**Similarity:**Computed between input and all representatives of existing clusters**Example --- cover coefficient algorithm of Can et al.:**Select set of documents as cluster seeds; assign each document to cluster that maximally covers it**Time:**O(N log N)**Space:**O(M)**Advantages:**Simple, requiring only one pass through data; may be useful as starting point for reallocation methods**Disadvantages:**Produces large clusters early in process; clusters formed are dependent on order of input data

- Assign the first document D1 as the representative of cluster C1
- Calculate the similarity Sj between document Di and each cluster, keeping track of the largest, Smax
- If Smax is greater than Sthreshold, add the document to the appropriate cluster, else create a new cluster with centroid Di
- If documents remain, repeat from step 2

**Similarity:**Allows various definitions of similarity / cluster cohesion**Type of clusters:**Refinements of initial input clustering**Time:**O(MN)**Space:**O(M + N)**Advantages:**Allows processing of larger data sets than other methods**Disadvantages:**Can take a long time to converge to a solution, depending on the appropriateness of the reallocation criteria to the structure of the data

- Select M cluster representatives or centroids
- Assign each document to the most similar centroid
- Recalculate the centroid for each cluster
- Repeat steps 2 and 3 until there is little change in cluster membership from pass to pass

**Similarity:**Join the most similar pair of objects that are not yet in the same cluster. Distance between 2 clusters is the distance between the closest pair of points, each of which is in one of the two clusters.**Type of clusters:**Long straggly clusters, chains, ellipsoidal**Time:**Usually O(N**2) though can range from O(N log N) to O(N**5)**Space:**O(N)**Advantages:**Theoretical properties, efficient implementations, widely used. No cluster centroid or representative required, so no need arises to recalculate the similarity matrix.**Disadvantages:**Unsuitable for isolating spherical or poorly separated clusters

- Arrays record cluster pointer and distance information
- Process input documents one by one. For each:
- Compute and store a row of the distance matrix
- Find nearest other point, using the matrix
- Relabel clusters, as needed

- MST has all info needed to generate single link hierarchy in O(N**2) operations, or the single link hierarchy can be built at the same time as MST.
- Prim-Dijkstra algorithm operates as follows:
- Place an arbitrary document in the MST and connect its NN to it, to create an initial MST fragment.
- Find the document not in the MST that is closest to any point in the MST, and add it to the current MST fragment.
- If a document remains that is not in the MST fragment, repeat the previous step.

**Similarity:**Join least similar pair between each of two clusters**Type of clusters:**All entries in a cluster are linked to one another within some minimum similarity, so have small, tightly bound clusters.**Time:**Voorhees alg. worst case is O(N**3) but sparse matrices require much less**Space:**Voorhees alg. worst case is O(N**2) but sparse matrices require much less**Advantages:**Good results from (Voorhees) comparative studies.**Disadvantages:**Difficult to apply to large data sets since most efficient algorithm is general HACM using stored data or stored matrix approach.

- Variation on the sorted matrix HACM approach
- Based on fact: if sims between all pairs of docs are processed in descending order, two clusters of size mi and mj can be merged as soon as the mi x mj th similarity of documents in those clusters is reached.
- Requires sorted list of doc-doc similarities
- Requires way to count no. of sims seen between any 2 active clusters

**Similarity:**Use the average value of the pairwise links within a cluster, based upon all objects in the cluster. Save space and time if use inner product of 2 (appropriately weighted) vectors --- see Voorhees alg. below.**Type of clusters:**Intermediate in tightness between single link and complete link**Time:**O(N**2)**Space:**O(N)**Advantages:**Ranked well in evaluation studies**Disadvantages:**Expensive for large collections

- Sim between cluster centroid and any doc = mean sim between the doc and all docs in the cluster.
- Hence, can use centroids only to compute cluster-cluster sim, in O(N) space

**Similarity:**Join the cluster pair whose merger minimizes the increase in the total within-group error sum of squares, based on Euclidean distance between centroids**Type of clusters:**Homogeneous clusters in a symmetric hierarchy; cluster centroid nicely characterized by its center of gravity**Time:**O(N**2)**Space:**O(N) --- carries out agglomerations in restricted spatial regions**Advantages:**Good at recovering cluster structure, yields unique and exact hierarchy**Disdvantages:**Sensitive to outliers, poor at recovering elongated clusters

We can apply a reciprocal nearest neighbor (RNN) algorithm, since for any point or cluster there exists a chain of nearest neighbors (NNs) that ends with a pair that are each others' NN.

- Select an arbitrary doc
- Follow the NN chain from it till an RNN pair is found.
- Merge the 2 points in that pair, replacing them with a single new point.
- If there is only one point, stop. Otherwise, if there is a point in the NN chain preceding the merged points, return to step 2 else to step 1.

**Similarity:**At each stage join the pair of clusters with the most similar centroids.**Type of clusters:**Cluster is represented by the coordinates of the group centroid or median.**Disadvantages:**Newly formed clusters may differ overly much from constituent points, meaning that updates may cause large changes throughout the cluster hierarchy.

- Identify the 2 closest points and combine them into a cluster
- Identify and combine the next 2 closest points (treating existing clusters as points too)
- If more than one cluster remains, return to step 1.

Lance and Williams proposed the Lance-Williams dissimilarity update formula, to calculate dissimilarities between a new cluster and existing points, based on the dissimilarities prior to forming the new cluster. This formula has 3 parameters, and each HACM can be characterized by its own set of Lance-Williams parameters (see Table 16.1 in text). Then, the algorithm above can be applied, using the appropriate Lance-Williams dissimilarities.

**Implementations of the general algorithm:**

**Stored matrix**approach: Use matrix, and then apply Lance-Williams to recalculate dissimilarities between cluster centers. Storage is therefore O(N**2) and time is at least O(N**2), but will be O(N**3) if matrix is scanned linearly.**Stored data**approach: O(N) space for data but recompute pairwise dissimilarities so need O(N**3) time**Sorted matrix**approach: O(N**2) to calculate dissimilarity matrix, O(N**2 log N**2) to sort it, O(N**2) to construct hierarchy, but one need not store the data set, and the matrix can be processed linearly, which reduces disk accesses.