Details of Clustering Algorithms

Non-hierarchical clustering methods

Single-pass methods

Similarity: Computed between input and all representatives of existing clusters
Example --- cover coefficient algorithm of Can et al.: Select set of documents as cluster seeds; assign each document to cluster that maximally covers it
Time: O(N log N)
Space: O(M)
Advantages: Simple, requiring only one pass through data; may be useful as starting point for reallocation methods
Disadvantages: Produces large clusters early in process; clusters formed are dependent on order of input data

Algorithm

Assign the first document D1 as the representative of cluster C1
Calculate the similarity Sj between document Di and each cluster, keeping track of the largest, Smax
If Smax is greater than Sthreshold, add the document to the appropriate cluster, else create a new cluster with centroid Di
If documents remain, repeat from step 2

Reallocation methods

Similarity: Allows various definitions of similarity / cluster cohesion
Type of clusters: Refinements of initial input clustering
Time: O(MN)
Space: O(M + N)
Advantages: Allows processing of larger data sets than other methods
Disadvantages: Can take a long time to converge to a solution, depending on the appropriateness of the reallocation criteria to the structure of the data

Algorithm

Select M cluster representatives or centroids
Assign each document to the most similar centroid
Recalculate the centroid for each cluster
Repeat steps 2 and 3 until there is little change in cluster membership from pass to pass

Hierarchical clustering methods

single link method

Similarity: Join the most similar pair of objects that are not yet in the same cluster. Distance between 2 clusters is the distance between the closest pair of points, each of which is in one of the two clusters.
Type of clusters: Long straggly clusters, chains, ellipsoidal
Time: Usually O(N**2) though can range from O(N log N) to O(N**5)
Space: O(N)
Advantages: Theoretical properties, efficient implementations, widely used. No cluster centroid or representative required, so no need arises to recalculate the similarity matrix.
Disadvantages: Unsuitable for isolating spherical or poorly separated clusters

SLINK Algorithm

Arrays record cluster pointer and distance information
Process input documents one by one. For each:
Compute and store a row of the distance matrix
Find nearest other point, using the matrix
Relabel clusters, as needed

MST Algorithms (see Prim, Kruskal algorithm animations)

MST has all info needed to generate single link hierarchy in O(N**2) operations, or the single link hierarchy can be built at the same time as MST.
Prim-Dijkstra algorithm operates as follows:
Place an arbitrary document in the MST and connect its NN to it, to create an initial MST fragment.
Find the document not in the MST that is closest to any point in the MST, and add it to the current MST fragment.
If a document remains that is not in the MST fragment, repeat the previous step.

complete link method

Similarity: Join least similar pair between each of two clusters
Type of clusters: All entries in a cluster are linked to one another within some minimum similarity, so have small, tightly bound clusters.
Time: Voorhees alg. worst case is O(N**3) but sparse matrices require much less
Space: Voorhees alg. worst case is O(N**2) but sparse matrices require much less
Advantages: Good results from (Voorhees) comparative studies.
Disadvantages: Difficult to apply to large data sets since most efficient algorithm is general HACM using stored data or stored matrix approach.

Voorhees Algorithm

Variation on the sorted matrix HACM approach
Based on fact: if sims between all pairs of docs are processed in descending order, two clusters of size mi and mj can be merged as soon as the mi x mj th similarity of documents in those clusters is reached.
Requires sorted list of doc-doc similarities
Requires way to count no. of sims seen between any 2 active clusters

group average link method

Similarity: Use the average value of the pairwise links within a cluster, based upon all objects in the cluster. Save space and time if use inner product of 2 (appropriately weighted) vectors --- see Voorhees alg. below.
Type of clusters: Intermediate in tightness between single link and complete link
Time: O(N**2)
Space: O(N)
Advantages: Ranked well in evaluation studies
Disadvantages: Expensive for large collections

Algorithm

Sim between cluster centroid and any doc = mean sim between the doc and all docs in the cluster.
Hence, can use centroids only to compute cluster-cluster sim, in O(N) space

Ward's method (minimum variance method)

Similarity: Join the cluster pair whose merger minimizes the increase in the total within-group error sum of squares, based on Euclidean distance between centroids
Type of clusters: Homogeneous clusters in a symmetric hierarchy; cluster centroid nicely characterized by its center of gravity
Time: O(N**2)
Space: O(N) --- carries out agglomerations in restricted spatial regions
Advantages: Good at recovering cluster structure, yields unique and exact hierarchy
Disdvantages: Sensitive to outliers, poor at recovering elongated clusters

RNN:
We can apply a reciprocal nearest neighbor (RNN) algorithm, since for any point or cluster there exists a chain of nearest neighbors (NNs) that ends with a pair that are each others' NN.
Algorithm - Single Cluster:

Select an arbitrary doc
Follow the NN chain from it till an RNN pair is found.
Merge the 2 points in that pair, replacing them with a single new point.
If there is only one point, stop. Otherwise, if there is a point in the NN chain preceding the merged points, return to step 2 else to step 1.

centroid and median methods

Similarity: At each stage join the pair of clusters with the most similar centroids.
Type of clusters: Cluster is represented by the coordinates of the group centroid or median.
Disadvantages: Newly formed clusters may differ overly much from constituent points, meaning that updates may cause large changes throughout the cluster hierarchy.

General algorithm for HACM

All hierarchical agglomerative clustering methods (HACMs) can be described by the following general algorithm.
Algorithm

Identify the 2 closest points and combine them into a cluster
Identify and combine the next 2 closest points (treating existing clusters as points too)
If more than one cluster remains, return to step 1.

How to apply general algorithm:
Lance and Williams proposed the Lance-Williams dissimilarity update formula, to calculate dissimilarities between a new cluster and existing points, based on the dissimilarities prior to forming the new cluster. This formula has 3 parameters, and each HACM can be characterized by its own set of Lance-Williams parameters (see Table 16.1 in text). Then, the algorithm above can be applied, using the appropriate Lance-Williams dissimilarities.

Implementations of the general algorithm:

Stored matrix approach: Use matrix, and then apply Lance-Williams to recalculate dissimilarities between cluster centers. Storage is therefore O(N**2) and time is at least O(N**2), but will be O(N**3) if matrix is scanned linearly.
Stored data approach: O(N) space for data but recompute pairwise dissimilarities so need O(N**3) time
Sorted matrix approach: O(N**2) to calculate dissimilarity matrix, O(N**2 log N**2) to sort it, O(N**2) to construct hierarchy, but one need not store the data set, and the matrix can be processed linearly, which reduces disk accesses.

Back to Main Page