Details of Clustering Algorithms
Non-hierarchical clustering methods
Single-pass methods
- Similarity: Computed between input and all representatives of existing clusters
- Example --- cover coefficient algorithm of Can et al.: Select set of documents as cluster seeds; assign each document to cluster that maximally covers it
- Time: O(N log N)
- Space: O(M)
- Advantages:
Simple, requiring only one pass through data; may be useful as starting point for reallocation methods
- Disadvantages: Produces large clusters early in process; clusters formed are dependent on order of input data
Algorithm
- Assign the first document D1 as the representative of cluster C1
- Calculate the similarity Sj between document Di and each cluster, keeping track of the largest, Smax
- If Smax is greater than Sthreshold, add the document to the appropriate cluster, else create a new cluster with centroid Di
- If documents remain, repeat from step 2
Reallocation methods
- Similarity: Allows various definitions of similarity / cluster cohesion
- Type of clusters: Refinements of initial input clustering
- Time: O(MN)
- Space: O(M + N)
- Advantages: Allows processing of larger data sets than other methods
- Disadvantages: Can take a long time to converge to a solution, depending on the appropriateness of the reallocation criteria to the structure of the data
Algorithm
- Select M cluster representatives or centroids
- Assign each document to the most similar centroid
- Recalculate the centroid for each cluster
- Repeat steps 2 and 3 until there is little change in cluster membership from pass to pass
Hierarchical clustering methods
single link method
- Similarity: Join the most similar pair of objects that are not yet in the same cluster. Distance between 2 clusters is the distance between the closest pair of points, each of which is in one of the two clusters.
- Type of clusters: Long straggly clusters, chains, ellipsoidal
- Time: Usually O(N**2) though can range from O(N log N) to O(N**5)
- Space: O(N)
- Advantages: Theoretical properties, efficient implementations, widely used. No cluster centroid or representative required, so no need arises to recalculate the similarity matrix.
- Disadvantages: Unsuitable for isolating spherical or poorly separated clusters
SLINK Algorithm
- Arrays record cluster pointer and distance information
- Process input documents one by one. For each:
- Compute and store a row of the distance matrix
- Find nearest other point, using the matrix
- Relabel clusters, as needed
MST Algorithms (see Prim, Kruskal algorithm animations)
- MST has all info needed to generate single link hierarchy in O(N**2) operations, or the single link hierarchy can be built at the same time as MST.
- Prim-Dijkstra algorithm operates as follows:
- Place an arbitrary document in the MST and connect its NN to it, to create an initial MST fragment.
- Find the document not in the MST that is closest to any point in the MST, and add it to the current MST fragment.
- If a document remains that is not in the MST fragment, repeat the previous step.
complete link method
- Similarity: Join least similar pair between each of two clusters
- Type of clusters: All entries in a cluster are linked to one another within some minimum similarity, so have small, tightly bound clusters.
- Time: Voorhees alg. worst case is O(N**3) but sparse matrices require much less
- Space: Voorhees alg. worst case is O(N**2) but sparse matrices require much less
- Advantages: Good results from (Voorhees) comparative studies.
- Disadvantages: Difficult to apply to large data sets since most efficient algorithm is general HACM using stored data or stored matrix approach.
Voorhees Algorithm
- Variation on the sorted matrix HACM approach
- Based on fact: if sims between all pairs of docs are processed in descending order, two clusters of size mi and mj can be merged as soon as the mi x mj th similarity of documents in those clusters is reached.
- Requires sorted list of doc-doc similarities
- Requires way to count no. of sims seen between any 2 active clusters
group average link method
- Similarity: Use the average value of the pairwise links within a cluster, based upon all objects in the cluster. Save space and time if use inner product of 2 (appropriately weighted) vectors --- see Voorhees alg. below.
- Type of clusters: Intermediate in tightness between single link and complete link
- Time: O(N**2)
- Space: O(N)
- Advantages: Ranked well in evaluation studies
- Disadvantages: Expensive for large collections
Algorithm
- Sim between cluster centroid and any doc = mean sim between the doc and all docs in the cluster.
- Hence, can use centroids only to compute cluster-cluster sim, in O(N) space
Ward's method (minimum variance method)
- Similarity: Join the cluster pair whose merger minimizes the increase in the total within-group error sum of squares, based on Euclidean distance between centroids
- Type of clusters: Homogeneous clusters in a symmetric hierarchy; cluster centroid nicely characterized by its center of gravity
- Time: O(N**2)
- Space: O(N) --- carries out agglomerations in restricted spatial regions
- Advantages: Good at recovering cluster structure, yields unique and exact hierarchy
- Disdvantages: Sensitive to outliers, poor at recovering elongated clusters
RNN:
We can apply a reciprocal nearest neighbor (RNN) algorithm,
since for any point or cluster there exists a chain of
nearest neighbors (NNs) that ends with a pair that are
each others' NN.
Algorithm - Single Cluster:
- Select an arbitrary doc
- Follow the NN chain from it till an RNN pair is found.
- Merge the 2 points in that pair, replacing them with a single new point.
- If there is only one point, stop. Otherwise, if there is a point in the NN chain preceding the merged points, return to step 2 else to step 1.
centroid and median methods
- Similarity: At each stage join the pair of clusters with the most similar centroids.
- Type of clusters: Cluster is represented by the coordinates of the group centroid or median.
- Disadvantages: Newly formed clusters may differ overly much from constituent points, meaning that updates may cause large changes throughout the cluster hierarchy.
General algorithm for HACM
All hierarchical agglomerative clustering methods (HACMs) can be described by the following general algorithm.
Algorithm
- Identify the 2 closest points and combine them into a cluster
- Identify and combine the next 2 closest points (treating existing clusters as points too)
- If more than one cluster remains, return to step 1.
How to apply general algorithm:
Lance and Williams proposed the Lance-Williams dissimilarity update formula, to calculate dissimilarities between a new cluster and existing points, based on the dissimilarities prior to forming the new cluster. This formula has 3 parameters, and each HACM can be characterized by its own set of Lance-Williams parameters (see Table 16.1 in text). Then, the algorithm above can be applied, using the appropriate Lance-Williams dissimilarities.
Implementations of the general algorithm:
- Stored matrix approach: Use matrix, and then apply Lance-Williams to recalculate dissimilarities between cluster centers. Storage is therefore O(N**2) and time is at least O(N**2), but will be O(N**3) if matrix is scanned linearly.
- Stored data approach: O(N) space for data but recompute pairwise dissimilarities so need O(N**3) time
- Sorted matrix approach: O(N**2) to calculate dissimilarity matrix, O(N**2 log N**2) to sort it, O(N**2) to construct hierarchy, but one need not store the data set, and the matrix can be processed linearly, which reduces disk accesses.
Back to Main Page