Details of Clustering Algorithms

Non-hierarchical clustering methods

Single-pass methods

Algorithm
  1. Assign the first document D1 as the representative of cluster C1
  2. Calculate the similarity Sj between document Di and each cluster, keeping track of the largest, Smax
  3. If Smax is greater than Sthreshold, add the document to the appropriate cluster, else create a new cluster with centroid Di
  4. If documents remain, repeat from step 2

Reallocation methods

Algorithm
  1. Select M cluster representatives or centroids
  2. Assign each document to the most similar centroid
  3. Recalculate the centroid for each cluster
  4. Repeat steps 2 and 3 until there is little change in cluster membership from pass to pass


Hierarchical clustering methods

single link method


SLINK Algorithm

MST Algorithms (see Prim, Kruskal algorithm animations)

complete link method

Voorhees Algorithm

group average link method

Algorithm

Ward's method (minimum variance method)

RNN:
We can apply a reciprocal nearest neighbor (RNN) algorithm, since for any point or cluster there exists a chain of nearest neighbors (NNs) that ends with a pair that are each others' NN.
Algorithm - Single Cluster:
  1. Select an arbitrary doc
  2. Follow the NN chain from it till an RNN pair is found.
  3. Merge the 2 points in that pair, replacing them with a single new point.
  4. If there is only one point, stop. Otherwise, if there is a point in the NN chain preceding the merged points, return to step 2 else to step 1.

centroid and median methods

General algorithm for HACM

All hierarchical agglomerative clustering methods (HACMs) can be described by the following general algorithm.
Algorithm
  1. Identify the 2 closest points and combine them into a cluster
  2. Identify and combine the next 2 closest points (treating existing clusters as points too)
  3. If more than one cluster remains, return to step 1.
How to apply general algorithm:
Lance and Williams proposed the Lance-Williams dissimilarity update formula, to calculate dissimilarities between a new cluster and existing points, based on the dissimilarities prior to forming the new cluster. This formula has 3 parameters, and each HACM can be characterized by its own set of Lance-Williams parameters (see Table 16.1 in text). Then, the algorithm above can be applied, using the appropriate Lance-Williams dissimilarities.

Implementations of the general algorithm:


Back to Main Page