Notes on Clustering Algorithms
(Based on Notes from Ed Fox's Course at Virginia Tech., 1995-1996)
Applications in IR
- Improve search efficiency: similar items are close-by on disk.
- Improve search effectiveness: it is likely that similar documents will be similar
to the same query.
- Precision: if cluster items are all relevant, or all non-relevant
- Recall: if one might not otherwise retrieve all items in a relevant cluster
- Explore the characteristics of a document collection, citation network, etc. Thus, one can easily find nearest neighbors (NNs), or analyze the classification structure.
- Construct a thesaurus based on term-term co-occurrence in documents.
Checklist of Clustering Issues and Steps
- Select features/attributes of objects to use in their representation.
- Select clustering method and similarity measure.
- Run clustering algorithm, producing clusters (and possibly a cluster hierarchy).
- Evaluate the clustering results.
- Update the clustering as changes are made.
- Search or otherwise use the clusters.
Similarity measures
Heuristics:
- Similarities should be normalized by length of document vectors.
- Term weighting may not be of particular value.
(For a cluster centroid, weights might even
be re-computed based on ranks. For example, consider a cluster centroid with t terms, so the one with largest weight has rank 1 and the term with lowest weight has rank t. Assume we want a linear decrease in weights, from value 1 for the rank 1 term down to value (1-c) for the term with rank t, where c might be say 0.9. Then, we might use the following value for the weight of the term with rank i:
1 - c*(i-1)/(t-1) )
Common measures:
In all cases, the numerator is based on the inner product, and the denominator is a normalization factor.
- Dice: multiply numerator by 2, and normalize by the sum of both sums of squares
- Jaccard: normalize by the sum of both sums of squares, less the inner product
- Cosine
Similarity matrix
General comments:
- Matrix is useful in many clustering algorithms.
- Matrix is useful for n nearest neighbor (NN) computations.
- Naive algorithm is O(n**2).
- IF-based algorithm can work for sparse matrices or matrix rows.
IF-based Algorithm:
Rationale ---
- sim is zero if there are no terms in common
- we can mark docs that have terms in common, with the aid of the IF
Pseudocode ---
.......foreach document
..............foreach term in document
.....................get IF list for term
............................mark documents appearing in IF list
..............foreach marked document
.....................compute sim with doc in outer loop
Computational complexity
- Terminology: Assume N documents and M clusters
- Space: Need:
- O(N): if store documents / input data set;
- O(M): if store clusters / centroids; and
- O(N**2): if store doc-doc sim matrix
- Time: Need:
- O(N): to just examine input data set;
- O(NM): for simple reallocation methods; and
- O(N**2): to build doc-doc sim matrix
Update
In some cases, clustering must start anew if there are any changes. Sometimes, additions can be made to the most similar cluster (but eventually that may force a regeneration or reclustering).
In other cases, update is part of the clustering algorithm:
- single link: van Rijsbergen and SLINK algorithms
- complete link: CLINK algorithm
Finally, cluster maintenance may be a part of the algorithm, as in the work of Crouch or that of Can et al.
Evaluation
In general, Ward's method, and the group average method, have performed well.
Voorhees found:
- complete link most effective for larger collections
- complete link and group average comparable for smaller collections
- single link hierarchies provided the worst retrieval performance
El-Hamdouchi and WIllett found:
- group average most suitable for document clustering
- complete link not as effective (based on using Defays' CLINK algorithm)