Notes on Clustering Algorithms

(Based on Notes from Ed Fox's Course at Virginia Tech., 1995-1996)

Illustrations of Key Clustering Concepts

Applications in IR

Improve search efficiency: similar items are close-by on disk.
Improve search effectiveness: it is likely that similar documents will be similar to the same query.
- Precision: if cluster items are all relevant, or all non-relevant
- Recall: if one might not otherwise retrieve all items in a relevant cluster
Explore the characteristics of a document collection, citation network, etc. Thus, one can easily find nearest neighbors (NNs), or analyze the classification structure.
Construct a thesaurus based on term-term co-occurrence in documents.

Checklist of Clustering Issues and Steps

Select features/attributes of objects to use in their representation.
Select clustering method and similarity measure.
Run clustering algorithm, producing clusters (and possibly a cluster hierarchy).
Evaluate the clustering results.
Update the clustering as changes are made.
Search or otherwise use the clusters.

Similarity measures

Heuristics:

Similarities should be normalized by length of document vectors.
Term weighting may not be of particular value.
(For a cluster centroid, weights might even be re-computed based on ranks. For example, consider a cluster centroid with t terms, so the one with largest weight has rank 1 and the term with lowest weight has rank t. Assume we want a linear decrease in weights, from value 1 for the rank 1 term down to value (1-c) for the term with rank t, where c might be say 0.9. Then, we might use the following value for the weight of the term with rank i:
1 - c*(i-1)/(t-1) )

Common measures:
In all cases, the numerator is based on the inner product, and the denominator is a normalization factor.

Dice: multiply numerator by 2, and normalize by the sum of both sums of squares
Jaccard: normalize by the sum of both sums of squares, less the inner product
Cosine

Similarity matrix

General comments:

Matrix is useful in many clustering algorithms.
Matrix is useful for n nearest neighbor (NN) computations.
Naive algorithm is O(n**2).
IF-based algorithm can work for sparse matrices or matrix rows.

IF-based Algorithm:
Rationale ---

sim is zero if there are no terms in common
we can mark docs that have terms in common, with the aid of the IF

Pseudocode ---
.......foreach document
..............foreach term in document
.....................get IF list for term
............................mark documents appearing in IF list
..............foreach marked document
.....................compute sim with doc in outer loop

Computational complexity

Terminology: Assume N documents and M clusters
Space: Need:
- O(N): if store documents / input data set;
- O(M): if store clusters / centroids; and
- O(N**2): if store doc-doc sim matrix
Time: Need:
- O(N): to just examine input data set;
- O(NM): for simple reallocation methods; and
- O(N**2): to build doc-doc sim matrix

Clustering Algorithm Details

Update

In some cases, clustering must start anew if there are any changes. Sometimes, additions can be made to the most similar cluster (but eventually that may force a regeneration or reclustering).

In other cases, update is part of the clustering algorithm:

single link: van Rijsbergen and SLINK algorithms
complete link: CLINK algorithm

Finally, cluster maintenance may be a part of the algorithm, as in the work of Crouch or that of Can et al.

Evaluation

In general, Ward's method, and the group average method, have performed well.

Voorhees found:

complete link most effective for larger collections
complete link and group average comparable for smaller collections
single link hierarchies provided the worst retrieval performance

El-Hamdouchi and WIllett found:

group average most suitable for document clustering
complete link not as effective (based on using Defays' CLINK algorithm)