# Notes on Clustering Algorithms

(Based on Notes from Ed Fox's Course at Virginia Tech., 1995-1996)

## Applications in IR

• Improve search efficiency: similar items are close-by on disk.
• Improve search effectiveness: it is likely that similar documents will be similar to the same query.
• Precision: if cluster items are all relevant, or all non-relevant
• Recall: if one might not otherwise retrieve all items in a relevant cluster
• Explore the characteristics of a document collection, citation network, etc. Thus, one can easily find nearest neighbors (NNs), or analyze the classification structure.
• Construct a thesaurus based on term-term co-occurrence in documents.

## Checklist of Clustering Issues and Steps

• Select features/attributes of objects to use in their representation.
• Select clustering method and similarity measure.
• Run clustering algorithm, producing clusters (and possibly a cluster hierarchy).
• Evaluate the clustering results.
• Update the clustering as changes are made.
• Search or otherwise use the clusters.

## Similarity measures

Heuristics:
• Similarities should be normalized by length of document vectors.
• Term weighting may not be of particular value.
(For a cluster centroid, weights might even be re-computed based on ranks. For example, consider a cluster centroid with t terms, so the one with largest weight has rank 1 and the term with lowest weight has rank t. Assume we want a linear decrease in weights, from value 1 for the rank 1 term down to value (1-c) for the term with rank t, where c might be say 0.9. Then, we might use the following value for the weight of the term with rank i:
1 - c*(i-1)/(t-1) )
Common measures:
In all cases, the numerator is based on the inner product, and the denominator is a normalization factor.
• Dice: multiply numerator by 2, and normalize by the sum of both sums of squares
• Jaccard: normalize by the sum of both sums of squares, less the inner product
• Cosine

## Similarity matrix

• Matrix is useful in many clustering algorithms.
• Matrix is useful for n nearest neighbor (NN) computations.
• Naive algorithm is O(n**2).
• IF-based algorithm can work for sparse matrices or matrix rows.
IF-based Algorithm:
Rationale ---
• sim is zero if there are no terms in common
• we can mark docs that have terms in common, with the aid of the IF
Pseudocode ---
.......foreach document
..............foreach term in document
.....................get IF list for term
............................mark documents appearing in IF list
..............foreach marked document
.....................compute sim with doc in outer loop

## Computational complexity

• Terminology: Assume N documents and M clusters
• Space: Need:
• O(N): if store documents / input data set;
• O(M): if store clusters / centroids; and
• O(N**2): if store doc-doc sim matrix
• Time: Need:
• O(N): to just examine input data set;
• O(NM): for simple reallocation methods; and
• O(N**2): to build doc-doc sim matrix

## Update

In some cases, clustering must start anew if there are any changes. Sometimes, additions can be made to the most similar cluster (but eventually that may force a regeneration or reclustering).

In other cases, update is part of the clustering algorithm:

Finally, cluster maintenance may be a part of the algorithm, as in the work of Crouch or that of Can et al.

## Evaluation

In general, Ward's method, and the group average method, have performed well.

Voorhees found:

• complete link most effective for larger collections
• complete link and group average comparable for smaller collections
• single link hierarchies provided the worst retrieval performance

El-Hamdouchi and WIllett found:

• group average most suitable for document clustering
• complete link not as effective (based on using Defays' CLINK algorithm)