Notes on Clustering Algorithms

(Based on Notes from Ed Fox's Course at Virginia Tech., 1995-1996)


Illustrations of Key Clustering Concepts

Applications in IR

Checklist of Clustering Issues and Steps

Similarity measures

Heuristics: Common measures:
In all cases, the numerator is based on the inner product, and the denominator is a normalization factor.

Similarity matrix

General comments: IF-based Algorithm:
Rationale --- Pseudocode ---
.......foreach document
..............foreach term in document
.....................get IF list for term
............................mark documents appearing in IF list
..............foreach marked document
.....................compute sim with doc in outer loop

Computational complexity


Clustering Algorithm Details


Update

In some cases, clustering must start anew if there are any changes. Sometimes, additions can be made to the most similar cluster (but eventually that may force a regeneration or reclustering).

In other cases, update is part of the clustering algorithm:

Finally, cluster maintenance may be a part of the algorithm, as in the work of Crouch or that of Can et al.

Evaluation

In general, Ward's method, and the group average method, have performed well.

Voorhees found:

El-Hamdouchi and WIllett found: