Example of Single Pass Clustering Technique
Suppose that we have the following set of documents and terms, and that we are interested in clustering the terms using the single pass method (note that the same method can beused to cluster the documents, but in that case, we would be using the document vectors (rows) rather than the term vector (columns). T1 T2 T3 T4 T5 Doc1 1 2 0 0 1 Doc2 3 1 2 3 0 Doc3 3 0 0 0 1 Doc4 2 1 0 3 0 Doc5 2 2 1 5 1 Start with T1 in a cluster by itself, say C1. At this point, C1 contains only one item, T1, so the centroid of C1 is simply the vector for T1: C1 = <1, 3, 3, 2, 2>. Now compare (i.e., measure similarities) of the next item (T2) to centroids of all existing clusters. At this point we have only one cluster, C1 (we will use dot product for simplicity): SIM(T2, C1) = 1*2 + 1*3 + 0*3 + 1*2 + 2*2 = 11 Now we need a pre-specified similarity threshold. Let's say that our threshold is 10. This means that if the similarity of T2 to the cluster centroid is >= 10, then we add T2 to the cluster, otherwise we use T2 to start a new cluster. In this case. SIM(T2, C1) = 11 > 10. Therefore we add T2 to cluster C1. We now need to compute the new centroid for C1 (which now contains T1 and T2). The centroid (which is the average vector for T1 and T2 is: C1 = <3/2, 4/2, 3/2, 3/2, 4/2> Now, we move to the next item, T3. Again, there is only one cluster, C1, so we only need to compare T3 with C1 centroid. The dot product of T3 and the above centroid is: SIM(T3, C1) = 0 + 8/2 + 0 + 0 + 4/2 = 6 This time, T3 does not pass the threshold test (the similarity is less than 10). Therefore, we use T3 to start a new cluster, C2. Now we have two clusters C1 = {T1, T2} C2 = {T3} We move to the next unclustered item, T4. Since we now have two clusters, we need to compute the MAX similarity of T4 to the 2 cluster centroids (note that the centroid of cluster C2 right now is just the vector for T3): SIM(T4, C1) = <0, 3, 0, 3, 5> . <3/2, 4/2, 3/2, 3/2, 4/2> = 0 + 12/2 + 0 + 9/2 + 20/2 = 20.5 SIM(T4, C2) = <0, 3, 0, 3, 5> . <0, 2, 0, 0, 1> = 0 + 6 + 0 + 0 + 5 = 11 Note that both similarity scores pass the threshold (10), however, we pick the MAX, and therefore, T4 will be added to cluster C1. Now we have the following: C1 = {T1, T2, T4} C2 = {T3} The centroid for C2 is still just the vector for T3: C2 = <0, 2, 0, 0, 1> and the new centroid for C1 is now: C1 = <3/3, 7/3, 3/3, 6/3, 9/3> The only item left unclustered is T5. We compute its similarity to the centroids of existing clusters: SIM(T5, C1) = <1, 0, 1, 0, 1> . <3/3, 7/3, 3/3, 6/3, 9/3> = 3/3 + 0 + 3/3 + 0 + 9/3 = 5 SIM(T5, C2) = <1, 0, 1, 0, 1> . <0, 2, 0, 0, 1> = 0 + 0 + 0 + 0 +1 = 1 Neither of these similarity values pass the threshold. Therefore, T5 will have to go into a new cluster C3. There are no more unclustered items, so we are done (after making a single pass through the items). The final clusters are: C1 = {T1, T2, T4} C2 = {T3} C3 = {T5} Note: Obviously, the results for this method are highly dependent on the similarity threshold that is used. You should use your judgment in setting this threshold so that you are left with a reasonable number of clusters.