In this example, we illustrate various unsupervised learning techniques (Clustering, PCA, SVD) using an example term-document matrix as the data.

In [1]:
 
import numpy as np
import pylab as pl
from sklearn.cluster import KMeans 
In [3]:
 
cd C:\WinPython27\Data
C:\WinPython27\Data
In [7]:
 
TD = genfromtxt('term-doc-mat.csv',delimiter=',',usecols=(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15),dtype=int)
In [13]:
 
terms = genfromtxt('term-doc-mat.csv',delimiter=',',usecols=(0),dtype=str)
In [14]:
 
print terms
['database' 'index' 'likelihood' 'linear' 'matrix' 'query' 'regression'
 'retrieval' 'sql' 'vector']
In [15]:
 
print TD
[[24 32 12  6 43  2  0  3  1  6  4  0  0  0  0]
 [ 9  5  5  2 20  0  1  0  0  0 27 14  3  2 11]
 [ 0  3  0  0  3  7 12  4 27  4  0  1  0  0  0]
 [ 3  0  0  0  0 16  0  2 25 23  7 12 21  3  2]
 [ 1  0  0  0  0 33  2  0  7 12 14  5 12  4  0]
 [12  2  0  0 27  0  0  0  0 22  9  4  0  5  3]
 [ 0  0  0  0  0 18 32 22 34 17  0  0  0  0  0]
 [ 1  0  0  0  2  0  0  0  3  9 27  7  5  4  4]
 [21 10 16  7 31  0  0  0  0  0  0  0  0  1  0]
 [ 2  0  0  2  0 27  4  2 11  8 33 16 14  7  3]]

First, we want to do some document clustering. Since the data is in term-document format, we need to obtain the transpose of the TD matrix.

In [16]:
 
DT = TD.T

Now we have a document-term matrix:

In [18]:
 
print DT
[[24  9  0  3  1 12  0  1 21  2]
 [32  5  3  0  0  2  0  0 10  0]
 [12  5  0  0  0  0  0  0 16  0]
 [ 6  2  0  0  0  0  0  0  7  2]
 [43 20  3  0  0 27  0  2 31  0]
 [ 2  0  7 16 33  0 18  0  0 27]
 [ 0  1 12  0  2  0 32  0  0  4]
 [ 3  0  4  2  0  0 22  0  0  2]
 [ 1  0 27 25  7  0 34  3  0 11]
 [ 6  0  4 23 12 22 17  9  0  8]
 [ 4 27  0  7 14  9  0 27  0 33]
 [ 0 14  1 12  5  4  0  7  0 16]
 [ 0  3  0 21 12  0  0  5  0 14]
 [ 0  2  0  3  4  5  0  4  1  7]
 [ 0 11  0  2  0  3  0  4  0  3]]
In [25]:
 
numTerms=len(DT[0])
In [26]:
 
print numTerms
10

Next, we will transform the data to TFxIDF weights:

In [37]:
 
# Find doucment frequencies for each term
DF = np.array([(DT!=0).sum(0)])
print DF
[[10 11  8 10  9  8  5  9  6 12]]
In [38]:
 
NDocs = len(DT[:,0])
print NDocs
15
In [411]:
 
# Create a matrix with all entries = NDocs
NMatrix=np.ones(np.shape(DT), dtype=float)*NDocs
In [49]:
 
# Convert each entry into IDF values
# Note that IDF is only a function of the term, so all rows will be identical.
IDF = log2(divide(NMatrix, DF))
In [412]:
 
np.set_printoptions(precision=2,suppress=True)
print IDF[0:2,]
[[ 0.58  0.45  0.91  0.58  0.74  0.91  1.58  0.74  1.32  0.32]
 [ 0.58  0.45  0.91  0.58  0.74  0.91  1.58  0.74  1.32  0.32]]
In [413]:
 
# Finally compute the TFxIDF values for each document-term entry
DT_tfidf = DT * IDF
In [414]:
 
print DT_tfidf
[[ 14.04   4.03   0.     1.75   0.74  10.88   0.     0.74  27.76   0.64]
 [ 18.72   2.24   2.72   0.     0.     1.81   0.     0.    13.22   0.  ]
 [  7.02   2.24   0.     0.     0.     0.     0.     0.    21.15   0.  ]
 [  3.51   0.89   0.     0.     0.     0.     0.     0.     9.25   0.64]
 [ 25.15   8.95   2.72   0.     0.    24.49   0.     1.47  40.98   0.  ]
 [  1.17   0.     6.35   9.36  24.32   0.    28.53   0.     0.     8.69]
 [  0.     0.45  10.88   0.     1.47   0.    50.72   0.     0.     1.29]
 [  1.75   0.     3.63   1.17   0.     0.    34.87   0.     0.     0.64]
 [  0.58   0.    24.49  14.62   5.16   0.    53.89   2.21   0.     3.54]
 [  3.51   0.     3.63  13.45   8.84  19.95  26.94   6.63   0.     2.58]
 [  2.34  12.08   0.     4.09  10.32   8.16   0.    19.9    0.    10.62]
 [  0.     6.26   0.91   7.02   3.68   3.63   0.     5.16   0.     5.15]
 [  0.     1.34   0.    12.28   8.84   0.     0.     3.68   0.     4.51]
 [  0.     0.89   0.     1.75   2.95   4.53   0.     2.95   1.32   2.25]
 [  0.     4.92   0.     1.17   0.     2.72   0.     2.95   0.     0.97]]

Now we are ready for clustering

In [415]:
 
import kMeans
In [104]:
 
centroids_tfidf, clusters_tfidf = kMeans.kMeans(DT_tfidf, 3, kMeans.distCosine, kMeans.randCent)
Iteration  1
Iteration  2
Iteration  3

Let's take a look at the cluster centroids

In [187]:
 
print "\t\tCluster0\tCluster1\tCluster2"
for i in range(len(terms)):
    print "%10s\t%.4f\t\t%.4f\t\t%.4f" %(terms[i],centroids_tfidf[0][i],centroids_tfidf[1][i],centroids_tfidf[2][i])
		Cluster0	Cluster1	Cluster2
  database	1.4039		0.4680		13.6881
     index	0.0895		5.1010		3.6692
likelihood	9.7944		0.1814		1.0883
    linear	7.7215		5.2647		0.3510
    matrix	7.9592		5.1588		0.1474
     query	3.9903		3.8089		7.4365
regression	38.9901		0.0000		0.0000
 retrieval	1.7687		6.9275		0.4422
       sql	0.0000		0.2644		22.4728
    vector	3.3481		4.7002		0.2575

Because the centroids are based on TFxIDF weights, they are not as descriptive as raw term frequencies or binary occurrence data. Let's redo the clustering with the original raw term frequencies.

In [108]:
 
centroids, clusters = kMeans.kMeans(DT, 3, kMeans.distCosine, kMeans.randCent)
Iteration  1
Iteration  2
Iteration  3
In [188]:
 
print "\t\tCluster0\tCluster1\tCluster2"
for i in range(len(terms)):
    print "%10s\t%.4f\t\t%.4f\t\t%.4f" %(terms[i],centroids[0][i],centroids[1][i],centroids[2][i])
		Cluster0	Cluster1	Cluster2
  database	2.4000		0.8000		23.4000
     index	0.2000		11.4000		8.2000
likelihood	10.8000		0.2000		1.2000
    linear	13.2000		9.0000		0.6000
    matrix	10.8000		7.0000		0.2000
     query	4.4000		4.2000		8.2000
regression	24.6000		0.0000		0.0000
 retrieval	2.4000		9.4000		0.6000
       sql	0.0000		0.2000		17.0000
    vector	10.4000		14.6000		0.8000
In [416]:
 
print clusters
[[ 2.    0.  ]
 [ 2.    0.01]
 [ 2.    0.01]
 [ 2.    0.01]
 [ 2.    0.  ]
 [ 0.    0.03]
 [ 0.    0.03]
 [ 0.    0.03]
 [ 0.    0.  ]
 [ 0.    0.04]
 [ 1.    0.  ]
 [ 1.    0.  ]
 [ 1.    0.04]
 [ 1.    0.01]
 [ 1.    0.05]]

Next, let's use principal component analysis to reduce the dimensionality of the data:

In [417]:
 
from sklearn import decomposition

We'll perform PCA to obtain the top 5 components and then transform the DT matrix into the lower dimensional space of 5 components:

In [418]:
 
pca = decomposition.PCA(n_components=5)
DTtrans = pca.fit(DT).transform(DT)
In [419]:
 
np.set_printoptions(precision=2,suppress=True)
print DTtrans
[[-25.45  -1.8    4.02  -3.12  -0.24]
 [-23.78  -7.29  -0.53  -4.77   6.32]
 [-15.03  -5.44 -10.09  -3.46   2.87]
 [ -6.75  -4.39 -14.84  -3.08   0.44]
 [-46.7    0.01  21.78   3.11  -0.29]
 [ 27.41   7.93  12.27 -15.16  13.47]
 [ 15.67 -21.74  -1.33  11.4    5.81]
 [  8.28 -16.54  -6.94   5.39   1.47]
 [ 29.51 -18.64  16.68   4.82  -1.69]
 [ 11.41   0.44  15.03  -3.03 -17.19]
 [  6.98  37.69   4.5   14.83   5.5 ]
 [  5.97  13.91  -6.64   2.12  -3.32]
 [ 12.36  10.4   -4.93 -11.5   -5.51]
 [  1.67   2.69 -13.18  -2.04  -3.72]
 [ -1.53   2.79 -15.8    4.49  -3.94]]
In [420]:
 
print(pca.explained_variance_ratio_)
[ 0.45  0.23  0.15  0.07  0.05]

Looking at the above, it can be observed that the first 5 components capture (explain) 95% of the variance in the data.

Now, we can redo the clustering, but this time in the lower dimensional space:

In [421]:
 
centroids_pca, clusters_pca = kMeans.kMeans(DTtrans, 3, kMeans.distCosine, kMeans.randCent)
Iteration  1
Iteration  2
Iteration  3
In [422]:
 
print clusters_pca
[[ 1.    0.  ]
 [ 1.    0.  ]
 [ 1.    0.03]
 [ 1.    0.31]
 [ 1.    0.01]
 [ 2.    0.14]
 [ 2.    0.06]
 [ 2.    0.17]
 [ 2.    0.  ]
 [ 2.    0.21]
 [ 0.    0.05]
 [ 0.    0.  ]
 [ 0.    0.11]
 [ 0.    0.14]
 [ 0.    0.19]]

Next, let's actually derive the principal components manually using linear algebra rather than relying on the PCA package from sklearn:

First step is to obtain the covariance matrix:

In [423]:
 
meanVals = mean(DT, axis=0)
meanRemoved = DT - meanVals #remove mean
covMat = cov(meanRemoved, rowvar=0)
 
np.set_printoptions(precision=2,suppress=True,linewidth=100)
print covMat
[[ 179.7    38.44  -17.06  -50.7   -40.93   66.87  -60.9   -19.62  116.32  -59.49]
 [  38.44   67.26  -21.54  -19.81   -6.5    31.83  -55.7    38.13   27.67   27.04]
 [ -17.06  -21.54   51.78   31.1     9.36  -11.61   77.41   -8.72  -16.2     4.67]
 [ -50.7   -19.81   31.1    85.97   51.43    2.54   45.59   15.13  -41.97   47.47]
 [ -40.93   -6.5     9.36   51.43   80.57   -4.43   25.86   17.64  -35.07   74.14]
 [  66.87   31.83  -11.61    2.54   -4.43   72.97  -22.49   15.7    45.17   -8.39]
 [ -60.9   -55.7    77.41   45.59   25.86  -22.49  162.03  -18.1   -50.37    7.87]
 [ -19.62   38.13   -8.72   15.13   17.64   15.7   -18.1    48.12  -19.18   49.06]
 [ 116.32   27.67  -16.2   -41.97  -35.07   45.17  -50.37  -19.18   93.92  -48.33]
 [ -59.49   27.04    4.67   47.47   74.14   -8.39    7.87   49.06  -48.33  102.26]]
In [424]:
 
eigVals,eigVects = linalg.eig(mat(covMat))
eigValInd = argsort(eigVals)  #sort, sort goes smallest to largest
eigValInd = eigValInd[::-1]   #reverse
sortedEigVals = eigVals[eigValInd]
print sortedEigVals
total = sum(sortedEigVals)
varPercentage = sortedEigVals/total*100
print varPercentage
[ 426.77  214.24  144.96   61.87   47.51   26.97   13.24    4.71    3.25    1.06]
[ 45.18  22.68  15.35   6.55   5.03   2.86   1.4    0.5    0.34   0.11]

We can plot the principal components based on the percentage of variance they capture:

In [425]:
 
import matplotlib
import matplotlib.pyplot as plt
 
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(range(1, 11), varPercentage[:10], marker='^')
plt.xlabel('Principal Component Number')
plt.ylabel('Percentage of Variance')
plt.show()
In [426]:
 
topNfeat = 5
topEigValInd = eigValInd[:topNfeat]  #cut off unwanted dimensions
reducedEigVects = eigVects[:,topEigValInd]   #reorganize eig vects largest to smallest
# reducedDT = meanRemoved * reducedEigVects    #transform data into new dimensions
print reducedDT
[[-25.45  -1.8    4.02  -3.12  -0.24]
 [-23.78  -7.29  -0.53  -4.77   6.32]
 [-15.03  -5.44 -10.09  -3.46   2.87]
 [ -6.75  -4.39 -14.84  -3.08   0.44]
 [-46.7    0.01  21.78   3.11  -0.29]
 [ 27.41   7.93  12.27 -15.16  13.47]
 [ 15.67 -21.74  -1.33  11.4    5.81]
 [  8.28 -16.54  -6.94   5.39   1.47]
 [ 29.51 -18.64  16.68   4.82  -1.69]
 [ 11.41   0.44  15.03  -3.03 -17.19]
 [  6.98  37.69   4.5   14.83   5.5 ]
 [  5.97  13.91  -6.64   2.12  -3.32]
 [ 12.36  10.4   -4.93 -11.5   -5.51]
 [  1.67   2.69 -13.18  -2.04  -3.72]
 [ -1.53   2.79 -15.8    4.49  -3.94]]

Next, let's look at an application of Singular Value Decomposition. This time, we'll focus on the term-document matrix in order to find themes based on combinations of terms.

In [427]:
 
u, s, vt = linalg.svd(TD, full_matrices=False)
In [428]:
 
print u
[[ 0.39 -0.6   0.22  0.17  0.22 -0.1  -0.59  0.07 -0.07  0.04]
 [ 0.3  -0.2  -0.33 -0.5   0.06 -0.17  0.21 -0.08 -0.57 -0.31]
 [ 0.2   0.16  0.33 -0.16  0.04 -0.4   0.   -0.76  0.23 -0.02]
 [ 0.37  0.27  0.    0.42 -0.49 -0.53  0.04  0.27 -0.16 -0.05]
 [ 0.32  0.23 -0.19  0.49  0.31  0.38  0.03 -0.22  0.   -0.52]
 [ 0.29 -0.23 -0.02  0.   -0.66  0.53  0.07 -0.31 -0.03  0.22]
 [ 0.36  0.38  0.62 -0.35  0.08  0.3   0.02  0.34 -0.1  -0.03]
 [ 0.21  0.05 -0.34 -0.38 -0.2  -0.02 -0.28  0.21  0.65 -0.33]
 [ 0.23 -0.44  0.15  0.11  0.15 -0.09  0.72  0.18  0.38 -0.02]
 [ 0.42  0.25 -0.42 -0.03  0.33  0.    0.02  0.01  0.07  0.68]]
In [429]:
 
print s
[ 93.97  77.25  54.14  29.74  26.27  19.76  13.75   9.41   7.88   3.88]
In [430]:
 
print diag(s)
[[ 93.97   0.     0.     0.     0.     0.     0.     0.     0.     0.  ]
 [  0.    77.25   0.     0.     0.     0.     0.     0.     0.     0.  ]
 [  0.     0.    54.14   0.     0.     0.     0.     0.     0.     0.  ]
 [  0.     0.     0.    29.74   0.     0.     0.     0.     0.     0.  ]
 [  0.     0.     0.     0.    26.27   0.     0.     0.     0.     0.  ]
 [  0.     0.     0.     0.     0.    19.76   0.     0.     0.     0.  ]
 [  0.     0.     0.     0.     0.     0.    13.75   0.     0.     0.  ]
 [  0.     0.     0.     0.     0.     0.     0.     9.41   0.     0.  ]
 [  0.     0.     0.     0.     0.     0.     0.     0.     7.88   0.  ]
 [  0.     0.     0.     0.     0.     0.     0.     0.     0.     3.88]]
In [431]:
 
originalTD = dot(u, dot(diag(s), vt))
print originalTD
[[ 24.  32.  12.   6.  43.   2.  -0.   3.   1.   6.   4.   0.   0.   0.   0.]
 [  9.   5.   5.   2.  20.   0.   1.   0.  -0.   0.  27.  14.   3.   2.  11.]
 [ -0.   3.  -0.  -0.   3.   7.  12.   4.  27.   4.   0.   1.   0.   0.  -0.]
 [  3.   0.   0.   0.  -0.  16.   0.   2.  25.  23.   7.  12.  21.   3.   2.]
 [  1.   0.   0.  -0.   0.  33.   2.  -0.   7.  12.  14.   5.  12.   4.   0.]
 [ 12.   2.   0.   0.  27.   0.   0.   0.   0.  22.   9.   4.   0.   5.   3.]
 [  0.  -0.   0.   0.  -0.  18.  32.  22.  34.  17.  -0.   0.   0.   0.  -0.]
 [  1.   0.   0.   0.   2.   0.   0.  -0.   3.   9.  27.   7.   5.   4.   4.]
 [ 21.  10.  16.   7.  31.  -0.   0.   0.   0.   0.  -0.   0.   0.   1.   0.]
 [  2.  -0.  -0.   2.   0.  27.   4.   2.  11.   8.  33.  16.  14.   7.   3.]]
In [432]:
 
numDimensions = 3
u_ld = u[:, :numDimensions]
sigma = diag(s)[:numDimensions, :numDimensions]
vt_ld = vt[:numDimensions, :]
lowRankTD = dot(u_ld, dot(sigma, vt_ld))
In [433]:
 
np.set_printoptions(precision=2,suppress=True,linewidth=120)
print lowRankTD
[[ 25.33  22.89  13.5    6.13  45.48  -1.98   1.99   2.79   1.9    7.87   4.6    1.34  -1.82   1.01   2.14]
 [ 10.71   7.37   4.79   2.6   18.7    6.82  -5.26  -3.14  -3.04   6.47  21.6    9.52   6.96   4.02   4.41]
 [  1.46   2.05   0.6    0.29   2.07  10.24  12.93   8.54  19.25   9.57  -2.74  -0.     2.18   0.08  -0.94]
 [  1.25  -0.19  -0.45   0.22   1.03  20.3   10.26   6.46  20.37  14.86  15.49   8.63  10.26   3.55   2.17]
 [  0.3   -1.69  -1.06   0.02  -0.57  18.19   4.81   2.78  13.17  12.03  19.71  10.15  10.88   4.12   3.05]
 [ 12.74  10.59   6.33   3.07  22.51   4.64   0.76   1.09   3.03   7.25  10.81   4.81   3.12   2.17   2.49]
 [  0.58   1.9   -0.08   0.01  -0.06  20.55  25.06  16.42  37.57  18.2   -5.23   0.13   4.68   0.16  -1.94]
 [  2.26  -0.13   0.23   0.54   3.39   9.93  -3.09  -2.23   0.62   5.95  19.64   9.17   8.18   3.73   3.5 ]
 [ 17.34  15.78   9.34   4.21  31.25  -3.21   0.35   1.28  -0.65   4.05   1.88   0.19  -2.16   0.39   1.3 ]
 [  1.34  -2.11  -1.07   0.26   0.94  23.19   1.78   0.66  11.78  14.65  31.61  15.6   15.57   6.33   5.21]]

The VT matrix can be viewed as the new representation of documents in the lower dimensional space.

In [434]:
 
print vt_ld
[[ 0.24  0.18  0.1   0.06  0.41  0.39  0.18  0.12  0.37  0.35  0.41  0.21  0.21  0.09  0.07]
 [-0.34 -0.32 -0.2  -0.08 -0.63  0.33  0.2   0.11  0.36  0.13  0.06  0.07  0.15  0.02 -0.02]
 [ 0.07  0.14  0.06  0.02  0.13 -0.06  0.4   0.27  0.43  0.08 -0.62 -0.26 -0.2  -0.1  -0.12]]

In information retrieval, a query is compared to documents using vector-space similarity between the query vector and document vectors. In the lower dim. space, this can be achieved by first mapping the query to lower dim. space, and then comparing it to docs in the lower dim. space.

In [435]:
 
queryVector = array([0,0,1,5,4,0,6,0,0,2])
lowDimQuery = dot(inv(sigma), dot(u_ld.T, queryVector))
print lowDimQuery
 
[ 0.07  0.07  0.05]
In [440]:
 
# Compute Cosine sim between the query and docs in the lower dimensional space
 
qNorm = lowDimQuery / linalg.norm(lowDimQuery)
In [442]:
 
docNorm = array([vt_ld[:,i]/linalg.norm(vt_ld[:,i]) for i in range(len(vt_ld[0]))])        
print docNorm
[[ 0.57 -0.81  0.16]
 [ 0.47 -0.8   0.37]
 [ 0.45 -0.85  0.27]
 [ 0.55 -0.82  0.15]
 [ 0.54 -0.83  0.17]
 [ 0.75  0.64 -0.12]
 [ 0.37  0.41  0.83]
 [ 0.38  0.33  0.86]
 [ 0.55  0.54  0.64]
 [ 0.92  0.33  0.2 ]
 [ 0.55  0.08 -0.83]
 [ 0.61  0.2  -0.77]
 [ 0.64  0.47 -0.61]
 [ 0.64  0.16 -0.75]
 [ 0.53 -0.13 -0.84]]
In [450]:
 
sims = dot(qNorm, docNorm.T)
# return indices of the docs in decending order of similarity to the query
simInds = sims.argsort()[::-1]
for i in simInds:
    print "Cosine similarity between Document %d and the query is: %.4f" %(i,sims[i])
Cosine similarity between Document 8 and the query is: 0.9693
Cosine similarity between Document 9 and the query is: 0.8843
Cosine similarity between Document 6 and the query is: 0.8604
Cosine similarity between Document 5 and the query is: 0.8362
Cosine similarity between Document 7 and the query is: 0.8309
Cosine similarity between Document 12 and the query is: 0.4358
Cosine similarity between Document 13 and the query is: 0.1840
Cosine similarity between Document 11 and the query is: 0.1767
Cosine similarity between Document 10 and the query is: 0.0404
Cosine similarity between Document 1 and the query is: -0.0555
Cosine similarity between Document 0 and the query is: -0.0810
Cosine similarity between Document 3 and the query is: -0.1039
Cosine similarity between Document 14 and the query is: -0.1097
Cosine similarity between Document 4 and the query is: -0.1119
Cosine similarity between Document 2 and the query is: -0.1375
In [466]:
 
centroids_svd, clusters_svd = kMeans.kMeans(vt_ld.T, 3, kMeans.distCosine, kMeans.randCent)
Iteration  1
Iteration  2
Iteration  3
In [467]:
 
print clusters_svd
[[ 0.    0.  ]
 [ 0.    0.  ]
 [ 0.    0.  ]
 [ 0.    0.  ]
 [ 0.    0.  ]
 [ 2.    0.05]
 [ 2.    0.01]
 [ 2.    0.01]
 [ 2.    0.  ]
 [ 2.    0.01]
 [ 1.    0.  ]
 [ 1.    0.  ]
 [ 1.    0.  ]
 [ 1.    0.  ]
 [ 1.    0.  ]]
In [468]:
 
print "\t\tCluster0\tCluster1\tCluster2"
for i in range(numDimensions):
    print "Theme %d\t\t%.4f\t\t%.4f\t\t%.4f" %(i,centroids_svd[0][i],centroids_svd[1][i],centroids_svd[2][i])
		Cluster0	Cluster1	Cluster2
Theme 0		0.1994		0.1982		0.2813
Theme 1		-0.3141		0.0570		0.2243
Theme 2		0.0844		-0.2611		0.2239
In [ ]: