Another Example of model selection, this time using text data from 20newsgroupsΒΆ

In [1]:
%pylab inline
Populating the interactive namespace from numpy and matplotlib
In [2]:
from sklearn.datasets import fetch_20newsgroups

news = fetch_20newsgroups(subset='all')

# We'll use only the first 3000 documents
n_samples = 3000

X = news.data[:n_samples]
y = news.target[:n_samples]
In [33]:
X[:2]
Out[33]:
[u"From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>\nSubject: Pens fans reactions\nOrganization: Post Office, Carnegie Mellon, Pittsburgh, PA\nLines: 12\nNNTP-Posting-Host: po4.andrew.cmu.edu\n\n\n\nI am sure some bashers of Pens fans are pretty confused about the lack\nof any kind of posts about the recent Pens massacre of the Devils. Actually,\nI am  bit puzzled too and a bit relieved. However, I am going to put an end\nto non-PIttsburghers' relief with a bit of praise for the Pens. Man, they\nare killing those Devils worse than I thought. Jagr just showed you why\nhe is much better than his regular season stats. He is also a lot\nfo fun to watch in the playoffs. Bowman should let JAgr have a lot of\nfun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final\nregular season game.          PENS RULE!!!\n\n",
 u'From: mblawson@midway.ecn.uoknor.edu (Matthew B Lawson)\nSubject: Which high-performance VLB video card?\nSummary: Seek recommendations for VLB video card\nNntp-Posting-Host: midway.ecn.uoknor.edu\nOrganization: Engineering Computer Network, University of Oklahoma, Norman, OK, USA\nKeywords: orchid, stealth, vlb\nLines: 21\n\n  My brother is in the market for a high-performance video card that supports\nVESA local bus with 1-2MB RAM.  Does anyone have suggestions/ideas on:\n\n  - Diamond Stealth Pro Local Bus\n\n  - Orchid Farenheit 1280\n\n  - ATI Graphics Ultra Pro\n\n  - Any other high-performance VLB card\n\n\nPlease post or email.  Thank you!\n\n  - Matt\n\n-- \n    |  Matthew B. Lawson <------------> (mblawson@essex.ecn.uoknor.edu)  |   \n  --+-- "Now I, Nebuchadnezzar, praise and exalt and glorify the King  --+-- \n    |   of heaven, because everything he does is right and all his ways  |   \n    |   are just." - Nebuchadnezzar, king of Babylon, 562 B.C.           |   \n']
In [31]:
cd D:\Documents\Class\CSC478\Data
D:\Documents\Class\CSC478\Data
In [29]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
In [55]:
# Read in the list of stop words
def get_stop_words():
    result = set()
    for line in open('stopwords_en.txt', 'r').readlines():
        result.add(line.strip())
    return result

stop_words = get_stop_words()
In [37]:
# TfidfVectorizer performs tokenization, removes stop words, and performs tfxidf transformation

tfidf = TfidfVectorizer(stop_words=stop_words, token_pattern=ur"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b")
In [38]:
X_tfidf = tfidf.fit_transform(X)
In [39]:
X_tfidf.shape
Out[39]:
(3000, 57115)
In [59]:
# Note that the transformed data is stored in a sparse matrix (which is much more efficient for large data sets)

X_tfidf
Out[59]:
<3000x57115 sparse matrix of type '<type 'numpy.float64'>'
	with 300365 stored elements in Compressed Sparse Row format>
In [44]:
print X_tfidf[:2]
  (0, 44235)	0.0778425181055
  (0, 21610)	0.0629754665732
  (0, 20274)	0.0720747079949
  (0, 30805)	0.0796134821484
  (0, 27117)	0.103274444627
  (0, 16304)	0.108250855547
  (0, 27621)	0.0955786159432
  (0, 41009)	0.123578072434
  (0, 7893)	0.0834961366781
  (0, 21618)	0.0689388683338
  (0, 13707)	0.0643253811618
  (0, 30056)	0.0503632147368
  (0, 9126)	0.110337523139
  (0, 39461)	0.0861066564803
  (0, 54424)	0.0728661071854
  (0, 21341)	0.157405851825
  (0, 30819)	0.111396638078
  (0, 48129)	0.0870836492325
  (0, 45315)	0.158001176271
  (0, 42622)	0.156248179166
  (0, 8238)	0.0503632147368
  (0, 46176)	0.0861066564803
  (0, 28202)	0.0333080236389
  (0, 27386)	0.231191011074
  (0, 50445)	0.0577174841757
  :	:
  (1, 52872)	0.0434515509519
  (1, 36145)	0.0906625099373
  (1, 36913)	0.0924084110306
  (1, 52505)	0.0291185713372
  (1, 35361)	0.0643802853074
  (1, 12844)	0.044843904834
  (1, 18471)	0.0573784901113
  (1, 42356)	0.0954862548979
  (1, 45403)	0.0830905280299
  (1, 48909)	0.0593329377833
  (1, 10579)	0.228476754279
  (1, 53630)	0.195289538619
  (1, 53831)	0.356402992763
  (1, 24304)	0.328949479945
  (1, 29724)	0.246464057433
  (1, 32140)	0.154160615738
  (1, 33163)	0.225914978976
  (1, 32279)	0.246464057433
  (1, 28202)	0.03159481851
  (1, 40064)	0.0954862548979
  (1, 35837)	0.027292266182
  (1, 30380)	0.0148477442017
  (1, 39908)	0.052082770997
  (1, 37323)	0.0153919545043
  (1, 48690)	0.0148230267927
In [56]:
# It's possible (though not usually necessary) to convert the matrix into a "dense" matrix

newX = X_tfidf.todense()
In [43]:
newX.shape
Out[43]:
(3000L, 57115L)
In [57]:
np.set_printoptions(linewidth=120, edgeitems=12)
print newX[:10]
[[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0. ...,  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0. ...,  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0. ...,  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0. ...,  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0. ...,  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0. ...,  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0. ...,  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0. ...,  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0. ...,  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0. ...,  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]]
In [58]:
newX.sum(axis=1)
Out[58]:
matrix([[  6.38255787],
        [  6.22316442],
        [ 11.10466984],
        [  8.01377232],
        [  6.59598284],
        [  8.64233053],
        [  4.08752316],
        [  8.23253906],
        [  6.9290635 ],
        [  5.54138631],
        [  6.36652435],
        [  5.6413991 ],
        ..., 
        [  9.56666541],
        [ 11.19261544],
        [  4.75115972],
        [  6.31888744],
        [  7.1462622 ],
        [  6.72387298],
        [  6.76975027],
        [  7.00729698],
        [  5.90752046],
        [  9.89775822],
        [  6.17808897],
        [  7.46697255]])
In [36]:
# Fortunately, scikit-learn modules handle sparse matrices natively

# Let's create a pipeline to perform preprocessing and to build the model
clf = Pipeline([
    ('vect', TfidfVectorizer(
                stop_words=stop_words,
                token_pattern=ur"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",         
    )),
    ('nb', MultinomialNB(alpha=0.01)),
])
In [16]:
from sklearn.cross_validation import cross_val_score, KFold
from scipy.stats import sem

def evaluate_cross_validation(clf, X, y, K):
    # create a k-fold croos validation iterator of k=5 folds
    cv = KFold(len(y), K, shuffle=True, random_state=0)
    # by default the score used is the one returned by score method of the estimator (accuracy)
    scores = cross_val_score(clf, X, y, cv=cv)
    print scores
    print ("Mean score: {0:.3f} (+/-{1:.3f})").format(
        np.mean(scores), sem(scores))
In [17]:
evaluate_cross_validation(clf, X, y, 3)
[ 0.812  0.808  0.822]
Mean score: 0.814 (+/-0.004)
In [18]:
def calc_params(X, y, clf, param_values, param_name, K):
    # initialize training and testing scores with zeros
    train_scores = np.zeros(len(param_values))
    test_scores = np.zeros(len(param_values))
    
    # iterate over the different parameter values
    for i, param_value in enumerate(param_values):
        print param_name, ' = ', param_value
        
        # set classifier parameters
        clf.set_params(**{param_name:param_value})
        
        # initialize the K scores obtained for each fold
        k_train_scores = np.zeros(K)
        k_test_scores = np.zeros(K)
        
        # create KFold cross validation
        cv = KFold(n_samples, K, shuffle=True, random_state=0)
        
        # iterate over the K folds
        for j, (train, test) in enumerate(cv):
            # fit the classifier in the corresponding fold
            # and obtain the corresponding accuracy scores on train and test sets
            clf.fit([X[k] for k in train], y[train])
            k_train_scores[j] = clf.score([X[k] for k in train], y[train])
            k_test_scores[j] = clf.score([X[k] for k in test], y[test])
            
        # store the mean of the K fold scores
        train_scores[i] = np.mean(k_train_scores)
        test_scores[i] = np.mean(k_test_scores)
       
    # plot the training and testing scores in a log scale
    plt.semilogx(param_values, train_scores, alpha=0.4, lw=2, c='b')
    plt.semilogx(param_values, test_scores, alpha=0.4, lw=2, c='g')
    
    plt.xlabel(param_name + " values")
    plt.ylabel("Mean cross validation accuracy")

    # return the training and testing scores on each parameter value
    return train_scores, test_scores
In [19]:
alphas = np.logspace(-7, 0, 8)
print alphas
[  1.00000000e-07   1.00000000e-06   1.00000000e-05   1.00000000e-04
   1.00000000e-03   1.00000000e-02   1.00000000e-01   1.00000000e+00]
In [20]:
train_scores, test_scores = calc_params(X, y, clf, alphas, 'nb__alpha', 3)
nb__alpha  =  1e-07
nb__alpha  =  1e-06
nb__alpha  =  1e-05
nb__alpha  =  0.0001
nb__alpha  =  0.001
nb__alpha  =  0.01
nb__alpha  =  0.1
nb__alpha  =  1.0
In [21]:
print 'training scores: ', train_scores
print 'testing scores: ', test_scores
training scores:  [ 1.          1.          1.          1.          1.          1.
  0.99683333  0.97416667]
testing scores:  [ 0.77133333  0.77666667  0.78233333  0.79433333  0.80333333  0.814
  0.80733333  0.74533333]
In [ ]:
# Let's now try Support Vector Machines for classification
In [22]:
from sklearn.svm import SVC

clf = Pipeline([
    ('vect', TfidfVectorizer(
                stop_words=stop_words,
                token_pattern=ur"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",         
    )),
    ('svc', SVC()),
])
In [23]:
gammas = np.logspace(-2, 1, 4)

train_scores, test_scores = calc_params(X, y, clf, gammas, 'svc__gamma', 3)
svc__gamma  =  0.01
svc__gamma  =  0.1
svc__gamma  =  1.0
svc__gamma  =  10.0
In [24]:
print 'training scores: ', train_scores
print 'testing scores: ', test_scores
training scores:  [ 0.06183333  0.279       0.99966667  1.        ]
testing scores:  [ 0.04866667  0.162       0.74666667  0.05166667]

For gamma < 1 we have underfitting. For gamma > 1 we have overfitting. So here, the best result is for gamma = 1 where we obtain a training an accuracy of 0.999 and a testing accuracy of 0.75

In [25]:
from sklearn.grid_search import GridSearchCV

parameters = {
    'svc__gamma': np.logspace(-2, 1, 4),
    'svc__C': np.logspace(-1, 1, 3),
}

clf = Pipeline([
    ('vect', TfidfVectorizer(
                stop_words=stop_words,
                token_pattern=ur"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",         
    )),
    ('svc', SVC()),
])

gs = GridSearchCV(clf, parameters, verbose=2, cv=3)
In [26]:
%time _ = gs.fit(X, y)

gs.best_params_, gs.best_score_
Fitting 3 folds for each of 12 candidates, totalling 36 fits
[CV] svc__gamma=0.01, svc__C=0.1 .....................................
[CV] ............................ svc__gamma=0.01, svc__C=0.1 -   7.2s
[CV] svc__gamma=0.01, svc__C=0.1 .....................................
[CV] ............................ svc__gamma=0.01, svc__C=0.1 -   7.4s
[CV] svc__gamma=0.01, svc__C=0.1 .....................................
[CV] ............................ svc__gamma=0.01, svc__C=0.1 -   7.2s
[CV] svc__gamma=0.1, svc__C=0.1 ......................................
[CV] ............................. svc__gamma=0.1, svc__C=0.1 -   7.1s
[CV] svc__gamma=0.1, svc__C=0.1 ......................................
[CV] ............................. svc__gamma=0.1, svc__C=0.1 -   7.2s
[CV] svc__gamma=0.1, svc__C=0.1 ......................................
[CV] ............................. svc__gamma=0.1, svc__C=0.1 -   7.5s
[CV] svc__gamma=1.0, svc__C=0.1 ......................................
[CV] ............................. svc__gamma=1.0, svc__C=0.1 -   7.5s
[CV] svc__gamma=1.0, svc__C=0.1 ......................................
[CV] ............................. svc__gamma=1.0, svc__C=0.1 -   7.2s
[CV] svc__gamma=1.0, svc__C=0.1 ......................................
[CV] ............................. svc__gamma=1.0, svc__C=0.1 -   7.4s
[CV] svc__gamma=10.0, svc__C=0.1 .....................................
[CV] ............................ svc__gamma=10.0, svc__C=0.1 -   7.5s
[CV] svc__gamma=10.0, svc__C=0.1 .....................................
[CV] ............................ svc__gamma=10.0, svc__C=0.1 -   7.5s
[CV] svc__gamma=10.0, svc__C=0.1 .....................................
[CV] ............................ svc__gamma=10.0, svc__C=0.1 -   7.4s
[CV] svc__gamma=0.01, svc__C=1.0 .....................................
[CV] ............................ svc__gamma=0.01, svc__C=1.0 -   7.2s
[CV] svc__gamma=0.01, svc__C=1.0 .....................................
[CV] ............................ svc__gamma=0.01, svc__C=1.0 -   7.4s
[CV] svc__gamma=0.01, svc__C=1.0 .....................................
[CV] ............................ svc__gamma=0.01, svc__C=1.0 -   7.4s
[CV] svc__gamma=0.1, svc__C=1.0 ......................................
[CV] ............................. svc__gamma=0.1, svc__C=1.0 -   7.2s
[CV] svc__gamma=0.1, svc__C=1.0 ......................................
[CV] ............................. svc__gamma=0.1, svc__C=1.0 -   7.2s
[CV] svc__gamma=0.1, svc__C=1.0 ......................................
[CV] ............................. svc__gamma=0.1, svc__C=1.0 -   7.4s
[CV] svc__gamma=1.0, svc__C=1.0 ......................................
[CV] ............................. svc__gamma=1.0, svc__C=1.0 -   7.3s
[CV] svc__gamma=1.0, svc__C=1.0 ......................................
[CV] ............................. svc__gamma=1.0, svc__C=1.0 -   7.2s
[CV] svc__gamma=1.0, svc__C=1.0 ......................................
[CV] ............................. svc__gamma=1.0, svc__C=1.0 -   7.4s
[CV] svc__gamma=10.0, svc__C=1.0 .....................................
[CV] ............................ svc__gamma=10.0, svc__C=1.0 -   7.5s
[CV] svc__gamma=10.0, svc__C=1.0 .....................................
[CV] ............................ svc__gamma=10.0, svc__C=1.0 -   7.3s
[CV] svc__gamma=10.0, svc__C=1.0 .....................................
[CV] ............................ svc__gamma=10.0, svc__C=1.0 -   7.4s
[CV] svc__gamma=0.01, svc__C=10.0 ....................................
[CV] ........................... svc__gamma=0.01, svc__C=10.0 -   7.1s
[CV] svc__gamma=0.01, svc__C=10.0 ....................................
[CV] ........................... svc__gamma=0.01, svc__C=10.0 -   7.3s
[CV] svc__gamma=0.01, svc__C=10.0 ....................................
[CV] ........................... svc__gamma=0.01, svc__C=10.0 -   7.2s
[CV] svc__gamma=0.1, svc__C=10.0 .....................................
[CV] ............................ svc__gamma=0.1, svc__C=10.0 -   7.1s
[CV] svc__gamma=0.1, svc__C=10.0 .....................................
[CV] ............................ svc__gamma=0.1, svc__C=10.0 -   7.1s
[CV] svc__gamma=0.1, svc__C=10.0 .....................................
[CV] ............................ svc__gamma=0.1, svc__C=10.0 -   7.2s
[CV] svc__gamma=1.0, svc__C=10.0 .....................................
[CV] ............................ svc__gamma=1.0, svc__C=10.0 -   7.3s
[CV] svc__gamma=1.0, svc__C=10.0 .....................................
[CV] ............................ svc__gamma=1.0, svc__C=10.0 -   7.3s
[CV] svc__gamma=1.0, svc__C=10.0 .....................................
[CV] ............................ svc__gamma=1.0, svc__C=10.0 -   7.3s
[CV] svc__gamma=10.0, svc__C=10.0 ....................................
[CV] ........................... svc__gamma=10.0, svc__C=10.0 -   7.4s
[CV] svc__gamma=10.0, svc__C=10.0 ....................................
[CV] ........................... svc__gamma=10.0, svc__C=10.0 -   7.4s
[CV] svc__gamma=10.0, svc__C=10.0 ....................................
[CV] ........................... svc__gamma=10.0, svc__C=10.0 -   7.5s
[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:    7.2s
[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:  4.5min finished
Wall time: 4min 39s
Out[26]:
({'svc__C': 10.0, 'svc__gamma': 0.10000000000000001}, 0.82666666666666666)

With the grid search we obtained a better combination of C and gamma parameters, for values 10.0 and 0.10 respectively, we obtained a 3-fold cross validation accuracy of 0.828 much better than the best value we obtained (0.76) in the previous experiment by only adjusting gamma and keeeping C value at 1.0.

In [ ]: