Assignment 3
Due: Sunday, May 17
In this Assignment you will explore and experiment with several classification
and predictive modeling approaches discussed in the lectures. You may also wish
to watch the class video
Classification in
WEKA (30 min) that demonstrated many of the techniques used
in this assignment.

In this problem we will use the PEP data from Assignment 2 for the purpose of
target marketing. In this case, we plan on using the historical data from past
customer responses (the training data from last assignment) in order to build a
classification model. The model will then be applied to a new set of prospects to
whom we may want extend an offer for a PEP. Rather than doing a mass marketing campaign to
all new prospects, we would like to target those that are likely to respond positively
to our offer (according to our classification model).
There are two data sets available (in ARFF format) contained in the Zip
archive bankdata.zip:
 bankdata.arff  Preclassified
training data Set for Building a Model
(this is the data from assignment 2)
 banknew.arff  A set of new
customers from which to find the "hot prospects" for the next
target marketing campaign (i.e. those that are likely to respond positively to
an offer for PEP.
Note that since the ID attribute is not used for building the classifier, you
should begin by loading each of these data sets into WEKA, and in each case
removing the ID attribute and saving both filtered data sets into new files.
 Using WEKA package create a "C4.5" classification model based on the
preclassified training data. In WEKA, the C4.5 algorithm is implemented by "
weka.classifiers.trees.J48 ". Use 10fold crossvalidation to evaluate your model
accuracy. Record the final decision tree and model accuracy statistics obtained from your
model. Be sure to indicate the parameters you use in building your classification model
(if you experiment with nondefault values).
You can save the statistics and results by rightclicking the last result set in the
"Result list" window and selecting "Save result buffer." You should also generate and create a
screen shot of your tree by selecting the "Visualize tree" command from the same menu
[Note: you can resize the window as necessary, rightclick inside the
window, and select the command "Fit to Screen" to get a better view of the full
tree]. You should provide the decision tree together with the accuracy results from the
crossvalidation as part of your submission.
 Next, apply the classification model from the previous part to the new customers
data set as the "Supplied test set." Be sure to the select the option "Output predictions" in the test
options for the classifier (under More Options). This
option will show you the predicted classes for the 200 new instances. In your
final submitted result you shouldo map the resulting answers back to the original customer "id" field
for the new customers (this could be done using a spreadsheet program such as
Excel and the original new customers data set in CSV format). Provide your resulting predictions for the 200 new cases and other supporting
documentation as part of your submission.
 Lift Charts: Suppose that we would like to use our predictive model
from the previous part as a response model for a future targeted marketing
campaign. To do so, we want to use the 200 new cases in part (b) as the test
data. Suppose that we have the actual positive responses provided by these 200
prospects. These actual responses are given in the spreadsheet
pepactualresp.xls. Note that the total
number of positive responses is 50 (i.e., the response rate for the untargeted
marketing is 25%). Given this information, and the predicted PEP values from
part (b) compute the
Cumulative Gain Lift Chart corresponding to the response model. Note that
the predicted PEP values are not actual responses, but only a prediction that
the prospect is likely to be interested in PEP. To create the chart, you will
need to compute and record the probability of PEP="YES" for each of the 200
prospects (this is part of the output generated in part b). You can then sort the 200 prospects according to
this probability and compute the cumulative positive responses (from the actual
response spreadsheet) against the total
number of prospects contacted. This should then be compared against the
untargeted case which has a fixed 25% response rate. Your final lift chart
should look something like this. Finally,
based on your lift chart, compute the lift value if only the top 70 prospects
are targeted. What does this value mean?
 In this problem you will use Naïve
Bayesian Classification on usage data associated with a hypothetical
ecommerce Web site to determine if a user will return to the site in the
future. The data set (VisitNominal.csv) contains a set of 100 user sessions
involving activities on the Web site. The attributes in this data set have
been converted into categorical (nominal) binary attributes indicating
whether the user has visited a specific section of the site or has purchased
a product in the past visits. The attributes are described as follows:
 Home  indicating whether the user has visited the
homepage.
 Browsed  indicating whether the user has
spent time (using some prespecified threshold) browsing the product
catalog.
 Searched  indicating whether the user has
performed searches for specific products.
 Prod_A,
Prod_B, Prod_C  indicating whether the user has purchased products
belonging the corresponding product category.

Visit_Again  the class attribute indicating whether the user has
subsequently returned to the site in a future session.
Your tasks in this problem are as follows:
 Load the data set into WEKA and under the Classify
tab choose classifiers.bayes.NaiveBayesSimple. Under
the Test options select Use training set.
Then run the classifier and save the result set buffer. You will notice
that the model specified the conditional probabilities associated with
different attributes for each of the two classes (Visit_Again=yes
and Visit_Again=no). For example, using this
information you can find Pr(Browsed=no  Visit_Again=yes)
or Pr(searched=yes  Visit_Again=no). Also, the model
includes the prior probabilities of each of the two classes, Pr(Visit_Again=no)
and Pr(Visit_Again=yes). Submit your result set as part
of your answer.
 Next, using the probabilities you
obtained from the model and Bayes' Rule, manually compute the
probabilities of each of the following two new instances belonging
each of the two classes:
 New instance X = <Home=yes, Browsed=no,
Searched=yes, Prod_A=no, Prod_B=yes, Prod_C=no>

New instance Y = <Home=yes, Browsed=yes, Searched=no, Prod_A=yes,
Prod_B=no, Prod_C=yes>
For example, in the case of X, you must user Bayes' rule to compute
Pr(X  Visit_Again=yes) and Pr(X  Visit_Again=no), and
similarly for Y. Show the details of your computation.

For this problem you will use an
image segmentation data set
and perform classification based on the KNearestNeighbor (KNN)
Approach. This dataset contains information characterizing images with each line
corresponding to one image. Each image is represented by 19 features (these
are the columns in the data and correspond to the feature names in the list
of attributes. The last column in the data contains the
class labels corresponding to the image types (brickface, sky, foliage,
cement, window, path, grass). The data set contains three files. The file "segmenttrain.arff"
is the training data consisting of 30 instances (images) from each of the 7
categories. The test data ("segmenttest.arff") is used for
evaluating the model built using the training data and it contains 2310
instances. A detailed description of the data set, including the meanings of
various attributes is provided in the file "segmentdecription.txt". The data set used in this problem is based on the
Image Segmentation data set at the UCI Machine Learning Repository.
Your tasks in this problem are the following:
 Load
in the training image segment data into WEKA and select WEKA's KNN
implementation under the Classify tab. This implementation is called IBk and
it is located in the module: weka.classifiers.lazy.IBk.
Open the classifier options dialog box and select an appropriate value for K
(number of neighbors). Under Test options choose 10fold
crossvalidation (which is the default). Run the classifier multiple times,
experimenting with different values of K (you may wish to try 5, 10, 15, 20
and so on) and with or without the distance weighting option set. For each
run examine the evaluation result. Once you are satisfied that you have the
best set of options, record the final results by saving your buffer for the
corresponding result set. You should submit this result set and also provide
a 12 paragraph summary of which options you tried and your findings.
 Next, apply your model
from part (a) to the test data. Under the Test options
select "Supplied test set" and set the test set to the file
segmenttest.arff. Under More options, make sure that
"Output predictions" is selected. Finally, run the KNN classifier on the
test data. Compare the evaluation results to the results from 10fold
crossvalidation. Submit your results set (including the predictions) along
with a summary of your observations.

Suppose that an online bookseller has collected
ratings information from
20 past users (U1U20) on a selection of recent books. The ratings range from 1 = worst to 5 = best.
Two new users (NU1 and NU2) who have recently visited the site and rated some of the books
("?" represents missing ratings). The two new users' ratings given in the last
two rows of the spreadsheet.
Using the KNearest Neighbor algorithm predict the ratings of these new users for
each of the books they have not yet rated. Use the Pearson correlation coefficient
as the similarity measure. [Note: you should complete this problem using
Microsoft Excel or similar spreadsheet program. You may also choose to write a
program that performs the specified computations below.]
 First compute the correlations between the new users (NU1 and NU2) and all
other users (you can show these as added columns in original spreadsheet). Then
for each new user compute the predicted rating for each of the unrated items using
K=3
(i.e., 3 nearest neighbors). Use the weighted average function to compute the
predictions based on ratings of the nearest neighbors. Be sure to show
the intermediate steps in your work (or provide a short explanation of how you
computed the predictions).
 Measure the Mean Absolute Error (MAE) on the predictions
using NU1 and
NU2 as test users. You can compute MAE by generating predictions for items
already rated by the test user (e.g., for NU1 these are all items except "The DaVinci Code"
and "Runny Babbit"). Then, for each of these items you can compute the
absolute value of the difference between the predicted and the actual ratings.
Finally, you can average these errors across all 12 compared items (for both NU1
and NU2) to obtain the
MAE.
 ItemBased Collaborative Filtering. Using the same data as above
and the itembased collaborative filtering algorithm (instead of userbased CF
used in the previous parts), compute the predicted rating of NU1 on the
book "The DaVinci Code". Note that in this case, you will need to find
the K most similar items (books) to the target item based
on their rating vectors (columns in the table), and then use NU1's
ratings on the K neighbor items. For this problem use K
= 2, and use Cosine Similarity to identify the most similar neighbors
to "The DaVinci Code".
In order to compute Cosine similarities, you may assume that missing values in
the ratings table are considered to be zeros.
