Assignment 4

Due: Sunday, May 31

In this Assignment you will explore and experiment with clustering in some of its applications in Web usage mining, customer segmentation, and recommender systems. Note: An example of using WEKA for clustering (using an older version of WEKA) can be found in K-Means Clustering in WEKA.

Consider the following document-term matrix, where each entry represents the raw frequency of a term T_i in document D_j. We would like to apply clustering to automatically group these documents into 3 classes (clusters). Note: You must not use WEKA or other clustering tools for this problem. However, you are encouraged to use a spreadsheet program such as Microsoft Excel to facilitate computation in intermediate steps.

Download table as a Microsoft Excel Worksheet
Suppose we initially assign D2 to Class1, D4 and D6 to Class2, and D5 and D7 to Class3. Using the K-means clustering method discussed in class, compute the final contents of the 3 classes. Use the Cosine similarity of two vectors as your similarity measure. Show the details of your computation, including intermediate steps in each iteration of the algorithm.
Note: Recall that the Cosine similarity of two vectors is their dot product divided by the product of their norms. For example, Consider the two vectors X and Y:
```
  X = <3, 0, 1, 2, 0, 3>
  Y = <2, 0, 0, 3, 8, 4>
```
The dot product is given by sum of the coordinate-wise multiples:
```
  dot-product(X, Y) = 3*2 + 0*0 + 1*0 + 2*3 + 0*8 + 3*4
                      = 6 + 0 + 0 + 6 + 0 + 12
                      = 24.
```
The norm of each vector is the square-root of the sum of the squares of its dimension values. So, the norms of X and Y are:

and the Cosine similarity of X and Y is given by:

An online shopping site has the following primary pages or sections: Home, Products, Search, Prod_A, Prod_B, Prod_C, Cart, Purchase. A user may browse from "Home" to "Products" and then to one of the individual products. The user may also search for a specific product by using the "Search" function. A visit to "Cart" implies that the user has placed an item in the shopping cart, and "Purchase" indicates that the user has completed the purchase of items in the shopping cart. The site has collected some hypothetical session data for 100 sessions. This data is available in CSV and format.
Use WEKA's K-means clustering algorithm to cluster these user sessions into segments. Try different clustering runs with various numbers of clusters (e.g., between 4 and 8), and select the result set(s) that seem to best answer as many of the following questions as possible.
- If a new user is observed to access the following pages: Home => Search => Prod_B, according to your clusters, what other product should be recommended to this user? Explain your answer based on your clustering results. What if the new user has accessed the following sequence instead: Products => Prod_C?
- Can clustering help us identify casual browsers ("window shoppers"), focused browsers (those who seem to know what products they are looking for), and searchers (those using the search function to find items they want)? If so, Are any of these groups show a higher or lower propensity to make a purchase?
- Do any of the segments show particular interest in one or more products, and if so, can we identify any special characteristics about their navigational behavior or their purchase propensity?
- If we know that, during the time of data collection, independent banner ads had been placed on some popular sites pointing to products A and B, can we identify segments corresponding to visitors that respond to the ads? (note that such user's are likely to enter the site by going directly to product pages rather than navigate from the Home page). If so, can we determine if either of these promotional campaigns are having any success?
For this problem, you should submit your clustering result summary (including the cluster centroids), the final data set which shows the final assignment of these sessions to clusters, and your answers to the above questions along with your justification based on the clustering results.
Other Notes: You may also want to use WEKA's cluster visualization capabilities to identify interesting distributions of various page visits among and within clusters. Examples of using WEKA for clustering can be found in K-Means Clustering in WEKA.

For this problem you will use part of the actual customer transaction data from a real bakery (bakery-binary.csv). The data set contains 1000 customer transactions, each corresponding to purchases by the customer during a visit to the store. The items (attributes) are 50 assorted cakes, pastries, coffee products, etc. A 1 in a transaction indicated the purchase of the corresponding item and a 0 indicates no purchase for that item. There is one additional attribute, Weekend, indicating whether the transaction took place on a weekend (this there are a total of 51 attributes). Your goal in this assignment is to use clustering in order to identify different customer segments, and then characterize those segments by analyzing the aggregate representations of each cluster (the cluster centroids).

Your tasks in this problem are the following:
1. Load in the data into WEKA and select WEKA's K-means clustering algorithm (SimpleKmeans) under the cluster tab. Cluster the customers and review the cluster centroids. Try a few runs with different numbers of clusters (e.g., K values between 4 and 8) and pick a result set that seems to provide clearly distinguishable customer segments based on the centroid values of the dominants items in those clusters. Include your final result set in your submission along with a short summary of your other runs.
2. Next, characterize the customer segments by indentifying significant patterns represented in each segment based on the cluster centroids. You can identify dominant items in segments and make observations about sets of items that seem strongly associated with each segment. Also compare different segments and discuss what are the unique characteristics of each that distinguished that segment from others. Finally, answer the following question: Is there a customer segment that primarily makes purchases over the weekends, and if so, are there specific items in which this group of customers is especially interested? Include your analysis of customer segments in your submission. [Note: a good way to represent the clusters is to generate "aggregate profiles" for each cluster. You can do this by exporting the centroids given in WEKA output to a spreadsheet program such as Microsoft Excel. Then, sort each centroid independently along with the item names in the decreasing order of the cluster values for the items. This will make it easier to focus on dominant items and compare clusters. You may also wish to filter out the items with very low centroid values. See an example of a typical profile for this problem.]
3. Based on your observations in part (b) can you identify 2-3 potential opportunities for the bakery to increase sales of one or more of the products through cross-sell recommendations or other types of customer incentives? Explain your answer.

Consider the following table representing a user’s navigational activity as captured in a server log file (assuming that data cleaning and user identification have already been performed). Note: “-“ in the referrer field represents an external access and is assumed to be different each time it occurs. The site graph for this hypothetical site is also given below.
1. Using sessionization heuristics, discussed in class, in each of the following cases give the sessions constructed by the heuristic for the above user (give each session as a list of pages accessed by the user during that session, e.g., S1 => A, C, B, ... and S2 => I, J, ..., and so on).
  - Sessions Using h1 heuristic with timeout threshold = 25 mins.
  - Sessions Using h2 heuristic with timeout threshold = 10 mins.
  - Sessions Using href heuristic (referrer-based heuristic)
2. Suppose (hypothetically) that sessionization for a user U results in a session given in the log file below:
Perform path completion to fill in missing references due to client-side caching. If there are multiple candidate completions, then use the completion resulting in the fewest number of “back” references.

Consider again the book rating data from 20 users (U1-U20) that you used in Assignment 3. Recall that the ratings range from 1 = worst to 5 = best. We have used K-means clustering algorithm to cluster these users into 4 clusters. The results of clustering (cluster centroids) are provided in the spreadsheet book-clusters.xlsx.
Recall that the two new users (NU1 and NU2) who have recently visited the site and rated some of the books. The two new users' ratings are given in the last two rows of the ratings spreadsheet.
In assignment 3 you used the K-nearest-neighbor (Knn) approach to predict ratings for NU1 and NU2. In this problem you will compute predicted ratings by comparing NU1 and NU2 to the cluster centroids.
1. First compute the correlations between the new users (NU1 and NU2) and the centroids of each cluster. Then using the most similar cluster to each of these users, compute the predicted ratings for the books they have not yet rated.
2. Measure the Mean Absolute Error (MAE) on the predictions using NU1 and NU2 as test users. Recall that you can compute MAE by generating predictions for items already rated by the test users (e.g., for NU1 these are all items except "The DaVinci Code" and "Runny Babbit"). Then, for each of these items you can compute the absolute value of the difference between the predicted and the actual ratings. Finally, you can average these errors across all 12 compared items (for both NU1 and NU2) to obtain the MAE. How does your MAE compare to the MAE you obtained using Knn in Problem 4 (part b) in Assignment 3?