Assignment 2

Due: Friday, May 1

In this Assignment you will explore unsupervised knowledge discovery techniques of association rule mining and sequential patterns mining. In particular you will apply some of the techniques discussed in the lectures by tracing the algorithms using sample data sets. You will also experiment with performing market basket analysis using WEKA on a couple of real-world data sets that have been preprocessed and prepared for the tasks in this assignment.

Note: You must do this problem manually by hand-tracing the Apriori algorithm discussed in the lectures [see lecture notes on Association Rule Discovery] rather than using a third party tool or program.

Consider the following transaction database. Each row represents a single transactions in which the specified items have been purchased.

Transaction ID	Items Purchased
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20	A,B,C,D B,C,D,E,G A,C,G,H,K B,C,D,E,K D,E,F,H,L A,B,C,D,E,L A,D,E,F,L B,I,K,L C,D,F,L A,B,D,E,K C,D,H,I,K B,C,E,K B,C,D,F A,B,C,D C,H,I,J A,E,F,H,L H,K,L A,B,D,H,K D,E,K B,C,D,E,H

Applying the Apriori algorithm with minimum support of 30% find all the frequent itemsets in the data set. For each step in the algorithm, give the list of frequent itemsets that satisfy minimum support (i.e., for each iteration i, give the set L_i along with the support values for the frequent itemsets).
From the maximal frequent itemsets in part a, generate all association rules that meet a minimum confidence of 75%. In addition to the confidence also specify the Lift (improvement) values for each of the final set of rule you discovered.

Suppose that we have the following site map associated with a hypothetical ecommerce Web site:

where A represents the event of visiting the homepage, E and F represent the action of adding items (perhaps from two different product categories) to the shopping cart, and G represents the purchase of the item(s) in the cart.
We also have some preprocessed and sessionized usage data recorded in the server log capturing past user navigation in the Web site. These sessions are stored in a sessions data file. Each row in the data file represents pages accessed by a user during one session in the order in which these pages were accessed.

Construct a Markov chain [see lecture notes on Mining Sequential & Navigational Patterns] based on the above site graph with states corresponding to the pages in the graph and transition probabilities computed using the training session data. You should also add a starting state with transitions to all pages that appear as starting pages in the session data, as well as a final state with transitions from pages that represent the end of a session. Show the computed transition probabilities as labels on the edges of the complete state transition diagram for the Markov Chain.

After computing the transition probabilities, answer each of the following using only the Markov model (not the original session data). Be sure to show the details of your computations.
- What is the probability that someone will visit page D?
- What is the probability that someone visiting the homepage will go on to visit additional pages in the site?
- What is the probability that a user who starts from the homepage will place something in his/her shopping cart?
- What is the probability that a user who starts from the homepage will actually make a purchase?

What is the probability of cart abandonment (i.e., the probability that someone who has something in the shopping cart will exit the site without making a purchase)?

For this problem you will use some preprocessed and aggregated clickstream data from a real e-commerce site, and use association rule mining to perform market basket analysis on the visitor session data.

[Note: Please watch the class video Association Rule Mining with WEKA (23 min) demonstrating the use of Apriori algorithm in WEKA for market basket analysis.]

There are two primary types of products sold through the above site, leg care products, and leg ware products. Each category includes various subcategories and individual products from multiple vendors. There is also a separate categorization of products by specialized "Collections" and by "Assortments." The data collection mechanism, in addition to capturing clickstream page-level data, also captures the information on categories, subcategories, assortments, and collections of products accessed in a given session.
For simplicity, the provided data combines and aggregates visited pages from the log files, category and subcategory names, and product related content pages/categories. The aggregate data contains a total of 182 attributes corresponding to pages or categories. These attributes are listed in the file Leg-Pages.txt. The session data is provides in ARFF format in the file legs.arff (in a Zip archive). This data contains a total of 7296 sessions (each row in the data). For the purpose of market basket analysis in WEKA, the session data is represented in relational format with unary categorical attributes (a value of "Y" indicates that the corresponding page/category was visited in the session, while a value of "?" indicates that the page/category is missing from the session). Thus, a typical association rule might look similar to the following:
/Products/Legwear=Y /Products/Legwear/Berkshire=Y ==> Collection: Better Than Bare - Queen=Y
or
Category: Health Supplements=Y ==> Subcategory: Bones & Joints=Y
Your task in this problem is as follows:
1. Load the data into WEKA and review the distributions in the data (go down the list of attributes and make a note of which pages or products are most frequent and which are least frequent - you can list the top 3 and the bottom 3 in your submission.
2. Perform association rule mining using Apriori algorithm with a "lowerBoundMinSupport" of 0.05 and using Lift of 2.5 as the minMetric for filtering the rules. Also, set "outputItemsets" to "True" so that you can also view the frequent items sets of different sizes in addition to the rules. Write a short summary of your observations, including any significant or interesting (e.g., unobvious or unexpected) associations you observe in the data based on the results. Save your result set and submit along with your assignment submission
3. Next, run the Apriori algorithm with a lower "lowerBoundMinSupport" so that you can identify associations at a more granular level (e.g., the level of individual products brands rather than higher level categories). You might want to start with 0.025 and go lower if necessary. Experiment with this threshold, as well as the Lift or Confidence metrics in different runs and pick the result set that seems to provide the most useful information (e.g., not too many obvious or noisy rules and not too few general rules). Again, provide a short summary of your observations, including some examples of rules or associations that you find interesting or useful.

The marketing department of a financial firm keeps records on customers, including demographic information and the type of accounts. When launching a new product, such as a "Personal Equity Plan" (PEP), a direct mail piece, advertising the product, is sent to existing customers, and a record kept as to whether that customer responded and bought the product. Based on this store of prior experience, the managers decide to use data mining techniques to build customer profile models. In this particular problem we are interested only in deriving (quantitative) association rules from the data (in a future assignment we will consider the use of classification with this data).

The data contains the following fields

id	a unique identification number
age	age of customer in years (numeric)
sex	MALE / FEMALE
region	inner_city/rural/suburban/town
income	income of customer (numeric)
married	is the customer married (YES/NO)
children	number of children (numeric)
car	does the customer own a car (YES/NO)
save_acct	does the customer have a saving account (YES/NO)
current_acct	does the customer have a current account (YES/NO)
mortgage	does the customer have a mortgage (YES/NO)
pep	did the customer buy a PEP (Personal Equity Plan) after the last mailing (YES/NO)

The data is contained in the file bank-data.csv. Each record is a customer description where the "pep" field indicates whether or not that customer bought a PEP after the last mailing.

Your goal is to perform Association Rule discovery on the data set using the Weka package.

Note: Association rule mining requires discretization of continuous variables. This task can be performed in the data transformation step or (in some cases) by the mining program. WEKA is a full data mining suite which includes various preprocessing modules (filters). When using WEKA, you will first apply the relevant preprocessing filters to transform the data before you performing association rule discovery.

First perform the necessary preprocessing steps required for association rule mining. Specifically, the "id" field will need to be removed and the numerical attributes must be discretized.

Next perform association rule discovery on the transformed data. Experiment with different parameters so that you get at least 20-30 strong rules (e.g., rules with high lift and confidence which at the same time have relatively good support).In WEKA Apriori algorithm interface set "outputItemsets" to "True" so that you can also view the frequent items sets of different sizes in addition to the rules.

Select the top 5 most "interesting" rules and for each specify the following:

an explanation of the pattern and why you believe it is interesting based on the business objectives of the company;
any recommendations based on the discovered rule that might help the company to better understand behavior of its customers or in its marketing campaign.

Note: The top 5 most interesting rules are most likely not the top 5 in the result set of the Apriori algorithm. They are rules that, in addition to having high support, lift, and confidence, also provide some non-trivial, actionable knowledge based on the underlying business objectives.