Data Preparation and Pattern Discovery

Next: The Recommendation Engine Up: A Framework for Personalization Previous: A Framework for Personalization

2.1 Data Preparation and Pattern Discovery

The starting and critical point for successful personalization based on usage data is data preprocessing. The required high-level tasks are data cleaning, user identification, session identification, pageview identification, and the inference of missing references due to caching. Transaction identification can be performed as a final preprocessing step prior to pattern discovery in order to focus on the relevant subsets of pageviews in each user session. Pageview identification is the task of determining which page file accesses contribute to a single browser display. For Web sites using cookies or embedded session IDs, user and session identification is trivial. Web sites without the benefit of additional information for user and session identification must rely on heuristics methods. These heuristics and details of usage preprocessing tasks are explained in [4] and we do not discuss them further in this paper.

The above preprocessing tasks ultimately result in a set of n pageviews, $P = \{p_1,p_2,\cdots,p_n\}$ , and a set of m user transactions, $T = \{t_1,t_2,\cdots,t_m\}$ , where each $t_i \in T$ is a subset of P. Conceptually, we view each transaction t as an l-length sequence of ordered pairs:

$\begin{displaymath}t = \left\langle{(p^t_1,w(p^t_1)),(p^t_2,w(p^t_2)),\cdots,(p^t_l,w(p^t_l))} \right\rangle, \end{displaymath}$

where each p^t_i = p_j for some $j \in \{1,\cdots,n\}$ , and w(p^t_i) is the weight associated with pageview p^t_i in the transaction t. The weights can be determined in a number of ways, however in the context of personalization based on clickstream data, the primary sources of data are server access logs. This allows us to choose two types of weights for pageviews: weights can be binary, representing the existence or non-existence of a product-purchase or a documents access in the transaction; or they can be a function of the duration of the associated pageview in the user's session. In this paper, since our focus is on association rule mining, we only consider binary weights on pageviews within user transactions, and furthermore, we ignore the ordering among the pageviews. Thus, a transaction can be viewed as a set of pageviews $s_t = \{p^t_i~\vert~ 1 \leq i \leq l~\mbox{and}~w(p^t_i)=1\}$ .

Association rules capture the relationships among items based on their patterns of co-occurrence across transactions. In the case of Web transactions, association rules capture relationships among pageviews based on the navigational patterns of users. For the current paper we have used the Apriori algorithm [2,15] which follows a generate-and-test methodology. The Apriori algorithm, initially finds groups of items (which in this case are the pageviews appearing in the preprocessed log) occurring frequently together in many transactions (i.e., satisfying a user specified minimum support threshold). Such groups of items are referred to as frequent item sets.

Given a transaction set T and a set $I = \{I_1,I_2,\cdots,I_k\}$ of frequent itemsets over T, the support of an itemset $I_i \in I$ is defined as

$\begin{displaymath}\sigma (I_i ) = \frac{{\vert\{ t \in T:I_i \subseteq t\} \vert}}{{\vert T\vert}} \end{displaymath}$

An important property of support is its downward closure: if an item set does not satisfy the minimum support criteria, then neither do any of its supersets. This property is essential for pruning the search space during each iteration of the Apriori algorithm.

Association rules which satisfy a minimum confidence threshold are then generated from the frequent itemsets. An association rule r is an expression of the form $X \Rightarrow Y (\sigma _r ,\,\alpha _r)$ , where X and Y are itemsets, $\sigma_r$ is the support of $X \cup Y$ , and $\alpha_r$ is the confidence for the rule r given by $\sigma(X \cup Y) / \sigma(X)$ .

A problem with using a global minimum support threshold is that the discovered patterns will not include ``rare" but important items which may not occur frequently in the transaction data. This is particularly important in the current context: when dealing with Web usage data, it is often the case that references to deeper content or product-oriented pages occur far less frequently that those of top level navigation-oriented pages. Yet, for effective Web personalization, it is important to capture patterns and generate recommendations that contain these items. Liu et al. [8] proposed a mining method with multiple minimum support that allows users to specify different support values for different items. In this method, the support of an itemset is defined as the minimum support of all items contained in the itemset. The specification of multiple minimum supports allows frequent itemsets to potentially contain rare items which are nevertheless deemed important. Our experimental results, presented in the next Section, show that the use of multiple support association rules can maintain the overall precision of recommendations, while dramatically improving coverage.

Next: The Recommendation Engine Up: A Framework for Personalization Previous: A Framework for Personalization

Bamshad Mobasher (mobasher@cs.depaul.edu)
2001-07-29