The starting and critical point for successful personalization based on usage data is data preprocessing. The required high-level tasks are data cleaning, user identification, session identification, pageview identification, and the inference of missing references due to caching. Transaction identification can be performed as a final preprocessing step prior to pattern discovery in order to focus on the relevant subsets of pageviews in each user session. Pageview identification is the task of determining which page file accesses contribute to a single browser display. For Web sites using cookies or embedded session IDs, user and session identification is trivial. Web sites without the benefit of additional information for user and session identification must rely on heuristics methods. These heuristics and details of usage preprocessing tasks are explained in [4] and we do not discuss them further in this paper.
The above preprocessing tasks ultimately result in a set of n
pageviews,
,
and a set of m user transactions,
,
where each
is a subset of P. Conceptually, we view each transaction
t as an l-length sequence of ordered pairs:
Association rules capture the relationships among items based on their patterns of co-occurrence across transactions. In the case of Web transactions, association rules capture relationships among pageviews based on the navigational patterns of users. For the current paper we have used the Apriori algorithm [2,15] which follows a generate-and-test methodology. The Apriori algorithm, initially finds groups of items (which in this case are the pageviews appearing in the preprocessed log) occurring frequently together in many transactions (i.e., satisfying a user specified minimum support threshold). Such groups of items are referred to as frequent item sets.
Given a transaction set T and a set
of frequent itemsets over T, the support of an itemset
is defined as
Association rules which satisfy a minimum confidence threshold
are then generated from the frequent itemsets. An association rule
r is an expression of the form
,
where X and Y are itemsets,
is the support of
,
and
is the confidence for the rule r given by
.
A problem with using a global minimum support threshold is that the discovered patterns will not include ``rare" but important items which may not occur frequently in the transaction data. This is particularly important in the current context: when dealing with Web usage data, it is often the case that references to deeper content or product-oriented pages occur far less frequently that those of top level navigation-oriented pages. Yet, for effective Web personalization, it is important to capture patterns and generate recommendations that contain these items. Liu et al. [8] proposed a mining method with multiple minimum support that allows users to specify different support values for different items. In this method, the support of an itemset is defined as the minimum support of all items contained in the itemset. The specification of multiple minimum supports allows frequent itemsets to potentially contain rare items which are nevertheless deemed important. Our experimental results, presented in the next Section, show that the use of multiple support association rules can maintain the overall precision of recommendations, while dramatically improving coverage.
Bamshad Mobasher (mobasher@cs.depaul.edu)