For our experiments, we used the access logs from the Web site of the Association for Consumer Research (ACR) Newsletter (www.acr-news.org). After data preprocessing, the data set contains a total of 18342 transactions and 122 URLs. For our analysis, we eliminate both pageviews which appear in less than 0.5% or more than 80% of transactions and short transactions which contain less than 6 pageviews. This data set was divided into a training set and an evaluation set. After these preprocessing steps, the total number of remaining pageview URLs was 40. Approximately 70% of transactions were randomly selected as training set and the remaining transactions were used for evaluation.
Our evaluation methodology is as follows. Each transaction t in the evaluation set is divided into two parts. The first n pageviews in t are used for generating recommendations, whereas, the remaining portion of t is used to evaluate the generated recommendations. The value n reflects the maximum allowable window size for the experiments (in our case 4). Given a window size ,
we select a subset of the first n pageviews as the surrogate for a user's active session window. The active session window is the portion of the user's clickstream used by the recommendation engine in order to produce a recommendation set. We call this portion of the transaction t the active session with respect to t, denoted by ast. The recommendation engine takes ast and a recommendation threshold
as inputs and produce a set of pageviews as recommendations. We denote this recommendation set by
.
Note that
contains all pageviews whose recommendation score is at least
(in particular, if
,
then
,
where P is the set of all pageviews).
The set of pageviews
can now be compared with the remaining |t|-n, pageviews in t. We denote this portion of t by evalt. Our comparison of these sets is based on 2 different metrics, namely, precision and coverage. The precision of
is defined as:
and the coverage of
is defined as:
Precision measures the degree to which the recommendation engine produces accurate recommendations (i.e., the proportion of relevant recommendations to the total number of recommendations). Coverage measures the ability of the recommendation engine to produce all of the pageviews that are likely to be visited by the user (i.e., the proportion of relevant recommendations to all pageviews that should be recommended).
Finally, for a given recommendation threshold ,
the mean over all transactions in the evaluation set was computed as the overall evaluation score for each measure. We ran each set of experiments for thresholds ranging from 0.1 to 1.0. The results of these experiments are presented below.
Bamshad Mobasher (mobasher@cs.depaul.edu)