Data Preparation for Usage and Content Mining

Next: Discovery of Aggregate Usage Up: A Web Mining Framework Previous: System Architecture

Data Preparation for Usage and Content Mining

The required high-level tasks in usage data preprocessing are data cleaning, user identification, session identification, pageview identification, and path completion. The latter may be necessary due to client-side or proxy level caching. User identification is necessary for Web sites that do not use cookies or embedded session Ids. We use the heuristics proposed in [3] to identify unique user sessions form anonymous usage data and to infer cached references.

Pageview identification is the task of determining which page file accesses contribute to a single browser display. Not all pageviews are relevant for specific mining tasks. Furthermore, among the relevant pageviews some may be more significant than others. The significance of a pageview may depend on usage, content and structural characteristics of the site, as well as prior domain knowledge specified by the site designer. For example, in an in an e-commerce site pageviews corresponding to product-oriented events (e.g., shopping cart changes or product information views) may be considered more significant than others. In order to provide a flexible framework for a variety of data mining activities a number of attributes must be recorded with each pageview. These attributes include the pageview id (normally a URL uniquely representing the pageview), duration (for a given user session), static pageview type (e.g., content, navigational, product view, index page, etc.), and other meta-data.

Transaction identification can be performed as a final preprocessing step prior to pattern discovery in order to focus on the relevant subsets of pageviews in each user session [3]. The transaction file can be further filtered by removing very low support or very high support pageview references (i.e., references to those pageviews which do not appear in a sufficient number of transactions, or those that are present in nearly all transactions). This type of support filtering can be useful in eliminating noise from the data, such as that generated by shallow navigational patterns of ``non-active" users, and pageview references with minimal knowledge value for the purpose of personalization.

Usage preprocessing ultimately results in a set of n pageview records appearing in the transaction file,

$P = \{p_1, p_2, \cdots, p_n \}$ ,

with each pageview record uniquely represented by its associated URL, and a set of m user transactions,

$T = \{t_1, t_2, \cdots, t_m \}$ ,

where each $t_i \in T$ is a subset of P. To facilitate various data mining operations such as clustering, we view each transaction t as an n-dimensional vector over the space of pageview references,

$t = \left\langle {w(p_1 ,t),w(p_2 ,t),\cdots,w(p_n ,t)} \right\rangle$ ,

where w(p_i, t) is a weight, in the transaction t, associated with the pageview represented by $p_i \in P$ .

The weights can be determined in a number of ways, for example, binary weights can be used to represent existence or non-existence of a product-purchase or a documents access in the transaction. On the other hand, the weights can be a function of the duration of the associated pageview in order to capture the user's interest in a content page. The weights may also, in part, be based on domain-specific significance weights assigned by the analyst.

Content preprocessing involves the extraction of relevant features from text and meta-data. Meta-data extraction becomes particularly important when dealing with product-oriented pageviews or those involving non-textual content. In the current implementation of our framework features are extracted from meta-data embedded into files in the form of XML or HTML meta-tags, as well as from the textual content of pages. In order to use features in similarity computations, appropriate weights must be associated with them. For features extracted from meta-data, we assume that feature weights are provided as part of the domain knowledge specified by the site designer. For features extracted from text we use a standard function of the term frequency and inverse document frequency (tf.idf) for feature weights as commonly used in information retrieval [5,15].

Specifically, each pageview p is represented as a k-dimensional feature vector, where k is the total number of extracted features from the site in a global dictionary. Each dimension in a feature vector represents the corresponding feature weight within the pageview. Thus, the feature vector for a pageview p is given by:

$p = \left\langle {fw(p ,f_1),fw(p ,f_2),\cdots,fw(p ,f_k)} \right\rangle$

where fw(p,f_j), is the weight of the jth feature in pageview $p \in P$ , for $1 \leq j \leq k$ .

For features extracted from textual content of pages, the feature weight is obtained as the normalized tf.idf value for the term. Finally, in order to combine feature weights from meta-data (specified externally) and feature weights from the text content, proper normalization of those weights must be performed as part of preprocessing. The feature vectors obtained in this way are organized into an inverted file structure containing a dictionary of all extracted features and posting files for each feature specifying the pageviews in which the feature occurs along with its weight. Conceptually, this structure can be viewed as a feature-pageview matrix in which each column is a feature vector corresponding to a pageview.

Next: Discovery of Aggregate Usage Up: A Web Mining Framework Previous: System Architecture

Bamshad Mobasher
2000-08-14