next up previous
Next: Integrating Content and Usage Up: A Web Mining Framework Previous: Discovery of Aggregate Usage

Discovery of Content Profiles

We use precisely the same representation for content profiles (i.e., as a weighted collection of pageviews). In contrast to usage profiles, content profiles represent different ways pages with partly similar content may be grouped together. Our goal here is to capture common interests of users in a group of pages because specific portions of their contents are similar. Different groups of users may be interested in different segments of each page, thus content profiles must capture overlapping interests of users.

Clusters of pageviews obtained using standard clustering algorithms which partition the data are not appropriate as candidates for content profiles. To obtain content profiles, instead of clustering pageviews (as k-dimensional feature vectors, where k is the number of extracted features in the global site dictionary), we cluster the features. Using the inverted feature-pageview matrix obtained in the content preprocessing stage, each feature can be viewed as an n-dimensional vector over the original space of pageviews. Thus, each dimension in the pageview vector for a feature is the weight associated with that feature in the corresponding pageview. We use multivariate k-means clustering technique to cluster these pageview vectors. Now, given a feature cluster G, we construct a content profile CG as a set of pageview-weight pairs:

\begin{displaymath}C_G = \{ \left\langle {p,weight(p,C_G )} \right\rangle \vert ~p \in P,
\;weight(p,C_G) \ge \tau \}
\end{displaymath}

where the significance weight, weight(p, CG), of the pageview p within the content profile is obtained as follows:

\begin{displaymath}weight(p,C_G ) = \frac{{\sum\limits_{f \in G} {fw(p,f)}
}}{{\sum\limits_{i = 1}^n {\sum\limits_{f \in G} {fw(p_i ,f)} } }}
\end{displaymath}

and fw(p, f) is the weight of a feature f in pageview p. As in the case of usage profiles, we normalize pageview weights so that the maximum weight in each profile is 1, and we filter out pageviews whose weight is below a specified significance threshold, $\tau$. Note that the representation of content profiles as a set of pageview-weight pairs is identical to that for usage profiles discussed earlier. This uniform representation allows us to easily integrate both types of profiles with the recommendation engine.


next up previous
Next: Integrating Content and Usage Up: A Web Mining Framework Previous: Discovery of Aggregate Usage
Bamshad Mobasher
2000-08-14