Professional Documents
Culture Documents
437
Alexandrin Popescu] and Lyle H. Ungar David M. Pennock and Steve Lawrence
Department of Computer and Information Science NEC Research Institute
University of Pennsylvania 4 Independence Way
Philadelphia, PA 19104 Princeton, NJ 08540
popescul®unagi.cis.upenn.edu dpennock®research.nj.nec.com
ungar®cis.upenn.edu lawrence®research.nj.nec.com
tremely sparse, as is typically the case for collaboration a list-ranking problem (Cohen et al.,1999; Freund et al.,
data, EM can suffer from overfitting. In Sections 4 and 5, 1998; Pennock et al., 2000a). Singular Value Decomposi
we present two techniques to effectively increase the den tion (SVD) was used to improve scalability of collaborative
sity of the data by exploiting secondary data. The first uses filtering systems by dimensionality reduction (Sarwar et a!.,
a similarity measure to fill in the user-item co-occurrence 2000).
matrix by inferring which items users are likely to have ac
Pure information filtering systems use only content to make
cessed without the system's knowledge. The second creates
recommendations. For example, search engines recom
an implicit user-content co-occurrence matrix by treating
mend web pages with content similar to (e.g., containing)
each user's access to an item as if it were many accesses to
user queries (Salton & McGill, 1983). In contrast to collab
all of the pieces of content in the item's descriptive infor
orative methods, content-based systems can even recom
mation. We evaluate these models in the context of a doc
mend new (previously unaccessed) items to users without
ument recommendation system. Specifically, we train and
any history in the system. Mooney & Roy (2000) develop
test the models on data from Researchindex, 1 an online dig
a content-based book recommender using information ex
ital library of Computer Science papers (Lawrence et al.,
traction and machine learning techniques for text catego
1999; Bollacker et a!., 2000). Section 6 presents empiri
rization.
cal results and evaluations. In Section 6.2, we demonstrate
the potential ineffectiveness of EM in sparse-data situa Several authors suggest methods for combining collabora
tions, using both Researchlndex data and synthetic data. In tive filtering with information filtering. Basu et a!. ( 1998)
Section 6.3, we show that both of our density-augmenting present a hybrid collaborative and content-based movie rec
methods are effective at reducing overfitting and improv ommender. Collaborative features (e.g., Bob and Mary like
ing predictive accuracy. Our models yield more accurate Titanic) are encoded as set-valued attributes. These fea
recommendations than the commonly-employed k-nearest tures are combined with more typical content features (e.g.,
neighbors (k-NN) algorithm. Moreover, our global models Traffic is rated R) to inductively learn a binary classifier that
can produce predictions for any user-item pair, whereas lo separates liked and disliked movies. Also in a movie rec
cal methods like k-NN are simply incapable of producing ommender domain, Good et al. ( 1999) suggest using con
meaningful recommendations for many user-item combi tent based software agents to automatically generate rat
nations. ings to reduce data sparsity. Claypool et al. (1999) employ
separate collaborative and content-based recommenders in
an online newspaper domain, combining the two predic
2 BACKGROUND AND RELATED
tions using an adaptive weighted average: as the number
WORK of users accessing an item increases, the weight of the col
laborative component tends to increase. Web hyperlinks
A variety of collaborative filtering algorithms have been and document citations can be thought of as implicit en
designed and deployed. The Tapestry system relied on dorsements or ratings. Cohn and Hofmann (200 1) combine
each user to identify like-minded users manually (Gold document content information with this type of connectiv
berg et al., 1992). GroupLens (Resnick et al., 1994) and ity information to identify principle topics and authoritative
Ringo (Shardanand & Maes, 1995), developed indepen documents in a collection.
dently, were the first to automate prediction. Typical al
Recommender systems technology is in current use in
gorithms compute similarity scores between all pairs of
many Internet commerce applications. For example, the
users; predictions for a given user are generated by weight
University of Minnesota's GroupLens and MovieLens2 re
ing other users' ratings proportionally to their similarity to
search projects spawned Net Perceptions,3 a successful In
the given user. A variety of similarity metrics are possible,
ternet startup offering personalization and recommendation
including correlation (Resnick et al., 1994), mean-squared
services. Alexa4 is a web browser plug-in that recommends
difference (Shardanand & Maes, 1995), vector similarity
related links based in part on other people's web surfing
(Breese et al., 1998), or probability that users are of the
habits. A growing number of companies, 5 including Arna
same type (Pennock et al., 2000b). Other algorithms con
zon.com, CDNow.com, and Levis.com, employ or provide
struct a model of underlying user preferences, from which
recommender system solutions (Schafer et al., 1999). Rec
predictions are inferred. Examples include Bayesian net
ommendation tools originally developed at Microsoft Re
work models (Breese et al., 1998), dependency network
search are now included with the Commerce Edition ofMi
models (Beckerman et al., 2000), clustering models (Un
crosoft's SiteServer,6 and are currently in use at multiple
gar & Foster, 1998), and models of how people rate items
(Pennock et al., 2000 b). Collaborative filtering has also 2http://movielens.umn.edu/
been cast as a machine learning problem (Basu et al., 1998; 3http://www.netperceptions.com/
Billsus & Pazzani, 1998; Nakamura & Abe, 1998) and as 4http://www.alexa.com/
5http://www.cis.upenn.edu/-ungar/CF/
1http://researchindex.org/ 6http://www.microsoft.com/siteserver
UAI2001 POPESCUL ET AL. 439
sites. P( u}
ics. Users choose among topics according to their interests; Figure 1: Graphical representation of the three-way aspect
topic variables in tum "generate" documents. Users are as modeL
tion of the joint distribution that treats users and documents where n(u, d) is the number of times user u accessed docu
symmetrically is Pr(z)Pr(u[z) Pr(d[z). The joint distri ment d, and n(d, w) is the number of times word w occurs
bution over just users and documents is in document d. Given training data of this form, the log
likelihood L of the data is
Pr(u,d) l:Pr(z)Pr(u[z)Pr(d[z).
L
=
z
L = n(u,d,w)logPr(u,d,w).
u,d,w
Model parameters are learned using EM (or variants) to
find a local maximum of the Jog-likelihood of the training The corresponding EM algorithm is:
data. After the model is learned, documents can be ranked
Estep:
for a given user according to Pr(d[u) rx Pr(u, d); that is,
according to how likely it is that the user will access the Pr(z)Pr(u[z)Pr(d[z)Pr(w[z)
corresponding document. Documents with high Pr{d[u) p r(z I u, d 'w) = 'I:z' Pr(z') Pr(u[z')Pr(d[z') Pr(w[z')
that the user has not yet seen are good candidates for rec
ommendation. Note that the aspect model allows multiple M step:
topics per user, unlike most clustering algorithms that as
sign each user to a single class. Pr(u[z) (X "E n(u, d, w)Pr(z[u, d, w)
d,w
This model is a pure collaborative filtering model; docu
ment content is not taken into account. We propose an Pr(d[z) (X L n(u, d,w) Pr(z[u,d,w)
extension of the aspect model to include three-way co u,w
ization. TEM makes use of an inverse computational tem where d, and dj are vectors with tf-idf coordinates as de
perature /3. EM is modified by raising the conditionals in scribed above.
the right-hand side of the E step equation to the power (3.
In our setting, the user-document co-occurrence data ma
TEM starts with /3 = 1, and decreases f3 with the rate 17 < 1
trix is smoothed by replacing zero entries with average sim
using f3 = f3 x 17, when the performance on a held-out por
ilarities above a certain threshold between the correspond
tion of the training set deteriorates.
ing document and all documents that the user has accessed.
In Section 6.2, we see that even TEM fails to generalize This effectively increases the density (i.e., the fraction of
when data is extremely sparse. In the next two sections, we non-zero entries) in the matrix. Figure 2 shows how the
propose two methods that effectively increase data density, density of the Researchlndex data (described in detail in
thereby improving learning performance. Section 6.1) changes depending on the similarity threshold
used in smoothing.
4 SIMILARITY-BASED DATA
SMOOTHING
6.2 OVERFITTING
where w; are words in d and ldl is the length of d. Con
ditional probabilities Pr (w; lu) follow directly from the
6.2.1 User-Document And User-Document-Word
model:
Aspect Models
Pr(u,w;)
Pr (W; I u) = =-----=--:---'---:-
LwPr(u,w) Training the two-way user-document aspect model on the
relatively dense set of 1000 users and 5000 documents re
Inclusion of words through documents, and eliminating sulted in immediate overfitting of EM, meaning that the test
documents from direct participation in modeling, increased data log-likelihood began to fall after only the first or sec
the density of our dataset (described below) from 0.38% to ond iteration. This immediate overfitting occured for num
almost 9%. bers of latent classes ranging from 3 to 50. Using tempered
EM (Wlder several reasonable temperature change sched
ules) only kept the test data log-likelihood approximately
6 RESULTS AND EVALUATION
at the same level as the initial random seed, without signif
icant improvements.
Section 6.1 describes the Researchlndex data. In Sec
tion 6.2, we examine under what conditions learning oc Including the words contained in the 5,000 documents, and
curs at all, by measuring the increase in the log-likelihood fitting the three-way aspect model also resulted in imme
of test <Jata as EM proceeds. We find that if data is too diate overfitting. Again, TEM failed to yield significant
sparse, neither EM nor TEM succeeds in significantly in improvements in the test data log-likelihood.
creasing the test data log-likelihood over a random initial
guess. In Section 6.3, we evaluate the recommendations of 6.2.2 Standard Aspect Model, Synthetic Data
our density-augmented models, according to Breese et al.'s
(1998) rank scoring metric. To examine whether this extreme overfitting was specific
to the Researchlndex data, we tested the aspect model on
a simple synthetic data set. Users are divided into three
6.1 RESEARCHINDEX DATA
disjoint groups according to the following scheme:
The data for our experiments was taken from Researchln
dex (formerly CiteSeer), the largest freely available 1. users 0--49 read papers 0--299,
database of scientific literature (Lawrence et al., 1999;
Bollacker et al., 2000). Researchlndex catalogs scientific 2. users 50--99 read papers 300--599, and
publications available on the web in PostScript and PDF
formats. The full-text of documents as well as the cita
3. users 100--149 read papers 600--899,
tions made in them are indexed. Researchlndex supports
keyword-based retrieval and browsing of the database, for
where the probabilities that users read papers in their inter
example by following the links between papers formed by
est set are uniform.
citations. Document detail page access information was
obtained for July to November, 2000 (multiple accesses by We designed the data so that the "correct" model with three
the same user were included). Heuristics were used to filter latent states is obvious. We generated several datasets of
out robots. Words from the first 5 kbytes of the text of each differing densities and trained a three-latent-variable aspect
document were extracted. model on each to see whether EM converges to the correct
model. We performed validation tests at each iteration with
We used the data from July to October as the training set,
test sets of the same density as the corresponding training
and the data from November as the testing set. Due to
the rapid growth in usage of Researchlndex, November set. Figure 3 plots the iteration (averaged over fifty ran
dom restarts of EM) where overfitting7 first occurs versus
accounted for 31% of the total five month activity. The
data included 33,050 unique users accessing the details the dataset density. In datasets of density less than 1.5%
of 177,232 documents. Density of this dataset was only the process consistently overfits from the first iteration. For
.i
i .
i
Figure 4 gives R scores for the experiments with k-NN in Figure 5: Total utility of the ranked lists over all users produced
by the similarity-based User-Document model against the simi
standard formulation on the user-document data for differ
larity threshold used in smoothing (25 latent class variables).
ent values of k, ranging from 10 to 60 with an interval of5.
8We modify Breese et al.'s formula slightly for the case of The maximum valueR has reached is 2.10, which is greater
observed accesses rather than ratings. than the best k-NN result (1.87), but not as good as the
UAI2001 POPESCUL ET AL. 443
User-Words model (2.92), discussed below. our sparsity reduction techniques, similarity-based smooth
ing and an equivalent of a user-words aspect model, can be
6.3.4 User-Words Aspect Model used.
References
" " .. Basu, C., Hirsh, H., & Cohen, W. (1998). Recommenda
tion as classification: Using social and content-based
Figure 6: Total utility of the ranked lists over all users produced information in recommendation. In Proceedings of
by the User-Words aspect modeL the Fifteenth National Conference on Artificial Intel
ligence, pp. 714-720.
Billsus, D., & Pazzani, M. J. (1998). Learning collaborative
7 CONCLUSIONS AND FUTURE WORK information filters. In Proceedings of the Fifteenth
International Conference on Machine Learning, pp.
We presented three probabilistic mixture models for recom 4�54.
mending items based on collaborative and content-based
Bollacker, K., Lawrence, S., & Giles, C. L. (2000). Discov
evidence merged in a unified manner. Incorporating con
ering relevant scientific literature on the web. IEEE
tent into a collaborative filtering system can increase the
Intelligent Systems, 15(2), 42-47.
flexibility and quality of the recommender. Moreover,
when data is extremely sparse-as is typical in many real Breese, J. S., Heckerman, D., & Kadie, C. (1998). Em
world applications-additional content information seems pirical analysis of predictive algorithms for collab
almost necessary to fit global probabilistic models at all. orative filtering. In Proceedings of the Fourteenth
The density of Researchlndex data is only 0.01%. Even Conference on Uncertainty in Artificial Intelligence,
the most active users reading the most popular articles in pp. 43-52.
duce a subset of density only 0.38%, still too sparse for the
Claypool, M., Gokhale, A., & Miranda, T. (1999).
straightforward EM and TEM approaches to work. We find
Combining content-based and collaborative filters
that a particularly good way to include content information
in an online newspaper. In Proceedings of the
in the context of a document recommendation system is to
ACM SIGIR Workshop on Recommender Systems
treat users as reading words of the document, rather than
Implementation and Evaluation.
the document itself. In our case, this increased the density
from 0.38% to almost 9%, resulting in recommendations Cohen, W. W., Schapire, R. E., & Singer, Y. (1999). Learn
superior to k-NN. ing to order things.
Journal of Artificial Intelligence
Research, 10,243-270.
There are many areas for future research. Similar meth
ods to those presented here might be used to recommend Cohn, D., & Hofmann, T. (2001). The missing link - a
items such as movies which have attributes other than text. probabilistic model of document content and hyper
A movie can be viewed as consisting of the director and text connectivity. In Advances in Neural Information
the actors in it, just as a document contains words. Both of Processing Systems, Vol. 13. The MIT Press.
444 POPESCUL ET AL. UAI2001
Freund, Y., Iyer, R., Schapire, R. E., & Singer, Y. (1998). Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., &
An efficient boosting algorithm for combining pref Riedl, J. (1994). GroupLens: An open architecture
erences. In
Proceedings of the Fifteenth Interna for collaborative filtering of netnews. In Proceedings
tional Conference on Machine Learning, pp. 170- of the ACM Conference on Computer Supported Co
178. operative Work, pp. 175-186.
Goldberg, D., Nichols, D., Oki, B. M., & Terry, D. (1992). Resnick, P., & Varian, H. R. (1997). Recommender sys
Using collaborative filtering to weave an information tems. Communications ofthe ACM, 40(3), 56-58.
tapestry. Communications of the ACM, 35(12), 61-
Salton, G., & McGill, M. (1983). Introduction to Modern
70.
Information Retrieval. McGraw Hill.
Good, N., Schafer, J. B., Konstan, J. A., Borchers, A., Sar
war, B. M., Herlocker, J. L., & Riedl, J. (1999). Com
Sarwar, B. M., Karypis, G., Konstan, J. A., & Riedl, J. T.
(2000). Application of dimensionality reduction in
bining collaborative filtering with personal agents for
recommender system - a case study. In ACM We
better recommendations. In Proceedings of the Six
teenth National Conference on Artificial Intelligence, b.KDD Web Mining/or £-Commerce Workshop.
pp. 439--446. Schafer, J. B., Konstan, J., & Riedl, J. (1999). Recom
mender systems in e-commerce. In Proceedings of
Heckerman, D., Chickering, D. M., Meek, C., Rounth
waite, R., & Kadie, C. (2000). Dependency networks
the ACM Conference on Electronic Commerce, pp.
158-166.
for collaborative filtering and data visualization. In
Proceedings of the Sixteenth Conference on Uncer Shardanand, U., & Maes, P. (1995). Social information fil
tainty in Artificial Intelligence, pp. 264-273. tering: Algorithms for automating 'word of mouth'.
In Proceedings ofComputer Human Interaction, pp.
Hofmann, T. (1999). P robabilistic latent semantic analy
210-217.
sis. InProceedings of the Fifteenth Conference on
Uncertainty in Artificial Intelligence, pp. 289-296. Ungar, L. H., & Foster, D. P. (1998). Clustering methods
for collaborative filtering. In Workshop on Recom
Hofmann, T., & P uzicha, J. (1999). Latent class models
mendation Systems at the Fifteenth National Confer
for collaborative filtering. In Proceedings
ofthe Six
ence on Artificial Intelligence.
teenth International Joint Conference on Artificial
Intelligence, pp. 688-693.