ciﬁc information based on the amount of training data available for the speciﬁc user. When a user
u
starts using asystem, the system has almost no information about theuser. Because
N
u
for the user is small, the prior distribution
p
(
f

θ
) is a major contributor when estimating the usermodelˆ
f
(ﬁrst term in Equation 1). If the system has areliable prior, it can perform well even with little data forthe user. How can we ﬁnd a good prior for
f
? We treatuser
u
’s model
f
u
as a sample from the distribution
P
(
f

θ
).Given a set of user models, we can choose
θ
to maximizetheir likelihood.As the system gets more training data for this particular user,
f
uMAP
is dominated by the data likelihood (secondterm in Equation 1) and the prior learned from other usersbecomes less important. As a result, the system works likea well tuned nonpersonalized system for a new user, butkeeps on improving the user model as more feedback fromthe user is available. Eventually, after a suﬃcient amountof data is collected, the user’s proﬁle should reach a ﬁxedpoint.
4. HIERARCHICALGAUSSIANNETWORK
Speciﬁc functional forms for
f
u
and
P
(
f

θ
) are needed tobuild a practical learning system. For our work, we let
f
be asimple linear regression function and let
w
be the parameterof
f
. We chose to represent the prior distribution as a multivariate normal, i.e.
P
(
f

θ
) =
P
(
w

θ
) =
N
(
w
;
µ,
Σ). Themotivation for this choice is the fact that the selfconjugacyof the normal distribution simpliﬁes computation. For simplicity, we assume that the covariance matrix Σ is diagonal,i.e. the components of the model are independent. Thus wehave:
w
u
∼ N
(
w
;
µ,
Σ)
y
=
x
T
·
w
u
+
εε
∼ N
(
ε
;0
,κ
2
u
)where
ε
is independent zero mean Gaussian noise. One canalso view
y
as a random sample from a normal distribution
N
(
y
;
x
T
·
w
u
,κ
2
u
). It is possible to extend the hierarchicalmodel to include a prior on the variance of the noise
κ
u
and learn it from the data. For simplicity, we set
κ
2
u
to aconstant value of 0
.
1 for all
u
in our experiments.Based on Equation 1, the MAP estimate of the user model
w
u
for a particular user
u
is [7]:
w
uMAP
= argmax
w
log
P
(
w

θ
) +
N
u
X
i
=1
log
P
(
x
ui
,y
ui

w
)= argmin
w
(
w
−
µ
)
T
Σ
−
1
(
w
−
µ
) +1
κ
uN
u
X
i
=1
(
x
T i
·
w
−
y
i
)
2
(2)Equation 2 also shows the natural tradeoﬀ that occursbetween the prior and the data. As we collect more data foruser
u
, the second term is expected to grow linearly with
N
u
.This term, which corresponds to the mean square error (
L
2
loss) of the model
w
on the training sample, will eventuallydominate the ﬁrst term, which is the loss associated withthe deviation from the prior mean.The variance of the prior Σ also aﬀects this tradeoﬀ. Forinstance, if the values on the diagonal of Σ are very largefor all
j
, the prior is close to a uniform distribution and
¡¢£¤ ¡¢£¥
Figure 1: Illustration of dependencies of variablesin the hierarchical model. The rating,
Y
, for a document,
X
, is conditioned on the document and themodel,
W
, associated with that user. Users shareinformation about their model through the use of aprior,
Θ
.
w
MAP
is close to the linear least square estimator. In thereverse situation, the prior dominates the data and the MAPestimate will be closer to
µ
.So far, we have discussed the learning process of a singleuser and assume the prior is ﬁxed. However, the system willhave many users, and the prior should be learned from theother users data. We use an initially uniform prior when thesystem is launched
2
, i.e.
P
(
µ,
Σ) is the same for all
µ
, Σ.As users join the system, the prior distribution
p
(
f

θ
) isupdated. At certain point when the system has
K
existingusers, the system has a set of
K
user models
w
u
:
u
= 1
...K
.Based on the evidence from these existing users, we can inferthe parameters
θ
= (
µ,
Σ) of our prior distribution:ˆ
µ
=1
k
K
X
u
=1
w
u
(3)ˆΣ =1
k
−
1
K
X
u
=1
(
w
u
−
ˆ
µ
)
·
(
w
u
−
ˆ
µ
)
T
(4)This is, in essence, the mechanism by which we share userinformation: the prior mean is the average of that parameter for all other user models. Similarly, the prior variancecaptures how consistent a parameter is for all users.
4.1 Finding
w
MAP
with Belief Propagation
Given a large pool of data, we need to ﬁnd the MAPestimation of all
f
u
and
θ
in Figure 1. A simple approachis to treat it as one large optimization problem where allparameters are optimized simultaneously. If there are
K
users with data sets¯
D
=
{
D
1
,...,D
K
}
then we can write
2
It is, of course, possible to extend the hierarchical modelto include a prior distribution for the parameters
µ
and Σ.