cific information based on the amount of training data avail-able for the specific user. When a user
u
starts using asystem, the system has almost no information about theuser. Because
N
u
for the user is small, the prior distribu-tion
p
(
f
|
θ
) is a major contributor when estimating the usermodelˆ
f
(first term in Equation 1). If the system has areliable prior, it can perform well even with little data forthe user. How can we find a good prior for
f
? We treatuser
u
’s model
f
u
as a sample from the distribution
P
(
f
|
θ
).Given a set of user models, we can choose
θ
to maximizetheir likelihood.As the system gets more training data for this particu-lar user,
f
uMAP
is dominated by the data likelihood (secondterm in Equation 1) and the prior learned from other usersbecomes less important. As a result, the system works likea well tuned non-personalized system for a new user, butkeeps on improving the user model as more feedback fromthe user is available. Eventually, after a sufficient amountof data is collected, the user’s profile should reach a fixedpoint.
4. HIERARCHICALGAUSSIANNETWORK
Specific functional forms for
f
u
and
P
(
f
|
θ
) are needed tobuild a practical learning system. For our work, we let
f
be asimple linear regression function and let
w
be the parameterof
f
. We chose to represent the prior distribution as a mul-tivariate normal, i.e.
P
(
f
|
θ
) =
P
(
w
|
θ
) =
N
(
w
;
µ,
Σ). Themotivation for this choice is the fact that the self-conjugacyof the normal distribution simplifies computation. For sim-plicity, we assume that the covariance matrix Σ is diagonal,i.e. the components of the model are independent. Thus wehave:
w
u
∼ N
(
w
;
µ,
Σ)
y
=
x
T
·
w
u
+
εε
∼ N
(
ε
;0
,κ
2
u
)where
ε
is independent zero mean Gaussian noise. One canalso view
y
as a random sample from a normal distribution
N
(
y
;
x
T
·
w
u
,κ
2
u
). It is possible to extend the hierarchicalmodel to include a prior on the variance of the noise
κ
u
and learn it from the data. For simplicity, we set
κ
2
u
to aconstant value of 0
.
1 for all
u
in our experiments.Based on Equation 1, the MAP estimate of the user model
w
u
for a particular user
u
is [7]:
w
uMAP
= argmax
w
log
P
(
w
|
θ
) +
N
u
X
i
=1
log
P
(
x
ui
,y
ui
|
w
)= argmin
w
(
w
−
µ
)
T
Σ
−
1
(
w
−
µ
) +1
κ
uN
u
X
i
=1
(
x
T i
·
w
−
y
i
)
2
(2)Equation 2 also shows the natural tradeoff that occursbetween the prior and the data. As we collect more data foruser
u
, the second term is expected to grow linearly with
N
u
.This term, which corresponds to the mean square error (
L
2
loss) of the model
w
on the training sample, will eventuallydominate the first term, which is the loss associated withthe deviation from the prior mean.The variance of the prior Σ also affects this tradeoff. Forinstance, if the values on the diagonal of Σ are very largefor all
j
, the prior is close to a uniform distribution and
¡¢£¤ ¡¢£¥
Figure 1: Illustration of dependencies of variablesin the hierarchical model. The rating,
Y
, for a doc-ument,
X
, is conditioned on the document and themodel,
W
, associated with that user. Users shareinformation about their model through the use of aprior,
Θ
.
w
MAP
is close to the linear least square estimator. In thereverse situation, the prior dominates the data and the MAPestimate will be closer to
µ
.So far, we have discussed the learning process of a singleuser and assume the prior is fixed. However, the system willhave many users, and the prior should be learned from theother users data. We use an initially uniform prior when thesystem is launched
2
, i.e.
P
(
µ,
Σ) is the same for all
µ
, Σ.As users join the system, the prior distribution
p
(
f
|
θ
) isupdated. At certain point when the system has
K
existingusers, the system has a set of
K
user models
w
u
:
u
= 1
...K
.Based on the evidence from these existing users, we can inferthe parameters
θ
= (
µ,
Σ) of our prior distribution:ˆ
µ
=1
k
K
X
u
=1
w
u
(3)ˆΣ =1
k
−
1
K
X
u
=1
(
w
u
−
ˆ
µ
)
·
(
w
u
−
ˆ
µ
)
T
(4)This is, in essence, the mechanism by which we share userinformation: the prior mean is the average of that param-eter for all other user models. Similarly, the prior variancecaptures how consistent a parameter is for all users.
4.1 Finding
w
MAP
with Belief Propagation
Given a large pool of data, we need to find the MAPestimation of all
f
u
and
θ
in Figure 1. A simple approachis to treat it as one large optimization problem where allparameters are optimized simultaneously. If there are
K
users with data sets¯
D
=
{
D
1
,...,D
K
}
then we can write
2
It is, of course, possible to extend the hierarchical modelto include a prior distribution for the parameters
µ
and Σ.
Leave a Comment