Professional Documents
Culture Documents
3 million ratings
Scores used in
determining
final winner
2/15/17 Jure Leskovec, Stanford C246: Mining Massive Datasets 3
480,000 users
Matrix R 1 3 4
3 5 5
4 5 5
3
17,700
movies 3
2 2 2
5
2 1 1
3 3
1
"
RMSE = ∑((,')∈# 𝑟̂'( − 𝑟'( *
0
2/15/17 Jure Leskovec, Stanford C246: Mining Massive Datasets
Predicted rating 5
¡ The winner of the Netflix Challenge
¡ Multi-scale modeling of the data:
Combine top level, “regional” Global effects
modeling of the data, with
a refined, local view:
§ Global: Factorization
rˆxi =
å jÎN ( i ; x )
sij × rxj
sij… similarity of items i and j
å s
jÎN ( i ; x ) ij
rxj…rating of user x on item j
N(i;x)… set of items similar to
item i that were rated by x
2/15/17 Jure Leskovec, Stanford C246: Mining Massive Datasets 8
¡ In practice we get better estimates if we
model deviations:
^
rxi = bxi +
å jÎN ( i ; x )
sij × ( rxj - bxj )
å s
jÎN ( i ; x ) ij
baseline estimate for rxi Problems/Issues with CF:
𝒃𝒙𝒊 = 𝝁 + 𝒃𝒙 + 𝒃𝒊 1) Similarity measures are “arbitrary”
2) Pairwise similarities neglect
interdependencies among users
μ = overall mean rating 3) Taking a weighted average can be
bx = rating deviation of user x
= (avg. rating of user x) – μ
restricting
bi = (avg. rating of movie i) – μ Solution: Instead of sij use wij that
we estimate directly from data
2/15/17 Jure Leskovec, Stanford C246: Mining Massive Datasets 9
¡ Use a weighted sum rather than weighted avg.:
𝑟:
'( = 𝑏'( + < 𝑤(> 𝑟'> − 𝑏'>
>∈?((;')
¡ A few notes:
§ 𝑵(𝒊; 𝒙) … set of movies rated by user x that are
similar to movie i
§ 𝒘𝒊𝒋 is the interpolation weight (some real number)
§ We allow: ∑𝒋∈𝑵(𝒊;𝒙) 𝒘𝒊𝒋 ≠ 𝟏
§ 𝒘𝒊𝒋 models interaction between pairs of movies
(it does not depend on user x)
2/15/17 Jure Leskovec, Stanford C246: Mining Massive Datasets 10
'( = 𝑏'( + ∑>∈?((,') 𝑤(>
¡ 𝑟: 𝑟'> − 𝑏'>
¡ How to set wij?
"
§ Remember, error metric is: ∑((,')∈# 𝑟̂'( − 𝑟'( *
#
𝟐
or equivalently SSE: ∑(𝒊,𝒙)∈𝑹 𝒓F𝒙𝒊 − 𝒓𝒙𝒊
§ Find wij that minimize SSE on training data!
§ Models relationships between item i and its neighbors j
§ wij can be learned/estimated based on x and
all other users that rated i
Why is this a good idea?
2/15/17 Jure Leskovec, Stanford C246: Mining Massive Datasets 11
1 3 4
3 5 5
2 2 2
§ Quantify goodness using RMSE: 2 1
5
1
𝑦
2/15/17 Jure Leskovec, Stanford C246: Mining Massive Datasets 14
¡ We have the optimization 𝐽 𝑤 = < 𝑏'( + < 𝑤(> 𝑟'> − 𝑏'> − 𝑟'(
*
¡ Gradient decent:
§ Iterate until convergence: 𝒘 ← 𝒘 − h𝜵𝒘𝑱 h … learning rate
§ where 𝜵𝒘𝑱 is the gradient (derivative evaluated on data):
𝜕𝐽(𝑤)
𝛻R 𝐽 = =2 < 𝑏'( + < 𝑤(U 𝑟'U − 𝑏'U − 𝑟'( 𝑟'> − 𝑏'>
𝜕𝑤(>
',(∈# U∈? (;'
for 𝒋 ∈ {𝑵 𝒊; 𝒙 , ∀𝒊, ∀𝒙 }
YZ(R)
else =𝟎
YR[\
§ Note: We fix movie i, go over all rxi, for every movie
𝝏𝑱(𝒘)
𝒋 ∈ 𝑵 𝒊; 𝒙 , we compute while |wnew - wold| > ε:
𝝏𝒘𝒊𝒋
wold = wnew
2/15/17
wnew = wold - h ·Ñwold
Jure Leskovec, Stanford C246: Mining Massive Datasets 15
¡ '( = 𝑏'( + ∑>∈?((;') 𝑤(> 𝑟'> − 𝑏'>
So far: 𝑟:
§ Weights wij derived based Global effects
Netflix: 0.9514
Lethal
Sense and Weapon
Sensibility Ocean’s 11
Geared Geared
towards towards
females males
5 4 4 2 1 3 -.5 .6 .5 users
items
factors
-.2 .3 .5 1.1 -.2 .3 .5 -2 -.5 .8 -.4 .3 1.4 2.4 -.9
2 4 1 2 3 4 3 5
2 4 5 4 2 ≈
items 1.1 2.1 .3
-.8
2.1
.7
-.4
.5
.6
1.4
1.7
.3
2.4
-1
.9
1.4
-.3
2.9
.4
-.7
.8
1.2
.7
-.1
-.6
1.3
.1
4 3 4 2 2 5 -.7 2.1 -2
1 3 3 2 4 -1 .7 .3 PT
R Q
¡ For now let’s assume we can approximate the
rating matrix R as a product of “thin” Q · PT
§ R has missing entries but let’s ignore that for now!
§ Basically, we will want the reconstruction error to be small on known
ratings and we don’t care about the values on the missing ones
2/15/17 Jure Leskovec, Stanford C246: Mining Massive Datasets 19
¡ How to estimate the missing rating of
user x for item i? 𝒓F𝒙𝒊 = 𝒒𝒊 ⋅ 𝒑𝒙
users
1 3 5 5 4
5 4
? 4 2 1 3
= < 𝒒𝒊𝒇 ⋅ 𝒑𝒙𝒇
items
2 4 1 2 3 4 3 5
≈
𝒇
2 4 5 4 2
4 3 4 2 2 5
qi = row i of Q
1 3 3 2 4
px = column x of PT
.1 -.4 .2
users
-.5 .6 .5
1.1 -.2 .3 .5 -2 -.5 .8 -.4 .3 1.4 2.4 -.9
factors
items
-.2 .3 .5
-.8 .7 .5 1.4 .3 -1 1.4 2.9 -.7 1.2 -.1 1.3
1.1 2.1 .3
2.1 -.4 .6 1.7 2.4 .9 -.3 .4 .8 .7 -.6 .1
-.7 2.1 -2
-1 .7 .3 PT
factors Q
2/15/17 Jure Leskovec, Stanford C246: Mining Massive Datasets 20
¡ How to estimate the missing rating of
user x for item i? 𝒓F𝒙𝒊 = 𝒒𝒊 ⋅ 𝒑𝒙
users
1 3 5 5 4
5 4
? 4 2 1 3
= < 𝒒𝒊𝒇 ⋅ 𝒑𝒙𝒇
items
2 4 1 2 3 4 3 5
≈
𝒇
2 4 5 4 2
4 3 4 2 2 5
qi = row i of Q
1 3 3 2 4
px = column x of PT
.1 -.4 .2
users
-.5 .6 .5
1.1 -.2 .3 .5 -2 -.5 .8 -.4 .3 1.4 2.4 -.9
factors
items
-.2 .3 .5
-.8 .7 .5 1.4 .3 -1 1.4 2.9 -.7 1.2 -.1 1.3
1.1 2.1 .3
2.1 -.4 .6 1.7 2.4 .9 -.3 .4 .8 .7 -.6 .1
-.7 2.1 -2
-1 .7 .3 PT
factors Q
2/15/17 Jure Leskovec, Stanford C246: Mining Massive Datasets 21
¡ How to estimate the missing rating of
user x for item i? 𝒓F𝒙𝒊 = 𝒒𝒊 ⋅ 𝒑𝒙
users
1 3 5 5 4
5 4 2.4
? 4 2 1 3
= < 𝒒𝒊𝒇 ⋅ 𝒑𝒙𝒇
items
2 4 1 2 3 4 3 5
≈
𝒇
2 4 5 4 2
4 3 4 2 2 5
qi = row i of Q
1 3 3 2 4
px = column x of PT
.1 -.4 .2
users
-.5 .6 .5
f factors
-.2 .3 .5
-.8 .7 .5 1.4 .3 -1 1.4 2.9 -.7 1.2 -.1 1.3
1.1 2.1 .3
2.1 -.4 .6 1.7 2.4 .9 -.3 .4 .8 .7 -.6 .1
-.7 2.1 -2
-1 .7 .3 PT
f factors Q
2/15/17 Jure Leskovec, Stanford C246: Mining Massive Datasets 22
Serious Braveheart
The Color Amadeus
Purple
Lethal
Sense and Weapon
Sensibility Ocean’s 11
Geared Geared
Factor 1
towards towards
females males
Lethal
Sense and Weapon
Sensibility Ocean’s 11
Geared Geared
Factor 1
towards towards
females males
¡ So in our case:
“SVD” on Netflix data: R ≈ Q · PT
A = R, Q = U, PT = S VT
𝒓F𝒙𝒊 = 𝒒𝒊 ⋅ 𝒑𝒙
2/15/17 Jure Leskovec, Stanford C246: Mining Massive Datasets 25
¡ We already know that SVD gives minimum
reconstruction error (Sum of Squared Errors):
l *
min < 𝐴(> − 𝑈Σ𝑉 (>
e,f,g
(>∈m
¡ Note two things:
§ SSE and RMSE are monotonically related:
𝟏
§ 𝑹𝑴𝑺𝑬 = 𝑺𝑺𝑬 Great news: SVD is minimizing RMSE!
𝒄
§ Complication: The sum in SVD error term is over
all entries (no-rating in interpreted as zero-rating).
But our R has missing entries!
2/15/17 Jure Leskovec, Stanford C246: Mining Massive Datasets 26
users factors
.1 -.4 .2
1 3 5 5 4
5 4 4 2 1 3 -.5 .6 .5 users
factors
items
»
2 4 1 2 3 4 3 5 -.2 .3 .5
-.8 .7 .5 1.4 .3 -1 1.4 2.9 -.7 1.2 -.1 1.3
2 4 5 4 2 1.1 2.1 .3
2.1 -.4 .6 1.7 2.4 .9 -.3 .4 .8 .7 -.6 .1
items
4 3 4 2 2 5 -.7 2.1 -2
PT
1 3 3 2 4 -1 .7 .3
Q
¡ SVD isn’t defined when entries are missing!
¡ Use specialized methods to find P, Q
*
§ min ∑ (,' ∈0 𝑟'( − 𝑞( ⋅ 𝑝' 𝑟̂'( = 𝑞( ⋅ 𝑝'
r,s
§ Note:
§ We don’t require cols of P, Q to be orthogonal/unit length
§ P, Q map users/movies to a latent space
§ The most popular model among Netflix contestants
2/15/17 Jure Leskovec, Stanford C246: Mining Massive Datasets 27
¡ Our goal is to find P and Q such tat:
𝟐
𝒎𝒊𝒏 < 𝒓𝒙𝒊 − 𝒒𝒊 ⋅ 𝒑𝒙
𝑷,𝑸
𝒊,𝒙 ∈𝑹
users factors
.1 -.4 .2
1 3 5 5 4
5 4 4 2 1 3 -.5 .6 .5 users
factors
1.1 -.2 .3 .5 -2 -.5 .8 -.4 .3 1.4 2.4 -.9
»
2 4 1 2 3 4 3 5 -.2 .3 .5
items
4 3 4 2 2 5 -.7 2.1 -2
PT
1 3 3 2 4 -1 .7 .3
Q
¡ 3 5
4 5
3
5
5
2 1
?
?
§ That is, it fits too well the training data and thus
not generalizing well to unseen test data
¡ 2
3
3
? ?
regularization:
?
2 1 ?
3 ?
1
é 2ù
å (rxi - qi p x ) + êl1 å p x + l2 å qi ú
2
min
2
P ,Q training ë x i û
“error” “length”
l1, l2 … user set regularization parameters
Note: We do not care about the “raw” value of the objective function,
but we care in P,Q that achieve the minimum of the objective
2/15/17 Jure Leskovec, Stanford C246: Mining Massive Datasets 31
serious Braveheart
The Color Amadeus
Purple
Lethal
Sense and Weapon
Sensibility Ocean’s 11 Geared
Geared
towards towards
Factor 1 males
females
Independence
é 2ù
Day
min å (r - qi p x ) 2 + l êå p x + å qi ú
2
xi
P ,Q training ë x i û
minfactors “error” + l “length” funny
2/15/17 Jure Leskovec, Stanford C246: Mining Massive Datasets 32
serious Braveheart
The Color Amadeus
Purple
Lethal
Sense and Weapon
Sensibility Ocean’s 11 Geared
Geared
towards towards
Factor 1 males
females
Independence
é 2ù
Day
min å (r - qi px ) 2 + l êå p x +å qi ú
2
xi
P ,Q training ë x i û
minfactors “error” + l “length” funny
2/15/17 Jure Leskovec, Stanford C246: Mining Massive Datasets 33
serious Braveheart
The Color Amadeus
Purple
Lethal
Sense and Weapon
Sensibility Ocean’s 11 Geared
Geared
towards towards
Factor 1 males
females
Independence
é 2ù
Day
min å (r - qi px ) 2 + l êå p x +å qi ú
2
xi
P ,Q training ë x i û
minfactors “error” + l “length” funny
2/15/17 Jure Leskovec, Stanford C246: Mining Massive Datasets 34
serious Braveheart
The Color Amadeus
Purple
Lethal
Sense and Weapon
Sensibility Ocean’s 11 Geared
Geared
towards towards
Factor 1 males
females
Independence
é 2ù
Day
min å (r - qi px ) 2 + l êå p x +å qi ú
2
xi
P ,Q training ë x i û
minfactors “error” + l “length” funny
2/15/17 Jure Leskovec, Stanford C246: Mining Massive Datasets 35
¡ Want to find matrices P and Q:
é 2ù
å (rxi - qi p x ) + êl1 å p x + l2 å qi ú
2
min
2
P ,Q training ë x i û
¡ Gradient descent:
§ Initialize P and Q (using SVD, pretend missing ratings are 0)
§ Do gradient descent: How to compute gradient
of a matrix?
§ P ¬ P - h ·ÑP Compute gradient of every
§ Q ¬ Q - h ·ÑQ element independently!
¡ Example:
§ Mean rating: µ = 3.7
§ You are a critical reviewer: your ratings are 1 star
lower than the mean: bx = -1
§ Star Wars gets a mean rating of 0.5 higher than
average movie: bi = + 0.5
§ Predicted rating for you on Star Wars:
= 3.7 - 1 + 0.5 = 3.2
2/15/17 Jure Leskovec, Stanford C246: Mining Massive Datasets 44
Solve:
¡
min å (rxi - (µ + bx + bi + qi px ))
2
æ 2ö
+ ç l1 å qi + l2 å p x + l3 å bx + l4 å bi ÷
2 2 2
è i x
regularization
x i ø
l is selected via grid-
search on a validation set
0.905
RMSE
0.9
0.895
0.89
0.885
1 10 100 1000
Millions of parameters
Netflix: 0.9514
+ CF
0.895
0.89
0.885
0.88
0.875
1 10 100 1000 10000
Millions of parameters
Netflix: 0.9514
¡ BellKor
§ Continue to get small improvements in their scores
§ Realize they are in direct competition with team Ensemble
¡ Strategy
§ Both teams carefully monitoring the leaderboard
§ Only sure way to check for improvement is to submit a set
of predictions
§ This alerts the other team of your latest score
2/15/17 Jure Leskovec, Stanford C246: Mining Massive Datasets 55
¡ Submissions limited to 1 a day
§ Only 1 final submission could be made in the last 24h
¡ 24 hours before deadline…
§ BellKor team member in Austria notices (by chance) that
Ensemble posts a score that is slightly better than BellKor’s
¡ Frantic last 24 hours for both teams
§ Much computer time on final optimization
§ Carefully calibrated to end about an hour before deadline
¡ Final submissions
§ BellKor submits a little early (on purpose), 40 mins before
deadline
§ Ensemble submits their final entry 20 mins later
§ ….and everyone waits….
2/15/17 Jure Leskovec, Stanford C246: Mining Massive Datasets 56
2/15/17 Jure Leskovec, Stanford C246: Mining Massive Datasets 57
2/15/17 Jure Leskovec, Stanford C246: Mining Massive Datasets 58
What’s the moral of
the story?
Submit early! J
2/15/17 Jure Leskovec, Stanford C246: Mining Massive Datasets 59
¡ Some slides and plots borrowed from
Yehuda Koren, Robert Bell and Padhraic
Smyth
¡ Further reading:
§ Y. Koren, Collaborative filtering with temporal
dynamics, KDD ’09
§ http://www2.research.att.com/~volinsky/netflix/bpc.html
§ http://www.the-ensemble.com/