Introduction To Algorithms For Behavior Based Recommendation

Introduction to Algorithms for Behavior Based
Recommendation
Tokyo Web Mining Meetup
March 26, 2016

Kimikazu Kato
Silver Egg Technology Co., Ltd.
1 / 36
About myself
加藤公一 Kimikazu Kato
Twitter: @hamukazu
LinkedIn: http://linkedin.com/in/kimikazukato
Chief Scientist at Silver Egg Technology

Ph.D in computer science, Master's degree in mathematics
Experience in numerical computation and mathematical algorithms
especially ...
Geometric computation, computer graphics
Partial differential equation, parallel computation, GPGPU
Mathematical programming
Now specialize in
Machine learning, especially, recommendation system
2 / 36
About our company
Silver Egg Technology
Established: 1998
CEO: Tom Foley
Main Service: Recommendation System, Online Advertisement
Major Clients: QVC, Senshukai (Bellemaison), Tsutaya
We provide a recommendation system to Japan's leading web sites.
3 / 36
Table of Contents
Introduction
Types of recommendation
Evaluation metrics
Algorithms
Conclusion
4 / 36
Caution
This presentation includes:
State-of-the-art algorithms for recommendation systems,
But does NOT include:
Any information about the core algorithm in Silver Egg Technology
5 / 36
Recommendation System
Recommender systems or recommendation systems (sometimes
replacing "system" with a synonym such as platform or engine) are a
subclass of information filtering system that seek to predict the
'rating' or 'preference' that user would give to an item. — Wikipedia
In this talk, we focus on collaborative filtering method, which only utilize

users' behavior, activity, and preference.
Other methods include:
Content-based methods
Method using demographic data
Hybrid
6 / 36
Rating Prediction Problem
user\movie W X Y Z
A 5 4 1 4
B 4
C 2 3
D 1 4 ?
Given rating information for some user/movie pairs,
Want to predict a rating for an unknown user/movie pair.
7 / 36
Item Prediction Problem
user\item W X Y Z
A 1 1 1 1
B 1
C 1
D 1 ? 1 ?
Given "who bought what" information (user/item pairs),
Want to predict which item is likely to be bought by a user.
8 / 36
Input/Output of the systems
Rating Prediction
Input: set of ratings for user/item pairs
Output: map from user/item pair to predicted rating
Item Prediction
Input: set of user/item pairs as shopping data, integer k
Output: top k items for each user which are most likely to be bought by
him/her
9 / 36
Evaluation Metrics for Recommendation
Systems
Rating prediction
The Root of the Mean Squared Error (RMSE)
The square root of the sum of squared errors
Item prediction
Precision
(# of Recommended and Purchased)/(# of Recommended)
Recall
(# of Recommended and Purchased)/(# of Purchased)
10 / 36
RMSE of Rating Prediction
Some user/item pairs are randomly chosen to be hidden.
user\movie W X Y Z
A 5 4 1 4
B 4
C 2 3
D 1 4 ?
2
Predicted as 3.1 but the actual is 4, then the squared error is |3.1 − 4|
2
= 0.9
Take the sum over the error over all the hidden items and then, take the
square root of it.
−−−−−−−−−−−−−−−−−−−−−−−−−−
2
∑ (predicted − actualui )
ui
√
(u,i)∈hidden
11 / 36
Precision/Recall of Item Prediction
If three items are recommended:
2 out of 3 recommended items are actually bought: the precision is 2/3.

2 out of 4 bought items are recommended: the recall is 2/4.
These are denoted by recall@3 and prec@3.
Ex. recall@5 = 3/5, prec@5 = 3/4
12 / 36
ROC and AUC
# of
1 2 3 4 5 6 7 8 9 10
recom.
# of
1 1 1 2 2 3 4 5 5 6
whites
# of
0 1 2 2 3 3 3 3 4 4
blacks
Divide the first and second row by total number of white and blacks
respectively, and plot the values in xy plane.
13 / 36
This curve is called "ROC curve." The area under this curve is called "AUC."
Higher AUC is better (max =1).
The AUC is often used in academia, but for a practical purpose...
14 / 36
Netflix Prize
The Netflix Prize was an open competition for the best collaborative
filtering algorithm to predict user ratings for films, based on previous
ratings without any other information about the users or films, i.e.
without the users or the films being identified except by numbers
assigned for the contest. — Wikipedia
Shortly, an open competition for preference prediction.
Closed in 2009.
15 / 36
Outline of Winner's Algorithm
Refer to the blog by E.Chen.
http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/
Digest of the methods:
Neighborhood Method
Matrix Factorization
Restricted Boltzmann Machines
Regression
Regularization
Ensemble Methods
16 / 36
Notations
Number of users: n
Set of users: U = {1, 2, … , n}
Number of items (movies): m
Set of items (movies): I = {1, 2, … , m}
Input matrix: A (n × m matrix)
17 / 36
Matrix Factorization
Based on the assumption that each item is described by a small number of
latent factors
Each rating is expressed as a linear combination of the latent factors
Achieve good performance in Netflix Prize
T
A ≈ X Y
Find such matrices X ,

∈ Mat(f , n) Y ∈ Mat(f , m) where f ≪ n, m
18 / 36
T
p (A|X, Y , σ) = ∏ N (Aui |Xu Y i , σ)
aui ≠0
p(X|σX ) = ∏ N (Xu |0, σX I )
p(Y |σY ) = ∏ N (Y i |0, σY I )
Find X and Y maximize p (X, Y |A, σ)
19 / 36
According to Bayes' Theorem,
p (X, Y |A, σ)
= p(A|X, Y , σ)p(X|σX )p(X|σX ) × const.
Thus,
log p (U , V |A, σ, σU , σV )
2
T 2 2
= ∑ (Aui − Xu Y i ) + λ X ∥X∥ + λ Y ∥Y ∥ + const.
Fro Fro
A ui
where ∥ ⋅ ∥Fro means Frobenius norm.
How can this be computed? Use MCMC. See [Salakhutdinov et al., 2008].
~
Once X and Y are determined, A := X
T
Y and the prediction for Aui is
~
estimated by Aui
20 / 36
Difference between Rating and Shopping
Rating Shopping (Browsing)
user\movie W X Y Z user\item W X Y Z
A 5 4 1 4 A 1 1 1 1
B 4 B 1
C 2 3 C 1
D 1 4 ? D 1 ? 1 ?
Includes negative feedback Includes no negative feedback

"1" means "boring" Zero means "unknown" or
Zero means "unknown" "negative"
More degree of the freedom
Consequently, the algorithm effective for the rating matrix is not necessarily
effective for the shopping matrix.
21 / 36
Solutions
Adding a constraint to the optimization problem
Changing the objective function itself
22 / 36
Adding a Constraint
The problem has the too much degree of freedom
Desirable characteristic is that many elements of the product should be
zero.
Assume that a certain ratio of zero elements of the input matrix remains
zero after the optimization [Sindhwani et al., 2010]
Experimentally outperform the "zero-as-negative" method
23 / 36
One-class Matrix Completion
[Sindhwani et al., 2010]
Introduced variables pui to relax the problem.
Minimize
T 2 2
∑ (Aui − Xu Y i ) + λ X ∥X∥ + λ Y ∥Y ∥
Fro Fro
A ui ≠0
T 2 T 2
+ ∑ [pui (0 − Xu Y i ) + (1 − pui )(1 − Xu Y i ) ]
A ui =0
+ T ∑ [−pui log pui − (1 − pui ) log(1 − pui )]
A ui =0
subject to
1
∑ pui = r
|{Aui |Aui = 0}|
A ui =0
24 / 36
T 2 2
∑ (Aui − Xu Y i ) + λ X ∥X∥ + λ Y ∥Y ∥
Fro Fro
A ui ≠0
T 2 T 2
+ ∑ [pui (0 − Xu Y i ) + (1 − pui )(1 − Xu Y i ) ]
A ui =0
+ T ∑ [−pui log pui − (1 − pui ) log(1 − pui )]
A ui =0
Intuitive explanation:
puimeans how likely the (u, i)-element is zero.
The second term is the error of estimation considering pui 's.
The third term is the entropy of the distribution.
25 / 36
Implicit Sparseness constraint: SLIM (Elastic Net)
In the regression model, adding L1 term makes the solution sparse:
1 λ(1 − ρ)
2 2
min [ ∥Xw − y∥ + ∥w∥ + λρ|w| 1 ]
2 2
w 2n 2
The similar idea is used for the matrix factorization [Ning et al., 2011]:
Minimize
λ(1 − ρ)
2
∥A − AW ∥ + ∥W ∥ + λρ|W | 1
Fro
2
subject to
diag W = 0
26 / 36
Ranking prediction
Another strategy of shopping prediction
"Learn from the order" approach
Predict whether X is more likely to be bought than Y, rather than the
probability for X or Y.
27 / 36
Bayesian Probabilistic Ranking
[Rendle et al., 2009]
Consider matrix factorization model, but the update of elements is

according to the observation of the "orders"
The parameters are the same as usual matrix factorization, but the
objective function is different
Consider a total order >u for each u ∈ U . Suppose that i >u j(i, j ∈ I )
means "the user u is more likely to buy i than j .
The objective is to calculate p(i >u j) such that Aui = 0 and Auj (which
means i and j are not bought by u ).
28 / 36
Let
DA = {(u, i, j) ∈ U × I × I |Aui = 1, Auj = 0} ,
and define
∏ p(>u |X, Y ) := ∏ p(i >u j|X, Y )
u∈U (u,i,j)∈D A
where we assume
T
p(i >u j|X, Y ) = σ(Xu Y i − Xu Y j )
1
σ(x) =
−x
1 + e
According to Bayes' theorem, the function to be optimized becomes:
∏ p(X, Y | >u ) = ∏ p(>u |X, Y ) × p(X)p(Y ) × const.
29 / 36
Taking log of this,
L := log[∏ p(>u |X, Y ) × p(X)p(Y )]
2 2
= log ∏ p(i >u j|X, Y ) − λ X ∥X∥ − λ Y ∥Y ∥
Fro Fro
(u,i,j)∈D A
T T 2 2
= ∑ log σ(Xu Y i − Xu Y j ) − λ X ∥X∥ − λ Y ∥Y ∥
Fro Fro
(u,i,j)∈D A
Now consider the following problem:
T T 2 2
max[ ∑ log σ(Xu Y i − Xu Y j ) − λ X ∥X∥ − λ Y ∥Y ∥ ]
Fro Fro
X,Y
(u,i,j)∈D A
This means "find a pair of matrices X, Y which preserve the order of the
element of the input matrix for each u ."
30 / 36
Computation
The function we want to optimize:
T T 2 2
∑ log σ(Xu Y i − Xu Y j ) − λ X ∥X∥ − λ Y ∥Y ∥
Fro Fro
(u,i,j)∈D A
U × I × I is huge, so in practice, a stochastic method is necessary.
Let the parameters be Θ = (X, Y ) .
The algorithm is the following:
Repeat the following
Choose (u, i, j) ∈ DA randomly
Update Θ with
∂ T T 2 2
Θ = Θ − α (log σ(Xu Y i − Xu Y j ) − λ X ∥X∥ − λ Y ∥Y ∥ )
∂Θ Fro Fro
This method is called Stochastic Gradient Descent (SGD).
31 / 36
MyMediaLite
http://www.mymedialite.net/
Open source implemetation of recommendation systems

Written in C#
Reasonable computation time
Supports rating and item prediction
32 / 36
Practical Aspect of Recommendation
Problem
Computational time
Memory consumption
How many services can be integrated in a server rack?
Super high accuracy with a super computer is useless for real business
33 / 36
Concluding Remarks: What is Important for
Good Prediction?
Theory
Machine learning
Mathematical optimization
Implementation
Algorithms
Computer architecture
Mathematics
Human factors!
Hand tuning of parameters
Domain specific knowledge
34 / 36
References (1/2)
For beginers
比戸ら, データサイエンティスト養成読本機械学習入門編, 技術評論社, 2016
T.Segaran. Programming Collective Intelligence, O'Reilly Media, 2007.
E.Chen. Winning the Netflix Prize: A Summary.
A.Gunawardana and G.Shani. A Survey of Accuracy Evaluation Metrics of
Recommendation Tasks, The Journal of Machine Learning Research,
Volume 10, 2009.
35 / 36
References (2/2)
Papers
Salakhutdinov, Ruslan, and Andriy Mnih. "Bayesian probabilistic matrix
factorization using Markov chain Monte Carlo." Proceedings of the 25th
international conference on Machine learning. ACM, 2008.
Sindhwani, Vikas, et al. "One-class matrix completion with low-density
factorizations." Data Mining (ICDM), 2010 IEEE 10th International
Conference on. IEEE, 2010.
Rendle, Steffen, et al. "BPR: Bayesian personalized ranking from implicit
feedback." Proceedings of the Twenty-Fifth Conference on Uncertainty in
Artificial Intelligence. AUAI Press, 2009.
Zou, Hui, and Trevor Hastie. "Regularization and variable selection via the
elastic net." Journal of the Royal Statistical Society: Series B (Statistical
Methodology) 67.2 (2005): 301-320.
Ning, Xia, and George Karypis. "SLIM: Sparse linear methods for top-n
recommender systems." Data Mining (ICDM), 2011 IEEE 11th
International Conference on. IEEE, 2011.
36 / 36

Introduction To Algorithms For Behavior Based Recommendation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Algorithms For Behavior Based Recommendation

Uploaded by

Copyright:

Available Formats

Introduction to Algorithms for Behavior Based

March 26, 2016

Silver Egg Technology Co., Ltd.

Chief Scientist at Silver Egg Technology

CEO: Tom Foley

Main Service: Recommendation System, Online Advertisement

Major Clients: QVC, Senshukai (Bellemaison), Tsutaya

We provide a recommendation system to Japan's leading web sites.

State-of-the-art algorithms for recommendation systems,

But does NOT include:

Any information about the core algorithm in Silver Egg Technology

In this talk, we focus on collaborative filtering method, which only utilize

Other methods include:

Given rating information for some user/movie pairs,

Want to predict a rating for an unknown user/movie pair.

Given "who bought what" information (user/item pairs),

Want to predict which item is likely to be bought by a user.

If three items are recommended:

2 out of 3 recommended items are actually bought: the precision is 2/3.

These are denoted by recall@3 and prec@3.

Ex. recall@5 = 3/5, prec@5 = 3/4

Higher AUC is better (max =1).

The AUC is often used in academia, but for a practical purpose...

Shortly, an open competition for preference prediction.

Digest of the methods:

Find such matrices X ,

p(X|σX ) = ∏ N (Xu |0, σX I )

p(Y |σY ) = ∏ N (Y i |0, σY I )

Find X and Y maximize p (X, Y |A, σ)

= p(A|X, Y , σ)p(X|σX )p(X|σX ) × const.

where ∥ ⋅ ∥Fro means Frobenius norm.

Includes negative feedback Includes no negative feedback

Introduced variables pui to relax the problem.

+ T ∑ [−pui log pui − (1 − pui ) log(1 − pui )]

+ T ∑ [−pui log pui − (1 − pui ) log(1 − pui )]

Consider matrix factorization model, but the update of elements is

means "the user u is more likely to buy i than j .

DA = {(u, i, j) ∈ U × I × I |Aui = 1, Auj = 0} ,

∏ p(>u |X, Y ) := ∏ p(i >u j|X, Y )

According to Bayes' theorem, the function to be optimized becomes:

∏ p(X, Y | >u ) = ∏ p(>u |X, Y ) × p(X)p(Y ) × const.

L := log[∏ p(>u |X, Y ) × p(X)p(Y )]

Now consider the following problem:

U × I × I is huge, so in practice, a stochastic method is necessary.

Let the parameters be Θ = (X, Y ) .

The algorithm is the following:

Repeat the following

Choose (u, i, j) ∈ DA randomly

This method is called Stochastic Gradient Descent (SGD).

Open source implemetation of recommendation systems

You might also like