Professional Documents
Culture Documents
RMBI1020 - Data Analytics For Business - Collaborative Filtering
RMBI1020 - Data Analytics For Business - Collaborative Filtering
– Collaborative Filtering
Regression
Market Basket Clustering
Analysis
Analysis
Optimization Classification
Association Collaborative
Rule Mining Filtering
RMBI1020@JeanWang, UST
Introduction
RMBI1020@JeanWang, UST
Preferences of Users towards Items
› In a typical ecommerce website, there are users and items
– Users have preferences (explicit or implicit) towards items
Users Items
Explicit Ratings
Implicit Ratings
RMBI1020@JeanWang, UST
Recommendation Systems
› A recommendation system makes prediction based on users’ historical activities
– What is the probability of a particular user purchasing a specific item?
– What rating will a user give to an unseen item?
– What is the top k unseen items that could be recommended to a user?
RMBI1020@JeanWang, UST
Outline
✓Collaborative Filtering
✓ User-User
✓ Item-Item
✓ Pros & Cons
✓Content-based Filtering
✓ Main idea and workflow
✓ Item profile and user profile
RMBI1020@JeanWang, UST
RMBI1020 – Collaborative Filtering
• Problem Specification
• User-User CF / Item-Item CF
• Adjusted Cosine Similarity
• Pros and Cons
RMBI1020@JeanWang, UST
Problem Specification
› Given
– A set of users and a set of items 1 0 1 0 3
– A set of observed user-item preferences in
0 1 0 1 4 5
a Rating Matrix (a.k.a. Preference Matrix)
– Each row is for one user, and each column 0 0 1 0 1 2
is for one item
13.5
24.3
30.4
RMBI1020@JeanWang, UST
Problem Specification
› Given
– A set of users and a set of items
– A set of observed user-item preferences in
a Rating Matrix (a.k.a. Preference Matrix) Sam 1 2 1
– Each row is for one user, and each column Jacob 4 2
is for one item
Mary 3 5 4 4 3
RMBI1020@JeanWang, UST
Collaborative Filtering: User-User and Item-Item
› To predict the fondness of Captain America for Space Stone
User-User CF Item-Item CF
RMBI1020@JeanWang, UST
User-User Similarity
› For an unobserved user-item pair (𝑥, 𝑖), if we find a set of other users whose
ratings are “similar” to 𝑥, their ratings on item 𝑖 could be used to estimate 𝑥’s
rating on item 𝑖
RMBI1020@JeanWang, UST
Adjusted Cosine Similarity
› The Adjusted Cosine Similarity better captures the intuition because
– Missing ratings are treated as neutral (i.e., the same as the average)
– Difference between the “tough raters” and “easy raters” are smoothened
σ𝑚∈𝑀(𝑥𝑖,𝑚 −𝑥)(𝑦
ҧ 𝑖,𝑚 − 𝑦)
ത Very similar to correlation
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑𝐶𝑜𝑠𝑖𝑛𝑒𝑆𝑖𝑚 𝑥, 𝑦 = calculation in Statistics, so
σ𝑚∈𝑀(𝑥𝑖,𝑚 −𝑥)ҧ 2 × σ𝑚∈𝑀(𝑦𝑖,𝑚 −𝑦)
ത 2 CORREL() function can be
used in Excel
RMBI1020@JeanWang, UST
From Similarity to Prediction
› User-to-User Collaborative Filtering
Obtain the similarity • Between the target user and other users
scores
RMBI1020@JeanWang, UST
Item-Item Collaborative Filtering
› Item-to-item Collaborative Filtering: an analogous approach to User-User CF
Obtain the similarity • Between the target items and other items
scores
RMBI1020@JeanWang, UST
Prediction Example
TWL HP LR AVG FRZ FF 𝑠𝑖𝑚(𝐸𝑚𝑖𝑙𝑦, … ) › User-user based prediction (let k=2)
Sam 1 2 1 -0.46 – The 2 most similar users to Emily who have rated
Jacob 4 2 -0.95 the movie Twilight are Michael and Mary
Mary 0.21
– Their average ratings on Twilight weighted by their
3 5 4 4 3 similarity scores to Emily is
Andrew 4 1 3 0.55 › (0.21*3 + 0.47*5)/|0.21+0.47| = 4.38
Emily ? 2 5 4 3 -
Olivia 5 2 -0.16 › Item-item based prediction (let k=2)
Mia 4 3 0.47 – The 2 most similar movies to Twilight which have
been rated by user Emily are Fantastic Four and
James 4 2 0.63 Lord of Rings
Michael 5 4 0.47 – Their average ratings by Emily weighted by their
similarity scores to Twilight is
Daniel 2 3 -0.47
› (0.35*2+0.50*3)/|0.35+0.50| = 2.59
Sofia 1 5 2 2 4 -0.75
Victoria 3 5 0.16 In practice, it has been
observed that item-item
𝑠𝑖𝑚(𝑇𝑊𝐿, … ) - -0.08 0.35 -0.03 -0.26 0.50 CF often works better
than user-user CF.
RMBI1020@JeanWang, UST
User-User CF vs Item-Item CF
› In practice, item-item CF is more preferred than user-user …
……
Users newly indicated
preference can significantly
change the similarity scores
in User-User CF
Dimension of 1,000,000×1,000
RMBI1020@JeanWang, UST
Pros & Cons of Collaborative Filtering
✔ Works for any kind of items (no need to consider item features and user profiles)
✘ Hard to find multiple users rating the same items in a sparse rating matrix
– Difficult to recommend items to someone with unique taste
✘ Face the Cold Start problem
– Cannot recommend an item that has not been previously rated
– Cannot recommend an item to a total new customer
✘ Tends to recommend popular items
RMBI1020@JeanWang, UST
Accuracy Evaluation
› Compare predictions with observed ratings
TWL HP LR AVG FRZ FF
Sam 1 2 1
– Root-mean-square error (RMSE)
Jacob 4 2
1 2
Mary › σ ∗
𝑟𝑥𝑖 − 𝑟𝑥𝑖 ∗
where 𝒓𝒙𝒊 is the predicted rating, 𝒓𝒙𝒊 is
3 5 4 4 3 𝑛 𝑥𝑖
the true rating of 𝑥 on 𝑖
Andrew 4 1 3
– Accuracy in top 10 recommendations
Emily 2 5 4 3
› % of those in top 10 recommendations that have been
Olivia 5 2 rated highly by the user
Mia 4 3
James 4 2
Michael 5 4
Daniel 2 3 Used for testing
Sofia 1 5 2 2 4
Victoria 3 5
RMBI1020@JeanWang, UST
RMBI1020 – Content-based Filtering
• Main Idea
• Workflow
• Pros and Cons
RMBI1020@JeanWang, UST
Content-based Filtering
› Two types of content-based recommendations
– Recommend to user 𝑥 the items similar to previous items rated highly by 𝑥
– Recommend those items to user 𝑥 that are rated highly by other users similar to 𝑥
RMBI1020@JeanWang, UST
Plan of Action
› A content-based filtering method typically consists the following steps
Build item
profiles Recommend a user
Identify the
items that are nearest
descriptive features
to him in terms of the
that can differentiate
similarity between
the items and
the user profile and
influence users
item profile
Build user
profiles
RMBI1020@JeanWang, UST
Content-based Filtering Example
Build item profiles
RMBI1020@JeanWang, UST
Pros & Cons of Content-Based Filtering
✔ No need to have data of other users
✔ Eases the sparsity issue and partially solves the cold-start problem
– Can make recommendations to users with unique tastes
– Can recommend new & unpopular items to users
RMBI1020@JeanWang, UST
RMBI1020 Data Analytics for Business
– Collaborative Filtering
Case Demo #8: Joke Funniness Rating Prediction
RMBI1020@JeanWang, UST
Data are downloaded and extracted at
Joke Recommendation http://eigentaste.berkeley.edu/dataset/
› About
– Jester is a joke recommender system developed at UC Berkeley to study social information filtering
– Containing over 5 million anonymous joke ratings from 150k users
› Data (lec10_jokes.xlsx)
– Number of users: 250
– Number of jokes: 100
– Rating: ranging from [-10,10]
› Problem
– To predict the ratings of
user 250 towards jokes 1 - 20
RMBI1020@JeanWang, UST
Calculation of Adjusted Cosine Similarity in Excel
lec10_jokes.xlsx – “Ratings”, “User-User Matrix” and “Item-Item Matrix” Worksheets
User-User step 1: normalize the ratings by user mean
User-User step 2: compute a similarity matrix between users
RMBI1020@JeanWang, UST
User-User Collaborative Filtering in Excel
lec10_jokes.xlsx –”User-User CF” Worksheet
User-User step 3: find the top 5 similarities between user 250
and other users who have rated the jokes 1-20 User-User step 4: find the user IDs matching
the top 5 similarities found in step 3
RMBI1020@JeanWang, UST
Item-Item Collaborative Filtering in Excel
lec10_jokes.xlsx – ”Item-Item CF” Worksheet
Item-Item step 3: find the top 5 similarities between joke
1-20 and other jokes that have been rated by user 250 Item-Item step 4: find the joke IDs
matching the similarities found in step 3
RMBI1020@JeanWang, UST
Evaluation of the Model in Excel
lec10_jokes.xlsx – “User-User CF” and “Item-Item CF” Worksheets
Accuracy in %
RMBI1020@JeanWang, UST
Lecture Summary
✓Problem that recommendation systems try to
address
✓Collaborative Filtering
✓ User-User CF and Item-Item CF
✓ Pros and cons
✓Content-based Filtering
✓ Pros and cons
RMBI1020@JeanWang, UST
Readings
› [1] Mining Massive Datasets – Chapter 9 Recommendation Systems
– http://mmds.org/#book
RMBI1020@JeanWang, UST