RMBI1020 - Data Analytics For Business - Collaborative Filtering

RMBI1020 – Data Analytics for Business
– Collaborative Filtering
Dr. Jean Wang

RMBI@IPO
HKUST
Topics
Introduction to BI
Data Visualization Data Modeling and Storage
Big Data and BI Analytics Technologies Enables BI

RMBI1020
Time Series Excel

Data Analytics for Business Basics
Forecast
Regression
Market Basket Clustering
Analysis
Analysis
Optimization Classification
Association Collaborative
Rule Mining Filtering
RMBI1020@JeanWang, UST
Introduction
Preferences of Users towards Items
› In a typical ecommerce website, there are users and items
– Users have preferences (explicit or implicit) towards items
Users Items
Explicit Ratings
Implicit Ratings
Recommendation Systems
› A recommendation system makes prediction based on users’ historical activities
– What is the probability of a particular user purchasing a specific item?
– What rating will a user give to an unseen item?
– What is the top k unseen items that could be recommended to a user?
Content-based Filtering Collaborative Filtering
Outline
✓Collaborative Filtering
✓ User-User
✓ Item-Item
✓ Pros & Cons
✓Content-based Filtering
✓ Main idea and workflow
✓ Item profile and user profile
Why did the bee marry?

He’s finally found his
honey.
RMBI1020 – Collaborative Filtering
• Problem Specification
• User-User CF / Item-Item CF
• Adjusted Cosine Similarity
• Pros and Cons
Problem Specification
› Given
– A set of users and a set of items 1 0 1 0 3
– A set of observed user-item preferences in
0 1 0 1 4 5
a Rating Matrix (a.k.a. Preference Matrix)
– Each row is for one user, and each column 0 0 1 0 1 2
is for one item
13.5
24.3
30.4
Problem Specification
› Given
– A set of users and a set of items
– A set of observed user-item preferences in
a Rating Matrix (a.k.a. Preference Matrix) Sam 1 2 1
– Each row is for one user, and each column Jacob 4 2
is for one item
Mary 3 5 4 4 3
› Goal of Collaborative Filtering Andrew 4 1 3
– Predict an unobserved preference pair Emily ? 2 5 4 3
(𝑥, 𝑖) which is the rating of user 𝑥 towards Olivia 5 2

item 𝑖 Mia 4 3
James 4 2
Michael 5 4
Daniel 2 3
Sofia 1 5 2 2 4
Victoria 3 5
Association Rule Mining vs Collaborative Filtering
Collaborative Filtering
considers the target user’s
individual preferences, so it
Association Rule Mining Collaborative Filtering can make more personalized
recommendations
Collaborative Filtering: User-User and Item-Item
› To predict the fondness of Captain America for Space Stone
User-User CF Item-Item CF
User-User Similarity
› For an unobserved user-item pair (𝑥, 𝑖), if we find a set of other users whose
ratings are “similar” to 𝑥, their ratings on item 𝑖 could be used to estimate 𝑥’s
rating on item 𝑖
The matrix is very sparse

Mary 3 5 4 4 3
James 4 2
Q: which one’s taste is
Sofia 1 5 2 2 4 more similar to James’s,
Sam 1 2 1 Mary or Sofia?
› Consider two rating vectors 𝑥 and 𝑦 representing two users

– We need a similarity measure 𝑠𝑖𝑚(𝑥, 𝑦) to capture the similarities between users
– However, both 𝑥 and 𝑦 may have a lot of missing values, with a few or none common items
RMBI1020@JeanWang, UST Treated as 0

Similarity Measure between Users (I) Euclidean and Manhattan
indicate that Mary is further
away from James, Sofia is closer
› Option 1: 𝐸𝑢𝑐𝑙𝑖𝑑𝑒𝑎𝑛𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑥, 𝑦 = σ𝑛𝑖=1(𝑥𝑖 − 𝑦𝑖 )2 to James, which is NOT desired.
– 𝑑𝑖𝑠 𝐽𝑎𝑚𝑒𝑠, 𝑀𝑎𝑟𝑦 = 32 + 52 + 02 + 02 + 42 + 12 = 7.14
– 𝑑𝑖𝑠(𝐽𝑎𝑚𝑒𝑠, 𝑆𝑜𝑓𝑖𝑎) = 02 + 12 + 52 + 22 + 22 + 22 = 6.16 TWL HP LR AVG FRZ FF
Mary 3 5 4 4 3
› Option 2: 𝑀𝑎𝑛ℎ𝑎𝑡𝑡𝑎𝑛𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑥, 𝑦 = σ𝑛𝑖=1 𝑥𝑖 − 𝑦𝑖
James 4 2
– 𝑑𝑖𝑠 𝐽𝑎𝑚𝑒𝑠, 𝑀𝑎𝑟𝑦 = 3 + 5 + 0 + 0 + 4 + 1 = 13
Sofia 1 5 2 2 4
– 𝑑𝑖𝑠 𝐽𝑎𝑚𝑒𝑠, 𝑆𝑜𝑓𝑖𝑎 = 0 + 1 + 5 + 2 + 2 + 2 = 12
Sam 1 2 1
σ𝑛
𝑖=1 𝑥𝑖 𝑦𝑖
› Option 3: 𝐶𝑜𝑠𝑖𝑛𝑒𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝑥, 𝑦 =
σ𝑛 2 𝑛
𝑖=1 𝑥𝑖 × σ𝑖=1 𝑦𝑖
2 Cosine indicates that James is
closer to Mary (desired), but
0+0+0+16+0+6
– 𝑠𝑖𝑚 𝐽𝑎𝑚𝑒𝑠, 𝑀𝑎𝑟𝑦 = = 0.57 the difference is not that much.
9+25+16+16+9× 16+4
0+0+0+8+0+8
– 𝑠𝑖𝑚 𝐽𝑎𝑚𝑒𝑠, 𝑆𝑜𝑓𝑖𝑎 = = 0.51
16+4× 1+25+4+4+16
Only commonly-rated
items contributed to
Cosine whereas Euclidean
and Manhattan do not.
Similarity Measure between Users (II)
› Option 4: Adjusted Cosine Similarity
– Normalize ratings by subtracting row mean (centering the average ratings of each row to 0)
– Fill in 0s for the unobserved values (user preference towards the unseen items is neutral)
TWL HP LR AVG FRZ FF avg TWL HP LR AVG FRZ FF

Mary 3 5 4 4 3 19/5 Mary -0.8 1.2 0 0.2 0.2 -0.8
𝑠𝑖𝑚 = 0.57
James 4 2 ‒ 6/2 = James 0 0 0 1 0 -1
𝑠𝑖𝑚 = 0.51 Sofia 1 5 2 2 4 14/5 Sofia 0 -1.8 2.2 -0.8 -0.8 1.2
Sam 1 2 1 4/3 Sam -0.33 0 0.67 0 0 -0.33
– Calculate the similarity using Cosine formula

0+0+0+0+0+0.2+0.8
› 𝑠𝑖𝑚 𝐽𝑎𝑚𝑒𝑠, 𝑀𝑎𝑟𝑦 = = 0.42
1+1× 0.64+1.44+0.04+0.04+0.64
0+0+0−0.8+0−1.2
› 𝑠𝑖𝑚 𝐽𝑎𝑚𝑒𝑠, 𝑆𝑜𝑓𝑖𝑎 = = −0.43 Capture the
1+1× 3.24+4.84+0.64+0.64+1.44 intuition much
better!
Adjusted Cosine Similarity
› The Adjusted Cosine Similarity better captures the intuition because
– Missing ratings are treated as neutral (i.e., the same as the average)
– Difference between the “tough raters” and “easy raters” are smoothened
σ𝑚∈𝑀(𝑥𝑖,𝑚 −𝑥)(𝑦
ҧ 𝑖,𝑚 − 𝑦)
ത Very similar to correlation
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑𝐶𝑜𝑠𝑖𝑛𝑒𝑆𝑖𝑚 𝑥, 𝑦 = calculation in Statistics, so
σ𝑚∈𝑀(𝑥𝑖,𝑚 −𝑥)ҧ 2 × σ𝑚∈𝑀(𝑦𝑖,𝑚 −𝑦)
ത 2 CORREL() function can be
used in Excel
Set of common items

Average observed- Average observed-
rated by user 𝑥 and user 𝑦
rating of user 𝑥 rating of user 𝑦
– Value ranges in [-1, 1]

› If 𝑠𝑖𝑚(𝑥, 𝑦) = -1, user 𝑥 and 𝑦 have the exactly opposite preference
› If 𝑠𝑖𝑚(𝑥, 𝑦) = 1, user 𝑥 and 𝑦 have the exactly same preference
From Similarity to Prediction
› User-to-User Collaborative Filtering
Obtain the similarity • Between the target user and other users
scores
• Who are the most similar to user 𝑥

Find top 𝒌 users • And rated the target item 𝑖
• Average of the observed ratings weighted by the user similarity scores

Predict the rating of σ𝑦𝜖𝐾𝑁𝑁(𝑥) 𝑠𝑖𝑚 𝑥, 𝑦 ∗ 𝑟𝑎𝑡𝑖𝑛𝑔(𝑦, 𝑖)
unobserved pair (𝑥, 𝑖) 𝑟𝑎𝑡𝑖𝑛𝑔 𝑥, 𝑖 =
σ𝑦𝜖𝐾𝑁𝑁(𝑥) 𝑠𝑖𝑚 𝑥, 𝑦
Item-Item Collaborative Filtering
› Item-to-item Collaborative Filtering: an analogous approach to User-User CF
Obtain the similarity • Between the target items and other items
scores
• Which are the most similar to item 𝑖

Find top 𝒌 items • And have been rated by the target user 𝑥
• Average of observed ratings weighted by the item similarity scores

Predict the rating of σ𝑗𝜖𝐾𝑁𝑁(𝑖) 𝑠𝑖𝑚 𝑖, 𝑗 ∗ 𝑟𝑎𝑡𝑖𝑛𝑔(𝑥, 𝑗)
unobserved pair (𝑥, 𝑖) 𝑟𝑎𝑡𝑖𝑛𝑔 𝑥, 𝑖 =
σ𝑗𝜖𝐾𝑁𝑁(𝑖) 𝑠𝑖𝑚 𝑖, 𝑗
Prediction Example
TWL HP LR AVG FRZ FF 𝑠𝑖𝑚(𝐸𝑚𝑖𝑙𝑦, … ) › User-user based prediction (let k=2)
Sam 1 2 1 -0.46 – The 2 most similar users to Emily who have rated
Jacob 4 2 -0.95 the movie Twilight are Michael and Mary
Mary 0.21
– Their average ratings on Twilight weighted by their
3 5 4 4 3 similarity scores to Emily is
Andrew 4 1 3 0.55 › (0.21*3 + 0.47*5)/|0.21+0.47| = 4.38
Emily ? 2 5 4 3 -
Olivia 5 2 -0.16 › Item-item based prediction (let k=2)
Mia 4 3 0.47 – The 2 most similar movies to Twilight which have
been rated by user Emily are Fantastic Four and
James 4 2 0.63 Lord of Rings
Michael 5 4 0.47 – Their average ratings by Emily weighted by their
similarity scores to Twilight is
Daniel 2 3 -0.47
› (0.35*2+0.50*3)/|0.35+0.50| = 2.59
Sofia 1 5 2 2 4 -0.75
Victoria 3 5 0.16 In practice, it has been
observed that item-item
𝑠𝑖𝑚(𝑇𝑊𝐿, … ) - -0.08 0.35 -0.03 -0.26 0.50 CF often works better
than user-user CF.
User-User CF vs Item-Item CF
› In practice, item-item CF is more preferred than user-user …
CF in large commercial websites

– Items are simpler to classify but users have multiple tastes
– Rows in the rating matrix is more sparse than the columns
– User-user similarity matrix is much larger so the calculation is
computationally expensive
– User-user similarity matrix needs to be updated more frequently
A user doesn’t have

enough items in common
with anybody else in
User-User CF
……
Users newly indicated
preference can significantly
change the similarity scores
in User-User CF
Dimension of 1,000,000×1,000
Pros & Cons of Collaborative Filtering
✔ Works for any kind of items (no need to consider item features and user profiles)
✘ Hard to find multiple users rating the same items in a sparse rating matrix
– Difficult to recommend items to someone with unique taste
✘ Face the Cold Start problem
– Cannot recommend an item that has not been previously rated
– Cannot recommend an item to a total new customer
✘ Tends to recommend popular items
Real-world recommendation system

often add content-based methods or
association rule mining models to
combine with collaborative filtering.
Accuracy Evaluation
› Compare predictions with observed ratings
TWL HP LR AVG FRZ FF
Sam 1 2 1
– Root-mean-square error (RMSE)
Jacob 4 2
1 2
Mary › σ ∗
𝑟𝑥𝑖 − 𝑟𝑥𝑖 ∗
where 𝒓𝒙𝒊 is the predicted rating, 𝒓𝒙𝒊 is
3 5 4 4 3 𝑛 𝑥𝑖
the true rating of 𝑥 on 𝑖
Andrew 4 1 3
– Accuracy in top 10 recommendations
Emily 2 5 4 3
› % of those in top 10 recommendations that have been
Olivia 5 2 rated highly by the user
Mia 4 3
James 4 2
Michael 5 4
Daniel 2 3 Used for testing
Sofia 1 5 2 2 4
Victoria 3 5
RMBI1020 – Content-based Filtering
• Main Idea
• Workflow
• Pros and Cons
Content-based Filtering
› Two types of content-based recommendations
– Recommend to user 𝑥 the items similar to previous items rated highly by 𝑥
– Recommend those items to user 𝑥 that are rated highly by other users similar to 𝑥
› Different from collaborative filtering

– Similarity measure is based on native attributes (selected features) of the items and users, not
user-item preference patterns
Plan of Action
› A content-based filtering method typically consists the following steps
Build item
profiles Recommend a user
Identify the
items that are nearest
descriptive features
to him in terms of the
that can differentiate
similarity between
the items and
the user profile and
influence users
item profile
Build user
profiles
Content-based Filtering Example
Build item profiles
0 0.8 0.5 0.9

Select features Recommend
movies
0.3 0.1 0.9 0.2
Mary 0.15 0.45 0.70 0.55

James 0.97 0.85 0.16 0.21
Build user profiles
Pros & Cons of Content-Based Filtering
✔ No need to have data of other users
✔ Eases the sparsity issue and partially solves the cold-start problem
– Can make recommendations to users with unique tastes
– Can recommend new & unpopular items to users
✔ Be able to provide explanations by listing content-features that cause an item to be

recommended
✘ Finding the appropriate features is hard

✘ Recommendation for new users is still hard
✘ Will not recommend items outside a user’s profile
✘ Unable to exploit quality judgements of other users
RMBI1020 Data Analytics for Business
– Collaborative Filtering
Case Demo #8: Joke Funniness Rating Prediction
Data are downloaded and extracted at
Joke Recommendation http://eigentaste.berkeley.edu/dataset/
› About
– Jester is a joke recommender system developed at UC Berkeley to study social information filtering
– Containing over 5 million anonymous joke ratings from 150k users
› Data (lec10_jokes.xlsx)
– Number of users: 250
– Number of jokes: 100
– Rating: ranging from [-10,10]
› Problem
– To predict the ratings of
user 250 towards jokes 1 - 20
Calculation of Adjusted Cosine Similarity in Excel
lec10_jokes.xlsx – “Ratings”, “User-User Matrix” and “Item-Item Matrix” Worksheets
User-User step 1: normalize the ratings by user mean
User-User step 2: compute a similarity matrix between users
Item-Item step 1: normalize the ratings by item mean

Item-Item step 2: compute a similarity matrix between items
User-User Collaborative Filtering in Excel
lec10_jokes.xlsx –”User-User CF” Worksheet
User-User step 3: find the top 5 similarities between user 250
and other users who have rated the jokes 1-20 User-User step 4: find the user IDs matching
the top 5 similarities found in step 3
User-User step 5: find the ratings from the top 5

similar user IDs found in step 4 on jokes 1-20 User-User step 6: calculated the average ratings
weighted by similarity scores of the top 5 users
Item-Item Collaborative Filtering in Excel
lec10_jokes.xlsx – ”Item-Item CF” Worksheet
Item-Item step 3: find the top 5 similarities between joke
1-20 and other jokes that have been rated by user 250 Item-Item step 4: find the joke IDs
matching the similarities found in step 3
Item-Item step 5: find the ratings from user 250

on the top 5 similar jokes found in step 4
Item-Item step 6: calculated the average ratings
weighted by similarity scores of top 5 similar jokes
Evaluation of the Model in Excel
lec10_jokes.xlsx – “User-User CF” and “Item-Item CF” Worksheets
User-User/Item-Item step 7: evaluate the prediction results by

comparing the signs with the signs of actual rating of user 250
Accuracy in %
Lecture Summary
✓Problem that recommendation systems try to
address
✓Collaborative Filtering
✓ User-User CF and Item-Item CF
✓ Pros and cons
✓Content-based Filtering
✓ Pros and cons
Readings
› [1] Mining Massive Datasets – Chapter 9 Recommendation Systems
– http://mmds.org/#book
› [2] Collaborative Filtering

– http://recommender-systems.org/collaborative-filtering/
› [3] Recommender systems

– http://www.cs.carleton.edu/cs_comps/0607/recommend/recommender/itembased.html
› [4] Explanation of Collaborative Filtering vs Content Based Filtering

– https://codeburst.io/explanation-of-recommender-systems-in-information-retrieval-
13077e1d916c

RMBI1020 - Data Analytics For Business - Collaborative Filtering

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RMBI1020 - Data Analytics For Business - Collaborative Filtering

Uploaded by

Copyright:

Available Formats

RMBI1020 – Data Analytics for Business

Dr. Jean Wang

Data Visualization Data Modeling and Storage

Big Data and BI Analytics Technologies Enables BI

Time Series Excel

Content-based Filtering Collaborative Filtering

Why did the bee marry?

› Goal of Collaborative Filtering Andrew 4 1 3

– Predict an unobserved preference pair Emily ? 2 5 4 3

(𝑥, 𝑖) which is the rating of user 𝑥 towards Olivia 5 2

The matrix is very sparse

› Consider two rating vectors 𝑥 and 𝑦 representing two users

RMBI1020@JeanWang, UST Treated as 0

TWL HP LR AVG FRZ FF avg TWL HP LR AVG FRZ FF

– Calculate the similarity using Cosine formula

Set of common items

– Value ranges in [-1, 1]

• Who are the most similar to user 𝑥

• Average of the observed ratings weighted by the user similarity scores

• Which are the most similar to item 𝑖

• Average of observed ratings weighted by the item similarity scores

CF in large commercial websites

A user doesn’t have

Real-world recommendation system

› Different from collaborative filtering

0 0.8 0.5 0.9

Mary 0.15 0.45 0.70 0.55

Build user profiles

✔ Be able to provide explanations by listing content-features that cause an item to be

✘ Finding the appropriate features is hard

Item-Item step 1: normalize the ratings by item mean

User-User step 5: find the ratings from the top 5

Item-Item step 5: find the ratings from user 250

User-User/Item-Item step 7: evaluate the prediction results by

› [2] Collaborative Filtering

› [3] Recommender systems

› [4] Explanation of Collaborative Filtering vs Content Based Filtering

You might also like