You are on page 1of 34

RMBI1020 – Data Analytics for Business

– Collaborative Filtering

Dr. Jean Wang


RMBI@IPO
HKUST
Topics
Introduction to BI

Data Visualization Data Modeling and Storage

Big Data and BI Analytics Technologies Enables BI


RMBI1020

Time Series Excel


Data Analytics for Business Basics
Forecast

Regression
Market Basket Clustering
Analysis
Analysis

Optimization Classification
Association Collaborative
Rule Mining Filtering
RMBI1020@JeanWang, UST
Introduction

RMBI1020@JeanWang, UST
Preferences of Users towards Items
› In a typical ecommerce website, there are users and items
– Users have preferences (explicit or implicit) towards items

Users Items
Explicit Ratings

Implicit Ratings

RMBI1020@JeanWang, UST
Recommendation Systems
› A recommendation system makes prediction based on users’ historical activities
– What is the probability of a particular user purchasing a specific item?
– What rating will a user give to an unseen item?
– What is the top k unseen items that could be recommended to a user?

Content-based Filtering Collaborative Filtering

RMBI1020@JeanWang, UST
Outline
✓Collaborative Filtering
✓ User-User
✓ Item-Item
✓ Pros & Cons

✓Content-based Filtering
✓ Main idea and workflow
✓ Item profile and user profile

Why did the bee marry?


He’s finally found his
honey.

RMBI1020@JeanWang, UST
RMBI1020 – Collaborative Filtering
• Problem Specification
• User-User CF / Item-Item CF
• Adjusted Cosine Similarity
• Pros and Cons

RMBI1020@JeanWang, UST
Problem Specification
› Given
– A set of users and a set of items 1 0 1 0 3
– A set of observed user-item preferences in
0 1 0 1 4 5
a Rating Matrix (a.k.a. Preference Matrix)
– Each row is for one user, and each column 0 0 1 0 1 2
is for one item

13.5

24.3

30.4

RMBI1020@JeanWang, UST
Problem Specification
› Given
– A set of users and a set of items
– A set of observed user-item preferences in
a Rating Matrix (a.k.a. Preference Matrix) Sam 1 2 1
– Each row is for one user, and each column Jacob 4 2
is for one item
Mary 3 5 4 4 3

› Goal of Collaborative Filtering Andrew 4 1 3

– Predict an unobserved preference pair Emily ? 2 5 4 3

(𝑥, 𝑖) which is the rating of user 𝑥 towards Olivia 5 2


item 𝑖 Mia 4 3
James 4 2
Michael 5 4
Daniel 2 3
Sofia 1 5 2 2 4
Victoria 3 5
RMBI1020@JeanWang, UST
Association Rule Mining vs Collaborative Filtering
Collaborative Filtering
considers the target user’s
individual preferences, so it
Association Rule Mining Collaborative Filtering can make more personalized
recommendations

RMBI1020@JeanWang, UST
Collaborative Filtering: User-User and Item-Item
› To predict the fondness of Captain America for Space Stone

User-User CF Item-Item CF

RMBI1020@JeanWang, UST
User-User Similarity
› For an unobserved user-item pair (𝑥, 𝑖), if we find a set of other users whose
ratings are “similar” to 𝑥, their ratings on item 𝑖 could be used to estimate 𝑥’s
rating on item 𝑖

The matrix is very sparse


Mary 3 5 4 4 3
James 4 2
Q: which one’s taste is
Sofia 1 5 2 2 4 more similar to James’s,
Sam 1 2 1 Mary or Sofia?

› Consider two rating vectors 𝑥 and 𝑦 representing two users


– We need a similarity measure 𝑠𝑖𝑚(𝑥, 𝑦) to capture the similarities between users
– However, both 𝑥 and 𝑦 may have a lot of missing values, with a few or none common items

RMBI1020@JeanWang, UST Treated as 0


Similarity Measure between Users (I) Euclidean and Manhattan
indicate that Mary is further
away from James, Sofia is closer
› Option 1: 𝐸𝑢𝑐𝑙𝑖𝑑𝑒𝑎𝑛𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑥, 𝑦 = σ𝑛𝑖=1(𝑥𝑖 − 𝑦𝑖 )2 to James, which is NOT desired.
– 𝑑𝑖𝑠 𝐽𝑎𝑚𝑒𝑠, 𝑀𝑎𝑟𝑦 = 32 + 52 + 02 + 02 + 42 + 12 = 7.14
– 𝑑𝑖𝑠(𝐽𝑎𝑚𝑒𝑠, 𝑆𝑜𝑓𝑖𝑎) = 02 + 12 + 52 + 22 + 22 + 22 = 6.16 TWL HP LR AVG FRZ FF
Mary 3 5 4 4 3
› Option 2: 𝑀𝑎𝑛ℎ𝑎𝑡𝑡𝑎𝑛𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑥, 𝑦 = σ𝑛𝑖=1 𝑥𝑖 − 𝑦𝑖
James 4 2
– 𝑑𝑖𝑠 𝐽𝑎𝑚𝑒𝑠, 𝑀𝑎𝑟𝑦 = 3 + 5 + 0 + 0 + 4 + 1 = 13
Sofia 1 5 2 2 4
– 𝑑𝑖𝑠 𝐽𝑎𝑚𝑒𝑠, 𝑆𝑜𝑓𝑖𝑎 = 0 + 1 + 5 + 2 + 2 + 2 = 12
Sam 1 2 1
σ𝑛
𝑖=1 𝑥𝑖 𝑦𝑖
› Option 3: 𝐶𝑜𝑠𝑖𝑛𝑒𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝑥, 𝑦 =
σ𝑛 2 𝑛
𝑖=1 𝑥𝑖 × σ𝑖=1 𝑦𝑖
2 Cosine indicates that James is
closer to Mary (desired), but
0+0+0+16+0+6
– 𝑠𝑖𝑚 𝐽𝑎𝑚𝑒𝑠, 𝑀𝑎𝑟𝑦 = = 0.57 the difference is not that much.
9+25+16+16+9× 16+4
0+0+0+8+0+8
– 𝑠𝑖𝑚 𝐽𝑎𝑚𝑒𝑠, 𝑆𝑜𝑓𝑖𝑎 = = 0.51
16+4× 1+25+4+4+16
Only commonly-rated
items contributed to
Cosine whereas Euclidean
and Manhattan do not.
RMBI1020@JeanWang, UST
Similarity Measure between Users (II)
› Option 4: Adjusted Cosine Similarity
– Normalize ratings by subtracting row mean (centering the average ratings of each row to 0)
– Fill in 0s for the unobserved values (user preference towards the unseen items is neutral)

TWL HP LR AVG FRZ FF avg TWL HP LR AVG FRZ FF


Mary 3 5 4 4 3 19/5 Mary -0.8 1.2 0 0.2 0.2 -0.8
𝑠𝑖𝑚 = 0.57
James 4 2 ‒ 6/2 = James 0 0 0 1 0 -1
𝑠𝑖𝑚 = 0.51 Sofia 1 5 2 2 4 14/5 Sofia 0 -1.8 2.2 -0.8 -0.8 1.2
Sam 1 2 1 4/3 Sam -0.33 0 0.67 0 0 -0.33

– Calculate the similarity using Cosine formula


0+0+0+0+0+0.2+0.8
› 𝑠𝑖𝑚 𝐽𝑎𝑚𝑒𝑠, 𝑀𝑎𝑟𝑦 = = 0.42
1+1× 0.64+1.44+0.04+0.04+0.64
0+0+0−0.8+0−1.2
› 𝑠𝑖𝑚 𝐽𝑎𝑚𝑒𝑠, 𝑆𝑜𝑓𝑖𝑎 = = −0.43 Capture the
1+1× 3.24+4.84+0.64+0.64+1.44 intuition much
better!

RMBI1020@JeanWang, UST
Adjusted Cosine Similarity
› The Adjusted Cosine Similarity better captures the intuition because
– Missing ratings are treated as neutral (i.e., the same as the average)
– Difference between the “tough raters” and “easy raters” are smoothened
σ𝑚∈𝑀(𝑥𝑖,𝑚 −𝑥)(𝑦
ҧ 𝑖,𝑚 − 𝑦)
ത Very similar to correlation
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑𝐶𝑜𝑠𝑖𝑛𝑒𝑆𝑖𝑚 𝑥, 𝑦 = calculation in Statistics, so
σ𝑚∈𝑀(𝑥𝑖,𝑚 −𝑥)ҧ 2 × σ𝑚∈𝑀(𝑦𝑖,𝑚 −𝑦)
ത 2 CORREL() function can be
used in Excel

Set of common items


Average observed- Average observed-
rated by user 𝑥 and user 𝑦
rating of user 𝑥 rating of user 𝑦

– Value ranges in [-1, 1]


› If 𝑠𝑖𝑚(𝑥, 𝑦) = -1, user 𝑥 and 𝑦 have the exactly opposite preference
› If 𝑠𝑖𝑚(𝑥, 𝑦) = 1, user 𝑥 and 𝑦 have the exactly same preference

RMBI1020@JeanWang, UST
From Similarity to Prediction
› User-to-User Collaborative Filtering

Obtain the similarity • Between the target user and other users
scores

• Who are the most similar to user 𝑥


Find top 𝒌 users • And rated the target item 𝑖

• Average of the observed ratings weighted by the user similarity scores


Predict the rating of σ𝑦𝜖𝐾𝑁𝑁(𝑥) 𝑠𝑖𝑚 𝑥, 𝑦 ∗ 𝑟𝑎𝑡𝑖𝑛𝑔(𝑦, 𝑖)
unobserved pair (𝑥, 𝑖) 𝑟𝑎𝑡𝑖𝑛𝑔 𝑥, 𝑖 =
σ𝑦𝜖𝐾𝑁𝑁(𝑥) 𝑠𝑖𝑚 𝑥, 𝑦

RMBI1020@JeanWang, UST
Item-Item Collaborative Filtering
› Item-to-item Collaborative Filtering: an analogous approach to User-User CF

Obtain the similarity • Between the target items and other items
scores

• Which are the most similar to item 𝑖


Find top 𝒌 items • And have been rated by the target user 𝑥

• Average of observed ratings weighted by the item similarity scores


Predict the rating of σ𝑗𝜖𝐾𝑁𝑁(𝑖) 𝑠𝑖𝑚 𝑖, 𝑗 ∗ 𝑟𝑎𝑡𝑖𝑛𝑔(𝑥, 𝑗)
unobserved pair (𝑥, 𝑖) 𝑟𝑎𝑡𝑖𝑛𝑔 𝑥, 𝑖 =
σ𝑗𝜖𝐾𝑁𝑁(𝑖) 𝑠𝑖𝑚 𝑖, 𝑗

RMBI1020@JeanWang, UST
Prediction Example
TWL HP LR AVG FRZ FF 𝑠𝑖𝑚(𝐸𝑚𝑖𝑙𝑦, … ) › User-user based prediction (let k=2)
Sam 1 2 1 -0.46 – The 2 most similar users to Emily who have rated
Jacob 4 2 -0.95 the movie Twilight are Michael and Mary
Mary 0.21
– Their average ratings on Twilight weighted by their
3 5 4 4 3 similarity scores to Emily is
Andrew 4 1 3 0.55 › (0.21*3 + 0.47*5)/|0.21+0.47| = 4.38
Emily ? 2 5 4 3 -
Olivia 5 2 -0.16 › Item-item based prediction (let k=2)
Mia 4 3 0.47 – The 2 most similar movies to Twilight which have
been rated by user Emily are Fantastic Four and
James 4 2 0.63 Lord of Rings
Michael 5 4 0.47 – Their average ratings by Emily weighted by their
similarity scores to Twilight is
Daniel 2 3 -0.47
› (0.35*2+0.50*3)/|0.35+0.50| = 2.59
Sofia 1 5 2 2 4 -0.75
Victoria 3 5 0.16 In practice, it has been
observed that item-item
𝑠𝑖𝑚(𝑇𝑊𝐿, … ) - -0.08 0.35 -0.03 -0.26 0.50 CF often works better
than user-user CF.
RMBI1020@JeanWang, UST
User-User CF vs Item-Item CF
› In practice, item-item CF is more preferred than user-user …

CF in large commercial websites


– Items are simpler to classify but users have multiple tastes
– Rows in the rating matrix is more sparse than the columns
– User-user similarity matrix is much larger so the calculation is
computationally expensive
– User-user similarity matrix needs to be updated more frequently

A user doesn’t have


enough items in common
with anybody else in
User-User CF

……
Users newly indicated
preference can significantly
change the similarity scores
in User-User CF
Dimension of 1,000,000×1,000
RMBI1020@JeanWang, UST
Pros & Cons of Collaborative Filtering
✔ Works for any kind of items (no need to consider item features and user profiles)

✘ Hard to find multiple users rating the same items in a sparse rating matrix
– Difficult to recommend items to someone with unique taste
✘ Face the Cold Start problem
– Cannot recommend an item that has not been previously rated
– Cannot recommend an item to a total new customer
✘ Tends to recommend popular items

Real-world recommendation system


often add content-based methods or
association rule mining models to
combine with collaborative filtering.

RMBI1020@JeanWang, UST
Accuracy Evaluation
› Compare predictions with observed ratings
TWL HP LR AVG FRZ FF
Sam 1 2 1
– Root-mean-square error (RMSE)
Jacob 4 2
1 2
Mary › σ ∗
𝑟𝑥𝑖 − 𝑟𝑥𝑖 ∗
where 𝒓𝒙𝒊 is the predicted rating, 𝒓𝒙𝒊 is
3 5 4 4 3 𝑛 𝑥𝑖
the true rating of 𝑥 on 𝑖
Andrew 4 1 3
– Accuracy in top 10 recommendations
Emily 2 5 4 3
› % of those in top 10 recommendations that have been
Olivia 5 2 rated highly by the user
Mia 4 3
James 4 2
Michael 5 4
Daniel 2 3 Used for testing
Sofia 1 5 2 2 4
Victoria 3 5

RMBI1020@JeanWang, UST
RMBI1020 – Content-based Filtering
• Main Idea
• Workflow
• Pros and Cons

RMBI1020@JeanWang, UST
Content-based Filtering
› Two types of content-based recommendations
– Recommend to user 𝑥 the items similar to previous items rated highly by 𝑥
– Recommend those items to user 𝑥 that are rated highly by other users similar to 𝑥

› Different from collaborative filtering


– Similarity measure is based on native attributes (selected features) of the items and users, not
user-item preference patterns

RMBI1020@JeanWang, UST
Plan of Action
› A content-based filtering method typically consists the following steps

Build item
profiles Recommend a user
Identify the
items that are nearest
descriptive features
to him in terms of the
that can differentiate
similarity between
the items and
the user profile and
influence users
item profile
Build user
profiles

RMBI1020@JeanWang, UST
Content-based Filtering Example
Build item profiles

0 0.8 0.5 0.9


Select features Recommend
movies
0.3 0.1 0.9 0.2

Mary 0.15 0.45 0.70 0.55


James 0.97 0.85 0.16 0.21

Build user profiles

RMBI1020@JeanWang, UST
Pros & Cons of Content-Based Filtering
✔ No need to have data of other users
✔ Eases the sparsity issue and partially solves the cold-start problem
– Can make recommendations to users with unique tastes
– Can recommend new & unpopular items to users

✔ Be able to provide explanations by listing content-features that cause an item to be


recommended

✘ Finding the appropriate features is hard


✘ Recommendation for new users is still hard
✘ Will not recommend items outside a user’s profile
✘ Unable to exploit quality judgements of other users

RMBI1020@JeanWang, UST
RMBI1020 Data Analytics for Business
– Collaborative Filtering
Case Demo #8: Joke Funniness Rating Prediction

RMBI1020@JeanWang, UST
Data are downloaded and extracted at
Joke Recommendation http://eigentaste.berkeley.edu/dataset/

› About
– Jester is a joke recommender system developed at UC Berkeley to study social information filtering
– Containing over 5 million anonymous joke ratings from 150k users

› Data (lec10_jokes.xlsx)
– Number of users: 250
– Number of jokes: 100
– Rating: ranging from [-10,10]

› Problem
– To predict the ratings of
user 250 towards jokes 1 - 20

RMBI1020@JeanWang, UST
Calculation of Adjusted Cosine Similarity in Excel
lec10_jokes.xlsx – “Ratings”, “User-User Matrix” and “Item-Item Matrix” Worksheets
User-User step 1: normalize the ratings by user mean
User-User step 2: compute a similarity matrix between users

Item-Item step 1: normalize the ratings by item mean


Item-Item step 2: compute a similarity matrix between items

RMBI1020@JeanWang, UST
User-User Collaborative Filtering in Excel
lec10_jokes.xlsx –”User-User CF” Worksheet
User-User step 3: find the top 5 similarities between user 250
and other users who have rated the jokes 1-20 User-User step 4: find the user IDs matching
the top 5 similarities found in step 3

User-User step 5: find the ratings from the top 5


similar user IDs found in step 4 on jokes 1-20 User-User step 6: calculated the average ratings
weighted by similarity scores of the top 5 users

RMBI1020@JeanWang, UST
Item-Item Collaborative Filtering in Excel
lec10_jokes.xlsx – ”Item-Item CF” Worksheet
Item-Item step 3: find the top 5 similarities between joke
1-20 and other jokes that have been rated by user 250 Item-Item step 4: find the joke IDs
matching the similarities found in step 3

Item-Item step 5: find the ratings from user 250


on the top 5 similar jokes found in step 4
Item-Item step 6: calculated the average ratings
weighted by similarity scores of top 5 similar jokes

RMBI1020@JeanWang, UST
Evaluation of the Model in Excel
lec10_jokes.xlsx – “User-User CF” and “Item-Item CF” Worksheets

User-User/Item-Item step 7: evaluate the prediction results by


comparing the signs with the signs of actual rating of user 250

Accuracy in %

RMBI1020@JeanWang, UST
Lecture Summary
✓Problem that recommendation systems try to
address

✓Collaborative Filtering
✓ User-User CF and Item-Item CF
✓ Pros and cons

✓Content-based Filtering
✓ Pros and cons

RMBI1020@JeanWang, UST
Readings
› [1] Mining Massive Datasets – Chapter 9 Recommendation Systems
– http://mmds.org/#book

› [2] Collaborative Filtering


– http://recommender-systems.org/collaborative-filtering/

› [3] Recommender systems


– http://www.cs.carleton.edu/cs_comps/0607/recommend/recommender/itembased.html

› [4] Explanation of Collaborative Filtering vs Content Based Filtering


– https://codeburst.io/explanation-of-recommender-systems-in-information-retrieval-
13077e1d916c

RMBI1020@JeanWang, UST

You might also like