You are on page 1of 37

Recommender System – using

Collaborative Filtering
Contents

• Recommender System – Introduction


• Recommender System – Rating Data
• Similarity Measures
• Euclidean distance
• Cosine Similarity
• Jaccard Similarity
• Pearson Correlation
• Collaborative Filtering
• User-based Collaborative Filtering (UBCF)
• Item-based Collaborative Filtering (IBCF)
• Evaluation
• Top-N Recommendations Considerations
Recommender System -
Introduction
Recommender System - Why ?
• How do we buy things in our day-to-day lives?
• We ask our friends, research the product specifications, compare the product with similar
products on the Internet, read the feedback from anonymous users, and then we make
decisions.
• How would it be if there is some mechanism that does all these tasks automatically and
recommends the products best suited for you efficiently?
• The answer is a Recommender System
Recommender System - What ?
• “Friends you may know" - Facebook
• “People you may know” – LinkedIn
• What are these recommendations ?
• These features recommends a list of people
whom you might know
• who are similar to you based on your friends
• friends of friends in your close circle
• geographical location
• Skillsets
• Groups
• liked pages….
• recommendations are specific to you and differ
from user to user.
Recommender System – Defining it

• Recommender systems are the software tools and techniques that provide suggestions, such as useful

products on e-commerce websites, videos on YouTube, friends' recommendations on Facebook, book

recommendations on Amazon, news recommendations on online news websites, and the so on.

• The main goal of recommender systems is to provide suggestions to online users to make better

decisions from many alternatives available over the Web.

• A better recommender system is directed more towards personalized recommendations by taking

into consideration the available digital footprint of the user and information about a product, such as

specifications, feedback from the users, comparison with other products, and so on, before making

recommendations.
Recommender System – Landscape – Types
User-based CF

Content-Based Memory-based

Item-based CF
Collaborative
Personalized Filtering Based
(CF) Matrix
Factorization
using SVD
Hybrid Model-based
Deep Learning
Recommender Methods
System
Popularity Based

Recency Based
Sorting/Filtering
Non-Personalized
on
Most-valued
based

Genre/Category/
Topic Based
Recommender System – Rating Data
Rating Data of Social Media and Web Platforms
• Generally rating data is used in multiple web platforms and
social media platforms
• Video streaming (e.g., Netflix, Youtube, PrimeVideo)
• Hotels & Hospitality (e.g., Airbnb, Booking.com,
Makemytrip.com)
• Taxi/Cab (Ola, Uber ), Books (Google books, Goodreads.com)
• Vendors (B2B platforms like IndiaMart or Alibaba)
• General consumer goods (Amazon.com, Flipkart)
• Doctors & Hospitals (Yelp.com), Movies (IMDB, Rottentomatoes)
• Social Media Post & Content (Twitter, Facebook )
• Digital newswebsites, articles (timesofIndia, WallstreetJournal)
among others.
Rating Data
• Ratings can range from 1 to 5 , 1 to 7 , Like, Yay-Nay-Love, Like/Dislike, Thumbs-
up/Thumbs-Down, Yes/No, Okay/Not-Okay .
• Rating can change also for business https://www.cnet.com/tech/services-and-
software/netflix-adds-two-thumbs-up-rating-for-content-you-absolutely-love/
• What does ratings represents ?
• Feelings of liking or disliking an activity/product/service/experience measured in degrees of
rating.
• Reflect the human behavior and becomes a snapshot of past likes and dislikes.
Past behavior can be used to predict future behavior.
• This becomes the foundation to use ratings as the data for analytics purpose. The most common
way to reflect this purpose is to generate recommendations based on past behavior which may be
useful and apt for current user behavior.
• The system which incorporates various ways of generating recommendations is also called
recommender system and many social media and web platforms use it to bring more options to
user personalization and to boost user activity on their platforms.
Rating Data Uses
Personalized Recommendations: Rating data is often leveraged to
provide personalized recommendations to users. By analyzing the
preferences and patterns in users' rating behaviors, platforms can
offer tailored suggestions and content based on individual tastes and
preferences.

Social Influence: Ratings and Decision Making: Users make


reviews can influence the informed decisions about which
purchasing decisions of other products, services, or experiences
users including through Word- to choose.
of-mouth, trends mechanisms.

Trust and Credibility: Ratings


Market Research and Analysis: Aggregated
contribute to the establishment
rating data can be used for market
of trust and credibility on online
research and analysis purposes.
platforms.

Feedback for Improvement: Rating data serves as a feedback


mechanism for businesses and platform operators.
Rating Data Types
• Ratings can be explicit.
• The rating data is collected by business and platforms to improve the
user experience and provide more choices for ease of decision
making of users.

• Ratings can be implicit.


• The rating data can be imputed through usage of the
products/services/items of the business/platform by the user.
• For example, youtube videos viewing duration or e-book read pages
can with appropriate assumptions can be used to impute the ratings.
• Similarly, text reviews can be used to impute ratings or frequency of
conversation can be used to impute ratings of “friends” on social media.
Data Pre-Processing

User Item Rating


Bhim Baahubaali1 5
Calvin Baahubaali1 2
Hobbes Baahubaali1 5
Peanut Baahubaali1 3
Bhim Baahubaali2 3
Calvin Baahubaali2 1
Hobbes Baahubaali2 4
User\Item Baahubaali1 Baahubaali2 Baahubaali3 KGF1 KGF2 KGF3
Peanut Baahubaali2 1
Bhim Baahubaali3 0
Bhim 5 3 0 2 0 5
Calvin Baahubaali3 0
Hobbes Baahubaali3 4
Data pre-processing Calvin 2 1 0 4 2 0
Peanut Baahubaali3 1
Bhim KGF1 2 Hobbes 5 4 4 3 0 1
Calvin KGF1 4
Hobbes KGF1 3 Peanut 3 1 1 2 4 2
Peanut KGF1 2
Bhim KGF2 0
Calvin KGF2 2
User-Item Rating Matrix
Hobbes KGF2 0
Peanut KGF2 4
Bhim KGF3 5
Calvin KGF3 0
Hobbes KGF3 1
Peanut KGF3 2

Given Rating Data


Similarity Measures
Data Pre-Processing for Collaborative Filtering

Euclidean Cosine
• Similarity measurements distance Similarity

Pearson
Jaccard
Correlation
Similarity
coefficient

Necessary for Collaborative Filtering (CF)

Find User-User Apply collaborative User-based CF


Similarity filtering Recommendations

Find Item-Item Apply collaborative Item-based CF


Similarity filtering Recommendations
Distance Measures – A pre-requiste knowledge for Similarity Concept
• Distance metrics are functions d(a, b) such that d(a, b) < d(a, c) if objects a and b are
considered “more similar” than objects a and c. Two objects exactly alike would have a
distance of zero. One of the most popular examples is Euclidean distance. To be a ‘true’
metric, it must obey the following four conditions:

• d(a, b) >= 0, for all a and b


• d(a, b) == 0, if and only if a = b, positive definiteness
• d(a, b) == d(b, a), symmetry
• d(a, c) <= d(a, b) + d(b, c), the triangle inequality
• There are a number of ways to convert between a distance metric and a similarity measure.
Inverse function of distance can be used for similarity.
Similarity Measures - Euclidean distance - Example
• Let the user-item rating matrix be User\Item Item1 Item2
A 2 3
B 1 2

• The Euclidean distance = (2 − 1)2 +(3 − 2)2 = 1.414

Mathematically

This calculation using matrix multiplication of two n-dimensional vectors A and B in general we use the formula as
• dist(x, y) = sqrt(dot(x, x) - 2 * dot(x, y) + dot(y, y))

In Python this is implemented as pairwise metrics for Euclidean distance as a function

https://scikit-
learn.org/stable/modules/generated/sklearn.metrics.pairwise.euclidean_distances.html#sklearn.metrics.pairwise.euclidean_distances

• Then compute Euclidean Similarity = 1/(1+Euclidean Distance) . Range of values will be 0 to 1, with 1 being
similar and closer to Zero being Dissimilar.
Similarity Measures - Cosine Similarity

• Cosine Similarity is based on angular distance rather than linear


distance. Cosine Similarity = 1 – Cosine Distance.
• Cosine similarity is a metric used to measure the similarity of two
vectors. Specifically, it measures the similarity in the direction or
orientation of the vectors ignoring differences in their magnitude or
scale. Both vectors need to be part of the same inner product space,
meaning they must produce a scalar through inner product
multiplication. The similarity of two vectors is measured by the cosine
of the angle between them.

• Example use cases for Cosine Similarity:


• Text Analytics: find the similarity between two text documents using the
number of terms used in both documents
• Recommendation System: Ecommerce, movie,books recommendation etc.
Similarity Measures - Cosine Similarity - Example

• Let the user-item rating matrix be


User\Item Item1 Item2
A 2 3
B 1 2

Mathematically Cosine Similarity =

• Norm of A = SQRT(2^2+3^2) = 3.605551 Norm of B = SQRT(1^2+2^2) = 2.236068

• Cosine Similarity = Cos(θ) = Dot((2,3)+(1,2))/(Norm of A)(Norm of B)

= ((2.1)+(3.2))/(3.605551 x 2.236068) = 0.992278

Cosine measure ranges from 0 to 1 if only positive ratings are considered. If negative ratings are used/provided/allowed in
user-item matrix, then range of values of Cosine similarity will be -1 to 1 with 1 showing similar item.

In Python this is implemented as pairwise metrics for Cosine Similarity as a function


https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html#sklearn.metrics.pairwise.cosine_similarity
Similarity Measures - Jaccard Similarity – Only Binary Rating

• The Jaccard similarity measures the similarity between two sets of data
to see which members are shared and distinct. The Jaccard similarity is
calculated by dividing the number of observations in both sets by the
number of observations in either set.
• In other words, the Jaccard similarity can be computed as the size of the
intersection divided by the size of the union of two sets.

• Jaccard similarity can be used to find the similarity between two


asymmetric binary vectors or to find the similarity between two sets.
• Example use cases for Jaccard Similarity:
• Text Analytics: find the similarity between two text documents using the
number of terms used in both documents
• Recommendation System: Ecommerce, movie,books recommendation etc.
Similarity Measures - Jaccard Similarity – Only Binary Rating
• Let the user-item rating matrix be
User\Item Item1 Item2
A 2 3
B 1 2

Mathematically Jaccard Similarity =

• Jaccard measures work on binary rating data. Let us recompute the user-item rating matrix with
following assumptions:
• Any rating greater than or equal to 2 will be considered likability for a user towards the item so will be treated as
one.
• Any rating less than 2 will be considered not-likeable for a user towards the item so will be treated as zero.

Revised user-item matrix will be User\Item Item1 Item2


A 1 1
B 0 1
Similarity Measures - Jaccard Similarity - Example
User\Item Item1 Item2
• Revised user-item matrix
A 1 1
• So, Set User A = { 1,1} and Set User B = {0,1}
B 0 1
• Jaccard score is based on set theory.
• The intersection of sets A and B is the set containing the common elements between the two sets. In this case, the only
common element is "1". Intersection of A and B: {1} . Size of Intersection = 1
• The union of sets A and B is the set containing all unique elements from both sets. In this case, the union is {0, 1}. Union of
A and B: {0, 1}. Size of Union = 2
• Jaccard Similarity = (Size of Intersection) / (Size of Union) = 1 / 2
• So, the Jaccard similarity coefficient between sets A and B is 0.5, or 50%. This indicates that half of the elements in user B
are also present in user A. The users have a moderate level of similarity.
Jaccard similarity ranges from 0 to 1, where 0 means no similarity and 1 means complete similarity between the
sets.
In Python implementation is through a Jaccard Score function
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_score.html#sklearn.metrics.jaccard_score
Similarity Measures – Pearson Correlation
• Let the user-item rating matrix be
User\Item Item1 Item2
A 2 3
B 1 2

Mathematically Pearson Correlation Similarity =

• The Pearson correlation coefficient measures the linear relationship between two datasets.
Strictly speaking, Pearson’s correlation requires that each dataset be normally distributed.
Similarity Measures – Pearson Correlation - Example
• Let the user-item rating matrix be
User\Item Item1 Item2
A 2 3
B 1 2

Mathematically Pearson Correlation Similarity =

• The Pearson correlation coefficient measures the linear relationship between two datasets. Strictly speaking,
Pearson’s correlation requires that each dataset be normally distributed.
• Pearson correlation similarity = 1 for above user-item rating matrix
• Pearson correlation similarity ranges from -1 to 1, where -1 means dissimilarity and 1 means complete
similarity between the users.
In Python implementation is through a Pearson correlation through scipy.stats
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html
Collaborative Filtering
Collaborative filtering recommender systems
• Collaborative filtering is a branch of recommendation that takes account of the information about different
users. The word "collaborative" refers to the fact that users collaborate with each other to recommend
items. In fact, the algorithms take account of user purchases and preferences in form of ratings.
• Task performed here is filtering items from a large set of alternatives collaboratively between users'
preferences depicted in form of user ratings for the items.
• Two users share the same interests in the past (i.e., they liked the same items such as books, posts, movies,
music etc.) they will also have similar tastes in the future.
• Collaborative filtering approach considers only user preferences (rating data) and does not take into
account the features or contents of the items being recommended. This approach requires a large set of
user preferences for more accurate results.
• Item-based collaborative filtering: This recommends to a user the items that are most similar to the
user's purchases.
• User-based collaborative filtering: This recommends to a user the items that are the most preferred by
similar users.

• Recommendation task : Use Collaborative Filtering to generate Top-N


recommendations for the items not rated a particular user.
Case Study - User-Item Rating Matrix

User\Item Baahubali1 Baahubali2 KGF1 KGF2 Kantara User\Item Baahubali1 Baahubali2 KGF1 KGF2 Kantara

Calvin 5 4 2 - 3 Calvin 5 4 2 0 3

Hobbes - 3 4 5 - Hobbes 0 3 4 5 0

Peanut 2 - 3 4 5 Peanut 2 0 3 4 5

Bhim 4 2 - 1 - Bhim 4 2 0 1 0

New_User 1 - 2 - - New_User 1 0 2 0 0

There are missing values shown by ‘-’ ‘-’ are replaced by zero values for
calculations.

Let the user matrix be given for 4 user and a “New_User” to the web app with five items which are movies. With the not
available ratings as not every user can rate every movie and neither every movie can be rated by all users’ in the real-
world scenario.
Recommendation task : Use Collaborative Filtering to generate Top-N recommendations for the items not rated by the
New_User
User-based collaborative filtering (UBCF)

• Measure how similar each user is to the new one. Like IBCF, popular similarity measures are
correlation and cosine.

• Identify the most similar users.


• You can also take account of the users whose similarity is above a defined threshold

• Rate the items purchased by the most similar users. The rating is the average rating among similar
users :
• Take Weighted average rating, using the similarities as weights
• Relative difference is applied to overcome user bias for poor raters and generous raters (optional, to remove
bias)

• Pick the top-rated items.


Item-based collaborative filtering (IBCF)
• The starting point is a rating matrix in which rows correspond to users and columns
correspond to items.
• For each two items, measure how similar they are in terms of having received similar ratings by
similar users
• For each item, identify the similar items and store as item-item similarity matrix
• For each user whose recommendations need to be generated, identify the items that are most
similar to this user‘s already purchased items.
• Item-item similarity values act as weight
• This is done by calculating new rated items as

• Pick the top-rated items.


LIMITATIONS of COLLABORATIVE FILTERING
• If the new user hasn't seen any movie yet, neither the IBCF nor the UBCF is able to recommend any
item. Unless the IBCF knows the items purchased by the new user, it can't work. The UBCF needs to
know which users have similar preferences to the new one, but we don't know about its ratings.

• If the new item hasn't been purchased by anyone, it will never be recommended. IBCF matches items
that have been purchased by the same users, so it won't match the new item with any of the others.
UBCF recommends to each user items purchased by similar users, and no one purchased the new
item. So, the algorithm won't recommend it to anyone.
Hybrid recommender systems
• Combining various recommender systems to build a more robust system. By combining
various recommender systems, we can eliminate the disadvantages of one system with the
advantages of another system and thus build a more robust system.
• Combining collaborative filtering methods, where the model fails when new items don't
have ratings, with content-based systems, where feature information about the items is
available, new items can be recommended more accurately and efficiently.
• Considerations?
• What techniques should be combined to achieve the business solution?
• How should we combine various techniques and their results for better predictions?
Evaluation
Evaluation techniques
• System is efficient or accurate? - base on which we state that the system is good?

• Whether the model is over fitting or under fitting ? How well the model fits the future data or test
data?

• You can do cross validation and create confusion matrix or use RMSE values

Cross Validation Confusion matrix

This technique is popularly used in evaluating a


This is a very popular technique for model evaluation classification model. We build a confusion matrix using
for almost all models. In this technique, we divide the the results of the model. We calculate precision and
data into two datasets: a training dataset and a test recall/sensitivity/specificity to evaluate the model.
dataset. The model is built using the training dataset
and evaluated using the test dataset. • Precision: This is the probability whether the truly
classified records are relevant.
• This process is repeated many times. The test errors
• Recall/Sensitivity: This is the probability whether
are calculated for every iteration. the relevant records are truly classified.
• The averaged test error is calculated to generalize • Specificity: Also known as true negative rate, this is
the model accuracy at the end of all the iterations. the proportion of truly classified wrong records.
Top-N recommendations
Considerations
Top-N recommendations – Considerations – User Experience

Device and Screen Size:


User Interface: Consider the user
Recommendations should be optimized
interface and interaction methods. For User Preferences: Incorporate user- Visual Appeal: The presentation of
for the device and screen size the user
touch-based devices, the UI may need specific preferences and settings, such recommendations should be visually
is using. Mobile devices, tablets, and
to be more touch-friendly, while for as language, location, and other appealing and aligned with the overall
desktops may have different layouts
desktops, more options might be personalized parameters. design of the application or platform.
and visual designs for
presented in a single view.
recommendations.

Avoid Over-Personalization: While


Feedback Loop: Incorporate user Learning and Adaptation: personalization is important,
Item Thumbnails: Use images and
feedback to continuously improve Recommendation models should adapt recommendations that are too focused
concise descriptions to help users
recommendations. Allow users to to changing user preferences over time. on a user's past behavior might lead to
quickly understand the items being
provide ratings, reviews, or explicit Periodically update the models using a filter bubble. Including diverse items
recommended.
feedback on recommended items. the latest data. can expose users to new and
unexpected options.

Novelty: Introduce novelty by


occasionally recommending items that
are slightly outside the user's usual
preferences. This can enhance the
user's exploration of the catalog.
Top-N recommendations – Considerations – Best Practices
• Business Goals:
• Business Constraints: Consider any business-specific constraints, such as promoting certain items,
adhering to inventory levels, or avoiding recommending certain items due to legal or ethical reasons.
• Promotion and Sales: If recommendations are used to drive sales or promotions, consider strategies to
feature special offers or high-margin items.

• Cold Start Problem:


• New Users: For new users with limited interaction history, use techniques such as content-based
recommendations or hybrid approaches to overcome the "cold start" problem.

• A/B Testing and Evaluation:


• A/B Testing: Test different recommendation algorithms or strategies using A/B testing to assess their
impact on user engagement, conversion rates, or other relevant metrics.
• Metrics: Define appropriate metrics for evaluating recommendation performance, such as click-through
rates, conversion rates, or user satisfaction.

• Multi-Modal Recommendations:
• Cross-Platform Consistency: If recommendations are provided across multiple platforms (website, app,
smart TV), ensure a consistent experience and recommendations.
Top-N recommendations – Considerations – Data Privacy & Data Management

• Privacy and Security:


• Data Privacy: Ensure that the recommendations are generated without compromising user privacy.
Use anonymized and aggregated data whenever possible.
• Data Security: Protect user data and ensure that the recommendation system is resistant to attacks
such as data poisoning or adversarial attacks.

• Real-Time and Scalability:


• Real-Time Recommendations: For some applications, recommendations need to be generated in
real-time. Ensure that the recommendation system can handle the load and generate timely
suggestions.
• Scalability: As the user base and catalog size grow, the recommendation system should scale
effectively to maintain responsiveness.

You might also like