You are on page 1of 34

UNIT III COLLABORATIVE FILTERING

A systematic approach, Nearest-neighbor collaborative filtering (CF), user-based and item-based CF,
components of neighborhood methods (rating normalization, similarity weight computation, and neighborhood
selection
Suggested Activities:
• Practical learning – Implement collaborative filtering concepts
• Assignment of security aspects of recommender systems
Suggested Evaluation Methods:
• Quiz on collaborative filtering
• Seminar on security measures of recommender systems

What Is Collaborative Filtering?

• Collaborative filtering filters information by using the interactions and data collected by the system from
other users. It’s based on the idea that people who agreed in their evaluation of certain items are likely
to agree again in the future.
• The concept is simple: when we want to find a new movie to watch we’ll often ask our friends for
recommendations. Naturally, we have greater trust in the recommendations from friends who share tastes
similar to our own.
• Most collaborative filtering systems apply the so-called similarity index-based technique. In the
neighborhood-based approach, a number of users are selected based on their similarity to the
active user. Inference for the active user is made by calculating a weighted average of the ratings
of the selected users.
• Collaborative-filtering systems focus on the relationship between users and items. The similarity of items
is determined by the similarity of the ratings of those items by the users who have rated both items.
• Collaborative filtering recommender systems have played a significant role in the rise of web services
and content platforms like Amazon, Netflix, YouTube, etc. in recent years. In this age of information,
knowing what the customer wants before they even know it themselves is nothing short of a superpower.
As the name suggests, recommender system algorithms are used to offer relevant content or product to
the consumer based on their taste or previous choices

There are two classes of Collaborative Filtering:

• User-based, which measures the similarity between target users and other users.
• Item-based, which measures the similarity between the items that target users rate or interact with and
other items.

1
Why do we need recommender systems?

• Back in 2006, Netflix offered a prize to solve a simple problem that had been around for
years. It was to find the best collaborative algorithm to predict user ratings for films
that they haven't watched yet, based on previous ratings of other movies.
• Today, e-commerce giants continue to try to solve this problem in a better way by
observing users’ past behavior to predict what other things the same user will like. .
• Recommendations also help customers discover new products and offers that they’re
not explicitly looking for, thus speeding up the search process. This allows companies to
send out personalized newsletters via email that offer new TV shows, movies, products,
and services that are better suited for them.
• One of the most significant advantages of modern recommendation algorithms is their
ability to take implicit feedback and suggest new content/products, thus staying up-to-
date with customers’ preferences. This enables businesses to continue catering to
customers even if their tastes change over time.

User-item interaction matrix


• In collaborative filtering, we ignore the features of an individual item. Instead, we focus on a similar
group of people using the item and recommend other items that the group likes.

• Similar users are divided into small clusters and are recommended new items according to the
preferences of that cluster. Let’s understand this with an easy movie recommendation example:

What we can infer from this user-item matrix is:

• Users 1 and 2 liked Movie 1. Since User 1 liked movies 2 and 4 a lot, there’s a high
chance of User 2 enjoying the same.
2
• Users 1 and 3 have opposite tastes.
• Users 3 and 4 both disliked Movie 2, so there’s a high chance User 4 will also dislike
Movie 4.
• User 3 might dislike Movie 1.

Collaborative filtering: Advantages and disadvantages


Advantages
• No domain knowledge is required since all the features are learned automatically.
• Can help users discover new interests even if they’re not actively searching for them by recommending
new items similar to what they’re interested in.
• Does not require in-detail features and contextual data of products or items. It only needs the user-item
interaction matrix to train the matrix factorization model.
Disadvantages
• Data sparsity can lead to difficulty in recommending new products or users since the suggestions are
based on historic data and interactions.
• As the user base grows, the algorithms suffer due to high data volume and lack of scalability.
• Lack of diversity in the long run. This might seem counterintuitive since the whole point of collaborative
filtering is to recommend new items to the user. However, since the algorithms function based on
historical ratings, it will not recommend items with little or limited data. Popular products will be more
popular in the long run and there will be a lack of new and diverse options.

Types of collaborative filtering


The two types of collaborative filtering approaches are:

• Memory-based collaborative approach


• Model-based collaborative approach

3
A systematic approach to collaborative filtering involves the following steps:

1. Data Collection: Gather user-item interaction data, such as ratings, reviews, purchases, or clicks.
2. Data Preprocessing: Clean and prepare the data for analysis, including handling missing values, outliers, and
data normalization.
3. User or Item Representation: Encode user preferences or item features into a suitable representation, such as
user-item matrices or item-attribute vectors.
4. Similarity Calculation: Compute similarity scores between users or items based on their respective
representations.
5. Nearest Neighbor Identification: Identify the nearest neighbor for each user or item based on the calculated
similarity scores.
6. Prediction Generation: Predict the rating or preference of a user for an item based on the ratings or preferences
of their nearest neighbor.
7. Evaluation and Optimization: Evaluate the performance of the CF algorithm using appropriate metrics and
refine the model parameters to improve accuracy.
8. Deployment and Maintenance: Integrate the CF algorithm into the recommender system and monitor its
performance over time, making adjustments as needed.

4
Effective collaborative filtering relies on the quality and quantity of user-item interaction data. Additionally, the
choice of similarity measures, nearest neighbor identification techniques, and prediction algorithms can
significantly impact the performance of the CF system.

• content-based approaches, which use the content of items previously rated by a user u, collaborative
(or social) filtering approaches rely on the ratings of u as well as those of other users in the system.
• The key idea is that the rating of u for a new item i is likely to be similar to that of another user v. if u
and v have rated other items in a similar way. Likewise, u is likely to rate two items i and j in a
similar fashion, if other users have given similar ratings to these two items.

Collaborative approaches overcome some of the limitations of content-based ones.


• Items for which the content is not available or difficult to obtain can still be recommended to users
through the feedback of other users.
5
• Collaborative recommendations are based on the quality of items as evaluated by peers, instead of
relying on content that may be a bad indicator of quality.
• Collaborative filtering ones can recommend items with very different content, as long as other users
have already shown interest for these different items

• Collaborative filtering methods can be grouped in the two general classes of neighborhood and model-
based methods.
• In neighborhood based (memory-based or heuristic-based ) collaborative filtering the user-item ratings
stored in the system are directly used to predict ratings for new items.
• This can be done in two ways known as user based or item-based recommendation.
o User-based systems, such as GroupLens (Social Computing Research at the University of
Minnesota) ,Bellcore video (Library Toolkit is a set of tools for constructing and browsing
libraries of digital video), and Ringo (Social Information Filtering for Music Recommendation),
evaluate the interest of a user u for an item I using the ratings for this item by other users, called
neighbors, that have similar rating patterns. The neighbors of user u are typically the users v
whose ratings o the items rated by both u and v, i.e. 𝐿_𝑢𝑣 , are most correlated to those of u.
o Item-based approaches, on the other hand, predict the rating of a user u for an item i based on
the ratings of u for items similar to i. In such approaches, two items are similar if several users
of the system have rated these items in a similar fashion.
• In model-based approaches use these ratings to learn a predictive model. The general idea is to
model the user-item interactions with factors representing latent characteristics of the users and items in
the system, like the preference class of users and the category class of items. This model is then trained
using the available data, and later used to predict ratings of users for new items. Model-based
approaches for the task of recommending items are numerous and include Bayesian Clustering , Latent
Semantic Analysis , Latent Dirichlet Allocation, Maximum Entropy , Boltzmann Machines, Support
Vector Machinesand Singular Value Decomposition

Memory-based collaborative approach


• Memory-based collaborative filtering (also called neighborhood-based or user-item filtering) is based
on the assumption that users with similar historical preferences would continue to display similar
historical preferences in the future. In this method, item ratings are computed in a straightforward manner
by factoring in the ratings of nearby people or things.
• In memory-based collaborative filtering, only the user-item interaction matrix is utilized to make new
recommendations to users. The whole process is based on the users’ previous ratings and interactions.
• Memory-based filtering consists of 2 methods: user-based collaborative filtering and item-based
collaborative filtering.
6
User-based collaborative filtering
• To suggest new recommendations to a particular user, a group of similar users (nearest
neighbors) is created based on the interactions of the reference user. The items that are most
popular in this group, but new to the target user, are used for the suggestions.
• User-based CF algorithms recommend items to a user based on the preferences of similar users. The
algorithm first identifies a set of similar users, also known as nearest neighbors, based on their past
interactions with items. The similarity between users is typically measured using distance metrics such
as cosine similarity or Pearson correlation. Once the nearest neighbors are identified, the algorithm
predicts the rating of an item for the active user by aggregating the ratings of that item from the nearest
neighbors.

Item-based collaborative filtering


• In item-based filtering, new recommendations are selected based on the old interactions of the target
user. First, all the items that the user has already liked are considered. Then, similar products are
computed and clusters are made (nearest neighbors). New items from these clusters are suggested to the
user.

• Item-based CF algorithms recommend items to a user based on the similarity of items to items that
the user has interacted with in the past. The algorithm first identifies a set of similar items based
on their attributes or features. The similarity between items is typically measured using distance
metrics or similarity measures such as Jaccard similarity or cosine similarity. Once the similar
items are identified, the algorithm recommends to the active user items that are similar to items
that the user has liked in the past.

Advantages of Memory-Based Collaborative Filtering:

7
• Simplicity: Memory-based approaches are intuitive and simple to implement, making them a
viable option for solving problems with moderately big datasets in a short amount of time.
• Transparency: Memory-Based systems’ suggestions are easy to understand since they are
grounded in the user’s and the item’s direct interactions.
• Serendipity: Memory-based filtering has the potential to provide serendipitous
recommendations, in which users stumble onto previously unknown but potentially
fascinating content through shared relationships with other users
Drawbacks of Memory-Based Collaborative Filtering:
• Sparsity and Scalability: Since the frequency of user-item interactions tends to decrease as the dataset
expands, it becomes more difficult to discover trustworthy neighbours and might cause scaling
problems.
• Cold Start: Memory-Based systems struggle when there are too few contacts with new users or things
to make reliable suggestions.
• Limited Representation: Memory-based approaches may provide subpar results because they fail to
fully capture complicated patterns in the data.
Model-based collaborative approach
• Cooperative Modelling Instead of using a predetermined set of rules, filters use a statistical or
machine learning model to identify and exploit hidden links and patterns in the data. These models
are then used to estimate users’ preferences for unseen objects based on their training data of past
interactions between users and items
• In the model-based approach, machine learning models are used to predict and rank interactions
between users and the items they haven’t interacted with yet. These models are trained using the
interaction information already available from the interaction matrix by deploying different
algorithms like matrix factorization, deep learning, clustering, etc.

Matrix factorization
Matrix factorization is used to generate latent features by decomposing the sparse user-item interaction matrix
into two smaller and dense matrices of user and item entities.

Matrix factorization is a popular technique used in Collaborative Filtering (CF) for recommendation systems.
CF is a method to predict a user's interests by collecting preferences or behavior information from many

8
users. Matrix factorization is particularly effective in collaborative filtering because it can handle the sparsity
of user-item interaction data.
Here's how matrix factorization works in the context of collaborative filtering:
1. Understanding the Data Matrix:
• Assume you have a matrix R representing user-item interactions. Rows correspond to users,
columns correspond to items, and the entries Rui represent user u's interaction (like rating,
purchase, or view) with item i. However, most entries are unknown (missing) because not all
users interact with all items.
2. Objective of Matrix Factorization:
• The goal of matrix factorization in CF is to decompose this sparse matrix R into the product
of two lower-dimensional matrices U and I
𝑹 ≈ 𝑼 × 𝑰𝑻
• Here, U (an 𝑚 × 𝑘 matrix) represents user embeddings, where each row u (out of m rows)
corresponds to a user's latent factors in an k-dimensional space.
• I (an 𝑛 × 𝑘 matrix) represents item embeddings, where each row i (out of n rows) corresponds
to an item's latent factors in the same k-dimensional space.
3. Matrix Factorization Process:
• Matrix factorization aims to learn the matrices U and I by minimizing the reconstruction error
between R and 𝑈 × 𝐼 𝑇 . This is typically achieved through optimization techniques like
gradient descent, alternating least squares, or stochastic gradient descent.
• The objective function could be formulated as:
minimize ∑(𝑢,𝑖)∈observed (𝑅𝑢𝑖 − (𝑈 × 𝐼 𝑇 )𝑢𝑖)2 +λ (∥ 𝑈 ∥2+∥ 𝐼 ∥2 ) where λ is a regularization
parameter to prevent overfitting.
4. Prediction and Recommendations:
Once the matrices U and I are learned, the missing entries in R can be estimated as
𝑼 × 𝑰𝑻
Recommendations for a user u can be made by suggesting items that have the highest
predicted scores (entries in 𝑼 × 𝑰𝑻 ) for that user, but have not been interacted with yet.
5. Key Advantages:
• Matrix factorization is effective in handling sparsity because it leverages latent factors to
capture user and item interactions.
• It can provide personalized recommendations even for users with very few interactions.
Advantages of Model-Based Collaborative Filtering:
• Scalability: Model-Based approaches outperform Memory-Based ones in dealing with big and
sparse datasets because they learn underlying patterns without making direct comparisons of users
or things.
• Cold Start Mitigation: By using supplementary data or a hybrid method, model-based filtering may
help with the cold start issue.
• Flexibility: Model-based methods may use a wide variety of data and attributes, allowing for the
incorporation of context to enhance suggestions.

Drawbacks of Model-Based Collaborative Filtering:


• Complexity: Due to the complexity of the models they need, the development and tuning of model-
based approaches often take more time and skill.
• Black Box: High accuracy is possible with Model-Based filtering, although the models’ inner
workings may be less visible and interpretable than those of Memory-Based approaches.
• Overfitting: Overfitting is a problem in Model-Based systems when there is insufficient data, and
this may result in suggestions that are too weighted towards prior encounters
Hybrid Approaches
Hybrid approaches in Collaborative Filtering (CF) combine different methods or techniques to overcome
limitations and enhance the performance of recommendation systems. These approaches leverage the
strengths of multiple recommendation strategies, such as collaborative filtering (CF) and content-based
filtering (CBF), to provide more accurate and diverse recommendations. Here's a breakdown of hybrid
approaches in CF:
1. Collaborative Filtering (CF):
9
• Collaborative Filtering methods recommend items based on user-item interactions or
similarities between users. This can be user-based CF (recommending items liked by similar
users) or item-based CF (recommending similar items to those a user has liked).
2. Content-Based Filtering (CBF):
• Content-Based Filtering recommends items based on their features or attributes. It analyzes
item descriptions or user profiles to suggest items that are similar in content to previously
liked items.
3. Types of Hybrid Approaches:
a. Weighted Hybrid:
• In this approach, predictions from different recommendation techniques (e.g., CF and CBF)
are combined using weighted averages or other blending methods. The weights can be fixed
or learned based on data.
b. Feature Combination:
• Features derived from both CF and CBF methods are combined to create a unified feature
representation. Machine learning algorithms can then use this combined feature representation
to make recommendations.
c. Cascade or Switch Hybrid:
• Recommendations from one method (e.g., CF) are used to filter or augment recommendations
from another method (e.g., CBF). This can improve recommendation accuracy by leveraging
the strengths of both methods.
d. Meta-Level Hybrid:
• In this approach, predictions from different recommendation algorithms are treated as input
features to a meta-learner (e.g., a machine learning model). The meta-learner then combines
these predictions to generate final recommendations.
Advantages of Hybrid Approaches:
• Improved Accuracy: By combining multiple methods, hybrid approaches can mitigate weaknesses
and improve recommendation accuracy.
• Diversity: Hybrid methods can provide more diverse recommendations by leveraging different
recommendation strategies.
• Robustness: They are more robust to data sparsity and the cold start problem compared to individual
CF or CBF methods.
• Improved Performance: Hybrid techniques might possibly provide higher overall performance by
using the capabilities of Memory-Based and Model-Based methodologies.
• Cold Start Mitigation: Cold-starting difficulties may be mitigated with the use of hybrid technology.
Examples:
• Netflix's recommendation system uses a hybrid approach, combining collaborative filtering
(based on user ratings) with content-based filtering (analyzing movie attributes like genre).
• Amazon's recommendation system also uses a hybrid approach, combining user-item
interactions with item attributes and user demographics.

Movie Recommendation System


Data:
• User Preferences: User ratings for movies.
• Movie Attributes: Genre, director, actors, release year, etc.
Hybrid Approach Components:
1. Collaborative Filtering (CF):
• Idea: Recommend movies based on user behavior and preferences.
• Implementation:
• Use matrix factorization (like Singular Value Decomposition or Matrix Factorization)
to learn latent factors from user-item interactions (ratings).
• Predict ratings for unseen movies based on similar users' preferences.
2. Content-Based Filtering (CBF):
• Idea: Recommend movies based on the attributes or content of the items.
• Implementation:
• Extract features from movies such as genre, director, actors, release year.
• Build a profile for each user based on their rated movies.
10
• Recommend movies that are similar in content to the ones a user has liked.
3. Hybridization:
• Combining CF and CBF:
• Weighted Approach: Combine scores from CF and CBF using a weighted sum or
other fusion techniques.
• Switching Strategy: Use CF for some users and CBF for others based on data
availability or performance metrics.
• Feature Combination: Include content-based features (e.g., movie genres, director)
as additional input to the collaborative filtering model.
Recommendation Process:
• For a New User:
• If the user has not rated any movies yet:
• Use CBF to recommend movies based on their provided preferences (e.g., preferred
genres).
• Once the user rates some movies:
• Incorporate these ratings into the CF model to provide personalized recommendations.
• For Existing Users:
• Use the hybrid approach to generate recommendations:
• Combine CF predictions (based on user-item interactions) with CBF recommendations
(based on movie attributes).
• Present the top-rated hybrid recommendations to the user.
Benefits of Hybrid Approach:
• Increased Accuracy: Combining multiple recommendation techniques can lead to more accurate
predictions.
• Improved Coverage: Content-based filtering can recommend items even when user-item interactions
are sparse (cold start problem).
• Enhanced Personalization: Incorporating user preferences (CBF) along with user-item interactions
(CF) leads to more personalized recommendations.
In this movie recommendation system example, the hybrid approach leverages both collaborative filtering and
content-based filtering techniques to provide diverse and accurate movie recommendations tailored to
individual users' tastes and preferences. Hybridization allows for a more robust recommendation system that
can handle various scenarios and user behaviours effectively.

NEAREST NEIGHBOR COLLABORATIVE FILTERING

• Neighborhood-based recommender systems fall under the collaborative filtering umbrella and focus
on using behavioral patterns, such as movies that users have watched in the past, to identify similar
users (i.e., users who demonstrate similar preferences), or similar items (i.e., items that receive similar
interest from the same users).
• Nearest Neighbors Collaborative Filtering (NNCF) is a technique used in recommendation systems
to predict user preferences based on the similarity between users or items. It falls under the umbrella
of Collaborative Filtering (CF), which utilizes the collective wisdom of users to make
recommendations.
• User-based Collaborative Filtering (UBCF):
o Predict a user's preference for an item by finding similar users based on their historical
ratings.
• Item-based Collaborative Filtering (IBCF):
o Predict a user's preference for an item by finding similar items based on how users have rated
them.
Steps Involved Nearest Neighbors Collaborative Filtering
Step-1: Data Representation: Represent user-item interactions as a matrix R, where rows correspond to users
and columns correspond to items. Each entry Rui represents a user u's rating (or interaction) with item i.

11
Step-2: Similarity Calculation: Compute similarity between users (for UBCF) or items (for IBCF) based on
their rating patterns. Common similarity metrics include cosine similarity, Pearson correlation, or Jaccard
similarity.
Step-3: Nearest Neighbors Selection: For a given user u (or item i), identify the k most similar users (or
items) based on the computed similarity scores.
Nearest Neighbors are typically selected based on the highest similarity scores.
Step-4: Prediction:
• UBCF Prediction: Predict user u's rating for item i by averaging the ratings of the k nearest
Neighbors who have rated item i, weighted by their similarity to user u.
• IBCF Prediction: Predict user u's rating for item i by combining ratings of items similar to item i,
weighted by the similarity between items.

• We refer to the technique that computes similar users as user-based and to the technique that focuses
on computing similar items as item-based.
• An example of the item-based technique is Netflix’s “Because you watched…” feature, which
recommends movies or shows based on examples that users previously showed interest in.
• An example of a user-based recommender system is booking.com, which recommends destinations
based on the historical behavior of other users with similar travel history.

Pipeline Overview

12
The image below summarizes the pipeline for our implementation of item-based and user-based recommender
systems in our declarative language, Rel. Without loss of generality, we focus on a movie recommendation
use case, where we are given interactions between users and movies.
Step 1: We convert user-item interactions to a bipartite graph.
The first step is to convert the input interactions data to a bipartite graph that contains two types of nodes:
Users and Movies, as shown in the image below.

The two node types are connected by an edge that we call watched. In Rel, Users and Movies are represented
by entity types, and their attributes, such as id and name, are represented by value types

Step 2: MovieLens Graph. \


13
• We use user-item interactions to compute item-item and user-user similarities by leveraging the
functions supported by the graph analytics library.
• Once we define the entity and value types, the next step is to populate the entities with data from the
original MovieLens dataset.
• Assuming we have a relation called watched_train(user, movie) that represents the train subset of the
MovieLens data and contains the watch history for the users, and a relation called movie_info(movie,
movie_name) that contains the movie names, we create the Movie entity as follows:
• The User entity is created similarly. Finally, we add an additional edge called watched that connects
the movie entity to the user entity.

Step 3: Similarity Computation.


• We use the similarities to predict the scores for all (user, movie) pairs. Each score is an indication of
how likely it is for a user to interact with a movie.
• Now that we have modeled our data as a graph, we can compute item-item and user-user similarities
using the user-item interactions: movies that have been watched by the same users will have a high
similarity value, while movies that have been watched by different users will have a low similarity
value.
• Here, we focus on the item-based method. The approach for the user-based method is very similar.
There are several similarity metrics that can be used for this task. Currently, the Rel graph library
provides the cosine_similarity and jaccard_similarity relations

Step 4: Scoring
• We sort the scores for every user in order to generate top-k recommendations.
• Using the similarities calculated in the previous step, we then compute the (user, movie) scores for all
pairs. We predict that a user will watch movies that are similar to the movies they have watched in the
past (item-based approach).
• The score for a pair (user, movie) indicates how likely it is for a user to watch a movie and is calculated
as follows: Where:

• 𝑆𝑐𝑜𝑟𝑒𝑢,𝑖 is the predicted score for user u and item i


• N[i] is the set of item i’s nearest neighbors
• W[u] is the set of items watched by user u
• 𝑆𝑖,𝑛 is the similarity score between items i and n
the score is the sum of the similarity scores of the target movie’s nearest neighbors that have been watched
by the target user
• The pred relation takes the following inputs:
• neighborhood_size: The number of similar movies (neighbors) we use to predict the score
• M: The relation containing (movie, user) pairs
• S: The similarity metric, e.g., jaccard, cosine
• T: The relation that selects the top neighborhood_size most similar movies to the target movie
(i.e., the nearest neighbors).

Step 5: We evaluate performance using evaluation metrics that are widely used for recommender systems

User-Based Collaborative Filtering

User-Based Collaborative Filtering is a technique used to predict the items that a user might like on the basis
of ratings given to that item by other users who have similar taste with that of the target user. Many websites
use collaborative filtering for building their recommendation system.
Step 1: Finding the similarity of users to the target user U. Similarity for any two users ‘a’ and ‘b’ can be
calculated from the given formula,
14
Step 2: Prediction of missing rating of an item Now, the target user might be very similar to some users and
may not be much similar to others. Hence, the ratings given to a particular item by the more similar users
should be given more weightage than those given by less similar users and so on. This problem can be solved
by using a weighted average approach. In this approach, you multiply the rating of each user with a similarity
factor calculated using the above mention formula. The missing rating can be calculated as

Example: User-Based Collaborative Filtering


Consider a matrix that shows four users Alice, U1, U2 and U3 rating on different news apps. The rating
range is from 1 to 5 on the basis of users’ likability of the news app. The ‘?’ indicates that the user has
not rated the app

Name Inshorts(I1) HT(I2) NYT(I3) TOI(I4) BBC(I5)

Alice 5 4 1 4 ?

U1 3 1 2 3 3

U2 4 3 4 3 5

U3 3 3 1 5 4

Step 1: Calculating the similarity between Alice and all the other users At first we calculate the
averages of the ratings of all the user excluding I5 as it is not rated by Alice
Therefore, we calculate the average as

Therefore, we have
𝟓+𝟒+𝟏+𝟒 𝟏𝟒
𝒓̅𝑨𝒍𝒊𝒄𝒆 = = = 𝟑. 𝟓
𝟒 𝟒
𝟑+𝟏+𝟐+𝟑 𝟗
𝒓̅𝑨𝒖𝟏 = = = 𝟐. 𝟐𝟓
𝟒 𝟒

𝟒+𝟑+𝟒+𝟑 𝟏𝟒
𝒓̅𝒖𝟐 = = = 𝟑. 𝟓
𝟒 𝟒
𝟑+𝟑+𝟏+𝟓 𝟏𝟐
𝒓̅𝒖𝟑 = = =𝟑
𝟒 𝟒

Calculate the new ratings as,

15
• Hence, we get the following matrix,

Name Inshorts(I1) HT(I2) NYT(I3) TOI(I4)

Alice 1.5 0.5 -2.5 0.5

U1 0.75 -1.25 -0.25 0.75

U2 0.5 -0.5 0.5 -0.5

U3 0 0 -2 2

Now, we calculate the similarity between Alice and all the other users

Step 2: Predicting the rating of the app not rated by Alice Now, we predict Alice’s rating for BBC News
App,

=3.5+{(0.301*(3-2.25)+(-0.33*(5-3.5)+(0.707*(4-3)}/{|0.301|+|0.33|+|0.707|}
=3.5+{(0.301*0.75)+(-0.33*1.5)+(0.707*1)}/{|0.301|+|0.33|+|0.707|}
=3.83

Item-to-Item Based Collaborative Filtering

• Collaborative Filtering is a technique or a method to predict a user’s taste and find the items that a
user might prefer on the basis of information collected from various other users having similar tastes
or preferences.
• It takes into consideration the basic fact that if person X and person Y have a certain reaction for
some items then they might have the same opinion for other items too.
• The two most popular forms of collaborative filtering are:
• User Based: Here, we look for the users who have rated various items in the same way and then find
the rating of the missing item with the help of these users.
• Item Based: Here, we explore the relationship between the pair of items (the user who bought Y, also
bought Z). We find the missing rating with the help of the ratings given to the other items by the user.
• Item to Item Similarity: The similarity between item pairs can be found in different ways. One of the
most common methods is to use cosine similarity

• Prediction Computation: The second stage involves executing a recommendation system. It uses the
items (already rated by the user) that are most similar to the missing item to generate rating.
16
• We hence try to generate predictions based on the ratings of similar products. We compute this using
a formula which computes rating for a particular item using weighted sum of the ratings of the other
similar product.

Example: Item-to-Item Based Collaborative Filtering


Given below is a set table that contains some items and the user who have rated those items. The rating
is explicit and is on a scale of 1 to 5. Each entry in the table denotes the rating given by ith User to a jth
Item. In most cases majority of cells are empty as a user rate only for few items. Here, we have taken
4 users and 3 items. We need to find the missing ratings for the respective user.

User/Item Item_1 Item_2 Item_3

User_1 2 – 3

User_2 5 2 –

User_3 3 3 1

User_4 – 2 2

Example:
Step 1: Finding similarities of all the item pairs.
Form the item pairs. For example, in this example the item pairs are (Item_1, Item_2), (Item_1, Item_3),
and (Item_2, Item_3). Select each item to pair one by one. After this, we find all the users who have rated
for both the items in the item pair. Form a vector for each item and calculate the similarity between the two
items using the cosine formula stated above..
Sim(Item1, Item2)
In the table, we can see only User_2 and User_3 have rated for both items 1 and 2.
Thus, let I1 be vector for Item_1 and I2 be for Item_2. Then,
I1 = 5U2 + 3U3 and,
I2 = 2U2 + 3U3

Sim(Item2, Item3)
In the table we can see only User_3 and User_4 have rated for both the items 1 and 2.
Thus, let I2 be vector for Item_2 and I3 be for Item_3. Then,
I2 = 3U3 + 2U4 and,
I3 = 1U3 + 2U4..

Sim(Item1, Item3)
17
In the table we can see only User_1 and User_3 have rated for both the items 1 and 2.
Thus, let I1 be vector for Item_1 and I3 be for Item_3. Then,
I1 = 2U1 + 3U3 and,
I3 = 3U1 + 1U3

Step 2: Generating the missing ratings in the table


Now, in this step we calculate the ratings that are missing in the table.
Rating of Item_2 for User_1

Rating of Item_3 for User_2

Rating of Item_1 for User_4

Advantages:
• Simple and intuitive approach to collaborative filtering.
• Effective in scenarios where users/items have sparse interactions.
• Can capture complex user-item relationships based on similarity metrics.
Challenges and Considerations:
• Data Sparsity: Nearest Neighbors CF may struggle with sparse datasets, where not all users
have rated many items.
• Scalability: Computing pairwise similarities can be computationally expensive for large
datasets.
• Cold Start Problem: Nearest Neighbors CF may face challenges when dealing with new users or
items with few ratings.

COMPONENTS OF NEIGHBORHOOD METHODS

three very important considerations in the implementation of a neighborhood-based recommender


system are
1) the normalization of ratings,
2) the computation of the similarity weights, and
3) the selection of neighbors.

• Neighborhood methods, a class of collaborative filtering algorithms, rely on the concept of finding
a "neighborhood" of users or items similar to a target user or item. These methods are based on
the idea that users who have similar preferences tend to like similar items, and vice versa. The key
components of neighborhood methods include:

18
• Similarity Measure: Neighborhood methods use a similarity measure to quantify the similarity between
users or items. Common similarity measures include cosine similarity, Pearson correlation coefficient,
and Jaccard similarity. The choice of similarity measure can significantly affect the performance of the
algorithm.

• Neighborhood Selection: Once the similarity between users or items is computed, the next step is to
select a subset of neighbors that are most similar to the target user or item. This subset is known as the
neighborhood. The size of the neighborhood, i.e., the number of nearest neighbors to consider, can be
fixed or adaptive.

• Rating Prediction: After selecting the neighborhood, the algorithm predicts the rating of a target user
for an item by aggregating the ratings of its neighbors for that item. This can be done using various
aggregation functions such as weighted average, weighted sum, or regression-based methods.

• Item or User-Based Approach: Neighborhood methods can be either item-based or user-based. In item-
based approaches, similarities between items are computed based on the ratings given by users, and
recommendations are made by finding items similar to those the user has liked. In user-based approaches,
similarities between users are calculated based on their rating patterns, and recommendations are made
by identifying users similar to the target user and recommending items they have liked.

• Rating Normalization: To improve the accuracy of predictions, rating normalization techniques may be
applied. These techniques adjust the ratings to account for user or item biases, such as users who tend to
rate items more positively or items that are consistently rated higher or lower than others.

• Sparse Data Handling: Neighborhood methods often face the challenge of dealing with sparse data,
where many user-item pairs have no ratings. Various strategies such as neighborhood expansion,
imputation, or incorporating auxiliary information may be employed to handle sparse data and improve
recommendation quality.

Components of neighborhood methods -Rating Normalization

• When it comes to assigning a rating to an item, each user has its own personal scale. Even if an
explicit definition of each of the possible ratings is supplied (e.g., 1=“strongly disagree”,
2=“disagree”, 3=“neutral”, etc.), some users might be reluctant to give high/low scores to items they
liked/disliked.
• Two of the most popular rating normalization schemes that have been proposed to convert
individual ratings to a more universal scale are mean-centering and Z-score

I. Mean-centering

19
Example As shown in Figure,

although Diane gave an average rating of 3 to the movies “Titanic” and “Forrest Gump”, the user-
mean-centered ratings show that her appreciation of these movies is in fact negative. This is because
her ratings are high on average, and so, an average rating correspond to a low degree of appreciation.
Differences are also visible while comparing the two types of mean-centering. For instance, the item-
mean-centered rating of the movie “Titanic” is neutral, instead of negative, due to the fact that much
lower ratings were given to that movie.

Therefore, we have
𝟓+𝟏+𝟐+𝟐 𝟏𝟎
𝒓̅𝒋𝒐𝒉𝒏 = = = 𝟐. 𝟓
𝟒 𝟒
𝟏+𝟓+𝟐+𝟓+𝟓 𝟏𝟖
𝒓̅𝒍𝒖𝒄𝒚 = = = 𝟑. 𝟔
𝟓 𝟓

𝟐+𝟑+𝟓+𝟒 𝟏𝟒
𝒓̅𝒆𝒓𝒊𝒄 = = = 𝟑. 𝟓
𝟒 𝟒

20
𝟒+𝟑+𝟓+𝟑 𝟏𝟓
𝒓̅𝒅𝒊𝒂𝒏𝒆 = = = 𝟑. 𝟕𝟓
𝟒 𝟒

Calculate the new ratings as,


• Hence, we get the following matrix,

Likewise, Diane’s appreciation for “The Matrix” and John’s distaste for “Forrest Gump” are more
pronounced in the item-mean-centered ratings.

Therefore, we have
𝟓+𝟏+𝟐+𝟒 𝟏𝟐
𝒓̅𝒎𝒂𝒕𝒓𝒊𝒙 = = =𝟑
𝟒 𝟒
𝟏+𝟓+𝟑 𝟗
𝒓̅𝒕𝒊𝒕𝒂𝒏𝒊𝒄 = = =𝟑
𝟑 𝟑

𝟐+𝟓+𝟒 𝟏𝟏
𝒓̅𝒘𝒂𝒍𝒍𝒆 = = = 𝟑. 𝟔
𝟑 𝟑
𝟐+𝟑+𝟓 𝟏𝟎
𝒓̅𝒅𝒊𝒆 𝒉𝒂𝒓𝒅 = = = 𝟑. 𝟑𝟑
𝟑 𝟑

𝟐++𝟓+𝟓+𝟑 𝟏𝟓
𝒓̅𝒇𝒐𝒓𝒆𝒔𝒕 𝒈𝒖𝒎𝒑 = = = 𝟑. 𝟕𝟓
𝟒 𝟑

Criteria to be considered
21
When choosing between the implementation of a user-based and an item-based neighborhood
recommender system, five criteria should be considered:
• Accuracy: The accuracy of neighborhood recommendation methods depends mostly on the ratio between
the number of users and items in the system. The similarity between two users in user-based methods,
which determines the neighbors of a user, is normally obtained by comparing the ratings made by these
users on the same items. On the other hand, an item-based method usually computes the similarity between
two items by comparing ratings made by the same user on these items.
• Efficiency: The memory and computational efficiency of recommender systems also depends on the ratio
between the number of users and items. Thus, when the number of users exceeds the number of items, as is
it most often the case, item- based recommendation approaches require much less memory and time to
compute the similarity weights (training phase) than user-based ones, making them more scalable. However,
the time complexity of the online recommendation phase, which depends only on the number of available
items and the maximum number of neighbors, is the same for user-based and item-based methods.
• Stability: The choice between a user-based and an item-based approach also depends on the frequency
and amount of change in the users and items of the system. If the list of available items is fairly static in
comparison to the users of the system, an item-based method may be preferable. On the contrary, in
applications where the list of available items is constantly changing, e.g., an online article recommender,
user-based methods could prove to be more stable.
• Justifiability: An advantage of item-based methods is that they can easily be used to justify a
recommendation. User-based methods, however, are less amenable to this process because the active user
does not know the other users serving as neighbors in the recommendation.
• Serendipity: In item-based methods, the rating predicted for an item is based on the ratings given to similar
items. Consequently, recommender systems using this approach will tend to recommend to a user item that
are related to those usually appreciated by this user.

II. Z – SCORE NORMALIZATION

Z-score normalization, also known as standard score normalization, is a statistical technique used to
rescale a distribution of values to have a mean of zero and a standard deviation of one. This
normalization technique is often applied to features or variables in data preprocessing to ensure that
they are on a comparable scale, which can be beneficial for certain machine learning algorithms.
The formula for calculating the Z-score of a data point x is:
𝑥−𝜇
𝑧=
𝜎
Where:
• z is the Z-score.
• x is the original value.
• μ is the mean of the distribution.
• σ is the standard deviation of the distribution.
Here's how Z-score normalization works:
1. Calculate Mean and Standard Deviation: Compute the mean (μ) and standard deviation (σ) of the
data distribution.
2. Normalize Data: For each data point, subtract the mean (μ) and then divide by the standard deviation
(σ). This centers the data distribution around zero and scales it to have a standard deviation of one.
Z-score normalization is particularly useful in situations where the data distribution may have outliers
or exhibit skewness. By rescaling the data to have a mean of zero and a standard deviation of one, Z-

22
score normalization helps to mitigate the impact of outliers and ensures that all features contribute
equally to the analysis.
It's important to note that Z-score normalization assumes that the data distribution is approximately
Gaussian (normal). If the distribution is significantly non-normal, other normalization techniques may
be more appropriate. Additionally, Z-score normalization is sensitive to outliers, so preprocessing
steps such as outlier removal or transformation may be necessary before normalization.

Example of how Z-score normalization can be applied in collaborative filtering:

Suppose we have a matrix of user ratings for items:


| | Item 1 | Item 2 | Item 3 |
|--------|--------|--------|--------|
| User 1 | 5 | 3 | 0 |
| User 2 | 4 | 0 | 0 |
| User 3 | 1 | 1 | 0 |
Calculate Mean and Standard Deviation: Compute the mean and standard deviation of each item's
ratings across all users.

For example, for Item 1:


Mean = (5 + 4 + 1) / 3 = 3.33
Standard Deviation = 𝑠𝑞𝑟𝑡(((5 − 3.33)^2 + (4 − 3.33)^2 + (1 − 3.33)^2) / 3) ≈ 1.91
Normalize Data: For each user-item pair, apply the Z-score normalization formula:

For example, for User 1 and Item 1:

𝑍 − 𝑠𝑐𝑜𝑟𝑒 = (5 − 3.33) / 1.91 ≈ 0.88


Similarly, for other user-item pairs, calculate the Z-scores.

The normalized ratings might look something like this:

| | Item 1 | Item 2 | Item 3 |


|--------|--------|--------|--------|
| User 1 | 0.88 | ??? | ??? |
| User 2 | 0.44 | ??? | ??? |
| User 3 | -1.32 | ??? | ??? |
Now, all ratings are standardized such that they have a mean of approximately zero and a standard
deviation of approximately one. This normalization allows for fair comparison of ratings across users
and items, which is useful in collaborative filtering algorithms where similarities between users or
items are computed based on these ratings.

Consider, two users A and B that both have an average rating of 3. Moreover, suppose that the
ratings of A alternate between 1 and 5, while those of B are always 3. A rating of 5 given to an
item by B is more exceptional than the same rating given by A, and, thus, reflects a greater
appreciation for this item.
While mean-centering removes the offsets caused by the different perceptions of an average rating,
Zscore normalization also considers the spread in the individual rating scales.
Once again, this is usually done differently in user-based than in item-based recommendation. In user-
based methods, the normalization of a rating rui divides the user-mean-centered rating by the standard
deviation σu of the ratings given by user u:
23
A user-based prediction of rating rui using this normalization approach would therefore be obtained
as

Likewise, the z-score normalization of rui in item-based methods divides the item mean-centered
rating by the standard deviation of ratings given to item i:

The item-based prediction of rating rui would then be

• Comparing mean-centering with Z-score, as mentioned, the second one has the additional benefit of
considering the variance in the ratings of individual users or items. This is particularly useful if the
rating scale has a wide range of discrete values or if it is continuous. On the other hand, because the
ratings are divided and multiplied by possibly very different standard deviation values, Z-score can be
more sensitive than mean-centering and, more often, predict ratings that are outside the rating scale.
• Finally, if rating normalization is not possible or does not improve the results, another possible
approach to remove the problems caused by the individual rating scale is preference-based filtering.
The particularity of this approach is that it focuses on predicting the relative preferences of users instead
of absolute rating values. Since the rating scale does not change the preference order for items,
predicting relative preferences removes the need to normalize the ratings.

SIMILARITIY WEIGHT COMPUTATION


The similarity weights play a double role in neighborhood-based recommendation
methods:
1) they allow the selection of trusted neighbors whose ratings are used in the prediction
2) they provide the means to give more or less importance to these neighbors in the prediction.
A measure of the similarity between two objects a and b, often used in information
retrieval, consists in representing these objects in the form of two vectors xa and xb and
computing the Cosine Vector (CV) (or Vector Space) similarity [7, 8, 44] between these
vectors:

The similarity between two users u and v would then be computed as

24
where Iuv once more denotes the items rated by both u and v. A problem with
this measure is that is does not consider the differences in the mean and variance of the
ratings made by users u and v.
A popular measure that compares ratings where the effects of mean and variance
have been removed is the Pearson Correlation (PC) similarity:

The Pearson correlation coefficient 𝑟𝑥𝑦 between two users x and y is calculated
as:

The Pearson correlation coefficient ranges from -1 to 1:


• 𝑟𝑥𝑦 =1 indicates a perfect positive correlation, meaning that the ratings of users x and y are perfectly
linearly related (i.e., when one user rates an item highly, the other user also tends to rate it highly).
• 𝑟𝑥𝑦 =−1 indicates a perfect negative correlation, meaning that the ratings of users x and y are perfectly
inversely related (i.e., when one user rates an item highly, the other user tends to rate it poorly).
• 𝑟𝑥𝑦 =0 indicates no linear correlation between the ratings of users x and y.
• In collaborative filtering, Pearson correlation similarity is used to identify similar users or items based
on their rating patterns. Users or items with higher Pearson correlation coefficients are considered more
similar, and their ratings can be used to make recommendations for each other.

Example Of Pearson Correlation (PC) Similarity In Collaborative Filtering


• Identify the common movies rated by both users (Movies 1, 2, and 5).
• Calculate the mean ratings for both users based on the common movies.
• Calculate the deviations from the mean for both users.
• Compute the covariance between the deviations.
• Calculate the standard deviations for both users.
• Finally, compute the Pearson correlation coefficient.
Pearson correlation similarity is a measure used in collaborative filtering to determine the similarity between
two users (or items) based on their rating patterns. It measures the linear correlation between the ratings given
by two users (or items), taking into account the mean rating of each user. A positive correlation indicates similar
rating patterns, while a negative correlation indicates dissimilar rating patterns.
Here's an example of how Pearson correlation similarity can be calculated for users in collaborative filtering:
Suppose we have a matrix of user ratings for items:

Suppose we have a small dataset representing user ratings for movies:


Movie 1 Movie 2 Movie 3 Movie 4
User 1 5 3 0 1
User 2 4 0 0 1
User 3 1 1 0 5
User 4 0 1 5 4

25
To calculate the Pearson correlation similarity between User 1 and User 2 based on the provided ratings
for movies, we'll follow the steps outlined earlier:

Movies rated by both User 1 and User 2:


Movies rated: Movie 1, Movie 4
Calculate mean ratings for User 1 and User 2:

Mean rating for User 1: (5 + 3 + 0 + 1) / 4 = 2.25


Mean rating for User 2: (4 + 0 + 0 + 1) / 4 = 1.25
Calculate deviations from the mean:

User 1: [5-2.25, 3-2.25, 0-2.25, 1-2.25] = [2.75, 0.75, -2.25, -1.25]


User 2: [4-1.25, 0-1.25, 0-1.25, 1-1.25] = [2.75, -1.25, -1.25, -0.25]
Calculate the Pearson correlation similarity:

Covariance = (2.75 * 2.75 + 0.75 * -1.25 + (-2.25) * (-1.25) + (-1.25) * (-0.25)) / 4


= (7.5625 - 0.9375 + 2.8125 + 0.3125) / 4
= 9.75 / 4
≈ 2.4375

Standard deviation User 1 = sqrt(((2.75)^2 + (0.75)^2 + (-2.25)^2 + (-1.25)^2) / 3)


≈ sqrt((7.5625 + 0.5625 + 5.0625 + 1.5625) / 3)
≈ sqrt(14.75 / 3)
≈ sqrt(4.9167)
≈ 2.22

Standard deviation User 2 = sqrt(((2.75)^2 + (-1.25)^2 + (-1.25)^2 + (-0.25)^2) / 3)


≈ sqrt((7.5625 + 1.5625 + 1.5625 + 0.0625) / 3)
≈ sqrt(10.75 / 3)
≈ sqrt(3.5833)
≈ 1.89

Pearson correlation similarity = 2.4375 / (2.22 * 1.89)


≈ 0.546

So, the Pearson correlation similarity between User 1 and User 2 is approximately 0.546. This indicates a
moderate positive correlation between their ratings on the shared movies.

Movies rated by both User 1 and User 3:

Movies rated: Movie 2


Calculate mean ratings for User 1 and User 3:

Mean rating for User 1: (5 + 3 + 0 + 1) / 4 = 2.25


Mean rating for User 3: (1 + 1 + 0 + 5) / 4 = 1.75
Calculate deviations from the mean:

User 1: [5-2.25, 3-2.25, 0-2.25, 1-2.25] = [2.75, 0.75, -2.25, -1.25]


User 3: [1-1.75, 1-1.75, 0-1.75, 5-1.75] = [-0.75, -0.75, -1.75, 3.25]
Calculate the Pearson correlation similarity:

Covariance = (2.75 * -0.75 + 0.75 * -0.75 + (-2.25) * -1.75 + (-1.25) * 3.25) / 1


= (-2.0625 - 0.5625 + 3.9375 - 4.0625) / 1
= -2.75 / 1
26
= -2.75

Standard deviation User 1 = sqrt(((2.75)^2 + (0.75)^2 + (-2.25)^2 + (-1.25)^2) / 3)


≈ sqrt((7.5625 + 0.5625 + 5.0625 + 1.5625) / 3)
≈ sqrt(14.75 / 3)
≈ sqrt(4.9167)
≈ 2.22

Standard deviation User 3 = sqrt(((-0.75)^2 + (-0.75)^2 + (-1.75)^2 + (3.25)^2) / 3)


≈ sqrt((0.5625 + 0.5625 + 3.0625 + 10.5625) / 3)
≈ sqrt(14.75 / 3)
≈ sqrt(4.9167)
≈ 2.22

Pearson correlation similarity = -2.75 / (2.22 * 2.22)


≈ -0.56

So, the Pearson correlation similarity between User 1 and User 3 is approximately -0.56. This negative
correlation suggests some dissimilarity between their ratings on the shared movie.

Movies rated by both User 1 and User 4:

Movies rated: Movie 2, Movie 4


Calculate mean ratings for User 1 and User 4:

Mean rating for User 1: (5 + 3 + 0 + 1) / 4 = 2.25


Mean rating for User 4: (0 + 1 + 5 + 4) / 4 = 2.5
Calculate deviations from the mean:

User 1: [5-2.25, 3-2.25, 0-2.25, 1-2.25] = [2.75, 0.75, -2.25, -1.25]


User 4: [0-2.5, 1-2.5, 5-2.5, 4-2.5] = [-2.5, -1.5, 2.5, 1.5]
Calculate the Pearson correlation similarity:

Covariance = (2.75 * -2.5 + 0.75 * -1.5 + (-2.25) * 2.5 + (-1.25) * 1.5) / 2


= (-6.875 - 1.125 - 5.625 - 1.875) / 2
= -15.5 / 2
= -7.75

Standard deviation User 1 = sqrt(((2.75)^2 + (0.75)^2 + (-2.25)^2 + (-1.25)^2) / 3)


≈ sqrt((7.5625 + 0.5625 + 5.0625 + 1.5625) / 3)
≈ sqrt(14.75 / 3)
≈ sqrt(4.9167)
≈ 2.22

Standard deviation User 4 = sqrt(((-2.5)^2 + (-1.5)^2 + (2.5)^2 + (1.5)^2) / 3)


≈ sqrt((6.25 + 2.25 + 6.25 + 2.25) / 3)
≈ sqrt(17.25 / 3)
≈ sqrt(5.75)
≈ 2.40

Pearson correlation similarity = -7.75 / (2.22 * 2.40)


≈ -1.45

So, the Pearson correlation similarity between User 1 and User 4 is approximately -1.45. This negative
correlation suggests some dissimilarity between their ratings on the shared movies.

27
III. Mean Squared Difference (MSD)

The Mean Squared Difference (MSD) is a statistical measure used to quantify the average squared
difference between two sets of values. It is commonly employed in various fields, including statistics,
machine learning, and signal processing, to assess the similarity or dissimilarity between datasets.

Example:
Suppose we have three users (User X, User Y, and User Z) and their ratings for four movies (Movie
1, Movie 2, Movie 3, and Movie 4). Here are the ratings:

User X: [4, 3, 5, 2]
User Y: [3, 2, 4, 3]
User Z: [5, 4, 3, 2]
To calculate the Mean Squared Difference (MSD) between User X and User Y for these movies, we
follow these steps:
• Compute the squared difference between corresponding ratings of User X and User Y for each movie.
• Calculate the mean of these squared differences.
28
Let's proceed with the calculations:
Squared differences:

Movie 1: (4 - 3)^2 = 1
Movie 2: (3 - 2)^2 = 1
Movie 3: (5 - 4)^2 = 1
Movie 4: (2 - 3)^2 = 1
Mean Squared Difference (MSD):
MSD(X, Y) = (1 + 1 + 1 + 1) / 4 = 4 / 4 = 1
So, the Mean Squared Difference between User X and User Y for these movies is 1.

Similarly, you can calculate the MSD between other pairs of users or for different sets of movies. MSD is
a simple metric that gives you an idea of how similar or dissimilar the ratings of two users are. A lower
MSD indicates greater similarity in ratings.

To calculate the Mean Squared Difference (MSD) between User X and User Z, we'll follow the same
steps:
Here are the ratings:

User X: [4, 3, 5, 2]
User Z: [5, 4, 3, 2]
Squared differences:

Movie 1: (4 - 5)^2 = 1
Movie 2: (3 - 4)^2 = 1
Movie 3: (5 - 3)^2 = 4
Movie 4: (2 - 2)^2 = 0
Mean Squared Difference (MSD):

MSD(X, Z) = (1 + 1 + 4 + 0) / 4 = 6 / 4 = 1.5
So, the Mean Squared Difference between User X and User Z for these movies is 1.5.

A lower MSD indicates greater similarity in ratings. In this case, the MSD between User X and User
Z is higher than the MSD between User X and User Y (which was 1), suggesting that User X's ratings
are more similar to User Y's ratings than to User Z's ratings.

To calculate the Mean Squared Difference (MSD) between User Y and User Z for the given movies, we'll
follow the same steps as before:

Compute the squared difference between corresponding ratings of User Y and User Z for each movie.
Calculate the mean of these squared differences.
Here are the ratings:

User Y: [3, 2, 4, 3]
User Z: [5, 4, 3, 2]
Squared differences:

Movie 1: (3 - 5)^2 = 4
Movie 2: (2 - 4)^2 = 4
Movie 3: (4 - 3)^2 = 1
Movie 4: (3 - 2)^2 = 1
Mean Squared Difference (MSD):

MSD(Y, Z) = (4 + 4 + 1 + 1) / 4 = 10 / 4 = 2.5
So, the Mean Squared Difference between User Y and User Z for these movies is 2.5.

29
This indicates that User Y and User Z have somewhat differing preferences across these movies, as
reflected by their ratings.

For User Y and User Z:

Squared differences:

Movie 1: (3 - 5)^2 = 4
Movie 2: (2 - 4)^2 = 4
Movie 3: (4 - 3)^2 = 1
Movie 4: (3 - 2)^2 = 1
Mean Squared Difference (MSD):

MSD(Y, Z) = (4 + 4 + 1 + 1) / 4 = 10 / 4 = 2.5
So, the Mean Squared Differences between the users are:

MSD(X, Y) = 1
MSD(X, Z) = 1.5
MSD(Y, Z) = 2.5
These values indicate the level of similarity between the ratings of each pair of users. Lower MSD values
indicate greater similarity.

NEIGHBORHOOD SELECTION
The selection of the neighbors used in the recommendation of items is normally done in two
steps:
1) a global filtering step where only the most likely candidates are kept
2) a per prediction step which chooses the best candidates for this prediction.
PRE – FILTERING OF NEIGHBORS
The pre-filtering of neighbors is an essential step that makes neighborhood-based approaches
practicable by reducing the amount of similarity weights to store, and limiting the number of
candidate neighbors to consider in the predictions. There are several ways in which this can be
accomplished:

• Top-N filtering: For each user or item, only a list of the N nearest-neighbors and their respective similarity weight is kept.
To avoid problems with efficiency or accuracy, N should be chosen carefully. Thus, if N is too large, an excessive amount
of memory will be required to store the neighborhood lists and predicting ratings will be slow. On the other hand, selecting a
too small value for N may reduce the coverage of the recommendation method, which causes some items to be never
recommended.
• Threshold filtering: Instead of keeping a fixed number of nearest-neighbors, this approach keeps all the neighbors whose
similarity weight has a magnitude greater than a given threshold 𝑤𝑚𝑖𝑛 . While this is more flexible than the previous filtering
technique, as only the most significant neighbors are kept, the right value of wmin may be difficult to determine.
• Negative filtering: In general, negative rating correlations are less reliable than positive ones. Intuitively, this is because
strong positive correlation between two users is a good indicator of their belonging to a common group (e.g., teenagers,
science-fiction fans, etc.). However, although negative correlation may indicate membership to different groups, it does not
tell how different these groups are, or whether these groups are compatible for other categories of items. While experimental
investigation have found negative correlations to provide no significant improvement in the prediction accuracy, whether
such correlations can be discarded depends on the data.

30
NEIGHBORS IN THE PREDICTIONS

To find neighbors in the context of predictions, particularly in collaborative filtering-based recommendation


systems, we often use similarity metrics to identify users or items that are similar to each other. These similar
users or items are considered neighbors. Once we identify neighbors, we can use their ratings or preferences
to make predictions for a user or item.

Here's a basic outline of the process:

Calculate Similarity: Use a similarity metric (such as Pearson correlation, cosine similarity, or Jaccard
similarity) to measure the similarity between users or items based on their ratings or features.

Identify Neighbors: Select the top-k most similar users or items as neighbors. The value of k can be
predefined or determined dynamically.

Make Predictions: Use the ratings of the neighbors to predict ratings for the target user or item. This can be
done by taking a weighted average of the ratings given by neighbors, where the weights are the similarities
between the neighbors and the target user (or item).

Recommendation: Once predictions are made, recommend items with the highest predicted ratings to the
target user.

Here's a simplified example:

Suppose we have three users (User A, User B, and User C) and their ratings for movies (Movie 1, Movie 2,
and Movie 3). We want to predict the rating of Movie 3 for User A.

User A: [5, 4, -] (User A hasn't rated Movie 3)


User B: [4, 5, 3]
User C: [3, 2, 4]
• Calculate Similarity: We can use a similarity metric (e.g., Pearson correlation) to calculate the
similarity between User A and each of the other users.

• Identify Neighbors: Let's say we choose User B and User C as neighbors based on their high
similarity scores.

• Make Predictions: We can predict the rating of Movie 3 for User A by taking a weighted average of
the ratings given by User B and User C for Movie 3, where the weights are their similarities with
User A.

• Recommendation: Recommend Movie 3 to User A if the predicted rating is above a certain threshold.

• In practice, recommendation systems use more sophisticated algorithms and techniques, but the basic
idea remains the same: identify similar users or items as neighbors and use their preferences to make
predictions or recommendations.

31
SECURITY ASPECTS OF RECOMMENDER SYSTEMS

• Security is a critical aspect of recommender systems, especially given the sensitive


nature of user data and the potential for malicious actors to exploit vulnerabilities. Here
are some key security considerations for recommender systems:
• Privacy Protection: Recommender systems often rely on user data to generate
recommendations. It's essential to implement robust privacy protection mechanisms to
safeguard sensitive user information. Techniques such as data anonymization, differential
privacy, and secure multiparty computation can be employed to protect user privacy while
still providing effective recommendations.
• Data Integrity: Ensuring the integrity of data is crucial to prevent unauthorized tampering
or manipulation of recommendation algorithms. Employing cryptographic techniques, access
controls, and data validation mechanisms can help maintain the integrity of data used by
recommender systems.
• Authentication and Authorization: Implement strong authentication and authorization
mechanisms to control access to recommender system resources and functionalities. This
helps prevent unauthorized access to user data and system components, reducing the risk of
data breaches and malicious activities.
• Secure Communication: Secure communication protocols, such as HTTPS, should be used
to encrypt data transmitted between clients and recommender system servers. This protects
sensitive user information from eavesdropping and interception by unauthorized parties.
• Model Robustness: Ensure that recommendation algorithms are robust against adversarial
attacks and manipulation attempts. Adversarial training, model validation, and robustness
testing can help identify and mitigate vulnerabilities in recommendation models.
• User Trust and Transparency: Promote user trust and transparency by providing clear
explanations of how recommendations are generated and how user data is used. Offering
users control over their data and preferences, as well as transparent opt-in/opt-out
mechanisms, can enhance trust in the recommender system.
• Monitoring and Auditing: Implement monitoring and auditing mechanisms to detect
anomalous behavior, security incidents, and unauthorized access attempts in real-time.
Regular security audits and penetration testing can help identify and address security
vulnerabilities proactively.
• Regulatory Compliance: Ensure compliance with relevant data protection regulations (e.g.,
GDPR, CCPA) and industry standards to protect user privacy and data rights. This includes
obtaining explicit user consent for data processing, providing users with access to their data,
and adhering to data retention and deletion policies.

IMPLEMENT COLLABORATIVE FILTERING

Collaborative filtering is a popular technique used in recommender systems to predict the preferences
of a user by leveraging the preferences of other similar users. There are two main types of collaborative
filtering: user-based collaborative filtering and item-based collaborative filtering.

Here's a basic implementation of user-based collaborative filtering using Python:

python
Copy code
import numpy as np

class CollaborativeFiltering:
def __init__(self, ratings):
32
self.ratings = ratings
self.similarity_matrix = self.calculate_similarity_matrix()

def calculate_similarity_matrix(self):
similarity_matrix = np.zeros((len(self.ratings), len(self.ratings)))
for i in range(len(self.ratings)):
for j in range(len(self.ratings)):
if i == j:
similarity_matrix[i][j] = 1
else:
similarity_matrix[i][j] = self.calculate_similarity(self.ratings[i], self.ratings[j])
return similarity_matrix

def calculate_similarity(self, user1_ratings, user2_ratings):


common_items_mask = np.logical_and(user1_ratings != 0, user2_ratings != 0)
if np.sum(common_items_mask) == 0:
return 0
else:
user1_common = user1_ratings[common_items_mask]
user2_common = user2_ratings[common_items_mask]
return np.dot(user1_common, user2_common) / (np.linalg.norm(user1_common) *
np.linalg.norm(user2_common))

def predict_ratings(self, user_id):


user_ratings = self.ratings[user_id]
predicted_ratings = np.zeros_like(user_ratings)
for i in range(len(user_ratings)):
if user_ratings[i] == 0:
weighted_sum = 0
similarity_sum = 0
for j in range(len(self.ratings)):
if self.ratings[j][i] != 0:
weighted_sum += self.similarity_matrix[user_id][j] * self.ratings[j][i]
similarity_sum += self.similarity_matrix[user_id][j]
if similarity_sum != 0:
predicted_ratings[i] = weighted_sum / similarity_sum
return predicted_ratings

# Example usage
ratings = np.array([
[5, 3, 0, 1],
[4, 0, 0, 1],
[1, 1, 0, 5],
[0, 1, 5, 4],
[0, 1, 5, 0],
])

cf = CollaborativeFiltering(ratings)
user_id = 0
predicted_ratings = cf.predict_ratings(user_id)
print("Predicted ratings for user", user_id, ":", predicted_ratings)

33
This implementation computes the similarity between users based on their rating vectors using the
cosine similarity metric. Then, it predicts the ratings for a given user by considering the ratings of
similar users weighted by their similarity scores. Finally, it prints the predicted ratings for a specified
user.

Example of Collaborative Filtering


let's consider a simple example of collaborative filtering for movie recommendations.

Suppose we have a small dataset of user ratings for a few movies:

Movie 1 Movie 2 Movie 3 Movie 4


User 1 5 3 0 1
User 2 4 0 0 1
User 3 1 1 0 5
User 4 0 1 5 4
User 5 0 1 5 0
In this example:

Users have rated movies on a scale of 1 to 5, with 0 indicating no rating.


We want to predict the ratings for movies that a user hasn't rated based on the ratings of similar users.
Now, let's say we want to predict the ratings for User 5. We can use collaborative filtering to do this.
We'll calculate the similarity between User 5 and each of the other users based on their rating patterns.
Then, we'll use the ratings of the most similar users to predict the ratings for User 5.

For instance, if we use cosine similarity as the similarity metric:

Similarity(User 5, User 1) = 0.5547


Similarity(User 5, User 2) = 0.0
Similarity(User 5, User 3) = 0.6547
Similarity(User 5, User 4) = 0.8944
Based on these similarities, we can see that User 4 is the most similar to User 5.

Now, we can predict the ratings for User 5 by taking a weighted average of the ratings of User 4 for
the movies:

Predicted rating for Movie 1 = (Similarity(User 5, User 4) * Rating(User 4, Movie 1)) / Similarity sum
= (0.8944 * 0) / 1.4488 ≈ 0
Predicted rating for Movie 2 = (0.8944 * 1) / 1.4488 ≈ 0.6189
Predicted rating for Movie 3 = (0.8944 * 5) / 1.4488 ≈ 3.0846
Predicted rating for Movie 4 = (0.8944 * 4) / 1.4488 ≈ 2.4727
So, the predicted ratings for User 5 would be approximately 0 for Movie 1, 0.6189 for Movie 2, 3.0846
for Movie 3, and 2.4727 for Movie 4. These ratings are based on the ratings of User 4, who is the most
similar to User 5.

34

You might also like