You are on page 1of 37

HYBRID BERT REC: BERT BASED HYBRID RECOMMENDER

Arandkar Vishal (20951A6759)


G. Tushar Reddy (20951A6754)
Ch Bala Varun chary (21955A6702)

I
HYBRID BERT REC: BERT BASED HYBRID RECOMMENDER

A Project Phase-II Report


Submitted in Partial Fulfilment of the
Requirements for the Award of the Degree Of

Bachelor of Technology
in
CSE (Data Science)

by

Arandkar Vishal (20951A6759)


G. Tushar Reddy (20951A6754)
Ch Bala Varun Chary (21955A6702)

Department of CSE(Data Science)

INSTITUTE OF AERONAUTICAL ENGINEERING


(Autonomous)
Dundigal, Hyderabad – 500 043, Telangana

January, 2024

© 2024, Vishal, Tushar, Varun. All rights reserved.

II
DECLARATION

I certify that
a. The work contained in this report is original and has been done by me under the guidance
of my supervisor(s).
b. The work has not been submitted to any other Institute for any degree or diploma.
c. I have followed the guidelines provided by the Institute in preparing the report.
d. I have conformed to the norms and guidelines given in the Ethical Code of Conduct of the
Institute.
e. Whenever I have used materials (data, theoretical analysis, figures, and text) from other
sources, I have given due credit to them by citing them in the text of the report and giving
their details in the references. Further, I have taken permission from the copyright owners
of the sources, whenever necessary.

Place: Signature of the Student :


Date: Roll No. :

III
CERTIFICATE

This is to certify that the project phase-I report entitled Hybrid-Bert Rec: Bert Based
Hybrid Recommender submitted by Mr. Arandkar Vishal, Mr. Gali Tushar Reddy and
Mr. Ch. Bala Varun Chary to the Institute of Aeronautical Engineering, Hyderabad, in
partial fulfillment of the requirements for the award of the Degree Bachelor of Technology in
CSE(Data Science) is a Bonafide record of work carried out by him/her under my/our
guidance and supervision. In whole or in parts, the contents of this report have not been
submitted to any other institute for the award of any Degree.

Supervisor Head of the Department


Ms N. Lakshmi Deepthi Dr. G. Sucharitha Reddy
Professor

Date:

IV
ABSTRACT
In the World of e-commerce, the dynamics of user behaviour evolve rapidly, shaped by societal trends
and changing preferences. This project introduces an innovative approach to recommendation systems
for e-commerce websites, focusing on the dynamic nature of user interactions. Leveraging content-
based filtering (CBF) and collaborative filtering, the project pioneers the integration of a bidirectional
self-attention network inspired by BERT, designed to capture and adapt to sequential user behaviours.

Traditional unidirectional models face limitations in capturing contextual information, especially in e-


commerce scenarios where user choices are influenced by diverse factors. The proposed bidirectional
model enhances the representation power by considering both left and right context, addressing the
shortcomings of unidirectional approaches.

Sequential recommender systems seek to capture information about user affinities and behaviours
considering their sequential series of interactions. In this Project, we detail BERT4Rec, a sequential
recommendation approach, based on bidirectional encoder of self attention-based Transformer
mechanisms. BERT Rec, which applies the bidirectional-encoder representations-from transformers
(BERT) technique to model user behaviour sequences by considering the target user’s historical data,
i.e., a content-based filtering (CBF) approach. Despite BERT Rec’s effectiveness, we argue that
considering only this historical data is insufficient to provide the most accurate recommendation.
HybridBERT Rec, which applies BERT to both CBF and collaborative filtering (CF). For CBF, we
want to extract the characteristics of the target user’s interactions with purchased items. For CF, we
want to find neighbouring users who are similar to the target user. Here, we extract the target item’s
characteristics using all other users who rated the target item as a second input to BERT. This generates
a target item profile. After obtaining both profiles, we use them to predict a rating score. We
experimented with three datasets, finding that our model was more accurate than the original
BERT4Rec.

Keywords: Recommender system, sequential recommendation, hybrid recommendation, BERT

V
TABLE OF CONTENTS

COVER PAGE i
TITLE PAGE ii
DECLARATION iii
CERTIFICATE iv
ABSTRACT v
CONTENTS vi
LIST OF FIGURES viii
LIST OF ABBREVIATIONS ix
CHAPTER 1 INTRODUCTION 1
1.1 Introduction 1
1.2 Existing System 2
1.3 Proposed System 3

CHAPTER 2 LITERATURE SURVEY 4


CHAPTER 3 PROJECT ARCHITECTURE 7
3.1 High-Level System Architecture 7
3.2 UML Diagram 9
3.2.1 Data Flow Diagram 9
3.2.2 Sequence Diagram 10
3.2.3 Use Case Diagram 11

CHAPTER 4 METHODOLOGY AND IMPLEMENTATION 12


4.1 Methodology 12
4.1.1 What is BERT /BERT Architecture 12
4.2 Understanding BERT 13
4.2.1 Transformer Layer 13
4.2.2 Attention mechanism 13
4.2.3 Feed forward neural network 13

VI
4.2.4 Stacking and Transformer Layer 14
4.2.5 Embeddings and softmax 14
4.3 Dataset 15
4.4 Implementation 17
4.4.1 CBF recommendation using BERT 17
4.4.2 CF recommendations using SVD 19
4.4.3 Hybrid Recommendation 20

CHAPTER 5 RESULTS AND DISCUSSION 21


5.1 Feature Importance And Analysis 21
5.2 Evaluation Metrics 22
5.3 Results 24

CHAPTER 6 CONCLUSION AND FUTURE SCOPE 25


6.1 Proposed Improvements 25
6.2 Future Scope 25

References 27
Appendices 27

VII
LIST OF FIGURES

Figure 1: High level architecture of recommendation system ………………………………………7


Figure 2:Collabrative Filtering……………………………………………………………………….7
Figure 3: Content based filtering …………………………………………………………..………...8
Figure 4:The architecture of the HybridBERT4Rec model …………………………………………8

Figure 5: Data flow diagram ………………………………………………………………………..10


Figure 6: Sequence diagram ………………………………………………………………………...10
Figure 7: Use case diagram …………………………………………………………………………11
Figure 8: A comparison between Transformer and Traditional RNN ………………………….…...12
Figure 9: Feed forward neural network ………………………………………………………….….14
Figure 10: Bert model architecture for predicting next item that user might interact with ……...….15
Figure 11: Top five genres and its percentage ………………………………………………...…....16
Figure 12: Correlation of heatmap for ratings dataframe ……………………………………...…....16
Figure 13: Training BERT on movie overview ……………………………………………………..18
Figure 14: Cosine similarity for first ten embeddings of dataframe ………………..……………….21
Figure 15: Top 10 Recommendation for user input …………………………..……………………..24
Figure 16: The similarity score and ratings of top recommendations …………..…………………...24

VIII
LIST OF ABBREVIATIONS

CBF - Content based Filtering


CF - Collaborative Filtering
BERT - Bidirectional Encoder Representation Of Transformers
SVD - Singular Value Decomposition
UBCF - User based Collaborative Filtering
IBCF - Item based Collaborative Filtering
MSE - Mean Squared Error
RMSE - Root Mean Squared Error

IX
Chapter 1
Introduction
1.1 Introduction

Recommender systems are indispensable tools in ecommerce platforms, aiding users in


navigating extensive product catalogues and facilitating personalized experiences. Traditional
recommendation algorithms, broadly categorized as collaborative filtering-based and content-
based [2]:, face distinct challenges in meeting the dynamic demands of real-world ecommerce
platforms.
Content-Based Filtering (CBF) is a popular recommendation paradigm utilized for items
such as web pages, news articles, books, and more. The fundamental idea behind CBF is to
recommend items based on their intrinsic features and user preferences. Unlike collaborative
filtering approaches that rely on ratings from other users, CBF relies on item attributes such
as genre, director, actors, or descriptions to create a user profile. This approach works
independently of the number of users in the system and is particularly effective when dealing
with structured data.
Collaborative Filtering (CF), on the other hand, relies on user-item interaction data to make
recommendations. User-based collaborative filtering identifies similar users based on their
historical preferences and recommends items based on what similar users have liked. Item-
based collaborative filtering recommends items similar to those a user has already interacted
with. However, both approaches are impacted by data sparsity in highly active ecommerce
platforms with large and dynamic inventories.
Hybrid recommendation systems blend diverse recommendation strategies to harness
complementary advantages. Traditional hybrids fuse collaborative filtering (CF), relying on
user preferences, and content-based filtering (CBF), centered on item characteristics. While
existing hybrids, such as those incorporating textual information in movie recommendations,
address specific domains, they often struggle to capture latent user preferences. Neural-based
approaches have emerged to bridge this gap, but many are limited to either CF or CBF. This
project explores an innovative hybrid approach that integrates both CF and CBF into neural
recommendations, offering a comprehensive solution for capturing nuanced user preferences.
This bidirectional approach overcomes the constraints of unidirectional models, providing
enhanced representation power and flexibility in capturing dynamic user preferences.
Training the bidirectional model involves addressing challenges in predicting the next item
for each position in the input sequence, requiring thoughtful consideration in the training
process.

1.2 Existing System

The Existing CBF systems make suggestions based on a user's item and profile information.
They believe that if a user has access expressed interest in something, they will do so again in
the long term. Comparable commodities are frequently grouped based on their features. User
profiles are created by analysing previous conversations or straightforwardly asking people
1
about their passions. Other systems use other systems that use user individual and
interpersonal data that aren't regarded as purely content-based
Moreover, The Collaborative Filtering is classified in to two types :
a)Memory-based approaches: This is also known as collaborative filtering in the
neighbourhood. Ratings of user-item pairs are simply predicted based on their proximity [33].
Collaborative filtering is further classified into 2 types: user-based collaborative filtering and
item-based collaborative filtering. User-based simply means that strong and similar
recommendations will come among like people. Item-based collaborative filtering suggests
things based on perceived relevance, as determined by customer reviews
i) User-Based Collaborative Filtering (UBCF):recommends items to a user by identifying
similar users and predicting the user's preferences based on their ratings. It employs a User-
Item Matrix, predicting unrated items for the active user by comparing their preferences to
those of similar users. The K-nearest neighbours (KNN) algorithm is commonly used to find
similar users, making it easy to implement and more accurate compared to some techniques
like content-based methods. However, it faces challenges with sparsity in user ratings,
scalability with a growing user base, and issues with new users and items (cold-start
problems).
ii) Item-Based Collaborative Filtering (IB-CF): focuses on finding items similar to the
user's preferences. It involves calculating similarity among items using techniques like
Cosine-Based Similarity or Correlation-Based Similarity. Unlike UB-CF, IB-CF pre-
calculates item similarities
b) Model-based collaborative filtering is not required to remember the based matrix.
Instead, the machine models are used to forecast and calculate how a customer gives a rating
to each product. These system algorithms are based on machine learning to predict unrated
products by customer ratings. These algorithms are further divided into different subsets, i.e.,
Matrix factorization-based algorithms, deep learning methods, and clustering algorithms.

1.3 Proposed Solution

The proposed Bert solution tries to mitigate the existing problems such as :
1.Cold Start Problem Mitigation: Instead of relying on unique item identifiers to aggregate
historical information, our approach utilizes only the item's title as content, coupled with
token embeddings. This helps address the cold start problem, a significant shortcoming of
traditional recommendation algorithms, by allowing the model to provide meaningful
recommendations even when historical data is limited.
2.User Latent Interest Learning: By training our model with user behaviour data, we aim
to enable the system to learn not only item similarities but also the latent interests of users.
This contrasts with traditional recommendation algorithms and some pair-wise deep learning
algorithms that primarily focus on providing similar items based on past purchases. Our

2
approach seeks to offer more personalized recommendations by capturing the nuanced
preferences of users.
•In CF-HybridBERT, the our solution involves extracting the target item representation,
capturing the similarity levels between all neighbours and the target user. During training, a
random user masking technique is employed on the user sequence associated with each target
item. This process aims to enable the model to reconstruct the masked user's original
embedding as accurately as possible. Once training is completed, the network is capable of
constructing the next user representation based on the characteristics of users interacting with
the target item. During testing, the target item representation is constructed by masking the
target user and adding it to the end of the user sequence. This representation encapsulates the
comparison values between each neighbouring user and the target user, signifying the
similarity levels.

Conversely, in CBF-HybridBERT4rec, the focus is on extracting the user representation that


signifies the target user's preferences toward the target item. For each target user, an item
sequence is considered, consisting of a series of items the target user has interacted with.
During the training stage, random item masking is applied to the item sequence. Similar to
user masking, this technique allows the model to rebuild the masked item as closely as
possible to its original embedding. After training, the obtained network can predict the next
item in the sequence. In testing, item masking involves merely masking the target item and
appending it to the end of the sequence, facilitating the prediction of the target user's profile
toward the target item. This approach ensures that the user preferences are effectively
captured in the hybrid recommendation model.

3
Chapter 2
Literature Survey
F. Ricci, L. Rokach, B. Shapira, and P. B. Kantor [1]
first comprehensive handbook which is dedicated entirely to Recommendation Systems (RS)
are applications that offer relevant items to users, from simple book recommendations to
more complex recommendations of moving items like conversational recommender system.
Recommender Systems (RSs) are software tools and techniques providing suggestions for
items to be of use to a user . The suggestions relate to various decision-making processes,
such as what items to buy, what music to listen to, or what online news to read. Although not
exclusively about recommendation systems, this book covers large- scale data mining,
including collaborative filtering and recommendation algorithms.

G. Adomavicius and A. Tuzhilin [2] :


Towards the next generation of recommender systems: A survey of the state-of-the-art and
possible extension. The paper discuss about the existing recommender system and its
drawbacks . The current generation of recommendation methods that are usually classified
into the following three main categories: content-based, collaborative, and hybrid
recommendation approaches. This paper also describes various limitations of current
recommendation methods and discusses possible extensions that can improve
recommendation capabilities .Some of the drawbacks are that The CB and CF have problems
with data sparsity and cold start problem. The hybrid recommender system is a combination
of CB and CF and its tries solve these problems to some extent.

H. Wang [3] proposed the ZeroMat[3]:


Solving Cold-start Problem of Recommender System
with No Input Data . H. Wang proposed a solution named “ZeroMat” ; that requires no input
data at all and predicts the user item rating data that is competitive in Mean Absolute Error
and fairness metric compared with the classic matrix factorization with affluent data, and
much better performance than random placement .The cold-start problem occurs when a
recommender system has little or no information about a new user or item, making it
challenging to provide accurate and personalized recommendations. H. Wang found a
drawback which was that ,The matrix factorization can be solved for cold start problem but
data sparsity and for large dataset it can be computationally expensive

R. Esmeli, M. Bader-El-Den and H. Abdullahi [4] : in 2017, conducted a fully content-


based movie recommendation (FCMR)Using Word2Vec Recommendation for Improved
Purchase Prediction. A system that uses a neural network model, Word2Vec CBOW, with
content information The model learns vector form features of each element, and then takes
advantage of the linear relationship of learned features to calculate the similarity between

4
each movie. Word2Vec is a popular word embedding technique that can also be applied in
recommendation systems. The FCMR system was able to achieve good performance on a
number of metrics, but they also suggested that it would be best to use a hybrid
recommender system that combines both content-based and collaborative filtering
approaches and the word2vec Cannot capture long-term dependencies

Sun, Fei, Rui Wang, Rui Zhang, Xin Chen, Xudong Hu, and Zhiyuan Liu [5] :
proposed the BERT4Rec”. BERT4Rec is a recommendation system that uses BERT
(Bidirectional Encoder Representations from Transformers) to generate embeddings for
items. These embeddings are then used to calculate the similarity between items, which can
be used to make recommendations. BERT4Rec has been shown to be effective for a variety
of recommendation tasks, including sequential recommendation, cross-domain
recommendation, and cold-start recommendation. BERT4Rec refers to the application of
BERT (Bidirectional Encoder Representations from Transformers) in the context of
recommender systems. BERT is a pre-trained transformer-based model that has been highly
successful in natural language processing tasks. When applied to recommender systems,
BERT can learn complex patterns and dependencies in user-item interactions, potentially
leading to more accurate and personalized recommendations. The application of BERT in
recommender systems has gained attention because of its ability to capture long-range
dependencies and context in user-item interactions. It allows the model to understand the
semantics of the items and the user preferences more effectively. The drawback was that,
The google also proposed “SASRec, KeBERT4Rec” each of them has its own advantages
like Self-Attentive Sequential Recommendation (SASRec) can capture the relationships
between items that are far apart in the user's history.
Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and PengJiang [6].
Modelling users’ dynamic preferences from their historical behaviours is challenging and
crucial for recommendation systems. Previous methods employ sequential neural networks
to encode users Historical interactions from left to right into hidden representations For
making recommendations. Despite their effectiveness, we argue That such left-to-right
unidirectional models are sub-optimal due To the limitations including: a) unidirectional
architectures restrict The power of hidden representation in users’ behaviour sequences; b)
they often assume a rigidly ordered sequence which is not always practical. To address these
limitations, we proposed a sequential recommendation model called BERT4Rec, which
employs the deep bidirectional self-attention to model user behaviour sequences. To avoid
the information leakage and efficiently train the bidirectional model, we adopt the Cloze
objective to sequential recommendation predicting the random masked items in the sequence
by jointly conditioning on their left and right context.

5
Anushree H and Shashidhara H S, Efficient Recommendation System Using Bert
Technology, International Journal of Advanced Research in Engineering And
Technology[7].
Modelling the diverse desires of users through their economic models for systems is dif
icult and essential. Past Techniques utilize successive Neural networks to convert the
historical experiences of customers across left to right Into encoded suggestion models.
Recommendation systems in e-commerce are Becoming an integral means of making
consumers navigate the content accessible. Recommender systems are an important aspect of
E-commerce platforms that Assist consumers choose products of choice on a wide scale
through huge investments. The metadata termed the contextual phrase, that incorporates the
reference label, and The description of the quote proves have been used by several authors to
locate the Relevant research referenced. The lack of a well-established benchmarking dataset
and No tool for recommendations that can achieve great efficiency has perhaps made the
Study challenging. The foundation of mutual marketplace stages in the rental property Field
is the personalization of recommendation systems. To support people, such a Framework
provides a valuable tool. The current technique aims to determine the Context of the movie
plot summary from the given movies using BERT as-a-service and To predict similar movie
recommendations.
Yuyangzi Fu ,Tian Wang [8] :
In e-commerce, recommender systems have Become an indispensable part of helping users
Explore the available inventory. In this work, We present a novel approach for item-based
Collaborative filtering, by leveraging BERT to Understand items, and score relevancy
between Different items. Our proposed method could Address problems that plague
traditional recommender systems such as cold start, and ”more of the same” recommended
content. We Conducted experiments on a large-scale real-World dataset with full cold-start
scenario, and The proposed approach significantly outperforms the popular Bi-LSTM model.

6
Chapter 3
Project Architecture
3.1 High-Level System Architecture

The main component of our architecture is recommendation engine which consist of three
main components :
1 Collaborative Filtering
2 Content Based Filtering
3 Hybrid Combination

Figure 1. High level architecture of recommendation system


COMPONENTS:
1.Collabrative Filtering : The Collaborative Filtering Recommender is entirely based on the
past behaviour and not on the context. More specifically, it is based on the similarity in
preferences, tastes and choices of two users. It analyses how similar the tastes of one user is
to another and makes recommendations on the basis of that. For instance, if user A likes
movies 1, 2, 3 and user B likes movies 2,3,4, then they have similar interests and A should
like movie 4 and B should like movie 1. This makes it one of the most commonly used
algorithms as it is not dependent on any additional information.

In general, collaborative filtering is the workhorse of recommender engines. The algorithm


has a very interesting property of being able to do feature learning on its own, which means

Figure 2. Collabrative Filtering


7
that it can start to learn for itself what features to use. It can be divided into Memory-Based
Collaborative Filtering and Model-Based Collaborative filtering.

2.Content based Filtering: The Content-Based Recommender relies on the similarity of the
items being recommended. The basic idea is that if you like an item, then you will also like a
“similar” item. It generally works well when it's easy to determine the context/properties of
each item.

A content based recommender works with data that the user provides, either explicitly movie
ratings for the MovieLens dataset. Based on that data, a user profile is generated, which is
then used to make suggestions to the user. As the user provides more inputs or takes actions
on the recommendations, the engine becomes more and more accurate.

Figure 3. Content Based Filtering

8
Figure 4. The architecture of the HybridBERT4Rec model, which comprises a CBF part,
and a prediction layer
3.2 UML Diagrams
3.2.1 Data Flow Diagram:
The fundamental data workflow encompasses the collection of data, analysis of raw data,
and the subsequent cleaning and preprocessing , model application and analysing the results
Lets understand how text processing is done :
While dealing with text data preprocessing, the focus is on refining the data by eliminating
unwanted elements, performing stop words removal, and incorporating lemmatization.
1.Removing Noise from Data:
Before extracting meaningful information, it is crucial to eliminate any irrelevant or
extraneous elements that might hinder the analysis process. This noise removal step involves
identifying and discarding data components that do not contribute to the overall
understanding or insights derived from the data.
2.Stop words Removal:
Stop words, commonly occurring words in a language (e.g., 'the,' 'is,' 'and'), are often devoid
of significant semantic meaning and can add noise to textual data. Stop words removal
involves systematically excluding these words from the dataset to streamline the analysis
process, allowing the focus to shift toward more contextually relevant terms. This enhances
the efficiency of subsequent natural language processing tasks.
3.Lemmatization:
Lemmatization is a linguistic process aimed at reducing words to their base or root form,
known as lemmas. This normalization technique ensures that different grammatical
variations of a word, such as verb conjugations or plural forms, are transformed into a
common base. By doing so, lemmatization facilitates more accurate and consistent analysis
of the text data, enabling improved understanding and interpretation of the content.
4.BERT Model:
Utilizing transformers, we load the pre-trained BERT model known as bert-base-uncased, a
comprehensive discussion of which will be provided in Chapter 4. The process involves
tokenizing sentences and subsequently encoding them into embeddings by passing them
through the BERT model. This mechanism allows us to harness the power of pre-trained
language representations for diverse natural language processing tasks.

9
Figure 5. Data Flow Diagram

3.2.2Sequence Diagram

Figure 2 data flow diagram


Figure SEQ Figure \* ARABIC 2:Data flow daigram for hybrid recommendation
Figure 6. Sequence Diagram

Sequence diagrams can be useful reference diagrams for businesses and other organizations.
We draw a sequence diagram to:
• Represent the details of a UML use case.
• Model the logic of a sophisticated procedure, function, or operation.

10
• See how tasks are moved between objects or components of a process

3.2.3 Use Case Diagram:

Figure 7. Use Case Diagram


1. User: Represents the actor interacting with the system.

2. Recommendation Engine : the recommendation engine is a combination of both


content based and collaborative based filtering along with inbuilt BERT
(bidirectional encoder representation of transformers).

3. The dataset consists of multiple sub data like ratings ,movies and etc and it is stored
after we perform cleaning and preprocessing them.

4. The ultimate step involves presenting movie recommendations to the user, following
a ranked list recommendation approach.

11
Chapter 4
Methodology And Implementation
4.1 Methodology:
Before getting started lets get introduced with the terminology:

In sequential recommendation, let U={u1,u2, . . . ,u |U | } denote a set of users,


V={v1,v2, . . . ,v |V | } be a set of items, and list Su=[v (u) 1 , . . . ,v (u) t , . . . ,v (u) nu]
denote the interaction sequence in chronological order for user u ∈ U, where v (u) t ∈ V is
the item that u has interacted with at time step2 t and nu is the the length of interaction
sequence for user u. Given the interaction history Su , sequential recommendation aims to
predict the item that user u will interact with at time step nu + 1.

4.1.1 What is Bert / Model Architecture :

Lets get started with sequential recommendation model called BERT -Rec, which adopts
Bidirectional Encoder Representations from Transformers to a new task, sequential

Figure 8. A comparison between Transformers and traditional RNN

Recommendation. It is built upon the popular self-attention layer, “Transformer layer”. As


illustrated in Figure 1b, BERT4Rec is stacked by L bidirectional Transformer layers. At each
layer, it iteratively revises the representation of every position by exchanging information
across all positions at the previous layer in parallel with the Transformer layer. Instead of
learning to pass relevant information forward step by step as RNN based methods did in
Figure 1d, self-attention mechanism endows BERT-Rec with the capability to directly
capture the dependencies in any distances. This mechanism results in a global receptive field,
while CNN based methods like Caser usually have a limited receptive field. In addition, in
contrast to RNN based methods, self-attention is straightforward to parallelize.

Lets get into deep one by one

12
4.2 Understanding Bert:

4.2.1Transformer Layer: Lets understand with example in machine translation the


language gets translated from one language to another this done through the transformers .
Initially the language gets encoded in terms of embeddings by “Encoder” and then “decoder
“in target language decodes it.

Now Encoder is built upon multiple self attention mechanism + Feed Forward Neural
network and the same with decoder .Bert is an Encoder model so we will focus only encoder
How does Encoder works.

4.2.2 Attention Mechanism: Attention mechanisms have become an integral part of


compelling sequence modelling and transduction models in various tasks, allowing
modelling of dependencies without regard to their distance in the input or output sequences
[2,3]In all but a few cases [4], however, such attention mechanisms are used in conjunction
with a recurrent network or generally with feed forward neural networks in our case.

These are calculated through a function called “Attention” .An attention function can be
described as mapping a query and a set of key-value pairs to an output, where the query,
keys, values, and output are all vectors. The output is computed as a weighted sum

The input consists of queries and keys of dimension dk, and values of dimension dv. We
compute the dot products of the query with all keys, divide each by √ dk, and apply a
softmax function to obtain the weights on the values. In practice, we compute the attention
function on a set of queries simultaneously, packed together into a matrix Q. The keys and
values are also packed together into matrices K and V . We compute the matrix of outputs as:

( )
T
QK
Attention ( Q , K ,V ) =softmax V
√ dk
● Q is query matrix

● K is key matrix

● V indicates values

4.2.3 Feed Forward Neural Network:

In addition to attention sub-layers, each of the layers in our encoder and decoder contains a
fully connected feed-forward network, which is applied to each position separately and
identically. This consists of two linear transformations with a ReLU activation in between.

Mathematically, ReLU is defined as f(x)=max(0,x), meaning it replaces any negative values


with zero while leaving positive values unchanged. The purpose of ReLU is to introduce

13
non-linearity to the network, allowing it to learn complex relationships and patterns in the
data.

Figure 9. Feed Forward Neural Network

4.2.4 Stacking Transformer Layer:

BERTBASE has 12 layers in the Encoder stack while BERTLARGE has 24 layers in the
Encoder stack. These are more than the Transformer architecture described in the original
project (6 encoder layers).

BERT architectures (BASE and LARGE) also have larger feedforward networks (768 and
1024 hidden units respectively), and more attention heads (12 and 16 respectively) than the
Transformer architecture suggested in the original project. It contains 512 hidden units and 8
attention heads.

BERT BASE contains 110M parameters while BERTLARGE has 340M parameters. How
ever we are going to use Bert base for our recommendation system . However, the network
becomes more difficult to train as it goes deeper. Therefore, we employ a residual connection
around each of the two sublayers as in Figure a, followed by layer normalization [5]. More
over, we also apply dropout [6] to the output of each sub-layer, before it is normalized. That
is, the output of each sub-layer is LN(x + Dropout(sublayer(x))), where sublayer(·) is the
function implemented by the sub-layer itself, LN is the layer normalization function

4.2.5 Embeddings and Softmax:

As elaborated above, without any recurrence or convolution module, the Transformer layer
(Trm) is not aware of the order of the input sequence. In order to make use of the sequential
information of the input, we inject Positional Embeddings into the input item embeddings at
the bottoms of the Transformer layer stacks. For a given item vi , its input representation h 0
i is constructed by summing the corresponding item and positional embedding:

0
hi =v i + pi

14
where vi ∈E is the d−dimensional embedding for item vi , Pi ∈P is the d−dimensional
positional embedding for position index i.

After L layers that hierarchically exchange information across all positions in the previous
layer, we get the final output HL for all items of the input sequence. Assuming that we mask
the item vt at time step t, we then predict the masked items vt we apply the softmax layer at
last.

SoftMax Layer: . It is typically used as the output layer of a neural network to convert the
raw output scores or logits into a probability distribution over multiple classes. The softmax
function takes a vector of real numbers as input and produces a probability distribution as
output

ez i
softmax ( z i )= K
.
∑ ❑ ez j
j=1

● Zi is the raw score or logit for class i

● e is the base of natural logarithm

● k is number of classes

Figure 10. BERT


SEQ model\*architecture
Figure ARABIC 6for predicting
BERT modelnext item
that user might interact with

4.3 Dataset:
We have acquired a movie dataset from Kaggle, comprising several sub-datasets, including a
primary movie dataset and separate rating datasets.

15
The movie dataset, intended for content-based filtering, is composed of 45,000 rows and 24
columns. Among these columns, 12 are
deemed essential for analysis, and they include: 'id', 'title', 'genres', 'original language',
'overview', 'tagline', 'production countries', 'release date', 'status', 'vote average', 'vote count',
and 'runtime'. To gain insights into the data, we will conduct exploratory data analysis
(EDA) techniques.

Figure SEQFigure
Figure
11.\*topARABIC
5 genres 7and
A its
detailed bar in
percentage graph
data and
framepie chart on

The ratings dataset, designated for collaborative filtering, comprises 10,000 rows and
includes 4 columns. Among these columns, three are considered crucial for analysis:
'movieId,' 'userId,' and 'rating.' Each entry in the dataset corresponds to a user identified by a
unique user ID providing a rating to a specific movie identified by a unique movie ID. The
ratings provided by users for these movies fall within the range of 1 to 5.

SEQ correlation
Figure 12. Figure \* ARABIC 8 A for ratings
of heatmap
dataframe
There are total of 671 unique users along with 9066 unique movies and the correlation plot
among them is

16
4.4 Implementation
4.4.1 CBF-Recommendation using BERT:
For content-based recommendation, we consolidate various text columns such as title,
overview, genre, tagline, and production companies. Subsequently, we apply natural
language processing (NLP) preprocessing steps, as outlined earlier.
Now, let's contemplate a scenario wherein a movie boasting an average rating of 9, derived
from merely two votes, cannot be deemed superior to a film with a lower average rating of 8
but has garnered 1000 votes. In light of this, we opt to employ IMDB's weighted rating
methodology to assess the overall quality of a movie. The weighted rating of a movie is
defined as:
where,

Weighted Rating= ( v +vm∗R )+( v m+m ∗C)

● v is the number of votes for the movie (vote count)


● R is the average rating of the movie (vote average)
● C is the mean vote across the whole dataset
● m is the minimum votes required to be listed in the chart

Now we use pre trained “Bert-base-uncased” model the base refer to the small model and
uncased refers to that all the text used during pre-training is converted to lowercase,
including both the input text and the vocabulary. For example, "Hello" and "hello" would be
treated as the same word in the "uncased" model
some key parameters and details about the BERT model:
Architecture: BERT utilizes a transformer architecture, specifically the transformer
encoder. The transformer model enables the bidirectional processing of input sequences,
allowing it to capture contextual information effectively.
Layers: BERT consists of multiple layers of transformer blocks. The number of layers in the
model is a hyperparameter that can be adjusted. The original BERT-base model has 12
layers,.
Hidden Units: Each layer in the transformer contains a certain number of hidden units,
which is another hyperparameter. The BERT-base model has 768 hidden units in each layer.
Attention Heads: The attention mechanism in the transformer is divided into attention
heads. The number of attention heads is also a hyperparameter. BERT-base has 12 attention
heads.

17
Vocabulary Size: BERT uses Word Piece tokenization, and the "uncased" variant has a
vocabulary size of 30522 tokens. This vocabulary includes both whole words and subwords.
Embedding Dimension: BERT represents words as vectors in an embedding space. The
embedding dimension for BERT-base is 768.
First, we set up our device to CUDA then we load our pre trained Bert model:
CUDA (Compute Unified Device Architecture) is a parallel computing platform and
application programming interface (API) model created by NVIDIA. It allows developers to
use NVIDIA graphics processing units (GPUs) for general-purpose processing, not just
graphics-related tasks. CUDA enables programmers to harness the computational power of
NVIDIA GPUs to accelerate various types of applications, including scientific simulations,
data processing, machine learning, and more.

Now the combined columns from above is passed through the bert model to encode them

18
Figure 13. Training BERT on movie overview

After cleaning we left with approx. of 44000 rows and the default size of each batch is 32 so
44000/32 which is approximately equal to 1377 so in total our dataset is divided into 1377
batches
4.4.2 CF –Recommendation System Using SVD :
As previously discussed the collaborative filtering is of two types item-based and user-based

● User-based, which measures the similarity between target users and other users.

● Item-based, which measures the similarity between the items that target users rate or
interact with and other items
When you have millions of users and/or items, computing pairwise correlations is expensive
and slow. We already saw that we could avoid processing all the data every time by
establishing neighbourhoods on the basis of similarity. Is it possible to reduce the size of the
ratings matrix some other way ?
The answer is yes we can use svd ( singular value decomposition ) to reduce the
dimensionality of the matrix lets understand how?
The Singular Value Decomposition (SVD), a method from linear algebra that has been
generally used as a dimensionality reduction technique in machine learning. SVD is a matrix
factorisation technique, which reduces the number of features of a dataset by reducing the
space dimension from N-dimension to K-dimension (where K<N). In the context of the
recommender system, the SVD is used as a collaborative filtering technique. It uses a matrix
structure where each row represents a user, and each column represents an item. The
elements of this matrix are the ratings that are given to items by users.
A=UΣV T

● A is the input matrix

● U is left singular Vector I,e userId

● Σ is the diagonal matrix of singular values

● VT represents the transpose of movie features

where A is the input data matrix (users’s ratings), U is the left singular vectors (user
“features” matrix), Σ is the diagonal matrix of singular values (essentially weights/strengths
19
of each concept), and VT is the right singular vectors (movie “features” matrix). U and VT
are column orthogonal, and represent different things. U represents how much users “like”
each feature and VT represents how relevant each feature is to each movie.

SVD and Recommendations

With SVD, we turn the recommendation problem into an Optimization problem that deals
with how good we are in predicting the rating for items given a user. One common metric to
achieve such optimization is Root Mean Square Error (RMSE). A lower RMSE is
indicative of improved performance and vice versa. RMSE is minimized on the known
entries in the utility matrix. SVD has a great property that it has the minimal reconstruction
Sum of Square Error (SSE); therefore, it is also commonly used in dimensionality reduction.
Below is the formula to achieve this:

minUv ∑ ∑ ¿ ¿
i , jeA

● i,j represents the rows and columns respectively

RMSE and SSE are monotonically related. This means that the lower the SSE, the lower the
RMSE. With the convenient property of SVD that it minimizes SSE, we know that it also
minimizes RMSE. Thus, SVD is a great tool for this optimization problem. To predict the
unseen item for a user, we simply multiply U, V, and ΣT.

4.4.3 Hybrid Recommendation:


Now, let's dive into the main aspect: the integration of collaborative and content-based
filtering into a unified component. This integration aims to mitigate the limitations in each
approach. Let's explore how this is achieved and the how it brings to address the drawbacks
of both methods.
Upon acquiring sentence embeddings from BERT, we proceed to calculate the cosine
similarity between each sentence and every other sentence. This computation is executed
using the cosine similarity function offered by NumPy, and the resulting values are stored in
a matrix.
In the initial step, our hybrid recommendation system begins by soliciting input from the
user, specifically the user ID and the title of a movie they are interested in. The system then
checks if the provided title exists within our dataset. If a match is found, the system proceeds
to identify the index corresponding to that particular movie title in our Data Frame.
Once the movie title index is determined, the system utilizes the cosine similarity matrix
(cos_sim) to generate the top 10 recommendations for movies that exhibit similarity to the
user's specified movie. This content-based filtering aspect ensures that recommendations are
based on the content features of the input movie, thereby capturing thematic similarities.
Subsequently, these top 10 recommendations, along with the user ID and movie ID, are
forwarded to the collaborative filtering component. This collaborative filtering step takes
into account the user's historical preferences and behaviors, recommending movies that users
20
with similar tastes have enjoyed. By combining both content-based and collaborative
filtering approaches, our hybrid recommendation system aims to leverage the strengths of
each method while mitigating their individual limitations. This integration provides a more
comprehensive and personalized set of recommendations for the user, enriching their overall
experience with the recommendation system.
However, in cases where the provided movie title is not found within our database, an
alternative approach is implemented. Instead, we request the user to input a movie overview.
Subsequently, we utilize BERT embeddings for the input overview, computing the cosine
similarity between this input and the overviews of other movies. The system then generates
and presents the top 10 recommendations based solely on content-based filtering using the
cosine similarity scores derived from the movie overviews.
In this scenario, the recommendation system exclusively relies on content-based filtering,
emphasizing the semantic similarities between the input movie overview and the existing
movies in the dataset. This allows the system to offer relevant recommendations by
identifying movies with similar thematic content, thus ensuring a robust and adaptable
recommendation mechanism even when specific movie titles are absent from the database.
Chapter 5
Results
5.1 Feature Importance and analysis:
The comprehension and selection of appropriate features significantly impact the
performance of a model. In the context of content-based filtering, we consider all text data as
crucial features, as it encapsulates the essence of item content and behavior. The sentence
embeddings obtained from the BERT model have a dimensionality of 768.
Now, let's delve into the pivotal role played by cosine similarity in this context. Cosine
similarity is instrumental in quantifying the angle between each pair of vectors representing
the sentence embeddings. This measurement helps gauge the similarity or dissimilarity
between different pieces of text data. By calculating the cosine similarity, we effectively
capture the semantic relationships and similarities among the items, forming a fundamental
aspect of the content-based recommendation system.
A.B
cos ( A , B )=
| A∨.∨B|

● Where A,B are vectors

21
● |A|.|B| represents the dot product of two vectors

Our dataset comprises 45,000 rows, and plotting all the cosine similarity values would result
in a cluttered visualization. Therefore, let's focus on examining the first 10 plots to gain
insights.

Figure 14. Cosine similarity for first 10 embeddings of dataframe


Figure SEQ Figure \* ARABIC 10 Cosine

5.2 Evaluation metrics:

For a recommendation system that provides a list of top 10 recommendations based on user
input, there are few evaluation metrics to assess its performance. Here are some evaluation
approaches

Precision@k:

Precision@k measures the proportion of relevant items among the top-k recommended
items. In your case, if a user interacts with or likes certain items, you can calculate the
precision of the system by checking how many of the top 10 recommended items are
relevant to the user.

Precision@k = (Number of relevant items in top-k) / k

here we take k as 10

Precision@10: 10.00% (what does it indicates?):

Precision measures the accuracy of the recommendations by calculating the proportion of


relevant items among the top-N recommended items. In your case, Precision@10 indicates
that 10% of the top 10 recommended movies were relevant to the user, based on the ground
truth ratings.

A Precision@10 of 10.00% suggests that 1 out of the 10 recommended movies was relevant,
according to the ground truth ratings.

22
The precision@10 for our Hybrid BERT recommendation system is :
Precision@10: 90.00%

RMSE And MAE in Recommendation Systems:

RMSE (Root Mean Squared Error) is a widely used metric in recommendation systems to
evaluate the accuracy of predicted ratings compared to the actual ratings given by users. It is
particularly relevant when dealing with collaborative filtering algorithms that predict user
preferences for items based on the preferences of other users.


n

∑ (Y i−Y^i )2
i=1
RMSE=
n
Where
 y iis the actual value of the dependent variable for the i-th observation
 ^
y i s the predicted value of the dependent variable for thei−th observation
 N is the number of observation in dataframe
To evaluate the collaborative filtering we use RMSE we use the Suprise library that
provided various ready-to-use powerful prediction algorithms including (SVD) to evaluate
its RMSE (Root Mean Squared Error) on the ratings dataset. It is a Python scikit building
and analysing recommender systems.

The Surprise library, stands for "Simple Python Recommendation System Engine," is a
Python library designed for building and evaluating recommendation systems. It provides a
convenient and easy-to-use interface for implementing and testing various recommendation
algorithms. Surprise is particularly popular for collaborative filtering-based recommendation
systems

The main advantage of surprise library is that it focus on collaborative filtering so we cant
directly use it to evaluate the hybrid recommendation engine.

MAE is the most straightforward metric of evaluation known as Mean absolute error. The
above is a fancy equation for evaluating it. It is literally the difference between what user
might rate a movie to what our system predicts.

( )∑ ¿ y −^y ∨¿
n
1
MAE= i i
n i=1

Where
 y iis the actual value of the dependent variable for the i-th observation
 ^
y i s the predicted value of the dependent variable for thei−th observation
 N is the number of observation in dataframe

23
In order to evaluate our CF-SVD model we use RMSE,MAE in 5 folds ,folds refer to the
divisions or partitions of the dataset that are created for the purpose of training and testing a
model. Cross-validation is a technique used to assess the performance of a model by splitting
the dataset into multiple subsets or folds, training the model on some of these folds, and
evaluating it on the remaining fold(s).

The RMSE and MAE results are:

Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean

Train_RMSE 0.8976 0.8971 0.8982 0.8883 0.8962 0.8955

Train_MAE 0.6894 0.6907 0.6914 0.6844 0.6916 0.6895

Test_RMSE 0.89758099 0.89706553 0.89817946 0.88828538 0.89622038 0.89620

Test_MAE 0.68941309 0.69074929 0.69138941 0.6843557 0.69160636 0.69160

5.3 Results:

If the movie title is found in the dataframe, the system adopts a hybrid approach, providing
movie recommendations and criteria.

Movie_recommendations:

Figure 15. Top 10 Recommendation for user input

24
Criteria:
The criteria involves similarity score which describes how similar the movies are related to
each other and weighted rating which was discussed in the implementation section

Figure 16. The similarity score and ratings of top 10 recommendations

Chapter 6
Conclusion and Future Scope
6.1 Conclusion
In this project, we created a recommendation system using two approaches. First, we looked
at the content of movies using BERT to understand their details. This helped us capture
subtle similarities in movie content based on their descriptions. Second, we considered what
similar users liked using collaborative filtering. We used SVD to handle missing data in our
ratings.
Combining these methods resulted in a smart recommendation system. It thinks about both
the content of movies and what people generally enjoy. It turned out to be quite effective,
with a 90% accuracy in suggesting movies that users might like. The collaborative filtering
part, which checks what similar users liked, had an accuracy measure called RMSE at 0.89,
showing it's doing a good job too.
In simpler terms, our recommendation system acts like a helpful friend who understands
both the details of movies and what people typically enjoy. This makes it better at suggesting
25
movies you might really enjoy watching. It's like having a buddy who knows your taste in
movies and gives you great suggestions!.
Future Scope:
For future enhancements, there's an opportunity to refine our recommendation system by
fine-tuning the BERT model on a more extensive dataset. Currently, we've utilized a pre-
trained BERT base model, but by subjecting it to thorough training on a custom dataset, we
can expect superior performance and heightened accuracy, particularly tailored to our
specific application.

Furthermore, in collaborative filtering, where we've employed Singular Value


Decomposition (SVD) to reduce dimensionality, there exists room for improvement. Instead
of adhering solely to SVD, exploring and implementing more advanced techniques for model
refinement could potentially yield better outcomes in terms of recommendation quality. This
way, we can stay at the forefront of advancements in both content-based and collaborative
filtering components, ensuring a recommendation system that continually evolves for
optimal results.

APPENDIX
REFERENCES

1. F. Ricci, L. Rokach, B. Shapira, and P. B. Kantor, Introduction to Recommender


Systems Handbook. Springer, October 2017
2. G. Adomavicius and A. Tuzhilin, “Towards the next generation of recommender
systems: A survey of the state-of-the-art and possible extensions,” IEEE Transactions on
Knowledge and Data Engineering, vol. 17, no. 6, pp. 734–749, June 2018.
3. M. J. Pazzani and D. Billsus, Content-Based Recommendation Systems. Springer, 2020,
pp. 325–341.

26
4. H. Wang, "ZeroMat: Solving Cold-start Problem of Recommender System with No
Input Data," 2021 IEEE 4th International Conference on Information Systems and
Computer Aided Education (ICISCAE), Dalian, China, 2021, pp. 102-105
5. R. Sharma, S. Rani and S. Tanwar, "Machine Learning Algorithms for building
Recommender Systems," 2019 International Conference on Intelligent Computing and
Control Systems (ICCS), Madurai, India, 2019, pp. 785-790
6. M. E. B. H. Kbaier, H. Masri and S. Krichen, "A Personalized Hybrid Tourism
Recommender System," 2017 IEEE/ACS 14th International Conference on Computer
Systems and Applications (AICCSA), Hammamet, Tunisia, 2017, pp. 244-250,
7. Y. Wang, S. C.-F. Chan, and G. Ngai, “Applicability of demographic recommender
system to tourist attractions: a case study on trip advisor,” in Proceedings of the The
2018 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and
Intelligent Agent TechnologyVolume 03. IEEE Computer Society, 2018
8. H. -W. Chen, Y. -L. Wu, M. -K. Hor and C. -Y. Tang, "Fully contentbased movie
recommender system with feature extraction using neural network," 2017 International
Conference on Machine Learning and Cybernetics (ICMLC), Ningbo, China, 2017, pp.
504-509
9. R. Esmeli, M. Bader-El-Den and H. Abdullahi, "Using Word2Vec Recommendation for
Improved Purchase Prediction," 2020 International Joint Conference on Neural
Networks (IJCNN), Glasgow, UK, 2020, pp. 1-8
10. Sun, Fei, Rui Wang, Rui Zhang, Xin Chen, Xudong Hu, and Zhiyuan Liu. "BERT4Rec:
Sequential Recommendation with Bidirectional Encoder Representations from
Transformers." arXiv preprint arXiv:1904.06690 (2019)
11. Dietmar Jannach, Ahtsham Manzoor, Wanling Cai, and Li Chen. 2021. A Survey on
Conversational Recommender Systems. ACM Comput. Surv. 54, 5, Article 105 (June
2022), 36 pages.
12. R. Ahuja, A. Solanki and A. Nayyar, "Movie Recommender System Using K-Means
Clustering AND K-Nearest Neighbor," 2019 9th International Conference on Cloud
Computing, Data Science & Engineering (Confluence), Noida, India, 2019, pp. 263-
268M. U. Gul, K. John Pratheep, M. Junaid and A. Paul, "Spiking Neural Network
(SNN) for Crop Yield Prediction," 2021 9th International Conference on Orange
Technology (ICOT), Tainan, Taiwan, 2021, pp. 1-4.

27
13. Tan, Y.; Zhang, M.; Liu, Y.; and Ma, S. 2017. RatingBoosted Latent Topics:
Understanding Users and Items with Ratings and Reviews. In IJCAI, 2640–2646.
IJCAI/AAAI Press.
14. McAuley, J. J.; and Leskovec, J. 2019. Hidden factors and hidden topics: understanding
rating dimensions with review text. In RecSys, 165–172. ACM
15. Zheng, L.; Noroozi, V.; and Yu, P. S. 2017. Joint Deep Modeling of Users and Items
Using Reviews for Recommendation. In WSDM, 425–434. AC
16. Howard, J.; and Ruder, S. 2018. Universal Language Model Fine-tuning for Text
Classification. In ACL (1), 328–339. Association for Computational Linguistics.
17. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and Sutskever, I. 2019. Language
models are unsupervised multitask learners. OpenAI Blog 1(8): 9.
18. Peters, M. E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; and
Zettlemoyer, L. 2018. Deep Contextualized Word Representations. In NAACL-HLT,
2227–2237. Association for Computational Linguistics.
19. Paul Covington, Jay Adams, and Emre Sargin. 2019. Deep neural networks for youtube
recommendations. In Proceedings of the 10th ACM conference on recommender
systems, pages 191–198. ACM.
20. Pasquale Lops, Marco De Gemmis, and Giovanni Semeraro. 2018. Content-based
recommender systems: State of the art and trends. In Recommender systems handbook,
pages 73–105. Springer

28

You might also like