You are on page 1of 33

MOVIE RECCOMDATION SYSTEM

Project Report

submitted
in partial fulfillment
for the award of the Degree of
Bachelor of Technology
in Department of Computer Science and Engineering

Supervisor Submitted by

Mrs. Pallavi Sharma Mr. Abhay Pratap Singh Tomar

Assistant Professor B.Tech Sem VIII

Department of Computer Science and Engineering


Amity School of Engineering and Technology

AMITY UNIVERSITY RAJASTHAN


April 2020
CANDIDATE’S DECLARATION
I hereby declare that the work, which is being presented in the report, entitled ” MOVIE
RECCOMDATION SYSTEM ” in partial fulfilment for the award of Degree of “Bachelor of
Technology” in Department of Computer Science and Engineering submitted to the
Department of Computer Science and Engineering, Amity University Rajasthan is a record of
my own investigations carried under the Guidance of Mrs. Pallavi Sharma, Department of
Computer Science and Engineering, Amity University Rajasthan.

I have not submitted the matter presented in this report anywhere for the award of any other
Degree.

Abhay Pratap Singh Tomar


Computer Science and Engineering
Enrolment No.: A20405216082

Amity University Rajasthan

Counter Signed by

Mrs. Pallavi Sharma

i
TABLE OF CONTENT
CANDIDATE’S DECLARATION i

TABLE OF CONTENT ii

ABSTRACT iv

CHAPTER 1 1-3

INTRODUCTION

1.1 Introduction 1

1 .2 Justification for the need of the project 2

CHAPTER 2 4-7

LITERATURE REVIEW

2.1 Introduction 4

2.2 Motivation of Study 4

2.3 Example of Scheme 5

2.3.1 Content-based Filtering Systems (CBF based systems) 5

2.3.2 Collaborative filtering based systems (CF based systems) 6

CHAPTER 3 8-14

PROPOSED WORK

3.1 Introduction 8

3.2 Modules of Project 8

ii
3.3 Feasibility Study 12

3.4 Issues Faced 13

CHAPTER 4 15-23

SYSTEM DESCRIPTION

4.1 Algorithm 15

4.2 Classification 18

CHAPTER 5 24-26

RESULTS

5.1 Testing 24

5.2 Results 24

CONCLUSION 27

REFERENCES 28

iii
ABSTRACT
This report intended to present the Movie Recommender systems that have become
ubiquitous in our lives. We start by preparing and comparing the various models on a smaller
dataset of ratings. Then, we try to scale the algorithm so that it is able to handle numerous
ratings. We find that for the smaller dataset, using user-based collaborative filtering results in
the lowest Mean Squared Error on our dataset.

In today’s digital world where there is an endless variety of content to be consumed like
books, videos, articles, movies, etc., finding the content of one’s liking has become some
task. On the other hand digital content providers want to engage as many users on their
service as possible for the maximum time. This is where recommender system comes into
picture where the content providers recommend users the content according to the users’
liking. In this paper we have proposed a movie recommender system. The objective of this is
to provide accurate movie recommendations to users. Usually the basic recommender
systems consider one of the following factors for generating recommendations; the
preference of user (i.e. content based filtering) or the preference of similar users (i.e.
collaborative filtering). To build a stable and accurate recommender system a hybrid of
content based filtering as well as collaborative filtering will be used.

Although, it adds a whole new dimension to the movie watching experience by providing
real-time personalized movie recommendations to users. It takes a collaborative social-
networking approach where a user’s own tastes are mixed with that of the entire community
to generate meaningful results. Unlike these systems, this Recommendation Engine will
continually analyse individual user’s movie preferences and recommend custom movie
recommendations. This is purely a movie recommendation service in that it offers a list of
movie suggestions based on previous user ratings.

iv
CHAPTER 1

INTRODUCTION
1.1 Introduction

Recommendation systems help users find and select items (e.g., books, movies, restaurants)
from the huge number available on the web or in other electronic information sources. Given
a large set of items and a description of the user’s needs, they present to the user a small set
of the items that are well suited to the description. Similarly, a movie recommendation
system provides a level of comfort and personalization that helps the user interact better with
the system and watch movies that cater to his needs. Providing this level of comfort to the
user was our primary motivation in opting for movie recommendation system as our BE
Project. The chief purpose of our system is to recommend movies to its users based on their
viewing history and ratings that they provide. The system will also recommend various E-
commerce companies to publicize their products to specific customers based on the genre of
movies they like. Personalized recommendation engines help millions of people narrow the
universe of potential films to fit their unique tastes. Collaborative filtering and content based
filtering are the are prime approaches to provide recommendation to users. Both of them are
best applicable in specific scenarios because of their respective ups and downs. In this paper
we have proposed a mixed approach such that both the algorithms complement each other
thereby improving performance and accuracy of the of our system.

We use machine learning to build a personalized movie scoring and recommendation system
based on user’s previous movie ratings. Different people have different taste in movies, and
this is not reflected in a single score that we see when we Google a movie. Our movie scoring
system helps users instantly discover movies to their liking, regardless of how distinct their
tastes may be. Current recommender systems generally fall into two categories: content-based
filtering and collaborative filtering. We experiment with both approaches in our project. For
content-based filtering, we take movie features such as actors, directors, movie description,

1
and keywords as inputs and use TF-IDF and doc2vec to calculate the similarity between
movies. For collaborative filtering, the input to our algorithm is the observed users’ movie
rating, and we use K-nearest neighbors and matrix factorization to predict user’s movie
ratings. We found that collaborative filtering performs better than content-based filtering in
terms of prediction error and computation time.

1.2 Justification for the need of project

Due to the advances in recommender systems, users constantly expect good


recommendations. They have a low threshold for services that are not able to make
appropriate suggestions. If a music streaming app is not able to predict and play music that
the user likes, then the user will simply stop using it. This has led to a high emphasis by tech
companies on improving their recommendation systems. However, the problem is more
complex than it seems. Every user has different preferences and likes. In addition, even the
taste of a single user can vary depending on a large number of factors, such as mood, season,
or type of activity the user is doing. For example, the type of music one would like to hear
while exercising differs greatly from the type of music he’d listen to when cooking dinner.
Another issue that recommendation systems have Search Engine Architecture, Spring 2017,
NYU Courant to solve is the exploration vs exploitation problem. They must explore new
domains to discover more about the user, while still making the most of what is already
known about of the user.

Two main approaches are widely used for recommender systems. One is content-based
filtering, where we try to profile the users interests using information collected, and
recommend items based on that profile. The other is collaborative filtering, where we try to
group similar users together and use information about the group to make recommendations
to the user.

Given the huge amount of movies are available all over the world, it is challenging for a user
to find the appropriate movies suitable for his/her tastes. Different users like different movies
or actors. It is important to find a method of filtering irrelevant movies and/or find a set of
relevant movies. Movie recommendation system is a process of exactly doing above tasks.

2
Such a system has lot of implications and is inspired by the success of recommendation
systems in different domains such as books, TV program, jokes, news articles. It is one of the
most important research in the digital television domain. The most well-known
recommendation systems are mainly based on Collaborative Filtering and Content-based
Filtering. CF first tries to find out the groups of similar users automatically from a set of
active users. The similarities between users are computed using correlation measure. It then
recommends items to a user based on the opinions of the users groups. Although CF is
successful in many domains, however, it has shortcomings such as, sparsity and scalability.
CF uses user ratings to find similar users. However, it is very difficult to find such since very
few movies have ratings. In this paper, we propose two methods important for movie
recommendation: movie swarm mining that mines a set of movies suitable for producer for
planning new movie and for new item recommendation, popular and interesting movie
mining which can be used to solve new users problem. The effectiveness of our proposed
methods demonstrated using Movies Data Sets.

3
CHAPTER 2

LITERATURE REVIEW
2.1 Introduction

Movie recommendation system is based on collaborative filtering approach. Collaborative


filtering makes use of information provided by user. That information is analyzed and a
movie is recommended to the users which are arranged with the movie with highest rating
first. The system also has a provision for user to select attributes on which he wants the
movie to be recommended has analyzed two traditional recommender systems i.e. content
based filtering and collaborative filtering. As both of them have their own drawbacks he
proposed a new system which is a combination of Bayesian network and collaborative
filtering. The proposed system is optimized for the given problem and provides probability
distributions to make useful inferences. The system uses a mix of content as well as
collaborative filtering algorithm. The context of the movies is also considered while
recommending.

2.2 Motivation of Study

The user - user relationship as well as user - item relationship plays a role in the
recommendation. The user specific information or item specific information is clubbed to
form a cluster by using chameleon. This is an efficient technique based on Hierarchical
clustering for recommender system. To predict the rating of an item voting system is used.
The proposed system has lower error and has better clustering of similar items. Even though
he proposed clustering as a way to deal with recommender systems. Two methods of
computing cluster representatives were presented and evaluated. Centroid-based solution and
memory-based collaborative filtering methods were used as a basis for comparing
effectiveness of the proposed two methods. The result was a significant increase in the
accuracy of the generated recommendations when compared to just centroid-based method
Costin-Gabriel Chiru proposed Movie Recommender, a system which uses the information

4
known about the user to provide movie recommendations. This system attempts to solve the
problem of unique recommendations which results from ignoring the data specific to the user.
The psychological profile of the user, their watching history and the data involving movie
scores from other websites is collected. They are based on aggregate similarity calculation.

2.3 Example of Scheme

The system is a hybrid model which uses both content based filtering and collaborative
filtering. To predict the difficulty level of each case for each trainee proposed a method called
content boosted collaborative filtering (CBCF).The algorithm is divided into two stages, First
being the content-based filtering that improves the existing trainee case ratings data and the
second being collaborative filtering that provides the final predictions. The CBCF algorithm
involves the advantages of both CBF and CF, while at the same time, overcoming both their
disadvantages. There are various types of recommender systems with different approaches
and some of them are classified as below:

2.3.1 Content-based Filtering Systems (CBF based systems): In content-based filtering,


items are recommended based on comparisons between item profile and user profile. A user
profile is content that is found to be relevant to the user in form of keywords(or features). A
user profile might be seen as a set of assigned keywords (terms, features) collected by
algorithm from items found relevant (or interesting) by the user. A set of keywords (or
features) of an item is the Item profile. For example, consider a scenario in which a person
goes to buy his favorite cake ‘X’ to a pastry. Unfortunately, cake ‘X’ has been sold out and as
a result of this the shopkeeper recommends the person to buy cake ‘Y’ which is made up of
ingredients similar to cake ‘X’. This is an instance of content-based filtering. Advantages of
content-based filtering are:

• They capable of recommending unrated items

• We can easily explain the working of recommender system by listing the Content features
of an item.

5
• Content-based recommender systems use need only the rating of the concerned user,and not
any other user of the system.

Disadvantages of content-based filtering are:

• It does not work for a new user who has not rated any item yet as enough ratings are
required contentbased recommender evaluates the user preferences and provides accurate
recommendations.

• No recommendation of serendipitous items.

• Limited Content Analysis- The recommender does not work if the system fails to
distinguish the items hat a user likes from the items that he does not like.

2.3.2 Collaborative filtering based systems (CF based systems): Collaborative filtering
system recommends items based on similarity measures between users and/or items. The
system recommends items preferred by similar users. This is based on the scenario where a
person asks his friends, who have similar tastes, to recommend him some movies. We
explore two algorithms for Collaborative filtering, the Nearest Neighbors Algorithm and the
Latent Factors Algorithm.

• Nearest Neighbors Collaborative Filtering: This approach relies on the idea that users
who have similar rating behaviors so far, share the same tastes and will likely exhibit
similar rating behaviors going forward. The algorithm first computes the similarity
between users by using the row vector in the ratings matrix corresponding to a user as
a representation for that user. The similarity is computed by using either cosine
similarity or Pearson Correlation. In order to predict the rating for a particular user for
a given movie j, we find the top k similar users to this particular user and then take a
weighted average of the ratings of the k similar users with the weights being the
similarity values.

• Latent Factor Methods: The latent factor algorithm looks to decompose the ratings
matrix R into two tall and thin matrices Q and P, with matrix Q having dimensions
num_users × k and P having the dimensions numitems × k where k is the number of

6
latent factors. The decomposition of R into Q and P is such that R = Q.P T . Any rating
rij in the ratings matrix can be computed by taking the dot product of row qi of matrix
Q and pj of matrix P. The matrices Q and P are initialized randomly or by performing
SVD on the ratings matrix. Then, the algorithm solves the problem of minimizing the
error between the actual rating value rij and the value given by taking the dot product
of rows qi and pj . The algorithm performs stochastic gradient descent to find the
matrices Q and P with minimum error starting from the initial matrices.

Advantages of collaborative filtering based systems:

• It is dependent on the relation between users which implies that it is content-independent.

• CF recommender systems can suggest serendipitous items by observing similar-minded


people’s behavior.

• They can make real quality assessment of items by considering other peoples experience.
Disadvantages of collaborative filtering are:

• Early rater problem: Collaborative filtering systems cannot provide recommendations for
new items since there are no user ratings on which to base a prediction.

• Gray sheep: In order for CF based system to work, group with similar characteristics are
needed. Even if such groups exist, it will be very difficult to recommend users who do not
consistently agree or disagree to these groups.

• Sparsity problem: In most cases, the amount of items exceed the number of users by a great
margin which makes it difficult to find items that are rated by enough people.

7
CHAPTER 3

PROPOSED WORK
3.1 Introduction

The Project is loaded in Anaconda Studio 2016. We used Anaconda Studio for Design and
Jupyter for coding of project. Created and maintained all databases into SQL Server 2014, in
that we create tables, stored query for store data or record of project.
Hardware Requirement:-
• Processor Based Computer(i5 with 7th gen)
• 4 GB-Ram
• 8GB Hard Disk
Software Requirement:
• Windows 8, Windows 10 (Original)
• Anaconda Studio 2016
• Jupyter Notebook
• SQL Server 2014
• Administrator access

3.2 Modules of Project

• Baseline methods We try out the following simple baseline methods to give us an
idea of the performance to expect from the;

a. Global Average- The global average technique serves as a simple baseline technique.
The average rating for all users across all movies is computed. This global average
serves as a prediction for all the missing entries in the ratings matrix.

b. User average- All users exhibit varying rating behaviors. Some users are lenient in
their ratings, whereas some are very stringent giving lower ratings to almost all
movies. This user bias needs to be incorporated into the rating predictions. We
compute the average rating for each user. The average rating of the user is then used

8
as the prediction for each missing rating entry for that particular user. This method
can be expected to perform slightly better than the global average since it takes into
account the rating behavior of the users into account.

c. Movie average- Some movies are rated highly by almost all users whereas some
movies receive poor ratings from everyone. Another simple baseline which can be
expected to perform slightly better than the global average is the movie average
method. In this technique, each missing rating entry for a movie j is assigned the
average rating for the movie j.

d. Adjusted Average- This simple method tries to incorporate some information about
the user i and the movie j when making a prediction for the entry rij . We predict a
missing entry for user i and movie j, by assigning it the global average value adjusted
for the user bias and movie bias. The adjusted average rating is given by the formula
below

rij = global_avg + (u_avg(i) − gij) + (m_avg(j) − gij)

The user bias is given by the difference between the average user rating and the global
average rating. The movie bias is given by the difference between the average movie
rating and the global average rating. Consider the following example which
demonstrates the working of the adjusted average method. Let the global average
rating be 3.7. The user A has an average rating of 4.1. Thus the bias of the user is (4.1
- 3.7) I.e the user rates 0.4 stars more than the global average. The movie Fast and
Furious has an average rating of 3.1 stars. Thus the bias for the movie is -0.6. the
adjusted average method will predict that user A will give the movie Fast and Furious
a rating of 3.7 + 0.4 - 0.6 = 3.5 .

• Collaborative Filtering

We implement the nearest neighbors and latent factor methods for Collaborative
filtering. The smaller version of the dataset can be processed by using dense matrices
and non-vectorized operations. However, if we try to use the simple implementations

9
on the larger 20 million dataset, the code breaks. If we store the ratings matrix for the
20 million dataset in a dense format, it would take up around 140000×27000×8 =
28GB of memory, assuming 8 bytes per matrix entry. Thus we need to use sparse
matrix representations to be able to handle the bigger dataset effectively. Also, the
matrix operations have to be as vectorized as possible to make efficient use of threads.
We observed that even after our best attempts to optimize the code as much as
possible, the algorithms still needed a lot of time to be able to process such huge
amounts of data. We thus decided to make use of Apache Spark to parallelize the
operations and improve the runtime performance of the algorithms by running them
on Spark clusters.

• Spark implementation We implemented the algorithms such that they could be run
on an Apache Spark cluster. We then run the algorithms on a cluster of 20 AWS EC2
instances. we were able to achieve significant improvements in the running time by
using Spark. The nearest neighbors algorithm took around 30 hrs to run on a single
machine, whereas the Spark implementation on the 20 machine cluster was able to run
in around 50 minutes.

• Content based implementation Data Scrapping- For this algorithm, we needed to


obtain movie metadata. While some basic movie metadata, such as release year, tags
and genre were provided in the set, we decided to collect further information about
the movie, with hopes of incorporating them into our system. We decided to obtain
the required information from the IMDB website. The pages content were accessed
using the Beautiful Soup2 python library. Beautiful Soup is a Python library designed
for quick turnaround projects like screen-scraping. It allows us to easily search and
navigate the pages content. Once the structure of the website was identified, we wrote
a script to automatically extract the relevant information about a web-page and create
a dump. In python3, Beautiful soup also handle the encoding of the incoming file,
further assisting in the automation.

10
• Movie Plot keyword search We also implemented an additional feature that provides
a distributed indexer on the movie synopsis as extracted from the IMDB website to
allow searching for movies based on keyword searches. It is built on a MapReduce
framework to build an inverted index. The framework for the indexer is the same as
the one developed earlier in the semester for the assignments. It has been modified to
allow searching on the movie information dataset.

First, the dataset is generated by scrapping data from the IMDB pages of the
respective movies. The IMDB movie id is provided along with the movie lens dataset.
After accessing the webpage, the metadata from the movie is collected and then
pickled and stored in a dump. This same dump can also be used to implement and
extend the content based model. In the current model, the synopsis refers to the first
of the provided plot summaries. The summaries tend to be much shorter than the
synopsis and can be extracted much more easily.

Also, the size of the dump increased by a factor of almost 100 when using synopses
instead of plot summaries. One more reason to select plot summaries is that detailed
synopsis are not available for a large portion of the movies, which would result in a
bias in the search. The rest of the search is similar to the one provided in the
assignments. The reformatted has been adjusted to match the format of the dump. The
rest of the search can be run in the identical manner as the assignment. More detailed
instructions are provided in a readMe file.

• Person name search In addition to searching the movie plot, we also developed a
separate search for names of people associated with the movie. These could be the
actors, the directors, or the writers. Both searches are independent, in the fact that
using the person search, you will not get results for queries appearing in the movie
plots, and visa-versa. In order to incorporate the person search for ambiguous queries
that could appear in either plot of in a name, such as Hill, we expanded the number of
documents that are returned by the index servers, and evaluated all the movies until

11
either 10 appropriate results are obtained or all the movies returned by the index
servers have been analyzed.

3.3Feasibility study

Feasibility Study is a high level capsule version of the entire process. Feasibility study is

necessary to determine that the proposed system is Feasible by considering the technical,

Operational, and Economical factors. By having a detailed feasibility study the management

will have a clear-cut view of the proposed system.

• Technical Feasibility

• Economic Feasibility

• Operational Feasibility

In this phase, we study the feasibility of all proposed systems, and pick the best feasible

solution for the problem. The feasibility is studied based on three main factors as follows.

Technical Feasibility:

In this step, we verify whether the proposed systems are technically feasible or not. i.e., all

the technologies required to develop the system are available readily or not. Technical

Feasibility determines whether the organization has the technology and skills necessary to

carry out the project and how this should be obtained.

● All necessary technology exists to develop the system.

● This system is too flexible and it can be expanded in the future

Economic Feasibility:

Economically, this project is completely feasible because it requires no extra financial

investment and with respect to time, it’s completely possible to complete this project in 6

12
months. In this step, we verify which proposal is more economical. The new system is

economically feasible only when the financial benefits are more than the investments and

expenditure. Economic Feasibility determines whether the project goal can be within the

resource limits allocated to it or not.

● The cost to conduct a full system investigation.

● The cost of h/w and s/w for the class of application being considered. The development
tool.

Our project is economically feasible because the cost of development is very minimal when

compared to financial benefits of the application.

Operational Feasibility:

In this step, we verify different operational factors of the proposed systems like man-power,

time etc., whichever solution uses less operational resources, is the best operationally feasible

solution Operational Feasibility determines if the proposed system satisfied user objectives

could be fitted into the current system operation.

● The methods of processing and presentation are completely accepted by the clients.

● The clients have been involved in the planning and development of the system.

● The proposed system will not cause any problem under any circumstances.

3.4 ISSUES FACED

• Scalability Issues- One of the major challenges in working with the 20 million
dataset is memory constraints. The data cannot be stored as a dense matrix due to its
huge size. We have to make use of sparse matrix representations in order for the
program to work without memory issues. Further, intermediate results such as the
user-user similarity matrix cannot be computed and stored due to the huge memory
footprint. We had to think of ways to compute the similarity values as and when

13
needed. Further, the 20 million dataset also needed a lot of time to run. We were able
to overcome the time requirements by writing parallelized implementations of the
algorithms using Apache Spark.

• Broken links- As mentioned earlier, meta-data about the movies was collected by
scrapping details from the IMDB pages site. The smaller dataset provided auto-
generated links to the movies url based on the movies title and release year. This
caused a large portion of the links to broken. Some titles were ambiguity leading to a
search page with recommendations rather the the movie page. For others, there was
some sort of error in the reference to the link. Some example of these errors was
usage of a secondary foreign title instead of the English one, usage of a former title,
and incorrect year of the movie. As a result, almost a third of the links were broken,
and had to be corrected before the data could be used. To fix this, we decided to use a
different dataset where the IMDB movie id was provided instead, which was easier to
use.

14
CHAPTER-4

SYSTEM DESCRIPTION

Owing to the various demerits of pure content-based and pure CF based systems, we have
proposed a hybrid recommender system which is known as content-boosted collaborative
filtering system. This hybrid system takes advantage from both the representation of the
content as well as the similarities among users. The intuition behind this technique is to use a
content-based predictor to fill the user-rating matrix that is sparsely distributed. A web
crawler is used to download necessary movie content for our dataset. After the preprocessing
the movie content database is stored. The dataset consists of a user-rating matrix. Content-
based predictions are used to train each user-rating vector in the user-rating matrix and
convert it into a pseudo rating matrix which combines actual rating with the predicted ratings.
Collaborative filtering is then applied to this full pseudo user-rating matrix to make
recommendation for an active user.

4.1 ALGORITHM

Hybrid Algorithm

Step1: Use content-based predictor to calculate the pseudo user-rating vector ‘v’ for every
user ‘u’ in the database.

𝑣𝑢,𝑖 = 𝑟𝑢,𝑖 ∶ is user u rated item i

𝑣𝑢,𝑖 = 𝑟𝑢,𝑖 ∶ otherwise

Step2: Weight all users with respect to similarity with the active user.

• Similarity between users is measured as the Pearson correlation between their ratings
vectors.

Step3: Select n users that have the highest similarity with the active user.

• These users form the neighborhood.

15
Step4: Compute a prediction from a weighted combination of the selected neighbors’ ratings.

To the best of our knowledge, there are a number of methods have been proposed in
recommendation system. The well-known recommendation system is Collaborative Filtering,
which uses users assessment on observed items to measure users similarity. Such assessment
is determined either explicitly or implicitly. In an explicit determination, users are asked to
provide their ratings in a one-to-five scale, which are then used for measuring the similarity.
In an implicit determination, users rating are determined based on the browsing behaviors.

However, if the item set is large and users rate a small fraction these, it is often difficult to
find similarities between users. This leads to low accuracy predictions or even to failure to
make predictions. Balabanovic had proposed a content-based recommendation system which
can be applied in different domains, such as, books, movies, videos, or music. It uses
different features, such as, author, genre, and most frequently used words. TF-IDF and
Information Gain (IG) are used commonly to extract these. George had proposed a hybrid
approach for movie recommendation system. This is a Web-based recommendation system,
collects user ratings of movies in one to-five scales by a graphical user interface. This process
implemented in two variations; substitute and switching. The aim of substitute is to utilize

16
collaborative filtering. The system uses a collaborative filtering technique as the
recommendation method. However, it uses a content-based technique for prediction if the
number of available ratings falls below a given threshold. In collaborative filtering, when a
new user or a new item is introduced, the system had no predictions that can make
recommendations. Content-based methods can handle new items, however, fails to handle
new users. Although a hybrid system tried to incorporate both collaborative and content-
based filtering, however, it also has the difficulties in dealing new users. In this paper, we
propose a recommendation system that has the ability to handle both new users and items.
Firstly, movie swarms create swarm based on movie genres those are features based that
cover content-based recommendation system. This process solves new item and new user
recommendation issue. However, this process might be overloaded when a large number of
same genre of movies are released. To solve this issue, we propose a method that uses
popular and interesting movies.

Different Search Features to search items

(a)Auto search complete: The System provides its users with auto search box, which
automatically pulls the movies matching the keywords typed, by the user. The auto search
feature is automatically activated after the user has typed 3 characters. The feature also
displays the averages rating of the movie besides it. Auto search complete would display 10
results matching the users keywords. If the user is unable to find the match amongst the 10
results he/she can click on the ‘more’ link provided at the bottom of the results to view more
results matching their search.

(b)The System also provides users with Advance search 12 benefits; users can search for
movies matching director, publisher, ISBN etc. Users can also view all the available versions
of a particular movie released by the author so far.

Rate Movies: Users can rate the movies which they like/dislike by providing numerical
rating on a scale of one to ten. The system also allows the users to tag their movies, and
provide feedback.

17
View/Edit past Movies: The system allows the users to view and edit their past ratings, tags,
and feedback.

4.2 Classification

➢ Content-based Filtering Systems: In content-based filtering, items are


recommended based on comparisons between item profile and user profile. A user
profile is content that is found to be relevant to the user in form of keywords(or
features). A user profile might be seen as a set of assigned keywords (terms, features)
collected by algorithm from items found relevant (or interesting) by the user. A set of
keywords (or features) of an item is the Item profile. For example, consider a scenario
in which a person goes to buy his favorite cake ‘X’ to a pastry. Unfortunately, cake
‘X’ has been sold out and as a result of this the shopkeeper recommends the person to
buy cake ‘Y’ which is made up of ingredients similar to cake ‘X’. This is an instance
of content-based filtering.

Fig. Context based filtering

We will be using the cosine similarity to calculate a numeric quantity that denotes the
similarity between two movies. We use the cosine similarity score since it is independent of
magnitude and is relatively easy and fast to calculate. Mathematically, it is defined as
follows:

18
We are now in a good position to define our recommendation function. These are the
following steps we'll follow :-

● Get the index of the movie given its title.

● Get the list of cosine similarity scores for that particular movie with all movies. Convert it
into a list of tuples where the first element is its position and the second is the similarity
score.

● Get the top 10 elements of this list. Ignore the first element as it refers to self (the movie
most similar to a particular movie is the movie itself).

● Return the titles corresponding to the indices of the top elements.

While our system has done a decent job of finding movies with similar plot descriptions, the
quality of recommendations is not that great. "The Dark Knight Rises" returns all Batman
movies while it is more likely that the people who liked that movie are more inclined to enjoy
other Christopher Nolan movies. This is something that cannot be captured by the present
system.

Credits, Genres and Keywords Based Recommender:- It goes without saying that the
quality of our recommender would be increased with the usage of better metadata. That is
exactly what we are going to do in this section. We are going to build a recommender based
on the following metadata: the 3 top actors, the director, related genres and the movie plot
keywords. From the cast, crew and keywords features, we need to extract the three most
important actors, the director and the keywords associated with that movie. Right now, our
data is present in the form of "stringified" lists , we need to convert it into a safe and usable
structure.

19
Fig. Credits, Genres and Keywords Based Recommender

➢ Collaborative filtering based systems: Our content based engine suffers from some
severe limitations. It is only capable of suggesting movies which are close to a certain
movie. That is, it is not capable of capturing tastes and providing recommendations
across genres.

a) User based filtering- These systems recommend products to a user that similar users
have liked. For measuring the similarity between two users we can either use 16 person
correlation or cosine similarity. This filtering technique can be illustrated with an
example. In the following matrix's, each row represents a user, while the columns
correspond to different movies except the last one which records the similarity between
that user and the target user. Each cell represents the rating that the user gives to that
movie. Assume user E is the target.

20
Since user A and F do not share any movie ratings in common with user E their similarities
with user E are not defined in person correlation. Therefore ,we only need to consider user
B,C and D on person correlation we can compute the following similarity.

Although computing user-based CF is very simple, it suffers from several problems. One
main issue is that users’ preference can change over time. It indicates that precomputing the
matrix based on their neighboring users may lead to bad performance. To tackle this
problem, we can apply item-based CF.

b) Item Based Collaborative Filtering - Instead of measuring the similarity between users,
the item-based CF recommends items based on their similarity with the items that the
target user rated. Likewise, the similarity can be computed with 17 Pearson Correlation
or Cosine Similarity. The major difference is that, with item-based collaborative filtering,
we fill in the blank vertically, as oppose to the horizontal manner that user-based CF
does. The following table shows how to do so for the movie Me Before You.

21
It successfully avoids the problem posed by dynamic user preference as item-based CF is
more static. However, several problems remain for this method. First, the main issue is
scalability. The computation grows with both the customer and the product. The worst
case complexity is O(mn) with m users and n items. In addition, sparsity is another
concern. Take a look at the above table again. Although there is only one user that rated
both Matrix and Titanic rated, the similarity between them is 1. In extreme cases, we can
have millions of users and the similarity between two fairly different movies could be
very high simply because they have similar rank for the only user who ranked them both.
Single Value Decomposition:- One way to handle the scalability and sparsity issue
created by CF is to leverage a latent factor model to capture the similarity between users
and items. Essentially, we want to turn the recommendation problem into an optimization
problem. We can view it as how good we are in predicting the rating for items given a
user. One common metric is Root Mean Square Error (RMSE). The lower the RMSE, the
better the performance. 18 Now talking about latent factor you might be wondering what
is it ?It is a broad idea which describes a property or concept that a user or an item have.
For instance, for music, latent factor can refer to the genre that the music belongs to.
SVD decreases the dimension of the utility matrix by extracting its latent factors.
Essentially, we map each user and each item into a latent space with dimension r.
Therefore, it helps us better understand the relationship between users and items as they
become directly comparable. The below figure illustrates this idea.

22
Now enough said , let's see how to implement this. Since the dataset we used before did
not have userId(which is necessary for collaborative filtering) let's load another dataset.
We'll be using the Surprise library to implement SVD.

23
CHAPTER 5

RESULTS
5.1 Testing

As the project is on a big scale, we always need testing to make it successful. If each
component works properly in all respects and gives desired output for all kinds of inputs then
the project is said to be successful. So the conclusion is-to make the project successful, it
needs to be tested. The testing done here was System Testing checking whether the user
requirements were satisfied. The code for the new system has been written completely using
Jupyter with Python as the coding language, Machine Learning as the interface for front-end
designing. The new system has been tested well with the help of the users and all the
applications have been verified from every nook and corner of the user. Although some
applications were found to be erroneous these applications have been corrected before being
implemented.
5.2 Result
➢ Content-based Filtering Systems

24
➢ Collaborative filtering based system

Fig. Collaborative Based Output_1

25
Fig. Collaborative Based Output_2

For movie with ID 302, we get an estimated prediction of 2.851. One startling feature
of this recommender system is that it doesn't care what the movie is (or what it
contains). It works purely on the basis of an assigned movie ID and tries to predict
ratings based on how the other users have predicted the movie.

26
CONCLUSION
A hybrid approach is taken between context based filtering and collaborative filtering to
implement the system. This approach overcomes drawbacks of each individual algorithm and
improves the performance of the system. Techniques like Clustering, Similarity and
Classification are used to get better recommendations thus reducing MAE and increasing
precision and accuracy. In future we can work on hybrid recommender using clustering and
similarity for better performance. Our approach can be further extended to other domains to
recommend songs, video, venue, news, books, tourism and e-commerce sites, etc.

We have explored both content-based and collaborative filtering for building the
recommendation system. Collaborative filtering overall performs better than content-based
filtering in terms of test RMSE. Additionally, content-based filtering is computationally more
expensive than collaborative filtering, as it involves extensive processing of text features.
Therefore collaborative filtering is preferred. For future work, we would like to address the
skewed prediction caused by imbalance in the number of low ratings compared to high
ratings. We would also explore ways such as regularization to address the overfitting issue in
KNN. Additionally, our recommendation system can be improved by combining content-
based filtering and collaborative filtering. Possible techniques include incorporating content
features as additional features in collaborative filtering, or vice versa, decision trees, and
neural network

27
REFERENCES
[1] James Bennett, Stan Lanning ; “The Netflix Prize”, In KDD Cup and Workshop in conjunction
with KDD, 2007.
[2] Mohammad Yahya H. Al-Shamri , Kamal K. Bharadwaj; “A Compact User Model for Hybrid
Movie Recommender System ” in International Conference on Computational Intelligence and
Multimedia Applications 2007
[3] Caselles-Dupr´e, Hugo Word2vec applied to recommendation: hyperparameters matter. RecSys.
2018.
[4] Christina Christakou, Leonidas Lefakis, Spyros Vrettos and Andreas Stafylopatis; “A Movie
Recommender System Based on Semi-supervised Clustering ”, IEEE Computer Society
Washington, DC, USA 2015.
[5] Luis M. de Campos, Juan M. Fernández-Luna *, Juan F. Huete, Miguel A. Rueda-Morales;
“Combining content-based and collaborative recommendations: A hybrid approach based on
Bayesian networks”, International Journal of Approximate Reasoning, revised 2010.
[6] M. Pazzani and D. Billsus, Learning and Revising User Profiles: The Identification of
Interesting Web Sites. Machine Learning, vol. 27, pp. 313-331, 1997. [7] Dietmar Jannach, Gerhard
Friedrich; “Tutorial: Recommender Systems”, International Joint Conference on Artificial
Intelligence, Beijing, August 4, 2013.
[8] Gaurangi, Eyrun, Nan; “MovieGEN: A Movie Recommendation System”, UCSB.
[9] Harpreet Kaur Virk, Er. Maninder Singh,” Analysis and Design of Hybrid Online Movie
Recommender System ”International Journal of Innovations in Engineering and Technology
(IJIET)Volume 5 Issue 2,April 2017.
[10]Manoj Kumar, D.KYadav, Ankur Singh, Vijay Kr. Gupta,” A Movie Recommender System:
MOVREC” International Journal of Computer Applications (0975 – 8887) Volume 124 – No.3,
August 2015.

28

You might also like