You are on page 1of 32

Chapter 1

INTRODUCTION

1.1 Problem definition

A Recommendation System, can be described to as a system that can run on grouped/non grouped
environment by taking customer’s online history and usage behaviour as one of its input and
producing a likely result for the client along these lines giving its customers an expectation closer
to the real world. Recommender system generally require a huge dataset and a quick registering
framework that can perform examination on the equivalent within seconds.

Recommendation Systems, in easier terms are softwares that are information escalated and
include complex example running on a lot of predefined parameters. Recommender system
customer’s interest with the end goal of suggesting things to buy or look at. They have become
basic applications in media streaming business also, giving proposals that needs huge data sets
with the goal that clients are directed toward the content which are most liked by them and
matches their interest. Many of systems have been proposed till today for performing proposals
and recommendations. The systems for example, content-based, communitarian, information
based and statistic are utilized for proposals.

In the proposed Video Recommendation System, Videos will be suggested and shown by using
content based filtering technique, which can work even in a smaller amount of data.
For building a recommender system from scratch, we face several different problems. Currently
there are a lot of recommender systems based on the user information, so what should we do if the
website has not gotten enough users.

After that, we will solve the representation of a movie, which is how a system can understand a
movie. That is the precondition for comparing similarity between two movies. Movie features
such as genre, actor and director is a way that can categorize movies. But for each feature of the
movie, there should be different weight for them and each of them plays a different role for
recommendation. So we get these questions:

• How to recommend movies when there are no user information.


• What kind of movie features can be used for the recommender system.
• How to calculate the similarity between two movies.
• Is it possible to set weight for each feature.
1.2Project Overview

1.2.1 Objective

The goal of our recommendation system is to provide personalized recommendations that help
users find videos relative to their interests and obtain them consistently. In order to improve the
degree of user satisfaction, it is imperative that these recommendations are provided in real-time
and reflect user’s recent activities on the site.

This requirement prevents us from generating recommendations by pre-computation that is


adopted in many works, because pre-computed recommendations cannot reflect user’s recent
activities. Instead, we compute new recommendations for users in real-time for each user request.
Video recommendation system provides users with suitable video to choose which is considered
as effective way to get a higher user satisfaction. We are also going to consider the attributes such
as mouse hover, video watching time, ratings given to the videos and recommend accordingly.

1.2.2 Vision

Creating a difficulty recommendation system that can :

 detect user difficulty,


 determining the current task
 providing reasonable assistance.

1.2.3 Mission

The mission of Recommender System is to connect users and information, which in one way


helps users to find information valuable to them and in another way push the information to
specific users. The mission is to increase needs of its digital user base

1.2.4 Comparative Analysis

Recommendation Strategy Statement


KNN The major drawback of becoming
significantly slows as the size of that data in
use grows.
Cosine Similarity If the two similar documents are far apart by
the Euclidean distance (due to the size of the
document), chances are they may still be
oriented closer together. The smaller the
angle, higher the cosine similarity
Chapter 2
Literature Survey

Existing system:

The existing engines make use of conventional algorithms for recommendations. In Content based
Recommendation Engine, system generates recommendations from source based on the features
associated with products and the user’s information.

Content-based recommenders treat recommendation as a user-specific classification problem and


learn a classifier for the user's likes and dislikes based on product features. In Collaborative
recommendation engines, suggestions are generated on the basis of ratings given by group of
people. It locates peer users with a rating history similar to the current user and generates
recommendations for the user.

In Context based Recommendation Engine, system requires the additional data about the context
of item consumption like time, mood and behavioural aspects. These data may be used to
improve the recommendation compared to what could be performed without this additional source
of information.

Disadvantages:

The major problem with existing system is it needs a good amount of data to even work
considerably good which can be a challenge for small businesses and startups.

The data which is to be used for training should be precise and filtered. Any mistake in the data
can lead to inaccuracy of the whole system.

Proposed System:

The Proposed Video Recommender System will use Content based filtering technique using
cosine similarity algorithm. This methodology depends on making a plenty of parameters to
describe a particular video or clip file. Thinking about an Video as an model the potential
parameters could be Channel, Genre, Year Released etc.. The bigger the parameter set the better
and simpler it is to coordinate examples with customer’s interest and his online impression. The
parameters would then be assigned weight and consequently a relative need is set for every one of
the parameter. All these parameters are then used to make a customer’s profile. Henceforth we see
that the system finds out about the client interest and choice patters by his interest by
understanding online behaviour.

Advantages:

The major advantage of using Content based filtering algorithm is no requirement of huge dataset.
Using these kind of system, a Video streaming can increase the content consumption per
customer. The Content Based filtering algorithm is flexible in nature.
Feasibility study

Recommender Systems (RSs) are systems capable of predicting the preferences of users over sets
of items (given the historical user-preference data). RSs can be found almost everywhere in the
digital space (e.g. Amazon, Google, Netflix), shaping the choices we make, the products we buy,
the books we read, or movies we watch. However, there are almost no RSs in the academic world,
where we expect they can have a great potential. The feasibility of the project is analysing the
video recommender system based on human interest. During system analysis the feasibility study
of the proposed system is to be carried out. This is to ensure that the proposed system is not a
burden to the users. For feasibility analysis, some of the understanding of the major requirements
for the system is essential.

Three key considerations involved in the feasibility analysis are:

 Economical Feasibility
 Technical Feasibility
 Social Feasibility

Economical Feasibility:

It is important to make our system to reach to many number of users since the profit is calculated
based on the number of users using the system. If we got the amount more than what we had
invested then we can say that our project is economically feasible.

Technical Feasibility:

Any project developed must not have a high demand on the available technical resources. This
will lead to high demands on the available technical resources. This will lead to high demands
being placed on the user.

Market Feasibility:

A marketing plan maps out specific ideas, strategies, and campaigns based on feasibility study
investigations, and are intended to be implemented. Think of market feasibility studies as a
logistical study and a marketing plan as a specific, planned course of action to take.
Chapter 3
System Environment

About Python

Python is a programming language, which means it’a a language both people and computers can
understand. Python was developed by a Dutch software engineer named Guido van Rossum, who
created the language to solve some problems he saw in computer languages of the time.

Python is an interpreted high-level programming language for general-purpose programming.


Created by Guido van Rossum and first released in 1991, Python has a design philosophy that
emphasizes code readability, and a syntax that allows programmers to express concepts in fewer
lines of code, notably using significant whitespace. It provides constructs that enable clear
programming on both small and large scales.

Python features a dynamic type system and automatic memory management. It supports multiple
programming paradigms, including object-oriented, imperative, functional and procedural, and
has a large and comprehensive standard library. Python interpreters are available for many
operating systems. C python, the reference implementation of Python, is open source software and
has a community-based development model, as do nearly all of its variant implementations. C
Python is managed by the non-profit Python Software Foundation.

You Can Use Python for Pretty Much Anything:

One significant advantage of learning Python is that it’s a general-purpose language that can be
applied in a large variety of projects. Below are just some of the most common fields where
Python has found its use:

 Data science

 Scientific and mathematical computing

 Web development

 Computer graphics

 Basic game development

 Mapping and geography (GIS software)


Python Is Widely Used in Data Science:

Python’s ecosystem is growing over the years and it’s more and more capable of the statistical
analysis.
It’s the best compromise between scale and sophistication (in terms of data processing). Python
emphasizes productivity and readability.

Python is used by programmers that want to delve into data analysis or apply statistical techniques
(and by devs that turn to data science)

There are plenty of Python scientific packages for data visualization, machine learning, natural
language processing, complex data analysis and more.

All of these factors make Python a great tool for scientific computing and a solid alternative for
commercial packages such as MatLab. The most popular libraries and tools for data science are:

NumPy:

The fundamental package for scientific computing with Python, adding support for large, multi-
dimensional arrays and matrices, along with a large library of high-level mathematical functions
to operate on these arrays.

NumPy is a Python package which stands for ‘Numerical Python’. It is the core library for
scientific computing, which contains a powerful n-dimensional array object, provide tools for
integrating C, C++ etc. It is also useful in linear algebra, random number capability etc. NumPy
array can also be used as an efficient multi-dimensional container for generic data. Now, let me
tell you what exactly is a python numpy array.

Pandas:

Pandas is a library for data manipulation and analysis. The library provides data structures and
operations for manipulating numerical tables and time series.

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data


structures and data analysis tools for the Python programming language.

pandas is a Python package providing fast, flexible, and expressive data structures designed to
make working with “relational” or “labeled” data both easy and intuitive. It aims to be the
fundamental high-level building block for doing practical, real world data analysis in Python.
Additionally, it has the broader goal of becoming the most powerful and flexible open source data
analysis / manipulation tool available in any language. It is already well on its way toward this
goal.
We’ll start with a quick, non-comprehensive overview of the fundamental data structures in
pandas to get you started. The fundamental behavior about data types, indexing, and axis
labeling / alignment apply across all of the objects. To get started, import NumPy and load pandas
into your namespace:

In [1]: import numpy as np

In [2]: import pandas as pd

Matplotlib:

Matplotlib is a python 2D plotting library which produces publication quality figures in a variety
of hardcopy formats and interactive environments across platforms. Matplotlib allows you to
generate plots, histograms, power spectra, bar charts, errorcharts, scatterplots, and more.

Matplotlib is a plotting library for the Python programming language and its numerical


mathematics extension NumPy. It provides an object-oriented API for embedding plots into
applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK+.
Python Saves Time

Even the classic “Hello, world” program illustrates this point:


print("Hello, world")
For comparison, this is what the same program looks like in Java:
public class HelloWorld {
public static voidmain(String[] args) { System.out.println("Hello, world");
}
}
Python Keywords and Identifier

 Keywords are the reserved words in Python.


 We cannot use a keyword as variable name, function name or any other identifier. They
are used to define the syntax and structure of the Python language.
 In Python, keywords are case sensitive.
 There are 33 keywords in Python 3.3. This number can vary slightly in course of time.

All the keywords except True, False and None are in lowercase and they must be written as it is.
The list of all the keywords is given below:

Identifier is the name given to entities like class, functions, variables etc. in Python. It helps
differentiating one entity from another.

Rules for writing identifiers

Identifiers can be a combination of letters in lowercase (a to z) or uppercase (A to Z) or digits (0


to 9) or an underscore (_). Names like myClass, var_1 and print_this_to_screen, all are valid
example.

An identifier cannot start with a digit. 1variable is invalid, but variable1 is perfectly fine.
Keywords cannot be used as identifiers.

>>> global = 1
File "<interactive input>", line 1
global = 1
^

SyntaxError: invalid syntax

We cannot use special symbols like !, @, #, $, % etc. in our identifier.

>>> a@ = 0
File "<interactive input>", line 1
a@ = 0
^
SyntaxError: invalid syntax

Python

Python features a dynamic type system and automatic memory management. It supports multiple
programming paradigms, including object-oriented, imperative, functional and procedural, and
has a large and comprehensive standard library.

Python interpreters are available for many operating systems. C Python, the reference
implementation of Python, is open source software and has a community-based development
model, as do nearly all of its variant implementations. C Python is managed by the non-profit
Python Software Foundation.

Python's large standard library, commonly cited as one of its greatest strengths, provides tools
suited to many tasks. For Internet-facing applications, many standard formats and protocols such
as MIME and HTTP are supported. It includes modules for creating graphical user interfaces,
connecting to relational databases, generating pseudorandom numbers, arithmetic with arbitrary-
precision decimals,[103] manipulating regular expressions, and unit testing.
Some parts of the standard library are covered by specifications (for example, the Web Server
Gateway Interface (WSGI) implementation  wsgiref  follows PEP 333), but most modules are not.
They are specified by their code, internal documentation, and test suites (if supplied). However,
because most of the standard library is cross-platform Python code, only a few modules need
altering or rewriting for variant implementations.
Anaconda

Python is a widely used high-level, general-purpose, interpreted, dynamic programming language.


Its design philosophy emphasizes code readability, and its syntax allows programmers to express
concepts in fewer lines of code than would be possible in languages such as C++ or Java. The
language provides constructs intended to enable clear programs on both a small and large scale.

Machine Learning

Machine learning is learning based on experience. As an example, it is like a person who learns to
play chess through observation as others play. In this way, computers can be programmed through
the provision of information which they are trained, acquiring the ability to identify elements or
their characteristics with high probability.

First of all, you need to know that there are various stages of machine learning:

 data collection

 data sorting

 data analysis

 algorithm development
 checking algorithm generated

 the use of an algorithm to further conclusions

To look for patterns, various algorithms are used, which are divided into two groups:

 Unsupervised learning

 Supervised learning

With unsupervised learning, your machine receives only a set of input data. Thereafter, the
machine is up to determine the relationship between the entered data and any other hypothetical
data. Unlike supervised learning, where the machine is provided with some verification data for
learning, independent Unsupervised learning implies that the computer itself will find patterns and
relationships between different data sets. Unsupervised learning can be further divided into
clustering and association.

Supervised learning implies the computer ability to recognize elements based on the provided
samples. The computer studies it and develops the ability to recognize new data based on this data.
For example, you can train your computer to filter spam messages based on previously received
information.
Chapter 4
SYSTEM ANALYSIS & DESIGN

Requirement Specification

Software Requirements:

Operating System : windows 7

Ultimate Coding Language : Python

Editor : Jupyter Notebook

Server : Anaconda Server(Dashboard)

Command Prompt : Anaconda Prompt

Hardware Requirements:

Preprocessor : i3(minimum),i5

RAM : 4GB(minimum)

Mouse : Optical Mouse

Monitor : 15.6’ colour Monitor

Hard disk : 1TB


Algorithm

K-NN with cosine similarity as distance measurement: -

K-NN is a non-parametric and lazy learning algorithm. Non-parametric means there is no


assumption for underlying data distribution i.e. the model structure determined from the dataset.
It is called Lazy algorithm because it does not need any training data points for model generation.
All training data is used in the testing phase which makes training faster and testing phase slower
and costlier.

K-Nearest Neighbor (K-NN) is a simple algorithm that stores all the available cases and classifies
the new data or case based on a similarity measure.

K-NN classification

In K-NN classification, the output is a class membership. An object is classified by a plurality


vote of its neighbors, with the object being assigned to the class most common among its k nearest
neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to
the class of that single nearest neighbor.
To determine which of the K instances in the training dataset are most similar to a new input, a
distance measure is used. For real-valued input variables, the most popular distance measure is the
Cosine similarity.
Steps to be carried out during the K-NN with Cosine Similarity algorithm are
as follows :

1. Divide the data into training and test data.


2. Select a value K.
3. Determine which distance function is to be used.
4. Choose a sample from the test data that needs to be classified and compute the distance to
its n training samples.
5. Sort the distances obtained and take the k-nearest data samples.
6. Assign the test class to the class based on the majority vote of its k neighbors.

Performance of the K-NN algorithm is influenced by three main factors :

1. The distance function or distance metric used to determine the nearest neighbors.


2. The decision rule used to derive a classification from the K-nearest neighbors.
3. The number of neighbors used to classify the new example.

Advantages of K-NN :

1. The K-NN algorithm is very easy to implement.


2. Nearly optimal in the large sample limit.
Cosine Similarity

Cosine similarity is a metric used to measure how similar the documents are irrespective of their
size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-
dimensional space. The cosine similarity is advantageous because even if the two similar
documents are far apart by the Euclidean distance (due to the size of the document), chances are
they may still be oriented closer together. The smaller the angle, higher the cosine similarity.

Cosine similarity is computed using the following formula:

Values range between -1 and 1, where -1 is perfectly dissimilar and 1 is perfectly similar.

The library contains both procedures and functions to calculate similarity between sets of data.
The function is best used when calculating the similarity between small numbers of sets. The
procedures parallelize the computation and are therefore more appropriate for computing
similarities on bigger datasets.

Derivation:

 recall the definition of the Dot Product: 

a ⋅ b = ∥a∥ ⋅ ∥b∥ ⋅ cosθ

 or, by rearranging get 

cosθ = a ⋅ b / ∥a∥ ⋅ ∥b∥

 so, let's define the cosine similarity function as cosine

(d1 , d2) = dT1 d2 / ∥d1∥ ⋅ ∥d2∥

 cosine is usually [−1,1][−1,1], but document vectors (see Vector Space Model) are usually
non-negative, so the angle between two documents can never be greater than 90
degrees, and for document vectors

(d1 , d2) ∈ [0 , 1]
min cosine is 0 (max angle: the documents are orthogonal)
max cosine is 1 (min angle: the documents are the same)

Cosine Normalization

 If documents have unit length, then cosine similarity is the same as Dot Product

Cosine (d1 , d2) = dT1 d2 / ∥d1∥ ⋅ ∥d2∥ = dT1 d2

 we can "unit-normalize" document vectors and then compute dot product on them and get
cosine this "unit-length normalization" is often called "cosine normalization" in IR

d′ = d / ∥d∥

Cosine Distance
 for documents cosine

(d1 , d2) == [0 , 1]

 since maximal value of cosine is 1, we can define cosine distance as

dc (d1 , d2) = 1 – cosine (d1 , d2) = 1 − dT1 d2 ∥d1∥ ⋅ ∥d2∥

 Combining the normalization and distance of cosine, we get:

Similariry (a , b) = a . b / ||a|| * ||b||


 Use-cases - when to use the Cosine Similarity algorithm

We can use the Cosine Similarity algorithm to work out the similarity between two things. We
might then use the computed similarity as part of a recommendation query. For example, to get
movie recommendations based on the preferences of users who have given similar ratings to other
movies that you’ve seen.

Vectorization in Cosine Similarity


Vectorization is the process of converting an algorithm from operating on a single value at a time
to operating on a set of values (vector) at one time. Modern CPUs provide direct support for
vector operations where a single instruction is applied to multiple data. Vectorization divides the
computation times by several order of magnitudes and the difference with loops increase with the
size of the data. Hence, if you want to deal with large amount of data, rewriting the algorithm as
matrix operations may lead to important performances gains.
Vectorization is used to speed up the Python code using loop. Using such a function can help in
minimizing the running time of code efficiently. Various operations are being performed over
vector such as dot product of vectors which is also known as scalar product as it produces single
output, outer products which results in square matrix of dimension equal to length X length of the
vectors, Element wise multiplication which products the element of same indexes and dimension
of the matrix remain unchanged.
Vectorization Process:
Word Embeddings or Word vectorization is a methodology in NLP to map words or phrases from
vocabulary to a corresponding vector of real numbers which used to find word predictions, word
similarities/semantics. The process of converting words into numbers are called Vectorization.
By using Count Vectorizer function we can convert text document to matrix of word count. Matrix
which is produced here is sparse matrix. By using Count Vectorizer on above document we get
5*15 sparse matrix of type numpy.int64.Vectorized code refers to operations that are performed on
multiple components of a vector.

GENRE
Comedy
Comedy Horror Thriller
Comedy
1 0 0
Horror 1 0 0
0 1 0
Thriller
0 0 1
Horror 0 1 0
Chapter 5
Implementation

Sample code to display the dataset:


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data1=pd.read_csv('movie.csv')
data=data1.copy()
data
Sample code to split the genre from dataset:
movies=data.genres.str.split("|")
movies
Sample code for Vectorization:
def vectorization(var):
gen=set()
for i in var:
for j in i.split('|'):
gen.add(j)
diction={}
for i in gen:
diction[i]=[]
for i in var:
dt=i.split('|')
for k in gen:
if k in dt:
diction[k].append(1)
else:
diction[k].append(0)
return diction
diction=vectorization(data['genres'])
diction.keys()
Sample code for Vectorization process:
for i in diction.keys():
data[i]=diction[i]
data
Sample code to display numeric values based on genre:
data.drop(columns=['genres','movieId','title'],inplace=True)
data
Sample code to display graph on genre:
data.hist(figsize=(20,10))
plt.show()
Sample code for Cosine Similarity:
values=[]
index=[]
for i in range(1):
for j in range(1,27278):
x=data.iloc[i]
y=data.iloc[j]
z=sum(x*y)
w=np.sqrt(sum(x*x))+np.sqrt(sum(y*y))
k=(z/w)
values.append(k)
index.append(j)
values
Sample code to display index values:
index
Sample code to display maximum value of index:
values.index(max(values))
Sample code to display title of maximum value:
data1['title'][2208]
Sample code to display genre of maximum value:
data1['genres'][2208]
Sample code to display values of each index:
values1=values.copy()
values1
Sample code to display best_match of values:
best_match=[]
for i in values1:
if i>1:
best_match.append(i)
best_match
Sample code to display best_match index:
j=[]
for i in best_match:
j.append(values.index(i))
print(set(j))
Sample code for movie recommendation titles:
recommend=set((j))
for i in recommend:
print('movies to watch',data1['title'][i],data1['genres'][i])
Chapter 5
Output Screens

Output for displaying the dataset:


Below output is splitting the genre:

Vectorization Output:
Output for Vectorization process
Output of numeric values based on genre:
Output for displaying graph on genre:

Output for Cosine Similarity:


Output for index values:

Output for maximum values of index:


Output for title of maximum value:

Output of genre maximum value:

Output of each index values:

Output of best_match values:


Output of best_match index:

Output of movie recommendation titles:


Chapter 6
Conclusion

Recommender systems are a powerful new technology for extracting additional value for a
business from its user databases. These systems help users find items they want to buy from a
business. Recommender systems benefit users by enabling them to find items they like.
Conversely, they help the business by generating more sales. Recommender systems are rapidly
becoming a crucial tool in E-commerce on the Web. Recommender systems are being stressed by
the huge volume of user data in existing corporate databases, and will be stressed even more by
the increasing volume of user data available on the Web. New technologies are needed that can
dramatically improve the scalability of recommender systems. Recommender systems open new
opportunities of retrieving personalized information on the Internet. It also helps to alleviate the
problem of information overload which is a very common phenomenon with information retrieval
systems and enables users to have access to products and services which are not readily available
to users on the system.
Chapter 7
Future Scope

Cosine similarity calculation do not work well when we don't have enough rating for movie or
when user's rating for some movie is exceptionally either high or low.As an improvement on this
project some other methods such as adjusted cosine similarity can be used to compute similarity.
Adjusted cosine similarity, which is similar to cosine similarity, is measured by normalizing the
user vectors Ux and Uy and computing the cosine of the angle between them. However, unlike
cosine similarity, when computing the dot product of the two user vectors, adjusted cosine
similarity uses the deviation between each of the user’s item ratings, denoted Ru, and their
average item rating, denoted ¯Ru, in place of the user’s raw item rating. In the near future, it will
be installed in Apache Server and so it will be published in internet. Datasets will be updated
continuously and it will make online actual rating predictions to the users whose habits are
changing day by day. As a result, it can be sensitively satisfying current user tastes. Web services
in particular suffer from producing recommendations of millions of items to millions

You might also like