You are on page 1of 9

Netflix-Prize!!

Paper-Name: Netflix Prize.pdf


Created By: Rajan Dhabalia (San Francisco State University)
Mail: rajan.sfsu@gmail.com
1. Introduction:
The Netflix prize was an open competition for predicting user rating based on the previous history of the user.
This competition was held by the Netflix which is a movie rental company.

The idea behind of this project is to narrow down the selection by providing the suggestions to the users based on
their preferences. On the site like Netflix there are millions of movies and songs available, and most of the users
are likely to overwhelm by the variety of possibilities. There are very few users who exactly know that what
movie they want to watch or music they want to listen to.

To collect this data [of user choices] most of the companies with wide range of variety of products, ask user to
rate items on scale of 5 star.

With the recent expansion of internet accessible catalogues of movies, movies etc., it becomes a valuable
technology. In this project we addressed some general techniques to address the Netflix prize project problem and
present some results and analysis.

2. Motivation:
The idea behind this project is to recommend items which a customer is likely to order. These recommendations
are based on the data or the history of the customer and other customers’ history with similar interests. Here
examples can be like amazon.com and ebay.com and Netflix.com.

3. Goal:
In the Netflix prize project we implemented a movie recommendation system which works on the dataset of
neflix.com.

Now days all e-commerce leaders like Netflix, Amazon and eBay have designed their recommendation system
and take them as a salient feature of their web site. For a service provider accurately predicting whether the
customer will enjoy a certain product based on their previous history for the other products can connect people to
the products they love. Which also improves the improved customer service and it drastically improves the
likelihood over a customer making a purchase.

4. Implementation Techniques:
(1) Pre-Processing of Data
(2) Evaluation of Result
 Method
 Algorithms

(1) Pre-Processing of Data


(2) Evaluation of Result
(1) Pre-Processing of Data

In the pre processing part we followed the following steps.


 First, we read all 1500 movie files which are structured in following manner. (CustomeID, MovieID,
Rating, Year of Release)
 This files are stored in a memory of java as below:
o In a Customer Class with
 Customer ID and Array List of: MovieID, Rating, Dates
Now we will create an Array list of Customer class called as “Customers”. This contains all the customers. We
write this object in to text file using input-output Object-Stream in Java.

So, now we need to read only this one object file to retrieve dataset instead of reading all 1500 movie files.
Basically it saves 90% execution time.

(2) Evaluation of Result

Method:

The main task of this system is to predict the rating given to movie mk by the user uj for an unseen pair (uj,
mk). To accomplish this task, we rely on two different Data-Mining techniques and concepts. First we use
“Clustering” to get relevant data and then we use “Data-distribution Prediction algorithm” to predict the rating of
movie mk by the user uj.

Clustering is used to find a “structure” in a collection of unlabeled data. It is a process of organizing objects
into groups whose members are similar in behaviour. So, while identifying users with their rating behavior we can
create cluster on the bases of users’ tastes, movie content (Actress, Actors, and Movie Genre).

Here we create cluster of users on the bases of the users’ tastes. In the sample of 500,000 users there are users with
vastly different rating behaviour. Among them some of the users only rate the movies 1-star, some of the users rate
only 5-stars, and some of the users distribute their ratings with relatively uniform range. So, using this dataset we
will find out the cluster in which all the users have same taste of movies. And they give rating each movie almost in
the similar way. To find out the cluster of these users, we implemented kNN clustering. kNN clustering technique
computes nearest neighbours for targeting object or the user.

The second stage of the Prediction-Phase relies on the “Probability distributions” ( P (r/u) ), using each
user’s aggregate rating (User’s Past history) and behaviour of his neighbours which are retrieved by previous stage
(kNN). We implemented this phase using Distribution based prediction algorithm.

Algorithm:

(1) Phase-1: [Clustering: K Nearest Neighbour (kNN)]

k-nearest neighbour algorithm (kNN) is a algorithm for classifying users based on closest training
examples in the feature (movie rating for other users) space. The main aim of this algorithm is to classify a new
user based on attributes and training samples. Here we find K number of training users closest to the target user.
And target user means, a test customer for whom we need to predict the rating of movie. KNN algorithm used
neighbourhood classification as the prediction value of the target user.
Implementation of K nearest neighbour algorithm for our project is following.

(1) Determine K parameter = total number of nearest neighbours


(2) Calculate the distance (using Tanimoto Coefficient) between the test user and all the training users

(3) Sort the distance and determine nearest neighbours based on the K-th minimum distance
(4) Gather structured values of the nearest neighbours (training users).

Testing:
Use average rating of nearest neighbours as a prediction rating value of the Test-Customer.
As our first attempt of project execution, we tested our application using only kNN. And we got the results as
following.

Result shows as we increase value of K, we get less efficient result.


It means when we have 50 neighbours (k=50), all have strong correlation with test-customer. So, it gives efficient
result.

And when we have 2000 neighbours (k=2000), it also contains neighbours who has weak correlation with test-
customer. So, increment of k causes less efficient result (RMSE increase).
[Fig 1: Graphical Presentation of KNN on Netflix Project]
(2) Phase-2: [Distribution based Prediction Algorithm]

In the second phase, we implemented Distribution based prediction algorithm. This algorithm relies on P(r | u),
using each user’s aggregate rating behaviour.

Implementation of this algorithm is as following.

(1) Retrieve K neighbours for testing user using kNN method.


(2) Find Kullback–Leibler divergence in between each neighbours and testing customer as following.

Where KL(ui,uj) is the Kullback-leibler divergence in between user ui and uj.


 The KL divergence compares the entropy of two distributions over the same random variable.
 KL measures the expected number of extra bits required to samples from testing-user(ui) to neighbour-
user(uj).
(3) Compute distance matrix D of pair-wise distances between all users, from which the k-nearest neighbours to
each user can be extracted.

(4) Gather divergence information for all neighbours.


(5) Apply distribution based algorithm for testing user (ui) to predict rating for movie (mj).

where R(uk,mj) is the known rating given mj by uk. Here, the rating given mj by ui is predicted using the ratings
given m by neighbours to ui, weighted according to the neighbour’s distance from ui by the weighting parameter α.
The constant c is a normalization parameter.

Results:

Probe dataset : 500 movie prediction


Training dataset : 1.2million records [1500 movie files]

(1) First we executed program by taking combination of 500 test-users and different values of k.
[Fig 2: RMSE Vs Prediction Algorithm]

(2) And we have also tried to take result using combination of k=2000 and different values of test-users.

[Fig 3: RMSE Vs Prediction Algorithm]


5. Conclusion:

All our work was implemented in Java using the Netbeans 6.8 and java was very convenient because it was able to
parse data with ease, and we face various memory related problems, were resolved; using Virtual Memory.
However the algorithm became increasingly slow and performance became an issue. But we think it was mainly
due to the huge amount of data we are using. And various Data-Mining techniques explored the view of Prediction
in business area.

6. GUI [Result]:
7. Contribution:
 Pre-Processing of Training-Dataset
 Retrieval of Test-Dataset
 Implementation of KNN Algorithm using Tanimoto Coefficient
 Implementation of GUI
 Implementation of Distribution based Prediction Algorithm
 Integration Algorithm with the KNN
 Evaluation of Result

8. References:

(1) Combining meta-data features with neighbor-based models by Dan Gillick and Josh Blumenstock
(2) Entropy and Kullback-Leibler Divergence by Miles Osborne, University of Edinburgh
(3) http://www.timelydevelopment.com/demos/NetflixPrize.aspx
(4) http://www.wikipedia.org/

You might also like