You are on page 1of 21

Introductory Demo

 http://www.caffeinatedrook.com/MovieRec/MovieRecServlet

1
Problem Statement
Data: User-Movie Ratings
Input: User number and Movie number
Output: Predicted Rating
Goal: Predict Ratings with the smallest RMSD possible.
(Make Customer happy.)

-0.07283353 0.001291469
-0.07283353
0.11694841 0.06918105
0.11694841
-0.0622078 -0.08339876

+ =
-0.0622078
0.081832446 0.12175138
0.081832446
0.049034953 0.17088805
0.049034953
-0.008441236 -0.1085485
-0.008441236
0.004925302 -0.07531176
0.004925302
0.001412398 -0.083747916
0.001412398
0.05334269 0.12860337
0.05334269

2
4.312 / 5
Motivation – Netflix

“The Netflix Prize seeks to substantially improve the


accuracy of predictions about how much someone is going
to love a movie based on their movie preferences. The
Netflix Prize improves our ability to connect people to the
movies they love.”

www.netflix.com

3
Impact to Field
Better Recommendations
= Happy Customers
Happy Customers
= More Money
= Larger Market Share

4
Motivation - Personal
One million of them

Feature Extraction

My uber-competitive nature (aka Justin’s Wife)

5
A problem with the problem statement
Tackling the Netflix Challenge requires many
hundreds (thousands…more?) of hours of
computation.
Ultimately, it will require the solution to many sub-
problems.
Sparcity
Noise
Memory Requirements
Movie Similarity
User Similarity (more on these a little later)

6
The Problem Statement Redefined

Data: User-Movie Ratings


Goal: Discover the relationships between..
Movies to other movies
Users to other users
Movies to users

7
Related Work
 Netflix Prize forum http://www.netflixprize.com//community/
 Lots of info on strategies people are trying.

www.netflix.com www.blockbuster.com
www.amazon.com www.spout.com

 Singular value decomposition and least squares solutions, Numerische Mathematik


Springer Berlin / Heidelberg
 Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft
Computing)
 Predicting User Preference for Movies using NetFlix database, Dhiraj Goel and
Dhruv Batra, Carnegie Mellon University
 The Netflix Prize, James Bennett Stan Lanning
 Use of KNN for the Netflix Prize, Ted Hong, Dimitris Tsamis, Stanford University
 How To Break Anonymity of the Netflix Prize Dataset, Arvind Narayanan, Vitaly
Shmatikov

8
Domain Understanding

The success or failure of retailers rely on matching the


customer to the product. In the case of online retailers,
like Netflix, recommender systems can be built to utilize
the vast sums of data generated online.

Netflix keeps a record for each user, containing the rating


(1-5) for each film the user has rated.

9
Data Selection
First, what the data does not contain.

It does not contain


Movie titles, directors, actors, studio, year
Customer age, sex, income, favorite color

Some contestants have written web-crawlers to mine this


information from the web.

10
Data Selection
 6:
• 17,770 Movies
 2031561,1,2004-07-26
 1176140,1,2004-02-16
 2336133,2,2004-09-05 • 480,189 users
 1521836,1,2004-08-11
 117277,3,2004-10-12
 326587,3,2004-09-06 • 100,480,507 Ratings
 1961542,3,2004-04-20
 1041552,3,2004-10-19
 1678346,3,2005-04-11 • 17,770 * 480,189
 643182,2,2004-07-18
 2182301,5,2004-08-04 = 8,532,958,530
 2502669,2,2004-02-10
 2211030,4,2004-05-26
 603277,3,2004-12-13 • 100,480,507 / 8,532,958,530


214166,2,2005-10-09 = 0.01177
……..
 ……..
(%98.8 sparse!)
11
Cleaning and Preprocessing

1) Transformed files from movie-view to user-view.


2) Normalized user ratings via Z-Score normalization.

12
Discovering Patterns
 Which Software to use? SPSS, SAS, Weka?
8,532,958,530 ratings * 4 bytes / rating
34,131,834,120 bytes
33,331,869 kilobytes
32,550 megabytes
31 gigabytes

Too big to hold the entire matrix


Too big to hold condensed matrix
Too “stupid” to manage memory without
paging.
13
Discovering Patterns
Which feature selection method to use?
Principle Component Analysis
Singular Value Decomposition
Multifactor Dimensionality Reduction
Latent Semantic Analysis

14
Discovering Patterns
M = 17,770 * 25
D = 17,770 * 480,189

444,250
12,004,725
12,448,975

8,532,958,530 Movie: a
User: b
vab = ∑i(Uai x Mbi)

1000
U = 25 * 480,189 .001
1c/5h
15 25c / ~5
A little board work to explain the algorithm

16
Interpretation: Feature 1-movie view
Trailer Park Boys: Season 3 Sweet Potato Pie
Trailer Park Boys: Season 4 Legion of the Dead
The Lord of the Rings: The Fellowship of the Dark Town
Ring: Extended Edition Comedy Only in Da Hood
Lord of the Rings: The Return of the King: Predator Island
Extended Edition Bad Bizness
Lord of the Rings: The Two Towers: Extended Vampiyaz
Edition My Big Phat Hip Hop Family
Lost: Season 1 Jack O'Lantern
Veronica Mars: Season 1 Desperate Souls
House
4
As Time Goes By: Series 9
Gilmore Girls: Season 4 3

-1

-2 17
Interpretation: Feature 2-movie view
Lost in Translation
National Lampoon's Mr. Wong
Without You I'm Nothing
Punch-Drunk Love Dragon Ball Z: World Tournament
Dogville Dragon Ball: Piccolo Jr. Saga: Part 2
The Royal Tenenbaums Dragon Ball: Tien Shinhan Saga
Whiteboyz Dragon Ball Z: Fusion
Pornografia Dragon Ball: Red Ribbon Army Saga
Spooks & Creeps Dragon Ball Z: Garlic Jr.
Kaaterskill Falls Dragon Ball: Piccolo Jr. Saga: Part 1
Armageddon
1 Dragon Ball: The Path to Power
0.8 Pearl Harbor
0.6

0.4

0.2

-0.2

-0.4

-0.6

-0.8

18 -1
Interpretation: Feature 3-movie view
Nostradamus: A Voice from the Past 1.5

Absolution
Ozzy Osbourne: Double O: Unauthorized 1

Monster-a-Go-Go! 0.5

Dark Harvest 2: The Maize


Jessica: A Ghost Story 0

Vanilla Sky -0.5

Ivan Vasilievich: Back to the Future


American Beauty -1

Still Bout It
WWE: Rebellion 2002 -1.5

Battle Athletes: Vol. 3: Go


Sailor Moon: Vol. 10: The Trouble With Rini
Battle Athletes Victory: Vol. 7: The Last Dance
Battle Athletes Victory: Vol. 1: Training
Battle Athletes Victory: Vol. 8: The Human Race!
ECW: Extreme Evolution: Extreme
Championship Wrestling
Battle Athletes Victory: Vol. 6: Willpower
Fushigi Yugi: The Mysterious Play: Eikoden
19 Lupin the 3rd: Dead or Alive
Interpretation: Nearest Neighbors

18 components
Find the nearest neighbors using Euclidean (q=2) distance

q=1 q=3
1. American Beauty (1999) 1. American Beauty (1999)
2. Fight Club (1999) 2. Mystic River (2003)
3. Reservoir Dogs (1992) 3. Fight Club (1999)
4. Mystic River (2003) 4. Traffic (2000)

q=2 q=4
1. American Beauty (1999) 1. American Beauty (1999)
2. Fight Club (1999) 2. Mystic River (2003)
3. Mystic River (2003) 3. Fight Club (1999)
4. Reservoir Dogs (1992) 4. Traffic (2000)

20
Demo – Name a movie!
 http://www.caffeinatedrook.com/MovieRec/MovieRecServlet

21

You might also like