65 views

Uploaded by rohananil

- PU CET UG 2014
- R xgboost-Tutorial
- Diff Multiples
- Mathematics for Chemistry
- Pertemuan 6 Transportation
- Data Mining-Accuracy and Ensemble Methods
- MATLAB for Engineers and Scientists
- YEARLY LESSON PLAN MATHS FORM 5 2011
- TOC Detailed
- Noisy Time Series Prediction using a Recurrent.pdf
- Reporte Gtsrb 1087680
- Regularization Paths for Generalized Linear Models Glmnet
- SSC-CGL-Pre-Mock-3
- 10.1.1.47.8764
- Final BSC212
- B.a. Economics
- recommender system
- Artigo Andre Mestrado
- Mohammad i 2004
- Gregory Smith Dissertation

You are on page 1of 17

May 5, 2011

Abstract This status report contains the ideas and the experiments that we have performed or are currently performing on the KDD Cup 2011 dataset. The dataset is biggest of its kind with some unique features like hierarchical relations among items, different types of items, and dates/timestamps for ratings. Currently we have implemented several variants of matrix factorization approaches. We are currently at 25.0426 RMSE on the test set using Alternative Least Squares approach. We are currently at 92th position on the leader board. After parallelizing the training, one epoch takes roughly in the range of 200-400 seconds.

Contents

1 2 3 Introduction Dataset Experiments and Results 3.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Biased Regularized Incremental Simultaneous Matrix Factorization (BRISMF) 3.3 Sigmoid based Matrix Factorization (SMF) . . . . . . . . . . . . . . . . . . . 3.3.1 Adding Temporal Term . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Sigmoid based Heirarchical Matrix Factorization (SHMF) . . . . . . . . . . . 3.5 Alternating least squares based Matrix Factorization (ALS) . . . . . . . . . . . 3.5.1 Adding Temporal Term . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Latent Feature log linear model . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Neighborhood Based Correction . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.1 Timing Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ideas for further exploration 3 3 7 . 7 . 8 . 8 . 8 . 8 . 9 . 9 . 9 . 10 . 11 . 14 14

. . . . . . . . . . .

4 5

Parallelism 15 5.1 Alternating update and grouping strategy . . . . . . . . . . . . . . . . . . . . . . 15 5.2 Joint SGD Update by grouping strategy . . . . . . . . . . . . . . . . . . . . . . . 15 Software 16

Introduction

This report investigates different collaborative ltering methods on the KDD Cup 2011 dataset. The dataset was provided by Yahoo!, and was collected from their music service. It is the biggest of its kind which restricts our choice of algorithms to the ones that scale. Apart from the typical (user,item,rating) triplets, there is hierarchical information among the items (tracks/albums/artist/genre) amd time stamps which needs to be exploited. So far, we have been able to parallelize several variants of matrix factorization approaches run it in the order of minutes per epoch. Apart from that we have analyzed the dataset in terms of types of items and found out there is signicant overtting for tracks and albums. Furthermore, on the validation set we found out that majority of the error is on items that are rated fewer times. The rest of the report contains our current progress and description of the dataset.

Dataset

The KDD Cup 2011 competition has two tracks. This report presents the experiments we performed on Track 1 dataset. The statistics for the dataset is presented in the table 1. The ratings range from 0-100 and the dates range from roughly [0-5xxx] days. There is also session information present in the dataset along with the days.

USER ID

NO: RATINGS

DAY

TIME STAMP

ITEM IDS

#Users 1,000,990

#Items #Ratings #TrainRatings #ValidationRatings #TestRatings 624,961 262,810,175 252,800,275 4,003,960 6,005,940

Table 1. KDD Cup 2011 Track 1 Dataset.

Table 2. Track 1 - Hierarchy statistics for items .

x 10

5 Counts

0 20

20

40 Rating

60

80

100

120

genre 1 genre 2 genre 3 genre k genre m genre n

artist 1

artist 2

album 1

album 2

album 3

album k

album m

album n

track n track l

track m

12

x 10

12

x 10

10

10

Genre

Artist Item

Album

Track

3.5 3 2.5 Ratings count 2 1.5 1 0.5 0 x 10

6

Distribution of ratings in test set

Genre

Artist Item

Album

Track

2.5

x 10

3000

2500

2000

1.5 #Tracks

Counts 1500

1000

0.5

500

6 8 log(#ratings of Tracks)

10

12

10

11

12

1200 Histogram of #Artists vs log(#ratings for Artists)

60 Histogram for #genres vs log(#ratings for genres)

1000

50

800 Counts

40 Counts

600

30

400

20

200

10

10

12

14

12

14

3

3.1

Notation

True rating for user- (u) and item - (i) Predicted rating for user - (u) and item - (i) Latent feature vector for user - (u) Latent feature vector for item - (i) Size of feature vector i.e. latent factors Concatenated feature matrix for all users Concatenated feature matrix for all items Number of users Number of items Learning rate parameter Regularization parameter Sigmoid on x

3.2

The objective function that we are minimizing here is the squared loss. The update for the U and I matrices are simultaneous, and we set the rst column of U and second column of I as 1 the corresponding second and rst columns in U and I respectively can be interpreted as the bias terms. We calculate the RMSE on the validation set after each epoch using the trained U and I matrices and terminate when either epoch limit has been reached or when RMSE diverges. Objective Function E = (ru,i Uu Ii )2 + ( ||Uu ||2 + ||Ii ||2 ) Optimization Type SGD 2 Derivative with respect to each example (r Uu Ii ) = 2(ru,i Uu Ii )Iik Uuk u,i (r Uu Ii )2 = 2(ru,i Uu Ii )Uuk Iik u,i Update Rule Uuk Uuk (( Uuk E) + (Uuk )) Iik Iik (( Iik E) + (Iik ))

3.3

The motivation for SMF was to keep the predicted rating in the range of [0-100]. Here the objective function similar to BRISMF. There are two ways of parallelizing SGD, both of them are dicussed in the Parallelism section. Objective Function E = (ru,i 100(Uu Ii ))2 +( ||Uu ||2 + ||Ii ||2 ) Optimization Type SGD Derivative with respect to each example (r 100(Uu Ii ))2 = 2(ru,i 100(Uu Ii ))100(Uu Ii )(1 (Uu Ii ))Iik Uuk u,i (r 100(Uu Ii ))2 = 2(ru,i 100(Uu Ii ))100(Uu Ii )(1 (Uu Ii ))Uuk Iik u,i Update Rule Uuk Uuk (( Uuk E) + (Uuk )) Iik Iik (( Iik E) + (Iik )) 3.3.1 Adding Temporal Term (ru,i 100(

k

The objective function after adding temporal term is E = rest of the derivation is similar to the previous section.

3.4

The algorithm is similar to SMF except for two key differences a) SGD training in a hierarchical fashion as shown below, here we use alternating training method instead of simultaneous updates. 8

b) Regularization term to make Ii for each item in the hierarchy similar, which is motivated by the fact that users tend to rate items in the same hierarchy similarly for ex: rating for a track and its corresponding album would be similar. for each epoch Update U using I Update Ii using U (i Genres) Update Ii using U (i Artists) Update Ii using U (i Albums) and regularization by (Ii Iartist(i) ) Update Ii using U (i T racks) and regularization by (Ii Ialbums(i) )

3.5

This method was rst presented in [4]. The main differences compared to previously dicussed methods are a) the update rule for Uu or Ii is the least squares solution and b) the regularization parameter is multiplied by the number of ratings for that user (nu ) or item (ni ). Objective Function E = (ru,i Uu Ii )2 + ( nu ||Uu ||2 + ni ||Ii ||2 ) Least squares solution for a Uu and Ii T (MI(u) MI(u) + (nu E))Uu = Vu where MI(u) is sub matrix of I, where columns are chosen based on items that user u has rated. and E is the identity matrix and Vu = MI(u) RT (u, I(u)) Optimization Type LS T Update Rule Uu A1 Vu where A = (MI(u) MI(u) + (nu E)) u Ii Bi1 Yi ; derivation similar to Uu 3.5.1 Adding Temporal Term

The objective function after adding temporal term is E = (ru,i k Uuk Iik Ttk ))2 +( nu ||Uu ||2 + ni ||Ii ||2 ). The rest of the derivation is similar to the previous section. We rst learn the U and I matrices by xing all elements of T to be 1 and T is estimated in the end after we have estimated for U and I matrices.

3.6

In LFL model [1] we restrict output ratings to be in the set Rc = {0,10,20,30,40,50...100} each corresponding to c = {0,...11} classes and learn latent features for each of the ratings. We x U 0 9

c Rc exp(Uu Iic ) 2 c ) + ( c ||Uu ||2 + ||Iic ||2 ) Z c Z = c exp(Uu Iic ) - Normalization term c exp(Uu Iic ) p(c|U c , I c ) = Zc c c c R exp(Uu Ii ) r= Z Derivative with respect to each example c foreach c U c (ru,i c Rc exp(Uu Iic ))2 = 2(ru,i c (Rc p(c|U c , I c ))p(c|U c , I c ) uk c (Rc c (Rc p(c|U c , I c ))Iik c foreach c I c (ru,i c Rc exp(Uu Iic ))2 = 2(ru,i c (Rc p(c|U c , I c ))p(c|U c , I c ) ik c (Rc c (Rc p(c|U c , I c ))Uuk Optimization Type SGD c c c Update Rule Uuk Uuk (( U c E) + (Uuk )) uk c c c Iik Iik (( I c E) + (Iik ))

Objective Function E =

(ru,i

ik

Another scheme to cut down on parameters it to keep a single Uu for all ratings. Although we have implemented this scheme, we have skipped running experiments with it since we did not see signicant difference between the schemes in the initial runs.

3.7

We use the method in [3] which is a post processing step after learning the latent features for users and items. The NB correction is as follows

p sim(itemi , itemj )(ruj ruj ) , where rc is the corrected rating for user u = + sim(itemi , itemj ) j,j=i and item i and sim is the similarity metric. is learned through regression on validation set. The summation j is over all items the user has rated in the training set. k Iik .Ijk sim(itemi , itemj ) = max{0, } 2 2 Iik Ijk k k c rui p rui j,j=i

10

3.8

Results

Sigmoid Matrix Factorization results Validation Set Training set 36

The gure below shows the RMSE on train and validations set for sigmoid matrix factorization.

38

34

32 RMSE

30

28 convergence

26

24

22

4 epochs

Figure 6: SMF: RMSE on train and validation set The result show overtting after the second epoch. Item type specic RMSE is shown in g 7.

11

38 36 34

34

32 RMSE

32 RMSE 30 28 26 24 22

30

28

26

24

1.5

2.5

3 epochs

3.5

4.5

22

1.5

2.5

3 Epochs

3.5

4.5

RMSE on Artists 38 36 34 32 30 RMSE RMSE 30 28 26 24 22 20 15 Validation set Training set 40

35

25

20

1.5

2.5

3 Epochs

3.5

4.5

1.5

2.5

3 epochs

3.5

4.5

12

60 55

Validation error on Tracks 50

50 45 RMSE 40 35 30 25

45

40

35 RMSE 30 25 20

20

15 0 2 4 6 8 log(#ratings for Tracks) 10 12 14

12

14

(a) RMSE vs log( No: of ratings for Tracks) (b) RMSE vs log( No: of ratings for Albums)

55 50 45 40 RMSE RMSE 35 30 25 20 15 10 0 2 4 6 8 log(#ratings for Artists) 10 12 14 Validation error on Artists 40 35 30 25 20 15 10 5 0 Validation error on Genres

10

12

14

Figure 8: These are from the ALS run: RMSE vs No: of ratings (Note: red line shows the actual RMSE for that specic type) The results have been summarized in the table below. The results from tensor factorization has been excluded since we are not condent about the training scheme. The LFL run was not tuned for best parameters. Regularization is similar in all the methods except in ALS where its multiplied by Nu or Ni .

13

Method(//k) RMSE (Test set) RMSE with NB correction(Test Set) BRISMF(.001/.001/100) 28.6200 SMF (10/.0001/100) 25.6736 25.4884 SHMF (10/.0001/100) 25.1183 ALS (1/-/50) 25.0426 LFL (10/.0001/100) 26.5238 Table 4. Current Results on Test Set

3.8.1

Timing Information

All these runs were using 7 cores on the same node. It takes around 250 seconds to load all the les into memory for track 1 on a single compute node. On vSMP loading time is around 400 seconds. Method(k) Time in sec per epoch SMF (100) 200 ALS (50) 400

Table 4. Time

There are multiple schemes for residual tting mentioned in [5] which needs attention. Another Ii Wi idea is to exploit hierarchy in the constrained method described in [2] where Uu = Yu + . Ii where Ii is 1 if the user u has rated item i in the training set. We see that the NB correction method on SMF does improve the test RMSE and NB seems to have the same essence of the constraint method which is users which have rated items similarly tend to rate items in similar fashion. Both NB correction and constrained method tries to make the latent user features for similar users closer. The authors of [2] have noted that for users that have sparse ratings the constrained method provides a considerable improvement. We need to rst think of a way to parallelize the updates for Wi then ponder over how to exploit hierarchy here. One of the other contestants have claimed that he has got an RMSE of 23.97 using ALS with validation set in training using 100 latent features[6]. He is currently at rank 47th . We could optimistically assume that after adding constrained feature terms and including the validation set in training we should reach top 20.

14

Another scheme to try out is to blend different results. Since we are currently aiming at learning more about the dataset and coming up with better RMSE on validation set using a single method, we feel we should leave this till the end. Another note is that alternating update and grouping stragegy is faster but RMSE diverges for some initialization of latent matrices. The results of ALS tensor factorization is similar to ALS without the temporal term on the validation set. But one major difference is that the tensor factorization is achieving signicantly lower RMSE on the training set (18.xx compared to 20.xx ).

5

5.1

Parallelism

Alternating update and grouping strategy

In this scheme, the SGD updates for U and I are decoupled. The U matrix is updated while xing I and vice versa (Alternating). This allows us to exploit the inherent parallelism in matrix updates. The matrix being updated is split into N groups and each group is updated independently.

FIXED I

5.2

In this scheme, the SGD updates for U and I are parallelized by creating two disjoint set containing (u,i) pairs as illustrated in the gure below. This scheme can be recursively applied to each of the disjoint set for further levels parallelism. However, since the alternating update strategy seems to work for all the algorithms discussed, this scheme has not been implemented yet.

15

Item

User

Software

We initially chose Matlab for implementing the baseline methods. But it turned out that parallel processing on Matlab turned out be slower than sequential processing on single node. We suspect that the slowness is because of some communication overhead in our code which we havent been able to debug. After spending several days trying to gure out the problem, We gave up and decided to code in C++ which turned out to be good choice. The pthreads library on GNU/Linux is being used currently for parallelism. As far as we know, there exists no efcient collaborative ltering packages online. As a byproduct of the competition, we are also trying to build a robust and efcient package in the lines of liblinear for regression.

References

1. Aditya Krishna Menon, Charles Elkan, A log-linear model with latent features for dyadic prediction, In IEEE International Conference on Data Mining (ICDM), Sydney, Australia, 2010 2. R. Salakhutdinov and A. Mnih. Probabilistic Matrix Factorization, Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA, 2008. 3. Gabor Takcs, Istvan Pilaszy, Bottyan Nemeth, Domonkos Tikk, Scalable Collaborative Filtering Approaches for Large Recommender Systems, Journal of Machine Learning Research 10: 623-656 (2009) 4. Zhou, Y., Wilkinson, D.M., Schreiber, R., Pan, R.: Large-Scale Parallel Collaborative Filtering for the Netix Prize,In AAIM(2008) 337-348 16

5. A. Paterek. Improving regularized Singular Value Decomposition for collaborative ltering. Proceedings of KDD Cup and Workshop, 2007. 6. http://groups.google.com/group/graphlab-kdd

17

- PU CET UG 2014Uploaded byAnweshaBose
- R xgboost-TutorialUploaded bytintojames
- Diff MultiplesUploaded bymohamed_books
- Mathematics for ChemistryUploaded bywildguess1
- Pertemuan 6 TransportationUploaded byabrian123456
- Data Mining-Accuracy and Ensemble MethodsUploaded byRaj Endran
- MATLAB for Engineers and ScientistsUploaded bysohail66794154
- YEARLY LESSON PLAN MATHS FORM 5 2011Uploaded byAyu Ali
- TOC DetailedUploaded bygs123@hotmail.com
- Noisy Time Series Prediction using a Recurrent.pdfUploaded byrenvor50
- Reporte Gtsrb 1087680Uploaded byAdrián García Betancourt
- Regularization Paths for Generalized Linear Models GlmnetUploaded bydanpliske6330
- SSC-CGL-Pre-Mock-3Uploaded byAnonymous 1b3ih8zg9
- 10.1.1.47.8764Uploaded bySteev Zamudio Reyes
- Final BSC212Uploaded byAbdelrahman Saad
- B.a. EconomicsUploaded byKarthick Kumaravel
- recommender systemUploaded byyachana sharma
- Artigo Andre MestradoUploaded bySanto André Santos
- Mohammad i 2004Uploaded byVinicius Moda
- Gregory Smith DissertationUploaded byJoe Frando
- Bus and IndAdminL3 Past Paper Series 4 2003Uploaded bypepukayi
- An Adaptive Constant Modulus Blind Equalization Algorithm and Its Stochastic Stability Analysis.pdfUploaded byShafayat Abrar
- CS FileUploaded byAjay Gupta
- 09 - SA5 Analysis of NGMN Requirement 9 - LTE Parameter Optimization.pptUploaded bypedro
- A Note on Square Neutrosophic Fuzzy MatricesUploaded byAnonymous 0U9j6BLllB
- Practicando.txtUploaded byDelmercito
- Evaluacion de Aceite de OlivaUploaded bycarolina
- Freeopen_source_software_an_alternative_for_engine.pdfUploaded byeusou
- Crane OptimizationUploaded byNaul Neyugn
- Project GuidelineUploaded byMohamed Farag Mostafa

- Articol confocalUploaded byCristina Petcu
- The Dogbone Connection: Part II -Uploaded byAndreea Handabut
- Average Current-Mode Control.pdfUploaded byQuickerMan
- Room-type.pdfUploaded bymuhammadsuhaib
- soluçõesUploaded byNarcisa Alexandra Filipe
- .._pdf_SugarUploaded bySnehal Badwaik
- 9 A hybrid camera for locating sources of gamma radiation in the environment.pdfUploaded byJesús Godoy
- Moment of Inertia and Angular Acceleration With Cobra4 and a Precision Pivot BearingUploaded byDelfinManuel
- Mastering Physics- Refraction and Interference LabUploaded bypalparas
- UseMaintanceManual-IvecoGenSets-P4D63Z001E-July2005.pdfUploaded bymanuel segovia
- Me 482 ENERGY CONSERVATION AND MANAGENET Text Book Prepared by Faris Kk FOR KTU S8Uploaded byvpzfaris
- Plunger Lung Erg DdUploaded byKaran Sharma
- tuning_pid_controller.pdfUploaded byata lon
- Electron Microscopy Homework_3Uploaded byFazlı Fatih Melemez
- mocktest survey2Uploaded byHemam Prasanta
- New Microsoft Word DocumentUploaded byNeha
- Homework Chapter 02Uploaded byMuzamil Shah
- 0580_s12_qp_42Uploaded byshahul
- Eee Lab Report 8Uploaded bySayeed Mohammed
- Automation The Essential GuideUploaded byrubysultana
- Druki Gofin Polecenie Przelewu PodatkiUploaded byFrky
- Plasma Membrane NotesUploaded byshermaine
- Sanyo Plasma Monitor Pdp42wv1 Service ManualUploaded byvideoson
- Argon vs Air Filled.pdfUploaded byAadii Soni
- J-Gear Pumps CatalogueUploaded byturandot
- Pub 117 the Brasses Whole Web PDFUploaded byjaskaran singh
- Notice: Inventions, Government-owned; availability for licensingUploaded byJustia.com
- Ch2 Subsoil Exploration (15-71) New3Uploaded byRafi Sulaiman
- IndexUploaded bytariq76
- Replica Metallography and Penetrant TestingUploaded byBala Singam