You are on page 1of 17

Triton Miners: Competing in the KDD Cup 2011

May 5, 2011
Abstract This status report contains the ideas and the experiments that we have performed or are currently performing on the KDD Cup 2011 dataset. The dataset is biggest of its kind with some unique features like hierarchical relations among items, different types of items, and dates/timestamps for ratings. Currently we have implemented several variants of matrix factorization approaches. We are currently at 25.0426 RMSE on the test set using Alternative Least Squares approach. We are currently at 92th position on the leader board. After parallelizing the training, one epoch takes roughly in the range of 200-400 seconds.

Contents
1 2 3 Introduction Dataset Experiments and Results 3.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Biased Regularized Incremental Simultaneous Matrix Factorization (BRISMF) 3.3 Sigmoid based Matrix Factorization (SMF) . . . . . . . . . . . . . . . . . . . 3.3.1 Adding Temporal Term . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Sigmoid based Heirarchical Matrix Factorization (SHMF) . . . . . . . . . . . 3.5 Alternating least squares based Matrix Factorization (ALS) . . . . . . . . . . . 3.5.1 Adding Temporal Term . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Latent Feature log linear model . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Neighborhood Based Correction . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.1 Timing Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ideas for further exploration 3 3 7 . 7 . 8 . 8 . 8 . 8 . 9 . 9 . 9 . 10 . 11 . 14 14

. . . . . . . . . . .

4 5

Parallelism 15 5.1 Alternating update and grouping strategy . . . . . . . . . . . . . . . . . . . . . . 15 5.2 Joint SGD Update by grouping strategy . . . . . . . . . . . . . . . . . . . . . . . 15 Software 16

Introduction

This report investigates different collaborative ltering methods on the KDD Cup 2011 dataset. The dataset was provided by Yahoo!, and was collected from their music service. It is the biggest of its kind which restricts our choice of algorithms to the ones that scale. Apart from the typical (user,item,rating) triplets, there is hierarchical information among the items (tracks/albums/artist/genre) amd time stamps which needs to be exploited. So far, we have been able to parallelize several variants of matrix factorization approaches run it in the order of minutes per epoch. Apart from that we have analyzed the dataset in terms of types of items and found out there is signicant overtting for tracks and albums. Furthermore, on the validation set we found out that majority of the error is on items that are rated fewer times. The rest of the report contains our current progress and description of the dataset.

Dataset

The KDD Cup 2011 competition has two tracks. This report presents the experiments we performed on Track 1 dataset. The statistics for the dataset is presented in the table 1. The ratings range from 0-100 and the dates range from roughly [0-5xxx] days. There is also session information present in the dataset along with the days.
USER ID

NO: RATINGS

DAY

TIME STAMP

ITEM IDS

Figure 1: Dataset format 3

#Users 1,000,990

#Items #Ratings #TrainRatings #ValidationRatings #TestRatings 624,961 262,810,175 252,800,275 4,003,960 6,005,940
Table 1. KDD Cup 2011 Track 1 Dataset.

#Genres 992 #Artists 27888 #Albums 88909 #Tracks 507172


Table 2. Track 1 - Hierarchy statistics for items .

x 10

Training set distribution

5 Counts

0 20

20

40 Rating

60

80

100

120

Figure 2: Training set rating histogram


genre 1 genre 2 genre 3 genre k genre m genre n

artist 1

artist 2

album 1

album 2

album 3

album k

album m

album n

track 1 track 2 track 3

track n track l

track m

Figure 3: Hierarchical Information

12

x 10

Training set rating distribution

12

x 10

Training set rating distribution

10

10

8 Ratings Count Ratings Count Genre Artist Item Album Track

Genre

Artist Item

Album

Track

(a) Training set


3.5 3 2.5 Ratings count 2 1.5 1 0.5 0 x 10
6

(b) Validation set


Distribution of ratings in test set

Genre

Artist Item

Album

Track

(c) Test set

Figure 4: Distribution of ratings

2.5

x 10

Histogram of (#ratings of Tracks) vs #Tracks


3000

Histogram for # albums vs log(#ratings for albums)

2500

2000

1.5 #Tracks
Counts 1500

1000

0.5

500

6 8 log(#ratings of Tracks)

10

12

6 7 8 9 log(#ratings for albums)

10

11

12

(a) Histogram of ratings of Track


1200 Histogram of #Artists vs log(#ratings for Artists)

(b) Histogram of ratings of Album


60 Histogram for #genres vs log(#ratings for genres)

1000

50

800 Counts

40 Counts

600

30

400

20

200

10

6 8 log(#ratings for Artists)

10

12

14

6 8 10 log(#ratings for Genres)

12

14

(c) Histogram of ratings of Artist

(d) Histogram of ratings of Genre

Figure 5: Histogram specic to item type (log is to the base e)

3
3.1

Experiments and Results


Notation
True rating for user- (u) and item - (i) Predicted rating for user - (u) and item - (i) Latent feature vector for user - (u) Latent feature vector for item - (i) Size of feature vector i.e. latent factors Concatenated feature matrix for all users Concatenated feature matrix for all items Number of users Number of items Learning rate parameter Regularization parameter Sigmoid on x

ru,i ru,i Uu Ii k U I Nu Ni (x)

3.2

Biased Regularized Incremental Simultaneous Matrix Factorization (BRISMF)

The objective function that we are minimizing here is the squared loss. The update for the U and I matrices are simultaneous, and we set the rst column of U and second column of I as 1 the corresponding second and rst columns in U and I respectively can be interpreted as the bias terms. We calculate the RMSE on the validation set after each epoch using the trained U and I matrices and terminate when either epoch limit has been reached or when RMSE diverges. Objective Function E = (ru,i Uu Ii )2 + ( ||Uu ||2 + ||Ii ||2 ) Optimization Type SGD 2 Derivative with respect to each example (r Uu Ii ) = 2(ru,i Uu Ii )Iik Uuk u,i (r Uu Ii )2 = 2(ru,i Uu Ii )Uuk Iik u,i Update Rule Uuk Uuk (( Uuk E) + (Uuk )) Iik Iik (( Iik E) + (Iik ))

3.3

Sigmoid based Matrix Factorization (SMF)

The motivation for SMF was to keep the predicted rating in the range of [0-100]. Here the objective function similar to BRISMF. There are two ways of parallelizing SGD, both of them are dicussed in the Parallelism section. Objective Function E = (ru,i 100(Uu Ii ))2 +( ||Uu ||2 + ||Ii ||2 ) Optimization Type SGD Derivative with respect to each example (r 100(Uu Ii ))2 = 2(ru,i 100(Uu Ii ))100(Uu Ii )(1 (Uu Ii ))Iik Uuk u,i (r 100(Uu Ii ))2 = 2(ru,i 100(Uu Ii ))100(Uu Ii )(1 (Uu Ii ))Uuk Iik u,i Update Rule Uuk Uuk (( Uuk E) + (Uuk )) Iik Iik (( Iik E) + (Iik )) 3.3.1 Adding Temporal Term (ru,i 100(
k

The objective function after adding temporal term is E = rest of the derivation is similar to the previous section.

Uuk Iik Ttk ))2 . The

3.4

Sigmoid based Heirarchical Matrix Factorization (SHMF)

The algorithm is similar to SMF except for two key differences a) SGD training in a hierarchical fashion as shown below, here we use alternating training method instead of simultaneous updates. 8

b) Regularization term to make Ii for each item in the hierarchy similar, which is motivated by the fact that users tend to rate items in the same hierarchy similarly for ex: rating for a track and its corresponding album would be similar. for each epoch Update U using I Update Ii using U (i Genres) Update Ii using U (i Artists) Update Ii using U (i Albums) and regularization by (Ii Iartist(i) ) Update Ii using U (i T racks) and regularization by (Ii Ialbums(i) )

3.5

Alternating least squares based Matrix Factorization (ALS)

This method was rst presented in [4]. The main differences compared to previously dicussed methods are a) the update rule for Uu or Ii is the least squares solution and b) the regularization parameter is multiplied by the number of ratings for that user (nu ) or item (ni ). Objective Function E = (ru,i Uu Ii )2 + ( nu ||Uu ||2 + ni ||Ii ||2 ) Least squares solution for a Uu and Ii T (MI(u) MI(u) + (nu E))Uu = Vu where MI(u) is sub matrix of I, where columns are chosen based on items that user u has rated. and E is the identity matrix and Vu = MI(u) RT (u, I(u)) Optimization Type LS T Update Rule Uu A1 Vu where A = (MI(u) MI(u) + (nu E)) u Ii Bi1 Yi ; derivation similar to Uu 3.5.1 Adding Temporal Term

The objective function after adding temporal term is E = (ru,i k Uuk Iik Ttk ))2 +( nu ||Uu ||2 + ni ||Ii ||2 ). The rest of the derivation is similar to the previous section. We rst learn the U and I matrices by xing all elements of T to be 1 and T is estimated in the end after we have estimated for U and I matrices.

3.6

Latent Feature log linear model

In LFL model [1] we restrict output ratings to be in the set Rc = {0,10,20,30,40,50...100} each corresponding to c = {0,...11} classes and learn latent features for each of the ratings. We x U 0 9

and I 0 to be zero i.e. keeping class 0 as the base class.


c Rc exp(Uu Iic ) 2 c ) + ( c ||Uu ||2 + ||Iic ||2 ) Z c Z = c exp(Uu Iic ) - Normalization term c exp(Uu Iic ) p(c|U c , I c ) = Zc c c c R exp(Uu Ii ) r= Z Derivative with respect to each example c foreach c U c (ru,i c Rc exp(Uu Iic ))2 = 2(ru,i c (Rc p(c|U c , I c ))p(c|U c , I c ) uk c (Rc c (Rc p(c|U c , I c ))Iik c foreach c I c (ru,i c Rc exp(Uu Iic ))2 = 2(ru,i c (Rc p(c|U c , I c ))p(c|U c , I c ) ik c (Rc c (Rc p(c|U c , I c ))Uuk Optimization Type SGD c c c Update Rule Uuk Uuk (( U c E) + (Uuk )) uk c c c Iik Iik (( I c E) + (Iik ))

Objective Function E =

(ru,i

ik

Another scheme to cut down on parameters it to keep a single Uu for all ratings. Although we have implemented this scheme, we have skipped running experiments with it since we did not see signicant difference between the schemes in the initial runs.

3.7

Neighborhood Based Correction

We use the method in [3] which is a post processing step after learning the latent features for users and items. The NB correction is as follows
p sim(itemi , itemj )(ruj ruj ) , where rc is the corrected rating for user u = + sim(itemi , itemj ) j,j=i and item i and sim is the similarity metric. is learned through regression on validation set. The summation j is over all items the user has rated in the training set. k Iik .Ijk sim(itemi , itemj ) = max{0, } 2 2 Iik Ijk k k c rui p rui j,j=i

10

3.8

Results
Sigmoid Matrix Factorization results Validation Set Training set 36

The gure below shows the RMSE on train and validations set for sigmoid matrix factorization.
38

34

32 RMSE

30

28 convergence

26

24

22

4 epochs

Figure 6: SMF: RMSE on train and validation set The result show overtting after the second epoch. Item type specic RMSE is shown in g 7.

11

RMSE on Albums 38 Validation set Training set 36

38 36 34

Tracks RMSE Validation set Training set


34

32 RMSE

32 RMSE 30 28 26 24 22

30

28

26

24

1.5

2.5

3 epochs

3.5

4.5

22

1.5

2.5

3 Epochs

3.5

4.5

(a) RMSE Tracks


RMSE on Artists 38 36 34 32 30 RMSE RMSE 30 28 26 24 22 20 15 Validation set Training set 40

(b) RMSE Albums

Validation set Training set

35

25

20

1.5

2.5

3 Epochs

3.5

4.5

1.5

2.5

3 epochs

3.5

4.5

(c) RMSE Artists

(d) RMSE Genres

Figure 7: SMF: RMSE specic to item type

12

60 55
Validation error on Tracks 50

Validation error on Albums

50 45 RMSE 40 35 30 25

45

40

35 RMSE 30 25 20

20
15 0 2 4 6 8 log(#ratings for Tracks) 10 12 14

6 8 10 log(#ratings for Albums)

12

14

(a) RMSE vs log( No: of ratings for Tracks) (b) RMSE vs log( No: of ratings for Albums)
55 50 45 40 RMSE RMSE 35 30 25 20 15 10 0 2 4 6 8 log(#ratings for Artists) 10 12 14 Validation error on Artists 40 35 30 25 20 15 10 5 0 Validation error on Genres

6 8 log(#ratings for Genres)

10

12

14

(c) RMSE vs log( No: of ratings for Artists)

(d) RMSE vs log( No: of ratings for Genres)

Figure 8: These are from the ALS run: RMSE vs No: of ratings (Note: red line shows the actual RMSE for that specic type) The results have been summarized in the table below. The results from tensor factorization has been excluded since we are not condent about the training scheme. The LFL run was not tuned for best parameters. Regularization is similar in all the methods except in ALS where its multiplied by Nu or Ni .

13

Method(//k) RMSE (Test set) RMSE with NB correction(Test Set) BRISMF(.001/.001/100) 28.6200 SMF (10/.0001/100) 25.6736 25.4884 SHMF (10/.0001/100) 25.1183 ALS (1/-/50) 25.0426 LFL (10/.0001/100) 26.5238 Table 4. Current Results on Test Set

3.8.1

Timing Information

All these runs were using 7 cores on the same node. It takes around 250 seconds to load all the les into memory for track 1 on a single compute node. On vSMP loading time is around 400 seconds. Method(k) Time in sec per epoch SMF (100) 200 ALS (50) 400
Table 4. Time

Ideas for further exploration

There are multiple schemes for residual tting mentioned in [5] which needs attention. Another Ii Wi idea is to exploit hierarchy in the constrained method described in [2] where Uu = Yu + . Ii where Ii is 1 if the user u has rated item i in the training set. We see that the NB correction method on SMF does improve the test RMSE and NB seems to have the same essence of the constraint method which is users which have rated items similarly tend to rate items in similar fashion. Both NB correction and constrained method tries to make the latent user features for similar users closer. The authors of [2] have noted that for users that have sparse ratings the constrained method provides a considerable improvement. We need to rst think of a way to parallelize the updates for Wi then ponder over how to exploit hierarchy here. One of the other contestants have claimed that he has got an RMSE of 23.97 using ALS with validation set in training using 100 latent features[6]. He is currently at rank 47th . We could optimistically assume that after adding constrained feature terms and including the validation set in training we should reach top 20.

14

Another scheme to try out is to blend different results. Since we are currently aiming at learning more about the dataset and coming up with better RMSE on validation set using a single method, we feel we should leave this till the end. Another note is that alternating update and grouping stragegy is faster but RMSE diverges for some initialization of latent matrices. The results of ALS tensor factorization is similar to ALS without the temporal term on the validation set. But one major difference is that the tensor factorization is achieving signicantly lower RMSE on the training set (18.xx compared to 20.xx ).

5
5.1

Parallelism
Alternating update and grouping strategy

In this scheme, the SGD updates for U and I are decoupled. The U matrix is updated while xing I and vice versa (Alternating). This allows us to exploit the inherent parallelism in matrix updates. The matrix being updated is split into N groups and each group is updated independently.

Four independent blocks

FIXED I

Figure 9: Each of the blocks of User matrix is updated independently.

5.2

Joint SGD Update by grouping strategy

In this scheme, the SGD updates for U and I are parallelized by creating two disjoint set containing (u,i) pairs as illustrated in the gure below. This scheme can be recursively applied to each of the disjoint set for further levels parallelism. However, since the alternating update strategy seems to work for all the algorithms discussed, this scheme has not been implemented yet.

15

Item

User

Two independent blocks Joint Update of U and I

Figure 10: Joint SGD update by grouping independent U and I

Software

We initially chose Matlab for implementing the baseline methods. But it turned out that parallel processing on Matlab turned out be slower than sequential processing on single node. We suspect that the slowness is because of some communication overhead in our code which we havent been able to debug. After spending several days trying to gure out the problem, We gave up and decided to code in C++ which turned out to be good choice. The pthreads library on GNU/Linux is being used currently for parallelism. As far as we know, there exists no efcient collaborative ltering packages online. As a byproduct of the competition, we are also trying to build a robust and efcient package in the lines of liblinear for regression.

References
1. Aditya Krishna Menon, Charles Elkan, A log-linear model with latent features for dyadic prediction, In IEEE International Conference on Data Mining (ICDM), Sydney, Australia, 2010 2. R. Salakhutdinov and A. Mnih. Probabilistic Matrix Factorization, Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA, 2008. 3. Gabor Takcs, Istvan Pilaszy, Bottyan Nemeth, Domonkos Tikk, Scalable Collaborative Filtering Approaches for Large Recommender Systems, Journal of Machine Learning Research 10: 623-656 (2009) 4. Zhou, Y., Wilkinson, D.M., Schreiber, R., Pan, R.: Large-Scale Parallel Collaborative Filtering for the Netix Prize,In AAIM(2008) 337-348 16

5. A. Paterek. Improving regularized Singular Value Decomposition for collaborative ltering. Proceedings of KDD Cup and Workshop, 2007. 6. http://groups.google.com/group/graphlab-kdd

17