You are on page 1of 6

TritonMiners: Ensemble of LFL and ALS

August 14, 2011


Abstract This document contains information about our nal submission for track 1 of KDD Cup 2011. We achieved a nal RMSE of 23.5797 on the test set using an ensemble of Alternative Least Squares and Latent Feature Log-Linear approach. We are ranked at 38th position on the leader board for track 1. This report contains information about our nal submission. Our main contribution is the parallelism for LFL using the Joint SGD Update by grouping strategy.

Contents
1 2 3 Notation Alternating least squares based Matrix Factorization (ALS) Parallelism for ALS training 3.1 Alternating update and grouping strategy Latent Feature log linear model Parallelism for LFL training 5.1 Joint SGD Update by grouping strategy . . . . . . . . . . . . . . . . . . . . . . . Results Timing Information 3 3 3 3 4 5 5 5 5

. . . . . . . . . . . . . . . . . . . . . .

4 5

6 7

Notation
True rating for user- (u) and item - (i) Predicted rating for user - (u) and item - (i) Latent feature vector for user - (u) Latent feature vector for item - (i) Size of feature vector i.e. latent factors Concatenated feature matrix for all users Concatenated feature matrix for all items Number of users Number of items Learning rate parameter Regularization parameter Sigmoid on x

ru,i ru,i Uu Ii k U I Nu Ni (x)

Alternating least squares based Matrix Factorization (ALS)

This method was rst presented in [4]. The main differences compared to previously dicussed methods are a) the update rule for Uu or Ii is the least squares solution and b) the regularization parameter is multiplied by the number of ratings for that user (nu ) or item (ni ). Objective Function E = (ru,i Uu Ii )2 + ( nu ||Uu ||2 + ni ||Ii ||2 ) Least squares solution for a Uu and Ii T (MI(u) MI(u) + (nu E))Uu = Vu where MI(u) is sub matrix of I, where columns are chosen based on items that user u has rated. and E is the identity matrix and Vu = MI(u) RT (u, I(u)) Optimization Type LS T Update Rule Uu A1 Vu where A = (MI(u) MI(u) + (nu E)) u Ii Bi1 Yi ; derivation similar to Uu

3
3.1

Parallelism for ALS training


Alternating update and grouping strategy

In this scheme, the SGD updates for U and I are decoupled. The U matrix is updated while xing I and vice versa (Alternating). This allows us to exploit the inherent parallelism in matrix updates. The matrix being updated is split into N groups and each group is updated independently.

Four independent blocks

FIXED I

Figure 1: Each of the blocks of User matrix is updated independently.

Latent Feature log linear model

In LFL model [1] we restrict output ratings to be in the set Rc = {0,10,20,30,40,50...100} each corresponding to c = {0,...11} classes and learn latent features for each of the ratings. We x U 0 and I 0 to be zero i.e. keeping class 0 as the base class.
c Rc exp(Uu Iic ) 2 c ) + ( c ||Uu ||2 + ||Iic ||2 ) Z c Z = c exp(Uu Iic ) - Normalization term c exp(Uu Iic ) p(c|U c , I c ) = Zc c c c R exp(Uu Ii ) r= Z Derivative with respect to each example c foreach c U c (ru,i c Rc exp(Uu Iic ))2 = 2(ru,i c (Rc p(c|U c , I c ))p(c|U c , I c ) uk c (Rc c (Rc p(c|U c , I c ))Iik c foreach c I c (ru,i c Rc exp(Uu Iic ))2 = 2(ru,i c (Rc p(c|U c , I c ))p(c|U c , I c ) ik c (Rc c (Rc p(c|U c , I c ))Uuk Optimization Type SGD c c c Update Rule Uuk Uuk (( U c E) + (Uuk )) uk c c c Iik Iik (( I c E) + (Iik ))

Objective Function E =

(ru,i

ik

5
5.1

Parallelism for LFL training


Joint SGD Update by grouping strategy

In this scheme, the SGD updates for U and I are parallelized by creating two disjoint set containing (u,i) pairs as illustrated in the gure below. This scheme can be recursively applied to each of the disjoint set for further levels parallelism. To create the disjoint set we used the modulo operator to partition into (u,i) sets. It turns out that in this dataset modulo operator splits the disjoint sets of almost equal sizes. One of the main advantages of this strategy over the alternating strategy is that the trained model is identitical to the trained model that one would get from a sequential SGD training. The alternating strategy creates a different model altogether.
Item

User

Two independent blocks Joint Update of U and I

Figure 2: Joint SGD update by grouping independent U and I

Results

The results from our experiments using both training and validation set during training. The ensemble coefcients were learned using linear regression on the validation set using a model trained on the training set. ALS with validation set (1/-/200) 23.88 LFL with validation set (10/.0001/120) 23.87 Ensemble of LFL and ALS 23.57
Table 4. Current Results on Test Set

Timing Information

All these runs were using 8 cores on the same node. It takes around 250 seconds to load all the les into memory for track 1 on a single compute node. On vSMP loading time is around 400 seconds. Method(k) Time in sec per epoch ALS (200) 4000 LFL (120) 1200
Table 4. Run times on a single node

References
1. Aditya Krishna Menon, Charles Elkan, A log-linear model with latent features for dyadic prediction, In IEEE International Conference on Data Mining (ICDM), Sydney, Australia, 2010 2. Zhou, Y., Wilkinson, D.M., Schreiber, R., Pan, R.: Large-Scale Parallel Collaborative Filtering for the Netix Prize,In AAIM(2008) 337-348