Professional Documents
Culture Documents
highlights
• Distinctive user and item factors are constructed solely based on rating triples.
• A deep feedforward method is used to model the complex relationship between factors.
• No contextual or demographic information are used, instead, they are learned.
• The system has been evaluated for rating predictions on three real world datasets.
• The proposed method produces comparable predictions with some best existing methods.
article info a b s t r a c t
Article history: Recommender systems play an important role in quickly identifying and recommending most acceptable
Received 13 December 2017 products to the users. The latent user factors and item characteristics determine the degree of user
Received in revised form 8 October 2018 satisfaction on an item. While many of the methods in the literature have assumed that these factors
Accepted 9 November 2018
are linear, there are some other methods that treat these factors as nonlinear; but they do it in a more
Available online 16 November 2018
implicit way. In this paper, we have investigated the effect of true nature (i.e., nonlinearity) of the user
Keywords: factors and item characteristics, and their complex layered relationship on rating prediction. We propose
Recommender system a new deep feedforward network that learns both the factors and their complex relationship concurrently.
Neural networks The aim of our study was to automate the construction of user profiles and item characteristics without
Nonlinear factorization using any demographic information and then use these constructed features to predict the degree of
Matrix factorization
acceptability of an item to a user. We constructed the user and item factors by using separate learner
Backpropagation
weights at the lower layers, and modeled their complex relationship in the upper layers. The construction
of the user profiles and the item characteristics, solely based on rating triples (i.e., user id, item id,
rating), overcomes the requirement of explicit demographic information be given to the system. We have
tested our model on three real world datasets: Jester, Movielens, and Yahoo music. Our model produces
better rating predictions than some of the state-of-the-art methods which use demographic information.
The root mean squared error incurred by our model on these datasets are 4.0873, 0.8110, and 0.9408
respectively. The errors are smaller than current best existing models’ errors in these datasets. The results
show that our system can be integrated to any web store where development of hand engineered features
for recommending products is less feasible due to huge traffics and also that there is a lack of demographic
information about the users and the items.
© 2018 Elsevier B.V. All rights reserved.
https://doi.org/10.1016/j.asoc.2018.11.018
1568-4946/© 2018 Elsevier B.V. All rights reserved.
B. Purkaystha, T. Datta, Md.S. Islam et al. / Applied Soft Computing Journal 75 (2019) 310–322 311
Good et al. [10] used information filtering (IF) to analyze the item
content and to build user profiles. After building each user profile
they identified similar user groups and their opinions. Finally,
the opinions and item content analyses were combined to make
a recommendation to an appropriate set of the users. A slightly
different framework, known as content-based filtering (CBF), ex-
ploits the similarity in items and recommends the similar items
that a user has shown preference on [13–15]. Sarwar et al. in [14]
evaluated different item-based filtering techniques. They showed a
comparative performance analysis of two similarity metrics: pear-
son correlation coefficient (PCC), a statistical correlation measuring
technique, and cosine similarity metric. One severe drawback of
both PCC and cosine similarity is that they are extremely sensitive
Fig. 1. A sample rating matrix. There is a total of m users and n items. Each row to outliers. Therefore, in larger datasets where a significant number
provides the personalization for the users, while each column provides the same of outliers are present, the prediction would tend to degrade. Sim-
for the items. The binary value (although real value in practical cases) in the cell ci,j ilar sets of users can also be found by other clustering techniques
(i ≤ m; j ≤ n) shows the rating given by user i on item j. The missing values are the
such as K-means clustering [16]. But segmentation quality of the
ones that a recommender system seeks to fill in.
users in such techniques largely depends on how the considered
factors are initialized. A clustering technique might have to do a
huge amount of computation before it can form reasonable groups
item must be inferred directly or indirectly. Most of the methods in if the number of ratings is significantly large. A modern recom-
the literature have tried to interpolate the user preferences and the mender system needs to be robust in the larger domains. Both CBF
item characteristics to produce a prediction on their interaction [4]. and CF face sparsity and scalability problems. Sparsity problem
The existing methods are mostly based on the assumption that arises where there is a huge number of users and items, but the
the users who have demonstrated similar preferences in past are number of ratings is significantly low. These issues can make any
likely to have similar opinions on a new product [5]. While this interactive recommender system impractical, and therefore, these
assumption is partially correct, it requires improvements in terms frameworks are not adopted alone.
of the subtleties in user preferences and in item characteristics Memory based techniques such as association rule mining [17],
both. This assumption eclipses the nuances of individuality of both or decision trees [18] have been mixed with these frameworks.
users and items. It is likely that capturing these little dissimilarities In [17] the authors used apriori algorithm, a collaborative filtering
in users and products may lead to better predictions. Moreover, data mining technique, and retrieved similar users and product
this assumption harbors to necessitate a lot of ratings to find a association rules. A product rule is generated by transaction in-
similar set of users. Another discrepancy is that most of these formation from each similarity group. These rules were then used
methods take a linear combination of these factors to predict the for user specific services. The authors in [18] used external infor-
interaction which does not reflect the real world scenario. In this mation (i.e., web usage data from clickstream) along with the pur-
paper, we treat the preferences of each user and characteristics chasing records of customers to learn user preferences and product
of each item as discriminative learnable nonlinear factors, and association. They also introduced product taxonomy, a background
take a nonlinear combination of these factors to interpolate the information about the products (such as what range a product’s
interaction between a user and an item. Our study is important for price had fallen in), in their recommendation process. Although
a number of reasons. The first reason is that we treat the actors these methods did not calculate the user or item factors explicitly,
(i.e., the users and the items in the dataset) in a more granular by considering similar users or items they were implicitly working
way. We do not impose the aforementioned statistical assumption with the factors.
to our system, rather, it is up to the model itself, that, how it Although most of the researches on the literature has addressed
models each user and each item; we do not force it to find a set the problem as a supervised learning problem, there have been
of similar users or a set of similar items. The second reason is that some researches which have approached to it as an unsupervised
we construct distinctive features without using any demographic learning problem [19]. For example, Kim et al. [20] evaluated
information. This is important where demographic information performances of several clustering techniques (such as simple K-
about the actors are not available or the number of actors is so means, self organizing map (SOM), and K-means with genetic
huge that manual tagging is almost impossible. The third reason algorithm) in online shopping domain. GA was used as an initializer
is that we treat the factors in its true nature, i.e., we treat them in K-means clustering, and the results showed this model signif-
as nonlinear entities. This should enable our system to model the icantly outperformed the other two models in segmenting users
interaction between the factors better, and consequently, produce so that better recommendations could be made. Unsupervised
better rating predictions. learning has been applied to movie recommendation also [21]. The
The interaction between any user and any item can be modeled authors used SOM with collaborative filtering; they built clusters
as a rating matrix Rm×n , for m users and n items, whose most by segmenting the users with respect to demographic information.
of the entries are empty (see Fig. 1). The entries that contain The other information they used for clustering is user preferences
numeric value are the known interactions for those users and those for items which they had interpolated using SOM. These unsu-
items. The advantage of modeling the problem as a rating matrix pervised techniques had the primary job to build similar sets (of
is that the algebraic techniques [6,7] or other factorization frame- users and items), but they did not construct any features of the
works [8] can easily be applied to it. The matrix representation actors; this was the reason they required demographic information
is preferable wherever the distinctive user and item factors are available or a user needed to produce a significant amount of
involved. It is assumed that the numbers of factors for each user ratings on items to find its appropriate set (of similar users).
are the same, and so are the numbers of factors for each item. Matrix Factorization techniques, on the other hand, directly
There have been many approaches to fill in the missing values. involve with the latent user and item factors. Say, the factors F u are
Collaborative filtering (CF) is the most approached framework in known for any user u and factors F i are known for any item i. The
completing rating matrices [9–12]. The framework assumes that missing rating r̂u,i can be calculated by using a dot multiplication
similar users will have similar interactions with the same item. in Eq. (1).
312 B. Purkaystha, T. Datta, Md.S. Islam et al. / Applied Soft Computing Journal 75 (2019) 310–322
Each unit at layer l (l > 2) is connected to all the units at layer 2.2.2. Learning process of the network: δ -rule
l − 1, i.e., immediate layers are fully connected. At the final layer, Our network learns by back-propagating the errors using δ -
we have set only one unit; when the factors from the input layer rule. Say, our network produced ratings y while the actual ratings
are propagated forward through upper layers and the transformed were given t. Let there be N training examples in the batch, and
information relayed to this unit, it produces a real value which therefore, we calculated the sum of squared errors (SSE) (Eq. (11))
is the rating. The units at each layer (except for the input layer) and back-propagated it.
are nonlinear units and there are actually two kinds of nonlinear
units we have used (but not both at the same time). If the dataset N
contains negative ratings then our units should be able to produce 1 ∑ (i)
E= (y − t (i) )2 (11)
negative values and in that case we would opt for tanh(·) activation 2
i=1
units (Eq. (2)). However, if there is no negative rating involved we
The error gradient with respect to the output was calculated using
would opt for logistic units σ (·) (Eq. (3)). In both the equations z is
Eq. (12).
the input to the units.
ez − e−z ∂E
tanh(z) = (2) =y−t (12)
ez + e−z ∂y
1 The error due to the output unit was calculated as such:
σ (z) = (3) ∂E ∂E ∂y ∂E
1 + e−z δ (L) = = · = ⊙ σ ′ (z (L) ) (13)
∂ z (L) ∂ y ∂ z (L) ∂y
2.2. Network analysis
The error in the immediate layers were calculated as:
2.2.1. Forward pass
δ (l) = (δ (l+1) · W (l+1) ) ⊙ σ ′ (z (l) ) (14)
Before training could be started, we had to identify each user
and each item in the dataset uniquely. After identifying them we σ (·) is the derivative of the activation function. For the logis-
′
initialized factor vector for each user and each item randomly. Say, tic units, σ ′ (z (l) ) = σ (z (l) )(1 − σ (z (l) )), and for the tanh units,
we decided to have Nu factors for each user and Ni factors for each tanh′ (z (l) ) = (1 − (tanh(z (l) ))2 ). The error gradients for each weight
item. The user factor vectors were initialized with the uniform and each bias were calculated using Eqs. (15) and (16) respectively.
random values in range [Umin , Umax ] (see Eq. (4)) and similarly item
factor vectors were initialized in range [Imin , Imax ] (see Eq. (5)) [38]. ∂E
= a(l−1) · δ (l) (15)
√ √ ∂ W (l)
6 6 ∂E
Umin = − Umax = (4) = δ (l) (16)
Nu Nu ∂ b(l)
√ √ We updated the weights using Eq. (17) and biases using Eq. (18)
6 6 at each epoch. This is a straightforward gradient descent update
Imin = − Imax = (5) where η is the learning rate.
Ni Ni
After initializing the factor vectors randomly, we had to feed ∂E
W (l) ←− W (l) − η (17)
them to the network. The user factor vectors (Fu ) for each mini- ∂ W (l)
batch were clung to the first sublayer of the bottommost layer ∂E
(i.e., l(1,u) ). The item factor vectors (Fi ) for each mini-batch, sim- b(l) ←− b(l) − η (l) (18)
∂b
ilarly, were clung to second sublayer of the bottommost layer
Our network learns two different things concurrently because
(i.e., l(1,i) ).
of the wiring that we chose. At the output layer where there is only
z (2,1) = W (2,1) · Fu + b(2,1) z (2,2) = W (2,2) · Fi + b(2,2) one unit which receives layered, complex nonlinear information
(6) about the factors from the units situated at the layer just below.
Here b(2,1) and b(2,2) are the biases for the first hidden layer and Thus whenever the weights between layer L − 1 and layer L get
z (2,1) and z (2,2) are the weighted input to the first sublayer and the updated according to the error gradient, this causes our network to
second sublayer of the first hidden layer respectively. The output be better at processing already L − 1 layered interaction (between
of the first hidden layer is given by applying either tanh(·) or σ (·) factors) to make it L layered interaction. This generalizes to the
to z (2) . other layers l (l > 2) as well.
a(2) = activation(z (2) ) (7)
For subsequent discussion it would be assumed that the activa- 2.2.3. Specialization with the separate learners
tion function is actually the σ (·) function (Eq. (3)). So, Eq. (7) can As the training progresses the weights at the first hidden layer
be rewritten as Eq. (8). are forced to update themselves to minimize their contribution
in the error (in Eq. (11)). When they update themselves (using
a(2) = σ (z (2) ) (8)
gradient descent optimizer) this causes the current configuration
Similarly, the output of each layer can be calculated by first taking going down towards a ravine in the error surface (see Fig. 3).
the weighted input from the immediate lower layer and then At the ravine, the parameters (including the weights of the first
applying the activation function to it (Eq. (9)). hidden layer) have values that are locally optimized. Under these
circumstances, if we change the values of weights at first hidden
layer the position of the current configuration would go upwards
z (l) = W (l) · a(l−1) + b(l)
the error surface (and the learning will obviously constrict the
a(l) = σ (z (l) ) (9) change). So, in other words, the current configuration contains one
The output unit o, at the final layer, produces one real-valued of the best set of values for the weights at the first hidden layer.
output (i.e., rating). The error gradients incurred by the first sublayer of the first
hidden layer (i.e., l(2,1) ) are only due to the user container (i.e., l(1,u) )
outputo = σ (W L · a(L−1) + b(L) ) (10) (as there was no connection from the item container). The units
B. Purkaystha, T. Datta, Md.S. Islam et al. / Applied Soft Computing Journal 75 (2019) 310–322 315
Fig. 3. The same error surface is shown to have two local minima A and B in both left and right. The configuration (filled in circle) on the left has the tendency to go down
in the hill because of the optimizer with current settings of the parameters (i.e., biases and weights). In the right, the network tends to stay at the ravine A because this is
locally optimized.
at this sublayer had processed the user information only and for- are available, we did not use them in our system. In both of these
warded it to the next layer. So reducing the error by updating datasets each user has rated at least 20 movies. For Yahoo music
the weights at this sublayer directly amounts to specialization in datasets, each of the user has rated at least 10 musics.
dealing with the user factors. These weights do not recognize the As all the users in all the datasets are chosen based on the
item factors but they specialize in dealing with user factors only. number of ratings they have produced, they are all bound to be the
They discriminatingly learn the user factors only. Similarly, the successful users (i.e., the users who did not enjoy rating movies or
weights of the second sublayer (i.e., W (2,2) ) specialize in dealing simply did not participate in the ratings were omitted) [36]. Any
with item factors alone. system that would learn from these ratings will be fundamentally
unknown of the behaviors of such users. Another limitation is that,
2.2.4. Learning the user and item factors as collecting the ratings spanned for several years, recommenda-
The error gradients are further propagated from the first sub- tion algorithms, user interfaces, and available facilities changed
layer of the first hidden layer to the user container (and this is from time to time. This might have impinged on users’ way of
unlike to what is done in the traditional neural networks; the rating movies, jokes, or musics.
inputs are not updated in them, but in our case they are not the real
inputs, rather, spurious ones brought from a uniform distribution).
The user factor vectors were clung to the units of this container. 3.2. Evaluation metrics
The error incurred by the first sublayer of first hidden layer are only
due to the factors of this container, and therefore, any subsequent 3.2.1. RMSE
update to these factor vectors would give the network better fit to We have measured the accuracy of the model in terms of root
data. This part of training amounts to learning the preferences of mean squared error (RMSE) as it is very common in the literature.
each user separately. Let there be N samples in the test set. Say, our network produced
Similarly, the item factor vectors in a mini-batch were clung to rating y(i) for sample i (1 ≤ i ≤ N) while the actual rating was t (i) .
the units of the second sublayer of the bottommost layer (i.e., l(1,i) ). Then, Eq. (19) gives the RMSE incurred by our model.
The error gradients received through the second sublayer of the √
first hidden layer (l(2,2) ) are the corrections to random initialization
∑N (i)
i=1 (y − t (i) )2
of the item factors. Hence, the subsequent updates also cause the RMSE = (19)
N
model to learn each item factor individually.
3.2.2. Precision and recall
3. Experimental Results and Discussion
We have used these two accuracy metrics for comparison
among models. Let there be a threshold θ applied on a subset
3.1. Datasets
of rating triples R. For all the ratings ru,i ∈ R, we say the item
We have tested our model on real world datasets [11,36]. The i is relevant to the user u if and only if ru,i ≥ θ . Say, w is the
datasets are summarized in Table 1. As it can be seen that they number of items retrieved by the model as relevant items whereas
constitute ratings of movies, jokes, and music. The datasets contain only x of them were actually relevant. Precision is, then, calculated
different rating ranges and sparsity levels. There are different sizes by Eq. (20).
of the data in the same domains. The collection periods of the
x
ratings show that these datasets would test the system’s ability to precision = (20)
model dynamic behaviors of both users and items. For example, w
the ratings of Movielens 100k were collected in span of more than Recall is simply the ratio of the number of relevant items retrieved
20 years. The four datasets (i.e., Movielens and Yahoo datasets) to the number of actual relevant items which were present in the
have ratings which are categorical whereas the other two datasets inventory. As, for any user, there are only two categories (i.e., either
(i.e., Jester datasets) have ratings which are continuous. But in our to be recommended or to be not recommended) for each product,
work we have produced continuous ratings for each dataset. the recall is the same as the sensitivity. If there are originally z
Jester datasets do not contain any demographic information relevant items after the threshold being applied, then recall or
about the users, but the jokes on which the ratings have been sensitivity is calculated as follows.
produced are available. Each user in Jester datasets has rated at
least 15 jokes. Both of the Movielens datasets contain demographic x
recall or sensitivity = (21)
information (such as the age, sex, occupation, etc.) about the users. z
They also contain information (e.g., genre, imdb url, release date, These two error metrics are used in Section 3.6.3 to compare the
movie title, etc.) about the movies. Although, these information recommendations produced by the methods.
316 B. Purkaystha, T. Datta, Md.S. Islam et al. / Applied Soft Computing Journal 75 (2019) 310–322
Table 1
Datasets.
Dataset #Ratings #Users #Items Sparsity (%) Rating scale Time Domain
Movielens 100k 100k 943 1 682 93.7 [1,5] 1995–2016 Movies
Movielens 1M 1M 6 040 3 952 95.81 [1,5] 2000–2003 Movies
Jester 1.8M >1.8M 24 938 100 27.40 [−10.00,10.00] 1999–2003 Jokes
Jester 4.1M >4.1M 73 421 100 43.67 [−10.00,10.00] 1999–2003 Jokes
Yahoo music 220k >220k 7 643 11 916 99.76 [1,5] 2002–2006 Music
Yahoo music 365k >365k 15 400 1 000 99.77 [1,5] 2002–2006 Music
Fig. 4. The bar diagrams show the impact of batch size in rating predictions. For Movielens 1M and Jester datasets, modeling the relationship in factors is more important
(or as much important as) factorizing the users and items. On the other hand, for Yahoo 365k, the factorization is more important than modeling the relationship.
Fig. 5. The impact of factorizing the users and the items is shown here in the subplots for four datasets. The other two datasets (Yahoo 220k and Movielens 100k) have been
skipped because they are too small for this experiment.
The plot reveals very important characteristics of how the sys- these affect learning of the model. When the final RMSE incurred
tem learns. The model learns the most for initial increments of the on any dataset is reported in this paper, the dataset is split into
dataset (i.e., when the training data ratio is increased to 20% to 80:10:10 for training, validation, and test purposes respectively.
50%). In these increments, the model sees more novel users and
items than in any other increments, and this drastically brings 3.5. Factorization and rating analysis
down the test set errors. The error lines do flatten out after the
stage where the model does not see enough novel users or items. In this section, we analyze the ratings produced by our model
The only space for improvement left after this stage is by adjust- with two users from Movielens 100k dataset, and based on the
ing the factors and by modeling the relationship in the opposite finding we seek to answer to our first research question that
direction of the error gradient. It should be noted that the different whether it is possible to construct user profiles and item charac-
training data ratios have been discussed here only to show how teristics solely based on the rating triples. The dataset contains the
318 B. Purkaystha, T. Datta, Md.S. Islam et al. / Applied Soft Computing Journal 75 (2019) 310–322
links to the movie information: its releasing year, genre, etc. These Table 3
are very good features for a movie and certainly will have an impact Comparison with other models in terms of RMSE.
on the ratings. The users are anonymized; the two users we have Model RMSE
picked have ids 450 and 405 respectively. Let us call them user a J 1.8M J 4.1M ML 100k ML 1M YM 220k YM 365k
and user b. Pearson vector similarity 4.6307 4.6280 1.0427 1.0003 1.2765 1.2761
User a has rated a total of 488 movies. Only 31 movies have Cosine vector similarity 4.6100 4.6215 1.0385 0.9983 1.2752 1.2752
Funk-SVD 4.2829 4.2221 0.9538 0.9243 1.0005 0.9997
been rated below 2 (in discrete scale 1–5) by him. So he rates SVD++ 4.1472 4.1316 0.9096 0.8427 0.9676 0.9670
movies generously. But there are some genres which he dislikes Matrix factorization using 4.1465 4.1379 0.9008 0.8470 0.9701 0.9723
mostly; these are comedy, adventure, and drama. He mostly gave user generated content
very low ratings on these genres. We, therefore, had picked a Multiple kernelized 4.1369 4.1282 0.9091 0.8297 0.9502 0.9479
matrix factorization
movie Up Close and Personal and fed already learned preferences
Robust convolutional 4.2279 4.1951 0.9003 0.8470 0.9618 0.9582
of a and characteristics of this movie to our trained network. The matrix factorization
genre of the movie is romance. Our network produced a rating NUIF 4.1128 4.0873 0.8965 0.8110 0.9470 0.9408
3.876 whereas he had actually rated the movie 4. We again picked *ML: Movielens, *YM: Yahoo music, *J: Jester. The models have been reproduced.
two movies Beverly Hillbillies, and Half Baked. Both of them are of
genre comedy. The network produced a rating 1.531 and 1.508
respectively whereas he had rated both of them 1. Clearly, the
Matrix factorization using user-generated content (MF-UGC). [40]
network has learned the useful features of the movies (although
exploits user generated contents on top of matrix factorization, and
we told nothing about them) and the preferences of user a. Another
makes predictions.
movie was picked, Brassed Off ; the genre of this movie is comedy,
romance, and drama. The network produced rating 2.994 whereas Robust convolutional matrix factorization (R-CONVMF). [35] uses
the actual rating he had given is 5. One possible explanation can contextual information and detects useful user and item features
be, although, the movie was of genre he did not like, he might have in lower convolutional layers.
been impressed by some other accidental factors (e.g., acting of
lead roles, romantic background, and others). 3.6.1. Comparison in terms of RMSE
User b has rated 657 movies. On average, he rated each movie The comparison of the models in terms of RMSE is given in
1.85. But there are genres (e.g., comedy, drama, science fiction) on Table 3. The RMSEs of the models compared here have been pro-
which he rated generously. We picked movie Philadelphia and fed duced when the dataset was split into 80:10:10 for training, val-
its factors and b’s preferences to our network. The network pro- idation, and test purposes. In average, our model has predicted
duced a rating 3.739 whereas b had actually rated 5. The network more accurate ratings on all datasets as compared to the other
was able both in factorizing user b and movie Philadelphia, and models. The improvement made by our model is marginal for
in modeling the relationship to some extent. We then picked the smaller datasets like Movielens 100k or Yahoo music 220k. But
movie 2001: A Space Odyssey (of genre science fiction), and the the improvement scales with the size of the dataset. For example,
network produced a rating for this movie 3.71 whereas b actually Jester 4.1M contains more than 4 million ratings and our model
had rated 5. Finally, we picked the movie A stranger in the house (of has incurred an RMSE that is decisively smaller than previous best
genre thriller). The network produced rating 1.65 whereas actual RMSE.
rating was 1. In Table 3, PVS and CVS are two methods which was solely based
The network was not given any prior information about users on the similarity measurements and on the statistical assumption
and movies. Still, it has managed to find out some distinctive that similar users will have similar interaction with similar items
features of the users and items so that it could produce a better which we have already discussed in Section 1. In our method,
fit to the data. This answers our first research question. It is indeed we have emphasized on nuances in the user behaviors as well
possible to construct the user profiles and the item characteristics as in the item characteristics, and it is clearly visible in Table 3
reasonably well based on the rating triples. that our method has incurred very significantly lower RMSEs than
each of these two methods. This corroborates our second research
question, and indeed, rating predictions were more accurate as we
3.6. Comparison among models
have treated each user and each item individually.
In the same table, there are a number of matrix factorization
We now give comparisons of our model with some best existing
methods (e.g., SVD++, Robust convolutional matrix factorization,
models on these datasets. The results of the other models have
multiple kernelized matrix factorization, etc.) which treat the fac-
been reproduced. The models that we compare our model with are
tors linearly. We have incurred lower RMSEs than each of them.
described now.
Although for smaller datasets (e.g., Movielens 100k) the improve-
Funk-SVD. This model is implemented using the method of [39]. ment is not very significant, the improvement made on a larger
This is slightly a different version of standard SVD. dataset is strongly visible (e.g., in Jester 4.1M we have incurred
RMSE ∼4.087 whereas the second lowest accuracy is, incurred by
SVD++. This model is based on an extensive usage of SVD, and it multiple kernelized matrix factorization method, ∼4.128). There-
modifies the standard SVD a lot [6]. It expects a temporal dimen- fore, we can answer our third research question that although for
sion in dataset and we gave this information to it. Our model does smaller datasets the improvement made by our method over the
not require such temporal information. linear methods are insignificant, for larger datasets the improve-
Multiple kernelized matrix factorization (MKMF). This model [8] is ment is significant.
similar to the other matrix factorization techniques employing
collaborative filtering. It combines multiple kernels and produces 3.6.2. Comparison in terms of temporal modeling
slightly better results. The architecture of our system does not accommodate explicit
temporal modeling. Temporal modeling means capturing the dy-
Pearson vector similarity (PVS) and Cosine vector similarity (CVS). namic behavior of the users and the changing characteristics of
Both of these models are essentially the same [14]. The only dif- the items. As we have not designed it explicitly, it is very hard
ference is in how they measure similarity among items. to conclude whether the system has captured the dynamism in
B. Purkaystha, T. Datta, Md.S. Islam et al. / Applied Soft Computing Journal 75 (2019) 310–322 319
user behaviors and item characteristics or not just by looking into samples T from each domain (i.e., jokes, movies, and music) and
the learned factors and bottommost layer weights. We picked plotted the points. For any dataset d, we tried out several values
Movielens 1M dataset to analyze the final output produced by of threshold θd . Each of the values of θd was applied on T , and
various methods on the basis of temporal modeling. This dataset precision and recall were calculated by using Eqs. (20) and (21).
has been picked because the ratings of this dataset span more than Precision shows the ratio of the number of correct items recom-
20 years which is greater than any other datasets. These ratings mended by the model to the total number of items recommended.
in the dataset have an associated column of timestamps (i.e., the On the other hand, recall shows the ratio of the number of correct
time when the ratings were given). We divided the test set users items recommended to the number agreed by θd .
into a number of categories. For any category c we set a threshold Fig. 7a shows the precision versus recall curves of all the models
tc so that each user in c has a rating span of at least tc days. This is on 1000 random test examples on Jester 4.1M dataset. Our model
summarized in Table 4. has provided a better improvement in finding more relevant prod-
Although our system is not provided with any temporal infor- ucts than the neighborhood models (e.g., Pearson vector similarity
mation, if the dataset has a property that its users’ behaviors or (PVS) and Cosine vector similarity (CVS) are the bottommost two
items’ characteristics change radically through time, the system curves). Other factorization models like MKMF provided very sim-
has to adjust the weights in between the bottommost layer and ilar performance when the recall was smaller. But once the thresh-
the first hidden layer (i.e., weights W (2) ) and the factors in factor old θjester divided the test examples into roughly equal number of
layer (i.e., Fu and Fi ) in such a way that it can minimize the error more relevant items and less relevant items, our model has always
(see optimization objective in Section 2.2.3 and in Section 2.2.4). produced better recommendations (e.g., see the NUIF curve from
This, although very implicitly, amounts to temporal modeling. The recall in range [0.3–0.8]).
effectiveness of our temporal modeling is shown in Fig. 6. The Fig. 7b also shows that our model almost always produced
RMSE of each of the models are plotted against each category. For better recommendations. There are some large recall values for
the portion of the test set where each user had at least two hundred which other factorization models produced similar or better rec-
days of ratings, NUIF incurred a comparatively lower RMSE. But ommendations than our model. This indicates that the improve-
as soon as the users stayed for longer in the dataset, the rating ments achieved by our model on Movielens is not as good as on
predictions made by NUIF did not scale as compared to some other Jester. This makes sense. Jester dataset is more than four times
methods. For the larger values of tc (say, 900) our model incurred larger than Movielens. Our deep model has, therefore, specialized
more RMSE than R-CONVMF model (which had used contextual more on Jester than on Movielens. But still, in the most cases it
information). produced better recommendations than the other models.
Fig. 7c shows the precision of the model versus recall on Yahoo
3.6.3. Comparison in terms of recommendations 365k. The dataset is significantly smaller than the previous two. As
The final comparative analysis of the models in this section is expected, although our model produced better recommendations
now given in terms of precision versus recall plots in Fig. 7. We have on the most cases, there are more instances where other factor-
selected three datasets from different domains (i.e., Jester 4.1M, ization models have produced similar or better recommendations
Movielens 1M, and Yahoo music 365k). The plots are not based (when recall > 0.8) as compared to previous two datasets.
on the whole test set; instead, we have randomly picked 1000 test
3.7. Impact of rating predictions and recommendation quality on
obtained results
Table 4
Movielens 1M dataset categorized according to tc values. The obtained results show a broad range of characteristics of our
# Minimum spanning days tc # Test users approach. The prediction accuracy improved for each dataset when
1 200 96 more hidden units were incorporated and dropout (i.e., randomly
2 300 75 omitting some of the units at each iteration) enabled. That implies,
3 400 60 with a proper regularization technique present, if we increase the
4 500 49
5 600 40
capacity of our network, it models the relation between users and
6 700 37 items better. This observation is important, mainly because it hints,
7 800 22 at the lower layers (i.e., input layer and first hidden layer), separate
8 900 12 learner connections were helping the model factorize the users and
9 1000 5
the items well (as shown with examples in Section 3.5) while the
Fig. 6. RMSE incurred by different methods with respect to the users having different days.
320 B. Purkaystha, T. Datta, Md.S. Islam et al. / Applied Soft Computing Journal 75 (2019) 310–322
Fig. 7. Precision versus recall for three different datasets. As we go to the right along the horizontal axis, the number of relevant items increases after the threshold θ being
applied. When value of θ is moderate (i.e., the value is around near of the middle value of rating range) the number of more relevant items roughly equals to the number of
less relevant items. Precision drops when this happens. Every model in each of the plots here finds it easy to recommend item when θ is either strong or weak.
upper layer connections were modeling the relationship among when the units at the first hidden layer propagates the error back
the factors. to the input layer they provide a balanced gradient. Thus, the
Another important insight is that for all of the data domains, as factors, at the input layer, do not receive the actual gradient but
the size of the dataset gets larger our prediction accuracy improves. an error gradient which has been counter-balanced by other type
For example, in Movielens 100k dataset the RMSE incurred by of factors. This messes up learning in user profile and item charac-
our model was ∼0.89 (see Table 2), but as the size of the dataset teristics building, and causes to factorize the users and the items
increased by ten times (in Movielens 1M), the RMSE dropped to as improperly. The network does not learn each user and each item
low as ∼0.81. Such improvement is also visible for other datasets. individually, rather, it is compelled to learn the interaction of each
This shows our method produces ratings with better accuracy as individual pairs of user and item (what we already do in the higher
the size of the dataset grows. For a larger amount of ratings, we are layers; actually, the full connection between the factor layer and
able to factorize the users and the items in a more granular way in the first hidden layer just adds an extra layer when modeling the
the lower layers. This helps the upper layers to utilize these learned relationship). These empirical results hint that separate learning
factors to produce better ratings. helps the model generalize well and it enables the system to model
In Section 2.2.3 we had analytically justified why we had split the user profiles and the item characteristics more precisely.
the connections in the first hidden layer. We now give an empirical In Section 3.6.3, we have showed the recommendations made
justification of the impact of separate learners. To see that in by our method and the other models, and observed that our im-
practice, we ran our system on all six datasets with the same provement which had been strongly visible in terms of ratings do
configurations, but instead of using separate learners, we fully not hold as much strong in case of recommendations, and even
connected the first hidden layer to the factor layer. The results for some higher recall values our method produced worse recom-
are summarized in Table 5. For each dataset, training RMSE is mendations as compared to some other models. This is because, in
significantly lower when separate learning is not enabled. The test several studies, it has been observed that improvements in terms
error, on the other hand, is ostensibly lower for separate learning. of RMSE do not translate into recommendation quality as much
The fully connected configurations have more connections to expected [41]. Our study had the objective to bring down the
model the accidental regularities in the ratings. Capturing such error in rating predictions, and therefore, this discrepancy exists
accidental regularities causes the model to overfit the data; hence, in recommendation quality of our system.
while the training error decreases, the test error increases. More-
over, the connections at this layer are not allowed to specialize 4. Conclusion
in either type of factors exclusively. Thus, the units at the first
hidden layer cannot distinguish between the two types of units In this paper, we proposed a method that emphasizes on fac-
(i.e., user factors and item factors) at the input layer. Therefore, torizing each user and each item individually to overcome the
B. Purkaystha, T. Datta, Md.S. Islam et al. / Applied Soft Computing Journal 75 (2019) 310–322 321
Table 5
Impact of separate learning.
R1 − R2
Dataset No separate learners Separate Learners Improvement = R1
× 100%
Training RMSE Test RMSE R1 Training RMSE Test RMSE R2
Jester 1.8M 2.0138 4.4304 2.1252 4.1128 7.16
Jester 4.1M 1.9632 4.4557 2.0160 4.0873 8.27
Movielens 100k 0.5076 0.9208 0.5530 0.8965 2.64
Movielens 1M 0.4810 0.8553 0.568 0.8110 5.17
Yahoo 220k 0.4622 0.9731 0.3801 0.9470 2.69
Yahoo 365k 0.4597 0.9710 0.4048 0.9408 3.19
limitation of the statistical assumption that the similar users will [11] K. Goldberg, T. Roeder, D. Gupta, C. Perkins, Eigentaste: A constant time
have similar opinions on an item. The main contribution of our collaborative filtering algorithm, Inf. Retr. 4 (2001) 133–151.
work lies in how we model each user profile and each item char- [12] X. Yang, Y. Guo, Y. Liu, H. Steck, A survey of collaborative filtering based social
recommender systems, Comput. Commun. 41 (2014) 1–10.
acteristic without using any demographic information. This way [13] M. Balabanović, Y. Shoham, Fab: content-based, collaborative recommenda-
of factorization is particularly helpful in scenarios where only a tion, Commun. ACM 40 (1997) 66–72.
huge amount of rating data exists without any explicit information [14] B. Sarwar, G. Karypis, J. Konstan, J. Riedl, Item-based collaborative filtering
about the users or the items. We have also contributed in elimi- recommendation algorithms, in: Proceedings of the 10th International Con-
nating the limitation of the traditional linear factorization models ference on World Wide Web, ACM, 2001, pp. 285–295.
[15] P. Pirasteh, J.J. Jung, D. Hwang, Item-based collaborative filtering with at-
(i.e., the number of user factors and the number of item factors be tribute correlation: a case study on movie recommendation, in: Asian Con-
the same) as our deep model allows any number of factor units at ference on Intelligent Information and Database Systems, Springer, 2014, pp.
the bottommost layer. We constructed the user behaviors and the 245–252.
item characteristics in the factor layer (i.e., the bottommost layer) [16] B.M. Sarwar, G. Karypis, J. Konstan, J. Riedl, Recommender systems for large-
and modeled the relationship in the upper layers. We have shown scale e-commerce: Scalable neighborhood formation using clustering, in: Pro-
ceedings of the Fifth International Conference on Computer and Information
that our model produces better predictions than the other factor- Technology, Vol. 1, 2002.
ization models and the models which use contextual information [17] C.-H. Lee, Y.-H. Kim, P.-K. Rhee, Web personalization expert with combining
or content information. collaborative filtering and association rule mining technique, Expert Syst.
Although we have shown the quality of recommendations pro- Appl. 21 (2001) 131–137.
duced by our model to compare it with others, our main objec- [18] Y.H. Cho, J.K. Kim, S.H. Kim, A personalized recommender system based on
web usage mining and decision tree induction, Expert Syst. Appl. 23 (2002)
tive was to minimize the error in rating predictions. It is a well
329–342.
known fact that the recommendation quality does not necessarily [19] K. Bryan, M. O’Mahony, P. Cunningham, Unsupervised retrieval of attack
scale with root mean squared errors. A further investigation can profiles in collaborative recommender systems, in: Proceedings of the 2008
be conducted which would seek to measure the effectiveness of ACM Conference on Recommender Systems, ACM, 2008, pp. 155–162.
the given architecture in terms of the recommendations it makes [20] K.-j. Kim, H. Ahn, A recommender system using ga k-means clustering in an
online shopping market, Expert Syst. Appl. 34 (2008) 1200–1209.
rather than the ratings. Another investigation can be made by using
[21] M. Lee, P. Choi, Y. Woo, A hybrid recommender system combining collabora-
demographic information of the users and the items along with tive filtering with neural network, in: International Conference on Adaptive
constructed factors to observe how this hybrid factorization model Hypermedia and Adaptive Web-Based Systems, Springer, 2002, pp. 531–534.
works in terms of both rating predictions and recommendation [22] M. Nilashi, O. Ibrahim, K. Bagherifard, A recommender system based on col-
quality. Additionally, as in this work we have not confronted the laborative filtering using ontology and dimensionality reduction techniques,
Expert Syst. Appl. 92 (2018) 507–520.
cold start problem, a serious dilemma when a new user or item is
[23] P. Symeonidis, Matrix and tensor decomposition in recommender systems,
introduced in the system which was not involved in any rating, a in: Proceedings of the 10th ACM Conference on Recommender Systems, ACM,
study can be run on how to confront this problem with keeping 2016, pp. 429–430.
other advantages of our system intact. [24] G. Adomavicius, Y. Kwon, Multi-criteria recommender systems, in: Recom-
mender Systems Handbook, Springer, 2015, pp. 847–880.
References [25] X. Zhou, J. He, G. Huang, Y. Zhang, Svd-based incremental approaches for
recommender systems, J. Comput. System Sci. 81 (2015) 717–733.
[26] J.R. Noriega, H. Wang, A direct adaptive neural-network control for unknown
[1] P. Clerkin, C. Hayes, P. Cunningham, Automated case generation for recom-
nonlinear systems and its application, IEEE Trans. Neural Netw. 9 (1998) 27–
mender systems using knowledge discovery techniques, Genre (1994).
34.
[2] P. Resnick, H.R. Varian, Recommender systems, Commun. ACM 40 (1997) 56–
[27] T.K. Paradarami, N.D. Bastian, J.L. Wightman, A hybrid recommender system
58.
using artificial neural networks, Expert Syst. Appl. 83 (2017) 300–313.
[3] G. Adomavicius, A. Tuzhilin, Toward the next generation of recommender
systems: A survey of the state-of-the-art and possible extensions, IEEE Trans. [28] R. Salakhutdinov, A. Mnih, G. Hinton, Restricted boltzmann machines for
Knowl. Data Eng. 17 (2005) 734–749. collaborative filtering, in: Proceedings of the 24th International Conference
[4] D.H. Park, H.K. Kim, I.Y. Choi, J.K. Kim, A literature review and classification of on Machine Learning, ACM, 2007, pp. 791–798.
recommender systems research, Expert Syst. Appl. 39 (2012) 10059–10072. [29] A. Van den Oord, S. Dieleman, B. Schrauwen, Deep content-based music
[5] R. Burke, Integrating knowledge-based and collaborative-filtering recom- recommendation, in: Advances in Neural Information Processing Systems,
mender systems, in: Proceedings of the Workshop on AI and Electronic 2013, pp. 2643–2651.
Commerce, 1999, pp. 69–72. [30] H. Wang, N. Wang, D.-Y. Yeung, Collaborative deep learning for recommender
[6] Y. Koren, R. Bell, C. Volinsky, Matrix factorization techniques for recom- systems, in: Proceedings of the 21th ACM SIGKDD International Conference
mender systems, Computer 42 (2009). on Knowledge Discovery and Data Mining, ACM, 2015, pp. 1235–1244.
[7] B.-W. Chen, W. Ji, S. Rho, Y. Gu, Supervised collaborative filtering based on [31] H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson,
ridge alternating least squares and iterative projection pursuit, IEEE Access G. Corrado, W. Chai, M. Ispir, et al., Wide & deep learning for recommender
(2017). systems, in: Proceedings of the 1st Workshop on Deep Learning for Recom-
[8] X. Liu, C. Aggarwal, Y.-F. Li, X. Kong, X. Sun, S. Sathe, Kernelized matrix mender Systems, ACM, 2016, pp. 7–10.
factorization for collaborative filtering, in: Proceedings of the 2016 SIAM [32] J. Wei, J. He, K. Chen, Y. Zhou, Z. Tang, Collaborative filtering and deep learning
International Conference on Data Mining, SIAM, 2016, pp. 378–386. based recommendation system for cold start items, Expert Syst. Appl. 69
[9] J.A. Konstan, B.N. Miller, D. Maltz, J.L. Herlocker, L.R. Gordon, J. Riedl, Grou- (2017) 29–39.
plens: applying collaborative filtering to usenet news, Commun. ACM 40 [33] J. Liu, C. Wu, J. Wang, Gated recurrent units based neural network for time
(1997) 77–87. heterogeneous feedback recommendation, Inform. Sci. 423 (2018) 50–65.
[10] N. Good, J.B. Schafer, J.A. Konstan, A. Borchers, B. Sarwar, J. Herlocker, J. [34] X. Wang, Y. Wang, Improving content-based and hybrid music recommen-
Riedl, et al., Combining collaborative filtering with personal agents for better dation using deep learning, in: Proceedings of the 22nd ACM International
recommendations, in: AAAI/IAAI, 1999, pp. 439–446. Conference on Multimedia, ACM, 2014, pp. 627–636.
322 B. Purkaystha, T. Datta, Md.S. Islam et al. / Applied Soft Computing Journal 75 (2019) 310–322
[35] D. Kim, C. Park, J. Oh, H. Yu, Deep hybrid recommender systems via exploiting [39] S. Funk, Netflix update: Try this at home, 2006.
document context and statistics of items, Inform. Sci. 417 (2017) 72–87. [40] Y. Xu, Z. Chen, J. Yin, Z. Wu, T. Yao, Learning to recommend with user
[36] F.M. Harper, J.A. Konstan, The movielens datasets: History and context, ACM generated content, in: International Conference on Web-Age Information
Trans. Interact. Intell. Syst. (TiiS) 5 (2016) 19. Management, Springer, 2015, pp. 221–232.
[37] K. Hornik, Approximation capabilities of multilayer feedforward networks, [41] P. Cremonesi, Y. Koren, R. Turrin, Performance of recommender algorithms on
Neural Netw. 4 (1991) 251–257. top-n recommendation tasks, in: Proceedings of the Fourth ACM Conference
[38] X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward on Recommender Systems, ACM, 2010, pp. 39–46.
neural networks, in: Proceedings of the Thirteenth International Conference
on Artificial Intelligence and Statistics, 2010, pp. 249–256.