You are on page 1of 13

Applied Soft Computing Journal 75 (2019) 310–322

Contents lists available at ScienceDirect

Applied Soft Computing Journal


journal homepage: www.elsevier.com/locate/asoc

Rating prediction for recommendation: Constructing user profiles and


item characteristics using backpropagation

Bishwajit Purkaystha , Tapos Datta, Md. Saiful Islam, Marium-E-Jannat
Computer Science and Engineering, Shahjalal University of Science & Technology, Sylhet-3114, Bangladesh

highlights

• Distinctive user and item factors are constructed solely based on rating triples.
• A deep feedforward method is used to model the complex relationship between factors.
• No contextual or demographic information are used, instead, they are learned.
• The system has been evaluated for rating predictions on three real world datasets.
• The proposed method produces comparable predictions with some best existing methods.

article info a b s t r a c t

Article history: Recommender systems play an important role in quickly identifying and recommending most acceptable
Received 13 December 2017 products to the users. The latent user factors and item characteristics determine the degree of user
Received in revised form 8 October 2018 satisfaction on an item. While many of the methods in the literature have assumed that these factors
Accepted 9 November 2018
are linear, there are some other methods that treat these factors as nonlinear; but they do it in a more
Available online 16 November 2018
implicit way. In this paper, we have investigated the effect of true nature (i.e., nonlinearity) of the user
Keywords: factors and item characteristics, and their complex layered relationship on rating prediction. We propose
Recommender system a new deep feedforward network that learns both the factors and their complex relationship concurrently.
Neural networks The aim of our study was to automate the construction of user profiles and item characteristics without
Nonlinear factorization using any demographic information and then use these constructed features to predict the degree of
Matrix factorization
acceptability of an item to a user. We constructed the user and item factors by using separate learner
Backpropagation
weights at the lower layers, and modeled their complex relationship in the upper layers. The construction
of the user profiles and the item characteristics, solely based on rating triples (i.e., user id, item id,
rating), overcomes the requirement of explicit demographic information be given to the system. We have
tested our model on three real world datasets: Jester, Movielens, and Yahoo music. Our model produces
better rating predictions than some of the state-of-the-art methods which use demographic information.
The root mean squared error incurred by our model on these datasets are 4.0873, 0.8110, and 0.9408
respectively. The errors are smaller than current best existing models’ errors in these datasets. The results
show that our system can be integrated to any web store where development of hand engineered features
for recommending products is less feasible due to huge traffics and also that there is a lack of demographic
information about the users and the items.
© 2018 Elsevier B.V. All rights reserved.

1. Introduction impinges on user experiences as well as on the number of e-


commerce transactions. The user experiences and the number of
The enormity of information across the web has made it ex- transactions can be improved if a right product can be recom-
mended to an appropriate user. As for the gigantism of both the
tremely difficult for its users to quickly find the most acceptable
number of users and the number of items, it is almost impossible to
products. Looking and searching for desired products across the
manually pick a product and recommend it to a correct user. There-
web stores is both time consuming and tedious; it significantly fore, a mechanical recommendation system might significantly
improve the user experiences and the number of transactions by
∗ Corresponding author. recommending a right product to a correct customer.
E-mail addresses: iambishwa@student.sust.edu (B. Purkaystha), Over a period of two decades researchers have worked ex-
taposdatta2013@gmail.com (T. Datta), saiful-cse@sust.edu (Md.S. Islam), tensively to address this problem [1–3]. Before a product can be
jannat-cse@sust.edu (Marium-E-Jannat). recommended to a user, the interaction between the user and the

https://doi.org/10.1016/j.asoc.2018.11.018
1568-4946/© 2018 Elsevier B.V. All rights reserved.
B. Purkaystha, T. Datta, Md.S. Islam et al. / Applied Soft Computing Journal 75 (2019) 310–322 311

Good et al. [10] used information filtering (IF) to analyze the item
content and to build user profiles. After building each user profile
they identified similar user groups and their opinions. Finally,
the opinions and item content analyses were combined to make
a recommendation to an appropriate set of the users. A slightly
different framework, known as content-based filtering (CBF), ex-
ploits the similarity in items and recommends the similar items
that a user has shown preference on [13–15]. Sarwar et al. in [14]
evaluated different item-based filtering techniques. They showed a
comparative performance analysis of two similarity metrics: pear-
son correlation coefficient (PCC), a statistical correlation measuring
technique, and cosine similarity metric. One severe drawback of
both PCC and cosine similarity is that they are extremely sensitive
Fig. 1. A sample rating matrix. There is a total of m users and n items. Each row to outliers. Therefore, in larger datasets where a significant number
provides the personalization for the users, while each column provides the same of outliers are present, the prediction would tend to degrade. Sim-
for the items. The binary value (although real value in practical cases) in the cell ci,j ilar sets of users can also be found by other clustering techniques
(i ≤ m; j ≤ n) shows the rating given by user i on item j. The missing values are the
such as K-means clustering [16]. But segmentation quality of the
ones that a recommender system seeks to fill in.
users in such techniques largely depends on how the considered
factors are initialized. A clustering technique might have to do a
huge amount of computation before it can form reasonable groups
item must be inferred directly or indirectly. Most of the methods in if the number of ratings is significantly large. A modern recom-
the literature have tried to interpolate the user preferences and the mender system needs to be robust in the larger domains. Both CBF
item characteristics to produce a prediction on their interaction [4]. and CF face sparsity and scalability problems. Sparsity problem
The existing methods are mostly based on the assumption that arises where there is a huge number of users and items, but the
the users who have demonstrated similar preferences in past are number of ratings is significantly low. These issues can make any
likely to have similar opinions on a new product [5]. While this interactive recommender system impractical, and therefore, these
assumption is partially correct, it requires improvements in terms frameworks are not adopted alone.
of the subtleties in user preferences and in item characteristics Memory based techniques such as association rule mining [17],
both. This assumption eclipses the nuances of individuality of both or decision trees [18] have been mixed with these frameworks.
users and items. It is likely that capturing these little dissimilarities In [17] the authors used apriori algorithm, a collaborative filtering
in users and products may lead to better predictions. Moreover, data mining technique, and retrieved similar users and product
this assumption harbors to necessitate a lot of ratings to find a association rules. A product rule is generated by transaction in-
similar set of users. Another discrepancy is that most of these formation from each similarity group. These rules were then used
methods take a linear combination of these factors to predict the for user specific services. The authors in [18] used external infor-
interaction which does not reflect the real world scenario. In this mation (i.e., web usage data from clickstream) along with the pur-
paper, we treat the preferences of each user and characteristics chasing records of customers to learn user preferences and product
of each item as discriminative learnable nonlinear factors, and association. They also introduced product taxonomy, a background
take a nonlinear combination of these factors to interpolate the information about the products (such as what range a product’s
interaction between a user and an item. Our study is important for price had fallen in), in their recommendation process. Although
a number of reasons. The first reason is that we treat the actors these methods did not calculate the user or item factors explicitly,
(i.e., the users and the items in the dataset) in a more granular by considering similar users or items they were implicitly working
way. We do not impose the aforementioned statistical assumption with the factors.
to our system, rather, it is up to the model itself, that, how it Although most of the researches on the literature has addressed
models each user and each item; we do not force it to find a set the problem as a supervised learning problem, there have been
of similar users or a set of similar items. The second reason is that some researches which have approached to it as an unsupervised
we construct distinctive features without using any demographic learning problem [19]. For example, Kim et al. [20] evaluated
information. This is important where demographic information performances of several clustering techniques (such as simple K-
about the actors are not available or the number of actors is so means, self organizing map (SOM), and K-means with genetic
huge that manual tagging is almost impossible. The third reason algorithm) in online shopping domain. GA was used as an initializer
is that we treat the factors in its true nature, i.e., we treat them in K-means clustering, and the results showed this model signif-
as nonlinear entities. This should enable our system to model the icantly outperformed the other two models in segmenting users
interaction between the factors better, and consequently, produce so that better recommendations could be made. Unsupervised
better rating predictions. learning has been applied to movie recommendation also [21]. The
The interaction between any user and any item can be modeled authors used SOM with collaborative filtering; they built clusters
as a rating matrix Rm×n , for m users and n items, whose most by segmenting the users with respect to demographic information.
of the entries are empty (see Fig. 1). The entries that contain The other information they used for clustering is user preferences
numeric value are the known interactions for those users and those for items which they had interpolated using SOM. These unsu-
items. The advantage of modeling the problem as a rating matrix pervised techniques had the primary job to build similar sets (of
is that the algebraic techniques [6,7] or other factorization frame- users and items), but they did not construct any features of the
works [8] can easily be applied to it. The matrix representation actors; this was the reason they required demographic information
is preferable wherever the distinctive user and item factors are available or a user needed to produce a significant amount of
involved. It is assumed that the numbers of factors for each user ratings on items to find its appropriate set (of similar users).
are the same, and so are the numbers of factors for each item. Matrix Factorization techniques, on the other hand, directly
There have been many approaches to fill in the missing values. involve with the latent user and item factors. Say, the factors F u are
Collaborative filtering (CF) is the most approached framework in known for any user u and factors F i are known for any item i. The
completing rating matrices [9–12]. The framework assumes that missing rating r̂u,i can be calculated by using a dot multiplication
similar users will have similar interactions with the same item. in Eq. (1).
312 B. Purkaystha, T. Datta, Md.S. Islam et al. / Applied Soft Computing Journal 75 (2019) 310–322

• Will rating predictions be more accurate if nuances in user


r̂u,i = F u · F i (1) behaviors and in item characteristics are emphasized as op-
posed to the statistical assumption that the similar users will
Alternating least squares (ALS) [7] or Singular value decompo- have similar interactions with the same items?
sition (SVD) [22–25] can be used to find a lower rank of the rating • How much better ratings would a method predict if it treats
matrix keeping the original entries fixed, and thus ensuring that the factors as nonlinear than the methods which consider
the filled in entries are best conforming with the initial conditions. them linear?
The authors in [22] tried to mitigate the effects of scalability and
sparsity problem of the recommender systems. They confronted Our finding based on this work is that for a large number of
scalability problem by dimensionality reduction (using SVD to find rating triples (e.g., Movielens 1 million dataset [36]) this useful set
most suitable items or users in related clusters). Zhou et al. in [25] of factors can be interpolated to the degree that it outperforms
introduced an incremental algorithm on top of SVD to produce a the systems which use explicit factors. It turns out that a deep
better prediction accuracy and also showed that their system run factorizing model can be very effective in constructing (or learning)
for a slightly less amount of time. Koren [6] introduced a better distinctive set of features of both users and items just based on
version of SVD, SVD++, which can capture dynamic factors. These the rating triples. These learned features, in turn, can be further
factorization techniques require that the number of latent user processed to produce a rating. The research questions are further
factors and the number of latent item factors to be the same. This discussed in Section 3.5 and in Section 3.6.
imposition may be ideal but less practical. Taking such a linear Although there are a lot of neural network architectures in
combination (as in Eq. (1)) is a straightforward thing to do, but the literature (like the recurrent neural networks (RNNs), convo-
it does not reflect the real world scenario. The obligation that a lutional neural networks (CNNs), restricted boltzmann machines
factor Fu,j for user u can only intermingle with a factor Fi,j for (RBMs), etc.), we have picked the feedforward neural network
item i to contribute in prediction constricts potential inter-factor for a number of reasons. All the former architectures seek to
relations. This implies that a nonlinear combination of the user model the relation in regard to the data given to them. But, our
and item factors should be a better approximator than these linear approach to the problem was different; as we had not used any
techniques. external information other than the rating triples, we were ought
On the other hand, the neural networks have been very suc- to make some random spurious factors (in consistent with the
rating triples) as a pluggable data to our network. These factors
cessful in modeling complex nonlinear relationship in various do-
were then updated regularly after each iteration in training period.
mains [26,27]. They had been applied in recommender system
Normally, the RNNs are useful for the sequence learning problems,
literature also and showed promise as Salakhutdinov et al. [28]
but here we were not dealing with sequences. The CNNs model
used restricted boltzmann network (RBM), a special kind of neural
the relationship by using replicative feature detectors which is in
network, to improve Netflix’s own rating prediction accuracy by
contrast to our motivation as we were constructing the features
more than 6%. Although the networks can be used with a CF [21]
themselves. Therefore, these special neural network architectures
or a CBF framework [29], they can be used independently too.
are useful in learning sequences, or detecting features in the data.
They are not constrained by the similarity metrics, and are able
We did not have any fixed data and we were neither learning se-
to treat each individual user and item separately on the merit
quences nor detecting features. As we had the objective to model a
of characteristics. It is possible to have any number of units at
complex nonlinear layered relationship in the ratings and a simple
the input layer of a network; thus, it removes the obligation of feedforward neural network is able to approximate any nonlinear
traditional linear algebraic techniques that the number of user function [37], we opted for a simpler neural network architecture.
factors and the number of item factors must be the same (see The central idea of our deep feedforward model is not to use
Section 2.1). The modern recommender system has to deal with any external information other than the rating triples. Learning is,
users who come from all walks of life and culture. The items are therefore, to achieve two specializations. The first specialization is
also very different. So it is fairly obvious that a complex, layered, to infer the factors of each user and each item that are assumed to
and nonlinear relationship exists between the users and items. be determinants for the visible ratings (and the missing ratings).
Recently their use in large datasets has been very success- The second specialization is learning how to work with these
ful [30–33]. These deep factor models can even learn to exploit factors. This specialization is achieved in two logical stages; the
audio features in a music file to recommend it to a user [29,34]. first stage is done by learning how to deal with user factors and
Although these deep models can capture the nonlinear relationship item factors separately (i.e., detecting already learned user and
in user and item factors, they require that these factors to be given item factors) by using separate learner connections, and once we
to them. They require demographic information, e.g., they need learn how to capture the factors separately, the second stage is
audio features to recommend a music to a user. The popular neural done by flowing these learned factors through several layers of
network architectures like convolutional neural networks have nonlinear units of the network. All these learnings take place at
also turned out to be successful in movie rating prediction [35]. the same time.
The authors generated latent factors by convolutional matrix fac- The separate learning helps in factorization. As we did not use
torization using contextual information. They detected distinctive any demographic information and were only left with the rating
features in convolutional layers as opposed to our motivation triples, we had to start with some bipartite factors (each brought
where we are interested in constructing useful features rather than from a uniform distribution) —one group represented user factors
using any contextual information. while the other represented item factors. Connecting a partisan
The degree of satisfaction of a user on an item is directly de- factor to an upper layer unit exclusively (i.e., that upper layer unit
pendent on a set of user factors and item factors. Therefore, the would not connect to the other group at the lower layer) made it
more precise the factors can be interpolated, the more accurate the possible to receive an error gradient solely for that partisan factor.
predictions can be generated. The motivation behind our research Had there was no separate connections, the error gradient received
was to find answers to a few questions. for a group would have been incorrect as the other group had a
contribution in the error. The analysis is elaborated in Sections
• Is it possible to construct the user profiles and the item 2.2.3 and 2.2.4.
characteristics reasonably well based only on rating triples We are mainly interested with factorizing users and items, and
(i.e., without using any demographic information)? their relationship. We have not considered the cold start problem,
B. Purkaystha, T. Datta, Md.S. Islam et al. / Applied Soft Computing Journal 75 (2019) 310–322 313

a very common problem arises when a new user or a new item is


introduced. Our assumption and the architecture does not allow
us to emphasize on it. To deal with the cold start problem prac-
tically, the prior information about the product must be known,
but this is a complete antithesis to our motivation. We emphasize
on constructing distinctive features of users and items, and it is
impossible to construct the features of a user or an item if the
system has not seen it previously. If there is no rating available for a
user or an item, its factors, brought from the uniform distribution,
are not updated at all. Thus it is not possible to predict the behavior
of a user or the characteristic of an item, if it was not involved
in any rating. For all we have not dealt the cold start explicitly,
if the system encounters an item that has not been rated before,
the interaction does not have to be totally random. Say, the system
finds an item which has not been rated before. Let us further
assume that the network has seen much data, and its weights have
the settings for which currently it is in a local minimum or close to a
local minimum. The learned user factors, understanding of how to
deal with the factors, and the modeling of the relationship between
factors counter balances the cold start problem to some extent. The
process is less compact, and this indirect mitigation does not work
very well in practice.
Although for new users or items the system does not work
satisfactorily, our work makes several contributions.

• Firstly, this nonlinear factorization method overcomes the


barrier of other linear factor methods (like SVD++ [6]) in Fig. 2. The full network architecture. This network differs from the traditional
fully connected feedforward neural networks in a view that the bottommost two
terms of linear modeling of relationship. The methods which
layers of the network are not fully connected. Another exception is that the units
do straightforward dot multiplication between user factors of the input layer (representing the factors) are also updated to facilitate factor
and item factors require that the number of the user factors construction.
and the number of item factors to be the same. Our model
does not constrict any such. The variance in user factors and
the variance in item factors maybe comparable for a balanced 2. Methodology: Nonlinear User and Item Factorization (NUIF)
dataset which has similar number of users and items. But,
for any dataset, which has a significant disproportion in the NUIF model learns the user and item factors, and their com-
number of users and items [11], the variances are incompa- plex nonlinear relationship simultaneously using backpropaga-
rable. Thus, the imposition that the number of user factors tion. The model employs a deep feedforward neural network which
and the number of item factors be the same is not ideal. Our forward-propagates the factors to produce a real-valued output.
deep factorization model allows the number of user factors The error in the output layer is then backward-propagated until
and the number of item factors be different. In fact, it is almost the factor layer is reached. The factors in this layer are then updated
every time the case that the optimal results are found when accordingly.
the number of user factors and the number of item factors are
2.1. The network architecture
not the same (see Section 3.4).
• Secondly, the user and item factors are constructed auto-
The network architecture is shown in Fig. 2. This is a traditional
matically to make predictions. This automatic learning is
feedforward network but with an exception. The exception is in
especially desired for the scenarios where no information is our connection from the bottommost layer (l(1) ) to the first hidden
available other than the rating triples. We also show how our layer (l(2) ). Layer l(1) is actually a container of the user factors (Fu )
model captures each user profile and each item characteristic and item factors (Fi ). It can be thought that it is logically divided
with two examples in Section 3.5. into two sublayers, say sublayer l(1,u) and sublayer l(1,i) . The units
• We finally measure the effectiveness of the variant models at l(1,u) always contain the Fu exclusively, whereas, the units at
based on our proposed method by using datasets from three the l(1,i) always contain Fi . The number of user factors (i.e., the
different domains under various conditions. We also provide number of units at l(1,u) ) and the number of item factors (i.e., the
the extensive comparative analyses of our models with some number of units at l(1,i) ) vary across data domains and require
of the state-of-the-art models. empirical settings (which is discussed in Section 3.4). Similarly, we
have logically divided the first hidden layer, l(2) into two, sublayer
The rest of the paper is organized as follows. Section 2 describes l(2,1) and sublayer l(2,2) . We have not wired any connection from
our methodology with a detailed analysis and ends with an ana- layer l(1,u) to layer l(2,2) . Similarly, no unit at l(1,i) (i.e., Fi ) has any
lytical justification for how the separate learning works. Section connection to the units at layer l(2,1) . On the other hand, the units
3 describes the findings of our experiments, discusses the quality at the layer l(1,u) are fully connected to the units at layer l(2,1) and
of the ratings predicted by our model, evaluates the performance the units at the layer l(1,i) are connected to all the units at layer l(2,2) .
of our proposed method, provide with a comparative analysis with Let W (l) , (l > 1), be the weights (i.e., connections) from the layer
other best existing methods, and finally ends with an empirical jus- l − 1 to the layer l. So, W (2) is logically divided into W (2,1) and W (2,2) .
tification of the separate learning. In Section 4 we summarize our Obviously, W (2,1) denotes all the connections from the layer l(1,u) to
findings and contributions, and very briefly discuss the potential the layer l(2,1) while W (2,2) denotes the connections from the layer
future investigations. l(1,i) to the layer l(2,2) .
314 B. Purkaystha, T. Datta, Md.S. Islam et al. / Applied Soft Computing Journal 75 (2019) 310–322

Each unit at layer l (l > 2) is connected to all the units at layer 2.2.2. Learning process of the network: δ -rule
l − 1, i.e., immediate layers are fully connected. At the final layer, Our network learns by back-propagating the errors using δ -
we have set only one unit; when the factors from the input layer rule. Say, our network produced ratings y while the actual ratings
are propagated forward through upper layers and the transformed were given t. Let there be N training examples in the batch, and
information relayed to this unit, it produces a real value which therefore, we calculated the sum of squared errors (SSE) (Eq. (11))
is the rating. The units at each layer (except for the input layer) and back-propagated it.
are nonlinear units and there are actually two kinds of nonlinear
units we have used (but not both at the same time). If the dataset N
contains negative ratings then our units should be able to produce 1 ∑ (i)
E= (y − t (i) )2 (11)
negative values and in that case we would opt for tanh(·) activation 2
i=1
units (Eq. (2)). However, if there is no negative rating involved we
The error gradient with respect to the output was calculated using
would opt for logistic units σ (·) (Eq. (3)). In both the equations z is
Eq. (12).
the input to the units.
ez − e−z ∂E
tanh(z) = (2) =y−t (12)
ez + e−z ∂y
1 The error due to the output unit was calculated as such:
σ (z) = (3) ∂E ∂E ∂y ∂E
1 + e−z δ (L) = = · = ⊙ σ ′ (z (L) ) (13)
∂ z (L) ∂ y ∂ z (L) ∂y
2.2. Network analysis
The error in the immediate layers were calculated as:
2.2.1. Forward pass
δ (l) = (δ (l+1) · W (l+1) ) ⊙ σ ′ (z (l) ) (14)
Before training could be started, we had to identify each user
and each item in the dataset uniquely. After identifying them we σ (·) is the derivative of the activation function. For the logis-

initialized factor vector for each user and each item randomly. Say, tic units, σ ′ (z (l) ) = σ (z (l) )(1 − σ (z (l) )), and for the tanh units,
we decided to have Nu factors for each user and Ni factors for each tanh′ (z (l) ) = (1 − (tanh(z (l) ))2 ). The error gradients for each weight
item. The user factor vectors were initialized with the uniform and each bias were calculated using Eqs. (15) and (16) respectively.
random values in range [Umin , Umax ] (see Eq. (4)) and similarly item
factor vectors were initialized in range [Imin , Imax ] (see Eq. (5)) [38]. ∂E
= a(l−1) · δ (l) (15)
√ √ ∂ W (l)
6 6 ∂E
Umin = − Umax = (4) = δ (l) (16)
Nu Nu ∂ b(l)
√ √ We updated the weights using Eq. (17) and biases using Eq. (18)
6 6 at each epoch. This is a straightforward gradient descent update
Imin = − Imax = (5) where η is the learning rate.
Ni Ni
After initializing the factor vectors randomly, we had to feed ∂E
W (l) ←− W (l) − η (17)
them to the network. The user factor vectors (Fu ) for each mini- ∂ W (l)
batch were clung to the first sublayer of the bottommost layer ∂E
(i.e., l(1,u) ). The item factor vectors (Fi ) for each mini-batch, sim- b(l) ←− b(l) − η (l) (18)
∂b
ilarly, were clung to second sublayer of the bottommost layer
Our network learns two different things concurrently because
(i.e., l(1,i) ).
of the wiring that we chose. At the output layer where there is only
z (2,1) = W (2,1) · Fu + b(2,1) z (2,2) = W (2,2) · Fi + b(2,2) one unit which receives layered, complex nonlinear information
(6) about the factors from the units situated at the layer just below.
Here b(2,1) and b(2,2) are the biases for the first hidden layer and Thus whenever the weights between layer L − 1 and layer L get
z (2,1) and z (2,2) are the weighted input to the first sublayer and the updated according to the error gradient, this causes our network to
second sublayer of the first hidden layer respectively. The output be better at processing already L − 1 layered interaction (between
of the first hidden layer is given by applying either tanh(·) or σ (·) factors) to make it L layered interaction. This generalizes to the
to z (2) . other layers l (l > 2) as well.
a(2) = activation(z (2) ) (7)
For subsequent discussion it would be assumed that the activa- 2.2.3. Specialization with the separate learners
tion function is actually the σ (·) function (Eq. (3)). So, Eq. (7) can As the training progresses the weights at the first hidden layer
be rewritten as Eq. (8). are forced to update themselves to minimize their contribution
in the error (in Eq. (11)). When they update themselves (using
a(2) = σ (z (2) ) (8)
gradient descent optimizer) this causes the current configuration
Similarly, the output of each layer can be calculated by first taking going down towards a ravine in the error surface (see Fig. 3).
the weighted input from the immediate lower layer and then At the ravine, the parameters (including the weights of the first
applying the activation function to it (Eq. (9)). hidden layer) have values that are locally optimized. Under these
circumstances, if we change the values of weights at first hidden
layer the position of the current configuration would go upwards
z (l) = W (l) · a(l−1) + b(l)
the error surface (and the learning will obviously constrict the
a(l) = σ (z (l) ) (9) change). So, in other words, the current configuration contains one
The output unit o, at the final layer, produces one real-valued of the best set of values for the weights at the first hidden layer.
output (i.e., rating). The error gradients incurred by the first sublayer of the first
hidden layer (i.e., l(2,1) ) are only due to the user container (i.e., l(1,u) )
outputo = σ (W L · a(L−1) + b(L) ) (10) (as there was no connection from the item container). The units
B. Purkaystha, T. Datta, Md.S. Islam et al. / Applied Soft Computing Journal 75 (2019) 310–322 315

Fig. 3. The same error surface is shown to have two local minima A and B in both left and right. The configuration (filled in circle) on the left has the tendency to go down
in the hill because of the optimizer with current settings of the parameters (i.e., biases and weights). In the right, the network tends to stay at the ravine A because this is
locally optimized.

at this sublayer had processed the user information only and for- are available, we did not use them in our system. In both of these
warded it to the next layer. So reducing the error by updating datasets each user has rated at least 20 movies. For Yahoo music
the weights at this sublayer directly amounts to specialization in datasets, each of the user has rated at least 10 musics.
dealing with the user factors. These weights do not recognize the As all the users in all the datasets are chosen based on the
item factors but they specialize in dealing with user factors only. number of ratings they have produced, they are all bound to be the
They discriminatingly learn the user factors only. Similarly, the successful users (i.e., the users who did not enjoy rating movies or
weights of the second sublayer (i.e., W (2,2) ) specialize in dealing simply did not participate in the ratings were omitted) [36]. Any
with item factors alone. system that would learn from these ratings will be fundamentally
unknown of the behaviors of such users. Another limitation is that,
2.2.4. Learning the user and item factors as collecting the ratings spanned for several years, recommenda-
The error gradients are further propagated from the first sub- tion algorithms, user interfaces, and available facilities changed
layer of the first hidden layer to the user container (and this is from time to time. This might have impinged on users’ way of
unlike to what is done in the traditional neural networks; the rating movies, jokes, or musics.
inputs are not updated in them, but in our case they are not the real
inputs, rather, spurious ones brought from a uniform distribution).
The user factor vectors were clung to the units of this container. 3.2. Evaluation metrics
The error incurred by the first sublayer of first hidden layer are only
due to the factors of this container, and therefore, any subsequent 3.2.1. RMSE
update to these factor vectors would give the network better fit to We have measured the accuracy of the model in terms of root
data. This part of training amounts to learning the preferences of mean squared error (RMSE) as it is very common in the literature.
each user separately. Let there be N samples in the test set. Say, our network produced
Similarly, the item factor vectors in a mini-batch were clung to rating y(i) for sample i (1 ≤ i ≤ N) while the actual rating was t (i) .
the units of the second sublayer of the bottommost layer (i.e., l(1,i) ). Then, Eq. (19) gives the RMSE incurred by our model.
The error gradients received through the second sublayer of the √
first hidden layer (l(2,2) ) are the corrections to random initialization
∑N (i)
i=1 (y − t (i) )2
of the item factors. Hence, the subsequent updates also cause the RMSE = (19)
N
model to learn each item factor individually.
3.2.2. Precision and recall
3. Experimental Results and Discussion
We have used these two accuracy metrics for comparison
among models. Let there be a threshold θ applied on a subset
3.1. Datasets
of rating triples R. For all the ratings ru,i ∈ R, we say the item
We have tested our model on real world datasets [11,36]. The i is relevant to the user u if and only if ru,i ≥ θ . Say, w is the
datasets are summarized in Table 1. As it can be seen that they number of items retrieved by the model as relevant items whereas
constitute ratings of movies, jokes, and music. The datasets contain only x of them were actually relevant. Precision is, then, calculated
different rating ranges and sparsity levels. There are different sizes by Eq. (20).
of the data in the same domains. The collection periods of the
x
ratings show that these datasets would test the system’s ability to precision = (20)
model dynamic behaviors of both users and items. For example, w
the ratings of Movielens 100k were collected in span of more than Recall is simply the ratio of the number of relevant items retrieved
20 years. The four datasets (i.e., Movielens and Yahoo datasets) to the number of actual relevant items which were present in the
have ratings which are categorical whereas the other two datasets inventory. As, for any user, there are only two categories (i.e., either
(i.e., Jester datasets) have ratings which are continuous. But in our to be recommended or to be not recommended) for each product,
work we have produced continuous ratings for each dataset. the recall is the same as the sensitivity. If there are originally z
Jester datasets do not contain any demographic information relevant items after the threshold being applied, then recall or
about the users, but the jokes on which the ratings have been sensitivity is calculated as follows.
produced are available. Each user in Jester datasets has rated at
least 15 jokes. Both of the Movielens datasets contain demographic x
recall or sensitivity = (21)
information (such as the age, sex, occupation, etc.) about the users. z
They also contain information (e.g., genre, imdb url, release date, These two error metrics are used in Section 3.6.3 to compare the
movie title, etc.) about the movies. Although, these information recommendations produced by the methods.
316 B. Purkaystha, T. Datta, Md.S. Islam et al. / Applied Soft Computing Journal 75 (2019) 310–322

Table 1
Datasets.
Dataset #Ratings #Users #Items Sparsity (%) Rating scale Time Domain
Movielens 100k 100k 943 1 682 93.7 [1,5] 1995–2016 Movies
Movielens 1M 1M 6 040 3 952 95.81 [1,5] 2000–2003 Movies
Jester 1.8M >1.8M 24 938 100 27.40 [−10.00,10.00] 1999–2003 Jokes
Jester 4.1M >4.1M 73 421 100 43.67 [−10.00,10.00] 1999–2003 Jokes
Yahoo music 220k >220k 7 643 11 916 99.76 [1,5] 2002–2006 Music
Yahoo music 365k >365k 15 400 1 000 99.77 [1,5] 2002–2006 Music

Table 2 3.4. Parameter settings and training data ratio


Different parameters for different datasets.
Dataset #Factors Hidden sizes Batch size Dropout RMSE One thing observable is that settings of the parameters largely
User Item vary across the datasets, and this is explicable. For larger datasets, it
200 200 (300,300,300) 250 0.0 4.1824 is reasonable to increase the capacity of the network by introducing
Jester 1.8M 100 300 (300,300,300) 250 0.0 4.1526 more hidden layers, incorporating more hidden units, and intro-
50 300 (600,600,450) 400 0.5 4.1128 ducing more user and item factors. But the number of user factors
Jester 4.1M
200 200 (450,450,450) 500 0.0 4.1520 and the number of item factors require a little more insight. For
50 400 (450,450,500) 500 0.5 4.0873 Jester 4.1M, we have achieved optimal results with #user factors
Movielens 100k
50 50 (100,500) 100 0.0 0.9101 = 50 and #item factors = 400. In average, we have each user 56
50 50 (200,1000) 100 0.5 0.8965 jokes rated, and each joke rated by 41 363 users. So it is very natural
120 200 (200,1000) 500 0.0 0.8532 that the items would have more variances than the users would.
Movielens 1M 150 250 (200,1000) 300 0.0 0.8463 We need more units to confront this larger variance in the items.
150 250 (200,1000) 500 0.5 0.8110
Therefore, it is reasonable to have more item factors than the user
50 100 (150, 150) 100 0.0 0.9649 factors. The number of factors, itself, required empirical settings.
Yahoo music 220k
50 100 (150, 150) 100 0.15 0.9470
In any dataset where the number of users is less than the number
30 50 (150,150) 100 0.0 0.9678 of items we had less user factors than item factors. The capacity
Yahoo music 365k
30 50 (100,50) 100 0.15 0.9408
of the network needs to be larger while working on Jester as the
rating range is greater than any other datasets.

3.3. Experiments 3.4.1. The impact of batch size


The optimal batch size differs extensively across the datasets.
The batch size reveals the character of the dataset. Our optimiza-
Different datasets require different activation functions. For ex-
tion objective amounts to two specializations: factorizing the users
ample, Jester contains continuous ratings from −10.00 to +10.00. and items and modeling the relationship. For a dataset, if an op-
Therefore, our units should be able to produce negative ratings. timal (or a suboptimal) parameter setting is found in a smaller
That is why we have employed tanh units at each layer for Jester. As batch size, then it means factorizing the users and the items at a
there is no negative rating involved in other datasets (i.e., Movie- greater granular level is more important than (or at least as much
lens, Yahoo music), we have got logistic units at each hidden layer important as) modeling the relationship and vice versa. This is
for them. so because the gradient received for a larger batch size carries
Each dataset required different settings of the parameters. Al- a moderate update rule for the randomly initialized factors. The
though it was not mandatory for our model to have the same variance in the rating triples for larger batches is typically greater
number of units at the first hidden layer as the number of factors than the variance in rating triples for smaller batches. These greater
at the input layer, it turned out that when both these numbers variance in the user behaviors and item characteristics tend to
were same the learning became smooth and the network produced balance out each other and to receive a moderate update rule.
On the other hand, for a smaller batch size the balancing out is
better results. Table 2 shows sensitivity of our network to different
comparatively primitive and the update rule is stronger for the
settings of the parameters. The column, Hidden sizes, in the table
factors. The most extreme scenario for this experiment is when the
excludes the first hidden layer of the network as it is mandatory
batch size is set to one. Even after days of training (in a machine
to have this layer in the network. The weights at this layer are containing intel core i7 octa-core processor and 32 GB of memory),
the factor detectors. We had only one output unit, so it could the learning curve does not converge; instead, it seems to zigzag in
only produce a real-valued rating r, 0 ≤ r ≤ 1 (for the logistic the parameter space. The impact of batch size in dataset from each
activation), or −1 ≤ r ≤ 1 (for the tanh activation). Therefore, domain is shown in Fig. 4.
we scaled down the rating for Jester dataset by a factor of 10,
and for others, by a factor of 5. We fed the user and item factors 3.4.2. Training data ratio
into the network in mini-batches. The mini-batch learning is very Factorizing the users and the items distinctively greatly impacts
important in our case; in addition to faster convergence, it helps the rating predictions. This is empirically shown in Fig. 5. If the
the model generalize well on factorization (see Section 3.4.1). dataset is split into smaller training size and larger test size, the
For each of the datasets we set the initial learning rate η to system gets to see individual users and items very little. Thus
0.005. If, for any four consecutive epochs, the validation loss had the randomly initialized user and item factors are not adjusted
properly. This affects prediction.
not lowered, we decreased η by a factor of 1.1. For the entries
In Fig. 5 we have plotted the relative RMSE with respect to
in Table 2 where dropout is 0.0, we stopped learning when the
training data ratio. For example, if for any dataset d our system
validation error started to increase. When we incorporated the incurred maximum RMSE Rmax and minimum RMSE Rmin , then
regularizer (i.e., dropout set to the non zero), we stopped learning (d)
normalized RMSE of R for dataset d, Rnorm , is calculated follows.
once value of η fell less than 10−5 . All the RMSEs reported in Table 2
are achieved when the dataset was divided into 80:10:10 ratio for R
training, validation, and test purposes respectively. R(d)
norm = (22)
Rmax − Rmin
B. Purkaystha, T. Datta, Md.S. Islam et al. / Applied Soft Computing Journal 75 (2019) 310–322 317

Fig. 4. The bar diagrams show the impact of batch size in rating predictions. For Movielens 1M and Jester datasets, modeling the relationship in factors is more important
(or as much important as) factorizing the users and items. On the other hand, for Yahoo 365k, the factorization is more important than modeling the relationship.

Fig. 5. The impact of factorizing the users and the items is shown here in the subplots for four datasets. The other two datasets (Yahoo 220k and Movielens 100k) have been
skipped because they are too small for this experiment.

The plot reveals very important characteristics of how the sys- these affect learning of the model. When the final RMSE incurred
tem learns. The model learns the most for initial increments of the on any dataset is reported in this paper, the dataset is split into
dataset (i.e., when the training data ratio is increased to 20% to 80:10:10 for training, validation, and test purposes respectively.
50%). In these increments, the model sees more novel users and
items than in any other increments, and this drastically brings 3.5. Factorization and rating analysis
down the test set errors. The error lines do flatten out after the
stage where the model does not see enough novel users or items. In this section, we analyze the ratings produced by our model
The only space for improvement left after this stage is by adjust- with two users from Movielens 100k dataset, and based on the
ing the factors and by modeling the relationship in the opposite finding we seek to answer to our first research question that
direction of the error gradient. It should be noted that the different whether it is possible to construct user profiles and item charac-
training data ratios have been discussed here only to show how teristics solely based on the rating triples. The dataset contains the
318 B. Purkaystha, T. Datta, Md.S. Islam et al. / Applied Soft Computing Journal 75 (2019) 310–322

links to the movie information: its releasing year, genre, etc. These Table 3
are very good features for a movie and certainly will have an impact Comparison with other models in terms of RMSE.

on the ratings. The users are anonymized; the two users we have Model RMSE

picked have ids 450 and 405 respectively. Let us call them user a J 1.8M J 4.1M ML 100k ML 1M YM 220k YM 365k
and user b. Pearson vector similarity 4.6307 4.6280 1.0427 1.0003 1.2765 1.2761
User a has rated a total of 488 movies. Only 31 movies have Cosine vector similarity 4.6100 4.6215 1.0385 0.9983 1.2752 1.2752
Funk-SVD 4.2829 4.2221 0.9538 0.9243 1.0005 0.9997
been rated below 2 (in discrete scale 1–5) by him. So he rates SVD++ 4.1472 4.1316 0.9096 0.8427 0.9676 0.9670
movies generously. But there are some genres which he dislikes Matrix factorization using 4.1465 4.1379 0.9008 0.8470 0.9701 0.9723
mostly; these are comedy, adventure, and drama. He mostly gave user generated content
very low ratings on these genres. We, therefore, had picked a Multiple kernelized 4.1369 4.1282 0.9091 0.8297 0.9502 0.9479
matrix factorization
movie Up Close and Personal and fed already learned preferences
Robust convolutional 4.2279 4.1951 0.9003 0.8470 0.9618 0.9582
of a and characteristics of this movie to our trained network. The matrix factorization
genre of the movie is romance. Our network produced a rating NUIF 4.1128 4.0873 0.8965 0.8110 0.9470 0.9408
3.876 whereas he had actually rated the movie 4. We again picked *ML: Movielens, *YM: Yahoo music, *J: Jester. The models have been reproduced.
two movies Beverly Hillbillies, and Half Baked. Both of them are of
genre comedy. The network produced a rating 1.531 and 1.508
respectively whereas he had rated both of them 1. Clearly, the
Matrix factorization using user-generated content (MF-UGC). [40]
network has learned the useful features of the movies (although
exploits user generated contents on top of matrix factorization, and
we told nothing about them) and the preferences of user a. Another
makes predictions.
movie was picked, Brassed Off ; the genre of this movie is comedy,
romance, and drama. The network produced rating 2.994 whereas Robust convolutional matrix factorization (R-CONVMF). [35] uses
the actual rating he had given is 5. One possible explanation can contextual information and detects useful user and item features
be, although, the movie was of genre he did not like, he might have in lower convolutional layers.
been impressed by some other accidental factors (e.g., acting of
lead roles, romantic background, and others). 3.6.1. Comparison in terms of RMSE
User b has rated 657 movies. On average, he rated each movie The comparison of the models in terms of RMSE is given in
1.85. But there are genres (e.g., comedy, drama, science fiction) on Table 3. The RMSEs of the models compared here have been pro-
which he rated generously. We picked movie Philadelphia and fed duced when the dataset was split into 80:10:10 for training, val-
its factors and b’s preferences to our network. The network pro- idation, and test purposes. In average, our model has predicted
duced a rating 3.739 whereas b had actually rated 5. The network more accurate ratings on all datasets as compared to the other
was able both in factorizing user b and movie Philadelphia, and models. The improvement made by our model is marginal for
in modeling the relationship to some extent. We then picked the smaller datasets like Movielens 100k or Yahoo music 220k. But
movie 2001: A Space Odyssey (of genre science fiction), and the the improvement scales with the size of the dataset. For example,
network produced a rating for this movie 3.71 whereas b actually Jester 4.1M contains more than 4 million ratings and our model
had rated 5. Finally, we picked the movie A stranger in the house (of has incurred an RMSE that is decisively smaller than previous best
genre thriller). The network produced rating 1.65 whereas actual RMSE.
rating was 1. In Table 3, PVS and CVS are two methods which was solely based
The network was not given any prior information about users on the similarity measurements and on the statistical assumption
and movies. Still, it has managed to find out some distinctive that similar users will have similar interaction with similar items
features of the users and items so that it could produce a better which we have already discussed in Section 1. In our method,
fit to the data. This answers our first research question. It is indeed we have emphasized on nuances in the user behaviors as well
possible to construct the user profiles and the item characteristics as in the item characteristics, and it is clearly visible in Table 3
reasonably well based on the rating triples. that our method has incurred very significantly lower RMSEs than
each of these two methods. This corroborates our second research
question, and indeed, rating predictions were more accurate as we
3.6. Comparison among models
have treated each user and each item individually.
In the same table, there are a number of matrix factorization
We now give comparisons of our model with some best existing
methods (e.g., SVD++, Robust convolutional matrix factorization,
models on these datasets. The results of the other models have
multiple kernelized matrix factorization, etc.) which treat the fac-
been reproduced. The models that we compare our model with are
tors linearly. We have incurred lower RMSEs than each of them.
described now.
Although for smaller datasets (e.g., Movielens 100k) the improve-
Funk-SVD. This model is implemented using the method of [39]. ment is not very significant, the improvement made on a larger
This is slightly a different version of standard SVD. dataset is strongly visible (e.g., in Jester 4.1M we have incurred
RMSE ∼4.087 whereas the second lowest accuracy is, incurred by
SVD++. This model is based on an extensive usage of SVD, and it multiple kernelized matrix factorization method, ∼4.128). There-
modifies the standard SVD a lot [6]. It expects a temporal dimen- fore, we can answer our third research question that although for
sion in dataset and we gave this information to it. Our model does smaller datasets the improvement made by our method over the
not require such temporal information. linear methods are insignificant, for larger datasets the improve-
Multiple kernelized matrix factorization (MKMF). This model [8] is ment is significant.
similar to the other matrix factorization techniques employing
collaborative filtering. It combines multiple kernels and produces 3.6.2. Comparison in terms of temporal modeling
slightly better results. The architecture of our system does not accommodate explicit
temporal modeling. Temporal modeling means capturing the dy-
Pearson vector similarity (PVS) and Cosine vector similarity (CVS). namic behavior of the users and the changing characteristics of
Both of these models are essentially the same [14]. The only dif- the items. As we have not designed it explicitly, it is very hard
ference is in how they measure similarity among items. to conclude whether the system has captured the dynamism in
B. Purkaystha, T. Datta, Md.S. Islam et al. / Applied Soft Computing Journal 75 (2019) 310–322 319

user behaviors and item characteristics or not just by looking into samples T from each domain (i.e., jokes, movies, and music) and
the learned factors and bottommost layer weights. We picked plotted the points. For any dataset d, we tried out several values
Movielens 1M dataset to analyze the final output produced by of threshold θd . Each of the values of θd was applied on T , and
various methods on the basis of temporal modeling. This dataset precision and recall were calculated by using Eqs. (20) and (21).
has been picked because the ratings of this dataset span more than Precision shows the ratio of the number of correct items recom-
20 years which is greater than any other datasets. These ratings mended by the model to the total number of items recommended.
in the dataset have an associated column of timestamps (i.e., the On the other hand, recall shows the ratio of the number of correct
time when the ratings were given). We divided the test set users items recommended to the number agreed by θd .
into a number of categories. For any category c we set a threshold Fig. 7a shows the precision versus recall curves of all the models
tc so that each user in c has a rating span of at least tc days. This is on 1000 random test examples on Jester 4.1M dataset. Our model
summarized in Table 4. has provided a better improvement in finding more relevant prod-
Although our system is not provided with any temporal infor- ucts than the neighborhood models (e.g., Pearson vector similarity
mation, if the dataset has a property that its users’ behaviors or (PVS) and Cosine vector similarity (CVS) are the bottommost two
items’ characteristics change radically through time, the system curves). Other factorization models like MKMF provided very sim-
has to adjust the weights in between the bottommost layer and ilar performance when the recall was smaller. But once the thresh-
the first hidden layer (i.e., weights W (2) ) and the factors in factor old θjester divided the test examples into roughly equal number of
layer (i.e., Fu and Fi ) in such a way that it can minimize the error more relevant items and less relevant items, our model has always
(see optimization objective in Section 2.2.3 and in Section 2.2.4). produced better recommendations (e.g., see the NUIF curve from
This, although very implicitly, amounts to temporal modeling. The recall in range [0.3–0.8]).
effectiveness of our temporal modeling is shown in Fig. 6. The Fig. 7b also shows that our model almost always produced
RMSE of each of the models are plotted against each category. For better recommendations. There are some large recall values for
the portion of the test set where each user had at least two hundred which other factorization models produced similar or better rec-
days of ratings, NUIF incurred a comparatively lower RMSE. But ommendations than our model. This indicates that the improve-
as soon as the users stayed for longer in the dataset, the rating ments achieved by our model on Movielens is not as good as on
predictions made by NUIF did not scale as compared to some other Jester. This makes sense. Jester dataset is more than four times
methods. For the larger values of tc (say, 900) our model incurred larger than Movielens. Our deep model has, therefore, specialized
more RMSE than R-CONVMF model (which had used contextual more on Jester than on Movielens. But still, in the most cases it
information). produced better recommendations than the other models.
Fig. 7c shows the precision of the model versus recall on Yahoo
3.6.3. Comparison in terms of recommendations 365k. The dataset is significantly smaller than the previous two. As
The final comparative analysis of the models in this section is expected, although our model produced better recommendations
now given in terms of precision versus recall plots in Fig. 7. We have on the most cases, there are more instances where other factor-
selected three datasets from different domains (i.e., Jester 4.1M, ization models have produced similar or better recommendations
Movielens 1M, and Yahoo music 365k). The plots are not based (when recall > 0.8) as compared to previous two datasets.
on the whole test set; instead, we have randomly picked 1000 test
3.7. Impact of rating predictions and recommendation quality on
obtained results
Table 4
Movielens 1M dataset categorized according to tc values. The obtained results show a broad range of characteristics of our
# Minimum spanning days tc # Test users approach. The prediction accuracy improved for each dataset when
1 200 96 more hidden units were incorporated and dropout (i.e., randomly
2 300 75 omitting some of the units at each iteration) enabled. That implies,
3 400 60 with a proper regularization technique present, if we increase the
4 500 49
5 600 40
capacity of our network, it models the relation between users and
6 700 37 items better. This observation is important, mainly because it hints,
7 800 22 at the lower layers (i.e., input layer and first hidden layer), separate
8 900 12 learner connections were helping the model factorize the users and
9 1000 5
the items well (as shown with examples in Section 3.5) while the

Fig. 6. RMSE incurred by different methods with respect to the users having different days.
320 B. Purkaystha, T. Datta, Md.S. Islam et al. / Applied Soft Computing Journal 75 (2019) 310–322

Fig. 7. Precision versus recall for three different datasets. As we go to the right along the horizontal axis, the number of relevant items increases after the threshold θ being
applied. When value of θ is moderate (i.e., the value is around near of the middle value of rating range) the number of more relevant items roughly equals to the number of
less relevant items. Precision drops when this happens. Every model in each of the plots here finds it easy to recommend item when θ is either strong or weak.

upper layer connections were modeling the relationship among when the units at the first hidden layer propagates the error back
the factors. to the input layer they provide a balanced gradient. Thus, the
Another important insight is that for all of the data domains, as factors, at the input layer, do not receive the actual gradient but
the size of the dataset gets larger our prediction accuracy improves. an error gradient which has been counter-balanced by other type
For example, in Movielens 100k dataset the RMSE incurred by of factors. This messes up learning in user profile and item charac-
our model was ∼0.89 (see Table 2), but as the size of the dataset teristics building, and causes to factorize the users and the items
increased by ten times (in Movielens 1M), the RMSE dropped to as improperly. The network does not learn each user and each item
low as ∼0.81. Such improvement is also visible for other datasets. individually, rather, it is compelled to learn the interaction of each
This shows our method produces ratings with better accuracy as individual pairs of user and item (what we already do in the higher
the size of the dataset grows. For a larger amount of ratings, we are layers; actually, the full connection between the factor layer and
able to factorize the users and the items in a more granular way in the first hidden layer just adds an extra layer when modeling the
the lower layers. This helps the upper layers to utilize these learned relationship). These empirical results hint that separate learning
factors to produce better ratings. helps the model generalize well and it enables the system to model
In Section 2.2.3 we had analytically justified why we had split the user profiles and the item characteristics more precisely.
the connections in the first hidden layer. We now give an empirical In Section 3.6.3, we have showed the recommendations made
justification of the impact of separate learners. To see that in by our method and the other models, and observed that our im-
practice, we ran our system on all six datasets with the same provement which had been strongly visible in terms of ratings do
configurations, but instead of using separate learners, we fully not hold as much strong in case of recommendations, and even
connected the first hidden layer to the factor layer. The results for some higher recall values our method produced worse recom-
are summarized in Table 5. For each dataset, training RMSE is mendations as compared to some other models. This is because, in
significantly lower when separate learning is not enabled. The test several studies, it has been observed that improvements in terms
error, on the other hand, is ostensibly lower for separate learning. of RMSE do not translate into recommendation quality as much
The fully connected configurations have more connections to expected [41]. Our study had the objective to bring down the
model the accidental regularities in the ratings. Capturing such error in rating predictions, and therefore, this discrepancy exists
accidental regularities causes the model to overfit the data; hence, in recommendation quality of our system.
while the training error decreases, the test error increases. More-
over, the connections at this layer are not allowed to specialize 4. Conclusion
in either type of factors exclusively. Thus, the units at the first
hidden layer cannot distinguish between the two types of units In this paper, we proposed a method that emphasizes on fac-
(i.e., user factors and item factors) at the input layer. Therefore, torizing each user and each item individually to overcome the
B. Purkaystha, T. Datta, Md.S. Islam et al. / Applied Soft Computing Journal 75 (2019) 310–322 321

Table 5
Impact of separate learning.
R1 − R2
Dataset No separate learners Separate Learners Improvement = R1
× 100%
Training RMSE Test RMSE R1 Training RMSE Test RMSE R2
Jester 1.8M 2.0138 4.4304 2.1252 4.1128 7.16
Jester 4.1M 1.9632 4.4557 2.0160 4.0873 8.27
Movielens 100k 0.5076 0.9208 0.5530 0.8965 2.64
Movielens 1M 0.4810 0.8553 0.568 0.8110 5.17
Yahoo 220k 0.4622 0.9731 0.3801 0.9470 2.69
Yahoo 365k 0.4597 0.9710 0.4048 0.9408 3.19

limitation of the statistical assumption that the similar users will [11] K. Goldberg, T. Roeder, D. Gupta, C. Perkins, Eigentaste: A constant time
have similar opinions on an item. The main contribution of our collaborative filtering algorithm, Inf. Retr. 4 (2001) 133–151.
work lies in how we model each user profile and each item char- [12] X. Yang, Y. Guo, Y. Liu, H. Steck, A survey of collaborative filtering based social
recommender systems, Comput. Commun. 41 (2014) 1–10.
acteristic without using any demographic information. This way [13] M. Balabanović, Y. Shoham, Fab: content-based, collaborative recommenda-
of factorization is particularly helpful in scenarios where only a tion, Commun. ACM 40 (1997) 66–72.
huge amount of rating data exists without any explicit information [14] B. Sarwar, G. Karypis, J. Konstan, J. Riedl, Item-based collaborative filtering
about the users or the items. We have also contributed in elimi- recommendation algorithms, in: Proceedings of the 10th International Con-
nating the limitation of the traditional linear factorization models ference on World Wide Web, ACM, 2001, pp. 285–295.
[15] P. Pirasteh, J.J. Jung, D. Hwang, Item-based collaborative filtering with at-
(i.e., the number of user factors and the number of item factors be tribute correlation: a case study on movie recommendation, in: Asian Con-
the same) as our deep model allows any number of factor units at ference on Intelligent Information and Database Systems, Springer, 2014, pp.
the bottommost layer. We constructed the user behaviors and the 245–252.
item characteristics in the factor layer (i.e., the bottommost layer) [16] B.M. Sarwar, G. Karypis, J. Konstan, J. Riedl, Recommender systems for large-
and modeled the relationship in the upper layers. We have shown scale e-commerce: Scalable neighborhood formation using clustering, in: Pro-
ceedings of the Fifth International Conference on Computer and Information
that our model produces better predictions than the other factor- Technology, Vol. 1, 2002.
ization models and the models which use contextual information [17] C.-H. Lee, Y.-H. Kim, P.-K. Rhee, Web personalization expert with combining
or content information. collaborative filtering and association rule mining technique, Expert Syst.
Although we have shown the quality of recommendations pro- Appl. 21 (2001) 131–137.
duced by our model to compare it with others, our main objec- [18] Y.H. Cho, J.K. Kim, S.H. Kim, A personalized recommender system based on
web usage mining and decision tree induction, Expert Syst. Appl. 23 (2002)
tive was to minimize the error in rating predictions. It is a well
329–342.
known fact that the recommendation quality does not necessarily [19] K. Bryan, M. O’Mahony, P. Cunningham, Unsupervised retrieval of attack
scale with root mean squared errors. A further investigation can profiles in collaborative recommender systems, in: Proceedings of the 2008
be conducted which would seek to measure the effectiveness of ACM Conference on Recommender Systems, ACM, 2008, pp. 155–162.
the given architecture in terms of the recommendations it makes [20] K.-j. Kim, H. Ahn, A recommender system using ga k-means clustering in an
online shopping market, Expert Syst. Appl. 34 (2008) 1200–1209.
rather than the ratings. Another investigation can be made by using
[21] M. Lee, P. Choi, Y. Woo, A hybrid recommender system combining collabora-
demographic information of the users and the items along with tive filtering with neural network, in: International Conference on Adaptive
constructed factors to observe how this hybrid factorization model Hypermedia and Adaptive Web-Based Systems, Springer, 2002, pp. 531–534.
works in terms of both rating predictions and recommendation [22] M. Nilashi, O. Ibrahim, K. Bagherifard, A recommender system based on col-
quality. Additionally, as in this work we have not confronted the laborative filtering using ontology and dimensionality reduction techniques,
Expert Syst. Appl. 92 (2018) 507–520.
cold start problem, a serious dilemma when a new user or item is
[23] P. Symeonidis, Matrix and tensor decomposition in recommender systems,
introduced in the system which was not involved in any rating, a in: Proceedings of the 10th ACM Conference on Recommender Systems, ACM,
study can be run on how to confront this problem with keeping 2016, pp. 429–430.
other advantages of our system intact. [24] G. Adomavicius, Y. Kwon, Multi-criteria recommender systems, in: Recom-
mender Systems Handbook, Springer, 2015, pp. 847–880.
References [25] X. Zhou, J. He, G. Huang, Y. Zhang, Svd-based incremental approaches for
recommender systems, J. Comput. System Sci. 81 (2015) 717–733.
[26] J.R. Noriega, H. Wang, A direct adaptive neural-network control for unknown
[1] P. Clerkin, C. Hayes, P. Cunningham, Automated case generation for recom-
nonlinear systems and its application, IEEE Trans. Neural Netw. 9 (1998) 27–
mender systems using knowledge discovery techniques, Genre (1994).
34.
[2] P. Resnick, H.R. Varian, Recommender systems, Commun. ACM 40 (1997) 56–
[27] T.K. Paradarami, N.D. Bastian, J.L. Wightman, A hybrid recommender system
58.
using artificial neural networks, Expert Syst. Appl. 83 (2017) 300–313.
[3] G. Adomavicius, A. Tuzhilin, Toward the next generation of recommender
systems: A survey of the state-of-the-art and possible extensions, IEEE Trans. [28] R. Salakhutdinov, A. Mnih, G. Hinton, Restricted boltzmann machines for
Knowl. Data Eng. 17 (2005) 734–749. collaborative filtering, in: Proceedings of the 24th International Conference
[4] D.H. Park, H.K. Kim, I.Y. Choi, J.K. Kim, A literature review and classification of on Machine Learning, ACM, 2007, pp. 791–798.
recommender systems research, Expert Syst. Appl. 39 (2012) 10059–10072. [29] A. Van den Oord, S. Dieleman, B. Schrauwen, Deep content-based music
[5] R. Burke, Integrating knowledge-based and collaborative-filtering recom- recommendation, in: Advances in Neural Information Processing Systems,
mender systems, in: Proceedings of the Workshop on AI and Electronic 2013, pp. 2643–2651.
Commerce, 1999, pp. 69–72. [30] H. Wang, N. Wang, D.-Y. Yeung, Collaborative deep learning for recommender
[6] Y. Koren, R. Bell, C. Volinsky, Matrix factorization techniques for recom- systems, in: Proceedings of the 21th ACM SIGKDD International Conference
mender systems, Computer 42 (2009). on Knowledge Discovery and Data Mining, ACM, 2015, pp. 1235–1244.
[7] B.-W. Chen, W. Ji, S. Rho, Y. Gu, Supervised collaborative filtering based on [31] H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson,
ridge alternating least squares and iterative projection pursuit, IEEE Access G. Corrado, W. Chai, M. Ispir, et al., Wide & deep learning for recommender
(2017). systems, in: Proceedings of the 1st Workshop on Deep Learning for Recom-
[8] X. Liu, C. Aggarwal, Y.-F. Li, X. Kong, X. Sun, S. Sathe, Kernelized matrix mender Systems, ACM, 2016, pp. 7–10.
factorization for collaborative filtering, in: Proceedings of the 2016 SIAM [32] J. Wei, J. He, K. Chen, Y. Zhou, Z. Tang, Collaborative filtering and deep learning
International Conference on Data Mining, SIAM, 2016, pp. 378–386. based recommendation system for cold start items, Expert Syst. Appl. 69
[9] J.A. Konstan, B.N. Miller, D. Maltz, J.L. Herlocker, L.R. Gordon, J. Riedl, Grou- (2017) 29–39.
plens: applying collaborative filtering to usenet news, Commun. ACM 40 [33] J. Liu, C. Wu, J. Wang, Gated recurrent units based neural network for time
(1997) 77–87. heterogeneous feedback recommendation, Inform. Sci. 423 (2018) 50–65.
[10] N. Good, J.B. Schafer, J.A. Konstan, A. Borchers, B. Sarwar, J. Herlocker, J. [34] X. Wang, Y. Wang, Improving content-based and hybrid music recommen-
Riedl, et al., Combining collaborative filtering with personal agents for better dation using deep learning, in: Proceedings of the 22nd ACM International
recommendations, in: AAAI/IAAI, 1999, pp. 439–446. Conference on Multimedia, ACM, 2014, pp. 627–636.
322 B. Purkaystha, T. Datta, Md.S. Islam et al. / Applied Soft Computing Journal 75 (2019) 310–322

[35] D. Kim, C. Park, J. Oh, H. Yu, Deep hybrid recommender systems via exploiting [39] S. Funk, Netflix update: Try this at home, 2006.
document context and statistics of items, Inform. Sci. 417 (2017) 72–87. [40] Y. Xu, Z. Chen, J. Yin, Z. Wu, T. Yao, Learning to recommend with user
[36] F.M. Harper, J.A. Konstan, The movielens datasets: History and context, ACM generated content, in: International Conference on Web-Age Information
Trans. Interact. Intell. Syst. (TiiS) 5 (2016) 19. Management, Springer, 2015, pp. 221–232.
[37] K. Hornik, Approximation capabilities of multilayer feedforward networks, [41] P. Cremonesi, Y. Koren, R. Turrin, Performance of recommender algorithms on
Neural Netw. 4 (1991) 251–257. top-n recommendation tasks, in: Proceedings of the Fourth ACM Conference
[38] X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward on Recommender Systems, ACM, 2010, pp. 39–46.
neural networks, in: Proceedings of the Thirteenth International Conference
on Artificial Intelligence and Statistics, 2010, pp. 249–256.

You might also like