Research Paper 3 Final

2022 IEEE International Conference on Big Data (Big Data)
Collaborative Filtering Guided Deep Reinforcement

Learning for Sequential Recommendations
Vahid Azizi Saayan Mitra Xiang Chen
Adobe Inc. Adobe Research Adobe Research
San Jose, USA San Jose, USA San Jose, USA
2022 IEEE International Conference on Big Data (Big Data) | 978-1-6654-8045-1/22/$31.00 ©2022 IEEE | DOI: 10.1109/BigData55660.2022.10020921
vazizi@adobe.com smitra@adobe.com xiangche@adobe.com
Abstract—Earlier recommendation techniques, such as Col- mance Indicators (KPIs) such as clicks and conversions on a
laborative Filtering (CF), assume the users’ preferences do not web page. Thus, looking at the immediate reward may not
change over time and strive to maximize the immediate reward. always benefit KPIs maximization in the long run.
In recent studies, Reinforcement Learning (RL) has been used
to make interactive recommendation systems that capture users’ To address the limitations above, the recommendation prob-
preferences over time and maximize the long-term reward. lem has been cast as a sequential decision-making problem.
However, these methods have two limitations. First, they assume Recent advances in RL have been leveraged to model interac-
that items are independently distributed and do not consider
tive and dynamic recommender systems with long-term plan-
the relations between items. This assumption ignores the power
of the relations between items for recommendation systems, as ning. Methods including but not limited to Partially Observable
demonstrated by CF. RL-based methods rely primarily on users’ Markov Decision Process (POMDP) [8], Q-learning [9], value-
positive feedback to understand their preferences, and sampling based methods [7], [10] and policy-based methods [11], [12]
is used to incorporate their negative feedback. In a practical have been proposed. As users interact with the methods, the
setting, users’ negative feedback is just as crucial as their
strategies can be continually updated to make recommenda-
positive feedback for understanding their preferences. We present
a novel Deep Reinforcement Learning (DRL) recommendation tions based on users’ dynamic preferences. However, these
framework to address the limitations above. We specifically utilize methods assume that items are independent and identically
the actor-critic paradigm, which considers the recommendation distributed, and the recommendations are only based on users’
problem a sequential decision-making process to adapt to users’ preferences. Based on the intuition behind item-based CF [13],
behaviors and maximize the long-term reward. Motivated by
[14], similar users will probably like similar items. Existing
the intuition that similar users like similar items, we extract
the relations between items using CF and integrate it into our RL-based methods do not consider the relations between items
framework to boost overall performance. Instead of negative in many recommendation scenarios. Besides, when presented
sampling, our proposed framework relies on all users’ positive with recommendations, users typically choose a few items rep-
and negative feedback to understand users’ preferences more resenting positive interactions characterized by views, clicks,
accurately. Extensive experiments with our dataset and two
and purchases. A vast majority of the items are skipped or
public datasets demonstrate the effectiveness of our proposed
framework. ignored, which captures the users’ implicit negative feedback
Index Terms—Deep Reinforcement Learning, Actor-Critic and gives valuable insight into the users’ preferences. Current
Framework, Collaborative Filtering, Negative Feedback, Inter- RL-based methods either do not use the negative feedback or
active Recommendation Systems use them by sampling [10]. Moreover, users’ preferences are
only updated if the users have a positive interaction.
I. I NTRODUCTION
In this paper, we propose an actor-critic framework for
Recommender systems have garnered much attention in the capturing the dynamic changes in users’ interests and the long-
last decade due to their importance in digital marketing. Typi- term impact of items on the cumulative reward. We boost
cally, methods such as content-based filtering [1], collaborative overall recommendation performance by leveraging relations
filtering [ 2], m atrix f actorization [ 3], f actorization machines between items generated using CF. To make the users’ states
[4], deep learning models [5] have been proven to work more representative, we integrate the item-item relations into
well, especially when rich historical user-item interactions the users’ historical interactions instead of using them directly
are available. However, these methods assume that users’ to make recommendations. Furthermore, we update the users’
preferences do not change over time. Hence, they cannot adapt preferences based on positive and negative feedback. We train
to the evolving trends and changing users’ tastes. Moreover, the proposed model using the Twin Delayed Deep Determin-
these methods often are greedy and maximize the immediate istic Policy Gradient (TD3) algorithm due to its superiority in
reward ignoring the long-term (cumulative) reward [6], [7]. training a stable model. The main contributions of this paper
Typically, recommender systems aim to optimize Key Perfor- can be summarized as follows:
• We propose a recommendation framework with a novel
actor-critic architecture.
978-1-6654-8045-1/22/$31.00 ©2022 IEEE 2175

Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on December 04,2023 at 18:54:42 UTC from IEEE Xplore. Restrictions apply.
• We leverage relations between items through CF and TABLE I: Notations.
Notations Explanation
integrate them with the users’ historical interactions. s The user’s state s = (s+ , s− )
• We integrate all the negative and positive feedback to s+ The user’s positive state
holistically capture the users’ behaviors and dynamic s− the user’s negative state
h The user’s interactions history h = (h+ , h− )
interests. h+ The user’s positive interactions history
• We narrow the items search space using CF, making the h− The user’s negative interactions history
+
proposed model more practical in recommender systems Iset The positive set represents the user’s positive interactions
−
Iset The negative set represeants the user’s negative interactions
with many items. n Number of recommended items in each session n >= 1
d The maximum number of items in historical sets d >= n
II. R ELATED W ORK m The dimension of items’ representation vectors
Recently, RL has been used to capture users’ dynamic eall All items’ representation vectors
interests in recommender systems and optimize the cumulative l The number of prior interaction sets used to generate the user’s state
reward. In [12], Zhao et al. propose an actor-critic framework
for generating pages of items. Similarly, Liu et al. adopt
+ −
an actor-critic framework in [15] and experiment with state vectors. Moreover, the Iset , Iset ∈ Rd×m where d is the
representation for better recommendation performance using hyperparameter for the maximum number of items in each
direct and complex interactions between users and items. Zhao set where d >= n.
et al. [10] include negative feedback by sampling alongside Action space A: ai ∈ A is a vector representation of
positive feedback in their RL-based algorithm. In [7], Zheng n recommended items at ti . The action vector is generated
et al. propose a deep Q-learning recommendation framework by the mean of recommended items’ representation vectors;
for explicitly modeling future rewards. Wang et al. in [16] therefore, ai ∈ R1×m .
present an actor-critic framework for sequential recommenda- Reward R: ri ∈ R is the immediate reward received
tions based on knowledge graph structures. The direct relation based on the user response to the recommended items
−
between items is not used in any prior works. Additionally, r((s+i , si ), ai ). The user can click, skip or purchase the
they typically focus on the users’ positive interactions, while recommended items.
negative feedback is either ignored or used through sampling. Transition function T : transition function is denoted
−
III. T HE P ROPOSED F RAMEWORK by the probability of state transition from si = (s+ i , si )
+ −
This section provides an overview of the proposed frame- to si+1 = (si+1 , si+1 ) after the agent takes action
work with notations (Table I) and describes each component’s ai , T (si , ai , si+1 ) = Pr(si+1 |si , ai ). We further assume
technical details in the following sections. Markov property of the transition function: Pr(si+1 |si , ai ) =
Pr(si+1 |si , ai , si−1 , ai−1 , ..., s0 , a0 ).
A. Framework Overview Discount Parameter γ: γ ∈ [0, 1] is the decay parameter
We model the recommendation problem as a Markov De- to determine the importance of future rewards. If γ = 0, the
cision Process (MDP) M where the recommendation sys- agent considers only the immediate reward and ignores the
tem (agent) interacts with the users (environment) over a future rewards; on the other hand, when γ = 1, the agent
sequence of time steps by recommending n items in each step weighs the immediate reward and future rewards equally.
where n >= 1. M is defined by a tuple of five elements With so many items, the state and action space is enormous.
(S, A, T , R, γ) as follows: This makes conventional RL methods (e.g., Q-learning) in-
State space S: si ∈ S is defined as the user’s preferences feasible due to computational challenges [17]. Here, we use
at time ti . As the user’s preferences can be obtained from DRL to approximate optimal policy and value function with
her historical activities, si is a function of the history of her neural networks. Moreover, DRL does not estimate transition
interactions hi , si = f (hi ). We break down the state into probabilities or store the Q-value table, so it is feasible to
−
positive and negative states, s+ +
i and si , where si represents support many items in recommender systems. The goal is
−
her positive preferences at ti and si represents her negative to find an optimized policy and an approximator for the
−
preferences at ti . Therefore, si = (s+ i , si ). Consequently, the value function. Our problem is well suited to the actor-
user’s interactions history is divided into positive interactions critic paradigm since it combines the computation of value
−
history h+ i and negative interactions history hi , and we function with an explicit representation of the policy [17]–[19].
+ + − −
have si = f+ (hi ) and si = f− (hi ). We define the We continuously learn the optimal recommendation policy
historical interactions as h+ + + +
i = {Iset0 , Iset1 , ..., Iseti−1 } and πθ : S → A. The optimal recommendation policy maximizes
h− − − − + −
i = {Iset0 , Iset1 , ..., Iseti−1 } where the Iseti and Iseti are
the expected long-term cumulative reward from any state-
the corresponding positive and negative sets built based on user action pair (si ∈ S, ai ∈ A) in Eq. 1. Where E is the
feedback to n recommended items at session ti (see section expectation under the policy πθ and rt+k is the immediate
III-C and Fig. 1 (agent view)). In our model, we keep the reward at a future time step t + k.
dimension of states m similar to the items’ representation ∞
−
vectors’ dimension. Thus, s+ i , si ∈ R1×m facilitates the Q∗ (si , ai ) = max Eπθ {
X
γ k rt+k |si , ai } (1)
operations in matrix form between states and items embedding πθ
k=0
2176
In an actor-critic framework, the actor (policy network) number of items (rows) per set. Note that d >= n since
learns from the critic (value function). User interactions each set needs to hold at least n items from the users’
history is the input to the actor for generating the current direct feedback. Items are represented with their corresponding
−
state si = (s+ i , si ) and providing an action ai . Instead embedding vector; therefore, the dimension of each of them
of considering all possible action-state pairs, the critic uses is d × m. However, if d >= n, these two sets could not be
the (si , ai ) pair as input to compute the Q-value function completed simultaneously. We fill the empty rows based on
Q(s, a) = Es′ [r + γQ(s′ , a′ )|s, a]. The critic evaluates the the similarity and dissimilarity between items. There are three
generated action, and the actor updates its parameters to possibilities, as depicted in Fig. 2. In the first case (Fig. 2(a)),
optimize the actions in subsequent iterations [17], [18]. since the user interacts positively with all n recommended
items, the positive items set is complete, and the negative
B. State Representation
items set is empty. In the second case (Fig. 2(b)), the user
Positive feedback from the users is essential to understand interacts negatively with all n recommended items, making
their interests and changes in their preferences. However, the negative items set full and the positive items set empty. In
in practice, users skip many items akin to implicit negative the third case (Fig. 2(c)), the user has a mix of positive and
feedback. Negative feedback provides insights into what users negative interactions, which partially fills both the positive and
are not interested in, which helps capture users’ interests from negative items’ sets. We believe that the users will like items
a different perspective. Generally, we have more negative similar to what they already like and dislike items similar to
observations than positive ones in recommendation systems. the already disliked items. In the first case, we fill the negative
Hence, random negative sampling is typically utilized [10], set using the farthest neighbors to items that appear in the
[20]. positive set. In the second case, we fill the positive set with
We incorporate all the negative and positive feedback on the farthest neighbors of those in the negative set. In the third
the model’s state space. Besides, the model can be updated case, both positive and negative sets are partially filled. We
even if only negative feedback is observed. Users see the then fill in the remaining rows in each set with the nearest
recommended items and give feedback (Fig. 1 (user view)). neighbors of the items available in that set. When d = n,
However, in agent view (Fig. 1), each session will be presented we may encounter the first and second scenarios. Otherwise,
+ − +
with positive Iset and negative Iset sets. Iset keeps the items the third scenario always occurs when d > n. We calculate
−
with positive feedback, and Iset keeps items with negative each item’s nearest and farthest neighbors offline to reduce the
feedback. framework’s latency.
...
User view
(a)
t0 t1 tn-1 tn
Iset+
Agent view
...
Iset-
(b)
t0 t1 tn-1 tn
Fig. 1: Views from the user’s and agent’s perspectives.

Users’ interactions at each timestamp are converted into
positive and negative sets (section III-C). (c)
C. Collaborative Filtering Integration

We trained a state-of-the-art CF model: DeepFM [21] with Fig. 2: Filling the positive and negative feedback slots under
our data to capture the similarity between items. We chose different scenarios: (a) All positive feedback. (b) All
DeepFM because we can integrate the items’ features and negative feedback. (c) Partially positive and partially
user-item interactions, leading to a better items’ embedding negative feedback.
space. Among items embedded in a space, those with stronger
relations are closer to each other. Using the relations between
items, we augment the users’ interactions history. In each D. Architecture of Actor
session, we recommend n items. Based on users feedback, The actor generates n recommendations based on the user’s
+ − −
we generate two sets, Iset and Iset (Fig. 1 (agent view)). current state si = (s+ i , si ). There are two sets for each
+ −
We introduce the hyper-parameter d, which is the maximum user at each timestamp, Iset and Iset . In order to generate
2177
a representative state that accurately reflects the current user the user’s level of interest. Finally, we select n top items
preference, we must reason spatially over both items’ sets for recommendation. We designed the model architecture
and capture the user’s interests over time. As a result, we experimentally to maintain accuracy while being reasonable
need a model that can perform spatial and temporal reasoning in time complexity. Fig. 4 shows the actor’s architecture.
simultaneously.
E. Architecture of Critic
CONVGRU, the Deep Neural Networks (DNNs) are de-
signed linearly, and after a forward pass, they do not store The critic is designed to measure the merits of the generated
information (memory-less). In a recurrent unit, previous in- action by an action-value function Q(s, a). In practice, the
puts’ memory is stored as a hidden state (h). Earlier versions actions’ space and states’ space are enormous, making it
of Recurrent Neural Networks (RNNs) are challenging to train infeasible for the critic to estimate Q(s, a) for each state-
due to the vanishing gradient problem [22]. Long Short-Term action pair. Besides, many state-action pairs are not seen by the
Memory (LSTM) [23] and Gated Recurrent Unit (GRU) [24] agent in the environment, making it hard to update their value.
were designed to overcome the limitations of previous models. Hence, we use a neural network to approximate this function,
LSTM and GRU were initially conceived as fully connected also called the deep Q-network. According to Q(s, a), the
layers and could be represented by linear transformations. actor’s parameters are updated to improve its performance for
However, it causes spatial relationships to be lost in 2D actions in the following iterations. The critic inputs are the
inputs. It is advantageous to preserve this information when current positive state s+i ∈ R
1×m
, the current negative state
dealing with 2D inputs such as images. A convolution form s−
i ∈ R 1×m
and the generated action vector ai ∈ R1×m . s+ i
of the Gated Recurrent Units (CONVGRU) is proposed for and s−i are concatenated with the action vector to make two
learning the temporal relations between video frames and vectors with dimension 2 × m. We use a linear layer for each
spatial information within each frame in [25]. Our positive of them and a linear layer at the top for estimating the Q
and negative sets are represented by a 2D matrix of d × m. value. The activation function is ReLU for all the layers. The
To capture the within-session correlation between items and architecture of the critic is shown in Fig. 4.
users’ interest evolution across sessions, we used CONVGRU IV. EXPERIMENTS
(Fig. 3). There are two independent layers for each of the
We conducted extensive experiments on our dataset by train-
sets. They are independent in the sense that their weights
ing our framework, using TD3 [26]. Additionally, two public
are not shared. As inputs to CONVGRU at timestamp ti are
− datasets were used to demonstrate our proposed method:
currently positive and negative states (s+i−1 , si−1 ) and the last
Amazon 2014 [27], [28] and MovieLens 1M [29]. In the
l interaction sets.
Amazon dataset, we selected the CDs and Vinyl category since
Recommendation: For recommending items, we assume
it has sparse interactions between users and items. Table II
that the positive state is a compound vector of the user’s
shows the statistics for each dataset. These datasets include
previous positive interactions and shows us her interest region
interactions in the form of ratings on different scales. Our
in the items’ embedding space. The negative state is the
private dataset has views, clicks, and purchases on a scale
compound vector of the user’s negative interactions and shows
of [0, 2], while the other two datasets have a scale of [1, 5].
us the region she is disinterested in the embedding space. We
According to our dataset, any rating > 1 is considered a
aim to find items closer to the users’ regions of interest and
positive interaction, whereas in the other two datasets, any
farther from their disinterested regions. We calculate the cosine
rating >= 4 is considered a positive interaction. We consid-
similarity between all items and the positive and negative
ered ratings below the thresholds to be negative feedback. In
states, where eall denotes embedding vectors for all items. We
this experiment, we assigned rewards of 5 and 0 to positive and
then calculate A−B; essentially, A and B give us interest and
negative interactions. To prepare the datasets, we categorized
disinterest scores for each item, while the difference shows
interactions by user. Then, each user’s interactions are sorted
according to their timestamp. The training was conducted
using 80 percent of the users’ earlier interactions, and testing
s+i-1
was conducted using the remaining 20 percent. The first l
...
s+i
CONVGRU CONVGRU CONVGRU interactions of each user’s training data are used to generate the
−
initial states, s+
init and sinit . These interactions are converted
CONVGRU CONVGRU ... CONVGRU to positive and negative sets as described in section III-C.
s-i
s-i-1
TABLE II: Statistics of datasets.

Dataset # of users # of items # of interactions
...
Adobe 1,047,428 65,451 23,515,109
MovieLens 1M 6,040 3,669 1,000,209
ti-l ti-(l-1) ti-1 CDs and Vinyl 1,578,597 232,376 3,749,004
Fig. 3: State generation is accomplished by two independent We make three different variations of our model to judge
CONVGRU layers. the effectiveness of each component. In the basic model Ours
2178
Actor Action vector (ai) Critic
. Dot Product Item1 m c Concatenate
Top n Item2 Q(si, ai)
- Minus recommended Embedding
...
items vectors of top n
m Mean Itemn
items
eTall . - . eTall
si+ si-
S+i-1 c
CONVGRU
S-i-1 si+ ai si-
...
c
ti-l ti-(l-1) ti-1
Fig. 4: Our actor-critic framework. The actor architecture is on the left, and the critic architecture is on the right.
(-N)(-F), we do not integrate negative feedback and do not absence of metadata in our dataset (Adobe dataset), we could
use the items’ relations. We have only one CONVGRU layer, not compare with the methods mentioned in the related work
fed only by the positive interactions. In the second variation section (section II), especially pair-wise [10] and page-wise
Ours (+N)(-F), we add the negative feedback. In the third methods [12].
variation Ours (+N)(-F), we have one CONVGRU layer for
each of the positive and negative interactions. In these two B. Offline Evaluation
models, we have only one item in each timestamp. In the To accommodate the time complexity [37] of LinUCB and
third variation, we integrate both negative feedback and the HLinUCB, we conducted two sets of experiments. We ran
items’ relations, denoted by Ours (+N)(+F)(d) where d is the first experiment on the entire data but excluded LinUCB
the maximum number of items allowed to integrate or the and HLinUCB due to increased complexity (Table II). In the
number of rows in positive and negative sets in each session. second experiment, we used a subset of data with a limited
Table III depicts the variations of our model. For training our number of items. We selected the top 200 items and pruned the
framework, we used twin delayed deep deterministic policy data accordingly (Table VIII). For measuring the performance,
gradients (TD3) [30], and the table IV shows the parameters we used Precision@k, Recall@k, F1-score@k, Normalized
for the model, which are tuned experimentally. Discounted Cumulative Gain (NDCG@k), and Mean Average
Precision (MAP@k) where k = 10. In our dataset, we have
TABLE III: Variations of our model. only one item per session (n = 1). We predict only one item
Models Negative feedback Items’ relations for each timestamp up to k sessions when using LinUCB,
Ours (-N)(-F) No No HLinUCB, and our method.
Ours (+N)(-F) Yes No
Ours (+N)(+F)(d) Yes Yes TABLE VIII: Statistics of sampled datasets.
Dataset # users # items # interactions
Adobe 4,057 200 959,942
TABLE IV: Training parameters.
MovieLens 1M 4,296 200 414,893
Parameters Value Explanation
σ1 0.1 Std of Gaussian exploration noise CDs and Vinyl 2,655 200 146,474
γ 0.99 Discount factor
τ 0.005 Target network update rate
σ2 0.2 Noise added to the target policy during the critic’s update Table V depicts the results over the whole data. Intra-
c 0.5 Range to clip the target policy noise
Policy frequency 2 Frequency of the delayed policy updates comparison between the variations of our model shows that
Batch size 256 The batch size
m 8 The dimension of items’ representation vectors
our first model Ours (-N)(-F) has the lowest performance.
l 5 The number of last interactions However, in our second variant Ours (+N)(-F), by adding
negative feedback, we see improvement in results which
confirms the importance of integrating negative feedback. Our
A. Baselines full model Ours (+N)(+F)(d) is obtained by adding negative
We compared our method’s performance with the fol- feedback and integrating the items’ relations are trained by
lowing baseline methods, widely used in modern recom- d = 3 and d = 5. The result shows that performance is
mender systems. The baselines are FM [31], Wide&Deep [32], improved by integrating the items’ relations with the users’
DeepFM [21], xDeepFM [33], MLR [34], LinUCB [35] and interactions history. The results also show that our first two
HLinUCB [36]. Due to the differences in data format or the derivative models are not performing better than the baselines.
2179
TABLE V: Performance of our model in comparison to baselines excluding LinUCB and HLinUCB.
Dataset Adobe Movielens CDs and Vinyl (Amazon)
Metrics(@10)(%) Precision Recall F1-score MAP NDCG Precision Recall F1-score MAP NDCG Precision Recall F1-score MAP NDCG
FM 0.26197 0.16061 0.17409 0.09564 0.26771 0.82723 0.39112 0.48486 0.38621 0.83281 0.92496 0.56759 0.66059 0.57248 0.93324
DeepFM 0.26098 0.15969 0.17347 0.09586 0.26931 0.82716 0.39122 0.4851 0.38708 0.833 0.92463 0.56642 0.65974 0.57226 0.93202
xDeepFM 0.25434 0.15629 0.16899 0.09141 0.26818 0.827164 0.39174 0.48535 0.38723 0.83287 0.92487 0.56649 0.65994 0.57138 0.93123
WDL 0.25525 0.15676 0.16967 0.09266 0.269 0.82669 0.39144 0.48508 0.3884 0.83463 0.92456 0.5669 0.65999 0.57167 0.93251
MLR 0.25088 0.15042 0.16543 0.09057 0.26944 0.83042 0.39203 0.48629 0.3884 0.83463 0.92474 0.56679 0.66005 0.57273 0.93086
Ours (-N)(-F) 0.41113 0.2863 0.29556 0.0828 0.44096 0.72029 0.35873 0.44046 0.33274 0.7023 0.91802 0.6078 0.68651 0.60275 0.92741
Ours (+N)(-F) 0.40992 0.29898 0.30086 0.08403 0.44135 0.76032 0.3728 0.45907 0.36466 0.74939 0.91497 0.60536 0.68389 0.59977 0.92268
Ours (+N)(+F)(3) 0.63733 0.44572 0.46152 0.1584 0.71264 0.84335 0.39918 0.49491 0.39166 0.84771 0.92133 0.60925 0.68827 0.60694 0.93022
Ours (+N)(+F)(5) 0.6405 0.45514 0.46572 0.15968 0.71648 0.85301 0.40552 0.50174 0.4683 0.85678 0.92542 0.61196 0.69121 0.61114 0.93796
TABLE VI: Performance of our model in comparison to LinUCB and HLinUCB.

LinUCB 0.0165 0.0757 0.0241 0.0004 0.0419 0.0546 0.0162 0.0238 0.0006 0.0546 0.0785 0.0207 0.0309 0.0006 0.0546
HLinUCB 0.0055 0.0103 0.006 0.0001 0.0085 0.0674 0.0274 0.0373 0.0011 0.0674 0.0269 0.0089 0.0135 0.0005 0.0396
Ours (+N)(+F)(3) 0.099 0.0763 0.0735 0.0531 0.1173 0.8924 0.2429 0.359 0.8211 0.8823 0.9217 0.1902 0.2799 0.8639 0.9139
Ours (+N)(+F)(5) 0.1003 0.0775 0.0744 0.0548 0.12 0.9007 0.2446 0.3617 0.8374 0.894 0.926 0.1902 0.2801 0.8819 0.9141
TABLE VII: Performance comparison with d = {3, 5, 7, 10}.

Ours (+N)(+F)(3) 0.63733 0.44572 0.46152 0.1584 0.71264 0.84335 0.39918 0.49491 0.78466 0.84771 0.92133 0.60925 0.68827 0.60694 0.93022
Ours (+N)(+F)(5) 0.6405 0.45514 0.46572 0.15968 0.71648 0.85301 0.40552 0.50174 0.4683 0.85678 0.92542 0.61196 0.69121 0.61114 0.93796
Ours (+N)(+F)(7) 0.64055 0.45544 0.46579 0.15995 0.71687 0.85445 0.40614 0.50253 0.468454 0.85845 0.92629 0.61231 0.69165 0.6119 0.93962
Ours (+N)(+F)(10) 0.64018 0.4537 0.46504 0.15953 0.71587 0.838 0.39772 0.49257 0.40242 0.84311 0.92757 0.61298 0.69243 0.61273 0.94119
However, our entire model always performs better than the

baselines, demonstrating the effectiveness of using negative
feedback and the items’ relations. Table VI depicts the results
over a subset of data with a limited number of items. As the
results show, our method always performs better than LinUCB
and HLinUCB.
C. Replicating Online Evaluation With a Simulator

To predict users’ ratings, we trained a simulator for each
dataset. In order to improve their accuracy at predicting
unknown ratings, the simulators are trained and fine-tuned
over the entire dataset. We considered two session lengths,
10 and 50, and calculated our model’s variations accumulated
Fig. 5: Comparison of online performance among our
normalized reward within [1, 2]. Fig. 5 and Fig. 6 show the
model’s variants with 10 sessions.
average of accumulated rewards for all users over 10 and 50
consecutive sessions. In both cases, the variations considering
negative feedback and the items’ relations (namely, Ours
(+N)(+F)(3) and Ours (+N)(+F)(5)) achieve higher perfor-
mance compared to the models without or with part of such
auxiliary information. These results confirm the effectiveness
of using all negative feedback and the items’ relations in our
proposed framework.
D. Effect of the Number of Items in Historical Sets (d)

We examined the effect of the d on our model’s per-
formance. Table VII shows the results with varying d =
{3, 5, 7, 10}. Based on the results, this parameter is dataset-
dependent. We achieve the best results with d = 7 in the
adobe dataset, and we see a slight decrease in performance
with d = 10. The same behavior is observed for Movielens Fig. 6: Comparison of online performance among our
dataset. However, in CDs and Vinyl, increasing d improves model’s variants with 50 sessions.
performance.
2180
V. C ONCLUSION [16] P. Wang, Y. Fan, L. Xia, W. X. Zhao, S. Niu, and J. Huang, “Kerl:
A knowledge-guided reinforcement learning model for sequential rec-
This paper presents a novel recommendation framework that ommendation,” in Proceedings of the 43rd International ACM SIGIR
Conference on Research and Development in Information Retrieval,
leverages deep RL. In particular, we model the recommen- 2020, pp. 209–218.
dation problem as a sequential decision-making problem and [17] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
solve it using the actor-critic paradigm. Our model captures MIT press, 2018.
[18] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath,
dynamic changes in users’ interests and considers items’ long- “Deep reinforcement learning: A brief survey,” IEEE Signal Processing
term contribution to the cumulative reward. In contrast to Magazine, vol. 34, no. 6, pp. 26–38, 2017.
previous methods, our method effectively integrates the direct [19] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-
dimensional continuous control using generalized advantage estimation,”
relations between items encoded in CF’s embedding vectors by arXiv preprint arXiv:1506.02438, 2015.
incorporating similar and dissimilar items into users’ interac- [20] G. E. Dupret and B. Piwowarski, “A user browsing model to predict
tions. Additionally, we use all positive and negative feedback search engine click data from past observations.” in Proceedings of
the 31st annual international ACM SIGIR conference on Research and
to capture the users’ dynamic interests. We have demonstrated development in information retrieval, 2008, pp. 331–338.
that our method can improve recommendations performance [21] H. Guo, R. Tang, Y. Ye, Z. Li, and X. He, “Deepfm: a factorization-
on our dataset and two public datasets. machine based neural network for ctr prediction,” arXiv preprint
arXiv:1703.04247, 2017.
[22] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies
with gradient descent is difficult,” IEEE transactions on neural networks,
R EFERENCES vol. 5, no. 2, pp. 157–166, 1994.
[23] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget:
[1] R. J. Mooney and L. Roy, “Content-based book recommending using Continual prediction with lstm,” 1999.
learning for text categorization,” in Proceedings of the fifth ACM [24] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of
conference on Digital libraries, 2000, pp. 195–204. gated recurrent neural networks on sequence modeling,” arXiv preprint
[2] X. Su and T. M. Khoshgoftaar, “A survey of collaborative filtering arXiv:1412.3555, 2014.
techniques,” Advances in artificial intelligence, vol. 2009, 2009. [25] N. Ballas, L. Yao, C. Pal, and A. Courville, “Delving deeper into con-
[3] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for volutional networks for learning video representations,” arXiv preprint
recommender systems,” Computer, vol. 42, no. 8, pp. 30–37, 2009. arXiv:1511.06432, 2015.
[4] Y. Juan, Y. Zhuang, W.-S. Chin, and C.-J. Lin, “Field-aware factorization [26] S. Fujimoto, H. Van Hoof, and D. Meger, “Addressing function approx-
machines for ctr prediction,” in Proceedings of the 10th ACM conference imation error in actor-critic methods,” arXiv preprint arXiv:1802.09477,
on recommender systems, 2016, pp. 43–50. 2018.
[27] R. He and J. McAuley, “Ups and downs: Modeling the visual evolution
[5] S. Zhang, L. Yao, A. Sun, and Y. Tay, “Deep learning based recom-
of fashion trends with one-class collaborative filtering,” in proceedings
mender system: A survey and new perspectives,” ACM Comput. Surv.,
of the 25th international conference on world wide web, 2016, pp. 507–
vol. 52, no. 1, Feb. 2019.
517.
[6] X. Wang, Y. Wang, D. Hsu, and Y. Wang, “Exploration in interactive per- [28] J. McAuley, C. Targett, Q. Shi, and A. Van Den Hengel, “Image-based
sonalized music recommendation: a reinforcement learning approach,” recommendations on styles and substitutes,” in Proceedings of the 38th
ACM Transactions on Multimedia Computing, Communications, and international ACM SIGIR conference on research and development in
Applications (TOMM), vol. 11, no. 1, pp. 1–22, 2014. information retrieval, 2015, pp. 43–52.
[7] G. Zheng, F. Zhang, Z. Zheng, Y. Xiang, N. Yuan, X. Xie, and Z. Li, [29] F. M. Harper and J. A. Konstan, “The movielens datasets: History and
“Drn: A deep reinforcement learning framework for news recommenda- context,” in ACM Transactions on Interactive Intelligent Systems (TiiS),
tion,” 04 2018, pp. 167–176. vol. 5, no. 4, 2015.
[8] G. Shani, D. Heckerman, R. I. Brafman, and C. Boutilier, “An mdp- [30] S. Fujimoto, H. van Hoof, and D. Meger, “Addressing Function Ap-
based recommender system.” Journal of Machine Learning Research, proximation Error in Actor-Critic Methods,” in Proceedings of the 35th
vol. 6, no. 9, 2005. International Conference on Machine Learning. PMLR, 2018, pp.
[9] N. Taghipour and A. Kardan, “A hybrid web recommender system based 1582–1591.
on q-learning,” in Proceedings of the 2008 ACM symposium on Applied [31] S. Rendle, “Factorization machines,” in 2010 IEEE International Con-
computing, 2008, pp. 1164–1168. ference on Data Mining. IEEE, 2010, pp. 995–1000.
[10] X. Zhao, L. Zhang, Z. Ding, L. Xia, J. Tang, and D. Yin, “Recommenda- [32] H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye,
tions with negative feedback via pairwise deep reinforcement learning,” G. Anderson, G. Corrado, W. Chai, M. Ispir et al., “Wide & deep
in Proceedings of the 24th ACM SIGKDD International Conference on learning for recommender systems,” in Proceedings of the 1st workshop
Knowledge Discovery & Data Mining, 2018, pp. 1040–1048. on deep learning for recommender systems, 2016, pp. 7–10.
[11] Y. Hu, Q. Da, A. Zeng, Y. Yu, and Y. Xu, “Reinforcement learning [33] J. Lian, X. Zhou, F. Zhang, Z. Chen, X. Xie, and G. Sun, “xdeepfm:
to rank in e-commerce search engine: Formalization, analysis, and Combining explicit and implicit feature interactions for recommender
application,” in Proceedings of the 24th ACM SIGKDD International systems,” in Proceedings of the 24th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining, 2018, pp. 368– Conference on Knowledge Discovery & Data Mining, 2018, pp. 1754–
377. 1763.
[12] X. Zhao, L. Xia, L. Zhang, Z. Ding, D. Yin, and J. Tang, “Deep [34] K. Gai, X. Zhu, H. Li, K. Liu, and Z. Wang, “Learning piece-wise linear
reinforcement learning for page-wise recommendations.” RecSys, 2018. models from large scale data for ad click prediction,” arXiv preprint
[13] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Item-based collabo- arXiv:1704.05194, 2017.
rative filtering recommendation algorithms,” in Proceedings of the 10th [35] W. Chu, L. Li, L. Reyzin, and R. Schapire, “Contextual bandits with
international conference on World Wide Web, 2001, pp. 285–295. linear payoff functions,” in Proceedings of the Fourteenth International
[14] J. Wang, A. P. De Vries, and M. J. Reinders, “Unifying user-based and Conference on Artificial Intelligence and Statistics. JMLR Workshop
item-based collaborative filtering approaches by similarity fusion,” in and Conference Proceedings, 2011, pp. 208–214.
Proceedings of the 29th annual international ACM SIGIR conference [36] H. Wang, Q. Wu, and H. Wang, “Learning hidden features for contextual
on Research and development in information retrieval, 2006, pp. 501– bandits,” in Proceedings of the 25th ACM International on Conference
508. on Information and Knowledge Management, 2016, pp. 1633–1642.
[15] F. Liu, R. Tang, X. Li, Y. Ye, H. Chen, H. Guo, and Y. Zhang, “Deep [37] J. Bento, S. Ioannidis, S. Muthukrishnan, and J. Yan, “A time and space
reinforcement learning based recommendation with explicit user-item efficient algorithm for contextual linear bandits,” in Joint European Con-
interactions modeling,” CoRR, vol. abs/1810.12027, 2018. [Online]. ference on Machine Learning and Knowledge Discovery in Databases.
Available: http://arxiv.org/abs/1810.12027 Springer, 2013, pp. 257–272.
2181

Research Paper 3 Final

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Research Paper 3 Final

Uploaded by

Copyright:

Available Formats

2022 IEEE International Conference on Big Data (Big Data)

Collaborative Filtering Guided Deep Reinforcement

vazizi@adobe.com smitra@adobe.com xiangche@adobe.com

978-1-6654-8045-1/22/$31.00 ©2022 IEEE 2175

Fig. 1: Views from the user’s and agent’s perspectives.

C. Collaborative Filtering Integration

TABLE II: Statistics of datasets.

TABLE VI: Performance of our model in comparison to LinUCB and HLinUCB.

TABLE VII: Performance comparison with d = {3, 5, 7, 10}.

However, our entire model always performs better than the

C. Replicating Online Evaluation With a Simulator

D. Effect of the Number of Items in Historical Sets (d)

You might also like