Professional Documents
Culture Documents
Research Paper 3 Final
Research Paper 3 Final
Abstract—Earlier recommendation techniques, such as Col- mance Indicators (KPIs) such as clicks and conversions on a
laborative Filtering (CF), assume the users’ preferences do not web page. Thus, looking at the immediate reward may not
change over time and strive to maximize the immediate reward. always benefit KPIs maximization in the long run.
In recent studies, Reinforcement Learning (RL) has been used
to make interactive recommendation systems that capture users’ To address the limitations above, the recommendation prob-
preferences over time and maximize the long-term reward. lem has been cast as a sequential decision-making problem.
However, these methods have two limitations. First, they assume Recent advances in RL have been leveraged to model interac-
that items are independently distributed and do not consider
tive and dynamic recommender systems with long-term plan-
the relations between items. This assumption ignores the power
of the relations between items for recommendation systems, as ning. Methods including but not limited to Partially Observable
demonstrated by CF. RL-based methods rely primarily on users’ Markov Decision Process (POMDP) [8], Q-learning [9], value-
positive feedback to understand their preferences, and sampling based methods [7], [10] and policy-based methods [11], [12]
is used to incorporate their negative feedback. In a practical have been proposed. As users interact with the methods, the
setting, users’ negative feedback is just as crucial as their
strategies can be continually updated to make recommenda-
positive feedback for understanding their preferences. We present
a novel Deep Reinforcement Learning (DRL) recommendation tions based on users’ dynamic preferences. However, these
framework to address the limitations above. We specifically utilize methods assume that items are independent and identically
the actor-critic paradigm, which considers the recommendation distributed, and the recommendations are only based on users’
problem a sequential decision-making process to adapt to users’ preferences. Based on the intuition behind item-based CF [13],
behaviors and maximize the long-term reward. Motivated by
[14], similar users will probably like similar items. Existing
the intuition that similar users like similar items, we extract
the relations between items using CF and integrate it into our RL-based methods do not consider the relations between items
framework to boost overall performance. Instead of negative in many recommendation scenarios. Besides, when presented
sampling, our proposed framework relies on all users’ positive with recommendations, users typically choose a few items rep-
and negative feedback to understand users’ preferences more resenting positive interactions characterized by views, clicks,
accurately. Extensive experiments with our dataset and two
and purchases. A vast majority of the items are skipped or
public datasets demonstrate the effectiveness of our proposed
framework. ignored, which captures the users’ implicit negative feedback
Index Terms—Deep Reinforcement Learning, Actor-Critic and gives valuable insight into the users’ preferences. Current
Framework, Collaborative Filtering, Negative Feedback, Inter- RL-based methods either do not use the negative feedback or
active Recommendation Systems use them by sampling [10]. Moreover, users’ preferences are
only updated if the users have a positive interaction.
I. I NTRODUCTION
In this paper, we propose an actor-critic framework for
Recommender systems have garnered much attention in the capturing the dynamic changes in users’ interests and the long-
last decade due to their importance in digital marketing. Typi- term impact of items on the cumulative reward. We boost
cally, methods such as content-based filtering [1], collaborative overall recommendation performance by leveraging relations
filtering [ 2], m atrix f actorization [ 3], f actorization machines between items generated using CF. To make the users’ states
[4], deep learning models [5] have been proven to work more representative, we integrate the item-item relations into
well, especially when rich historical user-item interactions the users’ historical interactions instead of using them directly
are available. However, these methods assume that users’ to make recommendations. Furthermore, we update the users’
preferences do not change over time. Hence, they cannot adapt preferences based on positive and negative feedback. We train
to the evolving trends and changing users’ tastes. Moreover, the proposed model using the Twin Delayed Deep Determin-
these methods often are greedy and maximize the immediate istic Policy Gradient (TD3) algorithm due to its superiority in
reward ignoring the long-term (cumulative) reward [6], [7]. training a stable model. The main contributions of this paper
Typically, recommender systems aim to optimize Key Perfor- can be summarized as follows:
• We propose a recommendation framework with a novel
actor-critic architecture.
2176
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on December 04,2023 at 18:54:42 UTC from IEEE Xplore. Restrictions apply.
In an actor-critic framework, the actor (policy network) number of items (rows) per set. Note that d >= n since
learns from the critic (value function). User interactions each set needs to hold at least n items from the users’
history is the input to the actor for generating the current direct feedback. Items are represented with their corresponding
−
state si = (s+ i , si ) and providing an action ai . Instead embedding vector; therefore, the dimension of each of them
of considering all possible action-state pairs, the critic uses is d × m. However, if d >= n, these two sets could not be
the (si , ai ) pair as input to compute the Q-value function completed simultaneously. We fill the empty rows based on
Q(s, a) = Es′ [r + γQ(s′ , a′ )|s, a]. The critic evaluates the the similarity and dissimilarity between items. There are three
generated action, and the actor updates its parameters to possibilities, as depicted in Fig. 2. In the first case (Fig. 2(a)),
optimize the actions in subsequent iterations [17], [18]. since the user interacts positively with all n recommended
items, the positive items set is complete, and the negative
B. State Representation
items set is empty. In the second case (Fig. 2(b)), the user
Positive feedback from the users is essential to understand interacts negatively with all n recommended items, making
their interests and changes in their preferences. However, the negative items set full and the positive items set empty. In
in practice, users skip many items akin to implicit negative the third case (Fig. 2(c)), the user has a mix of positive and
feedback. Negative feedback provides insights into what users negative interactions, which partially fills both the positive and
are not interested in, which helps capture users’ interests from negative items’ sets. We believe that the users will like items
a different perspective. Generally, we have more negative similar to what they already like and dislike items similar to
observations than positive ones in recommendation systems. the already disliked items. In the first case, we fill the negative
Hence, random negative sampling is typically utilized [10], set using the farthest neighbors to items that appear in the
[20]. positive set. In the second case, we fill the positive set with
We incorporate all the negative and positive feedback on the farthest neighbors of those in the negative set. In the third
the model’s state space. Besides, the model can be updated case, both positive and negative sets are partially filled. We
even if only negative feedback is observed. Users see the then fill in the remaining rows in each set with the nearest
recommended items and give feedback (Fig. 1 (user view)). neighbors of the items available in that set. When d = n,
However, in agent view (Fig. 1), each session will be presented we may encounter the first and second scenarios. Otherwise,
+ − +
with positive Iset and negative Iset sets. Iset keeps the items the third scenario always occurs when d > n. We calculate
−
with positive feedback, and Iset keeps items with negative each item’s nearest and farthest neighbors offline to reduce the
feedback. framework’s latency.
...
User view
(a)
t0 t1 tn-1 tn
Iset+
Agent view
...
Iset-
(b)
t0 t1 tn-1 tn
2177
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on December 04,2023 at 18:54:42 UTC from IEEE Xplore. Restrictions apply.
a representative state that accurately reflects the current user the user’s level of interest. Finally, we select n top items
preference, we must reason spatially over both items’ sets for recommendation. We designed the model architecture
and capture the user’s interests over time. As a result, we experimentally to maintain accuracy while being reasonable
need a model that can perform spatial and temporal reasoning in time complexity. Fig. 4 shows the actor’s architecture.
simultaneously.
E. Architecture of Critic
CONVGRU, the Deep Neural Networks (DNNs) are de-
signed linearly, and after a forward pass, they do not store The critic is designed to measure the merits of the generated
information (memory-less). In a recurrent unit, previous in- action by an action-value function Q(s, a). In practice, the
puts’ memory is stored as a hidden state (h). Earlier versions actions’ space and states’ space are enormous, making it
of Recurrent Neural Networks (RNNs) are challenging to train infeasible for the critic to estimate Q(s, a) for each state-
due to the vanishing gradient problem [22]. Long Short-Term action pair. Besides, many state-action pairs are not seen by the
Memory (LSTM) [23] and Gated Recurrent Unit (GRU) [24] agent in the environment, making it hard to update their value.
were designed to overcome the limitations of previous models. Hence, we use a neural network to approximate this function,
LSTM and GRU were initially conceived as fully connected also called the deep Q-network. According to Q(s, a), the
layers and could be represented by linear transformations. actor’s parameters are updated to improve its performance for
However, it causes spatial relationships to be lost in 2D actions in the following iterations. The critic inputs are the
inputs. It is advantageous to preserve this information when current positive state s+i ∈ R
1×m
, the current negative state
dealing with 2D inputs such as images. A convolution form s−
i ∈ R 1×m
and the generated action vector ai ∈ R1×m . s+ i
of the Gated Recurrent Units (CONVGRU) is proposed for and s−i are concatenated with the action vector to make two
learning the temporal relations between video frames and vectors with dimension 2 × m. We use a linear layer for each
spatial information within each frame in [25]. Our positive of them and a linear layer at the top for estimating the Q
and negative sets are represented by a 2D matrix of d × m. value. The activation function is ReLU for all the layers. The
To capture the within-session correlation between items and architecture of the critic is shown in Fig. 4.
users’ interest evolution across sessions, we used CONVGRU IV. EXPERIMENTS
(Fig. 3). There are two independent layers for each of the
We conducted extensive experiments on our dataset by train-
sets. They are independent in the sense that their weights
ing our framework, using TD3 [26]. Additionally, two public
are not shared. As inputs to CONVGRU at timestamp ti are
− datasets were used to demonstrate our proposed method:
currently positive and negative states (s+i−1 , si−1 ) and the last
Amazon 2014 [27], [28] and MovieLens 1M [29]. In the
l interaction sets.
Amazon dataset, we selected the CDs and Vinyl category since
Recommendation: For recommending items, we assume
it has sparse interactions between users and items. Table II
that the positive state is a compound vector of the user’s
shows the statistics for each dataset. These datasets include
previous positive interactions and shows us her interest region
interactions in the form of ratings on different scales. Our
in the items’ embedding space. The negative state is the
private dataset has views, clicks, and purchases on a scale
compound vector of the user’s negative interactions and shows
of [0, 2], while the other two datasets have a scale of [1, 5].
us the region she is disinterested in the embedding space. We
According to our dataset, any rating > 1 is considered a
aim to find items closer to the users’ regions of interest and
positive interaction, whereas in the other two datasets, any
farther from their disinterested regions. We calculate the cosine
rating >= 4 is considered a positive interaction. We consid-
similarity between all items and the positive and negative
ered ratings below the thresholds to be negative feedback. In
states, where eall denotes embedding vectors for all items. We
this experiment, we assigned rewards of 5 and 0 to positive and
then calculate A−B; essentially, A and B give us interest and
negative interactions. To prepare the datasets, we categorized
disinterest scores for each item, while the difference shows
interactions by user. Then, each user’s interactions are sorted
according to their timestamp. The training was conducted
using 80 percent of the users’ earlier interactions, and testing
s+i-1
was conducted using the remaining 20 percent. The first l
...
s+i
CONVGRU CONVGRU CONVGRU interactions of each user’s training data are used to generate the
−
initial states, s+
init and sinit . These interactions are converted
CONVGRU CONVGRU ... CONVGRU to positive and negative sets as described in section III-C.
s-i
s-i-1
Fig. 3: State generation is accomplished by two independent We make three different variations of our model to judge
CONVGRU layers. the effectiveness of each component. In the basic model Ours
2178
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on December 04,2023 at 18:54:42 UTC from IEEE Xplore. Restrictions apply.
Actor Action vector (ai) Critic
. Dot Product Item1 m c Concatenate
Top n Item2 Q(si, ai)
- Minus recommended Embedding
...
items vectors of top n
m Mean Itemn
items
eTall . - . eTall
si+ si-
S+i-1 c
CONVGRU
S-i-1 si+ ai si-
...
c
ti-l ti-(l-1) ti-1
Fig. 4: Our actor-critic framework. The actor architecture is on the left, and the critic architecture is on the right.
(-N)(-F), we do not integrate negative feedback and do not absence of metadata in our dataset (Adobe dataset), we could
use the items’ relations. We have only one CONVGRU layer, not compare with the methods mentioned in the related work
fed only by the positive interactions. In the second variation section (section II), especially pair-wise [10] and page-wise
Ours (+N)(-F), we add the negative feedback. In the third methods [12].
variation Ours (+N)(-F), we have one CONVGRU layer for
each of the positive and negative interactions. In these two B. Offline Evaluation
models, we have only one item in each timestamp. In the To accommodate the time complexity [37] of LinUCB and
third variation, we integrate both negative feedback and the HLinUCB, we conducted two sets of experiments. We ran
items’ relations, denoted by Ours (+N)(+F)(d) where d is the first experiment on the entire data but excluded LinUCB
the maximum number of items allowed to integrate or the and HLinUCB due to increased complexity (Table II). In the
number of rows in positive and negative sets in each session. second experiment, we used a subset of data with a limited
Table III depicts the variations of our model. For training our number of items. We selected the top 200 items and pruned the
framework, we used twin delayed deep deterministic policy data accordingly (Table VIII). For measuring the performance,
gradients (TD3) [30], and the table IV shows the parameters we used Precision@k, Recall@k, F1-score@k, Normalized
for the model, which are tuned experimentally. Discounted Cumulative Gain (NDCG@k), and Mean Average
Precision (MAP@k) where k = 10. In our dataset, we have
TABLE III: Variations of our model. only one item per session (n = 1). We predict only one item
Models Negative feedback Items’ relations for each timestamp up to k sessions when using LinUCB,
Ours (-N)(-F) No No HLinUCB, and our method.
Ours (+N)(-F) Yes No
Ours (+N)(+F)(d) Yes Yes TABLE VIII: Statistics of sampled datasets.
Dataset # users # items # interactions
Adobe 4,057 200 959,942
TABLE IV: Training parameters.
MovieLens 1M 4,296 200 414,893
Parameters Value Explanation
σ1 0.1 Std of Gaussian exploration noise CDs and Vinyl 2,655 200 146,474
γ 0.99 Discount factor
τ 0.005 Target network update rate
σ2 0.2 Noise added to the target policy during the critic’s update Table V depicts the results over the whole data. Intra-
c 0.5 Range to clip the target policy noise
Policy frequency 2 Frequency of the delayed policy updates comparison between the variations of our model shows that
Batch size 256 The batch size
m 8 The dimension of items’ representation vectors
our first model Ours (-N)(-F) has the lowest performance.
l 5 The number of last interactions However, in our second variant Ours (+N)(-F), by adding
negative feedback, we see improvement in results which
confirms the importance of integrating negative feedback. Our
A. Baselines full model Ours (+N)(+F)(d) is obtained by adding negative
We compared our method’s performance with the fol- feedback and integrating the items’ relations are trained by
lowing baseline methods, widely used in modern recom- d = 3 and d = 5. The result shows that performance is
mender systems. The baselines are FM [31], Wide&Deep [32], improved by integrating the items’ relations with the users’
DeepFM [21], xDeepFM [33], MLR [34], LinUCB [35] and interactions history. The results also show that our first two
HLinUCB [36]. Due to the differences in data format or the derivative models are not performing better than the baselines.
2179
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on December 04,2023 at 18:54:42 UTC from IEEE Xplore. Restrictions apply.
TABLE V: Performance of our model in comparison to baselines excluding LinUCB and HLinUCB.
Dataset Adobe Movielens CDs and Vinyl (Amazon)
Metrics(@10)(%) Precision Recall F1-score MAP NDCG Precision Recall F1-score MAP NDCG Precision Recall F1-score MAP NDCG
FM 0.26197 0.16061 0.17409 0.09564 0.26771 0.82723 0.39112 0.48486 0.38621 0.83281 0.92496 0.56759 0.66059 0.57248 0.93324
DeepFM 0.26098 0.15969 0.17347 0.09586 0.26931 0.82716 0.39122 0.4851 0.38708 0.833 0.92463 0.56642 0.65974 0.57226 0.93202
xDeepFM 0.25434 0.15629 0.16899 0.09141 0.26818 0.827164 0.39174 0.48535 0.38723 0.83287 0.92487 0.56649 0.65994 0.57138 0.93123
WDL 0.25525 0.15676 0.16967 0.09266 0.269 0.82669 0.39144 0.48508 0.3884 0.83463 0.92456 0.5669 0.65999 0.57167 0.93251
MLR 0.25088 0.15042 0.16543 0.09057 0.26944 0.83042 0.39203 0.48629 0.3884 0.83463 0.92474 0.56679 0.66005 0.57273 0.93086
Ours (-N)(-F) 0.41113 0.2863 0.29556 0.0828 0.44096 0.72029 0.35873 0.44046 0.33274 0.7023 0.91802 0.6078 0.68651 0.60275 0.92741
Ours (+N)(-F) 0.40992 0.29898 0.30086 0.08403 0.44135 0.76032 0.3728 0.45907 0.36466 0.74939 0.91497 0.60536 0.68389 0.59977 0.92268
Ours (+N)(+F)(3) 0.63733 0.44572 0.46152 0.1584 0.71264 0.84335 0.39918 0.49491 0.39166 0.84771 0.92133 0.60925 0.68827 0.60694 0.93022
Ours (+N)(+F)(5) 0.6405 0.45514 0.46572 0.15968 0.71648 0.85301 0.40552 0.50174 0.4683 0.85678 0.92542 0.61196 0.69121 0.61114 0.93796
2180
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on December 04,2023 at 18:54:42 UTC from IEEE Xplore. Restrictions apply.
V. C ONCLUSION [16] P. Wang, Y. Fan, L. Xia, W. X. Zhao, S. Niu, and J. Huang, “Kerl:
A knowledge-guided reinforcement learning model for sequential rec-
This paper presents a novel recommendation framework that ommendation,” in Proceedings of the 43rd International ACM SIGIR
Conference on Research and Development in Information Retrieval,
leverages deep RL. In particular, we model the recommen- 2020, pp. 209–218.
dation problem as a sequential decision-making problem and [17] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
solve it using the actor-critic paradigm. Our model captures MIT press, 2018.
[18] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath,
dynamic changes in users’ interests and considers items’ long- “Deep reinforcement learning: A brief survey,” IEEE Signal Processing
term contribution to the cumulative reward. In contrast to Magazine, vol. 34, no. 6, pp. 26–38, 2017.
previous methods, our method effectively integrates the direct [19] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-
dimensional continuous control using generalized advantage estimation,”
relations between items encoded in CF’s embedding vectors by arXiv preprint arXiv:1506.02438, 2015.
incorporating similar and dissimilar items into users’ interac- [20] G. E. Dupret and B. Piwowarski, “A user browsing model to predict
tions. Additionally, we use all positive and negative feedback search engine click data from past observations.” in Proceedings of
the 31st annual international ACM SIGIR conference on Research and
to capture the users’ dynamic interests. We have demonstrated development in information retrieval, 2008, pp. 331–338.
that our method can improve recommendations performance [21] H. Guo, R. Tang, Y. Ye, Z. Li, and X. He, “Deepfm: a factorization-
on our dataset and two public datasets. machine based neural network for ctr prediction,” arXiv preprint
arXiv:1703.04247, 2017.
[22] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies
with gradient descent is difficult,” IEEE transactions on neural networks,
R EFERENCES vol. 5, no. 2, pp. 157–166, 1994.
[23] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget:
[1] R. J. Mooney and L. Roy, “Content-based book recommending using Continual prediction with lstm,” 1999.
learning for text categorization,” in Proceedings of the fifth ACM [24] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of
conference on Digital libraries, 2000, pp. 195–204. gated recurrent neural networks on sequence modeling,” arXiv preprint
[2] X. Su and T. M. Khoshgoftaar, “A survey of collaborative filtering arXiv:1412.3555, 2014.
techniques,” Advances in artificial intelligence, vol. 2009, 2009. [25] N. Ballas, L. Yao, C. Pal, and A. Courville, “Delving deeper into con-
[3] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for volutional networks for learning video representations,” arXiv preprint
recommender systems,” Computer, vol. 42, no. 8, pp. 30–37, 2009. arXiv:1511.06432, 2015.
[4] Y. Juan, Y. Zhuang, W.-S. Chin, and C.-J. Lin, “Field-aware factorization [26] S. Fujimoto, H. Van Hoof, and D. Meger, “Addressing function approx-
machines for ctr prediction,” in Proceedings of the 10th ACM conference imation error in actor-critic methods,” arXiv preprint arXiv:1802.09477,
on recommender systems, 2016, pp. 43–50. 2018.
[27] R. He and J. McAuley, “Ups and downs: Modeling the visual evolution
[5] S. Zhang, L. Yao, A. Sun, and Y. Tay, “Deep learning based recom-
of fashion trends with one-class collaborative filtering,” in proceedings
mender system: A survey and new perspectives,” ACM Comput. Surv.,
of the 25th international conference on world wide web, 2016, pp. 507–
vol. 52, no. 1, Feb. 2019.
517.
[6] X. Wang, Y. Wang, D. Hsu, and Y. Wang, “Exploration in interactive per- [28] J. McAuley, C. Targett, Q. Shi, and A. Van Den Hengel, “Image-based
sonalized music recommendation: a reinforcement learning approach,” recommendations on styles and substitutes,” in Proceedings of the 38th
ACM Transactions on Multimedia Computing, Communications, and international ACM SIGIR conference on research and development in
Applications (TOMM), vol. 11, no. 1, pp. 1–22, 2014. information retrieval, 2015, pp. 43–52.
[7] G. Zheng, F. Zhang, Z. Zheng, Y. Xiang, N. Yuan, X. Xie, and Z. Li, [29] F. M. Harper and J. A. Konstan, “The movielens datasets: History and
“Drn: A deep reinforcement learning framework for news recommenda- context,” in ACM Transactions on Interactive Intelligent Systems (TiiS),
tion,” 04 2018, pp. 167–176. vol. 5, no. 4, 2015.
[8] G. Shani, D. Heckerman, R. I. Brafman, and C. Boutilier, “An mdp- [30] S. Fujimoto, H. van Hoof, and D. Meger, “Addressing Function Ap-
based recommender system.” Journal of Machine Learning Research, proximation Error in Actor-Critic Methods,” in Proceedings of the 35th
vol. 6, no. 9, 2005. International Conference on Machine Learning. PMLR, 2018, pp.
[9] N. Taghipour and A. Kardan, “A hybrid web recommender system based 1582–1591.
on q-learning,” in Proceedings of the 2008 ACM symposium on Applied [31] S. Rendle, “Factorization machines,” in 2010 IEEE International Con-
computing, 2008, pp. 1164–1168. ference on Data Mining. IEEE, 2010, pp. 995–1000.
[10] X. Zhao, L. Zhang, Z. Ding, L. Xia, J. Tang, and D. Yin, “Recommenda- [32] H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye,
tions with negative feedback via pairwise deep reinforcement learning,” G. Anderson, G. Corrado, W. Chai, M. Ispir et al., “Wide & deep
in Proceedings of the 24th ACM SIGKDD International Conference on learning for recommender systems,” in Proceedings of the 1st workshop
Knowledge Discovery & Data Mining, 2018, pp. 1040–1048. on deep learning for recommender systems, 2016, pp. 7–10.
[11] Y. Hu, Q. Da, A. Zeng, Y. Yu, and Y. Xu, “Reinforcement learning [33] J. Lian, X. Zhou, F. Zhang, Z. Chen, X. Xie, and G. Sun, “xdeepfm:
to rank in e-commerce search engine: Formalization, analysis, and Combining explicit and implicit feature interactions for recommender
application,” in Proceedings of the 24th ACM SIGKDD International systems,” in Proceedings of the 24th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining, 2018, pp. 368– Conference on Knowledge Discovery & Data Mining, 2018, pp. 1754–
377. 1763.
[12] X. Zhao, L. Xia, L. Zhang, Z. Ding, D. Yin, and J. Tang, “Deep [34] K. Gai, X. Zhu, H. Li, K. Liu, and Z. Wang, “Learning piece-wise linear
reinforcement learning for page-wise recommendations.” RecSys, 2018. models from large scale data for ad click prediction,” arXiv preprint
[13] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Item-based collabo- arXiv:1704.05194, 2017.
rative filtering recommendation algorithms,” in Proceedings of the 10th [35] W. Chu, L. Li, L. Reyzin, and R. Schapire, “Contextual bandits with
international conference on World Wide Web, 2001, pp. 285–295. linear payoff functions,” in Proceedings of the Fourteenth International
[14] J. Wang, A. P. De Vries, and M. J. Reinders, “Unifying user-based and Conference on Artificial Intelligence and Statistics. JMLR Workshop
item-based collaborative filtering approaches by similarity fusion,” in and Conference Proceedings, 2011, pp. 208–214.
Proceedings of the 29th annual international ACM SIGIR conference [36] H. Wang, Q. Wu, and H. Wang, “Learning hidden features for contextual
on Research and development in information retrieval, 2006, pp. 501– bandits,” in Proceedings of the 25th ACM International on Conference
508. on Information and Knowledge Management, 2016, pp. 1633–1642.
[15] F. Liu, R. Tang, X. Li, Y. Ye, H. Chen, H. Guo, and Y. Zhang, “Deep [37] J. Bento, S. Ioannidis, S. Muthukrishnan, and J. Yan, “A time and space
reinforcement learning based recommendation with explicit user-item efficient algorithm for contextual linear bandits,” in Joint European Con-
interactions modeling,” CoRR, vol. abs/1810.12027, 2018. [Online]. ference on Machine Learning and Knowledge Discovery in Databases.
Available: http://arxiv.org/abs/1810.12027 Springer, 2013, pp. 257–272.
2181
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on December 04,2023 at 18:54:42 UTC from IEEE Xplore. Restrictions apply.