Professional Documents
Culture Documents
Anisio Lacerda
PII: S0925-2312(17)30228-X
DOI: 10.1016/j.neucom.2016.12.076
Reference: NEUCOM 18025
Please cite this article as: Anisio Lacerda, Multi-Objective Ranked Bandits for Recommender Systems,
Neurocomputing (2017), doi: 10.1016/j.neucom.2016.12.076
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Anisio Lacerda
Centro Federal de Educação Tecnológica de Minas Gerais – CEFET-MG
Av. Amazonas, 7675 – Belo Horizonte – Minas Gerais – Brazil
T
Abstract
IP
This paper is interested in recommender systems that work with implicit feedback in dynamic scenarios providing
online recommendations, such as news articles and ads recommendation in Web portals. In these dynamic scenarios,
CR
user feedback to the system is given through clicks, and feedback needs to be quickly exploited to improve subsequent
recommendations. In this scenario, we propose an algorithm named multi-objective ranked bandits, which in contrast
with current methods in the literature, is able to recommend lists of items that are accurate, diverse and novel. The
algorithm relies on four main components: a scalarization function, a set of recommendation quality metrics, a dy-
US
namic prioritization scheme for weighting these metrics and a base multi-armed bandit algorithm. Results show that
our algorithm provides improvements of 7.8% and 10.4% in click-through rate in two real-world large-scale datasets
when compared to the single-objective state-of-the-art algorithm.
AN
Keywords: Recommender systems, online recommendation, multi-armed bandits
In a world of information overload, having systems In each trial the algorithm chooses from a set of alter-
capable of filtering out content according to users pref- natives and receives a reward associated with its choice.
erences is paramount. In this scenario, recommender When using MABs for recommendation, each arm is
systems play an important role in recommending rele-
ED
work with implicit feedback in dynamic scenarios pro- [31, 26] were proposed as an alternative approach to
viding online recommendations. One example within recommend item lists within the context of Web search
this scenario is to recommend news articles in a Web engines. In contrast with the traditional MAB approach,
portal, illustrated in Figure 1, where content changes this method associates a full MAB algorithm to each po-
AC
dynamically, with frequent insertions and deletions of sition of the ranking of retrieved items. Hence, the user
news. receives a list of items, where the item in each position
In dynamic scenarios such as the aforementioned, of the ranking was recommended by a different MAB.
user feedback to the system (given through clicks) needs
Ranked Bandits, as well as other classical MAB ap-
to be quickly exploited to improve subsequent recom-
proaches, focus mostly on improving recommendation
mendations. The most popular methods to deal with the
accuracy, which is usually measured by click-through
problem of online interactive recommender systems fol-
rate. However, it has become a consensus in the litera-
low a multi-armed bandit (MAB) approach, where an al-
ture that successful recommender systems need to con-
sider other dimensions of information utility besides ac-
Email address: anisio@decom.cefetmg.br (Anisio Lacerda) curacy [8]. Notably, diversity and novelty of the sug-
Preprint submitted to NEUROCOMPUTING February 9, 2017
ACCEPTED MANUSCRIPT
gestions performed by the system are key objectives 2012 Online Ads data, respectively, when compared to
to measure recommendation quality [27, 28]. In other the state-of-the-art single-objective algorithm.
words, even being accurate, obvious and monotonous The main contributions of this paper are:
recommendations are generally of little use, since they
do not expose users to unprecedented experiences. • We propose and test two scalarization methods for
weighting accuracy, diversity and novelty of the
produced rankings in the context of Ranked Ban-
dits.
T
weight of multiple recommendation quality met-
rics under three different perspectives: the user, the
IP
system and the ranking position.
CR
datasets to evaluate the performance of our method
and a representative set of bandit algorithms.
This paper goes further, and besides testing non- There is an extensive literature of methods conceived
linear scalarization schemes, it proposes a new method to perform personalized offline and online recommenda-
that dynamically prioritizes different recommendation tion. Currently, the state-of-the-art methods for offline
quality metrics during the life cycle of the user in the recommendation follow a Collaborative Filtering (CF)
PT
system. This is done by optimizing weights for differ- approach [29]. CF approaches recommend an item i to
ent metrics online, as the user interacts with the sys- a user u in two situations: (i) if the item is liked by other
tem, while accounting for his feedback. This is impor- users whose preferences are similar to those of u (user-
tant because the method adapts the systems responses to based CF), or (ii) if the item is similar to other items
CE
the user needs (e.g., new system users may benefit from user u liked in the past (item-based CF). Since they rely
more accurate suggestions, whereas older users may re- on the historical ratings or preferences provided by the
quire more diversified and novel recommendations). users, their performance is poor in online recommenda-
We conducted an extensive evaluation of the pro- tion tasks.
AC
posed method in two large datasets from real-world rec- Instead of going over the literature of offline rec-
ommendation tasks. The first task, namely news arti- ommendation methods exhaustively, we refer the inter-
cle recommendation, uses data from the Yahoo! Today ested reader to a recent comprehensive survey of Rec-
News dataset and contains over 45,811,883 user vis- ommender Systems [5]. In this section, we focus on
its. The second task deals with online advertisement the studies that are most relevant to our approach, i.e.,
(ad) recommendation, and uses data from the KDD Cup Multi-Objective Recommender systems.
2012 Online Ads (track 2) dataset. The dataset contains In the early days, the main drive of the recommender
149,639,105 views of advertisements. Our algorithm systems field was to accurately predict users’ prefer-
provides improvements of 7.8% and 10.4% in click- ences. Nowadays, however, a wider perspective towards
through rate for Yahoo! Today News and KDD Cup recommendation utility has been taking place, and the
2
ACCEPTED MANUSCRIPT
Recommendation list
importance of diversity and novelty, among other prop- action
erties, is becoming an important part of evaluation prac-
Target user
tice [11, 36]. environment
Multi-objective recommender systems have been Examine
successfully applied in several scenarios, such as, Recommender System
recommendation list
T
reward
ence considering more than the accuracy of suggestions.
In [40], the authors present a method to improve recom- Figure 2: The online recommendation task modeled as a Multi-
IP
objective Ranked Bandit problem.
mendation diversity, which is based on an optimization
procedure that considers two objective functions reflect-
CR
ing the preference for similarity and diversity of recom- consider unfair to compare MO-MAB algorithms and
mended items. the methods proposed here. Note that, even if we con-
In this same direction, in [28], the authors propose sider the use of MO-MAB algorithms for recommend-
hybrid approaches to multi-objective recommender sys- ing a single item per iteration to users, we restrict the
tems based on the concept of Pareto-efficiency. Their
method aggregates the output lists of basic recom-
mender systems into a single list. In [43], the authors
propose a genetic programming algorithm to simultane-
US possible objectives that can be considered. For instance,
the diversity of a recommendation list is a metric that
makes no sense if a single item is returned for each rec-
ommendation.
AN
ously consider recommendation accuracy and diversity As we are interested in suggesting multiple items to
of suggested items. users considering several objectives, in this work we fol-
We highlight that all the aforementioned methods run low the approach presented in [31, 26], named Ranked
over an offline setting, i.e., the recommendation models Bandits. Ranked Bandits is an online learning algorithm
are produced before user interactions with the system.
M
work that presents an online multi-objective recommen- Both works have as their main goal to minimize query
dation algorithm that dynamically prioritize accuracy, abandonment, which happens when the user does not
diversity, and novelty when suggesting items to users. click in any of the presented query results.
When dealing with online scenarios, approaches There are two crucial differences between the works
PT
based on Multi-objective Multi-armed Bandit algo- aforementioned and the one proposed here. First, they
rithms (MO-MAB) deserve our attention. In MO-MAB, are interested only on minimizing the query abandon-
several arms (items) are candidates to optimal depend- ment of search engine rankings. Hence, they ignore
ing on the set of objectives [9, 39, 38]. The standard other important metrics of these rankings, such as di-
CE
approach to MO-MABs consists of modifying classi- versity and novelty. Second, since they ignore the user
cal multi-armed bandit algorithms to consider more than features, they cannot provide personalized recommen-
one objective. For instance, in [9] the authors extend the dations, which is a crucial characteristic of our problem.
classical UCB1 algorithm to propose the Pareto-UCB1,
AC
T
9: update-mab(ut , i(t), rt , A) 8: else
10: end for 9: i j (t) ← ı̂ j (t)
IP
10: end if
11: end if
for balancing exploration and exploitation. Specifically, 12: end for
CR
the algorithm A exploits past experience to select the 13: show {i1 (t), . . . , ik (t)} to user; register clicks
arm that seems best. On the other hand, this seemingly 14: for j = 1, . . . , k do . Find feedback for A j
optimal arm may be suboptimal, due to imprecisions in 15: if user clicked i j (t) and ı̂ j (t) = i j (t) then
A’s knowledge. To avoid this undesired situation, A 16: r jt = 1
has to explore by choosing seemingly suboptimal arms
so as to gather more information about them.
In the context of online recommender systems, each
arm represents an item to be recommended to a user,
US 17:
18:
19:
20:
else
end if
r jt = 0
update-mab(ut , ı̂ j (t), r jt , A j )
AN
and the reward corresponds to whether the user clicked 21: end for
22: end for
the suggested item or not. In Figure 2, we show the
interaction cycle between a target user and an online
recommender system. The user arrives and the recom- ward rt is revealed (Line 4–8)1 . The algorithm then im-
M
mender system produces an ordered list of items, which proves its arm-selection strategy with the new observa-
is presented to the user. The user interacts with the list tion (ut , i(t), rt ) (Line 9). Note that the algorithm selects
by clicking on items that are relevant to him. The in- a single item per trial, i.e., it is unable to return a rec-
teraction is interpreted as a feedback about the quality
ED
{1, . . . , M} be a set of M arms, where each arm corre- pected number of clicks from users.
sponds to an item. Let E be a distribution over tuples Recently, works in the multi-armed bandit literature
(u,~r), where u ∈ U corresponds to the target user, i.e., have investigated methods for learning a ranking of
the user we are interested in recommending items to, arms instead of a single arm per trial [26], generating
AC
and ~r ∈ {0, 1} M is a vector of rewards. For each pair the Ranked Bandits. In Algorithm 2, we present the
(u,~r), the jth component of ~r is the reward associated to Ranked Bandits algorithm, which focuses on optimiz-
arm i j . The target user u can be described using features ing the choice of multiple arms. It runs an MAB in-
such as demographic information and gender. stance Ai for each ranking position i. We consider that
In Algorithm 1, we detail a MAB algorithm. A MAB users scroll the ranking from top to bottom. Hence, al-
algorithm iterates in discrete trials t = 1, 2, . . . (Line 1). gorithm A1 chooses which item is suggested at rank 1.
In each trial t, the algorithm chooses an item i(t) ∈ I Next, algorithm A2 selects which item is shown at rank-
based on user ut and on the knowledge it collected ing position 2, unless the item was already selected at a
from the previous trials (Line 2), and shows the cho-
sen item to the target user (Line 3). After that, the re- 1 We use rt as the reward for item i(t) in trial t.
4
ACCEPTED MANUSCRIPT
Table 1: In this toy example, we compare the final rankings of a single objective (SO) and a multi-objective (MO-RANK) recommender.
Rank MAB instance Weights Items Accuracy Diversity Novelty Final ranking
SO MO-RANK
i1 0.2 1.0 0.2
1 A1 (0.3, 0.2, 0.5) i2 i3
i2 0.5 1.0 0.4
i3 0.3 1.0 0.6
i1 0.7 0.5 0.2
T
2 A2 (0.4, 0.5, 0.1) i3 i1
i2 0.1 0.4 0.4
i3 0.9 0.1 0.2
IP
i1 0.8 0.2 0.1
3 A3 (0.6, 0.1, 0.3) i1 i2
i2 0.1 0.7 0.4
CR
i3 0.1 0.1 0.2
rithms A∗ receive a reward of 0. item to be recommended. For each item, a vector ~s with
Different from the previously mentioned ranked ban- its accuracy, diversity and novelty is stored. The ac-
dits algorithm, we address the problem of learning curacy corresponds to the predicted reward obtained by
ED
an optimal multi-objective ranking of items D = Ak . Note that each independent Ak is associated with
{d1 , d2 , . . . , dk } for a target user. Specifically, we con- a weight vector w ~ , which will assign different weights
sider the multi-objective ranked bandit problem, where to different objectives (quality metrics) at different posi-
we have k ranking positions, k ≥ 1; M items, M ≥ 2; tions of the ranking. In this way, more diverse and novel
PT
and O objectives, O ≥ 1. The goal is to simultane- items may be privileged into higher positions, while ac-
ously consider all objectives when recommending items curacy may be considered more important into lower
to users. Note that each user may consider a subset ranks. Although the example illustrated in Table 1 asso-
of items as relevant according to their preferences and ciates the weights to ranking positions, other weighting
CE
past experience, and the remainder of the items as non- strategies will be discussed later in this section.
relevant. Hence, users with different preferences will At each trial, A1 returns an item to u by combin-
have different relevant rankings, while users with simi- ing the weighted objectives using a function (which in
lar preferences will have similar relevant rankings. this example is linear). A1 returns i3 , while a single-
AC
If we compare the final rankings D MO−RANK = {i3 , i1 , i2 } an arm i a weight wo , and returns a scalar equals to the
and DS O = {i2 , i3 , i1 } in Table 1, we can notice that the weighted sum of the means:
positions of items change according to the values and
weights given to different quality metrics. fw~ (~si ) = w1 s1i + · · · + wO sO
i . (1)
The process we just described is composed by four A known drawback with linear scalarization is its inca-
main components: (i) a scalarization function, (ii) a pacity to potentially find all the points in a non-convex
set of recommendation quality metrics, (iii) a weighting Pareto set [9].
scheme, and (iv) a base MAB algorithm A. The scalar-
ization function defines how we weight each objective Chebyshev scalarization This function takes the
to compute a relevance score for each item. In Sec- weight vector as the above function and also a O-
T
tion 4.1, we detail two scalarization functions consid- dimensional reference point z = (z1 , . . . , zO )T , which
ered in this paper. The recommendation quality metrics
IP
has to be dominated by all elements in the Pareto set,
define the objectives that we are interested in maximiz- i.e., all optimal reward vectors s∗i . The Chebyshev
ing on recommendation rankings, as described in Sec- scalarization function is defined as:
CR
tion 4.2. As we assume that different scenarios require
different objectives’ prioritization, we propose different fw~ (~si ) = min wi × (sij − zi ), ∀ j. (2)
1≤ j≤O
weighting schemes to set these weights, as discussed in
Section 4.3. Finally, in Section 4.4, we detail the set of The vector z corresponds to the minimum of the current
base MAB algorithms used in this work.
zl = min slj − τl , ∀ j.
1≤ j≤O
(3)
AN
A traditional way to deal with the multi-objective sce-
nario is to transform it into a single-objective scenario A key advantage of the Chebyshev scalarization func-
using scalarization functions [9]. Here we consider both tion is that under certain conditions it encounters all
linear and non-linear scalarization functions. points in a non-convex Pareto set [39].
As previously explained, in the multi-objective sce- Now, given that we are able to convert a multi-
M
nario, for each arm (item) i, there is a score vector objective MAB problem into a single-objective prob-
~si = (s1i , . . . , sO lem, we can select the best arm i∗fw that maximizes the
i ), where O is a fixed number of objec-
tives. We use a scalarization function to compute the function fw :
ED
estimation of rewards for all items, In other words, we i∗fw~ = argmax fw~ (~si ). (4)
1≤i≤M
compute a combined score using the scalarization func-
tion to predict the reward vector ~r ∈ {0, 1} M , for all M 4.2. Recommendation quality metrics
items. The two scalarization functions used here receive In this section, we detail the recommendation quality
PT
the weighted sum of the components of the score vector metrics used in this work, namely, (i) accuracy, (ii) di-
~s and return a scalar value [10]. versity, and (iii) novelty. Besides accuracy, which has
We consider that the score vectors of all arms been the typical goal of recommender systems, suggest-
are ordered using the partial order on multi-objective
CE
6
ACCEPTED MANUSCRIPT
the set of items above ranking position j. Specifically, Algorithm 3 Multi-Objective Arm Selection
we consider all items with ranking position l, such that Require: Weight vector w ~ , score of objectives matrix
1 ≤ l < j, and compare them to the current item i j . The S M×O , scalarization function f
diversity of item i j is given by: 1: function select-arm
2: for i = 1, . . . , M do
1 X
div(i j |S ) = d(i j , il ), (5) 3: final-scoresi = fw~ (~si ) . Note: ~si is the
|S | 1≤l< j
objectives scores for item i
4: end for
where S is the set of items in ranking position l smaller 5: best-arm = argmax(final-scores) ~
than j; and d(a, b) is the cosine similarity between items
T
6: return best-arm
a and b. When S is empty, i.e., we are considering the 7: end function
first position in the ranking, we initialize all item diver-
IP
sities with equal values, e..g, 1.
In order to measure the novelty of item i j , in turn we Dynamic Multi-Objective Weighting per User
CR
compute a popularity-based item novelty metric, as pre- (DMO-USER): From the user perspective, both nov-
sented in [36]. Let the probability of an item i j being elty and diversity are desirable and act as a direct source
seen as: of user satisfaction. They help users to expand their
playsl (i j ) horizon and, as a consequence, enrich their experience.
p(seen|i j ) = P M
m=1 playsl (im )
,
Summarizing, diversity is related to the internal dif- and his tastes. We propose to have a distinct weight
ferences within a ranking, whereas novelty can be un- vector for each user, which will be adapted according
derstood as the difference between present and past ex- to the feedback he provides. Looking at the example
periences [36]. in Table 1, instead of associating weights with ranking
ED
or even to minimize the inherent biases of recommen- preferences really are, given the limitation of the user
dation. actions. For instance, in this paper we focus on im-
Next, we detail the proposed methods for weighting plicit feedback, such as clicks on items. Although clicks
each recommendation quality metric, i.e., accuracy, di- are driven by user interests, it is still difficult to identify
AC
versity and novelty. Since the motivations for enhancing what motivates a user to click on an item. As the recom-
each metric are themselves diverse, we propose three mender has access to a limited number of user activities,
different weighting schemes by considering user, sys- it has an incomplete knowledge about the users. In order
tem, and ranking position perspectives. In Algorithm 3, to alleviate this limitation, a healthy level of diversity
we detail the algorithm used to select the best arm (i.e., and novelty may help the system to reach a better un-
best item) to recommend to users. Note that the only derstanding of the users and, as a consequence, improve
difference among the proposed weighting schemes (i.e., the quality of recommendations. Another motivation
per User, per System, and per Rank) is that we need behind providing more diverse and novel suggestions
to instantiate a different weight vector w
~ depending on is related to business’ motivations. For instance, sug-
the chosen scheme. gesting items in the long tail is a strategy to draw profit
7
ACCEPTED MANUSCRIPT
from market niches by selling less of more [2]. More- Algorithm 4 Update Regressor Method
over, product diversification is a way to expand business Require: Weight vector w ~ , reward r, and score of ob-
and mitigate risk [22, 35]. In this case, a global weight- jectives vector ~s
ing vector is used, and the weights are the same for all 1: function update-regressor
users and all positions of the ranking. In this weighting 2: . Compute new weights for each objective (as a
scheme, we have a global weight vector w ~ common for computation over input vectors)
all users and ranking positions. 3: w ~ + σ12 Σ(r − ~s T w
~ new = w ~ )~s
Dynamic Multi-Objective Weighting per Rank 4: return w~ new
(DMO-RANK): We also consider an inherent bias in- 5: end function
T
volved when users choose items from ordered lists, i.e.,
the positional bias. This bias states that items in the top
which uses Eq. 10 to update weights according to users’
IP
positions of the ranking are more likely to receive clicks
feedback. We detailed this procedure in terms of input
than items at lower positions [14]. This bias has been in-
and output in Algorithm 4.
vestigated in the Information Retrieval [12, 13] and Rec-
CR
ommender Systems [14, 19] literature. Here, we pro- 4.4. Algorithm formalization
pose a weighting scheme that dynamically assigns dif-
Algorithm 5 presents the proposed Multi-Objective
ferent weights for accuracy, diversity and novelty to dif-
Ranked Bandit algorithm, which receives as input a set
ferent ranking positions (see Table 1). Given that higher
ranking positions receive more attention from users, a
valid strategy is to diversify recommendations on these
positions for new items, to understand how attractive
they are to users. Meanwhile, if users reach lower po-
US of bandit algorithms A, a scalarization function S, and
a weighting regressor W. The scalarization function
refers to how we weight each objective to compute a rel-
evance score for each item. We also assume that differ-
AN
ent scenarios require different objectives’ prioritization.
sitions of the ranking, the recommender may prioritize
A weighting regressor is used to dynamically update the
accuracy on these positions to benefit users’ effort. In
weights for each recommendation quality metric.
this weighting scheme, we have a different weight vec-
The individual bandit algorithms are initialized to
tor w
~ for each ranking position.
properly consider the set of items I, where |I| = M
M
f~t = h(~zt , ~ct , δt ), (9) If the selected item has already been chosen on a
higher rank, we select a new item arbitrarily (Lines 6–
where ~zt is the hidden state, ~ct is an optional input, νt 10). The chosen items are displayed to the user and a
is the system noise at time t, δt is the observation noise click (or non-click) is registered (Line 12). After the
CE
at time t, f~t is the observation, g is the transition model target user scrolls the recommended ranking, the algo-
and h is the observation model. rithm processes his feedback for each jth ranking posi-
We define the hidden state to represent the weights, tion (Line 13–20).
i.e. ~zt ≡ w
~ t , and let the time-varying observation model We consider only binary feedback, i.e., the reward is
AC
represent the current score vector ~s. Hence, we apply 1 if the user clicked on the suggested item, and 0 oth-
the Kalman filter to the observation model to update erwise (Line 14–18). Hence, we update each multi-
our posterior beliefs about the weights w ~ as the recom- armed bandit instance A j that corresponds to the jth
mender interacts with the users. Finally, the update of ranking position according to the given reward (Line
the model parameters is given by: 19). Finally, we update the online linear regressor W to
1 consider the given reward and corresponding weighting
w
~t = w
~ t−1 + Σt|t (rt − ~stT w
~ t−1 )~st , (10) scheme (Line 20). Note that, the main differences from
σ2
Algorithm 2 refer to lines 4–5 and 20. In these lines,
where rt is the reward associated to the suggested item we consider the computed weights and update them, re-
at time t and σ2 is the standard deviation. We compute spectively.
the weighting values using a weighting regressor W,
8
ACCEPTED MANUSCRIPT
T
7: if ı̂ j (t) ∈ {i1 (t), . . . , i j−1 (t)} then
8: i j (t) ← arbitrary unselected item available by Yahoo! as part of the Yahoo! Webscope
program [1]. We use the R6A version of the dataset,
IP
9: else
10: i j (t) ← ı̂ j (t) which contains more than 45 million user view/click log
11: end if events for articles displayed on the Yahoo! Today Mod-
CR
12: end if ule of Yahoo! Front Page during a period of ten contigu-
13: end for ous days. We used 4,5 million events as a warm-up to
14: show {i1 (t), . . . , ik (t)} to user; register clicks bandit algorithms (10%), and the remaining log events
15: for j = 1, . . . , k do . Find feedback for A j as test data.
16:
17:
18:
19:
if user clicked i j (t) and ı̂ j (t) = i j (t) then
else
r jt = 1
r jt = 0
US Each user/system interaction event consists of four
components: (i) the random article displayed to the
user, (ii) user/item feature vectors, (iii) whether the user
clicked on the displayed article, and (iv) the timestamp.
AN
20: end if In this dataset, displayed articles were randomly se-
21: update-mab(ut , ı̂ j (t), r jt , A j ) lected from an article pool. This makes this dataset ideal
22: ~ t = update-regressor(W, w
w ~ t , r jt ) . Alg. 4 for unbiased, offline evaluation of multi-armed bandit
23: end for algorithms. Detailed information on the data collection
process and the dataset properties can be found in [21].
M
tion 5.3). Finally, in Section 5.4 we discuss experimen- We use an advertisement dataset published by KDD
tal results. Cup 2012, track 2. This dataset refers to a click log
We emphasize that the datasets used in our experi- on www.soso.com (a large-scale search engine serviced
ments are publicly available on the Web [1], and that by Tencent), which contains 149 million ad impressions
AC
the code for the method and baselines can be find in the (view of advertisements). Each instance is an ad im-
Github repository3 . pression that consists of: (i) the user profile (i.e., gender
and age range), (ii) search keywords, (iii) displayed ad
5.1. Datasets information, and (iv) the click count. We processed the
We used two real-world, large-scale datasets to eval- data as follows. The dimension of the vector of fea-
uate our algorithm: Yahoo! Today News and KDD Cup tures of the displayed ad information is 1,070,866. We
2012 Online Advertising. first conducted a dimensionality reduction using the La-
tent Dirichlet Allocation topic model [4] using the key-
3 www.github.com/to-be-announced (available after publica- words, title and description of ads. We used 6 topics
tion). to represent ads. Another issue related to this dataset
9
ACCEPTED MANUSCRIPT
is that click information is extremely sparse due to the of the data. We are considering a recommendation task
large number of ads. To tackle this problem, we se- of top-k items, i.e., we recommend, for each target user,
lected the top 50 ads that have the most impressions in a list of the most relevant items with size k. Hence,
the evaluation, as proposed by the authors in [32]. The we also evaluate the performance of our algorithm for a
generated data set contains 9 million user visit events, variable list size. The averaged reward is equivalent to
from which 0,9 million events were used as warm up the metric CTR, which is the total reward divided by the
to bandit algorithms (10%) and the remaining events as total number of trials:
test data. Note that the competition originally refers to T
a batch scenario. Hence, it would be unfair to consider 1X
CT R@k = rt , (11)
the competition winner solution [37] as a baseline, given T t=1
T
that here we investigate an online learning recommen-
dation problem. where rt is the sum of rewards on a ranking with size
IP
k, and T is the total number of trials. We report the
5.2. Evaluation Protocol algorithm relative CTR, which is the algorithm’s CTR
CR
In an ideal scenario, the evaluation of a new multi- divided by the random recommender.
armed bandit algorithm should be conducted using a
bucket test, where we run the algorithm to serve a frac- 5.3. Base bandit algorithms
tion of live user traffic in a real recommender system. The proposed Multi-Objective Ranked bandit algo-
This evaluation protocol has several drawbacks. First,
use the state-of-the-art method for unbiased and offline options to explore uncertainties in the representation of
evaluation of bandit algorithms, namely BRED (Boot- the world (exploration). Different algorithms work in
strapped Replay on Expanded Data) [23]. The method different ways to find the best trade-off between these
ED
is based on bootstrapping techniques and enables an un- two objectives. For background, we refer the reader
biased evaluation by utilizing historical data. The use of to [7] and a recent survey on regret minimization4 ban-
bootstrapping allows to estimate the distribution of pre- dits [6]. Following, we detail the base MAB algorithms
dictions while reducing their variability. used to instantiate our algorithm.
PT
10
ACCEPTED MANUSCRIPT
posterior distribution. It chooses the arm propor- • SMO-EQ [w = (0.30, 0.30, 0.30)]: Establishes
tionally to its probability of being optimal under equal priorities to each objective.
this posterior [33].
• SMO-ACC [w = (0.50, 0.25, 0.25)]: Prioritizes ac-
• Bayes UCB: it is an adaption of UCB to the curacy over diversity and novelty.
Bayesian setting, where the upper confidence
• SMO-DIV [w = (0.25, 0.50, 0.25)]: Prioritizes di-
bound score is given by an upper quantile on the
versity over accuracy and novelty.
posterior mean [16].
• SMO-NOV [w = (0.25, 0.25, 0.50)]: Prioritizes
• LinUCB (α): is a classical contextual bandit algo-
novelty over accuracy and diversity.
T
rithm that computes the expected reward of each
arm by finding a linear combination of the previ- 5.4.1. On overall CTR
IP
ous rewards of the arm [20]. The contextual aspect
In Tables 2 and 3, we present the results for Yahoo!
of this algorithm refers to its capability of consid-
and KDD datasets, respectively. We test the algorithms
ering user features, such as gender and age, to rec-
CR
with different values of CTR@k, where k ∈ {1, 3, 5, 10}
ommend items. The α parameter is used to control
is the number of recommendations displayed (and, con-
the balance of exploration and exploitation.
sequently, the number of MABs instantiated). Since we
have similar conclusions for all values, we present only
Note that we are unable to use multi-objective multi-
armed bandit algorithms (MO-MAB) as base algorithms
(Section 2). First, MO-MAB algorithms consider the US
selection of a single item for each trial, and we are inter-
ested in returning a ranking of items to users. Even if we
CTR@5 in this section. Sections 5.4.2 and 5.4.3 ana-
lyze the impact of parameter k on the methods’ perfor-
mance. Note that for base bandit algorithms -greedy
(EG) and LinUCB, we enumerate different values for
AN
parameters and α, respectively. We highlight gains
consider MO-MAB algorithms for the case where k = 1,
in relative CTR@5 over the Single-Objective baseline
it is not simple to understand the behavior of these algo-
(SO) in bold.
rithms as base MABs. The challenge is that MO-MAB
We start by comparing the overall CTR values for the
algorithms try to consider multiple objectives by con-
linear and chebyshev scalarization functions. The re-
M
policies to define the priorities of different quality met- happens for the combinations of the chebyshev func-
rics. The results are compared with the state-of-the-art tion with the Thompson Sampling and dynamic multi-
method of the literature for single-objective optimiza- objective weighting scheme. When considering the Ya-
tion, referred in this section as SO (Ranked Bandits with hoo! dataset, the best results were obtained using the
AC
Linear scalarization
Weighting scheme
Static Dynamic
Algorithm SO SMO-EQ SMO-ACC SMO-DIV SMO-NOV DMO-USER DMO-RANK DMO-SYSTEM
Random 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
EG(0.01) 1.5291 1.5336 1.4660 1.5197 1.4348 1.5009 1.5984 1.4885
EG(0.05) 1.5565 1.5538 1.5227 1.4957 1.5565 1.4840 1.5800 1.5114
T
EG(0.10) 1.5443 1.5125 1.5314 1.5085 1.5190 1.5025 1.5287 1.5163
EG(0.20) 1.5060 1.5403 1.5825 1.4812 1.4870 1.5197 1.5422 1.5307
IP
EG(0.30) 1.4948 1.5132 1.5204 1.4854 1.5643 1.4848 1.5942 1.5045
LinUCB(0.01) 1.5379 1.5148 1.4833 1.5276 1.5235 1.4875 1.5619 1.5294
LinUCB(0.10) 1.5271 1.5264 1.5006 1.4675 1.5243 1.4844 1.5322 1.5170
CR
LinUCB(0.30) 1.5471 1.5358 1.4692 1.4752 1.5005 1.4583 1.5138 1.5146
TS 1.5581 1.4661 1.4568 1.5082 1.5191 1.4825 1.6269 1.6112
UCB1 1.4880 1.4144 1.3887 1.4309 1.4347 1.4306 1.5811 1.5381
UCB2 1.5187 1.5525 1.5390 1.5581 1.5365 1.4991 1.6052 1.6198
BayesUCB
Random
EG(0.01)
1.5339
1.0000
1.5291
1.4411
1.0000
1.5919
1.4091
1.0000
1.5506
1.4366
1.0000
1.6141
US 1.4377
Chebyshev scalarization
1.0000
1.6503
1.4264
1.0000
1.6179
1.6224
1.0000
1.5091
1.5915
1.0000
1.4887
AN
EG(0.05) 1.5565 1.6485 1.5126 1.5966 1.6522 1.6227 1.5296 1.5121
EG(0.10) 1.5443 1.6365 1.5708 1.5910 1.6644 1.6265 1.5535 1.5082
EG(0.20) 1.5060 1.6305 1.5116 1.5828 1.6386 1.5921 1.5404 1.5476
EG(0.30) 1.4948 1.5842 1.4906 1.5917 1.6450 1.5722 1.5617 1.4600
M
schemes, whereas in the KDD dataset the improvement tion. It seems hard to find good recommendation metric
is of 2.5%. weighting schemes when considering a linear relation-
For the Yahoo! dataset, the best performance of static ship between these metrics. Furthermore, the dynamic
CE
weighting scheme is achieved when priorityzing novelty multi-objective weighting per system (DMO-SYSTEM)
over other recommendation metrics (i.e., SMO-NOV). presents the worst performance when compared to other
For the KDD dataset, in turn, the prioritization of di- weighting schemes. Since it considers a common prior-
versity (i.e., SMO-DIV) reaches better performance in itization scheme for all users and ranking positions, we
AC
terms of relative CTR. Since the weights are static, the believe that it is harder for it to correctly prioritize the
recommendation of diverse and novel rankings is more recommendation metrics.
likely to find relevant items than just prioritizing accu- Also observe that, for all base algorithms, there is
racy, for instance. a combination of weighting scheme and scalarization
Considering dynamic weighting schemes, in both function that achieves a better CTR performance than
datasets the best scheme is to consider a specific priori- the state-of-the-art single-objective baseline. The best
tization for each ranking position (DMO-RANK). When performance for the Yahoo! dataset is obtained by the
we consider the dynamic multi-objective weighting per -greedy algorithm (EG0.10), with a CTR improvement
user (DMO-USER), we are able to improve baseline of 7.8%. For the KDD dataset, the best performance
performance mostly considering chebyshev scalariza- is obtained by the Thompson Sampling algorithm, with
12
ACCEPTED MANUSCRIPT
Linear scalarization
Weighting scheme
Static Dynamic
Algorithm SO SMO-EQ SMO-ACC SMO-DIV SMO-NOV DMO-USER DMO-RANK DMO-SYSTEM
Random 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
EG(0.01) 1.7504 1.6444 1.4809 1.5681 1.6351 1.5406 1.8756 1.5867
EG(0.05) 1.6702 1.6297 1.8049 1.5532 1.7044 1.6264 1.7112 1.5364
T
EG(0.10) 1.5525 1.7390 1.5511 1.7796 1.6323 1.4538 1.7901 1.4300
EG(0.20) 1.5997 1.7662 1.5939 1.4301 1.7022 1.4578 1.7498 1.7998
IP
EG(0.30) 1.8229 1.5608 1.6831 1.4801 1.4996 1.7413 1.6511 1.6796
LinUCB(0.01) 1.8719 1.6399 1.5800 1.5747 1.5612 1.4865 1.6069 1.5834
LinUCB(0.10) 1.7580 1.6798 1.5675 1.5340 1.7559 1.6096 1.6919 1.4677
CR
LinUCB(0.30) 1.9600 1.4623 1.5262 1.5713 1.4506 1.5624 1.6790 1.6760
TS 1.8472 1.4284 1.7169 1.6447 1.5593 1.4457 1.9494 1.9081
UCB1 1.6841 1.4890 1.4037 1.3775 1.4787 1.4941 1.8884 1.6600
UCB2 1.5925 1.5065 1.8330 1.6817 1.5109 1.6799 1.9039 1.7978
BayesUCB
Random
EG(0.01)
1.7012
1.0000
1.7504
1.4583
1.0000
1.8296
1.4412
1.0000
1.6705
1.6279
1.0000
1.7794
US 1.4945
Chebyshev scalarization
1.0000
2.0273
1.5782
1.0000
1.7734
1.6030
1.0000
1.9141
1.6006
1.0000
1.7002
AN
EG(0.05) 1.6702 1.8499 1.9400 1.7724 1.9430 1.9100 1.7773 1.6401
EG(0.10) 1.5525 2.0298 1.6972 2.0300 1.9175 1.6726 1.9660 1.5086
EG(0.20) 1.5997 2.0166 1.6233 1.6239 2.0228 1.6221 1.8832 1.9673
EG(0.30) 1.8229 1.7457 1.7740 1.6896 1.6747 1.9929 1.7285 1.7535
M
gains of 10.4% over the baseline in terms of CTR. On ter performance are dataset dependent. For instance,
average the gains over the single-objective algorithm are for the KDD dataset, the multi-objective versions of
4.7% and 15.8%, for the Yahoo! and KDD datasets, re- LinUCB and Bayes UCB exhibit no gains over single-
CE
considering diversity and novelty, which leads -greedy In conclusion, we present gains in CTR performance
to achieve the best CTR. We have a similar pattern in over baselines when considering multiple objectives si-
the KDD dataset, where the best single-objective base multaneously. Second, we show improvements using
algorithm is LinUCB, whereas the multi-objective base dynamic weighting schemes over static ones. This
algorithm is the Thompson Sampling. Once we have shows that our state-space model is able to correctly pri-
gains in both situations, we may conclude that consid- oritize among the recommendation metrics. It is well-
ering multiple objectives simultaneously is important to known that the recommendation quality metrics may be
provide better recommendations to users. conflicted among themselves [24]. For instance, it is
Moreover, the choice of weighting prioritization relatively easy to have a high diversified ranking that
scheme and scalarization functions to provide a bet- is poor in accuracy. In our experiments, we show that
13
ACCEPTED MANUSCRIPT
the chebyshev scalarization function deals better with ures 4(b,d)), it is more difficult to recommend relevant
these conflicts, providing better results. Possibly, once items in the first ranking position (i.e., k = 1). As the
the chebyshev scalarization function is able to find all size of the ranking increases, the relative CTR improves.
points in a non-convex Pareto set, it is more likely to (i.e., k = 3, 5, 10). Figure 4 also depicts that there is not
select the set of items that is best for each trial. a single base bandit algorithm that has the better per-
If we were to recommend a single algorithm, scalar- formance in the evaluated scenarios, although TS and
ization function and weighting scheme, we would sug- EG(0.1) are overall more consistent in their results. For
gest, for the Yahoo! dataset, the -greedy [EG(0.10)] instance, in Figure 4(a), we show the results for the Ya-
algorithm, chebyshev scalarization function and SMO- hoo! dataset considering the linear scalarization. Here,
NOV scheme. Whereas, for the KDD dataset, the sug- the maximum CTR@1 is given by the -greedy algo-
T
gestion is different: the Thompson Sampling algorithm, rithm, whereas in Figure 4(b), Bayes UCB algorithm
the chebyshev scalarization function and the DMO- has the best performance. Finally, we also see that we
IP
RANK scheme. have different ranges of CTR values depending on the
dataset used. Specifically, in Figures 4(a–b), for the Ya-
CR
5.4.2. On the scalarization function hoo! dataset, the relative CTR values range from ap-
We also analyze the performance considering the proximately 1.3 to 1.7. Meanwhile, in Figures 4(c–d),
Thompson Sampling algorithm as the MAB algorithm for the KDD dataset the relative CTR values range from
to suggest k items per trial, where k ∈ {1, 3, 5, 10} for approximately 1.2 to 2.1.
different scalarization functions. We chose these val-
ues for k to approximate the usual number of items sug-
gested by real recommender systems, and the Thomp-
son Sampling algorithm because it obtained the best
US 6. Conclusions and Future Work
AN
overall results in the experiments reported in the pre- This paper proposed a multi-objective ranking ban-
vious section. In Figures 3(a–d) and Figures 3(e–h), dits algorithm for online recommendation. The algo-
we show the results for the Yahoo! and KDD datasets, rithm relies on four main components: a scalarization
respectively. We report the relative CTR of multi- function, a set of recommendation quality metrics, a dy-
objective weighting schemes divided by the CTR of the namic prioritization scheme for weighting these metrics
M
tion. For instance, in the KDD dataset we have gains points in a non-convex Pareto, and proposed a method
in all weighting schemes for k = 1, 3, and 10 (Fig- to adapt the weighting scheme online for three different
ures 3(e,g,h)). A similar pattern occurs in the Yahoo! recommendation quality metrics, namely accuracy, nov-
dataset. However, it seems that when we have dynamic elty and diversity, based on a state-space model. Three
PT
weighting prioritization schemes for each ranking posi- variants of the dynamic model weighting method were
tion (DMO-RANK) the linear scalarization function has proposed, considering the perspectives of the user, the
a slightly better performance (see Figures 3(a–d)). recommender system and the ranking. The algorithm is
flexible enough to consider different multi-armed bandit
CE
ures 4(a–b), we present the best CTR values for the objective version of the method considering different
Yahoo! dataset, whereas in Figures 4(c–d), we show variations of the four components aforementioned. The
the CTR values for the KDD dataset. To help under- results showed that a Chebyshev scalarization func-
standing the graphs, we select a representative subset tion together with e dynamic weighting scheme pro-
of base bandit algorithms, namely, (i) -greedy, with vide the best results for both datasets when com-
= 0.05 [EG(0.05)], (ii) LinUCB with α = 0.10 [Lin- pared to the single-objective version and other scalar-
UCB(0.10)], (iii) Thompson Sampling (TS), (iv) UCB1, ization/weighting strategies. The results also show that
(v) UCB2, and (vi) Bayes UCB. when using the Thompson Sampling and -greedy al-
As expected, for almost all algorithms (except for gorithms we obtain more consistent results considering
EG(0.10) in Figure 4(a) and Bayes UCB in Fig- recommendation lists of different sizes. Particularly, our
14
ACCEPTED MANUSCRIPT
Relative CTR
1.5 1.5
1.0 1.0
0.5 0.5
0.0SMO-EQ SMO-ACC SMO-DIV SMO-NOV DMO-USER DMO-RANK DMO-SYSTEM 0.0SMO-EQ SMO-ACC SMO-DIV SMO-NOV DMO-USER DMO-RANK DMO-SYSTEM
(a) k = 1 (b) k = 3
T
Chebyshev Chebyshev
2.0 2.0
Relative CTR
Relative CTR
1.5 1.5
IP
1.0 1.0
0.5 0.5
0.0SMO-EQ 0.0SMO-EQ
CR
SMO-ACC SMO-DIV SMO-NOV DMO-USER DMO-RANK DMO-SYSTEM SMO-ACC SMO-DIV SMO-NOV DMO-USER DMO-RANK DMO-SYSTEM
(c) k = 5 (d) k = 10
US
Relative CTR
Relative CTR
2.0 2.0
1.5 1.5
1.0 1.0
0.5 0.5
AN
0.0SMO-EQ SMO-ACC SMO-DIV SMO-NOV DMO-USER DMO-RANK DMO-SYSTEM 0.0SMO-EQ SMO-ACC SMO-DIV SMO-NOV DMO-USER DMO-RANK DMO-SYSTEM
(e) k = 1 (f) k = 3
Relative CTR
2.0 2.0
1.5 1.5
1.0 1.0
0.5 0.5
ED
0.0SMO-EQ SMO-ACC SMO-DIV SMO-NOV DMO-USER DMO-RANK DMO-SYSTEM 0.0SMO-EQ SMO-ACC SMO-DIV SMO-NOV DMO-USER DMO-RANK DMO-SYSTEM
(g) k = 5 (h) k = 10
Figure 3: Relative CTR’s on multi-objective weighting schemes for different ranking lengths, (i.e., different k values). Figures (a–d) for Yahoo!
PT
Today News data and Figures (e–h) KDD Cup 2012 Online Ads data.
approach provides an 7.8% and 10.4% lift in CTR com- within the online learning to rank setting. We also in-
CE
pared to the state-of-the-art approach, for Yahoo! Today tend to investigate situations where we have no usage
News dataset and KDD Cup 2012 Online Ads dataset, history of users and items, known as the cold-start prob-
respectively lem [30], by modeling it as an exploration/exploitation
As future work, a promising extension is to analyze problem. Finally, in order to deal with sparsity, we be-
AC
the impact of other relevant objectives to recommen- lieve another good line of investigation is adding spar-
dation quality, such as coverage and serendipity [29]. sity as an objective to the proposed algorithm.
Another possible extension is to consider different base
MABs for each ranking position. In this case, an anal-
yses of both the empirical performance and the theo- Acknowledgments
retical properties of this mixed ranked bandit algorithm
would be of interest. Since we were successful in con- The authors gratefully acknowledge the finantial sup-
sidering multiple objectives in recommender systems, port of CNPq; CEFET-MG under Proc. PROPESQ-
it would also be interesting to adapt the algorithm to 10314/14; FAPEMIG under Proc. FAPEMIG-
account for queries and documents in a search engine PRONEX-MASWeb, Models, Algorithms and Systems
15
ACCEPTED MANUSCRIPT
1.7 1.7
1.6 1.6
Relative CTR
Relative CTR
1.5 1.5
EG(0.10) EG(0.10)
1.4 LinUCB(0.10) 1.4 LinUCB(0.10)
TS TS
1.3 UCB1 1.3 UCB1
UCB2 UCB2
BayesUCB BayesUCB
1.2 1 3 5 10 1.2 1 3 5 10
Rank Rank
T
(a) Linear scalarization (b) Chebyshev scalarization
IP
CR
2.0 2.0
1.8 1.8
Relative CTR
Relative CTR
1.6 1.6
EG(0.10) EG(0.10)
LinUCB(0.10) LinUCB(0.10)
1.4
1.2
1 3
Rank
5
TS
UCB1
UCB2
BayesUCB
10
US 1.4
1.2
1 3
Rank
5
TS
UCB1
UCB2
BayesUCB
10
AN
(c) Linear scalarization (d) Chebyshev scalarization
Figure 4: Relative CTR’s on different ranking lengths. Figures (a–b) for Yahoo! Today News data and Figures (c–d) KDD Cup 2012 Online Ads
data.
M
for the Web, process number APQ-01400-14, Microsoft The 2013 International Joint Conference on, pages 1–8. IEEE,
Azure Sponsorship–CEFET-MG. 2013.
[10] G. Eichfelder. Adaptive scalarization methods in multiobjective
ED
of the multiarmed bandit problem. Machine learning, 47(2- in web search. In Proceedings of the Second ACM International
3):235–256, 2002. Conference on Web Search and Data Mining, pages 124–131.
[4] D. M. Blei. Probabilistic topic models. Communications of the ACM, 2009.
ACM, 55(4):77–84, 2012. [14] K. Hofmann, A. Schuth, A. Bellogin, and M. De Rijke. Ef-
[5] J. Bobadilla, F. Ortega, A. Hernando, and A. Gutiérrez. Rec- fects of position bias on click-based recommender evaluation.
AC
ommender systems survey. Knowledge-Based Systems, 46:109– In Advances in Information Retrieval, pages 624–630. Springer,
132, 2013. 2014.
[6] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic [15] B. Horsburgh, S. Craw, and S. Massie. Learning pseudo-tags to
and nonstochastic multi-armed bandit problems. arXiv preprint augment sparse tagging in hybrid music recommender systems.
arXiv:1204.5721, 2012. Artificial Intelligence, 219:25–39, 2015.
[7] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and [16] E. Kaufmann, O. Cappé, and A. Garivier. On bayesian upper
games. Cambridge university press, 2006. confidence bounds for bandit problems. In International Con-
[8] P. Cremonesi, Y. Koren, and R. Turrin. Performance of recom- ference on Artificial Intelligence and Statistics, pages 592–600,
mender algorithms on top-n recommendation tasks. In Proceed- 2012.
ings of the fourth ACM conference on Recommender systems, [17] A. Lacerda. Contextual bandits with multi-objective payoff
pages 39–46. ACM, 2010. functions. In Proceedings of The 4th Brazilian Conference on
[9] M. M. Drugan and A. Nowe. Designing multi-objective multi- Intelligent Systems (BRACIS), pages 1–8. IEEE, 2015.
armed bandits algorithms: a study. In Neural Networks (IJCNN), [18] A. Lacerda, A. Veloso, and N. Ziviani. Exploratory and inter-
16
ACCEPTED MANUSCRIPT
active daily deals recommendation. In Proceedings of the 7th [37] K.-W. Wu, C.-S. Ferng, C.-H. Ho, A.-C. Liang, C.-H. Huang,
ACM Conference on Recommender Systems, RecSys ’13, pages W.-Y. Shen, J.-Y. Jiang, M.-H. Yang, T.-W. Lin, C.-P. Lee, et al.
439–442, New York, NY, USA, 2013. ACM. A two-stage ensemble of diverse models for advertisement rank-
[19] K. Lerman and T. Hogg. Leveraging position bias to improve ing in kdd cup 2012. In ACM SIGKDD KDD-Cup WorkShop,
peer recommendation. PloS one, 9(6):e98914, 2014. 2012.
[20] L. Li, W. Chu, J. Langford, and R. Schapire. A contextual- [38] S. Yahyaa, M. M. Drugan, and B. Manderick. Scalarized and
bandit approach to personalized news article recommendation. pareto knowledge gradient for multi-objective multi-armed ban-
In WWW, pages 661–670, 2010. dits. In Transactions on Computational Collective Intelligence
[21] L. Li, W. Chu, J. Langford, and X. Wang. Unbiased offline eval- XX, pages 99–116. Springer, 2015.
uation of contextual-bandit-based news article recommendation [39] S. Q. Yahyaa, M. M. Drugan, and B. Manderick. The scalarized
algorithms. In Proceedings of the fourth ACM international con- multi-objective multi-armed bandit problem: an empirical study
ference on Web search and data mining, pages 297–306. ACM, of its exploration vs. exploitation tradeoff. In Neural Networks
T
2011. (IJCNN), 2014 International Joint Conference on, pages 2290–
[22] M. Lubatkin and S. Chatterjee. Extending modern portfolio the- 2297. IEEE, 2014.
IP
ory into the domain of corporate diversification: does it apply? [40] M. Zhang and N. Hurley. Avoiding monotony: improving the
Academy of Management Journal, 37(1):109–136, 1994. diversity of recommendation lists. In RecSys, pages 123–130,
[23] J. Mary, P. Preux, and O. Nicol. Improving offline evaluation 2008.
of contextual bandit algorithms via bootstrapping techniques. In [41] C. Ziegler, S. McNee, J. Konstan, and G. Lausen. Improving
CR
Proceedings of the 31st International Conference on Machine recommendation lists through topic diversification. In WWW,
Learning (ICML-14), pages 172–180, 2014. pages 22–32, 2005.
[24] S. M. McNee, J. Riedl, and J. A. Konstan. Being accurate is not [42] E. Zitzler, L. Thiele, M. Laumanns, C. M. Fonseca, and V. G.
enough: how accuracy metrics have hurt recommender systems. Da Fonseca. Performance assessment of multiobjective optimiz-
In CHI’06 extended abstracts on Human factors in computing ers: an analysis and review. Evolutionary Computation, IEEE
[25]
[26]
systems, pages 1097–1101. ACM, 2006.
K. P. Murphy. Machine learning: a probabilistic perspective.
2012.
F. Radlinski, R. Kleinberg, and T. Joachims. Learning diverse
rankings with multi-armed bandits. In Proceedings of the 25th
US Transactions on, 7(2):117–132, 2003.
[43] Y. Zuo, M. Gong, J. Zeng, L. Ma, and L. Jiao. Personalized rec-
ommendation based on evolutionary multi-objective optimiza-
tion. Computational Intelligence Magazine, IEEE, 10(1):52–62,
2015.
AN
international conference on Machine learning, pages 784–791.
ACM, 2008.
[27] M. T. Ribeiro, A. Lacerda, A. Veloso, and N. Ziviani. Pareto-
efficient hybridization for multi-objective recommender sys-
tems. In Proceedings of the sixth ACM conference on Recom-
M
ACM, 2014.
[33] W. R. Thompson. On the likelihood that one unknown proba-
bility exceeds another in view of the evidence of two samples.
Biometrika, pages 285–294, 1933.
[34] M. Tokic. Adaptive ε-greedy exploration in reinforcement learn-
ing based on value differences. In KI 2010: Advances in Artifi-
cial Intelligence, pages 203–210. Springer, 2010.
[35] S. Vargas. Novelty and diversity enhancement and evaluation in
Recommender Systems. PhD thesis, Universidad Autonoma de
Madri, 2015.
[36] S. Vargas and P. Castells. Rank and relevance in novelty and
diversity metrics for recommender systems. In RecSys, pages
109–116, 2011.
17
ACCEPTED MANUSCRIPT
T
IP
CR
US
AN
M
ED
PT
CE
AC