You are on page 1of 19

Accepted Manuscript

Multi-Objective Ranked Bandits for Recommender Systems

Anisio Lacerda

PII: S0925-2312(17)30228-X
DOI: 10.1016/j.neucom.2016.12.076
Reference: NEUCOM 18025

To appear in: Neurocomputing

Received date: 16 March 2016


Revised date: 24 September 2016
Accepted date: 3 December 2016

Please cite this article as: Anisio Lacerda, Multi-Objective Ranked Bandits for Recommender Systems,
Neurocomputing (2017), doi: 10.1016/j.neucom.2016.12.076

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT

Multi-Objective Ranked Bandits for Recommender Systems

Anisio Lacerda
Centro Federal de Educação Tecnológica de Minas Gerais – CEFET-MG
Av. Amazonas, 7675 – Belo Horizonte – Minas Gerais – Brazil

T
Abstract

IP
This paper is interested in recommender systems that work with implicit feedback in dynamic scenarios providing
online recommendations, such as news articles and ads recommendation in Web portals. In these dynamic scenarios,

CR
user feedback to the system is given through clicks, and feedback needs to be quickly exploited to improve subsequent
recommendations. In this scenario, we propose an algorithm named multi-objective ranked bandits, which in contrast
with current methods in the literature, is able to recommend lists of items that are accurate, diverse and novel. The
algorithm relies on four main components: a scalarization function, a set of recommendation quality metrics, a dy-

US
namic prioritization scheme for weighting these metrics and a base multi-armed bandit algorithm. Results show that
our algorithm provides improvements of 7.8% and 10.4% in click-through rate in two real-world large-scale datasets
when compared to the single-objective state-of-the-art algorithm.
AN
Keywords: Recommender systems, online recommendation, multi-armed bandits

1. Introduction gorithm makes a sequence of choices (each choice rep-


resented by an arm) according to the algorithm policy.
M

In a world of information overload, having systems In each trial the algorithm chooses from a set of alter-
capable of filtering out content according to users pref- natives and receives a reward associated with its choice.
erences is paramount. In this scenario, recommender When using MABs for recommendation, each arm is
systems play an important role in recommending rele-
ED

modeled as an item to be recommended to the user. At


vant items to users. The recommendations generated by each trial, one item is recommended (one arm is chosen)
these systems may account for explicit feedback infor- and the algorithm receives as feedback information of
mation, such as the rating a user gives to a movie or a whether the user clicked to see the recommended item.
book, or implicit feedback, e. g., whether a user clicks
PT

Traditionally, MAB algorithms are limited to recom-


an ad in a portal or listens to a song in a music streaming mending one item per trial to the user. However, there
service. are scenarios where a list of items is expected instead
This paper is interested in recommender systems that of a single item. In this direction, Ranked Bandits
CE

work with implicit feedback in dynamic scenarios pro- [31, 26] were proposed as an alternative approach to
viding online recommendations. One example within recommend item lists within the context of Web search
this scenario is to recommend news articles in a Web engines. In contrast with the traditional MAB approach,
portal, illustrated in Figure 1, where content changes this method associates a full MAB algorithm to each po-
AC

dynamically, with frequent insertions and deletions of sition of the ranking of retrieved items. Hence, the user
news. receives a list of items, where the item in each position
In dynamic scenarios such as the aforementioned, of the ranking was recommended by a different MAB.
user feedback to the system (given through clicks) needs
Ranked Bandits, as well as other classical MAB ap-
to be quickly exploited to improve subsequent recom-
proaches, focus mostly on improving recommendation
mendations. The most popular methods to deal with the
accuracy, which is usually measured by click-through
problem of online interactive recommender systems fol-
rate. However, it has become a consensus in the litera-
low a multi-armed bandit (MAB) approach, where an al-
ture that successful recommender systems need to con-
sider other dimensions of information utility besides ac-
Email address: anisio@decom.cefetmg.br (Anisio Lacerda) curacy [8]. Notably, diversity and novelty of the sug-
Preprint submitted to NEUROCOMPUTING February 9, 2017
ACCEPTED MANUSCRIPT

gestions performed by the system are key objectives 2012 Online Ads data, respectively, when compared to
to measure recommendation quality [27, 28]. In other the state-of-the-art single-objective algorithm.
words, even being accurate, obvious and monotonous The main contributions of this paper are:
recommendations are generally of little use, since they
do not expose users to unprecedented experiences. • We propose and test two scalarization methods for
weighting accuracy, diversity and novelty of the
produced rankings in the context of Ranked Ban-
dits.

• We propose a method that dynamically changes the

T
weight of multiple recommendation quality met-
rics under three different perspectives: the user, the

IP
system and the ranking position.

• We conducted experiments in two real-world

CR
datasets to evaluate the performance of our method
and a representative set of bandit algorithms.

The remainder of this paper is organized as follows.


Figure 1: Example of news articles recommendation in the “Main”
In Section 2, we review related approaches in the litera-
tab of the Yahoo! Front Page. The highlighted recommended item
refers to a sports article. [17]

In this direction, in [17] we introduced the idea of


US ture. In Section 3, we present the formulation of the on-
line interactive recommendation problem. We then de-
tail our method to predict arm reward considering mul-
tiple objectives in Section 4. In Section 5, we analyze
AN
building upon Ranked Bandits to account for the opti-
mization of multiple recommendation quality metrics, the performance of our method comparing with tradi-
namely accuracy, diversity and, novelty of the ranking. tional bandit algorithms proposed in the literature. Fi-
We transformed this multi-objective optimization prob- nally, Section 6 draws conclusions and discusses exten-
lem into a single-objective optimization by using scalar- sions of this work.
M

ization functions to weight different objectives to pre-


dict the probability of a user clicking in a given recom- 2. Related work
mended item.
ED

This paper goes further, and besides testing non- There is an extensive literature of methods conceived
linear scalarization schemes, it proposes a new method to perform personalized offline and online recommenda-
that dynamically prioritizes different recommendation tion. Currently, the state-of-the-art methods for offline
quality metrics during the life cycle of the user in the recommendation follow a Collaborative Filtering (CF)
PT

system. This is done by optimizing weights for differ- approach [29]. CF approaches recommend an item i to
ent metrics online, as the user interacts with the sys- a user u in two situations: (i) if the item is liked by other
tem, while accounting for his feedback. This is impor- users whose preferences are similar to those of u (user-
tant because the method adapts the systems responses to based CF), or (ii) if the item is similar to other items
CE

the user needs (e.g., new system users may benefit from user u liked in the past (item-based CF). Since they rely
more accurate suggestions, whereas older users may re- on the historical ratings or preferences provided by the
quire more diversified and novel recommendations). users, their performance is poor in online recommenda-
We conducted an extensive evaluation of the pro- tion tasks.
AC

posed method in two large datasets from real-world rec- Instead of going over the literature of offline rec-
ommendation tasks. The first task, namely news arti- ommendation methods exhaustively, we refer the inter-
cle recommendation, uses data from the Yahoo! Today ested reader to a recent comprehensive survey of Rec-
News dataset and contains over 45,811,883 user vis- ommender Systems [5]. In this section, we focus on
its. The second task deals with online advertisement the studies that are most relevant to our approach, i.e.,
(ad) recommendation, and uses data from the KDD Cup Multi-Objective Recommender systems.
2012 Online Ads (track 2) dataset. The dataset contains In the early days, the main drive of the recommender
149,639,105 views of advertisements. Our algorithm systems field was to accurately predict users’ prefer-
provides improvements of 7.8% and 10.4% in click- ences. Nowadays, however, a wider perspective towards
through rate for Yahoo! Today News and KDD Cup recommendation utility has been taking place, and the
2
ACCEPTED MANUSCRIPT

Recommendation list
importance of diversity and novelty, among other prop- action
erties, is becoming an important part of evaluation prac-
Target user
tice [11, 36]. environment
Multi-objective recommender systems have been Examine
successfully applied in several scenarios, such as, Recommender System
recommendation list

movies [27], tags [15], and advertisements [18]. For agent

example, in [41] the authors concluded that user sat- Generate


implicit feedback
isfaction does not always correlate with high recom-
mendation accuracy. Hence, different multi-objective
algorithms have been proposed to enhance user experi- Implicit Feedback

T
reward
ence considering more than the accuracy of suggestions.
In [40], the authors present a method to improve recom- Figure 2: The online recommendation task modeled as a Multi-

IP
objective Ranked Bandit problem.
mendation diversity, which is based on an optimization
procedure that considers two objective functions reflect-

CR
ing the preference for similarity and diversity of recom- consider unfair to compare MO-MAB algorithms and
mended items. the methods proposed here. Note that, even if we con-
In this same direction, in [28], the authors propose sider the use of MO-MAB algorithms for recommend-
hybrid approaches to multi-objective recommender sys- ing a single item per iteration to users, we restrict the
tems based on the concept of Pareto-efficiency. Their
method aggregates the output lists of basic recom-
mender systems into a single list. In [43], the authors
propose a genetic programming algorithm to simultane-
US possible objectives that can be considered. For instance,
the diversity of a recommendation list is a metric that
makes no sense if a single item is returned for each rec-
ommendation.
AN
ously consider recommendation accuracy and diversity As we are interested in suggesting multiple items to
of suggested items. users considering several objectives, in this work we fol-
We highlight that all the aforementioned methods run low the approach presented in [31, 26], named Ranked
over an offline setting, i.e., the recommendation models Bandits. Ranked Bandits is an online learning algorithm
are produced before user interactions with the system.
M

that learns a ranking of items based on users’ behavior


On the other hand, our approach focuses in a online set- [26]. The authors in [31] build upon [26] and present
ting by considering users’ feedback to improve recom- variants of Ranked Bandits to consider side information
mendation quality. As far as we know, this is the first on similarity between items returned by search engines.
ED

work that presents an online multi-objective recommen- Both works have as their main goal to minimize query
dation algorithm that dynamically prioritize accuracy, abandonment, which happens when the user does not
diversity, and novelty when suggesting items to users. click in any of the presented query results.
When dealing with online scenarios, approaches There are two crucial differences between the works
PT

based on Multi-objective Multi-armed Bandit algo- aforementioned and the one proposed here. First, they
rithms (MO-MAB) deserve our attention. In MO-MAB, are interested only on minimizing the query abandon-
several arms (items) are candidates to optimal depend- ment of search engine rankings. Hence, they ignore
ing on the set of objectives [9, 39, 38]. The standard other important metrics of these rankings, such as di-
CE

approach to MO-MABs consists of modifying classi- versity and novelty. Second, since they ignore the user
cal multi-armed bandit algorithms to consider more than features, they cannot provide personalized recommen-
one objective. For instance, in [9] the authors extend the dations, which is a crucial characteristic of our problem.
classical UCB1 algorithm to propose the Pareto-UCB1,
AC

its multi-objective counterpart.


The current research line in MO-MAB algorithms 3. Problem formulation
investigates multi-objective versions of classical algo-
rithms, which keep selecting a single arm (item) per in- Online recommendation is usually modeled as a
teraction. Differently, we are interested in suggesting Multi-armed bandit (MAB) problem, which originates
multiple items to users considering several objectives, from the reinforcement learning paradigm. The MAB
which is equivalent to selecting many arms simultane- problem is defined as a sequential Markov Decision
ously, one from each MAB associated with a ranking Process (MDP) of an agent that tries to optimize its ac-
position. Given the different nature of the results of the tions while improving its knowledge on the arms. The
recommendation (a single versus a list of items), we fundamental challenge in bandit problems is the need
3
ACCEPTED MANUSCRIPT

Algorithm 1 Multi-armed Bandits Algorithm 2 Single-Objective Ranked Bandits


1: for t = 1, . . . , T do Require: Bandits A
2: i(t) ← select-arm(ut , A) 1: Initialize A1 (M), . . . , Ak (M) . Initialize MABs
3: show i(t) to user; register click 2: for t = 1, . . . , T do
4: if user clicked i(t) then 3: for j = 1, . . . , k do . Sequentially choose items
5: rt = 1 4: ı̂ j (t) ← select-arm(ut , A j )
6: else 5: if j > 1 then
7: rt = 0 6: if ı̂ j (t) ∈ {i1 (t), . . . , i j−1 (t)} then
8: end if 7: i j (t) ← arbitrary unselected item

T
9: update-mab(ut , i(t), rt , A) 8: else
10: end for 9: i j (t) ← ı̂ j (t)

IP
10: end if
11: end if
for balancing exploration and exploitation. Specifically, 12: end for

CR
the algorithm A exploits past experience to select the 13: show {i1 (t), . . . , ik (t)} to user; register clicks
arm that seems best. On the other hand, this seemingly 14: for j = 1, . . . , k do . Find feedback for A j
optimal arm may be suboptimal, due to imprecisions in 15: if user clicked i j (t) and ı̂ j (t) = i j (t) then
A’s knowledge. To avoid this undesired situation, A 16: r jt = 1
has to explore by choosing seemingly suboptimal arms
so as to gather more information about them.
In the context of online recommender systems, each
arm represents an item to be recommended to a user,
US 17:
18:
19:
20:
else

end if
r jt = 0

update-mab(ut , ı̂ j (t), r jt , A j )
AN
and the reward corresponds to whether the user clicked 21: end for
22: end for
the suggested item or not. In Figure 2, we show the
interaction cycle between a target user and an online
recommender system. The user arrives and the recom- ward rt is revealed (Line 4–8)1 . The algorithm then im-
M

mender system produces an ordered list of items, which proves its arm-selection strategy with the new observa-
is presented to the user. The user interacts with the list tion (ut , i(t), rt ) (Line 9). Note that the algorithm selects
by clicking on items that are relevant to him. The in- a single item per trial, i.e., it is unable to return a rec-
teraction is interpreted as a feedback about the quality
ED

ommendation list of items.


of the suggested item list. This feedback is then used Also observe that the reward is revealed only for
to update the recommendation model aiming to provide the chosen item. When the suggested item is clicked,
better recommendations in the future. The cycle fin- the reward is 1; otherwise, the reward is 0. The ex-
ishes and the next user is considered. This formulation
PT

pected reward of an item is the percentage of clicks it


translates to the Reinforcement Learning problem (see receives considering all trials, a.k.a the Click-Through
the terminology in blue color in Figure 2). Rate (CTR). Hence, the algorithm needs to choose the
Formally, let U be an arbitrary input space and I = item with maximum CTR, which maximizes the ex-
CE

{1, . . . , M} be a set of M arms, where each arm corre- pected number of clicks from users.
sponds to an item. Let E be a distribution over tuples Recently, works in the multi-armed bandit literature
(u,~r), where u ∈ U corresponds to the target user, i.e., have investigated methods for learning a ranking of
the user we are interested in recommending items to, arms instead of a single arm per trial [26], generating
AC

and ~r ∈ {0, 1} M is a vector of rewards. For each pair the Ranked Bandits. In Algorithm 2, we present the
(u,~r), the jth component of ~r is the reward associated to Ranked Bandits algorithm, which focuses on optimiz-
arm i j . The target user u can be described using features ing the choice of multiple arms. It runs an MAB in-
such as demographic information and gender. stance Ai for each ranking position i. We consider that
In Algorithm 1, we detail a MAB algorithm. A MAB users scroll the ranking from top to bottom. Hence, al-
algorithm iterates in discrete trials t = 1, 2, . . . (Line 1). gorithm A1 chooses which item is suggested at rank 1.
In each trial t, the algorithm chooses an item i(t) ∈ I Next, algorithm A2 selects which item is shown at rank-
based on user ut and on the knowledge it collected ing position 2, unless the item was already selected at a
from the previous trials (Line 2), and shows the cho-
sen item to the target user (Line 3). After that, the re- 1 We use rt as the reward for item i(t) in trial t.

4
ACCEPTED MANUSCRIPT

Table 1: In this toy example, we compare the final rankings of a single objective (SO) and a multi-objective (MO-RANK) recommender.

Rank MAB instance Weights Items Accuracy Diversity Novelty Final ranking
SO MO-RANK
i1 0.2 1.0 0.2
1 A1 (0.3, 0.2, 0.5) i2 i3
i2 0.5 1.0 0.4
i3 0.3 1.0 0.6
i1 0.7 0.5 0.2

T
2 A2 (0.4, 0.5, 0.1) i3 i1
i2 0.1 0.4 0.4
i3 0.9 0.1 0.2

IP
i1 0.8 0.2 0.1
3 A3 (0.6, 0.1, 0.3) i1 i2
i2 0.1 0.7 0.4

CR
i3 0.1 0.1 0.2

higher ranking position. If that is the case, the second


item is chosen arbitrarily. This procedure is repeated
sequentially, for all top k ranking positions. If a user
clicks on a item in ranking position i, then the algorithm
US Before going into detail on the proposed algorithm,
we illustrate the rationale behind it using the example
in Table 1. In this example, we want to recommend 3
items (k = 3) to a user u considering three quality recom-
AN
Ai receives a reward of 1, and all higher (i.e., skipped) mendation metrics: accuracy, diversity and novelty (the
algorithms A j , where j < i, receive a reward of 0. For definition of these metrics is given in Eqs. 5 and 7). The
algorithms A j with j > i, the state is rolled back as if Multi-Objective Ranked Bandit works as follows. For
this trial had never happened (as if the user never con- each position of the ranking of recommended items, a
sidered these items). If no item is clicked, then all algo- MAB Ak is instantiated. Each arm of Ak represents an
M

rithms A∗ receive a reward of 0. item to be recommended. For each item, a vector ~s with
Different from the previously mentioned ranked ban- its accuracy, diversity and novelty is stored. The ac-
dits algorithm, we address the problem of learning curacy corresponds to the predicted reward obtained by
ED

an optimal multi-objective ranking of items D = Ak . Note that each independent Ak is associated with
{d1 , d2 , . . . , dk } for a target user. Specifically, we con- a weight vector w ~ , which will assign different weights
sider the multi-objective ranked bandit problem, where to different objectives (quality metrics) at different posi-
we have k ranking positions, k ≥ 1; M items, M ≥ 2; tions of the ranking. In this way, more diverse and novel
PT

and O objectives, O ≥ 1. The goal is to simultane- items may be privileged into higher positions, while ac-
ously consider all objectives when recommending items curacy may be considered more important into lower
to users. Note that each user may consider a subset ranks. Although the example illustrated in Table 1 asso-
of items as relevant according to their preferences and ciates the weights to ranking positions, other weighting
CE

past experience, and the remainder of the items as non- strategies will be discussed later in this section.
relevant. Hence, users with different preferences will At each trial, A1 returns an item to u by combin-
have different relevant rankings, while users with simi- ing the weighted objectives using a function (which in
lar preferences will have similar relevant rankings. this example is linear). A1 returns i3 , while a single-
AC

objective version baseline (column SO) accounting only


for accuracy would return i2 . Then, A2 updates the
4. Multi-objective Ranked Bandits diversity metric for all items in relation to the current
ranking D. Next, A2 recommends an item to u by com-
In this section, we detail the multi-objective algo- bining the weighted objectives, returning item i1 . Since
rithm proposed to iteratively suggest items to users con- the novelty metric depends only on the selected item and
sidering multiple quality recommendation metrics. We not on the entire list, it is updated in the next trial. These
focus on learning an optimally relevant ranking of items same steps are followed to recommend the remaining
for a target user, and model the recommendation task as items. At the end of this process, we have a ranking that
a ranked bandit problem, as described in Section 3. was generated using a set of objectives simultaneously.
5
ACCEPTED MANUSCRIPT

If we compare the final rankings D MO−RANK = {i3 , i1 , i2 } an arm i a weight wo , and returns a scalar equals to the
and DS O = {i2 , i3 , i1 } in Table 1, we can notice that the weighted sum of the means:
positions of items change according to the values and
weights given to different quality metrics. fw~ (~si ) = w1 s1i + · · · + wO sO
i . (1)
The process we just described is composed by four A known drawback with linear scalarization is its inca-
main components: (i) a scalarization function, (ii) a pacity to potentially find all the points in a non-convex
set of recommendation quality metrics, (iii) a weighting Pareto set [9].
scheme, and (iv) a base MAB algorithm A. The scalar-
ization function defines how we weight each objective Chebyshev scalarization This function takes the
to compute a relevance score for each item. In Sec- weight vector as the above function and also a O-

T
tion 4.1, we detail two scalarization functions consid- dimensional reference point z = (z1 , . . . , zO )T , which
ered in this paper. The recommendation quality metrics

IP
has to be dominated by all elements in the Pareto set,
define the objectives that we are interested in maximiz- i.e., all optimal reward vectors s∗i . The Chebyshev
ing on recommendation rankings, as described in Sec- scalarization function is defined as:

CR
tion 4.2. As we assume that different scenarios require
different objectives’ prioritization, we propose different fw~ (~si ) = min wi × (sij − zi ), ∀ j. (2)
1≤ j≤O
weighting schemes to set these weights, as discussed in
Section 4.3. Finally, in Section 4.4, we detail the set of The vector z corresponds to the minimum of the current
base MAB algorithms used in this work.

4.1. Scalarization functions


US optimal rewards minus a small positive value2 , τ > 0.
Hence, for each objective l, we have:

zl = min slj − τl , ∀ j.
1≤ j≤O
(3)
AN
A traditional way to deal with the multi-objective sce-
nario is to transform it into a single-objective scenario A key advantage of the Chebyshev scalarization func-
using scalarization functions [9]. Here we consider both tion is that under certain conditions it encounters all
linear and non-linear scalarization functions. points in a non-convex Pareto set [39].
As previously explained, in the multi-objective sce- Now, given that we are able to convert a multi-
M

nario, for each arm (item) i, there is a score vector objective MAB problem into a single-objective prob-
~si = (s1i , . . . , sO lem, we can select the best arm i∗fw that maximizes the
i ), where O is a fixed number of objec-
tives. We use a scalarization function to compute the function fw :
ED

estimation of rewards for all items, In other words, we i∗fw~ = argmax fw~ (~si ). (4)
1≤i≤M
compute a combined score using the scalarization func-
tion to predict the reward vector ~r ∈ {0, 1} M , for all M 4.2. Recommendation quality metrics
items. The two scalarization functions used here receive In this section, we detail the recommendation quality
PT

the weighted sum of the components of the score vector metrics used in this work, namely, (i) accuracy, (ii) di-
~s and return a scalar value [10]. versity, and (iii) novelty. Besides accuracy, which has
We consider that the score vectors of all arms been the typical goal of recommender systems, suggest-
are ordered using the partial order on multi-objective
CE

ing items that are not easily discovered by the users is


spaces [42]. A score vector s is considered better than, essential, and it can be measured by the diversity and
or dominates, another score vector v, s ≺ v, if and only novelty of the recommendations [28].
if there exists at least one objective j for which v j < s j , We interpret the estimated reward computed by base
and for all other objectives l, we have vl ≤ sl . In con-
AC

MAB algorithms for all set of arms as an estimation of


trast, s is non-dominated by v, v ⊀ s, if and only if there accuracy (see details in Section 5.3). Next, we detail
exists at least one dimension j for which v j < s j . The how we compute diversity and novelty scores for each
Pareto optimal score set (or Pareto set for short) is the candidate item from our item set. Note that we consider
set of score vectors that are non-dominated by any other normalized scores for all metrics (i.e, accuracy, diver-
score vectors [9]. Following, we detail the scalarization sity, and novelty).
functions used in this work. In order to measure the diversity of an item i j , we
adapt the distance-based model from [36] to consider
Linear scalarization This function takes as input the
P
~ = (w1 , . . . , wO ), where O
vector of weights w o
o=1 w = 1,
o
assigns to each component si of the score vector ~si of 2 We used τ = 0.001 in our experiments.

6
ACCEPTED MANUSCRIPT

the set of items above ranking position j. Specifically, Algorithm 3 Multi-Objective Arm Selection
we consider all items with ranking position l, such that Require: Weight vector w ~ , score of objectives matrix
1 ≤ l < j, and compare them to the current item i j . The S M×O , scalarization function f
diversity of item i j is given by: 1: function select-arm
2: for i = 1, . . . , M do
1 X
div(i j |S ) = d(i j , il ), (5) 3: final-scoresi = fw~ (~si ) . Note: ~si is the
|S | 1≤l< j
objectives scores for item i
4: end for
where S is the set of items in ranking position l smaller 5: best-arm = argmax(final-scores) ~
than j; and d(a, b) is the cosine similarity between items

T
6: return best-arm
a and b. When S is empty, i.e., we are considering the 7: end function
first position in the ranking, we initialize all item diver-

IP
sities with equal values, e..g, 1.
In order to measure the novelty of item i j , in turn we Dynamic Multi-Objective Weighting per User

CR
compute a popularity-based item novelty metric, as pre- (DMO-USER): From the user perspective, both nov-
sented in [36]. Let the probability of an item i j being elty and diversity are desirable and act as a direct source
seen as: of user satisfaction. They help users to expand their
playsl (i j ) horizon and, as a consequence, enrich their experience.
p(seen|i j ) = P M
m=1 playsl (im )
,

where playsl (x) is the number of times Al instance


chooses item x.
US
(6) Besides that, according to [24], new users have different
needs from experienced users in a recommender sys-
tem. Since new users need to establish trust and rapport
with a recommender, they may benefit from more
AN
The novelty of item i j chosen by Al is given by: accurate suggestions, whereas older users may require
more novel and diversified suggestions. However, how
nov(i j ) = 1 − p(seen|i j ). (7) accurate, diverse and novel the recommendations are
should depend on the user experience with the system
M

Summarizing, diversity is related to the internal dif- and his tastes. We propose to have a distinct weight
ferences within a ranking, whereas novelty can be un- vector for each user, which will be adapted according
derstood as the difference between present and past ex- to the feedback he provides. Looking at the example
periences [36]. in Table 1, instead of associating weights with ranking
ED

positions, we would have the same weights for all


4.3. Weighting schemes positions, but they would vary from one user to another.
We emphasize that, in this weighting scheme, we have
In Section 4.2 we have discussed the importance of
a different weight vector w~ for each user.
PT

considering a wider perspective towards the added value


of recommendation, going beyond prediction accuracy. Dynamic Multi-Objective Weighting per System
Here, we focus on different weightings schemes that can (DMO-SYSTEM): From the system perspective, there
dynamically adapt to the needs of the user, the system, is a great extent of uncertainty on what the actual user
CE

or even to minimize the inherent biases of recommen- preferences really are, given the limitation of the user
dation. actions. For instance, in this paper we focus on im-
Next, we detail the proposed methods for weighting plicit feedback, such as clicks on items. Although clicks
each recommendation quality metric, i.e., accuracy, di- are driven by user interests, it is still difficult to identify
AC

versity and novelty. Since the motivations for enhancing what motivates a user to click on an item. As the recom-
each metric are themselves diverse, we propose three mender has access to a limited number of user activities,
different weighting schemes by considering user, sys- it has an incomplete knowledge about the users. In order
tem, and ranking position perspectives. In Algorithm 3, to alleviate this limitation, a healthy level of diversity
we detail the algorithm used to select the best arm (i.e., and novelty may help the system to reach a better un-
best item) to recommend to users. Note that the only derstanding of the users and, as a consequence, improve
difference among the proposed weighting schemes (i.e., the quality of recommendations. Another motivation
per User, per System, and per Rank) is that we need behind providing more diverse and novel suggestions
to instantiate a different weight vector w
~ depending on is related to business’ motivations. For instance, sug-
the chosen scheme. gesting items in the long tail is a strategy to draw profit
7
ACCEPTED MANUSCRIPT

from market niches by selling less of more [2]. More- Algorithm 4 Update Regressor Method
over, product diversification is a way to expand business Require: Weight vector w ~ , reward r, and score of ob-
and mitigate risk [22, 35]. In this case, a global weight- jectives vector ~s
ing vector is used, and the weights are the same for all 1: function update-regressor
users and all positions of the ranking. In this weighting 2: . Compute new weights for each objective (as a
scheme, we have a global weight vector w ~ common for computation over input vectors)
all users and ranking positions. 3: w ~ + σ12 Σ(r − ~s T w
~ new = w ~ )~s
Dynamic Multi-Objective Weighting per Rank 4: return w~ new
(DMO-RANK): We also consider an inherent bias in- 5: end function

T
volved when users choose items from ordered lists, i.e.,
the positional bias. This bias states that items in the top
which uses Eq. 10 to update weights according to users’

IP
positions of the ranking are more likely to receive clicks
feedback. We detailed this procedure in terms of input
than items at lower positions [14]. This bias has been in-
and output in Algorithm 4.
vestigated in the Information Retrieval [12, 13] and Rec-

CR
ommender Systems [14, 19] literature. Here, we pro- 4.4. Algorithm formalization
pose a weighting scheme that dynamically assigns dif-
Algorithm 5 presents the proposed Multi-Objective
ferent weights for accuracy, diversity and novelty to dif-
Ranked Bandit algorithm, which receives as input a set
ferent ranking positions (see Table 1). Given that higher
ranking positions receive more attention from users, a
valid strategy is to diversify recommendations on these
positions for new items, to understand how attractive
they are to users. Meanwhile, if users reach lower po-
US of bandit algorithms A, a scalarization function S, and
a weighting regressor W. The scalarization function
refers to how we weight each objective to compute a rel-
evance score for each item. We also assume that differ-
AN
ent scenarios require different objectives’ prioritization.
sitions of the ranking, the recommender may prioritize
A weighting regressor is used to dynamically update the
accuracy on these positions to benefit users’ effort. In
weights for each recommendation quality metric.
this weighting scheme, we have a different weight vec-
The individual bandit algorithms are initialized to
tor w
~ for each ranking position.
properly consider the set of items I, where |I| = M
M

For all of aforementioned weighting schemes, we


(Line 1). At each trial t, the algorithm interacts with
want to dynamically learn the values of weights using
target user ut (Line 2). For each ranking position j, the
the users’ feedback. Hence, we model the online in-
base bandit algorithm instance chooses an item from a
ference of the weights as a state-space model [25]. A
ED

fixed item set according to the current weight vector w ~t


generic form of the model is:
(Lines 3–5). Recall that for each rank j there is a sepa-
~zt = g(~ct ,~zt−1 , νt ), (8) rate instance bandit algorithm A j which is responsible
for recommending an item for the jth position.
PT

f~t = h(~zt , ~ct , δt ), (9) If the selected item has already been chosen on a
higher rank, we select a new item arbitrarily (Lines 6–
where ~zt is the hidden state, ~ct is an optional input, νt 10). The chosen items are displayed to the user and a
is the system noise at time t, δt is the observation noise click (or non-click) is registered (Line 12). After the
CE

at time t, f~t is the observation, g is the transition model target user scrolls the recommended ranking, the algo-
and h is the observation model. rithm processes his feedback for each jth ranking posi-
We define the hidden state to represent the weights, tion (Line 13–20).
i.e. ~zt ≡ w
~ t , and let the time-varying observation model We consider only binary feedback, i.e., the reward is
AC

represent the current score vector ~s. Hence, we apply 1 if the user clicked on the suggested item, and 0 oth-
the Kalman filter to the observation model to update erwise (Line 14–18). Hence, we update each multi-
our posterior beliefs about the weights w ~ as the recom- armed bandit instance A j that corresponds to the jth
mender interacts with the users. Finally, the update of ranking position according to the given reward (Line
the model parameters is given by: 19). Finally, we update the online linear regressor W to
1 consider the given reward and corresponding weighting
w
~t = w
~ t−1 + Σt|t (rt − ~stT w
~ t−1 )~st , (10) scheme (Line 20). Note that, the main differences from
σ2
Algorithm 2 refer to lines 4–5 and 20. In these lines,
where rt is the reward associated to the suggested item we consider the computed weights and update them, re-
at time t and σ2 is the standard deviation. We compute spectively.
the weighting values using a weighting regressor W,
8
ACCEPTED MANUSCRIPT

Algorithm 5 Multi-Objective Ranked Bandits Yahoo! Today News


Require: Bandits A, scalarization function f , and Personalized news recommendation is the problem of
weighting regressor W displaying relevant news articles for target users on web
1: Initialize A1 (M), . . . , Ak (M) . Initialize MABs pages based on the prediction of their individual tastes.
2: for t = 1, . . . , T do The recommender system uses feedback from users to
3: for j = 1, . . . , k do . Sequentially choose items build a prediction model of user’s tastes. However, there
4: ~ t = get-weights(W)
w is a distinct feature in this application: its “partial-label”
5: ı̂ j (t) ← select-arm(ut , w ~ t , f, A j ) . Eq. 4 nature, in which the user feedback (click or not) for an
6: if j > 1 then article is observed only when this article is showed.
We evaluate our method on clickstream data made

T
7: if ı̂ j (t) ∈ {i1 (t), . . . , i j−1 (t)} then
8: i j (t) ← arbitrary unselected item available by Yahoo! as part of the Yahoo! Webscope
program [1]. We use the R6A version of the dataset,

IP
9: else
10: i j (t) ← ı̂ j (t) which contains more than 45 million user view/click log
11: end if events for articles displayed on the Yahoo! Today Mod-

CR
12: end if ule of Yahoo! Front Page during a period of ten contigu-
13: end for ous days. We used 4,5 million events as a warm-up to
14: show {i1 (t), . . . , ik (t)} to user; register clicks bandit algorithms (10%), and the remaining log events
15: for j = 1, . . . , k do . Find feedback for A j as test data.
16:
17:
18:
19:
if user clicked i j (t) and ı̂ j (t) = i j (t) then

else
r jt = 1

r jt = 0
US Each user/system interaction event consists of four
components: (i) the random article displayed to the
user, (ii) user/item feature vectors, (iii) whether the user
clicked on the displayed article, and (iv) the timestamp.
AN
20: end if In this dataset, displayed articles were randomly se-
21: update-mab(ut , ı̂ j (t), r jt , A j ) lected from an article pool. This makes this dataset ideal
22: ~ t = update-regressor(W, w
w ~ t , r jt ) . Alg. 4 for unbiased, offline evaluation of multi-armed bandit
23: end for algorithms. Detailed information on the data collection
process and the dataset properties can be found in [21].
M

24: end for

KDD Cup 2012 Online Advertising


5. Experimental Results Online advertising systems present relevant adver-
ED

tisements (ads) to target users in order to maximize the


In this section, we investigate the performance of our click-through rate (see Eq. 11) on recommended ads.
method in two real-world datasets related to news arti- Sponsored search is a well-known example of online
cle recommendation (Yahoo! Today News) and online advertising, where the search engine selects the most
advertisement recommendation (KDD Cup 2012 On-
PT

relevant ads based on the user profile and a set of search


line Ads) (Section 5.1). Next, we describe the eval- keywords. The system needs to be able to select new
uation protocol (Section 5.2) and several multi-armed ads to users and learn from their feedback to improve
bandit algorithms used to instantiate our algorithm (Sec- the overall CTR estimation.
CE

tion 5.3). Finally, in Section 5.4 we discuss experimen- We use an advertisement dataset published by KDD
tal results. Cup 2012, track 2. This dataset refers to a click log
We emphasize that the datasets used in our experi- on www.soso.com (a large-scale search engine serviced
ments are publicly available on the Web [1], and that by Tencent), which contains 149 million ad impressions
AC

the code for the method and baselines can be find in the (view of advertisements). Each instance is an ad im-
Github repository3 . pression that consists of: (i) the user profile (i.e., gender
and age range), (ii) search keywords, (iii) displayed ad
5.1. Datasets information, and (iv) the click count. We processed the
We used two real-world, large-scale datasets to eval- data as follows. The dimension of the vector of fea-
uate our algorithm: Yahoo! Today News and KDD Cup tures of the displayed ad information is 1,070,866. We
2012 Online Advertising. first conducted a dimensionality reduction using the La-
tent Dirichlet Allocation topic model [4] using the key-
3 www.github.com/to-be-announced (available after publica- words, title and description of ads. We used 6 topics
tion). to represent ads. Another issue related to this dataset
9
ACCEPTED MANUSCRIPT

is that click information is extremely sparse due to the of the data. We are considering a recommendation task
large number of ads. To tackle this problem, we se- of top-k items, i.e., we recommend, for each target user,
lected the top 50 ads that have the most impressions in a list of the most relevant items with size k. Hence,
the evaluation, as proposed by the authors in [32]. The we also evaluate the performance of our algorithm for a
generated data set contains 9 million user visit events, variable list size. The averaged reward is equivalent to
from which 0,9 million events were used as warm up the metric CTR, which is the total reward divided by the
to bandit algorithms (10%) and the remaining events as total number of trials:
test data. Note that the competition originally refers to T
a batch scenario. Hence, it would be unfair to consider 1X
CT R@k = rt , (11)
the competition winner solution [37] as a baseline, given T t=1

T
that here we investigate an online learning recommen-
dation problem. where rt is the sum of rewards on a ranking with size

IP
k, and T is the total number of trials. We report the
5.2. Evaluation Protocol algorithm relative CTR, which is the algorithm’s CTR

CR
In an ideal scenario, the evaluation of a new multi- divided by the random recommender.
armed bandit algorithm should be conducted using a
bucket test, where we run the algorithm to serve a frac- 5.3. Base bandit algorithms
tion of live user traffic in a real recommender system. The proposed Multi-Objective Ranked bandit algo-
This evaluation protocol has several drawbacks. First,

stantial engineering efforts for deployment in the real


system, but it may also have negative impact on user
US
not only it is an expensive method, since it requires sub-
rithm runs a MAB instance Ai for each ranking posi-
tion i. Each of the k copies of the multi-armed bandit
algorithm is independent from each other. Recall that
the scores computed by these algorithms are used to es-
AN
experience. Moreover, there is no guarantee of replica- timate the accuracy term in Equation 4.
ble comparison using bucket tests as online metrics vary MAB algorithms are faced with the challenge of se-
significantly over time [21]. quentially balancing exploitation and exploration. This
On the other hand, evaluating online learning algo- is because while it is important to maximize rewards
rithms with past data is a challenging task. Here, we (exploitation), it is also crucial to better understand the
M

use the state-of-the-art method for unbiased and offline options to explore uncertainties in the representation of
evaluation of bandit algorithms, namely BRED (Boot- the world (exploration). Different algorithms work in
strapped Replay on Expanded Data) [23]. The method different ways to find the best trade-off between these
ED

is based on bootstrapping techniques and enables an un- two objectives. For background, we refer the reader
biased evaluation by utilizing historical data. The use of to [7] and a recent survey on regret minimization4 ban-
bootstrapping allows to estimate the distribution of pre- dits [6]. Following, we detail the base MAB algorithms
dictions while reducing their variability. used to instantiate our algorithm.
PT

The BRED method assumes that the set of available


items in the dataset is static to perform the bootstrap • Random: it randomly chooses an arm to pull.
resamples. We follow the same evaluation protocol pre-
•  -greedy ( ): it chooses a random arm with proba-
sented in [23], and assume a static world on small por-
CE

bility , and chooses the arm with the highest CTR


tions of the Yahoo! Front Page dataset. We took the
estimate with probability (1 − ) [34].
smallest number of portions such that a given portion
has a fixed number of items, resulting in 433 portions. • UCB1: it chooses the arm with the highest UCB
For the KDD Cup 2012 online ads data, since we se- (Upper Confidence Value) score, according to [3].
AC

lected the top 50 most clicked ads, we have only a single


portion, which contains just the 50 mentioned ads. • UCB2: the same as the above, but with a different
Let us consider each portion as a single dataset D of UCB score computation [3].
size T with M possible choices (items). Then, from
• Thompson sampling (TS): it is a Bayesian MAB
D, at each recommendation step, we generate P new
algorithm where the (unknown) reward values are
datasets D1 , D2 , ..., DP of size M × T by sampling with
inferred from past data and summarized using a
replacement.
For each dataset i of size Di with Mi items, we com-
pute the estimated click-through rate (CTR) by averag- 4 Note that minimizing the regret is equivalent to maximizing the

ing the CTR of each MAB on 100 random permutations rewards.

10
ACCEPTED MANUSCRIPT

posterior distribution. It chooses the arm propor- • SMO-EQ [w = (0.30, 0.30, 0.30)]: Establishes
tionally to its probability of being optimal under equal priorities to each objective.
this posterior [33].
• SMO-ACC [w = (0.50, 0.25, 0.25)]: Prioritizes ac-
• Bayes UCB: it is an adaption of UCB to the curacy over diversity and novelty.
Bayesian setting, where the upper confidence
• SMO-DIV [w = (0.25, 0.50, 0.25)]: Prioritizes di-
bound score is given by an upper quantile on the
versity over accuracy and novelty.
posterior mean [16].
• SMO-NOV [w = (0.25, 0.25, 0.50)]: Prioritizes
• LinUCB (α): is a classical contextual bandit algo-
novelty over accuracy and diversity.

T
rithm that computes the expected reward of each
arm by finding a linear combination of the previ- 5.4.1. On overall CTR

IP
ous rewards of the arm [20]. The contextual aspect
In Tables 2 and 3, we present the results for Yahoo!
of this algorithm refers to its capability of consid-
and KDD datasets, respectively. We test the algorithms
ering user features, such as gender and age, to rec-

CR
with different values of CTR@k, where k ∈ {1, 3, 5, 10}
ommend items. The α parameter is used to control
is the number of recommendations displayed (and, con-
the balance of exploration and exploitation.
sequently, the number of MABs instantiated). Since we
have similar conclusions for all values, we present only
Note that we are unable to use multi-objective multi-
armed bandit algorithms (MO-MAB) as base algorithms
(Section 2). First, MO-MAB algorithms consider the US
selection of a single item for each trial, and we are inter-
ested in returning a ranking of items to users. Even if we
CTR@5 in this section. Sections 5.4.2 and 5.4.3 ana-
lyze the impact of parameter k on the methods’ perfor-
mance. Note that for base bandit algorithms -greedy
(EG) and LinUCB, we enumerate different values for
AN
parameters  and α, respectively. We highlight gains
consider MO-MAB algorithms for the case where k = 1,
in relative CTR@5 over the Single-Objective baseline
it is not simple to understand the behavior of these algo-
(SO) in bold.
rithms as base MABs. The challenge is that MO-MAB
We start by comparing the overall CTR values for the
algorithms try to consider multiple objectives by con-
linear and chebyshev scalarization functions. The re-
M

struction, and so does our multi-objective ranked bandit


sults for chebyshev scalarization are better than linear
algorithm. In this case, it turns out to be really diffi-
scalarization on both datasets. Considering the over-
cult to separate the contributions of each multi-objective
all relative CTR, we have gains of 9.3% and 11.2%
strategy (MO-MAB and MO ranked bandits) in the final
ED

for linear and chebyshev scalarization functions, respec-


results if we combine them. Investigating modifications
tively. We believe this happens because, under cer-
in the multi-objective ranked bandits algorithms to con-
tain conditions, the chebyshev scalarization function is
sider MO-MABs is an interesting line of research.
able to find all the points in a non-convex Pareto set
PT

[9], which makes it able to find a better compromise


5.4. Experimental Results
between different objectives (metrics), recommending
In this section we empirically analyze the perfor- more relevant items to users. Note that, for the KDD
mance of our multi-objective algorithm with different dataset, the best overall improvement in terms of CTR
CE

policies to define the priorities of different quality met- happens for the combinations of the chebyshev func-
rics. The results are compared with the state-of-the-art tion with the Thompson Sampling and dynamic multi-
method of the literature for single-objective optimiza- objective weighting scheme. When considering the Ya-
tion, referred in this section as SO (Ranked Bandits with hoo! dataset, the best results were obtained using the
AC

optimized accuracy). combination of the chebyshev function with the -greed


We compare static multi-objective prioritization algorithm (α = 0.10) and dynamic multi-objective
schemes, presented in [17], with the dynamic schemes weighting scheme.
introduced in Section 4, namely DMO-USER, DMO- Considering the overall values of CTR obtained when
RANK, DMO-SYSTEM. In the static scheme, we fix using static or dynamic prioritization of recommenda-
the priority of each quality metric of interest beforehand tion quality metrics (accuracy, diversity, and novelty),
using a weight vector w = (wa , wd , wn ), where we set for both datasets the best CTR performance happens
fixed weights for the priority of accuracy (wa ), diversity when using dynamic weighting. Specifically, for the
(wd ), and novelty (wn ). We evaluate the following four Yahoo! dataset, we reach an 8.0% overall CTR im-
priority schemes: provement of dynamic weighting over static weighting
11
ACCEPTED MANUSCRIPT

Table 2: Relative CTR@5 on Yahoo! Today News data.

Linear scalarization
Weighting scheme
Static Dynamic
Algorithm SO SMO-EQ SMO-ACC SMO-DIV SMO-NOV DMO-USER DMO-RANK DMO-SYSTEM
Random 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
EG(0.01) 1.5291 1.5336 1.4660 1.5197 1.4348 1.5009 1.5984 1.4885
EG(0.05) 1.5565 1.5538 1.5227 1.4957 1.5565 1.4840 1.5800 1.5114

T
EG(0.10) 1.5443 1.5125 1.5314 1.5085 1.5190 1.5025 1.5287 1.5163
EG(0.20) 1.5060 1.5403 1.5825 1.4812 1.4870 1.5197 1.5422 1.5307

IP
EG(0.30) 1.4948 1.5132 1.5204 1.4854 1.5643 1.4848 1.5942 1.5045
LinUCB(0.01) 1.5379 1.5148 1.4833 1.5276 1.5235 1.4875 1.5619 1.5294
LinUCB(0.10) 1.5271 1.5264 1.5006 1.4675 1.5243 1.4844 1.5322 1.5170

CR
LinUCB(0.30) 1.5471 1.5358 1.4692 1.4752 1.5005 1.4583 1.5138 1.5146
TS 1.5581 1.4661 1.4568 1.5082 1.5191 1.4825 1.6269 1.6112
UCB1 1.4880 1.4144 1.3887 1.4309 1.4347 1.4306 1.5811 1.5381
UCB2 1.5187 1.5525 1.5390 1.5581 1.5365 1.4991 1.6052 1.6198
BayesUCB

Random
EG(0.01)
1.5339

1.0000
1.5291
1.4411

1.0000
1.5919
1.4091

1.0000
1.5506
1.4366

1.0000
1.6141
US 1.4377
Chebyshev scalarization
1.0000
1.6503
1.4264

1.0000
1.6179
1.6224

1.0000
1.5091
1.5915

1.0000
1.4887
AN
EG(0.05) 1.5565 1.6485 1.5126 1.5966 1.6522 1.6227 1.5296 1.5121
EG(0.10) 1.5443 1.6365 1.5708 1.5910 1.6644 1.6265 1.5535 1.5082
EG(0.20) 1.5060 1.6305 1.5116 1.5828 1.6386 1.5921 1.5404 1.5476
EG(0.30) 1.4948 1.5842 1.4906 1.5917 1.6450 1.5722 1.5617 1.4600
M

LinUCB(0.01) 1.5379 1.5288 1.4676 1.4379 1.5265 1.5356 1.5058 1.5099


LinUCB(0.10) 1.5271 1.5347 1.4837 1.4856 1.5371 1.5201 1.5527 1.5517
LinUCB(0.30) 1.5471 1.5199 1.5011 1.4770 1.5181 1.5338 1.5357 1.5322
TS 1.5581 1.5463 1.5230 1.5607 1.5920 1.5900 1.5714 1.5169
ED

UCB1 1.4880 1.5192 1.4669 1.5661 1.5221 1.5030 1.4782 1.4760


UCB2 1.5187 1.4970 1.4933 1.5469 1.5814 1.5540 1.4490 1.4542
BayesUCB 1.5339 1.5574 1.5233 1.5405 1.5306 1.5651 1.5430 1.4832
PT

schemes, whereas in the KDD dataset the improvement tion. It seems hard to find good recommendation metric
is of 2.5%. weighting schemes when considering a linear relation-
For the Yahoo! dataset, the best performance of static ship between these metrics. Furthermore, the dynamic
CE

weighting scheme is achieved when priorityzing novelty multi-objective weighting per system (DMO-SYSTEM)
over other recommendation metrics (i.e., SMO-NOV). presents the worst performance when compared to other
For the KDD dataset, in turn, the prioritization of di- weighting schemes. Since it considers a common prior-
versity (i.e., SMO-DIV) reaches better performance in itization scheme for all users and ranking positions, we
AC

terms of relative CTR. Since the weights are static, the believe that it is harder for it to correctly prioritize the
recommendation of diverse and novel rankings is more recommendation metrics.
likely to find relevant items than just prioritizing accu- Also observe that, for all base algorithms, there is
racy, for instance. a combination of weighting scheme and scalarization
Considering dynamic weighting schemes, in both function that achieves a better CTR performance than
datasets the best scheme is to consider a specific priori- the state-of-the-art single-objective baseline. The best
tization for each ranking position (DMO-RANK). When performance for the Yahoo! dataset is obtained by the
we consider the dynamic multi-objective weighting per -greedy algorithm (EG0.10), with a CTR improvement
user (DMO-USER), we are able to improve baseline of 7.8%. For the KDD dataset, the best performance
performance mostly considering chebyshev scalariza- is obtained by the Thompson Sampling algorithm, with
12
ACCEPTED MANUSCRIPT

Table 3: Relative CTR@5 on KDD Cup 2012 Online Ads data.

Linear scalarization
Weighting scheme
Static Dynamic
Algorithm SO SMO-EQ SMO-ACC SMO-DIV SMO-NOV DMO-USER DMO-RANK DMO-SYSTEM
Random 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
EG(0.01) 1.7504 1.6444 1.4809 1.5681 1.6351 1.5406 1.8756 1.5867
EG(0.05) 1.6702 1.6297 1.8049 1.5532 1.7044 1.6264 1.7112 1.5364

T
EG(0.10) 1.5525 1.7390 1.5511 1.7796 1.6323 1.4538 1.7901 1.4300
EG(0.20) 1.5997 1.7662 1.5939 1.4301 1.7022 1.4578 1.7498 1.7998

IP
EG(0.30) 1.8229 1.5608 1.6831 1.4801 1.4996 1.7413 1.6511 1.6796
LinUCB(0.01) 1.8719 1.6399 1.5800 1.5747 1.5612 1.4865 1.6069 1.5834
LinUCB(0.10) 1.7580 1.6798 1.5675 1.5340 1.7559 1.6096 1.6919 1.4677

CR
LinUCB(0.30) 1.9600 1.4623 1.5262 1.5713 1.4506 1.5624 1.6790 1.6760
TS 1.8472 1.4284 1.7169 1.6447 1.5593 1.4457 1.9494 1.9081
UCB1 1.6841 1.4890 1.4037 1.3775 1.4787 1.4941 1.8884 1.6600
UCB2 1.5925 1.5065 1.8330 1.6817 1.5109 1.6799 1.9039 1.7978
BayesUCB

Random
EG(0.01)
1.7012

1.0000
1.7504
1.4583

1.0000
1.8296
1.4412

1.0000
1.6705
1.6279

1.0000
1.7794
US 1.4945
Chebyshev scalarization
1.0000
2.0273
1.5782

1.0000
1.7734
1.6030

1.0000
1.9141
1.6006

1.0000
1.7002
AN
EG(0.05) 1.6702 1.8499 1.9400 1.7724 1.9430 1.9100 1.7773 1.6401
EG(0.10) 1.5525 2.0298 1.6972 2.0300 1.9175 1.6726 1.9660 1.5086
EG(0.20) 1.5997 2.0166 1.6233 1.6239 2.0228 1.6221 1.8832 1.9673
EG(0.30) 1.8229 1.7457 1.7740 1.6896 1.6747 1.9929 1.7285 1.7535
M

LinUCB(0.01) 1.8719 1.7756 1.6747 1.5834 1.6702 1.6350 1.6547 1.6706


LinUCB(0.10) 1.7580 1.8146 1.6576 1.6609 1.9105 1.7686 1.8427 1.5955
LinUCB(0.30) 1.9600 1.5360 1.6668 1.6854 1.5595 1.7612 1.8312 1.8226
TS 1.8472 1.6020 1.9409 1.8270 1.7450 1.6489 2.0395 1.9435
ED

UCB1 1.6841 1.7116 1.5813 1.6017 1.6759 1.6788 1.9117 1.7085


UCB2 1.5925 1.5441 1.9254 1.7908 1.6548 1.8742 1.8597 1.7355
BayesUCB 1.7012 1.6809 1.6633 1.8807 1.7010 1.8614 1.6229 1.5902
PT

gains of 10.4% over the baseline in terms of CTR. On ter performance are dataset dependent. For instance,
average the gains over the single-objective algorithm are for the KDD dataset, the multi-objective versions of
4.7% and 15.8%, for the Yahoo! and KDD datasets, re- LinUCB and Bayes UCB exhibit no gains over single-
CE

spectively. objective baseline using linear scalarization. However,


Furthermore, for the Yahoo! dataset, we observe that when considering chebyshev scalarization, the perfor-
the best single-object base algorithm is the Thompson mance achieves gains of up to 8.7% for the same algo-
Sampling. However, this pattern changes as we start rithms (i.e., LinUCB and Bayes UCB).
AC

considering diversity and novelty, which leads -greedy In conclusion, we present gains in CTR performance
to achieve the best CTR. We have a similar pattern in over baselines when considering multiple objectives si-
the KDD dataset, where the best single-objective base multaneously. Second, we show improvements using
algorithm is LinUCB, whereas the multi-objective base dynamic weighting schemes over static ones. This
algorithm is the Thompson Sampling. Once we have shows that our state-space model is able to correctly pri-
gains in both situations, we may conclude that consid- oritize among the recommendation metrics. It is well-
ering multiple objectives simultaneously is important to known that the recommendation quality metrics may be
provide better recommendations to users. conflicted among themselves [24]. For instance, it is
Moreover, the choice of weighting prioritization relatively easy to have a high diversified ranking that
scheme and scalarization functions to provide a bet- is poor in accuracy. In our experiments, we show that
13
ACCEPTED MANUSCRIPT

the chebyshev scalarization function deals better with ures 4(b,d)), it is more difficult to recommend relevant
these conflicts, providing better results. Possibly, once items in the first ranking position (i.e., k = 1). As the
the chebyshev scalarization function is able to find all size of the ranking increases, the relative CTR improves.
points in a non-convex Pareto set, it is more likely to (i.e., k = 3, 5, 10). Figure 4 also depicts that there is not
select the set of items that is best for each trial. a single base bandit algorithm that has the better per-
If we were to recommend a single algorithm, scalar- formance in the evaluated scenarios, although TS and
ization function and weighting scheme, we would sug- EG(0.1) are overall more consistent in their results. For
gest, for the Yahoo! dataset, the -greedy [EG(0.10)] instance, in Figure 4(a), we show the results for the Ya-
algorithm, chebyshev scalarization function and SMO- hoo! dataset considering the linear scalarization. Here,
NOV scheme. Whereas, for the KDD dataset, the sug- the maximum CTR@1 is given by the -greedy algo-

T
gestion is different: the Thompson Sampling algorithm, rithm, whereas in Figure 4(b), Bayes UCB algorithm
the chebyshev scalarization function and the DMO- has the best performance. Finally, we also see that we

IP
RANK scheme. have different ranges of CTR values depending on the
dataset used. Specifically, in Figures 4(a–b), for the Ya-

CR
5.4.2. On the scalarization function hoo! dataset, the relative CTR values range from ap-
We also analyze the performance considering the proximately 1.3 to 1.7. Meanwhile, in Figures 4(c–d),
Thompson Sampling algorithm as the MAB algorithm for the KDD dataset the relative CTR values range from
to suggest k items per trial, where k ∈ {1, 3, 5, 10} for approximately 1.2 to 2.1.
different scalarization functions. We chose these val-
ues for k to approximate the usual number of items sug-
gested by real recommender systems, and the Thomp-
son Sampling algorithm because it obtained the best
US 6. Conclusions and Future Work
AN
overall results in the experiments reported in the pre- This paper proposed a multi-objective ranking ban-
vious section. In Figures 3(a–d) and Figures 3(e–h), dits algorithm for online recommendation. The algo-
we show the results for the Yahoo! and KDD datasets, rithm relies on four main components: a scalarization
respectively. We report the relative CTR of multi- function, a set of recommendation quality metrics, a dy-
objective weighting schemes divided by the CTR of the namic prioritization scheme for weighting these metrics
M

single-objective baseline. and a base MAB algorithm.


Again, the chebyshev scalarization funtion reaches a We have built upon previous work to come up with
better performance than the linear scalarization func- non-linear scalarization functions, capable of finding all
ED

tion. For instance, in the KDD dataset we have gains points in a non-convex Pareto, and proposed a method
in all weighting schemes for k = 1, 3, and 10 (Fig- to adapt the weighting scheme online for three different
ures 3(e,g,h)). A similar pattern occurs in the Yahoo! recommendation quality metrics, namely accuracy, nov-
dataset. However, it seems that when we have dynamic elty and diversity, based on a state-space model. Three
PT

weighting prioritization schemes for each ranking posi- variants of the dynamic model weighting method were
tion (DMO-RANK) the linear scalarization function has proposed, considering the perspectives of the user, the
a slightly better performance (see Figures 3(a–d)). recommender system and the ranking. The algorithm is
flexible enough to consider different multi-armed bandit
CE

5.4.3. On number of items recommended algorithms proposed in the literature.


In this section, we investigate the impact on CTR per- Extensive experiments were performed with two real-
formance when varying the number k of displayed rec- world large-scale datasets, where the results in terms
ommendations for different MAB algorithms. In Fig- of click-through rate were compared with the single-
AC

ures 4(a–b), we present the best CTR values for the objective version of the method considering different
Yahoo! dataset, whereas in Figures 4(c–d), we show variations of the four components aforementioned. The
the CTR values for the KDD dataset. To help under- results showed that a Chebyshev scalarization func-
standing the graphs, we select a representative subset tion together with e dynamic weighting scheme pro-
of base bandit algorithms, namely, (i) -greedy, with vide the best results for both datasets when com-
 = 0.05 [EG(0.05)], (ii) LinUCB with α = 0.10 [Lin- pared to the single-objective version and other scalar-
UCB(0.10)], (iii) Thompson Sampling (TS), (iv) UCB1, ization/weighting strategies. The results also show that
(v) UCB2, and (vi) Bayes UCB. when using the Thompson Sampling and -greedy al-
As expected, for almost all algorithms (except for gorithms we obtain more consistent results considering
EG(0.10) in Figure 4(a) and Bayes UCB in Fig- recommendation lists of different sizes. Particularly, our
14
ACCEPTED MANUSCRIPT

2.5 Linear 2.5 Linear


Chebyshev Chebyshev
2.0 2.0
Relative CTR

Relative CTR
1.5 1.5
1.0 1.0
0.5 0.5
0.0SMO-EQ SMO-ACC SMO-DIV SMO-NOV DMO-USER DMO-RANK DMO-SYSTEM 0.0SMO-EQ SMO-ACC SMO-DIV SMO-NOV DMO-USER DMO-RANK DMO-SYSTEM

(a) k = 1 (b) k = 3

2.5 Linear 2.5 Linear

T
Chebyshev Chebyshev
2.0 2.0
Relative CTR

Relative CTR
1.5 1.5

IP
1.0 1.0
0.5 0.5
0.0SMO-EQ 0.0SMO-EQ

CR
SMO-ACC SMO-DIV SMO-NOV DMO-USER DMO-RANK DMO-SYSTEM SMO-ACC SMO-DIV SMO-NOV DMO-USER DMO-RANK DMO-SYSTEM

(c) k = 5 (d) k = 10

2.5 Linear 2.5 Linear


Chebyshev Chebyshev

US
Relative CTR

Relative CTR

2.0 2.0
1.5 1.5
1.0 1.0
0.5 0.5
AN
0.0SMO-EQ SMO-ACC SMO-DIV SMO-NOV DMO-USER DMO-RANK DMO-SYSTEM 0.0SMO-EQ SMO-ACC SMO-DIV SMO-NOV DMO-USER DMO-RANK DMO-SYSTEM

(e) k = 1 (f) k = 3

2.5 Linear 2.5 Linear


Chebyshev Chebyshev
M
Relative CTR

Relative CTR

2.0 2.0
1.5 1.5
1.0 1.0
0.5 0.5
ED

0.0SMO-EQ SMO-ACC SMO-DIV SMO-NOV DMO-USER DMO-RANK DMO-SYSTEM 0.0SMO-EQ SMO-ACC SMO-DIV SMO-NOV DMO-USER DMO-RANK DMO-SYSTEM

(g) k = 5 (h) k = 10

Figure 3: Relative CTR’s on multi-objective weighting schemes for different ranking lengths, (i.e., different k values). Figures (a–d) for Yahoo!
PT

Today News data and Figures (e–h) KDD Cup 2012 Online Ads data.

approach provides an 7.8% and 10.4% lift in CTR com- within the online learning to rank setting. We also in-
CE

pared to the state-of-the-art approach, for Yahoo! Today tend to investigate situations where we have no usage
News dataset and KDD Cup 2012 Online Ads dataset, history of users and items, known as the cold-start prob-
respectively lem [30], by modeling it as an exploration/exploitation
As future work, a promising extension is to analyze problem. Finally, in order to deal with sparsity, we be-
AC

the impact of other relevant objectives to recommen- lieve another good line of investigation is adding spar-
dation quality, such as coverage and serendipity [29]. sity as an objective to the proposed algorithm.
Another possible extension is to consider different base
MABs for each ranking position. In this case, an anal-
yses of both the empirical performance and the theo- Acknowledgments
retical properties of this mixed ranked bandit algorithm
would be of interest. Since we were successful in con- The authors gratefully acknowledge the finantial sup-
sidering multiple objectives in recommender systems, port of CNPq; CEFET-MG under Proc. PROPESQ-
it would also be interesting to adapt the algorithm to 10314/14; FAPEMIG under Proc. FAPEMIG-
account for queries and documents in a search engine PRONEX-MASWeb, Models, Algorithms and Systems
15
ACCEPTED MANUSCRIPT

1.7 1.7

1.6 1.6
Relative CTR

Relative CTR
1.5 1.5
EG(0.10) EG(0.10)
1.4 LinUCB(0.10) 1.4 LinUCB(0.10)
TS TS
1.3 UCB1 1.3 UCB1
UCB2 UCB2
BayesUCB BayesUCB
1.2 1 3 5 10 1.2 1 3 5 10
Rank Rank

T
(a) Linear scalarization (b) Chebyshev scalarization

IP
CR
2.0 2.0

1.8 1.8
Relative CTR

Relative CTR
1.6 1.6
EG(0.10) EG(0.10)
LinUCB(0.10) LinUCB(0.10)
1.4

1.2

1 3
Rank
5
TS
UCB1
UCB2
BayesUCB
10
US 1.4

1.2

1 3
Rank
5
TS
UCB1
UCB2
BayesUCB
10
AN
(c) Linear scalarization (d) Chebyshev scalarization

Figure 4: Relative CTR’s on different ranking lengths. Figures (a–b) for Yahoo! Today News data and Figures (c–d) KDD Cup 2012 Online Ads
data.
M

for the Web, process number APQ-01400-14, Microsoft The 2013 International Joint Conference on, pages 1–8. IEEE,
Azure Sponsorship–CEFET-MG. 2013.
[10] G. Eichfelder. Adaptive scalarization methods in multiobjective
ED

optimization. Springer, 2008.


[11] M. Ge, C. Delgado-Battenfeld, and D. Jannach. Beyond ac-
References curacy: evaluating recommender systems by coverage and
serendipity. In RecSys, pages 257–260. ACM, 2010.
[1] Yahoo! Webscope Program. http://webscope.sandbox. [12] L. A. Granka, T. Joachims, and G. Gay. Eye-tracking analysis of
PT

yahoo.com. user behavior in www search. In Proceedings of the 27th annual


[2] C. Anderson. The long tail: how endless choice is creating un- international ACM SIGIR conference on Research and develop-
limited demand. Random House, 2007. ment in information retrieval, pages 478–479. ACM, 2004.
[3] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis [13] F. Guo, C. Liu, and Y. M. Wang. Efficient multiple-click models
CE

of the multiarmed bandit problem. Machine learning, 47(2- in web search. In Proceedings of the Second ACM International
3):235–256, 2002. Conference on Web Search and Data Mining, pages 124–131.
[4] D. M. Blei. Probabilistic topic models. Communications of the ACM, 2009.
ACM, 55(4):77–84, 2012. [14] K. Hofmann, A. Schuth, A. Bellogin, and M. De Rijke. Ef-
[5] J. Bobadilla, F. Ortega, A. Hernando, and A. Gutiérrez. Rec- fects of position bias on click-based recommender evaluation.
AC

ommender systems survey. Knowledge-Based Systems, 46:109– In Advances in Information Retrieval, pages 624–630. Springer,
132, 2013. 2014.
[6] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic [15] B. Horsburgh, S. Craw, and S. Massie. Learning pseudo-tags to
and nonstochastic multi-armed bandit problems. arXiv preprint augment sparse tagging in hybrid music recommender systems.
arXiv:1204.5721, 2012. Artificial Intelligence, 219:25–39, 2015.
[7] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and [16] E. Kaufmann, O. Cappé, and A. Garivier. On bayesian upper
games. Cambridge university press, 2006. confidence bounds for bandit problems. In International Con-
[8] P. Cremonesi, Y. Koren, and R. Turrin. Performance of recom- ference on Artificial Intelligence and Statistics, pages 592–600,
mender algorithms on top-n recommendation tasks. In Proceed- 2012.
ings of the fourth ACM conference on Recommender systems, [17] A. Lacerda. Contextual bandits with multi-objective payoff
pages 39–46. ACM, 2010. functions. In Proceedings of The 4th Brazilian Conference on
[9] M. M. Drugan and A. Nowe. Designing multi-objective multi- Intelligent Systems (BRACIS), pages 1–8. IEEE, 2015.
armed bandits algorithms: a study. In Neural Networks (IJCNN), [18] A. Lacerda, A. Veloso, and N. Ziviani. Exploratory and inter-

16
ACCEPTED MANUSCRIPT

active daily deals recommendation. In Proceedings of the 7th [37] K.-W. Wu, C.-S. Ferng, C.-H. Ho, A.-C. Liang, C.-H. Huang,
ACM Conference on Recommender Systems, RecSys ’13, pages W.-Y. Shen, J.-Y. Jiang, M.-H. Yang, T.-W. Lin, C.-P. Lee, et al.
439–442, New York, NY, USA, 2013. ACM. A two-stage ensemble of diverse models for advertisement rank-
[19] K. Lerman and T. Hogg. Leveraging position bias to improve ing in kdd cup 2012. In ACM SIGKDD KDD-Cup WorkShop,
peer recommendation. PloS one, 9(6):e98914, 2014. 2012.
[20] L. Li, W. Chu, J. Langford, and R. Schapire. A contextual- [38] S. Yahyaa, M. M. Drugan, and B. Manderick. Scalarized and
bandit approach to personalized news article recommendation. pareto knowledge gradient for multi-objective multi-armed ban-
In WWW, pages 661–670, 2010. dits. In Transactions on Computational Collective Intelligence
[21] L. Li, W. Chu, J. Langford, and X. Wang. Unbiased offline eval- XX, pages 99–116. Springer, 2015.
uation of contextual-bandit-based news article recommendation [39] S. Q. Yahyaa, M. M. Drugan, and B. Manderick. The scalarized
algorithms. In Proceedings of the fourth ACM international con- multi-objective multi-armed bandit problem: an empirical study
ference on Web search and data mining, pages 297–306. ACM, of its exploration vs. exploitation tradeoff. In Neural Networks

T
2011. (IJCNN), 2014 International Joint Conference on, pages 2290–
[22] M. Lubatkin and S. Chatterjee. Extending modern portfolio the- 2297. IEEE, 2014.

IP
ory into the domain of corporate diversification: does it apply? [40] M. Zhang and N. Hurley. Avoiding monotony: improving the
Academy of Management Journal, 37(1):109–136, 1994. diversity of recommendation lists. In RecSys, pages 123–130,
[23] J. Mary, P. Preux, and O. Nicol. Improving offline evaluation 2008.
of contextual bandit algorithms via bootstrapping techniques. In [41] C. Ziegler, S. McNee, J. Konstan, and G. Lausen. Improving

CR
Proceedings of the 31st International Conference on Machine recommendation lists through topic diversification. In WWW,
Learning (ICML-14), pages 172–180, 2014. pages 22–32, 2005.
[24] S. M. McNee, J. Riedl, and J. A. Konstan. Being accurate is not [42] E. Zitzler, L. Thiele, M. Laumanns, C. M. Fonseca, and V. G.
enough: how accuracy metrics have hurt recommender systems. Da Fonseca. Performance assessment of multiobjective optimiz-
In CHI’06 extended abstracts on Human factors in computing ers: an analysis and review. Evolutionary Computation, IEEE

[25]

[26]
systems, pages 1097–1101. ACM, 2006.
K. P. Murphy. Machine learning: a probabilistic perspective.
2012.
F. Radlinski, R. Kleinberg, and T. Joachims. Learning diverse
rankings with multi-armed bandits. In Proceedings of the 25th
US Transactions on, 7(2):117–132, 2003.
[43] Y. Zuo, M. Gong, J. Zeng, L. Ma, and L. Jiao. Personalized rec-
ommendation based on evolutionary multi-objective optimiza-
tion. Computational Intelligence Magazine, IEEE, 10(1):52–62,
2015.
AN
international conference on Machine learning, pages 784–791.
ACM, 2008.
[27] M. T. Ribeiro, A. Lacerda, A. Veloso, and N. Ziviani. Pareto-
efficient hybridization for multi-objective recommender sys-
tems. In Proceedings of the sixth ACM conference on Recom-
M

mender systems, pages 19–26. ACM, 2012.


[28] M. T. Ribeiro, N. Ziviani, E. S. D. Moura, I. Hata, A. Lacerda,
and A. Veloso. Multiobjective pareto-efficient approaches for
recommender systems. ACM Transactions Intelligent Systems
Technology, 5(4):53:1–53:20, 2015.
ED

[29] F. Ricci, L. Rokach, B. Shapira, and P. B. Kantor. Recommender


systems handbook, volume 1. Springer, 2011.
[30] A. I. Schein, A. Popescul, L. H. Ungar, and D. M. Pennock.
Methods and metrics for cold-start recommendations. In Pro-
ceedings of the 25th annual international ACM SIGIR confer-
PT

ence on Research and development in information retrieval,


pages 253–260. ACM, 2002.
[31] A. Slivkins, F. Radlinski, and S. Gollapudi. Ranked bandits
in metric spaces: learning diverse rankings over large docu-
CE

ment collections. The Journal of Machine Learning Research,


14(1):399–436, 2013.
[32] L. Tang, Y. Jiang, L. Li, and T. Li. Ensemble contextual ban-
dits for personalized recommendation. In Proceedings of the
8th ACM Conference on Recommender systems, pages 73–80.
AC

ACM, 2014.
[33] W. R. Thompson. On the likelihood that one unknown proba-
bility exceeds another in view of the evidence of two samples.
Biometrika, pages 285–294, 1933.
[34] M. Tokic. Adaptive ε-greedy exploration in reinforcement learn-
ing based on value differences. In KI 2010: Advances in Artifi-
cial Intelligence, pages 203–210. Springer, 2010.
[35] S. Vargas. Novelty and diversity enhancement and evaluation in
Recommender Systems. PhD thesis, Universidad Autonoma de
Madri, 2015.
[36] S. Vargas and P. Castells. Rank and relevance in novelty and
diversity metrics for recommender systems. In RecSys, pages
109–116, 2011.

17
ACCEPTED MANUSCRIPT

Biography Recommender Systems, acting on the following topics:


algorithm design, web advertising and knowledge
representation.

T
IP
CR
US
AN
M
ED
PT
CE
AC

Anı́sio M. Lacerda received the Bachelor degree in


computer science and the MS and PhD degrees in
computer science from the Universidade Federal de
Minas Gerais (UFMG), Brazil, in 2005, 2008, and
2013, respectively. He has been an associate professor
of computer engineering at the Centro Federal de Edu-
ca£o Tecnolgica de Minas Gerais (CEFET-MG) since
2015, His research interests include is on computer
science, with emphasis on Information Retrieval and
18

You might also like