Professional Documents
Culture Documents
Recommender Selection
Da Xu Chuanwei Ruan∗ Evren Korpeoglu
Walmart Labs Walmart Labs Walmart Labs
Sunnyvale, California, USA Sunnyvale, California, USA Sunnyvale, California, USA
Da.Xu@walmartlabs.com RuanChuanwei@gmail.com EKorpeoglu@walmart.com
The production scenario that motivates our work is to choose higher efficiency. We use the following toy example to illustrate
from several candidate recommenders who have shown compara- our argument.
ble performances in offline evaluation. By the segment analysis,
Example 1. Suppose that there are six items in total, and the front-
we find each recommender more favorable to specific customer
end request consists of a uni-variate user feature 𝑟𝑡 ∈ R. The reward
groups, but again the conclusion cannot be drawn entirely due
mechanism is given by the linear model:
to the exposure and selection bias in the history data. In other
words, while it is safe to launch each candidate online, we still need 𝑌𝑡 = 𝜃 1 · 𝐼 1 × 𝑋𝑡 + . . . + 𝜃 6 · 𝐼 6 × 𝑋𝑡 ;
randomized experiments to explore each candidates’ real-world where 𝐼 𝑗 is the indicator variable on whether item 𝑗 is recommended.
performance for different customer groups. We want to design the Consider the top-3 recommendation from four candidate recommenders
experiment by accounting for the customer features (e.g. their seg- as follow (in the format of one-hot encoding):
mentation information) to minimize the cost of trying suboptimal
recommenders on a customer group. Notice that our goal devi- 𝑎 1 (𝑟𝑡 ) = [1, 1, 1, 0, 0, 0];
ates from the traditional controlled experiments because we care 𝑎 2 (𝑟𝑡 ) = [0, 0, 0, 1, 1, 1];
more about minimizing the cost than drawing rigorous inference. (1)
𝑎 3 (𝑟𝑡 ) = [0, 0, 1, 0, 1, 1];
In the sequel, we characterize our mission as a recommender-wise
exploration-exploitation problem, a novel application to the best of 𝑎 4 (𝑟𝑡 ) = [0, 0, 0, 1, 1, 1].
our knowledge. Before we proceed, we illustrate the fundamental If each recommender is explored with the probability, the role of 𝑎 1 is
differences between our problem and learning the ensembles of rec- underrated since it is the only recommender that provides information
ommenders [24]. The business metrics, such as GMV, are random about 𝜃 1 and 𝜃 2 . Also, 𝑎 2 and 𝑎 4 give the same outputs, so their
quantities that depend on the recommended contents as well as the exploration probability should be discounted by half. Similarly, the
distributions that govern customers’ decision-making. Even if we information provided by 𝑆 3 can be recovered by 𝑆 1 and 𝑆 4 (or 𝑆 2 )
have access to those distributions, we never know in advance the combined, so there is a linear dependency structure we may leverage.
conditional distribution given the recommended contents. There- The example is representative of the real-world scenario, where
fore, the problem can not be appropriately described by any fixed the one-hot encodings and user features may simply be replaced
objective for learning the recommender ensembles. by the pre-trained embeddings. By far, we provide an intuitive
In our case, the exploration-exploitation strategy can be viewed understanding of the benefits from good online experiment designs.
as a sequential game between the developer and the customers. In In Section 2, we introduce the notations and the formal background
each round 𝑡 = 1, . . . , 𝑛, where the role of 𝑛 will be made clear later, of bandit problems. We then summarize the relevant literature in
the developer chooses a recommender 𝑎𝑡 ∈ {1, . . . , 𝑘 } that produces Section 3. In Section 4, we present our optimal design methods
the content 𝑐𝑡 , e.g. top-k recommendations, according to the front- and describe the corresponding online infrastructure. Both the
end request 𝑟𝑡 , e.g. customer id, user features, page type, etc. Then simulation studies and real-world deployment analysis are provided
the customer reveal the reward 𝑦𝑡 such as click or not-click. The in Section 5. We summarize the major contributions as follow.
problem setting resembles that of the multi-armed bandits (MAB)
• We provide a novel setting for online recommender selection
by viewing each recommender as the action (arm). The front-end
via the lens of exploration-exploitation.
request 𝑟𝑡 , together with the recommended content 𝑐𝑡 = 𝑎𝑡 (𝑥𝑡 ),
• We present an optimal experiment approach and describe
can be think of as the context. Obviously, the context is informative
the infrastructure and implementation.
of the reward because the clicking will depend on how well the
• We provide both open-source simulation studies and real-
content matches the request. On the other hand, an (randomized)
world deployment results to illustrate the efficiency of the
experiment design can be characterized by a distribution 𝜋 over
Í approaches studied.
the candidate recommenders, i.e. 0 < 𝜋𝑡 (𝑎) < 1, 𝑘𝑖=1 𝜋𝑡 (𝑖) = 1
for 𝑖 = 1, . . . , 𝑛. We point out that a formal difference between 2 BACKGROUND
our setting and classical contextual bandit is that the context here
depends on the candidate actions. Nevertheless, its impact becomes We start by concluding our notations in Table 1. By convention, we
negligible if choosing the best set of contents is still equivalent to use lower and upper-case letters to denote scalars and random vari-
choosing the optimal action. Consequently, the goal of finding the ables, and bold-font lower and upper-case letters to denote vectors
optimal experimentation can be readily converted to optimizing and matrices. We use [𝑘] as a shorthand for the set of: {1, 2, . . . , 𝑘 }.
𝜋𝑡 , which is aligned with the bandit problems. The intuition is The randomized experiment strategy (policy) is a mapping from
that by optimizing 𝜋𝑡 , we refine the estimation of the structures the collected data to the recommenders, and it should maximize
Í
between context and reward, e.g. via supervised learning, at a low the overall reward 𝑛𝑡=1 𝑌𝑡 . The interactive process of the online
exploration cost. recommender selection can be described as follow.
The critical concern of doing exploration in e-commerce, perhaps 1. The developer receives a front-end request 𝑟𝑡 ∼ 𝑃 request .
more worrying than the other domains, is that irrelevant recommen- 2. The developer computes the feature representations that com-
dations can severely harm user experience and stickiness, which bines the request and outputs from all candidate recommender:
𝑘
directly relates to GMV. Therefore, it is essential to leverage the X𝑡 := 𝝓 𝑟𝑡 , 𝑎𝑖 (𝑟𝑡 ) 𝑖=1 .
problem-specific information, both the contextual structures and 3. The developer chooses a recommender 𝑎𝑡 according to the
prior knowledge, to further design the randomized strategy for randomized experiment strategy 𝜋 (𝑎|X𝑡 , ℎ®𝑡 ).
4. The customer reveals the reward 𝑦𝑡 .
Towards the D-Optimal Online Experiment Design for Recommender Selection KDD ’21, August 14–18, 2021, Virtual Event, Singapore
In particular, the selected recommender 𝑎𝑡 depends on the re- the collected data. Suppose we have an optimization oracle that
quest, candidate outputs, as well as the history data: returns 𝜽ˆ by fitting the empirical observations, then all the above
algorithms can be converted to the context-aware setting, e.g.
𝑎𝑡 ∼ 𝜋 𝑎 𝑟𝑡 , 𝝓 𝑟𝑡 , 𝑎𝑖 (𝑟𝑡 ) 𝑖=1, ℎ®𝑡 ,
𝑘
In terms of real-world applications, online advertisement and statistics literature [37], since both tasks aim at optimizing how the
news recommendation [3, 30, 31] are perhaps the two major do- design matrix is constructed. Following the previous buildup, the
mains where contextual bandits are investigated. Bandits have also reward model has one of the following forms:
been applied to our related problems such as item recommendation ⊺
𝜽 𝝓 𝑟𝑡 , 𝑎(𝑟𝑡 ) , linear model
[26, 32, 55] and recommender ensemble [9, 51]. To the best of our 𝑦𝑡 = (2)
𝑓𝜽 𝝓 𝑟𝑡 , 𝑎(𝑟𝑡 ) , for some nonlinear 𝑓𝜽 (·),
knowledge, none of the previous work studies contextual bandit
for the recommender selection. We start with the frequentist setting, i.e. 𝜽 do not admit a prior distri-
bution. In each round 𝑡, we try to find a optimal design 𝜋 (· | ·) such
4 METHODOLOGIES that the action sampled from 𝜋 leads to a maximum information
As we discussed in the literature review, the stage-wise (phased) for estimating 𝜽 . For statistical estimators, the Fisher information
exploration and exploitation appeals to our problem because of is a key quantity for evaluating the amount of information in the
their computation advantage and deployment flexibility. To apply observations. For the general 𝑓𝜽 , the Fisher information under (2)
the stage-wise exploration-exploitation to online recommender is given by:
selection, we describe a general framework in Algorithm 1. 𝑘
∑︁ ⊺
𝑀 (𝜋) = 𝜋 (𝑎𝑖 ) ∇𝜽 𝑓𝜽 𝝓 𝑟𝑡 , 𝑎𝑖 (𝑟𝑡 ) · ∇𝜽 𝑓𝜽 𝝓 𝑟𝑡 , 𝑎𝑖 (𝑟𝑡 ) , (3)
𝑎𝑖 =1
Algorithm 1: Stage-wise exploration and exploitation
where 𝜋 (𝑎𝑖 ) is a shorthand for the designed policy. For the linear
Input: Reward model 𝑓𝜽 (·); the restart criteria; the
reward model, the Fisher information is simplified to:
initialized history data ℎ®𝑡 .
1 while total rounds ≤ n do
𝑘
∑︁ ⊺
2 if restart criteria is satisfied then 𝑀 (𝜋) = 𝜋 (𝑎𝑖 ) 𝝓 𝑟𝑡 , 𝑎𝑖 (𝑟𝑡 ) · 𝝓 𝑟𝑡 , 𝑎𝑖 (𝑟𝑡 ) . (4)
𝑎𝑖 =1
3 Reset ℎ®𝑡 ;
4 end To understand the role Fisher information in evaluating the under-
5 Play 𝑛 1 rounds of random exploration, for instance: lying uncertainty of a model, according to the textbook derivations
for linear regression, we have:
𝜋 (𝑎 | ·) = 𝑘1 , and collect observation to ℎ®𝑡 ;
• var(𝜽ˆ ) ∝ 𝑀 (𝜋) −1 ;
6 Find the optimal 𝜽ˆ based on ℎ®𝑡 (e.. via empirical-risk
• the prediction variance for 𝝓𝑖 := 𝝓 𝑟𝑡 , 𝑎𝑖 (𝑟𝑡 ) is given by
⊺
minimization); var(𝑦ˆ𝑖 ) ∝ 𝝓𝑖 𝑀 (𝜋) −1 𝝓𝑖 .
7 Play 𝑛 2 rounds of exploitation with:
Therefore, the goal of optimal online experiment design can be ex-
𝑎𝑡 = arg max𝑎 𝑓𝜽ˆ 𝑟𝑡 , 𝑎(𝑟𝑡 ) ; plained as minimizing the uncertainty in the reward model, either
8 end for parameter estimation or prediction. In statistics, a D-optimal de-
sign minimizes det |𝑀 (𝜋) −1 | from the perspective of estimation vari-
⊺
The algorithm is deployment-friendly because Step 5 only in- ance , and the G-optimal design minimize max𝑖 ∈ {1,...,𝑘 } 𝝓𝑖 𝑀 (𝜋) −1 𝝓𝑖
volves front-end and cache operation, Step 6 is essentially a batch- from the perspective of prediction variance. A celebrated result
wise training on the back-end, and Step 7 applies directly to the states the equivalence between D-optimal and G-optimal designs.
standard front-end inference. Hence, the algorithm requires little Theorem 1 (Kiefer-Wolfowitz [27]). For a optimal design 𝜋 ∗ ,
modification from the existing infrastructure that supports real- the following statements are equivalent:
• 𝜋 ∗ = max𝜋 log det 𝑀 (𝜋) ;
time mdoel inference. Several additional advantages of the stage-
wise algorithms include: • 𝜋 ∗ is D-optimal;
• the number of of exploration and exploitation rounds, which • 𝜋 ∗ is G-optimal.
decides the proportion of traffic for each task, can be adap-
tively adjusted by the resource availability and response time Theorem 1 suggests that we use convex optimization to find the
service level agreements; optimal design for both parameter estimation and prediction:
• the non-stationary environment, which are often detected 𝑘
∑︁
via the hypothesis testing methods as described in [5, 8, 33], max log det 𝑀 (𝜋) s.t. 𝜋 (𝑎𝑖 ) = 1.
𝜋
can be handled by setting the restart criteria accordingly. 𝑎𝑖 =1
However, a drawback of the above formulation is that it does not in-
4.1 Optimal designs for exploration volve the observations collected in the previous exploration rounds.
This section is dedicated to improving the efficiency of exploration Also, the optimization problem does not apply to the Bayesian set-
in Step 5. Throughout this paper, we emphasize the importance ting if we wish to use Thompson sampling. Luckily, we find that
of leveraging the case-specific structures to minimize the number optimal design for the Bayesian setting has a nice connection to
of exploration steps it may take to collect equal information for the above problem, and it also leads to a straightforward solution
estimating 𝜽 . Recall from Example 1 that one particular structure that utilizes the history data as a prior for the optimal design.
is the relation among the recommended contents, whose role can We still assume the linear reward setting, and the prior for 𝜽
be thought of as the design matrix in linear regression. Towards is given by 𝜽 ∼ 𝑁 (0, R) where R is the covariance matrix. Unlike
that end, our goal is aligned with the optimal design in the classical in the frequentist setting, the Bayesian design focus on the design
Towards the D-Optimal Online Experiment Design for Recommender Selection KDD ’21, August 14–18, 2021, Virtual Event, Singapore
optimality in terms of certain utility function 𝑈 (𝜋). A common Remark 1. Before we proceed, we briefly discuss what to expect
choice is the expected gain in Shannon information, or equivalently, from the optimal design in theory. Optimizing the log det | · | objective
the Kullback-Leibler divergence between the prior and posterior is essentially finding the minimum-volume ellipsoid, also known as
distribution of 𝜽 . The intuition is that the larger the divergence, the John ellipsoid [19]. According to the previous results from the
the more information there is in the observations. Let 𝑦 be the geometric studies, using the proposed optimal design will do no worse
hypothetical rewards for 𝝓 𝑟𝑡 , 𝑎 1 (𝑟𝑡 ) , . . . , 𝝓 𝑟𝑡 , 𝑎𝑘 (𝑟𝑡 ) . Then the than the uniform exploration
√ if the reward model is misspecified. Also,
gain in Shannon information is given by: we can expect an average 𝑑 improvement in the √ frequentist’s linear
∫ reward setting [7], which means it only takes 𝑜 (1/ 𝑑) of the previous
𝑈 (𝜋) = log 𝑝 (𝜽 |𝑦, 𝜋)𝑝 (𝑦, 𝜽 |𝜋)d𝜽 d𝑦 exploration steps to estimate 𝜽 to the same precision.
(5)
1
= 𝐶 + log det 𝑀 (𝜋) + R−1 ,
4.2 Algorithms
2
In this section, we introduce an efficient algorithm to solve the
where 𝐶 is a constant. Therefore, maximizing 𝑈 (𝜋) is equivalent
optimal designs in (9) and (6). We then couple the optimal designs
to maximizing log det 𝑀 (𝜋) + R−1 .
to the stage-wise exploration-exploitation algorithms. The infras-
Compared with the objective for the frequentist setting, there is
tructure for our real-world production is also discussed. We have
now an additive R term inside the determinant. Notice that R is the
shown earlier that finding the optimal design requires solving a
convariance of the prior, so given the previous history data, we can
convex optimization programming. Since the problem is often of
simply plug in the empirical estimation of R. In particular, let 𝝓®𝑡 −1
moderate size as we do not expect the number of recommenders
be the collection of feature vectors from the previous exploration
𝑘 to be large, we find the Frank-Wolfe algorithm highly efficient
rounds: 𝝓 𝑥 1, 𝑎 1 (𝑥 1 ), . . . , 𝝓 𝑥𝑡 −1, 𝑎𝑡 −1 (𝑥𝑡 −1 ) . Then R is simply
⊺ −1 [17, 23]. We outline the solution for the most general non-linear re-
estimated by 𝝓®𝑡 −1 𝝓®𝑡 −1 . Therefore, the objective for Bayesian ward case in Algorithm 2. The solutions for the other scenarios are
optimal design, after integrating the prior from the history data, is included as special cases, e.g. by replacing 𝑀 (𝜋) with 𝑀 (𝜋) + R−1
given by: for the Bayesian setting.
𝑘
⊺
∑︁
maximize𝜋 log det 𝑀 (𝜋) + 𝜆𝝓𝑡 −1 𝝓𝑡 −1 , s.t. 𝜋 (𝑎𝑖 ) = 1, (6) Algorithm 2: The optimal design solver
𝑎𝑖 =1 Input: A subroutine for computing the 𝑀 (𝜋; 𝜽 0 ) in (4) or
where we introduce the hyper parameter 𝜆 to control influence of (8); the estimation 𝜽 0 =𝜽ˆ and
the history data. We refer the interested readers to [13, 16, 25, 34] 𝜂𝑎 := ∇𝜽 𝑓𝜽 𝝓 𝑟𝑡 , 𝑎(𝑟𝑡 ) 𝜽 =𝜽 ; the convergence
0
for the historic development of this topic. criteria.
1 Initialize 𝜋
old old 1
: 𝜋 (𝑎) = 𝑘 , 𝑎 = 1, . . . , 𝑘;
Moving beyond the linear setting, the results stated in the Kiefer-
Wolfowitz theorem also holds for nonlinear reward model [46]. 2 while convergence criteria not met do
Unfortunately, it is very challenging to find the exact optimal de- ⊺ −1
3 find 𝑎˜ = arg max𝑎 𝜂𝑎 𝑀 𝜋 old ; 𝜽 0 𝜂𝑎 ;
sign when the reward model isnonlinear [53]. The difficulty lies ⊺ −1
in the fact that ∇𝜽 𝑓𝜽 𝝓 𝑟𝑡 , 𝑎(𝑟𝑡 ) now depends on 𝜽 , so the Fisher 𝜂𝑎˜ 𝑀 𝜋 old ; 𝜽 0 𝜂𝑎˜ 𝑑 − 1
4 compute 𝜆𝑎 = ;
information (4) is also a function of the unknown 𝜽 . One solu- ⊺ −1
𝜂𝑎 𝑀 𝜋 old ; 𝜽 0 𝜂𝑎 − 1
tion which we find computationally feasible is to consider a local
5 for 𝑎 = 1, . . . , 𝑘 do
linearization of 𝑓𝜽 (·) using the Taylor expansion:
6 𝜋 new (𝑎) = (1 − 𝜆𝑎 )𝜋 old (𝑎) + 𝜆𝑎 1[𝑎 = 𝑎] ˜ ;
𝑓𝜽 𝝓 𝑟𝑡 , 𝑎(𝑟𝑡 ) ≈ 𝑓𝜽 0 𝝓 𝑟𝑡 , 𝑎(𝑟𝑡 ) + 7 end
(7)
∇𝜽 𝑓𝜽 𝝓 𝑟𝑡 , 𝑎(𝑟𝑡 ) 𝜽 − 𝜽0 ,
𝜽 =𝜽 0 8 𝜋 old = 𝜋 new
9 end
where 𝜽 0 is some local approximation. In this way, we have:
∇𝜽 𝑓𝜽 ≈ ∇𝜽 𝑓𝜽 𝝓 𝑟𝑡 , 𝑎(𝑟𝑡 ) 𝜽 =𝜽 , (8) Referring to the standard analysis of Frank-Wolfe algorithm [23],
0
and the local Fisher information will be given by 𝑀 (𝜋; 𝜽 0 ) according we show that the it takes the solver at most O (𝑑 log log 𝑘 + 𝑑/𝜖)
to (4). The effectiveness of local linearization completely depends on updates to achieve a multiplicative (1 + 𝜖) optimal solution. Each
the choice of 𝜽 0 . Using the data gathered from previous exploration update has an O (𝑘𝑑 3 ) computation complexity, but 𝑑 is usually
rounds is a reasonable way to estimate 𝜽 0 . We plug in 𝜽 0 to obtain small in practice (e.g. 𝑑 = 6 in Example 1), which we will illustrate
the optimal designs for the following exploration rounds: with more detail in Section 5.
By treating the optimal design solver as a subroutine, we now
𝑘
∑︁ present the complete picture of the stage-wise exploration-exploitation
maximize𝜋 log det 𝑀 (𝜋; 𝜽 0 ) s.t. 𝜋 (𝑎𝑖 ) = 1. (9) with optimal design. To avoid unnecessary repetitions, we describe
𝑎𝑖 =1
the algorithms for nonlinear reward model under frequentist setting
We do not study the Bayesian optimal design under the nonlinear (Algorithm 3), and for linear reward model under the Thompson
setting because even the linearization trick will be complicated. sampling. They include the other scenarios as special cases.
Moreover, by the way we construct 𝜽 0 above, we already pass a To adapt the optimal design to the Bayesian setting, we only
certain amount of prior information to the design. need to make a few changes to the above algorithm:
KDD ’21, August 14–18, 2021, Virtual Event, Singapore Da Xu, Chuanwei Ruan, Evren Korpeoglu, Sushant Kumar, and Kannan Achan
4 end Training
Recommender B
5 Compute the optimal exploration policy 𝜋 ∗ using the Data Cluster
Recommender C
optimal design solver under 𝜽 0 = 𝜽ˆ ; Play 𝑛 1 rounds of
exploration under 𝜋 ∗ and collect observation to ℎ®𝑡 ;
6 Optimize and update 𝜽ˆ based on ℎ®𝑡 ; Figure 1: High-level overview of the system bandit service.
7 Play 𝑛 2 rounds of exploitation with:
𝑎𝑡 = arg max𝑎 ∈ [𝑘 ] 𝑓𝜽ˆ 𝑟𝑡 , 𝑎(𝑟𝑡 ) ;
8 end keeping the resource availability and response time agreements,
since the optimal-design computations can cause stress during the
peak time. Also, we initiate the scoring (model inference) for the
• the optimal design solver is now specified for solving (6); candidate recommenders in parallel to reduce the latency when-
• instead of optimizing 𝜽ˆ via the empirical-risk minimization, ever there are spare resources. The pre-computations occur in the
we update the posterior of 𝜽 using the history data ℎ®𝑡 ; back-end training clusters, and their results (updated parameters,
• in each exploitation round, we execute Algorithm 4 instead. posterior of parameters, prior distributions) are stored in such as
the mega cache for the front end.
Algorithm 4: Optimal design for Thompson sampling at The logging system is another crucial component which main-
exploitation rounds. tains storage of the past reward signals, contexts, policy value, etc.
The logging system works interactively with the training cluster
1 Compute the posterior distribution 𝑃𝜽 according to the to run the scheduled job for pre-computation. We mention that
prior and collected data ; off-policy learning, which takes place at the back-end using the
2 Sample 𝜽ˆ from 𝑃𝜽 ; logged data, should be made robust to runtime uncertainties as
Select 𝑎𝑡 = arg max𝑎 𝜽ˆ ⊺ 𝝓 𝑟𝑡 , 𝑎(𝑟𝑡 ) ;
3 suggested recently by [52]. Another detail is that the rewards are
4 Collect the observation to ℎ®𝑡 ; often not immediately available, e.g. for the conversion rate, so we
set up an event stream to collect the data. The system bandit service
listens to the event streams and determines the rewards after each
For Thompson sampling, the computation complexity of explo- recommendation.
ration is the same as Algorithm 3. On the other hand, even with a For our deployment, we treat the system bandit serves as a mid-
conjugate prior distribution, the Bayesian linear regression has an dleware between the online and offline service. The details are
unfriendly complexity for the posterior computations. Nevertheless, presented in Figure 2, where we put together the relevant compo-
under our stage-wise setup, the heavy lifting can be done at the nents from the above discussion. It is not unusual these days to
back-end in a batch-wise fashion, so the delay will not be significant. leverage the "near-line computation", and our approach takes the
In our simulation studies, we observe comparable performances full advantage of the current infrastructure to support the optimal
from Algorithm 3 and 4. Nevertheless, each algorithm may experi- online experiment design for recommender selection.
ence specific tradeoff in the stage-wise setting, and we leave it to
the future work to characterize their behaviors rigorously.
5 EXPERIMENTS
We first provide simulation studies to examine the effectiveness
of the proposed optimal design approaches. We then discuss the
relevant testing performance on Walmart.com.
5.1 Simulation
For the illustration and reproducibility purposes, we implement the
proposed online recommender selection under a semi-synthetic set-
ting with a benchmark movie recommendation data. To fully reflect
the exploration-exploitation dilemma in real-world production, we Figure 3: The comparisons of the cumulative recall (reward) per
convert the benchmark dataset to an online setting such that it epoch for different bandit algorithms.
mimics the interactive process between the recommender and user
behavior. A similar setting was also found in [9] that studies the
non-contextual bandits as model ensemble methods, with which model, the ranking is determined by the popularity of movies com-
we also compare in our experiments. We consider the linear reward pared with the most-rated movies. For matrix factorization model,
model setting for our simulation1 . we adopt the same setting from [9].
Data-generating mechanism. In the beginning stage, 10% of Baselines. To elaborate the performance of the proposed meth-
the full data is selected as the training data to fit the candidate ods, we employ the widely-acknowledged exploration-exploitation
recommendation models, and the rest of the data is treated as the algorithms as the baselines:
testing set which generates the interaction data adaptively. The • The multi-armed bandit (MAB) algorithm without context:
procedure can be described as follow. In each epoch, we recommend 𝜖-greedy and Thompson sampling.
one item to each user. If the item has received a non-zero rating • Contextual bandit with the exploration conducted in the
from that particular user in the testing data, we move it to the LinUCB fashion (Linear+UCB) and Thompson sampling
training data and endow it with a positive label if the rating is high, fashion (Linear+Thompson).
e.g. ≥ 3 under the five-point scale. Otherwise, we add the item
We denote our algorithms by the Linear+Frequentis optimal
to the rejection list and will not recommend it to this user again.
design and the Linear+Bayesian optimal design.
After each epoch, we retrain the candidate models with both the
Ablation studies
past and the newly collected data. Similar to [9], we also use the
We conduct ablation studies with respect to the contexts and the
cumulative recall as the performance metric, which is the ratio of
optimal design component to show the effectiveness of the proposed
the total number of successful recommendations (up to the current
algorithms. Firstly, we experiment on removing the user context
epoch) againt the total number of positive rating in the testing data.
information. Secondly, we experiment with our algorithm without
The reported results are averaged over ten runs.
using the optimal designs.
Dataset. We use the MoiveLens 1M 2 dataset which consists of
Results. The results on cumulative recall per epoch are provided
the ratings from 6,040 users for 3,706 movies. Each user rates the
in Figure 3. It is evident that as the proposed algorithm with opti-
movies from zero to five. The movie ratings are binarized to {0, 1},
mal design outperforms the other bandit algorithms by significant
i.e. ≥ 2.5 or < 2.5, and we use the metadata of movies and users
margins. In general, even though 𝜖-greedy gives the worst perfor-
as the contextual information for the reward model. In particular,
mance, the fact that it is improving over the epochs suggests the
we perform the one-hot transformation for the categorical data to
validity of our simulation setup. The Thompson sampling under
obtain the feature mappings 𝝓 (·). For text features such as movie
MAB performs better than 𝜖-greedy, which is expected. The useful-
title, we train a word embedding model [36] with 50 dimensions.
ness of context in the simulation is suggested by the slightly better
The final representation is obtained by concatenating all the one-hot
performances from Linear+UCB and Linear+Thompson. However,
encoding and embedding.
they are outperformed by our proposed methods by significant
Candidate recommenders. We employ the four classical rec-
margins, which suggests the advantage of leveraging the optimal
ommendation models: user-based collaborative filtering (CF) [56],
design in the exploration phase. Finally, we observe that among the
item-based CF [42], popularity-based recommendation, and a ma-
optimal design methods, the Bayesian setting gives a slightly better
trix factorization model [21]. To train the candidate recommenders
performance, which may suggest the usefulness of the extra steps
during our simulations, we further split the 10% initial training data
in Algorithm 4.
into equal-sized training and validation dataset, for grid-searching
The results for the ablation studies are provided in Figure 4. The
the best hyperparameters. The validation is conducted by running
left-most plot shows the improvements from including contexts
the same generation mechanism for 20 epochs, and examine the
for bandit algorithms, and suggests that our approaches are indeed
performance for the last epoch. For the user-based collaborative
capturing and leveraging the signals of the user context. In the
filtering, we set the number of nearest neighbors as 30. For item-
middle and right-most plots, we observe the clear advantage of
based collaborative filtering, we compute the cosine similarity using
conducting the optimal design, specially in the beginning phases
the vector representations of movies. For relative item popularity
of exploration, as the methods with optimal design outperforms
1 https://github.com/StatsDLMathsRecomSys/D-optimal-recommender-selection. their counterparts. We conjecture that this is because the optimal
2 https://grouplens.org/datasets/movielens/1m/ designs aim at maximizing the information for the limited options,
KDD ’21, August 14–18, 2021, Virtual Event, Singapore Da Xu, Chuanwei Ruan, Evren Korpeoglu, Sushant Kumar, and Kannan Achan
which is more helpful when the majority of options have not been
explored such as in the beginning stage of the simulation.
Finally, we present a case study to fully illustrate the effectiveness
of the optimal design, which is shown in Figure 5. It appears that
in our simulation studies, the matrix factorization and popularity-
based recommendation are found to be more effective. With the
optimal design, the traffic concentrates more quickly to the two
promising candidate recommenders than without the optimal de-
sign. The observations are in accordance with our previous conjec- Figure 7: The testing results for the proposed stage-wise
ture that optimal design gives the algorithms more advantage in exploration-exploitation with optimal design.
the beginning phases of explorations.
We conduct a posthoc analysis by examining the proportion
of traffic directed to the frequent and infrequent user groups by
the online recommender selection system (Figure 8). Interestingly,
we observe different patterns where the brand and flavor-adjusted
models serve the frequent customers more often, and the unadjusted
baseline and price-adjusted model get more appearances for the
infrequent customers. The results indicate that our online selection
approach is actively exploring and exploiting the user-item features
that eventually benefits the online performance. The simulation
studies, on the other hand, reveal the superiority over the standard
exploration-exploitation methods.
Figure 5: The percentage of traffic routed to each candidate rec-
ommender with and without using the optimal design under the
frequentist setting.
REFERENCES [27] Jack Kiefer and Jacob Wolfowitz. 1960. The equivalence of two extremum prob-
[1] Yasin Abbasi-Yadkori, András Antos, and Csaba Szepesvári. 2009. Forced- lems. Canadian Journal of Mathematics 12 (1960), 363–366.
exploration based algorithms for playing in stochastic linear bandits. In COLT [28] John Langford and Tong Zhang. 2008. The epoch-greedy algorithm for multi-
Workshop on On-line Learning with Limited Feedback, Vol. 92. 236. armed bandits with side information. In Advances in neural information processing
[2] Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert systems. 817–824.
Schapire. 2014. Taming the monster: A fast and simple algorithm for contextual [29] Minyong R Lee and Milan Shen. 2018. Winner’s Curse: Bias Estimation for Total
bandits. In International Conference on Machine Learning. 1638–1646. Effects of Features in Online Controlled Experiments. In Proceedings of the 24th
[3] Shipra Agrawal and Navin Goyal. 2012. Analysis of thompson sampling for the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
multi-armed bandit problem. In Conference on learning theory. 39–1. 491–499.
[4] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. 2002. The [30] Lihong Li, Wei Chu, John Langford, Taesup Moon, and Xuanhui Wang. 2012.
nonstochastic multiarmed bandit problem. SIAM journal on computing 32, 1 An unbiased offline evaluation of contextual bandit algorithms with generalized
(2002), 48–77. linear models. In Proceedings of the Workshop on On-line Trading of Exploration
[5] Peter Auer and Chao-Kai Chiang. 2016. An algorithm with nearly optimal pseudo- and Exploitation 2. 19–36.
regret for both stochastic and adversarial bandits. In Conference on Learning [31] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. 2010. A contextual-
Theory. 116–120. bandit approach to personalized news article recommendation. In Proceedings of
[6] Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert Schapire. the 19th international conference on World wide web. 661–670.
2011. Contextual bandit algorithms with supervised learning guarantees. In [32] Shuai Li, Alexandros Karatzoglou, and Claudio Gentile. 2016. Collaborative
Proceedings of the Fourteenth International Conference on Artificial Intelligence filtering bandits. In Proceedings of the 39th International ACM SIGIR conference on
and Statistics. 19–26. Research and Development in Information Retrieval. 539–548.
[7] Sébastien Bubeck, Nicolo Cesa-Bianchi, and Sham M Kakade. 2012. Towards min- [33] Haipeng Luo, Chen-Yu Wei, Alekh Agarwal, and John Langford. 2018. Efficient
imax policies for online linear optimization with bandit feedback. In Conference contextual bandits in non-stationary worlds. In Conference On Learning Theory.
on Learning Theory. JMLR Workshop and Conference Proceedings, 41–1. 1739–1776.
[8] Sébastien Bubeck and Aleksandrs Slivkins. 2012. The best of both worlds: Sto- [34] Lawrenca S Mayer and Arlo D Hendrickson. 1973. A method for constructing an
chastic and adversarial bandits. In Conference on Learning Theory. 42–1. optimal regression design after an initial set of input values has been selected.
[9] Rocío Cañamares, Marcos Redondo, and Pablo Castells. 2019. Multi-armed Communications in Statistics-Theory and Methods 2, 5 (1973), 465–477.
recommender system bandit ensembles. In Proceedings of the 13th ACM Conference [35] H Brendan McMahan and Matthew Streeter. 2009. Tighter bounds for multi-
on Recommender Systems. 432–436. armed bandits with expert advice. (2009).
[10] Olivier Chapelle and Lihong Li. 2011. An empirical evaluation of thompson [36] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013.
sampling. In Advances in neural information processing systems. 2249–2257. Distributed Representations of Words and Phrases and their Compositionality.
[11] Jiawei Chen, Hande Dong, Xiang Wang, Fuli Feng, Meng Wang, and Xiangnan He. arXiv:1310.4546 [cs.CL]
2020. Bias and Debias in Recommender System: A Survey and Future Directions. [37] Timothy E O’Brien and Gerald M Funk. 2003. A gentle introduction to optimal
arXiv preprint arXiv:2010.03240 (2020). design for regression models. The American Statistician 57, 4 (2003), 265–267.
[12] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. 2011. Contextual ban- [38] Herbert Robbins. 1952. Some aspects of the sequential design of experiments.
dits with linear payoff functions. In Proceedings of the Fourteenth International Bull. Amer. Math. Soc. 58, 5 (1952), 527–535.
Conference on Artificial Intelligence and Statistics. 208–214. [39] Paat Rusmevichientong and John N Tsitsiklis. 2010. Linearly parameterized
[13] PAK Covey-Crump and SD Silvey. 1970. Optimal regression designs with previous bandits. Mathematics of Operations Research 35, 2 (2010), 395–411.
observations. Biometrika 57, 3 (1970), 551–566. [40] Daniel Russo and Benjamin Van Roy. 2016. An information-theoretic analysis
[14] da Xu, chuanwei ruan, evren korpeoglu, sushant kumar, and kannan achan. 2020. of Thompson sampling. The Journal of Machine Learning Research 17, 1 (2016),
Inductive representation learning on temporal graphs. In International Conference 2442–2471.
on Learning Representations (ICLR). [41] Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen.
[15] Miroslav Dudik, Daniel Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, 2017. A tutorial on thompson sampling. arXiv preprint arXiv:1707.02038 (2017).
Lev Reyzin, and Tong Zhang. 2011. Efficient optimal learning for contextual [42] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-
bandits. arXiv preprint arXiv:1106.2369 (2011). Based Collaborative Filtering Recommendation Algorithms. In Proceedings of
[16] Otto Dykstra. 1971. The augmentation of experimental data to maximize [X’ X]. the 10th International Conference on World Wide Web (Hong Kong, Hong Kong)
Technometrics 13, 3 (1971), 682–688. (WWW ’01). Association for Computing Machinery, New York, NY, USA, 285–295.
[17] Marguerite Frank, Philip Wolfe, et al. 1956. An algorithm for quadratic program- https://doi.org/10.1145/371920.372071
ming. Naval research logistics quarterly 3, 1-2 (1956), 95–110. [43] Anne Schuth, Katja Hofmann, and Filip Radlinski. 2015. Predicting search satisfac-
[18] Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, tion metrics with interleaved comparisons. In Proceedings of the 38th International
and Simon Dollé. 2018. Offline a/b testing for recommender systems. In Proceed- ACM SIGIR Conference on Research and Development in Information Retrieval. 463–
ings of the Eleventh ACM International Conference on Web Search and Data Mining. 472.
198–206. [44] Guy Shani and Asela Gunawardana. 2011. Evaluating recommendation systems.
[19] Elad Hazan and Zohar Karnin. 2016. Volumetric spanners: an efficient exploration In Recommender systems handbook. Springer, 257–297.
basis for learning. The Journal of Machine Learning Research 17, 1 (2016), 4062– [45] William R Thompson. 1933. On the likelihood that one unknown probability
4095. exceeds another in view of the evidence of two samples. Biometrika 25, 3/4 (1933),
[20] Jonathan L Herlocker, Joseph A Konstan, Loren G Terveen, and John T Riedl. 285–294.
2004. Evaluating collaborative filtering recommender systems. ACM Transactions [46] Lynda V White. 1973. An extension of the general equivalence theorem to
on Information Systems (TOIS) 22, 1 (2004), 5–53. nonlinear models. Biometrika 60, 2 (1973), 345–348.
[21] Y. Hu, Y. Koren, and C. Volinsky. 2008. Collaborative Filtering for Implicit [47] Yuxiang Xie, Nanyu Chen, and Xiaolin Shi. 2018. False discovery rate controlled
Feedback Datasets. In 2008 Eighth IEEE International Conference on Data Mining. heterogeneous treatment effect detection for online controlled experiments. In
263–272. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge
[22] Guido W Imbens and Donald B Rubin. 2015. Causal inference in statistics, social, Discovery & Data Mining. 876–885.
and biomedical sciences. Cambridge University Press. [48] Da Xu, RUAN Chuanwei, Kamiya Motwani, Evren Korpeoglu, Sushant Kumar,
[23] Martin Jaggi. 2013. Revisiting Frank-Wolfe: Projection-free sparse convex opti- and Kannan Achan. 2020. Methods and apparatus for item substitution. US
mization. In Proceedings of the 30th international conference on machine learning. Patent App. 16/424,799.
427–435. [49] Da Xu, Chuanwei Ruan, Evren Korpeoglu, Sushant Kumar, and Kannan Achan.
[24] Michael Jahrer, Andreas Töscher, and Robert Legenstein. 2010. Combining 2020. Adversarial Counterfactual Learning and Evaluation for Recommender
predictions for accurate recommender systems. In Proceedings of the 16th ACM System. Advances in Neural Information Processing Systems 33 (2020).
SIGKDD international conference on Knowledge discovery and data mining. 693– [50] Da Xu, Chuanwei Ruan, Evren Korpeoglu, Sushant Kumar, and Kannan Achan.
702. 2020. Product knowledge graph embedding for e-commerce. In Proceedings of
[25] Mark E Johnson and Christopher J Nachtsheim. 1983. Some guidelines for the 13th international conference on web search and data mining. 672–680.
constructing exact D-optimal designs on convex design spaces. Technometrics 25, [51] Da Xu and Bo Yang. 2022. On the Advances and Challenges of Adaptive Online
3 (1983), 271–277. Testing. arXiv preprint arXiv:2203.07672 (2022).
[26] Jaya Kawale, Hung H Bui, Branislav Kveton, Long Tran-Thanh, and Sanjay [52] Da Xu, Yuting Ye, Chuanwei Ruan, and Bo Yang. 2022. Towards Robust Off-policy
Chawla. 2015. Efficient Thompson Sampling for Online Matrix-Factorization Learning for Runtime Uncertainty. arXiv preprint arXiv:2202.13337 (2022).
Recommendation. In Advances in neural information processing systems. 1297– [53] Min Yang, Stefanie Biedermann, and Elina Tang. 2013. On optimal designs for
1305. nonlinear models: a general and efficient algorithm. J. Amer. Statist. Assoc. 108,
504 (2013), 1411–1420.
KDD ’21, August 14–18, 2021, Virtual Event, Singapore Da Xu, Chuanwei Ruan, Evren Korpeoglu, Sushant Kumar, and Kannan Achan
[54] Xuan Yin and Liangjie Hong. 2019. The identification and estimation of direct & Knowledge Management. 1411–1420.
and indirect effects in A/B tests through causal mediation analysis. In Proceedings [56] Z. Zhao and M. Shang. 2010. User-Based Collaborative-Filtering Recommendation
of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Algorithms on Hadoop. In 2010 Third International Conference on Knowledge
Mining. 2989–2999. Discovery and Data Mining. 478–481.
[55] Xiaoxue Zhao, Weinan Zhang, and Jun Wang. 2013. Interactive collaborative
filtering. In Proceedings of the 22nd ACM international conference on Information