You are on page 1of 17

Fully Dynamic Online Selection through Online Contention Resolution Schemes

Vashist Avadhanula* , Andrea Celli1 , Riccardo Colini-Baldeschi2 ,


Stefano Leonardi3 , Matteo Russo3
1
Department of Computing Sciences, Bocconi University, Milan, Italy
2
Core Data Science, Meta, London, UK
3
Department of Computer, Control and Management Engineering, Sapienza University, Rome, Italy
vas1089@gmail.com, andrea.celli2@unibocconi.it, rickuz@fb.com, leonardi@diag.uniroma1.it, mrusso@diag.uniroma1.it
arXiv:2301.03099v1 [cs.AI] 8 Jan 2023

Abstract time step. Finally, the objective function might not only be
We study fully dynamic online selection problems in an ad- the sum of the rewards (weights) we observe, if, for example,
versarial/stochastic setting that includes Bayesian online se- the decision maker has a utility function with “diminishing
lection, prophet inequalities, posted price mechanisms, and return” property.
stochastic probing problems subject to combinatorial con- To model the general class of sequential decision prob-
straints. In the classical “incremental” version of the problem, lems described above, we introduce fully dynamic online se-
selected elements remain active until the end of the input se- lection problems. This model generalizes online selection
quence. On the other hand, in the fully dynamic version of the problems (Chekuri, Vondrák, and Zenklusen 2011), where
problem, elements stay active for a limited time interval, and elements arrive online in an adversarial order and algorithms
then leave. This models, for example, the online matching of
tasks to workers with task/worker-dependent working times,
can use a priori information to maximize the weight of the
and sequential posted pricing of perishable goods. A success- selected subset of elements, subject to combinatorial con-
ful approach to online selection problems in the adversarial straints (such as matroid, matching, or knapsack).
setting is given by the notion of Online Contention Resolution In the classical version of the problem (Chekuri, Vondrák,
Scheme (OCRS), that uses a priori information to formulate and Zenklusen 2011), once an element is selected, it will af-
a linear relaxation of the underlying optimization problem, fect the combinatorial constraints throughout the entire input
whose optimal fractional solution is rounded online for any sequence. This is in sharp contrast with the fully dynamic
adversarial order of the input sequence. Our main contribu- version, where an element will affect the combinatorial con-
tion is providing a general method for constructing an OCRS
straint only for a limited time interval, which we name ac-
for fully dynamic online selection problems. Then, we show
how to employ such OCRS to construct no-regret algorithms tivity time of the element. For example, a new task can be
in a partial information model with semi-bandit feedback and matched to a reviewer as soon as she is done with previ-
adversarial inputs. ously assigned tasks, or an agent can buy a new good as
soon as the previously bought goods are perished. A large
class of Bayesian online selection (Kleinberg and Weinberg
1 Introduction 2012), prophet inequality (Hajiaghayi, Kleinberg, and Sand-
Consider the case where a financial service provider receives holm 2007), posted price mechanism (Chawla et al. 2010),
multiple operations every hour/day. These operations might and stochastic probing (Gupta and Nagarajan 2013) prob-
be malicious. The provider needs to assign them to human lems that have been studied in the classical version of on-
reviewers for inspection. The time required by each reviewer line selection can therefore be extended to the fully dynamic
to file a reviewing task and the reward (weight) that is ob- setting. Note that in the dynamic algorithms literature, fully
tained with the review follow some distributions. The distri- dynamic algorithms are algorithms that deal with both adver-
butions can be estimated from historical data, as they depend sarial insertions and deletions (Demetrescu et al. 2010). We
on the type of transaction that needs to be examined and on could also interpret our model in a similar sense since ele-
the expertise of the employed reviewers. To efficiently solve ments arrive online (are inserted) according to an adversarial
the problem, the platform needs to compute a matching be- order, and cease to exist (are deleted) according to adversar-
tween tasks and reviewers based on the a priori information ially established activity times.
that is available. However, the time needed for a specific re-
A successful approach to online selection problems is
view, and the realized reward (weight), is often known only
based on Online Contention Resolution Schemes (OCRSs)
after the task/reviewer matching is decided.
(Feldman, Svensson, and Zenklusen 2016). OCRSs use a
A multitude of variations to this setting are possible. For
priori information on the values of the elements to formu-
instance, if a cost is associated with each reviewing task, the
late a linear relaxation whose optimal fractional solution up-
total cost for the reviewing process might be bounded by a
per bounds the performance of the integral offline optimum.
budget. Moreover, there might be various kinds of restric-
Then, an online rounding procedure is used to produce a so-
tions on the subset of reviewers that are assigned at each
lution whose value is as close as possible to the fractional re-
* Research performed while the author was working at Meta. laxation solution’s value, for any adversarial order of the in-
put sequence. The OCRS approach allows to obtain good ap- by an adversary, and revealed only after the selection at the
proximations of the expected optimal solution for linear and current stage has been completed. In such setting, we show
submodular objective functions. The existence of OCRSs for that an α-competitive temporal OCRS can be exploited in
fully dynamic online selection problems is therefore a natu- the adversarial partial-information version of the problem, in
ral research question that we address in this work. order to build no-α-regret algorithms with polynomial per-
The OCRS approach is based on the availability of a pri- iteration running time. Regret is measured with respect to the
ori information on weights and activity times. However, in cumulative weights collected by the best fixed selection pol-
real world scenarios, these might be missing or might be icy in hindsight. We study three different settings: in the first
expensive to collect. Therefore, in the second part of our setting, we study the full-feedback model (i.e., the algorithm
work, we study the fully dynamic online selection problem observes the entire utility function at the end of each stage).
with partial information, where the main research question Then, we focus on the semi-bandit-feedback model, in which
is whether the OCRS approach is still viable if a priori in- the algorithm only receives information on the weights of the
formation on the weights is missing. In order to answer this elements it selects. In such setting, we provide a no-α-regret
question, we study a repeated version of the fully dynamic framework with Õ(T 1/2 ) upper bound on cumulative regret
online selection problem, in which at each stage weights are in the case in which we have a “white-box” OCRS (i.e., we
unknown to the decision maker (i.e., no a priori informa- know the exact procedure run within the OCRS, and we are
tion on weights is available) and chosen adversarially. The able to simulate it ex-post). Moreover, we also provide a no-
goal in this setting is the design of an online algorithm with α-regret algorithm with Õ(T 2/3 ) regret upper bound for the
performances (i.e., cumulative sum of weights of selected case in which we only have oracle access to the OCRS (i.e.,
elements) close to that of the best static selection strategy in the OCRS is treated as a black-box, and the algorithm does
hindsight. not require knowledge about its internal procedures).

Our Contributions Related Work


First, we introduce the fully dynamic online selection prob-
lem, in which elements arrive following an adversarial or- In the first part of the paper, we deal with a setting where
dering, and revealed one-by-one their weights and activ- the algorithm has complete information over the input but is
ity times at the time of arrival (i.e., prophet model), or unaware of the order in which elements arrive. In this con-
after the element has been selected (i.e., probing model). text, Contention resolution schemes (CRS) were introduced
Our model describes temporal packing constraints (i.e., by Chekuri, Vondrák, and Zenklusen (2011) as a powerful
downward-closed), where elements are active only within rounding technique in the context of submodular maximiza-
their activity time interval. The objective is to maximize the tion. The CRS framework was extended to online contention
weight of the selected set of elements subject to temporal resolution schemes (OCRS) for online selection problems
packing constraints. We provide two black-box reductions by Feldman, Svensson, and Zenklusen (2016), who provided
for adapting classical OCRS for online (non-dynamic) se- constant competitive OCRSs for different problems, e.g. in-
lection problems to the fully dynamic setting under full and tersections of matroids, matchings, and prophet inequalities.
partial information. We generalize the OCRS framework to a setting where ele-
ments are timed and cease to exist right after.
Blackbox reduction 1: from OCRS to temporal OCRS. In the second part, we lift the complete knowledge as-
Starting from a (b, c)-selectable greedy OCRS in the clas- sumption and work in an adversarial bandit setting, where at
sical setting, we use it as a subroutine to build a (b, c)- each stage the entire set of elements arrives, and we seek
selectable greedy OCRS in the more general temporal setting to select the “best” feasible subset. This is similar to the
(see Algorithm 1 and Theorem 1). This means that competi- problem of combinatorial bandits (Cesa-Bianchi and Lugosi
tive ratio guarantees in one setting determine the same guar- 2012), but unlike it, we aim to deal with combinatorial se-
antees in the other. Such a reduction implies the existence lection of timed elements. In this respect, blocking bandits
of algorithms with constant competitive ratio for online opti- (Basu et al. 2019) model situations where played arms are
mization problems with linear or submodular objective func- blocked for a specific number of stages. Despite their con-
tions subject to matroid, matching, and knapsack constraints, textual (Basu et al. 2021), combinatorial (Atsidakou et al.
for which we give explicit constructions. We also extend the 2021), and adversarial (Bishop et al. 2020) extensions, re-
framework to elements arriving in batches, which can have cent work on blocking bandits only addresses specific cases
correlated weights or activity times within the batch, as de- of the fully dynamic online selection problem (Dickerson
scribed in the appendix of the paper. et al. 2018), which we solve in entire generality, i.e. adver-
Blackbox reduction 2: from temporal OCRS to no-α- sarially and for all packing constraints.
regret algorithm. Following the recent work by Gergatsouli Our problem is also related to sleeping bandits (Klein-
and Tzamos (2022) in the context of Pandora’s box prob- berg, Niculescu-Mizil, and Sharma 2010), in that the adver-
lems, we define the following extension of the problem to the sary decides which actions the algorithm can perform at each
partial-information setting. For each of the T stages, the al- stage t. Nonetheless, a sleeping bandit adversary has to com-
gorithm is given in input a new instance of the fully dynamic municate all available actions to the algorithm before a stage
online selection problem. Activity times are fixed before- starts, whereas our adversary sets arbitrary activity times for
hand and known to the algorithm, while weights are chosen each element, choosing in what order elements arrive.
2 Preliminaries F , an optimal fractional solution can be computed via the
n X
Given a finite set X ⊆ R and Y ⊆ 2 , let 1Y ∈ {0, 1} |X| LP maxx∈PF f (x). If the goal is maximizing the cumula-
be the characteristic vector of set X , and co X be the convex tive sum of weights, the objective of the optimization prob-
hull of X . We denote vectors by bold fonts. Given vector x, lem is hx, wi, where w := (w1 , . . . , wm ) ∈ [0, 1]m is a
we denote by xi its i-th component. The set {1, 2, . . . , n}, vector specifying the weight of each element. If we assume
with n ∈ N>0 , is compactly denoted as [n]. Given a set X access to a polynomial-time separation oracle for PF such
and a scalar α ∈ R, let αX := {αx : x ∈ X }. Finally, given LP yields an optimal fractional solution in polynomial time.
a discrete set X , we denote by ∆X the |X |-simplex. Online selection problem. In the online version of the prob-
We start by introducing a general selection problem in the lem, given a family of packing constraints F , the goal is se-
standard (i.e., non-dynamic) case as studied by Kleinberg lecting an independent set whose cumulative weight is as
and Weinberg (2012) in the context of prophet inequalities. large as possible. In such setting, the elements reveal one
Let E be the ground set and let m := |E|. Each element e ∈ E by one their realized ze , following a fixed prespecified order
is characterized by a collection of parameters ze . In gen- unknown to the algorithm. Each time an element reveals ze ,
eral, ze is a random variable drawn according to an element- the algorithm has to choose whether to select it or discard it,
specific distribution ζe , supported over the joint set of pos- before the next element is revealed. Such decision is irrevo-
sible parameters. In the standard (i.e., non-dynamic) setting, cable. Computing the exact optimal solution to such online
ze just encodes the weight associated to element e, that is selection problems is intractable in general (Feige 1998), and
ze = (we ), for some we ∈ [0, 1].1 In such case distributions the goal is usually to design approximation algorithms with
ζe are supported over [0, 1]. Random variables {ze : e ∈ E} good competitive ratio.2 In the remainder of the section we
are independent, and ze is distributed according to ζe . An in- describe one well-known framework for such objective.
put sequence is an ordered sequence of elements and weights
Online contention resolution schemes. Contention resolu-
such that every element in E occurs exactly once in the se-
tion schemes were originally proposed by Chekuri, Vondrák,
quence. The order is specified by an arrival time se for each
and Zenklusen (2011) in the context of submodular function
element e. Arrival times are such that se ∈ [m] for all e ∈ E,
maximization, and later extended to online selection prob-
and for two distinct e, e′ we have se 6= se′ . The order of
lems by Feldman, Svensson, and Zenklusen (2016) under
arrival of the elements is a priori unknown to the algorithm,
the name of online contention resolution schemes (OCRS).
and can be selected by an adversary. In the standard full-
Given a fractional solution x ∈ PF , an OCRS is an online
information setting the distributions ζe can be chosen by an
rounding procedure yielding an independent set in I guar-
adversary, but they are known to the algorithm a priori. We
anteeing a value close to that of x. Let R(x) be a random
consider problems characterized by a family of packing con-
set containing each element e independently and with prob-
straints.
ability xe . The set R(x) may not be feasible according to
Definition 1 (Packing Constraint). A family of constraints constraints F . An OCRS essentially provides a procedure to
F = (E, I), for ground set E and independence family I ⊆ construct a good feasible approximation by starting from the
2E , is said to be packing (i.e., downward-closed) if, taken random set R(x). Formally,
A ∈ I, and B ⊆ A, then B ∈ I.
Definition 2 (OCRS). Given a point x ∈ PF and the set of
Elements of I are called independent sets. Such family of elements R(x), elements e ∈ E reveal one by one whether
constraints is closed under intersection, and encompasses they belong to R(x) or not. An OCRS chooses irrevocably
matroid, knapsack, and matching constraints. whether to select an element in R(x) before the next element
Fractional LP formulation Even in the offline setting, in is revealed. An OCRS for PF is an online algorithm that
which the ordering of the input sequence (se )e∈E is known selects S ⊆ R(x) such that 1S ∈ PF .
beforehand, determining an independent set of maximum We will focus on greedy OCRS, which were defined by Feld-
cumulative weight may be NP-hard in the worst-case (Feige man, Svensson, and Zenklusen (2016) as follows.
1998). Then, we consider the relaxation of the problem
in which we look for an optimal fractional solution. The Definition 3 (Greedy OCRS). Let PF ⊆ [0, 1]m be the fea-
value of such solution is an upper bound to the value of sibility polytope for constraint family F . An OCRS π for PF
the true offline optimum. Therefore, any algorithm guar- is called a greedy OCRS if, for every ex-ante feasible solu-
anteeing a constant approximation to the offline fractional tion x ∈ PF , it defines a packing subfamily of feasible sets
optimum immediately yields the same guarantees with re- Fπ,x ⊆ F , and an element e is selected upon arrival if, to-
spect to the offline optimum. Given a family of packing con- gether with the set of already selected elements, the resulting
straints F = (E, I), in order to formulate the problem of set is in Fπ,x .
computing the best fractional solution as a linear program- A greedy OCRS is randomized if, given x, the choice of
ming problem (LP) we introduce the notion of packing con- Fπ,x is randomized, and deterministic otherwise. For b, c ∈
straint polytope PF ⊆ [0, 1]m which is such that PF := [0, 1], we say that a greedy OCRS π is (b, c)-selectable if,
co ({1S : S ∈ I}) . Given a non-negative submodular func- for each e ∈ E, and given x ∈ bPF (i.e., belonging to a
tion f : [0, 1]m → R≥0 , and a family of packing constraints
2
The competitive ratio is computed as the worst-case ratio be-
1
This is for notational convenience. In the dynamic case ze will tween the value of the solution found by the algorithm and the value
contain other parameters in addition to weights. of an optimal solution.
down-scaled version of PF ), be selected by the algorithm. In this setting, each element
e ∈ E is characterized by a tuple of attributes ze := (we , de ).
Prπ,R(x) [S ∪ {e} ∈ Fπ,x ∀S ⊆ R(x), S ∈ Fπ,x ] ≥ c. Let F d := (E, I d ) be the family of temporal packing fea-
Intuitively, this means that, with probability at least c, the sibility constraints where elements block other elements in
random set R(x) is such that an element e is selected no the same independent set according to activity time vector
matter what other elements I of R(x) have been selected d = (de )e∈E . The goal of fully dynamic online selection is
so far, as long as I ∈ Fπ,x . This guarantees that an ele- selecting an independent set in I d whose cumulative weight
ment is selected with probability at least c against any ad- is as large as possible (i.e., as close as possible to the offline
versary, which implies a bc competitive ratio with respect optimum). We can naturally extend the expression for pack-
to the offline optimum (see Appendix A for further details). ing polytopes in the standard setting to the temporal one for
Now, we provide an example due to Feldman, Svensson, and every feasibility constraint family, by exploiting the follow-
Zenklusen (2016) of a feasibility constraint family where ing notion of active elements.
OCRSs guarantee a constant competitive ratio against the Definition 4 (Active Elements). For element e ∈ E and
offline optimum. We will build on this example throughout given {ze }e∈E , we denote the set of active elements as
the paper in order to provide intuition for the main concepts. Ee := {e′ ∈ E : se′ ≤ se ≤ se′ + de′ }.3
Example 1 (Theorem 2.7 in (Feldman, Svensson, and Zen- In this setting, we don’t need to select an independent set
klusen 2016)). Given a graph G = (V, E), with |E| = m S ∈ F , but, in a less restrictive way, we only require that for
edges,
n we consider a matching feasibility o polytope PF = each incoming element we select a feasible subset of the set
m
P of active elements.
x ∈ [0, 1] : e∈δ(u) xe ≤ 1, ∀u ∈ V , where δ(u) de-
Definition 5 (Temporal packing constraint polytope). Given
notes the set of all adjacent edges to u ∈ V. Given b ∈ [0, 1], F = (E, I), a temporal packing constraint polytope PFd ⊆
the OCRS takes as input x ∈ bPF , and samples each edge e
with probability xe to build R(x). Then, it selects each edge [0, 1]m is such that PFd := co ({1S : S ∩ Ee ∈ I, ∀e ∈ E}) .
e ∈ R(x), upon its arrival, with probability (1 − e−xe )/xe Observation 1. For a fixed element e, the temporal polytope
only if it is feasible. Then, the probability to select any edge is the convex hull of the collection containing all the sets
e = (u, v) (conditioned on being sampled) is such that S ∩ Ee is feasible. This needs to be true for all
e ∈ E, meaning that we can T rewrite the polytope and  the
1 − e−xe Y feasibility set as PFd = co e∈E {1S : S ∩ Ee ∈ F } , and
· e−xe′ T
xe
e′ ∈δ(u)∪δ(v)\{e}
I d = e∈E {S : S ∩ Ee ∈ I}. Moreover, when d and d′
−xe P
differ for at least one element e, that is de < d′e , then Ee ⊆
1−e 1 − e−xe −2b ′ ′
= · e− e′ ∈δ(u)∪δ(v)\{e} xe′
≥ ·e Ee′ . Then, PFd ⊇ PFd , I d ⊇ I d .
xe xe
−2b
We now extend Example 1 to account for activity times.
≥e , In Appendix B we also work out the reduction from stan-
where the inequality follows from xe ∈ bPF , i.e., dard to temporal packing constraints for a number of exam-
P ples, including rank-1 matroids (single-choice), knapsack,
e′ ∈δ(u)\{e} xe′ ≤ b − xe , and similarly for δ(v). Note that and general matroid constraints.
in order to obtain an unconditional probability, we need to
multiply the above by a factor xe . Example 2. We consider the temporal extension of the
matching polytope presented in Example 1, that is
We remark that this example resembles closely our in-  
troductory motivating application, where financial transac-  X 
tions need to be assigned to reviewers upon their arrival. PFd = y ∈ [0, 1]m : xe ≤ 1, ∀u ∈ V, ∀e ∈ E .
 
Moreover, Feldman, Svensson, and Zenklusen (2016) give e∈δ(u)∩Ee
explicit constructions of (b, c)-selectable greedy OCRSs for Let us use the same OCRS as in the previous example, but
knapsack, matching, matroidal constraints, and their inter- where “feasibility” only concerns the subset of active edges
section. We include a discussion of their feasibility poly- in δ(u) ∪ δ(v). The probability to select an edge e = (u, v)
topes in Appendix B. Ezra et al. (2020) generalize the above is
online selection procedure to a setting where elements arrive
1 − e−xe Y 1 − e−xe −2b
in batches rather than one at a time; we provide a discussion · e−xe′ ≥ ·e ≥ e−2b ,
of such setting in Appendix C. xe ′
x e
e ∈δ(u)∪δ(v)∩Ee \{e}

3 Fully Dynamic Online Selection which is obtained in a similar way to Example 1.


The fully dynamic online selection problem is characterized The above example suggests to look for a general reduc-
by the definition of temporal packing constraints. We gen- tion that maps an OCRS for the standard setting, to an OCRS
for the temporal setting, while achieving at least the same
eralize the online selection model (Section 2) by introduc-
ing an activity time de ∈ [m] for each element. Element e competitive ratio.
arrives at time se and, if it is selected by the algorithm, it re- 3
Note that, since for distinct elements e, e′ , we have se′ 6= se ,
mains active up to time se + de and “blocks” other elements we can equivalently define the set of active elements as Ee :=
from being selected. Elements arriving after that time can {e′ ∈ E : se′ < se ≤ se′ + de′ } ∪ {e}.
Algorithm 1: Greedy OCRS Black-box Reduction of Algorithm 1, we check that the active subset of the el-
ements selected so far, together with the new arriving ele-
Input: Feasibility families F and F d , polytopes PF ment e, is feasible against the subfamily Fπ,x ⊆ F . Con-
and PFd , OCRS π for F , a point x ∈ bPFd ; straint subfamily Fπ,x is induced by the original OCRS
Initialize S d ← ∅; π, and point x belongs to the polytope bPFd . Note that
Sample R(x) such that Pr [e ∈ R(x)] = xe ; we do not necessarily add element e to the running set
for e ∈ E do S d , even though feasible, but act as the original greedy
Upon arrival of element e, compute the set of OCRS would have acted. All that is left to be shown is
currently active elements Ee ; that such a procedure defines a subfamily of feasibility
if (S d ∩ Ee ) ∪ {e} ∈ Fπ,y then d
constraints Fπ,x ⊆ F d . By construction, on the arrival
Execute the original greedy OCRS π(x); of each element e, we guarantee that S d is a set such that
Update S d accordingly; its subset of active elements is feasible. This means that
else S d ∩ Ee ∈ Fπ,x ⊆ F . Then,
Discard element e; \
return set S d ; S d ∈ Fπ,x
d
:= {S : S ∩ Ee ∈ Fπ,x }.
e∈E

d
Finally, Fπ,x ⊆ F implies that Fπ,x ⊆ F d , which shows
4 OCRS for Fully Dynamic Online Selection that π̂ is greedy. With the above, we can now turn to
The first black-box reduction which we provide consists demonstrate (b, c)-selectability.
in showing that a (b, c)-selectable greedy OCRS for stan- Selectability. Upon arrival of element e ∈ E, let us con-
dard packing constraints implies the existence of a (b, c)- sider S and S d to be the sets of elements already selected
selectable greedy OCRS for temporal constraints. In partic- by π and π̂, respectively. By the way in which the con-
ular, we show that the original greedy OCRS working for straint families are defined, and by construction of π̂, we
x ∈ bPF can be used to construct another greedy OCRS can observe that, given x ∈ bPFd and y ∈ bPF , for all
for y ∈ bPFd . To this end, Algorithm 1 provides a way of S ⊆ R(y) such that S ∪ {e} ∈ Fπ,y , there always exists a
exploiting the original OCRS π in order to manage temporal set S d ⊆ R(x) such that (S d ∩ Ee ) ∪ {e} ∈ Fπ,x . This es-
constraints. For each element e, and given the induced sub- tablishes an injection between the selected set under stan-
family of packing feasible sets Fπ,y , the algorithm checks dard constraints, and its counterpart under temporal con-
whether the set of previously selected elements S d which straints. We observe that, for all e ∈ E and x ∈ bPFd ,
are still active in time, together with the new element e, is  
feasible with respect to Fπ,y . If that is the case, the algo- Pr S d ∪ {e} ∈ Fπ,xd
∀S d ⊆ R(x), S d ∈ Fπ,x d
=
rithm calls the OCRS π. Then, if the OCRS π for input y  d
Pr (S ∩ Ee ) ∪ {e} ∈ Fπ,x ∀S d ⊆ R(x), S d ∩ Ee ∈ Fπ,x d

.
decided to select the current element e, the algorithm adds it
to S d , otherwise the set remains unaltered. We remark that Hence, since for greedy OCRS π and y ∈ bPF , we have
such a procedure is agnostic to whether the original greedy that Pr [S ∪ {e} ∈ Fπ,y ∀S ⊆ R(y), S ∈ Fπ,y ] ≥ c, we
OCRS is deterministic or randomized. We observe that, due can conclude by the injection above that
to a larger feasibility constraint family, the number of in- 
dependent sets have increased with respect to the standard Pr (S d ∩ Ee ) ∪ {e} ∈ Fπ,x
setting. However, we show that this does not constitute a 
problem, and an equivalence between the two settings can ∀S d ⊆ R(x), S d ∩ Ee ∈ Fπ,x ≥ c.
be established through the use of Algorithm 1. The follow- The theorem follows.
ing result shows that Algorithm 1 yields a (b, c)-selectable
greedy OCRS for temporal packing constraints. We remark that the above reduction is agnostic to the
d weight scale, i.e., we need not assume that we ∈ [0, 1] for
Theorem 1. Let F , F be the standard and temporal pack-
all e ∈ E. In order to further motivate the significance of
ing constraint families, respectively, and let their corre-
Algorithm 1 and Theorem 1, in the Appendix we explic-
sponding polytopes be PF and PFd . Let x ∈ bPF and
itly reduce the standard setting to the fully dynamic one for
y ∈ bPFd , and consider a (b, c)-selectable greedy OCRS π single-choice, and provide a general recipe for all packing
for Fπ,x . Then, Algorithm 1 equippend with π is a (b, c)- constraints.
d
selectable greedy OCRS for Fπ,y .
5 Fully Dynamic Online Selection under
Proof. Let us denote by π̂ the procedure described in Algo-
rithm 1. First, we show that π̂ is a greedy OCRS for F d . Partial Information
In this section, we study the case in which the decision-
Greedyness. It is clear from the setting and the construc- maker has to act under partial information. In particular,
tion that elements arrive one at a time, and that π̂ irrevoca- we focus on the following online sequential extension of
bly selects an incoming element only if it is feasible, and the full-information problem: at each stage t ∈ [T ], a de-
before seeing the next element. Indeed, in the if statement cision maker faces a new instance of the fully dynamic
online selection problem. An unknown vector of weights Algorithm 2: F ULL -F EEDBACK A LGORITHM
wt ∈ [0, 1]|E| is chosen by an adversary at each stage t,
while feasibility set F d is known and fixed across all T Input: T , F d , temporal OCRS π̂, subroutine RM
Initialize RM for strategy space PFd
stages. This setting is analogous to the one recently stud-
for t ∈ [T ] do
ied by Gergatsouli and Tzamos (2022) in the context of
xt ← RM.R ECOMMEND ()
Pandora’s box problems. A crucial difference with the on-
at ← execute OCRS π̂ with input xt
line selection problem with full-information studied in Sec-
Play at , and subsequently observe f (·, wt )
tion 4 is that, at each step t, the decision maker has to de-
RM.U PDATE(f (·, w t ))
cide whether to select or discard an element before observ-
ing its weight. In particular, at each t, the decision maker
takes an action at := 1Std , where Std ∈ F d is the fea-
sible set selected at stage t. The choice of at is made be- xt computed by considering the weights selected by the ad-
fore observing wt . The objective of maximizing the cumu- versary up to time t − 1.6
lative sum of weights is encoded in the reward function Let us assume to have at our disposal a no-α-regret algo-
f : [0, 1]2m ∋ (a, w) 7→ ha, wi ∈ [0, 1], which is the rithm for decision space PFd . We denote such regret min-
reward obtained by playing a with weights w = (we )e∈E . 4 imizer as RM, and we assume it offers two basic opera-
In this setting, we can think of F d as the set of super-arms tions: i) RM.R ECOMMEND () returns a vector in PFd ; ii)
in a combinatorial online optimization problem. Our goal is RM.U PDATE(f (·, w)) updates the internal state of the re-
designing online algorithms which have a performance close gret minimizer using feedback received by the environment
to that of the best fixed super-arm in hindsight.5 In the analy- in the form of a reward function f (·, w). Notice that the
sis, as it is customary when the online optimization problem availability of such component is not enough to solve our
has an NP-hard offline counterpart, we resort to the notion problem since at each t we can only play a super-arm at ∈
of α-regret. In particular, given a set of feasible actions X , {0, 1}m feasible for F d , and not the strategy xt ∈ PFd ⊆
we define an algorithm’s α-regret up to time T as [0, 1]m returned by RM. The decision-maker can exploit the
( T ) " T # subroutine RM together with a temporal greedy OCRS π̂ by
X X following Algorithm 2. We can show that, if the algorithm
Regretα (T ) := α max f (x, wt ) −E f (xt , wt ) , employs a regret minimizer for PFd with a sublinear cumu-
x∈X
t=1 t=1 lative regret upper bound of RT , the following result holds.
where α ∈ (0, 1] and xt is the strategy output by the online
algorithm at time t. We say that an algorithm has the no-α- Theorem 2. Given a regret minimizer RM for decision
regret property if Regretα (T )/T → 0 for T → ∞. space PFd with cumulative regret upper bound RT , and an
The main result of the section is providing a black-box re- α-competitive temporal greedy OCRS, Algorithm 2 provides
duction that yields a no-α-regret algorithm for any fully dy- T
" T #
namic online selection problem admitting a temporal OCRS. X X
We provide no-α-regret frameworks for three scenarios: α max f (1S , wt ) − E f (at , wt ) ≤ RT .
S∈I d
t=1 t=1
• full-feedback model: after selecting at the decision-maker
observes the exact reward function f (·, wt ). Since we are assuming the existence of a polynomial-
time separation oracle for the set PFd , then the LP
• semi-bandit feedback with white-box OCRS: after taking a arg maxx∈PFd f (x, w) can be solved in polynomial time
decision at time t, the algorithm observes wt,e for each el- for any w. Therefore, we can instantiate a regret minimizer
ement e ∈ Std (i.e., each element selected at t). Moreover, for PFd by using, for example,
the decision-maker has exact knowledge of the procedure √ follow-the-regularised-leader
which yields RT ≤ Õ(m T ) (Orabona 2019).
employed by the OCRS, which can be easily simulated.
• semi-bandit feedback with oracle access to the OCRS: the Semi-Bandit Feedback with White-Box OCRS
decision maker has semi-bandit feedback and the OCRS is In this setting, given a temporal OCRS π̂, it is enough to
given as a black-box which can be queried once per step t. show that we can compute the probability that a certain
super-arm a is selected by π̂ given a certain order of ar-
Full-feedback Setting rivals at stage t and a vector of weights w. If that is the
In this setting, after selecting at , the decision-maker gets case, we can build
√ a no-α-regret algorithm with regret upper
to observe the reward function f (·, wt ). In order to achieve bound of Õ(m T ) by employing Algorithm 2 and by in-
performance close to that of the best fixed super-harm in stantiating the regret minimizer RM as the online stochastic
hindsight the idea is to employ the α-competitive OCRS de- mirror descent (OSMD) framework by Audibert, Bubeck,
signed in Section 4 by feeding it with a fractional solution and Lugosi (2014). We observe that the regret bound ob-
tained is this way is tight in the semi-bandit setting (Audib-
4
The analysis can be easily extended to arbitrary functions lin- ert, Bubeck, and Lugosi 2014). Let qt (e) be the probability
ear in both terms.
5 6
As we argue in Appendix D it is not possible to be competitive We remark that a (b, c)-selectable OCRS yields a bc competi-
with respect to more powerful benchmarks. tive ratio. In the following, we let α := bc.
Algorithm 3: S EMI -BANDIT-F EEDBACK A LGO - lected by at . Therefore, they have no counterfactual infor-
RITHM WITH O RACLE ACCESS TO OCRS mation on their reward had they selected a different feasi-
ble set. On top of that, we assume that the OCRS is given
Input: T , F d , temporal OCRS π̂, full-feedback
as a black-box, and therefore we cannot compute ex post
algorithm RM for decision space PFd
the probabilities qt (e) for selected elements. However, we
Let Z be initialized as in Theorem 3, and initialize
show that it is possible to tackle this setting by exploiting a
RM appropriately
reduction from the semi-bandit feedback setting to the full-
for τ = 1,. . . , Z do
T T
information feedback one. In doing so, we follow the ap-
Iτ ← (τ − 1) Z + 1, . . . , τ Z proach first proposed by Awerbuch and Kleinberg (2008).
Choose a random permutation p : [m] → E, and The idea is to split the time horizon T into a given num-
t1 , . . . , tm stages at random from Iτ ber of equally-sized blocks. Each block allows the decision
xτ ← RM.R ECOMMEND () maker to simulate a single stage of the full information set-
T
for t = (τ − 1) Z + 1, . . . , τ TZ do ting. We denote the number of blocks by Z, and each block
if t = tj for some j ∈ [m] then τ ∈ [Z] is composed by a sequence of consecutive stages
xt ← 1S d for a feasible set S d Iτ . Algorithm 3 describes the main steps of our procedure.
containing p(j) In particular, the algorithm employs a procedure RM, an al-
else gorithm for the full feedback setting as the one described
xt ← xτ in the previous section, that exposes an interface with the
Play at obtained from the OCRS π̂ executed two operation of a traditional regret minimizer. During each
with fractional solution xt block τ , the full-information subroutine is used to compute
˜
Compute estimators P fτ (e) of a vector xτ . Then, in most stages of the window Iτ , the de-
fτ (e) := |I1τ | t∈Iτ f (1e , wt ) for each e ∈ E cision at is computed by feeding xτ to the OCRS. A few
 
RM.U PDATE f˜τ (·) stages are chosen uniformly at random to estimate utilities
provided by other feasible sets (i.e., exploration phase). Af-
ter the execution of all the stages in the window Iτ , the al-
gorithm computes estimated reward functions and uses them
with which our algorithm selects element e at time t. Then, to update the full-information regret minimizer.
we can equip OSMD with the following unbiased estimator Let p : [m] → E be a random permutation of elements in
of the vector of weights: ŵt,e := wt,e at,e /qt (e). 7 In order to E. Then, for each e ∈ E, by letting j be the index such that
compute qt (·) we need to have observed the order of arrival p(j) = e in the current
P block τ , an unbiased estimator f˜τ (e)
at stage t, the weights corresponding to super-arm at , and of fτ (e) := |I1τ | t∈Iτ f (1e , wt ) can be easily obtained by
we need to be able to compute the probability with which
setting f˜τ (e) := f (1e , wtj ). Then, it is possible to show that
the OCRS selected e at t. This the reason for which we talk
our algorithm provides the following guarantees.
about “white-box” OCRS, as we need to simulate ex post
the procedure followed by the OCRS in order to compute Theorem 3. Given a temporal packing feasibility set F d ,
qt (·). When we know the procedure followed by the OCRS, and an α-competitive OCRS π̂, let Z = T 2/3 , and the full
we can always compute qt (e) for any element e selected at feedback subroutine RM be defined as per Theorem 2. Then
stage t, since at the end of stage t we know the order of Algorithm 3 guarantees that
arrival, weights for selected elements, and the initial frac- T
" T #
tional solution xt . We provide further intuition as for how to X X
α max f (1S , w t ) − E f (at , wt ) ≤ Õ(T 2/3 ).
compute such probabilities through the running example of S∈I d
t=1 t=1
matching constraints.
Example 3. Consider Algorithm 2 initialized with the OCRS 6 Conclusion and Future Work
of Example 1. Given stage t, we can safely limit our atten- In this paper we introduce fully dynamic online selection
tion to selected edges (i.e., elements e such that at,e = 1). problems in which selected items affect the combinatorial
Indeed, all other edges will either be unfeasible (which im- constraints during their activity times. We presented a gen-
plies that the probability of selecting them is 0), or they were eralization of the OCRS approach that provides near opti-
not selected despite being feasible. Consider an arbitrary mal competitive ratios in the full-information model, and no-
element e among those selected. Conditioned on the past α-regret algorithms with polynomial per-iteration running
choices up to element e, we know that e ∈ at will be fea- time with both full- and semi-bandit feedback. Our frame-
sible with certainty, and thus the (unconditional) probability work opens various future research directions. For example,
it is selected is simply qt (e) = 1 − e−yt,e . it would be particularly interesting to understand whether a
variation of Algorithms 2 and 3 can be extended to the case
Semi-Bandit Feedback and Oracle Access to OCRS in which the adversary changes the constraint family at each
As in the previous case, at each stage t the decision maker stage. Moreover, the study of the bandit-feedback model re-
can only observe the weights associated to each edge se- mains open, and no regret bound is known for that setting.
7
We observe that ŵt,e is equal to 0 when e has not been selected
at stage t because, in that case, at,e = 0.
Acknowledgements Demetrescu, C.; Eppstein, D.; Galil, Z.; and Italiano, G. F.
The authors of Sapienza are supported by the Meta Re- 2010. Dynamic Graph Algorithms, 9. Chapman &
search grant on “Fairness and Mechanism Design”, the ERC Hall/CRC, 2 edition. ISBN 9781584888222.
Advanced Grant 788893 AMDROMA “Algorithmic and Dickerson, J.; Sankararaman, K.; Srinivasan, A.; and Xu, P.
Mechanism Design Research in Online Markets”, the MIUR 2018. Allocation problems in ride-sharing platforms: Online
PRIN project ALGADIMAR “Algorithms, Games, and Dig- matching with offline reusable resources. In Proceedings of
ital Markets”. the AAAI Conference on Artificial Intelligence, volume 32.
Ezra, T.; Feldman, M.; Gravin, N.; and Tang, Z. G. 2020.
References Online Stochastic Max-Weight Matching: Prophet Inequal-
Abernethy, J. D.; Hazan, E.; and Rakhlin, A. 2009. Com- ity for Vertex and Edge Arrival Models. In EC’20, 769–787.
peting in the dark: An efficient algorithm for bandit linear Feige, U. 1998. A Threshold of Ln n for Approximating Set
optimization. COLT. Cover. J. ACM, 45(4): 634–652.
Atsidakou, A.; Papadigenopoulos, O.; Basu, S.; Caramanis, Feldman, M.; Svensson, O.; and Zenklusen, R. 2016. On-
C.; and Shakkottai, S. 2021. Combinatorial Blocking Ban- line Contention Resolution Schemes. In Krauthgamer, R.,
dits with Stochastic Delays. In Proceedings of the 38th In- ed., Proceedings of the Twenty-Seventh Annual ACM-SIAM
ternational Conference on Machine Learning, ICML 2021, Symposium on Discrete Algorithms, SODA 2016, Arlington,
18-24 July 2021, Virtual Event, 404–413. VA, USA, January 10-12, 2016, 1014–1033. SIAM.
Audibert, J.-Y.; Bubeck, S.; and Lugosi, G. 2014. Regret in Gergatsouli, E.; and Tzamos, C. 2022. Online Learning for
online combinatorial optimization. Mathematics of Opera- Min Sum Set Cover and Pandora’s Box. In Proceedings
tions Research, 39(1): 31–45. of the 39th International Conference on Machine Learning,
Awerbuch, B.; and Kleinberg, R. 2008. Online linear op- volume 162 of Proceedings of Machine Learning Research,
timization and adaptive routing. Journal of Computer and 7382–7403.
System Sciences, 74(1): 97–114. Gupta, A.; and Nagarajan, V. 2013. A Stochastic Probing
Basu, S.; Papadigenopoulos, O.; Caramanis, C.; and Problem with Applications. In Proceedings of the 16th In-
Shakkottai, S. 2021. Contextual Blocking Bandits. In The ternational Conference on Integer Programming and Com-
24th International Conference on Artificial Intelligence and binatorial Optimization, IPCO’13, 205–216. Berlin, Heidel-
Statistics, AISTATS 2021, April 13-15, 2021, Virtual Event, berg: Springer-Verlag. ISBN 9783642366932.
271–279. György, A.; Linder, T.; Lugosi, G.; and Ottucsák, G. 2007.
Basu, S.; Sen, R.; Sanghavi, S.; and Shakkottai, S. 2019. The On-Line Shortest Path Problem Under Partial Monitor-
Blocking Bandits. In Wallach, H.; Larochelle, H.; Beygelz- ing. Journal of Machine Learning Research, 8(10).
imer, A.; d'Alché-Buc, F.; Fox, E.; and Garnett, R., eds., Hajiaghayi, M. T.; Kleinberg, R.; and Sandholm, T. 2007.
Advances in Neural Information Processing Systems, vol- Automated Online Mechanism Design and Prophet Inequal-
ume 32. Curran Associates, Inc. ities. In Proceedings of the 22nd National Conference on
Bishop, N.; Chan, H.; Mandal, D.; and Tran-Thanh, L. 2020. Artificial Intelligence - Volume 1, AAAI’07, 58–65. AAAI
Adversarial Blocking Bandits. In Advances in Neural Infor- Press. ISBN 9781577353232.
mation Processing Systems 33: Annual Conference on Neu- Kesselheim, T.; and Mehlhorn, K. 2016. Lecture 2: Yao’s
ral Information Processing Systems 2020, NeurIPS 2020, Principle and the Secretary Problem. Randomized Al-
December 6-12, 2020, virtual. gorithms and Probabilistic Analysis of Algorithms, Max
Cesa-Bianchi, N.; and Lugosi, G. 2012. Combinatorial ban- Planck Institute for Informatics, Saarbrücken, Germany.
dits. Journal of Computer and System Sciences, 78(5): Kleinberg, R.; Niculescu-Mizil, A.; and Sharma, Y. 2010.
1404–1422. Regret Bounds for Sleeping Experts and Bandits. Mach.
Chawla, S.; Hartline, J. D.; Malec, D. L.; and Sivan, B. Learn., 80(2–3): 245–272.
2010. Multi-Parameter Mechanism Design and Sequential
Kleinberg, R.; and Weinberg, S. M. 2012. Matroid prophet
Posted Pricing. In Proceedings of the Forty-Second ACM
inequalities. In STOC’12, 123–136.
Symposium on Theory of Computing, STOC ’10, 311–320.
New York, NY, USA: Association for Computing Machin- Kveton, B.; Wen, Z.; Ashkan, A.; and Szepesvari, C.
ery. ISBN 9781450300506. 2015. Tight regret bounds for stochastic combinatorial semi-
bandits. In Artificial Intelligence and Statistics, 535–543.
Chekuri, C.; Vondrák, J.; and Zenklusen, R. 2011. Submod-
PMLR.
ular Function Maximization via the Multilinear Relaxation
and Contention Resolution Schemes. In Proceedings of the Livanos, V. 2021. A Simple and Tight Greedy OCRS. CoRR,
Forty-Third Annual ACM Symposium on Theory of Comput- abs/2111.13253.
ing, STOC ’11, 783–792. New York, NY, USA: Association McMahan, H. B.; and Blum, A. 2004. Online geometric
for Computing Machinery. ISBN 9781450306911. optimization in the bandit setting against an adaptive adver-
Chen, W.; Wang, Y.; and Yuan, Y. 2013. Combinatorial sary. In International Conference on Computational Learn-
multi-armed bandit: General framework and applications. ing Theory, 109–123. Springer.
In International conference on machine learning, 151–159. Orabona, F. 2019. A modern introduction to online learning.
PMLR. arXiv preprint arXiv:1912.13213.
A Contention Resolution Schemes and Online Contention Resolution Schemes
As explained at length in Section 2, our goal in general is that of finding the independent set of maximum weight for a
given feasibility constraint family. However, doing this directly might be intractable in general and we need to aim for a
good approximation of the optimum. In particular, given a non-negative submodular function f : [0, 1]m → R≥0 , and a family
of packing constraints F , we start from an ex ante feasible solution to the linear program maxx∈PF f (x), which upper bounds
the optimal value achievable. An ex ante feasible solution is simply a distribution over the independent sets of F , given by
a vector x in the packing constraint polytope of F . A key observation is that we can interpret the ex ante optimal solution
to the above linear program as a vector x∗ of fractional values, which induces distribution over elements such that x∗e is the
marginal probability that element e ∈ E is included in the optimum. Then, we use this solution to obtain a feasible solution that
suitably approximates the optimum. The random set R(x∗ ) constructed by ex ante selecting each element independently with
probability x∗e can be infeasible. Contention Resolution Schemes (Chekuri, Vondrák, and Zenklusen 2011) are procedures that,
starting from the random set of sampled elements R(x∗ ), construct a feasible solution with good approximation guarantees
with respect to the optimal solution of the original integer linear program.
Definition 6 (Contention Resolution Schemes (CRSs) (Chekuri, Vondrák, and Zenklusen 2011)). For b, c ∈ [0, 1], a (b, c)-
balanced Contention Resolution Scheme (CRS) π for F = (E, I) is a procedure such that, for every ex-ante feasible solution
x ∈ bPF (i.e., the down-scaled version of polytope PF ), and every subset S ⊆ E, returns a random set π(x, S) ⊆ S satisfying
the following properties:
1. Feasibility: π(x, S) ∈ I.
2. c-balancedness: Prπ,R(x) [e ∈ π(x, R(x)) | e ∈ R(x)] ≥ c, ∀e ∈ E.
When elements arrive in an online fashion, Feldman, Svensson, and Zenklusen (2016) extend CRS to the notion of OCRS,
where R(x) is obtained in the same manner, but elements are revealed one by one in adversarial order. The procedure has to
decide irrevocably whether or not to add the current element to the final solution set, which needs to be feasible and a competitive
against the offline optimum. The idea is that adding a sampled element e ∈ E to the set of already selected elements S ⊆ R(x)
maintains feasibility with at least constant probability, regardless of the element and the set. This originates Definition 3 and
the subsequent discussion.

B Examples
In this section, we provide some clarifying examples for the concepts introduced in Section 2 and 3.

Polytopes
Example 4 provides the definition of the constraint polytopes of some standard problems, while Example P 5 describes their
temporal version. For a set S ⊆ E and x ∈ Rm , we define, with a slight abuse of notation, x(S) := e∈S xe .
Example 4 (Standard Polytopes). Given a ground set E,
• Let K = (E, I) be a knapsack constraint. Then, given budget B > 0 and a vector of elements’ sizes c ∈ Rm ≥0 , its feasibility
polytope is defined as
PK = {x ∈ [0, 1]m : hc, xi ≤ B} .
• Let G = (E, I) be a matching constraint. Then, its feasibility polytope is defined as
PG = {x ∈ [0, 1]m : x(δ(u)) ≤ 1, ∀u ∈ V } ,
where δ(u) denotes the set of all adjacent edges to u ∈ V. Note that the ground set in this case is the set of all edges of
graph G = (V, E).
• Let M = (E, I) be a matroid constraint. Then, its feasibility polytope is defined as
PM = {x ∈ [0, 1]m : x(S) ≤ rank(S), ∀S ⊆ E} .
Here, rank(S) := max {|I| : I ⊆ S, I ∈ I}, i.e., the cardinality of the maximum independent set contained in S.
We can now rewrite the above polytopes under temporal packing constraints.
Example 5 (Temporal Polytopes). For ground set E,
• Let K = (E, I) be a knapsack constraint. Then, for B > 0 and cost vector c ∈ Rm ≥0 , its feasibility polytope is defined as

PK d
= {x ∈ [0, 1]m : hc, xi ≤ B, ∀e ∈ E} .
• Let G = (E, I) be a matching constraint. Then, its feasibility polytope is defined as
PGd = {x ∈ [0, 1]m : x(δ(u) ∩ Ee ) ≤ 1, ∀u ∈ V, ∀e ∈ E} .
• Let M = (E, I) be a matroid constraint. Then, its feasibility polytope is defined as
d
PM = {x ∈ [0, 1]m : x(S ∩ Ee ) ≤ rank(S), ∀S ⊆ E, ∀e ∈ E} .
We also note that, for general packing constraints, if de = ∞ for all e ∈ E, then Ee = E, PF∞ = PF , and similarly for the
constraint family F ∞ = F .
From Standard OCRS to Temporal OCRS for Rank-1 Matroids, Matchings, Knapsacks, and General
Matroids
In this section, we explicitly derive a (1, 1/e)-selectable (randomized) temporal greedy OCRS for the rank-1 matroid feasibility
constraint, from a (1, 1/e)-selectable (randomized) greedy OCRS in the standard setting (Livanos 2021), which is also tight.
Let us denote this standard OCRS as πM , where M is a rank-1 matroid.
Corollary 1. For the rank-1 matroid feasibility constraint family under temporal constraints, Algorithm 1 produces a (1, 1/e)-
selectable (randomized) temporal greedy OCRS π̂M from πM .
Proof. Since it is clear from context, we drop the dependence on M and write π, π̂. We will proceed by comparing side-by-side
what happens in π and in π̂. Let us recall from Examples 4, 5 that the polytopes can respectively be written as
PM = {x ∈ [0, 1]m : x(S) ≤ 1, ∀S ⊆ E} ,
d
PM = {y ∈ [0, 1]m : y(S ∩ Ee ) ≤ 1, ∀S ⊆ E, ∀e ∈ E} .
The two OCRSs perform the following steps, on the basis of Algorithm 1. On one hand, π defines a subfamily of constraints
−xe
Fπ,x := {{e} : e ∈ H(x)}, where e ∈ E is included in random subset H(x) ⊆ E with probability 1−exe . Then, it selects
d :=
the first sampled element e ∈ R(x) such that {e} ∈ Fπ,x . On the other hand, πy defines a subfamily of constraints Fπ,y
−ye
{{e} : e ∈ H(y)}, where e ∈ E is included in random subset H(y) ⊆ E with probability qe (y) = 1−eye . The feasibility
d
family Fπ,y induces, as per Observation 1, a sequence of feasibility families Fπ,y (e) := {{e} : e ∈ H(y) ∩ Ee }, for each
e ∈ E. For all e′ ∈ E, the OCRS selects the first sampled element e ∈ R(y) such that {e} ∈ Fπ,y (e). In other words, the
temporal OCRS selects a sampled element that is active only if no other element in its active elements set has been selected
earlier. It is clear that both are randomized greedy OCRSs.
We will now proceed by showing that each element e is selected with probability at least 1/e in both π, π̂. In π element e is
selected if sampled, and no earlier element has been selected before (i.e. its singleton set belongs to the subfamily Fπ,x ). An
−x ′
element e′ is not selected with probability 1 − xe′ · 1−ex ′ e = e−xe′ . This means that the probability of e being selected is
e

1−e −xe Y 1 − e−xe − Ps ′ <se xe′ (1 − e−xe ) exe −1 1


· e−xe′ = ·e e ≥ ≥ ,
xe se′ <se
xe xe e
P
where the first inequality is justified by se′ <se xe′ ≤ 1, and the second follows because the expression is minimized for
xe = 0. Similarly, in π̂ element e is selected if sampled, and no earlier element that is still active has been selected before (i.e.
its singleton set belongs to the subfamily Fπ,y (e)). We have that the probability of e being selected is
1 − e−ye Y 1 − e−ye − Ps ′ <se :e′ ∈Ee ye′ (1 − e−ye ) eye −1 1
· e−ye′ = ·e e ≥ ≥ .
ye ye ye e
se′ <se :e′ ∈Ee
P
Again, the first inequality is justified by se′ <se :e′ ∈Ee ye′ ≤ 1 by the temporal feasibility constraints, and the second follows
because the expression is minimized for ye = 0. Selectability is thus shown.
Remark 1. Adapting the OCRSs in Theorem 1.8 of (Feldman, Svensson, and Zenklusen 2016) for general matroids, matchings
and knapsacks, by following Algorithm 1 step-by-step, we get the same selectability guarantees in the temporal settings as in
the standard ones: respectively, (b, 1 − b), (b, e−2b), (b, (1 − 2b)/(2 − 2b)). There are two crucial steps to map a standard OCRS
into a temporal one, as exemplified by Corollary 1:
1. We first need to define the temporal constraints based on the standard ones. This is done simply by enforcing the constraint
in standard setting only for the current set of active elements, i.e. transforming Fπ,x into Fπ,y (e) for all elements e ∈ E.
Such a transformation is analogous to the one used to go from Example 4 to Example 5.
2. When proving selectability, the probability of feasibility is only calculated on elements e′ belonging the same independent
set as e (which arrives later), that are still active. This means that the probability computation is confined to only e′ ∈ Ee
such that se′ < se , rather than all e′ ∈ E such that se′ < se .

C Batched Arrival: Matching Constraints


As mentioned in Section 1, Ezra et al. (2020) generalize the one-by-one online selection problem to a setting where elements
arrive in batches. The existence of batched greedy OCRSs implies a number of results, as for instance Prophet Inequalities
under matching constraints where, rather than edges, vertices with all the edges adjacent to them arrive one at a time. This can
be viewed as an incoming batch of edges, for which Ezra et al. (2020) explicitly construct a (1, 1/2)-selectable batched greedy
OCRS.
Indeed, we let the ground set E be partitioned in k disjoint subsets (batches) arriving in the order B1 , . . . , Bk , and where
elements in each batch appear at the same time. Such batches need to belong to a feasible family of batches B: for example,
all batches could be required to be singletons, or they could be required to be all edges incident to a given vertex in a graph,
S so on. Similarly to the traditional OCRS, we sample a random subset Rj (x) ⊆ Bj , for all j ∈ [k], so as to form R(x) :=
and
j∈[k] Rj (x) ⊆ E, where Rj ’s are mutually independent. The fundamental difference with greedy OCRSs is that, within a
given batch, weights are allowed to be correlated.
Definition 7 (Batched Greedy OCRSs (Ezra et al. 2020)). For b, c ∈ [0, 1], let PF ⊆ [0, 1]m be F ’s feasibility polytope.
An OCRS π for bPF is called a batched greedy OCRS with respect to R if, for every ex-ante feasible solution x ∈ bPF , π
defines a packing subfamily of feasible sets Fπ,x ⊆ F , and it selects a sampled element e ∈ Bj when, together with the
set of already selected elements, the resulting set is in Fπ,x . We say that a batched greedy OCRS π is (b, c)-selectable if
S π,R(x) [Sj ∪ {e} ∈ Fπ,x ∀Sj ⊆ Rj (x), Sj ∈ Fπ,x ] ≥ c, for each j ∈ [k], e ∈ Sj . The output feasible set will be S :=
Pr
j∈[m] Sj ∈ Fπ,x .

Naturally, Theorem 1 extends to batched OCRSs.


Corollary 2. Let F , F d be respectively the standard and temporal packing constraint families, with their corresponding
polytopes PF , PFd . Let x ∈ bPF and y ∈ bPFd , and consider a (b, c)-selectable batched greedy OCRS π for Fπ,x , with
d
batches B1 , . . . , Bk ∈ B. We can construct a batched greedy OCRS π̂ that is also (b, c)-selectable for Fπ,y , with batches
B1 , . . . , Bk ∈ B.
The proof of this corollary is identical to that of Theorem 1: we can indeed define a set of active elements Ej for each batch
Bj , and π̂ is essentially in Algorithm 1 but with incoming batches rather than elements, and the necessary modifications in the
sets. We will demonstrate the use of batched greedy OCRSs in the graph matching setting, where vertices come one at a time
together with their contiguous edges. This allows us to solve the problem of dynamically assigning tasks to reviewers for the
reviewing time, and to eventually match new tasks to the same reviewers, so as to maximize the throughput of this procedure.
Details are presented in Appendix C.
By Corollary 2 together with Theorem 4.1 in Ezra et al. (2020), which gives an explicit construction of a (1, 1/2)-selectable
batched greedy OCRS under matching constraints, we immediately have that (1, 1/2)-selectable batched greedy OCRS exists
even under temporal constraints. For clarity, and in the spirit of Appendix B, we work out how to derive from scratch an
online algorithm that is 1/2-competitive with respect to the offline optimal matching when the graph is bipartite and temporal
constraints are imposed. We do not use of Corollary 2, but we follow the proof of this general statement for the specific setting
of bipartite graph matching. Batched OCRSs in the non-temporal case are not specific to the bipartite matching case but extend
in principle to arbitrary packing constraints. Nevertheless, the only known constant competitive batched OCRS is the one for
general graph matching by Ezra et al. (2020). Finally, we note that our results closely resemble the ones of Dickerson et al.
(2018), with the difference that their arrival order is assumed to be stochastic, whereas ours is adversarial.
This is motivated for instance by the following real-world scenario: there are |U | = m “offline” beds (machines) in an
hospital, and |V | = n “online” patients (jobs) that arrive. Once a patient v ∈ V comes, the hospital has to irrevocably assign it
to one of the beds, say u ∈ U , and occupy it for a stochastic time equal to duv := dv [u], for dv ∼ Dv , i.e., the u-th component
of random vector dv . The sequence of arrivals is adversarial, but with known ex-ante distributions (Wv , Dv ). Moreover, the
patient’s healing can be thought of as a positive reward/weight equal to wuv := wv [u], for wv ∼ Wv , i.e., the u-th component
of random vector wv , whose distributions are known to the algorithm. The hospital’s goal is that of maximizing the sum of the
healing weights over time, i.e., over a discrete time period of length |V | = n. Across v’s, both wv ’s and dv ’s are independent.
However, within the vector itself, components duv and du′ v could be correlated, and the same holds for wv ’s.

Linear Programming Formulation


First, we construct a suitable linear-programming formulation whose fractional solution yields an upper bound on the
expected optimum offline algorithm. Then, we devise an online algorithm that achieves an α-competitive ratio with re-
spect to the linear programming fractional solution. We follow the temporal LP Definition , and let f (x) := hw, xi,
for x ∈ PGd being a feasible fractional solution in the matching polytope. Since the matching polytope is PGd =
{x ∈ [0, 1]m : x(δ(u) ∩ Ee ) ≤ 1, ∀u ∈ V, ∀e ∈ E}, we can equivalently write the temporal linear program as
 XX


 max w uv · xuv ⊳ Objective
 x∈[0,1]m

 u∈U
X v∈V


 s.t. xuv ≤ 1, ∀v ∈ V ⊳ Constr. 1
u∈U
X (1)




 xuv′ · Pr [duv′ ≥ sv − sv′ ] + xuv ≤ 1, ∀u ∈ U, v ∈ V ⊳ Constr. 2

 v ′ :sv′ <sv


xuv ≥ 0, ∀u ∈ U, v ∈ V ⊳ Constr. 3
where wuv := Ewv ∼Wv [wuv ], when wuv is a random variable; when instead, it is deterministic, we simply have w uv = wuv .
Furthermore, as we argued in Section 2, we can think of xuv to be the probability that edge uv is inserted in the offline (frac-
tional) optimal matching. We now show why the above linear program yields an upper bound to the offline optimal matching.
Lemma 1. Cosider solution x∗ to linear program (1). Then, x∗ is such that hw, x∗ i ≥ Ew,d [hw, 1OPT i], where 1OPT ∈
{0, 1}m is the vector denoting which of the elements have been selected by the integral offline optimum.

Proof. The proof follows from analyzing the constraints. The meaning of Constraint 1 is that upon the arrival of vertex v, v
must be matched at most once in expectation. In fact, for each job v ∈ V , at most one machine u ∈ U can be selected by the
optimum, which yields
X
xuv ≤ 1.
u∈U

This justifies Constraint 1. Constraint 2, on the other hand, has the following simple interpretation: machine u is unavailable
when job v arrives if it has been matched earlier to a job v ′ such that the activity time is longer than the difference of v, v ′ arrival
times. Otherwise, u can in fact be matched to v, and this probability is of course lower than the probability of being available.
This implies that for each machine u ∈ U and each job v ∈ V ,
X
xuv′ · Pr [duv′ ≥ sv − sv′ ] + xuv ≤ 1.
v ′ :sv′ <sv

We have shown that all constraints are less restrictive for the linear program as they would be for the offline optimum. Since
the objective function is the same for both, a solution for the integral optimum is also a solution for the linear program, while
the converse does not necessarily hold. The statement follows.

A simple algorithm
Inspired by the algorithm by Dickerson et al. (2018) (which deals with stochastic rather than adversarial arrivals), we propose
Algorithm 4. In the remainder, let Navail (v) denote the set of available vertices u ∈ U when v ∈ V arrives.

Algorithm 4: Bipartite Matching Temporal OCRS


Data: Machine set U , job set V , and distributions Wv , Dv
Result: Matching M ⊆ U × V
Solve LP (1) and obtain fractional solution solution x∗ ;
M ← ∅;
for v ∈ V do
if Navail (v) = ∅ then
Reject v;
else
x∗
Select u ∈ Navail (v) with probability α · Pr[u∈Nuvavail (v)] ;
M ← M ∪ {uv};

Lemma 2. Algorithm 4 makes every vertex u ∈ U available with probability at least α. Moreover, such probability is maximized
for α = 1/2.

Proof. We will prove the claim by induction. For the first incoming job v = 1, Pr [u ∈ Navail (v)] = 1 ≥ α for all machines
u ∈ U , no matter what the values of wuv , duv are. To complete the base case, we only need to check that the probability of
selecting one machine is in fact no larger than one: for this purpose, let us name the event u is selected by Algorithm 4 when v
comes as u ∈ ALG(v).
X x∗uv
Pr [∃u ∈ Navail (v) : u ∈ ALG(v)] = α· ≤ α,
Pr [u ∈ Navail (v)]
u∈U

where the first equality follows from the fact the events within the existence quantifier are disjoint, and recalling that Navail (v) =
U for the first job. Consider all vertices v ′ arriving before vertex v (sv′ < sv ), and assume that Pr [u ∈ Navail (v ′ )] ≥ α always.
This means that the algorithm is makes each u available with probability at least α for all vertex arrivals before v. This, in turn,
implies that each u is selected with probability α · x∗uv′ . Let us observe that a machine u ∈ U will not be available for the
incoming job v ∈ V only if the algorithm has matched it to an earlier job v ′ with activity time larger than sv − sv′ . Formally,
the probability that u is available for v is
Pr [u ∈ Navail (v)] = 1 − Pr [u ∈
/ Navail (v)]
= 1 − Pr [∃v ′ ∈ V : sv′ < sv , u ∈ ALG(v ′ ), duv′ > sv − sv′ ]
X
≥ 1−α· x∗uv′ Pr [duv′ ≥ sv − sv′ ]
v ′ :sv′ <sv

≥ α+α · x∗uv
≥α
The second to last inequality follows from Constraint 2, and by observing the following simple implication for all r, z ∈ R: if
r + z ≤ 1, then 1 − αr ≥ α + αz, so long as α ≤ 21 . Since we would like to choose α as large as possible, we choose α = 21 .
What is left to be shown is that the probability of selecting one machine is at most one:
X x∗uv
Pr [∃u ∈ Navail (v) : u ∈ ALG(v)] = α· ≤ 1.
Pr [u ∈ Navail (v)]
u∈U

The statement, thus, follows.


A direct consequence of the above two lemmata is the following theorem. Indeed, if every u is available with at least
probability 1/2, then the algorithm will select it, regardless of what the previous algorithm actions. In turn, the optimum will
be approximated with the same factor.
Theorem 4. Algorithm 4 is 12 -competitive with respect to the expected optimum Ew,d [hw, 1OPT i].
Various applications such as prophet and probing inequalities for the batched temporal setting can be derived from the
above theorem. Solving them with a constant competitive ratio yields a solution for the review problem illustrated in the
introduction, where multiple financial transactions arriving over time could be assigned to one of many potential reviewers, and
these reviewers can be “reused” once they have completed their review time.

D Benchmarks
The need for stages
We argue that, for the results in Section 5, stages are necessary in order for us to be able to compare our algorithm against any
meaningful benchmark. Suppose, in contrast, that we chose to compare against the optimum (or an approximation of it) within
a single stage where n jobs arrive to a single arm. A non-adaptive adversary could simply run the following procedure, with
each job having weight 1: with probability 1/2, jobs with odd arrival order have activity time 1, and jobs with even arrival order
have activity time ∞, with probability 1/2 the opposite holds. To be precise, let us recall that ∞ is just a shorthand notation to
mean that all future jobs would be blocked: indeed, the activity time of a job arriving at time se is not unbounded but can be at
most n − se . As activity times are revealed after the algorithm has made a decision for the current job, the algorithm does not
know whether taking the current job will prevent it from being blocked for the entire future. The best thing the algorithm can
do is to pick the first job with probability 1/2. Indeed, if the algorithm is lucky and the activity time is 1 then it knows to be
in the first scenario and gets n. Otherwise, it only gets 1. Hence, the regret would be Rn = n − n+1 2 ∈ Ω(n), which is linear.
Note that n and T here represent two different concepts: the first is the number of elements sent within a stage; the second is the
number of stages. In the case outlined above, T = 1, since it is a single stage scenario. Thus, there is no hope that in a single
stage we could do anything meaningful, and we turn to the framework where an entire instance of the problem is sent at each
stage t ∈ [T ].

Choosing the right benchmark


Now, we motivate why the Best-in-Hindsight policy introduced at the beginning of Section 5 is a strong and realistic benchmark,
for an algorithm that knows the feasibility polytopes a priori. In fact, when we want to measure regret, we need to find a
benchmark to compare against, which is neither too trivial nor unrealistically powerful compared to the information we have at
hand. Below, we provide explicit lower bounds which show that the dynamic optimum is a too powerful benchmark even when
the polytope is known. In particular, the next examples prove that it is impossible to achieve sublinear (α-)Regret against the
dynamic optimum. In the remainder, we always assume full feedback and that the adversary is non-adaptive.
In the remainder, we denote by aOPT
t and aALG
t the action chosen at time t by the optimum and the algorithm respectively.
T
P P
Lemma 3. Every algorithm has R = t∈[T ] E[ft (aOPT t )] − t∈[T ] E[ft (aALG
t )] ∈ Ω(T ) against the dynamic optimum.
Proof. Consider the case of a single arm and the arrival of 3 jobs at each stage (on at a time within the stage, revealed from
top to bottom), with the constraint that at most 1 active job can be selected. The (non-adaptive) adversary simply tosses T fair
coins independently at each stage: if the tth coin lands heads, then all 3 jobs at the tth stage have activity times 1 and weights
1, otherwise all jobs have activity time ∞, the first job has weight ǫ and the last two have weight 1 (recall that ∞ is just a
shorthand notation to mean that all future jobs would be blocked). Figure 1 shows a possible realization of the T stages: at each
stage the expected reward of the optimal policy is 23 , since the optimal value is 1 or 2 with equal probability. By linearity of
P
expectation, t∈[T ] E[ft (aOPT
t )] = T · E[f (aOPT )] ≥ 23 T .

1, 1 ǫ, ∞ ǫ, ∞

1, 1 1, ∞ 1, ∞
.........

1, 1 1, ∞ 1, ∞

Figure 1: Three jobs per stage: w.p. 1/2, either {(1, 1), (1, 1), (1, 1)} or {(ǫ, ∞), (1, ∞), (1, ∞)}.

On the other hand, the algorithm will discover which scenario it has landed into only after the value of the first job has been
revealed. If it does not pick it and it results in a weight of 1, then the algorithm can get at most 1 from the remaining jobs. If
instead it decides to pick it but it realizes in an ǫ value, it will only get ǫ. Even if the algorithm is aware of such a stochastic
input beforehand, it knows that stages are independent and, hence, cannot be adaptive before a given stage begins. Then, it
observes the first job weight without taking it, but it may already be too late. Any algorithm in this setting can be described by
deciding to accept the first job with probability p (and reject it with 1 − p), and then act adaptively. Then, again by linearity of
expectation,
 
X
ALG ALG 1 1
E[ft (at )] = T · E[f (a )] = T · (2p + (1 − p)) + (ǫp + (1 − p))
2 2
t∈[T ]
2 + ǫp
= · T ≤ (1 + ǫ) · T.
2
Thus, RT ≥ (1 − ǫ) · T ∈ Ω(T ).
Now, we ask whether there exists a similar lower bound on approximate regret. Similarly to the previous lemma, we denote
by aOCRS
t the action chosen at time t by the OCRS.
P P
Lemma 4. Every algorithm has RT = α · t∈[T ] E[ft (aOPT t )] − t∈[T ] E[ft (aALG
t )] ∈ Ω(T ) against an α-approximation of
the dynamic optimum, for α ∈ (0, 1].
Proof. Let all the activity times be infinite, and define (for a given stage) the constraint to be picking a single job irrevocably.
We know that, for the single-choice problem, a tight OCRS achieves α = 1/2 competitive ratio. However, such OCRS is not
greedy. Livanos (2021) constructs a tight greedy OCRS for single-choice, which is α = 1/e competitive. For our purposes,
nonetheless, we only require the trivial inequality α ≤ 1. The non-adaptive adversary could run the following a priori procedure,
for each of the T stages: let δ = α−1/n 2 be a constant, sample k ∼ [n] uniformly at random, and send jobs in order of weights
δ k , δ k−1 , . . . , δ, 0, . . . , 0 (ascending until δ and then all 0s).8 We know that, by Theorem 1,
X X
E[ft (aOCRS
t )] ≥ α · E[ft (aOPT
t )].
t∈[T ] t∈[T ]

This is possible because the greedy OCRS has access full-information about the current stage a priori (it knows δ and the
sampled k at each stage), unlike the algorithm, which is unaware of how the T stages are going to be presented. It is easy to
see that what the best the algorithm can do within a given stage is to randomly guess what the drawn k has been, i.e., where
δ will land. We now divide the time horizon in T /n intervals, each composed of n stages. In each interval, since no stage is
predictive of the next, we know that the algorithm cannot be adaptive across stages, nor can it be within a stage, since all possible
sequences have the same prefix. By construction, we expect the algorithm to catch δ once per time interval, and otherwise get
8
This construction is inspired by the notes of Kesselheim and Mehlhorn (2016).
at most δ 2 , optimistically for all remaining n − 1 stages. In other words, let us index each interval by I ∈ [T /n] and rewrite the
algorithm and the OCRS expected rewards as
 
X X
2
 δ 2
E[ft (aALG
t )] ≤ δ + (n − 1)δ ≤ + δ · T,
n
t∈[T ] I∈[T /n]
X X
E[ft (aOCRS
t )] ≥ (αδ · n) = αδ · T.
t∈[T ] I∈[T /n]

Hence,
X X
RT = E[ft (aOCRS
t )] − E[ft (aALG
t )]
t∈[T ] t∈[T ]
 
δ 2
≥ αδ − − δ · T
n
(α − 1/n)2
= · T ∈ Ω(T ).
4
The last step follows from the fact that 1/n ∈ o(α), and α ∈ o(T ).

E Omitted proofs from Section 5


Theorem 2. Given a regret minimizer RM for decision space PFd with cumulative regret upper bound RT , and an α-competitive
temporal greedy OCRS, Algorithm 2 provides
T
" T #
X X
α max f (1S , wt ) − E f (at , w t ) ≤ RT .
S∈I d
t=1 t=1

Proof. We assume to have access to a regret minimizer for the set PFd guaranteeing an upper bound on the cumulative regret
up to time T of RT . Then,

" T
# T
X X
E f (at , wt ) ≥ α f (xt , wt )
t=1 t=1
T
!
X
T
≥ α max f (x, wt ) − R
d
x∈PF
t=1
T
!
X
T
= α max f (a, wt ) − R ,
a
t=1

where the first inequality follows from the fact that Algorithm 2 employs a suitable temporal OCRS π̂ to select at : for each
e ∈ E, the probability with which the OCRS selects e is at least α · xt,e , and since f is a linear mapping (in particular, it is
defined as the scalar product between a vector of weights and the choice at t) the above inequality holds. The second inequality
is by no-regret property of the regret minimizer for decision space PFd . This concludes the proof.

Theorem 3. Given a temporal packing feasibility set F d , and an α-competitive OCRS π̂, let Z = T 2/3 , and the full feedback
subroutine RM be defined as per Theorem 2. Then Algorithm 3 guarantees that
T
" T #
X X
α max f (1S , wt ) − E f (at , w t ) ≤ Õ(T 2/3 ).
S∈I d
t=1 t=1

Proof. We start by computing a lower bound on the average reward the algorithm gets. Algorithm 3 splits its decisions into Z
blocks, and, at each τ ∈ [Z], chooses the action xτ suggested by the RM, unless the stage is one of the randomly sampled
exploration steps. Then, we can write
" T #
1 X α X X
·E f (at , wt ) ≥ · f (xt , wt )
T t=1
T
τ ∈[Z] t∈Iτ

α X X m2 Z
≥ f (xτ , wt ) − α
T T
τ ∈[Z] t∈Iτ

α X X X m2 Z
= · xτ,e f (1e , wt ) − α
T T
τ ∈[Z] e∈E t∈Iτ

α X X h i m2 Z
= · xτ,e · E f˜τ (e) − α ,
Z T
τ ∈[Z] e∈E
where the first inequality is by the use of a temporal OCRS to select at , and the second inequality is obtained by subtracting
the worst-case costs incurred during exploration; note that the m2 factor in the second inequality is due to the fact that at each
of the m exploration stages, we can lose at most m. The last equality is by definition of the unbiased estimator, since the value
of f is observed T /Z times (once for every block) in expectation.
We can now bound from below the rightmost expression we just obtained by using the guarantees of the regret-minimizer.
 
2
α X X h i m Z α X X m2 Z
· xτ,e · E f˜τ (e) − α ≥ · E max xe f˜τ (e) − RZ  − α
Z T Z x∈PF d T
τ ∈[Z] e∈E τ ∈[Z] e∈E

α X X h i α m2 Z
= max xe · E f˜τ (e) − RZ − α
Z x∈PFd Z T
τ ∈[Z] e∈E

α X X X α m2 Z
= max xe f (1e , w t ) − RZ − α
T a∈F d Z T
τ ∈[Z] e∈E t∈Iτ
T
α X α m2 Z
= max f (at , wt ) − RZ − α ,
T a∈F d t=1 Z T
where we used unbiasedness of f˜τ (e), and the fact that the value of optimal fractional vector in the polytope is the same value
provided by the best superarm (i.e., best vertex
h of thei polytope) by convexity. The third equality follows from expanding the
P
expectation of the unbiased estimator (i.e. E f˜τ (e) := Z T · t∈Iτ f (1e , w t )). Let us now rearrange the last expression and
compute the cumulative regret:
T
" T #
X X α
α max f (at , wt ) − E f (at , w t ) ≤ RZ T + αm2 Z
a∈F d
t=1 t=1
Z |{z}

≤Õ( Z)

≤ Õ(T 2/3 ),
where in the last step we set Z = T 2/3 and obtain the desired upper bound on regret (the term αm2 is incorporated in the Õ
notation). The theorem follows.

F Further Related Works


CRS and OCRS. Contention resolution schemes (CRS) were introduced by Chekuri, Vondrák, and Zenklusen (2011) as a
powerful rounding technique in the context of submodular maximization. The CRS framework was extended to online con-
tention resolution schemes (OCRS) for online selection problems by Feldman, Svensson, and Zenklusen (2016), who provided
OCRSs for different problems, including intersections of matroids, matchings, and prophet inequalities. Ezra et al. (2020) re-
cently extended OCRS to batched arrivals, providing a constant competitive ratio for stochastic max-weight matching in vertex
and edge arrival models.
Combinatorial Bandits. The problem of combinatorial bandits was first studied in the context of online shortest paths (Awer-
buch and Kleinberg 2008; György et al. 2007), and the general version of the problem is due to Cesa-Bianchi and Lugosi (2012).
Improved regret bounds can be achieved in the case of combinatorial bandits with semi-bandit feedback (see, e.g., (Chen, Wang,
and Yuan 2013; Kveton et al. 2015; Audibert, Bubeck, and Lugosi 2014)). A related problem is that of linear bandits (Awer-
buch and Kleinberg 2008; McMahan and Blum 2004), which admit computationally efficient algorithms in the case in which
the action set is convex (Abernethy, Hazan, and Rakhlin 2009).
Blocking bandits. In blocking bandits (Basu et al. 2019) the arm that is played is blocked for a specific number of stages.
Blocking bandits have recently been studied in contextual (Basu et al. 2021), combinatorial (Atsidakou et al. 2021), and ad-
versarial (Bishop et al. 2020) settings. Our bandit model differs from blocking bandits since we consider each instance of the
problem confined within each stage. In addition, the online full information problems that are solved in most blocking bandits
papers (Atsidakou et al. 2021; Basu et al. 2021; Dickerson et al. 2018) only addresses specific cases of the fully dynamic online
selection problem, which we solve in entire generality.
Sleeping bandits. As mentioned, our problem is similar to that of sleeping bandits (see (Kleinberg, Niculescu-Mizil, and
Sharma 2010) and follow-up papers), but at the same time the two models differ in a number of ways. Just like the sleeping
bandits case, the adversary in our setting decides which actions we can perform by setting arbitrary activity times at each t.
The crucial difference between the two settings is that, in sleeping bandits, once an adversary has chosen the available actions
for a given stage, they have to communicate them all at once to the algorithm. In our case, instead, the adversary can choose
the available actions within a given stage as the elements arrive, so it is, in some sense, “more dynamic”. In particular, in the
temporal setting there are two levels of adaptivity for the adversary: on one hand, the adversary may or may not be adaptive
across stages (this is the classic bandit notion of adaptivity). On the other hand, the adversary may or may not be adaptive within
the same stage (which is the notion of online algorithms adaptivity).

You might also like