Professional Documents
Culture Documents
Fully Dynamic Online Selection Through Online Contention Resolution Schemes
Fully Dynamic Online Selection Through Online Contention Resolution Schemes
Abstract time step. Finally, the objective function might not only be
We study fully dynamic online selection problems in an ad- the sum of the rewards (weights) we observe, if, for example,
versarial/stochastic setting that includes Bayesian online se- the decision maker has a utility function with “diminishing
lection, prophet inequalities, posted price mechanisms, and return” property.
stochastic probing problems subject to combinatorial con- To model the general class of sequential decision prob-
straints. In the classical “incremental” version of the problem, lems described above, we introduce fully dynamic online se-
selected elements remain active until the end of the input se- lection problems. This model generalizes online selection
quence. On the other hand, in the fully dynamic version of the problems (Chekuri, Vondrák, and Zenklusen 2011), where
problem, elements stay active for a limited time interval, and elements arrive online in an adversarial order and algorithms
then leave. This models, for example, the online matching of
tasks to workers with task/worker-dependent working times,
can use a priori information to maximize the weight of the
and sequential posted pricing of perishable goods. A success- selected subset of elements, subject to combinatorial con-
ful approach to online selection problems in the adversarial straints (such as matroid, matching, or knapsack).
setting is given by the notion of Online Contention Resolution In the classical version of the problem (Chekuri, Vondrák,
Scheme (OCRS), that uses a priori information to formulate and Zenklusen 2011), once an element is selected, it will af-
a linear relaxation of the underlying optimization problem, fect the combinatorial constraints throughout the entire input
whose optimal fractional solution is rounded online for any sequence. This is in sharp contrast with the fully dynamic
adversarial order of the input sequence. Our main contribu- version, where an element will affect the combinatorial con-
tion is providing a general method for constructing an OCRS
straint only for a limited time interval, which we name ac-
for fully dynamic online selection problems. Then, we show
how to employ such OCRS to construct no-regret algorithms tivity time of the element. For example, a new task can be
in a partial information model with semi-bandit feedback and matched to a reviewer as soon as she is done with previ-
adversarial inputs. ously assigned tasks, or an agent can buy a new good as
soon as the previously bought goods are perished. A large
class of Bayesian online selection (Kleinberg and Weinberg
1 Introduction 2012), prophet inequality (Hajiaghayi, Kleinberg, and Sand-
Consider the case where a financial service provider receives holm 2007), posted price mechanism (Chawla et al. 2010),
multiple operations every hour/day. These operations might and stochastic probing (Gupta and Nagarajan 2013) prob-
be malicious. The provider needs to assign them to human lems that have been studied in the classical version of on-
reviewers for inspection. The time required by each reviewer line selection can therefore be extended to the fully dynamic
to file a reviewing task and the reward (weight) that is ob- setting. Note that in the dynamic algorithms literature, fully
tained with the review follow some distributions. The distri- dynamic algorithms are algorithms that deal with both adver-
butions can be estimated from historical data, as they depend sarial insertions and deletions (Demetrescu et al. 2010). We
on the type of transaction that needs to be examined and on could also interpret our model in a similar sense since ele-
the expertise of the employed reviewers. To efficiently solve ments arrive online (are inserted) according to an adversarial
the problem, the platform needs to compute a matching be- order, and cease to exist (are deleted) according to adversar-
tween tasks and reviewers based on the a priori information ially established activity times.
that is available. However, the time needed for a specific re-
A successful approach to online selection problems is
view, and the realized reward (weight), is often known only
based on Online Contention Resolution Schemes (OCRSs)
after the task/reviewer matching is decided.
(Feldman, Svensson, and Zenklusen 2016). OCRSs use a
A multitude of variations to this setting are possible. For
priori information on the values of the elements to formu-
instance, if a cost is associated with each reviewing task, the
late a linear relaxation whose optimal fractional solution up-
total cost for the reviewing process might be bounded by a
per bounds the performance of the integral offline optimum.
budget. Moreover, there might be various kinds of restric-
Then, an online rounding procedure is used to produce a so-
tions on the subset of reviewers that are assigned at each
lution whose value is as close as possible to the fractional re-
* Research performed while the author was working at Meta. laxation solution’s value, for any adversarial order of the in-
put sequence. The OCRS approach allows to obtain good ap- by an adversary, and revealed only after the selection at the
proximations of the expected optimal solution for linear and current stage has been completed. In such setting, we show
submodular objective functions. The existence of OCRSs for that an α-competitive temporal OCRS can be exploited in
fully dynamic online selection problems is therefore a natu- the adversarial partial-information version of the problem, in
ral research question that we address in this work. order to build no-α-regret algorithms with polynomial per-
The OCRS approach is based on the availability of a pri- iteration running time. Regret is measured with respect to the
ori information on weights and activity times. However, in cumulative weights collected by the best fixed selection pol-
real world scenarios, these might be missing or might be icy in hindsight. We study three different settings: in the first
expensive to collect. Therefore, in the second part of our setting, we study the full-feedback model (i.e., the algorithm
work, we study the fully dynamic online selection problem observes the entire utility function at the end of each stage).
with partial information, where the main research question Then, we focus on the semi-bandit-feedback model, in which
is whether the OCRS approach is still viable if a priori in- the algorithm only receives information on the weights of the
formation on the weights is missing. In order to answer this elements it selects. In such setting, we provide a no-α-regret
question, we study a repeated version of the fully dynamic framework with Õ(T 1/2 ) upper bound on cumulative regret
online selection problem, in which at each stage weights are in the case in which we have a “white-box” OCRS (i.e., we
unknown to the decision maker (i.e., no a priori informa- know the exact procedure run within the OCRS, and we are
tion on weights is available) and chosen adversarially. The able to simulate it ex-post). Moreover, we also provide a no-
goal in this setting is the design of an online algorithm with α-regret algorithm with Õ(T 2/3 ) regret upper bound for the
performances (i.e., cumulative sum of weights of selected case in which we only have oracle access to the OCRS (i.e.,
elements) close to that of the best static selection strategy in the OCRS is treated as a black-box, and the algorithm does
hindsight. not require knowledge about its internal procedures).
d
Finally, Fπ,x ⊆ F implies that Fπ,x ⊆ F d , which shows
4 OCRS for Fully Dynamic Online Selection that π̂ is greedy. With the above, we can now turn to
The first black-box reduction which we provide consists demonstrate (b, c)-selectability.
in showing that a (b, c)-selectable greedy OCRS for stan- Selectability. Upon arrival of element e ∈ E, let us con-
dard packing constraints implies the existence of a (b, c)- sider S and S d to be the sets of elements already selected
selectable greedy OCRS for temporal constraints. In partic- by π and π̂, respectively. By the way in which the con-
ular, we show that the original greedy OCRS working for straint families are defined, and by construction of π̂, we
x ∈ bPF can be used to construct another greedy OCRS can observe that, given x ∈ bPFd and y ∈ bPF , for all
for y ∈ bPFd . To this end, Algorithm 1 provides a way of S ⊆ R(y) such that S ∪ {e} ∈ Fπ,y , there always exists a
exploiting the original OCRS π in order to manage temporal set S d ⊆ R(x) such that (S d ∩ Ee ) ∪ {e} ∈ Fπ,x . This es-
constraints. For each element e, and given the induced sub- tablishes an injection between the selected set under stan-
family of packing feasible sets Fπ,y , the algorithm checks dard constraints, and its counterpart under temporal con-
whether the set of previously selected elements S d which straints. We observe that, for all e ∈ E and x ∈ bPFd ,
are still active in time, together with the new element e, is
feasible with respect to Fπ,y . If that is the case, the algo- Pr S d ∪ {e} ∈ Fπ,xd
∀S d ⊆ R(x), S d ∈ Fπ,x d
=
rithm calls the OCRS π. Then, if the OCRS π for input y d
Pr (S ∩ Ee ) ∪ {e} ∈ Fπ,x ∀S d ⊆ R(x), S d ∩ Ee ∈ Fπ,x d
.
decided to select the current element e, the algorithm adds it
to S d , otherwise the set remains unaltered. We remark that Hence, since for greedy OCRS π and y ∈ bPF , we have
such a procedure is agnostic to whether the original greedy that Pr [S ∪ {e} ∈ Fπ,y ∀S ⊆ R(y), S ∈ Fπ,y ] ≥ c, we
OCRS is deterministic or randomized. We observe that, due can conclude by the injection above that
to a larger feasibility constraint family, the number of in-
dependent sets have increased with respect to the standard Pr (S d ∩ Ee ) ∪ {e} ∈ Fπ,x
setting. However, we show that this does not constitute a
problem, and an equivalence between the two settings can ∀S d ⊆ R(x), S d ∩ Ee ∈ Fπ,x ≥ c.
be established through the use of Algorithm 1. The follow- The theorem follows.
ing result shows that Algorithm 1 yields a (b, c)-selectable
greedy OCRS for temporal packing constraints. We remark that the above reduction is agnostic to the
d weight scale, i.e., we need not assume that we ∈ [0, 1] for
Theorem 1. Let F , F be the standard and temporal pack-
all e ∈ E. In order to further motivate the significance of
ing constraint families, respectively, and let their corre-
Algorithm 1 and Theorem 1, in the Appendix we explic-
sponding polytopes be PF and PFd . Let x ∈ bPF and
itly reduce the standard setting to the fully dynamic one for
y ∈ bPFd , and consider a (b, c)-selectable greedy OCRS π single-choice, and provide a general recipe for all packing
for Fπ,x . Then, Algorithm 1 equippend with π is a (b, c)- constraints.
d
selectable greedy OCRS for Fπ,y .
5 Fully Dynamic Online Selection under
Proof. Let us denote by π̂ the procedure described in Algo-
rithm 1. First, we show that π̂ is a greedy OCRS for F d . Partial Information
In this section, we study the case in which the decision-
Greedyness. It is clear from the setting and the construc- maker has to act under partial information. In particular,
tion that elements arrive one at a time, and that π̂ irrevoca- we focus on the following online sequential extension of
bly selects an incoming element only if it is feasible, and the full-information problem: at each stage t ∈ [T ], a de-
before seeing the next element. Indeed, in the if statement cision maker faces a new instance of the fully dynamic
online selection problem. An unknown vector of weights Algorithm 2: F ULL -F EEDBACK A LGORITHM
wt ∈ [0, 1]|E| is chosen by an adversary at each stage t,
while feasibility set F d is known and fixed across all T Input: T , F d , temporal OCRS π̂, subroutine RM
Initialize RM for strategy space PFd
stages. This setting is analogous to the one recently stud-
for t ∈ [T ] do
ied by Gergatsouli and Tzamos (2022) in the context of
xt ← RM.R ECOMMEND ()
Pandora’s box problems. A crucial difference with the on-
at ← execute OCRS π̂ with input xt
line selection problem with full-information studied in Sec-
Play at , and subsequently observe f (·, wt )
tion 4 is that, at each step t, the decision maker has to de-
RM.U PDATE(f (·, w t ))
cide whether to select or discard an element before observ-
ing its weight. In particular, at each t, the decision maker
takes an action at := 1Std , where Std ∈ F d is the fea-
sible set selected at stage t. The choice of at is made be- xt computed by considering the weights selected by the ad-
fore observing wt . The objective of maximizing the cumu- versary up to time t − 1.6
lative sum of weights is encoded in the reward function Let us assume to have at our disposal a no-α-regret algo-
f : [0, 1]2m ∋ (a, w) 7→ ha, wi ∈ [0, 1], which is the rithm for decision space PFd . We denote such regret min-
reward obtained by playing a with weights w = (we )e∈E . 4 imizer as RM, and we assume it offers two basic opera-
In this setting, we can think of F d as the set of super-arms tions: i) RM.R ECOMMEND () returns a vector in PFd ; ii)
in a combinatorial online optimization problem. Our goal is RM.U PDATE(f (·, w)) updates the internal state of the re-
designing online algorithms which have a performance close gret minimizer using feedback received by the environment
to that of the best fixed super-arm in hindsight.5 In the analy- in the form of a reward function f (·, w). Notice that the
sis, as it is customary when the online optimization problem availability of such component is not enough to solve our
has an NP-hard offline counterpart, we resort to the notion problem since at each t we can only play a super-arm at ∈
of α-regret. In particular, given a set of feasible actions X , {0, 1}m feasible for F d , and not the strategy xt ∈ PFd ⊆
we define an algorithm’s α-regret up to time T as [0, 1]m returned by RM. The decision-maker can exploit the
( T ) " T # subroutine RM together with a temporal greedy OCRS π̂ by
X X following Algorithm 2. We can show that, if the algorithm
Regretα (T ) := α max f (x, wt ) −E f (xt , wt ) , employs a regret minimizer for PFd with a sublinear cumu-
x∈X
t=1 t=1 lative regret upper bound of RT , the following result holds.
where α ∈ (0, 1] and xt is the strategy output by the online
algorithm at time t. We say that an algorithm has the no-α- Theorem 2. Given a regret minimizer RM for decision
regret property if Regretα (T )/T → 0 for T → ∞. space PFd with cumulative regret upper bound RT , and an
The main result of the section is providing a black-box re- α-competitive temporal greedy OCRS, Algorithm 2 provides
duction that yields a no-α-regret algorithm for any fully dy- T
" T #
namic online selection problem admitting a temporal OCRS. X X
We provide no-α-regret frameworks for three scenarios: α max f (1S , wt ) − E f (at , wt ) ≤ RT .
S∈I d
t=1 t=1
• full-feedback model: after selecting at the decision-maker
observes the exact reward function f (·, wt ). Since we are assuming the existence of a polynomial-
time separation oracle for the set PFd , then the LP
• semi-bandit feedback with white-box OCRS: after taking a arg maxx∈PFd f (x, w) can be solved in polynomial time
decision at time t, the algorithm observes wt,e for each el- for any w. Therefore, we can instantiate a regret minimizer
ement e ∈ Std (i.e., each element selected at t). Moreover, for PFd by using, for example,
the decision-maker has exact knowledge of the procedure √ follow-the-regularised-leader
which yields RT ≤ Õ(m T ) (Orabona 2019).
employed by the OCRS, which can be easily simulated.
• semi-bandit feedback with oracle access to the OCRS: the Semi-Bandit Feedback with White-Box OCRS
decision maker has semi-bandit feedback and the OCRS is In this setting, given a temporal OCRS π̂, it is enough to
given as a black-box which can be queried once per step t. show that we can compute the probability that a certain
super-arm a is selected by π̂ given a certain order of ar-
Full-feedback Setting rivals at stage t and a vector of weights w. If that is the
In this setting, after selecting at , the decision-maker gets case, we can build
√ a no-α-regret algorithm with regret upper
to observe the reward function f (·, wt ). In order to achieve bound of Õ(m T ) by employing Algorithm 2 and by in-
performance close to that of the best fixed super-harm in stantiating the regret minimizer RM as the online stochastic
hindsight the idea is to employ the α-competitive OCRS de- mirror descent (OSMD) framework by Audibert, Bubeck,
signed in Section 4 by feeding it with a fractional solution and Lugosi (2014). We observe that the regret bound ob-
tained is this way is tight in the semi-bandit setting (Audib-
4
The analysis can be easily extended to arbitrary functions lin- ert, Bubeck, and Lugosi 2014). Let qt (e) be the probability
ear in both terms.
5 6
As we argue in Appendix D it is not possible to be competitive We remark that a (b, c)-selectable OCRS yields a bc competi-
with respect to more powerful benchmarks. tive ratio. In the following, we let α := bc.
Algorithm 3: S EMI -BANDIT-F EEDBACK A LGO - lected by at . Therefore, they have no counterfactual infor-
RITHM WITH O RACLE ACCESS TO OCRS mation on their reward had they selected a different feasi-
ble set. On top of that, we assume that the OCRS is given
Input: T , F d , temporal OCRS π̂, full-feedback
as a black-box, and therefore we cannot compute ex post
algorithm RM for decision space PFd
the probabilities qt (e) for selected elements. However, we
Let Z be initialized as in Theorem 3, and initialize
show that it is possible to tackle this setting by exploiting a
RM appropriately
reduction from the semi-bandit feedback setting to the full-
for τ = 1,. . . , Z do
T T
information feedback one. In doing so, we follow the ap-
Iτ ← (τ − 1) Z + 1, . . . , τ Z proach first proposed by Awerbuch and Kleinberg (2008).
Choose a random permutation p : [m] → E, and The idea is to split the time horizon T into a given num-
t1 , . . . , tm stages at random from Iτ ber of equally-sized blocks. Each block allows the decision
xτ ← RM.R ECOMMEND () maker to simulate a single stage of the full information set-
T
for t = (τ − 1) Z + 1, . . . , τ TZ do ting. We denote the number of blocks by Z, and each block
if t = tj for some j ∈ [m] then τ ∈ [Z] is composed by a sequence of consecutive stages
xt ← 1S d for a feasible set S d Iτ . Algorithm 3 describes the main steps of our procedure.
containing p(j) In particular, the algorithm employs a procedure RM, an al-
else gorithm for the full feedback setting as the one described
xt ← xτ in the previous section, that exposes an interface with the
Play at obtained from the OCRS π̂ executed two operation of a traditional regret minimizer. During each
with fractional solution xt block τ , the full-information subroutine is used to compute
˜
Compute estimators P fτ (e) of a vector xτ . Then, in most stages of the window Iτ , the de-
fτ (e) := |I1τ | t∈Iτ f (1e , wt ) for each e ∈ E cision at is computed by feeding xτ to the OCRS. A few
RM.U PDATE f˜τ (·) stages are chosen uniformly at random to estimate utilities
provided by other feasible sets (i.e., exploration phase). Af-
ter the execution of all the stages in the window Iτ , the al-
gorithm computes estimated reward functions and uses them
with which our algorithm selects element e at time t. Then, to update the full-information regret minimizer.
we can equip OSMD with the following unbiased estimator Let p : [m] → E be a random permutation of elements in
of the vector of weights: ŵt,e := wt,e at,e /qt (e). 7 In order to E. Then, for each e ∈ E, by letting j be the index such that
compute qt (·) we need to have observed the order of arrival p(j) = e in the current
P block τ , an unbiased estimator f˜τ (e)
at stage t, the weights corresponding to super-arm at , and of fτ (e) := |I1τ | t∈Iτ f (1e , wt ) can be easily obtained by
we need to be able to compute the probability with which
setting f˜τ (e) := f (1e , wtj ). Then, it is possible to show that
the OCRS selected e at t. This the reason for which we talk
our algorithm provides the following guarantees.
about “white-box” OCRS, as we need to simulate ex post
the procedure followed by the OCRS in order to compute Theorem 3. Given a temporal packing feasibility set F d ,
qt (·). When we know the procedure followed by the OCRS, and an α-competitive OCRS π̂, let Z = T 2/3 , and the full
we can always compute qt (e) for any element e selected at feedback subroutine RM be defined as per Theorem 2. Then
stage t, since at the end of stage t we know the order of Algorithm 3 guarantees that
arrival, weights for selected elements, and the initial frac- T
" T #
tional solution xt . We provide further intuition as for how to X X
α max f (1S , w t ) − E f (at , wt ) ≤ Õ(T 2/3 ).
compute such probabilities through the running example of S∈I d
t=1 t=1
matching constraints.
Example 3. Consider Algorithm 2 initialized with the OCRS 6 Conclusion and Future Work
of Example 1. Given stage t, we can safely limit our atten- In this paper we introduce fully dynamic online selection
tion to selected edges (i.e., elements e such that at,e = 1). problems in which selected items affect the combinatorial
Indeed, all other edges will either be unfeasible (which im- constraints during their activity times. We presented a gen-
plies that the probability of selecting them is 0), or they were eralization of the OCRS approach that provides near opti-
not selected despite being feasible. Consider an arbitrary mal competitive ratios in the full-information model, and no-
element e among those selected. Conditioned on the past α-regret algorithms with polynomial per-iteration running
choices up to element e, we know that e ∈ at will be fea- time with both full- and semi-bandit feedback. Our frame-
sible with certainty, and thus the (unconditional) probability work opens various future research directions. For example,
it is selected is simply qt (e) = 1 − e−yt,e . it would be particularly interesting to understand whether a
variation of Algorithms 2 and 3 can be extended to the case
Semi-Bandit Feedback and Oracle Access to OCRS in which the adversary changes the constraint family at each
As in the previous case, at each stage t the decision maker stage. Moreover, the study of the bandit-feedback model re-
can only observe the weights associated to each edge se- mains open, and no regret bound is known for that setting.
7
We observe that ŵt,e is equal to 0 when e has not been selected
at stage t because, in that case, at,e = 0.
Acknowledgements Demetrescu, C.; Eppstein, D.; Galil, Z.; and Italiano, G. F.
The authors of Sapienza are supported by the Meta Re- 2010. Dynamic Graph Algorithms, 9. Chapman &
search grant on “Fairness and Mechanism Design”, the ERC Hall/CRC, 2 edition. ISBN 9781584888222.
Advanced Grant 788893 AMDROMA “Algorithmic and Dickerson, J.; Sankararaman, K.; Srinivasan, A.; and Xu, P.
Mechanism Design Research in Online Markets”, the MIUR 2018. Allocation problems in ride-sharing platforms: Online
PRIN project ALGADIMAR “Algorithms, Games, and Dig- matching with offline reusable resources. In Proceedings of
ital Markets”. the AAAI Conference on Artificial Intelligence, volume 32.
Ezra, T.; Feldman, M.; Gravin, N.; and Tang, Z. G. 2020.
References Online Stochastic Max-Weight Matching: Prophet Inequal-
Abernethy, J. D.; Hazan, E.; and Rakhlin, A. 2009. Com- ity for Vertex and Edge Arrival Models. In EC’20, 769–787.
peting in the dark: An efficient algorithm for bandit linear Feige, U. 1998. A Threshold of Ln n for Approximating Set
optimization. COLT. Cover. J. ACM, 45(4): 634–652.
Atsidakou, A.; Papadigenopoulos, O.; Basu, S.; Caramanis, Feldman, M.; Svensson, O.; and Zenklusen, R. 2016. On-
C.; and Shakkottai, S. 2021. Combinatorial Blocking Ban- line Contention Resolution Schemes. In Krauthgamer, R.,
dits with Stochastic Delays. In Proceedings of the 38th In- ed., Proceedings of the Twenty-Seventh Annual ACM-SIAM
ternational Conference on Machine Learning, ICML 2021, Symposium on Discrete Algorithms, SODA 2016, Arlington,
18-24 July 2021, Virtual Event, 404–413. VA, USA, January 10-12, 2016, 1014–1033. SIAM.
Audibert, J.-Y.; Bubeck, S.; and Lugosi, G. 2014. Regret in Gergatsouli, E.; and Tzamos, C. 2022. Online Learning for
online combinatorial optimization. Mathematics of Opera- Min Sum Set Cover and Pandora’s Box. In Proceedings
tions Research, 39(1): 31–45. of the 39th International Conference on Machine Learning,
Awerbuch, B.; and Kleinberg, R. 2008. Online linear op- volume 162 of Proceedings of Machine Learning Research,
timization and adaptive routing. Journal of Computer and 7382–7403.
System Sciences, 74(1): 97–114. Gupta, A.; and Nagarajan, V. 2013. A Stochastic Probing
Basu, S.; Papadigenopoulos, O.; Caramanis, C.; and Problem with Applications. In Proceedings of the 16th In-
Shakkottai, S. 2021. Contextual Blocking Bandits. In The ternational Conference on Integer Programming and Com-
24th International Conference on Artificial Intelligence and binatorial Optimization, IPCO’13, 205–216. Berlin, Heidel-
Statistics, AISTATS 2021, April 13-15, 2021, Virtual Event, berg: Springer-Verlag. ISBN 9783642366932.
271–279. György, A.; Linder, T.; Lugosi, G.; and Ottucsák, G. 2007.
Basu, S.; Sen, R.; Sanghavi, S.; and Shakkottai, S. 2019. The On-Line Shortest Path Problem Under Partial Monitor-
Blocking Bandits. In Wallach, H.; Larochelle, H.; Beygelz- ing. Journal of Machine Learning Research, 8(10).
imer, A.; d'Alché-Buc, F.; Fox, E.; and Garnett, R., eds., Hajiaghayi, M. T.; Kleinberg, R.; and Sandholm, T. 2007.
Advances in Neural Information Processing Systems, vol- Automated Online Mechanism Design and Prophet Inequal-
ume 32. Curran Associates, Inc. ities. In Proceedings of the 22nd National Conference on
Bishop, N.; Chan, H.; Mandal, D.; and Tran-Thanh, L. 2020. Artificial Intelligence - Volume 1, AAAI’07, 58–65. AAAI
Adversarial Blocking Bandits. In Advances in Neural Infor- Press. ISBN 9781577353232.
mation Processing Systems 33: Annual Conference on Neu- Kesselheim, T.; and Mehlhorn, K. 2016. Lecture 2: Yao’s
ral Information Processing Systems 2020, NeurIPS 2020, Principle and the Secretary Problem. Randomized Al-
December 6-12, 2020, virtual. gorithms and Probabilistic Analysis of Algorithms, Max
Cesa-Bianchi, N.; and Lugosi, G. 2012. Combinatorial ban- Planck Institute for Informatics, Saarbrücken, Germany.
dits. Journal of Computer and System Sciences, 78(5): Kleinberg, R.; Niculescu-Mizil, A.; and Sharma, Y. 2010.
1404–1422. Regret Bounds for Sleeping Experts and Bandits. Mach.
Chawla, S.; Hartline, J. D.; Malec, D. L.; and Sivan, B. Learn., 80(2–3): 245–272.
2010. Multi-Parameter Mechanism Design and Sequential
Kleinberg, R.; and Weinberg, S. M. 2012. Matroid prophet
Posted Pricing. In Proceedings of the Forty-Second ACM
inequalities. In STOC’12, 123–136.
Symposium on Theory of Computing, STOC ’10, 311–320.
New York, NY, USA: Association for Computing Machin- Kveton, B.; Wen, Z.; Ashkan, A.; and Szepesvari, C.
ery. ISBN 9781450300506. 2015. Tight regret bounds for stochastic combinatorial semi-
bandits. In Artificial Intelligence and Statistics, 535–543.
Chekuri, C.; Vondrák, J.; and Zenklusen, R. 2011. Submod-
PMLR.
ular Function Maximization via the Multilinear Relaxation
and Contention Resolution Schemes. In Proceedings of the Livanos, V. 2021. A Simple and Tight Greedy OCRS. CoRR,
Forty-Third Annual ACM Symposium on Theory of Comput- abs/2111.13253.
ing, STOC ’11, 783–792. New York, NY, USA: Association McMahan, H. B.; and Blum, A. 2004. Online geometric
for Computing Machinery. ISBN 9781450306911. optimization in the bandit setting against an adaptive adver-
Chen, W.; Wang, Y.; and Yuan, Y. 2013. Combinatorial sary. In International Conference on Computational Learn-
multi-armed bandit: General framework and applications. ing Theory, 109–123. Springer.
In International conference on machine learning, 151–159. Orabona, F. 2019. A modern introduction to online learning.
PMLR. arXiv preprint arXiv:1912.13213.
A Contention Resolution Schemes and Online Contention Resolution Schemes
As explained at length in Section 2, our goal in general is that of finding the independent set of maximum weight for a
given feasibility constraint family. However, doing this directly might be intractable in general and we need to aim for a
good approximation of the optimum. In particular, given a non-negative submodular function f : [0, 1]m → R≥0 , and a family
of packing constraints F , we start from an ex ante feasible solution to the linear program maxx∈PF f (x), which upper bounds
the optimal value achievable. An ex ante feasible solution is simply a distribution over the independent sets of F , given by
a vector x in the packing constraint polytope of F . A key observation is that we can interpret the ex ante optimal solution
to the above linear program as a vector x∗ of fractional values, which induces distribution over elements such that x∗e is the
marginal probability that element e ∈ E is included in the optimum. Then, we use this solution to obtain a feasible solution that
suitably approximates the optimum. The random set R(x∗ ) constructed by ex ante selecting each element independently with
probability x∗e can be infeasible. Contention Resolution Schemes (Chekuri, Vondrák, and Zenklusen 2011) are procedures that,
starting from the random set of sampled elements R(x∗ ), construct a feasible solution with good approximation guarantees
with respect to the optimal solution of the original integer linear program.
Definition 6 (Contention Resolution Schemes (CRSs) (Chekuri, Vondrák, and Zenklusen 2011)). For b, c ∈ [0, 1], a (b, c)-
balanced Contention Resolution Scheme (CRS) π for F = (E, I) is a procedure such that, for every ex-ante feasible solution
x ∈ bPF (i.e., the down-scaled version of polytope PF ), and every subset S ⊆ E, returns a random set π(x, S) ⊆ S satisfying
the following properties:
1. Feasibility: π(x, S) ∈ I.
2. c-balancedness: Prπ,R(x) [e ∈ π(x, R(x)) | e ∈ R(x)] ≥ c, ∀e ∈ E.
When elements arrive in an online fashion, Feldman, Svensson, and Zenklusen (2016) extend CRS to the notion of OCRS,
where R(x) is obtained in the same manner, but elements are revealed one by one in adversarial order. The procedure has to
decide irrevocably whether or not to add the current element to the final solution set, which needs to be feasible and a competitive
against the offline optimum. The idea is that adding a sampled element e ∈ E to the set of already selected elements S ⊆ R(x)
maintains feasibility with at least constant probability, regardless of the element and the set. This originates Definition 3 and
the subsequent discussion.
B Examples
In this section, we provide some clarifying examples for the concepts introduced in Section 2 and 3.
Polytopes
Example 4 provides the definition of the constraint polytopes of some standard problems, while Example P 5 describes their
temporal version. For a set S ⊆ E and x ∈ Rm , we define, with a slight abuse of notation, x(S) := e∈S xe .
Example 4 (Standard Polytopes). Given a ground set E,
• Let K = (E, I) be a knapsack constraint. Then, given budget B > 0 and a vector of elements’ sizes c ∈ Rm ≥0 , its feasibility
polytope is defined as
PK = {x ∈ [0, 1]m : hc, xi ≤ B} .
• Let G = (E, I) be a matching constraint. Then, its feasibility polytope is defined as
PG = {x ∈ [0, 1]m : x(δ(u)) ≤ 1, ∀u ∈ V } ,
where δ(u) denotes the set of all adjacent edges to u ∈ V. Note that the ground set in this case is the set of all edges of
graph G = (V, E).
• Let M = (E, I) be a matroid constraint. Then, its feasibility polytope is defined as
PM = {x ∈ [0, 1]m : x(S) ≤ rank(S), ∀S ⊆ E} .
Here, rank(S) := max {|I| : I ⊆ S, I ∈ I}, i.e., the cardinality of the maximum independent set contained in S.
We can now rewrite the above polytopes under temporal packing constraints.
Example 5 (Temporal Polytopes). For ground set E,
• Let K = (E, I) be a knapsack constraint. Then, for B > 0 and cost vector c ∈ Rm ≥0 , its feasibility polytope is defined as
PK d
= {x ∈ [0, 1]m : hc, xi ≤ B, ∀e ∈ E} .
• Let G = (E, I) be a matching constraint. Then, its feasibility polytope is defined as
PGd = {x ∈ [0, 1]m : x(δ(u) ∩ Ee ) ≤ 1, ∀u ∈ V, ∀e ∈ E} .
• Let M = (E, I) be a matroid constraint. Then, its feasibility polytope is defined as
d
PM = {x ∈ [0, 1]m : x(S ∩ Ee ) ≤ rank(S), ∀S ⊆ E, ∀e ∈ E} .
We also note that, for general packing constraints, if de = ∞ for all e ∈ E, then Ee = E, PF∞ = PF , and similarly for the
constraint family F ∞ = F .
From Standard OCRS to Temporal OCRS for Rank-1 Matroids, Matchings, Knapsacks, and General
Matroids
In this section, we explicitly derive a (1, 1/e)-selectable (randomized) temporal greedy OCRS for the rank-1 matroid feasibility
constraint, from a (1, 1/e)-selectable (randomized) greedy OCRS in the standard setting (Livanos 2021), which is also tight.
Let us denote this standard OCRS as πM , where M is a rank-1 matroid.
Corollary 1. For the rank-1 matroid feasibility constraint family under temporal constraints, Algorithm 1 produces a (1, 1/e)-
selectable (randomized) temporal greedy OCRS π̂M from πM .
Proof. Since it is clear from context, we drop the dependence on M and write π, π̂. We will proceed by comparing side-by-side
what happens in π and in π̂. Let us recall from Examples 4, 5 that the polytopes can respectively be written as
PM = {x ∈ [0, 1]m : x(S) ≤ 1, ∀S ⊆ E} ,
d
PM = {y ∈ [0, 1]m : y(S ∩ Ee ) ≤ 1, ∀S ⊆ E, ∀e ∈ E} .
The two OCRSs perform the following steps, on the basis of Algorithm 1. On one hand, π defines a subfamily of constraints
−xe
Fπ,x := {{e} : e ∈ H(x)}, where e ∈ E is included in random subset H(x) ⊆ E with probability 1−exe . Then, it selects
d :=
the first sampled element e ∈ R(x) such that {e} ∈ Fπ,x . On the other hand, πy defines a subfamily of constraints Fπ,y
−ye
{{e} : e ∈ H(y)}, where e ∈ E is included in random subset H(y) ⊆ E with probability qe (y) = 1−eye . The feasibility
d
family Fπ,y induces, as per Observation 1, a sequence of feasibility families Fπ,y (e) := {{e} : e ∈ H(y) ∩ Ee }, for each
e ∈ E. For all e′ ∈ E, the OCRS selects the first sampled element e ∈ R(y) such that {e} ∈ Fπ,y (e). In other words, the
temporal OCRS selects a sampled element that is active only if no other element in its active elements set has been selected
earlier. It is clear that both are randomized greedy OCRSs.
We will now proceed by showing that each element e is selected with probability at least 1/e in both π, π̂. In π element e is
selected if sampled, and no earlier element has been selected before (i.e. its singleton set belongs to the subfamily Fπ,x ). An
−x ′
element e′ is not selected with probability 1 − xe′ · 1−ex ′ e = e−xe′ . This means that the probability of e being selected is
e
Proof. The proof follows from analyzing the constraints. The meaning of Constraint 1 is that upon the arrival of vertex v, v
must be matched at most once in expectation. In fact, for each job v ∈ V , at most one machine u ∈ U can be selected by the
optimum, which yields
X
xuv ≤ 1.
u∈U
This justifies Constraint 1. Constraint 2, on the other hand, has the following simple interpretation: machine u is unavailable
when job v arrives if it has been matched earlier to a job v ′ such that the activity time is longer than the difference of v, v ′ arrival
times. Otherwise, u can in fact be matched to v, and this probability is of course lower than the probability of being available.
This implies that for each machine u ∈ U and each job v ∈ V ,
X
xuv′ · Pr [duv′ ≥ sv − sv′ ] + xuv ≤ 1.
v ′ :sv′ <sv
We have shown that all constraints are less restrictive for the linear program as they would be for the offline optimum. Since
the objective function is the same for both, a solution for the integral optimum is also a solution for the linear program, while
the converse does not necessarily hold. The statement follows.
A simple algorithm
Inspired by the algorithm by Dickerson et al. (2018) (which deals with stochastic rather than adversarial arrivals), we propose
Algorithm 4. In the remainder, let Navail (v) denote the set of available vertices u ∈ U when v ∈ V arrives.
Lemma 2. Algorithm 4 makes every vertex u ∈ U available with probability at least α. Moreover, such probability is maximized
for α = 1/2.
Proof. We will prove the claim by induction. For the first incoming job v = 1, Pr [u ∈ Navail (v)] = 1 ≥ α for all machines
u ∈ U , no matter what the values of wuv , duv are. To complete the base case, we only need to check that the probability of
selecting one machine is in fact no larger than one: for this purpose, let us name the event u is selected by Algorithm 4 when v
comes as u ∈ ALG(v).
X x∗uv
Pr [∃u ∈ Navail (v) : u ∈ ALG(v)] = α· ≤ α,
Pr [u ∈ Navail (v)]
u∈U
where the first equality follows from the fact the events within the existence quantifier are disjoint, and recalling that Navail (v) =
U for the first job. Consider all vertices v ′ arriving before vertex v (sv′ < sv ), and assume that Pr [u ∈ Navail (v ′ )] ≥ α always.
This means that the algorithm is makes each u available with probability at least α for all vertex arrivals before v. This, in turn,
implies that each u is selected with probability α · x∗uv′ . Let us observe that a machine u ∈ U will not be available for the
incoming job v ∈ V only if the algorithm has matched it to an earlier job v ′ with activity time larger than sv − sv′ . Formally,
the probability that u is available for v is
Pr [u ∈ Navail (v)] = 1 − Pr [u ∈
/ Navail (v)]
= 1 − Pr [∃v ′ ∈ V : sv′ < sv , u ∈ ALG(v ′ ), duv′ > sv − sv′ ]
X
≥ 1−α· x∗uv′ Pr [duv′ ≥ sv − sv′ ]
v ′ :sv′ <sv
≥ α+α · x∗uv
≥α
The second to last inequality follows from Constraint 2, and by observing the following simple implication for all r, z ∈ R: if
r + z ≤ 1, then 1 − αr ≥ α + αz, so long as α ≤ 21 . Since we would like to choose α as large as possible, we choose α = 21 .
What is left to be shown is that the probability of selecting one machine is at most one:
X x∗uv
Pr [∃u ∈ Navail (v) : u ∈ ALG(v)] = α· ≤ 1.
Pr [u ∈ Navail (v)]
u∈U
D Benchmarks
The need for stages
We argue that, for the results in Section 5, stages are necessary in order for us to be able to compare our algorithm against any
meaningful benchmark. Suppose, in contrast, that we chose to compare against the optimum (or an approximation of it) within
a single stage where n jobs arrive to a single arm. A non-adaptive adversary could simply run the following procedure, with
each job having weight 1: with probability 1/2, jobs with odd arrival order have activity time 1, and jobs with even arrival order
have activity time ∞, with probability 1/2 the opposite holds. To be precise, let us recall that ∞ is just a shorthand notation to
mean that all future jobs would be blocked: indeed, the activity time of a job arriving at time se is not unbounded but can be at
most n − se . As activity times are revealed after the algorithm has made a decision for the current job, the algorithm does not
know whether taking the current job will prevent it from being blocked for the entire future. The best thing the algorithm can
do is to pick the first job with probability 1/2. Indeed, if the algorithm is lucky and the activity time is 1 then it knows to be
in the first scenario and gets n. Otherwise, it only gets 1. Hence, the regret would be Rn = n − n+1 2 ∈ Ω(n), which is linear.
Note that n and T here represent two different concepts: the first is the number of elements sent within a stage; the second is the
number of stages. In the case outlined above, T = 1, since it is a single stage scenario. Thus, there is no hope that in a single
stage we could do anything meaningful, and we turn to the framework where an entire instance of the problem is sent at each
stage t ∈ [T ].
1, 1 ǫ, ∞ ǫ, ∞
1, 1 1, ∞ 1, ∞
.........
1, 1 1, ∞ 1, ∞
Figure 1: Three jobs per stage: w.p. 1/2, either {(1, 1), (1, 1), (1, 1)} or {(ǫ, ∞), (1, ∞), (1, ∞)}.
On the other hand, the algorithm will discover which scenario it has landed into only after the value of the first job has been
revealed. If it does not pick it and it results in a weight of 1, then the algorithm can get at most 1 from the remaining jobs. If
instead it decides to pick it but it realizes in an ǫ value, it will only get ǫ. Even if the algorithm is aware of such a stochastic
input beforehand, it knows that stages are independent and, hence, cannot be adaptive before a given stage begins. Then, it
observes the first job weight without taking it, but it may already be too late. Any algorithm in this setting can be described by
deciding to accept the first job with probability p (and reject it with 1 − p), and then act adaptively. Then, again by linearity of
expectation,
X
ALG ALG 1 1
E[ft (at )] = T · E[f (a )] = T · (2p + (1 − p)) + (ǫp + (1 − p))
2 2
t∈[T ]
2 + ǫp
= · T ≤ (1 + ǫ) · T.
2
Thus, RT ≥ (1 − ǫ) · T ∈ Ω(T ).
Now, we ask whether there exists a similar lower bound on approximate regret. Similarly to the previous lemma, we denote
by aOCRS
t the action chosen at time t by the OCRS.
P P
Lemma 4. Every algorithm has RT = α · t∈[T ] E[ft (aOPT t )] − t∈[T ] E[ft (aALG
t )] ∈ Ω(T ) against an α-approximation of
the dynamic optimum, for α ∈ (0, 1].
Proof. Let all the activity times be infinite, and define (for a given stage) the constraint to be picking a single job irrevocably.
We know that, for the single-choice problem, a tight OCRS achieves α = 1/2 competitive ratio. However, such OCRS is not
greedy. Livanos (2021) constructs a tight greedy OCRS for single-choice, which is α = 1/e competitive. For our purposes,
nonetheless, we only require the trivial inequality α ≤ 1. The non-adaptive adversary could run the following a priori procedure,
for each of the T stages: let δ = α−1/n 2 be a constant, sample k ∼ [n] uniformly at random, and send jobs in order of weights
δ k , δ k−1 , . . . , δ, 0, . . . , 0 (ascending until δ and then all 0s).8 We know that, by Theorem 1,
X X
E[ft (aOCRS
t )] ≥ α · E[ft (aOPT
t )].
t∈[T ] t∈[T ]
This is possible because the greedy OCRS has access full-information about the current stage a priori (it knows δ and the
sampled k at each stage), unlike the algorithm, which is unaware of how the T stages are going to be presented. It is easy to
see that what the best the algorithm can do within a given stage is to randomly guess what the drawn k has been, i.e., where
δ will land. We now divide the time horizon in T /n intervals, each composed of n stages. In each interval, since no stage is
predictive of the next, we know that the algorithm cannot be adaptive across stages, nor can it be within a stage, since all possible
sequences have the same prefix. By construction, we expect the algorithm to catch δ once per time interval, and otherwise get
8
This construction is inspired by the notes of Kesselheim and Mehlhorn (2016).
at most δ 2 , optimistically for all remaining n − 1 stages. In other words, let us index each interval by I ∈ [T /n] and rewrite the
algorithm and the OCRS expected rewards as
X X
2
δ 2
E[ft (aALG
t )] ≤ δ + (n − 1)δ ≤ + δ · T,
n
t∈[T ] I∈[T /n]
X X
E[ft (aOCRS
t )] ≥ (αδ · n) = αδ · T.
t∈[T ] I∈[T /n]
Hence,
X X
RT = E[ft (aOCRS
t )] − E[ft (aALG
t )]
t∈[T ] t∈[T ]
δ 2
≥ αδ − − δ · T
n
(α − 1/n)2
= · T ∈ Ω(T ).
4
The last step follows from the fact that 1/n ∈ o(α), and α ∈ o(T ).
Proof. We assume to have access to a regret minimizer for the set PFd guaranteeing an upper bound on the cumulative regret
up to time T of RT . Then,
" T
# T
X X
E f (at , wt ) ≥ α f (xt , wt )
t=1 t=1
T
!
X
T
≥ α max f (x, wt ) − R
d
x∈PF
t=1
T
!
X
T
= α max f (a, wt ) − R ,
a
t=1
where the first inequality follows from the fact that Algorithm 2 employs a suitable temporal OCRS π̂ to select at : for each
e ∈ E, the probability with which the OCRS selects e is at least α · xt,e , and since f is a linear mapping (in particular, it is
defined as the scalar product between a vector of weights and the choice at t) the above inequality holds. The second inequality
is by no-regret property of the regret minimizer for decision space PFd . This concludes the proof.
Theorem 3. Given a temporal packing feasibility set F d , and an α-competitive OCRS π̂, let Z = T 2/3 , and the full feedback
subroutine RM be defined as per Theorem 2. Then Algorithm 3 guarantees that
T
" T #
X X
α max f (1S , wt ) − E f (at , w t ) ≤ Õ(T 2/3 ).
S∈I d
t=1 t=1
Proof. We start by computing a lower bound on the average reward the algorithm gets. Algorithm 3 splits its decisions into Z
blocks, and, at each τ ∈ [Z], chooses the action xτ suggested by the RM, unless the stage is one of the randomly sampled
exploration steps. Then, we can write
" T #
1 X α X X
·E f (at , wt ) ≥ · f (xt , wt )
T t=1
T
τ ∈[Z] t∈Iτ
α X X m2 Z
≥ f (xτ , wt ) − α
T T
τ ∈[Z] t∈Iτ
α X X X m2 Z
= · xτ,e f (1e , wt ) − α
T T
τ ∈[Z] e∈E t∈Iτ
α X X h i m2 Z
= · xτ,e · E f˜τ (e) − α ,
Z T
τ ∈[Z] e∈E
where the first inequality is by the use of a temporal OCRS to select at , and the second inequality is obtained by subtracting
the worst-case costs incurred during exploration; note that the m2 factor in the second inequality is due to the fact that at each
of the m exploration stages, we can lose at most m. The last equality is by definition of the unbiased estimator, since the value
of f is observed T /Z times (once for every block) in expectation.
We can now bound from below the rightmost expression we just obtained by using the guarantees of the regret-minimizer.
2
α X X h i m Z α X X m2 Z
· xτ,e · E f˜τ (e) − α ≥ · E max xe f˜τ (e) − RZ − α
Z T Z x∈PF d T
τ ∈[Z] e∈E τ ∈[Z] e∈E
α X X h i α m2 Z
= max xe · E f˜τ (e) − RZ − α
Z x∈PFd Z T
τ ∈[Z] e∈E
α X X X α m2 Z
= max xe f (1e , w t ) − RZ − α
T a∈F d Z T
τ ∈[Z] e∈E t∈Iτ
T
α X α m2 Z
= max f (at , wt ) − RZ − α ,
T a∈F d t=1 Z T
where we used unbiasedness of f˜τ (e), and the fact that the value of optimal fractional vector in the polytope is the same value
provided by the best superarm (i.e., best vertex
h of thei polytope) by convexity. The third equality follows from expanding the
P
expectation of the unbiased estimator (i.e. E f˜τ (e) := Z T · t∈Iτ f (1e , w t )). Let us now rearrange the last expression and
compute the cumulative regret:
T
" T #
X X α
α max f (at , wt ) − E f (at , w t ) ≤ RZ T + αm2 Z
a∈F d
t=1 t=1
Z |{z}
√
≤Õ( Z)
≤ Õ(T 2/3 ),
where in the last step we set Z = T 2/3 and obtain the desired upper bound on regret (the term αm2 is incorporated in the Õ
notation). The theorem follows.