Professional Documents
Culture Documents
1
Offline Imitation Learning from Multiple Baselines
the entire function. The offline setting we study is extremely We work in the finite-horizon setting and denote the horizon
well motivated here, where each interaction with the RL as H. The value function of a policy π is
environment involves the expensive operation of compiling H
and linking the program, and integrating this within the RL
X
Vπ (x) = Eah ∼π(·|sh ) [rx (sh , ah )],
loop engenders a significant engineering overhead.. With h=1
this setup, our paper makes the following contributions:
where sh is the the state at step h s.t. Px (sh |sh−1 , ah−1 ) =
1 and s1 ≡ x. Importantly, the policy class is such that the
• A natural behavior cloning algorithm, BC-M A X , that action distribution at state s depends only on s and not on
combines the multiple policies by executing each pol- the context, that is π(·|s, x) = π(·|s).
icy in every starting state in our dataset, and then im-
itating the trajectory of the policy with the highest Goal. We assume that we are given a set of K baselines
reward in that state. We give an upper bound on the policies {πi }k∈[K] together with n trajectories for each pol-
expected regret of the learned policy to the maximal icy, which we denote by {τi,j }i∈[K],j∈[n] , where xj ∼ D
reward obtained in each starting state by choosing the and a trajectory for policy π consists of {(sh , ah )}h∈[H] ,
best baseline policy for that state. with ah ∼ π(·|sh ). For an arbitrary policy π and context x
we use τπ (x) to denote the trajectory generated by following
• We complement our analysis with a lower bound show- π on context x. We also assume that we see the total reward
ing that the result is unimprovable beyond polyloga- for each trajectory and policy, that is for all i ∈ [K], j ∈ [n]
rithmic factors in our setting. we only observe
X
• We apply BC-M A X to two different real-world datasets r(τi,j ) = r(sj,h , aj,h ),
for the task of optimizing compiler inlining for binary (sj,h ,aj,h )∈τi,j
size, and show that we outperform strong baselines in instead of observing a dense reward across all the time steps
both the cases. In both cases we start with a single base- of the trajectory. We assume that the rewards are bounded
line policy, which is a prior model trained using online and non-negative, that is r(τ ) ∈ [0, B] for all trajectories τ
RL, which already has a strong performance on this and some constant B.
task. We demonstrate the versatility of BC-M A X by it-
eratively applying BC-M A X on the initial expert, along We have access to a policy class Π and seek to find a
with all prior policies trained using previous BC-M A X find a policy in Π which ideally competes with each of
iterations as the next set of baselines. We show that the baselines, and is able to combine their strengths. Let
with a limited amount of interaction with the environ- Vi (x1 ) = Vπi (x1 ) denote the expected cumulative reward
ment (to collect trajectories using each successive set of baseline i, conditioned on the context x. Then we seek to
of baselines), we obtain strong policies in a small num- minimize the regret:
ber of iterations, creating a promising practical recipe
h K i
Reg(π) := Ex∼D max Vi (x) − Vπ (x) . (1)
for challenging real-world settings. i=1
We now define the problem setting formally, and then dis- Learning setup. We assume access to the policy class
cuss some lines of prior work which are relevant to this Π, but do not assume any other function approximators,
setting. such as for modeling value functions. This is partly due
to the fact that the typical training of value functions using
2.1. Problem setting Bellman backups is not feasible in our sparse-reward setting.
Furthermore, typical actor-critic techniques make strong
Contextual MDP setting. We consider a contextual completeness and realizability assumptions on the value
Markov Decision Process (MDP) with state space S and function class, which are not realistic with a restricted notion
action space A. We denote the initial state distribution D1 of state which we encounter in our motivating problem of
and the sampled context (initial state) is x ∼ D. Once the compiler optimization. This necessitates the development
context is sampled, the transition kernel Px is deterministic, of purely policy-based methods.
that is Px (·|s, a) is a point-mass distribution. The reward
function is rx (s, a) and we assume deterministic rewards.
2.2. Related work
We note that both the transition kernel and reward kernel are
context-dependent. We will omit the context subscript from Vanilla behavior cloning Behavior cloning (Widrow,
our notation whenever it does not introduce ambiguity. 1964; Pomerleau, 1988) refers to the approach of learning a
2
Offline Imitation Learning from Multiple Baselines
policy that matches the mapping from states to actions ob- π̂ ∈ Π that optimizes the following intuitive objective:
served in the data. This is typically solved as a classification n
problem for deterministic policies, or by maximizing the π̂ = argmin
X X
1(π(sj,h ) ̸= aj,h ). (2)
log-likelihood of the chosen actions in the observed states π∈Π j=1 (sj,h ,aj,h )∈τij ,j
for stochastic policies. It is unclear how to apply vanilla
behavior cloning in the presence of multiple baselines. We
will present a natural formulation to behavior clone the best Algorithm 1 BC-M A X for cloning best per-context baseline.
baseline policy per context in the following section.
Input: Base policies {πi }i∈[K] and policy class Π.
Output: Policy π̂ ∈ Π.
Value-based improvement upon multiple baselines for j ∈ [n] do
(MAMBA) Cheng et al. (2020) show how to simultane- Sample xj ∼ D1 , collect trajectories {τi,j }j∈[n]
ously improve upon multiple baseline policies to compete Compute P highest reward policy πij =
with the best policy at each state in the MDP, which is a argmini∈[K] (sj,h ,aj,h )∈τi,j r(sj,h , aj,h ).
significantly stronger notion that competing with the best end for
baseline in each context only. However, this comes with two Pn P
π̂ = argminπ∈Π j=1 (sj,h ,aj,h )∈τi ,j 1(π(sj,h ) ̸=
caveats. Their method requires value function estimation for j
3. Algorithm and Regret Bound the trajectory with maximum return over all policies πi , i ∈
[K]. There exists π ∗ ∈ Π such that
We now describe our algorithm, BC-M A X , and give an upper
bound on the regret it incurs to the best per-context baseline. Px∼D (τπ∗ (x) ̸= τ ∗ (x)) ≤ ϵ.
Algorithm. We describe BC-M A X in Algorithm 1. The ba- Here Px∼D (A) denotes the probability of an event A un-
sic idea of the algorithm is quite simple. For each context xj der the distribution D, which we recall is the distribution
in our dataset, we first choose the trajectory with the highest over the contexts x. The assumption is natural as BC-M A X
cumulative reward across all the baselines. Then we use a cannot do a good job of approximating the best per-context
standard behavior cloning loss to mimic the choice of ac- baseline when no policy in the policy class has a small er-
tions in this trajectory. For a context xj , j ∈ [n], we denote ror in achieving this task. Note that the assumption does
ij = argmaxi∈[K] r(τi,j ), and BC-M A X tries to find a policy not take rewards into account as BC-M A X only matches
3
Offline Imitation Learning from Multiple Baselines
the actions of τ ∗ (x), and does not reason about the reward Now we note that for any policy π:
sub-optimality of other actions, as is common in behavior
cloning setups. Indeed this assumption is unavoidable in Reg(π) = Ex [max Vi (x) − Vπ (x)] = Ex [r(τ (x)) − r(τπ (x))]
i
our problem setting as we illustrate in the next section. H
X
Theorem 3.2. Under Assumption 3.1, after collecting n ≤ (H − h) Ex [1(π(sh (x)) ̸= ah (x))|].
trajectories from each of the K base policies, Algorithm 1 h=1
returns a policy π̂ with regret at most
Plugging the bound from Equation 3 into the inequality
H 2 log(H|Π|/δ)
above completes the proof.
Reg(π̂) ≤ O ϵH + ,
n
Implementation details In practice we can not directly
with probability at least 1 − δ. compute π̂ as defined in Equation 2. Instead we solve a
proxy to the optimization problem by replacing the indicator
Proof. Recall the definitions of τ ∗ (x) from Assumption 3.1, function 1(π(sj,h ) ̸= aj,h ) by the cross-entropy loss. Let
PH
and let π ∗ = argminπ∈Π Ex∼D ( h=1 1(π(s(x)) ̸= a(x))) yj,h ∈ {0, 1}A be the indicator with only entry equal to
∗
where (sh (x), ah (x))H
h=1 = τ (x) form the best trajectory 1 the one which corresponds to the action aj,h ), and all
for x among the baseline
PHpolicies. Under Assumption 3.1, other entries equal to 0. Further, we assume that all π ∈ Π
we know that Ex∼D ( h=1 1(π(s(x)) ̸= a(x))) ≤ ϵH. are such that π(s) ∈ ∆A−1 , that is each π(s) represents
Let us define a distribution over the actions that policy π plays when in
n X
H state s. We then use a first order method to minimize the
loss
X
Â(π) = 1(π(sj,h ) ̸= aj,h ),
j=1 h=1 n H X
X X
min wj −yj,h (a) log(π(Sj,h )),
H
!
X π∈Π
A(π) = Ex 1(π(sh (x)) ̸= ah (x)) . j=1 h=1 a
h=1
where wj ∈ [0, 1] are example weights which we find help-
Clearly we have that E[Â(π)] ful in our practical implementation. We refer the reader to
Pn = A(π) for any fixed
policy π, and Â(π) = j=1 Zj (π) with Zj (π) =
our experimental evaluation for mode details on how the
PH
1(π(s ) ̸
= a ) ≥ 0. We note that the Zj are i.i.d., weights are induced.
h=1 j,h j,h
with E[Zj (π)] = A(π) and E[Zj (π)2 ] ≤ H E[Zj (π)] =
H A(π). Then by Bernstein’s inequality combined with a 4. Lower bounds
union bound over policies, we have with probability at least
1 − δ, for all π ∈ Π: In this section, we show a series of lower bounds which
illustrate the necessity of various aspects of our guarantee in
p
|Â(π) − nA(π)| ≤ nH A(π) log(2|Π|/δ) + H log(2|Π|/δ) Theorem 3.2. We start with the necessity of Assumption 3.1
nA(π) 3
≤ + H log(2|Π|/δ). Necessity of Assumption 3.1 Let us consider a con-
2 2
textual multi-armed bandit problem, meaning that we fix
H = 1. For any ϵ, we choose the context space S = [M ]
Applying the inequality with π = π̂ and π = π ∗ , we obtain
for M = ⌈ 1ϵ ⌉, and choose D to be the uniform distri-
bution on S. We fix A = {a1 , a2 , a3 }, and K = 1
with the data collection policy π1 choosing a = a1 for
2 3H log(2|Π|/δ)
A(π̂) ≤ Â(π̂) + each context x ∈ S. We consider two possible environ-
n n ments, defined through rewards r1 , r2 . For a1 , we have
1 ∗ 3 ∗ 3 H log(2|Π|/δ) r1 (x, a1 ) = r2 (x, a2 ) = 1. For the other two actions, we
Â(π ) ≤ A(π ) + .
n 2 2 n have r1 (x, a2 ) = 0 and r1 (x, a3 ) = 1, while the second
Scaling the second inequality by 2 and adding them yields environment has r2 (x, a2 ) = 1 and r2 (x, a3 ) = 0. We de-
sign a policy class π with two policies {π1 , π2 } such that
π1 (x) = π2 (x) = a1 for all x ̸= 1 and π1 (1) = a2 m
6H log(2|Π|/δ) 6H log(2|Π|/δ) π2 (1) = a3 . Clearly this policy class satisfies Assump-
A(π̂) ≤ 3A(π ∗ ) + ≤ 3ϵ + , tion 3.1. But it also contains an optimal policy for both the
n n
(3) rewards r1 , r2 with a regret equal to 0. However, since the
data contains no information about which one of r1 or r2
where the second inequality follows by Assumption 3.1. generated the data, the best we can do is to pick between
4
Offline Imitation Learning from Multiple Baselines
π1 and π2 uniformly at random, and incur a regret of at returned by any algorithm A satisfies
least 0.5ϵ. This argument shows that we cannot replace the
log2 (|Π|)H 2
0 − 1 loss for measuring the accuracy of a policy in As- Ex [Vπ∗ (x)] − Ex [Vπ̂ (x)] ≥ min H, .
supmtion 3.1, with a more reward-aware quantity. It is also n
evident from the example that we cannot avoid incurring a
regret of Ω(ϵ). 5. Case study: Optimizing a compiler’s
inlining policy
Necessity of the comparator choice As discussed earlier,
5.1. The inlining for size problem
we define regret relative to the best baseline per-context, but
the broader literature on offline RL allows stronger bench- In short the inlining problem which we study in our experi-
marks, such as the best policy covered by the data generating ments consists of deciding to inline or not to inline a callsite
policies. To understand the difference between the two, we in a program with the goal of minimizing the size of the
consider the case of H = 2 and K = 2, with S = {x} be- final program binary. We omit most compilation details and
ing a singleton. Suppose we have two actions A = {a1 , a2 } just give a brief overview which should be sufficient for
and the baselines choose the trajectories π1 (x) = a1 and understanding the problem from a RL perspective. In our
π2 (x) = a2 at both h = 1, 2. There are two possible re- specific scenario, compilation is split into a frontend (fe)
ward functions given by r1 ((a1 , a1 )) = r1 ((a2 , a2 )) = 0.5, and a backend (be). In our setup the frontend consists of
r1 ((a1 , a2 )) = 1, r1 ((a2 , a1 )) = 0 and r2 (τ ) = 1 − r1 (τ ) translating the program into an Intermediate Representation
for all trajectories τ . Now we observe that in the sense of (IR), doing some frontend optimizations, then a (thin) link
the coverage studied in the offline RL literature, all four tra- step follows, which re-organizes functions in the various
jectories are covered by the dataset, since we get to observe modules to improve inlining opportunities. For more details
both the actions at both the steps in the episode. However, on the linking step see Johnson et al. (2017). The backend
since the data contains no useful information to distinguish compilation follows the thin link step and is applied on the
between r1 and r2 , no learning method can pick a covered updated modules. It consists of further optimizations, in-
policy which is better than our benchmark of best baseline cluding inlining decisions, final linking and lowering the IR
per-context. This challenge arises in our scenario as we to machine code, e.g., x86, ARM, etc. The IRs with which
only observe the aggregate reward over a trajectory, which our RL algorithms work with are post frontend linking and
makes per-step credit assignment used in standard offline pre backend optimization, that is we work only on backend
RL methods through the use of Bellman errors infeasible. optimizations. We note that a program is made up of mul-
tiple modules. In the fe, a module corresponds to a single
Tightness of horizon factor The tightness of the hori- C/C++ source file, after all the preprocessor directives have
zon factor follows from a simple reduction to Theorem been applied. In the be, the module would consist of a mix
6.1 in Rajaraman et al. (2020), who show that in a finite of IRs from different fe modules. The inlining decisions
horizon episodic MDP, there is no algorithm which only will be taken at callsites in the IRs of each module, where
observes n optimal policy trajectories and returns a policy the callee is also in the same module. We note that each
module is ultimately compiled to a machine code-specific
SH 2
with regret better than Ω n . The MDP contructed by
binary, with its own size, that will still need to be linked
Rajaraman et al. (2020) for the lower bound has a non- into the final executable. Hence we treat each module as
deterministic transition kernel P. Since we are assume a context x in our contextual MDP setting, and the value
that the transitions are deterministic, we instead use the Vi (x) of the baseline policy πi is the size of the binary we
randomness in sampling contexts, x ∼ D to simulate P. get for module x, when we make inlining decisions for the
Concretely, let ξ1 , . . . , ξH be the random bits sampled at module according to πi .
the H steps in a fixed episode from the construction in Ra-
jaraman et al. (2020). We define x = (ξ1 , . . . , ξH ) and set Program → IRs → fe optimizations
Px (sh+1 |sh , ah ) = P(sh+1 |sh , ah , ξh ) to be the trasition Collecting IRs
taken for the realization of ξh . The size of the state space of →ThinLinking1 −−−−−−−−−−→ be optimization
contextual MDP with the above transition kernel can be set →Final linking → x86
to S = Θ(log2 (|Π|) and the action space to A = {a1 , a2 }, (4)
so that each policy in Π is encoded by how it acts on the In Equation 4 we outline the compilation process, together
state space. This construction directly leads to the following with the step at which we collect IRs from the respective
lower bound. modules to be used in our RL algorithms. It is important to
note that the learned RL policy will make inlining decisions
Theorem 4.1. For any number of samples n there exists a both at the fe optimization and be optimization parts in
family of contextual MDPs with disribution over contexts
given by D1 , such that the policy π̂ = A({τπ∗ (xi )}ni=1 ) 1
See Johnson et al. (2017)
5
Offline Imitation Learning from Multiple Baselines
Equation 4, however, the IRs for training are only collected the assumption that there exists at least one base policy. In
after the fe optimization step. the first iteration a training dataset is collected from this
initial base policy π1 , next, π1 is behavior cloned by solving
The contextual MDP setting can now be tied together with
the optimization problem in Equation 2. We specify how
the compilation process as follows. Each context is a module
the weights for the objective are computed in the follow-
as mentioned above, with state-space defined by its IR. Each
ing sections as they are different for the different targets.
state corresponds to a callsite in the IR of the module x, and
Let π̂2 denote the resulting policy after solving Equation 2.
the action set is {inline, don’t inline} or {1, 0}
This policy is non-deterministic and so we construct the
respectively. Each πi is some base inlining policy and Vi (x)
base policy π2 by setting π2 (s) = argmaxa∈{0,1} π̂2 (s, a),
is defined by the size of the compiled stand-alone module
that π2 always plays the most likely action according to
x. It is important to note that there is a mismatch between
π̂2 . This concludes the first iteration. More generally, if
trying to maximize Ex∼D [Vπ (x)] and the overall goal of
we have a larger initial set of baseline policies than just a
minimizing the binary size, as it is not necessarily true that
singleton{π1 }, the iterations proceed similarly. However,
the sum of module sizes equals the size of the binary. In
instead of just using π1 , we use the full set {πi }i∈[K] of
fact, part of the post-inlining be and linker optimizations
baseline policies at every iteration at the first iteration.
may introduce a significant distribution shift between the
sum of module sizes and the size of the final binary. In our Proceeding this way, at the t-th iteration the set of base
experiments, we try to minimize this distribution shift by policies is taken as a subset of {π1 , . . . , πt−1 } which always
turning off certain optimizations. For more details on the contains π1 (or the larger set of all initial baselines). Then
compilation pipeline we refer to Trofin et al. (2021). we again invoke BC-M A X with these baseline policies, and
obtain a new randomized policy π̂t , and we refer to πt as the
We note that the entire process is fully deterministic, as we
corresponding deterministic greedy policy. When collecting
assumed in our theoretical setup, since the compiler is a
a new training dataset we not only collect trajectories with
deterministic program.
the chosen subset of base policies but we also may force
exploration by using π̂t−1 in the way discussed next.
5.2. Dataset collection
We train and evaluate on two sets of binaries. In the first Exploration in training dataset collection. For a fixed
experiment we train on a proprietary search binary and module x and a policy π, we choose a ceiling on the num-
evaluate the model on a different proprietary set of targets ber of exploration steps as a hyper parameter, which is
that are part of a cloud systems infrastructure. These targets a fraction of the length of the trajectory |τπ1 (x)|. The
need to be installed on a fixed size partition of each cloud call-sites at which exploration occurs are selected as fol-
machine and hence are size-constrained. In the second lows. The first exploration call-site is selected as h̃ =
experiment we train and evaluate on the Chrome binary on argminh {|π̂(sh )(0) − π̂(sh )(1)|}Sh ∈τπ (x) , as the call-site
Android. Training proceeds in two separate steps, which where the exploration policy π̂ is the least confident about
we repeat over several iterations. The two steps can be the action to choose. The exploration step is then played at
summarized as follows, first we collect a training dataset sh̃ by taking the action 1−π(sh̃ ) (recall that π(s) ∈ {0, 1}),
which consists of trajectories with smallest size over all base and the remaining steps in the trajectory are completed by
policies available at the current iteration. Next, we train a playing according to π. Let τ̂ denote the trajectory from the
new base model using the objective defined in Equation 2. last round of exploration. In the following exploration round
This conceptually applies Algorithm 1 repeatedly, where the exploration step is selected as the step h at which the
the set of baseline policies is updated at each iteration to gap, |π̂(sh )(0) − π̂(sh )(0)|, is smallest among all h > h̃,
include the new policy obtained from the previous iteration. where h̃ is the exploration step in the previous round. Once
We now describe each step carefully. the maximum number of exploration rounds is reached or
the exploration step reaches the end of the trajectory, we re-
Training dataset collection. The dataset collection be- turn the trajectory which results in the smallest module size
gins by creating a corpus of IRs of modules which make up among all explored trajectories. Pseudo-code is presented in
the final binary. The corpus creation follows the work of Algorithm 2. The exploration strategy is governed by π̂t−1 ,
Trofin et al. (2021); Grossman et al. (2023) and uses tools however, it can be updated by using the non-deterministic
for extracting the corpus are available on GitHub2 . The policy π̂ which induces π. We leave such approaches as
corpus is created at the beginning of training and remains future work, as we have already observed significant benefit
the same throughout every iteration. Training begins under of using only π̂t−1 as the exploration policy.
2
A detailed example can be found at https:
//github.com/google/ml-compiler-opt/blob/ Online versus offline learning. Our theoretical setup
main/docs/inlining-demo/demo.md frames the problem in an offline learning scenario, yet Algo-
6
Offline Imitation Learning from Multiple Baselines
7
Offline Imitation Learning from Multiple Baselines
8
Offline Imitation Learning from Multiple Baselines
Acknowledgements
We would like to thank Ziteng Sun for great discussions on
the lower bound.
Figure 4. Savings in MBs from PPO on binary size
References
where Ltm denotes the m-th coordinate of the loss vector Lt . Abbeel, P. and Ng, A. Y. Apprenticeship learning via inverse
The Hedge update is then reinforcement learning. In Proceedings of the twenty-first
international conference on Machine learning, pp. 1,
p̃t+1 t t
m = pm exp(−ηLm ) 2004.
p̃tm
pt+1
m = P t
. Barreto, A., Hou, S., Borsa, D., Silver, D., and Precup, D.
m p̃m Fast reinforcement learning with generalized policy up-
In Figures 3 and 4, we plot the savings of our learned dates. Proceedings of the National Academy of Sciences,
policies across iterations, relative to the initial PPO policy, 117(48):30079–30087, 2020.
measured in two different ways. For Fig 3, we simply add up
Cheng, C.-A., Kolobov, A., and Agarwal, A. Policy im-
the sizes of the binaries produced by compiling each module
provement via imitation of multiple oracles. Advances in
in our training dataset. This is a clean metric, as the distri-
Neural Information Processing Systems, 33:5587–5598,
bution shift between training and evaluation is small, and
2020.
no artifacts from linker or post-inlining be optimizations
are introduced in the evaluation. As we see, we improve Cheng, C.-A., Xie, T., Jiang, N., and Agarwal, A. Adversar-
rapidly beyond the PPO policy with the iterative applica- ially trained actor critic for offline reinforcement learning.
tions of BC-M A X . Note that even the sum of module sizes In International Conference on Machine Learning, pp.
suffers from the typical distribution shift between online 3852–3878. PMLR, 2022.
and offline RL, since the data used from behavior cloning
is collected using a different policy than the one we apply Ernst, D., Geurts, P., and Wehenkel, L. Tree-based batch
in evaluation. For the sum of module sizes metric, we can mode reinforcement learning. Journal of Machine Learn-
study the effect of this distribution shift rather carefully by ing Research, 6, 2005.
9
Offline Imitation Learning from Multiple Baselines
Grossman, A., Paehler, L., Parasyris, K., Ben-Nun, T., conference on artificial intelligence and statistics, pp.
Hegna, J., Moses, W., Diaz, J. M. M., Trofin, M., and 627–635. JMLR Workshop and Conference Proceedings,
Doerfert, J. Compile: A large ir dataset from production 2011.
sources. arXiv preprint arXiv:2309.15432, 2023.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and
Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, Klimov, O. Proximal policy optimization algorithms.
S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., arXiv preprint arXiv:1707.06347, 2017.
et al. Soft actor-critic algorithms and applications. arXiv
preprint arXiv:1812.05905, 2018. Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and
Riedmiller, M. Deterministic policy gradient algorithms.
Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, In International conference on machine learning, pp. 387–
T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, 395. Pmlr, 2014.
I., et al. Deep q-learning from demonstrations. In Pro-
ceedings of the AAAI conference on artificial intelligence, Song, Y., Zhou, Y., Sekhari, A., Bagnell, J. A., Krishna-
volume 32, 2018. murthy, A., and Sun, W. Hybrid rl: Using both offline
and online data can make rl efficient. arXiv preprint
Ho, J. and Ermon, S. Generative adversarial imitation learn- arXiv:2210.06718, 2022.
ing. Advances in neural information processing systems,
29, 2016. Trofin, M., Qian, Y., Brevdo, E., Lin, Z., Choromanski, K.,
and Li, D. Mlgo: a machine learning guided compiler op-
Johnson, T., Amini, M., and Li, X. D. Thinlto: scalable and timizations framework. arXiv preprint arXiv:2101.04808,
incremental lto. In 2017 IEEE/ACM International Sympo- 2021.
sium on Code Generation and Optimization (CGO), pp.
111–121. IEEE, 2017. Widrow, B. Pattern recognition and adaptive control. IEEE
Transactions on Applications and Industry, 83(74):269–
Kumar, A., Zhou, A., Tucker, G., and Levine, S. Con- 277, 1964.
servative q-learning for offline reinforcement learning.
Advances in Neural Information Processing Systems, 33: Xie, T., Cheng, C.-A., Jiang, N., Mineiro, P., and Agarwal,
1179–1191, 2020. A. Bellman-consistent pessimism for offline reinforce-
ment learning. Advances in neural information process-
Littlestone, N. and Warmuth, M. K. The weighted majority ing systems, 34:6683–6694, 2021.
algorithm. Information and computation, 108(2):212–
261, 1994. Zhan, W., Huang, B., Huang, A., Jiang, N., and Lee, J. Of-
fline reinforcement learning with realizability and single-
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, policy concentrability. In Conference on Learning Theory,
T., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn- pp. 2730–2775. PMLR, 2022.
chronous methods for deep reinforcement learning. In
International conference on machine learning, pp. 1928–
1937. PMLR, 2016.
10