Professional Documents
Culture Documents
8, AUGUST 2015 1
Abstract—This paper develops KL-Ergodic Exploration from for data-driven actions for collecting informative data. What
Equilibrium (KL-E3 ), a method for robotic systems to inte- makes this problem even more difficult is that generating data
grate stability into actively generating informative measurements for robotic systems is often an unstable process. It involves
through ergodic exploration. Ergodic exploration enables robotic
generating measurements dependent upon physical motion of
arXiv:2006.03552v2 [cs.RO] 7 Dec 2020
on the dynamic area coverage in the search space (as opposed a sample-based KL-divergence measure [4] as a replacement
to directly generating samples from the most informative for the ergodic metric. This form of measure has been used
regions). With this approach, we are able to integrate known previously; however, it relied on motion primitives in order to
equilibria and feedback policies (we refer to these as equilib- compute control actions [4]. We show that we can generate a
rium policies) which provide attractiveness guarantees—that continuous control signal that minimizes this ergodic measure
the robot will eventually return to an equilibrium —while using hybrid systems theory. The same approach is then shown
providing the control authority that allows the robot to actively to be readily amenable to existing equilibrium policies. As a
seek out and collect informative data in order to later solve a result, we can use approximate models of dynamical systems
learning task. Our contributions are summarized as follows: instead of complete dynamic reconstructions in order to ac-
• Developing a method which extends ergodic exploration tively generate data while ensuring safety in the exploration
to higher dimensional state-spaces using a sample-based process through a notion of attractiveness.
measure. Off-Policy Learning: In this work, we utilize concepts
• Synthesis of a control signal that exploits known equilib- from off-policy learning methods [18], [19] which are a set of
rium policies. methods that divides the learning and data generation phases.
• Present theoretical results on the Lyapunov attractiveness With these methods, data is first generated using some random
centered around equilibrium policies of our method. policy and a value function is learned from rewards gathered.
• Illustrate our method for improving the sample efficiency The value function is then used to create a policy which then
and quality of data collected for example learning goals. updates the value function [20]. Generating more data does not
We structure the paper as follows: Section II provides a require directly using the policy; however, the most common
list of related work, Section III defines the problem statement practice is to use the learned policy with added noise to guide
for this work. Section IV introduces our approximation to the policy learning. These methods often rely on samples from
an ergodic metric and Section V formulates the ergodic a buffer of prior data during the training process rather than
exploration algorithm for active learning and exploration from learning directly through the application of the policy. As such,
equilibrium. Section VI provides examples where our method they are more sample-efficient and can reuse existing data. A
is applicable. Last, Section VII provides concluding remarks disadvantage with off-policy methods is that they are highly
on our method and future directions. dependent on the distribution of data, resulting in an often
unstable learning process. Our approach focuses on improving
the data generation process through ergodic exploration to
II. R ELATED W ORK generate an informed distribution of data. As a result, the
Active Exploration: Existing work generally formulates learning process is able to attain improved results from the
problems of active exploration as information maximization generated distribution of data and retain its sample-efficiency.
with respect to a known parameterized model [15], [16]. Bayesian Optimization: Our work is most related to the
The problem with this approach is the abundance of local structure of Bayesian optimization [21]–[23]. In Bayesian
optima [2], [16] which the robotic system needs to overcome, optimization, the goal is to find the maximum of an objective
resulting in insufficient data collection. Other approaches function which is unknown. At each iteration of Bayesian
have sought to solve this problem by viewing information optimization, the unknown objective is sampled and a proba-
maximization as an area coverage problem [2], [4]. Ergodic bilistic model is generated. An acquisition function is then
exploration, in particular, has remedied the issue of local used as a metric for an “active learner” to find the next
optima by using the ergodic metric to minimize the Sobelov best sample. This loop repeats until a maximum is found. In
distance [17] from the time-averaged statistics of the robot’s our work, the “active learner” is the robot itself which must
trajectory to the expected information in the explored region. abide by the physics that governs its motion. As a result, the
This enables both exploration (quickly in low information assumption that the active learner has the ability to sample
regions) and exploitation (spending significant amount of time anywhere in the search space is lost. Another difference is that
in highly informative regions) in order to avoid local optima instead of using a sample-based method to find the subsequent
and harvest informative measurements. Our work utilizes this sampling position, as done in Bayesian optimization, we use
concept of ergodicity to improve how robotic systems explore ergodic exploration to generate a set of samples proportional
and learn from data. to some spatial distribution. We pose the spatial distribution
Ergodicity and Ergodic Exploration: A downside with the as an acquisition function which we show in Section VI-A.
current methods for generating ergodic exploration in robots is Thus, the active learner is able to sample from regions which
that they assumes that the model of the robot is fully known. have lower probability densities quickly and spending time
Moreover, there is little guarantee that the robot will not in regions which are likely to produce an optima. Note that
destabilize during the exploration process. This becomes an it is possible for one to directly use the derivatives of the
issue when the robot must explore part of its own state-space acquisition function and a model of the robot’s dynamics to
(i.e., velocity space) in order to generate informative data. search for the best actions that a robot can take to sample
Another issue is that these methods do not scale well with the from the objective; however, it is likely that similar issues
dimensionality of the search space, making experimental ap- with information maximization and local optima will occur.
plications with this approach challenging due to computational Information Maximization: Last, we review work that
limitations. Our approach overcomes these issues by using addresses the data-inefficiency problem through information
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3
R
maximization [24]. These methods work by either direct Sv
p(s)ds = 1. Given a target spatial distribution p(s), we
maximization of an information measure or by pruning a data- can calculate an ergodic metric as the distance between the
set based on some information measure [25]. These methods Fourier decomposition of p(s) and c(s | x(t)): 2
still suffer from the problem of local minima due to a lack X 2
of exploration or non-convex information objectives [26]. Our E(x(t)) = Λk (ck − pk ) (2)
work uses ergodicity as a way to gauge how much a robot k∈Nv
should be sampling from the exploration space. As a result, where Λk is a weight on the harmonics defined in [27],
the motion of the robot by minimizing an ergodic measure Z tf
will automatically optimize where and for how long the robot 1
ck = Fk (x(t))dt,
should sample from, avoiding the need to prune from a data- t f − t 0 t0
Z
set and sample from multiple highly informative peaks.
pk = p(s)Fk (s)ds,
The following section introduces ergodicity and ergodic Sv
exploration and defines the problem statement for this work. th
and Fk (s) is the k Fourier basis function. Minimizing the
III. P RELIMINARIES : E RGODICITY AND T HE E RGODIC ergodic metric results in the time-averaged statistics of x(t)
M ETRIC FOR E XPLORATION AND E XPLOITATION matching the arbitrarily defined target spatial distribution p(s)
as well as possible on a finite time horizon and the robotic sys-
This section serves as a preliminary section that introduces tem sampling measurements in high utility regions specified
conceptual background important to the rest of the paper. We by p(s).
first motivate ergodicity as an approach to the exploration vs.
exploitation problem and the need for an ergodic measure. (a) Ergodic Trajectory (b) p(s) (c) c(s|x(t)) (d) q(s|x(t)) (e) appr. q(s|x(t))
Then, we introduce ergodicity and ergodic exploration as the 4
resulting outcome of optimizing an ergodic metric. 2
The exploration vs. exploitation problem is a problem in
x
0
robot learning where the robot must deal with the choosing to x(t)
2 samples
exploit what it already knows or explore for more information,
0 10 20 0 10 20 0 10 20 0 10 20 0 10 20
which entails running the risk of damaging itself or collecting time likelihood
bad data. Ergodic exploration treats the problem of exploration
and exploitation as a problem of matching a spatial distribu- Fig. 1: (a) Illustration of an ergodic trajectory x(t) with
tions to a time-averaged distribution—that is, the probability respect to (b) target distribution p(s). Time-averaged dis-
of a positive outcome given a state is directly related to the tribution reconstructions of x(t) are shown using Defini-
time spent at that state. Thus, more time is spent in regions tions 2, 3, and Eq. 18). The Fourier decomposition approach
where there are positive outcomes (exploitation) and quickly often has residual artifacts due to the cosine approximation.
explores states where there is low probability of a positive 20 basis functions are used to approximate the time-averaged
outcome (exploration). distribution. Σ-approximation to the time-averaged statistics
minimizes residual artifacts. (e) Moment matching of the Σ-
Definition 1. Ergodicity, in robotics, is defined when a robot approximation provides a simplification for computing the
whose time-averaged statistics over its states is equivalent to time-averaged statistics.
the spatial statistics of an arbitrarily defined target distribution
that intersects those states. As in [2], [28], one can calculate a controller that optimizes
The exact specifications of a spatial statistic varies depend- the ergodic metric such that the trajectory of the robot is
ing on the underlying task and is defined for different tasks in ergodic with respect to the distribution p(s) (see Fig 1 for
Sections VI. For now, let us define the time-averaged statistics illustration). However, this approach scales O(|k|n ) where |k|
of a robot by considering its trajectory x(t) : R → Rn is the maximum integer–valued Fourier term. As a result,
∀t ∈ [t0 , tf ] generated through an arbitrary control process this method is ill-suited for high-dimensional learning tasks
u(t) : R → Rm . whose exploration states are often the full state-space of the
robot (often n > 3 for most mobile robots). Furthermore, the
Definition 2. Given a search domain S v ⊂ Rn+m where v ≤ resulting time-averaged distribution reconstruction will often
n + m, the time-averaged statistics (i.e., the time the robot have residual artifacts which require additional conditioning
spends in regions of the search domain S v ) of a trajectory to remove. This motivates the following section which defines
x(t) is defined as an ergodic measure. 3
Z tf
1
c(s | x(t)) = δ [s − x̄(t)] dt, (1) IV. KL-D IVERGENCE E RGODIC M EASURE
tf − t0 t0
As an alternative to computing the ergodic metric, we
where s ∈ S v is a point in the search domain, x̄(t) is the
present an ergodic measure which circumvents the scalability
component of the robot’s state x(t) that intersects the search
domain, and δ[·] is the Dirac delta function. 2 This
distance is known as the Sobelov distance.
3 We
make note of the use of the work “measure” as opposed to metric
In general, the target spatial statistics are defined through as we allude to using the KL-divergence which itself is not a metric, but a
its probability density function p(s) where p(s) > 0, and measure.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4
issues mentioned in the previous section. To do this, we illustrates the resulting reconstruction of the time-averaged
utilize the Kullback-–Leibler divergence [4], [29], [30] (KL– statistics of a trajectory x(t) using Definition 3 . The following
divergence) as a measure for ergodicity. Let us first define the section uses the KL-divergence ergodic measure and derives
approximation to the time-averaged statistics of a trajectory a controller which optimizes (4) while directly incorporating
x(t): learned models and policies.
Definition 3. Given a search domain S v ⊂ Rn+m the Σ-
V. KL-E3 : KL-E RGODIC E XPLORATION FROM
approximated time-averaged statistics of the robot is defined
E QUILIBRIUM
by Z tf
1 In this section, we derive KL-Ergodic Exploration from
q(s | x(t)) = ψ(s | x(t))dt (3) Equilibrium (KL-E3 ), which locally optimizes and improves
tf − t0 t0
(4). As an additional constraint, we impose an equilibrium
where ψ(s | x(t)) = η1 exp − 12 ks − x̄(t)k2Σ−1 , Σ ∈ Rv×v is
policy and an approximate transition model of the robot’s
a positive definite matrix parameter that specifies the width of dynamics on the algorithm. By synthesizing KL-E3 with
the Gaussian, and η is a normalization constant. existing policies that allow the robot to return to an equilibrium
We call this an approximation because the true time- state (i.e., local linear quadratic regulators (LQR) controller),
averaged statistics, as described in [2] and Definition 2, is we can take advantage of approximate transition models for
a collection of delta functions parameterized by time. We planning the robot’s motion while providing a bound on how
approximate the delta function as a Gaussian distribution with unstable the robot can become. We then show how this method
variance Σ, converging as kΣk → 0. As an aside, one can treat is Lyapunov attractive [31], [32], allowing the robot to become
Σ as a function of x̄(t) if there is uncertainty in the position unstable so long as we can ensure the robot will eventually
of the robot. return to an equilibrium.
With this approximation, we are able to relax the ergodic
objective in [2] and use the following KL-divergence objec- A. Model and Policy Assumptions for Equilibrium:
tive [4]: We assume that we have a robot whose approximate dy-
Z
p(s) namics can be modeled using the continuous time transition
DKL (pkq) = p(s) log ds model:
v q(s)
ZS Z
ẋ(t) = f (x(t), µ(x(t))) (5)
= p(s) log p(s)ds − p(s) log q(s)ds,
SZv Sv = g(x(t)) + h(x(t))µ(x(t))
=− p(s) log q(s)ds where ẋ(t) ∈ Rn is the change of rate of the state x(t) :
Sv
R → Rn of the robot at time t, f (x, u) : Rn×m → Rn
= −Ep(s) [log q(s)]
is the (possibly nonlinear) transition model of the robot as a
where E is the expectation operator, q(s) = q(s | x(t)), and function of state x and control u which we partition into g(x) :
p(s) is an arbitrary spatial distribution. Note that we drop Rn → Rn , the free unactuated dynamics, and h(x) : Rn →
the first term in the expanded KL-divergence because it does Rn×m , the actuated dynamics. In our modeling assumption,
not depend on the trajectory of the robot x(t). Rather than we consider a policy µ(x) : Rn → Rm which provides the
computing the integral over the exploration space S v (partly control signal to the robotic system such that there exists a
because of intractability), we approximate the expectation continuous Lyapunov function V (x) [32], [33] which has the
operator as following conditions:
DKL (pkq) = −Ep(s) [log q(s)] V (0) = 0 (6a)
N
X ∀x ∈ B\{0} V (x) > 0 (6b)
≈− p(si ) log q(si ) ∀x ∈ B\{0} ∇V · f (x, µ(x)) < 0 (6c)
i=1
XN Z tf
1
where B ⊂ Rn is a compact and connected set, and
∝− p(si ) log exp − ksi − x̄(t)k2Σ−1 dt, ∂
i=1 t0 2 V̇ (x(t)) = V (x) · f (x, u) = ∇V · f (x, u). (7)
(4) ∂x
Thus, a trajectory x(t) with initial condition at time t = t0
where N is the number of samples in the search domain
subject to (5) and µ(x) is defined as
drawn uniformly.4 Through this formulation, we still obtain Z t
the benefits of indirectly sampling from the spatial distribu-
x(t) = x(t0 ) + f (x(t), µ(x(t)))dt. (8)
tion p(s) without having to directly compute derivatives to t0
generate an optimal control signal for the robot. Furthermore,
For the rest of the paper, we will refer to µ(x) as an
this measure prevents computing the measure from scaling
equilibrium policy which is tied to an objective which returns
drastically with the number of exploration states. Figure 1
the robot to an equilibrium state. In our prior work [34] we
4 We can always use importance sampling to interchange which distribution show how one can include any objective into the synthesis of
we sample from. KL-E3 .
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5
B. Synthesizing a Schedule of Exploratory Actions: when (13) is equal to 0∀t ∈ [t0 , tf ], we set the expression in
Given the assumptions of a known approximate model (13) to zero and solve for µ? (t) at each instant in time giving
and an equilibrium policy, our goal is to generate a control us
signal that augments µ(x) and minimizes (4) while ensuring µ? (t) = −R−1 h(x(t))> ρ(t) + µ(x(t))
x remains within the compact set B which will allow the robot which is the schedule of exploratory actions that reduces the
to return to an equilibrium state within the time t ∈ [t0 , tf ]. objective for time t ∈ [t0 , tf ] and is bounded by µ(x).
Our approach starts by quantifying how sensitive (4) is to
switching from the policy µ(x(t)) to an arbitrary control vector In practice, the first term in (12) is calculated and applied
µ? (t) at any time τ ∈ [t0 , tf ] for an infinitesimally small to the robot using a true measurement of the state x̂(t) for
duration of time λ. We will later use this sensitivity to calculate the policy µ(x). We refer to this first term as δµ? (t) =
a closed-form solution to the most influential control signal −R−1 h(x(t))> ρ(t) yielding µ? (t) = δµ? (t) + µ(x̂(t)).
µ? (t). Given the derivation of the augmented control signal that
can generate ergodic exploratory motions, we verify the fol-
Proposition 1. The sensitivity of (4) with respect to the
lowing through theoretical analysis in the next section:
duration time λ, of switching from the policy µ(x) to an
arbitrary control signal µ? (t) at time τ is • that (12) does in fact reduce (4)
• that (12) imposes a bound on the conditions in (6)
∂DKL
= ρ(τ )> (f2 − f1 ) (9) • and that a robotic system subject to (12) has a notion of
∂λ Lyapunov attractiveness
where f2 = f (x(τ ), µ? (τ )) and f1 = f (x(τ ), µ(x(τ )), ρ(t) ∈ Theoretical Analysis: We first illustrate that our approach
Rn is the adjoint, or co-state variable which is the solution for computing (12) does reduce (4).
of the following differential equation ∂
> Corollary 1. Let us assume that ∂µ H 6= 0 ∀t ∈ [t0 , tf ], where
X p(si ) ∂ψ ∂f ∂f ∂µ H is the control Hamiltonian. Then
ρ̇(t) = − + ρ(t) (10)
q(si ) ∂x ∂x ∂u ∂x ∂
i
DKL = −kh(x(t))> ρ(t)k2R−1 < 0 (14)
∂ψ ∂λ
subject to the terminal constraint ρ(tf ) = 0, and ∂x is
evaluated at each sample si . ∀t ∈ [t0 , tf ] subject to µ? (t).
Proof. See Appendix A Proof. Inserting (12) into (9) and dropping the dependency of
time for clarity gives
∂
The sensitivity ∂λ DKL is known as the mode insertion
∂
gradient [1]. Note that the second term in (10) encodes how DKL = ρ(t)> (f2 − f1 )
the dynamics will change subject to the policy µ(x). We can ∂λ
directly compute the mode insertion gradient for any control = ρ> (g(x) + h(x)µ? − g(x) − h(x)µ(x))
µ? (t) that we choose. However, our goal is to find a schedule = ρ> (−h(x)R−1 h(x)> ρ + h(x)µ(x) − h(x)µ(x))
of µ? (t) which minimizes (4) while still bounded by the = −ρ> h(x)R−1 h(x)> ρ
equilibrium policy µ(x). We solve for this augmented control
signal by formulating the following optimization problem: = −kh(x(t))> ρ(t)k2R−1 ≤ 0. (15)
∂
Thus, ∂λ DKL is always negative subject to (12).
Z tf
∂ 1
µ? (t) = arg min DKL + kµ? (t)−µ(x(t))k2R dt
µ(t)∀t∈[t0 ,tf ] t0 ∂λ τ =t 2 For λ > 0 we can approximate the reduction in DKL
(11) as ∆DKL ≈ ∂λ ∂
DKL λ ≤ 0. Thus, by applying (12), we
where R ∈ Rm×m is a positive definite matrix that penalizes are generating exploratory motions that minimize the ergodic
∂
the deviation from the policy µ(x) and ∂λ DKL |τ =t is (9) measure defined by (4).
evaluated at time t. Our next set of analysis involves searching for a bound on
Proposition 2. The augmented control signal µ? (t) that the conditions in (6) when (12) is applied at any time τ ∈
minimizes (11) is given by [0, t − λ] for a duration λ ≤ t.
µ? (t) = −R−1 h(x(t))> ρ(t) + µ(x(t)). (12) Theorem 1. Given the conditions in (6) for a policy µ(x),
then V (xτλ (t)) − V (x(t)) ≤ λβ, where xτλ (t) is the solution
Proof. Taking the derivative of (11) with respect to µ? (t) at to (29) subject to (12) for τ ∈ [0, t − λ], λ ≤ t, and
each instance in time t ∈ [t0 , tf ] gives
Z tf β= sup −∇V · h(x(t))R−1 h(x(t))> ρ(t). (16)
∂ ∂ 1 2 t∈[τ,τ +λ]
DKL + kµ? (t) − µ(x(t))kR dt (13)
t0 ∂µ? ∂λ τ =t 2 Proof. See Appendix A.
Z tf
= h(x(t))> ρ(t) + R(µ? (t) − µ(x(t)))dt We can choose any time τ ∈ [0, t − λ] to apply µ? (t)
t0
and provide an upper bound quantifying the change of the
where we expand f (x, u) using (5). Since the expression under Lyaponov function described in (6) by fixing the maximum
the integral in (11) is convex in µ? (t) and is at an optimizer value of λ during active exploration. In addition, we can tune
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6
µ? (t) using the regularization value R such that as kRk → ∞, C. KL-E3 for Efficient Planning and Exploration
β → 0 and µ? (t) → µ(x(t)).
We extend our initial work by providing a further ap-
Given this bound, we can guarantee Lyapunov attractive-
proximation to computing the time-averaged statistics which
ness [8], where there exists a time t such that the system (8)
improves the computation time of our implementation. Taking
is guaranteed to return to a region of attraction (from which
note of (4), we can see that we have to evaluate q(si ) at each
the system can be guided towards a stable equilibrium state
sample point si where q(s) = q(s|x(t)) has to evaluate the
x0 ).
stored trajectory at each time. In real robot experiments, often
Definition 4. A robotic system defined by (5) is Lyapunov the underlying spatial statistics p(s) change significantly over
attractive if at some time t, the trajectory of the system time. In addition, most robot experiments require replanning
x(t) ∈ C(t) ⊂ B where C(t) = {x(t)|V (x) ≤ β ? , ∇V · which results in lost information over repeated iterations. Thus,
f (x(t), µ(x(t))) < 0}, β ? > 0 is the maximum level set of rather than trying to compute the whole time averaged trajec-
V (x) where ∇V · f (x, µ(x)) < 0, and limt→∞ x(t) → x0 tory in Definition 3, we opt to approximate the distribution by
such that x0 is an equilibrium state. applying Jensen’s inequality:
Theorem 2. Given the schedule of exploratory actions (12) Z tf
1
∀t ∈ [τ, τ + λ], a robotic system governed by (5) is Lyapunov q(s|x(t)) ∝ exp − ks − x̄(t)k2Σ−1 dt
2
attractive such that limt→∞ xτλ (t) → x0 . t0
Z tf
1
Proof. Using Theorem 1, the integral form of the Lyapunov ≥ exp − ks − x̄(t)k2Σ−1 dt . (18)
2 t0
function (37), and the identity (38), we can write
Z t Using this expression in (4), we can write
V (xτλ (t)) = V (x(0)) + ∇V · f (x(s), µ(x(s)))ds Z
0
Z τ +λ DKL ∝ − p(s) log q(s)ds
− ∇V · h(x(s))R−1 h(x(s))> ρ(s)ds Sv
1 tf
Z Z
τ
2
≤ V (x(0)) − γt + βλ < β ? , =− p(s) log exp − ks − x̄(t)kΣ−1 dt ds
Sv 2 t0
Z Z tf
where
∝ p(s) ks − x̄(t)k2Σ−1 dt ds
− γ = sup ∇V · f (x(s), µ(x(s))) < 0. (17) Sv t0
s∈[0,t] N
X Z tf
Since λ is fixed and β can be tuned by the matrix weight R, we ≈ p(si ) ksi − x̄(t)k2Σ−1 dt. (19)
i t0
can choose a t such that γt βλ. Thus, limt→∞ V (xτλ (t)) →
V (x0 ) and limt→∞ xτλ (t) → x0 , implies Lyapunov attractive- Following the results to compute (9), we can show that (12)
ness, where V (x0 ) is the minimum of the Lyapunov function remains the same where the only modification is in the adjoint
at the equilibrium state x0 . differential equation where
>
Proving Lyapunov attractiveness allows us to make the
X ∂` ∂f ∂f ∂µ
ρ̇(t) = − p(si ) − + ρ(t) (20)
claim that a robot will return to a region where V̇ (x) < 0 i
∂x ∂x ∂u ∂x
subject to the policy µ(x). This enables the robot to actively
explore states which would not naturally have a safe recovery. such that ` = `(s, x) = ks − xk2Σ−1 . This formulation has
Moreover, this analysis shows that we can choose the value no need to compute q(s) and instead only evaluate p(s)
of λ and R when calculating µ? (t) such that attractiveness at sampled points. We reserve using this implementation
always holds, giving us an algorithm that is safe for active for robotic systems that are of high dimensional space or
learning. when calculating the derivatives can be costly due to over-
So far, we have shown that (12) is a method that generates parameterization (i.e., multi-layer networks). Note that all
approximate ergodic exploration from equilibrium policies. the theoretical analysis still holds because the fundamental
We prove that this approach does reduce (4) and show the theory relies on the construction through hybrid systems
ability to quantify and bound how much the active exploration theory rather than the KL-divergence itself. The downside to
process will deviate the robotic system from equilibrium. Last, this approach is that one now loses the ability to generate
it is shown that generating data for learning does not require consistent ergodic exploratory movements. The effect can be
constant guaranteed Lyapunov stability of the robotic system, seen in Figure 1(e) where the trajectory is approximated by
but instead introduce the notion of attractiveness where we a wide Gaussian—rather than the bi-model distribution found
allow the robot to explore the physical realm so long as the in Fig. 1(d). However, for non-stationary p(s), having exact
time to explore is finite and the magnitude of the exploratory ergodic behavior is not necessary and such approximations at
actions is restrained. In the following section, we extend the time-averaged distribution level are sufficient.
our previous work in [34] by further approximating the time In the following subsection, we provide base algorithm and
averaged-statistics so that computing the adjoint variable can implementation details for KL-E3 and present variations based
be done more efficiently. on the learning goals in VI in the Appendix.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7
Learning Task
**
**
** *
** ****
***
* ** ***** * ** *
* ** ** * ** *
- 1.5 - 1.0 - 0.5 0.0 0.5 1.0 1.5 - 1.5 - 1.0 - 0.5 0.0 0.5 1.0 1.5 - 1.5 - 1.0 - 0.5 0.0 *0.5 1.0 1.5 - 1.5 - 1.0 - 0.5 * *0.5
0.0 1.0 1.5
* * **
* * * *
** **
* *
Objective Fun
GP Pred.
Dynamic Constraint
GP Std
* Measurements
a Gaussian process) of the objective function is built from mean µ̄(x) and variance σ(x), the upper confidence bound
sampled data xk ∈ Rn and the posterior of the model is (UCB) [22] acquisition function is defined as
used to construct an acquisition function [22]. The acquisition
function maps the sample space x to a value which indicates UCB(x) = µ̄(x) + κσ(x) (21)
the utility of the sample (in other words, how likely is where κ > 0. We augment Alg 1 for Bayesian optimization by
the sample to provide information about the objective given setting the UCB acquisition function as the target distribution
the previous sample in x). The acquisition function is often which we define through the Boltzmann softmax function, a
simpler and easier to calculate for selecting where to sample common method of converting functions that indicate regions
next rather than the objective function itself. Assuming one of high-value into distributions [37], [38]: 6
can freely sample the space x, Bayesian optimization takes
a sample based on the acquisition function and a posterior exp(cUCB(s))
p(s) = R (22)
is computed. The process then repeats until some terminal Sv
exp(cUCB(s̄))ds̄
number of evaluations of the objective or the optimizer is
where c > 0 is a scaling constant. Note that the denominator
reached. We provide pseudo-code for Bayesian optimization
is approximated as a sum over the samples that our method
in the Appendix, Alg 2.
generates. An approximate linear model of the cart double
In many examples of Bayesian optimization, the assumption pendulum dynamics centered around the unstable equilibrium
is that the learning algorithm can freely sample anywhere in is used along with an LQR policy that maintains the double
the sample space x ∈ Rn ; however, this is not always true. pendulum upright. We provide brief pseudo-code of the base
Consider an example where a robot must collect a sample from Algorithm 1 in the Appendix Alg. 3.
a Bayesian optimization step where the search space of this We first illustrate that our method generates ergodic ex-
sample intersects the state-space of the robot itself. The robot ploration through an execution of our method for Bayesian
is constrained by its dynamics in terms of how it can sample optimization in Figure 2. Here, the time-series evolution of
the objective. Thus, the Bayesian optimization step becomes KL-E3 is shown to sample proportional to the acquisition
a constrained optimization problem where the goal is to reach function. As a result, our method generates samples near each
the optimal value of the acquisition function subject to the of the peaks of the objective function. Furthermore, we can
dynamic constraints of the robot. Furthermore, assume that see that our method is exploiting the dynamics as well as
the motion of the robot is restricted to maintain the robot at the equilibrium policy, maintaining Lyapunov attractiveness
an equilibrium (such as maintaining the inverted equilibrium with respect to the inverted equilibrium (we will later discuss
of the cart double pendulum). The problem statement is then numerical results in Fig 4).
to enable a robot to execute a sample step of Bayesian Next, our method is compared against three variants of
optimization by taking into account the constraints of the Bayesian optimization: the first is Bayesian optimization with
robot. We use this example to emphasize the effectiveness no dynamics constraint (i.e., no robot is used); second, a linear
of our method for exploiting the local dynamic information quadratic regulator (LQR) variation of Bayesian optimization
using a cart double pendulum where the equilibrium state is where the maximum of the acquisition function is used as a tar-
at the upright inverted state and a policy maintains the double get for an LQR controller; and last a direct maximization of the
pendulum upright. acquisition using the stabilizing equilibrium policy (see [39]
Here, we use a Gaussian process with the radial basis 6 Other distribution forms are possible, but the analysis of their effects is
function (RBF) to build a model of the objective function left for future work and we choose the Boltzmann softmax formulation for
shown in Fig. 2. Using Gaussian process predictive posterior consistency throughout each example.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9
4
namical system by exploring the state-space of the quadcopter
0.0000
10 20 30 while remaining at a stable hover. Our goal is to show that
2 our method can efficiently and effectively explore the state-
space of the quadcopter (including body linear and angular
velocities) in order to generate data for learning a transition
0
0 5 10 15 20 25 30 model of the quadcopter for model-based control. In addition,
time (s) we show that the exploratory motions improve the quality of
data generated for learning while exploiting and respecting the
Fig. 4: Normalized Lyapunov function for the cart double stable hover equilibrium in a single execution of the robotic
pendulum with upright equilibrium. Our method (shown in system [39].
blue) is roughly 6 times more stable than LQR-Bayesian opti- An LQR policy is used to keep the vehicle hovering while
mization. The large initial values indicate the sweeping shown a local linear model (centered around the hover) is used for
in Fig. 2(a) when the cart double pendulum moves across planning exploratory motions. The stochastic model of the
the search space. Subsequent application of the exploratory quadcopter is of the form [41]
motion refine the exploration process. The Lyapunov attrac-
tiveness property is enforced through automatic switching of dx ∼ N (fθ (x, u), σθ (x)) (23)
the exploration process.
where N is a normal distribution with mean fθ , and variance
σθ (x), and the change in the state is given by dx ∈ Rn .
for detail) is used. A total of 10 trials for each method are Here, f (x, u; θ) = fθ (x, u) : Rn×m → Rn specifies a neural-
collected with the agent starting at the same location uniformly network model of the dynamics and σ(x; θ) = σθ (x) is a
sampled between −0.8 and 0.8 throughout the sample space. diagonal Gaussian σθ (x) : Rn → Rn which defines the
In Fig. 3 we show that our method not only performs compa- uncertainty of the transition model at state x all parameterized
rably to Bayesian optimization without dynamic constraints7 , by the parameters θ.
but outperforms both LQR and direct maximization variants
We use KL-E3 to enable the quadcopter to explore with
7 This may change if the dynamics of the robot are slower or the exploration
respect to the variance of the model (that is, exploration in the
space is sufficiently larger. Note that the other methods would also be equally state-space is generated based on how uncertain the transition
affected. model is at that state). In a similar manner as done in the
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10
tracking cost
tracking cost
Normal 0.1* 1.41 +- 0.0121 0.72 +- 0.0016
OU 0.3 2.73 +- 0.0228 1.17 +- 0.0152
10 10
OU 0.1 0.97 +- 0.0096 0.73 +- 0.0033
OU 0.01* 0.10 +- 0.0007 0.67 +- 0.0002 5 5
Uniform 0.1* 0.84 +- 0.0090 0.69 +- 0.0004
0 0
TABLE I: Comparison of our method against various methods 0 1 2 3 0 1 2 3
time (s) time (s)
for state-space exploration using a quadcopter. Each method
uses the same base stabilization policy which maintains hover
20
height and is instantiated and run once for 1200 time steps. OU 0.3
Data from a single execution is used to generate a neural 15
tracking cost
network dynamics model. This is repeated 20 times to estimate
10
the performance of each method. Methods with (*) were
unable to generate a dynamics model that completed the 5
tracking objective.
0
0 1 2 3
time (s)
|w, v|
0.10
|u|
0.50
rewards
0.25 0.05
0.00 0.00
0 1 2 3 4 5 0 2 4
time (s) time (s) DDPG
(a) Control Magnitude (b) Achieved Body
Velocity 0 KL E3 DDPG 0
Fig. 6: Control signal ku(t)k and resulting body linear and 0 50 100 0 250 500 750
angular velocities ω, v for the quadcopter system using our episodes
method, information maximization, and OU noise with 0.1
Fig. 7: Comparison of KL-E3 enhanced DDPG against DDPG
maximum noise level. Our method generates smoother control
using the cart pole inversion and the half cheetah running
signals while exploring the necessary regions of the state-space
tasks. KL-E3 provides a more informative distribution of data
without destabilizing the system.
which assists the learning process, improves the overall per-
formance, and achieves better performance faster for DDPG.
explore within the vicinity of the robot skill in an intentional
manner, improving the learning process. time-averaged statistics in (18) to improve the computational
In many examples of robot skill learning, a common mode efficiency of our algorithm. A parameterized dynamics model
of failure is that the resulting learned skill is highly dependent is built using the first 200 points of each simulation (see
on the quality of the distribution of data generated that is used Appendix for more detail) and updated as each trial continues.
for learning. Typically, these methods use the current iteration OU noise is used for exploration in the DDPG comparison
of the learned skill (which is often referred to as a policy) with the same parameters shown in [19]. We provide a pseudo-
with added noise (or have a stochastic policy) to explore the code of a KL-E3 enhanced DDPG in the Appendix Alg. 5.
action space. Often the added noise is insufficient towards Our method is tested on the cart pole inversion and the half-
gaining informative experience which improves the quality of cheetah running task (see Figure 7 for results). For both robotic
the policy. Here, we show that our method can improve robot systems, KL-E3 is shown to improve the overall learning
skill learning by generating dynamic coverage and exploration process, making learning a complex robot skill more sample
around the learned skill, reducing the likelihood of suboptimal efficient. Specifically, inverting the cart pole starts to occur
solutions, and improving the efficiency of these methods. within 50 episodes and the half cheetah begins generating
We use deep deterministic policy gradient (DDPG) [19] as running gaits within 250 episodes of the half cheetah (each
our choice of off-policy method. Given a set of data, DDPG episode consists of 200 time steps of each environment). In
calculates a Q-function defined as contrast, DDPG alone generates suboptimal running (as shown
Q(x, u) = E [r(x, u) + γQ(x0 , µ(x0 ))] (27) in (https://sites.google.com/view/kle3/home)) and unstable cart
inversion attempts. Because our method is able to explore
where r(x, u) : Rn×m → R is a reward function, x0 is the next within the vicinity of the learned skill in an intentional, ergodic
state subject to the control u, the expectation E is taken with manner, it is able to quickly learn skills and improve the
respect to the states, 0 > γ > 1 is known as a discounting overall quality of the exploration.
factor [38], and the function Q(x, u) : Rn×m → R maps the
utility of a state and how the action at that state will perform VII. C ONCLUSION
in the next state given the policy µ(x). DDPG simultaneously 3
We present KL-E , a method which is shown to enable
learns Q(s) and a policy µ(x) by sampling from a set of
robots to actively generate informative data for various learn-
collected states, actions, rewards, and their resulting next state.
ing goals from equilibrium policies. Our method synthesizes
We refer the reader to the pseudo-code of DDPG in [19].
ergodic coverage using a KL-divergence measure which gener-
Our method uses the learned Q(x, u) and µ(x) as the target
ates data through exploiting dynamic movement proportional
distribution and the equilibrium policy respectively. We modify
to the utility of the data. We show that hybrid systems
the Q function such that it becomes a distribution using a
theory can be used to synthesize a schedule of exploration
Boltzmann softmax
actions that can incorporate learned policies and models in a
exp(cQ(s)) systematic manner. Last, we present examples that illustrate
p(s) = R (28)
Sv
exp(cQ(s̄))ds̄ the effectiveness of our method for collecting and generating
data in an ergodic manner and provide theoretical analysis
where s ∈ Rn+m includes both states and actions. This
which bounds our method through Lyapunov attractiveness.
form of Equation (28) has been used previously for inverse
reinforcement learning [45], [46]. Here, Eq. (28) is used as
a guide for the ergodic exploration where our exploration ACKNOWLEDGMENT
is centered around the learned policy and the utility of the This material is based upon work supported by the National
learned skill (Q-function). Since most reinforcement learning Science Foundation under Grants CNS 1837515. Any opin-
deals with large state-spaces, we use the approximation to the ions, findings and conclusions or recommendations expressed
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12
in this material are those of the authors and do not necessarily Since Φ(τ, τ ) = 1, and
reflect the views of the aforementioned institutions.
∂ ∂f ∂f ∂µ
Φ(t, τ ) = −Φ(t, τ ) + ,
A PPENDIX A ∂τ ∂x ∂u ∂x
P ROOFS we can show that
∂ X p(si ) ∂ψ >
A. Proof of Proposition 1 ρ(τ )> =
Proof. Let us define the trajectory x(t) switching from ∂τ i
q(si ) ∂x
µ(x(τ )) → µ? (τ ) for a duration of λ as
!
X p(si ) tf ∂ψ >
Z
∂f ∂f ∂µ
Z τ Z τ +λ − − Φ(t, τ )dt + .
q(si ) τ ∂x ∂x ∂u ∂x
x(t) = x(t0 ) + f (x, µ(x))dt + f (x, µ? )dt (29) |
i
{z }
t τ
Z 0tf =ρ(τ )>
+ f (x, µ(x))dt Taking the transpose, we can show that ρ(t) can be solved
τ +λ
backwards in time with the differential equation
where we drop the dependence on time for clarity. Taking the >
derivative of (4), using (29), with respect to the duration time X p(si ) ∂ψ ∂f ∂f ∂µ
ρ̇(t) = − + ρ(t) (36)
λ gives us the following expression: q(si ) ∂x ∂x ∂u ∂x
i
∂ X p(si ) Z tf ∂ψ > ∂x
DKL = − dt. (30) with final condition ρ(tf ) = 0.
∂λ i
q(si ) τ +λ ∂x ∂λ
∂x
We obtain ∂λ by using Leibniz’s rule to evaluate the derivative
B. Proof of Theorem 1
of (29) with respect to λ at the integration boundary conditions
to obtain the expression Proof. Writing the integral form of the Lyapunov function
Z t > switching between µ(x(t)) and µ? (t) at time τ for a duration
∂x(t) ∂f ∂f ∂µ ∂x(s) of time λ starting at x(0) can be written as
= (f2 − f1 ) + + ds (31)
∂λ τ +λ ∂x ∂u ∂x ∂λ Z τ
where s is a place holder variable for time, f2 = f (x(τ + V (xτλ (t)) = V (x(0))+ ∇V · f (x(s), µ(x(s)))ds
0
λ), µ? (τ + λ)) and f1 = f (x(τ + λ), µ(x(τ + λ)). Noting Z τ +λ
∂x
that ∂λ is a repeated term under the integral, (31) is a linear + ∇V · f (x(s), µ? (s))ds
convolution with initial condition ∂x(τ∂λ
+λ)
= f2 − f1 . As a τ
Z t
result, we can rewrite (31) using a state-transition matrix [47]
! + ∇V · f (x(s), µ(x(s)))ds (37)
> τ +λ
∂f ∂f ∂µ
Φ(t, τ + λ) = exp + (t − τ ) where s is a place holder for time. Expanding f (x, u) to g(x)+
∂x ∂u ∂x
h(x)u and using (12) we can show the following identity:
with initial condition f2 − f1 as
∂x(t) ∇V · f (x, µ? ) = ∇V · g(x) + ∇V · h(x)µ?
= Φ(t, τ + λ)(f2 − f1 ). (32) = ∇V · g(x) + ∇V · h(x)µ(x)
∂λ
Using (32) in (30) gives the following expression − ∇V · h(x)R−1 h(x)> ρ
∂ X p(si ) Z tf ∂ψ > = ∇V · f (x, µ(x)) − ∇V · h(x)R−1 h(x)> ρ
DKL = − Φ(t, τ + λ)dt (f2 − f1 ) .
∂λ q(si ) τ +λ ∂x (38)
i
(33) Using (38) in (37), we can show that
Taking the limit as λ → 0 we then set Z t
X p(si ) Z tf ∂ψ > V (xτλ (t)) = V (x(0)) + ∇V · f (x(s), µ(x(s)))ds
ρ(τ )> = − Φ(t, τ )dt (34) 0
i
q(si ) τ ∂x Z τ +λ
in (33) which results in − ∇V · h(x(s))R−1 h(x(s))> ρ(s)ds
τ
∂ = V (x(t))
DKL = ρ(τ )> (f2 − f1 ) . (35)
∂λ Z τ +λ
Taking the derivative of (34) with respect to time τ yields the − ∇V · h(x(s))R−1 h(x(s))> ρ(s)ds
τ
following: (39)
∂ X p(si ) ∂ψ >
ρ(τ )> = Φ(τ, τ ) where x(t) is given by (8).
∂τ i
q(si ) ∂x Letting the largest value of ∇V · h(x(t))R−1 h(x(t))> ρ(t)
be given by
X p(si ) Z tf
∂ψ > ∂
− Φ(t, τ )dt.
i
q(si ) τ ∂x ∂τ β= sup −∇V · h(x(s))R−1 h(x(s))> ρ(s) > 0, (40)
s∈[τ,τ +λ]
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13
TABLE II: Parameters used for each method presented in Section VI . I indicates an identity matrix of size m × m, n × n for
the quadcopter model learning, and v × v where v = n + m used in both DDPG examples. Neural net model parameters θ
are learned by minimizing the error kẋ − f (x, u; θ)k2 over a subset of K data points where ẋ ≈ (x(t1 ) − x(t0 ))/dt and dt is
the time step of the system. A time-decaying R is used for both DDPG examples so that the largest exploring occurs earlier
on in the episode to better assist the skill learning.
[36] W. Zhong and H. Rock, “Energy and passivity based control of the Ian Abraham Ian Abraham received the B.S. de-
double inverted pendulum on a cart,” in IEEE International Conference gree in Mechanical and Aerospace Engineering from
on Control Applications, 2001, pp. 896–901. Rutgers University and the M.S. degree in Mechani-
[37] C. M. Bishop, Pattern recognition and machine learning. Springer, cal Engineering from Northwestern University. He
2006. is currently a Ph.D. Candidate at the Center for
[38] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. Robotics and Biosystems at Northwestern Univer-
MIT press, 2018. sity. His Ph.D. work focuses on developing formal
[39] I. Abraham and T. D. Murphey, “Active learning of dynamics for methods for robot sensing and runtime active learn-
data-driven control using koopman operators,” IEEE Transactions on ing. He is also the recipient of the 2019 King-Sun Fu
Robotics, vol. 35, no. 5, pp. 1071–1083, 2019. IEEE Transactions on Robotics Best Paper award.
[40] T. Fan and T. Murphey, “Online feedback control for input-saturated
robotic systems on lie groups,” arXiv preprint arXiv:1709.00376, 2017.
[41] Y. Gal, R. McAllister, and C. E. Rasmussen, “Improving pilco with
Bayesian neural network dynamics models,” in Data-Efficient Machine
Learning workshop, ICML, vol. 4, 2016.
[42] A. D. Wilson, J. A. Schultz, A. R. Ansari, and T. D. Murphey, “Dynamic
task execution using active parameter identification with the baxter
research robot,” Transactions on Automation Science and Engineering,
vol. 14, no. 1, pp. 391–397, 2017.
[43] G. E. Uhlenbeck and L. S. Ornstein, “On the theory of the brownian
motion,” Physical review, vol. 36, no. 5, p. 823, 1930. Ahalya Prabhakar Ahalya Prabhakar received the
[44] G. Williams, P. Drews, B. Goldfain, J. M. Rehg, and E. A. Theodorou, B.S. degree in mechanical engineering from Califor-
“Aggressive driving with model predictive path integral control,” in IEEE nia Institute of Technology, Pasadena, CA, USA, in
International Conference on Robotics and Automation (ICRA), 2016, pp. 2013, and the M.S. degree in mechanical engineering
1433–1440. from Northwestern University, Evanston, IL, USA,
[45] E. Bıyık and D. Sadigh, “Batch active preference-based learning of in 2016. She is a Ph.D. candidate at the Center for
reward functions,” arXiv preprint arXiv:1810.04303, 2018. Robotics and Biosystems at Northwestern Univer-
[46] D. S. Brown, Y. Cui, and S. Niekum, “Risk-aware active inverse sity. Her work focuses on developing compressible
reinforcement learning,” arXiv preprint arXiv:1901.02161, 2019. representations through active exploration for com-
[47] B. D. Anderson and J. B. Moore, Optimal control: linear quadratic plex task performance and efficient robot learning.
methods. Courier Corporation, 2007.