Professional Documents
Culture Documents
Abstract— We present a learning system which is able to may have an effect on the performance many steps into the
quickly and reliably acquire a robust feedback control policy future.
for 3D dynamic walking from a blank-slate using only trials
Although there is a great deal of literature on learning
implemented on our physical robot. The robot begins walking
within a minute and learning converges in approximately 20 control for dynamically walking bipedal robots, there are
minutes. This success can be attributed to the mechanics of our relatively few examples of learning algorithms actually imple-
robot, which are modeled after a passive dynamic walker, and to a mented on the robot or which work quickly enough to allow
dramatic reduction in the dimensionality of the learning problem. the robot to adapt online to changing terrain. Some researchers
We reduce the dimensionality by designing a robot with only 6
attempt to learn a controller in simulation that is robust enough
internal degrees of freedom and 4 actuators, by decomposing
the control system in the frontal and sagittal planes, and by to run on the real robot [1], treating differences between the
formulating the learning problem on the discrete return map simulation and the robot as disturbances. The University of
dynamics. We apply a stochastic policy gradient algorithm to New Hampshire bipeds were two early examples of online
this reduced problem and decrease the variance of the update learning which acquired a basic gait by tuning parameters in a
using a state-based estimate of the expected cost. This optimized
hand-designed controller ([2], [3]). In this paper we generalize
learning system works quickly enough that the robot is able to
continually adapt to the terrain as it walks. these results to obtaining a controller from scratch instead of
tuning an existing controller. Learning control has also been
I. I NTRODUCTION successfully implemented on Sony’s quadrupedal robot AIBO
(i.e., [4]). The learned controllers for AIBO are open-loop
Recent advances in bipedal walking technology have pro- trajectories, but trajectory feedback is essential for robust,
duced robots capable of leaving the laboratory environment to dynamic, bipedal walking.
interact with the unknown and uncertain environments of the In order to study learning feedback control for walking,
real world. Despite our best efforts, it is unlikely that we will we performed our initial experiments on a simplified robot
be able to preprogram these robots for every possible situation which captures the essence of dynamic walking but which
without sacrificing performance. Endowing our robots with the minimizes many of the complications. Our robot has only 6
ability to learn from experience and adapt to their environment internal degrees of freedom and 4 actuators1 . The mechanical
seems critical for the success of any real world robot. design of our robot is based on a passive dynamic walker ([5],
Dynamic bipedal walking is difficult to learn for a number [6]). This allows us to solve a portion of the control problem in
of reasons. First, walking robots typically have many degrees the mechanical design, and makes the robot mechanically very
of freedom, which can cause a combinatorial explosion for stable; most policies in our search space result in either stable
learning systems that attempt to optimize performance in every walking or failed walking where the robot ends up simply
possible configuration of the robot. Second, details of the standing still.
robot dynamics such as uncertainties in the ground contact The learning on our robot is performed by a policy gradient
and nonlinear friction in the joints are difficult to model well reinforcement learning algorithm ([7], [8], [9]). The goal of
in simulation, making it unlikely that a controller optimized in this paper is to describe our formulation of the learning
a simulation will perform optimally on the real robot. Since problem and the algorithm that we use to solve it. We include
it is only practical to run a small number of learning trials our experimental results on this simplified biped, and discuss
on the real robot, the learning algorithms must perform well the possibility of applying the same algorithm to a more
after obtaining a very limited amount of data. Finally, learning complicated walking system.
algorithms for dynamic walking must deal with dynamic
discontinuities caused by collisions with the ground and with 1 The standard for 3D bipeds is to have at least 12 internal degrees of
the problem of delayed reward - torques applied at one time freedom and 12 actuators in the legs
II. T HE ROBOT III. T HE L EARNING P ROBLEM
The goal of learning is to acquire a feedback control policy
which makes the robot’s gait invariant to small slopes. In total,
the system has 9 degrees of freedom2 , and the equations of
motion can be written in the form
H(q)q̈ + C(q, q̇)q̇ + G(q) = τ + D(t), (1)
where
q =[θyaw , θlP itch , θbP itch , θrP itch , θroll ,
θraRoll , θlaRoll , θraP itch , θlaP itch ]T ,
τ =[0, 0, 0, 0, τraRoll , τlaRoll , τraP itch , τlaP itch ]T .
H is the state dependent inertial matrix, C contains interaction
torques between the links, G represents the effect of gravity,
τ are the motor torques, and D are random disturbances to
the system. Our shorthand lP itch, bP itch, and rP itch refer
Fig. 1. The robot on the left is a simple passive dynamic walker. The robot to left leg pitch, body pitch, and right leg pitch, respectively.
on the right is our actuated version of the same robot. raRoll, laRoll, raP itch, and laP itch are short for right and
left ankle roll and pitch. The actual output of the controller is
a motor command vector
The passive dynamic walker shown on the left in Figure 1
represents the simplest machine that we could build which u = [uraRoll , ulaRoll , uraP itch , ulaP itch ]T ,
captures the essence of stable dynamic walking in three
dimensions. It has only a single passive pin joint at the hip. which generates torques
When placed at the top of a small ramp and given a push τ = h(q, q̇, u).
sideways, the walker will begin falling down the ramp and
eventually converge to a stable limit cycle trajectory that has The function h describes the linear feedback controller im-
been compared to the waddling gait of a penguin [10]. The plemented by the servo boards and the nonlinear kinematic
energetics of this passive walker are common to all passive transformation into joint torques.
walkers: energy lost due to friction and collisions when the The robot uses a deterministic feedback control policy
swing leg returns to the ground is balanced by the gradual which is represented using a linear function approximator
conversion of potential energy into kinetic energy as the walker parameterized by vector w and using nonlinear features φ:
moves down the slope. The mechanical design of this robot
X q
and some experimental stability results are presented in [11]. u = πw (x̂) = wi φi (x̂), with x = . (2)
q̇
i
We designed our learning robot by adding a small number
of actuators to this passive design. The robot shown on the The notation x̂ represents a noisy estimate of the state x.
right in figure 1, which is also described in [11], has passive Before learning, w is initialized to all zeros, making the
joints at the hip and 2 degrees of actuation (roll and pitch) at policy outputs zero everywhere, so that the robot simulates
each ankle. The ankle actuators are position controlled servo the passive walker.
motors which, when commanded to hold their zero position, To quantify the stability of our nonlinear, stochastic, peri-
allow the actuated robot to walk stably down a small ramp, odic trajectory, we consider the dynamics on the return map,
“simulating” the passive walker. The shape of the large, curved taken around the point where θroll = 0 and θ̇roll > 0. The
feet is designed to make the robot walk passively at 0.8Hz, and return map dynamics are a Markov random sequence with the
to take steps of approximately 6.5 cm when walking down a probability at the (n + 1)th crossing of the return map given
ramp of 0.03 radians. The robot stands 44 cm tall and weighs by
approximately 2.9 kg, which includes the CPU and batteries fw (x0 , x) = P {X̂(n + 1) = x0 |X̂(n) = x, W(n) = w}. (3)
that are carried on-board. The most recent additions to this
robot are the passive arms, which are mechanically coupled fw (x0 , x) represents the probability density function over the
to the opposite leg to provide mechanical yaw compensation. state space which contains the dynamics in equations 1 and 2
When placed on flat terrain, the passive walker waddles back integrated over one cycle. We do not make any assumptions
and forth, slowly losing energy, until it comes to rest standing about its form, except that it is Markov. Note that the element
still. In order to achieve stable walking on flat terrain, the of fw representing θroll is the delta function, independent of
actuators on our learning robot must restore energy into the 2 6 internal DOFs and 3 DOFs for the robot’s orientation. We assume that
system that would have been restored by gravity when walking the robot is always in contact with the ground at a single point, and infer the
down a slope. robot’s absolute (x, y) position in space directly from the remaining variables.
x. The return map dynamics are represented as a Markov eligibility trace. We begin with w(0) = e(0) = 0. At the end
chain that depends on the parameter vector w instead of the of the nth step, we make the updates:
equivalent Markov decision process for simplification because
the feedback controller is evaluated many times during a single δ(n) =g (x̂(n)) + γ Jˆv (x̂(n + 1)) − Jˆv (x̂(n)) (8)
step (our controller runs at 100 Hz and our robot steps at ei (n) =γei (n − 1) + bi (n)zi (n) (9)
around 0.8 Hz). The stochasticity in fw comes from the ∆wi (n) = − ηw δ(n)ei (n) (10)
random disturbances D(t) and the state estimation error, x̂−x.
∆vi (n) =ηv δ(n)ψi (x̂(n)). (11)
The cost function for learning uses a constant desired value,
xd , on the return map: ηw ≥ 0 and ηv ≥ 0 are the learning rates and γ is the discount
1 factor of the eligibility trace, which will be discussed in more
g(x) = |x − xd |2 . (4) detail in the algorithm derivation. bi (n) is a boolean one step
2
This desired value can be considered a reference trajectory on eligibility, which is 1 if the parameter wi is activated (φi (x̂) >
the return map, and is taken from the gait of the walker down 0) at any point during step n and 0 otherwise. δ(n) is called
a slope of 0.03 radians; no reference trajectory is required the one step temporal difference error.
for the limit cycle between steps. For a given trajectory x̂ = The algorithm can be understood intuitively. On each step
[x̂(0), x̂(1), ..., x̂(N )], we define the average cost the robot receives some cost g(x̂(n)). This cost is compared
to cost that we expect to receive, as estimated by Jˆv (x).
N
1 X If the cost is lower than expected, then −ηδ(n) is positive,
G(x̂) = g(x̂(n)). (5)
N n=0 so we add a scaled version of the noise terms, zi , into wi .
Similarly, if the cost is higher than expected, then we move in
Our goal is to find the parameter vector w which minimizes the opposite direction. This simple online algorithm performs
lim E {G(x̂)} . (6) approximate stochastic gradient descent on the expected value
N →∞ of the average infinite-horizon cost.
By minimizing this error, we are effectively minimizing the
eigenvalues of return map, and maximizing the stability of the V. A LGORITHM D ERIVATION
desired limit cycle. The expected value of the average cost, G, is given by:
IV. T HE L EARNING A LGORITHM Z
The learning algorithm is a statistical algorithm which E{G(x̂)} = G(x̂)Pw0 {X̂ = x̂}dx̂.
x̂
makes small changes to the control parameters w on each step
and uses correlations between changes in w and changes in the The probability of trajectory x̂ is
return map error to climb the performance gradient. This can N −1
be accomplished with a very simple online learning rule which
Y
Pw0 {X̂ = x̂} = P {X̂(0) = x̂(0)} fw0 (x̂(n + 1), x̂(n)).
changes w with each step that the robot takes. The particular n=0
algorithm that we present here was originally proposed by [7].
Taking the gradient of E{G(x̂)} with respect to w we find
We present a thorough derivation of this algorithm in the next
section.
Z
∂ ∂
The algorithm makes use of an intermediate representation E{G(x̂)} = G(x̂) Pw0 {X̂ = x̂}dx̂
∂wi x̂ ∂wi
which we call the value function, J(x). The value of state x is Z
∂
the expected average cost to be incurred by following policy = G(x̂)Pw0 {X̂ = x̂} log Pw0 {X̂ = x̂}dx̂
∂w i
πw starting from state x:
x̂
∂
N =E G(x̂) log Pw0 {X̂ = x̂}
1 X ∂wi
J(x) = lim g(x(n)), with x(0) = x. ( N −1
!)
N →∞ N X ∂
n=0
=E G(x̂) log fw0 (x̂(m + 1), x̂(m))
∂wi
Jˆv (x) is an estimate of the value function parameterized by m=0
vector v. This value estimate is represented in another function Recall that fw0 (x0 , x) is a complicated function which
approximator: includes the integrated dynamics of the controller and the
Jˆv (x̂) =
X
vi ψi (x̂). (7) robot. Nevertheless, ∂w∂
log fw0 is simply:
i
i
Fig. 5. Experimental return maps, before (left) and after (right) learning.
Fixed points exist at the intersections of the return map (blue) and the line of
slope one (red).
Fig. 2. A typical learning curve, plotted as the average error on each step. To quantify the stability of the learned controller, we mea-
sure the eigenvalues of the return map. Linearizing around
the fixed point in Figure 5 suggests that the system has a
single eigenvalue of 0.5. To obtain the eigenvalues of the
return map when the robot is walking, we run the robot from
a large number of initial conditions and record the return map
trajectories x̂i (n), 9 × 1 vectors which represent the state of
the system (with fixed ankles) on the nth crossing of the ith
trial. For each trial we estimate x̂i (∞), the equilibrium of
the return map. Finally, we perform a least squares fit of the
seconds matrix A to satisfy the relation
Fig. 3. θroll trajectory of the robot starting from standing still. [x̂i (n + 1) − x̂i (∞)] = A[x̂i (n) − x̂i (∞)].
The eigenvalues of A for the learned controller and for our
hand-designed controllers (described in [11]) are:
Controller Eigenvalues
Passive walking 0.88 ± 0.01i, 0.75, 0.66 ± 0.03i,
(63 trials) 0.54, 0.36, 0.32 ± 0.13i
Hand-designed 0.80, 0.60, 0.49 ± 0.04i, 0.36,
feed-forward (89 trials) 0.25, 0.20 ± 0.01i, 0.01
Hand-designed 0.78, 0.69 ± 0.03i, 0.36, 0.25,
feedback (58 trials) 0.20 ± 0.01i, 0.01
Learned feedback 0.74 ± 0.05i, 0.53 ± 0.09i, 0.43,
(42 trials) 0.30 ± 0.02i, 0.15, 0.07
Fig. 4. Learned feedback control policy uraRoll = πw (x̂).
All of these experiments were on flat terrain except the
In Figure 5 we plot the return maps of the system before passive walking, which was on a slope of 0.027 radians. The
convergence of the system to the nominal trajectory is largely of the return map, than any controller we were able to derive
governed by the largest eigenvalues. This analysis suggests that for the same robot by hand. Once the controller is learned, to
our learned controller converges to the steady state trajectory robot is able to quickly adapt to small changes in the terrain.
more quickly that the passive walker on a ramp and more Building a robot to simplify the learning allowed us to gain
quickly than any of our hand-designed controllers. some practical insights into the learning problem for dynamic
Our stochastic policy gradient algorithm solves the temporal bipedal locomotion. Implementing these algorithms on the real
credit assignment problem in by accumulating the eligibility robot proved to be a very different problem than working
within a step and discounting eligibility between steps. In- in simulation. We would like to take two basic directions
terestly, our algorithm performs best with heavy discounting to continue this research. First, we are removing many of
between steps (0 ≤ γ ≤ 0.2). This suggests that our one the simplifying assumptions used in this paper (such as the
dimensional value estimate does a good job of isolating the decomposed control policy) to better approximate optimal
credit assignment to a single step. walking on this simple platform and to test our learning
While it took a few minutes to learn a controller from a controller’s ability to compensate for rough terrain. Second,
blank slate, adjusting the learned controller to adapt to small we are scaling these results up to more sophisticated bipeds,
changes in the terrain appears to happen very quickly. The including a passive dynamic walker with knees and humanoids
non-learning controllers require constant attention and small that already have a basic control system in place.
manual changes to the parameters as the robot walks down
ACKNOWLEDGMENTS
the hall, on tiles, and on carpet. The learning controller easily
adapts to these situations. This work was supported by the David and Lucille Packard
Foundation (contract 99-1471), the National Science Founda-
VIII. D ISCUSSION tion (grant CCR-0122419). Special thanks to Ming-fai Fong
Designing our robot like a passive dynamic walker changes and Derrick Tan for their help with designing and building the
the learning problem in a number of ways. It allows us to experimental platform.
learn a policy with only a single output which controlled a R EFERENCES
9 DOF system, and allows us to formulate the problem on
[1] J. Morimoto and C. Atkeson, “Minimax differential dynamic program-
the return map dynamics. It also dramatically increases the ming: An application to robust biped walking.” Neural Information
number of policies in the search space which could generate Processing Systems, 2002.
stable walking. The learning algorithm works extremely well [2] W. T. Miller, III, “Real-time neural network control of a biped walking
robot,” IEEE Control Systems Magazine, vol. 14, no. 1, pp. 41–48, Feb
on this simple robot, but will the technique scale to more 1994.
complicated robots? [3] H. Benbrahim and J. A. Franklin, “Biped dynamic walking using
One factor in our success was the formulation of the reinforcement learning,” Robotics and Autonomous Systems, vol. 22,
pp. 283–302, 1997.
learning problem on the discrete dynamics of the return map [4] N. Kohl and P. Stone, “Policy gradient reinforcement learning for fast
instead of the continuous dynamics along the entire trajectory. quadrupedal locomotion.” IEEE International Conference on Robotics
This formulation relies on the fact that our passive walker and Automation, 2004.
[5] T. McGeer, “Passive dynamic walking,” International Journal of
produces periodic trajectories even before the learning begins. Robotics Research, vol. 9, no. 2, pp. 62–82, April 1990.
It is possible for passive walkers to have knees and arms [6] M. J. Coleman and A. Ruina, “An uncontrolled toy that can walk but
[13], or on a more traditional humanoid robot this algorithm cannot stand still,” Physical Review Letters, vol. 80, no. 16, pp. 3658 –
3661, April 1998.
could be used to augment and improve and existing walking [7] H. Kimura and S. Kobayashi, “An analysis of actor/critic algorithms
controller which produces nominal walking trajectories. using eligibility traces: Reinforcement learning with imperfect value
As the number of degrees of freedom increases, the stochas- functions.” International Conference on Machine Learning (ICML
’98), 1998, pp. 278–286.
tic policy gradient algorithm may have problems with scaling. [8] J. Baxter and P. Bartlett, “Infinite-horizon policy-gradient estimation,”
The algorithm correlates changes in the policy parameters with Journal of Artificial Intelligence Research, vol. 15, pp. 319–350, 11
changes in the performance on the return map. As we add 2001.
[9] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradient
degrees of freedom, the assignment of credit to a particular methods for reinforcement learning with function approximation.”
actuator will become more difficult, requiring more learning Advances in Neural Information Processing Systems, 1999.
trials to obtain a good estimate of the correlation. This scaling [10] J. E. Wilson, “Walking toy,” United States Patent Office, Tech. Rep.,
October 15 1936.
problem is an open and interesting research question and a [11] R. Tedrake, T. W. Zhang, M. Fong, and H. S. Seung, “Actuating a
primary focus of our current research. simple 3d passive dynamic walker.” IEEE International Conference on
Robotics and Automation, 2004.
IX. C ONCLUSIONS [12] R. Williams, “Simple statistical gradient-following algorithms for
connectionist reinforcement learning,” Machine Learning, vol. 8, pp.
We have presented a learning formulation and learning al- 229–256, 1992.
gorithm which works very well on our simplified 3D dynamic [13] S. H. Collins, M. Wisse, and A. Ruina, “A three-dimensional
passive-dynamic walking robot with two legs and knees,” International
biped. The robot begins to walk after only one minute of Journal of Robotics Research, vol. 20, no. 7, pp. 607–615, July 2001.
learning from a blank slate, and the learning converges to
the desired trajectory in less than 20 minutes. This learned
controller is quantifiably more stable, using the eigenvalues
A Simple Reinforcement Learning Algorithm For
Biped Walking
Jun Morimoto, Gordon Cheng, Christopher G. Atkeson, and Garth Zeglin
Department of Humanoid Robotics and The Robotics Institute
Computational Neuroscience Carnegie Mellon University
ATR Computational Neuroscience Labs cga@cs.cmu.edu, garthz@ri.cmu.edu
xmorimo@atr.co.jp, gordon@atr.co.jp http://www.ri.cmu.edu
http://www.cns.atr.co.jp/hrcn
Abstract— We propose a model-based reinforcement learning learned controller. In section IV-C, we analyze the stability of
algorithm for biped walking in which the robot learns to the learned controller.
appropriately place the swing leg. This decision is based on
a learned model of the Poincare map of the periodic walking
RKVEJ
pattern. The model maps from a state at the middle of a step
and foot placement to a state at next middle of a step. We
also modify the desired walking cycle frequency based on online
measurements. We present simulation results, and are currently . 4
implementing this approach on an actual biped robot.
I. I NTRODUCTION
Our long-term goal is to understand how humans learn
biped locomotion and adapt their locomotion pattern. In this Fig. 1. Three link robot model
paper, we propose and explore the feasibility of a candidate
learning algorithm for biped walking. Our algorithm has two
elements, learning appropriate foot placement, and estimating
RKVEJ
appropriate walking cycle timing. We are using model-based
reinforcement learning, where we learn a model of a Poincare NAJKR TAJKR
map and then choose control actions based on a computed
value function. Alternative approaches applying reinforcement
learning to biped locomotion include [1], [13], [2].
TAMPGG
An important issue in applying our approach is matching NAMPGG
the desired walking cycle timing to the natural dynamics of
the biped. In this study, we use phase oscillators to estimate
appropriate walking cycle timing [19], [14], [15]. Fig. 2. Five link biped robot
To evaluate our proposed method, we use simulated 3 link
and 5 link biped robots (Figs. 1 and 2). Physical parameters of
the 3 link simulated robot are in table I. Physical parameters TABLE I
of the 5 link simulated robot in table II are selected to model P HYSICAL PARAMETERS OF THE THREE LINK ROBOT MODEL
an actual biped robot fixed to a boom that keeps the robot
in the sagittal plane (Fig. 2). Our bipeds have a short torso trunk leg
mass [kg] 2.0 0.8
and point or round feet without ankle joints. For these bipeds,
length [m] 0.01 0.4
controlling biped walking trajectories with the popular ZMP inertia [kg · m2 ] 0.0001 0.01
approach [20], [8], [22], [12] is difficult or not possible, and
thus an alternative method for controller design must be used.
In section II-A, we introduce an estimation method of II. E STIMATION OF NATURAL BIPED WALKING TIMING
natural biped walking timing by using the measured walking In order for our foot placement algorithm to place the
period and an adaptive phase resetting method. In section III, foot at the appropriate time, we must estimate the natural
we introduce our reinforcement learning method for biped biped walking period, or equivalently, frequency. This timing
walking. The robot learns appropriate foot placement through changes, for example, when walking down slopes. Our goal
trial and error. In section IV-B, we propose using the esti- is to adapt the walking cycle timing to the dynamics of the
mation method for natural biped walking timing to assist the robot and environment.
With an initial condition which has a body velocity of
TABLE II
0.2m/s, the simulated 3 link robot walked stably on a 1.0 ◦
P HYSICAL PARAMETERS OF THE FIVE LINK ROBOT MODEL
downward slope (Fig. 3(Top)). However, the robot could not
trunk thigh shin walk stably on a 4.0 ◦ downward slope (Fig. 3(Bottom)). When
mass [kg] 2.0 0.64 0.15 we used the online estimate of ω̂ and the adaptive phase
length [m] 0.01 0.2 0.2
inertia (×10−4 [kg · m2 ]) 1.0 6.9 1.4
resetting method, the robot walked stably on the two test
slopes: 1.0◦ downward slope (Fig. 4(Top)) and 4.0 ◦ downward
slope (Fig. 4(Bottom)). In figure 5, we show the estimated
A. Estimation method walking frequency.
We derive the target walking frequency ω ∗ from the walking
period T which is measured from an actual half-cycle period
(one foot fall to another):
π
ω∗ = (1)
T
The update rule for the walking frequency is
ω̂n+1 = ω̂n + Kω (ω ∗ − ω̂n ), (2)
where Kω is the frequency adaptation gain, and ω n is the
estimated frequency after n steps. An interesting feature of this
method is that the simple averaging (low-pass filtering) method
(Eq. 2) can estimate appropriate timing of the walking cycle Fig. 3. Biped walking pattern without timing adaptation: (Top) 1.0◦
downward slope, (Bottom) 4.0◦ downward slope
for the given robot dynamics. This method was also adopted
in [14], [15].
Several studies suggest that phase resetting is effective to
match walking cycle timing to the natural dynamics of the
biped [19], [24], [14], [15]. Here we propose an adaptive phase
resetting method. Phase φ is reset when the swing leg touches
the ground:
φ̄ ← φ̄ + Kφ (φ − φ̄) (3)
φ ← φ̄, (4)
where φ̄ is the average phase, and K φ is the phase adaptation
gain.
B. A simple example of timing estimation Fig. 4. Biped walking pattern with timing adaptation: (Top) 1.0◦ downward
We use the simulated three link biped robot (Fig. 1) to slope, (Bottom) 4.0◦ downward slope
demonstrate the timing estimation method. A target biped
walking trajectory is generated using sinusoidal functions with
amplitude a = 10◦ and a simple controller is designed to 10
1.0 °
follow the target trajectories for each leg: 4.0 °
9.5
where τl denotes the left hip torque, τ r denotes the right hip
torque, k = 5.0 is a position gain, b = 0.1 is a velocity gain, 8
0 10 20
Steps
30 40 50
and θl and θr are left and right hip joint angles. Estimated
phase φ is given by φ = ω̂ n t, where t is the current time. Fig. 5. Estimated walking frequency
For comparison, we apply this controller to the simulated
robot without using the timing estimation method, so ω̂ is
fixed and φ increases linearly with time (The walking period
III. M ODEL - BASED REINFORCEMENT LEARNING FOR
was set to T = 0.63sec and frequency ω̂ = 10rad/sec). The
BIPED LOCOMOTION
initial average phase was set to φ̄ = 1.0 for the right leg and
φ̄ = π + 1.0 for the left leg, the frequency adaptation gain To walk stably we need to control the placement as well
was set to Kω = 0.3, and the phase adaptation gain was set as the timing of the next step. Here, we propose a learning
to Kφ = 0.3. method to acquire a stabilizing controller.
A. Model-based reinforcement learning
TABLE III
We use a model-based reinforcement learning frame- TARGET POSTURES AT EACH PHASE φ : θact IS PROVIDED BY THE OUTPUT
work [4], [17]. Reinforcement learning requires a source OF CURRENT POLICY. T HE UNIT FOR NUMBERS IN THIS TABLE IS DEGREES
of reward. We learn a Poincare map of the effect of foot
placement, and then learn a corresponding value function for right hip right knee left hip left knee
φ=0 −10.0 θact 10.0 0.0
states at phase φ = 12 π and φ = 32 π (Fig. 6), where we define φ = 0.5π θact 60.0
phase φ = 0 as the right foot touchdown. φ = 0.7π 10.0 −10.0
1) Learning the Poincare map of biped walking: We learn φ=π 10.0 0.0 −10.0 θact
φ = 1.5π 60.0 θact
a model that predicts the state of the biped a half cycle
φ = 1.7π −10.0 10.0
ahead, based on the current state and the foot placement at
touch down. We are predicting the location of the system in
a Poincare section at phase φ = 3π 2 based on the system’s 4 .
location in a Poincare section at phase φ = π2 . We use the RK
same model to predict the location at phase φ = π2 based on RK
the location at phase φ = 3π2 (Fig. 6). Because the state of the . 4
x̂ 3π
2
= f̂ (x π2 , u π2 ; wm ), (7)
. 4
10
−20
−30
−40
Actual
Target
−50
0 1 2 3 4 5 6
Time [sec]
20
10
−20
−30
−40
Actual
Target
−50
0 1 2 3 4 5 6
Time [sec]
80
Actual
70 Target
60
50
30
20
10
−10
0 1 2 3 4 5 6
Time [sec]
80
Actual
70 Target
40
30
20
10
4
C. Stability analysis of the acquired policy
Accumulated reward
3
We analyzed the stability of the acquired policy in terms
of the Poincare map, mapping from a Poincare section at 2
0.5
1.5
Eigenvalue
Value
−0.5 1
−1
1 0.5
0.5 0.2
0 0.1
0
−0.5 −0.1 0
−1 −0.2 0 50 100 150
d velocity [m/sec] d position [m] Trials
Fig. 11. Shape of acquired value function Fig. 13. Averaged eigenvalue of Jacobian matrix at each trial
Reinforcement Learning:
An Introduction
by Sutton, R.S. and Barto, A.G., MIT Press (1998). £31.95 (xi + 322 pages)
ISBN 0 262 19398 1
Reinforcement is a term with different can expect from that state into the dis-
meanings for different people. In the tant future; that is. they represent long-
psychological lexicon, it conjures up the term evaluations. In this sense, value
ominous images of Pavlov, Thorndike, functions represent something more
Skinner and their intellectual brethren. akin to judgements on the likely payoffs The present book is an excellent
Although these behaviorists helped to that will follow the current state. entry point for someone who wants to
operationalize experimental psychology, Some of the most exciting work in understand intuitively the ideas of re-
their insistence on the non-existence of reinforcement learning has taken place inforcement learning and the general
internal mental states provided a road- in the past 10 years with the discovery connection between its parts. It is not,
block to modern cognitive science. In of several mathematical connections however, a mathematical ‘how-to’ book,
fact, the rise of cognitive science since between separate methods for solving replete with proofs and pointers to un-
the 1950s could be viewed as a rejection reinforcement-learning problems. These solved problems in the field (as are, for
of the stultifying behaviorist views that connections showed that apparently example, Refs 3,5).
declared the mind to be a vacuous con- disparate mathematical techniques for The end of each chapter contains a
struct. With such a salvo on behavior- solving reinforcement-learning problems scholarly set of biographical and histori-
ists, what is a review of a book on rein- were related in fundamental ways. This cal notes. These sections are particularly
forcement learning doing in Trends in book provides the best historical de- pleasing because they provide an easy-
Cognitive Sciences? The short answer is tails of these mathematical connections to-read review of the history of papers
that reinforcement, in the context of the found anywhere, and frames clearly the and ideas that contributed to the chap-
new book by Sutton and Barto, is not ideas underlying this history. ter in question. The authors go above
what it seems. ‘Reinforcement learning What is the direct conceptual payoff and beyond the call of duty in these sec-
is learning what to do – how to map of reinforcement learning for cognitive tions by providing their own perspective
situations to actions – so as to maximize science? The descriptions so far show on how and why subfields developed
a numerical reward signal’, according that reinforcement-learning problems in particular ways. Their effort is useful
to the introduction of the book. could arise in a number of settings. Why because this kind of perspective is very
The primary aim here is to cast learn- should we expect this framework to en- difficult to come by, yet it often provides
ing as a problem involving agents that rich our understanding of cognition or conceptual insights by demonstrating
interact with an environment, sense their the connection of the brain to cognition? which paths of investigation resulted
state and the state of the environment, I think that the direct benefit is twofold. from historical accident or the prevail-
and choose actions based on these inter- The first benefit is that the lexicon of ing biases of the day. Furthermore, these
actions (which sounds very much like reinforcement learning is appropriate sections are accessible to the casual
a bug or a rat moving about in some for describing the problems faced by peruser as well as the serious student
territory in search of food or mates). The mobile creatures in a complex, stochastic seeking a historical record of publication
twist in reinforcement learning is that environment, in which the evaluation on the subject.
the agent comes pre-equipped with of a sequence of decisions might be sig- Anyone interested in the internal
goals that it seeks to satisfy. These goals nificantly delayed. Consonant with the representation of goals should read
are embodied in the influence of a ‘nu- appropriateness of the lexicon, a num- this book. In particular, the success of
merical reward signal’ on the way that ber of modern efforts have successfully TD-gammon, and the connection of re-
the agent chooses actions, categorizes used reinforcement learning to describe inforcement-learning algorithms to the
its sensations and changes its internal biological systems related to motor function of identified neural systems,
model of the environment. Despite the learning in the cerebellum, and reward suggests that reinforcement learning
obvious connection of these terms to learning by dopaminergic systems2,3. might have a lot more yet to say about
behavioral psychology, some of the more The second benefit is the emphasis cognition. That possibility awaits future
impressive applications of reinforcement that reinforcement learning places on evaluation.
learning have been in computer science representation. This emphasis emerges
P. Read Montague
and engineering applications. For exam- from the two serious complaints about
Division of Neuroscience, Baylor College of
ple, Tesauro’s TD-gammon, a reinforce- reinforcement learning as a framework
Medicine, Houston, TX 77030, USA.
ment-learning system, is now one of the for artificial intelligence or models of
tel: +1 713 798 3134
best backgammon players in the world1. brain function: (1) speed, and (2) the size
fax: +1 713 798 3130
Reinforcement learning typically di- of the state space4. For even modest
e-mail: read@bohr.neusc.bcm.tmc.edu
vides a problem into four parts: (1) a problems, the state space can be huge
policy; (2) a reward function; (3) a value (e.g. for backgammon, the state space is References
function; and (4) an internal model of ~1020 states). If any sizeable fraction of 1 Tesauro, G.J. (1994) TD-Gammon, a self-
the environment. In this context, a policy this state space must be explored for a
teaching backgammon program, achieves
is similar to an association in psycho- reinforcement-learning system to con-
master-level play Neural Comput. 6, 215–219
logical terms; it maps states to actions verge to an answer, then one might
2 Schultz, W., Dayan, P. and Montague, P.R.
(behavioral choices). One interesting have to wait an unacceptably long time
(1997) A neural substrate of prediction and
part of reinforcement-learning problems for a suitable answer to emerge. These
reward Science 275, 1593–1599
is the complimentary concepts of goals problems were a likely source of discour-
and evaluation. A reward function pro- agement for early work in reinforce- 3 Bertsekas, D.P. and Tsitsiklis, J.N. (1996) Neuro-
vides a numerical evaluation of a state, ment learning. However, more modern Dynamic Programming, Athena Scientific
and therefore embodies the agent’s defi- work has shown that if careful consid- 4 Kaebling, L.P., Littman, M.L. and Moore, A.W.
nition of what is immediately ‘good’ and eration is given to the representations (1996) Reinforcement learning: a survey
what is immediately ‘bad’. By contrast, of states or actions, then reinforcement- J. Artif. Intell. Res. 4, 237–285
value functions evaluate a state in terms learning systems can be a powerful way 5 Bellman, R. (1957) Dynamic Programming,
of the total amount of reward an agent of learning certain problems. Princeton University Press
360
Trends in Cognitive Sciences – Vol. 3, No. 9, September 1999
Simple random search of static linear policies is
competitive for reinforcement learning
Abstract
Model-free reinforcement learning aims to offer off-the-shelf solutions for con-
trolling dynamical systems without requiring models of the system dynamics. We
introduce a model-free random search algorithm for training static, linear policies
for continuous control problems. Common evaluation methodology shows that our
method matches state-of-the-art sample efficiency on the benchmark MuJoCo loco-
motion tasks. Nonetheless, more rigorous evaluation reveals that the assessment
of performance on these benchmarks is optimistic. We evaluate the performance
of our method over hundreds of random seeds and many different hyperparameter
configurations for each benchmark task. This extensive evaluation is possible
because of the small computational footprint of our method. Our simulations
highlight a high variability in performance in these benchmark tasks, indicating
that commonly used estimations of sample efficiency do not adequately evaluate
the performance of RL algorithms. Our results stress the need for new baselines,
benchmarks and evaluation methodology for RL algorithms.
1 Introduction
Model-free reinforcement learning (RL) aims to offer off-the-shelf solutions for controlling dynamical
systems without requiring models of the system dynamics. Such methods have successfully produced
RL agents that surpass human players in video games and games such as Go [16, 28]. Although
these results are impressive, model-free methods have not yet been successfully deployed to control
physical systems, outside of research demos. There are several factors prohibiting the adoption of
model-free RL methods for controlling physical systems: the methods require too much data to
achieve reasonable performance, the ever-increasing assortment of RL methods makes it difficult to
choose what is the best method for a specific task, and many candidate algorithms are difficult to
implement and deploy [11].
Unfortunately, the current trend in RL research has put these impediments at odds with each other.
In the quest to find methods that are sample efficient (i.e. methods that need little data) the general
trend has been to develop increasingly complicated methods. This increasing complexity has led to a
reproducibility crisis. Recent studies demonstrate that many RL methods are not robust to changes in
hyperparameters, random seeds, or even different implementations of the same algorithm [11, 12].
Algorithms with such fragilities cannot be integrated into mission critical control systems without
significant simplification and robustification.
Furthermore, it is common practice to evaluate and compare new RL methods by applying them to
video games or simulated continuous control problems and measure their performance over a small
number of independent trials (i.e., fewer than ten random seeds) [8–10, 14, 17, 19, 21–27, 31, 32].
The most popular continuous control benchmarks are the MuJoCo locomotion tasks [3, 29], with
32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada.
the Humanoid model being considered “one of the most challenging continuous control problems
solvable by state-of-the-art RL techniques [23].” In principle, one can use video games and simulated
control problems for beta testing new ideas, but simple baselines should be established and thoroughly
evaluated before moving towards more complex solutions.
To this end, we aim to determine the simplest model-free RL method that can solve standard
benchmarks. Recently, two different directions have been proposed for simplifying RL. Salimans
et al. [23] introduced a derivative-free policy optimization method, called Evolution Strategies. The
authors showed that, for several RL tasks, their method can easily be parallelized to train policies
faster than other methods. While the method of Salimans et al. [23] is simpler than previously
proposed methods, it employs several complicated algorithmic elements, which we discuss at the end
of Section 3. As a second simplification to model-free RL, Rajeswaran et al. [22] have shown that
linear policies can be trained via natural policy gradients to obtain competitive performance on the
MuJoCo locomotion tasks, showing that complicated neural network policies are not needed to solve
these continuous control problems. In this work, we combine ideas from the work of Salimans et al.
[23] and Rajeswaran et al. [22] to obtain the simplest model-free RL method yet, a derivative-free
optimization algorithm for training static, linear policies. We demonstrate that a simple random
search method can match or exceed state-of-the-art sample efficiency on the MuJoCo locomotion
tasks, included in the OpenAI Gym.
Henderson et al. [11] and Islam et al. [12] pointed out that standard evaluation methodology does not
accurately capture the performance of RL methods by showing that existing RL algorithms exhibit
high sensitivity to both the choice of random seed and the choice of hyperparameters. We show
similar limitations of common evaluation methodology through a different lens. We exhibit a simple
derivative free optimization algorithm which matches or surpasses the performance of more complex
methods when using the same evaluation methodology. However, a more thorough evaluation of
ARS reveals worse performance. Moreover, our method uses static linear policies and a simple local
exploration scheme, which might be limiting for more difficult RL tasks. Therefore, better evaluation
schemes are needed for determining the benefits of more complex RL methods. Our contributions are
as follows:
2 Problem setup
Problems in reinforcement learning require finding policies for controlling dynamical systems that
maximize an average reward. Such problems can be abstractly formulated as
max E⇠ [r(⇡✓ , ⇠)] , (1)
✓2Rd
2
where ✓ parametrizes a policy ⇡✓ : Rn ! Rp . The random variable ⇠ encodes the randomness of the
environment, i.e., random initial states and stochastic transitions. The value r(⇡✓ , ⇠) is the reward
achieved by the policy ⇡✓ on one trajectory generated from the system. In general one could use
stochastic policies ⇡✓ , but our proposed method uses deterministic policies.
Basic random search. Note that the problem formulation (1) aims to optimize reward by directly
optimizing over the policy parameters ✓. We consider methods which explore in the parameter
space rather than the action space. This choice renders RL training equivalent to derivative-free
optimization with noisy function evaluations. One of the simplest and oldest optimization methods
for derivative-free optimization is random search [15].
A primitive form of random search, which we call basic random search (BRS), simply computes a
finite difference approximation along the random direction and then takes a step along this direction
without using a line search. Our method ARS, described in Section 3, is based on this simple strategy.
For updating the parameters ✓ of a policy ⇡✓ , BRS and ARS exploit update directions of the form:
r(⇡✓+⌫ , ⇠1 ) r(⇡✓ ⌫ , ⇠2 )
, (2)
⌫
for two i.i.d. random variables ⇠1 and ⇠2 , ⌫ a positive real number, and a zero mean Gaussian
vector. It is known that such an update increment is an unbiased estimator of the gradient with
respect to ✓ of E E⇠ [r(⇡✓+⌫ , ⇠)], a smoothed version of the objective (1) which is close to the
original objective when ⌫ is small [20]. When the function evaluations are noisy, minibatches
can be used to reduce the variance in this gradient estimate. Evolution Strategies is a version of
this algorithm with several complicated algorithmic enhancements [23]. Another version of this
algorithm is called Bandit Gradient Descent by Flaxman et al. [6]. The convergence of random search
methods for derivative free optimization has been understood for several types of convex optimization
[1, 2, 13, 20]. Jamieson et al. [13] offer an information theoretic lower bound for derivative free
convex optimization and show that a coordinate based random search method achieves the lower
bound with nearly optimal dependence on the dimension.
The rewards r(⇡✓+⌫ , ⇠1 ) and r(⇡✓ ⌫ , ⇠2 ) in Eq. (2) are obtained by collecting two trajectories
from the dynamical system of interest, according to the policies ⇡✓+⌫ and ⇡✓ ⌫ , respectively. The
random variables ⇠1 , ⇠2 , and are mutually independent, and independent from previous trajectories.
One trajectory is called an episode or a rollout. The goal of RL algorithms is to approximately solve
problem (1) by using as few rollouts from the dynamical system as possible.
Scaling by the standard deviation R . As the training of policies progresses, random search in the
parameter space of policies can lead to large variations in the rewards observed across iterations. As
a result, it is difficult to choose a fixed step-size ↵ which does not allow harmful variations in the size
of the update steps. Salimans et al. [23] address this issue by transforming the rewards into rankings
and then using the adaptive optimization algorithm Adam for computing the update step. Both of
3
Algorithm 1 Augmented Random Search (ARS): four versions V1, V1-t, V2 and V2-t
1: Hyperparameters: step-size ↵, number of directions sampled per iteration N , standard deviation
of the exploration noise ⌫, number of top-performing directions to use b (b < N is allowed only
for V1-t and V2-t)
2: Initialize: M0 = 0 2 Rp⇥n , µ0 = 0 2 Rn , and ⌃0 = In 2 Rn⇥n , j = 0.
3: while ending condition not satisfied do
4: Sample 1 , 2 , . . . , N in Rp⇥n with i.i.d. standard normal entries.
5: Collect 2N rollouts of horizon H and their corresponding rewards using the 2N policies
⇢
⇡j,k,+ (x) = (Mj + ⌫ k )x
V1:
⇡j,k, (x) = (Mj ⌫ k )x
( 1/2
⇡j,k,+ (x) = (Mj + ⌫ k ) diag (⌃j ) (x µj )
V2:
⇡j,k, (x) = (Mj ⌫ k ) diag(⌃j ) /2 (x µj )
1
for k 2 {1, 2, . . . , N }.
6: V1-t, V2-t: Sort the directions k by max{r(⇡j,k,+ ), r(⇡j,k, )}, denote by (k) the k-th
largest direction, and by ⇡j,(k),+ and ⇡j,(k), the corresponding policies.
7: Make the update step:
b
X
↵
⇥ ⇤
Mj+1 = Mj + b R
r(⇡j,(k),+ ) r(⇡j,(k), ) (k) ,
k=1
where R is the standard deviation of the 2b rewards used in the update step.
8: V2: Set µj+1 , ⌃j+1 to be the mean and covariance of the 2N H(j + 1) states encountered
from the start of training.1
9: j j+1
10: end while
.
these techniques change the direction of the updates, obfuscating the behavior of the algorithm and
making it difficult to ascertain the objective Evolution Strategies is actually optimizing. Instead, to
address the large variations of the differences r(⇡M +⌫ ) r(⇡M ⌫ ), we scale the update steps by
the standard deviation R of the 2N rewards collected at each iteration (see Line 7 of Algorithm 1).
While training a policy for Humanoid-v1, we observed that the standard deviations R have an
increasing trend; see Figure 2 in Appendix A.2. This behavior occurs because perturbations of the
policy weights at high rewards can cause Humanoid-v1 to fall early, yielding large variations in the
rewards collected. Without scaling the update steps by R , eventually random search would take
update steps which are a thousand times larger than in the beginning of training. Therefore, R
adapts the step sizes according to the local sensitivity of the rewards to perturbations of the policy
parameters. The same training performance could probably be obtained by tuning a step size schedule.
However, one of our goals was to minimize the amount of tuning required.
Normalization of the states. The normalization of states used by ARS V2 is akin to data whitening
for regression tasks. Intuitively, it ensures that policies put equal weight on the different components
of the states. To see why this might help, suppose that a state coordinate only takes values in the
range [90, 100] while another state component takes values in the range [ 1, 1]. Then, small changes
in the control gain with respect to the first state coordinate would lead to larger changes in the actions
than the same sized changes with respect to the second state component. Hence, state normalization
allows different state components to have equal influence during training.
Previous work has also implemented such state normalization for fitting a neural network model for
several MuJoCo environments [19]. A similar normalization is used by ES as part of the virtual batch
1
Of course, we implement this in an efficient way that does not require the storage of all the states. Also, we
only keep track of the diagonal of ⌃j+1 . Finally, to ensure that the ratio 0/0 is treated as 0, if a diagonal entry
of ⌃j is smaller than 10 8 we make it equal to +1.
4
normalization of the neural network policies [23]. In the case of ARS, the state normalization can be
seen as a form of non-isotropic exploration in the parameter space of linear policies.
The main empirical motivation for ARS V2 comes from the Humanoid-v1 task. We were not able to
train a linear policy for this task without the normalization of the states described in Algorithm 1.
Moreover, ARS V2 performs better than ARS V1 on other MuJoCo tasks as well, as shown in
Section 4. However, the usefulness of state normalization is likely to be problem specific.
Using top performing directions. To further improve the performance of ARS on the MuJoCo
locomotion tasks, we propose ARS V1-t and V2-t. In the update steps used by ARS V1 and V2
each perturbation direction k is weighted by the difference of the rewards r(⇡j,k,+ ) and r(⇡j,k, ).
If r(⇡j,k,+ ) > r(⇡j,k, ), ARS pushes the policy weights Mj in the direction of k . If r(⇡j,k,+ ) <
r(⇡j,k, ), ARS pushes the policy weights Mj in the direction of k . However, since r(⇡j,k,+ )
and r(⇡j,k, ) are noisy evaluations of the performance of the policies parametrized by Mj + ⌫ k
and Mj ⌫ k , ARS V1 and V2 might push the weights Mj in the direction k even when k is
better, or vice versa. Moreover, there can be perturbation directions k such that updating the policy
weights Mj in either the direction k or k would lead to sub-optimal performance. To address
these issues, ARS V1-t and V2-t order decreasingly the perturbation directions k , according to
max{r(⇡j,k,+ ), r(⇡j,k, )}, and then use only the top b directions for updating the policy weights;
see Line 7 of Algorithm 1.
This algorithmic enhancement intuitively improves the performance of ARS because it ensures
that the update steps are an average over directions that obtained high rewards. However, without
theoretical investigation we cannot be certain of the effect of using this algorithmic enhancement, i.e.,
choosing b < N . When b = N versions V1-t and V2-t are equivalent to V1 and V2. Therefore, it is
certain that after tuning ARS V1-t and V2-t, they will not perform any worse than ARS V1 and V2.
Comparison to Salimans et al. [23]. ARS simplifies Evolution Strategies in several ways. First,
ES feeds the gradient estimate into the Adam algorithm. Second, instead of using the actual reward
values r(✓ ± ✏i ), ES transforms the rewards into rankings and uses the ranks to compute update
steps. The rankings are used to make training more robust. Instead, our method scales the update
steps by the standard deviation of the rewards. Third, ES bins the action space of the Swimmer-v1
and Hopper-v1 to encourage exploration. Our method surpasses ES without such binning. Fourth, ES
relies on policies parametrized by neural networks with virtual batch normalization, while we show
that ARS achieves state-of-the-art performance with linear policies.
5
Three random seeds evaluation: We compare the different versions of ARS to the following
methods: Trust Region Policy Optimization (TRPO), Deep Deterministic Policy Gradient (DDPG),
Natural Gradients (NG), Evolution Strategies (ES), Proximal Policy Optimization (PPO), Soft
Actor Critic (SAC), Soft Q-Learning (SQL), A2C, and the Cross Entropy Method (CEM). For the
performance of these methods we used values reported by Rajeswaran et al. [22], Salimans et al. [23],
Schulman et al. [26], and Haarnoja et al. [9]. In light of well-documented reproducibility issues of
reinforcement learning methods [11, 12], reporting the values listed in papers rather than rerunning
these algorithms casts prior work in the most favorable light possible.
Rajeswaran et al. [22] and Schulman et al. [26] evaluated the performance of RL algorithms on three
random seeds, while Salimans et al. [23] and Haarnoja et al. [9] used six and five random seeds
respectively. To put all methods on equal footing, for the evaluation of ARS, we sampled three
random seeds uniformly from the interval [0, 1000) and fixed them. For each of the six OpenAI Gym
MuJoCo locomotion tasks we chose a grid of hyperparameters2 , shown in Appendix A.6, and for
each set of hyperparameters we ran ARS V1, V2, V1-t, and V2-t three times, once for each of the
three fixed random seeds.
Table 1 shows the average number of episodes required by ARS, NG, and TRPO to reach a prescribed
reward threshold, using the values reported by Rajeswaran et al. [22] for NG and TRPO. For each
version of ARS and each MuJoCo task we chose the hyperparameters which minimize the average
number of episodes required to reach the reward threshold. The corresponding training curves of
ARS are shown in Figure 3 of Appendix A.2. For all MuJoCo tasks, except Humanoid-v1, we used
the same reward thresholds as Rajeswaran et al. [22]. Our choice to increase the reward threshold for
Humanoid-v1 is motivated by the presence of the survival bonuses, as discussed in Appendix A.1.
Table 1 shows that ARS V1 can train policies for all tasks except Humanoid-v1, which is successfully
solved by ARS V2. Secondly, we note that ARS V2 reaches the prescribed thresholds for Swimmer-
v1, Hopper-v1, and HalfCheetah-v1 faster than NG or TRPO, and matches the performance of NG
on the Humanoid-v1. On Walker2d-v1 and Ant-v1, ARS V2 is outperformed by NG. Nonetheless,
ARS V2-t surpasses the performance of NG on these two tasks. Although TRPO hits the reward
threshold for Walker2d-v1 faster than ARS, our method either matches or surpasses TRPO in the
metrics reported by Haarnoja et al. [9] and Schulman et al. [26].
Precise comparisons to more RL methods are provided in Appendix A.2. Here we offer a summary.
Salimans et al. [23] reported the average number of episodes required by ES to reach a prescribed
reward threshold, on four of the locomotion tasks. ARS surpassed ES on all of those tasks. Haarnoja
et al. [9] reported the maximum reward achieved by SAC, DDPG, SQL, and TRPO after a prescribed
number of timesteps, on four of the locomotion tasks. With the exception of SAC on HalfCheetah-v1
and Ant-v1, ARS outperformed competing methods. Schulman et al. [26] reported the maximum
reward achieved by PPO, A2C, CEM, and TRPO after a prescribed number of timesteps, on four of
2
Recall that ARS V1 and V2 take in only three hyperparameters: the step-size ↵, the number of perturbation
directions N , and scale of the perturbations ⌫. ARS V1-t and V2-t take in an additional hyperparameter, the
number of top directions used b (b N ).
3
N/A means that the method did not reach the reward threshold.
4
UNK stands for unknown.
6
the locomotion tasks. With the exception of PPO on Walker2d-v1, ARS matched or surpassed the
performance of competing methods.
A hundred seeds evaluation: For a more thorough evaluation of ARS, we sampled 100 distinct
random seeds uniformly at random from the interval [0, 10000). Then, using the hyperparameters
selected for Table 1, we ran ARS for each of the six MuJoCo locomotion tasks and the 100 random
seeds. The results are shown in Figure 1. Such a thorough evaluation was feasible because ARS
has a small computational footprint. As discussed in Appendix A.3, ARS is at least 15 times more
computationally efficient on the MuJoCo benchmarks than competing methods.
Figure 1 shows that 70% of the time ARS trains policies for all the MuJoCo locomotion tasks, with
the exception of Walker2d-v1 for which it succeeds only 20% of the time. Moreover, ARS succeeds
at training policies a large fraction of the time while using a competitive number of episodes.
4000
200
2000 3000
100 2000
1000
1000
0 0 - 10 10 - 20 20 - 100 0 - 20 20 - 30 30 - 100 0-5 5 - 20 20 - 100
0 0
0 500 1000 1500 0 5000 10000 0 5000 10000
Walker2d-v1 Ant-v1 Humanoid-v1
10000
0 - 80 80 - 90 90 - 100 4000 8000
8000
AverageReward
3000 6000
6000
2000
4000 4000
1000
2000 2000
0
0 - 30 30 - 70 70 - 100 0 - 30 30 - 70 70 - 100
0 1000 0
0 25000 50000 0 25000 50000 75000 0 100000 200000 300000 400000
Episodes Episodes Episodes
Figure 1: An evaluation of ARS over 100 random seeds on the MuJoCo locomotion tasks. The
dotted lines represent median rewards and the shaded regions represent percentiles. For Swimmer-v1
we used ARS V1. For Hopper-v1, Walker2d-v1, and Ant-v1 we used ARS V2-t. For HalfCheetah-v1
and Humanoid-v1 we used ARS V2.
There are two types of random seeds represented in Figure 1 that cause ARS to not reach high rewards.
There are random seeds on which ARS eventually finds high reward policies when sufficiently many
iterations of ARS are performed, and there are random seeds which lead ARS to discover locally
optimal behaviors. For the Humanoid model, ARS found numerous distinct gaits, including ones
during which the Humanoid hops only on one leg, walks backwards, or moves in a swirling motion.
Such gaits were found by ARS on the random seeds which cause slower training. While multiple
gaits for Humanoid models have been previously observed [10], our evaluation better emphasizes
their prevalence. The presence of local optima is inherent to non-convex optimization, and our results
show that RL algorithms should be evaluated on many random seeds for determining the frequency
with which local optima are found. Finally, we remark that ARS is the least sensitive to the choice of
random seed used when applied to HalfCheetah-v1, a task which is often used for the evaluation of
sensitivity of algorithms to the choice of random seeds.
Linear policies are sufficiently expressive for MuJoCo: We discussed how linear policies can
produce diverse gaits for the MuJoCo models, showing that they are sufficiently expressive to capture
diverse behaviors. Table 2 shows that linear policies can also achieve high rewards on all the MuJoCo
locomotion tasks. In particular, for Humanoid-v1 and Walker2d-v1, ARS found policies that achieve
significantly higher rewards than any other results we encountered in the literature. These results
show that linear policies are perfectly adequate for the MuJoCo locomotion tasks, reducing the need
for more expressive and more computationally expensive policies.
7
Maximum reward achieved
Task ARS Task ARS Task ARS
Swimmer-v1 365 HalfCheetah-v1 6722 Ant 5146
Hopper-v1 3909 Walker 11389 Humanoid 11600
Table 2: Maximum average reward achieved by ARS, where we took the maximum over all sets of
hyperparameters considered and the three fixed random seeds.
5 Discussion
With a few algorithmic augmentations, basic random search of static, linear policies achieves state-
of-the-art sample efficiency on the MuJoCo locomotion tasks. Surprisingly, no special nonlinear
controllers are needed to match the performance recorded in the RL literature. Moreover, since
our algorithm and policies are simple, we were able to perform extensive sensitivity analysis. This
analysis brings us to an uncomfortable conclusion that the current evaluation methods adopted in the
deep RL community are insufficient to evaluate whether proposed methods are actually solving the
studied problems.
The choice of benchmark tasks and the small number of random seeds do not represent the only issues
of current evaluation methodology. Though many RL researchers are concerned about minimizing
sample complexity, it does not make sense to optimize the running time of an algorithm on a single
problem instance. The running time of an algorithm is only a meaningful notion if either (a) evaluated
on a family of problem instances, or (b) when clearly restricting the class of algorithms.
Common RL practice, however, does not follow either (a) or (b). Instead, researchers run an algorithm
A on a task T with a given hyperparameter configuration, and plot a “learning curve” showing the
algorithm reaches a target reward after collecting X samples. Then the “sample complexity" of the
method is reported as the number of samples required to reach a target reward threshold, with the
given hyperparameter configuration. However, any number of hyperparameter configurations can
be tried. Any number of algorithmic enhancements can be added or discarded and then tested in
simulation. For a fair measurement of sample complexity, should we not count the number of rollouts
used for all tested hyperparameters?
Through optimal hyperparameter tuning one can artificially improve the perceived sample efficiency
of a method. Indeed, this is what we see in our work. By adding a third algorithmic enhancement to
basic random search (i.e., enhancing ARS V2 to V2-t), we are able to improve the sample efficiency of
an already highly performing method. Considering that most of the prior work in RL uses algorithms
with far more tunable parameters and neural nets whose architectures themselves are hyperparameters,
the significance of the reported sample complexities for those methods is not clear. This issue is
important because a meaningful sample complexity of an algorithm should inform us on the number
of samples required to solve a new, previously unseen task.
In light of these issues and of our empirical results, we make several suggestions for future work:
• Simple baselines should be established before moving forward to more complex benchmarks
and methods. We propose the Linear Quadratic Regulator as a reasonable testbed for RL
algorithms. LQR is well-understood when the model is known, problem instances can be
easily generated with a variety of different levels of difficulty, and little overhead is required
for replication; see Appendix A.4 for more details.
• When games and physics simulators are used for evaluation, separate problem instances
should be used for tuning and evaluating RL methods. Moreover, large numbers of random
seeds should be used for statistically significant evaluations.
• Rather than trying to develop general purpose algorithms, it might be better to focus on
specific problems of interest and find targeted solutions.
• More emphasis should be put on the development of model-based methods. For many
problems, such methods have been observed to require fewer samples than model-free
methods. Moreover, the physics of the systems should inform the parametric classes of
models used for different problems. Model-based methods incur many computational
challenges themselves, and it is quite possible that tools from deep RL, such as improved
8
tree search, can provide new paths forward for tasks that require the navigation of complex
and uncertain environments.
Acknowledgments
We thank Orianna DeMasi, Moritz Hardt, Eric Jonas, Robert Nishihara, Rebecca Roelofs, Esther
Rolf, Vaishaal Shankar, Ludwig Schmidt, Nilesh Tripuraneni, Stephen Tu for many helpful comments
and suggestions. HM thanks Robert Nishihara and Vaishaal Shankar for sharing their expertise
in parallel computing. As part of the RISE lab, HM is generally supported in part by NSF CISE
Expeditions Award CCF-1730628, DHS Award HSHQDC-16-3-00083, and gifts from Alibaba,
Amazon Web Services, Ant Financial, CapitalOne, Ericsson, GE, Google, Huawei, Intel, IBM,
Microsoft, Scotiabank, Splunk and VMware. BR is generously supported in part by NSF award
CCF-1359814, ONR awards N00014-14-1-0024 and N00014-17-1-2191, the DARPA Fundamental
Limits of Learning (Fun LoL) Program, and an Amazon AWS AI Research Award.
References
[1] A. Agarwal, O. Dekel, and L. Xiao. Optimal algorithms for online convex optimization with multi-point
bandit feedback. pages 28–40, 2010.
[2] F. Bach and V. Perchet. Highly-smooth zero-th order online optimization. Conference on Learning Theory,
2016.
[3] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. OpenAI
gym, 2016.
[4] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu. On the sample complexity of the linear quadratic
regulator. arxiv:1710.01688, 2017.
[5] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement learning
for continuous control. Proceedings of the International Conference on Machine Learning, pages 1329–
1338, 2016.
[6] A. D. Flaxman, A. T. Kalai, and H. B. McMahan. Online convex optimization in the bandit setting: gradient
descent without a gradient. Proceedings of the ACM-SIAM symposium on Discrete algorithms, pages
385–394, 2005.
[7] S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine. Q-prop: Sample-efficient policy gradient
with an off-policy critic. International Conference on Learning Representations, 2016.
[8] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine. Reinforcement learning with deep energy-based policies.
Proceedings of the International Conference on Machine Learning, 2017.
[9] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep
reinforcement learning with a stochastic actor. arxiv:1801.01290, 2018.
[10] N. Heess, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang, A. Eslami, M. Riedmiller,
et al. Emergence of locomotion behaviours in rich environments. arxiv:1707.02286, 2017.
[11] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger. Deep reinforcement learning that
matters. arXiv:1709.06560, 2017.
[12] R. Islam, P. Henderson, M. Gomrokchi, and D. Precup. Reproducibility of benchmarked deep reinforcement
learning tasks for continuous control. arxiv:1708.04133, 2017.
[13] K. G. Jamieson, R. Nowak, and B. Recht. Query complexity of derivative-free optimization. pages
2672–2680, 2012.
[14] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous
control with deep reinforcement learning. International Conference on Learning Representations, 2016.
[15] J. Matyas. Random optimization. Automation and Remote control, 26(2):246–253, 1965.
9
[17] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu.
Asynchronous methods for deep reinforcement learning. Proceedings of the International Conference on
Machine Learning, pages 1928–1937, 2016.
[18] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, W. Paul, M. I. Jordan, and I. Stoica.
Ray: A distributed framework for emerging ai applications. arxiv:1712.05889, 2017.
[19] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine. Neural network dynamics for model-based deep
reinforcement learning with model-free fine-tuning. arxiv:1708.02596, 2017.
[20] Y. Nesterov and V. Spokoiny. Random gradient-free minimization of convex functions. Foundations of
Computational Mathematics, 17(2):527–566, 2017.
[21] M. Plappert, R. Houthooft, P. Dhariwal, S. Sidor, R. Y. Chen, X. Chen, T. Asfour, P. Abbeel, and
M. Andrychowicz. Parameter space noise for exploration. arxiv:1706.01905, 2017.
[22] A. Rajeswaran, K. Lowrey, E. Todorov, and S. Kakade. Towards generalization and simplicity in continuous
control. Advances in Neural Information Processing Systems, 2017.
[23] T. Salimans, J. Ho, X. Chen, and I. Sutskever. Evolution strategies as a scalable alternative to reinforcement
learning. arxiv:1703.03864, 2017.
[24] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. pages
1889–1897, 2015.
[25] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control using
generalized advantage estimation. International Conference on Learning Representations, 2015.
[26] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.
arxiv:1707.06347, 2017.
[27] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy gradient
algorithms. Proceedings of the International Conference on Machine Learning, 2014.
[28] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser,
I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural
networks and tree search. Nature, 529(7587):484–489, 2016.
[29] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. IEEE/RSJ
International Conference on Intelligent Robots and Systems, pages 5026–5033, 2012.
[30] S. Tu and B. Recht. Least-squares temporal difference learning for the linear quadratic regulator.
arxiv:1712.08642, 2017.
[31] Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas. Sample efficient
actor-critic with experience replay. Interational Conference on Learning Representations, 2016.
[32] Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba. Scalable trust-region method for deep reinforcement
learning using kronecker-factored approximation. Advances in Neural Information Processing Systems,
2017.
10
SUNRISE: A Simple Unified Framework
for Ensemble Learning in Deep Reinforcement Learning
which selects actions with highest UCB for efficient explo- more stable, higher-signal-to-noise backups.
ration. This inference method can encourage exploration
Ensemble methods in RL. Ensemble methods have been
by providing a bonus for visiting unseen state-action pairs,
studied for different purposes in RL (Wiering & Van Hasselt,
where ensembles produce high uncertainty, i.e., high vari-
2008; Osband et al., 2016a; Anschel et al., 2017; Agarwal
ance (see Figure 1(b)). By enforcing the diversity between
et al., 2020; Lan et al., 2020). Chua et al. (2018) showed that
agents using Bootstrap with random initialization (Osband
modeling errors in model-based RL can be reduced using an
et al., 2016a), we find that these different ideas can be fruit-
ensemble of dynamics models, and Kurutach et al. (2018)
fully integrated, and they are largely complementary (see
accelerated policy learning by generating imagined experi-
Figure 1(a)).
ences from the ensemble of dynamics models. For efficient
We demonstrate the effectiveness of the proposed method us- exploration, Osband et al. (2016a) and Chen et al. (2017)
ing Soft Actor-Critic (SAC; Haarnoja et al. 2018) for contin- also leveraged the ensemble of Q-functions. However, most
uous control benchmarks (specifically, OpenAI Gym (Brock- prior works have studied the various axes of improvements
man et al., 2016) and DeepMind Control Suite (Tassa et al., from ensemble methods in isolation, while we propose a
2018)) and Rainbow DQN (Hessel et al., 2018) for dis- unified framework that handles various issues in off-policy
crete control benchmarks (specifically, Atari games (Belle- RL algorithms.
mare et al., 2013)). In our experiments, SUNRISE consis-
Exploration in RL. To balance exploration and exploita-
tently improves the performance of existing off-policy RL
tion, several methods, such as the maximum entropy frame-
methods. Furthermore, we find that the proposed weighted
works (Ziebart, 2010; Haarnoja et al., 2018), exploration
Bellman backups yield improvements in environments with
bonus rewards (Bellemare et al., 2016; Houthooft et al.,
noisy reward, which have a low signal-to-noise ratio.
2016; Pathak et al., 2017; Choi et al., 2019) and randomiza-
tion (Osband et al., 2016a;b), have been proposed. Despite
2. Related work the success of these exploration methods, a potential draw-
back is that agents can focus on irrelevant aspects of the
Off-policy RL algorithms. Recently, various off-policy RL
environment because these methods do not depend on the
algorithms have provided large gains in sample-efficiency
rewards. To handle this issue, Chen et al. (2017) proposed
by reusing past experiences (Fujimoto et al., 2018; Haarnoja
an exploration strategy that considers both best estimates
et al., 2018; Hessel et al., 2018). Rainbow DQN (Hessel
(i.e., mean) and uncertainty (i.e., variance) of Q-functions
et al., 2018) achieved state-of-the-art performance on the
for discrete control tasks. We further extend this strategy to
Atari games (Bellemare et al., 2013) by combining sev-
continuous control tasks and show that it can be combined
eral techniques, such as double Q-learning (Van Hasselt
with other techniques.
et al., 2016) and distributional DQN (Bellemare et al., 2017).
For continuous control tasks, SAC (Haarnoja et al., 2018)
achieved state-of-the-art sample-efficiency results by incor- 3. Background
porating the maximum entropy framework. Our ensemble
Reinforcement learning. We consider a standard RL
method brings orthogonal benefits and is complementary
framework where an agent interacts with an environment
and compatible with these existing state-of-the-art algo-
in discrete time. Formally, at each timestep t, the agent
rithms.
receives a state st from the environment and chooses an
Stabilizing Q-learning. It has been empirically observed action at based on its policy ⇡. The environment returns
that instability in Q-learning can be caused by applying a reward rt and theP agent transitions to the next state st+1 .
1
the Bellman backup on the learned value function (Hasselt, The return Rt = k=0
k
rt+k is the total accumulated
2010; Van Hasselt et al., 2016; Fujimoto et al., 2018; Song rewards from timestep t with a discount factor 2 [0, 1).
et al., 2019; Kim et al., 2019; Kumar et al., 2019; 2020). RL then maximizes the expected return.
By following the principle of double Q-learning (Hasselt,
Soft Actor-Critic. SAC (Haarnoja et al., 2018) is an off-
2010; Van Hasselt et al., 2016), twin-Q trick (Fujimoto et al.,
policy actor-critic method based on the maximum entropy
2018) was proposed to handle the overestimation of value
RL framework (Ziebart, 2010), which encourages the ro-
functions for continuous control tasks. Song et al. (2019)
bustness to noise and exploration by maximizing a weighted
and Kim et al. (2019) proposed to replace the max operator
objective of the reward and the policy entropy (see Ap-
with Softmax and Mellowmax, respectively, to reduce the
pendix A for further details). To update the parameters,
overestimation error. Recently, Kumar et al. (2020) handled
SAC alternates between a soft policy evaluation and a soft
the error propagation issue by reweighting the Bellman
policy improvement. At the soft policy evaluation step, a
backup based on cumulative Bellman errors. However, our
soft Q-function, which is modeled as a neural network with
method is different in that we propose an alternative way that
parameters ✓, is updated by minimizing the following soft
also utilizes ensembles to estimate uncertainty and provide
SUNRISE
UCB
Action
Actor N exploration
Replay buffer Critic N
Figure 1. (a) Illustration of our framework. We consider N independent agents (i.e., no shared parameters between agents) with one replay
buffer. (b) Uncertainty estimates from an ensemble of neural networks on a toy regression task (see Appendix C for more experimental
details). The black line is the ground truth curve, and the red dots are training samples. The blue lines show the mean and variance of
predictions over ten ensemble models. The ensemble can produce well-calibrated uncertainty estimates (i.e., variance) on unseen samples.
diversity between agents through two simple ideas: First, Algorithm 1 SUNRISE: SAC version
we initialize the model parameters of all agents with ran- 1: for each iteration do
dom parameter values for inducing an initial diversity in the 2: for each timestep t do
models. Second, we apply different samples to train each 3: // UCB EXPLORATION
agent. Specifically, for each SAC agent i in each timestep 4: Collect N action samples: At = {at,i ⇠
t, we draw the binary masks mt,i from the Bernoulli dis- ⇡ i (a|st )|i 2 {1, . . . , N }}
tribution with parameter 2 (0, 1], and store them in the 5: Choose the action that maximizes UCB: at =
replay buffer. Then, when updating the model parameters arg max Qmean (st , at,i ) + Qstd (st , at,i )
of agents, we multiply the bootstrap mask to each objec- at,i 2At
tive function, such as: mt,i L⇡ (st , i ) and mt,i LW Q (⌧t , ✓i ) 6: Collect state st+1 and reward rt from the environ-
in (4) and (5), respectively. We remark that Osband et al. ment by taking action at
(2016a) applied this simple technique to train an ensemble 7: Sample bootstrap masks Mt = {mt,i ⇠
of DQN (Mnih et al., 2015) only for discrete control tasks, Bernoulli ( ) — i 2 {1, . . . , N }}
while we apply to SAC (Haarnoja et al., 2018) and Rainbow 8: Store transitions ⌧t = (st , at , st+1 , rt ) and masks
DQN (Hessel et al., 2018) for both continuous and discrete in replay buffer B B [ {(⌧t , Mt )}
tasks with additional techniques. 9: end for
10: // U PDATE AGENTS VIA BOOTSTRAP AND
UCB exploration. The ensemble can also be leveraged WEIGHTED B ELLMAN BACKUP
for efficient exploration (Chen et al., 2017; Osband et al., 11: for each gradient step do
2016a) because it can express higher uncertainty on unseen 12: Sample random minibatch {(⌧j , Mj )}B j=1 ⇠ B
samples. Motivated by this, by following the idea of Chen 13: for each agent i do
et al. (2017), we consider an optimism-based exploration Update
14: PB the Q-function by minimizing
that chooses the action that maximizes
j=1 mj,i LW Q (⌧j , ✓i ) in (5)
1
B
15: Update
PB the policy by minimizing
at = max{Qmean (st , a) + Qstd (st , a)}, (7)
j=1 mj,i L⇡ (sj , i ) in (4)
1
a B
16: end for
where Qmean (s, a) and Qstd (s, a) are the empirical mean 17: end for
and standard deviation of all Q-functions {Q✓i }N i=1 , and 18: end for
the > 0 is a hyperparameter. This inference method can
encourage exploration by adding an exploration bonus (i.e.,
standard deviation Qstd ) for visiting unseen state-action Table 1 and Table 2) and discrete (see Table 3) control
pairs similar to the UCB algorithm (Auer et al., 2002). We tasks?
remark that this inference method was originally proposed
• How crucial is the proposed weighted Bellman back-
in Chen et al. (2017) for efficient exploration in discrete
ups in (5) for improving the signal-to-noise in Q-
action spaces. However, in continuous action spaces, finding
updates (see Figure 2)?
the action that maximizes the UCB is not straightforward.
To handle this issue, we propose a simple approximation • Can UCB exploration be useful for solving tasks with
scheme, which first generates N candidate action set from sparse rewards (see Figure 3(b))?
ensemble policies {⇡ i }Ni=1 , and then chooses the action • Is SUNRISE better than a single agent with more
that maximizes the UCB (Line 4 in Algorithm 1). For
updates and parameters (see Figure 3(c))?
evaluation, we approximate the maximum a posterior action
by averaging the mean of Gaussian distributions modeled • How does ensemble size affect the performance (see
by each ensemble policy. Figure 3(d))?
The full procedure of our unified framework, coined SUN-
RISE, is summarized in Algorithm 1. 5.1. Setups
Continuous control tasks. We evaluate SUNRISE on sev-
5. Experimental results eral continuous control tasks using simulated robots from
OpenAI Gym (Brockman et al., 2016) and DeepMind Con-
We designed our experiments to answer the following ques- trol Suite (Tassa et al., 2018). For OpenAI Gym experiments
tions: with proprioceptive inputs (e.g., positions and velocities),
we compare to PETS (Chua et al., 2018), a state-of-the-art
• Can SUNRISE improve off-policy RL algorithms, model-based RL method based on ensembles of dynamics
such as SAC (Haarnoja et al., 2018) and Rainbow models; POPLIN-P (Wang & Ba, 2020), a state-of-the-art
DQN (Hessel et al., 2018), for both continuous (see model-based RL method which uses a policy network to
SUNRISE
generate actions for planning; POPLIN-A (Wang & Ba, For our method, we do not alter any hyperparameters of
2020), variant of POPLIN-P which adds noise in the ac- the original RL algorithms and train five ensemble agents.
tion space; METRPO (Kurutach et al., 2018), a hybrid RL There are only three additional hyperparameters , T , and
method which augments TRPO (Schulman et al., 2015) us- for bootstrap, weighted Bellman backup, and UCB ex-
ing ensembles of dynamics models; and two state-of-the-art ploration, where we provide details in Appendix D, F, and
model-free RL methods, TD3 (Fujimoto et al., 2018) and G.
SAC (Haarnoja et al., 2018). For our method, we consider
a combination of SAC and SUNRISE, as described in Al- 5.2. Comparative evaluation
gorithm 1. Following the setup in Wang & Ba (2020) and
Wang et al. (2019), we report the mean and standard devia- OpenAI Gym. Table 1 shows the average returns of eval-
tion across ten runs after 200K timesteps on five complex uation roll-outs for all methods. SUNRISE consistently
environments: Cheetah, Walker, Hopper, Ant and SlimHu- improves the performance of SAC across all environments
manoid with early termination (ET). More experimental and outperforms the model-based RL methods, such as
details and learning curves with 1M timesteps are in Ap- POPLIN-P and PETS, on all environments except Ant and
pendix D. SlimHumanoid-ET. Even though we focus on performance
after small samples because of the recent emphasis on mak-
For DeepMind Control Suite with image inputs, we com- ing RL more sample efficient, we find that the gain from
pare to PlaNet (Hafner et al., 2019), a model-based RL SUNRISE becomes even more significant when training
method which learns a latent dynamics model and uses it longer (see Figure 3(c) and Appendix D). We remark that
for planning; Dreamer (Hafner et al., 2020), a hybrid RL SUNRISE is more compute-efficient than modern model-
method which utilizes the latent dynamics model to gener- based RL methods, such as POPLIN and PETS, because
ate synthetic roll-outs; SLAC (Lee et al., 2020), a hybrid they also utilize ensembles (of dynamics models) and per-
RL method which combines the latent dynamics model form planning to select actions. Namely, SUNRISE is sim-
with SAC; and three state-of-the-art model-free RL methods ple to implement, computationally efficient, and readily
which apply contrastive learning (CURL; Srinivas et al. parallelizable.
2020) or data augmentation (RAD (Laskin et al., 2020) and
DrQ (Kostrikov et al., 2021)) to SAC. For our method, we DeepMind Control Suite. As shown in Table 2, SUNRISE
consider a combination of RAD (i.e., SAC with random also consistently improves the performance of RAD (i.e.,
crop) and SUNRISE. Following the setup in RAD, we re- SAC with random crop) on all environments from Deep-
port the mean and standard deviation across five runs after Mind Control Suite. This implies that the proposed method
100k (i.e., low sample regime) and 500k (i.e., asymptotically can be useful for high-dimensional and complex input obser-
optimal regime) environment steps on six environments: vations. Moreover, our method outperforms existing pixel-
Finger-spin, Cartpole-swing, Reacher-easy, Cheetah-run, based RL methods in almost all environments. We remark
Walker-walk, and Cup-catch. More experimental details and that SUNRISE can also be combined with DrQ, and expect
learning curves are in Appendix F. that it can achieve better performances on Cartpole-swing
and Cup-catch at 100K environment steps.
Discrete control benchmarks. For discrete control tasks,
we demonstrate the effectiveness of SUNRISE on several Atari games. We also evaluate SUNRISE on discrete con-
Atari games (Bellemare et al., 2013). We compare to Sim- trol tasks from the Atari benchmark using Rainbow DQN.
PLe (Kaiser et al., 2020), a hybrid RL method which up- Table 3 shows that SUNRISE improves the performance of
dates the policy only using samples generated by learned Rainbow in almost all environments, and outperforms the
dynamics model; Rainbow DQN (Hessel et al., 2018) with state-of-the-art CURL and SimPLe on 11 out of 26 Atari
modified hyperparameters for sample-efficiency (van Has- games. Here, we remark that SUNRISE is also compatible
selt et al., 2019); Random agent (Kaiser et al., 2020); two with CURL, which could enable even better performance.
state-of-the-art model-free RL methods which apply the con- These results demonstrate that SUNRISE is a general ap-
trastive learning (CURL; Srinivas et al. 2020) and data aug- proach.
mentation (DrQ; Kostrikov et al. 2021) to Rainbow DQN;
and Human performances reported in Kaiser et al. (2020) 5.3. Ablation study
and van Hasselt et al. (2019). Following the setups in Sim- Effects of weighted Bellman backups. To verify the effec-
PLe, we report the mean across three runs after 100K in- tiveness of the proposed weighted Bellman backup (5) in
teractions (i.e., 400K frames with action repeat of 4). For improving signal-to-noise in Q-updates, we evaluate on a
our method, we consider a combination of sample-efficient modified OpenAI Gym environments with noisy rewards.
versions of Rainbow DQN and SUNRISE (see Algorithm 3 Following Kumar et al. (2019), we add Gaussian noise to the
in Appendix B). More experimental details and learning reward function: r0 (s, a) = r(s, a) + z, where z ⇠ N (0, 1)
curves are in Appendix G. only during training, and report the deterministic ground-
SUNRISE
Cheetah Walker Hopper Ant SlimHumanoid-ET
PETS 2288.4 ± 1019.0 282.5 ± 501.6 114.9 ± 621.0 1165.5 ± 226.9 2055.1 ± 771.5
POPLIN-A 1562.8 ± 1136.7 -105.0 ± 249.8 202.5 ± 962.5 1148.4 ± 438.3 -
POPLIN-P 4235.0 ± 1133.0 597.0 ± 478.8 2055.2 ± 613.8 2330.1 ± 320.9 -
METRPO 2283.7 ± 900.4 -1609.3 ± 657.5 1272.5 ± 500.9 282.2 ± 18.0 76.1 ± 8.8
TD3 3015.7 ± 969.8 -516.4 ± 812.2 1816.6 ± 994.8 870.1 ± 283.8 1070.0 ± 168.3
SAC 4474.4 ± 700.9 299.5 ± 921.9 1781.3 ± 737.2 979.5 ± 253.2 1371.8 ± 473.4
SUNRISE 4501.8 ± 443.8 1236.5 ± 1123.9 2643.2 ± 472.3 1502.4 ± 483.5 1926.6 ± 375.0
Table 1. Performance on OpenAI Gym at 200K timesteps. The results show the mean and standard deviation averaged over ten runs. For
baseline methods, we report the best number in prior works (Wang & Ba, 2020; Wang et al., 2019).
truth reward during evaluation. For our method, we also We also consider another variant of SUNRISE, which
consider a variant of SUNRISE, which updates Q-functions updates Q-functions with random weights sampled from
without the proposed weighted Bellman backup to isolate its [0.5, 1.0] uniformly at random. In order to evaluate the
effect. We compare to DisCor (Kumar et al., 2020), which performance of SUNRISE, we increase the noise rate by
improves SAC by reweighting the Bellman backup based adding Gaussian noise with a large standard deviation to the
on estimated cumulative Bellman errors (see Appendix E reward function: r0 (s, a) = r(s, a) + z, where z ⇠ N (0, 5).
for more details). Figure 3(a) shows the learning curves of all methods on
the SlimHumanoid-ET environment over 10 random seeds.
Figure 2 shows the learning curves of all methods on Ope-
First, one can not that SUNRISE with random weights (red
nAI Gym with noisy rewards. The proposed weighted Bell-
curve) is worse than SUNRISE with the proposed weighted
man backup significantly improves both sample-efficiency
Bellman backups (blue curve). Additionally, even without
and asymptotic performance of SUNRISE, and outperforms
UCB exploration, SUNRISE with the proposed weighted
baselines such as SAC and DisCor. One can note the per-
Bellman backups (purple curve) outperforms all baselines.
formance gain due to our weighted Bellman backup be-
This implies that the proposed weighted Bellman backups
comes more significant in complex environments, such
can handle the error propagation effectively even though
as SlimHumanoid-ET. We remark that DisCor still suffers
there is a large noise in reward function.
from error propagation issues in complex environments like
SlimHumanoid-ET and Ant because there are some approx- Effects of UCB exploration. To verify the advantage of
imation errors in estimating cumulative Bellman errors (see UCB exploration in (7), we evaluate on Cartpole-swing
Section 6.1 for more detailed discussion). These results im- with sparse-reward from DeepMind Control Suite. For our
ply that errors in the target Q-function can be characterized method, we consider a variant of SUNRISE, which selects
by the proposed confident weight in (6) effectively. action without UCB exploration. As shown in Fig 3(b),
SUNRISE with UCB exploration (blue curve) significantly
SUNRISE
Table 3. Performance on Atari games at 100K interactions. The results show the scores averaged three runs. For baseline methods, we
report the best numbers reported in prior works (Kaiser et al., 2020; van Hasselt et al., 2019).
improves the sample-efficiency on the environment with environments are also available in Appendix D, where the
sparse rewards. overall trend is similar.
Comparison with a single agent with more up-
dates/parameters. One concern in utilizing the ensemble 6. Discussion
method is that its gains may come from more gradient up-
6.1. Intuition for weighted Bellman backups
dates and parameters. To clarify this concern, we compare
SUNRISE (5 ensembles using 2-layer MLPs with 256 hid- (Kumar et al., 2020) show that naive Bellman backups can
den units each) to a single agent, which consists of 2-layer suffer from slow learning in certain environments, requiring
MLPs with 1024 (and 256) hidden units with 5 updates us- exponentially many updates. To handle this problem, they
ing different random minibatches. Figure 3(c) shows that propose the weighted Bellman backups, which make steady
the learning curves on SlimHumanoid-ET, where SUNRISE learning progress by inducing some optimal data distribution
outperforms all baselines. This implies that the gains from (see (Kumar et al., 2020) for more details). Specifically, in
SUNRISE can not be achieved by simply increasing the addition to a standard Q-learning, DisCor trains an error
number of updates/parameters. More experimental results model (s, a), which approximates the cumulative sum
on other environments are also available in Appendix D. of discounted Bellman errors over the past iterations of
training. Then, using the error model, DisCor reweights
Effects of ensemble size. We analyze the effects of en-
the Bellman backups based on⇣ a confidence ⌘ weight defined
semble size N on the Ant environment from OpenAI Gym. (s,a)
Figure 3(d) shows that the performance can be improved as follows: w(s, a) / exp T , where is a
by increasing the ensemble size, but the improvement is discount factor and T is a temperature.
saturated around N = 5. Thus, we use five ensemble agents
However, we remark that DisCor can still suffer from the
for all experiments. More experimental results on other
error propagation issues because there is also an approxima-
SUNRISE
Average return
Average return
Average return
Average return
10,000 1000 1000
4000
0 0 1000
3000
5,000 −1000 −1000
2000
500
−2000 −2000 1000
0 −3000 −3000 0 0
0 2.5×105 5.0×105 0 1×105 2×105 0 1×105 2×105 0 1×105 2×105 0 1×105 2×105
Timesteps Timesteps Timesteps Timesteps Timesteps
Figure 2. Learning curves on OpenAI Gym with noisy rewards. To verify the effects of the weighted Bellman backups (WBB), we consider
SUNRISE with WBB and without WBB. The solid line and shaded regions represent the mean and standard deviation, respectively, across
four runs.
40,000
SAC (log α= 0)
SAC (log α= 1) 2000
14,000 SAC RAD SAC
1000 SAC (log α= 10)
DisCor SUNRISE (without UCB) 30,000 SUNRISE
SUNRISE (N=2)
SUNRISE (with RW & UCB) SUNRISE (with UCB) SUNRISE (N=5)
12,000
Average return
SUNRISE (with UCB) 800 1500 SUNRISE (N=10)
Average return
Average return
SUNRISE (with WBB & UCB)
10,000 20,000
SUNRISE (with WBB)
Score
600
8,000 1000
(a) Large noise (b) Sparse reward (c) Gradient update (d) Ensemble size
Figure 3. (a) Learning curves of SUNRISE with random weight (RW) and the proposed weighted Bellman backups (WBB) on the
SlimHumanoid-ET environment with noisy rewards. (b) Effects of UCB exploration on the Cartpole environment with sparse reward. (c)
Learning curves of SUNRISE and single agent with h hidden units and five gradient updates per each timestep on the SlimHumanoid-ET
environment. (d) Learning curves of SUNRISE with varying values of ensemble size N on the Ant environment.
Acknowledgements Choi, J., Guo, Y., Moczulski, M., Oh, J., Wu, N., Norouzi,
M., and Lee, H. Contingency-aware exploration in re-
This research is supported in part by ONR PECASE inforcement learning. In International Conference on
N000141612723, Open Philanthropy, Darpa LEARN Learning Representations, 2019.
project, Darpa LwLL project, NSF NRI #2024675, Ten-
cent, and Berkeley Deep Drive. We would like to thank Chua, K., Calandra, R., McAllister, R., and Levine, S. Deep
Hao Liu for improving the presentation and giving helpful reinforcement learning in a handful of trials using proba-
feedback. We would also like to thank Aviral Kumar and bilistic dynamics models. In Advances in Neural Infor-
Kai Arulkumaran for providing tips on implementation of mation Processing Systems, 2018.
DisCor and Rainbow.
Efron, B. The jackknife, the bootstrap, and other resampling
plans, volume 38. Siam, 1982.
References
Agarwal, R., Schuurmans, D., and Norouzi, M. An opti- Fujimoto, S., Van Hoof, H., and Meger, D. Addressing
mistic perspective on offline reinforcement learning. In function approximation error in actor-critic methods. In
International Conference on Machine Learning, 2020. International Conference on Machine Learning, 2018.
Amos, B., Stanton, S., Yarats, D., and Wilson, A. G. On Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft
the model-based stochastic value gradient for continuous actor-critic: Off-policy maximum entropy deep reinforce-
reinforcement learning. arXiv preprint arXiv:2008.12775, ment learning with a stochastic actor. In International
2020. Conference on Machine Learning, 2018.
Anschel, O., Baram, N., and Shimkin, N. Averaged-dqn: Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D.,
Variance reduction and stabilization for deep reinforce- Lee, H., and Davidson, J. Learning latent dynamics for
ment learning. In International Conference on Machine planning from pixels. In International Conference on
Learning, 2017. Machine Learning, 2019.
Audibert, J.-Y., Munos, R., and Szepesvári, C. Exploration– Hafner, D., Lillicrap, T., Ba, J., and Norouzi, M. Dream
exploitation tradeoff using variance estimates in multi- to control: Learning behaviors by latent imagination. In
armed bandits. Theoretical Computer Science, 410(19): International Conference on Learning Representations,
1876–1902, 2009. 2020.
Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time Hasselt, H. V. Double q-learning. In Advances in Neural
analysis of the multiarmed bandit problem. Machine Information Processing Systems, 2010.
learning, 47(2-3):235–256, 2002.
Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostro-
Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., vski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., and
Saxton, D., and Munos, R. Unifying count-based explo- Silver, D. Rainbow: Combining improvements in deep
ration and intrinsic motivation. In Advances in Neural reinforcement learning. In AAAI Conference on Artificial
Information Processing Systems, 2016. Intelligence, 2018.
Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. Houthooft, R., Chen, X., Duan, Y., Schulman, J., De Turck,
The arcade learning environment: An evaluation plat- F., and Abbeel, P. Vime: Variational information maxi-
form for general agents. Journal of Artificial Intelligence mizing exploration. In Advances in Neural Information
Research, 47:253–279, 2013. Processing Systems, 2016.
Bellemare, M. G., Dabney, W., and Munos, R. A distribu- Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Camp-
tional perspective on reinforcement learning. In Interna- bell, R. H., Czechowski, K., Erhan, D., Finn, C., Koza-
tional Conference on Machine Learning, 2017. kowski, P., Levine, S., et al. Model-based reinforcement
learning for atari. In International Conference on Learn-
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., ing Representations, 2020.
Schulman, J., Tang, J., and Zaremba, W. Openai gym.
arXiv preprint arXiv:1606.01540, 2016. Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog,
A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M.,
Chen, R. Y., Sidor, S., Abbeel, P., and Schulman, J. Vanhoucke, V., et al. Qt-opt: Scalable deep reinforce-
Ucb exploration via q-ensembles. arXiv preprint ment learning for vision-based robotic manipulation. In
arXiv:1706.01502, 2017. Conference on Robot Learning, 2018.
SUNRISE
Kim, S., Asadi, K., Littman, M., and Konidaris, G. Deep- Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T.
mellow: removing the need for a target network in deep Curiosity-driven exploration by self-supervised predic-
q-learning. In International Joint Conference on Artificial tion. In International Conference on Machine Learning,
Intelligence, 2019. 2017.
Kostrikov, I., Yarats, D., and Fergus, R. Image augmentation Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Priori-
is all you need: Regularizing deep reinforcement learning tized experience replay. In International Conference on
from pixels. In International Conference on Learning Learning Representations, 2016.
Representations, 2021.
Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz,
Kumar, A., Fu, J., Soh, M., Tucker, G., and Levine, S. P. Trust region policy optimization. In International
Stabilizing off-policy q-learning via bootstrapping error Conference on Machine Learning, 2015.
reduction. In Advances in Neural Information Processing
Systems, 2019. Schulman, J., Chen, X., and Abbeel, P. Equivalence be-
tween policy gradients and soft q-learning. arXiv preprint
Kumar, A., Gupta, A., and Levine, S. Discor: Corrective arXiv:1704.06440, 2017.
feedback in reinforcement learning via distribution cor-
rection. In Advances in Neural Information Processing Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou,
Systems, 2020. I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M.,
Bolton, A., et al. Mastering the game of go without
Kurutach, T., Clavera, I., Duan, Y., Tamar, A., and Abbeel, human knowledge. Nature, 550(7676):354, 2017.
P. Model-ensemble trust-region policy optimization. In
International Conference on Learning Representations, Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai,
2018. M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Grae-
pel, T., et al. A general reinforcement learning algorithm
Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple that masters chess, shogi, and go through self-play. Sci-
and scalable predictive uncertainty estimation using deep ence, 362(6419):1140–1144, 2018.
ensembles. In Advances in Neural Information Process-
ing Systems, 2017. Song, Z., Parr, R., and Carin, L. Revisiting the softmax
bellman operator: New benefits and new perspective. In
Lan, Q., Pan, Y., Fyshe, A., and White, M. Maxmin q- International Conference on Machine Learning, 2019.
learning: Controlling the estimation bias of q-learning. In
International Conference on Learning Representations, Srinivas, A., Jabri, A., Abbeel, P., Levine, S., and Finn, C.
2020. Universal planning networks. In International Confer-
ence on Machine Learning, 2018.
Laskin, M., Lee, K., Stooke, A., Pinto, L., Abbeel, P., and
Srinivas, A. Reinforcement learning with augmented data. Srinivas, A., Laskin, M., and Abbeel, P. Curl: Contrastive
In Advances in Neural Information Processing Systems, unsupervised representations for reinforcement learning.
2020. In International Conference on Machine Learning, 2020.
Lee, A. X., Nagabandi, A., Abbeel, P., and Levine, S. Tamar, A., Wu, Y., Thomas, G., Levine, S., and Abbeel, P.
Stochastic latent actor-critic: Deep reinforcement learn- Value iteration networks. In Advances in Neural Informa-
ing with a latent variable model. In Advances in Neural tion Processing Systems, 2016.
Information Processing Systems, 2020.
Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., Casas, D.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, d. L., Budden, D., Abdolmaleki, A., Merel, J., Lefrancq,
J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje- A., et al. Deepmind control suite. arXiv preprint
land, A. K., Ostrovski, G., et al. Human-level control arXiv:1801.00690, 2018.
through deep reinforcement learning. Nature, 518(7540):
529, 2015. Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W.,
and Abbeel, P. Domain randomization for transferring
Osband, I., Blundell, C., Pritzel, A., and Van Roy, B. Deep deep neural networks from simulation to the real world.
exploration via bootstrapped dqn. In Advances in Neural In International Conference on Intelligent Robots and
Information Processing Systems, 2016a. Systems, 2017.
Osband, I., Van Roy, B., and Wen, Z. Generalization and Torabi, F., Warnell, G., and Stone, P. Behavioral cloning
exploration via randomized value functions. In Interna- from observation. In International Joint Conferences on
tional Conference on Machine Learning, 2016b. Artificial Intelligence Organization, 2018.
SUNRISE
Yarats, D., Zhang, A., Kostrikov, I., Amos, B., Pineau, J.,
and Fergus, R. Improving sample efficiency in model-
free reinforcement learning from images. arXiv preprint
arXiv:1910.01741, 2019.
Ziebart, B. D. Modeling purposeful adaptive behavior with
the principle of maximum causal entropy. 2010.