Stochastic Policy Gradient Reinforcement Learning On A Simple 3D Biped

Stochastic Policy Gradient Reinforcement
Learning on a Simple 3D Biped

Russ Tedrake Teresa Weirui Zhang H. Sebastian Seung
Computer Science & Department of Howard Hughes Medical Institute
Artificial Intelligence Lab, Mechanical Engineering Brain & Cognitive Sciences
Center for Bits & Atoms Brain & Cognitive Sciences Center for Bits & Atoms
Massachusetts Institute of Technology Massachusetts Institute of Technology Massachusetts Institute of Technology
Cambridge, MA 02139 Cambridge, MA 02139 Cambridge, MA 02139
Email: russt@ai.mit.edu Email: resa@mit.edu Email: seung@mit.edu
Abstract— We present a learning system which is able to may have an effect on the performance many steps into the
quickly and reliably acquire a robust feedback control policy future.
for 3D dynamic walking from a blank-slate using only trials
Although there is a great deal of literature on learning
implemented on our physical robot. The robot begins walking
within a minute and learning converges in approximately 20 control for dynamically walking bipedal robots, there are
minutes. This success can be attributed to the mechanics of our relatively few examples of learning algorithms actually imple-
robot, which are modeled after a passive dynamic walker, and to a mented on the robot or which work quickly enough to allow
dramatic reduction in the dimensionality of the learning problem. the robot to adapt online to changing terrain. Some researchers
We reduce the dimensionality by designing a robot with only 6
attempt to learn a controller in simulation that is robust enough
internal degrees of freedom and 4 actuators, by decomposing
the control system in the frontal and sagittal planes, and by to run on the real robot [1], treating differences between the
formulating the learning problem on the discrete return map simulation and the robot as disturbances. The University of
dynamics. We apply a stochastic policy gradient algorithm to New Hampshire bipeds were two early examples of online
this reduced problem and decrease the variance of the update learning which acquired a basic gait by tuning parameters in a
using a state-based estimate of the expected cost. This optimized
hand-designed controller ([2], [3]). In this paper we generalize
learning system works quickly enough that the robot is able to
continually adapt to the terrain as it walks. these results to obtaining a controller from scratch instead of
tuning an existing controller. Learning control has also been
I. I NTRODUCTION successfully implemented on Sony’s quadrupedal robot AIBO
(i.e., [4]). The learned controllers for AIBO are open-loop
Recent advances in bipedal walking technology have pro- trajectories, but trajectory feedback is essential for robust,
duced robots capable of leaving the laboratory environment to dynamic, bipedal walking.
interact with the unknown and uncertain environments of the In order to study learning feedback control for walking,
real world. Despite our best efforts, it is unlikely that we will we performed our initial experiments on a simplified robot
be able to preprogram these robots for every possible situation which captures the essence of dynamic walking but which
without sacrificing performance. Endowing our robots with the minimizes many of the complications. Our robot has only 6
ability to learn from experience and adapt to their environment internal degrees of freedom and 4 actuators1 . The mechanical
seems critical for the success of any real world robot. design of our robot is based on a passive dynamic walker ([5],
Dynamic bipedal walking is difficult to learn for a number [6]). This allows us to solve a portion of the control problem in
of reasons. First, walking robots typically have many degrees the mechanical design, and makes the robot mechanically very
of freedom, which can cause a combinatorial explosion for stable; most policies in our search space result in either stable
learning systems that attempt to optimize performance in every walking or failed walking where the robot ends up simply
possible configuration of the robot. Second, details of the standing still.
robot dynamics such as uncertainties in the ground contact The learning on our robot is performed by a policy gradient
and nonlinear friction in the joints are difficult to model well reinforcement learning algorithm ([7], [8], [9]). The goal of
in simulation, making it unlikely that a controller optimized in this paper is to describe our formulation of the learning
a simulation will perform optimally on the real robot. Since problem and the algorithm that we use to solve it. We include
it is only practical to run a small number of learning trials our experimental results on this simplified biped, and discuss
on the real robot, the learning algorithms must perform well the possibility of applying the same algorithm to a more
after obtaining a very limited amount of data. Finally, learning complicated walking system.
algorithms for dynamic walking must deal with dynamic
discontinuities caused by collisions with the ground and with 1 The standard for 3D bipeds is to have at least 12 internal degrees of
the problem of delayed reward - torques applied at one time freedom and 12 actuators in the legs
II. T HE ROBOT III. T HE L EARNING P ROBLEM
The goal of learning is to acquire a feedback control policy
which makes the robot’s gait invariant to small slopes. In total,
the system has 9 degrees of freedom2 , and the equations of
motion can be written in the form
H(q)q̈ + C(q, q̇)q̇ + G(q) = τ + D(t), (1)
where
q =[θyaw , θlP itch , θbP itch , θrP itch , θroll ,
θraRoll , θlaRoll , θraP itch , θlaP itch ]T ,
τ =[0, 0, 0, 0, τraRoll , τlaRoll , τraP itch , τlaP itch ]T .
H is the state dependent inertial matrix, C contains interaction
torques between the links, G represents the effect of gravity,
τ are the motor torques, and D are random disturbances to
the system. Our shorthand lP itch, bP itch, and rP itch refer
Fig. 1. The robot on the left is a simple passive dynamic walker. The robot to left leg pitch, body pitch, and right leg pitch, respectively.
on the right is our actuated version of the same robot. raRoll, laRoll, raP itch, and laP itch are short for right and
left ankle roll and pitch. The actual output of the controller is
a motor command vector
The passive dynamic walker shown on the left in Figure 1
represents the simplest machine that we could build which u = [uraRoll , ulaRoll , uraP itch , ulaP itch ]T ,
captures the essence of stable dynamic walking in three
dimensions. It has only a single passive pin joint at the hip. which generates torques
When placed at the top of a small ramp and given a push τ = h(q, q̇, u).
sideways, the walker will begin falling down the ramp and
eventually converge to a stable limit cycle trajectory that has The function h describes the linear feedback controller im-
been compared to the waddling gait of a penguin [10]. The plemented by the servo boards and the nonlinear kinematic
energetics of this passive walker are common to all passive transformation into joint torques.
walkers: energy lost due to friction and collisions when the The robot uses a deterministic feedback control policy
swing leg returns to the ground is balanced by the gradual which is represented using a linear function approximator
conversion of potential energy into kinetic energy as the walker parameterized by vector w and using nonlinear features φ:
moves down the slope. The mechanical design of this robot

X q
and some experimental stability results are presented in [11]. u = πw (x̂) = wi φi (x̂), with x = . (2)
q̇
i
We designed our learning robot by adding a small number
of actuators to this passive design. The robot shown on the The notation x̂ represents a noisy estimate of the state x.
right in figure 1, which is also described in [11], has passive Before learning, w is initialized to all zeros, making the
joints at the hip and 2 degrees of actuation (roll and pitch) at policy outputs zero everywhere, so that the robot simulates
each ankle. The ankle actuators are position controlled servo the passive walker.
motors which, when commanded to hold their zero position, To quantify the stability of our nonlinear, stochastic, peri-
allow the actuated robot to walk stably down a small ramp, odic trajectory, we consider the dynamics on the return map,
“simulating” the passive walker. The shape of the large, curved taken around the point where θroll = 0 and θ̇roll > 0. The
feet is designed to make the robot walk passively at 0.8Hz, and return map dynamics are a Markov random sequence with the
to take steps of approximately 6.5 cm when walking down a probability at the (n + 1)th crossing of the return map given
ramp of 0.03 radians. The robot stands 44 cm tall and weighs by
approximately 2.9 kg, which includes the CPU and batteries fw (x0 , x) = P {X̂(n + 1) = x0 |X̂(n) = x, W(n) = w}. (3)
that are carried on-board. The most recent additions to this
robot are the passive arms, which are mechanically coupled fw (x0 , x) represents the probability density function over the
to the opposite leg to provide mechanical yaw compensation. state space which contains the dynamics in equations 1 and 2
When placed on flat terrain, the passive walker waddles back integrated over one cycle. We do not make any assumptions
and forth, slowly losing energy, until it comes to rest standing about its form, except that it is Markov. Note that the element
still. In order to achieve stable walking on flat terrain, the of fw representing θroll is the delta function, independent of
actuators on our learning robot must restore energy into the 2 6 internal DOFs and 3 DOFs for the robot’s orientation. We assume that
system that would have been restored by gravity when walking the robot is always in contact with the ground at a single point, and infer the
down a slope. robot’s absolute (x, y) position in space directly from the remaining variables.
x. The return map dynamics are represented as a Markov eligibility trace. We begin with w(0) = e(0) = 0. At the end
chain that depends on the parameter vector w instead of the of the nth step, we make the updates:
equivalent Markov decision process for simplification because
the feedback controller is evaluated many times during a single δ(n) =g (x̂(n)) + γ Jˆv (x̂(n + 1)) − Jˆv (x̂(n)) (8)
step (our controller runs at 100 Hz and our robot steps at ei (n) =γei (n − 1) + bi (n)zi (n) (9)
around 0.8 Hz). The stochasticity in fw comes from the ∆wi (n) = − ηw δ(n)ei (n) (10)
random disturbances D(t) and the state estimation error, x̂−x.
∆vi (n) =ηv δ(n)ψi (x̂(n)). (11)
The cost function for learning uses a constant desired value,
xd , on the return map: ηw ≥ 0 and ηv ≥ 0 are the learning rates and γ is the discount
1 factor of the eligibility trace, which will be discussed in more
g(x) = |x − xd |2 . (4) detail in the algorithm derivation. bi (n) is a boolean one step
2
This desired value can be considered a reference trajectory on eligibility, which is 1 if the parameter wi is activated (φi (x̂) >
the return map, and is taken from the gait of the walker down 0) at any point during step n and 0 otherwise. δ(n) is called
a slope of 0.03 radians; no reference trajectory is required the one step temporal difference error.
for the limit cycle between steps. For a given trajectory x̂ = The algorithm can be understood intuitively. On each step
[x̂(0), x̂(1), ..., x̂(N )], we define the average cost the robot receives some cost g(x̂(n)). This cost is compared
to cost that we expect to receive, as estimated by Jˆv (x).
N
1 X If the cost is lower than expected, then −ηδ(n) is positive,
G(x̂) = g(x̂(n)). (5)
N n=0 so we add a scaled version of the noise terms, zi , into wi .
Similarly, if the cost is higher than expected, then we move in
Our goal is to find the parameter vector w which minimizes the opposite direction. This simple online algorithm performs
lim E {G(x̂)} . (6) approximate stochastic gradient descent on the expected value
N →∞ of the average infinite-horizon cost.
By minimizing this error, we are effectively minimizing the
eigenvalues of return map, and maximizing the stability of the V. A LGORITHM D ERIVATION
desired limit cycle. The expected value of the average cost, G, is given by:
IV. T HE L EARNING A LGORITHM Z
The learning algorithm is a statistical algorithm which E{G(x̂)} = G(x̂)Pw0 {X̂ = x̂}dx̂.
x̂
makes small changes to the control parameters w on each step
and uses correlations between changes in w and changes in the The probability of trajectory x̂ is
return map error to climb the performance gradient. This can N −1
be accomplished with a very simple online learning rule which
Y
Pw0 {X̂ = x̂} = P {X̂(0) = x̂(0)} fw0 (x̂(n + 1), x̂(n)).
changes w with each step that the robot takes. The particular n=0
algorithm that we present here was originally proposed by [7].
Taking the gradient of E{G(x̂)} with respect to w we find
We present a thorough derivation of this algorithm in the next
section.
Z
∂ ∂
The algorithm makes use of an intermediate representation E{G(x̂)} = G(x̂) Pw0 {X̂ = x̂}dx̂
∂wi x̂ ∂wi
which we call the value function, J(x). The value of state x is Z
∂
the expected average cost to be incurred by following policy = G(x̂)Pw0 {X̂ = x̂} log Pw0 {X̂ = x̂}dx̂
∂w i
πw starting from state x:
x̂
∂
N =E G(x̂) log Pw0 {X̂ = x̂}
1 X ∂wi
J(x) = lim g(x(n)), with x(0) = x. ( N −1
!)
N →∞ N X ∂
n=0
=E G(x̂) log fw0 (x̂(m + 1), x̂(m))
∂wi
Jˆv (x) is an estimate of the value function parameterized by m=0
vector v. This value estimate is represented in another function Recall that fw0 (x0 , x) is a complicated function which
approximator: includes the integrated dynamics of the controller and the
Jˆv (x̂) =
X
vi ψi (x̂). (7) robot. Nevertheless, ∂w∂
log fw0 is simply:
i
i
During learning, we add stochasticity to our deterministic ∂

log fw0 (x0 (m + 1), x(m)) =
control policy by varying w. Let Z(n) be a Gaussian random ∂wi
vector with E{Zi (n)} = 0 and E{Zi (n)Zj (n0 )} = σ 2 δij δnn0 . ∂ h i
log P {X̂0 = x0 |X̂ = x, W0 = w0 }Pw {W0 = w0 }
During the nth step that the robot takes, we evaluate the ∂wi
controller using the parameter vector w 0 (n) = w(n) + z(n). ∂ zi (m)
The algorithm uses a storage variable, e(n), which we call the = log Pw {W0 (m) = w0 } =
∂wi σ2
Substituting, we have performance gradient:
( N
! N −1
!)
∂ 1 X X ( N
)
E{G(x̂)} = E g(x̂(n)) zi (m) 1 X ∂
∂wi N σ2 lim E ∆wi (n) ≈ −η lim E{G(x̂)}.
n=1 m=0 N →∞ N N →∞ ∂wi
(N n
) n=0
1 X X
= E g(x̂(n)) bi (m)zi (m) .
N σ2 VI. L EARNING I MPLEMENTATION
n=0 m=0
This final P reduction is based on the observation that In our initial implementation of the algorithm, we decided to
N
E{g(x̂(n)) m=n zi (m)} = 0 (noise added to the controller further simplify the problem by decomposing the control in the
on or after step n is not correlated to the cost at step n). frontal and sagittal planes. In this decomposition, the ankle roll
Similarly, random changes to a weight that is not used during actuators are responsible for stabilizing the oscillations of the
the nth step (bi (m) = 0) have zero expectation, and can be robot in the frontal plane. The ankle pitch actuators cause the
excluded from the sum. robot to lean forward or backward, which moves the position
Observe that the variance of this gradient estimate grows of the center of mass relative to the ground contact point on the
without bound as N → ∞ [8]. To bound the variance, we use foot. Because the hip joint on our robot is passive, if the center
a biased estimate of this gradient which artificially discounts of mass is in front of the ground contact when the swing foot
the eligibility trace: leaves the ground, then the robot will begin to walk forward.
∂ The distance of between the center of mass and the ground
E{G(x̂)} ≈
∂wi contact is monotonically related to the step size and to the
walking speed.
(N n
)
1 X X
n−m
E g(x̂(n)) γ b i (m)z i (m) Due to the simplicity of the sagittal plane control, we only
N σ2
(N
n=0 m=0
) need to learn a control policy for the two ankle roll actuators
1 X which stabilize the roll oscillation in the frontal plane. This
= E g(x̂(n))ei (n) , strategy will change as the robot walks at different speeds, but
N σ2 n=0
we hope the learning algorithm will adapt quickly enough to
with 0 ≤ γ ≤ 1. The discount factor γ parameterizes the compensate for those differences.
bias-variance trade-off. With these simplifications in mind, we constrain the feed-
Next, observe that we can subtract any mean zero baseline back policy to be a function of only two variables: θ
roll
from this quantity without effecting the expected value of our and θ̇ . The choice of these two variables is not arbitrary;
roll
estimate [12]. Including this baseline can dramatically improve they are the only variables that we use when writing a non-
the performance of our algorithm because it can reduce the learning feedback controller that stabilizes the oscillation. We
variance of our gradient estimate. In particular, we subtract a also constrain the policy to be symmetric - the controller for
mean-zero term containing an estimate of the value function the left ankle is simply a mirror image of the controller for the
as recommended by [7]: right ankle. Therefore, the learned control policy only has a
∂ single output. The value function is approximated as a function
lim E{G(x̂)} ≈
N →∞ ∂wi of only a single variable: θ̇roll . This very low dimensionality
allows the algorithm to train very quickly.
(N N
)
1 X X
lim E g(x(n))ei (n) − V̂ π (x(n))zi (n) The control policy and value functions are both represented
N →∞ N
n=0
( N
n=0 using linear function approximators of the form described
1 X in Equations 2 and 7, which are fast and very convenient
= lim E g(x(n))ei (n)+ to initialize. We use a non-overlapping tile-coding for our
N →∞ N
n=0
approximator basis functions: 35 tiles for the policy (5 in
N N −1
)
θroll × 7 in θ̇roll ) and 11 tiles for the value function.
X X
zi (n) γ m−n γ V̂ π (x(m + 1)) − V̂ π (x(m))
n=0 m=n
In order to make the robot explore the state space during
( N learning, we hand-designed a simple controller to place the
1 X
robot in random initial conditions on the return map. The
= lim E g(x(n))ei (n)+
N →∞ N random distribution is biased according to the distribution of
n=0
N
X
) points that the robot has already experienced on the return map
π π
γ V̂ (x(n + 1)) − V̂ (x(n)) ei (n) - the most likely initial condition is the state that the robot
n=0 experienced least often. We use this controller to randomly
reinitialize the robot every time that it comes to a halt standing
(N )
1 X
= lim
N →∞ N
E δ(n)ei (n) still, or every 10 seconds, whichever comes first. This heuristic
n=0 makes the distribution on the return map more uniform, and
By this derivation, we can see that the average of the weight increases the likelihood of the algorithm converging on the
update given in equations 8-11 is in the direction of the same policy each time that it learns from a blank slate.
VII. E XPERIMENTAL R ESULTS learning (w = 0) and after 1000 steps. In general, the return
When the learning begins, the policy parameters, w, are map for our 9 DOF robot is 17 dimensional (9 states + 9
set to 0 and the baseline parameters, v, are initialized so that derivatives - 1), and the projection of these dynamics onto a
Jˆv (x) ≈ g(x) single dimension is difficult to interpret. The plots in Figure 5
1−γ . We typically train the robot on flat terrain using
short trials with random initial conditions. During the first few where made with the robot walking in place on flat terrain. In
trials, the policy does not restore sufficient energy into the this particular situation, most of the return map variables are
system, and the robot comes to a halt standing still. Within a close to zero throughout the dynamics, and a two dimensional
minute of training, the robot achieves foot clearance on nearly return map captures the desired dynamics. As expected, before
every step; this is the minimal definition of walking on this learning the return map illustrates a single fixed point at
system. The learning easily converges to a robust gait with θ̇roll = 0, which means the robot is standing still. After
the desired fixed point on the return map within 20 minutes learning, we obtain a single fixed point at the desired value
(approximately 960 steps at 0.8 Hz). Error obtained during (θ̇roll = 1.0 radians / second), and the basin of attraction of
learning depends on the random initial conditions of each this fixed point extends over the entire domain that we tested.
trial, and is therefore a very noisy stochastic variable. For this On the rare occasion that the robot falls over, the system does
reason, in Figure 2 we plot a typical learning curve in terms of not return to the map and stops producing points on this graph.
the average error per step. Figure 3 plots a typical trajectory of
the learned controller walking on flat terrain. Figure 4 displays
the final policy.
Fig. 5. Experimental return maps, before (left) and after (right) learning.
Fixed points exist at the intersections of the return map (blue) and the line of
slope one (red).
Fig. 2. A typical learning curve, plotted as the average error on each step. To quantify the stability of the learned controller, we mea-
sure the eigenvalues of the return map. Linearizing around
the fixed point in Figure 5 suggests that the system has a
single eigenvalue of 0.5. To obtain the eigenvalues of the
return map when the robot is walking, we run the robot from
a large number of initial conditions and record the return map
trajectories x̂i (n), 9 × 1 vectors which represent the state of
the system (with fixed ankles) on the nth crossing of the ith
trial. For each trial we estimate x̂i (∞), the equilibrium of
the return map. Finally, we perform a least squares fit of the
seconds matrix A to satisfy the relation
Fig. 3. θroll trajectory of the robot starting from standing still. [x̂i (n + 1) − x̂i (∞)] = A[x̂i (n) − x̂i (∞)].
The eigenvalues of A for the learned controller and for our
hand-designed controllers (described in [11]) are:
Controller Eigenvalues
Passive walking 0.88 ± 0.01i, 0.75, 0.66 ± 0.03i,
(63 trials) 0.54, 0.36, 0.32 ± 0.13i
Hand-designed 0.80, 0.60, 0.49 ± 0.04i, 0.36,
feed-forward (89 trials) 0.25, 0.20 ± 0.01i, 0.01
Hand-designed 0.78, 0.69 ± 0.03i, 0.36, 0.25,
feedback (58 trials) 0.20 ± 0.01i, 0.01
Learned feedback 0.74 ± 0.05i, 0.53 ± 0.09i, 0.43,
(42 trials) 0.30 ± 0.02i, 0.15, 0.07
Fig. 4. Learned feedback control policy uraRoll = πw (x̂).
All of these experiments were on flat terrain except the
In Figure 5 we plot the return maps of the system before passive walking, which was on a slope of 0.027 radians. The
convergence of the system to the nominal trajectory is largely of the return map, than any controller we were able to derive
governed by the largest eigenvalues. This analysis suggests that for the same robot by hand. Once the controller is learned, to
our learned controller converges to the steady state trajectory robot is able to quickly adapt to small changes in the terrain.
more quickly that the passive walker on a ramp and more Building a robot to simplify the learning allowed us to gain
quickly than any of our hand-designed controllers. some practical insights into the learning problem for dynamic
Our stochastic policy gradient algorithm solves the temporal bipedal locomotion. Implementing these algorithms on the real
credit assignment problem in by accumulating the eligibility robot proved to be a very different problem than working
within a step and discounting eligibility between steps. In- in simulation. We would like to take two basic directions
terestly, our algorithm performs best with heavy discounting to continue this research. First, we are removing many of
between steps (0 ≤ γ ≤ 0.2). This suggests that our one the simplifying assumptions used in this paper (such as the
dimensional value estimate does a good job of isolating the decomposed control policy) to better approximate optimal
credit assignment to a single step. walking on this simple platform and to test our learning
While it took a few minutes to learn a controller from a controller’s ability to compensate for rough terrain. Second,
blank slate, adjusting the learned controller to adapt to small we are scaling these results up to more sophisticated bipeds,
changes in the terrain appears to happen very quickly. The including a passive dynamic walker with knees and humanoids
non-learning controllers require constant attention and small that already have a basic control system in place.
manual changes to the parameters as the robot walks down
ACKNOWLEDGMENTS
the hall, on tiles, and on carpet. The learning controller easily
adapts to these situations. This work was supported by the David and Lucille Packard
Foundation (contract 99-1471), the National Science Founda-
VIII. D ISCUSSION tion (grant CCR-0122419). Special thanks to Ming-fai Fong
Designing our robot like a passive dynamic walker changes and Derrick Tan for their help with designing and building the
the learning problem in a number of ways. It allows us to experimental platform.
learn a policy with only a single output which controlled a R EFERENCES
9 DOF system, and allows us to formulate the problem on
[1] J. Morimoto and C. Atkeson, “Minimax differential dynamic program-
the return map dynamics. It also dramatically increases the ming: An application to robust biped walking.” Neural Information
number of policies in the search space which could generate Processing Systems, 2002.
stable walking. The learning algorithm works extremely well [2] W. T. Miller, III, “Real-time neural network control of a biped walking
robot,” IEEE Control Systems Magazine, vol. 14, no. 1, pp. 41–48, Feb
on this simple robot, but will the technique scale to more 1994.
complicated robots? [3] H. Benbrahim and J. A. Franklin, “Biped dynamic walking using
One factor in our success was the formulation of the reinforcement learning,” Robotics and Autonomous Systems, vol. 22,
pp. 283–302, 1997.
learning problem on the discrete dynamics of the return map [4] N. Kohl and P. Stone, “Policy gradient reinforcement learning for fast
instead of the continuous dynamics along the entire trajectory. quadrupedal locomotion.” IEEE International Conference on Robotics
This formulation relies on the fact that our passive walker and Automation, 2004.
[5] T. McGeer, “Passive dynamic walking,” International Journal of
produces periodic trajectories even before the learning begins. Robotics Research, vol. 9, no. 2, pp. 62–82, April 1990.
It is possible for passive walkers to have knees and arms [6] M. J. Coleman and A. Ruina, “An uncontrolled toy that can walk but
[13], or on a more traditional humanoid robot this algorithm cannot stand still,” Physical Review Letters, vol. 80, no. 16, pp. 3658 –
3661, April 1998.
could be used to augment and improve and existing walking [7] H. Kimura and S. Kobayashi, “An analysis of actor/critic algorithms
controller which produces nominal walking trajectories. using eligibility traces: Reinforcement learning with imperfect value
As the number of degrees of freedom increases, the stochas- functions.” International Conference on Machine Learning (ICML
’98), 1998, pp. 278–286.
tic policy gradient algorithm may have problems with scaling. [8] J. Baxter and P. Bartlett, “Infinite-horizon policy-gradient estimation,”
The algorithm correlates changes in the policy parameters with Journal of Artificial Intelligence Research, vol. 15, pp. 319–350, 11
changes in the performance on the return map. As we add 2001.
[9] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradient
degrees of freedom, the assignment of credit to a particular methods for reinforcement learning with function approximation.”
actuator will become more difficult, requiring more learning Advances in Neural Information Processing Systems, 1999.
trials to obtain a good estimate of the correlation. This scaling [10] J. E. Wilson, “Walking toy,” United States Patent Office, Tech. Rep.,
October 15 1936.
problem is an open and interesting research question and a [11] R. Tedrake, T. W. Zhang, M. Fong, and H. S. Seung, “Actuating a
primary focus of our current research. simple 3d passive dynamic walker.” IEEE International Conference on
Robotics and Automation, 2004.
IX. C ONCLUSIONS [12] R. Williams, “Simple statistical gradient-following algorithms for
connectionist reinforcement learning,” Machine Learning, vol. 8, pp.
We have presented a learning formulation and learning al- 229–256, 1992.
gorithm which works very well on our simplified 3D dynamic [13] S. H. Collins, M. Wisse, and A. Ruina, “A three-dimensional
passive-dynamic walking robot with two legs and knees,” International
biped. The robot begins to walk after only one minute of Journal of Robotics Research, vol. 20, no. 7, pp. 607–615, July 2001.
learning from a blank slate, and the learning converges to
the desired trajectory in less than 20 minutes. This learned
controller is quantifiably more stable, using the eigenvalues
A Simple Reinforcement Learning Algorithm For
Biped Walking
Jun Morimoto, Gordon Cheng, Christopher G. Atkeson, and Garth Zeglin
Department of Humanoid Robotics and The Robotics Institute
Computational Neuroscience Carnegie Mellon University
ATR Computational Neuroscience Labs cga@cs.cmu.edu, garthz@ri.cmu.edu
xmorimo@atr.co.jp, gordon@atr.co.jp http://www.ri.cmu.edu
http://www.cns.atr.co.jp/hrcn
Abstract— We propose a model-based reinforcement learning learned controller. In section IV-C, we analyze the stability of
algorithm for biped walking in which the robot learns to the learned controller.
appropriately place the swing leg. This decision is based on
a learned model of the Poincare map of the periodic walking
RKVEJ
pattern. The model maps from a state at the middle of a step
and foot placement to a state at next middle of a step. We
also modify the desired walking cycle frequency based on online
measurements. We present simulation results, and are currently . 4
implementing this approach on an actual biped robot.
I. I NTRODUCTION
Our long-term goal is to understand how humans learn
biped locomotion and adapt their locomotion pattern. In this Fig. 1. Three link robot model
paper, we propose and explore the feasibility of a candidate
learning algorithm for biped walking. Our algorithm has two
elements, learning appropriate foot placement, and estimating
RKVEJ
appropriate walking cycle timing. We are using model-based
reinforcement learning, where we learn a model of a Poincare NAJKR TAJKR
map and then choose control actions based on a computed
value function. Alternative approaches applying reinforcement
learning to biped locomotion include [1], [13], [2].
TAMPGG
An important issue in applying our approach is matching NAMPGG
the desired walking cycle timing to the natural dynamics of
the biped. In this study, we use phase oscillators to estimate
appropriate walking cycle timing [19], [14], [15]. Fig. 2. Five link biped robot
To evaluate our proposed method, we use simulated 3 link
and 5 link biped robots (Figs. 1 and 2). Physical parameters of
the 3 link simulated robot are in table I. Physical parameters TABLE I
of the 5 link simulated robot in table II are selected to model P HYSICAL PARAMETERS OF THE THREE LINK ROBOT MODEL
an actual biped robot fixed to a boom that keeps the robot
in the sagittal plane (Fig. 2). Our bipeds have a short torso trunk leg
mass [kg] 2.0 0.8
and point or round feet without ankle joints. For these bipeds,
length [m] 0.01 0.4
controlling biped walking trajectories with the popular ZMP inertia [kg · m2 ] 0.0001 0.01
approach [20], [8], [22], [12] is difficult or not possible, and
thus an alternative method for controller design must be used.
In section II-A, we introduce an estimation method of II. E STIMATION OF NATURAL BIPED WALKING TIMING
natural biped walking timing by using the measured walking In order for our foot placement algorithm to place the
period and an adaptive phase resetting method. In section III, foot at the appropriate time, we must estimate the natural
we introduce our reinforcement learning method for biped biped walking period, or equivalently, frequency. This timing
walking. The robot learns appropriate foot placement through changes, for example, when walking down slopes. Our goal
trial and error. In section IV-B, we propose using the esti- is to adapt the walking cycle timing to the dynamics of the
mation method for natural biped walking timing to assist the robot and environment.
With an initial condition which has a body velocity of
TABLE II
0.2m/s, the simulated 3 link robot walked stably on a 1.0 ◦
P HYSICAL PARAMETERS OF THE FIVE LINK ROBOT MODEL
downward slope (Fig. 3(Top)). However, the robot could not
trunk thigh shin walk stably on a 4.0 ◦ downward slope (Fig. 3(Bottom)). When
mass [kg] 2.0 0.64 0.15 we used the online estimate of ω̂ and the adaptive phase
length [m] 0.01 0.2 0.2
inertia (×10−4 [kg · m2 ]) 1.0 6.9 1.4
resetting method, the robot walked stably on the two test
slopes: 1.0◦ downward slope (Fig. 4(Top)) and 4.0 ◦ downward
slope (Fig. 4(Bottom)). In figure 5, we show the estimated
A. Estimation method walking frequency.
We derive the target walking frequency ω ∗ from the walking
period T which is measured from an actual half-cycle period
(one foot fall to another):
π
ω∗ = (1)
T
The update rule for the walking frequency is
ω̂n+1 = ω̂n + Kω (ω ∗ − ω̂n ), (2)
where Kω is the frequency adaptation gain, and ω n is the
estimated frequency after n steps. An interesting feature of this
method is that the simple averaging (low-pass filtering) method
(Eq. 2) can estimate appropriate timing of the walking cycle Fig. 3. Biped walking pattern without timing adaptation: (Top) 1.0◦
downward slope, (Bottom) 4.0◦ downward slope
for the given robot dynamics. This method was also adopted
in [14], [15].
Several studies suggest that phase resetting is effective to
match walking cycle timing to the natural dynamics of the
biped [19], [24], [14], [15]. Here we propose an adaptive phase
resetting method. Phase φ is reset when the swing leg touches
the ground:
φ̄ ← φ̄ + Kφ (φ − φ̄) (3)
φ ← φ̄, (4)
where φ̄ is the average phase, and K φ is the phase adaptation
gain.
B. A simple example of timing estimation Fig. 4. Biped walking pattern with timing adaptation: (Top) 1.0◦ downward
We use the simulated three link biped robot (Fig. 1) to slope, (Bottom) 4.0◦ downward slope
demonstrate the timing estimation method. A target biped
walking trajectory is generated using sinusoidal functions with
amplitude a = 10◦ and a simple controller is designed to 10
1.0 °
follow the target trajectories for each leg: 4.0 °
9.5
τl = k(a sin φ − θl ) − bθ̇l (5)

ω [rad/s]
τr = k(−a sin φ − θr ) − bθ̇r , (6)

8.5
where τl denotes the left hip torque, τ r denotes the right hip
torque, k = 5.0 is a position gain, b = 0.1 is a velocity gain, 8
0 10 20
Steps
30 40 50
and θl and θr are left and right hip joint angles. Estimated
phase φ is given by φ = ω̂ n t, where t is the current time. Fig. 5. Estimated walking frequency
For comparison, we apply this controller to the simulated
robot without using the timing estimation method, so ω̂ is
fixed and φ increases linearly with time (The walking period
III. M ODEL - BASED REINFORCEMENT LEARNING FOR
was set to T = 0.63sec and frequency ω̂ = 10rad/sec). The
BIPED LOCOMOTION
initial average phase was set to φ̄ = 1.0 for the right leg and
φ̄ = π + 1.0 for the left leg, the frequency adaptation gain To walk stably we need to control the placement as well
was set to Kω = 0.3, and the phase adaptation gain was set as the timing of the next step. Here, we propose a learning
to Kφ = 0.3. method to acquire a stabilizing controller.
A. Model-based reinforcement learning
TABLE III
We use a model-based reinforcement learning frame- TARGET POSTURES AT EACH PHASE φ : θact IS PROVIDED BY THE OUTPUT
work [4], [17]. Reinforcement learning requires a source OF CURRENT POLICY. T HE UNIT FOR NUMBERS IN THIS TABLE IS DEGREES
of reward. We learn a Poincare map of the effect of foot
placement, and then learn a corresponding value function for right hip right knee left hip left knee
φ=0 −10.0 θact 10.0 0.0
states at phase φ = 12 π and φ = 32 π (Fig. 6), where we define φ = 0.5π θact 60.0
phase φ = 0 as the right foot touchdown. φ = 0.7π 10.0 −10.0
1) Learning the Poincare map of biped walking: We learn φ=π 10.0 0.0 −10.0 θact
φ = 1.5π 60.0 θact
a model that predicts the state of the biped a half cycle
φ = 1.7π −10.0 10.0
ahead, based on the current state and the foot placement at
touch down. We are predicting the location of the system in
a Poincare section at phase φ = 3π 2 based on the system’s 4 .
location in a Poincare section at phase φ = π2 . We use the RK
same model to predict the location at phase φ = π2 based on RK
the location at phase φ = 3π2 (Fig. 6). Because the state of the . 4
robot drastically changes at foot touchdown (φ = 0, π), we

select the phases φ = π2 and φ = 3π 2 as Poincare sections. We
RK
approximate this Poincare map using a function approximator

with a parameter vector w m ,
.
4
x̂ 3π
2
= f̂ (x π2 , u π2 ; wm ), (7)
. 4
where the input state is defined as x = (d, d). ˙ d denotes the

horizontal distance between the stance foot position and the
body position (Fig. 7). Here, we use the hip position as the Fig. 6. Biped walking trajectory using four via-points: we update parameters
body position because the center of mass is almost at the same and select actions at Poincare sections on phase φ = π2 and φ = 3π 2
. L:left
leg, R:right leg
position as the hip position (Fig. 2). The action of the robot
u = θact is the target knee joint angle of the swing leg which
determines the foot placement (Fig. 7).
4) Learning the value function: In a reinforcement learning
2) Representation of biped walking trajectories and the
framework, the learner tries to create a controller which
low-level controller: One cycle of biped walking is repre-
maximizes expected total return. Here, we define the value
sented by four via-points for each joint (Fig. 6). The output
function for the policy µ
of a current policy θ act is used to specify via-points (Table
III). We interpolated trajectories between target postures by V µ (x(t)) = E[r(t + 1) + γr(t + 2) + γ 2 r(t + 3) + ...], (9)
using the minimum jerk criteria [6], [21] except for pushing
where r(t) is the reward at time t, and γ (0 ≤ γ ≤ 1) is
off at the stance knee joint. For pushing off at the stance knee,
the discount factor 1 . In this framework, we evaluate the value
we instantaneously change the desired joint angle to deliver a
function only at φ(t) = π2 and φ(t) = 32 π. Thus, we consider
pushoff to a fixed target to accelerate the motion.
our learning framework as model-based reinforcement learning
Zero desired velocity and acceleration are specified at each
for a semi-Markov decision process (SMDP) [18]. We used a
via-point. To follow the generated target trajectories, the torque
function approximator with a parameter vector w v to estimate
output at each joint is given by a PD servo controller:
the value function:
τj = k(θjd (φ) − θj ) − bθ̇j , (8) V̂ (t) = V̂ (x(t); wv ). (10)
where θjd (φ)is the target joint angle for j-th joint (j = 1 · · · 4), By considering the deviation from equation (9), we can define
position gain k is set to k = 2.0 except for the knee joint of the temporal difference error (TD-error) [17], [18]:
the stance leg (we used k = 8.0 for the knee joint of the stance

tT
leg), and the velocity gain b is set to b = 0.05. Table III shows δ(t) = γ k−t−1 r(k) + γ tT −t V̂ (tT ) − V̂ (t), (11)
the target postures. k=t+1
3) Rewards: The robot gets a reward if it successfully
where tT is the time when φ(tT ) = 12 π or φ(tT ) = 32 π. The
continues walking and gets punishment (negative reward) if
update rule for the value function can be derived as
it falls down. On each transition from phase φ = 12 π (or
φ = 32 π) to phase φ = 32 π (or φ = 12 π), the robot gets a V̂ (x(t)) ← V̂ (x(t)) + βδ(t), (12)
reward of 0.1 if the height of the body remains above 0.35m
where β = 0.2 is a learning rate. The parameter vector w v is
during the past half cycle. If the height of the body goes below
updated by equation (19).
0.35m, the robot is given a negative reward (-1) and the trial
is terminated. 1 We followed the definition of the value function in [17]
where ck is the center of the k-th basis function, D k is the
distance metric of the k-th basis function, N b is the number of
basis functions, and x̃k = ((x − ck )T , 1)T is the augmented
state. The update rule for the parameter w is given by:
∆wk = ak Pk x̃k (g(x) − hk (x)), (19)

CEV
where
1 Pk x̃k x̃Tk Pk
F Pk ← Pk − , (20)
λ + x̃k Pk x̃k
ak T
λ
Fig. 7. (left) Input state, (right) Output of the controller
and λ = 0.999 is the forgetting factor.
We align basis functions a k (x) at even intervals in each
dimension of input space x = (d, d) ˙ (Fig. 7) [−0.2(m) ≤
5) Learning a policy for biped locomotion: We use a ˙
d ≤ 0.2(m) and −1.0(m/s) ≤ d ≤ 1.0(m/s)]. We used
stochastic policy to generate exploratory action. The policy 400(=20×20) basis functions for approximating the policy and
is represented by a probabilistic model: the value function. We also align 20 basis functions at even
intervals in the output space −0.7(rad) ≤ θ act ≤ 0.7(rad)
1 (u(t) − A(x(t); wa ))2
µ(u(t)|x(t)) = √ exp − , (Fig. 7). We used 8000(=20 × 20 × 20) basis functions for
2πσ 2σ 2
(13) approximating the Poincare map. We set the distance metric
where A(x(t); wa ) denotes the mean of the model, which Dk to Dk = diag{2256, 90} for the policy and the value
is represented by a function approximator, where w a is a function, and D k = diag{2256, 90, 185} for the Poincare
parameter vector. We changed the variance σ according to the map. The centers of the basis functions c k and the distance
trial as σ = 100−N trial
+ 0.1 for Ntrial ≤ 100 and σ = 0.1 metrics of the basis functions D k are fixed during learning.
100
for Ntrial > 100, where Ntrial denotes the number of trials. IV. R ESULTS
The output of the policy is
A. Learning foot placement
u(t) = A(x(t); wa ) + σn(t), (14) We applied the proposed method to the 5 link simulated
where n(t) ∼ N (0, 1). N (0, 1) indicate a normal distribution robot (Fig. 2). We used a manually generated initial step to
which has mean 0 and variance 1. get the pattern started. We set the walking period to T =
We derive the update rule for a policy by using the value 0.79sec (ω = 8.0[rad/sec]).
A trial terminated after 50 steps or after the robot fell down.
function and the estimated Poincare map.
Figure 8(Top) shows the walking pattern before learning, and
1) Derive the gradient of the value function ∂V ∂x at the Figure 8(Middle) shows the walking pattern after 30 trials.
current state x(tT ). Target knee joint angles for the swing leg were varied because
∂f
2) Derive the gradient of the dynamics model ∂u at of exploratory behavior (Fig. 8(Middle)).
the previous state x(t) and the nominal action u = Figure 10 shows the accumulated reward at each trial. We
A(x(t); wa ). defined a successful trial when the robot achieved 50 steps.
3) Update the policy µ: A stable biped walking controller was acquired after 80 trials
∂V (x) ∂f (x, u) (averaged over 5 experiments). The shape of the value function
A(x; wa ) ← A(x; wa ) + α , (15)
∂x ∂u is shown in Fig. 11. The maximum value of the value function
where α = 0.2 is the learning rate. The parameter vector w a is located at positive d (around d = 0.05(m)) and negative d˙
is updated by equation (19). We can consider the output u(t) (around d˙ = −0.5(m/sec)).
is an option in the SMDP [18] initiated in state x(t) at time Figure 9 shows joint angle trajectories of stable biped
t when φ(t) = 12 π (or φ = 32 π), and it terminates at time t T walking after learning. Note that the robot added energy to
when φ = 32 π (or φ = 12 π). its initially slow walk by choosing θ act appropriately which
6) Function approximator: We used Receptive Field affects both foot placement and the subsequent pushoff. The
Weighted Regression(RFWR) [16] as the function approxi- acquired walking pattern is shown in Fig. 8(Bottom).
mator for the policy, the value function and the estimated B. Estimation of biped walking period
dynamics model. Here, we approximate the target function
The estimated phase of the cycle φ plays an important
g(x) with
Nb role in this controller. It is essential that the controller phase
ak (x)hk (x)
ĝ(x) = k=1 Nb , (16) matches the dynamics of the mechanical system. We applied
k=1 ak (x) our timing estimation method described in section II-A to the
learned biped controller. The initial average phase was set to
hk (x) = wkT x̃k , (17)
φ̄ = 1.0 for the left leg and φ̄ = π + 1.0 for the right leg, the
1 frequency adaptation gain was set to K ω = 0.3, and the phase
ak (x) = exp − (x − ck ) Dk (x − ck ) ,
T
(18)
2 adaptation gain was set to K φ = 0.3.
20
10
Left Hip [deg]

−10
−20
−30
−40
Actual
Target
−50
0 1 2 3 4 5 6
Time [sec]
20
10
Right Hip [deg]

−10
−20
−30
−40
Actual
Target
−50
0 1 2 3 4 5 6
Time [sec]
80
Actual
70 Target
60
50
Left Knee [deg]

40
30
20
10
−10
0 1 2 3 4 5 6
Time [sec]
80
Actual
70 Target
Fig. 8. Acquired biped walking pattern: (Top)Before learning, (Middle)After 60
30 trials, (Bottom)After learning
Right knee [deg]

50
40
30
20
10
We evaluated the combined method on a 1.0 ◦ downward −10

0 1 2 3
Time [sec]
4 5 6
slope. The simulated robot with the acquired controller in

previous section could not walk stably on the downward slope Fig. 9. Joint angle trajectories after learning
(Fig. 12(Top)). However, when we used the online estimate of
the walking period and the adaptive phase resetting method 6
with the learned controller, the robot walked stably on the

1.0◦ downward slope (Fig. 12(Bottom)). 5
4
C. Stability analysis of the acquired policy
Accumulated reward
3
We analyzed the stability of the acquired policy in terms
of the Poincare map, mapping from a Poincare section at 2
phase φ = π2 to phase φ = 3π 2 (Fig. 6). We estimated the

1
Jacobian matrix of the Poincare map at the Poincare sections,
and checked if |λ i (J)| < 1 or not, where λ i (i = 1, 2) 0
are the eigenvalues[3], [7]. Because we used differentiable

−1
functions as function approximators, we can estimate the 0 50
Trials
100 150
Jacobian matrix J based on:

df ∂f ∂f ∂u(x) Fig. 10. Accumulated reward at each trial: Results of five experiments
J= = + . (21)
dx ∂x ∂u ∂x
Figure 13 shows the average eigenvalues at each trial. The
eigenvalues decreased as the learning proceeded, and became In previous work, we have proposed a trajectory opti-
stable, i.e. |λi (J)| < 1. mization method for biped locomotion [10], [11] based on
differential dynamic programming [5], [9]. We are consider-
V. D ISCUSSION ing combining this trajectory optimization method with the
In this study, we used swing leg knee angle θ act to decide proposed reinforcement learning method.
foot placement because the lower leg has smaller mass and
tracking the target joint angle at the knee is easier than using ACKNOWLEDGMENT
the hip joint. However, using hip joints or using different
variables for the output of the policy are interesting topics for We would like to thank Mitsuo Kawato, Jun Nakanishi,
future work. We also are considering using captured data of Gen Endo at ATR Computational Neuroscience Laboratories,
a human walking pattern [23] as a nominal trajectory instead Japan, and Seiichi Miyakoshi of the Digital Human Research
of using a hand-designed walking pattern. We are currently Center, AIST, Japan for helpful discussions.
applying the proposed approach to the physical biped robot. Atkeson is partially supported by NSF award ECS-0325383.
2.5
|λ |
1
|λ |
2
1
2
0.5
1.5
Eigenvalue
Value
−0.5 1
−1
1 0.5
0.5 0.2
0 0.1
0
−0.5 −0.1 0
−1 −0.2 0 50 100 150
d velocity [m/sec] d position [m] Trials
Fig. 11. Shape of acquired value function Fig. 13. Averaged eigenvalue of Jacobian matrix at each trial
Neural Information Processing Systems 15, pages 1563–1570. MIT

Press, Cambridge, MA, 2003.
[12] K. Nagasaka, M. Inaba, and H. Inoue. Stabilization of dynamic walk
on a humanoid using torso position compliance control. In Proceedings
of 17th Annual Conference on Robotics Society of Japan, pages 1193–
1194, 1999.
[13] Y. Nakamura, M. Sato, and S. Ishii. Reinforcement learning for biped
robot. In Proceedings of the 2nd International Symposium on Adaptive
Motion of Animals and Machines, pages ThP–II–5, 2003.
[14] J. Nakanishi, J. Morimoto, G. Endo, G. Cheng, S. Schaal, and
M. Kawato. Learning from demonstration and adaptation of biped
locomotion with dynamical movement primitives. In Workshop on Robot
Programming by Demonstration, IEEE/RSJ International Conference on
Intelligent Robots and Systems, Las Vegas, NV, USA, 2003.
[15] J. Nakanishi, J. Morimoto, G. Endo, G. Cheng, S. Schaal, and
M. Kawato. Learning from demonstration and adaptation of biped
Fig. 12. Biped walking pattern with timing adaptation on downward slope: locomotion. Robotics and Autonomous Systems (to appear), 2004.
(Top)Without timing adaptation, (Bottom)With timing adaptation [16] S. Schaal and C. G. Atkeson. Constructive incremental learning from
only local information. Neural Computation, 10(8):2047–2084, 1998.
[17] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction.
The MIT Press, Cambridge, MA, 1998.
R EFERENCES [18] R. S. Sutton, D. Precup, and S. Singh. Between MDPs and semi-MDPs:
A Framework for Temporal Abstraction in Reinforcement Learning.
[1] H. Benbrahim and J. Franklin. Biped dynamic walking using reinforce- Artificial Intelligence, 112:181–211, 1999.
ment learning. Robotics and Autonomous Systems, 22:283–302, 1997. [19] K. Tsuchiya, S. Aoi, and K. Tsujita. Locomotion control of a biped
[2] C. Chew and G. A. Pratt. Dynamic bipedal walking assisted by learning. locomotion robot using nonlinear oscillators. In Proceedings of the
Robotica, 20:477–491, 2002. IEEE/RSJ International Conference on Intelligent Robots and Systems,
[3] R. Q. Van der Linde. Passive bipedal walking with phasic muscle pages 1745–1750, Las Vegas, NV, USA, 2003.
contraction. Biological Cybernetics, 82:227–237, 1999. [20] J. Vucobratovic, B. Borovac, D. Surla, and D. Stokic. Biped Locomotion:
[4] K. Doya. Reinforcement Learning in Continuous Time and Space. Dynamics, Stability, Control and Applications. Springer-Verlag, Berlin,
Neural Computation, 12(1):219–245, 2000. 1990.
[5] P. Dyer and S. R. McReynolds. The Computation and Theory of Optimal [21] Y. Wada and M. Kawato. A theory for cursive handwriting based on
Control. Academic Press, New York, NY, 1970. the minimization principle. Biological Cybernetics, 73:3–15, 1995.
[6] T. Flash and N. Hogan. The coordination of arm movements: An exper- [22] J. Yamaguchi, A. Takanishi, and I. Kato. Development of a biped
imentally confirmed mathematical model. The Journal of Neuroscience, walking robot compensating for three-axis moment by trunk motion.
5:1688–1703, 1985. Journal of the Robotics Society of Japan, 11(4):581–586, 1993.
[7] M. Garcia, A. Chatterjee, A. Ruina, and M. J. Coleman. The simplest [23] K. Yamane and Y. Nakamura. Dynamics filter – concept and implemen-
walking model: stability, complexityu, and scaling. ASME Jounal of tation of on-line motion generator for human figures. In Proceedings of
Biomechanical Engineering, 120(2):281–288, 1998. the 2000 IEEE International Conference on Robotics and Automation,
[8] K. Hirai, M. Hirose, and T. Takenaka. The Development of Honda pages 688–693, 2000.
Humanoid Robot. In Proceedings of the 1998 IEEE International [24] T. Yamasaki, T. Nomura, and S. Sato. Possible functional roles of phase
Conference on Robotics and Automation, pages 160–165, 1998. resetting during walking. Biological Cybernetics, 88(6):468–496, 2003.
[9] D. H. Jacobson and D. Q. Mayne. Differential Dynamic Programming.
Elsevier, New York, NY, 1970.
[10] J. Morimoto and C. G. Atkeson. Robust low-torque biped walking
using differential dynamic programming with a minimax criterion.
In Philippe Bidaud and Faiz Ben Amar, editors, Proceedings of the
5th International Conference on Climbing and Walking Robots, pages
453–459. Professional Engineering Publishing, Bury St Edmunds and
London, UK, 2002.
[11] J. Morimoto and C. G. Atkeson. Minimax differential dynamic
programming: An application to robust biped walking. In Suzanna
Becker, Sebastian Thrun, and Klaus Obermayer, editors, Advances in
Books
etcetera
Reinforcement Learning:
An Introduction
by Sutton, R.S. and Barto, A.G., MIT Press (1998). £31.95 (xi + 322 pages)
ISBN 0 262 19398 1
Reinforcement is a term with different can expect from that state into the dis-
meanings for different people. In the tant future; that is. they represent long-
psychological lexicon, it conjures up the term evaluations. In this sense, value
ominous images of Pavlov, Thorndike, functions represent something more
Skinner and their intellectual brethren. akin to judgements on the likely payoffs The present book is an excellent
Although these behaviorists helped to that will follow the current state. entry point for someone who wants to
operationalize experimental psychology, Some of the most exciting work in understand intuitively the ideas of re-
their insistence on the non-existence of reinforcement learning has taken place inforcement learning and the general
internal mental states provided a road- in the past 10 years with the discovery connection between its parts. It is not,
block to modern cognitive science. In of several mathematical connections however, a mathematical ‘how-to’ book,
fact, the rise of cognitive science since between separate methods for solving replete with proofs and pointers to un-
the 1950s could be viewed as a rejection reinforcement-learning problems. These solved problems in the field (as are, for
of the stultifying behaviorist views that connections showed that apparently example, Refs 3,5).
declared the mind to be a vacuous con- disparate mathematical techniques for The end of each chapter contains a
struct. With such a salvo on behavior- solving reinforcement-learning problems scholarly set of biographical and histori-
ists, what is a review of a book on rein- were related in fundamental ways. This cal notes. These sections are particularly
forcement learning doing in Trends in book provides the best historical de- pleasing because they provide an easy-
Cognitive Sciences? The short answer is tails of these mathematical connections to-read review of the history of papers
that reinforcement, in the context of the found anywhere, and frames clearly the and ideas that contributed to the chap-
new book by Sutton and Barto, is not ideas underlying this history. ter in question. The authors go above
what it seems. ‘Reinforcement learning What is the direct conceptual payoff and beyond the call of duty in these sec-
is learning what to do – how to map of reinforcement learning for cognitive tions by providing their own perspective
situations to actions – so as to maximize science? The descriptions so far show on how and why subfields developed
a numerical reward signal’, according that reinforcement-learning problems in particular ways. Their effort is useful
to the introduction of the book. could arise in a number of settings. Why because this kind of perspective is very
The primary aim here is to cast learn- should we expect this framework to en- difficult to come by, yet it often provides
ing as a problem involving agents that rich our understanding of cognition or conceptual insights by demonstrating
interact with an environment, sense their the connection of the brain to cognition? which paths of investigation resulted
state and the state of the environment, I think that the direct benefit is twofold. from historical accident or the prevail-
and choose actions based on these inter- The first benefit is that the lexicon of ing biases of the day. Furthermore, these
actions (which sounds very much like reinforcement learning is appropriate sections are accessible to the casual
a bug or a rat moving about in some for describing the problems faced by peruser as well as the serious student
territory in search of food or mates). The mobile creatures in a complex, stochastic seeking a historical record of publication
twist in reinforcement learning is that environment, in which the evaluation on the subject.
the agent comes pre-equipped with of a sequence of decisions might be sig- Anyone interested in the internal
goals that it seeks to satisfy. These goals nificantly delayed. Consonant with the representation of goals should read
are embodied in the influence of a ‘nu- appropriateness of the lexicon, a num- this book. In particular, the success of
merical reward signal’ on the way that ber of modern efforts have successfully TD-gammon, and the connection of re-
the agent chooses actions, categorizes used reinforcement learning to describe inforcement-learning algorithms to the
its sensations and changes its internal biological systems related to motor function of identified neural systems,
model of the environment. Despite the learning in the cerebellum, and reward suggests that reinforcement learning
obvious connection of these terms to learning by dopaminergic systems2,3. might have a lot more yet to say about
behavioral psychology, some of the more The second benefit is the emphasis cognition. That possibility awaits future
impressive applications of reinforcement that reinforcement learning places on evaluation.
learning have been in computer science representation. This emphasis emerges
P. Read Montague
and engineering applications. For exam- from the two serious complaints about
Division of Neuroscience, Baylor College of
ple, Tesauro’s TD-gammon, a reinforce- reinforcement learning as a framework
Medicine, Houston, TX 77030, USA.
ment-learning system, is now one of the for artificial intelligence or models of
tel: +1 713 798 3134
best backgammon players in the world1. brain function: (1) speed, and (2) the size
fax: +1 713 798 3130
Reinforcement learning typically di- of the state space4. For even modest
e-mail: read@bohr.neusc.bcm.tmc.edu
vides a problem into four parts: (1) a problems, the state space can be huge
policy; (2) a reward function; (3) a value (e.g. for backgammon, the state space is References
function; and (4) an internal model of ~1020 states). If any sizeable fraction of 1 Tesauro, G.J. (1994) TD-Gammon, a self-
the environment. In this context, a policy this state space must be explored for a
teaching backgammon program, achieves
is similar to an association in psycho- reinforcement-learning system to con-
master-level play Neural Comput. 6, 215–219
logical terms; it maps states to actions verge to an answer, then one might
2 Schultz, W., Dayan, P. and Montague, P.R.
(behavioral choices). One interesting have to wait an unacceptably long time
(1997) A neural substrate of prediction and
part of reinforcement-learning problems for a suitable answer to emerge. These
reward Science 275, 1593–1599
is the complimentary concepts of goals problems were a likely source of discour-
and evaluation. A reward function pro- agement for early work in reinforce- 3 Bertsekas, D.P. and Tsitsiklis, J.N. (1996) Neuro-
vides a numerical evaluation of a state, ment learning. However, more modern Dynamic Programming, Athena Scientific
and therefore embodies the agent’s defi- work has shown that if careful consid- 4 Kaebling, L.P., Littman, M.L. and Moore, A.W.
nition of what is immediately ‘good’ and eration is given to the representations (1996) Reinforcement learning: a survey
what is immediately ‘bad’. By contrast, of states or actions, then reinforcement- J. Artif. Intell. Res. 4, 237–285
value functions evaluate a state in terms learning systems can be a powerful way 5 Bellman, R. (1957) Dynamic Programming,
of the total amount of reward an agent of learning certain problems. Princeton University Press
360
Trends in Cognitive Sciences – Vol. 3, No. 9, September 1999
Simple random search of static linear policies is
competitive for reinforcement learning
Horia Mania Aurelia Guy Benjamin Recht

hmania@berkeley.edu lia@berkeley.edu brecht@berkeley.edu
Department of Electrical Engineering and Computer Science

University of California, Berkeley
Abstract
Model-free reinforcement learning aims to offer off-the-shelf solutions for con-
trolling dynamical systems without requiring models of the system dynamics. We
introduce a model-free random search algorithm for training static, linear policies
for continuous control problems. Common evaluation methodology shows that our
method matches state-of-the-art sample efficiency on the benchmark MuJoCo loco-
motion tasks. Nonetheless, more rigorous evaluation reveals that the assessment
of performance on these benchmarks is optimistic. We evaluate the performance
of our method over hundreds of random seeds and many different hyperparameter
configurations for each benchmark task. This extensive evaluation is possible
because of the small computational footprint of our method. Our simulations
highlight a high variability in performance in these benchmark tasks, indicating
that commonly used estimations of sample efficiency do not adequately evaluate
the performance of RL algorithms. Our results stress the need for new baselines,
benchmarks and evaluation methodology for RL algorithms.
1 Introduction
Model-free reinforcement learning (RL) aims to offer off-the-shelf solutions for controlling dynamical
systems without requiring models of the system dynamics. Such methods have successfully produced
RL agents that surpass human players in video games and games such as Go [16, 28]. Although
these results are impressive, model-free methods have not yet been successfully deployed to control
physical systems, outside of research demos. There are several factors prohibiting the adoption of
model-free RL methods for controlling physical systems: the methods require too much data to
achieve reasonable performance, the ever-increasing assortment of RL methods makes it difficult to
choose what is the best method for a specific task, and many candidate algorithms are difficult to
implement and deploy [11].
Unfortunately, the current trend in RL research has put these impediments at odds with each other.
In the quest to find methods that are sample efficient (i.e. methods that need little data) the general
trend has been to develop increasingly complicated methods. This increasing complexity has led to a
reproducibility crisis. Recent studies demonstrate that many RL methods are not robust to changes in
hyperparameters, random seeds, or even different implementations of the same algorithm [11, 12].
Algorithms with such fragilities cannot be integrated into mission critical control systems without
significant simplification and robustification.
Furthermore, it is common practice to evaluate and compare new RL methods by applying them to
video games or simulated continuous control problems and measure their performance over a small
number of independent trials (i.e., fewer than ten random seeds) [8–10, 14, 17, 19, 21–27, 31, 32].
The most popular continuous control benchmarks are the MuJoCo locomotion tasks [3, 29], with
32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada.
the Humanoid model being considered “one of the most challenging continuous control problems
solvable by state-of-the-art RL techniques [23].” In principle, one can use video games and simulated
control problems for beta testing new ideas, but simple baselines should be established and thoroughly
evaluated before moving towards more complex solutions.
To this end, we aim to determine the simplest model-free RL method that can solve standard
benchmarks. Recently, two different directions have been proposed for simplifying RL. Salimans
et al. [23] introduced a derivative-free policy optimization method, called Evolution Strategies. The
authors showed that, for several RL tasks, their method can easily be parallelized to train policies
faster than other methods. While the method of Salimans et al. [23] is simpler than previously
proposed methods, it employs several complicated algorithmic elements, which we discuss at the end
of Section 3. As a second simplification to model-free RL, Rajeswaran et al. [22] have shown that
linear policies can be trained via natural policy gradients to obtain competitive performance on the
MuJoCo locomotion tasks, showing that complicated neural network policies are not needed to solve
these continuous control problems. In this work, we combine ideas from the work of Salimans et al.
[23] and Rajeswaran et al. [22] to obtain the simplest model-free RL method yet, a derivative-free
optimization algorithm for training static, linear policies. We demonstrate that a simple random
search method can match or exceed state-of-the-art sample efficiency on the MuJoCo locomotion
tasks, included in the OpenAI Gym.
Henderson et al. [11] and Islam et al. [12] pointed out that standard evaluation methodology does not
accurately capture the performance of RL methods by showing that existing RL algorithms exhibit
high sensitivity to both the choice of random seed and the choice of hyperparameters. We show
similar limitations of common evaluation methodology through a different lens. We exhibit a simple
derivative free optimization algorithm which matches or surpasses the performance of more complex
methods when using the same evaluation methodology. However, a more thorough evaluation of
ARS reveals worse performance. Moreover, our method uses static linear policies and a simple local
exploration scheme, which might be limiting for more difficult RL tasks. Therefore, better evaluation
schemes are needed for determining the benefits of more complex RL methods. Our contributions are
as follows:
• In Section 3, for applications to continuous control, we augment a basic random search

method with three simple features. First, we scale each update step by the standard deviation
of the rewards collected for computing that update step. Second, we normalize the system’s
states by online estimates of their mean and standard deviation. Third, we discard from
the computation of the update steps the directions that yield the least improvement of the
reward. We refer to this method as Augmented Random Search (ARS).
• In Section 4, we evaluate the performance of ARS on the benchmark MuJoCo locomotion
tasks, included in the OpenAI Gym. Our method learns static, linear policies that achieve
high rewards on all MuJoCo tasks. No neural networks are used, and yet state-of-the-art
average rewards are achieved. For example, for Humanoid-v1 ARS finds linear policies
which achieve average rewards of over 11500, the highest value reported in the literature.
To put ARS on equal footing with competing methods, we evaluate its sample complexity
over three random seeds and compare it to results reported in the literature [9, 22, 23, 26].
ARS matches or exceeds state-of-the-art sample efficiency on the locomotion tasks when
using standard evaluation methodology.
• For a more thorough evaluation, we measured the performance of ARS over a hundred
random seeds and also evaluated its sensitivity to hyperparameter choices. Though ARS
successfully trains policies for the MuJoCo tasks a large fraction of the time when hyper-
parameters and random seeds are varied, ARS exhibits large variance. We measure the
frequency with which ARS finds policies that yield suboptimal locomotion gaits.
2 Problem setup
Problems in reinforcement learning require finding policies for controlling dynamical systems that
maximize an average reward. Such problems can be abstractly formulated as
max E⇠ [r(⇡✓ , ⇠)] , (1)
✓2Rd
2
where ✓ parametrizes a policy ⇡✓ : Rn ! Rp . The random variable ⇠ encodes the randomness of the
environment, i.e., random initial states and stochastic transitions. The value r(⇡✓ , ⇠) is the reward
achieved by the policy ⇡✓ on one trajectory generated from the system. In general one could use
stochastic policies ⇡✓ , but our proposed method uses deterministic policies.
Basic random search. Note that the problem formulation (1) aims to optimize reward by directly
optimizing over the policy parameters ✓. We consider methods which explore in the parameter
space rather than the action space. This choice renders RL training equivalent to derivative-free
optimization with noisy function evaluations. One of the simplest and oldest optimization methods
for derivative-free optimization is random search [15].
A primitive form of random search, which we call basic random search (BRS), simply computes a
finite difference approximation along the random direction and then takes a step along this direction
without using a line search. Our method ARS, described in Section 3, is based on this simple strategy.
For updating the parameters ✓ of a policy ⇡✓ , BRS and ARS exploit update directions of the form:
r(⇡✓+⌫ , ⇠1 ) r(⇡✓ ⌫ , ⇠2 )
, (2)
⌫
for two i.i.d. random variables ⇠1 and ⇠2 , ⌫ a positive real number, and a zero mean Gaussian
vector. It is known that such an update increment is an unbiased estimator of the gradient with
respect to ✓ of E E⇠ [r(⇡✓+⌫ , ⇠)], a smoothed version of the objective (1) which is close to the
original objective when ⌫ is small [20]. When the function evaluations are noisy, minibatches
can be used to reduce the variance in this gradient estimate. Evolution Strategies is a version of
this algorithm with several complicated algorithmic enhancements [23]. Another version of this
algorithm is called Bandit Gradient Descent by Flaxman et al. [6]. The convergence of random search
methods for derivative free optimization has been understood for several types of convex optimization
[1, 2, 13, 20]. Jamieson et al. [13] offer an information theoretic lower bound for derivative free
convex optimization and show that a coordinate based random search method achieves the lower
bound with nearly optimal dependence on the dimension.
The rewards r(⇡✓+⌫ , ⇠1 ) and r(⇡✓ ⌫ , ⇠2 ) in Eq. (2) are obtained by collecting two trajectories
from the dynamical system of interest, according to the policies ⇡✓+⌫ and ⇡✓ ⌫ , respectively. The
random variables ⇠1 , ⇠2 , and are mutually independent, and independent from previous trajectories.
One trajectory is called an episode or a rollout. The goal of RL algorithms is to approximately solve
problem (1) by using as few rollouts from the dynamical system as possible.
3 Our proposed algorithm

We now introduce the Augmented Random Search (ARS) method, which relies on three augmenta-
tions of BRS that build on successful heuristics employed in deep reinforcement learning. Throughout
the rest of the paper we use M to denote the parameters of policies because our method uses linear
policies, and hence M is a p ⇥ n matrix. The different versions of ARS are detailed in Algorithm 1.
The first version, ARS V1, is obtained from BRS by scaling the update steps by the standard deviation
R of the rewards collected at each iteration; see Line 7 of Algorithm 1. As shown in Section 4,
ARS V1 can train linear policies, which achieve the reward thresholds previously proposed in the
literature, for five MuJoCo benchmarks. However, ARS V1 requires a larger number of episodes, and
it cannot train policies for the Humanoid-v1 task. To address these issues in Algorithm 1 we also
propose ARS V2. This version of ARS trains policies which are linear maps of states normalized
by a mean and standard deviation computed online. Finally, to further enhance the performance of
ARS, we introduce a third algorithmic enhancement, shown in Algorithm 1 as ARS V1-t and ARS
V2-t. These versions of ARS can drop perturbation directions that yield the least improvement of the
reward. Now, we motivate and offer intuition for each of these algorithmic elements.
Scaling by the standard deviation R . As the training of policies progresses, random search in the
parameter space of policies can lead to large variations in the rewards observed across iterations. As
a result, it is difficult to choose a fixed step-size ↵ which does not allow harmful variations in the size
of the update steps. Salimans et al. [23] address this issue by transforming the rewards into rankings
and then using the adaptive optimization algorithm Adam for computing the update step. Both of
3
Algorithm 1 Augmented Random Search (ARS): four versions V1, V1-t, V2 and V2-t
1: Hyperparameters: step-size ↵, number of directions sampled per iteration N , standard deviation
of the exploration noise ⌫, number of top-performing directions to use b (b < N is allowed only
for V1-t and V2-t)
2: Initialize: M0 = 0 2 Rp⇥n , µ0 = 0 2 Rn , and ⌃0 = In 2 Rn⇥n , j = 0.
3: while ending condition not satisfied do
4: Sample 1 , 2 , . . . , N in Rp⇥n with i.i.d. standard normal entries.
5: Collect 2N rollouts of horizon H and their corresponding rewards using the 2N policies
⇢
⇡j,k,+ (x) = (Mj + ⌫ k )x
V1:
⇡j,k, (x) = (Mj ⌫ k )x
( 1/2
⇡j,k,+ (x) = (Mj + ⌫ k ) diag (⌃j ) (x µj )
V2:
⇡j,k, (x) = (Mj ⌫ k ) diag(⌃j ) /2 (x µj )
1
for k 2 {1, 2, . . . , N }.
6: V1-t, V2-t: Sort the directions k by max{r(⇡j,k,+ ), r(⇡j,k, )}, denote by (k) the k-th
largest direction, and by ⇡j,(k),+ and ⇡j,(k), the corresponding policies.
7: Make the update step:
b
X
↵
⇥ ⇤
Mj+1 = Mj + b R
r(⇡j,(k),+ ) r(⇡j,(k), ) (k) ,
k=1
where R is the standard deviation of the 2b rewards used in the update step.
8: V2: Set µj+1 , ⌃j+1 to be the mean and covariance of the 2N H(j + 1) states encountered
from the start of training.1
9: j j+1
10: end while
.
these techniques change the direction of the updates, obfuscating the behavior of the algorithm and
making it difficult to ascertain the objective Evolution Strategies is actually optimizing. Instead, to
address the large variations of the differences r(⇡M +⌫ ) r(⇡M ⌫ ), we scale the update steps by
the standard deviation R of the 2N rewards collected at each iteration (see Line 7 of Algorithm 1).
While training a policy for Humanoid-v1, we observed that the standard deviations R have an
increasing trend; see Figure 2 in Appendix A.2. This behavior occurs because perturbations of the
policy weights at high rewards can cause Humanoid-v1 to fall early, yielding large variations in the
rewards collected. Without scaling the update steps by R , eventually random search would take
update steps which are a thousand times larger than in the beginning of training. Therefore, R
adapts the step sizes according to the local sensitivity of the rewards to perturbations of the policy
parameters. The same training performance could probably be obtained by tuning a step size schedule.
However, one of our goals was to minimize the amount of tuning required.
Normalization of the states. The normalization of states used by ARS V2 is akin to data whitening
for regression tasks. Intuitively, it ensures that policies put equal weight on the different components
of the states. To see why this might help, suppose that a state coordinate only takes values in the
range [90, 100] while another state component takes values in the range [ 1, 1]. Then, small changes
in the control gain with respect to the first state coordinate would lead to larger changes in the actions
than the same sized changes with respect to the second state component. Hence, state normalization
allows different state components to have equal influence during training.
Previous work has also implemented such state normalization for fitting a neural network model for
several MuJoCo environments [19]. A similar normalization is used by ES as part of the virtual batch
1
Of course, we implement this in an efficient way that does not require the storage of all the states. Also, we
only keep track of the diagonal of ⌃j+1 . Finally, to ensure that the ratio 0/0 is treated as 0, if a diagonal entry
of ⌃j is smaller than 10 8 we make it equal to +1.
4
normalization of the neural network policies [23]. In the case of ARS, the state normalization can be
seen as a form of non-isotropic exploration in the parameter space of linear policies.
The main empirical motivation for ARS V2 comes from the Humanoid-v1 task. We were not able to
train a linear policy for this task without the normalization of the states described in Algorithm 1.
Moreover, ARS V2 performs better than ARS V1 on other MuJoCo tasks as well, as shown in
Section 4. However, the usefulness of state normalization is likely to be problem specific.
Using top performing directions. To further improve the performance of ARS on the MuJoCo
locomotion tasks, we propose ARS V1-t and V2-t. In the update steps used by ARS V1 and V2
each perturbation direction k is weighted by the difference of the rewards r(⇡j,k,+ ) and r(⇡j,k, ).
If r(⇡j,k,+ ) > r(⇡j,k, ), ARS pushes the policy weights Mj in the direction of k . If r(⇡j,k,+ ) <
r(⇡j,k, ), ARS pushes the policy weights Mj in the direction of k . However, since r(⇡j,k,+ )
and r(⇡j,k, ) are noisy evaluations of the performance of the policies parametrized by Mj + ⌫ k
and Mj ⌫ k , ARS V1 and V2 might push the weights Mj in the direction k even when k is
better, or vice versa. Moreover, there can be perturbation directions k such that updating the policy
weights Mj in either the direction k or k would lead to sub-optimal performance. To address
these issues, ARS V1-t and V2-t order decreasingly the perturbation directions k , according to
max{r(⇡j,k,+ ), r(⇡j,k, )}, and then use only the top b directions for updating the policy weights;
see Line 7 of Algorithm 1.
This algorithmic enhancement intuitively improves the performance of ARS because it ensures
that the update steps are an average over directions that obtained high rewards. However, without
theoretical investigation we cannot be certain of the effect of using this algorithmic enhancement, i.e.,
choosing b < N . When b = N versions V1-t and V2-t are equivalent to V1 and V2. Therefore, it is
certain that after tuning ARS V1-t and V2-t, they will not perform any worse than ARS V1 and V2.
Comparison to Salimans et al. [23]. ARS simplifies Evolution Strategies in several ways. First,
ES feeds the gradient estimate into the Adam algorithm. Second, instead of using the actual reward
values r(✓ ± ✏i ), ES transforms the rewards into rankings and uses the ranks to compute update
steps. The rankings are used to make training more robust. Instead, our method scales the update
steps by the standard deviation of the rewards. Third, ES bins the action space of the Swimmer-v1
and Hopper-v1 to encourage exploration. Our method surpasses ES without such binning. Fourth, ES
relies on policies parametrized by neural networks with virtual batch normalization, while we show
that ARS achieves state-of-the-art performance with linear policies.
4 Empirical results on the MuJoCo locomotion tasks
Implementation details. We implemented a parallel version of Algorithm 1 using the Python

library Ray [18]. To avoid the computational bottleneck of communicating perturbations , we
created a shared noise table which stores independent standard normal entries. Then, instead of
communicating perturbations , the workers communicate indices in the shared noise table. This
approach has been used in the implementation of Evolution Strategies by Moritz et al. [18] and is
similar to the approach proposed by Salimans et al. [23]. Our code sets the random seeds for the
random generators of all the workers and for all copies of the OpenAI Gym environments held by
the workers. All these random seeds are distinct and are a function of a single integer to which we
refer as the random seed. Furthermore, we made sure that the states and rewards produced during the
evaluation rollouts were not used in any form during training.
We evaluate the performance of ARS on the MuJoCo locomotion tasks included in the OpenAI Gym-
v0.9.3 [3, 29]. The OpenAI Gym provides benchmark reward functions for the different MuJoCo
locomotion tasks. We used these default reward functions for evaluating the performance of the
linear policies trained with ARS. The reported rewards obtained by a policy were averaged over
100 independent rollouts. For the Hopper-v1, Walker2d-v1, Ant-v1, and Humanoid-v1 tasks the
default reward functions include a survival bonus, which rewards RL agents with a constant reward at
each timestep, as long as a termination condition (i.e., falling over) has not been reached. During
training, we removed these survival bonuses, a choice we motivate in Appendix A.1. We also defer to
Appendix A.3 the sensitivity analysis of ARS to the choice of hyperparameters.
5
Three random seeds evaluation: We compare the different versions of ARS to the following
methods: Trust Region Policy Optimization (TRPO), Deep Deterministic Policy Gradient (DDPG),
Natural Gradients (NG), Evolution Strategies (ES), Proximal Policy Optimization (PPO), Soft
Actor Critic (SAC), Soft Q-Learning (SQL), A2C, and the Cross Entropy Method (CEM). For the
performance of these methods we used values reported by Rajeswaran et al. [22], Salimans et al. [23],
Schulman et al. [26], and Haarnoja et al. [9]. In light of well-documented reproducibility issues of
reinforcement learning methods [11, 12], reporting the values listed in papers rather than rerunning
these algorithms casts prior work in the most favorable light possible.
Rajeswaran et al. [22] and Schulman et al. [26] evaluated the performance of RL algorithms on three
random seeds, while Salimans et al. [23] and Haarnoja et al. [9] used six and five random seeds
respectively. To put all methods on equal footing, for the evaluation of ARS, we sampled three
random seeds uniformly from the interval [0, 1000) and fixed them. For each of the six OpenAI Gym
MuJoCo locomotion tasks we chose a grid of hyperparameters2 , shown in Appendix A.6, and for
each set of hyperparameters we ran ARS V1, V2, V1-t, and V2-t three times, once for each of the
three fixed random seeds.
Table 1 shows the average number of episodes required by ARS, NG, and TRPO to reach a prescribed
reward threshold, using the values reported by Rajeswaran et al. [22] for NG and TRPO. For each
version of ARS and each MuJoCo task we chose the hyperparameters which minimize the average
number of episodes required to reach the reward threshold. The corresponding training curves of
ARS are shown in Figure 3 of Appendix A.2. For all MuJoCo tasks, except Humanoid-v1, we used
the same reward thresholds as Rajeswaran et al. [22]. Our choice to increase the reward threshold for
Humanoid-v1 is motivated by the presence of the survival bonuses, as discussed in Appendix A.1.
Average # episodes to reach reward threshold

Task Threshold ARS NG-lin NG-rbf TRPO-nn
V1 V1-t V2 V2-t
Swimmer-v1 325 100 100 427 427 1450 1550 N/A 3
Hopper-v1 3120 89493 51840 3013 1973 13920 8640 10000
HalfCheetah-v1 3430 10240 8106 2720 1707 11250 6000 4250
Walker2d-v1 4390 392000 166133 89600 24000 36840 25680 14250
Ant-v1 3580 101066 58133 60533 20800 39240 30000 73500
Humanoid-v1 6000 N/A N/A 142600 142600 ⇡130000 ⇡130000 UNK4
Table 1: A comparison of ARS, NG, and TRPO on the MuJoCo locomotion tasks. For each task we
show the average number of episodes required to achieve a prescribed reward threshold, averaged
over three random seeds. We estimated the number of episodes required by NG to reach a reward of
6000 for Humanoid-v1 based on the learning curves presented by Rajeswaran et al. [22].
Table 1 shows that ARS V1 can train policies for all tasks except Humanoid-v1, which is successfully
solved by ARS V2. Secondly, we note that ARS V2 reaches the prescribed thresholds for Swimmer-
v1, Hopper-v1, and HalfCheetah-v1 faster than NG or TRPO, and matches the performance of NG
on the Humanoid-v1. On Walker2d-v1 and Ant-v1, ARS V2 is outperformed by NG. Nonetheless,
ARS V2-t surpasses the performance of NG on these two tasks. Although TRPO hits the reward
threshold for Walker2d-v1 faster than ARS, our method either matches or surpasses TRPO in the
metrics reported by Haarnoja et al. [9] and Schulman et al. [26].
Precise comparisons to more RL methods are provided in Appendix A.2. Here we offer a summary.
Salimans et al. [23] reported the average number of episodes required by ES to reach a prescribed
reward threshold, on four of the locomotion tasks. ARS surpassed ES on all of those tasks. Haarnoja
et al. [9] reported the maximum reward achieved by SAC, DDPG, SQL, and TRPO after a prescribed
number of timesteps, on four of the locomotion tasks. With the exception of SAC on HalfCheetah-v1
and Ant-v1, ARS outperformed competing methods. Schulman et al. [26] reported the maximum
reward achieved by PPO, A2C, CEM, and TRPO after a prescribed number of timesteps, on four of
2
Recall that ARS V1 and V2 take in only three hyperparameters: the step-size ↵, the number of perturbation
directions N , and scale of the perturbations ⌫. ARS V1-t and V2-t take in an additional hyperparameter, the
number of top directions used b (b  N ).
3
N/A means that the method did not reach the reward threshold.
4
UNK stands for unknown.
6
the locomotion tasks. With the exception of PPO on Walker2d-v1, ARS matched or surpassed the
performance of competing methods.
A hundred seeds evaluation: For a more thorough evaluation of ARS, we sampled 100 distinct
random seeds uniformly at random from the interval [0, 10000). Then, using the hyperparameters
selected for Table 1, we ran ARS for each of the six MuJoCo locomotion tasks and the 100 random
seeds. The results are shown in Figure 1. Such a thorough evaluation was feasible because ARS
has a small computational footprint. As discussed in Appendix A.3, ARS is at least 15 times more
computationally efficient on the MuJoCo benchmarks than competing methods.
Figure 1 shows that 70% of the time ARS trains policies for all the MuJoCo locomotion tasks, with
the exception of Walker2d-v1 for which it succeeds only 20% of the time. Moreover, ARS succeeds
at training policies a large fraction of the time while using a competitive number of episodes.
Average reward evaluated over 100 random seeds, shown by percentile

Swimmer-v1 Hopper-v1 HalfCheetah-v1
4000
6000
300 5000
3000
AverageReward
4000
200
2000 3000
100 2000
1000
1000
0 0 - 10 10 - 20 20 - 100 0 - 20 20 - 30 30 - 100 0-5 5 - 20 20 - 100
0 0
0 500 1000 1500 0 5000 10000 0 5000 10000
Walker2d-v1 Ant-v1 Humanoid-v1
10000
0 - 80 80 - 90 90 - 100 4000 8000
8000
AverageReward
3000 6000
6000
2000
4000 4000
1000
2000 2000
0
0 - 30 30 - 70 70 - 100 0 - 30 30 - 70 70 - 100
0 1000 0
0 25000 50000 0 25000 50000 75000 0 100000 200000 300000 400000
Episodes Episodes Episodes
Figure 1: An evaluation of ARS over 100 random seeds on the MuJoCo locomotion tasks. The
dotted lines represent median rewards and the shaded regions represent percentiles. For Swimmer-v1
we used ARS V1. For Hopper-v1, Walker2d-v1, and Ant-v1 we used ARS V2-t. For HalfCheetah-v1
and Humanoid-v1 we used ARS V2.
There are two types of random seeds represented in Figure 1 that cause ARS to not reach high rewards.
There are random seeds on which ARS eventually finds high reward policies when sufficiently many
iterations of ARS are performed, and there are random seeds which lead ARS to discover locally
optimal behaviors. For the Humanoid model, ARS found numerous distinct gaits, including ones
during which the Humanoid hops only on one leg, walks backwards, or moves in a swirling motion.
Such gaits were found by ARS on the random seeds which cause slower training. While multiple
gaits for Humanoid models have been previously observed [10], our evaluation better emphasizes
their prevalence. The presence of local optima is inherent to non-convex optimization, and our results
show that RL algorithms should be evaluated on many random seeds for determining the frequency
with which local optima are found. Finally, we remark that ARS is the least sensitive to the choice of
random seed used when applied to HalfCheetah-v1, a task which is often used for the evaluation of
sensitivity of algorithms to the choice of random seeds.
Linear policies are sufficiently expressive for MuJoCo: We discussed how linear policies can
produce diverse gaits for the MuJoCo models, showing that they are sufficiently expressive to capture
diverse behaviors. Table 2 shows that linear policies can also achieve high rewards on all the MuJoCo
locomotion tasks. In particular, for Humanoid-v1 and Walker2d-v1, ARS found policies that achieve
significantly higher rewards than any other results we encountered in the literature. These results
show that linear policies are perfectly adequate for the MuJoCo locomotion tasks, reducing the need
for more expressive and more computationally expensive policies.
7
Maximum reward achieved
Task ARS Task ARS Task ARS
Swimmer-v1 365 HalfCheetah-v1 6722 Ant 5146
Hopper-v1 3909 Walker 11389 Humanoid 11600
Table 2: Maximum average reward achieved by ARS, where we took the maximum over all sets of
hyperparameters considered and the three fixed random seeds.
5 Discussion
With a few algorithmic augmentations, basic random search of static, linear policies achieves state-
of-the-art sample efficiency on the MuJoCo locomotion tasks. Surprisingly, no special nonlinear
controllers are needed to match the performance recorded in the RL literature. Moreover, since
our algorithm and policies are simple, we were able to perform extensive sensitivity analysis. This
analysis brings us to an uncomfortable conclusion that the current evaluation methods adopted in the
deep RL community are insufficient to evaluate whether proposed methods are actually solving the
studied problems.
The choice of benchmark tasks and the small number of random seeds do not represent the only issues
of current evaluation methodology. Though many RL researchers are concerned about minimizing
sample complexity, it does not make sense to optimize the running time of an algorithm on a single
problem instance. The running time of an algorithm is only a meaningful notion if either (a) evaluated
on a family of problem instances, or (b) when clearly restricting the class of algorithms.
Common RL practice, however, does not follow either (a) or (b). Instead, researchers run an algorithm
A on a task T with a given hyperparameter configuration, and plot a “learning curve” showing the
algorithm reaches a target reward after collecting X samples. Then the “sample complexity" of the
method is reported as the number of samples required to reach a target reward threshold, with the
given hyperparameter configuration. However, any number of hyperparameter configurations can
be tried. Any number of algorithmic enhancements can be added or discarded and then tested in
simulation. For a fair measurement of sample complexity, should we not count the number of rollouts
used for all tested hyperparameters?
Through optimal hyperparameter tuning one can artificially improve the perceived sample efficiency
of a method. Indeed, this is what we see in our work. By adding a third algorithmic enhancement to
basic random search (i.e., enhancing ARS V2 to V2-t), we are able to improve the sample efficiency of
an already highly performing method. Considering that most of the prior work in RL uses algorithms
with far more tunable parameters and neural nets whose architectures themselves are hyperparameters,
the significance of the reported sample complexities for those methods is not clear. This issue is
important because a meaningful sample complexity of an algorithm should inform us on the number
of samples required to solve a new, previously unseen task.
In light of these issues and of our empirical results, we make several suggestions for future work:
• Simple baselines should be established before moving forward to more complex benchmarks
and methods. We propose the Linear Quadratic Regulator as a reasonable testbed for RL
algorithms. LQR is well-understood when the model is known, problem instances can be
easily generated with a variety of different levels of difficulty, and little overhead is required
for replication; see Appendix A.4 for more details.
• When games and physics simulators are used for evaluation, separate problem instances
should be used for tuning and evaluating RL methods. Moreover, large numbers of random
seeds should be used for statistically significant evaluations.
• Rather than trying to develop general purpose algorithms, it might be better to focus on
specific problems of interest and find targeted solutions.
• More emphasis should be put on the development of model-based methods. For many
problems, such methods have been observed to require fewer samples than model-free
methods. Moreover, the physics of the systems should inform the parametric classes of
models used for different problems. Model-based methods incur many computational
challenges themselves, and it is quite possible that tools from deep RL, such as improved
8
tree search, can provide new paths forward for tasks that require the navigation of complex
and uncertain environments.
Acknowledgments
We thank Orianna DeMasi, Moritz Hardt, Eric Jonas, Robert Nishihara, Rebecca Roelofs, Esther
Rolf, Vaishaal Shankar, Ludwig Schmidt, Nilesh Tripuraneni, Stephen Tu for many helpful comments
and suggestions. HM thanks Robert Nishihara and Vaishaal Shankar for sharing their expertise
in parallel computing. As part of the RISE lab, HM is generally supported in part by NSF CISE
Expeditions Award CCF-1730628, DHS Award HSHQDC-16-3-00083, and gifts from Alibaba,
Amazon Web Services, Ant Financial, CapitalOne, Ericsson, GE, Google, Huawei, Intel, IBM,
Microsoft, Scotiabank, Splunk and VMware. BR is generously supported in part by NSF award
CCF-1359814, ONR awards N00014-14-1-0024 and N00014-17-1-2191, the DARPA Fundamental
Limits of Learning (Fun LoL) Program, and an Amazon AWS AI Research Award.
References
[1] A. Agarwal, O. Dekel, and L. Xiao. Optimal algorithms for online convex optimization with multi-point
bandit feedback. pages 28–40, 2010.
[2] F. Bach and V. Perchet. Highly-smooth zero-th order online optimization. Conference on Learning Theory,
2016.
[3] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. OpenAI
gym, 2016.
[4] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu. On the sample complexity of the linear quadratic
regulator. arxiv:1710.01688, 2017.
[5] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement learning
for continuous control. Proceedings of the International Conference on Machine Learning, pages 1329–
1338, 2016.
[6] A. D. Flaxman, A. T. Kalai, and H. B. McMahan. Online convex optimization in the bandit setting: gradient
descent without a gradient. Proceedings of the ACM-SIAM symposium on Discrete algorithms, pages
385–394, 2005.
[7] S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine. Q-prop: Sample-efficient policy gradient
with an off-policy critic. International Conference on Learning Representations, 2016.
[8] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine. Reinforcement learning with deep energy-based policies.
Proceedings of the International Conference on Machine Learning, 2017.
[9] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep
reinforcement learning with a stochastic actor. arxiv:1801.01290, 2018.
[10] N. Heess, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang, A. Eslami, M. Riedmiller,
et al. Emergence of locomotion behaviours in rich environments. arxiv:1707.02286, 2017.
[11] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger. Deep reinforcement learning that
matters. arXiv:1709.06560, 2017.
[12] R. Islam, P. Henderson, M. Gomrokchi, and D. Precup. Reproducibility of benchmarked deep reinforcement
learning tasks for continuous control. arxiv:1708.04133, 2017.
[13] K. G. Jamieson, R. Nowak, and B. Recht. Query complexity of derivative-free optimization. pages
2672–2680, 2012.
[14] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous
control with deep reinforcement learning. International Conference on Learning Representations, 2016.
[15] J. Matyas. Random optimization. Automation and Remote control, 26(2):246–253, 1965.
[16] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller,

A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature,
518(7540):529, 2015.
9
[17] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu.
Asynchronous methods for deep reinforcement learning. Proceedings of the International Conference on
Machine Learning, pages 1928–1937, 2016.
[18] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, W. Paul, M. I. Jordan, and I. Stoica.
Ray: A distributed framework for emerging ai applications. arxiv:1712.05889, 2017.
[19] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine. Neural network dynamics for model-based deep
reinforcement learning with model-free fine-tuning. arxiv:1708.02596, 2017.
[20] Y. Nesterov and V. Spokoiny. Random gradient-free minimization of convex functions. Foundations of
Computational Mathematics, 17(2):527–566, 2017.
[21] M. Plappert, R. Houthooft, P. Dhariwal, S. Sidor, R. Y. Chen, X. Chen, T. Asfour, P. Abbeel, and
M. Andrychowicz. Parameter space noise for exploration. arxiv:1706.01905, 2017.
[22] A. Rajeswaran, K. Lowrey, E. Todorov, and S. Kakade. Towards generalization and simplicity in continuous
control. Advances in Neural Information Processing Systems, 2017.
[23] T. Salimans, J. Ho, X. Chen, and I. Sutskever. Evolution strategies as a scalable alternative to reinforcement
learning. arxiv:1703.03864, 2017.
[24] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. pages
1889–1897, 2015.
[25] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control using
generalized advantage estimation. International Conference on Learning Representations, 2015.
[26] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.
arxiv:1707.06347, 2017.
[27] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy gradient
algorithms. Proceedings of the International Conference on Machine Learning, 2014.
[28] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser,
I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural
networks and tree search. Nature, 529(7587):484–489, 2016.
[29] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. IEEE/RSJ
International Conference on Intelligent Robots and Systems, pages 5026–5033, 2012.
[30] S. Tu and B. Recht. Least-squares temporal difference learning for the linear quadratic regulator.
arxiv:1712.08642, 2017.
[31] Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas. Sample efficient
actor-critic with experience replay. Interational Conference on Learning Representations, 2016.
[32] Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba. Scalable trust-region method for deep reinforcement
learning using kronecker-factored approximation. Advances in Neural Information Processing Systems,
2017.
10
SUNRISE: A Simple Unified Framework
for Ensemble Learning in Deep Reinforcement Learning
Kimin Lee 1 Michael Laskin 1 Aravind Srinivas 1 Pieter Abbeel 1
Abstract efficient model-free RL algorithms through improvements

Off-policy deep reinforcement learning (RL) has in off-policy learning both in discrete and continuous do-
been successful in a range of challenging domains. mains (Fujimoto et al., 2018; Haarnoja et al., 2018; Hessel
However, standard off-policy RL algorithms can et al., 2018; Amos et al., 2020). However, there are still sub-
suffer from several issues, such as instability in Q- stantial challenges when training off-policy RL algorithms.
learning and balancing exploration and exploita- First, Q-learning often converges to sub-optimal solutions
tion. To mitigate these issues, we present SUN- due to error propagation in the Bellman backup, i.e., the
RISE, a simple unified ensemble method, which errors induced in the target value can lead to an increase in
is compatible with various off-policy RL algo- overall error in the Q-function (Kumar et al., 2019; 2020).
rithms. SUNRISE integrates two key ingredi- Second, it is hard to balance exploration and exploitation,
ents: (a) ensemble-based weighted Bellman back- which is necessary for efficient RL (Chen et al., 2017; Os-
ups, which re-weight target Q-values based on band et al., 2016a) (see Section 2 for further details).
uncertainty estimates from a Q-ensemble, and One way to address the above issues with off-policy RL algo-
(b) an inference method that selects actions us- rithms is to use ensemble methods, which combine multiple
ing the highest upper-confidence bounds for ef- models of the value function and (or) policy (Chen et al.,
ficient exploration. By enforcing the diversity 2017; Lan et al., 2020; Osband et al., 2016a; Wiering &
between agents using Bootstrap with random ini- Van Hasselt, 2008). For example, double Q-learning (Has-
tialization, we show that these different ideas selt, 2010; Van Hasselt et al., 2016) addressed the value
are largely orthogonal and can be fruitfully in- overestimation by maintaining two independent estimators
tegrated, together further improving the perfor- of the action values and later extended to continuous con-
mance of existing off-policy RL algorithms, such trol tasks in TD3 (Fujimoto et al., 2018). Bootstrapped
as Soft Actor-Critic and Rainbow DQN, for both DQN (Osband et al., 2016a) leveraged an ensemble of Q-
continuous and discrete control tasks on both functions for more effective exploration, and Chen et al.
low-dimensional and high-dimensional environ- (2017) further improved it by adapting upper-confidence
ments. Our training code is available at https: bounds algorithms (Audibert et al., 2009; Auer et al., 2002)
//github.com/pokaxpoka/sunrise. based on uncertainty estimates from ensembles. However,
most prior works have studied the various axes of improve-
ments from ensemble methods in isolation and have ignored
1. Introduction the error propagation aspect.
Model-free reinforcement learning (RL), with high-capacity In this paper, we present SUNRISE, a simple unified ensem-
function approximators, such as deep neural networks ble method that is compatible with most modern off-policy
(DNNs), has been used to solve a variety of sequential RL algorithms, such as Q-learning and actor-critic algo-
decision-making problems, including board games (Silver rithms. Our main idea is to reweight sample transitions
et al., 2017; 2018), video games (Mnih et al., 2015; Vinyals based on uncertainty estimates from a Q-ensemble. Be-
et al., 2019), and robotic manipulation (Kalashnikov et al., cause prediction errors can be characterized by uncertainty
2018). It has been well established that the above suc- estimates from ensembles (i.e., variance of predictions) as
cesses are highly sample inefficient (Kaiser et al., 2020). shown in Figure 1(b), we find that the proposed method
Recently, a lot of progress has been made in more sample- significantly improves the signal-to-noise in the Q-updates
1
and stabilizes the learning process. Additionally, for ef-
University of California, Berkeley. Correspondence to: Kimin ficient exploration, We define an upper-confidence bound
Lee <kiminlee@berkeley.edu>.
(UCB) based on the mean and variance of Q-functions simi-
Proceedings of the 38 th International Conference on Machine lar to Chen et al. (2017), and introduce an inference method,
Learning, PMLR 139, 2021. Copyright 2021 by the author(s).
SUNRISE
which selects actions with highest UCB for efficient explo- more stable, higher-signal-to-noise backups.
ration. This inference method can encourage exploration
Ensemble methods in RL. Ensemble methods have been
by providing a bonus for visiting unseen state-action pairs,
studied for different purposes in RL (Wiering & Van Hasselt,
where ensembles produce high uncertainty, i.e., high vari-
2008; Osband et al., 2016a; Anschel et al., 2017; Agarwal
ance (see Figure 1(b)). By enforcing the diversity between
et al., 2020; Lan et al., 2020). Chua et al. (2018) showed that
agents using Bootstrap with random initialization (Osband
modeling errors in model-based RL can be reduced using an
et al., 2016a), we find that these different ideas can be fruit-
ensemble of dynamics models, and Kurutach et al. (2018)
fully integrated, and they are largely complementary (see
accelerated policy learning by generating imagined experi-
Figure 1(a)).
ences from the ensemble of dynamics models. For efficient
We demonstrate the effectiveness of the proposed method us- exploration, Osband et al. (2016a) and Chen et al. (2017)
ing Soft Actor-Critic (SAC; Haarnoja et al. 2018) for contin- also leveraged the ensemble of Q-functions. However, most
uous control benchmarks (specifically, OpenAI Gym (Brock- prior works have studied the various axes of improvements
man et al., 2016) and DeepMind Control Suite (Tassa et al., from ensemble methods in isolation, while we propose a
2018)) and Rainbow DQN (Hessel et al., 2018) for dis- unified framework that handles various issues in off-policy
crete control benchmarks (specifically, Atari games (Belle- RL algorithms.
mare et al., 2013)). In our experiments, SUNRISE consis-
Exploration in RL. To balance exploration and exploita-
tently improves the performance of existing off-policy RL
tion, several methods, such as the maximum entropy frame-
methods. Furthermore, we find that the proposed weighted
works (Ziebart, 2010; Haarnoja et al., 2018), exploration
Bellman backups yield improvements in environments with
bonus rewards (Bellemare et al., 2016; Houthooft et al.,
noisy reward, which have a low signal-to-noise ratio.
2016; Pathak et al., 2017; Choi et al., 2019) and randomiza-
tion (Osband et al., 2016a;b), have been proposed. Despite
2. Related work the success of these exploration methods, a potential draw-
back is that agents can focus on irrelevant aspects of the
Off-policy RL algorithms. Recently, various off-policy RL
environment because these methods do not depend on the
algorithms have provided large gains in sample-efficiency
rewards. To handle this issue, Chen et al. (2017) proposed
by reusing past experiences (Fujimoto et al., 2018; Haarnoja
an exploration strategy that considers both best estimates
et al., 2018; Hessel et al., 2018). Rainbow DQN (Hessel
(i.e., mean) and uncertainty (i.e., variance) of Q-functions
et al., 2018) achieved state-of-the-art performance on the
for discrete control tasks. We further extend this strategy to
Atari games (Bellemare et al., 2013) by combining sev-
continuous control tasks and show that it can be combined
eral techniques, such as double Q-learning (Van Hasselt
with other techniques.
et al., 2016) and distributional DQN (Bellemare et al., 2017).
For continuous control tasks, SAC (Haarnoja et al., 2018)
achieved state-of-the-art sample-efficiency results by incor- 3. Background
porating the maximum entropy framework. Our ensemble
Reinforcement learning. We consider a standard RL
method brings orthogonal benefits and is complementary
framework where an agent interacts with an environment
and compatible with these existing state-of-the-art algo-
in discrete time. Formally, at each timestep t, the agent
rithms.
receives a state st from the environment and chooses an
Stabilizing Q-learning. It has been empirically observed action at based on its policy ⇡. The environment returns
that instability in Q-learning can be caused by applying a reward rt and theP agent transitions to the next state st+1 .
1
the Bellman backup on the learned value function (Hasselt, The return Rt = k=0
k
rt+k is the total accumulated
2010; Van Hasselt et al., 2016; Fujimoto et al., 2018; Song rewards from timestep t with a discount factor 2 [0, 1).
et al., 2019; Kim et al., 2019; Kumar et al., 2019; 2020). RL then maximizes the expected return.
By following the principle of double Q-learning (Hasselt,
Soft Actor-Critic. SAC (Haarnoja et al., 2018) is an off-
2010; Van Hasselt et al., 2016), twin-Q trick (Fujimoto et al.,
policy actor-critic method based on the maximum entropy
2018) was proposed to handle the overestimation of value
RL framework (Ziebart, 2010), which encourages the ro-
functions for continuous control tasks. Song et al. (2019)
bustness to noise and exploration by maximizing a weighted
and Kim et al. (2019) proposed to replace the max operator
objective of the reward and the policy entropy (see Ap-
with Softmax and Mellowmax, respectively, to reduce the
pendix A for further details). To update the parameters,
overestimation error. Recently, Kumar et al. (2020) handled
SAC alternates between a soft policy evaluation and a soft
the error propagation issue by reweighting the Bellman
policy improvement. At the soft policy evaluation step, a
backup based on cumulative Bellman errors. However, our
soft Q-function, which is modeled as a neural network with
method is different in that we propose an alternative way that
parameters ✓, is updated by minimizing the following soft
also utilizes ensembles to estimate uncertainty and provide
SUNRISE
Bootstrap with Actor 1

random Critic 1
Weighted Bellman
initialization
Actor 2 backup
Critic 2
UCB
Action
Actor N exploration
Replay buffer Critic N
(a) SUNRISE: actor-critic version (b) Uncertainty estimates
Figure 1. (a) Illustration of our framework. We consider N independent agents (i.e., no shared parameters between agents) with one replay
buffer. (b) Uncertainty estimates from an ensemble of neural networks on a toy regression task (see Appendix C for more experimental
details). The black line is the ground truth curve, and the red dots are training samples. The blue lines show the mean and variance of
predictions over ten ensemble models. The ensemble can produce well-calibrated uncertainty estimates (i.e., variance) on unseen samples.
Bellman residual: Q-learning is based on the Bellman backup in (2), it can

be affected by error propagation. I.e., error in the tar-
critic (✓) = E⌧t ⇠B [LQ (⌧t , ✓)],
LSAC (1)
get Q-function Q✓¯(st+1 , at+1 ) gets propagated into the Q-
2
LQ (⌧t , ✓) = Q✓ (st , at ) rt V̄ (st+1 ) , (2) function Q✓ (st , at ) at the current state. In other words,
⇥ ⇤i errors in the previous Q-function induce the “noise” to
with V̄ (st ) = Eat ⇠⇡ Q✓¯(st , at ) ↵ log ⇡ (at |st ) , the learning “signal” (i.e., true Q-value) of the current Q-
where ⌧t = (st , at , rt , st+1 ) is a transition, B is a replay function. Recently, Kumar et al. (2020) showed that this
buffer, ✓¯ are the delayed parameters, and ↵ is a temperature error propagation can cause inconsistency and unstable con-
parameter. At the soft policy improvement step, the policy ⇡ vergence. To mitigate this issue, for each agent i, we con-
with its parameter is updated by minimizing the following sider a weighted Bellman backup as follows:
objective: LW Q (⌧t , ✓i )
⇥ ⇤ 2
actor ( ) = Est ⇠B L⇡ (st , ) ,
LSAC (3) = w (st+1 , at+1 ) Q✓i (st , at ) rt V̄ (st+1 ) , (5)
⇥ ⇤
L⇡ (st , ) = Eat ⇠⇡ ↵ log ⇡ (at |st ) Q✓ (st , at ) . (4) where ⌧t = (st , at , rt , st+1 ) is a transition, at+1 ⇠
Here, the policy is modeled as a Gaussian with mean and ⇡ i (a|st ), and w(s, a) is a confidence weight based on en-
covariance given by neural networks to handle continuous semble of target Q-functions:
action spaces. w(s, a) = Q̄std (s, a) ⇤ T + 0.5, (6)
where T > 0 is a temperature, is the sigmoid function,
4. SUNRISE and Q̄std (s, a) is the empirical standard deviation of all tar-
get Q-functions {Q✓¯i }N i=1 . Note that the confidence weight
In this section, we propose the ensemble-based weighted
is bounded in [0.5, 1.0] because standard deviation is al-
Bellman backups, and then introduce SUNRISE: Simple
ways positive.2 The proposed objective LW Q down-weights
UNified framework for ReInforcement learning using
the sample transitions with high variance across target Q-
enSEmbles, which combines various ensemble methods. In
functions, resulting in a loss function for the Q-updates that
principle, our method can be used in conjunction with most
has a better signal-to-noise ratio (see Section 6.1 for more
modern off-policy RL algorithms, such as SAC (Haarnoja
detailed discussions).
et al., 2018) and Rainbow DQN (Hessel et al., 2018). For
the exposition, we describe only the SAC version in the
main body. The Rainbow DQN version follows the same 4.2. Combination with additional techniques
principles and is fully described in Appendix B. We integrate the proposed weighted Bellman backup with
UCB exploration into a single framework by utilizing the
4.1. Weighted Bellman backups bootstrap with random initialization.
Formally, we consider an ensemble of N SAC agents, i.e., Bootstrap with random initialization. To train the ensem-
i=1 , where ✓i and i denote the parameters of
{Q✓i , ⇡ i }N ble of agents, we use the bootstrap with random initializa-
the i-th soft Q-function and policy.1 Since conventional tion (Efron, 1982; Osband et al., 2016a), which enforces the
1 2
We remark that each Q-function Q✓i (s, a) has a unique target We find that it is empirically stable to set minimum value of
Q-function Q✓¯i (s, a). weight w(s, a) as 0.5.
SUNRISE
diversity between agents through two simple ideas: First, Algorithm 1 SUNRISE: SAC version
we initialize the model parameters of all agents with ran- 1: for each iteration do
dom parameter values for inducing an initial diversity in the 2: for each timestep t do
models. Second, we apply different samples to train each 3: // UCB EXPLORATION
agent. Specifically, for each SAC agent i in each timestep 4: Collect N action samples: At = {at,i ⇠
t, we draw the binary masks mt,i from the Bernoulli dis- ⇡ i (a|st )|i 2 {1, . . . , N }}
tribution with parameter 2 (0, 1], and store them in the 5: Choose the action that maximizes UCB: at =
replay buffer. Then, when updating the model parameters arg max Qmean (st , at,i ) + Qstd (st , at,i )
of agents, we multiply the bootstrap mask to each objec- at,i 2At
tive function, such as: mt,i L⇡ (st , i ) and mt,i LW Q (⌧t , ✓i ) 6: Collect state st+1 and reward rt from the environ-
in (4) and (5), respectively. We remark that Osband et al. ment by taking action at
(2016a) applied this simple technique to train an ensemble 7: Sample bootstrap masks Mt = {mt,i ⇠
of DQN (Mnih et al., 2015) only for discrete control tasks, Bernoulli ( ) — i 2 {1, . . . , N }}
while we apply to SAC (Haarnoja et al., 2018) and Rainbow 8: Store transitions ⌧t = (st , at , st+1 , rt ) and masks
DQN (Hessel et al., 2018) for both continuous and discrete in replay buffer B B [ {(⌧t , Mt )}
tasks with additional techniques. 9: end for
10: // U PDATE AGENTS VIA BOOTSTRAP AND
UCB exploration. The ensemble can also be leveraged WEIGHTED B ELLMAN BACKUP
for efficient exploration (Chen et al., 2017; Osband et al., 11: for each gradient step do
2016a) because it can express higher uncertainty on unseen 12: Sample random minibatch {(⌧j , Mj )}B j=1 ⇠ B
samples. Motivated by this, by following the idea of Chen 13: for each agent i do
et al. (2017), we consider an optimism-based exploration Update
14: PB the Q-function by minimizing
that chooses the action that maximizes
j=1 mj,i LW Q (⌧j , ✓i ) in (5)
1
B
15: Update
PB the policy by minimizing
at = max{Qmean (st , a) + Qstd (st , a)}, (7)
j=1 mj,i L⇡ (sj , i ) in (4)
1
a B
16: end for
where Qmean (s, a) and Qstd (s, a) are the empirical mean 17: end for
and standard deviation of all Q-functions {Q✓i }N i=1 , and 18: end for
the > 0 is a hyperparameter. This inference method can
encourage exploration by adding an exploration bonus (i.e.,
standard deviation Qstd ) for visiting unseen state-action Table 1 and Table 2) and discrete (see Table 3) control
pairs similar to the UCB algorithm (Auer et al., 2002). We tasks?
remark that this inference method was originally proposed
• How crucial is the proposed weighted Bellman back-
in Chen et al. (2017) for efficient exploration in discrete
ups in (5) for improving the signal-to-noise in Q-
action spaces. However, in continuous action spaces, finding
updates (see Figure 2)?
the action that maximizes the UCB is not straightforward.
To handle this issue, we propose a simple approximation • Can UCB exploration be useful for solving tasks with
scheme, which first generates N candidate action set from sparse rewards (see Figure 3(b))?
ensemble policies {⇡ i }Ni=1 , and then chooses the action • Is SUNRISE better than a single agent with more
that maximizes the UCB (Line 4 in Algorithm 1). For
updates and parameters (see Figure 3(c))?
evaluation, we approximate the maximum a posterior action
by averaging the mean of Gaussian distributions modeled • How does ensemble size affect the performance (see
by each ensemble policy. Figure 3(d))?
The full procedure of our unified framework, coined SUN-
RISE, is summarized in Algorithm 1. 5.1. Setups
Continuous control tasks. We evaluate SUNRISE on sev-
5. Experimental results eral continuous control tasks using simulated robots from
OpenAI Gym (Brockman et al., 2016) and DeepMind Con-
We designed our experiments to answer the following ques- trol Suite (Tassa et al., 2018). For OpenAI Gym experiments
tions: with proprioceptive inputs (e.g., positions and velocities),
we compare to PETS (Chua et al., 2018), a state-of-the-art
• Can SUNRISE improve off-policy RL algorithms, model-based RL method based on ensembles of dynamics
such as SAC (Haarnoja et al., 2018) and Rainbow models; POPLIN-P (Wang & Ba, 2020), a state-of-the-art
DQN (Hessel et al., 2018), for both continuous (see model-based RL method which uses a policy network to
SUNRISE
generate actions for planning; POPLIN-A (Wang & Ba, For our method, we do not alter any hyperparameters of
2020), variant of POPLIN-P which adds noise in the ac- the original RL algorithms and train five ensemble agents.
tion space; METRPO (Kurutach et al., 2018), a hybrid RL There are only three additional hyperparameters , T , and
method which augments TRPO (Schulman et al., 2015) us- for bootstrap, weighted Bellman backup, and UCB ex-
ing ensembles of dynamics models; and two state-of-the-art ploration, where we provide details in Appendix D, F, and
model-free RL methods, TD3 (Fujimoto et al., 2018) and G.
SAC (Haarnoja et al., 2018). For our method, we consider
a combination of SAC and SUNRISE, as described in Al- 5.2. Comparative evaluation
gorithm 1. Following the setup in Wang & Ba (2020) and
Wang et al. (2019), we report the mean and standard devia- OpenAI Gym. Table 1 shows the average returns of eval-
tion across ten runs after 200K timesteps on five complex uation roll-outs for all methods. SUNRISE consistently
environments: Cheetah, Walker, Hopper, Ant and SlimHu- improves the performance of SAC across all environments
manoid with early termination (ET). More experimental and outperforms the model-based RL methods, such as
details and learning curves with 1M timesteps are in Ap- POPLIN-P and PETS, on all environments except Ant and
pendix D. SlimHumanoid-ET. Even though we focus on performance
after small samples because of the recent emphasis on mak-
For DeepMind Control Suite with image inputs, we com- ing RL more sample efficient, we find that the gain from
pare to PlaNet (Hafner et al., 2019), a model-based RL SUNRISE becomes even more significant when training
method which learns a latent dynamics model and uses it longer (see Figure 3(c) and Appendix D). We remark that
for planning; Dreamer (Hafner et al., 2020), a hybrid RL SUNRISE is more compute-efficient than modern model-
method which utilizes the latent dynamics model to gener- based RL methods, such as POPLIN and PETS, because
ate synthetic roll-outs; SLAC (Lee et al., 2020), a hybrid they also utilize ensembles (of dynamics models) and per-
RL method which combines the latent dynamics model form planning to select actions. Namely, SUNRISE is sim-
with SAC; and three state-of-the-art model-free RL methods ple to implement, computationally efficient, and readily
which apply contrastive learning (CURL; Srinivas et al. parallelizable.
2020) or data augmentation (RAD (Laskin et al., 2020) and
DrQ (Kostrikov et al., 2021)) to SAC. For our method, we DeepMind Control Suite. As shown in Table 2, SUNRISE
consider a combination of RAD (i.e., SAC with random also consistently improves the performance of RAD (i.e.,
crop) and SUNRISE. Following the setup in RAD, we re- SAC with random crop) on all environments from Deep-
port the mean and standard deviation across five runs after Mind Control Suite. This implies that the proposed method
100k (i.e., low sample regime) and 500k (i.e., asymptotically can be useful for high-dimensional and complex input obser-
optimal regime) environment steps on six environments: vations. Moreover, our method outperforms existing pixel-
Finger-spin, Cartpole-swing, Reacher-easy, Cheetah-run, based RL methods in almost all environments. We remark
Walker-walk, and Cup-catch. More experimental details and that SUNRISE can also be combined with DrQ, and expect
learning curves are in Appendix F. that it can achieve better performances on Cartpole-swing
and Cup-catch at 100K environment steps.
Discrete control benchmarks. For discrete control tasks,
we demonstrate the effectiveness of SUNRISE on several Atari games. We also evaluate SUNRISE on discrete con-
Atari games (Bellemare et al., 2013). We compare to Sim- trol tasks from the Atari benchmark using Rainbow DQN.
PLe (Kaiser et al., 2020), a hybrid RL method which up- Table 3 shows that SUNRISE improves the performance of
dates the policy only using samples generated by learned Rainbow in almost all environments, and outperforms the
dynamics model; Rainbow DQN (Hessel et al., 2018) with state-of-the-art CURL and SimPLe on 11 out of 26 Atari
modified hyperparameters for sample-efficiency (van Has- games. Here, we remark that SUNRISE is also compatible
selt et al., 2019); Random agent (Kaiser et al., 2020); two with CURL, which could enable even better performance.
state-of-the-art model-free RL methods which apply the con- These results demonstrate that SUNRISE is a general ap-
trastive learning (CURL; Srinivas et al. 2020) and data aug- proach.
mentation (DrQ; Kostrikov et al. 2021) to Rainbow DQN;
and Human performances reported in Kaiser et al. (2020) 5.3. Ablation study
and van Hasselt et al. (2019). Following the setups in Sim- Effects of weighted Bellman backups. To verify the effec-
PLe, we report the mean across three runs after 100K in- tiveness of the proposed weighted Bellman backup (5) in
teractions (i.e., 400K frames with action repeat of 4). For improving signal-to-noise in Q-updates, we evaluate on a
our method, we consider a combination of sample-efficient modified OpenAI Gym environments with noisy rewards.
versions of Rainbow DQN and SUNRISE (see Algorithm 3 Following Kumar et al. (2019), we add Gaussian noise to the
in Appendix B). More experimental details and learning reward function: r0 (s, a) = r(s, a) + z, where z ⇠ N (0, 1)
curves are in Appendix G. only during training, and report the deterministic ground-
SUNRISE
Cheetah Walker Hopper Ant SlimHumanoid-ET
PETS 2288.4 ± 1019.0 282.5 ± 501.6 114.9 ± 621.0 1165.5 ± 226.9 2055.1 ± 771.5
POPLIN-A 1562.8 ± 1136.7 -105.0 ± 249.8 202.5 ± 962.5 1148.4 ± 438.3 -
POPLIN-P 4235.0 ± 1133.0 597.0 ± 478.8 2055.2 ± 613.8 2330.1 ± 320.9 -
METRPO 2283.7 ± 900.4 -1609.3 ± 657.5 1272.5 ± 500.9 282.2 ± 18.0 76.1 ± 8.8
TD3 3015.7 ± 969.8 -516.4 ± 812.2 1816.6 ± 994.8 870.1 ± 283.8 1070.0 ± 168.3
SAC 4474.4 ± 700.9 299.5 ± 921.9 1781.3 ± 737.2 979.5 ± 253.2 1371.8 ± 473.4
SUNRISE 4501.8 ± 443.8 1236.5 ± 1123.9 2643.2 ± 472.3 1502.4 ± 483.5 1926.6 ± 375.0
Table 1. Performance on OpenAI Gym at 200K timesteps. The results show the mean and standard deviation averaged over ten runs. For
baseline methods, we report the best number in prior works (Wang & Ba, 2020; Wang et al., 2019).
500K step PlaNet Dreamer SLAC CURL DrQ RAD SUNRISE

Finger-spin 561 ± 284 796 ± 183 673 ± 92 926 ± 45 938 ± 103 975 ± 16 983 ±1
Cartpole-swing 475 ± 71 762 ± 27 - 845 ± 45 868 ± 10 873 ± 3 876 ± 4
Reacher-easy 210 ± 44 793 ± 164 - 929 ± 44 942 ± 71 916 ± 49 982 ± 3
Cheetah-run 305 ± 131 570 ± 253 640 ± 19 518 ± 28 660 ± 96 624 ± 10 678 ± 46
Walker-walk 351 ± 58 897 ± 49 842 ± 51 902 ± 43 921 ± 45 938 ± 9 953 ± 13
Cup-catch 460 ± 380 879 ± 87 852 ± 71 959 ± 27 963 ± 9 966 ± 9 969 ± 5
100K step
Finger-spin 136 ± 216 341 ± 70 693 ± 141 767 ± 56 901 ± 104 811 ± 146 905 ± 57
Cartpole-swing 297 ± 39 326 ± 27 - 582 ± 146 759 ± 92 373 ± 90 591 ± 55
Reacher-easy 20 ± 50 314 ± 155 - 538 ± 233 601 ± 213 567 ± 54 722 ± 50
Cheetah-run 138 ± 88 235 ± 137 319 ± 56 299 ± 48 344 ± 67 381 ± 79 413 ± 35
Walker-walk 224 ± 48 277 ± 12 361 ± 73 403 ± 24 612 ± 164 641 ± 89 667 ± 147
Cup-catch 0±0 246 ± 174 512 ± 110 769 ± 43 913± 53 666 ± 181 633 ± 241
Table 2. Performance on DeepMind Control Suite at 100K and 500K environment steps. The results show the mean and standard deviation
averaged five runs. For baseline methods, we report the best numbers reported in prior works (Kostrikov et al., 2021).
truth reward during evaluation. For our method, we also We also consider another variant of SUNRISE, which
consider a variant of SUNRISE, which updates Q-functions updates Q-functions with random weights sampled from
without the proposed weighted Bellman backup to isolate its [0.5, 1.0] uniformly at random. In order to evaluate the
effect. We compare to DisCor (Kumar et al., 2020), which performance of SUNRISE, we increase the noise rate by
improves SAC by reweighting the Bellman backup based adding Gaussian noise with a large standard deviation to the
on estimated cumulative Bellman errors (see Appendix E reward function: r0 (s, a) = r(s, a) + z, where z ⇠ N (0, 5).
for more details). Figure 3(a) shows the learning curves of all methods on
the SlimHumanoid-ET environment over 10 random seeds.
Figure 2 shows the learning curves of all methods on Ope-
First, one can not that SUNRISE with random weights (red
nAI Gym with noisy rewards. The proposed weighted Bell-
curve) is worse than SUNRISE with the proposed weighted
man backup significantly improves both sample-efficiency
Bellman backups (blue curve). Additionally, even without
and asymptotic performance of SUNRISE, and outperforms
UCB exploration, SUNRISE with the proposed weighted
baselines such as SAC and DisCor. One can note the per-
Bellman backups (purple curve) outperforms all baselines.
formance gain due to our weighted Bellman backup be-
This implies that the proposed weighted Bellman backups
comes more significant in complex environments, such
can handle the error propagation effectively even though
as SlimHumanoid-ET. We remark that DisCor still suffers
there is a large noise in reward function.
from error propagation issues in complex environments like
SlimHumanoid-ET and Ant because there are some approx- Effects of UCB exploration. To verify the advantage of
imation errors in estimating cumulative Bellman errors (see UCB exploration in (7), we evaluate on Cartpole-swing
Section 6.1 for more detailed discussion). These results im- with sparse-reward from DeepMind Control Suite. For our
ply that errors in the target Q-function can be characterized method, we consider a variant of SUNRISE, which selects
by the proposed confident weight in (6) effectively. action without UCB exploration. As shown in Fig 3(b),
SUNRISE with UCB exploration (blue curve) significantly
SUNRISE
Game Human Random SimPLe CURL DrQ Rainbow SUNRISE

Alien 7127.7 227.8 616.9 558.2 761.4 789.0 872.0
Amidar 1719.5 5.8 88.0 142.1 97.3 118.5 122.6
Assault 742.0 222.4 527.2 600.6 489.1 413.0 594.8
Asterix 8503.3 210.0 1128.3 734.5 637.5 533.3 755.0
BankHeist 753.1 14.2 34.2 131.6 196.6 97.7 266.7
BattleZone 37187.5 2360.0 5184.4 14870.0 13520.6 7833.3 15700.0
Boxing 12.1 0.1 9.1 1.2 6.9 0.6 6.7
Breakout 30.5 1.7 16.4 4.9 14.5 2.3 1.8
ChopperCommand 7387.8 811.0 1246.9 1058.5 646.6 590.0 1040.0
CrazyClimber 35829.4 10780.5 62583.6 12146.5 19694.1 25426.7 22230.0
DemonAttack 1971.0 152.1 208.1 817.6 1222.2 688.2 919.8
Freeway 29.6 0.0 20.3 26.7 15.4 28.7 30.2
Frostbite 4334.7 65.2 254.7 1181.3 449.7 1478.3 2026.7
Gopher 2412.5 257.6 771.0 669.3 598.4 348.7 654.7
Hero 30826.4 1027.0 2656.6 6279.3 4001.6 3675.7 8072.5
Jamesbond 302.8 29.0 125.3 471.0 272.3 300.0 390.0
Kangaroo 3035.0 52.0 323.1 872.5 1052.4 1060.0 2000.0
Krull 2665.5 1598.0 4539.9 4229.6 4002.3 2592.1 3087.2
KungFuMaster 22736.3 258.5 17257.2 14307.8 7106.4 8600.0 10306.7
MsPacman 6951.6 307.3 1480.0 1465.5 1065.6 1118.7 1482.3
Pong 14.6 -20.7 12.8 -16.5 -11.4 -19.0 -19.3
PrivateEye 69571.3 24.9 58.3 218.4 49.2 97.8 100.0
Qbert 13455.0 163.9 1288.8 1042.4 1100.9 646.7 1830.8
RoadRunner 7845.0 11.5 5640.6 5661.0 8069.8 9923.3 11913.3
Seaquest 42054.7 68.4 683.3 384.5 321.8 396.0 570.7
UpNDown 11693.2 533.4 3350.3 2955.2 3924.9 3816.0 5074.0
Table 3. Performance on Atari games at 100K interactions. The results show the scores averaged three runs. For baseline methods, we
report the best numbers reported in prior works (Kaiser et al., 2020; van Hasselt et al., 2019).
improves the sample-efficiency on the environment with environments are also available in Appendix D, where the
sparse rewards. overall trend is similar.
Comparison with a single agent with more up-
dates/parameters. One concern in utilizing the ensemble 6. Discussion
method is that its gains may come from more gradient up-
6.1. Intuition for weighted Bellman backups
dates and parameters. To clarify this concern, we compare
SUNRISE (5 ensembles using 2-layer MLPs with 256 hid- (Kumar et al., 2020) show that naive Bellman backups can
den units each) to a single agent, which consists of 2-layer suffer from slow learning in certain environments, requiring
MLPs with 1024 (and 256) hidden units with 5 updates us- exponentially many updates. To handle this problem, they
ing different random minibatches. Figure 3(c) shows that propose the weighted Bellman backups, which make steady
the learning curves on SlimHumanoid-ET, where SUNRISE learning progress by inducing some optimal data distribution
outperforms all baselines. This implies that the gains from (see (Kumar et al., 2020) for more details). Specifically, in
SUNRISE can not be achieved by simply increasing the addition to a standard Q-learning, DisCor trains an error
number of updates/parameters. More experimental results model (s, a), which approximates the cumulative sum
on other environments are also available in Appendix D. of discounted Bellman errors over the past iterations of
training. Then, using the error model, DisCor reweights
Effects of ensemble size. We analyze the effects of en-
the Bellman backups based on⇣ a confidence ⌘ weight defined
semble size N on the Ant environment from OpenAI Gym. (s,a)
Figure 3(d) shows that the performance can be improved as follows: w(s, a) / exp T , where is a
by increasing the ensemble size, but the improvement is discount factor and T is a temperature.
saturated around N = 5. Thus, we use five ensemble agents
However, we remark that DisCor can still suffer from the
for all experiments. More experimental results on other
error propagation issues because there is also an approxima-
SUNRISE
SAC DisCor SUNRISE (without WBB) SUNRISE (with WBB)

SlimHumanoid-ET Hopper Walker Cheetah Ant
15,000 3000 3000 7000 2000
2000 2000 6000

1500
5000
Average return
Average return
Average return
Average return
Average return
10,000 1000 1000
4000
0 0 1000
3000
5,000 −1000 −1000
2000
500
−2000 −2000 1000
0 −3000 −3000 0 0
0 2.5×105 5.0×105 0 1×105 2×105 0 1×105 2×105 0 1×105 2×105 0 1×105 2×105
Timesteps Timesteps Timesteps Timesteps Timesteps
Figure 2. Learning curves on OpenAI Gym with noisy rewards. To verify the effects of the weighted Bellman backups (WBB), we consider
SUNRISE with WBB and without WBB. The solid line and shaded regions represent the mean and standard deviation, respectively, across
four runs.
40,000
SAC (log α= 0)
SAC (log α= 1) 2000
14,000 SAC RAD SAC
1000 SAC (log α= 10)
DisCor SUNRISE (without UCB) 30,000 SUNRISE
SUNRISE (N=2)
SUNRISE (with RW & UCB) SUNRISE (with UCB) SUNRISE (N=5)
12,000
Average return
SUNRISE (with UCB) 800 1500 SUNRISE (N=10)
Average return
Average return
SUNRISE (with WBB & UCB)
10,000 20,000
SUNRISE (with WBB)
Score
600
8,000 1000
6,000 400 10,000

500
4,000 200
0
2,000
0 0
0 1×105 2×105 3×105 4×105 5×105 0 2×105 4×105 6×105 8×105 10×105 0 2×105 4×105 6×105 8×10510×105 0 5×104 10×104 15×104 20×104
Timesteps Environment steps Timesteps Timesteps
(a) Large noise (b) Sparse reward (c) Gradient update (d) Ensemble size
Figure 3. (a) Learning curves of SUNRISE with random weight (RW) and the proposed weighted Bellman backups (WBB) on the
SlimHumanoid-ET environment with noisy rewards. (b) Effects of UCB exploration on the Cartpole environment with sparse reward. (c)
Learning curves of SUNRISE and single agent with h hidden units and five gradient updates per each timestep on the SlimHumanoid-ET
environment. (d) Learning curves of SUNRISE with varying values of ensemble size N on the Ant environment.
tion error in estimating cumulative Bellman errors. There- 7. Conclusion

fore, we consider an alternative approach that utilizes the un-
certainty from ensembles. Because it has been observed that In this paper, we introduce SUNRISE, a simple unified en-
the ensemble can produce well-calibrated uncertainty esti- semble method, which integrates the proposed weighted
mates (i.e., variance) on unseen samples (Lakshminarayanan Bellman backups with bootstrap with random initialization,
et al., 2017), we expect that the weighted Bellman backups and UCB exploration to handle various issues in off-policy
based on ensembles can handle error propagation more ef- RL algorithms. Especially, we present the ensemble-based
fectively. Indeed, in our experiments, we find that ensemble- weighted Bellman backups, which stabilizes and improves
based weighted Bellman backups can give rise to more the learning process by re-weighting target Q-values based
stable training and improve the data-efficiency of various on uncertainty estimates Our experiments show that SUN-
off-policy RL algorithms. RISE consistently improves the performances of existing
off-policy RL algorithms, such as Soft Actor-Critic and
Rainbow DQN, and outperforms state-of-the-art RL algo-
6.2. Computation overhead
rithms for both continuous and discrete control tasks on both
One can expect that there is an additional computation over- low-dimensional and high-dimensional environments. We
head by introducing ensembles. When we have N ensemble hope that SUNRISE could be useful to other relevant topics
agents, our method requires N ⇥ inferences for weighted such as sim-to-real transfer (Tobin et al., 2017), imitation
Bellman backups and 2N ⇥ inferences (N for actors and N learning (Torabi et al., 2018), understanding the connec-
for critics). However, we remark that our method can be tion between on-policy and off-policy RL (Schulman et al.,
more computationally efficient because it is parallelizable. 2017), offline RL (Agarwal et al., 2020), and planning (Srini-
Also, as shown in Figure 3(c), the gains from SUNRISE vas et al., 2018; Tamar et al., 2016).
can not be achieved by simply increasing the number of
updates/parameters.
SUNRISE
Acknowledgements Choi, J., Guo, Y., Moczulski, M., Oh, J., Wu, N., Norouzi,
M., and Lee, H. Contingency-aware exploration in re-
This research is supported in part by ONR PECASE inforcement learning. In International Conference on
N000141612723, Open Philanthropy, Darpa LEARN Learning Representations, 2019.
project, Darpa LwLL project, NSF NRI #2024675, Ten-
cent, and Berkeley Deep Drive. We would like to thank Chua, K., Calandra, R., McAllister, R., and Levine, S. Deep
Hao Liu for improving the presentation and giving helpful reinforcement learning in a handful of trials using proba-
feedback. We would also like to thank Aviral Kumar and bilistic dynamics models. In Advances in Neural Infor-
Kai Arulkumaran for providing tips on implementation of mation Processing Systems, 2018.
DisCor and Rainbow.
Efron, B. The jackknife, the bootstrap, and other resampling
plans, volume 38. Siam, 1982.
References
Agarwal, R., Schuurmans, D., and Norouzi, M. An opti- Fujimoto, S., Van Hoof, H., and Meger, D. Addressing
mistic perspective on offline reinforcement learning. In function approximation error in actor-critic methods. In
International Conference on Machine Learning, 2020. International Conference on Machine Learning, 2018.
Amos, B., Stanton, S., Yarats, D., and Wilson, A. G. On Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft
the model-based stochastic value gradient for continuous actor-critic: Off-policy maximum entropy deep reinforce-
reinforcement learning. arXiv preprint arXiv:2008.12775, ment learning with a stochastic actor. In International
2020. Conference on Machine Learning, 2018.
Anschel, O., Baram, N., and Shimkin, N. Averaged-dqn: Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D.,
Variance reduction and stabilization for deep reinforce- Lee, H., and Davidson, J. Learning latent dynamics for
ment learning. In International Conference on Machine planning from pixels. In International Conference on
Learning, 2017. Machine Learning, 2019.
Audibert, J.-Y., Munos, R., and Szepesvári, C. Exploration– Hafner, D., Lillicrap, T., Ba, J., and Norouzi, M. Dream
exploitation tradeoff using variance estimates in multi- to control: Learning behaviors by latent imagination. In
armed bandits. Theoretical Computer Science, 410(19): International Conference on Learning Representations,
1876–1902, 2009. 2020.
Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time Hasselt, H. V. Double q-learning. In Advances in Neural
analysis of the multiarmed bandit problem. Machine Information Processing Systems, 2010.
learning, 47(2-3):235–256, 2002.
Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostro-
Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., vski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., and
Saxton, D., and Munos, R. Unifying count-based explo- Silver, D. Rainbow: Combining improvements in deep
ration and intrinsic motivation. In Advances in Neural reinforcement learning. In AAAI Conference on Artificial
Information Processing Systems, 2016. Intelligence, 2018.
Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. Houthooft, R., Chen, X., Duan, Y., Schulman, J., De Turck,
The arcade learning environment: An evaluation plat- F., and Abbeel, P. Vime: Variational information maxi-
form for general agents. Journal of Artificial Intelligence mizing exploration. In Advances in Neural Information
Research, 47:253–279, 2013. Processing Systems, 2016.
Bellemare, M. G., Dabney, W., and Munos, R. A distribu- Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Camp-
tional perspective on reinforcement learning. In Interna- bell, R. H., Czechowski, K., Erhan, D., Finn, C., Koza-
tional Conference on Machine Learning, 2017. kowski, P., Levine, S., et al. Model-based reinforcement
learning for atari. In International Conference on Learn-
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., ing Representations, 2020.
Schulman, J., Tang, J., and Zaremba, W. Openai gym.
arXiv preprint arXiv:1606.01540, 2016. Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog,
A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M.,
Chen, R. Y., Sidor, S., Abbeel, P., and Schulman, J. Vanhoucke, V., et al. Qt-opt: Scalable deep reinforce-
Ucb exploration via q-ensembles. arXiv preprint ment learning for vision-based robotic manipulation. In
arXiv:1706.01502, 2017. Conference on Robot Learning, 2018.
SUNRISE
Kim, S., Asadi, K., Littman, M., and Konidaris, G. Deep- Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T.
mellow: removing the need for a target network in deep Curiosity-driven exploration by self-supervised predic-
q-learning. In International Joint Conference on Artificial tion. In International Conference on Machine Learning,
Intelligence, 2019. 2017.
Kostrikov, I., Yarats, D., and Fergus, R. Image augmentation Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Priori-
is all you need: Regularizing deep reinforcement learning tized experience replay. In International Conference on
from pixels. In International Conference on Learning Learning Representations, 2016.
Representations, 2021.
Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz,
Kumar, A., Fu, J., Soh, M., Tucker, G., and Levine, S. P. Trust region policy optimization. In International
Stabilizing off-policy q-learning via bootstrapping error Conference on Machine Learning, 2015.
reduction. In Advances in Neural Information Processing
Systems, 2019. Schulman, J., Chen, X., and Abbeel, P. Equivalence be-
tween policy gradients and soft q-learning. arXiv preprint
Kumar, A., Gupta, A., and Levine, S. Discor: Corrective arXiv:1704.06440, 2017.
feedback in reinforcement learning via distribution cor-
rection. In Advances in Neural Information Processing Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou,
Systems, 2020. I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M.,
Bolton, A., et al. Mastering the game of go without
Kurutach, T., Clavera, I., Duan, Y., Tamar, A., and Abbeel, human knowledge. Nature, 550(7676):354, 2017.
P. Model-ensemble trust-region policy optimization. In
International Conference on Learning Representations, Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai,
2018. M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Grae-
pel, T., et al. A general reinforcement learning algorithm
Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple that masters chess, shogi, and go through self-play. Sci-
and scalable predictive uncertainty estimation using deep ence, 362(6419):1140–1144, 2018.
ensembles. In Advances in Neural Information Process-
ing Systems, 2017. Song, Z., Parr, R., and Carin, L. Revisiting the softmax
bellman operator: New benefits and new perspective. In
Lan, Q., Pan, Y., Fyshe, A., and White, M. Maxmin q- International Conference on Machine Learning, 2019.
learning: Controlling the estimation bias of q-learning. In
International Conference on Learning Representations, Srinivas, A., Jabri, A., Abbeel, P., Levine, S., and Finn, C.
2020. Universal planning networks. In International Confer-
ence on Machine Learning, 2018.
Laskin, M., Lee, K., Stooke, A., Pinto, L., Abbeel, P., and
Srinivas, A. Reinforcement learning with augmented data. Srinivas, A., Laskin, M., and Abbeel, P. Curl: Contrastive
In Advances in Neural Information Processing Systems, unsupervised representations for reinforcement learning.
2020. In International Conference on Machine Learning, 2020.
Lee, A. X., Nagabandi, A., Abbeel, P., and Levine, S. Tamar, A., Wu, Y., Thomas, G., Levine, S., and Abbeel, P.
Stochastic latent actor-critic: Deep reinforcement learn- Value iteration networks. In Advances in Neural Informa-
ing with a latent variable model. In Advances in Neural tion Processing Systems, 2016.
Information Processing Systems, 2020.
Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., Casas, D.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, d. L., Budden, D., Abdolmaleki, A., Merel, J., Lefrancq,
J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje- A., et al. Deepmind control suite. arXiv preprint
land, A. K., Ostrovski, G., et al. Human-level control arXiv:1801.00690, 2018.
through deep reinforcement learning. Nature, 518(7540):
529, 2015. Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W.,
and Abbeel, P. Domain randomization for transferring
Osband, I., Blundell, C., Pritzel, A., and Van Roy, B. Deep deep neural networks from simulation to the real world.
exploration via bootstrapped dqn. In Advances in Neural In International Conference on Intelligent Robots and
Information Processing Systems, 2016a. Systems, 2017.
Osband, I., Van Roy, B., and Wen, Z. Generalization and Torabi, F., Warnell, G., and Stone, P. Behavioral cloning
exploration via randomized value functions. In Interna- from observation. In International Joint Conferences on
tional Conference on Machine Learning, 2016b. Artificial Intelligence Organization, 2018.
SUNRISE
Van Hasselt, H., Guez, A., and Silver, D. Deep reinforce-

ment learning with double q-learning. In AAAI Confer-
ence on Artificial Intelligence, 2016.
van Hasselt, H. P., Hessel, M., and Aslanides, J. When to
use parametric models in reinforcement learning? In Ad-
vances in Neural Information Processing Systems, 2019.
Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M.,

Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds,
T., Georgiev, P., et al. Grandmaster level in starcraft ii
using multi-agent reinforcement learning. Nature, 575
(7782):350–354, 2019.
Wang, T. and Ba, J. Exploring model-based planning with

policy networks. In International Conference on Learning
Representations, 2020.
Wang, T., Bao, X., Clavera, I., Hoang, J., Wen, Y., Lan-
glois, E., Zhang, S., Zhang, G., Abbeel, P., and Ba,
J. Benchmarking model-based reinforcement learning.
arXiv preprint arXiv:1907.02057, 2019.
Wiering, M. A. and Van Hasselt, H. Ensemble algorithms in
reinforcement learning. IEEE Transactions on Systems,
Man, and Cybernetics, 38(4):930–936, 2008.
Yarats, D., Zhang, A., Kostrikov, I., Amos, B., Pineau, J.,
and Fergus, R. Improving sample efficiency in model-
free reinforcement learning from images. arXiv preprint
arXiv:1910.01741, 2019.
Ziebart, B. D. Modeling purposeful adaptive behavior with
the principle of maximum causal entropy. 2010.

Stochastic Policy Gradient Reinforcement Learning On A Simple 3D Biped

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stochastic Policy Gradient Reinforcement Learning On A Simple 3D Biped

Uploaded by

Copyright:

Available Formats

Stochastic Policy Gradient Reinforcement

Learning on a Simple 3D Biped

During learning, we add stochasticity to our deterministic ∂

τl = k(a sin φ − θl ) − bθ̇l (5)

τr = k(−a sin φ − θr ) − bθ̇r , (6)

robot drastically changes at foot touchdown (φ = 0, π), we

approximate this Poincare map using a function approximator 

where the input state is defined as x = (d, d). ˙ d denotes the

Left Hip [deg]

Right Hip [deg]

Left Knee [deg]

Fig. 8. Acquired biped walking pattern: (Top)Before learning, (Middle)After 60

30 trials, (Bottom)After learning

Right knee [deg]

We evaluated the combined method on a 1.0 ◦ downward −10

slope. The simulated robot with the acquired controller in

with the learned controller, the robot walked stably on the

phase φ = π2 to phase φ = 3π 2 (Fig. 6). We estimated the

are the eigenvalues[3], [7]. Because we used differentiable

Jacobian matrix J based on:

Neural Information Processing Systems 15, pages 1563–1570. MIT

Horia Mania Aurelia Guy Benjamin Recht

Department of Electrical Engineering and Computer Science

• In Section 3, for applications to continuous control, we augment a basic random search

3 Our proposed algorithm

4 Empirical results on the MuJoCo locomotion tasks

Implementation details. We implemented a parallel version of Algorithm 1 using the Python

Average # episodes to reach reward threshold

Average reward evaluated over 100 random seeds, shown by percentile

[16] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller,

Kimin Lee 1 Michael Laskin 1 Aravind Srinivas 1 Pieter Abbeel 1

Abstract efficient model-free RL algorithms through improvements

Bootstrap with Actor 1

(a) SUNRISE: actor-critic version (b) Uncertainty estimates

Bellman residual: Q-learning is based on the Bellman backup in (2), it can

500K step PlaNet Dreamer SLAC CURL DrQ RAD SUNRISE

Game Human Random SimPLe CURL DrQ Rainbow SUNRISE

SAC DisCor SUNRISE (without WBB) SUNRISE (with WBB)

2000 2000 6000

6,000 400 10,000

tion error in estimating cumulative Bellman errors. There- 7. Conclusion

Van Hasselt, H., Guez, A., and Silver, D. Deep reinforce-

Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M.,

Wang, T. and Ba, J. Exploring model-based planning with

You might also like

approximate this Poincare map using a function approximator