You are on page 1of 15

1162 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 32, NO.

3, MARCH 2021

A3C-GS: Adaptive Moment Gradient Sharing With


Locks for Asynchronous Actor–Critic Agents
Alfonso B. Labao, Mygel Andrei M. Martija, and Prospero C. Naval, Jr. , Member, IEEE

Abstract— We propose an asynchronous gradient sharing faster than Q-learning, particularly its on-policy variants. How-
mechanism for the parallel actor–critic algorithms with improved ever, several challenges remain in policy gradient algorithm
exploration characteristics. The proposed algorithm (A3C-GS) design. Among these are bias and variance control in function
has the property of automatically diversifying worker policies in
the short term for exploration, thereby reducing the need for approximation, [2], [6], where recent advances are made possi-
entropy loss terms. Despite policy diversification, the algorithm ble by deep neural networks [6] that have variance-controlling
converges to the optimal policy in the long term. We show in our mechanisms [1], [7], [8]. Another pervasive problem is the
analysis that the gradient sharing operation is a composition exploration mechanism crucial for RL algorithms to discover
of two contractions. The first contraction performs gradient optimal policies [9]–[11].
computation, while the second contraction is a gradient sharing
operation coordinated by locks. From these two contractions, Similar to other RL algorithms, policy gradient methods
certain short- and long-term properties result. For the short term, are prone to converge to local optima when exploration
gradient sharing induces temporary heterogeneity in policies mechanisms are not integrated. This problem occurs especially
for performing needed exploration. In the long term, under a when the environment provides sparse rewards or requires
suitably small learning rate and gradient clipping, convergence navigation. Prior works promote exploration by importance
to the optimal policy is theoretically guaranteed. We verify our
results with several high-dimensional experiments and compare sampling [11], which controls the states encountered by the
A3C-GS against other on-policy policy-gradient algorithms. Our agent. Other works focus on variance control, such as trust
proposed algorithm achieved the highest weighted score. Despite region policy optimization (TRPO) [12], an early policy gra-
lower entropy weights, it performed well in high-dimensional dient algorithm. TRPO controls the neighborhood space of
environments that require exploration due to sparse rewards predicted policies and uses line-search optimization that can
and those that need navigation in 3-D environments for long
survival tasks. It consistently performed better than the base be hard to implement. A recent work, i.e., proximal policy
asynchronous advantage actor–critic (A3C) algorithm. optimization (PPO) [7], provides a practical mechanism to
approximate the variance control properties of TRPO and
Index Terms— Actor–critic agents, deep neural networks, deep
reinforcement learning (RL), policy gradient. performs comparably better than TRPO. Another algorithm,
i.e., the soft-actor–critic (SAC) [13], provides an off-policy
mechanism for policy gradients by means of Q-networks to
I. I NTRODUCTION estimate value/policy functions. Other works employ large

R EINFORCEMENT learning (RL) algorithms solve prob-


lems that seek to maximize cumulative rewards attained
in an environment [1]. RL is related to dynamic programming
numbers of parallel agents, such as the popular asynchronous
advantage actor–critic (A3C) algorithm [14], where several
workers explore asynchronously and update a global set
(DP) and optimal control. However, RL algorithms are more of parameters. A3C has been applied to Atari [14] and
general and can approximate solutions for infinite horizon VizDoom [10], [15] environments and produced state-of-the-
problems. RL problems are cast as the Markov decision art results. Several works, including [14], promote explo-
processes of agents in an environment. The environment ration by including entropy terms in loss functions. Another
provides agents with states and rewards depending on the work [10] shows how entropy can provide more exploration.
agent’s actions, while the agent searches for the optimal policy To date, A3C is among the top-performing RL algorithms,
using these rewards. rivaled by PPO. However, current A3C-based solutions for
Many recent RL algorithms fall under the policy gradient exploration have the following limitations.
family, which directly estimates the agent’s decision-making
process (policy) [2]–[5]. Policy gradient methods converge 1) Entropy loss terms result in biased estimates.
2) Update procedures result in near homogenous policies
Manuscript received March 31, 2019; revised October 20, 2019 and for large numbers of workers, which increases resource
January 21, 2020; accepted March 4, 2020. Date of publication April 10, consumption while relying on statistical randomization
2020; date of current version March 1, 2021. (Corresponding author:
Prospero C. Naval, Jr.) to encounter rewarding trajectories.
The authors are with the Computer Vision and Machine Intelligence
Group, Department of Computer Science, University of the Philippines We propose A3C-GS, an improvement over A3C, which
Diliman, Quezon City 1101, Philippines (e-mail: alfonso.labao@up.edu.ph; approaches the exploration problem using gradient sharing
mmmartija@up.edu.ph; pcnaval@up.edu.ph). over asynchronous agents with the Adam optimization. Our
Color versions of one or more of the figures in this article are available
online at https://ieeexplore.ieee.org. proposed algorithm addresses the two issues raised earlier
Digital Object Identifier 10.1109/TNNLS.2020.2980743 through a twofold approach.
2162-237X © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 28,2022 at 13:38:17 UTC from IEEE Xplore. Restrictions apply.
LABAO et al.: A3C-GS: ADAPTIVE MOMENT GRADIENT SHARING WITH LOCKS FOR ASYNCHRONOUS ACTOR–CRITIC AGENTS 1163

1) Encourage exploration via a gradient sharing operation loss function L takes the form of softmax cross entropy (1),
that automatically diversifies policies among A3C work- where  serves as a weight that allocates more importance to
ers to induce a larger search space but without compro- rewarding actions
mising long-term convergence to optimal policies.
2) Lower weights in entropy loss terms for closer long-term L = E[ log π(at |st )]. (1)
convergence to the optimal policy without compromising
exploration. The policy network’s gradient g is expressed in (2), where
The main component of the A3C-GS algorithm is the policies π(at |st ) are distributions over softmax probabilities
gradient sharing operation F that induces staggered updates, P(at |st )
propagates temporary biases, and prevents worker parameters
from being equal in the short term, resulting in automatic g = E[∇θ log π(at |st )] ∼ E[∇θ log P(at |st )]. (2)
diversification of policy estimates. With diversified policies,
the algorithm’s search space is widened, thereby increasing
the probability to encounter rewarding paths. However, if the
sharing operation is not controlled, long-term convergence B. Actor–Critic With Advantage Weights
may not occur since parameter differences among copies
become “too large.” A3C-GS provides a procedure to control Empirical rewards can be assigned to  in (2), but this
parameter differences despite sharing. In our analysis, given results in high variances [1]. As an alternative, actor–critic
two conditions: 1) a suitably small learning rate and 2) gradient algorithms use value functions v(st ) for  for lower variance.
clipping, differences resulting from the sharing operation are In this article, we use modified advantages A(st , at ) from [1]
small enough such that the operation remains a contraction. as , as shown in (3). By subtracting a baseline in (3),
Thus, A3C-GS exhibits both worker policy diversification for A(st , at ) variances are reduced while preserving
T unbiasedness.
automatic exploration, but, at the same time, converges closely Actual rewards are in the second term t=t+1 γ t rt
to the optimal policy in the long term due to its contraction
properties and lower entropy weights. 
T
A(st , at ) = rt + γ t rt − v(st ). (3)
We test our method on high-dimensional environments
t=t+1
against other state-of-the-art on-policy algorithms, such as
A3C [14], PPO [7], an asynchronous PPO, and generalized In terms of deep network implementation, we follow [16]
advantage estimation (GAE) [1]. Testing our system with where policy and value networks share parameters.
PPO variants provides comparison between variance control
and exploration. From our results, A3C-GS is the most
consistent—at the top three in almost all games. A3C-GS is
C. Exploration and Asynchronous Advantage Actor–Critic
comparable with PPO in games with large action spaces and
2-D navigation. A3C-GS is superior on games that require Exploration performs a principal role in RL. To show the
exploration due to sparse rewards or navigation in 3-D environ- need for exploration, we write (2) as follows with  A (st ) as
ments for long survival tasks. We also present several ablation advantage weights for state (st ):
tests that compare A3C-GS with standard A3C along with
1  A
T
several experiments to verify our claimed properties. A3C-GS
shows faster convergence rates and higher attained cumulative g= [ (st , at )∇θ log π(at |st )]. (4)
T t=1
rewards than A3C even with a smaller number of workers and
lower entropy loss weights. From (4), if action trajectories [a1 , a2 , . . . , aT ] ∼ π(at |st ) are
deterministic for each st , there is no variation in  A (st ), and
II. P OLICY G RADIENT A LGORITHMS total cumulative rewards do not increase. Exploration provides
A. Notation the needed variation. In standard A3C [14], exploration is done
by parallel workers that search E while asynchronously updat-
For this article, we use notations in [1]. Given environment
ing parameters. In order to further encourage exploration, [14]
E, the state provided by E to the RL agent (policy-gradient)
included an entropy term in its loss function with a weight of
is st , where t = 0, 1, 2, . . . , T and T denotes a terminal time.
0.10 (5). Entropy is defined in (6), where P(ζi ) corresponds
The agent maintains policy π(st , at ) over state–action pairs
to the probability of choosing action at = ζi
that determines the action at it performs in st . The environment
responds to at with reward rt and transitions to the next
L e = E[ log π(at |st )] + E[Ht (ρ)] (5)
. The t agent maximizes a cumulative (γ -discounted)
state st+1
n
reward ∞ t=0 γ rt . Maximization of rewards can be formulated Ht (ρ) = − Pt (ζi ) log Pt (ζi ). (6)
as searching for optimal policy π ∗ (st , at ) that provides the i
maximizing at for each st . Policy gradients directly approx-
imate π ∗ using function approximators. In particular, deep However, as we show in Lemma 11, (5) is biased due to
policy gradients’ algorithms use deep networks to compute the the E[Ht (ρ)] term. A3C-GS, therefore, attempts to lessen the
optimal policy. In many of these models, the policy gradient’s weight of E[Ht (ρ)] while maintaining exploration.

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 28,2022 at 13:38:17 UTC from IEEE Xplore. Restrictions apply.
1164 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 32, NO. 3, MARCH 2021

III. P ROPOSED A LGORITHM OVERVIEW


A. Algorithm With Adaptive Moment Gradient Sharing
We present, in this section, an overview of the A3C-GS
(Algorithm 1). A3C-GS maintains M sets of global parame-
ters and asynchronously shares gradients among N workers
using the Adam optimization [17] for policy diversification.
On the other hand, standard A3C maintains a single set of Fig. 1. Illustration showing the proposed algorithm’s behavior. Due to
global parameters, with an update procedure that produces diversified worker policies, A3C-GS (blue, left) has wider exploration at the
near homogenous policies among workers. We write down start but automatically converges closely to the optimal policy (red) in the long
term since it has a lower bias. Standard A3C (green, right) uses randomization
the procedures for A3C-GS as follows, with a more detailed for exploration and requires entropy terms, which potentially retains biases
description in the Appendix. (shown in the gap between policy trajectories and optimal policy). [Best
A3C-GS Summary (M Global Copies and N Workers): viewed in color]
1) M copies of global actor and critic parameters are
initialized with a common set of parameters. Algorithm 1 Adaptive Moment Gradient Sharing (A3C-GS)
2) Each asynchronous worker thread i ∈ N is assigned 1: Initialize:
a global actor/critic parameter copy m ∈ M. Worker E N ← N environment copies, t ← 0
i relies on m’s actor parameters for its policy network A N ← N actor threads, B N ← N buffers
g
θ M ← M global actor/critic parameters
and on m’s critic for computing advantages A(at , st ) g
L M ← N global locks for θ M
(i.e., weight  A ) during weight update steps. 2: L M ← unlock, all locks are initially deactivated
3) Each parameter copy m ∈ M is assigned a lock, which 3: θmi ← θmg for i ∈ N,m ∈ M: copy M global parameters
are all initially deactivated. 4: to assigned N worker threads
T
t=0 γ r t is below optimal criteria do
t
4) During an update routine, worker i computes gradients 5: while
6: for all i ∈ N worker threads A i
gi from its experience buffer Bi . It updates locally the
7: t ←t +1
parameters in its assigned parameter copy m using the 8: m ← index of assigned global parameters for A i
computed gradient. 9: E i provides A i with state st
5) Worker i checks if all locks L j , ( j ∈ M) of the other 10: A i performs action at ∼ π i (at , st )
copies are deactivated. If true, it proceeds to the next 11: E i provides reward rt , B i appends tuple [st , at , rt ]
12: if training for actor i == True then
step and activates all locks j ∈ M, otherwise it waits.
13: Compute boostrap v T for latest sT
6) Using its own gradients gi , worker i updates the para- 14: T −1[st ,tat , rt ] for
Collect tuples t = 1..T from B i
meters of all other global copies j ∈ M, j = m. This 15: Compute t=0 γ rt + γ v T as actual rewards
T

is the gradient sharing operation (termed operation F) 16: Estimate value function v(st ) using θmi critic
among global copies ∈ M. 17: parameters T
7) All locks L j , ( j ∈ M) are released, and worker i 18: Compute A(st , at ) from t=0 γ t rt and v t (Eq. 3)
19: Compute gradients g for assigned global actor
resumes training and exploration. 20: parameters using cross-entropy loss (Eq. 2)
A3C-GS Properties: 21: with t = A(st , at ) for t = 1..T
1) Short Term: 22: Compute gradients g for assigned global critic
23: parameters using MSE loss with targets

a) Exploration Type I: Exploration as a result of T −1 t
t=0 γ r t + γ v T
T
24:
staggered parameter updates from locking. 25: Update assigned global parameters θmg with
b) Exploration Type II: Exploration due to diversified 26: accumulated gradients θ̂i
policies induced by sharing operation F. 27: Update local parameters of thread θmg → θm
28: if ∃ L j with j ∈ {M − m} that is locked then
2) Long Term: 29: Wait for all other threads to unlock
a) Stabilized training due to “tolerable” O(1) varia- 30: else
tions among parameters of global copies, indepen- 31: L i ← lock, all locks j ∈ M are locked
32: Sharing Operation F : Update other global
dent of gradient magnitudes |g|. 33:
g
parameter copies θ j with j ∈ {M − m}
b) Convergence to the optimal policy under a suffi- 34: using own accumulated gradients θ̂i
ciently small learning rate α and clipped gradients. 35: L M ← unlock, all locks are deactivated
c) Lower bias due to lower entropy weights in loss 36: Reinitialize B i
function L.
These properties are illustrated in Fig. 1. Here, A3C-GS has
a wider search space in the short term than A3C. Moreover, A3C-GS given a common environment. In the demonstrations
in the long term, it approaches the optimal policy more closely that follow, we use the environment E (see Fig. 2) for both
since it has lower entropy weights. standard A3C and A3C-GS. Here, {al , ar } stand for left/right
turn, θ0 are initial parameters, and g represents gradients.
IV. A3C-GS E XAMPLE AND E XPLORATION T YPE I We first show the behavior of standard A3C in E, where it
In this section, we provide examples of behavior under is seen that its update procedure results in near homogenous
standard A3C and compare this with a specific form of policies among workers.

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 28,2022 at 13:38:17 UTC from IEEE Xplore. Restrictions apply.
LABAO et al.: A3C-GS: ADAPTIVE MOMENT GRADIENT SHARING WITH LOCKS FOR ASYNCHRONOUS ACTOR–CRITIC AGENTS 1165

any environment and is not dependent on the model structure,


i.e., it occurs for any arbitrary number of M global copies and
N workers. Type II exploration is induced by the gradient shar-
Fig. 2. Environment E with rewards inside nodes. ing procedure that produces parameter differences. However,
an uncontrolled form of gradient sharing may not result in
long-term convergence. The analysis that follows shows the
1) Standard A3C Update Procedure on E: Our simulation needed conditions to ensure that diversified worker policies
is illustrated in Fig. 3. After step 3, the resulting policies under under A3C-GS attain convergence in an automatic manner.
standard A3C are These are expressed in Properties 2a–2c and 1b. The gradient
sharing operation in A3C-GS is denoted as F. As mentioned,
E[P(al |s)] = 1.50 − 4δ : probability to turn left from S
the purpose of F is to produce parameter differences for
E[P(ar |s)] = 1.50 + 4δ : probability to turn right from S. worker policy diversification, and it does this by staggering
The expected probability gap is large (8δ). We can also see updates (see Lemma 8) and propagating biases (see Lemma 9).
that standard A3C closely ties up local parameters to the global The entire A3C-GS algorithm can be viewed as composed of
due to immediate local updates after global updates. two parts: 1) gradient computation ( ) and 2) gradient sharing
2) Proposed A3C-GS Update Procedure on E: In contrast, (F), such that A3C-GS computes a composition = F ◦ .
we show A3C-GS (see Fig. 4) following the same trajectories For to work, its two components have to contract to a
for each worker, where A3C-GS is given a specific setup fixed point such that computed gradients (both shared and
with three global copies and three workers. We state that local) modify initial parameters toward an optimal state of
given E and the same worker trajectories, local parameter parameters that compute optimal policies. is shown as a
variances in this A3C-GS model are larger than standard contraction in Lemma 7 (intuitively, gradients approach zero
A3C, assuming that gradient magnitudes are comparable. This near convergence). However, for F to become a contraction,
exploration describing A3C-GS’s behavior is classified as it needs specific conditions for stability since it propagates
Type I exploration. biases among copies. If the resulting parameter differences
We now compute parameter variances among workers. among copies become “too large,” contraction may not occur.
Without loss of generality, let gradient g2 = g1 and g3 = −g2 , As a preliminary, we first state the Adam (Algorithm 2) for
following the given condition on comparable gradients. For reference:
standard A3C, we have W̄ = (θ0 + g1 ) + (g2 /3). This leads
to Vara3c W = 2g22 /3. In contrast, for A3C-GS, we have Algorithm 2 Adam Optimization Algorithm
W̄ = θ0 + (2g1 /3) + (g3 /3) + (g2 /3), and its variance is 1: procedure A DAM
Vara3c−gs W = 38g22 /3. From this, we have Vara3c W ≤ 2: while θt not converged do
Varac3−gs W, implying that A3C-GS has larger local parameter (using notations m̄ = m t−1 , v̄ = v t−1 , θ̄ = θt−1 )
variances among workers. Moreover, expected conditional 3: t ← t + 1; gt ← ∇θ f t (θt−1 )
probabilities for A3C-GS are 4: m t ← β1 m̄ + (1 − β1 )gt ; v t ← β2 v̄ + (1 − β2 )gt2
5: m̂ t ← m t /(1 − β√1 ); v̂ t ← v t /(1 − β2 )
t t
E[P(al |s)] = 1.50 − 3δ : probability to turn left from S 6: θt ← θ̄ − α m̂ t /( v̂ t + )
E[P(ar |s)] = 1.50 + 3δ : probability to turn right from S.
With this policy, the gap between turning right to left is 6δ, Given the Adam algorithm, the contraction conditions are
implying that A3C-GS is more likely to turn left than A3C listed in Proposition 1 and applied to any pair of distinct
under E and rediscover the rewarding trajectory. This is an global copies i and j ∈ M. We note that Proposition 1 does
example of Type I exploration, which is an effect of locking not necessarily require equal initial parameters among global
that diversifies worker parameters, i.e., [θ0 + g1 ], [θ0 + g2 ], copies, but rather that they stay “tolerably” close (condition 4).
and [θ0 + g1 + g3 ], in this example. However, we note that Proposition 1: Let i and j refer to any pair of distinct
this behavior is dependent on the specific A3C-GS structure, global copies. Let m̄ refer to the first moment moving average
the environment type, and the trajectories experienced. In other in the Adam optimizer prior to F, and let v̄ refer to the second
settings, however, Type I exploration is not guaranteed. moment moving average. Let Adam parameters β1 ,β2 ∈ (0, 1)
with β2 > β1 . Given m̄, v̄, and g at update step s, if the
V. A3C-GS G RADIENT S HARING AND P ROPERTIES following conditions hold (where c1 and c2 are constants),
From Section IV, it can be observed that diversified worker F(θ̂ ) is a contraction in the absolute value norm |x|.
parameters induce diversified worker policies, which, in turn, 1) Gradients are controlled subject to |gi | < 1.0 and |g j | <
increases the search space and the probability of encountering 1.0.
rewarding trajectories. While Type I exploration produces this 2) If m̄ > 0 for i or j , we have |gi − g j | < c1 |m̄ i − m̄ j |.
effect, the problem with Type I is its robustness since it is 3) v̄’s are comparable subject to v̄ i /v̄ j ∼ 1.0.
merely a result of staggered locking. This lack of robustness 4) The difference = |θ̄i −θ̄ j | in weights prior to F is O(1)
provides the motive for requiring an additional exploration and independent of gradients subject to < c2 |m̄ i − m̄ j |.
behavior, i.e., Type II, whereby worker policies are automati- Under these four conditions, F is a contraction despite
cally diversified in a more robust manner such that it applies to its bias-inducing properties, and F can be repeated until

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 28,2022 at 13:38:17 UTC from IEEE Xplore. Restrictions apply.
1166 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 32, NO. 3, MARCH 2021

Fig. 3. Standard A3C Sample Update Procedure. (a) Step 0: initialization with one global actor and three worker threads. All conditional probabilities are
uniform. (b) Step 1: worker 1 finishes with [S-right] and r = 1. Global parameters are updated to [θ0 + g1 ] and copied to local parameters. Probability to
choose ar from S is higher. (c) Step 2: worker 2 finishes with [S-right]. Global parameters are updated to [θ0 + g1 + g2 ] and copied to local parameters.
Probabilities are updated. (d) Step 3: worker 3 finishes with [S-left-left] and r = 2. It updates global parameter, and gradient g3 reduces probability to turn
right from S. [Best viewed in color]

Fig. 4. A3C-GS Sample Update Procedure. (a) Step 0: initialization with three global copies, one for each worker. All conditional probabilities are uniform.
(b) Step 1: simultaneous updates for workers. Workers 1 and 2 finish with trajectory [S-right] and update global copies with gradients g1 and g2 followed by local
updates. Worker 3 is still computing with trajectory [S-left-left]. (c) Step 2: worker 1 locks first and propagates g1 to other global copies (workers 2 and 3 retain
local parameters). Locks are afterward released. (d) Step 3: worker 3 finishes computing and updates global copy with g3 (after left worker released locks).
Local parameters are updated as well. (e) Step 4: since worker 2 finished before worker 3, it locks first, forcing worker 3 to wait. Worker 2 propagates g2
to other global copies then releases all locks. (f) Step 5: locks are released by worker 2, and worker 3 propagates g3 to other global copies using the same
lock–unlock procedure. [Best viewed in color]

parameters converge (see Corollary 3). This result forms the global copies i and j are considered as tolerable if |θi − θ j | =
basis for constructing an “ideal” algorithm ∗ (Algorithm 3) O(1), i.e., differences are independent of gradient magnitude
that strictly ensures that all four conditions are satisfied. How- |g|. The conditions for A3C-GS to remain a contraction are a
ever, Algorithm 3 is not practical in terms of time complexity, sufficiently small learning rate α and gradient clipping.
and its tendency to keep all parameters almost equal may Given the abovementioned two conditions, we can derive
nullify exploration arising from parameter differences. Thus, the properties stated in Section IV. The properties are
we present A3C-GS, which is a practical approximation of listed in the following, with references to their corre-
Algorithm 3. A3C-GS has the property of approximating the sponding lemma, corollary, and proposition numbers in
conditions for F to become a contraction, with only a single the Appendix, where they are derived in nonsequential
application of F at each step. This is done, however, by requir- order.
ing additional conditions that keep the biases of F small Corollary 1 (Property 1b) Exploration Type II: Differ-
enough such that parameters of the M copies remain “close” ences in policy distributions among workers lead to explo-
to each other and guarantee eventual convergence as described ration, enabling A3C-GS to perform exploration despite lower
in Fig. 5. Specifically, parameter differences |θi − θ j | between weights for entropy terms in its loss function.

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 28,2022 at 13:38:17 UTC from IEEE Xplore. Restrictions apply.
LABAO et al.: A3C-GS: ADAPTIVE MOMENT GRADIENT SHARING WITH LOCKS FOR ASYNCHRONOUS ACTOR–CRITIC AGENTS 1167

TABLE I
R ANKINGS OF O N -P OLICY A LGORITHMS PER G AME

Fig. 5. Convergence behavior of A3C-GS. The true gradients (blue arrows)


draw the worker’s policy (blue curved line) toward the optimal policy (red
solid arrow) since they are unbiased. F shares biased gradients (green arrows)
that may or may not draw the worker’s policy to the correct one. However,
as long as there is clipping and a sufficiently small learning rate, gradient
magnitudes are controlled and policies move toward the optimal policy. Shared
gradients induce less deterministic policies for exploration and widen the
search space. If the optimal policy is instead the red dotted line, it would
not be discovered without the sharing operation. [Best viewed in color]

Lemma 1 (Property 2a): Using a sufficiently small α, para-


meter differences = |θi − θ j | between global copies i and
j are “tolerable” (or O(1)—independent of gradients) under
operation F at any step s. In addition, parameter difference
can be arbitrarily made small subject to < c2 |m̄ i − m̄ j |.
Proposition 2 (Property 2b): Suppose that the optimal tra-
jectory has been discovered by using Property 1b. Given
gradient clipping and a sufficiently small learning rate α, TABLE II
the algorithm converges, and parameters for all workers esti- R ANKING OF O N -P OLICY A LGORITHMS BASED ON A W EIGHTED S CORE ,
mate optimal policy π ∗ . W HERE F IRST P LACE = 3.0 P OINTS , S ECOND P LACE = 2.0 P OINTS ,
AND T HIRD P LACE = 1.0 P OINT
Corollary 2 (Property 2c) Lower Bias: Lower weights in
loss entropy terms result in lower bias for A3C-GS.

A. Further Details on Proposition 1’s Conditions


The abovementioned properties rely on Proposition 1’s
convergence properties. Given its importance, we state the
key lemmas that describe how gradient clipping and small
learning rates manage to approximate the four conditions VI. E XPERIMENTAL R ESULTS
of Proposition 1. Conditions 1–3 can be implemented by A. Comparison of A3C-GS With Other On-Policy Algorithms
gradient clipping as shown in Lemmas 2 and 3, while con-
dition 4 is implemented by small learning rates, as shown In this section, we compare the following on-policy algo-
in Lemma 1. rithms.
Lemma 2: For any distinct pair of global copies i and j , 1) A3C-GS: Three global copies with two workers each.
gradients gi and g j can be clipped subject to: 1) c1 |m̂ i − m̂ j | > 2) A3C-Standard: Six workers.
|g j − gi | for any constant c1 ; 2) |gi | < 1.0 and |g j | < 1.0 3) GAE: Generalized advantage estimation [1].
(conditions 1–2 of Proposition 1); and 3) |g j |/|gi | ∼ 1. 4) PPO: Proximal policy optimization [7].
Lemma 3: Given Lemma 2, there exists a sufficiently 5) Asynchronous PPO: Six workers.
small α subject to v̄ i /v̄ j ∼ 1.0 (condition 3 of Proposition 1), For these algorithms, we use the deep network trunk
given any constant difference between parameters θi and θ j . in [18] and a learning rate of 1e-5, except for Pong that
To comply with condition 4 takes a bit more work and requires uses 1e-4. GAE uses the advantage shown in Section II-B.
Lemma 4 as a prerequisite for Lemma 1, which holds that a Entropy weights for A3C-standard and PPO variants is 0.10,
sufficiently small learning rate can keep parameter differences while we use a lower entropy weight of 0.005 for A3C-GS.
controlled and tolerable at O(1) (Property 2a). A3C-GS implements norm gradient clipping with a clip value
Lemma 4: Suppose that there are two global copies i and of 40.0. These algorithms are implemented in several high-
j with common initial parameters θ̄ at s = 0. For any dimensional environments derived from the VizDoom plat-
update step s, let denote the difference in their parameters, form [15] and the recent retro game environment [19]. The
i.e., = θi,s − θ j,s . There exists learning rate α, such that | | numbers of actions in the VizDoom and retro environments
is small (at any step s) relative to differences in the second are 3 and 12, respectively.
moment of the gradient |g 2 | and the first and second moment We group the games into the following three: 1) Group A
Adam moving averages (|m| and |v|). is games that require little exploration and rewards are easy to
Proposition 3: F creates short-term divergences in prob- achieve; 2) Group B is games that require exploration due
ability distributions π for any pair of workers i and j if to a larger action space and need for 2-D navigation, but
the global copies are still far from estimating the optimal rewards are frequent; and 3) Group C is games that require
policy π ∗ . exploration due to sparse rewards or require 3-D navigation for

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 28,2022 at 13:38:17 UTC from IEEE Xplore. Restrictions apply.
1168 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 32, NO. 3, MARCH 2021

Fig. 6. Performance of on-policy algorithms on games that require less exploration with frequent rewards (Group A). (a) Shooter (doom basic). (b) Shooter
(doom rocket). (c) Doom defend center. (d) Doom defend line. [Best viewed in color]

Fig. 7. Performance of on-policy algorithms on games that require exploration due to larger action spaces (except Pong), need for 2-D navigation, with
relatively frequent rewards (Pong has the sparsest rewards among these) (Group B). (a) Samurai Showdown. (b) Airstriker Genesis. (c) AeroFighters. (d) Zero
Wing. (e) Earth Defense Force. (f) Mortal Kombat. (g) Twin Cobra. (h) Alpha Mission. (i) Pong. [Best viewed in color]

long survival tasks (where generally, 3-D environments have From Tables I and II, A3C-GS and PPO-async performed
more degrees of freedom for navigation than 2-D). Among the best. However, A3C-GS is more consistent than PPO-async,
games, those in Group A are simple VizDoom games that are with more top-three positions and the highest weighted score
easily predictable. Group B is retro games that have 12 actions of 37. Moreover, A3C-GS outperformed the standard A3C
and require 2-D navigation with frequent rewards. Group C is in all games (except for comparable performance in Doom
predict-position games with moving targets and rare rewards, Health), indicating performance gains. GAE is outperformed
along with navigation in 3-D environments (health/take cover). by the other algorithms. During training time, we note that
We summarize the results in Table I, along with a simple PPO seems sensitive to initialization. For instance, under one
weighted score in Table II. initialization in Pong, PPO converged quickly. On a different
We can observe in Table 1 and Fig. 6 that PPO algorithms initialization, it converged slowly. A3C-GS, meanwhile, deliv-
converge faster in Group A games where stability is more ered similar performance across all training repetitions.
important than exploration. In Group B with 2-D navigation,
A3C-GS and PPO asynchronous are comparable as shown in
Fig. 7. For instance, in Zero Wing, PPO outperformed A3C- B. Experimental Verification of A3C-GS Properties
GS, while A3C-GS outperformed PPO in Earth Defense Force Here, we compare A3C-GS and standard A3C with six
and AirStriker Genesis. However, in games that belong to and 12 workers. A3C-GS has three variants, each having six
Group C (shown in Fig. 8), A3C-GS is superior—particularly workers. For all models, we use the deep network trunk in [18]
in the Doom Predict scenario where the timeout is limited to and implement a dueling network [16]. We use Atari game
300, i.e., the agent has to quickly explore to obtain a reward. suite environments [20] following [14], [21]. For standard
The same applies to Doom Health, where navigation is in 3-D A3C, we follow [14] and provide an entropy weight of 0.1.
with more degrees of freedom along with a chance to get stuck For all A3C-GS variants, we reduce the entropy weight by a
in the environment’s walls. factor of 10–0.001.

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 28,2022 at 13:38:17 UTC from IEEE Xplore. Restrictions apply.
LABAO et al.: A3C-GS: ADAPTIVE MOMENT GRADIENT SHARING WITH LOCKS FOR ASYNCHRONOUS ACTOR–CRITIC AGENTS 1169

Fig. 8. Performance of on-policy algorithms on games that require exploration due to sparse rewards (doom predict) or need for navigation in 3-D environments
for long survival tasks (doom health/take cover) (Group C). (a) Shorter doom predict position with Timeout = 300. (b) Normal doom predict position. (c) Doom
health. (d) Doom take cover. [Best viewed in color]

Fig. 9. Experiments in hyperparameters: A3C-GS (three versions, blue lines) versus standard A3C [six workers (red) and 12 workers (orange)]. α ranges
from 1E-4 to 1E-6 with either gradient or value gradient clips. α is similar for all models at each game. In assault, α in A3C-GS-2 and A3C-GS-3 is lowered
10X for convergence. (a) Pong. (b) Assault (kills/game). (c) Space Invaders. The number of lives of all agents for Assault and Space Invaders is set to 1.
[Best viewed in color]

Fig. 10. A3C-GS training statistics and convergence against Standard A3C in Pong. (a) Entropy in policies. (b) KL divergences. (c) Absolute weight
differences. (d) Average gradients. [Best viewed in color]

Experimental Models:
1) A3C-GS-1: Two global copies with three workers each.
2) A3C-GS-2: Three global copies with two workers each.
3) A3C-GS-3: Six global copies with one worker each.
4) Standard A3C-6: One global copy with six workers.
5) Standard A3C-12: One global copy with 12 workers.
Fig. 11. A3C-GS ablation test on assault. Labels refer to clip and learning
From Fig. 9, the three types of A3C-GS have better results rate, respectively. To converge, α has to be set to 1E-6 with at least a gradient
than Standard A3C-6 and Standard A3C-12 in Pong and clip of 0.10. [Best viewed in color]
Space Invaders despite having fewer workers. A3C-GS in
Assault learned to stay alive longer. Hence, increasing worker and a sufficiently small learning rate to meet the contraction
count may not necessarily result in more exploration compared conditions of Proposition 1. Using global norm clipping,
with diversifying worker policies. This is seen in Fig. 10(a), average gradients are close to zero [see Fig. 10(d)]. In addition,
where A3C-GS has a larger decrease in entropy than Standard weight differences in Fig. 10(c) are kept within the ±0.0005
A3C. Fig. 10 shows several statistics comparing A3C-GS with range at α = 0.0001, despite comparable gradient magnitudes.
Standard A3C. From Fig. 10(a), A3C-GS has lower (less As shown in Fig. 11, A3C-GS has to have the right learning
negative) entropy than standard A3C, supporting Corollary 1 rate and clip; otherwise, it will not converge. Regarding
and Property 2c. However, as shown in the higher KL diver- convergence, (i.e., Propositions 2 and 4), we see, in Fig. 10(b),
gences in Fig. 10(b), A3C-GS has more diversified policies that KL-divergences exhibit a downward trend as training
than standard A3C. This shows Property 1b (Exploration progresses. This indicates a contraction in policies toward
Type II) supporting Proposition 3 and Corollary 1. The differ- the optimal policy. Moreover, despite parameter differences
ences in worker parameters are shown in Fig. 10(c), where in Fig. 10(c), we see that the differences taper off eventually.
A3C-GS has more differences in worker parameters than
standard A3C. We also observe that these absolute differences VII. C ONCLUSION
are small or “tolerable,” verifying Property 2a (Lemma 1). We presented the asynchronous A3C-GS algorithm that
The convergence of A3C-GS depends on clipped gradients uses gradient sharing to promote exploration under tempo-

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 28,2022 at 13:38:17 UTC from IEEE Xplore. Restrictions apply.
1170 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 32, NO. 3, MARCH 2021

rary biases. Being a composition of contractions, A3C-GS d[T (L ĝi (g)), T (L ĝ j (g))]:
converges to the optimal policy given a suitably small learn- 
∂ L(θ̄i − ĝi )
ing rate and gradient clipping. From experiments, A3C-GS = L θ̄i − ĝi − α
∂(θ̄i − ĝi )
produces more diversified policy distributions among workers 
for exploration, which, nonetheless, converges to the optimal ∂ L(θ̄ j − gˆj )
−L θ̄ j − gˆj − α .
policy in the long run. A3C-GS reports higher performance ∂(θ̄ j − gˆj ) ∞
than standard A3C despite maintaining a smaller pool of With α sufficiently small, we see that for i , we have
workers and consistently gained top positions in several games, T (L ĝi (g)) ∼ max∗g (L ĝi (g)), where max∗ signifies maximiza-
particularly in environments that require exploration due to tion due to SGD. Recall that SGD T brings (negative) convex
sparse rewards or 3-D navigation for long survival tasks. loss functions closer to zero as training proceeds, which is a
maximization operation
 over
 loss. The same holds
 true for j.
 
A PPENDIX Let L gˆi (g) = L θ̄ − ĝi −α g and L gˆj (g) = L θ̄ − ĝ j −α g
In the following, we derive and analyze Properties 2a and (i.e., loss under arbitrary g). We have the following:
2b, followed by Property 1b. The flow of analysis is as follows: ∗ ∗
||T (L gˆi (g)) − T (L gˆj (g))||∞ ∼ | max(L gˆi (g)) − max(L gˆj (g)|
1) Construction of = F ◦ (see Sections A–C); g g
2) Construction of ideal algorithm ∗ (see Section C); ∗
and α||L (g) − L (g)||∞ ∼ max |L (g) − L (g)|
ĝi ĝ j ĝi gˆj
3) Approximation of ∗ by A3C-GS (see Section D); g
4) Convergence of A3C-GS (see Sections E–G). ∗ ∗ ∗
⇒ | max(L (g))− max(L (g))| ≤ max |L gˆi (g)− L gˆj (g)|
gˆi gˆj
g g g
The main difference between A3C-GS and standard A3C
lies in F. F propagates biases (see Lemma 9) but remains a ⇒ ||T (L gˆi (g)) − T (L gˆj (g))||∞ ≤ α||L ĝi (g) − L ĝ j (g)||∞ .
contraction under some conditions (see Proposition 1). If we
Hence, T is a contraction. From SGD, we have L → 0, and
combine F with gradient computation , we construct a
therefore, ||T (L gˆi (g)) − T (L gˆj (g))||∞ → 0. 
contraction = F ◦ (see Proposition 4). can be computed
Lemma 5 shows the limiting behavior of loss functions
by an “ideal” algorithm ∗ that it is guaranteed to converge
given arbitrary initial parameters θ̄i and θ̄ j along with contrac-
but is impractical. For a practical algorithm, A3C-GS in
tion properties of SGD. We now state Lemma 6 that extends
Section D approximates ∗ using a suitably small learning
Lemma 5 to construct contraction . This is followed by
rate and gradient clipping. These conditions keep tolerable
Lemma 7 that extends Lemma 6 to Adam gradient updates.
O(1) parameter variations among global copies and result
Lemma 6: Suppose that for any input state, gradients are
in convergence (as shown in Section E). Finally, we use the
computed such that they modify parameters toward estimating
O(1) variations among parameters to show exploration and
the optimal policy π ∗ . From contraction T in Lemma 5, we can
bias properties in Sections F and G.
derive another contraction  = g, where  outputs gradient
g at each step in A3C-GS and implies that |gi − g j | of global
A. Construction of Contraction (Gradient Computation) copies i and j converges to zero for any input state.
To construct , we present T and  that follow from Proof: Using similar notations as Lemma 5, we construct

SGD [17]. For Lemmas 5 and 6, we suppress the input as a transformation of T , where  is an updated gradient
state st in the loss function L [as done in (1)] for ease of computed right after application of T with prior gradient
notation. We note that T and  use SGD, whereas the actual g = [∂ L(θ̄ − ĝ − α[g])/∂ L(θ̄ − ĝ)]
contraction used in A3C-GS is that uses Adam. T and  ,
∂ L(θ̄ − ĝ − α[g])
however, are more tractable and approximate [17]. [T (L ĝ (g))] = = g̃.
Lemma 5: Let i and j refer to any pair of global copies. ∂(θ̄ − ĝ − α[g])

Let T ∼ L, where L is a convex loss function computed from is a monotonic transformation of T and takes the gradient
unbiased targets for any input state. Policy estimates for L are with respect to an updated weight [θ̄ − ĝ − α[g]]. Since  is
computed from initial parameters θ̄i and θ̄ j and accumulated a monotonic transformation of T (L), and T is a contraction
gradients ĝ. Given any initial parameter set, T is a contraction [T → 0], we have  → 0 as well since g → 0 and T → 0
under ||x||∞ and a sufficiently small learning rate α ∈ (0, 1) as L → 0. From the given, suppose that there are global
with a fixed point of 0. copies i and j . At its fixed point of 0, we have |g̃i − g˜j | →
Proof: Let ĝi and ĝ j refer to the accumulated gradient 0 ⇒ |gi − g j | → 0 since g̃ = g on the next computation
for global copies i and j , respectively, for update steps 1 · · · s, of T . At this point, parameters θi and θ j estimate optimal
i.e., for i : ĝi = α s ∂ L/∂θis (given initial parameter set θ̄i ). policy π ∗ . 
The input g is the current gradient computed using L and Lemma 7: Given the same conditions as Lemma 6, suppose
unbiased targets. With ĝ and g, T performs the following that instead of SGD gradient updates, we use Adam gradient
process for i : updates. We can construct another contraction that contracts
 to the same fixed point as  .
   ∂ L(θ̄ − ĝi )
T (L, θ̄i , ĝi , g) = L θ̄ − ĝi − α g where g = . Proof: Using the said assumption, the Adam gradient
∂(θ̄ − ĝi ) updates on parameters also serve to minimize loss functions—
Let T (L gˆi (g)) = T (L, θ̄i , ĝi , g) (similarly for j ). similar to SGD. As shown in [17], Adam has faster conver-
We have under the sup-norm the following for gence properties relative to SGD. Hence, Lemma 6 holds,

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 28,2022 at 13:38:17 UTC from IEEE Xplore. Restrictions apply.
LABAO et al.: A3C-GS: ADAPTIVE MOMENT GRADIENT SHARING WITH LOCKS FOR ASYNCHRONOUS ACTOR–CRITIC AGENTS 1171


and behaves similar to , i.e., it converges to a fixed for the first and second global copies at the end of training
point.  step s = 1
m i→ j = β1 (β1 m̄ + (1 − β1 )gi ) + (1 − β1 )g j
B. Construction of Contraction F (Sharing Operation) v i→ j = β2 (β2 v̄ + (1 − β2 )gi2 ) + (1 − β2 )g 2j
With the first contraction completed, we construct con-  
αm i→ j 1 − β22
traction F, i.e., the gradient sharing operation. δi→ j =   ; θi→ j = θ̄ − δi − δi→ j
Let F refer to the gradient sharing process in A3C-GS that 1 − β12 v i→ j
shares gi and g j between any pair of global copies i and j m j →i = β1 (β1 m̄ + (1 − β1 )g j ) + (1 − β1 )gi
after all locks are released. For i , let A(θˆi , g j ) refer to the  
v j →i = β2 β2 v̄ + (1 − β2 )g 2j + (1 − β2 )gi2
Adam gradient update with θ̂ as the updated global parameter  
of i (i.e., parameters right after the worker updated its assigned αm j →i 1 − β22
global copy and prior to sharing in F). We denote the output δ j →i =   ; θ j →i = θ̄ − δ j − δ j →i . (8)
∗ 1 − β12 v j →i
of F as the reupdated parameters θˆi using shared gradient g j
and Adam To show that the difference in parameters between i and j

is not trivially equal to zero, we write the difference θi→ j −
F(θˆi ) = A(θˆi , g j ) → θˆi . θi→ j = θ̄ −δi −δi→ j − θ̄ +δ j +δ j →i . This is also = δ j +δ j →i −
δi − δi→ j = [δ j − δi ] + [δ j →i − δi→ j ]. Here, θi→ j − θ j →u is
We note, in Lemma 9, that F does not trivially render
composed of two components: 1) [δ j −δi ] and 2) [δ j →i −δi→ j ].
parameters equal and that it propagates biases. Despite these There are some scenarios, where both components are equal to
biases, Proposition 1 shows that under certain conditions, F is
zero. For the first component, we have zero if m i − m j = 0.0
a contraction. However, first, we state Lemma 8 to show how and v i /v j = 1.0. Similarly, we have [δ j →i − δi→ j ] = 0 if
F induces parameter differences among copies.
(1 − β1 )2 (gi − g j )] = 0 and v j →i /v i→ j = 1.0. This happens
Lemma 8: Using Adam, parameters of global copies in
only when all gradients are equal (a rare occurence). Now
A3C-GS during training are not equal with very large prob- suppose that [δ j − δi ] = 0 and [δ j →i − δi→ j ] = 0 but that
ability if there is a difference in parameter update ordering
[δ j − δi ] − [δ j →i − δi→ j ] = 0. This arises iff
under F. √ √
Proof: Suppose that there are two global copies, one for αm j (1 − β2 ) αm 1 (1 − β2 )
θ1 − θ j = −
each worker. We need a scenario with difference in update (1 − β1 )v̄ 1 (1 − β1 )v 1
ordering in F. For instance, let a, b, c, and d be gradients.  αm
j →i (1 − β2 ) αm i→ j (1 − β22 )
2
Two copies have a different parameter update orderings if one − − = 0. (9)
is updated using a different permutation than the other (for (1 − β12 )v j →i (1 − β12 )v i→ j
instance, one is updated [a + b + c + d], while the other is Showing that the roots of (9) are involved, it can be
updated [a + c + b + d]). Let this be the following. simplified if we substitute β1 = 0.9 and β2 = 0.99, which
1) Worker i computes gradient gi . are the standard Adam parameters. Eventually, we have
2) Worker j computes gradient g j . [(9) = 0] ⇒ [v j m 1 − v 1 m j = 0]
3) Worker i updates global copy i and its local parameters.
4) Worker j updates global copy i . and [v i→ j m j →i + v j →i m i→ j = 0].
5) Worker i activates all locks. It can be shown that [v j m 1 − v 1 m j ] = 0 if gi = g j . Similarly,
6) Worker i updates global copy j and then releases all it can be shown that [v i→ j m j →i +v j →i m i→ j ] = 0 if gi = −g j .
locks. Hence, for parameters of i and j to be equal, gradients have
7) Worker j activates all locks. to be gi = g j or gi = −g j ⇒ |gi | = |g j |. Given the large
8) Worker j updates global copy i and then releases all randomization involved in training, it is very unlikely that all
locks. gradients are equal in magnitude for all workers. 
This scenario provides an update ordering gi → g j for i and Lemma 9: Using the Adam optimization, suppose that prior
g j → gi for j . Let m̄, v̄, and θ̄ refer to the initial values of moving variance v̄ and prior moving mean m̄ of global copies
the moving average mean, variance, and parameters for global i and j are not equal, i.e., v̄ i = v̄ j and m̄ i = m̄ j due to
copies i and j . Following Algorithm 2, for steps 1 and 2: variances in gradients brought about by different parameters
(see Lemma 8). A single application of F in A3C-GS results
gi = ∇θ̄ , g j = ∇θ̄ ∗ ∗
in bias for updated global parameters θ̂i∗ and θ̂ ∗j and θˆi = θˆj .
m i = β1 m̄ + (1 − β1 )gi , m j = β1 m̄ + (1 − β1 )g j Proof: The proof follows from the way F is constructed.
v i = β2 v̄ + (1 − β2 )gi2 , v j = β2 v̄ + (1 − β2 )g 2j For i , we see that F(θˆi ) performs the following for θˆi :
√ √
αm i (1 − β2 ) αm j (1 − β2 ) F(θˆi ) = A(θˆi , g j ) using g j as the gradient from copy j
δi = , δj = [β1 (β1 m̄ i + (1 − β1 )gi ) + (1 − β1 )g j ]
(1 − β1 )v i (1 − β1 )v̄ j = θˆi − α .
θi = θ̄ − δ1 , θ j = θ̄ − δ j . (7) [β2 (β2 m̄ i + (1 − β2 )gi2 ) + (1 − β2 )g 2j ]
From the definition of g j , we have the following:
After computing θi and θ j , we move to steps 5 and 6 of the
update schedule to arrive at the final parameters θi→ j and θ j →i g j = ∇θ̂ j L, with θ̂ j as j ’s parameters prior to F.

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 28,2022 at 13:38:17 UTC from IEEE Xplore. Restrictions apply.
1172 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 32, NO. 3, MARCH 2021

Hence, an operation of F propagates on global copy i ’s done in 1 . In the last line in the following, denotes a very
parameters (θ̂i ) the gradient g j , which is a gradient of loss small number:
L with respect to θ̂ j and not θ̂i . This generates errors and 
0.9(0.9m̄ i + 0.1gi ) + 0.1g j

F(θ̂i ) = [F(θ̂ j ) + ] ⇒ θˆi = θˆj .

 2 ∼ θ − αω2
0.99(0.99v i + 0.01gi2 ) + 0.01g 2j
Proposition 1 (see Section V):
0.9(0.9m̄ j + 0.1g j ) + 0.1gi
Proof: Suppose that workers have updated their global −
copies i and j and are now under F at update step s. We have 0.99(0.99v j + 0.01g 2j ) + 0.01gi2

the following for θˆi and θˆj under the Adam optimization (m̄ i − m̄ j ) (gi − g j )
algorithm (see Algorithm 2) and prior parameters θ̄i and θ̄ j . ∼ θ − αω2 0.83 + . (15)
v̄ v̄
Note that these are the parameters involved in local Adam
We can observe that |ω1 − ω2 | → 0 as update step
gradient updates prior to F, i.e., the parameters involved
s → ∞. However, for any update step s, the second term
during computation of
of 1 is bigger than the second term of 2 under Condition
    3, where |gi − g j | < c1 |m̄ i − m̄ j |. Moreover, from Condition
αm i 1 − β2s αm j 1 − β2s
θˆi = θ̄i −   , θˆj = θ̄ j −   . 4, we see that the contribution of the  terms can be rendered
1 − β1s v i 1 − β1s v j negligible subject to |θ | < c2 |m̄ i − m̄ j |, i.e., for these to be
Using the absolute norm |x − y|, let k = α ∈ (0, 1) and valid, we can set c1 and c2 to be 0.01 for instance. Hence,
compute k[d(x, y)] as using conditions 1–4 and a suitable k ∈ (0, 1), we have [(12)
  < (10)], showing that F is a contraction. Under the four
1 − β2s m i mj conditions, we can observe that while a single application of
k[d(θi , θ j )] = k [θ̄i − θ̄ j ] −
ˆ ˆ − . (10)
1 − β1s vi vj F leads to bias (Lemma 9), additional applications of F using
gi and g j lower the difference between gi and g j and move
Here, the last term in k[d(θˆi , θˆj )] is expanded as
parameters closer (due to the term in 2 ). This is a property
β1 m̄ i + (1 − β1 )gi β1 m̄ j + (1 − β1 )g j of F being a contraction. 
= − . (11) From this, if parameter differences θ are large (i.e., condi-
β2 v̄ i + (1 − β2 )gi
2
β2 v̄ j + (1 − β2 )g 2j
tion 4 is violated), contraction may not occur. In Corollary 3,
Next, we compute d(F(θ̂i ), F(θ̂ j )) as follows (i.e., differ- we describe an ideal scenario where initial parameters are
ence after application of F): equal (not required in Proposition 1).
 Corollary 3: For any update step s, given conditions in
1 − β2s+1 Proposition 1 and common initial parameters θ̄i = θ̄ j , repeated
d(F(θ̂i ), F(θ̂ j )) = [θ̄i − θ̄ j ] − α  . (12)
1 − β1s+1 application of F using gradients gi and g j lowers loss and
asymptotically leads to equal parameters.
Here,  is equal to
[β1 (β1 m̄ i + (1 − β1 )gi ) + (1 − β1 )g j ] C. Construction of Contraction and Ideal Algorithm ∗
=
[β2 (β2 v̄ i + (1 − β2 )gi2 ) + (1 − β2 )g 2j ] From the prior discussion, we compose another contraction
[β1 (β1 m̄ j + (1 − β1 )g j ) + (1 − β1 )gi ] = F ◦ that has both gradient correction and sharing. There
− . (13)
[β2 (β2 v̄ j + (1 − β2 )g 2j ) + (1 − β2 )gi2 ] exists an ideal algorithm ∗ that computes and meets all
the conditions in Proposition 1. We start with Proposition 4
To prove that F is a contraction, we need to show that (12) on the existence of , followed by Proposition 5.
is less than (10). This depends on several parameters m̄ i , m̄ j , Proposition 4: Given conditions 1–3 in Proposition 1, com-
v̄ i , v̄ j , and β1 , β2 . We can use the conditions to simplify the mon initial parameters for all global copies, and a sufficiently
analysis and let β1 = 0.9 and β2 = 0.99 as done in most Adam small learning rate α, there exists a contraction = F ◦ .
standard parameter implementations. Let 1 = k[d(θˆi , θˆj )] Proof: From (13) and Proposition 1, F takes gradients
and 2 = d(F(θ̂i ), F(θ̂ j )). Similarly, let θ = θ̄i − θ̄ j , and ω g, which can be set as outputs of . This stepwise operation
as (substituting β1 and β2 ) is a composition, where, at step, s computes gradients g
√ √
1 − 0.99s 1 − 0.99s+1 and feeds all inputs to F. F iterates until convergence of θ̂ ∗
0.91 = ω , 0.91 = ω2 . to ensure condition 4 along with other conditions—ensuring
1 − 0.9s 1
1 − 0.9s+1
that F is a contraction. Afterward, recomputes g for step
We compute 1 in the following, where we use conditions s + 1. From = F ◦ , a composition of contraction is a
1 and 2 in Proposition 1 to factor out g 2 terms and to come up contraction. Hence, is a contraction with fixed point ∗ . 
with a common second moment moving average v̄ i = v̄ j = v̄ Proposition 5: Assume that condition 3 of Proposition 1

v j (0.9m̄ i +0.1gi )−v i (0.9m̄ j +0.1g j ) holds with full information on all trajectories under E. There
1 ∼ k θ −αω1 exists an algorithm ∗ such that ∗ computes and converges
0.99m̄ i m̄ j
 to fixed point ∗ and computes optimal policy π ∗ .
(m̄ i − m̄ j ) (gi − g j )
∼ k θ − αω1 0.9 + 0.1 . (14) Proof: We construct a pseudocode for ∗ as follows.
v̄ v̄ The algorithm is ideal since it ensures that conditions in
Similarly, we compute 2 in the following, reusing condi- Proposition 1 are met at each training update step s. Let i and
tions 1 and 2 in Proposition 1 to simplify the equations as j refer to any pair of distinct global copies under Algorithm 3.

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 28,2022 at 13:38:17 UTC from IEEE Xplore. Restrictions apply.
LABAO et al.: A3C-GS: ADAPTIVE MOMENT GRADIENT SHARING WITH LOCKS FOR ASYNCHRONOUS ACTOR–CRITIC AGENTS 1173

Algorithm 3 Algorithm ∗ (All Parameters Are Initially Lemma 2 (See Section V):
Equal) Proof: Condition 1 can be easily implemented by clipping
1: while ∀M, N: π not converged to π ∗ do all gradient magnitudes |gi | and |g j | to be < 1.0. Similarly,
2: for each worker i do one can proportionally decrease both gradients to meet con-
3: worker i explores environment under copy j dition 2. For condition 3, if |gi | > |g j |, simply clip |gi | until
4: if training == true then |gi |/|g j | ∼ 1.0 (and vice versa). 
5: Compute gradients g using with L Lemma 3 (See Section V):
6: Clip g (cond. 1 & 3 of Proposition 1) Proof: Let gradients (for copy i ) be expressed as gi,s =
7: Update assigned copy j with g γi,s θi,s−1 , where γi,s is the proportion of gradient at update step
8: Wait until all workers i  = i updated s derived from parameter θi,s . Using this, we express gradients
∀s as a function of θ0 , γi,s , and the initial gradient gi,0 through
9: if all workers have done local updates then
this recursion. For i , we have
10: while ∀θ not converged do
11: Apply F to all global copies i = j gi,0 = γi,0 θi,0
12: Compute θ̂ ∗ for all copies i = j θi,1 = θi,0 + αgi,0 , gi,1 = γi,1 θi,1 = γi,1 (θi,0 + αgi,0 )
13: If θ̂i∗ = θ̂ ∗j for any pair i = j repeat
θi,2 = θi,1 + αgi,1 = [θi,0 + αgi,0 ] + α[γi,1 (θi,0 + αgi,0 )] · · ·
The same recursion holds for copy j , except that we have
At initialization, parameters are equal (i.e., θi = θ j ). Workers θ j,0 = θi,0 + , where is the parameter difference. We express
assigned to i and j act on environment E and compute the second moment moving average v̂ i for copy i as follows:
respective gradients gi and g j (gi = g j ) from computation v̂ i,0 = [γi,0 θi,0 ]2
of i and j . With full information (from the given), these
v̂ i,1 = 0.9[γi,0 θi,0 ]2 + 0.1[γi,1 θi,1 ]2
gradients are computed with respect to targets derived from
optimal policy π ∗ . Gradients are clipped at this point, meeting = 0.9[γi,0 θi,0 ]2 + 0.1[γi,1 (θi,0 + αγi,0 θi,0 )]2 …
conditions 1 and 2. Using Lemma 1 and Corollary 3, we apply The same holds for j . However, applying Lemma 2, we see
F (using gradients g from ) iteratively until convergence. that we can write j ’s version of v̂ j,1 as
This results in equal weights for i and j , with θ = 0, meeting
condition 4. After F converges, we let updated workers i and = 0.9[γ j,0θ j,0 ]2 + 0.1[γ j,1(θ j,0 + αγ j,0 θ j,0 )]2
j resume (no need for exploration given full information). = 0.9[γi,0 (θi,0 + )]2 + 0.1[γ j,1(θi,0 + ) + αγ j,0 (θi,0 + )]2
Thus, all conditions for F’s contraction are satisfied. Since ∼ 0.9[γi,0 θi,0 ]2 + 0.1[γi,1 (θi,0 + αγi,0 θi,0 )]2 .
parameters for all copies are equal after repeated applications
of F, at the next update, resulting gradients and first/second The last line follows since the procedure in Lemma 2 specif-
moments computed by in Lemma 7 are equal for all global ically sets gradients |gi,s | ∼ |g j,s |; since all terms in v̂ j,1
copies (i.e., ’s Adam parameters are equally modified by F are squared, only the magnitude of gradients is important,
for all global copies, and it is similar to applying to several not their direction. Moreover, the above equations show that
batches). Using Lemma 7 and Proposition 1, this process clipping as per Lemma 2 renders each term in the recursion
converges to a fixed-point θ̂i = θ̂ j = ∗ . This is because both of v̂ i,s similar in magnitude to its corresponding term in the
loss (see Corollary 3) and gradients (see Lemma 7) decrease recursion of v̂ j,s —independent of . For more detail, we can
if F converges at the end of every update step subject to write v̂ i,s /v̂ i,s algebraically, but the equations are too involved.
θ̄i = θ̄ j . Thus, for , we have [gi → 0 and g j → 0] Instead, we follow a similar trick and use the Taylor series to
subject to [gi = g j = 0]. An application of F in this case approximate v̂ i,s /v̂ i,s centered s at α = 0 to check s its behavior
results in fixed point parameters ∗ (see Corollary 3). Now, relative to α. Let τi,s = l=0 cl γi,l2 and τ j,s = l=0 cl γ j,l
2
. The
we only have gi = g j = |gi − g j | = 0 if the loss L of both Taylor series centered at α = 0 is
global copies are 0.0 or that they estimate the optimal policy,      
τi,s f 1 τi,s , θi,0
2 f 3 τi,s , θ 2j,0
i.e., f (θ̂i∗ ) = f (θ̂ ∗j ) = f ( ∗ ) = π ∗ .  +α   +α 2
  + O(α 3 ).
Corollary 4: Our proposed A3C-GS = ∗ . τ j,s f 2 τ j,s , θi,0
2
f 4 τ j,s , θ 2j,0
Proof: Unlike ∗ , A3C-GS applies F only once despite In this equation, all f ’s are additive functions of the form:
θ̄i = θ̄ j . Thus, condition 4 in Proposition 1 is not met.  f (τi,s , θ02 ) = c γ θ
s s i,s 0
2
, where c’s are constants. From
Lemma 2, we have τi,s ∼ τ j,s since, as mentioned, at each step
D. Approximation of ∗ by A3C-GS: Properties 2a and 2b s, clipping renders all terms in the recursion similar. We can
then write the Taylor series as ∼ 1 + αC + α 2 C + α 3 C, . . .,
E. Lemmas for Proving Property 2a where C is a constant close to 1. By letting α be small enough,
From Lemma 8, F results in parameter differences we can keep condition v̂ i /v̂ j ∼ 1.0 satisfied. 
among global copies, and Corollary 3 is not ideal. Howe- Lemma 4 (See Section V):
vere, from Proposition 1, F remains a contraction as long Proof: For an informal proof of Lemma 4, we use
as parameter differences are not too large (condition 4). similar equations as Lemma 5. At any update step s,
We show that this problem can be solved in a practical way we write parameter θi,s for global copy i j (and j ) as θi,s =
using Lemmas 1–4. θ̄ − α s gsi and θ j,s = θ̄ − α s gs . We then have

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 28,2022 at 13:38:17 UTC from IEEE Xplore. Restrictions apply.
1174 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 32, NO. 3, MARCH 2021

  i  j
θi,s − θ j,s = −α s gs − s gs . Here, we see that |θi,s − show the form of ϕ, but it is a computable function that
θ j,s | ≥ 0 is a monotonic
  i  j  of α. Given information on
function expresses the gradient g j in terms of gi along with a “shift”
the magnitude: | s gs − s gs |, we can always set α to a
expressed in parameter difference = |θi − θ j | after SGD
desired low value such that |θi,s −θ j,s | is minimized. However, updates. We decompose parameter differences θi→ j − θ j →i to
for future update steps under fixed α, suppose that we are at [δ j −δi ] (differences due to local updates prior to sharing) and
update step s + 1, where policy Pi (a) under i is computed via [δ j →i − δi→ j ] (differences due to sharing after local updates).
softmax, and the respective weight for action a is θi,s (a). Let The key result is that θi→ j − θ j →i has this structure given f 1
ζ denote the set of possible actions and s (a) the advantage and f 2 that are increasing functions of g 2 , m, and v
weight for action a [as per (1)]. The gradient for i is 
∂L f 1 (g 2 , m, v)
gi,s+1 (a) = I [a] where I is an indicator function [θi→ j − θ j →i ] = ϕ( ) . (18)
∂θa,i,s f 2 (g 2 , m, v)
eθi,s (a)
= −s (a) log Pi (a) = −s (a) log  θi,s (a  )
With (18), if | |  {|gi2 |, |m|, |v|} (using a sufficient small α
a  ∈ζ e under Lemma 4), we get to prove the Lemma since |ϕ( )| is a
 θi,s (a  )
a  ∈ζ −a e monotonic function of | |. We write a finer decomposition of
= −s (a)  θi,s (a  ) [δ j −δi ] and [δ j →i −δi→ j ] as follows, with 1 as [δ j −δi ] and
a  ∈ζ e
 2 as [δ j →i − δi→ j ]. Let m̄ and v̄ denote any initial moving
1 − e[θ̄(a)−α s gs (a)]
i

= −s (a)   i  . (16) mean and variance before F in (1 and 2 ), as shown at the

[ a  ∈ζ e[θ̄ (a )−α s gs (a )] ] bottom of this page.
It is involved to transform the abovementioned equations to
For worker j , the gradient is similar. From (16), α is a term in
the form of (18). However, we can simplify by substituting
the exponential of the numerator and denominator,
 andits con-
  β1 = 0.9 and β2 = 0.99, i.e., standard Adam parameters,
tribution affects all actions as seen in a  ∈ζ e[θ̄(a )−α s gs (a )] .
i

and compute for a Taylor series centered around = 0. The


We can compute the derivative ∂[∂ L/∂θa,i,s ]/∂α to see the
Taylor series are as follows, with c1 = −10, c2 = −180,
effect of α on (16), but it is involved. To approximate the
c3 = 990, c4 = 99, c5 = −0.302, c6 = 0.304, c7 = −18.38,
effect of α, we compute the Taylor series of ∂[∂ L/∂θa,i,s ]/∂α
and c8 = 49.2513
centered at α = 0. Let |ζ | denote the cardinality of the action
set, and |ζ  | = |ζ | − 1. The Taylor series centered at α = 0 is 
 (c1 gi2 + c2 gi m̄ + c3 v̄) 2

|ζ  | |ζ  |  ∗1 ∼ + O
s (a) + α 2 s (a) |ζ  |θa − θa  + O(α 2 ). (17) (gi + c4 v̄)
2 2 (gi + c4 v̄)3
2
|ζ | |ζ | 
a ∈ζ −a 
∗ (c5 gi2 + c6 gi m̄ + c7 v̄) 2
2 ∼ + O . (19)
From (17), we have a nonzero gradient |ζ  |/(|ζ |s (a)) inde- (gi2 + c8 v̄)2 (gi2 + c8 v̄)3
pendent of α. Hence, if we set α to be arbitrarily small,
i.e., α → 0, we still have gs (a) = |ζ  |/(|ζ |s (a)) ≥ 0. We see that (19) follows the structure of (18), where and
We can see that gi,s+1 (a)’s first term is not a function of α, 2
are in the numerators ( f 1 ) and whose contributions are
while the other terms are functions of α through the prior θa . bounded by the denominators which are functions ( f 2 ) of the
Hence, we can set α small enough such that there are low prior gradient’s second moment gi2 and moving variance v̄. In the
parameter
  i differences (at step s) ( ) from = θi,s − θ j,s = numerator, we see that coefficients c2 and c6 of the middle
 j
−α g
s s − g
s s , relative to gi,s+1 (a) ∼ |ζ  |/(|ζ |s (a)) term gi m̄ have values of −180 and 0.304, respectively. These
(at step s + 1). It also follows that can be set small are lower compared with c3 and c7 of the third term v̄, which
relative to first and second moments of gi,s+1 (a), by setting it are 990 and −18.38, respectively. Similarly, c4 and c8 are
small compared with |ζ  |/(|ζ |s (a)). For future update steps, relatively larger than c1 , c2 , c5 , and c6 . Given these coefficients,
i.e., s + 2, s + 3, . . ., a sufficiently small α likewise keeps if m̄ is close to v̄, we see that for 1 , there is more likelihood
low.  that | f 2 | > | f 1 |. For 2 , we have more likelihood for | f 1 | ≤
Lemma 1 (See Section V): | f 2 |. In either case, controlling for other variables aside from
Proof: Let gi and g j denote gradients. Similarly, let there coefficients, we have | f 1 | ≤ | f 2 |, indicating that [θi→ j − θ j →i ]
be common initial parameters θ̄ for i and j . We write the is likely to be smaller. Nonetheless, we have shown that (19)
gradient for θ j as g j = gi + ϕ( ), where ϕ is monotonic has a form similar to (18), and we can always set [and
function of parameter difference. For brevity, we will not consequently ϕ( )] to be tolerable using sufficiently small α


(β1 m̄ + (1 − β1 )(gi + ϕ( ))/(1 − β1 ) (β1 m̄ + (1 − β1 )gi )/(1 − β1 )
1 = α √ − √
(β2 v̄ + (1 − β2 )(gi + ϕ( ) )/ (1 − β2 ) (β2 v̄ + (1 − β2 )gi2 )/ (1 − β2 )
2
⎡   ⎤
(β1 (β1 m̄ +(1−β1)(gi +ϕ( ))+(1−β1)gi )/(1−β1 ) 2 (β (β m̄ +(1−β )g )+(1−β )(g +ϕ( )/ 1−β 2
2 = α ⎣ ⎦
1 1 1 i 1 i 1
−   .
(β2 (β2 v̄ +(1−β2 )(gi +ϕ( )2 )+(1−β2)gi )/ (1−β2 ) (β2 (β2 v̄ +(1−β2)gi )+(1−β2)(gi +ϕ( ) )/ 1−β22
2 2 2 2

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 28,2022 at 13:38:17 UTC from IEEE Xplore. Restrictions apply.
LABAO et al.: A3C-GS: ADAPTIVE MOMENT GRADIENT SHARING WITH LOCKS FOR ASYNCHRONOUS ACTOR–CRITIC AGENTS 1175

as per Lemma 4. With this and | |  {|gi2 |, |m|, |v|}, we have However, as F is not applied repeatedly until convergence,
  there is no guarantee that all global copies achieve similar
[θi→ j − θ j →i ] = O + O parameters, i.e., if gradients computed by converge faster
(g 2 + c4 v̄)2 (g 2 + c8 v̄)2 ) to zero than the contraction rate of F. However, we note that
 i 2  i 2
+O +O . workers i and j can estimate a similar policy, i.e., i ∼  j
(gi2 + c4 v̄)3 (gi2 + c8 v̄)3 for i = j even under different parameters. If i and j both
This shows that [θi→ j − θ j →i ] = 4O(1) = O(1). Moreover, estimate π ∗ despite unequal parameters, then = g = 0
the difference is a monotonic function of and can be for both i and j , resulting in zero gradients shared by F.
arbitrarily made small to meet the condition of Lemma 1.  In this case, both i and j estimate π ∗ despite having differen
weights. 

F. Property 2b (Asymptotic Convergence) G. Suboptimality in Including Entropy Terms in Loss


Proposition 2 (See Section V): Functions
Proof: This proposition describes that for a sufficiently Lemma 10: The second component of (5) for entropy loss
small α, A3C-GS approximates Algorithm ∗ . As men- is nonzero if the optimal policy distribution π ∗ is nonuniform.
tioned earlier, A3C-GS performs a single F operation in Proof: Given a Bernoulli distribution of actions ζ0 and
each update step, which may cause Proposition 1’s condi- ζ1 , from 
tions to be violated. However, under a sufficiently small α  (6) and (5), entropy loss’s first total derivative D
is − ζi (∂ Ht (ρ))/(∂ P(ζi )) = log(P(1 − ζ0 )) − log(P(ζ0 )).
(see Lemma 1), we have | | = |θi − θ j | = O(1), with When D  = 0, loss is optimal. The second derivative D 
| | < |m̄ i − m̄ j | to keep condition 4. From Lemma 3, is [−1/x(1 − x)] that is negative within P(ζi ) ∈ [0, 1],
the small value of α keeps the second moment moving implying that D  is strictly concave. However, D  is 0 at
variances ∼ 1.0 for condition 3. From Lemma 2, A3C-GS P(ζ0 ) = 0.5 under the Bernoulli distribution, implying that
performs gradient clipping for conditions 1 and 2. Hence, all π ∗ = {P(ζ ∗
conditions of Proposition 1 are met, and F contracts. From 0 ),∂ P(ζ 1 )} = {0.50, 0.50} or that π is uniform.
Ht (ρ)
Thus, − ζi ∂ P(ζi ) = 0 for the gradient of (5) iff the
this, suppose that the optimal trajectory has been discovered estimated policy π is uniform. This proof can be generalized
from exploration (see Property 1b), subject to for any input to [ζ0 , ζ1 , . . . ζn ]. 
state, gradients are computed with respect to optimal targets. Lemma 11: If π ∗ is nonuniform, the computed actor poli-
From [1] and [12], these conditions allow convergence of pol- cies π under a loss function with entropy terms are suboptimal.
icy gradients needed for Lemmas 6–7. However, we note that Proof: From (2) and (6), gradient g e under the said loss
computes its first and second gradient moments locally, with gives (20). g e is backpropagated to parameters w ∈ W at
respect to the global copy’s parameters and targets. In contrast, update step s
F’s sharing modifies ’s Adam parameters. Nonetheless,
T 
1  
n
conditions 1–4 of Proposition 1 sufficiently ensure that ’s
gse = t ∇θ log π(at |st ) − log Pt (ζi ) . (20)
contraction is preserved. Informally, conditions 1 and 2 keep T t=1 i
differences in gradients g within a constant c1 of first
moments m. Condition 2 ensures that second moments v of Suppose that weight tis unbiased for all update steps
t =T t  
copies are ∼ 1. Condition 4 ensures that θ for all copies is t, i.e., E[t + t ] = t  =t γ rt , with zero-mean noise
kept close. With these controls, the first and second moments E[ t ] = 0. TThe unbiased gradient  for all update steps s
of gradients under F ◦ are roughly comparable to the isgs∗ = T1 t=1 t ∇θ log π(at |st ) . Let gm∗
denote unbiased
first and second gradient moments computed under subject gradient vector for w ∈ W. With α ∈ (0, 1), updates
to variations can be treated as noise, in which Adam toler- are Ws = Ws−1 + α gs∗ . Given α ∈ (0, 0.99) < 1.0 and
ates [17]. Moreover, at each update step, given gradient con- unbiasedness of gt∗ ∀{t, s}, repeated updates result in optimal
trols of conditions 1–4, is able to correct any biases induced W ∗ (from ∞SGD∗ properties [22]). We telescope it to W ∗ =

by F from prior update steps since gradients do not “explode.” W0 + α s=1 gs with gs → 0. However, e
∞if wee use gs ,e in the
We note that such biases result from parameter θ differences procedure, we have W s = W0 + α s=1 gs with gs → 0.
e

among copies and varying exploration trajectories. However, Since gs∗ → 0, it is not guaranteed to arrive at W ∗ if we use
given that the optimal trajectory has been discovered (from the gse , and π is not uniform. Under Lemma 10, the second term
given) and that F is a contraction (conditions of Proposition 1 in (20) is a nonzero gradient, giving suboptimal parameters,
are satisfied), parameter differences will eventually be reduced and π = π ∗ . 
after repeated applications of F (since these differences are Corollary 2 for Property 2c is implied by Lemma 11, i.e., lower
bounded by a constant factor under contraction). With these, entropy terms result in lower bias.
the effect of biases from F on ’s Adam parameters (par-
ticularly the first moment m) is reduced in the long term as H. Property 1b, Exploration Type II, and 2c (Lower Bias)
parameters converge. Informally, this shows that contracts, For the proof of Proposition 3, we recall Lemma 8. Corol-
given that at each update step, it rectifies its Adam parameters lary 1 (see Property 1b) follows from this.
to lower loss, while the amount of rectification needed as F’s Proposition 3 (See Section V):
biases gradually reduce in the long term. Hence, given that Proof: From Lemmas 8 and 9, we see that θi = θ j with
F ◦ is a composition of contractions, F ◦ converges. large probability for any pair of global copies i = j due

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 28,2022 at 13:38:17 UTC from IEEE Xplore. Restrictions apply.
1176 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 32, NO. 3, MARCH 2021

to F. Differences in θ imply differences in softmax policy [21] M. G. Bellemare, W. Dabney, and R. Munos, “A distributional perspec-
distributions. However, these differences disappear as is near tive on reinforcement learning,” in Proc. 34th Int. Conf. Mach. Learn.,
vol. 70, Aug. 2017, pp. 449–458.
convergence as per Proposition 2.  [22] M. Zinkevich, M. Weimer, L. Li, and A. J. Smola, “Parallelized
stochastic gradient descent,” in Proc. Adv. Neural Inf. Process. Syst.,
2010, pp. 2595–2603.
R EFERENCES
[1] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-
dimensional continuous control using generalized advantage estimation,”
in Proc. Int. Conf. Learn. Represent. (ICLR), 2016, pp. 1–14.
[2] S. Gu, T. Lillicrap, R. E. Turner, Z. Ghahramani, B. Schölkopf, and
S. Levine, “Interpolated policy gradient: Merging on-policy and off-
policy gradient estimation for deep reinforcement learning,” in Proc.
Adv. Neural Inf. Process. Syst., 2017, pp. 3846–3855. Alfonso B. Labao received the M.S. degree in
[3] A. B. Labao and P. C. Naval, “AC2: A policy gradient actor with primary computer science from the University of the Philip-
and secondary critics,” in Proc. Int. Joint Conf. Neural Netw. (IJCNN), pines Diliman, Quezon City, Philippines, in 2017,
Jul. 2018, pp. 1–8. where he is currently pursuing the Ph.D. degree in
[4] L. Li, D. Li, T. Song, and X. Xu, “Actor-critic learning control based on computer science.
2 -regularized temporal-difference prediction with gradient correction,” He was a Researcher with the Computer Vision
IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 12, pp. 5899–5909, and Machine Intelligence Group, Computer Science
Dec. 2018. Department, University of the Philippines Diliman.
[5] W. Shi, S. Song, C. Wu, and C. L. P. Chen, “Multi pseudo Q-learning- His research on reinforcement learning algorithms
based deterministic policy gradient for tracking control of autonomous focuses on policy gradients for continuous con-
underwater vehicles,” IEEE Trans. Neural Netw. Learn. Syst., vol. 30, trol. His current research involves algorithmics and
no. 12, pp. 3534–3546, Dec. 2019. automata theory.
[6] V. Mnih et al., “Playing Atari with deep reinforcement learning,” 2013,
arXiv:1312.5602. [Online]. Available: http://arxiv.org/abs/1312.5602
[7] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
“Proximal policy optimization algorithms,” 2017, arXiv:1707.06347.
[Online]. Available: http://arxiv.org/abs/1707.06347
[8] A. B. Labao and P. C. Naval, “Stabilizing actor policies by approxi-
mating advantage distributions from K critics,” in Proc. 24th Int. Conf. Mygel Andrei M. Martija received the B.S.
Pattern Recognit. (ICPR), Aug. 2018, pp. 1253–1258. degree in management engineering from the Ateneo
[9] P. Mirowski et al., “Learning to navigate in complex environments,” de Manila University, Quezon City, Philippines,
2016, arXiv:1611.03673. [Online]. Available: http://arxiv.org/abs/1611. in 2016. He is currently pursuing the M.S. degree
03673 in computer science with the University of the
[10] A. B. Labao, C. R. Raquel, and P. C. Naval, Jr., “Induced exploration Philippines Diliman, Quezon City.
on policy gradients by increasing actor entropy using advantage target He is currently a Researcher with the Computer
regions,” in Proc. Int. Conf. Neural Inf. Process. Cham, Switzerland: Vision and Machine Intelligence Group, Computer
Springer, 2018, pp. 655–667. Science Department, University of the Philippines
[11] Z. Wang et al., “Sample efficient actor-critic with experience replay,” Diliman. His current research interests include
2016, arXiv:1611.01224. [Online]. Available: http://arxiv.org/abs/1611. underwater computer vision, reinforcement learning,
01224 and object tracking.
[12] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, “Trust
region policy optimization,” in Proc. 32nd Int. Conf. Mach. Learn.,
vol. 37, Jul. 2015, pp. 1889–1897.
[13] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic:
Off-policy maximum entropy deep reinforcement learning with a sto-
chastic actor,” in Proc. 35th Int. Conf. Mach. Learn., vol. 80, 2018,
pp. 1861–1870.
[14] V. Mnih et al., “Asynchronous methods for deep reinforcement learning,” Prospero C. Naval, Jr. (Member, IEEE) received
in Proc. 33rd Int. Conf. Mach. Learn., Jun. 2016, pp. 1928–1937. the B.S., M.S., and Ph.D. degrees in electrical
[15] C. Schulze and M. Schulze, “ViZDoom: DRQN with prioritized experi- engineering from the University of the Philippines
ence replay, double-Q learning and snapshot ensembling,” in Proc. SAI Diliman, Quezon City, Philippines, and the M.Eng.
Intell. Syst. Conf., 2018, pp. 1–17. degree in computer science from Kyoto University,
[16] Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, and Kyoto, Japan.
N. de Freitas, “Dueling network architectures for deep reinforcement He is currently the Dado and Maria Banatao Pro-
learning,” in Proc. 33rd Int. Conf. Mach. Learn., vol. 48, 2016, fessor of artificial intelligence with the Department
pp. 1995–2003. of Computer Science, University of the Philippines
[17] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza- Diliman, where he teaches courses on computer
tion,” 2014, arXiv:1412.6980. [Online]. Available: http://arxiv.org/abs/ vision, probabilistic machine learning, and reinforce-
1412.6980 ment learning. He is also the Founder and the current Laboratory Head of
[18] M. Hessel et al., “Rainbow: Combining improvements in deep rein- Computer Vision and Machine Intelligence Group (CVMIG), Department of
forcement learning,” in Proc. 32nd AAAI Conf. Artif. Intell., 2018, Computer Science, University of the Philippines Diliman, which focuses on
pp. 3215–3222. the use of machine learning to solve problems in healthcare, environment, and
[19] T. Beysolow, II, “Custom OpenAI reinforcement learning environments,” education. He has authored or coauthored more than 100 articles in journals
in Applied Reinforcement Learning With Python. Basel, Switzerland: and conferences. His current research interests include underwater computer
Springer, 2019, pp. 95–112. vision, intelligent control of underwater autonomous vehicles, swarm robotics,
[20] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, “The arcade and computation.
learning environment: An evaluation platform for general agents,” Dr. Naval, Jr., has served as the Chair of the IEEE Philippine Section from
J. Artif. Intell. Res., vol. 47, pp. 253–279, Jun. 2013. 2015 to 2016.

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 28,2022 at 13:38:17 UTC from IEEE Xplore. Restrictions apply.

You might also like