Professional Documents
Culture Documents
3, MARCH 2021
Abstract— We propose an asynchronous gradient sharing faster than Q-learning, particularly its on-policy variants. How-
mechanism for the parallel actor–critic algorithms with improved ever, several challenges remain in policy gradient algorithm
exploration characteristics. The proposed algorithm (A3C-GS) design. Among these are bias and variance control in function
has the property of automatically diversifying worker policies in
the short term for exploration, thereby reducing the need for approximation, [2], [6], where recent advances are made possi-
entropy loss terms. Despite policy diversification, the algorithm ble by deep neural networks [6] that have variance-controlling
converges to the optimal policy in the long term. We show in our mechanisms [1], [7], [8]. Another pervasive problem is the
analysis that the gradient sharing operation is a composition exploration mechanism crucial for RL algorithms to discover
of two contractions. The first contraction performs gradient optimal policies [9]–[11].
computation, while the second contraction is a gradient sharing
operation coordinated by locks. From these two contractions, Similar to other RL algorithms, policy gradient methods
certain short- and long-term properties result. For the short term, are prone to converge to local optima when exploration
gradient sharing induces temporary heterogeneity in policies mechanisms are not integrated. This problem occurs especially
for performing needed exploration. In the long term, under a when the environment provides sparse rewards or requires
suitably small learning rate and gradient clipping, convergence navigation. Prior works promote exploration by importance
to the optimal policy is theoretically guaranteed. We verify our
results with several high-dimensional experiments and compare sampling [11], which controls the states encountered by the
A3C-GS against other on-policy policy-gradient algorithms. Our agent. Other works focus on variance control, such as trust
proposed algorithm achieved the highest weighted score. Despite region policy optimization (TRPO) [12], an early policy gra-
lower entropy weights, it performed well in high-dimensional dient algorithm. TRPO controls the neighborhood space of
environments that require exploration due to sparse rewards predicted policies and uses line-search optimization that can
and those that need navigation in 3-D environments for long
survival tasks. It consistently performed better than the base be hard to implement. A recent work, i.e., proximal policy
asynchronous advantage actor–critic (A3C) algorithm. optimization (PPO) [7], provides a practical mechanism to
approximate the variance control properties of TRPO and
Index Terms— Actor–critic agents, deep neural networks, deep
reinforcement learning (RL), policy gradient. performs comparably better than TRPO. Another algorithm,
i.e., the soft-actor–critic (SAC) [13], provides an off-policy
mechanism for policy gradients by means of Q-networks to
I. I NTRODUCTION estimate value/policy functions. Other works employ large
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 28,2022 at 13:38:17 UTC from IEEE Xplore. Restrictions apply.
LABAO et al.: A3C-GS: ADAPTIVE MOMENT GRADIENT SHARING WITH LOCKS FOR ASYNCHRONOUS ACTOR–CRITIC AGENTS 1163
1) Encourage exploration via a gradient sharing operation loss function L takes the form of softmax cross entropy (1),
that automatically diversifies policies among A3C work- where serves as a weight that allocates more importance to
ers to induce a larger search space but without compro- rewarding actions
mising long-term convergence to optimal policies.
2) Lower weights in entropy loss terms for closer long-term L = E[ log π(at |st )]. (1)
convergence to the optimal policy without compromising
exploration. The policy network’s gradient g is expressed in (2), where
The main component of the A3C-GS algorithm is the policies π(at |st ) are distributions over softmax probabilities
gradient sharing operation F that induces staggered updates, P(at |st )
propagates temporary biases, and prevents worker parameters
from being equal in the short term, resulting in automatic g = E[∇θ log π(at |st )] ∼ E[∇θ log P(at |st )]. (2)
diversification of policy estimates. With diversified policies,
the algorithm’s search space is widened, thereby increasing
the probability to encounter rewarding paths. However, if the
sharing operation is not controlled, long-term convergence B. Actor–Critic With Advantage Weights
may not occur since parameter differences among copies
become “too large.” A3C-GS provides a procedure to control Empirical rewards can be assigned to in (2), but this
parameter differences despite sharing. In our analysis, given results in high variances [1]. As an alternative, actor–critic
two conditions: 1) a suitably small learning rate and 2) gradient algorithms use value functions v(st ) for for lower variance.
clipping, differences resulting from the sharing operation are In this article, we use modified advantages A(st , at ) from [1]
small enough such that the operation remains a contraction. as , as shown in (3). By subtracting a baseline in (3),
Thus, A3C-GS exhibits both worker policy diversification for A(st , at ) variances are reduced while preserving
T unbiasedness.
automatic exploration, but, at the same time, converges closely Actual rewards are in the second term t=t+1 γ t rt
to the optimal policy in the long term due to its contraction
properties and lower entropy weights.
T
A(st , at ) = rt + γ t rt − v(st ). (3)
We test our method on high-dimensional environments
t=t+1
against other state-of-the-art on-policy algorithms, such as
A3C [14], PPO [7], an asynchronous PPO, and generalized In terms of deep network implementation, we follow [16]
advantage estimation (GAE) [1]. Testing our system with where policy and value networks share parameters.
PPO variants provides comparison between variance control
and exploration. From our results, A3C-GS is the most
consistent—at the top three in almost all games. A3C-GS is
C. Exploration and Asynchronous Advantage Actor–Critic
comparable with PPO in games with large action spaces and
2-D navigation. A3C-GS is superior on games that require Exploration performs a principal role in RL. To show the
exploration due to sparse rewards or navigation in 3-D environ- need for exploration, we write (2) as follows with A (st ) as
ments for long survival tasks. We also present several ablation advantage weights for state (st ):
tests that compare A3C-GS with standard A3C along with
1 A
T
several experiments to verify our claimed properties. A3C-GS
shows faster convergence rates and higher attained cumulative g= [ (st , at )∇θ log π(at |st )]. (4)
T t=1
rewards than A3C even with a smaller number of workers and
lower entropy loss weights. From (4), if action trajectories [a1 , a2 , . . . , aT ] ∼ π(at |st ) are
deterministic for each st , there is no variation in A (st ), and
II. P OLICY G RADIENT A LGORITHMS total cumulative rewards do not increase. Exploration provides
A. Notation the needed variation. In standard A3C [14], exploration is done
by parallel workers that search E while asynchronously updat-
For this article, we use notations in [1]. Given environment
ing parameters. In order to further encourage exploration, [14]
E, the state provided by E to the RL agent (policy-gradient)
included an entropy term in its loss function with a weight of
is st , where t = 0, 1, 2, . . . , T and T denotes a terminal time.
0.10 (5). Entropy is defined in (6), where P(ζi ) corresponds
The agent maintains policy π(st , at ) over state–action pairs
to the probability of choosing action at = ζi
that determines the action at it performs in st . The environment
responds to at with reward rt and transitions to the next
L e = E[ log π(at |st )] + E[Ht (ρ)] (5)
. The t agent maximizes a cumulative (γ -discounted)
state st+1
n
reward ∞ t=0 γ rt . Maximization of rewards can be formulated Ht (ρ) = − Pt (ζi ) log Pt (ζi ). (6)
as searching for optimal policy π ∗ (st , at ) that provides the i
maximizing at for each st . Policy gradients directly approx-
imate π ∗ using function approximators. In particular, deep However, as we show in Lemma 11, (5) is biased due to
policy gradients’ algorithms use deep networks to compute the the E[Ht (ρ)] term. A3C-GS, therefore, attempts to lessen the
optimal policy. In many of these models, the policy gradient’s weight of E[Ht (ρ)] while maintaining exploration.
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 28,2022 at 13:38:17 UTC from IEEE Xplore. Restrictions apply.
1164 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 32, NO. 3, MARCH 2021
is the gradient sharing operation (termed operation F) 16: Estimate value function v(st ) using θmi critic
among global copies ∈ M. 17: parameters T
7) All locks L j , ( j ∈ M) are released, and worker i 18: Compute A(st , at ) from t=0 γ t rt and v t (Eq. 3)
19: Compute gradients g for assigned global actor
resumes training and exploration. 20: parameters using cross-entropy loss (Eq. 2)
A3C-GS Properties: 21: with t = A(st , at ) for t = 1..T
1) Short Term: 22: Compute gradients g for assigned global critic
23: parameters using MSE loss with targets
a) Exploration Type I: Exploration as a result of T −1 t
t=0 γ r t + γ v T
T
24:
staggered parameter updates from locking. 25: Update assigned global parameters θmg with
b) Exploration Type II: Exploration due to diversified 26: accumulated gradients θ̂i
policies induced by sharing operation F. 27: Update local parameters of thread θmg → θm
28: if ∃ L j with j ∈ {M − m} that is locked then
2) Long Term: 29: Wait for all other threads to unlock
a) Stabilized training due to “tolerable” O(1) varia- 30: else
tions among parameters of global copies, indepen- 31: L i ← lock, all locks j ∈ M are locked
32: Sharing Operation F : Update other global
dent of gradient magnitudes |g|. 33:
g
parameter copies θ j with j ∈ {M − m}
b) Convergence to the optimal policy under a suffi- 34: using own accumulated gradients θ̂i
ciently small learning rate α and clipped gradients. 35: L M ← unlock, all locks are deactivated
c) Lower bias due to lower entropy weights in loss 36: Reinitialize B i
function L.
These properties are illustrated in Fig. 1. Here, A3C-GS has
a wider search space in the short term than A3C. Moreover, A3C-GS given a common environment. In the demonstrations
in the long term, it approaches the optimal policy more closely that follow, we use the environment E (see Fig. 2) for both
since it has lower entropy weights. standard A3C and A3C-GS. Here, {al , ar } stand for left/right
turn, θ0 are initial parameters, and g represents gradients.
IV. A3C-GS E XAMPLE AND E XPLORATION T YPE I We first show the behavior of standard A3C in E, where it
In this section, we provide examples of behavior under is seen that its update procedure results in near homogenous
standard A3C and compare this with a specific form of policies among workers.
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 28,2022 at 13:38:17 UTC from IEEE Xplore. Restrictions apply.
LABAO et al.: A3C-GS: ADAPTIVE MOMENT GRADIENT SHARING WITH LOCKS FOR ASYNCHRONOUS ACTOR–CRITIC AGENTS 1165
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 28,2022 at 13:38:17 UTC from IEEE Xplore. Restrictions apply.
1166 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 32, NO. 3, MARCH 2021
Fig. 3. Standard A3C Sample Update Procedure. (a) Step 0: initialization with one global actor and three worker threads. All conditional probabilities are
uniform. (b) Step 1: worker 1 finishes with [S-right] and r = 1. Global parameters are updated to [θ0 + g1 ] and copied to local parameters. Probability to
choose ar from S is higher. (c) Step 2: worker 2 finishes with [S-right]. Global parameters are updated to [θ0 + g1 + g2 ] and copied to local parameters.
Probabilities are updated. (d) Step 3: worker 3 finishes with [S-left-left] and r = 2. It updates global parameter, and gradient g3 reduces probability to turn
right from S. [Best viewed in color]
Fig. 4. A3C-GS Sample Update Procedure. (a) Step 0: initialization with three global copies, one for each worker. All conditional probabilities are uniform.
(b) Step 1: simultaneous updates for workers. Workers 1 and 2 finish with trajectory [S-right] and update global copies with gradients g1 and g2 followed by local
updates. Worker 3 is still computing with trajectory [S-left-left]. (c) Step 2: worker 1 locks first and propagates g1 to other global copies (workers 2 and 3 retain
local parameters). Locks are afterward released. (d) Step 3: worker 3 finishes computing and updates global copy with g3 (after left worker released locks).
Local parameters are updated as well. (e) Step 4: since worker 2 finished before worker 3, it locks first, forcing worker 3 to wait. Worker 2 propagates g2
to other global copies then releases all locks. (f) Step 5: locks are released by worker 2, and worker 3 propagates g3 to other global copies using the same
lock–unlock procedure. [Best viewed in color]
parameters converge (see Corollary 3). This result forms the global copies i and j are considered as tolerable if |θi − θ j | =
basis for constructing an “ideal” algorithm ∗ (Algorithm 3) O(1), i.e., differences are independent of gradient magnitude
that strictly ensures that all four conditions are satisfied. How- |g|. The conditions for A3C-GS to remain a contraction are a
ever, Algorithm 3 is not practical in terms of time complexity, sufficiently small learning rate α and gradient clipping.
and its tendency to keep all parameters almost equal may Given the abovementioned two conditions, we can derive
nullify exploration arising from parameter differences. Thus, the properties stated in Section IV. The properties are
we present A3C-GS, which is a practical approximation of listed in the following, with references to their corre-
Algorithm 3. A3C-GS has the property of approximating the sponding lemma, corollary, and proposition numbers in
conditions for F to become a contraction, with only a single the Appendix, where they are derived in nonsequential
application of F at each step. This is done, however, by requir- order.
ing additional conditions that keep the biases of F small Corollary 1 (Property 1b) Exploration Type II: Differ-
enough such that parameters of the M copies remain “close” ences in policy distributions among workers lead to explo-
to each other and guarantee eventual convergence as described ration, enabling A3C-GS to perform exploration despite lower
in Fig. 5. Specifically, parameter differences |θi − θ j | between weights for entropy terms in its loss function.
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 28,2022 at 13:38:17 UTC from IEEE Xplore. Restrictions apply.
LABAO et al.: A3C-GS: ADAPTIVE MOMENT GRADIENT SHARING WITH LOCKS FOR ASYNCHRONOUS ACTOR–CRITIC AGENTS 1167
TABLE I
R ANKINGS OF O N -P OLICY A LGORITHMS PER G AME
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 28,2022 at 13:38:17 UTC from IEEE Xplore. Restrictions apply.
1168 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 32, NO. 3, MARCH 2021
Fig. 6. Performance of on-policy algorithms on games that require less exploration with frequent rewards (Group A). (a) Shooter (doom basic). (b) Shooter
(doom rocket). (c) Doom defend center. (d) Doom defend line. [Best viewed in color]
Fig. 7. Performance of on-policy algorithms on games that require exploration due to larger action spaces (except Pong), need for 2-D navigation, with
relatively frequent rewards (Pong has the sparsest rewards among these) (Group B). (a) Samurai Showdown. (b) Airstriker Genesis. (c) AeroFighters. (d) Zero
Wing. (e) Earth Defense Force. (f) Mortal Kombat. (g) Twin Cobra. (h) Alpha Mission. (i) Pong. [Best viewed in color]
long survival tasks (where generally, 3-D environments have From Tables I and II, A3C-GS and PPO-async performed
more degrees of freedom for navigation than 2-D). Among the best. However, A3C-GS is more consistent than PPO-async,
games, those in Group A are simple VizDoom games that are with more top-three positions and the highest weighted score
easily predictable. Group B is retro games that have 12 actions of 37. Moreover, A3C-GS outperformed the standard A3C
and require 2-D navigation with frequent rewards. Group C is in all games (except for comparable performance in Doom
predict-position games with moving targets and rare rewards, Health), indicating performance gains. GAE is outperformed
along with navigation in 3-D environments (health/take cover). by the other algorithms. During training time, we note that
We summarize the results in Table I, along with a simple PPO seems sensitive to initialization. For instance, under one
weighted score in Table II. initialization in Pong, PPO converged quickly. On a different
We can observe in Table 1 and Fig. 6 that PPO algorithms initialization, it converged slowly. A3C-GS, meanwhile, deliv-
converge faster in Group A games where stability is more ered similar performance across all training repetitions.
important than exploration. In Group B with 2-D navigation,
A3C-GS and PPO asynchronous are comparable as shown in
Fig. 7. For instance, in Zero Wing, PPO outperformed A3C- B. Experimental Verification of A3C-GS Properties
GS, while A3C-GS outperformed PPO in Earth Defense Force Here, we compare A3C-GS and standard A3C with six
and AirStriker Genesis. However, in games that belong to and 12 workers. A3C-GS has three variants, each having six
Group C (shown in Fig. 8), A3C-GS is superior—particularly workers. For all models, we use the deep network trunk in [18]
in the Doom Predict scenario where the timeout is limited to and implement a dueling network [16]. We use Atari game
300, i.e., the agent has to quickly explore to obtain a reward. suite environments [20] following [14], [21]. For standard
The same applies to Doom Health, where navigation is in 3-D A3C, we follow [14] and provide an entropy weight of 0.1.
with more degrees of freedom along with a chance to get stuck For all A3C-GS variants, we reduce the entropy weight by a
in the environment’s walls. factor of 10–0.001.
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 28,2022 at 13:38:17 UTC from IEEE Xplore. Restrictions apply.
LABAO et al.: A3C-GS: ADAPTIVE MOMENT GRADIENT SHARING WITH LOCKS FOR ASYNCHRONOUS ACTOR–CRITIC AGENTS 1169
Fig. 8. Performance of on-policy algorithms on games that require exploration due to sparse rewards (doom predict) or need for navigation in 3-D environments
for long survival tasks (doom health/take cover) (Group C). (a) Shorter doom predict position with Timeout = 300. (b) Normal doom predict position. (c) Doom
health. (d) Doom take cover. [Best viewed in color]
Fig. 9. Experiments in hyperparameters: A3C-GS (three versions, blue lines) versus standard A3C [six workers (red) and 12 workers (orange)]. α ranges
from 1E-4 to 1E-6 with either gradient or value gradient clips. α is similar for all models at each game. In assault, α in A3C-GS-2 and A3C-GS-3 is lowered
10X for convergence. (a) Pong. (b) Assault (kills/game). (c) Space Invaders. The number of lives of all agents for Assault and Space Invaders is set to 1.
[Best viewed in color]
Fig. 10. A3C-GS training statistics and convergence against Standard A3C in Pong. (a) Entropy in policies. (b) KL divergences. (c) Absolute weight
differences. (d) Average gradients. [Best viewed in color]
Experimental Models:
1) A3C-GS-1: Two global copies with three workers each.
2) A3C-GS-2: Three global copies with two workers each.
3) A3C-GS-3: Six global copies with one worker each.
4) Standard A3C-6: One global copy with six workers.
5) Standard A3C-12: One global copy with 12 workers.
Fig. 11. A3C-GS ablation test on assault. Labels refer to clip and learning
From Fig. 9, the three types of A3C-GS have better results rate, respectively. To converge, α has to be set to 1E-6 with at least a gradient
than Standard A3C-6 and Standard A3C-12 in Pong and clip of 0.10. [Best viewed in color]
Space Invaders despite having fewer workers. A3C-GS in
Assault learned to stay alive longer. Hence, increasing worker and a sufficiently small learning rate to meet the contraction
count may not necessarily result in more exploration compared conditions of Proposition 1. Using global norm clipping,
with diversifying worker policies. This is seen in Fig. 10(a), average gradients are close to zero [see Fig. 10(d)]. In addition,
where A3C-GS has a larger decrease in entropy than Standard weight differences in Fig. 10(c) are kept within the ±0.0005
A3C. Fig. 10 shows several statistics comparing A3C-GS with range at α = 0.0001, despite comparable gradient magnitudes.
Standard A3C. From Fig. 10(a), A3C-GS has lower (less As shown in Fig. 11, A3C-GS has to have the right learning
negative) entropy than standard A3C, supporting Corollary 1 rate and clip; otherwise, it will not converge. Regarding
and Property 2c. However, as shown in the higher KL diver- convergence, (i.e., Propositions 2 and 4), we see, in Fig. 10(b),
gences in Fig. 10(b), A3C-GS has more diversified policies that KL-divergences exhibit a downward trend as training
than standard A3C. This shows Property 1b (Exploration progresses. This indicates a contraction in policies toward
Type II) supporting Proposition 3 and Corollary 1. The differ- the optimal policy. Moreover, despite parameter differences
ences in worker parameters are shown in Fig. 10(c), where in Fig. 10(c), we see that the differences taper off eventually.
A3C-GS has more differences in worker parameters than
standard A3C. We also observe that these absolute differences VII. C ONCLUSION
are small or “tolerable,” verifying Property 2a (Lemma 1). We presented the asynchronous A3C-GS algorithm that
The convergence of A3C-GS depends on clipped gradients uses gradient sharing to promote exploration under tempo-
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 28,2022 at 13:38:17 UTC from IEEE Xplore. Restrictions apply.
1170 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 32, NO. 3, MARCH 2021
rary biases. Being a composition of contractions, A3C-GS d[T (L ĝi (g)), T (L ĝ j (g))]:
converges to the optimal policy given a suitably small learn-
∂ L(θ̄i − ĝi )
ing rate and gradient clipping. From experiments, A3C-GS = L θ̄i − ĝi − α
∂(θ̄i − ĝi )
produces more diversified policy distributions among workers
for exploration, which, nonetheless, converges to the optimal ∂ L(θ̄ j − gˆj )
−L θ̄ j − gˆj − α .
policy in the long run. A3C-GS reports higher performance ∂(θ̄ j − gˆj ) ∞
than standard A3C despite maintaining a smaller pool of With α sufficiently small, we see that for i , we have
workers and consistently gained top positions in several games, T (L ĝi (g)) ∼ max∗g (L ĝi (g)), where max∗ signifies maximiza-
particularly in environments that require exploration due to tion due to SGD. Recall that SGD T brings (negative) convex
sparse rewards or 3-D navigation for long survival tasks. loss functions closer to zero as training proceeds, which is a
maximization operation
over
loss. The same holds
true for j.
A PPENDIX Let L gˆi (g) = L θ̄ − ĝi −α g and L gˆj (g) = L θ̄ − ĝ j −α g
In the following, we derive and analyze Properties 2a and (i.e., loss under arbitrary g). We have the following:
2b, followed by Property 1b. The flow of analysis is as follows: ∗ ∗
||T (L gˆi (g)) − T (L gˆj (g))||∞ ∼ | max(L gˆi (g)) − max(L gˆj (g)|
1) Construction of = F ◦ (see Sections A–C); g g
2) Construction of ideal algorithm ∗ (see Section C); ∗
and α||L (g) − L (g)||∞ ∼ max |L (g) − L (g)|
ĝi ĝ j ĝi gˆj
3) Approximation of ∗ by A3C-GS (see Section D); g
4) Convergence of A3C-GS (see Sections E–G). ∗ ∗ ∗
⇒ | max(L (g))− max(L (g))| ≤ max |L gˆi (g)− L gˆj (g)|
gˆi gˆj
g g g
The main difference between A3C-GS and standard A3C
lies in F. F propagates biases (see Lemma 9) but remains a ⇒ ||T (L gˆi (g)) − T (L gˆj (g))||∞ ≤ α||L ĝi (g) − L ĝ j (g)||∞ .
contraction under some conditions (see Proposition 1). If we
Hence, T is a contraction. From SGD, we have L → 0, and
combine F with gradient computation , we construct a
therefore, ||T (L gˆi (g)) − T (L gˆj (g))||∞ → 0.
contraction = F ◦ (see Proposition 4). can be computed
Lemma 5 shows the limiting behavior of loss functions
by an “ideal” algorithm ∗ that it is guaranteed to converge
given arbitrary initial parameters θ̄i and θ̄ j along with contrac-
but is impractical. For a practical algorithm, A3C-GS in
tion properties of SGD. We now state Lemma 6 that extends
Section D approximates ∗ using a suitably small learning
Lemma 5 to construct contraction . This is followed by
rate and gradient clipping. These conditions keep tolerable
Lemma 7 that extends Lemma 6 to Adam gradient updates.
O(1) parameter variations among global copies and result
Lemma 6: Suppose that for any input state, gradients are
in convergence (as shown in Section E). Finally, we use the
computed such that they modify parameters toward estimating
O(1) variations among parameters to show exploration and
the optimal policy π ∗ . From contraction T in Lemma 5, we can
bias properties in Sections F and G.
derive another contraction = g, where outputs gradient
g at each step in A3C-GS and implies that |gi − g j | of global
A. Construction of Contraction (Gradient Computation) copies i and j converges to zero for any input state.
To construct , we present T and that follow from Proof: Using similar notations as Lemma 5, we construct
SGD [17]. For Lemmas 5 and 6, we suppress the input as a transformation of T , where is an updated gradient
state st in the loss function L [as done in (1)] for ease of computed right after application of T with prior gradient
notation. We note that T and use SGD, whereas the actual g = [∂ L(θ̄ − ĝ − α[g])/∂ L(θ̄ − ĝ)]
contraction used in A3C-GS is that uses Adam. T and ,
∂ L(θ̄ − ĝ − α[g])
however, are more tractable and approximate [17]. [T (L ĝ (g))] = = g̃.
Lemma 5: Let i and j refer to any pair of global copies. ∂(θ̄ − ĝ − α[g])
Let T ∼ L, where L is a convex loss function computed from is a monotonic transformation of T and takes the gradient
unbiased targets for any input state. Policy estimates for L are with respect to an updated weight [θ̄ − ĝ − α[g]]. Since is
computed from initial parameters θ̄i and θ̄ j and accumulated a monotonic transformation of T (L), and T is a contraction
gradients ĝ. Given any initial parameter set, T is a contraction [T → 0], we have → 0 as well since g → 0 and T → 0
under ||x||∞ and a sufficiently small learning rate α ∈ (0, 1) as L → 0. From the given, suppose that there are global
with a fixed point of 0. copies i and j . At its fixed point of 0, we have |g̃i − g˜j | →
Proof: Let ĝi and ĝ j refer to the accumulated gradient 0 ⇒ |gi − g j | → 0 since g̃ = g on the next computation
for global copies i and j , respectively, for update steps 1 · · · s, of T . At this point, parameters θi and θ j estimate optimal
i.e., for i : ĝi = α s ∂ L/∂θis (given initial parameter set θ̄i ). policy π ∗ .
The input g is the current gradient computed using L and Lemma 7: Given the same conditions as Lemma 6, suppose
unbiased targets. With ĝ and g, T performs the following that instead of SGD gradient updates, we use Adam gradient
process for i : updates. We can construct another contraction that contracts
to the same fixed point as .
∂ L(θ̄ − ĝi )
T (L, θ̄i , ĝi , g) = L θ̄ − ĝi − α g where g = . Proof: Using the said assumption, the Adam gradient
∂(θ̄ − ĝi ) updates on parameters also serve to minimize loss functions—
Let T (L gˆi (g)) = T (L, θ̄i , ĝi , g) (similarly for j ). similar to SGD. As shown in [17], Adam has faster conver-
We have under the sup-norm the following for gence properties relative to SGD. Hence, Lemma 6 holds,
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 28,2022 at 13:38:17 UTC from IEEE Xplore. Restrictions apply.
LABAO et al.: A3C-GS: ADAPTIVE MOMENT GRADIENT SHARING WITH LOCKS FOR ASYNCHRONOUS ACTOR–CRITIC AGENTS 1171
and behaves similar to , i.e., it converges to a fixed for the first and second global copies at the end of training
point. step s = 1
m i→ j = β1 (β1 m̄ + (1 − β1 )gi ) + (1 − β1 )g j
B. Construction of Contraction F (Sharing Operation) v i→ j = β2 (β2 v̄ + (1 − β2 )gi2 ) + (1 − β2 )g 2j
With the first contraction completed, we construct con-
αm i→ j 1 − β22
traction F, i.e., the gradient sharing operation. δi→ j = ; θi→ j = θ̄ − δi − δi→ j
Let F refer to the gradient sharing process in A3C-GS that 1 − β12 v i→ j
shares gi and g j between any pair of global copies i and j m j →i = β1 (β1 m̄ + (1 − β1 )g j ) + (1 − β1 )gi
after all locks are released. For i , let A(θˆi , g j ) refer to the
v j →i = β2 β2 v̄ + (1 − β2 )g 2j + (1 − β2 )gi2
Adam gradient update with θ̂ as the updated global parameter
of i (i.e., parameters right after the worker updated its assigned αm j →i 1 − β22
global copy and prior to sharing in F). We denote the output δ j →i = ; θ j →i = θ̄ − δ j − δ j →i . (8)
∗ 1 − β12 v j →i
of F as the reupdated parameters θˆi using shared gradient g j
and Adam To show that the difference in parameters between i and j
∗
is not trivially equal to zero, we write the difference θi→ j −
F(θˆi ) = A(θˆi , g j ) → θˆi . θi→ j = θ̄ −δi −δi→ j − θ̄ +δ j +δ j →i . This is also = δ j +δ j →i −
δi − δi→ j = [δ j − δi ] + [δ j →i − δi→ j ]. Here, θi→ j − θ j →u is
We note, in Lemma 9, that F does not trivially render
composed of two components: 1) [δ j −δi ] and 2) [δ j →i −δi→ j ].
parameters equal and that it propagates biases. Despite these There are some scenarios, where both components are equal to
biases, Proposition 1 shows that under certain conditions, F is
zero. For the first component, we have zero if m i − m j = 0.0
a contraction. However, first, we state Lemma 8 to show how and v i /v j = 1.0. Similarly, we have [δ j →i − δi→ j ] = 0 if
F induces parameter differences among copies.
(1 − β1 )2 (gi − g j )] = 0 and v j →i /v i→ j = 1.0. This happens
Lemma 8: Using Adam, parameters of global copies in
only when all gradients are equal (a rare occurence). Now
A3C-GS during training are not equal with very large prob- suppose that [δ j − δi ] = 0 and [δ j →i − δi→ j ] = 0 but that
ability if there is a difference in parameter update ordering
[δ j − δi ] − [δ j →i − δi→ j ] = 0. This arises iff
under F. √ √
Proof: Suppose that there are two global copies, one for αm j (1 − β2 ) αm 1 (1 − β2 )
θ1 − θ j = −
each worker. We need a scenario with difference in update (1 − β1 )v̄ 1 (1 − β1 )v 1
ordering in F. For instance, let a, b, c, and d be gradients. αm
j →i (1 − β2 ) αm i→ j (1 − β22 )
2
Two copies have a different parameter update orderings if one − − = 0. (9)
is updated using a different permutation than the other (for (1 − β12 )v j →i (1 − β12 )v i→ j
instance, one is updated [a + b + c + d], while the other is Showing that the roots of (9) are involved, it can be
updated [a + c + b + d]). Let this be the following. simplified if we substitute β1 = 0.9 and β2 = 0.99, which
1) Worker i computes gradient gi . are the standard Adam parameters. Eventually, we have
2) Worker j computes gradient g j . [(9) = 0] ⇒ [v j m 1 − v 1 m j = 0]
3) Worker i updates global copy i and its local parameters.
4) Worker j updates global copy i . and [v i→ j m j →i + v j →i m i→ j = 0].
5) Worker i activates all locks. It can be shown that [v j m 1 − v 1 m j ] = 0 if gi = g j . Similarly,
6) Worker i updates global copy j and then releases all it can be shown that [v i→ j m j →i +v j →i m i→ j ] = 0 if gi = −g j .
locks. Hence, for parameters of i and j to be equal, gradients have
7) Worker j activates all locks. to be gi = g j or gi = −g j ⇒ |gi | = |g j |. Given the large
8) Worker j updates global copy i and then releases all randomization involved in training, it is very unlikely that all
locks. gradients are equal in magnitude for all workers.
This scenario provides an update ordering gi → g j for i and Lemma 9: Using the Adam optimization, suppose that prior
g j → gi for j . Let m̄, v̄, and θ̄ refer to the initial values of moving variance v̄ and prior moving mean m̄ of global copies
the moving average mean, variance, and parameters for global i and j are not equal, i.e., v̄ i = v̄ j and m̄ i = m̄ j due to
copies i and j . Following Algorithm 2, for steps 1 and 2: variances in gradients brought about by different parameters
(see Lemma 8). A single application of F in A3C-GS results
gi = ∇θ̄ , g j = ∇θ̄ ∗ ∗
in bias for updated global parameters θ̂i∗ and θ̂ ∗j and θˆi = θˆj .
m i = β1 m̄ + (1 − β1 )gi , m j = β1 m̄ + (1 − β1 )g j Proof: The proof follows from the way F is constructed.
v i = β2 v̄ + (1 − β2 )gi2 , v j = β2 v̄ + (1 − β2 )g 2j For i , we see that F(θˆi ) performs the following for θˆi :
√ √
αm i (1 − β2 ) αm j (1 − β2 ) F(θˆi ) = A(θˆi , g j ) using g j as the gradient from copy j
δi = , δj = [β1 (β1 m̄ i + (1 − β1 )gi ) + (1 − β1 )g j ]
(1 − β1 )v i (1 − β1 )v̄ j = θˆi − α .
θi = θ̄ − δ1 , θ j = θ̄ − δ j . (7) [β2 (β2 m̄ i + (1 − β2 )gi2 ) + (1 − β2 )g 2j ]
From the definition of g j , we have the following:
After computing θi and θ j , we move to steps 5 and 6 of the
update schedule to arrive at the final parameters θi→ j and θ j →i g j = ∇θ̂ j L, with θ̂ j as j ’s parameters prior to F.
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 28,2022 at 13:38:17 UTC from IEEE Xplore. Restrictions apply.
1172 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 32, NO. 3, MARCH 2021
Hence, an operation of F propagates on global copy i ’s done in 1 . In the last line in the following, denotes a very
parameters (θ̂i ) the gradient g j , which is a gradient of loss small number:
L with respect to θ̂ j and not θ̂i . This generates errors and
0.9(0.9m̄ i + 0.1gi ) + 0.1g j
∗
F(θ̂i ) = [F(θ̂ j ) + ] ⇒ θˆi = θˆj .
∗
2 ∼ θ − αω2
0.99(0.99v i + 0.01gi2 ) + 0.01g 2j
Proposition 1 (see Section V):
0.9(0.9m̄ j + 0.1g j ) + 0.1gi
Proof: Suppose that workers have updated their global −
copies i and j and are now under F at update step s. We have 0.99(0.99v j + 0.01g 2j ) + 0.01gi2
the following for θˆi and θˆj under the Adam optimization (m̄ i − m̄ j ) (gi − g j )
algorithm (see Algorithm 2) and prior parameters θ̄i and θ̄ j . ∼ θ − αω2 0.83 + . (15)
v̄ v̄
Note that these are the parameters involved in local Adam
We can observe that |ω1 − ω2 | → 0 as update step
gradient updates prior to F, i.e., the parameters involved
s → ∞. However, for any update step s, the second term
during computation of
of 1 is bigger than the second term of 2 under Condition
3, where |gi − g j | < c1 |m̄ i − m̄ j |. Moreover, from Condition
αm i 1 − β2s αm j 1 − β2s
θˆi = θ̄i − , θˆj = θ̄ j − . 4, we see that the contribution of the terms can be rendered
1 − β1s v i 1 − β1s v j negligible subject to |θ | < c2 |m̄ i − m̄ j |, i.e., for these to be
Using the absolute norm |x − y|, let k = α ∈ (0, 1) and valid, we can set c1 and c2 to be 0.01 for instance. Hence,
compute k[d(x, y)] as using conditions 1–4 and a suitable k ∈ (0, 1), we have [(12)
< (10)], showing that F is a contraction. Under the four
1 − β2s m i mj conditions, we can observe that while a single application of
k[d(θi , θ j )] = k [θ̄i − θ̄ j ] −
ˆ ˆ − . (10)
1 − β1s vi vj F leads to bias (Lemma 9), additional applications of F using
gi and g j lower the difference between gi and g j and move
Here, the last term in k[d(θˆi , θˆj )] is expanded as
parameters closer (due to the term in 2 ). This is a property
β1 m̄ i + (1 − β1 )gi β1 m̄ j + (1 − β1 )g j of F being a contraction.
= − . (11) From this, if parameter differences θ are large (i.e., condi-
β2 v̄ i + (1 − β2 )gi
2
β2 v̄ j + (1 − β2 )g 2j
tion 4 is violated), contraction may not occur. In Corollary 3,
Next, we compute d(F(θ̂i ), F(θ̂ j )) as follows (i.e., differ- we describe an ideal scenario where initial parameters are
ence after application of F): equal (not required in Proposition 1).
Corollary 3: For any update step s, given conditions in
1 − β2s+1 Proposition 1 and common initial parameters θ̄i = θ̄ j , repeated
d(F(θ̂i ), F(θ̂ j )) = [θ̄i − θ̄ j ] − α . (12)
1 − β1s+1 application of F using gradients gi and g j lowers loss and
asymptotically leads to equal parameters.
Here, is equal to
[β1 (β1 m̄ i + (1 − β1 )gi ) + (1 − β1 )g j ] C. Construction of Contraction and Ideal Algorithm ∗
=
[β2 (β2 v̄ i + (1 − β2 )gi2 ) + (1 − β2 )g 2j ] From the prior discussion, we compose another contraction
[β1 (β1 m̄ j + (1 − β1 )g j ) + (1 − β1 )gi ] = F ◦ that has both gradient correction and sharing. There
− . (13)
[β2 (β2 v̄ j + (1 − β2 )g 2j ) + (1 − β2 )gi2 ] exists an ideal algorithm ∗ that computes and meets all
the conditions in Proposition 1. We start with Proposition 4
To prove that F is a contraction, we need to show that (12) on the existence of , followed by Proposition 5.
is less than (10). This depends on several parameters m̄ i , m̄ j , Proposition 4: Given conditions 1–3 in Proposition 1, com-
v̄ i , v̄ j , and β1 , β2 . We can use the conditions to simplify the mon initial parameters for all global copies, and a sufficiently
analysis and let β1 = 0.9 and β2 = 0.99 as done in most Adam small learning rate α, there exists a contraction = F ◦ .
standard parameter implementations. Let 1 = k[d(θˆi , θˆj )] Proof: From (13) and Proposition 1, F takes gradients
and 2 = d(F(θ̂i ), F(θ̂ j )). Similarly, let θ = θ̄i − θ̄ j , and ω g, which can be set as outputs of . This stepwise operation
as (substituting β1 and β2 ) is a composition, where, at step, s computes gradients g
√ √
1 − 0.99s 1 − 0.99s+1 and feeds all inputs to F. F iterates until convergence of θ̂ ∗
0.91 = ω , 0.91 = ω2 . to ensure condition 4 along with other conditions—ensuring
1 − 0.9s 1
1 − 0.9s+1
that F is a contraction. Afterward, recomputes g for step
We compute 1 in the following, where we use conditions s + 1. From = F ◦ , a composition of contraction is a
1 and 2 in Proposition 1 to factor out g 2 terms and to come up contraction. Hence, is a contraction with fixed point ∗ .
with a common second moment moving average v̄ i = v̄ j = v̄ Proposition 5: Assume that condition 3 of Proposition 1
v j (0.9m̄ i +0.1gi )−v i (0.9m̄ j +0.1g j ) holds with full information on all trajectories under E. There
1 ∼ k θ −αω1 exists an algorithm ∗ such that ∗ computes and converges
0.99m̄ i m̄ j
to fixed point ∗ and computes optimal policy π ∗ .
(m̄ i − m̄ j ) (gi − g j )
∼ k θ − αω1 0.9 + 0.1 . (14) Proof: We construct a pseudocode for ∗ as follows.
v̄ v̄ The algorithm is ideal since it ensures that conditions in
Similarly, we compute 2 in the following, reusing condi- Proposition 1 are met at each training update step s. Let i and
tions 1 and 2 in Proposition 1 to simplify the equations as j refer to any pair of distinct global copies under Algorithm 3.
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 28,2022 at 13:38:17 UTC from IEEE Xplore. Restrictions apply.
LABAO et al.: A3C-GS: ADAPTIVE MOMENT GRADIENT SHARING WITH LOCKS FOR ASYNCHRONOUS ACTOR–CRITIC AGENTS 1173
Algorithm 3 Algorithm ∗ (All Parameters Are Initially Lemma 2 (See Section V):
Equal) Proof: Condition 1 can be easily implemented by clipping
1: while ∀M, N: π not converged to π ∗ do all gradient magnitudes |gi | and |g j | to be < 1.0. Similarly,
2: for each worker i do one can proportionally decrease both gradients to meet con-
3: worker i explores environment under copy j dition 2. For condition 3, if |gi | > |g j |, simply clip |gi | until
4: if training == true then |gi |/|g j | ∼ 1.0 (and vice versa).
5: Compute gradients g using with L Lemma 3 (See Section V):
6: Clip g (cond. 1 & 3 of Proposition 1) Proof: Let gradients (for copy i ) be expressed as gi,s =
7: Update assigned copy j with g γi,s θi,s−1 , where γi,s is the proportion of gradient at update step
8: Wait until all workers i = i updated s derived from parameter θi,s . Using this, we express gradients
∀s as a function of θ0 , γi,s , and the initial gradient gi,0 through
9: if all workers have done local updates then
this recursion. For i , we have
10: while ∀θ not converged do
11: Apply F to all global copies i = j gi,0 = γi,0 θi,0
12: Compute θ̂ ∗ for all copies i = j θi,1 = θi,0 + αgi,0 , gi,1 = γi,1 θi,1 = γi,1 (θi,0 + αgi,0 )
13: If θ̂i∗ = θ̂ ∗j for any pair i = j repeat
θi,2 = θi,1 + αgi,1 = [θi,0 + αgi,0 ] + α[γi,1 (θi,0 + αgi,0 )] · · ·
The same recursion holds for copy j , except that we have
At initialization, parameters are equal (i.e., θi = θ j ). Workers θ j,0 = θi,0 + , where is the parameter difference. We express
assigned to i and j act on environment E and compute the second moment moving average v̂ i for copy i as follows:
respective gradients gi and g j (gi = g j ) from computation v̂ i,0 = [γi,0 θi,0 ]2
of i and j . With full information (from the given), these
v̂ i,1 = 0.9[γi,0 θi,0 ]2 + 0.1[γi,1 θi,1 ]2
gradients are computed with respect to targets derived from
optimal policy π ∗ . Gradients are clipped at this point, meeting = 0.9[γi,0 θi,0 ]2 + 0.1[γi,1 (θi,0 + αγi,0 θi,0 )]2 …
conditions 1 and 2. Using Lemma 1 and Corollary 3, we apply The same holds for j . However, applying Lemma 2, we see
F (using gradients g from ) iteratively until convergence. that we can write j ’s version of v̂ j,1 as
This results in equal weights for i and j , with θ = 0, meeting
condition 4. After F converges, we let updated workers i and = 0.9[γ j,0θ j,0 ]2 + 0.1[γ j,1(θ j,0 + αγ j,0 θ j,0 )]2
j resume (no need for exploration given full information). = 0.9[γi,0 (θi,0 + )]2 + 0.1[γ j,1(θi,0 + ) + αγ j,0 (θi,0 + )]2
Thus, all conditions for F’s contraction are satisfied. Since ∼ 0.9[γi,0 θi,0 ]2 + 0.1[γi,1 (θi,0 + αγi,0 θi,0 )]2 .
parameters for all copies are equal after repeated applications
of F, at the next update, resulting gradients and first/second The last line follows since the procedure in Lemma 2 specif-
moments computed by in Lemma 7 are equal for all global ically sets gradients |gi,s | ∼ |g j,s |; since all terms in v̂ j,1
copies (i.e., ’s Adam parameters are equally modified by F are squared, only the magnitude of gradients is important,
for all global copies, and it is similar to applying to several not their direction. Moreover, the above equations show that
batches). Using Lemma 7 and Proposition 1, this process clipping as per Lemma 2 renders each term in the recursion
converges to a fixed-point θ̂i = θ̂ j = ∗ . This is because both of v̂ i,s similar in magnitude to its corresponding term in the
loss (see Corollary 3) and gradients (see Lemma 7) decrease recursion of v̂ j,s —independent of . For more detail, we can
if F converges at the end of every update step subject to write v̂ i,s /v̂ i,s algebraically, but the equations are too involved.
θ̄i = θ̄ j . Thus, for , we have [gi → 0 and g j → 0] Instead, we follow a similar trick and use the Taylor series to
subject to [gi = g j = 0]. An application of F in this case approximate v̂ i,s /v̂ i,s centered s at α = 0 to check s its behavior
results in fixed point parameters ∗ (see Corollary 3). Now, relative to α. Let τi,s = l=0 cl γi,l2 and τ j,s = l=0 cl γ j,l
2
. The
we only have gi = g j = |gi − g j | = 0 if the loss L of both Taylor series centered at α = 0 is
global copies are 0.0 or that they estimate the optimal policy,
τi,s f 1 τi,s , θi,0
2 f 3 τi,s , θ 2j,0
i.e., f (θ̂i∗ ) = f (θ̂ ∗j ) = f ( ∗ ) = π ∗ . +α +α 2
+ O(α 3 ).
Corollary 4: Our proposed A3C-GS = ∗ . τ j,s f 2 τ j,s , θi,0
2
f 4 τ j,s , θ 2j,0
Proof: Unlike ∗ , A3C-GS applies F only once despite In this equation, all f ’s are additive functions of the form:
θ̄i = θ̄ j . Thus, condition 4 in Proposition 1 is not met. f (τi,s , θ02 ) = c γ θ
s s i,s 0
2
, where c’s are constants. From
Lemma 2, we have τi,s ∼ τ j,s since, as mentioned, at each step
D. Approximation of ∗ by A3C-GS: Properties 2a and 2b s, clipping renders all terms in the recursion similar. We can
then write the Taylor series as ∼ 1 + αC + α 2 C + α 3 C, . . .,
E. Lemmas for Proving Property 2a where C is a constant close to 1. By letting α be small enough,
From Lemma 8, F results in parameter differences we can keep condition v̂ i /v̂ j ∼ 1.0 satisfied.
among global copies, and Corollary 3 is not ideal. Howe- Lemma 4 (See Section V):
vere, from Proposition 1, F remains a contraction as long Proof: For an informal proof of Lemma 4, we use
as parameter differences are not too large (condition 4). similar equations as Lemma 5. At any update step s,
We show that this problem can be solved in a practical way we write parameter θi,s for global copy i j (and j ) as θi,s =
using Lemmas 1–4. θ̄ − α s gsi and θ j,s = θ̄ − α s gs . We then have
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 28,2022 at 13:38:17 UTC from IEEE Xplore. Restrictions apply.
1174 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 32, NO. 3, MARCH 2021
i j
θi,s − θ j,s = −α s gs − s gs . Here, we see that |θi,s − show the form of ϕ, but it is a computable function that
θ j,s | ≥ 0 is a monotonic
i j of α. Given information on
function expresses the gradient g j in terms of gi along with a “shift”
the magnitude: | s gs − s gs |, we can always set α to a
expressed in parameter difference = |θi − θ j | after SGD
desired low value such that |θi,s −θ j,s | is minimized. However, updates. We decompose parameter differences θi→ j − θ j →i to
for future update steps under fixed α, suppose that we are at [δ j −δi ] (differences due to local updates prior to sharing) and
update step s + 1, where policy Pi (a) under i is computed via [δ j →i − δi→ j ] (differences due to sharing after local updates).
softmax, and the respective weight for action a is θi,s (a). Let The key result is that θi→ j − θ j →i has this structure given f 1
ζ denote the set of possible actions and s (a) the advantage and f 2 that are increasing functions of g 2 , m, and v
weight for action a [as per (1)]. The gradient for i is
∂L f 1 (g 2 , m, v)
gi,s+1 (a) = I [a] where I is an indicator function [θi→ j − θ j →i ] = ϕ( ) . (18)
∂θa,i,s f 2 (g 2 , m, v)
eθi,s (a)
= −s (a) log Pi (a) = −s (a) log θi,s (a )
With (18), if | | {|gi2 |, |m|, |v|} (using a sufficient small α
a ∈ζ e under Lemma 4), we get to prove the Lemma since |ϕ( )| is a
θi,s (a )
a ∈ζ −a e monotonic function of | |. We write a finer decomposition of
= −s (a) θi,s (a ) [δ j −δi ] and [δ j →i −δi→ j ] as follows, with 1 as [δ j −δi ] and
a ∈ζ e
2 as [δ j →i − δi→ j ]. Let m̄ and v̄ denote any initial moving
1 − e[θ̄(a)−α s gs (a)]
i
= −s (a) i . (16) mean and variance before F in (1 and 2 ), as shown at the
[ a ∈ζ e[θ̄ (a )−α s gs (a )] ] bottom of this page.
It is involved to transform the abovementioned equations to
For worker j , the gradient is similar. From (16), α is a term in
the form of (18). However, we can simplify by substituting
the exponential of the numerator and denominator,
andits con-
β1 = 0.9 and β2 = 0.99, i.e., standard Adam parameters,
tribution affects all actions as seen in a ∈ζ e[θ̄(a )−α s gs (a )] .
i
|ζ | |ζ | ∗1 ∼ + O
s (a) + α 2 s (a) |ζ |θa − θa + O(α 2 ). (17) (gi + c4 v̄)
2 2 (gi + c4 v̄)3
2
|ζ | |ζ |
a ∈ζ −a
∗ (c5 gi2 + c6 gi m̄ + c7 v̄) 2
2 ∼ + O . (19)
From (17), we have a nonzero gradient |ζ |/(|ζ |s (a)) inde- (gi2 + c8 v̄)2 (gi2 + c8 v̄)3
pendent of α. Hence, if we set α to be arbitrarily small,
i.e., α → 0, we still have gs (a) = |ζ |/(|ζ |s (a)) ≥ 0. We see that (19) follows the structure of (18), where and
We can see that gi,s+1 (a)’s first term is not a function of α, 2
are in the numerators ( f 1 ) and whose contributions are
while the other terms are functions of α through the prior θa . bounded by the denominators which are functions ( f 2 ) of the
Hence, we can set α small enough such that there are low prior gradient’s second moment gi2 and moving variance v̄. In the
parameter
i differences (at step s) ( ) from = θi,s − θ j,s = numerator, we see that coefficients c2 and c6 of the middle
j
−α g
s s − g
s s , relative to gi,s+1 (a) ∼ |ζ |/(|ζ |s (a)) term gi m̄ have values of −180 and 0.304, respectively. These
(at step s + 1). It also follows that can be set small are lower compared with c3 and c7 of the third term v̄, which
relative to first and second moments of gi,s+1 (a), by setting it are 990 and −18.38, respectively. Similarly, c4 and c8 are
small compared with |ζ |/(|ζ |s (a)). For future update steps, relatively larger than c1 , c2 , c5 , and c6 . Given these coefficients,
i.e., s + 2, s + 3, . . ., a sufficiently small α likewise keeps if m̄ is close to v̄, we see that for 1 , there is more likelihood
low. that | f 2 | > | f 1 |. For 2 , we have more likelihood for | f 1 | ≤
Lemma 1 (See Section V): | f 2 |. In either case, controlling for other variables aside from
Proof: Let gi and g j denote gradients. Similarly, let there coefficients, we have | f 1 | ≤ | f 2 |, indicating that [θi→ j − θ j →i ]
be common initial parameters θ̄ for i and j . We write the is likely to be smaller. Nonetheless, we have shown that (19)
gradient for θ j as g j = gi + ϕ( ), where ϕ is monotonic has a form similar to (18), and we can always set [and
function of parameter difference. For brevity, we will not consequently ϕ( )] to be tolerable using sufficiently small α
(β1 m̄ + (1 − β1 )(gi + ϕ( ))/(1 − β1 ) (β1 m̄ + (1 − β1 )gi )/(1 − β1 )
1 = α √ − √
(β2 v̄ + (1 − β2 )(gi + ϕ( ) )/ (1 − β2 ) (β2 v̄ + (1 − β2 )gi2 )/ (1 − β2 )
2
⎡ ⎤
(β1 (β1 m̄ +(1−β1)(gi +ϕ( ))+(1−β1)gi )/(1−β1 ) 2 (β (β m̄ +(1−β )g )+(1−β )(g +ϕ( )/ 1−β 2
2 = α ⎣ ⎦
1 1 1 i 1 i 1
− .
(β2 (β2 v̄ +(1−β2 )(gi +ϕ( )2 )+(1−β2)gi )/ (1−β2 ) (β2 (β2 v̄ +(1−β2)gi )+(1−β2)(gi +ϕ( ) )/ 1−β22
2 2 2 2
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 28,2022 at 13:38:17 UTC from IEEE Xplore. Restrictions apply.
LABAO et al.: A3C-GS: ADAPTIVE MOMENT GRADIENT SHARING WITH LOCKS FOR ASYNCHRONOUS ACTOR–CRITIC AGENTS 1175
as per Lemma 4. With this and | | {|gi2 |, |m|, |v|}, we have However, as F is not applied repeatedly until convergence,
there is no guarantee that all global copies achieve similar
[θi→ j − θ j →i ] = O + O parameters, i.e., if gradients computed by converge faster
(g 2 + c4 v̄)2 (g 2 + c8 v̄)2 ) to zero than the contraction rate of F. However, we note that
i 2 i 2
+O +O . workers i and j can estimate a similar policy, i.e., i ∼ j
(gi2 + c4 v̄)3 (gi2 + c8 v̄)3 for i = j even under different parameters. If i and j both
This shows that [θi→ j − θ j →i ] = 4O(1) = O(1). Moreover, estimate π ∗ despite unequal parameters, then = g = 0
the difference is a monotonic function of and can be for both i and j , resulting in zero gradients shared by F.
arbitrarily made small to meet the condition of Lemma 1. In this case, both i and j estimate π ∗ despite having differen
weights.
among copies and varying exploration trajectories. However, Since gs∗ → 0, it is not guaranteed to arrive at W ∗ if we use
given that the optimal trajectory has been discovered (from the gse , and π is not uniform. Under Lemma 10, the second term
given) and that F is a contraction (conditions of Proposition 1 in (20) is a nonzero gradient, giving suboptimal parameters,
are satisfied), parameter differences will eventually be reduced and π = π ∗ .
after repeated applications of F (since these differences are Corollary 2 for Property 2c is implied by Lemma 11, i.e., lower
bounded by a constant factor under contraction). With these, entropy terms result in lower bias.
the effect of biases from F on ’s Adam parameters (par-
ticularly the first moment m) is reduced in the long term as H. Property 1b, Exploration Type II, and 2c (Lower Bias)
parameters converge. Informally, this shows that contracts, For the proof of Proposition 3, we recall Lemma 8. Corol-
given that at each update step, it rectifies its Adam parameters lary 1 (see Property 1b) follows from this.
to lower loss, while the amount of rectification needed as F’s Proposition 3 (See Section V):
biases gradually reduce in the long term. Hence, given that Proof: From Lemmas 8 and 9, we see that θi = θ j with
F ◦ is a composition of contractions, F ◦ converges. large probability for any pair of global copies i = j due
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 28,2022 at 13:38:17 UTC from IEEE Xplore. Restrictions apply.
1176 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 32, NO. 3, MARCH 2021
to F. Differences in θ imply differences in softmax policy [21] M. G. Bellemare, W. Dabney, and R. Munos, “A distributional perspec-
distributions. However, these differences disappear as is near tive on reinforcement learning,” in Proc. 34th Int. Conf. Mach. Learn.,
vol. 70, Aug. 2017, pp. 449–458.
convergence as per Proposition 2. [22] M. Zinkevich, M. Weimer, L. Li, and A. J. Smola, “Parallelized
stochastic gradient descent,” in Proc. Adv. Neural Inf. Process. Syst.,
2010, pp. 2595–2603.
R EFERENCES
[1] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-
dimensional continuous control using generalized advantage estimation,”
in Proc. Int. Conf. Learn. Represent. (ICLR), 2016, pp. 1–14.
[2] S. Gu, T. Lillicrap, R. E. Turner, Z. Ghahramani, B. Schölkopf, and
S. Levine, “Interpolated policy gradient: Merging on-policy and off-
policy gradient estimation for deep reinforcement learning,” in Proc.
Adv. Neural Inf. Process. Syst., 2017, pp. 3846–3855. Alfonso B. Labao received the M.S. degree in
[3] A. B. Labao and P. C. Naval, “AC2: A policy gradient actor with primary computer science from the University of the Philip-
and secondary critics,” in Proc. Int. Joint Conf. Neural Netw. (IJCNN), pines Diliman, Quezon City, Philippines, in 2017,
Jul. 2018, pp. 1–8. where he is currently pursuing the Ph.D. degree in
[4] L. Li, D. Li, T. Song, and X. Xu, “Actor-critic learning control based on computer science.
2 -regularized temporal-difference prediction with gradient correction,” He was a Researcher with the Computer Vision
IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 12, pp. 5899–5909, and Machine Intelligence Group, Computer Science
Dec. 2018. Department, University of the Philippines Diliman.
[5] W. Shi, S. Song, C. Wu, and C. L. P. Chen, “Multi pseudo Q-learning- His research on reinforcement learning algorithms
based deterministic policy gradient for tracking control of autonomous focuses on policy gradients for continuous con-
underwater vehicles,” IEEE Trans. Neural Netw. Learn. Syst., vol. 30, trol. His current research involves algorithmics and
no. 12, pp. 3534–3546, Dec. 2019. automata theory.
[6] V. Mnih et al., “Playing Atari with deep reinforcement learning,” 2013,
arXiv:1312.5602. [Online]. Available: http://arxiv.org/abs/1312.5602
[7] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
“Proximal policy optimization algorithms,” 2017, arXiv:1707.06347.
[Online]. Available: http://arxiv.org/abs/1707.06347
[8] A. B. Labao and P. C. Naval, “Stabilizing actor policies by approxi-
mating advantage distributions from K critics,” in Proc. 24th Int. Conf. Mygel Andrei M. Martija received the B.S.
Pattern Recognit. (ICPR), Aug. 2018, pp. 1253–1258. degree in management engineering from the Ateneo
[9] P. Mirowski et al., “Learning to navigate in complex environments,” de Manila University, Quezon City, Philippines,
2016, arXiv:1611.03673. [Online]. Available: http://arxiv.org/abs/1611. in 2016. He is currently pursuing the M.S. degree
03673 in computer science with the University of the
[10] A. B. Labao, C. R. Raquel, and P. C. Naval, Jr., “Induced exploration Philippines Diliman, Quezon City.
on policy gradients by increasing actor entropy using advantage target He is currently a Researcher with the Computer
regions,” in Proc. Int. Conf. Neural Inf. Process. Cham, Switzerland: Vision and Machine Intelligence Group, Computer
Springer, 2018, pp. 655–667. Science Department, University of the Philippines
[11] Z. Wang et al., “Sample efficient actor-critic with experience replay,” Diliman. His current research interests include
2016, arXiv:1611.01224. [Online]. Available: http://arxiv.org/abs/1611. underwater computer vision, reinforcement learning,
01224 and object tracking.
[12] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, “Trust
region policy optimization,” in Proc. 32nd Int. Conf. Mach. Learn.,
vol. 37, Jul. 2015, pp. 1889–1897.
[13] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic:
Off-policy maximum entropy deep reinforcement learning with a sto-
chastic actor,” in Proc. 35th Int. Conf. Mach. Learn., vol. 80, 2018,
pp. 1861–1870.
[14] V. Mnih et al., “Asynchronous methods for deep reinforcement learning,” Prospero C. Naval, Jr. (Member, IEEE) received
in Proc. 33rd Int. Conf. Mach. Learn., Jun. 2016, pp. 1928–1937. the B.S., M.S., and Ph.D. degrees in electrical
[15] C. Schulze and M. Schulze, “ViZDoom: DRQN with prioritized experi- engineering from the University of the Philippines
ence replay, double-Q learning and snapshot ensembling,” in Proc. SAI Diliman, Quezon City, Philippines, and the M.Eng.
Intell. Syst. Conf., 2018, pp. 1–17. degree in computer science from Kyoto University,
[16] Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, and Kyoto, Japan.
N. de Freitas, “Dueling network architectures for deep reinforcement He is currently the Dado and Maria Banatao Pro-
learning,” in Proc. 33rd Int. Conf. Mach. Learn., vol. 48, 2016, fessor of artificial intelligence with the Department
pp. 1995–2003. of Computer Science, University of the Philippines
[17] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza- Diliman, where he teaches courses on computer
tion,” 2014, arXiv:1412.6980. [Online]. Available: http://arxiv.org/abs/ vision, probabilistic machine learning, and reinforce-
1412.6980 ment learning. He is also the Founder and the current Laboratory Head of
[18] M. Hessel et al., “Rainbow: Combining improvements in deep rein- Computer Vision and Machine Intelligence Group (CVMIG), Department of
forcement learning,” in Proc. 32nd AAAI Conf. Artif. Intell., 2018, Computer Science, University of the Philippines Diliman, which focuses on
pp. 3215–3222. the use of machine learning to solve problems in healthcare, environment, and
[19] T. Beysolow, II, “Custom OpenAI reinforcement learning environments,” education. He has authored or coauthored more than 100 articles in journals
in Applied Reinforcement Learning With Python. Basel, Switzerland: and conferences. His current research interests include underwater computer
Springer, 2019, pp. 95–112. vision, intelligent control of underwater autonomous vehicles, swarm robotics,
[20] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, “The arcade and computation.
learning environment: An evaluation platform for general agents,” Dr. Naval, Jr., has served as the Chair of the IEEE Philippine Section from
J. Artif. Intell. Res., vol. 47, pp. 253–279, Jun. 2013. 2015 to 2016.
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on January 28,2022 at 13:38:17 UTC from IEEE Xplore. Restrictions apply.