You are on page 1of 15

Acta Astronautica 171 (2020) 265–279

Contents lists available at ScienceDirect

Acta Astronautica
journal homepage: www.elsevier.com/locate/actaastro

Path planning for asteroid hopping rovers with pre-trained deep T


reinforcement learning architectures
Jianxun Jianga, Xiangyuan Zenga,∗, Davide Guzzettib, Yuyang Youa
a
Beijing Institute of Technology, Beijing, 100081, China
b
Auburn University, Auburn, AL, 36849, United States

ARTICLE INFO ABSTRACT

Keywords: Asteroid surface exploration is challenging due to complex terrain topology and irregular gravity field. A hop-
Asteroid surface exploration ping rover is considered as a promising mobility solution to explore the surface of small celestial bodies.
Hopping rover Conventional path planning tasks, such as traversing a given map to reach a known target, may become par-
Path planning ticularly challenging for hopping rovers if the terrain displays sufficiently complex 3-D structures. As an alter-
Deep reinforcement learning
native to traditional path-planning approaches, this work explores the possibility of applying deep reinforcement
learning (DRL) to plan the path of a hopping rover across a highly irregular surface. The 3-D terrain of the
asteroid surface is converted into a level matrix, which is used as an input of the reinforcement learning algo-
rithm. A deep reinforcement learning architecture with good convergence and stability properties is presented to
solve the rover path-planning problem. Numerical simulations are performed to validate the effectiveness and
robustness of the proposed method with applications to two different types of 3-D terrains.

1. Introduction the current exploration effort. Different from major planetary bodies,
the irregular and weak gravity field [10] of asteroids drives the de-
Various space agencies around the world are intensifying the ex- velopment of alternative forms of mobility. In particular, hopping may
ploration of solar system small bodies [1], and the number of observed be one of the most efficient way to traverse an asteroid surface [11]. On
asteroids is increasing [2]. Asteroid exploration is of great significance May 9, 2003, the Hayabusa spacecraft attempted the deployment of a
because: (1). The geological structure of an asteroid surface reveals hopping rover, MINERVA, on the surface of asteroid 25143 Itokawa
details about the formation and evolution of the solar system [3]. (2). [12]. Unfortunately, the lander missed the target and was lost into in-
Asteroids may harbor clues about the origin of life [4]. (3). Mineral terplanetary space. Taking that experience into account, Hayabusa 2
resources could be available at asteroids [5]. (4). Asteroid impact re- has recently succeeded in deploying a similar hopping robot, called
presents a threat to human society [6]. MASCOT,1 on Ryugu. Hayabusa 2 aims to study the asteroid Ryugu and
Asteroid exploration may be conducted via ground-based observa- return surface samples to Earth by 2020. In the meantime, Stanford
tions, spacecraft flyby or rendezvous, and in-situ exploration with University, JPL and the California Institute of Technology have devel-
surface landers. Ground-based observations and spacecraft rendezvous oped a robotic hedgehog for the exploration of Phobos [13].
are currently the most advanced approaches. In the last two decades, With increasing missions to less known asteroids in the future, new
the United States, Japan and European countries have also successfully methods are required to guide and control hopping rovers over un-
implemented various asteroid flyby or rendezvous programs [7]. On charted asteroid surfaces. Specifically, hopping rovers may be required
December 3rd, 2014, the Japanese Space Agency (JAXA) successfully to autonomously traverse an asteroid surface with complex terrains and
launched a spacecraft, Hayabusa 2, to rendezvous with the near-Earth irregular gravity. One option to automate the path planning process
asteroid 162173 Ryugu [8]. On December 3, 2018, NASA's “Origins, may be to create traversability maps from visual observations of surface
Spectral Interpretation, Resource Identification, Security-Regolith Ex- features. Then, the terrain surrounding the rover can be classified and
plorer” (OSIRIS-REX) spacecraft [9] completed its 1.2 billion-mile (2 mapped to a cost function that is used to determine the optimal path for
billion-kilometer) journey and arrived at the asteroid 101955 Bennu. a given task [14,15]. However, it may be problematic for a small, cost-
Obtaining access to asteroid surfaces is the natural continuation of effective rover to independently acquire and process in-situ the


Corresponding author.
E-mail address: zeng@bit.edu.cn (X. Zeng).
1
https://mascot.cnes.fr/en/MASCOT/index.htm [Retrived 2020-02-25].

https://doi.org/10.1016/j.actaastro.2020.03.007
Received 17 July 2019; Received in revised form 28 February 2020; Accepted 7 March 2020
Available online 13 March 2020
0094-5765/ © 2020 IAA. Published by Elsevier Ltd. All rights reserved.
J. Jiang, et al. Acta Astronautica 171 (2020) 265–279

Nomenclature Qold previous Q table


Qnew updated Q table
dn shortest distance from the surface to the center of yi target Q value
mass of the asteroid θ parameters of the Q evaluation network
df longest distance from the surface to the center of θ- parameters of the Q target network
mass of the asteroid ε probability of adopting random actions over optimal
l maximum bouncing height actions
L maximum altitude level difference π motion policy (or strategy)
t, n current step α learning rate
Vπ(s), Qπ(s,a) value function Qtarget Q target value
st, sn current state Qeval Q evaluation value
st+1,sn+1, s_, s' next state YtDoubleDQN Q target value of double Q-Learning
rt, rj, rn current reward YtDQN Q target value of original Q-Learning
Rt Cumulative reward before time step t V (s) state value function
at current action A(a) action advantage function
γ discount factor of Q function batch set of training episodes

sufficient global, visual data that enable autonomous surface explora- different landing maneuvers in an environment with unknown dy-
tion. In addition, the continuous change of light and shadow that namics, with initial conditions spanning a large deployment region, and
characterizes a small body surface may critically impact on the quality without a shape model of the asteroid [24]. The resulting DRL frame-
of the visual information that is accessible. work [25] may enable perceptual decision-making tasks within com-
As an alternative, a hopping rover may move independently without plex systems, including path-planning for a hopping rover on a poorly
any local visual information using a pre-trained control architecture. known asteroid surface.
First assume that partial topo-surface information of the target asteroid In this work, we explore the application of deep reinforcement
is available, perhaps from instruments onboard of a mother spacecraft learning to the problem of path planning on asteroid surface. The main
or a landing probe as illustrated in Fig. 1. The topo-surface information contribution of this paper is the design and training of a DRL archi-
may be converted into regional height-level difference data. Before tecture, which uses level difference to determine a series of actions for
being deployed, the hopping rover's movement trajectory and control the hopping rover. Provided partial surface topology information, the
system may be pre-trained using simulated surface level differences. objective is to determine the optimal trajectory from an initial known
The processing and calculation of the terrain data can be performed area to the target area that minimizes energy consumption. This ar-
from the landing probe, the mother spacecraft or the ground station. chitecture may serve as a stepping stone to develop next-generation
The hopping rover(s) will not be released until obtaining the optimal guidance schemes for autonomous hopping rovers. In fact, pre-trained
path. Although the topo-surface data are assumed as known informa- DRL architectures are naturally preconfigured to learn from different
tion, such data may not be sufficiently accurate to implement path topo-surface terrain data.
planning solutions as open-loop guidance. A closed-loop control me-
chanism may be required to ensure that the rover adjust its course to
the target and offset perturbations that are due to finite modeling ac- 2. Problem statement
curacy.
This paper takes a first step in developing a path planning algorithm This section describes a method to model terrain features and ac-
for hopping rovers that is based on deep reinforcement learning (DRL). tions for hopping rover motion within asteroid surface environments.
To date, there exist a number of mature methods, including but not
limited to artificial potential fields [16], fuzzy logics [17], and genetic
algorithms [18]. Nevertheless, when these methods are applied to real-
world problems, they typically result in point-design solutions, easy to
fall into a local optimum and requiring extensive numerical experi-
mentation for each specific application. Reinforcement learning [19]
(RL) is emerging as a more general framework for adaptive decision
making, as it leverages the ability of an agent to autonomously interact
with the environment and improve its decision making strategy.
However, RL in its original form is typically not able to solve di-
mensionally large problems, such as end-to-end path planning.
In 2015, Minh et al. presented the model of Deep Q Network al-
gorithm [20], as a pioneering work in the field of reinforcement
learning applied to dimensionally large problems. In 2016, by com-
bining deep reinforcement learning with Monte Carlo tree search
(MCTS), Silver developed an artificial Go player ‘AlphaGo’ [21], which
defeated worldwide recognized GO champions. Both success stories are
based on the idea of combining RL with deep neural networks (DNN)
[22], ones that supply the capability to perceive and extract critical
information from very large datasets. Recently, reinforcement learning
techniques have gained increasing attention from scientists and en-
gineers in the space community [23]. In 2019, Gaudet, Linares, and
Furfaro developed an adaptive integrated guidance, navigation, and Fig. 1. Sketch for a landing probe releasing the hopping rover.
control system using reinforcement meta-learning that can complete

266
J. Jiang, et al. Acta Astronautica 171 (2020) 265–279

Fig. 2. Selected movement mechanisms.

2.1. Terrain representation for asteroid surface navigation movement are possible, as shown in Fig. 2. When moving from a higher
level to a lower level, a tumbling/rolling motion is adopted to let the
Asteroid surface path planning requires a virtual model of the ter- rover slide or freely fall, which is referred to as “Rolling” in this study. A
rain to traverse. A representation of the terrain that is convenient for small bounce or rolling can be used to move forward on the same level
path planning describes terrain features in terms of height levels over a between two adjacent locations, referred to as “Walking”. Going from a
predetermined reference altitude, called sea-level [26]. Given in- lower level to a higher one, hopping/jumping is required to move the
formation about a connected region of the asteroid surface, the shortest rover, which is referred to as “Hopping”. The aforementioned me-
distance dn and the longest distance df from the surface region to the chanisms of locomotion are based on the principle of momentum ex-
center of mass of the asteroid can be calculated. The average between change and have been demonstrated by existing rovers, such as the
dn and df values defines the radius of an equivalent sphere that renders Cubli,2 the Hedgehog,3 and the MINVERVA-II.4
the local sea level. The altitude of any point at the local sea level is zero. It is assumed that rover rolling and hopping movements are limited
Let L represent the maximum altitude level difference throughout to a maximum altitude change of one level (positive or negative) when
the selected asteroid surface region, which is approximated by the moving across two adjacent grid cells. Such an assumption is made
difference between dn and df. Also assume that the motion of the hop- under the intuition that large hopping movements may trigger surface
ping rover is constrained by a maximum bouncing height, l. Then, the escape trajectories and large rolling movements may damage the rover.
maximum altitude level difference, L, can be divided into N several If the absolute altitude difference between adjacent cells is greater than
equally spaced intervals, with N = L/l. An altitude value in the interval one level, the terminal cell is considered to be hazardous for the current
[0, l) is considered the first positive level (recall that zero defines the state. Fig. 3 displays a two-dimensional top view and a 3-D graph of a
sea level), while a height value in the interval [l, 2l) is considered as the hypothetical terrain. The number within each grid cell of the top view
positive second level, and so on. The same classification applies to al- represents the altitude level. Assume that the rover is initially located in
titude values below the sea level, commencing with the first negative the cell at the center of the grid. From this location, it is only feasible to
level for altitude values in the interval (–l, 0]. move left or up. Down and right movements are considered hazardous
Given an asteroid terrain in the form of a two-dimensional contour as the altitude level difference of adjacent cells is greater than one.
map, we assign each contour plane to a fixed and independent level-
value. For example, a value of 1 might be given to the first positive
3. Deep reinforcement learning for traversing asteroid surface
level, a value of 0 to the sea-level altitude, and a value of -2 to the
negative second level. Then, we rasterize the 2-D contour map with a
Deep Reinforcement Learning (DRL) may be a convenient frame-
30 × 30 grid. In other words, the 2-D contour map can be transformed
work to develop autonomous path planning algorithms for hopping
into 900 quadrilateral regions (grid cells) of various altitude levels. The
rovers. DRL architectures can be pre-trained with complex asteroid
horizontal distance that a rover can traverse with a single bounce may
surface information. After obtaining the optimal trajectory, the hopping
vary significantly with the initial state and surface conditions. To
rover is released to explore the environment. A DRL architecture may
simplify the problem at hand, the horizontal distance for each move-
also allow for learning end-to-end path-planning policies, ones that map
ment in this paper is constrained to a unit cell of the grid. The length of
sensory input about local altitude level variations to rover movement
each grid cell is approximately 0.1∼1 m for a 100-m-sized asteroid. If
commands.
we consider a square area with a side of 30 cells, then the side of such
area measures approximately 3∼30 m. If a grid cell contains multiple
contour planes, the level-value of the cell center point is taken as a 3.1. Reinforcement learning
reference. If the center point of the grid cell is at the intersection of two
contour planes the lower level-value is taken as a reference. Eventually, Deep reinforcement learning is an extension of reinforcement
a discretized level-value map can be obtained. learning. Reinforcement learning is based on the idea that an agent
interacts with an environment over a number of discrete time steps.
2.2. Surface motion scheme Note that interactions may be real or simulated. At each time step t, the
agent explores a state st and selects an action at from some set of pos-
Due to irregular and weak gravity field, hopping rovers are a pre- sible actions A according to a policy π. Afterwards, the agent reaches
ferable option for asteroid surface exploration, one that is different the next state st+1 and obtains a reward rt which guides the selection of
from planetary surface exploration robots. The motion of a hopping future actions. (Note: this is called a learning iteration). The process
rover can be treated as a combination of hopping segments and tum- continues multiple times until a terminal state, one that defines failure
bling segments. Commanding the rover to move (bounce or roll) over a or success, is reached. Fig. 4 summarizes the basic model for
unit grid length, that is, the distance between the midpoint of the
current grid cell and the midpoint of the next grid cell, is considered as 2
https://idsc.ethz.ch/research-dandrea/research-projects/archive/cubli.
an action. html [Retrived 2020-02-25].
In this paper, the set of rover actions is restricted to a unit grid 3
https://www.jpl.nasa.gov/news/news.php?feature=4712 [Retrived 2020-
length movement in four independent directions. Each direction is se- 02-25].
parated by a 90° angle: front, back, left and right. Three types of rover 4
http://www.hayabusa2.jaxa.jp/en/topics/20180919e/[Retrived 2020-02-25].

267
J. Jiang, et al. Acta Astronautica 171 (2020) 265–279

Fig. 3. A hypothetical terrain model (left: top view, right: 3-D view).

reinforcement learning. represents the probability for adopting random actions over optimal
The goal of an agent is to maximize the expected cumulative sum of actions, which enables exploration of the solution space.
rewards, or return, from each state st. The expected return is captured
by a state-action value function. Processing from the initial to the 3.1.2. State-action value function update
terminal state constitute a learning episode. (Note: one learning episode In each learning iteration, assume a virtual rover moving across a
has multiple learning iterations). In our problem, target and hazardous discrete terrain map by choosing one action from a collection of actions
cells define a terminal state. In order to obtain the optimal path, a state- at every time step. At step n, the rover reaches the state sn ∈ X, and
action value function Q is introduced as: chooses the action an ∈ A accordingly. The agent receives a reward rn,
whose value depends only on the current state and action. The transi-
Q (s , a ) = [Rt st = s, a] (1)
tion from the current state sn to the next state sn+1 is stochastic and
where the return Rt is the total rewards accumulation from time step t described as a probability distribution:
with discount factor γ ∈ (0,1].
Prob [sn + 1 = sn+ 1 sn, an] = P (a s ) (3)
Rt = kr
k=0
t+k (2) For the given policy π, the expected value of each state is rendered
by the state value function, Vπ(s), which is defined as the accumulation
The parameter γ is a discount factor that weights the importance of of reward signals that a rover expects to receive from that state on, until
immediate and later rewards. The Q value is the actual expected return the terminal state:
for selecting action a in state s by exploiting policy π [27]. The optimal
function Q*(s,a) = maxπ Qπ(s,a) gives the maximum expected action V (s ) = [r (s0, a0 ) + r (s1, a1) + 2r (s ,
2 a2 ) + s0 = s , (a s )] (4)
value for state s. Similarly, an optimal policy is derived by selecting the Under the policy π, the state value function, Vπ(s), satisfies the
highest-valued action in each state. Bellman equation:

3.1.1. Policy π V (s ) = r (s ) + Psp (s) (s )v (s )


s S (5)
Observations from the environment are mapped to actions by a
policy π = P (a |s). In this paper, an ε-greedy strategy [28] is used as where s represents the current state, and s′ is the next state. Equation
policy π for the reinforcement learning architecture. The value of ε (5) shows that the state value function of a policy consists of two parts:

Fig. 4. The basic model of reinforcement learning.

268
J. Jiang, et al. Acta Astronautica 171 (2020) 265–279

One is current reward r given directly by the environment. The other is yi = s ~E [r + max a Q (s , a ; i ) s , a] (10)
the expected value of the discount reward in the future. Thus, the op-
timal policy π can be described as: This network is referred to as the Q evaluation network. The para-
meters of the Q evaluation network are adjusted at each iteration to
obtain an accurate function approximation of the state-action value
V (s ) V (s ) = max a s (a ) + Pss [a] V (s ) function. A typical method to update the Q evaluation network para-
s (6) meters is stochastic gradient descent [32]. Stochastic gradient descent
requires computing the gradient of the loss function:
Watkins discusses a representation of the Bellman equation that is
more convenient for incremental dynamic programming when the i Li ( i ) = s, a ~ ( ); s ~E [(r + max a Q (s , a ; i ) Q (s , a ; i) )2 i Q (s , a; i) ]
model is unknown to the agent [29]. This representation rewrites the Q (11)
value (or action-value) in Eq. (1) as:
Specifically, Q (s, a; θi) in Eq. (11) represents the output of the
Q (s , a ) = s (a ) + Pss [ (s )] V (s ) current evaluation network and renders the value of the current state-
s (7) action pair. Deep reinforcement learning may employ an additional
The Q value is the expected discounted return for executing an ac- neural network to generate the target value yi, which is called the Q
tion at the current state s within the policy π. Incremental dynamic target network. In Eq. (10), Q (s’, a’; θi-) denotes the output of the target
programming that is based on the Q-value representation of the network. The target network parameters θi- are only updated every ‘N’
Bellman equation is also known as Q-learning [30]. steps and kept constant between individual updates. Therefore, there
The target of Q-learning is to determine the Q value for the optimal are two networks in the DRL architecture: the Q target network and the
policy. Note that, V*(s) = maxaQ*(s,a), if a* is the action at which the Q evaluation network (Seen in Figs. 5 and 6 in Sec. 3.2.2). The Q target
maximum is attained, which results in the form of π*(s,a) ≡ a* for the network can predict the value of Qtarget, and the evaluation network is
policy π. If the agent learns the policy π, it can, then, find optimal used to compute the value of Qeval. The structure of these two networks
actions for the environment. The updating formula of the basic Q- may be identical, while internal parameters can take different values.
learning algorithm is: Typically, the Q target network is an earlier version of the Q evaluation
network, holding the internal parameters for a number of iterations
Q (st , at )(st , at )new = (1 ) Q (st , at )old + rt + maxQ (st + 1, a) through the update process. In other words, the set of parameters that
a (8) defines the Q target network is fixed for a given number of iteration, N,
before being replaced by the current parameters of the Q evaluation
where Qold and Qnew represent the previous Q value function and the network. Fixing the Q target for a certain number of update iterations
updated Q value function, respectively. Here, α is the learning rate that reduces correlations of the adjacent states and improves convergence
determines the learning speed. The max operation in Eq. (8) is trivial to properties [20].
implement if the Q value function can be rendered in tabular form. Deep neural networks are used to build progressively more abstract
However, the performance of tabular Q-learning methods are typically representations of the data. The combination of deep neural networks
unsatisfactory in problems with larger dimensions. and Q-learning constitutes a Deep Q Network (DQN) that may learn
from high-dimensional inputs. A final key idea is necessary to build an
3.1.3. Deep learning efficient DQN: Experience Replay.
Deep neural networks may serve as function approximation for the Experience replay consists in the randomization of training data.
state-action value function Q. A network with parameters θi approx- Experience replay smooths changes in data distribution and removes
imating the state-action value function Q at iteration i is trained by correlations in the observation sequence. Transfer samples (st, at, rt, st
minimizing a sequence of loss functions Li (θi) [31]: +1) from the interaction between the agent and environment at each
time step are stored into a replay memory unit. At each time step within
(9)
Li ( i ) = 2]
s, a ~ ( ) [(yi Q (s , a ; i ))
a learning iteration, transfer samples are randomly selected from the
where yi represents the target value. The target value, yi is calculated as: replay memory unit to update the Q value networks. Without loss of

Fig. 5. Interior structure of the Q evaluation network.

269
J. Jiang, et al. Acta Astronautica 171 (2020) 265–279

neurons. Each neuron renders the Q value corresponding to one of the


four actions at the given state. Given an input state, this architecture
has the ability to calculate Q values for each action with a single for-
ward pass through the network.
In Fig. 5, the structure of the Q evaluation network is displayed. The
inputs of the neural network architecture are the four vertex co-
ordinates of the current state (current square). Fully-connected layers
are used as hidden layers consisting of 15 linear rectified units. The
number of network layers is related to the complexity of the asteroid
terrain. There is a positive proportional relationship between the
number of network layers and the number of height levels. Advantage
and value branches in a dueling network form can be added to improve
performance of the Q evaluation network, which will be discussed in
Sect. 3.2.3 in detail. The output layer predicts the Q value for each
action. Every link of the architecture will be trained during the learning
process to optimize network weights and thresholds.
Fig. 6 describes the interior structure of the Q target network, which
is similar to that of the Q evaluation network. The main difference
between the target and evaluation networks is the choice of input
vector (s, s_) (where s_ is the next state) and update period (1, N). The
evaluation network takes as an input the current state and it is updated
at each iteration. The target network takes as an input the next state
and it is updated every N iterations. The Q values that are predicted by
the evaluation and target network are employed in the calculation of
the loss function as defined in Eq. (9).
Fig. 7 summarizes the operant conditioning mechanisms underlying
Fig. 6. Interior structure of Q target network.
the DQL architecture. The basic environment information is the initial
and terminal locations. Observational state and reward value sets are
generality, 32 groups of samples are taken at a time in our study. This created by using the method in Sect. 2. The learning strategy module
method has many advantages over the traditional Q-learning method. maps the state space to action space. Following is a “Memory” unit that
For example, randomization of the sample distribution breaks strong stores a large number of observed relationships between states and
correlations between the closely related samples, and prevents con- actions. Radom samples out of the “Memory” unit are selected to train
vergence on local optima. For more details, readers are recommended each Q neural network. Finally, the output of each Q network is em-
to Ref. [20]. ployed to update the mapping between state and action space within
the learning strategy model. The well-informed combination of state-
3.2. Experimental setting action mapping, memory unity and Q networks produces an advance
operant conditioning learning process that could identify optimal path-
This section illustrates configuration and pre-training of a DQN ar- planning strategies for rovers.
chitecture. A DQN architecture is used to guide a rover traversing the
surface of a small asteroid from an initial position to a target location 3.2.3. Improving the DQN algorithm
within a given region. Three methods for improving the DQN algorithm are adopted and
reproduced, including the Double DQN [33] (DDQN), Dueling Network
3.2.1. Reward setting [34], and Prioritized Experience Replay [35].
Path planning on an asteroid surface can be enabled by using a DQN A. Owing to state-action value estimation errors during the learning
architecture. Rewards are given as a function of the energy consumed process, Q-learning algorithms are occasionally overoptimistic within
by rover movements as well as terminal conditions. Three types of high-dimensional environments. Consider the following the state-action
movement (“Rolling”, “Walking” and “Hopping”) are allowed across value estimate Yt:
the same or different altitude levels, as described in Sect. 2.2. We as-
YtDQN rt + 1 + max aQ (St + 1, a; t) (12)
sume that the rover consumes the least energy when transferring be-
tween adjacent locations of the same altitude with “Walking”. As a For traditional Q-learning methods, the parameters θt in γ maxa Q
result, “Walking” is given a reward value of “0”, while “Rolling” and (St+1, a; θt) are those of the Q target network. In the presence of esti-
“Hopping” (recall Fig. 2) are given a small penalty (i.e., a negative mation errors, using a greedy strategy to select the maximum Q value
reward). During a training episode, a large positive reward is given if may cause the error to be maximized, resulting in overestimation. To
the rover reaches the target area. On the contrary, a large penalty is increase prediction accuracy, a Double Q-learning procedure may be
given if the rover path terminates at a hazardous zone. The reward adopted. Double Q-learning is based on the idea that the Q value cor-
setting is summarized in Table 1. responding to St+1 is determined by the target network, where the
To facilitate convergence of the learning process, the reward values
in Table 1 for movement actions are scaled relative to the reward values Table 1
for the terminal states. The reward values for “Rolling” and “Hopping” Reward settings.
movements that facilitate convergence of the learning process may vary Action or state Reward
with terrain topology.
Walking 0
3.2.2. Model architecture Rolling (0, -0.1]
Hopping (0, -0.1]
DQN architectures utilize deep neural network to approximate the
Target area 10
state-action value function Q. The input of the neural network is the Hazardous zone −5
state where the rover is currently located. The output layer has four

270
J. Jiang, et al. Acta Astronautica 171 (2020) 265–279

Fig. 7. Architecture that enables operant conditioning learning.

Fig. 8. SumTree structure example.

Table 2 Double Q-learning [36] has been observed to produce more reliable
List of hyper-parameters. and stable learning processes. Double Q-learning can be implemented
Hyper-parameter Value within the DQN framework without altering the neural network ar-
chitectures for approximating the state-action value function. Only the
1 Minibatch size 32 definition of YtDQN needs to be updated with the definition of
2 Replay memory size 20000 YtDoubleDQN that is supplied in Eq. (13). As an alternative, online action
3 Discount factor 0.9
4 Learning rate 0.0025
selection can be employed to mitigate over-estimation [33].
5 Initial exploration 1 B. A dueling architecture divides the output of the network into two
6 Final exploration 0.1 parts: One is the function V (s) predicting the state value, and the other
7 Epsilon increment 0.00025 is the function A(a) rendering the advantage that each action provides.
8 Replace target iteration 300
Here, a dueling DQN can be treated as a single network with two
streams at the penultimate layer. During policy evaluation, this archi-
tecture may identify optimal actions more rapidly. The final Q value
execution of a depends on the evaluation network:
can be defined as:

Yt DoubleDQN = rt + 1 + Q St + 1, argmax Q (St + 1, a; t ), t


Q (s, a) = V (s ) + A (s , a) (14)
a (13)
and

271
J. Jiang, et al. Acta Astronautica 171 (2020) 265–279

Fig. 9. Three views of a hypothetical terrain (axes are dimensionless).

sampling process. This algorithm is based on a tree-structured search.


Each tree node value represents the size of an interval. Each branch
node of the tree search has only two bifurcations. Nodes of the bottom
layer in the SumTree structure store the sample's priority pj, which is
proportional to the absolute value of TD-error, δj. The value of each
parent node is the sum of the value of its two children. The value stored
at the single node of the top layer in the SumTree is the value as the sum
of all priorities p. Then, the probability of extracting the sample cor-
responding to δj is pj/∑pj.
When sampling, divide the total priority p of all samples by batch,
into n intervals with:

n= p / batch (17)
Referring to Fig. 8, when extracting 5 sets of data (i.e., n = 5), the
priority distribution can be represented with the following priority
value intervals [1-28], and [29-35]. After obtaining the interval of each
branch size, a number is randomly selected in each interval during
sampling, and the corresponding priority is obtained by utilizing the
SumTree. For instance, {3, 9, 15, 26, 34} are randomly selected during
a given iteration. Taking the number 15 as an example, we traverse the
tree from the top to the bottom layer:’15’ belongs to “[1-19] → [11-19]
Fig. 10. Optimal path (axes are dimensionless). → [13-19]” and determines that its priority is 7. Accordingly, the
sample corresponding to priority 7 will be trained. The setting of
Q (s , a ; L, L) = V (s ; L) + A (s , a ; L) (15) SumTree structure make the samples corresponding to larger TD-error
more likely to be relearned. More detailed discussions about the TD-
where αL and βL are the parameters of the network layers. error method can be found in the Ref. [35].
C. In training a DQN, it may be beneficial to reuse past memories by
utilizing experience replay. In natural DQN, past experience is derived 4. Numerical analysis
from random sampling, where the importance of each individual
memory is ignored. Prioritized experience replay attempts to replay Pre-training of a DQN architecture for hopping rover path planning
important memories more often. As such, the learning process becomes without visual sensory input is performed. Both discrete and continuous
more efficient. A Temporal Difference Error [37] (shorted as ‘TD-error’) terrains are considered. The computer configuration employed for pre-
is introduced to weight the importance of memory samples training comprises an Intel Core i7 with CPU @2.2Ghz (8 cores multi-
thread) and 16 GB of RAM. Throughout the simulations, the reward
TD error: j = rj + j Qtarget (Sj , arg max aQ (Sj, a)) Q (Sj 1, Aj 1 )
setting is consistent with that given in Table 1. Training of the Q net-
(16) works depends on the following hyper-parameters:
where δj is the value of the j-th TD-error. Here, the subscript j = {1, 2, 3,
…} represents the index of each sample from the memory. 1) Minibatch size, i.e., the number of samples for each training
Samples with higher TD-error hold higher training priority. Owing iteration;
to the high computational cost to arrange memory priority, a SumTree 2) Replay memory size, i.e., the size of the memory bank for data
algorithm may be introduced to increase the efficiency of the memory storage;

272
J. Jiang, et al. Acta Astronautica 171 (2020) 265–279

Fig. 11. Number of steps per episode during the learning process.

Fig. 12. Magnification of the early learning stage.

3) Discount factor, i.e., the attenuation factor γ used in the Q-learning 4.1. Application to 3-D discrete surface
update;
4) Learning rate, i.e., the learning rate used for RMSProp (an opti- 4.1.1. Trajectory evolution
mizer of Neural Network); Fig. 9 illustrates three views of a representative level map as a hy-
5) Initial exploration, i.e., the initial value of ε in the ε-greedy policy; pothetical asteroid surface, which is used to approximate complex as-
6) Final exploration, i.e., the final value of ε in the ε-greedy policy; teroid terrain. Approximate level maps for testing path planning algo-
7) Epsilon increment, i.e., the number of frames over which the initial rithms may be created to be a conservative representation of the final
value of ε is linearly annealed to its final value; operational environment. For preliminary demonstration of the DQN
8) Replace target iteration, i.e., the number of iterations before path-planning algorithm, a level map with 30 × 30 cells is selected. In
memory is replaced. Fig. 9, light-green areas denote the sea level of the terrain. Warm-color
cells correspond to areas higher than the sea-level plane. As the color
Values for the hyper-parameters are listed in Table 2. deepens, the height of the corresponding area increases. The cool color
gird cells denote regions that are below the sea-level plane. As cell color
deepens, the height of the corresponding area decreases.

273
J. Jiang, et al. Acta Astronautica 171 (2020) 265–279

Fig. 13. Accumulation of Qtarget over training episodes.

Fig. 14. Hypothetical asteroid terrain samples A and B (axes are dimensionless).

Fig. 15. Contour map of the asteroid surface samples A and B (axes are dimensionless).

274
J. Jiang, et al. Acta Astronautica 171 (2020) 265–279

In this numerical example, the departure point of the rover is ar-


bitrarily set as [2.5, 27.5]T, where the target destination is [26.5, 3.5]T.
Fig. 10 displays the optimal path connecting these two points (orange).
To travel to the destination, the rover takes 48 steps in total, with a
combination of “Walking-Rolling-Hopping” movements. It should be
pointed out that there is a number of alternative trajectories with the
same cost as the current optimal solution for this problem. Alternative
optimal cases corresponding to the same cost are not discussed in this
work.
When the pre-training process starts, the rover has minimal prior
knowledge of the environment, including its initial position, directions
of movement and which locations are accessible. As part of the pre-
training process, the rover virtually explores thousands of action
choices and trajectories to build a sufficiently rich memory of experi-
ences. Constantly expanding the memory samples is a critical process to
solve complex path planning problems.

4.1.2. Discussion of the results


The number of steps taken to reach the end of an episode reduces as
a result of the learning process. Fig. 11 displays rapid convergence of
Fig. 16. Optimal path for terrain A (axes are dimensionless). the number of steps with training episodes. A magnification of the first
400 episodes of the learning process is shown in Fig. 12. Observing
Figs. 11 and 12, the number of steps per episode changes sharply during
early training episodes. A large number of states, up to 200,000, are
explored initially during pre-training to enrich the memory. At this
stage, the algorithm acquires knowledge on the approximate location of
the target. Following training episodes are used to determine the op-
timal path. The figure in the top right corner of Fig. 11 is the magni-
fication of the number of steps per episode at the final stage of the
learning process. As the number of steps per episode does not change,
that may indicate convergence on a solution.
The hyper-parameters that are listed in Table 2 represent values that
are tuned to improve learning performance on this first experiment. In
following experiments, better learning performance maybe possible
with additional tuning of the parameters. Each point in Fig. 13 denotes
an accumulation of Qtarget (i.e., the converged cumulative return) per
episode. The distribution of data points demonstrates the gradual rise of
accumulation of Qtarget with the increase of the training time. After
sufficient training, the accumulation of Qtarget approaches a stable
moving average value, which usually indicates that the optimal path
has been obtained.

4.2. Application to 3-D continuous surface

Next, the DQN path planning algorithm for hopping rovers is tested
Fig. 17. Optimal path for terrain B (axes are dimensionless). on a refined representation of the asteroid terrain. Fig. 14 shows two
hypothetical 3-D terrains that are modelled using the conversion
method described in Sect. 2.1. The difference between map levels is

Fig. 18. The accumulation of Qtarget for terrain A (left) and B (right).

275
J. Jiang, et al. Acta Astronautica 171 (2020) 265–279

Fig. 19. Number of steps per episode during the learning process for terrain A.

Fig. 20. Number of steps per episode during the learning process for terrain B.

equal to the maximum height of the rover hopping movements. Thus, results show that DRL may be sufficiently robust to guide the rover
the resulting contour maps are displayed in Fig. 15, where each map is a within a variety of rough asteroid terrains with adequate pre-training.
grid of 30 × 30 cells. In these experiments, the RMSProp optimizer [38] is used. During the
learning process, the applied strategy is ε-greedy with ε annealed from
1.0 to 0.1 over the first 3600 frames, and fixed at 0.1 thereafter. In
4.2.1. Trajectory evolution Fig. 18, the accumulation of Qtarget is shown for training on terrain A
Figs. 16 and 17 show a discretized version of the terrain maps in and terrain B.
Figs. 14 and 15. The color scale is interpreted as that in the previous Figs. 19 and 20 render the trajectory learning process of the DRL
example. Light-green grid cells denote cells at the terrain sea level. architecture as the number of steps per episode. The results show that
Warm-color grid cells are higher than the sea level, while cold-color during pre-training knowledge of the surface topology is acquired and
grid cells are below the sea level. The depth of the color is proportional that the DRL architecture may converge on an optimal trajectory.
to the absolute altitude value. The full lines are the optimal trajectories Overall, the trend in Fig. 19 is similar to that in Fig. 20. Differences in
determined by the DQN algorithm. The blue and red circles are the terrain topography and the inherent random nature of an ε-greedy
respective target points [25.5, 27.5]T and [23.5, 3.5]T. The two starting policy are responsible for major variations in the early stages of the
points are [1.5, 1.5]T and [4.5, 27.5]T, respectively. learning process. Occasional, large variations in the number of episode
The same learning mechanism and neural network architecture in steps in middle and late learning stages may also be caused by bad
Sect. 4.1 are adopted in these two experiments. These preliminary

276
J. Jiang, et al. Acta Astronautica 171 (2020) 265–279

Fig. 21. Optimal paths for an increasing proportion of perturbed cells (axes are dimensionless).

Table 3 In order to preliminary verify the generality and good robustness of


Three key indicators in the training process. a DRL path-planning architecture under slight variation of surface to-
The proportion of perturbed cells Indicator 1 Indicator 2 Indicator 3
pography, additional static maps are generated by randomly perturbing
the altitude of selected unit cells of surface B. The level difference be-
1% 123982 261 s 4296 tween the value of a perturbed cell and its original value is ± 1. The
2% 109978 236 s 4720 symbol ‘☆’ in Fig. 21 marks a perturbed grid cell. The number of per-
3% 116895 240 s 6238
4% 180504 450 s 7665
turbed grid cells from left to right, top down in Fig. 21 is 9, 18, 27 and
36 among the total 900 cells (representing 1%, 2%, 3% and 4% of the
total number of grid cells on the map, respectively). Throughout the
updates of the DQN ′s memory bank. following simulations, the departure point and the target point are the
same as that specified in Sect. 4.2.1, i.e., from [4.5, 27.5]T to [23.5,
3.5]T.
4.2.2. Discussion for static terrain variations
In the top two maps of Fig. 21, the introduction of perturbed grid
The 3-D terrain shown in Fig. 14 is a simulated asteroid terrain.
cells leads to a modification of the terrain, but with little or no variation
Compared with real asteroid terrains, the surface in Fig. 14 is relatively
of the states nearby the original, converged trajectory. Therefore, the
flatter and smoother. In addition, the topography of the simulated
new converged path is the same as the one in the original, unperturbed
terrain is assumed known. Surface topography of target small bodies is
map. In the bottom two maps of Fig. 21, the proportions of perturbed
usually obtained after rendezvous via multiple fly-by, as, for example,
grid cells increases to 3% and 4%. In these latter maps, a fraction of the
demonstrated by the Rosetta5 and Hayabusa26 missions. Due to the
perturbed grid cells are located nearby the original optimal path
limited duration and number of possible fly-bys as well as the on-board
(Fig. 14). As a consequence, there exists a notable variation of the final,
sensor finite accuracy, differences between the mapped and actual
converged trajectory to reach the target. A large number of simulation
surface are to be expected. Therefore, the rover ability to adapt to the
experiments displays that a nearly optimal path from a fixed initial and
uncertain ground is warranted.
to a final target point may still be found for small variations of the
terrain to transverse.
5
http://www.esa.int/Science_Exploration/Space_Science/Rosetta [Retrived Table 3 summarizes three key indicators in the training process that
2020-02-25]. are analyzed to understand the effect of cell perturbations:
6
http://www.hayabusa2.jaxa.jp/en/[Retrived 2020-02-25].

277
J. Jiang, et al. Acta Astronautica 171 (2020) 265–279

Indicator 1: this indicator renders the total number of steps that the Acknowledgments
rover takes to move from the initial position to the target area in the
first successful training episode. The authors from the Beijing Institute of Technology acknowledge
Indicator 2: this indicator renders the time it takes for the rover to the support by the National Natural Science Foundation of China (No.
reach the target point from the initial position in the first successful 11972075). All the authors appreciate the valuable suggestions made
training episode. by the two reviewers of this paper.
Indicator 3: this indicator renders the number of episodes with a
stable reward accumulation value. References
As shown in Table 3, when the proportion of perturbed grid cells
increases from 1% to 3%, the change of the terrain features may be [1] W.R. Wu, W.W. Liu, D. Qiao, D.G. Jie, Investigation on the development of deep
negligible. Therefore, the variation of step number and time for opti- space exploration, Sci. China Technol. Sci. 55 (4) (2012) 1086–1091.
[2] V. Carruba, N. David, V. David, Detection of the YORP effect for small asteroids in
mization training is relatively small. When the proportion of perturbed the Karin cluster, 151 (6) (2016) 164.
grid cell is 4%, terrain variations becomes more important. The corre- [3] T. Yoshimitsu, T. Kubota, I. Nakatani, T. Adachi, H. Saito, Micro-hopping robot for
sponding number of exploration steps increases by 50% relative to the asteroid exploration, Acta Astronaut. 52 2–6 (2003) 441–446.
[4] H. Cottin, J. Michelle Kotler, K. Bartik, H. James Cleaves II, C.S. Cockell, J.P. de
unperturbed value. The exploration time of the first episode is also Vera, P. Ehrenfreund, S. Leuko, I. Loes Ten Kate, Z. Martins, R. Pascal, R. Quinn,
about 200 s longer than its original value. Introducing a small fraction P. Rettberg, F. Westall, ... , Astrobiology and the possibility of life on Earth and
of perturbed random cells in the 2-D map may not prevent the algo- elsewhere, Space Sci. Rev. 209 1–4 (2017) 1–42.
[5] M. Martínez-Jiménez, C.E. Moyano-Cambero, J.M. Trigo-Rodríguez, J. Alonso-
rithm from identifying a nearly optimal trajectory, but computational
Azcárate, J. Llorca, Asteroid Mining: Mineral Resources in Undifferentiated Bodies
demand may increase. from the Chemical Composition of Carbonaceous Chondrites, Assessment and
The 3-D terrain shown in Fig. 14 is a simulated and idealized as- Mitigation of Asteroid Impact Hazards, Springer, Cham, 2017, pp. 73–101.
[6] L.D. Mathias, L.F. Wheeler, J.L. Dotson, A probabilistic asteroid impact risk model:
teroid terrain. Much work remains to employ the current algorithm on-
assessment of sub-300 m impacts, Icarus 289 (2017) 106–119.
board of an actual rover. One particular challenge is that the validity of [7] P. Ehrenfreund, C. McKay, J.D. Rummel, B.H. Foing, C.R. Neal, T. MassonZwaan,
the learned path-planning policy is limited by the accuracy of the ter- M. Ansdell, N. Peter, J. Zarnecki, S. Mackwell, M. Antionetta Perino, L. Billings,
rain map that is utilized during training. The current algorithm learns a J. Mankins, M. Race, Toward a global space exploration program: a stepping stone
approach, Adv. Space Res. 49 (1) (2012) 2–48.
path-planning strategy over a static map (i.e., the terrain map is fixed [8] S. Watanabe, Y. Tsuda, M. Yoshikawa, S. Tanaka, T. Saiki, S. Nakazawa, Hayabusa2
for all training episodes). Therefore, the algorithms determines a mission overview, Space Sci. Rev. 208 (1–4) (2017) 3–16.
nearly-optimal path that is specific to the terrain map provided and [9] M. Crombie, S. Selznick, M. Loveridge, B. Rizk, D.N. DellaGiustina, D.R. Golish,
B.J. Bos, P.R. Christensen, V.E. Hamilton, D. Reuter, A.A. Simon, O.S. Barnouin,
does not generalize to different maps autonomously. A natural exten- M.G. Daly, R.C. Espiritu, B. Clark, M.C. Nolan, D.S. Lauretta, Schedule of origins,
sion of our work, would be training the RL agent on a distribution of spectral interpretation, resource identification, security-regolith explorer data
maps (i.e., at each episode the terrain map is drawn from a probability product releases to the planetary data system, AGU Fall Meeting Abstracts, 2018.
[10] Y. Takahashi, D.J. Scheeres, R.A. Werner, Surface gravity fields for asteroids and
distribution). That might enable the RL agent to learn an adaptive path- comets, J. Guid. Contr. Dynam. 36 (2) (2013) 362–374.
planning strategy over the selected distribution of terrain maps. Other [11] M. Pavone, J.C. Castillo-Rogez, I.A.D. Nesnas, J.A. Hoffman, N.J. Strange,
possible extension of this work involves a more realistic model for the Spacecraft/rover hybrids for the exploration of small solar system bodies, IEEE
Aerospace Conference, vol. 2013, IEEE., 2013, pp. 1–11.
rover's locomotion. A higher-fidelity model for the rover's locomotion [12] R.P. Binzel, A.S. Rivkin, S.J. Bus, J.M. Sunshine, T.H. Burbine, MUSES‐C target
may include irregular gravity field, stochastic bouncing on highly ir- asteroid (25143) 1998 SF36: a reddened ordinary chondrite, Meteoritics Planet Sci.
regular surfaces, active outgassing regions and irregular step length. 36 (8) (2001) 1167–1172.
[13] T.C. Duxbury, A.V. Zakharov, H. Hoffmann, E.A. Guinness, Spacecraft exploration
of Phobos and deimos, Planet. Space Sci. 102 (2014) 9–17.
[14] S. Zhou, J. Xi, M.W. McDaniel, T. Nishihata, P. Salesses, K. Iagnemma, Self-su-
5. Conclusion pervised learning to visually detect terrain surfaces for autonomous robots oper-
ating in forested terrain, J. Field Robot. 29 (2) (2012) 277–297.
[15] R. Valencia-Murillo, N. Arana-Daniel, C. López-Franco, A.Y. Alanís, R. Valencia-
In this paper, the path-planning problem for hopping rovers on as- Murillo, Rough terrain perception through geometric entities for robot navigation,
teroid surface with complex terrain is investigated using deep re- 2nd International Conference on Advances in Computer Science and Engineering
inforcement learning. The equivalent conversion method reduces the (CSE 2013), Atlantis Press, 2013.
[16] Y. Chen, G. Luo, Y. Mei, J. Yu, X. Su, UAV path planning using artificial potential
dimension of the state variables by converting arbitrarily complex field method updated by optimal control theory, Int. J. Syst. Sci. 47 (6) (2016)
three-dimensional terrains into a reward matrix. An improved neural 1407–1420.
network based on the DQN architecture is developed to determine the [17] A. Pandey, R. Kumar Sonkar, K. Kant Pandey, D.R. Parhi, Path planning navigation
of mobile robot with obstacles avoidance using fuzzy logic controller, IEEE 8th
optimal actions for the hopping rover. Three advanced architecture International Conference on Intelligent Systems and Control (ISCO), vol. 2014,
elements (i.e., DDQN, Dueling Network, and Prioritized Experience IEEE., 2014.
Replay) are combined to improve the performance of the DRL algorithm [18] M. Elhoseny, A. Tharwat, A.E. Hassanien, Bezier curve based path planning in a
dynamic field using modified genetic algorithm, J. Comput. Sci. 25 (2018)
and make the learning more efficient. Offline pre-training for different 339–350.
terrain complexity is demonstrated. Results show a good generality of [19] R.S. Sutton, A.G. Barto, Reinforcement Learning: an Introduction, MIT press, 2018.
the DRL method. In a topographic map of 30 × 30 size, DQN ensures [20] V. Mnih, K. Kavukcuoglu, D. Silver, A.A. Rusu, J. Veness, M.G. Bellemare,
A. Graves, M. Riedmiller, A.K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie,
that the optimal path may be found successfully between any two
A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis,
points within approximately 9000 episodes. The robustness of the Human-level control through deep reinforcement learning, Nature 518 (2015) 7540
proposed method is evaluated by considering a static uncertainty of the 529.
asteroid terrain map with an error up to 4% of the total number of map [21] F. Wang, J. Jason Zhang, X. Zheng, X. Wang, Y. Yua, X. Dai, J. Zhang, Y. Yuan,
Where does AlphaGo go: from church-turing thesis to AlphaGo thesis and beyond,
cells. Optimal paths are successfully obtained in the tested perturbed- IEEE/CAA Journal of Automatica Sinica. 3 (2) (2016) 113–120.
map scenarios. Pre-trained DRL architectures may be a stepping stone [22] J. Schmidhuber, Deep learning in neural networks: an overview, Neural Network.
in the development of hopping rovers for the autonomous exploration 61 (2015) 85–117.
[23] D. Izzo, A survey on artificial intelligence trends in spacecraft guidance dynamics
of asteroid surfaces. and control, Astrodynamics 3 (4) (2019) 287–299.
[24] G. Brian, R. Linares, R. Furfaro, Seeker Based Adaptive Guidance via Reinforcement
Meta-Learning Applied to Asteroid Close Proximity Operations, (2019) arXiv pre-
print arXiv:1907.06098.
Declaration of competing interest [25] Y. Zhu, R. Mottaghi, E. Kolve, J.J. Lim, A. Gupta, L. Fei-Fei, A. Farhadi, Target-
driven visual navigation in indoor scenes using deep reinforcement learning, IEEE
International Conference on Robotics and Automation (ICRA), vol. 2017, IEEE,
We declare that we have no conflict of interest.
2017.

278
J. Jiang, et al. Acta Astronautica 171 (2020) 265–279

[26] H. Osmaston, Estimates of glacier equilibrium line altitudes by the Area× altitude, [33] H. Van Hasselt, A. Guez, D. Silver, Deep reinforcement learning with double q-
the Area× altitude balance ratio and the Area× altitude balance index methods learning, Thirtieth AAAI Conference on Artificial Intelligence, 2016.
and their validation, Quat. Int. 138 (2005) 22–31. [34] Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, N. de Freitas, Dueling
[27] M. Tokic, Adaptive ε-greedy exploration in reinforcement learning based on value Network Architectures for Deep Reinforcement Learning, (2015) arXiv preprint
differences, Annual Conference on Artificial Intelligence, Springer, Berlin, arXiv:1511.06581.
Heidelberg, 2010. [35] T. Schaul, J. Quan, I. Antonoglou, Prioritized Experience Replay, (2015) arXiv
[28] R. Kumar, B. Moseley, S. Vassilvitskii, A. Vattani, Fast greedy algorithms in ma- preprint arXiv:1511.05952.
preduce and streaming, ACM Trans. Parallel Comput. (TOPC). 2 (3) (2015) 14. [36] H.V. Hasselt, Double Q-learning, Advances in Neural Information Processing
[29] C.J.C.H. Watkins, Learning from Delayed Rewards, (1989). System, 2010 2613-2321.
[30] C.J.C.H. Watkins, P. Dayan, Q-learning, Mach. Learn. 8 (3–4) (1992) 279–292. [37] K. Arulkumaran, M.P. Deisenroth, M. Brundage, A.A. Bharath, Deep reinforcement
[31] M.I. Jordan, D.E. Rumelhart, Forward models: supervised learning with a distal learning: a brief survey, IEEE Signal Process. Mag. 34 (6) (2017) 26–38.
teacher, Cognit. Sci. 16 (3) (1992) 307–354. [38] I. Bello, B. Zoph, V. Vasudevan, Q. Le, Neural optimizer search with reinforcement
[32] L. Bottou, Large-scale machine learning with stochastic gradient descent, learning, Proceedings of the 34th International Conference on Machine Learning,
Proceedings of COMPSTAT, vol. 2010, Physica-Verlag HD., 2010, pp. 177–186. vol. 70, JMLR, 2017, pp. 459–468.

279

You might also like