Professional Documents
Culture Documents
Acta Astronautica
journal homepage: www.elsevier.com/locate/actaastro
Keywords: Asteroid surface exploration is challenging due to complex terrain topology and irregular gravity field. A hop-
Asteroid surface exploration ping rover is considered as a promising mobility solution to explore the surface of small celestial bodies.
Hopping rover Conventional path planning tasks, such as traversing a given map to reach a known target, may become par-
Path planning ticularly challenging for hopping rovers if the terrain displays sufficiently complex 3-D structures. As an alter-
Deep reinforcement learning
native to traditional path-planning approaches, this work explores the possibility of applying deep reinforcement
learning (DRL) to plan the path of a hopping rover across a highly irregular surface. The 3-D terrain of the
asteroid surface is converted into a level matrix, which is used as an input of the reinforcement learning algo-
rithm. A deep reinforcement learning architecture with good convergence and stability properties is presented to
solve the rover path-planning problem. Numerical simulations are performed to validate the effectiveness and
robustness of the proposed method with applications to two different types of 3-D terrains.
1. Introduction the current exploration effort. Different from major planetary bodies,
the irregular and weak gravity field [10] of asteroids drives the de-
Various space agencies around the world are intensifying the ex- velopment of alternative forms of mobility. In particular, hopping may
ploration of solar system small bodies [1], and the number of observed be one of the most efficient way to traverse an asteroid surface [11]. On
asteroids is increasing [2]. Asteroid exploration is of great significance May 9, 2003, the Hayabusa spacecraft attempted the deployment of a
because: (1). The geological structure of an asteroid surface reveals hopping rover, MINERVA, on the surface of asteroid 25143 Itokawa
details about the formation and evolution of the solar system [3]. (2). [12]. Unfortunately, the lander missed the target and was lost into in-
Asteroids may harbor clues about the origin of life [4]. (3). Mineral terplanetary space. Taking that experience into account, Hayabusa 2
resources could be available at asteroids [5]. (4). Asteroid impact re- has recently succeeded in deploying a similar hopping robot, called
presents a threat to human society [6]. MASCOT,1 on Ryugu. Hayabusa 2 aims to study the asteroid Ryugu and
Asteroid exploration may be conducted via ground-based observa- return surface samples to Earth by 2020. In the meantime, Stanford
tions, spacecraft flyby or rendezvous, and in-situ exploration with University, JPL and the California Institute of Technology have devel-
surface landers. Ground-based observations and spacecraft rendezvous oped a robotic hedgehog for the exploration of Phobos [13].
are currently the most advanced approaches. In the last two decades, With increasing missions to less known asteroids in the future, new
the United States, Japan and European countries have also successfully methods are required to guide and control hopping rovers over un-
implemented various asteroid flyby or rendezvous programs [7]. On charted asteroid surfaces. Specifically, hopping rovers may be required
December 3rd, 2014, the Japanese Space Agency (JAXA) successfully to autonomously traverse an asteroid surface with complex terrains and
launched a spacecraft, Hayabusa 2, to rendezvous with the near-Earth irregular gravity. One option to automate the path planning process
asteroid 162173 Ryugu [8]. On December 3, 2018, NASA's “Origins, may be to create traversability maps from visual observations of surface
Spectral Interpretation, Resource Identification, Security-Regolith Ex- features. Then, the terrain surrounding the rover can be classified and
plorer” (OSIRIS-REX) spacecraft [9] completed its 1.2 billion-mile (2 mapped to a cost function that is used to determine the optimal path for
billion-kilometer) journey and arrived at the asteroid 101955 Bennu. a given task [14,15]. However, it may be problematic for a small, cost-
Obtaining access to asteroid surfaces is the natural continuation of effective rover to independently acquire and process in-situ the
∗
Corresponding author.
E-mail address: zeng@bit.edu.cn (X. Zeng).
1
https://mascot.cnes.fr/en/MASCOT/index.htm [Retrived 2020-02-25].
https://doi.org/10.1016/j.actaastro.2020.03.007
Received 17 July 2019; Received in revised form 28 February 2020; Accepted 7 March 2020
Available online 13 March 2020
0094-5765/ © 2020 IAA. Published by Elsevier Ltd. All rights reserved.
J. Jiang, et al. Acta Astronautica 171 (2020) 265–279
sufficient global, visual data that enable autonomous surface explora- different landing maneuvers in an environment with unknown dy-
tion. In addition, the continuous change of light and shadow that namics, with initial conditions spanning a large deployment region, and
characterizes a small body surface may critically impact on the quality without a shape model of the asteroid [24]. The resulting DRL frame-
of the visual information that is accessible. work [25] may enable perceptual decision-making tasks within com-
As an alternative, a hopping rover may move independently without plex systems, including path-planning for a hopping rover on a poorly
any local visual information using a pre-trained control architecture. known asteroid surface.
First assume that partial topo-surface information of the target asteroid In this work, we explore the application of deep reinforcement
is available, perhaps from instruments onboard of a mother spacecraft learning to the problem of path planning on asteroid surface. The main
or a landing probe as illustrated in Fig. 1. The topo-surface information contribution of this paper is the design and training of a DRL archi-
may be converted into regional height-level difference data. Before tecture, which uses level difference to determine a series of actions for
being deployed, the hopping rover's movement trajectory and control the hopping rover. Provided partial surface topology information, the
system may be pre-trained using simulated surface level differences. objective is to determine the optimal trajectory from an initial known
The processing and calculation of the terrain data can be performed area to the target area that minimizes energy consumption. This ar-
from the landing probe, the mother spacecraft or the ground station. chitecture may serve as a stepping stone to develop next-generation
The hopping rover(s) will not be released until obtaining the optimal guidance schemes for autonomous hopping rovers. In fact, pre-trained
path. Although the topo-surface data are assumed as known informa- DRL architectures are naturally preconfigured to learn from different
tion, such data may not be sufficiently accurate to implement path topo-surface terrain data.
planning solutions as open-loop guidance. A closed-loop control me-
chanism may be required to ensure that the rover adjust its course to
the target and offset perturbations that are due to finite modeling ac- 2. Problem statement
curacy.
This paper takes a first step in developing a path planning algorithm This section describes a method to model terrain features and ac-
for hopping rovers that is based on deep reinforcement learning (DRL). tions for hopping rover motion within asteroid surface environments.
To date, there exist a number of mature methods, including but not
limited to artificial potential fields [16], fuzzy logics [17], and genetic
algorithms [18]. Nevertheless, when these methods are applied to real-
world problems, they typically result in point-design solutions, easy to
fall into a local optimum and requiring extensive numerical experi-
mentation for each specific application. Reinforcement learning [19]
(RL) is emerging as a more general framework for adaptive decision
making, as it leverages the ability of an agent to autonomously interact
with the environment and improve its decision making strategy.
However, RL in its original form is typically not able to solve di-
mensionally large problems, such as end-to-end path planning.
In 2015, Minh et al. presented the model of Deep Q Network al-
gorithm [20], as a pioneering work in the field of reinforcement
learning applied to dimensionally large problems. In 2016, by com-
bining deep reinforcement learning with Monte Carlo tree search
(MCTS), Silver developed an artificial Go player ‘AlphaGo’ [21], which
defeated worldwide recognized GO champions. Both success stories are
based on the idea of combining RL with deep neural networks (DNN)
[22], ones that supply the capability to perceive and extract critical
information from very large datasets. Recently, reinforcement learning
techniques have gained increasing attention from scientists and en-
gineers in the space community [23]. In 2019, Gaudet, Linares, and
Furfaro developed an adaptive integrated guidance, navigation, and Fig. 1. Sketch for a landing probe releasing the hopping rover.
control system using reinforcement meta-learning that can complete
266
J. Jiang, et al. Acta Astronautica 171 (2020) 265–279
2.1. Terrain representation for asteroid surface navigation movement are possible, as shown in Fig. 2. When moving from a higher
level to a lower level, a tumbling/rolling motion is adopted to let the
Asteroid surface path planning requires a virtual model of the ter- rover slide or freely fall, which is referred to as “Rolling” in this study. A
rain to traverse. A representation of the terrain that is convenient for small bounce or rolling can be used to move forward on the same level
path planning describes terrain features in terms of height levels over a between two adjacent locations, referred to as “Walking”. Going from a
predetermined reference altitude, called sea-level [26]. Given in- lower level to a higher one, hopping/jumping is required to move the
formation about a connected region of the asteroid surface, the shortest rover, which is referred to as “Hopping”. The aforementioned me-
distance dn and the longest distance df from the surface region to the chanisms of locomotion are based on the principle of momentum ex-
center of mass of the asteroid can be calculated. The average between change and have been demonstrated by existing rovers, such as the
dn and df values defines the radius of an equivalent sphere that renders Cubli,2 the Hedgehog,3 and the MINVERVA-II.4
the local sea level. The altitude of any point at the local sea level is zero. It is assumed that rover rolling and hopping movements are limited
Let L represent the maximum altitude level difference throughout to a maximum altitude change of one level (positive or negative) when
the selected asteroid surface region, which is approximated by the moving across two adjacent grid cells. Such an assumption is made
difference between dn and df. Also assume that the motion of the hop- under the intuition that large hopping movements may trigger surface
ping rover is constrained by a maximum bouncing height, l. Then, the escape trajectories and large rolling movements may damage the rover.
maximum altitude level difference, L, can be divided into N several If the absolute altitude difference between adjacent cells is greater than
equally spaced intervals, with N = L/l. An altitude value in the interval one level, the terminal cell is considered to be hazardous for the current
[0, l) is considered the first positive level (recall that zero defines the state. Fig. 3 displays a two-dimensional top view and a 3-D graph of a
sea level), while a height value in the interval [l, 2l) is considered as the hypothetical terrain. The number within each grid cell of the top view
positive second level, and so on. The same classification applies to al- represents the altitude level. Assume that the rover is initially located in
titude values below the sea level, commencing with the first negative the cell at the center of the grid. From this location, it is only feasible to
level for altitude values in the interval (–l, 0]. move left or up. Down and right movements are considered hazardous
Given an asteroid terrain in the form of a two-dimensional contour as the altitude level difference of adjacent cells is greater than one.
map, we assign each contour plane to a fixed and independent level-
value. For example, a value of 1 might be given to the first positive
3. Deep reinforcement learning for traversing asteroid surface
level, a value of 0 to the sea-level altitude, and a value of -2 to the
negative second level. Then, we rasterize the 2-D contour map with a
Deep Reinforcement Learning (DRL) may be a convenient frame-
30 × 30 grid. In other words, the 2-D contour map can be transformed
work to develop autonomous path planning algorithms for hopping
into 900 quadrilateral regions (grid cells) of various altitude levels. The
rovers. DRL architectures can be pre-trained with complex asteroid
horizontal distance that a rover can traverse with a single bounce may
surface information. After obtaining the optimal trajectory, the hopping
vary significantly with the initial state and surface conditions. To
rover is released to explore the environment. A DRL architecture may
simplify the problem at hand, the horizontal distance for each move-
also allow for learning end-to-end path-planning policies, ones that map
ment in this paper is constrained to a unit cell of the grid. The length of
sensory input about local altitude level variations to rover movement
each grid cell is approximately 0.1∼1 m for a 100-m-sized asteroid. If
commands.
we consider a square area with a side of 30 cells, then the side of such
area measures approximately 3∼30 m. If a grid cell contains multiple
contour planes, the level-value of the cell center point is taken as a 3.1. Reinforcement learning
reference. If the center point of the grid cell is at the intersection of two
contour planes the lower level-value is taken as a reference. Eventually, Deep reinforcement learning is an extension of reinforcement
a discretized level-value map can be obtained. learning. Reinforcement learning is based on the idea that an agent
interacts with an environment over a number of discrete time steps.
2.2. Surface motion scheme Note that interactions may be real or simulated. At each time step t, the
agent explores a state st and selects an action at from some set of pos-
Due to irregular and weak gravity field, hopping rovers are a pre- sible actions A according to a policy π. Afterwards, the agent reaches
ferable option for asteroid surface exploration, one that is different the next state st+1 and obtains a reward rt which guides the selection of
from planetary surface exploration robots. The motion of a hopping future actions. (Note: this is called a learning iteration). The process
rover can be treated as a combination of hopping segments and tum- continues multiple times until a terminal state, one that defines failure
bling segments. Commanding the rover to move (bounce or roll) over a or success, is reached. Fig. 4 summarizes the basic model for
unit grid length, that is, the distance between the midpoint of the
current grid cell and the midpoint of the next grid cell, is considered as 2
https://idsc.ethz.ch/research-dandrea/research-projects/archive/cubli.
an action. html [Retrived 2020-02-25].
In this paper, the set of rover actions is restricted to a unit grid 3
https://www.jpl.nasa.gov/news/news.php?feature=4712 [Retrived 2020-
length movement in four independent directions. Each direction is se- 02-25].
parated by a 90° angle: front, back, left and right. Three types of rover 4
http://www.hayabusa2.jaxa.jp/en/topics/20180919e/[Retrived 2020-02-25].
267
J. Jiang, et al. Acta Astronautica 171 (2020) 265–279
Fig. 3. A hypothetical terrain model (left: top view, right: 3-D view).
reinforcement learning. represents the probability for adopting random actions over optimal
The goal of an agent is to maximize the expected cumulative sum of actions, which enables exploration of the solution space.
rewards, or return, from each state st. The expected return is captured
by a state-action value function. Processing from the initial to the 3.1.2. State-action value function update
terminal state constitute a learning episode. (Note: one learning episode In each learning iteration, assume a virtual rover moving across a
has multiple learning iterations). In our problem, target and hazardous discrete terrain map by choosing one action from a collection of actions
cells define a terminal state. In order to obtain the optimal path, a state- at every time step. At step n, the rover reaches the state sn ∈ X, and
action value function Q is introduced as: chooses the action an ∈ A accordingly. The agent receives a reward rn,
whose value depends only on the current state and action. The transi-
Q (s , a ) = [Rt st = s, a] (1)
tion from the current state sn to the next state sn+1 is stochastic and
where the return Rt is the total rewards accumulation from time step t described as a probability distribution:
with discount factor γ ∈ (0,1].
Prob [sn + 1 = sn+ 1 sn, an] = P (a s ) (3)
Rt = kr
k=0
t+k (2) For the given policy π, the expected value of each state is rendered
by the state value function, Vπ(s), which is defined as the accumulation
The parameter γ is a discount factor that weights the importance of of reward signals that a rover expects to receive from that state on, until
immediate and later rewards. The Q value is the actual expected return the terminal state:
for selecting action a in state s by exploiting policy π [27]. The optimal
function Q*(s,a) = maxπ Qπ(s,a) gives the maximum expected action V (s ) = [r (s0, a0 ) + r (s1, a1) + 2r (s ,
2 a2 ) + s0 = s , (a s )] (4)
value for state s. Similarly, an optimal policy is derived by selecting the Under the policy π, the state value function, Vπ(s), satisfies the
highest-valued action in each state. Bellman equation:
268
J. Jiang, et al. Acta Astronautica 171 (2020) 265–279
One is current reward r given directly by the environment. The other is yi = s ~E [r + max a Q (s , a ; i ) s , a] (10)
the expected value of the discount reward in the future. Thus, the op-
timal policy π can be described as: This network is referred to as the Q evaluation network. The para-
meters of the Q evaluation network are adjusted at each iteration to
obtain an accurate function approximation of the state-action value
V (s ) V (s ) = max a s (a ) + Pss [a] V (s ) function. A typical method to update the Q evaluation network para-
s (6) meters is stochastic gradient descent [32]. Stochastic gradient descent
requires computing the gradient of the loss function:
Watkins discusses a representation of the Bellman equation that is
more convenient for incremental dynamic programming when the i Li ( i ) = s, a ~ ( ); s ~E [(r + max a Q (s , a ; i ) Q (s , a ; i) )2 i Q (s , a; i) ]
model is unknown to the agent [29]. This representation rewrites the Q (11)
value (or action-value) in Eq. (1) as:
Specifically, Q (s, a; θi) in Eq. (11) represents the output of the
Q (s , a ) = s (a ) + Pss [ (s )] V (s ) current evaluation network and renders the value of the current state-
s (7) action pair. Deep reinforcement learning may employ an additional
The Q value is the expected discounted return for executing an ac- neural network to generate the target value yi, which is called the Q
tion at the current state s within the policy π. Incremental dynamic target network. In Eq. (10), Q (s’, a’; θi-) denotes the output of the target
programming that is based on the Q-value representation of the network. The target network parameters θi- are only updated every ‘N’
Bellman equation is also known as Q-learning [30]. steps and kept constant between individual updates. Therefore, there
The target of Q-learning is to determine the Q value for the optimal are two networks in the DRL architecture: the Q target network and the
policy. Note that, V*(s) = maxaQ*(s,a), if a* is the action at which the Q evaluation network (Seen in Figs. 5 and 6 in Sec. 3.2.2). The Q target
maximum is attained, which results in the form of π*(s,a) ≡ a* for the network can predict the value of Qtarget, and the evaluation network is
policy π. If the agent learns the policy π, it can, then, find optimal used to compute the value of Qeval. The structure of these two networks
actions for the environment. The updating formula of the basic Q- may be identical, while internal parameters can take different values.
learning algorithm is: Typically, the Q target network is an earlier version of the Q evaluation
network, holding the internal parameters for a number of iterations
Q (st , at )(st , at )new = (1 ) Q (st , at )old + rt + maxQ (st + 1, a) through the update process. In other words, the set of parameters that
a (8) defines the Q target network is fixed for a given number of iteration, N,
before being replaced by the current parameters of the Q evaluation
where Qold and Qnew represent the previous Q value function and the network. Fixing the Q target for a certain number of update iterations
updated Q value function, respectively. Here, α is the learning rate that reduces correlations of the adjacent states and improves convergence
determines the learning speed. The max operation in Eq. (8) is trivial to properties [20].
implement if the Q value function can be rendered in tabular form. Deep neural networks are used to build progressively more abstract
However, the performance of tabular Q-learning methods are typically representations of the data. The combination of deep neural networks
unsatisfactory in problems with larger dimensions. and Q-learning constitutes a Deep Q Network (DQN) that may learn
from high-dimensional inputs. A final key idea is necessary to build an
3.1.3. Deep learning efficient DQN: Experience Replay.
Deep neural networks may serve as function approximation for the Experience replay consists in the randomization of training data.
state-action value function Q. A network with parameters θi approx- Experience replay smooths changes in data distribution and removes
imating the state-action value function Q at iteration i is trained by correlations in the observation sequence. Transfer samples (st, at, rt, st
minimizing a sequence of loss functions Li (θi) [31]: +1) from the interaction between the agent and environment at each
time step are stored into a replay memory unit. At each time step within
(9)
Li ( i ) = 2]
s, a ~ ( ) [(yi Q (s , a ; i ))
a learning iteration, transfer samples are randomly selected from the
where yi represents the target value. The target value, yi is calculated as: replay memory unit to update the Q value networks. Without loss of
269
J. Jiang, et al. Acta Astronautica 171 (2020) 265–279
270
J. Jiang, et al. Acta Astronautica 171 (2020) 265–279
Table 2 Double Q-learning [36] has been observed to produce more reliable
List of hyper-parameters. and stable learning processes. Double Q-learning can be implemented
Hyper-parameter Value within the DQN framework without altering the neural network ar-
chitectures for approximating the state-action value function. Only the
1 Minibatch size 32 definition of YtDQN needs to be updated with the definition of
2 Replay memory size 20000 YtDoubleDQN that is supplied in Eq. (13). As an alternative, online action
3 Discount factor 0.9
4 Learning rate 0.0025
selection can be employed to mitigate over-estimation [33].
5 Initial exploration 1 B. A dueling architecture divides the output of the network into two
6 Final exploration 0.1 parts: One is the function V (s) predicting the state value, and the other
7 Epsilon increment 0.00025 is the function A(a) rendering the advantage that each action provides.
8 Replace target iteration 300
Here, a dueling DQN can be treated as a single network with two
streams at the penultimate layer. During policy evaluation, this archi-
tecture may identify optimal actions more rapidly. The final Q value
execution of a depends on the evaluation network:
can be defined as:
271
J. Jiang, et al. Acta Astronautica 171 (2020) 265–279
n= p / batch (17)
Referring to Fig. 8, when extracting 5 sets of data (i.e., n = 5), the
priority distribution can be represented with the following priority
value intervals [1-28], and [29-35]. After obtaining the interval of each
branch size, a number is randomly selected in each interval during
sampling, and the corresponding priority is obtained by utilizing the
SumTree. For instance, {3, 9, 15, 26, 34} are randomly selected during
a given iteration. Taking the number 15 as an example, we traverse the
tree from the top to the bottom layer:’15’ belongs to “[1-19] → [11-19]
Fig. 10. Optimal path (axes are dimensionless). → [13-19]” and determines that its priority is 7. Accordingly, the
sample corresponding to priority 7 will be trained. The setting of
Q (s , a ; L, L) = V (s ; L) + A (s , a ; L) (15) SumTree structure make the samples corresponding to larger TD-error
more likely to be relearned. More detailed discussions about the TD-
where αL and βL are the parameters of the network layers. error method can be found in the Ref. [35].
C. In training a DQN, it may be beneficial to reuse past memories by
utilizing experience replay. In natural DQN, past experience is derived 4. Numerical analysis
from random sampling, where the importance of each individual
memory is ignored. Prioritized experience replay attempts to replay Pre-training of a DQN architecture for hopping rover path planning
important memories more often. As such, the learning process becomes without visual sensory input is performed. Both discrete and continuous
more efficient. A Temporal Difference Error [37] (shorted as ‘TD-error’) terrains are considered. The computer configuration employed for pre-
is introduced to weight the importance of memory samples training comprises an Intel Core i7 with CPU @2.2Ghz (8 cores multi-
thread) and 16 GB of RAM. Throughout the simulations, the reward
TD error: j = rj + j Qtarget (Sj , arg max aQ (Sj, a)) Q (Sj 1, Aj 1 )
setting is consistent with that given in Table 1. Training of the Q net-
(16) works depends on the following hyper-parameters:
where δj is the value of the j-th TD-error. Here, the subscript j = {1, 2, 3,
…} represents the index of each sample from the memory. 1) Minibatch size, i.e., the number of samples for each training
Samples with higher TD-error hold higher training priority. Owing iteration;
to the high computational cost to arrange memory priority, a SumTree 2) Replay memory size, i.e., the size of the memory bank for data
algorithm may be introduced to increase the efficiency of the memory storage;
272
J. Jiang, et al. Acta Astronautica 171 (2020) 265–279
Fig. 11. Number of steps per episode during the learning process.
3) Discount factor, i.e., the attenuation factor γ used in the Q-learning 4.1. Application to 3-D discrete surface
update;
4) Learning rate, i.e., the learning rate used for RMSProp (an opti- 4.1.1. Trajectory evolution
mizer of Neural Network); Fig. 9 illustrates three views of a representative level map as a hy-
5) Initial exploration, i.e., the initial value of ε in the ε-greedy policy; pothetical asteroid surface, which is used to approximate complex as-
6) Final exploration, i.e., the final value of ε in the ε-greedy policy; teroid terrain. Approximate level maps for testing path planning algo-
7) Epsilon increment, i.e., the number of frames over which the initial rithms may be created to be a conservative representation of the final
value of ε is linearly annealed to its final value; operational environment. For preliminary demonstration of the DQN
8) Replace target iteration, i.e., the number of iterations before path-planning algorithm, a level map with 30 × 30 cells is selected. In
memory is replaced. Fig. 9, light-green areas denote the sea level of the terrain. Warm-color
cells correspond to areas higher than the sea-level plane. As the color
Values for the hyper-parameters are listed in Table 2. deepens, the height of the corresponding area increases. The cool color
gird cells denote regions that are below the sea-level plane. As cell color
deepens, the height of the corresponding area decreases.
273
J. Jiang, et al. Acta Astronautica 171 (2020) 265–279
Fig. 14. Hypothetical asteroid terrain samples A and B (axes are dimensionless).
Fig. 15. Contour map of the asteroid surface samples A and B (axes are dimensionless).
274
J. Jiang, et al. Acta Astronautica 171 (2020) 265–279
Next, the DQN path planning algorithm for hopping rovers is tested
Fig. 17. Optimal path for terrain B (axes are dimensionless). on a refined representation of the asteroid terrain. Fig. 14 shows two
hypothetical 3-D terrains that are modelled using the conversion
method described in Sect. 2.1. The difference between map levels is
Fig. 18. The accumulation of Qtarget for terrain A (left) and B (right).
275
J. Jiang, et al. Acta Astronautica 171 (2020) 265–279
Fig. 19. Number of steps per episode during the learning process for terrain A.
Fig. 20. Number of steps per episode during the learning process for terrain B.
equal to the maximum height of the rover hopping movements. Thus, results show that DRL may be sufficiently robust to guide the rover
the resulting contour maps are displayed in Fig. 15, where each map is a within a variety of rough asteroid terrains with adequate pre-training.
grid of 30 × 30 cells. In these experiments, the RMSProp optimizer [38] is used. During the
learning process, the applied strategy is ε-greedy with ε annealed from
1.0 to 0.1 over the first 3600 frames, and fixed at 0.1 thereafter. In
4.2.1. Trajectory evolution Fig. 18, the accumulation of Qtarget is shown for training on terrain A
Figs. 16 and 17 show a discretized version of the terrain maps in and terrain B.
Figs. 14 and 15. The color scale is interpreted as that in the previous Figs. 19 and 20 render the trajectory learning process of the DRL
example. Light-green grid cells denote cells at the terrain sea level. architecture as the number of steps per episode. The results show that
Warm-color grid cells are higher than the sea level, while cold-color during pre-training knowledge of the surface topology is acquired and
grid cells are below the sea level. The depth of the color is proportional that the DRL architecture may converge on an optimal trajectory.
to the absolute altitude value. The full lines are the optimal trajectories Overall, the trend in Fig. 19 is similar to that in Fig. 20. Differences in
determined by the DQN algorithm. The blue and red circles are the terrain topography and the inherent random nature of an ε-greedy
respective target points [25.5, 27.5]T and [23.5, 3.5]T. The two starting policy are responsible for major variations in the early stages of the
points are [1.5, 1.5]T and [4.5, 27.5]T, respectively. learning process. Occasional, large variations in the number of episode
The same learning mechanism and neural network architecture in steps in middle and late learning stages may also be caused by bad
Sect. 4.1 are adopted in these two experiments. These preliminary
276
J. Jiang, et al. Acta Astronautica 171 (2020) 265–279
Fig. 21. Optimal paths for an increasing proportion of perturbed cells (axes are dimensionless).
277
J. Jiang, et al. Acta Astronautica 171 (2020) 265–279
Indicator 1: this indicator renders the total number of steps that the Acknowledgments
rover takes to move from the initial position to the target area in the
first successful training episode. The authors from the Beijing Institute of Technology acknowledge
Indicator 2: this indicator renders the time it takes for the rover to the support by the National Natural Science Foundation of China (No.
reach the target point from the initial position in the first successful 11972075). All the authors appreciate the valuable suggestions made
training episode. by the two reviewers of this paper.
Indicator 3: this indicator renders the number of episodes with a
stable reward accumulation value. References
As shown in Table 3, when the proportion of perturbed grid cells
increases from 1% to 3%, the change of the terrain features may be [1] W.R. Wu, W.W. Liu, D. Qiao, D.G. Jie, Investigation on the development of deep
negligible. Therefore, the variation of step number and time for opti- space exploration, Sci. China Technol. Sci. 55 (4) (2012) 1086–1091.
[2] V. Carruba, N. David, V. David, Detection of the YORP effect for small asteroids in
mization training is relatively small. When the proportion of perturbed the Karin cluster, 151 (6) (2016) 164.
grid cell is 4%, terrain variations becomes more important. The corre- [3] T. Yoshimitsu, T. Kubota, I. Nakatani, T. Adachi, H. Saito, Micro-hopping robot for
sponding number of exploration steps increases by 50% relative to the asteroid exploration, Acta Astronaut. 52 2–6 (2003) 441–446.
[4] H. Cottin, J. Michelle Kotler, K. Bartik, H. James Cleaves II, C.S. Cockell, J.P. de
unperturbed value. The exploration time of the first episode is also Vera, P. Ehrenfreund, S. Leuko, I. Loes Ten Kate, Z. Martins, R. Pascal, R. Quinn,
about 200 s longer than its original value. Introducing a small fraction P. Rettberg, F. Westall, ... , Astrobiology and the possibility of life on Earth and
of perturbed random cells in the 2-D map may not prevent the algo- elsewhere, Space Sci. Rev. 209 1–4 (2017) 1–42.
[5] M. Martínez-Jiménez, C.E. Moyano-Cambero, J.M. Trigo-Rodríguez, J. Alonso-
rithm from identifying a nearly optimal trajectory, but computational
Azcárate, J. Llorca, Asteroid Mining: Mineral Resources in Undifferentiated Bodies
demand may increase. from the Chemical Composition of Carbonaceous Chondrites, Assessment and
The 3-D terrain shown in Fig. 14 is a simulated and idealized as- Mitigation of Asteroid Impact Hazards, Springer, Cham, 2017, pp. 73–101.
[6] L.D. Mathias, L.F. Wheeler, J.L. Dotson, A probabilistic asteroid impact risk model:
teroid terrain. Much work remains to employ the current algorithm on-
assessment of sub-300 m impacts, Icarus 289 (2017) 106–119.
board of an actual rover. One particular challenge is that the validity of [7] P. Ehrenfreund, C. McKay, J.D. Rummel, B.H. Foing, C.R. Neal, T. MassonZwaan,
the learned path-planning policy is limited by the accuracy of the ter- M. Ansdell, N. Peter, J. Zarnecki, S. Mackwell, M. Antionetta Perino, L. Billings,
rain map that is utilized during training. The current algorithm learns a J. Mankins, M. Race, Toward a global space exploration program: a stepping stone
approach, Adv. Space Res. 49 (1) (2012) 2–48.
path-planning strategy over a static map (i.e., the terrain map is fixed [8] S. Watanabe, Y. Tsuda, M. Yoshikawa, S. Tanaka, T. Saiki, S. Nakazawa, Hayabusa2
for all training episodes). Therefore, the algorithms determines a mission overview, Space Sci. Rev. 208 (1–4) (2017) 3–16.
nearly-optimal path that is specific to the terrain map provided and [9] M. Crombie, S. Selznick, M. Loveridge, B. Rizk, D.N. DellaGiustina, D.R. Golish,
B.J. Bos, P.R. Christensen, V.E. Hamilton, D. Reuter, A.A. Simon, O.S. Barnouin,
does not generalize to different maps autonomously. A natural exten- M.G. Daly, R.C. Espiritu, B. Clark, M.C. Nolan, D.S. Lauretta, Schedule of origins,
sion of our work, would be training the RL agent on a distribution of spectral interpretation, resource identification, security-regolith explorer data
maps (i.e., at each episode the terrain map is drawn from a probability product releases to the planetary data system, AGU Fall Meeting Abstracts, 2018.
[10] Y. Takahashi, D.J. Scheeres, R.A. Werner, Surface gravity fields for asteroids and
distribution). That might enable the RL agent to learn an adaptive path- comets, J. Guid. Contr. Dynam. 36 (2) (2013) 362–374.
planning strategy over the selected distribution of terrain maps. Other [11] M. Pavone, J.C. Castillo-Rogez, I.A.D. Nesnas, J.A. Hoffman, N.J. Strange,
possible extension of this work involves a more realistic model for the Spacecraft/rover hybrids for the exploration of small solar system bodies, IEEE
Aerospace Conference, vol. 2013, IEEE., 2013, pp. 1–11.
rover's locomotion. A higher-fidelity model for the rover's locomotion [12] R.P. Binzel, A.S. Rivkin, S.J. Bus, J.M. Sunshine, T.H. Burbine, MUSES‐C target
may include irregular gravity field, stochastic bouncing on highly ir- asteroid (25143) 1998 SF36: a reddened ordinary chondrite, Meteoritics Planet Sci.
regular surfaces, active outgassing regions and irregular step length. 36 (8) (2001) 1167–1172.
[13] T.C. Duxbury, A.V. Zakharov, H. Hoffmann, E.A. Guinness, Spacecraft exploration
of Phobos and deimos, Planet. Space Sci. 102 (2014) 9–17.
[14] S. Zhou, J. Xi, M.W. McDaniel, T. Nishihata, P. Salesses, K. Iagnemma, Self-su-
5. Conclusion pervised learning to visually detect terrain surfaces for autonomous robots oper-
ating in forested terrain, J. Field Robot. 29 (2) (2012) 277–297.
[15] R. Valencia-Murillo, N. Arana-Daniel, C. López-Franco, A.Y. Alanís, R. Valencia-
In this paper, the path-planning problem for hopping rovers on as- Murillo, Rough terrain perception through geometric entities for robot navigation,
teroid surface with complex terrain is investigated using deep re- 2nd International Conference on Advances in Computer Science and Engineering
inforcement learning. The equivalent conversion method reduces the (CSE 2013), Atlantis Press, 2013.
[16] Y. Chen, G. Luo, Y. Mei, J. Yu, X. Su, UAV path planning using artificial potential
dimension of the state variables by converting arbitrarily complex field method updated by optimal control theory, Int. J. Syst. Sci. 47 (6) (2016)
three-dimensional terrains into a reward matrix. An improved neural 1407–1420.
network based on the DQN architecture is developed to determine the [17] A. Pandey, R. Kumar Sonkar, K. Kant Pandey, D.R. Parhi, Path planning navigation
of mobile robot with obstacles avoidance using fuzzy logic controller, IEEE 8th
optimal actions for the hopping rover. Three advanced architecture International Conference on Intelligent Systems and Control (ISCO), vol. 2014,
elements (i.e., DDQN, Dueling Network, and Prioritized Experience IEEE., 2014.
Replay) are combined to improve the performance of the DRL algorithm [18] M. Elhoseny, A. Tharwat, A.E. Hassanien, Bezier curve based path planning in a
dynamic field using modified genetic algorithm, J. Comput. Sci. 25 (2018)
and make the learning more efficient. Offline pre-training for different 339–350.
terrain complexity is demonstrated. Results show a good generality of [19] R.S. Sutton, A.G. Barto, Reinforcement Learning: an Introduction, MIT press, 2018.
the DRL method. In a topographic map of 30 × 30 size, DQN ensures [20] V. Mnih, K. Kavukcuoglu, D. Silver, A.A. Rusu, J. Veness, M.G. Bellemare,
A. Graves, M. Riedmiller, A.K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie,
that the optimal path may be found successfully between any two
A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis,
points within approximately 9000 episodes. The robustness of the Human-level control through deep reinforcement learning, Nature 518 (2015) 7540
proposed method is evaluated by considering a static uncertainty of the 529.
asteroid terrain map with an error up to 4% of the total number of map [21] F. Wang, J. Jason Zhang, X. Zheng, X. Wang, Y. Yua, X. Dai, J. Zhang, Y. Yuan,
Where does AlphaGo go: from church-turing thesis to AlphaGo thesis and beyond,
cells. Optimal paths are successfully obtained in the tested perturbed- IEEE/CAA Journal of Automatica Sinica. 3 (2) (2016) 113–120.
map scenarios. Pre-trained DRL architectures may be a stepping stone [22] J. Schmidhuber, Deep learning in neural networks: an overview, Neural Network.
in the development of hopping rovers for the autonomous exploration 61 (2015) 85–117.
[23] D. Izzo, A survey on artificial intelligence trends in spacecraft guidance dynamics
of asteroid surfaces. and control, Astrodynamics 3 (4) (2019) 287–299.
[24] G. Brian, R. Linares, R. Furfaro, Seeker Based Adaptive Guidance via Reinforcement
Meta-Learning Applied to Asteroid Close Proximity Operations, (2019) arXiv pre-
print arXiv:1907.06098.
Declaration of competing interest [25] Y. Zhu, R. Mottaghi, E. Kolve, J.J. Lim, A. Gupta, L. Fei-Fei, A. Farhadi, Target-
driven visual navigation in indoor scenes using deep reinforcement learning, IEEE
International Conference on Robotics and Automation (ICRA), vol. 2017, IEEE,
We declare that we have no conflict of interest.
2017.
278
J. Jiang, et al. Acta Astronautica 171 (2020) 265–279
[26] H. Osmaston, Estimates of glacier equilibrium line altitudes by the Area× altitude, [33] H. Van Hasselt, A. Guez, D. Silver, Deep reinforcement learning with double q-
the Area× altitude balance ratio and the Area× altitude balance index methods learning, Thirtieth AAAI Conference on Artificial Intelligence, 2016.
and their validation, Quat. Int. 138 (2005) 22–31. [34] Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, N. de Freitas, Dueling
[27] M. Tokic, Adaptive ε-greedy exploration in reinforcement learning based on value Network Architectures for Deep Reinforcement Learning, (2015) arXiv preprint
differences, Annual Conference on Artificial Intelligence, Springer, Berlin, arXiv:1511.06581.
Heidelberg, 2010. [35] T. Schaul, J. Quan, I. Antonoglou, Prioritized Experience Replay, (2015) arXiv
[28] R. Kumar, B. Moseley, S. Vassilvitskii, A. Vattani, Fast greedy algorithms in ma- preprint arXiv:1511.05952.
preduce and streaming, ACM Trans. Parallel Comput. (TOPC). 2 (3) (2015) 14. [36] H.V. Hasselt, Double Q-learning, Advances in Neural Information Processing
[29] C.J.C.H. Watkins, Learning from Delayed Rewards, (1989). System, 2010 2613-2321.
[30] C.J.C.H. Watkins, P. Dayan, Q-learning, Mach. Learn. 8 (3–4) (1992) 279–292. [37] K. Arulkumaran, M.P. Deisenroth, M. Brundage, A.A. Bharath, Deep reinforcement
[31] M.I. Jordan, D.E. Rumelhart, Forward models: supervised learning with a distal learning: a brief survey, IEEE Signal Process. Mag. 34 (6) (2017) 26–38.
teacher, Cognit. Sci. 16 (3) (1992) 307–354. [38] I. Bello, B. Zoph, V. Vasudevan, Q. Le, Neural optimizer search with reinforcement
[32] L. Bottou, Large-scale machine learning with stochastic gradient descent, learning, Proceedings of the 34th International Conference on Machine Learning,
Proceedings of COMPSTAT, vol. 2010, Physica-Verlag HD., 2010, pp. 177–186. vol. 70, JMLR, 2017, pp. 459–468.
279