You are on page 1of 11

Acta Astronautica 186 (2021) 87–97

Contents lists available at ScienceDirect

Acta Astronautica
journal homepage: www.elsevier.com/locate/actaastro

Review article

Survey of machine learning techniques in spacecraft control design


Maksim Shirobokov ∗, Sergey Trofimov, Mikhail Ovchinnikov
Moscow Center for Fundamental and Applied Mathematics, 4 Miusskaya Pl., Moscow 125047, Russia

ARTICLE INFO ABSTRACT

Keywords: In this paper, a survey on the machine learning techniques in spacecraft control design is given. Among
Spacecraft control the applications of machine learning on the subject are the design of optimal interplanetary trajectories, the
Machine learning synthesis of controllers to stabilize orbital or angular motion, formation control, the design of control laws for
Neural network
landing on the surface of a celestial body. All the works are classified into two almost equal groups — the
Supervised learning
supervised learning (stochastic and deterministic methods) and the reinforcement learning (direct and value-
Neurodynamics
Reinforcement learning
based approaches). Stochastic supervised learning methods are based on stochastic optimization procedures,
random initialization of neural networks weights, and stochastic nature of the obtained results. Deterministic
methods are based on the Lyapunov theory; the network training is a deterministic process. The division of
reinforcement learning methods into direct and value-based approaches is similar to the separation into direct
and indirect methods in the optimal control theory. We discuss the main ideas, advantages, and drawbacks
of the techniques and give some recommendations for future investigations. We also highlight interesting
ideas and approaches in the application of machine learning methods that can be used in a broad variety of
astrodynamical problems.

1. Introduction At the same time, literature analysis demonstrates that all the works
on the subject fall into two approximately equal groups: those that
Nowadays, there is a strong interest in machine learning techniques use methods of supervised learning [1] and those that exploit methods
and, more generally, in artificial intelligence. It is also clear that in of reinforcement learning [2]. Mathematically speaking, supervised
some cases, these methods may provide researchers and engineers learning is a machine learning technique of approximating a map from
with good solutions that are difficult or even impossible to implement input to output based on a set of examples of input–output pairs. For
using classical approaches. Applications involve robotics, autonomous
instance, one can assign a task of state–control mapping approximation
vehicles, healthcare, business, and marketing. Following the interest in
using neural networks based on the state–control pairs calculated by
machine learning methods and being inspired by their effectiveness,
some method which is able to provide each ‘state’ with correct ‘control’.
researchers start paying attention to space technology as well. Many
questions arise here: Are the machine learning methods (especially, Note that the supervised learning problems imply a training sample,
neural networks) actually necessary in spaceflight applications? Are even if it consists of a single example. This sample is always formed by
they more effective than traditional approaches, and if ‘yes’, then how some other reliable and ‘correct’ method. Supervised learning is mainly
much and why? Are these methods sufficiently reliable? used in regression, approximation, classification, pattern recognition,
In this paper, we consider a part of the many questions raised. and prediction problems.
We review some of the ideas and methods of machine learning that Reinforcement learning is a type of machine learning task that
can be used to develop the orbital and attitude spacecraft control. implies an agent that interacts with an environment and receives a
Currently, these methods are mainly based on artificial neural networks scalar signal (reward) from the latter. The agent’s goal is to take
and help researchers and mission designers to find optimal trajectories actions in the environment in order to maximize the reward. Unlike
and create adaptive feedback controllers that can potentially be used on supervised learning methods, in this case, the agent does not know the
board a spacecraft. However, such a classification of the studies appears
examples of correct actions; in other words, the reward signal is not
to be inadequate since the number of papers on the control design
supposed to give information about the correct actions to the agent.
significantly prevails over those on search for optimal trajectories.
The agent should deduce the correct behavior by trial and error. Thus,
This indicates high relevance and interest in adaptive and autonomous
there is no training samples in reinforcement learning, and learning
controllers for spacecraft motion control.

∗ Corresponding author.
E-mail address: shirobokov@keldysh.ru (M. Shirobokov).

https://doi.org/10.1016/j.actaastro.2021.05.018
Received 8 September 2020; Received in revised form 20 April 2021; Accepted 7 May 2021
Available online 17 May 2021
0094-5765/© 2021 IAA. Published by Elsevier Ltd. All rights reserved.
M. Shirobokov et al. Acta Astronautica 186 (2021) 87–97

correct answers is replaced by search for the right actions through a 2. Supervised learning
rationally organized trial-and-error procedure. Reinforcement learning
can be used for the adaptive optimal control synthesis. This section assembles the studies on the search for optimum space-
A quick literature review shows that machine learning methods in craft control by supervised learning methods which include regression,
astrodynamics are applied for various purposes: creating an adaptive approximation, classification, pattern recognition, and prediction tasks.
The section is divided into two parts: In the first one, the model is
controller for the main thruster or a controller for angular motion of
trained in a random (stochastic) way, while in the second part, learning
the spacecraft with known or unknown inertial properties, controlling
is a nonrandom, deterministic process.
a spacecraft formation, landing on the surface of a celestial body,
maintaining the spacecraft motion in complex gravitational fields, and
2.1. Stochastic learning
making fast decisions under uncertain conditions. They also help when
a preliminary analysis of the mission requires rapid consideration of Let us begin the review with paper [5]. This study deals with
multiple solutions and the selection of the best one. Moreover, they adaptive neurocontrol for spacecraft attitude control. The controller
also assist global optimization methods to search for global extrema of includes two neural networks: one network predicts the spacecraft
functions taking into account nonlinear constraints. motion and the second one provides the control. The authors aim not
The most recent survey on artificial intelligence trends used for only at generating nonlinear control but also at making it adaptive. This
the spacecraft motion control is presented in [3]. The paper focuses purpose is profoundly discussed in the paper.
on evolutionary optimization, tree searches, and machine learning and Another study where the controller includes the predictive and
highlight synergies between these technologies that could provide re- control neural networks has been recently published [6]. This paper
searchers with new interesting results. Yet, the paper omits many of the considers the problem of fuel-optimal attitude control for a 12U-cubesat
classical studies on the application of the machine learning techniques using four attitude thrusters. For this purpose, three controller types
in astrodynamics, avoids Lyapunov-based approaches, lacks discus- are developed: (1) the one based on the Lyapunov and sliding mode
control, (2) the one based on the Lyapunov and projective control,
sions and generalizations, and concentrates mainly on the authors’
and (3) the one using neural networks. The third controller includes
contribution to the field.
two neural networks. One recurrent network predicts the spacecraft
In the present paper, we give a much more detailed survey on the state at the end of flight as well as the target function value (as in
machine learning applications in spacecraft control, both the classical the linear–quadratic regulator); the input variables are discrete (bang–
and the most recent ones, provide the detailed discussions on some bang control). The input for this network is sought in such a way
useful notions and interesting ideas that can be used to solve the as to hit the target and to minimize fuel consumption. The authors
problems, and develop some recommendations for future research. This made an extensive comparison of the method to a simple sliding mode
work does not aim at reviewing all the studies on this subject and control and the projection-based control and demonstrated that the
focuses mainly on peer-reviewed publications, specifically on those that neural control is more fuel-efficient for detumbling maneuvers, with
solve exact problems and not just describe general concepts. up to 25% fuel savings or even 36% for finite-time slew motions.
The paper consists of several sections. Section 2 focuses on the stud- The study [7] gives a brief overview of the attitude control system of
ies based on the supervised learning methods for searching the optimal Solar and Heliospheric Observatory (SOHO) spacecraft [8]. The system
trajectories and control. These are generally regression problems where includes six hydrazine thrusters controlled using a pulse width modu-
the learning model is represented by neural networks. Supervised learn- lation scheme with proportional–derivative controller (PD-controller).
In cases where several thrusters are involved, there appears the dis-
ing methods found here are classified in two categories. Firstly, these
turbance torque, the direction and magnitude of which are uncertain.
are classic ‘stochastic’ learning, i.e. learning based on stochastic opti-
Therefore, it is desirable that the control system be adaptive. The
mization methods, random initialization of neural networks weights,
author describes the general scheme of adaptation of these parameters
and the attempt to obtain the best possible model parameter values.
and suggests using neural networks, as it is necessary to approximate
Training performance is estimated by the results of the model testing some nonlinear mappings. For this purpose, the author introduces two
on a part of the sample. The second category covers deterministic neural networks: the predictive and the estimating. The PD-controller
learning techniques. In this case, training is a deterministic process plays the role of the basic controller, while the neural networks are
where target neural networks weights are obtained by integrating a used only as assistants for complex patterns approximation and for the
certain system of differential equations. In literature, this is also called option ‘leave things as they are’. The study emphasizes that it is too
neurodynamics [4]. The network quality is then provided by the facts early to use neural networks as basic controllers (at least in 1995), since
of the Lyapunov control theory. otherwise it is required to prove the reliability of the system based only
Section 3 is entirely dedicated to reinforcement learning methods. on neurocontrollers.
They are commonly classified into two categories: the policy- and In [9], two studies on improving the performance of the pitch axis
value-based methods. The first one consists of direct policy search, attitude control system are conducted. In the first study, the main
i.e. mapping of state set to action set. The second category includes controller is the linear–quadratic regulator, and the neural network is
the methods based on the value function and Bellman’s principle of used as an assistant. The linear–quadratic controller does not take into
account the errors in the inertial parameters of the spacecraft or the
optimality. Note that this classification is equivalent to the division
external disturbances while it provides a certain control for the system.
into direct and indirect methods for optimal control search: under
The residual between the spacecraft’s target state and the actual one is
direct methods, the control function is discretized and the search for
used for neural network training.
optimal control reduces to nonlinear programming problem, while by
The second study introduces a linear plant uncertainty model and a
indirect methods, one derives a control function satisfying Pontryagin’s linear external disturbance model. These linear models (with unknown
maximum principle or Bellman’s principle. While Sections 2 and 3 coefficients) are interpreted as a single neural network without hidden
contain mainly short descriptions of the methods, Section 4 comprises layers. The search for the coefficients of such a model is equivalent to
the detailed discussion and our recommendations. In Section 5, we the search for network weights. They are obtained so that the residual
place a list of the reviewed ideas provided with the corresponding between the prediction and the measurements is minimal. Thus, it is
bibliographical references for convenience. In the conclusion section, shown that neural networks can be trained not only in the traditional
we sum up the results of the survey. way, i.e. by designing the objective functional as the mean-square

88
M. Shirobokov et al. Acta Astronautica 186 (2021) 87–97

deviation of the network outputs from the correct answers, but also are generated by using Pontryagin’s maximum principle and a simple
indirectly, through observing the system dynamics. direct shooting method. The results show that the developed neural
Neural networks assistance to the controller also appears in [10]. To networks autonomously correct trajectories for the majority of the
provide the spacecraft attitude control, the authors propose to use two missed thrust events and that the final cost is on average less than 3%
instruments. The first one is a linear feedback controller which uses a higher than that for the optimal trajectory.
simple model and therefore its operations are satisfactory only in the The approximation of optimal control in low-thrust orbital motion
linear approximation. The second instrument is a neural network to is studied profoundly in the recently published paper [20]. The authors
compensate for the output data of the first instrument and corrects the consider three types of optimal trajectories: time-optimal low-thrust
control provided by the linear controller. It is known how the system transfers, fuel-optimal low-thrust transfers, and optimal 𝐽2 -perturbed
should evolve in the simple model and what spacecraft’s orientation multi-impulse near-Earth trajectories. The peculiarity of this study in
will be in a short time, so the discrepancy in the state in the linear comparison to the other ones is that the method parameters (number of
approximation can be transferred into a control correction. The latter neurons, activation functions, batch size, optimization method, learning
is the ‘right answer’ or an ‘example’ for the neural network. rate, and even the input data type used — Cartesian coordinates or or-
In [11], attitude control of a large sail with uncertainty in its bital elements) are also optimized, i.e. they are treated as optimization
flexibility is considered. Though neural networks are used in this variables in some external optimization procedure. The best parameter
paper only for sail parameter refinement and are not related directly sets are determined both by exhaustive search and by other methods,
to attitude control, they predict the value of a certain buffer angle, for instance, using the tree-structured Parzen estimator [21]. Different
thereby excluding the errors in orientation after the maneuver. On the parameter sets are also compared.
computational side, angle approximation by a neural network proved In [22], the guidance for the Earth–Venus fuel-optimal low-thrust
much more effective than the use of exact formulae. transfers is represented by deep neural networks. To create the training
In [12], the spacecraft attitude control problem for the purpose of data, the authors proposed a novel method called ‘‘backward gener-
high-precision observation of the coronal mass ejections from the Sun ation of optimal examples’’ to train the networks on massively large
in the Coronal Diagnostic Experiment (CODEX, [13]) mission is con- trajectory databases. Several schemes to train representations of either
sidered. The study has shown that using a neural network architecture the optimal policy (thrust profile) or the value function (optimal fuel
in the pointing system significantly improves pointing performance as mass) are proposed and tested. It is stated that the best-performing
compared to the baseline proportional–integral–derivative controller. networks predict the optimal thrust value and direction within an error
The developed algorithm is relatively simple for implementation and of 5% and 1 degree, respectively.
provides a smaller computational overhead. In [23], a learning-based optimal control method is developed that
Papers [14] and [15] contain many unusual and interesting ideas provides the onboard law for the powered descent problem. Deep
(see the Discussions section). These studies are devoted to the devel- neural networks are used to approximate the map between the initial
opment of the onboard optimizer for low-thrust spacecraft trajectories. state of the spacecraft and the so-called critical parameters (switching
The authors are interested in developing an approximator for the op- times for the bang–bang control, duration of the powered descent
timal solution of the optimization problem. The problem of generating phase, and some other parameters). As stated, one of the main results is
the optimal control is formulated in terms of Pontryagin’s maximum that the method requires less training data compared with the existing
principle. learning-based methods.
Similar ideas are applied in [16], where deep neural networks In [24], an algorithm for image-based powered descent guidance
are used to approximate the gravitational field of asteroids and the via deep learning is proposed. The images are the inputs to the convo-
developed control laws. Each control parameter is approximated by lutional neural networks that output the command acceleration. The
its own deep neural network. Simulation results of the time-optimal networks are trained within the supervised learning framework and
landing on 443 Eros are given to illustrate the effectiveness of the de- applied in the problem of lunar landing.
veloped controller in solution optimality, terminal guidance accuracy, Finally, in [25] an adaptive neurocontroller is designed for for-
and robustness under uncertainties. mation flying in low-Earth near-circular orbits. Two spacecraft are
An indirect optimization method as ‘teacher’ is also considered considered: one is controllable and the other is not. The controllable
in [17], where pinpoint landing problem is solved using Pontryagin’s spacecraft is able to change its cross section within known limits, as
maximum principle. The models are trained offline and elements of well as apply velocity impulses in any direction. Its goal is keeping
the training sample are generated by a random walk method in the the projective circular relative orbit. The ballistic coefficient of the
input variable space. As a result, a curve appears in the input space uncontrollable spacecraft is unknown, so an adaptation to its value is
and for every point on that curve, the correct output has been found. It required. Two neural networks are presented: the one for approximat-
is interesting to note that while the curve is drawn via a continuation ing the area-to-mass ratio and the other for approximating the optimal
procedure, the optimization problem for designing optimal trajectories velocity impulses. The controllable spacecraft adapts to the unknown
for every point of the curve is also solved by a continuation method. As ballistic coefficient of the uncontrollable spacecraft in real time. The
a result, the sampling process turns out to be two-level: at the internal idea is to make the unknown ballistic coefficient to be an input to the
level, the continuation method solves the boundary-value problem from neural networks.
the Pontryagin’s maximum principle and at the external level the space
of input variables is covered through ‘small steps’. Verification of the 2.2. Deterministic learning
results is carried out through tests.
In [18], deep neural networks assist in generating the optimal One of the applications of the deterministic learning can be found
low-thrust trajectories by providing intelligent initial guesses for the in [26]. The aim of the study is to design a nonlinear controller to
conjugate variables defined by Pontryagin’s maximum principle. The provide the spacecraft formation with submillimeter-accuracy control.
transfer problems with energy-optimal and fuel-optimal performance In this paper, a formation keeping problem rather than station keeping
indices are considered. The networks sizes, learning rates, and activa- problem. The authors do not linearize the equations of motion near
tion functions are carefully selected based on the results obtained on a any trajectory; the method proposed is not bound to any specific
test set. type of orbital motion and described for the most general case. The
In [19], the authors investigate the developed autonomous low- control action consists of three values: the linear–quadratic regulator, a
thrust controller to cope with missed thrust events. Deep neural net- compensator of the difference between the motion obtained due to the
works approximate the state–action mapping; the training sample pairs first term and the motion that is prescribed to the second spacecraft

89
M. Shirobokov et al. Acta Astronautica 186 (2021) 87–97

while maintaining the formation, and the neurocontroller used for 3. Reinforcement learning
real-time compensation of the disturbances which are assumed to be
uniformly bounded. The neural network weights are adapted according The present subsection assembles the studies considering spacecraft
to a certain system of differential equations obtained by the Lyapunov control in terms of the decision theory. The controlled system is consid-
function method and by considering weights as free parameters that ered as an agent interacting with the environment and receiving reward
can be changed in real time [27]. In this study, the neurocontroller is scalar signals. In these problems, there is no teacher to suggest correct
an assistant to conventional controllers. actions in specific states to the agent. Instead, the agent ought to deduce
In [28], the neurocontroller is developed to correct the attitude the desirable actions by trial and error and through interaction with the
environment to maximize the total reward. The mapping from the state
control of the spacecraft with uncertain properties such as mass, inertial
space to the action space is called ‘agent policy’; the agent ought to
characteristics, and misalignment of gyroscope axes. The nonlinear
find the policy optimizing the total rewards (for details see [2]). There
feedback controller serves as a basis while the neural network is
are two classes of methods that can solve this problem. Firstly, there
its assistant which adaptively estimates and takes into account the
are methods wherein the agent policy is parameterized in some way,
disturbances caused by inaccurate simulation of uncertain system pa-
and then the task of obtaining the strategy is reduced to the search for
rameters. Neural network weights are corrected according to a certain the optimal values of these parameters. Secondly, there are methods
differential equation system and the system stability is derived from based on solving the Bellman equation and the value or critic function,
the Lyapunov stability theory. Unlike the previous study, the neu- a state function that assigns value to each state within the framework
rocontroller synthesis contains explicit assumptions on the functions of this policy (how valuable it is to be in a given state from the point
used as well as formulations and proof of lemmas and theorems which of view of the subsequent reward, acting within the current policy).
guarantee proper operation of the system. The strict analysis of the The optimal policy is defined as a greedy policy in relation to this
system stability is also performed. value function. The examples of application of reinforcement learning
In [29], the neural network assists the main controller, which is methods are therefore classed under two corresponding categories.
based on sliding mode control, to compensate for model errors and
external disturbances in the formation control. Neural network weights 3.1. Policy-based methods
are adjusted according to a certain system of differential equations and
converge exponentially quickly to the optimal values. The optimality is Let us begin with the studies done by Dachwald and his colleagues.
understood in the sense that the deviation of the neural network values Their studies [33–36] combine machine learning methods and evolu-
and the function estimated is uniformly bounded by a small number. tionary programming, i.e. the use of genetic algorithms for optimization
The system stability is analyzed using the Lyapunov stability theory. of target functions appearing in learning.
The authors note that, for sliding mode control, it is especially impor- In [33], a solution of the problem of optimal interplanetary trajec-
tant to use the model error compensator, since precise sliding mode tories design for a spacecraft supplied with a solar sail is proposed. It
is shown that neural networks together with evolutionary algorithms
control is impossible without estimating the disturbances; otherwise,
(evolutionary neurocontrol) provide a solution closer to the global
there occurs chattering, and the tracking error increases.
optimum than those obtained by traditional methods.
In the above-mentioned studies, the neurocontrollers act as assis-
Assume that the policy (mapping from state to action) is parame-
tants for the main controllers. In [30] and [31], the neurocontrollers terized using a neural network which parameters are optimized by a
play the role of the main controllers. In [30], the controller is used for genetic algorithm. For each fixed policy, there is a certain spacecraft
stabilization of the angular motion of an underactuated spacecraft with trajectory and a value of the chosen objective functional that depends
unknown dynamics. As in the other studies, the neural network weights on the flight time and on the phase state residuals at the end of flight.
are adjusted by the numerical integration of a differential equation In terms of the genetic algorithm, this is a fitness function. In terms of
system; the system stability is derived strictly in the framework of the reinforcement learning, it is the negated reward that the agent (the
certain assumptions and the tracking error converges exponentially fast spacecraft) receives at the end of the flight. Thus, the genetic algorithm
to some neighborhood of zero. is a method of solving the reinforcement learning problem.
Paper [31] develops the neurocontroller for attitude control of Note also that in [33], the agent receives the reward only at the
the spacecraft using gyroscopes. The neural network compensates not end of its motion, so the agent training is impossible in flight mode,
only for unmodeled external disturbances but also for the disturbances so it ought to be performed offline before the flight. This is a common
caused by gyroscopes. The neural network estimates external distur- situation in reinforcement learning.
bances which is used in the control law to compensate for their effects. The approach was realized in the InTrance (Intelligent Trajec-
The controller adapts to time-dependent inertial parameters of the tory optimization using neurocontroller evolution) software mentioned
spacecraft, the inertial matrix uncertainty, and input torque uncer- in [34], where it is used for the low-thrust trajectory optimization.
tainty due to electromechanical uncertainties in the gimbal servo loop As in the previous study, the author emphasizes that his approach in
gains and rotor inertias. The authors emphasize that learning can some cases provides better solutions than the ones obtained earlier by
traditional methods.
be conducted online (in the flight mode) without preliminary offline
The results of the method’s implementation are also given in [35].
training.
In that paper, a solution for the Global Trajectory Optimization Com-
The idea to use the neural network to estimate unmodeled distur-
petition (GTOC 2005) problem is described. The competition task was
bances while entering the atmosphere of Mars is proposed in [32]. The to eliminate the asteroid hazard by colliding the spacecraft with the
controller based on sliding mode control was chosen as the main one. asteroid at the highest possible speed and with the lowest possible fuel
One of the terms of the corresponding control function is represented consumption. The authors managed to solve the problem without the
by external disturbances which are assumed to be corrected. These use of gravitational maneuvers which could potentially reduce the fuel
disturbances are estimated online using a neural network. The neural consumption. According to the authors, at that time, their method was
network parameters are adjusted according to a certain differential not suitable for designing trajectories with gravitational maneuvers and
equation system. As a result of the synthesis of the sliding control general multiphase flight trajectories.
and the disturbance-estimating neural network, there appears a stable This limitation of the method was removed afterwards in [36],
closed-loop control system. The system robustness is also investigated where a problem of gravity-assisted trajectory design for a spacecraft
numerically through a run of Monte-Carlo simulations. with a low-thrust engine is considered. In this case, the input of the

90
M. Shirobokov et al. Acta Astronautica 186 (2021) 87–97

policy neural network receives not only the spacecraft’s current state Surface Sliding Guidance (MSSG) method which does not require the
vector 𝒙𝑠𝑐 and target state vector 𝒙𝑇 , as it was in [33], but also state nominal descent trajectory. This method has a number of parameters
vectors 𝒙𝑝𝑙 of the planet and the difference 𝒙𝑠𝑐 − 𝒙𝑝𝑙 between the space- that are optimized by reinforcement learning methods. Note that the
craft’s state vector and the planet’s one. A noteworthy detail is that ‘action’ here is not control actions but the MSSG method parameters.
feeding the difference 𝒙𝑠𝑐 − 𝒙𝑝𝑙 to the neural network seems redundant These parameters are not interpreted as state functions, instead they are
since vectors 𝒙𝑠𝑐 and 𝒙𝑝𝑙 are sent to the input separately. Though, fixed and do not change during the flight. It results in the fact that the
technically, there is no need for transferring 𝒙𝑠𝑐 − 𝒙𝑝𝑙 , sometimes it parameters are optimized to maximize the average reward received by
speeds up the convergence of the training procedure. the spacecraft at the end of the flight. It appears, in particular, that the
Similar to [33–35], and [36], the reinforcement learning approach learning is possible only offline, before the flight, though the authors
is described in [37], where the precision formation control is designed. state in the conclusion that the method can be generalized and the
The authors propose to use two neural networks: the first network MSSG method parameters can be made adaptive.
gives the velocity of a spacecraft based on relative positions between In [46], a joint guidance and control system of the landing module
the spacecraft and the second network gives the angular velocity for on Mars is proposed. Mapping from the spacecraft state space to the
a spacecraft based on its attitude. These networks are implemented for space of control actions is obtained using the PPO approach. The results
each spacecraft and they are identical. This approach was implemented of Monte Carlo simulation tests show that the control is able to achieve
on the Synchronized Position Hold Engage Reorient Experimental Satel- high landing accuracy (several meters in position and about 10 cm/s in
lites (SPHERES) platform [38] and tested on board the International velocity when descending from an altitude of approximately 2500 m)
Space Station. The tests showed millimeter and microradian accuracy and be resistant to noise and parameter uncertainties. The authors also
for position and orientation, respectively, although no strict facts to pay attention to possible implementation of the algorithms on board the
guarantee the performance of this system have been presented. spacecraft. Mapping an observation to an action takes less than 1 ms
The issue of lacking performance guarantees for neurocontrollers on a 2.3 GHz processor or approximately 23 ms on a 100 MHz flight
is tackled in [39]. The test problem of optimal-time transfer between processor. This is acceptable considering that the navigation system
two near-Earth orbits is solved. Instead of using a direct or indirect provides updates to the guidance system every 0.2 s. The authors note
optimization method, the stabilizing Lyapunov control is designed. that the further speed-up would be possible by coding the guidance
The Lyapunov function is a weighted sum of squared deviations of and control system using the fast TensorFlow Python API designed in
the current state (represented in equinoctial elements) and the target particular for using on various devices.
state. These weights can be chosen constant, but in this case, the Finally, in [47], an adaptive guidance strategy around asteroids
solutions obtained are not necessarily optimal. Therefore, the weights by means of reinforcement meta-learning [48–52] is developed. It is
are proposed to be optimized and interpreted as functions of the current assumed that the lander is equipped with an optical seeker that targets
state, target state, and spacecraft mass. These functions are modeled as a terrain feature or an active beacon; the control policy maps sensor
standard neural networks with one hidden layer. The networks weights output to actuators directly. High-fidelity tests show that the developed
are found for the flight time to be optimal and they are constant. The system is able to provide landing on an asteroid with pinpoint accuracy.
optimization problem that arises here is solved by the particle swarm The policy adapts in real time to both environmental forces acting on
optimization method. As a result, an asymptotically stable system is the agent and internal disturbances such as actuator failure and center
obtained. This system cannot learn online since the reward (the fitness of mass variation.
function) is derived at the end of the flight.
In [40], the proximal policy optimization (PPO) method [41] is 3.2. Value-based methods
used to find a feedback control law for docking maneuvers. As stated,
the reinforcement learning approach is expected to be robust with In [53], a sample mission around a celestial body is considered and
respect to the initial conditions, perturbations, noise, and uncertainties the problem of maximizing of the number of scientific tasks taking
of the dynamics. The method is compared with the standard optimal into account constraints and failure probabilities is stated. The paper
control method using the GPOPS-II software suite [42]. The compari- considers a sample science mission where the spacecraft collects data
son demonstrates that the reinforcement learning policy sub-optimally from celestial objects viewable only within a certain true anomaly
approximates the optimal (obtained by GPOPS-II) solution. The benefit window. Scientific data collection requires the spacecraft to slew its
of the approximation is that the policy is quickly implementable in a instruments toward each target and continue pointing in the direction
closed-loop fashion. of the target while orbiting. The problem is defined in terms of the
In [43], the Multi-Reward Proximal Policy Optimization (MRPPO) Markov decision process and is solved by a subclass of reinforcement
reinforcement learning method is developed to reconstruct the Pareto learning methods known as approximate dynamical programming (see
front in a multi-criteria transfer optimization problem. The essence of many bibliographical references in [53]). Both the set of states and the
the approach is that state–action–state transitions are independent of set of actions are finite.
the control law and can be shared between multiple reinforcement In [54], the autonomous navigation and pointing control for a
learning agents, thus accelerating the generation of optimal trajectories CubeSat operating in close proximity of a binary asteroid system is
for different objectives and constraints. proposed. The control maximizes the mission scientific return with
Paper [44] seems to be the first study in which reinforcement autonomous decision making of the next best image acquisition time.
learning is used for landing on the surface of a celestial body (Mars, The general problem is formulated in terms of the Partially Observable
in this case). The problem is interpreted in terms of a Markov decision Markov Decision Process and the policy is optimized by the Neural
process, and a policy mapping the spacecraft state into control actions Fitted Q [55] and the Deep Q-Network (DQN) algorithms [56].
is developed. The policy is approximated via a neural network. The The authors of [57] apply the actor–critic approach [2] and ap-
agent receives the reward at the end of the flight and it is formed by the proximate dynamic programming to a problem of attitude control of
residual between the real and target state vectors. Hence, only offline an object with unknown properties. There are two spacecraft, one of
learning is possible. them approaching the other and docking to it. The inertial properties
The study in [45] proposes an algorithm for pinpoint landing on of the second spacecraft are unknown and it is necessary for the whole
the surface of a celestial body, such as the Moon or Mars. The authors system to follow a given angular trajectory (uncooperative docking).
report that the algorithms of landing on Mars usually involve the The authors propose a control function consisting of two parts: one of
calculation of the nominal trajectory on board the spacecraft. For them is the prescribed performance control method [58] and the second
landing, the authors propose to use the previously-developed Multiple one is based on adaptive control. The strength of the study is that

91
M. Shirobokov et al. Acta Astronautica 186 (2021) 87–97

following the assumptions, the authors prove that the control system formulated as reinforcement learning ones. The results show that the
evolves as required, even taking into account limited perturbations. In joint optimization saves several tons of fuel.
addition, it is proved that the adaptability and robustness of the system In [71], the authors apply reinforcement meta-learning approach;
increase when this adaptive controller is enabled. the value function and the policy are represented in the form of
In [59], reinforcement learning methods are used in the problem recurrent neural networks (see the learning scheme in Fig. 1). As before,
of shape reconstruction of an unknown and uncooperative space ob- mapping from the spacecraft state space to the space of control actions
ject by image processing. The objective of the reinforcement learning is under training. Learning is performed in offline mode, though the
agent is to control a spacecraft around the uncooperative object to get controlled system is able to adapt in real-time mode to changing errors
images of the target and obtain its highest map quality taking into of spacecraft state determination, to variations in the spacecraft mass
account visibility conditions of the target. The both on-policy and off- estimate, and to external disturbances. The methods proposed in this
policy advantage actor–critic methods are exploited and compared to a paper are verified both for Mars and asteroid landing, even under
value-based method (DQN). The results obtained are promising, though conditions of incomplete observability and possible engine failure. Note
more extensive training is needed to facilitate policy generalization that this study demonstrates that the control method proposed by
properties. the authors surpasses the conventional fuel-optimal feedback guidance
The reinforcement learning framework used in [60] aims at de- algorithm developed independently by Battin [72] and D’Souza [73].
veloping an optimal closed-loop feedback-driven control law to solve In [74], the standard ZEM/ZEV algorithm [75] used for planetary
low-thrust many-revolution trajectory design problems. For that, the landing is generalized and made adaptive. The idea is to learn the
authors take a Lyapunov-based method and make its parameters to be ZEM/ZEV method parameters during the powered descent phase to
state-dependent and approximated by neural networks. Searching for satisfy constraints while maintaining fuel quasi-optimality. This is done
transfers from the geostationary transfer orbit to the geostationary orbit by an actor–critic reinforcement learning algorithm and by using the
is performed and stability and robustness of the designed controller extreme learning machine neural networks. The resulting adaptive
are demonstrated. The authors exploit the actor–critic approach and algorithm improves the performance of the original algorithm in terms
extreme learning machines [61,62] to update the weights of the critic. of fuel consumption and allows path constraints to be incorporated
In [63], a guidance strategy for spacecraft proximity operations directly into the guidance law.
based on deep reinforcement learning methods is proposed. The essence In [76], a reinforcement learning framework for identifying safe
of the study is that the reinforcement learning agent is not responsible landing locations is proposed. The learned model evaluates and selects
for the entire guidance, navigation, and control process; it provides the landing sites with consideration of the terrain features, quality of future
desired spacecraft velocity to some controller that is able to handle observations, and control in order to achieve a robust landing trajec-
dynamic uncertainties and modeling error. Thus, the authors solve the tory. The learning is done by the Twin Delayed Deep Deterministic
problem accommodating uncertainties of the real world by the rein- Policy Gradient (TD3) algorithm [77]. The results on simulations in
forcement learning agent. The learning methods used is the distributed lunar landing scenario are demonstrated.
distributional deep deterministic policy gradient (D4PG) [64].
Paper [65] is mostly focused on developing a comprehensive tool
4. Discussion and recommendations
for generating trajectories in the Earth–Moon system with the help
of reinforcement learning methods. Essentially, dynamical structures
In this section, we list and discuss the common ideas and approaches
(periodical orbits and the related invariant manifolds) are defined and
in the above mentioned studies. Their advantages and critical issues are
discretized, i.e. they are represented as a finite number of points. The
emphasized, and some recommendations are given at the end of the
transfer in the Earth–Moon system is interpreted as a sequence of
section.
transfers (possibly powered arcs) between these points. One needs only
to calculate the fuel consumption for the transfer between them taking
into account the requirements of the particular mission, as well as to 4.1. The predictive and control networks
obtain the optimal graph path. The authors propose to use the Heuris-
tically Accelerated Reinforcement Learning (HARL) algorithm [66] and While analyzing the literature, we found that the introduction of
compare its performance with the Dijkstra algorithm. It is assumed two neural networks – the predictive one and the control one – is
that the output of the HARL algorithm can be used as an initial a natural and common idea for neurocontroller design. In our opin-
approximation for subsequent more accurate optimization procedures. ion, the reason for this is as follows. Assume that the spacecraft is
In [67], the authors investigate deep reinforcement learning meth- moving under uncertainty, and the target spacecraft state and the
ods in a problem of propulsionless planar phasing in the Earth’s atmo- actual one are given. The residual between these states may be due to
sphere. Two methods are studied — the advantage actor–critic (A2C) inaccurate knowledge of the spacecraft/environment model or inaccu-
[68] and the PPO. Importance of the novel dynamics propagator and rate/incorrect control actions. In order the control to be adaptive, this
the choice of the reward function are discussed. residual should be translated into the control actions correction. The
Let us conclude the subsection with the studies on synthesis of the authors of [5] propose approximation of the dynamics by a predictive
planetary landing control. neural network and by translating the output residual into the input
The study in [69] is focused on combining the entry and powered one. This idea works in offline mode (before the flight) at the stage
descent stages of a Mars landing mission since they are usually sep- of network design. When implemented in real motion, it experiences
arated and the overall obtained solution appears to be not optimal. some difficulties. The authors correctly note that in real-time mode,
Since the effectiveness of the powered descent stage depends on the the predictive network should also be adjusted — and much more often
terminal conditions of the entry stage, it is proposed to optimize these than the control network. To solve this problem, the authors suggest to
terminal conditions. This leads to a two-level optimization procedure: use the known linear adaptive controller.
for each terminal condition of the entry stage, optimization problems Finally, the authors mention that the control developed should be
are solved separately at the entry and powered descent stages, the total verified for stability. Note that for stochastic supervised learning, this
costs are calculated, and these costs are optimized at a higher level of problem is a centerpiece of many studies. In general, the result of
optimization. The terminal conditions of the entry stage play a role of stochastic learning depends on the random realization of the initial
the optimized variables of the high-level optimization. Low-level opti- values of neural network parameters. Also, the training sample may
mization problems are solved by pseudospectral methods (hp-adaptive not be sufficiently representative to obtain a good enough approximator
pseudospectral method, [70]). High-level optimization problems are in the form of a neural network. Note that this early paper states the

92
M. Shirobokov et al. Acta Astronautica 186 (2021) 87–97

Fig. 1. Training of the actor and critic networks for adaptive landing on celestial bodies [71].

reliability issue of model trained by stochastic methods which in our 4.6. Combination of functions is approximated better than individually
opinion is rather serious.
We see this approach also in [10]. This method is criticized, as Another interesting observation is given in [14]. It turns out that,
at an early stage of learning the neural networks are trained on non- despite the fact that conjugate variables are hardly approximated by
representative examples of motion that have not become regular yet. neural networks, their combinations, which give a direct control func-
Besides, the control neural network is trained on predictions provided tion, are approximated with high accuracy. Therefore, the spacecraft
by the predictive one which may be not accurate at the early stage of control is carried out not by approximate values of the conjugate
learning. Thus, generalization capability of the control neural network variables but by the approximate control function. The approximate
is based on the experience gained in improper regions of space of input conjugate variables are used only for training the networks.
variables.
4.7. Neural controllers as assistants
4.2. Individual networks
Note that the use of stochastic learning methods for designing the
The authors of [14] and [15] introduce neural networks for approx-
assistants seems to be a more reasonable idea (if not the only possible
imating the initial values of conjugate variables, and each conjugate
one) than to create neurocontrollers whose stability or reliability are
variable has its own neural network. Note that this is not the same as
having a single neural network the output of which is a vector contain- not strictly proven. Though the performance of neural networks is
ing all the vectors of the conjugate variables. Each neural network can verified both during and after training (on random subsets of the
have its own architecture at the user’s convenience. sample), it is still not safe to say that there are no inputs on which
the neural network behaves inadequately. This problem disappears
4.3. Indirect optimization loop for training if neural networks are treated only as secondary controllers which
propose but do not insist on their behavior. Precisely this view of the
Another idea [14,15] is that the neural network outputs go to the neural network application on board the real spacecraft is presented
input of the indirect optimization method which takes them as an initial in [7] and in some other studies (for instance to improve efficiency of
guess and generates the improved solutions. The latter are applied spacecraft attitude control using the gyroscope [9]).
to obtain the residual between the neural network output and the
true solution (see the schematic diagram in Fig. 2). Thus, the indirect 4.8. The co-operative approach
optimization method acts here as a ‘teacher’ while the neural network
plays the role of a ‘student’ studying in real time. During the learning In [10], the authors apply the so called co-operative approach and
process, the output of the student improves and the subsequent guesses list its advantages. One of the advantages is that the control is based on
for the optimization method will be closer to the true value. Thus, a proven and good method which incorporates the major complexity of
sampling and training are not two separate processes (as is typically the task; the neural network needs only to slightly adjust the control. In
the case), but a single one. addition, the linear controller maintains the spacecraft trajectory at the
early stages of the neural network training when it is not sufficiently
4.4. Control modification when near the target trained yet.

In addition, it is noted in both papers that while approaching the


4.9. Supervised learning followed by reinforcement learning
target state of the spacecraft, the control should be modified. In [14],
this modification consists of strengthening the neural networks input,
Note that in [44], the agent is trained first on the examples by
whereas in [15], it is proposed to consider the control as a weighted
sum of what the neural network gives and some feedback nonlinear standard methods (supervised learning), and then in the reinforce-
control. ment learning regime. This approach accelerates convergence since
reinforcement learning in complex systems may require considerable
4.5. Supervised/reinforcement learning time. Supervised learning provides a good initial approximation for
the policy. However, this type of learning still gives unsatisfactory
From our point of view, in [15], the learning method is mistakenly output, as at the stage of training, the researcher should compromise
called ‘reinforcement learning’. Actually, the technique proposed by the on the accuracy of a neural network and its generalization capability.
authors is a supervised learning since the learning scheme contains Such a neural network may generate outliers for some input values.
explicitly defined teacher (indirect optimization block) and learning Therefore, after the supervised learning regime, in the model including
is based not on the scalar signal but on the discrepancy between the the uncertainty of some parameters, offline reinforcement learning
answer provided by neural network and a ‘correct’ answer. starts. For this purpose, the neural network weights are optimized by a

93
M. Shirobokov et al. Acta Astronautica 186 (2021) 87–97

Fig. 2. Closing a neural network with a regular optimization method [14,15].

non-gradient optimization method. As the authors note, the use of non- 4.13. List of machine learning approaches in space flight control
gradient methods instead of gradient ones is quite reasonable because
the reward does not depend on a single action but on their sequence; For reader’s convenience, we group numerous interesting machine
moreover, the relation between the reward received and the weights learning ideas and approaches proposed in application to space flight
of the neural network that approximates the policy turns out to be control and supplement them with references:
rather complicated. In addition, non-gradient optimization methods are
1. the control and predictive neural networks in a control loop [5,
usually less likely to ‘get stuck’ in local minima than gradient ones. 6],
After training, the algorithm proposed by the authors can be applied in 2. adding a teacher for a neural network in a control contour; the
real flight mode. teacher corrects the output of the neural network, while the
network supplies initial guesses to the teacher [14,15],
4.10. Model-based reinforcement learning method 3. the use of an individual neural network for each approximated
value instead of the single neural network for all values [14,15],
4. some inputs to a neural network are functions of other in-
A useful aspect of the study in [53] is that it describes in detail puts [36],
the Markov decision process, the spaces of states and actions, and 5. adaptation to unknown model parameters as optimization of the
formalizes the model of the environment in which the agent operates: inputs to the control neural network [25],
the transition probabilities between states are explicitly specified. In 6. sampling by random walk [17],
general, an explicit introduction of the model is an advantage of 7. deterministic learning, neurodynamics, the Lyapunov method
this work since training with explicitly given models using dynamic [26,28–32,39],
programming methods is usually faster than training without a model. 8. neuroevolution [33–37],
On the other hand, it is necessary to have good guesses about transition 9. the Lyapunov method in conjunction with reinforcement learn-
probabilities values. ing [39],
Another advantage of this study is that the agent may be trained 10. supervised learning to obtain the initial approximation for rein-
both offline and online. If the spaces of states and actions are too large, forcement learning methods [44],
the agent can be trained offline. Online learning may be necessary when 11. reinforcement meta-learning [47,71,78],
it is difficult or impossible to load on board the spacecraft an optimal 12. application of the extreme learning machines [60,74].
policy calculated on ground.
4.14. Recommendations

4.11. Reinforcement meta-learning Now, let us suggest some recommendations for the use of machine
learning methods in control problems of spaceflight mechanics.
A few years ago, an approach to accelerate online learning is de- Unfortunately, there are a large number of studies that omit the
veloped and called reinforcement meta-learning or meta-reinforcement stability analysis of the developed neural networks for spacecraft con-
learning; description and implementation of the meta-learning tech- trol, in particular, their input data sensitivity. Sometimes, the choice
of neural networks instead of any other control law approximators
nique can be found in [48–52]. Meta-learning can be considered as an
is explained by the networks’ useful properties (good generalizing
implementation of the idea of learning to learn. The agent learns to act
abilities, robustness, the ability to approximate any continuous function
effectively not just in one task, but in a series of tasks. Thus, when
etc.), but the demonstration of these properties is not studied. Most
facing a new task that has never been encountered during training
commonly, a neural network trained by stochastic methods is verified
(e.g., in the online mode), it will be still familiar with the situation
on random tests which cannot guarantee that there is no neural network
and learn faster. Recently, this approach is applied in [47,71,78]. input for which it will behave inappropriately to the problem being
solved. Such verification may be omitted only for cases where neural
4.12. Extreme learning machines networks play the role of assistants to the main controller or in the
optimal trajectory search procedure. If one plans to use neural networks
on board the spacecraft in real-time mode, it is recommended to
It is also interesting to highlight an interesting technique, the ex- consider the neurocontroller synthesis with the help of the Lyapunov
treme learning machine, for training neural networks [61,62]. The theory (deterministic learning).
essence of the idea is that for one-hidden-layer feedforward neural There is another important thing specific to astrodynamics appli-
networks it is possible to replace computationally heavy training pro- cations. In all the cases of neural network design, the stability of
cedures by single computation of the inverse of some matrix. The neural network operation should be verified in case of failure of one
authors of the method state its fast performance, good generalization or more neurons. For instance, this situation can happen when a
properties, absence of the local minima problem, and the possibility to micrometeoroid or radiation damages memory devices on board the
work with non-differentiable activation functions. spacecraft so that a number of weights are no longer available or take

94
M. Shirobokov et al. Acta Astronautica 186 (2021) 87–97

random values. Unfortunately, such verification is almost absent in the Declaration of competing interest
studies. Therefore, it is recommended to pay attention to proving and
demonstrating the stability of neural network operation and move from The authors declare that they have no known competing finan-
model problems to realistic ones. Recommendations and some technical cial interests or personal relationships that could have appeared to
aspects of implementation can be found in [79–81]. influence the work reported in this paper.
Further, there are few studies where offline-trained neural net-
work retrains in real flight mode and adapts to unmodeled external
Acknowledgments
disturbances or model uncertainties. For this, deterministic learning
methods are suitable — as well as reinforcement learning ones, the
agent receives rewards not only at the end of the flight, but throughout This work was supported by Moscow Center of Fundamental and
the flight. Applied Mathematics, Agreement with the Ministry of Science and
Over the last years, a significant amount of papers appeared devoted Higher Education of the Russian Federation, No. 075-15-2019-1623.
to the application of reinforcement learning methods. We found these
studies highly interesting and important since even being model-free References
in some cases, they are able to compete with model-based classic
methods [46]. It is also desirable that the programming codes be pub- [1] S.J. Russel, P. Norvig, Artificial Intelligence: A Modern Approach, Prentice Hall,
licly available (at least for results reproduction)1 for active experience 2010.
[2] R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction, The MIT Press,
sharing between investigators. Code sharing is a critical requirement if
2018.
one wants to fairly compare the results with those obtained by other [3] D. Izzo, M. Märtens, B. Pan, A survey on artificial intelligence trends in spacecraft
methods or other implementations of the same method [82]. guidance dynamics and control, Astrodynamics 3 (4) (2019) 287–299, http:
It is also worth paying attention to other forms of bioinspired con- //dx.doi.org/10.1007/s42064-018-0053-6.
trol systems on board the spacecraft which are, for example, used for [4] S. Haykin, Neural Networks and Learning Machines, Prentice Hall, 2009.
the angular motion control [83]. The cornerstone here is the develop- [5] K. KrishnaKumar, S. Rickard, S. Bartholomew, Adaptive neuro-control for space-
ment of a mathematical model for the nervous system which would be craft attitude control, Neurocomputing 9 (2) (1995) 131–148, http://dx.doi.org/
10.1016/0925-2312(94)00062-W.
both practically useful and reasoned from the perspective of biological
[6] J.D. Biggs, H. Fournier, Neural-network-based optimal attitude control using
ideas about organisms. Neural-like networks (not to be confused with four impulsive thrusters, J. Guid. Control Dyn. 43 (2) (2020) 299–309, http:
traditional artificial neural networks) underlie the control system, and //dx.doi.org/10.2514/1.G004226.
their topology is optimized by genetic algorithms. Note also that ‘rough’ [7] W.A. Wright, Stochastic tuning of a spacecraft controller using neural networks,
control systems based on traditional algorithms act here as assistants Eng. Appl. Artif. Intell. 8 (6) (1995) 651–656, http://dx.doi.org/10.1016/0952-
for training the neural-like system of adaptive control. 1976(95)00043-7.
[8] V. Domingo, B. Fleck, A.I. Poland, SOHO: The solar and heliospheric observatory,
5. Conclusion Space Sci. Rev. 72 (1995) 81–84, http://dx.doi.org/10.1007/BF00768758.
[9] R.R. Kumar, H. Seywald, S.M. Deshpande, Z. Rahman, Artificial neural networks
in space station optimal attitude control, Acta Astronaut. 35 (2–3) (1995)
The conducted review leads to the following conclusions. Machine 107–117, http://dx.doi.org/10.1016/0094-5765(94)00153-D.
learning methods occur in the attempts to solve diverse well-known [10] B. Apolloni, F. Battini, C. Lucisano, A co-operating neural approach for space-
spacecraft control problems. Among those are the design of optimal crafts attitude control, Neurocomputing 16 (4) (1997) 279–307, http://dx.doi.
interplanetary trajectories, the synthesis of controllers to stabilize or- org/10.1016/S0925-2312(97)00035-0.
bital or angular motion, formation control, and, especially recently, [11] O. Eldad, E.G. Lightsey, C. Claudel, Minimum-time attitude control of deformable
solar sails with model uncertainty, J. Spacecr. Rockets 54 (4) (2017) 863–870,
the design of control laws for landing on the surface of a celestial
http://dx.doi.org/10.2514/1.A33713.
body. Whatever the problem is, the authors propose various models
[12] P. Galchenko, H. Pernicka, S.N. Balakrishnan, Pointing system design for the
of the control system (usually based on the use of neural networks) coronal diagnostic experiment (CODEX) using a modified state observer and a
which can be trained, i.e. adjust their parameters according to the neural network controller, in: AIAA/AAS Astrodynamics Specialist Conference,
incoming information. Two ways of learning are possible, based on 2020.
either incoming behavior examples (supervised learning) or incoming [13] K. Cho, S. Bong, S. Choi, H. Yang, J. Kim, J. Baek, J. Park, E. Lim, R.-S. Kim,
reinforcement signals (reinforcement learning). S. Kim, Y.-H. Kim, Y. Park, S. Clarke, J. Davila, N. Gopalswamy, V. Nakariakov,
B. Li, R. Pinto, Toward a next generation solar coronagraph: Development of
We infer that learning methods are truly useful for creating in-
a compact diagnostic coronagraph on the ISS, J. Korean Astronom. Soc. 50 (5)
telligent controllers and searching for the optimal control. Some of (2017) 139–149, http://dx.doi.org/10.5303/JKAS.2017.50.5.139.
the studies prove by computer simulations that this sort of methods [14] L. Cheng, Z. Wang, F. Jiang, C. Zhou, Real-time optimal control for spacecraft
can be more effective than traditional control theory approaches and orbit transfer via multiscale deep neural networks, IEEE Trans. Aerosp. Electron.
sufficiently reliable. The reason behind is that new powerful learning Syst. 55 (5) (2018) 2436–2450, http://dx.doi.org/10.1109/taes.2018.2889571.
algorithms have recently appeared which make it possible to train [15] L. Cheng, Z. Wang, F. Jiang, Real-time control for fuel-optimal Moon landing
based on an interactive deep reinforcement learning algorithm, Astrodynamics 3
controllers’ models in stochastic environments and naturally take into
(4) (2019) 375–386, http://dx.doi.org/10.1007/s42064-018-0052-2.
account perturbations, uncertainties, and failures in control. Of course, [16] L. Cheng, Z. Wang, Y. Song, F. Jiang, Real-time optimal control for irregular
a lot is to be done yet. asteroid landings using deep neural networks, Acta Astronaut. 170 (2020) 66–79,
Taking into account all the reviewed studies, we have proposed http://dx.doi.org/10.1016/j.actaastro.2019.11.039.
some recommendations for future works. These recommendations [17] C. Sánchez-Sánchez, D. Izzo, Real-time optimal control via deep neural networks:
mostly concern the need for analysis of the stability and reliability Study on landing problems, J. Guid. Control Dyn. 41 (5) (2018) 1122–1135,
properties of neural networks since these properties are crucial in http://dx.doi.org/10.2514/1.G002357.
[18] S. Yin, J. Li, L. Cheng, Low-thrust spacecraft trajectory optimization via a DNN-
space applications. It is worth moving from consideration of model
based method, Adv. Space Res. 66 (7) (2020) 1635–1646, http://dx.doi.org/10.
problems to realistic ones by considering real-life requirements to both 1016/j.asr.2020.05.046.
the hardware and the software on board the spacecraft. It is also [19] A. Rubinsztejn, R. Sood, F.E. Laipert, Neural network optimal control in astro-
interesting to investigate more deeply the cases in which model-free dynamics: Application to the missed thrust problem, Acta Astronaut. 176 (2020)
reinforcement learning methods give better solutions than well-known 192–203, http://dx.doi.org/10.1016/j.actaastro.2020.05.027.
model-based approaches since this could assist in developing brand new [20] H. Li, S. Chen, D. Izzo, H. Baoyin, Deep networks as approximators of optimal
low-thrust and multi-impulse cost in multitarget missions, Acta Astronaut. 166
methods with far-reaching consequences.
(2020) 469–481, http://dx.doi.org/10.1016/j.actaastro.2019.09.023.
[21] J.S. Bergstra, R. Bardenet, Y. Bengio, B. Kégl, Algorithms for hyper-parameter
optimization, in: J. Shawe-Taylor, R.S. Zemel, P.L. Bartlett, F. Pereira, K.Q.
1
For example, see Brian Gaudet’s repository: https://github.com/Aerospac Weinberger (Eds.), Advances in Neural Information Processing Systems 24,
e-AI. Curran Associates Inc., 2011.

95
M. Shirobokov et al. Acta Astronautica 186 (2021) 87–97

[22] D. Izzo, E. Öztürk, Real-time guidance for low-thrust transfers using deep neural [47] B. Gaudet, R. Linares, R. Furfaro, Terminal adaptive guidance via reinforcement
networks, J. Guid. Control Dyn. 44 (2) (2021) http://dx.doi.org/10.2514/1. meta-learning: Applications to autonomous asteroid close-proximity operations,
G005254. Acta Astronaut. 171 (2020) 1–13, http://dx.doi.org/10.1016/j.actaastro.2020.02.
[23] S. You, C. Wan, R. Dai, J.R. Rea, Learning-based onboard guidance for fuel- 036.
optimal powered descent, J. Guid. Control Dyn. 44 (3) (2021) http://dx.doi. [48] Y. Duan, J. Schulman, X. Chen, P.L. Bartlett, I. Sutskever, P. Abbeel, RL2 : Fast
org/10.2514/1.G004928. reinforcement learning via slow reinforcement learning, 2016, arXiv:1611.02779.
[24] L. Ghilardi, A. D’Ambrosio, A. Scorsoglio, R. Furfaro, R. Linares, F. Curti, Image- [49] C. Finn, P. Abbeel, S. Levine, Model-agnostic meta-learning for fast adaptation
based optimal powered descent guidance via deep recurrent imitation learning, of deep networks, arXiv preprint arXiv:1703.03400.
in: AIAA/AAS Astrodynamics Specialist Conference, 2020. [50] N. Mishra, M. Rohaninejad, X. Chen, P. Abbeel, A simple neural attentive
[25] M. Shirobokov, S. Trofimov, Adaptive neural formation-keeping control in meta-learner, arXiv preprint, arXiv:1707.03141.
low-earth orbits, Cosm. Res. (2021) in press. [51] K. Frans, J. Ho, X. Chen, P. Abbeel, J. Schulman, Meta Learning Shared
[26] P. Gurfil, M. Idan, N.J. Kasdin, Adaptive neural control of deep-space formation Hierarchies, arXiv preprint arXiv:1710.09767.
flying, J. Guid. Control Dyn. 26 (3) (2003) 491–501, http://dx.doi.org/10.2514/ [52] J.X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J.Z. Leibo, R. Munos, C.
2.5072. Blundell, D. Kumaran, M. Botvinick, Learning to reinforcement learn, arXiv
[27] F.L. Lewis, A. Yesildirek, K. Liu, Multilayer neural-net robot controller with preprint arXiv:1611.05763.
guaranteed tracking performance, IEEE Trans. Neural. Netw. 7 (2) (1996) [53] A. Nasir, E.M. Atkins, I. Kolmanovsky, Robust science-optimal spacecraft control
388–399, http://dx.doi.org/10.1109/72.485674. for circular orbit missions, IEEE Trans. Syst. Man Cybern.: Syst. 50 (3) (2017)
[28] H. Leeghim, Y. Choi, H. Bang, Adaptive attitude control of spacecraft using neural 923–934, http://dx.doi.org/10.1109/tsmc.2017.2767077.
networks, Acta Astronaut. 64 (7–8) (2009) 778–786, http://dx.doi.org/10.1016/ [54] M. Piccinin, G. Zanotti, S. Silvestrini, A. Capannolo, A. Pasquale, Mich‘ele
j.actaastro.2008.12.004. Lavagna, Cubesat exploration missions to binary asteroids: on board autonomy
[29] J. Bae, Y. Kim, Adaptive controller design for spacecraft formation flying using and intelligent imaging towards science return enhancement, in: AIAA/AAS
sliding mode controller and neural networks, J. Frank. Inst. 349 (2) (2012) Astrodynamics Specialist Conference, 2020.
578–603, http://dx.doi.org/10.1016/j.jfranklin.2011.08.009. [55] M. Riedmiller, 10 steps and some tricks to set up neural reinforcement con-
[30] W. Zeng, Q. Wang, Learning from adaptive neural network control of an trollers, in: G. Montavon, G.B. Orr, K.R. Müller (Eds.), Neural Networks: Tricks
underactuated rigid spacecraft, Neurocomputing 168 (2015) 690–697, http:// of the Trade. Lecture Notes in Computer Science, Vol 7700, Springer, Berlin,
dx.doi.org/10.1016/j.neucom.2015.05.055. Heidelberg, http://dx.doi.org/10.1007/978-3-642-35289-8_39.
[56] K. Mnih, D. Kavukcuoglu, A.A. Silver, J. Rusu, M.G. Veness, A. Bellemare, M.
[31] W. MacKunis, F. Leve, P.M. Patre, N. Fitz-Coy, W.E. Dixon, Adaptive neural
Graves, A.K. Riedmiller, G. Fidjeland, S. Ostrovski, C. Petersen, A. Beattie, I.
network-based satellite attitude control in the presence of CMG uncertainty, Aero.
Sadik, H. Antonoglou, D. King, D. Kumaran, S. Wierstra, V. Legg, D. Hassabis,
Sci. Technol. 54 (2016) 218–228, http://dx.doi.org/10.1016/j.ast.2016.04.022.
Human-level control through deep reinforcement learning, Nature 518 (7540)
[32] S. Li, X. Jiang, RBF neural network based second-order sliding mode guidance
(2015) 529–533, http://dx.doi.org/10.1038/nature14236.
for Mars entry under uncertainties, Aero. Sci. Technol. 43 (2015) 226–235,
[57] C. Wei, J. Luo, H. Dai, Z. Bian, J. Yuan, Learning-based adaptive prescribed
http://dx.doi.org/10.1016/j.ast.2015.03.006.
performance control of postcapture space robot-target combination without
[33] B. Dachwald, Optimization of interplanetary solar sailcraft trajectories using
inertia identifications, Acta Astronaut. 146 (2018) 228–242, http://dx.doi.org/
evolutionary neurocontrol, J. Guid. Control Dyn. 27 (1) (2004) 66–72, http:
10.1016/j.actaastro.2018.03.007.
//dx.doi.org/10.2514/1.9286.
[58] C.P. Bechlioulis, G.A. Rovithakis, Robust adaptive control of feedback linearizable
[34] B. Dachwald, Optimization of very-low-thrust trajectories using evolutionary
mimo nonlinear systems with prescribed performance, IEEE Trans. Automat.
neurocontrol, Acta Astronaut. 57 (2–8) (2005) 175–185, http://dx.doi.org/10.
Control 53 (9) (2008) 2090–2099.
1016/j.actaastro.2005.03.004.
[59] A. Brandonisio, M. Lavagna, D. Guzzetti, Deep reinforcement learning to en-
[35] B. Dachwald, A. Ohndorf, 1st ACT global trajectory optimisation competition:
hance fly-around guidance for uncooperative space objects smart imaging, in:
Results found at DLR, Acta Astronaut. 61 (9) (2007) 742–752, http://dx.doi.
AIAA/AAS Astrodynamics Specialist Conference, 2020.
org/10.1016/j.actaastro.2007.03.011.
[60] H. Holt, R. Armellin, N. Baresi, A. Scorsoglio, R. Furfaro, Low-thrust trajec-
[36] I. Carnelli, B. Dachwald, M. Vasile, Evolutionary neurocontrol: A novel method
tory design using state-dependent closed-loop control laws and reinforcement
for low-thrust gravity-assist trajectory optimization, J. Guid. Control Dyn. 32 (2)
learning, in: AIAA/AAS Astrodynamics Specialist Conference, 2020.
(2009) 616–625, http://dx.doi.org/10.2514/1.32633.
[61] G.-B. Huang, Q.-Y. Zhu, Ch.-Kh. Siew, Extreme learning machine: Theory and
[37] D. Izzo, L.F. Simões, G.C.H.E. de Croon, An evolutionary robotics approach for
applications, Neurocomputing 70 (1–3) (2006) 489–501, http://dx.doi.org/10.
the distributed control of satellite formations, Evol. Intell. 7 (2014) 107–118,
1016/j.neucom.2005.12.126.
http://dx.doi.org/10.1007/s12065-014-0111-9.
[62] G.-B. Huang, D.H. Wang, Y. Lan, Extreme learning machines: a survey, Int. J.
[38] S. Mohan, A. Saenz-Otero, S. Nolet, D.W. Miller, S. Sell, SPHERES flight Mach. Learn. Cybern. 2 (2011) 107–122, http://dx.doi.org/10.1007/s13042-011-
operations testing and execution, Acta Astronaut. 65 (7–8) (2009) 1121–1132, 0019-y.
http://dx.doi.org/10.1016/j.actaastro.2009.03.039. [63] K. Hovell, S. Ulrich, Deep reinforcement learning for spacecraft proximity
[39] Y. Dalin, X. Bo, G. Youtao, Optimal strategy for low-thrust spiral trajectories operations guidance, J. Spacecr. Rockets 58 (2) (2021) http://dx.doi.org/10.
using Lyapunov-based guidance, Adv. Space Res. 56 (5) (2015) 865–878, http: 2514/1.A34838.
//dx.doi.org/10.1016/j.asr.2015.05.030. [64] G. Barth-Maron, M.W. Hoffman, D. Budden, W. Dabney, D. Horgan, T.B. Dhruva,
[40] C.E. Oestreich, R. Linares, R. Gondhalekar, Autonomous six-degree-of-freedom A. Muldal, N. Heess, T. Lillicrap, Distributed distributional deterministic policy
spacecraft docking maneuvers via reinforcement learning, in: AIAA/AAS gradients, in: International Conference on Learning Representations, Vancouver,
Astrodynamics Specialist Conference, 2020. Canada, 2018.
[41] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Proximal policy [65] A. Das-Stuart, K.C. Howell, D.C. Folta, Rapid trajectory design in complex en-
optimization algorithms, 2017, arXiv:1707.06347, arXiv preprint. vironments enabled by reinforcement learning and graph search strategies, Acta
[42] M.A. Patterson, A.V. Rao, GPOPS-II: A MATLAB software for solving multiple- Astronaut. 171 (2020) 172–195, http://dx.doi.org/10.1016/j.actaastro.2019.04.
phase optimal control problems using hp-adaptive Gaussian quadrature colloca- 037.
tion methods and sparse nonlinear programming, ACM Trans. Math. Software 41 [66] R.A.C. Bianchi, C.H.C. Ribeiro, A.H.R. Costa, Accelerating autonomous learning
(1) (2014) http://dx.doi.org/10.1145/2558904. by using heuristic selection of actions, J. Heuristics 14 (2008) 135–168, http:
[43] C.J. Sullivan, N. Bosanac, Using multi-objective deep reinforcement learning //dx.doi.org/10.1007/s10732-007-9031-5.
to uncover a Pareto front in multi-body trajectory design, in: AIAA/AAS [67] B. Smith, R. Abay, J. Abbey, S. Balage, M. Brown, R. Boyce, Propulsionless planar
Astrodynamics Specialist Conference, 2020. phasing of multiple satellites using deep reinforcement learning, Adv. Space Res.
[44] B. Gaudet, R. Furfaro, Adaptive pinpoint and fuel efficient mars landing using 67 (11) (2021) 3667–3682, http://dx.doi.org/10.1016/j.asr.2020.09.025.
reinforcement learning, IEEE/CAA J. Autom. Sin. 1 (4) (2014) 397–411, http: [68] T. Degris, P.M. Pilarski, R.S. Sutton, Model-free reinforcement learning with
//dx.doi.org/10.1109/JAS.2014.7004667. continuous action in practice, in: American Control Conference, 2012.
[45] R. Furfaro, D.R. Wibben, B. Gaudet, J. Simo, Terminal multiple surface sliding [69] X. Jiang, S. Li, R. Furfaro, Integrated guidance for mars entry and powered
guidance for planetary landing: Development, tuning and optimization via descent using reinforcement learning and pseudospectral method, Acta Astronaut.
reinforcement learning, J. Astronaut. Sci. 62 (2015) 73–99, http://dx.doi.org/ 163 (Part B) (2019) 114–129, http://dx.doi.org/10.1016/j.actaastro.2018.12.
10.1007/s40295-015-0045-1. 033.
[46] B. Gaudet, R. Linares, R. Furfaro, Deep reinforcement learning for six degree- [70] C.L. Darby, W.W. Hager, A.V. Rao, Direct trajectory optimization using a variable
of-freedom planetary landing, Adv. Space Res. 65 (7) (2020) 1723–1741, http: low-order adaptive pseudospectral method, J. Spacecr. Rockets 48 (3) (2011)
//dx.doi.org/10.1016/j.asr.2019.12.030. 433–445, http://dx.doi.org/10.2514/1.52136.

96
M. Shirobokov et al. Acta Astronautica 186 (2021) 87–97

[71] B. Gaudet, R. Linares, R. Furfaro, Adaptive guidance and integrated navigation [78] A. Scorsoglio, A. D’Ambrosio, L. Ghilardi, R. Furfaro, B. Gaudet, R. Linares, F.
with reinforcement meta-learning, Acta Astronaut. 169 (2020) 180–190, http: Curti, Safe lunar landing via images: A reinforcement meta-learning application
//dx.doi.org/10.1016/j.actaastro.2020.01.007. to autonomous hazard avoidance and landing, in: AAS/AIAA Astrodynamics
[72] R.H. Battin, An Introduction To the Mathematics and Methods of Astrodynamics, Specialist Conference, 2020.
American Institute of Aeronautics and Astronautics. Inc. Reston, 1999, pp. [79] A. McGovern, K.L. Wagstaff, Machine learning in space: Extending our reach,
558–561. Mach. Learn. 84 (2011) 335–340, http://dx.doi.org/10.1007/s10994-011-5249-
[73] C.S. D’Souza, An optimal guidance law for planetary landing, in: Guidance, 4.
Navigation, and Control Conference, AIAA Paper 1997-3709, 1997, http://dx. [80] L.V. Savkin, V.G. Dmitriev, E.A. Fedorov, V.I. Filatov, P.A. Gusenkov, Neuroregu-
doi.org/10.2514/6.1997-3709. lators in spacecraft onboard systems, Promyshlennye ASU i kontrollery (4) (2016)
[74] R. Furfaro, A. Scorsoglio, R. Linares, M. Massari, Adaptive generalized ZEM-
31–39 (in Russian).
ZEV feedback guidance for planetary landing via a deep reinforcement learning
[81] V.V. Yefimov, Neural intellectualization of on-board complexes for control of
approach, Acta Astronaut. 171 (2020) 156–171, http://dx.doi.org/10.1016/j.
surveillance spacecraft, Mekh. Avtom. upravleniye (10) (2006) 2–15 (in Russian).
actaastro.2020.02.051.
[82] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, D. Meger, Deep
[75] Y. Guo, M. Hawkins, B. Wie, Applications of generalized zero-effort-miss/zero-
reinforcement learning that matters, in: Thirty-Second AAAI Conference on
effort-velocity feedback guidance algorithm, J. Guid. Contr. Dynam. 36 (3)
(2013) 810–820, http://dx.doi.org/10.2514/1.58099. Artificial Intelligence, 2018, arXiv:1709.06560.
[76] K. Iiyama, K. Tomita, B.A. Jagatia, T. Nakagawa, K. Ho, Deep reinforcement [83] A.A. Zhdanov, L.V. Zemskikh, B.B. Belyaev, A system of stabilizing the angular
learning for safe landing site selection with concurrent consideration of divert motion of a spacecraft based on a neuron-like system of autonomous adaptive
maneuvers, in: AAS/AIAA Astrodynamics Specialist Conference, 2020. control, Cosm. Res. 42 (2004) 269–282, http://dx.doi.org/10.1023/B:COSM.
[77] S. Fujimoto, H. Hoof, D. Meger, Addressing function approximation error in 0000033301.16194.fc.
actor-critic methods, in: International Conference on Machine Learning, 2018.

97

You might also like