Zavoli Federici 2021 Reinforcement Learning For Robust Trajectory Design of Interplanetary Missions

JOURNAL OF GUIDANCE, CONTROL, AND DYNAMICS
Vol. 44, No. 8, August 2021
Reinforcement Learning for Robust Trajectory Design

of Interplanetary Missions
Alessandro Zavoli∗ and Lorenzo Federici†

Sapienza University of Rome, 00184 Rome, Italy
https://doi.org/10.2514/1.G005794
This paper investigates the use of reinforcement learning for the robust design of low-thrust interplanetary
trajectories in presence of severe uncertainties and disturbances, alternately modeled as Gaussian additive process
noise, observation noise, and random errors in the actuation of the thrust control, including the occurrence of a missed
thrust event. The stochastic optimal control problem is recast as a time-discrete Markov decision process to comply
with the standard formulation of reinforcement learning. An open-source implementation of the state-of-the-art
algorithm Proximal Policy Optimization is adopted to carry out the training process of a deep neural network, used to
map the spacecraft (observed) states to the optimal control policy. The resulting guidance and control network
Downloaded by Bibliothek der TU Muenchen on February 2, 2024 | http://arc.aiaa.org | DOI: 10.2514/1.G005794
provides both a robust nominal trajectory and the associated closed-loop guidance law. Numerical results are
presented for a typical Earth–Mars mission. To validate the proposed approach, the solution found in a
(deterministic) unperturbed scenario is first compared with the optimal one provided by an indirect technique.
The robustness and optimality of the obtained closed-loop guidance laws is then assessed by means of Monte Carlo
campaigns performed in the considered uncertain scenarios.
Nomenclature Subscript
Aπ = advantage function k = value at kth time step
a = action vector
Ev = expected value of a vector v with respect to τ
τ Superscripts
f, h, g = nonlinear functions identifying the time-discrete
dynamic, observation, and control models ctr = control errors
J = merit index mte, 1 = presence of a single-step missed thrust event
m = spacecraft mass, kg mte, 2 = presence of a multiple-step missed thrust event
N = number of time steps per training episode obs = observation uncertainty
N μ; Σ = Gaussian distribution with mean μ and covariance Σ st = state uncertainty
o = observation vector T = transpose
Qπ = action-value function unp = unperturbed environment
R = reward = optimal value
r = position vector, km
s = state vector
T = total number of training steps I. Introduction
t = time, s
u
Vπ
=
=
control vector
value function
I N RECENT years, the possibility of using small- or micro-
spacecraft in interplanetary missions is drawing the attention of
scientists and engineers around the world interested in reducing both
v = velocity vector, km∕s development time and cost of the mission, without affecting signifi-
γ = discount factor cantly its scientific return. The first deep-space micro-spacecraft,
Δv = impulsive velocity variation, km∕s PROCYON [1], was developed in little more than a year in 2014
ε = terminal constraints satisfaction tolerance by the University of Tokyo and JAXA, at a very low cost if compared
θ = neural network parameters with standard-size spacecraft. Despite the malfunctioning of the main
π = control policy thruster, the PROCYON mission has been ubiquitously called a
σx = standard deviation of Gaussian random variable x success, paving the way for similar mission concepts by other space
τ = trajectory agencies, such as the Mars Cube One (MarCO) mission by NASA [2]
ϕ, ϑ, ψ = Euler angles (roll, pitch, and yaw), rad and ESA’s first stand-alone CubeSat mission for deep-space,
ω = random vector that indicates a disturbance/perturbation MArgo [3].
Low-thrust electric propulsion is a key technology for enabling
small- and micro-spacecraft interplanetary missions, as it provides
the spacecraft with high-specific-impulse thrust capability. However,
because of the limited budget, micro-spacecraft generally mount
components with a low technological readiness level (TRL), which
Received 16 November 2020; revision received 26 February 2021; increase the risk of incurring in control execution errors and/or
accepted for publication 7 May 2021; published online 30 June 2021. Copy- missed thrust events (MTEs) during any of the long thrusting periods
right © 2021 by Alessandro Zavoli and Lorenzo Federici. Published by the that characterize low-thrust trajectories. Furthermore, micro-space-
American Institute of Aeronautics and Astronautics, Inc., with permission. All craft have limited access to ground stations on Earth, and greater
requests for copying and permission to reprint should be submitted to CCC at navigation uncertainties (i.e., errors in knowledge of the state) are to
www.copyright.com; employ the eISSN 1533-3884 to initiate your request.
See also AIAA Rights and Permissions www.aiaa.org/randp. be expected with respect to standard missions.
*Researcher, Department of Mechanical and Aerospace Engineering, Via With the spacecraft operating in a relatively well-known environ-
Eudossiana 18; alessandro.zavoli@uniroma1.it. ment, free of major disturbances, traditional optimal control methods
† (e.g., indirect methods based on Pontryagin’s principle [4] or direct
Ph.D. Student, Department of Mechanical and Aerospace Engineering,
Via Eudossiana 18; lorenzo.federici@uniroma1.it. methods based on collocation [5,6]) proved to be rather efficient
1440
ZAVOLI AND FEDERICI 1441
means to design time-optimal or minimum-propellant low-thrust robust trajectory design. In fact, BC accuracy rapidly worsens when
transfers performed by standard-size spacecraft. More recently, con- the trained network is asked to solve problems that fall outside of the
vex-optimization-based approaches have also been successfully set of expert demonstrations it was trained on. As a consequence,
employed for the numerical solution of low-thrust orbital transfer when dealing with SOCPs, a drop in performance (or even diver-
problems [7,8]. Such methods are computationally light enough for gence) may occur when, because of uncertainty, the flight trajectory
potential onboard guidance applications, following the general starts moving away from the training set domain, typically populated
framework of model predictive control (MPC), which consists in by solutions coming from deterministic OCPs. To cover previously
recursively solving an optimal control problem (OCP) with initial unknown situations, a Dataset Aggregation (DAGGER) approach
condition updated at each guidance step, given the available state can be used [30]. This approach has been effectively exploited to
measurements [9]. However, except for naturally convex problems, improve the network accuracy in controlling a lander during a
sequential optimization methods must be implemented to make powered descent on the lunar surface [31]. In that case, after the first
OCPs characterized by general nonconvex constraints computation- training of the network, an expert (i.e., an OCP solver) is called to
ally tractable. This can significantly slow down the optimization relabel with the correct control laws all the trajectories of the test set in
process and limit the closed-loop update frequency, thus hindering which the surrogate model error is not acceptable. Then, the network
the algorithm robustness to uncertainties [10,11]. is retrained using this new data set, in order to achieve a correct
Typically, when designing a real space mission, engineers check behavior also in previously unsuccessful scenarios. However, the
and improve the trajectory robustness to uncertainties a posteriori effectiveness of BC for robust trajectory design remains doubtful,
[12], by means of time-consuming iterative procedures, which often especially when solutions from deterministic OCPs are used as expert
bring to suboptimal solutions and overconservative margins. This demonstrations. Recently, an attempt to train a network by BC using a
design methodology is particularly unsuitable for micro-spacecraft precomputed set of trajectories perturbed by random MTEs has been
missions, where the possibility to have large propellant margins and performed, showing interesting results [32]. However, the possibility
hardware redundancy is not a viable option. In this respect, recent of having other types of state and control uncertainties has not been
research papers attempted to address the robust design of interplan- thoroughly investigated yet.
etary trajectories by using novel optimization techniques. As an A different approach is represented by reinforcement learning
example, the problem of designing optimal risk-aware trajectories, (RL), which involves learning from experience rather than from
which guarantee the safety of the spacecraft when it operates in expert demonstrations. In RL, a software agent (i.e., the G&CNet)
uncertain environments, was addressed by applying chance-con- autonomously learns how to behave in a (possibly unknown and
strained optimal control combined with a convex optimization stochastic) environment, modeled as a Markov Decision Process
approach to deal with impulsive maneuvers [13], or with a direct/ (MDP), so as to maximize some utility-based function, which plays
indirect hybrid optimization method to deal with continuous thrust the role of the merit index in traditional OCPs. Differently from BC,
[14]. Stochastic Differential Dynamic Programming (SDDP) was no pre-assigned data set of observation-control pairs to learn from is
applied to interplanetary trajectory design in presence of Gaussian- considered, that is, the agent is not told in advance what actions
modeled state uncertainties [15,16]. Also, the robust design of a low- are best to take in a given set of states. Instead, the agent is left free
thrust interplanetary transfer to a near-Earth asteroid was performed to explore by itself the whole solution space, by repeatedly interact-
by using evidence theory to model epistemic uncertainties in the ing with a sufficiently large number of realizations of the environ-
performance of the main thruster and in the magnitude of the depar- ment. The only feedback the agent receives back is a numerical
ture hyperbolic excess velocity [17]. Belief-based transcription pro- reward collected at each time step, which helps the agent in under-
cedures for stochastic optimal control problems (SOCPs) have been standing how good or how bad its current performance is. In this
also proposed for the robust design of space trajectories under framework, the goal of the RL agent is to learn the control policy
stochastic and epistemic uncertainties [18,19], incorporating also that maximizes the expected cumulative sum of the rewards over a
navigation analysis in the formulation to update the knowledge of trajectory.
the spacecraft state in presence of observations [20,21]. The main drawback of RL is that it typically involves much longer
Interest in using deep learning techniques to solve optimal and training times than BC to achieve comparable results, as the explora-
robust control problems has grown rapidly in recent years, especially tion must always start from scratch, even when partial knowledge of
for space applications. In this context, the term “G&CNet” (namely, the system is available in advance. Also, the correlation existing
guidance and control network) was coined at the European Space between the collected data can lead to vicious circles: if the agent
Agency [22] to refer to an onboard system that provides real-time starts collecting poor quality data (e.g., trajectories with no rewards),
guidance and control functionalities to the spacecraft by means of a it has little to no chance of knowing how to improve its performance,
deep neural network (DNN), which replaces traditional control and and it will continue to accumulate bad trajectories. As an additional
guidance architectures. DNNs are among the most versatile and downside, because MDPs allow only scalar reward functions, a
powerful machine learning tools, thanks to their unique capability careful choice, or shaping, of the reward is often mandatory to
of accurately approximating complex nonlinear input–output func- efficiently guide the agent during training, while ensuring compli-
tions, when provided with a sufficiently large amount of data con- ance with all problem constraints. In this respect, an inverse
sisting of sample input–output pairs (i.e., a training set) [23]. reinforcement learning (IRL) approach could be adopted to autono-
Two alternative, and quite different, approaches can be used to mously extract a reward function from observed expert demonstra-
train a G&CNet to solve an OCP, depending on what training data are tions [33], thus improving the effectiveness of the training process on
used and how they are collected. In behavioral cloning (BC), given a the considered problem. However, as other imitation learning proce-
set of trajectories from an expert (i.e., sequences of observation- dures, such as BC and the DAGGER method, an IRL approach would
control pairs), the network is trained to reproduce (or clone) the require the a priori computation of a huge number of optimal trajec-
observed expert behavior. Usually, these trajectories are obtained tories for populating the expert data set, leading to an additional
as the solution of a deterministic OCP with randomized boundary computational burden associated to the solution of the OCPs.
conditions. Recently, BC has been used to train a fast-execution Recently, deep RL methods have been successfully applied to the
G&CNet to control a spacecraft during a fuel-optimal low-thrust real-time guidance of a spacecraft during a 6-DoF powered descent
Earth–Venus transfer [24] and during a free landing maneuver in a on Mars surface [34], orbital transfers in the cislunar environment
simplified dynamic model [25]. Also, this technique was proposed [35–37], and autonomous rendezvous and docking maneuvers [38].
for fast and reliable generation of optimal asteroid landing trajecto- Deep learning technologies also proved to be capable of handling
ries [26,27] as well as for onboard guidance of a hypersonic reentry even complex spaceflight mechanics problems, as the optimization of
vehicle [28]. low-thrust spacecraft transfers. Behavioral cloning approaches have
Despite being computationally efficient, as it largely relies on been proposed to obtain minimum-time trajectories of a solar sail
state-of-the-art implementations of supervised learning algorithms spacecraft [39] and mass-optimal trajectories of a low-thrust spacecraft
[29], BC presents a number of downsides that make it unsuitable for [40], by generalizing the optimal solutions provided by an indirect
1442 ZAVOLI AND FEDERICI
approach. RL-based approach also achieved high-quality results in a Table 1 Problem data
number of low-thrust transfer problems, as the optimal guidance of
Variable Value
a spacecraft during low-thrust transfers to geosynchronous orbits
[41,42], and low-thrust interplanetary trajectory design [43]. tf , days 358.79
When dealing with robust trajectory design, RL allows to effi- m0 , kg 1,000
ciently handle a few key challenges typical of stochastic OCPs that T max , N 0.50
are of primary concern to other solution methods, such as chance- ueq , km∕s 19.6133
constraint optimization and belief-based optimization, including μ⊙ , km3 ∕s2 132,712,440,018
1) an arbitrary form of the feedback control laws, 2) the uncertainty
rL , km −140;699;693; −51;614;428; 980T
propagation through complex system dynamics (e.g., nonlinear sys-
tems), and 3) the general intractability of multiple joint stochastic vL , km∕s 9.7746; −28.0783; 4.3377 × 10−4 T
constraints, which often lead to overconservative solutions when r♂ , km −172;682;023; 176;959;469; 7;948;912T
tackled by traditional methods. v♂ , km∕s −16.4274; −14.8605; 9.2149 × 10−2 T
This paper aims at investigating the use of RL for the robust design
of a low-thrust interplanetary trajectory in presence of various
sources of uncertainty, which are 1) dynamic uncertainty, due to
possible unmodeled forces acting on the spacecraft; 2) navigation
errors, which bring to an inaccurate knowledge of the spacecraft state;
3) control errors, due to erroneous actuation of the commanded the spacecraft engine performance (maximum thrust T max and effective
control; and 4) an MTE, related to the unexpected occurrence of a exhaust velocity ueq ) are reported in Table 1. In all simulations, the
safe mode during a thrusting period. The present analysis signifi- physical quantities have been made nondimensional by using as refer-
pr 149.6 × 10 km, the
ence values the Earth–Sun mean distance 6
cantly extends prior works in the literature that either consider only
deterministic transfers [43] or the moderate presence of safe mode corresponding circular velocity v μ⊙ ∕r, and the initial spacecraft
events [44], by proposing a solution approach capable of dealing with mass m m0 .
arbitrary non-Gaussian, and possibly nonalgebraic, stochastic uncer- Even though the spacecraft propulsive characteristics are slightly
tainty distributions, which would be difficult, if not impossible, to optimistic given the current technological limits, the same data used
handle with current direct or indirect methods for SOCPs. As a further in Refs. [16] and [47] have been considered in this paper for the sake
contribution to the existing literature, this paper presents a solution of a fair comparison. This relatively simple case study, where the
approach of general validity, which does not rely on a problem- departure date, transfer duration, and initial excess of hyperbolic
dependent reward shaping to produce results of quality comparable velocity are fixed, serves as a basis to validate the effectiveness of
with those provided by an indirect method in a deterministic scenario. the studied technique in a known mission scenario. Yet, the proposed
To this aim, a constraints relaxation technique named “ε-constraint,” methodology can be easily extended in the future to deal with more
first proposed by Takahama and Sakai [45] for parametric optimiza- complex missions, by introducing the choice of the mission depar-
tion problems and then successfully exploited by the authors for
ture/arrival date and of a possible nonnull initial excess of hyperbolic
the solution of complex real-world spaceflight problems by means
velocity with only minor implementation changes.
of evolutionary methods [46], is here adopted in the framework
The stochastic effects considered in this study are state uncertain-
of RL.
ties, which refer to the presence of unmodeled accelerations and/or
The paper is organized as follows. First, the SOCP is formulated as
badly estimated parameters in the considered dynamic model, obser-
a time-discrete MDP. The mathematical models used to describe the
vation uncertainties, i.e., navigation errors, related to measurement
state and observation uncertainties, the control errors, and the occur-
rence of an MTE are detailed. The expression of the reward function, noise and/or inaccuracies in the orbital determination that lead to
which includes both the merit index and the problem constraints imperfect knowledge of the spacecraft state, and control errors, which
(i.e., fixed final spacecraft position and velocity), is given as well. account for both random actuation errors in thrust pointing and
Next, after a brief introduction of the basic concepts and notation of modulation, and a single MTE with variable duration, which corre-
RL, the RL algorithm used in this work, named “Proximal Policy sponds to a null thrust occurrence triggered by a safe mode event.
Optimization” (PPO), is described in detail. Furthermore, the con-
figuration selected for the DNN and the values used for the algorithm A. Markov Decision Process
hyperparameters are reported. Then, numerical results are presented
for the case study of the paper, that is, a time-fixed low-thrust Earth– Let us briefly introduce the mathematical formulation of a generic
Mars rendezvous mission. Specifically, the effect of each source of time-discrete MDP, which is required to properly set up the math-
uncertainty on the system dynamics is analyzed independently, and ematical framework of deep RL algorithms. Let sk ∈ S ⊂ Rn be a
the obtained results are compared in terms of trajectory robustness vector that completely identifies the state of the system (e.g., the
and optimality. Eventually, the reliability of the obtained solutions is spacecraft) at time tk . In general, the complete system state at time tk
assessed by means of Monte Carlo simulations. A section of con- is not available to the controller, which instead relies on an obser-
clusions ends the paper. vation vector ok ∈ O ⊂ Rm. Even though raw observations could be
used directly to control the spacecraft [48], in the present paper the
observations are to be intended as an uncertain (e.g., noisy) version
II. Problem Statement of the true state. So, the observation ok can be written as a function h
A three-dimensional, time-fixed Earth–Mars rendezvous mission, of the current state sk , of a random vector ωo;k ∈ Ωo ⊂ Rmω , repre-
already analyzed in other research papers [16,47], is considered as a senting noise or uncertainty in the state estimation, and of time tk ,
test case. The spacecraft, equipped with a low-thrust engine, leaves since a time-fixed mission is considered. The commanded action
Earth at a predetermined departure date with zero excess of hyper- ak ∈ A ⊂ Rl at time tk is the output of a state-feedback control
bolic velocity. The objective is to design a minimum-propellant policy π:O → A, that is, ak πok . The actual control uk ∈ A
trajectory that allows the spacecraft to rendezvous with Mars at a differs from the commanded action due to possible execution errors;
prescribed arrival date, in presence of navigation and control uncer- for this reason, it is modeled as a function g of the commanded action
tainties. During the transfer, the spacecraft is assumed to move ak and of a stochastic control disturbance vector ωa;k ∈ Ωa ⊂ Rlω .
according to a Keplerian dynamic model, under the sole gravitational A stochastic, time-discrete dynamic model f describes the evolution
influence of the Sun. of the system state. The uncertainty on the system dynamics at time
The values of the initial position rL and velocity vL of the Earth, the tk is modeled as a random vector ωs;k ∈ Ωs ⊂ Rnω . As a result, the
final position r♂ and velocity v♂ of Mars, the gravitational parameter of dynamic system evolution over time is described by the following
the sun μ⊙ , the total transfer time tf , the initial spacecraft mass m0 , and equations
sk1 fsk ; uk ; ωs;k1 (1) v♂ − vN

ΔvN minjv♂ − vN j; Δvmax;N (10)
jv♂ − vN j
ok hsk ; tk ; ωo;k (2)
and the final spacecraft state is defined by
ak πok (3)
rf rN (11)
uk gak ; ωa;k (4)
vf vN ΔvN (12)
An episodic formulation is considered; each episode is divided
into N time steps (i.e., k ∈ 0; N). The problem goal is to find the mf mN exp−jΔvN j∕ueq (13)
optimal control policy π that maximizes a performance index J, that
is, the expected value of the discounted sum of rewards obtained
The deterministic observation vector at time tk is
along a trajectory
" # ok rTk ; vTk ; mk ; tk T ∈ R8 (14)
X
N−1
J E γ k Rk (5)
τ∼π The following sections introduce the mathematical models used
k0
for the different sources of uncertainty considered in this study,
where Rk Rsk ; uk ; sk1 is the reward associated with transition- namely, uncertainties on the dynamic model, noise on the observa-
ing from state sk to state sk1 due to control uk , and γ ∈ 0; 1 is a tions, and control execution errors, including the possible occurrence
discount factor that is used to either encourage long-term planning of an MTE. It is worth to remark that additional sources of uncertainty
(γ 1) or short-term rewards (γ ≪ 1). Eτ∼π denotes the expectation (e.g., on the vehicle mass) may be considered too, by changing the
taken over a trajectory τ, that is, a sequence of state-action pairs τ dynamics and/or observation model accordingly. The solution meth-
fs0 ; a0 ; : : : ; sN−1 ; aN−1 g sampled according to the closed-loop odology is independent of the specific source of uncertainty con-
dynamics in Eqs. (1–4). sidered.
2. State Uncertainties
B. Low-Thrust Earth–Mars Mission
For the sake of simplicity, uncertainties on the spacecraft dynamics
This general MDP model is now specified for the Earth–Mars
are modeled as an additive Gaussian noise on position and velocity at
transfer problem at hand. During the mission, the spacecraft state sk at
time tk , k ∈ 0; N, that is,
any time step tk ktf ∕N; k ∈ 0; N is identified by its inertial
position r x; y; zT and velocity v vx ; vy ; vz T with respect to
δrk
Sun, and by its mass m ωs;k ∼ N 06 ; Σs ∈ R6 (15)
δvk
sk rTk ; vTk ; mk T ∈ R7 (6)
where Σs diagσ 2r;s I3 ; σ 2v;s I3 is the covariance matrix, with σ r;s
In the present paper, the total number of time steps N has been set and σ v;s the standard deviations on position and velocity, and In
equal to 40, as in the original paper by Lantoine and Russell [47]. The (respectively, 0n ) indicates an identity (respectively, null) matrix of
low-thrust trajectory is approximated as a series of ballistic arcs dimension n × n (respectively, n × 1). Thus, the stochastic dynamic
connected by impulsive Δvs, similarly to what is done in the well- model in Eq. (1) can be expanded as
known Sims–Flanagan model [49]. The magnitude of the kth impulse 2 3 2 3
Δvk is limited by the amount of Δv that could be accumulated over rk1 δrk1
the corresponding trajectory segment (from time tk−1 to time tk ) by 4 vk1 5 fsk ; uk ; ωs;k1 fdet sk ; uk 4 δvk1 5 (16)
operating the spacecraft engine at maximum thrust T max : mk1 0
T max tf
Δvmax;k (7)
mk N
3. Observation Uncertainties
where mk denotes the spacecraft mass just before the kth Δv is
The uncertainty in the knowledge of spacecraft position and
performed.
velocity, which may arise due to errors in the orbital determination,
So, the commanded action at time tk corresponds to an impulsive Δv:
is modeled as an additive Gaussian noise on the deterministic obser-
vation at time tk
ak Δvk ∈ −Δvmax;k ; Δvmax;k 3 ⊂ R3 (8)
2 3 2 3
rk δro;k
6 vk 7 6 δvo;k 7
1. Deterministic Model ok 6 7 6
4 mk 5 4 0 5
7 (17)
Because the spacecraft moves under Keplerian dynamics between
tk 0
any two impulses, in a deterministic scenario the spacecraft state can be
propagated analytically with a closed-form transition function fdet
being
2 3
2 3 f^k rk g^ k vk Δvk
rk1 6 7 δro;k
6 7 6 7 ωo;k ∼ N 06 ; Σo ∈ R6 (18)
4 vk1 5 fdet sk ; ak 6 f_^k rk g_^ k vk Δvk 7 (9) δvo;k
4 5
mk1
mk exp−jΔvk j∕ueq where Σo diagσ 2r;o I3 ; σ 2v;o I3 is the covariance matrix, with σ r;o
and σ v;o the standard deviations on the observed spacecraft position
for k 0; : : : ; N − 1, where f^k and g^ k are the Lagrange’s coefficients and velocity.
at k-th step, defined as in Ref. [50], and the mass update is obtained
through Tsiolkovsky’s equation. 4. Control Errors
At the final time tf tN , the last Δv is computed so as to match Control execution errors are modeled as a small three-dimensional
Mars velocity rotation of the commanded Δv, defined by Euler angles δϕ; δϑ; δψ,
and a slight variation δu of its modulus. Random variables (

0 if k < N
δϕ; δϑ; δψ, and δu are sampled from a Gaussian distribution with es;k (25)
standard deviation σ ϕ , σ ϑ , σ ψ , and σ u , respectively. So, the control maxfer ; ev g if k N
disturbance vector at time tk is
Here μk is the cost function, that is, the (nondimensional) consumed
2 3 propellant mass at time tk , eu;k is the (nondimensional) violation of
δϕk
6 δϑ 7 the constraint on the maximum Δv magnitude for the kth segment
6 k7 [see Eq. (7)], and es;N is the maximum value between the final position
ωa;k 6 7 ∼ N 04 ; Σa ∈ R4 (19)
4 δψ k 5 error er jrf − r♂ j∕jr♂ j and velocity error ev jrf − r♂ j∕jv♂ j.
δuk Finally, ε is a tolerance on terminal constraint violation.
Please note that, in the framework of MDPs, only scalar rewards
where Σa diagσ 2ϕ ; σ 2ϑ ; σ 2ψ ; σ 2u is the covariance matrix. are allowed. Thus, in the present work, constraints are accounted for
The actual control u can be written as a function of the commanded by introducing the L1 -penalty functions eu;k and es;k for control path
action a at time tk , k ∈ 0; N, as constraints and terminal constraints, respectively, with penalty
weights λeu and λes . The rationale behind the selection of suitable
uk gak ; ωa;k 1 δuk Ak ak (20) values for the weights λeu and λes , as well as for the allowed tolerance
ε, is discussed in Sec. III.C.
where the rotation matrix Ak is evaluated, under the small-angle
assumption, as
2 3 III. Reinforcement Learning
1 −δψ k δϑk This section introduces the mathematical backbone of policy-
6 7
Ak 4 δψ k 1 −δϕk 5 (21) gradient RL algorithms. Without loss of generality, throughout
−δϑk δϕk 1 the whole section the MDP is supposed to be perfectly observable
(i.e., with ok sk ) to conform with the standard notation of RL.
Anyway, the same equations can be applied by using the observation
It is worth noticing that this disturbance cannot be modeled as an ok in place of the state sk when a perfect knowledge of the state is not
additive Gaussian noise and most of the techniques in the literature available, or whenever the observations differ from the state.
would not be directly applicable. The RL algorithm adopted in this work is PPO [52], which is a
model-free, policy-gradient, actor-critic method widely recognized
5. Missed Thrust Event for the high performance demonstrated on a number of continuous
The effect of occurrence of an MTE of unknown duration is also and high-dimensional control problems. A model-free approach is
investigated. A Weibull distribution was recently proposed by Imken particularly suited for the proposed investigation. Indeed, an algo-
et al. [51] to fit the occurrence of safe mode events of known deep rithm that does not rely on the explicit knowledge of the MDP
space missions, and used by Rubinsztejn et al. [44] to design missed equations for returning the optimal control policy π can be applied
thrust resilient trajectories. A more severe distribution is considered seamlessly to different mission scenarios. Also, model-free
here, both for research purposes, i.e., to investigate the algorithm approaches support the use of complex, and possibly nonanalytical,
capability in extreme circumstances, and with the aim of exploring expressions for the transition function f, control model g, observa-
less optimistic scenarios, which should be more representative of tion model h, and/or disturbance distributions, which may be not
small- and micro-spacecraft missions, where a greater number of allowed by traditional solution methods for SOCPs previously
unexpected issues may arise as a result of using low-TRL com- adopted in the context of robust design of spacecraft trajectories.
ponents.
The MTE is supposed to start at time tk^ , where index k^ ∈ N is A. Policy Gradient Algorithms
sampled from a uniform probability distribution in 0; N at the Policy-gradient algorithms are a common choice in RL. Compared
beginning of each episode, and it is modeled as a complete lack of with value-based methods, such as deep Q-learning [53], which gen-
thrust, hence uk^ 03 . With probability 1 − pmte, the miss-thrust erally work well when the agent can choose between a finite set of
condition is recovered (i.e., the engine returns to be fully operational) actions, policy-gradient methods are better suited to continuous state
at the next time tk1
^ , and no further MTE arises. Otherwise, the MTE and action spaces. The underlying idea of policy gradient is to directly
persists for an additional time step, and the procedure is repeated. The learn the policy πs that maximizes the performance index J. A
MTE can last at most nmte successive time steps, that is, from time tk^ stochastic policy is generally considered, as deemed more robust and
to time tkn
^ mte −1 . It is worth noticing that, because exactly one MTE more oriented toward exploration than a deterministic one. In this sense,
occurs per mission, but neither the MTE starting time nor its duration πs should be intended as a shorthand for πajs, that is, the proba-
is made available to the agent, the resulting stochastic process is non- bility of choosing action a, conditioned by being the system in the state
Markov. s. To cope with complex environments and possibly large (and con-
tinuous) state and action spaces, the policy π is usually approximated by
6. Reward Function a DNN with parameters θ, and referred to as π θ ajs. A DNN is an
input–output computing system composed by a sequence of layers,
The goal of the optimization procedure is to maximize the each made up of a number of basic mathematical units called neurons. A
(expected) final mass of the spacecraft, while ensuring the compli- multilayer perceptron (MLP) network is adopted in the present work:
ance with terminal rendezvous constraints on position and velocity. each neuron receives an input (numerical) signal from each unit in the
For this reason, the reward Rk obtained at time tk is defined as previous layer, generates an output signal as the result of applying a
(typically nonlinear) activation function to a linear combination of all
Rk −μk − λeu eu;k−1 − λes maxf0; es;k − εg (22) the inputs, and sends this signal to all the units in the next layer. The
weights and biases used in the linear combinations constitute the net-
where work parameters θ. The network inputs are passed directly to the
( neurons of the first (or input) layer; the outputs generated by the neurons
mk−1 − mk ∕m
if k < N of the last (or output) layer are returned as network output. The
μk (23)
mN−1 − mf ∕m if k N intermediate layers are defined as hidden layers. More complex archi-
tectures, such as recurrent neural networks (RNNs), are also used in
deep RL to cope with partially observable or non-Markov environ-
eu;k maxf0; juk j − Δvmax;k ∕vg
(24) ments, but they are not considered in this work. In the case of a
deterministic policy, the network directly outputs the action, given the computational point of view, the advantage function at time tk is
current system state as input; instead, when a stochastic policy is usually approximated by using the so-called generalized advantage
considered, the network returns some parameters of the underlying estimator (GAE)
probability distribution, such as the mean values of a multivariate
diagonal Gaussian distribution, which is the most frequently adopted X
N −1
0
one. The standard deviations instead are usually considered as stand- A^ k γλk −k δk 0 (30)
alone parameters (i.e., not dependent on the current state). The actual k 0 k
action is then sampled from this probability distribution.
Once the network architecture (i.e., number of layers, layer den- where
sity, and activation functions) is assigned, the problem reduces to the
search of the parameters θ that maximize the merit index J δk Rk γV πk1
θ
sk1 − V πk θ sk (31)
0 1
" # is the temporal-difference error, which is an unbiased estimate of the
B X
N−1
C advantage function, being Eτ∼πθ δk Aπk θ sk ; ak . So, the policy
θ arg max Jθ arg max@ E γ k Rk A (26) gradient can be evaluated as
θ θ τ∼π θ
k0
" #
X
N −1
where Rk Rsk ; uk ; sk1 is the immediate reward collected after ∇θ Jθ E ∇θ log π θ ak jsk A^ k (32)
τ∼π θ
taking action ak ∼ π θ ⋅jsk . k0
To solve Eq. (26), policy-gradient algorithms perform a stochastic

gradient ascent (SGA) update on θ, that is, θ←θ α∇θ Jθ, where α This leads to the advantage actor-critic (A2C) method [55], whose
is the learning rate, which is an user-defined positive constant, or basic architecture is reported in Fig. 1. In this case, the critic returns a
slowly decreasing, hyperparameter that controls the step size of the parameterized value function V ϕ , necessary for the advantage esti-
gradient update. mation through Eqs. (30) and (31).
The policy gradient theorem [54] is used to rewrite the gradient
∇θ Jθ in a more suitable form B. Proximal Policy Optimization
" # PPO can be seen as a further evolution of the (advantage) actor-
X
N −1 critic method. PPO introduces a “clipped surrogate objective func-
∇θ Jθ E ∇θ log π θ ak jsk Qπk θ sk ; ak (27) tion” Jclip in place of J, so as to force the updated policy π θ to stay in a
τ∼π θ
k0 small range ε, named “clip range,” around the previous policy π θold .
This is obtained by clipping the probability that the new policy moves
where outside of the interval 1 − ϵπ θold ; 1 ϵπ θold . In this way, the
" # algorithm avoids having too large policy updates during gradient
X
N −1
Qπk θ s; a ascent when using Eq. (32), which are usually the cause of a perfor-
0
E γ k −k R k 0 jsk s; ak a (28)
τ∼π θ mance collapse.
k 0 k
Let r~k θ be the probability ratio
is the action-value function, or simply Q-function, at step k, which is
the expected discounted return obtained by starting at time tk in state π θ ak jsk
r~k θ (33)
s, taking action a and then acting according to policy π θ. π θold ak jsk
The Q-function could be estimated by a Monte Carlo approach,
using the average return over several episode samples. While unbiased, The clipped surrogate objective function can be written as
this estimate has a high variance, which makes this approach unsuit- " #
able for practical purposes. An improved solution relies on approxi- 1XN−1
mating the Q-function by a second DNN with parameters ϕ, leading to J clip
θ E minr~k θAk ;clipr~k θ;1 − ϵ;1 ϵAk (34)
^ ^
τ∼π θ N k0
the so-called actor-critic method. The two neural networks run in
parallel and are concurrently updated: the actor, which returns the
parameterized policy π θ, is updated by gradient ascent on the policy Empirical evidence suggests that using a clip range ϵ that linearly
gradient [Eq. (27)]; the critic, which returns the parameterized decreases with the training steps can improve the learning effective-
Q-function Qϕ , is recursively updated using temporal difference ness. Typically, in PPO the policy and value function updates are
[54]. Intuitively, the actor controls the agent behavior, whereas the carried out all at once by gathering in the set of parameters θ the
critic evaluates the agent performance and gives a feedback to the actor weights and biases of both the actor and critic networks, and by
in order to efficiently update the policy. including in the objective function a mean-squared value function
To further improve the stability of the learning process and to error term H, defined as
reduce the variance in the estimate of the policy gradient, it is possible
to subtract a function of the system state only, called baseline bs,
from the expression (27) of the policy gradient, without changing its
value in expectation. The most common choice of baseline is the
value function
" #
X
N −1
V πk θ s
0
E γ k −k R k 0 jsk s (29)
τ∼π θ
k 0 k
which represents the expected discounted return obtained by starting

at time tk in state s and acting according to policy π θ until the end of
the episode. When the value function is subtracted from the action-
value function Qπk θ , one obtains the definition of the advantage
function Aπk θ s; a Qπk θ s; a − V πk θ s, which quantifies by how
much the total reward improves by taking a specific action a in state s,
instead of randomly selecting the action according to π θ ⋅js. From a Fig. 1 Schematic of the advantage actor-critic RL process.
2 0 12 3 Table 2 PPO hyperparameters

1 61 X
N −1 X
N −1
7
Hθ E 4 @V θ sk − γ k −k Rk 0 A 5
0
(35) Hyperparameter Symbol Value Eligibility interval
2 τ∼πθ N k0 0
k k Discount factor γ 0.9999 f0.9; 0.95; 0.98; 0.99; 0.999; 0.9999; 1g
GAE factor λ 0.99 f0.8; 0.9; 0.92; 0.95; 0.98; 0.99; 1.0g
and an entropy term S, defined as Initial learning
α0 2.5 × 10−4 [1 × 10−5 , 1]
rate
" # Initial clip range ϵ0 0.3 f0.1; 0.2; 0.3; 0.4g
1XN−1
Sθ E Ea∼πθ ⋅jsk − log π θ ajsk (36) Value function
c1 0.5 f0.5; 1.0; 5.0g
τ∼π θ N k0 coefficient
Entropy
c2 4.75 × 10−8 [1 × 10−8 , 0.1]
coefficient
the latter being added to ensure sufficient exploration, and prevent Number of SGA
premature convergence to a suboptimal deterministic policy [55]. epochs per nopt 30 f1; 5; 10; 20; 30; 50g
Eventually, the final surrogate objective function is update
Jppo θ Jclip θ − c1 Hθ c2 Sθ (37)
where c1 and c2 are two hyperparameters, named “value function Both the learning rate α and the clip range ϵ decrease linearly during
coefficient” and “entropy coefficient,” which control the relative training, starting from their initial values α0 , ϵ0 , according to the
importance of the various terms. following rules

The overall learning process consists of two well-distinguished
phases that are repeated iteratively: 1) policy rollout, during which α α0 1 − t∕T (38)
the current policy π θ is run in nenv independent (e.g., parallel)
realizations of the environment for nb training episodes, each of ϵ ϵ0 1 − t∕T (39)
length N, collecting a set of nenv nb trajectories τi , and 2) policy
update, which is performed by running nopt epochs of SGA over nb where t indicates the current training-step number.
sampling mini-batches of on-policy data, that is, the trajectories Astrodynamics routines from the scientific library pykep [60],
coming from the last rollout only. The algorithm terminates after a developed at the European Space Agency, were used for simulating
total number of training steps equal to T, that is, after the mission environment. The whole Python code used to produce the
N upd T∕nenv nb N policy updates. results here presented is made freely available on GitHub‡ for
research purpose.
C. Implementation Details Despite the theoretical guarantee of monotonic improvement of
The results presented in this work were obtained by using the PPO the expected cumulative reward over the infinite-dimensional policy
implementation by stable baselines [56], an open-source library space, the global optimality of the attained policy, its rate of con-
containing a set of improved implementations of RL algorithms vergence, as well as the impact of the policy and value function
based on OpenAI Baselines. According to the available literature, parameterization, remain unclear [61]. Indeed, the performance of
PPO performance is significantly impacted by the so-called “code- the policy may oscillate during the training due to the intrinsic
level optimizations,” that is, small implementation details seemingly stochastic nature of the algorithm. To ensure that the best control
of secondary importance; hence, the use of fully tested libraries is policy found during the whole training is retained, an independent
recommended for the sake of reproducibility [57]. The agent archi- assessment of the policy performance in terms of the average cumu-
tecture consists of an eight-neuron input layer, followed by two lative reward over a 100-run Monte Carlo campaign is carried out
separate head networks: one for the control policy estimation (actor) every nt 4 × 104 nenv training steps. The best found policy accord-
and the other for the value function estimation (critic). The output ing to this assessment is then returned by the algorithm as putative
layer has size nact o 3 for the actor, and size no 1 for the critic.
crit
optimal robust policy.
Each network
p is composed of three hidden layers of size h1 10ni , As mentioned in Sec. II.B.6, when dealing with constrained opti-
h2 h1 h3 , and h3 10no , respectively, where ni is the number of mization problems within a MDP framework, constraint violations
input neurons, and no is the number of output neurons of the are typically included as penalty terms inside the reward function [see
considered network. The activation function is tanh for all neurons Eq. (22)]. In these cases, a proper selection of the constraint weight-
in the hidden layers and linear for the neurons in the output layer. This ing factors is crucial to correctly balance the relative importance of
three-layer DNN topology has been devised heuristically, by decid- the various terms inside the reward function, and avoid the creation of
ing the width of the different layers on the basis of the number of a number of spurious suboptimal solution the search could get
inputs and outputs, as proposed in similar research studies on deep trapped into. To make the solution method less sensitive on the
RL for space trajectory optimization [34]. This architecture has specific values selected for the penalty weights, a constraints relax-
shown improved performance over the default MLP configuration ation method named ε-constraint, similar to those sometimes used in
suggested by stable baselines for control problems of comparable stochastic global optimization [46], has been adopted to enforce
size [56], that is, a 2-layer DNN with 64 neurons per layer, as well as constraints more gradually during optimization, thus allowing the
over an augmented-width variant with 81 neurons per layer, and it agent to explore to a greater extent the solution space at the beginning
was thus adopted in what follows. of the training process. For this reason, as a modest original contri-
The PPO hyperparameters were tuned once, and the same values bution of this paper, the terminal constraints satisfaction tolerance ε
used for all investigations. The values were tuned by means of an also varies during the training, according to a piecewise constant
open source framework for automated hyperparameter search, decreasing trend
Optuna [58], which uses a Tree-Structured Parzen Estimator (TPE)
algorithm [59]. TPE is a Bayesian hyperparameter optimization
ε0 t ≤ T∕2
algorithm that uses the history of already evaluated hyperparameters εt (40)
ε∞ t > T∕2
to build probabilistic models, which are then used to sample the next
set of hyperparameters to evaluate. The tuning was performed on a
deterministic (unperturbed) mission scenario (i.e., without uncertain- This minor modification proved to be able to significantly improve
ties and control errors), with a budget of 500 trials, and a maximum of the success of the presented method.
2.4 × 106 training steps per trial. Tuned values of the hyperpara-
‡
meters, together with their eligibility intervals, are reported in Table 2. https://github.com/LorenzoFederici/RobustTrajectoryDesignbyRL.
Finally, the values of the penalty weights (λeu 100 and λes 50) method, even when it is applied to simple problems as the deterministic
and the parameters of the ε-constraint method, i.e., the number of rendezvous mission here considered. This is mainly motivated by the
levels (2) and the intermediate values (ε0 0.01), have been chosen fact that PPO is a model-free algorithm; hence the knowledge of the
after a manual trial-and-error tuning procedure on the deterministic underlying (analytical) dynamic model is not exploited at all. The only
mission scenario, by looking at both the final cumulative reward and way the agent can obtain satisfactory results is to acquire as much
constraint violation. The final value of the tolerance ε∞ 0.001 has experience (i.e., samples) as possible about the environment. Indeed,
been selected to be the same as the tolerance on terminal conditions the solution by RL of the deterministic problem here reported was
adopted in similar studies [43]. obtained after a training consisting of 200 × 106 steps, which took 12 h
with 8 parallel workers on a computer equipped with an 8-core Intel
Core i7-9700 K CPU @3.60 GHz, whereas the indirect method took
IV. Numerical Results just a few seconds, with a trivial initial guess.
This section presents some preliminary results obtained by training
the agent on different environments (i.e., mission scenarios), generated B. Robust Trajectory Design
by considering separately each uncertainty source defined in the prob-
Besides the unperturbed, deterministic mission scenario (labeled
lem statement. To validate the proposed approach, the solution found in
unp), the following stochastic case studies have been analyzed sep-
a deterministic unperturbed scenario is first compared with the optimal
arately in the present manuscript: 1) state uncertainties (st), 2) obser-
one provided by an indirect technique. Then, the robustness and
vation uncertainties (obs), 3) control errors (ctr), 4) single-step MTE
optimality of the closed-loop control policies obtained in the considered
(mte, 1), and 5) multiple-step MTE (mte, 2). Training the agent in one
uncertain scenarios are assessed by means of Monte Carlo campaigns.
of these environments leads to a correspondingly defined control

policy, named π st , π obs , π ctr , π mte;1 , and π mte;2 , respectively. The
A. Deterministic Optimal Trajectory values used for the parameters describing each of the uncertainty
The ability of the presented methodology to deal with traditional, models defined in Sec. II.B are reported in Table 3. For each mission
deterministic OCPs is investigated first, by comparing the trajectory scenario, the reference trajectory, which has to be intended as robust
corresponding to a control policy π unp, obtained by training the agent against the considered uncertainty, is obtained by running a deter-
in the deterministic (unperturbed) environment Eq. (9), with the ministic version of the corresponding policy (i.e., it always takes the
solution of the original Earth–Mars low-thrust transfer problem, most probable action, instead of sampling from the probability dis-
found by using a well-established indirect method, routinely used tribution π θ ⋅js) in the unperturbed environment, and recording the
by the authors on interplanetary missions [62]. Δvs and the spacecraft states along the trajectory.
Figure 2 shows the optimal trajectories found by using the two Figure 3 shows the best reference robust trajectory obtained with
approaches. The light-blue arrows indicate the Δvs along the RL each policy at the end of a training that lasted 200 × 106 steps. The
trajectory, whereas the light-orange area shows the optimal thrust
vector found by using the indirect method. The two solutions are very Table 3 Parameters of the different
close to each other, in terms of both trajectory and thrust direction. uncertainty models
Also, the final spacecraft mass of the RL solution (606.34 kg) is in
good agreement with the (true) optimal mass obtained by the indirect Uncertainty Parameter Value
method (603.91 kg). This slight difference is partly because RL σ r;s , km 1.0
State
satisfies the terminal constraints with a lower accuracy (10−3 vs σ v;s , km∕s 0.05
10−9 in the indirect method), and partly due to the approximated, σ r;o , km 1.0
time-discrete, impulsive dynamic model adopted in the MDP tran- Observation
σ v;o , km∕s 0.05
scription. It is important to remark here that, unlike previous works in
the literature, this high-quality result was obtained without the use of σ ϕ , deg 1.0
any ad-hoc reward shaping, but just leveraging on the ε-constraint σ ϑ , deg 1.0
method introduced in Sec. III.C. Control
σ ψ , deg 1.0
However, when applying RL to the solution of deterministic OCPs, σu 0.05
two major downsides arise. First, because the MDP formulation relies
on a scalar reward function, terminal constraints cannot be explicitly pmte 0
Single-step MTE
accounted for, and all constraint violations must be introduced in the nmte 1
reward function as weighted penalty terms. As a result, the accuracy on pmte 0.1
constraint satisfaction is generally looser than in traditional methods Multiple-step MTE
nmte 3
for solving OCPs. Second, RL is a rather computationally expensive
Fig. 3 Earth–Mars trajectories corresponding to different robust

Fig. 2 Optimal Earth–Mars trajectories in the deterministic scenario. policies.
differences with respect to the deterministic optimal trajectory (in blue) and the cumulative reward J, as well as some environment-specific
are upscaled by a factor 5 for illustration purposes. One can notice that training settings. The solutions corresponding to a policy trained in a
the robust trajectories tend to approach Mars orbit slightly in advance stochastic environment with perturbations on either state (π st ), obser-
with respect to the optimal deterministic solution, in order to improve vations (π obs ), or control direction and magnitude (π ctr ) satisfy the
their capability of meeting the terminal constraints even in presence of terminal constraints within the prescribed tolerance (10−3 ). In these
uncertainties or control errors in the latter part of the mission. cases, robustness is obtained by sacrificing less than 3% of the final
Table 4 summarizes the main features of these trajectories: the final spacecraft mass. Instead, the reference solutions obtained with pol-
spacecraft mass mf , the final position error er and velocity error ev, icies trained in the MTE-perturbed environments do not satisfy the
terminal position error requirement. This unintended behavior is
Table 4 Robust trajectories overview probably due to the exaggerated failure distribution here considered,
Settings Results which always assumes one MTE per mission. As a consequence,
when the policy is run in the unperturbed environment, it tends to
Policy nenv nb mf , kg er 10−3 ev J
perform worse than in the perturbed environment it was trained on.
πunp
8 4 606.34 0.25 0 −0.3937 The final spacecraft mass obtained in these two cases is also consid-
π st 8 16 592.11 0.60 0 −0.4079 erably lower than in the previous scenarios.
π obs 8 16 592.56 0.68 0 −0.4074 In all presented cases, the error on the final velocity is zero. This
π ctr 8 8 590.85 0.60 0 −0.4092 result should not surprise the reader. Indeed, the last Δv is computed
π mte;1 16 16 576.47 4.94 0 −0.6205 algebraically as a difference between the final spacecraft velocity and
π mte;2 16 32 570.31 2.63 0 −0.5114

Mars velocity [see Eq. (10)]. Thus, the terminal velocity constraint is
automatically satisfied once the computed Δv has a magnitude lower
a) unp b) st
c) obs d) ctr
e) mte,1 f) mte,2
Fig. 4 Magnitude of the Δvs along the robust trajectories.

than the maximum admissible for the last trajectory segment, accord- responsible for the additional propellant consumption with respect
ing to the Sims–Flanagan model here adopted. to the deterministic solution. As a final remark, the terminal Δv of the
Figure 4 shows the distribution of the magnitude of the Δvs over solutions obtained with policies π mte;1 and π mte;2 is considerably
the flight time for the different robust trajectories. Dashed lines smaller than the one in the other solutions. The reason is probably
indicate the maximum allowed Δv at each time step, computed using that the two policies seek to ensure compliance with the final velocity
Eq. (7). First, it is worth noticing that the applied Δvs are always constraint even in the worst-case scenario, that is, when the MTE
lower than or equal to the maximum admissible value in all cases; occurs next to the arrival.
hence, constraint Eq. (24) is always satisfied. In the unperturbed
scenario (Fig. 4a), an almost bang-off-bang pattern is obtained,
similar to the one present in the optimal indirect solution (see Fig. 2). C. Closed-Loop Mission Analysis
In the remaining scenarios (Figs. 4b–4f) the Δv magnitude is more Besides returning a reference robust trajectory, the trained network
uniformly spread over the whole transfer, and always considerably can be also used to provide the spacecraft with a computationally
lower than its maximum value. This is a distinctive feature of robust inexpensive and robust closed-loop guidance. To the best of the
trajectories, which must satisfy the constraint on the maximum authors’ knowledge, no general proof of stability exist for DNN-
admissible value of Δv while leaving room for efficient correction based GNC. Even though some encouraging results have been pro-
maneuvers. Indeed, this suboptimal distribution of the thrust is posed in the literature for specific systems using a Lyapunov
a) st b) obs
c) ctr d) mte,1
e) mte,2 f) unp
Fig. 5 Monte Carlo trajectories in the stochastic mission scenarios.
approach [63], Monte Carlo analysis is still the best means to assess in the presence of the considered multiple-step distribution of MTEs.
both stability and performance of the proposed methodology. There- By looking at Table 5, it is clear that in most cases (62.8%) the policy
fore, the closed-loop performance of the policies in their respective manages to recover from the event of engine failure, and meets the final
stochastic mission scenario is investigated by a Monte Carlo cam- constraints within the imposed tolerance. However, in the remaining
paign, which consisted in running each policy over a set of 1000 (unfortunate) scenarios, the MTE occurs at some crucial points of the
randomly generated environment realizations, or test episodes. The trajectory, that is, near the final time, and/or it lasts for three consecutive
state-perturbed environment was used as test case for the unperturbed time steps, which correspond to one month of complete absence of
policy π unp. thrust. In these cases, the policy is not able to compensate for the
The spacecraft trajectories that result from the Monte Carlo missing Δvs in any way, and, consequently, the terminal constraints
campaigns are shown in Fig. 5. Specifically, in each figure, the cannot be met. This fact is confirmed by the high variance on the
dark-blue line represents the robust reference trajectory, light-blue violation of the terminal constraints, as well as by the high value of the
arrows indicate the nominal Δvs, and the gray lines represent the 2σ and 3σ contour radii compared with the other solutions.
trajectories of the randomly generated test episodes. The differences For the sake of completeness, the results obtained by running
between each Monte Carlo sample trajectory and the corresponding policy π unp in the state-perturbed stochastic environment are reported
reference trajectory are upscaled by a factor 5 for illustration in Fig. 5f. Even though the differences between the two reference
purposes. One can notice that, for all but the multiple-step MTE trajectories corresponding to policies π unp and π st seem minimal (see
scenario (Figs. 5a–5d), the Monte Carlo–generated trajectories Fig. 3), the effects on the closed-loop simulations are apparent.
have a greater dispersion in the central part of the mission that Indeed, policy π unp fails to reach Mars with the required accuracy
progressively reduces, and almost entirely disappears, while in all simulations, whereas policy π st succeeds in the 91.5% of the
approaching Mars, as a direct consequence of the terminal con- cases. Similar pictures are obtained by running policy π unp in any of
straints enforcement. Conversely, in case of multiple-step MTEs the other proposed stochastic environments and are here omitted for
(Fig. 5e), a limited number of trajectories (12 out of 1000) clearly the sake of brevity. In particular, the success rate obtained by π unp is
miss the target (i.e., es;N > 10−2 ). zero in both the state- and observation-uncertain environment, is
Figure 6 highlights the distribution of the error er rf − r♂ ∕ 18.1% in case of control errors on thrust magnitude and direction,
jr♂ j ex ; ey ; ez T on the spacecraft position at Mars encounter in and is less than double in the single-step (30.6%) and multiple-step
presence of state uncertainties (Fig. 6a) and control errors (Fig. 6b). (28.2%) MTE environment. The increase in robustness obtained by
Similar plots can be realized for the remaining cases, but they are here training the agent on the corresponding stochastic environments is
omitted for the sake of conciseness. The three dashed lines identify noticeable.
the radii εB of the balls containing the 68.3, 95.5, and 99.7% of the As a limitation of the current approach, each robust policy was
solutions, respectively (i.e., the sigma-contours). obtained by training the agent repeatedly in one specific environment
The results of the Monte Carlo analysis are summarized in Table 5, only, in order to learn the underlying uncertainty distribution. Hence,
which reports, for each environment, the mean value (mean) and the agent may not be able to react correctly to unexpected events,
standard deviation (std) of the final spacecraft mass mf , terminal which corresponds to unseen environments. This is apparent if one
position error er and velocity error ev, the success rate SR, that is, the looks at the robust trajectory obtained with policy π mte;1 : in the last
percentage of trajectories whose terminal constraint violation es;N is few points the agent has not experienced the MTE yet, and, expecting
lower than the final prescribed tolerance εT, and the tolerance εB it, forces a suboptimal trajectory, as, during training, it has learned
that would be required to obtain a success rate equal to 68.3% (1σ), that at least one MTE should always occur. As a result, the final error
95.5% (2σ), and 99.7% (3σ), respectively. in position is greater than the imposed accuracy. To this aim,
According to the obtained results, RL seems able to cope with all
the proposed stochastic scenarios quite effectively. Indeed, despite Table 5 Results of the Monte Carlo simulations
the severity of the considered perturbations/uncertainties, the suc-
cess rate is rather high, over 81% in most cases and up to 91.5% mf , kg er [10−3 ] ev [10−3 ] εB [10−3 ]
when only additive Gaussian state perturbations are considered. Policy Mean Std Mean Std Mean Std SR, % 1σ 2σ 3σ
These results were obtained with a uniform control grid, with
maneuvering points spaced by roughly 9 days from each other. π st 581.34 8.41 0.63 0.25 0.022 0.15 91.5 0.74 1.10 1.85
Further improvements are expected by introducing additional con- π obs 582.50 9.34 0.66 0.26 0.044 0.26 89.1 0.78 1.15 2.65
trol points immediately before Mars arrival (e.g., at −5, −3, and π ctr 590.61 0.80 0.66 0.32 0.016 0.12 84.4 0.80 1.24 1.57
−1 days before the encounter) as done in more traditional guidance π mte;1 583.55 7.12 0.76 0.57 0.10 0.47 81.9 0.67 2.27 3.19
schemes. On the other hand, the preliminary results found with π mte;2 573.88 4.22 1.09 1.27 0.46 2.49 62.8 1.24 5.87 18.5
policy π mte;2 indicate that RL performance deteriorates substantially
a) st b) ctr
Fig. 6 Distribution of the position errors at Mars encounter for policies πst and πctr .
a meta-learning approach, where the agent is concurrently trained on The preliminary results here proposed pave the way for RL
multiple environments with different underlying uncertainty distri- approaches in the field of robust design of interplanetary trajectories.
butions [48], could be useful and will be considered in further works. Of course, additional work is required in order to increase both the
As a final comment, the computational burden in RL is almost efficiency of the learning process and the reliability of the obtained
entirely concentrated in the training phase, which is performed on the solutions. Specifically, the high computational cost associated with
ground before the start of the mission, and can be thus carried out on the training procedure calls for the use of asynchronous algorithms,
high-performance hardware. The evaluation of the control law is where the two processes of policy rollout (for collecting experience)
instead extremely fast, as it mostly involves sequences of matrix and policy update (for learning) run in parallel, aiming at exploiting
multiplications, and can be safely carried out at high sample rate even as best as possible the massive parallelization allowed by high-
on low-performance flight hardware, which is expected to be performance computing clusters and graphics processing units. Also,
mounted on micro-spacecraft due to budgetary constraints. This is the use of recurrent neural networks should be investigated when
a major advantage over computational-intense guidance methods, dealing with non-Markov dynamic processes, as in the case of partial
such as MPC, where, at each time step, the whole problem solution observability and multiple, correlated MTEs. However, the most
must be recomputed onboard. On the other side, MPC excels at crucial point seems to be enhancing the constraint-handling capabil-
addressing OCPs involving multiple terminal and/or path constraints, ity of current RL algorithms. The adoption of the ε-constraint relax-
and provides the best quality solutions. In this respect, it is also worth ation is a modest contribution that goes in that direction. More
noticing that the presence of multiple control/state path constraints is advanced formulations of the problem, such as constrained Markov
not a concern for the effectiveness of RL. In fact, the agent receives an decision processes, should be investigated in the future for this
immediate (negative) reward whenever it does not meet the imposed purpose.
constraints. Instead, teaching the agent to effectively cope with

terminal constraints is a much more challenging task. In this case,
the cumulative effect of all actions can be judged only at the end of the
References
episode, as the agent receives a delayed (negative) reward as a
response to the constraint violation. The combined effect of delayed [1] Campagnola, S., Ozaki, N., Sugimoto, Y., Yam, C. H., Chen, H.,
Kawabata, Y., Ogura, S., Sarli, B., Kawakatsu, Y., Funase, R., and
rewards and random initialization of the DNN’s parameters may
Nakasuka, S., “Low-Thrust Trajectory Design and Operations of PRO-
significantly slow down the convergence, and, consequently, the CYON, the First Deep-Space Micro-Spacecraft,” 25th International
quality of the attained solution after a fixed number of training steps. Symposium on Space Flight Dynamics, German Space Operations
The use of the ε-constraint relaxation technique is a first attempt at Center (GSOC) and the European Space Operations Centre (ESOC),
improving the efficiency of the learning process in the case of Paper 72, 2015.
problems with terminal constraints. Yet, further investigations need [2] Asmar, S., and Matousek, S., “Mars Cube One (MarCO): The First
to be targeted at studying this aspect in more details. Planetary Cubesat Mission,” Proceedings of the Mars CubeSat/NanoSat
Workshop, Vol. 20, NASA JPL, Pasadena, CA, 2014, pp. 1–21.
[3] Walker, R., Koschny, D., Bramanti, C., and Carnelli, I., and ESA CDF
Study Team, “Miniaturised Asteroid Remote Geophysical Observer
V. Conclusions (M-ARGO): A Stand-Alone Deep Space CubeSat System for Low-Cost
This paper presented a deep Reinforcement Learning (RL) frame- Science and Exploration Missions,” 6th Interplanetary CubeSat Work-
work to deal with the robust design of low-thrust interplanetary shop, Paper 2017.A.3.1, Cambridge, England, U.K., 2017.
trajectories, in presence of different sources of uncertainty. To this [4] Casalino, L., and Colasurdo, G., “Optimization of Variable-Specific-
aim, the Stochastic Optimal Control Problem (SOCP) was first Impulse Interplanetary Trajectories,” Journal of Guidance, Control, and
reformulated as a time-discrete Markov Dynamic Process (MDP). Dynamics, Vol. 27, No. 4, 2004, pp. 678–684.
https://doi.org/10.2514/1.11159
Then, a state-of-the-art RL algorithm, named “Proximal Policy Opti- [5] Scheel, W. A., and Conway, B. A., “Optimization of Very-Low-Thrust,
mization” (PPO), was adopted for the problem solution, outlining Many-Revolution Spacecraft Trajectories,” Journal of Guidance, Con-
those features that made it a strong candidate over similar policy- trol, and Dynamics, Vol. 17, No. 6, 1994, pp. 1185–1192.
gradient methods. Preliminary numerical results were reported for a https://doi.org/10.2514/3.21331
three-dimensional time-fixed Earth–Mars mission, by considering [6] Graham, K. F., and Rao, A. V., “Minimum-Time Trajectory Optimiza-
separately the effect of different types of uncertainties, namely, tion of Multiple Revolution Low-Thrust Earth-Orbit Transfers,” Journal
uncertainties on the dynamic model and on the observations, errors of Spacecraft and Rockets, Vol. 52, No. 3, 2015, pp. 711–727.
in the applied control, as well as the presence of a variable-duration https://doi.org/10.2514/1.A33187
Miss-Thrust Event (MTE). [7] Wang, Z., and Grant, M. J., “Minimum-Fuel Low-Thrust Transfers for
Spacecraft: A Convex Approach,” IEEE Transactions on Aerospace and
The obtained results show the ability of PPO to solve simple Electronic Systems, Vol. 54, No. 5, 2018, pp. 2274–2290.
interplanetary transfer problems as the time-fixed minimum-propellant https://doi.org/10.1109/TAES.2018.2812558
mission here considered, in both deterministic and stochastic scenar- [8] Wang, Z., and Grant, M. J., “Optimization of Minimum-Time Low-
ios. The solution found in the deterministic case is in good agreement Thrust Transfers Using Convex Programming,” Journal of Spacecraft
with the optimal solution provided by an indirect method. However, and Rockets, Vol. 55, No. 3, 2018, pp. 586–598.
the high computational cost necessary to train the neural network https://doi.org/10.2514/1.A33995
discourages the use of a model-free RL algorithm in this simple [9] Eren, U., Prach, A., Koçer, B. B., Raković, S. V., Kayacan, E., and
circumstance. The power of RL becomes apparent when dealing with Açkmeşe, B., “Model Predictive Control in Aerospace Systems: Current
State and Opportunities,” Journal of Guidance, Control, and Dynamics,
SOCPs, where traditional methods are cumbersome, impractical, or
Vol. 40, No. 7, 2017, pp. 1541–1566.
simply impossible to apply. Despite the reported results are only https://doi.org/10.2514/1.G002507
preliminary, the presented solutions seem very promising in terms of [10] Benedikter, B., Zavoli, A., Colasurdo, G., Pizzurro, S., and Cavallini, E.,
trajectory robustness and total propellant consumption. Terminal con- “Autonomous Upper Stage Guidance Using Convex Optimization and
straint are effectively enforced by using the ε-constraint relaxation Model Predictive Control,” ASCEND 2020, AIAA Paper 2020-4268,
strategy. The methodology here proposed is quite general and can be 2020.
used, with the appropriate changes, to cope with a variety of spacecraft https://doi.org/10.2514/6.2020-4268
missions and state/control constraints. As an example, time-optimal [11] Federici, L., Benedikter, B., and Zavoli, A., “Machine Learning
and/or multi-revolution orbital transfers could be investigated too. Techniques for Autonomous Spacecraft Guidance During Proximity
Indeed, some care is required when dealing with time-free missions Operations,” AIAA Scitech 2021 Forum, AIAA Paper 2021-0668,
2021.
so as to retain the episodic setting. Instead, extension to arbitrary https://doi.org/10.2514/6.2021-0668
stochastic dynamic models (e.g., with possibly complex non-Gaussian [12] Laipert, F. E., and Longuski, J. M., “Automated Missed-Thrust Propel-
perturbations) is straightforward. This is a major advantage with lant Margin Analysis for Low-Thrust Trajectories,” Journal of Space-
respect to other techniques presented in the literature, which are based craft and Rockets, Vol. 52, No. 4, 2015, pp. 1135–1143.
on ad-hoc extensions of traditional optimal control methods. https://doi.org/10.2514/1.A33264
[13] Oguri, K., and McMahon, J. W., “Risk-Aware Trajectory Design with [30] Ross, S., Gordon, G., and Bagnell, D., “A Reduction of Imitation
Impulsive Maneuvers: Convex Optimization Approach,” Advances in Learning and Structured Prediction to No-Regret Online Learning,”
the Astronautical Sciences, Vol. 171, 2020, pp. 1985–2004. Proceedings of the Fourteenth International Conference on Artifi-
[14] Oguri, K., and McMahon, J. W., “Risk-Aware Trajectory Design with cial Intelligence and Statistics, Assoc. for Computing Machinery,
Continuous Thrust: Primer Vector Theory Approach,” Advances in the New York, 2011, pp. 627–635.
Astronautical Sciences, Vol. 171, 2020, pp. 2049–2067. [31] Furfaro, R., Bloise, I., Orlandelli, M., Di Lizia, P., Topputo, F., and
[15] Ozaki, N., Campagnola, S., Funase, R., and Yam, C. H., “Stochastic Linares, R., “Deep Learning for Autonomous Lunar Landing,”
Differential Dynamic Programming with Unscented Transform for Advances in the Astronautical Sciences, Vol. 167, 2018, pp. 3285–
Low-Thrust Trajectory Design,” Journal of Guidance, Control, and 3306.
Dynamics, Vol. 41, No. 2, 2018, pp. 377–387. [32] Rubinsztejn, A., Sood, R., and Laipert, F. E., “Neural Network Optimal
https://doi.org/10.2514/1.G002367 Control in Astrodynamics: Application to the Missed Thrust Problem,”
[16] Ozaki, N., Campagnola, S., and Funase, R., “Tube Stochastic Optimal Acta Astronautica, Vol. 176, Nov. 2020, pp. 192–203.
Control for Nonlinear Constrained Trajectory Optimization Problems,” https://doi.org/10.1016/j.actaastro.2020.05.027
Journal of Guidance, Control, and Dynamics, Vol. 43, No. 4, 2020, [33] Ng, A. Y., and Russell, S. J., “Algorithms for Inverse Reinforcement
pp. 645–655. Learning,” Proceedings of the Seventeenth International Conference on
https://doi.org/10.2514/1.G004363 Machine Learning, Morgan Kaufmann Publishers Inc., San Francisco,
[17] Di Carlo, M., Vasile, M., Greco, C., and Epenoy, R., “Robust Optimi- CA, 2000, pp. 663–670.
sation of Low-Thrust Interplanetary Transfers Using Evidence Theory,” https://doi.org/10.5555/645529.657801
Advances in the Astronautical Sciences, Vol. 168, 2019, pp. 339–358. [34] Gaudet, B., Linares, R., and Furfaro, R., “Deep Reinforcement Learning
[18] Greco, C., Di Carlo, M., Vasile, M., and Epenoy, R., “An Intrusive for Six Degree-of-Freedom Planetary Landing,” Advances in Space
Polynomial Algebra Multiple Shooting Approach to the Solution of Research, Vol. 65, No. 7, 2020, pp. 1723–1741.
Optimal Control Problems,” Proceedings of the International Astro- https://doi.org/10.1016/j.asr.2019.12.030

nautical Congress (IAC), International Astronautical Federation (IAF), [35] Scorsoglio, A., Furfaro, R., Linares, R., and Massari, M., “Actor-Critic
Paris, France, Oct. 2018, pp. 1–11. Reinforcement Learning Approach to Relative Motion Guidance in
[19] Greco, C., Di Carlo, M., Vasile, M., and Epenoy, R., “Direct Multiple Near-Rectilinear Orbit,” Advances in the Astronautical Sciences,
Shooting Transcription with Polynomial Algebra for Optimal Control Vol. 168, 2019, pp. 1737–1756.
Problems Under Uncertainty,” Acta Astronautica, Vol. 170, May 2020, [36] Sullivan, C. J., and Bosanac, N., “Using Reinforcement Learning to
pp. 224–234. Design a Low-Thrust Approach into a Periodic Orbit in a Multi-
https://doi.org/10.1016/j.actaastro.2019.12.010 Body System,” AIAA Scitech 2020 Forum, AIAA Paper 2020-1914,
[20] Greco, C., Campagnola, S., and Vasile, M. L., “Robust Space Trajectory 2020.
Design Using Belief Stochastic Optimal Control,” AIAA Scitech 2020 https://doi.org/10.2514/6.2020-1914
Forum, AIAA Paper 2020-1471, 2020. [37] LaFarge, N. B., Miller, D., Howell, K. C., and Linares, R., “Guidance for
https://doi.org/10.2514/6.2020-1471 Closed-Loop Transfers Using Reinforcement Learning with Applica-
[21] Greco, C., and Vasile, M., “Closing the Loop Between Mission Design tion to Libration Point Orbits,” AIAA Scitech 2020 Forum, AIAA Paper
and Navigation Analysis,” 71st International Astronautical Congress 2020-0458, 2020.
(IAC)—The CyberSpace Edition, International Astronautical Federation https://doi.org/10.2514/6.2020-0458
(IAF), Paris, France, Oct. 2020, pp. 1–13. [38] Broida, J., and Linares, R., “Spacecraft Rendezvous Guidance in Clut-
[22] Izzo, D., Märtens, M., and Pan, B., “A Survey on Artificial Intelligence tered Environments via Reinforcement Learning,” Advances in the
Trends in Spacecraft Guidance Dynamics and Control,” Astrodynamics, Astronautical Sciences, Vol. 168, 2019, pp. 1777–1788.
Vol. 3, No. 4, 2019, pp. 287–299. [39] Cheng, L., Wang, Z., Jiang, F., and Zhou, C., “Real-Time Optimal
https://doi.org/10.1007/s42064-018-0053-6 Control for Spacecraft Orbit Transfer via Multiscale Deep Neural Net-
[23] Hornik, K., Stinchcombe, M., and White, H., “Universal Approxima- works,” IEEE Transactions on Aerospace and Electronic Systems,
tion of an Unknown Mapping and Its Derivatives Using Multilayer Vol. 55, No. 5, 2019, pp. 2436–2450.
Feedforward Networks,” Neural Networks, Vol. 3, No. 5, 1990, https://doi.org/10.1109/TAES.2018.2889571
pp. 551–560. [40] Izzo, D., and Öztürk, E., “Real-Time Guidance for Low-Thrust Trans-
https://doi.org/10.1016/0893-6080(90)90005-6 fers Using Deep Neural Networks,” Journal of Guidance, Control, and
[24] Izzo, D., Öztürk, E., and Märtens, M., “Interplanetary Transfers via Dynamics, Vol. 44, No. 2, 2021, pp. 315–327.
Deep Representations of the Optimal Policy and/or of the Value Func- https://doi.org/10.2514/1.G005254
tion,” Proceedings of the Genetic and Evolutionary Computation [41] Holt, H., Armellin, R., Scorsoglio, A., and Furfaro, R., “Low-Thrust
Conference Companion, Assoc. for Computing Machinery, New York, Trajectory Design Using Closed-Loop Feedback-Driven Control Laws
2019, pp. 1971–1979. and State-Dependent Parameters,” AIAA Scitech 2020 Forum, AIAA
https://doi.org/10.1145/3319619.3326834 Paper 2020-1694, 2020.
[25] Sanchez-Sanchez, C., Izzo, D., and Hennes, D., “Learning the Optimal https://doi.org/10.2514/6.2020-1694
State-Feedback Using Deep Networks,” 2016 IEEE Symposium Series [42] Arora, L., and Dutta, A., “Reinforcement Learning for Sequential Low-
on Computational Intelligence (SSCI), Inst. of Electrical and Electronics Thrust Orbit Raising Problem,” AIAA Scitech 2020 Forum, AIAA Paper
Engineers, New York, 2016, pp. 1–8. 2020-2186, 2020.
https://doi.org/10.1109/ssci.2016.7850105 https://doi.org/10.2514/6.2020-2186
[26] Cheng, L., Wang, Z., Jiang, F., and Li, J., “Fast Generation of Optimal [43] Miller, D., Englander, J. A., and Linares, R., “Interplanetary Low-Thrust
Asteroid Landing Trajectories Using Deep Neural Networks,” IEEE Design Using Proximal Policy Optimization,” Advances in the Astro-
Transactions on Aerospace and Electronic Systems, Vol. 56, No. 4, nautical Sciences, Vol. 171, 2020, pp. 1575–1592.
2020, pp. 2642–2655. [44] Rubinsztejn, A., Bryan, K., Sood, R., and Laipert, F., “Using Reinforce-
https://doi.org/10.1109/TAES.2019.2952700 ment Learning to Design Missed Thrust Resilient Trajectories,” 2020
[27] Cheng, L., Wang, Z., Song, Y., and Jiang, F., “Real-Time Optimal AAS/AIAA Astrodynamics Specialist Conference, AAS Paper 20-453,
Control for Irregular Asteroid Landings Using Deep Neural Networks,” 2020.
Acta Astronautica, Vol. 170, May 2020, pp. 66–79. [45] Takahama, T., and Sakai, S., “Constrained Optimization by the ε
https://doi.org/10.1016/j.actaastro.2019.11.039 Constrained Differential Evolution with an Archive and Gradient-Based
[28] Shi, Y., and Wang, Z., “A Deep Learning-Based Approach to Real-Time Mutation,” 2010 IEEE Congress on Evolutionary Computation (CEC),
Trajectory Optimization for Hypersonic Vehicles,” AIAA Scitech 2020 IEEE Publ., Piscataway, NJ, 2010, pp. 1–9.
Forum, AIAA Paper 2020-0023, 2020. https://doi.org/10.1109/CEC.2010.5586484
https://doi.org/10.2514/6.2020-0023 [46] Federici, L., Benedikter, B., and Zavoli, A., “EOS: A Parallel, Self-
[29] Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Adaptive, Multi-Population Evolutionary Algorithm for Constrained
Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Good- Global Optimization,” 2020 IEEE Congress on Evolutionary Compu-
fellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, tation (CEC), IEEE Publ., Piscataway, NJ, 2020, pp. 1–10.
L., Kudlur, M., Levenberg, J., Mane, D., Monga, R., Moore, S., Murray, https://doi.org/10.1109/CEC48606.2020.9185800
D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, [47] Lantoine, G., and Russell, R. P., “A Hybrid Differential Dynamic
K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viegas, F., Vinyals, O., Programming Algorithm for Constrained Optimal Control Problems.
Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X., “Tensor- Part 2: Application,” Journal of Optimization Theory and Applications,
Flow: Large-Scale Machine Learning on Heterogeneous Systems,” Vol. 154, No. 2, 2012, pp. 418–442.
2015, https://www.tensorflow.org/. https://doi.org/10.1007/s10957-012-0038-1
[48] Gaudet, B., Linares, R., and Furfaro, R., “Adaptive Guidance and on PPO and TRPO,” 7th International Conference on Learning Repre-
Integrated Navigation with Reinforcement Meta-Learning,” Acta Astro- sentations (ICLR), New Orleans, LO, 2019.
nautica, Vol. 169, April 2020, pp. 180–190. [58] Akiba, T., Sano, S., Yanase, T., Ohta, T., and Koyama, M., “Optuna:
https://doi.org/10.1016/j.actaastro.2020.01.007 A Next-Generation Hyperparameter Optimization Framework,” Pro-
[49] Sims, J. A., and Flanagan, S. N., “Preliminary Design of Low-Thrust ceedings of the 25th ACM SIGKDD International Conference on
Interplanetary Missions,” Advances in the Astronautical Sciences, Knowledge Discovery & Data Mining, Assoc. for Computing Machi-
Vol. 103, No. 1, 2000, pp. 583–592. nery, New York, 2019, pp. 2623–2631.
[50] Bate, R. R., Mueller, D. D., and White, J. E., Fundamentals of Astrody- https://doi.org/10.1145/3292500.3330701
namics, Dover, New York, 1971. [59] Bergstra, J., Bardenet, R., Bengio, Y., and Kégl, B., “Algorithms for
[51] Imken, T., Randolph, T., DiNicola, M., and Nicholas, A., “Modeling Hyper-Parameter Optimization,” Proceedings of the 24th International
Spacecraft Safe Mode Events,” 2018 IEEE Aerospace Conference, Conference on Neural Information Processing Systems, Curran Asso-
IEEE Publ., Piscataway, NJ, 2018, pp. 1–13. ciates Inc., Red Hook, NY, 2011, pp. 2546–2554.
https://doi.org/10.1109/AERO.2018.8396383 https://doi.org/10.5555/2986459.2986743
[52] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O., [60] Izzo, D., “PyKEP: Open Source Tools for Massively Parallel Optimi-
“Proximal Policy Optimization Algorithms,” arXiv preprint arXiv: zation in Astrodynamics (the case of interplanetary trajectory optimiza-
1707.06347, 2017. tion),” Proceedings of the 5th International Conference on
[53] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Astrodynamics Tools and Techniques (ICATT), European Space
Wierstra, D., and Riedmiller, M., “Playing Atari with Deep Reinforce- Agency (ESA), Noordwijk, Netherlands, 2012.
ment Learning,” arXiv preprint arXiv:1312.5602, 2013. [61] Liu, B., Cai, Q., Yang, Z., and Wang, Z., “Neural Proximal/Trust Region
[54] Sutton, R. S., and Barto, A. G., Reinforcement Learning: An Introduc- Policy Optimization Attains Globally Optimal Policy,” Advances in
tion, MIT Press, Boston, MA, 2018. Neural Information Processing Systems, Vol. 32, 2019, pp. 1–42.
[55] Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., [62] Colasurdo, G., Zavoli, A., Longo, A., Casalino, L., and Simeoni, F.,
Silver, D., and Kavukcuoglu, K., “Asynchronous Methods for Deep “Tour of Jupiter Galilean Moons: Winning Solution of GTOC6,” Acta
Reinforcement Learning,” International Conference on Machine Learn- Astronautica, Vol. 102, Sept. 2014, pp. 190–199.
ing, 2016, pp. 1928–1937. https://doi.org/10.1016/j.actaastro.2014.06.003
[56] Hill, A., Raffin, A., Ernestus, M., Gleave, A., Kanervisto, A., Traore, R., [63] Tanaka, K., “An Approach to Stability Criteria of Neural-Network
Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, Control Systems,” IEEE Transactions on Neural Networks, Vol. 7,
A., Schulman, J., Sidor, S., and Wu, Y., “Stable Baselines,” 2018, https:// No. 3, 1996, pp. 629–642.
github.com/hill-a/stable-baselines. https://doi.org/10.1109/72.501721
[57] Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Janoos, F., Rudolph,
L., and Madry, A., “Implementation Matters in Deep RL: A Case Study

Zavoli Federici 2021 Reinforcement Learning For Robust Trajectory Design of Interplanetary Missions

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Zavoli Federici 2021 Reinforcement Learning For Robust Trajectory Design of Interplanetary Missions

Uploaded by

Copyright:

Available Formats

JOURNAL OF GUIDANCE, CONTROL, AND DYNAMICS

Vol. 44, No. 8, August 2021

Reinforcement Learning for Robust Trajectory Design

Alessandro Zavoli∗ and Lorenzo Federici†

sk1 fsk ; uk ; ωs;k1 (1) v♂ − vN

and a slight variation δu of its modulus. Random variables (

To solve Eq. (26), policy-gradient algorithms perform a stochastic

which represents the expected discounted return obtained by starting

2 0 12 3 Table 2 PPO hyperparameters

Jppo θ Jclip θ − c1 Hθ c2 Sθ (37)

importance of the various terms. following rules

of these environments leads to a correspondingly defined control

Fig. 3 Earth–Mars trajectories corresponding to different robust

π mte;2 16 32 570.31 2.63 0 −0.5114

Fig. 4 Magnitude of the Δvs along the robust trajectories.

constraints. Instead, teaching the agent to effectively cope with

Optimal Control Problems,” Proceedings of the International Astro- https://doi.org/10.1016/j.asr.2019.12.030

You might also like

Zavoli Federici 2021 Reinforcement Learning For Robust Trajectory Design of Interplanetary Missions

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Zavoli Federici 2021 Reinforcement Learning For Robust Trajectory Design of Interplanetary Missions

Uploaded by

Copyright:

Available Formats

JOURNAL OF GUIDANCE, CONTROL, AND DYNAMICS

Vol. 44, No. 8, August 2021

Reinforcement Learning for Robust Trajectory Design

Alessandro Zavoli∗ and Lorenzo Federici†

sk1  fsk ; uk ; ωs;k1  (1) v♂ − vN

and a slight variation δu of its modulus. Random variables (

To solve Eq. (26), policy-gradient algorithms perform a stochastic

which represents the expected discounted return obtained by starting

2 0 12 3 Table 2 PPO hyperparameters

Jppo θ  Jclip θ − c1 Hθ  c2 Sθ (37)

importance of the various terms. following rules

of these environments leads to a correspondingly defined control

Fig. 3 Earth–Mars trajectories corresponding to different robust

π mte;2 16 32 570.31 2.63 0 −0.5114

Fig. 4 Magnitude of the Δvs along the robust trajectories.

constraints. Instead, teaching the agent to effectively cope with

Optimal Control Problems,” Proceedings of the International Astro- https://doi.org/10.1016/j.asr.2019.12.030

You might also like

sk1 fsk ; uk ; ωs;k1 (1) v♂ − vN

Jppo θ Jclip θ − c1 Hθ c2 Sθ (37)