You are on page 1of 41

Nonlinear Dyn (2023) 111:17037–17077

https://doi.org/10.1007/s11071-023-08725-y

ORIGINAL PAPER

A deep reinforcement learning control approach


for high-performance aircraft
Agostino De Marco · Paolo Maria D’Onza ·
Sabato Manfredi

Received: 26 September 2022 / Accepted: 25 May 2023 / Published online: 2 August 2023
© The Author(s) 2023

Abstract This research introduces a flight controller at both low and high flight Mach numbers, in simu-
for a high-performance aircraft, able to follow ran- lations reproducing a prey–chaser dogfight scenario.
domly generated sequences of waypoints, at varying Robustness to sensor noise, atmospheric disturbances,
altitudes, in various types of scenarios. The study different initial flight conditions and varying reference
assumes a publicly available six-degree-of-freedom signal shapes is also demonstrated.
(6-DoF) rigid aeroplane flight dynamics model of a
military fighter jet. Consolidated results in artificial Keywords Deep reinforcement learning · Flight
intelligence and deep reinforcement learning (DRL) dynamics · UCAV · Aeroplane controllability ·
research are used to demonstrate the capability to make Nonlinear control
certain manoeuvres AI-based fully automatic for a
high-fidelity nonlinear model of a fixed-wing aircraft.
This work investigates the use of a deep determinis- 1 Introduction
tic policy gradient (DDPG) controller agent, based on
the successful applications of the same approach to Within the next decade, we will witness an exponen-
other domains. In the particular application to flight tial increase in the use of artificial intelligence to sup-
control presented here, the effort has been focused port the operations of all sorts of flying machines. The
on the design of a suitable reward function used to interest will range from military applications to the
train the agent to achieve some given navigation tasks. growing civil aviation market, which will also include
The trained controller is successful on highly cou- the new category of personal flying vehicles. This
pled manoeuvres, including rapid sequences of turns, research makes use of consolidated results in artificial
intelligence and deep reinforcement learning research
A. De Marco (B) · P. M. D’Onza to demonstrate the AI-based flight control of a high-
Department of Industrial Engineering (DII), Università degli performance fixed-wing aircraft, taken as an example
Studi di Napoli Federico II, Via Claudio, 21, 80125 Naples, Italy
of an aerial platform exhibiting nonlinear behaviour.
e-mail: agostino.demarco@unina.it
P. M. D’Onza
e-mail: p.donza@studenti.unina.it
1.1 Nonlinear dynamics in fixed-wing aircraft and
S. Manfredi (B) established engineering approach to flight control
Department of Electrical Engineering and Information Technol-
ogy (DIETI), Università degli Studi di Napoli Federico II, Via
Claudio, 21, 80125 Naples, Italy Many current fixed-wing aerial vehicles display some
e-mail: sabato.manfredi@unina.it nonlinear dynamics, such as inertial coupling, or aero-

123
17038 A. De Marco et al.

dynamic and propulsive nonlinearities. Future aircraft An alternative adaptive neural control method has been
are likely to have even more significant nonlineari- proposed in [7] in order to better reflect the characteris-
ties in their flight dynamics. For example, to enhance tics of the actual plant model ensuring the reliability of
their agility some unmanned aerial vehicles (UAV) the designed control laws, still considering a reduced
designs will feature bio-inspired morphing wings that 3-DoF motion model.
significantly change their shapes. High-altitude long- Reinforcement learning (RL) is a branch of machine
endurance UAVs as well as future commercial airliner learning that offers an approach to intelligent control
designs will have a high aspect ratio, and highly flexible inspired by the way biological organisms learn their
lifting surfaces to improve their efficiency. Airbreath- tasks. RL is quite a large subject and provides the means
ing hypersonic vehicles (AHV) are yet another exam- to design a wide range of control strategies. In essence,
ple of plants exhibiting significant cross-coupling, non- an RL-based control task is performed by an agent—
linearity, non-minimum phase behaviour, and model that is, a piece of software—able to learn a control
uncertainty, even more than traditional aircraft in terms policy through trial and error. The agent learns by inter-
of dynamics. All these designs make flight control a acting with its environment, which includes the plant
challenging problem. under study [8]. In its early implementations, RL was
An established engineering approach to design flight based on tabular methods with discrete action and state
controllers is to rely on gain scheduling, to switch spaces and gained quite a lot of popularity for its appli-
between different linear controllers that were tuned for cations to gaming environments. One notable early
a known set of specific operating points [1]. Because usage example of RL in the aviation domain has been
of their dependence on linearized plant dynamics, these the control of glider soaring [9]. This study demon-
gain scheduling techniques must be carefully designed strated that a flight control policy to effectively gain
to accomplish their tasks across the whole mission lift from ascending thermal plumes could be success-
envelope. In practice, the classical linear controllers are fully learnt.
generally used in a conservative fashion by setting some
constraints to the flight envelopes and limiting the oper-
ability of the aerial vehicles. On the other hand, non- 1.3 Deep reinforcement learning as a promising field
linear control algorithms can be designed to attain the for nonlinear control
advantage of using the full flight envelope, especially
for unconventional and disruptive aircraft configura- To address the curse of dimensionality encountered
tions. Model-free adaptive control as well as intelligent in RL approaches when making the action and state
control techniques provide the possibility to replace spaces larger, function approximators (typically neural
the classical controller structure with a more general networks, NNs) have been used in the past decade in
approach, which turns out to be also fault-tolerant to various fields to enable continuous control, with agents
changes in model dynamics [2]. implemented as structures known as actor-critic. The
controller agent has a policy function and a value func-
tion to interact with its environment. By observing the
1.2 Intelligent control approaches state of the environment, the agent performs an action
based on its current policy to change the environment
Many authors propose intelligent control approaches state, and consequently, receives a reward. Both the
that combine artificial intelligence and automatic con- policy function and the value function are represented
trol methodologies such as adaptive neural control [3] by neural networks that have to be made optimal by
and fuzzy control [4]. Examples of application to fly- the end of the learning process. The policy function
ing vehicles include adaptive neural control of AHVs, is updated in a gradient ascent manner according to
integral sliding mode control, and stochastic adaptive the estimated advantage of performing the action; the
attitude control [5], as well as prescribed performance value function is updated in a supervised fashion to
dynamic neural network control of flexible hypersonic estimate the advantage based on the collected rewards.
vehicles [6]. Yet these methods assume a simplified The nonlinear activation functions in neural networks
motion dynamic model that considers only the longitu- can represent a highly nonlinear transfer from inputs
dinal dynamics and an affine form of the control input. to outputs, and deep reinforcement learning (DRL) is

123
A deep reinforcement learning control approach 17039

in fact a promising field for nonlinear control. Through learn effectively. This makes it an ideal method for solv-
the actor-critic scheme, an agent can learn to control ing complex, high-dimensional sequential decision-
an aircraft model by maximizing the expected reward, making problems, such as those encountered in air
and the reward calculation function is a crucial aspect combat [18]. In modern aerial combat with unmanned
of the agent training process. Particular forms of pos- combat aerial vehicles (UCAVs), short-range aerial
sible reward functions for the aircraft control scenarios combat (dogfight) is still considered an important topic
studied in this work are discussed in the article. [19,20]. The USA has planned a roadmap by 2030 for
Several novel implementations of the actor-critic the formation flight of UCAVs [21]. In the roadmap,
paradigm have been recently introduced, and most of such a UAV mission is expected to be complex and mul-
them fall under the category of DRL approaches. Their tifaceted, requiring precise waypoint-line tracking and
ability to perform end-to-end offline learning has made advanced formation flight algorithms to enable UCAVs
it possible to design agents that surpass human per- to execute it effectively. In complex formation flights
formance, as demonstrated with different Atari games such as aerial combat situations, the ability of a UCAV
[10]. In this study, the agent observes an image of to quickly adjust its flight attitude and path is crucial.
the game, a high-dimensional space, and learns a pol- This ability is often referred to as flight agility, which
icy using a deep Q-network (DQN) for its discrete involves the rapid and seamless transition between dif-
action space. DRL was extended to continuous-time ferent flight states.
control applications with the off-policy deep determin- To simulate aerial combat scenarios involving un
istic policy gradient (DDPG) algorithm [11]. For its manned vehicles, Wang et al. [22] extended their
actor-network policy update, the DDPG algorithm uses basic manoeuvre library to account for the enhanced
a critic network that provides a value function approxi- manoeuvring capabilities of UCAVs. They also pro-
mator. This method also estimates the temporal differ- posed a robust decision-making method for manoeu-
ence of the action-state value function as well as the vring under incomplete target information. Addition-
state value function during learning and is capable of ally, reinforcement learning [19,23] was explored as a
reducing the variance of the gradient estimation. machine learning approach to generate strategies, and
DDPG flight control applications have been limited McGrew [24] utilized the approximate dynamic pro-
to small-scale flying wings [12] and quadcopters [13]. gramming (ADP) approach to solve a two-dimensional
An on-policy RL algorithm known as proximal pol- aerial combat problem involving a single UCAV acting
icy optimization (PPO) [14] has been introduced in as an agent and learning near-optimal actions through
order to reduce DDPG’s learning instability. The PPO interaction with the combat environment.
showed improved policy convergence when applied Researchers have been exploring as well the poten-
to the flight control of an unmanned flying-wing air- tial of combining DRL with UCAVs’ air combat
craft [15]. A refined algorithm was proposed in [16] to decision-making [18,20,25]. For instance, Li [20] pro-
control underwater glider attitude based on a combina- posed a model for UCAV autonomous manoeuvre
tion of an active disturbance rejection control (ADRC) decision-making in short-range air combat, based on
algorithm and a natural actor-critic (NAC) algorithm. the multi-step double deep Q-network (MS-DDQN)
A recently developed actor-critic agent structure for algorithm. To address the limitations of traditional
closed-loop attitude control of a UAV is discussed in methods such as poor flexibility and weak decision-
[17], which addresses the optimal tracking problem making ability, some researchers have proposed the use
of continuous-time nonlinear systems with input con- of deep learning for manoeuvring [18]. Liu et al. [25]
straints. The work introduces a novel tuning law for the proposed a multi-UCAV cooperative decision-making
critic’s neural network parameters that can improve the method based on a multi-agent proximal policy opti-
learning rate. mization (MAPPO) algorithm. DRL has been used by
Hu et al. [26] to plan beyond-visual-range air combat
1.4 Applications to unmanned combat aerial vehicles tactics and by Wang et al. [27] to quantify the relation-
(UCAVs) ship between the flight agility of a UCAV and its short-
range aerial combat effectiveness. A similar approach
DRL has a distinct advantage over other approaches to has been also presented by Yang et al. [28].
flight control in its ability to perceive information and

123
17040 A. De Marco et al.

While the aforementioned approaches have demon- to flight control). The underlying mathematical formu-
strated some success in simulating aerial combat sce- lation to the 6-DoF aircraft-controlled dynamics pre-
narios, they have primarily focused on generating sented here is based on the attitude quaternion; hence,
action strategies, and only a few have applied a six- it is general and suitable for all kinds of flight task sim-
degree-of-freedom (6-DoF) nonlinear aircraft model ulations.
to the simulations. The mass point model that was This work validates an AI-based flight control
used in some studies fails to accurately represent the approach in a complex and high-performance require-
flight performance of a high-order UCAV. Addition- ment scenario, such as that of the target following. The
ally, directly applying these complex algorithms to control is accomplished by an agent trained with a DRL
a 6-DoF nonlinear model has been often discarded algorithm, whose applications have been documented
because of its demanding computational requirements. in literature so far only for simpler models and less
To address this limitation, Shin et al. [29] proposed dynamic operative scenarios. In fact, the FDM imple-
a basic flight manoeuvre (BFM)-based approach to mentation of the chosen aeroplane is highly nonlin-
guide the attacker and simulate a 6-DoF nonlinear ear and realistic. The application examples shown at
UCAV model in an aerial combat environment. These the end of the article demonstrate simulation scenarios
approaches combine a 3D point mass model with a con- where significant variations occur in the state space (for
trol law design based on nonlinear dynamic inversion instance, altitude and flight Mach number changes) and
(NDI) to control the three-dimensional flight path of a multiple nonlinear effects come into play.
6-DoF nonlinear UCAV model [22]. Finally, an additional contribution to the works avail-
able in the literature is the use of an established engi-
neering simulation environment to validate the pro-
1.5 Research contribution posed approach. This, combined with the fact that the
FDM includes realistic aerodynamics, propulsion, and
This research introduces an effective nonlinear control FCS models, makes the agent validation closer to those
application based on the DDPG actor-critic approach to learning applications occurring in experimental envi-
the full-blown nonlinear, high-fidelity flight dynamics ronments.
model (FDM) of a fighter jet. The use of this type of
controller has been documented only for small-scale
UAVs, and the present article addresses the applica- 1.6 Article organization
tion to the three-dimensional navigation of a high-
performance UCAV. For this kind of application, the The next section summarizes the main concepts regard-
research does not assume a reduced order model to train ing deep reinforcement learning and discusses how this
the agent—such as the methods presented in [22] and approach has been employed to control an aircraft in
[29]. In addition, unlike the BFM-based approaches, atmospheric flight.
the proposed AI-based strategy generates the pilot- The mathematical background of the article is sum-
ing commands to control directly the nonlinear 6-DoF marized in Sect. 2. The foundations of DRL applied
UCAV model. Finally, extensive validation is carried to flight control and the chosen control approach are
out by a well-known realistic simulator. A high-fidelity explained in Sect. 3, where the details about the reward
FDM of the General Dynamics F-16 Fighting Fal- shaping are also presented. Simulation test scenarios
con has been used, paving the way for future applica- that validate the proposed control strategy are presented
tions to other similar aircraft. The chosen FDM is pub- in Sect. 4. A discussion of the presented results is given
licly available and implemented for the JSBSim flight in Sect. 5. Finally, the conclusions are presented in
dynamics software library [30]. The simulation results Sect. 6. The article includes also three appendices A, B,
presented in this work were obtained using JSBSim and C with further details on the aerodynamic, propul-
within the MATLAB/Simulink computational environ- sive, and flight control system models of the F-16 air-
ment. craft.
Moreover, the present effort deals with some mod-
elling aspects in the aircraft state equations that are not
commonly found in the literature (when DRL is applied

123
A deep reinforcement learning control approach 17041

Fig. 1 Earth-based NED


frame FE and aircraft
body-fixed frame FB .
Velocity vector V of aircraft
gravity centre G with
respect to the inertial
observer. Ground track
point PGT of the
instantaneous gravity centre
position (xE,G , yE,G , xE,G ).
Standard definitions of
aircraft Euler angles
(ψ, θ, φ) and flight path
angles (γ , χ)

2 Mathematical background geometric model; the axis xE points towards the geo-
graphic North, the axis yE points towards the East;
2.1 Rigid aeroplane nonlinear 6-DoF flight dynamics the axis z E points downwards, towards the centre of
model the Earth. For this reason, the frame FE is also called
tangent NED frame (North-East-Down). The aircraft
The 6-DoF atmospheric motion of a rigid aeroplane body-fixed frame FB = {G; xB , yB , z B }, has its ori-
is governed by a set of nonlinear ordinary differen- gin located at the centre of gravity G of the aircraft;
tial equations. These equations are standard; there- the roll axis xB runs along the fuselage and points out
fore, the complete derivation is herein omitted for of the nose; the pitch axis yB points towards the right
sake of brevity—deferring the interested reader to the wing tip; the yaw axis z B points towards the belly of
related references [1]. The derivation of such a system the fuselage.
assumes a clear identification of a ground-based ref- The motion equations are derived from Newton’s
erence frame and an aircraft-based reference frame. second law applied to the flight of an air vehicle, leading
The former is treated as an inertial frame, neglect- to six core scalar equations (the conservation laws of the
ing the effects of the rotational velocity of the Earth, linear and angular momentum projected onto the mov-
which is a good approximation when describing the ing frame FB ), followed by the flight path equations
motion of aeroplanes in the lower regions of the Earth’s (used for navigation purposes, for tracking the aircraft
atmosphere. This approximation is acceptable for sub- flight in terms of centre-of-gravity position with respect
sonic as well as supersonic flight and fits the fighter to the Earth-based frame FE ), and by the rigid-body
jet control example considered in this research. These motion kinematic equations (providing a relationship
two main reference frames, besides other auxiliary for the aircraft attitude quaternion, used for express-
frames, are shown in Fig. 1. The Earth-based inertial ing the orientation of the body axes with respect to the
frame is named here FE = {OE ; xE , yE , z E }, hav- inertial ground frame).
ing its origin fixed to a convenient point OE on the The full set of equations in closed form is introduced
ground and its plane xE yE tangent to the assumed Earth in this section. The JSBSim flight dynamics library

123
17042 A. De Marco et al.

implements them in a more general form, which is where the aerodynamic drag D, the aerodynamic cross
beyond the scope of this article. However, the following force C and the aerodynamic lift L are involved to con-
presentation serves to highlight the intricacies and the veniently model the effect of the external airflow, and
nonlinearities inherent to the control problem explored where ⎡ ⎤⎡ ⎤
cos α 0 − sin α cos β sin(−β) 0
by the present research. [TBW ] = ⎣ 0 1 0 ⎦ ⎣ − sin(−β) cos β 0 ⎦
sin α 0 cos α 0 0 1
2.1.1 Conservation of the linear momentum equations (5)
(CLMEs)
is the coordinate transformation matrix from the stan-
The conservation of the linear momentum equation dard wind frame FW = {G; xW , yW , z W }, commonly
(CLMEs) for a rigid aeroplane of constant mass pro- used by aerodynamicists (see Fig. 2), to FB .
vides the following three core scalar equations [1]: Equations (1a)–(1b)–(1c) are actually set in closed
form because the aerodynamic angles (α, β) and the
1  
u̇ = r v − q w + Wx + Fx(A) + Fx(T) (1a) aerodynamic force components (D, C, L) appearing
m
1   in expressions (4) are modelled as functions of air-
v̇ = − r u + p w + W y + Fy(A) + Fy(T) (1b) craft state variables and of external inputs. According
m to Fig. 2, the state variables (u, v, w), being defined as
1  
ẇ = q u − p v + Wz + Fz(A) + Fz(T) (1c) components in FB of the aircraft gravity centre velocity
m vector V , are expressed in terms of (α, β) as follows:
where the W is the aircraft weight force, F (A) is the
resultant aerodynamic force, and F (T) is the resultant u = V cos β cos α, v = V sin β,
thrust force. Their components in the body frame FB w = V cos β sin α (6)
are conveniently expressed to obtain a closed form of
Eqs. (1a)–(1b)–(1c). with
The aircraft weight is a vertical force vector, i.e. always 
V = u 2 + v 2 + w2 (7)
aligned to the inertial axis z E , of constant magnitude
mg, that is expressed in terms of its components in body Consequently, the instantaneous angles of attack and
axes as follows: of sideslip are given by
⎧ ⎫ ⎧ ⎫ ⎧ ⎫
⎨ Wx ⎬ ⎨ 0 ⎬ ⎨ 2 qz qx −q0 q y ⎬ w v
W = [TBE ] 0 = 2 q y qz +q0 qx mg α = tan−1 , β = sin−1 √ (8)
⎩ y⎭ ⎩ ⎭ ⎩ 2 2 2 2⎭ u u + v 2 + w2
2
Wz mg q0 −qx −q y +qz
(2) The aerodynamic force components are expressed
where [TBE ] is the direction cosine matrix representing in terms of their aerodynamic coefficients according to
the instantaneous attitude of frame FB with respect to the conventional formulas:
frame FE . The entries of [TBE ] are functions of the 1 1
aircraft attitude quaternion components (q0 , qx , q y , qz ) D= ρV 2 S C D , C = ρV 2 S CC ,
[1]: ⎡ 2 2

q02 +qx2 −q y2 −qz2 2 qx q y +q0 qz 2 qz qx −q0 q y 1

[TBE ] = 2 qx q y −q0 qz q02 −qx2 +q y2 −qz2 2 q y qz +q0 qx ⎦
L = ρV 2 S C L (9)
2
2 qz qx +q0 q y 2 q y qz −q0 qx q02 −qx2 −q y2 +qz2
where the external air density ρ is a known function
(3)
of the flight altitude h = −z E,G (along with other gas
The instantaneous resultant aerodynamic force F (A) properties, such as the sound speed a) [31], S is a con-
acting on the air vehicle, when projected onto FB , is stant reference area, and coefficients (C D , CC , C L ) are
commonly
⎧ ⎫ expressed as follows [1]:
(A) ⎧ ⎫ modelled as functions of aircraft state variables and

⎨ Fx ⎪ ⎬ ⎨ −D ⎬
(A)
Fy = [TBW ] −C external inputs. Appendix A presents the details of a

⎩ (A) ⎪⎭ ⎩ ⎭ nonlinear aerodynamic model for the F-16 fighter jet.
Fz −L
⎧ ⎫ Finally, according to Fig. 3, the resultant thrust
⎨ −D cos α cos β+L sin α+C cos α sin β ⎬
= −C cos β−D sin β (4) force F (T) , which is a vector of magnitude T , can be
⎩ ⎭ expressed in terms of its body-frame components, in
−D sin α cos β−L cos α+C sin α sin β

123
A deep reinforcement learning control approach 17043

Fig. 2 (Left) Aerodynamic angles, aerodynamic (or stability) frame FA . (Right) Wind frame FW and aerodynamic forces (D, C, L)

2.1.2 Conservation of the angular momentum


equations (CAMEs)

The conservation of the angular momentum equations


(CAMEs) for a rigid aeroplane of constant mass are the
following [1]:
ṗ = C1 r + C2 p q + C3 L + C4 N (11a)
q̇ = C5 p r − C6 p − r 2 2
+ C7 M (11b)
ṙ = C8 p − C2 r q + C4 L + C9 N (11c)
Fig. 3 Thrust vector, thrust magnitude T , thrust line angle μT ,
thrust line eccentricity eT where
1 
C1 = (I yy − Izz )Izz − I x2z ,
the case of symmetric propulsion (zero y-component), 1 
as follows: C2 = (I x x − I yy + Izz )I x z (12a)
⎧ ⎫ ⎧ ⎫
⎪ (T) Izz − I x x
⎨ Fx ⎪ ⎬ ⎨ cos μT ⎬ C3 =
Izz
, C4 =
Ix z
, C5 = (12b)
(T)
Fy = δT Tmax (h, M) 0 (10) I yy
⎩ (T) ⎪
⎪ ⎭ ⎩ ⎭
Fz sin μT Ix z 1
C6 = , C7 = ,
I yy I yy
where μT is a known constant angle formed by the
thrust line in the aircraft symmetry plane with the refer- 1  Ix x
C8 = (I x x − I yy )I x x + I x2z , C9 = (12c)
ence axis xB , T = δT Tmax (h, M), δT is the throttle set-
ting (an external input to the system), and Tmax (h, M) and = I x x Izz − I x2z are constants of the model known
is the maximum available thrust, i.e. a known func- from the rigid aeroplane inertia matrix calculated with
tion of altitude and flight Mach number M = V /a. respect to the axes of FB .
Appendix B presents the details of a nonlinear thrust The instantaneous resultant external moment M
model for the F-16 fighter jet. about the pole G acting on the air vehicle is the sum

123
17044 A. De Marco et al.

of the resultant aerodynamic moment M (A) and of the The FPEs are simply derived from the component
resultant moment M (T) due to thrust line eccentricity transformation of vector V from frame FB to frame FE
⎧ ⎫ ⎧ ⎫
with respect to G. When M is projected onto FB , it is ⎨ ẋE,G ⎬ ⎨u⎬
commonly expressed as: ẏE,G = [TEB ] v (16)
⎧ ⎫ ⎧ ⎫ ⎧ ⎫ ⎩ ⎭ ⎩ ⎭
⎨ L ⎬ ⎨ L(A) ⎬ ⎨ 0 ⎬ ż E,G w
M = M(A) + M(T) (13) knowing that [TEB ] = [TBE ]T and accounting for defi-
⎩ ⎭ ⎩ (A) ⎭ ⎩ ⎭
N N 0 nition (3). The FPEs are then written in matrix format
in the case of symmetric propulsion (thrust line in the as follows:
aircraft symmetry plane). ⎧ ⎫ ⎡ 2 2 2 2 ⎤
⎨ ẋE,G ⎬ q0 +qx −q y −qz 2 qx q y −q0 qz 2 qz qx +q0 q y
Equations (11) are actually set in closed form ẏ = ⎣ 2 qx q y +q0 qz q0 −qx +q y −qz 2 q y qz −q0 qx ⎦
2 2 2 2
⎩ E,G ⎭
because the aerodynamic moments about the roll, pitch ż E,G 2 qz qx − q0 q y 2 q y qz + q0 qx q02 −qx2 −q y2 +qz2
⎧ ⎫
and yaw axes (L(A) , M(A) , N (A) ) are modelled as ⎨u⎬
v (17)
functions of aircraft state variables and of external ⎩ ⎭
w
inputs. The same applies to the pitching moment M(T) .
The inputs for the FPEs are the aircraft attitude
The body-axis components of M (A) are commonly
quaternion components along with the components
expressed in terms of their coefficients according to the
(u, v, w), which are provided by the solution of the
conventional formulas:
⎧ (A) ⎫ ⎧ ⎫ combined (CLMEs)-(CAMEs) system.
⎨ L ⎬ 1 ⎨ b CL ⎬
The quaternion components instead are derived from
M(A) = ρV 2 S c̄ CM (14)
⎩ (A) ⎭ 2 ⎩ ⎭ the body-frame components ( p, q, r ) of the aircraft
N b CN
angular velocity vector  through the solution of
where b and c̄ are reference lengths known from the another set of equations to be introduced next.
aeroplane’s geometry. Appendix A presents a high-
fidelity nonlinear model of the aerodynamic roll-, pitch- 2.1.4 Kinematic equations (KEs)
, and yaw-moment coefficients (CL , CM , CN ) for the
F-16 fighter jet. The rigid-body kinematic equations (KEs) based on the
The pitching moment due to thrust M(T) is given by aircraft attitude quaternion components [1] are written
the direct action of the thrust vector: in matrix format as follows:
⎧ ⎫ ⎡ ⎤⎧ ⎫
M(T) = T eT (15) ⎪
⎪ q̇0 ⎪
⎪ 0 − p −q −r ⎪ ⎪ q0 ⎪

⎨ ⎬ 1 ⎢ ⎥ ⎨ qx ⎬
where the eccentricity eT is a known parameter, posi- q̇x ⎢ p 0 r −q ⎥
= (18)

⎪ q̇ ⎪ 2 ⎣ q −r 0 p ⎦ ⎪ q ⎪
tive when the thrust line is located beneath the centre ⎩ y⎪ ⎭ ⎩ y⎪
⎪ ⎭
of gravity (eT < 0 as shown in Fig. 3). q̇z r q −p 0 qz
The inputs to the KEs are the angular velocity com-
2.1.3 Flight path equations (FPEs) ponents ( p, q, r ) in FB . Their solution provides the
numerical values of the kinematic state variables
Systems (1a)–(1b)–(1b)–(11a)–(11b)–(11c) of CLMEs (q0 , qx , q y , qz ).
and CAMEs projected onto the moving frame FB must The above system is the set of differential equations
be necessarily augmented with two additional systems of choice, nowadays, in large-scale simulations because
of equations to solve the aircraft dynamics and prop- it prevents the singularity known as ‘gimbal lock’ asso-
agate its state in time. One such system is needed for ciated with the alternative formulation based on air-
expressing the trajectory of the aircraft with respect to craft Euler angles. However, at each instant t when
the Earth-based inertial frame. The flight path equa- a quadruplet of quaternion components is known, the
tions (FPEs) are specifically used for this purpose. The Euler angles (ψ, θ, φ), shown in Fig. 4, are calculated
outputs of this system of differential equations form according to a standard algorithm [32].
the instantaneous position {xE,G (t), yE,G (t), z E,G (t)}
of the aircraft centre of gravity G in the space FE . The 2.1.5 Summary of the equations and of system inputs
reduced 2D version {xE,G (t), yE,G (t)} of the FPEs pro-
vides the so-called ground track relative to the aircraft The system of (CLMEs)-(CAMEs)-(FPEs)-(KEs), that
flight. is, (1)–(11)–(17)–(18), is the full set of 13 coupled

123
A deep reinforcement learning control approach 17045

the aircraft dynamics block, includes the following


channels: (i) Roll command δ̃a (acting on right aileron
deflection angle δa and on the antisymmetric left aileron
deflection), (ii) Pitch command δ̃e (acting on elevon
deflection angle δe ), (iii) Yaw command δ̃r (acting on
rudder deflection angle δr ), (iv) Throttle lever command
δ̃T (acting on throttle setting δT , with the possibility to
trigger the jet engine afterburner), (v) Speed brake com-
mand δ̃sb (acting on speed break deflection angle δsb ),
(vi) Wing trailing-edge flaps command δ̃f,TE (acting on
trailing-edge flap deflection angle δf,TE ) and (vii) Wing
leading-edge flap command δ̃f,LE (acting on leading-
edge flap deflection angle δf,LE ).
Some of these channels are associated with actual
Fig. 4 Aircraft Euler angles and attitude with respect to the Earth pilot’s input command signals, such as the primary
frame flight controls δ̃a , δ̃e , δ̃r , δ̃T . Some other signals
are mainly actuated and controlled by the FCS, for
nonlinear differential equations governing the 6-DoF, instance, δ̃f,TE , δ̃f,LE and δ̃b . Additional details on the
rigid-body dynamics of atmospheric flight. They are FCS including a summary of the control logic are
in a closed form once the aerodynamic as well as the reported in Appendix C.
propulsive external forces and moments are completely A selected number of the above inputs form the vec-
modelled as functions of 13 state variables tor ũagt of normalized commands used by the agent to
 T interact with the system in all training sessions:
x = u, v, w, p, q, r, xE,G , yE,G , z E,G , q0 , qx , q y , qz
 T
(19) ũagt = δ̃a , δ̃e , δ̃r , δ̃T (20)
i.e. of a state vector x, and of a number of external The full input vector to the nonlinear aircraft flight
inputs grouped into an input vector, commonly known dynamics model during all simulations performed in
as vector u. this work—resulting from the action of the agent in
The F-16 public domain model used for this research combination with the FCS logics—is then the follow-
features a quite articulated and high-fidelity flight con- ing:
trol system (FCS), whose simplified scheme is depicted  T
in Fig. 5. The FCS, which receives state feedback from u = δa , δe , δr , δT , δf,TE , δf,LE , δsb (21)

Fig. 5 Simplified scheme


of a General Dynamics F-16
flight control system. See
Appendixes B and C for
more details on the flight
controller and engine
controller

123
17046 A. De Marco et al.

Fig. 6 General Dynamics


F-16 Fighting Falcon.
Aerodynamic control
surface deflections

The deflection angles of the F-16 aerodynamic control tem interact with a complex environment according to
surfaces are depicted in Fig. 6. a given task. Both control theory and machine learn-
ing fundamentally rely on optimization, and likewise,
RL involves a set of optimization techniques within an
3 Deep reinforcement learning approach applied observational framework for learning how to interact
to flight control with the environment. In this sense, RL stands at the
intersection of control theory and machine learning.
There is a growing interest, nowadays, in AI-based pilot This is explained rigorously in a well-known introduc-
models, which are going to augment the manned legacy tory book by Sutton and Barto [8]. On the other hand, in
fighters’ capabilities and might bring them to compete their recent textbook on data-driven science and engi-
with the next-generation air dominance systems. This is neering Brunton and Kutz [33] use a modern and uni-
the main motivation for the present research. A related fied notation to present an overview of state-of-the-art
motivation lies in the possibility of implementing more RL techniques applied to various fields, including a
effective AI-assisted pilot training procedures for high- mathematical formulation for DRL. The reader may
precision tasks, such as dogfights and formation flights refer to these bibliographical references for a compre-
[20,25–27]. hensive explanation of the theory behind all variants of
DRL is a promising approach to be combined with RL approaches.
established flight control design techniques, and it RL methods apply to problems where an agent inter-
provides the following advantages: (i) it can make a acts with its environment in discrete time steps [8,33],
flight control system learn optimal policies, by inter- as depicted in Fig. 7. At time step k, the agent senses
acting with the environment, when the aircraft non- the state sk of its environment and decides to perform
linear dynamics are not completely known or difficult an action ak that brings to a new state sk+1 at the next
to model; (ii) it can deal with changing situations in time step, obtaining a reward rk . The scheme is reiter-
uncertain and strongly dynamic environments. In this ated with advancing time steps, k ← k + 1, and the
section, we recall the basic concepts to explain how RL agent learns to take appropriate actions to achieve
DRL, in particular the DDPG idea, is applicable to the optimal immediate or delayed rewards.
chosen flight control scenarios presented later in the The agent’s behaviour over time is given by the finite
article. sequence of states and actions e = (s0 , a0 ), (s1 , a1 ),

. . . , (sT , aT ) , also called an episode (or trajectory),
where T is the episode’s final time step. An episode
3.1 The reinforcement learning framework
may end when a target terminal state is reached, or
else when T becomes the maximum allowed number
RL is a significant branch of machine learning that is
of time steps.
concerned with how to learn control laws and poli-
cies from experience, which make a dynamical sys-

123
A deep reinforcement learning control approach 17047

Fig. 7 Schematic of
reinforcement learning, in
the particular case of an
agent’s policy π represented
by a deep neural network
(adapted from [33]). In the
case of flight control, the
action a is the input vector
ũagt defined by Eq. (20); the
observed state is the vector
s = [x, ẋ, u, u̇]T , where x
and u are defined by
Eqs. (19) and (21),
respectively

In RL, the environment and the agent are dynamic. system represent the environment. In RL, the environ-
The goal of learning is to find an optimal policy, i.e. a ment is everything outside the controller as shown in
control law (or behaviour) that generates an action Fig. 5, and this includes the plant dynamics as well.
sequence able to obtain the best possible outcome. A The agent is just a bit of software that is generating the
behaviour is optimal when it collects the highest cumu- actions and updating the control laws through learning.
lated reward. This is accomplished by allowing a piece In the present case study, the agent acts like the ‘brain’
of software—that is, the agent—to explore the envi- of the pilot that learns to control the aircraft.
ronment, interact with it, and learn by observing how it The state s is a collection of variables observed by
evolves over time. Whenever the agent takes an action, the agent while interacting with the environment, which
it affects the environment, which transitions to a new is taken from the list (x, u) defined by (19) and (21).
state, or even when the agent takes no action the envi- For the agent that controls the F-16 flight dynamics, the
ronment still might change. Within an entire episode, action a is given by the inputs ũagt defined by (20).
the evolutionary environment produces a sequence of An important characteristic of reinforcement learn-
rewards, and using this information, the agent can ing is that the agent is not required to know in advance
adjust its future behaviour and learn from the process. the environment’s dynamics. It will learn on its own
In the particular scheme of Fig. 7, for the sake of exam- only by observation, provided that it has access to the
ple, the agent is a deep neural network that receives states and that rewards are appropriately calculated.
a large number of input signals and then adjusts its This makes RL a powerful approach that can lead to
parameters ϑ during training so that it can control the an effective model-free control design. In flight con-
environment as requested. In the specific environment trol, for instance, the learning agent does not need to
of an aeroplane in controlled flight, the action a is the initially know anything about a flying vehicle. It will
control input vector ũagt defined by Eq. (20), while the still figure out how to collect rewards without know-
observed state s = [x, ẋ, u, u̇]T incorporates vectors x ing the aircraft weight, how the aerosurfaces move,
and u (Eqs. (19) and (21), respectively), and their time or how effective they are. The training technique used
derivatives. More details are given in Sect. 3.7 and in in this research applies a method designed originally
Fig. 9. for model-free applications to a problem of flight con-
An important point to observe is that in RL the trol where a high-fidelity model of the environment is
environment is everything that exists outside of the known.
agent. Practically speaking, it is where the agent sends An RL framework can be applied to a simulated
actions and what generates rewards and observations. environment, that requires to be modelled, or even to a
In this context, the environment is different from what is physical environment. In the present study, the agent is
accepted by control engineers, who tend to think of the trained within a high-fidelity model of the environment;
environment as everything outside of the controller and hence, the learning experience is carried out with flight
the plant. In classical flight control approaches things simulations. Learning is a process that often requires
like wind gusts and other disturbances that impact the millions or tens of millions of samples, which means

123
17048 A. De Marco et al.

a huge number of trial simulations, error calculations RL. Section 3.11 introduces a convenient reward cal-
and corrections. The obvious advantage of a simulated culation function, which can be adopted in a flight con-
environment is that the learning process can be run at trol task where a target altitude and heading are com-
faster than real time and that simulations may be exe- manded.
cuted in parallel on multiprocessor machines or GPU
clusters.
An important step in RL-based control design work- 3.3 Policy function
flows is the deployment of the control algorithm on the
target hardware or within a chosen simulation environ- Besides the environment that provides the rewards, an
ment. When learning ends successfully, the agent’s pol- RL-based control must have a well-designed agent. The
icy can be considered optimal and frozen, and a static agent is comprised of its policy and of the learning
instance of the policy can be deployed onto the tar- algorithm, two things that are closely interlaced. Many
get as any other form of control law. Examples of use learning algorithms require a specific policy structure
of an optimized agent deployed in different simulation and choosing an algorithm depends on the nature of the
environments are presented in Sect. 4. environment.
A policy π(s, a) defines the behaviour of the agent,
in order to determine which action should be taken
3.2 Rewards in each state. Formally speaking, it is a function π :
(s, a) → [0, 1] that maps a state–action pair to a scalar
Given the environment, the RL has to be based on how value between 0 and 1, regarded as a conditional prob-
the agent should behave and how it will be rewarded ability density function of taking the action a while
for doing what an assigned task prescribes. The reward observing the state s (random policy). But a policy
can be earned at every time step (immediate), or be function can also be deterministic, in that case, given a
sparse (delayed), or can only come at the very end of state s the action a is nonrandom, i.e. π : s → a.
an episode after long periods of time. In the F-16 flight The term ‘policy’ means that the agent has to make
simulations, an immediate reward is available at each a decision based on the current state. Training an agent
environment transition from time tk to tk+1 as a function to accomplish a given task in the simulated environ-
r (sk , ak , sk+1 ) of the two consecutive states and of the ment means generating several simulations and updat-
action that causes the transition. ing the policy parameters while episodes are gener-
In RL, there is no restriction on creating a reward ated by maximizing the policy performance. The pol-
function, unlike LQR in the theory of optimal control icy function tells the agent what to do, and learning
where the cost function is quadratic. The reward can be a policy function is the most important part of an RL
calculated from a nonlinear function or using thousands algorithm.
of parameters. It completely depends on what it takes to As shown in Fig. 7, the policy is a function that
effectively train the agent. For instance, if a flight atti- takes in state observations and outputs actions; there-
tude is requested, with a prescribed heading ψc , then fore, any function with that input and output relation-
one should intuitively tend to give more rewards to the ship can work. Environments with a continuous state–
agent as the heading angle ψ gets closer to the com- action space have a continuous policy function, and
manded direction. If one wants to take controller effort it makes sense to represent π with a general-purpose
into account, then rewards should be subtracted as actu- function approximator, i.e. something that can handle
ator use increases. a continuous state s(t) and a continuous action a(t),
A reward function can really be any function one without having to set the nonlinear function structure
can think of. But making an effective reward function, ahead of time. This is accomplished with deep neural
on the other hand, requires ingenuity. Unfortunately, networks, a function approximation approach which
there is no straightforward way of crafting a reward forms the basis of DRL. The training algorithm selected
to guarantee the agent will converge on the desired for this research, known as DDPG and available in
control. The definition of a reward calculation func- MATLAB/Simulink Reinforcement Learning Toolbox,
tion, appropriate to a prescribed control task, is called adopts a deterministic policy, i.e. a function πϑ (s) that
reward shaping and is one of the most difficult tasks in is parameterized by a finite vector ϑ. With this struc-

123
A deep reinforcement learning control approach 17049

ture, hundreds of time samples of the environment state The value function V is the unique solution to its Bell-
are used, gathering several multi-dimensional observa- man equation, which forms the basis of a number of
tions of the F-16 in simulated flight as the input into ways to compute, approximate, and learn V . This for-
πϑ , which outputs the actuator commands that drive mula, besides its implications to modern RL methods,
the aerosurface movements and the thrust lever. Even is derived and extensively explained in [8].
though the function might be extremely complex, there Learning the value function V and jointly the opti-
will be a neural network of some kind that can achieve mal policy:
it.  
π = arg max Eπ r0 + γ V (s ) (26)
π

3.4 Value function is the central challenge in RL. All methods in RL that
find an optimal value function V and the correspond-
The policy is used to explore the environment to gen- ing optimal policy π in a two-step procedure based on
erate episodes and calculate rewards. A performance (25) and (26), are called policy iteration learning algo-
index Rk of the policy at time step k is also called dis- rithms. In some cases, a large number of trials must
counted return, i.e. the weighted sum of all rewards that often be evaluated in order to determine an optimal pol-
the agent is expected to receive from time tk onwards: icy by iteration and, in practice, reinforcement learn-
ing may become very expensive to train. Yet, RL is
T
 well suited for the flight control problem stated in this
Rk = γ i rk+i (22)
research, where there is a known model for the envi-
i=0
ronment, and where evaluating a policy is relatively
where γ ∈ ]0, 1] is called as the discount rate. The
affordable as there are sufficient resources to perform
discount rate is a hyperparameter tuned by the user, and
a near brute-force optimization.
it represents how much future events lose their value
In the present research, the agent learns through a
in terms of policy performance according to how far
hybrid technique that mixes policy-iteration learning
away in time they are. Future rewards are discounted,
with a strategy known as Q-learning (see Sect. 3.6).
reflecting the economic principle that current rewards
Moreover, all complexities in prediction functions are
are more valuable than future rewards. In general, Rk
delegated to function approximators based on deep neu-
is a random variable resulting from future states and
ral networks, to learn the policy π , the value function
actions that are unknown at time tk during learning.
V , and the quality function Q. The latter is introduced
Given a policy π , a value function is defined as the
in the next subsection.
function that quantifies the desirability of being in a
given state:
Vπ (s) = Eπ Rk | sk = s (23) 3.5 Quality function
where Eπ indicates the expected reward over the time
steps from k to T , when the state at time step k is s. The quality function, or Q-value, of a state–action pair
When the subscript π is omitted from the notation Vπ , (s, a) is defined as the expected value
 
one refers to the value function for the best possible Q(s, a) = E r (s, a, s ) + γ V (s ) (27)
policy:
when at time step k the state is s, the agent is assumed
V (s) = max Vπ (s) (24) to follow the optimal policy, and the generic action a
π
One of the most important properties of the value is taken. This brings to the next state s , resulting in the
function is that the value V (sk ) at a time step k—also immediate reward r (s, a, s ) and a discounted future
called V-value—is given by an elegant recursive for- cumulated reward γ V (s ).
mula known as Bellman equation for V : While the value function V tells what is the value of
  being in a current state s, the function Q—also called
V (s) = max Eπ r0 + γ V (s ) (25)
π action-value function—is a joint quality or value of
where s = sk+1 is the next state after s = sk given taking an action a given a current state s. The quality
the action ak rewarded with r0 = rk , and the expecta- function is somehow richer and it contains more infor-
tion is over actions selected by the optimal policy π . mation about what is the quality of being in a given state

123
17050 A. De Marco et al.

for any action it might take. For this reason, sometimes ahead in the future, thus approximating the expected
Q is also called ‘critic’ because it can look at the possi- return as:
ble actions and be used to criticize the agent’s choices.
Rk ≈ r (sk , ak , sk+1 ) + γ Vπ (sk+1 ) (30)
This function can be approximated by a deep neural
network, as well. or, using the Q-value, as:
The optimal policy π (s, a) and the optimal value Rk ≈ r (sk , ak , sk+1 ) + γ Q π (sk+1 , ak+1 ) (31)
function V (s) contain redundant information, as one
can be determined from the other via the quality func- These approximations bring to the following defini-
tion Q(s, a): tions of the learning rules or update equations:

π (s, a) = arg max Q(s, a), V (s) = max Q(s, a) Vπ (sk ) ← Vπ (sk ) + η δRPE (sk , ak , sk+1 )
a a (32)
Q π (sk , ak ) ← Q π (sk , ak ) + η δTDE (sk , ak , sk+1 )
(28)
where η is a learning rate between 0 and 1, and the
This formulation is used to define the Q-learning strat- quantities:
egy, which is recalled in the next subsection.
δRPE = r (sk , ak , sk+1 ) + γ Vπ (sk+1 ) − Vπ (sk )
δTDE = r (sk , ak , sk+1 ) + γ Q π (sk+1 , ak+1 ) (33)
3.6 Temporal difference and Q-learning
− Q π (sk , ak )
In an approach to learning from trial-and-error expe- are called reward-prediction error (RPE) and tempo-
rience, the value function V or quality function Q is ral difference error (TDE), respectively. If the error
learned through a repeated evaluation of many poli- is positive, the transition was positively surprising:
cies [8,33]. In this work, the chosen learning process one obtains more reward or lands in a better state
is not episodic (does not wait for the end of a control than expected: the initial state or action was actually
trajectory to update the policy), but instead, it is imple- underrated, so its estimated value must be increased.
mented in such a way as to learn continuously by boot- Similarly, if the error is negative, the transition was
strapping. This technique is based on current estimates negatively surprising: the initial state or action was
of V or Q, which are then repeatedly updated by scan- overrated, and its value must be decreased. All meth-
ning the successive states in the same control trajectory. ods based on update equations (32) are called value-
In the simplest case, at each iteration of this learning iteration learning algorithms.
technique, the value function is updated by means of a TD-based learning offers the advantage that after
one-step look ahead, namely a value prediction for the each state transition, the V-value or Q-value updates
next state s given the current state s and action a. This can be immediately applied, that is, there is no need to
approach relies on Bellman’s principle of optimality, wait until an entire episode is completed. This process
which states that a large multi-step control policy must allows very fast learning and is called online learning.
also be locally optimal in every subsequence of steps The policy updates may be applied at every single tran-
[33]. sition (TD(0) or 1-step look ahead) or the learning may
The temporal difference learning method known as proceed from batches of consecutive state transitions
TD(0) method simply approximates the expected dis- (TD(n) or n-step look ahead).
counted return with an estimation based on the reward Q-learning is a technique that derives from TD
immediately received summed to the value of the next learning and is particularly suitable for model-free
state. Given a control trajectory generated through an RL. In Q-learning, the Q function is learned directly
optimal policy π , for the Bellman’s optimality condi- only by observing the evolutionary environment, in an
tion the V-value of state sk is given by: approach that post-processes the generated control tra-
  jectories. It can be seen as a generalization of the many
V (sk ) = Eπ rk + γ V (sk+1 ) (29)
available model-based learning strategies, applicable to
where rk + γ V (sk+1 ) acts as an unbiased estimator for all those control scenarios that are difficult or impossi-
V (sk ), in the language of Bayesian statistics. For non- ble to model. As seen from (27), the Q function incor-
optimal policies π , this same idea is used to update porates in its definition the very concept of a one-step
the value function based on the value function one step look ahead, and does not need an environment’s model.

123
A deep reinforcement learning control approach 17051

Fig. 8 The Q-learning


algorithm for estimating the
optimal policy π

Therefore, the learned Q function, the optimal policy, In all RL approaches, the concept of learning rate is
and the value function may be extracted as in (28). very important: it determines to what extend, for each
In Q-learning, the Q function update equation is [8, episode during agent training, newly acquired infor-
33]: mation overrides old information. A sufficient level of
Q(sk , ak ) ← Q(sk , ak ) + η δ̂TDE (sk , ak , sk+1 ) (34) exploration has to be ensured in order to make sure
that the estimates converge to the optimal values: this
where is known as the exploration-exploitation problem. At
δ̂TDE (sk , ak , sk+1 ) = r (sk , ak , sk+1 ) successive time steps within the generic episode, if
+γ max Q(sk+1 , a) the agent selects always the same action policy from
a the beginning (exploitation), it will never discover bet-
−Q(sk , ak ) (35) ter (or worse) alternatives. On the other hand, if the
is called the off-policy TDE. Due to the fact that the policy is updated in such a way that it picks random
optimal a is used in (35) to determine the correction actions (exploration), this random sub-optimal policy
δ̂TDE based on the current estimate for Q, while taking might have a chance to bring new information to the
a different action ak+1 based on a different behaviour learning process. A trade-off between exploitation and
policy for the next state transition, Q-learning is called exploration is ensured by the available Q-learning tech-
an off-policy technique. Thus, Q-learning takes sub- niques: usually, a lot of exploration happens at the
optimal actions to explore future states but uses the beginning of the training to accumulate knowledge
optimal action a to improve the Q function at each state about the environment and the control task, less towards
transition. The Q-learning algorithm for estimating the the end to actually use the acquired knowledge and per-
optimal policy π is summarized in Fig. 8. In practice, form optimally. Generally, Q-learning will learn a more
rather than finding the true Q-value of the state–action optimal solution faster than alternative techniques.
pair in one go at each time step tk , through the Bellman
equation the agent improves the approximation of the
function Q over time. 3.7 The actor-critic architecture
Q-learning was formalized for solving problems
with discrete action and state spaces governed by a In RL, an actor-critic method consists in simultane-
finite Markov decision process. Yet, when the action ously learning a policy function and a value function by
and state spaces are continuous, as occurs in the flight conveniently mixing value-iteration as well as policy-
control problems, time is discretized with an appropri- iteration learning. As shown in Fig. 9, in this agent
ate frequency and Q-learning becomes equally appli- architecture there is an actor-network, which is policy-
cable. Its off-policy character has made Q-learning par- based, and a critic network, which is value-based. The
ticularly well suited to many real-world applications, temporal difference signal from the critic is used to
enabling the RL agent to improve when its policy is update the actor’s policy parameters. In this particular
sub-optimal. In deep Q-learning, the Q function is con- case, the actor tries to take what it considers is the best
veniently represented as a neural network. action according to the current state (just as in a simpler

123
17052 A. De Marco et al.

Fig. 9 Actor-critic
architecture scheme

policy function method, Fig. 7); the critic estimates the is a major problem that promotes the tendency of the
Q-value associated to the state and to the action that the involved neural networks to overfit and fall into local
actor just took (as in Q-learning methods). minima.
The actor-critic scheme works for continuous action Another major problem occurs with the actor and
spaces because the critic—which is supposedly the critic implemented as neural networks because their
optimal Q function evaluator in the current condition— loss functions have non-steady targets. As opposed to
only needs to consider a single action, the one that classification or regression problems where the desired
the actor just took. In fact, when the actor selects an values are fixed throughout the network update iter-
action, it is applied to the environment, the critic esti- ations, in Q-learning, the target r (sk , ak , sk+1 ) + γ
mates the value of that state–action pair, and then it max a Q ϑ (sk+1 , a) will change because the function
uses the reward from the environment as a metric to approximator Q ϑ depends on the weights ϑ. This cir-
determine how good its Q-value prediction was. The cumstance can make the actor-critic pair particularly
error is the difference between the new estimated value inefficient, especially if they are implemented as feed-
of the previous state and the old value of the previous forward networks and the control task features a mov-
state from the critic network. The critic uses the error ing reference.
to update itself so that it has a less sub-optimal predic-
tion the next time it is in that state. The actor-network
also updates its parameters with the response from the
critic and the error term so that it can improve its future 3.8 Deep Q-networks
behaviour. This research uses the DDPG training algo-
rithm which is based on the actor-critic architecture The problem in DRL caused by correlated samples
[12]. The algorithm can learn from environments with within mini-batches has been solved by introducing a
continuous state and action spaces, and since it esti- learning technique called deep Q-network (DQN) algo-
mates a deterministic policy, it learns much faster than rithm [34]. The approach relies on data structures called
those techniques based on stochastic policies. experience replay memory (ERM), or replay buffers,
To compute the prediction errors, usually, many suc- which are huge buffers where hundreds of thousands
cessive samples (single transitions) are gathered and of successive transitions are stored. The agent is trained
concatenated in mini-batches, so that the critic’s neural by randomically sampling mini-batches from the ERM,
network could learn to minimize the prediction error which is cyclically emptied and refilled with new sam-
from these chunks of data. Yet, the successive transi- ples.
tions concatenated inside a mini-batch are not indepen- On the other hand, to fix the problem related to
dent of each other but correlated: (sk , ak , rk+1 , sk+1 ) the loss function target inherent unsteadiness, a DQN
will be followed by (sk+1 , ak+1 , rk+2 , sk+2 ), and so algorithm uses a cyclically frozen version of the agent
on, which is not a distribution of random samples. This called target network, which computes the transitions
intended to feed the ERM. Only every few thousand

123
A deep reinforcement learning control approach 17053

iterations the target network is updated so that the loss 3.10 Deep deterministic policy gradient
function targets remain stationary.
The learning rate of DQNs is known to be lower In the application presented here, the DRL agent learns
than those of other available approaches. This is due a deterministic policy. Deterministic policy gradient
to the sample complexity, i.e. to the fact that the agent (DPG) algorithms form a family of methods devel-
must inevitably enact a huge number, in the order of oped to approximate the gradient of the objective func-
millions, of transitions to obtain a satisfying policy. tion when the policy is deterministic [36]. The DPG
The function of the DQN within the agent’s behaviour approach has been improved in order to work with
is depicted in the left part of Fig. 9 showing how the nonlinear function approximators [11] resulting in the
DQN trains the critic network to estimate the future method known as deep deterministic policy gradient.
rewards and to update the actor’s policy (see also DDPG combines concepts from DQN and DPG obtain-
Fig. 8). In the case of flight control, the updated policy ing an algorithm able to effectively solve continuous
determines the action on aerosurfaces and engine throt- control problems with an off-policy method. As in
tle that maximizes the expected reward. The reward DQN, an ERM (to store the past transitions and then
shaping approach for the control problem introduced learn off-policy) and target networks (to stabilize the
by this work is presented in Sect. 3.11. learning) are used. However, in DQN the target net-
works are updated only every couple of thousands of
steps: they change significantly between two updates,
3.9 Policy gradient methods but not very often. Anyway, during the DDPG devel-
opment, it turned out to be more efficient to make the
In policy-based function approximation—which is the target networks slowly track the trained networks. For
part of the actor-critic scheme that updates the actor— this reason, both the target networks and the trained
when the policy π is parameterized by ϑ, it is pos- networks are updated together, by using the following
sible to use gradient optimization on the parameters update rule:
to improve the policy much faster than other iterative
ϑ k+1 ← τ ϑ k + (1 − τ ) ϑ k+1 with τ ≤ 1 (38)
methods.
The objective is to learn a policy that maximizes the where τ is a smoothing factor (usually much less than
expected return of each transition. The goal of the neu- 1) that defines how much delay exists between target
ral network is to maximize an objective function given networks and trained networks. The above update strat-
by the return (22) calculated over a set of trajectories egy improves the stability in the Q function learning
E selected by the policy: process.
⎡ ⎤ The key idea taken from DPG is the policy gradient
T

  for the actor. The critic is learned by using regular Q-
J (ϑ) = E E Rk = E E ⎣ γ i rk+i ⎦ (36)
learning. Anyway, as the policy is deterministic, it can
i=0
quickly produce always the same actions: this is an
The algorithm known as policy gradient method exploration problem. Some environments are naturally
applies a gradient ascent technique to the weights in noisy, improving the exploration, but this cannot be
order to maximize J in the space of ϑ’s. All it is assumed in the general case. To deal with that, DDPG
actually necessary with this technique is the gradient perturbs the deterministic action with an additive noise
of the objective function with respect to the weights ξ generated with a stochastic noise model, such that
∇ϑ J = ∂J /∂ϑ. As a suitable estimation of this pol- ak ← ak + ξ k , in order to force the exploration of the
icy gradient is obtained, the gradient ascent formula is environment.
straightforward:
ϑ ← ϑ + η ∇ϑ J ϑ (37)
3.11 Reward function shaping
The reader is referred to the article by Peters and Schaal
[35] for an overview of policy gradient methods, with Finally, we define a basic reward calculation function r
details on how to estimate the policy gradient and how that determines the agent’s reward rk at time tk . Gener-
to improve the sample complexity. ally speaking, the reward is based on current values of

123
17054 A. De Marco et al.

Fig. 10 Penalty functions tested in this research. (Left) Logarithmic barrier function. See Eq. (39), with xmin = −2, xmax = 2,
rmin = −5, C = 1. (Right) Hyperbolic function. See Eq. (40), with xmin = −2, xmax = 2, λ = 1, τ = 0.5

the aircraft state and observable variables and is defined After some investigation, a second basic reward cal-
for specific control tasks. The formulation proposed culation method, named hyperbolic penalty function,
here, which might appear limited to a given particular turned out to give better results, in terms of flight con-
control scenario, in practice has proven to be flexible trol, as opposed to the logarithmic barrier penalty. The
and widely applicable. hyperbolic penalty evaluates to nearly constant values
The reward signal is the way of communicating to inside a given range of the independent variable and
the agent what has to be achieved, not how it has to be exhibits a nearly linear varying behaviour outside that
achieved. Therefore, the reward function is constructed range. It is defined as follows:
in such a way as to properly guide the agent to the 
desired state, making sure not to impart a priori knowl- r (x) =λ(x − xmin ) − λ2 (x − xmin )2 + τ 2
edge about how to achieve what the agent is supposed  (40)
+ λ(xmax − x) − λ2 (xmax − x)2 + τ 2
to accomplish. Moreover, a reward should not be set to
be able to achieve subgoals and prefer them to the ulti- where τ and λ are nonnegative shape parameters. In
mate control aim. A properly rewarded agent does not particular, ±λ are the slopes of the linear segments out-
find ways to reach a subgoal without achieving the real side of the interval [xmin , xmax ]. Two examples of log-
end target. The reward calculation is therefore accom- arithmic and hyperbolic penalty functions are shown in
plished by evaluating some properly designed simple Fig. 10.
functions of one or more scalar variables. These have The above-defined functions are used to construct
the role of penalty functions because their output is usu- a reward rk in terms of all or a subset of variables in
ally much higher when their independent variables lie the triplet (sk , ak , sk+1 ). Care is taken when linking a
in given ranges. higher reward to good behaviour because for a poorly
A logarithmic barrier penalty reward function was defined rk the agent may prefer to maximize its reward
first used in this work to test the default settings of at the cost of not reaching the desired state. For instance,
the MATLAB Reinforcement Learning Toolbox. This for a waypoint following task assigned to an aircraft
function is defined as follows: controller, the agent might approach the target point
⎧  
in space and fly around the given waypoint in order to

⎪ , 1 (x − )2 accumulate as much reward as possible instead of pass-

⎪ max r min C log 4 max x min
⎨ 
r (x)=  ing through the target and then finishing the episode

⎪ − log (x−xmin )(xmax −xmin ) if x ∈ [xmin , xmax ] (which will of course result in a lower total reward).



rmin if x ∈
/ [xmin , xmax ] For this reason, the maximum of the chosen function
(39) r (x) will not be positive.
Therefore, assuming we have an array χ = (χ1 ,
where rmin < 0 is the minimum allowed reward, C χ2 , . . . , χn ) of n scalars, where each χi is a function of
is a nonnegative curvature parameter, and [xmin , xmax ] variables in (sk , ak , sk+1 ), the total reward of a transi-
is the range where the reward cannot be as low as rmin . tion from time step k to k + 1 will be:

123
A deep reinforcement learning control approach 17055


n
ing of hyperparameters related to the DDPG algorithm
rk = r χi (41)
and to the reward calculation function. A representa-
i=1
tive example of agent training is presented here, whose
where r can be the logarithmic (39) or the hyperbolic
Simulink overall scheme is reported in Fig. 11, while
barrier penalty function (40). The number n will depend
a selection of successful control examples in different
on the type of control task under study.
scenarios is presented in the next section.
Section 4 introduces the flight dynamics model used
The agent training process was tuned for a flight
in all simulations and then presents the main learning
control task where a reference F-16 FDM was: (i)
experiment, with its set of hyperparameters, that trains
set in flight at 30,000ft of altitude and at an assigned
an agent to follow a given combination of heading and
speed, with a randomly generated initial heading, and
altitude. Successively, some additional test cases are
(ii) required to reach a target commanded heading
also reported that validate this DRL-based control strat-
ψc = 0 deg and a target commanded altitude h c =
egy.
27,000 ft within a time tT = 30 s, (iii) with a final
4 Control strategy validation wings level and horizontal fuselage attitude—that is,
following zero commanded roll and elevation angles
The flight dynamics model used for this research is φc = θc = 0 deg. The target flight condition is a
provided by the JSBSim software library. JSBSim is a translational flight, i.e. a motion with zero commanded
multi-platform, general-purpose, object-oriented FDM angular speeds pc = qc = rc = 0 deg/s.
written in C++. The FDM is essentially the physics/- Thousands of simulations were performed to train
math model that defines the 6-DoF movement of a fly- the agent within the MATLAB/Simulink environment
ing vehicle, subject to the interaction with its natural and to reach a fine-tuned control for the assigned task.
environment, under the forces and moments applied In the initial trials, the agent was trained with the
to it using the various control mechanisms. The math- two different types of reward functions presented in
ematical engine of JSBSim solves the nonlinear sys- Sect. 3.11, and finally, it was determined that the hyper-
tem (1)–(11)–(17)–(18) of differential equations start- bolic penalty function (40) was the one that gave the
ing from a given set of initial conditions, with pre- best results.
scribed input laws or determined by a pilot-in-the-loop By defining the error variables:
operating mode. The FDM implements fully customiz- h = h c − h, φ = φc − φ, θ = θc − θ,
able, data-driven flight control systems, aerodynamic
ψ = ψc − ψ,  p = pc − p, q = qc − q (42)
models, propulsive models, and landing gear arrange-
ment through a set of configuration text files in XML the observation vector χ in this scenario is defined as
format. follows:
The software can be run as an engineering flight  
simulator in a standalone batch mode (no graph- χ = h , φ , θ , ψ ,  p , q , α, β, δ̃T , δ̃a , δ̃e , δ̃r , δ̃f
ical displays), or it can be integrated with other (43)
simulation environments. JSBSim includes a MAT-
LAB S-function that makes it usable as a simulation All simulations reaching a terminal state at the final
block in Simulink. This feature, besides the MAT- time tT with an altitude error |h | > 2000 ft were
LAB/Simulink Reinforcement Learning Toolbox, has marked with a final cumulative reward RT = −1000
made possible all simulations and learning strategies (control target unattained).
performed in this research. Table 1 summarizes the initial conditions of all simu-
A validation of JSBSim as a flight dynamics soft- lations required to train the agent to follow a given head-
ware library has been reported by several authors ing and altitude. The main hyperparameters of the fine-
[30,37]. tuned training process with the DDPG algorithm are
reported in Table 2, while the reward function hyperpa-
4.1 Agent training rameters are listed in Table 3. The major training setup
options are reported in Table 4. The computational cost
Several learning experiments were carried out with var- to run the simulations on a personal computer equipped
ious control goals, in order to assess the appropriate tun- with an Intel i7-9750 h CPU, a DDR4 RAM of 32 GB

123
17056 A. De Marco et al.

Fig. 11 Simulation scheme in Simulink

Table 1 Initial conditions for the heading and altitude control Table 2 Agent hyperparameters for the heading and altitude
training control training
Parameter Symbol Value Parameter Value

Altitude h 30, 000.0 ft (9144.0 m) Sample time 0.2


First aircraft velocity u 750.0 ft/s (228.6 m/s) Batch size 256
component Noise standard deviation decay 1 × 10−5
Second aircraft velocity v 0.0 ft/s (0.0 m/s) Actor learning rate 1 × 10−3
component
Actor gradient threshold 1
Third aircraft velocity w 0.0 ft/s (0.0 m/s)
component Critic learning rate 1 × 10−3
Latitude μ 47.0 deg Critic gradient threshold 1
Longitude l 122.0 deg
Roll angle φ 0.0 deg
Table 5 Statistics for the heading and altitude control agent
Pitch angle θ 0.0 deg training scenario (hardware: Intel i7-9750 h CPU, DDR4 32 GB
Heading angle ψ Random ∈ (0, 360) deg of RAM, Nvidia RTX2060 GPU)

Performed episodes 2889


Maximum obtained reward −24.60
and an Nvidia GPU RTX2060 is summarized in Table 5. Elapsed time 6 h 51 min
Finally, the cumulative reward history, Re as a function
of the number of episodes, is plotted in Fig. 12.

pedals (δ̃a , δ̃e , δ̃r ) and the throttle lever, δ̃T —forming
4.2 Simulation scenarios the four-dimensional input vector ũagt defined in (20).
These signals are filtered by the FCS, whose output
This section presents the results of various test case sim- is then converted in aerosurface deflections and actual
ulations where different control tasks are successfully throttle setting, and passed to the aircraft dynamics sim-
accomplished by the same agent presented in Sect. 4.1. ulation block (directly interfaced to JSBSim) besides
With reference to the scheme of Fig. 5, the pilot’s other control effector signals produced by the FCS log-
commanded inputs are essentially replaced by the ics. These form the full input vector u defined in (21).
agent’s action on the primary controls—i.e. stick and

123
A deep reinforcement learning control approach 17057

Table 3 Hyperbolic
Parameter Lower bound (xmin ) Upper bound (xmax ) λ τ
penalty parameters for the
heading and altitude control Altitude error (h ) −5 5 1/398 0
scenario
Roll rate ( p ) −0.1 0.1 125/78 0
Pitch rate (q ) −0.1 0.1 125/78 0
Roll angle error (φ ) −0.02 0.02 125/78 0
Pitch angle error (θ ) −0.02 0.02 125/78 0
Heading angle error (ψ ) −0.02 0.02 125/78 0

Table 4 Training options


Parameter Value
for the heading and altitude
control scenario Maximum number of episodes 5000
Maximum number of steps per episode 150
Stop training criteria Average reward (of five successive episodes) ≥ −50

4.2.1 Heading and altitude following number and angular velocity components, are reported
in Figs. 16 and 17.
The heading and altitude following scenario is the one
that was actually set up to train the agent and introduced 4.2.2 Waypoint following
in Sect. 4.1. The details of a representative simulation
with control inputs provided by the trained agent are This test case generalizes the previous scenario by
presented here. introducing a number of sequentially generated ran-
The task of achieving a zero heading angle and an dom waypoints during the simulation, that the aircraft
assigned new flight attitude is accomplished within the is required to reach under the agent’s control. As shown
prescribed 30 s. The agent’s inputs as well as the FCS in Fig. 18, the same agent that was trained to accomplish
outputs are plotted as normalized flight commands time the simpler task presented in the previous example, is
histories in Fig. 13, where the throttle setting values now deployed in a new simulation scenario where the
above 1 mean that the jet engine afterburner is being reference values ψc and h c change over time. The addi-
used. The time histories of the primary aerosurface tional logic with respect to the previous case receives a
deflections, as actual inputs to the aircraft dynamics multi-dimensional reference signal, calculates the error
model, are shown in Fig. 14. Time histories of aircraft terms, and injects them into a reward calculation block.
state variables, such as attitude angles and aerodynamic This reward thus calculated directs the agent to follow
angles, velocity components, normal load factor, Mach a given flight path marked by multiple waypoints.

Fig. 12 Episode reward


history during the training
for the heading and altitude
control scenario

123
17058 A. De Marco et al.

Fig. 13 F-16
agent-controlled heading
and altitude following
simulation scenario.
Normalized flight
commands histories, as
provided by the agent and
filtered by the FCS

123
A deep reinforcement learning control approach 17059

Fig. 14 F-16
agent-controlled heading
and altitude following
simulation scenario. Actual
primary aerosurface
deflections correspond to
the command inputs coming
from the FCS. See also
Fig. 5

Fig. 15 F-16
agent-controlled heading
and altitude following
simulation scenario.
Aerosurface deflections
δf,TE and δf,LE generated by
the FCS logics. See also
Fig. 5


The waypoints form a discrete sequence (lc agent is required to follow after time ti−1 is calculated
, μc , h c )i | i = 1, 2, . . . , n of n locations in space of as:
  
geographic coordinates (l, μ) and altitude h generated ψc,i (t) = atan2 sin lc,i − l(t) cos μ(t),
at subsequent time instants t1 , t2 , . . . , tn . In the exam-
ple presented here, a sequence of n = 10 waypoints cos μ(t) sin μc,i − sin μ(t) cos μc,i
 
has been considered. cos lc,i − l(t) (44)
Starting from a random initial flight condition, with
casual heading, speed and altitude, the goal is to reach for ti−1 ≤ t < ti . Therefore, the altitude and head-
the next random waypoint, and successively one by ing error considered in the reward calculation function
one all the other waypoints as they reveal themselves become:
to the agent along the way. For 0 = t0 ≤ t < t1 the h (t) = h c,i − h(t), ψ (t) = ψc,i − ψ(t) (45)
aircraft points to waypoint (lc , μc , h c )1 until at t = t1 for ti−1 ≤ t < ti and i = 1, . . . , n. The above formulas
the vehicle is labelled as sufficiently close to the first represent the time-varying references provided to the
target, the waypoint (lc , μc , h c )2 is generated and pur- control agent.
sued for t1 ≤ t < t2 ; the scheme repeats itself until the The agent looks at one waypoint at a time in this par-
last waypoint is reached after time tn . The geographic ticular test case. When the aircraft reaches an assigned
coordinates and the general aeroplane position tracking threshold distance from the current pointed waypoint,
are handled through the JSBSim internal Earth model. the next one is generated and passed to the agent. The
In the generic instant t of the simulation, with the instantaneous distance from the aircraft gravity centre
aircraft gravity centre having geographic longitude l(t) to the current waypoint is calculated with the Haversine
and latitude μ(t), once the vehicle is commanded to fly formula [38].
towards the next ith waypoint, the heading ψc,i that the The behaviour of the agent in the multiple waypoints
following task can be figured out by the simulation

123
17060 A. De Marco et al.

Fig. 16 F-16
agent-controlled heading
and altitude following
simulation scenario. Time
histories of altitude, attitude
angles and aerodynamic
angles

results presented below. The agent’s inputs as well as 10 waypoints and the commanded waypoint altitudes.
the FCS outputs are plotted as normalized flight com- Finally, the three-dimensional flight path and aircraft
mands time histories in Fig. 19. The time histories of body attitude evolution are shown in Fig. 25.
the primary aerosurface deflections, as actual inputs
to the aircraft dynamics model, are shown in Fig. 20.
4.2.3 Varying target with sensor noise
Time histories of aircraft state variables, such as atti-
tude angles and aerodynamic angles, velocity compo-
This test case is set up to investigate the agent’s
nents, normal load factor, Mach number and angular
behaviour when external disturbances are injected into
velocity components, are reported in Figs. 21 and 22.
the environment. The simulation scenario is similar to
Figure 23 reports the actual and commanded values
the heading and altitude following example presented
of heading ψ, longitude l and latitude μ. This scenario
in Sect. 4.2.1, but in this case the commanded altitude
is also represented on the map of Fig. 24 reporting the
h c and heading ψc do change in time according to
ground track of the aircraft trajectory, the sequence of
assigned stepwise constant functions. In addition, the

123
A deep reinforcement learning control approach 17061

state signals that define the agent’s observation vector


(43) and that are sent also as feedback to the FCS are
perturbed with prescribed noisy disturbances.
In particular, a set of random (Gaussian) normally
distributed noise signals are generated and then added
to some state variables in order to simulate an uncer-
tainty on data available to the controllers. The results
of a simulation case with zero-mean perturbations—
assuming an accurately calibrated set of sensors, which
is typically expected from the class of aircraft consid-
ered in this study—are reported in Fig. 26. Table 6 lists
the main parameters of the additive noisy signals.
The results of this simulation case are similar to
those reported in previous examples. The outcome of
the agent’s control actions is represented by the time
histories plotted in Fig. 26, which shows the instanta-
neous aircraft altitude and heading beside their cor-
responding commanded values. The same figure also
reports the aerosurface deflection angles as provided to
the aircraft FDM.

4.2.4 Prey–chaser scenario

This test case evolves from the waypoint following sce-


nario presented in Sect. 4.2.2 and provides a basis for
possible applications of the present agent-based con-
trol approach to the field of military fighter pilot train-
ing (to dogfight and formation flight, for instance). In
a prey–chaser scenario, two aeroplane models coexist
Fig. 17 F-16 agent-controlled heading and altitude following and share the same flight environment where the first
simulation scenario. Time histories of velocity components, nor- aircraft acts as the prey, being chased by the second
mal load factor, Mach number and angular velocity components
one, i.e. the chaser.

Fig. 18 Control scheme of


a waypoint following
scenario. The reference
given by successive
waypoints and altitudes are
injected into the reward
calculation

123
17062 A. De Marco et al.

Fig. 19 F-16
agent-controlled waypoint
following simulation
scenario. Normalized flight
commands histories, as
provided by the agent and
filtered by the FCS

123
A deep reinforcement learning control approach 17063

Fig. 20 F-16
agent-controlled waypoint
following simulation
scenario. Actual primary
aerosurface deflections
corresponding to the
command inputs coming
from the FCS

In the particular test case presented here, both the 5.1 Path control with fixed reference
chaser and the prey are piloted by an RL-based agent.
In the simulation environment, there are two replicas The simplest example showing a case of path control
of the same F-16 model, both of them piloted by two with a fixed reference is presented in Sect. 4.2.1. The
identical trained agents (Agent vs. Agent, the same pre- assigned control task is precisely what the agent was
sented in Sect. 4.1). The first agent controls the prey trained for, i.e. a case of exploitation. The aircraft is
aircraft to follow a given sequence of random way- able to reach a target heading angle and an assigned
points, much as the previous multiple waypoints fol- new flight altitude, starting from a randomly generated
lowing example. The second agent controls the chaser initial flight condition (random heading, altitude, and
aircraft by acquiring the successive positions of the prey speed), within a flight time of 30 s. The agent achieves
with a given frequency and follows them as if they were this result by providing the input actions—normalized
virtual waypoints. flight commands—shown in time histories of Fig. 13.
The prey–chaser interaction within the waypoint The agent inputs are filtered by the FCS, whose out-
subsequence 1 to 8 is shown in detail by the maps of put signals are also plotted in the same figure. In par-
Fig. 27. The full picture for the full waypoint sequence ticular, it is seen that to reach the assigned goal as
1 to 10 is shown in Fig. 28, where also the random initial quickly as possible the agent demands a high thrust
positions of the two aeroplanes are marked. level, thus requiring the use of the jet engine afterburner
This particular simulation demonstrates the ability (see the topmost plot of Fig. 13, where the output δ̃T
of the chaser agent to tighten its trajectory when appro- from the FCS, for 12 s ≤ t ≤ 22 s, becomes higher
priate, for instance, when the prey passes from way- than 1). Time histories of the other primary and sec-
point 5 to waypoint 6. Another interesting behaviour ondary controls as well as of the aircraft state variables
is observed when the chaser aircraft overtakes the prey clearly show an initial left turn, combined with a dive,
when the waypoint 8 is reached and surpassed. In this to reach a prescribed lower altitude, North-pointing
case, the agent-controlled chaser aircraft performs a flight path. A typical FCS behaviour for this type of
complete turn to position itself behind the prey and high-performance fighter jet is observed in the plot
continue the target following task. at the bottom of Fig. 13: the output δ̃r from the FCS
for 12 s ≤ t ≤ 22 s exhibits severe filtering of the
agent’s rudder command in order to keep the sideslip
angle as low as possible (see β within the same time
5 Discussion
interval in Fig. 16). The manoeuvre is confirmed by
the time histories of primary aerosurface deflections
All simulation examples introduced in the previous sec-
shown in Fig. 14 and of wing leading-edge/trailing-
tion demonstrate the validity of the trained agent when
edge flap deflections reported in Fig. 15. The left turn
it is directed to execute different control tasks with a
results from the negative (right) aileron deflection δa
progressive level of difficulty.

123
17064 A. De Marco et al.

Fig. 21 F-16
agent-controlled waypoint
following simulation
scenario. Time histories of
altitude, attitude angles and
aerodynamic angles

(right aileron down, left aileron up), for 1 s ≤ t ≤ 6 s). lift devices when some aircraft state variables do fall
The initial dive results from the combined action of tail within prescribed ranges (see Appendix C). Time his-
and wing leading-edge flap deflections, δe and δf,LE , tories of aircraft state variables—such as altitude, atti-
respectively, for 0 s ≤ t ≤ 6 s, leading to a pitch- tude angles, aerodynamic angles reported in Fig. 16,
down manoeuvre (see negative θ within the same time velocity components, normal load factor, flight Mach
interval in Fig. 16). Figure 15 reveals an inherent com- number and angular velocity components reported in
plexity of such a controlled flight example, showing Fig. 17—confirm the left turn/dive manoeuvre to reach
time-varying deflections of leading-edge and trailing- the prescribed terminal state. From Fig. 16, in partic-
edge flaps mounted on the main wing. These deflec- ular, the errors h = h c − h, ψ = −ψ, θ = −θ ,
tions are due to the high-fidelity implementation of FCS φ = −φ and their vanishing behaviour with time are
control logics that can trigger the actuation of high- easily deduced and confirm the effectiveness of the

123
A deep reinforcement learning control approach 17065

Fig. 22 F-16
agent-controlled waypoint
following simulation
scenario. Time histories of
velocity components,
normal load factor, Mach
number and angular
velocity components

123
17066 A. De Marco et al.

Fig. 23 F-16
agent-controlled waypoint
following simulation
scenario. Actual and
commanded values of
heading ψ, longitude l and
latitude μ. See also Eq. (44)

agent’s control actions. This test case clearly shows reference. The simulation is performed by using the
a valid AI-based control in the presence of nonlinear same control agent that was trained for the simpler
effects. For instance, the significant variations of alti- one-reference heading/altitude following task. Interest-
tude, flight Mach number and angle of attack do trigger ingly, the simulation results presented in Figs. 19 to 23,
all those nonlinearities accurately modelled in the air- demonstrate that also the multiple waypoints follow-
craft FDM (see Appendices A, and B). ing exercise is successfully achieved. The agent reacts
appropriately to each randomly generated new refer-
ence and effectively enacts its policy to control the plant
5.2 Path control with varying reference dynamics. The ground track reported in Fig. 24 and the
three-dimensional trajectory presented in Fig. 25 show
A case of multiple waypoints following task is intro- the sequence of turns, dives and climbs that are flown
duced in Sect. 4.2.2. This scenario provides an inter- to accomplish the assigned task.
esting example of controlled flight with a moving

123
A deep reinforcement learning control approach 17067

Fig. 25 F-16 agent-controlled waypoint following simulation


scenario. Three-dimensional flight path and aircraft body attitude
evolution. Aircraft geometry not in scale (magnified 1250 times
for clarity)
Table 6 Aircraft states variables affected by additive zero-mean
noisy disturbances in the test case of Sect. 4.2.3
Perturbed states Unit Variance (σ 2 )

p, q, r rad/s 0.01
h m 1.0
ψ, θ, φ rad 0.01
α, β rad 0.01
Fig. 24 F-16 agent-controlled waypoint following simulation
scenario. Ground track of the aircraft trajectory, sequence of
assigned waypoints and commanded waypoint altitudes
aircraft instances and their states are represented in
5.3 Control validation in the presence of noise
the chosen airspace. Four successive screen captures
of the virtual simulation environment are represented
Section4.2.3 reports an example of how the trained
in Fig. 30, while the prey aircraft, after having reached
agent behaves when the state observations are per-
waypoint 3, is pursuing waypoint 4. The figure shows
turbed by additive noisy signals. This test case, with
two camera views on the left, taken following the prey
the simulation results shown in Fig. 26, proves the
and the chaser aircraft at fixed distances, respectively.
robustness of the agent’s control actions with respect to
The views are synchronized to the evolving ground
external disturbances, within the assumption of well-
tracks shown on the right. As seen in this excerpt of
calibrated sensors.
simulation, the chasing fighter enters into the field of
view of the first camera as it flies at a higher Mach num-
5.4 Simulation of a prey–chaser scenario ber, while the leading aircraft is still pursuing waypoint
4. Eventually, the chaser reaches its target arriving at
Finally, Sect. 4.2.4 presents an interesting prey–chaser the prey’s tail. This test case is one example of sev-
simulation scenario, demonstrated by the ground tracks eral other interesting simulation possibilities. In fact,
reported in Figs. 27 and 28. Two instances of the same assuming, for example, that the chaser is piloted by
trained agent direct two instances, respectively, of the the agent discussed here, the prey can be piloted by
same model of flying vehicle simulating an air engage- a completely different agent instructed to accomplish
ment. A virtual simulation environment based on the a prescribed manoeuvre or, as an alternative, the prey
FlightGear flight simulation software1 has been set up can be a human-in-the-loop piloted model. These are
as a means to visualize the air combat in a proper possible applications of the flight control approach pre-
scenery. The scheme of Fig. 29 depicts how the two sented in this study that have the potential to enhance
pilot training procedures by means of AI-augmented
1 www.flightgear.org. simulation environments.

123
17068 A. De Marco et al.

Fig. 26 F-16
agent-controlled simulation
scenario with varying
commanded heading and
altitude, and zero-mean
sensor noise. Time histories
of altitude, heading and
aerosurface deflection
angles

6 Conclusion fully automatic in highly dynamic scenarios, even in the


presence of sensor noise and atmospheric disturbances.
This research presents a high-performance aircraft A future research direction that should evolve from
flight control technique based on reinforcement learn- this study is the comparison of the proposed AI-based
ing and provides an example of how AI can generate controller to other standard types of flight control.
a valid controller. The proposed approach is validated
by using a reference simulation environment where the
nonlinear, high-fidelity flight dynamics model of a mil- Supplementary information
itary fighter jet is used to train an agent for a selected set
of controlled flight tasks. The simulation results show This article has no accompanying supplementary file.
the control effectiveness to make certain manoeuvres

123
A deep reinforcement learning control approach 17069

Fig. 27 Details of the waypoint sequence 1–8, and of the two ground tracks in the prey–chaser simulation scenario

Fig. 28 Map projection of aircraft trajectory, and assigned way- Fig. 29 Virtual simulation environment based on JSBSim, MAT-
points, by a prey–chaser scenario simulation LAB/Simulink and FlightGear

123
17070 A. De Marco et al.

Fig. 30 Screen captures at


four successive instants of
the one-to-one air combat
virtual simulation (times 1
and 2) and (times 3 and 4)

123
A deep reinforcement learning control approach 17071

Fig. 30 continued

Acknowledgements The authors would like to thank the edi- Open Access This article is licensed under a Creative Com-
tors and reviewers of Nonlinear Dynamics for their valuable mons Attribution 4.0 International License, which permits use,
efforts in the review of this paper. sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original
Funding Open access funding provided by Università degli author(s) and the source, provide a link to the Creative Com-
Studi di Napoli Federico II within the CRUI-CARE Agreement. mons licence, and indicate if changes were made. The images or
No funding was received for conducting this study. other third party material in this article are included in the article’s
Creative Commons licence, unless indicated otherwise in a credit
Data availability The data that support the findings of this study line to the material. If material is not included in the article’s Cre-
are available from the corresponding authors, ADM or SM, upon ative Commons licence and your intended use is not permitted by
reasonable request. statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder. To view
Declarations a copy of this licence, visit http://creativecommons.org/licenses/
by/4.0/.
Conflict of interest The authors declare that they have no con-
flict of interest.

Consent for publication The authors grant the Journal of Non-


linear Dynamics the authority to publish this work.

123
17072 A. De Marco et al.

Appendix A: Aerodynamic model of the General δ̃gear : normalized gear position (0: retracted; 1: fully
Dynamics F-16 Fighting Falcon deployed)

The high-performance aircraft FDM selected for this


research incorporates the nonlinear JSBSim aero- Cross force coefficient
dynamic model [30], which we summarize in this
appendix. With reference to the definitions given by The cross force coefficient CC that appears in Eq. (9) is
Eqs. (4), (9), (13) and (14), the aerodynamic coef- also called CY by many authors. For the F-16, the side
ficients are implemented as functions of the aircraft force coefficient is modelled as follows:
state and input variables, x and u. Each coefficient  
CY = CYβ + CYβ (M) β + CYδa δa + CYδr δr
is expressed as a sum of nonlinear terms, commonly
  b
known as the aerodynamic build-up formula. The + CY p(α) p + CYr(α) r (A3)
detailed build-up formulas for the F-16 fighter are 2V
reported below, where all angles are in radians.
Roll moment coefficient
Lift coefficient

  CL = CL
bas
β
(α, β) + CLβ(M) β
h
C L = kC L ,ge C Lbas (α, δe ) + C L δf,LE(α) δf,LE   b
b + CL p(α) p + CLr(α) r
 2V 
+C L δf,TE δf,TE + C L δsb(α) δsb
+ CLδa(α, β) + CLδa(M) α δa
  qc   
+ C L q(α) + kδsb(α) δsb (A1) + CLδr (α, β) + CLδr (M) α δr (A4)
2V
where the functions kC L ,ge (h/b), C Lbas (α, δe ), C L δf,LE (α),
C L δsb (α), C L q (α), kδsb (α) are reported, for the sake of Pitch moment coefficient
example, in Figs. 31 and 32.

Drag coefficient CM = CM
bas
(α, δe ) + CMα(M) α
qc
+CMδsb(α) δsb + CMq(α) (A5)
2V
CD = CD
bas
(α, δe ) + C D (M) + C Dδf,LE(α) δf,LE
+C Dδf,TE δf,TE Yaw moment coefficient
+C Dδsb(α) δsb + C Dδ̃ δ̃gear
gear
  qc
+ C Dq(α) + kδf,LE(α) δf,LE (A2) CN = CN
bas
(α, β) + CNβ(M)β
2V

Fig. 31 Mapped lift


coefficient terms of General
Dynamic F-16 JSBSim
model

123
A deep reinforcement learning control approach 17073

Fig. 32 Mapped lift coefficient terms of General Dynamic F-16 JSBSim model

  b Appendix B: Thrust model of the General Dynamics


+ CN p(α) p + CNr(α) r
 2V F-16 Fighting Falcon
+ CNδa(α, β) + CNδa(M) δa
 
+ CNδr(α, β) + CNδr(M) α δr (A6) With reference to definitions (10), (13) and (15), the
instantaneous thrust is modelled in JSBSim [30,39] as
For a complete description of the F-16 aerody- follows:
namics model, the reader can refer to the official
T = δT Tmax,SL T̃max h DA , M (B7)
JSBSim repository https://github.com/JSBSim-Team/
jsbsim exploring the folder aircraft/f16 and the where the normalized thrust T̃max is a function of den-
<aerodynamics/> block in the configuration file sity altitude h DA and flight Mach number M (see also
aircraft/f16/f16.xml. Table 7). The density altitude h DA is one of the quanti-

123
17074 A. De Marco et al.

Table 7 Pratt and Whitney F100-PW-229 JSBSim model property


Property Symbol Value

Maximum static thrust at sea level Tmax,SL 17,800.0 lbf (79,178.3 N)


Maximum static thrust, with afterburner, at sea level Tmax,ab,SL 29,000.0 lbf (128,998 N)
Bypass ratio BPR 0.4
 
lb kg
Thrust-specific fuel consumption at cruise TSFC 0.74 lbf h 0.075 N h
 
lb kg
Thrust-specific fuel consumption at cruise, with afterburner ATSFC 2.05 lbf h 0.209 N h
Angle of incidence of the thrust axis μT 0.0 deg
Thrust line eccentricity eT 0.0 m

ties defined by the ICAO International Standard Atmo- The error (C9) is a difference between the commanded
sphere (ISA) model as functions of the altitude h (m) roll rate and the actual roll rate, which is named
above the mean sea level (MSL) [31]. fcs/roll-trim-error in JSBSim input/configu-
For known values of altitude h and airspeed V , ration meta-language and implemented by the follow-
the density altitude h DA and the flight Mach number ing XML fragment:
M = V /a(h) allow the calculation of a normalized Listing 1 F-16 roll channel FCS logic, PID error function.
maximum available thrust in various engine conditions
< p u r e _ g a i n name = " fcs / roll - rate - norm " >
by interpolating the data points shown in Fig. 33. → p̂(t) := . . .
< input > v e l o c i t i e s / p - aero - r a d _ s e c </ input >
For a complete description of the F-16 propulsive → p(t)
model, the reader can refer to the official JSBSim repos- < gain > 0 . 3 1 8 2 1 </ gain > → G p (s/rad)
</ p u r e _ g a i n >
itory https://github.com/JSBSim-Team/jsbsim explor-
ing the folder aircraft/f16, the <propulsion/> < s u m m e r name = " fcs / roll - trim - error " >
→ eroll (t) := . . .
block in the configuration file aircraft/f16/f16. < input > fcs / aileron - cmd - norm </ input > → δ̃a (t)
xml, and the engine configuration file engine/F < input > - fcs / roll - rate - norm </ input > → − p̂(t)
</ s u m m e r >
100-PW-229.xml.
This roll channel controller is named fcs/roll-
rate-pid in Listing 2 and becomes active in flight
Appendix C: Flight control system of the General whenever the calibrated airspeed VCAS ≥ 20 kts
Dynamics F-16 Fighting Falcon (an excerpt) (fcs/aileron
-pid-trigger set to 1).
The FCS within the JSBSim FDM [30,39] of the F-16
Listing 2 F-16 roll channel FCS logic, as defined in JSBSim
implements several ideal/parallel PID controllers act-
input/configuration meta-language in XML format.
ing on different input channels (see Fig. 5). Each con-
< s w i t c h name = " fcs / aileron - pid - t r i g g e r " >
troller attempts to minimize a given error function e(t) < d e f a u l t value = " 1 " / >
over time by adjusting a control variable u(t) to a new < test value = " 0 " >
v e l o c i t i e s / vc - kts lt 20.0
value determined by the following weighted sum: </ test >
! t </ s w i t c h >
de(t)
u(t) = K p e(t) + K i e(τ ) dτ + K d (C8) < pid name = " fcs / roll - rate - pid " > → u roll (t) := . . .
0 dt < t r i g g e r > fcs / aileron - pid - t r i g g e r </ t r i g g e r >
For the roll channel, the FCS features a PID con- 0 or 1
< input > fcs / roll - trim - error </ input >
troller that minimizes the following nondimensional < kp > 3 . 0 0 0 0 0 </ kp > → K p
error function: < ki > 0 . 0 0 0 5 0 </ ki > → K i
< kd > -0.00125 </ kd > → K d
eroll (t) = δ̃a (t) − p̂(t) (C9) </ pid >

where δ̃a is the aileron normalized command input (in Finally, for the roll channel, the closed-loop input com-
the interval [−1, 1]), and p̂ = G p p is a nondimen- mand δ̃a is defined and then converted into an aerosur-
sional roll rate, with G p a constant gain (in s/rad). face deflection angle δa (t) ∈ [δa,min , δa,max ], named

123
A deep reinforcement learning control approach 17075

Fig. 33 Mapped normalized thrust of General Dynamic F-16 JSBSim model

fcs/aileron-pos-rad as shown by the follow- < max > 0.375 </ max > → δa,max
</ range >
ing XML input fragment: < o u t p u t > fcs / aileron - pos - rad </ o u t p u t >
→ δa (t) ∈ [δa,min , δa,max ]
Listing 3 F-16 roll channel FCS logic, as defined in JSBSim </ a e r o s u r f a c e _ s c a l e >
input/configuration meta-language in XML format.
< s u m m e r name = " fcs / roll - rate - c o m m a n d " > The FCS control laws for the remaining input chan-
→ δ̃a := δ̃a + u roll nels are defined in a similar manner. For the pitch chan-
< input > fcs / roll - rate - pid </ input >
< input > fcs / aileron - cmd - norm </ input > nel, the FCS control logic features a cascade of inter-
< clipto >
< min > -1 </ min >
connected blocks. The high-level control logic is given
< max >1 </ max > by a PID controller assuming the following error func-
</ c l i p t o >
</ s u m m e r > tion:
< aerosurface_scale epitch (t) = q̂0 − G δe (t) − q̂(t) − n z B (t) (C10)
name = " fcs / aileron - c o n t r o l " >
< input > fcs / roll - rate - c o m m a n d </ input > that is, the difference between a commanded normal-
→ δ̃a ∈ [−1, 1]
< range >
ized pitch rate q̂0 , a nondimensional elevator scheduler
< min > -0.375 </ min > → δa,min gain G δe (as a function of the instantaneous angle of

123
17076 A. De Marco et al.

attack α), the instantaneous aircraft normalized pitch References


rate q̂(t), and the normal load factor n z B = (gz B −
az B )/g, also known as g-load. The pitch channel PID 1. Stevens, B.L., Lewis, F.L.: Aircraft Control and Simulation.
Wiley-Interscience, Hoboken (2003)
controller is active whenever VCAS ≥ 5 kts. 2. Dally, K., Kampen, E.-J.V.: Soft actor-critic deep reinforce-
A low-level FCS logic for the pitch channel defines a ment learning for fault tolerant flight control. In: AIAA
limiter on the elevator command. The F-16, in fact, has SCITECH 2022 Forum. American Institute of Aeronautics
a maximum g-load allowable limit of n z B ,max = 9, and and Astronautics, Reston, VA, USA (2022). https://doi.org/
10.2514/6.2022-2078
a minimum limit of n z B ,min = −4. Moreover, the pitch
3. Wang, H., Liu, S., Yang, X.: Adaptive neural control for
control logic defines a particular behaviour of the ele- non-strict-feedback nonlinear systems with input delay. Inf.
vator according to the instantaneous value of the angle Sci. 514, 605–616 (2020)
of attack α(t). If α approaches 30 deg, the FCS will 4. Huo, X., Ma, L., Zhao, X., Niu, B., Zong, G.: Observer-based
adaptive fuzzy tracking control of mimo switched nonlinear
command full down elevator deflection, overriding the
systems preceded by unknown backlash-like hysteresis. Inf.
pilot’s commanded deflection, to prevent stalling. This Sci. 490, 369–386 (2019)
articulated behaviour determines the elevator scheduler 5. Xia, R., Chen, M., Wu, Q., Wang, Y.: Neural network based
gain G δe (t) in definition (C10). integral sliding mode optimal flight control of near space
hypersonic vehicle. Neurocomputing 379, 41–52 (2020)
For the yaw channel, the FCS features a PID con- 6. Zhao, H.-W., Liang, Y.: Prescribed performance dynamic
troller that minimizes the following error function: neural network control for a flexible hypersonic vehicle
with unknown control directions. Adv. Mech. Eng. 11(4),
1687814019841489 (2019)
1 7. Luo, C., Lei, H., Li, J., Zhou, C.: A new adaptive neural
eyaw (t) = δ̃r (t) + r̂ (t) + n y (t) (C11) control scheme for hypersonic vehicle with actuators multi-
4 B ple constraints. Nonlinear Dyn. 100(4), 3529–3553 (2020).
https://doi.org/10.1007/s11071-020-05707-2
8. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An
Introduction, 2nd edn. MIT Press, Cambridge (2018)
where δ̃r is the rudder normalized command input (in 9. Reddy, G., Wong-Ng, J., Celani, A., Sejnowski, T.J., Ver-
the interval [−1, 1]), r̂ = G r (t) r is a nondimensional gassola, M.: Glider soaring via reinforcement learning in
yaw rate, G r (t) is a variable gain (in s/rad, and func- the field. Nature 562(7726), 236–239 (2018)
tion of the instantaneous airspeed), and n yB = (g yB − 10. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,
Antonoglou, I., Wierstra, D., Riedmiller, M.: Playing atari
a yB )/g is the lateral load factor. The PID controller is with deep reinforcement learning. In: Neural Information
active whenever the aircraft calibrated VCAS ≥ 10 kts. Processing Systems Deep Learning Workshop (2013). arXiv
For the throttle input channel, the FCS simply maps https://doi.org/10.48550/ARXIV.1312.5602
the normalized throttle command δ̃T ∈ [0, 1] into a sig- 11. Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T.,
Tassa, Y., Silver, D., Wierstra, D.: Continuous control with
nal δT ∈ [0, 2] assuming that when δT ≥ 1 the jet deep reinforcement learning. In: International Conference
engine afterburner becomes active. on Learning Representations (2015). arXiv https://doi.org/
The speed brake channel is used by the FCS to pre- 10.48550/ARXIV.1509.02971
vent deep stall. This control commands speed brake 12. Tsourdos, A., Dharma Permana, I.A., Budiarti, D.H., Shin,
H.-S., Lee, C.-H.: Developing flight control policy using
deflections at high angles of attack and low speeds. deep deterministic policy gradient. In: 2019 IEEE Interna-
This will provide just enough pitch-down moment to tional Conference on Aerospace Electronics and Remote
keep the aircraft under control. Moreover, speed brake Sensing Technology (ICARES), pp. 1–7 (2019). https://doi.
maximum deflection is of δsb,max = 60 deg and is lim- org/10.1109/ICARES.2019.8914343
13. Koch, W., Mancuso, R., West, R., Bestavros, A.: Reinforce-
ited to 43 deg when the gear is extended to prevent ment learning for UAV attitude control. ACM Trans. Cyber-
physical speed brake damage on touchdown. Phys. Syst. 3(2), 3301273 (2019). https://doi.org/10.1145/
For a complete description of the F-16 flight dynam- 3301273
ics model and its FCS, including control laws for the 14. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov,
O.: Proximal Policy Optimization Algorithms (2017). arXiv
wing trailing-edge and leading-edge flap deflections, https://doi.org/10.48550/ARXIV.1707.06347
the reader can refer to the official JSBSim reposi- 15. Bøhn, E., Coates, E.M., Moe, S., Johansen, T.A.: Deep
tory https://github.com/JSBSim-Team/jsbsim explor- reinforcement learning attitude control of fixed-wing UAVs
ing the folder aircraft/f16, the <flight_ using proximal policy optimization. In: 2019 International
Conference on Unmanned Aircraft Systems (ICUAS). IEEE,
control/> block in the configuration file aircraft
/f16/f16. xml and the folder systems.

123
A deep reinforcement learning control approach 17077

Atlanta, GA, USA (2019). https://doi.org/10.1109/icuas. AIAA Modeling and Simulation Technologies Conference,
2019.8798254 10–13 August 2009, Chicago, Illinois. American Institute
16. Su, Z.-q, Zhou, M., Han, F.-f, Zhu, Y.-w, Song, D.-l, Guo, of Aeronautics and Astronautics, Reston, VA, USA (2009).
T.-t: Attitude control of underwater glider combined rein- https://doi.org/10.2514/6.2009-5699
forcement learning with active disturbance rejection control. 31. United States Committee on Extension to the Standard
J. Mar. Sci. Technol. 24(3), 686–704 (2019). https://doi.org/ Atmosphere, National Aeronautics and Space Administra-
10.1007/s00773-018-0582-y tion, National Oceanic and Atmospheric Administration,
17. Mishra, A., Ghosh, S.: Variable gain gradient descent-based U.S. Air Force: U.S. Standard Atmosphere, 1976. NOAA-
reinforcement learning for robust optimal tracking control SIT 76-1562. National Oceanic and Amospheric Adminis-
of uncertain nonlinear system with input constraints. Non- tration, Washington, DC, USA (1976)
linear Dyn. 107(3), 2195–2214 (2022). https://doi.org/10. 32. Janota, A., Šimák, V., Nemec, D., Hrbček, J.: Improving the
1007/s11071-021-06908-z precision and speed of Euler angles computation from low-
18. Zhang, H., Huang, C.: Maneuver decision-making of deep cost rotation sensor data. Sensors 15(3), 7016–7039 (2015).
learning for UCAV thorough azimuth angles. IEEE Access https://doi.org/10.3390/s150307016
8, 12976–12987 (2020) 33. Brunton, S.L., Kutz, J.N.: Data-Driven Science and Engi-
19. Lee, D., Kim, S., Suk, J.: Formation flight of unmanned neering: Machine Learning, Dynamical Systems, and Con-
aerial vehicles using track guidance. Aerosp. Sci. Technol. trol, 2nd edn. Cambridge University Press, Cambridge
76, 412–420 (2018). https://doi.org/10.1016/j.ast.2018.01. (2022). https://doi.org/10.1017/9781009089517
026 34. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness,
20. Li, Y.-f, Shi, J.-p, Jiang, W., Zhang, W.-g, Lyu, Y.-x: J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidje-
Autonomous maneuver decision-making for a UCAV in land, A.K., Ostrovski, G., Petersen, S., Beattie, C., Sadik,
short-range aerial combat based on an MS-DDQN algo- A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D.,
rithm. Def. Technol. 18(9), 1697–1714 (2022) Legg, S., Hassabis, D.: Human-level control through deep
21. Cambone, S.A., Krieg, K., Pace, P., Linton, W.: Unmanned reinforcement learning. Nature 518(7540), 529–533 (2015).
aircraft systems roadmap 2005–2030. Off. Secr. Def. 8, 4–15 https://doi.org/10.1038/nature14236
(2005) 35. Peters, J., Schaal, S.: Reinforcement learning of motor
22. Wang, H., Liu, P.X., Bao, J., Xie, X.-J., Li, S.: Adaptive skills with policy gradients. Neural Networks 21(4), 682–
neural output-feedback decentralized control for large-scale 697 (2008). https://doi.org/10.1016/j.neunet.2008.02.003.
nonlinear systems with stochastic disturbances. IEEE Trans. (Robotics and Neuroscience)
Neural Netw. Learn. Syst. 31(3), 972–983 (2019) 36. Hafner, R., Riedmiller, M.: Reinforcement learning in feed-
23. Yuksek, B., Inalhan, G.: Reinforcement learning based back control. Mach. Learn. 84, 137–169 (2011). https://doi.
closed-loop reference model adaptive flight control system org/10.1007/s10994-011-5235-x
design. Int. J. Adapt. Control Signal Process. 35(3), 420–440 37. Nicolosi, F., De Marco, A., Sabetta, V., Della Vecchia, P.:
(2021). https://doi.org/10.1002/acs.3181 Roll performance assessment of a light aircraft: flight sim-
24. McGrew, J.S., How, J.P., Williams, B., Roy, N.: Air-combat ulations and flight tests. Aerosp. Sci. Technol. 76, 471–483
strategy using approximate dynamic programming. J. Guid. (2018). https://doi.org/10.1016/j.ast.2018.01.041
Control. Dyn. 33(5), 1641–1654 (2010). https://doi.org/10. 38. The Cosine-Haversine formula: American Mathematical
2514/1.46815 Monthly 64(1), 38 (1957). https://doi.org/10.2307/2309088
25. Liu, X., Yin, Y., Su, Y., Ming, R.: A multi-UCAV cooperative 39. Snell, S., Enns, D., Garrard, W., Jr.: Nonlinear control of
decision-making method based on an MAPPO algorithm a supermaneuverable aircraft. J. Guid. Control. Dyn. 15(4),
for beyond-visual-range air combat. Aerospace 9(10), 563 976–984 (1992). https://doi.org/10.2514/6.1989-3486
(2022). https://doi.org/10.3390/aerospace9100563
26. Hu, D., Yang, R., Zuo, J., Zhang, Z., Wu, J., Wang, Y.: Appli-
cation of deep reinforcement learning in maneuver planning Publisher’s Note Springer Nature remains neutral with regard
of beyond-visual-range air combat. IEEE Access 9, 32282– to jurisdictional claims in published maps and institutional affil-
32297 (2021) iations.
27. Wang, M., Wang, L., Yue, T., Liu, H.: Influence of unmanned
combat aerial vehicle agility on short-range aerial com-
bat effectiveness. Aerosp. Sci. Technol. 96, 105534 (2020).
https://doi.org/10.1016/j.ast.2019.105534
28. Yang, Q., Zhu, Y., Zhang, J., Qiao, S., Liu, J.: UAV air com-
bat autonomous maneuver decision based on DDPG algo-
rithm. In: 2019 IEEE 15th International Conference on Con-
trol and Automation (ICCA), pp. 37–42 (2019). IEEE
29. Shin, H., Lee, J., Kim, H., Hyunchul Shim, D.: An
autonomous aerial combat framework for two-on-two
engagements based on basic fighter maneuvers. Aerosp. Sci.
Technol. 72, 305–315 (2018). https://doi.org/10.1016/j.ast.
2017.11.014
30. Berndt, J., De Marco, A.: Progress on and usage of the open
source flight dynamics model software library, JSBSim. In:

123

You might also like