You are on page 1of 15

Expert Systems With Applications 225 (2023) 120146

Contents lists available at ScienceDirect

Expert Systems With Applications


journal homepage: www.elsevier.com/locate/eswa

Quadcopter neural controller for take-off and landing in windy environments


Xabier Olaz a , Daniel Alaez a , Manuel Prieto a , Jesús Villadangos a,b , José Javier Astrain a,b ,∗
a
Department of Statistics, Computer Science and Mathematics, Universidad Pública de Navarra, 31006 Pamplona, Spain
b
Institute of Smart Cities, Universidad Pública de Navarra, 31006 Pamplona, Spain

ARTICLE INFO ABSTRACT

Keywords: This paper proposes the design of a quadcopter neural controller based on Reinforcement Learning (RL) for
Quadcopter controlling the complete maneuvers of landing and take-off, even in variable windy conditions. To facilitate
Take-off RL training, a wind model is designed, and two RL algorithms, Deep Deterministic Policy Gradient (DDPG)
Landing
and Proximal Policy Optimization (PPO), are adapted and compared. The first phases of the learning process
Deep reinforcement learning
consider extended exploration states as a warm-up, and a novel neural network controller architecture is
Wind
PPO
proposed with the addition of an adaptation layer. The neural network’s output is defined as the forces
DDPG and momentum desired for the UAV, and the adaptation layer transforms forces and momentum into motor
velocities. By decoupling attitude from motor velocities, the adaptation layer enhances a more straightforward
interpretation of the neural network output and helps refine the rewards. The successful neural controller
training has been tested up to 36 km/h wind speed.

1. Introduction This paper proposes the design of a neural controller based on


Reinforcement Learning (RL) for controlling the complete maneuvers
An unmanned aerial vehicle (UAV), commonly known as a remotely of landing and take-off, even in variable windy conditions. The rest
piloted or unmanned aircraft, can be remotely operated with or without of the paper is organized as follows: Section 2 is devoted to ana-
a human operator. The UAV comprises four main components: the lyzing the related works; Section 3 describes the contribution of the
flight control unit (FCU), payload, actuators, and sensors. Unmanned paper; Section 4 presents the physical model of the quadcopter, the
aerial systems (UAS) include the UAV, a ground control station (GCS), reinforcement learning method used and the reward function used;
and the corresponding datalink (Gupta, Ghonge, Jawandhiya, et al., Section 5 describes the wind model proposed for training in adverse
2013). conditions; Section 6 presents the neural flight control proposal and
The flight control unit is responsible for maintaining UAV stability its architecture; Section 7 describes the neural controller training; Sec-
during flight, and its design is a complex and challenging task due to tion 8 presents and discusses the results obtained. Finally, Conclusions,
the system’s intrinsic non-linear dynamics and parameter uncertain- Glossary, Acknowledgments, and References end the paper.
ties (Nguyen et al., 2021). In fact, UAVs have fewer control inputs than
moving capabilities, resulting in an underactuated system. Inaccurate 2. Related works
measurements of inertia moments and aerodynamic coefficients further
exacerbate these difficulties. Various techniques have been developed to address UAV stability
Although UAV stability is crucial for controlling UAV attitude, it issues, including both linear and non-linear algorithms. Linear algo-
is also essential to control maneuvers such as take-off and landing. rithms, such as PID controllers, have been widely used to stabilize
UAVs (Gautam & Ha, 2013; Kendoul, 2007; Pounds, Bersak, & Dollar,
Moreover, non-controlled external disturbances such as wind gusts
2012). On the other hand, non-linear algorithms, like State Dependent
can impact the UAV’s dynamics during concrete maneuvers in flight
Ricatti Equation (SDRE) techniques, are commonly implemented in
operations. These external disturbances are particularly critical during
most multi-copter systems (Gheorghiţă, Vîntu, Mirea, & Brăescu, 2015).
take-off and landing procedures, where even minor deviations in angle
Other alternatives for stability control include adaptive control, Lin-
or position can lead to catastrophic events. Thus, designing a flight
ear Quadratic Regulator (LQR) control, model predictive
control unit that considers maneuvers and can operate in challenging
control, and sliding mode control (SMC) (Alexis, Nikolakopoulos, &
external conditions remains a significant challenge.

∗ Corresponding author at: Department of Statistics, Computer Science and Mathematics, Universidad Pública de Navarra, 31006 Pamplona, Spain.
E-mail addresses: xabier.olaz@unavarra.es (X. Olaz), daniel.alaez@unavarra.es (D. Alaez), manuel.prieto@unavarra.es (M. Prieto), jesusv@unavarra.es
(J. Villadangos), josej.astrain@unavarra.es (J.J. Astrain).

https://doi.org/10.1016/j.eswa.2023.120146
Received 9 December 2022; Received in revised form 23 March 2023; Accepted 11 April 2023
Available online 18 April 2023
0957-4174/© 2023 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
X. Olaz et al. Expert Systems With Applications 225 (2023) 120146

Table 1
Comparison of different approaches.
Reference General approach/Control Landing and take-off Challenging wind conditions RL algorithms compared
model
Beck et al. (2016) Robust control (Backstepping) Yes, both No No
Ridlwan, Nugraha, Riansyah, PID, Vision-based Yes, both No No
and Trilaksono (2017)
Polvara et al. (2018) Deep reinforcement learning No. Landing only No DQN vs. DDQN
(DRL)
Massé, Gougeon, Nguyen, and LQR and Structured H∞ C No. Attitude control vs. Robust control) Yes. Up to 14 m/s No. Robust control comparison
Saussié (2018) (Optimal control
Rodriguez-Ramos, Sampedro, DRL No. Landing only No DDPG only
Bavle, De La Puente, and
Campoy (2019)
Koch, Mancuso, West, and DRL No. Attitude control No TRPO, DDPG, PPO
Bestavros (2019)
Legovich, Rusakov, and Diane Vision based classical control Yes, both No No
(2019)
Todeschini, Fagiano, Micheli, Model-based, hierarchical Yes, both No. Up to 3 m/s No
and Cattano (2019)
Xuan-Mung, Hong, Nguyen, Model based control No. Landing only
Le, et al. (2020)
Xie, Peng, Wang, Niu, and DRL No. Landing only No DDPG vs. DQN
Zheng (2020)
Theile, Bayerlein, Nai, DRL No. Landing only No DDQN only
Gesbert, and Caccamo (2020)
Nguyen et al. (2021) Sliding mode control (Robust No. Attitude control Yes No
control)
Jiang and Song (2022) RL No. Landing only No DDPG, TD3, SAC
Abo Mosali et al. (2022) Multi-Level DRL No. Landing only No Adapted DQN
Kownacki et al. (2023) Extended Kalman filter Yes, both No. 3.1 m/s No
Our contribution DRL + Adaptation layer Yes, both Yes. Up to 10 m/s DDPG vs. PPO

Tzes, 2012; Alexis, Papachristos, Nikolakopoulos, & Tzes, 2011; Although flight controller design has been extensively considered a
Beikzadeh & Liu, 2018; Cohen, Abdulrahim, & Forbes, 2020; Guardeño, stability problem, few papers have considered the entire maneuvers,
López, & Sánchez, 2019; Mei & Ding, 2018; Nadda & Swarup, 2018; such as take-off and landing, as the process to be controlled (Beck
Sastry, Bodson, & Bartram, 1990; Xiao-Zheng, Guang-Hong, CHANG, et al., 2016; Polvara et al., 2018; Xuan-Mung et al., 2020). Some studies
& Wei-Wei, 2013; Yuan, Ding, & Mei, 2020). These robust control have shown that both procedures can be executed autonomously on
techniques compensate for unknown variations in the system or the en- stationary landing platforms.
vironment and can cope with actuation constraints (Navabi & Mirzaei, Although some RL solutions in the literature address attitude control
2016; Uswarman, Istiqphara, Yunmar, & Rakhman, 2019). to maintain UAV stability at each time instant, these techniques have
However, these automatic controllers require accurate knowledge yet to be applied to risky maneuvers such as the control of the take-
of the UAV dynamics and are not designed to withstand external off or landing processes, despite the emphasis of other works on their
forces. These forces, usually modeled as noisy inputs, tend to negatively importance (Legovich et al., 2019). Classical control techniques men-
affect their performance, particularly in unknown environments with tioned above focus on stabilizing the UAV but not on maneuvers (Koch
unknown external forces and actuation constraints. et al., 2019) in windy conditions.
Artificial neural networks have been involved in flight controller To our knowledge, most papers have not considered including
adverse external conditions, such as a realistic wind model or other
design to avoid the need for precise knowledge of UAV dynamics and
physical aspects like the ground effect, for flight controller design
external forces characterization (Nguyen et al., 2021; Ulus & Eski,
when using RL techniques (Abo Mosali et al., 2022; Jiang & Song,
2021). NASA has also researched adaptive control for manned and
2022; Legovich et al., 2019; Polvara et al., 2018; Ridlwan et al., 2017;
unmanned aircraft or spacecraft, such as space shuttle docking and
Rodriguez-Ramos et al., 2019; Theile et al., 2020; Xie et al., 2020).
neural network-based sensor failure detection (Gupta, Loparo, Mackall,
On the other hand, the few studies that have considered these
Schumann, & Soares, 2004; Schumann, Gupta, & Nelson, 2003).
external conditions (Kownacki et al., 2023; Todeschini et al., 2019)
Various neural network controller design approaches have been
have not designed a controller capable of coping with challenging wind
developed for aircraft, which differ in control method, neural net-
speeds and are barely trained with wind speeds up to 3.1 m/s, which
work architecture, and implementation details (Emami, Castaldi, &
corresponds to a ‘‘light breeze’’ (Huler, 2007) in the Beaufort Scale,
Banazadeh, 2022; Gu, Valavanis, Rutherford, & Rizzo, 2020; Gupta which is not a challenging scenario. This scale is used to measure
et al., 2004). Examples of different neural network architectures in- different wind speeds. According to the scale, wind speeds above ten
clude feed-forward networks (Norgaard, Ravn, Poulsen, & Hansen, m/s represent ‘‘a strong breeze’’ and, therefore, are not recommended
2000), dynamic cell structures (DCS) (Jorgensen, 1997; Totah & Totah, for human piloting. For this reason, setting this strong breeze as the
1997), or SigmaPi (Kaneshige & Gundy-Burlet, 2001). maximum possible speed during take-off and landing was mandatory to
Due to the absence of UAV dynamics knowledge and external forces prove the capabilities of the proposed neural controller and a challenge
characterization, Reinforcement Learning (Gautam & Ha, 2013; Gau- for the drone to cope with.
tam, Singh, Sujit, & Saripalli, 2022; Koch et al., 2019) has emerged as Table 1 summarizes the related works described above.
a leading approach in designing alternative control systems for UAVs.
This approach involves using model-free systems to design a neural 3. Contributions
controller and integrating unknown external forces. In this method, the
controller explores actions to maximize a reward. After the learning This paper presents a novel approach for controlling the complete
process, the controller selects the control action that maximizes the process of take-off and landing maneuvers for Unmanned Aerial Ve-
reward based on the current and previous states of the system. hicles (UAVs) in variable windy conditions. The proposed method

2
X. Olaz et al. Expert Systems With Applications 225 (2023) 120146

Fig. 1. The body frame 𝑥𝑏 , 𝑦𝑏 , 𝑧𝑏 , motor angular speeds 𝜔𝑖 , motor actuating forces 𝑓𝑖 and momentum 𝑀𝑀 𝑖 of our quadcopter.

uses a neural flight controller based on Reinforcement Learning (RL) shows promising results for controlling the complete take-off and land-
techniques, which combines attitude and trajectory control to ensure ing maneuvers for UAVs in variable windy conditions. Implementing
a stable and safe take-off and landing. The contributions of this paper this neural flight controller has potential applications in the automation
can be summarized as follows: of UAVs in the aerospace industry.
A neural flight controller based on RL techniques is proposed to
handle the complete take-off and landing maneuvers under variable 4. Foundations
windy conditions. The proposed controller architecture learns to adjust
the attitude and trajectory of the UAV simultaneously to ensure a stable 4.1. Quadcopter physical model
and safe take-off and landing in variable windy conditions. A novel
neural network controller architecture is proposed with the addition The quadcopter structure for which the neural controller is designed
of an adaptation layer that transforms forces and momentum into is illustrated in Fig. 1, along with the corresponding forces, torques,
motor velocities. This adaptation layer decouples attitude from motor and angular velocities created by the four motors. The quadcopter
velocities, allowing a more straightforward interpretation of the neural dynamics, as described here, are based on Luukkonen (2011) and
network output and helping to refine the rewards afterward. exhibit the 6-DOF underactuation problem mentioned previously. The
A wind model variable in speed and direction is designed to evaluate quadcopter has four symmetric arms along the x and y body axes.
the neural controller’s performance in a realistic environment. UAV The angular velocity of 𝑚𝑜𝑡𝑜𝑟𝑖 , denoted by 𝜔𝑖 , generates a force 𝑓𝑖
aerodynamic parameters (lift, drag, and lateral forces) are simulated in the direction of the motor axis (Eq. (1)). The angular velocity and
using computational fluid dynamics (CFD) to evaluate the neural con- acceleration of the motor also produce a torque 𝑀𝑀𝑖 around the motor
troller’s performance in a wide range of windy conditions (Aláez, axis (Eq. (2)). The lift constant is denoted by 𝑘, the drag constant by 𝑏,
Olaz, Prieto, Villadangos, & Astrain, 2023). The physics model used is and the motor’s inertia moment by 𝐼𝑀 .
extended to provide a more realistic UAV response by integrating the 𝑓𝑖 = 𝑘𝜔2𝑖 (1)
aerodynamic parameters corresponding to the actual UAV attitude.
Our proposed neural flight controller is implemented in the Gazebo 𝑀𝑀 𝑖 = 𝑏𝜔2𝑖 + 𝐼𝑀 𝜔̇ 𝑖 (2)
physics motor, which provides a realistic environment for our RL The gyroscopic effect, referred in parcel 𝐼𝑀 𝜔̇ 𝑖 in Eq. (2), is usually
design. Two different RL algorithms are adapted and compared to de- considered small compared to 𝑏𝜔2𝑖 and thus it is omitted. The combined
sign the neural controller: Deep Deterministic Policy Gradient (DDPG) forces of motors create thrust 𝑇 in the direction of the body 𝑧-axis.
and Proximal Policy Optimization (PPO). Extended state exploration Torque 𝑀𝑏 around the body consists of the torques 𝑀𝜙 , 𝑀𝜃 and
is proposed as a warm-up during the early stages of the learning 𝑀𝜓 in the direction of the corresponding body frame angles. 𝑙 is the
process, yielding a more robust and reliable neural controller due to distance between the motor and the center of mass of the quadcopter
the maximization of state exploration. (Eqs. (3)–(4)).
The use of neural controllers allows automation of the final UAV

4 ∑
4
configuration process after its manufacture, avoiding the development 𝑇 = 𝑓𝑖 = 𝑘 𝜔2𝑖 (3)
of classical controllers, such as those based on PIDs or non-linear ap- 𝑖=1 𝑖=1
proaches. This neural controller development mechanism allows each
⎡ 𝑀𝜙 ⎤ ⎡𝑙𝑘(−𝜔2 + 𝜔4 )⎤
2 2
⎡0⎤
UAV to be configured to perform complex maneuvers such as the ⎢ ⎥
𝐵
𝑇𝑥𝑦𝑧 = ⎢ 0 ⎥ , 𝑀𝐵 = ⎢ 𝑀𝜃 ⎥ = ⎢𝑙𝑘(−𝜔2 + 𝜔2 )⎥ (4)
landing and take-off of quadcopters, as it finishes its manufacturing and ⎢ ⎥ ⎢ ⎥ ⎢ ∑4 1 3
⎣𝑇 ⎦ ⎣𝑀𝜓 ⎦ ⎣ 𝑀𝑀 ⎥⎦
assembly. In conclusion, using RL techniques, the proposed approach 𝑖=1 𝑖

3
X. Olaz et al. Expert Systems With Applications 225 (2023) 120146

The model in RL is the knowledge we have in advance of the learning


environment. If the environment is fully observable, which means it is a
predictable environment and we know the exact reward functions, then
we can use a model-based algorithm. However, wind behavior at each
time is unknown in an unpredictable real-world problem like the one
presented in this paper. Therefore, after executing an action, it might
need to be clarified where (in what state) the agent will end up. This
Fig. 2. Simple representation of reinforcement learning.
unpredictability can happen because of the uncertain nature of wind,
with different gusts varying in direction, time, and strength that cannot
be foreseen. For this, the safest option is to use a model-free approach
The roll movement is performed by decreasing motor 1’s speed and based on Policy optimization (Brockman et al., 2016) or Q-learning.
increasing motor 4’s speed. Similarly, pitch rotation can be achieved
Both classifications are based on Markov Decision Processes (Put-
by reducing the speed of motor one and increasing motor three. Lastly.
erman, 1990). However, these two differ, making them optimal for
Yaw movement is done by increasing the angular velocities of two
different tasks. The most significant distinction is how they update their
opposite motors while decreasing the other two.
𝑄 values according to a policy (On-Policy) or not (Off-Policy). 𝑄 keeps
The above equations model the physical behavior of the UAV,
the goodness of an action (𝑎) taken at a particular state (𝑠) noted by
which is used to evaluate its attitude (𝑎𝑡𝑡𝑖 (𝑡)) at each timestep 𝑡.
𝑄(𝑠, 𝑎).
Note that physical behavior depends on motor velocities because other
Q-learning (Watkins & Dayan, 1992) employs off-policy because
parameters are constants that depend on the UAV structure.
it learns one policy while executing another. On the other hand, in
On-Policy the agent does execute the same policy it is learning. This
4.2. Reinforcement learning
gives Q-learning some advantages, such as its simplicity, where you can
implement Q-functions as simple discrete tables giving a high chance
Neural networks are computational models that consist of inter-
connected artificial neurons, which aim to simulate the learning and of solving the training (converging).
decision-making capabilities of the human brain. They have been On the other hand, Policy optimization allows for possible penalties
widely used in various applications, from computer vision to natural from exploratory moves, one thing Q-learning would ignore. This may
language processing. In decision-making scenarios, neural networks are turn up as more conservative, as q-learning would trigger the reward
employed to learn and adapt to complex environments, making optimal while exploring regardless of a dangerous path. Given our current task,
decisions based on available information. Reinforcement learning is comparing both approaches might enhance the experiment by giving
one of the most popular machine learning techniques (Sutton & Barto, different solutions to the problem taken into account in this paper.
2018). It works as an imitation of how creatures in nature tend to To sum up, Q-learning as continuous landing and take-off maneu-
learn. First, an agent performs an action in a given environment. Then, vers may be a good choice for training in a fast-iterating environment.
depending on the action taken, it will be put into a new state and will However, caring about the rewards gained while learning is critical, so
receive a reward to tell if the action taken was wrong or right, making also considering a different approach as Policy Optimization may be the
it closer to its goal (see Fig. 2). better choice. However, off-policy algorithms such as DDPG are known
However, there are plenty of RL algorithms for learning, so choosing for being more sample efficient, so considering another alternative to
the right ones or at least those for which better results are expected PPO, which is nowadays a standard for OpenAI, might be interesting.
in advance was crucial. OpenAI (Brockman et al., 2016) divides these Because of this, PPO and DDPG Algorithms have been chosen for the
algorithms according to whether they are based on a model (see Fig. 3). landing and take-off tasks.

Fig. 3. OpenAI’s taxonomy of modern RL algorithms.

4
X. Olaz et al. Expert Systems With Applications 225 (2023) 120146

Table 2
Comparison of PPO and DDPG hyperparameters.
(a) The PPO hyperparameters. (b) The DDPG hyperparameters.
Name Variable Value Unit Name Variable Value Unit
Batch size N 4096 𝑡𝑡𝑠 Training timesteps 𝑡𝑡𝑠 50 𝑡𝑡𝑠
Training epochs 80 Minimal buffer size 1024 𝑡𝑡𝑠
Max episode steps 450 𝑡𝑡𝑠 Batch size N 4096 𝑡𝑡𝑠
Environment rate 30 Hz Warm-up steps 10 000 𝑡𝑡𝑠
Action noise 𝑎𝜎 0.25 Training epochs 30
Actor learning rate 𝑙𝑟𝑎 0.0003 Max episode steps 450 𝑡𝑡𝑠
Critic learning rate 𝑙𝑟𝑐 0.001 Environment rate 30 Hz
Clipping factor 𝜖 0.2 Discount factor 𝛾 0.99
Discount factor 𝛾 0.99 Buffer size 3E+6 𝑡𝑡𝑠
Learning rate lr 0.001

4.2.1. Proximal Policy Optimization (PPO) Algorithm 2 The DDPG Algorithm


Proximal Policy Optimization or PPO (Schulman, Wolski, Dhariwal, 1: Initialize a Critic Network 𝑄 and an Actor-Network 𝜋
Radford, & Klimov, 2017) is an on-policy algorithm released in 2017. It 2: Initialize a copy of the Critic 𝑄′ and the Actor networks,
3: (called target networks).
is considered a family of policy optimization methods that use multiple 4: Initialize Sampling Memory Buffer
epochs of stochastic gradient ascent to perform each policy update. 5: for episode in episodes do
These methods have the stability and reliability of trust-region (Schul- 6: Reset the environment into a random state
7: Read initial state 𝑠𝑖
man, Levine, Abbeel, Jordan, & Moritz, 2015) methods. However, they 8: 𝑗=0
are simpler to implement, requiring only a few lines of code to change 9: for i in timesteps do
10: 𝑗 =𝑗+1
to a vanilla policy gradient implementation, applicable in more general 11: if episode > warm-up episodes then
settings (for example, when using a joint architecture for the policy and 12: Select action 𝑎𝑖 = 𝜋(𝑠𝑖 ) +  𝑖
value function), and have better overall performance. 13: else
14: Select action 𝑎𝑖 =  𝑖
PPO is known to outperform other state-of-the-art methods in chal- 15: end if
lenging environments with UAVs (Koch et al., 2019), making it ideal 16: Execute action, observe reward 𝑅𝑖 and state 𝑠𝑖 + 1
17: Store the transition (𝑠𝑖 , 𝑎𝑖 , 𝑅𝑖 , 𝑠𝑖 + 1) into buffer
for unpredictable wind’s complex environment. 18: if j > training timesteps then
The PPO algorithm (see Algorithm 1) is an on-policy algorithm that 19: j = 0
20: Sample a random minibatch from the buffer
uses both an actor and a critic network as a method of learning the 21: Update the critic
environment and develops a behaving good agent based on the reward 22: Update the actor
function. The algorithm works based on the experimentation on the 23: Transfer the knowledge to the target networks
24: end if
current environment, using some noise over the current agent policy 25: if Terminal state reached then
and evaluating if this added noise resulted in better or worse returns 26: Break
27: end if
compared to the mean return of that specific state. 28: end for
29: end for
Algorithm 1 The PPO Algorithm
1: for Epoch in Epochs do 4.2.2. Deep deterministic policy gradient (DDPG)
2: Observe state 𝑠𝑖 On the other hand, DDPG (Lillicrap et al., 2015), as an off-policy
3: for Timestep in Timesteps do
4: Calculate action 𝑎𝑖 = 𝜋𝜃 (𝑠𝑖 )
algorithm, learns a Q-function and a policy. In other words, the Bellman
5: Execute action, observe reward 𝑅𝑖 and state 𝑠𝑖+1 equation and off-policy data are used to learn the Q-function while
6: Save transitions (𝑎𝑖 , 𝑠𝑖 , 𝑃𝑎𝑖 , 𝑄(𝑎𝑖 , 𝑠𝑖 ), 𝐷𝑖 , 𝑅𝑖 )𝑜𝑙𝑑 using this Q-function for learning the policy. While being an off-policy
7: if Terminal then
algorithm, it can be considered the deep Q-learning algorithm for
8: Break
9: end if continuous action spaces, which might favor the landing and take-off
10: end for tasks performed by the UAV. It has achieved good performance in many
11: Compute episode returns simulated continuous control problems by integrating deterministic
12: Compute advantage estimates
13: if time to update then policy gradient algorithm (Hou, Liu, Wei, Xu, & Chen, 2017). The
14: for epoch in epochs do DDPG algorithm (see Algorithm 2) used in this paper has the following
15: Get all transitions from memory relevant hyperparameters (see Table 2 for details).
16: Evaluate expected returns 𝑄
17: Evaluate 𝑙𝑜𝑔(𝑃 𝑎, 𝑖) and 𝑆(𝜎)
18: Compute 𝐿𝑟𝑖 4.2.3. Reward function
19: Normalize the advantages An essential part of any reinforcement learning environment is the
20: Compute Total Loss
21: Compute episode returns
discussion of the environment reward function, as the training perfor-
22: Compute advantage estimates mance is heavily impacted by how the reward is shaped. Therefore, two
23: if time to update then reward functions were developed, one for each proposed maneuver.
24: for epoch in epochs do
Both reward functions are written in a paradigm that rewards positively
25: Get all transitions from memory
26: Evaluate expected returns for actions that guide the environment toward the solved state and also
27: Evaluate 𝑙𝑜𝑔(𝑃𝑎,𝑖 ) and 𝑆(𝜎) rewards negatively when the opposite occurs. The reward function for
28: Normalize the advantages
each maneuver is further explained in the Training section.
29: Compute Total Loss
30: end for
31: end if 5. Wind model
32: end for
33: end if
Quadcopter physics models usually consider ideal conditions. How-
34: end for
ever, effects that can alter system dynamics should be considered in

5
X. Olaz et al. Expert Systems With Applications 225 (2023) 120146

Fig. 4. The proposed FCU architecture.

Fig. 5. The architecture of Actor and Critic Network of the neural network. The layers are interleaved by a ReLu activation function.

real conditions. For example, the wind is critical in landing and take-off actual wind, making it not constant, but adding some noise or slight
maneuvers. Because of this, a physical model of the wind is proposed. variations to it. In this way, the wind comprises random characteristics
Each time a take-off or landing is performed, the wind direction is that resemble a real wind gust. Finally, the wind on the inertial axis
determined randomly to ensure variability in the environment. As a 𝑋𝑌 𝑍 is calculated in Eq. (8).
result, the wind speed varies smoothly during the maneuver, following
a uniform distribution. In all cases, wind directions are restricted hori- 𝑊𝑋𝑌 𝑍 = [𝑊𝜃 sin(𝜃𝑊 ), 𝑊𝜃 cos(𝜃𝑊 ), 0] (8)
zontally to the ground, functioning similarly to a windsock. We based
our wind system on the approach presented in Massé et al. (2018). 6. Neural Flight Controller Unit architecture
Under a uniform distribution, all possible directions have the same
likelihood of happening, from 0 to 2𝜋 (Eq. (5)). The same applies to The proposed Neural Flight Controller Unit (N-FCU) comprises the
the wind speed (Eq. (6)), which can assume values from 0 to 10 m/s,
neural network, the adaptation layer, and the conversion layer. N-
where 𝜃𝑊 is the direction of the wind and 𝑆𝑤 being the wind speed.
FCU is responsible for controlling the UAV maneuvers. Its architecture
𝜃𝑊 =  (0, 2𝜋) (5) is depicted in Fig. 4. The current altitude, angular velocity, linear
velocity, and position of the UAV constitute its current state (𝑆𝑖 ), which
𝑆𝑊 =  (0, 10) (6)
is the N-FCU input. On the other hand, the N-FCU outputs the four
A linear ramp curve is applied for the wind (Eq. (7)) to ensure the motor velocities to perform any possible maneuver.
continuity of the wind function. This continuity prevents giant leaps
on the wind. For example, suppose the wind is previously blowing at 6.1. Neural network
5 m/s, and the new wind speed has a value of 10 m/s. Instead of just
jumping to 10 m/s directly, the wind will slowly increase to the desired
Neural Network (NN) consists of two deep networks: Actor and
velocity, representing a more continuous and accurate representation of
the wind and its changes in direction and speed. Critic (see Fig. 5). This actor-critic-based framework allows for com-
𝑝 bining policy learning and value learning. On the one hand, the actor-
𝑊𝜃 = 𝑆𝑊 ,𝑖−1 + (𝑆𝑊 ,𝑖 − 𝑆𝑊 ,𝑖−1 ) 𝑖 +  (0, 0.5) (7) network provides adequate action for obtaining the desired state. On
𝑃
the other hand, the external entity (critic) tracks whether the neural
The proposed final wind model is referenced in Eq. (7). 𝑊𝜃 is the
controller is ahead or behind in the training. This external feedback
speed of the wind in the direction 𝜃 while 𝑆𝑊 is the wind speed
validates the whole process (if the UAV is approaching the target, how
sampled from Eq. (5), with 𝑆𝑊 ,𝑖 being the current sampled wind
and 𝑆𝑊 ,𝑖−1 the last sampled wind speed. 𝑝𝑖 is the number of timesteps it stabilizes whenever receiving a wind gust, and if there is stable
into the current wind sample and 𝑃 is the total number of timesteps learning overall).
before a new wind speed is sampled. The NN has a certain memory degree. For such purpose, it considers
Both direction and speed of the wind are sampled in the same time the current state of the UAV and the three previous states (State Buffer).
step. Speeds 𝑆𝑊 ,𝑖 and 𝑆𝑊 ,𝑖−1 might be in different directions. In this The reason for considering the last three values is the need to know
case, the linear ramp will also be able to smoothly transition between the UAV’s location, velocity and acceleration. Once the training is
both directions 𝜃𝑖 and 𝜃𝑖−1 . completed, the actor-network works as the NN-controller. It contains
Finally, a Gaussian noise  (0, 0.5) in speed and direction is added, the policy learned in training to produce the most adapted action for
so there are little gusts in different directions and speeds, just like the the input data received.

6
X. Olaz et al. Expert Systems With Applications 225 (2023) 120146

6.2. Mapping and forces conversion

Usually, any controller would directly provide motor velocities as


output. However, we propose using adaptation and conversion layers
to reduce the training time by considering the momentum and target
forces and reducing the possible saturation of the motor velocities. So,
the final output of the complete N-FCU also results in motor velocities
but goes first through all the mapping and conversion. The adaptation
layer considers that the UAV during the flight is in a precarious
balance and tries to ensure that the UAV flies continuously. It can be
interpreted that this layer adjusts thrust and momentum to ensure the
UAV is continuously fighting against gravity. Finally, a conversion layer
transforms target thrust and momentum into motor velocities.
For each maneuver (landing and take-off), a different mapping
function was developed, focusing on the performance and speed of the
training. It is possible to consider the mapping function as part of the
neural network model itself, as it is an integral part of the training
and ensures safe and reasonable motor inputs. The mapping function’s
Fig. 6. Proportion of solved evaluation episodes of the PPO landing task without
input is the neural network’s output (𝑁𝑁𝑜 ), as described in Eq. (9). adaptation layer.
The output of the mapping function is the target thrust (𝑇 ) of the
motors and the body momentum of the quadcopter is 𝑀𝜙 , 𝑀𝜃 , and 𝑀𝜓 ,
being the momentum over the x, y, and z axes (pitch, roll, and yaw) of
shown on Eq. (12). Some assumptions need to be made so that 𝜔𝑖 ∈ R,
the quadcopter in the body frame of reference. This mapping function
the most important being that the right-hand side of Eq. (12) should
ensures that the Actor-network outputs 𝑎𝑖 within the tasks domain and,
be positive at all times. Otherwise, 𝜔𝑖 ∈ C and not applicable to the
more importantly, can ensure that the output is sensible, i.e., within
motors.
reason of the specific problem it is facing.
The adaptation layer decouples the movement and attitude of the ⎡ 2 ⎤ ⎡ 𝑀𝜓 𝑀𝜃 𝑇𝐻 ⎤
⎢𝜔1 ⎥ ⎢ 4𝑏 − 2𝑘𝑙 + 4𝑘 ⎥
UAV (output of the neural network are target forces and momentum, ⎢ ⎥ ⎢ 𝑀 𝑀

Eq. (9)) and motor velocities. This layer allows the training of the ⎢𝜔2 ⎥ ⎢− 𝜙 − 𝜓 + 𝑇 𝐻 ⎥
⎢ 2 ⎥ = ⎢ 2𝑘𝑙 4𝑏 4𝑘 ⎥
(12)
neural network considering the rationale of the maneuver. For example, ⎢ 2 ⎥ ⎢ 𝑀𝜓 𝑀𝜃 𝑇𝐻 ⎥
yaw should be constant during the maneuver, and roll and its variation ⎢𝜔3 ⎥ ⎢ 4𝑏 + 2𝑘𝑙 + 4𝑘 ⎥
⎢ ⎥ ⎢ ⎥
should be shallow. However, this could result in motor velocities satu- ⎢𝜔2 ⎥ ⎢ 𝑀𝜙 − 𝑀𝜓 + 𝑇 𝐻 ⎥
ration when transforming the above values within the conversion layer. ⎣ 4 ⎦ ⎣ 2𝑘𝑙 4𝑏 4𝑘 ⎦

Current proposals of neural network controllers learn that mapping Fig. 6 shows the test results performed to check how the UAV would
directly to the motor velocities and saturation are rarer to occur. train without the adaptation and mapping. In this case, the neural
[ ] network acted directly on each motor, outputting the desired angular
𝑁𝑁𝑜 = 𝑇 𝑀𝜙 𝑀𝜃 𝑀𝜓 (9)
speed, according to 𝜔𝑖 = 80𝑎𝑖 .
This mapping enables a more straightforward output from the net- The test performed with this landing task without the proposed
work, as depicted in Eq. (10). It is more straightforward because it maps action mapping took over 14k episodes. As a result, it did not provide
the network output directly to actions of body thrust and momentum more than 60% solved episodes. As explained later in Section 7, the
rather than the motor’s direct actions. This removes a coupling element proposed mapping provides a higher success rate (near 100%) with
from the training, as there is heavy coupling on motor actions to body fewer episodes.
acceleration on the quadcopter dynamics problem.
[ ] [ ] 7. Neural controller training
𝐶𝑙𝑑 = 4.3 0 0 0 + 𝑎𝑖 ⋅ 2 0.1 0.1 0.05 (10)
[ ] [ ] Both take-off and landing maneuvers are very technical tasks for
𝐶𝑡𝑜 = 5.0 0 0 0 + 𝑎𝑖 ⋅ 3 0.1 0.1 0.05 (11)
pilots. During the take-off, the UAV must leave the ground and stabilize
Eqs. (10) and (11) show the existence of a neutral thrust that is itself while reaching the target height. At target height, the UAV must
being added to the output of the network 𝑁𝑁𝑜 , this neutral thrust remain inside a specified zone for a given time while keeping its
[ ]
ensures that if the neural network outputs a neutral 0 0 0 0 velocity below a certain threshold. Similarly, the landing maneuver
signal, the resulting thrust would be kept consistent with the task. So, considers the UAV stabilized at a given height. Then, it reduces its
considering the above considerations, the adaptation layer maps the altitude slowly until it reaches the ground and stops its motors. The
neural network to the target forces and momentum. landing point is considered to be inside a reference circle centered at
Considering a neutral output from the policy, in the landing task, the crossing point of the vertical with the ground, measured at the
the quadcopter would be slowly moving down. In contrast, in the take- position of the UAV before it initiates the landing operation.
off task, a neutral policy output would result in a slightly lower thrust A total amount of 12,480 trainings have been performed. Each train-
than necessary to lift the aircraft from the ground. It was determined ing lasts for a set of episodes, depending on how successful the agent
empirically that the thrust should be much higher than the torque becomes in completing the given task and how the reward function is
generated in the 𝑥, 𝑦, and 𝑧 body directions. A lower torque range also enhanced over time by trial and error. Episodes are evaluated based
ensures faster training and a stabler flight than a broader domain of on the MSE (Mean Square Error). The first trainings lasted for a low
possible torques. number of episodes (being the MSE a high value). This is because the
Motors and forces are coupled as presented in Eq. (4). Given that reward function was not correctly modeled yet, or the UAV was en-
the UAV arm’s length 𝑙, the lift factor 𝑘, and the momentum factor 𝑏 countering terminal errors very often (at the beginning UAV controller
are known, it is possible to solve the above equations into the motor had not learned to cope with strong winds, often ending upside down,
angular velocities 𝜔𝑖 . There is a linear correlation between the Forces so the whole environment and training had to be reset). The number
𝑇 , 𝑀𝜙 , 𝑀𝜃 and 𝑀𝜓 to the motor angular velocities 𝜔1 , 𝜔2 , 𝜔3 , 𝜔4 that is of episodes per training increases over time while enhancing both the

7
X. Olaz et al. Expert Systems With Applications 225 (2023) 120146

adaptation layer and reward function. As the MSE highly depends on good. For example, the agent is rewarded with 100 points for a perfect
the training number, we present the results of the final training for each landing or take-off, and for a wrong state, the agent is rewarded with
maneuver and algorithm. −100 points. In addition, there are other terminal states in between to
continuously guide the UAV to improve its terminal state.
7.1. Take-off maneuver There is also the possibility that the UAV reaches the maximum
time state. This state is troublesome for the PPO algorithm, as it relies
The goal was to, starting from a landed random position 𝑃𝑖,𝑡 (Eq.
entirely on the calculation of the returns of each episode individually.
(13), with values depicted in meters), be able to take off on a stable
Hence, a workaround has to be made to ensure that the maximum-time
path and reach a volume sphere with two meters of diameter and its
central location at position [0, 0, 5] m. terminal reward does not introduce bias into the training process. This
The starting position of the take-off maneuver is determined by Eq. is made by establishing a negative reward for taking too long.
(13), where it is defined by a uniform distribution with average 𝑃𝑎𝑣𝑔
and standard deviation 𝑃𝑠𝑡𝑑 .
7.2. Landing maneuver
( )
𝑃𝑖,𝑡 =  𝑃𝑎𝑣𝑔 , 𝑃𝑠𝑡𝑑 =  ([0, 0, 0], [0.2, 0.2, 0.1]) (13)

The episode is completed when the quadcopter has remained within The landing maneuver goal was to, starting from a flight random
the volume sphere for at least 50 steps, which is a reasonable time to position 𝑃𝑖,𝑙 (Eq. (18)), be able to stabilize the UAV and move it
receive at least one wind gust once it arrives at the target position. It downwards towards the ground. The landing should be done within
should also have a final velocity lower than 0.3 meters per second. This a circle with its center at [0, 0, 0] and a diameter of 0.8 m (a sphere
velocity is considered negligible while coping with such strong winds visually big enough to contain the UAV inside and check if it holds the
and keeping it under this velocity considers the episode a success. position). The UAV should also reach the ground with a velocity lower
Several endpoints were designed for each task, focusing on reducing than 0.8 meters per second, with an angular velocity lower than 0.4
the time on unwanted and nearly unsolvable states. The most important
radians per second, and with an attitude norm lower than 0.5 radians
endpoint is the astray condition. This condition typically triggers when
(Safe landing speed to ensure no parts are damaged). Eq. (10) describes
the UAV is too far away from the objective, too fast, or in an extreme
the mapping output function in the landing task.
attitude. This is considered a wrong endpoint. Once reached such a
failure state, the episode is aborted. ( )
𝑃𝑖,𝑙 =  𝑃𝑎𝑣𝑔 , 𝑃𝑠𝑡𝑑 =  ([0, 0, 5], [0.2, 0.2, 0.1]) (18)
Eq. (11) shows the take-off task mapping function, being 𝑁𝑐,𝑜𝑢𝑡 the
neural network output. Note that the constant thrust 𝑇 is higher on the There are several endpoints for the landing maneuver, including an
take-off than on the landing function, ensuring a slight training bias in astray state similar to the one implemented on the take-off task. The
the direction of the goal. perfect landing state corresponds to a UAV landing on the target with
a low velocity while keeping a flat attitude to the ground. Finally, there
7.1.1. Take-off reward function
is also the skewed landing terminal state, where the UAV can land on
The take-off reward function is written on Eq. (17), with 𝐷𝑟𝑖 being
target but not within the margins of error of distance and attitude.
the distance reward, 𝐷𝑧𝑖 the 𝑧-axis distance to the goal, 𝐷𝑥,𝑦 𝑖 being
the modulus of the XY-axis distance to the goal. 𝑉𝑟𝑖 the velocity- All these different terminal states were developed to ensure better
related reward, with 𝑃 𝑖 being the current position, 𝑇 𝑖 the target sphere training for both algorithms, enabling a more granular return function
position, and 𝑉 𝑖 the current velocity of the UAV, and every parameter than just a perfect landing and an astray terminal state. This kind of
is considered at timestep 𝑖. reward shaping enables training that encourages constant learning. It is
significantly related to humans learning how to land the UAV, learning
𝐷𝑟𝑖 = 0.6𝐷𝑧𝑖 + 0.4𝐷𝑥,𝑦
𝑖
(14) to land it slowly, within a certain attitude, within a specific area, etc.
𝑉𝑟𝑖 = ||−0.5(𝑃 − 𝑇 ) − 𝑉 |
′ 𝑖 𝑖|
(15)
( )
𝑅𝑠,𝑖 = −2 10𝐷𝑟𝑖 + 10𝑉𝑟′ + 20𝐴′𝑟 + 10 ⋅ 𝑚𝑖𝑛(𝑆𝑖𝑛 , 50) + 50𝑇𝑎𝑐ℎ (16) 7.2.1. Landing reward function
𝑅 = 𝑅𝑠,𝑖 − 𝑅𝑠,𝑖−1 (17) The landing reward function is described in Eq. (21), with 𝐷𝑟𝑖 being
the distance reward (Eq. (19)), calculated exactly as the last task, 𝐷𝑧′
The reward shaping is 𝑅𝑠,𝑖 at the timestep 𝑖 with the counter 𝑆𝑖𝑛 𝑖
the 𝑧-axis distance to the goal, 𝐷𝑥,𝑦 being the modulus of the XY-axis
being the number of consecutive timesteps that the UAV is inside the
target sphere, and 𝑇𝑎𝑐ℎ a trigger that is set to 1 when the UAV reaches distance to the goal. 𝑅𝑠,𝑖 is the reward shaping at the time step 𝑖 and
the sphere for the first time. After that, the counter is restarted if the 𝑅 is the time step reward.
UAV flies outside the target sphere.
𝐷𝑟𝑖 = 0.6𝐷𝑧𝑖 + 0.4𝐷𝑥,𝑦
𝑖
(19)
Eq. (14) is designed to ensure more rewards for the vertical axis ( 𝑖 )
of movement than the horizontal axis. This is done as the UAV must 𝑅𝑠,𝑖 = −10 2𝐷𝑟 + 𝑉𝑟𝑖 + 4𝐴𝑖𝑟 (20)
travel a more significant distance on the 𝑍-axis than just little X-Y-
𝑅 = 𝑅𝑠,𝑖 − 𝑅𝑠,𝑖−1 (21)
axis corrections. Eq. (15) shapes the target velocity that the UAV is
intended to be, based on its position and the direction of the target. Eq. (20) encapsulates the position reward 𝐷𝑟𝑖 similar to the take-
[ ]
For example, if the current position of the UAV is 0 0 0 and the off task, the velocity reward 𝑉𝑟′ that is identical to the take-off task,
[ ]
target position is 0 0 5 the target velocity 𝑉𝑟 would be 2.5 m∕s.
and the attitude reward 𝐴𝑖𝑟 , which is also identical to the take-off task.
Eq. (16) adds all the different reward functions in one variable, while
In addition, Eq. (21) enables a reward-shaping function similar to the
Eq. (17) is the reward shaping function. The reward shaping function
take-off task.
aims to reward positive values if the quadcopter constantly improves
its current rewards. In contrast, if the current reward is lower than the As in the take-off maneuver, a perfect landing is rewarded with 100
previous reward, the reward shaping function returns a negative value. (established as a maximum reward in our system) points, and for an
In addition to the reward given in each time step, there is also a utterly erroneous state, the astray terminal state is rewarded with −100
reward at the end of each episode, known as the terminal reward. This points. There are other terminal states between them to continuously
reward ensures that the agent knows whether that terminal state is guide the UAV to improve its terminal state.

8
X. Olaz et al. Expert Systems With Applications 225 (2023) 120146

framework (Ketkar & Moolayil, 2021). Gazebo simulates the environ-


ment, and algorithms run on the agent. The communication between
the agent and the environment is based on Robot Operating System
(ROS) (Quigley et al., 2009), using the Rospy Python package (Sarkar,
Patel, Ram, & Capoor, 2016). Fig. 8 shows how the algorithm flow
works between ROS, the environment within Gazebo, and the learning
strategies.

8. Results and discussion

For better understanding, time-based heatmaps for both landing and


take-off maneuvers with the two algorithms have been generated so
the performance in the early and last episodes of each training can be
easily understood and compared. For this, two viewports are shown,
Fig. 7. The Kwad quadcopter model. a top view and a lateral view with normalization of both XY axes so
the training can be transferred from a 3D to a 2D representation. This
section will show and discuss the results, comparing both DDPG and
7.3. Training methodology PPO algorithms on the landing and take-off maneuvers.

8.1. Landing
The aircraft used in this work is a quadcopter, with the environment
being a ROS-Gazebo simulation of a quadcopter and its surroundings,
Fig. 9(left) shows the cumulative episode reward throughout the
modeling as closely as possible the physical behavior of the aircraft and last training for both DDPG and PPO algorithms. The PPO algorithm
its interaction with the surrounding wind. The Actor network receives learns very quickly, reaching top rewards within 200–300 episodes of
the state of the aircraft from the environment at the time step 𝑖, and training, while the DDPG algorithm cannot reach the same reward level
produces control actions mapped to each of the four motors at the next within 7500 episodes. It is important to note that the DDPG could reach
timestep 𝑖 + 1. the same reward level, as the reward curve shows steady growth, but
The agent used in the simulator is a Quadcopter (For further ref- it requires more time to converge to such a state. Fig. 9(right) shows
erences, see KwadonGithub, depicted in Fig. 7). The states of the the mean final velocity of the quadcopter in the evaluation steps. Note
quadcopter are available and include position, velocity, angular posi- that with the PPO reinforcement learning algorithm, the model quickly
tion, and angular velocity information being provided at a 100 Hz rate. notices that lower final velocities yield better rewards. In contrast, on
In addition, the control actions are mapped to the four aircraft motors’ the DDPG algorithm, the final velocity has plateaued at 0.5 meters per
target velocity (rpm) at every sample time (100 Hz). second.
The training was performed with a single drone, gathering samples Figs. 10 and 11 show the time-based heatmaps landing maneuvers
ranging from 2 to 10 episodes per training batch. Hyperparameters with the two algorithms. They allow understanding and comparing the
based on similar problems were initially applied for the training. From performance in the early and final episodes of each training.
there, based on empiric trial and error, parameters were fine-tuned Fig. 12(left) shows the proportion of solved episodes on the evalu-
until the training converged to a good solution. ation pass of the algorithm. Note that with PPO, within 200 episodes,
In the PPO training, a large number of samples ensures that the most of the samples are of solved episodes, showing around 99% of suc-
training is stable and fast, while for the DDPG training, a large number cessful landings, compared to the DDPG algorithm, which has around
of samples ensures that the training converges faster to a good solution. 92% of them. Regarding the landing maneuver, the PPO algorithm
surpasses DDPG, having better performance in the environment and
In addition, the training was conducted with the quadcopter starting
much faster training, showing promising results with only 7% of the
at a random position, with the addition of a random wind algorithm,
training episodes compared to the DDPG algorithm, also showing more
ensuring a large diversity of environmental uncertainties and a faster
success on the landing characteristics, with lower-end velocity and
convergence of the policy. Finally, neural controllers evaluation was
target accuracy. Fig. 12(right) shows the MSE for the landing task. PPO
also done to measure how the training was progressing. An evaluation
reduces the MSE faster than DDPG does. DDPG requires a notably high
step is done with the policy noise reduced to zero, enabling the visual- number of episodes to reach a similar error.
ization of the policy actions in their totality and reducing the variance We compared our proposal to a published paper by Xie et al. (2020)
of the episode. that focuses on the landing maneuver using the DDPG approach. Their
success rate was 93% compared to our 98,7%, even when we took into
7.4. Learning environment account windy conditions. We attribute this difference to the adaptation
layer we developed alongside the neural controller. With the adaptation
The physics of the quadcopter and the wind behavior are integrated layer, PPO training in the worst-case scenario, a solved proportion of
into the Gazebo 3D physics simulator environment. Reinforcement 60% is achieved in 500 episodes. In contrast, without the proposed
learning algorithms are programmed with Python alongside the Pytorch layer, it takes up to 12,000 episodes to reach the same success.

Fig. 8. Diagram showing the integration between agent and environment.

9
X. Olaz et al. Expert Systems With Applications 225 (2023) 120146

Fig. 9. Accumulated reward of the final training on the landing task (left). Episodes of the final training regarding velocity on the landing task (right).

Fig. 10. PPO for landing maneuver in the last training. The upper images show the zenithal heatmap, while the bottom images show the lateral one.

8.2. Take-off a reasonable solution within 850 steps, while the DDPG took 4000
episodes to reach a good solution, as noted in Fig. 13.
Regarding the take-off maneuver, both algorithms succeeded sim- The PPO training was unstable, as noted from episode 2000 on-
ilarly. The PPO algorithm showed better performance, converging to wards. PPO does not achieve steady growth, showing an overtraining

10
X. Olaz et al. Expert Systems With Applications 225 (2023) 120146

Fig. 11. DDPG for landing maneuver in the last training. The upper images show the zenithal heatmap, while the bottom images show the lateral one.

Fig. 12. Proportion of solved evaluation episodes of the landing task (left). MSE of the landing training (right).

11
X. Olaz et al. Expert Systems With Applications 225 (2023) 120146

Table 3
Metrics of the last training.
Episodes
0–1999 2000–3999 4000–5999 6000–7999
Take-off
PPO Mean reward value 23,810 18,340 31,560 70,580
Mean deviation (xy) 0,857 2,432 3,533 0,243
Mean height (z) 0,674 1,434 2,346 4,564
Mean distance(goal) 3,577 3,446 1,643 0,531
Mean attitude error 0,368 1,365 1,144 0,061
DDPG Mean reward value 28,540 26,210 40,650 90,150
Mean deviation (xy) 1,245 0,644 0,426 0,252
Mean height (z) 1,154 2,345 3,565 4,235
Mean distance(goal) 2,425 2,515 0,525 0,226
Mean attitude error 0,532 0,752 0,661 0,126
Landing
PPO Mean reward value 32,520 125,430
Mean deviation (xy) 0,563 0,145
Mean landing velocity (m/s) 0,705 0,235
Mean distance(goal) 0,536 0,242
Mean attitude error 1,175 0,156
DDPG Mean reward value 15,250 60,540 55,680 108,650
Mean deviation (xy) 1,664 1,246 0,643 0,135
Mean landing velocity (m/s) 3,531 1,454 0,652 0,662
Mean distance(goal) 4,253 1,255 0,538 0,243
Mean attitude error 0,422 0,374 0,785 0,251

For the takeoff, DDPG shows a slower learning pace, but results
end up being a little more consistent than with PPO. For the landing
maneuver, PPO training does not need to go up to 7000 episodes while
DDPG does, as it converges around the 2000 episode mark, even with
lower performance errors.
This performance might be due to a non-optimal reward shaping
function, where the reinforcement learning algorithm chooses to in-
stead exploit the function in a way that does not solve the environment
but overfits that policy.
As in the landing maneuver and the adaptation layer, in PPO
training worst case scenario, a solved proportion of 60% is achieved
in 100 episodes, whereas without the proposed layer, it takes up to
12,000 episodes to reach the same success.

8.3. Computational cost

Each training epoch consists of episodes with a fixed number of


timesteps, usually 2400. An episode can end in two ways: when the
Fig. 13. Accumulated reward on the take-off task. agent has completed 2400 timesteps (performs 2400 actions), or when
the UAV reaches a terminal state. The agent can perform actions at
from episode 4000 onwards. Similarly, Fig. 14 illustrates that the a rate of 100 Hz (100 actions per second), which means that each
trajectory followed by the UAV during the training was very unstable, episode lasts a maximum of 24 s. The final training lasts for about
even during the last episodes of the final training epoch. Fig. 15 depicts 7000 episodes, which means around 168,000 s (46,6 h), the longest
the trajectory of the UAV during the training, showing a clear steady training. The training was run for 832 h using an Nvidia 3080 GPU
growth as episodes increase. Note the orange circle on the final episodes and Tensorflow.
representing the time the UAV remains close to the goal point.
Fig. 16(left) shows that the DDPG did exceed the PPO algorithm in 9. Conclusions
solved proportion, reaching 98% of solved evaluation episodes at 4200
training episodes. In comparison, the maximum of solved episodes by Both neural network controllers developed using different rein-
the PPO algorithm was around 85% at 1000 episodes. Regarding the forcement learning approaches have shown remarkable results in con-
MSE, it is shown in Fig. 16(right) that PPO achieves a low error but is trol landing and take-off maneuvers by successfully maneuvering in
bigger than DDPG. And, as episodes increase PPO remains around the unrecommended wind situations based on the Beaufort scale. This
same error while DDPG reduces it. performance is due mainly to using reinforcement algorithms dealing
Table 3 summarizes the metrics of the different trainings performed. with continuous domains.
It describes four different stages of the training (intervals of 2000 On the other hand, the necessary domain knowledge was much
episodes) for the take-off and landing maneuvers, and both PPO and simpler than the necessary domain study to develop a fully functional
DDPG algorithms. One can note the absence of data when considering classical controller, as PID or non-linear approaches, showing that
the PPO algorithm for landing for the interval [4000, 7999] since it developing similar neural controllers is feasible for complex quadcopter
reaches the best reward and avoids overtraining. maneuvers.

12
X. Olaz et al. Expert Systems With Applications 225 (2023) 120146

Fig. 14. PPO for landing maneuver in the last training. The upper images show the zenithal heatmap, while the bottom images show the lateral one.

The proposed controller adds an adaptation layer which shows Declaration of competing interest
an important improvement by reducing training times for both algo-
rithms in the maneuvers. Training in the same conditions, including The authors declare that they have no known competing finan-
the adaptation layer reduces more than 3 times the training time. cial interests or personal relationships that could have appeared to
The PPO algorithm has shown better performance on control and influence the work reported in this paper.
navigation in the landing maneuver, with 98,7% of the episodes achiev-
ing the goal. In addition, the training time for maximum reward was
Data availability
significantly less for the PPO algorithm, showing that the algorithm is
a better choice for balancing performance and training time.
On the other hand, the DDPG algorithm outperforms PPO in control Data will be made available on request.
and navigation in the take-off maneuver, showing 98,3% of successful
maneuvers. Acknowledgments
Now that the help of a neural network in critical tasks such as
a landing or take-off has been proven successful in the simulations, This work has been supported in part by the Ministerio de Ciencia e
ongoing research should be focused on improving the neural controller
Innovación (Spain) and European Union NextGenerationEU, Spain un-
by trying newer learning methods.
der the research grant TED2021-131716B-C21 SARA (Data processing
by superresolution algorithms); in part by Agencia Estatal de Inves-
CRediT authorship contribution statement
tigación (AEI), Spain and European Union NextGenerationEU/PRTR,
Xabier Olaz: Investigation, Formal analysis, Experimentation. Spain PLEC2021-007997: Holistic power lines predictive maintenance
Daniel Alaez: Data curation, Software development, Writing – system; and in part by the Government of Navarre (Departamento de
original draft, Writing – review & editing. Manuel Prieto: Investiga- Desarrollo Económico), Spain under the research grants 0011-1411-
tion, Formal analysis, Methodology, Verification, Writing – review 2021-000021 EMERAL: Emergency UAVs for long range operations,
& editing. Jesús Villadangos: Conceptualization, Investigation, 0011-1365-2020-000078 DIVA, and 0011-1411-2021-000025 MOSIC:
Writing – review & editing, Resources, Project management, Funding Plataforma logística de largo alcance, eléctrica 𝑦 conectada. Xabier Olaz
acquisition. José Javier Astrain: Conceptualization, Investigation, wants to thank PhD José Ramón González de Mendívil Moreno for his
Formal analysis, Writing – review & editing. valuables comments and suggestions.

13
X. Olaz et al. Expert Systems With Applications 225 (2023) 120146

Fig. 15. DDPG for landing maneuver in the last training. The upper images show the zenithal heatmap, while the bottom images show the lateral one.

Fig. 16. Proportion of solved evaluation episodes of the take-off task (left). MSE of the Take-off training (right).

References Alexis, K., Nikolakopoulos, G., & Tzes, A. (2012). Model predictive quadrotor con-
trol: attitude, altitude and position experimental studies. IET Control Theory &
Applications, 6(12), 1812–1827.
Abo Mosali, N., Shamsudin, S. S., Mostafa, S. A., Alfandi, O., Omar, R., Al-Fadhali, N., Alexis, K., Papachristos, C., Nikolakopoulos, G., & Tzes, A. (2011). Model predictive
et al. (2022). An adaptive multi-level quantization-based reinforcement learning quadrotor indoor position control. In 19th mediterranean conference on control &
model for enhancing UAV landing on moving targets. Sustainability, 14(14), 8825. automation (MED) (pp. 1247–1252). IEEE.
Aláez, D., Olaz, X., Prieto, M., Villadangos, J., & Astrain, J. (2023). VTOL UAV digital Beck, H., Lesueur, J., Charland-Arcand, G., Akhrif, O., Gagné, S., Gagnon, F., et al.
twin for take-off, hovering and landing in different wind conditions. Simulation (2016). Autonomous takeoff and landing of a quadcopter. In 2016 international
Modelling Practice and Theory, 123, Article 102703. conference on unmanned aircraft systems (ICUAS) (pp. 475–484). IEEE.

14
X. Olaz et al. Expert Systems With Applications 225 (2023) 120146

Beikzadeh, H., & Liu, G. (2018). Trajectory tracking of quadrotor flying manipulators Navabi, M., & Mirzaei, H. (2016). 𝜃-D based nonlinear tracking control of quadcopter.
using L1 adaptive control. Journal of the Franklin Institute, 355(14), 6239–6261. In 2016 4th international conference on robotics and mechatronics (ICROM) (pp.
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., et al. 331–336). IEEE.
(2016). Openai gym. arXiv preprint arXiv:1606.01540. Nguyen, N. P., Mung, N. X., Thanh, H. L. N. N., Huynh, T. T., Lam, N. T., & Hong, S.
Cohen, M. R., Abdulrahim, K., & Forbes, J. R. (2020). Finite-horizon LQR control of K. (2021). Adaptive sliding mode control for attitude and altitude system of a
quadrotors on SE2 (3). IEEE Robotics and Automation Letters, 5(4), 5748–5755. quadcopter UAV via neural network. IEEE Access, 9, 40076–40085.
Emami, S. A., Castaldi, P., & Banazadeh, A. (2022). Neural network-based flight control Norgaard, M., Ravn, O., Poulsen, N. K., & Hansen, L. K. (2000). Neural networks for
systems: Present and future. Annual Reviews in Control. modelling and control of dynamic systems: a practitioner’s handbook. Springer.
Gautam, D., & Ha, C. (2013). Control of a quadrotor using a smart self-tuning fuzzy Polvara, R., Patacchiola, M., Sharma, S., Wan, J., Manning, A., Sutton, R., et al. (2018).
PID controller. International Journal of Advanced Robotic Systems, 10(11), 380. Toward end-to-end control for UAV autonomous landing via deep reinforcement
Gautam, A., Singh, M., Sujit, P. B., & Saripalli, S. (2022). Autonomous quadcopter learning. In 2018 international conference on unmanned aircraft systems (ICUAS) (pp.
landing on a moving target. Sensors, 22(3), 1116. 115–123). IEEE.
Gheorghiţă, D., Vîntu, I., Mirea, L., & Brăescu, C. (2015). Quadcopter control system. In Pounds, P. E., Bersak, D. R., & Dollar, A. M. (2012). Stability of small-scale UAV
2015 19th international conference on system theory, control and computing (ICSTCC) helicopters and quadrotors with added payload mass under PID control. Autonomous
(pp. 421–426). IEEE. Robots, 33(1), 129–142.
Gu, W., Valavanis, K. P., Rutherford, M. J., & Rizzo, A. (2020). UAV model-based flight Puterman, M. L. (1990). Markov decision processes. Handbooks in Operations Research
control with artificial neural networks: a survey. Journal of Intelligent and Robotic and Management Science, 2, 331–434.
Systems, 100, 1469–1491. Quigley, M., Conley, K., Gerkey, B., Faust, J., Foote, T., Leibs, J., et al. (2009). ROS:
Guardeño, R., López, M. J., & Sánchez, V. M. (2019). MIMO PID controller tuning an open-source robot operating system. In ICRA workshop on open source software,
method for quadrotor based on LQR/LQG theory. Robotics, 8(2), 36. Vol. 3 (p. 5). Kobe, Japan.
Gupta, S. G., Ghonge, M. M., Jawandhiya, P. M., et al. (2013). Review of unmanned Ridlwan, H. M., Nugraha, Y. P., Riansyah, M. I., & Trilaksono, B. R. (2017). Simulation
aircraft system (UAS). International Journal of Advanced Research in Computer of vision-based for automatic takeoff and landing hexacopter on a moving ground
Engineering & Technology (IJARCET), 2(4), 1646–1658. vehicle. In Proceedings of the 9th international conference on machine learning and
Gupta, P., Loparo, K., Mackall, D., Schumann, J., & Soares, F. (2004). Verification computing (pp. 502–506).
and validation methodology of real-time adaptive neural networks for aerospace Rodriguez-Ramos, A., Sampedro, C., Bavle, H., De La Puente, P., & Campoy, P. (2019).
applications. In International conference on computational intelligence on modeling, A deep reinforcement learning strategy for UAV autonomous landing on a moving
control and automation. Citeseer. platform. Journal of Intelligent and Robotic Systems, 93, 351–366.
Hou, Y., Liu, L., Wei, Q., Xu, X., & Chen, C. (2017). A novel DDPG method with Sarkar, A., Patel, K. A., Ram, R. G., & Capoor, G. K. (2016). Gesture control of drone
prioritized experience replay. In 2017 IEEE international conference on systems, using a motion controller. In 2016 international conference on industrial informatics
man, and cybernetics (SMC) (pp. 316–321). http://dx.doi.org/10.1109/SMC.2017. and computer systems (Ciics) (pp. 1–5). IEEE.
8122622. Sastry, S., Bodson, M., & Bartram, J. F. (1990). Adaptive control: stability, convergence,
Huler, S. (2007). Defining the wind: The beaufort scale and how a 19th-century admiral and robustness.
turned science into poetry. Crown. Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust region policy
Jiang, Z., & Song, G. (2022). A deep reinforcement learning strategy for UAV optimization. In International conference on machine learning (pp. 1889–1897). PMLR.
autonomous landing on a platform. arXiv preprint arXiv:2209.02954. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal
Jorgensen, C. C. (1997). Direct adaptive aircraft control using dynamic cell structure neural policy optimization algorithms. arXiv preprint arXiv:1707.06347.
networks: Technical report, MoffettField, CA, USA: Ames Research Center. Schumann, J. M. P., Gupta, P., & Nelson, S. D. (2003). On verification & validation
Kaneshige, J., & Gundy-Burlet, K. (2001). Integrated neural flight and propulsion control of neural network based controllers. In Proceedings of the international conf on
system: Technical report, American Institution of Aeronautics and Astronautics. engineering international conference on engineering applications of neural networks (pp.
Kendoul, F. (2007). Modelling and control of unmanned aerial vehicles, and development 1–8).
of a vision-based autopilot for small rotorcraft navigation (Ph.D. thesis), Doctoral Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: an introduction. MIT Press.
Dissertation, 26 September. Theile, M., Bayerlein, H., Nai, R., Gesbert, D., & Caccamo, M. (2020). UAV coverage
Ketkar, N., & Moolayil, J. (2021). Introduction to pytorch. In Deep learning with python path planning under varying power constraints using deep reinforcement learning.
(pp. 27–91). Springer. In 2020 IEEE/RSJ international conference on intelligent robots and systems (IROS)
Koch, W., Mancuso, R., West, R., & Bestavros, A. (2019). Reinforcement learning for (pp. 1444–1449). IEEE.
UAV attitude control. ACM Transactions on Cyber-Physical Systems, 3(2), 1–21. Todeschini, D., Fagiano, L., Micheli, C., & Cattano, A. (2019). Control of vertical take
Kownacki, C., Ambroziak, L., Ciezkowski, M., Wolniakowski, A., Romaniuk, S., off, dynamic flight and landing of hybrid drones for airborne wind energy systems.
Bozko, A., et al. (2023). Precision landing tests of tethered multicopter and VTOL In 2019 American control conference (ACC) (pp. 2177–2182). http://dx.doi.org/10.
UAV on moving landing pad on a lake. Sensors, 23(4), 2016. 23919/ACC.2019.8815202.
Legovich, Y. S., Rusakov, K. D., & Diane, S. A. (2019). Control of takeoff and landing Totah, J., & Totah, J. (1997). Adaptive flight control and on-line learning. In Guidance,
of the multicopter in a moving robotic container without external navigation Navigation, and Control Conference (p. 3537).
systems. In 2019 twelfth international conference‘‘ management of large-scale system Ulus, Ş., & Eski, İ. (2021). Neural network and fuzzy logic-based hybrid attitude
development’’(MLSD) (pp. 1–5). IEEE. controller designs of a fixed-wing UAV. Neural Computing and Applications, 33(14),
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., et al. (2015). 8821–8843.
Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509. Uswarman, R., Istiqphara, S., Yunmar, R. A., & Rakhman, A. Z. (2019). Robust
02971. control of a quadcopter flying via sliding mode. Journal of Science and Applicative
Luukkonen, T. (2011). Modelling and control of quadcopter. In Independent Research Technology, 2(1), 135–143.
Project in Applied Mathematics, Vol. 22 (pp. 1–26). Espoo. Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3), 279–292.
Massé, C., Gougeon, O., Nguyen, D.-T., & Saussié, D. (2018). Modeling and control Xiao-Zheng, J., Guang-Hong, Y., CHANG, X.-H., & Wei-Wei, C. (2013). Robust fault-
of a quadcopter flying in a wind field: A comparison between LQR and structured tolerant H∞ control with adaptive compensation. Acta Automatica Sinica, 39(1),
h∞ control techniques. In 2018 international conference on unmanned aircraft systems 31–42.
(ICUAS) (pp. 1408–1417). IEEE. Xie, J., Peng, X., Wang, H., Niu, W., & Zheng, X. (2020). UAV autonomous tracking
Mei, K., & Ding, S. (2018). Second-order sliding mode controller design subject to and landing based on deep reinforcement learning strategy. Sensors, 20(19), 5630.
an upper-triangular structure. IEEE Transactions on Systems, Man, and Cybernetics: Xuan-Mung, N., Hong, S. K., Nguyen, N. P., Le, T.-L., et al. (2020). Autonomous quad-
Systems, 51(1), 497–507. copter precision landing onto a heaving platform: New method and experiment.
Nadda, S., & Swarup, A. (2018). On adaptive sliding mode control for improved IEEE Access, 8, 167192–167202.
quadrotor tracking. Journal of Vibration and Control, 24(14), 3219–3230. Yuan, J., Ding, S., & Mei, K. (2020). Fixed-time SOSM controller design with output
constraint. Nonlinear Dynamics, 102(3), 1567–1583.

15

You might also like