Vertical Take-Off and Landing System Control Using Deep Reinforcement Learning

Vertical Take-Off and Landing System Control
Using Deep Reinforcement Learning

Haitham M. Al-radhi Khaled A. El-Metwally
Electrical Power Engineering Dept. Electrical Power Engineering Dept.
Cairo University Cairo University
Giza, Egypt Giza, Egypt
alradhi.haitham@gmail.com kmetwally@eng.cu.edu.eg
Abstract— This paper introduces the utilization of a The model used in this study is a laboratory prototype
2023 3rd International Conference on Emerging Smart Technologies and Applications (eSmarTA) | 979-8-3503-0533-3/23/$31.00 ©2023 IEEE | DOI: 10.1109/ESMARTA59349.2023.10293679
controller based on the Twin Delayed Deep Deterministic Policy introduced by QUANSER Company [10]. It is a VTOL
Gradient (TD3) algorithm for the Vertical Take-Off and system with one degree of freedom (1-DOF), the pitch angle.
Landing (VTOL) system control. The TD3 algorithm combines Designing a controller based on a reduced-order model of the
deep neural networks and actor-critic reinforcement learning VTOL system simplifies the implementation using the real
techniques, enabling it to learn optimal policies directly from system [11]. The proposed DRL-based controller will be
interactions with the environment. MATLAB and Simulink are applied to this system to examine its response compared to the
used to provide an efficient platform for training and analyzing
conventional PID controller.
the performance of the TD3-based controller. This study
compares the performance of the TD3-based controller to that In [11], a Sliding Mode Control (SMC) with linear
of the PID controller subject to step changes and external quadratic integral control was implemented to control VTOL
disturbances. Results obtained by applying the TD3-based System. The controller performance was measured using the
controller to one degree of freedom VTOL system demonstrate integral of the square error index. In [9], two control methods:
the effectiveness of the proposed control. The results, also, Proportional Integral Velocity (PIV) and SMC, were used to
indicate that the proposed controller performance is better than find the response of the VTOL system. In [12], the authors
the PID controller, with no overshoot and approximately four applied a DRL algorithm called Deep Deterministic Policy
times faster settling time. Also, the proposed controller shows
Gradient (DDPG) to the VTOL system model to control the
robustness against disturbances under various conditions.
pitch angle. Despite the good results, the performance still
Keywords— VTOL system control, TD3-based controller, AI, suffers from overshoot, relatively high settling time, or both.
deep reinforcement learning, PID controller, MATLAB. The overall performance, especially settling time and
overshoot, can be improved using the proposed controller
I. INTRODUCTION introduced in this paper.
Recently, interest in applying DRL techniques to set-point This study presents the design of DRL parts to train the
tracking problems in control systems has grown. DRL is a TD3 agent to learn VTOL system control. This control
form of machine learning that employs a trial-and-error method will be compared to the PID controller to investigate
approach to learn the optimal policy. DRL involves an agent the performance. Designing an effective reward is a critical
interacting with the environment, receiving feedback through issue because the reward directly affects the learning process.
the reward function, and using it to learn and improve its This study presents a distinct choice for reward function to
knowledge. The DRL-based controller learns from the make the agent effectively achieve the desired performance.
system's states and feedback signals to make control actions While most DRL algorithms used for control systems, like
instead of relying on a predefined model or set of parameters. [12], designed the reward function based on only the absolute
This approach significantly reduces the probability of system- or square value of the error signal, the proposed reward here
model mismatch, yielding performance improvement [1]. is more comprehensive, as explained extensively in
DRL has demonstrated potential in various domains, such as subsection IV.D.
robotics, and might present advantages over traditional
control methods. The remainder of the study is organized as follows: section
II involves an in-details summary of the TD3 algorithm. The
DRL has been used widely in the control field for speed VTOL model and the derivative of its transfer functions are
[2], temperature [3], and process control [4]. Also, it has been presented in section III. In section IV, the Simulink
used to compute the gains of the PID controller used to control environment that is used to train the TD3 agent is explained
different systems [5], [6]. Twin Delayed Deep Deterministic and defined. The training hyperparameters selection, training
Policy Gradient (TD3) algorithm is a modern DRL algorithm results, and the performance of the TD3-based controller
[7] that can exhibit better performance than other DRL compared with the PID controller performance are discussed
algorithms. In this study, the TD3 algorithm is used to learn in section V. In section VI, a conclusion and suggested future
the optimal control of the VTOL system. works are provided.
The VTOL aircraft is a nonlinear system that exhibits high II. THEORITICAL BACKGROUND
complexity. During the flight, its aerodynamic parameters
undergo significant variations. Aircraft modeling is A. Reinforcement Learning
challenging because of the variable aerodynamic parameters Reinforcement learning (RL) is a machine learning
and environmental factors that affect them during flight. branch. Unlike supervised learning, where the algorithm
Aerospace vehicles such as helicopters, rockets, and balloons learns from labeled data, and unsupervised learning, where the
are real-world examples of the VTOL system [8]. Three algorithm learns from unlabeled data, reinforcement learning
motions have an impact on determining stability and control involves learning from interactions with an environment. Fig.
during flight: yaw, roll, and pitch [9]. 1 shows an agent (controller, using the language of control
systems) interacts with its environment (plant) to learn a
979-8-3503-0533-3/23/$31.00 ©2023 IEEE
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SANTA CATARINA. Downloaded on January 12,2024 at 19:46:00 UTC from IEEE Xplore. Restrictions apply.
policy that maximizes the expected return, which is the sum The idea behind using target networks for the actor and the
of discounted rewards it expects to receive over time. At each critic, with the same architecture and initialization but a
time step, the agent observes the current state (𝑠 ∈ 𝑆) of the different update frequency than the original networks, is to
environment, selects an action (𝑎 ∈ 𝐴) based on its policy, and achieve stability in deep reinforcement learning and make the
then receives a reward and a new state 𝑠′. The reward and the learning less prone to overfitting. The idea is accomplished by
new state are used to update the agent's knowledge of the updating the target network more slowly than the main
environment and its policy [7], [13]. network. Thus, the target network has more stable and
accurate estimates of the Q-value function, which can lead to
ction at states st a more stable and effective policy [15].
gent n ironment measurements
controller control action plant, sensors Deep neural networks can take multiple gradient updates
to converge, and training them on high-error states too
frequently can lead to divergent behavior. To prevent this,
Re ard rt
error signal
TD3 updates the policy network less repeatedly than the Q-
value network to minimize the error in the Q-value function
Fig. 1. RL blocks and corresponding control system elements. before introducing a policy update. This technique can reduce
the variance in the Q-value estimates and improve the quality
B. TD3 Algorithm of the learned policy. Also, the delayed updates reduce the
The TD3 algorithm is an improvement of the Deep correlation between successive updates of the same network,
Deterministic Policy Gradient (DDPG) algorithm. It is a resulting in a more stable and efficient learning process [15].
model-free, off-policy reinforcement learning approach. The For this, the algorithm is called “delayed DDPG.”
TD3 agent is an actor-critic RL agent that seeks an optimal
policy 𝜋𝜙∗ by tuning its parameters (𝜙) [14]. The tuning is The target networks update their parameters periodically
to exactly match the current network or by some soft factor 𝝉
typically accomplished by updating these parameters using
at each time step: 𝜙𝑡 = 𝜏 𝜙 + (1 − 𝜏)𝜙𝑡 [7].
the gradient of an objective function 𝐽(𝜙) which is the
expected return 𝑉𝜋𝜙 (𝑠) . The policy 𝜋(𝑠; 𝜙) (the actor) is The network parameters are updated using random
updated via the deterministic policy gradient algorithm samples from a replay buffer 𝐵 . The replay buffer is a
described by equation [7]: memory (data structure) used to store and manage the past
experiences (transitions) collected by the agent. By sampling
𝛻𝜙 𝐽(𝜙) = 𝛻𝑎 𝑚𝑖𝑛 (𝑄(𝑠, 𝑎; 𝜃𝑘 )) |𝑎=𝜋(𝑠;𝜙) 𝛻𝜙 𝜋(𝑠; 𝜙) (1) from a replay buffer, the agent can avoid biasing its updates
𝑘 towards recent experiences and instead learns from a diverse
set of past experiences. Also, it minimizes correlations
The TD3 algorithm includes creating two Q-value between samples, which leads to more stable and robust
networks (two critics) 𝑄(𝑠, 𝑎; 𝜃𝑘 )|𝑘=1,2 , i.e., it uses the idea learning [15]. Fig. 3 shows an illustrative representation of the
of double Q-learning to compute the Q-value [15]. For this, algorithm.
the algorithm is called “twin.”
III. 1 DOF VTOL MODEL
The critics are updated using temporal difference error
with secondary frozen target networks 𝑄𝑡 (𝑠′, 𝑎′; 𝜃𝑡𝑘 ) to The QUANSER QNET VTOL system includes a high-
maintain a stable objective across multiple updates [5]. The speed DC fan with a protective guard mounted on the arm.
temporal difference (TD) error is calculated as the difference The arm has an adjustable counterweight at its opposite end,
(𝛥𝑄) between the Q-value target (𝑦) and the current Q-value, which can be altered to impact the system's dynamics. The
i.e., 𝑦 − 𝑄(𝑠, 𝑎; 𝜃𝑘 ) [14]. arm assembly rotates around a rotary encoder shaft, from
which the VTOL pitch position can be determined. Usually,
The Q-value target (𝑦), described by equation (3), is the flight systems are divided into smaller subsystems due to their
immediate reward 𝑟 added to the discounted minimum Q- complexities. Each subsystem can be addressed separately,
value from the two target critics. With clipped (taking the then combined for the overall solution [10].
minimum) double Q-learning, the Q-value target cannot
introduce any additional overestimation over using the The VTOL system is divided into two components,
standard Q-learning target. To compute 𝑦: Firstly, the agent necessitating two controller loops to handle the system. The
picks the next action 𝑎′ using the target policy 𝜋𝑡 and the next first component represents the relationship between the
state 𝑠′. Then, the agent adds a clipped noise ε to the computed voltage and current of the actuator (the motor). The second
action, 𝜖 ~ 𝑐𝑙𝑖𝑝(𝑁 (0, 𝜎̃), −𝑐, 𝑐) : 𝜎̃ is the standard deviation one represents the relationship between the current and the
of Gaussian noise, and 𝑐 is the upper noise limit. Adding noise position of the VTOL body (the pitch angle '𝜃'). The cascade
to every action as a regularization smooths the Q-values control system used in the VTOL is shown in Fig. 2. The inner
estimation and thus prevents overfitting. The clipping aims to (or secondary) loop controls the motor's current based on a
keep the target in a small range [7], [14]: reference signal generated by the outer loop. The outer (or
primary) loop regulates the VTOL's pitch by adjusting the
reference point of the inner loop [9].
𝑎′ = 𝜋𝑡 (𝑠′; 𝜙𝑡 ) + 𝜖 ()
Inner loop
Finally, the agent finds 𝑦 by passing the next action to the 𝜃 Pitch u Current u Actuator VTOL 𝜃
+- +-
target critics [14]: Controller = Controlle = Model Pitch Model
r 𝑉
𝑦 = 𝑟 + 𝛾 ∗ 𝑚𝑖𝑛 (𝑸𝒕𝒌 (𝑠′, 𝑎′ ; 𝜃𝑡𝑘 )) ()

𝑘
Fig. 2. QNET VTOL cascade control system.
Fig. 3. TD3 algorithm workflow. motor's current, , whereas the output is the pitch angle, 𝜃.
The actuator dynamics are ignored by assuming that =
A. The Actuator Model . Matlab and Simulink are used for designing,
The following transfer function provides the voltage- implementing, and evaluating the controller.
current relationship of the system's actuator [10]:
The TD3 Agent is designed using the RL toolbox in
1 Matlab. Fig. 4 shows the TD3 agent block and its interactions
= (4)
𝑉 𝑅 +𝐿 𝑠 with its environment through the observations (states) and the
reward blocks. The agent controls the VTOL system using the
Where 𝑣 is the voltage induced to the actuator (the
action signal (the control signal 𝑢 ). The “is done” block
input), is the actuator's induced current (the output), 𝑅 is
provides a binary signal that ends the episode if some
the motor's resistance, and 𝐿 is the motor's inductance. The
conditions are exceeded.
parameters' values are shown in Table I.
A. The states selection
The inner loop can be controlled using a proportional-
integral (PI) controller, which computes the needed voltage The states should describe all the relevant information
for the actuator to produce the wanted current [11]. required for the agent to make a good decision while
excluding unnecessary observations to reduce computational
TABLE I. VTOL MODEL PARAMETERS complexity. Selecting appropriate states for the RL agent is
usually an iterative process that involves experimentation and
Description Symbol Value testing to refine the state representation based on the
3 performance of the RL agent. In this study, the selected states
Motor's resistance 𝑅
Ω are the error between the theta reference and the actual theta
0.0583 ( 𝑒 ), the integral, and the derivative of that error
Motor's inductance 𝐿 𝒅
𝐻 [ 𝒆, ∫ 𝒆 𝒅𝒕 , 𝒆].
𝒅𝒕
Moment of equivalent 0.00347
𝐽 B. The action setup
inertia 𝑘𝑔𝑚2
0.002 The action is the control signal that the agent generates to
Viscous friction constant 𝐵 control the environment and bring the outputted pitch equal to
𝑁𝑚𝑠/𝑟𝑎𝑑
0.0373 the reference. It is important to set the action limits when
Stiffness constant 𝐾 defining the action information according to the physically
𝑁𝑚/𝑟𝑎𝑑
allowed range. The control action in our example is the
0.0108
Torque constant 𝐾𝑡 actuator’s current ( ) . According to [11], the allowed
𝑁𝑚/𝐴
current limits for the QNET VTOL model are ± 3A.
B. The VTOL Pitch Model
The second-order system describes the relationship
between the pitch angle of the VTOL system (𝜃) and the
actuator current ( ) and can be derived as follows [11]:
𝐽𝜃̈ + 𝐵𝜃̇ + 𝐾𝜃 = 𝐾𝑡 (5)
Then, the transfer function can be written as:

(𝐾𝑡 /𝐽)
𝜃(𝑠) = ( ) (𝑠) (6)
𝑠2 + (𝐵/𝐽)𝑠 + (𝐾/𝐽)
IV. TD3-BASED CONTROLLER DESIGN
The outer loop described by the transfer function (6) is Fig. 4. The outer loop with the TD3 controller.
controlled using a TD3-based controller. The input is the
C. The environment setup
The model of the VTOL system represented by the
transfer function (6) is considered the environment. The
environment also includes the error calculation and the
Scaling
reference signal. For the generalization purpose, the agent
needs to deal with different setpoints (references) during
scale , ias
training. A random reference is applied at each episode to Output Layer a
nput Layer s
achieve that task. ith Tanh
acti ation fcn
D. The reward designing CL
ully Connected Layer

The reward design is a significant issue when training the ith ReLU
acti ation fcn Rectified Linear Unit
TD3 agent. The reward function should encourage the agent
to learn the desired behavior and avoid undesired behavior. Fig. 5. Actor (policy) neural network architecture.
The rewards could be positive or negative scalar, discrete,
continuous, or hybrid functions. The distinct reward used for 2) The critic network
this study is divided into five parts as follows: On the other hand, the critic-network determines the
The first part encourages the agent to minimize the error action-value (𝑄) for the action (𝑎) taken by the actor and the
(𝑒) while allowing for some tolerance. The agent gets rewards en ironment’s state ( 𝑠 ). While the actor explores the
based on the magnitude of the error: environment, the critic guides the actor to help it learn an
optimal policy. The proposed architecture for the critic neural
3, 𝑖𝑓 |𝑒| = 0 network is shown in Fig. 6. The critic network is composed of
2, 𝑖𝑓 |𝑒| < 0.01 two paths for the two sets of inputs then the concatenation
𝑟1 = (7)
1, 𝑖𝑓 |𝑒| < 0.1 layer concatenates the two inputs for more processing. TD3
{−1, 𝑖𝑓 |𝑒| > 0.1 algorithm uses two critics with the same architecture but
different weights initialization.
The second part encourages the agent to decrease the error
over time and penalizes it if the error increases and thus
reducing the response oscillations: e
−5, 𝑖𝑓 |𝑒| 𝑖𝑛𝑐𝑟𝑒𝑎𝑠𝑒𝑠 .
𝑟2 = { (8)
2, 𝑖𝑓 |𝑒| 𝑑𝑒𝑐𝑟𝑒𝑎𝑠𝑒𝑠 .
.
.
Input Layer (s) .
The third part encourages the agent to take actions that (3 neurons) . .
FCL 2 .
reduce the error by providing a penalty for higher errors. The (64 neurons)
. . Q
.
penalty becomes more intense for larger errors: with ReLU .
Output Layer
activation fcn.
(Q-value)
𝑟3 = −𝑒 2 (9) (1 neuron)
.
. FCL 3
The fourth part encourages the agent to take smoother .
.
.
(32 neurons)
with ReLU
actions instead of sudden ones by adding a penalty for large .
.
Concatenation activation fcn.
actions. This part may help the agent avoid aggressive actions Im .
Layer
(96 neurons)
that could cause instability or oscillations in the response: Input Layer (a)
.
(1 neuron)
𝑟4 = −0.1 ∗ 𝑢(𝑡 − 1)2 (10)
FCL 2 FCL: Fully Connected Layer
(32 neurons) ReLU: Rectified Linear Unit
The fifth part encourages the agent to take actions that with ReLU
keep the system within the allowed operation range by adding activation fcn.
a penalty if the environment exceeds limits: Fig. 6. Critic (Q-value) neural network architecture.
𝑟5 = −10 𝑖𝑓 𝜃𝑎𝑐𝑡𝑢𝑎𝑙 ≤ (−𝜋/6 ) , 𝑜𝑟 𝜃𝑎𝑐𝑡𝑢𝑎𝑙 ≥ (𝜋/6 ) (11) 3) Networks' weighs initialization

The network initialization is done using a Glorot initializer
The overall reward is the summation of the five parts. (Xavier initializer). The Glorot initializer samples the weights
E. The agent creation from a normal distribution with zero mean and variance of
2/(𝑛𝑖 + 𝑛𝑜 ): 𝑛𝑖 is the no. of the layer's input, and 𝑛𝑜 is the
As mentioned earlier, the TD3 agent is an actor-critic RL
no. of the layer's output [14]. The target actor and critics
agent. In this section, the actor-network, critic-network, and
networks have the same architecture and weights initialization
their target networks are constructed.
of their original networks but with different updating
1) The actor (policy) network mechanisms, as discussed in section II.B.
The actor network determines the appropriate action (𝒂) 4) Exploration noise
for the states (𝒔) the agent receives from the environment. The The TD3 algorithm adds exploration noise (𝜖 ~ 𝑁 (0, 𝜎))
actor orks as the controller; it outputs the control action ‘u ’ to the action chosen by the actor neural network to help the
The proposed architecture for the actor neural network is agent explore the state and action spaces. The noise used in
shown in Fig. 5. The rectified linear unit (ReLU) activation the TD3 algorithm is from a Gaussian distribution with zero
function is commonly used in DNNs. Using tanh and scaling mean and standard deviation determined by the designer. The
layers can help ensure the actions are within the valid range. designer should consider the tradeoff between exploration
and exploitation by choosing a standard deviation decay rate inconvenient behavior, but it goes stable before the PID
value. A guideline provided by [14] is to choose the standard control action.
deviation as:
𝜎 = 𝑟/√𝑇𝑠 (12)
Where 𝑟 is a value between 1% and 10% of the action

range, and 𝑇𝑠 is the sampling time.
V. TRAINING, RESULTS AND DISCUSSION
A. Training the TD3 agent
Training an RL agent is an iterative process. Results in
later steps can require to return to an earlier stage in the
learning workflow [14]. The training is an exhausted
operation based on trial and error. Tens of attempts were made
before getting satisfactory results. The final tuning of the TD3
hyperparameters values, which led to those results, is shown
in Table II. The TD3 agent was trained, as illustrated in Fig. Fig. 7. TD3-based controller vs PID controller (a constant reference).
3, through two stages. During the first stage, the agent was
trained for approximately 2700 episodes. The best agent from
the first stage, the 709th episode's agent, was chosen to
continue training on it through the second stage. For over
5000 episodes, the most suitable agent from the second stage,
which showed a proper response, was the 3736th episode's
agent.
Once the training is completed, the actor neural network

(the policy) becomes responsible for computing the optimal
action given any environment state, see Fig. 9.
TABLE II. FINAL SET OF THE TD3 AGENT PARAMETERS

Hyperparameter Value
Discount factor γ 0.99
Experience buffer length (B) 106
Mini batch size (M) 128 Fig. 8. TD3-based controller vs PID controller (a sinusoidal reference).
Actor learning rate 10−5
TABLE III. STEP RESPONSE CHARACTERSTICS AND SSE
Critic learning rate 10−4 PERFORMANCE METRIC
Sample time (𝑻𝒔 ) 0.01 (sec)
Exploration noise standard 0.2 for the 1st stage, SSE
Controller
deviation (𝝈) 0.05 for the 2nd stage Rise Settling Over- 𝜃 =
Soft update factor (𝝉) 0.005 Time Time shoot 𝜃
0.1 ∗
Max. no. of steps in an episode 1000 = 𝜋/6
sin (2𝜋)
B. Set-point tracking TD3-
0.3526 0.5941 0 33.6318 0.0610
For this section, experimental simulations will be made to based
assess the efficacy of the proposed TD3-based controller. PID 0.2588 2.1605 16.23 34.1598 0.7659
Using the policy of the trained TD3 agent, the system step
response is compared to the PID (tuned using Matlab auto- The overall system (the inner and outer loop) and the
tuning) controller response. The PID parameters were tuned proposed controller are shown in Fig. 9. The figure
using the Simulink tuner app. Fig. 7 shows the system demonstrates that the effective agent part after training is the
response (the pitch angle 𝜃 and the control action 𝑢) with PID policy, which acts as the controller.
and TD3-based controller for a constant reference signal
(𝜃 = 𝜋/6). Fig. 8 shows another test using a sinusoidal The TD3-based controller can still handle the system well
reference signal (𝜃 = 0.1 ∗ sin (2𝜋)). even after adding the actuator effect due to the generalization
Results comparison of the TD3-based controller and PID feature in the DRL. Fig. 10 shows the system response and the
controller based on the step response characteristics and the control action after adding the inner loop effect.
sum of the squared errors (SSE) is shown in Table III.
The resulting performance is better than the DDPG-based
It is clear that the TD3-based controller performs better controller proposed by [12]. The DDPG-based controller
than the PID controller in settling time, overshoot, and SSE. suffered from overshoot, although it was less than the PID
The TD3-based control action might initially exhibit an controller. The authors of [12] did not discuss the control
action of the DDPG-based controller or the whole system Through simulation, the TD3-based controller
performance (after adding the inner loop effect). demonstrated its capability to learn to control the VTOL
system effectively. The results are compared with a PID
controller and with a DDPG-based controller. The simulation
results showed that the TD3-based controller successfully
tracked different desired references with a faster settling time,
free overshoot, and low SSE performance index. Also, the
proposed controller exhibited effective disturbance rejection
Fig. 9. QNET VTOL cascade control system with the TD3-based controller. capabilities, indicating its robustness and ability to maintain
desired performance in the presence of external disturbances.
REFERENCES
[1] D Dutta and S R Upreti, “ sur ey and comparati e e aluation of
actor‐critic methods in process control,” Can J Chem ng, ol 100,
no. 9, pp. 2028–2056, 2022.
[2] P Chen, Z He, C Chen, and J Xu, “Control strategy of speed ser o
systems ased on deep reinforcement learning,” lgorithms, ol 11,
no. 5, p. 65, 2018.
[3] S Brandi, M S Piscitelli, M Martellacci, and Capozzoli, “Deep
reinforcement learning to optimise indoor temperature control and
heating energy consumption in uildings,” nergy Build, ol 224, p
110225, 2020.
[4] S Spiel erg and P Kumar, “Deep Reinforcement Learning
pproaches for Process Control,” 6th nternational Symposium
on Advanced Control of Industrial Processes (AdCONIP), 2017,
pp.201-206.
Fig. 10. System response and control action under the effect of the actuator. [5] S. Tufenkci, B. B. Alagoz, G. Kavuran, C. Yeroglu, N. Herencsar, and
S Mahata, “ theoretical demonstration for reinforcement learning of
C. Anti-disturbance performance PI control dynamics for optimal speed control of DC motors by using
T in Delay Deep Deterministic Policy Gradient lgorithm,” xpert
In actual situations, the system is prone to external Syst Appl, vol. 213, p. 119192, 2023.
disturbance from the outer environment, external interference,
[6] Mohanty and Schneider, “Tuning of an ircraft Pitch P D
or sensor noise. The system's performance under the Controller with Reinforcement Learning and Deep Neural Net”,
disturbance effect indicates the robustness of the system. An Accessed: Feb. 27, 2023. [Online]. Available:
external disturbance is imposed on the VTOL system - by http://cs229.stanford.edu/proj2019aut/data/assignment_308832_raw/2
adding a step signal with an amplitude equal to 1 rad to the 6643693.pdf
system's output at time t=3 for 0.5 sec - to test its performance. [7] S ujimoto, H Hoof, and D Meger, “ ddressing function
approximation error in actor-critic methods,” nternational conference
In Fig. 11, the TD3-based controller shows a faster settling on machine learning, PMLR, 2018, pp. 1587–1596.
time and a lower overshoot than the PID-based control. [8] S Mondal and C Mahanta, “O ser er ased sliding mode control
Generally, the proposed controller exhibits robustness in the strategy for vertical take-off and landing VTOL aircraft system,”
face of disturbances under various conditions. IEEE 8th Conference on Industrial Electronics and Applications
(ICIEA), IEEE, 2013, pp. 1–6.
[9] J Jaco and S S Kumar, “Sliding Mode Control of VTOL System,”
in IOP Conference Series: Materials Science and Engineering, IOP
Publishing, 2019, p. 012069.
[10] P Martin, M Krug, and J pkarian, “QN T VTOL Work ook
Student ” 2014 ccessed: Feb. 26, 2023. [Online]. Available:
https://www.quanser.com/products/qnet-2-0-vtol-board/
[11] M Herrera, P Leica, D Cha ez, and O Camacho, “ Blended Sliding
Mode Control with Linear Quadratic Integral Control based on
Reduced Order Model for a VTOL System ,” in C NCO 1 , 2017, pp
606–612.
[12] M. ğrali, M. U. Soydemir, A. Gökçen, and S. Sahin, “Deep
Reinforcement Learning Based Controller Design for Model of The
Vertical Take off and Landing System,” uropean Journal of Science
and Technology, no. 26, pp. 358–363, 2021.
[13] R Siraskar, “Reinforcement learning for control of al es,” Machine
Fig. 11. TD3-based vs. PID controllers under the output disturbance. Learning with Applications, vol. 4, p. 100030, 2021.
[14] nc MathWorks, “Reinforcement Learning Tool oxTM User’s Guide
VI. CONCLUSION R2022 ,” 2022 ccessed Jan 10, 2023 [Online] aila le
https://www.mathworks.com/help/pdf_doc/reinforcement-
This paper presented the training of a TD3 DRL agent for learning/rl_ug.pdf
controlling a 1-DOF VTOL system. An overview of the
[15] H Zhang, T Yu, and R Huang, “Com ine Deep Q-Networks with
training algorithm and the workflow are outlined. The Actor-Critic,” in Deep Reinforcement Learning, Singapore Springer
simulation environment is designed, using Simulink, to train Singapore, 2020, pp. 213–245.
the TD3 agent based on a well-designed reward function.

Vertical Take-Off and Landing System Control Using Deep Reinforcement Learning

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Vertical Take-Off and Landing System Control Using Deep Reinforcement Learning

Uploaded by

Copyright:

Available Formats

Vertical Take-Off and Landing System Control

Using Deep Reinforcement Learning

979-8-3503-0533-3/23/$31.00 ©2023 IEEE

𝑦 = 𝑟 + 𝛾 ∗ 𝑚𝑖𝑛 (𝑸𝒕𝒌 (𝑠′, 𝑎′ ; 𝜃𝑡𝑘 )) ()

Fig. 2. QNET VTOL cascade control system.

𝐽𝜃̈ + 𝐵𝜃̇ + 𝐾𝜃 = 𝐾𝑡 (5)

Then, the transfer function can be written as:

ully Connected Layer

−5, 𝑖𝑓 |𝑒| 𝑖𝑛𝑐𝑟𝑒𝑎𝑠𝑒𝑠 .

𝑟5 = −10 𝑖𝑓 𝜃𝑎𝑐𝑡𝑢𝑎𝑙 ≤ (−𝜋/6 ) , 𝑜𝑟 𝜃𝑎𝑐𝑡𝑢𝑎𝑙 ≥ (𝜋/6 ) (11) 3) Networks' weighs initialization

Where 𝑟 is a value between 1% and 10% of the action

Once the training is completed, the actor neural network

TABLE II. FINAL SET OF THE TD3 AGENT PARAMETERS

You might also like