Expert Systems With Applications: Ignacio Carlucho, Mariano de Paula, Gerardo G. Acosta

Expert Systems With Applications 137 (2019) 292–307
Contents lists available at ScienceDirect
Expert Systems With Applications

journal homepage: www.elsevier.com/locate/eswa
Double Q-PID algorithm for mobile robot control

Ignacio Carlucho, Mariano De Paula∗, Gerardo G. Acosta
INTELYMEC, Centro de Investigaciones en Física e Ingeniería del Centro CIFICEN-UNICEN-CICpBA-CONICET, Olavarría 7400, Argentina
a r t i c l e i n f o a b s t r a c t
Article history: Many expert systems have been developed for self-adaptive PID controllers of mobile robots. However,
Received 1 November 2018 the high computational requirements of the expert systems layers, developed for the tuning of the PID
Revised 26 June 2019
controllers, still require previous expert knowledge and high efficiency in algorithmic and software exe-
Accepted 26 June 2019
cution for real-time applications. To address these problems, in this paper we propose an expert agent-
Available online 27 June 2019
based system, based on a reinforcement learning agent, for self-adapting multiple low-level PID con-
Keywords: trollers in mobile robots. For the formulation of the artificial expert agent, we develop an incremental
Reinforcement learning model-free algorithm version of the double Q-Learning algorithm for fast on-line adaptation of multi-
Double Q-learning ple low-level PID controllers. Fast learning and high on-line adaptability of the artificial expert agent
Incremental learning is achieved by means of a proposed incremental active-learning exploration-exploitation procedure, for
Double Q-PID a non-uniform state space exploration, along with an experience replay mechanism for multiple value
Mobile robots
functions updates in the double Q-learning algorithm. A comprehensive comparative simulation study
Multi-platforms
and experiments in a real mobile robot demonstrate the high performance of the proposed algorithm for
a real-time simultaneous tuning of multiple adaptive low-level PID controllers of mobile robots in real
world conditions.
© 2019 Elsevier Ltd. All rights reserved.
1. Introduction the operative conditions, as could happen for instance in a robo-

tized warehouse application.
Mobile robots have a prevailing role in the development of sev- The simplicity and fast computation of the Proportional Inte-
eral technological systems and currently they are used in a vast gral Derivative (PID) controller made it the most popular low-level
number of industrial applications, such as agro, process, aerospace, control strategy and it is currently implemented in several mobile
off-shore and others. In addition, in the new era of the fourth in- robots’ control systems (Åström & Hägglund, 2006). A PID con-
dustrial revolution, also called Industry 4.0, there is an increas- trol strategy requires the tuning of its parameters according to
ing participation of these mobile robots in industrial applications the controlled system, usually done for a certain operating point.
and they are essential parts of their manufacturing execution sys- Classical control theory has contributed with many tuning meth-
tems too. However, nowadays it is of paramount importance that ods (Åström & Hägglund, 2004; Ziegler & Nichols, 1993). However,
these types of robots perform their tasks in an autonomous way in autonomous robots with complex dynamics, often coupled and
and, usually, they must operate in multiple and uncertain environ- highly non-linear, and with multiple degrees of freedom, the PID
ments, which pose a big challenge for autonomous robot control tuning techniques tend to have a poor performance requiring an
strategies. exhaustive expert knowledge about the system behavior and, of-
From an expert system point of view, many proposals were ten, about its operative environments too.
made for controlling mobile robots for direct applications in in- The main motivation of this work lies in having a fast intelli-
dustrial environments. However, the real time adaptability needed gent expert system for the simultaneous on-line tuning of multiple
for this type of applications is still an open issue for the major- low-level PIDs controllers in real-time. Particularly, mobile robots
ity of these proposals. Under these conditions, the expert system exhibit highly non-linear, uncertain and coupled behavior and they
layer has to be able to adapt to changes in the environment and are very difficult, even impossible, to be analytically modeled with
enough accuracy. In addition when multiple PID controllers are im-
plemented they suffer a considerable performance degradation in
the face of variable and uncertain conditions. For these reasons,
∗
Corresponding author.
adaptive PID control techniques have arisen as a way to set the
E-mail addresses: ignacio.carlucho@fio.unicen.edu.ar (I. Carlucho), mariano. controller gains according to the desired system behavior.
depaula@fio.unicen.edu.ar (M. De Paula), ggacosta@fio.unicen.edu.ar (G.G. Acosta).
https://doi.org/10.1016/j.eswa.2019.06.066
0957-4174/© 2019 Elsevier Ltd. All rights reserved.
I. Carlucho, M. De Paula and G.G. Acosta / Expert Systems With Applications 137 (2019) 292–307 293
Classical PID tuning methods have been extended with artificial adaptive formulation using frequency response analysis by means
intelligence techniques to achieve better performance and higher of tracking a point on a Nyquist curve to initialize the adaptive
adaptability of the PID controllers (Tandan & Swarnkar, 2015). algorithm to build up PID gain schedules. Myungsoo, Jun and Sa-
Adaptability implies the capability of an entity (i.e. intelligent ex- fonov (1999) use unfalsified control techniques (Safonov & Tung-
pert agent) to self-adapt, driving and improving its behavior to Ching Tsao, 1997) to propose an iterative procedure to adjust the
carry out an specific task in a partially (or fully) unknown and un- PID gains. In this formulation, an initial discretized set of candidate
certain environment to ultimately achieve a certain goal. In this gains and an operative specification must be supplied, and then if
sense, reinforcement learning (RL) is one of the most powerful the algorithm cannot find a set of gains, it must be run again, re-
learning paradigms (Sutton & Barto, 1998), for adaptive control for- laxing the performance specification and/or changing the initial set
mulations, in which an artificial expert agent learns to solve a con- of possible gains. Chang and Jin (2013) proposed a PID trajectory
trol task only by means of its experience. In the context of adap- tracking controller based on a non-linear kinematic model of the
tive PID control for mobile robots, an RL expert agent must learn mobile robots, used to predict the robot trajectory, however only
to adapt the gains of the PID controllers in an on-line way, whilst simulation results were reported. Antonelli (2007); Antonelli, Cac-
the real-time interactions, between the robot and its environment cavale, Chiaverini, and Fusco (2003) formulated different adaptive
occur. versions based on PID control laws with an adaptive compensa-
In this paper, we propose an RL agent-based expert system tion of the dynamics for controlling autonomous underwater vehi-
for adapting on-line the gains of the PID controllers, one of the cles (AUVs) in six degrees-of-freedom. However, in such proposals
most widely used low-level control techniques in autonomous mo- the control gains must be adjusted manually, first in simulation
bile robots. Our present formulation is based on some ideas out- and then with the real system during its operation (Antonelli, Chi-
lined in an our previous work (Carlucho, De Paula, Villar, & Acosta, averini, Sarkar, & West, 2001). In the work of Barbalata, De Car-
2017). However, unlike the previous formulation, in this work a olis, Dunnigan, Petillot, and Lane (2015) an adaptive on-line tun-
higher efficient algorithm for a RL agent-based expert system for ing method for a coupled two-close-loop proportional controller of
on-line autonomous adaptation of low-level PID controllers of mo- four degrees-of-freedom of an AUV is presented where the gains
bile robots is presented. The new formulation combines a new and of each controller must be determined on-line according to the er-
improved incremental active learning exploration-exploitation pro- ror signals. Rout and Subudhi (2017) developed an adaptive tuning
cedure for the double Q-learning algorithm. As other novelty, we method for a PID controller using an inverse optimal control tech-
have also incorporated the definition of support data dictionary for nique based on a NARMAX model (Chen & Billings, 1989) for the
the incorporation of the experience replay mechanism which uses representations of the non-linear dynamics for path following con-
this data set for fast learning of a low-level control policy. In addi- trol design for an AUV.
tion, a temporal memory is defined to manage the proposed incre- Notice that the aforementioned methods require prior knowl-
mental active learning mechanism for a non-uniform state space edge about the models of the systems and, often, an intensive sim-
exploration while the experience replay mechanism is used for the ulation phase is needed to set the initial algorithms conditions. The
double Q-learning updates of function values. The proposed ap- ideas of auto-tuning methods have inspired the researchers of the
proach was initially tested in different mobile robots, i.e. terres- artificial intelligence community to formulate proposals of adaptive
trial, aerial and underwater robots, for which different benchmark PID controllers (Tandan & Swarnkar, 2015). Mainly, these proposals
simulation platforms were used. Also, a comprehensive compara- were based on the idea of incorporating an extra layer of decision
tive simulation study was made demonstrating the higher perfor- making for on-line tuning of the PID gains, allowing the controllers
mance of the expert Double Q-PID Algorithm. Finally, experiments to be able to solve more challenging tasks.
in a real-time mobile robot demonstrates the high performance of Fuzzy logic theory (Zadeh, 1973) has been combined with PID
the proposed algorithm for a real-time simultaneous on-line tun- control techniques in such a way as to dynamically adjust the
ing of multiples adaptive low-level PID controllers. PID gains. In the work of Acosta, Mayosky, and Catalfo (1994) a
The article is structured as follows. Section 2 provides a fuzzy system was implemented to manage a fuzzy conjunc-
overview of related works about adaptive PID formulations for dif- tion originated by an heuristic method for the system parame-
ferent mobile robots. Section 3 provides the necessary background. ters estimation, based on experimental measurements of closed-
In Section 4 the proposed Double Q-PID algorithm is explained fol- loop performance. Also, robot motion control systems based on
lowed by the obtained results in Section 5. Finally in Section 6 the PID (Sood, 2011) and PI (Sood & Kumar, 2014) control architec-
final remarks are presented. tures with embedded fuzzy rules were developed to tune the
controllers gains aiming to decrease rise time, removing steady
2. Related work state error quickly and avoiding overshoot. Similarly, Li, He, and
Yin (2012) proposed a trajectory tracking control for a four-
Since the original works of Ziegler and Nichols (1942, 1993), wheeled differential mobile robot using a closed loop fuzzy PID
many techniques were developed for the tuning of PID controllers control architecture with path and heading error feedback. Simi-
(Bansal, Sharma, & Shreeraman, 2012; Cohen & Coon, 1953; Åström larly, Sheikhlar, Fakharian, and Adhami-Mirhosseini (2013) imple-
& Hägglund, 2004; Wang, Fung, & Zhang, 1999) and most of them mented a fuzzy PI controller for a four wheeled omni-directional
optimize merit functions such as integral absolute error and in- soccer robot using the tracking error and its derivative as in-
tegral time squared error (Ogata, 2010; Pessen, 1994; Zhuang & puts for the fuzzy inference rules. Analogously, Salhi and Al-
Atherton, 1993). Usually mobile robots, used in several industrial imi (2013) proposed a hybrid fuzzy PID controller, feed with
applications, are complex systems with changing operative condi- the outputs of a Kinet sensor, for trajectory planning and obsta-
tions and they have further requirements for tuning, mostly due cle avoidance of a terrestrial mobile robot. On the other hand,
to errors in the models and changes in the operative conditions. Geder, Palmisano, Ramamurti, Sandberg, and Ratna (2008) have
For this reason, the attention has been turned towards the devel- presented and discussed two different approaches of a fuzzy
opment of controllers able to auto-tune in order to compensate logic PID controller for a simulated unmanned underwater ve-
for those issues. In this sense, a number of adaptive control for- hicle, one using weighted gait combination and the other using
mulations were proposed to self adapt the PID gains, using differ- a modification of mean bulk angle bias. Recently, for AUV con-
ent strategies, for a variety of mobile robots. To name some rele- trol, Khodayari and Balochian (2015) have proposed a self-adaptive
vant works, Hägglund and Åström (1991) proposed a model-based fuzzy PID controller for the attitude control of the robot, based on
294 I. Carlucho, M. De Paula and G.G. Acosta / Expert Systems With Applications 137 (2019) 292–307
a dynamic model obtained from mechanical principles. Zhao, Yi, ics is made, showing that the neural network controller was able
Peng, and Peng (2017) have proposed a self-adaptive fuzzy-PID to perform with an overall lower error. This fact reveals the need
controller for a simulated AUV which obtains the required ad- for autonomous expert systems layers that are capable of on-line
justment of the PID gains according to the error and the change adapting the PIDs’ constants. Also, for unmanned aerial vehicles an
of error, in both distance and yaw. Similarly, a number of works implementation of the PID structure as an artificial neural network
with fuzzy PID formulations have been proposed for unmanned was made in Fatan, Sefidgari, and Barenji (2013) for controlling the
aerial vehicles (Abu Rmilah, Hassan, & Bin Mardi, 2016; Ahmed, altitude of a quadcopter robot and in (Önder Efe, 2011) a fractional
Kushsairy, Bakar, Hazry, & Joyo, 2015; Cohen, Bossert, & Col, 2002; PID formulation (PIλ Dμ ) with neural network aided finite impulse
Kuantama, Vesselenyi, Dzitac, & Tarca, 2017; Wei & Cohen, 2015). response approximator to control an unmanned aerial quadrotor
Also, fuzzy fractional calculus (Takači, Takači, & Takači, 2014) has vehicle was formulated. Overall, although the proposed neural PID
been used for fractional fuzzy PID control formulations in mobile controllers achieve a better performance that the standard PID for-
robots applications (Liu, Pan, & Xue, 2015; Shah & Agashe, 2016; mulation, the estimation range of the PID parameters is quite lim-
Sharma, Rana, & Kumar, 2014). ited, due mostly by the activation function of the neural network
However, the design of fuzzy self-tuning PID control strategies output layer.
requires some prior knowledge of the system behavior. This is a Real adaptive control strategies require a continuous learning
big disadvantage since it means that an expert’s prior knowledge capacity which is proper of the reinforcement learning paradigm
of the technological system is still required in order to setup the (Sutton & Barto, 1998). In a RL framework an artificial agent learns
controller, or at least to define the inference rules as well as the an optimal control policy based on the successive interactions said
initial controller gains. This is not a trivial task since the range agent has with the environment. In this sense, low-level PID con-
of the PID gains is crucial for achieving an optimal performance. trollers of mobile robots could be continuously adjusted by an op-
Moreover, these problems are aggravated in several mobile robots, timal control policy, which maps robot states to optimal PID pa-
such as AUVs, that have multiply degrees of freedom and highly rameters according to the current specified operative conditions
coupled dynamics. (Carlucho et al., 2017). When the low-level control is accomplished
Analogously to fuzzy PID formulations, artificial neural net- by two or more PID controllers, say for example, one for each
works (ANN) (Haykin, 1998) have been combined with PID control degree of freedom of a mobile robot, the simultaneous tuning is
methods to give rise to adaptive PID formulations for controlling a challenging task such that a few proposals have been formu-
mobile robots. Mostly, the ANNs have been used to learn the dy- lated using RL approach for the adjustment of multiple PID con-
namic models of the systems based on collected data coming from trollers of mobile robots. Gloye, Göktekin, Egorova, Tenchio, and
real or simulated interactions. Calvo-Rolle, Casteleiro-Roca, Quin- Rojas (2005) proposed three low-level PID controllers (one for the
tián, and Del Carmen Meizoso-Lopez (2013) have proposed an in- forward direction, one for the sidewards direction and other for
telligent hybrid system which combines a rule based system with the angle of rotation) for an autonomous mobile robot, where the
artificial neural networks and support vector machines for tun- parameters of the PID controllers are selected by a policy gra-
ing a PID controller in an open loop way according with the op- dient based RL method which learns from the observed expe-
eration point of the system. Rossomando and Soria (2015) pro- rience, summarized in the collected rewards/punishments. How-
posed a neural PID formulation to control a mobile robot based ever, the presented formulation was only tested in simulation in
on the implementation of a dual control strategy through a kine- a robot with uncoupled dynamics. Moreover, in order to achieve
matic controller and a discrete-time adaptive neural PID controller, a successful performance, this formulation requires a human ex-
which compensates the dynamic nonlinearities of the robot. Also, pert with knowledge to set the initial PIDs constants that are
adaptive PID approaches have been developed based on the anal- then further adapted by the algorithm. Similarly, in the work of
ogy with feed forward neural networks resulting in a continuous el Hakim, Hindersah, and Rijanto (2013) the standard Q-learning
auto-tuning PID method and the effectiveness of this approach method (Watkins & Dayan, 1992) was used to self-tune the PID
was only demonstrated for velocity and orientation tracking con- controllers of a soccer robot. In this particular case, the entire pos-
trol of simulated nonholonomic terrestrial mobiles robots (Singh, sible finite state transitions were given in advance, something not
Bisht, & Padhy, 2013; Ye, 2008). For underwater vehicles (Dyda & practical for real robotic applications. Moreover, the action domain
Os’kin, 2004), neural networks based self-tuning PID control for- was discretized a priori, within a narrow range, requiring human
mulations have been proposed as in the work of Dong, Guo, Lin, knowledge for tuning what can be considered as hyper parame-
Li, and Wang (2012) where a self-tuning PID controller, for an AUV ters of the presented algorithm. Carlucho et al. (2017) proposed an
with spherical body was presented. The proposed approach con- incremental version of Q-learning algorithm for adaptive low-level
sists in a neural network identifier and a self-tuning PID controller PID control of mobile robots. In this work, only terrestrial robots
which is realized by a two-layer neural network. The weights of were taken into account for the experiments.
the two-layer neural network represent the PID gains and its out- Particularly, Q-learning algorithm (Watkins & Dayan, 1992) is
put represents the control commands. However, only simulation probably one of the most widespread RL algorithms and it has
results were presented where only the forward speed of the AUV been used in many RL formulations. However, the new developed
was controlled, therefore it is unclear if the method is suitable for double Q-learning algorithm (Hasselt, 2010) has significantly im-
more complex systems with higher degrees of freedom and cou- proved the learning performance reducing the over estimation typ-
pled dynamics. A similar proposal has been made in (Hernández- ical of the original Q-learning. Double Q-learning was first de-
Alvarado, García-Valdovinos, Salgado-Jiménez, Gómez-Espinosa, & veloped by Hasselt (2010) and has since been used in different
Fonseca-Navarro, 2016) where a neural network on-line adjusts the branches of RL as in the work of Zongzhang Zhang (2017) who for-
PID gains according to the position tracking error of a mini AUV. In mulated the weighted double Q-learning showing empirical sim-
spite of the obtained results, it is worth to note that the range of ulation results. Also, the double Q-learning algorithm has been
available PID constants is limited by the neural network activation adopted for recent developments in the emergent field of deep
function of the output layer. Moreover, the initial values of the PID learning (LeCun et al., 2015; Mnih et al., 2015) and also these
constants are determined by the randomly initialized weights of formulations have been enhanced with experience replay mech-
the neural network. In Pan, Lai, Yang, and Wu (2013) a comparison anisms using past experience (Pan, Zaheer, White, Patterson, &
of a neural network controller against a PID as a tracking control White, 2018b; van Seijen & Sutton, 2015; Zhang & Sutton, 2017),
strategy for an autonomous surface vehicle with unknown dynam- demonstrating better performance by smoothing over the data and
Fig. 1. Low level PID control module.
eliminating correlation on the samples. Van Hasselt, Guez, and to the PID control law of Eq. (1). On the other hand, we can think
Silver (2016) popularized the double Q-learning algorithm in the the parameters of the implemented PID controllers as arranged in
deep reinforcement learning field successfully solving the ATARI an action vector k = (k1 , k2 , . . . , kD )T , with D being the number of
domain problems. Also, recent advances in fields like macroeco- PID controllers, such that an RL agent needs to select the action
nomics have been reported as in Goumagias, Hristu-Varsakelis, and vector seeking to achieve a certain operative goal. Thus, we can
Assael (2018) who used double deep RL to determine the tax eva- use a two-layer control architecture as in Fig. 2 where the low-
sion behavior of taxpayer entities. est layer is the PID control module and the highest layer is com-
posed by an artificial agent. The RL agent is continuously learn-
3. Background ing by interacting with the environment and fixing the PIDs gains
k1 = (k1p , k1i , k1d ), . . . , kD = (kDp , kD
i
, kD
d
) through the learned control
In this section we briefly outline the basic elements necessary policy π .
for our double Q-PID formulation which will be exposed in the Into the RL framework, this decision problem can be formulated
next section. as a Markov Decision Process (MDP) (Monahan, 1982), where the
aim of the RL agent is to find an optimal policy π ∗ that defines
3.1. RL for adaptive PID the optimal controller parameters (kt ) for different system states
(xt ) such that the total expected reward is maximized:
The proposed control architecture is based on a classical PID
J ∗ = max Jπ = max Eπ {Rt |xt = x} (5)
controller, where its control law is (Kuo, 1995):
u ( kT ) = u ( kT − T ) + k 1 e ( kT ) + k 2 e ( kT − T ) + k 3 e ( kT − 2 T ) (1) Then, we can define the optimal state-action value function
(Q( · , · )) as (Sutton & Barto, 1998):
where:
Q ∗ (xt , kt ) = E {rt+1 + γ max Q ∗ (xt+1 , kt+1 )|xt , kt } (6)
k k
k1 = k p 1+ d (2)
T where rt+1 is the instantaneous reward obtained by the system
transition from the state xt to xt+1 and γ is the discount rate.
kd T Then, when the optimal state-action function is learned by inter-
k2 = −k p 1 + 2 − (3)
T ki actions, the optimal policy can be directly obtained as:

kd π ∗ = arg max Q ∗ (xt , k ) (7)
k3 = k p (4) k
ki
One of the most widespread RL algorithm, to find the optimal
being kp , ki and kd the parameters (gains) of the PID controller, policy π ∗ , is the well-known off-policy model-free Q-learning algo-
T the sampling time and e(kT) the error term used to form the rithm (Watkins & Dayan, 1992). Its simplest one-step update rule
derivative, proportional and integral terms that in turn yield the is:
control action u(kT) Eq. (1) at time instant kT, with k = 0, 1, 2, . . ..
Commonly, mobile robots whether terrestrial, aerial, aquatic or Q (xt , kt ) ← Q (xt , kt ) + α [rt+1 + γ max Q (xt+1 , kt+1 ) − Q (xt , kt )]
k
underwater have, at least, two or more degrees of freedom to be
(8)
controlled. In this sense, a control module as the depicted in Fig. 1
is used for low-level control. Notice that in the low level control where α ∈ (0, 1] is the learning rate and γ ∈ (0, 1] is the discount
module there could be as many PID controllers as manipulated factor. According to Eq. (8), the state-action function Q, directly ap-
variables. Thus, each PID should determine the magnitude of each proximates Q∗ , regardless of the policy being followed. Thus, the
low-level control action (uti ) to be executed on the robot according optimal policy can then be easily obtained as described in Eq. (7),
Fig. 2. Structure of the QPID approach.
with probability of convergence of 1, regardless of the policy fol- were made (Buçoniu, Babuska, Schutter, & Ernst, 2010; Timmer &
lowed (Bhatnagar, Sutton, Ghavamzadeh, & Lee, 2009; Sutton & Riedmiller, 2007) but the action selection step introduces an op-
Barto, 1998). timization bias which is often problematic and can cause an un-
stable learning process (Neumann & Peters, 2008) or require a lot
3.2. Double Q-learning of computational load (LeCun et al., 2015; Mnih et al., 2015) not
suitable for robotics systems operating in real time. For this reason
Double Q-learning was first developed by Hasselt (2010) and we include a mechanism for incremental discretization of the state
has since been applied successfully to solve a number of problems and action spaces. Particularly, in this section we only present the
(Goumagias et al., 2018; Pan, Wang, Cheng, & Yu, 2018a; Van Has- main ideas of incremental discretization of the state space and ac-
selt et al., 2016; Zongzhang Zhang, 2017). tion space developed in Carlucho et al. (2017). These concepts are
Double Q-learning was developed in order to reduce the over- the basis of our current proposal. However here we just describe
estimation present in the original Q-learning algorithm, and this is the original concepts, while later, in Section 4, we will adapt and
achieved by a double estimator for the Q function, whereby two formalize these ideas according to our current proposal.
functions QA and QB are used. Each Q function must be updated Let X be a data set for the state space representation; η( · , · ) a
with a value from the other, i.e., if the policy uses the function QA membership function; ρ a scalar that represent a threshold value
to chose the action according to Eq. (7), then the updating rule of and xt the current system state, at time t. So, as the system evolves
QA , as in Eq. (8), is: over time, step by step, a state transition occurs from one state
xt to another xt+1 . We drop the time subindex t to simplify the
QA (xt , kt ) ← QA (xt , kt ) + α [rt+1 + γ max QB (xt+1 , kt+1 ) − QA (xt , kt )] notation, yielding x = xt and x = xt+1 . Then, the incremental con-
k
(9) formation process of set X is as follows: each time that a transi-

tion state occurs, we must verify if x provides relevant informa-
Alternatively, when the action selection according to Eq. (7) is tion regarding X. For this, we use the membership function, such
made using QB , directly the value function updating rule is: that if η (x, x ) > ρ ∀ x ∈ X, then x contributes with relevant in-
formation for the state space representation. Therefore, x must be
QB (xt , kt ) ← QB (xt , kt ) + α [rt+1 + γ max QA (xt+1 , kt+1 ) − QB (xt , kt )]
k incorporated to X, i.e. X = {X ∪ x }. Note, that the discretization
(10) could be done until a maximum level Z, being ρ z the discretiza-
tion step for each discretization level z of the state space, i.e. we
Since the update of QA and QB is done with its opposite, each
have = (ρ0 , ρ1 , . . . , ρZ ).
update takes into consideration a different subset of the data set,
As the interactions progress the learning process does too. So,
thus removing the bias in the value estimation. Due to the charac-
in our formulation, an agent must fix the PIDs gains according to
teristics of the update, the computational burden is not increased,
the performance requirements. In a Q-learning algorithm, at each
maintaining the design characteristics of the original Q-learning. In
decision step, an action k must be chosen from a given set K of
this sense, the RL proposals using the Q-learning algorithm could
possible actions by means of a control policy π . In the proposed
be easily enhanced using double Q-learning to reduce overestima-
approach, we initially have a fairly coarse discretization of the ac-
tion.
tion space summarized in the set K. Then, as the agent interac-
tions progresses, the degree of discretization of the set K is prop-
3.3. Incremental discretization of the states and actions spaces
erly augmented in those regions where the best actions lie. This
idea is schematically represented in the simplest form in Fig. 3.
Normally, in the standard Q-learning application a state-action
The represented process in this figure is as follows.
discretization is made in advance. In this case if the discretiza-
First, suppose that the system at a time t is in a state xt and
tion is coarse the results could be poor, on the other hand, if it
the agent chooses the action kt (represented by shadow cell in
is too thin the Q-learning algorithm could become intractable. For
Fig. 3) from a set of possible actions K (represented by all cells
this reason, many proposals incorporating function approximators
Fig. 3. Incremental action discretization process.
of a row in Fig. 3). Then, suppose that after the selected action As we have exposed, the interaction between the artificial agent
is applied on the system, it evolves to a state xt+1 = xt , i.e., this and its environment starts from a coarse discretization of the state
means that system remains invariant. It is expected that the agent and action space. From this point on, what could happen is that
will in the next decision step, by means of the policy in Eq. (7), the agent chooses an action (Eq. (7)) and the system evolves re-
choose the same action if K remains unchanged. Then, if after N peatedly to a same successor state. Under these circumstances, we
time steps the agent selects the same action from K and con- previously stated that the system remains invariant and a finner
sequently the system evolves towards the same state configura- discretization of the action space is required to improve the con-
tion, the set K will be changed to offer a higher discretization. trol policy.
This is called the incremental discretization process of K, car- Let Mt be a temporal memory defined as a tuple Mt = (xt , kt )
ried out as is indicated in Fig. 3. That is, the shadow cell, in the containing information about the system state at time t and the
top row, will be discretized with a step δ j , as is shown in the action taken at that moment. Then, to determine if the system
second row in Fig. 3 and the new configuration of the set K is stays or not invariant at a time t, we compare the last N mem-
obtained. ories and if these memories are equals it means that the system
Again, suppose that the agent selects the same action during N has not experienced a change from the perspective of the agent,
times and the system remains invariant, again the splitting process i.e., Mt = Mt−1 = . . . = Mt−N . If this is the case, the agent needs
is repeated, and the new K is given as in Fig. 3. Consequently, this to explore new possible actions. Therefore a new subset of possi-
incremental procedure must be systematically repeated each time ble actions is generated through an incremental action discretiza-
that the system remains invariant and as the learning progresses tion (explained above) of the action space around kt (see Fig. 3).
the action space will be discretized as it is depicted in Fig. 3. Anal- For the next decision step, the agent will have to choose its ac-
ogously, as in the state space discretization, the discretization pro- tion from this recently created subset of actions. In addition, it is
cess of the action space could be done until a maximum level J, worth noting that while the incremental discretization of the ac-
being δ j the discretization step for each discretization level, i.e. tion space is being made, a discretization of the state space must
= δ1 , . . . , δJ , is achieved. also be done (as described in the previous section). In summary,

both procedures, the incremental discretization of the action space
and of the state space, must be made each time that the system
4. Double Q-PID algorithm remains invariant, i.e., each time that Mt = Mt−1 = . . . = Mt−N .
The incremental active learning mechanism, by means of the
4.1. Incremental active learning incremental discretization of both the state and action space, fur-
thers the exploration into a promissory region of the feasible so-
In this section we present the incremental active learning lution space. In this way, the agent avoids the unnecessary explo-
mechanism for simultaneous exploration of the state and action ration of large regions, concentrating in a more relevant space, go-
space. ing from a coarse discretization to a more refined and focused one.
On that subject, the exploration-exploitation dilemma is a cru- This idea can be seen in Fig. 4 where it can be seen a promis-
cial one for RL agents. One potential problem encountered is that ing initial region defined by the surrounding regions around the
a large amount of robot-environment interactions may be neces- shaded reachable states. Within this region there are reachable
sary to learn a successful policy. In that sense, the RL agent has to states, all of them representing sub-regions of the state space.
exploit its current knowledge of the environment and at the same Thus, the set of all reachable states defines what we call a reach-
time it needs to explore the environment sufficiently to learn more able region of the state space. In this way, working with large data
about its reward-relevant structure. To tackle these issues we de- sets is avoided and, as a consequence, the computational cost is
veloped an incremental active learning procedure which involves lowered, a desirable goal when working with autonomous mobile
the ideas presented in the previous section. robots.
Fig. 4. Schematic representation for the incremental discretization of the state space.
4.2. Stochastic experience replay shows the main steps of our proposed approach, which we will ex-
plain in detail in the current subsection. In addition, to clarify the
Typical of deep RL formulations, stochastic experience replay computational procedure and facilitate the reproducibility of the
mechanisms have been developed for faster off-line updates of the Double Q-PID algorithm, we make the code available in an open
value functions approximators in continuous domains (Mnih et al., repository1 .
2015; Zhang & Sutton, 2017). This is a simple but efficient mech-
anism which repeatedly samples from a buffer R of recent experi- Algorithm 1 Pseudo-code for the Double Q-PID algorithm.
ence, recorded from real transitions τi = (xt , ut , rt , xt+1 ).
1: Inputs: α , γ , r (·, · ), 0 , K, η (·, · ), n, m, N, ,
This strategy behaves like model-based RL paradigms which

2: x0 ← get sensor readings
simulate one-step transitions repeatedly to update the model and
3: Initialize Replay Buffer R
action-value functions. However in our proposal, unlike model-
4: Initialize X
based formulations, no model is being constructed. We simply ran-
5: Initialize Q A , Q B
domly select, in each algorithm iteration, a number Γ of transi-
6: loop
tions selected from the replay buffer R to update the state-action
7: - greedy action selection → kt
value functions. The replay buffer must be bounded and it can
8: Set the current temporal memory as Mt = (xt , kt )
only store a maximum number m of states transitions, i.e. R =
9: Interact with the system, observe the transition state xt+1
(τ1 , τ2 , . . . , τi , . . . τm ). When R reaches its maximum size, the oldest
and the obtained reward rt+1
information will be removed in order to store the new experienced
10: if the system is variant then
transitions.
11: Store the current transition τ = (xt , kt , xt+1 , rt+1 ) in the
When the experience replay mechanism is being used, it is ben-
replay buffer R
eficial to initialize the state-action values functions, QA and QB ,
12: Identify the nearest state xi ∈ X to xt+1
randomly. In practice, the random initialization of value functions
13: if xt+1 is in the vicinity of xi then
in RL formulations is not a trivial task, however it has a dou-
14: Double Q-learning updating → Update Q A and Q B ac-
ble benefit. On the one hand, this is desirable since it contributes
cording to Eqs. (9) and (10)
with exploration, crucial in RL. On the other hand, more impor-
15: else
tant is still the fact that the possibility of bias in the value func-
16: Incorporate xt+1 in X
tions decreases. As an additional advantage we can say that ran-
17: Set up Q A and Q B for incorporating the new point xt+1
domly initializing the functions generates one less input parameter
for the algorithm. This represents a significant advantage since the
18: end if
smaller the number of hyper parameters, the lower the influence
19: else
of the designer on the behavior of the algorithm. This simplifies
20: Incremental active learning
re-implementation of the algorithm and the ability to easily switch
21: end if
platforms.
22: Stochastic experience replay
23: Set the system state for the new execution loop
4.3. Algorithmic statement 24: end loop
In Algorithm 1 we outline the pseudo-code of the proposed

Double Q-PID algorithm in a compact form. This pseudo-code 1
https://github.com/IgnacioCarlucho/Double_QPID.
In Line 1 the input parameters of the algorithm are indicated.

Some of these parameters are common in Q-learning applications,
such as the learning rate α , the discount rate γ , the reward func-
tion r( · , · ) and the initial exploration probability 0 . However
some are specific of the proposed algorithm such as the initial
coarse set of possible actions K, the membership function η( · , · )
for incremental state discretization, the number n of initial inter-
actions for the replay buffer initialization and the maximum ad-
missible size of the data set X. The parameter N is the number
of temporal memories (Mt ) that must be compared for determin-
ing the invariability of the system. Finally, and
contain the
thresholds for the different levels of incremental discretization of
the state and action spaces respectively.
From Line 2 to 5 the algorithm components are initialized. In
robotic applications the robot initial state, x0 , commonly, can not
be fixed in advance. So, the initial state is composed by sensor
measurements as is indicated in Line 2.
In Line 3 the replay buffer (R) is initialized. Following, in Line Fig. 5. Platforms used for testing the algorithm.
4, the data set X is initialized with the n + 1 states experienced
during the initial interactions, i.e. X = {x0 , x1 , . . . , xn+1 }. Once we
have the initial representations of the state and action spaces, X In Line 22 the stochastic experience replay mechanism (detailed
and K, in Line 5 the state-action values QA and QB are randomly in Section 4.2) is carried out for faster updating of the value func-
initialized. tions QA and QB using the replay buffer R. Finally, in Line 23 the
The Double Q-PID algorithm can run indefinitely, if it were nec- system state for the algorithm must be updated for the next loop
essary, executing the loop between Line 6 and Line 24 continu- execution. So, if the system varied (condition in Line 10 was true)
ously. In our formulation we propose to use an -greedy scheme the new current system state will be the xi identified in Line 12
for action selection (Line 7) based on QA and QB . and if the system remained invariant, directly the new system state
First it is necessary to choose, e.g. randomly, between updating will be the transition state observed during the agent-robot inter-
the (QA or QB ) value function according to Eq. (9) or Eq. (10) as action in Line 9.
appropriate. Secondly, with probability 1 − t the greedy action is
obtained from on QA , if updating A was chosen, or based on QB , 5. Results
if updating B was chosen. With a probability t a non-greedy ac-
tion is selected. Then, in Line 8 with the current state (xt ) and the In this section we present different experiments to test our
current selected action (kt ), the temporal memory is updated as proposed algorithm and discuss the obtained results. First, we
Mt = (xt , kt ). Once the action kt was determined, the gains of the outline the experimental setup where we present the platforms
low-level PID controllers (in PID module of Fig. 2) are set. Thus, used to test the performance of our proposed approach and also
this low-level controller interacts with the robot (Fig. 1) and, af- we present the setup for the reinforcement learning experiments.
ter one step ahead interaction, we observe the transition state xt+1 Then, simulation results are shown followed by a performance
and we compute the reward signal rt+1 (Line 9) using the reward comparison made between the proposed approach and the origi-
function. nal adaptive incremental Q-learning method. Finally, real-time re-
After the interaction has occurred, we must decide if the sys- sults using the Pioneer 3-AT robot are presented.
tem remains invariant or not, by comparing the last N the tempo-
ral memories (Mt , Mt−1 , . . . , Mt−N ) according to the procedure ex- 5.1. Experimental setup
plained in Section 4.1. Then, this process is carried out along the
conditional statement from Line 10 to Line 18. When the system 5.1.1. Experimental platforms
“is variant” this mean that the last N temporal memories are not The proposed approach was first tested in different simulated
equal. Therefore, we must incorporate the experienced transition τ environments using three different robotics platforms depicted in
in the replay buffer R (Line 11). Also, as was previously developed Fig. 5. Then experiments using a real robot were made. The plat-
in Section 4.1, we have to identify the nearest2 state xi ∈ X to the forms used for the simulated experiments were the Pioneer 3-
current system state xt+1 . If xt+1 is inside the vicinity of any sate AT (Fig. 5a), a Quadrotor (Fig. 5b) and the ICTIOBOT (Fig. 5c), an
xi ∈ X (Section 3.3), we have to update the state-action value func- autonomous underwater vehicle (AUV). For the real-time experi-
tions Q A (xi , kt ) and Q B (xi , kt ) according to the double Q-learning ments an implementation on a Pioneer 3-AT, available in our labo-
rules (Section 3.2). In the case that xt+1 gives relevant information ratories, was carried out.
with respect to X, i.e. η (xt+1 , xi ) > ρt ∀xi ∈ X, xt+1 has to be in- Notice that all used platforms are completely different in term
corporated in X (Line 16). The incorporation of xt+1 , means also of dynamic behavior and also have distinctive characteristics which
the augmentation of QA and QB with initial (random) state-actions serve well as to test the adaptive performance of the proposed
values for all (xt+1 × K) (Line 17). On the other hand, when the algorithm in different conditions. In this way, we aim to demon-
system “is not variant” (Line 19) the incremental active learning strate the adaptive ability of the proposed approach for adaptive
procedure (Section 4.1) must be carried out as is indicated in Line. low-level PID control of mobile robots operating in different envi-
ronments.
The first robot used is the Pioneer 3-AT, a mobile robot with
2
four differential drive wheels, extensively used in several research
Note that to simplify the reading, we say “nearest” because we use the eu-
clidean distance as membership function η (xt+1 , xi ) in our applications (Section 5),
works as an experimental platform. The independent turning ac-
but keep in mind that η is an algorithm input and it can be defined in the most tion of the wheels allows it to simultaneously control both the
convenient way. forward and the turning speed. In addition, the manufacturer pro-
vides an open source model suitable for implementation in Gazebo euclidean distance between the current state xt and the setpoint
with ROS. The other simulation platform used is the Hector- xreq . It is worth noting that both the current state, xt , and the re-
quadrotor (Alejo, Cobano, Heredia, & Ollero, 2016; Arokiasami, quested state, xreq , are vectors with dimensions varying depend-
Vadakkepa, Tan, & Srinivasan, 2016) which is a small flying robot. ing on the degrees of freedom, while the reward is a scalar value.
This unmanned aerial vehicle (UAV) has also available a simula- However, notice that the Gaussian function always gives a high
tion platform for Gazebo and ROS. Finally, the proposed algorithm positive reward signal when the current system state (xt ) is close
is tested for the adaptive low-level PID control of an underwater to its fixed target (xreq ) and a lower reward is obtained when the
robot, the ICTIOBOT. It is an autonomous underwater vehicle (AUV) distance between xt and xreq is higher.
equipped with four thrusters that allow for a control of 4 degrees Exploration/Exploitation setup: Another important aspect of the
of freedom (DOF), namely: surge, sway, pitch and yaw. A partic- RL formulation is how the exploration-exploitation of the agent
ularity of the AUVs is their coupled and high nonlinear dynamics knowledge is implemented. If the learned representation is ex-
and therefore the tuning of low-level PID controllers is not a sim- ploited too early, the agent performance may reach a local max-
ple task. For the ICTIOBOT simulation, a dynamic model was de- imum failing to see better policies due to the lack of exploration
veloped by Petit, Paulo, Carlucho, Menna, and Paula (2016) and a of the state-space. On the other hand, if the exploration is too
simulated environment with ROS is used to test the proposed al- high, the learned policy is never implemented since the agents
gorithm. randomizes its behavior. A common practice is a middle ground,
All the simulation environments used for the experiments are with a high exploration policy during the first stages of learning,
available to the public and we provide a tutorial on how to set when the agent has little knowledge of the problem, and over time
up the environments in the corresponding github repository where change the behavior to that of exploiting the learned represen-
the code of the algorithm is hosted too3 . tation, something that commonly happens on later stages of the
learning processes. Thus, to this end we utilize an -greedy ex-
5.1.2. RL setup ploration scheme with an exponential decay rate of the parameter
Controller setup t that represents the probability of the agent of taking an action
In the series of experiments conducted, the actions taken by the based on the current policy, or rather take an exploratory action
incremetal Q-learning agent are the gains of the PID controllers. based on a random policy. Then, the decay rate is defined as:
We employed one PID controller for each degree of freedom, thus:
t = 0 + 1 e−t (13)
kt = ( k1p , k1i , k1d , k2p , k2i , k2d , . . . , kDp , kDi , kDd ) (11) where, unless otherwise stated, 0 = 0.05 and 1 = 0.3 for all ex-
periments.
where D varies with the application and the amount of degrees of
freedom that are being controlled.
For both the Pioneer 3-AT and the Hector-quadrotor, we control 5.2. Simulated experiments
the movements on the horizontal plane. So, here we control the
linear and angular speed only, therefore we only have two degrees Following, the simulation experiments and the obtained results
of freedom, i.e. D = 2. In theses cases, kt is a gains vector while the are presented. To organize the presentation, we will show the re-
control action vector to be executed on the robot is ut = (u1 , u2 ). sults in different subsections, one for each simulated robot plat-
Here, u1 is the control command for linear instantaneous speed form.
and u2 is the control command for the rotational instantaneous In all cases, unless otherwise stated, we defined the remain-
speed with respect to the vertical z axis. In the case of the IC- ing inputs of the algorithm as follow: the learning rate is fixed as
TIOBOT, we only control the movements in three degrees of free- α = 0.20; the discount factor is γ = 0.95; the shape parameter of
dom using four thrusters, therefore the composition of the manip- the reward function is σ = 1.0; to initialize the replay buffer we fix
ulated control variable is ut = (u1 , u2 , u3 , u4 ). Note that while the the number of initial interactions in n = 8; for temporal memories
ICTIOBOT has four degree of freedom (surge, heave, yaw, pitch), comparison we use N = 7 memories; for incremental discretiza-
we consider the case where the inmersion manuever is made with tion of the state space we use = (ρ0 , ρ1 , . . . , ρmax ) with ρ =
zero pitch. Therefore the control must be made on the horizontal −((ρmax − ρmin )/max ) + ρmax being ρmax = 0.25, ρmin = 0.005
plane, resulting in a control problem in three degrees of freedom, and max = 20; and for the incremental discretization of the action
i.e. surge, heave and yaw are controlled. space we set δ0 = 0.25 and δl = δl−1 /2 with lmax = 8 for each ac-
It is important to note that the initial values of kt are taken tion dimension.
from a uniform random distribution such that kimin ≤ ki ≤ kimax . By Also, it is worth noting that during all preliminary trials, the
randomly initializing the values of kt we encourage the algorithm Double Q-PID algorithm had quickly learned to cancel the deriva-
to explore, while at the same time, we reduce the number of tive terms, i.e. the derivative terms were almost always 0, or very
hyper-parameters that the user needs to supply. Therefore, sim- close to 0 and, in practice, they can be ignored. Consequently, in
plifying the user requirements when working with real-time plat- order to show the results as clearly as possible, we have decided
forms. to exclude the derivative terms since they are practically 0, or very
Reward function: The reward function is defined based on the close to 0, so they can be ignored. Following, only for practical rea-
operative requirements of the system. So, in our experiments these sons and unless stated otherwise, we do not include the deriva-
operative requirements are velocities setpoints. In this sense, we tive terms, in consequence, the vector action of the RL agent is
use a Gaussian function to shape the reward of the form: kt = (k1p , k1i , . . . kDp , kD ).
i
1
xt − xreq
2
rt (xt , xreq ) = exp − (12)
2π σ 2σ 2 5.2.1. Mobile terrestrial robot
As we have stated, to test the proposed algorithm in a ter-
with σ being a free parameter that determines the width of the
restrial robot, simulations experiments were performed using the
Gaussian function and the double bar operator (|| · ||) denotes the
Gazebo simulator for the Pioneer 3-AT. As stated earlier, there are
two manipulated variables ut = (u1 , u2 ) and the system state is
3
https://github.com/IgnacioCarlucho/Double_QPID. defined as xt = (vx , wz ) being vx and wz the controlled variables,
Fig. 6. Results of Test 1. Simulated Pioneer 3-AT with vxre f = 0.4 m/s and wzre f = −0.3 rad/s.
Fig. 7. Results of Test 2. Simulated Pioneer 3-AT with vxre f = −0.3 m/s and wzre f = −0.4 rad/s.
Fig. 8. Results of Test 3. Simulated Hector-quadcopter with vre f = −0.5 m/s and wre f = −0.2 rad/s.
i.e. the linear and angular velocity respectively4 . Therefore, in the a set of gains to successful drive the robot according to its op-
low level PID module (Fig. 2) there are two low-level controllers erative desired setpoint. This means that the action selection has
(D = 2). Unless clarified, for all the tests carried out in the remain- stabilized allowing the incremental discretization of the state and
der of this work, the initial values for the gains (kdj ) in the vector action spaces to occur by means of the incremental active learning
kt are randomly selected, such that 0 ≤ kdj ≤ 2. procedure developed in Section 4.1. In this sense, we can think that
the agent successfully adapts to the required conditions and, there-
Fig. 6 shows the results obtained for the experiment Test 1.
fore, specializes in the search for actions (controller gains) within a
In this trial, the reference for the controlled variables was set as
certain region. It is worth noting that in under 300s the algorithm
vxre f = 0.4m/s and wzre f = −0.3m/s. In Fig. 6a the obtained con-
is able to reach the maximum discretization level for the action
trolled velocities are shown. Here, we can see how in less than a
space, obtaining the optimal set of controller gains.
few seconds the desired velocity setpoint is achieved. Fig. 6b shows
A second independent trial, Test 2, was also carried out using
a schedule of the selected controller gains. As can be seen, after
this platform and the obtained results are shown in Fig. 7. Analo-
about 100 seconds, the agent finds a suitable combination of con-
gously to the previous trial, Fig. 7a shows the robot instantaneous
troller gains, therefore it continues to seek a solution around these
velocities, Fig. 7b shows the controller gains and Fig. 7c the evo-
values. This fact can also be seen Fig. 6c where the evolution of
lution of the incremental action discretization level. In this case
the incremental action discretization level is shown. Here we can
we use a completely different setpoint with vxre f = −0.4 m/s and
see how the incremental discretization of the action space evolves
over time, reaching finally the fixed maximum level lmax = 8. It can wzre f = −0.3 rad/s, meaning that the vehicle must perform a left
be seen that after about 100 seconds the agent is starting to find turn while moving backward. The objective of this is to test the
algorithm for different operating conditions, since the dynamic be-
havior of the Pioneer 3-AT is different accordingly to the direction
4
To simplify the notation we omit the subindex t for vx and wz
Fig. 9. Simulation experiment of Hector-quadcopter with vre f = 0.3 m/s and wre f = 0.2 rad/s under wind disturbance.
Fig. 10. Results of Test 4. Simulated ICTIOBOT with vxre f = −0.2 m/s, vzre f = 0.0 m/s and wzre f = 0.2 rad/s.
Fig. 11. Results of Test 5. Simulated ICTIOBOT with vxre f = 0.15 m/s, vzre f = 0.05 m/s and wzre f = −0.25 rad/s.
of motion. So, the different velocity requirements for Test 2 with seen in Fig. 12a the linear and turning velocities achieve their ref-
respect to Test 1 show the adaptive performance of the algorithm. erence values, both after an early overshoot (around the first five
It can be seen that the performance of the algorithm is similar to seconds), however the Double Q-PID algorithm rapidly stabilizes
the previous case and, again, the required velocities are obtained the controlled magnitudes. Fig. 12b shows the gains schedule for
almost immediately, reaching the maximum discretization level in the low-level controllers. Similar to the trials in Section 5.2.1, in
under 300s. Fig. 12b it is clear that as the learning process progresses, the agent
focuses on improving its policy and only actions in a bounded re-
5.2.2. Quadracopter gion of the actions space are executed, that is, a higher discretiza-
Similar to the previous experiments, following we present sim- tion level (Fig. 12c) of the state and action spaces is achieved by
ulation results for the quadracopter robot. the incremental active learning mechanism (Section 4.1).
Fig. 8 shows the result for the trial Test 3, where we aimed Aiming to demonstrate the robustness an adaptability of the
to control the flying robot in a horizontal plane. In this way, the Double Q-PID algorithm, in Fig. 9 the results for a more demand-
controlled magnitudes are the forward and turning speed, v and w ing trial are shown. In this case the quadcopter is facing medium
respectively. Therefore, for this the control task, once the robot is intensity wind force with the following velocity requests vxre f =
hovering, our proposed algorithm starts to run with the following 0.3 m/s and wzre f = 0.2 rad/s. As can be seen, the proposed algo-
velocity requests vxre f = −0.5 m/s and wzre f = −0.2 rad/s. As can be rithm successful manages the robot achieving their reference re-
Fig. 12. Simulation experiment of ICTIOBOT with vxre f = 0.20 m/s, vzre f = 0.0 m/s and wzre f = 0.20 rad/s under current.
Fig. 13. Simulation experiment using QPID to control the ICTIOBOT with vxre f = −0.2 m/s, vzre f = 0.0 m/s and wzre f = 0.2 rad/s.
Fig. 14. Heat maps of Q-values for the experiments of the trial Test 4 using DQPID and QPID.
quests in a short time period. As can be seen in Fig. 9a, there is a Table 1
Performance indexes summary for the comparative study between
negligible swing in the velocities due to the effect of the variable
DQPID and QPID.
medium wind force.
Comparative summary
Algorithm MAHA ED MSE ST

5.2.3. Underwater vehicle
In this subsection we present the results obtained when con- Test 1 DQPID 0.2271 0.00282 0.01438 5.07
QPID 0.4904 0.00351 0.01813 7.51
trolling the underwater vehicle ICTIOBOT. Unlike previous cases,
Test 2 DQPID 0.2141 0.00276 0.01497 5.04
we are facing a control problem of 3DOF due that we will control QPID 0.4752 0.00337 0.01864 7.03
the surge (vx ), sway (vz ) and yaw (wz ) velocities. For this reason, Test 3 DQPID 0.1501 0.00240 0.01690 4.81
we use a control module with three low-level controllers for this QPID 0.1797 0.00274 0.01906 5.07
3DOF control task. Moreover, the highly coupled dynamics of the Test 4 DQPID 0.2561 0.00409 0.01674 10.12
QPID 1.0858 0.02813 0.02441 –
vehicle entails a more challenging control task for the control sys-
Test 5 DQPID 0.3957 0.00368 0.01171 18.09
tem. QPID 0.7357 0.01395 0.01844 40.16
Fig. 10 shows the results for a trial, Test 4, with vxre f = −0.2 m/s,
vzre f = 0.0 m/s and wzre f = 0.2 rad/s. As it can be seen the proposed
algorithm successfully controls the system, with only a little initial son only the initial coarse discretized action space was available
overshoot in the controlled magnitudes. However the AUV achieves for the selection of controllers gains. However, as it can be seen
the requested setpoint in almost 10s (Fig. 10a). Notice that during the proposed algorithm successfully controls the system with only
the first 100 seconds the system was notably varying and conse- a little initial overshoot in the controlled magnitudes and the AUV
quently the system memories (Mt ) were varing too. For this rea- achieves the requested setpoint in almost 10 seconds (Fig. 10a).
Fig. 15. DQPID for a real-time implementation in Pioneer 3-AT with vxre f = −0.2 m/s and wzre f = 0.1 rad/s.
Fig. 16. Results of the controlled variables for different real-time implementations of DQPID in Pioneer 3-AT.
Another trial was made with different operative references: In order to perform a comparative evaluation, different indexes
Test 5. In this case, the fixed setpoint was vxre f = 0.15 m/s, vzre f = were calculated to compare the performance of the algorithms.
0.05 m/s and wzre f = −0.25 rad/s. With this operative reference, we For this comparative, we use as performance metrics: the Maha-
aim to test the algorithm in a more complex control task, that im- lanobis distance (MAHA), the euclidean distance (ED), the mean
plies moving forward while turning and submerging (z axis points squared error (MSE) and the setting time (ST) given in seconds. In
downwards). Moreover, the dynamic behavior of the vehicle is dif- Table 1 there is a summary of the different performance metrics
ferent from that required in Test 4 for multiple reasons, for exam- computed for all the performed trials. As can be seen the DQPID
ple, the behavior of the thrusters is not the same when moving algorithm always outperforms to the QPID algorithm.
forwards than backwards, the robot geometry is not symmetrical Notice in Table 1 that when the QPID algorithm is used to con-
and therefore the drag coefficients are not the same. The results of trol the underwater vehicle, in Test 4, there is not setting time re-
Test 5 are shown in Fig. 11 and, despite the facts mentioned, we ported. This means that the QPID algorithm did not achieve the
can see that the algorithm successfully solves this control task. We required operative conditions, while DQPID managed to reach the
can see in Fig. 11a that again a small overshoot is present, none setting time in over ten seconds. In Fig. 13 we show the results
the less, the velocities are achieved quite fast (around 12 s). Simi- obtained for Test 4 when the QPID algorithm is used. Compar-
lar to Test 4, in Fig. 11b and c it can be seen that during the first ing these results against those obtained when the DQPID algo-
moments due to the variability of the system memories only the rithm was used, Fig. 10, brings out the far superior performance of
initial coarse discretized action space was available for the selec- the DQPID algorithm. In addition, when analyzing the velocity re-
tion of the controllers gains. sponses, Fig. 13a, we found out that the incremental QPID was un-
In addition, analogously as it was made for the aerial robot, able to stabilize these controlled variables. This poor performance
here we also outline a more demanding simulation experiment contrasts with the successful results obtained when the DQPID al-
for the ICTIOBOT regarding medium intensity ocean currents, as gorithm is used, shown in Fig. 10a, highlighting even more the im-
it is explained in the outstanding book of Fossen (2002) for ma- provement of the proposed DQPID algorithm with respect to the
rine control systems. Fig. 12 shows the obtained results for this original QPID.
trial where we can see that the Double Q-PID algorithm success- In Fig. 14a representation of the state-action function Q in a 2-
fully controls the requested variables. Also, in the Fig. 12a it can be dimensional plot is shown. The horizontal axis represents the for-
seen that the AUV reaches its specification in a short time with an ward velocity vx , the vertical axis represents the turning velocity
insignificant variation around the setpoint. wz and a heat map represents the Q-values. It can be seen that
the highest Q values are in the proximities of the required set-
5.3. Performance comparison point. However, if we compare the heat map shown in Fig. 14a and
that shown in Fig. 14b, it is clear that when the DQPID is used the
In order to make a performance comparison with other com- Q values are strongly marked up. This could be attributed to the
parable methodology, in this section, we present a comparison be- fact that the DQPID algorithm implements a stochastic experience
tween the proposed Double Q-learning (DQPID) and the original replay mechanism that takes advantage of the past experience,
Incremental Q-learning (QPID) strategy (Carlucho et al., 2017). For seen by the agent, to continuously update the state-action function
this matter, we repeat all trials of the previous section but using what allows agent a faster improvement of its performance.
the QPID algorithm.
Fig. 17. DQPID for a real-time implementation in Pioneer 3-AT with setpoint changes.
6. Concluding remarks
In this work we have proposed a model-free algorithm for a

RL agent-based expert system for adaptive low-level PID control of
mobile robotic systems, namely the Double Q-PID algorithm, which
can: (i) adaptively drive the parameter selection of the low-level
PID controllers of mobile robots, thus improving the adaptability
and the real-time performance;(ii) make a controlled exploration
through the most promising regions of the state and action spaces
by means of an sparse incremental exploration/exploitation pro-
cess; (iii) while an incremental discretization of the state and ac-
tion spaces is carried out, by means of the proposed incremental
active learning mechanism, thus obtaining more updated informa-
tion and data efficiency and consequently improving the accuracy
of the environmental information for the action decision-making.
Fig. 18. Heat map.
The proposed Double Q-PID algorithm demonstrated a success-
ful performance obtaining satisfactory results for multiple trials,
carried out in different (simulated and real) mobile robotic plat-
5.4. Real-time implementation forms, as it was shown in the results section. Furthermore, com-
parative studies with a different algorithm were provided, showing
In this subsection we validate the proposed algorithm in a real how our proposal surpasses the performance in multiple domains
mobile robot. Following, we show the results obtained when im- and operative conditions. The extended amount of results is indica-
plementing the proposed algorithm in a real-time platform, the Pi- tive of the adaptive capabilities of our proposal. Moreover, it is im-
oneer 3-AT. In this case, the input parameters for the DQPID al- portant to note that the optimal PIDs parameters were obtained
gorithm are the same that those used in the previous trials. Out- in an extremely short amount of time, within seconds, making the
door trials were perform in a field, with uneven ground, making implementation of the algorithm feasible for real-time robotic ap-
the control problem a more difficult task. plications.
In Fig. 15 the obtained results for a trial with a fixed setpoint The proposed methodology works without the need of a model,
vxre f = −0.2 m/s and wzre f = 0.1 rad/s are shown. As it can be seen unlike most classical control methods, and with a low requirement
the DQPID algorithm is able to successfully reach the desired ref- of transitional data, a problem faced by most of the modern rein-
erence conditions in a short time period while successfully sta- forcement learning algorithms being researched. In this sense, our
bilizing the PID constants. Following we present, in Fig. 16, re- proposed approach takes the best of both worlds, model-free ap-
sults for different trials which have been made for other opera- proaches and data efficient algorithms, achieving an successful per-
tive references. In all those cases it can be seen that the DQPID formance. The data efficiency could be appreciated in terms of the
algorithm is consistently successful in reaching the requested learning speed achieved in the performed trials. It can be quan-
setpoints. tified by the number of interactions needed to get an acceptable
Finally, in order to further demand the adaptive ability of the performance. Clearly, this is an advantage for real world applica-
algorithm we set a test where the reference is suddenly changed a tions for example with respect to other recent proposals like the
number of times. Initially we set the setpoint as vxre f = 0.2, wzre f = deep reinforcement learning approach.
0.1 after 200 s we change the request to vxre f = −0.4, wzre f = 0.3 Another issue no less important is that the proposed method-
to then finally go back to the initial request. Fig. 17 shows the ology is robust in terms of hyper-parameter tuning, which lowers
obtained results for this trial. As it can be seen the algorithm the requirements for human operators in deployment on different
is able to reach all the desired velocities in a fast way achiev- robotic platforms. That is, in addition to the common setup param-
ing a smooth behavior of the robot. In addition, it is noteworthy eters of any RL application, perhaps the most demanding point is
that there are no significant overshooting and unstable behaviors to provide an approximate initial coarse discretization of the ac-
at the times of the reference changes, while also the algorithm tion space. Instead, in our work we use a random and wide initial-
successfully stabilizes the PID constants. Fig. 18 shows the result- ization for the initial PIDs gains, since our algorithm has proved
ing heat map of Q values for this trial where we can clearly see to be robust to the initialization, as can be seen in the results.
that the hotter regions correspond with that where the setpoints The algorithm robustness was demonstrated in the implementa-
laid. tions for the different platforms. It highlights that there is no need
to fine tune the initialization of the hyper-parameters for the dif- Ahmed, S. F., Kushsairy, K., Bakar, M. I. A., Hazry, D., & Joyo, M. K. (2015). Atti-
ferent controlled systems. In real time robotic applications this is tude stabilization of quad-rotor (UAV) system using fuzzy PID controller (an ex-
perimental test). In Proceedings of the second international conference on com-
a paramount advantage since no intensive user knowledge and in- puting technology and information management (ICCTIM) (pp. 99–104). IEEE.
tervention is needed in the setup phase. doi:10.1109/ICCTIM.2015.7224600.
A potential improvement may be to extend this proposal to Alejo, D., Cobano, J. A., Heredia, G., & Ollero, A. (2016). A reactive method for col-
lision avoidance in industrial environments. Journal of Intelligent & Robotic Sys-
work also in higher layers of decision-making systems for con- tems, 84(1), 745–758. doi:10.1007/s10846- 016- 0359- 7.
trolling mobile robots. For example, the Double Q-PID algorithm Antonelli, G. (2007). On the use of adaptive/integral actions for six-degrees-of-
could be used as part of autonomous trajectory tracking control freedom control of autonomous underwater vehicles. IEEE Journal of Oceanic En-
gineering, 32(2), 300–312. doi:10.1109/JOE.2007.893685.
systems to solve end-to-end control problems typical of mobile
Antonelli, G., Caccavale, F., Chiaverini, S., & Fusco, G. (2003). A novel adaptive con-
robots. As future works the authors suggest to extend the method- trol law for underwater vehicles. IEEE Transactions on Control Systems Technology,
ology by means of exploring the possibility of extending the ac- 11(2), 221–232. doi:10.1109/TCST.2003.809244.
Antonelli, G., Chiaverini, S., Sarkar, N., & West, M. (2001). Adaptive control of an au-
tion discretization to a continuous space effectively breaking loose
tonomous underwater vehicle: experimental results on ODIN. IEEE Transactions
of the constrains of space discretization. Briefly, this discretization on Control Systems Technology, 9(5), 756–765. doi:10.1109/87.944470.
could be done sparsely, and complemented by a continuous space Arokiasami, W. A., Vadakkepa, P., Tan, K. C., & Srinivasan, D. (2016). Vector di-
representation by means of a combination of different techniques. rected path generation and tracking for autonomous unmanned aerial/ ground
vehicles. In Proceedings of the ieee congress on evolutionary computation (CEC)
In addition, the estimation of the parameters of the PIDs could be (pp. 1375–1381). doi:10.1109/CEC.2016.7743949.
extended by another type of continuous RL techniques that use, Bansal, H. O., Sharma, R., & Shreeraman, P. R. (2012). PID Controller tuning tech-
for example, neural networks as policy approximators such as pol- niques: A Review. Journal of Control Engineering and Technology, 2(4), 168–176.
Barbalata, C., De Carolis, V., Dunnigan, M. W., Petillot, Y., & Lane, D. (2015).
icy gradient methods. Furthermore, it should also be possible to An adaptive controller for autonomous underwater vehicles. In Proceedings of
extend the proposal to solve the navigation problem in mobile the IEEE/RSJ international conference on intelligent robots and systems (IROS)
robots, by providing the self-tuning of a series of cascade PID con- (pp. 1658–1663). Hamburg: IEEE. doi:10.1109/IROS.2015.7353590.
Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., & Lee, M. (2009). Natural actor-
trollers. Nonetheless, the proposed agent-based expert system for- critic algorithms. Automatica, 45(11), 2471–2482. doi:10.1016/j.automatica.2009.
mulation is general, in such a way that could be implemented in 07.008.
another real platforms, such as autonomous underwater vehicles, Buçoniu, L., Babuska, R., Schutter, B. D., & Ernst, D. (2010). Reinforcement learning
and dynamic programming using function approximators (1st). CRC Press.
quadcopter or drones and others. Also, the proposal could be used
Busoniu, L., de Bruin, T., Toli, D., Kober, J., & Palunko, I. (2018). Reinforcement learn-
in problems from different domains, to name a few, such as the de- ing for control: Performance, stability, and deep approximators. Annual Reviews
cision making system for maximum power point tracking of photo in Control, 46, 8–28. doi:10.1016/j.arcontrol.2018.09.005.
Calvo-Rolle, J. L., Casteleiro-Roca, J. L., Quintián, H., & Del Carmen Meizoso-
voltaic sources and also for wind energy conversion systems, elec-
Lopez, M. (2013). A hybrid intelligent system for PID controller using in a steel
tric engine control systems and others. rolling process. Expert Systems with Applications, 40(13), 5188–5196. doi:10.1016/
Finally, we have to mention that the stability of our proposed j.eswa.2013.03.013.
Double Q-PID algorithm was demonstrated through an exhaustive Carlucho, I., De Paula, M., Villar, S. A., & Acosta, G. G. (2017). Incremental Q-learning
strategy for adaptive PID control of mobile robots. Expert Systems with Applica-
experimentation. However, the stability analysis in a theoretical tions, 80, 183–199. doi:10.1016/J.ESWA.2017.03.002.
and analytic way (like in classical control formulations, for ex- Chang, H., & Jin, T. (2013). Adaptive tracking controller based on the PID for mo-
ample, made by the Routh-Hurwitz stability criterion or Lyapunov bile robot path tracking. In J. Lee, M. C. Lee, H. Liu, & J.-H. Ryu (Eds.), Intelligent
robotics and applications (pp. 540–549). Berlin, Heidelberg: Springer Berlin Hei-
analysis) of the model-free RL-based algorithms (like our proposal) delberg.
is yet an open issue (Busoniu, de Bruin, Toli, Kober, & Palunko, Chen, S., & Billings, S. A. (1989). Representations of non-linear systems: the NAR-
2018). Therefore the stability of the overall Double Q-PID algorithm MAX model. International Journal of Control, 49(3), 1013–1032. doi:10.1080/
00207178908559683.
poses an even greater challenge for future research works. Cohen, G., & Coon, G. (1953). Theoretical consideration of retarded control. Transac-
tions of ASME, 75(1), 827–834.
Disclosure of conflicts of interest Cohen, K., Bossert, D. E., & Col, L. (2002). PID and fuzzy logic pitch attitude hold
systems for a fighter jet AIAA-2002-4646 PID and fuzzy logic pitch attitude hold
systems for a fighter jet. In Proceedings of the AIAA guidance, navigation, and
The authors have no conflicts of interest to declare. control conference and exhibit. doi:10.2514/6.2002-4646. Monterey, California
Dong, E., Guo, S., Lin, X., Li, X., & Wang, Y. (2012). A neural network-based self-
tuning PID controller of an autonomous underwater vehicle. In Proceedings of
Credit authorship contribution statement
the IEEE international conference on mechatronics and automation (pp. 898–903).
IEEE. doi:10.1109/ICMA.2012.6283262.
Ignacio Carlucho: Conceptualization, Data curation, Formal Dyda, A. A., & Os’kin, D. A. (2004). Neural network control system for underwater
analysis, Writing - original draft. Mariano De Paula: Conceptu- robots. IFAC Proceedings Volumes, 37(10), 427–432. doi:10.1016/S1474-6670(17)
31769-X.
alization, Data curation, Formal analysis, Writing - original draft. Fatan, M., Sefidgari, B. L., & Barenji, A. V. (2013). An adaptive neuro PID for con-
Gerardo G. Acosta: Conceptualization, Formal analysis, Writing - trolling the altitude of quadcopter robot. In Proceedings of the 18th international
original draft. conference on methods & models in automation & robotics (MMAR) (pp. 662–665).
IEEE. doi:10.1109/MMAR.2013.6669989.
Fossen, T. I. (2002). Marine control systems : guidance, navigation and control of ships,
Acknowledgements rigs and underwater vehicles. Marine Cybernetics.
Geder, J. D., Palmisano, J., Ramamurti, R., Sandberg, W. C., & Ratna, B. (2008). Fuzzy
logic PID based control design and performance for a pectoral fin propelled
We gratefully acknowledge the support of NVIDIA Corporation unmanned underwater vehicle. In Proceedings of the international conference
with the donation of the Titan X Pascal GPU used for this research. on control, automation and systems (pp. 40–46). IEEE. doi:10.1109/ICCAS.2008.
Particularly, we would like to thanks the National Research and 4694526.
Gloye, A., Göktekin, C., Egorova, A., Tenchio, O., & Rojas, R. (2005). Learning to
Technology Council of Argentina (CONICET) for the economic sup-
Drive and Simulate Autonomous Mobile Robots. In Proceedings of the Robocup
port of Eng. Ignacio Carlucho with a Ph.D. fellowship. (pp. 160–171). Springer, Berlin, Heidelberg. Lecture notes in computer science
Goumagias, N. D., Hristu-Varsakelis, D., & Assael, Y. M. (2018). Using deep Q-learning
References to understand the tax evasion behavior of risk-averse firms. Expert Systems with
Applications, 101, 258–270. doi:10.1016/J.ESWA.2018.01.039.
Abu Rmilah, M. H. Y., Hassan, M. A., & Bin Mardi, N. A. (2016). A PC-based simu- Hägglund, T., & Åström, K. J. (1991). Industrial adaptive controllers based
lation platform for a quadcopter system with self-tuning fuzzy PID controllers. on frequency response techniques. Automatica, 27(4), 599–609. doi:10.1016/
Computer Applications in Engineering Education, 24(6), 934–950. doi:10.1002/cae. 0 0 05-1098(91)90 052-4.
21769. el Hakim, A., Hindersah, H., & Rijanto, E. (2013). Application of reinforcement learn-
Acosta, G. G., Mayosky, M. A., & Catalfo, J. M. (1994). An expert PID controller uses ing on self-tuning PID controller for soccer robot multi-agent system. In Pro-
refined Ziegler and Nichols rules and fuzzy logic ideas. Applied Intelligence, 4(1), ceedings of the joint international conference on rural information & communica-
53–66. doi:10.10 07/BF0 0872055.
tion technology and electric-vehicle technology (RICT & ICEV-T) (pp. 1–6). IEEE. Safonov, M., & Tung-Ching Tsao (1997). The unfalsified control concept and learning.
doi:10.1109/rICT-ICeVT.2013.6741546. IEEE Transactions on Automatic Control, 42(6), 843–847. doi:10.1109/9.587340.
Hasselt, H. v. (2010). Double q-learning. In Proceedings of the 23rd international Salhi, K., & Alimi, A. M. (2013). Fuzzy-PID hybrid controller for mobile robot using
conference on neural information processing systems - volume 2. In NIPS’10 point cloud and low cost depth sensor. In Proceedings of the international con-
(pp. 2613–2621). USA: Curran Associates Inc. ference on individual and collective behaviors in robotics (ICBR) (pp. 92–97). IEEE.
Haykin, S. (1998). Neural networks: A comprehensive foundation (2nd). Prentice Hall. doi:10.1109/ICBR.2013.6729280.
Hernández-Alvarado, R., García-Valdovinos, L. G., Salgado-Jiménez, T., Gómez- van Seijen, H., & Sutton, R. (2015). A deeper look at planning as learning from re-
Espinosa, A., & Fonseca-Navarro, F. (2016). Neural network-based self-Tuning PID play. In Proceedings of the 2nd multidisciplinary conference on reinforcement learn-
control for underwater vehicles. Sensors (Basel, Switzerland), 16(9). doi:10.3390/ ing and decision making.
s16091429. Shah, P., & Agashe, S. (2016). Review of fractional PID controller. Mechatronics, 38,
Khodayari, M. H., & Balochian, S. (2015). Modeling and control of autonomous un- 29–41. doi:10.1016/j.mechatronics.2016.06.005.
derwater vehicle (AUV) in heading and depth attitude via self-adaptive fuzzy Sharma, R., Rana, K., & Kumar, V. (2014). Performance analysis of fractional order
PID controller. Journal of Marine Science and Technology (Japan), 20(3), 559–578. fuzzy PID controllers applied to a robotic manipulator. Expert Systems with Ap-
doi:10.10 07/s0 0773- 015- 0312-7. plications, 41(9), 4274–4289. doi:10.1016/j.eswa.2013.12.030.
Kuantama, E., Vesselenyi, T., Dzitac, S., & Tarca, R. (2017). PID And fuzzy-PID control Sheikhlar, A., Fakharian, A., & Adhami-Mirhosseini, A. (2013). Fuzzy adaptive PI con-
model for quadcopter attitude with disturbance parameter. International Jour- trol of omni-directional mobile robot. In Proceedings of the 13th Iranian confer-
nal of Computers Communications & Control, 12(4), 519. doi:10.15837/ijccc.2017.4. ence on fuzzy systems (IFSC) (pp. 1–4). IEEE. doi:10.1109/IFSC.2013.6675667.
2962. Singh, A., Bisht, G., & Padhy, P. K. (2013). Neural network based adaptive non linear
Kuo, B. C. (1995). Automatic control systems. Prentice Hall. PID controller for non-holonomic mobile robot. In Proceedings of the interna-
LeCun, Y., Bengio, Y., Hinton, G., Y. , L., Y. , B., & G. , H. (2015). Deep learning. Nature, tional conference on control, automation, robotics and embedded systems (CARE)
521(7553), 436–444. doi:10.1038/nature14539. arXiv: 1312.6184v5. (pp. 1–6). IEEE. doi:10.1109/CARE.2013.6733732.
Li, D. L., He, Z. S., & Yin, Y. F. (2012). Trajectory tracking control of mo- Sood, V. (2011). Autonomous robot motion control using fuzzy PID controller. In
bile robot based on self-Adaptive PID combined with preview and fuzzy High performance architecture and grid computing (pp. 385–390). Springer Berlin
logic. Advanced Materials Research, 590, 268–271. 10.4028/www.scientific.net/ Heidelberg.
AMR.590.268 Sood, V., & Kumar, P. (2014). Adaptive fuzzy controller for controlling the dynamics
Liu, L., Pan, F., & Xue, D. (2015). Variable-order fuzzy fractional PID controller. ISA of robot motion. International Journal of Fuzzy Computation and Modelling, 1(1),
Transactions, 55, 227–233. doi:10.1016/j.isatra.2014.09.012. 37. doi:10.1504/IJFCM.2014.064230.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. a., Veness, J., Bellemare, M. G., Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning : An introduction. MIT
et al. (2015). Human-level control through deep reinforcement learning. Nature, Press.
518(7540), 529–533. doi:10.1038/nature14236. arXiv: 1312.5602. Takači, D., Takači, A., & Takači, A. (2014). On the operational solutions of fuzzy
Monahan, G. E. (1982). State of the art—A survey of partially observable Markov fractional differential equations. Fractional Calculus and Applied Analysis, 17(4).
decision processes: theory, models, and algorithms. Management Science, 28(1), doi:10.2478/s13540- 014- 0216- y.
1–16. doi:10.1287/mnsc.28.1.1. Tandan, N., & Swarnkar, K. K. (2015). PID controller optimization by soft computing
MyungsooJun, & Safonov, M. (1999). Automatic PID tuning: an application of un- techniques–A review. International Journal of Hybrid Information Technology, 8(7),
falsified control. In Proceedings of the IEEE international symposium on computer 357–362.
aided control system design (pp. 328–333). IEEE. doi:10.1109/CACSD.1999.808669. Timmer, S., & Riedmiller, M. (2007). Fitted Q iteration with CMACs. In Proceedings
Neumann, G., & Peters, J. R. (2008). Fitted Q-iteration by advantage weighted regres- of the IEEE international symposium on approximate dynamic programming and
sion. In Advances in neural information processing systems 21 (pp. 1177–1184). reinforcement learning (pp. 1–8). IEEE. doi:10.1109/ADPRL.2007.368162.
Ogata, K. (2010). Modern control engineering. Prentice-Hall. Van Hasselt, H., Guez, A., & Silver, D. (2016). Deep reinforcement learning with dou-
Önder Efe, M. (2011). Neural network assisted computationally simple PI d control ble Q-Learning. In Proceedings of the 30th AAAI conference on artificial intelligence,
of a quadrotor UAV. IEEE Transactions on Industrial Informatics, 7(2). doi:10.1109/ AAAI (pp. 2094–2100). arXiv: 1712.01275.
TII.2011.2123906. Wang, Q.-G., Fung, H.-W., & Zhang, Y. (1999). PID Tuning with exact gain and phase
Pan, C.-Z., Lai, X.-Z., Yang, S. X., & Wu, M. (2013). An efficient neural network ap- margins. ISA Transactions, 38(3), 243–249. doi:10.1016/S0019- 0578(99)00020- 8.
proach to tracking control of an autonomous surface vehicle with unknown dy- Watkins, C. J., & Dayan, P. (1992). Technical note: Q-Learning. Machine Learning,
namics. Expert Systems with Applications, 40(5), 1629–1635. doi:10.1016/J.ESWA. 8(3/4), 279–292. doi:10.1023/A:1022676722315.
2012.09.008. Wei, W., & Cohen, K. (2015). Development of a model based fuzzy-PID controller
Pan, J., Wang, X., Cheng, Y., & Yu, Q. (2018). Multisource transfer double DQN based for the AeroQuad cyclone quad-copter. In Proceedings of the AIAA infotech @
on actor learning. IEEE Transactions on Neural Networks and Learning Systems, aerospace. Reston, Virginia: American Institute of Aeronautics and Astronautics.
29(6), 2227–2238. doi:10.1109/TNNLS.2018.2806087. doi:10.2514/6.2015-2029.
Pan, Y., Zaheer, M., White, A., Patterson, A., & White, M. (2018). Organizing experi- Ye, J. (2008). Adaptive control of nonlinear PID-based analog neural networks for
ence: A deeper look at replay mechanisms for sample-based planning in con- a nonholonomic mobile robot. Neurocomputing, 71(7), 1561–1565. doi:10.1016/j.
tinuous state domains. Technical Report. arXiv: 1806.04624v1. neucom.2007.04.014.
Pessen, D. W. (1994). A new look at PID-controller tuning. Journal of Dynamic Sys- Zadeh, L. A. (1973). Outline of a new approach to the analysis of complex systems
tems, Measurement, and Control, 116(3), 553. doi:10.1115/1.2899252. and decision processes. IEEE Transactions on Systems, Man, and Cybernetics, SMC-
Petit, A., Paulo, C., Carlucho, I., Menna, B., & Paula, M. D. (2016). Prediction of the 3(1), 28–44. doi:10.1109/TSMC.1973.5408575.
hydrodynamic coefficients of an autonomous underwater vehicle. In Proceedings Zhang, S., & Sutton, R. S. (2017). A deeper look at experience replay. In Proceedings
of the 3rd IEEE/OES south american international symposium on oceanic engineer- of the NIPS deep reinforcement learning symposium.
ing (SAISOE) (pp. 1–6). doi:10.1109/SAISOE.2016.7922474. Zhao, J., Yi, W., Peng, Y., & Peng, X. (2017). Design and simulation of a Self-adaptive
Åström, K. J., & Hägglund, T. (2004). Revisiting the Ziegler–Nichols step response Fuzzy-PID Controller for an autonomous underwater vehicle (pp. 867–878). Cham:
method for PID control. Journal of Process Control, 14(6), 635–650. doi:10.1016/j. Springer.
jprocont.20 04.01.0 02. Zhuang, M., & Atherton, D. (1993). Automatic tuning of optimum PID controllers.
Åström, K. J., & Hägglund, T. (2006). Advanced PID control. ISA - The Instrumentation, IEE Proceedings D Control Theory and Applications, 140(3), 216. doi:10.1049/ip-d.
Systems and Automation Society. 1993.0030.
Rossomando, F. G., & Soria, C. M. (2015). Identification and control of nonlinear dy- Ziegler, J. G., & Nichols, N. B. (1942). Optimum settings for automatic controllers.
namics of a mobile robot in discrete time using an adaptive technique based Transactions of the ASME, 64, 759–768.
on neural PID. Neural Computing and Applications, 26(5), 1179–1191. doi:10.1007/ Ziegler, J. G., & Nichols, N. B. (1993). Optimum settings for automatic controllers.
s00521- 014- 1805- 8. Journal of Dynamic Systems, Measurement, and Control, 115(2B), 220. doi:10.1115/
Rout, R., & Subudhi, B. (2017). Inverse optimal self-tuning PID control design for an 1.2899060.
autonomous underwater vehicle. International Journal of Systems Science, 48(2), Zhang, Z., Zhiyuan, P., & Kochenderfer, M. J. (2017). Weighted double Q-learning. In
367–375. doi:10.1080/00207721.2016.1186238. Proceedings of the twenty-sixth international joint conference on artificial intelli-
gence, IJCAI (pp. 3455–3461). doi:10.24963/ijcai.2017/483.

Expert Systems With Applications: Ignacio Carlucho, Mariano de Paula, Gerardo G. Acosta

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Expert Systems With Applications: Ignacio Carlucho, Mariano de Paula, Gerardo G. Acosta

Uploaded by

Copyright:

Available Formats

Expert Systems With Applications 137 (2019) 292–307

Contents lists available at ScienceDirect

Expert Systems With Applications

Double Q-PID algorithm for mobile robot control

1. Introduction the operative conditions, as could happen for instance in a robo-

Fig. 1. Low level PID control module.

Fig. 2. Structure of the QPID approach.

(9) formation process of set X is as follows: each time that a transi-

Fig. 3. Incremental action discretization process.

= δ1 , . . . , δJ , is achieved. also be done (as described in the previous section). In summary,

This strategy behaves like model-based RL paradigms which

In Algorithm 1 we outline the pseudo-code of the proposed

In Line 1 the input parameters of the algorithm are indicated.

Algorithm MAHA ED MSE ST

In this work we have proposed a model-free algorithm for a

You might also like