Accepted Manuscript: 10.1016/j.neucom.2017.08.036

Accepted Manuscript
Data-Driven Model-Free Slip Control of Anti-lock Braking Systems

Using Reinforcement Q-Learning
Mircea-Bogdan Radac , Radu-Emil Precup
PII: S0925-2312(17)31457-1
DOI: 10.1016/j.neucom.2017.08.036
Reference: NEUCOM 18828
To appear in: Neurocomputing
Received date: 20 October 2016

Revised date: 27 March 2017
Accepted date: 22 August 2017
Please cite this article as: Mircea-Bogdan Radac , Radu-Emil Precup , Data-Driven Model-Free Slip
Control of Anti-lock Braking Systems Using Reinforcement Q-Learning, Neurocomputing (2017), doi:
10.1016/j.neucom.2017.08.036
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Highlights
 Model-free slip control using reinforcement Q-Learning is proposed.
 Neural network controllers tuning strategies with Q-Learning are presented.
 Comparisons with model-free relay and model-based PI control are offered.
 Comparison with Approximate Dynamic Programming control is offered.
 Controller learning scenarios with/without supervisory control are presented.
Graphical abstract
x1 (rad/s)
T
0.5
a)
IP
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
time (s)
1
x2 (rad/s)
CR
0.5
b)
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
time (s)
1
c)
US
 - slip
0.5
AN
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
time (s)
1
u (%)
0
M
-1 d)
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
time (s)
ED
Braking results for the control systems with NFQCDA-1 controller (green lines), relay feedback controller (blue
lines) and OPI controller (red lines): a) normalized upper wheel speed; b) normalized lower wheel speed; c)
controlled slip and reference (black dotted); d) control signals and actuator dead zone activation threshold (black
dotted).
PT
CE
AC
ACCEPTED MANUSCRIPT
Data-Driven Model-Free Slip Control of Anti-lock Braking Systems Using

Reinforcement Q-Learning
Mircea-Bogdan Radac a, Radu-Emil Precup a,b,*
a
Department of Automation and Applied Informatics, Politehnica University of Timisoara,
Bd. V. Parvan 2, 300223 Timisoara, Romania
b
School of Engineering, Edith Cowan University,
270 Joondalup Dr., Joondalup, WA 6027, Australia
T
IP
Abstract
CR
This paper proposes the design and implementation of a model-free tire slip control for a fast and
highly nonlinear Anti-lock Braking System (ABS). A reinforcement Q-learning optimal control approach
is inserted in a batch neural fitted scheme using two neural networks to approximate the value function
US
and the controller, respectively. The transition samples required for learning high performance control
can be collected by interacting with the process either by online exploiting the current iteration controller
(or policy) under an -greedy exploration strategy, or by using data collected under any other controller
AN
that is capable of ensuring efficient exploration of the action-state space. Both approaches are highlighted
in the paper. Fortunately, the ABS process fits this type of learning-by-interaction because it does not
need an initial stabilizing controller. The validation case studies conducted on a real laboratory setup
M
reveal that high control system performance can be achieved using the proposed approaches. Insightful
comments on the observed control behavior are offered along with performance comparisons with several
ED
types of model-based and model-free controllers including relay, model-based optimal PI, an original
model-free neural network state-feedback VRFT controller and a model-free neural network adaptive
actor-critic one. With the ability to improve control performance starting from different supervisory
PT
controllers or to learn high performance controllers from scratch, the proposed Q-learning optimal
control approach proves its performance in a wide operating range and is therefore recommended to its
CE
industrial application on ABS.
Keywords: Anti-lock Braking System, model free control, neural networks, Q-Learning.
AC
*
Corresponding author. Tel.: +402564032-29, -30, -26, fax: +40256403214.
E-mail addresses:mircea.radac@upt.ro (M.-B. Radac), radu.precup@upt.ro (R.-E. Precup).

ACCEPTED MANUSCRIPT
1. Introduction
The Anti-lock Braking System (ABS) control plays a major role as part of the safety subsystems on modern
cars as it prevents wheel locking during braking via direct tire slip control [1] in a diversity of model-based
control approaches such as: linear [2]–[4] or nonlinear control [5], [6], fuzzy control [7], [8], sliding mode
control [9], switching control [10] and neural network control [11].
In model-based control approaches and especially for the ABS fast and highly nonlinear system, the
mismatch between the model used for controller design and the real process leads to serious performance
degradation due to un-modeled dynamics and parametric uncertainty. The interest for straightforward model-
free control design is therefore justified. The model-free control techniques use (usually input-output) data for
T
control design, hence also being labeled as data-driven. Several such techniques emerging from classical control
theory are Virtual Reference Feedback Tuning (VRFT), [12], Iterative Feedback Tuning (IFT) [13],
IP
Simultaneous Perturbation Stochastic Approximation (SPSA) [14], Model-Free Iterative Learning Control
(MFILC) [15], [16], Model-Free Adaptive Control [17], [18]. The above techniques usually exploit the process
CR
structure for designing performant control rather than using identified process models. Having different
underlying paradigms (iterative/non-iterative/adaptive), their performance depends on the mismatch between the
control, owing to the lack of explicit process models.

US
assumed structure and the reality. For most of these techniques, it is difficult to prove stability of the learned
Border lining the fields of machine learning and control, Reinforcement Learning (RL) [19] and Adaptive
AN
(Approximate) Dynamic Programming (ADP) [20] have set themselves as representative data-driven
techniques, some of their implementations being also regarded as model/modeling-free. Q-learning [21] is a
model-free RL approach and it can be used to solve optimal control problems in environments where analytical
M
solutions are intractable, covering linear/nonlinear, discrete/continuous time and deterministic/stochastic

systems. RL is known in the control community as ADP whereas Q-learning used with function approximators
(FAs) such as Neural Networks (NNs) is equivalent with Action Dependent Heuristic Dynamic Programming
ED
(ADHDP) [22]–[26] when used without a process model. Q-learning needs only the transition samples in the
state-action space collected by interacting with the unknown process. With this regard, Q-learning with FAs is
PT
preferable over many of the control solutions proposed in the ADP field where at least the knowledge of
system’s input function is usually needed (e.g., in an nonlinear input-affine state-space representation) [22]–
[26]. The downside of not having a process model (or a part of that model) is that the convergence of Q-learning
CE
used with FAs is not ensured straightforward, depending strongly on the FAs used and on good exploration of
the state-action space (sometimes implying large data sets required for learning), which is not an easy task.
Most of the RL control results assume fully observable process state while partial state observation leads to
AC
Partially Observed Markov Decision Processes (POMDPs). Process state knowledge requires more insight than
the pure input-output representation used by other classical control model-free Policy Iteration (PI) approaches
such as VRFT [27], [28], IFT [29], SPSA [30] and MFILC [15], [16], [30]. State knowledge, although more
expensive, is expected to lead to increased CS performance via the resulting nonlinear state-feedback
controllers. When the state is not measurable but observable, completely model-free RL-like solutions using
only input-output data-based state observers can still be found for system having exploitable structure such as
the Linear Time-Invariant (LTI), under data-based observability and controllability assumptions [31]–[34].
In the context of ADP [35] and batch RL [36] applied to automotive control systems, a recent slip control
via ADP has been proposed in [11] but the Value Iteration (VI) algorithm used with FAs uses the state transition
ACCEPTED MANUSCRIPT
function and it is thus not model-free whereas in this work no transition function (or process model) is used for
learning control. To the best of authors’ knowledge, it is the first time in the literature when model-free Q-
learning is applied to ABS control.
The paper is organized as follows: the mathematical model of the ABS controlled process is presented in
Section II together with the control problem formulation and a model transformation serving as justification for
Q-learning control. A novel batch fitted Q-learning algorithm called Neural Fitted Q-Learning with Continuous
Discrete Actions (NFQCDA) is defined and discussed in Section III as it will be adapted for the slip regulation
control problem under two learning approaches. The case study given in Section IV describes the application of
the two NFQCDA learning approaches to the slip control problem to prove its effectiveness, offering
T
implementation details, comparisons with several model-based and model-free control strategies and insightful
comments. Section V concludes the paper.
IP
2. ABS dynamics and control problem
CR
The laboratory equipment consists of two wheels shown in Fig. 1. The lower wheel is accelerated and
illustrates the car and will gain speed. Reaching the speed threshold value causes the upper wheel to initiate the
US
braking sequence. Direct Current (DC) motors are employed to drive the acceleration on the lower wheel and
the brake on the upper wheel, the motors being controlled by Pulse Width Modulation (PWM) technique.
AN
M
ED
Fig. 1. The laboratory equipment setup of the wheels [37].

PT
The braking system contains a cable wound on the upper wheel motor’s shaft that acts upon a hand brake-
like mechanism with lever that determines the plates of the braking system to create pressure upon the upper
CE
wheel’s braking disk thus slowing both wheels. The movement of the lever requires positioning of a mechanical
moving part (i.e. the motor shaft) that introduces pseudo variable dead time with each different set point. This is
due to the elastic behavior of the cable during its winding and negatively affects the controller design and
AC
subsequently the control system performance. This actuator is thus technically inferior to the usual electro-
hydraulic-based actuators making it difficult to obtain high performance slip control.
The state-space nonlinear dynamics of the controlled process as part of the ABS laboratory equipment
obtained from first principles modeling are [37]:
x1  ( Fn r1()  d 1 x1  M 10  M 1 ) / J 1 ,
x 2  ( Fn r2 ()  d 2 x 2  M 20 ) / J 2 ,
(1)
M  20.83(b(u )  M ),
1 1
15.24u  6.21, if u  0.415

b(u )   ,
 0, if u  0.415
ACCEPTED MANUSCRIPT
where r1 and r2 are the radii of wheels, Fn is the normal force that the upper wheel pushes upon the lower
wheel, () is the friction coefficient, the activation threshold of the actuator dead zone is 0.415,  [0;1] is the
slip calculated as
  min{max[( r2 x2  r1 x1 ) /( r2 x2  0.001),0],1}, (2)
where a small constant is added to the denominator in order to avoid division by zero, x1 and x 2 are the upper
and lower wheels’ angular accelerations, respectively, and u[%][0;1] is the control signal interpreted as the
PWM duty cycle. Further process identification leads to the parameters L – the length of the arm whose right
end supports the upper wheel and can rotate around point A, and  – the angle between the normal direction in
T
the contact point of the wheels and the direction of L . Fn is not usually constant during braking, it depends on
IP
the bumping effect occurring with violent braking commands, but here it is used as the approximately constant
expression Fn  [atanh( x1 )M 1  M 10  M g  d1 x1 ] /[ L sin()  L cos()] . The identified process parameters are
CR
  65.61o (1.14 rad) , L  0.37 m , r1  r2  0.099 m , J 1  7.53 10 3 kg  m 2 , J 2  25.6 10 3 kg  m 2 ,
d1  1.1874 10 4 kg  m 2 / s , d 2  2.1468 10 4 kg  m 2 / s , M 10  0.0032 N  m , M 20  0.0925 N  m ,
M g  19.61 N  m .
US
The low-quality braking system reflects into a nonlinear dead zone in the actuator dynamics, which together
with the first two nonlinear state equations in (1) form a deterministic nonlinear dynamic system.
AN
The experimentally obtained static slip-friction curve () of Fig. 2 is approximated by
()  w4 p /( a  p )  w3 3  w2 2  w1, (3)
with a  25.72 10 5 , p  2.099 , w1  42.4 10 3 , w2  29.37 10 11 , w3  35.08 10 3 , w4  29.37 10 11 . Iron (lower
M
wheel) and plastic (upper wheel) surfaces are forced to contact. However, the slip-friction characteristic depends
on the surfaces coming into contact, which in real world applications correspond to tires coming into contact
ED
with different surfaces such as dry/wet asphalt, ice, mud, gravel, etc. For the calculation of the slip  it is
required to have available the vehicle speed and the speed of the braking wheel. Both x1 and x 2 are obtained by
PT
means of simple observers that numerically differentiate the positions provided by optical encoders. In real
vehicles, the car speed may be obtained by different sensor fusion algorithms in order to estimate the slip [1].
CE
AC
Fig. 2. Slip-friction curve [2], [37].

ACCEPTED MANUSCRIPT
The slip control problem to be solved in this paper in an optimal control framework amounts to regulating
the slip to a desired setpoint. This paper focuses on solving this problem without using the mathematical process
model (1), thus in a model-free setting, using a Q-learning scheme.
2.1. From POMDP to fully observable MDP
Denoting the state vector with x  [ x1 x2 M 1 ]T and using e.g. forward Euler discretization (FED) leads to
the deterministic discrete-time MDP form of (1), which will be needed for data-collection for designing digital
data-driven control ( k denotes the discrete time index)
x k 1  f (x k , u k ), (4)
T
where f :  4   3 is a nonlinear function.
IP
To achieve nonlinear state-feedback control via Q-learning, the MDP (1) should usually by fully state-
observable but the equipment does not have sensor for the hidden state braking torque M 1 , therefore it is only a
CR
POMDP. It is customary to use delayed past samples of the measurable states (or outputs) and control signal to
estimate non-measurable states and to cope with time delays. This approach requires in turn observability
US
assumptions about the process which are difficult if not impossible to assess for general unknown nonlinear
processes. In the linear state-space systems case the approach leads to input-output data-based observers [31]–
[34], whereas in the nonlinear case the approach has been used also in [39], [41]. The observability of nonlinear
AN
systems is also tightly connected to the concepts of accessibility and flatness [42], where the internal system’s
states are expressed in terms of only high order derivatives of the system’s outputs and inputs. Next, a
transformation of the POMDP (1) into a higher order fully observable MDP is described.
M
The elimination of Fn () between first two equations in (1) and solving for M 1 in terms of x1 , x 2 and
their derivatives leads to

M 1  r1 / r2  ( J 2 x 2  d 2 x2  M 20 )  J 1 x1  d1 x1  M 10 .
ED
(5)
It has also been shown in [1], [7], [37] that (1) is equivalent to the state-space form
x1  f 1 ( M 1 , x1 ),
PT
x 2  f 2 ( M 1 , x1 , x 2 ),
(6)
M  f ( M , u ),
1 3 1
with f1 , f 2 , f 3 nonlinear functions in their arguments.

CE
Applying FED to (6) and (5) all together leads to

x1,k 1  F1 ( M 1,k , x1,k ),
AC
x2,k 1  F2 ( M 1,k , x1,k , x2,k ), (7)

M 1,k 1  F3 ( M 1,k , u k ),
M 1,k  G ( x1,k 1 , x1,k , x2,k 1 , x2,k ),
where F1 , F2 , F3 corresponding to f1 , f 2 , f 3 in (6) and G corresponding to (5) are nonlinear functions
depending on their arguments and on the sampling period. Substituting M 1,k from the last equation of (7) into
the third one, delaying one step behind the third equation with F3 results in the relationship
M 1,k  F3 ( x1,k , x1,k 1 , x2,k , x2,k 1 , u k 1 ) , which substituted in the first two relations in (7) produces
ACCEPTED MANUSCRIPT
x1,k 1  F1 ( x1,k , x1,k 1 , x2,k , x2,k 1 , u k 1 ),

(8)
x2,k 1  F2 ( x1,k , x1,k 1 , x2,k , x2,k 1 , u k 1 ).
x 2,k  x 2,k 1 and u~k  u k 1 , the two equations in (8) result in the final
x1,k  x1,k 1 , ~
Using the notations ~
extended state-space system

 x1,k 1   F1 ( x1,k , ~
x1,k , x2,k , ~
x2,k , u~k ) 
   ~ ~ ~ 
 x2,k 1   F2 ( x1,k , x1,k , x2,k , x2,k , uk )  (9)
x e,k 1   ~
x  x1,k   F(x , u ),
 1,k 1    e ,k k
~
 2,k 1  
x x 
2 ,k
 u~   u 
 k 1   k 
T
which is a discrete-time deterministic MDP with full state measurement. Therefore, the use of delayed measured
states (or outputs) and inputs to deal with integer delays in both states and control and with non-measurable but
IP
observable states is fully justified according to [31], [34], [39], [41]. The model-free approach suggested in this
paper is applicable when a process model is not available for controllability and observability analysis and the
CR
order of the system cannot be proven, and the designer can just simply build an extended state vector of delayed
measurable outputs and control signals and next try to learn the control.
2.2. Further preparing the model for data collection US

Since the extended state vector in (9) will be used as an input to the neural network-based Q-learning
AN
scheme, it will be useful for training purposes to normalize the collected data. The control signal is known to be
bounded inside [0,1] because of its physical interpretation (the duty cycle to the PWM driving the DC motor). In
M
the acceleration phase before braking, the upper threshold value of both the car angular speed and the car wheel
(before braking, throughout the accelerating phase their speed is considered equal) is 180 rad/s, therefore both
x1 , x2 [0,180] rad/s are normalized inside [0,1] to be used in the recording phase. State variables scaling and/or
ED
control signal scaling does not alter the MDP property of the system.
3. Neural batch fitted Q-learning

PT
Starting with a discrete-time nonlinear deterministic and unknown state-space system of the form (9),
x k 1  F(x k , u k ) (the extended subscript is omitted for notation simplicity), where F is the unknown system
CE
function, an optimal control problem can be formulated such that to minimize the next infinite horizon cost
function (c.f.) J (x i ) starting with x(i)
AC

J (xi )    k iU (xk , uk , xk 1 ), (10)
k i
with the starting time index i , the discount factor 0    1 ensures the convergence of J (x i ) , and the utility
function U depends on the transition from x k to x k 1 under action u k . The control signals u k , u k 1 , ..., are the
unknowns that should minimize J (x i ) . If the optimal c.f. value starting with the time index k  1 is known as
J * (x k 1 ) , then the optimal control u k* will fulfill Bellman’s optimality equation [43]
u k*  arg min (U (x k , u k , x k 1 )  J * (x k 1 )). (11)

uk
ACCEPTED MANUSCRIPT
With available system dynamics, for a finite time horizon version of the c.f., numerical dynamic
programming solutions can be employed backwards in time. For infinite horizon c.f.s, PI and VI algorithms can
be used but all aforementioned solutions are computationally feasible only with discrete state and action spaces
of moderate size, an issue commonly known as the “curse of dimensionality”. For large size or continuous state
and action spaces, FAs such as NNs are generally used.
If the system’s dynamics is unknown, the minimization of (10) becomes an RL problem. A more
informative c.f. evaluated for each state-action pair known as the Q-function is defined and solved for, whose
optimal value also fulfils Bellman’s optimality equation
Q * (x k , u k* )  min (U (x k , u k , x k 1 )  Q * (x k 1 , u k*1 )). (12)
uk
T
Then J * (x k )  Q * (x k , u k* ) . Q-learning is used as follows to solve for Q * (x k , u k ) in (12) forward in time,
IP
through a typically VI algorithm that updates the value of a state-action pair after each observation of a
transition under a mixed exploration/exploitation control law (or policy)
CR
Qk 1 (x k , u k )  (1  )Qk (x k , u k )  (U (x k , u k , x k 1 )   min Qk (x k 1 , u)), (13)
u
where 0    1 is the learning rate parameter and    is the continuous/discrete action space over which the
US
minimization of Qk (x k 1 , u) is carried out. Q-learning can be used to learn both episodic and non-episodic tasks.
In its simple implementation, for discrete state and action spaces, a table is used to represent the Q-function. The
formulation is also valid in the stochastic case. Q-learning is insensitive to the control law under which the
AN
transition samples are collected as long as good exploration is ensured. This practically leads to different
learning-by-interaction implementation styles for Q-learning as shown next.
For the dynamic discrete-time state-space nonlinear system (9) with continuous state and action spaces, one
M
possible batch implementation of the Q-learning algorithm is the Neural Fitted Q-Learning with Continuous
Actions (NFQCA) one [38], [39], [28] that belongs to the family of fitted Q iteration algorithms [40] and uses
ED
two networks, one to approximate each of the Q function (critic), with the abbreviation Q-NN, and the controller
(actor), namely the controller FA called C-NN.
Using NFQCA the update of the Q-NN and the C-NN is done batch-wise using the entire dataset of
PT
collected transition samples. This is considered in [38] and [39] to improve the learning convergence. The
learning control can be carried out either offline between the interacting collection phases of mini batches of
CE
transition samples or at the end of a single large collection phase. In each case, good exploration of the state-
action space must be ensured. This observation leads to two versions of Q-learning implementation: one in
which the control strategy learned offline from the data collected so far is actively used for further collecting
AC
transition samples, and the other one in which an already existing controller acting as supervisor is used to
collect the transition samples after which the Q-learning controller is learned at the end of the collection phase.
Let D  {(x k , u k , x k 1 )} be the entire dataset of collected state-action transition samples. The first version of
the Q-learning algorithm to be presented is an NFQCA algorithm, where a discrete set of actions is used to
minimize the current iteration estimate of the Q-NN called Neural Fitted Q-Learning with Continuous Discrete
Actions (NFQCDA-1) and referred to as the NFQCDA-1 algorithm. This leads to the nonlinear state-feedback
controller referred to as the NFQCDA-1 controller with the notation u k  C (x k ) . The steps of the NFQCDA-1
algorithm are
S1. Initialize the Q-NN and the controller NN, set iter  1 .
ACCEPTED MANUSCRIPT
S2. Prepare the training set for the Q-NN consisting of the input data pairs {x k , u k } and the target output
data by U (x k , u k , x k 1 )  Qiter x k 1 , Citer (x k 1 ). Qiter (x k , u k ) is the current iteration estimate of the Q-
function.
S3. Batch train the Q-NN (i.e., the critic NN FA) using the input-output patterns obtained at step S2 in a
supervised learning framework to obtain Qiter 1 (x k , u k ) .
S4. Prepare the training set for the controller (actor) C-NN using the input data {x k } from the dataset D
and the target data as {u k* } where u k*  arg min Qiter (x k , u) , for each {x k }  D and  is a discrete action space.
u
S5. Batch train the controller/actor C-NN using the training data obtained in S4 and obtain Citer 1 (x k ) .
T
S5. Run the newest control policy Citer 1 (x k ) using, for example, an -greedy strategy, collect the transition
IP
samples and add them to the existing database D . If stopping condition is met terminate the algorithm, else set
iter  iter  1 and return to step S2.
CR
The stopping condition is set by the user as the level of tracking performance achieved with the current
iteration controller and can be checked for between two consecutive iterations of the NFQCDA-1 algorithm.
US
The current iteration controller Citer (x k ) is used in step S2 of the NFQCDA-1 algorithm because it is
assumed to minimize the current iteration Q-function Qiter (x k 1 , u k 1 ) . In step S4, the minimization of the current
Qiter is done by uniformly sampling Qiter over a discrete action space which will provide the target samples for
AN
the continuous actions C-NN, which is different from the NFQCA given in [38]. Minimization of Qiter over a
continuous action space can also be done as in [43] using a cascaded NN comprised of the controller NN
M
connected to the Q- NN, by setting the target to the Q-NN to 0 and using gradient back-propagation training.
However, this approach is more computationally demanding. With the proposed discrete action space approach,
the controller will learn to interpolate between the target controls minimizing Qiter for a given fixed state. Our
ED
controller training approach presented above is simpler with respect to the popular gradient-descent back-
propagation approach that requires the calculation of Qiter / u evaluated at each pair {x k , u k } in D in order to
PT
update the parameters of the C-NN that minimizes the Q-function. It is also computationally feasible with a
small number of controls and small corresponding discrete action spaces. Avoiding the online minimization of
CE
Qiter at each sampling instant and obtaining the control signal just by evaluating an NN instead, allows for high
frequency sampling control, which is crucial for fast real-time implementations. The NFQCDA-1 algorithm
ensures good exploration of the state-action space since as a stabilizing control policy is learned, it allows for a
AC
controlled exploration which would otherwise be difficult to attain just by using random control actions.
The second version of Q-learning that uses an already designed controller acting as supervisor does not
need the interaction steps of the NFQCDA-1 algorithm to attain good exploration since this can be ensured in a
controlled environment. This algorithm, called the K-step NFQCDA-2 algorithm, leading to the NFQCDA-2
controller, has the following structure:
S1. Initialize the Q-NN and the C-NN, set iter  1 , select the number of steps K . Use a supervisor
controller (possibly perturbed under an -greedy strategy to achieve persistent excitation condition and
subsequently better exploration) to build the dataset D .
ACCEPTED MANUSCRIPT
S2. Prepare the training set for Q-NN consisting of the input data pairs {x k , u k } and the target output data
 
by U (x k , u k , x k 1 )   min Qiter (x k 1 , u) . Qiter (x k , u k ) is the current iteration estimate of the Q-function and 
u
is a discrete action space.

S3. Batch train the critic Q-NN using the input-output patterns obtained at step S2 in a supervised learning
framework to obtain Qiter 1 (x k , u k ) .
S4. Set iter  iter  1. If iter  K jump to S2, else jump to S5.
S5. Prepare the training set for the controller (actor) C-NN using the input data {x k } in the dataset D and
the target data as {u k* } , where u k*  arg min Qiter (x k , u) , for each {x k }  D and  is a discrete action space.
T
u
S6. Batch train the controller/actor C-NN using the training data obtained at step S2 and obtain C (x k ) .
IP
The algorithm is terminated.
Differently from the NFQCDA-1 algorithm, the NFQCDA-2 algorithm learns the controller in one
CR
supervised learning step at the end of the Q-NN training, namely after the value function estimate is
approximated. This approach will be shown to result in control system performance improvement starting with a
US
simple supervisor controller used to collect the transition samples.
4. Validation case studies

AN
4.1. First case study
The sampling period is considered 0.01 s in the simulated case study described as follows. To fit the
M
optimal regulation problem to the form of the c.f. given in (10), aiming slip setpoint regulation to a constant
reference input * , the utility function in (10) is set to
 
ED
U (xk , uk , xk 1 )  min (k  * )2 ,0.05 , (14)
with the slip  k calculated as in (2) and it depends only on the current state x k . The upper bound on the utility
is used to force U (x k , u k , x k 1 )  Qiter x k 1 , Citer (x k 1 ) in the NFQCDA-1 algorithm inside [0,1] as it will be
PT
the target value for an NN whose output will be also constrained within [0,1] , with the NN architecture details to
be introduced immediately into context. This scheme is used in [38] and [39] as well, and is important to ensure
CE
the learning convergence. In this case a single goal/objective is targeted in terms of the constant reference input,
which is encoded in the utility function, but NFQCDA-1 can deal with multiple objectives (even time-varying
AC
ones as required by typical reference trajectory tracking problems). The changing reference input is next treated
as an additional input to the Q-learning controller and minor adaptation scheme is needed [38], [39]. The
feedback control structure is presented in Fig. 3, with the (possibly time-varying) reference slip *k set as an
optional input to the controller, which will be used in some of the following case studies. The slip output does
not enter explicitly the NFQCDA-1 controller, although it could be used as an additional input state like the slip
reference input. Instead, the NFQCDA-1 controller will calculate it intrinsically in the utility function.
ACCEPTED MANUSCRIPT
Fig. 3. NFQCDA-1 and NFQCDA-2 feedback control system.
To achieve practical convergence, the discount factor is selected as   0.95 . The reference input in this
case study is *  0.2 , which is at the maximum of the slip-friction curve in Fig. 2, aiming the maximum
friction. As outlined in [2] and [7], the linearized dynamics of the process in the vicinity of the maximum point
switch from stable (   0.2 ) to unstable (   0.2 ).
T
Many transition samples have to be collected from the process to populate the dataset D for learning
IP
control using NFQCDA-1. The braking process is extremely favorable for data collection since it is an episodic
one in which after accelerating to a threshold car speed, the braking is initiated. In addition, an initial stabilizing
CR
controller is not necessary whereas for many other processes a destabilizing controller may damage the process.
In a braking scenario set to last a pre-specified amount of time, the car can either brake to a full stop or not, after
which it can re-accelerate to star the next braking sequence. From a real-life perspective, given a track to run the
US
car, the braking learning control process requires only successive episodes of accelerating then braking until
high performance control is achieved.
AN
Two NNs will be used, one to approximate the value function, called the Q-NN, the other one to
approximate the controller and called the C-NN. The Q-NN is a 6–12–12–1 feed-forward fully connected NN
with biases. Five inputs are the state vector in (7) plus the scalar control action. Each hidden layer has 12 tansig
M
activation function neurons and the output neuron has a logsig activation function, which bounds the NN output
within [0,1] . The weights are initialized randomly from a zero-mean normal distribution. The standard
Levenberg-Marquardt gradient back-propagation algorithm is used for training for a maximum number of 100
ED
epochs. The C-NN is similar to Q-NN in most aspects, with the dimension 5–12–12–1, having only the extended
state vector in (7) as input and a linear activation function of the output neuron.
PT
For data collection purposes, each braking episode occurring after an acceleration up to 180 rad/s is set to
last for 2 s meaning 200 transition samples being collected each time. During each braking episode, the
NFQCDA-1 algorithm interacts with the process by exploiting the current controller under an -greedy strategy.
CE
The discrete action space is   {0,0.1,...,1} . The exploration factor of the -greedy strategy is set to 2 initially,
meaning a uniform random control action from  is sent to the process every 2 sampling instants (50% of the
AC
time) out of the instants when a control action is applied to the process. After each 10 interaction episodes, the
exploration factor is increased by 1, an exploration factor of 3 meaning that random control is sent to the process
every third sampling instant (33% of the time), and so on. The rest of the sampling instants the control signal is
obtained by passing the state through the C-NN. After each interaction episode, the NFQCDA-1 algorithm trains
both the Q-NN and the C-NN at the current iteration.
The learning is stopped after 6000 transition samples were collected from the process corresponding to 30
NFQCDA-1 iterations (and also trials/interactions). The total interaction time with the process (measured only
in braking time) is 60 s corresponding to 30 braking trials, which suffices to learn high performance control. The
offline learning time increases with the increase of the dataset D , with the size of the NNs and with the size of
ACCEPTED MANUSCRIPT
the discrete action space  and cumulates about 15 minutes on an Intel i5 2.3 GHz with 8 GB of RAM laptop
computer. The final NN controller evaluation time measured on the computer is 0.31 ms , well below the
sampling period.
The performance of the control system with NFQCDA-1 controller on the 2 s scenario is displayed in Fig. 4
along with the results obtained with a model-free relay feedback controller with the outputs 0 and 1 switching at
zero control error, and a model-based optimized proportional-integral (OPI) controller. The relay feedback
controller can be considered a model-free controller as well because it only requires processing the control error
and setting the ON/OFF thresholds, which in turn requires only knowledge on the actuator. The relay feedback
controller thresholds can also be subjected to tuning. The OPI controller with the transfer function
T
C ( z)  (5.2640  4.9170  z 1 ) /(1  z 1 ) was obtained to minimize the sum of squared error criterion
IP
150
150
J SSE    k 1 min[(  k  * ) 2 ,0.05] on a 1.5 s scenario with braking starting at 180 rad/s and the optimal
k 1
controller parameters were found using an optimization approach with the process model in simulation, for a
CR
constant slip reference setpoint of *  0.2 . The optimization is carried out using the Nelder-Mead simplex
search algorithm. Minimizing the same criterion on 2 s results in very poor marginally unstable linear OPI
US
controller, since after 1.5 s the linear controller (and also the relay feedback controller and the NFQCDA-1
controller) quickly loses performance owing to the wheels’ low angular speeds, which makes it impossible to
recover the slippage effect which becomes unstable [1]–[10]. In practice, the ABS control may be deactivated at
AN
low speeds to allow for full car stop but this solution is not considered in this case study. The performance
criterion J SSE
200
as a time-truncated version of the original c.f. (8) is 1.84 for the control system with NFQCDA-1
controller, 3.17 for the control system with relay feedback controller and 1.47 for the control system with OPI
M
controller, the NFQCDA-1 controller balancing the setpoint tracking with low oscillations in both the output and
in the control. However, J SSE
200
is measured for the entire scenario, including the unstable behavior of lost
ED
slippage effect after 1.2 s, which is not really relevant for performance assessment but it affects nevertheless the
measured performance.
The NFQCDA-1 controller inherits the time-optimal character of both the relay feedback controller and the
PT
OPI controller, as seen in Fig. 4 c) where the reference slip is quickly reached in about 0.2 s, with the control
signal in Fig. 4 d) saturating in the beginning to its maximum value, which can also be observed for the relay
CE
feedback controller. Afterwards, the NFQCDA-1 controller continuous action is smoother and reaches about the
same steady-state value of the OPI controller and presumably the mean value of the control signal of the relay
feedback controller. The controlled slip in Fig. 4 c) is smooth with the NFQCDA-1 controller and not oscillatory
AC
as the output with the relay feedback controller. Hence the NFQCDA-1 controller inherits the best features of
the other two controllers, owing to its nonlinear character and to the use of the full state-feedback information as
compared with only output feedback for the other two controllers. As for the wheels’ angular speed, it is noticed
in Fig. 4a) and 4b) that all three control systems exhibit similar overall performance in terms of wheel and car
angular speeds. However, the car is not fully stopped in case of the control system with NFQCDA-1 controller,
which apparently decides to not act the brake after 1.25 s. This will significantly reduce the oscillations in the
slip output at the end of the learning control scenario.
ACCEPTED MANUSCRIPT
x1 (rad/s)
1
0.5
a)
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
time (s)
1
x2 (rad/s)
0.5
b)
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
time (s)
1
c)
 - slip
0.5
T
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
IP
time (s)
1
CR
u (%)
-1 d)
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
time (s)
US
Fig. 4. Braking results for the control systems with NFQCDA-1 controller (green lines), relay feedback
controller (blue lines) and OPI controller (red lines): a) normalized upper wheel speed; b) normalized lower
wheel speed; c) controlled slip and reference (black dotted); d) control signals and actuator dead zone activation
threshold (black dotted).
AN
4.2. Second case study
In the previous case study, the braking learning control scenario has started every time with the same initial
M
condition. When starting with different initial conditions the slip control is not achieved anymore for the
NFQCDA-1 controller whereas the relay feedback and the OPI controllers can still ensure slip control. To
ED
overcome this issue, the learning control scenario is changed to start with different initial conditions spanning
the entire range of initial car speeds before braking, i.e. x10  x20 [0,180] rad/s .
The same learning architectures are used as in Sub-section 4.1, the only differences being the uniform
PT
random initial conditions imposed to the car speed (and the braked wheel speed) at each interaction episode, the
discrete uniform action space that minimizes the Q-NN has now 7 actions,   {0,1 / 6,1 / 3,1 / 2,2 / 3,2.5 / 3,1} , and
CE
the -greedy exploration factor is constant and equal to 4 throughout NFQCDA-1 algorithm’s interaction with
the process. The NFQCDA-1 learning is now stopped after 12000 transition samples are collected corresponding
to 120 s and 60 braking trials.
AC
The response curves for the control systems with NFQCDA-1, relay feedback and OPI controllers
considering two scenarios where braking starts with 180 rad/s and 90 rad/s are illustrated in Figs. 5 and 6
respectively, this time without x1 , x 2 displayed.
ACCEPTED MANUSCRIPT
1
a)
0.8
0.6
 - slip
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
time (s)
0.5
u (%)
T
0
b)
IP
-0.5
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
time (s)
Fig. 5. Braking results at 180 rad/s for the control systems with NFQCDA-1 controller (green lines), relay
CR
feedback controller (blue lines) and OPI controller (red lines): a) controlled slip and reference (black dotted); b)
control signals and actuator dead zone activation threshold (black dotted).
a)
0.8
1
US
 - slip
0.6
0.4
AN
0.2
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
time (s)
1.5
M
0.5
u (%)
ED
-0.5
b)
-1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
PT
time (s)
CE
In the braking scenario at high speed ( 180 rad/s ) J SSE

200
is 1.79, 3.17 and 1.47 for the control systems with
NFQCDA-1, relay feedback and OPI controllers, respectively. For the braking scenario at 90 rad/s , J SSE
200
is 2.40,
AC
4.75 and 2.28 for the control systems with NFQCDA-1, relay feedback and OPI controllers, respectively. These
measurements include the setpoint tracking error that occurs after 0.6 s where the slippage effect is lost and this
is not very relevant for performance assessment. Measuring J SSE
100
in high-speed braking (after 1 s the slippage
effect is lost) results in 0.30, 0.33 and 0.25 for the NFQCDA-1, relay and OPI controllers, respectively, while
measuring J SSE
60
in low-speed braking ( 90 rad/s initial speed, with slippage effect lost after 0.6 s) results in 0.41,
0.52 and 0.54 for the NFQCDA-1, relay and OPI controllers, respectively. The OPI controller behaves the best
in the high-speed braking since it was optimized particularly for this scenario, while behaving the worst in low
initial speed braking. The NFQCDA-1 controller is the best in low-speed braking where even the relay
ACCEPTED MANUSCRIPT
controller is better than the OPI controller. Remarkably, the NFQCDA-1 controller has learned to brake at
various initial speeds and it can generalize better than the two other controllers.
4.3. Third case study
In both previous case studies, the control system with NFQCDA-1 controller has not been able to reject a
slip output disturbance or to track a changing slip reference, as opposed to the other two controllers. A change in
the slip reference setpoint (equivalent to a slip output disturbance in a slip feedback error-based control
architecture) can model a change in the surfaces coming into contact during braking; this should first be detected
by a higher-level supervisory controller after which the reference setpoint should be adjusted for the respective
T
surface since the slip-friction curve is surface dependent.
IP
To add reference tracking and disturbance rejection capabilities to the control system with NFQCDA-1
controller, the time-varying slip reference setpoint is set as an additional input to the controller. The learning can
CR
be done by starting with different initial car speeds as in the second case study and during the braking process
the slip setpoint is allowed to vary in time. This is necessary because the slip always starts from 0 when braking
is initiated. For the learning process to be presented with various combinations of slip reference input changes,
US
throughout the braking process the slip reference change is modeled by a succession of two steps switching at
0.3 s, i.e. *k  R1 (k )  (k  30)  R2 (k  30) where R1 , R2 are uniform random numbers inside
[0.15,0.35] . Normally, this approach requires slight adaptations of the NFQCDA-1 algorithm such that to be
AN
able to deal with time varying goals [38], and this would not be very data-efficient as it is only able to learn
from half of the transition samples collected. However, since the slip reference is step-wise constant except for
the switching sample, the adaptation scheme is not considered.
M
The learning process is further particularized to slightly different settings than in the first two case studies,
to check how the learning control is influenced, namely, the initial car speed varies within [50,180] rad/s
ED
because below 50 rad/s the braking process is very short and the slippage effect is not recoverable (see the slip
trajectories behavior at the end of the scenarios in the first two case studies). As such, the transitions samples
used for learning are collected only for from those whose normalized car speed is x2  0.1 , which corresponds
PT
to x2  18.8 rad/s . This is different from the previous case studies and leads to fewer usable transition samples
collected at each interaction episode lasting 2 s. The exploration factor is now constant and equal to 5 for all the
CE
interaction episodes, and the discrete action space  is the same with that use in the case study given in Sub-
section 4.2 having 7 actions. The Q-NN is of size 7–15–15–1 (now the extended state x k   6 additionally
AC
contains the slip reference input, see Fig. 3):

xk  [xTe, k *k ]T , (15)
the C-NN is 6–10–10–1, whereas the rest of the learning process and its training settings are the same.
Concluding, data collection now covers the state-action space with random initial car speed, random transitions
between random setpoints and random control actions via the -greedy exploration strategy.
The NFQCDA-1 algorithm is run for 130 iterations corresponding to 130 interaction episodes of 2 s each,
out of which 11800 transition samples were collected, about the same size of case study two. The total offline
training time cumulates about 1 hour on the laptop computer. The controller NN evaluation time measured on
the computer is now 0.233 ms , since the C-NN has fewer parameters than the one used in the first two case
ACCEPTED MANUSCRIPT
studies. The control system with the final NFQCDA-1 controller is compared against the control system with the
same relay feedback controller and the control system with the same OPI controller considered in the previous
case studies. Two scenarios are recorded and analyzed corresponding to high-speed braking and low-speed
braking, illustrated in Figs. 7 and 8.
1 a)
0.8
0.6
 - slip
0.4
0.2
T
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
time (s)
IP
b)
1.5
CR
u (%)
0.5
-0.5
-1
0 0.2 0.4 0.6
US
0.8 1
time (s)
1.2 1.4
Fig. 7. Braking results at 180 rad/s with setpoint tracking for the control systems with NFQCDA-1 controller
(green lines), relay feedback controller (blue lines) and OPI controller (red lines): a) controlled slip and
reference (black dotted); b) control signals and actuator dead zone activation threshold (black dotted).
1.6 1.8 2
AN
1 a) x relay
2
= 0.1
x NFQCA
2
=0.1
0.8
M
x OPIC = 0.1
0.6
 - slip
0.4
0.2
ED
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
time (s)
1.5
PT
1
u (%)
0.5
0
CE
-0.5
b)
-1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
time (s)
AC
The analyzed performance of the control systems with all controllers is now measured on a modified sum of
squared errors criterion recorded from the start of the braking until the normalized car speed drops below
x2  0.1 rad/s after the TF  th sampling instant. The performance criterion is expressed as
TF
TF ( x2  0.1)
J SSE    k 1 min[(  k  *k ) 2 ,0.05] .
k 1
Analyzing the high-speed braking scenario in Fig. 7 starting with the initial car speed of 180 rad/s , it is
revealed that all controllers are able to track setpoint changes. A reference slip setpoint change occurs at 0.5 s in
ACCEPTED MANUSCRIPT
the braking process in this scenario from 0.2 to 0.3. The measured performance is J SSE
119( x 2  0.1)
 0.65 for the
NFQCDA-1 controller, J SSE

119( x 2  0.1)
 0.75 for the relay feedback controller and J SSE
119( x2 0.1)
 0.82 for the OPI
controller. The same number of TF  119 samples is observed for all controllers. After this point in time, the
controlled slip and the control signal of the NFQCDA-1 controller behave unstable given the poor generalization
property of the C-NN controller for slips outside the range [0.15,0.35] , which were not seen during the training
phase and owing to C-NN’s linear output activation function that does not bound the controller output. This
behavior was not observed in the first two case studies, where all the transition samples were used for
NFQCDA-1’s NN controller training, including those corresponding to low speeds, where the slip behaved
unstable. However, the slip control can be switched to a different logic after the car speed drops below the
T
threshold, to avoid this “unstable” effect, for example, to full braking.
IP
In the low-speed braking scenario shown in Fig. 8 that starts with 70 rad/s , the performance indices are
42( x2  0.1)
J SSE  0.77 for the NFQCDA-1 controller, J SSE
43( x2  0.1)
 0.92 for the relay feedback one and J SSE
43( x2  0.1)
 1.03
CR
for the OPI controller. The NFQCDA-1 controller is the best, borrowing the time-optimal behavior from the
relay feedback controller and the smooth steady-state behavior from the OPI controller, which now at low speed
performs poorly. The NFQCDA-1, relay feedback and OPI controllers bring the normalized car speed at
US
x2  0.1 rad/s in 0.42 s, 0.43 s and 0.43 s, respectively (42, 43 and 43 samples, respectively). The reference slip
setpoint decrease occurs at 0.3 s in the braking process in this scenario from 0.3 to 0.2, to accommodate the
AN
short braking time for this low initial car speed. The performance loss of the control system with the OPI
controller follows its testing in different regimes than the one used for its optimization.
The output disturbance rejection is not ensured since the controlled output  is not used for feedback (as
M
shown in Fig. 3) and this disturbance cannot be translated in terms of wheel speeds x1 and x 2 . If the slip
feedback error was used as an extra input (belonging to the extended state vector) to the NFQCDA-1 controller,
ED
then the output disturbance could have been conceptually regarded as an equivalent disturbance in the slip
reference setpoint. This differs from the relay and the OPI controllers, which use the control error to elaborate
the control signal (action).
PT
4.4. Fourth case study

CE
In previous case studies, the NFQCDA-1 controller is learned while interacting with the process. However,
it is possible to improve the control by learning from a supervisor controller using the NFQCDA-2 algorithm,
benefiting from the good exploratory actions brought by a prior stabilizing controller. The case study described
AC
in this sub-section illustrates such an approach, where the transition samples are collected under the model-free
relay controller described in the previous case studies. The learning settings are as presented next. The relay
feedback controller for transition samples collection is used in successive braking episodes starting each time
with 180 rad/s . The slip reference is modeled as a series of steps changing after each 0.2 s, with the magnitude
in the range of [0.15,0.35] . The exploration factor of the -greedy strategy is set to 3 meaning that one out of
three control actions is uniformly randomly selected from the discrete action space
  {0,1/ 6,1/ 3,1/ 2,2 / 3,2.5 / 3,1} . The transition samples are collected for all the normalized car speeds down-
ranging from top initial speed to zero, which is differently from the case study presented in Sub-section 4.3,
ACCEPTED MANUSCRIPT
leading to 12000 samples being collected to be used to learn a new NFQCDA-2 controller. The collecting
strategy leads to good exploration owing to the stabilizing relay feedback controller.
The 40-step offline NFQCDA-2 algorithm is run on this dataset until convergence to an acceptable
controller is observed. Each iteration of the algorithm minimizes the current iteration Q-NN estimate over the
discrete action space  . The C-NN is learned only at the end according to the steps S5 and S6 of the NFQCDA-
2 algorithm. The Q-NN is of size 7–15–15–1 (the extended state also contains the slip reference input as in (15),
see Fig. 3), the C-NN is 6–10–10–1. The NN training settings are the same as in the previous case studies.
The initial unperturbed relay feedback controller and the resulting NN controller are displayed in Fig. 9 on a
changing slip reference scenario starting from high initial car speed of 180 rad/s . The NFQCDA-2 model-free
T
controller shows smoother action compared with the supervisor relay feedback one. The result confirms that the
control system performance can be improved starting with an already existing model-free controller.
IP
100
J SSE  0.717 for the relay controller and J SSE
100
 0.6524 for the NFQCDA-2 model-free controller.
CR
For additional comparison purposes, we add an original NN state-feedback nonlinear controller designed
using the popular model-free VRFT concept which has been combined with NNs [44], [45] before but only on
input-output processes and controllers models. It appears to be a first time application of VRFT to nonlinear
US
state-feedback control design. In VRFT, the aim is to design a controller such that the closed-loop CS behavior
from the reference input to the undisturbed output matches a reference model (usually a linear one). The
reference model M (z ) used in this case study is a discretized version of M (s)  25 2 /( s 2  2  0.9  25  s  25 2 )
AN
selected in terms of natural frequency and damping factor. Input-state-output data tuples
~
{u~k , ~
xe,k ,  k }, k  0..N  1 is collected from the process in open-loop setting, from successive braking episodes
M
starting with initial car speed at 180 rad/s , in which the process input is a sequence of uniform random step
signals with amplitude u[%][0;1] and duration 0.05 sec . Once again we note that open-loop data collection for
the unstable ABS process is not problematic in practice, whereas VRFT is commonly used on open-loop stable
ED
~*
processes. The virtual reference input is calculated offline as  ~
k M
1
( z ) k . The infinite-horizon undiscounted

 k
VRFT model reference matching performance index is expressed as J (x )  (  M ( z )* ) 2 , M ( z )  1 , in
PT
MR 0 k
k 0
which the utility slightly differs from the one used in (14), therefore it is an approximation of the control
objective used with the NFQCDA-1 and NFQCDA-2 controllers. Then a state-feedback controller
CE
~
u k  C ([~
xe,k *k ]T ) minimizing the finite-time controller identification performance index
N 1
~
J VR   (u~k  C ([~
x e,k *k ]T ) 2 is the one achieving closed-loop model reference matching. It was shown in [44]
AC
k 0
that for a rich controller parameterization (such as a NN) and sufficient persistently exciting collected data,
minimization of J VR translates to minimizing J MR (x 0 ) . Note that the output dependence on the state
~
 k  h~
x e,k  is not explicitly needed, knowledge of  k is necessary for virtual reference *k calculation, while
~ ~
the controller uses only ~

x e, k for feedback.
In our case, we collect N  10000 input-state-output tuples to train a 6–12–1 feed-forward fully connected
C-NN with biases with tansig activation function in the hidden layer and linear activation function in the output.
~
x e,k *k ]T as input patterns, u~k as targets and minimizes J VR above. The VRFT C-NN
This controller uses [~
ACCEPTED MANUSCRIPT
control performance measures J SSE

100
 0.5058 and is shown in Fig. 9 together with the relay and NFQCDA-2
controllers. The VRFT controller is slightly better in terms of constant steady-state behavior, while not as fast in
response as the other two, since M ( z )  1 prohibits a time-optimal behavior. Theoretically, M ( z )  1 could be
used as well but this usually fails to provide successful VRFT design. A similar data size as for the NFQCDA-2
controller is used for learning efficient control. The NN VRFT controller displays consistent level of
performance even at different initial braking speeds, proving generalization ability, owing to the NN complex
structure. VRFT controllers may serve as supervisory controllers as well, even for the adaptive critics design to
be presented next.
1
a)
T
0.8
IP
0.6
 - slip
0.4
CR
0.2
0
0 0.5 1 1.5 2
time (s)
2
b)
1.5
1 US
u (%)
0.5
AN
0
-0.5
0 0.5 1 1.5 2
time (s)
M
(green lines), supervisor relay feedback controller (blue lines) and NN VRFT controller (red lines): a) controlled
slip and reference (black dotted); b) control signals.
ED
The idea of using a supervisor controller in supervised actor-critic (SAC) learning context has also been
used in [35], [46], [47], even combined with model-free VRFT to play the supervisor role [48]. In the following,
a comparison with an SAC NN controller trained using the ADP actor-critic (AC) framework is presented. The
PT
proposed architecture uses two NNs to approximate each of the value function Q (the critic NN) and the
controller function (the actor NN). Without using a process model, this is in fact an adaptive model-free
CE
ADHDP scheme. The convergence and stability of ADP algorithms have been thoroughly approached in [49]–
[52] and particularly for Q-learning schemes in [53]. The two NNs are three-layers feedforward ones with one
hidden layer, fully connected with biases, modeled in terms of
AC
nhc nic 1
Qk  Wc ,nhc 1  Wc ,i tanhi ( Vc j ,i I j ), (16)
i 1 j 1
nha nia 1
uk  Wa ,nha 1  Wa ,i tanhi ( Vaj ,i I j ), (17)
i 1 j 1
with Wc  [Wc,1 ...Wc,nhc 1 ]T – the output layer weights of the critic having n hc hidden neurons,
Vc  [Vc j ,i ]i 1...nhc , j 1...nic 1 – the hidden layer weights, I c  [ I1...I nic 1 ]T  [xTk uk 1]T – the critic input vector of size
nic  1 (the bias input is included as the constant 1), Wa  [Wa ,1 ...Wa ,nha 1 ]T – the output layer weights of the actor
having n ha hidden neurons, and Va  [Vaj ,i ]i 1...nha , j 1...nia 1 – the actor’s hidden layer weights,
ACCEPTED MANUSCRIPT
I a  [ I1...I nia 1 ]T  [xTk 1]T – the actor input vector of size nia  1 (the bias input is included), and the superscript
T indicates matrix transposition. tanhi () stands for the hyperbolic tangent activation function at the output of
i th hidden neuron. The actor and critic are cascaded with the actor output u k being the nic th critic input.
Parameterizing the actor and critic weight vectors generically as πk  [(Wak )T (Vak )T ]T and θk  [(Wck )T (Vck )T ]T ,
respectively, with the superscript k indicating the index of the current sampling instant, we force the critic to be
trained such that to minimize the temporal difference error measured by
k  U (xk 1 , uk 1 , xk )  Q(xk , uk , θk )  Q(xk 1 , uk 1 , θk 1 ), (18)
by online minimizing the critic cost function E c, k  0.5 2k .
T
Considering the actor trained to minimize the Q-function estimate at every current sample time, the gradient
IP
descent training rules for actor and critic are
Qx, u, θ (19)
θ k  θ k 1   c   k ,
CR
θ ( x k 1 ,u k 1 ,θ k 1)
Qx, u, θ u x, π  (20)

π k  π k 1   a   ,
u ( x k 1 ,u k 1 ,θ k 1) π ( x k 1 ,πk 1 )
US
with  a ,  c – the step size magnitudes of the actor and critic training rules, respectively.
The parameters used in this case study are nhc  50 , nic  7 , nha  12 , nic  6 ,  a  10 6 ,  c  10 2 ,
AN
  0.95 . Both the actor (6-12-1) and the critic (7-50-1) NNs weights/parameters are initialized uniformly
random in [1.5,1.5] . First, the OPI controller used in previous sub-sections is used as a supervisor to collect the
transition samples. Braking trials are conducted from the same initial car speed of 180 rad/s until 6000
M
transitions are generated. To ensure good exploration the slip reference is modeled as a sequence of two steps of
magnitudes in [0.15,0.35] . Each braking scenario lasting 2 s generates 200 transition samples. During this data
ED
collection, the actor NN is not trained and therefore not used while the critic NN is trained online to estimate the
Q-function of the stabilizing OPI controller. In addition, the collected dataset under the OPI controller is used to
train an initial stable actor NN controller. The role of the supervisor in this learning process is to bring the actor
PT
NN and critic NN close to their optimal values.

With the actor NN initialized with the dataset gathered in terms of the OPI controller, both actor and critic
CE
NNs are trained online, in this case on a fixed braking scenario starting with the initial car speed of 180 rad/s ,
initial slip output set to zero, and the slip reference input set to *k  0.17(k )  (k  40)  0.26(k  40) . To
ensure good exploration, zero mean normal random noise of variance  2  0.014 is added to the C-NN output
AC
each sampling instant. The simultaneous adaptation of the critic and actor NNs is then performed in each
braking episode as long as the normalized car speed is x2  0.1 . The following braking episode starts with the
final actor and critic NN parameters in the preceding trial and the learning process is stopped after 30 braking
trials.
The learned controller performance is compared with the NFQCDA-2 controller designed previously in this
section and with the original OPI controller, all displayed in Fig. 10. The performance of the control system with
AC controller is improved over the OPI controller in terms of less oscillations in both control signal and
controlled slip output and is comparable with the performance of the control system with NFQCDA-2
ACCEPTED MANUSCRIPT
controller, specifically J SSE

100
 0.6672 for the SAC controller and J SSE
100
 0.6524 for the NFQCDA-2 controller.
Similar amount of data was necessary for learning the SAC controller and the NFQCDA-2 one. The former
starts learning from a stabilizing controller that learn to mimic the OPI controller and already delivers a
suboptimal performance while the latter learns the control from scratch just using data collected with the relay
controller, so the NFQCDA-2 is more data-efficient. In fact, the batch fitted Q-learning schemes are known to
be more data-efficient than their adaptive counterparts. Extensive simulations show that the same performance
improvement of the AC controller over the initial OPI one is observed for a faster sampling period (1 msec
instead of 10 msec ) but with much higher control performance in terms of better tracking of the slip setpoint.
Concluding, the AC architecture control performance is dependent on the NN architectures, the sampling period
T
and the learning parameters. The approximation capability of the three layer NNs used by the AC controller is
IP
inferior to that of the four layer NNs used in NFQCDA-1 and NFQCDA-2 but the powerful approximation
capability of the latter is not always needed since overfitting should generally be avoided. The advantage of the
CR
AC adaptive online Q-learning scheme over typical ADP adaptive schemes resides in the critic FA learning the
Q function estimate, which allows direct minimization over the control action, and does not require the back-
propagation of the correction through the value function and next through some process model. This could be
0.8
a)
US
also generalized to other illustrative nonlinear control applications [54]–[62].
AN
0.6
 - slip
0.4
0.2
0
M
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

time (s)
1.5
b)
1
ED
u (%)
0.5
0
PT
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
time (s)for the control systems with NFQCDA-2 controller

Fig. 10. Braking results at 180 rad/s with setpoint tracking
(green lines) and supervisor OPI controller (blue lines) and the actor-critic controller (red line): a) controlled slip
CE
and reference (black dotted); b) control signals.
5. Conclusion
AC
The paper has suggested a model-free Q-learning approach to nonlinear state-feedback ABS slip control
and shows its effectiveness and generalization capability over two widely encountered output-feedback control
strategies, namely a model-free relay feedback controller and a model-based OPI controller. The Q-learning
controller inherits the best features of the other two controllers, is more complex with more parameters to be
tuned (NN architectures with parameters and learning settings, Q-learning settings, cost function, etc.) but
overall exhibits higher performance, comparable with the one obtained by another model-free NN VRFT
controller. The number of transition samples necessary for learning increases when advanced features such as
setpoint tracking are required at various speeds. The Q-learning controller training uses transition samples
obtained either by exploiting the current iteration controller or from a supervisory controller of different type.
ACCEPTED MANUSCRIPT
The Q-learning control process for ABS is extremely amenable to practical implementations since it does
not require initial stabilizing controller and it is episodic in nature. These special features make it attractive to
straightforward industrial applications.
Acknowledgement
This work was supported by grants of the Romanian National Authority for Scientific Research – the
Executive Agency for Higher Education, Research, Development and Innovation Funding, CNCS – UEFISCDI,
project numbers PN-II-RU-TE-2014-4-0207, PN-II-PT-PCCA-2013-4-0544 and PN-II-PT-PCCA-2013-4-0070.
T
References
[1] U. Kiencke and L. Nielsen, Automotive Control Systems, Springer-Verlag, New York, 2000.
IP
[2] M.-B. Radac, R.-E. Precup, S. Preitl, J. K. Tar, E. M. Petriu, Linear and fuzzy control solutions for a
laboratory anti-lock braking system, in: Proc. 6th International Symposium on Intelligent Systems and
CR
Informatics, Subotica, Serbia, 2008, pp. 1–6.
[3] M.-B. Radac, R.-E. Precup, S. Preitl, J. K. Tar, J. Fodor, E. M. Petriu, Gain-scheduling and iterative
feedback tuning of PI controllers for longitudinal slip control, in: Proc. 6th IEEE International Conference
US
on Computational Cybernetics, Stara Lesna, Slovakia, 2008, pp. 183–188.
[4] M. Martinez-Gardea, I. J. M. Guzman, S. di Gennaro, C. A. Lua, Experimental comparison of linear and
AN
nonlinear controllers applied to an Antilock Braking System, in: Proc. 2014 IEEE Conference on Control
Applications, Antibes, France, 2014, pp. 71–76.
[5] C. A. Lua, B. C. Toledo, S. di Gennaro, M. Martinez-Gardea, Dynamic Control Applied to a Laboratory
M
Antilock Braking System, Mathematical Problems in Engineering 2015 (2015) article ID 896859 1–11.
[6] M. Martinez-Gardea, I. J. Mares-Guzman, C. A. Lua, I. Vazquez-Alvarez, Comparison of linear and
nonlinear controller applied to an Antilock Braking System, in: Proc. 2014 IEEE International Autumn
ED
Meeting on Power Electronics and Computing, Ixtapa, Mexico, 2014, pp. 1–6.
[7] M.-B. Radac, R.-E. Precup, S. Preitl, J. K. Tar, K. J. Burnham, Tire slip fuzzy control of a laboratory Anti-
PT
lock Braking System, in: Proc. 2009 IEEE European Control Conference, Budapest, Hungary, 2009, pp.
940–945.
[8] A. Manivanna Boopathi, A. Abudhahir, Adaptive fuzzy sliding mode controller for wheel slip control in
CE
antilock braking system, Journal of Engineering Research 4 (2) (2016) 132–150.

[9] D. Antic, V. Nikolic, D. Mitic, M. Milojkovic, S. Peric, Sliding mode control of anti-lock braking system:
an overview, Facta Universitatis: Automatic Control and Robotics 9 (1) (2010) 41-58.
AC
[10] M. Dousti, S. C. Baslamisli, E. T. Onder, S. Solmaz, Design of a multiple-model switching controller for
ABS braking dynamics, Trans. of the Inst. of Measurement and Control 37 (5) (2015) 582-595.
[11] T. Sardarmehni, A. Heydari, Optimal switching in anti-lock brake systems of ground vehicles based on
approximate dynamic programming, in: Proc. 2015 ASME Dynamic Systems and Control Conference,
Columbus, OH, 2015, pp. 1–10.
[12] M. C. Campi, A. Lecchini, S. M. Savaresi, Virtual reference feedback tuning: a direct method for the design
of feedback controllers, Automatica 38 (8) (2002) 1337–1346.
[13] H. Hjalmarsson, Iterative feedback tuning - an overview, International Journal of Adaptive Control and
Signal Processing 16 (5) (2002) 373–395.
ACCEPTED MANUSCRIPT
[14] J. C. Spall, J. A. Cristion, Model-free control of nonlinear stochastic systems with discrete-time
measurements, IEEE Transactions on Automatic Control 43 (9) (1998) 1198–1210.
[15] M.-B. Radac, R.-E. Precup, Data-based two-degree-of-freedom iterative control approach to constrained
non-linear systems, IET Control Theory & Applications 9 (7) (2015) 1000–1010.
[16] M.-B. Radac, R.-E. Precup, E. M. Petriu, Model-free primitive-based iterative learning control approach to
trajectory tracking of MIMO systems with experimental validation, IEEE Transactions on Neural Networks
and Learning Systems 26 (11) (2015) 2925–2938.
[17] Z. S. Hou, S. Jin, A novel data-driven control approach for a class of discrete-time nonlinear systems, IEEE
Transactions on Control Systems Technology 19 (6) (2011) 1549–1558.
T
[18] M. Fliess, C. Join, Model-free control, International Journal of Control 86 (2013) 2228–2252.
[19] R. S. Sutton, A. G. Barto, Reinforcement Learning: An Introduction, Cambridge, MA: MIT Press, 1998.
IP
[20] P. J. Werbos, Approximate dynamic programming for real-time control and neural modeling, in: Handbook
of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches, D. A. White and D. A. Sofge, Eds. New
CR
York: Van Nostrand Reinhold, 1992, ch. 13.
[21] C. Watkins, P. Dayan, Q-learning, Machine Learning 8 (3–4) (1992) 279–292.
US
[22] F. Lewis, D. Vrabie, K. G. Vamvoudakis, Reinforcement learning and feedback control: Using natural
decision methods to design optimal adaptive controllers, IEEE Control Systems Magazine 32 (6) (2012)
76–105.
AN
[23] F.-Y. Wang, H. Zhang, D. Liu, Adaptive dynamic programming: an introduction, IEEE Computational
Intelligence Magazine 4 (2) (2009) 39–47.
[24] D. V. Prokhorov, D. C. Wunsch, Adaptive critic designs, IEEE Transactions on Neural Networks 8 (5)
M
(1997) 997–1007.
[25] Z. Ni, H. He, X. Zhong, Experimental Studies on Data-driven heuristic dynamic programming for POMDP,
in: Frontiers of Intelligent Control and Information Processing, D. Liu et al., Eds. Singapore: World
ED
Scientific Publishing, Lecture Notes in Computer Science, pp. 83–105, 2015.

[26] Z. Ni, H. He, Z. Zhong, D. V. Prokhorov, Model-free dual heuristic dynamic programming, IEEE
PT
Transactions on Neural Networks and Learning Systems 26 (8) (2015) 1834–1839.

[27] S. Formentin, S. M. Savaresi, L. Del Re, Non-iterative direct data-driven controller tuning for multivariable
systems: theory and application, IET Control Theory & Applications 6 (9) (2012) 1250–1257.
CE
[28] M.-B. Radac, R.-E. Precup, R.-C. Roman, Data-driven virtual reference feedback tuning and reinforcement
Q-learning for model-free position control of an aerodynamic system, in: Proc. 24th IEEE Mediterranean
Conference on Control and Automation, Athens, Greece, 2016, pp. 1126–1132.
AC
[29] M.-B. Radac, R.-E. Precup, E. M. Petriu, S. Preitl, Iterative data-driven tuning of controllers for nonlinear
systems with constraints, IEEE Transactions on Industrial Electronics 61 (11) (2014) 6360–6368.
[30] M.-B. Radac, R.-E. Precup, Model-free constrained data-driven iterative reference input tuning algorithm
with experimental validation, International Journal of General Systems 45 (4) (2016) 455–476.
[31] F. Lewis, K. G. Vamvoudakis, Reinforcement learning for partially observable dynamic processes:
Adaptive dynamic programming using measured output data, IEEE Transactions on Systems, Man, and
Cybernetics, Part B: Cybernetics 41 (1) (2011) 14–25.
[32] Z. Wang, D. Liu, Data-based controllability and observability analysis of linear discrete-time systems, IEEE
Transactions on Neural Networks 22 (12) (2011) 2388–2392.
ACCEPTED MANUSCRIPT
[33] J. Zhang, H. Zhang, B. Wang, T. Cai, Nearly data-based optimal control for linear discrete model-free
systems with delays via reinforcement learning, International Journal of Systems Science 47 (7) (2016)
1563–1573.
[34] Y. Zhang, S. X. Ding, Y. Yang, L. Li, Data-driven design of two-degree-of-freedom controllers using
reinforcement learning techniques, IET Control Theory Applications 9 (7) (2015) 1011–1021.
[35] D. Zhao, Z. Hu, Z. Xia, C. Alippi, Y. Zhu, D. Wang, Full-range adaptive cruise control based on supervised
adaptive dynamic programming, Neurocomputing 125 (2014) 57–67.
[36] S. Tognetti, S. M. Savaresi, C. Spelta, M. Restelli, Batch reinforcement learning for semi-active suspension
control, in: Proc. 2009 IEEE Conference on Control Applications, St. Petersburg, Russia, 2009, pp. 582–
T
587.
[37] The Laboratory Anti-lock Braking System Controlled from PC, User’s Manual. Krakow, Poland: Inteco
IP
Ltd., 2007.
[38] R. Hafner, M. Riedmiller, Reinforcement learning in feedback control. Challenges and benchmarks from
CR
technical process control, Machine Learning 84 (1) (2011) 137–169.
[39] R. Hafner, M. Riedmiller, Neural reinforcement learning controllers for a real robot application, in: Proc.
US
2007 IEEE Conference on Robotics and Automation, Rome, Italy, 2007, pp. 2098–2103.
[40] D. Ernst, P. Geurts, L. Wehenkel, Tree-based batch mode reinforcement learning, Journal of Machine
Learning Research, 6 (2005) 503–556.
AN
[41] F. Ruelens, B. Claessens, S. Quaiyum, B. de Schutter, R. Babuska, R. Belmans, Reinforcement learning
applied to an electric water heater: from theory to practice, IEEE Transactions on Smart Grid (2016),
doi:10.1109/TSG.2016.2640184.
M
[42] M. Fliess, J. Levine, P. Martin, P. Rouchon, Flatness and defect of non-linear systems: introductory theory
and examples, International Journal of Control 61 (6) (1995) 1327–1361.
[43] D. Liu, H. Javaherian, O. Kovalenko, T. Huang, Adaptive critic learning techniques for engine torque and
ED
air-fuel ratio control, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 38 (4)
(2008) 988–993.
PT
[44] P. Yan, D. Liu, D. Wang, H. Ma, Data-driven controller design for general MIMO nonlinear systems via
virtual reference feedback tuning and neural networks, Neurocomputing 171 (2016) 815–825.
[45] M.-B. Radac, R.-E. Precup, Three-level hierarchical model-free learning approach to trajectory tracking
CE
control, Engineering Applications of Artificial Intelligence 55 (2016) 103–118.

[46] D. Zhao, B. Wang, D. Liu, A supervised actor-critic approach for adaptive cruise control, Soft Computing
17 (11) (2013) 2089–2099.
AC
[47] G. K. Venayagamoorthy, R. G. Harley, D. C. Wunsch, Comparison of heuristic dynamic programming and

dual heuristic programming adaptive critics for neurocontrol of a turbogenerator, IEEE Transactions on
Neural Networks 13 (3) (2002) 764–773.
[48] M.-B. Radac, R.-E. Precup, R.-C. Roman, Model-free control performance improvement using virtual
reference feedback tuning and reinforcement Q-learning, International Journal of Systems Science 48 (5)
(2017) 1071–1083.
[49] Y. Huang, D. Liu, Neural-network-based optimal tracking control scheme for a class of unknown discrete-
time nonlinear systems using iterative ADP algorithm, Neurocomputing 125 (2014) 46–56.
ACCEPTED MANUSCRIPT
[50] R. Song, W. Xiao, H. Zhang, Multi-objective optimal control for a class of unknown nonlinear systems
based on finite-approximation-error ADP algorithm, Neurocomputing 119 (2013) 212–221.
[51] D. Wang, D. Liu, Neuro-optimal control for a class of unknown nonlinear dynamic systems using SN-DHP
technique, Neurocomputing 121 (2013) 218–225.
[52] J. Zhang, H. Zhang, Y. Luo, T. Feng, Model-free optimal control design for a class of linear discrete-time
systems with multiple delays using adaptive dynamic programming, Neurocomputing 135 (2014) 163–170.
[53] B. Luo, D. Liu, T. Huang, D. Wang, Model-free optimal tracking control via critic-only Q-learning, IEEE
Transactions on Neural Networks and Learning Systems 27 (10) (2016) 2134–2144.
[54] I. Škrjanc, S. Blažič, S. Oblak, J. Richalet, An approach to predictive control of multivariable time-delayed
T
plant: Stability and design issues, ISA Transactions 43 (4) (2004) 585–595.
[55] R. E. Haber, J. R. Alique, Fuzzy logic-based torque control system for milling process optimization, IEEE
IP
Transactions on Systems, Man and Cybernetics Part C: Applications and Reviews 37 (5) (2007) 941–950.
[56] F. G. Filip, Decision support and control for large-scale complex systems, Annual Reviews in Control 32
CR
(1) (2008) 61–70.
[57] J. Vaščák, K. Hirota, Integrated decision-making system for robot soccer, Journal of Advanced
US
Computational Intelligence and Intelligent Informatics 15 (2) (2011) 156–163.
[58] R.-E. Precup, M.-B. Radac, M. L. Tomescu, E. M. Petriu, S. Preitl, Stable and convergent iterative feedback
tuning of fuzzy controllers for discrete-time SISO systems, Expert Systems with Applications 40 (1) (2013)
AN
188–199.
[59] Z. Wang, J. Hu, H. Dong, On general systems with randomly occurring incomplete information,
International Journal of General Systems 45(5) (2016) 479–485.
M
[60] S. Bouaziz, H. Dhahri, A. M. Alimi, A. Abraham, Evolving flexible beta basis function neural tree using
extended genetic programming & hybrid artificial bee colony, Applied Soft Computing 47 (2016) 653–668.
[61] C. Candea, F. G. Filip, Towards intelligent collaborative decision support platforms, Studies in Informatics
ED
and Control, 25 (2) (2016) 143–152.

[62] Ö. Türkşen, M. Tez, An application of Nelder-Mead heuristic-based hybrid algorithms: Estimation of
PT
compartment model parameters, International Journal of Artificial Intelligence 14 (1) (2016) 112–129.
CE
AC
ACCEPTED MANUSCRIPT
Radu-Emil Precup received the Dipl.Ing. (with honors) degree in automation and computers from the
“Traian Vuia” Polytechnic Institute of Timisoara, Timisoara, Romania, the Dipl. degree in mathematics from
the West University of Timisoara, Timisoara, and the Ph.D. degree in automatic systems from the Politehnica
University of Timisoara (UPT), Timisoara, Romania, in 1987, 1993, and 1996, respectively. He is currently with
T
the UPT, Timisoara, Romania, where he became a Professor with the Department of Automation and Applied
IP
Informatics in 2000, and is currently a Doctoral Supervisor of Automation and Systems Engineering. He is also
CR
an Adjunct Professor within the School of Engineering, Edith Cowan University, Joondalup, WA, Australia, and
an Honorary Professor and a Member of the Doctoral School of Applied Informatics with the Óbuda University,
Budapest, Hungary. From 1999 to 2009, he held research and teaching positions with the Université de Savoie,
US
Chambéry and Annecy, France, Budapest Tech Polytechnical Institution, Budapest, Hungary, Vienna University
of Technology, Vienna, Austria, and Budapest University of Technology and Economics, Budapest, Hungary.
AN
He has been an Editor-in-Chief of the International Journal of Artificial Intelligence since 2008, and he is also
on the editorial board of several other prestigious journals.
He has published more than 300 papers in refereed journals, refereed conference proceedings, and
M
contributions to books. His current research interests cover intelligent control systems, data-driven control, and
nature-inspired algorithms for optimization.

ED
He is a senior member of the Institute of Electrical and Electronics Engineers (IEEE), a member of the
Technical Committee (TC) on Virtual Systems in Measurements of the IEEE Instrumentation & Measurement
PT
Society, the Task Force on Autonomous Learning Systems within the Neural Networks TC of the IEEE
Computational Intelligence Society, the Subcommittee on Computational Intelligence as part of the TC on

CE
Control, Robotics and Mechatronics in the IEEE Industrial Electronics Society, the Task Force on Educational
Aspects of Standards of Computational Intelligence as part of the TC on Standards in the IEEE Computational
AC
Intelligence Society, the International Federation of Automatic Control (IFAC) Technical Committee on
Computational Intelligence in Control (previously named Cognition and Control), the Working Group WG 12.9
on Computational Intelligence of the Technical Committee TC12 on Artificial Intelligence of the International
Federation for Information Processing (IFIP), the European Society for Fuzzy Logic and Technology
(EUSFLAT), the Hungarian Fuzzy Association, and the Romanian Society of Control Engineering and
Technical Informatics.
Prof. Precup received the “Grigore Moisil” Prize from the Romanian Academy, two times, in 2005 and
2016, for his contribution on fuzzy control and the optimization of fuzzy systems, the Spiru Haret Award from
ACCEPTED MANUSCRIPT
the National Grand Lodge of Romania in partnership with the Romanian Academy in 2016 for education,
environment and IT, the Excellency Diploma of the International Conference on Automation, Quality &
Testing, Robotics AQTR 2004 (THETA 14, Cluj-Napoca, Romania), two Best Paper Awards in the Intelligent
Control Area of the 2008 Conference on Human System Interaction HSI 2008, Krakow (Poland), the Best Paper
Award of 16th Online World Conference on Soft Computing in Industrial Applications WSC16 (Loughborough
University, UK) in 2011, the Certificate of Appreciation for the Best Paper in the Session TT07 1 Control
Theory of 39th Annual Conference of the IEEE Industrial Electronics Society IECON 2013 (Vienna, Austria),
and a Best Paper Nomination at 12th International Conference on Informatics in Control, Automation and
T
Robotics ICINCO 2015 (Colmar, France).
IP
CR
US
AN
Mircea-Bogdan Radac received the Dipl.Ing. degree in systems and computer engineering and the Ph.D.
degree in systems engineering from the Politehnica University of Timisoara (UPT), Timisoara, Romania, in
M
2008 and 2011, respectively. He is currently with the UPT, Timisoara, Romania, where he became a Lecturer in
the Department of Automation and Applied Informatics in 2014.

ED
He has published more than 100 papers in refereed journals, conference proceedings, and contributions to
books. His current research interests cover control structures and algorithms and data-driven control with focus
on iterative methods in control design and optimization.

PT
He is a member of the Institute of Electrical and Electronics Engineers (IEEE), the Technical Committees on
Computational Cybernetics and Cyber-Medical Systems of the IEEE Systems, Man, and Cybernetics Society,
CE
and the Romanian Society of Control Engineering and Technical Informatics.
Dr. Radac received the “Grigore Moisil” Prize from the Romanian Academy in 2016 for his contribution on
AC
the optimization of fuzzy systems, and the Certificate of Appreciation for the Best Paper in the Session TT07 1
Control Theory of 39th Annual Conference of the IEEE Industrial Electronics Society IECON 2013 (Vienna,
Austria).
ACCEPTED MANUSCRIPT
Figure Captions
Fig. 1. The laboratory equipment setup of the wheels [37].
Fig. 2. Slip-friction curve [2], [37].
Fig. 3. NFQCDA-1 and NFQCDA-2 feedback control system.
Fig. 4. Braking results for the control systems with NFQCDA-1 controller (green lines), relay feedback
controller (blue lines) and OPI controller (red lines): a) normalized upper wheel speed; b) normalized lower
T
wheel speed; c) controlled slip and reference (black dotted); d) control signals and actuator dead zone activation
IP
threshold (black dotted).
CR
US
AN
M
ED
PT
CE
(green lines), supervisor relay feedback controller (blue lines) and NN VRFT controller (red lines): a) controlled
slip and reference (black dotted); b) control signals.

AC
(green lines) and supervisor OPI controller (blue lines) and the actor-critic controller (red line): a) controlled slip
and reference (black dotted); b) control signals.

ACCEPTED MANUSCRIPT
T
IP
CR
US
AN
M
ED
PT
CE
AC

Accepted Manuscript: 10.1016/j.neucom.2017.08.036

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Accepted Manuscript: 10.1016/j.neucom.2017.08.036

Uploaded by

Copyright:

Available Formats

Accepted Manuscript

Data-Driven Model-Free Slip Control of Anti-lock Braking Systems

Mircea-Bogdan Radac , Radu-Emil Precup

To appear in: Neurocomputing

Received date: 20 October 2016

Data-Driven Model-Free Slip Control of Anti-lock Braking Systems Using

Mircea-Bogdan Radac a, Radu-Emil Precup a,b,*

industrial application on ABS.

E-mail addresses:mircea.radac@upt.ro (M.-B. Radac), radu.precup@upt.ro (R.-E. Precup).

control, owing to the lack of explicit process models.

solutions are intractable, covering linear/nonlinear, discrete/continuous time and deterministic/stochastic

Fig. 1. The laboratory equipment setup of the wheels [37].

15.24u  6.21, if u  0.415

d1  1.1874 10 4 kg  m 2 / s , d 2  2.1468 10 4 kg  m 2 / s , M 10  0.0032 N  m , M 20  0.0925 N  m ,

()  w4 p /( a  p )  w3 3  w2 2  w1, (3)

Fig. 2. Slip-friction curve [2], [37].

2.1. From POMDP to fully observable MDP

their derivatives leads to

with f1 , f 2 , f 3 nonlinear functions in their arguments.

Applying FED to (6) and (5) all together leads to

x2,k 1  F2 ( M 1,k , x1,k , x2,k ), (7)

where F1 , F2 , F3 corresponding to f1 , f 2 , f 3 in (6) and G corresponding to (5) are nonlinear functions

x1,k 1  F1 ( x1,k , x1,k 1 , x2,k , x2,k 1 , u k 1 ),

extended state-space system

2.2. Further preparing the model for data collection US

3. Neural batch fitted Q-learning

u k*  arg min (U (x k , u k , x k 1 )  J * (x k 1 )). (11)

is a discrete action space.

4. Validation case studies

U (xk , uk , xk 1 )  min (k  * )2 ,0.05 , (14)

Fig. 3. NFQCDA-1 and NFQCDA-2 feedback control system.

4.2. Second case study

In the braking scenario at high speed ( 180 rad/s ) J SSE

4.3. Third case study

contains the slip reference input, see Fig. 3):

NFQCDA-1 controller, J SSE

4.4. Fourth case study

the controller uses only ~

control performance measures J SSE

by online minimizing the critic cost function E c, k  0.5 2k .

Qx, u, θ u x, π  (20)

NN and critic NN close to their optimal values.

controller, specifically J SSE

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

time (s)for the control systems with NFQCDA-2 controller

and reference (black dotted); b) control signals.

antilock braking system, Journal of Engineering Research 4 (2) (2016) 132–150.

Scientific Publishing, Lecture Notes in Computer Science, pp. 83–105, 2015.

Transactions on Neural Networks and Learning Systems 26 (8) (2015) 1834–1839.

control, Engineering Applications of Artificial Intelligence 55 (2016) 103–118.

[47] G. K. Venayagamoorthy, R. G. Harley, D. C. Wunsch, Comparison of heuristic dynamic programming and

and Control, 25 (2) (2016) 143–152.

on the editorial board of several other prestigious journals.

nature-inspired algorithms for optimization.

Computational Intelligence Society, the Subcommittee on Computational Intelligence as part of the TC on

the Department of Automation and Applied Informatics in 2014.

on iterative methods in control design and optimization.

and the Romanian Society of Control Engineering and Technical Informatics.

Fig. 1. The laboratory equipment setup of the wheels [37].

Fig. 2. Slip-friction curve [2], [37].

Fig. 3. NFQCDA-1 and NFQCDA-2 feedback control system.