This action might not be possible to undo. Are you sure you want to continue?
for Control of Nonlinear Systems
.
Continuous State Space QLearning
for Control of Nonlinear Systems
ACADEMISCH PROEFSCHRIFT
ter verkrijging van de graad van doctor
aan de Universiteit van Amsterdam,
op gezag van de Rector Magniﬁcus
prof. dr. J.J.M. Franse
ten overstaan van een door het college voor promoties ingestelde
commissie, in het openbaar te verdedigen in de Aula der Universiteit
op woensdag 21 februari 2001, te 12:00 uur
door
Stephanus Hendrikus Gerhardus ten Hagen
geboren te Neede
Promotor: Prof. dr. ir. F.C.A. Groen
Co–promotor: Dr. ir. B.J.A. Kr¨ose
Commissie: Prof. dr. H.B. Verbruggen
Prof. dr. P. Adriaans
Prof. dr. J.N. Kok
Dr. M.A. Wiering
Faculteit der Natuurwetenschappen, Wiskunde en Informatica
This work was part of STW project AIF.3595: “Robust Control of Nonlinear Systems
using Neural Networks”. It was carried out in graduate school ASCI.
Advanced School for Comput i ng and I magi ng
Copyright c ( 2001 by Stephan H.G. ten Hagen. All rights reserved.
Contents
1 Introduction 1
1.1 Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Designing the state feedback controller . . . . . . . . . . . . . . . . 3
1.1.2 Unknown systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Overview of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Reinforcement Learning 11
2.1 A Discrete Deterministic Optimal Control Task . . . . . . . . . . . . . . . 11
2.1.1 The problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2 The solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 The Stochastic Optimization Tasks . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 The Markov Decision Process . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Model based solutions . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.3 Reinforcement Learning solutions . . . . . . . . . . . . . . . . . . . 16
2.2.4 Advanced topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 RL for Continuous State Spaces . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 Continuous state space representations . . . . . . . . . . . . . . . . 22
2.3.2 Learning in continuous domains . . . . . . . . . . . . . . . . . . . . 24
2.3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3 LQR using QLearning 29
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 LQR with an Unknown System . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.1 Linear Quadratic Regulation . . . . . . . . . . . . . . . . . . . . . . 30
3.2.2 System Identiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.3 The Qfunction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.4 QLearning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.5 The Performance Measure . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.6 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
v
vi CONTENTS
3.3 The Inﬂuence of Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.1 The estimation reformulated . . . . . . . . . . . . . . . . . . . . . . 37
3.3.2 The System Identiﬁcation approach . . . . . . . . . . . . . . . . . . 38
3.3.3 The LQRQL approach . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3.4 The Exploration Characteristics . . . . . . . . . . . . . . . . . . . . 45
3.4 Simulation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.2 Exploration Characteristic . . . . . . . . . . . . . . . . . . . . . . . 47
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4 LQRQL for Nonlinear Systems 51
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Nonlinearities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.1 The nonlinear system . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.2 Nonlinear approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3 The Extended LQRQL Approach . . . . . . . . . . . . . . . . . . . . . . . 56
4.3.1 SI and LQRQL for nonlinear systems . . . . . . . . . . . . . . . . . 56
4.3.2 The extension to LQRQL . . . . . . . . . . . . . . . . . . . . . . . 57
4.3.3 Estimating the new quadratic Qfunction . . . . . . . . . . . . . . . 58
4.4 Exploration Characteristic for Extended LQRQL . . . . . . . . . . . . . . 59
4.5 Simulation Experiments with a Nonlinear System . . . . . . . . . . . . . . 63
4.5.1 The nonlinear system . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5.2 Experiment with a nonzero average w. . . . . . . . . . . . . . . . . 66
4.5.3 Experiments for diﬀerent average w . . . . . . . . . . . . . . . . . . 68
4.6 Experiments on a Real Nonlinear System . . . . . . . . . . . . . . . . . . . 70
4.6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.6.2 The experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5 Neural QLearning using LQRQL 75
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2 Neural Nonlinear Qfunctions . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3 Neural LQRQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3.1 Deriving the feedback function . . . . . . . . . . . . . . . . . . . . . 79
5.3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.4 Training the Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.5 Simulation Experiments with a Nonlinear System . . . . . . . . . . . . . . 83
5.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.5.2 The experimental procedure . . . . . . . . . . . . . . . . . . . . . . 85
5.5.3 The performance of the global feedback functions . . . . . . . . . . 87
CONTENTS vii
5.5.4 Training with a larger train set . . . . . . . . . . . . . . . . . . . . 88
5.6 Experiment on a Real Nonlinear System . . . . . . . . . . . . . . . . . . . 90
5.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6 Conclusions and Future work 95
6.1 LQR and QLearning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.1.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.2 Extended LQRQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.2.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.3 Neural QLearning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.3.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.4 General Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
A The Least Squares Estimation 101
A.1 The QRDecomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
A.2 The Least Squares Solution . . . . . . . . . . . . . . . . . . . . . . . . . . 101
A.3 The SI Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
B The Mobile Robot 105
B.1 The robot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
B.2 The model of the robot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
C Notation and Symbols 107
C.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
C.2 Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Bibliography 110
Summary 117
Samenvatting 118
Acknowledgments 120
viii CONTENTS
Chapter 1
Introduction
In everyday life we control many systems. We do this when riding a bicycle, changing
the temperature in a room or driving in a city to a certain destination. Although these
activities are quite diﬀerent they have in common that we try to inﬂuence some physical
properties of a system on the basis of certain measurements. In other words, we try to
control a system.
Figure 1.1 shows the conﬁguration of a control task. The system represents the physical
system we have. Suppose we want to change the temperature in a room, then the room is
the system. The properties of the system that do not change form the parameters of the
system. The height of the room is such a parameter. The properties of the system that
are not constant form the state of the system. So in our example the temperature is the
state of the system.
The state can change, depending on the current state value and the parameters of
the system. Because we want to control the system, we also should be able to inﬂuence
the state changes. The inﬂuence we can have on the state changes are called the control
actions. So by applying each time the right control actions we try to change the state to
System
State
Disturbance
Parameters
?
Control Action
Controller
Figure 1.1. The control conﬁguration.
1
2 CHAPTER 1. INTRODUCTION
some desired value. In our example this means we have a heater that can be switched on
and oﬀ. The disturbance in ﬁgure 1.1 indicates a change in system or state change that can
not be controlled. The control actions have to be chosen such that they reduce the eﬀect
of the disturbance. So if the temperature in our room drops because a window is opened,
the heater should be switched on.
Instead of applying all actions “by hand”, it is also possible to design an algorithm
that computes the appropriate control actions given the state. This algorithm is called the
controller. So based on the measurement of the temperature, the controller has to decide
whether to switch the heater on or oﬀ. Before designing a controller, one has to specify the
desired behavior of the system and controller. Then based on the desired behavior and all
available information of the system, a controller is designed according to a certain design
procedure. This design procedure provides a recipe to construct a controller. After the
design one checks whether the design speciﬁcations are met.
A diﬀerent approach to obtain a controller is to design a control architecture in which
the controller itself learns how to control the system. In order to do this the architecture
should incorporate a mechanism to interact with the system and store the gained experi
ences. Based on these experiences the controller should infer what for the control task the
most appropriate action is. This seems a very natural approach, resembling the way we
learn how to control.
Machine Learning (ML) is a subﬁeld within Artiﬁcial Intelligence research, in which
inference mechanisms are studied. From a ML point of view the inference mechanism
in the above described control architecture is called Reinforcement Learning (RL). The
control architecture receives a scalar evaluation, called the reinforcement, for each action
applied to the system. In this way the consequence of actions can be associated with the
states in which these actions were applied. In this thesis we will focus on QLearning, a
particular kind of RL, in which the consequences of actions are associated for each state
and action combination. Then for each state that action, for which the consequences are
most desirable, can be selected.
In this chapter we will ﬁrst formalize our control task and give an overview of diﬀerent
controller design approaches. We present some problems and show how they are solved.
Then we will give a short description of RL. Finally we will present our problem statement
and give an overview of the remainder of this thesis.
1.1 Control
If the control of a system has to be automated, a controller has to be designed. This nor
mally is the task of control engineers. Control theory provides the mathematical framework
in which the controller design task can be formulated. In this framework the real physical
system is described by an abstract representation called the model.
In order to ﬁnd the controller, the model should indicate the inﬂuence of the control
action on the state change. The function describing the state change is called the state
transition. In this thesis we will only consider time discrete systems. If the state is
1.1. CONTROL 3
represented as vector x ∈ IR
n
x
, the control action as vector u ∈ IR
n
u
and the noise as
vector v ∈ IR
n
v
, the state transition is given by:
1
x
k+1
= f(x
k
, u
k
, v
k
). (1.1)
This is the model of the system. The k in (1.1) represents the time step. The f forms
a functional mapping from IR
n
x
+n
u
+n
v
→ IR
n
x
that represents the state transition. The
parameters of the system are included in f.
In (1.1) the vector v can represent external disturbances that inﬂuence the state tran
sition. But the model forms an abstract approximation of the system, that does not
necessarily have to be complete. So v can also be used to take into account the conse
quences of the unmodeled dynamics of the system. In the design of the controller v is
regarded as a stochastic variable.
Note that the model in (1.1) represents a Markov process. This means that the next
state x
k+1
does not depend on past states and actions. Throughout this thesis we will
assume a model like (1.1). This is a reasonable assumption because all higher order Markov
processes can be represented as zeroorder Markov processes as in (1.1). This is possible
by combining the past states that inﬂuence the state transition and use this as the new
state. Then the model in (1.1) can be used to describe the change of the new state.
Another issue not reﬂected in the model of (1.1) are the measurements. In a real system
the value of the state is not always available through measurements by sensors. Still mea
surements are made because they are needed to control the system. These measurements
form an extra output of the system. To take this into account in the model in (1.1), an
extra function has to be speciﬁed that maps x, u and v to the measurements. We will not
include it in the model and so we will assume that the state values are measured directly.
The controller has not been described yet. Since we will only consider models like
(1.1), we have to assume that the state value is available to the controller. Also we will
assume that the controller only uses the present state value. This leaves as only possible
candidate: a state feedback controller. This means that the controller is formed by a
functional mapping from state to control action, so u = g(x). With these assumptions,
designing the controller becomes a matter of ﬁnding the appropriate g. The function g
will be called the feedback.
1.1.1 Designing the state feedback controller
The goal of the design is to ﬁnd the best g to control the system. This means that the
desired behavior of the system plus controller should be speciﬁed. There are two main
design principles:
• Optimality
A design method based on optimal control optimizes some criterion. This criterion
1
For continuous time systems the model represents the change of the state, rather then the new state
value.
4 CHAPTER 1. INTRODUCTION
gives a scalar evaluation of all state values and control actions. Traditionally the
evaluation represents costs, so the task is to ﬁnd the feedback that leads to the
lowest costs.
• Stability
A design method based on stability prevents the system from becoming instable.
The state value of an instable system can grow beyond any bounds and potentially
damage the real system.
The main diﬀerence between these design principles is that optimality based design aims
at the best possible behavior, while a stability based design aims at the prevention of the
worst possible behavior. Note that these principles do not exclude each other. For an
instable system the costs would be very high, so an optimality based design would also try
to stabilize the system. And if the controller stabilizes the system, then it is still possible
to choose the controller that performs best given an optimal control criterion.
The two design principles still do not indicate how the feedback is computed. To explain
this it is convenient to ﬁrst introduce an important class of systems, the linear systems.
The model of a linear system is given by (1.1) where f represents a linear function of the
state, the control action and the noise:
x
k+1
= Ax
k
+Bu
k
+v
k
, (1.2)
The matrices A and B have the proper dimensions and represent the parameters of the
linear system.
When a stability based design method is applied, a quantiﬁcation of the stability can
be used to select the parameters of the feedback. Suppose that the feedback is also linear:
u = Lx, (1.3)
where L ∈ IR
n
u
×n
x
represents the linear feedback. This makes it possible to describe the
state transition based on the closed loop. Applying (1.3) to (1.2) gives
2
x
k+1
= (A +
BL)x
k
= Dx
k
, where D represents the parameters of the closed loop. A state value in
future can be computed by multiplying the matrices D, so x
k+N
= D
N
x
k
. It is now easy
to see that the eigenvalues of D give an indication of the stability. If they are larger than
one the system is unstable and if they are smaller than one the system is stable. The
closer the eigenvalues of D are to zero, the faster the state value will approach zero. If
we include the noise v
k
then this will disturb the state transition. The eigenvalues of D
determine how many time steps it will take before the eﬀect of the noise at time step k
can be neglected.
For an optimality based design method ﬁrst an evaluation criterion has to be chosen
that indicates the desired behavior of the system. This criterion usually represents the
cost of being at a certain state or applying certain control actions. Since the system is
dynamic also the future consequences of the action on the costs should be taken into
2
For simplicity we ignore the noise.
1.1. CONTROL 5
account. This makes the optimization less trivial. Diﬀerent methods are available to ﬁnd
the best controller:
• Linear Quadratic Regulation
In case the system is linear and the costs are given by a quadratic function of the
state and control action. It can be shown that the optimal control action is a linear
function of the state [11][3]. The total future costs for each state forms a quadratic
function of the state value. The parameters of this function can be found by solving
an algebraic equation. The optimal linear function then follows from the system and
the quadratic function.
• Model Based Predictive Control
Using the model of the system, the future of the system can be predicted to give
an indication of the future costs. The system does not have to be linear and the
costs do not have to be given by a quadratic function. The optimal control sequence
for a ﬁnite horizon is computed. The ﬁrst action from this sequence is applied to
the system. In the next time step the whole procedure is repeated, making this
a computationally expensive method. For linear systems and constraints there are
several optimization methods available to eﬃciently compute the optimal action at
each time step [58][42].
• Dynamic Programming
The state and action space can be quantiﬁed and then Dynamic Programming can
be used [10][11][57]. Starting at a ﬁnite point in future, the optimal control sequence
is computed backwards using the model of the system.
Dynamic Programming will be used in chapter 2 to introduce reinforcement learning. In
chapter 3 we will use Linear Quadratic Regulation.
The linear system in (1.2) was introduced to simplify the explanation. Controller design
methods are best understood when applied to linear system. The main reason for this is
that models of linear systems are mathematically convenient and the properties of these
systems are well deﬁned. Since not all systems are linear controller design approaches
for nonlinear systems as in (1.1) were developed. For an overview of these methods see
[37]. An alternative approach is to formalize the representation of the nonlinear system in
such way that many well–deﬁned properties of linear systems can be used to describe the
properties of nonlinear systems [54]. These properties do not have to be globally valid, so
the state values for which they hold should also be given.
1.1.2 Unknown systems
The control design methods described above can only be applied if the model of the system
is known exactly. This is generally not the case, for example when not the entire system is
modeled or the speciﬁc values of the parameters are not known. So system identiﬁcation
methods were developed to approximate the model of the system based on measurements.
6 CHAPTER 1. INTRODUCTION
The measurements are generated by applying control actions to the system. The control
actions have to be chosen such that they suﬃciently excite the system. In that case the
properties of the state transition function can be estimated from the data set containing all
measurements and control actions. There are diﬀerent possibilities available for identifying
the model of the system:
• Linear models
If the system is assumed to be linear the function describing the state transition is
given. Only the parameters of the model have to be estimated. The identiﬁcation
methods for linear systems are well understood [46][22]. The parameters are usually
estimated using a (recursive) linear least squares estimation method.
• Local models
One global model can be build up using many smaller local models. These local
models are only valid in a restricted part of the state space or their contribution to
the output of the global model is restricted. In Gain Scheduling (GS) [3] the output
of the global model is computed based on only one local model. The value of the
state determines which local model to use.
In locally weighted representations [4][5] all local models can contribute to the output
of the global model. The contribution of the outputs of the local models depends on
the weighting. The weighting depends on the value of the state. The local models
can be linear. By combining more than one linear model a nonlinear function can be
approximated by the global model.
An alternative approach is to use Fuzzy logic, where reasoning determines the local
models to use. In case of overlapping membership functions, the combination of
local models is used to compute the control action [73]. If the system is unknown,
the membership functions are hard to determine. In that case it is possible to use
adaptive membership functions, that are trained using data from the system.
• Nonlinear models
When the system is nonlinear it is not possible to use a linear model, if the deviation
of this model from the system is too large. This means the parameters of a nonlinear
model have to be found. If the function class of the system is unknown, general func
tion approximator like neural networks can be used [53][36][86][63]. The parameters
of the model are found by using a supervised learning method to train the networks.
This means that quadratic error between the real next state value and that given by
the model is minimized.
Once the model of the system is found it can be used to design a controller. It is also
possible to do this online, in which this is called indirect adaptive control. This can be
applied when the system is varying in time or when the model forms a simpliﬁed local
description of the system. The identiﬁcation is used to ﬁnd the parameters of the model
and this model is used to tune the parameters of the controller.
1.2. REINFORCEMENT LEARNING 7
If we adapt the controller it does not necessary have to be based on the estimated model
of the system. It is also possible that the design speciﬁcation can be estimated directly from
the measurements. This is called direct adaptive control. In this case diﬀerent possibilities
exist:
• Self tuning regulator
The self tuning regulator as described in [3] directly estimates the parameters required
to adjust the feedback. This can be based on the diﬀerence between the state value
and some reference signal.
• Model modiﬁcation
It is possible in case of a nonlinear system that an approximator is used to change
the current model into a model that makes control easier. An example is using a
feed forward network to model the inverse kinematics of a robot as in [64]. Another
example is feedback linearization, where a nonlinear feedback function is adapted to
compensate for the nonlinearity in the system [78].
1.2 Reinforcement Learning
Reinforcement Learning [40][12][71][30] from a Machine Learning [48] point of view is a
collection of algorithms that can be used to optimize a control task. Initially it was
presented as a “trial and error” method to improve the interaction with a dynamical
system [9]. Later it has been established that it can also be regarded as a heuristic kind of
Dynamic Programming (DP) [80][83][8]. The objective is to ﬁnd a policy, a function that
maps the states of the system to control actions, that optimizes a performance criterion.
Compared with control, we can say that the policy represents the feedback function. In
RL, the feedback function is optimized during the interaction using an evaluation from the
system. Therefore we can also regard RL as a form of adaptive optimal control. During the
interaction each state transition is being evaluated, resulting in a scalar reinforcement. The
performance criterion to optimize is usually given by the expected sum of all reinforcements.
This criterion has to be maximized if the reinforcements represent rewards and minimized
if they represent costs.
The value function represents the expected sum of future reinforcements for each state.
In RL the optimization is performed by ﬁrst estimating the value function. The value
function can be approximated by using Temporal Diﬀerence learning. In this learning rule
the approximated value of a state is updated based on the diﬀerence with the approximated
value of the next state. This diﬀerence should agree with the reinforcement received during
that state transition.
When the (approximated) future performance is known for each state, then given the
present state it is possible to select the “preferred” next state. But this still does not
indicate what action should be taken. For this a model of the system is required. With this
model, the action that has the highest probability of bringing the system in the “preferred”
next state can be selected. Because such a model is not always available, model free RL
8 CHAPTER 1. INTRODUCTION
techniques were developed. QLearning is modelfree RL [80][81]. The idea is that the
sum of future reinforcements can be approximated as a function of the state and the
action. This function is called the Qfunction and it has a value for each state and action
combination. The Qfunction can be used to select that action for which the value of the
Qfunction is optimal, given the present state. This model free optimization is only possible
because the optimization is performed online by interacting with the system. Therefore
we can see QLearning as a direct adaptive control scheme.
Approximations are made for the visited states and actions taken in these states. The
use of the approximation to choose the best action can only rely on the actions that
are tried often enough in these states. A deterministic policy, which always maps the
state to the same action, cannot be used to generate the data for the approximation. So
the actions taken during the interaction should not only depend on the existing policy,
but should also depend on a random “trial” process. This process is referred to as the
exploration. Exploration is of utmost importance because it determines the search space,
in which the optimal policy is searched for. If the search space is large enough then it can
be proven that RL converges to the optimal solution [68][25][24][39]. But these proofs rely
on perfect backups of visited states and applied actions in lookuptables. So guarantees
for convergence to the optimal solution can only be given, when there are a discrete number
of states and actions.
That the convergence proofs rely on a discrete state and action space does not mean
that RL cannot be applied to system with a continuous state and action space as in (1.1).
In [65] RL is used to obtain a controller for a manufacturing process for thermoplastic
structures, followed by a molding process and chip fabrication in [66]. In [2] and [59] a
thermostat controller is obtained. A controller for satellite positioning is found in [61]. The
use of RL for chemical plants and reactors are described in [47][59] and [60]. In [26] RL is
used to control a “fuzzy” ball–and–beam system. The approaches used for these application
were primarily based on heuristics, so there hardly is any theoretic understanding of the
closed loop performance of these approaches.
1.3 Problem Statement
The reinforcement learning approach has been successfully applied as a control design
approach for ﬁnding controllers. In the previous section we gave some examples. These
approaches were all, except for [26], performed on simulated systems. In most cases the
resulting controllers were used afterwards to control the real system.
The advantage of the reinforcement learning approach is that it does not require any
information about the system. Only the measurements, generated by “trial and error”,
are used to estimate the performance. When these approaches are applied in simulation,
this advantage is not exploited. The perfect model of the system is still required. Still the
successful results suggest that RL can be used to ﬁnd good controllers.
There are two reasons why most RL approaches are applied in simulation and not
directly on the real system. The ﬁrst reason is that RL learning can be very slow. This is
1.4. OVERVIEW OF THIS THESIS 9
because the process of interacting with the system has to be repeated several times. The
duration of one time step for the real system is given by the system itself and for a slow
process like a chemical plant learning will take a very long time. In simulation the duration
of one time step is given by the time it takes to simulate one time step. If learning takes
too much time a faster computer can be used to speed up the learning process.
A more serious problem with RL applied to real system is that most algorithms are
based on heuristics. This implies that the outcome will not always be completely un
derstood. For a real system it is very important that properties, like stability, can be
guaranteed to hold. Instability may cause the damage of the system. This means that
without a better understanding, RL algorithms cannot be applied directly to a real system
so that its advantage cannot be exploited.
The goal in this thesis is to make reinforcement learning more applicable as controller
design approach for real systems. This means we want to derive methods that:
• are able to deal with continuous state space problems. Most reinforcement learning
approaches are based on systems with discrete state and action space conﬁgurations.
However, controllers for real systems often have continuous state values as input and
continuous actions as output.
• are able to deal with nonlinear systems.
• do not require too high disturbances on the system. The excitation of the system
involves adding a random component to the control action. For real systems this is
not desirable, so we want to minimize the amount of excitation required.
• do not require too much data. The data is obtained by controlling the system, which
means that the sample time of the system determines how long it takes to generate
one data point. For a large sample time it means that generating a lot of data takes
a long time during which the system can not be used.
• provides ways to interpret the results. This means that we want to have an indication
of the quality of the resulting controller without testing it on the real system. Also
we want to be able to see how we can change the settings to achieve better results.
Methods that do not have any of these properties will not be very useful. Therefore all the
methods we derive will have some of these properties.
1.4 Overview of this Thesis
This thesis is about RL and therefore we will start with a more in–depth introduction of
RL in chapter 2. Since most theoretic results in RL address conﬁgurations with discrete
state and action spaces, these conﬁgurations will be used to explain the basic idea behind
RL. Then we will describe RL methods for continuous state space conﬁgurations, because
10 CHAPTER 1. INTRODUCTION
these are the systems we want to control. After that we reﬁne our problem statement and
choose a direction for our investigation.
In chapter 3 we choose the Linear Quadratic Regulation (LQR) framework as optimal
control task. We introduce LQRQL as Qlearning approach to obtain the feedback for
this framework. We compare the results with an indirect approach that ﬁrst estimates
the parameters of the system and uses this to derive the feedback. By introducing the
exploration characteristics we reveal the inﬂuence of the amount of exploration on the
resulting feedbacks of both methods. Based on this we can prove whether the resulting
feedback will be an improvement, given the amount of exploration used.
In chapter 4 we investigate the consequences of applying LQRQL on nonlinear systems.
We will indicate a possible shortcoming and introduce Extended LQRQL as a solution.
This will result in a linear feedback with an additional constant. This approach and the
two approaches from chapter 3 will be tested on a nonlinear system in simulation and on
the real system.
In chapter 5 we will introduce Neural QLearning. The diﬀerence with the approach
in [65] is that it only uses one neural network. The controller is derived in the same way
as in LQRQL. It can be used to obtain a linear and a nonlinear feedback function. The
conclusion and suggestions for future work are described in chapter 6.
Chapter 2
Reinforcement Learning
In the previous chapter we described Reinforcement Learning (RL). In this chapter we
will give an introduction to RL in which we explain the basic RL concepts necessary
to understand the remainder of this thesis. For a more complete overview of RL see
[40][12][71].
Most theoretical results on RL apply to systems with discrete state and action spaces.
Therefore we will start our introduction based on these systems. First we describe the
optimal control task for a deterministic system and present a systematic procedure to
compute the optimal policy. Then we introduce the Markov Decision Process (MDP) and
describe the classical solution approaches that use the model of the system. In RL the
optimization is based on the interaction with the system. We will describe two important
RL algorithms; Temporal Diﬀerence learning and QLearning.
We want to use RL for systems as described in the previous chapter, which are systems
with continuous state and action spaces. One way of using RL methods in the continuous
state and action space task is changing the problem into a discrete state and action problem
by quantization of the state and action spaces. Then the original discrete algorithms can
be used. The other way is to modify the algorithms so that they can deal with general
function approximators operating in the continuous domain. We will focus on the last. We
will conclude with a discussion in which we reﬁne our problem statement.
2.1 A Discrete Deterministic Optimal Control Task
To introduce the basics of an optimal control task we will restrict ourselves to a determin
istic discrete system.
2.1.1 The problem
A discrete deterministic optimal control task consists of:
1
1
We denote the discrete states and actions by s and a to make a clear distinction between the continuous
states x and actions u described in chapter 1.
11
12 CHAPTER 2. REINFORCEMENT LEARNING
• A ﬁnite set of states ¦s
1
, s
2
, , s
n
s
¦
The indices indicate the labels of the states of the system.
• A set of actions ¦a
1
, a
2
, , a
n
a
¦
The set of actions can depend on the state, because it is possible that not all actions
are possible in all states.
• A dynamic system
The state transition of the system changes the current state of the system to a possible
other state. To denote the state at a certain time step we will use the time as an
index. So s
k
is a particular element from the set of states at time step k. For the
deterministic system the state transition maps the present state and action to the
next state.
• Reinforcements r
k
These indicate the received reinforcement at time step k. There does not have to be
a dependency on time, in general it depends on the state transition and action that
takes place at time step k.
• The criterion
The criterion indicates the desired behavior. It describes how all reinforcements at
diﬀerent time steps are combined to give a scalar indication of the performance.
This can be the sum over all reinforcements that have to be either minimized or
maximized, depending on whether the reinforcements are interpreted as costs or
rewards. Another possible criterion is based on the average of reinforcements [67].
As the state transitions are deterministic, an action taken in one state always results in
the same next state. The policy π is the function that maps the states to the action, so
that action a = π(s) is taken in state s. Given a policy for a deterministic system, the
entire future sequence of states and actions is determined.
We have to specify our criterion before we can solve this optimization task. We take
as criterion the minimization of the cost to go to a certain goal state. For a given policy
π the future is already determined, so we can compute the total costs to the goal state for
each state. We will call this the value function
2
V
π
, and V
π
(s) is the value of state s
V
π
(s) =
N
¸
i=k
r
i
, (2.1)
where s = s
k
and N is the ﬁnal time step when the goal state is reached. So now the
optimization task is to ﬁnd an optimal policy
3
π
∗
for which the values of all states are
minimal. Note that the corresponding optimal value function V
∗
= V
π
∗
is unique, but the
optimal policy does not have to be unique.
2
If the number of discrete states is ﬁnite, the value function can be regarded as a vector V
π
∈ IR
n
s
.
Then V
π
(s) is the element from vector V
π
corresponding to state s. A function/vector like this is often
called a lookuptable.
3
We will use
∗
to denote optimal solutions.
2.2. THE STOCHASTIC OPTIMIZATION TASKS 13
2.1.2 The solution
Computing the optimal value function can help determining the optimal policy. A system
atic procedure is to compute the optimal value function backwards, starting in the goal
state. For all states that can reach the goal state, store the minimal costs as the value
of these states. Store also the actions that lead to these minimal costs, because they will
form the optimal policy. The policy we get is the optimal policy for going in one step to
the goal state.
Now we consider all states that can reach the goal state in two steps. For each state
select the action for which the received cost plus the stored value of the next state is
minimal. We can store this as the new value for each state, but ﬁrst we have to check
whether this state already has a value stored. This is when the goal state can also be
reached in one step for this state. Only if the new value is lower that the already stored
value, we change the stored value and action. This gives the optimal policy for going in
two steps to the goal state. This procedure can be repeated until each state that can reach
the goal state has an optimal action stored. Then we have the optimal policy.
The underlying idea of this procedure is that for the optimal value function, the fol
lowing relation should hold:
V
∗
(s) = min
a
¦r
k
+V
∗
(s
)¦. (2.2)
Here the s is a state for which any s
can be reached in one time step. Here r
k
represents
the cost received at time step k when action a is applied and the state changes from s to
s
. For all possible actions in state s the left hand side is computed and the action a is
determined for which this is minimal. The minimal value is stored as V
∗
(s).
The procedure described is called Dynamic Programming (DP) [10] and a more general
form of (2.2) is called the Bellman equation. It deﬁnes optimal actions as actions for which
the minimal value of the two successive states only diﬀer in the cost received during that
time step. As illustrated by DP, the Bellman equation can also be used to ﬁnd algorithms
to derive the optimal actions.
2.2 The Stochastic Optimization Tasks
The solution to the deterministic optimal control task described before is quite straight
forward. But this is just a restricted class of optimal control tasks. A more general class of
optimization task is given by Markov Decision processes (MDP). This will be the frame
work in which we will introduce RL algorithms, but ﬁrst we will describe some model based
solutions.
2.2.1 The Markov Decision Process
The MDP consist of the same elements as the deterministic optimal control task. The
diﬀerence is that the state transitions and reinforcements no longer have to be deterministic.
14 CHAPTER 2. REINFORCEMENT LEARNING
The state transitions of the process are given by all transition probabilities for going from
some state s to some other state s
when action a is taken:
P
a
ss
= Pr ¦s
k
= s
[ s
k
= s, a
k
= a¦ . (2.3)
All the n
s
n
s
n
a
probabilities together form the model of the system. The deterministic
system is just a special case of (2.3), for which all values of P
a
ss
are either one or zero. The
process described in (2.3) is a Markov process because the state transitions are conditionally
independent of the past. This is a necessary property to express the value function as a
function of the current state alone.
The value function for the MDP can be deﬁned similar to (2.1). To take into account
the stochastic state transitions according to (2.3) the expectation value has to be taken
over the sum:
V
π
(s) = E
π
N
¸
i=k
r
i
[ s
k
= s
¸
. (2.4)
The subscript π of the policy is used to indicate that actions in the future are always
selected according to policy π.
Because the state transitions are probabilistic, the ﬁnal time step N (at which the goal
state is reached) can vary a lot. This is because one state transition can lead to diﬀerent
next states. This means that all state transition after that can be diﬀerent, and that it
may take a diﬀerent number of time steps to reach the goal state. It is even possible that
N becomes inﬁnite.
The consequence is that the sum in the expectation value of (2.4) can have many
diﬀerent values, so that the variance of the underlying probability density function is very
high. A discount factor γ ∈ [0, 1] can be introduced to weigh future reinforcements. This
will reduce the variance and make the value function easier to approximate. So a more
general deﬁnition of the value function for a stochastic system becomes:
V
π
(s) = E
π
N
¸
i=k
γ
i−k
r
i
[ s
k
= s
¸
. (2.5)
Note that there can be another reason to include a discount factor. This is the situation
where all reinforcements are zero except for the goal state. In this situation (2.4) reduces
to V
π
(s) = r
N
, so that all policies that eventually will reach the goal state will be equally
good. A discount factor γ < 1 in (2.5) will make states closer to the goal state have higher
values. Then the optimal policy is the one that will reach the goal state in the smallest
number of steps.
2.2.2 Model based solutions
We will now describe some solution methods that are based on computing the value function
using the model of the system. So all probabilities (2.3) have to be known. If they are not
known, these probabilities have to be estimated by interacting with the system. Once the
model is available, the system itself will no longer be used to compute the value function.
2.2. THE STOCHASTIC OPTIMIZATION TASKS 15
The value function is computed to solve the stochastic optimal control task. A solution
method we allready described is DP, that uses backwards induction from a ﬁnal time step in
future. For the stochastic optimization task, a more general form of the Bellman equation
(2.2) has to be used:
V
∗
(s) =
¸
s
P
π
∗
ss
(R
π
∗
ss
+γV
∗
(s
)). (2.6)
The P
π
∗
ss
is a matrix with state transition probabilities from s to s
when taking actions
according to the optimal policy π
∗
. The vector R
π
∗
ss
represents the expected reinforcements
assigned to the state transitions when the optimal action is taken
R
π
∗
ss
= E¦r
k
[ s
k
= s, s
k+1
= s
, a
k
= π
∗
(s
k
)¦ . (2.7)
The main diﬀerence between (2.6) and (2.2) is that now all possible next states are
taken into account, weighed with the probabilities of their occurrence. So the optimal
policy is the policy for which the expected sum of future reinforcements is minimal.
We can deﬁne a n
s
n
s
matrix P
π
∗
with state transition probabilities when policy π
∗
is
used. Here the rows indicate all the present states and the columns all the next states. We
can also deﬁne a vector R
π
∗
with the expected reinforcements for each state. Then (2.6) can
be written more compact as V
∗
= P
π
∗
(R
π
∗
+γV
∗
). The compact notation of the Bellman
equation indicates that DP requires that all state transitions with nonzero probabilities
have to be taken into account. This makes it a very diﬃcult and computational expensive
method.
Policy Iteration
The Bellman equation (2.6) provides a condition that has to hold for the optimal value
function. Based on this equation some iterative methods were developed to compute the
optimal policy. One of them is Policy Iteration [11][57], which consists of two basic steps.
The ﬁrst step is the policy evaluation, in which for a given policy the value function is
computed. The second step is the policy improvement step in which the new policy is
computed. To compute the new policy, the greedy policy is determined. This is the policy
for which in each state the action is chosen that will result in the next state with the lowest
value.
The policy evaluation is an iteration algorithm based on (2.6), with the diﬀerence that
the action is taken according to the policy that is being evaluated. The expectancy in
the iteration can be calculated using the known transition probabilities P
π
m
(s)
ss
and their
expected reinforcements R
π
m
(s)
ss
. Here π
m
is the policy at iteration step m, that is being
evaluated. Starting with an initial V
0
the values for each state are updated according to
V
l+1
(s) =
¸
s
P
π
m
ss
(R
π
m
ss
+γV
l
(s
)), (2.8)
where l indicates the policy evaluation step. Note that one policy evaluation step is com
pleted when (2.8) is applied to all states.
16 CHAPTER 2. REINFORCEMENT LEARNING
Similar to DP, the policy evaluation can be regarded as computing the value function
backwards. To see this, take the expected reinforcement at the ﬁnite time step N as the
initial V
0
. Then (2.8) makes V
1
become the expected cost of the ﬁnal state transition plus
the discounted cost at the ﬁnal state. The V
2
becomes the future costs of the last two state
transitions and this goes on for every iteration using (2.8). The V
N
equals the V
π
in (2.5),
so that the correct value function is found.
The practical problem of the policy evaluation is that N can be unknown or inﬁnite.
Fortunately the diﬀerence between V
l+1
and V
l
will decrease as l increases. The eﬀect is
that after a certain value of l the greedy policy corresponding to the value function in (2.8)
will not change. Then the policy evaluation can stop and the policy can be improved by
taking the greedy policy. The greedy policy is given by
π
m
(s) = argmin
a
¸
s
P
a
ss
(R
a
ss
+γV
l
(s
))
¸
, (2.9)
where π
m
is the greedy policy. The V
l
in the right hand side of (2.9) indicates that the
greedy policy is computed after l evaluation steps. The number of evaluation steps is
increased until π
m
no longer changes. Then the policy is improved by taking as new policy
π
m+1
= π
m
. Because the number of policies is ﬁnite, the optimal policy π
∗
will be found
after a ﬁnal number of policy iteration steps.
Value Iteration
The diﬀerence between policy iteration and DP is that policy iteration does not use the
optimal action during policy evaluation. It takes the actions according to the current
policy. In DP only the actions that are optimal are considered. A combination of these
two methods is called value iteration.
Value iteration is a one step approach in which the two policy iteration steps are
combined. This is done by replacing π(s) in P
π(s)
ss
and R
π(s)
ss
by the greedy policy of V
m
.
So it still takes into account all possible state transitions, but now the policy is changed
immediately into the greedy policy. Under the same conditions as policy iteration, value
iteration can be proven to converge to the optimal policy.
Note that there still is a diﬀerence with DP. Value Iteration does not use the optimal
action, but actions that are assumed to be optimal. This does not have to be true. Espe
cially in the beginning of value iteration, it is very unlikely that the approximated value
function is correct. Therefore it is also very likely that the actions used by Value Iteration
are not optimal.
2.2.3 Reinforcement Learning solutions
The model based solution methods only use the model of the system to compute the optimal
policy. The system itself is not used once the model is available. This is the main diﬀerence
with reinforcement learning solutions. The reinforcement learning solution methods use
interactions with the system to optimize the policy. To enable the interaction with the
2.2. THE STOCHASTIC OPTIMIZATION TASKS 17
system, an initial policy has to be available. The system starts in the initial state and the
interaction will go on until the ﬁnal time step N. This process can be repeated several
times.
Temporal Diﬀerence learning
The policy π is available and the data obtained by the interaction with the system is
used to evaluate the policy. With the interactions, the value function is only updated for
the present state s = s
k
using only the actual next state s
= s
k+1
. Also the received
reinforcement r
k
is used and not the expected reinforcement R
π
ss
. This suggests an update
according to:
V
l+1
(s
k
) = r
k
+γV
l
(s
k+1
). (2.10)
Clearly this is not the same as (2.8), because it is based on one state transition that just
happens with a certain probability. In (2.8) the value function is updated for all states
simultaneously using all possible next states.
The interaction with the system can be repeated several times by restarting the exper
iment after the goal state is reached. For one particular state, the next state and received
reinforcement can be diﬀerent. So the right hand side of (2.10) can have diﬀerent values
in repeated experiments. Computing the average of these values before updating the value
function in (2.10), will result in the same update as (2.8) if the interactions are repeated
inﬁnite times.
A drawback of this approach is that all right hand sides of (2.10) for each state have to
be stored. It is also possible to incrementally combine the right hand side of (2.10) with
an existing approximation of V
l
. This leads to an update like
V
l+1
(s
k
) = (1 −α
l
)V
l
(s
k
) +α
l
(r
k
+γV
l
(s
k+1
)). (2.11)
The α
l
∈ [0, 1] is the learning rate. For a correct approximation of the value function,
the learning rate should decrease. Therefore the learning rate has the index l. If (2.11) is
rewritten to
V
l+1
(s
k
) = V
l
(s
k
) +α
l
(r
k
+γV
l
(s
k+1
) −V
l
(s
k
)). (2.12)
we see that V
l
(s
k
) is changed proportional to the diﬀerence between the value of the current
state and what it should be according to the received cost and the value of the next state.
This diﬀerence is called the Temporal Diﬀerence (TD) and the update (2.12) is called
temporal diﬀerence learning [68].
Exploration
There is one problem with approximating the value function based on visited states alone.
It is possible that some states are never visited, so that no value can be computed for
these states. So in spite of repeating the interaction, it is still possible that (2.12) never
equals (2.8). To prevent this an additional random trial process is included, that is referred
18 CHAPTER 2. REINFORCEMENT LEARNING
to as the exploration. The exploration has the same function as the excitation in system
identiﬁcation, described in chapter 1.
The exploration means that sometimes a diﬀerent action is tried than the action ac
cording to the policy. There are diﬀerent strategies to explore. In [76] a distinction is made
between directed and undirected exploration. In case of undirected exploration, the tried
actions are truly random and are selected based on a predeﬁned scheme.
Undirected exploration does not take information about the system into account but can
use experiences gathered during previous interactions. One popular undirected exploration
strategy is given by the greedy policy
4
[70]. With a probability 1− the action according
to the greedy policy is taken. With a probability a nongreedy action is tried. In this
way the parameter can be used to specify the amount of exploration. Usually the initial
value of is taken high, so that many actions are tried. With each policy improvement
step also the value of is decreased. The eﬀect is that when the policy becomes better,
the exploration focuses more and more around the improved policies. The result is that a
fewer number of interactions with the system is required.
Exploration can also help to converge faster to the optimal policy. This is when an
action is tried that leads to a state with a lower value, that would not be reached when the
action according to the policy was taken. The consequence is that the value of the state
where the action was tried is decreased. This increases the probability that the improved
policy visits that state. This can be seen as ﬁnding a shortcut because of the exploration.
QLearning
The value function for a policy π can be approximated without a model of the system.
However, computing the greedy policy using (2.9) still requires P
a
ss
and R
a
ss
. A model
free RL method that does not use P
a
ss
and R
a
ss
was introduced. It was called QLearning
[80][81]. It is based on the idea to estimate the value for each state and action combination.
The Bellman equation in (2.6) uses P
π
∗
ss
and R
π
∗
ss
, so it is only valid if in each state the
optimal action π
∗
(s) is taken. It is also possible to compute the right hand side of (2.6)
for each possible action using P
a
ss
and R
a
ss
. This gives the value for each state and action
combination, when all actions in future are optimal. Deﬁne
Q
∗
(s, a) =
¸
s
P
a
ss
(R
a
ss
+γV
∗
(s)) (2.13)
as the optimal Qfunction. For the optimal actions this agrees with the Bellman equation.
For all other actions the value will be higher, so:
V
∗
(s) = min
a
Q
∗
(s, a). (2.14)
For each state the optimal action is given by:
π
∗
(s) = argmin
a
Q
∗
(s, a). (2.15)
4
This is also known under the name pseudo–stochastic or max–random or Utility–drawn–distribution.
2.2. THE STOCHASTIC OPTIMIZATION TASKS 19
This implies that the optimal Qfunction alone is suﬃcient to compute the optimal policy.
The Qfunction for policy π can be deﬁned in the same way as (2.5):
Q
π
(s, a) = E
π
N
¸
i=k
γ
i−k
r
i+1
[ s
k
= s, a
k
= a
¸
(2.16)
Now temporal diﬀerence learning can be used to approximate Q
π
(s, a).
The temporal diﬀerence update in (2.12) uses the value for the present state s
k
and
the next state s
k+1
. For the Qfunction the update should use the present stateaction
combination and the next stateaction combination. Here we have to be careful, because
(2.16) requires that all future actions are taken according to policy π. This means that the
next action a
k+1
has to be π(s
k+1
). The current action can be π(s
k
), but it can also be
another action that is being explored. So the temporal diﬀerence update for the Qfunction
becomes
Q
l+1
(s
k
, a
k
) = Q
l
(s
k
, a
k
) +α
l
(r
k
+γQ
l
(s
k+1
, π(s
k+1
)) −Q
l
(s
k
, a
k
)). (2.17)
The greedy policy can be determined similar to (2.15) according to
π
(s) = argmin
a
Q
l
(s, a). (2.18)
This represents one policy iteration step. If the process is repeated, eventually the optimal
policy will be found. Note that the original Qlearning in [80] was based on value iteration.
Convergence
If the system can be represented by a MDP and for each state the value is estimated sepa
rately, the convergence of temporal diﬀerence learning can be proven [68][38][24][25]. Also,
if the Qvalue for each stateaction combination is estimated separately, the convergence of
Qlearning can be proven [80] [81]. The guaranteed convergence requires that the learning
rate decreases and that the entire state action space is explored suﬃciently.
The learning rate α in (2.12) has to decrease in such a way that:
∞
¸
k=0
α
k
= ∞, (2.19)
∞
¸
k=0
α
2
k
< ∞ (2.20)
holds. The condition (2.20) is required to make sure that the update eventually will
converge to a ﬁxed solution. The condition (2.19) is required to make sure that the state
space can be explored suﬃciently. If α decreases faster then it is possible that not all states
are visited often enough, before the update of the value function becomes very small. This
also requires that suﬃcient exploration is used, so that in theory all states are visited
inﬁnitely often.
20 CHAPTER 2. REINFORCEMENT LEARNING
2.2.4 Advanced topics
The beneﬁts of RL
Why should RL be used? An obvious reason for using RL would be if there is no model
of the system available. However, such a model can be obtained by interaction with the
system and estimating all probabilities. Then all the n
s
n
s
n
a
probabilities have to be
estimated. It is clear that the number of probabilities to estimate grows very fast when n
s
increases.
When a model is available and used, all state transition probabilities are considered, so
the computational complexity of one evaluation step is quadratic in the number of states
and linear in the number of actions. The convergence rate of the iteration also depends
on the number of states. There are n
n
a
s
possible policies, so this is the maximum number
of policy improvement steps required [57]. In general it is unknown how many policy
evaluations steps are required.
The reinforcement learning methods can be beneﬁcial when the state space becomes
large. This is because updates are made only for the state transitions that actually take
place. So the estimated values are primarily based on state transitions that occur with
a high probability. The state transitions with low probabilities hardly occur and will not
have a large inﬂuence on the estimated value function. In the worst case however, all states
should be visited often enough. So in the worst case RL is not beneﬁcial, but on average it
will be able to ﬁnd the optimal solution faster for problems with a large number of states.
Increasing the speed
Although the RL algorithms require on average less computation than classical approaches,
they still need to run the system very often to obtain suﬃcient data. Diﬀerent solutions
were proposed to speed up the learning algorithms:
• Generalizing over time
The temporal diﬀerence update (2.12) changes the value according to the diﬀerence
between the value of the current state and what it should be according to the received
reinforcement during the state transition and the value of the next state. This update
is based on one state transition. Since the value represents the sum of future costs,
the value can also be updated to agree with previous states and the reinforcements
received since.
The update (2.12) can be written as V
l+1
(s
k
) = V
l
(s
k
) + α
l
∆V
k
. The change ∆V
k
is
in this case the temporal diﬀerence. To take also past states into account the update
can be based on a sum of previous temporal diﬀerences. This is called TD(λ) learning
[68], where the “discount factor” λ ∈ [0, 1] weighs the previous updates. The update
of the value becomes:
∆V
k
= (r
k
+γV
l
(s
k+1
) −V
l
(s
k
))τ(s
k
), (2.21)
2.2. THE STOCHASTIC OPTIMIZATION TASKS 21
where τ represents the eligibility trace. For the present state the eligibility is updated
according to τ(s
k
) := λγτ(s
k
)+1, while for all other states s the eligibility is updated
according to τ(s) := λγτ(s). There are other possible updates for the eligibility
[62], but the update in (2.12) always corresponds with TD(0). This can be further
enhanced as in Truncated Temporal Diﬀerence (TTD) [23], where the updates are
postponed until suﬃcient data is available. For QLearning it is also possible to use
TD(λ) learning, resulting in Q(λ) learning [55]. Also this can be further enhanced
by using fast online Q(λ) [84].
• Using a model
Although a model is not required for all RL techniques, a model can be used to
speed up learning when it is available. If a model is not available it can be estimated
simultaneously with the RL algorithm. This does not have to be a complete model
with estimation of all state transition probabilities. It can also be trajectories stored
from previous runs.
The model can be used to generate simulated experiments [45]. In DYNA [69] a
similar approach is taken. Updates of the estimated value function are based on the
real system or based on a simulation of the system. A diﬀerent approach is to use
the model to determine for which states the estimated values should be updated. In
Prioritized Sweeping [49] a priority queue is maintained that indicates how promising
the states are. Then the “important” states in the past are also updated based on
the present state transition.
2.2.5 Summary
We introduced reinforcement learning for Markov decision processes with discrete state and
action spaces. The goal is to optimize the mapping from states to control actions, called
the policy. Because of the discrete state spaces, the value of the expected sum of future
reinforcements can be stored for each state separately. These values can be computed
oﬀline using the model of the system. When RL is used, these values are estimated based
on the interaction with the system.
The estimated values for two successive states should agree with the reinforcement that
is received during that state transition. Temporal Diﬀerence learning is based on this
observation. The estimated value of the present states is updated in a way that it will
agree more with the estimated value of the next state and the received reinforcement. To
get a good approximation of the value for all states, all states have to be visited often
enough. For this an additional random trial process, called exploration, is included in the
interaction with the system. Also the whole interaction process itself should be repeated
several times by restarting the system.
Once all values are estimated correctly the new policy can be determined. For this
the actions are selected that have the highest probability of bringing the system in the
most desirable next state. To compute these actions, the model of the system should be
available. A diﬀerent approach is QLearning. The expected sum of future reinforcements
22 CHAPTER 2. REINFORCEMENT LEARNING
is estimated for each state and action combination. Once this Qfunction is estimated
correctly the action can be selected based on the estimated values.
2.3 RL for Continuous State Spaces
2.3.1 Continuous state space representations
The RL approaches described in the previous section were based on estimating the value
for each state or stateaction combination. This is only possible if the number of states
and possible actions is ﬁnite. So these RL approaches can only be used for systems with
discrete state and action spaces. We want to use RL algorithms to ﬁnd controllers for
systems with continuous state and action spaces. These are the systems as described in
chapter 1 with state x ∈ IR
n
x
and control action u ∈ IR
n
u
. This means that we cannot
directly apply the algorithms to these systems. We will give two diﬀerent approaches to
apply RL to optimize the controllers for these systems.
State space quantization
A very obvious solution to get from a continuous state space to a discrete one is to quantize
the state space. This is a form of state aggregation, where all states in a part of the state
space are grouped to form one discrete state. In [9] the continuous state space of an
inverted pendulum was partitioned into a ﬁnite number of discrete states. The advantage
of this method is that the standard RL algorithms can be used, but there are also some
drawbacks:
• If the continuous state space system is a Markov process, then it is possible that the
discretized system is no longer Markov [50]. This means that the convergence proofs
that apply for the standard RL algorithms no longer have to be valid.
• The solution found is probably suboptimal. Due to the quantization the set of pos
sible policies is reduced considerably. The optimal policy for the continuous state
space problem may not be in the set. In that case the optimal solution found for the
discrete system may still perform very badly on the continuous system.
Because the choice of the quantization inﬂuences the result, algorithms were developed
that use adaptive quantization. There are methods based on unsupervised learning, like
kNearest Neighbor [28] or self organizing maps [43]. There is a method that uses trian
gularization of the state space based on data [51], which has been improved in [52]. There
are divide and conquer methods, like the PartiGame Algorithm [49], where large states
are split when necessary. The advantage of these adaptive quantization methods is that
they can result in a more optimal policy than the ﬁxed quantization methods. On the
other hand, when a large state is split the probability of being in one of the smaller states
becomes smaller. The consequence is that the estimated values for these states become
less reliable.
2.3. RL FOR CONTINUOUS STATE SPACES 23
x
u
Q(x, u)
r
System
Actor
Critic
Figure 2.1. The ActorCritic Conﬁguration
Function approximations
Function approximators are parameterized representations that can be used to represent a
function. Often function approximators are used that can represent any function. In a RL
context function approximators are often used in the actorcritic conﬁguration as shown
in ﬁgure 2.1. Two function approximators are used, one representing the policy called the
actor and one representing the value function called the critic.
If we look at the critic we see that it has as input the state x and the action u.
5
This indicates that it is forming a Qfunction rather than a value function. However, Q
learning using function approximators is hardly possible. This is because when the critic is
a general function approximator, the computation of the action for which the approximated
Qfunction is minimal can be very hard. It is unlikely that an analytical function can be
used for this, so the minimum has to be numerically approximated. In the actorcritic
architecture the gradient of the critic is used to compute the update of the actor.
As function approximators often feed–forward neural networks were used. This ap
proach has been successfully demonstrated in [74] and [75], which resulted in a program
for Backgammon that plays at world champion level. Note also that [75] applies to a sys
tem with a ﬁnite number of states. The neural network was only used to generalize over
the states.
The applications mentioned in section 1.2 were all based on function approximators.
Feed forward neural networks were used in [65][59][60][26]. CMACs were used in [70][65][66].
Radial basis function were used in [1] in combination with a stabilizing controller.
In [14] examples are given where the use of RL with function approximators can fail.
In [70] the same examples were used and the experimental setup was modiﬁed to make
them work. Proofs of convergence for function approximation only exist for approximators
linear in the weight when applied to MDPs [77][79]. For systems with a continuous state
space there are no proofs when general function approximators are used. Also there are no
general recipes to make the use of function approximators successful. It still might require
5
Note that it is also possible to have a critic with only state x as input.
24 CHAPTER 2. REINFORCEMENT LEARNING
some hand tuning in the settings of the experiment. Proofs of convergence for continuous
stateaction space problems do exist for the linear quadratic regularization task [82][17][44],
but then no general function approximator is used but an approximator that is appropriate
for this speciﬁc task.
2.3.2 Learning in continuous domains
The networks represent function approximators. We will describe how they can be trained.
The critic is the function V (ξ, w), where ξ represents the input of the approximator. This
can represent the state when approximating the value function, or it can represent the state
and action, so that the approximator represents the Qfunction. The w are the parameters
of the network.
Training the Critic
Diﬀerent methods are possible to train the critic. They are all based on the temporal
diﬀerence in (2.12) or (2.16).
• TD(λ)
The TD(λ) update was introduced in section 2.2.4 as an extension to the update
(2.12). For a function approximator the update is based on the gradient with respect
to the parameters of the function approximator. This leads to an update like [68]:
6
w
= w +α
N−1
¸
k=0
∆w
k
, (2.22)
with
∆w
k
= (r
k
+γV (ξ
k+1
, w) −V (ξ
k
, w))
k
¸
i=1
λ
k−i
∇
w
V (ξ
i
, w). (2.23)
The α is the learning rate and the λ is a weighting factor for past updates. The
sequence of past updates indicate the eligibility trace and is used to make the update
depend on the trajectory. This can speed up the training of the network. The value
of 0 < λ < 1 for which it trains fastest is not known and depends on the problem. It
can only be found empirically by repeating the training for diﬀerent λ.
• Minimize the quadratic temporal diﬀerence error
Most learning methods for function approximators are based on minimizing the
summed squared error between the target value and the network output. In a similar
way the temporal diﬀerence can be used to express an error that is minimized by
6
Note that the term “temporal diﬀerence” was introduced in [68], together with the TD(λ) learning
rule for function approximators and not for the discrete case as described in section 2.2.4.
2.3. RL FOR CONTINUOUS STATE SPACES 25
standard steepest decent methods [83]. This means the training of the critic becomes
minimizing the quadratic temporal diﬀerence error.
E =
1
2
N−1
¸
k=0
(r
k
+γV (ξ
k+1
, w) −V (ξ
k
, w))
2
(2.24)
where V represents the critic with weights w and γ is the discount factor. We see
here that this error does not completely specify the function to approximate. It gives
the diﬀerence in output for two points in the input.
The steepest decent update rule can now be based on this error:
w
= w −α∇
w
E = w +
N−1
¸
k=0
∆w
k
, (2.25)
with
∆w
k
= (r
k
+γV (ξ
k+1
, w) −V (ξ
k
, w))(γ∇
w
V (ξ
k+1
, w) −∇
w
V (ξ
k
, w)). (2.26)
If the learning rate is small enough convergence to a suboptimal solution can be
guaranteed. That this solution can not be guaranteed to be optimal is due to the
possibility that it converges to a local minimum of (2.24).
• Residual Algorithms
The temporal diﬀerence method described above can be regarded as the residue of
the Bellman equation, because if the Bellman equation would hold this error would
be zero. The advantage of using steepest descent on the Bellman residual is that it
can always be made to converge. This is just a matter of making the learning rate
small enough. Of course there remains the risk of ending in a local minimum.
On the other hand, the learning converges very slowly compared to the TD(λ) update
rule. This, however, has the disadvantage that the convergence for a function approx
imator cannot be guaranteed. Based on this observation Residual Algorithms were
proposed [6]. The main idea is to combine TD(λ) learning with the minimization of
the Bellman residual.
Training the Actor
The second function approximator is the actor, which represents the policy or feedback
function u
k
= g(x
k
, w
a
). Here w
a
represents the weights of the actor. In the discrete
case the policy that is greedy with respect to the value function is chosen. Here the critic
approximates the value function and the parameters of the actor should be chosen in such
way that it is greedy with respect to the critic.
Since the critic is represented by a function approximator, it can have any form. This
means that the values of the actor cannot be derived directly from the critic. When
the actor network is represented by a continuous diﬀerential function, a steepest decent
approach can be used to adapt the weights. For this there are two possibilities, depending
on whether the critic represents a Qfunction or a value function:
26 CHAPTER 2. REINFORCEMENT LEARNING
• Backpropagation with respect to the critic
If the control action is an input of the critic, the gradient with respect to the action
can be computed. This indicates how a change in action inﬂuences the output of
the critic. The critic can be regarded as an error function that indicates how the
outputs of the actor should change. So the update of the weights of the actor can be
performed according to
∆w
a
= −∇
w
a
g(x
k
, w
a
)∇
u
Q(x
k
, u
k
, w). (2.27)
Note that the input u
k
of this network is the output of the network representing the
actor.
7
• Backpropagation based on temporal diﬀerence
The critic can be used to determine the temporal diﬀerence. Then the actor can be
updated such that the temporal diﬀerence becomes smaller:
∆w
a
= −∇
w
a
g(ξ
k
, w
a
)(r
k
+γV (ξ
k+1
, w) −V (ξ
k
, w)). (2.28)
In this case the critic and actor are trained based on the same temporal diﬀerence.
The backpropagation based on the temporal diﬀerence was ﬁrst used in [9]. Then
it was improved in [85] Recently these approaches have gained more interest as solution
methods for problem domain that are nonMarkov [41][7][72]. The actor then represents
the probability distribution from which the actions are drawn. The result is therefore not
a deterministic policy. A drawback of these approaches is that they learn very slowly.
The general idea of the actor critic approaches has been formalized in Heuristic Dy
namic Programming (HDP) [83]. This describes a family of approaches that are based on
backpropagation and the actor critic conﬁguration. The HDP and action depended HDP
can be regarded as temporal diﬀerence learning and Qlearning. An alternative approach
is to approximate the gradient of the critic that is used to update the actor. This has been
extended to a more general framework in which the value of the critic and the gradient
are trained simultaneously [83][56]. The main drawback of these approaches is that they
become so complicated that even the implementation of some of these algorithms is not
trivial [56].
2.3.3 Summary
In this section we showed that there are two diﬀerent ways to apply RL algorithms to obtain
a feedback function for systems with a continuous state and action space. One solution
approach is based on discretizing the state space so that the original RL algorithms can be
applied. The other solution is that function approximators are used to estimate the value
or Qfunction. In this case two approximators are required, one to represent the feedback
and one to represent the value or Qfunction.
7
Therefore we could not express this with a single ξ as input.
2.4. DISCUSSION 27
2.4 Discussion
Now we have described RL and control so we can reﬁne our problem statement. First look
at what RL and control have in common:
• RL is used to solve an optimal control task. This means that an optimal control task
can be formulated into a RL problem.
• The system identiﬁcation requires that the system is excited in order to estimate
the parameters of the system. The RL algorithms require exploration in order to
estimate the parameters of the value function.
In a sense RL can be seen as adaptive control where the parameters are estimated and
used to tune the parameters of the controller.
There are some diﬀerences:
• Stability plays an important role in control theory, while in RL the set of feasible
actions is assumed to be known in advanced.
• The state transitions in the model are the actual values of the state, while in the MDP
framework the state transitions are given by the probabilities. This has inﬂuence on
how the parameters are estimated.
Since we are interested in controlling systems with continuous state and action spaces,
we have to choose one of the two main approaches. The state space quantization results in
a feedback function that is not well deﬁned. It is based on many local estimations. For the
resulting feedback function is therefore not easy to see what the consequences are when
this feedback is used to control a continuous time system.
The function approximator approaches are better suited for understanding the resulting
feedback. This is because the resulting feedback function is already determined by the
actor. The main problem with these approaches is that they rely on training two diﬀerent
approximators, so that it is hard to see how well they are trained. Also it is hard to see
how the result can be improved.
The only function approximator approach that comes close to our requirements pre
sented in section 1.3, is the approach that is applied to the linear quadratic regularization
task. The only problem is that it does not apply to nonlinear systems. Still this approach
will be a good starting point for our investigation.
28 CHAPTER 2. REINFORCEMENT LEARNING
Chapter 3
LQR using QLearning
3.1 Introduction
In this chapter we will present a theoretic framework that enables us to analyze the use
of RL in problem domains with continuous state and action spaces. The ﬁrst theoretical
analysis and proof of convergence of RL applied to such problems can be found in [82]. It
shows the convergence to the correct value function for a Linear Quadratic Regularization
task, where the weights of a carefully chosen function approximator were adjusted to
minimize the temporal diﬀerence error. Based on the same idea the convergence was
proven for other RL approaches, including Qlearning [44].
In [16][19][20][17], a policy iteration based Qlearning approach was introduced to solve
a LQR task. This was based on a Recursive Least Squares (RLS) estimation of a quadratic
Qfunction. These approaches do not use the model of the linear system and can be applied
if the system is unknown. If data is generated with suﬃcient exploration, the convergence
to the optimal linear feedback can be proven.
Both [17] and [44] indicate that the practical applicability of the results are limited
by the absence of noise in the analysis. A scalar example in [17] shows that the noise
introduces a bias in the estimation, making that proofs of convergence no longer hold. We
are interested in the use of RL on real practical problem domains with continuous state
and action spaces. This means that we have to include the noise in our analysis.
We are also interested in how well the RL approach performs compared to alternative
solution methods. We can solve the LQR task with an unknown system using an indirect
approach, where data is used to estimate the parameters of the system. Then these esti
mated parameters are used to compute the optimal feedback. Because we want to compare
the results it is important to replace the RLS by a batch linear least squares estimation.
This has the advantage that no initial parameters have to be speciﬁed and so the result
ing solution only depends on the data and the solution method. The result is that both
solution methods are oﬀline optimization methods, because ﬁrst all data is generated and
then the new feedback is computed.
According to the convergence proofs, suﬃcient exploration is required to ﬁnd the op
29
30 CHAPTER 3. LQR USING QLEARNING
timal solution. This means that random actions have to be applied to the system. In a
practical control task this is not desirable, so we need an indication of the minimal amount
of exploration that is suﬃcient. This means that our analysis has to show how the perfor
mance of the two solution methods depend on the amount of exploration used to generate
the data.
In the next section we will specify the Linear Quadratic Regularization task where the
linear system is assumed to be unknown. We then present the two solution methods and
give an overview on how to compare the performance of these two methods. Section 3.3
will focus on the inﬂuence of the exploration on the comparison. We will show that
the noise determines the amount of exploration required for a guaranteed convergence.
Also we will show that this amount of exploration diﬀers for the two solution methods.
The experimental conﬁrmation of the results will be given in section 3.4, followed by the
discussion and conclusion in section 3.5 and 3.6.
3.2 LQR with an Unknown System
In this section we will describe the LQR task and show how to obtain the optimal feedback
when everything is known. We will then present a direct and an indirect solution method
for the situation where the parameters of the system are unknown. Also we will deﬁne a
performance measure and give an overview of the comparison of the two solution methods.
3.2.1 Linear Quadratic Regulation
In the Linear Quadratic Regulation (LQR) framework, the system is linear and the direct
cost is quadratic. Let a linear time invariant discrete time system be given by:
x
k+1
= Ax
k
+Bu
k
+v
k
, (3.1)
with x
k
∈ IR
n
x
the state, u
k
∈ IR
n
u
the control action and v
k
∈ IR
n
x
the system noise at
time step k. All elements of system noise v are assumed to be ^(0, σ
2
v
) distributed and
white. Matrix A ∈ IR
n
x
×n
x
and B ∈ IR
n
x
×n
u
are the parameters of the system.
The direct cost r is a quadratic function of the state and the control action at time k:
r
k
= x
T
k
Sx
k
+u
T
k
Ru
k
, (3.2)
where S ∈ IR
n
x
×n
x
and R ∈ IR
n
u
×n
u
are the design choices. The objective is to ﬁnd the
mapping from state to control action (IR
n
x
→ IR
n
u
) that minimizes the total costs J, which
is given by:
J =
∞
¸
k=0
r
k
. (3.3)
The value of J is ﬁnite if (3.2) approaches zero fast enough. This is the case when (3.1)
is controlled using (1.3) and the closed loop is stable. The total costs J becomes inﬁnite
if the closed loop is unstable. It is possible to include a discount factor γ < 1 in (3.3),
3.2. LQR WITH AN UNKNOWN SYSTEM 31
but then the total costs can be ﬁnite for an unstable system. We will use (3.3) without a
discount factor (or γ = 1), so that a ﬁnite J always implies that the system is stable.
The optimal control action u
∗
is a linear function of the state:
u
∗
k
= L
∗
x
k
with L
∗
= −(B
T
K
∗
B +R)
−1
B
T
K
∗
A (3.4)
where K
∗
∈ IR
n
x
×n
x
is the unique symmetric positive deﬁnite solution to the Discrete
Algebraic Riccati Equation (DARE):
K
∗
= A
T
(K
∗
−K
∗
B(B
T
K
∗
B +R)
−1
B
T
K
∗
)A +S. (3.5)
This solution exists if: (A, B) is controllable, (A, S
1
2
) is observable, S ≥ 0 (positive semi
deﬁnite) and R > 0 (positive deﬁnite) [11]. Only with perfect knowledge about A, B,
S and R can equation (3.5) be solved. This restricts the practical applicability of LQR
because in practice perfect knowledge about A, B is not available.
3.2.2 System Identiﬁcation
In indirect adaptive control the parameters of the system have to be estimated, and we will
refer to this as the System Identiﬁcation (SI) approach.
1
This is our ﬁrst method to solve
the LQR problem with unknown A and B. The estimations are based on measurements
generated by controlling the system using:
u
k
= Lx
k
+e
k
(3.6)
where L is the existing feedback
2
and e
k
∈ IR
n
u
represents the excitation (or exploration)
noise. The main diﬀerence between e and v is that v is an unknown property of the system,
while e is a random process that is added on purpose to the control action. All elements
of e are chosen to be ^(0, σ
2
e
) distributed and white. Although the value of e is always
known, the two methods presented in this chapter will not use this knowledge.
Controlling the system for N time steps results in a set ¦x
k
¦
N
k=0
with:
x
k
= D
0
+
k
¸
i=0
D
k−i−1
(Be
i
+v
i
) (3.7)
where D = A +BL represents the closed loop. The sets ¦x
k
¦
N
k=0
and ¦u
k
¦
N
k=0
(computed
with (3.6)) form a data set that depends on the parameters of the system, the feedback, the
initial state and both noise sequences. To estimate the parameters of the system, rewrite
(3.1) to:
x
T
k+1
=
x
T
k
u
T
k
¸
A
T
B
T
¸
+v
k
(3.8)
1
Actually closedloop identiﬁcation or identiﬁcation for control would be more appropriate, but system
identiﬁcation stresses the main diﬀerence with the reinforcement learning method more.
2
Note that this controller already has the form of (3.4), only the value of L is not optimal.
32 CHAPTER 3. LQR USING QLEARNING
So for the total data set
Y
SI
=
x
T
1
.
.
.
x
T
N
¸
¸
¸
¸
=
x
T
0
u
T
0
.
.
.
.
.
.
x
T
N−1
u
T
N−1
¸
¸
¸
¸
¸
A
T
B
T
¸
+
v
T
0
.
.
.
v
T
N−1
¸
¸
¸
¸
= X
SI
θ
SI
+V
SI
(3.9)
should hold. Since V
SI
is not known, only a least squares estimate of θ
SI
can be given:
ˆ
θ
SI
= (X
T
SI
X
SI
)
−1
X
T
SI
Y
SI
. (3.10)
The estimated parameters of the system,
ˆ
A and
ˆ
B, can be derived from
ˆ
θ
SI
. When
ˆ
A and
ˆ
B are used, the solution of (3.5) will be
ˆ
K. Then a feedback
ˆ
L
SI
can be computed using
(3.4). This feedback
ˆ
L
SI
is the resulting approximation of L
∗
by the SI approach.
3.2.3 The Qfunction
Reinforcement learning is our second method for solving the LQR task with unknown A
and B. This means that the costs in (3.2) will be regarded as the reinforcements. As
explained in chapter 2, the main idea behind RL is to approximate the future costs and
ﬁnd a feedback that minimizes these costs. In LQR the solution of the DARE (3.5) can
be used to express the future costs as a function of the state when the optimal feedback is
used [11]:
V
∗
(x
k
) =
∞
¸
i=k
r
i
= x
T
k
K
∗
x
k
, (3.11)
with V
∗
: IR
n
x
→ IR. The feedback that minimizes (3.11) is given by (3.4), which requires
knowledge about A and B. So, it is not very useful
3
to estimate the parameters K
∗
of
(3.11). QLearning is more appropriate, since it does not require knowledge about the
system to obtain the feedback.
In QLearning the feedback is derived from the Qfunction, which represents the future
costs as a function of the state and action. So Q : IR
n
x
×n
u
→ IR is the function to approxi
mate based on the measurements. If we know what function we have to approximate, then
we only have to estimate the parameters. According to (2.14)
V
∗
(x) = min
u
Q
∗
(x, u) = Q
∗
(x, u
∗
) (3.12)
should hold. It can be shown [19][44] that Q
∗
(x
k
, u
∗
k
) is given by:
Q
∗
(x
k
, u
∗
k
) =
∞
¸
i=k
r
i
=
x
T
k
u
∗
k
T
¸
S +A
T
K
∗
A A
T
K
∗
B
B
T
K
∗
A R +B
T
K
∗
B
¸ ¸
x
k
u
∗
k
¸
(3.13)
=
x
T
k
u
∗
k
T
¸
H
∗
xx
H
∗
xu
H
∗
ux
H
∗
uu
¸ ¸
x
k
u
∗
k
¸
= φ
∗
k
T
H
∗
φ
∗
k
. (3.14)
3
In optimization tasks other than LQR, the computation of the future costs may be intractable so that
an approximation of the future costs may be useful.
3.2. LQR WITH AN UNKNOWN SYSTEM 33
The vector φ
∗
k
T
=
x
T
k
u
∗
k
T
is the concatenation of the state and optimal control action
and the matrix H
∗
contains the parameters of the optimal Qfunction. This shows that
the optimal Qfunction for the LQR task is a quadratic function of the state and action.
The Qfunction in (3.14) can be used to compute the optimal control action without
the use of the system model. According to (2.15),
u
∗
k
= arg min
u
Q
∗
(x
k
, u) (3.15)
should be computed for all states to get the optimal feedback function. The Qfunction
in (3.14) is a quadratic function and H
∗
is a symmetric positive deﬁnite matrix. So this
function can easily be minimized by setting the derivative to the control action to zero:
∇
u
∗
k
Q
∗
(x
k
, u
∗
k
) = 2H
∗
ux
x
k
+ 2H
∗
uu
u
∗
k
= 0, resulting in:
u
∗
k
= −(H
∗
uu
)
−1
H
∗
ux
x
k
= L
∗
x
k
with L
∗
= −(H
∗
uu
)
−1
H
∗
ux
. (3.16)
With the H
∗
ux
= B
T
K
∗
A and H
∗
uu
= R+B
T
K
∗
B in (3.14), this result is identical to (3.4).
It is not the optimal Qfunction that is being approximated, but the function repre
senting the future costs. This is the function Q
L
(x, u) = φ
T
H
L
φ (with φ
T
=
x u
T
)
4
, because all measurements are generated using some feedback L. The H
L
is symmetric
and positive deﬁnite so that L
= −(H
L
uu
)
−1
H
L
ux
5
is the feedback that minimizes Q
L
. The
L
does not have to be the optimal feedback but it will have lower future costs than L.
Estimating the parameters of Q
L
forms a policy evaluation step and computing L
forms
a policy improvement step. This means that this QLearning approach is based on policy
iteration. If L
is not good enough, the whole procedure can be repeated by generating
measurements using L
. If the parameters of Q
L
are always estimated correctly, the se
quence of new values of L
forms a contraction towards the optimal solution L
∗
[44]. This
means we only have to verify the correctness of one policy improvement step. Then the
convergence to the optimal solution follows from induction.
3.2.4 QLearning
In (2.16) the update rule for QLearning is given. It is based on repeatedly restarting the
system and generating new data in each run. The update also has a learning rate that has
to decrease according to (2.19) and (2.20). This makes it impossible to compare the result
with that of the SI approach. We therefore have to change the Qlearning algorithm such
that it uses one single data set and does not use a learning rate.
The parameters H
L
of the Qfunction should be estimated in the same way as the
parameters of the system in paragraph 3.2.2. So the same data set with ¦x
k
¦
N
k=0
and
¦u
k
¦
N
k=0
is used. QLearning also uses scalar reinforcements. These are the direct costs
4
If there is also noise: Q
L
(x, u) = φ
T
H
L
φ +v
T
K
L
v
5
The
indicates the feedback that is optimal according to the Qfunction, so L
optimizes Q
L
. This is
equivalent to the greedy policy described in chapter 2.
34 CHAPTER 3. LQR USING QLEARNING
computed with (3.2).
6
The parameters of Q
L
are estimated based on the data set, generated
using feedback L. The function Q
L
can be estimated by writing its deﬁnition recursively:
Q
L
(x
k
, u
k
) =
∞
¸
i=k
r
i
= r
k
+
∞
¸
i=k+1
r
i
= r
k
+Q
L
(x
k+1
, Lx
k+1
). (3.17)
Note that this deﬁnition implies that the data is generated using a stabilizing feedback. In
case the feedback L is not stable, the function Q
L
(x
k
, u
k
) is not deﬁned because the sum
of future reinforcement is not bounded. Therefore the correct values of H
L
do not exist
and cannot be estimated.
From (3.17) it follows that:
r
k
+Q
L
(x
k+1
, Lx
k+1
) −Q
L
(x
k
, u
k
) = 0. (3.18)
If in this equation Q
L
is replaced by its approximation
ˆ
Q
L
the left hand side is the Temporal
Diﬀerence (TD). Because both functions are quadratic, the right hand side of (3.18) is only
zero if
ˆ
Q
L
has the same parameters as Q
L
. The parameters of
ˆ
Q
L
can be estimated by
reducing the distance between the TD and zero. This can be formulated as a least squares
estimation as in (3.10). We deﬁne:
φ
T
k
=
x
T
k
u
T
k
, φ
T
k+1
=
x
T
k
L
T
x
T
k
and w
T
k
= v
T
k
K
L
v
k
−v
T
k+1
K
L
v
k+1
. (3.19)
Note that the deﬁnition of φ
T
k+1
is slightly diﬀerent from φ
T
k
. It is possible to write (3.18)
as:
7
r
k
= Q
L
(x
k
, u
k
) −Q
L
(x
k+1
, Lx
k+1
) (3.20)
= φ
T
k
H
L
φ
k
−φ
T
k+1
H
L
φ
k+1
+w
k
(3.21)
= vec
(φ
k
φ
T
k
)
T
vec
(H
L
) −vec
(φ
k+1
φ
T
k+1
)
T
vec
(H
L
) +w
k
(3.22)
= vec
(φ
k
φ
T
k
−φ
k+1
φ
T
k+1
)
T
vec
(H
L
) +w
k
= vec
(Φ
k
)vec
(H
L
) +w
k
. (3.23)
Note that the matrix Φ
k
also depends on L. For all time steps the following holds:
Y
QL
=
r
0
.
.
.
r
N−1
¸
¸
¸
¸
=
vec
(Φ
0
)
T
.
.
.
vec
(Φ
N−1
)
T
¸
¸
¸
¸
vec
(H
L
) +
w
0
.
.
.
w
N−1
¸
¸
¸
¸
= X
QL
θ
QL
+V
QL
, (3.24)
so that
ˆ
θ
QL
= (X
T
QL
X
QL
)
−1
X
T
QL
Y
QL
(3.25)
6
Note that for the SI approach the parameters of the system are unknown, but still the weighting of the
design matrices can be made. Conceptually this does not make any sense. In a practical situation it is more
likely that some scalar indication of performance is available, like for instance the energy consumption of
the system. We compute the direct cost using (3.2) for fair comparison.
7
Deﬁne vec
(A) as the function that stacks the upper triangle elements of matrix A into a vector.
3.2. LQR WITH AN UNKNOWN SYSTEM 35
gives an estimation of vec
(H
L
). Since
ˆ
H should be symmetrical it can be derived from
vec
(H
L
). By applying (3.16) to matrix
ˆ
H the resulting feedback
ˆ
L
QL
for the QLearning
approach is computed. This should be an approximation of L
.
This variant of Qlearning only applies to the LQR framework, therefore we will refer
to it as the Linear Quadratic Regulation QLearning (LQRQL) approach.
8
The main
diﬀerence is that it does not require restarting the system several times to generate suﬃcient
data for the correct estimation of all Qvalues. Instead it uses prior knowledge about the
function class of the Qfunction that is chosen to ﬁt the LQR task. This can be seen as
generalization over the stateaction space, which for the LQR task is globally correct. This
also holds for the approaches in [17][44]. The only diﬀerence between these approaches and
our approach using (3.25), is that we choose to learn in one step using the entire data set.
The main reason for doing this is to make it possible to compare the result with the SI
approach. Also the analysis of the outcome is easier when the estimation is not performed
recursively.
3.2.5 The Performance Measure
We have described two diﬀerent methods that use measurements to optimize a feedback
resulting in
ˆ
L
SI
and
ˆ
L
QL
. For the comparison a scalar performance measure is required
to indicate which of these feedbacks performs best. There are three ways to measure the
performance:
• Experimental: Run the system with the resulting feedbacks and compute the total
costs. The performances of both approaches can only be compared for one speciﬁc
setup and it does not indicate how “optimal” the result is. For a real system, this is
the only possible way to compare the results.
• Optimal Feedback: In a simulation there is knowledge of A and B, so the optimal
feedback L
∗
can be computed using (3.4). A norm
9
L
∗
−
ˆ
L will not be a good
performance measure because feedbacks with similar L
∗
−
ˆ
L can have diﬀerent
future costs. It is even possible that if L
∗
−
ˆ
L
1
 < L
∗
−
ˆ
L
2
,
ˆ
L
1
results in an
unstable closed loop while
ˆ
L
2
results in a stable closed loop. This means that this
measure can be used to show that the resulting feedback approaches the optimal
feedback, but this does not show that this will result in lower total costs.
• DARE Solution: Knowledge about A and B can be used in (3.5) to compute the
solution of the DARE K
∗
. This gives the future costs (3.11) when starting in x
0
and
using L
∗
. The costs when using an approximated feedback
ˆ
L can be expressed also
like (3.11), but then with matrix K
ˆ
L
. Comparing the matrices K
∗
and K
ˆ
L
results in
a performance indication that only depends on the initial state.
8
This should not be confused with the Least Squarse TD approach [18][13], that applies to MDPs.
9
Here
ˆ
L indicates the resulting feedback of both approaches.
36 CHAPTER 3. LQR USING QLEARNING
We will deﬁne a performance measure based on the DARE solution, because it is the least
sensitive to the settings of the experiment.
When using
ˆ
L, the value function V
ˆ
L
(x
0
) gives the total costs J
ˆ
L
when starting in
state x
0
. It is given by:
V
ˆ
L
(x
0
) =
∞
¸
k=0
r
k
= x
T
0
∞
¸
k=0
(A
T
+
ˆ
L
T
B
T
)
k
(S +
ˆ
L
T
R
ˆ
L)(A +B
ˆ
L)
k
x
0
= x
T
0
K
ˆ
L
x
0
(3.26)
where K
ˆ
L
is again a symmetric matrix. (It is clear that this matrix only exists when the
closed loop A + B
ˆ
L has all its eigenvalues in the unit disc). When L
∗
is used the total
costs V
∗
(x
0
) can be computed using (3.11). Let the relative performance (RP) ρ(x
0
) be
the quotient between V
ˆ
L
(x
0
) and V
∗
(x
0
), so:
ρ
ˆ
L
(x
0
) =
V
ˆ
L
(x
0
)
V
∗
(x
0
)
=
x
T
0
K
ˆ
L
x
0
x
T
0
K
∗
x
0
=
x
T
0
(K
∗
)
−1
K
ˆ
L
x
0
x
T
0
x
0
=
x
T
0
Γ
ˆ
L
x
0
x
T
0
x
0
, (3.27)
where Γ
ˆ
L
= (K
∗
)
−1
K
ˆ
L
. Only when
ˆ
L = L
∗
, Γ
ˆ
L
is the unit matrix. The RP ρ
ˆ
L
(x
0
) is
bounded below and above by the minimal and maximal eigenvalues of Γ
ˆ
L
:
ρ
ˆ
L
min
= λ
min
(Γ
ˆ
L
) ≤ ρ
ˆ
L
(x
0
) ≤ λ
max
(Γ
ˆ
L
) = ρ
ˆ
L
max
∀x
0
= 0 (3.28)
Note that ρ
ˆ
L
min
= ρ
ˆ
L
max
= 1 if and only if
ˆ
L = L
∗
and that ρ
ˆ
L
min
≥ 1 ∀
ˆ
L.
According to (3.28) three possible measures for the RP can be used: ρ
ˆ
L
min
, ρ
ˆ
L
max
or
ρ
ˆ
L
(x
0
). It does not matter for the comparison which measure is used, so in general we
will use ρ
ˆ
L
to indicate one of these measures. Note that ρ
ˆ
L
min
and ρ
ˆ
L
max
only depend on the
feedback
ˆ
L and the four matrices A, B, S and R that deﬁne the problem. In a practical
situation ρ
ˆ
L
max
seems the best choice because it represents the worst case RP with respect
to x
0
. In this chapter we will call feedback L
1
better than feedback L
2
if ρ
L
1
< ρ
L
2
.
3.2.6 Overview
The schematic overview in ﬁgure 3.1 summarizes this section. The setup at the left shows
the system parameters and noise, but also the feedback L and exploration noise e to
generate the measurements indicated with x, u and r (note that r is computed using S
and R). The SI approach is shown at the top and the QLearning approach at the bottom.
The computation of the optimal feedback using A and B is shown in the middle.
For LQRQL ﬁgure 3.1 shows no explicit optimization, because this is implicitly included
in the estimation of the Qfunction. This is the diﬀerence between the SI and the QL ap
proach: the SI approach is a two step method, where the estimation and the optimization
are performed independently. LQRQL is a one step method, where estimation and opti
mization are performed at once. Figure 3.1 also shows that no additional information is
required to derive
ˆ
L
QL
from H
L
.
The question mark at the very right of ﬁgure 3.1 indicates the comparison between ρ
ˆ
L
SI
and ρ
ˆ
L
QL
. In the next section we will relate this comparison to the amount of exploration,
indicated with the e at the very left of ﬁgure 3.1.
3.3. THE INFLUENCE OF EXPLORATION 37
e, L
v
ˆ
A,
ˆ
B
ρ
ˆ
L
QL
x, u, r, L
x, u
?
ˆ
L
QL
L
∗
ˆ
L
SI
K
∗
ˆ
K
A, B, x
0
S, R
S, R
A, B, R
ˆ
A,
ˆ
B, R
A, B, S, R
A, B, S, R
Result
Comparison Optimization
Estimation
Setup
SI
ˆ
H
L
QL
ρ
ˆ
L
SI
Figure 3.1. The Overview. The boxes indicate the parameters and the symbols next to the
arrows indicate the required information to compute the next result.
3.3 The Inﬂuence of Exploration
In this section we will investigate the inﬂuence of the exploration on the relative perfor
mances and the comparison. We will start in 3.3.1 by reformulating the method of the
estimation to make it possible to express the estimation error of the linear least squares
estimation. In 3.3.2 and 3.3.3 the inﬂuence of the exploration is investigated for both meth
ods. In 3.3.4 the exploration characteristic will be introduced to describe the inﬂuence of
the exploration on the performance of the resulting feedbacks.
3.3.1 The estimation reformulated
Both approaches described in the previous section, are based on a linear least squares
estimation. The equations (3.10) and (3.25) can be written as:
ˆ
θ = (X
T
X)
−1
X
T
Y. (3.29)
This solution
ˆ
θ depends only on the matrices X and Y , so no additional parameters
inﬂuence the result. For a fair comparison this is an important property. In practice (3.29)
is hardly ever used because of its poor numerical performance. Diﬀerent decomposition
methods exist to overcome numerical problems. This is important for the implementation
of the simulations, but we will also use it to investigate the inﬂuence of the exploration.
The matrix inversion in (3.29) is the main problem for our analysis, because this makes it
very hard to see how the exploration inﬂuences the estimation.
38 CHAPTER 3. LQR USING QLEARNING
In QRdecomposition
10
X is decomposed into an upper triangular square matrix M
and a unitary matrix Z, so that (3.29) can be written as:
ˆ
θ = ((ZM)
T
ZM)
−1
(ZM)
T
Y = M
−1
Z
T
Y. (3.30)
This is suﬃcient for an eﬃcient implementation but it still uses a matrix inversion, making
it hard to see how this solution depends on the exploration. To see the inﬂuence of the
exploration, this solution has to be rearranged even more.
The deﬁnition of Z and M in appendix A.1 makes use of projection matrices P. Let
P
i
be the projection matrix corresponding to X
∗i
, which represents the i
th
column of X,
then P
i
can be deﬁned recursively according to:
P
i
= P
i−1
−
P
i−1
X
∗i−1
X
T
∗i−1
P
T
i−1
P
i−1
X
∗i−1

2
2
and P
1
= I. (3.31)
So P
i
depends on all columns of X from X
∗1
to X
∗i−1
. Multiplying these columns with P
i
results in a zero vector, so the part of X
∗i
that is a linear combination of the columns X
∗1
to X
∗i−1
does not contribute to outcome of P
j
X
∗j
.
Appendix A.2 shows that matrices P can be used to solve (3.30) without matrix inver
sion. Let
ˆ
θ
n∗
be the last row and
ˆ
θ
i∗
be the i
th
row, then they are given by:
11
ˆ
θ
n∗
=
X
T
∗n
P
T
n
P
n
X
∗n

2
2
Y (3.32)
ˆ
θ
i∗
=
X
T
∗i
P
T
i
P
i
X
∗i

2
2
(Y −
n
¸
j=i+1
X
∗j
ˆ
θ
∗j
) for i < n. (3.33)
So
ˆ
θ can be obtained recursively by starting at the last row. If one of the columns of X
is a linear combination of all other columns then (X
T
X)
−1
is singular. In this situation
P
i
X
∗i
in (3.33) will become zero, resulting in a singularity as well.
12
We will use (3.33)
only for the theoretical analysis of the exploration and not for the implementation, because
its numerical performance is even worse than (3.29).
3.3.2 The System Identiﬁcation approach
We ﬁrst rewrite the estimation for the SI approach and show how the resulting feedback
depends on the estimation. We then express the estimation error and show how it depends
on the exploration. Finally we show the consequences of the estimation error on the
resulting feedback and its performance.
10
The name refers to matrices Q and R, but we will use Z and M because we already use the symbols
Q and R.
11
In the rest of the chapter we will ignore the absence of the sum term for the n
th
row by deﬁning a
dummy
ˆ
θ
n+1∗
that equals zero.
12
The scalar P
i
X
∗i

2
2
in (3.33) is squared, so it will go faster to zero than the elements of vector X
T
∗i
P
T
i
.
3.3. THE INFLUENCE OF EXPLORATION 39
The Estimation
To show the inﬂuence of the exploration, matrix X
SI
in (3.9) can be split in a part that
depends on x and in a part that depends on u:
X
SI
=
A 
=
A AL
T
+c
with A =
x
T
0
.
.
.
x
T
N−1
¸
¸
¸
¸
 =
u
T
0
.
.
.
u
T
N−1
¸
¸
¸
¸
c =
e
T
0
.
.
.
e
T
N−1
¸
¸
¸
¸
.
(3.34)
Also the control actions are split into a feedback part and an exploration part, but according
to (3.9) some exploration is still contained in A and Y
SI
.
Appendix A.3 shows that the columns of
ˆ
B and
ˆ
A are given by:
ˆ
B
∗i
=
c
T
∗i
P
T
n
x
+i
P
n
x
+i
c
∗i

2
2
(Y
SI
−
n
x
+n
u
¸
j=n
x
+i+1

∗j
ˆ
B
∗j
) (3.35)
ˆ
A
∗i
=
A
T
∗i
P
T
n
x
+i
P
n
x
+i
A
∗i

2
2
(Y
SI
−
n
x
¸
j=i+1
A
∗j
ˆ
A
∗j
−
ˆ
B
T
). (3.36)
Without exploration the value of
ˆ
B becomes inﬁnite, because P
n
x
+i
c
∗i

2
2
approaches zero
faster than c
T
∗i
P
T
n
x
+i
. This also makes
ˆ
A inﬁnite. For low exploration the term 
ˆ
B
T
dominates the outcome of (3.36). Then
ˆ
A becomes more linear dependent on
ˆ
B. So for
low exploration (
ˆ
A,
ˆ
B) are more likely to be uncontrollable.
Appendix A.3 also shows that the columns of
ˆ
D =
ˆ
A +L
ˆ
B are given by:
ˆ
D
∗i
=
A
T
∗i
P
T
i
P
i
A
∗i

2
2
(Y
SI
−c
ˆ
B
T
−
n
x
¸
j=i+1
A
∗j
ˆ
D
∗j
). (3.37)
Because
ˆ
B is multiplied with c,
ˆ
D does not become very large for low amounts of explo
ration. Therefore we will use
ˆ
B and
ˆ
D to obtain the resulting feedback
ˆ
L
SI
.
The Feedback
To compute the feedback (3.5) should be solved using
ˆ
A and
ˆ
B, so:
ˆ
K =
ˆ
A
T
(
ˆ
K −
ˆ
K
ˆ
B(
ˆ
B
T
ˆ
K
ˆ
B +R)
−1
ˆ
B
T
ˆ
K)
ˆ
A +S. (3.38)
A unique solution
ˆ
K to (3.38) does not have to exist, especially when
ˆ
A and
ˆ
B are too
large due to insuﬃcient exploration. In this case the right hand side of (3.38) will become
very small making
ˆ
K ≈ S. So we will assume that K
∗
−
ˆ
K is not too large. The feedback
is computed according to (3.4) using the estimated matrices:
ˆ
L
SI
= −(R +
ˆ
B
T
ˆ
K
ˆ
B)
−1
ˆ
B
T
ˆ
K
ˆ
A = (R +
ˆ
B
T
ˆ
K
ˆ
B)
−1
ˆ
B
T
ˆ
K(
ˆ
BL −
ˆ
D). (3.39)
By replacing
ˆ
A by
ˆ
BL −
ˆ
D, two possible outcomes can already be given:
40 CHAPTER 3. LQR USING QLEARNING
• Too low exploration:
ˆ
L
SI
≈ L
If the amount of exploration is much too low,
ˆ
B in (3.39) becomes much too large
because of the low value of c in (3.35). So
ˆ
D and R can be neglected, resulting in
ˆ
L
SI
≈ (
ˆ
B
T
ˆ
K
ˆ
B)
−1
ˆ
B
T
ˆ
K
ˆ
BL = L. This means that the outcome will approximately be
the feedback that was used to generate the data!
• High exploration:
ˆ
L
SI
≈ L
∗
For very high amounts of exploration the system noise V
SI
in (3.9) can be neglected.
The least squares estimation will almost be perfect, so solving the DARE and com
puting feedback will approximately have the optimal feedback L
∗
as outcome.
We can conclude that for insuﬃcient exploration the relative performance does not change
and for abundant exploration the relative performance approaches one. We will determine
the minimal amount of exploration required to obtain the second outcome.
The Estimation Error
By deﬁning y
k
= x
k+1
and using (3.7), it is possible to write:
y
k
= D
k+1
x
0
+
k
¸
i=0
D
k−i
(Be
i
+v
i
). (3.40)
So Y
SI
can be written as:
Y
SI
= AD
T
+cB
T
+V
SI
. (3.41)
This can be used to get an expression for the error in the estimations of B and D.
ˆ
B
∗i
=
c
T
∗i
P
T
n
x
+i
P
n
x
+i
c
∗i

2
2
(AD
T
+cB
T
+V
SI
−
n
x
+n
u
¸
j=n
x
+i+1

∗j
ˆ
B
∗j
) (3.42)
=
c
T
∗i
P
T
n
x
+i
P
n
x
+i
c
∗i

2
2
(cB
T
+V
SI
−
n
x
+n
u
¸
j=n
x
+i+1
c
∗j
ˆ
B
∗j
) (3.43)
= B
∗i
+
c
T
∗i
P
T
n
x
+i
P
n
x
+i
c
∗i

2
2
(V
SI
−
n
x
+n
u
¸
j=n
x
+i+1
c
∗j
(
ˆ
B
∗j
−B
∗j
)). (3.44)
So the estimation error
¯
B
∗i
=
ˆ
B
∗i
−B
∗i
is given by:
¯
B
∗i
=
c
T
∗i
P
T
n
x
+i
P
n
x
+i
c
∗i

2
2
(V
SI
−
n
x
+n
u
¸
j=n
x
+i+1
c
∗j
¯
B
∗j
) (3.45)
In the same way:
ˆ
D
∗i
=
A
T
∗i
P
T
i
P
i
A
∗i

2
2
(AD
T
+cB
T
+V
SI
−c
ˆ
B
T
−
nx
¸
j=i+1
A
∗j
ˆ
D
j∗
) (3.46)
¯
D
∗i
=
A
T
∗i
P
T
i
P
i
A
∗i

2
2
(V
SI
−c
¯
B
T
−
n
x
¸
j=i+1
A
∗j
¯
D
∗j
). (3.47)
3.3. THE INFLUENCE OF EXPLORATION 41
The estimation errors depend on the exploration noise in c and the system noise in V
SI
.
The estimations
ˆ
B and
ˆ
D can be written as the sum of the correct value and the
estimation error. So
ˆ
B = B +
¯
B and
ˆ
D = D+
¯
D. These expressions can be used in (3.39)
to see how the resulting feedback depends on the exploration and system noise, because
the correct values B and D do not depend on c and V
SI
.
The Minimal Exploration
Expressions (3.45) and (3.47) hold for any c and V
SI
. To focus on the amounts of explo
ration and system noise, the estimation errors should be expressed using σ
e
and σ
v
. These
errors depend on the conﬁguration so only an indication of the level of magnitude of these
errors can be given.
We have to make the following assumptions:
E¦c
T
∗i
c
∗i
¦ ∼ c
1
Nσ
2
e
(3.48)
E¦c
T
∗i
V
SI∗i
¦ ∼ c
2
Nσ
e
σ
v
(3.49)
E¦P
i
A
∗i

2
2
¦ ∼ c
3
+c
4
σ
2
v
+c
5
σ
v
σ
e
+c
6
σ
2
e
, (3.50)
with c
1
c
6
constants depending on n
x
, n
u
and B. Note that these assumptions are rude
approximations that indicate the expectation value over a time interval of size N. Assump
tion (3.49) indicates that the level of magnitude of the expectation value is proportional
to the cross correlation between c
∗i
and V
SI∗i
. Given the ﬁxed time interval, this value
may vary a lot depending on the particular noise sequences. This means that the constant
c
2
depends very much on these noise sequences. The same holds for c
5
in (3.49). The
constant c
3
is included to incorporate the dependency on the initial state x
0
.
Now it is possible to give an approximation of the expected estimation errors (3.45)
and (3.47). They are proportional to:
E¦
¯
B¦ ∼
c
2
σ
v
c
1
σ
e
(3.51)
E¦
¯
D¦ ∼
σ
v
c
3
+c
4
σ
2
v
+c
5
σ
v
σ
e
+c
6
σ
2
e
. (3.52)
Note that E¦
¯
D¦ has a maximum for σ
e
= −
c
5
σ
v
2c
6
∼ σ
v
, which in general is slightly less than
σ
v
.
The errors in (3.52) are zero if σ
v
= 0, so exploration is only required to prevent
singularity in the computations of the least squares estimate.
For σ
v
= 0 it is possible to neglect
ˆ
D and R in (3.39) if
¯
B makes
ˆ
B much too large.
The maximum of E¦
¯
D¦ is less than one and E¦
¯
B¦ is also less than one, so for σ
e
≈ σ
v
the
ˆ
D and R in (3.39) cannot be neglected. This is the minimal amount of exploration
that is required, for more exploration the estimations will almost be correct. So as a rule
of thumb:
The amount of exploration should be larger than the amount of system noise!
42 CHAPTER 3. LQR USING QLEARNING
3.3.3 The LQRQL approach
Analog to the SI approach, we will determine the dependency of the estimation errors on
the exploration and system noise. We also will show the consequences of the estimation
error on the resulting feedback and its performance.
The Estimation
To show the inﬂuence of the exploration, matrix X
QL
in (3.24) should be rearranged in
such a way that linear dependencies between the columns of X
QL
can be used. Write Φ
k
as:
Φ
k
= φ
k
φ
T
k
−φ
k+1
φ
T
k+1
=
¸
x
k
u
k
¸
x
T
k
u
T
k
−
¸
x
k+1
Lx
k+1
¸
x
T
k+1
x
T
k+1
L
T
=
¸
Φ
xx
k
Φ
xu
k
Φ
ux
k
Φ
uu
k
¸
(3.53)
with:
Φ
xx
k
= x
k
x
T
k
−x
k+1
x
T
k+1
Φ
ux
k
= Φ
xu
k
T
= LΦ
xx
k
+e
k
x
T
k
Φ
uu
k
= Φ
ux
k
L
T
+u
k
e
T
k
. (3.54)
The rows vec
(Φ
k
) of X
QL
do not have the elements arranged according to (3.54), so we
redeﬁne X
QL
as:
13 14
X
QL
=
vec
(Φ
xx
0
)
T
vec(Φ
ux
0
)
T
vec
(Φ
uu
0
)
T
.
.
.
.
.
.
.
.
.
vec
(Φ
xx
N−1
)
T
vec(Φ
ux
N−1
)
T
vec
(Φ
uu
N−1
)
T
¸
¸
¸
¸
=
Ψ
xx
Ψ
ux
Ψ
uu
. (3.55)
The submatrices Ψ
xx
, Ψ
ux
and Ψ
uu
correspond to
ˆ
H
xx
,
ˆ
H
ux
and
ˆ
H
uu
, which are rearrange
ments of the vectors
ˆ
θ
xx
,
ˆ
θ
ux
and
ˆ
θ
uu
. Vector
ˆ
θ
xx
has n
xx
=
1
2
n
x
(n
x
+ 1) elements, vector
ˆ
θ
ux
has n
ux
= n
u
n
x
elements and vector
ˆ
θ
uu
has n
uu
=
1
2
n
u
(n
u
+ 1) elements.
Let L
v
be the feedback matrix such that vec(LΦ
xx
k
) = L
v
vec(Φ
xx
k
) and let L be the
feedback such that vec
(Φ
ux
k
L
T
) = vec
(Φ
ux
k
)L
T
. Deﬁne matrix Υ with rows vec(e
k
x
T
k
)
and matrix T with rows vec
(u
k
e
T
k
). Then
Ψ
ux
= L
v
Ψ
xx
+ Υ and Ψ
uu
= Ψ
ux
L
T
+T (3.56)
can be used to ﬁnd expressions for
ˆ
θ
xx
,
ˆ
θ
ux
and
ˆ
θ
uu
using (3.33).
13
For the implementation this is not required, only for the analysis.
14
The function vec(A) stacks all columns of A into one vector. Note that here vec(Φ
ux
k
) is used instead
of vec(Φ
xu
k
). Only the order of elements is diﬀerent because Φ
k
is symmetric and so vec(Φ
ux
k
) = vec(Φ
xu
k
)
T
.
The reason for doing this is that the calculation of the feedback according to (3.16) makes use of
ˆ
H
ux
and
not
ˆ
H
xu
.
3.3. THE INFLUENCE OF EXPLORATION 43
The Feedback
To compute the feedback, (3.16) should be solved using
ˆ
H
ux
and
ˆ
H
uu
, so:
ˆ
L
QL
= −
ˆ
H
−1
uu
ˆ
H
ux
. (3.57)
ˆ
H
uu
and
ˆ
H
ux
are obtained by rearranging the vectors
ˆ
θ
ux
and
ˆ
θ
uu
. The diﬀerence with
the SI approach is that the feedback directly follows from the estimations, so it can only
be investigated by looking at
ˆ
θ
ux
and
ˆ
θ
uu
. These estimations are according to (3.25) a
solution to:
Y
QL
=
Ψ
xx
Ψ
ux
Ψ
uu
θ
xx
θ
ux
θ
uu
¸
¸
¸ +V
QL
. (3.58)
The estimation of
ˆ
θ
uu
using (3.33) is given by:
ˆ
θ
uu,i
=
Ψ
uu
∗i
T
P
T
n
xx
+n
ux
+i
P
n
xx
+n
ux
+i
Ψ
uu
∗i

2
2
(Y
QL
−
n
xx
+n
ux
+n
uu
¸
j=n
xx
+n
ux
+i+1
Ψ
uu
∗j
ˆ
θ
uu,j
)
=
T
ee
∗i
T
P
T
n
xx
+n
ux
+i
P
n
xx
+n
ux
+i
T
ee
∗i

2
2
(Y
QL
−
n
xx
+n
ux
+n
uu
¸
j=n
xx
+n
ux
+i+1
Ψ
uu
∗j
ˆ
θ
uu,j
). (3.59)
T
ee
has vec
(e
k
e
T
k
)
T
as rows, because the vec
(Lx
k
e
T
k
)
T
has no eﬀect on the multiplication
with matrices P
n
xx
+n
ux
+i
.
15
Equation (3.59) is similar to (3.35) and without exploration
T
ee
becomes zero causing a singularity (just like c in (3.35)). The estimation of
ˆ
θ
ux
has a
similar form as (3.36):
ˆ
θ
ux,i
=
Υ
∗i
T
P
T
n
xx
+i
P
n
xx
+i
Υ
∗i

2
2
(Y
QL
−
n
xx
+n
ux
¸
j=n
xx
+i+1
Ψ
ux
∗j
ˆ
θ
ux,j
−Ψ
uu
ˆ
θ
uu
). (3.60)
The linear relation Ψ
uu
= Ψ
ux
L
T
v
+ T resembles  = AL
T
+ c. So it also possible to
deﬁne a
ˆ
θ
d
, equivalent to the closed loop (3.37) in the SI approach, according to:
ˆ
θ
d,i
=
Υ
∗i
T
P
T
n
xx
+i
P
n
xx
+i
Υ
∗i

2
2
(Y
QL
−T
ee
ˆ
θ
uu
−
n
xx
+n
ux
¸
j=n
xx
+i+1
Ψ
ux
∗j
ˆ
θ
d,j
). (3.61)
Since
ˆ
θ
d
=
ˆ
θ
ux
+
ˆ
θ
uu
L
T
can be rearranged to
ˆ
H
d
=
ˆ
H
ux
+
ˆ
H
uu
L, (3.57) can be written as:
ˆ
L
QL
=
ˆ
H
−1
uu
(
ˆ
H
uu
L −
ˆ
H
d
) = L −
ˆ
H
−1
uu
ˆ
H
d
. (3.62)
With this result two possible outcomes can be given:
• Too low exploration:
ˆ
L
QL
≈ L
If the amount of exploration is much too low
ˆ
H
uu
is much larger than
ˆ
H
d
, so the second
term in (3.62) can be ignored. The outcome will approximately be the feedback that
was used to generate the data!
15
This is because matrix Φ
uu
is symmetric.
44 CHAPTER 3. LQR USING QLEARNING
• High exploration:
ˆ
L
QL
≈ L
For very high exploration, the value of V
QL
in (3.58) can be neglected. So
ˆ
H will
be an almost perfect estimation of H
L
. Solving (3.57) will approximately have L
as
outcome.
We can conclude that for insuﬃcient exploration the relative performance does not change.
For high amounts of exploration the estimation will almost be correct resulting in a relative
performance that corresponds to L
.
The Estimation Error
To ﬁnd the minimal exploration we adapt the expressions for the estimation errors of the
SI approach with the values for the QLearning approach. The error in the estimation
ˆ
θ
uu
is given by:
¯
θ
uu,i
=
T
eeT
P
T
n
xx
+n
ux
+i
P
n
xx
+n
ux
+i
T
ee

2
2
(V
QL
−
n
xx
+n
ux
+n
uu
¸
j=n
xx
+n
ux
+i+1
T
ee
¯
θ
uu,i
). (3.63)
In the same way for the estimation
ˆ
θ
d
¯
θ
d,i
=
Υ
∗i
T
P
T
n
xx
+n
ux
+i
P
n
xx
+n
ux
+i
Υ
∗i

2
2
(V
QL
−T
ee
¯
θ
uu
−
n
xx
+n
ux
+n
uu
¸
j=n
xx
+n
ux
+i+1
Υ
∗i
¯
θ
d,i
). (3.64)
The level of magnitude of the errors
¯
θ
uu,i
and
¯
θ
d,i
remains the same after rearranging it to
¯
H
uu
and
¯
H
d
.
The Minimal Exploration
To get an approximation of the minimal amount of exploration, we start again with some as
sumptions. Since T
ee
has vec
(e
k
e
T
ts
)
T
as rows, we will assume that E¦T
ee
∗i
T
T
ee
∗i
¦ ∼ c
1
Nσ
4
e
.
Using the deﬁnition of w
k
in (3.19) we will assume that E¦T
ee
∗i
T
V
QL
∗i
¦ ∼ c
2
Nσ
2
e
σ
2
v
. We fur
ther assume that E¦P
n
xx
+n
ux
+i
Υ
∗i

2
2
¦ ∼ σ
2
e
E¦P
i
A
∗i

2
2
¦, so that the levels of magnitude
of the expected errors are:
E¦
¯
H
uu
¦ ∼
c
2
σ
2
v
c
1
σ
2
e
(3.65)
E¦
¯
H
d
¦ ∼
σ
2
v
(c
3
+c
4
σ
2
v
)σ
e
+c
5
σ
v
σ
2
e
+c
6
σ
3
e
. (3.66)
Both errors in (3.66) will be zero if σ
v
= 0. This corresponds with the noise free situation in
[16] and [44]. In this situation the only purpose of the exploration is to prevent singularity
in the computations of the least squares estimate.
The maximum of E¦
¯
H
d
¦ can be expressed as σ
v
(1+κ), where κ > 0 is some value that
depends on the constants and σ
v
. Without specifying the constants it is impossible to get
an idea about the size of κ. The only thing we can conclude is:
3.3. THE INFLUENCE OF EXPLORATION 45
The amount of exploration required by the LQRQL approach is larger than the amount
of exploration required by the SI approach!
3.3.4 The Exploration Characteristics
Our main contribution in this chapter is comparing the performances for the SI and LQRQL
approach. Especially we focused on the inﬂuence of the amount of exploration on these
performances. We will summarize our results in this section. For this we deﬁne the
exploration characteristic as the expected performance as a function of the amount of
exploration σ
e
. As performance measure we will use the relative performance introduced
in section 3.2.5. First we will give a description of the similarities and then of the diﬀerences
between the two solution methods.
The outcome of both methods depends on two estimations;
ˆ
B and
ˆ
D for the SI approach,
ˆ
H
uu
and
ˆ
H
d
for LQRQL. These estimations can be viewed as the sum of the correct
result and the estimation error (i.e.
ˆ
B = B +
¯
B), where the exploration and system
noise only aﬀect the estimation error. Based on the inﬂuence of the exploration on the
estimation errors, we can distinguish four types of outcomes. With the increase of the level
of exploration we will have the following types of outcome:
I Singularity: No exploration will result in a singularity, so there is no outcome.
II Error Dominance: If the amount of exploration is much too low, but a feedback
can be computed, the estimation errors will dominate the outcome. The resulting
feedback will approximately be the feedback that was used to generate the data, so
the relative performance does not improve.
III Sequence Dependent Outcome: For a certain amount of exploration, the esti
mation errors do not dominate the outcome but are still too high to be neglected. So
the resulting feedback is partly based on the estimation error and partly on the cor
rect value. The outcome will depend on the particular realization of the exploration
noise sequence and system noise sequence. Therefore the relative performance can
be anything, although it is bounded from below due to the error
¯
D or
¯
H
d
.
IV Correct Estimation: For suﬃcient exploration the estimation errors can be ne
glected. The SI approach will result in L
∗
because the system’s parameters are
estimated correctly. The LQRQL will result in L
because the parameters of the
Qfunction are estimated correctly.
Two exploration characteristics for SI and LQRQL approaches are sketched in ﬁgure 3.2.
To stress the diﬀerences, log(ρ
L
− 1) is shown instead of ρ
L
. The diﬀerences in the four
types of outcomes are:
I Singularity: The amount of exploration appears quadratically in the estimations for
the QL approach, so it requires more exploration to prevent singularity. This means
that the lines in ﬁgure 3.2(b) start for higher values of σ
e
than in ﬁgure 3.2(a).
46 CHAPTER 3. LQR USING QLEARNING
II Error Dominance: Only the value of σ
e
for which this is the outcome diﬀers.
III Sequence Dependent Outcome:
¯
D has a maximum value for σ
e
< σ
v
and
¯
H
d
for
σ
e
> σ
v
. Also the maximum values are diﬀerent. So this outcome only diﬀers in the
lower bound for the relative performance.
IV Correct Estimation: For the SI approach the relative performance will approach
one with the increase in the exploration level. For LQRQL the feedback will approach
L
so that the relative performance depends on the feedback that was used to generate
the data. The only way to let the relative performance approach one is to do more
policy improvement steps. In Figure 3.2b we see that L
2
= L
1
, so L
2
is the result
after two policy improvement step when starting with L
1
.
The main diﬀerences between the exploration characteristics in ﬁgure 3.2 are the amounts
of exploration for which these outcomes occur. It is clear that there is a diﬀerence in the
type IV outcome, because the SI approach will approximate L
∗
and the LQRQL approach
L
. Therefore does the outcome of the LQRQL approach depend on the L and for the SI
approach it does not.
The dashed lines in ﬁgure 3.2(a) indicate the lower bound on the relative performance
due to
¯
D (the dashed line continues under the solid and bold line). The relative performance
will not go below this line, even when an almost optimal feedback like L
2
is used. So
II III IV I
σ
v
σ
e
log(ρ −1)
L
1
L
2
L
∗
(a) The SI exploration characteristic. The
arrow with L
∗
indicates that the character
istic will approach the optimal feedback if
the exploration is increased.
σ
v
log(ρ −1)
II I IV III
L
1
L
2
L
1
L
2
σ
e
(b) The LQRQL exploration characteristic.
Figure 3.2. The Exploration Characteristics. Both ﬁgures show log(ρ
L
− 1) as a function
of σ
e
when data was generated using feedback L
1
(bold line) and feedback L
2
(solid line). The
symbols next to these lines indicate the value that is being approximated. The feedback L
2
is
almost optimal so that ρ
L
2
is almost one. The dashed line indicates the lower bound on the
relative performances due to the errors
¯
D or
¯
H
d
. The grey area indicates outcome III, where any
relative performance above the lower bound is possible.
3.4. SIMULATION EXPERIMENTS 47
taking σ
e
= σ
v
will not “improve” the feedback. (This amount of exploration will give
an improvement for L
1
). Feedback L
2
can only be improved by increasing the amount of
exploration even more. Figure 3.2(b) shows the same eﬀect for the LQRQL approach due
to
¯
H
d
. It takes more exploration to guarantee L
2
than it takes to guarantee L
1
. Both
approaches have in common that for near optimal feedbacks more exploration is required
to guarantee an improvement than just avoiding outcome III.
3.4 Simulation Experiments
The purpose of the simulation experiments is to verify the results presented in previous
sections and show the exploration characteristics.
3.4.1 Setup
We take a system according to (3.1) with the following parameters:
A =
¸
−0.6 0.4
1 0
¸
B =
¸
0
1
¸
x
0
=
¸
1
1
¸
. (3.67)
For the direct cost (3.2) we take S to be a unit matrix and R = 1. For this system
the number of parameters that has to be estimated for both approaches equals 6. So
taking N = 20 measurements should be enough for the estimation. The measurements are
generated according to (3.6) with
L =
0.2 −0.2
σ
v
= 10
−4
. (3.68)
For this value of L the closed loop is stable. The solution of the DARE and the optimal
feedback are given by:
K
∗
=
¸
2.302 0.149
0.149 1.112
¸
L
∗
=
0.373 0.279
. (3.69)
Also the relative performances for L can be computed:
ρ
L
min
= 1.469 ρ
L
max
= 2.093 ρ
L
(x
0
) = 1.832. (3.70)
3.4.2 Exploration Characteristic
We compute the exploration characteristic by doing the same simulation experiment for
diﬀerent values of σ
e
. To make sure that σ
e
is the only parameter that diﬀers, we always
use the same realizations of exploration noise and system noise by using the same seeds
for the random generator. We vary σ
e
from 10
−12
to 10
5
.
Figure 3.3(a) shows ρ
ˆ
L
min
, ρ
ˆ
L
max
and ρ
ˆ
L
(x
0
) for the SI and LQRQL approach for one
realization. The exploration intervals for the four types of outcomes can be seen in ﬁg
ure 3.3(a):
48 CHAPTER 3. LQR USING QLEARNING
10
10
10
5
10
0
10
5
0.5
1
1.5
2
2.5
3
3.5
4
4.5
σ
e
ρ
m
i
n
,
ρ
(
x
0
)
a
n
d
ρ
m
a
x
(a) ρ
ˆ
L
min
, ρ
ˆ
L
max
and ρ
ˆ
L
(x
0
) for one realization.
10
10
10
5
10
0
10
5
10
15
10
10
10
5
10
0
σ
e
l
o
g
(
ρ
m
a
x

1
)
(b) log(ρ
ˆ
L
max
−1) for ﬁve realizations.
Figure 3.3. Simulation Results. The dotted vertical line indicates the system noise level
σ
v
= 10
−4
. The dashed lines are the results for the SI approach and the solid lines are the results
for the LQRQL approach.
I SI: σ
e
< 10
−12
(not shown in ﬁgure 3.3(a)), LQRQL: σ
e
< 10
−7
.
II SI: 10
−12
< σ
e
< 10
−7
, LQRQL: 10
−7
< σ
e
< 10
−4
= σ
v
. The values of ρ
L
min
, ρ
L
max
and ρ
L
(x
0
) in ﬁgure 3.3(a) agree with the values in (3.70).
III SI: 10
−7
< σ
e
< 10
−4
, LQRQL: 10
−4
< σ
e
< 10
−2
. This particular realization results
in an increase of the RP for both methods.
IV SI: σ
e
> 10
−4
, LQRQL: σ
e
> 10
−2
. The RP for both approaches seem to be one.
In the results in ﬁgure 3.3(b) we want to focus on type III and IV outcomes. In
ﬁgure 3.3(a) the outcomes of type IV are not very clear. Therefore ﬁgure 3.3(b) shows
log(ρ
ˆ
L
max
−1) and not ρ
ˆ
L
max
. The outcomes are shown for ﬁve diﬀerent realizations.
III For some realizations ρ
ˆ
L
max
< ρ
L
max
and for other realizations ρ
ˆ
L
max
> ρ
L
max
. This
hold for both approaches but not always for the same realizations (not shown in
ﬁgure 3.3(b), where the lines are not labeled). So if ρ
ˆ
L
max
is high for one approach, it
does not imply that it is also high for the other approach.
IV The RP for both approaches are not equal to one, they are close to one! For the SI
approach the value of ρ
ˆ
L
max
gets closer to one if the amount of exploration is increased.
For LQRQL the value of ρ
ˆ
L
max
does not approach one if the amount of exploration is
increased. Instead it approaches ρ
L
max
> 1.
3.5. DISCUSSION 49
3.5 Discussion
In this chapter we continued investigating RL in the context of LQR as in [16][44]. In
order to make it more realistic we included system noise in our analysis. The proofs of
convergence in [16][44] are only valid if there is no noise and suﬃcient exploration is used.
In this chapter we showed that the system noise determines the amount of exploration
that is suﬃcient. There is a hard threshold level for the amount of exploration required.
Just below that threshold the resulting feedback can be anything and even result in an
unstable closed loop. This is the amount of exploration that has to be avoided. For the
SI approach an alternative method was proposed to deal with the bias resulting from the
system noise [21]. The idea is to add a bias towards the optimal solution. The eﬀect of
such an approach is that this may reduce the probability of an unstable closed loop for
the type III outcome, but this can still not be guaranteed. Therefore avoiding type III
outcomes is safer.
If we reduce the amount of exploration even more, we will ﬁnd the feedback used to
generate the data as the optimal solution. Although this is not dangerous, this result is
not very useful. If the amount of exploration is reduced even more, no feedback can be
computed because of a singularity in the least squares estimation. This eﬀect is also present
without noise, so the purpose of exploration in [16][44] is to prevent numerical problems
with the recursive least squares estimation. The amount of exploration that is suﬃcient in
that case is determined by the machine precision of the computer.
We also compared the result of LQRQL with an indirect approach. We observed that
the performance of this approach as a function of the amount of exploration is very similar
to that of LQRQL. The main diﬀerence is that the threshold level of the amount of explo
ration required is lower. This means that under the circumstances under which convergence
of LQRQL can be proven, it is wiser to us the indirect approach.
In [32] some additional experiments are described. There, two almost identical data
sets are shown, where one data set did not change the feedback and where the other gave
an improvement. This indicates that visual inspection of the data does not reveal whether
suﬃcient exploration was used. For the LQRQL approach we can look at the eigenvalues
of
ˆ
H. If it has negative eigenvalues, the quadratic Qfunction is not positive deﬁnite and
therefore insuﬃcient exploration was used. For the SI approach such an indication is not
available.
We did not look at the inﬂuence of the feedback itself, but this can only have an eﬀect for
the higher amounts of exploration. Just below the threshold level the feedback determines
the probability of an unstable closed loop. Since this situation has to be avoided, this is
not of interest. For suﬃcient exploration the feedback will determine how many policy
iteration steps are required. When starting with a good performing feedback only a few
steps are required.
Our analysis was based on estimating the magnitudes of the estimation errors. These
errors still depend on the number of time steps used. The contribution of the number of
time steps on the performance for the indirect approach is described in [27]. The results
presented there are very conservative and indicate that a large number of times steps are
50 CHAPTER 3. LQR USING QLEARNING
required for a guaranteed improvement of the performance. Based on our experiment
we see that the amount of time steps required is just a couple of times the number of
parameters to estimate.
3.6 Conclusion
We have shown a fair comparison between two diﬀerent approaches to optimize the feedback
for an unknown linear system. For the system identiﬁcation approach the estimation and
optimization are performed separately. For the QLearning approach the optimization is
implicitly included in the estimation. The comparison is fair because both approaches
used the same data, and no other parameters had to be chosen. So the diﬀerences in
performance are due to the approaches.
The ﬁrst conclusion is that for insuﬃcient exploration the result of the optimization
will be the same as the feedback that was used to generate the data. So no change in
feedback does not imply that the feedback is already optimal. This result is a consequence
of the noise in the system. This noise introduces a bias in the estimation and when using
insuﬃcient exploration, this bias dominates the estimated outcome.
If the exploration is insuﬃcient, but large enough that the resulting feedback will not
be the same as the initial feedback, then the resulting performance becomes very unpre
dictable. The closed loop can be stable and the performance can be improved, but it is
also possible that the closed loop becomes unstable. These results have to be avoided and
therefore it is very important that suﬃcient exploration is used.
The second conclusion is that the LQRQL approach can be guaranteed to optimize the
feedback. This is the ﬁrst continuous state space problem with noise, for which a rein
forcement learning approach can be guaranteed to work. This is the good news. The bad
news is that if the conditions hold and a good outcome can be guaranteed, an alternative
approach based on system identiﬁcation will perform better. So we can not recommend to
use the LQRQL approach for the linear quadratic regularization task.
Chapter 4
LQRQL for Nonlinear Systems
4.1 Introduction
In this chapter we will study the applicability of linear approximations for nonlinear system.
First we will show that other approaches to design controllers for nonlinear systems are
often based on local linear approximations. We will rewrite the nonlinear system as a
linear system with a nonlinear correction. This allows us to study the eﬀect of nonlinear
correction on the estimations of the SI and LQRQL approach as described in chapter 3.
We can show that these two approaches estimate the parameters of the wrong function if
this correction is not zero.
In a local part of the state space of a smooth nonlinear function, the correction can be
assumed to have a constant value. For this situation we will introduce the extended LQRQL
approach. In this approach the parameters are estimated of a more general quadratic Q
function, so that more parameters have to be estimated. The resulting feedback function
no longer has to go through the origin. Therefore this approach is more suited in a local
part of the state space of a nonlinear function.
The eﬀect of the extension of LQRQL is shown in experiments on a nonlinear system.
The choice of system was such that we were able to vary the eﬀect of nonlinearity by the
choice of the initial state. In this way we were able to show how the extended approach
compares with the other two approaches for diﬀerent sizes of the nonlinear correction. The
experiments were performed in simulation and on the real system.
4.2 Nonlinearities
4.2.1 The nonlinear system
In this chapter we will only consider systems that can be represented by ﬁrst order vector
ized diﬀerence equations. This is the class of time discrete Markov systems. The systems
are described as (1.1)
x
k+1
= f(x
k
, u
k
, v
k
), (4.1)
51
52 CHAPTER 4. LQRQL FOR NONLINEAR SYSTEMS
where f is a functional mapping that maps the present state, control action and noise
to the next state value. We will also assume that f is a smooth continuous diﬀerential
mapping, for which the gradient to x
k
and u
k
is bounded.
One class of systems that agrees with (4.1) is the class of linear systems according
to (3.1). This is an important class. Due to the linearity, the mathematics of analyzing
the system’s behavior becomes tractable. For a controller design strategy this is impor
tant, because this makes it possible to verify whether control design speciﬁcation are met.
Therefore many controller design strategies are based on linear systems.
To illustrate how linear systems simplify the controller design, take a task for which the
objective is to keep the state value ﬁxed at a value x
s
. The state value does not change if
x
k+1
equals x
k
. This means that there has to exist a constant control action u
s
for which
x
s
= Ax
s
+Bu
s
(4.2)
holds. The state that is unchanged under control action u
s
is given by:
x
s
= (I −A)
−1
Bu
s
, (4.3)
where we will call x
s
the set point. This shows that if a solution exists, the state value
x
s
is uniquely determined by the control action u
s
. It is also important to note that the
existence of this solution is completely determined by the parameters of the linear system.
So if it exists, it exists for all possible control actions u
s
.
The result of the design of the controller is that we get a system with a feedback
controller. In chapter 1 we already showed that for a linear system with a linear feedback,
the behavior of the closed loop can be expressed using the parameters of the closed loop.
If the close loop is stable the system will always end up in the origin. If the closed loop is
unstable, the system will never end up in the origin.
If we take a general nonlinear function as in (4.1) then instead of (4.2) we have
x
s
= f(x
s
, u
s
, 0). (4.4)
In this case it is possible to have more solutions. More solutions means that for one x
s
more values of u
s
are possible, but also that for one u
s
more values for x
s
are possible.
This implies that the correctness of a control action can only be veriﬁed when considering
the exact value of the state and control action. The same holds for the equilibrium state
when a feedback function u = g(x) is used. The equilibrium state is the solution to
x
eq
= f(x
eq
, g(x
eq
), 0). (4.5)
Again it is possible that multiple solutions exist. For some solutions the system will
approach the equilibrium while for others it will not. The consequence is that the stability
of the system also depends on part of the state space. In some parts of the state space
the closed loop can be stable, while in other parts it is unstable. It is clear that this will
make the design of a controller very complicated. For the linear case it is just a matter of
4.2. NONLINEARITIES 53
making the closed loop stable, for the nonlinear case it may also include deﬁning the part
of the state space where the closed loop has to be stable.
We can conclude that the main diﬀerence between linear and nonlinear systems is that
for linear systems the properties are globally valid. They are given by the parameters of
the linear system and do not depend on the state value. For nonlinear systems, properties
can be only locally valid. They are given not only by the parameters but also depend on
the value of the state.
4.2.2 Nonlinear approaches
We will give a short overview of some methods for obtaining feedback functions for non
linear systems. We will also show how these methods rely on techniques derived for linear
systems.
Fixed Set Point
In (4.4) the noise was set to zero to deﬁne the set point x
s
and the corresponding constant
control action u
s
. In practice however there is always noise. So even if we have a nonlinear
system whose state value equals x
s
when u
s
is applied, due to the noise the next state
value can be diﬀerent. Then the system is no longer in its set point, so the state value will
change again. For a general function f this change can be anything and does not have to
bring the system back in its set point.
To make sure the system will go back to the set point, a feedback can be added. The
input of this feedback is the diﬀerence between the state value and the set point. So the
task of this feedback is to control its input to zero. To design a linear feedback that will do
this, a local linearization of the nonlinear function has to be made around the set point.
If the mapping f of the nonlinear system (4.1) is continuously diﬀerentiable, the system
can be assumed to be locally linear around the set point. Let ¯ x
k
= x
k
−x
s
and ¯ u
k
= u
k
−u
s
.
Then, (4.1) can be rewritten according to:
x
k+1
−x
s
= f(x
k
, u
k
, v
k
) −x
s
(4.6)
¯ x
k+1
= f(x
s
+ ¯ x
k
, u
s
+ ¯ u
k
, v
k
) −x
s
(4.7)
≈ f(x
s
, u
s
, 0) +f
x
(x
s
, u
s
, 0)¯ x
k
+f
u
(x
s
, u
s
, 0)¯ u
k
+f
v
(x
s
, u
s
, 0)v
k
−x
s
(4.8)
¯ x
k+1
=
˜
A¯ x
k
+
˜
B¯ u
k
+
˜
B
v
v
k
. (4.9)
In (4.6) the set point x
s
is subtracted on both sides of (4.1). Then this is expressed in
the new state and control action ¯ x
k
and ¯ u
k
in (4.7). Note that we assume that v has zero
mean. In (4.8) the mapping f is replaced by its ﬁrst order Taylor expansion around the
set point, where the mappings f
x
, f
u
and f
v
contain the appropriate derivatives. If u
s
is
chosen in such a way that f(x
s
, u
s
, 0) = x
s
and the mappings are replaced by the matrices
˜
A,
˜
B and
˜
B
v
, then this forms the linear system (4.9).
54 CHAPTER 4. LQRQL FOR NONLINEAR SYSTEMS
In (4.9) we have a linear system for which a linear feedback can be designed, but care
has to be taken. The linear system is only valid near the set point so its properties are not
globally valid.
Gain Scheduling
The ﬁxed set point approach only gives a feedback that is valid near the set point. To get
a feedback function that is valid in a larger part of the state space, multiple linear models
can be used. This means that locally linear models are computed for many diﬀerent set
points. As explained in chapter 2 we can form one global nonlinear feedback by combining
all the local linear models.
One approach is to form one global feedback where the control action is determined by
only one local linear model that is valid at the current state value. This is Gain Scheduling.
A possible implementation of Gain Scheduling is to partition the state space and compute
for each partition a local linearized model with respect to the center of the partition.
These linear models can be used to compute the appropriate local linear feedbacks. The
only diﬀerence with the ﬁxed set point approach is that now the local linear feedback
does not have to result in an equilibrium state around the center of the partitioning. The
feedback may change the state of the system to a part of the state space where a diﬀerent
local model will determine the control action.
Feedback Linearization
In some cases it is possible to change a nonlinear system into a globally linear system.
This is called feedback linearization. The objective is to change the nonlinear system into
a linear system of our choice. This is only possible for a restricted class of nonlinear systems
for which the mapping f of the nonlinear system in (4.1) can be written as
f(x, u, v) = g(x) +h(x)u +v, (4.10)
with g and h mappings of the appropriate dimensions.
Suppose we want to have a linear system with parameters
˜
A and
˜
B and control input
˜ u. Then the control action can be computed according to
u = h(x)
−1
(
˜
Ax −
˜
B˜ u −g(x)), (4.11)
where ˜ u represents a new control action. Applying (4.11) to a nonlinear system with (4.10)
will transform the nonlinear system into a linear system:
x
k+1
= g(x
k
) +h(x
k
)u
k
+v
k
= g(x
k
) +h(x
k
)h(x
k
)
−1
(
˜
Ax
k
−
˜
B ˜ u
k
−g(x
k
)) +v
k
=
˜
Ax
k
−
˜
B ˜ u
k
+v
k
. (4.12)
The result is a linear system, so the appropriate new control actions ˜ u can be determined
using conventional linear techniques. The choice of the parameters of the new linear system
are made to simplify the design of the controller for the new linear system.
4.2. NONLINEARITIES 55
It is clear that this approach is only possible for a restricted class of nonlinear systems.
Not only does (4.10) have to hold, but also (4.11) has to exist. In case of a scalar state
and action it is clear that h(x) = 0 is a suﬃcient condition for the existence of (4.11).
For more general situations the existence has to be veriﬁed using Liebrackets. For more
detailed information about this approach see [37].
The existence of (4.11) is important, but it also has to be found. When the physics
of the system is known this can be computed. When the physics is not known (4.11)
has to be approximated by a general function approximator. In that case it is no longer
possible to guarantee global properties of the system, since it relies on the quality of the
approximation. For a continuous time conﬁguration, conditions for suﬃcient quality of the
approximation for feedback linearization can be given [78]. One of these conditions is that
the system is excited enough, so that the state space is explored suﬃciently.
Linear Matrix Inequalities
Linear Matrix Inequalities (LMI) techniques [15] are used to analyze nonlinear systems and
proof properties, like stability. These techniques are based on the idea that a nonlinear
system can be described by a set of linear systems. If we look at only one state transition
of a nonlinear system then a linear system can be deﬁned that generates the same state
transition:
1
x
k+1
= f(x
k
, u
k
) = A(x
k
, u
k
)x
k
+B(x
k
, u
k
)u
k
. (4.13)
Here A(x
k
, u
k
) and B(x
k
, u
k
) represent matrices, whose parameters depend on x
k
and
u
k
. So for the current state and control action there is a linear system for which the state
transition will also lead to x
k+1
.
To say something about the stability of the nonlinear system, all possible values of x
and u should be considered. The parameter matrices A(x, u) and B(x, u) for all possible
x and u form a set. If all linear systems corresponding to the matrices in the set are stable,
then also the nonlinear system must be stable. This is because for any state transition
of the nonlinear system there exists a stable linear system that will have the same state
transition.
The set can consist of an inﬁnite number of linear systems, making it impossible to
prove the stability of all linear systems. One solution is to use a polytope LMI. In the
parameter space of the linear system a polytope is selected that encloses the complete set.
If the linear systems corresponding to the edges of the polytope are stable then all linear
systems within the polytope are stable as well.
Another approach is to use a norm bounded LMI. One stable linear system with pa
rameters
˜
A and
˜
B is selected and all linear systems are considered to have parameters
˜
A
and
˜
B plus an extra feedback. The state transitions of the linear systems are described as
x
k+1
=
˜
Ax
k
+
˜
Bu
k
+w(x
k
, u
k
), where the vector w(x
k
, u
k
) represents the extra feedback.
1
For simplicity we assume there is no noise.
56 CHAPTER 4. LQRQL FOR NONLINEAR SYSTEMS
If it can be proven that the norm of w is less than one for all values of x and u, then the
nonlinear system is stable.
Since LMIs provide the opportunity to guarantee stability for nonlinear systems, they
can be used for stability based controller design. The LMI technique is used to specify the
set of feasible feedbacks and a controller design approach is used to select one feedback from
that set. The drawback of the LMI approaches is that the set of feasible feedbacks depends
on the choice of polytope or
˜
A and
˜
B. In case of an unfortunate choice it is possible that
many good feedbacks are rejected. In the worst case the set of feasible feedbacks is empty.
4.2.3 Summary
We described some approaches to obtain feedback functions for nonlinear systems. Except
for the feedback linearization, all were based on local linear approximations of the nonlin
ear feedback. These approaches assume that the mapping f is available, but when it is
unknown the local linear system can be approximated by the SI approach from chapter 3.
The LQRQL approach in chapter 3 was also derived for linear systems. We are interested
in using LQRQL to obtain a local linear approximation of the optimal feedback function.
4.3 The Extended LQRQL Approach
In this section we will ﬁrst show that the SI and LQRQL approach from chapter 3, do have
limitations when they are applied to nonlinear systems. This is especially the case when
the estimations are based on data generated in a small part of the state space. We will
show that LQRQL can be extended to overcome these limitations, resulting in a solution
that is more appropriate in a small part of the state space.
4.3.1 SI and LQRQL for nonlinear systems
The SI and LQRQL approach were derived for linear systems. In order to see how they
would perform when applied to nonlinear systems, we write (4.1) as
x
k+1
= f(x
k
, u
k
, v
k
) (4.14)
=
˜
Ax
k
+
˜
Bu
k
+ (f(x
k
, u
k
, v
k
) −
˜
Ax
k
−
˜
Bu
k
)
=
˜
Ax
k
+
˜
Bu
k
+w(x
k
, u
k
, v
k
). (4.15)
This describes a linear system with parameters
˜
A and
˜
B and an extra additional nonlinear
correction function w.
2
Equation (4.15) strongly resembles a norm bounded LMI. In our
case the mapping f is unknown, so we cannot use (4.15) as a LMI.
The SI and LQRQL approach obtain a feedback based on a generated train set. If
we assume that the train set is generated in a small local part of the state space, we can
2
The linear system (3.1) is a special case for which w(x, u, v) = v ∀x, u, v.
4.3. THE EXTENDED LQRQL APPROACH 57
simplify (4.15) by replacing the nonlinear function w(x, u, v) with its average value for
the train set. So we look at the system:
x
k+1
= Ax
k
+Bu
k
+w (4.16)
where the value of vector w has a ﬁxed constant value.
3
We can apply the SI approach, which will result in the estimated linear system: x
k+1
=
ˆ
Ax
k
+
ˆ
Bu
k
. In this linear system the correction is not present. This implies that
ˆ
A and
ˆ
B are estimated such that the average value of w is zero. So when the average value of w
is far from zero, the estimated linear system with
ˆ
A and
ˆ
B is not a good approximation of
f in (4.14). The resulting feedback
ˆ
L
SI
will not approximate the optimal feedback at the
part of the stateaction space where the train set was generated.
In order to see the consequences of w on the result of LQRQL, we can look at the
resulting linear feedback function u = Lx. When we apply the feedback function u
k
= Lx
k
to control (4.16), the equilibrium state x
eq
is no longer at the origin. The equilibrium state
is given by:
x
eq
= (I −A −BL)
−1
w. (4.17)
This shows that the equilibrium state depends on L! The LQRQL approach was derived
for a linear system with w = 0, so the equilibrium state was always in the origin. Only
the trajectory towards the equilibrium state can be optimized.
At the equilibrium state in (4.17) the direct costs (3.2) are not zero. So LQRQL applied
to (4.16) will not only optimize the trajectory toward the equilibrium state, but also the
value of the equilibrium state. If both the equilibrium state and the trajectory towards the
equilibrium state have to be optimized, then the feedback function u = Lx has insuﬃcient
degrees of freedom.
4.3.2 The extension to LQRQL
Another way to see why the linear feedback does not have enough degrees of freedom is by
looking at its approximation of a nonlinear optimal feedback. Let us assume an optimal
linear feedback function g
∗
as shown in ﬁgure 4.1. The local linear approximation according
to u = Lx is not very good. The ﬁgure also shows that the addition of an extra constant
l to the feedback function will allow for a better local linear approximation.
If the feedback function u = Lx+l is used for (4.16) then the equilibrium state is given
by
x
eq
= (I −A +BL)
−1
(Bl +w). (4.18)
Here we see that if the feedback L is chosen such that the trajectory towards the equilibrium
state is optimal, the value of l can be used to select the optimal equilibrium state.
4
The l
3
Strictly speaking this does not have to represent a nonlinear system because the value of w is also
globally valid. We can also view (4.16) as a linear system, where the mean of system noise is not zero.
The w then represents the mean of the system noise.
4
Whether this is possible depends on B and w.
58 CHAPTER 4. LQRQL FOR NONLINEAR SYSTEMS
u
x
u
∗
= g
∗
(x)
u = Lx +l
u = Lx
Figure 4.1. The local linear approximation of the optimal nonlinear feedback function. The solid
lines indicate the region of approximation. Because the feedback u = Lx has to go through the
origin, the local linear approximation cannot be a good approximation of the nonlinear function.
The approximation of u = Lx + l does not have to go through the origin and is able to match
the nonlinear function more closely.
can be interpreted as the constant u
s
in the Fixed Set Point approach. The diﬀerence is
that the value of l is not chosen, but optimized using the extended LQRQL approach.
The resulting feedback function of LQRQL is a consequence of the choice of the
quadratic Qfunction. For the new feedback function, a new Qfunction is needed. The
Qfunction used in the previous chapter was of the form Q(φ) = φ
T
Hφ, where φ
T
k
=
x
T
k
u
T
k
. This is not a very general quadratic function. A general quadratic function is
given by:
Q(φ) = φ
T
Hφ +G
T
φ +c. (4.19)
By including a term with vector G and a constant c, any quadratic function can be repre
sented by (4.19). If (4.19) has the optimal parameters H
∗
, G
∗
and c
∗
, the greedy feedback
can be found by taking the derivative to u
∗
and set this to zero. This results in
u
∗
= −(H
∗
uu
)
−1
(H
∗
xu
x +G
∗T
) (4.20)
= L
∗
x +l
∗
. (4.21)
This indicates that the Qfunction (4.19) will result in the feedback we want. Compared to
(3.16) we see that again L
∗
= −(H
∗
uu
)
−1
H
∗
ux
. This shows that L
∗
optimizes the trajectory
to the equilibrium state. The diﬀerence with (3.16) is that a constant l
∗
= −(H
∗
uu
)
−1
G
∗T
is
added to the feedback function. The purpose of this constant is to determine the optimal
equilibrium state. Note that the scalar c in (4.19) does not appear in (4.21), so it does not
have to be included in the estimation.
4.3.3 Estimating the new quadratic Qfunction
The estimation method for the new quadratic Qfunction will be slightly diﬀerent because
more parameters have to be estimated. In the previous chapter the estimation was based
4.4. EXPLORATION CHARACTERISTIC FOR EXTENDED LQRQL 59
on the temporal diﬀerence (3.18). A similar approach will be used for the Qfunction
according to (4.19).
5
The diﬀerence with (3.19) is that φ
T
k+1
=
x
T
k+1
Lx
T
k+1
+l
and the
consequences of the noise are represented by ν
k
. Then in the same way as in (3.20):
r
k
=
∞
¸
i=k
r
i
= Q(x
k
, u
k
) −Q(x
k+1
, Lx
k+1
+l) (4.22)
= φ
T
k
Hφ
k
+G
T
φ
k
−φ
T
k+1
Hφ
k+1
−G
T
φ
k+1
+ν
k
(4.23)
= vec
(φ
k
φ
T
k
)vec
(H) −vec
(φ
k+1
φ
T
k+1
)vec
(H) + (φ
T
k
−φ
T
k+1
)G+ν
k
(4.24)
=
vec
(φ
k
φ
T
k
−φ
k+1
φ
T
k+1
) φ
T
k
−φ
T
k+1
¸
vec
(H)
G
¸
+ν
k
. (4.25)
This again can be used to express the estimation as a linear least squares estimation:
Y
EX
=
r
0
r
1
.
.
.
r
N−1
¸
¸
¸
¸
¸
¸
=
vec
(φ
0
φ
T
0
−φ
1
φ
T
1
) φ
T
0
−φ
T
1
vec
(φ
1
φ
T
1
−φ
2
φ
T
2
) φ
T
1
−φ
T
2
.
.
.
.
.
.
vec
(φ
N−1
φ
T
N−1
−φ
N
φ
T
N
) φ
T
N−1
−φ
T
N
¸
¸
¸
¸
¸
¸
θ
EX
+
ν
0
ν
1
.
.
.
ν
N−1
¸
¸
¸
¸
¸
¸
(4.26)
= X
EX
θ
EX
+V
EX
. (4.27)
The estimation
ˆ
θ
EX
= (X
T
EX
X
EX
)
−1
X
T
EX
Y
EX
gives an estimation of θ
EX
and because
vec
(H) and G are included in θ
EX
, it also gives the estimation for
ˆ
H and
ˆ
G. The
ˆ
H and
ˆ
G can be used in the same way as in (4.21) to obtain the resulting
ˆ
L
EX
and
ˆ
l.
The absence of the constant c in (4.25) indicates that c does not inﬂuence the outcome
of the estimation.
6
So the actual function that is estimated is
ˆ
Q(φ) = φ
T
ˆ
Hφ +
ˆ
G
T
φ. (4.28)
This is the function we will use as Qfunction and we will call this the Extended LQRQL
approach. For this function
ˆ
Q(0) = 0, so it is not a general quadratic function anymore.
Still this is general enough because the value of c also does not inﬂuence (4.21). The reason
that this constant c can be ignored completely is that it represents the costs that will be
received anyway, regardless the optimization. This indicates that the optimization is only
based on avoidable costs. Costs that will be received anyway do not inﬂuence the resulting
feedback function.
4.4 Exploration Characteristic for Extended LQRQL
The estimation (4.27) shows how the parameters are estimated, but does not indicate how
well these parameters are estimated. The results from the previous chapter suggest that
the correctness of the estimation depends on the amount of exploration used for generating
5
The Qfunction depends on L and l, but for clarity these indices are omitted.
6
Note that if a discount factor γ < 1 is used in (4.22), the constant c does inﬂuence the outcome.
60 CHAPTER 4. LQRQL FOR NONLINEAR SYSTEMS
the train set. The exploration characteristic was introduced in section 3.3.4 to present an
overview of the quality of the outcome as a function of the exploration. This will now be
used to investigate the results of the extended LQRQL approach. Two questions have to
be answered:
• Are the extra parameters
ˆ
G estimated correctly?
• How does the estimation of
ˆ
G inﬂuence the estimation of
ˆ
H?
This last question is relevant since G and H are estimated simultaneously in (4.27).
The estimation for the extended approach includes n
x
extra parameters and can be
written similar to (3.58). The linear equation that has to hold is given by
Y
EX
=
Ψ
xx
Ψ
ux
Ψ
uu
Ψ
g
θ
xx
θ
ux
θ
uu
θ
g
¸
¸
¸
¸
¸
+V
EX
, (4.29)
where V
EX
represents the noise and θ
g
represents the extra parameters that have to be
estimated. Similar to (3.63) the estimation error of a row of
ˆ
θ
g
can be written as
¯
θ
g,i
=
Ψ
g
∗i
T
P
T
n
xx
+n
xu
+n
uu
+i
P
n
xx
+n
xu
+n
uu
+i
Ψ
g
∗i

2
2
(V
EX
−
n
xx
+n
xu
+n
uu
+n
x
¸
j=n
xx
+n
xu
+n
uu
+1+i
Ψ
g
∗j
¯
θ
g,j
). (4.30)
The estimations
¯
θ
uu
in (3.63) and
¯
θ
ux
change according to
¯
θ
uu,i
=
T
ee
∗i
T
P
T
n
xx
+n
xu
+i
P
n
xx
+n
xu
+i
T
ee
∗i

2
2
(V
EX
−Ψ
g
¯
θ
g
−
n
xx
+n
xu
+n
uu
¸
j=n
xx
+n
xu
+1+i
Ψ
uu
∗j
¯
θ
uu,j
), (4.31)
¯
θ
ux,i
=
Υ
∗i
T
P
T
n
xx
+i
P
n
xx
+i
Υ
∗i

2
2
(V
EX
−Ψ
g
¯
θ
g
−Ψ
uu
¯
θ
uu
−
n
xx
+n
xu
¸
j=n
xx
+1+i
Ψ
ux
∗j
¯
θ
ux,j
). (4.32)
The main diﬀerence here is that there is the extra Ψ
g
¯
θ
g
in the estimation errors
¯
θ
uu,i
and
¯
θ
ux,i
. The dependency of the estimation error
¯
θ
g
on the exploration is comparable to that
of the SI approach, so the inﬂuence of the
¯
θ
uu,i
and
¯
θ
ux,i
can be neglected. The V
EX
is
diﬀerent from V
QL
from the previous chapter in the sense that it now includes the value
of w. This means that in the expected estimation errors (3.66) the value of σ
2
e
should
be replaced by (σ
e
+ w[[)
2
. The consequence is that even for low noise in the system, if
there is a large extra w, still a large amount of exploration is required. In other words,
the minimum of the Qfunction should be included in the area that is being explored.
Simulation of the Extended LQRQL characteristics
We did some simulation experiments to show the reliability of the extended LQRQL ap
proach. To be able to show the correctness of the parameter estimation we took as system
the model given by (4.16):
x
k+1
= Ax
k
+Bu
k
+v
k
+w u
k
= Lx
k
+e
k
+l. (4.33)
4.4. EXPLORATION CHARACTERISTIC FOR EXTENDED LQRQL 61
In (4.33) the w is constant so it is the same for all time steps. In the experiments we
compared the results with the standard LQRQL approach.
We ﬁrst did an experiment to see whether the correct value of l can be found. Because
all parameters are estimated simultaneously, we also did an experiment to see whether
the estimation of G has any positive eﬀects on the estimation of H. Finally we did an
experiment to see whether an arbitrary set point can be learned, when this set point is
determined by the direct costs r.
We used the same parameter settings as in the previous chapter:
A =
¸
−0.6 0.4
1 0
¸
B =
¸
0
1
¸
x
0
=
¸
1
1
¸
L =
0.2 −0.2
σ
v
= 10
−4
.
(4.34)
The value of w depends on the experiment. Each experiment was performed 5 times and
we took N = 30 number of time steps.
Experiment I: Can a correct value of l be found? This requires knowledge about
the correct value l. In this experiment we took w equal to zero so that no additional
l is required. We took as initial value l = 1. The initial value l = 0, so that the
estimated
ˆ
l should be zero.
Figure 4.2(a) shows how the value of l changes. For low values of σ
e
the value of l
does not change. If the value of σ
e
is larger than σ
v
the correct value of l is obtained.
Experiment II: Does the extension of LQRQL improve the results? In this exper
iment we used w
T
=
1 1
, so that a l = 0 was required. Both the LQRQL and
the Extended LQRQL were applied on the same train sets. We asked the following
questions: Does the nonzero w inﬂuence the estimation of
ˆ
L? Also does the nonzero
w inﬂuence the amount of exploration that is required?
Figure 4.2(b) shows 
ˆ
L−L
∗
 for both approaches, where L
∗
is the optimal feedback
computed with (3.5) and (3.4). For low values of σ
e
the value of
ˆ
L equals L in (4.34).
The value of 
ˆ
L − L
∗
 decreased around σ
e
≈ 1, which is about the same scale as
w. We did a similar experiment using w
T
=
10 10
, and there this happened at
σ
e
≈ 10. This indicates that the suﬃcient amount of exploration depends on w
if w > σ
v
. Figure 4.2(b) shows that the improvement for the extended approach
requires less exploration. For high values of σ
e
the value of l becomes −0.5 and this
makes the total costs for the extended LQRQL approach lower than the total costs
of the standard LQRQL approach.
Experiment III: Can the set point be learned? The set point was introduced in
section 4.2.2 to indicate a desired equilibrium state. So learning the set point means
optimizing the stationary behavior of the system. The extension of LQRQL was
motivated by the inability of LQRQL to deal with the simultaneous optimization of
the transient and stationary behavior of the system. We did this experiment to see
whether this is possible using the extended LQRQL. We changed the costs in (3.2)
62 CHAPTER 4. LQRQL FOR NONLINEAR SYSTEMS
to r
k
= (x
T
k
− x
T
s
)S(x
k
− x
s
) + u
T
k
Ru
k
, to specify a preference for an equilibrium
point at x
s
.
Figure 4.3(a) shows the total costs when x
T
s
=
1 −1
. It is clear that the cost
reduction after σ
e
≈ 1 is much larger for the extended approach. The σ
e
≈ 1 is the
same scale as x
s
and doing the same experiment for a larger x
s
shows that more
exploration is required (not shown).
For one result obtained using σ
e
= 5.6 the state values are shown as a function of time
in ﬁgure 4.3(b). The values for the standard approaches always go to zero while the
values for the extended approach go to two diﬀerent values. The value of
ˆ
l = −0.55
brings the equilibrium state closer to x
s
. Note that the equilibrium state will not be
equal to x
s
because of the costs assigned to the control action that keeps it at x
s
.
The simulation experiments have shown that the correct value of l will be found. They
have shown that in case of w = 0, the resulting feedbacks
ˆ
L
QL
and
ˆ
L
EX
will have the same
outcome. However, the extended approach requires less exploration to obtain this result.
Finally the experiments showed that the extended approach is also able to learn the set
point.
10
10
10
5
10
0
10
5
1
0.5
0
0.5
1
1.5
2
σ
e
l
(a) Experiment I:
ˆ
l as a function of σ
e
.
10
5
10
0
10
5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
σ
e
L
(b) Experiment II: 
ˆ
L − L
∗

2
as a function of
σ
e
.
Figure 4.2. Results of the simulation experiments I and II. The solid lines are the results for the
extended approach, the dashed lines the result of the standard approach. The amount of system
noise σ
v
= 10
−4
is indicated by the dotted line.
4.5. SIMULATION EXPERIMENTS WITH A NONLINEAR SYSTEM 63
10
5
10
0
10
5
50
55
60
65
70
75
80
σ
e
J
(a) The total costs J as a function of σ
e
.
0 5 10 15 20 25 30
1
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
k
x
1
a
n
d
x
2
(b) The state values x
1
and x
2
as functions of
time k for one result using σ
e
= 5.6.
Figure 4.3. Results of simulation experiment III. The solid lines are the results for the extended
approach, the dashed lines the result of the standard approach.
4.5 Simulation Experiments with a Nonlinear System
For a nonlinear system (4.15) the value of w(x, u, v) does not have to be constant. In the
previous section we showed that the extended LQRQL approach performes better than the
standard approach when a constant w is not zero. If the nonlinearity is smooth then in a
small part of the state space the value of w(x, u, v) does not vary too much. If the average
value of w(x, u, v) is not zero, the extended LQRQL approach should also perform better.
To verify this we did some experiments with a nonlinear system, where we compared the
performance of the extended LQRQL approach with the SI and standard LQRQL approach
from chapter 3.
4.5.1 The nonlinear system
The mobile robot
As nonlinear system we used a mobile robot system. A description of the robot can be
found in appendix B. Further a model is given in appendix B, that describes how the
robot changes its position and orientation in the world. The change of orientation does
not depend on the position, but the change in position depends on the orientation. This
suggests that this can be used as a nonlinear system, where the eﬀect of the orientation
on the change of position introduces the nonlinearity.
The task of the robot is to follow a straight line that is deﬁned in the world. The state
is given by the distance δ to that line and the orientation α with respect to that line, as
64 CHAPTER 4. LQRQL FOR NONLINEAR SYSTEMS
δ
α
Line
Robot
(a) The task
φ
u = 0
u = +
y
x
u = −
(b) The implementation
Figure 4.4. The top view of the robot. The left ﬁgure illustrates the general task. The right
ﬁgure shows our implementation where the line to follow is the positive xaxis. This ﬁgure also
indicates the robot’s movement, given a positive, negative or zero u.
shown in ﬁgure 4.4(a). So the state is given by x
T
=
α δ
, where δ is given in meters
and α in radians. Using (B.6) we can express the state transition for the robot given this
task:
α
k+1
= α
k
+ωT (4.35)
δ
k+1
= δ
k
+
v
t
ω
(cos(α
k
) −cos(α
k
+Tω)). (4.36)
The T is the sample time. According to (4.36) the trajectory of the robot describes a part
of a circle with radius
v
t
ω
. There are two control actions: the traversal speed v
t
and the
rotation speed ω. We gave the traversal speed a ﬁxed value of 0.1 meters per second. As
control action u we took the rotation speed ω. Without loss of generality we can take the
xaxis of the world as the line to follow. Our simulation setup is shown in ﬁgure 4.4(b),
where we take δ = y and α = φ.
We used quadratic direct costs, so
r
k
= x
T
¸
S
α
0
0 S
δ
¸
x +uRu, (4.37)
with
S
α
= 0.1, S
δ
= 1 and R = 1. (4.38)
This indicates that we want to minimize the distance to the line without steering too much.
The S
α
is not so important. The only reason to give it a small positive value is to assign
costs to following the line in the wrong direction.
4.5. SIMULATION EXPERIMENTS WITH A NONLINEAR SYSTEM 65
Comparing the results
Before starting the experiments we will give an indication about the correct feedbacks. We
will also show that the value of δ can be used to vary the value of w(x, u, v). To compare
the results of the three approaches we have to test the performance. We will describe how
we will test the results.
The task is to ﬁnd a linear feedback
ˆ
L by applying the SI approach and both LQRQL
approaches. The extended LQRQL approach also has to ﬁnd an extra term
ˆ
l from the
train set. So the resulting feedback for the SI and standard LQRQL approach is:
u =
ˆ
Lx =
ˆ
L
α
ˆ
L
δ
¸
α
δ
¸
. (4.39)
For the extended LQRQL approach there is an extra term
u =
ˆ
Lx +
ˆ
l =
ˆ
L
α
ˆ
L
δ
¸
α
δ
¸
+
ˆ
l. (4.40)
Now we can already determine what the correct feedback should look like. If δ > 0
and α = 0, the robot has to turn right. This means that for u < 0,
ˆ
L
δ
< 0 is correct. If
α is also positive, the robot has to turn right even more. This means that also
ˆ
L
α
< 0 is
correct. The
ˆ
l can be used in that state to steer to the right even more, so this should also
be negative. However, if δ < 0 the value of
ˆ
l should be positive.
The model in (4.36) describes a smooth nonlinear function. This function is symmetric
around the origin of the state space, which indicates that the optimal feedback function
will go through the origin. So it is possible to use (4.39) as a local linear approximation of
the optimal feedback function. When the robot is closely following the line, the value of
w(x, u, v) will be very small. The extended LQRQL approach has to result in
ˆ
l = 0.
For large values of δ, the optimal feedback will move the robot in the direction of the
line. This implies that α is no longer very small, so that the nonlinearity in (4.36) will
have an eﬀect on the change of δ. The value of w(x, u, v) is no longer zero, so that the
(4.39) will not be able to form a good local linear approximation of the optimal feedback
function. The extended LQRQL approach can provide a better approximation for a large
δ and should result in
ˆ
l = 0.
In order to compare the results of the diﬀerent approaches, we need to have a criterion.
The relative performance from the previous chapter cannot be used, because we do not
know the optimal solution. We used the total costs over a ﬁxed time interval
J =
N
¸
k=0
r
k
, (4.41)
as criterion. The resulting feedback is obtained based on a train set, that was generated by
starting at a certain state. To test the feedback we started the robot in the same state. As
time interval we took the same number of time steps as used during the generation of the
train set. By taking only a short time interval we made sure that we tested the feedback
in the same local part of the stateaction space where the train set was generated.
66 CHAPTER 4. LQRQL FOR NONLINEAR SYSTEMS
4.5.2 Experiment with a nonzero average w.
We ﬁrst did an an experiment with a nonzero average w. We could do this by focusing on
a part of the state space where δ is large. Under these conditions the extended LQRQL
approach should be able to perform better than the SI and standard LQRQL approach.
The setting
We took as settings:
• Sample time T = 0.35 seconds.
• Number of time steps N = 57. So with T = 0.35, the robot drives for approximately
20 seconds.
• Gaussian exploration noise with σ
e
= 0.1. A higher level of exploration would make
it impossible to use similar settings on the real robot, because the tolerated rotation
on the real robot is limited.
• Initial feedback L =
−10
−3
−10
−3
. This feedback makes the robot move towards
the line. On the other hand it is small enough, so that hardly any prior knowledge
is included.
• Initial orientation α = 0.
Because of the exploration we were not able to exactly determine the local part of the
state space in which the train set is generated. The robot drives for approximately 20
seconds at at speed of 0.1 meters per second. Thus the robot traverses approximately 2
meters. As initial distance to the line we used δ
0
= 1.5 meters, for which the resulting
ˆ
l
should be negative. In the worst case the robot stops at δ = −0.5, for which
ˆ
l should be
positive. On average most of the training samples were obtained in the part of the state
space for which
ˆ
l should be negative, so that we knew that the extended LQRQL had to
result in a
ˆ
l < 0. This means that the average w was not zero. For the SI and standard
LQRQL approach l = 0, so they should perform less than the extended LQRQL approach..
We generated one train set and used the SI, the standard LQRQL and the extended
LQRQL approach. The resulting feedbacks of the three approaches were tested by starting
the robot in the same initial state. For each test we computed the the total costs according
to (4.41). In Table 4.1 the resulting feedbacks and the total costs of the test runs are shown.
ˆ
L
α
ˆ
L
δ
ˆ
l J
SI 0.2794 0.7224 155.5828
Standard LQRQL 0.5861 0.0330 115.3245
Extended LQRQL 0.5855 0.0133 0.5865 45.8943
Table 4.1. Results of the experiment with nonzero average w.
4.5. SIMULATION EXPERIMENTS WITH A NONLINEAR SYSTEM 67
The resulting value of
ˆ
l is negative, just as we expected. The values of
ˆ
L for both LQRQL
approaches are almost the same, but the
ˆ
L for the SI approach is diﬀerent. The total costs
of the SI approach is the highest and the extended LQRQL approach has the lowest total
costs.
In ﬁgure 4.5 ﬁve trajectories are shown. The ﬁrst shows that using the initial feedback
without exploration will make the robot move slowly towards the line. The second trajec
tory shows that the generated train set, primarily depends on the exploration and not on
the initial feedback. The other three trajectories are the test runs with the three resulting
feedbacks.
The trajectory of the SI approach is the “curly” line in ﬁgure 4.5 that moves to the left.
Because the value of
ˆ
L
δ
is too large, the robot will rotate too much for large values of δ. At
the end of the trajectory the value of δ is so small that the robot no longer rotates around
its axis. The trajectory of the standard LQRQL approach resembles that of the initial
feedback L, although L and
ˆ
L are clearly diﬀerent. The reason is that
ˆ
L
δ
is too small, so
the robot hardly rotates. Therefore the orientation α will remain very small, so the higher
value of
ˆ
L
α
does not contribute to the control action. Although the extended LQRQL
approach has approximately the same
ˆ
L, because of
ˆ
l it turns faster in the direction of the
line. This explains the lower total costs. The nonzero value of
ˆ
l suggests that the average
w is not zero. Since the extended approach can deal with this situation, it outperforms
the other approaches.
1 0.5 0 0.5 1 1.5
0
0.5
1
1.5
X
Y
No expl o.
Dat a
SI
St andard LQRQL
Ext ended LQRQL
Figure 4.5. The trajectories in the world. The line with “No Explo” is the trajectory when
using the initial feedback L. For the line with “Data” the exploration is added to generate the
train set. The other three lines are the trajectories of the test runs.
68 CHAPTER 4. LQRQL FOR NONLINEAR SYSTEMS
4.5.3 Experiments for diﬀerent average w
The previous experiment has shown that for a large value of δ the extended LQRQL
approach performs best. This is the case for a nonzero average w. We also indicated that
around the line the average value of w is zero. Here the SI and standard LQRQL approach
already have the correct value of l = 0. The extended approach has to estimate the correct
ˆ
l which can result in an
ˆ
l = 0. This suggests that the extended LQRQL approach will
perform slightly less when the train set is generated near the line. We did some experiments
to see how of the performance of the three approaches depends on the size of the average
w.
In the same way as in the previous experiment we used the initial distance δ
0
to de
termine the average w. We varied the initial distance δ
0
from 0 to 2 in steps of
1
4
. The
estimation results were based on train sets, generated with a random exploration noise.
For this reason we generated for each δ
0
500 train sets, each with a diﬀerent exploration
noise sequence. The SI, standard LQRQL and extended LQRQL were applied to all train
sets.
All resulting feedbacks can be tested, but we ﬁrst looked at the reliability of the out
come. For all three approaches the resulting feedback function depends only on the train
set. If the resulting feedback is not good then this is because the train set is not good
enough. In chapter 3 we showed that this is the case when the amount of exploration is
not enough. If suﬃcient exploration is used the estimated quadratic Qfunction is positive
deﬁnite. In that case the estimated matrix
ˆ
H only has positive eigenvalues.
We used the same σ
e
for all train sets. When we have a nonlinear system, we can see
w
k
= w(x
k
, u
k
, v
k
) as system noise that does not have to be Gaussian or white. Since
the amount of exploration must be higher than the amount of system noise, it can turn
out that in some train set insuﬃcient exploration was used. Another problem is that the
contribution of the feedback function to the control action is very small when the train set
is generated close to the line. The trajectory is mainly determined by the exploration. In
these situations it can be very hard to evaluate the feedback function, so the estimated
Qfunction may not be correct. We know for sure that if
ˆ
H has negative eigenvalues,
the Qfunction is not correct. For both LQRQL approaches we rejected the results with
negative eigenvalues for
ˆ
H.
The extended LQRQL approach also estimates
ˆ
l, whose reliability does not depend on
the eigenvalues of
ˆ
H. If we look at ﬁgure 4.1 we see that a small change in the optimal
feedback function may lead to a large change in the value of l. This means that a small
diﬀerence in the train set may lead to large diﬀerence in the estimated
ˆ
l. To be able to
compare the results with the other two approaches, the situations for which
ˆ
l were clearly
wrong were also removed. For the extended LQRQL approach we rejected the results for
which [
ˆ
l[ > 1. For these values of
ˆ
l, the
ˆ
L hardly contributed to the motion of the robot.
For the SI approach we do not have a criterion that indicates the reliability of the
resulting feedback, so we used all train sets. Figure 4.6(a) shows the fractions of the 500
runs that were used for both LQRQL approaches. For the standard LQRQL approach
about 75 percent was used. For the extended approach the percentage without negative
4.5. SIMULATION EXPERIMENTS WITH A NONLINEAR SYSTEM 69
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
δ
U
s
e
d
St andard LQRQL
Ext ended LQRQL
(a) Fraction of the train sets used.
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
0
50
100
150
200
250
300
350
400
δ
J
SI
St andard LQRQL
Ext ended LQRQL
(b) The J as a function of δ
0
.
Figure 4.6. The results for diﬀerent initial δ.
eigenvalues for
ˆ
H was already below 50 percent. When also the unreliable
ˆ
l results were
removed, about one third of the train sets were kept. This indicates that the extended
LQRQL approach is the most sensitive to the particular train set.
The resulting feedback functions that were not rejected were tested. In ﬁgure 4.6(b)
we see the average total costs J for diﬀerent initial δ for the three approaches. When
ﬁgure 4.6(b) is plotted with error bars the total ﬁgure becomes unclear, therefore we
omited the errorbars. The maximal standard deviation was 40.8 for δ = 2 for the extended
LQRQL approach. The ﬁgure clearly shows that the total costs increases as δ increases,
because δ is part of the direct costs in (4.37). For δ
0
= 0 the SI and the standard LQRQL
approach have J = 0. This is because the robot is already on the line and facing the
right direction. So x = 0 and therefore also u = 0. Because
ˆ
l = 0 the total costs for the
extended LQRQL approach is not zero.
For small values of δ
0
, the SI approach has the lowest costs, followed by the standard
LQRQL approach. For high values of δ
0
the situation is reversed. The extended approach
has the lowest total costs and the SI approach the highest. This is also what we observed
in the preliminary experiment.
We can conclude that if the train set is generated close to the line, the average w is
almost zero. The SI and standard LQRQL approach perform best. Further away from
the line, the average value of w becomes larger. The total costs for the SI approach is
much higher than that of the LQRQL approaches. The extended LQRQL approach has
the lowest cost as expected.
70 CHAPTER 4. LQRQL FOR NONLINEAR SYSTEMS
4.6 Experiments on a Real Nonlinear System
4.6.1 Introduction
We did experiments with a real robot to see whether the simulation results also apply to
real systems. The state value was derived from the odometry. This means that the robot
keeps track of its position in the world by measuring the speed of both wheels. We then
translated this into a value for δ and α.
There are some important diﬀerences with the simulated robot:
• The time steps do not have to be constant. State information is not directly available
and is obtained via a communication protocol between the two processing boards in
the robot. The duration of the communication varies a little. This introduces some
noise, because at discrete time steps the state changes are inﬂuenced by the duration
of the time step.
• The size of the control action that can be applied is bounded. Too high values for
u may ruin the engine of the robot. When generating the train set it was possible
that actions were tried that could not be tolerated. For safety we introduced the
action bounds, such that u
min
= −0.18 ≤ u ≤ u
max
= 0.18. When a u outside this
interval had to be applied to the robot, we replaced it by the value of the closest
action bound. This value was also used for the train set. The consequence is that
for high values of σ
e
, the exploration noise is no longer Gaussian.
• The robot has a ﬁnite acceleration. This is a part of the dynamics of the system was
not included in the model used for the simulations. This means that the trajectory
during a time step is not exactly a circle. A consequence is that the robot might not
respond so quickly on fast varying control action when exploring. So eﬀectively the
amount of exploration contributing to the movement of the robot is lower than the
amount of exploration that is added to the control action.
• There is wheel spin. The eﬀect of wheel spin is that the robot does not move according
to the rotation of the wheels. This can mean that the real position of the robot does
not change, while the wheels are rotating. Since we do not use the robot’s real
position, this does not aﬀect us. But the odometry is based on measuring the speed
of the wheels and they accelerate faster during wheel spin. This has the eﬀect that
similar control actions can lead to diﬀerent state transitions.
4.6.2 The experiments
It was infeasible to generate the same amount of data sets as in the simulation experiments.
We varied the initial δ from 0.25 to 1.75 in steps of 0.5. We generated data for four diﬀerent
sequences of exploration noise, so in total we generated 16 data sets. For one exploration
sequence the four data sets generated are shown in ﬁgure 4.7(a).
4.6. EXPERIMENTS ON A REAL NONLINEAR SYSTEM 71
0 0.5 1 1.5 2
1
0.5
0
0.5
1
1.5
2
X
Y
(a) The generation of four train
sets, using the same explo
ration noise sequence.
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
0
50
100
150
200
250
δ
J
SI
St andard LQRQL
Ext ended LQRQL
(b) The average total costs J as a function of the initial δ.
Figure 4.7. The experiment with the real robot.
In ﬁgure 4.7(b) we see the average total costs for each test of the three diﬀerent methods.
We did not remove any train set for the LQRQL approaches. The results in ﬁgure 4.7(b)
show that on average all three approaches perform almost the same. Only the extended
approach has higher total costs for δ
0
= 0.25. The latter agrees with the simulation
experiments. If we compare ﬁgure 4.7(b) with ﬁgure 4.6(b) we see that the performances
of both LQRQL approaches are a little less in practice than in simulation. The SI approach
seems to perform better in practice than in simulation.
In ﬁgure 4.8(a) we see, for all four exploration noise sequences, the total costs as a
function of the initial δ for the SI approach. For one sequence the total costs were very
high (for δ
0
= 1.75 the total costs was more than 600.). This is because the resulting
feedback function is wrong. It made the robot move away from the line. For the three other
sequences we see that the total costs are very low. These low total costs are misleading.
What happens in these situations is that the feedbacks found by the SI approach are much
too high. This would lead to rotations as in ﬁgure 4.5, but the action bounds prevented
the robot from rotating. Instead the action applied to the robot alternated between u
min
and u
max
for the ﬁrst 15 to 20 time steps. After that the robot was very close to the line
and it started following the line as it should. In these cases the low total costs were caused
by the action bounds, which indicates that the SI approach did not ﬁnd the optimal linear
feedback!
72 CHAPTER 4. LQRQL FOR NONLINEAR SYSTEMS
The actions taken by the resulting feedbacks of both LQRQL approaches were always
between u
min
and u
max
. In ﬁgure 4.8(b) we see for all four exploration noise sequences the
total costs as a function of the initial δ for the extended LQRQL approach. We noticed
that the plot for the standard LQRQL approach looks quite similar. We see that for
some exploration sequences the costs are quite low while for some the costs are higher.
This indicates that the performance depends on the particular sequence of exploration
noise. For one sequence the total costs are as low as for the three sequences of the SI
approach. Only the extended approach did not exploit the action bounds. This indicates
that the extended LQRQL approach optimized the local linear feedback function for the
real nonlinear system.
In the simulation experiments we rejected train sets that resulted in negative eigenvalues
for
ˆ
H because we considered them unreliable. We did not do this for the experiments on
the real robot. For the standard LQRQL approach 9 of the 16 train sets resulted in
a positive deﬁnite
ˆ
H and for the extended LQRQL approach only 5. The total costs for
these feedbacks were not always the lowest, because in some cases an “unreliable” feedback
performed better. This was caused by the acceleration of the robot. The reliable feedbacks
made the robot rotate faster in the direction to the line than the unreliable feedbacks. But
due to the acceleration, the robot did not always stop rotating fast enough. Then the robot
was facing the wrong direction and had to rotate back. Many unreliable feedbacks turn
slower towards the line and are therefore less aﬀected by the acceleration. This implies
that the reliability indication based on the eigenvalues of
ˆ
H is not appropriate for the real
robot if the acceleration is ignored.
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
0
100
200
300
400
500
600
700
δ
J
(a) J for SI as function of δ
0
for all four explo
ration noise sequences.
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
0
50
100
150
200
250
300
δ
J
(b) J for extended LQRQL as function of δ
0
for
all four exploration noise sequences.
Figure 4.8. All 16 performances for SI and extended LQRQL.
4.7. DISCUSSION 73
In summary, the experiments have shown that the SI approach does not optimize a local
linear feedback correctly. Instead it either ﬁnds a wrong feedback or it ﬁnds feedbacks
that are too large, so that most control actions are limited by the action bound. The
performances of the standard and extended LQRQL approach are similar to those of the
simulation experiments. So we can conclude that they correctly optimize a local linear
feedback.
4.7 Discussion
We started this chapter with a presentation of a number of methods for the control of
nonlinear systems. Some of these methods were based on local linear approximations
of the system. We were interested whether we could use QLearning to ﬁnd a feedback
which is locally optimal. We showed that the standard LQRQL approach as presented
in chapter 3 would not lead to an optimal linear feedback, and introduced the extended
LQRQL approach. In this approach we do not estimate the parameters of a global quadratic
function through the origin, but we use the data to estimate the parameters of a more
general quadratic function. The consequence for the feedback function is that an extra
constant was introduced. In this way the resulting feedback is no longer restricted to a
linear function through the origin.
We tested the extended LQRQL approach on a nonlinear system in simulation and on
a real nonlinear system. We compared it with the SI and standard LQRQL approach from
chapter 3. The results indicate that if we explore in a restricted part of the state space, by
sampling for a few time steps, the extended approach performs much better. This means
that if a feedback function is based on local linear approximations, the extended approach
has to be used. The standard LQRQL should only be used if it is known that the optimal
feedback function goes through the origin.
It is possible to use multiple linear models to construct one global nonlinear feedback
function. In [44] bumptrees were used to form a locally weighted feedback function. In
[17] local linear feedbacks were used for diﬀerent partitions of the state space of a nonlinear
system. The result is equivalent to Gain Scheduling as described in chapter 2. In both
cases the resulting local linear feedback functions were obtained by the standard LQRQL
approach.
However, for standard LQRQL the local linear models are not only based on the train
set but also on the position of the partition with respect to the origin. This means the
models are not completely local. A consequence is that the linear feedbacks for partitions
far away from the origin will become smaller. This is because the feedback has to go
through the origin. The consequence is that it is unlikely that linear models far away from
the origin are optimal. This implies that if we want to form a nonlinear feedback based
on local linear feedbacks, we have to use the extended approach. The extended LQRQL
approach is able to estimate the appropriate set point for each linear model.
74 CHAPTER 4. LQRQL FOR NONLINEAR SYSTEMS
4.8 Conclusions
Many control design techniques for nonlinear systems are based on local linear approxi
mations. We studied the use of LQRQL for obtaining a local linear approximation of a
nonlinear feedback function. A nonlinear system can be regarded as a linear system with an
additional nonlinear correction. If in a local part of the state space the average correction
is not zero, the SI and standard LQRQL approach from chapter 3 will approximate the
wrong function. We introduced the extended LQRQL approach, that will result in a linear
feedback plus an additional constant. The experiments on a nonlinear system have shown
that if the additional constant is not zero, the extended approach will perform better.
Chapter 5
Neural QLearning using LQRQL
5.1 Introduction
The LQRQL approach was derived for linear systems with quadratic costs. Although this
approach can also be applied to nonlinear systems or other cost functions, the resulting
feedback will always be linear. This is a consequence of the restriction that the Qfunction
is quadratic. The use of a quadratic Qfunction was motivated by the possibility to use
a linear least squares estimation to obtain the parameters of the Qfunction. Then the
parameters of the linear feedback follow directly from the parameters of the Qfunction.
If the system is nonlinear, it is very likely that the Qfunction is not quadratic. In
such situations it is possible to use a general approximator, like a feed forward network,
to approximate the Qfunction. The feedback can be obtained by ﬁnding for all states the
control action that minimizes the feedback function. We will show that this can have the
eﬀect that the feedback function is not a continuous function and we will explain why it is
desirable to have a continuous feedback function. In order to have a continuous feedback
function, another network can be used to represent the feedback function. This is the
actor/critic conﬁguration as described in section 2.3. This has the drawback that two
networks have to be trained, where the training of one network depends on the other. This
makes the training very hard.
The choice is either to have a discontinuous feedback or to use two networks. In this
chapter we propose a method in which we only use one network and still have a continuous
feedback function. The idea is to combine LQRQL with a carefully chosen network and we
called this Neural QLearning. The feedback can be derived in the same way as in LQRQL.
The simulation and real experiments were based on the same nonlinear system as in
chapter 4. We applied the Neural QLearning method and compared the result with Gain
Scheduling based on the extended LQRQL approach. The resulting nonlinear feedback
functions have to be globally valid. So instead of focusing on a local part of the state
space, we looked at larger parts of the state space.
75
76 CHAPTER 5. NEURAL QLEARNING USING LQRQL
5.2 Neural Nonlinear Qfunctions
Instead of a quadratic Qfunction, a general approximator like a feed forward neural net
work can be used. It describes the function
Q(ξ, w) = Γ
o
(w
T
o
Γ
h
(Wξ +b
h
) +b
o
), (5.1)
with Γ
o
and Γ
h
the activation functions of the units in the output and hidden layer. The
rows of matrix W contain the weight vectors w
h,i
of the hidden units and the vector b
h
contains the corresponding biases. The weights of the output unit are given by w
o
and
the bias by b
o
. The network has only one output Q(ξ, w), where the vector w indicates
all weights of the network. The ξ represents the input of the Qfunction.
If a network is used to represent a Qfunction we might obtain a typical function as
in ﬁgure 5.2(a). This is drawn for a scalar state and control vector, and the weights of
the network are picked randomly. It is clear that this Qfunction is a smooth continuous
function. Given this Qfunction the feedback has to be determined.
According to (2.15) and (3.15) the greedy feedback is computed by taking the minimum
value of the Qfunction for each state. In ﬁgure 5.2(b) the top view of the Qfunction is
shown with lines indicating similar Qvalues. In this plot also the greedy feedback function
is shown that is computed by taking for each x the value of u for which the Qvalue is
minimal. Using the greedy feedback from the network has two drawbacks:
• The computation of the greedy action involves ﬁnding the extremum of a nonlinear
function. In general this is not a trivial task.
• Even when the Qfunction is smooth as in ﬁgure 5.2(a), it is still possible that the
greedy feedback function (shown in ﬁgure 5.2(b)) is not a continuous function. This
Γ
o
Γ
1
Γ
2
w
o
w
h2
w
h1
b
h1
b
h2
ξ
Γ
o
: linear
Γ
1,2
: tanh
Q(ξ, w)
Figure 5.1. The network with two hidden units
5.2. NEURAL NONLINEAR QFUNCTIONS 77
1
0.5
0
0.5
1
1
0.5
0
0.5
1
0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
X
U
(a) An arbitrary Qfunction
1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1
1
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
X
U
(b) The greedy feedback
Figure 5.2. The Qfunction is formed by a feed forward network. In (b) we see the top view of
the Qfunction with lines indicating the hight. The bold line indicates the greedy feedback.
can cause problems for real applications when there is noise in the system. Near the
discontinuities a little noise can have a high inﬂuence on the control action, so the
behavior of the system becomes very unpredictable.
The second drawback can be overcome by introducing a Dynamic Output Element
[60]. The static feedback is followed by a low pass ﬁlter, removing the eﬀects of noise at
discontinuities of the feedback function. In this way the control action is not only based
on the current state value, but also on the previous control actions. This approach is not
a solution to the ﬁrst drawback.
One approach that removes both drawbacks is the Actor/Critic approach described
in chapter 2. A second function approximator is introduced to represent the feedback
function. If a continuous diﬀerential function is used as actor, the second drawback is
overcome. By training the actor based on the critic also the ﬁrst drawback is removed.
The main diﬀerence between the actor/critic conﬁguration and QLearning is shown
in ﬁgure 5.3 and is the way in which the feedback is derived from the Qfunction. In
the actor/critic conﬁguration the actor is trained based on the critic. In the QLearning
conﬁguration the Qfunction implicates the feedback, so that the Qfunction and feedback
are represented by one and the same network.
The two networks in the actor/critic conﬁguration have to be trained, and this is the
major problem of this conﬁguration. Training two networks means that training parameters
and initial settings have to be selected for two networks. It is very hard to determine
beforehand the appropriate setting for the training of the networks. Since the settings
inﬂuence the result, the interpretation of the results is very hard.
78 CHAPTER 5. NEURAL QLEARNING USING LQRQL
Actor Network
Critic Network
Q(x, u)
g(x) System
(a)
Q(x, u)
Qfunction
g(x) System
(b)
Figure 5.3. The actor/critic conﬁguration and Qlearning. The dashed arrow indicates the
training of the actor based on the critic. The implication arrow indicates that the feedback
function directly follows from the Qfunction.
In the LQRQL approach the greedy feedback is computed by setting the derivative
of the Qfunction to the control action to zero. So for LQRQL the quadratic Qfunction
implicates the linear feedback function, and the parameters of the feedback function can
be expressed as functions of the parameters of the Qfunction. This is a property we want
to keep for the neural Qfunction.
We will propose a method that uses LQRQL to keep this property for a neural Q
function. In this method the feedback function can be expressed using the weights of the
neural Qfunction. No second network is required. The method will also guarantee that
there are no discontinuities in the feedback function.
5.3 Neural LQRQL
Neural LQRQL
1
is based on the standard LQRQL described in chapter 3, but then with a
feed forward network to represent the Qfunction. The standard LQRQL approach is based
on three steps. First the least squares estimation is applied to the train set, resulting in
an estimation θ. Then the parameters of the quadratic Qfunction
ˆ
H are derived from the
estimated θ. Finally the value of linear feedback
ˆ
L is computed based on
ˆ
H using (3.16).
This indicates that if it is possible to derive
ˆ
θ from the neural Qfunction, the feedback
can be computed from
ˆ
θ analogue to the LQRQL approach.
In LQRQL the quadratic Qfunction was represented as a linear multiplication of the
quadratic combinations of the state and action with the estimated parameters. This was
introduced to make it possible to use a linear least squares estimation for estimating θ
1
An almost similar approach in [33] is called PseudoParametric QLearning.
5.3. NEURAL LQRQL 79
based on the measurements. In section 3.2.4 the Qfunction is written as
Q
L
(x
k
, u
k
) = vec
(φ
k
φ
T
k
)
T
vec
(H
L
) = Ω
T
k
θ. (5.2)
Here H
L
represents the parameters of the quadratic Qfunction and φ
k
the vector with the
state and control action. This can be regarded as just a multiplication of two vectors Ω
k
and θ. We see that writing the quadratic Qfunction as a linear function requires that the
input vector is “quadratic”. Instead of having a quadratic function with input φ, we have
a linear function with input Ω. Input Ω contains all quadratic combinations of elements
of φ and vector θ contains the corresponding values from H
L
.
The representation of the Qfunction in (5.2) can also be viewed as a one layer feed
forward network with a linear transfer function and input Ω
k
. We can extend this repre
sentation by writing this similar to (5.1):
2
Q(x
k
, u
k
) = Γ
o
(w
T
o
Γ
h
(WΩ
T
k
) +b
h
) +b
o
. (5.3)
If we take Γ
o
and Γ
h
as linear functions and the biases b
h
and b
o
all zero, then there exist
values for W and w
o
such that (5.3) equals (5.2). Note that these weights are not unique,
because the number of weights is higher then the number of parameters of (5.3). Since we
want to stay as close as possible to the original LQRQL approach, we will only consider
activation functions that resemble linear functions.
The neural representation is introduced to deal with non quadratic Qfunctions. The
network in (5.3) with linear transfer functions can not represent these functions, therefore
we have to use nonlinear transfer functions. Let the output of hidden unit i is given by:
Γ
h,i
(w
T
h,i
Ω
T
k
+b
h,i
) = tanh(w
T
h,i
Ω
T
k
+b
h,i
). (5.4)
The hyperbolic tangent is nonlinear, but in the origin it is zero and its derivative is one.
So for small values of w
h,i
(5.4) still resembles a unit with a linear transfer function. Only
when the weights of the hidden units become large, the Qfunction will no longer be a
quadratic function.
5.3.1 Deriving the feedback function
Given the Qfunction and its input, the values of
ˆ
θ should be obtained. In the case of the
standard LQRQL these are just the parameters that are estimated, so they are immediately
available. For the neural Qfunction this is diﬀerent. The weights are the parameters that
are estimated based on the train set, so the value of
ˆ
θ should be derived from the weights.
Given Q
L
(Ω
k
) from (5.2), the parameters θ can also be obtained by computing the
derivative of this function to the input Ω. This means that parameter θ
i
can be computed
according to:
θ
i
=
∂Q
L
(Ω
k
)
∂Ω
i
. (5.5)
2
This Q function still depends on the feedback function used to generate the data, which can also be a
nonlinear function. Therefore we will omit the superscript L.
80 CHAPTER 5. NEURAL QLEARNING USING LQRQL
This shows that the θ
i
does not depend on Ω
k
, so the values of θ
i
do not depend on the
state and control action. When this is computed for all parameters of θ, then H
L
can be
derived and therefore
˜
L can be computed.
In the same way
ˆ
θ
i
(Ω)
3
can be computed for the neural Qfunction in (5.3):
ˆ
θ
i
(Ω
k
) =
∂Q(Ω
k
)
∂Ω
i
=
n
h
¸
j=1
w
o,j
w
h,i,j
(1 −tanh
2
(w
T
h,i
Ω
k
+b
h,i
)) (5.6)
where w
h,i,j
represents the weight from input Ω
i
to unit j and n
h
indicates the number of
hidden units.
When
ˆ
θ(Ω)
k
is available, it can be rearranged into
ˆ
H(Ω) so that L(Ω) can be computed.
The control action can be computed according to u
k
= L(Ω
k
)x
k
. This is still a linear
multiplication of the state with L(Ω
k
), but it is a nonlinear feedback function because Ω
k
contains x
k
. The problem is that it also contains u
k
, which is the control action that has
to be computed.
In order to solve this problem (5.6) can be split into two parts:
ˆ
θ
i
=
n
h
¸
j=1
w
o,j
w
h,i,j
. .. .
linear
−
n
h
¸
j=1
w
o,j
w
h,i,j
tanh
2
(w
T
h,i
Ω
k
+b
h,i
)
. .. .
nonlinear
. (5.7)
The ﬁrst part is indicated with “linear”, because this corresponds to the
ˆ
θ
i
that would
be computed when all hidden units are linear. The resulting feedback function derived
from this network would also be linear. The second part is indicated with “nonlinear”
because this is the eﬀect of the nonlinearities introduced by the hyperbolic tangent transfer
functions.
A linear feedback can be obtained from the network by just ignoring the hyperbolic
tangent functions. Deﬁne
˜
θ
i
by:
˜
θ
i
=
n
h
¸
j=1
w
o,j
w
h,i,j
. (5.8)
This leads to a vector
˜
θ. Analog to
ˆ
θ we can rearrange
˜
θ to form a quadratic function
with parameters
˜
H. From
˜
H we can derive a linear feedback
˜
L.
The feedback
˜
L can be used to compute the control action ˜ u =
˜
Lx. This control action
can be used to control the system, but it can also be used to obtain the vector
˜
Ω. The
vector
˜
Ω represents the quadratic combination of the state x with the control action ˜ u.
This can be used to obtain the nonlinear feedback
u
k
= (H
uu
(
˜
Ω
k
))
−1
H
ux
(
˜
Ω
k
)x
k
= L(
˜
Ω
k
)x
k
. (5.9)
Here the linear feedback L(
˜
Ω
k
) is a function of x
k
, so that the resulting feedback function
is nonlinear. If we compute this feedback function for the Qfunction in ﬁgure 5.2(a) then
this results in the feedback function shown in ﬁgure 5.4.
3
Strictly speaking, the
ˆ
θ
i
(Ω) is not an estimation. For consistency with the previous chapters we will
keep indicating it with a hat.
5.3. NEURAL LQRQL 81
1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1
1
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
X
U
l i near
nonl i near
greedy
Figure 5.4. Nonlinear feedback example. The feed forward network forms the same Qfunction
as in ﬁgure 5.2(a). This ﬁgure shows the resulting linear feedback
˜
L and the nonlinear feedback
function according to (5.9). For comparison also the greedy feedback from ﬁgure 5.2(b) is plotted.
5.3.2 Discussion
If we want to interpret the resulting feedback function, we have to look at (5.6). In case
w
T
h,i
Ω
k
+b
h,i
is large for a hidden unit i, the (1 −tanh
2
(w
T
h,i
Ω
k
+b
h,i
)) will be very small.
The contribution of this hidden unit to
ˆ
θ
i
will be very small. In case w
T
h,i
Ω
k
+b
h,i
is zero
for a hidden unit i, the tanh(w
T
h,i
Ω
k
+ b
h,i
)) will be very small. The hyperbolic tangent
can be ignored and the contribution of this hidden unit is the same as for the computation
of the linear feedback
˜
L. We therefore can interpret the feedback function as a locally
weighted combination of linear feedback functions.
Each hidden unit will lead to a linear feedback function and the value of the state x
determines how much the feedback function of each state contributes to the total feedback.
When we compute
˜
L all hidden units are weighted equally. This results in a linear feedback
function that is globally valid, because it no longer depends on the state value. When
the neural approach is applied to the LQR task, then the resulting feedback should not
depend on the state value. The weights of the hidden units should be very small, so that
the feedback function should become linear.
The linear feedback
˜
L does not have to be the same as the result
ˆ
L
QL
of LQRQL,
when applied to the same data set. In standard LQRQL the parameters of a quadratic
Qfunction are estimated. When the Qfunction to approximate is not quadratic, then a
small part of the train set can have a large inﬂuence on the estimated parameters. This
is the case for training samples obtained in parts of the state space where the Qfunction
deviates a lot from a quadratic function. If (5.3) is used to approximate the Qfunction,
the inﬂuence of these points is much less. One hidden unit can be used to get a better
approximation of the true Qfunction for that part of the state space. Since
˜
L is based on
82 CHAPTER 5. NEURAL QLEARNING USING LQRQL
the average linear feedback formed by the hidden units, the contribution of the training
samples on the resulting linear feedback
˜
L is much smaller. This means that if the true
Qfunction is not quadratic, the linear feedback derived from the network is more reliable
than the result of the standard LQRQL approach.
The linear feedback
˜
L can always be derived from the network. This does not hold for
the nonlinear L(
˜
Ω
k
), because it depends on the weighting of the linear feedbacks based
on the state value. It might be possible that for some state values, the contribution of
all hidden units to
˜
θ are zero. Then no feedback can be computed. This can only occur
when all weights of the hidden units are large. This means that it is important to prevent
the weights from becoming too large. One way of doing this is by incorporating a form of
regularization in the learning rule. Another way is to scale down the reinforcements r, so
that the Qfunction to approximate becomes smoother. The easiest way is to make sure
that the number of hidden units is not too large, so that it becomes very unlikely that the
weights become too large.
5.4 Training the Network
The weights of the network in (5.3) can be found by training the network according to
one of the methods described in section 2.3. If the method that minimizes the quadratic
temporal diﬀerence error (2.24) using (2.25) and (2.26) with a discount factor γ = 1 is used,
then the training is based on minimizing the same criterion as (3.25). The only diﬀerence
is that the network starts with some random initial weights that are incrementally updated
at each iteration step. The least squares estimation gives the global minimum of the error
at once. A consequence of this diﬀerence is that the weights of the network always have a
value, while the least squares estimation gives no solution in case of a singularity. On the
other hand, the training of the network can fail to ﬁnd the global minimum of the error.
Because the network approximates a Qfunction (2.24) is written using Q(Ω, w):
4
E =
1
2
N−1
¸
k=0
(r
k
+Q(Ω
k+1
, w) −Q(Ω
k
, w))
2
. (5.10)
and (2.26) becomes
∆w
k
= (r
k
+Q(Ω
k+1
, w) −Q(Ω
k
, w))(∇
w
Q(Ω
k+1
, w) −∇
w
Q(Ω
k
, w)). (5.11)
Before the network can be trained, the initial weights have to be selected. If the
weights of the hidden units are chosen very small, the feedback function derived from this
initial network is linear. If the true Qfunction is quadratic, the weights change such that
the feedback remains linear. Only when it is necessary, when the true Qfunction is not
quadratic, the weights of the hidden units become larger. Then the resulting feedback will
be linear. In this way it is prevented that the resulting feedback becomes very nonlinear,
while the simpler linear feedback is also possible.
4
When the discount factor γ = 1.
5.5. SIMULATION EXPERIMENTS WITH A NONLINEAR SYSTEM 83
There are some diﬀerences between the estimation of the parameters of a quadratic
function and the training of the network. The important diﬀerences are:
• No singularity
For too low exploration no feedback can be computed for the standard approach.
The neural approach already starts with an initialized network and the network
only changes its parameters. This means that there is no singularity for too low
exploration. Singularity can only happen when the weights of the hidden units are
too large so that
˜
θ is almost zero.
• Diﬀerent results for low exploration
The training of the network is such that the temporal diﬀerence is minimized. Even
if the global minimum is found this does not necessary result in an improved feedback
function. If insuﬃcient exploration is used then LQRQL results in the feedback used
to generate the train set. It is very unlikely that the feedback derived from the
network will be the same as the feedback used to generate the train set. Even when
the same train set is used, the resulting feedbacks will not be the same when the
networks are initialized diﬀerently.
• Improvement for high exploration
Suﬃcient exploration is achieved, if the exploration contributes to the temporal dif
ference error. Only then, the minimization of the temporal diﬀerence error will lead
to an improvement of the feedback function. The exploration will make the temporal
diﬀerence error much higher. This means that a network, trained on a set with in
suﬃcient exploration, will have a lower error than one trained on a set with suﬃcient
exploration. For a train set with suﬃcient data, diﬀerent networks initialized with
small weights will eventually result in similar feedback functions.
5.5 Simulation Experiments with a Nonlinear System
In chapter 4 we used a mobile robot to experiment with the diﬀerent approaches. We
used the same system to test the Neural QLearning approach. The main diﬀerence with
chapter 4 is that the resulting feedback function should be globally valid. This means that
we test the feedback function in a larger part of the state space. In order to compare the
results with a diﬀerent nonlinear feedback, we also used Gain Scheduling based on the
Extended QLearning approach.
5.5.1 Introduction
The system was identical to the one in chapter 4, we only have to specify the settings
for the two approaches. To be able to compare the results both approaches used the
same train sets. This means that the train set was generated with one linear feedback
L =
10
−3
10
−3
.
84 CHAPTER 5. NEURAL QLEARNING USING LQRQL
Gain Scheduling
For the Gain Scheduling approach we had to partition the state space in separate partitions.
For each partition one local linear feedback was obtained by using extended LQRQL. For
each estimation a train set was selected consisting of all training samples in that particular
partition. This means that we had to make sure that for each partition suﬃcient training
samples were available to obtain a local feedback. We did this by generating more train
sets by starting the system from diﬀerent initial states.
In chapter 4 we observed that diﬀerent feedbacks were found when train sets were
generated with a diﬀerent initial δ. This implies that the feedback should be diﬀerent for
diﬀerent values of δ. We therefore divide the state space into three partitions according to:
• Negative partition: −∞ < δ < −1
• Middle partition: −1 < δ < 1
• Positive partition: 1 < δ < ∞
The resulting feedback function consists of three local linear functions. At the border
of the partitions these functions can be diﬀerent, so that the feedback function is not
continuous. It is possible to make this a continuous diﬀerential function by smoothing
it near the borders of the partitions. We did not do this because we wanted to see the
local linear feedbacks as clearly as possible for the comparison with the Neural QLearning
approach.
Neural QLearning
For the Neural QLearning approach, all train sets used for Gain Scheduling were combined
into one train set. This train set was used to train the network, by minimizing the quadratic
temporal diﬀerence error using (5.10). Because the result also depends on the initial
weights, we trained the network for ten diﬀerent initial weights. Then we used the network
with the lowest quadratic temporal diﬀerence error. To prevent over ﬁtting it is possible
to split the train set in two and use only one set for training and the other for testing. We
did not do this because we wanted the resulting feedback to be based on the same train set
as the Gain Scheduling approach. Instead we made sure that we did not have too many
hidden units.
We chose the following settings for the network:
• Number of hidden units: 3.
This was to prevent over ﬁtting and to keep the number of weights close to the
number of parameters of the Gain Scheduling approach.
• The value of the initial output weights: 0.
• Values of initial weights and biases of hidden units: Random between −10
−4
and
10
−4
.
5.5. SIMULATION EXPERIMENTS WITH A NONLINEAR SYSTEM 85
0 0.2 0.4 0.6 0.8 1 1.2 1.4
2.5
2
1.5
1
0.5
0
0.5
1
1.5
X
Y
Figure 5.5. The trajectories while generating the train set.
These weights are taken small so that the initial resulting feedbacks are linear and
perhaps become nonlinear during training.
5.5.2 The experimental procedure
The purpose of this experiment is to point out the major diﬀerences with the experiments
in chapter 4. Diﬀerent train sets are generated by starting the system for diﬀerent initial
δ. For sequence of exploration the train sets are shown in ﬁgure 5.5.
All train sets are combined to form one train set that is used to obtain the two global
(a) Gain Scheduling (b) Neural QLearning
Figure 5.6. The control action as function of the state
86 CHAPTER 5. NEURAL QLEARNING USING LQRQL
nonlinear feedback functions. The resulting feedback functions of the train sets in ﬁgure 5.5
are shown in ﬁgure 5.6. In ﬁgure 5.6(a) the Gain Scheduling result is shown. The feedback
function consists of three local linear feedbacks, but in this case two linear feedbacks are
quite similar. The third feedback is diﬀerent and the “jump” between the boundary of
the two partitions is very clear. In ﬁgure 5.6(b) the resulting feedback based on Neural Q
Learning is shown. This looks like one linear feedback with a smooth nonlinear correction.
The feedback functions were tested by starting the robot for diﬀerent initial δ. In
ﬁgure 5.7(a) four trajectories are shown for the Gain Scheduling result. The trajectory
that starts in δ = 1.5 clearly shows a jump when it crosses the boundary between the
partitions of the state space. After the jump the trajectory moves similar to the other
trajectories in that partition towards the line.
In ﬁgure 5.7(b) the resulting trajectories for Neural QLearning are shown. It is clear
that there are no jumps in the trajectories. Also it can be noticed that for large initial
distances to the line, δ = −1.5 and δ = 1.5, the robot initially moves faster towards the
line. This is a consequence of the nonlinear correction.
The main diﬀerence with the experiments in chapter 4 is that now two global nonlinear
feedbacks are used. This means that the tests from diﬀerent initial δ are performed with
the same feedback function. In chapter 4, each initial δ was tested with the corresponding
local linear feedback.
0 2 4 6 8 10 12 14
1.5
1
0.5
0
0.5
1
1.5
Y
X
(a) Gain Scheduling
0 2 4 6 8 10 12 14
1.5
1
0.5
0
0.5
1
1.5
X
Y
(b) Neural QLearning
Figure 5.7. The trajectories
5.5. SIMULATION EXPERIMENTS WITH A NONLINEAR SYSTEM 87
5.5.3 The performance of the global feedback functions
We did 30 simulation experiments to compare the performance of the two approaches.
Each experiment was performed as described in the preliminary experiments. As initial δ
we used: −1.75, −0.75, 0.75 and 1.75. For the Neural QLearning approach, the network
was trained with 5 diﬀerent initial weights. The network for which the lowest temporal
diﬀerence error was reached was tested.
The tests were performed by running the robot for 302 time steps. This is equivalent
to driving for approximately 2 minutes. The reason for using a long test period is that we
wanted the robot to visit a large part of the state space.
In ﬁgure 5.8(a) we plotted the total costs as a function of the initial δ for the best
feedbacks of the two approaches. We see that the values of the resulting total costs of the
Neural QLearning approach are symmetric around δ = 0. This is because of the quadratic
input of the network. Also we see that it is lower than the feedback for the Gain Scheduling
approach. The result of the Gain Scheduling approach is not symmetric, which indicates
that the robot will not approach the line exactly. Instead it will drive at a small distance
parallel to the line, which also explains why the total costs are a little higher than for the
Neural QLearning approach.
In ﬁgure 5.8(b) we see the average total costs of both approaches for all 30 experiments.
Note that we plotted the the averagate log(J), because the value of J varies between 10
and 10
5
. We see that the Gain Scheduling approach performs very badly on average. This
is because the results are either good as shown in ﬁgure 5.8(a) or very bad. These very
2 1.5 1 0.5 0 0.5 1 1.5 2
0
20
40
60
80
100
120
140
160
180
δ
0
M
i
n
J
Gai n Schedul i ng
Neural QLearni ng
(a) The best performance for Gain Scheduling
and Neural QLearning.
2 1.5 1 0.5 0 0.5 1 1.5 2
1
1.5
2
2.5
3
3.5
4
4.5
δ
0
A
v
e
r
a
g
e
l
o
g
(
J
)
Gai n Schedul i ng
Neural QLearni ng
(b) The average performance for Gain Schedul
ing and Neural QLearning.
Figure 5.8. The performances of the global feedback functions.
88 CHAPTER 5. NEURAL QLEARNING USING LQRQL
bad results are consequences of bad local feedbacks at the middle partition. If the local
feedback around the line makes the robot move towards the outer partitions, the total
costs will always be very high. If in this case the linear feedback in that outer partition
is good, it will make the robot move towards the middle partition. As a consequence the
robot gets stuck at the boundary of these two partitions.
The reason for the bad performance of Gain Scheduling is that the resulting feedback
function of the extended LQRQL approach is completely determined by the train set. In
chapter 4 we already showed that not all resulting feedback functions will be good. In Gain
Scheduling the state space is partitioned and for each partition the train set should lead
to a good function. The partitioning makes that it becomes less likely that all partitions
will have a good performance. The Neural QLearning approach uses the complete train
set. Therefore it does not have this problem, and performs much better on average.
5.5.4 Training with a larger train set
The previous experiments showed that Gain Scheduling did not perform so good. The only
way to improve this result is to make sure that for each partition the train set is good. We
did an experiment to see whether the result of Gain Scheduling can be improved by using a
larger train set for each partition. We used all the train sets from the previous experiment
and combined them into one large train set. Then we applied the Gain Scheduling and
Neural QLearning approach to this larger train set.
In ﬁgure 5.9(a) we see the resulting nonlinear feedback for Gain Scheduling. For the
middle partition
ˆ
l = −0.0032, which shows that it is getting closer to the correct value of
3
2
1
0
1
2
3
3
2
1
0
1
2
3
3
2
1
0
1
2
3
del t a al pha
(a) Gain Scheduling
3
2
1
0
1
2
3
3
2
1
0
1
2
3
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
del t a
al pha
(b) Neural QLearning
Figure 5.9. The control action as function of the state when a larger train set was used.
5.5. SIMULATION EXPERIMENTS WITH A NONLINEAR SYSTEM 89
δ 1.75 0.75 0.75 1.75
J
GS
219.3274 105.8559 91.3872 201.2774
J
NQ
121.8163 16.8742 16.8742 121.8163
Table 5.1. The total costs for the feedback function based on the large train set.
l = 0. The value for
ˆ
l is 0.8085 for the negative and −1.0222 for the positive partition. This
agrees with the discussion in section 4.5.1 where we indicated that L > 0 if δ < 0 and L < 0
if δ > 0. In ﬁgure 5.9(b) we see the resulting nonlinear feedback for Neural QLearning.
This feedback is steeper when δ ≈ 0, compared to the feedback in ﬁgure 5.6(b).
We tested the resulting feedback functions by starting the robot in the same initial
states. Table 5.1 shows the resulting total costs for both approaches for diﬀerent δ
0
. The
total costs J
GS
are much lower than the average value in ﬁgure 5.8(b). This shows that the
increase of the train set has a huge impact on the performance of Gain Scheduling. The
total costs of the neural QLearning approach is again lower that that of Gain Scheduling.
The performance in ﬁgure 5.8(b) was already good based on smaller train sets, therefore
the increase in train set does not have such a huge impact.
The trajectories of the test runs are shown in ﬁgure 5.10 and ﬁgure 5.11. In ﬁgure 5.10
we see that for the outer partitions the robot rotates
1
2
π in the direction of the line within
one step. Then it arrives at a state where the control action is almost zero, so the robot
moves straight ahead until it enters the middle partition. In the middle partition the robot
will slowly approach the line. This is because the linear function of the middle partition in
ﬁgure 5.9(a) is not steep enough. The reason is that for the middle partition, the feedback
L hardly contributes to the control action, so that it is very diﬃcult to estimate the
quadratic Qfunction. This implies that the total costs are higher than those of the neural
QLearning approach, because of the feedback in the middle partition. The trajectories of
the Neural QLearning approach in ﬁgure 5.11 shows that the robot moves faster to the
line than the trajectories in ﬁgure 5.7(b). This is because the feedback in ﬁgure 5.9(b) is
steeper near δ ≈ 0.
0 1 2 3 4 5 6 7 8 9 10
2
1
0
1
2
X
Y
Figure 5.10. The trajectories for the Gain Scheduling feedback based on the large train set.
90 CHAPTER 5. NEURAL QLEARNING USING LQRQL
0 1 2 3 4 5 6 7 8 9 10
2
1.5
1
0.5
0
0.5
1
1.5
2
X
Y
Figure 5.11. The trajectories for the Neural QLearning feedback based on the large train set.
5.6 Experiment on a Real Nonlinear System
We did an experiment with the real robot, where we used the exploration noise sequence
that gave the best results for both LQRQL approaches in chapter 4. The resulting feedback
functions are shown in ﬁgure 5.12. In ﬁgure 5.12(a) it is clear that the local linear feedback
for the partition, with the negative initial δ, is wrong. This indicates that the exploration
noise sequence used, only results in a useful training set for positive initial δ. Since the
resulting feedback in this partition is completely determined by the train set, the only way
to solve this is by generating a new training set for this partition.
In ﬁgure 5.12(b) we see that the result of the Neural QLearning approach is again
linear with a smooth nonlinear correction. Also we see that the control actions are large
(we observed the same in chapter 4 for the SI approach). In fact, the linear feedback
derived from the network is too large. We see that in some parts of the state space the
control action is reduced by the nonlinear correction. At these states the control actions
are between the action bounds of the real robot, so these actions can be applied.
Because of the space limitation of the room, we tested only for one minute. The resulting
total costs are shown in Table 5.2. Again we see that the resulting total costs for the Gain
Scheduling approach are not symmetric. Also we see that the costs for δ
0
= −1.75 is very
high. This corresponds to the partition for which the local linear feedback is wrong. The
results of the Neural QLearning approach show that it performs very well for all initial δ.
In order to understand the results in Table 5.2, we plotted the trajectories. In (5.13(a))
δ 1.75 0.75 0.75 1.75
J
GS
634.007 89.008 29.993 103.707
J
NQ
82.362 11.600 11.471 82.189
Table 5.2. The total costs for the real nonlinear system
5.6. EXPERIMENT ON A REAL NONLINEAR SYSTEM 91
(a) Gain Scheduling
3
2
1
0
1
2
3 3
2
1
0
1
2
3
3
2
1
0
1
2
3
del t a
al pha
U
(b) Neural QLearning
Figure 5.12. The control action as function of the state for the real nonlinear system.
we see that for the negative partition the robot moves in the wrong direction. For this
partition the local feedback is wrong. For the other partition the robot moves toward a
line parallel to the line to follow. This indicates that it will never follow the line exactly.
This is a consequence of the local feedback at the middle partition.
In (5.13(b)) we see that the two trajectories that start close to the line will approach
the line. We also see that the trajectories far from the line ﬁrst start to rotate very fast
towards the line. The ﬁrst few time steps the action is higher than the safety limit, so we
only let the robot rotate with maximum speed. After that all actions are within the safety
limit and the robot starts following the line. The high linear feedback from the network
makes the robot move to the line very eﬃciently, when it is close to the line. The linear
feedback will result in control actions that are too large far from the line. Due to the
nonlinear correction the size of these actions are reduced. In ﬁgure 5.12(b) we see that this
correction is for state values that were part of the train set. For other parts of the state
space the control actions are still too large.
These experiments have shown that the Gain Scheduling approach depends very much
on the training sets for each of the local linear feedbacks. It is possible that for some parts
of the state space the feedback function found is not appropriate. The Neural QLearning
approach results in a feedback function that is linear in most parts of the state space. In
parts of the state space where the training set indicates that a correction is required, the
control action deviates from the linear feedback. The resulting feedback function of Neural
QLearning is a smooth function, unlike the feedback function of the Gain Scheduling
approach where there can still be discontinuities.
92 CHAPTER 5. NEURAL QLEARNING USING LQRQL
5 4 3 2 1 0 1 2 3 4 5
2
1.5
1
0.5
0
0.5
1
1.5
X
Y
(a) Trajectories using Gain Scheduling
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
1.5
1
0.5
0
0.5
1
1.5
(b) Trajectories using Neural LQRQL
Figure 5.13. The trajectories for the real nonlinear system.
5.7 Discussion
In this chapter we proposed a method to apply QLearning for general Qfunctions. We
described how a feedforward network can be used to approximate the Qfunction. The
result is that the resulting feedbacks can also be nonlinear. We showed how the LQRQL
approach can be used to obtain the feedback from the approximated Qfunction. Then only
one function has to be approximated, the feedback follows directly from the Qfunction.
The advantage of using only one network is that only one network has to be trained.
This makes this approach more convenient to use. More important is that the feedback
function directly follows from the approximated Qfunction. This is similar to Qlearning
where the greedy policy is obtained by selection the greedy action for each state.
5.8. CONCLUSIONS 93
A second advantage of our approach is that the resulting feedback function is smooth.
This means that there are no sudden jumps as in Gain Scheduling. These jumps can also
appear when the greedy action is directly computed from the function approximator. The
problem with these jumps is that the the behavior of the system becomes more unpre
dictable in the presence of noise. The eﬀect of the jumps can be overcome, but for Neural
QLearning this is not necessary.
Both the Gain Scheduling approach and the Neural QLearning approach derive a global
nonlinear feedback function from a set of linear feedback functions. In Gain Scheduling
only one local linear feedback determines the control action, so it is essential that for every
partition the feedback function is good. This is not always the case, resulting sometimes
in very poor performances. The only way to solve this is to generate more data.
The feedback function of the Neural QLearning approach can be seen as a locally
weighted combination of linear feedback functions. Therefore the result is less dependent
on the correctness of each linear feedback function. This means that a poor performance
as that of Gain Scheduling becomes very unlikely. Because the training is based on the
complete train set, it will result in a good feedback function for a smaller train set than
the Gain Scheduling approach.
5.8 Conclusions
There are diﬀerent ways to get a feedback when using a feed forward network as a Q
function. Our method is based on the idea that there should be a direct relation between
the Qfunction and the feedback. The method uses LQRQL to obtain a linear feedback
from the Qfunction. Then the linear feedback is used to compute the nonlinear feedback
function. The result is a global linear function with local nonlinear corrections.
94 CHAPTER 5. NEURAL QLEARNING USING LQRQL
Chapter 6
Conclusions and Future work
The objective of the research described in this thesis is to give an answer to the question
whether Reinforcement Learning (RL) methods are viable methods to obtain controllers for
systems with a continuous state and action space. RL is normally applied to systems with
a ﬁnite number of discrete states and possible actions. Many real systems have continuous
state and action spaces, so algorithms for discrete state and action spaces cannot be applied
directly.
Some applications of RL to control real systems have been described [65][59][2][61][47].
This indicates that RL can be used as a controller design method. For these applications
the training was performed mainly in simulation, which means that the system has to be
known. However, one of the characteristics of RL is that it optimizes the controller based
on interactions with the system, for which no knowledge about the system is needed. When
the training is performed in simulation the potential of RL is not fully exploited.
There are two main reasons for not applying RL directly to optimize the controller
of a system. The ﬁrst reason is that for real control tasks a reliable controller is needed.
The RL algorithms used for the control of continuous state space tasks are either based
on heuristics or on discrete RL methods. This means that the closed loop performance
when RL is applied to a real system is not very well understood. Stability during learning
cannot be guaranteed, which explains the hesitation in applying these algorithms directly
on real systems.
The second reason is that RL may train very slowly. In simulation the time required
to obtain a suﬃciently large train set depends on the computing speed. The computation
of one state transition can be much faster than the actual sample time on the real system.
On a real system, the time required to obtain a suﬃciently large train set is determined by
the system. The only way to speed this up is to use algorithms that require less training.
In this thesis we looked at QLearning, a modelfree RL method. We investigated the
applicability of QLearning for real systems with continuous state and action spaces. We
therefore focused on the two issues mentioned above: how can we guarantee convergence
in the presence of system noise and how can we learn from a small train set. We developed
three methods: LQRQL for linear systems with quadratic costs, Extended LQRQL for
local approximations for nonlinear systems and Neural QLearning for ﬁnding a nonlinear
95
96 CHAPTER 6. CONCLUSIONS AND FUTURE WORK
feedback for nonlinear systems.
6.1 LQR and QLearning
A well known framework in control theory is the Linear Quadratic Regularization (LQR)
task. It is an optimal control task in which the system is linear and the direct costs are
given by a quadratic function of the state and action. The objective is to minimize the
total future cost. When the parameters of the system are known, the optimal feedback
function can be obtained by solving the Discrete Algebraic Riccati Equation. This solution
is used to compute the optimal linear feedback.
When the parameters of the linear system are unknown, System Identiﬁcation (SI) can
be used to estimate the parameters. The estimated parameters of the system can be used
to compute the optimal feedback. It has been shown [16][44] that the LQR task can be
solved by QLearning. For this the original QLearning algorithm is adapted so that it
can deal with the continuous state and action space of the linear system. We called this
approach LQR QLearning (LQRQL).
It can be proven that the resulting linear feedback will eventually converge to the
optimal feedback when suﬃcient exploration is used. These results only apply when there
is no noise. In practice the system noise will always be present, so we investigated the
inﬂuence of the system noise on the performance of the resulting linear feedbacks. We
aimed at determining the amount of exploration required for a guaranteed improvement
and compared this with the more traditional SI approach. For a fair comparison between
these methods, we used a batch least squares estimation. So unlike the recursive least
squares approach in [16], our results only depends on the train sets used.
To show the inﬂuence of the system noise on the performance we introduced the explo
ration characteristic, in which four types of outcomes can be distinguished for the SI and
LQRQL approach. If the amount of exploration is:
I Much too low: No feedback can be computed.
II Too low: The resulting feedback will be the same as that used for the generation of
the train set.
III Almost suﬃcient: The resulting feedback can be anything, depending on the sequence
of system noise and exploration.
IV Suﬃcient: The parameters of the Qfunction and system approach the correct value.
The new feedback will be an improvement.
Only the type IV outcome is useful, so suﬃcient exploration is required. For this we derived
that the SI approach requires at least more exploration than there is noise in the system.
Furthermore we derived that the LQRQL approach requires more exploration than the SI
approach.
6.2. EXTENDED LQRQL 97
6.1.1 Future work
Eigenvalues: The Qfunction is given by a sum of positive deﬁnite quadratic functions
(the reinforcements), so the correct Qfunction is a positive deﬁnite quadratic function as
well. This can be veriﬁed by looking at the eigenvalues of the matrix that represents the
parameters of the Qfunction. Negative eigenvalue are only possible if the parameters of
the Qfunction are not estimated correctly. This can only be the case for the type II and III
outcome. So negative eigenvalues imply that insuﬃcient exploration was used to generate
the train set. In our experiments we always found negative eigenvalues for insuﬃcient
exploration, but we cannot guarantee that the estimated Qfunction will never be positive
deﬁnite for insuﬃcient exploration. In order to use the eigenvalues as reliability measure,
it still has to be proven that insuﬃcient exploration will never result in a positive deﬁnite
Qfunction.
Exploration: The LQRQL approach requires more exploration than the SI approach.
For real systems adding disturbance to the control action is not desirable, so to enhance
practical applicability the required amount of exploration has to be reduced. One simple
way of doing this is by using the SARSA variant of QLearning. QLearning uses the next
action according to the feedback to compute the temporal diﬀerence. This does not include
the exploration. In SARSA the action taken at the next time step is used and this also
includes the exploration at that time step. So each exploration added to the control action
is used twice. Experiments have shown that this approach only requires the same amount
of exploration as the SI approach. However, SARSA introduces a bias in the estimation,
making it perform worse for very high amounts of exploration. These results are only
experimental and should be veriﬁed theoretically.
6.2 Extended LQRQL
LQRQL was developed for linear systems and not for nonlinear systems. Many control
design approaches for nonlinear system are based on local linear approximations of the
nonlinear system. LQRQL can be used to obtain a local linear approximation of the
optimal feedback function. To investigate the consequences of the nonlinearity it is possible
to write the model of the nonlinear system as a linear system with a nonlinear correction.
The linear feedback of the SI and LQRQL approach are not always appropriate feed
back functions to approximate the optimal feedback function. An additional oﬀset can be
included in the feedback function. This will allow for a better local approximation of the
optimal feedback function in case the average nonlinear correction is large in the region
of interest. To obtain the value of the oﬀset, the parameters of a more general quadratic
Qfunction have to be estimated. We showed that these parameters can be estimated in
the same way as the original standard LQRQL approach. We called this new approach
the extended LQRQL approach. Our experiments on a simulated and on a real nonlinear
system conﬁrmed that the extended LQRQL approach results in a better local approx
98 CHAPTER 6. CONCLUSIONS AND FUTURE WORK
imation of the optimal feedback function. We can conclude that the extended LQRQL
approach has to be applied if a nonlinear feedback function is based on multiple local
linear approximations.
6.2.1 Future work
On–line learning: Throughout this thesis we used batch learning, which means that
learning starts when the complete train set is available. We did this to make fair compar
isons between the diﬀerent methods possible. Usually RL is used as an on–line learning
method. The recursive least squares approach for the LQR task in [16][44] is an on–line
approach. Although our Qfunction is quadratic, it can be seen as a function approxima
tor that is linear in the parameters. For these function approximators the conditions are
known for which convergence is guaranteed when on–line TD(λ) learning is used [29][31].
How the on–line TD(λ) approach compares with the recursive least squares approach is
not known and should be investigated.
6.3 Neural QLearning
Function approximators have been used in RL. One approach uses the actor/critic conﬁg
uration. One function called the critic is used to approximate the Qfunction. The other
function is called the actor and is used to approximate the feedback function. Both ap
proximators have to be trained, where the approximation of the actor is trained based on
the critic. This makes the training procedure rather tedious and the outcome is hard to
analyze.
The LQRQL approach can be combined with a feedforward neural network approx
imation of the Qfunction. In this case there is no actor because in LQRQL the linear
feedback function follows directly from the parameters of the Qfunction. So only one
function approximator has to be trained. We called this approach Neural QLearning.
To obtain a nonlinear feedback function for a nonlinear system using Neural QLearning,
ﬁrst a linear feedback has to be determined. The derivatives with respect to the inputs
of the network give the parameters that would be estimated by the LQRQL approach.
These parameters depend on the state and control action, and therefore it is not possible to
directly derive a nonlinear feedback function. It is possible to ignore the tangent hyperbolic
activation functions of the hidden units to obtain a globally valid linear feedback function.
The linear feedback function can be used to compute the control action that is necessary
to determine the nonlinear feedback function. The resulting nonlinear feedback function
can be regarded as a locally weighted function, where each hidden unit results in a local
linear feedback. The state value determined the weighting of these local linear function.
Experiments were performed on a simulated and on a real nonlinear system, where the
goal of the experiments was to obtain global nonlinear feedback functions. We compared
Neural QLearning with Gain Scheduling. Gain Scheduling can be performed by making
for local partitions in the state space a local approximation using the extended LQRQL
6.4. GENERAL CONCLUSION 99
approach. The experiments have shown that the neural Qlearning approach requires less
training data than the Gain Scheduling approach. Therefore Neural QLearning is better
suited for real systems.
6.3.1 Future Work
Value Iteration: The neural QLearning approach we presented, is still based on policy
iteration. The original QLearning approach is based on value iteration. Because the
feedback function directly follows from the approximated Qfunction it is now also possible
to apply value iteration. This then can be combined with an online learning approach.
Whether Neural QLearning using Value Iteration is better than Policy Iteration still should
be investigated.
Regularization: The weights of the hidden units should not be too large. In supervised
learning there exist approaches for regularization. One way is to assign costs to the size of
the weights. The consequences of this modiﬁcation of the learning for Neural QLearning
are unknown. It could have the eﬀect that the resulting nonlinear feedback function no
longer performs well. This means that further research is necessary to determine whether
regularization can be applied.
6.4 General Conclusion
Reinforcement Learning can be used to optimize controllers for real systems with continu
ous state and action spaces. To exploit the beneﬁts of reinforcement learning the approach
should be based on QLearning, where the feedback function directly follows from the
approximated Qfunction. The Qfunction should be represented by a function approxi
mator, of which the parameters are estimated from the train set. If it is known that the
system is linear, then the parameters of a quadratic Qfunction can be estimated. This is
the LQRQL approach. In case there is no knowledge about the system, the more general
Neural QLearning approach can be applied. This will give a globally valid linear feedback
function with a local nonlinear correction.
100 CHAPTER 6. CONCLUSIONS AND FUTURE WORK
Appendix A
The Least Squares Estimation
A.1 The QRDecomposition
Two matrices Z and M have to be found for which ZM = X and Z
T
Z = I. Let X have n
columns and N −1 rows, then Z also has n columns and N −1 rows. M is a n n upper
triangular matrix. Matrices Z and M can be found by GrammSchmidt orthogonalization,
for which the result can be written using projection matrices P. Let z
∗i
be the i
th
column
of Z and X
∗i
be the i
th
column of X. Then z
∗i
is given by:
z
∗i
=
P
i
X
∗i
P
i
X
∗i

2
. (A.1)
Matrix M is given by:
M =
P
1
X
∗1

2
z
T
∗1
X
∗2
z
T
∗1
X
∗n
0 P
2
X
∗2

2
z
T
∗2
X
∗n
.
.
.
.
.
.
.
.
.
0 0 P
n
X
∗n

2
¸
¸
¸
¸
¸
¸
. (A.2)
The projection matrices can be deﬁned recursively. Let P
1
= I, then for every j > 1:
P
j
=
j−1
¸
i=0
(I −z
∗i
z
T
∗i
) = I −
j−1
¸
i=0
z
∗i
z
T
∗i
(A.3)
= I −
j−1
¸
i=0
P
i
X
∗i
X
T
∗i
P
T
i
P
i
X
∗i

2
2
= P
j−1
−
P
j−1
X
∗j−1
X
T
∗j−1
P
T
j−1
P
j−1
X
∗j−1

2
2
. (A.4)
The projection matrix has the following properties: P
T
i
= P
i
, P
2
i
= P
i
and P
i
P
j
= P
j
∀j > i.
P is a N −1 N −1 matrix.
A.2 The Least Squares Solution
The main diﬃculty in solving (3.30) is the computation of M
−1
. But M is upper triangular,
so the inverse can be found by backward substitution. Let m
i,j
indicate the element of M
101
102 APPENDIX A. THE LEAST SQUARES ESTIMATION
at i, j and let m
(−1)
i,j
indicate the element of M
−1
at i, j. The elements of M
−1
are given
by:
m
(−1)
i,j
=
−1
m
j,j
¸
j−1
k=i
m
k,j
m
(−1)
i,k
for i < j
1
m
i,i
for i = j
0 for i > j.
(A.5)
With this result, the value of M
−1
Z
T
can be written as:
F = M
−1
Z
T
=
m
(−1)
1,1
z
T
∗1
+m
(−1)
1,2
z
T
∗2
+ +m
(−1)
1,n
z
T
∗n
m
(−1)
2,2
z
T
∗2
+ +m
(−1)
2,n
z
T
∗n
m
(−1)
n,n
z
T
∗n
¸
¸
¸
¸
¸
¸
. (A.6)
Now the values of (A.5) and (A.2) should be ﬁlled in to ﬁnd the expression for the rows of
F. Let F
∗i
be the i
th
row of F (with i < n) then this can be written as:
F
i∗
= m
(−1)
i,i
z
T
∗i
+
n
¸
j=i+1
m
(−1)
i,j
z
T
∗j
(A.7)
=
z
T
∗i
m
i,i
+
n
¸
j=i+1
j−1
¸
k=i
m
k,j
m
(−1)
i,k
z
T
∗j
(A.8)
=
z
T
∗i
m
i,i
+m
(−1)
i,i+1
z
T
∗i+1
+m
(−1)
i,i+2
z
T
∗i+2
+ m
(−1)
i,n
z
T
∗n
(A.9)
=
z
T
∗i
m
i,i
−
m
i,i+1
z
T
∗i+1
m
i,i
m
i+1,i+1
−
z
T
∗i+2
m
i+2,i+2
(
m
i,i+2
m
i,i
−
m
i,i+1
m
i+1,i+2
m
i,i
m
i+1,i+1
) + (A.10)
=
z
T
∗i
m
i,i
−
z
T
∗i
, X
∗i+1
z
T
∗i+1
m
i,i
m
i+1,i+1
−
1
m
i+2,i+2
(
z
T
∗i
X
∗i+2
m
i,i
−
z
T
∗i
X
∗i+1
z
T
∗i+1
X
∗i+2
m
i,i
m
i+1,i+1
)z
T
∗i+2
+ (A.11)
=
z
T
∗i
m
i,i
(I −
X
∗i+1
z
T
∗i+1
m
i+1,i+1
−
X
∗i+2
z
T
∗i+2
m
i+2,i+2
+
X
∗i+1
z
T
∗i+1
X
∗i+2
z
T
∗i+2
m
i+1,i+1
m
i+2,i+2
+ ) (A.12)
=
z
T
∗i
m
i,i
n
¸
j=i+1
(I −
X
∗j
z
T
∗j
m
j,j
) (A.13)
=
X
T
∗i
P
T
i
P
i
X
∗i

2
2
n
¸
j=i+1
(I −
X
∗j
X
T
∗j
P
T
j
P
j
X
∗j

2
2
) (A.14)
=
X
T
∗i
P
T
i
P
i
X
∗i

2
2
n
¸
j=i+1
(I −X
∗j
F
j∗
). (A.15)
For i = n there is no sum in (A.7), so the resulting expression does not have the
product. The last row of the least squares solution
ˆ
θ
∗n
is given by F
∗n
Y , which results in:
ˆ
θ
∗n
=
X
T
∗n
P
T
n
P
n
X
∗n

2
2
Y. (A.16)
A.3. THE SI ESTIMATION 103
Starting with the last row all other row can be computed recursively:
ˆ
θ
∗i
=
X
T
∗i
P
T
i
P
i
X
∗i

2
2
(Y −
n
¸
j=i+1
X
∗j
ˆ
θ
∗j
). (A.17)
A.3 The SI Estimation
The solution according to (3.33) starts with the last row of
ˆ
θ so ﬁrst the
ˆ
θ
B
=
ˆ
B
T
will be
shown.
ˆ
θ
B,n
u
∗
=

T
∗n
u
P
T
n
x
+n
u
P
n
x
+n
u

∗n
u

2
2
Y
SI
=
c
T
∗n
u
P
T
n
x
+n
u
P
n
x
+n
u
c
∗n
u

2
2
Y and, (A.18)
ˆ
θ
B,i∗
=
c
T
∗i
P
T
n
x
+i
P
n
x
+i
c
∗i

2
2
(Y
SI
−
n
u
+n
x
¸
j=n
x
+i+1

∗j
ˆ
θ
B,j∗
). (A.19)
Here we used P
n
x
+i

∗i
= P
n
x
+i
(AL
T
+c) = P
n
x
+i
c
∗i
, because P
n
x
+i
removes the part that
is linear dependent on the previous columns of X
SI
. Then the rows of
ˆ
θ
A
can be expressed
as:
ˆ
θ
A,i∗
=
A
T
∗i
P
T
i
P
i
A
∗i

2
2
(Y
SI
−
nx
¸
j=i+1
A
∗j
ˆ
θ
A,∗j
−
nu
¸
j=1

∗j
ˆ
θ
B,j∗
) (A.20)
=
A
T
∗i
P
T
i
P
i
A
∗i

2
2
(Y
SI
−
nx
¸
j=i+1
A
∗j
ˆ
θ
A,∗j
−
ˆ
θ
B
). (A.21)
(The expression for
ˆ
θ
A,n
x
∗
does not have the ﬁrst sum.) The matrices of
ˆ
B and
ˆ
A are the
transpose of
ˆ
θ
B
and
ˆ
θ
A
, so
ˆ
θ
B,i∗
=
ˆ
B
∗i
and
ˆ
θ
A,i∗
=
ˆ
A
∗i
.
The matrices A and  can be used to ﬁnd an estimation of the feedback because
 = AL
T
+ c. It is also possible to write the feedback L using A,  and c because
L
T
= (A
T
A)
−1
A
T
( −c). So it is also possible to write:
L
∗i
=
A
T
∗i
P
T
i
P
i
A
∗i

2
2
( −c −
nx
¸
j=i+1
A
∗j
L
j∗
). (A.22)
This can be used to write (A.21) as:
ˆ
θ
A,i∗
=
A
T
∗i
P
T
i
P
i
A
∗i

2
2
(Y
SI
−
nx
¸
j=i+1
A
∗j
ˆ
θ
A,∗j
) −
A
T
∗i
P
T
i
P
i
A
∗i

2
2

ˆ
θ
B
(A.23)
=
A
T
∗i
P
T
i
P
i
A
∗i

2
2
(Y
SI
−
nx
¸
j=i+1
A
∗j
ˆ
θ
A,∗j
−c
ˆ
θ
B
−
nx
¸
j=i+1
A
∗j
L
j∗
ˆ
θ
B
) −L
∗i
ˆ
θ
B
(A.24)
ˆ
θ
A,i∗
+L
∗i
ˆ
θ
B
=
A
T
∗i
P
T
i
P
i
A
∗i

2
2
(Y
SI
−
nx
¸
j=i+1
A
∗j
(
ˆ
θ
A,∗j
+L
j∗
ˆ
θ
B
) −c
ˆ
θ
B
). (A.25)
104 APPENDIX A. THE LEAST SQUARES ESTIMATION
Deﬁne
ˆ
θ
D,i∗
=
ˆ
θ
A,i∗
+L
∗i
ˆ
θ
B
so that:
ˆ
θ
D,i∗
=
A
T
∗i
P
T
i
P
i
A
∗i

2
2
(Y
SI
−c
ˆ
θ
B
−
nx
¸
j=i+1
A
∗j
ˆ
θ
D,∗j
) (A.26)
which represents the estimation of the closed loop.
Appendix B
The Mobile Robot
B.1 The robot
The robot we used in the experiments is a Nomad Super Scout II, see ﬁgure B.1. This is a
mobile robot with a two wheel diﬀerential drive at its geometric center. The drive motors
of both wheels are independent and the width of the robot is 41 cm. The maximum speed
is 1 m/s at an acceleration of 2 m/s
2
.
The robot has a MC68332 processor board for the low level processes. These include
sending the control commands to the drive motors, but also keeping track of the position
and orientation of the robot by means of odometry. All other software runs on a second
board equipped with a Pentium II 233Mhz processor.
Figure B.1. The Nomad Super Scout II
105
106 APPENDIX B. THE MOBILE ROBOT
B.2 The model of the robot
The mobile robot has three degrees of freedom. These deﬁne the robot’s position x and y
in the world and the robot’s orientation φ. The speeds of the left and right wheel v
l
and
v
r
are the control actions that make the robot move.
Because the geometric center of the robot is right between the wheels, the control
actions can also be indicated by a traversal speed v
t
and rotational speed ω. The traversal
speed is given by:
v
t
=
1
2
(v
l
+v
r
) (B.1)
The rotational speed is given by:
ω =
1
W
(v
r
−v
l
), (B.2)
where W indicates the width of the robot (41 cm).
The change of of position and orientation is given by:
˙ x
˙ y
˙
φ
¸
¸
¸ =
sin(φ)v
t
cos(φ)v
t
ω
¸
¸
¸ (B.3)
We can notice here that the change of the orientation is independent of the position, but
the change of position depends on the orientation. We also have to note that (B.3) is a
simpliﬁed model of the robot, since it does not take into account the acceleration.
By taking the integral over a ﬁxed sample time interval T, the discrete time state
transition can be derived:
x
k+1
= x
k
+
v
t
ω
(sin(φ
k
+Tω) −sin(φ
k
)) (B.4)
y
k+1
= y
k
+
v
t
ω
(cos(φ
k
) −cos(φ
k
+Tω)) (B.5)
φ
k+1
= φ
k
+ωT (B.6)
This hold for any ω = 0. If ω = 0, the orientation does not change and x
k+1
= x
k
+T sin φ
k
and y
k+1
= y
k
+T cos φ
k
.
Appendix C
Notation and Symbols
C.1 Notation
Conventions to indicate properties of variables. The A is used as an example variable.
A
The accent indicates a new result of an iteration.
A Bold indicates a vector.
ˆ
A The hat indicates the result of an estimation.
˜
A The tilde indicates a dummy variable.
A
∗
The star indicates an optimal solution.
¯
A The bar indicates an error or deviation between the real
value and the desired value.
A
i
A subscript can be an indication of the time step, an
element form A, a submatrix or the method applied.
Multiple subscripts are separated by a comma.
A
∗i
Indicates column i of matrix A.
A
j∗
Indicates row j of matrix A.
Operations on matrices. The A is used as an example matrix.
A =
¸
a
11
a
12
a
21
a
22
¸
A
T
=
¸
a
11
a
21
a
12
a
22
¸
vec(A) =
a
11
a
21
a
12
a
22
¸
¸
¸
¸
¸
vec
(A) =
a
11
a
12
a
22
¸
¸
¸
Indices indicating the applied method.
SI System Identiﬁcation
QL Standard LQRQL
EX Extended LQRQL
NQ Neural QLearning
GS Gain Scheduling
107
108 APPENDIX C. NOTATION AND SYMBOLS
C.2 Symbols
Chapter 1
x Continuous state vector
u Continuous control action vector
v Continuous noise vector, all elements are zero mean
Gaussian and white
n
x
Dimension of the state space
n
u
Dimension of the control action space
k Time step
f State transition function
g State feedback function
A, B Parameter matrices of a linear system
L Parameters of a linear state feedback
D Parameters of the closed loop of a linear system
Chapter 2
s Discrete state
a Discrete action
r Reinforcement
π Policy
V Value function
N Number of time steps
P Probability matrix
E¦¦ Expectation value
R Expected reinforcements
γ Discount factor
m, l Iteration indices
α Learning rate
Q Qfunction
τ Eligibility trace
ξ Input of the function approximator
w Weights of the function approximator
E Error function
Chapter 3
σ
v
Standard deviation of system noise v
S, R Parameters of quadratic direct cost
J Total costs
K Solution of the Discrete Algebraic Riccati Equation
e Exploration
σ
e
Standard deviation of exploration noise e
C.2. SYMBOLS 109
X, Y , V Matrices to form the least squares estimation
θ Parameters to be estimated by the linear least squares
estimation
H Parameters of the quadratic Qfunction function, sub
scripts indicate submatrices of H
φ Concatenation of state vector x and control vector u
ρ Relative performance
P Projection matrix
A, , c Submatrices of X for the SI approach
c Some constants
Φ Matrix with diﬀerence in quadratic state/action values
w Noise contribution for LQRQL approach
L, L
v
Diﬀerent representations of the linear feedback
κ A constant
Ψ Submatrices of X for the LQRQL approach
T, T
ee
, Υ Quadratic combinations of actions and exploration
Chapter 4
w Nonlinear correction
x
s
Set point
u
s
Action to keep system at its set point
x
eq
Equilibrium state
x
eq
Action at equilibrium state
l Addition constant in feedback function
G Extra parameters for the extended Qfunction
x, y, φ Position and orientation coordinates of the robot in the
world
δ Distance to the line
α Orientation with respect to the line
Chapter 5
w
o
, w
h
, b
h
Weights and biases of the network
Γ
o
, Γ
h
Activation functions of the units
Ω The quadratic combination of state and control action,
the input for the network for Neural Qlearning
110 APPENDIX C. NOTATION AND SYMBOLS
Bibliography
[1] P.E. An, S. AslamMir, M. Brown, and C.J. Harris. A reinforcement learning approach
to online optimal control. In Proceedings of the International Conference on Neural
Networks, 1994.
[2] C.G. Anderson, D. Hittle, A. Katz, and R. Kretchmar. Synthesis of reinforcement
learning, neural networks, and pi control applied to a simulated heating coil. Journal
of Artiﬁcial Intelligence in Engineering, 1996.
[3] K.J.
˚
Astrom and B. Wittenmark. Adaptive Control. AddisonWesley, 1989.
[4] C.G. Atkeson, S.A. Schaal, and A.W. Moore. Locally weighted learning. AI Review,
1997.
[5] C.G. Atkeson, S.A. Schaal, and A.W. Moore. Locally weighted learning for control.
AI Review, 1997.
[6] L. Baird. Residual algorithms: Reinforcement learning with function approximation.
In Machine Learning: Proceedings of the Twelfth International Conference, 1995.
[7] L. Baird and A.G. Moore. Gradient descent for general reinforcement learning. In
Advances in Neural Information processing Systems 11, 1999.
[8] A.G. Barto, S.J. Bradtke, and S.P. Singh. Learning to act using realtime dynamic
programming. Artiﬁcial Intelligence, 1995.
[9] A.G. Barto, R.S. Sutton, and C.W. Anderson. Neuronlike adaptive elements that can
solve diﬃcult learning control problems. IEEE Transactions on Systems, Man, and
Cybernetics, 1983.
[10] R. Bellman. Dynamic Programming. Princeton University Press, 1957.
[11] D.P. Bertsekas. Dynamic Programming: Deterministic and Stochastic Models.
PrenticeHall, 1987.
[12] D.P. Bertsekas and J.N. Tsitsiklis. NeuroDynamic Programming. Athena Scientiﬁc,
Belmont, Massachusetts, 1997.
111
112 BIBLIOGRAPHY
[13] J.A. Boyan. Leastsquares temporal diﬀerence learning. In Machine Learning: Pro
ceedings of the Sixteenth International Conference (ICML), 1999.
[14] J.A. Boyan and A.W. Moore. Generalization in reinforcement learning: Safely approx
imating the value function. In Advances in Neural Information Processing Systems 7
(NIPS), 1995.
[15] S. Boyd, L. El Ghaoui, E. Feron, and V. Balakrishnan. Linear matrix inequalities in
system and control theory. S.I.A.M., 1994.
[16] S.J. Bradtke. Reinforcement learning applied to linear quadratic regulation. In Ad
vances in Neural Information Processing Systems, 1993.
[17] S.J. Bradtke. Incremental dynamic programming for online adaptive optimal control.
PhD thesis, University of Massachusetts, 1994.
[18] S.J. Bradtke and A.G Barto. Linear least–squares algorithms for temporal diﬀerence
learning. Machine Learning, 1996.
[19] S.J. Bradtke, B.E. Ydstie, and A.G. Barto. Adaptive linear quadratic control using
policy iteration. Technical Report CMPSCI 9449, University of Massachusetts, 1994.
[20] S.J. Bradtke, B.E. Ydstie, and A.G. Barto. Adaptive linear quadratic control using
policy iteration. In Proceedings of the American Control Conference, 1994.
[21] M. Campi and P.R. Kumar. Adaptive linear quadratic gaussian control: The cost
biased approach revisited. SIAM Journal on Control and Optimization, 1998.
[22] H.F. Chen and L. Guo. Identiﬁcation and stochastic adaptive control. Birkhuser, 1991.
[23] P. Cichosz. Truncated temporal diﬀerence: On the eﬃcient implementation of TD(λ)
for reinforcement learning. Journal of Artiﬁcial Intalligence Research, 1995.
[24] P. Dayan. The Convergence of TD(λ) for general λ. Machine Learning, 1992.
[25] P. Dayan and T.J. Sejnowski. TD(λ) converges with probability 1. Machine Learning,
1994.
[26] P.H. Eaton, D.V. Prokhorov, and D.C. Wunch II. Neurocontroller alternatives for
”fuzzy” ball–and–beam systems with nonuniform nonlinear friction. IEEE transac
tions on Neural Networks, 2000.
[27] C.N. Fiechter. PAC adaptive control of linear systems. In Proceedings of the Tenth
Annual Conference on Computational Learning Theory, 1997.
[28] G.J. Gordon. Stable function approximation in dynamic programming. In Machine
Learning: Proceedings of the Twelfth International Conference, 1995.
BIBLIOGRAPHY 113
[29] S.H.G. ten Hagen and B.J.A. Kr¨ose. Generalizing in TD(λ) learning. In Procedings
of the third Joint Conference of Information Sciences, Durham, NC, USA, volume 2,
1997.
[30] S.H.G. ten Hagen and B.J.A. Kr¨ose. A short introduction to reinforcement learning.
In Proc. of the 7th BelgianDutch Conf. on Machine Learning, 1997.
[31] S.H.G. ten Hagen and B.J.A. Kr¨ose. Towards a reactive critic. In Proc. of the 7th
BelgianDutch Conf. on Machine Learning,, 1997.
[32] S.H.G. ten Hagen and B.J.A. Kr¨ose. Linear quadratic regulation using reinforcement
learning. In Proc. of the 8th BelgianDutch Conf. on Machine Learning,, 1998.
[33] S.H.G ten Hagen and B.J.A. Kr¨ ose. Pseudoparametric Qlearning using feedforward
neural networks. In ICANN’98, Proceedings of the International Conference on Arti
ﬁcial Neural Networks. SpringerVerlag, 1998.
[34] S.H.G ten Hagen and B.J.A Kr¨ose. Reinforcement learning for realistic manufactur
ing processes. In CONALD 98, Conference on Automated Learning and Discovery,
Carnegie Mellon University, Pittsburgh, PA, 1998.
[35] S.H.G. ten Hagen, D. l’Ecluse, and B.J.A. Kr¨ose. Qlearning for mobile robot con
trol. In BNAIC’99, Proc. of the 11th BelgiumNetherlands Conference on Artiﬁcial
Intelligence, 1999.
[36] K.J. Hunt, D. Sbarbaro, R.
˙
Zbikowski, and P.J. Gawthrop. Neural networks for control
systems—a survey. Automatica, 1992.
[37] A. Isidori. Nonlinear control systems: An introduction. Springer, 1989.
[38] T. Jaakkola, M.I. Jordan, and S.P. Singh. On the convergence of stochastic itera
tive dynamic programming algorithms. Technical Report 9307, MIT Computational
Cognitive Science, 1993.
[39] T. Jaakkola, M.I. Jordan, and S.P. Singh. On the convergence of stochastic iterative
dynamic programming algorithms. Neural Computation, 1994.
[40] L.P. Kaelbling, M.L. Littman, and A.W. Moore. Reinforcement learning: A survey.
Journal of Artiﬁcial Intelligence Research, 1996.
[41] H. Kimura and S. Kobayashi. An analysis of actor/critic algorithms using eligibility
traces: Reinforcement learning with imperfect value function. In proceedings of the
15th International Conference on Machine Learning, 1998.
[42] M.V. Kothare, V. Balakrishnan, and M. Morari. Robust constrained model predictive
control using linear matrix inequalities. Automatica, 1996.
114 BIBLIOGRAPHY
[43] B.J.A. Kr¨ose and J.W.M. van Dam. Adaptive state space quantisation for reinforce
ment learning of collisionfree navigation. In Proceedings of the 1992 IEEE/RSJ Inter
national Conference on Intelligent Robots and Systems. IEEE, Piscataway, NJ, 1992.
[44] T. Landelius. Reinforcement learning and Distributed Local Model Synthesis. PhD
thesis, Link¨ oping University, 1997.
[45] L.J. Lin and T.M. Mitchell. Memory approaches to reinforcement learning in non
markovian domains. Technical Report CMUCS92138, School of Computer Science,
Carnegie Mellon University, 1992.
[46] L. Ljung. System Identiﬁcation–Theory for the User. Prentice Hall, 1987.
[47] S. Miller and R.J. Williams. Applications of Artiﬁcial Neural Networks, chapter Tem
poral Diﬀerence Learning: A Chemical Process Control Application. Kluwer, 1995.
[48] T.M. Mitchel. Machine Learning. McGraw Hill, 1997.
[49] A.G. Moore and C.G. Atkeson. Prioritized sweeping: Reinforcement learning with
less data and less real time. Machine Learning, 13, 1993.
[50] R. Munos. A convergent reinforcement learning algorithm in the continuous case based
on a ﬁnite diﬀerence method. In Proceedings of the International Joint Conference on
Artiﬁcial Intelligence, 1997.
[51] R. Munos. Finiteelement methods with local triangulation reﬁnement for continuous
reinforcement learning problems. In European Conference on Machine Learning, 1997.
[52] R. Munos. A study of reinforcement learning in the continuous case by means of
viscosity solutions. Machine Learning Journal, 2000.
[53] K.S. Narendra and K. Parthasarathy. Identiﬁcation and control for dynamic systems
using neural networks. IEEE transaction on Neural Networks, 1990.
[54] H. Nijmeijer and A. van der Schaft. Nonlinear dynamical control systems. Springer,
1990.
[55] J. Peng and R. J. Williams. Incremental multistep qlearning. Machine Learning,
1996.
[56] D.V. Prokhorov and D.C. Wunch II. Adaptive critic design. IEEE transactions on
Neural Networks, 1997.
[57] M.L. Puterman. Markov decision processes: discrete stochastic dynamic programming.
Wiley, 1994.
[58] S.J. Qin and T. Badgwell. An overview of industrial model predictive control tech
nology. In AIChE Symposium Series 316, 1996.
BIBLIOGRAPHY 115
[59] M. Riedmiller. Application of sequential reinforcement learning to control dynamical
systems. In Proceedings of the IEEE International Conference on Neural Networks,
1996.
[60] M. Riedmiller. Concepts and facilities of a neural reinforcement learning control archi
tecture for technical process control. In Neural Computing and Application Journal,
1999.
[61] G. Schram, B.J.A. Kr¨ ose, R. Babuska, and A.J. Krijgsman. Neurocontrol by re
inforcement learning. Journal A (Journal on Automatic Control), Special Issue on
Neurocontrol, 37, 1996.
[62] S.P. Singh and R.S. Sutton. Reinforcement learning with replacing eligibility traces.
Machine Learning journal, 1996.
[63] J. Sj¨ oberg, Q. Zhang, L. Ljung, A. Benveniste, B Deylon, P.Y. Glorennec, H. Hjal
marsson, and A. Juditsky. Nonlinear blackbox modeling in system identiﬁcation: a
uniﬁed overview. Automatica, 1995.
[64] P.P. van der Smagt, F.C.A. Groen, and B.J.A. Kr¨ose. Robot handeye coordination
using neural networks. Technical Report CS9310, Dept. of Comp. Sys, University of
Amsterdam, 1993.
[65] D.A. Sofge and D.A. White. Neural network based process optimization and control.
In Proceedings of the 29th conf. on Decision and Control, 1990.
[66] D.A. Sofge and D.A. White. Handbook of Intelligent Control, Neural Fuzzy, and
Adaptive Approaches, chapter Applied learning: optimal control for manufacturing.
Van Nostrand Reinhold, 1992.
[67] M. Sridhar. Average reward reinforcement learning: Foundations, algorithms, and
empirical results. Machine Learning , Special Issue on Reinforcement Learning, 1996.
[68] R.S. Sutton. Learning to predict by the methods of temporal diﬀerences. Machine
Learning, 1988.
[69] R.S. Sutton. DYNA, an integrated architecture for learning, planning and reacting.
In Working Notes of the 1991 AAAI Spring Symposium on Integrated Intelligent Ar
chitectures and SIGART Bulletin 2, 1991.
[70] R.S. Sutton. Generalizing in reinforcement learning: Succesful examples using sparse
coarse coding. In Advances in Neural Information Processing Systems (8), 1996.
[71] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. MIT Press,
1998.
116 BIBLIOGRAPHY
[72] R.S. Sutton, D. McAllester, S.P. Singh, and Y Mansour. Policy gradient methods for
reinforcement learning with function approximation. In Advances in Neural Informa
tion Processing Systems 12, 2000.
[73] T. Takagi and M. Sugeno. Fuzzy identiﬁcation of systems and its applications to
modeling and control. IEEE transactions on Systems, man and Cybernetics, 1985.
[74] G. Tesauro. Practical issues in temporal diﬀerence learning. In Advances in Neural
Information Processing Systems 4, 1992.
[75] G. Tesauro. Temporal diﬀerence learning in TDgammon. In Communications of the
ACM, 1995.
[76] S.B. Thrun. The role of exploration in learning control. In Handbook of Intelligent
control, Neural Fuzzy and Adaptive Approahces. Van Nostrand Reinhold, 1992.
[77] J.N. Tsitsiklis and B. Van Roy. An analysis of temporaldiﬀerence learning with
function approximation. IEEE Transactions on Automatic Control, May, 1997.
[78] E. TzirkelHancock and F. Fallside. A direct control method for a class of nonlinear
systems using neural networks. Technical Report CUED/FINFENG/TR65, Cam
bridge University, 1991.
[79] B. Van Roy. Learning and Value Function Approximation in Complex Decision Pro
cesses. PhD thesis, Massachussets Intitute of Technology, 1998.
[80] C.J.C.H. Watkins. Learning from Delayed Rewards. PhD thesis, University of Cam
bridge, 1989.
[81] C.J.C.H. Watkins and P. Dayan. Technical note: Q learning. Machine Learning, 1992.
[82] P.J. Werbos. Consistency of HDP applied to a simple reinforcement learning problem.
Neural Networks, 1990.
[83] P.J. Werbos. Approximate dynamic programming for realtime control and neural
modeling. In D. A. White and D. A. Werbos Sofge, editors, Handbook of Intelligent
Control: Neural, Fuzzy, and Adaptive Approaches. Van Nostrand Reinhold, 1992.
[84] M.A. Wiering and J.H Schmidhuber. Fast online Q(λ). Machine Learning, 1998.
[85] R. J. Williams. Simple statistical gradientfollowing algorithms for connectionist re
inforcement learning. Machine Learning, 1992.
[86] R.
˙
Zbikowski, K.J. Hunt, A. Dzieli´ nski, R. MurraySmith, and P.J. Gawthrop. A
review of advances in neural adaptive control systems. Technical Report ESPRIT III
Project 8039: NACT, University of Glasgow/DaimlerBenz AG, 1994.
SUMMARY 117
Summary
The topic in this thesis is the use of Reinforcement Learning (RL) for the control of real
systems. In RL the controller is optimized based on a scalar evaluation called the rein
forcement. For systems with discrete states and actions there is a solid theoretic base and
convergence to the optimal controller can be proven. Real systems often have continuous
states and control actions. For these systems, the consequences of applying RL is less clear.
To enhance the applicability of RL for these systems, more understanding is required.
One problem when RL is applied to real continuous control tasks is that it is no longer
possible to guarantee that the closed loop remains stable throughout the learning process.
This makes it a dangerous way to obtain a controller, especially since a random process
called “exploration” has to be included during training. Another problem is that most RL
algorithms train very slowly. This means that for a slow process the application of RL is
very time consuming.
When the system is linear and the costs are given by a quadratic function of the state
and control action, Linear Quadratic Regularization (LQR) can be applied. This leads to
the optimal linear feedback function. System Identiﬁcation (SI) can be used in case the
system is unknown. The LQR task can also be solved by QLearning, a modelfree RL
approach in which the system is not explicitly modeled, but its cost function is. We called
this method LQRQL. To ﬁnd the optimal feedback it is necessary that suﬃcient exploration
is used. We expressed the performance of the resulting feedback as a function of the amount
of exploration and noise. Based on this we derived that a guaranteed improvement of the
performance requires that more exploration is used than the amount of noise in the system.
Also we derived that the LQRQL approach requires more exploration than the SI approach.
Most practical systems are nonlinear systems, for which nonlinear feedback functions
are required. Existing techniques for nonlinear systems are often based on local linear
approximations. The linear feedbacks of the LQRQL and SI approach are not always able
to give a good local linear approximation of a nonlinear feedback function. It is possible to
extend the LQRQL approach. The Extended LQRQL approach estimates more parameters
and results in a linear feedback plus a constant. In an experiment on a nonlinear system
we showed that the extended LQRQL approach gives a better local approximation of the
optimal nonlinear feedback function.
For nonlinear systems, function approximators can be used to approximate the Q
function. We showed that it is possible to combine the LQRQL approach with a feed
forward neural network approximation of the Qfunction. The nonlinear feedback function
can directly be determined from the approximated Qfunction, by ﬁrst computing a linear
feedback for which a nonlinear correction can be determined. This results in a global non
linear feedback function. Experiments have shown that this approach requires less training
data than an approach based on partitioning of the state space, where each partition ap
proximates a linear feedback.
Reinforcement Learning can be used as a method to ﬁnd a controller for real systems.
To an unknown linear system the LQRQL approach can be applied. The Neural QLearning
approach can be used when the system is nonlinear.
118 SAMENVATTING
Samenvatting
Het onderwerp van dit proefschrift is het gebruik van Reinforcement Learning (RL) voor
het regelen van werkelijke systemen. Bij RL wordt de regelaar ge¨ optimaliseerd op basis
van een scalaire evaluatie, die reinforcement wordt genoemd. Voor systemen met discrete
toestanden en regelacties is er een solide theoretische basis ontwikkeld en convergentie
naar de optimale regelaar kan worden gegarandeerd. In de praktijk hebben systemen vaak
continue toestanden en regelacties. Voor deze systemen zijn de consequenties van het
toepassen van RL minder duidelijk. Om de toepasbaarheid van RL voor deze systemen te
vergroten is meer inzicht vereist.
Een probleem dat optreedt wanneer RL wordt toegepast op werkelijke continue syste
men is dat het niet meer mogelijk is om te garanderen dat het gesloten systeem stabiel
blijft tijdens het leren. Dit maakt het een gevaarlijke manier om een regelaar te verkrijgen,
vooral omdat ook een random proces, “exploratie” genaamd, moet worden toegevoegd tij
dens het leren. Een ander probleem is dat RL algoritmes erg langzaam leren. Dit betekent
dat voor een langzaam systeem het gebruik van RL erg tijdrovend is.
Wanneer een systeem lineair is en de kosten worden gegeven door een kwadratische
functie van de toestand en regelactie kan Linear Quadratic Reqularization (LQR) worden
gebruikt. Dit levert de optimale lineaire feedback functie. Systeem Identiﬁcatie (SI) kan
worden gebruikt als het systeem niet bekend is. De LQR taak kan ook opgelost wor
den met QLearning, een modelvrije RL aanpak waarin niet het systeem expliciet wordt
gemodelleerd maar de kostenfunctie. We hebben deze methode LQRQL genoemd. Om
de optimale feedback te vinden is het noodzakelijk om voldoende exploratie te gebruiken.
We hebben de kwaliteit van de resulterende feedback beschreven als functie van de ho
eveelheid exploratie en de hoeveelheid systeemruis. Op basis hiervan hebben we afgeleid
dat voor een gegarandeerde verbetering van de kwaliteit het noodzakelijk is dat er meer
wordt ge¨exploreerd dan er ruis is in het systeem. Bovendien hebben we aangetoond dat de
LQRQL methode meer exploratie vereist dan de SI aanpak.
De meeste praktische systemen zijn nietlineaire systemen, waarvoor nietlinear feed
back functies nodig zijn. Bestaande technieken voor nietlineaire systemen zijn vaak
gebaseerd op lokaal lineaire benaderingen. De lineaire feedback van de LQRQL en de
SI aanpak zijn niet altijd in staat om een goede lokale benadering te vormen van de opti
male nietlineaire feedback functie. Het is mogelijk om de LQRQL aanpak uit te bereiden.
Bij de extended LQRQL aanpak worden meer parameters geschat en het resulteert in
een lineaire feedback plus een constante. In een experiment met een nietlineair systeem
hebben we laten zien dat de extended LQRQL aanpak een betere lokaal lineaire benadering
oplevert van de optimale nietlineaire feedback functie.
Voor nietlineaire systemen kunnen algemene functie schatters worden gebruikt om de
Qfunctie te representeren. We hebben aangetoond dat het mogelijk is om LQRQL te
combineren met een feedforward neuraal netwerk benadering van de Qfunctie. De niet
lineaire feedback functie kan direct uit de benaderde Qfunctie worden bepaald, door eerst
een lineaire feedback te berekenen waarop een niet lineaire correctie kan worden bepaald.
Het resultaat is een globale nietlineaire functie. De experimenten hebben laten zien dat
SAMENVATTING 119
deze aanpak minder leervoorbeelden vereist dan een aanpak gebaseerd op het partitioneren
van de toestandsruimte, waarin voor iedere toestand een afzonderlijke lineaire feedback
wordt bepaald.
Reinforcement Learning kan gebruikt worden als methodiek voor het vinden van rege
laars voor werkelijke systemen. Voor een onbekend lineair systeem kan de LQRQL methode
worden gebruikt. De Neurale QLearning aanpak kan worden gebruikt indien het systeem
niet lineair is.
120 ACKNOWLEDGMENTS
Acknowledgments
The ﬁrst people to thank are Frans Groen and Ben Kr¨ose. Frans I want to thank for
his clarity in feedback and for giving hope that this thesis would eventually be ﬁnished.
Ben I want to thank for remaining a nice guy, in spite of me being sometimes a little bit
stubborn, sloppy and annoying.
The project members from Delft, Bart Wams and Ton van den Boom, were very helpful
with the control theoretic aspects in this thesis. Also the user group members of our STW
project contributed to this thesis by indicating the importance of reliability on the practical
applicability of control approaches. The work in chapter 3 was possible because Walter
Hoﬀmann explained how the QRdecomposition can make things easier. The work on the
real robot would not be possible without the help of Edwin Steﬀens.
I was lucky to have very entertaining room mates. Joris van Dam was always loud and
funny, Anuj Dev helped me improve my pingpong skills, Nikos Massios never was angry
about my bicycle helmet jokes and Joris Portegies Zwart never stopped bragging about his
favorite operating system.
Also I had colleagues with whom I enjoyed spending time. Nikos Vlassis, which is
originally a Greek word, Leo Dorst, who often tried to be funnier than me, and Rien van
Leeuwen, Roland Bunschoten, Sjaak Verbeek, Bas Terwijn and Tim Bouma made sure that
conversations during lunch were never too seriously. I have to thank the many students
as well, for providing enough distractions in the last few years. Especially Danno l’Ecluse
and Kirsten ten Tusscher provided suﬃcient opportunities for not working on this thesis.
Last but not least, I want to thank my mother for all the lasagnas with vitamins.
.
Continuous State Space QLearning for Control of Nonlinear Systems
ACADEMISCH PROEFSCHRIFT
ter verkrijging van de graad van doctor aan de Universiteit van Amsterdam, op gezag van de Rector Magniﬁcus prof. dr. J.J.M. Franse ten overstaan van een door het college voor promoties ingestelde commissie, in het openbaar te verdedigen in de Aula der Universiteit op woensdag 21 februari 2001, te 12:00 uur
door
Stephanus Hendrikus Gerhardus ten Hagen
geboren te Neede
Adriaans Prof.G. M.A. dr.A. F. J. dr. It was carried out in graduate school ASCI.N. Advanced School for Computing and Imaging Copyright c 2001 by Stephan H.Promotor: Co–promotor: Commissie: Prof. ten Hagen. Verbruggen Prof. All rights reserved. Kok Dr.B. ir. Wiskunde en Informatica This work was part of STW project AIF.A. ir. .J. Groen o Dr. H.3595: “Robust Control of Nonlinear Systems using Neural Networks”.C. Wiering Faculteit der Natuurwetenschappen. Kr¨se Prof. dr. P. B. dr.
. . . . . . .1. . . . . 2. . .3 Reinforcement Learning solutions . . . 2 Reinforcement Learning 2. . . . . . . . . . . . . . . . . . . . . . .1. . . . . . . . . . . . . . . . . . . . . 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3. . . . . . . . . . . . . . . . . . . . 1 2 3 5 7 8 9 11 11 11 13 13 13 14 16 20 21 22 22 24 26 27 29 29 30 30 31 32 33 35 36 . . . . . .1 Linear Quadratic Regulation 3. . . . . . . . .3. . . . . . . .3 Summary . . .2. . . . . . . . . . . . . . . . . . . . . . . . . . . . .2. . .1. . .1 Continuous state space representations . . . .2. . . . . .1. . .2. .4 Discussion . . . . .2. . . . . . . . . . . . v . . . 1. . . . . . . . . . . .3 RL for Continuous State Spaces .2 Model based solutions . . . . . . . . . . . .1 The Markov Decision Process . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Overview of this Thesis . .2 The solution . . . . . . . .1 The problem . . . . . . . . . . . . . . . . . . .3. . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . .2 Unknown systems . . 2. . . .2 System Identiﬁcation . . . . . . . . . .3 The Qfunction . . .1 Designing the state feedback 1.5 Summary . . . . .2.Contents 1 Introduction 1. . . . . . . . . . . . . 1. . . . . . . . . . . . . . 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 The Performance Measure . . . . . . . . . . . . . . . . . . . . . 2. . .2. . . . .2 Learning in continuous domains . . . . . . . . . . . . . . . . . . . . . . . . . .2 Reinforcement Learning . . . . . . . . . . . 3. . . . . .3 Problem Statement . . . . . . . . . controller . . . . . . 2. . . . . . . . . . . . . . . . 2. 2. . .2. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2. . .2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 LQR using QLearning 3. . . . 1. . . . . . . . 2. . . . . . . . . . . . . . . . . 1. . . . .2 The Stochastic Optimization Tasks . .3. . . . . . . . . . . . .2 LQR with an Unknown System . . .2. . . . . . . . . . . . .4 QLearning . . . . . . . . . . . . . . . . . . .4 Advanced topics . . . . . . .6 Overview .1 A Discrete Deterministic Optimal Control Task 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2. . . . . . . . . . . . . .2. . . 2. . . . . . . . . . . . . . . . . . . . . . .1 Control . 3. . . 3. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 3. . . . . . . . .3 . . . . . . . . . . . . . . . . . . . . . . . . . . .3 Estimating the new quadratic Qfunction . . .5 Discussion . 4. . . . . . . . . . . . . 4. . . . . . . . . . . . . . .2 Neural Nonlinear Qfunctions . .3. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5. . . . . . . 3. . . . . . . . . 5. . . . . . . 4. . . . . . . . . . . . . . .8 Conclusions . .3 Neural LQRQL . . . . . . .5. . . . . . . . . . . . . . . . . . 5. . . . . . . . . . 4.2 The extension to LQRQL . .2. . . . . . 4. . . . 4. . . . . . . . . . . . . . . . . . . . . . . . . . .5. . . . . . . . . . . .1 Introduction . . . . . . . . . .2 The experiments . .3. . . . . . . . . . . . . . . . . . . . .5. . . . . . . . . . . . . . . . . .6 Conclusion . .5. . . . . . . . . . . . . . . . . . . . . . . . .3. . 5. . 37 37 38 42 45 47 47 47 49 50 51 51 51 51 53 56 56 56 57 58 59 63 63 66 68 70 70 70 73 74 75 75 76 78 79 81 82 83 83 85 87 4 LQRQL for Nonlinear Systems 4. .2 Experiment with a nonzero average w. . . . .2 Nonlinear approaches . . . . . . . . . 5. . . . . . . . . 5. . . . . .4 Exploration Characteristic for Extended LQRQL 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 The nonlinear system . . .1 The nonlinear system .2. . . . . . . . . . . . . . . .4 Training the Network . .4 Simulation Experiments . . . . . . . . 5 Neural QLearning using LQRQL 5. . .2 The System Identiﬁcation approach 3.3 The Extended LQRQL Approach . 4. . . . . . . . . . . . . . . . . . . . . . . . .3. . . . . . . . . . . . . . . . . . . .1 The estimation reformulated . . . . . . .3 The LQRQL approach . . . . . . .5 Simulation Experiments with a Nonlinear System 4. . 4. . . . . . . . . . . . . . . . 3. . . . . . .4. . . . . . . . 3.1 Introduction . . . . . . . . . . 5.3. . . . . . . . . . . . . . . . . . . . . . . . . . .7 Discussion . . 5. . . . . . . . . . . . . . . . . . .1 SI and LQRQL for nonlinear systems . . . . . . . . . . . . . . . . . . . . . . 4. . .3 Summary . 3. 3. . . . . . . . .3. . . . . 3. . . . . . .5 Simulation Experiments with a Nonlinear System . . . . . . 4. . .4 The Exploration Characteristics . . . . . . . . . . . . . 3. . . . . . .5. . . . . . . . . . . . . . . . . . . . . .1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 Discussion . . . . .2 Nonlinearities .2 The experimental procedure . . .3 The performance of the global feedback functions . . . . . . . . . . . . . . . . . . . . . . . . . . . .3. . . . . . . . .1 Introduction . . . . . . . .vi CONTENTS The Inﬂuence of Exploration . . . . . . . . . . . . . . . . . . . 3. . . . . . . . . . . . . . . . 4. . . . . . . . . . .3 Experiments for diﬀerent average w . . . . . . . . . . . . . . . . . . . . . . . . . . . .6 Experiments on a Real Nonlinear System . . . 5. . . . . . . . . . . . 4. . . .6. . . . 4. . . . . . . . . . . . . . . . . . . . . . . .1 Introduction . . . . .6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5. . . . . . .2. . . . . .3. . . . . . . . . . . . . . 4. . . . . . . .2 Exploration Characteristic . . . . . . . . . . . . . . . . . . 4. . . . . .3. . . .1 Deriving the feedback function . . . . . .
. . . . . . . . 107 C. . . . . . .1 Future work . . . . . . . . .2 A. . . . . 88 90 92 93 95 96 97 97 98 98 99 99 Least Squares Estimation 101 The QRDecomposition . . . . . . . . . . . . . . . . . . . . . . . . .1 Notation . . . 6. . . 5. . . . . . . . . . . . . . . . . . . . . 101 The Least Squares Solution . . . . . . . . . . . . . . . . .2 Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Bibliography Summary Samenvatting Acknowledgments 110 117 118 120 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Conclusions and Future work 6. . .5. . . . . . . . . . . . . . . . . . . 101 The SI Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . .1 LQR and QLearning . . . 6. . . . . . . .4 Training with a larger train set . . . . . . . . . . .8 Conclusions . .2 The model of the robot . . . . . . . . . . . . . . . . . . . . . . 6. .2 Extended LQRQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 B The Mobile Robot 105 B. .3 . . . . . . . . . . . . . . . . . .1 A. . . . . . . . . . . . . . . . . . . . . . . . . 5. . . . . . . . . .7 Discussion . . . . .CONTENTS vii 5. .3 Neural QLearning . . . . . .6 Experiment on a Real Nonlinear System 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 General Conclusion . . . . . . . . . . . . . . . . . . . . . 6. . . . . . . . . . . . . 106 C Notation and Symbols 107 C.1. . . . . .3. . . . . . . . . . . . . . . . . . . . .1 Future work . 105 B. . . . . . . . . . . 6. . . . . . . A The A. . . . . . . . . . . . . . . . . . . .1 The robot . . . . . . . . . .2. . . . . . . . . . . . . . . . . .1 Future Work . 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
viii CONTENTS .
we try to control a system. So by applying each time the right control actions we try to change the state to Disturbance System Parameters Control Action State ? Controller Figure 1. The properties of the system that are not constant form the state of the system. Suppose we want to change the temperature in a room.1 shows the conﬁguration of a control task. The system represents the physical system we have. Because we want to control the system. then the room is the system. we also should be able to inﬂuence the state changes. 1 .Chapter 1 Introduction In everyday life we control many systems. The control conﬁguration. The height of the room is such a parameter. The inﬂuence we can have on the state changes are called the control actions. Figure 1. Although these activities are quite diﬀerent they have in common that we try to inﬂuence some physical properties of a system on the basis of certain measurements.1. The properties of the system that do not change form the parameters of the system. The state can change. In other words. We do this when riding a bicycle. depending on the current state value and the parameters of the system. changing the temperature in a room or driving in a city to a certain destination. So in our example the temperature is the state of the system.
In this framework the real physical system is described by an abstract representation called the model. So if the temperature in our room drops because a window is opened. a controller has to be designed. in which the consequences of actions are associated for each state and action combination.1 indicates a change in system or state change that can not be controlled. Machine Learning (ML) is a subﬁeld within Artiﬁcial Intelligence research. Finally we will present our problem statement and give an overview of the remainder of this thesis. Then for each state that action. the model should indicate the inﬂuence of the control action on the state change. In order to ﬁnd the controller. it is also possible to design an algorithm that computes the appropriate control actions given the state. After the design one checks whether the design speciﬁcations are met. Before designing a controller. INTRODUCTION some desired value. If the state is . 1. In our example this means we have a heater that can be switched on and oﬀ. In this way the consequence of actions can be associated with the states in which these actions were applied. The control architecture receives a scalar evaluation.2 CHAPTER 1. A diﬀerent approach to obtain a controller is to design a control architecture in which the controller itself learns how to control the system. called the reinforcement. Instead of applying all actions “by hand”. This design procedure provides a recipe to construct a controller. a particular kind of RL. the heater should be switched on. In this chapter we will ﬁrst formalize our control task and give an overview of diﬀerent controller design approaches. in which inference mechanisms are studied. Based on these experiences the controller should infer what for the control task the most appropriate action is.1 Control If the control of a system has to be automated. In order to do this the architecture should incorporate a mechanism to interact with the system and store the gained experiences. one has to specify the desired behavior of the system and controller. The control actions have to be chosen such that they reduce the eﬀect of the disturbance. This algorithm is called the controller. In this thesis we will focus on QLearning. for which the consequences are most desirable. The disturbance in ﬁgure 1. This seems a very natural approach. Then based on the desired behavior and all available information of the system. Then we will give a short description of RL. the controller has to decide whether to switch the heater on or oﬀ. can be selected. Control theory provides the mathematical framework in which the controller design task can be formulated. for each action applied to the system. This normally is the task of control engineers. a controller is designed according to a certain design procedure. We present some problems and show how they are solved. resembling the way we learn how to control. So based on the measurement of the temperature. In this thesis we will only consider time discrete systems. From a ML point of view the inference mechanism in the above described control architecture is called Reinforcement Learning (RL). The function describing the state change is called the state transition.
1) represents the time step. In (1. There are two main design principles: • Optimality A design method based on optimal control optimizes some criterion. This criterion For continuous time systems the model represents the change of the state. So v can also be used to take into account the consequences of the unmodeled dynamics of the system. This means that the controller is formed by a functional mapping from state to control action.1).1) are the measurements. that does not necessarily have to be complete.1) the vector v can represent external disturbances that inﬂuence the state transition. Then the model in (1. Note that the model in (1. (1. The function g will be called the feedback.1) can be used to describe the change of the new state. This means that the desired behavior of the system plus controller should be speciﬁed. v k ).1). rather then the new state value. Throughout this thesis we will assume a model like (1. With these assumptions.1). The f forms a functional mapping from IRnx +nu +nv → IRnx that represents the state transition. so u = g(x). 1. designing the controller becomes a matter of ﬁnding the appropriate g.1) This is the model of the system. This means that the next state xk+1 does not depend on past states and actions. the state transition is given by:1 xk+1 = f (xk . The controller has not been described yet. 1 . CONTROL 3 represented as vector x ∈ IRnx . Still measurements are made because they are needed to control the system. The parameters of the system are included in f . uk . The k in (1. Since we will only consider models like (1. an extra function has to be speciﬁed that maps x.1. u and v to the measurements. This leaves as only possible candidate: a state feedback controller. But the model forms an abstract approximation of the system. we have to assume that the state value is available to the controller.1) represents a Markov process. Also we will assume that the controller only uses the present state value.1).1. In the design of the controller v is regarded as a stochastic variable. This is possible by combining the past states that inﬂuence the state transition and use this as the new state. In a real system the value of the state is not always available through measurements by sensors. the control action as vector u ∈ IRnu and the noise as vector v ∈ IRnv . These measurements form an extra output of the system. We will not include it in the model and so we will assume that the state values are measured directly.1. This is a reasonable assumption because all higher order Markov processes can be represented as zeroorder Markov processes as in (1.1 Designing the state feedback controller The goal of the design is to ﬁnd the best g to control the system. Another issue not reﬂected in the model of (1. To take this into account in the model in (1.
Note that these principles do not exclude each other. The two design principles still do not indicate how the feedback is computed. The main diﬀerence between these design principles is that optimality based design aims at the best possible behavior. If we include the noise v k then this will disturb the state transition. INTRODUCTION gives a scalar evaluation of all state values and control actions. where D represents the parameters of the closed loop.4 CHAPTER 1.1) where f represents a linear function of the state. • Stability A design method based on stability prevents the system from becoming instable. the faster the state value will approach zero. If they are larger than one the system is unstable and if they are smaller than one the system is stable. To explain this it is convenient to ﬁrst introduce an important class of systems.3) to (1. Suppose that the feedback is also linear: u = Lx. For an optimality based design method ﬁrst an evaluation criterion has to be chosen that indicates the desired behavior of the system. Traditionally the evaluation represents costs. while a stability based design aims at the prevention of the worst possible behavior. For an instable system the costs would be very high. The closer the eigenvalues of D are to zero. (1. the control action and the noise: xk+1 = Axk + Buk + v k . . And if the controller stabilizes the system. This criterion usually represents the cost of being at a certain state or applying certain control actions. The state value of an instable system can grow beyond any bounds and potentially damage the real system. Since the system is dynamic also the future consequences of the action on the costs should be taken into 2 For simplicity we ignore the noise. the linear systems. The eigenvalues of D determine how many time steps it will take before the eﬀect of the noise at time step k can be neglected. When a stability based design method is applied. so an optimality based design would also try to stabilize the system. A state value in future can be computed by multiplying the matrices D. so xk+N = DN xk . so the task is to ﬁnd the feedback that leads to the lowest costs.3) where L ∈ IRnu ×nx represents the linear feedback. This makes it possible to describe the state transition based on the closed loop. (1. It is now easy to see that the eigenvalues of D give an indication of the stability. a quantiﬁcation of the stability can be used to select the parameters of the feedback. The model of a linear system is given by (1. then it is still possible to choose the controller that performs best given an optimal control criterion.2) gives2 xk+1 = (A + BL)xk = Dxk . Applying (1.2) The matrices A and B have the proper dimensions and represent the parameters of the linear system.
It can be shown that the optimal control action is a linear function of the state [11][3]. Starting at a ﬁnite point in future. . In chapter 3 we will use Linear Quadratic Regulation. for example when not the entire system is modeled or the speciﬁc values of the parameters are not known.1. The total future costs for each state forms a quadratic function of the state value. For an overview of these methods see [37]. The ﬁrst action from this sequence is applied to the system. Dynamic Programming will be used in chapter 2 to introduce reinforcement learning. Controller design methods are best understood when applied to linear system. This makes the optimization less trivial. so the state values for which they hold should also be given.2) was introduced to simplify the explanation.2 Unknown systems The control design methods described above can only be applied if the model of the system is known exactly. • Model Based Predictive Control Using the model of the system. In the next time step the whole procedure is repeated. This is generally not the case. The system does not have to be linear and the costs do not have to be given by a quadratic function. These properties do not have to be globally valid.1) were developed. The linear system in (1. the future of the system can be predicted to give an indication of the future costs. Since not all systems are linear controller design approaches for nonlinear systems as in (1. • Dynamic Programming The state and action space can be quantiﬁed and then Dynamic Programming can be used [10][11][57]. The parameters of this function can be found by solving an algebraic equation. the optimal control sequence is computed backwards using the model of the system. For linear systems and constraints there are several optimization methods available to eﬃciently compute the optimal action at each time step [58][42]. An alternative approach is to formalize the representation of the nonlinear system in such way that many well–deﬁned properties of linear systems can be used to describe the properties of nonlinear systems [54]. So system identiﬁcation methods were developed to approximate the model of the system based on measurements.1. The optimal linear function then follows from the system and the quadratic function. The optimal control sequence for a ﬁnite horizon is computed. making this a computationally expensive method.1. CONTROL 5 account. 1. Diﬀerent methods are available to ﬁnd the best controller: • Linear Quadratic Regulation In case the system is linear and the costs are given by a quadratic function of the state and control action. The main reason for this is that models of linear systems are mathematically convenient and the properties of these systems are well deﬁned.
the combination of local models is used to compute the control action [73]. general function approximator like neural networks can be used [53][36][86][63]. in which this is called indirect adaptive control. that are trained using data from the system. The local models can be linear. In that case it is possible to use adaptive membership functions. This means the parameters of a nonlinear model have to be found. This can be applied when the system is varying in time or when the model forms a simpliﬁed local description of the system. These local models are only valid in a restricted part of the state space or their contribution to the output of the global model is restricted. By combining more than one linear model a nonlinear function can be approximated by the global model. if the deviation of this model from the system is too large. Only the parameters of the model have to be estimated. The contribution of the outputs of the local models depends on the weighting. In Gain Scheduling (GS) [3] the output of the global model is computed based on only one local model. An alternative approach is to use Fuzzy logic. In that case the properties of the state transition function can be estimated from the data set containing all measurements and control actions. The weighting depends on the value of the state. It is also possible to do this online. The value of the state determines which local model to use. where reasoning determines the local models to use. The parameters of the model are found by using a supervised learning method to train the networks. • Local models One global model can be build up using many smaller local models. • Nonlinear models When the system is nonlinear it is not possible to use a linear model. If the system is unknown. the membership functions are hard to determine.6 CHAPTER 1. The control actions have to be chosen such that they suﬃciently excite the system. The identiﬁcation methods for linear systems are well understood [46][22]. In case of overlapping membership functions. Once the model of the system is found it can be used to design a controller. In locally weighted representations [4][5] all local models can contribute to the output of the global model. There are diﬀerent possibilities available for identifying the model of the system: • Linear models If the system is assumed to be linear the function describing the state transition is given. The parameters are usually estimated using a (recursive) linear least squares estimation method. INTRODUCTION The measurements are generated by applying control actions to the system. This means that quadratic error between the real next state value and that given by the model is minimized. If the function class of the system is unknown. . The identiﬁcation is used to ﬁnd the parameters of the model and this model is used to tune the parameters of the controller.
where a nonlinear feedback function is adapted to compensate for the nonlinearity in the system [78]. The performance criterion to optimize is usually given by the expected sum of all reinforcements. Because such a model is not always available. For this a model of the system is required. The value function represents the expected sum of future reinforcements for each state. This criterion has to be maximized if the reinforcements represent rewards and minimized if they represent costs. the feedback function is optimized during the interaction using an evaluation from the system. the action that has the highest probability of bringing the system in the “preferred” next state can be selected. The objective is to ﬁnd a policy. This can be based on the diﬀerence between the state value and some reference signal.2 Reinforcement Learning Reinforcement Learning [40][12][71][30] from a Machine Learning [48] point of view is a collection of algorithms that can be used to optimize a control task. resulting in a scalar reinforcement. • Model modiﬁcation It is possible in case of a nonlinear system that an approximator is used to change the current model into a model that makes control easier. In this case diﬀerent possibilities exist: • Self tuning regulator The self tuning regulator as described in [3] directly estimates the parameters required to adjust the feedback. Compared with control. Initially it was presented as a “trial and error” method to improve the interaction with a dynamical system [9]. The value function can be approximated by using Temporal Diﬀerence learning. REINFORCEMENT LEARNING 7 If we adapt the controller it does not necessary have to be based on the estimated model of the system. With this model. During the interaction each state transition is being evaluated. This is called direct adaptive control. a function that maps the states of the system to control actions. Another example is feedback linearization. 1. then given the present state it is possible to select the “preferred” next state. In RL the optimization is performed by ﬁrst estimating the value function. model free RL . Therefore we can also regard RL as a form of adaptive optimal control. When the (approximated) future performance is known for each state.2. Later it has been established that it can also be regarded as a heuristic kind of Dynamic Programming (DP) [80][83][8].1. In RL. This diﬀerence should agree with the reinforcement received during that state transition. It is also possible that the design speciﬁcation can be estimated directly from the measurements. An example is using a feed forward network to model the inverse kinematics of a robot as in [64]. that optimizes a performance criterion. But this still does not indicate what action should be taken. we can say that the policy represents the feedback function. In this learning rule the approximated value of a state is updated based on the diﬀerence with the approximated value of the next state.
except for [26]. So the actions taken during the interaction should not only depend on the existing policy. But these proofs rely on perfect backups of visited states and applied actions in lookuptables. In most cases the resulting controllers were used afterwards to control the real system. which always maps the state to the same action. so there hardly is any theoretic understanding of the closed loop performance of these approaches. cannot be used to generate the data for the approximation. Exploration is of utmost importance because it determines the search space. performed on simulated systems. in which the optimal policy is searched for. In [65] RL is used to obtain a controller for a manufacturing process for thermoplastic structures. The ﬁrst reason is that RL learning can be very slow.8 CHAPTER 1. are used to estimate the performance. INTRODUCTION techniques were developed. So guarantees for convergence to the optimal solution can only be given. The idea is that the sum of future reinforcements can be approximated as a function of the state and the action. The approaches used for these application were primarily based on heuristics. The use of the approximation to choose the best action can only rely on the actions that are tried often enough in these states. Only the measurements. In [26] RL is used to control a “fuzzy” ball–and–beam system. followed by a molding process and chip fabrication in [66]. QLearning is modelfree RL [80][81]. when there are a discrete number of states and actions. That the convergence proofs rely on a discrete state and action space does not mean that RL cannot be applied to system with a continuous state and action space as in (1. this advantage is not exploited. A controller for satellite positioning is found in [61]. The advantage of the reinforcement learning approach is that it does not require any information about the system. but should also depend on a random “trial” process. 1. Approximations are made for the visited states and actions taken in these states. A deterministic policy. When these approaches are applied in simulation. This function is called the Qfunction and it has a value for each state and action combination. There are two reasons why most RL approaches are applied in simulation and not directly on the real system.3 Problem Statement The reinforcement learning approach has been successfully applied as a control design approach for ﬁnding controllers. generated by “trial and error”. This process is referred to as the exploration. In [2] and [59] a thermostat controller is obtained. The use of RL for chemical plants and reactors are described in [47][59] and [60]. Still the successful results suggest that RL can be used to ﬁnd good controllers.1). The Qfunction can be used to select that action for which the value of the Qfunction is optimal. If the search space is large enough then it can be proven that RL converges to the optimal solution [68][25][24][39]. In the previous section we gave some examples. This is . This model free optimization is only possible because the optimization is performed online by interacting with the system. These approaches were all. Therefore we can see QLearning as a direct adaptive control scheme. The perfect model of the system is still required. given the present state.
This means that we want to have an indication of the quality of the resulting controller without testing it on the real system. In simulation the duration of one time step is given by the time it takes to simulate one time step. OVERVIEW OF THIS THESIS 9 because the process of interacting with the system has to be repeated several times. If learning takes too much time a faster computer can be used to speed up the learning process. Therefore all the methods we derive will have some of these properties.4. • do not require too high disturbances on the system. so we want to minimize the amount of excitation required. because . This means we want to derive methods that: • are able to deal with continuous state space problems. However. Methods that do not have any of these properties will not be very useful. controllers for real systems often have continuous state values as input and continuous actions as output. 1. A more serious problem with RL applied to real system is that most algorithms are based on heuristics. Instability may cause the damage of the system. can be guaranteed to hold. The data is obtained by controlling the system. • are able to deal with nonlinear systems. Since most theoretic results in RL address conﬁgurations with discrete state and action spaces. Then we will describe RL methods for continuous state space conﬁgurations.4 Overview of this Thesis This thesis is about RL and therefore we will start with a more in–depth introduction of RL in chapter 2. • do not require too much data. The excitation of the system involves adding a random component to the control action. The duration of one time step for the real system is given by the system itself and for a slow process like a chemical plant learning will take a very long time. like stability. The goal in this thesis is to make reinforcement learning more applicable as controller design approach for real systems. • provides ways to interpret the results. which means that the sample time of the system determines how long it takes to generate one data point. For a large sample time it means that generating a lot of data takes a long time during which the system can not be used.1. This means that without a better understanding. Most reinforcement learning approaches are based on systems with discrete state and action space conﬁgurations. This implies that the outcome will not always be completely understood. For a real system it is very important that properties. Also we want to be able to see how we can change the settings to achieve better results. For real systems this is not desirable. RL algorithms cannot be applied directly to a real system so that its advantage cannot be exploited. these conﬁgurations will be used to explain the basic idea behind RL.
After that we reﬁne our problem statement and choose a direction for our investigation. By introducing the exploration characteristics we reveal the inﬂuence of the amount of exploration on the resulting feedbacks of both methods. In chapter 5 we will introduce Neural QLearning. INTRODUCTION these are the systems we want to control. It can be used to obtain a linear and a nonlinear feedback function. In chapter 3 we choose the Linear Quadratic Regulation (LQR) framework as optimal control task. Based on this we can prove whether the resulting feedback will be an improvement. We introduce LQRQL as Qlearning approach to obtain the feedback for this framework. . The controller is derived in the same way as in LQRQL. In chapter 4 we investigate the consequences of applying LQRQL on nonlinear systems. The diﬀerence with the approach in [65] is that it only uses one neural network. This will result in a linear feedback with an additional constant. We will indicate a possible shortcoming and introduce Extended LQRQL as a solution.10 CHAPTER 1. The conclusion and suggestions for future work are described in chapter 6. This approach and the two approaches from chapter 3 will be tested on a nonlinear system in simulation and on the real system. given the amount of exploration used. We compare the results with an indirect approach that ﬁrst estimates the parameters of the system and uses this to derive the feedback.
Then we introduce the Markov Decision Process (MDP) and describe the classical solution approaches that use the model of the system. First we describe the optimal control task for a deterministic system and present a systematic procedure to compute the optimal policy. We will describe two important RL algorithms. We will focus on the last. One way of using RL methods in the continuous state and action space task is changing the problem into a discrete state and action problem by quantization of the state and action spaces. For a more complete overview of RL see [40][12][71]. In this chapter we will give an introduction to RL in which we explain the basic RL concepts necessary to understand the remainder of this thesis. which are systems with continuous state and action spaces.1. 2. Most theoretical results on RL apply to systems with discrete state and action spaces. Then the original discrete algorithms can be used. Temporal Diﬀerence learning and QLearning.1 1 The problem A discrete deterministic optimal control task consists of:1 We denote the discrete states and actions by s and a to make a clear distinction between the continuous states x and actions u described in chapter 1.1 A Discrete Deterministic Optimal Control Task To introduce the basics of an optimal control task we will restrict ourselves to a deterministic discrete system. Therefore we will start our introduction based on these systems. In RL the optimization is based on the interaction with the system. 2. We will conclude with a discussion in which we reﬁne our problem statement.Chapter 2 Reinforcement Learning In the previous chapter we described Reinforcement Learning (RL). We want to use RL for systems as described in the previous chapter. The other way is to modify the algorithms so that they can deal with general function approximators operating in the continuous domain. 11 .
The policy π is the function that maps the states to the action. As the state transitions are deterministic. • The criterion The criterion indicates the desired behavior. depending on whether the reinforcements are interpreted as costs or rewards. an action taken in one state always results in the same next state. To denote the state at a certain time step we will use the time as an index. • A set of actions {a1 . • A dynamic system The state transition of the system changes the current state of the system to a possible other state. Then V π (s) is the element from vector V π corresponding to state s. We have to specify our criterion before we can solve this optimization task. the value function can be regarded as a vector V π ∈ IRns . but the optimal policy does not have to be unique. So now the optimization task is to ﬁnd an optimal policy3 π ∗ for which the values of all states are ∗ minimal. 2 . There does not have to be a dependency on time. Note that the corresponding optimal value function V ∗ = V π is unique. the entire future sequence of states and actions is determined. A function/vector like this is often called a lookuptable. So sk is a particular element from the set of states at time step k. We will call this the value function2 V π . 3 We will use ∗ to denote optimal solutions. For a given policy π the future is already determined. It describes how all reinforcements at diﬀerent time steps are combined to give a scalar indication of the performance.1) where s = sk and N is the ﬁnal time step when the goal state is reached. so we can compute the total costs to the goal state for each state. (2. REINFORCEMENT LEARNING • A ﬁnite set of states {s1 . Another possible criterion is based on the average of reinforcements [67]. ana } The set of actions can depend on the state. If the number of discrete states is ﬁnite. s2 . because it is possible that not all actions are possible in all states. and V π (s) is the value of state s N V π (s) = i=k ri . We take as criterion the minimization of the cost to go to a certain goal state. · · · . in general it depends on the state transition and action that takes place at time step k. For the deterministic system the state transition maps the present state and action to the next state. sns } The indices indicate the labels of the states of the system. • Reinforcements rk These indicate the received reinforcement at time step k. so that action a = π(s) is taken in state s.12 CHAPTER 2. · · · . This can be the sum over all reinforcements that have to be either minimized or maximized. a2 . Given a policy for a deterministic system.
A systematic procedure is to compute the optimal value function backwards. 2. The minimal value is stored as V ∗ (s). store the minimal costs as the value of these states. the following relation should hold: V ∗ (s) = min{rk + V ∗ (s )}. . 2. As illustrated by DP. Store also the actions that lead to these minimal costs.1 The Markov Decision Process The MDP consist of the same elements as the deterministic optimal control task.2) is called the Bellman equation. but ﬁrst we have to check whether this state already has a value stored. For each state select the action for which the received cost plus the stored value of the next state is minimal. The underlying idea of this procedure is that for the optimal value function. Now we consider all states that can reach the goal state in two steps. a (2. This procedure can be repeated until each state that can reach the goal state has an optimal action stored.2. because they will form the optimal policy. THE STOCHASTIC OPTIMIZATION TASKS 13 2. This will be the framework in which we will introduce RL algorithms. the Bellman equation can also be used to ﬁnd algorithms to derive the optimal actions. For all states that can reach the goal state. This gives the optimal policy for going in two steps to the goal state.2 The Stochastic Optimization Tasks The solution to the deterministic optimal control task described before is quite straightforward.2) Here the s is a state for which any s can be reached in one time step. The policy we get is the optimal policy for going in one step to the goal state. Only if the new value is lower that the already stored value. but ﬁrst we will describe some model based solutions.2. starting in the goal state.1.2 The solution Computing the optimal value function can help determining the optimal policy.2. Then we have the optimal policy. Here rk represents the cost received at time step k when action a is applied and the state changes from s to s . It deﬁnes optimal actions as actions for which the minimal value of the two successive states only diﬀer in the cost received during that time step. This is when the goal state can also be reached in one step for this state. we change the stored value and action. A more general class of optimization task is given by Markov Decision processes (MDP). We can store this as the new value for each state. The diﬀerence is that the state transitions and reinforcements no longer have to be deterministic. For all possible actions in state s the left hand side is computed and the action a is determined for which this is minimal. The procedure described is called Dynamic Programming (DP) [10] and a more general form of (2. But this is just a restricted class of optimal control tasks.
so that all policies that eventually will reach the goal state will be equally good. A discount factor γ ∈ [0. In this situation (2.3) All the ns × ns × na probabilities together form the model of the system. The value function for the MDP can be deﬁned similar to (2. (2. The consequence is that the sum in the expectation value of (2. (2. This will reduce the variance and make the value function easier to approximate. Then the optimal policy is the one that will reach the goal state in the smallest number of steps.5) will make states closer to the goal state have higher values. So a more general deﬁnition of the value function for a stochastic system becomes: N V π (s) = Eπ i=k γ i−k ri  sk = s . these probabilities have to be estimated by interacting with the system. Because the state transitions are probabilistic.3).4) reduces to V π (s) = rN .14 CHAPTER 2. The deterministic a system is just a special case of (2.4) can have many diﬀerent values. This means that all state transition after that can be diﬀerent.1). the ﬁnal time step N (at which the goal state is reached) can vary a lot. ak = a} . for which all values of Pss are either one or zero.2. . This is because one state transition can lead to diﬀerent next states. If they are not known. 2. The process described in (2. It is even possible that N becomes inﬁnite. the system itself will no longer be used to compute the value function.3) is a Markov process because the state transitions are conditionally independent of the past.2 Model based solutions We will now describe some solution methods that are based on computing the value function using the model of the system. This is the situation where all reinforcements are zero except for the goal state.5) Note that there can be another reason to include a discount factor. So all probabilities (2. (2. This is a necessary property to express the value function as a function of the current state alone. To take into account the stochastic state transitions according to (2.4) The subscript π of the policy is used to indicate that actions in the future are always selected according to policy π. REINFORCEMENT LEARNING The state transitions of the process are given by all transition probabilities for going from some state s to some other state s when action a is taken: a Pss = Pr {sk = s  sk = s. and that it may take a diﬀerent number of time steps to reach the goal state. 1] can be introduced to weigh future reinforcements. A discount factor γ < 1 in (2. so that the variance of the underlying probability density function is very high.3) have to be known. Once the model is available.3) the expectation value has to be taken over the sum: N V π (s) = Eπ i=k ri  sk = s .
Based on this equation some iterative methods were developed to compute the optimal policy.6) V ∗ (s) = s The Pss is a matrix with state transition probabilities from s to s when taking actions π∗ according to the optimal policy π ∗ . We ∗ can also deﬁne a vector Rπ with the expected reinforcements for each state. Starting with an initial V0 the values for each state are updated according to Vl+1 (s) = s π πm Pssm (Rss + γVl (s )). ∗ π∗ (2. THE STOCHASTIC OPTIMIZATION TASKS 15 The value function is computed to solve the stochastic optimal control task. One of them is Policy Iteration [11][57]. This makes it a very diﬃcult and computational expensive method.2. .2. that is being evaluated.6).2) has to be used: π∗ π∗ Pss (Rss + γV ∗ (s )). ak = π ∗ (sk )} .8) is applied to all states. So the optimal policy is the policy for which the expected sum of future reinforcements is minimal.2) is that now all possible next states are taken into account. (2.6) and (2. in which for a given policy the value function is computed. a more general form of the Bellman equation (2. The policy evaluation is an iteration algorithm based on (2. with the diﬀerence that the action is taken according to the policy that is being evaluated. The second step is the policy improvement step in which the new policy is computed. Then (2. The ﬁrst step is the policy evaluation. The vector Rss represents the expected reinforcements assigned to the state transitions when the optimal action is taken π Rss = E {rk  sk = s. Here πm is the policy at iteration step m. For the stochastic optimization task. the greedy policy is determined.6) provides a condition that has to hold for the optimal value function. sk+1 = s . ∗ We can deﬁne a ns × ns matrix P π with state transition probabilities when policy π ∗ is used. This is the policy for which in each state the action is chosen that will result in the next state with the lowest value.8) where l indicates the policy evaluation step. To compute the new policy. weighed with the probabilities of their occurrence. Here the rows indicate all the present states and the columns all the next states. A solution method we allready described is DP. which consists of two basic steps. The expectancy in π (s) the iteration can be calculated using the known transition probabilities Pssm and their πm (s) expected reinforcements Rss . Policy Iteration The Bellman equation (2. Note that one policy evaluation step is completed when (2.6) can ∗ ∗ be written more compact as V ∗ = P π (Rπ + γV ∗ ). The compact notation of the Bellman equation indicates that DP requires that all state transitions with nonzero probabilities have to be taken into account. (2.7) The main diﬀerence between (2. that uses backwards induction from a ﬁnal time step in future.
Then (2. This is the main diﬀerence with reinforcement learning solutions. REINFORCEMENT LEARNING Similar to DP. To enable the interaction with the . In DP only the actions that are optimal are considered. 2. So it still takes into account all possible state transitions.9) where πm is the greedy policy. the optimal policy π ∗ will be found after a ﬁnal number of policy iteration steps. This does not have to be true. Value Iteration does not use the optimal action. Especially in the beginning of value iteration. but actions that are assumed to be optimal. so that the correct value function is found.8) will not change.9) indicates that the greedy policy is computed after l evaluation steps.2. The Vl in the right hand side of (2. The VN equals the V π in (2. The reinforcement learning solution methods use interactions with the system to optimize the policy.8) makes V1 become the expected cost of the ﬁnal state transition plus the discounted cost at the ﬁnal state. This is done by replacing π(s) in Pss and Rss by the greedy policy of Vm . The eﬀect is that after a certain value of l the greedy policy corresponding to the value function in (2. it is very unlikely that the approximated value function is correct. Then the policy is improved by taking as new policy πm+1 = πm .16 CHAPTER 2. take the expected reinforcement at the ﬁnite time step N as the initial V0 . The number of evaluation steps is increased until πm no longer changes. Value Iteration The diﬀerence between policy iteration and DP is that policy iteration does not use the optimal action during policy evaluation. Note that there still is a diﬀerence with DP. To see this.8). Fortunately the diﬀerence between Vl+1 and Vl will decrease as l increases. the policy evaluation can be regarded as computing the value function backwards.3 Reinforcement Learning solutions The model based solution methods only use the model of the system to compute the optimal policy. The V2 becomes the future costs of the last two state transitions and this goes on for every iteration using (2. s (2. Because the number of policies is ﬁnite. It takes the actions according to the current policy. but now the policy is changed immediately into the greedy policy. value iteration can be proven to converge to the optimal policy. A combination of these two methods is called value iteration. Therefore it is also very likely that the actions used by Value Iteration are not optimal. Then the policy evaluation can stop and the policy can be improved by taking the greedy policy.5). The system itself is not used once the model is available. The greedy policy is given by πm (s) = argmin a a a Pss (Rss + γVl (s )) . The practical problem of the policy evaluation is that N can be unknown or inﬁnite. Under the same conditions as policy iteration. Value iteration is a one step approach in which the two policy iteration steps are π(s) π(s) combined.
This suggests an update according to: (2. (2.10) with an existing approximation of Vl .8). For a correct approximation of the value function.10) can have diﬀerent values in repeated experiments.11) is rewritten to Vl+1 (sk ) = Vl (sk ) + αl (rk + γVl (sk+1 ) − Vl (sk )).8) if the interactions are repeated inﬁnite times. Clearly this is not the same as (2. It is possible that some states are never visited. so that no value can be computed for these states. THE STOCHASTIC OPTIMIZATION TASKS 17 system. an initial policy has to be available.8) the value function is updated for all states simultaneously using all possible next states. 1] is the learning rate. In (2. It is also possible to incrementally combine the right hand side of (2.10).11) The αl ∈ [0. A drawback of this approach is that all right hand sides of (2. This diﬀerence is called the Temporal Diﬀerence (TD) and the update (2.12) is called temporal diﬀerence learning [68]. If (2.10) for each state have to be stored. The system starts in the initial state and the interaction will go on until the ﬁnal time step N . This process can be repeated several times. because it is based on one state transition that just happens with a certain probability.12) never equals (2. will result in the same update as (2. Computing the average of these values before updating the value function in (2. Temporal Diﬀerence learning The policy π is available and the data obtained by the interaction with the system is used to evaluate the policy.12) we see that Vl (sk ) is changed proportional to the diﬀerence between the value of the current state and what it should be according to the received cost and the value of the next state. that is referred . For one particular state. the learning rate should decrease. Also the received π reinforcement rk is used and not the expected reinforcement Rss . the next state and received reinforcement can be diﬀerent.8). the value function is only updated for the present state s = sk using only the actual next state s = sk+1 . it is still possible that (2. So the right hand side of (2. This leads to an update like Vl+1 (sk ) = (1 − αl )Vl (sk ) + αl (rk + γVl (sk+1 )).2. With the interactions. So in spite of repeating the interaction. Exploration There is one problem with approximating the value function based on visited states alone.2. (2. To prevent this an additional random trial process is included. Therefore the learning rate has the index l.10) Vl+1 (sk ) = rk + γVl (sk+1 ). The interaction with the system can be repeated several times by restarting the experiment after the goal state is reached.
For the optimal actions this agrees with the Bellman equation. One popular undirected exploration strategy is given by the greedy policy4 [70]. The exploration has the same function as the excitation in system identiﬁcation. The result is that a fewer number of interactions with the system is required. Undirected exploration does not take information about the system into account but can use experiences gathered during previous interactions. The consequence is that the value of the state where the action was tried is decreased. a). For all other actions the value will be higher. In case of undirected exploration.18 CHAPTER 2. a) = s a a Pss (Rss + γV ∗ (s)) (2. This can be seen as ﬁnding a shortcut because of the exploration. Exploration can also help to converge faster to the optimal policy. so that many actions are tried. the exploration focuses more and more around the improved policies. a). a a However. This is when an action is tried that leads to a state with a lower value. With each policy improvement step also the value of is decreased. There are diﬀerent strategies to explore. when all actions in future are optimal. Deﬁne Q∗ (s. It was called QLearning [80][81]. The exploration means that sometimes a diﬀerent action is tried than the action according to the policy. described in chapter 1. It is based on the idea to estimate the value for each state and action combination. computing the greedy policy using (2. It is also possible to compute the right hand side of (2. π∗ π∗ The Bellman equation in (2. In this way the parameter can be used to specify the amount of exploration. the tried actions are truly random and are selected based on a predeﬁned scheme. Usually the initial value of is taken high.6) uses Pss and Rss . This gives the value for each state and action combination. QLearning The value function for a policy π can be approximated without a model of the system. With a probability a nongreedy action is tried. so: V ∗ (s) = min Q∗ (s. A model a a free RL method that does not use Pss and Rss was introduced. With a probability 1 − the action according to the greedy policy is taken. The eﬀect is that when the policy becomes better. In [76] a distinction is made between directed and undirected exploration.6) a a for each possible action using Pss and Rss . REINFORCEMENT LEARNING to as the exploration.13) as the optimal Qfunction. so it is only valid if in each state the optimal action π ∗ (s) is taken. a 4 (2. a For each state the optimal action is given by: π ∗ (s) = argmin Q∗ (s.15) This is also known under the name pseudo–stochastic or max–random or Utility–drawn–distribution. that would not be reached when the action according to the policy was taken. This increases the probability that the improved policy visits that state.14) (2.9) still requires Pss and Rss . .
ak )).2. a). ak ) = Ql (sk .19) (2. The learning rate α in (2. π(sk+1 )) − Ql (sk . ak ) + αl (rk + γQl (sk+1 . ak = a (2. a) = Eπ i=k γ i−k ri+1  sk = s. This means that the next action ak+1 has to be π(sk+1 ). The condition (2. a).16) requires that all future actions are taken according to policy π. because (2. If α decreases faster then it is possible that not all states are visited often enough. if the Qvalue for each stateaction combination is estimated separately. eventually the optimal policy will be found. a (2. The temporal diﬀerence update in (2. the convergence of temporal diﬀerence learning can be proven [68][38][24][25].5): N Qπ (s. before the update of the value function becomes very small.12) uses the value for the present state sk and the next state sk+1 . The guaranteed convergence requires that the learning rate decreases and that the entire state action space is explored suﬃciently.12) has to decrease in such a way that: ∞ k=0 ∞ k=0 αk = ∞. 2 αk < ∞ (2. The greedy policy can be determined similar to (2. So the temporal diﬀerence update for the Qfunction becomes Ql+1 (sk .19) is required to make sure that the state space can be explored suﬃciently. Convergence If the system can be represented by a MDP and for each state the value is estimated separately.17) (2. THE STOCHASTIC OPTIMIZATION TASKS 19 This implies that the optimal Qfunction alone is suﬃcient to compute the optimal policy. .16) Now temporal diﬀerence learning can be used to approximate Qπ (s. If the process is repeated. Also. The Qfunction for policy π can be deﬁned in the same way as (2. but it can also be another action that is being explored. For the Qfunction the update should use the present stateaction combination and the next stateaction combination.15) according to π (s) = argmin Ql (s. Here we have to be careful. This also requires that suﬃcient exploration is used. the convergence of Qlearning can be proven [80] [81].20) holds.2. The current action can be π(sk ). Note that the original Qlearning in [80] was based on value iteration.18) This represents one policy iteration step.20) is required to make sure that the update eventually will converge to a ﬁxed solution. The condition (2. so that in theory all states are visited inﬁnitely often.
but on average it will be able to ﬁnd the optimal solution faster for problems with a large number of states. So in the worst case RL is not beneﬁcial. all states should be visited often enough. Diﬀerent solutions were proposed to speed up the learning algorithms: • Generalizing over time The temporal diﬀerence update (2.2. so this is the maximum number s of policy improvement steps required [57]. However. the value can also be updated to agree with previous states and the reinforcements received since. (2. The reinforcement learning methods can be beneﬁcial when the state space becomes large.12) can be written as Vl+1 (sk ) = Vl (sk ) + αl ∆Vk . 1] weighs the previous updates. so the computational complexity of one evaluation step is quadratic in the number of states and linear in the number of actions. Increasing the speed Although the RL algorithms require on average less computation than classical approaches. So the estimated values are primarily based on state transitions that occur with a high probability. where the “discount factor” λ ∈ [0. they still need to run the system very often to obtain suﬃcient data. all state transition probabilities are considered. The update (2. The convergence rate of the iteration also depends on the number of states.20 CHAPTER 2. To take also past states into account the update can be based on a sum of previous temporal diﬀerences. In general it is unknown how many policy evaluations steps are required. In the worst case however. There are nna possible policies. such a model can be obtained by interaction with the system and estimating all probabilities. When a model is available and used.21) . The state transitions with low probabilities hardly occur and will not have a large inﬂuence on the estimated value function. Then all the ns × ns × na probabilities have to be estimated. It is clear that the number of probabilities to estimate grows very fast when ns increases. This is called TD(λ) learning [68]. This is because updates are made only for the state transitions that actually take place.4 Advanced topics The beneﬁts of RL Why should RL be used? An obvious reason for using RL would be if there is no model of the system available. The update of the value becomes: ∆Vk = (rk + γVl (sk+1 ) − Vl (sk ))τ (sk ). REINFORCEMENT LEARNING 2. Since the value represents the sum of future costs. This update is based on one state transition. The change ∆Vk is in this case the temporal diﬀerence.12) changes the value according to the diﬀerence between the value of the current state and what it should be according to the received reinforcement during the state transition and the value of the next state.
all states have to be visited often enough. The goal is to optimize the mapping from states to control actions. but the update in (2. Because of the discrete state spaces. a model can be used to speed up learning when it is available. THE STOCHASTIC OPTIMIZATION TASKS 21 where τ represents the eligibility trace. For the present state the eligibility is updated according to τ (sk ) := λγτ (sk )+1. resulting in Q(λ) learning [55]. is included in the interaction with the system. This does not have to be a complete model with estimation of all state transition probabilities. The model can be used to generate simulated experiments [45]. while for all other states s the eligibility is updated according to τ (s) := λγτ (s). It can also be trajectories stored from previous runs. 2. Temporal Diﬀerence learning is based on this observation. In Prioritized Sweeping [49] a priority queue is maintained that indicates how promising the states are. • Using a model Although a model is not required for all RL techniques. For this the actions are selected that have the highest probability of bringing the system in the most desirable next state. For this an additional random trial process.2. The estimated values for two successive states should agree with the reinforcement that is received during that state transition. To get a good approximation of the value for all states. In DYNA [69] a similar approach is taken. Also the whole interaction process itself should be repeated several times by restarting the system. Then the “important” states in the past are also updated based on the present state transition. these values are estimated based on the interaction with the system. the model of the system should be available. If a model is not available it can be estimated simultaneously with the RL algorithm.2. When RL is used.5 Summary We introduced reinforcement learning for Markov decision processes with discrete state and action spaces. The expected sum of future reinforcements . called exploration. Once all values are estimated correctly the new policy can be determined. where the updates are postponed until suﬃcient data is available. This can be further enhanced as in Truncated Temporal Diﬀerence (TTD) [23]. There are other possible updates for the eligibility [62]. To compute these actions. called the policy. Updates of the estimated value function are based on the real system or based on a simulation of the system.2.12) always corresponds with TD(0). A diﬀerent approach is to use the model to determine for which states the estimated values should be updated. For QLearning it is also possible to use TD(λ) learning. These values can be computed oﬀline using the model of the system. the value of the expected sum of future reinforcements can be stored for each state separately. Also this can be further enhanced by using fast online Q(λ) [84]. A diﬀerent approach is QLearning. The estimated value of the present states is updated in a way that it will agree more with the estimated value of the next state and the received reinforcement.
2. In [9] the continuous state space of an inverted pendulum was partitioned into a ﬁnite number of discrete states. There is a method that uses triangularization of the state space based on data [51]. We want to use RL algorithms to ﬁnd controllers for systems with continuous state and action spaces. where all states in a part of the state space are grouped to form one discrete state. when a large state is split the probability of being in one of the smaller states becomes smaller. There are divide and conquer methods. The optimal policy for the continuous state space problem may not be in the set. This means that we cannot directly apply the algorithms to these systems. State space quantization A very obvious solution to get from a continuous state space to a discrete one is to quantize the state space. The consequence is that the estimated values for these states become less reliable. This is only possible if the number of states and possible actions is ﬁnite.3. . Due to the quantization the set of possible policies is reduced considerably.22 CHAPTER 2. On the other hand. We will give two diﬀerent approaches to apply RL to optimize the controllers for these systems. So these RL approaches can only be used for systems with discrete state and action spaces. then it is possible that the discretized system is no longer Markov [50]. Because the choice of the quantization inﬂuences the result. There are methods based on unsupervised learning. The advantage of this method is that the standard RL algorithms can be used.3 2. These are the systems as described in chapter 1 with state x ∈ IRnx and control action u ∈ IRnu . • The solution found is probably suboptimal. In that case the optimal solution found for the discrete system may still perform very badly on the continuous system. which has been improved in [52]. The advantage of these adaptive quantization methods is that they can result in a more optimal policy than the ﬁxed quantization methods. like the PartiGame Algorithm [49]. This is a form of state aggregation. REINFORCEMENT LEARNING is estimated for each state and action combination.1 RL for Continuous State Spaces Continuous state space representations The RL approaches described in the previous section were based on estimating the value for each state or stateaction combination. but there are also some drawbacks: • If the continuous state space system is a Markov process. like kNearest Neighbor [28] or self organizing maps [43]. This means that the convergence proofs that apply for the standard RL algorithms no longer have to be valid. algorithms were developed that use adaptive quantization. Once this Qfunction is estimated correctly the action can be selected based on the estimated values. where large states are split when necessary.
If we look at the critic we see that it has as input the state x and the action u. In the actorcritic architecture the gradient of the critic is used to compute the update of the actor. In [70] the same examples were used and the experimental setup was modiﬁed to make them work. Often function approximators are used that can represent any function. Also there are no general recipes to make the use of function approximators successful.1. In [14] examples are given where the use of RL with function approximators can fail. so the minimum has to be numerically approximated. Feed forward neural networks were used in [65][59][60][26]. This is because when the critic is a general function approximator. Proofs of convergence for function approximation only exist for approximators linear in the weight when applied to MDPs [77][79]. which resulted in a program for Backgammon that plays at world champion level.1. However. The applications mentioned in section 1. Note also that [75] applies to a system with a ﬁnite number of states. This approach has been successfully demonstrated in [74] and [75]. one representing the policy called the actor and one representing the value function called the critic. It still might require 5 Note that it is also possible to have a critic with only state x as input.3. In a RL context function approximators are often used in the actorcritic conﬁguration as shown in ﬁgure 2. u) Critic Actor u x System Figure 2. The ActorCritic Conﬁguration Function approximations Function approximators are parameterized representations that can be used to represent a function. . For systems with a continuous state space there are no proofs when general function approximators are used.5 This indicates that it is forming a Qfunction rather than a value function.2 were all based on function approximators.2. It is unlikely that an analytical function can be used for this. The neural network was only used to generalize over the states. the computation of the action for which the approximated Qfunction is minimal can be very hard. CMACs were used in [70][65][66]. Qlearning using function approximators is hardly possible. RL FOR CONTINUOUS STATE SPACES 23 r Q(x. As function approximators often feed–forward neural networks were used. Two function approximators are used. Radial basis function were used in [1] in combination with a stabilizing controller.
but then no general function approximator is used but an approximator that is appropriate for this speciﬁc task.4 as an extension to the update (2. The critic is the function V (ξ.4. This leads to an update like [68]:6 w =w+α with ∆wk = (rk + γV (ξk+1 . (2. w).22) k λk−i wV (ξi . They are all based on the temporal diﬀerence in (2. For a function approximator the update is based on the gradient with respect to the parameters of the function approximator. In a similar way the temporal diﬀerence can be used to express an error that is minimized by Note that the term “temporal diﬀerence” was introduced in [68]. w) − V (ξ k . The sequence of past updates indicate the eligibility trace and is used to make the update depend on the trajectory.23) The α is the learning rate and the λ is a weighting factor for past updates.2. Proofs of convergence for continuous stateaction space problems do exist for the linear quadratic regularization task [82][17][44]. (2. where ξ represents the input of the approximator. We will describe how they can be trained. so that the approximator represents the Qfunction. • TD(λ) The TD(λ) update was introduced in section 2. 6 .16).24 CHAPTER 2. It can only be found empirically by repeating the training for diﬀerent λ. The w are the parameters of the network. REINFORCEMENT LEARNING some hand tuning in the settings of the experiment.12). This can represent the state when approximating the value function. w). together with the TD(λ) learning rule for function approximators and not for the discrete case as described in section 2. Training the Critic Diﬀerent methods are possible to train the critic.2 Learning in continuous domains The networks represent function approximators. 2. w)) i=1 N −1 k=0 ∆wk .12) or (2. • Minimize the quadratic temporal diﬀerence error Most learning methods for function approximators are based on minimizing the summed squared error between the target value and the network output.2.3. This can speed up the training of the network. or it can represent the state and action. The value of 0 < λ < 1 for which it trains fastest is not known and depends on the problem.
(2. however. When the actor network is represented by a continuous diﬀerential function. That this solution can not be guaranteed to be optimal is due to the possibility that it converges to a local minimum of (2. On the other hand. has the disadvantage that the convergence for a function approximator cannot be guaranteed. In the discrete case the policy that is greedy with respect to the value function is chosen. The steepest decent update rule can now be based on this error: w =w−α with ∆wk = (rk + γV (ξk+1 . depending on whether the critic represents a Qfunction or a value function: . Here wa represents the weights of the actor. a steepest decent approach can be used to adapt the weights. Based on this observation Residual Algorithms were proposed [6]. This is just a matter of making the learning rate small enough. E= 1 N −1 (rk + γV (ξk+1 . w) − V (ξ k . w) − V (ξk . Training the Actor The second function approximator is the actor.24). We see here that this error does not completely specify the function to approximate. RL FOR CONTINUOUS STATE SPACES 25 standard steepest decent methods [83]. which represents the policy or feedback function uk = g(xk . It gives the diﬀerence in output for two points in the input. the learning converges very slowly compared to the TD(λ) update rule.2. it can have any form. because if the Bellman equation would hold this error would be zero. This means the training of the critic becomes minimizing the quadratic temporal diﬀerence error.26) If the learning rate is small enough convergence to a suboptimal solution can be guaranteed. (2. The main idea is to combine TD(λ) learning with the minimization of the Bellman residual. Here the critic approximates the value function and the parameters of the actor should be chosen in such way that it is greedy with respect to the critic. w))(γ wV wE = w + N −1 k=0 ∆wk .3. wa ). Of course there remains the risk of ending in a local minimum. Since the critic is represented by a function approximator. For this there are two possibilities. • Residual Algorithms The temporal diﬀerence method described above can be regarded as the residue of the Bellman equation. w)). This means that the values of the actor cannot be derived directly from the critic. The advantage of using steepest descent on the Bellman residual is that it can always be made to converge. This.24) where V represents the critic with weights w and γ is the discount factor. w))2 2 k=0 (2.25) (ξk+1 . w) − wV (ξk .
w)). The actor then represents the probability distribution from which the actions are drawn. 2.26 CHAPTER 2. The general idea of the actor critic approaches has been formalized in Heuristic Dynamic Programming (HDP) [83]. This indicates how a change in action inﬂuences the output of the critic. (2.7 • Backpropagation based on temporal diﬀerence The critic can be used to determine the temporal diﬀerence. The HDP and action depended HDP can be regarded as temporal diﬀerence learning and Qlearning. In this case two approximators are required. The other solution is that function approximators are used to estimate the value or Qfunction. One solution approach is based on discretizing the state space so that the original RL algorithms can be applied. A drawback of these approaches is that they learn very slowly.28) In this case the critic and actor are trained based on the same temporal diﬀerence. REINFORCEMENT LEARNING • Backpropagation with respect to the critic If the control action is an input of the critic. w a )(rk + γV (ξ k+1 . Then the actor can be updated such that the temporal diﬀerence becomes smaller: ∆wa = − wa g(ξ k . This describes a family of approaches that are based on backpropagation and the actor critic conﬁguration. This has been extended to a more general framework in which the value of the critic and the gradient are trained simultaneously [83][56]. So the update of the weights of the actor can be performed according to ∆wa = − wa g(xk . The result is therefore not a deterministic policy. one to represent the feedback and one to represent the value or Qfunction. The critic can be regarded as an error function that indicates how the outputs of the actor should change. uk . 7 Therefore we could not express this with a single ξ as input.27) Note that the input uk of this network is the output of the network representing the actor. the gradient with respect to the action can be computed. w a ) uQ(xk . An alternative approach is to approximate the gradient of the critic that is used to update the actor.3 Summary In this section we showed that there are two diﬀerent ways to apply RL algorithms to obtain a feedback function for systems with a continuous state and action space. . w). (2. The main drawback of these approaches is that they become so complicated that even the implementation of some of these algorithms is not trivial [56]. The backpropagation based on the temporal diﬀerence was ﬁrst used in [9].3. Then it was improved in [85] Recently these approaches have gained more interest as solution methods for problem domain that are nonMarkov [41][7][72]. w) − V (ξ k .
is the approach that is applied to the linear quadratic regularization task. • The system identiﬁcation requires that the system is excited in order to estimate the parameters of the system.3. This is because the resulting feedback function is already determined by the actor. so that it is hard to see how well they are trained. The state space quantization results in a feedback function that is not well deﬁned. DISCUSSION 27 2. There are some diﬀerences: • Stability plays an important role in control theory. This has inﬂuence on how the parameters are estimated. For the resulting feedback function is therefore not easy to see what the consequences are when this feedback is used to control a continuous time system. This means that an optimal control task can be formulated into a RL problem.4. .4 Discussion Now we have described RL and control so we can reﬁne our problem statement. while in RL the set of feasible actions is assumed to be known in advanced. It is based on many local estimations. The main problem with these approaches is that they rely on training two diﬀerent approximators. Since we are interested in controlling systems with continuous state and action spaces.2. The function approximator approaches are better suited for understanding the resulting feedback. The only problem is that it does not apply to nonlinear systems. while in the MDP framework the state transitions are given by the probabilities. The RL algorithms require exploration in order to estimate the parameters of the value function. In a sense RL can be seen as adaptive control where the parameters are estimated and used to tune the parameters of the controller. Still this approach will be a good starting point for our investigation. • The state transitions in the model are the actual values of the state. The only function approximator approach that comes close to our requirements presented in section 1. we have to choose one of the two main approaches. First look at what RL and control have in common: • RL is used to solve an optimal control task. Also it is hard to see how the result can be improved.
28 CHAPTER 2. REINFORCEMENT LEARNING .
These approaches do not use the model of the linear system and can be applied if the system is unknown.Chapter 3 LQR using QLearning 3. a policy iteration based Qlearning approach was introduced to solve a LQR task. Based on the same idea the convergence was proven for other RL approaches. We are also interested in how well the RL approach performs compared to alternative solution methods. According to the convergence proofs. including Qlearning [44]. the convergence to the optimal linear feedback can be proven. This means that we have to include the noise in our analysis. because ﬁrst all data is generated and then the new feedback is computed. where data is used to estimate the parameters of the system. Both [17] and [44] indicate that the practical applicability of the results are limited by the absence of noise in the analysis. Because we want to compare the results it is important to replace the RLS by a batch linear least squares estimation. We can solve the LQR task with an unknown system using an indirect approach. making that proofs of convergence no longer hold. This was based on a Recursive Least Squares (RLS) estimation of a quadratic Qfunction. where the weights of a carefully chosen function approximator were adjusted to minimize the temporal diﬀerence error. We are interested in the use of RL on real practical problem domains with continuous state and action spaces. It shows the convergence to the correct value function for a Linear Quadratic Regularization task. A scalar example in [17] shows that the noise introduces a bias in the estimation. Then these estimated parameters are used to compute the optimal feedback.1 Introduction In this chapter we will present a theoretic framework that enables us to analyze the use of RL in problem domains with continuous state and action spaces. If data is generated with suﬃcient exploration. In [16][19][20][17]. The ﬁrst theoretical analysis and proof of convergence of RL applied to such problems can be found in [82]. This has the advantage that no initial parameters have to be speciﬁed and so the resulting solution only depends on the data and the solution method. The result is that both solution methods are oﬀline optimization methods. suﬃcient exploration is required to ﬁnd the op29 .
The experimental conﬁrmation of the results will be given in section 3.6.3 will focus on the inﬂuence of the exploration on the comparison. The objective is to ﬁnd the mapping from state to control action (IRnx → IRnu ) that minimizes the total costs J. The direct cost r is a quadratic function of the state and the control action at time k: rk = xT Sxk + uT Ruk . Let a linear time invariant discrete time system be given by: xk+1 = Axk + Buk + v k . Also we will show that this amount of exploration diﬀers for the two solution methods. . the system is linear and the direct cost is quadratic. k k (3. (3. This means that our analysis has to show how the performance of the two solution methods depend on the amount of exploration used to generate the data. so we need an indication of the minimal amount of exploration that is suﬃcient. In a practical control task this is not desirable. σv ) distributed and white. 3. In the next section we will specify the Linear Quadratic Regularization task where the linear system is assumed to be unknown.2 LQR with an Unknown System In this section we will describe the LQR task and show how to obtain the optimal feedback when everything is known. We then present the two solution methods and give an overview on how to compare the performance of these two methods.1) with xk ∈ IRnx the state.1 Linear Quadratic Regulation In the Linear Quadratic Regulation (LQR) framework.3) and the closed loop is stable.2) approaches zero fast enough. (3.3).3) k=0 The value of J is ﬁnite if (3.2) where S ∈ IRnx ×nx and R ∈ IRnu ×nu are the design choices.2. followed by the discussion and conclusion in section 3. which is given by: J= ∞ rk . All elements of system noise v are assumed to be N (0. 3.1) is controlled using (1. LQR USING QLEARNING timal solution. The total costs J becomes inﬁnite if the closed loop is unstable. This means that random actions have to be applied to the system. Matrix A ∈ IRnx ×nx and B ∈ IRnx ×nu are the parameters of the system. Also we will deﬁne a performance measure and give an overview of the comparison of the two solution methods. We will then present a direct and an indirect solution method for the situation where the parameters of the system are unknown. We will show that the noise determines the amount of exploration required for a guaranteed convergence. Section 3. It is possible to include a discount factor γ < 1 in (3.4.30 CHAPTER 3. uk ∈ IRnu the control action and v k ∈ IRnx the system noise at 2 time step k. This is the case when (3.5 and 3.
the two methods presented in this chapter will not use this knowledge. The optimal control action u∗ is a linear function of the state: u∗ = L∗ xk k with L∗ = −(B T K ∗ B + R)−1 B T K ∗ A (3. only the value of L is not optimal. 1 . so that a ﬁnite J always implies that the system is stable.4) where K ∗ ∈ IRnx ×nx is the unique symmetric positive deﬁnite solution to the Discrete Algebraic Riccati Equation (DARE): K ∗ = AT (K ∗ − K ∗ B(B T K ∗ B + R)−1 B T K ∗ )A + S. Although the value of e is always known. (A.2. rewrite (3. but system identiﬁcation stresses the main diﬀerence with the reinforcement learning method more. To estimate the parameters of the system.2 System Identiﬁcation In indirect adaptive control the parameters of the system have to be estimated. B) is controllable. and we will refer to this as the System Identiﬁcation (SI) approach.1) to: AT xT = xT uT + vk (3.2. Controlling the system for N time steps results in a set {xk }N with: k=0 k xk = D0 + i=0 Dk−i−1 (Bei + v i ) (3. 2 Note that this controller already has the form of (3. All elements 2 of e are chosen to be N (0. 1 (3. S 2 ) is observable. B is not available. 3. We will use (3.6)) form a data set that depends on the parameters of the system.5) be solved. while e is a random process that is added on purpose to the control action. S and R can equation (3.8) k k k+1 BT Actually closedloop identiﬁcation or identiﬁcation for control would be more appropriate.7) where D = A + BL represents the closed loop.5) This solution exists if: (A. B.6) where L is the existing feedback2 and ek ∈ IRnu represents the excitation (or exploration) noise. S ≥ 0 (positive semideﬁnite) and R > 0 (positive deﬁnite) [11]. The sets {xk }N and {uk }N (computed k=0 k=0 with (3. LQR WITH AN UNKNOWN SYSTEM 31 but then the total costs can be ﬁnite for an unstable system.3.4).1 This is our ﬁrst method to solve the LQR problem with unknown A and B. The estimations are based on measurements generated by controlling the system using: uk = Lxk + ek (3. Only with perfect knowledge about A. the feedback. the initial state and both noise sequences. This restricts the practical applicability of LQR because in practice perfect knowledge about A.3) without a discount factor (or γ = 1). The main diﬀerence between e and v is that v is an unknown property of the system. σe ) distributed and white.
BT T T T xN −1 uN −1 xN v T −1 N T T ˆ θSI = (XSI XSI )−1 XSI YSI .32 CHAPTER 3.3 The Qfunction Reinforcement learning is our second method for solving the LQR task with unknown A and B.2) will be regarded as the reinforcements.4).11). = XSI θSI + VSI . When A and ˆ are used.4). . Then a feedback LSI can be computed using ˆ ˆ B ˆ (3. . . . According to (2. since it does not require knowledge about the system to obtain the feedback. . u∗ ) = k = 3 (3. . It can be shown [19][44] that Q∗ (xk . can be derived from θSI . If we know what function we have to approximate. k (3. only a least squares estimate of θSI can be given: (3. which represents the future costs as a function of the state and action. In LQR the solution of the DARE (3. This means that the costs in (3.11) is given by (3. LQR USING QLEARNING So for the total data set xT xT uT vT 1 0 0 0 .10) ˆ ˆ ˆ ˆ The estimated parameters of the system. u∗ ) is given by: k Q∗ (xk .14) xT u∗ T k k ∗ ∗ Hxx Hxu ∗ ∗ Hux Huu In optimization tasks other than LQR. then we only have to estimate the parameters.2. This feedback LSI is the resulting approximation of L∗ by the SI approach. Since VSI is not known. which requires knowledge about A and B. the solution of (3. So Q : IRnx ×nu → IR is the function to approximate based on the measurements. In QLearning the feedback is derived from the Qfunction. = . 3. A and B. u) = Q∗ (x. it is not very useful3 to estimate the parameters K ∗ of (3. As explained in chapter 2. (3.9) should hold. The feedback that minimizes (3.14) V ∗ (x) = min Q∗ (x.13) (3. + .5) will be K. QLearning is more appropriate. u∗ ) u should hold. the main idea behind RL is to approximate the future costs and ﬁnd a feedback that minimizes these costs. k k u∗ k xk u∗ k (3.5) can be used to express the future costs as a function of the state when the optimal feedback is used [11]: V (xk ) = ∗ ∞ i=k ri = xT K ∗ xk .11) with V ∗ : IRnx → IR. AT . YSI = .12) ∞ i=k ri = xT u∗ T k k S + AT K ∗ A AT K ∗ B BTK ∗A R + BTK ∗B xk = φ∗ T H ∗ φ∗ . . the computation of the future costs may be intractable so that an approximation of the future costs may be useful. So.
The parameters H L of the Qfunction should be estimated in the same way as the parameters of the system in paragraph 3. This is equivalent to the greedy policy described in chapter 2. We therefore have to change the Qlearning algorithm such that it uses one single data set and does not use a learning rate.19) and (2. but the function repreT senting the future costs. uk ) = 2Hux xk + 2Huu uk = 0. This means that this QLearning approach is based on policy iteration. u∗ = arg min Q∗ (xk .15). 4 5 . because all measurements are generated using some feedback L.20).14). This shows that the optimal Qfunction for the LQR task is a quadratic function of the state and action.2.14) is a quadratic function and H ∗ is a symmetric positive deﬁnite matrix. the sequence of new values of L forms a contraction towards the optimal solution L∗ [44]. u) = φT H L φ (with φT = x u )4 . QLearning also uses scalar reinforcements. resulting in: k ∗ ∗ u∗ = −(Huu )−1 Hux xk = L∗ xk k ∗ ∗ with L∗ = −(Huu )−1 Hux . u) k u (3.15) should be computed for all states to get the optimal feedback function. These are the direct costs k=0 If there is also noise: QL (x. Then the convergence to the optimal solution follows from induction.2. It is based on repeatedly restarting the system and generating new data in each run. The H L is symmetric L L and positive deﬁnite so that L = −(Huu )−1 Hux 5 is the feedback that minimizes QL . 3.4 QLearning In (2.4). the whole procedure can be repeated by generating measurements using L . So the same data set with {xk }N and k=0 {uk }N is used. This is the function QL (x. (3. u) = φT H L φ + v T K L v The indicates the feedback that is optimal according to the Qfunction.2. According to (2. If the parameters of QL are always estimated correctly.16) the update rule for QLearning is given. This makes it impossible to compare the result with that of the SI approach. This means we only have to verify the correctness of one policy improvement step. LQR WITH AN UNKNOWN SYSTEM 33 The vector φ∗ T = xT u∗ T is the concatenation of the state and optimal control action k k k and the matrix H ∗ contains the parameters of the optimal Qfunction.3.14) can be used to compute the optimal control action without the use of the system model. The Qfunction in (3. Estimating the parameters of QL forms a policy evaluation step and computing L forms a policy improvement step.16) ∗ ∗ With the Hux = B T K ∗ A and Huu = R + B T K ∗ B in (3.2. this result is identical to (3. It is not the optimal Qfunction that is being approximated. so L optimizes QL . The Qfunction in (3. If L is not good enough. The L does not have to be the optimal feedback but it will have lower future costs than L. So this function can easily be minimized by setting the derivative to the control action to zero: ∗ ∗ ∗ ∗ ∗ u∗ Q (xk . The update also has a learning rate that has to decrease according to (2.
The parameters of QL can be estimated by reducing the distance between the TD and zero. k k k φT = xT LT xT k k k+1 and wT = v T K L v k − v T K L v k+1 .20) (3. It is possible to write (3. uk ) = ∞ i=k ri = rk + ∞ i=k+1 ri = rk + QL (xk+1 . . (3. (3. From (3. uk ) − QL (xk+1 . . Lxk+1 ) φT H L φk − φT H L φk+1 + wk k k+1 vec (φk φT )T vec (H L ) − vec (φk+1 φT )T vec (H L ) + wk k k+1 T T T L vec (φk φk − φk+1 φk+1 ) vec (H ) + wk = vec (Φk )vec (H L ) + wk . In case the feedback L is not stable. The function QL can be estimated by writing its deﬁnition recursively: QL (xk .2) for fair comparison.18) ˆ If in this equation QL is replaced by its approximation QL the left hand side is the Temporal Diﬀerence (TD). Therefore the correct values of H L do not exist and cannot be estimated. = . .17) Note that this deﬁnition implies that the data is generated using a stabilizing feedback. LQR USING QLEARNING computed with (3. uk ) = 0. In a practical situation it is more likely that some scalar indication of performance is available. T rN −1 vec (ΦN −1 ) wN −1 T T ˆ θ QL = (XQL XQL )−1 XQL YQL (3. We deﬁne: φT = xT uT .17) it follows that: rk + QL (xk+1 . generated using feedback L. but still the weighting of the design matrices can be made.25) Note that for the SI approach the parameters of the system are unknown. . Conceptually this does not make any sense.21) (3. (3.6 The parameters of QL are estimated based on the data set. We compute the direct cost using (3.18) is only ˆ ˆ zero if QL has the same parameters as QL .10). .19) k k k+1 Note that the deﬁnition of φT is slightly diﬀerent from φT . the function QL (xk . This can be formulated as a least squares estimation as in (3.34 CHAPTER 3. For all time steps the following holds: YQL so that 6 r0 vec (Φ0 )T w0 . Because both functions are quadratic. the right hand side of (3. Lxk+1 ) − QL (xk . .22) (3. L vec (H ) + . 7 Deﬁne vec (A) as the function that stacks the upper triangle elements of matrix A into a vector.24) (3. uk ) is not deﬁned because the sum of future reinforcement is not bounded. like for instance the energy consumption of the system. = .2).18) k+1 k as:7 rk = = = = QL (xk . Lxk+1 ). (3. = XQL θQL + VQL .23) Note that the matrix Φk also depends on L.
5) to compute the solution of the DARE K ∗ . By applying (3. is that we choose to learn in one step using the entire data set. so the optimal ˆ feedback L∗ can be computed using (3. This should be an approximation of L . but this does not show that this will result in lower total costs. This gives the future costs (3. It is even possible that if L∗ − L1 < L∗ − L2 . This also holds for the approaches in [17][44]. There are three ways to measure the performance: • Experimental: Run the system with the resulting feedbacks and compute the total costs.2.2. For a real system. The costs when using an approximated feedback L can be expressed also ˆ ˆ L like (3. L1 results in an ˆ unstable closed loop while L2 results in a stable closed loop. which for the LQR task is globally correct. therefore we will refer to it as the Linear Quadratic Regulation QLearning (LQRQL) approach. this is the only possible way to compare the results. This variant of Qlearning only applies to the LQR framework. . 3. This can be seen as generalization over the stateaction space. ˆ Here L indicates the resulting feedback of both approaches. LQR WITH AN UNKNOWN SYSTEM 35 ˆ gives an estimation of vec (H L ).11). This means that this measure can be used to show that the resulting feedback approaches the optimal feedback. The performances of both approaches can only be compared for one speciﬁc setup and it does not indicate how “optimal” the result is. For the comparison a scalar performance measure is required to indicate which of these feedbacks performs best.16) to matrix H the resulting feedback LQL for the QLearning approach is computed. Comparing the matrices K ∗ and K L results in a performance indication that only depends on the initial state.25). Instead it uses prior knowledge about the function class of the Qfunction that is chosen to ﬁt the LQR task.3. The main reason for doing this is to make it possible to compare the result with the SI approach. but then with matrix K .4). • DARE Solution: Knowledge about A and B can be used in (3. that applies to MDPs. The only diﬀerence between these approaches and our approach using (3.5 The Performance Measure We have described two diﬀerent methods that use measurements to optimize a feedback ˆ ˆ resulting in LSI and LQL . • Optimal Feedback: In a simulation there is knowledge of A and B. 8 9 This should not be confused with the Least Squarse TD approach [18][13]. Also the analysis of the outcome is easier when the estimation is not performed recursively.8 The main diﬀerence is that it does not require restarting the system several times to generate suﬃcient data for the correct estimation of all Qvalues. Since H should be symmetrical it can be derived from L ˆ ˆ vec (H ).11) when starting in x0 and ˆ using L∗ . A norm9 L∗ − L will not be a good ∗ ˆ performance measure because feedbacks with similar L − L can have diﬀerent ˆ ˆ ˆ future costs.
because this is implicitly included in the estimation of the Qfunction. because it is the least sensitive to the settings of the experiment. (It is clear that this matrix only exists when the ˆ closed loop A + B L has all its eigenvalues in the unit disc). It does not matter for the comparison which measure is used. min max min ˆ ˆ According to (3. ΓL is the unit matrix. This is the diﬀerence between the SI and the QL approach: the SI approach is a two step method. When L∗ is used the total ∗ costs V (x0 ) can be computed using (3.1 also shows that no additional information is ˆ required to derive LQL from H L . u and r (note that r is computed using S and R). LQRQL is a one step method.1. Only when L = L∗ . V ∗ (x0 ) xT K x0 xT x0 x0 x0 0 0 ˆ ˆ ˆ ˆ (3. where estimation and optimization are performed at once. It is given by: V L (x0 ) = ˆ ˆ ∞ k=0 r k = xT 0 ∞ k=0 ˆ ˆ ˆ ˆ ˆ (AT + LT B T )k (S + LT RL)(A + B L)k x0 = xT K L x0 (3. Figure 3.1 summarizes this section. so: ρL (x0 ) = ˆ xT K L x0 V L (x0 ) xT (K ∗ )−1 K L x0 xT ΓL x0 = 0 ∗ = 0 = 0T . indicated with the e at the very left of ﬁgure 3. The RP ρL (x0 ) is ˆ bounded below and above by the minimal and maximal eigenvalues of ΓL : ρL = λmin (ΓL ) ≤ ρL (x0 ) ≤ λmax (ΓL ) = ρL min max ˆ ˆ ˆ ˆ ˆ ∀x0 = 0 (3. B. In this chapter we will call feedback L1 better than feedback L2 if ρL1 < ρL2 . ρL or min max ˆ ρL (x0 ). In the next section we will relate this comparison to the amount of exploration. Note that ρL and ρL only depend on the min max ˆ feedback L and the four matrices A.1 indicates the comparison between ρLSI ˆ and ρLQL . The computation of the optimal feedback using A and B is shown in the middle. but also the feedback L and exploration noise e to generate the measurements indicated with x. Let the relative performance (RP) ρ(x0 ) be ˆ the quotient between V L (x0 ) and V ∗ (x0 ). The SI approach is shown at the top and the QLearning approach at the bottom. where the estimation and the optimization are performed independently.27) ˆ ˆ ˆ ˆ ˆ where ΓL = (K ∗ )−1 K L . LQR USING QLEARNING We will deﬁne a performance measure based on the DARE solution. so in general we ˆ ˆ ˆ will use ρL to indicate one of these measures. . For LQRQL ﬁgure 3. S and R that deﬁne the problem. ˆ ˆ ˆ When using L. ˆ The question mark at the very right of ﬁgure 3.11).36 CHAPTER 3.2. The setup at the left shows the system parameters and noise.28) ˆ ˆ ˆ ˆ ˆ Note that ρL = ρL = 1 if and only if L = L∗ and that ρL ≥ 1 ∀L.26) 0 where K L is again a symmetric matrix. 3.28) three possible measures for the RP can be used: ρL .6 Overview The schematic overview in ﬁgure 3.1 shows no explicit optimization. In a practical ˆ situation ρL seems the best choice because it represents the worst case RP with respect max to x0 . the value function V L (x0 ) gives the total costs J L when starting in state x0 .
R x.3 the inﬂuence of the exploration is investigated for both methods. In 3.29) is hardly ever used because of its poor numerical performance. u ˆ K ˆ LSI ρLSI ˆ v e. R ˆ ˆ A. but we will also use it to investigate the inﬂuence of the exploration. B x. r.3. 3. B. are based on a linear least squares estimation.1 The estimation reformulated Both approaches described in the previous section.1 by reformulating the method of the estimation to make it possible to express the estimation error of the linear least squares estimation. Diﬀerent decomposition methods exist to overcome numerical problems. because this makes it very hard to see how the exploration inﬂuences the estimation.3. x0 S.3.3.3. In practice (3.2 and 3.4 the exploration characteristic will be introduced to describe the inﬂuence of the exploration on the performance of the resulting feedbacks. L A. R ˆ LQL Optimization Result ρLQL ˆ Comparison Figure 3. For a fair comparison this is an important property. B. L ˆ HL Estimation K∗ A. The Overview. The equations (3. We will start in 3.29) ˆ This solution θ depends only on the matrices X and Y . R L∗ ? QL Setup A.29) is the main problem for our analysis.3. R SI ˆ ˆ A. (3. The boxes indicate the parameters and the symbols next to the arrows indicate the required information to compute the next result. This is important for the implementation of the simulations.10) and (3. . THE INFLUENCE OF EXPLORATION 37 S. u. The matrix inversion in (3.3 The Inﬂuence of Exploration In this section we will investigate the inﬂuence of the exploration on the relative performances and the comparison. S. S.3.25) can be written as: ˆ θ = (X T X)−1 X T Y. B. 3.1. B. so no additional parameters inﬂuence the result. R A. B. In 3.
(3. so the part of X∗i that is a linear combination of the columns X∗1 to X∗i−1 does not contribute to outcome of Pj X∗j .2 shows that matrices P can be used to solve (3. this solution has to be rearranged even more. Multiplying these columns with Pi results in a zero vector. LQR USING QLEARNING In QRdecomposition10 X is decomposed into an upper triangular square matrix M and a unitary matrix Z. so it will go faster to zero than the elements of vector X∗i PiT . The deﬁnition of Z and M in appendix A.30) This is suﬃcient for an eﬃcient implementation but it still uses a matrix inversion. Appendix A. If one of the columns of X is a linear combination of all other columns then (X T X)−1 is singular. then they are given by:11 ˆ θ n∗ = ˆ θ i∗ = T T X∗n Pn Y Pn X∗n 2 2 n T X∗i PiT ˆ (Y − X∗j θ ∗j ) Pi X∗i 2 2 j=i+1 (3.2 The System Identiﬁcation approach We ﬁrst rewrite the estimation for the SI approach and show how the resulting feedback depends on the estimation.33) will become zero. We then express the estimation error and show how it depends on the exploration.30) without matrix inverˆ ˆ sion. Let θ n∗ be the last row and θ i∗ be the ith row. In this situation Pi X∗i in (3.1 makes use of projection matrices P . resulting in a singularity as well. Finally we show the consequences of the estimation error on the resulting feedback and its performance. 3.3. To see the inﬂuence of the exploration. making it hard to see how this solution depends on the exploration. which represents the ith column of X. 12 T The scalar Pi X∗i 2 in (3. (3. (3.12 We will use (3.33) is squared. 11 In the rest of the chapter we will ignore the absence of the sum term for the nth row by deﬁning a ˆ dummy θ n+1∗ that equals zero. 2 10 . but we will use Z and M because we already use the symbols Q and R.29). The name refers to matrices Q and R. Let Pi be the projection matrix corresponding to X∗i .32) for i < n.31) So Pi depends on all columns of X from X∗1 to X∗i−1 . because its numerical performance is even worse than (3.33) ˆ So θ can be obtained recursively by starting at the last row.33) only for the theoretical analysis of the exploration and not for the implementation.29) can be written as: ˆ θ = ((ZM )T ZM )−1 (ZM )T Y = M −1 Z T Y.38 CHAPTER 3. so that (3. then Pi can be deﬁned recursively according to: Pi = Pi−1 − T T Pi−1 X∗i−1 X∗i−1 Pi−1 Pi−1 X∗i−1 2 2 and P1 = I.
9) some exploration is still contained in X and YSI . Pi X∗i 2 2 j=i+1 (3. ˆ ˆ Appendix A.3.4) using the estimated matrices: ˆ ˆ ˆˆ ˆ ˆ ˆ ˆ ˆˆ ˆ ˆ ˆ ˆ LSI = −(R + B T K B)−1 B T K A = (R + B T K B)−1 B T K(BL − D). Therefore we will use B The Feedback ˆ ˆ To compute the feedback (3. This also makes A inﬁnite. xT −1 N uT 0 . . D does not become very large for low amounts of exploˆ and D to obtain the resulting feedback LSI . E = . For low exploration the term U B T ˆ ˆ dominates the outcome of (3. In this case the right hand side of (3.37) ˆ ˆ Because B is multiplied with E. . so: ˆ ˆ ˆ ˆˆ ˆ ˆˆ ˆ ˆ ˆ K = AT (K − K B(B T K B + R)−1 B T K)A + S. but according to (3.3 shows that the columns of B and A are given by: ˆ B∗i = ˆ A∗i = nx +nu T T E∗i Pnx +i ˆ (YSI − U∗j B∗j ) Pnx +i E∗i 2 2 j=nx +i+1 nx T T X∗i Pnx +i ˆ ˆ (YSI − X∗j A∗j − U B T ). .9) can be split in a part that depends on x and in a part that depends on u: eT 0 . So for ˆ B) are more likely to be uncontrollable. XSI = X U = X X LT + E .3.35) (3. with X = .34) Also the control actions are split into a feedback part and an exploration part.39) . The feedback is computed according to (3. two possible outcomes can already be given: (3. (3.38) does not have to exist. ˆ ˆ ˆ Appendix A. eT −1 N (3. ˆ ˆ ˆ By replacing A by BL − D.3 also shows that the columns of D = A + LB are given by: ˆ D∗i = nx T X∗i PiT ˆ ˆ (YSI − E B T − X∗j D∗j ). Then A becomes more linear dependent on B. because Pnx +i E∗i 2 approaches zero 2 T T ˆ ˆ faster than E∗i Pnx +i . ˆ ˆ ration. especially when A and B are too large due to insuﬃcient exploration.38) ˆ ˆ ˆ A unique solution K to (3. THE INFLUENCE OF EXPLORATION 39 The Estimation To show the inﬂuence of the exploration. U = . Pnx +i X∗i 2 2 j=i+1 xT 0 . uT −1 N (3. ˆ low exploration (A.36).5) should be solved using A and B. So we will assume that K ∗ − K is not too large.38) will become ˆ ˆ very small making K ≈ S.36) ˆ Without exploration the value of B becomes inﬁnite. matrix XSI in (3.
ˆ B∗i = = (3.9) can be neglected. So D and R can be neglected. nx +nu T T E∗i Pnx +i ˆ (X DT + EB T + VSI − U∗j B∗j ) Pnx +i E∗i 2 2 j=nx +i+1 nx +nu T T E∗i Pnx +i ˆ (EB T + VSI − E∗j B∗j ) Pnx +i E∗i 2 2 j=nx +i+1 nx +nu T T E∗i Pnx +i ˆ (VSI − E∗j (B∗j − B∗j )).46) (3. The Estimation Error By deﬁning y k = xk+1 and using (3.35). The least squares estimation will almost be perfect.41) This can be used to get an expression for the error in the estimations of B and D.40) YSI = X DT + EB T + VSI .45) (3. We can conclude that for insuﬃcient exploration the relative performance does not change and for abundant exploration the relative performance approaches one. resulting in T ˆ ˆ −1 ˆ T ˆ ˆ ˆ ˆ LSI ≈ (B K B) B K BL = L. B in (3. it is possible to write: k yk = D So YSI can be written as: k+1 x0 + i=0 Dk−i (Bei + v i ).40 CHAPTER 3.43) (3.7). This means that the outcome will approximately be the feedback that was used to generate the data! ˆ • High exploration: LSI ≈ L∗ For very high amounts of exploration the system noise VSI in (3.39) becomes much too large ˆ because of the low value of E in (3. LQR USING QLEARNING ˆ • Too low exploration: LSI ≈ L ˆ If the amount of exploration is much too low. We will determine the minimal amount of exploration required to obtain the second outcome.42) (3. (3. Pi X∗i 2 2 j=i+1 nx +nu T T E∗i Pnx +i ¯ (VSI − E∗j B∗j ) Pnx +i E∗i 2 2 j=nx +i+1 (3. so solving the DARE and computing feedback will approximately have the optimal feedback L∗ as outcome.44) = B∗i + ¯ ˆ So the estimation error B∗i = B∗i − B∗i is given by: ¯ B∗i = In the same way: ˆ D∗i = ¯ D∗i = nx T X∗i PiT ˆ ˆ (X DT + EB T + VSI − E B T − X∗j Dj∗ ) Pi X∗i 2 2 j=i+1 nx T X∗i PiT ¯ ¯ (VSI − E B T − X∗j D∗j ). Pnx +i E∗i 2 2 j=nx +i+1 (3.47) .
So as a rule of thumb: The amount of exploration should be larger than the amount of system noise! .47).48) (3. This means that the constant c2 depends very much on these noise sequences. The same holds for c5 in (3.52) are zero if σv = 0. So B = B + B and D = D + D.3. The Minimal Exploration Expressions (3. Now it is possible to give an approximation of the expected estimation errors (3. which in general is slightly less than σv .39) if B makes B much too large.49) (3. These errors depend on the conﬁguration so only an indication of the level of magnitude of these errors can be given. the estimation errors should be expressed using σe and σv . We have to make the following assumptions: T 2 E{E∗i E∗i } ∼ c1 N σe T E{E∗i VSI∗i } ∼ c2 N σe σv 2 2 E{ Pi X∗i 2 } ∼ c3 + c4 σv + c5 σv σe + c6 σe . The errors in (3.47) hold for any E and VSI .51) σv . Assumption (3.45) and (3. ˆ ¯ ˆ For σv = 0 it is possible to neglect D and R in (3. so exploration is only required to prevent singularity in the computations of the least squares estimate. for more exploration the estimations will almost be correct.52) 2 2 c3 + c4 σv + c5 σv σe + c6 σe 5σ ¯ Note that E{D} has a maximum for σe = − c2c6v ∼ σv . ˆ ˆ The estimations B and D can be written as the sum of the correct value and the ˆ ¯ ˆ ¯ estimation error. These expressions can be used in (3. so for σe ≈ σv ˆ the D and R in (3. The constant c3 is included to incorporate the dependency on the initial state x0 .50) with c1 · · · c6 constants depending on nx .39) to see how the resulting feedback depends on the exploration and system noise. Given the ﬁxed time interval.45) and (3.3. (3. This is the minimal amount of exploration that is required. To focus on the amounts of exploration and system noise. THE INFLUENCE OF EXPLORATION 41 The estimation errors depend on the exploration noise in E and the system noise in VSI .49).39) cannot be neglected. Note that these assumptions are rude approximations that indicate the expectation value over a time interval of size N . ¯ ¯ The maximum of E{D} is less than one and E{B} is also less than one. nu and B.49) indicates that the level of magnitude of the expectation value is proportional to the cross correlation between E∗i and VSI ∗i . 2 (3. They are proportional to: ¯ E{B} ∼ ¯ E{D} ∼ c2 σv c1 σe (3. because the correct values B and D do not depend on E and VSI . this value may vary a lot depending on the particular noise sequences.
Vector θ xx has nxx = 1 nx (nx + 1) elements. θ ux and θ uu . . . k xk uk xT uT − k k xk+1 Lxk+1 T T xT k+1 xk+1 L Φxx Φxu k k Φux Φuu k k (3. we will determine the dependency of the estimation errors on the exploration and system noise. The function vec(A) stacks all columns of A into one vector. Note that here vec(Φux ) is used instead k T of vec(Φxu ).3.54).54) The rows vec (Φk ) of XQL do not have the elements arranged according to (3. Hux and Huu .55) . which are rearrangeˆ ˆ ˆ ˆ ments of the vectors θ xx . k k k ˆ The reason for doing this is that the calculation of the feedback according to (3. Write Φk as: Φk = φk φT − φk+1 φT k k+1 = = with: Φux k Φxx = xk xT − xk+1 xT k k k+1 = Φxu T = LΦxx + ek xT k k k uu ux T Φk = Φk L + uk eT .33). We also will show the consequences of the estimation error on the resulting feedback and its performance. . so we redeﬁne XQL as:13 14 XQL vec (Φxx )T vec(Φux )T vec (Φuu )T 0 0 0 .16) makes use of Hux and ˆ not Hxu . Then k Ψ ux = Lv Ψ xx + Υ 13 14 and Ψ uu = Ψ ux LT + T (3.56) ˆ ˆ ˆ can be used to ﬁnd expressions for θ xx . (3.3 The LQRQL approach Analog to the SI approach. θ ux and θ uu using (3.42 CHAPTER 3. Deﬁne matrix Υ with rows vec(ek xT ) k k k and matrix T with rows vec (uk eT ). matrix XQL in (3. The Estimation To show the inﬂuence of the exploration. vector 2 ˆ ˆ θ ux has nux = nu nx elements and vector θ uu has nuu = 1 nu (nu + 1) elements. vec (Φxx−1 )T vec(Φux−1 )T vec (Φuu−1 )T N N N ˆ ˆ ˆ The submatrices Ψ xx . . .24) should be rearranged in such a way that linear dependencies between the columns of XQL can be used. Only the order of elements is diﬀerent because Φk is symmetric and so vec(Φux ) = vec(Φxu ) . . = . . only for the analysis. LQR USING QLEARNING 3. 2 Let Lv be the feedback matrix such that vec(LΦxx ) = Lv vec(Φxx ) and let L be the k k feedback such that vec (Φux LT ) = vec (Φux )LT . Ψ ux and Ψ uu correspond to Hxx . = Ψ xx Ψ ux Ψ uu . For the implementation this is not required.53) (3.
j − Ψ uu θ uu ).35) and without exploration ˆ T ee becomes zero causing a singularity (just like E in (3.57) can be written as: ˆ −1 ˆ ˆ ˆ −1 ˆ ˆ LQL = Huu (Huu L − Hd ) = L − Huu Hd . 2 Pnxx +i Υ∗i 2 j=nxx +i+1 (3.57) ˆ ˆ ˆ ˆ Huu and Hux are obtained by rearranging the vectors θ ux and θ uu .60) The linear relation Ψ uu = Ψ ux LT + T resembles U = X LT + E. The diﬀerence with the SI approach is that the feedback directly follows from the estimations. (3.59) T ee has vec (ek eT )T as rows.36): ˆ θ ux.i = nxx +nux T Υ∗i T Pnxx +i ux ˆ ˆ (YQL − Ψ∗j θ ux. The outcome will approximately be the feedback that was used to generate the data! 15 (3.j ) uu 2 Pnxx +nux +i Ψ∗i 2 j=nxx +nux +i+1 nxx +nux +nuu ee T T∗i T Pnxx +nux +i uu ˆ (YQL − Ψ∗j θ uu. so: ˆ ˆ −1 ˆ LQL = −Huu Hux . because the vec (Lxk eT )T has no eﬀect on the multiplication k k with matrices Pnxx +nux +i .j ). The estimation of θ ux has a similar form as (3.61) ˆ ˆ ˆ ˆ ˆ ˆ Since θ d = θ ux + θ uu LT can be rearranged to Hd = Hux + Huu L. With this result two possible outcomes can be given: ˆ • Too low exploration: LQL ≈ L ˆ ˆ If the amount of exploration is much too low Huu is much larger than Hd .3.25) a solution to: θ xx YQL = Ψ xx Ψ ux Ψ uu θ ux + VQL .37) in the SI approach. (3. THE INFLUENCE OF EXPLORATION 43 The Feedback ˆ ˆ To compute the feedback. . (3. So it also possible to v ˆ deﬁne a θ d . These estimations are according to (3.i = nxx +nux T Υ∗i T Pnxx +i ux ˆ ˆ (YQL − T ee θ uu − Ψ∗j θ d.62) This is because matrix Φuu is symmetric. ee Pnxx +nux +i T∗i 2 2 j=nxx +nux +i+1 (3. Pnxx +i Υ∗i 2 2 j=nxx +i+1 (3.35)).3. equivalent to the closed loop (3. so it can only ˆ ˆ be investigated by looking at θ ux and θ uu .16) should be solved using Hux and Huu .33) is given by: ˆ θ uu.15 Equation (3.i = = nxx +nux +nuu uu T Ψ∗i T Pnxx +nux +i uu ˆ (YQL − Ψ∗j θ uu. so the second term in (3. (3.j ).58) θ uu ˆ The estimation of θ uu using (3.59) is similar to (3.62) can be ignored. according to: ˆ θd.
66) will be zero if σv = 0. For high amounts of exploration the estimation will almost be correct resulting in a relative performance that corresponds to L . Pnxx +nux +i Υ∗i 2 2 j=nxx +nux +i+1 (3.44 CHAPTER 3.65) . ee 2 Pnxx +nux +i T 2 j=nxx +nux +i+1 (3. The Minimal Exploration To get an approximation of the minimal amount of exploration.63) ˆ In the same way for the estimation θ d nxx +nux +nuu T Υ∗i T Pnxx +nux +i ¯ ¯ ¯ θ d. ts ee 2 2 Using the deﬁnition of wk in (3. so that the levels of magnitude 2 2 of the expected errors are: ¯ E{Huu } ∼ ¯ E{Hd } ∼ 2 c2 σv 2 c1 σe 2 σv 2 2 3 (c3 + c4 σv )σe + c5 σv σe + c6 σe (3.57) will approximately have L as outcome. So H will L be an almost perfect estimation of H .i ).64) ¯ ¯ The level of magnitude of the errors θ uu. In this situation the only purpose of the exploration is to prevent singularity in the computations of the least squares estimate.i ). ¯ The maximum of E{Hd } can be expressed as σv (1 + κ). Without specifying the constants it is impossible to get an idea about the size of κ. The error in the estimation θ uu is given by: nxx +nux +nuu ee T T ¯ uu. This corresponds with the noise free situation in [16] and [44].i = (VQL − T ee θuu − Υ∗i θ d. Since T ee has vec (ek eT )T as rows. The only thing we can conclude is: .58) can be neglected.i = T Pnxx +nux +i (VQL − ¯ θ T ee θ uu.66) Both errors in (3.19) we will assume that E{T∗i T VQL ∗i } ∼ c2 N σe σv . The Estimation Error To ﬁnd the minimal exploration we adapt the expressions for the estimation errors of the ˆ SI approach with the values for the QLearning approach. We fur2 ther assume that E{ Pnxx +nux +i Υ∗i 2 } ∼ σe E{ Pi X∗i 2 }. We can conclude that for insuﬃcient exploration the relative performance does not change. LQR USING QLEARNING ˆ • High exploration: LQL ≈ L ˆ For very high exploration. we start again with some asee ee 4 sumptions. the value of VQL in (3.i remains the same after rearranging it to ¯ ¯ Huu and Hd . where κ > 0 is some value that depends on the constants and σv .i and θ d. we will assume that E{T∗i T T∗i } ∼ c1 N σe . (3. Solving (3.
The LQRQL will result in L because the parameters of the Qfunction are estimated correctly.3. This means that the lines in ﬁgure 3. Especially we focused on the inﬂuence of the amount of exploration on these performances.2. We will summarize our results in this section. Two exploration characteristics for SI and LQRQL approaches are sketched in ﬁgure 3. B = B + B). ˆ ˆ The outcome of both methods depends on two estimations. . The SI approach will result in L∗ because the system’s parameters are estimated correctly. IV Correct Estimation: For suﬃcient exploration the estimation errors can be neglected. III Sequence Dependent Outcome: For a certain amount of exploration. where the exploration and system noise only aﬀect the estimation error. The resulting feedback will approximately be the feedback that was used to generate the data. ˆ uu and Hd for LQRQL.e. With the increase of the level of exploration we will have the following types of outcome: I Singularity: No exploration will result in a singularity. but a feedback can be computed. so the relative performance does not improve. These estimations can be viewed as the sum of the correct ˆ H ˆ ¯ result and the estimation error (i.2(b) start for higher values of σe than in ﬁgure 3. So the resulting feedback is partly based on the estimation error and partly on the correct value. The outcome will depend on the particular realization of the exploration noise sequence and system noise sequence. The diﬀerences in the four types of outcomes are: I Singularity: The amount of exploration appears quadratically in the estimations for the QL approach.2(a). so there is no outcome.3.3. the estimation errors do not dominate the outcome but are still too high to be neglected.4 The Exploration Characteristics Our main contribution in this chapter is comparing the performances for the SI and LQRQL approach. the estimation errors will dominate the outcome.2. To stress the diﬀerences. log(ρL − 1) is shown instead of ρL . First we will give a description of the similarities and then of the diﬀerences between the two solution methods. we can distinguish four types of outcomes. Based on the inﬂuence of the exploration on the estimation errors. As performance measure we will use the relative performance introduced in section 3. so it requires more exploration to prevent singularity.5. Therefore the relative performance can ¯ ¯ be anything. B and D for the SI approach. For this we deﬁne the exploration characteristic as the expected performance as a function of the amount of exploration σe . II Error Dominance: If the amount of exploration is much too low. although it is bounded from below due to the error D or Hd . THE INFLUENCE OF EXPLORATION 45 The amount of exploration required by the LQRQL approach is larger than the amount of exploration required by the SI approach! 3.
Figure 3. where any relative performance above the lower bound is possible. Therefore does the outcome of the LQRQL approach depend on the L and for the SI approach it does not. So this outcome only diﬀers in the lower bound for the relative performance. The feedback L2 is almost optimal so that ρL2 is almost one. The symbols next to these lines indicate the value that is being approximated. The arrow with L∗ indicates that the characteristic will approach the optimal feedback if the exploration is increased. The dashed lines in ﬁgure 3. so L2 is the result after two policy improvement step when starting with L1 .2(a) indicate the lower bound on the relative performance ¯ due to D (the dashed line continues under the solid and bold line). The relative performance will not go below this line. The dashed line indicates the lower bound on the ¯ ¯ relative performances due to the errors D or Hd . In Figure 3.2b we see that L2 = L1 . . The grey area indicates outcome III. For LQRQL the feedback will approach L so that the relative performance depends on the feedback that was used to generate the data. even when an almost optimal feedback like L2 is used. IV Correct Estimation: For the SI approach the relative performance will approach one with the increase in the exploration level.46 CHAPTER 3. So log(ρ − 1) I II L1 L2 III IV log(ρ − 1) I II L1 L2 III IV L1 L2 L∗ σv σe σv σe (a) The SI exploration characteristic. Also the maximum values are diﬀerent. (b) The LQRQL exploration characteristic.2. LQR USING QLEARNING II Error Dominance: Only the value of σe for which this is the outcome diﬀers. The main diﬀerences between the exploration characteristics in ﬁgure 3. It is clear that there is a diﬀerence in the type IV outcome. ¯ ¯ III Sequence Dependent Outcome: D has a maximum value for σe < σv and Hd for σe > σv . The only way to let the relative performance approach one is to do more policy improvement steps. because the SI approach will approximate L∗ and the LQRQL approach L . Both ﬁgures show log(ρL − 1) as a function of σe when data was generated using feedback L1 (bold line) and feedback L2 (solid line).2 are the amounts of exploration for which these outcomes occur. The Exploration Characteristics.
1 We take a system according to (3. So taking N = 20 measurements should be enough for the estimation.3. Both approaches have in common that for near optimal feedbacks more exploration is required to guarantee an improvement than just avoiding outcome III. ρL and ρL (x0 ) for the SI and LQRQL approach for one min max realization.4 1 0 0 1 1 . We vary σe from 10−12 to 105 .69) Also the relative performances for L can be computed: ρL = 1. (3.832. ˆ ˆ ˆ Figure 3. It takes more exploration to guarantee L2 than it takes to guarantee L1 .1 Setup −0.149 0. (This amount of exploration will give an improvement for L1 ).68) For this value of L the closed loop is stable. we always use the same realizations of exploration noise and system noise by using the same seeds for the random generator.3(a): . (3.2 σv = 10−4 .4.112 L∗ = 0.2 Exploration Characteristic We compute the exploration characteristic by doing the same simulation experiment for diﬀerent values of σe . 3.302 0.093 max ρL (x0 ) = 1.149 1.1) with the following parameters: A= B= x0 = (3. The exploration intervals for the four types of outcomes can be seen in ﬁgure 3.70) 3.373 0. Feedback L2 can only be improved by increasing the amount of exploration even more. For this system the number of parameters that has to be estimated for both approaches equals 6.6 0.279 . Figure 3.469 min ρL = 2. 3. (3.2) we take S to be a unit matrix and R = 1.4. The solution of the DARE and the optimal feedback are given by: K∗ = 2.3(a) shows ρL .6) with L = 0. To make sure that σe is the only parameter that diﬀers.4 Simulation Experiments The purpose of the simulation experiments is to verify the results presented in previous sections and show the exploration characteristics. The measurements are generated according to (3.2(b) shows the same eﬀect for the LQRQL approach due ¯ to Hd . SIMULATION EXPERIMENTS 47 taking σe = σv will not “improve” the feedback.4.2 −0.67) For the direct cost (3.
3(b) shows ˆ ˆ log(ρL − 1) and not ρL . The values of ρL . In ﬁgure 3. max ˆ Figure 3. In the results in ﬁgure 3. Therefore ﬁgure 3. it max does not imply that it is also high for the other approach. LQRQL: 10−7 < σe < 10−4 = σv .5 2 10 5 ρ 1.3(b). III SI: 10−7 < σe < 10−4 . IV The RP for both approaches are not equal to one. Simulation Results.48 CHAPTER 3.3(a) agree with the values in (3. I SI: σe < 10−12 (not shown in ﬁgure 3.70). The outcomes are shown for ﬁve diﬀerent realizations. max max III For some realizations ρL < ρL and for other realizations ρL > ρL . So if ρL is high for one approach. they are close to one! For the SI ˆ approach the value of ρL gets closer to one if the amount of exploration is increased. This max max max max hold for both approaches but not always for the same realizations (not shown in ˆ ﬁgure 3. ρmax and ρL (x0 ) for one realization. max ˆ For LQRQL the value of ρL does not approach one if the amount of exploration is max increased.3(a) the outcomes of type IV are not very clear. LQRQL: 10−4 < σe < 10−2 . IV SI: σe > 10−4 .1) 10 5 0 5 min 0 2.5 1 0. This particular realization results in an increase of the RP for both methods. LQRQL: σe > 10−2 .3(a)). where the lines are not labeled). max ˆ ˆ .5 10 10 10 10 σ 10 10 10 15 10 10 10 5 e σ 10 0 10 5 e L (a) ρL . The dashed lines are the results for the SI approach and the solid lines are the results for the LQRQL approach. II SI: 10−12 < σe < 10−7 .3(b) we want to focus on type III and IV outcomes.3. ρL min max and ρL (x0 ) in ﬁgure 3. min ˆ ˆ ˆ (b) log(ρL − 1) for ﬁve realizations. The dotted vertical line indicates the system noise level σv = 10−4 . LQRQL: σe < 10−7 . The RP for both approaches seem to be one. ρ(x ) and ρ 3 log(ρmax . LQR USING QLEARNING 4.5 max .5 4 10 0 3. Instead it approaches ρL > 1.
Just below the threshold level the feedback determines the probability of an unstable closed loop. so the purpose of exploration in [16][44] is to prevent numerical problems with the recursive least squares estimation. If it has negative eigenvalues. We also compared the result of LQRQL with an indirect approach.5 Discussion In this chapter we continued investigating RL in the context of LQR as in [16][44]. For the SI approach such an indication is not available. For the SI approach an alternative method was proposed to deal with the bias resulting from the system noise [21]. The eﬀect of such an approach is that this may reduce the probability of an unstable closed loop for the type III outcome. This means that under the circumstances under which convergence of LQRQL can be proven. DISCUSSION 49 3. We observed that the performance of this approach as a function of the amount of exploration is very similar to that of LQRQL. In order to make it more realistic we included system noise in our analysis. When starting with a good performing feedback only a few steps are required. but this can only have an eﬀect for the higher amounts of exploration. Just below that threshold the resulting feedback can be anything and even result in an unstable closed loop. The contribution of the number of time steps on the performance for the indirect approach is described in [27]. The results presented there are very conservative and indicate that a large number of times steps are . we will ﬁnd the feedback used to generate the data as the optimal solution. two almost identical data sets are shown. This indicates that visual inspection of the data does not reveal whether suﬃcient exploration was used. no feedback can be computed because of a singularity in the least squares estimation. For suﬃcient exploration the feedback will determine how many policy iteration steps are required. The main diﬀerence is that the threshold level of the amount of exploration required is lower. it is wiser to us the indirect approach. If the amount of exploration is reduced even more. This is the amount of exploration that has to be avoided. We did not look at the inﬂuence of the feedback itself. this result is not very useful. The amount of exploration that is suﬃcient in that case is determined by the machine precision of the computer. Since this situation has to be avoided. The proofs of convergence in [16][44] are only valid if there is no noise and suﬃcient exploration is used. Therefore avoiding type III outcomes is safer. In [32] some additional experiments are described. If we reduce the amount of exploration even more. but this can still not be guaranteed. For the LQRQL approach we can look at the eigenvalues ˆ of H. In this chapter we showed that the system noise determines the amount of exploration that is suﬃcient. this is not of interest. Although this is not dangerous.5. Our analysis was based on estimating the magnitudes of the estimation errors. There.3. where one data set did not change the feedback and where the other gave an improvement. There is a hard threshold level for the amount of exploration required. the quadratic Qfunction is not positive deﬁnite and therefore insuﬃcient exploration was used. These errors still depend on the number of time steps used. This eﬀect is also present without noise. The idea is to add a bias towards the optimal solution.
This is the ﬁrst continuous state space problem with noise. So we can not recommend to use the LQRQL approach for the linear quadratic regularization task.6 Conclusion We have shown a fair comparison between two diﬀerent approaches to optimize the feedback for an unknown linear system. For the system identiﬁcation approach the estimation and optimization are performed separately. The second conclusion is that the LQRQL approach can be guaranteed to optimize the feedback. This noise introduces a bias in the estimation and when using insuﬃcient exploration. then the resulting performance becomes very unpredictable. This result is a consequence of the noise in the system. The bad news is that if the conditions hold and a good outcome can be guaranteed. an alternative approach based on system identiﬁcation will perform better. 3. Based on our experiment we see that the amount of time steps required is just a couple of times the number of parameters to estimate. but it is also possible that the closed loop becomes unstable. The ﬁrst conclusion is that for insuﬃcient exploration the result of the optimization will be the same as the feedback that was used to generate the data. These results have to be avoided and therefore it is very important that suﬃcient exploration is used. The closed loop can be stable and the performance can be improved. The comparison is fair because both approaches used the same data. LQR USING QLEARNING required for a guaranteed improvement of the performance. but large enough that the resulting feedback will not be the same as the initial feedback. This is the good news. and no other parameters had to be chosen. for which a reinforcement learning approach can be guaranteed to work. . So the diﬀerences in performance are due to the approaches. this bias dominates the estimated outcome. For the QLearning approach the optimization is implicitly included in the estimation. If the exploration is insuﬃcient.50 CHAPTER 3. So no change in feedback does not imply that the feedback is already optimal.
uk . We will rewrite the nonlinear system as a linear system with a nonlinear correction. the correction can be assumed to have a constant value. so that more parameters have to be estimated. In a local part of the state space of a smooth nonlinear function. Therefore this approach is more suited in a local part of the state space of a nonlinear function.1 Nonlinearities The nonlinear system In this chapter we will only consider systems that can be represented by ﬁrst order vectorized diﬀerence equations. For this situation we will introduce the extended LQRQL approach. We can show that these two approaches estimate the parameters of the wrong function if this correction is not zero.Chapter 4 LQRQL for Nonlinear Systems 4. v k ). This is the class of time discrete Markov systems. The eﬀect of the extension of LQRQL is shown in experiments on a nonlinear system. (4. The experiments were performed in simulation and on the real system. First we will show that other approaches to design controllers for nonlinear systems are often based on local linear approximations. This allows us to study the eﬀect of nonlinear correction on the estimations of the SI and LQRQL approach as described in chapter 3. The systems are described as (1.1) 51 .1) xk+1 = f (xk . The choice of system was such that we were able to vary the eﬀect of nonlinearity by the choice of the initial state. In this way we were able to show how the extended approach compares with the other two approaches for diﬀerent sizes of the nonlinear correction.2 4. In this approach the parameters are estimated of a more general quadratic Qfunction.2.1 Introduction In this chapter we will study the applicability of linear approximations for nonlinear system. 4. The resulting feedback function no longer has to go through the origin.
1). It is also important to note that the existence of this solution is completely determined by the parameters of the linear system. (4. To illustrate how linear systems simplify the controller design.3) (4. For some solutions the system will approach the equilibrium while for others it will not.5) Again it is possible that multiple solutions exist. If the close loop is stable the system will always end up in the origin. the behavior of the closed loop can be expressed using the parameters of the closed loop.2) where we will call xs the set point. take a task for which the objective is to keep the state value ﬁxed at a value xs . The state that is unchanged under control action us is given by: xs = (I − A)−1 Bus . We will also assume that f is a smooth continuous diﬀerential mapping. So if it exists. For the linear case it is just a matter of .4) In this case it is possible to have more solutions. but also that for one us more values for xs are possible. for which the gradient to xk and uk is bounded. 0). Due to the linearity. Therefore many controller design strategies are based on linear systems. it exists for all possible control actions us . 0). while in other parts it is unstable. For a controller design strategy this is important.52 CHAPTER 4. This is an important class. In some parts of the state space the closed loop can be stable. The state value does not change if xk+1 equals xk . In chapter 1 we already showed that for a linear system with a linear feedback. The equilibrium state is the solution to xeq = f (xeq . the mathematics of analyzing the system’s behavior becomes tractable. The consequence is that the stability of the system also depends on part of the state space. LQRQL FOR NONLINEAR SYSTEMS where f is a functional mapping that maps the present state. If the closed loop is unstable. the state value xs is uniquely determined by the control action us . If we take a general nonlinear function as in (4. More solutions means that for one xs more values of us are possible. the system will never end up in the origin. The result of the design of the controller is that we get a system with a feedback controller. (4. This implies that the correctness of a control action can only be veriﬁed when considering the exact value of the state and control action. control action and noise to the next state value. because this makes it possible to verify whether control design speciﬁcation are met.2) we have xs = f (xs . The same holds for the equilibrium state when a feedback function u = g(x) is used. g(xeq ). This means that there has to exist a constant control action us for which xs = Axs + Bus holds. (4. This shows that if a solution exists. It is clear that this will make the design of a controller very complicated.1) then instead of (4. One class of systems that agrees with (4.1) is the class of linear systems according to (3. us .
a feedback can be added. We will also show how these methods rely on techniques derived for linear systems. NONLINEARITIES 53 making the closed loop stable.4) the noise was set to zero to deﬁne the set point xs and the corresponding constant control action us . (4. To design a linear feedback that will do this. So even if we have a nonlinear system whose state value equals xs when us is applied.1) is continuously diﬀerentiable. us . us .8) (4. the system ¯ ¯ can be assumed to be locally linear around the set point. They are given not only by the parameters but also depend on the value of the state. (4.4.6) (4. If the mapping f of the nonlinear system (4. In (4. To make sure the system will go back to the set point. Fixed Set Point In (4. us + uk . us . Then the system is no longer in its set point. v k ) − xs ¯ ¯ ¯ xk+1 = f (xs + xk . .1).6) the set point xs is subtracted on both sides of (4. Let xk = xk −xs and uk = uk −us . Note that we assume that v has zero mean. 0) = xs and the mappings are replaced by the matrices ˜ ˜ ˜ A. We can conclude that the main diﬀerence between linear and nonlinear systems is that for linear systems the properties are globally valid.7).2. Then. v k ) − xs ≈ f (xs .9) In (4. 0)¯ k x u +f v (xs .8) the mapping f is replaced by its ﬁrst order Taylor expansion around the set point. In practice however there is always noise. f u and f v contain the appropriate derivatives.7) (4. a local linearization of the nonlinear function has to be made around the set point. properties can be only locally valid. They are given by the parameters of the linear system and do not depend on the state value. where the mappings f x. B and Bv . If us is chosen in such a way that f (xs . 0)v k − xs ˜x ˜¯ ˜ ¯ xk+1 = A¯ k + B uk + Bv v k .9). 0) + f x(xs . The input of this feedback is the diﬀerence between the state value and the set point.2 Nonlinear approaches We will give a short overview of some methods for obtaining feedback functions for nonlinear systems.1) can be rewritten according to: xk+1 − xs = f (xk . us . so the state value will change again. 4. us . for the nonlinear case it may also include deﬁning the part of the state space where the closed loop has to be stable. So the task of this feedback is to control its input to zero. due to the noise the next state value can be diﬀerent. uk . Then this is expressed in ¯ ¯ the new state and control action xk and uk in (4.2. then this forms the linear system (4. 0)¯ k + f u(xs . For nonlinear systems. For a general function f this change can be anything and does not have to bring the system back in its set point.
LQRQL FOR NONLINEAR SYSTEMS In (4. A possible implementation of Gain Scheduling is to partition the state space and compute for each partition a local linearized model with respect to the center of the partition.54 CHAPTER 4. multiple linear models can be used.12) ˜ The result is a linear system.1) can be written as f (x. The linear system is only valid near the set point so its properties are not globally valid. One approach is to form one global feedback where the control action is determined by only one local linear model that is valid at the current state value. Feedback Linearization In some cases it is possible to change a nonlinear system into a globally linear system. The feedback may change the state of the system to a part of the state space where a diﬀerent local model will determine the control action.9) we have a linear system for which a linear feedback can be designed. The choice of the parameters of the new linear system are made to simplify the design of the controller for the new linear system. ˜ ˜ Suppose we want to have a linear system with parameters A and B and control input ˜ u. Applying (4. . This means that locally linear models are computed for many diﬀerent set points.11) to a nonlinear system with (4. As explained in chapter 2 we can form one global nonlinear feedback by combining all the local linear models. u. Gain Scheduling The ﬁxed set point approach only gives a feedback that is valid near the set point. These linear models can be used to compute the appropriate local linear feedbacks. The only diﬀerence with the ﬁxed set point approach is that now the local linear feedback does not have to result in an equilibrium state around the center of the partitioning. but care has to be taken. so the appropriate new control actions u can be determined using conventional linear techniques. v) = g(x) + h(x)u + v. The objective is to change the nonlinear system into a linear system of our choice. (4.10) will transform the nonlinear system into a linear system: xk+1 = g(xk ) + h(xk )uk + v k ˜ ˜˜ = g(xk ) + h(xk )h(xk )−1 (Axk − B uk − g(xk )) + v k ˜ ˜˜ = Axk − B uk + v k .11) ˜ where u represents a new control action. (4. To get a feedback function that is valid in a larger part of the state space. This is Gain Scheduling. This is only possible for a restricted class of nonlinear systems for which the mapping f of the nonlinear system in (4. Then the control action can be computed according to ˜ ˜˜ u = h(x)−1 (Ax − B u − g(x)). This is called feedback linearization. (4.10) with g and h mappings of the appropriate dimensions.
uk ) represent matrices. uk ) represents the extra feedback. One stable linear system with pa˜ ˜ ˜ rameters A and B is selected and all linear systems are considered to have parameters A ˜ plus an extra feedback. 1 For simplicity we assume there is no noise. If we look at only one state transition of a nonlinear system then a linear system can be deﬁned that generates the same state transition:1 xk+1 = f (xk .2.11). In that case it is no longer possible to guarantee global properties of the system. To say something about the stability of the nonlinear system. This is because for any state transition of the nonlinear system there exists a stable linear system that will have the same state transition.11) has to be approximated by a general function approximator. If all linear systems corresponding to the matrices in the set are stable. Another approach is to use a norm bounded LMI.10) have to hold. NONLINEARITIES 55 It is clear that this approach is only possible for a restricted class of nonlinear systems. uk )uk . The existence of (4. uk ) and B(xk . For more general situations the existence has to be veriﬁed using Liebrackets. For more detailed information about this approach see [37]. all possible values of x and u should be considered.4. Linear Matrix Inequalities Linear Matrix Inequalities (LMI) techniques [15] are used to analyze nonlinear systems and proof properties. Not only does (4. One solution is to use a polytope LMI. If the linear systems corresponding to the edges of the polytope are stable then all linear systems within the polytope are stable as well. So for the current state and control action there is a linear system for which the state transition will also lead to xk+1 . The set can consist of an inﬁnite number of linear systems. where the vector w(xk . uk ) = A(xk . so that the state space is explored suﬃciently. making it impossible to prove the stability of all linear systems. u) for all possible x and u form a set. like stability. u) and B(x. conditions for suﬃcient quality of the approximation for feedback linearization can be given [78]. but it also has to be found. The parameter matrices A(x.11) is important. but also (4. since it relies on the quality of the approximation. uk ). When the physics is not known (4. uk )xk + B(xk . . then also the nonlinear system must be stable. When the physics of the system is known this can be computed. For a continuous time conﬁguration. These techniques are based on the idea that a nonlinear system can be described by a set of linear systems. In the parameter space of the linear system a polytope is selected that encloses the complete set. One of these conditions is that the system is excited enough.11) has to exist. (4. The state transitions of the linear systems are described as and B ˜ ˜ xk+1 = Axk + Buk + w(xk .13) Here A(xk . whose parameters depend on xk and uk . In case of a scalar state and action it is clear that h(x) = 0 is a suﬃcient condition for the existence of (4.
uk . but when it is unknown the local linear system can be approximated by the SI approach from chapter 3. In order to see how they would perform when applied to nonlinear systems. 4. 4. then the nonlinear system is stable. The SI and LQRQL approach obtain a feedback based on a generated train set. If we assume that the train set is generated in a small local part of the state space. v. v) = v ∀x. The LMI technique is used to specify the set of feasible feedbacks and a controller design approach is used to select one feedback from that set.1) as xk+1 = f (xk .3. uk . In the worst case the set of feasible feedbacks is empty. The drawback of the LMI approaches is that the set of feasible feedbacks depends ˜ ˜ on the choice of polytope or A and B. v k ) − Axk − Buk ) ˜ ˜ = Axk + Buk + w(xk .3 The Extended LQRQL Approach In this section we will ﬁrst show that the SI and LQRQL approach from chapter 3. so we cannot use (4. do have limitations when they are applied to nonlinear systems. These approaches assume that the mapping f is available. Since LMIs provide the opportunity to guarantee stability for nonlinear systems.2.1) is a special case for which w(x. Except for the feedback linearization. In our case the mapping f is unknown.14) (4. we can 2 The linear system (3. all were based on local linear approximations of the nonlinear feedback. v k ).15) as a LMI.56 CHAPTER 4. u. . 4. v k ) ˜ ˜ ˜ ˜ = Axk + Buk + (f (xk . In case of an unfortunate choice it is possible that many good feedbacks are rejected.1 SI and LQRQL for nonlinear systems The SI and LQRQL approach were derived for linear systems. we write (4. Equation (4. resulting in a solution that is more appropriate in a small part of the state space. This is especially the case when the estimations are based on data generated in a small part of the state space. uk .15) ˜ ˜ This describes a linear system with parameters A and B and an extra additional nonlinear 2 correction function w.15) strongly resembles a norm bounded LMI. The LQRQL approach in chapter 3 was also derived for linear systems.3 Summary We described some approaches to obtain feedback functions for nonlinear systems. they can be used for stability based controller design. (4. u. We will show that LQRQL can be extended to overcome these limitations. We are interested in using LQRQL to obtain a local linear approximation of the optimal feedback function. LQRQL FOR NONLINEAR SYSTEMS If it can be proven that the norm of w is less than one for all values of x and u.
If both the equilibrium state and the trajectory towards the equilibrium state have to be optimized. (4.15) by replacing the nonlinear function w(x. the equilibrium state xeq is no longer at the origin. The ﬁgure also shows that the addition of an extra constant l to the feedback function will allow for a better local linear approximation.4 The l Strictly speaking this does not have to represent a nonlinear system because the value of w is also globally valid. but also the value of the equilibrium state.16) as a linear system.16). The resulting feedback LSI will not approximate the optimal feedback at the part of the stateaction space where the train set was generated. THE EXTENDED LQRQL APPROACH 57 simplify (4.1. If the feedback function u = Lx + l is used for (4. v) with its average value for the train set.16) then the equilibrium state is given by xeq = (I − A + BL)−1 (Bl + w). the value of l can be used to select the optimal equilibrium state. The local linear approximation according to u = Lx is not very good.3. u. we can look at the resulting linear feedback function u = Lx. At the equilibrium state in (4. So when the average value of w ˆ ˆ is far from zero. The equilibrium state is given by: xeq = (I − A − BL)−1 w.3 We can apply the SI approach. so the equilibrium state was always in the origin. (4. The w then represents the mean of the system noise. So we look at the system: xk+1 = Axk + Buk + w (4.2) are not zero. 4 Whether this is possible depends on B and w.17) This shows that the equilibrium state depends on L! The LQRQL approach was derived for a linear system with w = 0. where the mean of system noise is not zero.14).16) will not only optimize the trajectory toward the equilibrium state. So LQRQL applied to (4. the estimated linear system with A and B is not a good approximation of ˆ f in (4. In order to see the consequences of w on the result of LQRQL.18) Here we see that if the feedback L is chosen such that the trajectory towards the equilibrium state is optimal.2 The extension to LQRQL Another way to see why the linear feedback does not have enough degrees of freedom is by looking at its approximation of a nonlinear optimal feedback. When we apply the feedback function uk = Lxk to control (4.16) where the value of vector w has a ﬁxed constant value.4. We can also view (4.3. In this linear system the correction is not present. This implies that A and ˆ ˆ Ax ˆ B are estimated such that the average value of w is zero.17) the direct costs (3. which will result in the estimated linear system: xk+1 = ˆ k + Buk . Let us assume an optimal linear feedback function g ∗ as shown in ﬁgure 4. Only the trajectory towards the equilibrium state can be optimized. then the feedback function u = Lx has insuﬃcient degrees of freedom. 4. 3 .
3.20) (4. The diﬀerence is that the value of l is not chosen. The resulting feedback function of LQRQL is a consequence of the choice of the quadratic Qfunction. If (4. where φT = k xT uT . Compared to ∗ ∗ (3. the greedy feedback can be found by taking the derivative to u∗ and set this to zero.19) has the optimal parameters H ∗ . Note that the scalar c in (4.19) will result in the feedback we want. This results in ∗ ∗ u∗ = −(Huu )−1 (Hxu x + G∗ T ) = L∗ x + l∗ .19) By including a term with vector G and a constant c.1. can be interpreted as the constant us in the Fixed Set Point approach. The diﬀerence with (3.16) is that a constant l∗ = −(Huu )−1 G∗ T is added to the feedback function. The purpose of this constant is to determine the optimal equilibrium state. The Qfunction used in the previous chapter was of the form Q(φ) = φT Hφ. (4. the local linear approximation cannot be a good approximation of the nonlinear function. but optimized using the extended LQRQL approach. (4. The local linear approximation of the optimal nonlinear feedback function. This is not a very general quadratic function. For the new feedback function.19) does not appear in (4.21). A general quadratic function is k k given by: Q(φ) = φT Hφ + GT φ + c. a new Qfunction is needed.19). The approximation of u = Lx + l does not have to go through the origin and is able to match the nonlinear function more closely.21) This indicates that the Qfunction (4. any quadratic function can be represented by (4. The solid lines indicate the region of approximation. LQRQL FOR NONLINEAR SYSTEMS u = Lx u∗ = g ∗ (x) x Figure 4. 4. Because the feedback u = Lx has to go through the origin.3 Estimating the new quadratic Qfunction The estimation method for the new quadratic Qfunction will be slightly diﬀerent because more parameters have to be estimated. This shows that L∗ optimizes the trajectory ∗ to the equilibrium state. so it does not have to be included in the estimation. In the previous chapter the estimation was based .58 u u = Lx + l CHAPTER 4.16) we see that again L∗ = −(Huu )−1 Hux . G∗ and c∗ .
6 So the actual function that is estimated is ˆ ˆ ˆ Q(φ) = φT Hφ + GT φ.28) This is the function we will use as Qfunction and we will call this the Extended LQRQL ˆ approach. regardless the optimization.4. . 4. The reason that this constant c can be ignored completely is that it represents the costs that will be received anyway. so it is not a general quadratic function anymore. . uk ) − Q(xk+1 . vec (φ0 φT − φ1 φT ) 0 1 vec (φ1 φT − φ2 φT ) 1 2 .5 The diﬀerence with (3. . H The absence of the constant c in (4. G This again can be used to express the estimation as a linear least squares estimation: YEX = r0 r1 . (4. ν N −1 (4. A similar approach will be used for the Qfunction T according to (4. . Costs that will be received anyway do not inﬂuence the resulting feedback function.19) is that φT = xT k+1 Lxk+1 + l and the k+1 consequences of the noise are represented by ν k . θ EX + ν0 ν1 .27) shows how the parameters are estimated.26) (4. .4 Exploration Characteristic for Extended LQRQL The estimation (4. Still this is general enough because the value of c also does not inﬂuence (4.4. φT − φT 0 1 φT − φT 1 2 .20): rk = = = = ri = Q(xk .21) to obtain the resulting LEX and ˆ ˆ l. For this function Q(0) = 0. Note that if a discount factor γ < 1 is used in (4. the constant c does inﬂuence the outcome. Lxk+1 + l) i=k φT Hφk + GT φk − φT Hφk+1 − GT φk+1 + ν k k k+1 T vec (φk φk )vec (H) − vec (φk+1 φT )vec (H) + k+1 vec (φk φT − φk+1 φT ) φT − φT k k+1 k k+1 ∞ (4.27) T T ˆ The estimation θ EX = (XEX XEX )−1 XEX YEX gives an estimation of θ EX and because ˆ ˆ vec (H) and G are included in θEX . Then in the same way as in (3. The results from the previous chapter suggest that the correctness of the estimation depends on the amount of exploration used for generating 5 6 The Qfunction depends on L and l. but does not indicate how well these parameters are estimated. EXPLORATION CHARACTERISTIC FOR EXTENDED LQRQL 59 on the temporal diﬀerence (3.25) vec (H) + νk. it also gives the estimation for H and G.22).24) k k+1 (4.25) indicates that c does not inﬂuence the outcome of the estimation. .22) (4. This indicates that the optimization is only based on avoidable costs. .23) (φT − φT )G + ν k (4. rN −1 = vec (φN −1 φT −1 − φN φT ) φT −1 − φT N N N N = XEX θ EX + VEX .19). but for clarity these indices are omitted. . The ˆ ˆ and G can be used in the same way as in (4. .21).18).
In other words.60 CHAPTER 4.i can be neglected.33) . LQRQL FOR NONLINEAR SYSTEMS the train set.i = g nxx +nxu +nuu +nx T Ψ∗i T Pnxx +nxu +nuu +i g ¯ Ψ∗j θ g.j ).i .63) the estimation error of a row of θ g can be written as ¯ θ g.j ). still a large amount of exploration is required.30) ¯ ¯ The estimations θ uu in (3. The consequence is that even for low noise in the system. g 2 (VEX − Pnxx +nxu +nuu +i Ψ∗i 2 j=nxx +nxu +nuu +1+i (4.32) ¯ ¯ The main diﬀerence here is that there is the extra Ψ g θ g in the estimation errors θ uu.j ).i and ¯ ¯ θ ux.i = nxx +nxu +nuu ee T T∗i T Pnxx +nxu +i g¯ uu ¯ (VEX − Ψ θ g − Ψ∗j θ uu.63) and θ ux change according to ¯ θ uu.16): xk+1 = Axk + Buk + v k + w uk = Lxk + ek + l. the minimum of the Qfunction should be included in the area that is being explored. The dependency of the estimation error θ g on the exploration is comparable to that ¯ uu. so the inﬂuence of the θ diﬀerent from VQL from the previous chapter in the sense that it now includes the value 2 of w. To be able to show the correctness of the parameter estimation we took as system the model given by (4. Similar to (3.31) (4.58). θ uu θg (4. The linear equation that has to hold is given by YEX = Ψ xx Ψ ux Ψ uu Ψ g θ xx θ ux + VEX .3.29) where VEX represents the noise and θ g represents the extra parameters that have to be ˆ estimated. Two questions have to be answered: ˆ • Are the extra parameters G estimated correctly? ˆ ˆ • How does the estimation of G inﬂuence the estimation of H? This last question is relevant since G and H are estimated simultaneously in (4. ee 2 Pnxx +nxu +i T∗i 2 j=nxx +nxu +1+i nxx +nxu T Υ∗i T Pnxx +i ux ¯ ¯ ¯ (VEX − Ψ g θ g − Ψ uu θ uu − Ψ∗j θ ux. The exploration characteristic was introduced in section 3.i and θ ux.4 to present an overview of the quality of the outcome as a function of the exploration. This means that in the expected estimation errors (3.i = ¯ θ ux. (4.27). if there is a large extra w. Pnxx +i Υ∗i 2 2 j=nxx +1+i (4. The VEX is ¯ of the SI approach. Simulation of the Extended LQRQL characteristics We did some simulation experiments to show the reliability of the extended LQRQL approach.66) the value of σe should be replaced by (σe + w)2 . The estimation for the extended approach includes nx extra parameters and can be written similar to (3. This will now be used to investigate the results of the extended LQRQL approach.
34) The value of w depends on the experiment. Figure 4.2 σv = 10−4 . We asked the following ˆ questions: Does the nonzero w inﬂuence the estimation of L? Also does the nonzero w inﬂuence the amount of exploration that is required? ˆ Figure 4. Each experiment was performed 5 times and we took N = 30 number of time steps. which is about the same scale as w.2(b) shows L − L∗ for both approaches. when this set point is determined by the direct costs r. EXPLORATION CHARACTERISTIC FOR EXTENDED LQRQL 61 In (4. Experiment III: Can the set point be learned? The set point was introduced in section 4. so that a l = 0 was required. For high values of σe the value of l becomes −0. We did this experiment to see whether this is possible using the extended LQRQL.2(a) shows how the value of l changes. We did a similar experiment using wT = 10 10 . so that the estimated ˆ should be zero. (4. We ﬁrst did an experiment to see whether the correct value of l can be found. This indicates that the suﬃcient amount of exploration depends on w if w > σv . l Figure 4.2(b) shows that the improvement for the extended approach requires less exploration.5) and (3. Experiment II: Does the extension of LQRQL improve the results? In this experiment we used wT = 1 1 .2) . For low values of σe the value of l does not change. We took as initial value l = 1. Experiment I: Can a correct value of l be found? This requires knowledge about the correct value l.4 0 1 B= 0 1 x0 = 1 1 L = 0. If the value of σe is larger than σv the correct value of l is obtained.5 and this makes the total costs for the extended LQRQL approach lower than the total costs of the standard LQRQL approach. ∗ ˆ The value of L − L decreased around σe ≈ 1.2.2 −0. Because all parameters are estimated simultaneously. For low values of σe the value of L equals L in (4.4.4. We used the same parameter settings as in the previous chapter: A= −0.6 0. The initial value l = 0. Both the LQRQL and the Extended LQRQL were applied on the same train sets.34). In the experiments we compared the results with the standard LQRQL approach. In this experiment we took w equal to zero so that no additional l is required. The extension of LQRQL was motivated by the inability of LQRQL to deal with the simultaneous optimization of the transient and stationary behavior of the system.4).33) the w is constant so it is the same for all time steps. we also did an experiment to see whether the estimation of G has any positive eﬀects on the estimation of H. Finally we did an experiment to see whether an arbitrary set point can be learned. So learning the set point means optimizing the stationary behavior of the system. where L∗ is the optimal feedback ˆ computed with (3.2 to indicate a desired equilibrium state. and there this happened at σe ≈ 10. We changed the costs in (3.
They ˆ ˆ have shown that in case of w = 0. Note that the equilibrium state will not be equal to xs because of the costs assigned to the control action that keeps it at xs . However.5 0.4 0 L 10 5 0 5 l 0.1 0.3(b).5 0. The values for the standard approaches always go to zero while the l values for the extended approach go to two diﬀerent values. to specify a preference for an equilibrium k s k point at xs . The simulation experiments have shown that the correct value of l will be found. the extended approach requires less exploration to obtain this result. Finally the experiments showed that the extended approach is also able to learn the set point. LQRQL FOR NONLINEAR SYSTEMS to rk = (xT − xT )S(xk − xs ) + uT Ruk . It is clear that the cost s reduction after σe ≈ 1 is much larger for the extended approach.3(a) shows the total costs when xT = 1 −1 . 2 as a function of Figure 4. The amount of system noise σv = 10−4 is indicated by the dotted line. .62 CHAPTER 4.2.6 1 0.9 1.3 0. the resulting feedbacks LQL and LEX will have the same outcome. 2 1 0.5 1 10 10 σe 10 10 0 10 5 σe 10 0 10 5 (a) Experiment I: ˆ as a function of σe . Results of the simulation experiments I and II. l ˆ (b) Experiment II: L − L∗ σe .5 0.6 the state values are shown as a function of time in ﬁgure 4.2 0. For one result obtained using σe = 5.8 0. the dashed lines the result of the standard approach. The solid lines are the results for the extended approach. Figure 4. The value of ˆ = −0.7 0.55 brings the equilibrium state closer to xs . The σe ≈ 1 is the same scale as xs and doing the same experiment for a larger xs shows that more exploration is required (not shown).
that describes how the robot changes its position and orientation in the world. v) is not zero. where we compared the performance of the extended LQRQL approach with the SI and standard LQRQL approach from chapter 3. To verify this we did some experiments with a nonlinear system.15) the value of w(x. If the average value of w(x. the dashed lines the result of the standard approach. u. A description of the robot can be found in appendix B. Figure 4. v) does not vary too much.8 75 0.6. u. The solid lines are the results for the extended approach.5 Simulation Experiments with a Nonlinear System For a nonlinear system (4.6 0.1 The nonlinear system The mobile robot As nonlinear system we used a mobile robot system. v) does not have to be constant. The task of the robot is to follow a straight line that is deﬁned in the world.8 60 55 50 σ 10 0 10 5 1 0 5 10 e 15 k 20 25 30 (a) The total costs J as a function of σe .6 0. Further a model is given in appendix B.4. The change of orientation does not depend on the position. Results of simulation experiment III. SIMULATION EXPERIMENTS WITH A NONLINEAR SYSTEM 63 80 1 0. as . where the eﬀect of the orientation on the change of position introduces the nonlinearity. but the change in position depends on the orientation. 4.4 70 65 x1 and x2 10 5 0. The state is given by the distance δ to that line and the orientation α with respect to that line.3.5. If the nonlinearity is smooth then in a small part of the state space the value of w(x.2 0 J 0.4 0.2 0. (b) The state values x1 and x2 as functions of time k for one result using σe = 5. 4. the extended LQRQL approach should also perform better. u. In the previous section we showed that the extended LQRQL approach performes better than the standard approach when a constant w is not zero. This suggests that this can be used as a nonlinear system.5.
The Sα is not so important.4(a). 0 Sδ (4. Sα 0 x + uRu.1. Our simulation setup is shown in ﬁgure 4. The top view of the robot. The left ﬁgure illustrates the general task. ω (4.38) This indicates that we want to minimize the distance to the line without steering too much. This ﬁgure also indicates the robot’s movement. So the state is given by xT = α δ . Using (B.36) the trajectory of the robot describes a part of a circle with radius vt .37) . (4. where we take δ = y and α = φ. We gave the traversal speed a ﬁxed value of 0.64 CHAPTER 4.6) we can express the state transition for the robot given this task: αk+1 = αk + ωT vt δk+1 = δk + (cos(αk ) − cos(αk + T ω)). The only reason to give it a small positive value is to assign costs to following the line in the wrong direction. The right ﬁgure shows our implementation where the line to follow is the positive xaxis. LQRQL FOR NONLINEAR SYSTEMS Robot δ y α Line φ x (a) The task (b) The implementation u=− u=0 u=+ Figure 4.1 meters per second. We used quadratic direct costs. where δ is given in meters and α in radians. There are two control actions: the traversal speed vt and the ω rotation speed ω. shown in ﬁgure 4.36) The T is the sample time. Sδ = 1 and R = 1.35) (4. negative or zero u.4. so rk = xT with Sα = 0. According to (4. given a positive. As control action u we took the rotation speed ω. Without loss of generality we can take the xaxis of the world as the line to follow.4(b).
When the robot is closely following the line.5. To compare the results of the three approaches we have to test the performance. ˆ The task is to ﬁnd a linear feedback L by applying the SI approach and both LQRQL approaches. the robot has to turn right even more. The extended LQRQL approach has to result in ˆ = 0. So the resulting feedback for the SI and standard LQRQL approach is: ˆ ˆ ˆ u = Lx = Lα Lδ α . so that the (4. . l In order to compare the results of the diﬀerent approaches. This means that for u < 0. u. δ (4. This means that also Lα < 0 is ˆ can be used in that state to steer to the right even more.4. The relative performance from the previous chapter cannot be used. SIMULATION EXPERIMENTS WITH A NONLINEAR SYSTEM 65 Comparing the results Before starting the experiments we will give an indication about the correct feedbacks. So it is possible to use (4. u. that was generated by starting at a certain state. The l be negative. By taking only a short time interval we made sure that we tested the feedback in the same local part of the stateaction space where the train set was generated. The value of w(x. The extended LQRQL approach also has to ﬁnd an extra term ˆ from the l train set. We will describe how we will test the results. This implies that α is no longer very small.36) describes a smooth nonlinear function. if δ < 0 the value of ˆ should be positive. the value of w(x. This function is symmetric around the origin of the state space. v) is no longer zero. (4. v).39) For the extended LQRQL approach there is an extra term ˆ ˆ ˆ u = Lx + ˆ = Lα Lδ l α +ˆ l.36) will have an eﬀect on the change of δ. so this should also correct. so that the nonlinearity in (4. u.39) will not be able to form a good local linear approximation of the optimal feedback function. However. Lδ < 0 is correct. If ˆ α is also positive. To test the feedback we started the robot in the same state.39) as a local linear approximation of the optimal feedback function. δ (4. If δ > 0 ˆ and α = 0. The extended LQRQL approach can provide a better approximation for a large δ and should result in ˆ = 0. l The model in (4. v) will be very small. we need to have a criterion. As time interval we took the same number of time steps as used during the generation of the train set. the robot has to turn right.40) Now we can already determine what the correct feedback should look like.41) as criterion. l For large values of δ. which indicates that the optimal feedback function will go through the origin. the optimal feedback will move the robot in the direction of the line. We will also show that the value of δ can be used to vary the value of w(x. We used the total costs over a ﬁxed time interval N J= k=0 rk . The resulting feedback is obtained based on a train set. because we do not know the optimal solution.
7224 0.2 Experiment with a nonzero average w.66 CHAPTER 4.35. In Table 4.1.8943 SI Standard LQRQL Extended LQRQL Table 4. The resulting feedbacks of the three approaches were tested by starting the robot in the same initial state. For the SI and standard l LQRQL approach l = 0.2794 0. We generated one train set and used the SI.1 the resulting feedbacks and the total costs of the test runs are shown. Under these conditions the extended LQRQL approach should be able to perform better than the SI and standard LQRQL approach. Results of the experiment with nonzero average w.41).5855 ˆ Lδ 0. so that hardly any prior knowledge is included.1. because the tolerated rotation on the real robot is limited.. • Initial feedback L = −10−3 −10−3 . On the other hand it is small enough.1 meters per second. Thus the robot traverses approximately 2 l meters.3245 45. • Number of time steps N = 57. The robot drives for approximately 20 seconds at at speed of 0. Because of the exploration we were not able to exactly determine the local part of the state space in which the train set is generated. for which ˆ should be l positive. On average most of the training samples were obtained in the part of the state space for which ˆ should be negative.5 meters.0330 0.35 seconds. the standard LQRQL and the extended LQRQL approach. ˆ Lα 0.5865 J 155. for which the resulting ˆ should be negative. LQRQL FOR NONLINEAR SYSTEMS 4.5. • Gaussian exploration noise with σe = 0. So with T = 0. In the worst case the robot stops at δ = −0. The setting We took as settings: • Sample time T = 0. so they should perform less than the extended LQRQL approach. .5. so that we knew that the extended LQRQL had to l result in a ˆ < 0. the robot drives for approximately 20 seconds.5861 0. This feedback makes the robot move towards the line. As initial distance to the line we used δ0 = 1. • Initial orientation α = 0. For each test we computed the the total costs according to (4. A higher level of exploration would make it impossible to use similar settings on the real robot. This means that the average w was not zero. We could do this by focusing on a part of the state space where δ is large. We ﬁrst did an an experiment with a nonzero average w.0133 ˆ l × × 0.5828 115.
1.5 1 1. Since the extended approach can deal with this situation. so the higher ˆ value of Lα does not contribute to the control action. although L and L are clearly diﬀerent. The trajectories in the world. The trajectory of the standard LQRQL approach resembles that of the initial ˆ ˆ feedback L. The line with “No Explo” is the trajectory when using the initial feedback L. The ﬁrst shows that using the initial feedback without exploration will make the robot move slowly towards the line. The total costs approaches are almost the same. ˆ Because the value of Lδ is too large. The nonzero value of ˆ suggests that the average l w is not zero. For the line with “Data” the exploration is added to generate the train set. it outperforms the other approaches. because of ˆ it turns faster in the direction of the line.5 No explo. but the L of the SI approach is the highest and the extended LQRQL approach has the lowest total costs.4. In ﬁgure 4. primarily depends on the exploration and not on the initial feedback.5. the robot will rotate too much for large values of δ.5 0 X 0. SIMULATION EXPERIMENTS WITH A NONLINEAR SYSTEM 67 ˆ l The resulting value of ˆ is negative.5. . The other three lines are the trajectories of the test runs.5 that moves to the left. This explains the lower total costs. Although the extended LQRQL ˆ l approach has approximately the same L.5 1 Y 0. The reason is that Lδ is too small. The other three trajectories are the test runs with the three resulting feedbacks. so the robot hardly rotates. just as we expected. Data SI Standard LQRQL Extended LQRQL 1 0. Therefore the orientation α will remain very small. The values of L for both LQRQL ˆ for the SI approach is diﬀerent. The second trajectory shows that the generated train set. At the end of the trajectory the value of δ is so small that the robot no longer rotates around its axis.5 ﬁve trajectories are shown. The trajectory of the SI approach is the “curly” line in ﬁgure 4.5 0 Figure 4.
When we have a nonlinear system. If we look at ﬁgure 4. the Qfunction is not correct. The extended LQRQL approach also estimates ˆ whose reliability does not depend on l. For all three approaches the resulting feedback function depends only on the train set. l compare the results with the other two approaches. LQRQL FOR NONLINEAR SYSTEMS 4. standard LQRQL and extended LQRQL were applied to all train sets.1 we see that a small change in the optimal feedback function may lead to a large change in the value of l. Here the SI and standard LQRQL approach already have the correct value of l = 0. For the extended approach the percentage without negative . Figure 4. In that case the estimated matrix H only has positive eigenvalues. each with a diﬀerent exploration noise sequence. so we used all train sets. the situations for which ˆ were clearly wrong were also removed. For the standard LQRQL approach about 75 percent was used. we can see wk = w(xk .68 CHAPTER 4. generated with a random exploration noise. If the resulting feedback is not good then this is because the train set is not good enough. If suﬃcient exploration is used the estimated quadratic Qfunction is positive ˆ deﬁnite. All resulting feedbacks can be tested. Since the amount of exploration must be higher than the amount of system noise. We did some experiments to see how of the performance of the three approaches depends on the size of the average w. We used the same σe for all train sets.6(a) shows the fractions of the 500 runs that were used for both LQRQL approaches. uk .3 Experiments for diﬀerent average w The previous experiment has shown that for a large value of δ the extended LQRQL approach performs best. The extended approach has to estimate the correct ˆ which can result in an ˆ = 0. but we ﬁrst looked at the reliability of the outcome. v k ) as system noise that does not have to be Gaussian or white. For both LQRQL approaches we rejected the results with ˆ negative eigenvalues for H. The trajectory is mainly determined by the exploration. In these situations it can be very hard to evaluate the feedback function. For these values of ˆ the L hardly contributed to the motion of the robot. Another problem is that the contribution of the feedback function to the control action is very small when the train set is generated close to the line.5. it can turn out that in some train set insuﬃcient exploration was used. ˆ the eigenvalues of H. We know for sure that if H has negative eigenvalues. We also indicated that around the line the average value of w is zero. This means that a small diﬀerence in the train set may lead to large diﬀerence in the estimated ˆ To be able to l. l For the SI approach we do not have a criterion that indicates the reliability of the resulting feedback. The SI. so the estimated ˆ Qfunction may not be correct. For the extended LQRQL approach we rejected the results for ˆ l. In chapter 3 we showed that this is the case when the amount of exploration is not enough. For this reason we generated for each δ0 500 train sets. This is the case for a nonzero average w. We varied the initial distance δ0 from 0 to 2 in steps of 1 . In the same way as in the previous experiment we used the initial distance δ0 to determine the average w. The 4 estimation results were based on train sets. which ˆ > 1. This suggests that the extended LQRQL approach will l l perform slightly less when the train set is generated near the line.
The resulting feedback functions that were not rejected were tested.4.4 1.2 0. For high values of δ0 the situation is reversed. Further away from the line. This indicates that the extended LQRQL approach is the most sensitive to the particular train set.6 1. ˆ eigenvalues for H was already below 50 percent.2 0.8 2 0 0 0. The extended LQRQL approach has the lowest cost as expected.8 for δ = 2 for the extended LQRQL approach.3 200 J 150 100 0.8 1 δ 1.4 0. We can conclude that if the train set is generated close to the line. followed by the standard LQRQL approach.7 0. the average value of w becomes larger. So x = 0 and therefore also u = 0. The results for diﬀerent initial δ. Because ˆ = 0 the total costs for the l extended LQRQL approach is not zero. about one third of the train sets were kept.4 0.8 2 (a) Fraction of the train sets used. In ﬁgure 4. the average w is almost zero.2 1. The ﬁgure clearly shows that the total costs increases as δ increases. The SI and standard LQRQL approach perform best.2 0.9 0.6.8 400 Standard LQRQL Extended LQRQL 350 SI Standard LQRQL Extended LQRQL 300 0. This is because the robot is already on the line and facing the right direction.2 1. SIMULATION EXPERIMENTS WITH A NONLINEAR SYSTEM 69 1 0. The maximal standard deviation was 40.5. . When ﬁgure 4. The total costs for the SI approach is much higher than that of the LQRQL approaches.6 0. The extended approach has the lowest total costs and the SI approach the highest. therefore we omited the errorbars. When also the unreliable ˆ results were l removed.5 0. Figure 4. the SI approach has the lowest costs.37).6 Used 250 0.8 1 δ 1. For small values of δ0 .6 1.6(b) is plotted with error bars the total ﬁgure becomes unclear.1 0 50 0 0.4 1. This is also what we observed in the preliminary experiment.6(b) we see the average total costs J for diﬀerent initial δ for the three approaches. because δ is part of the direct costs in (4. For δ0 = 0 the SI and the standard LQRQL approach have J = 0.4 0. (b) The J as a function of δ0 .6 0.
This is a part of the dynamics of the system was not included in the model used for the simulations. But the odometry is based on measuring the speed of the wheels and they accelerate faster during wheel spin. while the wheels are rotating. 4. Since we do not use the robot’s real position. There are some important diﬀerences with the simulated robot: • The time steps do not have to be constant. The consequence is that for high values of σe . For one exploration sequence the four data sets generated are shown in ﬁgure 4. When a u outside this interval had to be applied to the robot. we replaced it by the value of the closest action bound. • The size of the control action that can be applied is bounded.75 in steps of 0. such that umin = −0. A consequence is that the robot might not respond so quickly on fast varying control action when exploring.7(a).18 ≤ u ≤ umax = 0. Too high values for u may ruin the engine of the robot. This has the eﬀect that similar control actions can lead to diﬀerent state transitions. We generated data for four diﬀerent sequences of exploration noise.6. because at discrete time steps the state changes are inﬂuenced by the duration of the time step.18. the exploration noise is no longer Gaussian. So eﬀectively the amount of exploration contributing to the movement of the robot is lower than the amount of exploration that is added to the control action. This introduces some noise. We then translated this into a value for δ and α.70 CHAPTER 4. The duration of the communication varies a little. This means that the robot keeps track of its position in the world by measuring the speed of both wheels. For safety we introduced the action bounds. . The state value was derived from the odometry.6 4. • There is wheel spin. This means that the trajectory during a time step is not exactly a circle.1 Experiments on a Real Nonlinear System Introduction We did experiments with a real robot to see whether the simulation results also apply to real systems. State information is not directly available and is obtained via a communication protocol between the two processing boards in the robot. • The robot has a ﬁnite acceleration.5. LQRQL FOR NONLINEAR SYSTEMS 4. When generating the train set it was possible that actions were tried that could not be tolerated. this does not aﬀect us. so in total we generated 16 data sets. This can mean that the real position of the robot does not change. We varied the initial δ from 0. This value was also used for the train set. The eﬀect of wheel spin is that the robot does not move according to the rotation of the wheels.2 The experiments It was infeasible to generate the same amount of data sets as in the simulation experiments.25 to 1.6.
These low total costs are misleading. Instead the action applied to the robot alternated between umin and umax for the ﬁrst 15 to 20 time steps.8 1 δ 1.6 0.4.2 1. We did not remove any train set for the LQRQL approaches.). EXPERIMENTS ON A REAL NONLINEAR SYSTEM 71 2 250 SI Standard LQRQL Extended LQRQL 1.5 200 1 150 0.8 (a) The generation of four train sets.5 1 X 1. for all four exploration noise sequences.25.5. For the three other sequences we see that the total costs are very low. What happens in these situations is that the feedbacks found by the SI approach are much too high.8(a) we see.6 1. This would lead to rotations as in ﬁgure 4. using the same exploration noise sequence. but the action bounds prevented the robot from rotating.2 0.6(b) we see that the performances of both LQRQL approaches are a little less in practice than in simulation.6.5 1 0 0. The experiment with the real robot. If we compare ﬁgure 4.4 1. The latter agrees with the simulation experiments.5 2 0 0. which indicates that the SI approach did not ﬁnd the optimal linear feedback! .5 Y J 100 0 50 0.7(b) show that on average all three approaches perform almost the same. In these cases the low total costs were caused by the action bounds. In ﬁgure 4. Figure 4. the total costs as a function of the initial δ for the SI approach. The SI approach seems to perform better in practice than in simulation. This is because the resulting feedback function is wrong. In ﬁgure 4.7. After that the robot was very close to the line and it started following the line as it should. It made the robot move away from the line. (b) The average total costs J as a function of the initial δ. Only the extended approach has higher total costs for δ0 = 0. For one sequence the total costs were very high (for δ0 = 1. The results in ﬁgure 4.4 0.7(b) with ﬁgure 4.75 the total costs was more than 600.7(b) we see the average total costs for each test of the three diﬀerent methods.
Many unreliable feedbacks turn slower towards the line and are therefore less aﬀected by the acceleration. For the standard LQRQL approach 9 of the 16 train sets resulted in ˆ a positive deﬁnite H and for the extended LQRQL approach only 5. (b) J for extended LQRQL as function of δ0 for all four exploration noise sequences.2 1. LQRQL FOR NONLINEAR SYSTEMS The actions taken by the resulting feedbacks of both LQRQL approaches were always between umin and umax .8.72 CHAPTER 4. In ﬁgure 4. Then the robot was facing the wrong direction and had to rotate back. Only the extended approach did not exploit the action bounds. the robot did not always stop rotating fast enough.4 1. All 16 performances for SI and extended LQRQL.8 1 δ 1.4 1.6 0.8 1 δ 1.6 0.4 0. We did not do this for the experiments on the real robot. because in some cases an “unreliable” feedback performed better. 700 300 600 250 500 200 400 150 J J 300 100 200 100 50 0 0. Figure 4.6 1.8 (a) J for SI as function of δ0 for all four exploration noise sequences.8 0 0. In the simulation experiments we rejected train sets that resulted in negative eigenvalues ˆ for H because we considered them unreliable.4 0.2 1. But due to the acceleration.6 1.2 0.2 0. For one sequence the total costs are as low as for the three sequences of the SI approach. This indicates that the extended LQRQL approach optimized the local linear feedback function for the real nonlinear system. The total costs for these feedbacks were not always the lowest. This indicates that the performance depends on the particular sequence of exploration noise. The reliable feedbacks made the robot rotate faster in the direction to the line than the unreliable feedbacks. We noticed that the plot for the standard LQRQL approach looks quite similar.8(b) we see for all four exploration noise sequences the total costs as a function of the initial δ for the extended LQRQL approach. This was caused by the acceleration of the robot. . We see that for some exploration sequences the costs are quite low while for some the costs are higher. This implies ˆ that the reliability indication based on the eigenvalues of H is not appropriate for the real robot if the acceleration is ignored.
. The performances of the standard and extended LQRQL approach are similar to those of the simulation experiments. The result is equivalent to Gain Scheduling as described in chapter 2. It is possible to use multiple linear models to construct one global nonlinear feedback function. In [44] bumptrees were used to form a locally weighted feedback function. the extended approach has to be used. The standard LQRQL should only be used if it is known that the optimal feedback function goes through the origin.7 Discussion We started this chapter with a presentation of a number of methods for the control of nonlinear systems. the extended approach performs much better. However. A consequence is that the linear feedbacks for partitions far away from the origin will become smaller. We tested the extended LQRQL approach on a nonlinear system in simulation and on a real nonlinear system. Instead it either ﬁnds a wrong feedback or it ﬁnds feedbacks that are too large. This means that if a feedback function is based on local linear approximations. This means the models are not completely local. and introduced the extended LQRQL approach.4. The consequence is that it is unlikely that linear models far away from the origin are optimal. In [17] local linear feedbacks were used for diﬀerent partitions of the state space of a nonlinear system. We compared it with the SI and standard LQRQL approach from chapter 3. In this approach we do not estimate the parameters of a global quadratic function through the origin. In both cases the resulting local linear feedback functions were obtained by the standard LQRQL approach. by sampling for a few time steps. so that most control actions are limited by the action bound. In this way the resulting feedback is no longer restricted to a linear function through the origin. but we use the data to estimate the parameters of a more general quadratic function. This is because the feedback has to go through the origin. So we can conclude that they correctly optimize a local linear feedback. for standard LQRQL the local linear models are not only based on the train set but also on the position of the partition with respect to the origin. We were interested whether we could use QLearning to ﬁnd a feedback which is locally optimal. The extended LQRQL approach is able to estimate the appropriate set point for each linear model. The results indicate that if we explore in a restricted part of the state space. We showed that the standard LQRQL approach as presented in chapter 3 would not lead to an optimal linear feedback. we have to use the extended approach.7. 4. The consequence for the feedback function is that an extra constant was introduced. the experiments have shown that the SI approach does not optimize a local linear feedback correctly. Some of these methods were based on local linear approximations of the system. This implies that if we want to form a nonlinear feedback based on local linear feedbacks. DISCUSSION 73 In summary.
The experiments on a nonlinear system have shown that if the additional constant is not zero. A nonlinear system can be regarded as a linear system with an additional nonlinear correction.8 Conclusions Many control design techniques for nonlinear systems are based on local linear approximations. We introduced the extended LQRQL approach. .74 CHAPTER 4. LQRQL FOR NONLINEAR SYSTEMS 4. that will result in a linear feedback plus an additional constant. the extended approach will perform better. the SI and standard LQRQL approach from chapter 3 will approximate the wrong function. If in a local part of the state space the average correction is not zero. We studied the use of LQRQL for obtaining a local linear approximation of a nonlinear feedback function.
The feedback can be obtained by ﬁnding for all states the control action that minimizes the feedback function. In such situations it is possible to use a general approximator. where the training of one network depends on the other. The idea is to combine LQRQL with a carefully chosen network and we called this Neural QLearning. So instead of focusing on a local part of the state space. This makes the training very hard. it is very likely that the Qfunction is not quadratic. 75 . This is the actor/critic conﬁguration as described in section 2. This has the drawback that two networks have to be trained. In order to have a continuous feedback function.1 Introduction The LQRQL approach was derived for linear systems with quadratic costs. In this chapter we propose a method in which we only use one network and still have a continuous feedback function.3. The simulation and real experiments were based on the same nonlinear system as in chapter 4. we looked at larger parts of the state space. If the system is nonlinear. another network can be used to represent the feedback function. The feedback can be derived in the same way as in LQRQL. The choice is either to have a discontinuous feedback or to use two networks. This is a consequence of the restriction that the Qfunction is quadratic. like a feed forward network.Chapter 5 Neural QLearning using LQRQL 5. The resulting nonlinear feedback functions have to be globally valid. We applied the Neural QLearning method and compared the result with Gain Scheduling based on the extended LQRQL approach. The use of a quadratic Qfunction was motivated by the possibility to use a linear least squares estimation to obtain the parameters of the Qfunction. to approximate the Qfunction. Although this approach can also be applied to nonlinear systems or other cost functions. Then the parameters of the linear feedback follow directly from the parameters of the Qfunction. the resulting feedback will always be linear. We will show that this can have the eﬀect that the feedback function is not a continuous function and we will explain why it is desirable to have a continuous feedback function.
a general approximator like a feed forward neural network can be used. Using the greedy feedback from the network has two drawbacks: • The computation of the greedy action involves ﬁnding the extremum of a nonlinear function. This is drawn for a scalar state and control vector. It is clear that this Qfunction is a smooth continuous function. The weights of the output unit are given by wo and the bias by bo . If a network is used to represent a Qfunction we might obtain a typical function as in ﬁgure 5.15) the greedy feedback is computed by taking the minimum value of the Qfunction for each state. The network with two hidden units . w) Γo wo Γ1 wh1 bh1 Γ2 wh2 bh2 Γ1. w) = Γo (wT Γh (W ξ + bh ) + bo ). The network has only one output Q(ξ.2(a).76 CHAPTER 5.i of the hidden units and the vector bh contains the corresponding biases.2 Neural Nonlinear Qfunctions Instead of a quadratic Qfunction. This Q(ξ.2(b) the top view of the Qfunction is shown with lines indicating similar Qvalues. According to (2.2(a).1. it is still possible that the greedy feedback function (shown in ﬁgure 5. The ξ represents the input of the Qfunction.2 : tanh Γo : linear ξ Figure 5. Given this Qfunction the feedback has to be determined. • Even when the Qfunction is smooth as in ﬁgure 5. where the vector w indicates all weights of the network. In this plot also the greedy feedback function is shown that is computed by taking for each x the value of u for which the Qvalue is minimal. It describes the function Q(ξ.2(b)) is not a continuous function.1) with Γo and Γh the activation functions of the units in the output and hidden layer. w). NEURAL QLEARNING USING LQRQL 5. and the weights of the network are picked randomly. In general this is not a trivial task. In ﬁgure 5. The rows of matrix W contain the weight vectors wh. o (5.15) and (3.
Training two networks means that training parameters and initial settings have to be selected for two networks.2. One approach that removes both drawbacks is the Actor/Critic approach described in chapter 2.4 0.5 1 0. It is very hard to determine beforehand the appropriate setting for the training of the networks. A second function approximator is introduced to represent the feedback function.2 0 X 0.8 1 (a) An arbitrary Qfunction (b) The greedy feedback Figure 5. If a continuous diﬀerential function is used as actor. By training the actor based on the critic also the ﬁrst drawback is removed.8 0. the interpretation of the results is very hard. In the QLearning conﬁguration the Qfunction implicates the feedback. NEURAL NONLINEAR QFUNCTIONS 77 1 0. The static feedback is followed by a low pass ﬁlter.4 0.8 1 1 U 0. so the behavior of the system becomes very unpredictable. This approach is not a solution to the ﬁrst drawback.5 U 1 1 0 0.1 0 0. The second drawback can be overcome by introducing a Dynamic Output Element [60]. In the actor/critic conﬁguration the actor is trained based on the critic. so that the Qfunction and feedback are represented by one and the same network.6 0. . can cause problems for real applications when there is noise in the system.6 0. removing the eﬀects of noise at discontinuities of the feedback function. In (b) we see the top view of the Qfunction with lines indicating the hight. Since the settings inﬂuence the result. and this is the major problem of this conﬁguration.5.2 0.8 0.2 0 0. Near the discontinuities a little noise can have a high inﬂuence on the control action. The Qfunction is formed by a feed forward network.1 1 0.6 0. The two networks in the actor/critic conﬁguration have to be trained.2 0.5 0.6 0.6 0. the second drawback is overcome.3 0.4 0.3 and is the way in which the feedback is derived from the Qfunction. The main diﬀerence between the actor/critic conﬁguration and QLearning is shown in ﬁgure 5.5 0 0.7 0.4 0.2 0.5 X 0. In this way the control action is not only based on the current state value.2. The bold line indicates the greedy feedback.4 0. but also on the previous control actions.
78
CHAPTER 5. NEURAL QLEARNING USING LQRQL
Critic Network Q(x, u)
Qfunction Q(x, u)
g(x) Actor Network
(a)
System
g(x)
System
(b)
Figure 5.3. The actor/critic conﬁguration and Qlearning. The dashed arrow indicates the training of the actor based on the critic. The implication arrow indicates that the feedback function directly follows from the Qfunction.
In the LQRQL approach the greedy feedback is computed by setting the derivative of the Qfunction to the control action to zero. So for LQRQL the quadratic Qfunction implicates the linear feedback function, and the parameters of the feedback function can be expressed as functions of the parameters of the Qfunction. This is a property we want to keep for the neural Qfunction. We will propose a method that uses LQRQL to keep this property for a neural Qfunction. In this method the feedback function can be expressed using the weights of the neural Qfunction. No second network is required. The method will also guarantee that there are no discontinuities in the feedback function.
5.3
Neural LQRQL
Neural LQRQL1 is based on the standard LQRQL described in chapter 3, but then with a feed forward network to represent the Qfunction. The standard LQRQL approach is based on three steps. First the least squares estimation is applied to the train set, resulting in ˆ an estimation θ. Then the parameters of the quadratic Qfunction H are derived from the ˆ ˆ estimated θ. Finally the value of linear feedback L is computed based on H using (3.16). ˆ This indicates that if it is possible to derive θ from the neural Qfunction, the feedback ˆ analogue to the LQRQL approach. can be computed from θ In LQRQL the quadratic Qfunction was represented as a linear multiplication of the quadratic combinations of the state and action with the estimated parameters. This was introduced to make it possible to use a linear least squares estimation for estimating θ
1
An almost similar approach in [33] is called PseudoParametric QLearning.
5.3. NEURAL LQRQL
79
based on the measurements. In section 3.2.4 the Qfunction is written as QL (xk , uk ) = vec (φk φT )T vec (H L ) = ΩT θ. k k (5.2)
Here H L represents the parameters of the quadratic Qfunction and φk the vector with the state and control action. This can be regarded as just a multiplication of two vectors Ωk and θ. We see that writing the quadratic Qfunction as a linear function requires that the input vector is “quadratic”. Instead of having a quadratic function with input φ, we have a linear function with input Ω. Input Ω contains all quadratic combinations of elements of φ and vector θ contains the corresponding values from H L . The representation of the Qfunction in (5.2) can also be viewed as a one layer feedforward network with a linear transfer function and input Ωk . We can extend this representation by writing this similar to (5.1):2 Q(xk , uk ) = Γo (wT Γh (W ΩT ) + bh ) + bo . o k (5.3)
If we take Γo and Γh as linear functions and the biases bh and bo all zero, then there exist values for W and wo such that (5.3) equals (5.2). Note that these weights are not unique, because the number of weights is higher then the number of parameters of (5.3). Since we want to stay as close as possible to the original LQRQL approach, we will only consider activation functions that resemble linear functions. The neural representation is introduced to deal with non quadratic Qfunctions. The network in (5.3) with linear transfer functions can not represent these functions, therefore we have to use nonlinear transfer functions. Let the output of hidden unit i is given by: Γh,i (wT ΩT + bh,i ) = tanh(wT ΩT + bh,i ). h,i k h,i k (5.4)
The hyperbolic tangent is nonlinear, but in the origin it is zero and its derivative is one. So for small values of wh,i (5.4) still resembles a unit with a linear transfer function. Only when the weights of the hidden units become large, the Qfunction will no longer be a quadratic function.
5.3.1
Deriving the feedback function
ˆ Given the Qfunction and its input, the values of θ should be obtained. In the case of the standard LQRQL these are just the parameters that are estimated, so they are immediately available. For the neural Qfunction this is diﬀerent. The weights are the parameters that ˆ are estimated based on the train set, so the value of θ should be derived from the weights. L Given Q (Ωk ) from (5.2), the parameters θ can also be obtained by computing the derivative of this function to the input Ω. This means that parameter θi can be computed according to: ∂QL (Ωk ) θi = . (5.5) ∂Ωi
This Q function still depends on the feedback function used to generate the data, which can also be a nonlinear function. Therefore we will omit the superscript L.
2
80
CHAPTER 5. NEURAL QLEARNING USING LQRQL
This shows that the θi does not depend on Ωk , so the values of θi do not depend on the state and control action. When this is computed for all parameters of θ, then H L can be ˜ derived and therefore L can be computed. ˆi (Ω)3 can be computed for the neural Qfunction in (5.3): In the same way θ
nh ˆi (Ωk ) = ∂Q(Ωk ) = θ wo,j wh,i,j (1 − tanh2 (wT Ωk + bh,i )) h,i ∂Ωi j=1
(5.6)
where wh,i,j represents the weight from input Ωi to unit j and nh indicates the number of hidden units. ˆ ˆ When θ(Ω)k is available, it can be rearranged into H(Ω) so that L(Ω) can be computed. The control action can be computed according to uk = L(Ωk )xk . This is still a linear multiplication of the state with L(Ωk ), but it is a nonlinear feedback function because Ωk contains xk . The problem is that it also contains uk , which is the control action that has to be computed. In order to solve this problem (5.6) can be split into two parts: ˆ θi =
nh nh
wo,j wh,i,j −
j=1 linear j=1
wo,j wh,i,j tanh2 (wT Ωk + bh,i ) . h,i
nonlinear
(5.7)
ˆ The ﬁrst part is indicated with “linear”, because this corresponds to the θi that would be computed when all hidden units are linear. The resulting feedback function derived from this network would also be linear. The second part is indicated with “nonlinear” because this is the eﬀect of the nonlinearities introduced by the hyperbolic tangent transfer functions. A linear feedback can be obtained from the network by just ignoring the hyperbolic ˜ tangent functions. Deﬁne θi by: ˜ θi =
nh
wo,j wh,i,j .
j=1
(5.8)
˜ ˆ ˜ This leads to a vector θ. Analog to θ we can rearrange θ to form a quadratic function ˜ ˜ ˜ From H we can derive a linear feedback L. with parameters H. ˜ ˜ ˜ The feedback L can be used to compute the control action u = Lx. This control action ˜ can be used to control the system, but it can also be used to obtain the vector Ω. The ˜ ˜ vector Ω represents the quadratic combination of the state x with the control action u. This can be used to obtain the nonlinear feedback ˜ ˜ ˜ uk = (Huu (Ωk ))−1 Hux (Ωk )xk = L(Ωk )xk . (5.9) ˜ Here the linear feedback L(Ωk ) is a function of xk , so that the resulting feedback function is nonlinear. If we compute this feedback function for the Qfunction in ﬁgure 5.2(a) then this results in the feedback function shown in ﬁgure 5.4.
ˆ Strictly speaking, the θi (Ω) is not an estimation. For consistency with the previous chapters we will keep indicating it with a hat.
3
5.3. NEURAL LQRQL
81
1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 1
U
linear nonlinear greedy
0.8 0.6 0.4 0.2 0 X 0.2 0.4 0.6 0.8 1
Figure 5.4. Nonlinear feedback example. The feed forward network forms the same Qfunction ˜ as in ﬁgure 5.2(a). This ﬁgure shows the resulting linear feedback L and the nonlinear feedback function according to (5.9). For comparison also the greedy feedback from ﬁgure 5.2(b) is plotted.
5.3.2
Discussion
If we want to interpret the resulting feedback function, we have to look at (5.6). In case wT Ωk + bh,i is large for a hidden unit i, the (1 − tanh2 (wT Ωk + bh,i )) will be very small. h,i h,i ˆi will be very small. In case wT Ωk + bh,i is zero The contribution of this hidden unit to θ h,i for a hidden unit i, the tanh(wT Ωk + bh,i )) will be very small. The hyperbolic tangent h,i can be ignored and the contribution of this hidden unit is the same as for the computation ˜ of the linear feedback L. We therefore can interpret the feedback function as a locally weighted combination of linear feedback functions. Each hidden unit will lead to a linear feedback function and the value of the state x determines how much the feedback function of each state contributes to the total feedback. ˜ When we compute L all hidden units are weighted equally. This results in a linear feedback function that is globally valid, because it no longer depends on the state value. When the neural approach is applied to the LQR task, then the resulting feedback should not depend on the state value. The weights of the hidden units should be very small, so that the feedback function should become linear. ˜ ˆ The linear feedback L does not have to be the same as the result LQL of LQRQL, when applied to the same data set. In standard LQRQL the parameters of a quadratic Qfunction are estimated. When the Qfunction to approximate is not quadratic, then a small part of the train set can have a large inﬂuence on the estimated parameters. This is the case for training samples obtained in parts of the state space where the Qfunction deviates a lot from a quadratic function. If (5.3) is used to approximate the Qfunction, the inﬂuence of these points is much less. One hidden unit can be used to get a better ˜ approximation of the true Qfunction for that part of the state space. Since L is based on
the contribution of ˜ all hidden units to θ are zero.25).26) becomes ∆wk = (rk + Q(Ωk+1 . Another way is to scale down the reinforcements r.3) can be found by training the network according to one of the methods described in section 2.3. This can only occur when all weights of the hidden units are large. while the simpler linear feedback is also possible. when the true Qfunction is not quadratic.82 CHAPTER 5. the weights of the hidden units become larger. the initial weights have to be selected.24) using (2. Only when it is necessary. In this way it is prevented that the resulting feedback becomes very nonlinear. the feedback function derived from this initial network is linear.11) Before the network can be trained. Because the network approximates a Qfunction (2. This does not hold for ˜ the nonlinear L(Ωk ). while the least squares estimation gives no solution in case of a singularity. the contribution of the training ˜ samples on the resulting linear feedback L is much smaller. w):4 E= and (2. The least squares estimation gives the global minimum of the error at once. the weights change such that the feedback remains linear. (5. w))2 . w) − Q(Ωk . The only diﬀerence is that the network starts with some random initial weights that are incrementally updated at each iteration step.4 Training the Network The weights of the network in (5. w)). It might be possible that for some state values. If the method that minimizes the quadratic temporal diﬀerence error (2. NEURAL QLEARNING USING LQRQL the average linear feedback formed by the hidden units. w) − Q(Ωk .10) − w Q(Ωk . ˜ The linear feedback L can always be derived from the network. If the weights of the hidden units are chosen very small. A consequence of this diﬀerence is that the weights of the network always have a value. so that the Qfunction to approximate becomes smoother. so that it becomes very unlikely that the weights become too large. w) 1 N −1 (rk + Q(Ωk+1 . the training of the network can fail to ﬁnd the global minimum of the error. . 5. 4 When the discount factor γ = 1. This means that if the true Qfunction is not quadratic. 2 k=0 (5. the linear feedback derived from the network is more reliable than the result of the standard LQRQL approach. Then no feedback can be computed.26) with a discount factor γ = 1 is used. One way of doing this is by incorporating a form of regularization in the learning rule. because it depends on the weighting of the linear feedbacks based on the state value.24) is written using Q(Ω. w))( w Q(Ωk+1 .25) and (2. then the training is based on minimizing the same criterion as (3. On the other hand. This means that it is important to prevent the weights from becoming too large. The easiest way is to make sure that the number of hidden units is not too large. If the true Qfunction is quadratic. Then the resulting feedback will be linear.
if the exploration contributes to the temporal difference error. Singularity can only happen when the weights of the hidden units are ˜ too large so that θ is almost zero. The important diﬀerences are: • No singularity For too low exploration no feedback can be computed for the standard approach. will have a lower error than one trained on a set with suﬃcient exploration. We used the same system to test the Neural QLearning approach. It is very unlikely that the feedback derived from the network will be the same as the feedback used to generate the train set. Even when the same train set is used. we also used Gain Scheduling based on the Extended QLearning approach.5 Simulation Experiments with a Nonlinear System In chapter 4 we used a mobile robot to experiment with the diﬀerent approaches.5. • Improvement for high exploration Suﬃcient exploration is achieved. the minimization of the temporal diﬀerence error will lead to an improvement of the feedback function. This means that the train set was generated with one linear feedback L = 10−3 10−3 . Even if the global minimum is found this does not necessary result in an improved feedback function. .1 Introduction The system was identical to the one in chapter 4. Only then. we only have to specify the settings for the two approaches.5. The neural approach already starts with an initialized network and the network only changes its parameters. SIMULATION EXPERIMENTS WITH A NONLINEAR SYSTEM 83 There are some diﬀerences between the estimation of the parameters of a quadratic function and the training of the network. the resulting feedbacks will not be the same when the networks are initialized diﬀerently. To be able to compare the results both approaches used the same train sets. If insuﬃcient exploration is used then LQRQL results in the feedback used to generate the train set. • Diﬀerent results for low exploration The training of the network is such that the temporal diﬀerence is minimized. This means that a network. For a train set with suﬃcient data. The exploration will make the temporal diﬀerence error much higher. 5. diﬀerent networks initialized with small weights will eventually result in similar feedback functions.5. This means that we test the feedback function in a larger part of the state space. trained on a set with insuﬃcient exploration. In order to compare the results with a diﬀerent nonlinear feedback. 5. The main diﬀerence with chapter 4 is that the resulting feedback function should be globally valid. This means that there is no singularity for too low exploration.
To prevent over ﬁtting it is possible to split the train set in two and use only one set for training and the other for testing. NEURAL QLEARNING USING LQRQL Gain Scheduling For the Gain Scheduling approach we had to partition the state space in separate partitions. Because the result also depends on the initial weights. For each partition one local linear feedback was obtained by using extended LQRQL. We did not do this because we wanted the resulting feedback to be based on the same train set as the Gain Scheduling approach. We did not do this because we wanted to see the local linear feedbacks as clearly as possible for the comparison with the Neural QLearning approach. by minimizing the quadratic temporal diﬀerence error using (5. We therefore divide the state space into three partitions according to: • Negative partition: −∞ < δ < −1 • Middle partition: −1 < δ < 1 • Positive partition: 1 < δ < ∞ The resulting feedback function consists of three local linear functions. . • The value of the initial output weights: 0. Instead we made sure that we did not have too many hidden units. This train set was used to train the network. This was to prevent over ﬁtting and to keep the number of weights close to the number of parameters of the Gain Scheduling approach. • Values of initial weights and biases of hidden units: Random between −10−4 and 10−4 . we trained the network for ten diﬀerent initial weights. We did this by generating more train sets by starting the system from diﬀerent initial states. We chose the following settings for the network: • Number of hidden units: 3. Neural QLearning For the Neural QLearning approach. For each estimation a train set was selected consisting of all training samples in that particular partition. It is possible to make this a continuous diﬀerential function by smoothing it near the borders of the partitions. all train sets used for Gain Scheduling were combined into one train set. This implies that the feedback should be diﬀerent for diﬀerent values of δ. At the border of the partitions these functions can be diﬀerent.84 CHAPTER 5. so that the feedback function is not continuous.10). This means that we had to make sure that for each partition suﬃcient training samples were available to obtain a local feedback. In chapter 4 we observed that diﬀerent feedbacks were found when train sets were generated with a diﬀerent initial δ. Then we used the network with the lowest quadratic temporal diﬀerence error.
SIMULATION EXPERIMENTS WITH A NONLINEAR SYSTEM 85 1. The control action as function of the state .4 0.5 0 0.2 The experimental procedure The purpose of this experiment is to point out the major diﬀerences with the experiments in chapter 4.5 Y 1 1.5.5. For sequence of exploration the train sets are shown in ﬁgure 5.5 2 2. 5.2 1.2 0. Diﬀerent train sets are generated by starting the system for diﬀerent initial δ.4 Figure 5.5 1 0.6 X 0.8 1 1.5 0 0. These weights are taken small so that the initial resulting feedbacks are linear and perhaps become nonlinear during training. All train sets are combined to form one train set that is used to obtain the two global (a) Gain Scheduling (b) Neural QLearning Figure 5.5.5.6. The trajectories while generating the train set.5.
This is a consequence of the nonlinear correction. The trajectory that starts in δ = 1.5 1.86 CHAPTER 5.5 0 0 Y 0. This looks like one linear feedback with a smooth nonlinear correction.5 0. NEURAL QLEARNING USING LQRQL nonlinear feedback functions.5 1 1 0.5 0 2 4 6 X 8 10 12 14 1.5 are shown in ﬁgure 5.5 Y 0. In ﬁgure 5. This means that the tests from diﬀerent initial δ are performed with the same feedback function.6(a) the Gain Scheduling result is shown.6. In ﬁgure 5.7.5 and δ = 1. In ﬁgure 5. After the jump the trajectory moves similar to the other trajectories in that partition towards the line.5. but in this case two linear feedbacks are quite similar. In chapter 4. In ﬁgure 5. each initial δ was tested with the corresponding local linear feedback.7(a) four trajectories are shown for the Gain Scheduling result. The trajectories .5 1 1 1.7(b) the resulting trajectories for Neural QLearning are shown. The third feedback is diﬀerent and the “jump” between the boundary of the two partitions is very clear. The feedback functions were tested by starting the robot for diﬀerent initial δ. The main diﬀerence with the experiments in chapter 4 is that now two global nonlinear feedbacks are used. Also it can be noticed that for large initial distances to the line. It is clear that there are no jumps in the trajectories.5 0 2 4 6 X 8 10 12 14 (a) Gain Scheduling (b) Neural QLearning Figure 5.5 clearly shows a jump when it crosses the boundary between the partitions of the state space. the robot initially moves faster towards the line. δ = −1. The resulting feedback functions of the train sets in ﬁgure 5. The feedback function consists of three local linear feedbacks. 1.6(b) the resulting feedback based on Neural QLearning is shown.
The tests were performed by running the robot for 302 time steps. −0. Instead it will drive at a small distance parallel to the line. (b) The average performance for Gain Scheduling and Neural QLearning.5 Gain Scheduling Neural QLearning 4 160 140 3.8(a) we plotted the total costs as a function of the initial δ for the best feedbacks of the two approaches.5 60 2 40 1.5 0 δ0 0. .8(a) or very bad. 0. the network was trained with 5 diﬀerent initial weights.5. The performances of the global feedback functions.5 1 0. Note that we plotted the the averagate log(J).8(b) we see the average total costs of both approaches for all 30 experiments. These very 180 Gain Scheduling Neural QLearning 4. because the value of J varies between 10 and 105 . For the Neural QLearning approach.5 1 1.5.75.75.5 120 Average log(J) 100 Min J 3 80 2.5 2 (a) The best performance for Gain Scheduling and Neural QLearning.5 20 0 2 1. which also explains why the total costs are a little higher than for the Neural QLearning approach. which indicates that the robot will not approach the line exactly. We see that the Gain Scheduling approach performs very badly on average.5 1 0. This is because of the quadratic input of the network. As initial δ we used: −1.75 and 1. Figure 5. The reason for using a long test period is that we wanted the robot to visit a large part of the state space. We see that the values of the resulting total costs of the Neural QLearning approach are symmetric around δ = 0. The network for which the lowest temporal diﬀerence error was reached was tested. Each experiment was performed as described in the preliminary experiments.8. This is equivalent to driving for approximately 2 minutes.5 1 1. This is because the results are either good as shown in ﬁgure 5. In ﬁgure 5. In ﬁgure 5.3 The performance of the global feedback functions We did 30 simulation experiments to compare the performance of the two approaches. The result of the Gain Scheduling approach is not symmetric.75. SIMULATION EXPERIMENTS WITH A NONLINEAR SYSTEM 87 5.5 2 1 2 1.5 0 δ0 0. Also we see that it is lower than the feedback for the Gain Scheduling approach.5.
8 3 2 1 0 1 2 3 alpha 3 2 1 delta 0 1 2 3 (a) Gain Scheduling (b) Neural QLearning Figure 5.8 0. In ﬁgure 5.4 0.4 Training with a larger train set The previous experiments showed that Gain Scheduling did not perform so good.9(a) we see the resulting nonlinear feedback for Gain Scheduling. We did an experiment to see whether the result of Gain Scheduling can be improved by using a larger train set for each partition. The reason for the bad performance of Gain Scheduling is that the resulting feedback function of the extended LQRQL approach is completely determined by the train set.0032. The Neural QLearning approach uses the complete train set. In chapter 4 we already showed that not all resulting feedback functions will be good. and performs much better on average. the total costs will always be very high.5. If in this case the linear feedback in that outer partition is good. 5. it will make the robot move towards the middle partition. We used all the train sets from the previous experiment and combined them into one large train set. which shows that it is getting closer to the correct value of 3 2 1 0 1 2 3 3 2 1 0 1 2 alpha 3 3 2 1 delta 0 1 2 3 0. Therefore it does not have this problem.6 0. In Gain Scheduling the state space is partitioned and for each partition the train set should lead to a good function. If the local feedback around the line makes the robot move towards the outer partitions.6 0. The control action as function of the state when a larger train set was used.88 CHAPTER 5. NEURAL QLEARNING USING LQRQL bad results are consequences of bad local feedbacks at the middle partition. Then we applied the Gain Scheduling and Neural QLearning approach to this larger train set.2 0 0.2 0. For the l middle partition ˆ = −0. The partitioning makes that it becomes less likely that all partitions will have a good performance. .4 0.9. As a consequence the robot gets stuck at the boundary of these two partitions. The only way to improve this result is to make sure that for each partition the train set is good.
75 91. The total costs for the feedback function based on the large train set.9(b) is steeper near δ ≈ 0.5.2774 121.5. The value for ˆ is 0. The total costs JGS are much lower than the average value in ﬁgure 5. This feedback is steeper when δ ≈ 0. l l = 0. This agrees with the discussion in section 4.3274 121.6(b).8163 δ JGS JNQ 1. This is because the linear function of the middle partition in ﬁgure 5.8163 0. because of the feedback in the middle partition.8742 Table 5. therefore the increase in train set does not have such a huge impact. In ﬁgure 5. The total costs of the neural QLearning approach is again lower that that of Gain Scheduling. so the robot moves straight ahead until it enters the middle partition.75 219.9(b) we see the resulting nonlinear feedback for Neural QLearning.8559 16.75 201. In the middle partition the robot will slowly approach the line.8742 0.1. the feedback L hardly contributes to the control action.75 105.10 we see that for the outer partitions the robot rotates 1 π in the direction of the line within 2 one step. . We tested the resulting feedback functions by starting the robot in the same initial states.8(b). The performance in ﬁgure 5. The trajectories for the Gain Scheduling feedback based on the large train set. The trajectories of the Neural QLearning approach in ﬁgure 5.7(b).11 shows that the robot moves faster to the line than the trajectories in ﬁgure 5. compared to the feedback in ﬁgure 5.5.11.10.8(b) was already good based on smaller train sets. Then it arrives at a state where the control action is almost zero.3872 16.10 and ﬁgure 5. so that it is very diﬃcult to estimate the quadratic Qfunction. The trajectories of the test runs are shown in ﬁgure 5. The reason is that for the middle partition.0222 for the positive partition. Table 5. In ﬁgure 5. This shows that the increase of the train set has a huge impact on the performance of Gain Scheduling.8085 for the negative and −1. 2 1 0 Y 1 2 0 1 2 3 4 5 X 6 7 8 9 10 Figure 5. SIMULATION EXPERIMENTS WITH A NONLINEAR SYSTEM 89 1. This implies that the total costs are higher than those of the neural QLearning approach.1 shows the resulting total costs for both approaches for diﬀerent δ0 .9(a) is not steep enough.1 where we indicated that L > 0 if δ < 0 and L < 0 if δ > 0. This is because the feedback in ﬁgure 5.
12(a) it is clear that the local linear feedback for the partition.13(a)) δ JGS JNQ 1. with the negative initial δ.12.12(b) we see that the result of the Neural QLearning approach is again linear with a smooth nonlinear correction.90 2 1.75 is very high.2. This indicates that the exploration noise sequence used.707 82. The total costs for the real nonlinear system .471 1. so these actions can be applied. Because of the space limitation of the room. In (5.75 103. the linear feedback derived from the network is too large.75 89. Again we see that the resulting total costs for the Gain Scheduling approach are not symmetric. Also we see that the control actions are large (we observed the same in chapter 4 for the SI approach). where we used the exploration noise sequence that gave the best results for both LQRQL approaches in chapter 4.362 0.5 1 1. Also we see that the costs for δ0 = −1. The trajectories for the Neural QLearning feedback based on the large train set. In ﬁgure 5.11.75 634. only results in a useful training set for positive initial δ. we tested only for one minute. The resulting total costs are shown in Table 5. the only way to solve this is by generating a new training set for this partition.5 0 0. Since the resulting feedback in this partition is completely determined by the train set.2.007 82. The resulting feedback functions are shown in ﬁgure 5.189 Table 5. In fact. we plotted the trajectories. This corresponds to the partition for which the local linear feedback is wrong.2. 5. At these states the control actions are between the action bounds of the real robot.5 1 0. NEURAL QLEARNING USING LQRQL 6 7 8 9 10 Figure 5.008 11.6 Experiment on a Real Nonlinear System We did an experiment with the real robot. The results of the Neural QLearning approach show that it performs very well for all initial δ. In order to understand the results in Table 5. is wrong.993 11.75 29. We see that in some parts of the state space the control action is reduced by the nonlinear correction.600 0.5 2 0 1 2 3 4 5 X Y CHAPTER 5. In ﬁgure 5.
For this partition the local feedback is wrong. For the other partition the robot moves toward a line parallel to the line to follow. unlike the feedback function of the Gain Scheduling approach where there can still be discontinuities. EXPERIMENT ON A REAL NONLINEAR SYSTEM 91 3 2 1 U 0 1 2 3 3 2 1 0 1 2 3 alpha 3 2 delta 0 1 1 2 3 (a) Gain Scheduling (b) Neural QLearning Figure 5. This is a consequence of the local feedback at the middle partition. when it is close to the line. we see that for the negative partition the robot moves in the wrong direction. These experiments have shown that the Gain Scheduling approach depends very much on the training sets for each of the local linear feedbacks. This indicates that it will never follow the line exactly. After that all actions are within the safety limit and the robot starts following the line. We also see that the trajectories far from the line ﬁrst start to rotate very fast towards the line. The high linear feedback from the network makes the robot move to the line very eﬃciently. The resulting feedback function of Neural QLearning is a smooth function.13(b)) we see that the two trajectories that start close to the line will approach the line. For other parts of the state space the control actions are still too large. The Neural QLearning approach results in a feedback function that is linear in most parts of the state space. In (5. The control action as function of the state for the real nonlinear system. In parts of the state space where the training set indicates that a correction is required. .12.12(b) we see that this correction is for state values that were part of the train set.5. so we only let the robot rotate with maximum speed.6. Due to the nonlinear correction the size of these actions are reduced. The ﬁrst few time steps the action is higher than the safety limit. The linear feedback will result in control actions that are too large far from the line. It is possible that for some parts of the state space the feedback function found is not appropriate. the control action deviates from the linear feedback. In ﬁgure 5.
5 0 0. The advantage of using only one network is that only one network has to be trained. .5 2 2.5 1 1.13. the feedback follows directly from the Qfunction.5 1 0. We showed how the LQRQL approach can be used to obtain the feedback from the approximated Qfunction.5 1 1. 5.5 1 1. More important is that the feedback function directly follows from the approximated Qfunction. We described how a feedforward network can be used to approximate the Qfunction.7 Discussion In this chapter we proposed a method to apply QLearning for general Qfunctions. The result is that the resulting feedbacks can also be nonlinear. Then only one function has to be approximated.92 CHAPTER 5.5 4 4.5 0 Y 0.5 0 0. The trajectories for the real nonlinear system.5 (b) Trajectories using Neural LQRQL Figure 5. NEURAL QLEARNING USING LQRQL 1. This is similar to Qlearning where the greedy policy is obtained by selection the greedy action for each state.5 2 5 4 3 2 1 X 0 1 2 3 4 5 (a) Trajectories using Gain Scheduling 1. This makes this approach more convenient to use.5 3 3.5 1 0.
The feedback function of the Neural QLearning approach can be seen as a locally weighted combination of linear feedback functions. This means that there are no sudden jumps as in Gain Scheduling. CONCLUSIONS 93 A second advantage of our approach is that the resulting feedback function is smooth. Both the Gain Scheduling approach and the Neural QLearning approach derive a global nonlinear feedback function from a set of linear feedback functions. This means that a poor performance as that of Gain Scheduling becomes very unlikely. . resulting sometimes in very poor performances. This is not always the case.8 Conclusions There are diﬀerent ways to get a feedback when using a feed forward network as a Qfunction. In Gain Scheduling only one local linear feedback determines the control action. The result is a global linear function with local nonlinear corrections. The method uses LQRQL to obtain a linear feedback from the Qfunction. Because the training is based on the complete train set.8. These jumps can also appear when the greedy action is directly computed from the function approximator. Therefore the result is less dependent on the correctness of each linear feedback function. it will result in a good feedback function for a smaller train set than the Gain Scheduling approach. The problem with these jumps is that the the behavior of the system becomes more unpredictable in the presence of noise. Our method is based on the idea that there should be a direct relation between the Qfunction and the feedback. so it is essential that for every partition the feedback function is good. Then the linear feedback is used to compute the nonlinear feedback function. 5.5. The eﬀect of the jumps can be overcome. but for Neural QLearning this is not necessary. The only way to solve this is to generate more data.
94 CHAPTER 5. NEURAL QLEARNING USING LQRQL .
the time required to obtain a suﬃciently large train set is determined by the system. one of the characteristics of RL is that it optimizes the controller based on interactions with the system. In simulation the time required to obtain a suﬃciently large train set depends on the computing speed. Many real systems have continuous state and action spaces. a modelfree RL method. The computation of one state transition can be much faster than the actual sample time on the real system. for which no knowledge about the system is needed. When the training is performed in simulation the potential of RL is not fully exploited.Chapter 6 Conclusions and Future work The objective of the research described in this thesis is to give an answer to the question whether Reinforcement Learning (RL) methods are viable methods to obtain controllers for systems with a continuous state and action space. This means that the closed loop performance when RL is applied to a real system is not very well understood. which explains the hesitation in applying these algorithms directly on real systems. The only way to speed this up is to use algorithms that require less training. We investigated the applicability of QLearning for real systems with continuous state and action spaces. For these applications the training was performed mainly in simulation. However. In this thesis we looked at QLearning. The ﬁrst reason is that for real control tasks a reliable controller is needed. The second reason is that RL may train very slowly. so algorithms for discrete state and action spaces cannot be applied directly. RL is normally applied to systems with a ﬁnite number of discrete states and possible actions. The RL algorithms used for the control of continuous state space tasks are either based on heuristics or on discrete RL methods. Some applications of RL to control real systems have been described [65][59][2][61][47]. On a real system. Extended LQRQL for local approximations for nonlinear systems and Neural QLearning for ﬁnding a nonlinear 95 . There are two main reasons for not applying RL directly to optimize the controller of a system. We developed three methods: LQRQL for linear systems with quadratic costs. which means that the system has to be known. This indicates that RL can be used as a controller design method. We therefore focused on the two issues mentioned above: how can we guarantee convergence in the presence of system noise and how can we learn from a small train set. Stability during learning cannot be guaranteed.
It can be proven that the resulting linear feedback will eventually converge to the optimal feedback when suﬃcient exploration is used. IV Suﬃcient: The parameters of the Qfunction and system approach the correct value. This solution is used to compute the optimal linear feedback. It is an optimal control task in which the system is linear and the direct costs are given by a quadratic function of the state and action. To show the inﬂuence of the system noise on the performance we introduced the exploration characteristic. Only the type IV outcome is useful. we used a batch least squares estimation. It has been shown [16][44] that the LQR task can be solved by QLearning. in which four types of outcomes can be distinguished for the SI and LQRQL approach. For this the original QLearning algorithm is adapted so that it can deal with the continuous state and action space of the linear system. These results only apply when there is no noise. For this we derived that the SI approach requires at least more exploration than there is noise in the system. our results only depends on the train sets used. For a fair comparison between these methods. 6.96 CHAPTER 6. We aimed at determining the amount of exploration required for a guaranteed improvement and compared this with the more traditional SI approach. The new feedback will be an improvement. When the parameters of the linear system are unknown. In practice the system noise will always be present. We called this approach LQR QLearning (LQRQL). The estimated parameters of the system can be used to compute the optimal feedback. depending on the sequence of system noise and exploration. III Almost suﬃcient: The resulting feedback can be anything. System Identiﬁcation (SI) can be used to estimate the parameters. When the parameters of the system are known. The objective is to minimize the total future cost. CONCLUSIONS AND FUTURE WORK feedback for nonlinear systems. so we investigated the inﬂuence of the system noise on the performance of the resulting linear feedbacks.1 LQR and QLearning A well known framework in control theory is the Linear Quadratic Regularization (LQR) task. If the amount of exploration is: I Much too low: No feedback can be computed. II Too low: The resulting feedback will be the same as that used for the generation of the train set. the optimal feedback function can be obtained by solving the Discrete Algebraic Riccati Equation. Furthermore we derived that the LQRQL approach requires more exploration than the SI approach. so suﬃcient exploration is required. . So unlike the recursive least squares approach in [16].
6.2. EXTENDED LQRQL
97
6.1.1
Future work
Eigenvalues: The Qfunction is given by a sum of positive deﬁnite quadratic functions (the reinforcements), so the correct Qfunction is a positive deﬁnite quadratic function as well. This can be veriﬁed by looking at the eigenvalues of the matrix that represents the parameters of the Qfunction. Negative eigenvalue are only possible if the parameters of the Qfunction are not estimated correctly. This can only be the case for the type II and III outcome. So negative eigenvalues imply that insuﬃcient exploration was used to generate the train set. In our experiments we always found negative eigenvalues for insuﬃcient exploration, but we cannot guarantee that the estimated Qfunction will never be positive deﬁnite for insuﬃcient exploration. In order to use the eigenvalues as reliability measure, it still has to be proven that insuﬃcient exploration will never result in a positive deﬁnite Qfunction. Exploration: The LQRQL approach requires more exploration than the SI approach. For real systems adding disturbance to the control action is not desirable, so to enhance practical applicability the required amount of exploration has to be reduced. One simple way of doing this is by using the SARSA variant of QLearning. QLearning uses the next action according to the feedback to compute the temporal diﬀerence. This does not include the exploration. In SARSA the action taken at the next time step is used and this also includes the exploration at that time step. So each exploration added to the control action is used twice. Experiments have shown that this approach only requires the same amount of exploration as the SI approach. However, SARSA introduces a bias in the estimation, making it perform worse for very high amounts of exploration. These results are only experimental and should be veriﬁed theoretically.
6.2
Extended LQRQL
LQRQL was developed for linear systems and not for nonlinear systems. Many control design approaches for nonlinear system are based on local linear approximations of the nonlinear system. LQRQL can be used to obtain a local linear approximation of the optimal feedback function. To investigate the consequences of the nonlinearity it is possible to write the model of the nonlinear system as a linear system with a nonlinear correction. The linear feedback of the SI and LQRQL approach are not always appropriate feedback functions to approximate the optimal feedback function. An additional oﬀset can be included in the feedback function. This will allow for a better local approximation of the optimal feedback function in case the average nonlinear correction is large in the region of interest. To obtain the value of the oﬀset, the parameters of a more general quadratic Qfunction have to be estimated. We showed that these parameters can be estimated in the same way as the original standard LQRQL approach. We called this new approach the extended LQRQL approach. Our experiments on a simulated and on a real nonlinear system conﬁrmed that the extended LQRQL approach results in a better local approx
98
CHAPTER 6. CONCLUSIONS AND FUTURE WORK
imation of the optimal feedback function. We can conclude that the extended LQRQL approach has to be applied if a nonlinear feedback function is based on multiple local linear approximations.
6.2.1
Future work
On–line learning: Throughout this thesis we used batch learning, which means that learning starts when the complete train set is available. We did this to make fair comparisons between the diﬀerent methods possible. Usually RL is used as an on–line learning method. The recursive least squares approach for the LQR task in [16][44] is an on–line approach. Although our Qfunction is quadratic, it can be seen as a function approximator that is linear in the parameters. For these function approximators the conditions are known for which convergence is guaranteed when on–line TD(λ) learning is used [29][31]. How the on–line TD(λ) approach compares with the recursive least squares approach is not known and should be investigated.
6.3
Neural QLearning
Function approximators have been used in RL. One approach uses the actor/critic conﬁguration. One function called the critic is used to approximate the Qfunction. The other function is called the actor and is used to approximate the feedback function. Both approximators have to be trained, where the approximation of the actor is trained based on the critic. This makes the training procedure rather tedious and the outcome is hard to analyze. The LQRQL approach can be combined with a feedforward neural network approximation of the Qfunction. In this case there is no actor because in LQRQL the linear feedback function follows directly from the parameters of the Qfunction. So only one function approximator has to be trained. We called this approach Neural QLearning. To obtain a nonlinear feedback function for a nonlinear system using Neural QLearning, ﬁrst a linear feedback has to be determined. The derivatives with respect to the inputs of the network give the parameters that would be estimated by the LQRQL approach. These parameters depend on the state and control action, and therefore it is not possible to directly derive a nonlinear feedback function. It is possible to ignore the tangent hyperbolic activation functions of the hidden units to obtain a globally valid linear feedback function. The linear feedback function can be used to compute the control action that is necessary to determine the nonlinear feedback function. The resulting nonlinear feedback function can be regarded as a locally weighted function, where each hidden unit results in a local linear feedback. The state value determined the weighting of these local linear function. Experiments were performed on a simulated and on a real nonlinear system, where the goal of the experiments was to obtain global nonlinear feedback functions. We compared Neural QLearning with Gain Scheduling. Gain Scheduling can be performed by making for local partitions in the state space a local approximation using the extended LQRQL
6.4. GENERAL CONCLUSION
99
approach. The experiments have shown that the neural Qlearning approach requires less training data than the Gain Scheduling approach. Therefore Neural QLearning is better suited for real systems.
6.3.1
Future Work
Value Iteration: The neural QLearning approach we presented, is still based on policy iteration. The original QLearning approach is based on value iteration. Because the feedback function directly follows from the approximated Qfunction it is now also possible to apply value iteration. This then can be combined with an online learning approach. Whether Neural QLearning using Value Iteration is better than Policy Iteration still should be investigated. Regularization: The weights of the hidden units should not be too large. In supervised learning there exist approaches for regularization. One way is to assign costs to the size of the weights. The consequences of this modiﬁcation of the learning for Neural QLearning are unknown. It could have the eﬀect that the resulting nonlinear feedback function no longer performs well. This means that further research is necessary to determine whether regularization can be applied.
6.4
General Conclusion
Reinforcement Learning can be used to optimize controllers for real systems with continuous state and action spaces. To exploit the beneﬁts of reinforcement learning the approach should be based on QLearning, where the feedback function directly follows from the approximated Qfunction. The Qfunction should be represented by a function approximator, of which the parameters are estimated from the train set. If it is known that the system is linear, then the parameters of a quadratic Qfunction can be estimated. This is the LQRQL approach. In case there is no knowledge about the system, the more general Neural QLearning approach can be applied. This will give a globally valid linear feedback function with a local nonlinear correction.
100 CHAPTER 6. CONCLUSIONS AND FUTURE WORK .
Matrices Z and M can be found by GrammSchmidt orthogonalization. Then z∗i is given by: Pi X∗i z∗i = . M is a n × n upper triangular matrix. Pi X∗i 2 Pj−1 X∗j−1 2 2 2 The projection matrix has the following properties: PiT = Pi . then for every j > 1: j−1 Pj = i=0 (I − j−1 T z∗i z∗i ) =I− i=0 T z∗i z∗i (A.Appendix A The Least Squares Estimation A. .. 0 2 T z∗1 X∗2 · · · P2 X∗2 2 · · · . then Z also has n columns and N − 1 rows. . Let mi. A.1 The QRDecomposition Two matrices Z and M have to be found for which ZM = X and Z T Z = I.2) 0 j−1 ··· Pn X∗n The projection matrices can be deﬁned recursively.4) = I− i=0 T T T Pj−1 X∗j−1 X∗j−1 Pj−1 Pi X∗i X∗i PiT = Pj−1 − . Let P1 = I. for which the result can be written using projection matrices P .2 The Least Squares Solution The main diﬃculty in solving (3.j indicate the element of M 101 . Pi2 = Pi and Pi Pj = Pj ∀j > i. . 2 (A.3) (A. But M is upper triangular.1) Pi X∗i 2 Matrix M is given by: M = P1 X∗1 0 . Let z∗i be the ith column of Z and X∗i be the ith column of X. (A. so the inverse can be found by backward substitution. P is a N − 1 × N − 1 matrix.30) is the computation of M −1 .. T z∗1 X∗n T z∗2 X∗n . . . Let X have n columns and N − 1 rows.
i+1 z∗i+1 zT z∗i mi. the value of M −1 Z T can be written as: T T m z T + m1.16) .i+1 mi+2. (A.i mi.i j=i+1 k=i = = = = = = = T z∗i (−1) T (−1) T (−1) T + mi. j. j and let mi.i mi+1. THE LEAST SQUARES ESTIMATION (−1) at i.i+1 z∗i+1 + mi.9) mi.n z∗n .j z∗j (−1) (A.i mi.j mi.13) mi.7) (A.j mi.5) and (A.i+1 T T T T z T .6) Now the values of (A. which results in: ˆ θ∗n = T T X∗n Pn Y.j by: indicate the element of M −1 at i.i z∗i + (−1) n j=i+1 T mi. For i = n there is no sum in (A.i+1 mi+1.8) n j−1 T z∗i (−1) T = + mk. Let F∗i be the ith row of F (with i < n) then this can be written as: T Fi∗ = mi.i mi+1.j = −1 mj.i+1 T T T T T X∗i+1 z∗i+1 X∗i+2 z∗i+2 X∗i+1 z∗i+1 X∗i+2 z∗i+2 z∗i (I − − + + · · ·) (A.i for i < j for i = j for i > j.k (−1) 0 1 mi.i+2 mi.1 ∗1 (−1) T (−1) T m2. The last row of the least squares solution θ∗n is given by F∗n Y . so the resulting expression does not have the ˆ product.n z∗n (−1) (−1) (−1) (A.14) (A.i+2 T T X∗j z∗j z∗i n (I − ) (A.i+2 mi+1.j T X∗i PiT Pi X∗i 2 2 T X∗i PiT Pi X∗i 2 2 T X∗j X∗j PjT (I − ) Pj X∗j 2 2 j=i+1 n n j=i+1 (A.i mi+1.k z∗j mi. X∗i+1 z∗i+1 z∗i 1 z T X∗i+2 z∗i X∗i+1 z∗i+1 X∗i+2 T − ∗i − ( ∗i − )z∗i+2 + (A.10) mi.j j−1 k=i mk.7).2 z∗2 + · · · + m2.11) ··· mi.i mi.i mi+1.i T T mi.n z∗n 1.5) With this result.i+2 mi. The elements of M −1 are given (−1) mi.i+2 − − ∗i+2 ( − ) + ··· (A.i mi.n z∗n (A.i j=i+1 mj. Pn X∗n 2 2 (A.102 APPENDIX A.i+1 mi+2.12) mi.i+2 mi.15) (I − X∗j Fj∗ ).2 z∗2 + · · · + m1.i+2 z∗i+2 + · · · mi.i mi+1. F = M −1 Z T = · · · (−1) T mn.2) should be ﬁlled in to ﬁnd the expression for the rows of F .i+1 mi+2.i+1 mi+2.
∗j − E θB − X∗j Lj∗ θB ) − L∗i θ(A.∗j − U θB ). U and E because LT = (X T X )−1 X T (U − E). Pi X∗i 2 2 j=i+1 (A.3 The SI Estimation ˆ ˆ ˆ The solution according to (3.j∗ ). Pi X∗i 2 2 j=i+1 (A. Then the rows of θA can be expressed as: ˆ θA.i∗ + L∗i θB = nx T X∗i PiT X TP T ˆ ˆ (YSI − X∗j θA. ˆ θB.i∗ = = nx nu T X∗i PiT ˆA. Pi X∗i 2 2 j=i+1 (A. The matrices X and U can be used to ﬁnd an estimation of the feedback because U = X LT + E.20) (A.∗j + Lj∗ θB ) − E θB ).j∗ ) Pi X∗i 2 2 j=i+1 j=1 nx T X∗i PiT ˆ ˆ (YSI − X∗j θA.∗j ) − ∗i i 2 U θB Pi X∗i 2 Pi X∗i 2 2 j=i+1 (A.3. It is also possible to write the feedback L using X .19) Here we used Pnx +i U∗i = Pnx +i (X LT + E) = Pnx +i E∗i .22) This can be used to write (A.A.∗j − ˆ (YSI − X∗j θ U∗j θB. THE SI ESTIMATION 103 Starting with the last row all other row can be computed recursively: ˆ θ∗i = n T X∗i PiT ˆ (Y − X∗j θ∗j ). Pi X∗i 2 2 j=i+1 (A.i∗ = A∗i .) The matrices of B and A are the ˆ ˆ ˆ ˆ ˆ ˆ transpose of θB and θA . because Pnx +i removes the part that ˆ is linear dependent on the previous columns of XSI .18) (A.i∗ = T T T T U∗nu Pnx +nu E∗nu Pnx +nu YSI = Y Pnx +nu U∗nu 2 Pnx +nu E∗nu 2 2 2 nu +nx T T E∗i Pnx +i ˆ (YSI − U∗j θB.25) .21) ˆ ˆ ˆ (The expression for θA. (A.23) nx nx T X∗i PiT ˆ ˆ ˆ ˆB (YSI − X∗j θA. So it is also possible to write: L∗i = nx T X∗i PiT (U − E − X∗j Lj∗ ).i∗ = = ˆ ˆ θA.nu ∗ = ˆ θB. so θB.17) A.33) starts with the last row of θ so ﬁrst the θB = B T will be shown.24) Pi X∗i 2 2 j=i+1 j=i+1 nx T X∗i PiT ˆ ˆ ˆ (YSI − X∗j (θA.i∗ = B∗i and θA.21) as: ˆ θA.nx ∗ does not have the ﬁrst sum. Pnx +i E∗i 2 2 j=nx +i+1 and.
i∗ + L∗i θB so that: ˆ θD.104 APPENDIX A. THE LEAST SQUARES ESTIMATION ˆ ˆ ˆ Deﬁne θD. .i∗ = nx T X∗i PiT ˆ ˆ (YSI − E θB − X∗j θD.∗j ) Pi X∗i 2 2 j=i+1 (A.i∗ = θA.26) which represents the estimation of the closed loop.
This is a mobile robot with a two wheel diﬀerential drive at its geometric center.1 The robot The robot we used in the experiments is a Nomad Super Scout II.Appendix B The Mobile Robot B. The drive motors of both wheels are independent and the width of the robot is 41 cm. The maximum speed is 1 m/s at an acceleration of 2 m/s2 . These include sending the control commands to the drive motors. The Nomad Super Scout II 105 . but also keeping track of the position and orientation of the robot by means of odometry.1. All other software runs on a second board equipped with a Pentium II 233Mhz processor. The robot has a MC68332 processor board for the low level processes. Figure B. see ﬁgure B.1.
the discrete time state transition can be derived: xk+1 = xk + yk+1 φk+1 vt (sin(φk + T ω) − sin(φk )) ω vt = yk + (cos(φk ) − cos(φk + T ω)) ω = φk + ωT (B.2 The model of the robot The mobile robot has three degrees of freedom. but the change of position depends on the orientation.6) This hold for any ω = 0.3) We can notice here that the change of the orientation is independent of the position.4) (B. Because the geometric center of the robot is right between the wheels. These deﬁne the robot’s position x and y in the world and the robot’s orientation φ. W (B. since it does not take into account the acceleration.3) is a simpliﬁed model of the robot. The traversal speed is given by: 1 vt = (vl + vr ) (B. the control actions can also be indicated by a traversal speed vt and rotational speed ω.2) where W indicates the width of the robot (41 cm). The change of of position and orientation is given by: x ˙ sin(φ)vt ˙ y = cos(φ)vt ˙ ω φ (B. We also have to note that (B. The speeds of the left and right wheel vl and vr are the control actions that make the robot move. By taking the integral over a ﬁxed sample time interval T . .5) (B.106 APPENDIX B.1) 2 The rotational speed is given by: ω= 1 (vr − vl ). If ω = 0. the orientation does not change and xk+1 = xk + T sin φk and yk+1 = yk + T cos φk . THE MOBILE ROBOT B.
Indicates row j of matrix A. The tilde indicates a dummy variable.Appendix C Notation and Symbols C. a submatrix or the method applied. The A is used as an example matrix. A subscript can be an indication of the time step. The star indicates an optimal solution. A∗i Aj∗ Operations on matrices. The bar indicates an error or deviation between the real value and the desired value.1 A A ˆ A ˜ A A∗ ¯ A Ai Notation The accent indicates a new result of an iteration. Conventions to indicate properties of variables. Multiple subscripts are separated by a comma. a11 a12 a21 a22 a11 a21 a12 a22 a11 a11 a vec(A) = 21 vec (A) = a12 a12 a22 a22 A= AT = Indices indicating the applied method. SI QL EX NQ GS System Identiﬁcation Standard LQRQL Extended LQRQL Neural QLearning Gain Scheduling 107 . Indicates column i of matrix A. an element form A. The A is used as an example variable. Bold indicates a vector. The hat indicates the result of an estimation.
2 x u v Symbols Continuous state vector Continuous control action vector Continuous noise vector. NOTATION AND SYMBOLS C.108 APPENDIX C. R J K e σe Discrete state Discrete action Reinforcement Policy Value function Number of time steps Probability matrix Expectation value Expected reinforcements Discount factor Iteration indices Learning rate Qfunction Eligibility trace Input of the function approximator Weights of the function approximator Error function Standard deviation of system noise v Parameters of quadratic direct cost Total costs Solution of the Discrete Algebraic Riccati Equation Exploration Standard deviation of exploration noise e . all elements are zero mean Gaussian and white Dimension of the state space Dimension of the control action space Time step State transition function State feedback function Parameter matrices of a linear system Parameters of a linear state feedback Parameters of the closed loop of a linear system Chapter 1 nx nu k f g A. B L D Chapter 2 s a r π V N P E{} R γ m. l α Q τ ξ w E Chapter 3 σv S.
V θ H φ ρ P X .2. Y . bh Γo . Lv κ Ψ T . y. U.C. SYMBOLS 109 X. Γh Ω Matrices to form the least squares estimation Parameters to be estimated by the linear least squares estimation Parameters of the quadratic Qfunction function. E c Φ w L. w h . φ δ α Chapter 5 w o . the input for the network for Neural Qlearning . subscripts indicate submatrices of H Concatenation of state vector x and control vector u Relative performance Projection matrix Submatrices of X for the SI approach Some constants Matrix with diﬀerence in quadratic state/action values Noise contribution for LQRQL approach Diﬀerent representations of the linear feedback A constant Submatrices of X for the LQRQL approach Quadratic combinations of actions and exploration Nonlinear correction Set point Action to keep system at its set point Equilibrium state Action at equilibrium state Addition constant in feedback function Extra parameters for the extended Qfunction Position and orientation coordinates of the robot in the world Distance to the line Orientation with respect to the line Weights and biases of the network Activation functions of the units The quadratic combination of state and control action. Υ Chapter 4 w xs us xeq xeq l G x. T ee .
NOTATION AND SYMBOLS .110 APPENDIX C.
Locally weighted learning. Schaal. D.P. S. Locally weighted learning for control. Neuronlike adaptive elements that can solve diﬃcult learning control problems. Artiﬁcial Intelligence.J. Brown. Astrom and B. Princeton University Press.P. In Advances in Neural Information processing Systems 11. [11] D. 1987.A. 1996.G. Adaptive Control.A. Bertsekas. and Cybernetics.G. Learning to act using realtime dynamic programming. and S. R. S. [2] C. 1995. Moore. Bertsekas and J. ˚ [3] K. Residual algorithms: Reinforcement learning with function approximation.W. In Proceedings of the International Conference on Neural Networks.P. [12] D. and A. Baird. AI Review. Barto. [7] L. [4] C. Katz. 1983.E. 1999. Harris.W. AI Review. A reinforcement learning approach to online optimal control. Moore. 111 . Bellman. S. Hittle. Dynamic Programming. 1994. AslamMir. NeuroDynamic Programming. [5] C.G. Moore. Schaal. neural networks. An.N. Belmont. 1997. Wittenmark. [10] R. Baird and A. Massachusetts.J. and R.W. Anderson. Gradient descent for general reinforcement learning. and C. Athena Scientiﬁc. Man. Synthesis of reinforcement learning.Bibliography [1] P. Tsitsiklis. [9] A. In Machine Learning: Proceedings of the Twelfth International Conference.S. Atkeson.G. Singh. Atkeson. Barto. 1997. Journal of Artiﬁcial Intelligence in Engineering. Bradtke. IEEE Transactions on Systems. [8] A. Dynamic Programming: PrenticeHall. Sutton. A. 1989. and C. 1957. and pi control applied to a simulated heating coil. [6] L. AddisonWesley. M. Kretchmar. and A.G. Anderson. 1997. S. 1995.G. Deterministic and Stochastic Models.J.
D. and A. PhD thesis. 1997. Machine Learning. Adaptive linear quadratic control using policy iteration.112 BIBLIOGRAPHY [13] J. Prokhorov. Incremental dynamic programming for online adaptive optimal control. Eaton. Reinforcement learning applied to linear quadratic regulation. Bradtke. and D. [23] P. Journal of Artiﬁcial Intalligence Research. 1994. PAC adaptive control of linear systems. [25] P. In Machine Learning: Proceedings of the Sixteenth International Conference (ICML).F. Birkhuser.J. 1993. Adaptive linear quadratic control using policy iteration. 1995. Kumar. 1998. E. Machine Learning. Boyd. B. Truncated temporal diﬀerence: On the eﬃcient implementation of TD(λ) for reinforcement learning. TD(λ) converges with probability 1. [17] S. Fiechter. Generalization in reinforcement learning: Safely approximating the value function.G. and A. Bradtke. Feron. Dayan and T.J. In Advances in Neural Information Processing Systems. .I. 1994. Boyan. L. Identiﬁcation and stochastic adaptive control.G. 1992. [28] G. 1991. Machine Learning.G Barto. 1994.A. University of Massachusetts. 1994. S. Neurocontroller alternatives for ”fuzzy” ball–and–beam systems with nonuniform nonlinear friction. Ydstie.M. 1994. In Advances in Neural Information Processing Systems 7 (NIPS). Ydstie.E. Sejnowski. University of Massachusetts. Bradtke. Leastsquares temporal diﬀerence learning.A. Barto. El Ghaoui.A.W. Dayan. [18] S. Adaptive linear quadratic gaussian control: The costbiased approach revisited. Cichosz. [19] S. [20] S. 2000. The Convergence of TD(λ) for general λ. In Machine Learning: Proceedings of the Twelfth International Conference. Campi and P.J. [27] C. 1995. Guo. Chen and L. Gordon. Technical Report CMPSCI 9449.J. [16] S.V. [26] P. Wunch II. Barto. [21] M. Boyan and A. B. [24] P. Moore.H. [22] H. Linear least–squares algorithms for temporal diﬀerence learning.R. 1995. SIAM Journal on Control and Optimization. Balakrishnan.J. In Proceedings of the American Control Conference. and V. Bradtke and A.J.C. In Proceedings of the Tenth Annual Conference on Computational Learning Theory.E. IEEE transactions on Neural Networks. 1996. Stable function approximation in dynamic programming. [15] S. 1999. Linear matrix inequalities in system and control theory.N. [14] J..J. Bradtke.
[40] L. Reinforcement learning for realistic manufacturo ing processes. Kaelbling. [39] T.H.H.L. Kothare.J.J. MIT Computational Cognitive Science. o [31] S. 1994. [33] S. ten Hagen and B. An analysis of actor/critic algorithms using eligibility traces: Reinforcement learning with imperfect value function.G. M. Conference on Automated Learning and Discovery.A. Kr¨se.G.J. on Machine Learning.W. Nonlinear control systems: An introduction.I. and P.H.J.. NC. Springer.A. . Carnegie Mellon University.P. 1996. [34] S. PA. Balakrishnan. Pseudoparametric Qlearning using feedforward o neural networks. [30] S. 1996. Neural networks for control systems—a survey. ten Hagen and B. and B.H. 1998. Qlearning for mobile robot cono trol. M. Kr¨se. Sbarbaro. and M. Kr¨se. Proc. In Proc.G. SpringerVerlag. In Proc. Proceedings of the International Conference on Artiﬁcial Neural Networks. [32] S.G ten Hagen and B. 1997.A. Jaakkola. Automatica. [37] A. of the 7th BelgianDutch Conf.BIBLIOGRAPHY 113 [29] S. on Machine Learning. M. In proceedings of the 15th International Conference on Machine Learning. In BNAIC’99.H. In ICANN’98. Morari. Moore. Gawthrop. Singh. on Machine Learning. Kr¨se. Kobayashi. of the 11th BelgiumNetherlands Conference on Artiﬁcial Intelligence.A. Jaakkola. Linear quadratic regulation using reinforcement o learning. In CONALD 98.G. Jordan.J. ten Hagen. On the convergence of stochastic iterative dynamic programming algorithms. of the 8th BelgianDutch Conf. R. Reinforcement learning: A survey.A Kr¨se. 1998. D. D.P. Kr¨se. and S. l’Ecluse.I. o In Proc. of the 7th BelgianDutch Conf. Jordan.. Durham.J. Hunt. Littman. Journal of Artiﬁcial Intelligence Research.A. Neural Computation. 1997. V. volume 2. ˙ [36] K.V.H. [38] T. Kimura and S. Technical Report 9307. and S. 1992. [41] H. Zbikowski.G. Generalizing in TD(λ) learning. [42] M. ten Hagen and B. 1993.G ten Hagen and B. ten Hagen and B.J. 1997. Singh. A short introduction to reinforcement learning.P. On the convergence of stochastic iterative dynamic programming algorithms. Robust constrained model predictive control using linear matrix inequalities. Kr¨se.J.H. In Procedings o of the third Joint Conference of Information Sciences. 1998. Automatica. Pittsburgh.J. 1989. [35] S. Towards a reactive critic. 1998. and A. Isidori. USA.A. 1999.
J. 1992. In Proceedings of the 1992 IEEE/RSJ International Conference on Intelligent Robots and Systems. A convergent reinforcement learning algorithm in the continuous case based on a ﬁnite diﬀerence method. [55] J. PhD thesis. Parthasarathy. Atkeson. 1993. Wunch II. Puterman. [52] R. Williams. 1997. [46] L. Adaptive state space quantisation for reinforceo ment learning of collisionfree navigation. van Dam. Qin and T.M. Munos. 1995.M. Carnegie Mellon University. 1996.M. 1994. Technical Report CMUCS92138.C. Identiﬁcation and control for dynamic systems using neural networks.A. In Proceedings of the International Joint Conference on Artiﬁcial Intelligence. Narendra and K.J. 1990. McGraw Hill.G. IEEE. 1987. A study of reinforcement learning in the continuous case by means of viscosity solutions. [44] T. Machine Learning. [58] S. [53] K. 1997. Munos. IEEE transactions on Neural Networks. Nijmeijer and A. Reinforcement learning and Distributed Local Model Synthesis. Miller and R. Nonlinear dynamical control systems. . Wiley.S. [47] S. 1997. 1997. Peng and R. Prioritized sweeping: Reinforcement learning with less data and less real time. An overview of industrial model predictive control technology. Finiteelement methods with local triangulation reﬁnement for continuous reinforcement learning problems. [57] M. Link¨ping University. NJ.J. [50] R. Lin and T. [48] T. In European Conference on Machine Learning. 1996. Ljung.L. [49] A. Mitchel. Adaptive critic design. Incremental multistep qlearning. Markov decision processes: discrete stochastic dynamic programming. IEEE transaction on Neural Networks. Applications of Artiﬁcial Neural Networks.J. Machine Learning. Machine Learning. Machine Learning Journal. [54] H. Prentice Hall. 2000. Moore and C.114 BIBLIOGRAPHY [43] B. [56] D. Munos. Springer. School of Computer Science.W. van der Schaft. Piscataway. Mitchell. Kluwer.V. 1997. Prokhorov and D.G. 13. Kr¨se and J. J. 1990. Badgwell. Memory approaches to reinforcement learning in nonmarkovian domains. [51] R. System Identiﬁcation–Theory for the User. 1992. o [45] L. In AIChE Symposium Series 316. chapter Temporal Diﬀerence Learning: A Chemical Process Control Application. Landelius. Williams.
Handbook of Intelligent Control. Learning to predict by the methods of temporal diﬀerences. [64] P. Reinforcement Learning: An Introduction.J. Sutton. [68] R.A. Average reward reinforcement learning: Foundations. Special Issue on Reinforcement Learning. Zhang. 1999. L. Singh and R.A. Sutton and A.S. Neurocontrol by reo inforcement learning. 1996. Kr¨se. Nonlinear blackbox modeling in system identiﬁcation: a uniﬁed overview.A. Riedmiller. [70] R.A. [67] M. Riedmiller.J. and B. [71] R.S. In Neural Computing and Application Journal. [60] M. Ljung. Groen. algorithms. Glorennec. . Machine Learning . of Comp. Sys. H. 1992. an integrated architecture for learning. Application of sequential reinforcement learning to control dynamical systems. Neural Fuzzy. Barto.S. F. Q. University of Amsterdam. on Decision and Control. A. Journal A (Journal on Automatic Control). Neural network based process optimization and control. In Proceedings of the 29th conf. Van Nostrand Reinhold. Automatica. [66] D. Babuska. and A.P. P. DYNA. Schram. Sutton. 1990.Y. and Adaptive Approaches. and A. 1988. and empirical results. Machine Learning journal. Sutton. [61] G.S. Benveniste. Machine Learning. 1996. [62] S. 1996. 1991. Sofge and D. Juditsky. chapter Applied learning: optimal control for manufacturing. In Working Notes of the 1991 AAAI Spring Symposium on Integrated Intelligent Architectures and SIGART Bulletin 2. 1995. 37. [65] D. Reinforcement learning with replacing eligibility traces. Generalizing in reinforcement learning: Succesful examples using sparse coarse coding. Concepts and facilities of a neural reinforcement learning control architecture for technical process control. [63] J. Special Issue on Neurocontrol. Technical Report CS9310. In Advances in Neural Information Processing Systems (8). White. Krijgsman. 1996. planning and reacting.S. Hjalo marsson. Sj¨berg. White.BIBLIOGRAPHY 115 [59] M.A. Robot handeye coordination o using neural networks.C. [69] R. In Proceedings of the IEEE International Conference on Neural Networks. B. Sutton. MIT Press. Sridhar.A.P.A. Kr¨se. 1996.J. R. van der Smagt.G. B Deylon. Dept. Sofge and D. 1998. 1993.
[81] C. Fallside. Temporal diﬀerence learning in TDgammon. Van Roy. Approximate dynamic programming for realtime control and neural modeling. Sutton. [74] G. Fuzzy. Wiering and J. IEEE transactions on Systems. PhD thesis. May. In Communications of the ACM. K. Takagi and M. An analysis of temporaldiﬀerence learning with function approximation. 1990. [82] P.116 BIBLIOGRAPHY [72] R. Learning from Delayed Rewards. MurraySmith.A. Machine Learning. 1998. [77] J.P. Sugeno.J. R. 2000. [73] T. Hunt. Singh. [84] M. University of Cambridge.J. 1994.H. J. D. [76] S. Technical note: Q learning.S.J. and Adaptive Approaches. Massachussets Intitute of Technology. Watkins and P. S. Technical Report ESPRIT III Project 8039: NACT. TzirkelHancock and F.J. Van Nostrand Reinhold. Cambridge University. Practical issues in temporal diﬀerence learning. editors. Van Nostrand Reinhold. Neural Fuzzy and Adaptive Approahces. Consistency of HDP applied to a simple reinforcement learning problem. 1995. Werbos Sofge. [80] C. A n review of advances in neural adaptive control systems. Watkins. Handbook of Intelligent Control: Neural. Dayan. McAllester. The role of exploration in learning control. 1998. Thrun. 1985. [79] B. University of Glasgow/DaimlerBenz AG. 1991. Policy gradient methods for reinforcement learning with function approximation. Fast online Q(λ). Tesauro. 1992. [75] G.C. ˙ [86] R. In Advances in Neural Information Processing Systems 4.J. Fuzzy identiﬁcation of systems and its applications to modeling and control. PhD thesis. A. [85] R. . White and D. Machine Learning. A. IEEE Transactions on Automatic Control. 1989. 1992. In D. Williams.H. A. and Y Mansour. and P. 1992. man and Cybernetics. 1992. [83] P. 1992. 1997.C.J.B. Learning and Value Function Approximation in Complex Decision Processes. A direct control method for a class of nonlinear systems using neural networks. Gawthrop. Neural Networks. Van Roy.H Schmidhuber. Tesauro. Dzieli´ski. Technical Report CUED/FINFENG/TR65. Machine Learning. Zbikowski. Werbos. Werbos. In Advances in Neural Information Processing Systems 12. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. In Handbook of Intelligent control.N. [78] E. Tsitsiklis and B.
In RL the controller is optimized based on a scalar evaluation called the reinforcement. a modelfree RL approach in which the system is not explicitly modeled. The Neural QLearning approach can be used when the system is nonlinear. We expressed the performance of the resulting feedback as a function of the amount of exploration and noise. This results in a global nonlinear feedback function. For systems with discrete states and actions there is a solid theoretic base and convergence to the optimal controller can be proven. for which nonlinear feedback functions are required. the consequences of applying RL is less clear. This makes it a dangerous way to obtain a controller. This means that for a slow process the application of RL is very time consuming. For nonlinear systems. by ﬁrst computing a linear feedback for which a nonlinear correction can be determined. Existing techniques for nonlinear systems are often based on local linear approximations. The Extended LQRQL approach estimates more parameters and results in a linear feedback plus a constant. To ﬁnd the optimal feedback it is necessary that suﬃcient exploration is used. Another problem is that most RL algorithms train very slowly. especially since a random process called “exploration” has to be included during training. We called this method LQRQL. Based on this we derived that a guaranteed improvement of the performance requires that more exploration is used than the amount of noise in the system. function approximators can be used to approximate the Qfunction. To an unknown linear system the LQRQL approach can be applied. where each partition approximates a linear feedback. Linear Quadratic Regularization (LQR) can be applied. One problem when RL is applied to real continuous control tasks is that it is no longer possible to guarantee that the closed loop remains stable throughout the learning process. The LQR task can also be solved by QLearning. It is possible to extend the LQRQL approach. The linear feedbacks of the LQRQL and SI approach are not always able to give a good local linear approximation of a nonlinear feedback function. In an experiment on a nonlinear system we showed that the extended LQRQL approach gives a better local approximation of the optimal nonlinear feedback function.SUMMARY 117 Summary The topic in this thesis is the use of Reinforcement Learning (RL) for the control of real systems. Reinforcement Learning can be used as a method to ﬁnd a controller for real systems. Real systems often have continuous states and control actions. but its cost function is. Experiments have shown that this approach requires less training data than an approach based on partitioning of the state space. more understanding is required. Also we derived that the LQRQL approach requires more exploration than the SI approach. System Identiﬁcation (SI) can be used in case the system is unknown. . We showed that it is possible to combine the LQRQL approach with a feedforward neural network approximation of the Qfunction. To enhance the applicability of RL for these systems. The nonlinear feedback function can directly be determined from the approximated Qfunction. This leads to the optimal linear feedback function. When the system is linear and the costs are given by a quadratic function of the state and control action. For these systems. Most practical systems are nonlinear systems.
We hebben de kwaliteit van de resulterende feedback beschreven als functie van de hoeveelheid exploratie en de hoeveelheid systeemruis. Dit levert de optimale lineaire feedback functie. We hebben deze methode LQRQL genoemd. We hebben aangetoond dat het mogelijk is om LQRQL te combineren met een feedforward neuraal netwerk benadering van de Qfunctie. een modelvrije RL aanpak waarin niet het systeem expliciet wordt gemodelleerd maar de kostenfunctie. Voor systemen met discrete toestanden en regelacties is er een solide theoretische basis ontwikkeld en convergentie naar de optimale regelaar kan worden gegarandeerd. Om de optimale feedback te vinden is het noodzakelijk om voldoende exploratie te gebruiken. In een experiment met een nietlineair systeem hebben we laten zien dat de extended LQRQL aanpak een betere lokaal lineaire benadering oplevert van de optimale nietlineaire feedback functie. Bovendien hebben we aangetoond dat de LQRQL methode meer exploratie vereist dan de SI aanpak. Het resultaat is een globale nietlineaire functie. moet worden toegevoegd tijdens het leren. waarvoor nietlinear feedback functies nodig zijn. De LQR taak kan ook opgelost worden met QLearning. Voor deze systemen zijn de consequenties van het toepassen van RL minder duidelijk. Bij RL wordt de regelaar ge¨ptimaliseerd op basis o van een scalaire evaluatie. Bestaande technieken voor nietlineaire systemen zijn vaak gebaseerd op lokaal lineaire benaderingen.118 SAMENVATTING Samenvatting Het onderwerp van dit proefschrift is het gebruik van Reinforcement Learning (RL) voor het regelen van werkelijke systemen. Bij de extended LQRQL aanpak worden meer parameters geschat en het resulteert in een lineaire feedback plus een constante. In de praktijk hebben systemen vaak continue toestanden en regelacties. “exploratie” genaamd. De lineaire feedback van de LQRQL en de SI aanpak zijn niet altijd in staat om een goede lokale benadering te vormen van de optimale nietlineaire feedback functie. De meeste praktische systemen zijn nietlineaire systemen. Een probleem dat optreedt wanneer RL wordt toegepast op werkelijke continue systemen is dat het niet meer mogelijk is om te garanderen dat het gesloten systeem stabiel blijft tijdens het leren. door eerst een lineaire feedback te berekenen waarop een niet lineaire correctie kan worden bepaald. Het is mogelijk om de LQRQL aanpak uit te bereiden. Wanneer een systeem lineair is en de kosten worden gegeven door een kwadratische functie van de toestand en regelactie kan Linear Quadratic Reqularization (LQR) worden gebruikt. vooral omdat ook een random proces. De experimenten hebben laten zien dat . Systeem Identiﬁcatie (SI) kan worden gebruikt als het systeem niet bekend is. die reinforcement wordt genoemd. De nietlineaire feedback functie kan direct uit de benaderde Qfunctie worden bepaald. Dit maakt het een gevaarlijke manier om een regelaar te verkrijgen. Op basis hiervan hebben we afgeleid dat voor een gegarandeerde verbetering van de kwaliteit het noodzakelijk is dat er meer e wordt ge¨xploreerd dan er ruis is in het systeem. Een ander probleem is dat RL algoritmes erg langzaam leren. Dit betekent dat voor een langzaam systeem het gebruik van RL erg tijdrovend is. Voor nietlineaire systemen kunnen algemene functie schatters worden gebruikt om de Qfunctie te representeren. Om de toepasbaarheid van RL voor deze systemen te vergroten is meer inzicht vereist.
SAMENVATTING 119 deze aanpak minder leervoorbeelden vereist dan een aanpak gebaseerd op het partitioneren van de toestandsruimte. Voor een onbekend lineair systeem kan de LQRQL methode worden gebruikt. Reinforcement Learning kan gebruikt worden als methodiek voor het vinden van regelaars voor werkelijke systemen. waarin voor iedere toestand een afzonderlijke lineaire feedback wordt bepaald. . De Neurale QLearning aanpak kan worden gebruikt indien het systeem niet lineair is.
I was lucky to have very entertaining room mates. The work on the real robot would not be possible without the help of Edwin Steﬀens. Nikos Massios never was angry about my bicycle helmet jokes and Joris Portegies Zwart never stopped bragging about his favorite operating system. Anuj Dev helped me improve my pingpong skills. I have to thank the many students as well. and Rien van Leeuwen. Ben I want to thank for remaining a nice guy. Nikos Vlassis. Also I had colleagues with whom I enjoyed spending time. Bart Wams and Ton van den Boom.120 ACKNOWLEDGMENTS Acknowledgments The ﬁrst people to thank are Frans Groen and Ben Kr¨se. Bas Terwijn and Tim Bouma made sure that conversations during lunch were never too seriously. for providing enough distractions in the last few years. Sjaak Verbeek. were very helpful with the control theoretic aspects in this thesis. The project members from Delft. which is originally a Greek word. in spite of me being sometimes a little bit stubborn. Especially Danno l’Ecluse and Kirsten ten Tusscher provided suﬃcient opportunities for not working on this thesis. The work in chapter 3 was possible because Walter Hoﬀmann explained how the QRdecomposition can make things easier. Leo Dorst. sloppy and annoying. . Frans I want to thank for o his clarity in feedback and for giving hope that this thesis would eventually be ﬁnished. Last but not least. Also the user group members of our STW project contributed to this thesis by indicating the importance of reliability on the practical applicability of control approaches. I want to thank my mother for all the lasagnas with vitamins. Roland Bunschoten. Joris van Dam was always loud and funny. who often tried to be funnier than me.
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue reading from where you left off, or restart the preview.