You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/224007165

Model predictive control and reinforcement learning as two


complementary frameworks

Article · January 2006

CITATIONS READS

3 1,064

4 authors, including:

Damien Ernst Florin Capitanescu


University of Liège Luxembourg Institute of Science and Technology (LIST)
197 PUBLICATIONS   4,892 CITATIONS    73 PUBLICATIONS   1,687 CITATIONS   

SEE PROFILE SEE PROFILE

Louis Wehenkel
University of Liège
417 PUBLICATIONS   9,055 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Life in Microgravity View project

PREMASOL View project

All content following this page was uploaded by Florin Capitanescu on 29 May 2014.

The user has requested enhancement of the downloaded file.


International Journal of Tomography & Statistics
Summer 2007, Vol. 6, No. S07; Int. J. Tomogr. Stat.; 122-127
ISSN 0972-9976; Copyright © 2007 by IJTS, ISDER

MODEL PREDICTIVE CONTROL AND


REINFORCEMENT LEARNING AS TWO
COMPLEMENTARY FRAMEWORKS

Damien Ernst 1 Mevludin Glavic


Florin Capitanescu Louis Wehenkel

Dept of Electrical Engineering and Computer Science,


University of Liège, Belgium
Email: {ernst,glavic,capitane,lwh}@montefiore.ulg.ac.be

Abstract: Model predictive control (MPC) and reinforcement learning (RL) are
two popular families of methods to control system dynamics. In their traditional
setting, they formulate the control problem as a discrete-time optimal control
problem and compute a suboptimal control policy. We present in this paper in
a unified framework these two families of methods. We run for MPC and RL
algorithms simulations on a benchmark control problem taken from the power
system literature and discuss the results obtained.

Keywords: Model predictive control, reinforcement learning, interior-point


method, fitted Q iteration

1. INTRODUCTION systems (systems of differential equations, Markov


decision processes, etc). We have decided to focus
Reinforcement learning and model predictive con-
in this paper on discrete-time deterministic sys-
trol are two families of control techniques which
tems and we have selected, for easing the compari-
tackle control problems by formalizing them as
son, one particular type of algorithm for each field.
optimal control problems. While MPC techniques
The MPC algorithm selected solves the optimal
assume that a model of the optimal control prob-
control problem by computing open-loop policies
lem is available, reinforcement learning techniques
through the resolution of a non-linear optimiza-
assume that the only information available from
tion problem in which the system dynamics is rep-
the model is the one gathered from interaction
resented by equality constraints (Morari and Lee,
with the system. Both families of techniques usu-
1999). Such a choice was in some sense normal
ally compute a suboptimal control policy to the
since it is the dominant approach to MPC. As RL
control problem. For MPC techniques, the sub-
algorithm, we have chosen an algorithm, known as
optimalities come mainly from the simplifications
fitted Q iteration, which computes from the infor-
done to the original optimal control problem for
mation gathered from interaction with the system,
lightening the computational burdens, while for
an approximation of the optimal control policy
RL techniques, they also originate from the lack
by solving a sequence of batch-mode supervised
of information on the model.
learning regression problems (Ernst et al., 2005).
The paper is an attempt to present some MPC Two main reasons have motivated this choice.
and RL algorithms in a unified framework and First, this algorithm has been shown to clearly
reports, for the first time, simulation results ob- outperform other RL algorithms which makes it a
tained by both families of techniques on the same good candidate to become the standard approach
optimal control problem. to RL. Secondly, in its essence, and this contrary
to many other existing RL algorithms which have
Over the last decades researchers from both the
been designed to solve specifically infinite time
MPC and the RL field have proposed numerous
horizon control problem, the fitted Q iteration al-
algorithms that can be applied to various types of
gorithm solves finite time horizon problems. Since
such a feature is also shared by MPC techniques,
1 Damien Ernst is a postdoctoral researcher of the Belgian
the presentation of RL and MPC in a unified
FNRS (Fonds National de la Recherche Scientifique), of framework gets simplified.
which he acknowledges the financial support.
Int. J. Tomogr. Stat.; Vol. 6, No. S07, Summer 2007 123

2. MPC AND RL IN THE FINITE HORIZON Theorem 1. A sequence of actions u∗0 , u∗1 , · · · ,
CASE u∗T −1 is optimal if and only if QT −t (xt , u∗t ) =
2.1 The optimal control problem inf

QT −t (xt , u ) ∀t ∈ {0, · · · , T − 1}.
u ∈U

We consider a discrete-time system whose dynam-


ics over T -stages is described by a time-invariant Under various sets of additional assumptions (e.g.
equation: U finite or see Hernández-Lerma and Lasserre
(1996) when U is infinite), the existence of an op-
xt+1 = f (xt , ut ) t = 0, 1, · · · , T − 1 (1) timal closed-loop (open-loop) policy which is a T -
stage optimal policy is guaranteed. We use the no-
where ∀t, the state xt is an element of the state ∗ ∗
tation πc,T (πo,T ) to refer to a closed-loop (open-
space X and the action ut is an element of the
loop) T -stage optimal policy. From Theorem 1, we
action space U . T ∈ 0 is referred to as the ∗ ∗
see that every policy πc,T is such that πc,T (x, t) ∈
optimization horizon. 
{u ∈ U |QT −t (x, u) = inf 
Q T −t (x, u )}. Simi-
u ∈U
The transition from t to t + 1 is associated with ∗
larly, for every policy πo,T ∗
we have πo,T (x0 , t) ∈
an instantaneous cost signal ct = c(xt , ut ) ∈ 

{u ∈ U |QT −t (x, u) = inf QT −t (xt , u )} with


which we assume bounded by a constant Bc , and  u ∈U

for every initial state x0 and for every sequence xt = f (xt−1 , πo,T (x0 , t − 1)), ∀t = 1, · · · , T − 1.
of actions we define the discounted cost over T
stages by 2.2 Model predictive control


T −1 Model predictive control techniques target an op-
{u0 ,u1 ,··· ,uT −1 } ∗
CT (x0 ) = γ t c(xt , ut ) (2) timal open-loop policy πo,T , and assume that the
t=0 system dynamics and cost function are available
where γ ∈ [0, 1[ is the discount factor. in analytical form.

In this context, an optimal control sequence u∗0 , For a given initial state x0 , the terms πo,T (x0 , t),
u∗1 , · · · , u∗T −1 , is a sequence of actions which t = 0, 1, · · · , T − 1 of the optimal open-loop
minimizes the cost over T stages. 2 We denote by policy may then be computed by solving the
CTπ (x) the cost over T stages associated with a minimization problem:
policy π when the initial state is x0 = x. A T - 
T −1
stage optimal policy is a policy that leads for every inf γ t c(xt , ut ) (4)
(u0 ,u1 ,··· ,uT −1 ,x1 ,x2 ,··· ,xT −1 )
initial state x0 = x to a minimal CTπ (x). This is ∈U ×···×U ×X×···×X
t=0
the kind of policy we are looking for.
subject to the T equality constraints (1).
We consider three different types of policies: open-
loop policies which select at time t the action ut Under appropriate assumptions, the minimiza-
based only on the initial state x0 of the system tion problem (4) can be tackled by classical con-
and the current time (ut = πo (t, x0 )), closed-loop vex programming algorithms. However, for many
policies which select the action ut based on the practical problems, its resolution may be a dif-
current time and the current state (ut = πc (t, xt )) ficult task, with no guarantees that the solution
and closed-loop stationary policies for which the found by the used optimizer is indeed optimal.
action is selected only based on the knowledge of Therefore, the MPC techniques actually rather
the current state (ut = πs (xt )). produce an approximation π̂o,T ∗
of a T -stage opti-
To characterize optimality of these T -stage poli- mal control policy. The better the approximation,
∗ ∗
π̂ π
cies, we define iteratively the sequence of functions the smaller CT o,T (x0 ) − CT o,T (x0 ).
QN : X × U → , N = 1, · · · , T as follows:


2.3 Reinforcement learning


QN (x, u) = c(x, u) + γ inf

QN −1 (f (x, u), u ) (3)
u ∈U
Reinforcement learning techniques do not assume
with Q0 (x, u) = 0 , ∀(x, u) ∈ X × U . that the system dynamics and the cost function
are given in analytical (or even algorithmic) form.
We have the following theorem (see e.g. Bertsekas
The sole information they assume available about
(2000) for a proof):
the system dynamics and the cost function is the
one that can be gathered from the observation of
2 In this problem statement, we did not explicitly consider system trajectories. Reinforcement learning tech-

constraints other than those implied by the system dynam- niques compute from this an approximation π̂c,T
ics. Note however that static constraints (e.g. operating of a T -stage optimal (closed-loop) policy since,
limits) can be modeled in this formulation by penalizing
the cost function, and dynamic ones (e.g. energy and time except for very special conditions, the exact opti-
limitations) by first introducing additional state variables mal policy can not be decided from such a limited
and adding penalizing terms on these latter. amount of information.
124 International Journal of Tomography & Statistics

The fitted Q iteration algorithm on which we ∗ ∗


Since πs (x) = πo,T  (x, 0) is a πs,T  policy, MPC

focus in this paper, actually relies on a slightly ∗


methods produce the following policy: π̂s,T  (x) =

weaker assumption, namely that a set of one step ∗


π̂o,T  (x, 0). Similarly, the RL algorithm outputs a

system transitions is given, each one providing ∗ ∗


policy π̂s,T  (x) = π̂c,T  (x, 0).

the knowledge of a new sample of information


(xt , ut , ct , xt+1 ) that we name four-tuple. 4. EXPERIMENTS
We denote by F the set {(xlt , ult , clt , xlt+1 )}#F
l=1 of In this section, we present simulation results ob-
available four-tuples. Fitted Q iteration computes tained by using MPC and RL techniques. The
from F the functions Q̂1 , Q̂2 , · · · , Q̂T , approxima- section starts with a description of the optimal
tions of the functions Q1 , Q2 , · · · , QT defined by control problem considered. Then, we run exper-
Eqn (3), by solving a sequence of batch-mode su- iments on this optimal control problem with the
pervised learning problems. From these, a policy reinforcement learning algorithm and, afterwards,
which satisfies with the MPC algorithm.

π̂c,T (t, x) ∈ {u ∈ U |Q̂T −t (x, u) = inf

Q̂T −t (x, u )}
u ∈U 4.1 Control problem
is taken as approximation of an optimal control We consider the problem of controlling the bench-
policy. mark power system represented on Fig. 1a. This
Posing Q̂0 (x, u) = 0, ∀(x, u) ∈ X ×U , the training academic power system example is composed of a
set for the N th supervised learning problem of the generator connected to a machine of infinite size
sequence (N ≥ 1) is through a transmission line. On the transmission
line is installed a variable reactance which influ-
  #F ences the amount of electrical power transmitted
(il , ol ) = (xlt , ult ), rtl + γ inf Q̂N −1 (xlt+1 , u) through the line. The system has two state vari-
u∈U
l=1
ables, the angle δ and the speed ω of the generator.
where the i (resp. o ) denote inputs (resp. out-
l l The dynamics of these two state variables is given
puts). The supervised learning regression algo- by the set of differential equations:
rithm produces from the sample of inputs and
outputs the function Q̂N . δ̇ = ω
Pm − Pe EV
ω̇ = with Pe = sin δ
M u + Xsystem
3. TIME HORIZON TRUNCATION

where Pm , M , E, V and Xsystem are parameters
Let πs,T denote a stationary policy such that equal respectively to 1, 0.03183, 1, 1 and 0.4.

πs,T (x) ∈ {u ∈ U |QT (x, u) = inf

QT (x, u )} and Pm represents the mechanical power of the ma-
u ∈U
let πT∗ denote a T -stage optimal control policy. chine, M its inertia, E its terminal voltage, V the
voltage of the terminal bus system and Xsystem
We have the following theorem (proof in Ernst
the overall system reactance. When the uncon-
et al. (2006)):
trolled system is driven away from its equilib-
X Pm
rium point (δe , ωe ) = (arcsin( system ), 0), un-
Theorem 2. For any T, T  ∈ 0 such that T  < T , EV
damped electrical power (Pe ) oscillations appear
we have the following bound on the suboptimality
∗ in the line (Fig. 1b). A variable reactance u has
of πs,T  when used as solution of a T -stage optimal
been installed in series with the overall reactance
control problem:
Xsystem . The control problem consists in finding
 ∗  
a control policy for this variable reactance which
π  π∗ γ T (4 − 2γ)Bc
sup CT s,T (x) − CT T (x) ≤ .(5) damps the electrical power oscillations.
x∈X (1 − γ)2
From this continuous time control problem, we
Theorem 2 shows that the suboptimality of the define a discrete-time optimal control problem

policy πs,T  when used as solution of an optimal
with infinite time optimization horizon (T →
control problem with an optimization horizon T ∞) such that policies π leading to small costs
(T > T  ) can be upper bounded by an expression lim CTπ (x), also tend to produce good damping
T →∞
which decreases exponentially with T  . of Pe .
When dealing with a large or even infinite opti- The discrete-time dynamics is obtained by dis-
mization horizon T , MPC and RL algorithms use cretizing the time with the time between t and t+1
this property to truncate the time horizon, so as chosen equal to 0.050 s. The value of u is chosen
to reduce computational burdens. In other words, constant during each 0.050 s interval. If δt+1 and
they solve an optimal control problem with a time ωt+1 are such that they do not belong anymore to
horizon T  < T and compute an approximation the stability domain of the uncontrolled (u = 0)
∗ ∗
π̂s,T  of πs,T  . system (Fig. 1c), that is if
Int. J. Tomogr. Stat.; Vol. 6, No. S07, Summer 2007 125

1 2 EV cos(δt+1 ) Value of T 
M ωt+1 − Pm δt+1 − > −0.438788, 1 5 20 100
2 Xsystem
RL, #F = 105 44.3† 39.3 20.5 20.9
then a terminal state is supposed to be reached. RL, #F = 103 43.9† 47.2 27.7 26.9
A terminal state is a state in which the system MPC 55.2† 38.2 20.3 20.3
π̂ ∗
remains stuck, i.e. term. state = f (term. state, u) Table 1. Value of lim CT s,T ((0, 8)) esti-


T →∞
∀u ∈ U . ∗
mated by simulating the system with π̂s,T 

The state space X is thus composed of the un- from (δ, ω) = (0, 8) over 1000 time steps. The
controlled system’s stability domain (Fig. 1c) plus suffix † indicates that the policy drives the sys-
one terminal state. The action space U is chosen tem to the terminal state before t = 1000. Non-
such that U = {u ∈ |u ∈ [−0.16, 0]}. The decay

linear optimization algorithm did not converge
factor γ is equal to 0.95. The cost function c(x, u) at time t = 164 when T  = 20.
was chosen to penalize deviations of the electrical c. ompute this minimum, we have considered a sub-
power from its steady state value (Pe at steady set U  {0, −0.016, · · · , −0.16} of U and computed
state = Pm ). Furthermore, since control actions the minimum over this subset. Similarly, after
should not drive the system towards the terminal T  iterations of the algorithm, the policy π̂s,T 
state, a very large cost is associated with such output by the RL algorithm is chosen equal to
actions. The value of this latter cost was chosen π̂s,T  (x) = arg inf Q̂T  (x, u).
u∈U 
such that a policy which drives the system outside
of the stability domain, whatever the optimization Four-tuples generation. To collect the four-
horizon T , is necessarily suboptimal. The exact tuples we have considered 100,000 one step
cost function is: episodes with x0 and u0 for each episode drawn
⎧ at random in X × U .
2

⎨(Pet+1 − Pm ) if xt+1 = term. state Results. On Figs 2a-c, we have represented the
c(xt , ut ) = 0 xt = term. state (6) policy π̂s,T  computed for increasing values of


1000 otherwise T  . As we may observe, the policy considerably
changes with T  . To assess the influence of T 
where Pet+1 = EV
Xsystem +ut sin(δt+1 ). on the ability of the policy π̂s,T  to approximate
an optimal policy over an infinite time horizon,
System reactance Variable reactance we have computed for different values of T  , the
E δ V 0 π̂ 
value of lim CT s,T ((δ, ω) = (0, 8.)). The results
Xsystem u T →∞
are reported on the first line of Table 1. We see
Electrical power (Pe ) that the cost tends to decrease when T  increases.
Generator Infinite size generator
That means that the suboptimality of the policy
(a) Representation of the power system.
ω
π̂s,T  tend to decrease with T  , as suggested by
Pe
10. Theorem 2.
2.
5.
Performances of the fitted Q iteration algorithm
1. Pm 0.0
are influenced by the information it has on the
0.0
0.0 2.5 5. 7.5 10. 12.5 t(s)
−5. optimal control problem, represented by the set
−1. −10. of F. Usually, the less information, the larger
π̂ ∗  π∗ 
−2. −1.−.50.0 0.5 1. 1.5 2.
δ lim CT s,T (x) − lim CT s,T (x), assuming that T 
T →∞ T →∞
(b) Electrical power (c) Stability domain of is sufficiently large. To illustrate this, we have
oscillations when at t = 0 the uncontrolled (u = 0) run the RL algorithm by considering this time
(δ, ω) = (0, 8) and u = 0 power system. a 1000 element set of four-tuples, with the four-
Fig. 1. Some insights into the control problem. tuples generated in the same conditions as before.

The resulting policies π̂s,T  are drawn on Figs 2d-

e. As shown on the second line of Table 1, these


4.2 Reinforcement learning policies tend indeed to lead to higher costs than
those observed by considering 100,000 four-tuples.
Fitted Q iteration algorithm. The fitted Q
iteration algorithm computes π̂s,T  by solving We were mentioning before, when defining the
sequentially T  batch mode supervised learn- optimal control problem, that a policy leading to
ing problems. As batch-mode supervised learn- small costs was also leading to good damping of
ing algorithm, we have chosen the Extra-Trees Pe . This is illustrated on Fig. 3a.
algorithm (Ernst et al., 2005; Geurts et al.,
2006). At each iteration of the fitted Q it- 4.3 Model predictive control
eration algorithm, one also needs to compute Non-linear optimization problem. At the
inf Q̂(xlt+1 , u) for each l ∈ {1, 2, · · · , #F}. To core of the MPC approach, there is the resolution
u∈U
126 International Journal of Tomography & Statistics

ω ω ω ω
10. 10. 10. 10.

5. 5. 5. 5.

0.0 0.0 0.0 0.0

−5. −5. −5. −5.

−10. −10. −10. −10.

−1. −.5 0.0 0.5 1. 1.5 2. −1. −.5 0.0 0.5 1. 1.5 2. −1. −.5 0.0 0.5 1. 1.5 2. −1. −.5 0.0 0.5 1. 1.5 2.
δ δ δ δ
∗ , RL, #F = 105
(a) π̂s,1 ∗ , RL, #F = 105
(b) π̂s,5 ∗
(c) π̂s,20 , RL, #F = 105 ∗
(d) π̂s,100 , RL, #F = 105
ω ω ω ω
10. 10. 10. 10.

5. 5. 5. 5.

0.0 0.0 0.0 0.0

−5. −5. −5. −5.

−10. −10. −10. −10.

−1. −.5 0.0 0.5 1. 1.5 2. −1. −.5 0.0 0.5 1. 1.5 2. −1. −.5 0.0 0.5 1. 1.5 2. −1. −.5 0.0 0.5 1. 1.5 2.
δ δ δ δ
(e) ∗ ,
π̂s,1 RL, #F = 103 (f) ∗ ,
π̂s,5 RL, #F = 103 (g) ∗
π̂s,20 , RL, #F = 103 (h) ∗
π̂s,100 , RL, #F = 103
ω ω ω
10. 10. 10.
ω
10.
5. 5. 5.
5.
0.0 0.0 0.0
0.0
−5. −5. −5.
−5.

−10. −10. −10. −10.

−1. −.5 0.0 0.5 1. 1.5 2. −1. −.5 0.0 0.5 1. 1.5 2. −1. −.5 0.0 0.5 1. 1.5 2. −1.−.50.0 0.5 1. 1.5 2.
δ δ δ δ
∗ , MPC
(i) π̂s,1 ∗ , MPC
(j) π̂s,5 ∗
(k) π̂s,20 , MPC ∗
(l) π̂s,100 , MPC

Fig. 2. Representation of π̂s,T



 (x). The evaluation is carried out for {(δ, ω) ∈ X| ∃i, j ∈ |(δ, ω) = (0.1 ∗ i, 0.5 ∗ j)}.
The grey level of a bullet associated with a state x gives information about the magnitude of |π̂ s,T ∗
 (x)|. Black

bullets corresponds to the largest possible value of |π̂ s,T  | (−0.16), white bullets to the smallest one (0) and grey to
intermediate values with the larger the magnitude of |π̂ T ∗ |, the darker the grey. Figures a-d gives for different values

of T  the policies π̂s,T

 (x) obtained by the RL algorithm with 100,000 four-tuples, figures g-h with the RL algorithm
with 1000 four-tuples and Figs i-l with the MPC algorithm. On these latter four figures, states for which the MPC
algorithm failed to converge are not represented.

Pe Pe
δt+1 − δt − (h/2)ωt − (h/2)ωt+1 = 0 (8)
 
2. 2.
1 EV sin δt
ωt+1 − ωt − (h/2) Pm −
1. 1.
M ut + Xsystem
 
0.0 0.0 1 EV sin δt+1
0.0 2.5 5. 7.5 10. 12.5t(s) 0.0 2.5 5. 7.5 10. 12.5t(s)
−(h/2) Pm − =0 (9)
−1. −1. M ut + Xsystem
−2. −2.
with h = 0.05 s and 3T  inequality constraints

(a) π̂s,100 , RL, #F = 105 ∗
(b) π̂s,100 , MPC (t = 0, 1, · · · , T  − 1):
Fig. 3. Evolution of Pe when the system starts from ut ≤ 0 (10)

(δ0 , ω0 ) = (0, 8) and is controlled by the policy π̂s,100 . −ut ≤ 0.16 (11)
1 2 EV
M ωt+1 − Pm δt+1 − cos(δt+1 )
2 Xsystem
of a non-linear programming problem. It has been +0.438788 ≤ 0 (12)
stated as follows:
Several choices have been made to state the non-
linear optimization problem. First, the cost of

T−1
1000 that occurs when the system reaches a ter-
inf (
(δ1 ,··· ,δT  ,ω1 ,··· ,ωT  ,u0 ,··· ,uT  −1 )∈ 3T  minal state does not appear in the cost functional
t=0


EV (7) and has been replaced by the inequality con-


γt( sin(δt+1 ) − Pm )2 ) (7) straints (12). These constraints limit the search
Xsystem + ut
for an optimal policy to open-loop policies that
guarantee that the system stays inside the sta-
subject to 2T  equality constraints (t = 0, 1, · · · , T  − bility domain of the uncontrolled system. It does
1): not lead to any suboptimality since we know that
Int. J. Tomogr. Stat.; Vol. 6, No. S07, Summer 2007 127

policies driving the system to a terminal state are policy is able to damp them completely. This is
suboptimal policies. Another choice that deserves explained, among others, by the ability the MPC
explanations is the way the equality constraints algorithm has to exploit the continuous nature
xt+1 = f (xt , ut ) are modeled. To model them, we of the action space while the RL algorithm dis-
have relied on a trapezoidal method, with a step cretizes it into a finite number of values.
size of 0.05 s, the time between t and t + 1 (Eqns
(8-9)). 5. CONCLUSIONS
Non-linear optimization algorithm. As non- We have presented in this paper reinforcement
linear optimizer, we have used the primal-dual learning and model predictive control in a uni-
Interior-Point Method (Kortanek et al., 1991). fied framework and run, for both techniques, ex-
Results. Figures 2i-l represent the different poli- periments on a same optimal control problem.
cies π̂s,T  computed for increasing values of T  . From these experiments, two main conclusions can
To compute the value of π̂s,T  for a specific state be drawn. Firstly, the RL algorithm considered,
x, we need to run the optimization algorithm. known as fitted Q iteration, was offering good
As reported on these figures, for some states, the performances. This suggests that it may perhaps
algorithm fails to converge. Also, the larger the be a good alternative to MPC approaches when
value of T  , the more frequently the algorithm the system dynamics and/or the cost function
fails to converge, which is not surprising since are totally or partially unknown. Indeed, rather
the complexity of the optimization problem tends than use first the information gathered from in-
to increase with T  . For example, on the 1304 teraction to identify the system dynamics and/or
states x for which the policy π̂s,T  was estimated, the cost function, one could directly extract from
1 failed to converge for T  = 1, 2 for T  = 5, this information the control policy by using RL
340 for T  = 20 and 474 when T  = 100. One algorithms. Secondly, MPC approaches usually
may reasonably wonder whether the convergence compute from the model a suboptimal policy by
problems are not due to the fact that for some solving a non-linear optimization problem. How-
values of x0 and T  , there are no solutions to the ever, we found out that non-linear optimizers may
optimization problem. If the equality constraints fail to find a good solution, especially when the
xt+1 = f (xt , ut ) were perfectly represented, it optimization horizon grows. In some sense, we
would not be the case since when ut = 0, t = could have circumvented these convergence prob-
0, 1, · · · , T  − 1, we know that the trajectory stays lems by using the fitted Q iteration algorithm
inside the stability domain. However, with a dis- together with a procedure to generate the four-
crete dynamics approximated by Eqns (8-9), it tuples from the model. More generally, we may
may not be the case anymore. To know whether question whether it would not be preferable when
most of these convergence problems were indeed the model is known to rely more on algorithms
caused by approximating the equality constraints, exploiting the dynamic programming principle, as
we have simulated the system modeled by Eqns fitted Q iteration does, than to compute immedi-
(8-9) over 100 time steps and with ut = 0, ∀t. ately an optimal sequence of actions by solving
We found out that for the 1304 states for which a non-linear optimization problem for which the
the policy was estimated, 20 were indeed leading system dynamics is represented through equality
to a violation of the constraints when chosen as constraints.
initial state. This is however a small number com- REFERENCES
pared to the 474 states for which the non-linear D.P. Bertsekas. Dynamic Programming and Optimal Con-
optimization algorithm failed to converge. trol, volume I. Athena Scientific, Belmont, MA, 2nd
edition, 2000.
It is interesting to notice that, if we take apart the D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch
states for which convergence did not occur, MPC mode reinforcement learning. Journal of Machine
policies look similar to RL policies computed with Learning Research, 6:503–556, April 2005.
a large number of samples. This is not surpris- D. Ernst, M. Glavic, F. Capitanescu, and L. Wehenkel.
ing, since both methods aim to approximate πs,T  Reinforcement learning and model predictive control: a
comparison. Submitted, 2006.
policies. Table 1 reports the costs over an infinite P. Geurts, D. Ernst, and L. Wehenkel. Extremely random-
time horizon obtained by using MPC policies for ized trees. To appear in Machine Learning, 2006.
various values of T  . When T  = 20, the MPC O. Hernández-Lerma and J.B. Lasserre. Discrete-Time
algorithm failed to converge for some states met, Markov Control Processes. Basic Optimality Criteria.
π̂ Springer, New-York, 1996.
which is why the value of lim CT s,20 is not given.
T →∞ K. Kortanek, F. Potra, and Y. Ye. On some efficient
We observe that for T  = 5 and T  = 100, interior point methods for nonlinear convex program-
the policies computed by MPC outperform those ming. Linear Algebra and its Applications, 169-189,
April 1991.
computed by RL. Figure 3 shows that, while the
M. Morari and J.H. Lee. Model predictive control: past,
π̂s,100 policy computed by the RL algorithm pro- present and future. Computers and Chemical Engineer-
duces some residual oscillations, the MPC π̂s,100 ing, 23:667–682, 1999.

View publication stats

You might also like