Professional Documents
Culture Documents
net/publication/224007165
CITATIONS READS
3 1,064
4 authors, including:
Louis Wehenkel
University of Liège
417 PUBLICATIONS 9,055 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Florin Capitanescu on 29 May 2014.
Abstract: Model predictive control (MPC) and reinforcement learning (RL) are
two popular families of methods to control system dynamics. In their traditional
setting, they formulate the control problem as a discrete-time optimal control
problem and compute a suboptimal control policy. We present in this paper in
a unified framework these two families of methods. We run for MPC and RL
algorithms simulations on a benchmark control problem taken from the power
system literature and discuss the results obtained.
2. MPC AND RL IN THE FINITE HORIZON Theorem 1. A sequence of actions u∗0 , u∗1 , · · · ,
CASE u∗T −1 is optimal if and only if QT −t (xt , u∗t ) =
2.1 The optimal control problem inf
QT −t (xt , u ) ∀t ∈ {0, · · · , T − 1}.
u ∈U
T −1 Model predictive control techniques target an op-
{u0 ,u1 ,··· ,uT −1 } ∗
CT (x0 ) = γ t c(xt , ut ) (2) timal open-loop policy πo,T , and assume that the
t=0 system dynamics and cost function are available
where γ ∈ [0, 1[ is the discount factor. in analytical form.
∗
In this context, an optimal control sequence u∗0 , For a given initial state x0 , the terms πo,T (x0 , t),
u∗1 , · · · , u∗T −1 , is a sequence of actions which t = 0, 1, · · · , T − 1 of the optimal open-loop
minimizes the cost over T stages. 2 We denote by policy may then be computed by solving the
CTπ (x) the cost over T stages associated with a minimization problem:
policy π when the initial state is x0 = x. A T -
T −1
stage optimal policy is a policy that leads for every inf γ t c(xt , ut ) (4)
(u0 ,u1 ,··· ,uT −1 ,x1 ,x2 ,··· ,xT −1 )
initial state x0 = x to a minimal CTπ (x). This is ∈U ×···×U ×X×···×X
t=0
the kind of policy we are looking for.
subject to the T equality constraints (1).
We consider three different types of policies: open-
loop policies which select at time t the action ut Under appropriate assumptions, the minimiza-
based only on the initial state x0 of the system tion problem (4) can be tackled by classical con-
and the current time (ut = πo (t, x0 )), closed-loop vex programming algorithms. However, for many
policies which select the action ut based on the practical problems, its resolution may be a dif-
current time and the current state (ut = πc (t, xt )) ficult task, with no guarantees that the solution
and closed-loop stationary policies for which the found by the used optimizer is indeed optimal.
action is selected only based on the knowledge of Therefore, the MPC techniques actually rather
the current state (ut = πs (xt )). produce an approximation π̂o,T ∗
of a T -stage opti-
To characterize optimality of these T -stage poli- mal control policy. The better the approximation,
∗ ∗
π̂ π
cies, we define iteratively the sequence of functions the smaller CT o,T (x0 ) − CT o,T (x0 ).
QN : X × U → , N = 1, · · · , T as follows:
1 2 EV cos(δt+1 ) Value of T
M ωt+1 − Pm δt+1 − > −0.438788, 1 5 20 100
2 Xsystem
RL, #F = 105 44.3† 39.3 20.5 20.9
then a terminal state is supposed to be reached. RL, #F = 103 43.9† 47.2 27.7 26.9
A terminal state is a state in which the system MPC 55.2† 38.2 20.3 20.3
π̂ ∗
remains stuck, i.e. term. state = f (term. state, u) Table 1. Value of lim CT s,T ((0, 8)) esti-
T →∞
∀u ∈ U . ∗
mated by simulating the system with π̂s,T
The state space X is thus composed of the un- from (δ, ω) = (0, 8) over 1000 time steps. The
controlled system’s stability domain (Fig. 1c) plus suffix † indicates that the policy drives the sys-
one terminal state. The action space U is chosen tem to the terminal state before t = 1000. Non-
such that U = {u ∈ |u ∈ [−0.16, 0]}. The decay
linear optimization algorithm did not converge
factor γ is equal to 0.95. The cost function c(x, u) at time t = 164 when T = 20.
was chosen to penalize deviations of the electrical c. ompute this minimum, we have considered a sub-
power from its steady state value (Pe at steady set U {0, −0.016, · · · , −0.16} of U and computed
state = Pm ). Furthermore, since control actions the minimum over this subset. Similarly, after
should not drive the system towards the terminal T iterations of the algorithm, the policy π̂s,T
state, a very large cost is associated with such output by the RL algorithm is chosen equal to
actions. The value of this latter cost was chosen π̂s,T (x) = arg inf Q̂T (x, u).
u∈U
such that a policy which drives the system outside
of the stability domain, whatever the optimization Four-tuples generation. To collect the four-
horizon T , is necessarily suboptimal. The exact tuples we have considered 100,000 one step
cost function is: episodes with x0 and u0 for each episode drawn
⎧ at random in X × U .
2
⎪
⎨(Pet+1 − Pm ) if xt+1 = term. state Results. On Figs 2a-c, we have represented the
c(xt , ut ) = 0 xt = term. state (6) policy π̂s,T computed for increasing values of
⎪
⎩
1000 otherwise T . As we may observe, the policy considerably
changes with T . To assess the influence of T
where Pet+1 = EV
Xsystem +ut sin(δt+1 ). on the ability of the policy π̂s,T to approximate
an optimal policy over an infinite time horizon,
System reactance Variable reactance we have computed for different values of T , the
E δ V 0 π̂
value of lim CT s,T ((δ, ω) = (0, 8.)). The results
Xsystem u T →∞
are reported on the first line of Table 1. We see
Electrical power (Pe ) that the cost tends to decrease when T increases.
Generator Infinite size generator
That means that the suboptimality of the policy
(a) Representation of the power system.
ω
π̂s,T tend to decrease with T , as suggested by
Pe
10. Theorem 2.
2.
5.
Performances of the fitted Q iteration algorithm
1. Pm 0.0
are influenced by the information it has on the
0.0
0.0 2.5 5. 7.5 10. 12.5 t(s)
−5. optimal control problem, represented by the set
−1. −10. of F. Usually, the less information, the larger
π̂ ∗ π∗
−2. −1.−.50.0 0.5 1. 1.5 2.
δ lim CT s,T (x) − lim CT s,T (x), assuming that T
T →∞ T →∞
(b) Electrical power (c) Stability domain of is sufficiently large. To illustrate this, we have
oscillations when at t = 0 the uncontrolled (u = 0) run the RL algorithm by considering this time
(δ, ω) = (0, 8) and u = 0 power system. a 1000 element set of four-tuples, with the four-
Fig. 1. Some insights into the control problem. tuples generated in the same conditions as before.
∗
The resulting policies π̂s,T are drawn on Figs 2d-
ω ω ω ω
10. 10. 10. 10.
5. 5. 5. 5.
−1. −.5 0.0 0.5 1. 1.5 2. −1. −.5 0.0 0.5 1. 1.5 2. −1. −.5 0.0 0.5 1. 1.5 2. −1. −.5 0.0 0.5 1. 1.5 2.
δ δ δ δ
∗ , RL, #F = 105
(a) π̂s,1 ∗ , RL, #F = 105
(b) π̂s,5 ∗
(c) π̂s,20 , RL, #F = 105 ∗
(d) π̂s,100 , RL, #F = 105
ω ω ω ω
10. 10. 10. 10.
5. 5. 5. 5.
−1. −.5 0.0 0.5 1. 1.5 2. −1. −.5 0.0 0.5 1. 1.5 2. −1. −.5 0.0 0.5 1. 1.5 2. −1. −.5 0.0 0.5 1. 1.5 2.
δ δ δ δ
(e) ∗ ,
π̂s,1 RL, #F = 103 (f) ∗ ,
π̂s,5 RL, #F = 103 (g) ∗
π̂s,20 , RL, #F = 103 (h) ∗
π̂s,100 , RL, #F = 103
ω ω ω
10. 10. 10.
ω
10.
5. 5. 5.
5.
0.0 0.0 0.0
0.0
−5. −5. −5.
−5.
−1. −.5 0.0 0.5 1. 1.5 2. −1. −.5 0.0 0.5 1. 1.5 2. −1. −.5 0.0 0.5 1. 1.5 2. −1.−.50.0 0.5 1. 1.5 2.
δ δ δ δ
∗ , MPC
(i) π̂s,1 ∗ , MPC
(j) π̂s,5 ∗
(k) π̂s,20 , MPC ∗
(l) π̂s,100 , MPC
Pe Pe
δt+1 − δt − (h/2)ωt − (h/2)ωt+1 = 0 (8)
2. 2.
1 EV sin δt
ωt+1 − ωt − (h/2) Pm −
1. 1.
M ut + Xsystem
0.0 0.0 1 EV sin δt+1
0.0 2.5 5. 7.5 10. 12.5t(s) 0.0 2.5 5. 7.5 10. 12.5t(s)
−(h/2) Pm − =0 (9)
−1. −1. M ut + Xsystem
−2. −2.
with h = 0.05 s and 3T inequality constraints
∗
(a) π̂s,100 , RL, #F = 105 ∗
(b) π̂s,100 , MPC (t = 0, 1, · · · , T − 1):
Fig. 3. Evolution of Pe when the system starts from ut ≤ 0 (10)
∗
(δ0 , ω0 ) = (0, 8) and is controlled by the policy π̂s,100 . −ut ≤ 0.16 (11)
1 2 EV
M ωt+1 − Pm δt+1 − cos(δt+1 )
2 Xsystem
of a non-linear programming problem. It has been +0.438788 ≤ 0 (12)
stated as follows:
Several choices have been made to state the non-
linear optimization problem. First, the cost of
T−1
1000 that occurs when the system reaches a ter-
inf (
(δ1 ,··· ,δT ,ω1 ,··· ,ωT ,u0 ,··· ,uT −1 )∈ 3T minal state does not appear in the cost functional
t=0
policies driving the system to a terminal state are policy is able to damp them completely. This is
suboptimal policies. Another choice that deserves explained, among others, by the ability the MPC
explanations is the way the equality constraints algorithm has to exploit the continuous nature
xt+1 = f (xt , ut ) are modeled. To model them, we of the action space while the RL algorithm dis-
have relied on a trapezoidal method, with a step cretizes it into a finite number of values.
size of 0.05 s, the time between t and t + 1 (Eqns
(8-9)). 5. CONCLUSIONS
Non-linear optimization algorithm. As non- We have presented in this paper reinforcement
linear optimizer, we have used the primal-dual learning and model predictive control in a uni-
Interior-Point Method (Kortanek et al., 1991). fied framework and run, for both techniques, ex-
Results. Figures 2i-l represent the different poli- periments on a same optimal control problem.
cies π̂s,T computed for increasing values of T . From these experiments, two main conclusions can
To compute the value of π̂s,T for a specific state be drawn. Firstly, the RL algorithm considered,
x, we need to run the optimization algorithm. known as fitted Q iteration, was offering good
As reported on these figures, for some states, the performances. This suggests that it may perhaps
algorithm fails to converge. Also, the larger the be a good alternative to MPC approaches when
value of T , the more frequently the algorithm the system dynamics and/or the cost function
fails to converge, which is not surprising since are totally or partially unknown. Indeed, rather
the complexity of the optimization problem tends than use first the information gathered from in-
to increase with T . For example, on the 1304 teraction to identify the system dynamics and/or
states x for which the policy π̂s,T was estimated, the cost function, one could directly extract from
1 failed to converge for T = 1, 2 for T = 5, this information the control policy by using RL
340 for T = 20 and 474 when T = 100. One algorithms. Secondly, MPC approaches usually
may reasonably wonder whether the convergence compute from the model a suboptimal policy by
problems are not due to the fact that for some solving a non-linear optimization problem. How-
values of x0 and T , there are no solutions to the ever, we found out that non-linear optimizers may
optimization problem. If the equality constraints fail to find a good solution, especially when the
xt+1 = f (xt , ut ) were perfectly represented, it optimization horizon grows. In some sense, we
would not be the case since when ut = 0, t = could have circumvented these convergence prob-
0, 1, · · · , T − 1, we know that the trajectory stays lems by using the fitted Q iteration algorithm
inside the stability domain. However, with a dis- together with a procedure to generate the four-
crete dynamics approximated by Eqns (8-9), it tuples from the model. More generally, we may
may not be the case anymore. To know whether question whether it would not be preferable when
most of these convergence problems were indeed the model is known to rely more on algorithms
caused by approximating the equality constraints, exploiting the dynamic programming principle, as
we have simulated the system modeled by Eqns fitted Q iteration does, than to compute immedi-
(8-9) over 100 time steps and with ut = 0, ∀t. ately an optimal sequence of actions by solving
We found out that for the 1304 states for which a non-linear optimization problem for which the
the policy was estimated, 20 were indeed leading system dynamics is represented through equality
to a violation of the constraints when chosen as constraints.
initial state. This is however a small number com- REFERENCES
pared to the 474 states for which the non-linear D.P. Bertsekas. Dynamic Programming and Optimal Con-
optimization algorithm failed to converge. trol, volume I. Athena Scientific, Belmont, MA, 2nd
edition, 2000.
It is interesting to notice that, if we take apart the D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch
states for which convergence did not occur, MPC mode reinforcement learning. Journal of Machine
policies look similar to RL policies computed with Learning Research, 6:503–556, April 2005.
a large number of samples. This is not surpris- D. Ernst, M. Glavic, F. Capitanescu, and L. Wehenkel.
ing, since both methods aim to approximate πs,T Reinforcement learning and model predictive control: a
comparison. Submitted, 2006.
policies. Table 1 reports the costs over an infinite P. Geurts, D. Ernst, and L. Wehenkel. Extremely random-
time horizon obtained by using MPC policies for ized trees. To appear in Machine Learning, 2006.
various values of T . When T = 20, the MPC O. Hernández-Lerma and J.B. Lasserre. Discrete-Time
algorithm failed to converge for some states met, Markov Control Processes. Basic Optimality Criteria.
π̂ Springer, New-York, 1996.
which is why the value of lim CT s,20 is not given.
T →∞ K. Kortanek, F. Potra, and Y. Ye. On some efficient
We observe that for T = 5 and T = 100, interior point methods for nonlinear convex program-
the policies computed by MPC outperform those ming. Linear Algebra and its Applications, 169-189,
April 1991.
computed by RL. Figure 3 shows that, while the
M. Morari and J.H. Lee. Model predictive control: past,
π̂s,100 policy computed by the RL algorithm pro- present and future. Computers and Chemical Engineer-
duces some residual oscillations, the MPC π̂s,100 ing, 23:667–682, 1999.