You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/340181129

Augmenting MPC schemes with active learning: Intuitive tuning and guaranteed
performance

Article · March 2020


DOI: 10.1109/LCSYS.2020.2983384

CITATIONS READS
0 66

3 authors:

Raffaele Soloperto Johannes Koehler


Universität Stuttgart Universität Stuttgart
17 PUBLICATIONS   76 CITATIONS    34 PUBLICATIONS   108 CITATIONS   

SEE PROFILE SEE PROFILE

Frank Allgöwer
Universität Stuttgart
700 PUBLICATIONS   13,654 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Dynamic operation with nonlinear model predictive control View project

Alternative Scanning Methods for High-speed Atomic Force Microscopy View project

All content following this page was uploaded by Raffaele Soloperto on 01 April 2020.

The user has requested enhancement of the downloaded file.


Augmenting MPC schemes with active learning:
Intuitive tuning and guaranteed performance
Raffaele Soloperto, Johannes Köhler, Frank Allgöwer

Abstract—A framework to augment an existing model predic- general computationally intractable. As a result, dual MPC
tive control (MPC) design/implementation with active learning is schemes [12], [13], [14], [15] try to approximately solve the
proposed. Active learning is achieved by employing a user-defined dual control problem. However, these approaches (i) fail to
learning cost function (e.g. enforcing persistence of excitation
or using exploration terms from reinforcement learning), with provide theoretical guarantees regarding closed-loop safety
the aim to improve model knowledge and reduce uncertainty and/or performance [13], [14], [15], or (ii) are limited to
through model adaptation. The framework is applicable to a linear system dynamics [12], [15]. To overcome the drawbacks
general class of nonlinear MPC design procedures and ensures associated with existing dual MPC schemes, we focus on
desired performance bounds for the resulting closed-loop, which the problem of augmenting an existing MPC implementa-
can be intuitively tuned compared to the initial MPC design. The
performance bounds are obtained by coupling the active learning tion to enhance learning/system excitation, while preserving
objective with performance bounds of the primary MPC, using the safety and performance guarantees of the original MPC
tools from multi-objective MPC and average constraints from scheme.
economic MPC. The overall framework can be easily imple-
mented1 and it is intuitive to tune. The resulting computational Contributions: We present a framework to augment an
demand typically is comparable to the original MPC scheme. We existing nonlinear MPC scheme with a learning cost function
demonstrate the practicality of the proposed framework using a to incentivize active learning. The framework is applicable
numerical example involving a nonlinear uncertain model, and
active learning.
to a wide variety of nonlinear MPC designs: general eco-
nomic MPC schemes [16], MPC schemes without terminal
constraints [17], robust MPC schemes [18], and combinations
I. I NTRODUCTION thereof [19], [20], [21]. For this general setup, we derive
Model predictive control (MPC) [1] is an optimization based closed-loop performance guarantees depending on intuitively
control method that can deal with general nonlinear systems tunable constants, which allow for a simple trade-off between
subject to hard state and input constraints. The performance potential performance relaxation and freedom to explore the
of MPC schemes depends largely on the accuracy of the system. These performance bounds are enabled using tech-
prediction model. Thus, online model refinement in MPC, niques from averaged constraint MPC [22], which extend
e.g., using machine learning approaches [2], [3], or (robust) existing approaches for multi-objective MPC [23], [24].
adaptive methods [4], [5], [6], [7], is an active research topic. In addition, the proposed framework poses no restrictions
Passive learning, i.e. without any active system excitation, is on the predicted learning cost function, allowing for the
often not sufficient to achieve satisfactory performance [8]. In incorporation of most existing formulations, e.g., persistence
this paper, we present a framework to augment an existing of excitation cost [5], covariance predictions using recursive
MPC implementation with a user defined active learning cost, least square (RLS) [12], variance maximization/exploration
while retaining the original properties regarding closed-loop in Gaussian processes (GPs) [25], and general reinforcement
constraint satisfaction and performance. learning based functions [10].
Related works: In Reinforcement Learning approaches [9],
a random excitation input is applied in order to explore un- To summarize, we propose a comprehensive framework to
known dynamics. Since this might lead to constraint violation, extend existing MPC implementations in order to incorporate
in [10] a predictive safety filter is proposed with the aim to a learning cost function. The desired performance and learning
ensure safety, without considering performance. can be easily tuned with intuitive scalar constants. In addition,
Optimizing for performance while ensuring safety is the the computational demand of the proposed learning augmented
goal of many applications. The ideal performance under model MPC framework is typically only moderately increased com-
uncertainty is obtained using dual control [11], which is in pared to the baseline MPC implementation. We showcase the
practicality of the proposed framework with a nonlinear uncer-
This work was supported by the German Research Foundation under Grants tain system, involving reference tracking and active learning.
GRK 2198/1 - 277536708, AL 316/12-2, and MU 3929/1-2 - 279734922. The
authors thank the International Max Planck Research School for Intelligent Outline: Section II presents the basic setup and discusses
Systems (IMPRS-IS) for supporting Raffaele Soloperto.
Raffaele Soloperto, Johannes Köhler, and Frank Allgöwer are with preliminaries regarding the existing MPC framework. The
the ”Institute for Systems Theory and Automatic Control”, Univer- proposed framework is presented in Section III, including
sity of Stuttgart, 70550 Stuttgart, Germany. (email:raffaele.soloperto, jo- a theoretical derivation of performance bounds. Section IV
hannes.koehler, frank.allgower@ist.uni-stuttgart.de).
1 An exemplary implementation can be found at: https://www.ist. demonstrates the results with a numerical example and Sec-
uni-stuttgart.de/dokumente/public/Soloperto 20.m tion V concludes the paper.
II. P ROBLEM F ORMULATION where N ∈ N is the MPC prediction horizon, ` is the bounded
stage cost, and Vf is the bounded terminal cost. By `min ∈ R
A. Problem setup
we denote the lower bound on ` satisfying
We consider a nonlinear, time-invariant, perturbed discrete-
time system `min ≤ `(X, π), (4)
for all X, π satisfying (x, π(x)) ∈ Z for all x ∈ X.
xt+1 = fw (xt , ut , dt ), (1) In most MPC implementations, the cost function (3) is
with state x ∈ Rn , control input u ∈ Rm , disturbance d ∈ simply based on some nominal trajectory `(xk|t , uk|t ) [4], [5],
D ⊂ Rq , time t ∈ N, and perturbed (unknown) system fw . [6], [7], [20]. In a robust tube framework, one can also use
We impose point-wise in time state and input constraints a worst-case cost satisfying `(X, π) ≥ `(x, π(x)), ∀x ∈ X,
which provides stronger performance guarantees [18], [21].
(xt , ut ) ∈ Z, ∀t ≥ 0, (2) Existing MPC scheme: Given the nominal model ft (x, u),
the description of the uncertainty Wt , the primary cost func-
where Z ⊆ Rn+m is a constraint set. tion E, and possibly a suitable terminal set Xf ⊂ Rn , the
We consider the case where the model can be learned resulting optimization problem is given by:
online to ensure an improvement in the performance and
therefore to reduce conservatism. This gives rise to a time- Et∗ := min E(X·|t , π·|t ) (5a)
X·|t ,π·|t
varying nominal model ft (x, u) and some uncertainty bounds s.t. ft (xk|t , πk|t (xk|t )) + wk|t ∈ Xk+1|t , (5b)
Wt (x, u), satisfying the following condition for all t ≥ 0
(xk|t , πk|t (xk|t )) ∈ Z, (5c)
fw (x, u, d) − ft (x, u) ∈ Wt (x, u). ∀wk|t ∈ Wt (xk|t , πk|t (xk|t )), ∀xk|t ∈ Xk|t , (5d)
Such a condition is standard in robust MPC [26, Ass. 1], {xt } ∈ X0|t , XN |t ⊆ Xf , (5e)
robust adaptive MPC using set-membership estimation [4], k = 0, . . . , N − 1.
[5], [6], [7] and is enforced in recent non-parametric machine
The solution of (5) are the optimal cost Et∗ , sets X∗·|t
learning approaches [27]. To ensure constraint satisfaction ∗
and control laws π·|t . A discussion on various tractable for-
despite the uncertainty Wt , tube-based robust MPC approaches
mulations for the sets X·|t , control laws π·|t and the tube
predict sets Xk|t ⊂ Rn that contain the uncertain predicted
propagation (5b) can be found in [7], [26, Rk. 1]. The resulting
trajectories. For linear models, the sets Xk|t can often be
closed-loop system is given by
obtained using vertex enumeration [6], while computationally
more efficient approaches parametrize Xk|t with a nominal xt+1 = fw (xt , ut , dt )∈ X∗1|t , ∗
ut = π0|t (xt ). (6)
trajectory xk|t and some scaled set, e.g. using fixed robust
The following assumption characterizes the fact that the
positive invariant (RPI) sets [18], [21], polytopic tubes [4],
existing MPC scheme (5) is properly designed, i.e., ensures
[5] or incremental Lyapunov functions [7], [10].
recursively feasibility and provides a suitable performance
To reduce the conservatism of this tube propagation, a
bound on the primary objective `.
parametrized control law πk|t : Rn → Rm [6], [7] is typically
included in the prediction, e.g., πk|t (x) = Kx + uk|t with Assumption 1. Problem (5) is feasible for all t ≥ 0 and
a nominal input uk|t and a stabilizing feedback K ∈ Rm×n , the constraints (2) are satisfied for the resulting closed-loop
compare [4], [5]. In the special case of nominal MPC (i.e., with system (6). Furthermore, there exist constants c ≥ `min ,
no uncertainty), we have ft (x, u) := fw (x, u, d), Wt (x, u) := α ∈ (0, 1], such that for any trajectory (X·|t , π·|t ) satisfying
{0} and the predictions reduce to standard state and input the constraints in (5), and for any xt+1 ∈ X1|t , there exists
trajectories πk|t = uk|t , Xk|t = xk|t . a candidate trajectory (X·|t+1 , π·|t+1 ), which satisfies the
constraints in (5) and satisfies the following performance
bound
B. Existing MPC framework
E(X·|t+1 , π·|t+1 ) ≤ E(X·|t , π·|t ) − α`(X0|t , π0|t ) + c. (7)
In this paper, we consider the case where an existing
MPC scheme is already implemented. Such a scheme ensures The following remark summarizes different MPC design
constraint satisfaction and an acceptable level of closed- procedures that satisfy Assumption 1 and the corresponding
loop performance, according to some user-defined criteria. In constants α, c.
Section III, we show how this scheme can be augmented in
Remark 1. Recursive feasibility and constraint satisfaction
order to incentivize learning of system (1), and thus reduce
are standard properties of MPC schemes, which can be
the uncertainty Wt .
guaranteed with suitable conditions on the tube propagation
Primary cost function: Based on the definition of the sets (X·|t , (5b)), the model update (ft , Wt ) and the terminal set Xf ,
X·|t and the parametrized control law π·|t , we define a general compare [7, Thm. 1] for general conditions. In the following,
cost function as follows we discuss how α and c vary according to the considered
N
X −1 scenario.
E(X·|t , π·|t ) := `(Xk|t , πk|t ) + Vf (XN |t ), (3) • MPC with terminal ingredients (Vf , Xf ): The standard
k=0 MPC design [1] uses a (robust) positive invariant terminal set
Xf combined with a control Lyapunov function Vf , which is • instability.
used as a terminal cost. In the nominal case, such a design These issues are mainly due to finding an appropriate trade-
directly implies satisfaction of Assumption 1 with α = 1 and off between the two costs is, in general, not intuitive, and
c := `(xs , us ) = min `(x, u) (8) therefore tuning is not only expensive in terms of time and
(x,u)∈Z resources, but might also generate unsafe behaviours during
s.t. f (x, u) = x, experiments. Note that even an arbitrarily small λ allows the
learning cost to have dominant effects when the primary cost
compare [16]. Furthermore, in case of tracking MPC (` is close to zero.
positive definite), (8) holds with c = `min = 0. To avoid the above mentioned shortcomings resulting from
In the robust case with a nominal stage cost `(xk|t , πk|t ), the such a naive augmentation, we propose a framework for learn-
value of c increases by a factor depending on the magnitude of ing MPC that is easy to implement, allows for intuitive tuning,
the model mismatch, compare e.g. [5, Thm. 7], [26, Thm. 1]. In and gives guaranteed performance bounds on the primary cost.
case a worst-case stage cost ` is used, we have c = `(Ω, πf ),
with some RPI set Ω, compare [18].
• MPC without terminal ingredients2 (Vf := 0, Xf := B. Proposed MPC scheme
n
R ): In (nominal) tracking MPC without terminal ingredients, In the following, we introduce the proposed MPC frame-
inequality (7) is satisfied with some sub-optimality index α ∈ work which is based on Problem (5) augmented with active
(0, 1] and c = 0 for a sufficiently large prediction horizon N learning. For the latter, we consider a general learning cost
[17]. In the economic context, the bound (7) holds with α = 1, function Ht (X·|t , π·|t ). Frequent choices for such a function
c = `(xs , us ) + N with some N > 0 [19]. In the robust case, are discussed in Section III-D. The proposed MPC scheme is
similar bounds hold with a larger constant c, compare [20], defined as follows
[21].
min Ht (X·|t , π·|t ) (9a)
X·|t ,π·|t ,∆E
t

III. P ROPOSED MPC F RAMEWORK s.t. E(X·|t , π·|t ) = EtB + ∆E


t , (9b)
This section contains the proposed learning MPC framework ∆E
t ≤ β̄ max{Et+ , 0} + γ̄ + Yt−1 , (9c)
for nonlinear uncertain systems. A motivating scenario is ∆E ≤ β max max{Et+ , 0} + γ max , (9d)
t
described in Section III-A, a general description of the learning
ft (xk|t , πk|t (xk|t )) + wk|t ∈ Xk+1|t , (9e)
cost function together with the proposed MPC framework is
presented in Section III-B. In Section III-C, we show the (xk|t , πk|t (xk|t )) ∈ Z, (9f)
main theorem, an interpretation of the theoretical result, and ∀wk|t ∈ Wt (xk|t , πk|t (xk|t )), ∀xk|t ∈ Xk|t , (9g)
the related proof. Section III-D discusses different existing {xt } ∈ X0|t , XN |t ⊆ Xf , (9h)
learning formulations that can be incorporated in the proposed
k = 0, . . . , N − 1.
MPC framework.
The solution of Problem (9) are the optimal sets X̂·|t , control
A. Motivating scenario laws π̂·|t , the relaxation factor ∆ ˆEt , and the primary cost
Consider the case where system (1) is controlled with an function Êt := E(X̂·|t , π̂·|t ). The resulting closed-loop system
MPC scheme defined by Problem (5). Moreover, assume that is given by
the MPC scheme (5) provides an acceptable, but unfortunately
xt+1 = fw (xt , ut , dt ) ∈ X̂1|t , ut = π̂0|t (xt ). (10)
not ideal, level of performance, according to some user-defined
criteria. In particular, due to the presence of uncertainty, Compared to Problem (5), the proposed scheme (9) mini-
a further improvement of the performance would require a mizes the learning cost Ht , while the primary cost E only
significant effort from the user, if at all possible. appears in the additional constraints (9b)–(9d), which are
In such a setup, a simple and appealing way to reduce explained in the following.
the uncertainty, and therefore to increase performance, can be The value EtB used in (9b) represents the desired primary
obtained by augumenting the primary cost E with a learning cost, satisfying Et∗ ≤ EtB by definition. In the following, we
cost Ht , which has the goal to induce excitation in (1). In this discuss three different design choices for EtB , which all satisfy
case, the cost E in Problem (9) is replace by E + λHt , where B
Et+1 ≤ Êt − α`(X̂0|t , π̂0|t ) + c, with α, c ∈ R from bound (7)
λ ≥ 0 is a tuning parameter, compare e.g. [5], [12]. However, in Assumption 1.
combining two potentially conflicting cost functions, might 1) A simple choice for the desired cost EtB is given by
lead to one or more of the following shortcomings: EtB := Êt−1 −α`(X̂0|t−1 , π̂0|t−1 )+c, which requires the
• arbitrary deterioration of the primary performance, explicit knowledge of the constants α ∈ (0, 1] and c ∈ R
• permanent steady-state error, for the implementation. Remark 1 and the references
• generation of undesired oscillatory behaviours, therein discuss in detail how valid constants can be
2 In MPC without terminal constraints, the performance bound (7) and even
computed offline for different design procedures.
recursive feasibility might only hold if E(X·|t , π·|t ) ≤ C holds with some 2) Since offline computed constants α, c may be con-
constant C, which then needs to be recursively ensured. servative, an intuitive alternative is to use the feasible
candidate solution (X·|t , π·|t ) from Assumption 1, which where ᾱ := (1 − β̄)α ≥ 0. Furthermore, if β max ∈ [0, 1), then
satisfies the following performance bound holds for all T ≥ 0
(7)
EtB := E(X·|t , π·|t ) ≤ Êt−1 − α`(X̂0|t−1 , π̂0|t−1 ) + c. ÊT − Ê0 (14)
T
X −1
This approach is particularly appealing in case terminal ≤T (γ max + c − β max α`min ) − αmax `(X̂0|t , π̂0|t ),
ingredients are employed (due to the simple construction t=0
of the candidate solution). max max
where α := (1 − β )α > 0.
3) The desired performance bound EtB can also be defined
by additionally solving the original MPC scheme (5) The proof of Theorem 1 is detailed at the end of this section.
using: Interpretation of Theorem 1 and special cases: In the
(7) following, we show how (13) and (14) can be interpreted in
EtB := Et∗ ≤ Êt−1 − α`(X̂0|t−1 , π̂0|t−1 ) + c. economic and tracking MPC schemes.
In an economic MPC setup (α = 1), where infinite average
The variable ∆E t in (9b), represents how much we allow the performance are typically of interest, by choosing β̄ = 0, we
primary cost function Êt to deviate from the desired bound
get the following bound
EtB . The proposed scheme will allow for a performance degra- PT −1
dation/relaxation ∆E t if this helps to minimize the learning cost `(X̂0|t , π̂0|t )
Ht . However, as stated above, deteriorating the performance lim sup t=0 ≤ γ̄ + c. (15)
T →∞ T
might lead to unsafe or undesired behaviour of the closed-loop
Hence, one can choose the parameter γ̄ to easily tune the
system (10). For this reason, the term ∆E t is upper-bounded primary performance, which corresponds to an additional
using (9c) and (9d).
average suboptimality in the cost. Moreover, by setting γ̄ = 0,
In particular, (9c) can be seen as an average constraint with
the bound (15) is equivalent to the average performance bound
storage Yt , as employed in economic MPC schemes, compare
usually obtained based on the economic MPC scheme (5) with
[22]. The parameter γ̄ ≥ 0 represents the average absolute
Assumption 1, compare [16], [18], [21]. In addition, one can
performance relaxation we allow for active learning (i.e.
minimization of Ht ), while β̄ ∈ [0, 1) represents a percentage PT a time-varying relaxation γ̄t ≥ 0, satisfying
also consider
limT →∞ t=0 γ̄t /T ≤ γ̄, which leads to the same bound in
of the relative relaxation in the performance requirements. The
(15), while additionally giving the freedom to distribute active
variable Et+ , used in (9c)–(9d), is defined as follows
learning over time.
1),2),3) On the other side, in a tracking MPC case, we have c ≥ 0,
Et+ := Êt−1 − EtB ≥ α`(X̂0|t−1 , π̂0|t−1 ) − c. (11)
Vf (X) ≥ 0, and `(X, π) ≥ `min = 0 for all X, π satisfying
The operation max{Et+ , 0} in (9c)-(9d) is used to ensure that (x, π(x)) ∈ Z, ∀x ∈ X. In such a case, the transient behaviour
we allow for a performance relaxation only if it is possible and convergence are crucial. Based on (13), the following non-
to guarantee a potential decrease of the primary cost function, averaged performance bound holds
i.e., Et+ ≥ 0. Finally, the storage Yt is updated as follows T
X −1
ᾱ`(X̂0|t , π̂0|t ) ≤ Ê0 + Y0 + T (c + γ̄), ∀T ≥ 0. (16)
Yt := Yt−1 + β̄ max{Et+ , 0} + γ̄ − ∆E
t , (12)
t=0
with Y0 ≥ 0. The update rule in (12) is needed to enforce av- Note that for c = γ = Y0 = 0, the bound (16) reflects
erage bounds on ∆E t , similar to average constraint MPC [22]. the usual suboptimality estimate in MPC without terminal
The parameters β max ≥ 0 and γ max ≥ 0 in (9d) have an constraints [17], with the relaxed suboptimality index α ≤ α.
analogous meaning to β̄ and γ̄, respectively, but are used to Moreover, based on (14), the following holds
impose hard (non-averaged) bounds on ∆E t .
Similar tools are used in Multi-objective MPC to obtain Êt+1 − Êt ≤ −αmax `(X̂0|t , π̂0|t ) + c + γ max . (17)
desired bounds [23], [24], which are less flexible than the Hence, if `(X, π) ≥ αl (kxk), Ê0 ≤ αu (kxk) + p, with some
proposed framework due to the lack of tuning variables γ̄, β̄, αl , αu ∈ K∞ 3 , p ≥ 0, then (17) implies practical asymptotic
γ max , β max . Thus, the proposed framework can also be applied stability (w.r.t c + γ max , p) of the closed-loop system with a
to general multi-objective MPC, where Ht is not necessarily practical Lyapunov function Êt .
related to active learning.
Proof of Theorem 1
C. Theoretical Analysis and Interpretation
Recursive feasibility: We show feasibility of Problem (9)
Theorem 1. Let Assumption 1 hold. Then Problem (9) is for all t ≥ 0, using the recursive feasibility conditions
feasible for all t ≥ 0 and the following performance bound for Problem (5) in Assumption 1. Firstly, every trajectory
holds for all T ≥ 0: (X·|t , π·|t ) satisfying the constraints in Problem (9) trivially
ÊT − Ê0 + YT − Y0 satisfies the constraints in Problem (5). In the following,
we show that the candidate trajectory (X·|t+1 , π·|t+1 ) from
T
X −1
≤T (γ̄ + c − β̄α`min ) − ᾱ`(X̂0|t , π̂0|t ), (13) 3 By K
∞ we denote the class of functions α : R≥0 → R≥0 , which are
t=0 continuous, strictly increasing, unbounded and satisfy α(0) = 0.
Assumption 1 satisfies the constraints in Problem (9). First, distribution θ ∈ N (θ̄t , Σt ), where the covariance is predicted
B
the different bounds Et+1 are such that the candidate solution using
(9b)
satisfies ∆E B
t+1 = E(X·|t+1 , π·|t+1 ) − Et+1 ≤ 0. Σ−1 −1 −2
G(xk|t , uk|t )G> (xk|t , uk|t ),
k|t = Σk−1|t + σ σ > 0.
Given Y0 ≥ 0, one can recursively show
(9c)
Correspondingly, the learning cost Ht can be chosen to
(12)
Yt = Yt−1 + β̄ max{Et+ , 0} + γ̄ − ∆E
t ≥ 0, ∀t ≥ 1. minimize tr(Σk|t ) or the expected cost, as done in [12].
Persistence of excitation (PE) can be incentivised by choos-
This implies that ∆E t+1 ≤ 0 satisfies constraint (9c). Satis- ing Ht = −β, and including
faction of (9d) is trivially ensured since β max , γ max ≥ 0 by N −1
construction.
X
G(xk|t , πk|t (xk|t ))G> (xk|t , πk|t (xk|t ))  βI, ∀xk|t ∈ Xk|t .
Satisfaction of (13): Let us consider the following k=0

B
Êt+1 − Êt = Et+1
(9b)
+ ∆E + E (11) as an additional constraint in Problem (9), which allows for
t+1 − Êt = −Et+1 + ∆t+1
(12)
convex relaxations in the linear polytopic case, compare [5].
+ +
= − Et+1 + Yt + β̄ max{Et+1 , 0} + γ̄ − Yt+1 . (18) Similar to [10], the learning cost can also be defined based
+ on some external exciting input uL t (e.g. resulting from a
If Et+1 ≥ 0, we have
Reinforcement Learning algorithm) using Ht := kut − uL 2
t k .
(18) +
Êt+1 − Êt = −(1 − β̄)Et+1 + Yt + γ̄ − Yt+1 . (19) Compared to a predictive safety filter [10], the proposed
approach also provides suitable bounds on the transient and
+
Similarly, if Et+1 < 0, the following holds average performance, which is crucial in many applications.
+(18) Moreover, one can define Ht := E(x̃·|t , π·|t ), where x̃ is
Êt+1 − Êt = −Et+1 + Yt + γ̄ − Yt+1 . (20) based on a different online adapted/learned model fˆt , e.g.
Let us now define the set τ as follows using least-mean square parameter estimation [4], [7] or policy
gradients from reinforcement learning [3].
τ (T ) := {t ∈ N[0,T ] |Et+ < 0}, (21)
where N[0,T ] is the set of all the integers between 0 and T . IV. N UMERICAL E XAMPLE
Combining both the cases, we have for all T ≥ 0 In the following example, we illustrate how the tuning
variables β̄ and γ̄ can be used to intuitively tune the primary
ÊT − Ê0 + YT − Y0
performance and active learning. We consider a mass-spring-
T −1
(19),(20),(21) X
+
X
+ damper system subject to additive and parametric uncertainty
= − (1 − β̄)Et+1 + T γ̄ − β̄Et+1
t=0 t∈τ (T ) mẍ1 = −k0 exp(−x1 )x1 − dẋ1 + u + w, (22)
T −1
(4),(11)
≤ −
X
+
(1 − β̄)Et+1 + T γ̄ + T β̄c − T β̄α`min with mass m = 1, uncertain damping constant d = 1, spring
t=0
constant k0 = 0.33, disturbance w, |w| ≤ 0.01 and, constraint
T −1 set Z = [−0.1, 1.1] × [−5, 5] × [−5, 5]. The model (22) is
(11) X
≤ − (1 − β̄)α`(X̂0|t , π̂0|t ) + T (γ̄ + c − β̄α`min ). discretized with a Euler discretization with a sampling time
t=0 of Ts = 0.1s. At time t = 0, we start with a nominal
dˆ0 = 1.1 which is then learned through a Least Mean Square
Similarly, condition (14) can be shown using an analogous
(LMS) algorithm. We make use of the robust MPC approach
reasoning to the proof of (13), by using (9d) in (18) instead
proposed in [26], and we employ a standard quadratic track-
of (12) to bound ∆Et . Satisfaction of (16) can be proved based
ing stage cost, with weight matrices Q = diag(10, 1), and
on (13) and on the non-negativity of c, `, and YT .
R = 1. The variable EtB is defined according to case 3),
i.e., EtB = Et∗ . The goal is to induce excitation in system
D. Learning formulations xref ref
(22) while steering it toward the desired PN reference t , ut .
−1 L
A standard approach in learning MPC schemes is to aug- We consider the learning cost Ht = k=0 kuk|t − ut+k k2 ,
ment the primary cost function with a term that incentives where uL ref
t = ut + 3 sin(0.2t) is an external probing signal
learning [5], [12]. Any learning cost Ht can be directly incor- that excites system (22), similar to [10]. In Fig. 1, we show
porated in the proposed framework, with the main difference four different scenarios. In particular, in all four scenarios, we
that we guarantee suitable performance bounds. The different set Y0 = β max = 0, and γ max = ∞, so that we focus on how
learning cost functions Ht typically depend on the considered the variables β̄ and γ̄ influence the closed-loop behaviour.
model description (ft ,Wt ) and the corresponding update rule. First, we consider the two extremum cases, i.e., the case of
In particular, if the model (ft ,Wt ) is obtained through a pure tracking (passive learning) and of pure (active) learning.
Gaussian Process (GP),PNthen similar to [25], the learning cost In the case of pure tracking (β̄ = 0, γ̄ = 0), the solution
−1
Ht can be chosen as k −Σt (Xk|t , πk|t ) to seek operations of Problem (9) is equivalent to the one from Problem (5).
with maximum uncertainty/covariance Σt . In the case of pure learning (β̄ = 0, γ̄ = ∞), the proposed
In case of nonlinear models linear in the uncertain parame- MPC scheme reduces to the safety filter shown in [10], i.e.,
ters, i.e., a term G(x, u)θ appearing in the model, a Recursive maximal learning while guaranteeing constraint satisfaction.
Least Square (RLS) filter can be used to obtain a Gaussian In this case, we see that the external input uL t generates high
[7] J. Köhler, P. Kötting, R. Soloperto, F. Allgöwer, and M. A. Müller,
1
“A robust adaptive model predictive control framework for nonlinear
0.8 uncertain systems,” arXiv preprint arXiv:1911.02899, 2019.
[8] A. Mesbah, “Stochastic model predictive control with active uncertainty
0.6
learning: A survey on dual control,” Annual Reviews in Control, vol. 45,
0.4 pp. 107–117, 2018.
[9] B. Recht, “A tour of reinforcement learning: The view from continuous
0.2
control,” Annual Review of Control, Robotics, and Autonomous Systems,
0 vol. 2, pp. 253–279, 2019.
[10] K. P. Wabersich and M. N. Zeilinger, “Safe exploration of nonlinear
0 5 10 15 20 dynamical systems: A predictive safety filter for reinforcement learning,”
arXiv preprint arXiv:1812.05506, 2018.
[11] A. A. Feldbaum, “Dual control theory. i,” Automation and Remote
Fig. 1: Comparison of closed-loop trajectories. The value e20 Control, vol. 21, no. 9, pp. 1240–1249, 1960.
indicates the accuracy of the estimated dˆ20 at time 20s, and [12] R. Soloperto, J. Köhler, M. A. Müller, and F. Allgöwer, “Dual adaptive
ˆ
is defined as et := |dtd−d| . Note that e0 = 10%. MPC for output tracking of linear systems,” in Proc. 58th IEEE Conf.
Decision and Control (CDC), 2019, pp. 1377–1382.
[13] S. Thangavel, S. Lucia, R. Paulen, and S. Engell, “Dual robust nonlinear
model predictive control: A multi-stage approach,” Journal of Process
Control, vol. 72, pp. 39–51, 2018.
oscillations in closed-loop, leading to a high tracking error. [14] E. Arcari, L. Hewing, and M. N. Zeilinger, “An approximate dynamic
Such a behaviour can be considered as safe in the sense that programming approach for dual stochastic model predictive control,”
arXiv preprint arXiv:1911.03728, 2019.
the constraints are satisfied, but due to the large tracking error [15] A. Iannelli, M. Khosravi, and R. S. Smith, “Structured exploration in
it may not be desirable in practical applications. the finite horizon linear quadratic dual control problem,” arXiv preprint
In order to generate the desired trade-off between learning arXiv:1910.14492, 2019.
[16] D. Angeli, R. Amrit, and J. B. Rawlings, “On average performance and
and tracking, we consider the case where β̄ = 0.8 (80%) and stability of economic model predictive control,” IEEE transactions on
γ̄ = 0, where we see that the system shows some probing automatic control, vol. 57, pp. 1615–1626, 2012.
only during the transient phase, and then it converges to the [17] A. Boccia, L. Grüne, and K. Worthmann, “Stability and feasibility of
state constrained MPC without stabilizing terminal constraints,” Systems
desired reference point, as discussed in Section III-B. In the & control letters, vol. 72, pp. 14–21, 2014.
case where β̄ = 0 and γ̄ = 5.0, we see that the controller [18] F. A. Bayer, M. A. Müller, and F. Allgöwer, “Min-max economic model
induces a constantly bounded amount of active probing, which predictive control approaches with guaranteed performance,” in Proc.
55th IEEE Conf. Decision and Control (CDC), 2016, pp. 3210–3215.
is mainly visible when the system is close to the steady-state. [19] L. Grüne and M. Stieler, “Asymptotic stability and transient optimality
This implies a negligible relaxation on the convergence rate, of economic MPC without terminal conditions,” Journal of Process
and a visible probing when the system is close to the desired Control, vol. 24, no. 8, pp. 1187–1196, 2014.
[20] J. Köhler, M. A. Müller, and F. Allgöwer, “A novel constraint tighten-
steady state. To conclude, the results shown in Fig. 1 confirm ing approach for nonlinear robust model predictive control,” in Proc.
the intuitive meaning of the parameters β̄ and γ̄, how they American Control Conf. (ACC). IEEE, 2018, pp. 728–734.
influence the closed-loop and therefore the accuracy of the [21] L. Schwenkel, J. Köhler, M. A. Müller, and F. Allgöwer, “Robust
ˆ economic model predictive control without terminal conditions,” in Proc.
estimated parameter d. 21st IFAC World Congress, 2020, accepted.
[22] M. A. Müller, D. Angeli, F. Allgöwer, R. Amrit, and J. B. Rawlings,
V. C ONCLUSION “Convergence in economic model predictive control with average con-
straints,” Automatica, vol. 50, pp. 3100–3111, 2014.
We proposed a framework for enhancing an existing MPC [23] D. He, L. Wang, and J. Sun, “On stability of multiobjective NMPC with
design with active learning to improve model quality and objective prioritization,” Automatica, vol. 57, pp. 189–198, 2015.
[24] L. Grüne and M. Stieler, “Multiobjective model predictive control for
thus performance. We proved suitable transient and average stabilizing cost criteria,” Discrete & Continuous Dynamical Systems-B,
performance bounds on the resulting MPC scheme with in- pp. 2823–2830, 2019.
tuitively tunable constants. The practicality of the proposed [25] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger, “Gaussian process
optimization in the bandit setting: No regret and experimental design,”
approach is demonstrated using an example involving tracking arXiv preprint arXiv:0912.3995, 2009.
and active learning. Designing active learning cost functions [26] J. Köhler, R. Soloperto, M. A. Müller, and F. Allgöwer, “A computa-
that guarantee closed-loop reduction of uncertainty is part of tionally efficient robust model predictive control framework for uncertain
nonlinear systems,” IEEE Transactions on Automatic Control, 2020, to
future work. appear.
[27] E. Maddalena and C. Jones, “Learning non-parametric models with
R EFERENCES guarantees: A smooth lipschitz interpolation approach,” p. 6, 2019.
[Online]. Available: http://infoscience.epfl.ch/record/265764
[1] J. B. Rawlings and D. Q. Mayne, Model predictive control: Theory and
design. Nob Hill Pub., 2009.
[2] L. Hewing, J. Kabzan, and M. N. Zeilinger, “Cautious model predic-
tive control using gaussian process regression,” IEEE Transactions on
Control Systems Technology, 2019.
[3] M. Zanon and S. Gros, “Safe reinforcement learning using robust MPC,”
arXiv preprint arXiv:1906.04005, 2019.
[4] M. Lorenzen, M. Cannon, and F. Allgöwer, “Robust MPC with recursive
model update,” Automatica, vol. 103, pp. 461–471, 2019.
[5] X. Lu, M. Cannon, and D. Koksal-Rivet, “Robust adaptive model pre-
dictive control: Performance and parameter estimation,” arXiv preprint
arXiv:1911.00865, 2019.
[6] M. Bujarbaruah, X. Zhang, M. Tanaskovic, and F. Borrelli, “Adaptive
MPC under time varying uncertainty: Robust and stochastic,” arXiv
preprint arXiv:1909.13473, 2019.

View publication stats

You might also like