Professional Documents
Culture Documents
Abstract— We study the problem of determining the optimal Robust control [7] put the spotlight on the approximative
exploration strategy in an unconstrained scalar optimization nature of models and the field ”identification for control
problem depending on an unknown parameter to be learned (I4C)” evolved in the 1990s as a response to the dichotomy
from online collected noisy data. An optimal trade-off between
arXiv:2403.15344v1 [math.OC] 22 Mar 2024
exploration and exploitation is crucial for effective optimization between the leading paradigm of the time of using data-
under uncertainties, and to achieve this we consider a cumu- driven models as complex as the true system and the fact
lative regret minimization approach over a finite horizon, with that simple controllers may successfully control a complex
each time instant in the horizon characterized by a stochastic system. Control relevant models (Examples 2.10 and 2.11
exploration signal, whose variance has to be designed. In this in [4] are striking illustrations) and procedures for how to
setting, under an idealized assumption on an appropriately
defined information function associated with the excitation, we generate data such that despite a systematic error (bias) the
are able to show that the optimal exploration strategy is either identified models fulfilled their task in control design were
to use no exploration at all (called lazy exploration) or adding developed. We refer to [8] for a survey of this work. Another
an exploration excitation only at the first time instant of the line of research has been to develop systematic experiment
horizon (called immediate exploration). A quadratic numerical design procedures such that the generated data is maximally
example is used to illustrate the results.
informative for data-driven model based control design [9],
I. I NTRODUCTION [10], [11]. Interestingly, in [11] it is shown that even if such
As many systems are too complex to be modelled (only) a design is made for a full order model, the experiments
by physical relationships, control and system optimization become control-relevant allowing for simplified models to
procedures often need to employ data, making data-driven be used as long as they can capture the system properties
decision making a central topic. This topic can be considered relevant for control. Other observations made in [11] are that
as old as control itself. It is not hard to envisage that the the experimental cost and the required model complexity is
flyball governor controlling Watt’s steam engine was adjusted highly dependent on the desired control performance. Also
based on observations of the closed-loop operation of the relevant to our study is that to cope with the inherent catch
engine. A more modern but still early example is the MIT that the optimal experiment design depends on the unknown
rule, proposed for adaptive control of aircrafts traversing system, adaptive experiment design procedures have been
different flight conditions [1]. Since then several prominent developed supported by theory showing that such procedures
research directions have been pursued, focusing on different do not result in loss in performance results asymptotically
aspects of data-driven decision making in a control context. [12], [13]. Control performance and the cost of acquiring data
In adaptive control, different controller structures were pro- have in most work been treated separately. A seminal step for
posed, such as the self-tuning regulator (STR) and model a comprehensive treatment of data-driven decision making
reference adaptive control (MRAC). These two principles for control was taken by Lai and Wei in the mid 1980s when
employ data in a somewhat different way. In the STR a they in an adaptive control context integrated exploration
model is updated which subsequently is used to update (purposely adding excitation to obtain informative data) and
the controller employing the certainty equivalence principle, exploitation (achieving good control performance) into one
i.e. the model is assumed to be correct, while in MRAC single criterion [14], [15]. The difference between the actual
data directly affects the controller parameters. Establishing control performance, including both these contributions, and
closed loop stability of adaptive control schemes (e.g. [2], the optimal control performance was called regret. While
[3]) is considered one of the breakthroughs in control. We dormant for long, lately significant developments have been
refer to [4], [5], [6] for treatments of adaptive control. made for the problem of regret minimization for the Linear
Quadratic Regulator (LQR) problem. One of the main out-
*This work was supported by VINNOVA Competence Center AdBIO- comes of these efforts has been to establish the rate of growth
PRO, contract [2016-05181] and by the Swedish Research Council through
the research environment NewLEADS (New Directions in Learning Dynam- of the minimal regret as function of the control horizon T .
ical Systems), contract [2016-06079] and by Wallenberg AI, Autonomous The early work [14] indicated an asymptotic lower bound of
Systems and Software Program (WASP), funded by Knut and Alice Wal- O(log(T )) for minimum√variance control of ARX-systems.
lenberg Foundation.
The authors are with the Division of Decision and Control Systems, For LQR, the rate O( T ) was established in [16] and
KTH Royal Institute of Technology, 10044 Stockholm, Sweden. Mirko proven to be the optimal lower bound rate when both state
Pasquini, Kévin Colin and Håkan Hjalmarsson are also with the Center matrix and input matrix are unknown in [17]. This rate can
for Advanced Bio Production AdBIOPRO, KTH Royal Institute of Tech-
nology, 10044 Stockholm, Sweden (e-mail:yinwang@kth.se; pasqu@kth.se; be obtained when systems are excited √ with white Gaussian
kcolin@kth.se; hjalmars@kth.se). noise whose variance decays as 1/ t [18]. We will refer to
this type of exploration as decaying. Recently, in [19], it is In Section IV, we conclude that either lazy or immediate
demonstrated that when both state matrix and input matrix √ exploration is optimal for a class of regret minimization
are unknown, the regret can be upper-bounded as O( T ); problems. In Section V we consider a numerical example
when either state matrix or input matrix is known, it can be to illustrate that our theoretical results hold despite the
upper-bounded as O(log(T )). The exploration cost is also approximations made in Section III. Finally, we provide a
accounted for in [20], who studies a general control problem conclusion and future perspectives in Section VI.
for a general class of linear systems in a batch-wise setting.
The numerical study shows that it is optimal to devote all II. P ROBLEM STATEMENT
exploration to the first batch. Even though this setting is Many important control problems can be phrased as iter-
somewhat different from the adaptive LQR setting studied in ative optimization of the input. One example is medium op-
the works referenced above, it is interesting to note that this timization in bioprocessing applications for pharmaceuticals
conclusion is at odds with the decaying exploration proposed production [27]. More broadly, this is the scope of real-time
in, e.g. [18], [19] and in [21] it is shown that for a given finite optimization, where the operating condition of a partially
horizon T immediate excitation is optimal also in an LQR- unknown plant is optimized by iteratively solving a sequence
setting. By immediate we mean that all exploration takes √ of optimization problems. Here we consider the problem of
place at the beginning. The optimal regret still is O( T ) unconstrained optimization of a scalar cost function Φ where
but the proportionality constant is smaller than with decaying the dependency on the underlying system is modeled by
exploration. a scalar parameter θ0 , i.e. Φ = Φ(u, θ0 ) where u denotes
In this contribution we broaden this line of research by the scalar input to be optimized and θ0 denotes the true
turning to the optimization of non-linear static systems where parameter. The optimal input is given by
the performance measure can be a non-linear function of the
input. To study essential aspects of the problem, we limit u∗0 = arg min Φ(u, θ0 ) (1)
u∈R
our attention to static input-output relationships described
by one unknown parameter and assume that we have access We use the function U : R → R to explicitly describe the
to noisy measurements of the output response. This does relationship between the system parameter and the optimal
not preclude that the system may be dynamic, only that input defined by (1), i.e. u∗0 = U (θ0 ). A key aspect of
we are interested in optimizing its static behaviour and that our problem is that exact knowledge about θ0 is unknown
we assume that the response to a constant input can be and is only available indirectly by way of measurements of
measured (for a stable system the latter may be achieved some output of the system subject to measurement errors.
after transients have died out). The setting thus closely relates Formally, we express this by the measurement equation
to the steady-state optimization problems faced in Real-
Time Optimization (RTO) [22], [23] where the goal is to yt = h(ut , θ0 ) + et (2)
optimize a plant operating condition, despite such plant’s where yt and ut are the output and input at time instant
input-output steady-state map being partially unknown, e.g. t respectively and et is zero-mean Gaussian noise with
as it depends on an unknown parameter to be learned with variance σ 2 . We will base our method on the Certainty
noisy measurements. Although the concept of exploration- Equivalence Principle (CEP), i.e. after having collected
exploitation trade-off has been central in the context of {u1 , . . . , ut−1 } and {y1 , . . . , yt−1 } from (2), we approximate
decision making under uncertainty, only few works in the u∗0 by replacing the unknown θ0 by an estimate θ̂t using those
literature of RTO account for this. A notable exception is measurements, resulting in the input
[24] which employs a Bayesian Optimization framework,
so that exploration is accounted for by means of an ac- u∗t = arg min Φ(u, θ̂t ) (3)
u
quisition function. To the best of our knowledge, a regret
minimization framework was, for this type of setting, first We remark that the use of CEP is very common in data-
studied in [25]. However, there the long-term effect of the driven control [4], [18], [16].
additional excitation is neglected as only the regret in the From (1) and (3) we see that u∗t = U (θ̂t ). Thus, attaining
next time instant is considered. Contrary to this, we derive an an accurate estimate is instrumental for the CEP to give a
approximation for the cumulative regret over the entire time satisfactory result. To this end the CEP may not work well as
horizon T , based on which we are able to derive an optimal the input may not be very informative. Consider for example
exploration strategy. Aligning with the numerical results of the extreme case for which u = u∗0 gives y = 0, i.e. no
[20] for batch-wise regret minimization and the theoretical information on θ0 is obtained.
results of [26], [21] for the LQR-problem, we show that To address this, following e.g. [18], exploration is achieved
optimal exploration is either lazy (no exploration at all) or by incorporating an additional excitation term αt , to the input
immediate (only explore at the first time instant), depending u∗t to improve the accuracy of estimation, as illustrated in
on the problem at hand. Fig. 1. For future reference we will refer to u∗t and αt as ex-
In Section II, we present the regret minimization problem, ploitation and exploration inputs respectively. However, the
while we derive an approximate expression of the regret, introduction of αt presents a dilemma, as it tends to perturb
with an upper bound for this approximation in Section III. u∗t , the input that minimizes Φ(u, θ̂t ). This perturbation leads
factor we
will omit it and
study the scaled expected regret
∗ 2
PT ∗
t=1 E (ut + αt − u 0 ) , which can be expanded into
T
X
E (u∗t − u∗0 )2 + 2(u∗t − u∗0 )αt + αt2
(5)
t=1
Furthermore, both the actual input u∗0 and the approximate
counterpart u∗t are determined by the function U . The former
Fig. 1. The iterative framework for the input optimization problem based requires the unknown actual parameter θ0 , while the latter
on the CEP as well as the exploitation and exploration idea utilizes the estimated parameter θ̂t . We here approximate the
difference between u∗0 and u∗t with the Taylor approximation
to an increase in the cost. Thus, a trade-off exists between
u∗t − u∗0 ≈ (θ̂t − θ0 )Jθ (6)
exploitation and exploration when we design αt .
Regret is a widely used criterion for obtaining an optimal where Jθ = dUdθ(θ) |θ=θ0 . This gives the following approxi-
balance between exploitation and exploration. It is defined mation of the regret in (5)
as the cumulative performance degradation from the optimal T
cost achieved with u∗0 , when the input ut = u∗t + αt is
X
E (θ̂t − θ0 )2 Jθ2 + 2(θ̂t − θ0 )Jθ αt + αt2
R̃ :=
employed instead.P In a finite horizon T , the cumulative regret t=1
T ∗ ∗
is expressed as t=1 [Φ(ut + αt , θ0 ) − Φ(u0 , θ0 )] which T
X XT
can be used to guide the design of the exploration αt . =Jθ2 E (θ̂t − θ0 )2 +
xt (7)
The presence of random noise et , as described in equa- t=1 t=1
tion (2), makes both θ̂t and the subsequently generated u∗t where the equality holds because the current exploration
stochastic. Inspired by the literature on regret minimization action αt and the estimate error θ̂t − θ0 (determined by the
for linear quadratic adaptive control [18], [19], [21], we as- input and output before time t) are independent, together
sume that the exploration sequence {αt } is zero-mean white with the fact that E[αt ] = 0. The approximation (7) indicates
noise with a time-varying variance denoted by xt . Thus, our that the performance degradation over horizon T is incurred
further analysis will focus on the expected cumulative regret by the model uncertainty and the exploration input in an
(in what follows we will use regret for short for this quantity) additive way. Next we will study the model uncertainty and
defined as follows, taking into consideration the randomness its relation to the exploration input.
in both the noise and the exploration action,
B. Model uncertainty
T
The Cramér-Rao inequality establishes a lower bound for
X
E Φ(u∗t + αt , θ0 ) − Φ(u∗0 , θ0 )
R̄ = (4)
t=1
the variance of an unbiased estimator in terms of the inverse
of the Fisher information [28]. Here we will use the idealized
Here the expectation E is taken with respect to (w.r.t.) both
assumption that we have an estimator which is unbiased and
the noise {et } and the exploration signal {αt }. This reflects
efficient, meaning thatit attains this lower bound1 . Thus, the
the average performance loss of the exploration strategy
quantity E (θ̂t − θ0 )2 in (7) is given by the inverse of the
when the same experiment is repeated multiple times with
Fisher information denoted by It−1 , where the index is t − 1
the same way of generating the stochastic et and αt .
since the model uncertainty will depend on all the inputs up
In general it is very difficult to compute the regret in (4). In to, but not including, time t. As shown in Appendix A, the
the next section we will develop an approximate expression Fisher information at time t is given by
of R̄ which will allow us to conclude on the structure of the
t
optimal exploration sequence {αt }. 1 X ∂h 2
It =I0 + 2 E θ=θ0 (8)
σ s=1 ∂θ us =u ∗
+αs
III. R EGRET APPROXIMATION AND MODEL s
Since {et } is a zero-mean white noise, we have independence where λ∗k is the optimal KKT multiplier. Due to x∗1 ≥ x∗k
between es and er for every s ̸= r. Besides, recalling that for k = 2, . . . , T − 1 (from Lemma 1) and the convexity of
{et } is Gaussian with variance σ 2 , we get the information function i, which has increasing derivative,
t we obtain
X 1 ∂h 2
It = E θ=θ0 i′ (x∗k ) ≤ i′ (x∗1 ) ⇒ −i′ (x∗k ) ≥ −i′ (x∗1 ) (23)
s=1
σ2 ∂θ us =u ∗
+αs s
Also, for k = 2, . . . , T − 1, it holds that
which is the second term in the expression (8).
T −1 T −1
B. Proof of Theorem 1
X 1 X 1
Pt < Pt (24)
[i0 + ∗ 2 i(x∗s )]2
We start by proving the following Lemma. t=k s=1 i(xs )] t=1 [i0 + s=1
Lemma 1. Consider the problem (17) and denote with x∗ = since all the terms in the sum on the right-hand side of (24)
[x∗1 , . . . , x∗T −1 ] its solution. Assume that the function i is non- are positive. From (23) and (24) it follows
negative and monotonically increasing in the domain [0, ∞). T
X −1
i′ (x∗ )
T
X −1
i′ (x∗1 )
Then x∗1 ≥ x∗2 ≥ · · · ≥ x∗T −1 ≥ 0. 1− Pt k >1− t
[i0 + s=1 i(x∗s )]2 ∗ 2
P
t=k t=1 [i0 + s=1 i(xs )]
Proof: Consider the vector x = [x1 , . . . , xj , xj+1 , . . . , xT −1 ]
and the vector x̃ = [x1 , . . . , xj+1 , xj , . . . , xT −1 ] built from which together with (20) suggests that for the KKT multi-
x by swapping xj and xj+1 . Assume xj ≥ xj+1 . Denote pliers it holds λ∗k > λ∗1 ≥ 0, for all k ≥ 2, which together
with C : RT+−1 → R+ the objective function of Problem with the slackness complementary condition in (22), implies
(17). By comparing the cost induced by x and x̃ we get that x∗2 = · · · = x∗T −1 = 0. Thus we proved that the optimal
C(x) − C(x̃) equals to solution x∗ is either a lazy or an immediate excitation.
Now assume that the solution x∗ is a lazy excitation. Then,
1 1 according to (20) and (21), it should hold
Pj−1 − Pj−1 ≤0
i0 + s=1 i(xs ) + i(xj ) i0 + s=1 i(xs ) + i(xj+1 ) T −1
X i′ (0)
due to the fact that xj ≥ xj+1 and that the information 1− ≥0 (25)
[i0 + t · i(0)]2
function i is monotonically increasing in the domain [0, ∞). t=1
3 From
which proves that the violation of (25), i.e. condition (18), is
a statistical perspective, αt is an ancillary statistic meaning that
it contains no information about θ. Still such a statistic may heavily
sufficient for immediate excitation, since we know that the
influence the properties of an estimator and the Fisher information should solution must be either immediate or lazy. □
be conditioned on such a statistic (see, e.g., [29], [30]).