Professional Documents
Culture Documents
2
Department of Automatic Control, Lund University
Abstract
The strategy of pre-training a large model on a diverse dataset, then fine-tuning for a particular
application has yielded impressive results in computer vision, natural language processing, and
robotic control. This strategy has vast potential in adaptive control, where it is necessary to rapidly
adapt to changing conditions with limited data. Toward concretely understanding the benefit of pre-
training for adaptive control, we study the adaptive linear quadratic control problem in the setting
where the learner has prior knowledge of a collection of basis matrices for the dynamics. This
basis is misspecified in the sense that it cannot perfectly represent the dynamics of the underlying
data generating process. We propose an algorithm that uses this prior knowledge, and prove upper
bounds on the expected regret after T interactions with the system. In the regime√where T is small,
the upper bounds are dominated by a term scales with either poly(log T ) or T , depending on
the prior knowledge available to the learner. When T is large, the regret is dominated by a term
that grows with δT , where δ quantifies the level of misspecification. This linear term arises due
to the inability to perfectly estimate the underlying dynamics using the misspecified basis, and is
therefore unavoidable unless the basis matrices are also adapted online. However, it only dominates
for large T , after the sublinear terms arising due to the error in estimating the weights for the basis
matrices become negligible. We provide simulations that validate our analysis. Our simulations
also show that offline data from a collection of related systems can be used as part of a pre-training
stage to estimate a misspecified dynamics basis, which is in turn used by our adaptive controller.
1. Introduction
Transfer learning, whereby a model is pre-trained on a large dataset, and then finetuned for a specific
application, has exhibited great success in computer vision (Dosovitskiy et al., 2020) and natural
language processing (Devlin et al., 2018). Efforts to apply these methods to control have shown
exciting preliminary results, particularly in robotics (Dasari et al., 2019). The principle underpin-
ning the success of transfer learning is to use diverse datasets to extract compressed, broadly useful
features, which can be used in conjunction with comparatively simple models for downstream ob-
jectives. These simple models can be finetuned with relatively little data from the downstream task.
However, errors in the pre-training stage may cause this two-step strategy to underperform learn-
ing from scratch when ample task-specific data is available. Such a tradeoff can be acceptable in
settings such as adaptive control, where the learner must rapidly adapt to changes with limited data.
Driven by the potential of pre-training in adaptive control, we study the adaptive linear quadratic
regulator (LQR) in a setting where imperfect prior information about the system is available. The
adaptive linear LQR problem consists of a learner interacting with an unknown linear system
xt+1 = A⋆ xt + B ⋆ ut + wt , (1)
with state xt , input ut , and noise wt assuming values in RdX , RdU , and RdX , respectively. The learner
is evaluated by its ability to minimize its regret, which compares the cost incurred by playing the
learner for T time steps with the cost attained by the optimal LQR controller. Prior work has studied
the adaptive LQR problem in the settings where A⋆ , B ⋆ , or both A⋆ and B ⋆ are fully unknown to
the learner and with either stochastic or bounded adversarial noise processes (Abbasi-Yadkori and
Szepesvári, 2011; Hazan et al., 2020; Cassel et al., 2020). In this work, we analyze asetting where
the learner has a set of basis matrices for the system dynamics; however, A⋆ B ⋆ is not in the
subspace spanned by this basis, leading to misspecification.
1.2. Contributions
We introduce a notion of misspecification between an estimate for the basis of the dynamics and the
data generating process. We then propose an adaptive control algorithm that uses this misspecified
basis, and subsequently analyze the regret of this algorithm. This leads to the following insights:
• Our results generalize the understanding of when logarithmic regret is possible in adaptive LQR
beyond the cases of known A⋆ or B ⋆ studied by Cassel et al. (2020); Jedra and Proutiere (2022).
• When misspecification is present, the regret incurs a term that is linear in T . The coefficient for
this term decays gracefully as the level of misspecification diminishes. As a result, small misspec-
ification means the regret is dominated by sublinear terms in the low data regime of interest for
adaptive control. These terms are favorable to those possible in the absence of prior knowledge.
We validate our theory with numerical experiments, and show the benefit using a dynamics repre-
sentation determined by pre-training on offline data from related systems for adaptive control.
1.3. Notation
The Euclidean norm of a vector x is denoted ∥x∥. For a matrix A, the spectral norm is denoted ∥A∥,
and the Frobenius norm is denoted ∥A∥F . We use † to denote the Moore-Penrose pseudo-inverse.
The spectral radius of a square matrix is denoted ρ(A). A symmetric, positive semi-definite (psd)
matrix A = A⊤ is denoted A ⪰ 0. Similarly A ⪰ B denotes that A − B is positive semidefinite.
The minimum eigenvalue of a symmetric, positive definite matrix A is denoted λmin (A).For f, g :
D → R, we write f ≲ g if for some c > 0, f (x) ≤ cg(x) ∀x ∈ D. We denote the solutions
to the discrete Lyapunov equation by dlyap(A, Q) and the discrete algebraic Riccati equation by
DARE(A, B, Q, R). We use x ∨ y to denote the maximum of x and y.
2. Problem Formulation
2.1. System model
We consider the system (1) where the noise wt has independent identically distributed elements that
are mean zero and σ 2 -sub-Gaussian
for some σ 2 ∈ R with σ 2 ≥ 1. We additionally assume that the
noise has identity covariance: E wt wt⊤ = I.1 We suppose the dynamics admit the decomposition
A B ⋆ = vec−1 (Φ⋆ θ⋆ ),
⋆
(2)
where Φ⋆ ∈ RdX (dX +dU )×dθ specifies the model structure, and has orthonormal columns. Mean-
while, θ⋆ ∈ Rdθ specifies the parameters. The operator vec−1 maps a vector in RdX (dX +dU ) into a
matrix in RdX ×(dX +dU ) by stacking length dX blocks of the the vector into columns of the matrix,
working top to bottom and left to right. We can write this as a linear combination of basis matrices:
dθ h i h i
X
θi⋆ ΦA,⋆ B,⋆ A,⋆ B,⋆
= vec−1 Φ⋆i ,
⋆
A B⋆ = i Φ i , where Φ i Φ i
i=1
and Φ⋆i is the ith column of Φ⋆ . This decomposition of the data generating process is a natural ex-
tension of the low-dimensional linear representations considered in Du et al. (2020) to the setting of
multiple related dynamical systems with shared structure determined by Φ⋆ . A version of this model
for autonomous systems was considered by Modi et al. (2021) for multi-task system identification.
We assume that both Φ⋆ and θ⋆ are unknown; however, we have an estimate Φ̂ ∈ RdX (dX +dU )×dθ ,
also with orthonormal columns, for Φ⋆ . Such an esimate may be obtained by performing a pre-
training step on offline data from a collection of systems related to (1) by the shared matrix Φ⋆
1. Noise that enters the process through a non-singular matrix H can be addressed by rescaling the dynamics by H −1 .
in (2). Due to the noise present in the offline data, this esimate will be imperfect, resulting in
misspecification2 . To quantify the level of misspecification between this estimate and the underlying
data generating process, we use the following subspace distance metric.
Definition 1 ((Stewart and Sun, 1990)) Let Φ⋆⊥ complete the basis of Φ⋆ such that Φ⋆ Φ⋆⊥ is
an orthogonal matrix. Then the subspace distance between Φ⋆ and Φ̂ is d(Φ⋆ , Φ̂) ≜ Φ̂⊤ Φ⋆⊥ .
Note that the above distance is small when the range of the matrices Φ̂ and Φ⋆ is similar. In partic-
ular, the above distance may be small even when Φ̂ − Φ⋆ is large.
As long as d(Φ⋆ , Φ̂) is sufficiently small and the dimension dθ < dX (dX + dU ), the estimate Φ̂
allows the learner to fit a model with less data than would be required to estimate A⋆ B ⋆ from
scratch. This benefit comes at the cost of a bias in the learner’s model that grows with d(Φ⋆ , Φ̂).
To define an algorithm that keeps the cost small, we first introduce the infinite horizon LQR cost:
1 K
J (K) ≜ lim sup E CT , (4)
T →∞ T
where the superscript of K on the expectation denotes that the cost is evaluated under the state
feedback controller ut = Kxt . To ensure that there exists a controller such that (4) has finite cost,
we assume (A⋆ , B ⋆ ) is stabilizable. Under this assumption, (4) is minimized by the LQR controller
K∞ (A⋆ , B ⋆ ), where K∞ (A, B) ≜ −(B ⊤ P∞ (A, B)B + R)−1 B ⊤ P∞ (A, B)A, and P∞ (A, B) ≜
DARE(A, B, Q, R). We define the shorthands P ⋆ ≜ P∞ (A⋆ , B ⋆ ) and K ⋆ ≜ K∞ (A⋆ , B ⋆ ). To
characterize the infinite horizon LQR cost of an arbitrary stabilizing controller K, we additionally
define the solution PK to the lyapunov equation for the closed loop system under an arbitrary K
such that ρ(A⋆ +B ⋆ K) < 1: PK ≜ dlyap(A⋆ +B ⋆ K, Q+K ⊤ RK). For a controller K satisfying
ρ(A⋆ + B ⋆ K) < 1, J (K) = tr(PK ). We have that PK ⋆ = P ⋆ .
The infinite horizon LQR controller provides a baseline level of performance that our learner
cannot surpass in the limit as T → ∞. Borrowing the notion of regret from online learning, as in
Abbasi-Yadkori and Szepesvári (2011), we quantify the performance of our learning algorithm by
comparing the cumulative ⋆ cost ⋆CT to the scaled infinite horizon cost attained by the LQR controller
if the system matrices A B were known:
RT ≜ CT − T J (K ⋆ ). (5)
2. Sample complexity bounds for learning Φ̂ are provided by Zhang et al. (2023b), so this step is not studied here.
3. Generalizing to arbitrary Q ≻ 0 and R ≻ 0 can be performed by scaling the cost and changing the input basis.
Algorithm 1 Certainty Equivalent Control with Continual Exploration
1: Input: Stabilizing controller K0 , initial epoch length τ1 , number of epochs kfin , exploration
sequence σ12 , σ22 , σ32 , . . . σk2fin , state bound xb , controller bound Kb
2: Initialize: K̂1 ← K0 , τ0 ← 0, T ← τ1 2kfin −1 .
3: for k = 1, 2, . . . , kfin do
4: for t = τk−1 + 1, . . . , τk do
5: if ∥xt ∥2 ≥ x2b log T or K̂k ≥ Kb then
6: Abort and play K0 forever
7: Play ut = K̂k xt + σk gt , where gt ∼ N (0, I)
8: θ̂k , Λ ← LS(Φ̂, xτk−1+1:τk+1 , uτk−1 +1:τk ) ▷ Algorithm 2
−1
9: Âk B̂k ← vec Φ̂θ̂k
10: K̂k+1 ← K∞ (Âk , B̂k )
11: τk+1 ← 2τk
In light of the above reformulation, the goal of the learner is to interact with the system (1) to
maximize the information about the relevant parameters for control while simultaneously regulating
the system to minimize RT . A reasonable strategy to do so is for the learner to use its history of
interaction with the system to construct a model for the dynamics, e.g. by determining estimates Â
and B̂. It may then use these estimates as part of a certainty equivalent (CE) design by synthesizing
controllers K̂ = K∞ (Â, B̂). Prior work has shown that if the model estimate is sufficiently close
to the true dynamics, then the cost of playing the controller K̂ exceeds the cost of playing K ⋆ by a
quantity that is quadratic in the estimation error (Mania et al., 2019; Simchowitz and Foster, 2020).
1
Lemma 1 (Theorem 3 of Simchowitz and Foster (2020)) Define ε ≜ 2916∥P ⋆ (A ⋆ ,B ⋆ )∥10
. If
2 2
 B̂ − A B ⋆ F ≤ ε, then J (K̂) − J (K ⋆ ) ≤ 142 ∥P ⋆ ∥8  B̂ − A⋆ B ⋆ F .
⋆
3. Regret Bounds
We now present our bounds on the expected5 regret incurred by Algorithm 1. Consider running Al-
gorithm 1 for T = τ1 2kfin −1 timesteps, where τ1 is the initial epoch length and kfin is the number of
epochs. To bound the regret incurred by this algorithm, we decompose the regret into that achieved
by the algorithm under a high probability success event, and that incurred during a failure event
under which the state or controller bound in line 5 of Algorithm 1 are violated. To ensure the failure
event occurs with a small probability, we make the following assumption on the state and controller
bounds, which uses the shorthand ΨB ⋆ ≜ max{1, ∥B ⋆ ∥}.
√
Assumption 1 We assume that xb ≥ 400 ∥PK0 ∥2 ΨB ⋆ σ dX + dU and Kb ≥ 2 ∥PK0 ∥.
p
We make additional assumptions about the remaining arguments supplied as inputs to the al-
gorithm in two cases: one in which no additional assumptions about the dynamics are made (Sec-
tion 3.1), and one in which we assume that the system structure estimate is such that the initial sta-
bilizing controller and the optimal controller provide sufficient excitation to identify the unknown
parameters (Section 3.2).
The requirement above arises from the way that misspecification enters our bounds on the estimation
2
error  B̂ − A⋆ B ⋆ F . See the definition of Eest,1 in Section 3.3.
In this setting, we run Algorithm 1 with exploratory inputs injected to ensure identifiability of
the unknown parameters. Doing so provides the regret guarantees in the following theorem.
Theorem 2 Consider applying Algorithm 1 with initial stabilizing controller K0 for T = τ1 2kfin −1
timesteps for some positive integers kfin, and τ1 . Suppose that for some γ ≥ d(Φ̂, Φ⋆ ), the explo-
√
d /dθ
ration sequence is given by σk2 = max √ U k−1 , γ ∀k ≥ 1. Suppose the state bound xb and the
τ1 2
5. In contrast to high probability regret bounds, expected regret provides an understanding of what happens in the
unlikely events where controller performs poorly.
controller bound Kb satisfy Assumption 1 and that Φ̂ satisfies Assumption 2. There exists a universal
positive constant Cwarm up such that if τ1 = τwarm up log2 T for some
√
1
log ∥P ⋆ ∥ dθ dU 2
τwarm up ≥ Cwarm up σ 4 ∥PK0 ∥3 max Ψ2B ⋆ (dX + dU ), x2b , , ,
log 1 − 1 ε
∥P ⋆ ∥
Under the above assumption, we may run Algorithm 1 without an additional exploratory input
injected. As in Section 3.1, we require that the representation error is small enough to guarantee the
closeness condition in Lemma 1 may be satisfied with our estimated model.
9
ε∥PK0 ∥ Ψ8B ⋆ ∥θ⋆ ∥2 (dX +dU )
Assumption 4 Define β2 ≜ Cbias,2 dθ min{α2 ,α4 }
for a sufficiently large universal con-
q
stant Cbias,2 . Our representation error satisfies d(Φ̂, Φ⋆ ) ≤ 2βε 2 .
This requirement again arises from dependence of the estimation error bounds on the misspecifica-
tion. See the definition of Eest,2 in Section 3.3. Under these assumptions, our regret bound may be
improved to that in the following theorem.
Theorem 3 Consider applying Algorithm 1 with initial stabilizing controller K0 for T = τ1 2kfin
timeteps for some positive integers kfin , and τ1 . Additionally suppose the exploration sequence is
zero for all time: σk2 = 0 for k = 1, . . . , kfin . Suppose the state bound xb and the controller
bound Kb satisfy Assumption 1, and that Φ̂ satisfies Assumption 3 and Assumption 4. There exists a
positive universal constant Cwarm up such that if τ1 = τwarm up log T , for
log √ 1 ⋆
4 3 2
2 8 ∥P ∥ dθ
τwarm up ≥ Cwarm up σ ∥PK0 ∥ ΨB ⋆ max (dX + dU ), xb , , 2
,
log 1 − 2∥P1 ⋆ ∥ 2εα
When the misspecification is zero, the expected regret grows with log2 T . Prior work Cassel
et al. (2020); Jedra and Proutiere (2022) has shown that such rates are possible if either A⋆ or B ⋆
are known to the learner. See Appendix E.1 for details about how to obtain logarithmic regret in
these settings using Theorem 3. The above result expands on prior work by generalizing conditions
for prior knowledge that are sufficient to achieve logarithmic regret.
With misspecification present, the above theorem has a term growing linearly T . In contrast
to Theorem 2, the coefficient for this term is proportional to the level of misspecification squared,
which is smaller than the linear dependence on misspecification in Theorem 2. This result shows that
if some coarse system knowledge depending on a few unknown parameters is available in advance
and the unknown parameters are easily identifiable in the sense of Assumption 3, then there exists a
substantial regime of T for which the regret incurred is much smaller than that attained by learning
√
to control the system from scratch with fully unknown A⋆ and B ⋆ (where the regret scales as T ).
troller to obtain a bound on the cost, which is then multiplied by the small probability of the failure
event. To control R1 , we show that under the success event, the closeness condition Lemma 1 is
satisfied. As a result, the
cost of each epoch Jk is (τk − τk−1 )J (K ⋆ ) in addition to a term pro-
2
Âk B̂k − A⋆ B ⋆ F + σk2 . Using the estimation error bounds of
portional to (τk − τk−1 )
events Eest,1 and Eest,2 along with √the choices for σk2 in the two settings, we find that the quantity
⋆
R1 − T J (K ) is proportional to T log T + d(Φ̂, Φ⋆ )T log T in the setting of Section 3.1, and
log2 T + d(Φ̂, Φ⋆ )2 T in the setting of Section 3.2. Combining terms provides the expected regret
bounds in Theorems 2 and 36 .
4. Numerical Example
To validate our bounds, we run Algorithm 1 on the system (1) where A⋆ and B ⋆ are obtained by lin-
earizing and discretzing the cartpole dynamics defined by the equations (M + m)ẍ + mℓ(θ̈ cos(θ) −
θ̇2 sin(θ)) = u, and m(ẍ cos(θ) + ℓθ̈ − g sin(θ)) = 0, for cart mass M = 1, pole mass m = 1,
pole length ℓ = 1, and gravitiy g = 1. Discretization uses Euler’s approach with stepsize 0.25. The
disturbance signal is generated as wt ∼ N (0, 0.01I).
We consider various inputs for the representation estimate Φ̂ and the exploration sequence
σ12 , . . . , σk2fin . The remaining parameters are discussed in Appendix A.
No Misspecification: In Figure 1, we plot the re-
60000
gret of Algorithm 1 in the absence of misspecification, 50000
i.e. d(Φ̂, Φ⋆ ) = 0. We consider several instances for 40000
Regret
Regret
20000 40000
Artificial Misspecification: In Figure 2, we compare the regret from the fully unknown setting
to the regret with a misspecified lumped parameter representation. Figure 2(a) considers the lumped
parameter representation that is extended such that Assumption 3 is violated. Therefore the learner
must continually inject noise to the system in order to explore. We artificially create misspecifica-
tion by adding small perturbations to the true representation such that d(Φ̂, Φ⋆ ) > 0. We see that in
the low data regime, the regret incurred is less than that incurred when the model is fully unknown.
When the misspecification level is 0.05, the bias in the identification due to the misspecification
causes the regret to rapidly overtake the regret from the fully unkown setting. When the misspec-
ification is small, the regret remains less than that from the fully unknown setting for the entire
horizon of T values that are plotted. Figure 2(b) considers the lumped parameter setting without
the extension, for which Assumption 3 is satisfied. As in Figure 2(a), we add a perturbation to the
representation to create misspecification. In this setting, we consider much larger perturbations,
such that d(Φ̂, Φ⋆ ) is 0.1, 0.15, or 0.20. For all three such situations, the regret begins much smaller
than that of the fully unknown model, but overtakes it as T becomes large.
Learned Representation: In Figure 3, we consider 70000 fully unknown
representation learned from data
a setting motivated by multi-task learning, in which a 60000
representation is learned using offline data from sev- 50000
40000
Regret
References
Yasin Abbasi-Yadkori and Csaba Szepesvári. Regret bounds for the adaptive control of linear
quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pages
1–26. JMLR Workshop and Conference Proceedings, 2011.
Karl J Åström and Björn Wittenmark. Adaptive control. Courier Corporation, 2013.
Karl Johan Åström and Björn Wittenmark. On self tuning regulators. Automatica, 9(2):185–199,
1973.
Asaf Cassel, Alon Cohen, and Tomer Koren. Logarithmic regret for learning linear quadratic regu-
lators efficiently. In International Conference on Machine Learning, pages 1328–1337. PMLR,
2020.
Daniel Cederberg, Anders Hansson, and Anders Rantzer. Synthesis of minimax adaptive controller
for a finite set of linear systems. In 2022 IEEE 61st Conference on Decision and Control (CDC),
pages 1380–1384. IEEE, 2022.
Alon Cohen, √
Tomer Koren, and Yishay Mansour. Learning linear-quadratic regulators efficiently
with only t regret. In International Conference on Machine Learning, pages 1300–1309.
PMLR, 2019.
Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper,
Siddharth Singh, Sergey Levine, and Chelsea Finn. Robonet: Large-scale multi-robot learning.
arXiv preprint arXiv:1910.11215, 2019.
Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. Regret bounds for
robust adaptive control of the linear quadratic regulator. Advances in Neural Information Pro-
cessing Systems, 31, 2018.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An
image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint
arXiv:2010.11929, 2020.
Simon S Du, Wei Hu, Sham M Kakade, Jason D Lee, and Qi Lei. Few-shot learning via learning
the representation, provably. arXiv preprint arXiv:2002.09434, 2020.
PC Gregory. Proceedings of the Self Adaptive Flight Control Systems Symposium, volume 59.
Wright Air Development Center, Air Research and Development Command, United . . . , 1959.
Elad Hazan, Sham Kakade, and Karan Singh. The nonstochastic control problem. In Algorithmic
Learning Theory, pages 408–421. PMLR, 2020.
Petros A Ioannou and Jing Sun. Robust adaptive control, volume 1. PTR Prentice-Hall Upper
Saddle River, NJ, 1996.
Yassir Jedra and Alexandre Proutiere. Finite-time identification of stable linear systems optimality
of the least-squares estimator. In 2020 59th IEEE Conference on Decision and Control (CDC),
pages 996–1001. IEEE, 2020.
Yassir Jedra and Alexandre Proutiere. Minimal expected regret in linear quadratic control. In
International Conference on Artificial Intelligence and Statistics, pages 10234–10321. PMLR,
2022.
Sham Kakade, Akshay Krishnamurthy, Kendall Lowrey, Motoya Ohnishi, and Wen Sun. Infor-
mation theoretic regret bounds for online nonlinear control. Advances in Neural Information
Processing Systems, 33:15312–15325, 2020.
Sahin Lale, Kamyar Azizzadenesheli, Babak Hassibi, and Anima Anandkumar. Logarithmic re-
gret bound in partially observable linear dynamical systems. Advances in Neural Information
Processing Systems, 33:20876–20888, 2020.
Sahin Lale, Kamyar Azizzadenesheli, Babak Hassibi, and Animashree Anandkumar. Reinforce-
ment learning with fast stabilization in linear dynamical systems. In International Conference on
Artificial Intelligence and Statistics, pages 5354–5390. PMLR, 2022.
Horia Mania, Stephen Tu, and Benjamin Recht. Certainty equivalence is efficient for linear
quadratic control. Advances in Neural Information Processing Systems, 32, 2019.
Aditya Modi, Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. Joint
learning of linear time-invariant dynamical systems. arXiv preprint arXiv:2112.10955, 2021.
Kumpati S Narendra and Anuradha M Annaswamy. Stable adaptive systems. Courier Corporation,
2012.
Victor H Peña, Tze Leung Lai, and Qi-Man Shao. Self-normalized processes: Limit theory and
Statistical Applications. Springer, 2009.
Anders Rantzer. Minimax adaptive control for a finite set of linear systems. In Learning for Dy-
namics and Control, pages 893–904. PMLR, 2021.
Venkatraman Renganathan, Andrea Iannelli, and Anders Rantzer. An online learning analysis of
minimax adaptive control. arXiv preprint arXiv:2307.07268, 2023.
Max Simchowitz and Dylan Foster. Naive exploration is optimal for online lqr. In International
Conference on Machine Learning, pages 8937–8948. PMLR, 2020.
Max Simchowitz, Karan Singh, and Elad Hazan. Improper learning for non-stochastic control. In
Conference on Learning Theory, pages 3320–3436. PMLR, 2020.
Gunter Stein. Adaptive flight control: A pragmatic view. In Applications of Adaptive Control, pages
291–312. Elsevier, 1980.
Gilbert W Stewart and Ji-guang Sun. Matrix perturbation theory. (No Title), 1990.
Paolo Tilli. Singular values and eigenvalues of non-hermitian block toeplitz matrices. Linear alge-
bra and its applications, 272(1-3):59–89, 1998.
Nilesh Tripuraneni, Michael Jordan, and Chi Jin. On the theory of transfer learning: The importance
of task diversity. Advances in neural information processing systems, 33:7852–7862, 2020.
Anastasios Tsiamis, Ingvar M Ziemann, Manfred Morari, Nikolai Matni, and George J Pappas.
Learning to control linear systems can be hard. In Conference on Learning Theory, pages 3820–
3857. PMLR, 2022.
Anastasios Tsiamis, Ingvar Ziemann, Nikolai Matni, and George J Pappas. Statistical learning
theory for control: A finite-sample perspective. IEEE Control Systems Magazine, 43(6):67–97,
2023.
Han Wang, Leonardo F Toso, Aritra Mitra, and James Anderson. Model-free learning with hetero-
geneous dynamical systems: A federated lqr approach. arXiv preprint arXiv:2308.11743, 2023.
Thomas T Zhang, Katie Kang, Bruce D Lee, Claire Tomlin, Sergey Levine, Stephen Tu, and Nikolai
Matni. Multi-task imitation learning for linear dynamical systems. In Learning for Dynamics and
Control Conference, pages 586–599. PMLR, 2023a.
Thomas TCK Zhang, Leonardo F Toso, James Anderson, and Nikolai Matni. Meta-learning opera-
tors to optimality from multi-task non-iid data. arXiv preprint arXiv:2308.04428, 2023b.
Ingvar Ziemann and Henrik Sandberg. Regret lower bounds for learning linear quadratic gaussian
systems. arXiv preprint arXiv:2201.01680, 2022.
Ingvar Ziemann, Anastasios Tsiamis, Bruce Lee, Yassir Jedra, Nikolai Matni, and George J. Pappas.
A tutorial on the non-asymptotic theory of system identification, 2023.
Appendix A. Additional Experimental Details
The matrices A⋆ and B ⋆ considered in Section 4 are given as follows:
1 0.25 0 0 0
0 1 −0.25 0 , B ⋆ = 0.25 .
A⋆ =
0 (7)
0 1 0 0
0 0 0.5 1 −0.25
where the superscript (i) indicates which of the five systems the data is collected from.
and a sufficiently large universal constant c > 0. There exists an event Els which holds with proba-
bility at least 1 − δ under which the estimation error satisfies
dθ σ 2
⋆ ⋆
2 1
Φ̂θ̂ − Φ θ ≲ ¯ t
log
tλmin ∆ (σu , K) δ
!
σ 4 ∥PK ∥7 Ψ6B ⋆ dX + dU + log 1δ ∥PK ∥2 Ψ2B ⋆ d(Φ̂, Φ⋆ )2 ∥θ⋆ ∥2
+ 1+ ¯ t (σu , K))2 ¯ t (σu , K)) .
tλmin (∆ λmin (∆
where
t−2 s I ⊤ 0
1 XX I 0
¯ t (σu , K) ≜ Φ̂⊤
∆ AjK (σu2 B ⋆ (B ⋆ )⊤ + I) AjK ⊤ + Φ̂.
t K K 0 σu2 IdU
s=0 j=0
Toward proving the above theorem, we first define a mapping from the noise sequence and the
state x1 to the stacked sequence of states that is be used throughout the section. In particular,
x1
x2
x1
t t
.. = F1 (K, σu ) F2 (K, σu ) , (9)
. ηt
xt
⊤ ⊤ ⊤ ⊤
⊤
where ηt ≜ w1 g1 ... wt−1 gt−1 and
I 0 0 ... 0
I B ⋆ σu
AK 0 ... 0
F1t (K, σu ) F2t (K, σu )
≜ .
.. ..
. .
At−1 At−2 I B ⋆ σu . . . I B ⋆ σu
K K
xt I 0
Using the fact that = x + , we may express the stacked states and inputs as
ut K t σu gt
x1
u1
I I 0 0
.. t t
. = It ⊗ F1 (K, σu )x1 + It ⊗ F2 (K, σu ) + It ⊗ η
K K 0 σu IdU
xt
ut
≜ Ξt1 (K, σu )x1 + Ξt2 (K, σu )η. (10)
We additionally define two quantities related to the empirical covariance matrix. In particular, we
define the expected value of the empirical covariance matrix conditioned on the state at the start of
the sequence, x1 , along with the centered counterpart:
" t
⊤
#
t 1 X xs xs
Σ (K, σu , x1 ) ≜ E |x1 and
t us us
s=1
" t #
1 X xs x s x s xs
Σ̄t (K, σu , x1 ) ≜ E −E |x1 −E |x1 ⊤ .
t us us us us
s=1
∥x21 ∥
4. Σt (K, σu , x1 ) ⪯ (1 + ∥PK ∥ t−1 )Σ̄t (K, σu , x1 )
Proof
Points 1 and 2: These results follow by expressing the states and inputs in terms of the noise
sequence and x1 , then using the fact that the noise terms are independent, with mean zero and
identity covariance.
Point 3: The first inequality is immediate from point one by the fact that
t−1 ⊤
1X I s ⊤ s ⊤ I
AK x1 x1 (AK ) ⪰ 0.
t K K
s=0
1Xt−2 X
t
≤ 1 + 1 + ∥K∥ 2 ⋆ 2
∥B ∥ + 1 AjK AjK ⊤
t
t=0 j=0
∞
X
≤ 1 + 1 + ∥K∥ 2 ⋆ 2
∥B ∥ + 1 AjK AjK ⊤
j=0
= 1 + 1 + ∥K∥2 ∥B ⋆ ∥2 + 1 ∥dlyap[AK ]∥
≤ 1 + 1 + ∥K∥2 ∥B ⋆ ∥2 + 1 ∥PK ∥
≤ 5 ∥PK ∥2 Ψ2B ⋆ ,
where the final point follows from the fact that ∥K∥2 ≤ Q + K ⊤ RK ≤ ∥PK ∥, ∥PK ∥ ≥ 1, and
the definition of ΨB ⋆ .
To analyze the estimate θ̂, we first verify that it solves the least squares problem of interst.
Lemma 3 Consider applying Algorithm 2 to an estimated model structure Φ̂, state data x1:t+1 and
input data u1:t : θ̂, Λ = LS(Φ̂, x1:t+1 , u1:t ). As long as Λ ≻ 0, then θ̂ is the unique solution to
t 2
X x
argmin xs+1 − vec (Φ̂θ) s
−1
. (11)
θ us
s=1
We now split the estimation error into a part arising from the variance and a part arising from
the bias. The variance decays as t grows sufficiently large. The bias is due to the misspecificiation
between the model structure estimate Φ̂ and the true model structure Φ⋆ .
Lemma 4 Consider applying Algorithm 2 to an estimated model structure Φ̂, state data x1:t+1 and
input data u1:t : θ̂, Λ = LS(Φ̂, x1:t+1 , u1:t ). Suppose that Λ ≻ 0. Let
We have that
2 2 2
Φ̂θ̂ − Φ⋆ θ⋆ ≤ 2 Φ̂θ̂ − Φ̂θ̄ + 2 Φ̂θ̄ − Φ⋆ θ⋆ . (13)
| {z } | {z }
variance bias
Proof The lemma folllows by application of the triangle inequality.
The variance term in the above lemma is bounded in Lemma 10, while the bias term is bounded
in Lemma 11. Before presenting these lemmas, we present several covariance concentration guar-
antees that will be useful in obtaining presenting the variance bound.
Lemma 5 Let W be and d × d random matrix, and ε ∈ [0, 1/2). Furthermore, let N be an ε-net
of S d−1 with minimal cardinality. Then for all ρ > 0,
2 h i
P[∥W ∥ > ρ] ≤ + 1 d max P x⊤ W x > (1 − 2ϵ)ρ .
ε x∈N
The following lemma helps to bound system constants in term of a few key quantities.
The result on the H∞ norm of AK follows from the previous point by noting that ∥AK ∥H∞ ≤
P ∞ t
t=0 AK .
We now a result about the concentration of the empirical covariance to the true covariance. The
result is adapted from Lemma 2 of Jedra and Proutiere (2020) and Lemma A.2 of Zhang et al.
(2023a), which generalized the approximate isometries argument in Jedra and Proutiere (2020) to
account for projecting the covariance onto a lower dimensional space by hitting it on either side
with a low rank orthonormal matrix. The primary difference of the below result from that in Zhang
et al. (2023a) is the inclusion of a nonzero initial state, resulting in an additional sub-gaussian
concentration step.
Lemma 7 (Approximate Isometries) Let data x1:t , u1:t be generated by (8). Let G ∈
Rdc (dX +dU )×dG be a matrix with orthonormal columns. Define
M ≜ G⊤ Σt (K, σu , x1 ) ⊗ Idc G −1/2 .
Then for some positive universal constant C, and for any ρ < 1, the inequality
k −1
τX ⊤
1 xt xt
M G⊤ ⊗ Idc GM − I > ρ
τk−1 ut ut
t=τk−1
such that
t ⊤
!
1 X xs xs
⊗ Idc = X ⊤ X.
t us us
s=1
First note that E G⊤ X ⊤ XG|x1 = M −2 . Then
M G⊤ X ⊤ XGM − I = sup v ⊤ M G⊤ X ⊤ XGM − I v (14)
v∈S dG
h i
= sup v ⊤ M G⊤ X ⊤ XGM v − E v ⊤ M G⊤ X ⊤ XGM v|x1
v∈S dG
h i
= sup ∥XGM v∥2 − E ∥XGM v∥2 |x1 .
v∈S dG
Next, we construct matrices Γt1 (K, σu ) and Γt2 (K, σu ) such that vec(X ⊤ ) = Γt1 (K, σu )x1 +
Γt2 (K, σu )ηt . To construct these matrices, first recall the definition of Ξt1 (K,σu ) and Ξt2 (K, σu )
given in (10) and of F1t (K, σu ) and F2t (K, σu ) given in (9). To reach vec X ⊤ from the vector of
stacked states and inputs, we can apply the following transformation:
x1 x1
IdX +dU ⊗ e1 u1 u1
1 ..
.. 1
..
vec(X ⊤ ) = √ It ⊗ ≜ √ H . .
. .
t t
IdX +dU ⊗ edc xt xt
ut ut
Combining the above sequence of transformation implies that by defining Γt1 (K, σu ) ≜
1 t t 1 t ⊤ t
t HΞ1 (K, σu ) and Γ2 (K, σu ) ≜ t HΞ2 (K, σu ), we have vec(X ) = Γ1 x1 + Γ2 (K, σu )ηt .
Using the above expression, our quantity of interest in (14) may be expressed as
M G⊤ X ⊤ XGM − I
h i
2 2
σGM v Γt1 (K, σu )x1 + Γt2 (K, σu )ηt σGM v Γt1 (K, σu )x1 + Γt2 (K, σu )ηt
= sup −E |x1
v∈S dG
2 2
= sup σGM v Γt2 (K, σu )ηt + σGM v Γt1 (K, σu )x1 + 2⟨σGM v Γt1 (K, σu )x1 , σGM v Γt2 (K, σu )ηt ⟩
v∈S dG
h i
2 2
− E σGM v Γt2 (K, σu )ηt + σGM v Γt1 (K, σu )x1 + 2⟨σGM v Γt1 (K, σu )x1 , σGM v Γt2 (K, σu )ηt ⟩|x1
(i)
h i
2 2
= sup σGM v Γt2 (K, σu ηt + 2⟨σGM v Γt1 (K, σu x1 , σGM v Γt2 (K, σu ηt ⟩ − E σGM v Γt2 (K, σu )ηt
v∈S dG
(ii) 2 2
= sup σGM v Γt2 (K, σu )ηt + 2⟨σGM v Γt1 (K, σu )x1 , σGM v Γt2 (K, σu )ηt ⟩ − σGM v Γt2 (K, σu ) F
v∈S dG
2 2
≤ sup σGM v Γt2 (K, σu )ηt − σGM v Γt2 (K, σu ) F
+ 2⟨σGM v Γt1 (K, σu )x1 , σGM v Γt2 (K, σu )ηt ⟩
v∈S dG | {z } | {z }
quadratic form concentration cross term
where (i) follows from linearity of expectation and the substition rule of conditional expectation,
combined with the fact that η has mean zero and is independent of x1 . Equality (ii) follows from
the fact that ηt is mean zero with identity covariance.
To bound the concentration of the quadratic form, we invoke Theorem 6.3.2 of Vershynin
(2020). In particular, there exists a universal constant C > 0 such that
h
2 2 ρi
Pr σGM v Γt2 (K, σu )ηt − σGM v Γt2 (K, σu ) F ≥
n 2
o
2
2 t
min ρ / σGM v Γ2 (K, σu ) F , ρ (16)
≤ 2 exp −C
2
.
σ 4 ∥σGMk v Γt2 (K, σu )∥
For the cross term, we invoke a standard sub-Gaussian tail bound (Vershynin, 2020) to show
that for any fixed v ∈ S dG ,
h ρi
Pr 2⟨σGM v Γt1 (K, σu )x1 , σGM v Γt2 (K, σu )ηt ⟩ >
2
(17)
!
1 2 1
≤ 2 exp − ρ .
16 σ 2 Γt (K, σu )⊤ σ ⊤ σGM v Γt (K, σu )x1 2
2 GM v 1
We now seek to simplify the denominators appearing in the exponents for each of the probability
bounds above. First, note that we have
2 2
σGM v Γt1 (K, σu )x1 + σGM v Γt2 (K, σu ) F
= tr σGM v Γt1 (K, σu )x1 x⊤ t ⊤ ⊤ t t ⊤ ⊤
1 Γ1 (K, σu ) σGM v + tr σGM v Γ2 (K, σu )Γ2 (K, σu ) σGM v
= tr σGM v Γt1 (K, σu )x1 x⊤ t ⊤ t t
1 Γ1 (K, σu ) + Γ2 (K, σu )Γ2 (K, σu )
⊤ ⊤
σGM v
h i
⊤ ⊤ ⊤ ⊤
= E tr σGM v vec(X ) vec(X ) σGM v |x1 ,
where the final equality follows from the fact that ηt is mean zero with identity covariance and
independent form x1 combined with the construction of Γt1 (K, σu ) and Γt2 (K, σu ). Employing the
identity in (15), we find
h i h i
E tr σGM v vec(X ⊤ ) vec(X ⊤ )⊤ σGM
⊤
v |x 1 = E v ⊤
M G ⊤ ⊤
X XGM v|x 1 = 1,
where the final equality follows by the definition of M and the fact that ∥v∥2 = 1. The above
2 2
computations provide identity σGM v Γt1 (K, σu )x1 + σGM v Γt2 (K, σu ) F = 1. We therefore
2 2
have σGM v Γt2 (K, σu ) F ≤ 1 and σGM v Γt1 (K, σu )x1 ≤ 1. As a result, the quantity in the
denominator of the exponential in (17) is bounded as
Therefore, the probability bounds in (16) and (17) may be modified to read
2 !
h
2 2 ρ i min ρ, ρ
Pr σGM v Γt2 (K, σu )ηt − σGM v Γt2 (K, σu ) F
≥ ≤ 2 exp −C 2
2 σ 4 ∥σGM v Γt2 (K, σu )∥
!
h ρ i 1 1
Pr 2⟨σGM v Γt1 (K, σu )x1 , σGM v Γt2 (K, σu )ηt ⟩ > ≤ 2 exp − ρ2 .
2 16 σ 2 ∥σGM v Γt2 (K, σu )∥2
Union bounding over the above two events, and using the assumption that σ 2 ≥ 1 implies that for
some universal constant C > 0,
h i
2 2
Pr σGM v Γt2 (K, σu )ηt − σGM v Γt2 (K, σu ) F + 2⟨σGM v Γt1 (K, σu )x1 , σGM v Γt2 (K, σu )ηt ⟩ > ρ
!
min ρ, ρ2
≤ 4 exp −C 2 .
σ 4 ∥σGM v Γt2 (K, σu )∥
1
We may now invoke the covering argument in Lemma 5 with ε = 4 to show that there exists a
universal positive constant C such that
!
min ρ, ρ2
h i
Pr M G⊤ X ⊤ XGM − I > ρ ≤ 4 × 9dG exp −C 2 . (18)
σ 4 ∥σGM v Γt2 (K, σu )∥
To conclude, we must bound the term σGM v Γt2 (K, σu ) . To do so, we write
1
σGM v Γt2 (K, σu ) = σGM v HΞt2 (K, σu )
t
1 I t 0 0
= √ σGM v H It ⊗ F2 (K, σu ) + It ⊗
t K 0 σu IdU
1 I t 0 0
≤ √ σGM v H It ⊗
F2 (K, σu ) + σGM v H It ⊗ .
t K 0 σu IdU
| {z } | {z }
state term input noise term
The second term quantifiesthe system’s amplification of an exogenous input, and is bounded by the
⋆
H∞ norm of the system AK I σu B H∞
(Tilli, 1998, Corollary 4.2). This, in turn, is
bounded by the H∞ norm of the matrix AK multiplied by the spectral norm of the input matrix,
yielding
where (i) follows by the following kronecker product identities: Ia+b ⊗ W = Ia ⊗ (Ib ⊗ W ),
(A ⊗ B)(C ⊗ D) = AC ⊗ BD, and ∥I ⊗ W ∥ = ∥W ∥. Equality (ii) follows by definition of
the operator norm, and equality (iii) follows by the vectorization identity vec(W Y Z) = (Z ⊤ ⊗
W ) vec(Y ). Equality (iv) follows by the definition of the inverse vectorization operation. The
quantity in the norm above may now be written as the inner product of a vector with itself. In
particular, we have
I
sup vec v ⊤ M G⊤ w ⊗ Idc
w∈S dX
K
v
u ⊤ ! !
u I I
= sup tv ⊤ M G⊤ ww⊤ ⊗ Idc GM v
w∈S dX
K K
v
(i)
u ⊤ ! !
u I I
≤ sup t v ⊤ M G⊤ ⊗ Idc GM v
w∈S dX
K K
(ii) q
≤ 2 sup v ⊤ M G⊤ (Σt (K, σu , x1 ) ⊗ Idc )GM v ≤ 2,
w∈S dX
where inequality (i) follows by the fact that ww⊤ ⪯ I, and inequailty (ii) follows from the fact that
⊤
I I
⪯ 2Σt (K, σu , x1 ), as may be verified by examining the expression for Σ̄t (K, σu , x1 )
K K
in Lemma 2. We may similarly show that the input term is bounded by one:
0 0
σGM v H It ⊗ ≤ 1.
0 σu IdU
1
σGM v Γt2 (K, σu ) 1 + 2 F2t (K, σu )
Combining results we have the bound ≤ √
t
≤
5 3/2
√
t
∥PK ∥ ΨB ⋆ ,σw , where we used the result of Lemma 6 to bound ∥AK ∥H∞ in terms of ∥PK ∥.
Substituting this result into the bound in (18), we find that for some universal constant C,
2 !
h i d min ρ, ρ t
Pr M G⊤ X ⊤ XGM − I > ρ ≤ 4 × 9dG exp −C ,
σ ∥PK ∥3 Ψ2B ⋆
4
t
( ⊤ ! )
1 ⊤ t 1 X ⊤ xs xs
⊗ IdX Φ̂ ⪯ 2Φ̂⊤ Σt (K, σu , x1 ) ⊗ IdX Φ̂
Econc = Φ̂ Σ (K, σu , x1 ) ⊗ IdX Φ̂ ⪯ Φ̂
2 t us us
s=1
for a sufficiently large universal constant c > 0, the event holds with probability at least 1 − 2δ .
Definition 6 (Filtration and Adapted Process) A sequence of sub-σ-algebras {Fs }ts=1 is said to be
a filtration if Fs ⊆ Fk for s ≤ k. A stochastic process {ws }ts=1 is said to be adapted to the filtration
{Fs }ts=1 if for all s ≥ 1, ws is Fs -measurable.
Theorem 7 Let {Fs }ts=0 be a filtration and let {zs }ts=1 be a matrix valued stochastic process
assuming values in RdX ×dθ which is adapted to {Fs−1 }ts=1 and {ws }ts=1 a vector valued stochastic
process assuming values in RdX which is adapted to {Fs }ts=1 . Additionally suppose that for all
1 ≤ s ≤ t, wt is σ 2 -conditionally sub-Gaussian with respect to Fs . Let Σ be a positive definite
matrix in Rdθ ×dθ . Then for δ ∈ (0, 1), the following inequality holds with probability at least 1 − δ:
t
! t 2 Pt !
det Σ + z ⊤z
X X
s=1 s s 1
Σ+ zs⊤ zs −1/2 zs⊤ ws ≤ σ 2 log + 2σ 2 log .
det(Σ) δ
s=1 s=1
⊤
Proof The proof follows by applying Lemma 4.1 from Ziemann et al. (2023) with P ← ts=1 zsσws
P
Pt ⊤
and Q ← s=1 zs zs . To verify
the conditions for this lemma to hold, we must verify that
⊤
maxλ∈Rdθ E exp λ P − 12 λ⊤ Qλ ≤ 1. To verify this condition, we define the quantity
s ⊤ ⊤ !
X λ zk wk 1 ⊤ ⊤
Ms (λ) ≜ exp − λ zk zk λ .
σ 2
k=1
λ⊤zs⊤ ws 1 ⊤ ⊤
E Ms (λ) = E E(Ms (λ|Fs ) = E Ms−1 (λ) E exp − λ zs zs λ |Fs .
σ 2
σ2 Conccov (δ)2
2 det(Σ + Λ) 2
Φ̂θ̂ − Φ̂θ̄ ≲ log 2
+ 2
Φ⋆ θ⋆ − Φ̂θ̄ ,
λmin (Σ) det(Σ)δ λmin (Σ)
where
√
r
t 2 3/2 1
Conccov (δ) = Σ (K, σu , x1 ) max tσ ∥PK ∥ ΨB ⋆ dX + dU + log ,
δ
1
σ 4 ∥PK ∥3 Ψ2B ⋆ dX + dU + log .
δ
.
t
X x 2
−1 s
vec Φ̂θ̄ − Φ̂θ̂
≤
us
s=1
Xt x x
−1 s −1 ⋆ ⋆ s
2 vec Φ̂ θ̂ − θ̄ , vec Φ θ − Φ̂θ̄
us us
s=1
Xt x
−1 s
+2 vec Φ̂ θ̂ − θ̄ , ws .
us
s=1
Multiplying the above inequality by two, and bringing one of the sums from the right side to the left
leaves us with
X t x 2
s
vec−1 Φ̂θ̄ − Φ̂θ̂ ≤
us
s=1
X t x x
−1 s −1 ⋆ ⋆ s
4 vec Φ̂ θ̂ − θ̄ , vec Φ θ − Φ̂θ̄
us us
| s=1 {z } (19)
variance from misspecification
Xt x X t x 2
−1 s −1 s
+4 vec Φ̂ θ̂ − θ̄ , ws − vec Φ̂(θ̄ − θ̂) .
us us
s=1 s=1
| {z }
self-normalized process
t t
( t t
) t 2
X X 2 X X X
2 −1/2
4 zt (θ̂ − θ̄), ws − zs (θ̄ − θ̂) ≤ sup 4 zs θ, ws − ∥zs θ∥ =4 Λ zs⊤ ws .
s=1 s=1 θ s=1 s=1 s=1
t 2 t 2
X X
−1/2
4 Λ zt⊤ ws ≤ 8 (Λ + Σ) −1/2
zs⊤ ws .
s=1 s=1
Invoking the self-normalized martingale bound in Theorem 7 we find that there exists an event Enoise
that holds with probability at least 1 − δ/4 such that under the event Enoise ,
t 2
−1/2
X 16 det(Σ + Λ)
(Λ + Σ) zs⊤ ws ≤ σ log 2
.
det(Σ)δ 2
s=1
Bound on Variance from misspecification: We now proceed to bound the term in (19) arising
due to variance from the model mispecification. To do so, note that
t
X x x
−1 s −1 ⋆ ⋆ s
vec Φ̂ θ̂ − θ̄ , vec Φ θ − Φ̂θ̄
us us
s=1
t ⊤ !!
⋆ ⋆
X
⊤ xs xs
= Φ θ − Φ̂θ̄ ⊗ IdX Φ̂(θ̂ − θ̄)
us us
s=1
= t Φ⋆ θ⋆ − Φ̂θ̄ ⊤ Σt (K, σu , x1 ) ⊗ IdX Φ̂(θ̂ − θ̄)
t ⊤
! !
⋆ ⋆
⊤
X xs xs t
+ Φ θ − Φ̂θ̄ − tΣ (K, σu , x1 ) ⊗ IdX Φ̂(θ̂ − θ̄) .
us us
s=1
Invoking Lemma 7 with G = IdX +dU , we find that there exists an event Ecov that holds with proba-
bility at least 1 − δ/4 such that under Ecov ,
t ⊤
X xs xs
− tΣt (K, σu , x1 ) ≲ Conccov (δ).
us us
s=1
Combining bounds: In light of the bounds on the self-normalized process and the variance from
misspecification above, we define the event Evar = Enoise ∩ Ecov . This event holds with probability
at least 1 − δ/2 by a union bound. We may substitute the bounds arising under these events into into
(19), we find
t x 2
X
−1
s det(Σ + Λ)
2
vec Φ̂θ̄ − Φ̂θ̂ ≲ σ log + Conccov (δ) Φ⋆ θ⋆ − Φ̂θ̄ Φ̂(θ̂ − θ̄) ,
us det(Σ)δ 2
s=1
where we absorbed the 16 from the log in the first term into a universal constant. The left side
2
of the above inequality may be lower bounded by λmin (Σ) θ̄ − θ̂ . which may in turn be lower
2
bounded by λmin (Σ) Φ̂θ̄ − Φ̂θ̂ , by the fact that Φ̂ has orthonormal columns. We therefore obtain
the following bound:
σ2
2 det(Σ + Λ) Conccov (δ)
Φ̂θ̄ − Φ̂θ̂ ≲ log + Φ⋆ θ⋆ − Φ̂θ̄ Φ̂(θ̂ − θ̄) .
λmin (Σ) det(Σ)δ 2 λmin (Σ)
Now, if we define
σ2
Conccov (δ) det(Σ + Λ)
a≜ Φ⋆ θ⋆ − Φ̂θ̄ , b≜ log , x ≜ Φ̂θ̂ − Φ̂θ̄ ,
λmin (Σ) λmin (Σ) det(Σ)δ 2
then the
above inequalities may be combined to read x2 ≤ ax + b, from which we can conclude
√
x ≤ 21 a + a2 + 4b , or equivalently,
1 2 p 1
x2 ≤ 2a + 4b + 2 a4 + 4a2 b ≤ 2a2 + 4b + 2a2 + 4a2 + 4b = 2a2 + 2b.
4 4
Subtituting in the original quantities of interest and absorbing the factor of two into the universal
constant concludes the proof.
Lemma 10 (Variance Bound) Suppose t ≥ cτls for c and τls defined in Theorem 5. Then there
exists an event Els that holds with probability at least 1 − δ under which
dθ σ 2
2 1
Φ̂θ̂ − Φ̂θ̄ ≲ log
tλmin Φ̂⊤ Σ̄t (K, σu , x1 ) ⊗ IdX Φ̂ δ
2
σ 4 Σt (K, σu , x1 ) ∥PK ∥3 Ψ2B ⋆
1 2
+ dX + dU + log δ Φ⋆ θ⋆ − Φ̂θ̄ .
tλmin Φ̂⊤ Σ̄t (K, σu , x1 ) ⊗ IdX Φ̂ 2
Proof By the lower bound on t, the event Econc defined in Lemma 8 holds with probability at
least 1 − δ/2. Denoting Σ ← t 12 Φ̂⊤ Σ̄t (K, σu , x1 ) ⊗ IdX Φ̂, we may invoke Lemma 9 to show
that there exists an event Evar that holds with probability at least 1 − δ/2 such that conditioned on
Els = Econc ∩ Evar ,
σ2 Conccov (δ)2
2 det(Σ + Λ) 2
Φ̂θ̂ − Φ̂θ̄ ≲ log 2
+ 2
Φ⋆ θ⋆ − Φ̂θ̄ .
λmin (Σ) det(Σ)δ λmin (Σ)
By a union bound, Els holds with probability at least 1 − δ. Subsituting the definition of Conccov
from Lemma 9, we may simplify the above bound to find
σ2
2 det(Σ + Λ)
Φ̂θ̂ − Φ̂θ̄ ≲ log
λmin (Σ) det(Σ)δ 2
2
σ 4 Σt (K, σu , x1 ) ∥PK ∥3 Ψ2B ⋆ t
4 2
+ 2
dX + dU + log Φ⋆ θ⋆ − Φ̂θ̄ ,
λmin (Σ) δ
where we have used the fact that
4 3 1
t ≥ cτls ≥ σ ∥PK ∥ Ψ2B ⋆ dX + dU + log ,
δ
∥x1 ∥2 ⊤ t
Λ ⪯ 2tΦ̂⊤ Σt (K, σu , x1 ) ⊗ IdX Φ̂ ⪯ 2t(1 + ∥PK ∥
)Φ̂ Σ̄ (K, σu , x1 ) ⊗ IdX Φ̂,
t−1
where the final inequality follows from Lemma 2 point 4. The log term may then be bounded as
2
1 + ∥PK ∥ ∥xt−1
1∥
det(Σ + Λ)
log ≤ dθ log4
det(Σ)δ 2 δ2
2
By the fact that t ≥ cτls ≥ ∥x1 ∥2 ∥PK ∥ + 1, we have ∥PKt−1
∥∥x1 ∥
≤ 1. Therefore, the above quantity
is bounded by dθ log δ82 ≲ dθ log 1δ .
Substituting the expression for Σ tells us that
t
λmin (Σ) = λmin Φ̂⊤ Σ̄t (K, σu , x1 ) ⊗ IdX Φ̂ .
2
2
Substituing this into the bound on Φ̂θ̂ − Φ̂θ̄ along with the upper bound on log det(Σ+Λ)det(Σ)δ 2
completes the proof.
Proof
We first left multiply the
quantity in the norm by the identity defined in terms of the matrix
Φ̂Φ̂ Σ (K, σu , x1 ) ⊗ IdX Φ̂Φ̂ + Φ̂⊥ Φ̂⊤
⊤ t ⊤
⊥ left multiplied by its inverse:
2
Φ⋆ θ⋆ − Φ̂θ̄ = (Φ̂Φ̂⊤ Σt (K, σu , x1 ) ⊗ IdX Φ̂Φ̂⊤ + Φ̂⊥ Φ̂⊤ −1
⊥)
2
⊤ ⊤
t
Φ̂⊥ Φ̂⊤ ⋆ ⋆
× Φ̂Φ̂ Σ (K, σu , x1 ) ⊗ IdX Φ̂Φ̂ + ⊥ Φ θ − Φ̂θ̄ .
by the fact that the columns of Φ̂ are orthonormal. We now substitute the optimal projection
θ̄ = Φ̂⊤ Σt (K, σu , x1 ) ⊗ IdX Φ̂ −1 Φ̂⊤ Σt (K, σu , x1 ) ⊗ IdX Φ⋆ θ⋆
Σt (K, σu , x1 )
≤ ,
λmin (Φ̂⊤ (Σ̄t (K, σu , x1 ) ⊗ IdX )Φ̂)
where the final inequality follows from submultiplicativity, the fact that Φ̂ has orthonormal columns
and the fact that Σt (K, σu , x1 ) ⪯ Σ̄t (K, σu , x1 ).
The lemma follows by substituting the above inequality along with (23) into (22). The result is
then substituted into (21), where we recall the definition Theorem 1.
2
J (K̂) − J (K ⋆ ) ≤ 142 ∥P ⋆ ∥8 Â B̂ − A⋆ B ⋆
F
.
Proof The vectors wt and gt have σ 2 -sub-Gaussian components. Therefore, a sub-Gaussian tail
bound tells us that for any unit vector v ∈ RdX +dU )
ρ2
⊤ wt
P v ≥ ρ ≤ exp − 2 .
gt 2σ
By a covering argument, we therfore conclude
ρ2
wt dX +dU
P ≥ρ ≤5 exp − 2 .
gt 8σ
Union bounding over the T timesteps yields
ρ2
wt dX +dU
P max ≥ ρ ≤ T5 exp − 2 .
1≤t≤T gt 8σ
q
Setting ρ = 4σ (dX + dU ) log Tδ , the above probability becomes
T T
T 5dX +dU exp −2(dX + dU ) log ≤ T 5dX +dU exp −2(dX + dU ) − log
δ δ
≤ δ5dX +dU exp(−2)dX +dU ≤ δ.
Lemma 14 Consider rolling out the system xs+1 = A⋆ xs + B ⋆ us + ws from initial state x1 for t
timesteps under the control action us = Kxs + σu gs where K is stabilizing and σu ≤ 1. Then the
following bound holds:
s
1 t ∥P ∥ ∥x ∥ + 2 ∥P ∥3/2 Ψ ⋆ max wt
∥xt ∥ ≤ 1− K 1 K B
∥PK ∥ 1≤t≤T gt
Pt−1 wk
At−1 t−1−k ⋆
Proof The states may be expressed as xt = K x1 + k=0 AK I σu B . Therefore,
gk
by Lemma 6,
t−1
X wt
∥xt ∥ ≤ At−1 At−1−k I σu B ⋆
K ∥x1 ∥ + K max
1≤t≤T gt
k=0
s
1 t−1 ∥P 3/2 wt
≤ 1− K ∥ ∥x1 ∥ + 2 ∥PK ∥ ΨB ⋆ max .
∥PK ∥ 1≤t≤T gt
Lemma 15 Consider rolling out the system xs+1 = A⋆ xs + B ⋆ us + ws from initial state x1 for t
timesteps under the control action us = Kxs + σu gs where K is stabilizing and σu ≤ 1. Suppose
3/2 wt
• ∥x1 ∥ ≤ 16 ∥PK0 ∥ ΨB max1≤t≤T
⋆
gt
• ∥PK ∥ ≤ 2 ∥PK0 ∥
1
• t ≥ log 1
4∥PK ∥ + 1.
1−
∥PK ∥
Then for s = 1, . . . , t
2 wt
∥xs ∥ ≤ 40 ∥PK0 ∥ ΨB ⋆ max .
1≤t≤T gt
Furthermore,
3/2 wt
∥xt ∥ ≤ 16 ∥PK0 ∥ ΨB ⋆ max .
1≤t≤T gt
Proof By Lemma 14,
1/2 wt 3/2
∥xs ∥ ≤ ∥PK ∥ ∥x1 ∥ + 2 ∥PK ∥ ΨB ⋆ max
1≤t≤T gt
1/2 3/2 wt
≤ 2 ∥PK0 ∥ ∥x1 ∥ + 8 ∥PK0 ∥ ΨB ⋆ max
1≤t≤T gt
Denote the event that the above inequality holds by Ew bound . We show by induction that the requisite
conditions of Esuccess,1 hold for all epochs. A key step in this argument is instantiating the result
of Theorem 5 for every epoch. The rate of decay in this bound is the determined by minimum
⊤ t
eigenvalue of the state and input covariance, λmin (Φ̂ Σ̄ (K, σu , x1 ) ⊗ IdX Φ̂). We control this
quantity by invoking point 3 of Lemma 2 along with the facts that 1) Φ̂ has orthonormal columns
and 2) σu = σk ≤ 1 for all epochs due to the lower bound on τ1 . In particular, we may show that
σu2 σu2
λmin (Φ̂⊤ Σ̄t (K, σu , x1 ) ⊗ IdX Φ̂) ≥
2 ≥ 8 ∥P ∥ , (24)
2(2 + 2 ∥K∥ ) K
where the final inequality follows by noting that 2 + 2 ∥K∥2 ≤ 2 + 2 ∥PK ∥ ≤ 4 ∥PK ∥ by the fact
that ∥PK ∥ ≥ 1.
Base Case: For the first epoch, observe that our lower bound on τ1 ensures that τ1 ≥
cτls (K0 , 0, 12 T −3 ) for a sufficiently large constant c. Therefore applying Theorem 5 along with
(24) implies that there exists an event Els,1 which holds with probability at least 12 T −3 such that
under Els,1 ,
2 dθ σ 2 ∥PK0 ∥
Â1 B̂1 − A⋆ B ⋆
≲ log(T )
F τ1 σ12
!
σ 4 ∥PK0 ∥9 Ψ6B ⋆ (dX + dU + log(T )) ∥PK0 ∥3 Ψ2B ⋆ d(Φ̂, Φ⋆ )2 ∥θ⋆ ∥2
+ 1+ .
τ1 σ14 σ12
√
d /d
Using the fact thatσ12 ≥ √θτ1 U to simplify the first two appearances of σ1 , and σ12 ≥ d(Φ̂, Φ⋆ ) for
the final appearance, the above bound simplifies to
√
⋆ ⋆
2 dθ dU σ 2 ∥PK0 ∥
Â1 B̂1 − A B F ≲ √ log(T )
τ1
dθ
+ σ4 ∥PK0 ∥12 Ψ8B ⋆ (dX + dU + log(T ))d(Φ̂, Φ⋆ ) ∥θ⋆ ∥2 .
dU
By recalling the definition of β1 from Assumption 2, the above bound can be simplified to
√
2 dθ dU ∥PK0 ∥
2
log(T ) + β1 log(T )d(Φ̂, Φ⋆ ).
⋆ ⋆
Â1 B̂1 − A B F ≤ Cest,1 σ √
τ1
2 2
Ew bound , ∥xt ∥ ≤ xb log T for t = 1, . . . , τ1 ,
Meanwhile, Lemma 15 tells us that under the event
wt
and that ∥xτ1 ∥ ≤ 16 ∥PK0 ∥3/2 ΨB ⋆ max1≤t≤T . Finally, we have ∥K0 ∥ ≤ Kb .
gt
Induction step: Suppose
3/2 wt
∥xτk ∥ ≤ 16 ∥PK0 ∥ Ψ
B⋆ max
1≤t≤T gt
and
√
2 dθ dU ∥PK0 ∥
2
A⋆ B⋆ log(T ) + β1 log(T )d(Φ̂, Φ⋆ ).
Âk B̂k − F
≤ Cest,1 σ
p
τ1 2k−1
√
σ 2 dθ dU ∥PK0 ∥ 2
By the assumptions that τ1 ≥ c 2ε for a sufficiently large constant c and that
ε
2
d(Φ̂, Φ⋆ ) ≤ 2β1 log(T Âk B̂k − A⋆ B ⋆ F ≤ ε. Then the conditions of
) , we have that
Lemma 12 are satisfied, and consequently, PK̂k+1 ≤ 1.05 ∥P ⋆ ∥. Therefore
1 1
τls (K̂k+1 , x2b log T, T −3 ) ≤ 2τls (K ⋆ , x2b log T, T −3 ) and PK̂k+1 ≤ 1.05 ∥PK0 ∥ ≤ 2 ∥PK0 ∥ .
2 2
We may use our lower bound on τ1 to conclude that τ1 ≥ cτls (K ⋆ , x2b log T, 21 T −3 ) for a sufficiently
large universal constant c. This guarantees that the conditions of Theorem 5 are satsified. As a result,
there exists an event Els,k+1 that holds with probability at least 1 − 21 T −3 under which
⋆ 2 dθ σ 2 PK̂k+1
⋆
Âk+1 B̂k+1 − A B F ≲ 2 log(T )
(τk+1 − τk )σk+1
9 3
4 6
σ PK̂k+1 ΨB ⋆ (dX + dU + log(T )) PK̂k+1 Ψ2B ⋆ d(Φ̂, Φ⋆ )2 ∥θ⋆ ∥2
+ 1 + ,
4
(τk+1 − τk )σk+1
2
σk+1
where we have again used (24) to simplify the bound. Observing that τk+1 − τk = 12 τk+1 , and
√ √
2 dθ /dU d /d 2
using the lower bounds σk+1 ≥ √ k = √τθk+1U and σk+1 ≥ γ ≥ d(Φ̂, Φ⋆ ), the above bound
τ1 2
simplifies to
√
2
dθ dU σ 2 PK̂k+1
Âk+1 B̂k+1 − A⋆ B ⋆
F
≲ √ log(T )
τk+1
12
Ψ8B ⋆ (dX + dU + log(T ))d(Φ̂, Φ⋆ ) ∥θ⋆ ∥2 .
p
+ dθ /dU σ 4 PK̂k+1
Using the fact that PK̂k+1 ≤ 1.05 ∥P ⋆ ∥ ≤ 1.05 ∥PK0 ∥, the above bound may be simplified to
√
2 dθ dU ∥PK0 ∥
2
Âk+1 B̂k+1 − A⋆ B ⋆ log(T ) + β1 d(Φ̂, Φ⋆ ) log(T ),
F
≤ Cest,1 σ √
τk+1
where the powers of 1.05 are absorbed into the universal constants, and β1 is as defined in Assump-
tion 2.
Furthermore, Lemma 15 (which applies due to the inductive hypothesis, the lower bound on τ1 ,
and the fact that PK̂k+1 ≤ 2 ∥PK0 ∥) informs us that under the event Ew bound ,
2 3/2 wt
∥xt ∥ ≤ x2b log T ∀t = τk + 1, . . . , τk+1 , and xτk+1 ≤ 16 ∥PK0 ∥ ΨB ⋆ max .
1≤t≤T gt
2
Also observe that K̂k+1 ≤ PK̂k+1 ≤ 2 ∥PK0 ∥, which ensures that K̂k+1 ≤ 2 ∥K ⋆ ∥ ≤ Kb .
In particular, we have verified that the algorithm does not abort on this epoch, and we have verified
that the inductive hypothesis holds to start the next epoch.
Union bounding over the events The bounds on the norm of the state and the controller gain
ensure that the algorithm does not abort. The estimation error bounds ensure that
√
2 dθ dU ∥PK0 ∥
2
log(T ) + β1 log(T )d(Φ̂, Φ⋆ ) ∀k ≥ 1.
⋆ ⋆
Âk B̂k − A B F ≤ Cest,1 σ √
τ1
Therefore, Esuccess,1 ⊆ Ew bound ∩ Els,1 ∩ · · · ∩ Els,kfin . Union bounding over the good events
Ew bound , Els,1 , . . . , Els,kfin , we find that Esuccess,1 holds with probability at least 1 − T −2 .
Lemma 17 Running Algorithm 1 with the arguments defined in Theorem 3, the event Esuccess,2 holds
with probability at least 1 − T −2 .
Observing that τk+1 − τk = 12 τk+1 , and again using the fact that K̂k+1 = K ⋆ (Âk , B̂k ) where
2
Âk B̂k − A⋆ B ⋆ F ≤ ε, we have that PK̂k+1 ≤ 1.05 ∥P ⋆ ∥. Therefore by absorbing
powers of 1.05 into the universal constants, and following the same simplifications taken in the base
case, the estimation error bound may be simplified to
2 σ 2 dθ log T
Âk+1 B̂k+1 − A⋆ B ⋆ + β2 d(Φ̂, Φ⋆ )2 .
≤ Cest,2
F τk+1 α2
Furthermore, Lemma 15 (which applies due to the inductive hypothesis, the lower bound on τ1 ,
and the fact that PK̂k+1 ≤ 2 ∥PK0 ∥) informs us that under the event Ew bound ,
2 2 3/2 wt
∥xt ∥ ≤ xb log T ∀t = τk + 1, . . . , τk+1 , and xτk+1 ≤ 16 ∥PK0 ∥ ΨB ⋆ max .
1≤t≤T gt
2
Also observe that K̂k+1 ≤ PK̂k+1 ≤ 2 ∥PK0 ∥, which ensures that K̂k+1 ≤ 2 ∥K ⋆ ∥ ≤ Kb .
In particular, we have verified that the algorithm does not abort on this epoch, and we have verified
that the inductive hypothesis holds to start the next epoch.
Union bounding over the events The bounds on the norm of the state and the controller gain
ensure that the algorithm does not abort. The estimation error bounds ensure that
2 σ 2 dθ log T
Âk B̂k − A⋆ B ⋆ + β2 d(Φ̂, Φ⋆ )2
≤ Cest,2 ∀k ≥ 1.
F τk+1 α2
Therefore, Esuccess,1 ⊆ Ew bound ∩ Els,1 ∩ · · · ∩ Els,kfin . Union bounding over the good events
Ew bound , Els,1 , . . . , Els,kfin , we find that Esuccess,1 holds with probability at least 1 − T −2 .
s=1
≤ tJ(K) + tσu2 (1 + dU Ψ2B ⋆ ∥PK ∥) + ∥x1 ∥2 ∥PK ∥ .
Proof This result follows from Lemma 18 by noting that x1 = 0, and therefore
"τ #
X1
2 2
E ∥xt ∥Q + ∥ut ∥R ≤ τ1 J(K0 ) + 2τ1 dU ∥PK0 ∥ σ12 Ψ2B ⋆ ≤ 3τ1 max{dX , dU } ∥PK0 ∥ Ψ2B ⋆
t=1
Lemma 20 Let Esuccess ← Esuccess,1 or Esuccess ← Esuccess,2 in the settings of Theorem 2 and
Theorem 3 respectively. Then in either setting
T
" #
X
c
R2 = E 1(Esuccess ) ∥xt ∥2Q + ∥ut ∥2R
t=τ1 +1
−1
2Kb2 x2b log T + T −1 J(K0 ) + 24 ∥PK0 ∥ Ψ2B ⋆ (dX + dU )σ 2 T −2 log 3T
≤T ∥Q∥ +
kfin
X
+ 2T −2 ∥PK0 ∥ ∥θ⋆ ∥2F Kb2 x2b log T + 2(τk − τk−1 )dU σk2 .
k=1
In the setting of Theorem 2, this reduces to
R2 ≤ T −1 ∥Q∥ + 2Kb2 x2b log T + T −1 J(K0 ) + 24 ∥PK0 ∥ Ψ2B ⋆ (dX + dU )σ 2 T −2 log 3T
√ p
+ 2T −2 ∥PK0 ∥ ∥θ⋆ ∥2F Kb2 x2b log T + 6 T dθ dU + 2dU γT.
In the setting of Theorem 3, this reduces to
R2 ≤ T −1 ∥Q∥ + 2Kb2 x2b log T + T −1 J(K0 ) + 24 ∥PK0 ∥ Ψ2B ⋆ (dX + dU )σ 2 T −2 log 3T
Proof Let tabort be the first time instance where ∥xt ∥2 ≥ x2b log T or K̂k ≥ Kb .
T
" #
X
R2 = E c
1(Esuccess ) ∥xt ∥2Q + ∥ut ∥2R
t=τ1 +1
T
" #
X
c
≤ E 1(Esuccess ) ∥xt ∥2Q + ∥ut ∥2R
t=1
X−1
tabort T
" # " #
X
≤ E 1(tabort ≤ T ) ∥xt ∥2Q + ∥ut ∥2R +E ∥xt ∥2Q + ∥ut ∥2R .
t=1 t=tabort
2
By the fact that ut = K̂k xt + σk gt for gt ∼ N (0, I), we have that ∥ut ∥2 ≤ 2 K̂k xt + 2 ∥σu gt ∥2 .
By the definition of tabort and the fact that R = I, for t < tabort
2
∥xt ∥2Q + 2 K̂k xt ≤ x2b log T ∥Q∥ + 2Kb2 .
R
Therefore
X−1
tabort
" #
E 1(tabort ≤ T ) ∥xt ∥2Q + ∥ut ∥2R
t=1
X−1
tabort kfin
X τk
X
x2b ∥Q∥ + 2Kb log T + 22
2σk2 ∥gt ∥2
≤ E1(tabort ≤ T )
t=1 k=1 t=τk−1 +1
X−1
tabort kfin τk
" #
X X
x2b ∥Q∥ + 2Kb log T + E1(tabort ≤ T )
2
2σk2 ∥gt ∥2
≤ E 1(tabort ≤ T )
t=1 k=1 t=τk−1 +1
Xkfin τk
X
2 2
2σk2 ∥gt ∥2
≤ E 1(tabort ≤ T )T xb ∥Q∥ + 2Kb log T + E
k=1 t=τk−1 +1
kfin
X τk
X
= P(tabort ≤ T )T x2b ∥Q∥ + 2Kb2 log T + 2dU σk2
k=1 t=τk−1 +1
kfin
X
≤ T −1 x2b ∥Q∥ + 2Kb2 log T + 2(τk − τk−1 )dU σk2 .
k=1
For the second term, we have from Lemma 18
" T # " T
#
X X h i
2 2 2 2
E ∥xt ∥Q + ∥ut ∥R = E 1(tabort ≤ T ) E ∥xt ∥Q + ∥ut ∥R |tabort , xtabort
t=tabort t=tabort
h i
≤ E 1(tabort ≤ T ) T J(K0 ) + ∥PK0 ∥ ∥xtabort ∥2 .
wt
(A⋆ B ⋆ K)x B⋆
We have that xtabort = + tabort −1 + I σu for some K satisfying ∥K∥ ≤ Kb
gt
and σu ≤ 1. Then we may write
2
⋆ ⋆ 2 2 wt
xtabort ≤ 2 ∥A + B K∥ ∥xtabort −1 ∥ + 2Ψ2B ⋆ max
1≤t≤T gt
2
wt
≤ 2 ∥θ⋆ ∥2F Kb2 x2b log T + 2Ψ2B ⋆ max .
1≤t≤T gt
Then
T
" #
X
E ∥xt ∥2Q + ∥ut ∥2R ≤ P(tabort ≤ T )(T J(K0 ) + 2 ∥PK0 ∥ ∥θ⋆ ∥2F Kb2 x2b log T )
t=tabort
" 2#
wt
+ 2 ∥PK0 ∥ Ψ2B ⋆ E 1(tabort ≤ T ) max .
1≤t≤T gt
We have that P(tabort ≤ T )(T J(K0 ) + 2 ∥PK0 ∥ ∥θ⋆ ∥2F Kb2 x2b log T ) ≤ T −1 J(K0 ) +
2T −2 ∥PK0 ∥ ∥θ⋆ ∥2F Kb2 x2b log T . Furthermore, by Lemma 21,
" 2#
wt
2
2 ∥PK0 ∥ ΨB ⋆ E 1(tabort ≤ T ) max ≤ 8 ∥PK0 ∥ Ψ2B ⋆ (dX + dU )σ 2 T −2 log 9T 3 .
1≤t≤T gt
Lemma 22 Let Esuccess ← Esuccess,1 or Esuccess ← Esuccess,2 in the settings of Theorem 2 and
Theorem 3 respectively. Then in either setting,
T
" #
X
R1 = E 1(Esuccess ) ∥xt ∥2Q + ∥ut ∥2R
t=τ1 +1
kfin h i
X 2
E 1(Ek−1 )142(τk − τk−1 ) ∥P ⋆ ∥8 Âk−1 B̂k−1 − A⋆ B ⋆
≤ F
k=2
⋆
+ (τk − τk−1 )J (K ) + 4(τk − τk−1 )dU ∥PK ⋆ ∥ σk2 Ψ2B ⋆ + 2x2b log T ∥PK ⋆ ∥ ,
where
√
σ 2 dθ dU ∥PK0 ∥ log T
2 ⋆
Âk B̂k − A⋆ B ⋆
Ek = F
≤ Cest,1 √ + β1 d(Φ̂, Φ ) log T (26)
τk
for k = 1, . . . , kfin − 1 in the setting of Theorem 2 and
σ 2 dθ log T
⋆ ⋆
2 ⋆ 2
Ek = Âk B̂k − A B F ≤ Cest,2 + β2 d(Φ̂, Φ ) (27)
τk α2
for k = 1, . . . , kfin − 1 in the setting of Theorem 3.
where inequality (i) follows by using the bound from Sk−1 along with the fact that under the event
Ek−1 , the burn in time and assumptions on d(Φ̂, Φ⋆ ) ensure that PK̂k ≤ 2 ∥P ⋆ ∥. Inequality (ii)
follows from the fact that under the burn-in time assumptions along with the event Ek−1 , the condi-
2
tion of Lemma 12 is satisfies, and J (K̂)−J (K ⋆ )l ≤ 142 ∥P ⋆ ∥8 Âk−1 B̂k−1 − A⋆ B ⋆ F .
Taking the expectation and summing over k = 2, . . . , kfin proves the claim.
We have that kfin = log2 T /τ1 ≤ log T , providing the result in the lemma statement.
Lemma 24 In the setting of Theorem 3,
2
⋆ 8 σ dθ
⋆
R1 − T J (K ) ≤ Cest,2 142 ∥P ∥ + xb ∥PK ∥ log2 T + 142 ∥P ⋆ ∥8 β2 d(Φ̂, Φ⋆ )2 T.
2 ⋆
α2
Proof By substituting the estimation error bound under the events Ek from (27), and the exploration
noise magnitude σk2 = 0, Lemma 22 provides
2
⋆ 8 σ dθ
⋆
R1 − T J (K ) ≤ kfin Cest,2 142 ∥P ∥ 2
+ xb ∥PK ⋆ ∥ log T + 142 ∥P ⋆ ∥8 β2 d(Φ̂, Φ⋆ )2 T,
2
α
We have that kfin = log2 T /τ1 ≤ log T , providing the result in the lemma statement.
E[RT ] = R1 + R2 + R3 − T J (K ⋆ ),
The first term may be modified by recalling that τ1 = τwarm up log2 T . For the second term, note
that by the fact that T −1 ≤ τ1−1 , the following bound holds,
T −1 ∥Q∥ + Kb2 x2b log T + J(K0 ) + ∥PK0 ∥ Ψ2B ⋆ (dX + dU )σ 2 log T + ∥PK0 ∥ ∥θ⋆ ∥2 Kb2 x2b log T
≲ ∥Q∥ + Kb2 1 + ∥PK0 ∥ ∥θ⋆ ∥2 .
Combining terms, using the fact that d(Φ̂, Φ⋆ ) ≤ γ, and substituting the definition of β1 from
Assumption 2 leads to the theorem statement.
E[RT ] = R1 + R2 + R3 − T J (K ⋆ ),
with R1 , R2 , and R3 as in (6). Combining the bounds on R1 − T J (K ⋆ ) from Lemma 24, on R2
from Lemma 20, and on R3 from Lemma 19, we find
σ 2 dθ
E[RT ] ≤ τwarm up max{dX , dU } ∥PK0 ∥ Ψ2B ⋆ + ∥P ⋆ ∥8 2 + x2b ∥PK ⋆ ∥ log2 T
α
+ T −1 J(K0 ) + ∥Q∥ + Kb2 + ∥PK0 ∥ ∥θ⋆ ∥2 Kb2 x2b log T + ∥PK0 ∥ ΨB ⋆ (dX + dU )σ 2 log T
+ ∥P ⋆ ∥8 β2 d(Φ̂, Φ⋆ )2 T.
Combining terms and substituting the definition of β2 from Assumption 4 as well as the definition
of ε in Lemma 12 provides the result in the theorem statement.
Appendix E. Generalizations
E.1. Logarithmic Regret with Known A⋆ or B ⋆
Cassel et al. (2020) and Jedra and Proutiere (2022) show that it is possible to attain logaritmic regret
if either the A⋆ is known to the learner and K ⋆ (K ⋆ )⊤ ≻ 0, or the B ⋆ is known to the learner.
These settings cannot immediately be embedded into the formulation of (1), because (1) requires
⋆ ⋆
that A B is a linear function of the unknown parameters. It can therefore only embed the
settings where A⋆ and B ⋆ are known up to scale. However, the model can easily be extended to
−1 ⋆ ⋆
xt
xt+1 = Ā B̄ + vec (Φ θ ) + wt .
ut
If the learner has access to Ā, B̄, and an estimate Φ̂ for Φ⋆ , then the results of Theorem 2 and
Theorem 3 are unchanged by this generalization. In particular, the least squares algorithm may
be modified to subtract out the known dynamics Ā and B̄. This modified algorithm is shown in
Appendix E.1. Propagating this change through the analysis does not change the results.
Using the above generalization, we may recover logarithmic regret in the settings considered by
Cassel et al. (2020); Jedra and Proutiere (2022).
1: Input: Model structure estimate Φ̂, state data x1:t+1 , input data u1:t
2: Return: θ̂, Λ, where
t ! t ⊤
!
†
X
⊤ xs xt X
⊤ xs xs
θ̂ = Λ Φ̂ ⊗ IdX xs+1 − Ā B̄ , and Λ = Φ̂ ⊗ IdX Φ̂
us ut us us
s=1 s=1
with
⋆ columns having a single element that is one, and the rest zero, such that vec−1 (Φ⋆ θ⋆ ) =
A 0 . As we assume the learner knows B ⋆ exactly, and Φ⋆ forms a complete basis for
A⋆ we may simply choose Φ̂ = Φ⋆ , and have d(Φ̂, Φ⋆ ) = 0. To verify that we may achieve
logarithmic regret in this setting using Theorem 3, we must simply show that Assumption 3
is satisfied. To do so, note that for all i = 1, . . . , dθ , ΦB A
i are zero, and Φ̂i are matrices each
with a distinct unity element. Therefore, for any K, and any unit vector v ∈ Rdθ ,
2 2 dθ
X X X 1
vi Φ̂A
i + Φ̂B
i K = vi Φ̂A
i = vi2 = ∥v∥2 = 1 ≥ ,
i=1 F i=1 F i=1 3 ∥P ⋆ ∥3/2
where the last inequality follows by the fact that ∥P ⋆ ∥ ≥ 1. This verifies that Assumption 3
holds in this setting.