You are on page 1of 47

Nonasymptotic Regret Analysis of Adaptive Linear Quadratic Control

with Model Misspecification

Bruce D. Lee1 BRUCELE @ SEAS . UPENN . EDU


Anders Rantzer2 ANDERS . RANTZER @ CONTROL . LTH . SE
Nikolai Matni1 NMATNI @ SEAS . UPENN . EDU
1
Department of Electrical and Systems Engineering, University of Pennsylvania
arXiv:2401.00073v1 [eess.SY] 29 Dec 2023

2
Department of Automatic Control, Lund University

Abstract
The strategy of pre-training a large model on a diverse dataset, then fine-tuning for a particular
application has yielded impressive results in computer vision, natural language processing, and
robotic control. This strategy has vast potential in adaptive control, where it is necessary to rapidly
adapt to changing conditions with limited data. Toward concretely understanding the benefit of pre-
training for adaptive control, we study the adaptive linear quadratic control problem in the setting
where the learner has prior knowledge of a collection of basis matrices for the dynamics. This
basis is misspecified in the sense that it cannot perfectly represent the dynamics of the underlying
data generating process. We propose an algorithm that uses this prior knowledge, and prove upper
bounds on the expected regret after T interactions with the system. In the regime√where T is small,
the upper bounds are dominated by a term scales with either poly(log T ) or T , depending on
the prior knowledge available to the learner. When T is large, the regret is dominated by a term
that grows with δT , where δ quantifies the level of misspecification. This linear term arises due
to the inability to perfectly estimate the underlying dynamics using the misspecified basis, and is
therefore unavoidable unless the basis matrices are also adapted online. However, it only dominates
for large T , after the sublinear terms arising due to the error in estimating the weights for the basis
matrices become negligible. We provide simulations that validate our analysis. Our simulations
also show that offline data from a collection of related systems can be used as part of a pre-training
stage to estimate a misspecified dynamics basis, which is in turn used by our adaptive controller.

1. Introduction
Transfer learning, whereby a model is pre-trained on a large dataset, and then finetuned for a specific
application, has exhibited great success in computer vision (Dosovitskiy et al., 2020) and natural
language processing (Devlin et al., 2018). Efforts to apply these methods to control have shown
exciting preliminary results, particularly in robotics (Dasari et al., 2019). The principle underpin-
ning the success of transfer learning is to use diverse datasets to extract compressed, broadly useful
features, which can be used in conjunction with comparatively simple models for downstream ob-
jectives. These simple models can be finetuned with relatively little data from the downstream task.
However, errors in the pre-training stage may cause this two-step strategy to underperform learn-
ing from scratch when ample task-specific data is available. Such a tradeoff can be acceptable in
settings such as adaptive control, where the learner must rapidly adapt to changes with limited data.
Driven by the potential of pre-training in adaptive control, we study the adaptive linear quadratic
regulator (LQR) in a setting where imperfect prior information about the system is available. The
adaptive linear LQR problem consists of a learner interacting with an unknown linear system
xt+1 = A⋆ xt + B ⋆ ut + wt , (1)
with state xt , input ut , and noise wt assuming values in RdX , RdU , and RdX , respectively. The learner
is evaluated by its ability to minimize its regret, which compares the cost incurred by playing the
learner for T time steps with the cost attained by the optimal LQR controller. Prior work has studied
the adaptive LQR problem in the settings where A⋆ , B ⋆ , or both A⋆ and B ⋆ are fully unknown to
the learner and with either stochastic or bounded adversarial noise processes (Abbasi-Yadkori and
Szepesvári, 2011; Hazan et al., 2020; Cassel et al., 2020). In this work, we analyze asetting where
the learner has a set of basis matrices for the system dynamics; however, A⋆ B ⋆ is not in the
subspace spanned by this basis, leading to misspecification.

1.1. Related Work


Adaptive Control Originating from autopilot development for high-performance aircraft in the
1950s (Gregory, 1959), adaptive control addressed the need for controllers to adapt to varying alti-
tude, speed, and flight configuration (Stein, 1980). Interest in adaptive control theory grew over the
subsequent decades, notably advanced by the landmark paper by Åström and Wittenmark (1973),
who studied self tuning regulators. The history of adaptive control is documented in numerous texts
(Åström and Wittenmark, 2013; Ioannou and Sun, 1996; Narendra and Annaswamy, 2012).
Nonasymptotic Adaptive LQR The non-asymptotic study of the adaptive LQR problem was
pioneered by Abbasi-Yadkori and Szepesvári (2011). Subsequent works (Dean et al., 2018; Cohen
et al., 2019; Mania et al., 2019)
√ developed computationally efficient algorithms with upper bounds
on the regretqscaling with T . Simchowitz and Foster (2020) provide lower bounds which show
that the rate d2U dX T is optimal when the system is entirely unknown. The dependence of the lower
bounds on system theoretic constants is refined by Ziemann and Sandberg (2022), enabling Tsiamis
et al. (2022) to show the regret may depend exponentially on the dX for specific classes of systems.
If either A⋆ or B ⋆ are known, then the regret bounds may be improved to poly(log T ) (Cassel et al.,
2020; Jedra and Proutiere, 2022). Alternative formulations of the adaptive LQR problem consider
bounded non-stochastic disturbances (Hazan et al., 2020; Simchowitz et al., 2020) and minimax
settings (Rantzer, 2021; Cederberg et al., 2022; Renganathan et al., 2023). In contrast to existing
work studying the adaptive LQR problem from a nonasymptotic perspective, we consider bounded
misspecification between a representation estimate for the dynamics and the data generating process.
Multi-task Representation Learning The source of mispecification we consider is inspired by
theoretical work studying multi-task representation learning in the linear regression setting (Du
et al., 2020; Tripuraneni et al., 2020). These papers have been followed by a collection of work
studying the use of multi-task representation learning in the presence of data correlated across time,
which arises in system identification (Modi et al., 2021; Zhang et al., 2023b) and imitation learning
(Zhang et al., 2023a). All of these theoretical works show that by pre-training a shared representa-
tion on a set of source tasks, the sample complexity of learning the target task may be reduced.

1.2. Contributions
We introduce a notion of misspecification between an estimate for the basis of the dynamics and the
data generating process. We then propose an adaptive control algorithm that uses this misspecified
basis, and subsequently analyze the regret of this algorithm. This leads to the following insights:
• Our results generalize the understanding of when logarithmic regret is possible in adaptive LQR
beyond the cases of known A⋆ or B ⋆ studied by Cassel et al. (2020); Jedra and Proutiere (2022).
• When misspecification is present, the regret incurs a term that is linear in T . The coefficient for
this term decays gracefully as the level of misspecification diminishes. As a result, small misspec-
ification means the regret is dominated by sublinear terms in the low data regime of interest for
adaptive control. These terms are favorable to those possible in the absence of prior knowledge.
We validate our theory with numerical experiments, and show the benefit using a dynamics repre-
sentation determined by pre-training on offline data from related systems for adaptive control.

1.3. Notation
The Euclidean norm of a vector x is denoted ∥x∥. For a matrix A, the spectral norm is denoted ∥A∥,
and the Frobenius norm is denoted ∥A∥F . We use † to denote the Moore-Penrose pseudo-inverse.
The spectral radius of a square matrix is denoted ρ(A). A symmetric, positive semi-definite (psd)
matrix A = A⊤ is denoted A ⪰ 0. Similarly A ⪰ B denotes that A − B is positive semidefinite.
The minimum eigenvalue of a symmetric, positive definite matrix A is denoted λmin (A).For f, g :
D → R, we write f ≲ g if for some c > 0, f (x) ≤ cg(x) ∀x ∈ D. We denote the solutions
to the discrete Lyapunov equation by dlyap(A, Q) and the discrete algebraic Riccati equation by
DARE(A, B, Q, R). We use x ∨ y to denote the maximum of x and y.

2. Problem Formulation
2.1. System model
We consider the system (1) where the noise wt has independent identically distributed elements that
are mean zero and σ 2 -sub-Gaussian
 for some σ 2 ∈ R with σ 2 ≥ 1. We additionally assume that the
noise has identity covariance: E wt wt⊤ = I.1 We suppose the dynamics admit the decomposition


A B ⋆ = vec−1 (Φ⋆ θ⋆ ),
 ⋆ 
(2)

where Φ⋆ ∈ RdX (dX +dU )×dθ specifies the model structure, and has orthonormal columns. Mean-
while, θ⋆ ∈ Rdθ specifies the parameters. The operator vec−1 maps a vector in RdX (dX +dU ) into a
matrix in RdX ×(dX +dU ) by stacking length dX blocks of the the vector into columns of the matrix,
working top to bottom and left to right. We can write this as a linear combination of basis matrices:
dθ h i h i
 X
θi⋆ ΦA,⋆ B,⋆ A,⋆ B,⋆
= vec−1 Φ⋆i ,
 ⋆
A B⋆ = i Φ i , where Φ i Φ i
i=1

and Φ⋆i is the ith column of Φ⋆ . This decomposition of the data generating process is a natural ex-
tension of the low-dimensional linear representations considered in Du et al. (2020) to the setting of
multiple related dynamical systems with shared structure determined by Φ⋆ . A version of this model
for autonomous systems was considered by Modi et al. (2021) for multi-task system identification.
We assume that both Φ⋆ and θ⋆ are unknown; however, we have an estimate Φ̂ ∈ RdX (dX +dU )×dθ ,
also with orthonormal columns, for Φ⋆ . Such an esimate may be obtained by performing a pre-
training step on offline data from a collection of systems related to (1) by the shared matrix Φ⋆
1. Noise that enters the process through a non-singular matrix H can be addressed by rescaling the dynamics by H −1 .
in (2). Due to the noise present in the offline data, this esimate will be imperfect, resulting in
misspecification2 . To quantify the level of misspecification between this estimate and the underlying
data generating process, we use the following subspace distance metric.

Definition 1 ((Stewart and Sun, 1990)) Let Φ⋆⊥ complete the basis of Φ⋆ such that Φ⋆ Φ⋆⊥ is
 

an orthogonal matrix. Then the subspace distance between Φ⋆ and Φ̂ is d(Φ⋆ , Φ̂) ≜ Φ̂⊤ Φ⋆⊥ .

Note that the above distance is small when the range of the matrices Φ̂ and Φ⋆ is similar. In partic-
ular, the above distance may be small even when Φ̂ − Φ⋆ is large.
As long as d(Φ⋆ , Φ̂) is sufficiently small and the dimension dθ < dX (dX + dU ), the estimate Φ̂
allows the learner to fit a model with less data than would be required to estimate A⋆ B ⋆ from


scratch. This benefit comes at the cost of a bias in the learner’s model that grows with d(Φ⋆ , Φ̂).

2.2. Learning objective


The goal of the learner is to interact with system (1) while keeping the cumulative cost small, where
the cumulative cost is defined for matrices Q ⪰ I and R = I as3
T
X
CT ≜ ct , where ct ≜ x⊤ ⊤
t Qxt + ut Rut . (3)
t=1

To define an algorithm that keeps the cost small, we first introduce the infinite horizon LQR cost:
1 K
J (K) ≜ lim sup E CT , (4)
T →∞ T

where the superscript of K on the expectation denotes that the cost is evaluated under the state
feedback controller ut = Kxt . To ensure that there exists a controller such that (4) has finite cost,
we assume (A⋆ , B ⋆ ) is stabilizable. Under this assumption, (4) is minimized by the LQR controller
K∞ (A⋆ , B ⋆ ), where K∞ (A, B) ≜ −(B ⊤ P∞ (A, B)B + R)−1 B ⊤ P∞ (A, B)A, and P∞ (A, B) ≜
DARE(A, B, Q, R). We define the shorthands P ⋆ ≜ P∞ (A⋆ , B ⋆ ) and K ⋆ ≜ K∞ (A⋆ , B ⋆ ). To
characterize the infinite horizon LQR cost of an arbitrary stabilizing controller K, we additionally
define the solution PK to the lyapunov equation for the closed loop system under an arbitrary K
such that ρ(A⋆ +B ⋆ K) < 1: PK ≜ dlyap(A⋆ +B ⋆ K, Q+K ⊤ RK). For a controller K satisfying
ρ(A⋆ + B ⋆ K) < 1, J (K) = tr(PK ). We have that PK ⋆ = P ⋆ .
The infinite horizon LQR controller provides a baseline level of performance that our learner
cannot surpass in the limit as T → ∞. Borrowing the notion of regret from online learning, as in
Abbasi-Yadkori and Szepesvári (2011), we quantify the performance of our learning algorithm by
comparing the cumulative ⋆ cost ⋆CT to the scaled infinite horizon cost attained by the LQR controller
if the system matrices A B were known:

RT ≜ CT − T J (K ⋆ ). (5)

2. Sample complexity bounds for learning Φ̂ are provided by Zhang et al. (2023b), so this step is not studied here.
3. Generalizing to arbitrary Q ≻ 0 and R ≻ 0 can be performed by scaling the cost and changing the input basis.
Algorithm 1 Certainty Equivalent Control with Continual Exploration
1: Input: Stabilizing controller K0 , initial epoch length τ1 , number of epochs kfin , exploration
sequence σ12 , σ22 , σ32 , . . . σk2fin , state bound xb , controller bound Kb
2: Initialize: K̂1 ← K0 , τ0 ← 0, T ← τ1 2kfin −1 .
3: for k = 1, 2, . . . , kfin do
4: for t = τk−1 + 1, . . . , τk do
5: if ∥xt ∥2 ≥ x2b log T or K̂k ≥ Kb then
6: Abort and play K0 forever
7: Play ut = K̂k xt + σk gt , where gt ∼ N (0, I)
8: θ̂k , Λ ← LS(Φ̂, xτk−1+1:τk+1 , uτk−1 +1:τk ) ▷ Algorithm 2
−1
 
9: Âk B̂k ← vec Φ̂θ̂k
10: K̂k+1 ← K∞ (Âk , B̂k )
11: τk+1 ← 2τk

In light of the above reformulation, the goal of the learner is to interact with the system (1) to
maximize the information about the relevant parameters for control while simultaneously regulating
the system to minimize RT . A reasonable strategy to do so is for the learner to use its history of
interaction with the system to construct a model for the dynamics, e.g. by determining estimates Â
and B̂. It may then use these estimates as part of a certainty equivalent (CE) design by synthesizing
controllers K̂ = K∞ (Â, B̂). Prior work has shown that if the model estimate is sufficiently close
to the true dynamics, then the cost of playing the controller K̂ exceeds the cost of playing K ⋆ by a
quantity that is quadratic in the estimation error (Mania et al., 2019; Simchowitz and Foster, 2020).
1
Lemma 1 (Theorem 3 of Simchowitz and Foster (2020)) Define ε ≜ 2916∥P ⋆ (A ⋆ ,B ⋆ )∥10
. If
2 2
 B̂ − A B ⋆ F ≤ ε, then J (K̂) − J (K ⋆ ) ≤ 142 ∥P ⋆ ∥8  B̂ − A⋆ B ⋆ F .
   ⋆     

2.3. Algorithm description


Our proposed algorithm, Algorithm 1, is a CE algorithm akin to that proposed in Cassel et al.
(2020). The algorithm takes a stabilizing controller K0 as an input.4 Starting from this controller,
Algorithm 1 follows a doubling epochs strategy. At the end of each epoch, it uses the data collected
during the epoch along with the estimate for Φ̂ to obtain an estimate for θ̂ by solving a least squares
problem (as detailed in Algorithm 2).  The estimated parameters θ̂ are combined with the estimate
Φ̂ to obtain the dynamics estimate  B̂ . This estimate is used to synthesize a CE controller
K̂ = K∞ (Â, B̂). In the next epoch, the learner plays the resultant controller with exploratory noise
added. Before playing each input, the algorithm checks whether the state or the controller exceed
bounds determined by algorithm inputs xb and Kb . If they do, it aborts the certainty equivalent
scheme, and plays the initial stabilizing controller for all time. Doing so enables bounding of the
regret during unlikely events where the CE controller fails. The key difference from the CE algo-
rithms proposed in prior work is that the system identification step solves a least squares problem
defined in terms of the estimate Φ̂ to estimate the unknown parameters.
4. While providing a stabilizing controller is common in prior work, not all analyses require it. (Lale et al., 2022).
Algorithm 2 Least squares: LS(Φ̂, x1:t+1 , u1:t+1 )
1: Input: Model structure estimate Φ̂, state data x1:t+1 , input data u1:t
2: Return: θ̂, Λ, where
t t
   !    ⊤ !
X x s
X x s x s
θ̂ = Λ† Φ̂⊤ ⊗ IdX xs+1 , and Λ = Φ̂⊤ ⊗ IdX Φ̂.
us us us
s=1 s=1

3. Regret Bounds
We now present our bounds on the expected5 regret incurred by Algorithm 1. Consider running Al-
gorithm 1 for T = τ1 2kfin −1 timesteps, where τ1 is the initial epoch length and kfin is the number of
epochs. To bound the regret incurred by this algorithm, we decompose the regret into that achieved
by the algorithm under a high probability success event, and that incurred during a failure event
under which the state or controller bound in line 5 of Algorithm 1 are violated. To ensure the failure
event occurs with a small probability, we make the following assumption on the state and controller
bounds, which uses the shorthand ΨB ⋆ ≜ max{1, ∥B ⋆ ∥}.

Assumption 1 We assume that xb ≥ 400 ∥PK0 ∥2 ΨB ⋆ σ dX + dU and Kb ≥ 2 ∥PK0 ∥.
p

We make additional assumptions about the remaining arguments supplied as inputs to the al-
gorithm in two cases: one in which no additional assumptions about the dynamics are made (Sec-
tion 3.1), and one in which we assume that the system structure estimate is such that the initial sta-
bilizing controller and the optimal controller provide sufficient excitation to identify the unknown
parameters (Section 3.2).

3.1. Certainty equivalent control with continual exploration


To ensure an estimate satisfying the condition in Lemma 1 is attainable, the gap between the model
structure estimate and the ground truth cannot be too large, leading to the following assumption.
q
Assumption 2 Define β1 ≜ Cbias,1 σ 4 ∥PK0 ∥12 Ψ8B ⋆ ∥θ⋆ ∥2 (dX + dU ) ddUθ for a sufficiently large
ε
universal constant Cbias,1 . We assume our representation error satisfies d(Φ̂, Φ⋆ ) ≤ 2β1 log T .

The requirement above arises from the way that misspecification enters our bounds on the estimation
 2
error  B̂ − A⋆ B ⋆ F . See the definition of Eest,1 in Section 3.3.
  

In this setting, we run Algorithm 1 with exploratory inputs injected to ensure identifiability of
the unknown parameters. Doing so provides the regret guarantees in the following theorem.

Theorem 2 Consider applying Algorithm 1 with initial stabilizing controller K0 for T = τ1 2kfin −1
timesteps for some positive integers kfin, and τ1 . Suppose that for some γ ≥ d(Φ̂, Φ⋆ ), the explo-
√ 
d /dθ
ration sequence is given by σk2 = max √ U k−1 , γ ∀k ≥ 1. Suppose the state bound xb and the
τ1 2

5. In contrast to high probability regret bounds, expected regret provides an understanding of what happens in the
unlikely events where controller performs poorly.
controller bound Kb satisfy Assumption 1 and that Φ̂ satisfies Assumption 2. There exists a universal
positive constant Cwarm up such that if τ1 = τwarm up log2 T for some

√
 
1
 log ∥P ⋆ ∥ dθ dU 2
 
τwarm up ≥ Cwarm up σ 4 ∥PK0 ∥3 max Ψ2B ⋆ (dX + dU ), x2b ,  , ,
 log 1 − 1 ε 
∥P ⋆ ∥

then the expected regret satisfies


p √
E[RT ] ≤ c0 log2 (T ) + c1 dθ dU T log T + c2 γT log T,

where c0 = poly(dX , dU , ∥PK0 ∥ , ΨB ⋆ , τwarm up , xb , Kb ), c1 = poly(∥PK0 ∥ , ΨB ⋆ , σ) and c2 =


poly(dU , dX , dθ , ∥PK0 ∥ , ΨB ⋆ , ∥θ⋆ ∥ , σ).

First consider the above result in√the absence


√ of misspecificaion: γ = d(Φ̂, Φ⋆ ) = 0. In this
case, the dominantq term grows with dθ dU T log T . As long as dθ ≤ dX dU , this is smaller than
the dependence of d2U dX which appears in the lower bounds of for the regret of learning to control
a system with entirely unknown A⋆ and B ⋆ (Simchowitz and Foster, 2020). If the misspecification
is nonzero, then the regret bound incurs an additional term that grows linearly (up to log factors)

√ T . However, as long as d(Φ̂, Φ ) is sufficiently small, there exists a regime of T for which the
with
T term dominates, and using the misspecified basis provides a benefit over learning from scratch.

3.2. Certainty equivalent control without additional exploration


In this section, we analyze the regret attained under the additional assumption that the process noise
fully excites the relevant modes of the system under K ⋆ and K0 . This may be guaranteed as follows.
Pdθ 2
1 A B
Assumption 3 Let α ≥ . Assume that minv:∥v∥=1 i=1 vi (Φ̂i + Φ̂i K) ≥ α2 for
3∥P ⋆ ∥3/2 F
K = K0 , K ⋆ , where Φ̂A = vec−1 Φ̂i , and Φ̂i is the ith column of Φ̂.
 
i Φ̂Bi

Under the above assumption, we may run Algorithm 1 without an additional exploratory input
injected. As in Section 3.1, we require that the representation error is small enough to guarantee the
closeness condition in Lemma 1 may be satisfied with our estimated model.
9
ε∥PK0 ∥ Ψ8B ⋆ ∥θ⋆ ∥2 (dX +dU )
Assumption 4 Define β2 ≜ Cbias,2 dθ min{α2 ,α4 }
for a sufficiently large universal con-
q
stant Cbias,2 . Our representation error satisfies d(Φ̂, Φ⋆ ) ≤ 2βε 2 .

This requirement again arises from dependence of the estimation error bounds on the misspecifica-
tion. See the definition of Eest,2 in Section 3.3. Under these assumptions, our regret bound may be
improved to that in the following theorem.

Theorem 3 Consider applying Algorithm 1 with initial stabilizing controller K0 for T = τ1 2kfin
timeteps for some positive integers kfin , and τ1 . Additionally suppose the exploration sequence is
zero for all time: σk2 = 0 for k = 1, . . . , kfin . Suppose the state bound xb and the controller
bound Kb satisfy Assumption 1, and that Φ̂ satisfies Assumption 3 and Assumption 4. There exists a
positive universal constant Cwarm up such that if τ1 = τwarm up log T , for

log √ 1 ⋆
 
4 3 2

2 8 ∥P ∥ dθ 
τwarm up ≥ Cwarm up σ ∥PK0 ∥ ΨB ⋆ max (dX + dU ), xb ,   , 2
,
 log 1 − 2∥P1 ⋆ ∥ 2εα 

then the expected regret satisfies

E[RT ] ≤ c1 log2 (T ) + c2 d(Φ̂, Φ⋆ )2 T,

where c1 = poly(dX , dU , dθ , ∥PK0 ∥ , ΨB ⋆ , ∥θ⋆ ∥ , σ, α−1 , τwarm up , Kb , xb ) and


c2 = poly(dX , dU , dθ , ∥PK0 ∥ , ΨB ⋆ , ∥θ⋆ ∥ , α−1 ).

When the misspecification is zero, the expected regret grows with log2 T . Prior work Cassel
et al. (2020); Jedra and Proutiere (2022) has shown that such rates are possible if either A⋆ or B ⋆
are known to the learner. See Appendix E.1 for details about how to obtain logarithmic regret in
these settings using Theorem 3. The above result expands on prior work by generalizing conditions
for prior knowledge that are sufficient to achieve logarithmic regret.
With misspecification present, the above theorem has a term growing linearly T . In contrast
to Theorem 2, the coefficient for this term is proportional to the level of misspecification squared,
which is smaller than the linear dependence on misspecification in Theorem 2. This result shows that
if some coarse system knowledge depending on a few unknown parameters is available in advance
and the unknown parameters are easily identifiable in the sense of Assumption 3, then there exists a
substantial regime of T for which the regret incurred is much smaller than that attained by learning

to control the system from scratch with fully unknown A⋆ and B ⋆ (where the regret scales as T ).

3.3. Proof sketch


Our main result proceeds by first defining a success eventsfor which the certainty equivalent control
scheme never aborts, and generates dynamics estimates Âk B̂k which are sufficiently close to
the true dynamics A⋆ B ⋆ at all times. The success events for Section 3.1 and Section 3.2 are
 

Esuccess,1 = Ebound ∩ Eest,1 and Esuccess,2 = Ebound ∩ Eest,2 respectively, where


n o n o
Ebound = ∥xt ∥2 ≤ x2b log T ∀t = 1, . . . , T ∩ K̂k ≤ Kb , ∀k = 1, . . . , kfin ,

σ 2 dθ dU ∥PK0 ∥
 
   ⋆ ⋆
 2 ⋆
Eest,1 = Âk B̂k − A B F ≤ Cest,1 √ log T + β1 d(Φ̂, Φ ) log T ,
τk
σ 2 dθ
 
   ⋆ ⋆
 2 ⋆ 2
Eest,2 = Âk B̂k − A B F ≤ Cest,2 log T + β2 d(Φ̂, Φ ) ,
τk α2
and Cest,1 and Cest,2 are positive universal constants.
We use the success events to decompose the expected regret as in Cassel et al. (2020): E[RT ] =
R1 + R2 + R3 − T J (K ⋆ ), where for Esuccess = Esuccess,1 or Esuccess = Esuccess,2 ,
kfin
" # " T
# "τ #
X X X 1
c
R1 = E 1(Esuccess ) Jk , R2 = E 1(Esuccess ) ct , and R3 = E ct , (6)
k=2 t=τ1 +1 t=1
are the costs due to Psuccess, failure, and the first epoch respectively. Here, Jk are the epoch costs
τk+1
defined as Jk = t=τk t In the settings of both Section 3.1 and Section 3.2, R3 is given by
c .
τ1 tr PK0 (I + σ1 B (B ⋆ )⊤ ) , while R2 is controlled using the upper bounds on the state and con-
2 ⋆

troller to obtain a bound on the cost, which is then multiplied by the small probability of the failure
event. To control R1 , we show that under the success event, the closeness condition Lemma 1 is
satisfied. As a result, the
 cost of each epoch Jk is (τk − τk−1 )J (K ⋆ ) in addition to a term pro-
 2
Âk B̂k − A⋆ B ⋆ F + σk2 . Using the estimation error bounds of
  
portional to (τk − τk−1 )
events Eest,1 and Eest,2 along with √the choices for σk2 in the two settings, we find that the quantity

R1 − T J (K ) is proportional to T log T + d(Φ̂, Φ⋆ )T log T in the setting of Section 3.1, and
log2 T + d(Φ̂, Φ⋆ )2 T in the setting of Section 3.2. Combining terms provides the expected regret
bounds in Theorems 2 and 36 .

4. Numerical Example
To validate our bounds, we run Algorithm 1 on the system (1) where A⋆ and B ⋆ are obtained by lin-
earizing and discretzing the cartpole dynamics defined by the equations (M + m)ẍ + mℓ(θ̈ cos(θ) −
θ̇2 sin(θ)) = u, and m(ẍ cos(θ) + ℓθ̈ − g sin(θ)) = 0, for cart mass M = 1, pole mass m = 1,
pole length ℓ = 1, and gravitiy g = 1. Discretization uses Euler’s approach with stepsize 0.25. The
disturbance signal is generated as wt ∼ N (0, 0.01I).
We consider various inputs for the representation estimate Φ̂ and the exploration sequence
σ12 , . . . , σk2fin . The remaining parameters are discussed in Appendix A.
No Misspecification: In Figure 1, we plot the re-
60000
gret of Algorithm 1 in the absence of misspecification, 50000
i.e. d(Φ̂, Φ⋆ ) = 0. We consider several instances for 40000
Regret

the representation: one for fully unknown A⋆ and B ⋆ , 30000


fully unknown
A known up to scale
one which encodes a setting where the A⋆ matrix is 20000 lumped parameter
lumped parameter extended
known up to an unknown scaling, and one which cap- 10000
tures a lumped parameter model where the discretized 0
0 20000 40000 60000 80000 100000 120000
and linearized cartpole structure is known up to scale, T
but the values of the entries which vary with cart mass, Figure 1: Regret of Algorithm 1 with var-
pole mass, and pole length are unknown. We addition- ious choices for Φ̂.
ally consider extending the lumped parameter representation by adding a basis vector that ensures
the condition in Assumption 3 is violated. For the representation capturing fully unknown A⋆ and
B ⋆ , and the extended lumped parameter representation, the condition in Assumption √ 3 is not satis-
2 1
fied, so we run Algorithm 1 with exploration noise scaling as σk ∝ √ , and incur T regret, as
2k
predicted by Theorem 2. The extended lumped parameter representation incurs regret at a slower
rate than the setting when the system is fully unknown. This is predicted by Theorem 2 due to the
fact that√the extended lumped parameter model has dθ = 6 < 20 = dX (dX + dU ), so the coefficient
on the T term is smaller. For the remaining settings, Assumption 3 is satisfied, so we run the
algorithm with no additional exploration and incur logarithimic regret, as predicted in Theorem 3.
6. High probability estimation error bounds are proved in Appendix B. These are used to establish high probability
bounds on the success events in Appendix C. The bounds on R1 , R2 and R3 are established in Appendix D.
50000 fully unknown
100000
misspecified lumped: d( , ) = 0.10
40000 80000 misspecified lumped: d( , ) = 0.15
misspecified lumped: d( , ) = 0.20
30000 60000
Regret

Regret
20000 40000

10000 fully unknown


20000
misspecified lumped parameter: d( , ) = 0.01
0 misspecified lumped parameter: d( , ) = 0.05 0
0 20000 40000 60000 80000 0 20000 40000 60000 80000 100000 120000
T T
(a) Continual exploration (b) No exploration
Figure 2: We plot the regret of Algorithm 1 with Φ̂ describing a lumped parameter model (right), or
a lumped parameter model and extended such that the condition Assumption 3 is violated
(left). In both settings the representations are perturbed, resulting in a misspecification
between the true representation Φ⋆ and the representation estimate Φ̂. The regret is com-
pared to that incurred by running Algorithm 1 with a fully unknown A⋆ and B ⋆ .

Artificial Misspecification: In Figure 2, we compare the regret from the fully unknown setting
to the regret with a misspecified lumped parameter representation. Figure 2(a) considers the lumped
parameter representation that is extended such that Assumption 3 is violated. Therefore the learner
must continually inject noise to the system in order to explore. We artificially create misspecifica-
tion by adding small perturbations to the true representation such that d(Φ̂, Φ⋆ ) > 0. We see that in
the low data regime, the regret incurred is less than that incurred when the model is fully unknown.
When the misspecification level is 0.05, the bias in the identification due to the misspecification
causes the regret to rapidly overtake the regret from the fully unkown setting. When the misspec-
ification is small, the regret remains less than that from the fully unknown setting for the entire
horizon of T values that are plotted. Figure 2(b) considers the lumped parameter setting without
the extension, for which Assumption 3 is satisfied. As in Figure 2(a), we add a perturbation to the
representation to create misspecification. In this setting, we consider much larger perturbations,
such that d(Φ̂, Φ⋆ ) is 0.1, 0.15, or 0.20. For all three such situations, the regret begins much smaller
than that of the fully unknown model, but overtakes it as T becomes large.
Learned Representation: In Figure 3, we consider 70000 fully unknown
representation learned from data
a setting motivated by multi-task learning, in which a 60000
representation is learned using offline data from sev- 50000
40000
Regret

eral systems related to the system of interest. In par-


30000
ticular, we collect trajectories of length 1200 from five
20000
discretized and linearized cartpole systems generated 10000
with various values of the parameters (M, m, ℓ). The 0
resulting data is in turn used to fit a a representation 0 20000 40000 60000 80000 100000 120000
T
7
Φ̂. Once the representation is obtained, we run Al- Figure 3: Regret of Alg. 1 with Φ̂ learned
gorithm 1 in the absence of exploratory input. We see offline from related systems.
that the regret incurred is much lower than the setting in which the dynamics are fully unknown
for the small data regime, but overtakes it as T becomes large. This aligns with the results from
the artificial misspecification experiment. By computing the distance between the learned and true
lumped parameter representations, we find that d(Φ̂, Φ⋆ ) = 0.2041.
7. See Appendix A for details.
5. Conclusion
We studied adaptive LQR in the presence of misspecification between the learner’s simple model
structure for the dynamics, and the true dynamics. Our proposed algorithm performs well in both
experiments and theory as long as this misspecification is sufficiently small. Our analysis shows
a phase shift in the problem that depends on whether the learner’s prior knowledge enables iden-
tification of the unknown parameters without additional exploration, thus allowing regret that is
logarithmic in T . There are many interesting avenues for future work, including extending the anal-
ysis to settings where a collection of nonlinear basis functions for the dynamics are misspecified,
e.g., by modifying the model studied in Kakade et al. (2020). Another direction is to analyze the set-
ting where the learner’s simple model is updated online by sharing data from a collection of related
systems, as in recent work on federated learning in dynamical systems (Wang et al., 2023).

References
Yasin Abbasi-Yadkori and Csaba Szepesvári. Regret bounds for the adaptive control of linear
quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pages
1–26. JMLR Workshop and Conference Proceedings, 2011.

Karl J Åström and Björn Wittenmark. Adaptive control. Courier Corporation, 2013.

Karl Johan Åström and Björn Wittenmark. On self tuning regulators. Automatica, 9(2):185–199,
1973.

Asaf Cassel, Alon Cohen, and Tomer Koren. Logarithmic regret for learning linear quadratic regu-
lators efficiently. In International Conference on Machine Learning, pages 1328–1337. PMLR,
2020.

Daniel Cederberg, Anders Hansson, and Anders Rantzer. Synthesis of minimax adaptive controller
for a finite set of linear systems. In 2022 IEEE 61st Conference on Decision and Control (CDC),
pages 1380–1384. IEEE, 2022.

Alon Cohen, √
Tomer Koren, and Yishay Mansour. Learning linear-quadratic regulators efficiently
with only t regret. In International Conference on Machine Learning, pages 1300–1309.
PMLR, 2019.

Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper,
Siddharth Singh, Sergey Levine, and Chelsea Finn. Robonet: Large-scale multi-robot learning.
arXiv preprint arXiv:1910.11215, 2019.

Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. Regret bounds for
robust adaptive control of the linear quadratic regulator. Advances in Neural Information Pro-
cessing Systems, 31, 2018.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An
image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint
arXiv:2010.11929, 2020.

Simon S Du, Wei Hu, Sham M Kakade, Jason D Lee, and Qi Lei. Few-shot learning via learning
the representation, provably. arXiv preprint arXiv:2002.09434, 2020.

PC Gregory. Proceedings of the Self Adaptive Flight Control Systems Symposium, volume 59.
Wright Air Development Center, Air Research and Development Command, United . . . , 1959.

Elad Hazan, Sham Kakade, and Karan Singh. The nonstochastic control problem. In Algorithmic
Learning Theory, pages 408–421. PMLR, 2020.

Petros A Ioannou and Jing Sun. Robust adaptive control, volume 1. PTR Prentice-Hall Upper
Saddle River, NJ, 1996.

Yassir Jedra and Alexandre Proutiere. Finite-time identification of stable linear systems optimality
of the least-squares estimator. In 2020 59th IEEE Conference on Decision and Control (CDC),
pages 996–1001. IEEE, 2020.

Yassir Jedra and Alexandre Proutiere. Minimal expected regret in linear quadratic control. In
International Conference on Artificial Intelligence and Statistics, pages 10234–10321. PMLR,
2022.

Sham Kakade, Akshay Krishnamurthy, Kendall Lowrey, Motoya Ohnishi, and Wen Sun. Infor-
mation theoretic regret bounds for online nonlinear control. Advances in Neural Information
Processing Systems, 33:15312–15325, 2020.

Sahin Lale, Kamyar Azizzadenesheli, Babak Hassibi, and Anima Anandkumar. Logarithmic re-
gret bound in partially observable linear dynamical systems. Advances in Neural Information
Processing Systems, 33:20876–20888, 2020.

Sahin Lale, Kamyar Azizzadenesheli, Babak Hassibi, and Animashree Anandkumar. Reinforce-
ment learning with fast stabilization in linear dynamical systems. In International Conference on
Artificial Intelligence and Statistics, pages 5354–5390. PMLR, 2022.

Horia Mania, Stephen Tu, and Benjamin Recht. Certainty equivalence is efficient for linear
quadratic control. Advances in Neural Information Processing Systems, 32, 2019.

Aditya Modi, Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. Joint
learning of linear time-invariant dynamical systems. arXiv preprint arXiv:2112.10955, 2021.

Kumpati S Narendra and Anuradha M Annaswamy. Stable adaptive systems. Courier Corporation,
2012.

Victor H Peña, Tze Leung Lai, and Qi-Man Shao. Self-normalized processes: Limit theory and
Statistical Applications. Springer, 2009.

Anders Rantzer. Minimax adaptive control for a finite set of linear systems. In Learning for Dy-
namics and Control, pages 893–904. PMLR, 2021.
Venkatraman Renganathan, Andrea Iannelli, and Anders Rantzer. An online learning analysis of
minimax adaptive control. arXiv preprint arXiv:2307.07268, 2023.

Max Simchowitz and Dylan Foster. Naive exploration is optimal for online lqr. In International
Conference on Machine Learning, pages 8937–8948. PMLR, 2020.

Max Simchowitz, Karan Singh, and Elad Hazan. Improper learning for non-stochastic control. In
Conference on Learning Theory, pages 3320–3436. PMLR, 2020.

Gunter Stein. Adaptive flight control: A pragmatic view. In Applications of Adaptive Control, pages
291–312. Elsevier, 1980.

Gilbert W Stewart and Ji-guang Sun. Matrix perturbation theory. (No Title), 1990.

Paolo Tilli. Singular values and eigenvalues of non-hermitian block toeplitz matrices. Linear alge-
bra and its applications, 272(1-3):59–89, 1998.

Nilesh Tripuraneni, Michael Jordan, and Chi Jin. On the theory of transfer learning: The importance
of task diversity. Advances in neural information processing systems, 33:7852–7862, 2020.

Anastasios Tsiamis, Ingvar M Ziemann, Manfred Morari, Nikolai Matni, and George J Pappas.
Learning to control linear systems can be hard. In Conference on Learning Theory, pages 3820–
3857. PMLR, 2022.

Anastasios Tsiamis, Ingvar Ziemann, Nikolai Matni, and George J Pappas. Statistical learning
theory for control: A finite-sample perspective. IEEE Control Systems Magazine, 43(6):67–97,
2023.

Roman Vershynin. High-dimensional probability. University of California, Irvine, 2020.

Han Wang, Leonardo F Toso, Aritra Mitra, and James Anderson. Model-free learning with hetero-
geneous dynamical systems: A federated lqr approach. arXiv preprint arXiv:2308.11743, 2023.

Thomas T Zhang, Katie Kang, Bruce D Lee, Claire Tomlin, Sergey Levine, Stephen Tu, and Nikolai
Matni. Multi-task imitation learning for linear dynamical systems. In Learning for Dynamics and
Control Conference, pages 586–599. PMLR, 2023a.

Thomas TCK Zhang, Leonardo F Toso, James Anderson, and Nikolai Matni. Meta-learning opera-
tors to optimality from multi-task non-iid data. arXiv preprint arXiv:2308.04428, 2023b.

Ingvar Ziemann and Henrik Sandberg. Regret lower bounds for learning linear quadratic gaussian
systems. arXiv preprint arXiv:2201.01680, 2022.

Ingvar Ziemann, Anastasios Tsiamis, Bruce Lee, Yassir Jedra, Nikolai Matni, and George J. Pappas.
A tutorial on the non-asymptotic theory of system identification, 2023.
Appendix A. Additional Experimental Details
The matrices A⋆ and B ⋆ considered in Section 4 are given as follows:
   
1 0.25 0 0 0
0 1 −0.25 0  , B ⋆ =  0.25  .
A⋆ = 
 
0 (7)
0 1 0   0 
0 0 0.5 1 −0.25

We consider Q = I, and R = I. We  keep the values of K0 , τ1 , xb , and Kb fixed across2 all


experiments as 0.37 1.64 4.49 3.89 , 1024, 50, and 15. We replace the condition ∥xt ∥ ≥
x2b log T from line 5 of Algorithm 1 with ∥xt ∥2 ≥ xb . These modifications are made because the
guarantees in Section 3 are not anytime, but rather require the number of timesteps T which the
algorithm is to be run as an input to the algorithm. For sake of efficiently plotting the regret against
time, we remove the dependence of the algorithm on T , and keep these parameters fixed across
time. Despite the discrepancy, we observe that the predicted trends still hold.
Learned Representation The parameter values used for the systems which generate the of-
fline data are (M, m, ℓ) = (0.4, 1.0, 1.0), (1.6, 1.3, 0.3), (1.3, 0.7, 0.65), (0.2, 0.06, 1.36) and,
(0.2, 0.47, 1.83). The data from the is collected with process noise wt ∼ N (0, 0.01I), and in-
puts ut = K0 xt + ηt , where ηt ∼ N (0, I). The representation is determined by jointly optimizing
Φ along with the system specific parameter values θ to solve the least squares problem
5 1199 
" #
 x(i) 2
(i)
X X
(1) (5) −1 (i) t
Φ̂, θ̂ , . . . , θ̂ = argmin xt+1 − vec Φθ (i) ,
Φ,θ(1) ,...,θ(5) i=1 t=1 ut

where the superscript (i) indicates which of the five systems the data is collected from.

Appendix B. Estimation Error Bounds


We begin by considering a general estimation problem in which the system is excited by an arbitrary
stabilizing controller K and excitation level defined by σu . In particular, we consider the evolution
of the following system:
xt+1 = A⋆ xt + B ⋆ ut + wt
(8)
ut = Kxt + σu gt ,
i.i.d.
where gt ∼ N (0, I), and wt is a random variable with σ 2 -sub-Gaussian entries satifying
E[wt wt⊤ ] = I. We assume that σu2 ≤ 1 and that x1 is a random variable. We consider generating the
2
estimates θ̂, Λ = LS(Φ̂, x1:t+1 , u1:t ). We present a bound on the estimation error Φ̂θ̂ − Φ⋆ θ⋆ in
terms of the true system parameters as well as the amount of data, t.
Before presenting our main estimation result, we define the shorthand AK ≜ A⋆ + B ⋆ K. We
also use the H∞ norm, defined below.

Definition 4 The H∞ norm of a stable matrix A is given by

∥A∥H∞ = sup (zI − A)−1 .


z∈C,|z|=1
Theorem 5 (least squares estimation error) Let δ ∈ (0, 1/2). Suppose t ≥ cτls (K, ∥x1 ∥2 , δ) for
   
4 3 2 1
τls (K, x̄, δ) ≜ max σ ∥PK ∥ ΨB ⋆ dX + dU + log , x̄ × ∥PK ∥ + 1
δ

and a sufficiently large universal constant c > 0. There exists an event Els which holds with proba-
bility at least 1 − δ under which the estimation error satisfies

dθ σ 2
 
⋆ ⋆
2 1
Φ̂θ̂ − Φ θ ≲ ¯ t
 log
tλmin ∆ (σu , K) δ
!
σ 4 ∥PK ∥7 Ψ6B ⋆ dX + dU + log 1δ ∥PK ∥2 Ψ2B ⋆ d(Φ̂, Φ⋆ )2 ∥θ⋆ ∥2

+ 1+ ¯ t (σu , K))2 ¯ t (σu , K)) .
tλmin (∆ λmin (∆

where
 
t−2 s     I ⊤ 0 
1 XX I  0 
¯ t (σu , K) ≜ Φ̂⊤ 
∆ AjK (σu2 B ⋆ (B ⋆ )⊤ + I) AjK ⊤ + Φ̂.
t K K 0 σu2 IdU
s=0 j=0

Toward proving the above theorem, we first define a mapping from the noise sequence and the
state x1 to the stacked sequence of states that is be used throughout the section. In particular,
 
x1
x2   
 x1

t t
 ..  = F1 (K, σu ) F2 (K, σu ) , (9)
 
. ηt
xt
 ⊤ ⊤   ⊤ ⊤
⊤
where ηt ≜ w1 g1 ... wt−1 gt−1 and
 
I 0 0 ... 0
I B ⋆ σu
 
 AK 0 ... 0 
F1t (K, σu ) F2t (K, σu )
 
≜ .
 
.. ..
 . . 
At−1 At−2 I B ⋆ σu . . . I B ⋆ σu
   
K K
     
xt I 0
Using the fact that = x + , we may express the stacked states and inputs as
ut K t σu gt
 
x1
u1          
I I 0 0
 
 ..  t t
 .  = It ⊗ F1 (K, σu )x1 + It ⊗ F2 (K, σu ) + It ⊗ η
  K K 0 σu IdU
 xt 
ut
≜ Ξt1 (K, σu )x1 + Ξt2 (K, σu )η. (10)
We additionally define two quantities related to the empirical covariance matrix. In particular, we
define the expected value of the empirical covariance matrix conditioned on the state at the start of
the sequence, x1 , along with the centered counterpart:
" t   

#
t 1 X xs xs
Σ (K, σu , x1 ) ≜ E |x1 and
t us us
s=1
" t          #
1 X xs x s x s xs
Σ̄t (K, σu , x1 ) ≜ E −E |x1 −E |x1 ⊤ .
t us us us us
s=1

Lemma 2 (Epoch-wise covariance bounds) For t ≥ 2, we have


 
I   ⊤ 0 
j ⊤ I 0

t 1 Pt−2 Ps j 2 ⋆ ⋆ ⊤
1. Σ̄ (K, σu , x1 ) = t s=0 j=0 AK (σu B (B ) + I) AK +
K K 0 σu2 IdU
   ⊤
1 Pt−1 I ⊤ ⊤ I
2. Σt (K, σu , x1 ) = Σ̄k + t
s s
s=0 K AK x1 x1 (AK ) K
2
σu
3. Σt (K, σu , x1 ) ⪰ Σ̄t (K, σu , x1 ) ⪰ 2(1+2∥K∥2 +σu
2)
I

∥x21 ∥
4. Σt (K, σu , x1 ) ⪯ (1 + ∥PK ∥ t−1 )Σ̄t (K, σu , x1 )

5. Σ̄t (K, σu , x1 ) ≤ 5 ∥PK ∥2 Ψ2B ⋆ .

Proof
Points 1 and 2: These results follow by expressing the states and inputs in terms of the noise
sequence and x1 , then using the fact that the noise terms are independent, with mean zero and
identity covariance.
Point 3: The first inequality is immediate from point one by the fact that
t−1    ⊤
1X I s ⊤ s ⊤ I
AK x1 x1 (AK ) ⪰ 0.
t K K
s=0

For the second inequality, note that


t−2 s     ⊤ 0 
t 1 XX I j 2 ⋆ ⋆ ⊤

j ⊤ I 0
Σ̄ (K, σu , x1 ) = AK (σu B (B ) + I) AK +
t K K 0 σu2 IdU
s=0 j=0
   ⊤  
1 I I 0 0
⪰ + .
2 K K 0 σu2 IdU
By Lemma F.6 of Dean et al. (2018),
2
   ⊤  !  
σw I I 0 0 2 1 1
λmin + ≥ σu min , .
2 K K 0 σu2 IdU 2 4 ∥K∥2 + 2σu2
n o
To conclude, we pull out 12 , and we see that min 1, 2∥K∥12 +σ2 ≥ 1
1+2∥K∥2 +σu
2
.
u
Point 4: Using the fact that for vv ⊤ ⪯ ∥v∥2 I, we find
t−1    ⊤
1X I s ⊤ s ⊤ I
AK x1 x1 (AK )
t K K
s=0
t−1    ⊤
2 1 I s ⊤ I
X
s
⪯ ∥x1 ∥ AK (AK )
t K K
s=0
t−1    ⊤
s ⊤ 1 I I
2
X
s
⪯ ∥x1 ∥ AK (AK )
t K K
s=0
   ⊤
x21 t−1 I I
⪯ ∥dlyap[AK ]∥
t−1 t K K
x21
⪯ ∥PK ∥ Σ̄t (K, σu , x1 ).
t−1
Point 5: By the fact that σu2 ≤ 1 we have from the triangle inequality and submultiplicativity that
t−2 s     ⊤ 0 
t 1 XX I j 2 ⋆ ⋆ ⊤

j ⊤ I 0
Σ̄ (K, σu , x1 ) = AK (σu B (B ) + I) AK +
t−1 K K 0 σu2 IdU
s=0 j=0

   1Xt−2 X
t  
≤ 1 + 1 + ∥K∥ 2 ⋆ 2
∥B ∥ + 1 AjK AjK ⊤
t
t=0 j=0

  ∞
 X  
≤ 1 + 1 + ∥K∥ 2 ⋆ 2
∥B ∥ + 1 AjK AjK ⊤
j=0
  
= 1 + 1 + ∥K∥2 ∥B ⋆ ∥2 + 1 ∥dlyap[AK ]∥
  
≤ 1 + 1 + ∥K∥2 ∥B ⋆ ∥2 + 1 ∥PK ∥
≤ 5 ∥PK ∥2 Ψ2B ⋆ ,

where the final point follows from the fact that ∥K∥2 ≤ Q + K ⊤ RK ≤ ∥PK ∥, ∥PK ∥ ≥ 1, and
the definition of ΨB ⋆ .

To analyze the estimate θ̂, we first verify that it solves the least squares problem of interst.
Lemma 3 Consider applying Algorithm 2 to an estimated model structure Φ̂, state data x1:t+1 and
input data u1:t : θ̂, Λ = LS(Φ̂, x1:t+1 , u1:t ). As long as Λ ≻ 0, then θ̂ is the unique solution to
t   2
X x
argmin xs+1 − vec (Φ̂θ) s
−1
. (11)
θ us
s=1

Proof By the vectorization identity vec(XY Z) = (Z ⊤ ⊗ X) vec(Y ), we may write


   ⊤ !
x x
vec−1 (Φ̂θ) s = s
⊗ IdX Φ̂θ.
us us
 ⊤ ! 2
Pt xs
Minimizing s=1 xs+1 − ⊗ IdX Φ̂θ over θ yields θ̂.
us

We now split the estimation error into a part arising from the variance and a part arising from
the bias. The variance decays as t grows sufficiently large. The bias is due to the misspecificiation
between the model structure estimate Φ̂ and the true model structure Φ⋆ .

Lemma 4 Consider applying Algorithm 2 to an estimated model structure Φ̂, state data x1:t+1 and
input data u1:t : θ̂, Λ = LS(Φ̂, x1:t+1 , u1:t ). Suppose that Λ ≻ 0. Let

θ̄ = argmin(Φ̂θ − Φ⋆ θ⋆ )⊤ Σt (K, σu , x1 ) ⊗ IdX (Φ̂θ − Φ⋆ θ⋆ ).



(12)
θ

We have that
2 2 2
Φ̂θ̂ − Φ⋆ θ⋆ ≤ 2 Φ̂θ̂ − Φ̂θ̄ + 2 Φ̂θ̄ − Φ⋆ θ⋆ . (13)
| {z } | {z }
variance bias
Proof The lemma folllows by application of the triangle inequality.

The variance term in the above lemma is bounded in Lemma 10, while the bias term is bounded
in Lemma 11. Before presenting these lemmas, we present several covariance concentration guar-
antees that will be useful in obtaining presenting the variance bound.

B.1. Covariance Concentration


The following ε-net argument is standard. A proof may be found in Chapter 4 of Vershynin (2020).

Lemma 5 Let W be and d × d random matrix, and ε ∈ [0, 1/2). Furthermore, let N be an ε-net
of S d−1 with minimal cardinality. Then for all ρ > 0,
 
2 h i
P[∥W ∥ > ρ] ≤ + 1 d max P x⊤ W x > (1 − 2ϵ)ρ .
ε x∈N

The following lemma helps to bound system constants in term of a few key quantities.

Lemma 6 (Lemma B.11 of Simchowitz


r and Foster (2020)) For a stabilizing controller K, we

have that for any t, AtK ≤ 1 − ∥P1K ∥ t ∥PK ∥ and ∞ t 3/2
P
t=0 AK ≤ 2 ∥PK ∥ . As a conse-

quence, we also have ∥AK ∥H∞ ≤ 2 ∥PK ∥3/2 .

Proof For the first point, observe that


s 
1
q q
AtK = (AtK )⊤ AtK ≤ (AtK )⊤ PK AtK ≤ 1− t ∥P
K ∥.
∥PK ∥
For we may use the first point to show that
∞ ∞
s !
X 1/2
X 1
AtK ≤ ∥PK ∥ 1− t
∥PK ∥
t=0 t=0
1/2 1
= ∥PK ∥ q
1
1− 1− ∥PK ∥
q
1
1+ 1− ∥PK ∥
1/2
= ∥PK ∥
1/ ∥PK ∥
3/2
≤ 2 ∥PK ∥ .

The result on the H∞ norm of AK follows from the previous point by noting that ∥AK ∥H∞ ≤
P ∞ t
t=0 AK .

We now a result about the concentration of the empirical covariance to the true covariance. The
result is adapted from Lemma 2 of Jedra and Proutiere (2020) and Lemma A.2 of Zhang et al.
(2023a), which generalized the approximate isometries argument in Jedra and Proutiere (2020) to
account for projecting the covariance onto a lower dimensional space by hitting it on either side
with a low rank orthonormal matrix. The primary difference of the below result from that in Zhang
et al. (2023a) is the inclusion of a nonzero initial state, resulting in an additional sub-gaussian
concentration step.

Lemma 7 (Approximate Isometries) Let data x1:t , u1:t be generated by (8). Let G ∈
Rdc (dX +dU )×dG be a matrix with orthonormal columns. Define
  
M ≜ G⊤ Σt (K, σu , x1 ) ⊗ Idc G −1/2 .

Then for some positive universal constant C, and for any ρ < 1, the inequality
 
k −1 
τX   ⊤
1 xt xt
M G⊤  ⊗ Idc GM − I > ρ
τk−1 ut ut
t=τk−1

holds with probability at most


 2 !
min ρ, ρ t
4 × 9dG exp −C ,
σ 4 ∥PK ∥3 Ψ2B ⋆

Proof Define the shorthand  ⊤


x1 u⊤

1
1 
X ≜ √  ... ..  ⊗ I
.  dc
t
xt u⊤

t

such that
t    ⊤
!
1 X xs xs
⊗ Idc = X ⊤ X.
t us us
s=1
First note that E G⊤ X ⊤ XG|x1 = M −2 . Then
 

 
M G⊤ X ⊤ XGM − I = sup v ⊤ M G⊤ X ⊤ XGM − I v (14)
v∈S dG
h i
= sup v ⊤ M G⊤ X ⊤ XGM v − E v ⊤ M G⊤ X ⊤ XGM v|x1
v∈S dG
h i
= sup ∥XGM v∥2 − E ∥XGM v∥2 |x1 .
v∈S dG

Using the identity vec(W Y Z) = (Z ⊤ ⊗ W ) vec(Y ), we may write


 
vec v ⊤ M ⊤ G⊤ X ⊤ = (It×dX ⊗ (GM v)⊤ ) vec(X ⊤ ) ≜ σGM v vec(X ⊤ ). (15)

Next, we construct matrices Γt1 (K, σu ) and Γt2 (K, σu ) such that vec(X ⊤ ) = Γt1 (K, σu )x1 +
Γt2 (K, σu )ηt . To construct these matrices, first recall the definition of Ξt1 (K,σu ) and Ξt2 (K, σu )
given in (10) and of F1t (K, σu ) and F2t (K, σu ) given in (9). To reach vec X ⊤ from the vector of
stacked states and inputs, we can apply the following transformation:
   
   x1 x1
IdX +dU ⊗ e1 u1  u1 
1  ..
 
  ..  1  
 .. 
vec(X ⊤ ) = √ It ⊗  ≜ √ H  . .

.  . 
t   t  
IdX +dU ⊗ edc  xt   xt 
ut ut

Combining the above sequence of transformation implies that by defining Γt1 (K, σu ) ≜
1 t t 1 t ⊤ t
t HΞ1 (K, σu ) and Γ2 (K, σu ) ≜ t HΞ2 (K, σu ), we have vec(X ) = Γ1 x1 + Γ2 (K, σu )ηt .
Using the above expression, our quantity of interest in (14) may be expressed as

M G⊤ X ⊤ XGM − I
h i
2 2
σGM v Γt1 (K, σu )x1 + Γt2 (K, σu )ηt σGM v Γt1 (K, σu )x1 + Γt2 (K, σu )ηt
 
= sup −E |x1
v∈S dG
2 2
= sup σGM v Γt2 (K, σu )ηt + σGM v Γt1 (K, σu )x1 + 2⟨σGM v Γt1 (K, σu )x1 , σGM v Γt2 (K, σu )ηt ⟩
v∈S dG
h i
2 2
− E σGM v Γt2 (K, σu )ηt + σGM v Γt1 (K, σu )x1 + 2⟨σGM v Γt1 (K, σu )x1 , σGM v Γt2 (K, σu )ηt ⟩|x1
(i)
h i
2 2
= sup σGM v Γt2 (K, σu ηt + 2⟨σGM v Γt1 (K, σu x1 , σGM v Γt2 (K, σu ηt ⟩ − E σGM v Γt2 (K, σu )ηt
v∈S dG
(ii) 2 2
= sup σGM v Γt2 (K, σu )ηt + 2⟨σGM v Γt1 (K, σu )x1 , σGM v Γt2 (K, σu )ηt ⟩ − σGM v Γt2 (K, σu ) F
v∈S dG
2 2
≤ sup σGM v Γt2 (K, σu )ηt − σGM v Γt2 (K, σu ) F
+ 2⟨σGM v Γt1 (K, σu )x1 , σGM v Γt2 (K, σu )ηt ⟩
v∈S dG | {z } | {z }
quadratic form concentration cross term

where (i) follows from linearity of expectation and the substition rule of conditional expectation,
combined with the fact that η has mean zero and is independent of x1 . Equality (ii) follows from
the fact that ηt is mean zero with identity covariance.
To bound the concentration of the quadratic form, we invoke Theorem 6.3.2 of Vershynin
(2020). In particular, there exists a universal constant C > 0 such that
h
2 2 ρi
Pr σGM v Γt2 (K, σu )ηt − σGM v Γt2 (K, σu ) F ≥
n 2
o
2

2 t
min ρ / σGM v Γ2 (K, σu ) F , ρ (16)
≤ 2 exp −C

2
 .
σ 4 ∥σGMk v Γt2 (K, σu )∥

For the cross term, we invoke a standard sub-Gaussian tail bound (Vershynin, 2020) to show
that for any fixed v ∈ S dG ,
h ρi
Pr 2⟨σGM v Γt1 (K, σu )x1 , σGM v Γt2 (K, σu )ηt ⟩ >
2
(17)
!
1 2 1
≤ 2 exp − ρ .
16 σ 2 Γt (K, σu )⊤ σ ⊤ σGM v Γt (K, σu )x1 2
2 GM v 1

We now seek to simplify the denominators appearing in the exponents for each of the probability
bounds above. First, note that we have
2 2
σGM v Γt1 (K, σu )x1 + σGM v Γt2 (K, σu ) F
   
= tr σGM v Γt1 (K, σu )x1 x⊤ t ⊤ ⊤ t t ⊤ ⊤
1 Γ1 (K, σu ) σGM v + tr σGM v Γ2 (K, σu )Γ2 (K, σu ) σGM v
   
= tr σGM v Γt1 (K, σu )x1 x⊤ t ⊤ t t
1 Γ1 (K, σu ) + Γ2 (K, σu )Γ2 (K, σu )
⊤ ⊤
σGM v
h   i
⊤ ⊤ ⊤ ⊤
= E tr σGM v vec(X ) vec(X ) σGM v |x1 ,

where the final equality follows from the fact that ηt is mean zero with identity covariance and
independent form x1 combined with the construction of Γt1 (K, σu ) and Γt2 (K, σu ). Employing the
identity in (15), we find
h   i h i
E tr σGM v vec(X ⊤ ) vec(X ⊤ )⊤ σGM

v |x 1 = E v ⊤
M G ⊤ ⊤
X XGM v|x 1 = 1,

where the final equality follows by the definition of M and the fact that ∥v∥2 = 1. The above
2 2
computations provide identity σGM v Γt1 (K, σu )x1 + σGM v Γt2 (K, σu ) F = 1. We therefore
2 2
have σGM v Γt2 (K, σu ) F ≤ 1 and σGM v Γt1 (K, σu )x1 ≤ 1. As a result, the quantity in the
denominator of the exponential in (17) is bounded as

Γt2 (K, σu )⊤ σGM


⊤ t t
v σGM v Γ1 (K, σu )x1 ≤ σGM v Γ2 (K, σu ) σGM v Γt1 (K, σu )x1 ≤ σGM v Γt2 (K, σu ) .

Therefore, the probability bounds in (16) and (17) may be modified to read
 2 !
h
2 2 ρ i min ρ, ρ
Pr σGM v Γt2 (K, σu )ηt − σGM v Γt2 (K, σu ) F
≥ ≤ 2 exp −C 2
2 σ 4 ∥σGM v Γt2 (K, σu )∥
!
h ρ i 1 1
Pr 2⟨σGM v Γt1 (K, σu )x1 , σGM v Γt2 (K, σu )ηt ⟩ > ≤ 2 exp − ρ2 .
2 16 σ 2 ∥σGM v Γt2 (K, σu )∥2
Union bounding over the above two events, and using the assumption that σ 2 ≥ 1 implies that for
some universal constant C > 0,
h i
2 2
Pr σGM v Γt2 (K, σu )ηt − σGM v Γt2 (K, σu ) F + 2⟨σGM v Γt1 (K, σu )x1 , σGM v Γt2 (K, σu )ηt ⟩ > ρ
!
min ρ, ρ2

≤ 4 exp −C 2 .
σ 4 ∥σGM v Γt2 (K, σu )∥
1
We may now invoke the covering argument in Lemma 5 with ε = 4 to show that there exists a
universal positive constant C such that
!
min ρ, ρ2
h i 
Pr M G⊤ X ⊤ XGM − I > ρ ≤ 4 × 9dG exp −C 2 . (18)
σ 4 ∥σGM v Γt2 (K, σu )∥

To conclude, we must bound the term σGM v Γt2 (K, σu ) . To do so, we write

1
σGM v Γt2 (K, σu ) = σGM v HΞt2 (K, σu )
t      
1 I t 0 0
= √ σGM v H It ⊗ F2 (K, σu ) + It ⊗
t K 0 σu IdU
 
       
1  I t 0 0 
≤ √  σGM v H It ⊗
 F2 (K, σu ) + σGM v H It ⊗ .
t K 0 σu IdU 
| {z } | {z }
state term input noise term

For the state term, we have


     
I I
σGM v H It ⊗ Ψ1 ≤ σGM v H It ⊗ F2t (K, σu ) .
K K

The second term quantifiesthe system’s amplification of an exogenous input, and is bounded by the

  
H∞ norm of the system AK I σu B H∞
(Tilli, 1998, Corollary 4.2). This, in turn, is
bounded by the H∞ norm of the matrix AK multiplied by the spectral norm of the input matrix,
yielding

F2t (K, σu ) ≤ ∥AK ∥H∞ (∥B ⋆ ∥ + 1)


We may bound the first quantity as
  
I
σGM v H It ⊗
K
  
IdX +dU ⊗ e1   
⊤  . I
= (It×dc ⊗ (GM v) )It ⊗ 
 .
.  It ⊗

K
IdX +dU ⊗ edc
 
IdX +dU ⊗ e1  
(i) ..  I
= (Idc ⊗ (GM v)⊤ ) 

. 
K
IdX +dU ⊗ edc
 
IdX +dU ⊗ e1  
(ii) ..  I
= sup (Idc ⊗ (GM v)⊤ )  w

. 
K
w∈S dX
IdX +dU ⊗ edc
  
IdX +dU ⊗ e1  
(iii) ..  I
= sup vec v ⊤ M G⊤ vec−1  w
 
d
. 
K
w∈S X
IdX +dU ⊗ edc
   
(iv) ⊤ ⊤ I
= sup vec v M G w ⊗ Idc
w∈S dX
K

where (i) follows by the following kronecker product identities: Ia+b ⊗ W = Ia ⊗ (Ib ⊗ W ),
(A ⊗ B)(C ⊗ D) = AC ⊗ BD, and ∥I ⊗ W ∥ = ∥W ∥. Equality (ii) follows by definition of
the operator norm, and equality (iii) follows by the vectorization identity vec(W Y Z) = (Z ⊤ ⊗
W ) vec(Y ). Equality (iv) follows by the definition of the inverse vectorization operation. The
quantity in the norm above may now be written as the inner product of a vector with itself. In
particular, we have
   
I
sup vec v ⊤ M G⊤ w ⊗ Idc
w∈S dX
K
v
u    ⊤ ! !
u I I
= sup tv ⊤ M G⊤ ww⊤ ⊗ Idc GM v
w∈S dX
K K
v
(i)
u    ⊤ ! !
u I I
≤ sup t v ⊤ M G⊤ ⊗ Idc GM v
w∈S dX
K K
(ii) q
≤ 2 sup v ⊤ M G⊤ (Σt (K, σu , x1 ) ⊗ Idc )GM v ≤ 2,
w∈S dX

where inequality (i) follows by the fact that ww⊤ ⪯ I, and inequailty (ii) follows from the fact that
   ⊤
I I
⪯ 2Σt (K, σu , x1 ), as may be verified by examining the expression for Σ̄t (K, σu , x1 )
K K
in Lemma 2. We may similarly show that the input term is bounded by one:
  
0 0
σGM v H It ⊗ ≤ 1.
0 σu IdU
1
σGM v Γt2 (K, σu ) 1 + 2 F2t (K, σu )

Combining results we have the bound ≤ √
t

5 3/2

t
∥PK ∥ ΨB ⋆ ,σw , where we used the result of Lemma 6 to bound ∥AK ∥H∞ in terms of ∥PK ∥.
Substituting this result into the bound in (18), we find that for some universal constant C,
 2 !
h i d min ρ, ρ t
Pr M G⊤ X ⊤ XGM − I > ρ ≤ 4 × 9dG exp −C ,
σ ∥PK ∥3 Ψ2B ⋆
4

which concludes our proof.

Lemma 8 (Covariance Concentration) For a sufficiently large universal positive constant c, we


have that as long as t ≥ cσ 4 ∥PK ∥3 Ψ2B ⋆ dθ + log 1δ , the event


t
(    ⊤ ! )
1 ⊤ t 1 X ⊤ xs xs
⊗ IdX Φ̂ ⪯ 2Φ̂⊤ Σt (K, σu , x1 ) ⊗ IdX Φ̂
 
Econc = Φ̂ Σ (K, σu , x1 ) ⊗ IdX Φ̂ ⪯ Φ̂
2 t us us
s=1

holds with probability at least 1 − 2δ .


1
Proof The result follows by invoking
 Lemma 7 with  G ← Φ̂ and ρ ← 2 . We invert the probability
bound by setting 2δ = 4 × 9dθ exp −C σ4 ∥P t∥3 Ψ2 . As long as t ≥ cσ 4 ∥PK ∥3 Ψ2B ⋆ dθ + log 1δ

K B⋆

for a sufficiently large universal constant c > 0, the event holds with probability at least 1 − 2δ .

B.2. Self-Normalized Martingales


To present the self-normalized martingale bound we first recall the definitions of a filtration and an
adapted process.

Definition 6 (Filtration and Adapted Process) A sequence of sub-σ-algebras {Fs }ts=1 is said to be
a filtration if Fs ⊆ Fk for s ≤ k. A stochastic process {ws }ts=1 is said to be adapted to the filtration
{Fs }ts=1 if for all s ≥ 1, ws is Fs -measurable.

The following theorem is a modification of Theorem 3.4 in Abbasi-Yadkori and Szepesvári


(2011) and is a consequence of Lemma 14.7 in Peña et al. (2009).

Theorem 7 Let {Fs }ts=0 be a filtration and let {zs }ts=1 be a matrix valued stochastic process
assuming values in RdX ×dθ which is adapted to {Fs−1 }ts=1 and {ws }ts=1 a vector valued stochastic
process assuming values in RdX which is adapted to {Fs }ts=1 . Additionally suppose that for all
1 ≤ s ≤ t, wt is σ 2 -conditionally sub-Gaussian with respect to Fs . Let Σ be a positive definite
matrix in Rdθ ×dθ . Then for δ ∈ (0, 1), the following inequality holds with probability at least 1 − δ:
t
! t 2 Pt !
det Σ + z ⊤z
X X
s=1 s s 1
Σ+ zs⊤ zs −1/2 zs⊤ ws ≤ σ 2 log + 2σ 2 log .
det(Σ) δ
s=1 s=1

Proof The proof follows by applying Lemma 4.1 from Ziemann et al. (2023) with P ← ts=1 zsσws
P
Pt ⊤
and Q ← s=1 zs zs . To verify
 the conditions for this lemma to hold, we must verify that

maxλ∈Rdθ E exp λ P − 12 λ⊤ Qλ ≤ 1. To verify this condition, we define the quantity
s  ⊤ ⊤ !
X λ zk wk 1 ⊤ ⊤
Ms (λ) ≜ exp − λ zk zk λ .
σ 2
k=1

Observe that for any λ ∈ Rdθ , M0 (λ) = 1. For any 0 < s ≤ t,

λ⊤zs⊤ ws 1 ⊤ ⊤
    
E Ms (λ) = E E(Ms (λ|Fs ) = E Ms−1 (λ) E exp − λ zs zs λ |Fs .
σ 2

Note that by the fact that ws is σ 2 -conditionally sub-Gaussian with respect to Fs ,


  ⊤ ⊤     ⊤ ⊤   
λ zs ws 1 ⊤ ⊤ λ zs ws 1 ⊤ ⊤
E exp − λ zs zs λ |Ft = E exp |Fs exp − λ zs zs λ
σ 2 σ 2
 

⊤ ⊤
 1 ⊤ ⊤
≤ exp λ zs zs λ exp − λ zs zs λ = 1.
2

Therefore E Mt ≤ E Mt−1 (λ) ≤ · · · ≤ E M1 (λ) ≤ E M0 (λ) = 1, which verifies the requisite


condition.

B.3. Variance Bound


Lemma 9 Let Σ ∈ Rdθ ×dθ be positive definite. Suppose data x1:t+1 , u1:t is generated by (8).
Consider generating the least squares estimate θ̂, Λ = LS(Φ̂, x1:t+1 , u1:t ). Suppose that Λ ⪰ Σ.
There exists an event Evar that holds with probability at least 1 − δ/2 under which the variance
satisfies

σ2 Conccov (δ)2
 
2 det(Σ + Λ) 2
Φ̂θ̂ − Φ̂θ̄ ≲ log 2
+ 2
Φ⋆ θ⋆ − Φ̂θ̄ ,
λmin (Σ) det(Σ)δ λmin (Σ)

where

 r
t 2 3/2 1
Conccov (δ) = Σ (K, σu , x1 ) max tσ ∥PK ∥ ΨB ⋆ dX + dU + log ,
δ
 
1
σ 4 ∥PK ∥3 Ψ2B ⋆ dX + dU + log .
δ
.

Proof By Lemma 3, θ̂ solves (11). Therefore,


t   2 t   2
X x X x
xs+1 − vec −1
(Φ̂θ̂) s ≤ xs+1 − vec −1
(Φ̂θ̄) s .
us us
s=1 s=1
 
xs
Inserting xs+1 = vec−1 (Φ⋆ θ⋆ ) + ws , we may rearrange the above inequality to read
us

t
X   x  2
−1 s
vec Φ̂θ̄ − Φ̂θ̂

us
s=1
Xt     x    x  
−1 s −1 ⋆ ⋆ s
2 vec Φ̂ θ̂ − θ̄ , vec Φ θ − Φ̂θ̄
us us
s=1
Xt     x  
−1 s
+2 vec Φ̂ θ̂ − θ̄ , ws .
us
s=1

Multiplying the above inequality by two, and bringing one of the sums from the right side to the left
leaves us with
X t   x  2
s
vec−1 Φ̂θ̄ − Φ̂θ̂ ≤
us
s=1
X t     x    x  
−1 s −1 ⋆ ⋆ s
4 vec Φ̂ θ̂ − θ̄ , vec Φ θ − Φ̂θ̄
us us
| s=1 {z } (19)
variance from misspecification
Xt     x   X t   x  2
−1 s −1 s
+4 vec Φ̂ θ̂ − θ̄ , ws − vec Φ̂(θ̄ − θ̂) .
us us
s=1 s=1
| {z }
self-normalized process

Bound on Self-normalized Process: We first!bound the self-normalized process above. To


 ⊤
xs
simplify notation, we denote zs = ⊗ IdX Φ̂. By using the identity vec(XY Z) = (X ⊤ ⊗
us
Z) vec(Y ), we may therefore write the self-normalized process term above as

t   t
( t   t
) t 2
X X 2 X X X
2 −1/2
4 zt (θ̂ − θ̄), ws − zs (θ̄ − θ̂) ≤ sup 4 zs θ, ws − ∥zs θ∥ =4 Λ zs⊤ ws .
s=1 s=1 θ s=1 s=1 s=1

Using the lower bound Λ ⪰ Σ, we have

t 2 t 2
X X
−1/2
4 Λ zt⊤ ws ≤ 8 (Λ + Σ) −1/2
zs⊤ ws .
s=1 s=1

Invoking the self-normalized martingale bound in Theorem 7 we find that there exists an event Enoise
that holds with probability at least 1 − δ/4 such that under the event Enoise ,

t 2  
−1/2
X 16 det(Σ + Λ)
(Λ + Σ) zs⊤ ws ≤ σ log 2
.
det(Σ)δ 2
s=1
Bound on Variance from misspecification: We now proceed to bound the term in (19) arising
due to variance from the model mispecification. To do so, note that
t 
X    x    x  
−1 s −1 ⋆ ⋆ s
vec Φ̂ θ̂ − θ̄ , vec Φ θ − Φ̂θ̄
us us
s=1
t    ⊤ !!

⋆ ⋆
 X
⊤ xs xs  
= Φ θ − Φ̂θ̄ ⊗ IdX Φ̂(θ̂ − θ̄)
us us
s=1
   
= t Φ⋆ θ⋆ − Φ̂θ̄ ⊤ Σt (K, σu , x1 ) ⊗ IdX Φ̂(θ̂ − θ̄)
t    ⊤
! !

⋆ ⋆


X xs xs t
 
+ Φ θ − Φ̂θ̄ − tΣ (K, σu , x1 ) ⊗ IdX Φ̂(θ̂ − θ̄) .
us us
s=1

By the optimality of θ̄ in (12),


t
!
  X   

Φ⋆ θ⋆ − Φ̂θ̄ Σt (K, σu ) ⊗ IdX Φ̂(θ̂ − θ̄) = 0.
s=1

Therefore, we are left with


t 
X    x    x  
s s
vec−1 Φ̂ θ̂ − θ̄ , vec−1 Φ⋆ θ⋆ − Φ̂θ̄
us us
s=1
t    ⊤
! !
  X x s x s
 
= Φ⋆ θ⋆ − Φ̂θ̄k ⊤ − tΣt (K, σu , x1 ) ⊗ IdX Φ̂(θ̂ − θ̄)
us us
s=1
t    ⊤
⋆ ⋆
X xs xs
≤ Φ θ − Φ̂θ̄ Φ̂(θ̂ − θ̄) − tΣt (K, σu , x1 ) .
us us
s=1

Invoking Lemma 7 with G = IdX +dU , we find that there exists an event Ecov that holds with proba-
bility at least 1 − δ/4 such that under Ecov ,
t    ⊤
X xs xs
− tΣt (K, σu , x1 ) ≲ Conccov (δ).
us us
s=1

Combining bounds: In light of the bounds on the self-normalized process and the variance from
misspecification above, we define the event Evar = Enoise ∩ Ecov . This event holds with probability
at least 1 − δ/2 by a union bound. We may substitute the bounds arising under these events into into
(19), we find
t  x  2  
X
−1

s det(Σ + Λ)
2
vec Φ̂θ̄ − Φ̂θ̂ ≲ σ log + Conccov (δ) Φ⋆ θ⋆ − Φ̂θ̄ Φ̂(θ̂ − θ̄) ,
us det(Σ)δ 2
s=1

where we absorbed the 16 from the log in the first term into a universal constant. The left side
2
of the above inequality may be lower bounded by λmin (Σ) θ̄ − θ̂ . which may in turn be lower
2
bounded by λmin (Σ) Φ̂θ̄ − Φ̂θ̂ , by the fact that Φ̂ has orthonormal columns. We therefore obtain
the following bound:

σ2
 
2 det(Σ + Λ) Conccov (δ)
Φ̂θ̄ − Φ̂θ̂ ≲ log + Φ⋆ θ⋆ − Φ̂θ̄ Φ̂(θ̂ − θ̄) .
λmin (Σ) det(Σ)δ 2 λmin (Σ)

Now, if we define

σ2
 
Conccov (δ) det(Σ + Λ)
a≜ Φ⋆ θ⋆ − Φ̂θ̄ , b≜ log , x ≜ Φ̂θ̂ − Φ̂θ̄ ,
λmin (Σ) λmin (Σ) det(Σ)δ 2

then the
 above inequalities may be combined to read x2 ≤ ax + b, from which we can conclude
√ 
x ≤ 21 a + a2 + 4b , or equivalently,

1 2 p  1
x2 ≤ 2a + 4b + 2 a4 + 4a2 b ≤ 2a2 + 4b + 2a2 + 4a2 + 4b = 2a2 + 2b.

4 4
Subtituting in the original quantities of interest and absorbing the factor of two into the universal
constant concludes the proof.

Lemma 10 (Variance Bound) Suppose t ≥ cτls for c and τls defined in Theorem 5. Then there
exists an event Els that holds with probability at least 1 − δ under which

dθ σ 2
 
2 1
Φ̂θ̂ − Φ̂θ̄ ≲   log

tλmin Φ̂⊤ Σ̄t (K, σu , x1 ) ⊗ IdX Φ̂ δ
2
σ 4 Σt (K, σu , x1 ) ∥PK ∥3 Ψ2B ⋆
 
1 2
+    dX + dU + log δ Φ⋆ θ⋆ − Φ̂θ̄ .
tλmin Φ̂⊤ Σ̄t (K, σu , x1 ) ⊗ IdX Φ̂ 2

Proof By the lower bound on t, the event Econc defined in Lemma 8 holds with probability at
least 1 − δ/2. Denoting Σ ← t 12 Φ̂⊤ Σ̄t (K, σu , x1 ) ⊗ IdX Φ̂, we may invoke Lemma 9 to show
that there exists an event Evar that holds with probability at least 1 − δ/2 such that conditioned on
Els = Econc ∩ Evar ,

σ2 Conccov (δ)2
 
2 det(Σ + Λ) 2
Φ̂θ̂ − Φ̂θ̄ ≲ log 2
+ 2
Φ⋆ θ⋆ − Φ̂θ̄ .
λmin (Σ) det(Σ)δ λmin (Σ)

By a union bound, Els holds with probability at least 1 − δ. Subsituting the definition of Conccov
from Lemma 9, we may simplify the above bound to find

σ2
 
2 det(Σ + Λ)
Φ̂θ̂ − Φ̂θ̄ ≲ log
λmin (Σ) det(Σ)δ 2
2
σ 4 Σt (K, σu , x1 ) ∥PK ∥3 Ψ2B ⋆ t
 
4 2
+ 2
dX + dU + log Φ⋆ θ⋆ − Φ̂θ̄ ,
λmin (Σ) δ
where we have used the fact that
 
4 3 1
t ≥ cτls ≥ σ ∥PK ∥ Ψ2B ⋆ dX + dU + log ,
δ

to show that Conccov assumes the quantity that grows with t.


The upper bound from the event Econc provides

∥x1 ∥2 ⊤ t
Λ ⪯ 2tΦ̂⊤ Σt (K, σu , x1 ) ⊗ IdX Φ̂ ⪯ 2t(1 + ∥PK ∥
 
)Φ̂ Σ̄ (K, σu , x1 ) ⊗ IdX Φ̂,
t−1
where the final inequality follows from Lemma 2 point 4. The log term may then be bounded as
 2

1 + ∥PK ∥ ∥xt−1
1∥
 
det(Σ + Λ)
log ≤ dθ log4 
det(Σ)δ 2 δ2

2
By the fact that t ≥ cτls ≥ ∥x1 ∥2 ∥PK ∥ + 1, we have ∥PKt−1
∥∥x1 ∥
≤ 1. Therefore, the above quantity
is bounded by dθ log δ82 ≲ dθ log 1δ .
Substituting the expression for Σ tells us that
t   
λmin (Σ) = λmin Φ̂⊤ Σ̄t (K, σu , x1 ) ⊗ IdX Φ̂ .
2
2  
Substituing this into the bound on Φ̂θ̂ − Φ̂θ̄ along with the upper bound on log det(Σ+Λ)det(Σ)δ 2
completes the proof.

B.4. Bias Bound


Lemma 11 The bias of the estimate arising in Lemma 4 is bounded as
2 Σt (K, σu , x1 )
Φ̂θ̄ − Φ⋆ θ⋆ ≤4 d(Φ̂, Φ⋆ )2 ∥θ⋆ ∥2 .
λmin (Φ̂⊤ (Σ̄t (K, σu , x1 ) ⊗ IdX )Φ̂)

Proof
We first left multiply the
 quantity in the norm by the identity defined in terms of the matrix
Φ̂Φ̂ Σ (K, σu , x1 ) ⊗ IdX Φ̂Φ̂ + Φ̂⊥ Φ̂⊤
⊤ t ⊤
⊥ left multiplied by its inverse:

2
Φ⋆ θ⋆ − Φ̂θ̄ = (Φ̂Φ̂⊤ Σt (K, σu , x1 ) ⊗ IdX Φ̂Φ̂⊤ + Φ̂⊥ Φ̂⊤ −1

⊥)

   2
⊤ ⊤
t
Φ̂⊥ Φ̂⊤ ⋆ ⋆

× Φ̂Φ̂ Σ (K, σu , x1 ) ⊗ IdX Φ̂Φ̂ + ⊥ Φ θ − Φ̂θ̄ .

To simplify the above quantity, firstly observe that the term


 
Φ̂⊥ Φ̂⊤⊥ Φ ⋆ ⋆
θ − Φ̂ θ̄ = Φ̂⊥ Φ̂⊤ ⋆ ⋆
⊥Φ θ (20)
by the fact that Φ̂⊤
⊥ Φ̂ = 0. Second, note that the term
 
Φ̂Φ̂⊤ Σt (K, σu , x1 ) ⊗ IdX Φ̂Φ̂⊤ Φ⋆ θ⋆ − Φ̂θ̄


= Φ̂Φ̂⊤ Σt (K, σu , x1 ) ⊗ IdX Φ̂Φ̂⊤ Φ⋆ θ⋆ − Φ̂Φ̂⊤ Σt (K, σu , x1 ) ⊗ IdX Φ̂θ̄


 

by the fact that the columns of Φ̂ are orthonormal. We now substitute the optimal projection
  
θ̄ = Φ̂⊤ Σt (K, σu , x1 ) ⊗ IdX Φ̂ −1 Φ̂⊤ Σt (K, σu , x1 ) ⊗ IdX Φ⋆ θ⋆


into the above to find that


 
Φ̂Φ̂⊤ Σt (K, σu , x1 ) ⊗ IdX Φ̂Φ̂⊤ Φ⋆ θ⋆ − Φ̂θ̄


= Φ̂Φ̂⊤ Σt (K, σu , x1 ) ⊗ IdX Φ̂Φ̂⊤ Φ⋆ θ⋆



  
− Φ̂Φ̂⊤ (Σk ⊗ IdX )Φ̂ Φ̂⊤ Σt (K, σu , x1 ) ⊗ IdX Φ̂ −1 Φ̂⊤ Σt (K, σu , x1 ) ⊗ IdX Φ⋆ θ⋆


= Φ̂Φ̂⊤ Σt (K, σu , x1 ) ⊗ IdX Φ̂Φ̂⊤ Φ⋆ θ⋆ − Φ̂Φ̂⊤ Σt (K, σu , x1 ) ⊗ IdX Φ⋆ θ⋆


 
 
= Φ̂Φ̂⊤ Σt (K, σu , x1 ) ⊗ IdX Φ̂Φ̂⊤ − IdX Φ⋆ θ⋆
= −Φ̂Φ̂⊤ Σt (K, σu , x1 ) ⊗ IdX Φ̂⊥ Φ̂⊤ ⋆ ⋆

⊥Φ θ .

Combining the above result with (20), we find that


2
Φ⋆ θ⋆ − Φ̂θ̄
  2
= (Φ̂Φ̂⊤ Σt (K, σu , x1 ) ⊗ IdX Φ̂Φ̂⊤ + Φ̂⊥ Φ̂⊤ −1 ⊤ t
Φ̂⊥ Φ̂⊤ ⋆ ⋆

⊥ ) I − Φ̂Φ̂ Σ (K, σ ,
u 1x ) ⊗ I dX ⊥Φ θ
  2 2
≤ (Φ̂Φ̂⊤ Σt (K, σu , x1 ) ⊗ IdX Φ̂Φ̂⊤ + Φ̂⊥ Φ̂⊤ −1
I − Φ̂Φ̂⊤ Σt (K, σu , x1 ) ⊗ IdX Φ̂⊥ Φ̂⊤ ⋆
∥θ⋆ ∥2 .

⊥) ⊥Φ
(21)
Application of the triangle inequality and Cauchy-Schwarz yields
  2
(Φ̂Φ̂⊤ Σt (K, σu , x1 ) ⊗ IdX Φ̂Φ̂⊤ + Φ̂⊥ Φ̂⊤ −1
I − Φ̂Φ̂⊤ Σt (K, σu , x1 ) ⊗ IdX Φ̂⊥

⊥)
2
≤ 2 (Φ̂Φ̂⊤ Σt (K, σu , x1 ) ⊗ IdX Φ̂Φ̂⊤ + Φ̂⊥ Φ̂⊤ −1

⊥ ) Φ̂⊥
2
+ 2 (Φ̂Φ̂⊤ Σt (K, σu , x1 ) ⊗ IdX Φ̂Φ̂⊤ + Φ̂⊥ Φ̂⊤ −1 ⊤
Σt (K, σu , x1 ) ⊗ IdX Φ̂⊥
 
⊥ ) Φ̂Φ̂ .
(22)
The first term may be simplified as follows:
2
(Φ̂Φ̂⊤ Σt (K, σu , x1 ) ⊗ IdX Φ̂Φ̂⊤ + Φ̂⊥ Φ̂⊤ −1

⊥ ) Φ̂⊥
2
 Φ̂⊤ Σt (K, σu , x1 ) ⊗ Id Φ̂
    
  ⊤ −1
= Φ̂ Φ̂⊥ X
Φ̂ Φ̂⊥ Φ̂⊥
I
(23)
−1 2
 Φ̂⊤ Σt (K, σu , x1 ) ⊗ Id Φ̂
 
  ⊤
= Φ̂ Φ̂⊥ X
Φ̂ Φ̂⊥ Φ̂⊥
I
= 1.
For the second term, we may similarly show that
2
(Φ̂Φ̂⊤ Σt (K, σu , x1 ) ⊗ IdX Φ̂Φ̂⊤ + Φ̂⊥ Φ̂⊤ −1 ⊤
Σt (K, σu , x1 ) ⊗ I Φ̂⊥
 
⊥ ) Φ̂Φ̂
2
= (Φ̂Φ̂⊤ Σt (K, σu , x1 ) ⊗ IdX Φ̂Φ̂⊤ )† Φ̂Φ̂⊤ Σt (K, σu , x1 ) ⊗ IdX Φ̂⊥
 

Σt (K, σu , x1 )
≤ ,
λmin (Φ̂⊤ (Σ̄t (K, σu , x1 ) ⊗ IdX )Φ̂)
where the final inequality follows from submultiplicativity, the fact that Φ̂ has orthonormal columns
and the fact that Σt (K, σu , x1 ) ⪯ Σ̄t (K, σu , x1 ).
The lemma follows by substituting the above inequality along with (23) into (22). The result is
then substituted into (21), where we recall the definition Theorem 1.

B.5. Proof of Theorem 5


Proof Combining the results of Lemma 4 with Lemma 10, we find that under the event Els of
Lemma 10,
dθ σ 2
 
2 1
Φ̂θ̂ − Φ̂θ̄ ≲    log δ
tλmin Φ̂⊤ Σ̄t (K, σu , x1 ) ⊗ IdX Φ̂
 
2
σ 4 Σt (K, σu , x1 ) ∥PK ∥3 Ψ2B ⋆
 
1 ⋆ ⋆
2
+ 1 +    dX + dU + log δ  Φ θ − Φ̂θ̄ .
tλ min Φ̂⊤ Σ̄t (K, σ , x ) ⊗ I Φ̂ 2
u 1 dX

Using Lemma 11, the above may be bounded as


dθ σ 2
 
2 1
Φ̂θ̂ − Φ̂θ̄ ≲   log
tλmin Φ̂⊤

Σ̄t (K, σu , x1 ) ⊗ IdX Φ̂ δ
 
2
σ4 Σt (K, σu , x1 ) ∥PK ∥3 Ψ2B ⋆
 
1
+ 1 +    dX + dU + log δ 
tλmin Φ̂⊤ Σ̄t (K, σu , x1 ) ⊗ IdX Φ̂ 2
Σt (K, σu , x1 )
×  d(Φ̂, Φ⋆ )2 ∥θ⋆ ∥2 .
λmin (Φ̂⊤ Σ̄t (K, σu , x1 ) ⊗ IdX Φ̂)
To conclude, we use the upper bounds on Σt (K, σu , x1 ) from Lemma 2:
x21 x2
Σt (K, σu , x1 ) ≤ (1 + ∥PK ∥ ) Σ̄t (K, σu , x1 ) ≤ 5(1 + ∥PK ∥ 1 ) ∥PK ∥2 Ψ2B ⋆ .
t−1 t−1
By the fact that t ≥ cτls , the above quantity is bounded as Σt (K, σu , x1 ) ≤ 10 ∥PK ∥2 Ψ2B ⋆ .
Therefore we find that
dθ σ 2
 
2 1
Φ̂θ̂ − Φ⋆ θ⋆ ≲   log
⊤ t
tλmin Φ̂ Σ̄ (K, σu , x1 ) ⊗ IdX Φ̂
 δ
 !
σ 4 ∥PK ∥7 Ψ6B ⋆ dX + dU + log 1δ ∥PK ∥2 Ψ2B ⋆ d(Φ̂, Φ⋆ )2 ∥θ⋆ ∥2
+ 1+   .
tλmin (Φ̂⊤ Σ̄t (K, σu , x1 ) ⊗ IdX Φ̂)2 λmin (Φ̂⊤ Σ̄t (K, σu , x1 ) ⊗ IdX Φ̂)
Appendix C. High Probability Bounds on the Success Events
We first state a more complete counterpart of Lemma 1 that also characterize the error in the learned
controller gain and the Lyapunov equation solution PK̂ under this controller.
1
Lemma 12 (Theorem 3 of Simchowitz and Foster (2020)) Define ε ≜ 2916∥P ⋆ (A ⋆ ,B ⋆ )∥10
. As
2 21 ⋆ 1
long as  B̂ − A B ⋆ F ≤ ε, we have that PK̂ ⪯ 20
  ⋆
P , K̂ − K ⋆ ≤
 
⋆ 3/2
, and
6∥P ∥

2
J (K̂) − J (K ⋆ ) ≤ 142 ∥P ⋆ ∥8 Â B̂ − A⋆ B ⋆
   
F
.

Lemma 13 Let δ ∈ (0, 1). With probability at least 1 − δ


  r
wt T
max ≤ 4σ (dX + dU ) log .
1≤t≤T gt δ

Proof The vectors wt and gt have σ 2 -sub-Gaussian components. Therefore, a sub-Gaussian tail
bound tells us that for any unit vector v ∈ RdX +dU )
ρ2
     
⊤ wt
P v ≥ ρ ≤ exp − 2 .
gt 2σ
By a covering argument, we therfore conclude
ρ2
     
wt dX +dU
P ≥ρ ≤5 exp − 2 .
gt 8σ
Union bounding over the T timesteps yields
ρ2
     
wt dX +dU
P max ≥ ρ ≤ T5 exp − 2 .
1≤t≤T gt 8σ
q
Setting ρ = 4σ (dX + dU ) log Tδ , the above probability becomes
   
T T
T 5dX +dU exp −2(dX + dU ) log ≤ T 5dX +dU exp −2(dX + dU ) − log
δ δ
≤ δ5dX +dU exp(−2)dX +dU ≤ δ.

Lemma 14 Consider rolling out the system xs+1 = A⋆ xs + B ⋆ us + ws from initial state x1 for t
timesteps under the control action us = Kxs + σu gs where K is stabilizing and σu ≤ 1. Then the
following bound holds:
s   
1 t ∥P ∥ ∥x ∥ + 2 ∥P ∥3/2 Ψ ⋆ max wt
∥xt ∥ ≤ 1− K 1 K B
∥PK ∥ 1≤t≤T gt
 
Pt−1  wk
At−1 t−1−k ⋆

Proof The states may be expressed as xt = K x1 + k=0 AK I σu B . Therefore,
gk
by Lemma 6,
t−1  
X wt
∥xt ∥ ≤ At−1 At−1−k I σu B ⋆
 
K ∥x1 ∥ + K max
1≤t≤T gt
k=0
s   
1 t−1 ∥P 3/2 wt
≤ 1− K ∥ ∥x1 ∥ + 2 ∥PK ∥ ΨB ⋆ max .
∥PK ∥ 1≤t≤T gt

Lemma 15 Consider rolling out the system xs+1 = A⋆ xs + B ⋆ us + ws from initial state x1 for t
timesteps under the control action us = Kxs + σu gs where K is stabilizing and σu ≤ 1. Suppose
 
3/2 wt
• ∥x1 ∥ ≤ 16 ∥PK0 ∥ ΨB max1≤t≤T

gt
• ∥PK ∥ ≤ 2 ∥PK0 ∥
 
1
• t ≥ log 1

4∥PK ∥ + 1.
1−
∥PK ∥

Then for s = 1, . . . , t
 
2 wt
∥xs ∥ ≤ 40 ∥PK0 ∥ ΨB ⋆ max .
1≤t≤T gt
Furthermore,
 
3/2 wt
∥xt ∥ ≤ 16 ∥PK0 ∥ ΨB ⋆ max .
1≤t≤T gt
Proof By Lemma 14,
 
1/2 wt 3/2
∥xs ∥ ≤ ∥PK ∥ ∥x1 ∥ + 2 ∥PK ∥ ΨB ⋆ max
1≤t≤T gt
 
1/2 3/2 wt
≤ 2 ∥PK0 ∥ ∥x1 ∥ + 8 ∥PK0 ∥ ΨB ⋆ max
1≤t≤T gt

 boundfor ∥x1 ∥. For the second point,


The bound in the first point then follows by substituting in the
1
note that by Lemma 14, and the fact that t ≥ log 
1

4∥PK ∥ + 1,
1−
∥PK ∥
 
1 3/2 wt
∥xt ∥ ≤ ∥x1 ∥ + 2 ∥PK ∥ ΨB ⋆ max
2 1≤t≤T gt
 
1 wt
≤ ∥x1 ∥ + 8 ∥PK0 ∥3/2 ΨB ⋆ max
2 1≤t≤T gt
 
wt
≤ 16 ∥PK0 ∥3/2 ΨB ⋆ max .
1≤t≤T gt
Lemma 16 Running Algorithm 1 with the arguments defined in Theorem 2, the event Esuccess,1 holds
with probability at least 1 − T −2 .

Proof By Lemma 13, we have that with probability at least 1 − 12 T −2 ,


 
wt p
max ≤ 4σ 3(dX + dU ) log 2T .
1≤t≤T gt

Denote the event that the above inequality holds by Ew bound . We show by induction that the requisite
conditions of Esuccess,1 hold for all epochs. A key step in this argument is instantiating the result
of Theorem 5 for every epoch. The rate of decay in this bound is the determined by minimum
⊤ t

eigenvalue of the state and input covariance, λmin (Φ̂ Σ̄ (K, σu , x1 ) ⊗ IdX Φ̂). We control this
quantity by invoking point 3 of Lemma 2 along with the facts that 1) Φ̂ has orthonormal columns
and 2) σu = σk ≤ 1 for all epochs due to the lower bound on τ1 . In particular, we may show that

σu2 σu2
λmin (Φ̂⊤ Σ̄t (K, σu , x1 ) ⊗ IdX Φ̂) ≥

2 ≥ 8 ∥P ∥ , (24)
2(2 + 2 ∥K∥ ) K

where the final inequality follows by noting that 2 + 2 ∥K∥2 ≤ 2 + 2 ∥PK ∥ ≤ 4 ∥PK ∥ by the fact
that ∥PK ∥ ≥ 1.
Base Case: For the first epoch, observe that our lower bound on τ1 ensures that τ1 ≥
cτls (K0 , 0, 12 T −3 ) for a sufficiently large constant c. Therefore applying Theorem 5 along with
(24) implies that there exists an event Els,1 which holds with probability at least 12 T −3 such that
under Els,1 ,

2 dθ σ 2 ∥PK0 ∥
Â1 B̂1 − A⋆ B ⋆
   
≲ log(T )
F τ1 σ12
!
σ 4 ∥PK0 ∥9 Ψ6B ⋆ (dX + dU + log(T )) ∥PK0 ∥3 Ψ2B ⋆ d(Φ̂, Φ⋆ )2 ∥θ⋆ ∥2
+ 1+ .
τ1 σ14 σ12

d /d
Using the fact thatσ12 ≥ √θτ1 U to simplify the first two appearances of σ1 , and σ12 ≥ d(Φ̂, Φ⋆ ) for
the final appearance, the above bound simplifies to

   ⋆ ⋆
 2 dθ dU σ 2 ∥PK0 ∥
Â1 B̂1 − A B F ≲ √ log(T )
τ1

+ σ4 ∥PK0 ∥12 Ψ8B ⋆ (dX + dU + log(T ))d(Φ̂, Φ⋆ ) ∥θ⋆ ∥2 .
dU
By recalling the definition of β1 from Assumption 2, the above bound can be simplified to

2 dθ dU ∥PK0 ∥
 2
log(T ) + β1 log(T )d(Φ̂, Φ⋆ ).
   ⋆ ⋆
Â1 B̂1 − A B F ≤ Cest,1 σ √
τ1
2 2
 Ew bound , ∥xt ∥ ≤ xb log T for t = 1, . . . , τ1 ,
Meanwhile, Lemma 15 tells us that under the event
wt
and that ∥xτ1 ∥ ≤ 16 ∥PK0 ∥3/2 ΨB ⋆ max1≤t≤T . Finally, we have ∥K0 ∥ ≤ Kb .
gt
Induction step: Suppose
 
3/2 wt
∥xτk ∥ ≤ 16 ∥PK0 ∥ Ψ
B⋆ max
1≤t≤T gt

and

2 dθ dU ∥PK0 ∥
2
A⋆ B⋆ log(T ) + β1 log(T )d(Φ̂, Φ⋆ ).
   
Âk B̂k − F
≤ Cest,1 σ
p
τ1 2k−1

 √ 
σ 2 dθ dU ∥PK0 ∥ 2
By the assumptions that τ1 ≥ c 2ε for a sufficiently large constant c and that
ε
 2
d(Φ̂, Φ⋆ ) ≤ 2β1 log(T Âk B̂k − A⋆ B ⋆ F ≤ ε. Then the conditions of
  
) , we have that
Lemma 12 are satisfied, and consequently, PK̂k+1 ≤ 1.05 ∥P ⋆ ∥. Therefore

1 1
τls (K̂k+1 , x2b log T, T −3 ) ≤ 2τls (K ⋆ , x2b log T, T −3 ) and PK̂k+1 ≤ 1.05 ∥PK0 ∥ ≤ 2 ∥PK0 ∥ .
2 2
We may use our lower bound on τ1 to conclude that τ1 ≥ cτls (K ⋆ , x2b log T, 21 T −3 ) for a sufficiently
large universal constant c. This guarantees that the conditions of Theorem 5 are satsified. As a result,
there exists an event Els,k+1 that holds with probability at least 1 − 21 T −3 under which

  ⋆  2 dθ σ 2 PK̂k+1


Âk+1 B̂k+1 − A B F ≲ 2 log(T )
(τk+1 − τk )σk+1
9 3
 
4 6
σ PK̂k+1 ΨB ⋆ (dX + dU + log(T ))  PK̂k+1 Ψ2B ⋆ d(Φ̂, Φ⋆ )2 ∥θ⋆ ∥2
+ 1 + ,

4
(τk+1 − τk )σk+1
 2
σk+1

where we have again used (24) to simplify the bound. Observing that τk+1 − τk = 12 τk+1 , and
√ √
2 dθ /dU d /d 2
using the lower bounds σk+1 ≥ √ k = √τθk+1U and σk+1 ≥ γ ≥ d(Φ̂, Φ⋆ ), the above bound
τ1 2
simplifies to

2
dθ dU σ 2 PK̂k+1
Âk+1 B̂k+1 − A⋆ B ⋆
   
F
≲ √ log(T )
τk+1
12
Ψ8B ⋆ (dX + dU + log(T ))d(Φ̂, Φ⋆ ) ∥θ⋆ ∥2 .
p
+ dθ /dU σ 4 PK̂k+1

Using the fact that PK̂k+1 ≤ 1.05 ∥P ⋆ ∥ ≤ 1.05 ∥PK0 ∥, the above bound may be simplified to

2 dθ dU ∥PK0 ∥
2
Âk+1 B̂k+1 − A⋆ B ⋆ log(T ) + β1 d(Φ̂, Φ⋆ ) log(T ),
   
F
≤ Cest,1 σ √
τk+1

where the powers of 1.05 are absorbed into the universal constants, and β1 is as defined in Assump-
tion 2.
Furthermore, Lemma 15 (which applies due to the inductive hypothesis, the lower bound on τ1 ,
and the fact that PK̂k+1 ≤ 2 ∥PK0 ∥) informs us that under the event Ew bound ,
 
2 3/2 wt
∥xt ∥ ≤ x2b log T ∀t = τk + 1, . . . , τk+1 , and xτk+1 ≤ 16 ∥PK0 ∥ ΨB ⋆ max .
1≤t≤T gt
2
Also observe that K̂k+1 ≤ PK̂k+1 ≤ 2 ∥PK0 ∥, which ensures that K̂k+1 ≤ 2 ∥K ⋆ ∥ ≤ Kb .
In particular, we have verified that the algorithm does not abort on this epoch, and we have verified
that the inductive hypothesis holds to start the next epoch.
Union bounding over the events The bounds on the norm of the state and the controller gain
ensure that the algorithm does not abort. The estimation error bounds ensure that

2 dθ dU ∥PK0 ∥
 2
log(T ) + β1 log(T )d(Φ̂, Φ⋆ ) ∀k ≥ 1.
   ⋆ ⋆
Âk B̂k − A B F ≤ Cest,1 σ √
τ1
Therefore, Esuccess,1 ⊆ Ew bound ∩ Els,1 ∩ · · · ∩ Els,kfin . Union bounding over the good events
Ew bound , Els,1 , . . . , Els,kfin , we find that Esuccess,1 holds with probability at least 1 − T −2 .

Lemma 17 Running Algorithm 1 with the arguments defined in Theorem 3, the event Esuccess,2 holds
with probability at least 1 − T −2 .

Proof By Lemma 13, we have that with probability at least 1 − 12 T −2 ,


 
wt p
max ≤ 4σ 3(dX + dU ) log 2T .
1≤t≤T gt
Denote the event that the above inequality holds by Ew bound . We show by induction that the requisite
conditions of Esuccess,2 hold for all epochs. A key step in this argument is instantiating the result
of Theorem 5 for every epoch. The rate of decay in this bound is the determined by minimum
⊤ t

eigenvalue of the state and input covariance, λmin (Φ̂ Σ̄ (K, σu , x1 ) ⊗ IdX Φ̂). We control this
quantity by observing that for any unit vector v
   ⊤ !
⊤ ⊤ t
 1 ⊤ I I
v Φ̂ Σ̄ (K, σu , x1 ) ⊗ IdX Φ̂v ≥ v Φ̂ ⊗ IdX Φ̂v
2 K K

!    ! d

!
θ
1 X I I X
= vi Φ̂i ⊤ ⊗ IdX vi Φ̂i
2 K K
i=1 i=1
(25)
1 X
dθ  I  2
−1
= vi vec Φ̂i
2 K
i=1 F
dθ 2
1 X  
= vi Φ̂A
i + Φ̂ BK
i .
2
i=1 F

We use this result along with Assumption 3 to show a lower bound


2

λmin (Φ̂ Σ̄ (K, σu , x1 ) ⊗ IdX Φ̂) ≥ α8 , and in turn show that the event Esuccess,2 holds
t


with high probability. This result will proceed by induction.


Base Case: For the first epoch, observe that our lower bound on τ1 guarantees that τ1 ≥
cτls (K0 , 0, 12 T −3 ) for a sufficiently large universal constant c. Therefore applying Theorem 5 along
with (25) and Assumption 3 implies that there exists an event Els,1 which holds with probability at
least 12 T −3 such that under Els,1 ,
2 dθ σ 2
Â1 B̂1 − A⋆ B ⋆
   
≲ log T
F τ1 α2
!
σ 4 ∥PK0 ∥7 Ψ6B ⋆ (dX + dU + log T ) ∥PK0 ∥2 Ψ2B ⋆ d(Φ̂, Φ⋆ )2 ∥θ⋆ ∥2
+ 1+ .
τ1 α4 α2
Using the fact that τ1 = τwarm up log T , we can cancel the log T in the numerator in the second term.
4
We can further use the fact that τwarm up ≥ c σϵαd2θ for some universal positive constant c to simplify
the second term to β2 d(Φ̂, Φ⋆ )2 , where β2 is as in Assumption 4. Then the above bound may be
simplified to
2 σ 2 dθ log T
Â1 B̂1 − A⋆ B ⋆ + β2 d(Φ̂, Φ⋆ )2 .
   
≤ Cest,2
F τ1 α2
where Cest,2 is a sufficiently large universal constant. Meanwhile, Lemma 15 tells us
2 2 log T for t = 1, . . . , τ , and that ∥x ∥ ≤
that under the event Ew bound, ∥xt ∥ ≤ x b 1 τ1
3/2 wt
16 ∥PK0 ∥ ΨB ⋆ max1≤t≤T . Finally, we have ∥K0 ∥ ≤ Kb .
gt
Induction step: Suppose
 
3/2 wt
∥xτk ∥ ≤ 16 ∥PK0 ∥ ΨB ⋆ max
1≤t≤T gt
and
2 σ 2 dθ log T
Âk B̂k − A⋆ B ⋆ + β2 d(Φ̂, Φ⋆ )2 .
   
≤ Cest,2
F τ1 α2
2d
By the lower bound on τ1 , we have that τ1 ≥ c σ θ log T
for a sufficiently large universal constant
2εα2 q
c. Combining this with the condition in assumption Assumption 4, which says d(Φ̂, Φ⋆ ) ≤ 2βε 2 ,
 2
we find Âk B̂k − A⋆ B ⋆ F ≤ ε. Then the conditions of Lemma 12 are satisfied, and
  

consequently, PK̂k+1 ≤ 1.05 ∥P ⋆ ∥. Therefore


1 1
τls (K̂k+1 , x2b log T, T −3 ) ≤ 2τls (K ⋆ , x2b log T, T −3 ) and PK̂k+1 ≤ 1.05 ∥PK0 ∥ ≤ 2 ∥PK0 ∥ .
2 2
We may guarantee from our lower bound on τ1 that τ1 ≥ cτls (K ⋆ , x2b log T, 12 T −3 ) for a sufficiently
large universal constant c, which ensures that the conditions of Theorem 5 are satisfied. As a result,
there exists an event Els,k+1 that holds with probability at least 1 − 21 T −3 under which
 2 dθ σ 2
Âk+1 B̂k+1 − A⋆ B ⋆ F ≲
  
log T
(τk+1 − τk )α2
7 2
 
σ 4 P
K̂k+1 Ψ6 (d + d + log T )
B⋆ X U  PK̂k+1 Ψ2B ⋆ d(Φ̂, Φ⋆ )2 ∥θ⋆ ∥2
+ 1 + ,

(τk+1 − τk )α4 α2

 
where we lower bounded λmin (Φ̂⊤ Σ̄τk+1 −τk (K̂k+1 , ) ⊗ IdX Φ̂) using (25) along with the follow-
ing reverse triangle inequality combined with Assumption 3 and Lemma 12:
dθ dθ
X 
A B ⋆
 X 1 α
vi Φ̂i + Φ̂i K + vi Φ̂B ⋆
i (K̂k+1 − K ) ≥ α − K̂k+1 − K ⋆ ≥ α − ≥ .
i=1 i=1 F
6 ∥P ⋆ ∥3/2 2

Observing that τk+1 − τk = 12 τk+1 , and again using the fact that K̂k+1 = K ⋆ (Âk , B̂k ) where
 2
Âk B̂k − A⋆ B ⋆ F ≤ ε, we have that PK̂k+1 ≤ 1.05 ∥P ⋆ ∥. Therefore by absorbing
  

powers of 1.05 into the universal constants, and following the same simplifications taken in the base
case, the estimation error bound may be simplified to
2 σ 2 dθ log T
Âk+1 B̂k+1 − A⋆ B ⋆ + β2 d(Φ̂, Φ⋆ )2 .
   
≤ Cest,2
F τk+1 α2
Furthermore, Lemma 15 (which applies due to the inductive hypothesis, the lower bound on τ1 ,
and the fact that PK̂k+1 ≤ 2 ∥PK0 ∥) informs us that under the event Ew bound ,
 
2 2 3/2 wt
∥xt ∥ ≤ xb log T ∀t = τk + 1, . . . , τk+1 , and xτk+1 ≤ 16 ∥PK0 ∥ ΨB ⋆ max .
1≤t≤T gt
2
Also observe that K̂k+1 ≤ PK̂k+1 ≤ 2 ∥PK0 ∥, which ensures that K̂k+1 ≤ 2 ∥K ⋆ ∥ ≤ Kb .
In particular, we have verified that the algorithm does not abort on this epoch, and we have verified
that the inductive hypothesis holds to start the next epoch.
Union bounding over the events The bounds on the norm of the state and the controller gain
ensure that the algorithm does not abort. The estimation error bounds ensure that
2 σ 2 dθ log T
Âk B̂k − A⋆ B ⋆ + β2 d(Φ̂, Φ⋆ )2
   
≤ Cest,2 ∀k ≥ 1.
F τk+1 α2
Therefore, Esuccess,1 ⊆ Ew bound ∩ Els,1 ∩ · · · ∩ Els,kfin . Union bounding over the good events
Ew bound , Els,1 , . . . , Els,kfin , we find that Esuccess,1 holds with probability at least 1 − T −2 .

Appendix D. Proof of Main result


Lemma 18 Consider rolling out the system xs+1 = A⋆ xs + B ⋆ us + ws from initial state x1 for t
timesteps under the control action us = Kxs + σu gs where K is stabilizing and σu ≤ 1. We have
that
" t #
X
E ∥xt ∥2Q + ∥ut ∥2R |x1 ≤ tJ(K) + 2tσu2 dU Ψ2B ⋆ ∥PK ∥) + ∥x1 ∥2 ∥PK ∥
s=1

Proof Begin by writing


" t # t
X X h i
2 2
E ∥xt ∥Q + ∥ut ∥R |x1 = E ∥xt ∥2Q+K ⊤ RK + ∥σu gt ∥2 |x1
s=1 s=1
t
X  h i
= tσu2 + tr (Q + K ⊤ RK) E xt x⊤
t |x1 .
s=1
We have that
h i h i
E xt x⊤t |x 1 = E (A ⋆
+ B ⋆
K)x t−1 x ⊤
t−1 (A ⋆
+ B ⋆
K) + σu
2 ⋆
B (B ⋆ ⊤
) + I|x 1
⋆ ⋆ t−1 ⊤ ⋆ ⋆ t−1 ⊤

= (A + B K) x1 x1 (A + B K)
t−1
X   
+ (A⋆ + B ⋆ K)k σu2 B ⋆ (B ⋆ )⊤ + I (A⋆ + B ⋆ K)k ⊤
k=1
⪯ (A⋆ + B ⋆ K)t−1 x1 x⊤
⊤
⋆ ⋆
1 (A + B K)
t−1
+ dlyap(A⋆ + B ⋆ K, σu2 B ⋆ (B ⋆ )⊤ + I).

Substituting this above, we find,


" t #
X  
E ∥xt ∥2Q + ∥ut ∥2R |x1 ≤ tσu2 + t tr (Q + K ⊤ RK)dlyap(A⋆ + B ⋆ K, σu2 B ⋆ (B ⋆ )⊤ + I)
s=1
t 
X  
+ tr (Q + K ⊤ RK)(A⋆ + B ⋆ K)s−1 x1 x⊤
1 (A ⋆
+ B ⋆
K) s−1 ⊤

s=1
≤ tJ(K) + tσu2 (1 + dU Ψ2B ⋆ ∥PK ∥) + ∥x1 ∥2 ∥PK ∥ .

Lemma 19 In the settings of both Theorem 2 and Theorem 3,


"τ #
X 1
2 2
R3 = E ∥xt ∥Q + ∥ut ∥R ≤ 3τ1 max{dX , dU } ∥PK0 ∥ Ψ2B ⋆ .
t=1

Proof This result follows from Lemma 18 by noting that x1 = 0, and therefore
"τ #
X1
2 2
E ∥xt ∥Q + ∥ut ∥R ≤ τ1 J(K0 ) + 2τ1 dU ∥PK0 ∥ σ12 Ψ2B ⋆ ≤ 3τ1 max{dX , dU } ∥PK0 ∥ Ψ2B ⋆
t=1

Lemma 20 Let Esuccess ← Esuccess,1 or Esuccess ← Esuccess,2 in the settings of Theorem 2 and
Theorem 3 respectively. Then in either setting
T
" #
X
c
R2 = E 1(Esuccess ) ∥xt ∥2Q + ∥ut ∥2R
t=τ1 +1
−1
2Kb2 x2b log T + T −1 J(K0 ) + 24 ∥PK0 ∥ Ψ2B ⋆ (dX + dU )σ 2 T −2 log 3T

≤T ∥Q∥ +
kfin
X
+ 2T −2 ∥PK0 ∥ ∥θ⋆ ∥2F Kb2 x2b log T + 2(τk − τk−1 )dU σk2 .
k=1
In the setting of Theorem 2, this reduces to
R2 ≤ T −1 ∥Q∥ + 2Kb2 x2b log T + T −1 J(K0 ) + 24 ∥PK0 ∥ Ψ2B ⋆ (dX + dU )σ 2 T −2 log 3T

√ p
+ 2T −2 ∥PK0 ∥ ∥θ⋆ ∥2F Kb2 x2b log T + 6 T dθ dU + 2dU γT.
In the setting of Theorem 3, this reduces to
R2 ≤ T −1 ∥Q∥ + 2Kb2 x2b log T + T −1 J(K0 ) + 24 ∥PK0 ∥ Ψ2B ⋆ (dX + dU )σ 2 T −2 log 3T


+ 2T −2 ∥PK0 ∥ ∥θ⋆ ∥2F Kb2 x2b log T.

Proof Let tabort be the first time instance where ∥xt ∥2 ≥ x2b log T or K̂k ≥ Kb .

T
" #
X
R2 = E c
1(Esuccess ) ∥xt ∥2Q + ∥ut ∥2R
t=τ1 +1
T
" #
X
c
≤ E 1(Esuccess ) ∥xt ∥2Q + ∥ut ∥2R
t=1
X−1
tabort T
" # " #
X
≤ E 1(tabort ≤ T ) ∥xt ∥2Q + ∥ut ∥2R +E ∥xt ∥2Q + ∥ut ∥2R .
t=1 t=tabort

2
By the fact that ut = K̂k xt + σk gt for gt ∼ N (0, I), we have that ∥ut ∥2 ≤ 2 K̂k xt + 2 ∥σu gt ∥2 .
By the definition of tabort and the fact that R = I, for t < tabort
2
∥xt ∥2Q + 2 K̂k xt ≤ x2b log T ∥Q∥ + 2Kb2 .

R

Therefore
X−1
tabort
" #
E 1(tabort ≤ T ) ∥xt ∥2Q + ∥ut ∥2R
t=1
 
X−1
tabort kfin
X τk
X
x2b ∥Q∥ + 2Kb log T + 22
2σk2 ∥gt ∥2 

≤ E1(tabort ≤ T )
t=1 k=1 t=τk−1 +1
 
X−1
tabort kfin τk
" #
X X
x2b ∥Q∥ + 2Kb log T + E1(tabort ≤ T )
2
2σk2 ∥gt ∥2 

≤ E 1(tabort ≤ T )
t=1 k=1 t=τk−1 +1
 
Xkfin τk
X
2 2
2σk2 ∥gt ∥2 
  
≤ E 1(tabort ≤ T )T xb ∥Q∥ + 2Kb log T + E
k=1 t=τk−1 +1
kfin
X τk
X
= P(tabort ≤ T )T x2b ∥Q∥ + 2Kb2 log T + 2dU σk2


k=1 t=τk−1 +1
kfin
X
≤ T −1 x2b ∥Q∥ + 2Kb2 log T + 2(τk − τk−1 )dU σk2 .


k=1
For the second term, we have from Lemma 18
" T # " T
#
X X h i
2 2 2 2
E ∥xt ∥Q + ∥ut ∥R = E 1(tabort ≤ T ) E ∥xt ∥Q + ∥ut ∥R |tabort , xtabort
t=tabort t=tabort
h  i
≤ E 1(tabort ≤ T ) T J(K0 ) + ∥PK0 ∥ ∥xtabort ∥2 .
 
 wt
(A⋆ B ⋆ K)x B⋆

We have that xtabort = + tabort −1 + I σu for some K satisfying ∥K∥ ≤ Kb
gt
and σu ≤ 1. Then we may write
  2
⋆ ⋆ 2 2 wt
xtabort ≤ 2 ∥A + B K∥ ∥xtabort −1 ∥ + 2Ψ2B ⋆ max
1≤t≤T gt
  2
wt
≤ 2 ∥θ⋆ ∥2F Kb2 x2b log T + 2Ψ2B ⋆ max .
1≤t≤T gt

Then
T
" #
X
E ∥xt ∥2Q + ∥ut ∥2R ≤ P(tabort ≤ T )(T J(K0 ) + 2 ∥PK0 ∥ ∥θ⋆ ∥2F Kb2 x2b log T )
t=tabort
"   2#
wt
+ 2 ∥PK0 ∥ Ψ2B ⋆ E 1(tabort ≤ T ) max .
1≤t≤T gt

We have that P(tabort ≤ T )(T J(K0 ) + 2 ∥PK0 ∥ ∥θ⋆ ∥2F Kb2 x2b log T ) ≤ T −1 J(K0 ) +
2T −2 ∥PK0 ∥ ∥θ⋆ ∥2F Kb2 x2b log T . Furthermore, by Lemma 21,
"   2#
wt
2
2 ∥PK0 ∥ ΨB ⋆ E 1(tabort ≤ T ) max ≤ 8 ∥PK0 ∥ Ψ2B ⋆ (dX + dU )σ 2 T −2 log 9T 3 .
1≤t≤T gt

Combining these results proves the first claim.


2
√ reduce the bound to the expression specific to Theorem 2, we substitute in the bound σk ≤
To
dU /dθ
γ+ τk . To reduce the bound to the expression specific to Theorem 3, we substitute σk2 = 0.

Lemma 21 (Adaptation of Lemma 35 of Cassel et al. (2020)) Suppose P(E) ≤ δ. Then


"   2#
wt 9T
E 1(E) max ≤ 4(dX + dU )σ 2 δ log
1≤t≤T g t δ

Proof Note that


" # ( !)
  2   2
wt wt
P 1(E) max > x ≤ min P(E), P max >x .
1≤t≤T gt 1≤t≤T gt
Therefore, the tail sum formula provides
"   2# Z ∞ "   2 #
wt wt
E 1(E) max = P 1(E) max > x dx
1≤t≤T gt 0 1≤t≤T gt
Z 4(dX +dU )σ2 log T Z ∞ "   2
#
δ wt
≤ P(E)dx + P max > x dx
0 4(dX +dU )σ 2 log T 1≤t≤T gt
δ
Z ∞  
2T x
≤ 4(dX + dU )σ δ log + 2T exp − dx
δ 4(dX +dU )σ 2 log T 4(dX + dU )σ 2
δ
T
= 4(dX + dU )σ 2 δ log + 8(dX + dU )σ 2 δ
δ
T
= 4(dX + dU )σ 2 δ(log + log 9)
δ
2 9T
= 4(dX + dU )σ δ log ,
δ
  2
wt
where the probability bound in the final inequality follows from the fact that is sub-
gt
exponential with parameter 4(dX + dU )σ 2 (see Section 2.8 of Vershynin (2020)).

Lemma 22 Let Esuccess ← Esuccess,1 or Esuccess ← Esuccess,2 in the settings of Theorem 2 and
Theorem 3 respectively. Then in either setting,
T
" #
X
R1 = E 1(Esuccess ) ∥xt ∥2Q + ∥ut ∥2R
t=τ1 +1
kfin  h i
X 2
E 1(Ek−1 )142(τk − τk−1 ) ∥P ⋆ ∥8 Âk−1 B̂k−1 − A⋆ B ⋆
   
≤ F
k=2


+ (τk − τk−1 )J (K ) + 4(τk − τk−1 )dU ∥PK ⋆ ∥ σk2 Ψ2B ⋆ + 2x2b log T ∥PK ⋆ ∥ ,

where

σ 2 dθ dU ∥PK0 ∥ log T
 
2 ⋆
Âk B̂k − A⋆ B ⋆
   
Ek = F
≤ Cest,1 √ + β1 d(Φ̂, Φ ) log T (26)
τk
for k = 1, . . . , kfin − 1 in the setting of Theorem 2 and
σ 2 dθ log T
 
   ⋆ ⋆
 2 ⋆ 2
Ek = Âk B̂k − A B F ≤ Cest,2 + β2 d(Φ̂, Φ ) (27)
τk α2
for k = 1, . . . , kfin − 1 in the setting of Theorem 3.

Proof We have that


 
T kfin τk
" #
X X X
E 1(Esuccess ) ∥xt ∥2Q + ∥ut ∥2R = E1(Esuccess ) ∥xt ∥2Q + ∥ut ∥2R .
t=τ1 k=2 t=τk−1 +1
n o
Let Sk = ∥xτk +1 ∥2 ≤ x2b log T . We have that for any k = 1, . . . , kfin , Esuccess ⊆ Ek ∩ Sk ,
and therefore 1(Esuccess ) ≤ 1(Ek ∩ Sk ). By the tower property, we may write that for any k =
2, . . . , kfin ,
 
τk
X
E1(Esuccess ) ∥xt ∥2Q + ∥ut ∥2R 
t=τk−1 +1
  
τk
X
≤ E1(Ek−1 ∩ Sk−1 ) E ∥xt ∥2Q + ∥ut ∥2R |xτk−1 +1 , K̂k .
t=τk−1 +1

As K̂k is stabilizing under Ek , we may invoke Lemma 18 to bound


 
Xτk
1(Ek−1 ∩ Sk−1 ) E ∥xt ∥2 + ∥ut ∥2 |xτk−1 +1 , K̂k 
Q R
t=τk−1 +1
 
2
≤ 1(Ek−1 ∩ Sk−1 ) (τk − τk−1 )J (K̂k ) + 2(τk − τk−1 )dU PK̂k σk2 Ψ2B ⋆ + xτk−1 +1 PK̂k
(i)  
≤ 1(Ek−1 ) (τk − τk−1 )J (K̂k ) + 4(τk − τk−1 )dU ∥P ⋆ ∥ σk2 Ψ2B ⋆ + 2x2b log T ∥PK ⋆ ∥
(ii)
 
⋆ 8
 2
Âk−1 B̂k−1 − A B F + (τk − τk−1 )J (K ⋆ )
   ⋆ ⋆
≤ 1(Ek−1 ) 142(τk − τk−1 ) ∥P ∥

+ 4(τk − τk−1 )dU ∥PK ⋆ ∥ σk2 Ψ2B ⋆ + 2x2b log T ∥P ⋆ ∥

where inequality (i) follows by using the bound from Sk−1 along with the fact that under the event
Ek−1 , the burn in time and assumptions on d(Φ̂, Φ⋆ ) ensure that PK̂k ≤ 2 ∥P ⋆ ∥. Inequality (ii)
follows from the fact that under the burn-in time assumptions along with the event Ek−1 , the condi-
 2
tion of Lemma 12 is satisfies, and J (K̂)−J (K ⋆ )l ≤ 142 ∥P ⋆ ∥8 Âk−1 B̂k−1 − A⋆ B ⋆ F .
  

Taking the expectation and summing over k = 2, . . . , kfin proves the claim.

Lemma 23 In the setting of Theorem 2,



R1 − T J (K ⋆ ) ≤ 832σ 2 dθ dU ∥P ⋆ ∥8 Cest,1 ∥PK0 ∥ Ψ2B ⋆ T log T
p
 
+ 4dU ∥P ⋆ ∥ Ψ2B ⋆ γ + 142 ∥P ⋆ ∥8 β1 d(Φ̂, Φ⋆ ) log T T + 2x2b ∥PK ⋆ ∥ log2 T,

Proof By substituting the estimation error



bound under the events Ek in (26), and the bound on the
2 d
exploration noise magnitude σk ≤ √ √ k−1 + γ, Lemma 22 provides
θ
dU τ1 2

R1 − T J (K ⋆ ) ≤ 832σ 2 dθ dU ∥P ⋆ ∥8 Cest,1 ∥PK0 ∥ Ψ2B ⋆ T log T
p
 
+ 4dU ∥P ⋆ ∥ Ψ2B ⋆ γ + 142 ∥P ⋆ ∥8 β1 d(Φ̂, Φ⋆ ) log T T + 2kfin x2b ∥PK ⋆ ∥ log T,

We have that kfin = log2 T /τ1 ≤ log T , providing the result in the lemma statement.
Lemma 24 In the setting of Theorem 3,
2
 
⋆ 8 σ dθ

R1 − T J (K ) ≤ Cest,2 142 ∥P ∥ + xb ∥PK ∥ log2 T + 142 ∥P ⋆ ∥8 β2 d(Φ̂, Φ⋆ )2 T.
2 ⋆
α2

Proof By substituting the estimation error bound under the events Ek from (27), and the exploration
noise magnitude σk2 = 0, Lemma 22 provides
2
 
⋆ 8 σ dθ

R1 − T J (K ) ≤ kfin Cest,2 142 ∥P ∥ 2
+ xb ∥PK ⋆ ∥ log T + 142 ∥P ⋆ ∥8 β2 d(Φ̂, Φ⋆ )2 T,
2
α

We have that kfin = log2 T /τ1 ≤ log T , providing the result in the lemma statement.

D.1. Proof of Theorem 2


We consider the decomposition of the regret as

E[RT ] = R1 + R2 + R3 − T J (K ⋆ ),

with R1 , R2 , and R3 as in (6). Combining the bounds on R1 − T J (K ⋆ ) from Lemma 23, on R2


from Lemma 20, and on R3 from Lemma 19, we find

E[RT ] ≲ τ1 max{dX , dU } ∥PK0 ∥ Ψ2B ⋆


 
+ T −1 ∥Q∥ + Kb2 x2b log T + J(K0 ) + ∥PK0 ∥ Ψ2B ⋆ (dX + dU )σ 2 log T + ∥PK0 ∥ ∥θ⋆ ∥2 Kb2 x2b log T


+ σ 2 dθ dU ∥P ⋆ ∥8 ∥PK0 ∥ Ψ2B ⋆ T log T
p
 
+ dU ∥P ⋆ ∥ Ψ2B ⋆ γ + ∥P ⋆ ∥8 β1 d(Φ̂, Φ⋆ ) log T T + x2b ∥PK ⋆ ∥ log2 T

The first term may be modified by recalling that τ1 = τwarm up log2 T . For the second term, note
that by the fact that T −1 ≤ τ1−1 , the following bound holds,
 
T −1 ∥Q∥ + Kb2 x2b log T + J(K0 ) + ∥PK0 ∥ Ψ2B ⋆ (dX + dU )σ 2 log T + ∥PK0 ∥ ∥θ⋆ ∥2 Kb2 x2b log T

  
≲ ∥Q∥ + Kb2 1 + ∥PK0 ∥ ∥θ⋆ ∥2 .

Combining terms, using the fact that d(Φ̂, Φ⋆ ) ≤ γ, and substituting the definition of β1 from
Assumption 2 leads to the theorem statement.

D.2. Proof of Theorem 3


We consider the decomposition of the regret as

E[RT ] = R1 + R2 + R3 − T J (K ⋆ ),
with R1 , R2 , and R3 as in (6). Combining the bounds on R1 − T J (K ⋆ ) from Lemma 24, on R2
from Lemma 20, and on R3 from Lemma 19, we find

E[RT ] ≲ τ1 max{dX , dU } ∥PK0 ∥ Ψ2B ⋆


   
+ T −1 J(K0 ) + ∥Q∥ + Kb2 + ∥PK0 ∥ ∥θ⋆ ∥2 Kb2 x2b log T + ∥PK0 ∥ ΨB ⋆ (dX + dU )σ 2 log T
σ 2 dθ log2 T
+ ∥P ⋆ ∥8 + ∥P ⋆ ∥8 β2 d(Φ̂, Φ⋆ )2 T + x2b ∥PK ⋆ ∥ log2 T.
α2
Substituting τ1 = τwarm up log T ≤ τwarm up log2 T for τ1 , and grouping terms, we find that

σ 2 dθ
 
E[RT ] ≤ τwarm up max{dX , dU } ∥PK0 ∥ Ψ2B ⋆ + ∥P ⋆ ∥8 2 + x2b ∥PK ⋆ ∥ log2 T
α
   
+ T −1 J(K0 ) + ∥Q∥ + Kb2 + ∥PK0 ∥ ∥θ⋆ ∥2 Kb2 x2b log T + ∥PK0 ∥ ΨB ⋆ (dX + dU )σ 2 log T
+ ∥P ⋆ ∥8 β2 d(Φ̂, Φ⋆ )2 T.

Using the fact that T −1 ≤ τ1−1 we have the following bound:


   
T −1 J (K0 ) + ∥Q∥ + Kb2 + 2 ∥PK0 ∥ ∥θ⋆ ∥2 Kb2 x2b log T + ∥PK0 ∥ ΨB ⋆ (dX + dU )σ 2 log T
  
≲ ∥Q∥ + Kb2 1 + ∥PK0 ∥ ∥θ⋆ ∥2 .

Combining terms and substituting the definition of β2 from Assumption 4 as well as the definition
of ε in Lemma 12 provides the result in the theorem statement.

Appendix E. Generalizations
E.1. Logarithmic Regret with Known A⋆ or B ⋆
Cassel et al. (2020) and Jedra and Proutiere (2022) show that it is possible to attain logaritmic regret
if either the A⋆ is known to the learner and K ⋆ (K ⋆ )⊤ ≻ 0, or the B ⋆ is known to the learner.
These  settings  cannot immediately be embedded into the formulation of (1), because (1) requires
⋆ ⋆
that A B is a linear function of the unknown parameters. It can therefore only embed the
settings where A⋆ and B ⋆ are known up to scale. However, the model can easily be extended to
 
  −1 ⋆ ⋆
 xt
xt+1 = Ā B̄ + vec (Φ θ ) + wt .
ut

If the learner has access to Ā, B̄, and an estimate Φ̂ for Φ⋆ , then the results of Theorem 2 and
Theorem 3 are unchanged by this generalization. In particular, the least squares algorithm may
be modified to subtract out the known dynamics Ā and B̄. This modified algorithm is shown in
Appendix E.1. Propagating this change through the analysis does not change the results.
Using the above generalization, we may recover logarithmic regret in the settings considered by
Cassel et al. (2020); Jedra and Proutiere (2022).
1: Input: Model structure estimate Φ̂, state data x1:t+1 , input data u1:t
2: Return: θ̂, Λ, where

t  ! t    ⊤
   !

X
⊤ xs   xt X
⊤ xs xs
θ̂ = Λ Φ̂ ⊗ IdX xs+1 − Ā B̄ , and Λ = Φ̂ ⊗ IdX Φ̂
us ut us us
s=1 s=1

• B ⋆ known: We may set Ā B̄ = 0 B ⋆ and θ⋆ = vec A⋆ . We then set Φ⋆ as the matrix


   

with
 ⋆ columns having a single element that is one, and the rest zero, such that vec−1 (Φ⋆ θ⋆ ) =
A 0 . As we assume the learner knows B ⋆ exactly, and Φ⋆ forms a complete basis for


A⋆ we may simply choose Φ̂ = Φ⋆ , and have d(Φ̂, Φ⋆ ) = 0. To verify that we may achieve
logarithmic regret in this setting using Theorem 3, we must simply show that Assumption 3
is satisfied. To do so, note that for all i = 1, . . . , dθ , ΦB A
i are zero, and Φ̂i are matrices each
with a distinct unity element. Therefore, for any K, and any unit vector v ∈ Rdθ ,
2 2 dθ
X X X 1
vi Φ̂A
i + Φ̂B
i K = vi Φ̂A
i = vi2 = ∥v∥2 = 1 ≥ ,
i=1 F i=1 F i=1 3 ∥P ⋆ ∥3/2

where the last inequality follows by the fact that ∥P ⋆ ∥ ≥ 1. This verifies that Assumption 3
holds in this setting.

• A⋆ known, K ⋆ (K ⋆ )⊤ ⪰ µI and K0 K0⊤ ⪰ µI for µ ≥ 1 8 : We may set Ā B̄ =


 
3∥P ⋆ ∥3/2
 ⋆ 
A 0 and θ⋆ = vec A⋆ . We then set Φ⋆ as the matrix with columns having a single
element that is one, and the rest zero, such that vec−1 (Φ⋆ θ⋆ ) = 0 B ⋆ . As we assume
the learner knows A⋆ exactly, and Φ⋆ forms a complete basis for A⋆ we may simply choose
Φ̂ = Φ⋆ , and have d(Φ̂, Φ⋆ ) = 0. To verify that we achieve logarithmic regret in this setting
using Theorem 3, we again show that Assumption 3 is satisfied. To do so, note that for all
i = 1, . . . , dθ , ΦA B
i are zero, and Φ̂i are matrices each with a distinct unity element. Therefore,
for any K, and any unit vector v ∈ Rdθ ,
2 2
X X
vi Φ̂A
i + Φ̂B
i K = vi Φ̂B
i K
i=1 F i=1 F
dθ dθ
!
X X
⊤ ⊤
= tr ( vi Φ̂B
i )KK ( vi Φ̂B
i )
i=1 i=1
dθ 2
X
≥µ vi Φ̂B
i
i=1 F
1
=µ≥ .
3 ∥P ⋆ ∥3/2
8. The assumption K0 K0⊤ ⪰ µI, and the lower bound on µ is not included in Cassel et al. (2020); Jedra and Proutiere
(2022), but are needed here to verify Assumption 3. The lower bound on α in Assumption 3 is used to find the
probability bound on the success event Esuccess,2 . It may be possible to remove this lower bound, but we leave that for
future work.
With a different definition of regret measuring the distance between the incurred cost and the
optimal cost of a set of controllers in hindsight, Lale et al. (2020) also show logarithmic regret in a
partially observed setting. However, they operate under the assumption that the class of controllers is
persistently exciting. Such a condition is not satisfied by the optimal state feedback LQR controller.
In particular, the covariates under the policy ut = K ⋆ xt are given by
t    ⊤   t  ⊤
X xs xs I X ⊤ I
= ⋆ xs xs ,
us us K K⋆
s=1 s=1

which is degenerate. See Tsiamis et al. (2023) for further discussion.

You might also like