You are on page 1of 17

A Convex Analytic Approach to System

Identification
Venkatesh Saligrama
Department of Electrical and Computer Engineering
Boston University, Boston MA 02215
Abstract—This paper introduces a new concept for sys-
tem identification in order to account for random and non-
random(deterministic/set-membership) uncertainties. While, ran-
dom/stochastic models are natural for modeling measurement
errors, non-random uncertainties are well-suited for modeling
parametric and non-parametric components. The new concept
introduced is distinct from earlier concepts in many respects.
First, inspired by the concept of uniform convergence of empirical
means developed in machine learning theory, we seek a stronger
notion of convergence in that the objective is to obtain proba-
bilistic uniform convergence of model estimates to the minimum
possible radius of uncertainty. Second, the formulation lends itself
to convex analysis leading to description of optimal algorithms,
which turn out to be well-known instrument-variable methods
for many of the problems. Third, we characterize conditions
on inputs in terms of second-order sample path properties
required to achieve the minimum radius of uncertainty. Finally,
we present fundamental bounds and optimal algorithms for
system identification for a wide variety of standard as well as
non-standard problems that include special structures such as
unmodeled dynamics, positive real conditions, bounded sets and
linear fractional maps.
Keywords: complex systems, convex analysis, input design,
linear algorithms, robust identification
I. INTRODUCTION
System identification deals with mathematical modeling of
an unknown system in a model class from noisy data. The
model class is non-random, i.e., specified without a probability
distribution, while noise, which typically refers to measure-
ment error is modeled as the realization of stochastic process.
The model class is deterministic and can be specified in terms
of a finite-dimensional parametrization or more generally as a
non-parametric class of systems. This paper introduces a new
concept for system identification, inspired by the idea of uni-
form convergence of empirical means (UCEM) from machine
learning theory [24], [32], in order to account for both random
and non-random(deterministic/set-membership) uncertainties.
The main objective of our formulation is to minimize the
probability of the worst-case estimation errors, i.e., we seek
probabilistic uniform convergence of model estimates.
The question of dealing with both random and non-random
aspects in an identification problem is arguably as old as sys-
tem identification/estimation theory. Indeed, the well known
This research was supported by the ONR Young Investigator Program
and Presidential Early Career Award (PECASE) N00014-02-100362, NSF
CAREER award ECS 0449194, and NSF Grant CCF 0430983
maximum-likelihood estimate picks the (deterministic) model
that maximizes the likelihood of the observed data. In the
minimum-prediction error(MPE) rule [13], which can be
thought of as a generalization to the maximum likelihood,
one chooses models that minimize the prediction error (or in
general a loss function). The MPE scheme is quite general and
can be used in a wide variety of contexts including situations
where the real system does not belong to the chosen model
class [3], [14], [11], [33]. Nevertheless, as pointed out in [5]
it is in general difficult to assess the model quality of the
estimates obtained through MPE methods with finite data. In
general there is a significant discrepancy—when there is un-
modeled error—between the estimate obtained by minimizing
an empirical cost over finite data and the asymptotic theoretical
estimate.
This latter issue of dealing with unmodeled dynamics over
finite data together with the need for control-oriented models
(i.e., with guarantees on uncertainty), has motivated a number
of researchers to propose numerous non-mixed formulations.
In [9] a stochastic embedding of unmodeled dynamics is
described. Milanese [17], [18] initiated research on a purely
worst-case paradigm with stochastic noise absorbed within a
worst-case framework as uniformly bounded noise(UBN). This
research led to rapid development of worst-case algorithms
for identification [10], [16], [23], its time complexity [20],
[6] and evaluation [8], [7] for a wide variety of problems.
Nevertheless, the UBN model can be conservative particularly
in accounting for stochastic noise. This has motivated various
researchers [30], [27] to consider additional constraints on
sample-paths of worst-case noise processes. There has also
been related research on explicitly modeling feasible sets
through ellipsoidal as well as polytopic constraints [34], [31].
Although, such constraints do lead to desirable behavior in
many cases, it is in general difficult to explicitly model both
stochastic noise and unmodeled dynamics within a purely
worst-case setting. Consequently, while deficiencies in the
MPE setup require stronger convergence notions, the purely
worst-case approach can either be conservative or require
extensive modeling in complex situations.
Motivated by these concerns, a mixed approach has been
discussed in [30], [26], [27], where the approach has been
to attempt a separation between unmodeled dynamics and
noise and seek models that minimize the distance between
the model class and the underlying system. In parallel mo-
2
tivated by persistent identification, Wang [36] has proposed
a new concept for mixed identification, wherein the idea is
to pick model estimates so that the worst-case probability
of “mismatch” error being larger than some threshold is
minimized. Furthermore, a new notion of sample complexity
is developed wherein the idea is to find the smallest time
by which a confidence level for the probabilistic objective
can be reached. These ideas resemble notions of minimax
risk functions employed in the statistical and estimation lit-
erature [12]. There the idea is to choose parameters that
minimize the worst-case (over all parameters) the expected
value of the estimation error. Nevertheless, this problem (in
terms of finding optimal algorithms, optimal inputs, minimum
sample complexity etc.) is in general intractable in both the
estimation and identification contexts. Wang [36] addresses
this issue by deriving upper and lower bounds for a number
of different cases. The upper bounds are usually obtained by
using least-squares algorithms for periodic inputs while the
lower bounds are obtained by considering the noiseless case.
Although, these bounds can sometimes be suitable, they can be
overly conservative for situations where Least Squares (LS) is
not a convergent algorithm. This is particularly the case when
there is a separation between model, unmodeled dynamics and
noise as was shown in [27]. In summary, although there are a
number of formulations that have been proposed for dealing
with mixed scenarios, it is generally difficult to characterize
optimal algorithms, inputs and fundamental bounds.
To address these issues we propose a new formulation for
mixed components in Section II. Preliminary work along these
lines has appeared in our earlier publications [25], where
consistent identification of FIR models was described. This
paper generalizes that formulation and deals with a wide vari-
ety of new problems. This formulation is inspired by notions
of uniform convergence of empirical means and approximate
learning in machine learning theory. Our objective is to seek
models/estimates that minimize the probability of the worst-
case errors, i.e., we seek uniform convergence in probability
of the estimation error. Although, this notion is stronger than
the earlier proposed formulations, we show in Section III, that
it is also tractable through convex analysis particularly for
problems defined by linear observations, convex subsets of
systems and additive random noise and under the

norm.
This can be extended to other norms by approximating in a
finite family of linear cost functions. Specifically, we show
that optimal algorithms can be associated with hyperplane
separation of convex sets, which can further be related to
instrument-variable techniques. In Section IV the paper de-
scribes new insights for parameter estimation in stochastic
noise. In particular, it is well known that a persistency of
excitation condition is required to ensure consistency. By
means of our mixed formulation we show that this condition is
also necessary for achieving uniform consistency. This section
also develops solution techniques for problems with convex
parametric constraints. Section V then develops algorithms for
optimal identification in the presence of unmodeled dynamics
and noise. The analysis is then extended to problems to cover
situations where the desired solution can be expressed as a
non-linear function of a linear operator. In particular, this
strategy is employed to show that uniform convergence of
parametric error to zero can be established in the presence
of noise and unmodeled dynamics for stable systems in an
1
space. In Section VI we show how this approach also leads to
solutions to problems described by LFT structures. Another
aspect of the paper is in characterizing inputs required for
system identification. We derive conditions on input sequences
in terms of their second order sample path properties.
a) Notation:: A

denotes the complex-conjugate trans-
pose of the matrix A. Z
+
is the set of positive integers.
denotes real valued sequence on positive integers.
p
, p ≥ 1
denotes the space of sequences on Z
+
bounded in the
p
norm (in [15] these concepts are defined on the space of
natural numbers but we only consider sequences over positive
integers and denote them as
p
with an abuse in notation).
For a signal, x ∈
p
, P
m
denotes the truncation operator:
P
m
(x) = (x(0), . . . , x(m−1), 0, . . .), X
n
is the column vec-
tor (x(0), x(1), . . . , x(n−1))

. |x|
p
denotes the
p
norm of
the discrete time signal x(). |A|
p,q
= sup
x∈p
|Ax|
q
/|x|
p
denotes the induced norm of operator or matrix.
The n-point autocorrelation, r
n
x
(), of a signal, x, is
r
n
x
(τ) =
n−τ

i=0
x(τ +i)x(i); τ ∈ [0, 1, . . . , n]
The n-point cross-correlation between two signals
r
n
xy
(τ) =
n−τ

i=0
x(τ +i)y(i)
N(m, σ) denotes the gaussian distribution with mean m and
standard deviation σ. Prob¦A¦ denotes the probability of an
event A and E¦X¦ denotes the expected value of the random
variable X. Finally z
−1
is the z-transform variable and corre-
sponds to the usual shift operation. LTI refers to linear time
invariant systems and BIBO refers to bounded-input-bounded-
output systems, i.e., the space of systems defined by the disc
algebra of analytic functions bounded in the closed unit disc.
RH
2
refers to stable real-rational LTI systems with bounded
squared norm. The impulse response of an LTI system, H, is
represented by ¦h(k)¦
k∈Z
+. The output response, y(t), to an
input u(t) is denoted by y(t) = Hu(t) = (h u)(t), where
denotes the convolution operation. For a real-rational LTI
system, H, we often associate a minimal realization in state
space:
H ≡
_
A B
C D
_
(1)
II. PROBLEM FORMULATION
In this section we formulate a mixed stochastic-
deterministic problem. Mathematically, we are given a se-
quence of real-valued observations, y(k), which are assumed
to be generated according to an underlying model. The model
3
is uncertain in that it has both random and non-random aspects
that provide information on how the sequence of observations
are generated. More precisely, let
y(k) = F
k
(θ, ∆, w), k = 0, 1, . . . , n −1 (2)
where, F
k
is a time dependent real-valued regressor; w() is
a random noise process; θ, ∆ are non-random components
with the tuple (θ, ∆) ∈ o, where o is a subset of a separable
normed linear space H. The finite dimensional component, θ,
is to be estimated by means of an algorithm,
ˆ
θ
n
(y) from a
sequence of n observations. We denote by
ˆ
θ (with an abuse
of notation) the infinite sequence of algorithms, i.e., ¦
ˆ
θ
n
¦. To
clarify the notation, consider an LTI example with a 1-tap FIR
with unmodeled dynamics, i.e.
y(t) = θu(t) +
t

k=0
δ(t −k)u(k) +w(t)
≡ F
t
u
(θ, ∆, w); (θ, ∆) ∈ o = IR
1
; u() ∈

where, δ(t) is the impulse response of the unmodeled compo-
nent, ∆.
Remark: In general, note that the variables θ and ∆, may
not be independent, i.e., the set of values θ and ∆ can be
jointly specified by the set o ⊂ H. This models situations
involving optimal approximations, i.e., θ characterizes the best
approximation in the model class, while ∆ denotes the residual
error.
We now introduce the mixed problem. Suppose, | |
is a suitable norm on IR
m
. We denote by α(
ˆ
θ
n
, γ, o) the
probability that the worst-case estimation error is larger than
γ for n observations, i.e.,
α(
ˆ
θ
n
, γ, o) = Prob
_
sup
(θ,∆)∈o
|θ −
ˆ
θ
n
(y)| ≥ γ
_
where the probability is taken with respect to the noise
distribution. Observe that the set, o is a separable space and
so measurability is preserved under a supremum [1]. Note
that the problem is meaningful even when the perturbation,
∆, does not exist. Indeed, this situation is analogous to the
concept of uniform convergence of empirical means. For the
simplest situation in learning theory, we are given a family of
functions, f(x) ∈ T. N i.i.d. samples of ¦x
i
¦ are drawn from
a distribution defined by P and the problem is to show that,
Prob
_
sup
f∈F
¸
¸
¸
¸
¸
E
P
(f(x)) −
1
N
N

i=1
f(x
i
)
¸
¸
¸
¸
¸
≥ γ
_
−→0
Extensions to this problem include consideration of general
loss functions and function approximation with lower com-
plexity function classes. The principle difficulty in applying
learning theory to system identification is that the solutions
there depend on observations being independent and do not
deal with dynamical scenarios.
Returning to our context, we define the asymptotic radius of
uncertainty, R(
ˆ
θ, o), for the sequence of algorithms
ˆ
θ as:
R(
ˆ
θ, o) = inf
_
γ ∈ IR [ limsup
n
α(
ˆ
θ
n
, γ, o) = 0.
_
This leads to the notion of consistency:
Definition 1: We say that the sequence of algorithms
ˆ
θ is
uniformly consistent if the radius of uncertainty is equal to
zero.
The minimum radius of uncertainty is given by:
R
0
(o) = inf
ˆ
θ
R(
ˆ
θ, o)
Finally, we define sample complexity and optimality of an
algorithm:
Definition 2: The sample complexity of the algorithm
ˆ
θ is
the smallest number of observations, L(, ξ,
ˆ
θ, o), such that
the probability that the worst-case estimation error is larger
than γ = R
0
(o) +, is at most α, i.e.,
L(, ξ,
ˆ
θ, o) = min
_
n ∈ Z
+
[ α(
ˆ
θ
n
, γ, o) ≤ ξ
_
An algorithm,
ˆ
θ
0
is said to be optimal if its sample complexity
is smaller than any other algorithm, i.e.,
L(, ξ,
ˆ
θ
0
, o) ≤ L(, ξ,
ˆ
θ, o), ∀ ξ > 0; > 0
These definitions are related to the mixed formulations pro-
posed in [36] with important differences. Wang et. al. [36]
consider a weaker version in that their formulation is related
(but not exactly
1
) to the worst-case probability of error, i.e.,
their notion is related to the left-hand-side, while our notion
is related to the right-hand-side of the following inequality.
sup
(θ,∆)∈S
Prob
_
θ −
ˆ
θ
n
(y) ≥ γ
_
≤ Prob
_
sup
(θ,∆)∈S
θ −
ˆ
θ
n
(y) ≥ γ
_
The two problems can be significantly different as the
following examples indicate:
b) Example: Let X be a binary number taking values
in the set ¦−1, 1¦. Suppose W is a binary random variable
taking values in the set ¦−1, 1¦ each of which is equally
likely. We immediately see that max
X
Prob¦WX ≥ 1/2¦ =
1/2; Prob¦max
X
(WX) > 1/2¦ = 1
c) Example: Suppose, Y = θ +W with θ ∈ [−1, 1] and
W uniformly distributed in [−1, 1]. It follows that, for a linear
algorithm,
ˆ
θ(Y ) = αY ,
min
α
Prob¦max
θ
[αY −θ[ > 1/2¦ ≥ 1/2
where we have used the fact that the minimum value is equal
to min
α
Prob¦max
θ
[(1−α)θ+αW[ > 1/2¦. The lower bound
now follows by choosing, θ = W. It turns out through detailed
analysis that the optimal solution for the weaker notion is given
by α = 3/4, and we get,
min
α
max
θ
Prob¦[αY −θ[ ≥ 1/2¦ ≤ 1/3
Although, the weaker notion may suffice in many cases,
the main problem is that it is difficult to analyze exactly
(as seen even for the above example). This issue is well
1
[36] considers simpler structures where the model and unmodeled dynam-
ics are decoupled
4
known in statistical estimation [12] and recently the authors
have analyzed the weaker notion [2], [29] in the context of
composite hypothesis testing, and except for simple situations
the problem is generally intractable. In [36] this issue is
addressed by deriving upper and lower bounds, where the
upper bound is computed with a least squares algorithm and
periodic inputs. The lower bound is computed by considering
the noiseless case. Nevertheless, for a large number of prob-
lems especially when there is a decomposition between model
and unmodeled dynamics the upper and lower bounds are not
tight. Indeed, [27] provides examples wherein neither periodic
inputs nor LS algorithms lead to consistency, while there exist
specialized inputs and linear algorithms that do result not
only in consistency but also polynomial sample-complexity.
One advantage of our formulation is that it lends itself to
exact analysis through application of convexity theory and by
extension design of optimal inputs and algorithms, particularly
for problems defined by linear observations, convex subsets of
systems and additive random noise for

norm, as illustrated
in the sequel. We also demonstrate extensions of this approach
to other norms by approximating in a finite family of linear
cost functions.
III. OPTIMAL ALGORITHMS FOR CONVEX PROBLEMS
In this section we state and prove a general theorem, which
is applied in subsequent sections. The main result here is
that under some technical conditions linear algorithms have
optimal sample complexity. The theorem is a generalization
of Smolyak’s theorem [22] to observations with random noise.
Consider a linear space, H, and a convex and balanced subset,
o ⊂ H. Suppose, elements of the set, o are observed through
a linear measurement structure with additive random noise,
w(), i.e.,
y(k) = λ
k
(h) +w(k), k = 1, . . . , n, h ∈ o (3)
where, λ
k
, is a linear functional on the space, H. Thus the
observations are linear over H. The task is to compute a
linear function Φ(h), where Φ maps h to IR
p
, i.e., Φ(h) =

1
(h), . . . , φ
p
(h))
T
. The linear function Φ(h) is often re-
ferred to as the solution operator.
d) Example: The measurement structure includes
ARMA models as well. This is because, if ψ(k) is the
sequence of regressors formed for an ARMA structure (see
[13] for details) then,
y(k) = ψ
T
(k)θ +w(k)
forms a linear measurement structure on the variable θ.
For the setup described by Equation 3 we have the following
theorem.
Theorem 1: Suppose w() is i.i.d. Gaussian noise such that
w(k) ∼ ^(0, σ
2
). Then linear algorithms are optimal when
the norm measure on Φ(h) is the

norm.
Remark: In general the result holds for more general
noise distributions, which are symmetric and monotone around
zero. These properties can be used to argue the equivalence
between the probabilistic problem and an appropriate worst-
case problem. In this paper we limit our attention to gaussian
processes (the non i.i.d. case is established in the next section)
for the sake of simplicity.
Remark: Although we have assumed Gaussian noise the
theorem holds for other noise distributions as well. The main
property required is that the noise distribution be symmetric
and monotonic around zero. For instance, the uniform distri-
bution satisfies these properties as well.
Proof: If the radius of uncertainty is infinite then linear
algorithms are no worse than any other algorithm. Therefore,
suppose γ
0
= R
0
(o) < ∞. For simplicity, consider estimation
of a single linear functional, φ(h). Suppose, the sequence of
algorithms,
ˆ
φ = ¦
ˆ
φ
n
(y)¦, achieves the sample complexity,
L(, ξ,
ˆ
φ, o). This implies,
Prob
_
sup
h∈o
[φ(h) −
ˆ
φ
n
(y)[ ≥ γ
0
+
_
≤ ξ, ∀ n ≥ L(, ξ,
ˆ
φ, o)
(4)
We are left to establish that there is a linear algorithm that
achieves the same performance. The proof of optimality of
the linear algorithm would follow because
ˆ
φ
n
(y) is arbitrary
and can be replaced by an optimal algorithm.
Now, Equation 4 is equivalent to existence of a set / ⊂ IR
n
with Prob¦/¦ ≤ ξ such that,
sup
w∈IR
n
−/
sup
h∈o
[φ(h) −
ˆ
φ
n
(y)[ ≤ γ
0
+ (5)
where, w, is now a worst-case noise signal for which the error
is uniformly bounded by γ
0
+. Next we establish that the set
J
n
− / can be chosen to be convex without impacting the
above inequality. We state this in the form of a proposition
below and the details can be found in the appendix,
Proposition 1: If Equation 5 is satisfied then it follows
that there exists a convex balanced subset, J
n
0
⊂ IR
n
, with
Prob¦J
n
0
¦ ≥ 1 −ξ such that
sup
w∈J
n
0
sup
h∈o
[φ(h) −
ˆ
φ
n
(y)[ ≤ γ
0
+ (6)
We are now left to prove that a linear algorithm is an
alternative optimal estimator as well. For this we follow the
proof of Smolyak’s theorem [22]. The main idea is illustrated
in Figure 1.
Consider the following set of IR
n+1
:
V = {(φ(h), λ1(h) +w(1), . . . , λn(h) +w(n)) | w ∈ W
n
0
, h ∈ S}
From Equation 6, it follows that (γ + , 0, . . . , 0) is not
contained in the interior of the set 1. This implies the existence
of a hyperplane such that,
c
0
(φ(h) −(γ +)) +
n

j=0
c
j
y
j
≤ 0, ∀h ∈ o, w ∈ J
n
0
Now, the linear functionals, ¦λ
1
() + e
T
1
, . . . , λ
n
() + e
T
n
¦,
where, e
j
, is a column vector of length n with jth component
equal to one and all other components equal to zero, are
linearly independent. This in turn implies that c
0
,= 0. Now
5
Fig. 1. Illustration of Convex Separation; X and Y axes denote range of
measurement and parametric values respectively. With high probability they
lie in a bounded convex set.
by using the fact that 1 is convex and balanced we see that
(−γ − , 0 . . . , 0) is also another boundary point. Therefore
we obtain the complementary inequality, i.e.,
c
0
(φ(h) + (γ +)) +
n

j=0
c
j
y
j
≥ 0, ∀h ∈ o, w ∈ J
n
0
Consequently, for every , ξ > 0, it follows that there are a
sequence of coefficients ¦q
n
(k) = −c
k
/c
0
¦
n
k=0
not all zero
such that,
sup
w∈J
n
0
sup
h∈o
¸
¸
¸
¸
¸
n

k=0
q
n
(k)y(k) −φ(h)
¸
¸
¸
¸
¸
≤ γ
0
+, ∀ n ≥ N(, ξ)
thus, establishing result for a single linear functional. To
generalize it to the linear function, Φ(h), we note that,
|Φ(h) −
ˆ
Φ(h)|

= max
j
¸
¸
¸φ
j
(h) −
ˆ
φ
n
j
(y)
¸
¸
¸

¸
¸
¸
¸
¸
n

k=0
q
n
j
(k, φ
l
)y(k) −φ
j
(h)
¸
¸
¸
¸
¸
where, q
n
j
(k, φ
l
) are the optimal linear weights corresponding
to the linear functionals φ
l
(h).
A. Extensions
We can generalize the above theorem to squared error norms
as well. Although, it is possible to bound the squared norm
with the infinity norm this can be conservative (for example,
for a vector x ∈ IR
p
we have |x|



p|x|
2
). Instead,
our idea is to express squared norm value as the maximum
over all unit magnitude linear functionals—the existence of
which is guaranteed through duality—and then find estimates
for each of these linear functionals. Specifically, let x ∈ IR
p
endowed with the squared norm metric and let υ() be a linear
functional on IR
p
. Therefore, υ() can be associated with a
vector (υ
1
, . . . , υ
p
) ∈ IR
p
. We then have,
max
υ2≤1
υ(x) = |x|
2
Now to draw a direct correspondence with the theorem we
proceed as follows. We consider the problem of estimating the
linear function Φ(h) = (φ
1
(h), φ
2
(h), . . . , φ
p
(h)), with an
algorithm
ˆ
Φ
n
() with the squared norm as the error measure.
For ease of exposition we let x
j
= φ
j
(h). We then select
linear functionals,
υ(x) =
p

k=1
υ
k
x
k
Corresponding to each choice of υ() ∈ IR
p
we have an
optimal linear algorithm, q
n
υ
(k) with the minimum radius of
uncertainty, γ
0
+ (with respect to the squared norm), i.e.,
¸
¸
¸
¸
¸

k
q
n
υ
(k)y(k) −
p

k=1
υ
k
x
k
¸
¸
¸
¸
¸
≤ γ
0
+ (7)
Clearly, υ() consists of an infinite family and needs dis-
cretization. Let υ
k
be such a discretization with k belonging
to some finite index set. We then obtain a finite set of
linear inequalities. A feasible point, (x
1
, . . . , x
p
) = Φ(h) =

1
(h), . . . , φ
p
(h)) in this finite set forms a good estimate
for the squared norm. Specifically, let υ
k
= (υ
k
1
, . . . , υ
k
p
) ∈
IR
p
, k ∈ 1, |υ
k
|
2
= 1. Suppose, the minimum angle of
separation for the collection of vectors is smaller than β <
π/4. Consider a feasible point, x belonging to the set defined
by Equation 7 with υ ∈ ¦υ
k
¦
k∈I
and let
ˆ
Φ
n
(y) = x be a
non-linear estimator that maps the observations to a feasible
point. It then follows that,
Corollary 1: If there exists an algorithm that achieves a
radius of uncertainty γ
0
+ for a data length n with probability
α then:
|Φ(h) −
ˆ
Φ
n
(y)|
2

γ
0
+
1 −β
Proof: See the appendix.
These theorems are readily generalized to stationary gaus-
sian noise setting with bounded power spectral density.
Corollary 2: Let, Y
n
= Λ
n
(h)+Q
n
W
n
, where Y
n
, W
n
are
column vectors of length n and with |Q
n
|
2
≤ 1 and Λ
n
=

1
(), λ
2
(), . . . , λ
n
()) is a sequence of linear functionals.
Then a linear algorithm achieves the minimum diameter of
uncertainty for estimating a linear function Φ(h) mapping Hto
IR
p
from data under the

and squared norm error measures.
Remark: Note that the results do not provide an explicit
expression for computing the minimum radius of uncertainty
(and indeed this problem is . However, they do provide a basis
for determining this uncertainty through linear algorithms.
IV. PARAMETER ESTIMATION REVISITED
In this section we provide a complementary perspective
on parameter estimation in random noise. The motivation
for revisiting this well studied topic is threefold: (1) First,
this topic has not been studied under the new criterion of
uniform convergence. In the prediction-error paradigm [13],
which unifies many of the results in stochastic identification,
the objective is to analyze prediction errors, i.e., V (θ) =
1
n

n
t=1
E(y(t) − ˆ y(t/θ))
2
as a function of the predictor.
6
Our objective differs in two ways: (a) It directly deals with
the estimation error, i.e., |θ −
ˆ
θ
n
(y)| (b) we seek uniform
convergence, i.e., Prob¦sup
θ
|θ−
ˆ
θ
n
(y)|¦. (2) It turns out that
under this new criterion the well known input requirements
become transparent and can be shown to be both necessary
as well as sufficient for achieving consistency. We point out
that most results that exist in the literature in this connection
are sufficient conditions and necessity is not easy to establish
in many cases. (3) We can describe tight lower bounds for
error for finite data without the need for a consistent estimator,
which is required typically for Cramer-Rao bounds. (4) Finally,
a new aspect is that optimal algorithms for identification
problems with convex parametric constraints, which model
parametric dependencies, can also be developed.
To describe our problem we need the notions of persistent
input and persistency of excitation. A persistent input is an


sequence with unbounded energy, i.e.,
Definition 3: An

bounded sequence,
(x(0), x(1), . . . , x(k), . . .), is said to be persistent if
liminf
n
|P
n
x|
2
−→∞
We point out that this definition is weaker than the
conventional notion, where the input signal is required
to be quasi-stationary and have finite power, i.e.,
lim
n→∞
1
n

n
t=0
x
2
(t) = C. We next describe persistency of
excitation, which also is weaker than the conventional notion.
Definition 4: An input u() is said to be persistently excit-
ing of order, m, if there exists m instruments (discrete-time
signals), v
1
(), v
2
(), . . . , v
m
() such that
1) The input u() is persistent.
2) The mm matrix, Σ
n
, with [Σ
n
]
jk
= r
n
vju
(k) satisfies,
Σ
n
≥ ρI
m
, for all n ≥ N
0
, for some N
0
∈ Z
+
and
ρ > 0.
We first prove the result for FIR parameterizations, i.e.,
y(t) =
p−1

k=0
h(k)u(t −k) +w(t)
We assume that w(t) is i.i.d. gaussian noise. The following
result follows:
Theorem 2: For the setup above the radius of uncertainty
is zero if and only if the input u() is persistently exciting of
order p.
Proof: The sufficiency part of the result can be derived
along the lines of [13] and is omitted. The necessity part to
the best of author’s knowledge is new. Since norms on finite
dimensional spaces are all equivalent we employ the

norm
here. We are then given that there exists an algorithm, h
n
k
(y)
such that,
Prob
_
sup
h∈IR
p
|h(k) −
ˆ
h
n
k
(y)|

_
−→0
To apply Theorem 1 we need to define λ
i
s first:
λ
j
(h) =
p−1

k=0
h(k)u(j −k), j = 1, 2, . . . , n
With this association we have a linear observation structure
and since the objective is to estimate the FIR taps, Φ() in
this case is the identity operator. Therefore, the assumptions
of Theorem 1 are satisfied. Consequently, based on our hy-
pothesis, there exists a linear algorithm that achieves uniform
consistency as well. This implies existence of an instrument
q
n
k
corresponding to each coefficient h(k) so:
[
n

j=0
q
n
k
(j)w(j) +
n

j=0
(q
n
k
(j)u(j −k) −1)h(k) (8)
+
p−1

j=0, j=k
n

l=0
q
n
k
(l)u(l −j)h(j)[
N→∞
−→ 0
Since, h ∈ IR
p
is arbitrary it follows that,
n

l=0
q
n
k
(l)u(l −k) = 1;
n

l=0
q
n
k
(l)u(l −j) = 0, ∀ j ,= k (9)
and
Prob
_
_
_
¸
¸
¸
¸
¸
¸
n

j=0
q
n
k
(j)w(j)
¸
¸
¸
¸
¸
¸

_
_
_
N→∞
−→ 0
Since, w() is an i.i.d. gaussian random process,

n
j=0
q
n
k
(j)w(j) is gaussian with mean zero and variance
equal to |q
n
k
|
2
. Therefore, we need
limsup
n
|q
n
k
|
2
→0
Now from Equation 9 we have from Cauchy-Schwartz inequal-
ity,
1 =
¸
¸
¸
¸
¸
n

l=0
q
n
k
(l)u(l −k)
¸
¸
¸
¸
¸
≤ |q
n
k
|
2
_
n

l=0
u
2
(l −k)
_
1/2
This implies that |(P
n
−P
−1
)u|
2
→∞. Therefore, the input
must be persistently exciting and from Equation 9 it follows
that the persistency of excitation must be of order p as stated.
The proof is readily extended to the more general case with
linear parameterizations, i.e.,
Corollary 3: Let the input-output equation be given by
y(t) = ψ
T
(t)θ +w(t), θ ∈ IR
p
(10)
where, ψ
T
(t) = [y(t−1), y(t−2), . . . , y(t−m
1
), u(t), u(t−
1), . . . , u(t −m
2
)] is the usual regression vector formed from
previous inputs and outputs, where m
1
+m
2
= p. The radius
of uncertainty is equal to zero if and only if the input is
persistently exciting of order p.
These ideas can also be extended to more general convex
parameterizations which account for parametric dependencies.
We consider finitely many parametric constraints modeled as:
[Zθ[ ≤ b (11)
where, Z ∈ IR
l×p
. Note that the parametric set is balanced.
We describe some relevant situations where such convex
constraints naturally arise. Our first example is the case of
7
a positive real condition, which can be modeled by convex
constraints:

k
θ
k
cos(ωk) ≥ 0, ω ∈ [−π, π]
To utilize these constraints we can discretize the infinite
system of inequalities and through translation consider convex
and balanced subsets of the parameter space. Another example
where these results are applicable is when the set of parameters
belong to a linear manifold, i.e., Zθ = b. We have the
following corollary for convex parameter constraints:
Corollary 4: Consider the setup of Equations 10, 11. Fur-
thermore, let Ψ
n
= [ψ(0), ψ(1), . . . ψ(n)]
T
be the data
regression matrix. It follows that for a data length n the radius
of uncertainty (in the

norm) is smaller than δ if and only
if there exists matrices Q ∈ IR
p×n
and Γ such that,

n
+ ΓZ = I, [Γb[ +|Q|
2,∞
≤ δ, Γ ≥ 0 (12)
where, all of the inequalities are to be interpreted pointwise.
Proof: The proof of the result follows from convex
duality and in particular a straightforward application of Farkas
lemma, which is stated here for the sake of completion [19]:
Lemma 1 (Farkas Lemma): Consider a set of linear func-
tions, ¦µ
j
¦ on the parameter θ. Then it follows that,
_
µ
T
0
θ −b
0
> 0 whenever µ
T
i
θ −b
i
> 0, i = 1, 2, . . . , m
_
¸
µ
0
=

m
k=1
γ
k
µ
k
,

m
k=1
γ
k
b
k
≤ b
0
, γ
k
≥ 0
We now return to the proof of Corollary 4. First, note that
the problem setup conforms to the convex setup. Indeed, the
input-output data is given by a linear measurement structure,
the parameter set is convex and balanced. The desired solution,
Φ(θ) = θ, is linear. Therefore, it follows similar to the
arguments used in the preceding theorems that if the radius
of uncertainty is smaller than δ, then there must exist a matrix
Q and an integer n, for a given > 0, ξ > 0, such that
[(QΨ
n
−I)θ +Qw[ ≤ δ +, ∀ [Zθ[ ≤ b; w ∈ J
n
0
where, J
n
0
is the same convex set with probability mass larger
than 1 −ξ as described in Theorem 1. The proof now follows
by a direct application of Farkas lemma.
The solution consists of finding a Q that satisfies Equation 12,
which can solved through convex programming. An estimate
of θ is then given by Qy, where y is the observation vector
of length n.
Remark: Observe that when the constraints are given by
Zθ = 0, we should expect that the persistency of excitation
condition can be relaxed. This is indeed the case as seen from
Equation 12. This is because in this case, in the dual setting,
the positive constraint Γ ≥ 0 no longer exists. Therefore, we
are only left to satisfy QΦ
n
+ ΓZ = I for an arbitrary Γ. It
follows that if Z has rank m then Φ
n
need only have rank
p −m to ensure that the equation can be satisfied. Therefore,
the input needs to be persistently exciting of a smaller order
than the parametric dimension.
Our next task is to comment on the issues of optimality and
choice of optimal inputs. From Theorem 1 it follows that linear
algorithms are optimal for the

norm. From the results in
Corollary 4 we have for a convex parametric set o that,
Prob
_
max
θ∈S
θ −θ
n
(y)∞ ≥
_
≥ Prob {Qw∞ ≥ −Γb∞}
= erfc
_
−Γb∞
Q2,∞
_
where, erfc(), denotes the gaussian error function [12]
and we have assumed that radius of uncertainty is equal to
zero (for simplicity of exposition). These results in turn help
in loosely characterizing inputs that would lead to optimal
identification. This would follow by first minimizing, |Q|
2,∞
subject to constraints of Equation 12. The optimal cost of this
optimization problem would provide an index J(Φ
n
) which
is a function of the regressor Φ
n
. One can then attempt to
minimize the index by choosing different inputs.
A. Uniform Bounded Noise
Although, the UBN setup does not strictly fall within
the framework described in Section II the proof technique
of Theorem 1 can be used to establish that the radius of
uncertainty is bounded away from zero. Let,
y(t) =
p−1

k=0
h(k)u(t −k) +w(t), w(t) ∈ [−γ, γ], [u()[ ≤ 1
(13)
In the context of UBN we are interested in an estimate
ˆ
h
n
(y)
based on input-output data of length n so the worst-case error
over all FIRs and admissible noise realizations are minimized:
min
ˆ
h
n
→ sup
J
n
sup
h∈IR
p
|h −
ˆ
h
n
|
1
where, J
n
= [−γ, γ]
n
is the set of all admissible noise
sequences up to time length n. Similarly, let |
n
be the set
of all unit-amplitude-bounded input sequences of length n.
It is well known [23] that regardless of the input, estimator
and data-length the estimation error is always bounded from
below by γ. We provide a new proof for this result based on
Theorem 1 as stated below.
Theorem 3: The minimum radius of uncertainty is larger
than γ for any algorithm,
ˆ
h
n
(y).
Proof: We note that,
sup
J
n
sup
h
|h −
ˆ
h
n
|
1
≥ sup
J
n
sup
h
|h −
ˆ
h
n
|

Now, both J
n
and h are convex balanced sets, therefore it
follows from the steps used in the proof of Theorem 1 that
there must exist a linear algorithm that achieves optimal error
for the latter (RHS of inequality above) problem. Thus there
is a linear instrument, q
n
(), with the error estimate, E, given
8
by,
E = max
0≤m≤p−1
¸
¸
¸
¸
¸
n

t=0
q
n
m
(t)w(t) +
n

j=0
(q
n
m
(j)u(j) −1)h(0)
+
p−1

j=1
n

l=0
q
n
m
(l)u(l −j)h(j)
¸
¸
¸
¸
¸
¸
This implies that if the error, E < ∞ then,
1 =
n

j=0
q
n
m
(j)u(j) ≤ |q
n
m
|
1
|u|

= |q
n
m
|
1
, ∀0 ≤ m ≤ p−1
Now, we see that,
E ≥ sup
w
¸
¸
¸
¸
¸
n

t=0
q
n
m
(t)w(t)
¸
¸
¸
¸
¸
≥ |q
n
m
|
1
|w|

≥ γ, ∀m
thus establishing the result.
Theorem 1 provides a situation under which the minimum
radius of uncertainty is zero in that if a small subset of
noise sequences of vanishingly small measure were rendered
inadmissible then the radius of uncertainty is equal to zero.
This can also be seen indirectly from our results in an earlier
paper [30].
V. UNMODELED DYNAMICS WITH STOCHASTIC NOISE
In this section we deal with the problem of asymptotic
consistency of systems with mixed uncertainties consisting of
stochastic measurement noise and worst-case (deterministic)
unmodeled dynamics. The problem arises in system identi-
fication whenever restricted complexity models are used for
identification. System identification with restricted complexity
models have been explored in [27], [8], [35] with deterministic
worst-case noise and in mixed situations in [30], [27], [36],
[4]. Here we study mixed situations and derive fundamental
conditions required to achieve uniform convergence of the
model estimates to their optimal approximations in the model
class. Our setup will deal with the following input-output
equation:
y(t) = Hu(t) +w(t) = G(θ)u(t) + ∆u(t) +w(t); (14)
where, u is a bounded input; H = G(θ)+∆ ∈ o is the subset
of systems; G(θ) ∈ ( is a model class of interest; w(t) is a
Guassian stochastic process. The main goal is to estimate G(θ)
from input output data in a uniform manner as described in
Section II. More specifically, we consider model classes that
are subspaces, i.e.,
( =
_
G ∈ RH
2
[ G =
m

k=1
θ
k
G
k
, G
k
∈ RH
2
, θ
k
∈ IR
m
_
(15)
In the following section we discuss issues that arise when the
perturbations ∆ are specified independent of the model class.
A. Overbounding Unmodeled Dynamics
The main goal of this section is to motivate the need
for a optimal decomposition of the system into model and
unmodeled dynamics. We do this by computing the radius of
uncertainty for identifying an FIR model class with unmodeled
dynamics. We then generalize these results to the more general
case. Suppose, (
FIR
, is the class of finite impulse response
sequences (FIR) of order m. Let,
y(t) = Hu(t) +w(t) = Gu(t) + ∆u(t) +w(t) (16)
where, H ∈
1
, G ∈ (
FIR
, ∆ ∈ ∆ and w(0), w(1), . . . is
a sequence of i.i.d. jointly gaussian random variables; inputs
are bounded, i.e., [u(k)[ ≤ 1, k ∈ Z
+
; the unmodeled class
∆ is given by
∆ = ¦∆ ∈
1
[ |∆|
1
≤ γ¦ (17)
which indicates the fact that the mth order FIR account
for the underlying system within a worst-case error of size γ.
This is clearly a redundant decomposition of the underlying
system. The first m-taps of the true system are accounted for
both by the model and the unmodeled dynamics. This results
in a minimum radius of uncertainty larger than γ.
Proposition 2: For the setup above it follows that the min-
imum radius of uncertainty is greater than γ.
Proof: The convexity of the set of systems, linear
observations as well as linearity of the solution operator
Φ(h) = (h
0
, . . . , h
m−1
) can be readily verified. Now the
radius of uncertainty can only be smaller if we only consider
a smaller subset of systems, specifically, perturbations that
satisfy δ(k) = 0 for all k ≥ m. By Theorem 1 there must
exist instruments q
n
l
() that are optimal for estimating the FIR
model class. Consider estimation of the first tap. It follows
that if the radius of uncertainty is bounded then,
n

k=0
q
n
0
(k)u(k) = 1
This implies that,
¸
¸
¸
¸
¸
n

k=0
q
n
0
(k)(G+ ∆)u(k) +
n

k=0
q
n
0
(k)w(k) −h(0)
¸
¸
¸
¸
¸

¸
¸
¸
¸
¸
n

k=0
q
n
0
(k)u(k)δ(0)
¸
¸
¸
¸
¸
= γ
This proof can be generalized for other model classes as well,
which we state below without proof.
Proposition 3: Suppose the system decomposition is convex
and balanced such that for some η > 0, the model ηG is an
element of the unmodeled set ∆, then the radius of uncertainty
must be larger than η.
The main problem with these results is the apparent gap
between practical reality and the minimum radius of uncer-
tainty achieved through the decomposition of Equation 16.
9
For instance, consider the case when noise is relatively small.
In this case the decomposition suggests that the radius of
uncertainty still does not go to zero. However, in practice the
first m-taps of the impulse response of H can be identified
accurately for the noiseless case. In this light, the parametriza-
tion in Equation 17 over bounds the existing input-output
relationship and leads to a conservative result. This issue
was acknowledged by the author in [27] and it was pointed
out that once a normed space of systems is chosen, there is
an optimal decomposition in the sense that it minimizes the
unmodeled error. Once such a decomposition is chosen the
objective would be to estimate the optimal model component.
For the FIR case, such a decomposition would lead to the
following description for the unmodeled error:
∆ = ¦z
−m
∆ [ |∆| ≤ γ¦ (18)
which is optimal for
1
and H
2
norms. This is because, this
decomposition leads to the minimum unmodeled error.
B. Identification in Hilbert Spaces
Motivated by the discussion in the previous section we de-
rive identification results for more general model spaces. The
basic structure used in the proof (as seen from Equations 5, 6)
is that we require the set o after the decomposition to remain
convex. On a Hilbert space this convex decomposition is pre-
served and we should expect Theorem 1 to apply in this setting
as well. The main problem is that all elements belonging to
unit balls in RH
2
are not necessarily stable. Therefore, there
are hardly any inputs that satisfy the requirements for uniform
consistency. Therefore, the unmodeled perturbations have to be
further constrained in some manner. These issues are discussed
in this section.
We are now left to describe the unmodeled error. Motivated
by Proposition 3 we again seek to ascribe modeled and
unmodeled components to different aspects of data. This can
be done naturally since the dual is also a Hilbert space
and orthogonal separation between modeled and unmodeled
components can be easily established. Indeed, using Cauchy’s
theorem we can do more as in the following proposition:
Proposition 4: Consider the model class given by Equa-
tion 15. Then, it follows that, every real-rational system, H,
can be written as:
H = G
0
+F∆ (19)
for some G
0
= G(θ
0
), a real-rational all pass-function F
and an arbitrary perturbation ∆. The zeros of F

are pole
locations of the G
1
, G
2
, . . . , G
m
.
Proof: The proof follows by applying Cauchy’s theo-
rem [21].
e) Example: There are m poles at zero for an FIR model
space of order m, and therefore the system F is a delay of
order m. This implies that the unmodeled component is given
by the set of all stable real-rational functions of the form
z
−m
∆, with ∆ being an arbitrary element in RH
2
.
As before we restrict the size of the perturbation and consider
first the following subsets of real systems:

2
= ¦∆ [ |∆|
2
≤ γ¦, ∆
1
= ¦∆ [ |∆|
1
≤ γ¦ (20)
As we remarked earlier the second subset is considered
since the first (although natural) contains unstable systems.
Consequently, as it turns out it is difficult to find inputs that
lead to uniform consistency for the first set. For notational
simplicity we define:
S2 = {H ∈ RH2 | H = G+F∆, G ∈ G, ∆ ∈ ∆2} (21)
S1 = {H ∈ RH2 | H = G+F∆, G ∈ G, ∆ ∈ ∆1}
We now state our main result in this section.
Theorem 4: Let ˆ u = Fu be the filtered input through the all
pass filter, F and u
l
() = G
l
u(). A necessary and sufficient
condition for uniform consistency over the set o
2
is that the
input be persistent and there exist a sequence of instruments,
q
n
() such that,

m
k=0
r
q
n
u
(k)g
l
(k) = r
q
n
u
l (0) = 1, l = 1, 2, . . . , m (22)

n
τ=0
¸
¸
¸r
n
q
n
ˆ u
(τ)
¸
¸
¸
2
n→∞
−→ 0; [r
n
q
n
w
(0)[
n→∞
−→ 0 in probability
On the other hand if o
1
is considered then the conditions for
uniform consistency remains the same except that the second
condition is weaker, i.e.,
max
1≤τ≤n
¸
¸
r
n
q
n
ˆ u
(τ)
¸
¸
n→∞
−→ 0 (23)
Furthermore, these conditions imply that the input must be
persistently exciting of infinite order.
Proof: First we consider the set o
2
. We note that the
projection ΠH of the system H onto the subspace ( is
linear and of finite rank. Therefore, the solution operator
Φ(H) = (θ
1
, . . . , θ
m
)
T
is linear. The subset o
2
is convex and
the observation given by Equation 14 is linear. For the sake
of exposition we next consider one-parameter model class.
The general case follows in an identical manner. Now by
Theorem 1 if uniform consistency holds there must exist a
number n and an instrument, q
n
() so that, for any > 0, ξ >
0
¸
¸
¸
¸
¸
¸
n

k=0
q
n
(k)w(k) +
n

j=0
n−j

k=0
q
n
(k +j)u(k)h(j) −θ
1
¸
¸
¸
¸
¸
¸
≤ , ∀ θ
1
∈ IR
with w ∈ J
n
0
and Prob¦J
n
0
¦ ≥ 1 − ξ. As before, this
implies the condition that the instrument must de-correlate the
noise, i.e., [

n
k=0
q
n
(k)w(k)[ −→0. We are now left with an
instrument that must satisfy,
¸
¸
¸
¸
¸
¸
n

j=0
n−j

k=0
q
n
(k +j)u(k)h(j) −θ
1
¸
¸
¸
¸
¸
¸
=
¸
¸
¸
¸
¸
¸
n

j=0
n−j

k=0
q
n
(k +j)u(k)(θ
1
g
1
(j) + (f
1
δ)(j)) −θ
1
¸
¸
¸
¸
¸
¸

10
where, g
1
(k), f
1
(k), δ(k) are the impulse responses for
G
1
, F
1
, ∆ respectively. Observe that we have applied Propo-
sition 4 and substituted F
1
∆ for the perturbation and so F
1
and G
1
are perpendicular. By arbitrarily varying θ, ∆ we see
that,
n

j=0
r
n
q
n
u
(j)g
1
(j) = r
q
n
u
1(0) = 1; sup
∆∈∆2
¸
¸
¸
¸
¸
¸
n

j=0
r
n
q
n
ˆ u
(j)δ(j)
¸
¸
¸
¸
¸
¸
≤ ,
where, for the first expression u
1
= G
1
u, and in the second
expression we have absorbed the filter F
1
into the input, i.e.,
ˆ u = F
1
u. The condition arising from the noise and the first
expression together imply that the input must be persistently
exciting. This follows from the fact that,
1 =
¸
¸
¸
¸
¸
¸
n

j=0
q
n
(j)u
1
(j)
¸
¸
¸
¸
¸
¸
≤ |q
n
|
2
|G
1
|
H∞
|P
n
u|
2
where, | |
H∞
is the usual H

norm. But the noise condition
implies that |q
n
| −→ ∞. Therefore, |P
n
u| −→ ∞, since
G
1
is bounded in H

because it is stable real rational transfer
function. Next for the perturbation we apply Cauchy-Schwartz
inequality to obtain:
sup
∆∈∆2
¸
¸
¸
¸
¸
¸
n

j=0
r
n
q
n
ˆ u
(j)δ(j)
¸
¸
¸
¸
¸
¸
2
= γ
¸
¸
¸
¸
¸
¸
n

j=0
(r
n
q
n
ˆ u
(j))
2
¸
¸
¸
¸
¸
¸
−→0
For the set ∆
1
we similarly have,
sup
∆∈∆1
¸
¸
¸
¸
¸
¸
n

j=0
r
n
q
n
ˆ u
(j)δ(j)
¸
¸
¸
¸
¸
¸
= γ max
j≤n
¸
¸
r
n
q
n
ˆ u
(j)
¸
¸
−→0
This establishes the necessity. We will not describe the suffi-
ciency conditions since the argument is similar to the argument
for sufficiency for FIR model classes, which is dealt in the next
section.
Remark: To see how the convex separation results in achiev-
ing uniform consistency note that the conditions on the input
from the modeled part implies that G
1
u() is aligned with the
instrument q
n
(). Similarly, the unmodeled part implies that
F
1
u() is perpendicular to q
n
(). This is possible because G
1
and F
1
are orthogonal and so G
1
u and F
1
u are uncorrelated
when driven by white noise.
Remark: We point out that the conditions for o
2
are more
demanding than o
1
. For the set o
1
we only require that
the normalized worst-case cross-correlation between input and
instrument to go to zero. This condition can easily be accom-
plished for inputs that have small autocorrelation coefficients,
i.e., inputs that are white in their second order properties. Then
an appropriately filtered input can be selected as the instru-
ment. In the RH
2
case we need a more stringent condition
since we need the sum of the squares of the normalized cross-
correlation also go to zero. This is understandable on account
of the fact that RH
2
forms a larger class of perturbations than

1
.
We next discuss issues of optimality and choice of inputs.
First, notice that one could incorporate additional parametric
and non-parametric constraints as we did in Section IV and
Equation 12. For these set of problems the fact that the optimal
algorithm is linear follows from Theorem 1. The problem
now reduces to looking for inputs satisfying Equation 22.
The constraints on the input (disregarding the noise constraint)
imply that the optimal instrument is the solution to:
min(q
n
)
T
Ψ
n
Ψ
T
n
(q
n
) subject to
n

j=0
q
n
(j)u(j) = 1
where, Ψ
n
= [ψ(0), ψ(1), . . . , ψ(n)]
T
, with ψ(t) =
[ˆ u(t), ˆ u(t−1), . . . , ψ(0)]. Let, u = [u(0), . . . , u(n)]
T
. If the
value of the solution to the quadratic optimization problem has
a value greater β (this is the opposite of what we want) then:
(q
n
)
T
ΨnΨ
T
n
(q
n
) =
1
u
T
(ΨnΨ
T
n
)
−1
u
≥ β ⇐⇒
_
1/β u
T
u ΨnΨ
T
n
_
≥ 0
where, we have used Schur’s complementation to derive the
last expression above. We now argue the difficulty of finding
inputs that satisfy this condition by applying it to a 1-tap FIR
model class. For this situation the above condition implies that,
n

j=1
(r
n
u
(j))
2
≥ β
If the input is chosen i.i.d. then the expected value of the
expression on the left is bounded away from zero. Therefore,
for a β > 0 (independent of n) the above condition is met with
probability one leading us to believe that few inputs satisfy the
opposite condition, i.e.,

n
j=1
(r
n
u
(j))
2
<< β.
On the other hand the weaker condition for ∆
1
can be
met by a number of inputs. In particular we only need to
establish that there exist inputs with small autocorrelations
with high probability. Then, by choosing an appropriately
filtered input as the instrument the conditions of Equation 23
can be satisfied.
Lemma 2: Suppose u(t) is a discrete i.i.d. bernoulli random
process with mean zero and variance 1, then:
Prob
_
max
0<k≤n
|r
n
u
(k)|

≥ α
_
≤ nexp(−nβ(α)) (24)
where β(α) = 1 +
1−α
2
log
2
1−α
2
+
1+α
2
log
2
1+α
2
Proof: The proof of this result can be found in [6].
In summary we have characterized conditions for uniform
consistency and optimality for Hilbert spaces, where convex
separation between model class and unmodeled dynamics
proved to be critical. In the next section we consider a situation
where such a convex separation does not exist.
C. Identification in
1
and non-linear functionals
We now turn to the problem of identification in the
1
norm. In the previous section the Hilbert space structure led to
convex decomposition of the unmodeled error and model class
and Theorem 1 could be applied in a straightforward manner.
11
For FIR model classes, under the
1
norm, the decomposition
is still convex. This is due to the fact that optimal FIR
approximation in
1
and RH
2
are identical. However for
more general classes under the
1
norm the decomposition
through optimal approximations is not convex. In the next
section, we present identification of FIR model class in the

1
norm. The results developed for FIR case will be used
to compute uniformly consistent estimators for more general
model classes.
D. FIR Model Class
In this section we derive necessary and sufficient conditions
for identification of FIR model classes in the presence of
unmodeled dynamics under the
1
norm. To this end, consider
the optimal decomposition:
o = ¦H ∈
1
[ H = G+ ∆, G ∈ (
FIR
, ∆ ∈ ∆¦ (25)
where, ∆ = ¦z
−m
∆ [ |∆|
1
≤ γ¦; (
FIR
is the class of
m-tap FIR sequences.
In the following theorem we derive necessary and sufficient
conditions for achieving uniform consistency for estimating
the FIR coefficients.
Theorem 5: The radius of uncertainty for the above setup
is equal to zero if and only if there is a persistent input and
a corresponding family of instrument variables q
n
() which
uniformly de-correlates the input and noise, i.e., a number
N(ρ) for every ρ > 0 such that for all n > N(ρ):
max
1≤τ≤n
¸
¸
r
n
q
n
u
(τ)
¸
¸
n→∞
−→ 0,
¸
¸
r
n
q
n
u
(0)
¸
¸
= 1 (26)
[r
n
q
n
w
(0)[
n→∞
−→ 0 in probability (27)
Proof: (=⇒) We observe that all norms are equivalent
on finite dimensional spaces [15]. Consequently, uniform
consistency in the
1
norm for the first m FIR taps is
identical to uniform consistency in the

norm. This im-
plies by hypothesis that, the minimum radius of uncertainty
is zero for the

norm as well. Therefore, given >
0, ξ > 0 there is an algorithm,
ˆ
h and a number L(, ξ,
ˆ
h, o)
such that Prob
_
sup
H∈o
|P
m
(h −
ˆ
h
n
)|


_
≤ ξ, n ≥
L(, ξ,
ˆ
G, o). The setup now mirrors Theorem 1 since the
set, o is convex; Φ(H) = (h(0), . . . , h(m − 1))
T
; w is
i.i.d. gaussian noise; measurement structure is linear, i.e.,
λ
t
(h) = Hu(t); norm measure on Φ() is the

norm.
Consequently, by hypothesis, there is a linear algorithm with
weights, q
n
l
(), for each FIR tap, 0 ≤ l ≤ m, such that,
sup
w∈J
n
0
sup
H∈o
¸
¸
¸
¸
¸
n

k=0
q
n
l
(k)y(k) −h(l)
¸
¸
¸
¸
¸
≤ ; ∀ n ≥ N(, ξ)
and Prob¦J
n
0
¦ ≥ 1 −ξ Upon substitution it follows that,
[

n
k=0
q
n
l
(k)w(k) +

n
k=0
(q
n
l
(k)u(k) −1)h(j)
+

n
j=m

n−j
k=0
q
n
l
(k +j)u(k)h(j)
¸
¸
¸
n→∞
−→ 0
Next, by noting that ∆ is arbitrary with
1
norm less than γ
and w() is an arbitrary element in the convex set J
n
0
, which
also contains the zero element we have,

n
k=0
q
n
l
(k)u(k) = 1; max
1<j<n
¸
¸
¸

n−j
k=0
q
n
l
(k +j)u(k)
¸
¸
¸ ≤
[

n
k=0
q
n
l
(k)w(k)[ ≤ , w ∈ J
n
0
The proof now follows by first observing that for the hypoth-
esis to be true we must have, Prob ¦J
n
0
¦ ≥ 1 −ξ.
The persistency of the input follows along similar lines as in
Theorem 2.
(⇐=) To prove sufficiency we construct an IV technique
that is uniformly consistent. Basically we let, q
n
= q
n
1
and
its delayed copies up to m-delays serve as instruments. In
particular, for length n of data, we let, Q
n
be the matrix
composed of the instrument q
n
and its m delays, U
n
a
corresponding matrix of input sequence with its m delays
and R
n
the residual delays of the input sequence.
Qn =
_
¸
¸
¸
¸
¸
¸
¸
_
q
n
(0) 0 . . . 0
q
n
(1) q
n
(0) 0 . . .
.
.
.
.
.
.
.
.
. q
n
(0)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
_
¸
¸
¸
¸
¸
¸
¸
_
(28)
Un =
_
¸
¸
¸
¸
¸
¸
¸
_
u(0) 0 . . . 0
u(1) u(0) 0 . . .
.
.
.
.
.
.
.
.
. u(0)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
_
¸
¸
¸
¸
¸
¸
¸
_
; Rn =
_
¸
¸
¸
¸
¸
¸
¸
_
0 0 . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
u(0) 0 . . .
.
.
.
.
.
.
.
.
.
_
¸
¸
¸
¸
¸
¸
¸
_
Suppose, g is the column vector composed of the first
m impulse response coefficients, i.e., g = P
m
H and δ is
the column vector of impulse response of the unmodeled
dynamics, i.e., δ = (P
n
− P
m
)H. Let y, w be the output
and noise signal vectors of length N respectively. Then,
y = U
n
g +R
n
δ +w =⇒Q
T
n
y = Q
T
n
U
n
g +Q
T
n
R
n
δ +Q
T
n
w
(29)
It follows from hypothesis that Q
T
n
U
n
is an invertible matrix
with bounded inverse for large enough N since it converges
to a lower triangular matrix with identical elements on the
diagonal. Again from our hypothesis together with the fact
that that |∆| ≤ γ, it follows that, Q
T
n
R
n
δ converges to zero.
To show this, note that the kth row of Q
T
n
R
n
is given by,
(Q
T
n
Rn)
k
=
_
_
n+1−(m+k)

j=0
q
n
(m+ (k −1) +j)u(j),
n−(m+k)

j=0
q
n
(m+k +j)u(j), . . . , q
n
(m+ (k −1) +j)u(0)
_
_
Consequently, by hypothesis it follows that,
12
¸
¸
(Q
T
n
R
n
)
k
δ
¸
¸

|∆|
1
max
τ
¸
¸
¸
¸
¸
¸
n+1−(m+k+τ)

j=0
q
n
(m+ (k −1) +j +τ)u(j)
¸
¸
¸
¸
¸
¸
≤ γ
Finally, Q
T
n
w also converges to zero in expectation by
hypothesis and noting that (w(0), w(1), . . . , w(n)) are i.i.d.
random variables.
We next discuss the issue of sample complexity. The sample-
complexity of an algorithm,
ˆ
Φ(y) is said to polynomial [32]
in learning theory if
L(, ξ,
ˆ
Φ, o) ≤
_
1

_
d
log
_
1
ξ
_
for some integer d. We establish polynomial sample complex-
ity of linear algorithms for the FIR model class here. First, we
choose instruments q
n
(k) = u(k)/n, where the input sequence
is a bernoulli process. It is then easy to establish that,
Prob
_
sup
H∈o
|Q
T
n
y −P
m
(H)|


_

Prob
_
max
0<τ≤n
[r
n
u
(τ)[ +[r
n
uw
(τ)[
n

_

Prob
_
max
0<τ≤n
[r
n
u
(τ)[
n


2
_
+ Prob
_
max
0<τ≤n
[r
n
uw
(τ)[
n


2
_

nexp(−nβ()) +nexp(−n
2

2
)
where we have taken the probability over both the input and
the noise process. The first term in the last expression uses
Lemma 2 and the second term uses the fact that w is gaussian.
Now, it follows directly by substitution that, for some C > 0,
L(, ξ,
ˆ
Φ, o) ≤ C
_
1

_
2
log
_
1
ξ
_
implying polynomial sample-complexity of convergence.
E. General Model Classes
In this section we derive algorithms for achieving uniform
consistency for general model classes (that are subspaces) in
the
1
norm. The main difficulty is that the decomposition
achieved through optimal approximation is not convex. Hence,
Theorem 1 cannot be directly applied. We proceed by first
describing the optimal approximation as a known non-linear
function of linear functions on the underlying system space.
Since linear functions of the system can be efficiently esti-
mated the non-linear function can be estimated as well.
Our setup consists of,
y(t) = Hu(t)+w(t) = Gu(t)+∆u(t)+w(t), G ∈ (, ∆ ∈ ∆
where, ( =
_

p
j=1
α
j
G
j
[ α
j
∈ IR, G ∈ RH
2
_
and ∆ =
¦|∆| ≤ γ [ ∆ | (

¦, where (

is the space of annihilators
of the model space (. The decomposition follows from well-
known results in
1
approximation theory [15]. Noise, w()
is assumed to be i.i.d. gaussian noise as before. Our task
is to determine the model G
0
that minimizes the unmodeled
error, i.e., G
0
= argmin
G∈(
|H−G|. The following example
illustrates the issue of non-convexity of
1
decompositions.
f) Example: Consider, the following convex space of
systems,
o = ¦H ∈
1
[ H
α
= αH
1
+ (1 −α)H
2
; α ∈ [0, 1]¦
where H
1
(z
−1
) = 1 and H
2
(z
−1
) = z
−1
+
_
1 −
z
−1
2
_
−1
Our model class is given by ( =
_
θ
_
1 −
z
−1
2
_
−1
[ θ ∈ IR
_
. From the alignment conditions
for optimality [15] it follows that the unique optimal
approximation, G(H
1
), for the system H
1
is G(H
1
) = 0.
Similarly, we deduce that G(H
2
) =
_
1 −
z
−1
2
_
−1
. Moreover,
G(H
α
) = G(H
2
) for all α ,= 1 is the unique optimal
approximation as well. Therefore, the set of optimal
approximants for the convex set of systems o is not convex.
Motivated by these issues we solve the problem of
1
iden-
tification indirectly. The idea is that
1
minimizer can be
arbitrarily closely approximated by a non-linear function of
linear functions of the system impulse response, i.e,
G
0
(H) = ψ(L(H))
where, ψ() is non-linear continuous map, which maps to the
model subspace and L is a finite rank linear operator mapping
the system H to an appropriate finite dimensional space.
Uniform consistency in
1
would then follow by estimating
the linear functions, whose consistency conditions are in turn
characterized by Theorem 5. To this end we have the following
theorem.
Theorem 6: The input characterization as in Theorem 5 is
sufficient for uniform consistency in the
1
norm.
The proof of the theorem relies on a number of results that
are provided below.
Proposition 5: The optimal
1
approximation G
0
(H) for a
given system, H, can be re-written as,
G
0
(H) = Π(H)+argmin
G∈(
|H−ΠH−G| = lim
m→∞
G
0
(P
m
H)
where, Π is the
2
projection on to the space (.
Proof: The proof follows from the following equality:
min
G∈(
|H −G|
1
= min
G∈(
|H −ΠH −G|
1
The limiting argument follows from the fact that P
m
H →H
and continuity of projection operation and the
1
norm.
For notational simplicity we denote the modified problem on
the RHS by G
r
, i.e.,
G
r
(H) = argmin
G∈(
|H −ΠH −G|
and the modified system by, H
r
,
H
r
= H −ΠH
13
The main task is to now show how to further simplify the
problem to an instantiation of Theorem 1. First, note that
ΠH is a linear and the range of the projection is a finite
dimensional space. Therefore, under the conditions derived in
Theorem 4, we can consistently estimate the projection in a
uniform manner. The problem reduces to estimating G
r
(H).
In this regard, we have the following proposition:
Proposition 6: If there is G ∈ ( such that |H −G|
1
≤ γ
then there is some constant L such that, |H −ΠH|
1
≤ Lγ
Proof: By hypothesis we have that |H−ΠH|
2
≤ |H−
G|
2
≤ |H −G|
1
≤ γ. We now use the state-space form for
ΠH, G, as described by Equation 1. Since, ( is a subspace,
the state-transition matrix, A, is fixed. We assume that control
vector B is the free parameter that needs to be estimated.
Let B
1
, B
2
be the corresponding control vectors for G, ΠH
respectively. It follows,
|G−ΠH|
2

_
_
_
_
C CA CA
2
. . . CA
n
¸
T
(B
2
−B
1
)
_
_
_
2
≤ 2γ
(30)
Since (A, C) is observable the matrix
_
C CA CA
2
. . . CA
n
¸
T
is invertible. Therefore,
|B
2
− B
1
|
2
≤ K
0
γ for some constant K > 0 that
depends on the system matrices, A, C. Since, B
2
, B
1
are
finite dimensional and norms on finite dimensional spaces
are equivalent, we also have |B
2
− B
1
|
1
≤ K
1
γ for some
K
1
> 0. This implies that |G − ΠH|
1
≤ K
2
γ for some
K
2
> 0. It now follows that,
|H −ΠH|
1
≤ |H −G|
1
+|G−ΠH|
1
≤ Lγ
Our final proposition states that,
Proposition 7: Let, G
m
denote the best
1
approximation of
the FIR system P
m
H
r
, i.e., G
m
= argmin
G∈(
|P
m
(H
r
−G)|
1
Then it follows that,
lim
m→∞
|P
m
(H
r
−G
m
)|
1
= |H
r
−G
r
|
1
Proof: The proof follows by the following sequence of
inequalities:
|H
r
−G
r
|
1
≤ |H
r
−G
m
|
1
|P
m
(H
r
−G
m
)|
1
≤ |P
m
(H
r
−G
r
)|
1
Therefore,
0 ≤ Pm(Hr −Gr)1 −Pm(Hr −Gm)1
≤ (I −Pm)(Hr −Gm)1 −(I −Pm)(Hr −Gr)1 −→ 0
where the final limit follows from the fact that |G
m
|
1
and
|G
r
|
1
are bounded and have exponentially decaying tails.
The theorem now follows from the fact that, G
0
can be decom-
posed as a known non-linear operator acting on a collection of
linear functionals on the system H, i.e., ψ(ΠH, P
m
(H−ΠH))
where: 1) Estimate of an FIR model P
m
(H − ΠH) for the
modified system H − ΠH can be obtained consistently if
conditions in Theorem 5 are satisfied; 2) ΠH can be obtained
if conditions of Equation 23 are satisfied, which are equivalent
to conditions in Theorem 5.
VI. UNCERTAIN LFT STRUCTURES
We consider several different LFT structures for which the
tools developed in Theorem 1 can be applied to characterize
conditions on inputs, which achieve uniform consistency.
To focus attention on structured perturbations we limit our
attention to real-rational space of systems endowed with the
H
2
norm.
A. Multiplicative Uncertainty
We consider first the multiplicative uncertainty as shown in
Figure 2(a) with H ∈ H
2
. The input-output equation is given
by:
y(t) = Hu(t) +w(t) = G(1 + ∆)u(t) +w(t)
where, ( =
_

p
j=1
α
j
G
j
[ α
j
∈ IR
_
, the noise w is random
i.i.d. gaussian noise and ∆ is a multiplicative perturbation with
bounded
1
norm and accounts for the undermodeling of H.
Upon substitution we have,
y(t) =
p

j=1
α
j
G
j
(1 + ∆)u(t) +w(t)
=

α
j
(1 + ∆)G
j
u(t) +w(t)
=
p

j=1
α
j
ψ
j
(t) + ∆
p

j=1
α
j
ψ
j
(t) +w(t)
where we have used the commutativity of the convolution op-
eration and substituted ψ
j
(t) = G
j
u(t). The parametrization
is non-convex as shown in the following example.
g) Example: Consider the two elements H
1
= −(1 −
∆)G and H
2
= (1 + ∆)G in the setup above. Let the set of
perturbations be bounded, i.e., |∆| ≤ then (H
1
+H
2
)/2 =
∆G cannot be expressed as (1 +
˜
∆)G for some |
˜
∆| ≤ .
Therefore, the decomposition is not convex. However, if we
restrict α
j
≥ 0 the decomposition turns out to be convex. In
particular, the following subset of systems is convex.
o
+
=
_
_
_
G(1 + ∆) [ G =

j
α
j
G
j
, α
j
≥ 0, |∆| ≤
_
_
_
Theorem 7: A necessary condition for uniform consistency
is that ∆ be orthogonal to the space ( and the inputs satisfy
conditions of Theorem 4.
Proof: We first consider a convex subset, o
0
, of systems.
Specifically, constrain the parameters α
j
∈ [1, 2]. Then
o
0
=

j

j
G
j
+
˜
∆G
j
) ⊂

α
j
G
j
(1 + ∆) = o, where
|
˜
∆|, |∆| are both smaller than γ. Now translating the set
o
0
by modifying the variables ˜ α
j
= α
j
− 3/2 we get a new
set o
1
which is both balanced and convex. We can now apply
Proposition 3 and Theorem 1 to argue that ∆ and ( must
be orthogonal for achieving uniform consistency. Furthermore,
from Theorem 4 it follows that the relevant input conditions
must also be satisfied. Now, there is a one-to-one mapping
from the collection o
1
to the collection o
0
and therefore the
14
(a) Multiplicative Uncertainty (b) Identification with Structured Uncer-
tainty
Fig. 2. Different Classes of Linear Fractional Structures; The second
figure illustrates a serial connection of two components one of which is
approximately known, while the other is unknown and must be estimated
from input-output data
same conditions are necessary for o
0
as well. Since, o
0
⊂ o
the necessity holds for the general case as well.
We next apply the instrument derived for the set o
1
to
the general problem. It turns out that the notion of uniform
consistency must be modified suitably. Indeed, since the per-
turbation is scaled by the parameter the error bound also has
a multiplicative structure, i.e., we can only say that
|α −α
n
|

≤ β|α|

with high probability for a large enough n. This is weaker
than the conditions derived in the previous sections.
B. Serial Connections of Uncertain Systems
In this problem, shown in Figure 2(b), two uncertain sys-
tems, H
0
, H
1
are connected in series. The system H
1
is
approximately known, i.e., its best approximation G
1
(in the
class (
1
) is known, and the system H
0
must be approximately
identified in the model class, (
0
from data. The interpretation
of the setup is that we are given a set of uncertain components
connected in series, where it is not possible to isolate any sin-
gle component. The question arises as to when an approximate
model for the uncertain component can be derived. We first
write the input-output equation as follows:
y(k) = (G
0
+ ∆
0
)(G
1
+ ∆
1
)u(k) +w(k) (31)
It is clear from the setup that the unmodeled dynamics enters
the equations in a non-linear non-convex fashion. Neverthe-
less, necessary and sufficient conditions can still be estab-
lished.
Theorem 8: A necessary and sufficient condition for uni-
form consistency of the above setup is that ∆ be orthogonal to
both model classes (
0
and (
1
and the input satisfy conditions
of Theorem 4.
Proof: (necessity) Upon expanding Equation 31 we have,
y(t) = G
0
G
1
u(t)+G
1

0
u(t)+G
0

1
u(t)+∆
0

1
u(t)+w(t)
(32)
Now, uniform consistency of G
0
implies uniform consistency
of the above setup with ∆
1
= 0 or ∆
0
= 0. The former case
reduces to an additive perturbation for which ∆ ⊥ (
0
is a
necessary condition, while the latter case is a multiplicative
perturbation for which ∆ ⊥ (
1
is necessary according
to Theorem 7. The conditions on the input follow from
Theorems 4.
(a) Block Diagram of Uncertain
System in a Feedback Loop
(b) Identification error
Fig. 3. The unknown system has both parametric and non-parametric
components with the parametric part to be identified from closed loop data;
The setup is unconventional in that the system input is also measured; This
arises in a number of applications involving indoor echo-cancellation Systems.
(sufficiency) Comparing Equation 32 with Equation 14 we
see a potential difference only in the penultimate term, i.e.,

0

1
u(t). Now since ∆
0
is perpendicular to (
0
, it can be
written as F
0
˜
∆, for an arbitrary bounded perturbation
˜
∆,
where F
0
is as described in Proposition 4. By absorbing
˜
∆∆
1
into a single perturbation, ∆, we can write, ∆
0

1
= F
0
∆.
The new expression obtained is now consistent with conditions
of Theorem 4, which then establishes sufficiency.
h) Example: As an instantiation of the theorem consider
the case when G
0
and G
1
are FIRs of order m and n
respectively. Then no consistent algorithm exists when the
order, m, of G
0
is larger than n. This is because the unmodeled
error G
0

1
and the class of FIR models of order m are no
longer orthogonal.
C. Identification of Uncertain Systems in Feedback Loops
Consider the feedback setup shown in Figure 3(a). The goal
is to estimate the optimal approximation to H = G + ∆ in
the model class G ∈ ( from input-output data. Our task is to
design the input signal u to guarantee robust convergence to
the optimal approximation.
Such problems arise in the context of design of adaptive
acoustic echo-cancellation systems that operate in a feedback
loop [28]. In general the high order acoustic dynamics coupled
with significant time variation severely limits the choice of
echo-canceller order. Moreover, the feedback filter, typically
used to enhance the quality of speech, can impact the stability
of the echo-cancellation system. In this light, a lower order
echo-cancellation system is designed in the hope that robust
performance can be ensured at the cost of sacrificing some
performance. A dither input signal, u, is employed as an
input to the loudspeaker and the output, v, of the loudspeaker
is recorded in addition to the microphone signals, y. Thus
the input output data consists of (u, v, y) as illustrated in
Figure 3(a). Our setup consists of:
y(t) = Hv(t) +w(t) =
m

j=1
α
j
G
j
v(t) +F∆v(t) +w(t) (33)
where, w is i.i.d. gaussian noise and we have decomposed H ∈
RH
2
into orthogonal subspaces. Furthermore, we assume that
the closed loop system is stable and the exogenous input u()
15
and the internal signal, v(), and the output signal y() are both
measured. As the following theorem shows that a necessary
and sufficient condition for uniform consistency remains the
same, i.e.,
Theorem 9: A necessary and sufficient condition for uni-
form consistency of the optimal restricted complexity model
G ∈ ( is that the input satisfy conditions in Theorem 4.
Proof: By straightforward algebraic manipulations of the
feedback loop we obtain that:
v(t) = W
1
u(t) +W
2
w(t), W
1
= (1 −HK)
−1
, W
2
= KW
1
Although the system, W, is generally unknown (since it is
a LFT of the controller with the unknown system), the fact
that the controller is stabilizing implies W
1
, W
2
are stable
transfer functions. Therefore, v(t) is bounded. Now consider
Equation 33, which is exactly in the form of Equation 14
with the unmodeled dynamics orthogonal to (. These are now
consistent with assumptions of Theorem 4 with u(t) replaced
with v(t). We are left to establish the conditions on the input
signal. Uniform consistency holds if and only if there exists
an instrument, q
n
() that decorrelates the internal signal v(t)
and noise w(t), i.e.,
¸
¸
¸
¸
¸
n

k=0
q
n
(k)v(k)
¸
¸
¸
¸
¸
= 1,
¸
¸
¸
¸
¸
n

k=0
q
n
(k +j)v(k)
¸
¸
¸
¸
¸
−→0
and
¸
¸
¸
¸
¸
n

k=0
q
n
(k)w(k)
¸
¸
¸
¸
¸
−→0
This in turn implies that the filtered instrument, ˜ q
n
() =
W
1
q
n
() decorrelates the input signal and noise:
¸
¸
¸
¸
¸
n

k=0
˜ q
n
(k)u(k)
¸
¸
¸
¸
¸
= 1,
¸
¸
¸
¸
¸
n

k=0
˜ q
n
(k +j)u(k)
¸
¸
¸
¸
¸
−→0
¸
¸
¸
¸
¸
n

k=0
˜ q
n
(k)w(k)
¸
¸
¸
¸
¸
−→0
Note that in the above example since W
1
is unknown a
convenient instrument for v(t) is the filtered dither signal
u
1
(t) = Gu(t). This is because when u(t) is a i.i.d. dither
signal:
1
n
¸
¸
¸
¸
¸
n

k=0
u
1
(k)v(k)
¸
¸
¸
¸
¸
≥ β > 0;
1
n
¸
¸
¸
¸
¸
n

k=0
u
1
(k +j)v(k)
¸
¸
¸
¸
¸
−→0
and
1
n
¸
¸
¸
¸
¸
n

k=0
u
1
(k)w(k)
¸
¸
¸
¸
¸
−→0
i) Example: The model space is given by ( =
α
1−0.3z
−1
.
The unmodeled dynamics is generated by normalizing a zero
mean random gaussian sequence of length 1000 and normal-
izing it to have an
1
norm equal to one. The unmodeled
dynamics is then filtered with the all-pass filter F = (z
−1

0.3)/(1−0.3z
−1
). This ensures orthogonal separation between
the unmodeled error and the model subspace. The system
H is generated by summing the model for α = 1 with the
unmodeled dynamics. Next, a controller K is chosen so that
the closed loop system is stable and such that the sensitivity
function W
1
has
1
norm smaller than one. A uniformly
distributed i.i.d. input in the interval [−0.5, 0.5] of length 300
is applied and the measurements v(t) and y(t) are collected
in i.i.d. gaussian noise w(k) with mean zero and variance
0.1. For the IV technique we use (1 − 0.3z
−1
)
−1
u(t) as the
instrument in Equation 33 and identify α. Figure 3(b) shows
the parametric error plotted as a function of data length for the
IV scheme. We see that the IV technique converges, while it is
well known that the least squares approach leads to large bias.
This is not surprising considering the fact that the unmodeled
output is correlated with the model output and so while the
least squares approach leads to a large bias, the IV technique
is chosen suitably to decorrelate both the uncertainty as well
as the noise.
Remark: We contrast the IV approach presented in this pa-
per with the well-known MPE approach. The idea in the MPE
approach is to find models that minimize the so called predic-
tion error. In specific circumstances this implies a preference
for those models, where the prediction error is uncorrelated
with model output. In the presence of complex uncertainties,
especially when the error arises due to under modeling,
this approach can lead to significant bias. IV methods can
overcome this problem through suitable instruments. These
instruments in specific circumstances can first decorrelate the
uncertainty arising from correlated sources and then proceed
to cancel stochastic uncorrelated noise as well.
VII. CONCLUSIONS
We have introduced a new concept for system identification
with mixed stochastic and deterministic uncertainties that
is inspired by machine learning theory. In comparison to
earlier formulations our concept requires a stronger notion
of convergence in that the objective is to obtain probabilistic
uniform convergence of model estimates to the minimum
possible radius of uncertainty. Our formulation lends itself to
convex analysis leading to description of optimal algorithms,
which turn out to be well-known instrument-variable methods
for many of the problems. We have characterized conditions
on inputs in terms of second-order sample path properties
required to achieve the minimum radius of uncertainty. We
have derived optimal algorithms for system identification for
a wide variety of standard as well as non-standard problems
that include special structures such as unmodeled dynamics,
positive real conditions, bounded sets and linear fractional
maps.
16
VIII. APPENDIX
(Proof of Proposition 1) We deal with the case when γ
0
=
0 for simplicity. The argument for arbitrary γ
0
proceeds in a
similar manner. Let Y
n
denote the column vector of the output
signal of length n and T(Y
n
) be the set of all h ∈ o that are
consistent with the input-output data and the noise set, i.e.,
T(Y
n
) = ¦h ∈ o : y(k) = λ
k
(h)+w(k); w ∈ J
n
, 1 ≤ k ≤ n¦
The local diameter of uncertainty is given by:
diam(Y
n
) = sup
h1,h2∈T(Yn)

0
(h
1
−h
2
)[
From [22] it follows that the estimation error for any estimator
is bounded from below by one-half of the diameter of uncer-
tainty. By hypothesis of our proposition and the preceding
argument there exists a noise set, J
0
n
, such that the diameter
of uncertainty is smaller than 2. We are now left to produce a
convex and balanced set that has the same properties. To this
end, consider the following noise set:
J

n
= ¦w ∈ IR
n
[ λ
k
(h) +w = 0, [λ
0
(h)[ ≤ , 1 ≤ k ≤ n¦
On account of the fact that the measurements are linear and
the constraints are convex and balanced, it follows that the set,
J

n
is also convex and balanced. We are left to establish two
facts:
(A) The noise set J

n
is feasible, i.e., the global diameter of
uncertainty over all feasible outputs Y
n
is bounded by 2.
(B) The probability measure of the noise set J

n
is larger
than the feasible set, J
0
n
.
These two facts together will establish the proposition since
we have then found an alternative convex and balanced set to
J
0
n
that has a larger probability measure and has the same
worst-case global diameter of uncertainty.
To establish (A) we observe from [22] that with a lin-
ear/convex measurement structure, the worst-case diameter of
uncertainty is realized when the output is identically zero.
Consequently, for all other output realizations the worst-case
uncertainty can only be smaller. Now, by construction the set
J

n
corresponds to the zero output signal and therefore by the
preceding argument is feasible.
(B) is established by the following sequence of inequalities:
Prob{W
0
n
} ≤ sup
Wn
_
Prob{Wn}
¸
¸
¸
¸
¸
sup
h∈S,w∈Wn
diam{Yn} ≤ 2
_
(a)
≤ sup
Wn
_
Prob{Wn}
¸
¸
¸
¸
¸
sup
h=0,w∈Wn
diam{Yn} ≤ 2
_
(b)
≤ sup
Wn, 0∈Wn
_
Prob{Wn}
¸
¸
¸
¸
¸
sup
h=0,w∈Wn
diam{Yn} ≤ 2
_
(c)
≤ sup
Wn, 0∈Wn
_
Prob{Wn}
¸
¸
¸
¸
¸
sup
h=0,w∈Wn
diam{Yn = 0} ≤ 2
_
(d)
≤ Prob{W

n
}
Inequality (a) follows from the fact that the output Y
n
in the RHS is restricted to the case when h = 0, which
amounts to a relaxation of the constraint. Inequality (c) follows
from a similar argument. Inequality (b) follows from the
fact that the worst-case diameter of uncertainty is invariant
under translations of noise set J
n
. However, the probability
measure of the noise set can change. Using the underlying
symmetry and monotonicity of the noise distribution around
zero, translation of the feasible set J
n
so that it contains
the zero element leads to increasing the probability measure.
Finally, (d) follows by definition of J

n
since it includes all
possible noise elements for which the output is identically
zero and the diameter of uncertainty is constrained to be
bounded by 2. Consequently, no other noise set containing
zero element can have a larger probability measure.
(Proof of Corollary 1) Consider an element z ∈ H
2
with
¸, ) denoting the inner product. Let z
0
= z/|z|
2
be a unit
vector aligned with z; z
α
be any other unit vector. It follows
that,
|z|
2
= ¸z
0
, z) ≤ [¸z
0
−z
α
, z)[+[¸z
α
, z)[ ≤ |z
0
−z
α
|
2
|z|
2
+[¸z
α
, z)[
where the first expression in the last inequality follows cauchy-
schwartz inequality. Denoting υ
α
(z) = ¸z
α
, z) and letting α
take on values in a finite index set 1 such that there is at least
an α for which |z
0
−z
α
|
2
< 1 we have
|z|
2
≤ min
z0−zα2<1

α
(z)[
1 −|z
0
−z
α
|
2
To relate it to Corollary 1 we let, z = x
1
−x
2
, where x
1
, x
2
are any two feasible elements satisfying Equation 7. It follows
that,
|z|
2
= |x
1
−x
2
|
2
≤ min
z0−zα2<1

α
(x
1
−x
2
)[
1 −|z
0
−z
α
|
2
The result now follows by direct substitution.
REFERENCES
[1] P. Billingsley. Probability and Measure. John Wiley & Sons, Inc., 1995.
[2] V. Borkar, S. Mitter, and S. Venkatesh. Variations on a neyman pearson
theme. Journal of the Indian Statistical Institute, 66:292–305, 2004.
[3] P. Caines. Linear Stochastic Systems. New York, Wiley, 1988.
[4] M. C. Campi. Exponentially weighted least squares identification of
time-varying systems with white dusturbances. IEEE Trans. on Signal
Processing, 42:2906–2914, 1994.
[5] M. C. Campi and E. Weyer. Finite sample properties of system identifica-
tion methods. IEEE Transactions on Automatic Control, 47:1329–1333,
2002.
[6] M. A. Dahleh, T. V. Theodosopoulos, and J. N. Tsitsiklis. The sample
complexity of worst case identification of linear systems. System and
Control Letters, 20:157–166, 1993.
[7] A. Garulli and W. Reinelt. On model error modeling in set membership
identification. In Proceedings IFAC Symposium on System Identification,
Santa Barbara (USA), pages WeMD1–3, 2000.
[8] L. Giarre, B. Z. Kacewicz, and M. Milanese. Model quality evaluation
in set-membership identification. Automatica, 33:1133–1139, 1997.
[9] G. C. Goodwin, M. Gevers, and B. Ninness. Quantifying the error in
estimated transfer functions with application to model-order selection.
IEEE Transactions on Automatic Control, 37:1343–1354, 1992.
[10] G. Gu and P. P. Khargonekar. Linear and nonlinear algorithms for
identification in h with error bounds. IEEE Transactions on Automatic
Control, 37:953–963, 1992.
17
[11] P. Van Den Hof, P. Heuberger, and J. Bokor. Identification with
generalized orthonormal basis functions–statistical analysis and error
bounds. Special Topics in Identification Modelling and Control, pages
39–48, 1993.
[12] E. L. Lehmann. Testing Statistical Hypotheses. 2nd edn. Springer Verlag:
New York, 1997.
[13] L. Ljung. System Identification: Theory for the user. Prentice Hall
Englewood Cliffs, NJ, 1987.
[14] L. Ljung and Z. D. Yuan. Asymptotic properties of the least squares
method for estimating transfer functions. IEEE Transactions on Auto-
matic Control, pages 514–530, 1985.
[15] D. Luenberger. Optimization by vector space methods. John Wiley and
Sons, Inc, 1969.
[16] P. M. Makila. Robust identification and galois sequences. International
Journal of Control, 54:1189–1200, 1991.
[17] M. Milanese and G. Belforte. Estimation theory and uncertainty intervals
evaluation in the presence of unknown but bounded errors: Linear
families of models and estimators. IEEE Transactions on Automatic
Control, 27:408–414, 1982.
[18] M. Milanese and A. Vicino. Optimal estimation theory for dynamic
systems with set membership uncertainty: An overview. Automatica,
27:997–1009, 1991.
[19] C. H. Papadimitriou and K. Steiglitz. Combinatorial Optimization.
Prentice Hall Englewood Cliffs, NJ, 1997.
[20] P. Poolla and A. Tikku. On the time complexity of worst–case system
identification. IEEE Transactions on Automatic Control, 39:944–50,
1994.
[21] W. Rudin. Principles of Mathematical Analysis. McGraw-Hill Interna-
tional Editions, 1976.
[22] J. F. Traub, G. W. Wasilkowski, and H. Wozniakowski. Information,
uncertainty, complexity. Addison-Wesley Pub. Co., 1983.
[23] D. N. C. Tse, M. A. Dahleh, and J. N. Tsitsiklis. Optimal identification
under bounded disturbances. IEEE Transactions on A-C, 38:1176–90,
1993.
[24] V. Vapnik. The Nature of Statistical Learning Theory. New York:
Springer-Verlag, 1995.
[25] S. Venkatesh. Necessary conditions for identification of uncertain
systems. Systems and Control Letters, 53:117–125, 2004.
[26] S. Venkatesh and M. A. Dahleh. System identification for complex-
systems: Problem formulation and results. In Proceedings of the 36th
IEEE Conf. on Decision and Control, San Diego, CA, 1997.
[27] S. Venkatesh and M. A. Dahleh. On system-identification of complex-
processes with finite data. IEEE Transactions on Automatic Control,
46:235–257, 2001.
[28] S. Venkatesh and A. M. Finn. Cabin communication system without
acoustic echo cancellation. US Patent #6748086, 2004.
[29] S. Venkatesh and S. K. Mitter. Statistical modeling and estimation
with limited data. In International Symposium on Information Theory,
Lausanne, Switzerland, 2002.
[30] S. R. Venkatesh and M. A. Dahleh. Identification in the presence
of unmodeled dynamics and noise. IEEE Transactions on Automatic
Control, 42:1620–1635, 1997.
[31] A. Vicino and G. Zappa. Sequential approximation of feasible param-
eter sets for identification with set membership identification. IEEE
Transactions on Automatic Control, 41:774–785, 1996.
[32] M. Vidyasagar. A Theory of Learning and Generalization. New York:
Springer-Verlag, 1997.
[33] B. Wahlberg. System identification using laguerre models. IEEE
Transactions on Automatic Control, pages 551–562, 1991.
[34] B. Wahlberg and L. Ljung. Hard frequency-domain model error bounds
from least-squares like identification techniques. IEEE Transactions on
Automatic Control, 37:900–912, 1992.
[35] L. Y. Wang. Persistent identification of time-varying systems. IEEE
Transactions on Automatic Control, 42:66–82, 1997.
[36] L. Y. Wang and G. G. Yin. Persistent identification of systems with
unmodeled dynamics and exogenous disturbances. IEEE Transactions
on Automatic Control, 45:1246–1256, 2000.

In Section VI we show how this approach also leads to solutions to problems described by LFT structures. Finally z −1 is the z-transform variable and corresponds to the usual shift operation. optimal inputs. x p denotes the p norm of the discrete time signal x(·). This can be extended to other norms by approximating in a finite family of linear cost functions. Xn is the column vector (x(0). inputs and fundamental bounds. which can further be related to instrument-variable techniques. P ROBLEM F ORMULATION In this section we formulate a mixed stochasticdeterministic problem. a new notion of sample complexity is developed wherein the idea is to find the smallest time by which a confidence level for the probabilistic objective can be reached. the space of systems defined by the disc algebra of analytic functions bounded in the closed unit disc. Specifically. . minimum sample complexity etc. it is well known that a persistency of excitation condition is required to ensure consistency. these bounds can sometimes be suitable. In Section IV the paper describes new insights for parameter estimation in stochastic noise. Preliminary work along these lines has appeared in our earlier publications [25]. Although.. of a signal. This formulation is inspired by notions of uniform convergence of empirical means and approximate learning in machine learning theory. y(k). x ∈ p . we show in Section III. where denotes the convolution operation. wherein the idea is to pick model estimates so that the worst-case probability of “mismatch” error being larger than some threshold is minimized. Our objective is to seek models/estimates that minimize the probability of the worstcase errors. is represented by {h(k)}k∈Z + . y(t). σ) denotes the gaussian distribution with mean m and standard deviation σ. To address these issues we propose a new formulation for mixed components in Section II. The model . . 0. n] The n-point cross-correlation between two signals n−τ n rxy (τ ) = i=0 x(τ + i)y(i) N (m. this problem (in terms of finding optimal algorithms. . These ideas resemble notions of minimax risk functions employed in the statistical and estimation literature [12]. The output response. . For a real-rational LTI system. . In summary. Mathematically. For a signal. 1. unmodeled dynamics and noise as was shown in [27]. We derive conditions on input sequences in terms of their second order sample path properties. .. The upper bounds are usually obtained by using least-squares algorithms for periodic inputs while the lower bounds are obtained by considering the noiseless case.) is in general intractable in both the estimation and identification contexts.). . . H.e. we show that optimal algorithms can be associated with hyperplane separation of convex sets. τ ∈ [0. A p. Z + is the set of positive integers. we are given a sequence of real-valued observations. i. In particular. to an input u(t) is denoted by y(t) = Hu(t) = (h u)(t). x(n − 1))∗ . . is n−τ n rx (τ ) = i=0 x(τ + i)x(i). it is generally difficult to characterize optimal algorithms. Another aspect of the paper is in characterizing inputs required for system identification. Prob{A} denotes the probability of an event A and E{X} denotes the expected value of the random variable X. that it is also tractable through convex analysis particularly for problems defined by linear observations. p . i. This paper generalizes that formulation and deals with a wide variety of new problems. . H. x(1). The analysis is then extended to problems to cover situations where the desired solution can be expressed as a non-linear function of a linear operator. LTI refers to linear time invariant systems and BIBO refers to bounded-input-boundedoutput systems. Pm denotes the truncation operator: Pm (x) = (x(0). although there are a number of formulations that have been proposed for dealing with mixed scenarios. . Section V then develops algorithms for optimal identification in the presence of unmodeled dynamics and noise. x. rx (·). There the idea is to choose parameters that minimize the worst-case (over all parameters) the expected value of the estimation error. By means of our mixed formulation we show that this condition is also necessary for achieving uniform consistency. this strategy is employed to show that uniform convergence of parametric error to zero can be established in the presence of noise and unmodeled dynamics for stable systems in an 1 space. convex subsets of systems and additive random noise and under the ∞ norm.q = supx∈ p Ax q / x p denotes the induced norm of operator or matrix. we seek uniform convergence in probability of the estimation error. p ≥ 1 denotes the space of sequences on Z + bounded in the p norm (in [15] these concepts are defined on the space of natural numbers but we only consider sequences over positive integers and denote them as p with an abuse in notation). Wang [36] has proposed a new concept for mixed identification. a) Notation:: A∗ denotes the complex-conjugate transpose of the matrix A. . Nevertheless. which are assumed to be generated according to an underlying model. Although. we often associate a minimal realization in state space: A B H≡ (1) C D II. x(m−1). The impulse response of an LTI system. Wang [36] addresses this issue by deriving upper and lower bounds for a number of different cases. This section also develops solution techniques for problems with convex parametric constraints. .2 tivated by persistent identification. n The n-point autocorrelation. denotes real valued sequence on positive integers. This is particularly the case when there is a separation between model. In particular. this notion is stronger than the earlier proposed formulations. where consistent identification of FIR models was described. . they can be overly conservative for situations where Least Squares (LS) is not a convergent algorithm. . Furthermore.e. RH2 refers to stable real-rational LTI systems with bounded squared norm.

∆) ∈ S. for a linear ˆ algorithm. min Prob{max |αY − θ| > 1/2} ≥ 1/2 α θ f ∈F f (xi ) ≥ γ −→ 0 Extensions to this problem include consideration of general loss functions and function approximation with lower complexity function classes. ∆ are non-random components with the tuple (θ. θn (y) from a ˆ sequence of n observations.. We now introduce the mixed problem. where S is a subset of a separable normed linear space H. It follows that. ξ. ξ. θ. for the sequence of algorithms θ as: ˆ ˆ R(θ.∆)∈S t Fu (θ. i. We denote by θ (with an abuse ˆ of notation) the infinite sequence of algorithms. sup (θ. w(·) is a random noise process.e. S) = 0. consider an LTI example with a 1-tap FIR with unmodeled dynamics. w). Suppose W is a binary random variable taking values in the set {−1. θ. this situation is analogous to the concept of uniform convergence of empirical means. ∀ ξ > 0. θ.. S) = inf γ ∈ IR | limsupn α(θn . Observe that the set. Suppose. S). . i. samples of {xi } are drawn from a distribution defined by P and the problem is to show that. note that the variables θ and ∆. We denote by α(θn . i. The principle difficulty in applying learning theory to system identification is that the solutions there depend on observations being independent and do not deal with dynamical scenarios. i. θ. Note that the problem is meaningful even when the perturbation. θ0 is said to be optimal if its sample complexity is smaller than any other algorithm. Returning to our context. we are given a family of functions. n − 1 (2) This leads to the notion of consistency: ˆ Definition 1: We say that the sequence of algorithms θ is uniformly consistent if the radius of uncertainty is equal to zero. Wang et. may not be independent. while our notion is related to the right-hand-side of the following inequality. γ. i. their notion is related to the left-hand-side. f (x) ∈ F. Y = θ + W with θ ∈ [−1. . the main problem is that it is difficult to analyze exactly (as seen even for the above example). R(θ. Indeed.∆)∈S ˆ θ − θn (y) ≥ γ ˆ θ − θn (y) ≥ γ where the probability is taken with respect to the noise distribution.∆)∈S Prob ˆ θ − θn (y) ≥ γ ≤ Prob sup (θ. The minimum radius of uncertainty is given by: ˆ R0 (S) = inf R(θ. F k is a time dependent real-valued regressor. t y(k) = F k (θ. S is a separable space and so measurability is preserved under a supremum [1]. ˆ ˆ L( . · ˆ is a suitable norm on IRm . θ = W . we define the asymptotic radius of ˆ ˆ uncertainty. θ(Y ) = αY . θ0 ..e. (θ. and we get. the weaker notion may suffice in many cases. S) ≤ L( .d.. al. γ. the probability that the worst-case estimation error is larger than γ = R0 (S) + . S) ≤ ξ y(t) = ≡ θu(t) + k=0 δ(t − k)u(k) + w(t) 1. This issue is well 1 [36] considers simpler structures where the model and unmodeled dynamics are decoupled . is at most α. ∆.e. S) = Prob sup (θ. ˆ α(θn . More precisely. {θn }. This models situations involving optimal approximations. L( . The finite dimensional component. [36] consider a weaker version in that their formulation is related (but not exactly1 ) to the worst-case probability of error. i. where. where we have used the fact that the minimum value is equal to minα Prob{maxθ |(1−α)θ+αW | > 1/2}. To clarify the notation.e. ξ. δ(t) is the impulse response of the unmodeled component. Prob{maxX (W X) > 1/2} = 1 c) Example: Suppose. while ∆ denotes the residual error. The lower bound now follows by choosing. let where. ∆.e. min max Prob{|αY − θ| ≥ 1/2} ≤ 1/3 α θ Although. S) the probability that the worst-case estimation error is larger than γ for n observations. such that the smallest number of observations. does not exist. S) = min n ∈ Z + | α(θn . . > 0 These definitions are related to the mixed formulations proposed in [36] with important differences.3 is uncertain in that it has both random and non-random aspects that provide information on how the sequence of observations are generated. γ. the set of values θ and ∆ can be jointly specified by the set S ⊂ H.e. 1}. i.e. 1] and W uniformly distributed in [−1. k = 0..e. ∆. S) ˆ θ Finally. ˆ ˆ L( . N i.. We immediately see that maxX Prob{W X ≥ 1/2} = 1/2. Remark: In general. 1]. ∆. i. Prob sup EP (f (x)) − 1 N N i=1 The two problems can be significantly different as the following examples indicate: b) Example: Let X be a binary number taking values in the set {−1. For the simplest situation in learning theory. γ. 1} each of which is equally likely.. 1. θ characterizes the best approximation in the model class. θ. ∆) ∈ S = IR × u(·) ∈ ∞ ˆ An algorithm. ξ. S). It turns out through detailed analysis that the optimal solution for the weaker notion is given by α = 3/4. we define sample complexity and optimality of an algorithm: ˆ Definition 2: The sample complexity of the algorithm θ is ˆ S). w).i. . ˆ is to be estimated by means of an algorithm.

is now a worst-case noise signal for which the error is uniformly bounded by γ0 + . ej . . . is a column vector of length n with jth component equal to one and all other components equal to zero. which is applied in subsequent sections. In this paper we limit our attention to gaussian processes (the non i. Then linear algorithms are optimal when the norm measure on Φ(h) is the ∞ norm. Gaussian noise such that w(k) ∼ N (0. III. Suppose. φp (h))T . ∀ h ∈ S. 0. elements of the set. for a large number of problems especially when there is a decomposition between model and unmodeled dynamics the upper and lower bounds are not tight. h ∈ S (3) between the probabilistic problem and an appropriate worstcase problem. {λ1 (·) + e1 . For the setup described by Equation 3 we have the following theorem. . We also demonstrate extensions of this approach to other norms by approximating in a finite family of linear cost functions. consider estimation of a single linear functional.i. The proof of optimality of ˆ the linear algorithm would follow because φn (y) is arbitrary and can be replaced by an optimal algorithm. the sequence of ˆ ˆ algorithms. if ψ(k) is the sequence of regressors formed for an ARMA structure (see [13] for details) then. where. y(k) = ψ T (k)θ + w(k) forms a linear measurement structure on the variable θ. w(·). The task is to compute a linear function Φ(h). S are observed through a linear measurement structure with additive random noise.d. . . n. is a linear functional on the space. where the upper bound is computed with a least squares algorithm and periodic inputs. ˆ sup sup |φ(h) − φn (y)| ≤ γ0 + n w∈IR −A h∈S (5) ˆ sup sup |φ(h) − φn (y)| ≤ γ0 + (6) n w∈W 0 h∈S We are now left to prove that a linear algorithm is an alternative optimal estimator as well. particularly for problems defined by linear observations. convex subsets of systems and additive random noise for ∞ norm. ξ. case is established in the next section) for the sake of simplicity. and a convex and balanced subset. Theorem 1: Suppose w(·) is i. Φ(h) = (φ1 (h). Prob h∈S ˆ sup |φ(h) − φn (y)| ≥ γ0 + ˆ ≤ ξ. . S ⊂ H. ξ. k = 1. This implies. . while there exist specialized inputs and linear algorithms that do result not only in consistency but also polynomial sample-complexity. Now . H. . One advantage of our formulation is that it lends itself to exact analysis through application of convexity theory and by extension design of optimal inputs and algorithms. Nevertheless. Thus the observations are linear over H. λn (·) + eT }. Suppose. . ∀ n ≥ L( . . Now. w ∈ W n 0 T Now. ˆ S). which are symmetric and monotone around zero. The lower bound is computed by considering the noiseless case.. the uniform distribution satisfies these properties as well. Indeed. H. Proposition 1: If Equation 5 is satisfied then it follows n that there exists a convex balanced subset. with n Prob{W 0 } ≥ 1 − ξ such that V = {(φ(h). φ = {φn (y)}. σ 2 ). d) Example: The measurement structure includes ARMA models as well. 0) is not contained in the interior of the set V. This implies the existence of a hyperplane such that. . For simplicity. . Remark: Although we have assumed Gaussian noise the theorem holds for other noise distributions as well. Therefore. Remark: In general the result holds for more general noise distributions. φ. For instance. Consider a linear space.d. The main result here is that under some technical conditions linear algorithms have optimal sample complexity. This in turn implies that c0 = 0. it follows that (γ + .e. . Equation 4 is equivalent to existence of a set A ⊂ IRn with Prob{A} ≤ ξ such that. W 0 ⊂ IRn . h ∈ S} 0 c0 (φ(h) − (γ + )) + j=0 cj yj ≤ 0. We state this in the form of a proposition below and the details can be found in the appendix. [27] provides examples wherein neither periodic inputs nor LS algorithms lead to consistency. . achieves the sample complexity. φ. as illustrated in the sequel. and except for simple situations the problem is generally intractable.i. i. i. Proof: If the radius of uncertainty is infinite then linear algorithms are no worse than any other algorithm. .4 known in statistical estimation [12] and recently the authors have analyzed the weaker notion [2]. O PTIMAL A LGORITHMS FOR C ONVEX P ROBLEMS In this section we state and prove a general theorem. Next we establish that the set W n − A can be chosen to be convex without impacting the above inequality. Consider the following set of IRn+1 : From Equation 6. λ1 (h) + w(1). λn (h) + w(n)) | w ∈ W n . the linear functionals. where Φ maps h to IRp . L( . . This is because. For this we follow the proof of Smolyak’s theorem [22]. The theorem is a generalization of Smolyak’s theorem [22] to observations with random noise. . λk . are linearly independent. The main idea is illustrated in Figure 1. In [36] this issue is addressed by deriving upper and lower bounds. φ(h). These properties can be used to argue the equivalence y(k) = λk (h) + w(k). The main property required is that the noise distribution be symmetric and monotonic around zero. . . n where. n where.e. . suppose γ0 = R0 (S) < ∞. w. The linear function Φ(h) is often referred to as the solution operator.. S) (4) We are left to establish that there is a linear algorithm that achieves the same performance. [29] in the context of composite hypothesis testing.

. Remark: Note that the results do not provide an explicit expression for computing the minimum radius of uncertainty (and indeed this problem is . We then obtain a finite set of linear inequalities. n qj (k. . . our idea is to express squared norm value as the maximum over all unit magnitude linear functionals—the existence of which is guaranteed through duality—and then find estimates for each of these linear functionals. . V (θ) = n 1 2 ˆ t=1 E(y(t) − y (t/θ)) as a function of the predictor. . . x belonging to the set defined ˆ by Equation 7 with υ ∈ {υ k }k∈I and let Φn (y) = x be a non-linear estimator that maps the observations to a feasible point. To generalize it to the linear function. . . are the optimal linear weights corresponding to the linear functionals φl (h). n c0 (φ(h) + (γ + )) + j=0 cj yj ≥ 0. Yn = Λn (h)+Qn Wn . we note that. . ξ) = ≥ ˆ max φj (h) − φn (y) j j n n qj (k. 0) is also another boundary point. i. . ˆ Φ(h) − Φ(h) ∞ sup sup n w∈W 0 h∈S k=0 q n (k)y(k) − φ(h) ≤ γ0 + . υ(·) can be associated with a vector (υ1 . w ∈ W n 0 Consequently. Corollary 2: Let. the minimum angle of separation for the collection of vectors is smaller than β < π/4. establishing result for a single linear functional. . In the prediction-error paradigm [13]. λ2 (·). this topic has not been studied under the new criterion of uniform convergence. λn (·)) is a sequence of linear functionals. ∀ h ∈ S. qυ (k) with the minimum radius of uncertainty. . Let υ k be such a discretization with k belonging to some finite index set. . let υ k = (υ1 . . Consider a feasible point. .. However.e. . Suppose. υp ) ∈ p k IR . . which unifies many of the results in stochastic identification. 1. . With high probability they lie in a bounded convex set. Specifically. We then select linear functionals.e. ξ > 0.e. . p υ(x) = k=1 υk x k Corresponding to each choice of υ(·) ∈ IRp we have an n optimal linear algorithm. Therefore we obtain the complementary inequality. . . . k n qυ (k)y(k) − k=1 υk x k ≤ γ 0 + (7) by using the fact that V is convex and balanced we see that (−γ − . φ2 (h). φp (h)) in this finite set forms a good estimate k k for the squared norm. the objective is to analyze prediction errors. 0 . it is possible to bound the squared norm with the infinity norm this can be conservative (for example. n thus. Although. . k ∈ I. We then have. where Yn .. . υp ) ∈ IRp . φl )y(k) − φj (h) k=0 where. Extensions We can generalize the above theorem to squared error norms as well. Specifically. ∀ n ≥ N ( . Corollary 1: If there exists an algorithm that achieves a radius of uncertainty γ0 + for a data length n with probability α then: γ0 + ˆ Φ(h) − Φn (y) 2 ≤ 1−β Proof: See the appendix. These theorems are readily generalized to stationary gaussian noise setting with bounded power spectral density. It then follows that. it follows that there are a sequence of coefficients {q n (k) = −ck /c0 }n not all zero k=0 such that. . with an ˆ algorithm Φn (·) with the squared norm as the error measure. n max υ(x) = x 2 ≤1 2 . υ 2 = 1. Instead. Illustration of Convex Separation. φl ) A. PARAMETER E STIMATION R EVISITED In this section we provide a complementary perspective on parameter estimation in random noise. υ(·) consists of an infinite family and needs discretization. X and Y axes denote range of measurement and parametric values respectively. . The motivation for revisiting this well studied topic is threefold: (1) First. √ for a vector x ∈ IRp we have x ∞ ≤ p x 2 ). . Φ(h). IV. A feasible point. p Fig. Wn are column vectors of length n and with Qn 2 ≤ 1 and Λn = (λ1 (·). γ0 + (with respect to the squared norm). Therefore. υ Clearly. For ease of exposition we let xj = φj (h). i. let x ∈ IRp endowed with the squared norm metric and let υ(·) be a linear functional on IRp . i. (x1 . for every . We consider the problem of estimating the linear function Φ(h) = (φ1 (h). φp (h)). they do provide a basis for determining this uncertainty through linear algorithms.5 Now to draw a direct correspondence with the theorem we proceed as follows. Then a linear algorithm achieves the minimum diameter of uncertainty for estimating a linear function Φ(h) mapping H to IRp from data under the ∞ and squared norm error measures. xp ) = Φ(h) = (φ1 (h). .

x(1). .e. . n n n qk (l)u(l l=0 1/2 1= − k) ≤ n qk 2 l=0 u (l − k) 2 y(t) = k=0 h(k)u(t − k) + w(t) This implies that (Pn − P−1 )u 2 → ∞.e. Σn ≥ ρIm . which model parametric dependencies. y(t−2). . (3) We can describe tight lower bounds for error for finite data without the need for a consistent estimator. To describe our problem we need the notions of persistent input and persistency of excitation. The proof is readily extended to the more general case with linear parameterizations. n where.d.. . we need n limsupn qk 2 is an i. θ ∈ IRp (10) We assume that w(t) is i. Z ∈ IRl×p . Since norms on finite dimensional spaces are all equivalent we employ the ∞ norm here. Therefore. Σn . h ∈ IR is arbitrary it follows that. .e. The following result follows: Theorem 2: For the setup above the radius of uncertainty is zero if and only if the input u(·) is persistently exciting of order p. v2 (·). gaussian random process. i. the assumptions of Theorem 1 are satisfied. Φ(·) in this case is the identity operator. We next describe persistency of excitation. for some N0 ∈ Z + and ρ > 0. The necessity part to the best of author’s knowledge is new. The radius of uncertainty is equal to zero if and only if the input is persistently exciting of order p. y(t−m1 ). Corollary 3: Let the input-output equation be given by y(t) = ψ T (t)θ + w(t). j=k l=0 p n qk (l)u(l − j)h(j)| −→ 0 Since. i. . which also is weaker than the conventional notion. ∀ j = k (9) and Prob Since. based on our hypothesis. (2) It turns out that under this new criterion the well known input requirements become transparent and can be shown to be both necessary as well as sufficient for achieving consistency. . for all n ≥ N0 . . ψ T (t) = [y(t−1). i. Our first example is the case of . l=0 n qk (l)u(l − j) = 0.. hn (y) k such that. there exists a linear algorithm that achieves uniform consistency as well. x(k). the input must be persistently exciting and from Equation 9 it follows that the persistency of excitation must be of order p as stated. n 2) The m × m matrix. Definition 4: An input u(·) is said to be persistently exciting of order. . vm (·) such that 1) The input u(·) is persistent. .6 Our objective differs in two ways: (a) It directly deals with ˆ the estimation error. v1 (·). . A persistent input is an ∞ sequence with unbounded energy. Prob ˆn sup h(k) − hk (y) p h∈IR p−1 ∞ −→ 0 where. Therefore. is said to be persistent if lim inf Pn x 2 −→ ∞ n We point out that this definition is weaker than the conventional notion. . . . i.e. Definition 3: An bounded sequence. w(·)    n j=0 n qk (j)w(j) ≥    N→∞ −→ 0 n n j=0 qk (j)w(j) is gaussian with n equal to qk 2 . We first prove the result for FIR parameterizations. θ − θn (y) (b) we seek uniform ˆ convergence. This implies existence of an instrument n qk corresponding to each coefficient h(k) so: n n n qk (j)w(j) j=0 | + j=0 n (qk (j)u(j − k) − 1)h(k) n (8) N→∞ p−1 + j=0. We point out that most results that exist in the literature in this connection are sufficient conditions and necessity is not easy to establish in many cases.e. if there exists m instruments (discrete-time signals). which is required typically for Cramer-Rao bounds. u(t). . . 2.d. .. n 1 limn→∞ n t=0 x2 (t) = C. ∞ (x(0). p−1 With this association we have a linear observation structure and since the objective is to estimate the FIR taps. .i. Proof: The sufficiency part of the result can be derived along the lines of [13] and is omitted. Consequently. We describe some relevant situations where such convex constraints naturally arise. gaussian noise. can also be developed. with [Σn ]jk = rvj u (k) satisfies. a new aspect is that optimal algorithms for identification problems with convex parametric constraints. mean zero and variance →0 Now from Equation 9 we have from Cauchy-Schwartz inequality. .. Note that the parametric set is balanced. .i. . i. j = 1. u(t − m2 )] is the usual regression vector formed from previous inputs and outputs. .e. where the input signal is required to be quasi-stationary and have finite power. i. Therefore. Prob{supθ θ − θn (y) }. .. u(t− 1). We are then given that there exists an algorithm. n n l=0 n qk (l)u(l − k) = 1. These ideas can also be extended to more general convex parameterizations which account for parametric dependencies.. m. We consider finitely many parametric constraints modeled as: |Zθ| ≤ b (11) To apply Theorem 1 we need to define λi s first: λj (h) = k=0 h(k)u(j − k).). where m1 + m2 = p. (4) Finally.

Then it follows that. |Γb| + Q 2. denotes the gaussian error function [12] and we have assumed that radius of uncertainty is equal to zero (for simplicity of exposition). i. Therefore. E. Similarly. note that the problem setup conforms to the convex setup. Q 2. W 0 is the same convex set with probability mass larger than 1 − ξ as described in Theorem 1. Remark: Observe that when the constraints are given by Zθ = 0. {µj } on the parameter θ. One can then attempt to minimize the index by choosing different inputs. the parameter set is convex and balanced. From Theorem 1 it follows that linear algorithms are optimal for the ∞ norm. is linear.7 a positive real condition. QΨn + ΓZ = I. the UBN setup does not strictly fall within the framework described in Section II the proof technique of Theorem 1 can be used to establish that the radius of uncertainty is bounded away from zero. This is indeed the case as seen from Equation 12. This is because in this case. ψ(n)]T be the data regression matrix. therefore it follows from the steps used in the proof of Theorem 1 that there must exist a linear algorithm that achieves optimal error for the latter (RHS of inequality above) problem. the input-output data is given by a linear measurement structure. Proof: The proof of the result follows from convex duality and in particular a straightforward application of Farkas lemma. let U n be the set of all unit-amplitude-bounded input sequences of length n. it follows similar to the arguments used in the preceding theorems that if the radius of uncertainty is smaller than δ. p−1 µ0 = k=1 γk µk . such that |(QΨn − I)θ + Qw| ≤ δ + . T T µ0 θ − b0 > 0 whenever µi θ − bi > 0. . then there must exist a matrix Q and an integer n. . for a given > 0. m m y(t) = k=0 h(k)u(t − k) + w(t). The proof now follows by a direct application of Farkas lemma. . let Ψn = [ψ(0). . It is well known [23] that regardless of the input. w(t) ∈ [−γ. we should expect that the persistency of excitation condition can be relaxed. ξ > 0.e. From the results in Corollary 4 we have for a convex parametric set S that. erfc(·). . k=1 γk bk ≤ b0 . given ˆ ≥ sup sup h − hn Wn h ∞ . We have the following corollary for convex parameter constraints: Corollary 4: Consider the setup of Equations 10. Our next task is to comment on the issues of optimality and choice of optimal inputs. estimator and data-length the estimation error is always bounded from below by γ. γ]. These results in turn help in loosely characterizing inputs that would lead to optimal identification. 2. Indeed. Furthermore. the positive constraint Γ ≥ 0 no longer exists. which is stated here for the sake of completion [19]: Lemma 1 (Farkas Lemma): Consider a set of linear functions. w ∈ W n 0 n where. where y is the observation vector of length n. Therefore. Φ(θ) = θ. i = 1.∞ ≤ δ. q n (·). ω ∈ [−π. W n = [−γ. First. γk ≥ 0 We now return to the proof of Corollary 4. . ∀ |Zθ| ≤ b. Zθ = b. ˆ sup sup h − hn Wn h 1 Now. hn (y). |u(·)| ≤ 1 (13) ˆ In the context of UBN we are interested in an estimate hn (y) based on input-output data of length n so the worst-case error over all FIRs and admissible noise realizations are minimized: ˆ min → sup sup h − hn n p ˆ hn h∈IR W 1 where. Let. with the error estimate. which can be modeled by convex constraints: k θk cos(ωk) ≥ 0. all of the inequalities are to be interpreted pointwise. The desired solution. It follows that for a data length n the radius of uncertainty (in the ∞ norm) is smaller than δ if and only if there exists matrices Q ∈ IRp×n and Γ such that. Therefore. Uniform Bounded Noise Although. A. Prob max θ − θn (y) θ∈S ∞ ≥ ≥ = Prob { Qw erfc ∞ ≥ − Γb ∞} − Γb ∞ Q 2.∞ subject to constraints of Equation 12.. 11. The optimal cost of this optimization problem would provide an index J(Φn ) which is a function of the regressor Φn . We provide a new proof for this result based on Theorem 1 as stated below. we are only left to satisfy QΦn + ΓZ = I for an arbitrary Γ. π] To utilize these constraints we can discretize the infinite system of inequalities and through translation consider convex and balanced subsets of the parameter space. It follows that if Z has rank m then Φn need only have rank p − m to ensure that the equation can be satisfied. Proof: We note that. ψ(1). The solution consists of finding a Q that satisfies Equation 12. γ]n is the set of all admissible noise sequences up to time length n. Theorem 3: The minimum radius of uncertainty is larger ˆ than γ for any algorithm. in the dual setting.∞ the input needs to be persistently exciting of a smaller order than the parametric dimension. An estimate of θ is then given by Qy. which can solved through convex programming. . m where. Another example where these results are applicable is when the set of parameters belong to a linear manifold. This would follow by first minimizing. Thus there is a linear instrument. both W n and h are convex balanced sets. Γ ≥ 0 (12) where.

.i. the model ηG is an element of the unmodeled set ∆. .. (14) which indicates the fact that the mth order FIR account for the underlying system within a worst-case error of size γ. The main goal is to estimate G(θ) from input output data in a uniform manner as described in Section II. which we state below without proof. . Overbounding Unmodeled Dynamics max + j=0 n (qm (j)u(j) − 1)h(0) E= 0≤m≤p−1 p−1 n + j=1 l=0 n qm (l)u(l − j)h(j) This implies that if the error. V. This can also be seen indirectly from our results in an earlier paper [30]. H = G(θ) + ∆ ∈ S is the subset of systems. k=1 θk Gk .. is a sequence of i. U NMODELED DYNAMICS WITH S TOCHASTIC N OISE In this section we deal with the problem of asymptotic consistency of systems with mixed uncertainties consisting of stochastic measurement noise and worst-case (deterministic) unmodeled dynamics. This is clearly a redundant decomposition of the underlying system. E < ∞ then. Gk ∈ RH2 . Our setup will deal with the following input-output equation: y(t) = Hu(t) + w(t) = G(θ)u(t) + ∆u(t) + w(t).e. This results in a minimum radius of uncertainty larger than γ. n The main goal of this section is to motivate the need for a optimal decomposition of the system into model and unmodeled dynamics. [8]. k ∈ Z + . specifically. Consider estimation of the first tap. n n q0 (k)(G k=0 n + ∆)u(k) + k=0 n n q0 (k)w(k) − h(0) where. we see that. [35] with deterministic worst-case noise and in mixed situations in [30]. System identification with restricted complexity models have been explored in [27]. More specifically. inputs are bounded. Suppose. n E ≥ sup w t=0 n n qm (t)w(t) ≥ qm where. [4]. Proof: The convexity of the set of systems. Proposition 2: For the setup above it follows that the minimum radius of uncertainty is greater than γ. Proposition 3: Suppose the system decomposition is convex and balanced such that for some η > 0. perturbations that satisfy δ(k) = 0 for all k ≥ m. The first m-taps of the true system are accounted for both by the model and the unmodeled dynamics. m ≥ n q0 (k)u(k)δ(0) = γ k=0 G= G ∈ RH2 | G = (15) In the following section we discuss issues that arise when the perturbations ∆ are specified independent of the model class. G(θ) ∈ G is a model class of interest. Here we study mixed situations and derive fundamental conditions required to achieve uniform convergence of the model estimates to their optimal approximations in the model class. . . . G F IR . linear observations as well as linearity of the solution operator Φ(h) = (h0 . ∀ 0 ≤ m ≤ p−1 Now. [27]. n n q0 (k)u(k) = 1 k=0 | ∆ 1 ≤ γ} (17) This implies that. u is a bounded input. Let.8 by. y(t) = Hu(t) + w(t) = Gu(t) + ∆u(t) + w(t) (16) 1= j=0 n n qm (j)u(j) ≤ qm 1 u ∞ n = qm 1 . |u(k)| ≤ 1. . The problem arises in system identification whenever restricted complexity models are used for identification. H ∈ 1 .e. jointly gaussian random variables. then the radius of uncertainty must be larger than η. Theorem 1 provides a situation under which the minimum radius of uncertainty is zero in that if a small subset of noise sequences of vanishingly small measure were rendered inadmissible then the radius of uncertainty is equal to zero. is the class of finite impulse response sequences (FIR) of order m. w(1). We then generalize these results to the more general case. w(t) is a Guassian stochastic process. θk ∈ IRm This proof can be generalized for other model classes as well. i. Now the radius of uncertainty can only be smaller if we only consider a smaller subset of systems. the unmodeled class ∆ is given by ∆ = {∆ ∈ 1 1 w ∞ ≥ γ. It follows that if the radius of uncertainty is bounded then. We do this by computing the radius of uncertainty for identifying an FIR model class with unmodeled dynamics. ∀ m thus establishing the result. hm−1 ) can be readily verified. . ∆ ∈ ∆ and w(0). [36]. we consider model classes that are subspaces. i. The main problem with these results is the apparent gap between practical reality and the minimum radius of uncertainty achieved through the decomposition of Equation 16.d. G ∈ G F IR . By Theorem 1 there must n exist instruments ql (·) that are optimal for estimating the FIR model class. n n n qm (t)w(t) t=0 A.

These issues are discussed in this section. Therefore. q n (·) such that. B.e. |rqn w (0)| −→ 0 in probability ˆ τ =0 rq n u (τ ) On the other hand if S 1 is considered then the conditions for uniform consistency remains the same except that the second condition is weaker. every real-rational system. . q n (·) so that. ∆ ∈ ∆2 } (21) {H ∈ RH2 | H = G + F ∆. ∆ ∈ ∆1 } ∆2 = {∆ | ∆ 2 ≤ γ}. F and ul (·) = Gl u(·). these conditions imply that the input must be persistently exciting of infinite order. | k=0 q n (k)w(k)| −→ 0. consider the case when noise is relatively small. as it turns out it is difficult to find inputs that lead to uniform consistency for the first set. This is because. in practice the first m-taps of the impulse response of H can be identified accurately for the noiseless case. . i. This issue was acknowledged by the author in [27] and it was pointed out that once a normed space of systems is chosen. with w ∈ and Prob{W n } ≥ 1 − ξ. The basic structure used in the proof (as seen from Equations 5. 6) is that we require the set S after the decomposition to remain convex. m k=0 rq n u (k)gl (k) = rq n ul (0) = 1. The zeros of F ∗ are pole locations of the G1 . n n−j j=0 k=0 Wn 0 q n (k + j)u(k)h(j) − θ1 q n (k + j)u(k)(θ1 g1 (j) + (f1 δ)(j)) − θ1 ≤ n n−j = j=0 k=0 .9 For instance. The general case follows in an identical manner. A necessary and sufficient condition for uniform consistency over the set S 2 is that the input be persistent and there exist a sequence of instruments. . the solution operator Φ(H) = (θ1 . G ∈ G. ξ > 0 n n n−j q (k)w(k) + k=0 j=0 k=0 n q n (k + j)u(k)h(j) − θ1 ≤ . For notational simplicity we define: S2 S1 = = {H ∈ RH2 | H = G + F ∆. ∆1 = {∆ | ∆ 1 ≤ γ} (20) which is optimal for 1 and H2 norms. However. a real-rational all pass-function F and an arbitrary perturbation ∆. . H. it follows that. . G2 . l = 1. In this light. Consequently. This implies that the unmodeled component is given by the set of all stable real-rational functions of the form z −m ∆. Therefore. can be written as: H = G0 + F ∆ (19) We now state our main result in this section. We are now left to describe the unmodeled error. 1≤τ ≤n n max rqn u (τ ) −→ 0 ˆ n→∞ (23) Furthermore. this 0 implies the condition that the instrument must de-correlate the n noise. Identification in Hilbert Spaces Motivated by the discussion in the previous section we derive identification results for more general model spaces. Proof: First we consider the set S 2 . for any > 0.. . The main problem is that all elements belonging to unit balls in RH2 are not necessarily stable. . there is an optimal decomposition in the sense that it minimizes the unmodeled error. Proof: The proof follows by applying Cauchy’s theorem [21]. this decomposition leads to the minimum unmodeled error. with ∆ being an arbitrary element in RH2 . i. Then. For the sake of exposition we next consider one-parameter model class. . using Cauchy’s theorem we can do more as in the following proposition: Proposition 4: Consider the model class given by Equation 15. such a decomposition would lead to the following description for the unmodeled error: ∆ = {z −m ∆ | ∆ ≤ γ} (18) As before we restrict the size of the perturbation and consider first the following subsets of real systems: As we remarked earlier the second subset is considered since the first (although natural) contains unstable systems. . . 2. Once such a decomposition is chosen the objective would be to estimate the optimal model component. the parametrization in Equation 17 over bounds the existing input-output relationship and leads to a conservative result. and therefore the system F is a delay of order m. Indeed. In this case the decomposition suggests that the radius of uncertainty still does not go to zero. m (22) 2 n→∞ n→∞ n n n −→ 0. For the FIR case. θm )T is linear. We are now left with an instrument that must satisfy. On a Hilbert space this convex decomposition is preserved and we should expect Theorem 1 to apply in this setting as well. ∀ θ1 ∈ IR for some G0 = G(θ0 ). there are hardly any inputs that satisfy the requirements for uniform consistency. Gm . e) Example: There are m poles at zero for an FIR model space of order m. Theorem 4: Let u = F u be the filtered input through the all ˆ pass filter. This can be done naturally since the dual is also a Hilbert space and orthogonal separation between modeled and unmodeled components can be easily established. Therefore. the unmodeled perturbations have to be further constrained in some manner. . . Motivated by Proposition 3 we again seek to ascribe modeled and unmodeled components to different aspects of data.. The subset S 2 is convex and the observation given by Equation 14 is linear. Now by Theorem 1 if uniform consistency holds there must exist a number n and an instrument.e. As before. We note that the projection ΠH of the system H onto the subspace G is linear and of finite rank. G ∈ G.

f1 (k). then: Prob 0<k≤n max n ru (k) ∞ ≥α ≤ n exp(−nβ(α)) (24) where β(α) = 1 + 1−α log2 1−α + 1+α log2 1+α 2 2 2 2 Proof: The proof of this result can be found in [6]. This follows from the fact that. For the set S 1 we only require that the normalized worst-case cross-correlation between input and instrument to go to zero. In summary we have characterized conditions for uniform consistency and optimality for Hilbert spaces. which is dealt in the next section. The constraints on the input (disregarding the noise constraint) n rqn u (j)δ(j) ≤ . ∆ respectively. . u(t−1). since G1 is bounded in H∞ because it is stable real rational transfer function. ψ(1). i. the unmodeled part implies that F1 u(·) is perpendicular to q n (·). u = F1 u. We now argue the difficulty of finding inputs that satisfy this condition by applying it to a 1-tap FIR model class. In the previous section the Hilbert space structure led to convex decomposition of the unmodeled error and model class and Theorem 1 could be applied in a straightforward manner.i. Remark: We point out that the conditions for S 2 are more demanding than S 1 . for the first expression u1 = G1 u. n n rqn u (j)g1 (j) = rqn u1 (0) = 1.e. sup j=0 n ∆∈∆2 j=0 We next discuss issues of optimality and choice of inputs. First. Therefore. Therefore. n min(q n )T Ψn ΨT (q n ) subject to n j=0 q n (j)u(j) = 1 where. This is understandable on account of the fact that RH2 forms a larger class of perturbations than 1. If the u ˆ value of the solution to the quadratic optimization problem has a value greater β (this is the opposite of what we want) then: (q n )T Ψn ΨT (q n ) = n uT (Ψ 1 ≥ β ⇐⇒ T −1 u n Ψn ) 1/β u uT Ψn ΨT n ≥0 1= j=0 q n (j)u1 (j) ≤ q n 2 G1 H∞ Pn u 2 where. Then. By arbitrarily varying θ. Let.. This condition can easily be accomplished for inputs that have small autocorrelation coefficients. u = [u(0).10 where. by choosing an appropriately filtered input as the instrument the conditions of Equation 23 can be satisfied. On the other hand the weaker condition for ∆1 can be met by a number of inputs. Ψn = [ψ(0). where convex separation between model class and unmodeled dynamics proved to be critical. Identification in 1 and non-linear functionals We now turn to the problem of identification in the 1 norm. ψ(n)]T .. u(n)]T . j=1 (ru (j))2 << β. In particular we only need to establish that there exist inputs with small autocorrelations with high probability. .e. . Then an appropriately filtered input can be selected as the instrument. C. . we have used Schur’s complementation to derive the last expression above. . . The condition arising from the noise and the first ˆ expression together imply that the input must be persistently exciting. i. For these set of problems the fact that the optimal algorithm is linear follows from Theorem 1. . i. ˆ imply that the optimal instrument is the solution to: n where. n j=1 n (ru (j))2 ≥ β sup ∆∈∆2 j=0 =γ j=0 n (rqn u (j))2 −→ 0 ˆ For the set ∆1 we similarly have. for a β > 0 (independent of n) the above condition is met with probability one leading us to believe that few inputs satisfy the n n opposite condition.d. δ(k) are the impulse responses for G1 . . Observe that we have applied Proposition 4 and substituted F1 ∆ for the perturbation and so F1 and G1 are perpendicular. .. For this situation the above condition implies that.e. with ψ(t) = [ˆ(t). The problem now reduces to looking for inputs satisfying Equation 22. and in the second expression we have absorbed the filter F1 into the input. Pn u −→ ∞. Similarly. ψ(0)]. This is possible because G1 and F1 are orthogonal and so G1 u and F1 u are uncorrelated when driven by white noise. ∆ we see that. If the input is chosen i. g1 (k). then the expected value of the expression on the left is bounded away from zero. Lemma 2: Suppose u(t) is a discrete i. . In the RH2 case we need a more stringent condition since we need the sum of the squares of the normalized crosscorrelation also go to zero. But the noise condition implies that q n −→ ∞. . notice that one could incorporate additional parametric and non-parametric constraints as we did in Section IV and Equation 12. In the next section we consider a situation where such a convex separation does not exist. F1 . · H∞ is the usual H∞ norm. n sup ∆∈∆1 j=0 n n rqn u (j)δ(j) = γ max rqn u (j) −→ 0 ˆ ˆ j≤n This establishes the necessity.i. inputs that are white in their second order properties. . Remark: To see how the convex separation results in achieving uniform consistency note that the conditions on the input from the modeled part implies that G1 u(·) is aligned with the instrument q n (·). Next for the perturbation we apply Cauchy-Schwartz inequality to obtain: n 2 n rqn u (j)δ(j) ˆ n where. bernoulli random process with mean zero and variance 1. We will not describe the sufficiency conditions since the argument is similar to the argument for sufficiency for FIR model classes.d. .

 . a number N (ρ) for every ρ > 0 such that for all n > N (ρ): 1≤τ ≤n n max rqn u (τ ) n→∞    q n (0)   .   .d. .e.. n−j n k=0 ql (k + w ∈ Wn 0 j)u(k) ≤ The proof now follows by first observing that for the hypothesis to be true we must have. 0 q n (0) . consider the optimal decomposition: S = {H ∈ 1 Next. i. . . . . 0 ≤ l ≤ m.  . max1<j<n n n k=0 ql (k)w(k)| | ≤ . i. i. . . . norm measure on Φ(·) is the ∞ norm. Consequently. Let y. 0 The persistency of the input follows along similar lines as in Theorem 2. we present identification of FIR model class in the 1 norm. Qn be the matrix composed of the instrument q n and its m delays.. ∆ ∈ ∆} (25) where. (⇐=) To prove sufficiency we construct an IV technique n that is uniformly consistent.. w be the output and noise signal vectors of length N respectively.. uniform consistency in the 1 norm for the first m FIR taps is identical to uniform consistency in the ∞ norm. .. . ξ. S is convex. S) ˆ such that Prob supH∈S Pm (h − hn ) ∞ ≥ ≤ ξ. . 0 . . . G F IR is the class of m-tap FIR sequences. . the minimum radius of uncertainty is zero for the ∞ norm as well. g = Pm H and δ is the column vector of impulse response of the unmodeled dynamics. under the 1 norm. However for more general classes under the 1 norm the decomposition through optimal approximations is not convex. 0 .11 For FIR model classes.  . . .. . . . . . Then. Therefore. . S). R =  .   . ... The setup now mirrors Theorem 1 since the set. q n (m + (k − 1) + j)u(0)  . G. . for length n of data. . Again from our hypothesis together with the fact that that ∆ ≤ γ. Qn =  .. 0 n n n n k=0 ql (k)w(k) + k=0 (ql (k)u(k) − 1)h(j) n→∞ n n−j n + j=m k=0 ql (k + j)u(k)h(j) −→ 0 (QT Rn )k =  n q n (m + (k − 1) + j)u(j). Prob {W n } ≥ 1 − ξ.. q n = q1 and its delayed copies up to m-delays serve as instruments. the decomposition is still convex. . we let. QT Rn δ converges to zero. Un a corresponding matrix of input sequence with its m delays and Rn the residual delays of the input sequence. . .. 0 . . w is i.. ∀ n ≥ N ( . . −→ 0. by noting that ∆ is arbitrary with 1 norm less than γ and w(·) is an arbitrary element in the convex set W n . Φ(H) = (h(0). .   u(0) u(1) . j=0 j=0 Consequently. which 0 also contains the zero element we have. . .. . . . This is due to the fact that optimal FIR approximation in 1 and RH2 are identical. n ≥ ˆ L( . 0 u(0) . n n→∞ Suppose. . . δ = (Pn − Pm )H. . gaussian noise.   .. h and a number L( . .  .   u(0) .  n . n To show this. . n n k=0 ql (k)u(k) = 1. . n  n+1−(m+k)     Un =       0 0 .i. . This implies by hypothesis that. . . ξ) n−(m+k) and Prob{W n } ≥ 1 − ξ Upon substitution it follows that.          sup sup n w∈W 0 H∈S | n ql (k)y(k) k=0 − h(l) ≤ . Theorem 5: The radius of uncertainty for the above setup is equal to zero if and only if there is a persistent input and a corresponding family of instrument variables q n (·) which uniformly de-correlates the input and noise.  .. there is a linear algorithm with n weights. for each FIR tap.. Basically we let. ∆ = {z −m ∆ | ∆ 1 ≤ γ}. . . λt (h) = Hu(t). . given > ˆ ˆ 0.. q n (0) n  q (1)  . i. .   u(0) . . In the next section. . by hypothesis it follows that. ξ. . q n (m + k + j)u(j). n rqn u (0) = 1 (26) n (27) |rqn w (0)| −→ 0 in probability Proof: (=⇒) We observe that all norms are equivalent on finite dimensional spaces [15].. .e. ξ > 0 there is an algorithm. . h(m − 1))T . D.  (28) | H = G + ∆. .  . . . 0 . y = Un g + Rn δ + w =⇒ QT y = QT Un g + QT Rn δ + QT w n n n n (29) It follows from hypothesis that QT Un is an invertible matrix n with bounded inverse for large enough N since it converges to a lower triangular matrix with identical elements on the diagonal. In the following theorem we derive necessary and sufficient conditions for achieving uniform consistency for estimating the FIR coefficients. FIR Model Class In this section we derive necessary and sufficient conditions for identification of FIR model classes in the presence of unmodeled dynamics under the 1 norm. G ∈ G F IR . .. . The results developed for FIR case will be used to compute uniformly consistent estimators for more general model classes. h. g is the column vector composed of the first m impulse response coefficients.e.  . . . it follows that. such that. . 0 .   . measurement structure is linear.  .  . .  .. . In particular. To this end. . . by hypothesis. Consequently. ql (·). . . .e. . . . . note that the kth row of QT Rn is given by. .

. The following example τ j=0 illustrates the issue of non-convexity of 1 decompositions. for the system H1 is G(H1 ) = 0.i. ξ. From the alignment conditions −1 + 0<τ ≤n n n |ru (τ )| Prob max ≥ 0<τ ≤n n 2 max H∈S ≥ ≥ ≤ ≤ n |ruw (τ )| ≥ 0<τ ≤n n 2 n |ruw (τ )| Motivated by these issues we solve the problem of 1 identification indirectly. First. Noise. Hence. G0 (H) = Π(H)+argminG∈G H−ΠH−G = lim G0 (Pm H) m→∞ + Prob max n exp(−nβ( )) + n exp(−n 2 /σ 2 ) where we have taken the probability over both the input and the noise process. . which maps to the model subspace and L is a finite rank linear operator mapping the system H to an appropriate finite dimensional space. The first term in the last expression uses Lemma 2 and the second term uses the fact that w is gaussian. Φ. ξ. It is then easy to establish that.. Theorem 1 cannot be directly applied. i. G ∈ G.12 of the model space G. the set of optimal approximants for the convex set of systems S is not convex. The samplez −1 −1 −1 −1 ˆ complexity of an algorithm. The main difficulty is that the decomposition achieved through optimal approximation is not convex. we deduce that G(H2 ) = 1 − z 2 . Moreover. ≤ G0 (H) = ψ(L(H)) where. G = { ∆ ≤γ|∆ p j=1 αj Gj ⊥ where.e. Theorem 6: The input characterization as in Theorem 5 is sufficient for uniform consistency in the 1 norm. the following convex space of T Finally. The idea is that 1 minimizer can be arbitrarily closely approximated by a non-linear function of linear functions of the system impulse response. S = {H ∈ 1 | Hα = αH1 + (1 − α)H2 . we choose instruments q n (k) = u(k)/n. Φ. To this end we have the following theorem.. hypothesis and noting that (w(0). Since linear functions of the system can be efficiently estimated the non-linear function can be estimated as well. General Model Classes In this section we derive algorithms for achieving uniform consistency for general model classes (that are subspaces) in the 1 norm. We establish polynomial sample complexity of linear algorithms for the FIR model class here. −1 θ 1− z −1 2 −1 | θ ∈ IR . α ∈ [0. w(·) T (Qn Rn )k δ ≤ is assumed to be i. i. ∆ ∈ ∆ where. Hr . Proposition 5: The optimal 1 approximation G0 (H) for a given system. G ∈ RH2 and ∆ = and the modified system by. where the input sequence is a bernoulli process. . G(H1 ). ˆ L( . S) ≤ 1 d log 1 ξ for some integer d. Proof: The proof follows from the following equality: G∈G = min H − ΠH − G 1 G∈G The limiting argument follows from the fact that Pm H → H and continuity of projection operation and the 1 norm. −1 We next discuss the issue of sample complexity.d. Therefore. The decomposition follows from wellknown results in 1 approximation theory [15]. Π is the 2 projection on to the space G. 1]} random variables.e. for some C > 0. Our task n+1−(m+k+τ ) is to determine the model G0 that minimizes the unmodeled ∆ 1 max q n (m + (k − 1) + j + τ )u(j) ≤ γ error. G(Hα ) = G(H2 ) for all α = 1 is the unique optimal approximation as well.i. can be re-written as. . where G ⊥ is the space of annihilators | αj ∈ IR. gaussian noise as before. Uniform consistency in 1 would then follow by estimating the linear functions. For notational simplicity we denote the modified problem on the RHS by Gr .e.d. i. it follows directly by substitution that. ψ(·) is non-linear continuous map. E. Prob Prob sup QT y − Pm (H) n n |ru (τ )| ∞ for optimality [15] it follows that the unique optimal approximation. G0 = argminG∈G H − G . Qn w also converges to zero in expectation by systems. Similarly. We proceed by first describing the optimal approximation as a known non-linear function of linear functions on the underlying system space. Our setup consists of. Gr (H) = argminG∈G H − ΠH − G Hr = H − ΠH . Now. H. whose consistency conditions are in turn characterized by Theorem 5. S) ≤ C 1 2 log 1 ξ implying polynomial sample-complexity of convergence. 1 min H − G G }. f) Example: Consider. w(1). Φ(y) is said to polynomial [32] where H1 (z ) = 1 and H2 (z ) = z + 1 − 2 Our model class is given by G = in learning theory if ˆ L( . y(t) = Hu(t)+w(t) = Gu(t)+∆u(t)+w(t). The proof of the theorem relies on a number of results that are provided below. w(n)) are i.

d.     S + = G(1 + ∆) | G = αj Gj . We can now apply Proposition 3 and Theorem 1 to argue that ∆ and G must be orthogonal for achieving uniform consistency. under the conditions derived in Theorem 4. The input-output equation is given by: y(t) = Hu(t) + w(t) = G(1 + ∆)u(t) + w(t) where. 0 ≤ ≤ 1 1 ≤ Hr − Gm 1 1 Pm (Hr − Gm ) ≤ Pm (Hr − Gr ) where we have used the commutativity of the convolution operation and substituted ψj (t) = Gj u(t). if we restrict αj ≥ 0 the decomposition turns out to be convex. First. Upon substitution we have. It now follows that. we also have B2 − B1 1 ≤ K1 γ for some K1 > 0. Proof: We first consider a convex subset. . Theorem 7: A necessary condition for uniform consistency is that ∆ be orthogonal to the space G and the inputs satisfy conditions of Theorem 4. This implies that G − ΠH 1 ≤ K2 γ for some K2 > 0.13 The main task is to now show how to further simplify the problem to an instantiation of Theorem 1. Now. C. The parametrization is non-convex as shown in the following example. which are equivalent to conditions in Theorem 5. ΠH respectively. 2) ΠH can be obtained if conditions of Equation 23 are satisfied. A. U NCERTAIN LFT STRUCTURES We consider several different LFT structures for which the tools developed in Theorem 1 can be applied to characterize conditions on inputs. Let the set of perturbations be bounded. C) is observable the matrix T C CA CA2 . Specifically. Furthermore. αj ≥ 0.e. G = j=1 αj Gj | αj ∈ IR . CAn T (B2 − B1 ) y(t) = j=1 αj Gj (1 + ∆)u(t) + w(t) αj (1 + ∆)Gj u(t) + w(t) p p = = j=1 αj ψj (t) + ∆ j=1 αj ψj (t) + w(t) H − ΠH 1 ≤ H −G 1 + G − ΠH 1 ≤ Lγ Our final proposition states that. of systems. G is a subspace. It follows. Proposition 7: Let. The theorem now follows from the fact that. ∆ ≤   j Pm (Hr − Gr ) 1 − Pm (Hr − Gm ) 1 (I − Pm )(Hr − Gm ) 1 − (I − Pm )(Hr − Gr ) 1 −→ 0 where the final limit follows from the fact that Gm 1 and Gr 1 are bounded and have exponentially decaying tails. is fixed.. However.e. CAn is invertible. which achieve uniform consistency. 2].e. G. Since. we have the following proposition: Proposition 6: If there is G ∈ G such that H − G 1 ≤ γ then there is some constant L such that. the decomposition is not convex. ψ(ΠH. Pm (H −ΠH)) where: 1) Estimate of an FIR model Pm (H − ΠH) for the modified system H − ΠH can be obtained consistently if conditions in Theorem 5 are satisfied. A. Now translating the set ∆ S 0 by modifying the variables αj = αj − 3/2 we get a new ˜ set S 1 which is both balanced and convex. ∆ are both smaller than γ. Therefore. where j (αj Gj + ∆Gj ) ⊂ ˜ . we can consistently estimate the projection in a uniform manner. g) Example: Consider the two elements H1 = −(1 − ∆)G and H2 = (1 + ∆)G in the setup above. i.i. i. gaussian noise and ∆ is a multiplicative perturbation with bounded 1 norm and accounts for the undermodeling of H. Therefore. from Theorem 4 it follows that the relevant input conditions must also be satisfied. . i. B1 are finite dimensional and norms on finite dimensional spaces are equivalent. G−ΠH 2 2 VI. the noise w is random i. Then ˜ S0 = αj Gj (1 + ∆) = S. The problem reduces to estimating Gr (H). We now use the state-space form for ΠH. lim Pm (Hr − Gm ) 1 = Hr − Gr 1 m→∞ Proof: The proof follows by the following sequence of inequalities: Hr − G r Therefore. B2 be the corresponding control vectors for G. ∆ ≤ then (H1 + H2 )/2 = ˜ ˜ ∆G cannot be expressed as (1 + ∆)G for some ∆ ≤ . p p ≤ C CA CA2 . G0 can be decomposed as a known non-linear operator acting on a collection of linear functionals on the system H. . In this regard. ≤ 2γ (30) Since (A. there is a one-to-one mapping from the collection S 1 to the collection S 0 and therefore the . Multiplicative Uncertainty We consider first the multiplicative uncertainty as shown in Figure 2(a) with H ∈ H2 . Therefore. A. Since. In particular. To focus attention on structured perturbations we limit our attention to real-rational space of systems endowed with the H2 norm. Gm denote the best 1 approximation of the FIR system Pm Hr . note that ΠH is a linear and the range of the projection is a finite dimensional space. as described by Equation 1. Let B1 . S 0 . B2 . B2 − B1 2 ≤ K0 γ for some constant K > 0 that depends on the system matrices. the following subset of systems is convex. constrain the parameters αj ∈ [1. We assume that control vector B is the free parameter that needs to be estimated. the state-transition matrix... . Gm = argminG∈G Pm (Hr −G) 1 Then it follows that. H − ΠH 1 ≤ Lγ Proof: By hypothesis we have that H − ΠH 2 ≤ H − G 2 ≤ H − G 1 ≤ γ.

uniform consistency of G0 implies uniform consistency of the above setup with ∆1 = 0 or ∆0 = 0. since the perturbation is scaled by the parameter the error bound also has a multiplicative structure. B. It turns out that the notion of uniform consistency must be modified suitably. In general the high order acoustic dynamics coupled with significant time variation severely limits the choice of echo-canceller order. H0 .. ≤β α ∞ (sufficiency) Comparing Equation 32 with Equation 14 we see a potential difference only in the penultimate term. v. The goal is to estimate the optimal approximation to H = G + ∆ in the model class G ∈ G from input-output data. it can be ˜ ˜ written as F0 ∆. where it is not possible to isolate any single component. The question arises as to when an approximate model for the uncertain component can be derived. The former case reduces to an additive perturbation for which ∆ ⊥ G 0 is a necessary condition. Theorem 8: A necessary and sufficient condition for uniform consistency of the above setup is that ∆ be orthogonal to both model classes G 0 and G 1 and the input satisfy conditions of Theorem 4. Proof: (necessity) Upon expanding Equation 31 we have. This is because the unmodeled error G0 ∆1 and the class of FIR models of order m are no longer orthogonal. Thus the input output data consists of (u. can impact the stability of the echo-cancellation system. Different Classes of Linear Fractional Structures. Then no consistent algorithm exists when the order. i. The conditions on the input follow from Theorems 4. Serial Connections of Uncertain Systems In this problem. By absorbing ∆∆ into a single perturbation. two uncertain systems. we assume that the closed loop system is stable and the exogenous input u(·) . The interpretation of the setup is that we are given a set of uncertain components connected in series. i. Our setup consists of: m y(t) = Hv(t) + w(t) = j=1 αj Gj v(t) + F ∆v(t) + w(t) (33) where. C. which then establishes sufficiency. y(t) = G0 G1 u(t)+G1 ∆0 u(t)+G0 ∆1 u(t)+∆0 ∆1 u(t)+w(t) (32) Now. Since.e. we can only say that with high probability for a large enough n. m. The unknown system has both parametric and non-parametric components with the parametric part to be identified from closed loop data. In this light. The system H1 is approximately known.e. Furthermore. typically used to enhance the quality of speech. necessary and sufficient conditions can still be established. h) Example: As an instantiation of the theorem consider the case when G0 and G1 are FIRs of order m and n respectively. 2. of the loudspeaker is recorded in addition to the microphone signals. This arises in a number of applications involving indoor echo-cancellation Systems. ∆0 ∆1 u(t). shown in Figure 2(b). G 0 from data. ˜ 1 where F0 is as described in Proposition 4. for an arbitrary bounded perturbation ∆. Nevertheless.i.. Now since ∆0 is perpendicular to G 0 . S 0 ⊂ S the necessity holds for the general case as well. A dither input signal. w is i. ∆. v. Indeed. The new expression obtained is now consistent with conditions of Theorem 4. This is weaker than the conditions derived in the previous sections. We next apply the instrument derived for the set S 1 to the general problem.d.14 (a) Multiplicative Uncertainty (b) Identification with Structured Uncertainty (a) Block Diagram of Uncertain System in a Feedback Loop (b) Identification error Fig. α − αn ∞ Fig. and the system H0 must be approximately identified in the model class. u. Identification of Uncertain Systems in Feedback Loops Consider the feedback setup shown in Figure 3(a). y) as illustrated in Figure 3(a). i. we can write. gaussian noise and we have decomposed H ∈ RH2 into orthogonal subspaces.. is employed as an input to the loudspeaker and the output.e. the feedback filter. We first write the input-output equation as follows: y(k) = (G0 + ∆0 )(G1 + ∆1 )u(k) + w(k) (31) It is clear from the setup that the unmodeled dynamics enters the equations in a non-linear non-convex fashion. while the latter case is a multiplicative perturbation for which ∆ ⊥ G 1 is necessary according to Theorem 7. its best approximation G1 (in the class G 1 ) is known. H1 are connected in series. a lower order echo-cancellation system is designed in the hope that robust performance can be ensured at the cost of sacrificing some performance. 3. while the other is unknown and must be estimated from input-output data same conditions are necessary for S 0 as well. The second figure illustrates a serial connection of two components one of which is approximately known. y. The setup is unconventional in that the system input is also measured. ∆0 ∆1 = F0 ∆. Moreover. Such problems arise in the context of design of adaptive acoustic echo-cancellation systems that operate in a feedback loop [28]. of G0 is larger than n. Our task is to design the input signal u to guarantee robust convergence to the optimal approximation.

Now consider Equation 33.d. positive real conditions. gaussian noise w(k) with mean zero and variance 0. v(·). bounded sets and linear fractional maps. W2 = KW1 Although the system.e. the IV technique is chosen suitably to decorrelate both the uncertainty as well as the noise. As the following theorem shows that a necessary and sufficient condition for uniform consistency remains the same.e. The unmodeled dynamics is generated by normalizing a zero mean random gaussian sequence of length 1000 and normalizing it to have an 1 norm equal to one.i. q n (·) that decorrelates the internal signal v(t) and noise w(t).3z −1 ). W2 are stable transfer functions. For the IV technique we use (1 − 0. especially when the error arises due to under modeling.i. VII. Remark: We contrast the IV approach presented in this paper with the well-known MPE approach. We have characterized conditions on inputs in terms of second-order sample path properties required to achieve the minimum radius of uncertainty.i. while it is well known that the least squares approach leads to large bias. W1 = (1 − HK)−1 . v(t) is bounded.. k=0 k=0 q n (k + j)v(k) −→ 0 and n k=0 q n (k)w(k) −→ 0 This in turn implies that the filtered instrument.3z −1 )−1 u(t) as the instrument in Equation 33 and identify α. We have derived optimal algorithms for system identification for a wide variety of standard as well as non-standard problems that include special structures such as unmodeled dynamics. Therefore. This is because when u(t) is a i. The idea in the MPE approach is to find models that minimize the so called prediction error. this approach can lead to significant bias. a controller K is chosen so that the closed loop system is stable and such that the sensitivity function W1 has 1 norm smaller than one. which turn out to be well-known instrument-variable methods for many of the problems. Figure 3(b) shows the parametric error plotted as a function of data length for the IV scheme. In the presence of complex uncertainties. In specific circumstances this implies a preference for those models. C ONCLUSIONS q n (k)w(k) −→ 0 ˜ We have introduced a new concept for system identification with mixed stochastic and deterministic uncertainties that is inspired by machine learning theory. We see that the IV technique converges. the fact that the controller is stabilizing implies W1 . input in the interval [−0. We are left to establish the conditions on the input signal. These are now consistent with assumptions of Theorem 4 with u(t) replaced with v(t). This ensures orthogonal separation between the unmodeled error and the model subspace.1. Next. W .d. which is exactly in the form of Equation 14 with the unmodeled dynamics orthogonal to G.15 and the internal signal. q n (·) = ˜ W1 q n (·) decorrelates the input signal and noise: n n q (k)u(k) = 1. Theorem 9: A necessary and sufficient condition for uniform consistency of the optimal restricted complexity model G ∈ G is that the input satisfy conditions in Theorem 4.3)/(1−0. i.3z −1 . The unmodeled dynamics is then filtered with the all-pass filter F = (z −1 − 0. 1 n n k=0 u1 (k + j)v(k) −→ 0 k=0 u1 (k)w(k) −→ 0 . IV methods can overcome this problem through suitable instruments.5. and the output signal y(·) are both measured. where the prediction error is uncorrelated with model output. i. dither signal: 1 n and 1 n n n k=0 u1 (k)v(k) ≥ β > 0. Our formulation lends itself to convex analysis leading to description of optimal algorithms. The system H is generated by summing the model for α = 1 with the unmodeled dynamics. This is not surprising considering the fact that the unmodeled output is correlated with the model output and so while the least squares approach leads to a large bias. Proof: By straightforward algebraic manipulations of the feedback loop we obtain that: v(t) = W1 u(t) + W2 w(t). k=0 Note that in the above example since W1 is unknown a convenient instrument for v(t) is the filtered dither signal u1 (t) = Gu(t).5] of length 300 is applied and the measurements v(t) and y(t) are collected in i. Uniform consistency holds if and only if there exists an instrument. 0.d. A uniformly distributed i.. is generally unknown (since it is a LFT of the controller with the unknown system). In comparison to earlier formulations our concept requires a stronger notion of convergence in that the objective is to obtain probabilistic uniform convergence of model estimates to the minimum possible radius of uncertainty. ˜ k=0 n k=0 n q n (k + j)u(k) −→ 0 ˜ α i) Example: The model space is given by G = 1−0. n n q n (k)v(k) = 1. These instruments in specific circumstances can first decorrelate the uncertainty arising from correlated sources and then proceed to cancel stochastic uncorrelated noise as well.

Goodwin. for all other output realizations the worst-case uncertainty can only be smaller. To this z 2 = z0 . A. the probability measure of the noise set can change. To establish (A) we observe from [22] that with a linear/convex measurement structure. 1 ≤ k ≤ n} the zero element leads to increasing the probability measure. consider the following noise set: VIII. Z. 1997. Campi. Finite sample properties of system identificaWn h∈S . 37:953–963. Borkar. . w ∈ W n . z and letting α take on values in a finite index set I such that there is at least an α for which z0 − zα 2 < 1 we have z 2 ≤ z0 −zα min 2 <1 To relate it to Corollary 1 we let. 33:1133–1139. zα be any other unit vector. on Signal Processing. Mitter. Venkatesh. N.e. In Proceedings IFAC Symposium on System Identification. Linear and nonlinear algorithms for Inequality (a) follows from the fact that the output Yn identification in h with error bounds. 0∈W n Santa Barbara (USA).(Proof of Corollary 1) Consider an element z ∈ H2 with tainty. Probability and Measure. Gu and P. x2 are any two feasible elements satisfying Equation 7. ∗ ≤ Prob{W n } IEEE Transactions on Automatic Control. where x1 . Linear Stochastic Systems. i. We are now left to produce a that. John Wiley & Sons. Consequently. z | ≤ z0 −zα 2 z 2 +| zα . M. IEEE Transactions on Automatic Control. translation of the feasible set W n so that it contains D(Yn ) = {h ∈ S : y(k) = λk (h)+w(k). (B) The probability measure of the noise set W ∗ is larger n than the feasible set. IEEE Trans. Quantifying the error in (d) estimated transfer functions with application to model-order selection. C. (c) [8] L. and S. Let z0 = z/ z 2 be a unit 0 argument there exists a noise set. [2] V. Caines. such that the diameter vector aligned with z. Kacewicz. (b) [7] A. [3] P. h=0. V. 1992.w∈W n W n . Finally. h=0. It follows of uncertainty is smaller than 2 . 2000. It follows that. [4] M. is bounded from below by one-half of the diameter of uncer. Using the underlying symmetry and monotonicity of the noise distribution around zero. Tsitsiklis. Campi and E. Inc. S. 1992. (d) follows by definition of W ∗ since it includes all n The local diameter of uncertainty is given by: possible noise elements for which the output is identically diam(Yn ) = sup |λ0 (h1 − h2 )| zero and the diameter of uncertainty is constrained to be h1 .. On model error modeling in set membership diam{Yn } ≤ 2 sup Prob{W n } ≤ sup identification.w∈W n complexity of worst case identification of linear systems. On account of the fact that the measurements are linear and the constraints are convex and balanced. Now. z 2 |υα (z)| 1 − z 0 − zα 2 = x 1 − x2 2 ≤ z0 −zα min 2 <1 The result now follows by direct substitution. 47:1329–1333. pages WeMD1–3. Giarre. the global diameter of n uncertainty over all feasible outputs Yn is bounded by 2 . Milanese. Garulli and W. Denoting υα (z) = zα . z ≤ | z0 −zα . Billingsley. 1994. Variations on a neyman pearson theme.e. By hypothesis of our proposition and the preceding ·.w∈W n W n . P. New York. no other noise set containing From [22] it follows that the estimation error for any estimator zero element can have a larger probability measure. A PPENDIX (Proof of Proposition 1) We deal with the case when γ0 = 0 for simplicity. and B. W ∗ is also convex and balanced. z |+| zα . z | end. 20:157–166. · denoting the inner product. Wiley. Reinelt.. Model quality evaluation diam{Yn = 0} ≤ 2 Prob{W n } sup sup ≤ in set-membership identification. i. 2004. Theodosopoulos. and J. n These two facts together will establish the proposition since we have then found an alternative convex and balanced set to W 0 that has a larger probability measure and has the same n worst-case global diameter of uncertainty. Journal of the Indian Statistical Institute. 42:2906–2914. We are left to establish two n facts: (A) The noise set W ∗ is feasible. 37:1343–1354. the worst-case diameter of uncertainty is realized when the output is identically zero. ≤ sup Prob{W n } diam{Yn } ≤ 2 sup [6] M. Gevers. Automatica. (B) is established by the following sequence of inequalities: W ∗ = {w ∈ IRn | λk (h) + w = 0. (a) 2002. by construction the set W ∗ corresponds to the zero output signal and therefore by the n preceding argument is feasible.w∈W n tion methods. C. W n . which Control. Let Yn denote the column vector of the output signal of length n and D(Yn ) be the set of all h ∈ S that are consistent with the input-output data and the noise set. z = x1 − x2 . 1993. it follows that the set. System and Control Letters. Inequality (b) follows from the fact that the worst-case diameter of uncertainty is invariant under translations of noise set W n . Prob{W 0 } ≤ sup Prob{W n } diam{Yn } ≤ 2 sup n [5] M. 1988. convex and balanced set that has the same properties. 0∈W n [9] G. C.. T.h2 ∈D (Yn ) bounded by 2 . Ninness.16 amounts to a relaxation of the constraint. Exponentially weighted least squares identification of time-varying systems with white dusturbances. Consequently. Inequality (c) follows from a similar argument. W 0 . The sample Wn h=0. |υα (x1 − x2 )| 1 − z 0 − zα 2 R EFERENCES [1] P. 1995. [10] G. Dahleh. B. However. Khargonekar. IEEE Transactions on Automatic in the RHS is restricted to the case when h = 0. 1 ≤ k ≤ n} n where the first expression in the last inequality follows cauchyschwartz inequality. and M. The argument for arbitrary γ0 proceeds in a similar manner. |λ0 (h)| ≤ . 66:292–305. Weyer.

1983. 1996. Ljung. IEEE Transactions on Automatic Control. complexity. 1982. uncertainty. Vicino. 1995. IEEE Transactions on Automatic Control. Heuberger. Switzerland. Automatica. G. Optimal identification under bounded disturbances. McGraw-Hill International Editions. 2001. In Proceedings of the 36th IEEE Conf. [31] A. Ljung. A. Wahlberg. 1991. A. IEEE Transactions on Automatic Control. Rudin. Inc. 1994. and J. N. 2004. A Theory of Learning and Generalization. Tikku. John Wiley and Sons. Dahleh. On system-identification of complexprocesses with finite data. 1969. M. 1997. Venkatesh and M. Cabin communication system without acoustic echo cancellation. System identification for complexsystems: Problem formulation and results. 1987. pages 39–48. Venkatesh and M. 46:235–257. 1991. New York: Springer-Verlag. Makila. Papadimitriou and K. Mitter. 2000. IEEE Transactions on Automatic Control. Robust identification and galois sequences. [17] M. Wang. [18] M. 37:900–912. [29] S. 1997. 27:408–414. Dahleh. [35] L. L. 1997. Venkatesh and M. N. Y. R. Van Den Hof. International Journal of Control. [30] S. [13] L. Luenberger. System identification using laguerre models. CA. Yuan. IEEE Transactions on Automatic Control. [25] S. [21] W. Finn. and J. Hard frequency-domain model error bounds from least-squares like identification techniques. Asymptotic properties of the least squares method for estimating transfer functions. [23] D. 41:774–785. Lausanne. Milanese and A. [28] S. 1993. [19] C. Yin. IEEE Transactions on Automatic Control. Dahleh. on Decision and Control. D. M. Wang and G. IEEE Transactions on A-C. Venkatesh and A. Persistent identification of systems with unmodeled dynamics and exogenous disturbances. Dahleh. [33] B. Prentice Hall Englewood Cliffs. 1993. P. Information. The Nature of Statistical Learning Theory. A. . Special Topics in Identification Modelling and Control. Poolla and A. Identification in the presence of unmodeled dynamics and noise.. Vicino and G. Identification with generalized orthonormal basis functions–statistical analysis and error bounds. Estimation theory and uncertainty intervals evaluation in the presence of unknown but bounded errors: Linear families of models and estimators. Wahlberg and L. [20] P. Prentice Hall Englewood Cliffs. 2002. Persistent identification of time-varying systems. 27:997–1009. Optimization by vector space methods. 39:944–50. [12] E. [26] S. Ljung and Z. Statistical modeling and estimation with limited data. In International Symposium on Information Theory. NJ. 54:1189–1200.17 [11] P. M. Bokor. 1997. W. Co. [32] M. 1997. Vidyasagar. K. [22] J. Addison-Wesley Pub. Y. H. Systems and Control Letters. 1976. 42:66–82. San Diego. Traub. Belforte. 1991. [36] L. pages 514–530. Lehmann. IEEE Transactions on Automatic Control. IEEE Transactions on Automatic Control. New York: Springer-Verlag. [16] P. Milanese and G. [15] D. Vapnik. C. 2nd edn. US Patent #6748086. Sequential approximation of feasible parameter sets for identification with set membership identification. 1997. Combinatorial Optimization. Testing Statistical Hypotheses. System Identification: Theory for the user. 38:1176–90. Tse. 1992. 53:117–125. Steiglitz. [14] L. Principles of Mathematical Analysis. IEEE Transactions on Automatic Control. 1985. 2004. On the time complexity of worst–case system identification. Wozniakowski. Venkatesh and S. [34] B. Venkatesh. Tsitsiklis. and H. NJ. [27] S. Springer Verlag: New York. F. Optimal estimation theory for dynamic systems with set membership uncertainty: An overview. Wasilkowski. 45:1246–1256. 42:1620–1635. G. A. Zappa. IEEE Transactions on Automatic Control. pages 551–562. Necessary conditions for identification of uncertain systems. [24] V.