Identiﬁcation
Venkatesh Saligrama
Department of Electrical and Computer Engineering
Boston University, Boston MA 02215
Abstract—This paper introduces a new concept for sys
tem identiﬁcation in order to account for random and non
random(deterministic/setmembership) uncertainties. While, ran
dom/stochastic models are natural for modeling measurement
errors, nonrandom uncertainties are wellsuited for modeling
parametric and nonparametric components. The new concept
introduced is distinct from earlier concepts in many respects.
First, inspired by the concept of uniform convergence of empirical
means developed in machine learning theory, we seek a stronger
notion of convergence in that the objective is to obtain proba
bilistic uniform convergence of model estimates to the minimum
possible radius of uncertainty. Second, the formulation lends itself
to convex analysis leading to description of optimal algorithms,
which turn out to be wellknown instrumentvariable methods
for many of the problems. Third, we characterize conditions
on inputs in terms of secondorder sample path properties
required to achieve the minimum radius of uncertainty. Finally,
we present fundamental bounds and optimal algorithms for
system identiﬁcation for a wide variety of standard as well as
nonstandard problems that include special structures such as
unmodeled dynamics, positive real conditions, bounded sets and
linear fractional maps.
Keywords: complex systems, convex analysis, input design,
linear algorithms, robust identiﬁcation
I. INTRODUCTION
System identiﬁcation deals with mathematical modeling of
an unknown system in a model class from noisy data. The
model class is nonrandom, i.e., speciﬁed without a probability
distribution, while noise, which typically refers to measure
ment error is modeled as the realization of stochastic process.
The model class is deterministic and can be speciﬁed in terms
of a ﬁnitedimensional parametrization or more generally as a
nonparametric class of systems. This paper introduces a new
concept for system identiﬁcation, inspired by the idea of uni
form convergence of empirical means (UCEM) from machine
learning theory [24], [32], in order to account for both random
and nonrandom(deterministic/setmembership) uncertainties.
The main objective of our formulation is to minimize the
probability of the worstcase estimation errors, i.e., we seek
probabilistic uniform convergence of model estimates.
The question of dealing with both random and nonrandom
aspects in an identiﬁcation problem is arguably as old as sys
tem identiﬁcation/estimation theory. Indeed, the well known
This research was supported by the ONR Young Investigator Program
and Presidential Early Career Award (PECASE) N0001402100362, NSF
CAREER award ECS 0449194, and NSF Grant CCF 0430983
maximumlikelihood estimate picks the (deterministic) model
that maximizes the likelihood of the observed data. In the
minimumprediction error(MPE) rule [13], which can be
thought of as a generalization to the maximum likelihood,
one chooses models that minimize the prediction error (or in
general a loss function). The MPE scheme is quite general and
can be used in a wide variety of contexts including situations
where the real system does not belong to the chosen model
class [3], [14], [11], [33]. Nevertheless, as pointed out in [5]
it is in general difﬁcult to assess the model quality of the
estimates obtained through MPE methods with ﬁnite data. In
general there is a signiﬁcant discrepancy—when there is un
modeled error—between the estimate obtained by minimizing
an empirical cost over ﬁnite data and the asymptotic theoretical
estimate.
This latter issue of dealing with unmodeled dynamics over
ﬁnite data together with the need for controloriented models
(i.e., with guarantees on uncertainty), has motivated a number
of researchers to propose numerous nonmixed formulations.
In [9] a stochastic embedding of unmodeled dynamics is
described. Milanese [17], [18] initiated research on a purely
worstcase paradigm with stochastic noise absorbed within a
worstcase framework as uniformly bounded noise(UBN). This
research led to rapid development of worstcase algorithms
for identiﬁcation [10], [16], [23], its time complexity [20],
[6] and evaluation [8], [7] for a wide variety of problems.
Nevertheless, the UBN model can be conservative particularly
in accounting for stochastic noise. This has motivated various
researchers [30], [27] to consider additional constraints on
samplepaths of worstcase noise processes. There has also
been related research on explicitly modeling feasible sets
through ellipsoidal as well as polytopic constraints [34], [31].
Although, such constraints do lead to desirable behavior in
many cases, it is in general difﬁcult to explicitly model both
stochastic noise and unmodeled dynamics within a purely
worstcase setting. Consequently, while deﬁciencies in the
MPE setup require stronger convergence notions, the purely
worstcase approach can either be conservative or require
extensive modeling in complex situations.
Motivated by these concerns, a mixed approach has been
discussed in [30], [26], [27], where the approach has been
to attempt a separation between unmodeled dynamics and
noise and seek models that minimize the distance between
the model class and the underlying system. In parallel mo
2
tivated by persistent identiﬁcation, Wang [36] has proposed
a new concept for mixed identiﬁcation, wherein the idea is
to pick model estimates so that the worstcase probability
of “mismatch” error being larger than some threshold is
minimized. Furthermore, a new notion of sample complexity
is developed wherein the idea is to ﬁnd the smallest time
by which a conﬁdence level for the probabilistic objective
can be reached. These ideas resemble notions of minimax
risk functions employed in the statistical and estimation lit
erature [12]. There the idea is to choose parameters that
minimize the worstcase (over all parameters) the expected
value of the estimation error. Nevertheless, this problem (in
terms of ﬁnding optimal algorithms, optimal inputs, minimum
sample complexity etc.) is in general intractable in both the
estimation and identiﬁcation contexts. Wang [36] addresses
this issue by deriving upper and lower bounds for a number
of different cases. The upper bounds are usually obtained by
using leastsquares algorithms for periodic inputs while the
lower bounds are obtained by considering the noiseless case.
Although, these bounds can sometimes be suitable, they can be
overly conservative for situations where Least Squares (LS) is
not a convergent algorithm. This is particularly the case when
there is a separation between model, unmodeled dynamics and
noise as was shown in [27]. In summary, although there are a
number of formulations that have been proposed for dealing
with mixed scenarios, it is generally difﬁcult to characterize
optimal algorithms, inputs and fundamental bounds.
To address these issues we propose a new formulation for
mixed components in Section II. Preliminary work along these
lines has appeared in our earlier publications [25], where
consistent identiﬁcation of FIR models was described. This
paper generalizes that formulation and deals with a wide vari
ety of new problems. This formulation is inspired by notions
of uniform convergence of empirical means and approximate
learning in machine learning theory. Our objective is to seek
models/estimates that minimize the probability of the worst
case errors, i.e., we seek uniform convergence in probability
of the estimation error. Although, this notion is stronger than
the earlier proposed formulations, we show in Section III, that
it is also tractable through convex analysis particularly for
problems deﬁned by linear observations, convex subsets of
systems and additive random noise and under the
∞
norm.
This can be extended to other norms by approximating in a
ﬁnite family of linear cost functions. Speciﬁcally, we show
that optimal algorithms can be associated with hyperplane
separation of convex sets, which can further be related to
instrumentvariable techniques. In Section IV the paper de
scribes new insights for parameter estimation in stochastic
noise. In particular, it is well known that a persistency of
excitation condition is required to ensure consistency. By
means of our mixed formulation we show that this condition is
also necessary for achieving uniform consistency. This section
also develops solution techniques for problems with convex
parametric constraints. Section V then develops algorithms for
optimal identiﬁcation in the presence of unmodeled dynamics
and noise. The analysis is then extended to problems to cover
situations where the desired solution can be expressed as a
nonlinear function of a linear operator. In particular, this
strategy is employed to show that uniform convergence of
parametric error to zero can be established in the presence
of noise and unmodeled dynamics for stable systems in an
1
space. In Section VI we show how this approach also leads to
solutions to problems described by LFT structures. Another
aspect of the paper is in characterizing inputs required for
system identiﬁcation. We derive conditions on input sequences
in terms of their second order sample path properties.
a) Notation:: A
∗
denotes the complexconjugate trans
pose of the matrix A. Z
+
is the set of positive integers.
denotes real valued sequence on positive integers.
p
, p ≥ 1
denotes the space of sequences on Z
+
bounded in the
p
norm (in [15] these concepts are deﬁned on the space of
natural numbers but we only consider sequences over positive
integers and denote them as
p
with an abuse in notation).
For a signal, x ∈
p
, P
m
denotes the truncation operator:
P
m
(x) = (x(0), . . . , x(m−1), 0, . . .), X
n
is the column vec
tor (x(0), x(1), . . . , x(n−1))
∗
. x
p
denotes the
p
norm of
the discrete time signal x(). A
p,q
= sup
x∈p
Ax
q
/x
p
denotes the induced norm of operator or matrix.
The npoint autocorrelation, r
n
x
(), of a signal, x, is
r
n
x
(τ) =
n−τ
i=0
x(τ +i)x(i); τ ∈ [0, 1, . . . , n]
The npoint crosscorrelation between two signals
r
n
xy
(τ) =
n−τ
i=0
x(τ +i)y(i)
N(m, σ) denotes the gaussian distribution with mean m and
standard deviation σ. Prob¦A¦ denotes the probability of an
event A and E¦X¦ denotes the expected value of the random
variable X. Finally z
−1
is the ztransform variable and corre
sponds to the usual shift operation. LTI refers to linear time
invariant systems and BIBO refers to boundedinputbounded
output systems, i.e., the space of systems deﬁned by the disc
algebra of analytic functions bounded in the closed unit disc.
RH
2
refers to stable realrational LTI systems with bounded
squared norm. The impulse response of an LTI system, H, is
represented by ¦h(k)¦
k∈Z
+. The output response, y(t), to an
input u(t) is denoted by y(t) = Hu(t) = (h u)(t), where
denotes the convolution operation. For a realrational LTI
system, H, we often associate a minimal realization in state
space:
H ≡
_
A B
C D
_
(1)
II. PROBLEM FORMULATION
In this section we formulate a mixed stochastic
deterministic problem. Mathematically, we are given a se
quence of realvalued observations, y(k), which are assumed
to be generated according to an underlying model. The model
3
is uncertain in that it has both random and nonrandom aspects
that provide information on how the sequence of observations
are generated. More precisely, let
y(k) = F
k
(θ, ∆, w), k = 0, 1, . . . , n −1 (2)
where, F
k
is a time dependent realvalued regressor; w() is
a random noise process; θ, ∆ are nonrandom components
with the tuple (θ, ∆) ∈ o, where o is a subset of a separable
normed linear space H. The ﬁnite dimensional component, θ,
is to be estimated by means of an algorithm,
ˆ
θ
n
(y) from a
sequence of n observations. We denote by
ˆ
θ (with an abuse
of notation) the inﬁnite sequence of algorithms, i.e., ¦
ˆ
θ
n
¦. To
clarify the notation, consider an LTI example with a 1tap FIR
with unmodeled dynamics, i.e.
y(t) = θu(t) +
t
k=0
δ(t −k)u(k) +w(t)
≡ F
t
u
(θ, ∆, w); (θ, ∆) ∈ o = IR
1
; u() ∈
∞
where, δ(t) is the impulse response of the unmodeled compo
nent, ∆.
Remark: In general, note that the variables θ and ∆, may
not be independent, i.e., the set of values θ and ∆ can be
jointly speciﬁed by the set o ⊂ H. This models situations
involving optimal approximations, i.e., θ characterizes the best
approximation in the model class, while ∆ denotes the residual
error.
We now introduce the mixed problem. Suppose,  
is a suitable norm on IR
m
. We denote by α(
ˆ
θ
n
, γ, o) the
probability that the worstcase estimation error is larger than
γ for n observations, i.e.,
α(
ˆ
θ
n
, γ, o) = Prob
_
sup
(θ,∆)∈o
θ −
ˆ
θ
n
(y) ≥ γ
_
where the probability is taken with respect to the noise
distribution. Observe that the set, o is a separable space and
so measurability is preserved under a supremum [1]. Note
that the problem is meaningful even when the perturbation,
∆, does not exist. Indeed, this situation is analogous to the
concept of uniform convergence of empirical means. For the
simplest situation in learning theory, we are given a family of
functions, f(x) ∈ T. N i.i.d. samples of ¦x
i
¦ are drawn from
a distribution deﬁned by P and the problem is to show that,
Prob
_
sup
f∈F
¸
¸
¸
¸
¸
E
P
(f(x)) −
1
N
N
i=1
f(x
i
)
¸
¸
¸
¸
¸
≥ γ
_
−→0
Extensions to this problem include consideration of general
loss functions and function approximation with lower com
plexity function classes. The principle difﬁculty in applying
learning theory to system identiﬁcation is that the solutions
there depend on observations being independent and do not
deal with dynamical scenarios.
Returning to our context, we deﬁne the asymptotic radius of
uncertainty, R(
ˆ
θ, o), for the sequence of algorithms
ˆ
θ as:
R(
ˆ
θ, o) = inf
_
γ ∈ IR [ limsup
n
α(
ˆ
θ
n
, γ, o) = 0.
_
This leads to the notion of consistency:
Deﬁnition 1: We say that the sequence of algorithms
ˆ
θ is
uniformly consistent if the radius of uncertainty is equal to
zero.
The minimum radius of uncertainty is given by:
R
0
(o) = inf
ˆ
θ
R(
ˆ
θ, o)
Finally, we deﬁne sample complexity and optimality of an
algorithm:
Deﬁnition 2: The sample complexity of the algorithm
ˆ
θ is
the smallest number of observations, L(, ξ,
ˆ
θ, o), such that
the probability that the worstcase estimation error is larger
than γ = R
0
(o) +, is at most α, i.e.,
L(, ξ,
ˆ
θ, o) = min
_
n ∈ Z
+
[ α(
ˆ
θ
n
, γ, o) ≤ ξ
_
An algorithm,
ˆ
θ
0
is said to be optimal if its sample complexity
is smaller than any other algorithm, i.e.,
L(, ξ,
ˆ
θ
0
, o) ≤ L(, ξ,
ˆ
θ, o), ∀ ξ > 0; > 0
These deﬁnitions are related to the mixed formulations pro
posed in [36] with important differences. Wang et. al. [36]
consider a weaker version in that their formulation is related
(but not exactly
1
) to the worstcase probability of error, i.e.,
their notion is related to the lefthandside, while our notion
is related to the righthandside of the following inequality.
sup
(θ,∆)∈S
Prob
_
θ −
ˆ
θ
n
(y) ≥ γ
_
≤ Prob
_
sup
(θ,∆)∈S
θ −
ˆ
θ
n
(y) ≥ γ
_
The two problems can be signiﬁcantly different as the
following examples indicate:
b) Example: Let X be a binary number taking values
in the set ¦−1, 1¦. Suppose W is a binary random variable
taking values in the set ¦−1, 1¦ each of which is equally
likely. We immediately see that max
X
Prob¦WX ≥ 1/2¦ =
1/2; Prob¦max
X
(WX) > 1/2¦ = 1
c) Example: Suppose, Y = θ +W with θ ∈ [−1, 1] and
W uniformly distributed in [−1, 1]. It follows that, for a linear
algorithm,
ˆ
θ(Y ) = αY ,
min
α
Prob¦max
θ
[αY −θ[ > 1/2¦ ≥ 1/2
where we have used the fact that the minimum value is equal
to min
α
Prob¦max
θ
[(1−α)θ+αW[ > 1/2¦. The lower bound
now follows by choosing, θ = W. It turns out through detailed
analysis that the optimal solution for the weaker notion is given
by α = 3/4, and we get,
min
α
max
θ
Prob¦[αY −θ[ ≥ 1/2¦ ≤ 1/3
Although, the weaker notion may sufﬁce in many cases,
the main problem is that it is difﬁcult to analyze exactly
(as seen even for the above example). This issue is well
1
[36] considers simpler structures where the model and unmodeled dynam
ics are decoupled
4
known in statistical estimation [12] and recently the authors
have analyzed the weaker notion [2], [29] in the context of
composite hypothesis testing, and except for simple situations
the problem is generally intractable. In [36] this issue is
addressed by deriving upper and lower bounds, where the
upper bound is computed with a least squares algorithm and
periodic inputs. The lower bound is computed by considering
the noiseless case. Nevertheless, for a large number of prob
lems especially when there is a decomposition between model
and unmodeled dynamics the upper and lower bounds are not
tight. Indeed, [27] provides examples wherein neither periodic
inputs nor LS algorithms lead to consistency, while there exist
specialized inputs and linear algorithms that do result not
only in consistency but also polynomial samplecomplexity.
One advantage of our formulation is that it lends itself to
exact analysis through application of convexity theory and by
extension design of optimal inputs and algorithms, particularly
for problems deﬁned by linear observations, convex subsets of
systems and additive random noise for
∞
norm, as illustrated
in the sequel. We also demonstrate extensions of this approach
to other norms by approximating in a ﬁnite family of linear
cost functions.
III. OPTIMAL ALGORITHMS FOR CONVEX PROBLEMS
In this section we state and prove a general theorem, which
is applied in subsequent sections. The main result here is
that under some technical conditions linear algorithms have
optimal sample complexity. The theorem is a generalization
of Smolyak’s theorem [22] to observations with random noise.
Consider a linear space, H, and a convex and balanced subset,
o ⊂ H. Suppose, elements of the set, o are observed through
a linear measurement structure with additive random noise,
w(), i.e.,
y(k) = λ
k
(h) +w(k), k = 1, . . . , n, h ∈ o (3)
where, λ
k
, is a linear functional on the space, H. Thus the
observations are linear over H. The task is to compute a
linear function Φ(h), where Φ maps h to IR
p
, i.e., Φ(h) =
(φ
1
(h), . . . , φ
p
(h))
T
. The linear function Φ(h) is often re
ferred to as the solution operator.
d) Example: The measurement structure includes
ARMA models as well. This is because, if ψ(k) is the
sequence of regressors formed for an ARMA structure (see
[13] for details) then,
y(k) = ψ
T
(k)θ +w(k)
forms a linear measurement structure on the variable θ.
For the setup described by Equation 3 we have the following
theorem.
Theorem 1: Suppose w() is i.i.d. Gaussian noise such that
w(k) ∼ ^(0, σ
2
). Then linear algorithms are optimal when
the norm measure on Φ(h) is the
∞
norm.
Remark: In general the result holds for more general
noise distributions, which are symmetric and monotone around
zero. These properties can be used to argue the equivalence
between the probabilistic problem and an appropriate worst
case problem. In this paper we limit our attention to gaussian
processes (the non i.i.d. case is established in the next section)
for the sake of simplicity.
Remark: Although we have assumed Gaussian noise the
theorem holds for other noise distributions as well. The main
property required is that the noise distribution be symmetric
and monotonic around zero. For instance, the uniform distri
bution satisﬁes these properties as well.
Proof: If the radius of uncertainty is inﬁnite then linear
algorithms are no worse than any other algorithm. Therefore,
suppose γ
0
= R
0
(o) < ∞. For simplicity, consider estimation
of a single linear functional, φ(h). Suppose, the sequence of
algorithms,
ˆ
φ = ¦
ˆ
φ
n
(y)¦, achieves the sample complexity,
L(, ξ,
ˆ
φ, o). This implies,
Prob
_
sup
h∈o
[φ(h) −
ˆ
φ
n
(y)[ ≥ γ
0
+
_
≤ ξ, ∀ n ≥ L(, ξ,
ˆ
φ, o)
(4)
We are left to establish that there is a linear algorithm that
achieves the same performance. The proof of optimality of
the linear algorithm would follow because
ˆ
φ
n
(y) is arbitrary
and can be replaced by an optimal algorithm.
Now, Equation 4 is equivalent to existence of a set / ⊂ IR
n
with Prob¦/¦ ≤ ξ such that,
sup
w∈IR
n
−/
sup
h∈o
[φ(h) −
ˆ
φ
n
(y)[ ≤ γ
0
+ (5)
where, w, is now a worstcase noise signal for which the error
is uniformly bounded by γ
0
+. Next we establish that the set
J
n
− / can be chosen to be convex without impacting the
above inequality. We state this in the form of a proposition
below and the details can be found in the appendix,
Proposition 1: If Equation 5 is satisﬁed then it follows
that there exists a convex balanced subset, J
n
0
⊂ IR
n
, with
Prob¦J
n
0
¦ ≥ 1 −ξ such that
sup
w∈J
n
0
sup
h∈o
[φ(h) −
ˆ
φ
n
(y)[ ≤ γ
0
+ (6)
We are now left to prove that a linear algorithm is an
alternative optimal estimator as well. For this we follow the
proof of Smolyak’s theorem [22]. The main idea is illustrated
in Figure 1.
Consider the following set of IR
n+1
:
V = {(φ(h), λ1(h) +w(1), . . . , λn(h) +w(n))  w ∈ W
n
0
, h ∈ S}
From Equation 6, it follows that (γ + , 0, . . . , 0) is not
contained in the interior of the set 1. This implies the existence
of a hyperplane such that,
c
0
(φ(h) −(γ +)) +
n
j=0
c
j
y
j
≤ 0, ∀h ∈ o, w ∈ J
n
0
Now, the linear functionals, ¦λ
1
() + e
T
1
, . . . , λ
n
() + e
T
n
¦,
where, e
j
, is a column vector of length n with jth component
equal to one and all other components equal to zero, are
linearly independent. This in turn implies that c
0
,= 0. Now
5
Fig. 1. Illustration of Convex Separation; X and Y axes denote range of
measurement and parametric values respectively. With high probability they
lie in a bounded convex set.
by using the fact that 1 is convex and balanced we see that
(−γ − , 0 . . . , 0) is also another boundary point. Therefore
we obtain the complementary inequality, i.e.,
c
0
(φ(h) + (γ +)) +
n
j=0
c
j
y
j
≥ 0, ∀h ∈ o, w ∈ J
n
0
Consequently, for every , ξ > 0, it follows that there are a
sequence of coefﬁcients ¦q
n
(k) = −c
k
/c
0
¦
n
k=0
not all zero
such that,
sup
w∈J
n
0
sup
h∈o
¸
¸
¸
¸
¸
n
k=0
q
n
(k)y(k) −φ(h)
¸
¸
¸
¸
¸
≤ γ
0
+, ∀ n ≥ N(, ξ)
thus, establishing result for a single linear functional. To
generalize it to the linear function, Φ(h), we note that,
Φ(h) −
ˆ
Φ(h)
∞
= max
j
¸
¸
¸φ
j
(h) −
ˆ
φ
n
j
(y)
¸
¸
¸
≥
¸
¸
¸
¸
¸
n
k=0
q
n
j
(k, φ
l
)y(k) −φ
j
(h)
¸
¸
¸
¸
¸
where, q
n
j
(k, φ
l
) are the optimal linear weights corresponding
to the linear functionals φ
l
(h).
A. Extensions
We can generalize the above theorem to squared error norms
as well. Although, it is possible to bound the squared norm
with the inﬁnity norm this can be conservative (for example,
for a vector x ∈ IR
p
we have x
∞
≤
√
px
2
). Instead,
our idea is to express squared norm value as the maximum
over all unit magnitude linear functionals—the existence of
which is guaranteed through duality—and then ﬁnd estimates
for each of these linear functionals. Speciﬁcally, let x ∈ IR
p
endowed with the squared norm metric and let υ() be a linear
functional on IR
p
. Therefore, υ() can be associated with a
vector (υ
1
, . . . , υ
p
) ∈ IR
p
. We then have,
max
υ2≤1
υ(x) = x
2
Now to draw a direct correspondence with the theorem we
proceed as follows. We consider the problem of estimating the
linear function Φ(h) = (φ
1
(h), φ
2
(h), . . . , φ
p
(h)), with an
algorithm
ˆ
Φ
n
() with the squared norm as the error measure.
For ease of exposition we let x
j
= φ
j
(h). We then select
linear functionals,
υ(x) =
p
k=1
υ
k
x
k
Corresponding to each choice of υ() ∈ IR
p
we have an
optimal linear algorithm, q
n
υ
(k) with the minimum radius of
uncertainty, γ
0
+ (with respect to the squared norm), i.e.,
¸
¸
¸
¸
¸
k
q
n
υ
(k)y(k) −
p
k=1
υ
k
x
k
¸
¸
¸
¸
¸
≤ γ
0
+ (7)
Clearly, υ() consists of an inﬁnite family and needs dis
cretization. Let υ
k
be such a discretization with k belonging
to some ﬁnite index set. We then obtain a ﬁnite set of
linear inequalities. A feasible point, (x
1
, . . . , x
p
) = Φ(h) =
(φ
1
(h), . . . , φ
p
(h)) in this ﬁnite set forms a good estimate
for the squared norm. Speciﬁcally, let υ
k
= (υ
k
1
, . . . , υ
k
p
) ∈
IR
p
, k ∈ 1, υ
k

2
= 1. Suppose, the minimum angle of
separation for the collection of vectors is smaller than β <
π/4. Consider a feasible point, x belonging to the set deﬁned
by Equation 7 with υ ∈ ¦υ
k
¦
k∈I
and let
ˆ
Φ
n
(y) = x be a
nonlinear estimator that maps the observations to a feasible
point. It then follows that,
Corollary 1: If there exists an algorithm that achieves a
radius of uncertainty γ
0
+ for a data length n with probability
α then:
Φ(h) −
ˆ
Φ
n
(y)
2
≤
γ
0
+
1 −β
Proof: See the appendix.
These theorems are readily generalized to stationary gaus
sian noise setting with bounded power spectral density.
Corollary 2: Let, Y
n
= Λ
n
(h)+Q
n
W
n
, where Y
n
, W
n
are
column vectors of length n and with Q
n

2
≤ 1 and Λ
n
=
(λ
1
(), λ
2
(), . . . , λ
n
()) is a sequence of linear functionals.
Then a linear algorithm achieves the minimum diameter of
uncertainty for estimating a linear function Φ(h) mapping Hto
IR
p
from data under the
∞
and squared norm error measures.
Remark: Note that the results do not provide an explicit
expression for computing the minimum radius of uncertainty
(and indeed this problem is . However, they do provide a basis
for determining this uncertainty through linear algorithms.
IV. PARAMETER ESTIMATION REVISITED
In this section we provide a complementary perspective
on parameter estimation in random noise. The motivation
for revisiting this well studied topic is threefold: (1) First,
this topic has not been studied under the new criterion of
uniform convergence. In the predictionerror paradigm [13],
which uniﬁes many of the results in stochastic identiﬁcation,
the objective is to analyze prediction errors, i.e., V (θ) =
1
n
n
t=1
E(y(t) − ˆ y(t/θ))
2
as a function of the predictor.
6
Our objective differs in two ways: (a) It directly deals with
the estimation error, i.e., θ −
ˆ
θ
n
(y) (b) we seek uniform
convergence, i.e., Prob¦sup
θ
θ−
ˆ
θ
n
(y)¦. (2) It turns out that
under this new criterion the well known input requirements
become transparent and can be shown to be both necessary
as well as sufﬁcient for achieving consistency. We point out
that most results that exist in the literature in this connection
are sufﬁcient conditions and necessity is not easy to establish
in many cases. (3) We can describe tight lower bounds for
error for ﬁnite data without the need for a consistent estimator,
which is required typically for CramerRao bounds. (4) Finally,
a new aspect is that optimal algorithms for identiﬁcation
problems with convex parametric constraints, which model
parametric dependencies, can also be developed.
To describe our problem we need the notions of persistent
input and persistency of excitation. A persistent input is an
∞
sequence with unbounded energy, i.e.,
Deﬁnition 3: An
∞
bounded sequence,
(x(0), x(1), . . . , x(k), . . .), is said to be persistent if
liminf
n
P
n
x
2
−→∞
We point out that this deﬁnition is weaker than the
conventional notion, where the input signal is required
to be quasistationary and have ﬁnite power, i.e.,
lim
n→∞
1
n
n
t=0
x
2
(t) = C. We next describe persistency of
excitation, which also is weaker than the conventional notion.
Deﬁnition 4: An input u() is said to be persistently excit
ing of order, m, if there exists m instruments (discretetime
signals), v
1
(), v
2
(), . . . , v
m
() such that
1) The input u() is persistent.
2) The mm matrix, Σ
n
, with [Σ
n
]
jk
= r
n
vju
(k) satisﬁes,
Σ
n
≥ ρI
m
, for all n ≥ N
0
, for some N
0
∈ Z
+
and
ρ > 0.
We ﬁrst prove the result for FIR parameterizations, i.e.,
y(t) =
p−1
k=0
h(k)u(t −k) +w(t)
We assume that w(t) is i.i.d. gaussian noise. The following
result follows:
Theorem 2: For the setup above the radius of uncertainty
is zero if and only if the input u() is persistently exciting of
order p.
Proof: The sufﬁciency part of the result can be derived
along the lines of [13] and is omitted. The necessity part to
the best of author’s knowledge is new. Since norms on ﬁnite
dimensional spaces are all equivalent we employ the
∞
norm
here. We are then given that there exists an algorithm, h
n
k
(y)
such that,
Prob
_
sup
h∈IR
p
h(k) −
ˆ
h
n
k
(y)
∞
_
−→0
To apply Theorem 1 we need to deﬁne λ
i
s ﬁrst:
λ
j
(h) =
p−1
k=0
h(k)u(j −k), j = 1, 2, . . . , n
With this association we have a linear observation structure
and since the objective is to estimate the FIR taps, Φ() in
this case is the identity operator. Therefore, the assumptions
of Theorem 1 are satisﬁed. Consequently, based on our hy
pothesis, there exists a linear algorithm that achieves uniform
consistency as well. This implies existence of an instrument
q
n
k
corresponding to each coefﬁcient h(k) so:
[
n
j=0
q
n
k
(j)w(j) +
n
j=0
(q
n
k
(j)u(j −k) −1)h(k) (8)
+
p−1
j=0, j=k
n
l=0
q
n
k
(l)u(l −j)h(j)[
N→∞
−→ 0
Since, h ∈ IR
p
is arbitrary it follows that,
n
l=0
q
n
k
(l)u(l −k) = 1;
n
l=0
q
n
k
(l)u(l −j) = 0, ∀ j ,= k (9)
and
Prob
_
_
_
¸
¸
¸
¸
¸
¸
n
j=0
q
n
k
(j)w(j)
¸
¸
¸
¸
¸
¸
≥
_
_
_
N→∞
−→ 0
Since, w() is an i.i.d. gaussian random process,
n
j=0
q
n
k
(j)w(j) is gaussian with mean zero and variance
equal to q
n
k

2
. Therefore, we need
limsup
n
q
n
k

2
→0
Now from Equation 9 we have from CauchySchwartz inequal
ity,
1 =
¸
¸
¸
¸
¸
n
l=0
q
n
k
(l)u(l −k)
¸
¸
¸
¸
¸
≤ q
n
k

2
_
n
l=0
u
2
(l −k)
_
1/2
This implies that (P
n
−P
−1
)u
2
→∞. Therefore, the input
must be persistently exciting and from Equation 9 it follows
that the persistency of excitation must be of order p as stated.
The proof is readily extended to the more general case with
linear parameterizations, i.e.,
Corollary 3: Let the inputoutput equation be given by
y(t) = ψ
T
(t)θ +w(t), θ ∈ IR
p
(10)
where, ψ
T
(t) = [y(t−1), y(t−2), . . . , y(t−m
1
), u(t), u(t−
1), . . . , u(t −m
2
)] is the usual regression vector formed from
previous inputs and outputs, where m
1
+m
2
= p. The radius
of uncertainty is equal to zero if and only if the input is
persistently exciting of order p.
These ideas can also be extended to more general convex
parameterizations which account for parametric dependencies.
We consider ﬁnitely many parametric constraints modeled as:
[Zθ[ ≤ b (11)
where, Z ∈ IR
l×p
. Note that the parametric set is balanced.
We describe some relevant situations where such convex
constraints naturally arise. Our ﬁrst example is the case of
7
a positive real condition, which can be modeled by convex
constraints:
k
θ
k
cos(ωk) ≥ 0, ω ∈ [−π, π]
To utilize these constraints we can discretize the inﬁnite
system of inequalities and through translation consider convex
and balanced subsets of the parameter space. Another example
where these results are applicable is when the set of parameters
belong to a linear manifold, i.e., Zθ = b. We have the
following corollary for convex parameter constraints:
Corollary 4: Consider the setup of Equations 10, 11. Fur
thermore, let Ψ
n
= [ψ(0), ψ(1), . . . ψ(n)]
T
be the data
regression matrix. It follows that for a data length n the radius
of uncertainty (in the
∞
norm) is smaller than δ if and only
if there exists matrices Q ∈ IR
p×n
and Γ such that,
QΨ
n
+ ΓZ = I, [Γb[ +Q
2,∞
≤ δ, Γ ≥ 0 (12)
where, all of the inequalities are to be interpreted pointwise.
Proof: The proof of the result follows from convex
duality and in particular a straightforward application of Farkas
lemma, which is stated here for the sake of completion [19]:
Lemma 1 (Farkas Lemma): Consider a set of linear func
tions, ¦µ
j
¦ on the parameter θ. Then it follows that,
_
µ
T
0
θ −b
0
> 0 whenever µ
T
i
θ −b
i
> 0, i = 1, 2, . . . , m
_
¸
µ
0
=
m
k=1
γ
k
µ
k
,
m
k=1
γ
k
b
k
≤ b
0
, γ
k
≥ 0
We now return to the proof of Corollary 4. First, note that
the problem setup conforms to the convex setup. Indeed, the
inputoutput data is given by a linear measurement structure,
the parameter set is convex and balanced. The desired solution,
Φ(θ) = θ, is linear. Therefore, it follows similar to the
arguments used in the preceding theorems that if the radius
of uncertainty is smaller than δ, then there must exist a matrix
Q and an integer n, for a given > 0, ξ > 0, such that
[(QΨ
n
−I)θ +Qw[ ≤ δ +, ∀ [Zθ[ ≤ b; w ∈ J
n
0
where, J
n
0
is the same convex set with probability mass larger
than 1 −ξ as described in Theorem 1. The proof now follows
by a direct application of Farkas lemma.
The solution consists of ﬁnding a Q that satisﬁes Equation 12,
which can solved through convex programming. An estimate
of θ is then given by Qy, where y is the observation vector
of length n.
Remark: Observe that when the constraints are given by
Zθ = 0, we should expect that the persistency of excitation
condition can be relaxed. This is indeed the case as seen from
Equation 12. This is because in this case, in the dual setting,
the positive constraint Γ ≥ 0 no longer exists. Therefore, we
are only left to satisfy QΦ
n
+ ΓZ = I for an arbitrary Γ. It
follows that if Z has rank m then Φ
n
need only have rank
p −m to ensure that the equation can be satisﬁed. Therefore,
the input needs to be persistently exciting of a smaller order
than the parametric dimension.
Our next task is to comment on the issues of optimality and
choice of optimal inputs. From Theorem 1 it follows that linear
algorithms are optimal for the
∞
norm. From the results in
Corollary 4 we have for a convex parametric set o that,
Prob
_
max
θ∈S
θ −θ
n
(y)∞ ≥
_
≥ Prob {Qw∞ ≥ −Γb∞}
= erfc
_
−Γb∞
Q2,∞
_
where, erfc(), denotes the gaussian error function [12]
and we have assumed that radius of uncertainty is equal to
zero (for simplicity of exposition). These results in turn help
in loosely characterizing inputs that would lead to optimal
identiﬁcation. This would follow by ﬁrst minimizing, Q
2,∞
subject to constraints of Equation 12. The optimal cost of this
optimization problem would provide an index J(Φ
n
) which
is a function of the regressor Φ
n
. One can then attempt to
minimize the index by choosing different inputs.
A. Uniform Bounded Noise
Although, the UBN setup does not strictly fall within
the framework described in Section II the proof technique
of Theorem 1 can be used to establish that the radius of
uncertainty is bounded away from zero. Let,
y(t) =
p−1
k=0
h(k)u(t −k) +w(t), w(t) ∈ [−γ, γ], [u()[ ≤ 1
(13)
In the context of UBN we are interested in an estimate
ˆ
h
n
(y)
based on inputoutput data of length n so the worstcase error
over all FIRs and admissible noise realizations are minimized:
min
ˆ
h
n
→ sup
J
n
sup
h∈IR
p
h −
ˆ
h
n

1
where, J
n
= [−γ, γ]
n
is the set of all admissible noise
sequences up to time length n. Similarly, let 
n
be the set
of all unitamplitudebounded input sequences of length n.
It is well known [23] that regardless of the input, estimator
and datalength the estimation error is always bounded from
below by γ. We provide a new proof for this result based on
Theorem 1 as stated below.
Theorem 3: The minimum radius of uncertainty is larger
than γ for any algorithm,
ˆ
h
n
(y).
Proof: We note that,
sup
J
n
sup
h
h −
ˆ
h
n

1
≥ sup
J
n
sup
h
h −
ˆ
h
n

∞
Now, both J
n
and h are convex balanced sets, therefore it
follows from the steps used in the proof of Theorem 1 that
there must exist a linear algorithm that achieves optimal error
for the latter (RHS of inequality above) problem. Thus there
is a linear instrument, q
n
(), with the error estimate, E, given
8
by,
E = max
0≤m≤p−1
¸
¸
¸
¸
¸
n
t=0
q
n
m
(t)w(t) +
n
j=0
(q
n
m
(j)u(j) −1)h(0)
+
p−1
j=1
n
l=0
q
n
m
(l)u(l −j)h(j)
¸
¸
¸
¸
¸
¸
This implies that if the error, E < ∞ then,
1 =
n
j=0
q
n
m
(j)u(j) ≤ q
n
m

1
u
∞
= q
n
m

1
, ∀0 ≤ m ≤ p−1
Now, we see that,
E ≥ sup
w
¸
¸
¸
¸
¸
n
t=0
q
n
m
(t)w(t)
¸
¸
¸
¸
¸
≥ q
n
m

1
w
∞
≥ γ, ∀m
thus establishing the result.
Theorem 1 provides a situation under which the minimum
radius of uncertainty is zero in that if a small subset of
noise sequences of vanishingly small measure were rendered
inadmissible then the radius of uncertainty is equal to zero.
This can also be seen indirectly from our results in an earlier
paper [30].
V. UNMODELED DYNAMICS WITH STOCHASTIC NOISE
In this section we deal with the problem of asymptotic
consistency of systems with mixed uncertainties consisting of
stochastic measurement noise and worstcase (deterministic)
unmodeled dynamics. The problem arises in system identi
ﬁcation whenever restricted complexity models are used for
identiﬁcation. System identiﬁcation with restricted complexity
models have been explored in [27], [8], [35] with deterministic
worstcase noise and in mixed situations in [30], [27], [36],
[4]. Here we study mixed situations and derive fundamental
conditions required to achieve uniform convergence of the
model estimates to their optimal approximations in the model
class. Our setup will deal with the following inputoutput
equation:
y(t) = Hu(t) +w(t) = G(θ)u(t) + ∆u(t) +w(t); (14)
where, u is a bounded input; H = G(θ)+∆ ∈ o is the subset
of systems; G(θ) ∈ ( is a model class of interest; w(t) is a
Guassian stochastic process. The main goal is to estimate G(θ)
from input output data in a uniform manner as described in
Section II. More speciﬁcally, we consider model classes that
are subspaces, i.e.,
( =
_
G ∈ RH
2
[ G =
m
k=1
θ
k
G
k
, G
k
∈ RH
2
, θ
k
∈ IR
m
_
(15)
In the following section we discuss issues that arise when the
perturbations ∆ are speciﬁed independent of the model class.
A. Overbounding Unmodeled Dynamics
The main goal of this section is to motivate the need
for a optimal decomposition of the system into model and
unmodeled dynamics. We do this by computing the radius of
uncertainty for identifying an FIR model class with unmodeled
dynamics. We then generalize these results to the more general
case. Suppose, (
FIR
, is the class of ﬁnite impulse response
sequences (FIR) of order m. Let,
y(t) = Hu(t) +w(t) = Gu(t) + ∆u(t) +w(t) (16)
where, H ∈
1
, G ∈ (
FIR
, ∆ ∈ ∆ and w(0), w(1), . . . is
a sequence of i.i.d. jointly gaussian random variables; inputs
are bounded, i.e., [u(k)[ ≤ 1, k ∈ Z
+
; the unmodeled class
∆ is given by
∆ = ¦∆ ∈
1
[ ∆
1
≤ γ¦ (17)
which indicates the fact that the mth order FIR account
for the underlying system within a worstcase error of size γ.
This is clearly a redundant decomposition of the underlying
system. The ﬁrst mtaps of the true system are accounted for
both by the model and the unmodeled dynamics. This results
in a minimum radius of uncertainty larger than γ.
Proposition 2: For the setup above it follows that the min
imum radius of uncertainty is greater than γ.
Proof: The convexity of the set of systems, linear
observations as well as linearity of the solution operator
Φ(h) = (h
0
, . . . , h
m−1
) can be readily veriﬁed. Now the
radius of uncertainty can only be smaller if we only consider
a smaller subset of systems, speciﬁcally, perturbations that
satisfy δ(k) = 0 for all k ≥ m. By Theorem 1 there must
exist instruments q
n
l
() that are optimal for estimating the FIR
model class. Consider estimation of the ﬁrst tap. It follows
that if the radius of uncertainty is bounded then,
n
k=0
q
n
0
(k)u(k) = 1
This implies that,
¸
¸
¸
¸
¸
n
k=0
q
n
0
(k)(G+ ∆)u(k) +
n
k=0
q
n
0
(k)w(k) −h(0)
¸
¸
¸
¸
¸
≥
¸
¸
¸
¸
¸
n
k=0
q
n
0
(k)u(k)δ(0)
¸
¸
¸
¸
¸
= γ
This proof can be generalized for other model classes as well,
which we state below without proof.
Proposition 3: Suppose the system decomposition is convex
and balanced such that for some η > 0, the model ηG is an
element of the unmodeled set ∆, then the radius of uncertainty
must be larger than η.
The main problem with these results is the apparent gap
between practical reality and the minimum radius of uncer
tainty achieved through the decomposition of Equation 16.
9
For instance, consider the case when noise is relatively small.
In this case the decomposition suggests that the radius of
uncertainty still does not go to zero. However, in practice the
ﬁrst mtaps of the impulse response of H can be identiﬁed
accurately for the noiseless case. In this light, the parametriza
tion in Equation 17 over bounds the existing inputoutput
relationship and leads to a conservative result. This issue
was acknowledged by the author in [27] and it was pointed
out that once a normed space of systems is chosen, there is
an optimal decomposition in the sense that it minimizes the
unmodeled error. Once such a decomposition is chosen the
objective would be to estimate the optimal model component.
For the FIR case, such a decomposition would lead to the
following description for the unmodeled error:
∆ = ¦z
−m
∆ [ ∆ ≤ γ¦ (18)
which is optimal for
1
and H
2
norms. This is because, this
decomposition leads to the minimum unmodeled error.
B. Identiﬁcation in Hilbert Spaces
Motivated by the discussion in the previous section we de
rive identiﬁcation results for more general model spaces. The
basic structure used in the proof (as seen from Equations 5, 6)
is that we require the set o after the decomposition to remain
convex. On a Hilbert space this convex decomposition is pre
served and we should expect Theorem 1 to apply in this setting
as well. The main problem is that all elements belonging to
unit balls in RH
2
are not necessarily stable. Therefore, there
are hardly any inputs that satisfy the requirements for uniform
consistency. Therefore, the unmodeled perturbations have to be
further constrained in some manner. These issues are discussed
in this section.
We are now left to describe the unmodeled error. Motivated
by Proposition 3 we again seek to ascribe modeled and
unmodeled components to different aspects of data. This can
be done naturally since the dual is also a Hilbert space
and orthogonal separation between modeled and unmodeled
components can be easily established. Indeed, using Cauchy’s
theorem we can do more as in the following proposition:
Proposition 4: Consider the model class given by Equa
tion 15. Then, it follows that, every realrational system, H,
can be written as:
H = G
0
+F∆ (19)
for some G
0
= G(θ
0
), a realrational all passfunction F
and an arbitrary perturbation ∆. The zeros of F
∗
are pole
locations of the G
1
, G
2
, . . . , G
m
.
Proof: The proof follows by applying Cauchy’s theo
rem [21].
e) Example: There are m poles at zero for an FIR model
space of order m, and therefore the system F is a delay of
order m. This implies that the unmodeled component is given
by the set of all stable realrational functions of the form
z
−m
∆, with ∆ being an arbitrary element in RH
2
.
As before we restrict the size of the perturbation and consider
ﬁrst the following subsets of real systems:
∆
2
= ¦∆ [ ∆
2
≤ γ¦, ∆
1
= ¦∆ [ ∆
1
≤ γ¦ (20)
As we remarked earlier the second subset is considered
since the ﬁrst (although natural) contains unstable systems.
Consequently, as it turns out it is difﬁcult to ﬁnd inputs that
lead to uniform consistency for the ﬁrst set. For notational
simplicity we deﬁne:
S2 = {H ∈ RH2  H = G+F∆, G ∈ G, ∆ ∈ ∆2} (21)
S1 = {H ∈ RH2  H = G+F∆, G ∈ G, ∆ ∈ ∆1}
We now state our main result in this section.
Theorem 4: Let ˆ u = Fu be the ﬁltered input through the all
pass ﬁlter, F and u
l
() = G
l
u(). A necessary and sufﬁcient
condition for uniform consistency over the set o
2
is that the
input be persistent and there exist a sequence of instruments,
q
n
() such that,
m
k=0
r
q
n
u
(k)g
l
(k) = r
q
n
u
l (0) = 1, l = 1, 2, . . . , m (22)
n
τ=0
¸
¸
¸r
n
q
n
ˆ u
(τ)
¸
¸
¸
2
n→∞
−→ 0; [r
n
q
n
w
(0)[
n→∞
−→ 0 in probability
On the other hand if o
1
is considered then the conditions for
uniform consistency remains the same except that the second
condition is weaker, i.e.,
max
1≤τ≤n
¸
¸
r
n
q
n
ˆ u
(τ)
¸
¸
n→∞
−→ 0 (23)
Furthermore, these conditions imply that the input must be
persistently exciting of inﬁnite order.
Proof: First we consider the set o
2
. We note that the
projection ΠH of the system H onto the subspace ( is
linear and of ﬁnite rank. Therefore, the solution operator
Φ(H) = (θ
1
, . . . , θ
m
)
T
is linear. The subset o
2
is convex and
the observation given by Equation 14 is linear. For the sake
of exposition we next consider oneparameter model class.
The general case follows in an identical manner. Now by
Theorem 1 if uniform consistency holds there must exist a
number n and an instrument, q
n
() so that, for any > 0, ξ >
0
¸
¸
¸
¸
¸
¸
n
k=0
q
n
(k)w(k) +
n
j=0
n−j
k=0
q
n
(k +j)u(k)h(j) −θ
1
¸
¸
¸
¸
¸
¸
≤ , ∀ θ
1
∈ IR
with w ∈ J
n
0
and Prob¦J
n
0
¦ ≥ 1 − ξ. As before, this
implies the condition that the instrument must decorrelate the
noise, i.e., [
n
k=0
q
n
(k)w(k)[ −→0. We are now left with an
instrument that must satisfy,
¸
¸
¸
¸
¸
¸
n
j=0
n−j
k=0
q
n
(k +j)u(k)h(j) −θ
1
¸
¸
¸
¸
¸
¸
=
¸
¸
¸
¸
¸
¸
n
j=0
n−j
k=0
q
n
(k +j)u(k)(θ
1
g
1
(j) + (f
1
δ)(j)) −θ
1
¸
¸
¸
¸
¸
¸
≤
10
where, g
1
(k), f
1
(k), δ(k) are the impulse responses for
G
1
, F
1
, ∆ respectively. Observe that we have applied Propo
sition 4 and substituted F
1
∆ for the perturbation and so F
1
and G
1
are perpendicular. By arbitrarily varying θ, ∆ we see
that,
n
j=0
r
n
q
n
u
(j)g
1
(j) = r
q
n
u
1(0) = 1; sup
∆∈∆2
¸
¸
¸
¸
¸
¸
n
j=0
r
n
q
n
ˆ u
(j)δ(j)
¸
¸
¸
¸
¸
¸
≤ ,
where, for the ﬁrst expression u
1
= G
1
u, and in the second
expression we have absorbed the ﬁlter F
1
into the input, i.e.,
ˆ u = F
1
u. The condition arising from the noise and the ﬁrst
expression together imply that the input must be persistently
exciting. This follows from the fact that,
1 =
¸
¸
¸
¸
¸
¸
n
j=0
q
n
(j)u
1
(j)
¸
¸
¸
¸
¸
¸
≤ q
n

2
G
1

H∞
P
n
u
2
where,  
H∞
is the usual H
∞
norm. But the noise condition
implies that q
n
 −→ ∞. Therefore, P
n
u −→ ∞, since
G
1
is bounded in H
∞
because it is stable real rational transfer
function. Next for the perturbation we apply CauchySchwartz
inequality to obtain:
sup
∆∈∆2
¸
¸
¸
¸
¸
¸
n
j=0
r
n
q
n
ˆ u
(j)δ(j)
¸
¸
¸
¸
¸
¸
2
= γ
¸
¸
¸
¸
¸
¸
n
j=0
(r
n
q
n
ˆ u
(j))
2
¸
¸
¸
¸
¸
¸
−→0
For the set ∆
1
we similarly have,
sup
∆∈∆1
¸
¸
¸
¸
¸
¸
n
j=0
r
n
q
n
ˆ u
(j)δ(j)
¸
¸
¸
¸
¸
¸
= γ max
j≤n
¸
¸
r
n
q
n
ˆ u
(j)
¸
¸
−→0
This establishes the necessity. We will not describe the sufﬁ
ciency conditions since the argument is similar to the argument
for sufﬁciency for FIR model classes, which is dealt in the next
section.
Remark: To see how the convex separation results in achiev
ing uniform consistency note that the conditions on the input
from the modeled part implies that G
1
u() is aligned with the
instrument q
n
(). Similarly, the unmodeled part implies that
F
1
u() is perpendicular to q
n
(). This is possible because G
1
and F
1
are orthogonal and so G
1
u and F
1
u are uncorrelated
when driven by white noise.
Remark: We point out that the conditions for o
2
are more
demanding than o
1
. For the set o
1
we only require that
the normalized worstcase crosscorrelation between input and
instrument to go to zero. This condition can easily be accom
plished for inputs that have small autocorrelation coefﬁcients,
i.e., inputs that are white in their second order properties. Then
an appropriately ﬁltered input can be selected as the instru
ment. In the RH
2
case we need a more stringent condition
since we need the sum of the squares of the normalized cross
correlation also go to zero. This is understandable on account
of the fact that RH
2
forms a larger class of perturbations than
1
.
We next discuss issues of optimality and choice of inputs.
First, notice that one could incorporate additional parametric
and nonparametric constraints as we did in Section IV and
Equation 12. For these set of problems the fact that the optimal
algorithm is linear follows from Theorem 1. The problem
now reduces to looking for inputs satisfying Equation 22.
The constraints on the input (disregarding the noise constraint)
imply that the optimal instrument is the solution to:
min(q
n
)
T
Ψ
n
Ψ
T
n
(q
n
) subject to
n
j=0
q
n
(j)u(j) = 1
where, Ψ
n
= [ψ(0), ψ(1), . . . , ψ(n)]
T
, with ψ(t) =
[ˆ u(t), ˆ u(t−1), . . . , ψ(0)]. Let, u = [u(0), . . . , u(n)]
T
. If the
value of the solution to the quadratic optimization problem has
a value greater β (this is the opposite of what we want) then:
(q
n
)
T
ΨnΨ
T
n
(q
n
) =
1
u
T
(ΨnΨ
T
n
)
−1
u
≥ β ⇐⇒
_
1/β u
T
u ΨnΨ
T
n
_
≥ 0
where, we have used Schur’s complementation to derive the
last expression above. We now argue the difﬁculty of ﬁnding
inputs that satisfy this condition by applying it to a 1tap FIR
model class. For this situation the above condition implies that,
n
j=1
(r
n
u
(j))
2
≥ β
If the input is chosen i.i.d. then the expected value of the
expression on the left is bounded away from zero. Therefore,
for a β > 0 (independent of n) the above condition is met with
probability one leading us to believe that few inputs satisfy the
opposite condition, i.e.,
n
j=1
(r
n
u
(j))
2
<< β.
On the other hand the weaker condition for ∆
1
can be
met by a number of inputs. In particular we only need to
establish that there exist inputs with small autocorrelations
with high probability. Then, by choosing an appropriately
ﬁltered input as the instrument the conditions of Equation 23
can be satisﬁed.
Lemma 2: Suppose u(t) is a discrete i.i.d. bernoulli random
process with mean zero and variance 1, then:
Prob
_
max
0<k≤n
r
n
u
(k)
∞
≥ α
_
≤ nexp(−nβ(α)) (24)
where β(α) = 1 +
1−α
2
log
2
1−α
2
+
1+α
2
log
2
1+α
2
Proof: The proof of this result can be found in [6].
In summary we have characterized conditions for uniform
consistency and optimality for Hilbert spaces, where convex
separation between model class and unmodeled dynamics
proved to be critical. In the next section we consider a situation
where such a convex separation does not exist.
C. Identiﬁcation in
1
and nonlinear functionals
We now turn to the problem of identiﬁcation in the
1
norm. In the previous section the Hilbert space structure led to
convex decomposition of the unmodeled error and model class
and Theorem 1 could be applied in a straightforward manner.
11
For FIR model classes, under the
1
norm, the decomposition
is still convex. This is due to the fact that optimal FIR
approximation in
1
and RH
2
are identical. However for
more general classes under the
1
norm the decomposition
through optimal approximations is not convex. In the next
section, we present identiﬁcation of FIR model class in the
1
norm. The results developed for FIR case will be used
to compute uniformly consistent estimators for more general
model classes.
D. FIR Model Class
In this section we derive necessary and sufﬁcient conditions
for identiﬁcation of FIR model classes in the presence of
unmodeled dynamics under the
1
norm. To this end, consider
the optimal decomposition:
o = ¦H ∈
1
[ H = G+ ∆, G ∈ (
FIR
, ∆ ∈ ∆¦ (25)
where, ∆ = ¦z
−m
∆ [ ∆
1
≤ γ¦; (
FIR
is the class of
mtap FIR sequences.
In the following theorem we derive necessary and sufﬁcient
conditions for achieving uniform consistency for estimating
the FIR coefﬁcients.
Theorem 5: The radius of uncertainty for the above setup
is equal to zero if and only if there is a persistent input and
a corresponding family of instrument variables q
n
() which
uniformly decorrelates the input and noise, i.e., a number
N(ρ) for every ρ > 0 such that for all n > N(ρ):
max
1≤τ≤n
¸
¸
r
n
q
n
u
(τ)
¸
¸
n→∞
−→ 0,
¸
¸
r
n
q
n
u
(0)
¸
¸
= 1 (26)
[r
n
q
n
w
(0)[
n→∞
−→ 0 in probability (27)
Proof: (=⇒) We observe that all norms are equivalent
on ﬁnite dimensional spaces [15]. Consequently, uniform
consistency in the
1
norm for the ﬁrst m FIR taps is
identical to uniform consistency in the
∞
norm. This im
plies by hypothesis that, the minimum radius of uncertainty
is zero for the
∞
norm as well. Therefore, given >
0, ξ > 0 there is an algorithm,
ˆ
h and a number L(, ξ,
ˆ
h, o)
such that Prob
_
sup
H∈o
P
m
(h −
ˆ
h
n
)
∞
≥
_
≤ ξ, n ≥
L(, ξ,
ˆ
G, o). The setup now mirrors Theorem 1 since the
set, o is convex; Φ(H) = (h(0), . . . , h(m − 1))
T
; w is
i.i.d. gaussian noise; measurement structure is linear, i.e.,
λ
t
(h) = Hu(t); norm measure on Φ() is the
∞
norm.
Consequently, by hypothesis, there is a linear algorithm with
weights, q
n
l
(), for each FIR tap, 0 ≤ l ≤ m, such that,
sup
w∈J
n
0
sup
H∈o
¸
¸
¸
¸
¸
n
k=0
q
n
l
(k)y(k) −h(l)
¸
¸
¸
¸
¸
≤ ; ∀ n ≥ N(, ξ)
and Prob¦J
n
0
¦ ≥ 1 −ξ Upon substitution it follows that,
[
n
k=0
q
n
l
(k)w(k) +
n
k=0
(q
n
l
(k)u(k) −1)h(j)
+
n
j=m
n−j
k=0
q
n
l
(k +j)u(k)h(j)
¸
¸
¸
n→∞
−→ 0
Next, by noting that ∆ is arbitrary with
1
norm less than γ
and w() is an arbitrary element in the convex set J
n
0
, which
also contains the zero element we have,
n
k=0
q
n
l
(k)u(k) = 1; max
1<j<n
¸
¸
¸
n−j
k=0
q
n
l
(k +j)u(k)
¸
¸
¸ ≤
[
n
k=0
q
n
l
(k)w(k)[ ≤ , w ∈ J
n
0
The proof now follows by ﬁrst observing that for the hypoth
esis to be true we must have, Prob ¦J
n
0
¦ ≥ 1 −ξ.
The persistency of the input follows along similar lines as in
Theorem 2.
(⇐=) To prove sufﬁciency we construct an IV technique
that is uniformly consistent. Basically we let, q
n
= q
n
1
and
its delayed copies up to mdelays serve as instruments. In
particular, for length n of data, we let, Q
n
be the matrix
composed of the instrument q
n
and its m delays, U
n
a
corresponding matrix of input sequence with its m delays
and R
n
the residual delays of the input sequence.
Qn =
_
¸
¸
¸
¸
¸
¸
¸
_
q
n
(0) 0 . . . 0
q
n
(1) q
n
(0) 0 . . .
.
.
.
.
.
.
.
.
. q
n
(0)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
_
¸
¸
¸
¸
¸
¸
¸
_
(28)
Un =
_
¸
¸
¸
¸
¸
¸
¸
_
u(0) 0 . . . 0
u(1) u(0) 0 . . .
.
.
.
.
.
.
.
.
. u(0)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
_
¸
¸
¸
¸
¸
¸
¸
_
; Rn =
_
¸
¸
¸
¸
¸
¸
¸
_
0 0 . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
u(0) 0 . . .
.
.
.
.
.
.
.
.
.
_
¸
¸
¸
¸
¸
¸
¸
_
Suppose, g is the column vector composed of the ﬁrst
m impulse response coefﬁcients, i.e., g = P
m
H and δ is
the column vector of impulse response of the unmodeled
dynamics, i.e., δ = (P
n
− P
m
)H. Let y, w be the output
and noise signal vectors of length N respectively. Then,
y = U
n
g +R
n
δ +w =⇒Q
T
n
y = Q
T
n
U
n
g +Q
T
n
R
n
δ +Q
T
n
w
(29)
It follows from hypothesis that Q
T
n
U
n
is an invertible matrix
with bounded inverse for large enough N since it converges
to a lower triangular matrix with identical elements on the
diagonal. Again from our hypothesis together with the fact
that that ∆ ≤ γ, it follows that, Q
T
n
R
n
δ converges to zero.
To show this, note that the kth row of Q
T
n
R
n
is given by,
(Q
T
n
Rn)
k
=
_
_
n+1−(m+k)
j=0
q
n
(m+ (k −1) +j)u(j),
n−(m+k)
j=0
q
n
(m+k +j)u(j), . . . , q
n
(m+ (k −1) +j)u(0)
_
_
Consequently, by hypothesis it follows that,
12
¸
¸
(Q
T
n
R
n
)
k
δ
¸
¸
≤
∆
1
max
τ
¸
¸
¸
¸
¸
¸
n+1−(m+k+τ)
j=0
q
n
(m+ (k −1) +j +τ)u(j)
¸
¸
¸
¸
¸
¸
≤ γ
Finally, Q
T
n
w also converges to zero in expectation by
hypothesis and noting that (w(0), w(1), . . . , w(n)) are i.i.d.
random variables.
We next discuss the issue of sample complexity. The sample
complexity of an algorithm,
ˆ
Φ(y) is said to polynomial [32]
in learning theory if
L(, ξ,
ˆ
Φ, o) ≤
_
1
_
d
log
_
1
ξ
_
for some integer d. We establish polynomial sample complex
ity of linear algorithms for the FIR model class here. First, we
choose instruments q
n
(k) = u(k)/n, where the input sequence
is a bernoulli process. It is then easy to establish that,
Prob
_
sup
H∈o
Q
T
n
y −P
m
(H)
∞
≥
_
≤
Prob
_
max
0<τ≤n
[r
n
u
(τ)[ +[r
n
uw
(τ)[
n
≥
_
≤
Prob
_
max
0<τ≤n
[r
n
u
(τ)[
n
≥
2
_
+ Prob
_
max
0<τ≤n
[r
n
uw
(τ)[
n
≥
2
_
≤
nexp(−nβ()) +nexp(−n
2
/σ
2
)
where we have taken the probability over both the input and
the noise process. The ﬁrst term in the last expression uses
Lemma 2 and the second term uses the fact that w is gaussian.
Now, it follows directly by substitution that, for some C > 0,
L(, ξ,
ˆ
Φ, o) ≤ C
_
1
_
2
log
_
1
ξ
_
implying polynomial samplecomplexity of convergence.
E. General Model Classes
In this section we derive algorithms for achieving uniform
consistency for general model classes (that are subspaces) in
the
1
norm. The main difﬁculty is that the decomposition
achieved through optimal approximation is not convex. Hence,
Theorem 1 cannot be directly applied. We proceed by ﬁrst
describing the optimal approximation as a known nonlinear
function of linear functions on the underlying system space.
Since linear functions of the system can be efﬁciently esti
mated the nonlinear function can be estimated as well.
Our setup consists of,
y(t) = Hu(t)+w(t) = Gu(t)+∆u(t)+w(t), G ∈ (, ∆ ∈ ∆
where, ( =
_
p
j=1
α
j
G
j
[ α
j
∈ IR, G ∈ RH
2
_
and ∆ =
¦∆ ≤ γ [ ∆  (
⊥
¦, where (
⊥
is the space of annihilators
of the model space (. The decomposition follows from well
known results in
1
approximation theory [15]. Noise, w()
is assumed to be i.i.d. gaussian noise as before. Our task
is to determine the model G
0
that minimizes the unmodeled
error, i.e., G
0
= argmin
G∈(
H−G. The following example
illustrates the issue of nonconvexity of
1
decompositions.
f) Example: Consider, the following convex space of
systems,
o = ¦H ∈
1
[ H
α
= αH
1
+ (1 −α)H
2
; α ∈ [0, 1]¦
where H
1
(z
−1
) = 1 and H
2
(z
−1
) = z
−1
+
_
1 −
z
−1
2
_
−1
Our model class is given by ( =
_
θ
_
1 −
z
−1
2
_
−1
[ θ ∈ IR
_
. From the alignment conditions
for optimality [15] it follows that the unique optimal
approximation, G(H
1
), for the system H
1
is G(H
1
) = 0.
Similarly, we deduce that G(H
2
) =
_
1 −
z
−1
2
_
−1
. Moreover,
G(H
α
) = G(H
2
) for all α ,= 1 is the unique optimal
approximation as well. Therefore, the set of optimal
approximants for the convex set of systems o is not convex.
Motivated by these issues we solve the problem of
1
iden
tiﬁcation indirectly. The idea is that
1
minimizer can be
arbitrarily closely approximated by a nonlinear function of
linear functions of the system impulse response, i.e,
G
0
(H) = ψ(L(H))
where, ψ() is nonlinear continuous map, which maps to the
model subspace and L is a ﬁnite rank linear operator mapping
the system H to an appropriate ﬁnite dimensional space.
Uniform consistency in
1
would then follow by estimating
the linear functions, whose consistency conditions are in turn
characterized by Theorem 5. To this end we have the following
theorem.
Theorem 6: The input characterization as in Theorem 5 is
sufﬁcient for uniform consistency in the
1
norm.
The proof of the theorem relies on a number of results that
are provided below.
Proposition 5: The optimal
1
approximation G
0
(H) for a
given system, H, can be rewritten as,
G
0
(H) = Π(H)+argmin
G∈(
H−ΠH−G = lim
m→∞
G
0
(P
m
H)
where, Π is the
2
projection on to the space (.
Proof: The proof follows from the following equality:
min
G∈(
H −G
1
= min
G∈(
H −ΠH −G
1
The limiting argument follows from the fact that P
m
H →H
and continuity of projection operation and the
1
norm.
For notational simplicity we denote the modiﬁed problem on
the RHS by G
r
, i.e.,
G
r
(H) = argmin
G∈(
H −ΠH −G
and the modiﬁed system by, H
r
,
H
r
= H −ΠH
13
The main task is to now show how to further simplify the
problem to an instantiation of Theorem 1. First, note that
ΠH is a linear and the range of the projection is a ﬁnite
dimensional space. Therefore, under the conditions derived in
Theorem 4, we can consistently estimate the projection in a
uniform manner. The problem reduces to estimating G
r
(H).
In this regard, we have the following proposition:
Proposition 6: If there is G ∈ ( such that H −G
1
≤ γ
then there is some constant L such that, H −ΠH
1
≤ Lγ
Proof: By hypothesis we have that H−ΠH
2
≤ H−
G
2
≤ H −G
1
≤ γ. We now use the statespace form for
ΠH, G, as described by Equation 1. Since, ( is a subspace,
the statetransition matrix, A, is ﬁxed. We assume that control
vector B is the free parameter that needs to be estimated.
Let B
1
, B
2
be the corresponding control vectors for G, ΠH
respectively. It follows,
G−ΠH
2
≤
_
_
_
_
C CA CA
2
. . . CA
n
¸
T
(B
2
−B
1
)
_
_
_
2
≤ 2γ
(30)
Since (A, C) is observable the matrix
_
C CA CA
2
. . . CA
n
¸
T
is invertible. Therefore,
B
2
− B
1

2
≤ K
0
γ for some constant K > 0 that
depends on the system matrices, A, C. Since, B
2
, B
1
are
ﬁnite dimensional and norms on ﬁnite dimensional spaces
are equivalent, we also have B
2
− B
1

1
≤ K
1
γ for some
K
1
> 0. This implies that G − ΠH
1
≤ K
2
γ for some
K
2
> 0. It now follows that,
H −ΠH
1
≤ H −G
1
+G−ΠH
1
≤ Lγ
Our ﬁnal proposition states that,
Proposition 7: Let, G
m
denote the best
1
approximation of
the FIR system P
m
H
r
, i.e., G
m
= argmin
G∈(
P
m
(H
r
−G)
1
Then it follows that,
lim
m→∞
P
m
(H
r
−G
m
)
1
= H
r
−G
r

1
Proof: The proof follows by the following sequence of
inequalities:
H
r
−G
r

1
≤ H
r
−G
m

1
P
m
(H
r
−G
m
)
1
≤ P
m
(H
r
−G
r
)
1
Therefore,
0 ≤ Pm(Hr −Gr)1 −Pm(Hr −Gm)1
≤ (I −Pm)(Hr −Gm)1 −(I −Pm)(Hr −Gr)1 −→ 0
where the ﬁnal limit follows from the fact that G
m

1
and
G
r

1
are bounded and have exponentially decaying tails.
The theorem now follows from the fact that, G
0
can be decom
posed as a known nonlinear operator acting on a collection of
linear functionals on the system H, i.e., ψ(ΠH, P
m
(H−ΠH))
where: 1) Estimate of an FIR model P
m
(H − ΠH) for the
modiﬁed system H − ΠH can be obtained consistently if
conditions in Theorem 5 are satisﬁed; 2) ΠH can be obtained
if conditions of Equation 23 are satisﬁed, which are equivalent
to conditions in Theorem 5.
VI. UNCERTAIN LFT STRUCTURES
We consider several different LFT structures for which the
tools developed in Theorem 1 can be applied to characterize
conditions on inputs, which achieve uniform consistency.
To focus attention on structured perturbations we limit our
attention to realrational space of systems endowed with the
H
2
norm.
A. Multiplicative Uncertainty
We consider ﬁrst the multiplicative uncertainty as shown in
Figure 2(a) with H ∈ H
2
. The inputoutput equation is given
by:
y(t) = Hu(t) +w(t) = G(1 + ∆)u(t) +w(t)
where, ( =
_
p
j=1
α
j
G
j
[ α
j
∈ IR
_
, the noise w is random
i.i.d. gaussian noise and ∆ is a multiplicative perturbation with
bounded
1
norm and accounts for the undermodeling of H.
Upon substitution we have,
y(t) =
p
j=1
α
j
G
j
(1 + ∆)u(t) +w(t)
=
α
j
(1 + ∆)G
j
u(t) +w(t)
=
p
j=1
α
j
ψ
j
(t) + ∆
p
j=1
α
j
ψ
j
(t) +w(t)
where we have used the commutativity of the convolution op
eration and substituted ψ
j
(t) = G
j
u(t). The parametrization
is nonconvex as shown in the following example.
g) Example: Consider the two elements H
1
= −(1 −
∆)G and H
2
= (1 + ∆)G in the setup above. Let the set of
perturbations be bounded, i.e., ∆ ≤ then (H
1
+H
2
)/2 =
∆G cannot be expressed as (1 +
˜
∆)G for some 
˜
∆ ≤ .
Therefore, the decomposition is not convex. However, if we
restrict α
j
≥ 0 the decomposition turns out to be convex. In
particular, the following subset of systems is convex.
o
+
=
_
_
_
G(1 + ∆) [ G =
j
α
j
G
j
, α
j
≥ 0, ∆ ≤
_
_
_
Theorem 7: A necessary condition for uniform consistency
is that ∆ be orthogonal to the space ( and the inputs satisfy
conditions of Theorem 4.
Proof: We ﬁrst consider a convex subset, o
0
, of systems.
Speciﬁcally, constrain the parameters α
j
∈ [1, 2]. Then
o
0
=
j
(α
j
G
j
+
˜
∆G
j
) ⊂
α
j
G
j
(1 + ∆) = o, where

˜
∆, ∆ are both smaller than γ. Now translating the set
o
0
by modifying the variables ˜ α
j
= α
j
− 3/2 we get a new
set o
1
which is both balanced and convex. We can now apply
Proposition 3 and Theorem 1 to argue that ∆ and ( must
be orthogonal for achieving uniform consistency. Furthermore,
from Theorem 4 it follows that the relevant input conditions
must also be satisﬁed. Now, there is a onetoone mapping
from the collection o
1
to the collection o
0
and therefore the
14
(a) Multiplicative Uncertainty (b) Identiﬁcation with Structured Uncer
tainty
Fig. 2. Different Classes of Linear Fractional Structures; The second
ﬁgure illustrates a serial connection of two components one of which is
approximately known, while the other is unknown and must be estimated
from inputoutput data
same conditions are necessary for o
0
as well. Since, o
0
⊂ o
the necessity holds for the general case as well.
We next apply the instrument derived for the set o
1
to
the general problem. It turns out that the notion of uniform
consistency must be modiﬁed suitably. Indeed, since the per
turbation is scaled by the parameter the error bound also has
a multiplicative structure, i.e., we can only say that
α −α
n

∞
≤ βα
∞
with high probability for a large enough n. This is weaker
than the conditions derived in the previous sections.
B. Serial Connections of Uncertain Systems
In this problem, shown in Figure 2(b), two uncertain sys
tems, H
0
, H
1
are connected in series. The system H
1
is
approximately known, i.e., its best approximation G
1
(in the
class (
1
) is known, and the system H
0
must be approximately
identiﬁed in the model class, (
0
from data. The interpretation
of the setup is that we are given a set of uncertain components
connected in series, where it is not possible to isolate any sin
gle component. The question arises as to when an approximate
model for the uncertain component can be derived. We ﬁrst
write the inputoutput equation as follows:
y(k) = (G
0
+ ∆
0
)(G
1
+ ∆
1
)u(k) +w(k) (31)
It is clear from the setup that the unmodeled dynamics enters
the equations in a nonlinear nonconvex fashion. Neverthe
less, necessary and sufﬁcient conditions can still be estab
lished.
Theorem 8: A necessary and sufﬁcient condition for uni
form consistency of the above setup is that ∆ be orthogonal to
both model classes (
0
and (
1
and the input satisfy conditions
of Theorem 4.
Proof: (necessity) Upon expanding Equation 31 we have,
y(t) = G
0
G
1
u(t)+G
1
∆
0
u(t)+G
0
∆
1
u(t)+∆
0
∆
1
u(t)+w(t)
(32)
Now, uniform consistency of G
0
implies uniform consistency
of the above setup with ∆
1
= 0 or ∆
0
= 0. The former case
reduces to an additive perturbation for which ∆ ⊥ (
0
is a
necessary condition, while the latter case is a multiplicative
perturbation for which ∆ ⊥ (
1
is necessary according
to Theorem 7. The conditions on the input follow from
Theorems 4.
(a) Block Diagram of Uncertain
System in a Feedback Loop
(b) Identiﬁcation error
Fig. 3. The unknown system has both parametric and nonparametric
components with the parametric part to be identiﬁed from closed loop data;
The setup is unconventional in that the system input is also measured; This
arises in a number of applications involving indoor echocancellation Systems.
(sufﬁciency) Comparing Equation 32 with Equation 14 we
see a potential difference only in the penultimate term, i.e.,
∆
0
∆
1
u(t). Now since ∆
0
is perpendicular to (
0
, it can be
written as F
0
˜
∆, for an arbitrary bounded perturbation
˜
∆,
where F
0
is as described in Proposition 4. By absorbing
˜
∆∆
1
into a single perturbation, ∆, we can write, ∆
0
∆
1
= F
0
∆.
The new expression obtained is now consistent with conditions
of Theorem 4, which then establishes sufﬁciency.
h) Example: As an instantiation of the theorem consider
the case when G
0
and G
1
are FIRs of order m and n
respectively. Then no consistent algorithm exists when the
order, m, of G
0
is larger than n. This is because the unmodeled
error G
0
∆
1
and the class of FIR models of order m are no
longer orthogonal.
C. Identiﬁcation of Uncertain Systems in Feedback Loops
Consider the feedback setup shown in Figure 3(a). The goal
is to estimate the optimal approximation to H = G + ∆ in
the model class G ∈ ( from inputoutput data. Our task is to
design the input signal u to guarantee robust convergence to
the optimal approximation.
Such problems arise in the context of design of adaptive
acoustic echocancellation systems that operate in a feedback
loop [28]. In general the high order acoustic dynamics coupled
with signiﬁcant time variation severely limits the choice of
echocanceller order. Moreover, the feedback ﬁlter, typically
used to enhance the quality of speech, can impact the stability
of the echocancellation system. In this light, a lower order
echocancellation system is designed in the hope that robust
performance can be ensured at the cost of sacriﬁcing some
performance. A dither input signal, u, is employed as an
input to the loudspeaker and the output, v, of the loudspeaker
is recorded in addition to the microphone signals, y. Thus
the input output data consists of (u, v, y) as illustrated in
Figure 3(a). Our setup consists of:
y(t) = Hv(t) +w(t) =
m
j=1
α
j
G
j
v(t) +F∆v(t) +w(t) (33)
where, w is i.i.d. gaussian noise and we have decomposed H ∈
RH
2
into orthogonal subspaces. Furthermore, we assume that
the closed loop system is stable and the exogenous input u()
15
and the internal signal, v(), and the output signal y() are both
measured. As the following theorem shows that a necessary
and sufﬁcient condition for uniform consistency remains the
same, i.e.,
Theorem 9: A necessary and sufﬁcient condition for uni
form consistency of the optimal restricted complexity model
G ∈ ( is that the input satisfy conditions in Theorem 4.
Proof: By straightforward algebraic manipulations of the
feedback loop we obtain that:
v(t) = W
1
u(t) +W
2
w(t), W
1
= (1 −HK)
−1
, W
2
= KW
1
Although the system, W, is generally unknown (since it is
a LFT of the controller with the unknown system), the fact
that the controller is stabilizing implies W
1
, W
2
are stable
transfer functions. Therefore, v(t) is bounded. Now consider
Equation 33, which is exactly in the form of Equation 14
with the unmodeled dynamics orthogonal to (. These are now
consistent with assumptions of Theorem 4 with u(t) replaced
with v(t). We are left to establish the conditions on the input
signal. Uniform consistency holds if and only if there exists
an instrument, q
n
() that decorrelates the internal signal v(t)
and noise w(t), i.e.,
¸
¸
¸
¸
¸
n
k=0
q
n
(k)v(k)
¸
¸
¸
¸
¸
= 1,
¸
¸
¸
¸
¸
n
k=0
q
n
(k +j)v(k)
¸
¸
¸
¸
¸
−→0
and
¸
¸
¸
¸
¸
n
k=0
q
n
(k)w(k)
¸
¸
¸
¸
¸
−→0
This in turn implies that the ﬁltered instrument, ˜ q
n
() =
W
1
q
n
() decorrelates the input signal and noise:
¸
¸
¸
¸
¸
n
k=0
˜ q
n
(k)u(k)
¸
¸
¸
¸
¸
= 1,
¸
¸
¸
¸
¸
n
k=0
˜ q
n
(k +j)u(k)
¸
¸
¸
¸
¸
−→0
¸
¸
¸
¸
¸
n
k=0
˜ q
n
(k)w(k)
¸
¸
¸
¸
¸
−→0
Note that in the above example since W
1
is unknown a
convenient instrument for v(t) is the ﬁltered dither signal
u
1
(t) = Gu(t). This is because when u(t) is a i.i.d. dither
signal:
1
n
¸
¸
¸
¸
¸
n
k=0
u
1
(k)v(k)
¸
¸
¸
¸
¸
≥ β > 0;
1
n
¸
¸
¸
¸
¸
n
k=0
u
1
(k +j)v(k)
¸
¸
¸
¸
¸
−→0
and
1
n
¸
¸
¸
¸
¸
n
k=0
u
1
(k)w(k)
¸
¸
¸
¸
¸
−→0
i) Example: The model space is given by ( =
α
1−0.3z
−1
.
The unmodeled dynamics is generated by normalizing a zero
mean random gaussian sequence of length 1000 and normal
izing it to have an
1
norm equal to one. The unmodeled
dynamics is then ﬁltered with the allpass ﬁlter F = (z
−1
−
0.3)/(1−0.3z
−1
). This ensures orthogonal separation between
the unmodeled error and the model subspace. The system
H is generated by summing the model for α = 1 with the
unmodeled dynamics. Next, a controller K is chosen so that
the closed loop system is stable and such that the sensitivity
function W
1
has
1
norm smaller than one. A uniformly
distributed i.i.d. input in the interval [−0.5, 0.5] of length 300
is applied and the measurements v(t) and y(t) are collected
in i.i.d. gaussian noise w(k) with mean zero and variance
0.1. For the IV technique we use (1 − 0.3z
−1
)
−1
u(t) as the
instrument in Equation 33 and identify α. Figure 3(b) shows
the parametric error plotted as a function of data length for the
IV scheme. We see that the IV technique converges, while it is
well known that the least squares approach leads to large bias.
This is not surprising considering the fact that the unmodeled
output is correlated with the model output and so while the
least squares approach leads to a large bias, the IV technique
is chosen suitably to decorrelate both the uncertainty as well
as the noise.
Remark: We contrast the IV approach presented in this pa
per with the wellknown MPE approach. The idea in the MPE
approach is to ﬁnd models that minimize the so called predic
tion error. In speciﬁc circumstances this implies a preference
for those models, where the prediction error is uncorrelated
with model output. In the presence of complex uncertainties,
especially when the error arises due to under modeling,
this approach can lead to signiﬁcant bias. IV methods can
overcome this problem through suitable instruments. These
instruments in speciﬁc circumstances can ﬁrst decorrelate the
uncertainty arising from correlated sources and then proceed
to cancel stochastic uncorrelated noise as well.
VII. CONCLUSIONS
We have introduced a new concept for system identiﬁcation
with mixed stochastic and deterministic uncertainties that
is inspired by machine learning theory. In comparison to
earlier formulations our concept requires a stronger notion
of convergence in that the objective is to obtain probabilistic
uniform convergence of model estimates to the minimum
possible radius of uncertainty. Our formulation lends itself to
convex analysis leading to description of optimal algorithms,
which turn out to be wellknown instrumentvariable methods
for many of the problems. We have characterized conditions
on inputs in terms of secondorder sample path properties
required to achieve the minimum radius of uncertainty. We
have derived optimal algorithms for system identiﬁcation for
a wide variety of standard as well as nonstandard problems
that include special structures such as unmodeled dynamics,
positive real conditions, bounded sets and linear fractional
maps.
16
VIII. APPENDIX
(Proof of Proposition 1) We deal with the case when γ
0
=
0 for simplicity. The argument for arbitrary γ
0
proceeds in a
similar manner. Let Y
n
denote the column vector of the output
signal of length n and T(Y
n
) be the set of all h ∈ o that are
consistent with the inputoutput data and the noise set, i.e.,
T(Y
n
) = ¦h ∈ o : y(k) = λ
k
(h)+w(k); w ∈ J
n
, 1 ≤ k ≤ n¦
The local diameter of uncertainty is given by:
diam(Y
n
) = sup
h1,h2∈T(Yn)
[λ
0
(h
1
−h
2
)[
From [22] it follows that the estimation error for any estimator
is bounded from below by onehalf of the diameter of uncer
tainty. By hypothesis of our proposition and the preceding
argument there exists a noise set, J
0
n
, such that the diameter
of uncertainty is smaller than 2. We are now left to produce a
convex and balanced set that has the same properties. To this
end, consider the following noise set:
J
∗
n
= ¦w ∈ IR
n
[ λ
k
(h) +w = 0, [λ
0
(h)[ ≤ , 1 ≤ k ≤ n¦
On account of the fact that the measurements are linear and
the constraints are convex and balanced, it follows that the set,
J
∗
n
is also convex and balanced. We are left to establish two
facts:
(A) The noise set J
∗
n
is feasible, i.e., the global diameter of
uncertainty over all feasible outputs Y
n
is bounded by 2.
(B) The probability measure of the noise set J
∗
n
is larger
than the feasible set, J
0
n
.
These two facts together will establish the proposition since
we have then found an alternative convex and balanced set to
J
0
n
that has a larger probability measure and has the same
worstcase global diameter of uncertainty.
To establish (A) we observe from [22] that with a lin
ear/convex measurement structure, the worstcase diameter of
uncertainty is realized when the output is identically zero.
Consequently, for all other output realizations the worstcase
uncertainty can only be smaller. Now, by construction the set
J
∗
n
corresponds to the zero output signal and therefore by the
preceding argument is feasible.
(B) is established by the following sequence of inequalities:
Prob{W
0
n
} ≤ sup
Wn
_
Prob{Wn}
¸
¸
¸
¸
¸
sup
h∈S,w∈Wn
diam{Yn} ≤ 2
_
(a)
≤ sup
Wn
_
Prob{Wn}
¸
¸
¸
¸
¸
sup
h=0,w∈Wn
diam{Yn} ≤ 2
_
(b)
≤ sup
Wn, 0∈Wn
_
Prob{Wn}
¸
¸
¸
¸
¸
sup
h=0,w∈Wn
diam{Yn} ≤ 2
_
(c)
≤ sup
Wn, 0∈Wn
_
Prob{Wn}
¸
¸
¸
¸
¸
sup
h=0,w∈Wn
diam{Yn = 0} ≤ 2
_
(d)
≤ Prob{W
∗
n
}
Inequality (a) follows from the fact that the output Y
n
in the RHS is restricted to the case when h = 0, which
amounts to a relaxation of the constraint. Inequality (c) follows
from a similar argument. Inequality (b) follows from the
fact that the worstcase diameter of uncertainty is invariant
under translations of noise set J
n
. However, the probability
measure of the noise set can change. Using the underlying
symmetry and monotonicity of the noise distribution around
zero, translation of the feasible set J
n
so that it contains
the zero element leads to increasing the probability measure.
Finally, (d) follows by deﬁnition of J
∗
n
since it includes all
possible noise elements for which the output is identically
zero and the diameter of uncertainty is constrained to be
bounded by 2. Consequently, no other noise set containing
zero element can have a larger probability measure.
(Proof of Corollary 1) Consider an element z ∈ H
2
with
¸, ) denoting the inner product. Let z
0
= z/z
2
be a unit
vector aligned with z; z
α
be any other unit vector. It follows
that,
z
2
= ¸z
0
, z) ≤ [¸z
0
−z
α
, z)[+[¸z
α
, z)[ ≤ z
0
−z
α

2
z
2
+[¸z
α
, z)[
where the ﬁrst expression in the last inequality follows cauchy
schwartz inequality. Denoting υ
α
(z) = ¸z
α
, z) and letting α
take on values in a ﬁnite index set 1 such that there is at least
an α for which z
0
−z
α

2
< 1 we have
z
2
≤ min
z0−zα2<1
[υ
α
(z)[
1 −z
0
−z
α

2
To relate it to Corollary 1 we let, z = x
1
−x
2
, where x
1
, x
2
are any two feasible elements satisfying Equation 7. It follows
that,
z
2
= x
1
−x
2

2
≤ min
z0−zα2<1
[υ
α
(x
1
−x
2
)[
1 −z
0
−z
α

2
The result now follows by direct substitution.
REFERENCES
[1] P. Billingsley. Probability and Measure. John Wiley & Sons, Inc., 1995.
[2] V. Borkar, S. Mitter, and S. Venkatesh. Variations on a neyman pearson
theme. Journal of the Indian Statistical Institute, 66:292–305, 2004.
[3] P. Caines. Linear Stochastic Systems. New York, Wiley, 1988.
[4] M. C. Campi. Exponentially weighted least squares identiﬁcation of
timevarying systems with white dusturbances. IEEE Trans. on Signal
Processing, 42:2906–2914, 1994.
[5] M. C. Campi and E. Weyer. Finite sample properties of system identiﬁca
tion methods. IEEE Transactions on Automatic Control, 47:1329–1333,
2002.
[6] M. A. Dahleh, T. V. Theodosopoulos, and J. N. Tsitsiklis. The sample
complexity of worst case identiﬁcation of linear systems. System and
Control Letters, 20:157–166, 1993.
[7] A. Garulli and W. Reinelt. On model error modeling in set membership
identiﬁcation. In Proceedings IFAC Symposium on System Identiﬁcation,
Santa Barbara (USA), pages WeMD1–3, 2000.
[8] L. Giarre, B. Z. Kacewicz, and M. Milanese. Model quality evaluation
in setmembership identiﬁcation. Automatica, 33:1133–1139, 1997.
[9] G. C. Goodwin, M. Gevers, and B. Ninness. Quantifying the error in
estimated transfer functions with application to modelorder selection.
IEEE Transactions on Automatic Control, 37:1343–1354, 1992.
[10] G. Gu and P. P. Khargonekar. Linear and nonlinear algorithms for
identiﬁcation in h with error bounds. IEEE Transactions on Automatic
Control, 37:953–963, 1992.
17
[11] P. Van Den Hof, P. Heuberger, and J. Bokor. Identiﬁcation with
generalized orthonormal basis functions–statistical analysis and error
bounds. Special Topics in Identiﬁcation Modelling and Control, pages
39–48, 1993.
[12] E. L. Lehmann. Testing Statistical Hypotheses. 2nd edn. Springer Verlag:
New York, 1997.
[13] L. Ljung. System Identiﬁcation: Theory for the user. Prentice Hall
Englewood Cliffs, NJ, 1987.
[14] L. Ljung and Z. D. Yuan. Asymptotic properties of the least squares
method for estimating transfer functions. IEEE Transactions on Auto
matic Control, pages 514–530, 1985.
[15] D. Luenberger. Optimization by vector space methods. John Wiley and
Sons, Inc, 1969.
[16] P. M. Makila. Robust identiﬁcation and galois sequences. International
Journal of Control, 54:1189–1200, 1991.
[17] M. Milanese and G. Belforte. Estimation theory and uncertainty intervals
evaluation in the presence of unknown but bounded errors: Linear
families of models and estimators. IEEE Transactions on Automatic
Control, 27:408–414, 1982.
[18] M. Milanese and A. Vicino. Optimal estimation theory for dynamic
systems with set membership uncertainty: An overview. Automatica,
27:997–1009, 1991.
[19] C. H. Papadimitriou and K. Steiglitz. Combinatorial Optimization.
Prentice Hall Englewood Cliffs, NJ, 1997.
[20] P. Poolla and A. Tikku. On the time complexity of worst–case system
identiﬁcation. IEEE Transactions on Automatic Control, 39:944–50,
1994.
[21] W. Rudin. Principles of Mathematical Analysis. McGrawHill Interna
tional Editions, 1976.
[22] J. F. Traub, G. W. Wasilkowski, and H. Wozniakowski. Information,
uncertainty, complexity. AddisonWesley Pub. Co., 1983.
[23] D. N. C. Tse, M. A. Dahleh, and J. N. Tsitsiklis. Optimal identiﬁcation
under bounded disturbances. IEEE Transactions on AC, 38:1176–90,
1993.
[24] V. Vapnik. The Nature of Statistical Learning Theory. New York:
SpringerVerlag, 1995.
[25] S. Venkatesh. Necessary conditions for identiﬁcation of uncertain
systems. Systems and Control Letters, 53:117–125, 2004.
[26] S. Venkatesh and M. A. Dahleh. System identiﬁcation for complex
systems: Problem formulation and results. In Proceedings of the 36th
IEEE Conf. on Decision and Control, San Diego, CA, 1997.
[27] S. Venkatesh and M. A. Dahleh. On systemidentiﬁcation of complex
processes with ﬁnite data. IEEE Transactions on Automatic Control,
46:235–257, 2001.
[28] S. Venkatesh and A. M. Finn. Cabin communication system without
acoustic echo cancellation. US Patent #6748086, 2004.
[29] S. Venkatesh and S. K. Mitter. Statistical modeling and estimation
with limited data. In International Symposium on Information Theory,
Lausanne, Switzerland, 2002.
[30] S. R. Venkatesh and M. A. Dahleh. Identiﬁcation in the presence
of unmodeled dynamics and noise. IEEE Transactions on Automatic
Control, 42:1620–1635, 1997.
[31] A. Vicino and G. Zappa. Sequential approximation of feasible param
eter sets for identiﬁcation with set membership identiﬁcation. IEEE
Transactions on Automatic Control, 41:774–785, 1996.
[32] M. Vidyasagar. A Theory of Learning and Generalization. New York:
SpringerVerlag, 1997.
[33] B. Wahlberg. System identiﬁcation using laguerre models. IEEE
Transactions on Automatic Control, pages 551–562, 1991.
[34] B. Wahlberg and L. Ljung. Hard frequencydomain model error bounds
from leastsquares like identiﬁcation techniques. IEEE Transactions on
Automatic Control, 37:900–912, 1992.
[35] L. Y. Wang. Persistent identiﬁcation of timevarying systems. IEEE
Transactions on Automatic Control, 42:66–82, 1997.
[36] L. Y. Wang and G. G. Yin. Persistent identiﬁcation of systems with
unmodeled dynamics and exogenous disturbances. IEEE Transactions
on Automatic Control, 45:1246–1256, 2000.
In Section VI we show how this approach also leads to solutions to problems described by LFT structures. Finally z −1 is the ztransform variable and corresponds to the usual shift operation. optimal inputs. x p denotes the p norm of the discrete time signal x(·). This can be extended to other norms by approximating in a ﬁnite family of linear cost functions. Xn is the column vector (x(0). inputs and fundamental bounds. which can further be related to instrumentvariable techniques. P ROBLEM F ORMULATION In this section we formulate a mixed stochasticdeterministic problem. a new notion of sample complexity is developed wherein the idea is to ﬁnd the smallest time by which a conﬁdence level for the probabilistic objective can be reached. the space of systems deﬁned by the disc algebra of analytic functions bounded in the closed unit disc. Speciﬁcally. . minimum sample complexity etc. it is well known that a persistency of excitation condition is required to ensure consistency. these bounds can sometimes be suitable. In Section IV the paper describes new insights for parameter estimation in stochastic noise. Preliminary work along these lines has appeared in our earlier publications [25]. Although.. of a signal. This formulation is inspired by notions of uniform convergence of empirical means and approximate learning in machine learning theory. y(k). x ∈ p . we show in Section III. where denotes the convolution operation. wherein the idea is to pick model estimates so that the worstcase probability of “mismatch” error being larger than some threshold is minimized. Our objective is to seek models/estimates that minimize the probability of the worstcase errors. is represented by {h(k)}k∈Z + . y(t). σ) denotes the gaussian distribution with mean m and standard deviation σ. To address these issues we propose a new formulation for mixed components in Section II. The model . . 0. n] The npoint crosscorrelation between two signals n−τ n rxy (τ ) = i=0 x(τ + i)y(i) N (m. this problem (in terms of ﬁnding optimal algorithms. . These ideas resemble notions of minimax risk functions employed in the statistical and estimation literature [12]. The output response. . For a realrational LTI system. . In summary. Mathematically. For a signal. 1. unmodeled dynamics and noise as was shown in [27]. We derive conditions on input sequences in terms of their second order sample path properties. .. The upper bounds are usually obtained by using leastsquares algorithms for periodic inputs while the lower bounds are obtained by considering the noiseless case.) is in general intractable in both the estimation and identiﬁcation contexts.). . . H.e. we show that optimal algorithms can be associated with hyperplane separation of convex sets. τ ∈ [0. A p. Z + is the set of positive integers. we are given a sequence of realvalued observations. i. In particular. to an input u(t) is denoted by y(t) = Hu(t) = (h u)(t). x(n − 1))∗ . . is n−τ n rx (τ ) = i=0 x(τ + i)x(i). it is generally difﬁcult to characterize optimal algorithms. Another aspect of the paper is in characterizing inputs required for system identiﬁcation. Prob{A} denotes the probability of an event A and E{X} denotes the expected value of the random variable X. that it is also tractable through convex analysis particularly for problems deﬁned by linear observations. p . i. This paper generalizes that formulation and deals with a wide variety of new problems. . H. x(1). The analysis is then extended to problems to cover situations where the desired solution can be expressed as a nonlinear function of a linear operator. LTI refers to linear time invariant systems and BIBO refers to boundedinputboundedoutput systems. Pm denotes the truncation operator: Pm (x) = (x(0). although there are a number of formulations that have been proposed for dealing with mixed scenarios. . Section V then develops algorithms for optimal identiﬁcation in the presence of unmodeled dynamics and noise. x. rx (·). There the idea is to choose parameters that minimize the worstcase (over all parameters) the expected value of the estimation error. By means of our mixed formulation we show that this condition is also necessary for achieving uniform consistency. this strategy is employed to show that uniform convergence of parametric error to zero can be established in the presence of noise and unmodeled dynamics for stable systems in an 1 space. convex subsets of systems and additive random noise and under the ∞ norm.q = supx∈ p Ax q / x p denotes the induced norm of operator or matrix. we seek uniform convergence in probability of the estimation error. p ≥ 1 denotes the space of sequences on Z + bounded in the p norm (in [15] these concepts are deﬁned on the space of natural numbers but we only consider sequences over positive integers and denote them as p with an abuse in notation). Wang [36] has proposed a new concept for mixed identiﬁcation. a) Notation:: A∗ denotes the complexconjugate transpose of the matrix A. . Nevertheless. which are assumed to be generated according to an underlying model. Although. we often associate a minimal realization in state space: A B H≡ (1) C D II. x(m−1). The impulse response of an LTI system. Wang [36] addresses this issue by deriving upper and lower bounds for a number of different cases. This section also develops solution techniques for problems with convex parametric constraints. .2 tivated by persistent identiﬁcation. n The npoint autocorrelation. denotes real valued sequence on positive integers. This is particularly the case when there is a separation between model. In particular. this notion is stronger than the earlier proposed formulations. where consistent identiﬁcation of FIR models was described. . they can be overly conservative for situations where Least Squares (LS) is not a convergent algorithm. . Furthermore.e. RH2 refers to stable realrational LTI systems with bounded squared norm.
∆) ∈ S. for a linear ˆ algorithm. min Prob{max αY − θ > 1/2} ≥ 1/2 α θ f ∈F f (xi ) ≥ γ −→ 0 Extensions to this problem include consideration of general loss functions and function approximation with lower complexity function classes. ∆ are nonrandom components with the tuple (θ. θn (y) from a ˆ sequence of n observations.. We now introduce the mixed problem. where S is a subset of a separable normed linear space H. It follows that. ξ. ξ. θ. for the sequence of algorithms θ as: ˆ ˆ R(θ.∆)∈S t Fu (θ. i. We denote by θ (with an abuse ˆ of notation) the inﬁnite sequence of algorithms. sup (θ. w(·) is a random noise process.e. S) = 0. consider an LTI example with a 1tap FIR with unmodeled dynamics. w). Suppose W is a binary random variable taking values in the set {−1. θ. this situation is analogous to the concept of uniform convergence of empirical means. ∀ ξ > 0. θ.. S) = inf γ ∈ IR  limsupn α(θn . Observe that the set. Suppose. S). . i. samples of {xi } are drawn from a distribution deﬁned by P and the problem is to show that. note that the variables θ and ∆. We denote by α(θn . i. The principle difﬁculty in applying learning theory to system identiﬁcation is that the solutions there depend on observations being independent and do not deal with dynamical scenarios. i. θ. Note that the problem is meaningful even when the perturbation. θ0 is said to be optimal if its sample complexity is smaller than any other algorithm. Returning to our context. we are given a family of functions. n − 1 (2) This leads to the notion of consistency: ˆ Deﬁnition 1: We say that the sequence of algorithms θ is uniformly consistent if the radius of uncertainty is equal to zero. Wang et. may not be independent. while our notion is related to the righthandside of the following inequality. γ. i. their notion is related to the lefthandside. f (x) ∈ F. Y = θ + W with θ ∈ [−1. . the main problem is that it is difﬁcult to analyze exactly (as seen even for the above example). R(θ. Indeed.∆)∈S ˆ θ − θn (y) ≥ γ ˆ θ − θn (y) ≥ γ where the probability is taken with respect to the noise distribution.∆)∈S Prob ˆ θ − θn (y) ≥ γ ≤ Prob sup (θ. The minimum radius of uncertainty is given by: ˆ R0 (S) = inf R(θ. F k is a time dependent realvalued regressor. t y(k) = F k (θ. S is a separable space and so measurability is preserved under a supremum [1]. ˆ ˆ L( . · ˆ is a suitable norm on IRm . θ = W . we deﬁne the asymptotic radius of ˆ ˆ uncertainty. θ(Y ) = αY . θ0 ..e. (θ. and we get. the weaker notion may sufﬁce in many cases. S) ≤ L( .d.. al. γ. the probability that the worstcase estimation error is larger than γ = R0 (S) + . S) ≤ ξ y(t) = ≡ θu(t) + k=0 δ(t − k)u(k) + w(t) 1. This issue is well 1 [36] considers simpler structures where the model and unmodeled dynamics are decoupled . is at most α. ∆.e. S) = Prob sup (θ. ˆ α(θn . More precisely. {θn }. This models situations involving optimal approximations. L( . The ﬁnite dimensional component. [36] consider a weaker version in that their formulation is related (but not exactly1 ) to the worstcase probability of error. i. where. where we have used the fact that the minimum value is equal to minα Prob{maxθ (1−α)θ+αW  > 1/2}. To clarify the notation.e. ξ. δ(t) is the impulse response of the unmodeled component. Prob{maxX (W X) > 1/2} = 1 c) Example: Suppose. while ∆ denotes the residual error. The lower bound now follows by choosing. let where. ∆.e. min max Prob{αY − θ ≥ 1/2} ≤ 1/3 α θ Although. S) the probability that the worstcase estimation error is larger than γ for n observations. such that the smallest number of observations. does not exist. S) = min n ∈ Z +  α(θn . . > 0 These deﬁnitions are related to the mixed formulations proposed in [36] with important differences.3 is uncertain in that it has both random and nonrandom aspects that provide information on how the sequence of observations are generated. γ. the set of values θ and ∆ can be jointly speciﬁed by the set S ⊂ H.e. 1}. i.e. 1] and W uniformly distributed in [−1. k = 0..e. ∆. S) ˆ θ Finally. ˆ ˆ L( . N i.. We immediately see that maxX Prob{W X ≥ 1/2} = 1/2. Remark: In general. 1]. ∆. i. Prob sup EP (f (x)) − 1 N N i=1 The two problems can be signiﬁcantly different as the following examples indicate: b) Example: Let X be a binary number taking values in the set {−1. For the simplest situation in learning theory. γ. 1} each of which is equally likely.. 1. θ characterizes the best approximation in the model class. θ. ∆) ∈ S = IR × u(·) ∈ ∞ ˆ An algorithm. ξ. S). It turns out through detailed analysis that the optimal solution for the weaker notion is given by α = 3/4. we deﬁne sample complexity and optimality of an algorithm: ˆ Deﬁnition 2: The sample complexity of the algorithm θ is ˆ S). w).i. . ˆ is to be estimated by means of an algorithm.
is now a worstcase noise signal for which the error is uniformly bounded by γ0 + . ej . . . is a column vector of length n with jth component equal to one and all other components equal to zero. which is applied in subsequent sections. In this paper we limit our attention to gaussian processes (the non i. Then linear algorithms are optimal when the norm measure on Φ(h) is the ∞ norm. Gaussian noise such that w(k) ∼ N (0. III. Suppose. φp (h))T . ∀ h ∈ S. 0. elements of the set. for a large number of problems especially when there is a decomposition between model and unmodeled dynamics the upper and lower bounds are not tight. h ∈ S (3) between the probabilistic problem and an appropriate worstcase problem. {λ1 (·) + e1 . For the setup described by Equation 3 we have the following theorem. . We also demonstrate extensions of this approach to other norms by approximating in a ﬁnite family of linear cost functions. consider estimation of a single linear functional.i. The proof of optimality of ˆ the linear algorithm would follow because φn (y) is arbitrary and can be replaced by an optimal algorithm. the sequence of ˆ ˆ algorithms. if ψ(k) is the sequence of regressors formed for an ARMA structure (see [13] for details) then. where. y(k) = ψ T (k)θ + w(k) forms a linear measurement structure on the variable θ. w(·). The task is to compute a linear function Φ(h). S are observed through a linear measurement structure with additive random noise.d. . . n. is a linear functional on the space. where the upper bound is computed with a least squares algorithm and periodic inputs. ˆ sup sup φ(h) − φn (y) ≤ γ0 + n w∈IR −A h∈S (5) ˆ sup sup φ(h) − φn (y) ≤ γ0 + (6) n w∈W 0 h∈S We are now left to prove that a linear algorithm is an alternative optimal estimator as well. particularly for problems deﬁned by linear observations. convex subsets of systems and additive random noise for ∞ norm. ξ. case is established in the next section) for the sake of simplicity. and a convex and balanced subset. Theorem 1: Suppose w(·) is i. Φ(h) = (φ1 (h). Prob h∈S ˆ sup φ(h) − φn (y) ≥ γ0 + ˆ ≤ ξ. . S ⊂ H. ξ. k = 1. This implies. . while there exist specialized inputs and linear algorithms that do result not only in consistency but also polynomial samplecomplexity. Now . H. . One advantage of our formulation is that it lends itself to exact analysis through application of convexity theory and by extension design of optimal inputs and algorithms. Nevertheless. Thus the observations are linear over H. λn (·) + eT }. Suppose. . ∀ n ≥ L( . . Now. w ∈ W n 0 T Now. ˆ S). which are symmetric and monotone around zero. The lower bound is computed by considering the noiseless case.. the uniform distribution satisﬁes these properties as well. Indeed. H. Proposition 1: If Equation 5 is satisﬁed then it follows n that there exists a convex balanced subset. with n Prob{W 0 } ≥ 1 − ξ such that V = {(φ(h). φ = {φn (y)}. σ 2 ). d) Example: The measurement structure includes ARMA models as well. 0) is not contained in the interior of the set V. This implies the existence of a hyperplane such that. . For simplicity. . Remark: Although we have assumed Gaussian noise the theorem holds for other noise distributions as well. Therefore. Remark: In general the result holds for more general noise distributions. φ. For instance. Consider a linear space.d. The main result here is that under some technical conditions linear algorithms have optimal sample complexity. This in turn implies that c0 = 0. it follows that (γ + .e. . Equation 4 is equivalent to existence of a set A ⊂ IRn with Prob{A} ≤ ξ such that. W 0 ⊂ IRn . h ∈ S} 0 c0 (φ(h) − (γ + )) + j=0 cj yj ≤ 0. We state this in the form of a proposition below and the details can be found in the appendix. [27] provides examples wherein neither periodic inputs nor LS algorithms lead to consistency. . achieves the sample complexity. φ. as illustrated in the sequel. and except for simple situations the problem is generally intractable.i. i. i. Proof: If the radius of uncertainty is inﬁnite then linear algorithms are no worse than any other algorithm. .4 known in statistical estimation [12] and recently the authors have analyzed the weaker notion [2]. O PTIMAL A LGORITHMS FOR C ONVEX P ROBLEMS In this section we state and prove a general theorem. Next we establish that the set W n − A can be chosen to be convex without impacting the above inequality. Consider the following set of IRn+1 : From Equation 6. λ1 (h) + w(1). λn (h) + w(n))  w ∈ W n . the linear functionals. where Φ maps h to IRp . L( . . This is because. For this we follow the proof of Smolyak’s theorem [22]. The theorem is a generalization of Smolyak’s theorem [22] to observations with random noise. . λk . are linearly independent. The main idea is illustrated in Figure 1. In [36] this issue is addressed by deriving upper and lower bounds. φ(h). These properties can be used to argue the equivalence y(k) = λk (h) + w(k). The main property required is that the noise distribution be symmetric and monotonic around zero. . . n where. n where.e. . suppose γ0 = R0 (S) < ∞. w. The linear function Φ(h) is often referred to as the solution operator.. S) (4) We are left to establish that there is a linear algorithm that achieves the same performance. [29] in the context of composite hypothesis testing.
. Remark: Note that the results do not provide an explicit expression for computing the minimum radius of uncertainty (and indeed this problem is . We then obtain a ﬁnite set of linear inequalities. n qj (k. . . our idea is to express squared norm value as the maximum over all unit magnitude linear functionals—the existence of which is guaranteed through duality—and then ﬁnd estimates for each of these linear functionals. . V (θ) = n 1 2 ˆ t=1 E(y(t) − y (t/θ)) as a function of the predictor. . . x belonging to the set deﬁned ˆ by Equation 7 with υ ∈ {υ k }k∈I and let Φn (y) = x be a nonlinear estimator that maps the observations to a feasible point. To generalize it to the linear function. . . are the optimal linear weights corresponding to the linear functionals φl (h). n c0 (φ(h) + (γ + )) + j=0 cj yj ≥ 0. Yn = Λn (h)+Qn Wn . we note that. . ξ) = ≥ ˆ max φj (h) − φn (y) j j n n qj (k. 0) is also another boundary point. i. . ˆ Φ(h) − Φ(h) ∞ sup sup n w∈W 0 h∈S k=0 q n (k)y(k) − φ(h) ≤ γ0 + . υ(·) can be associated with a vector (υ1 . w ∈ W n 0 Consequently. Corollary 2: Let. the minimum angle of separation for the collection of vectors is smaller than β < π/4. establishing result for a single linear functional. . In the predictionerror paradigm [13]. λ2 (·). this topic has not been studied under the new criterion of uniform convergence. λn (·)) is a sequence of linear functionals. ∀ h ∈ S. qυ (k) with the minimum radius of uncertainty. . Let υ k be such a discretization with k belonging to some ﬁnite index set. . let υ k = (υ1 . . Consider a feasible point. .. However.e. . Suppose. υp ) ∈ p k IR . . which uniﬁes many of the results in stochastic identiﬁcation. 1. . With high probability they lie in a bounded convex set. Speciﬁcally. We then select linear functionals.e. ξ > 0.e. . p υ(x) = k=1 υk x k Corresponding to each choice of υ(·) ∈ IRp we have an n optimal linear algorithm. Therefore we obtain the complementary inequality. . . . k n qυ (k)y(k) − k=1 υk x k ≤ γ 0 + (7) by using the fact that V is convex and balanced we see that (−γ − . φ2 (h). φp (h)) in this ﬁnite set forms a good estimate k k for the squared norm. the objective is to analyze prediction errors. 0 . it is possible to bound the squared norm with the inﬁnity norm this can be conservative (for example. n thus. Although. . k ∈ I. We then have. where Yn .. . υp ) ∈ IRp . φl )y(k) − φj (h) k=0 where. Extensions We can generalize the above theorem to squared error norms as well. Speciﬁcally. ∀ n ≥ N ( . Corollary 1: If there exists an algorithm that achieves a radius of uncertainty γ0 + for a data length n with probability α then: γ0 + ˆ Φ(h) − Φn (y) 2 ≤ 1−β Proof: See the appendix. These theorems are readily generalized to stationary gaussian noise setting with bounded power spectral density. It then follows that. it follows that there are a sequence of coefﬁcients {q n (k) = −ck /c0 }n not all zero k=0 such that. . with an ˆ algorithm Φn (·) with the squared norm as the error measure. n max υ(x) = x 2 ≤1 2 . υ 2 = 1. Instead. Illustration of Convex Separation. φl ) A. PARAMETER E STIMATION R EVISITED In this section we provide a complementary perspective on parameter estimation in random noise. υ(·) consists of an inﬁnite family and needs discretization. X and Y axes denote range of measurement and parametric values respectively. . The motivation for revisiting this well studied topic is threefold: (1) First. √ for a vector x ∈ IRp we have x ∞ ≤ p x 2 ). . Φ(h). IV. A feasible point. p Fig. Wn are column vectors of length n and with Qn 2 ≤ 1 and Λn = (λ1 (·). γ0 + (with respect to the squared norm). Therefore. υ Clearly. For ease of exposition we let xj = φj (h). i. let x ∈ IRp endowed with the squared norm metric and let υ(·) be a linear functional on IRp . i. (x1 . for every . We consider the problem of estimating the linear function Φ(h) = (φ1 (h). φp (h)). they do provide a basis for determining this uncertainty through linear algorithms.5 Now to draw a direct correspondence with the theorem we proceed as follows. Then a linear algorithm achieves the minimum diameter of uncertainty for estimating a linear function Φ(h) mapping H to IRp from data under the ∞ and squared norm error measures. xp ) = Φ(h) = (φ1 (h). .
x(1). .e. . n n n qk (l)u(l l=0 1/2 1= − k) ≤ n qk 2 l=0 u (l − k) 2 y(t) = k=0 h(k)u(t − k) + w(t) This implies that (Pn − P−1 )u 2 → ∞.e. Σn ≥ ρIm . which model parametric dependencies. y(t−2). . (3) We can describe tight lower bounds for error for ﬁnite data without the need for a consistent estimator. To describe our problem we need the notions of persistent input and persistency of excitation. The proof is readily extended to the more general case with linear parameterizations. n where.d.. . we need n limsupn qk 2 is an i. θ ∈ IRp (10) We assume that w(t) is i. Z ∈ IRl×p . Since norms on ﬁnite dimensional spaces are all equivalent we employ the ∞ norm here. Therefore. Σn . h ∈ IR is arbitrary it follows that. .e. The following result follows: Theorem 2: For the setup above the radius of uncertainty is zero if and only if the input u(·) is persistently exciting of order p. v2 (·). gaussian random process. i. the assumptions of Theorem 1 are satisﬁed. Φ(·) in this case is the identity operator. We next describe persistency of excitation. for some N0 ∈ Z + and ρ > 0. The necessity part to the best of author’s knowledge is new. The radius of uncertainty is equal to zero if and only if the input is persistently exciting of order p. y(t−m1 ). Corollary 3: Let the inputoutput equation be given by y(t) = ψ T (t)θ + w(t). j=k l=0 p n qk (l)u(l − j)h(j) −→ 0 Since. i. . which also is weaker than the conventional notion. ∀ j = k (9) and Prob Since. based on our hypothesis. (2) It turns out that under this new criterion the well known input requirements become transparent and can be shown to be both necessary as well as sufﬁcient for achieving consistency. . for all n ≥ N0 . . ψ T (t) = [y(t−1). i. Our ﬁrst example is the case of . l=0 n qk (l)u(l − j) = 0.. hn (y) k such that. there exists a linear algorithm that achieves uniform consistency as well. x(k). the input must be persistently exciting and from Equation 9 it follows that the persistency of excitation must be of order p as stated. n 2) The m × m matrix. Deﬁnition 4: An input u(·) is said to be persistently exciting of order. . vm (·) such that 1) The input u(·) is persistent. .6 Our objective differs in two ways: (a) It directly deals with ˆ the estimation error. v1 (·). . A persistent input is an ∞ sequence with unbounded energy. Prob ˆn sup h(k) − hk (y) p h∈IR p−1 ∞ −→ 0 where. Therefore. is said to be persistent if lim inf Pn x 2 −→ ∞ n We point out that this deﬁnition is weaker than the conventional notion. . . . i.e. Deﬁnition 3: An bounded sequence. w(·) n j=0 n qk (j)w(j) ≥ N→∞ −→ 0 n n j=0 qk (j)w(j) is gaussian with n equal to qk 2 . We ﬁrst prove the result for FIR parameterizations. θ − θn (y) (b) we seek uniform ˆ convergence. This implies existence of an instrument n qk corresponding to each coefﬁcient h(k) so: n n n qk (j)w(j) j=0  + j=0 n (qk (j)u(j − k) − 1)h(k) n (8) N→∞ p−1 + j=0. We point out that most results that exist in the literature in this connection are sufﬁcient conditions and necessity is not easy to establish in many cases.e. if there exists m instruments (discretetime signals). which is required typically for CramerRao bounds. u(t). . . 2.d. .. n 1 limn→∞ n t=0 x2 (t) = C. ∞ (x(0). p−1 With this association we have a linear observation structure and since the objective is to estimate the FIR taps. .i. Proof: The sufﬁciency part of the result can be derived along the lines of [13] and is omitted. Consequently. We describe some relevant situations where such convex constraints naturally arise. gaussian noise. can also be developed. with [Σn ]jk = rvj u (k) satisﬁes. a new aspect is that optimal algorithms for identiﬁcation problems with convex parametric constraints. mean zero and variance →0 Now from Equation 9 we have from CauchySchwartz inequality. .. Note that the parametric set is balanced. .i. . i. j = 1. u(t − m2 )] is the usual regression vector formed from previous inputs and outputs. .e. where the input signal is required to be quasistationary and have ﬁnite power. i. Therefore. Prob{supθ θ − θn (y) }. .. u(t− 1). We are then given that there exists an algorithm. n n l=0 n qk (l)u(l − k) = 1. These ideas can also be extended to more general convex parameterizations which account for parametric dependencies.. m. We consider ﬁnitely many parametric constraints modeled as: Zθ ≤ b (11) To apply Theorem 1 we need to deﬁne λi s ﬁrst: λj (h) = k=0 h(k)u(j − k).). where m1 + m2 = p. (4) Finally.
Then it follows that. Γb + Q 2. denotes the gaussian error function [12] and we have assumed that radius of uncertainty is equal to zero (for simplicity of exposition). i. Therefore. E. Similarly. note that the problem setup conforms to the convex setup. Q 2. W 0 is the same convex set with probability mass larger than 1 − ξ as described in Theorem 1. Remark: Observe that when the constraints are given by Zθ = 0. {µj } on the parameter θ. One can then attempt to minimize the index by choosing different inputs. the parameter set is convex and balanced. From Theorem 1 it follows that linear algorithms are optimal for the ∞ norm. is linear.7 a positive real condition. QΨn + ΓZ = I. the UBN setup does not strictly fall within the framework described in Section II the proof technique of Theorem 1 can be used to establish that the radius of uncertainty is bounded away from zero. This is indeed the case as seen from Equation 12. This is because in this case. ψ(n)]T be the data regression matrix. therefore it follows from the steps used in the proof of Theorem 1 that there must exist a linear algorithm that achieves optimal error for the latter (RHS of inequality above) problem. the inputoutput data is given by a linear measurement structure. Proof: The proof of the result follows from convex duality and in particular a straightforward application of Farkas lemma. let U n be the set of all unitamplitudebounded input sequences of length n. it follows similar to the arguments used in the preceding theorems that if the radius of uncertainty is smaller than δ. p−1 µ0 = k=1 γk µk . such that (QΨn − I)θ + Qw ≤ δ + . T T µ0 θ − b0 > 0 whenever µi θ − bi > 0. . then there must exist a matrix Q and an integer n. . for a given > 0. m m y(t) = k=0 h(k)u(t − k) + w(t). The proof now follows by a direct application of Farkas lemma. . let Ψn = [ψ(0). . It is well known [23] that regardless of the input. w(t) ∈ [−γ. we should expect that the persistency of excitation condition can be relaxed. ξ > 0.e. From the results in Corollary 4 we have for a convex parametric set S that. erfc(·). . k=1 γk bk ≤ b0 . given ˆ ≥ sup sup h − hn Wn h ∞ . We have the following corollary for convex parameter constraints: Corollary 4: Consider the setup of Equations 10. Our next task is to comment on the issues of optimality and choice of optimal inputs. estimator and datalength the estimation error is always bounded from below by γ. γ]. These results in turn help in loosely characterizing inputs that would lead to optimal identiﬁcation. 2. Indeed. Furthermore. the positive constraint Γ ≥ 0 no longer exists. which is stated here for the sake of completion [19]: Lemma 1 (Farkas Lemma): Consider a set of linear functions. w ∈ W n 0 n where. where y is the observation vector of length n. Therefore. Φ(θ) = θ. i = 1.∞ ≤ δ. q n (·). ω ∈ [−π. W n = [−γ. First. γk ≥ 0 We now return to the proof of Corollary 4. . ∀ Zθ ≤ b. Zθ = b. ˆ sup sup h − hn Wn h 1 Now. hn (y). u(·) ≤ 1 (13) ˆ In the context of UBN we are interested in an estimate hn (y) based on inputoutput data of length n so the worstcase error over all FIRs and admissible noise realizations are minimized: ˆ min → sup sup h − hn n p ˆ hn h∈IR W 1 where. Let. with the error estimate. which can be modeled by convex constraints: k θk cos(ωk) ≥ 0. all of the inequalities are to be interpreted pointwise. The desired solution. It follows that for a data length n the radius of uncertainty (in the ∞ norm) is smaller than δ if and only if there exists matrices Q ∈ IRp×n and Γ such that. Therefore. Uniform Bounded Noise Although. A. Prob max θ − θn (y) θ∈S ∞ ≥ ≥ = Prob { Qw erfc ∞ ≥ − Γb ∞} − Γb ∞ Q 2.∞ subject to constraints of Equation 12.. 11. The optimal cost of this optimization problem would provide an index J(Φn ) which is a function of the regressor Φn . We provide a new proof for this result based on Theorem 1 as stated below. we are only left to satisfy QΦn + ΓZ = I for an arbitrary Γ. π] To utilize these constraints we can discretize the inﬁnite system of inequalities and through translation consider convex and balanced subsets of the parameter space. It follows that if Z has rank m then Φn need only have rank p − m to ensure that the equation can be satisﬁed. Proof: We note that. ψ(1). The solution consists of ﬁnding a Q that satisﬁes Equation 12. γ]n is the set of all admissible noise sequences up to time length n. Theorem 3: The minimum radius of uncertainty is larger ˆ than γ for any algorithm. in the dual setting.∞ the input needs to be persistently exciting of a smaller order than the parametric dimension. An estimate of θ is then given by Qy. which can solved through convex programming. . m where. Another example where these results are applicable is when the set of parameters belong to a linear manifold. This would follow by ﬁrst minimizing. Thus there is a linear instrument. both W n and h are convex balanced sets. Γ ≥ 0 (12) where.
.i. the model ηG is an element of the unmodeled set ∆. .. (14) which indicates the fact that the mth order FIR account for the underlying system within a worstcase error of size γ. The main goal is to estimate G(θ) from input output data in a uniform manner as described in Section II. which we state below without proof. . Overbounding Unmodeled Dynamics max + j=0 n (qm (j)u(j) − 1)h(0) E= 0≤m≤p−1 p−1 n + j=1 l=0 n qm (l)u(l − j)h(j) This implies that if the error. V. This can also be seen indirectly from our results in an earlier paper [30]. H = G(θ) + ∆ ∈ S is the subset of systems. k=1 θk Gk .. is a sequence of i. U NMODELED DYNAMICS WITH S TOCHASTIC N OISE In this section we deal with the problem of asymptotic consistency of systems with mixed uncertainties consisting of stochastic measurement noise and worstcase (deterministic) unmodeled dynamics. This is clearly a redundant decomposition of the underlying system. E < ∞ then. Gk ∈ RH2 . Our setup will deal with the following inputoutput equation: y(t) = Hu(t) + w(t) = G(θ)u(t) + ∆u(t) + w(t).e. This results in a minimum radius of uncertainty larger than γ. n The main goal of this section is to motivate the need for a optimal decomposition of the system into model and unmodeled dynamics. [8]. k ∈ Z + . speciﬁcally. Consider estimation of the ﬁrst tap. n n q0 (k)(G k=0 n + ∆)u(k) + k=0 n n q0 (k)w(k) − h(0) where. we see that. [35] with deterministic worstcase noise and in mixed situations in [30]. System identiﬁcation with restricted complexity models have been explored in [27]. More speciﬁcally. inputs are bounded. Suppose. n E ≥ sup w t=0 n n qm (t)w(t) ≥ qm where. [4]. Proof: The convexity of the set of systems. Proposition 2: For the setup above it follows that the minimum radius of uncertainty is greater than γ. Proposition 3: Suppose the system decomposition is convex and balanced such that for some η > 0. perturbations that satisfy δ(k) = 0 for all k ≥ m. The ﬁrst mtaps of the true system are accounted for both by the model and the unmodeled dynamics. m ≥ n q0 (k)u(k)δ(0) = γ k=0 G= G ∈ RH2  G = (15) In the following section we discuss issues that arise when the perturbations ∆ are speciﬁed independent of the model class. G(θ) ∈ G is a model class of interest. Here we study mixed situations and derive fundamental conditions required to achieve uniform convergence of the model estimates to their optimal approximations in the model class. . . . G F IR . linear observations as well as linearity of the solution operator Φ(h) = (h0 . ∀ 0 ≤ m ≤ p−1 Now. [27]. n n q0 (k)u(k) = 1 k=0  ∆ 1 ≤ γ} (17) This implies that. u is a bounded input. Let.8 by. y(t) = Hu(t) + w(t) = Gu(t) + ∆u(t) + w(t) (16) 1= j=0 n n qm (j)u(j) ≤ qm 1 u ∞ n = qm 1 . u(k) ≤ 1. . The problem arises in system identiﬁcation whenever restricted complexity models are used for identiﬁcation. H ∈ 1 .e. jointly gaussian random variables. then the radius of uncertainty must be larger than η. Theorem 1 provides a situation under which the minimum radius of uncertainty is zero in that if a small subset of noise sequences of vanishingly small measure were rendered inadmissible then the radius of uncertainty is equal to zero. is the class of ﬁnite impulse response sequences (FIR) of order m. w(1). We then generalize these results to the more general case. w(t) is a Guassian stochastic process. θk ∈ IRm This proof can be generalized for other model classes as well. i. Now the radius of uncertainty can only be smaller if we only consider a smaller subset of systems. the unmodeled class ∆ is given by ∆ = {∆ ∈ 1 1 w ∞ ≥ γ. It follows that if the radius of uncertainty is bounded then. We do this by computing the radius of uncertainty for identifying an FIR model class with unmodeled dynamics. ∀ m thus establishing the result. hm−1 ) can be readily veriﬁed. . ∆ ∈ ∆ and w(0). [36]. we consider model classes that are subspaces. i. The main problem with these results is the apparent gap between practical reality and the minimum radius of uncertainty achieved through the decomposition of Equation 16.d. G ∈ G F IR . By Theorem 1 there must n exist instruments ql (·) that are optimal for estimating the FIR model class. n n n qm (t)w(t) t=0 A.
These issues are discussed in this section. Therefore. q n (·) such that. B.e. rqn w (0) −→ 0 in probability ˆ τ =0 rq n u (τ ) On the other hand if S 1 is considered then the conditions for uniform consistency remains the same except that the second condition is weaker. every realrational system. . q n (·) so that. ∆ ∈ ∆2 } (21) {H ∈ RH2  H = G + F ∆. ∆ ∈ ∆1 } ∆2 = {∆  ∆ 2 ≤ γ}. F and ul (·) = Gl u(·). these conditions imply that the input must be persistently exciting of inﬁnite order.  k=0 q n (k)w(k) −→ 0. consider the case when noise is relatively small. as it turns out it is difﬁcult to ﬁnd inputs that lead to uniform consistency for the ﬁrst set. This is because. in practice the ﬁrst mtaps of the impulse response of H can be identiﬁed accurately for the noiseless case. . i. This issue was acknowledged by the author in [27] and it was pointed out that once a normed space of systems is chosen. with w ∈ and Prob{W n } ≥ 1 − ξ. The basic structure used in the proof (as seen from Equations 5. 6) is that we require the set S after the decomposition to remain convex. m k=0 rq n u (k)gl (k) = rq n ul (0) = 1. The zeros of F ∗ are pole locations of the G1 . n n−j j=0 k=0 Wn 0 q n (k + j)u(k)h(j) − θ1 q n (k + j)u(k)(θ1 g1 (j) + (f1 δ)(j)) − θ1 ≤ n n−j = j=0 k=0 .9 For instance. The general case follows in an identical manner. A necessary and sufﬁcient condition for uniform consistency over the set S 2 is that the input be persistent and there exist a sequence of instruments. . the solution operator Φ(H) = (θ1 . G ∈ G. ξ > 0 n n n−j q (k)w(k) + k=0 j=0 k=0 n q n (k + j)u(k)h(j) − θ1 ≤ . For notational simplicity we deﬁne: S2 S1 = = {H ∈ RH2  H = G + F ∆. ∆1 = {∆  ∆ 1 ≤ γ} (20) which is optimal for 1 and H2 norms. However. a realrational all passfunction F and an arbitrary perturbation ∆. . H. it follows that. . G2 . l = 1. In this light. Consequently. This implies that the unmodeled component is given by the set of all stable realrational functions of the form z −m ∆. Therefore. can be written as: H = G0 + F ∆ (19) We now state our main result in this section. We are now left to describe the unmodeled error. 1≤τ ≤n n max rqn u (τ ) −→ 0 ˆ n→∞ (23) Furthermore. this 0 implies the condition that the instrument must decorrelate the n noise. Identiﬁcation in Hilbert Spaces Motivated by the discussion in the previous section we derive identiﬁcation results for more general model spaces. Proof: First we consider the set S 2 . for any > 0.. . The main problem is that all elements belonging to unit balls in RH2 are not necessarily stable. . there is an optimal decomposition in the sense that it minimizes the unmodeled error. Proof: The proof follows by applying Cauchy’s theorem [21]. this decomposition leads to the minimum unmodeled error. with ∆ being an arbitrary element in RH2 . i. Then. For the sake of exposition we next consider oneparameter model class. . using Cauchy’s theorem we can do more as in the following proposition: Proposition 4: Consider the model class given by Equation 15. such a decomposition would lead to the following description for the unmodeled error: ∆ = {z −m ∆  ∆ ≤ γ} (18) As before we restrict the size of the perturbation and consider ﬁrst the following subsets of real systems: As we remarked earlier the second subset is considered since the ﬁrst (although natural) contains unstable systems. . . 2. Once such a decomposition is chosen the objective would be to estimate the optimal model component. the parametrization in Equation 17 over bounds the existing inputoutput relationship and leads to a conservative result. and therefore the system F is a delay of order m. Indeed. In this case the decomposition suggests that the radius of uncertainty still does not go to zero. m (22) 2 n→∞ n→∞ n n n −→ 0. For the FIR case. θm )T is linear. We are now left with an instrument that must satisfy. On a Hilbert space this convex decomposition is preserved and we should expect Theorem 1 to apply in this setting as well. ∀ θ1 ∈ IR for some G0 = G(θ0 ). there are hardly any inputs that satisfy the requirements for uniform consistency. Gm . e) Example: There are m poles at zero for an FIR model space of order m. Theorem 4: Let u = F u be the ﬁltered input through the all ˆ pass ﬁlter. This can be done naturally since the dual is also a Hilbert space and orthogonal separation between modeled and unmodeled components can be easily established. Therefore. the unmodeled perturbations have to be further constrained in some manner. . . Motivated by Proposition 3 we again seek to ascribe modeled and unmodeled components to different aspects of data.. The subset S 2 is convex and the observation given by Equation 14 is linear. Now by Theorem 1 if uniform consistency holds there must exist a number n and an instrument.e. As before. We note that the projection ΠH of the system H onto the subspace G is linear and of ﬁnite rank. G ∈ G.
f1 (k). then: Prob 0<k≤n max n ru (k) ∞ ≥α ≤ n exp(−nβ(α)) (24) where β(α) = 1 + 1−α log2 1−α + 1+α log2 1+α 2 2 2 2 Proof: The proof of this result can be found in [6]. This follows from the fact that. For the set S 1 we only require that the normalized worstcase crosscorrelation between input and instrument to go to zero. In summary we have characterized conditions for uniform consistency and optimality for Hilbert spaces. which is dealt in the next section. The constraints on the input (disregarding the noise constraint) n rqn u (j)δ(j) ≤ . ∆ respectively. . u(t−1). since G1 is bounded in H∞ because it is stable real rational transfer function. ψ(1). i. the unmodeled part implies that F1 u(·) is perpendicular to q n (·). u = F1 u. We now argue the difﬁculty of ﬁnding inputs that satisfy this condition by applying it to a 1tap FIR model class. In the previous section the Hilbert space structure led to convex decomposition of the unmodeled error and model class and Theorem 1 could be applied in a straightforward manner.i. Remark: We point out that the conditions for S 2 are more demanding than S 1 . for the ﬁrst expression u1 = G1 u. n n rqn u (j)g1 (j) = rqn u1 (0) = 1.e. sup j=0 n ∆∈∆2 j=0 We next discuss issues of optimality and choice of inputs. First. Therefore. Therefore. n min(q n )T Ψn ΨT (q n ) subject to n j=0 q n (j)u(j) = 1 where. This is understandable on account of the fact that RH2 forms a larger class of perturbations than 1. If the u ˆ value of the solution to the quadratic optimization problem has a value greater β (this is the opposite of what we want) then: (q n )T Ψn ΨT (q n ) = n uT (Ψ 1 ≥ β ⇐⇒ T −1 u n Ψn ) 1/β u uT Ψn ΨT n ≥0 1= j=0 q n (j)u1 (j) ≤ q n 2 G1 H∞ Pn u 2 where. Then. By arbitrarily varying θ. Let.. This condition can easily be accomplished for inputs that have small autocorrelation coefﬁcients. u = [u(0).10 where. by choosing an appropriately ﬁltered input as the instrument the conditions of Equation 23 can be satisﬁed. On the other hand the weaker condition for ∆1 can be met by a number of inputs. Ψn = [ψ(0). where convex separation between model class and unmodeled dynamics proved to be critical. Identiﬁcation in 1 and nonlinear functionals We now turn to the problem of identiﬁcation in the 1 norm. ψ(n)]T .. u(n)]T . j=1 (ru (j))2 << β. In particular we only need to establish that there exist inputs with small autocorrelations with high probability. .e. . Then an appropriately ﬁltered input can be selected as the instrument. C. . we have used Schur’s complementation to derive the last expression above. . . The condition arising from the noise and the ﬁrst ˆ expression together imply that the input must be persistently exciting. i. For these set of problems the fact that the optimal algorithm is linear follows from Theorem 1. . i. ˆ imply that the optimal instrument is the solution to: n where. n j=1 n (ru (j))2 ≥ β sup ∆∈∆2 j=0 =γ j=0 n (rqn u (j))2 −→ 0 ˆ For the set ∆1 we similarly have. for a β > 0 (independent of n) the above condition is met with probability one leading us to believe that few inputs satisfy the n n opposite condition.d. δ(k) are the impulse responses for G1 . . Observe that we have applied Proposition 4 and substituted F1 ∆ for the perturbation and so F1 and G1 are perpendicular. .. For this situation the above condition implies that.e. with ψ(t) = [ˆ(t). The problem now reduces to looking for inputs satisfying Equation 22. and in the second expression we have absorbed the ﬁlter F1 into the input. Pn u −→ ∞. Similarly. ψ(0)]. This is possible because G1 and F1 are orthogonal and so G1 u and F1 u are uncorrelated when driven by white noise. ∆ we see that. If the input is chosen i. g1 (k). then the expected value of the expression on the left is bounded away from zero. Lemma 2: Suppose u(t) is a discrete i. . In the RH2 case we need a more stringent condition since we need the sum of the squares of the normalized crosscorrelation also go to zero. But the noise condition implies that q n −→ ∞. . notice that one could incorporate additional parametric and nonparametric constraints as we did in Section IV and Equation 12. In the next section we consider a situation where such a convex separation does not exist. F1 . · H∞ is the usual H∞ norm. n sup ∆∈∆1 j=0 n n rqn u (j)δ(j) = γ max rqn u (j) −→ 0 ˆ ˆ j≤n This establishes the necessity.i. inputs that are white in their second order properties. . Remark: To see how the convex separation results in achieving uniform consistency note that the conditions on the input from the modeled part implies that G1 u(·) is aligned with the instrument q n (·). Next for the perturbation we apply CauchySchwartz inequality to obtain: n 2 n rqn u (j)δ(j) ˆ n where. bernoulli random process with mean zero and variance 1. We will not describe the sufﬁciency conditions since the argument is similar to the argument for sufﬁciency for FIR model classes.d. .
. a number N (ρ) for every ρ > 0 such that for all n > N (ρ): 1≤τ ≤n n max rqn u (τ ) n→∞ q n (0) . .d. .e.. n−j n k=0 ql (k + w ∈ Wn 0 j)u(k) ≤ The proof now follows by ﬁrst observing that for the hypothesis to be true we must have. 0 q n (0) . consider the optimal decomposition: S = {H ∈ 1 Next. i. . . . . 0 ≤ l ≤ m. . max1<j<n n n k=0 ql (k)w(k)  ≤ . i. i. . . . norm measure on Φ(·) is the ∞ norm. Consequently. Let y. 0 The persistency of the input follows along similar lines as in Theorem 2. we present identiﬁcation of FIR model class in the 1 norm. Qn be the matrix composed of the instrument q n and its m delays.. ∆ ∈ ∆} (25) where. (⇐=) To prove sufﬁciency we construct an IV technique n that is uniformly consistent.. w be the output and noise signal vectors of length N respectively.. uniform consistency in the 1 norm for the ﬁrst m FIR taps is identical to uniform consistency in the ∞ norm. .. . ξ. S is convex. S) ˆ such that Prob supH∈S Pm (h − hn ) ∞ ≥ ≤ ξ. . 0 . . . G F IR is the class of mtap FIR sequences. . the minimum radius of uncertainty is zero for the ∞ norm as well. g = Pm H and δ is the column vector of impulse response of the unmodeled dynamics. under the 1 norm. However for more general classes under the 1 norm the decomposition through optimal approximations is not convex. 0 .11 For FIR model classes. . . .. . . . . . Then. Therefore. . S). R = . . ... The setup now mirrors Theorem 1 since the set. q n (m + (k − 1) + j)u(0) . G. . for length n of data. . Again from our hypothesis together with the fact that that ∆ ≤ γ. Qn = .. 0 n n n n k=0 ql (k)w(k) + k=0 (ql (k)u(k) − 1)h(j) n→∞ n n−j n + j=m k=0 ql (k + j)u(k)h(j) −→ 0 (QT Rn )k = n q n (m + (k − 1) + j)u(j). Prob {W n } ≥ 1 − ξ.. q n = q1 and its delayed copies up to mdelays serve as instruments. the decomposition is still convex. . we let. QT Rn δ converges to zero. Un a corresponding matrix of input sequence with its m delays and Rn the residual delays of the input sequence. . .. 0 . . w is i.. ∀ n ≥ N ( . . −→ 0. by noting that ∆ is arbitrary with 1 norm less than γ and w(·) is an arbitrary element in the convex set W n . Φ(H) = (h(0). . u(0) u(1) . j=0 j=0 Consequently. which 0 also contains the zero element we have. . .. . . . This is due to the fact that optimal FIR approximation in 1 and RH2 are identical. n ≥ ˆ L( . 0 u(0) . n n→∞ Suppose. . . δ = (Pn − Pm )H. . gaussian noise. .. h and a number L( . . . u(0) . n . n To show this. . n n k=0 ql (k)u(k) = 1. . n n+1−(m+k) Un = 0 0 .i. . This implies by hypothesis that. . . ξ) n−(m+k) and Prob{W n } ≥ 1 − ξ Upon substitution it follows that. sup sup n w∈W 0 H∈S  n ql (k)y(k) k=0 − h(l) ≤ . Theorem 5: The radius of uncertainty for the above setup is equal to zero if and only if there is a persistent input and a corresponding family of instrument variables q n (·) which uniformly decorrelates the input and noise. .. there is a linear algorithm with n weights. for each FIR tap.. Basically we let. ∆ = {z −m ∆  ∆ 1 ≤ γ}. . . λt (h) = Hu(t). . given > ˆ ˆ 0.. q n (0) n q (1) . i. . u(0) . . In the next section. . by hypothesis it follows that. ξ. . q n (m + k + j)u(j). n rqn u (0) = 1 (26) n (27) rqn w (0) −→ 0 in probability Proof: (=⇒) We observe that all norms are equivalent on ﬁnite dimensional spaces [15].. .e. ξ > 0 there is an algorithm. . h(m − 1))T . D. (28)  H = G + ∆. . . . . 0 . y = Un g + Rn δ + w =⇒ QT y = QT Un g + QT Rn δ + QT w n n n n (29) It follows from hypothesis that QT Un is an invertible matrix n with bounded inverse for large enough N since it converges to a lower triangular matrix with identical elements on the diagonal. In the following theorem we derive necessary and sufﬁcient conditions for achieving uniform consistency for estimating the FIR coefﬁcients. FIR Model Class In this section we derive necessary and sufﬁcient conditions for identiﬁcation of FIR model classes in the presence of unmodeled dynamics under the 1 norm. G ∈ G F IR . .. . The results developed for FIR case will be used to compute uniformly consistent estimators for more general model classes. h. g is the column vector composed of the ﬁrst m impulse response coefﬁcients.e. . . . it follows that. such that. . 0 . . measurement structure is linear. . . . .. . In particular. To this end. . . by hypothesis. Consequently. ql (·). . . .e. . . . . note that the kth row of QT Rn is given by. .
. The following example τ j=0 illustrates the issue of nonconvexity of 1 decompositions. for the system H1 is G(H1 ) = 0.i. ξ. From the alignment conditions −1 + 0<τ ≤n n n ru (τ ) Prob max ≥ 0<τ ≤n n 2 max H∈S ≥ ≥ ≤ ≤ n ruw (τ ) ≥ 0<τ ≤n n 2 n ruw (τ ) Motivated by these issues we solve the problem of 1 identiﬁcation indirectly. First. Noise. Hence. G0 (H) = Π(H)+argminG∈G H−ΠH−G = lim G0 (Pm H) m→∞ + Prob max n exp(−nβ( )) + n exp(−n 2 /σ 2 ) where we have taken the probability over both the input and the noise process. . which maps to the model subspace and L is a ﬁnite rank linear operator mapping the system H to an appropriate ﬁnite dimensional space. The ﬁrst term in the last expression uses Lemma 2 and the second term uses the fact that w is gaussian. Φ. ξ. It is then easy to establish that.. Theorem 1 cannot be directly applied. i. G ∈ G.12 of the model space G. the set of optimal approximants for the convex set of systems S is not convex. The samplez −1 −1 −1 −1 ˆ complexity of an algorithm. The main difﬁculty is that the decomposition achieved through optimal approximation is not convex. we deduce that G(H2 ) = 1 − z 2 . Moreover. ≤ G0 (H) = ψ(L(H)) where. G = { ∆ ≤γ∆ p j=1 αj Gj ⊥ where.e. Theorem 6: The input characterization as in Theorem 5 is sufﬁcient for uniform consistency in the 1 norm. the following convex space of T Finally. The idea is that 1 minimizer can be arbitrarily closely approximated by a nonlinear function of linear functions of the system impulse response. S = {H ∈ 1  Hα = αH1 + (1 − α)H2 . we choose instruments q n (k) = u(k)/n. Φ. To this end we have the following theorem.. hypothesis and noting that (w(0). Since linear functions of the system can be efﬁciently estimated the nonlinear function can be estimated as well. General Model Classes In this section we derive algorithms for achieving uniform consistency for general model classes (that are subspaces) in the 1 norm. We establish polynomial sample complexity of linear algorithms for the FIR model class here. −1 θ 1− z −1 2 −1  θ ∈ IR . α ∈ [0. w(·) T (Qn Rn )k δ ≤ is assumed to be i. i. ∆ ∈ ∆ where. Hr . Proposition 5: The optimal 1 approximation G0 (H) for a given system. G ∈ RH2 and ∆ = and the modiﬁed system by. where the input sequence is a bernoulli process. . G(H1 ). ˆ L( . S) ≤ 1 d log 1 ξ for some integer d. Proof: The proof follows from the following equality: G∈G = min H − ΠH − G 1 G∈G The limiting argument follows from the fact that Pm H → H and continuity of projection operation and the 1 norm. −1 We next discuss the issue of sample complexity.d. Therefore. The decomposition follows from wellknown results in 1 approximation theory [15]. Π is the 2 projection on to the space G. 1]} random variables.e. for some C > 0. Our task n+1−(m+k+τ ) is to determine the model G0 that minimizes the unmodeled ∆ 1 max q n (m + (k − 1) + j + τ )u(j) ≤ γ error. G(Hα ) = G(H2 ) for all α = 1 is the unique optimal approximation as well.i. can be rewritten as. . where G ⊥ is the space of annihilators  αj ∈ IR. gaussian noise as before. Uniform consistency in 1 would then follow by estimating the linear functions. For notational simplicity we denote the modiﬁed problem on the RHS by Gr .e.d. i. it follows directly by substitution that. ψ(·) is nonlinear continuous map. E. Prob Prob sup QT y − Pm (H) n n ru (τ ) ∞ for optimality [15] it follows that the unique optimal approximation. G0 = argminG∈G H − G . Qn w also converges to zero in expectation by systems. Similarly. We proceed by ﬁrst describing the optimal approximation as a known nonlinear function of linear functions on the underlying system space. Our setup consists of. Gr (H) = argminG∈G H − ΠH − G Hr = H − ΠH . Now. H. whose consistency conditions are in turn characterized by Theorem 5. S) ≤ C 1 2 log 1 ξ implying polynomial samplecomplexity of convergence. 1 min H − G G }. f) Example: Consider. w(1). Φ(y) is said to polynomial [32] where H1 (z ) = 1 and H2 (z ) = z + 1 − 2 Our model class is given by G = in learning theory if ˆ L( . y(t) = Hu(t)+w(t) = Gu(t)+∆u(t)+w(t). The proof of the theorem relies on a number of results that are provided below. w(n)) are i.
d. S + = G(1 + ∆)  G = αj Gj . We can now apply Proposition 3 and Theorem 1 to argue that ∆ and G must be orthogonal for achieving uniform consistency. under the conditions derived in Theorem 4. The inputoutput equation is given by: y(t) = Hu(t) + w(t) = G(1 + ∆)u(t) + w(t) where. 0 ≤ ≤ 1 1 ≤ Hr − Gm 1 1 Pm (Hr − Gm ) ≤ Pm (Hr − Gr ) where we have used the commutativity of the convolution operation and substituted ψj (t) = Gj u(t). if we restrict αj ≥ 0 the decomposition turns out to be convex. First. Upon substitution we have. It now follows that. we also have B2 − B1 1 ≤ K1 γ for some K1 > 0. Proof: We ﬁrst consider a convex subset. . Theorem 7: A necessary condition for uniform consistency is that ∆ be orthogonal to the space G and the inputs satisfy conditions of Theorem 4. This implies that G − ΠH 1 ≤ K2 γ for some K2 > 0.13 The main task is to now show how to further simplify the problem to an instantiation of Theorem 1. Now. C. The parametrization is nonconvex as shown in the following example. which are equivalent to conditions in Theorem 5. ΠH respectively. 2) ΠH can be obtained if conditions of Equation 23 are satisﬁed. A. U NCERTAIN LFT STRUCTURES We consider several different LFT structures for which the tools developed in Theorem 1 can be applied to characterize conditions on inputs. Let the set of perturbations be bounded. C) is observable the matrix T C CA CA2 . Speciﬁcally. Furthermore. αj ≥ 0.e. G = j=1 αj Gj  αj ∈ IR . CAn T (B2 − B1 ) y(t) = j=1 αj Gj (1 + ∆)u(t) + w(t) αj (1 + ∆)Gj u(t) + w(t) p p = = j=1 αj ψj (t) + ∆ j=1 αj ψj (t) + w(t) H − ΠH 1 ≤ H −G 1 + G − ΠH 1 ≤ Lγ Our ﬁnal proposition states that. of systems. G is a subspace. It follows. Proposition 7: Let. The theorem now follows from the fact that. ∆ ≤ j Pm (Hr − Gr ) 1 − Pm (Hr − Gm ) 1 (I − Pm )(Hr − Gm ) 1 − (I − Pm )(Hr − Gr ) 1 −→ 0 where the ﬁnal limit follows from the fact that Gm 1 and Gr 1 are bounded and have exponentially decaying tails. is ﬁxed.. However.e. CAn is invertible. which achieve uniform consistency. 2].e. G. Since. we have the following proposition: Proposition 6: If there is G ∈ G such that H − G 1 ≤ γ then there is some constant L such that. the decomposition is not convex. ψ(ΠH. Pm (H −ΠH)) where: 1) Estimate of an FIR model Pm (H − ΠH) for the modiﬁed system H − ΠH can be obtained consistently if conditions in Theorem 5 are satisﬁed. A. Now translating the set ∆ S 0 by modifying the variables αj = αj − 3/2 we get a new ˜ set S 1 which is both balanced and convex. ∆ are both smaller than γ. Therefore. where j (αj Gj + ∆Gj ) ⊂ ˜ . we can consistently estimate the projection in a uniform manner. g) Example: Consider the two elements H1 = −(1 − ∆)G and H2 = (1 + ∆)G in the setup above. i.i. i. gaussian noise and ∆ is a multiplicative perturbation with bounded 1 norm and accounts for the undermodeling of H. Therefore. from Theorem 4 it follows that the relevant input conditions must also be satisﬁed. . i. B1 are ﬁnite dimensional and norms on ﬁnite dimensional spaces are equivalent. G−ΠH 2 2 VI. the noise w is random i. Then ˜ S0 = αj Gj (1 + ∆) = S. The problem reduces to estimating Gr (H). We now use the statespace form for ΠH. lim Pm (Hr − Gm ) 1 = Hr − Gr 1 m→∞ Proof: The proof follows by the following sequence of inequalities: Hr − G r Therefore. B2 be the corresponding control vectors for G. ∆ ≤ then (H1 + H2 )/2 = ˜ ˜ ∆G cannot be expressed as (1 + ∆)G for some ∆ ≤ . p p ≤ C CA CA2 . G0 can be decomposed as a known nonlinear operator acting on a collection of linear functionals on the system H. . In this regard. ≤ 2γ (30) Since (A. there is a onetoone mapping from the collection S 1 to the collection S 0 and therefore the . Multiplicative Uncertainty We consider ﬁrst the multiplicative uncertainty as shown in Figure 2(a) with H ∈ H2 . Therefore. A. Since. In particular. To focus attention on structured perturbations we limit our attention to realrational space of systems endowed with the H2 norm. Gm denote the best 1 approximation of the FIR system Pm Hr . note that ΠH is a linear and the range of the projection is a ﬁnite dimensional space. as described by Equation 1. Let B1 . S 0 . B2 . B2 − B1 2 ≤ K0 γ for some constant K > 0 that depends on the system matrices. the following subset of systems is convex. constrain the parameters αj ∈ [1. We assume that control vector B is the free parameter that needs to be estimated. the statetransition matrix... . Gm = argminG∈G Pm (Hr −G) 1 Then it follows that. H − ΠH 1 ≤ Lγ Proof: By hypothesis we have that H − ΠH 2 ≤ H − G 2 ≤ H − G 1 ≤ γ.
uniform consistency of G0 implies uniform consistency of the above setup with ∆1 = 0 or ∆0 = 0. since the perturbation is scaled by the parameter the error bound also has a multiplicative structure. B. It turns out that the notion of uniform consistency must be modiﬁed suitably. In general the high order acoustic dynamics coupled with signiﬁcant time variation severely limits the choice of echocanceller order. H0 .. ≤β α ∞ (sufﬁciency) Comparing Equation 32 with Equation 14 we see a potential difference only in the penultimate term. v. The goal is to estimate the optimal approximation to H = G + ∆ in the model class G ∈ G from inputoutput data. it can be ˜ ˜ written as F0 ∆. where it is not possible to isolate any single component. The question arises as to when an approximate model for the uncertain component can be derived. The former case reduces to an additive perturbation for which ∆ ⊥ G 0 is a necessary condition. Theorem 8: A necessary and sufﬁcient condition for uniform consistency of the above setup is that ∆ be orthogonal to both model classes G 0 and G 1 and the input satisfy conditions of Theorem 4. Proof: (necessity) Upon expanding Equation 31 we have. This is because the unmodeled error G0 ∆1 and the class of FIR models of order m are no longer orthogonal. Thus the input output data consists of (u. can impact the stability of the echocancellation system. Different Classes of Linear Fractional Structures. Then no consistent algorithm exists when the order. i. The conditions on the input follow from Theorems 4. Serial Connections of Uncertain Systems In this problem. By absorbing ∆∆ into a single perturbation. two uncertain systems. we assume that the closed loop system is stable and the exogenous input u(·) . The interpretation of the setup is that we are given a set of uncertain components connected in series. i. Our setup consists of: m y(t) = Hv(t) + w(t) = j=1 αj Gj v(t) + F ∆v(t) + w(t) (33) where. C. which then establishes sufﬁciency. y(t) = G0 G1 u(t)+G1 ∆0 u(t)+G0 ∆1 u(t)+∆0 ∆1 u(t)+w(t) (32) Now. Since.e. we can only say that with high probability for a large enough n. m. The unknown system has both parametric and nonparametric components with the parametric part to be identiﬁed from closed loop data. In this light. The system H1 is approximately known.e. Furthermore. typically used to enhance the quality of speech. necessary and sufﬁcient conditions can still be established. h) Example: As an instantiation of the theorem consider the case when G0 and G1 are FIRs of order m and n respectively. 2. of the loudspeaker is recorded in addition to the microphone signals. This arises in a number of applications involving indoor echocancellation Systems. ∆0 ∆1 u(t). shown in Figure 2(b). G 0 from data. ˜ 1 where F0 is as described in Proposition 4. for an arbitrary bounded perturbation ∆. Nevertheless.i.. Now since ∆0 is perpendicular to G 0 . S 0 ⊂ S the necessity holds for the general case as well. A dither input signal. w is i. ∆. v. Indeed. The new expression obtained is now consistent with conditions of Theorem 4. This is weaker than the conditions derived in the previous sections. We next apply the instrument derived for the set S 1 to the general problem.d.14 (a) Multiplicative Uncertainty (b) Identiﬁcation with Structured Uncertainty (a) Block Diagram of Uncertain System in a Feedback Loop (b) Identiﬁcation error Fig. α − αn ∞ Fig. and the system H0 must be approximately identiﬁed in the model class. u. Identiﬁcation of Uncertain Systems in Feedback Loops Consider the feedback setup shown in Figure 3(a). y) as illustrated in Figure 3(a). i. we can write. gaussian noise and we have decomposed H ∈ RH2 into orthogonal subspaces.. is employed as an input to the loudspeaker and the output.e. the feedback ﬁlter. We ﬁrst write the inputoutput equation as follows: y(k) = (G0 + ∆0 )(G1 + ∆1 )u(k) + w(k) (31) It is clear from the setup that the unmodeled dynamics enters the equations in a nonlinear nonconvex fashion. while the latter case is a multiplicative perturbation for which ∆ ⊥ G 1 is necessary according to Theorem 7. its best approximation G1 (in the class G 1 ) is known. H1 are connected in series. a lower order echocancellation system is designed in the hope that robust performance can be ensured at the cost of sacriﬁcing some performance. 3. while the other is unknown and must be estimated from inputoutput data same conditions are necessary for S 0 as well. The second ﬁgure illustrates a serial connection of two components one of which is approximately known. y. The setup is unconventional in that the system input is also measured. ∆0 ∆1 = F0 ∆. Moreover. Such problems arise in the context of design of adaptive acoustic echocancellation systems that operate in a feedback loop [28]. of G0 is larger than n. Our task is to design the input signal u to guarantee robust convergence to the optimal approximation.
Now consider Equation 33.d. positive real conditions. gaussian noise w(k) with mean zero and variance 0. v(·). bounded sets and linear fractional maps. W2 = KW1 Although the system.e. the IV technique is chosen suitably to decorrelate both the uncertainty as well as the noise. As the following theorem shows that a necessary and sufﬁcient condition for uniform consistency remains the same.e. The unmodeled dynamics is generated by normalizing a zero mean random gaussian sequence of length 1000 and normalizing it to have an 1 norm equal to one.i. q n (·) that decorrelates the internal signal v(t) and noise w(t).3z −1 ). W2 are stable transfer functions. For the IV technique we use (1 − 0. especially when the error arises due to under modeling.i. VII. Remark: We contrast the IV approach presented in this paper with the wellknown MPE approach. We have characterized conditions on inputs in terms of secondorder sample path properties required to achieve the minimum radius of uncertainty.i. while it is well known that the least squares approach leads to large bias. W1 = (1 − HK)−1 . v(t) is bounded.. k=0 k=0 q n (k + j)v(k) −→ 0 and n k=0 q n (k)w(k) −→ 0 This in turn implies that the ﬁltered instrument.3z −1 )−1 u(t) as the instrument in Equation 33 and identify α. We have derived optimal algorithms for system identiﬁcation for a wide variety of standard as well as nonstandard problems that include special structures such as unmodeled dynamics. Therefore. This is because when u(t) is a i. The idea in the MPE approach is to ﬁnd models that minimize the so called prediction error. this approach can lead to signiﬁcant bias. a controller K is chosen so that the closed loop system is stable and such that the sensitivity function W1 has 1 norm smaller than one. which turn out to be wellknown instrumentvariable methods for many of the problems. Figure 3(b) shows the parametric error plotted as a function of data length for the IV scheme. In the presence of complex uncertainties. In speciﬁc circumstances this implies a preference for those models. C ONCLUSIONS q n (k)w(k) −→ 0 ˜ We have introduced a new concept for system identiﬁcation with mixed stochastic and deterministic uncertainties that is inspired by machine learning theory. We see that the IV technique converges. the fact that the controller is stabilizing implies W1 . input in the interval [−0. We are left to establish the conditions on the input signal. These are now consistent with assumptions of Theorem 4 with u(t) replaced with v(t). This ensures orthogonal separation between the unmodeled error and the model subspace.1. Next. W .d. which is exactly in the form of Equation 14 with the unmodeled dynamics orthogonal to G.15 and the internal signal. q n (·) = ˜ W1 q n (·) decorrelates the input signal and noise: n n q (k)u(k) = 1. Theorem 9: A necessary and sufﬁcient condition for uniform consistency of the optimal restricted complexity model G ∈ G is that the input satisfy conditions in Theorem 4.3)/(1−0. i.3z −1 . The unmodeled dynamics is then ﬁltered with the allpass ﬁlter F = (z −1 − 0. 1 n n k=0 u1 (k + j)v(k) −→ 0 k=0 u1 (k)w(k) −→ 0 . IV methods can overcome this problem through suitable instruments.5. and the output signal y(·) are both measured. where the prediction error is uncorrelated with model output. i. dither signal: 1 n and 1 n n n k=0 u1 (k)v(k) ≥ β > 0. Our formulation lends itself to convex analysis leading to description of optimal algorithms. The system H is generated by summing the model for α = 1 with the unmodeled dynamics. This is not surprising considering the fact that the unmodeled output is correlated with the model output and so while the least squares approach leads to a large bias. Proof: By straightforward algebraic manipulations of the feedback loop we obtain that: v(t) = W1 u(t) + W2 w(t). k=0 Note that in the above example since W1 is unknown a convenient instrument for v(t) is the ﬁltered dither signal u1 (t) = Gu(t).5] of length 300 is applied and the measurements v(t) and y(t) are collected in i. Uniform consistency holds if and only if there exists an instrument. 0.d. A uniformly distributed i.. is generally unknown (since it is a LFT of the controller with the unknown system). In comparison to earlier formulations our concept requires a stronger notion of convergence in that the objective is to obtain probabilistic uniform convergence of model estimates to the minimum possible radius of uncertainty. ˜ k=0 n k=0 n q n (k + j)u(k) −→ 0 ˜ α i) Example: The model space is given by G = 1−0. n n q n (k)v(k) = 1. These instruments in speciﬁc circumstances can ﬁrst decorrelate the uncertainty arising from correlated sources and then proceed to cancel stochastic uncorrelated noise as well.
Goodwin. for all other output realizations the worstcase uncertainty can only be smaller. To this z 2 = z0 . A. the probability measure of the noise set can change. To establish (A) we observe from [22] that with a linear/convex measurement structure. 1 ≤ k ≤ n} the zero element leads to increasing the probability measure. consider the following noise set: VIII. Z. 1997. Campi. Finite sample properties of system identiﬁcaWn h∈S . 37:953–963. Borkar. . w ∈ W n . z and letting α take on values in a ﬁnite index set I such that there is at least an α for which z0 − zα 2 < 1 we have z 2 ≤ z0 −zα min 2 <1 To relate it to Corollary 1 we let. 33:1133–1139. zα be any other unit vector. on Signal Processing. Mitter. Venkatesh. N.e. In Proceedings IFAC Symposium on System Identiﬁcation. Linear and nonlinear algorithms for Inequality (a) follows from the fact that the output Yn identiﬁcation in h with error bounds. 0∈W n Santa Barbara (USA).(Proof of Corollary 1) Consider an element z ∈ H2 with tainty. Probability and Measure. Gu and P. x2 are any two feasible elements satisfying Equation 7. ∗ ≤ Prob{W n } IEEE Transactions on Automatic Control. where x1 . Linear Stochastic Systems. i. We are now left to produce a that. John Wiley & Sons. Consequently. z  ≤ z0 −zα 2 z 2 + zα . M. IEEE Transactions on Automatic Control. translation of the feasible set W n so that it contains D(Yn ) = {h ∈ S : y(k) = λk (h)+w(k). (B) The probability measure of the noise set W ∗ is larger n than the feasible set. IEEE Trans. Quantifying the error in (d) estimated transfer functions with application to modelorder selection. C. (c) [8] L. and S. Let z0 = z/ z 2 be a unit 0 argument there exists a noise set. [2] V. Caines. such that the diameter vector aligned with z. Kacewicz. (b) [7] A. [3] P. h=0. V. 1992.w∈W n W n . Finally. h=0. It follows of uncertainty is smaller than 2 . 2000. It follows that. [4] M. is bounded from below by onehalf of the diameter of uncer. Using the underlying symmetry and monotonicity of the noise distribution around zero. Tsitsiklis. Campi and E. Inc. S. 1992. (d) follows by deﬁnition of W ∗ since it includes all n The local diameter of uncertainty is given by: possible noise elements for which the output is identically diam(Yn ) = sup λ0 (h1 − h2 ) zero and the diameter of uncertainty is constrained to be h1 .. On model error modeling in set membership diam{Yn } ≤ 2 sup Prob{W n } ≤ sup identiﬁcation.w∈W n complexity of worst case identiﬁcation of linear systems. On account of the fact that the measurements are linear and the constraints are convex and balanced. Now. z 2 υα (z) 1 − z 0 − zα 2 = x 1 − x2 2 ≤ z0 −zα min 2 <1 The result now follows by direct substitution. 47:1329–1333. pages WeMD1–3. Giarre. the global diameter of n uncertainty over all feasible outputs Yn is bounded by 2 . Milanese. Garulli and W. Denoting υα (z) = zα . z ≤  z0 −zα . Billingsley. 1994. Variations on a neyman pearson theme.e. By hypothesis of our proposition and the preceding ·.w∈W n W n . P. New York. no other noise set containing From [22] it follows that the estimation error for any estimator zero element can have a larger probability measure. A PPENDIX (Proof of Proposition 1) We deal with the case when γ0 = 0 for simplicity. and B. W ∗ is also convex and balanced. z + zα . z  end. 20:157–166. · denoting the inner product. Wiley. Reinelt.. Model quality evaluation diam{Yn = 0} ≤ 2 Prob{W n } sup sup ≤ in setmembership identiﬁcation. i. 2004. Theodosopoulos. and J. n These two facts together will establish the proposition since we have then found an alternative convex and balanced set to W 0 that has a larger probability measure and has the same n worstcase global diameter of uncertainty. Journal of the Indian Statistical Institute. 42:2906–2914. We are left to establish two n facts: (A) The noise set W ∗ is feasible. 37:1343–1354. the worstcase diameter of uncertainty is realized when the output is identically zero. ≤ sup Prob{W n } diam{Yn } ≤ 2 sup [6] M. Gevers. Automatica. (B) is established by the following sequence of inequalities: W ∗ = {w ∈ IRn  λk (h) + w = 0. (a) 2002. by construction the set W ∗ corresponds to the zero output signal and therefore by the n preceding argument is feasible.w∈W n tion methods. C. W n . which Control. Let Yn denote the column vector of the output signal of length n and D(Yn ) be the set of all h ∈ S that are consistent with the inputoutput data and the noise set. z = x1 − x2 . 1993. it follows that the set. System and Control Letters. Inequality (b) follows from the fact that the worstcase diameter of uncertainty is invariant under translations of noise set W n . Prob{W 0 } ≤ sup Prob{W n } diam{Yn } ≤ 2 sup n [5] M. 1988. convex and balanced set that has the same properties. 0∈W n [9] G. C.. T.h2 ∈D (Yn ) bounded by 2 . Ninness.16 amounts to a relaxation of the constraint. Exponentially weighted least squares identiﬁcation of timevarying systems with white dusturbances. Consequently. Inequality (c) follows from a similar argument. W 0 . The sample Wn h=0. υα (x1 − x2 ) 1 − z 0 − zα 2 R EFERENCES [1] P. 1995. [10] G. Dahleh. B. However. Khargonekar. IEEE Transactions on Automatic in the RHS is restricted to the case when h = 0. 1 ≤ k ≤ n} n where the ﬁrst expression in the last inequality follows cauchyschwartz inequality. and M. The argument for arbitrary γ0 proceeds in a similar manner. λ0 (h) ≤ . 66:292–305. Weyer.
1983. 1996. Ljung. IEEE Transactions on Automatic Control. complexity. 1982. uncertainty. Vicino. 1995. IEEE Transactions on Automatic Control. Heuberger. Switzerland. Automatica. G. Optimal identiﬁcation under bounded disturbances. McGrawHill International Editions. 2001. In Proceedings of the 36th IEEE Conf. [31] A. Ljung. A. Wahlberg. 1991. A. IEEE Transactions on Automatic Control. Rudin. Inc. 1994. and J. N. 2004. A Theory of Learning and Generalization. Tikku. John Wiley and Sons. Dahleh. On systemidentiﬁcation of complexprocesses with ﬁnite data. 1969. M. 1997. Venkatesh and M. Cabin communication system without acoustic echo cancellation. System identiﬁcation for complexsystems: Problem formulation and results. 1987. pages 39–48. Venkatesh and M. 46:235–257. 1991. New York: SpringerVerlag. Makila. Papadimitriou and K. Mitter. 2000. IEEE Transactions on Automatic Control. Robust identiﬁcation and galois sequences. [17] M. Wang. [18] M. 37:900–912. [29] S. 1997. 27:408–414. Dahleh. [35] L. L. 1997. Venkatesh and M. N. Y. R. Van Den Hof. International Journal of Control. [30] S. [13] L. Luenberger. System identiﬁcation using laguerre models. CA. Yuan. IEEE Transactions on Automatic Control. [25] S. [21] W. Finn. and J. Hard frequencydomain model error bounds from leastsquares like identiﬁcation techniques. Asymptotic properties of the least squares method for estimating transfer functions. [23] D. 41:774–785. Lausanne. Milanese and A. [28] S. 1993. [19] C. Yin. IEEE Transactions on Automatic Control. Dahleh. on Decision and Control. D. M. Wang and G. IEEE Transactions on AC. Venkatesh and A. Persistent identiﬁcation of systems with unmodeled dynamics and exogenous disturbances. Dahleh. [33] B. Prentice Hall Englewood Cliffs. 1993. P. Information. The Nature of Statistical Learning Theory. A. . Special Topics in Identiﬁcation Modelling and Control. Poolla and A. Identiﬁcation in the presence of unmodeled dynamics and noise.. Vicino and G. Identiﬁcation with generalized orthonormal basis functions–statistical analysis and error bounds. Estimation theory and uncertainty intervals evaluation in the presence of unknown but bounded errors: Linear families of models and estimators. Wahlberg and L. [20] P. Prentice Hall Englewood Cliffs. 2002. Persistent identiﬁcation of timevarying systems. 27:997–1009. Optimization by vector space methods. 39:944–50. [12] E. [26] S. Ljung and Z. Statistical modeling and estimation with limited data. In International Symposium on Information Theory. NJ. 54:1189–1200.17 [11] P. M. Bokor. 1997. W. Co. [32] M. 1997. Vidyasagar. K. [22] J. AddisonWesley Pub. Y. H. Systems and Control Letters. 1976. 42:66–82. San Diego. Traub. Belforte. 1991. [36] L. pages 514–530. Lehmann. IEEE Transactions on Automatic Control. IEEE Transactions on Automatic Control. New York: SpringerVerlag. [16] P. Milanese and G. [15] D. Vapnik. C. 2nd edn. US Patent #6748086. Sequential approximation of feasible parameter sets for identiﬁcation with set membership identiﬁcation. 1997. Combinatorial Optimization. Testing Statistical Hypotheses. System Identiﬁcation: Theory for the user. 38:1176–90. Tse. 1992. 53:117–125. Steiglitz. [14] L. Principles of Mathematical Analysis. IEEE Transactions on Automatic Control. 1985. 2004. On the time complexity of worst–case system identiﬁcation. Wozniakowski. Venkatesh and S. [34] B. Venkatesh. Tsitsiklis. and H. NJ. [27] S. Springer Verlag: New York. F. Optimal estimation theory for dynamic systems with set membership uncertainty: An overview. Wasilkowski. 45:1246–1256. 42:1620–1635. G. A. Zappa. IEEE Transactions on Automatic Control. pages 551–562. Necessary conditions for identiﬁcation of uncertain systems. [24] V.