Estimation - Theory For Engineers - Lecture Notes

Estimation Theory for Engineers
Roberto Togneri
30th August 2005
Applications
Modern estimation theory can be found at the heart of many electronic signal
processing systems designed to extract information. These systems include:
Radar
where the delay of the received pulse echo has to be estimated in the
presence of noise
Sonar
where the delay of the received signal from each sensor has to estimated
in the presence of noise
Speech
where the parameters of the speech model have to be estimated in the
presence of speech/speaker variability and environmental noise
Image
where the position and orientation of an object from a camera image
has to be estimated in the presence of lighting and background noise
Biomedicine
where the heart rate of a fetus has to be estimated in the presence
of sensor and environmental noise
Communications
where the carrier frequency of a signal has to be estimated
for demodulation to the baseband in the presence of degradation noise
Control
where the position of a powerboat for corrective navigation correction
has to be estimated in the presence of sensor and environmental noise
Siesmology
where the underground distance of an oil deposit has to be esti-
mated from noisy sound reections due to dierent densities of oil and
rock layers
The majority of applications require estimation of an unknown parameter

a collection of observation data
x[n]
from
which also includes articacts due to to
sensor inaccuracies, additive noise, signal distortion (convolutional noise), model

imperfections, unaccounted source variability and multiple interfering signals.
Introduction
Dene:
x[n]
n,
observation data at sample time
x = (x[0] x[1] . . . x[N 1])T
vector of
observation samples (N -point
data set), and
p(x; ) mathemtical model (i.e.

by .
PDF) of the
The problem is to nd a function of the
estimate
of
N -point
N -point data set parametrized
data set which provides an
that is:
= g(x = {x[0], x[1], . . . , x[N 1]})

where
is
an
estimate
g(x)
Once a candidate
1. How
close
will
of
and
g(x)
is known as the estimator function.
is found, then we usually ask:
be
to
2. Are there better (i.e.
(i.e.
closer
how good
or
optimal
is our estimator)?
) estimators?
A natural optimal criterion is minimisation of the
mean square error :
= E[( )2 ]
mse()
But this does not yield realisable estimator functions which can be written as
functions of the data only:
=E
mse()
h
h

i2
h
i2
+ E()

+ E()

E()
= var()

E()
i2
is a function of the variance of the
var(), is only a function of data. Thus an alternative approach is

= 0 and minimise var()
. This produces the Minimum
E()
However although
Unbiased (MVU) estimator.
estimator,
to assume
Variance
2.1 Minimum Variance Unbiased (MVU) Estimator

1. Estimator has to be unbiased , that is:
= for a < < b

E()
where [a,b] is the range of interest
2. Estimator has to have minimum variance:
= arg min{E( E())

2}
M V U = arg min{var()}
2.2 EXAMPLE
Consider a xed signal,
w[n]
A, embedded in a WGN (White Gaussian Noise) signal,
n = 0, 1, . . . , N 1
x[n] = A + w[n]
where
= A
Consider the
is the parameter to be estimated from the observed data,
sample-mean
x[n].
estimator function:
N 1
1 X
=
x[n]
N n=0
Is the sample-mean an MVU estimator for A?
Unbiased?
= E( 1
E()
N
1
N
x[n]) =
1
N2
x[n]) =
1
N
E(x[n]) =
A=
1
N NA
=A
Minimum Variance?
= var( 1
var()
N
1
N2
var(x[n]) =
N
N
2
N
the MVU estimator? It is unbiased but is it minimum

2 for all other unbiased estimator functions ?
var()
N
But is the sample-mean

variance? That is, is
Cramer-Rao Lower Bound (CLRB)
The variance of any unbiased estimater
must be lower bounded by the CLRB,
with the variance of the MVU estimator attaining the CLRB. That is:

var()
E
and
E
g
var(M V U ) =
Furthermore if, for some functions
ln p(x;)
2
and
2 ln p(x;)
2
I:
ln p(x; )
= I()(g(x) )
then we can nd the MVU estimator as:

variance is
1/I().
For a
p -dimensional
M V U = g(x)
condition is:
C I1 () 0
3
and the minimum
vector parameter, , the equivalent
i.e.
C I1 ()
is the covariance
E()
T (
E())
C = E[(
matrix. The Fisher matrix, I(), is given as:

2
ln p(x; )
[I()]ij = E
i j
is postitive semidenite where
Furthermore if, for some
then we can nd the

covariance is
p-dimensional
function
and
pp
ln p(x; )
= I()(g(x) )
MVU estimator as: M V U = g(x)
matrix
I:
and the minimum
I1 ().
3.1 EXAMPLE
Consider the case of a signal embedded in noise:
n = 0, 1, . . . , N 1
x[n] = A + w[n]
where
w[n]
is a WGN with variance
p(x; )
=
where
p(x; )
N
1
Y
2 ,
and thus:

1
2
exp 2 (x[n] )
2
2 2
n=0
#
"
N 1
1
1 X
2
(x[n] )
N exp
2 2 n=0
(2 2 ) 2
1
is considered a function of the parameter
and is thus termed the likelihood function.
= A
(for known
x)
Taking the rst and then second
derivatives:
ln p(x; )
2 ln p(x; )
2
N 1 X
N
(
x[n] ) = 2 ( )
2 N
N
= 2
For a MVU estimator the lower bound has to apply, that is:
var(M V U ) =
E
ln p(x;)
2
i=
2
N
2
N and thus the samplemean is a MVU estimator. Alternatively we can show this by considering the
but we know from the previous example that
=
var()
rst derivative:
where
M V U
ln p(x; )
N 1 X
= 2(
x[n] ) = I()(g(x) )
N
P
I() = N2 and g(x) = N1
x[n]. Thus the MVU estimator
P
2
1
1
=N
x[n] with minimum variance I()
= N .
4
is indeed
Linear Models
If N-point samples of data are observed and modeled as:
x = H + w
where
x
H
N 1
N p
p1
N 1
=
=
=
=
observation vector
observation matrix
vector of parameters to be estimated
noise vector with PDF
then using the CRLB theorem
= g(x)
N(0, 2 I)
will be an MVU estimator if:
ln p(x; )
= I()(g(x) )
with
C = I1 ().
So we need to factor:

N
ln p(x; )
1
=
ln(2 2 ) 2 2 (x H)T (x H)
2
into the form
I()(g(x) ).
When we do this the MVU estimator for
= (HT H)1 HT x
and the covariance matrix of
is:
C = 2 (HT H)1
4.1 EXAMPLES
4.1.1 Curve Fitting
Consider tting the data,
x(t),
by a
pth
order polynomial function of
x(t) = 0 + 1 t + 2 t2 + + p tp + w(t)
Say we have N samples of data, then:
x =
[x(t0 ), x(t1 ), x(t2 ), . . . , x(tN 1 )]
[w(t0 ), w(t1 ), w(t2 ), . . . , w(tN 1 )]
[0 , 1 , 2 , . . . . p ]
t:
is:
so
x = H + w,
H is
where
the
N p
1
t0
1
t1
H= .
.
.
..
.
1 tN 1
matrix:
t20
t21
.
.
.
..
t2N 1
tp0
tp1
tpN 1
Hence the MVU estimate of the polynomial coecients based on the N samples
of data is:
= (HT H)1 HT x
4.1.2 Fourier Analysis

Consider tting or representing the N samples of data,
nation of
sine
cosine
and
with period N samples.
x[n],
by a linear combi-
functions at dierent harmonics of the fundamental
This implies that
x[n]
period N and this type of analysis is known as
is a periodic time series with
Fourier analysis.
We consider
our model as:
x[n] =
M
X

ak cos
k=1
so
x = H + w,
where
is the
2kn
N

+
M
X

bk sin
k=1
2kn
N

+ w[n]
where:
x =
[x[0], x[1], x[2], . . . , x[N 1]]
[w[0], w[1], w[2], . . . , w[N 1]]
[a1 , a2 , . . . , aM , b1 , b2 , . . . , bM ]
N 2M
T
T
matrix:

H = ha1 ha2 haM hb1 hb2 hbM
where:
a
hk =
1

cos 2k
N
cos 2k2
N
.
.
.
cos
2k(N 1)
N
b
hk =
0

sin 2k
N
sin 2k2
N
.
.
.
sin
2k(N 1)
N
Hence the MVU estimate of the Fourier co-ecients based on the N samples of
data is:
= (HT H)1 HT x
After carrying out the simplication the solution can be shown to be:
2
a T
N (h1 ) x
.
.
.
2
a T
(h
M) x
N
2
b T
(h
1) x
N
.
.
.
2
b T
(h
M) x
N
which is none other than the standard solution found in signal processing textbooks, usually expressed directly as:
and
C =
a
k
bk

N 1
2 X
2kn
x[n] cos
N n=0
N

N 1
2 X
2kn
x[n] sin
N n=0
N
2 2
N I.
4.1.3 System Identication

We assume a tapped delay line (TDL) or nite impulse response (FIR) model
with
stages or taps of an unknown system with output,
u[n],
the system a known input,
x[n].
To identify
is used to probe the system which produces
output:
x[n] =
p1
X
h[k]u[n k] + w[n]
k=0
We assume N input samples are used to yield N output samples and our identication problem is the same as estimation of the linear model parameters for
x = H + w,
and
is the
where:
x =
[x[0], x[1], x[2], . . . , x[N 1]]
[w[0], w[1], w[2], . . . , w[N 1]]
[h[0], h[1], h[2], . . . , h[p 1]]
N p
matrix:
H=
u[0]
u[1]
0
u[0]
0
0
.
.
.
.
.
.
..
.
.
.
u[N 1] u[N 2]
u[N p]
The MVU estimate of the system model co-ecients is given by:
= (HT H)1 HT x
where
u[n]
C = 2 (HT H)1 .
Since
is a function of
u[n]
we would like to choose
to achieve minimum variance. It can be shown that the signal we need is a
pseudorandom noise (PRN) sequence which has the property that the autocorrelation function is zero for
ruu [k] =
and hence
k 6= 0,
1
N
HT H = N ruu [0]I.
that is:
NX
1k
k 6= 0
u[n]u[n + k] = 0
n=0
Dene the crosscorrelation function:
rux [k] =
1
N
NX
1k
u[n]x[n + k]
n=0
then the system model co-ecients are given by:
= rux [i]
h[i]
ruu [0]
4.2 General Linear Models

In a general linear model two important extensions are:
1. The noise vector,
w, is no longer white and has PDF N(0, C) (i.e.
general
Gaussian noise)
2. The observed data vector,
components,
x, also includes the contribution of known signal
s.
Thus the general linear model for the observed data is expressed as:
x = H + s + w
where:
s
w
N 1
N 1
vector of known signal samples

noise vector with PDF
N(0, C)
Our solution for the simple linear model where the noise is assumed white can
be used after applying a suitable whitening transformation.
noise covariance matrix as:
C1 = DT D
8
If we factor the
then the matrix
is the required transformation since:
E[wwT ] = C E[(Dw)(Dw)T ] = DCDT = (DD1 )(DT

that is
w0 = Dw
has PDF
N(0, I).
DT ) = I
Thus by transforming the general linear
model:
x = H + s + w
to:
x0 = Dx = DH + Ds + Dw
x0 = H0 + s0 + w0
or:
x00 = x0 s0 = H0 + w0
we can then write the MVU estimator of
given the observed data
x00
as:
= (H0 H0 )1 H0 x00
= (HT DT DH)1 HT DT D(x s)
That is:
= (HT C1 H)1 HT C1 (x s)
and the covariance matrix is:
C = (HT C1 H)1
General MVU Estimation
5.1 Sucient Statistic

For the cases where the CRLB cannot be established a more general approach
to nding the MVU estimator is required.
statistic
for the unknown parameter
Neyman-Fisher Factorization:
We need to rst nd a
sucient
If we can factor the PDF
p(x; )
as:
p(x; ) = g(T (x), )h(x)

g(T (x), ) is a function of T (x) and only and h(x) is a function of x
T (x) is a sucient statistic for . Conceptually one expects that the
PDF after the sucient statistic has been observed, p(x|T (x) = T0 ; ), should
not depend on since T (x) is sucient for the estimation of and no more
knowledge can be gained about once we know T (x).
where
only, then
5.1.1 EXAMPLE
Consider a signal embedded in a WGN signal:
x[n] = A + w[n]
Then:
p(x; ) =
where
= A
N 1
1 X
(x[n] )2
exp 2
2 n=0
"
1
N
(2 2 ) 2
is the unknown parameter we want to estimate.
We factor as
follows:
p(x; )
#
"
#
N 1
N
1
X
1
1 X 2
2
exp 2 (N 2
x [n]
x[n] exp 2
2
2 n=0
n=0
"
1
N
(2 2 ) 2
= g(T (x), ) h(x)

where we dene
T (x) =
PN 1
n=0
x[n]
which is a sucient statistic for
5.2 MVU Estimator

There are two dierent ways one may derive the MVU estimator based on the
sucient statistic,
1. Let
be any
T (x):
unbiased estimator of
Then
(x)) =
= E(|T
|T
(x))d
p(
is the MVU estimator.

2. Find some function
,
The
that is
= g(T (x))
is the MVU
such that
E[g(T (x))] = ,
then
is an unbiased estimator of
estimator.
Rao-Blackwell-Lehmann-Schee (RBLS)
(x))
E(|T
theorem tells us that
is:
1. A valid estimator for
2. Unbiased
3. Of lesser or equal variance that that of
for all
4. The MVU estimator if the sucient statistic,
that is unbiased. That is, if

then we must have
T (x),
is complete.
T (x), is complete if there is only one function g(T (x))

h(T (x)) is another unbiased estimator (i.e. E[h(T (x))] =
that g = h if T (x) is complete.
The sucient statistic,
10
5.2.1 EXAMPLE
Consider the previous example of a signal embedded in a WGN signal:
x[n] = A + w[n]
PN 1
T (x) = n=0 x[n]. Using
a function g such that E[g(T (x))] = = A. Now:
"N 1
# N 1
X
X
E[T (x)] = E
x[n] =
E[x[n]] = N
where we derived the sucient statistic,

we need to nd
n=0
It is obvious that:
n=0
#
N 1
1 X
x[n] =
E
N n=0
= g(T (x)) = 1 PN 1 x[n], which is
and thus
n=0
N
already seen before, is the MVU estimator for .
method 2
"
the
sample mean
we have
Best Linear Unbiased Estimators (BLUEs)
It may occur that the MVU estimator or a sucient statistic cannot be found or,
indeed, the PDF of the data is itself unknown (only the second-order statistics
are known). In such cases one solution is to assume a functional model of the
estimator, as being linear in the data, and nd the linear estimator which is
both
unbiased
and has
minimum variance, i.e.
the BLUE
For the general vector case we want our estimator to be a linear function of the
data, that is:
= Ax
Our rst requirement is that the estimator be unbiased, that is:
= AE(x) =
E()
which can only be satised if:
E(x) = H
AH = I. The BLUE is derived by nding the A which minimises the
C = ACAT , where C = E[(x E(x))(x E(x))T ] is the covariance
of the data x, subject to the constraint AH = I. Carrying out this minimisation
i.e.
variance,
yields the following for the BLUE:
= Ax = (HT C1 H)1 HT C1 x
where
C = (HT C1 H)1 .
The form of the BLUE is identical to the MVU
estimator for the general linear model. The crucial dierence is that the BLUE
11
does not make any assumptions on the PDF of the data (or noise) whereas the
MVU estimator was derived assuming Gaussian noise. Of course,
truly Gaussian then the BLUE is also the MVU estimator.
if the data is
The BLUE for the
general linear model can be stated as follows:
Gauss-Markov Theorem
Consider a general linear model of the form:
x = H + w
where
H is known, and w is noise with covariance C (the PDF of w is otherwise

is:
arbitrary), then the BLUE of
= (HT C1 H)1 HT C1 x
where
C = (HT C1 H)1
is the minimum covariance.
6.1 EXAMPLE
Consider a signal embedded in noise:
x[n] = A + w[n]
w[n] is of unspecied PDF
rameter = A is to be estimated.
H by noting:
Where
with
var(w[n]) = n2
and the unknown pa-
We assume a BLUE estimate and we derive
E[x] = 1
where
x = [x[0], x[1], x[2], . . . , x[N 1]]
H 1.
1 = [1, 1, 1, . . . , 1]
and we have
Also:
02
0
C= .
..
0
0
12
0
0
.
.
.
..
.
.
.
2
N
1
1
02
0
1
12
.
.
.
C1 =
..
.
0
0
0
..
.
.
.
and hence the BLUE is:
= (H C
T
H)
H C
PN 1 x[n]
1T C1 x
n=0 2
x = T 1 = PN 1 1n
1 C 1
n=0 2
n
and the minimum covariance is:
=
C = var()
1
1
= PN 1
1T C1 1
n=0
12
1
2
n
1
2
N
1
and we note that in the case of white noise where
n2 = 2
then we get the
sample mean:
N 1
1 X
x[n]
=
N n=0
and miminum variance
=
var()
2
N .
Maximum Likelihood Estimation (MLE)
7.1 Basic MLE Procedure

In some cases the MVU estimator may not exist or it cannot be found by any of
the methods discussed so far. The MLE approach is an alternative method in
cases where the PDF is known. With MLE the unknown parameter is estimated
by maximising the PDF. That is dene
such
that:
= arg max p(x; )
where
is the vector of observed data (of N samples). It can be shown that
is asymptotically unbiased:
=
lim E()
N
and asymptotically ecient:
= CRLB
lim var()
f an MVU estimator exists, then the MLE

procedure will produce it. An important observation is that unlike the
previous estimates the MLE does not require an explicit expression for
An important result is that i
p(x; )!
Indeed given a histogram plot of the PDF as a function of
numerically search for the
one can
that maximises the PDF.
7.1.1 EXAMPLE
Consider the signal embedded in noise problem:
x[n] = A + w[n]
where
w[n] is WGN with zero mean but unknown variance which is also A, that
= A, manifests itself both as the unknown signal
is the unknown parameter,
and the variance of the noise. Although a highly unlikely scenario, this simple
example demonstrates the power of the MLE approach since nding the MVU
estimator by the previous procedures is not easy. Consider the PDF:
p(x; ) =
N 1
1 X
exp
(x[n] )2
2 n=0
"
1
N
(2) 2
13
We consider
p(x; )
as a function of
need to maximise it wrt to
thus it is a likelihood function and we
For Gaussian PDFs it easier to nd the maximum
of the log-likelihood function:
ln p(x; ) = ln
1
N
(2) 2
N 1
1 X
(x[n] )2
2 n=0
Dierentiating we have:
N 1
N 1
ln p(x; )
N
1 X
1 X
= +
(x[n] ) + 2
(x[n] )2
2 n=0
2 n=0
, produces the MLE estimate:

v
u N 1
1
1 u
1 X 2
x [n] +
= +t
2
N n=0
4
and setting the derivative to zero and solving for
where we have assumed
> 0.
It can be shown that:
=
lim E()
N
and:
= CRLB =
lim var()
2
N ( + 12 )
7.1.2 EXAMPLE
Consider a signal embedded in noise:
x[n] = A + w[n]
w[n] is WGN with zero mean and known variance 2 . We know the MVU
estimator for is the sample mean. To see that this is also the MLE, we consider
the PDF:
"
#
N 1
1 X
1
2
(x[n] )
p(x; ) =
N exp
2 2 n=0
(2 2 ) 2
where
and maximise the log-likelihood function be setting it to zero:
N 1
ln p(x; )
1 X
= 2
(x[n] ) = 0
n=0
thus
1
N
PN 1
n=0
x[n]
which is the sample-mean.
14
N
1
X
n=0
x[n] N = 0
7.2 MLE for Transformed Parameters

The MLE of the transformed parameter,
= g(),
is given by:
= g()
then
is
the MLE of . If g is not a one-to-one function (i.e. not invertible)
is obtained as the MLE of the transformed likelihood function, pT (x; ),
where
which is dened as:
pT (x; ) =
max
{:=g()}
p(x; )
7.3 MLE for the General Linear Model

Consider the general linear model of the form:
x = H + w
where
is a known
samples, and
N p
matrix,
is the
N 1 observation vector with N

N 1 with PDF N(0, C). The
is a noise vector of dimension
PDF is:
p(x; ) =
(2) 2
and the MLE of

1
T 1
exp
(x
H)
C
(x
H)
1
2
det 2 (C)
1
is found by dierentiating the log-likelihood which can be
shown to yield:
(H)T 1
ln p(x; )
=
C (x H)
which upon simplication and setting to zero becomes:
HT C1 (x H) = 0
and this we obtain the MLE of
as:
= (HT C1 H)1 HT C1 x
which is the same as the MVU estimator.
7.4 EM Algorithm
We wish to use the MLE procedure to nd an estimate for the unknown parameter
which requires maximisation of the log-likelihood function,
ln px (x; ).
However we may nd that this is either too dicult or there arediculties in
nding an expression for the PDF itself.
In such circumstances where direct
expression and maxmisation of the PDF in terms of the observed data
15
is
dicult or intractable, an iterative solution is possible if another data set,

can be found such that the PDF in terms of
closed form and maximise. We term the data set

original data
x the incomplete
y,
set is much easier to express in
the
complete
data and the
data. In general we can nd a mapping from the
complete to the incomplete data:
x = g(y)
however this is usually a many-to-one transformation in that a subset of the
complete
data will map to the same
incomplete
data (e.g. the incomplete data
may represent an accumulation or sum of the complete data). This explains the
terminology:
y,
is
incomplete
(or is missing something) relative to the data,
which is complete (for performing the MLE procedure). This is usually not
evident, however, until one is able to dene what the complete data set. Unfortuneately dening what constitutes the complete data is usually an arbitrary
procedure which is highly problem specic.
7.4.1 EXAMPLE
Consider spectral analysis where a known signal,
x[n],
is composed of an un-
known summation of harmonic components embedded in noise:
x[n] =
p
X
n = 0, 1, . . . , N 1
cos 2fi n + w[n]
i=1
where
w[n] is WGN with known variance 2
and the unknown parameter vector
to be estimated is the group of frequencies:
= f = [f1 f2 . . . fp ]T .
The
standard MLE would require maximisation of the log-likelihood of a multivariate Gaussian distribution which is equivalent to minimising the argument
of the exponential:
J(f ) =
N
1
X
x[n]
n=0
which is a
p-dimensional
p
X
!2
cos 2fi n
i=1
minimisation problem (hard!). On the other hand if
we had access to the individual harmonic signal embedded in noise:
yi [n] = cos 2fi n + wi [n]

where
wi [n]
is WGN with known variance
i = 1, 2, . . . , p
n = 0, 1, . . . , N 1
i2
then the MLE procedure would
result in minimisation of:
J(fi ) =
N
1
X
(yi [n] cos 2fi n)2
n=0
16
i = 1, 2, . . . , p
which are
independent one-dimensional minimisation problems (easy!). Thus
is the complete data set that we are looking for to facilitate the MLE proce-
y.
dure. However we do not have access to
is:
x = g(y)
x[n] =
p
X
The relationship to the known data
n = 0, 1, . . . , N 1
yi [n]
i=1
and we further assume that:
w[n]
2
p
X
i=1
p
X
wi [n]
i2
i=1
y to x is many-to-one we cannot directly form

py (y; ) in terms of the known x (since we can't do
1
of y = g
(x)).
However since the mapping from

an expression for the PDF
the obvious substitution
y, even though an expression for

ln py (y; ) can now easily be derived we can't directly maxmise ln py (y; ) wrt
since y is unavailable. However we know x and if we further assume that we
have a good guess estimate for then we can consider the expected value of
ln py (y; ) conditioned on what we know:
Z
E[ln py (y; )|x; ] = ln py (y; )p(y|x; )dy
Once we have found the complete data set,
and we attempt to maximise this expectation to yield, not the MLE, but next
best-guess estimate for
E-step (nd the expression for the

M-step (Maxmisation of the expectation), hence the name
What we do is iterate through both an

Expectation) and
EM algorithm. Specically:
Expectation(E):
Determine the average or expectation of the log-likelihood
of the complete data given the known incomplete or observed data and current
estimate of the parameter vector
Z
Q(, k ) =
ln py (y; )p(y|x; k )dy
Maxmisation (M):Maximise the average log-likelihood of the complete date

or Q-function to obtain the next estimate of the parameter vector
k+1 = arg max Q(, k )
17
Convergence of the EM algorithm is guaranteed (under mild conditions) in the

sense that the average log-likelihood of the complete data does not decrease at
each iteration, that is:
Q(, k+1 ) Q(, k )

with equality when
is the MLE. The three main attributes of the EM algo-
rithm are:
1. An initial value for the unknown parameter is needed and as with most
iterative procedures a good initial estimate is required for good convergence
2. The selection of the complete data set is arbitrary
3. Although
ln py (y; ) can usually be easily expressed in closed form nding
the closed form expression for the expectation is usually harder.
7.4.2 EXAMPLE
Applying the EM algorithm to the previous example requires nding a closed
expression for the average log-likelihood.
E-step
We start with nding an expression for
ln py (y; )
in terms of the
complete data:
p
N 1
X
1 X
yi [n] cos 2fi n
ln py (y; ) h(y) +
2
i=1 i n=0
h(y) +
p
X
1 T
c yi
2 i
i=1 i
h(y) do not depend on , ci = [1, cos 2fi , cos 2fi (2), . . . , cos 2fi (N
yi = [yi [0], yi [1], yi [2], . . . , yi [N 1]]T . We write the conditional expec-
where the terms in
1)]T
and
tation as:
Q(, k )
= E[ln py (y; )|x; k ]

p
X
1 T
= E(h(y)|x; k ) +
2 ci E(yi |x; k )
i=1 i
Since we wish to maximise
Q(, k )
ing:
Q0 (, k ) =
wrt to
p
X
then this is equivalent to maxim-
cTi E(yi |x; k )
i=1
We note that
E(yi |x; k )
can be thought as as an estimate of the
given the observed data set
x[n]
and current estimate
18
k .
Since
yi [n] data set

y is Gaussian
then
is a sum of Gaussians and thus
and
are jointly Gaussian and one of
the standard results is:
E(y|x; k ) = E(y) + Cyx C1

xx (x E(x))
and application of this yields:
p
X
i2
i = E(yi |x; k ) = ci + 2 (x
ci )
y
i=1
and
2
yi [n] = cos 2fik n + i2
x[n]
p
X
!
cos 2fik n
i=1
Thus:
Q0 (, k ) =
p
X
i
cTi y
i=1
M-step
Maximisation of
Q0 (, k )
consists of maximising each term in the
sum separately or:
i
fik+1 = arg max cTi y
fi
Pp
2 = i=1 i2 we still have the problem that we

2
don't know what the i are. However as long as:
Furthermore since we assumed
2 =
p
X
i2
i=1
p
X
2
i
i=1
=1
then we can chose these values arbitrarily.
Least Squares Estimation (LSE)
8.1 Basic LSE Procedure

The MVU, BLUE and MLE estimators developed previously required an expression for the PDF
p(x; )
some optimal fashion.
in order to estimate the unknown parameter
in
An alternative approach is to assume a signal model
(rather than probabilistic assumptions about the data) and achieve a design
goal assuming this model. With the Least Squares (LS) approach we assume
that the signal model is a function of the unknown parameter
signal:
s[n] s(n; )
19
and produces a
where
s(n; ) is a function of n and parameterised by . Due to noise and model

w[n], the signal s[n] can only be observed as:
inaccuracies,
x[n] = s[n] + w[n]

Unlike previous approaches no statement is made on the probabilistic distribution nature of
x[n] s[n]
w[n].
We only state that what we have is an error:
which with the appropriate choice of
= so
least-squares sense. That is we choose
J() =
N
1
X
e[n] =
should be minimised in a
that the criterion:
(x[n] s[n])2
n=0
is minimised over the N observation samples of interest and we call this the LSE
of
More precisely we have:
= arg min J()
and the minimum LS error is given by:
Jmin = J()
An important assumption to produce a meaningful unbiassed estimate is that
the noise and model inaccuracies,
w[n],
have
zero-mean.
However no other
probabilistic assumptions about the data are made (i.e. LSE is valid for both
Gaussian and non-Gaussian noise), although by the same token we also cannot make any optimality claims with LSE. (since this would depend on the
distribution of the noise and modelling errors ).
A problem that arises from assuming a signal model function
knowledge of
p(x; )
s(n; ) rather than
is the need to choose an appropriate signal model. Then
again in order to obtain a closed form or parameteric expression for
p(x; )
one usually needs to know what the underlying model and noise characteristics
are anyway.
8.1.1 EXAMPLE
Consider observations,
x[n], arising from a DC-level signal model, s[n] = s(n; ) =
:
x[n] = + w[n]
where
is the unknown parameter to be estimated. Then we have:
J() =
N
1
X
(x[n] s[n])2 =
n=0
N
1
X
n=0
20
(x[n] )2
The dierentiating wrt

J()
=0
=
and hence
1
N
and setting to zero:
N
1
X
=0
(x[n] )
N
1
X
n=0
PN 1
x[n]
n=0
Jmin
n=0
which is the
=
= J()
x[n] N = 0
sample-mean.
We also have that:
!2
N 1
1 X
x[n]
x[n]
N n=0
N
1
X
n=0
8.2 Linear Least Squares

We assume the signal model is a linear function of the estimator, that is:
s = H
s = [s[0], s[1], . . . , s[N 1]]T and H is a
[1 , 2 , . . . , p ]. Now:
x = H + w
where
and with
x = [x[0], x[1], . . . , x[N 1]]T

J() =
N
1
X
known
N p
matrix with
we have:
(x[n] s[n])2 = (x H)T (x H)
n=0
Dierentiating and setting to zero:

J()
=0
=
=0
2HT x + 2HT H
yields the required LSE:
= (HT H)1 HT x
which, surprise, surprise, is the identical functional form of the MVU estimator
for the linear model.
An interesting extension to the linear LS is the
weighted
LS where the contribu-
tion to the error from each component of the parameter vector can be weighted
in importance by using a dierent from of the error criterion:
J() = (x H)T W(x H)

where
is an
N N
postive denite (symmetric) weighting matrix.
21
8.3 Order-Recursive Least Squares

In many cases the signal model is unknown and must be assumed. Obviously
we would like to choose the model,
s(),
that minimises
Jmin ,
that is:
sbest () = arg min Jmin
s()
We can do this arbitrarily by simply choosing models, obtaining the LSE

then selecting the model which provides the smallest
Jmin .
, and
However, models
are not arbitrary and some models are more complex (or more precisely have
a larger number of parameters or degrees of freedom) than others. The more
complex a model the lower the
the model is to
overt
Jmin
one can expect but also the more likely
the data or be
overtrained
(i.e.
t the noise and not
generalise to other data sets).
8.3.1 EXAMPLE
x(t) plotted against
t and we would like to t the best line to the data. But
what line do we t: s(t; ) = 1 , constant? s(t; ) = 1 + 2 t, a straight line?
s(t; ) = 1 +2 t+3 t2 , a quadratic? etc. Each case represents an increase in the
Consider the case of line tting where we have observations

the sample time index
order
of the model (i.e. order of the polynomial t), or the number of parameters
to be estimated and consequent increase in the modelling power. A polynomial

t represents a linear model,
s = H ,
where
s = [s(0), s(1), . . . , s(N 1)]T
1
1
Constant = [1 ]T and H = .. is a N 1 matrix
.
1
1
0
1
1
1
T
2
Linear = [1 , 2 ] and H =
is a N 2 matrix
.
.
.
..
.
1 N 1
1
0
0
1
1
1
1
T
2
4
Quadratic = [1 , 2 , 3 ] and H =
.
.
.
.
.
..
.
.
1 N 1 (N 1)2
and:
matrix
22
is a
N 3
and so on. If the underlying model is indeed a straight line then we would expect
Jmin
not only that the minimum
result with a straight line model but also that
higher order polynomial models (e.g. quadratic, cubic, etc.) will yield the same
Jmin
(indeed higher-order models would degenerate to a straight line model,
except in cases of overtting). Thus the straight line model is the best model
to use.
An alternative to providing an independent LSE for each possible signal model

a more ecient
order-recursive
LSE is possible if the models are dierent orders
of the same base model (e.g. polynomials of dierent degree). In this method
the LSE is updated in order (of increasing parameters). Specially dene
the LSE of order
(i.e.
as
parameters to be estimated). Then for a linear model
we can derive the order-recursive LSE as:
k+1 =
k + UPDATEk
However the success of this approach depends on proper formulation of the linear
models in order to facilitate the derivation of the recursive update. For example,
if
has othonormal column vectors then the LSE is equivalent to projecting
the observation
H.
x onto the space spanned by the orthonormal column vectors of
Since increasing the order implies increasing the dimensionality of the space
by just adding another column to
this allows a recursive update relationship
to be derived.
8.4 Sequential Least Squares

In most signal processing applications the observations samples arrive as a
stream of data.
All our estimation strategies have assumed a
batch
or block
mode of processing whereby we wait for N samples to arrive and then form our
estimate based on these samples. One problem is the delay in waiting for the
N samples before we produce our estimate, another problem is that as more
data arrives we have to repeat the calculations on the larger blocks of data (N
increases as more data arrives). The latter not only implies a growing computational burden but also the fact that we have to buer all the data we have seen,
both will grow linearly with the number of samples we have.
Since in signal
processing applications samples arise from sampling a continuous process our

computational and memory burden will grow linearly with time! One solution
samples,
sequential mode of processing where the parameter estimate for n

, is derived from the previous parameter estimate for n1 samples,
[n]
1].
[n
For linear models we can represent sequential LSE as:
is to use a
= [n
1] + K[n](x[n] s[n|n 1])
[n]
where s[n|n 1] s(n; [n 1]).
s[n|n 1]) is the prediction error.
The
K[n]
is the correction gain and
The magnitude of the correction gain
23
(x[n]
K[n] is
usually directly related to the value of the estimator error variance,
var([n1])
,
with a larger variance yielding a larger correction gain. This behaviour is reasonable since a larger variance implies a poorly estimated parameter (which
should have minumum variance) and a larger correction gain is expected. Thus
one expects the variance to decrease with more samples and the estimated parameter to converge to the true value (or the LSE with an innite number of
samples).
8.4.1 EXAMPLE
We consider the specic case of linear model with a vector parameter:
s[n] = H[n][n]
The most interesting example of sequential LS arises with the weighted LS
W[n] = C1 [n] where C[n] is the covariance matrix of the

zero-mean noise, w[n], which is assumed to be uncorrelated. The argument [n]
implies that the vectors are based on n sample observations. We also consider:
error criterion with
C[n]
H[n]
where
= diag(02 , 12 , . . . , n2 )

H[n 1]
=
hT [n]
hT [n] is the nth row vector of the n p matrix H[n].
It should be obvious
that:
s[n 1] = H[n 1][n 1]

We also have that
s[n|n 1] = hT [n][n 1].
So the
estimator update is:
= [n
1] + K[n](x[n] hT [n][n 1])
[n]
Let
[n] = C [n]
be the covariance matrix of
based on
samples of data and
the it can be shown that:
K[n] =
[n 1]h[n]
n2 + hT [n][n 1]h[n]
and furthermore we can also derive a
covariance update:
[n] = (I K[n]hT [n])[n 1]

yielding a wholly recursive procedure requiring only knowledge of the observation data
x[n] and the initialisation values: [1]

and [1], the initial estimate
of the parameter and a initial estimate of the parameter covariance matrix.
24
8.5 Constrained Least Squares

Assume that in the vector parameter LSE problem we are aware of constraints
on the individual parameters, that is
p -dimensional
is subject to
r<p
inde-
pendent linear constraints. The constraints can be summarised by the condition

that
satisfy the following system of linear constraint equations:
A = b
Then using the technique of Lagrangian multipliers our LSE problem is that of
miminising the following Lagrangian error criterion:
Jc = (x H)T (x H) + T (A b)
Let
be the unconstrained LSE of a linear model, then the expression for
the constrained estimate is:
c =
+ (HT H)1 AT [A(HT H)1 AT ]1 (A b)
where
= (HT H)1 HT x
for unconstrained LSE.
8.6 Nonlinear Least Squares

So far we have assumed a linear signal model:
for the signal model,
s()
s() = H
where the notation
, explicitly shows its dependence on the parameter
In general the signal model will be an
p -dimensional parameter .
N -dimensional
nonlinear function of the
In such a case the minimisation of:
J = (x s())T (x s())
becomes much more dicult. Dierentiating wrt
J
=0
which requires solution of
and setting to zero yields:
s()T
(x s()) = 0
nonlinear simultaneous equations.
Approximate
solutions based on linearization of the problem exist which require iteration until
convergence.
8.6.1 Newton-Rhapson Method

Dene:
g() =
s()T
(x s())
So the problem becomes that of nding the zero of a nonlinear function, that
is:
g() = 0.
The Newton-Rhapson method linearizes the function about the
25
initialised value
and directly solves for the zero of the function to produce
the next estimate:

k+1 = k
g()
1

g()

= k
Of course if the function was linear the next estimate would be the correct value,
but since the function is nonlinear this will not be the case and the procedure
is iterated until the estimates converge.
8.6.2 Gauss-Newton Method

We linearize the signal model about the known (i.e.
estimate)
initial guess or current
k :

s()
s() s( k ) +
( k )
=k
and the LSE minimisation problem which can be shown to be:
J = (
x( k ) H( k ))T (
x( k ) H( k ))
where
( k ) = x s( k ) + H( k ) k
x
is known and:

s()
H( k ) =
=k
Based on the standard solution to the LSE of the linear model, we have that:
k+1 = k + (HT ( k )H( k ))1 HT ( k )(x s( k ))
Method of Moments
Although we may not have an expression for the PDF we assume that we can
use the natrual estimator for the
k th
moment,
k = E(xk [n]),
that is:
N 1
1 X k
k =
x [n]
N n=0
If we can write an expression for the
parameter
k th
moment as a function of the unknown
:
k = h()
and assuming
exists then we can derive the an estimate by:
= h1 (
k ) = h1
26
!
N 1
1 X k
x [n]
N n=0
If
is a
p -dimensional
vector then we require
unknowns. That is we need some set of

order
equations to solve for the
moment equations. Using the lowest
moments what we would like is:
= h1 ()
where:
PN 1
1
N P n=0 x[n]
N 1 2
1
n=0 x [n]
N
.
.
.
P
N 1 p
1
n=0 x [n]
N
9.1 EXAMPLE
Consider a 2-mixture Gaussian PDF:
p(x; ) = (1 )g1 (x) + g2 (x)

where
g1 = N(1 , 12 )
and
g2 = N(2 , 22 )
are two dierent Gaussian PDFs and
is the unknown parameter that has to be estimated. We can write the second
moment as a function of
as follows:
2 = E(x2 [n]) =
x2 p(x; )dx = (1 )12 + 22 = h()
and hence:
2 12
= 2
=
2 12
10
1
N
PN 1
n=0
22
x2 [n] 12
12
Bayesian Philosophy
10.1 Minimum Mean Square Estimator (MMSE)

The
classic approach
we have been using so far has assumed that the parameter
is unknown but deterministic. Thus the optimal estimator
spective and independent of the actual value of

value or prior knowledge of
is
optimal irre-
But in cases where the actual
could be a factor (e.g. where an MVU estimator
does not exist for certain values or where prior knowledge would improve the
estimation) the classic approach would not work eectively.
In the Bayesian philosophy the
prior pdf, p().
is treated as a random variable with a known
Such prior knowledge concerning the distribution of the estima-
tor should provide better estimators than the deterministic case.
27
In the classic approach we derived the MVU estimator by rst considering mimisation of the
where:
mean square eror, i.e. = arg min mse()
= E[( )2 ] =
mse()
and
p(x; )
is the pdf of
. In the Bayesian approach
where:
= arg min Bmse()
parametrised by
similarly derive an estimator by minimising
= E[( )
2] =
Bmse()
is the Bayesian mse and
( )p(x; )dx
p(x, )
Z Z
2 p(x, )dxd
( )
is the joint pdf of
and
(since
random variable). It should be noted that the Bayesian squared error

and classic squared error
()
are the same.
and
is now a
2
( )
estimator
Bmse()
Bmse() with respect
The minimum
or MMSE is derived by dierentiating the expression for

to
we
setting this to zero to yield:
= E(|x) =
where the
Z
p(|x)d
posterior pdf, p(|x), is given by:

p(|x) =
p(x, )
p(x|)p()
p(x, )
=R
=R
p(x)
p(x, )d
p(x|)p()d
Apart from the computational (and analytical!)
requirements in deriving an
expression for the posterior pdf and then evaluating the expectation
E(|x)
there is also the problem of nding an appropriate prior pdf . The usual choice
is to assume the joint pdf,
p(x, ), is Gaussian and hence both the prior pdf, p(),
p(|x),
are also Gaussian (this property imples the Gaussian
and posterior pdf,
pdf is a conjugate prior distribution). Thus the form of the pdfs remains the
same and all that changes are the means and variances.
10.1.1 EXAMPLE
Consider signal embedded in noise:
x[n] = A + w[n]
where as before
=A
w[n] = N(0, 2 )
is a WGN process and the unknown parameter
is to be estimated. However in the Bayesian approach we also assume
A is a random variable with a prior pdf which in this case is the

2
p(A) = N(A , A
) . We also have that p(x|A) = N(A, 2 ) and we
that x and A are jointly Gaussian. Thus the posterior pdf:
the parameter
Gaussian pdf
can assume
p(A|x) = R
p(x|A)p(A)
2
= N(A|x , A|x
)
p(x|A)p(A)dA
28
is also a Gaussian pdf and after the required simplication we have that:
2
A|x
N
2
1
+

1
2
A
and
A|x =
N
A
x
+ 2
2
A
2
A|x
and hence the MMSE is:
A = E[A|x]
Ap(A|x)dA = A|x
=
x + (1 )A
where
2
A
2 +
A
N
2
following (assume A
Upon closer examination of the MMSE we observe the
2 ):
1. With few data (N is small) then
2
A
2 /N
and
A A ,
that is the
MMSE tends towards the mean of the prior pdf (and eectively ignores
the contribution of the data). Also
2
p(A|x) N(A , A
).
2 /N and A x
, that is
the MMSE tends towards the sample mean x
(and eectively ignores the
contribution of the prior information). Also p(A|x) N(
x, 2 /N ).
2
2. With large amounts of data (N is large)A
If x and y are jointly Gaussian, where x is k 1 and y is l 1, with mean vector

[E(x)T , E(y)T ]T and partitioned covariance matrix

Cxx Cxy
kk kl
C=
=
Cyx Cyy
lk ll
then the conditional pdf,
p(y|x),
is also Gaussian and:
E(y|x) = E(y) + Cyx C1

xx (x E(x))
1
Cy|x = Cyy Cyx Cxx Cxy
10.2 Bayesian Linear Model

Now consider the Bayesian Llinear Model:
x = H + w
is the unknown parameter to be estimated with prior pdf N( , C )
w is a WGN with N(0, Cw ). The MMSE is provided by the expression for
E(y|x) where we identify y . We have that:
where
and
E(x) = H
E(y) =
29
and we can show that:
= E[(x E(x))(x E(x))T ] = HC HT + Cw

= E[(y E(y))(x E(x)T ] = C HT
Cxx
Cyx
x and are jointly Gaussian we have that:
= E(|x) = + C HT (HC HT + Cw )1 (x H )
and hence since
10.3 Relation with Classic Estimation

In classical estimation we cannot make any assumptions on the prior, thus all
possible
have to be considered.
distribution, essentially
2 = .
The equivalent prior pdf would be a at

This so-called
noninformative
prior pdf will
yield the classic estimator where such is dened.
10.3.1 EXAMPLE
Consider the signal embedded in noise problem:
x[n] = A + w[n]
where we have shown that the MMSE is:
A =
x + (1 )A
where
with
A = x
2
A
2
2
A + N
. If the prior pdf is noninformative then
2
=
A
and
=1
which is the classic estimator.
10.4 Nuisance Parameters

Suppose that both
in
Then
and were unknown parameters but we are only interested
is a nuisance parameter. We can deal with this by integrating
out of the way. Consider:
p(|x) = R
Now
p(x|)
and if
is, in reality,
and
p(x|)p()
p(x|)p()d
p(x|, ), but we can obtain the

Z
p(x|) = p(x|, )p(|)d
are independent then:
Z
p(x|) =
p(x|, )p()d
30
true
p(x|)
by:
11
The
General Bayesian Estimators
Bmse()
given by:
= E[( )
2] =
Bmse()
Z Z
2 p(x, )dxd
( )
is one specic case for a general estimator that attempts to minimise the average
of the cost function,
C(),
that is the Bayes risk
R = E[C()]
where
.
= ( )
There are three dierent cost functions of interest:
1.
Quadratic:
C() = 2
the estimate to minimise
which is the mean of the

2.
Absolute:
.
R = Bmse()
R = Bmse() is:
Z
= p(|x)d
which yields
C() = ||.
We already know that
posterior pdf.
The estimate,
that minimises
R = E[| |]
satises:
p(|x)d =
that is, the median of the
3.
Hit-or-miss:
p(|x)d

C() =
or
Pr{ |x}
=
2
posterior pdf.
0 || <
1 || >
The estimate that minimises the
Bayes risk can be shown to be:
= arg max p(|x)
which is the mode of the
posterior pdf
(the value that maxmises the pdf ).
For the Gaussian posterior pdf it should be noted that the mean mediam and
mode are identical.
Of most interest are the quadratic and hit-or-miss cost
functions which, together with a special case of the latter, yield the following
three important classes of estimator:
1.
MMSE
Minimum Mean Square Error
already introduced as the mean of the
) estimator which we have
posterior pdf :
Z
p(|x)d =
31
p(x|)p()d
2.
MAP (Maximum A Posteriori ) estimator which is the mode or maximum
of the
posterior pdf :
arg max p(|x)
=
3.
Bayesian ML (Bayesian
arg max p(x|)p()
Maximum Likelihood
) estimator which is
the special case of the MAP estimator where the prior pdf,
p(), is uniform
or noninformative:
= arg max p(x|)
x given , p(x|), is essentially equivalent to the pdf of x parametrized by , p(x; ), the Bayesian ML estimator
Noting that the conditional pdf of

is equivalent to the classic MLE.
Comparing the three estimators:
The MMSE is preferred due to its least-squared cost function but it is

also the most dicult to derive and compute due to the need to nd an
expression or measurements of the posterior pdf
p(|x)in order to integrate
p(|x)d
The MAP hit-or-miss cost function is more crude but the MAP estimate is easier to derive since there is no need to integrate, only nd the
maximum of the posterior pdf
p(|x)
either analytically or numerically.
The Bayesian ML is equivalent in performance to the MAP only in the

case where the prior pdf is noninformative, otherwise it is a sub-optimal
estimator.
tional pdf
p(|x).
However, like the classic MLE, the expression for the condi-
p(x|)
is usually easier to obtain then that of the posterior pdf,
Since in most cases knowledge of the the prior pdf is unavailable
so, not surprisingly, ML estimates tend to be more prevalent. However it

may not always be prudent to assume the prior pdf is uniform, especially
in cases where prior knowledge of the estimate is available even though
the exact pdf is unknown. In these cases a MAP estimate may perform
better even if an articial prior pdf is assumed (e.g. a Gaussian prior
which has the added benet of yielding a Gaussian posterior).
12
Linear Bayesian Estimators
12.1 Linear Minimum Mean Square Error (LMMSE) Estimator

We assume that the parameter
x = [x[0] x[1] . . . x[N 1]]T
is to be estimated based on the data set
rather than assume any specic form for the joint
32
pdf
p(x, ).
We consider the class of all ane estimators of the form:
N
1
X
an x[n] + aN = aT x + aN
n=0
where
a = [a1 a2 . . . aN 1 ]T and
choose the weight co-ecients
{a, aN }
to min-
imize the Bayesian MSE:
h
i
= E ( )
2
Bmse()
The resultant estimator is termed the
estimator.
linear minimum mean square error (LMMSE)
The LMMSE will be sub-optimal unless the MMSE is also linear.
Such would be the case if the Bayesian linear model applied:
x = H + w
The weight co-ecients are obtained from
Bmse()
ai
= 0
for
i = 1, 2, . . . , N
which yields:
aN = E()
N
1
X
an E(x[n]) = E() aT E(x)
and a = C1
xx Cx
n=0

T
where Cxx = E (x E(x))(x E(x))
is the N N covariance matrix and
Cx = E [(x E(x))( E())] is the N 1 cross-covariance vector. Thus the
LMMSE estimator is:
where we note
parameter
where now
= aT x + aN = E() + Cx C1
xx (x E(x))

Cx = CTx = E ( E())(x E(x))T . For
the
1p
Cx
= E() + Cx C1 (x E(x))
xx

= E ( E())(x E(x))T is the p N
cross-covariance
matrix. And the Bayesian MSE matrix is:
h
i
T
M = E ( )(
)
where
vector
an equivalent expression for the LMMSE estimator is derived:
= C Cx C1
xx Cx

= E ( E())( E())T is the p p
covariance matrix.
12.1.1 Bayesian Gauss-Markov Theorem

If the data are described by the Bayesian linear model form:
x = H + w
33
x is the N 1 data vector, H is a known N p observation matrix,

p 1 random vector of parameters with mean E() and covariance matrix
C and w is an N 1 random vector with zero mean and covariance matrix
Cw which is uncorrelated with (the joint pdf p(w, ) and hence also p(x, )
where
is a
are otherwise arbitrary), then noting that:
E(x) = HE()
Cxx = HC HT + Cw
Cx = C HT
the LMMSE estimator of
is:
= E() + C HT (HC HT + Cw )1 (x HE())

T 1
1 T 1
= E() + (C1
H Cw (x HE())
+ H Cw H)
and the covariance of the error which is the Bayesian MSE matrix is:
T 1
1
M = (C1
+ H Cw H)
12.2 Wiener Filtering

x = [x[0] x[1] . . . x[N 1]]T which are
the N N covariance matrix takes the
We assume N samples of time-series data

wide-sense stationary (WSS). As such
symmetric Toeplitz form:
Cxx = Rxx
[Rxx ]ij = rxx [i j]
where
rxx [k] = E(x[n]x[n k]) is the autocorrelation function (ACF) of the

Rxx denotes the autocorrelation matrix. Note that since x[n]
is WSS the expectation E(x[n]x[n k]) is independent of the absolute time
index n. In signal processing the estimated ACF is used:

PN 1|k|
1
x[n]x[n + |k|] |k| N 1
n=0
N
rxx [k] =
0
|k| N
where
x[n]
process and
Both the data
and the parameter to be estimated
are assumed zero mean.
Thus the LMMSE estimator is:
= Cx C1
xx x
Application of the LMMSE estimator to the three signal processing estimation
problems of ltering, smoothing and prediction gives rise to the
equation solutions.
34
Wiener ltering
12.2.1 Smoothing
= s = [s[0] s[1] . . . s[N 1]]T based
x = [x[0] x[1] . . . x[N 1]]T where:
The problem is to estimate the signal

the noisy data
on
x=s+w
and
w = [w[0] w[1] . . . w[N 1]]T
is the noise process. An important dierence
s[n] can use the

x[1], . . . x[n 1]), the present x[n] and
(x[n + 1], x[n + 2], . . . , x[N 1]). This means that the solution
between smoothing and ltering is that the signal estimate

entire data set: the
future
values
past
values (x[0],
cannot be cast as ltering problem since we cannot apply a causal lter to the
data.
We assume that the signal and noise processes are uncorrelated. Hence:
rxx [k] = rss [k] + rww [k]

and thus:
Cxx = Rxx = Rss + Rww

also:
Cx = E(sxT ) = E(s(s + w)T ) = Rss

Hence the LMMSE estimator (also called the Wiener estimator) is:
s = Rss (Rss + Rww )1 x = Wx

and the
N N
matrix:
W = Rss (Rss + Rww )1

is referred to as the
Wiener smoothing matrix.
12.2.2 Filtering
= s[n] based only on the present and
x = [x[0] x[1] . . . x[n]]T . As n increases this allows us to view
The problem is to estimate the signal
past
noisy data
the estimation process as an application of a causal lter to the data and we

need to cast the LMMSE estimator expression in the form of a lter.
Assuming the signal and noise processes are uncorrelated we have:
Cxx = Rss + Rww

where
Cxx
is an
(n + 1) (n + 1)
autocorrelation matrix. Also:
Cx = E(sxT ) = E(s(s + w)T )
= E(ssT )
= E(s[n] [s[0] s[1] . . . s[n]])
= [rss [n] rss [n 1] . . . rss [0]] = rTss
35
is a
1 (n + 1)
row vector. The LMMSE estimator is:
s[n] = rTss (Rss + Rww )1 x = aT x

where
a = [a0 a1 . . . an ]T = (Rss + Rww )1 rTss
is the
(n + 1) 1
weights. Note that the check subscript is used to denote
vector of
time-reversal.
Thus:
rTss = [rss [0] rss [1] . . . rss [n]]

We interpret the process of forming the estimator as time evolves (n increases)
h(n) [k], the time-varying impulse

response, be the response of the lter at time n to an impulse applied k samples
before (i.e. at time n k ). We note that ai can be intrepreted as the response
of the lter at time n to the signal (or impulse) applied at time i = n k . Thus
as a ltering operation.
Specically we let
we can make the following correspondence:
h(n) [k] = ank
k = 0, 1, . . . , n.
Then:
s[n] =
n
X
ak x[k] =
k=0
n
X
(n)
[n k]x[k] =
k=0
n
X
h(n) [k]x[n k]
k=0
(n)
T
,
We dene the vector h = h
[0] h(n) [1] . . . h(n) [n] . Then we have that h = a
that is, h is a time-reversed version of a. To explicitly nd the impulse response
h we note that since:
(Rss + Rww )a = rTss
then it also true that:
(Rss + Rww )h = rss

When written out we get the
Wiener-Hopf ltering equations :

(n)
h [0]
rxx [n]
h(n) [1]
rxx [n 1]
.
.
.
.
.
.
rxx [0]
h(n) [n]
rxx [0]
rxx [1]
rxx [1]
r
[0]
xx
.
.
..
.
.
.
.
.
rxx [n] rxx [n 1]
where
rxx [k] = rss [k] + rww [k].
the equations is the
rss [0]
rss [1]
=
.
.

.
rss [n]
A computationally ecient solution for solving
Levinson recursion
which solves the equations recursively
to avoid resolving them for each value of
n.
12.2.3 Prediction
= x[N 1 + l] based on the current and past x =
[x[0] x[1] . . . x[N 1]] at sample l 1 in the future. The resulting estimator is
The problem is to estimate
termed the
l-step linear predictor.
36
As before we have
Cxx = Rxx
where
Rxx
is the
N N
autocorrelation matrix,
and:
= E(xxT )
= E(x[N 1 + l] [x[0] x[1] . . . x[N 1]])
= [rxx [N 1 + l] rxx [N 2 + l] . . . rxx [l]] = rTxx
Cx
Then the LMMSE estimator is:
T
x
[N 1 + l] = rTxx R1
xx x = a x
where
a = [a0 a1 . . . aN 1 ] = R1
rxx .
xx
We can interpret the process of forming
the estimator as a ltering operation where
k] = ak
h(N ) [k] = h[k] = ank h[N
and then:
x
[N 1 + l] =
N
1
X
h[N k]x[k] =
k=0
N
X
h[k]x[N k]
k=1
T
T
h = [h[1] h[2] . . . h[N ]] = [aN 1 aN 2 . . . a0 ] = a

explicit expression for h by noting that:
Dening
nd an
as before we can
Rxx a = rxx Rxx h = rxx

where
rxx .
rxx = [rxx [l] rxx [l + 1] . . . rxx [l + N 1]]
When written out we get the
rxx [0]
rxx [1]
rxx [1]
rxx [0]
.
.
.
.
.
.
..
rxx [N 1] rxx [N 2]
is the time-reversed version of
Wiener-Hopf prediction equations :

rxx [l]
rxx [N 1]
h[1]
h[2]
r
[l + 1]
rxx [N 2]
xx

.. =
.
.
.
.
.
.
.
rxx [l + N 1]
rxx [0]
h[N ]
A computationally ecient solution for solving the equations is the
recursion
Levinson
which solves the equations recursively to avoid resolving them for
each value of
n.
The special case for
l = 1,
the one-step linear predictor, covers
two important cases in signal processing:
the values of
h[n]
are termed the
linear prediction coecients
which are
used extensively in speech coding, and
the resulting equations are identical to the
Yule-Walker equations
used to
solve the AR lter parameters of an AR(N ) process.
12.3 Sequential LMMSE

We can derive a recursive re-estimation of
from:
37
[n]
(i.e.
based
on
data samples)
1],
[n
the estimate based on
x[n],
nth
the
x
[n|n 1],
n1
data samples,
data sample, and
an estimate of
x[n]
based on
n1
data samples.
Using a vector space analogy where we dene the following inner product:
(x, y) = E(xy)
the procedure is as follows:
1], or we have
[n
estimate
We can consider [n 1] as the true value of
on the subspace spanned by {x[0], , x[1], . . . , x[n 1]}.
1. From the previous iteration we have the estimate,
.
[0]
an initial
projected
x[n] based on the previous n1 samples,

x
[n|n 1], or we have an initial
x
[0| 1]. We can consider x
[n|n 1] as the true value x[n]
on the subspace spanned by {x[0], , x[1], . . . , x[n 1]}.
2. We nd the LMMSE estimator of
that is the one-step linear predictor,

estimate,
projected
3. We form the
innovation x[n] x[n|n 1] which we can consider as being
orthogonal to the space spanned by
{x[0], , x[1], . . . , x[n 1]}
and rep-
resenting the direction of correction based exclusively on the new data

sample
x[n].
correction gain K[n]
4. We calculate the
on the innovation
x[n] x
[n|n 1],
K[n] =
by the normalised projection of
that is:
E[(x[n] x
[n|n 1])]
E[(x[n] x
[n|n 1])2 ]
5. Finally we update the estimator by adding the correction:
= [n
1] + K[n](x[n] x
[n]
[n|n 1])
12.3.1 Sequential LMMSE for the Bayesian Linear Model

For the vector parameter-scalar data form of the Bayesian linear model:
x[n] = hT + w[n]
we can derive the sequential vector LMMSE:
[n]
1] + K[n](x[n] hT [n][n
1])
= [n
M[n 1]h[n]
K[n] =
n2 + hT [n]M[n 1]h[n]
M[n]
where
M[n]
= (I K[n]hT [n])M[n 1]
is the error covariance estimate:
M[n] = E[( [n])(

[n])
]
38
13
Kalman Filters
The Kalman lter can be viewed as an extension of the sequential LMMSE to

the case where the parameter estimate, or state,
varies with time in a known but
stochastic way and unknown initialisation. We consider the vector state-vector

observation case and consider the unknown state (to be estimated) at time
s[n],
to vary according to the Gauss-Markov signal model
process
n,
equation:
s[n] = As[n 1] + Bu[n] n 0

s[n] is the p1 state vector with unknown initialisation s[1] = N(s , Cs )
u[n] = N(0, Q) is the WGN r 1 driving noise vector and A, B are known
p p and p r matrices. The observed data, x[n], is then assumed a linear
where
,
function of the state vector with the following
observation
equation:
x[n] = H[n]s[n] + w[n] n 0

w[n] = N(0, R) is the WGN M 1
M p matrix.
T

T
T
We wish to estimate s[n] based on the observations X[n] = x [0] x [1] . . . x [n] .
where
x[n]
is the
M 1
observation vector,
observation noise sequence and
is a known
Our criterion of optimality is the Bayesian MMSE:
E[(s[n] s[n|n])2 ]
where
s[n|n] = E(s[n]|X[n]) is the estimate of s[n]

X[n]. We dene the innovation sequence:
based on the observation
sequence
[n] = x[n] x
[n|n 1]
x
where
[n|n 1] = E(x[n]|X[n 1]) is the predictor of x[n] based on the

x
X[n 1]. We can consider s[n|n] as being derived from
observation sequence
two separate components:
s[n|n] = E(s[n]|X[n 1]) + E(s[n]|

x[n])
where
E(s[n]|X[n 1])
vation correction.
E(s[n]|
x[n]) is the innos[n|n 1] = E(s[n]|X[n 1]) and from the
is the state prediction and
By denition
process equation we can show that:
s[n|n 1] = As[n 1|n 1]

From our sequential LMMSE results we also have that :
[n|n 1])
E(s[n]|
x[n]) = K[n](x[n] x
and:
K[n] = E(s[n]
xT [n])E(
x[n]
xT [n])1 = M[n|n1]HT [n](H[n]M[n|n1]HT [n]+R)1
39
where:

M[n|n 1] = E (s[n] s[n|n 1])(s[n] s[n|n 1])T
is the state error covariance predictor at time
quence
X[n 1].
based on the observation se-
From the process equation we can show that:
M[n|n 1] = AM[n 1|n 1]AT + BQBT

and by appropriate substitution we can derive the following expression for the
state error covariance estimate at time
n:
M[n|n] = (I K[n]H[n])M[n|n 1]
Kalman Filter Procedure

Prediction
s[n|n 1] = As[n 1|n 1]
M[n|n 1] = AM[n 1|n 1]AT + BQBT
Gain
K[n] = M[n|n 1]HT [n](H[n]M[n|n 1]HT [n] + R)1
Correction
s[n|n] = s[n|n 1] + K[n](x[n] x
[n|n 1])
M[n|n] = (I K[n]H[n])M[n|n 1]
14
MAIN REFERENCE
Fundamentals of Statistical Signal Processing: Estimation Theory , Prentice-Hall, 1993

Steven M. Kay,
40

Estimation - Theory For Engineers - Lecture Notes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Estimation - Theory For Engineers - Lecture Notes

Uploaded by

Copyright:

Available Formats

Estimation Theory for Engineers

in the presence of noise

where the parameters of the speech model have to be estimated in the

presence of speech/speaker variability and environmental noise

where the position and orientation of an object from a camera image

has to be estimated in the presence of lighting and background noise

where the heart rate of a fetus has to be estimated in the presence

of sensor and environmental noise

where the carrier frequency of a signal has to be estimated

for demodulation to the baseband in the presence of degradation noise

where the position of a powerboat for corrective navigation correction

has to be estimated in the presence of sensor and environmental noise

where the underground distance of an oil deposit has to be esti-

The majority of applications require estimation of an unknown parameter

which also includes articacts due to to

sensor inaccuracies, additive noise, signal distortion (convolutional noise), model

observation data at sample time

x = (x[0] x[1] . . . x[N 1])T

observation samples (N -point

data set), and

p(x; ) mathemtical model (i.e.

The problem is to nd a function of the

N -point data set parametrized

data set which provides an

= g(x = {x[0], x[1], . . . , x[N 1]})

is known as the estimator function.

is found, then we usually ask:

2. Are there better (i.e.

A natural optimal criterion is minimisation of the

mean square error :

is a function of the variance of the

var(), is only a function of data. Thus an alternative approach is

Unbiased (MVU) estimator.

2.1 Minimum Variance Unbiased (MVU) Estimator

= for a < < b

= arg min{E( E())

A, embedded in a WGN (White Gaussian Noise) signal,

is the parameter to be estimated from the observed data,

the MVU estimator? It is unbiased but is it minimum

But is the sample-mean

Cramer-Rao Lower Bound (CLRB)

The variance of any unbiased estimater

must be lower bounded by the CLRB,

then we can nd the MVU estimator as:

and the minimum

vector parameter, , the equivalent

is postitive semidenite where

Furthermore if, for some

then we can nd the

MVU estimator as: M V U = g(x)

and the minimum

is a WGN with variance

is considered a function of the parameter

and is thus termed the likelihood function.

Taking the rst and then second

If N-point samples of data are observed and modeled as:

then using the CRLB theorem

will be an MVU estimator if:

When we do this the MVU estimator for

and the covariance matrix of

order polynomial function of

[x(t0 ), x(t1 ), x(t2 ), . . . , x(tN 1 )]

[w(t0 ), w(t1 ), w(t2 ), . . . , w(tN 1 )]

which also includes articacts due to to

The problem is to nd a function of the

then we can nd the MVU estimator as:

is postitive semidenite where

then we can nd the

Taking the rst and then second

functions at dierent harmonics of the fundamental

4.1.3 System Identication

stages or taps of an unknown system with output,

is used to probe the system which produces

The MVU estimate of the system model co-ecients is given by:

then the system model co-ecients are given by:

5.1 Sucient Statistic

We need to rst nd a

which is a sucient statistic for

4. The MVU estimator if the sucient statistic,

The sucient statistic,

where we derived the sucient statistic,

Our rst requirement is that the estimator be unbiased, that is: