Professional Documents
Culture Documents
Estimation - Theory For Engineers - Lecture Notes
Estimation - Theory For Engineers - Lecture Notes
Roberto Togneri
30th August 2005
Applications
Modern estimation theory can be found at the heart of many electronic signal
processing systems designed to extract information. These systems include:
Radar
where the delay of the received pulse echo has to be estimated in the
presence of noise
Sonar
where the delay of the received signal from each sensor has to estimated
Speech
Image
Biomedicine
Communications
Control
Siesmology
mated from noisy sound reections due to dierent densities of oil and
rock layers
x[n]
from
Introduction
Dene:
x[n]
n,
vector of
PDF) of the
estimate
of
N -point
that is:
is
an
estimate
g(x)
Once a candidate
1. How
close
will
of
and
g(x)
be
to
(i.e.
closer
how good
or
optimal
is our estimator)?
) estimators?
= E[( )2 ]
mse()
But this does not yield realisable estimator functions which can be written as
functions of the data only:
=E
mse()
h
h
i2
h
i2
+ E()
+ E()
E()
= var()
E()
i2
estimator,
to assume
Variance
2.2 EXAMPLE
Consider a xed signal,
w[n]
n = 0, 1, . . . , N 1
x[n] = A + w[n]
where
= A
Consider the
sample-mean
x[n].
estimator function:
N 1
1 X
=
x[n]
N n=0
Is the sample-mean an MVU estimator for A?
Unbiased?
= E( 1
E()
N
1
N
x[n]) =
1
N2
x[n]) =
1
N
E(x[n]) =
A=
1
N NA
=A
Minimum Variance?
= var( 1
var()
N
1
N2
var(x[n]) =
N
N
2
N
with the variance of the MVU estimator attaining the CLRB. That is:
var()
E
and
E
g
var(M V U ) =
Furthermore if, for some functions
ln p(x;)
2
and
2 ln p(x;)
2
I:
ln p(x; )
= I()(g(x) )
1/I().
For a
p -dimensional
M V U = g(x)
condition is:
C I1 () 0
3
i.e.
C I1 ()
is the covariance
E()
T (
E())
C = E[(
matrix. The Fisher matrix, I(), is given as:
2
ln p(x; )
[I()]ij = E
i j
p-dimensional
function
and
pp
ln p(x; )
= I()(g(x) )
matrix
I:
I1 ().
3.1 EXAMPLE
Consider the case of a signal embedded in noise:
n = 0, 1, . . . , N 1
x[n] = A + w[n]
where
w[n]
p(x; )
=
where
p(x; )
N
1
Y
2 ,
and thus:
1
2
exp 2 (x[n] )
2
2 2
n=0
#
"
N 1
1
1 X
2
(x[n] )
N exp
2 2 n=0
(2 2 ) 2
1
= A
(for known
x)
derivatives:
ln p(x; )
2 ln p(x; )
2
N 1 X
N
(
x[n] ) = 2 ( )
2 N
N
= 2
For a MVU estimator the lower bound has to apply, that is:
var(M V U ) =
E
ln p(x;)
2
i=
2
N
2
N and thus the samplemean is a MVU estimator. Alternatively we can show this by considering the
but we know from the previous example that
=
var()
rst derivative:
where
M V U
ln p(x; )
N 1 X
= 2(
x[n] ) = I()(g(x) )
N
P
I() = N2 and g(x) = N1
x[n]. Thus the MVU estimator
P
2
1
1
=N
x[n] with minimum variance I()
= N .
4
is indeed
Linear Models
x = H + w
where
x
H
N 1
N p
p1
N 1
=
=
=
=
observation vector
observation matrix
vector of parameters to be estimated
noise vector with PDF
= g(x)
N(0, 2 I)
ln p(x; )
= I()(g(x) )
with
C = I1 ().
So we need to factor:
N
ln p(x; )
1
=
ln(2 2 ) 2 2 (x H)T (x H)
2
into the form
I()(g(x) ).
= (HT H)1 HT x
is:
C = 2 (HT H)1
4.1 EXAMPLES
4.1.1 Curve Fitting
Consider tting the data,
x(t),
by a
pth
x(t) = 0 + 1 t + 2 t2 + + p tp + w(t)
Say we have N samples of data, then:
x =
[0 , 1 , 2 , . . . . p ]
t:
is:
so
x = H + w,
H is
where
the
N p
1
t0
1
t1
H= .
.
.
..
.
1 tN 1
matrix:
t20
t21
.
.
.
..
t2N 1
tp0
tp1
tpN 1
Hence the MVU estimate of the polynomial coecients based on the N samples
of data is:
= (HT H)1 HT x
sine
cosine
and
x[n],
by a linear combi-
x[n]
Fourier analysis.
We consider
x[n] =
M
X
ak cos
k=1
so
x = H + w,
where
is the
2kn
N
+
M
X
bk sin
k=1
2kn
N
+ w[n]
where:
x =
[a1 , a2 , . . . , aM , b1 , b2 , . . . , bM ]
N 2M
T
T
matrix:
H = ha1 ha2 haM hb1 hb2 hbM
where:
a
hk =
1
cos 2k
N
cos 2k2
N
.
.
.
cos
2k(N 1)
N
b
hk =
0
sin 2k
N
sin 2k2
N
.
.
.
sin
2k(N 1)
N
Hence the MVU estimate of the Fourier co-ecients based on the N samples of
data is:
= (HT H)1 HT x
After carrying out the simplication the solution can be shown to be:
2
a T
N (h1 ) x
.
.
.
2
a T
(h
M) x
N
2
b T
(h
1) x
N
.
.
.
2
b T
(h
M) x
N
which is none other than the standard solution found in signal processing textbooks, usually expressed directly as:
and
C =
a
k
bk
N 1
2 X
2kn
x[n] cos
N n=0
N
N 1
2 X
2kn
x[n] sin
N n=0
N
2 2
N I.
u[n],
x[n].
To identify
output:
x[n] =
p1
X
h[k]u[n k] + w[n]
k=0
We assume N input samples are used to yield N output samples and our identication problem is the same as estimation of the linear model parameters for
x = H + w,
and
is the
where:
x =
N p
matrix:
H=
u[0]
u[1]
0
u[0]
0
0
.
.
.
.
.
.
..
.
.
.
u[N 1] u[N 2]
u[N p]
= (HT H)1 HT x
where
u[n]
C = 2 (HT H)1 .
Since
is a function of
u[n]
pseudorandom noise (PRN) sequence which has the property that the autocorrelation function is zero for
ruu [k] =
and hence
k 6= 0,
1
N
HT H = N ruu [0]I.
that is:
NX
1k
k 6= 0
u[n]u[n + k] = 0
n=0
Dene the crosscorrelation function:
rux [k] =
1
N
NX
1k
u[n]x[n + k]
n=0
= rux [i]
h[i]
ruu [0]
general
Gaussian noise)
2. The observed data vector,
components,
s.
Thus the general linear model for the observed data is expressed as:
x = H + s + w
where:
s
w
N 1
N 1
N(0, C)
Our solution for the simple linear model where the noise is assumed white can
be used after applying a suitable whitening transformation.
noise covariance matrix as:
C1 = DT D
8
If we factor the
w0 = Dw
has PDF
N(0, I).
DT ) = I
model:
x = H + s + w
to:
x0 = Dx = DH + Ds + Dw
x0 = H0 + s0 + w0
or:
x00 = x0 s0 = H0 + w0
we can then write the MVU estimator of
x00
as:
= (H0 H0 )1 H0 x00
= (HT DT DH)1 HT DT D(x s)
That is:
= (HT C1 H)1 HT C1 (x s)
C = (HT C1 H)1
statistic
Neyman-Fisher Factorization:
sucient
p(x; )
as:
only, then
5.1.1 EXAMPLE
Consider a signal embedded in a WGN signal:
x[n] = A + w[n]
Then:
p(x; ) =
where
= A
N 1
1 X
(x[n] )2
exp 2
2 n=0
"
1
N
(2 2 ) 2
We factor as
follows:
p(x; )
#
"
#
N 1
N
1
X
1
1 X 2
2
exp 2 (N 2
x [n]
x[n] exp 2
2
2 n=0
n=0
"
1
N
(2 2 ) 2
T (x) =
PN 1
n=0
x[n]
1. Let
be any
T (x):
unbiased estimator of
Then
(x)) =
= E(|T
|T
(x))d
p(
,
The
that is
= g(T (x))
is the MVU
such that
E[g(T (x))] = ,
then
is an unbiased estimator of
estimator.
Rao-Blackwell-Lehmann-Schee (RBLS)
(x))
E(|T
is:
2. Unbiased
3. Of lesser or equal variance that that of
for all
T (x),
is complete.
10
5.2.1 EXAMPLE
Consider the previous example of a signal embedded in a WGN signal:
x[n] = A + w[n]
PN 1
T (x) = n=0 x[n]. Using
a function g such that E[g(T (x))] = = A. Now:
"N 1
# N 1
X
X
E[T (x)] = E
x[n] =
E[x[n]] = N
n=0
It is obvious that:
n=0
#
N 1
1 X
x[n] =
E
N n=0
= g(T (x)) = 1 PN 1 x[n], which is
and thus
n=0
N
already seen before, is the MVU estimator for .
method 2
"
the
sample mean
we have
It may occur that the MVU estimator or a sucient statistic cannot be found or,
indeed, the PDF of the data is itself unknown (only the second-order statistics
are known). In such cases one solution is to assume a functional model of the
estimator, as being linear in the data, and nd the linear estimator which is
both
unbiased
and has
the BLUE
For the general vector case we want our estimator to be a linear function of the
data, that is:
= Ax
= AE(x) =
E()
which can only be satised if:
E(x) = H
AH = I. The BLUE is derived by nding the A which minimises the
C = ACAT , where C = E[(x E(x))(x E(x))T ] is the covariance
of the data x, subject to the constraint AH = I. Carrying out this minimisation
i.e.
variance,
= Ax = (HT C1 H)1 HT C1 x
where
C = (HT C1 H)1 .
estimator for the general linear model. The crucial dierence is that the BLUE
11
does not make any assumptions on the PDF of the data (or noise) whereas the
MVU estimator was derived assuming Gaussian noise. Of course,
if the data is
Gauss-Markov Theorem
x = H + w
where
= (HT C1 H)1 HT C1 x
where
C = (HT C1 H)1
6.1 EXAMPLE
Consider a signal embedded in noise:
x[n] = A + w[n]
w[n] is of unspecied PDF
rameter = A is to be estimated.
H by noting:
Where
with
var(w[n]) = n2
E[x] = 1
where
H 1.
1 = [1, 1, 1, . . . , 1]
and we have
Also:
02
0
C= .
..
0
0
12
0
0
.
.
.
..
.
.
.
2
N
1
1
02
0
1
12
.
.
.
C1 =
..
.
0
0
0
..
.
.
.
= (H C
T
H)
H C
PN 1 x[n]
1T C1 x
n=0 2
x = T 1 = PN 1 1n
1 C 1
n=0 2
n
=
C = var()
1
1
= PN 1
1T C1 1
n=0
12
1
2
n
1
2
N
1
n2 = 2
sample mean:
N 1
1 X
x[n]
=
N n=0
and miminum variance
=
var()
2
N .
such
that:
where
is asymptotically unbiased:
=
lim E()
N
and asymptotically ecient:
= CRLB
lim var()
p(x; )!
one can
7.1.1 EXAMPLE
Consider the signal embedded in noise problem:
x[n] = A + w[n]
where
w[n] is WGN with zero mean but unknown variance which is also A, that
= A, manifests itself both as the unknown signal
and the variance of the noise. Although a highly unlikely scenario, this simple
example demonstrates the power of the MLE approach since nding the MVU
estimator by the previous procedures is not easy. Consider the PDF:
p(x; ) =
N 1
1 X
exp
(x[n] )2
2 n=0
"
1
N
(2) 2
13
We consider
p(x; )
as a function of
ln p(x; ) = ln
1
N
(2) 2
N 1
1 X
(x[n] )2
2 n=0
Dierentiating we have:
N 1
N 1
ln p(x; )
N
1 X
1 X
= +
(x[n] ) + 2
(x[n] )2
2 n=0
2 n=0
x [n] +
= +t
2
N n=0
4
> 0.
=
lim E()
N
and:
= CRLB =
lim var()
2
N ( + 12 )
7.1.2 EXAMPLE
Consider a signal embedded in noise:
x[n] = A + w[n]
w[n] is WGN with zero mean and known variance 2 . We know the MVU
estimator for is the sample mean. To see that this is also the MLE, we consider
the PDF:
"
#
N 1
1 X
1
2
(x[n] )
p(x; ) =
N exp
2 2 n=0
(2 2 ) 2
where
N 1
ln p(x; )
1 X
= 2
(x[n] ) = 0
n=0
thus
1
N
PN 1
n=0
x[n]
14
N
1
X
n=0
x[n] N = 0
= g(),
is given by:
= g()
then
is
where
pT (x; ) =
max
{:=g()}
p(x; )
x = H + w
where
is a known
samples, and
N p
matrix,
is the
PDF is:
p(x; ) =
(2) 2
and the MLE of
1
T 1
exp
(x
H)
C
(x
H)
1
2
det 2 (C)
1
shown to yield:
(H)T 1
ln p(x; )
=
C (x H)
HT C1 (x H) = 0
and this we obtain the MLE of
as:
= (HT C1 H)1 HT C1 x
7.4 EM Algorithm
We wish to use the MLE procedure to nd an estimate for the unknown parameter
ln px (x; ).
However we may nd that this is either too dicult or there arediculties in
nding an expression for the PDF itself.
15
is
x the incomplete
y,
the
complete
x = g(y)
however this is usually a many-to-one transformation in that a subset of the
complete
incomplete
may represent an accumulation or sum of the complete data). This explains the
terminology:
y,
is
incomplete
which is complete (for performing the MLE procedure). This is usually not
evident, however, until one is able to dene what the complete data set. Unfortuneately dening what constitutes the complete data is usually an arbitrary
procedure which is highly problem specic.
7.4.1 EXAMPLE
Consider spectral analysis where a known signal,
x[n],
is composed of an un-
x[n] =
p
X
n = 0, 1, . . . , N 1
i=1
where
= f = [f1 f2 . . . fp ]T .
The
standard MLE would require maximisation of the log-likelihood of a multivariate Gaussian distribution which is equivalent to minimising the argument
of the exponential:
J(f ) =
N
1
X
x[n]
n=0
which is a
p-dimensional
p
X
!2
cos 2fi n
i=1
wi [n]
i = 1, 2, . . . , p
n = 0, 1, . . . , N 1
i2
J(fi ) =
N
1
X
n=0
16
i = 1, 2, . . . , p
which are
is the complete data set that we are looking for to facilitate the MLE proce-
y.
is:
x = g(y)
x[n] =
p
X
n = 0, 1, . . . , N 1
yi [n]
i=1
and we further assume that:
w[n]
2
p
X
i=1
p
X
wi [n]
i2
i=1
and we attempt to maximise this expectation to yield, not the MLE, but next
best-guess estimate for
EM algorithm. Specically:
Expectation(E):
of the complete data given the known incomplete or observed data and current
estimate of the parameter vector
Z
Q(, k ) =
17
rithm are:
1. An initial value for the unknown parameter is needed and as with most
iterative procedures a good initial estimate is required for good convergence
2. The selection of the complete data set is arbitrary
3. Although
7.4.2 EXAMPLE
Applying the EM algorithm to the previous example requires nding a closed
expression for the average log-likelihood.
E-step
ln py (y; )
in terms of the
complete data:
p
N 1
X
1 X
yi [n] cos 2fi n
ln py (y; ) h(y) +
2
i=1 i n=0
h(y) +
p
X
1 T
c yi
2 i
i=1 i
h(y) do not depend on , ci = [1, cos 2fi , cos 2fi (2), . . . , cos 2fi (N
yi = [yi [0], yi [1], yi [2], . . . , yi [N 1]]T . We write the conditional expec-
1)]T
and
tation as:
Q(, k )
i=1 i
Q(, k )
ing:
Q0 (, k ) =
wrt to
p
X
i=1
We note that
E(yi |x; k )
x[n]
18
k .
Since
then
and
p
X
i2
i = E(yi |x; k ) = ci + 2 (x
ci )
y
i=1
and
2
yi [n] = cos 2fik n + i2
x[n]
p
X
!
cos 2fik n
i=1
Thus:
Q0 (, k ) =
p
X
i
cTi y
i=1
M-step
Maximisation of
Q0 (, k )
i
fik+1 = arg max cTi y
fi
Pp
2 =
p
X
i2
i=1
p
X
2
i
i=1
=1
p(x; )
in
(rather than probabilistic assumptions about the data) and achieve a design
goal assuming this model. With the Least Squares (LS) approach we assume
that the signal model is a function of the unknown parameter
signal:
s[n] s(n; )
19
and produces a
where
inaccuracies,
x[n] s[n]
w[n].
= so
J() =
N
1
X
e[n] =
should be minimised in a
(x[n] s[n])2
n=0
is minimised over the N observation samples of interest and we call this the LSE
of
Jmin = J()
An important assumption to produce a meaningful unbiassed estimate is that
the noise and model inaccuracies,
w[n],
have
zero-mean.
However no other
probabilistic assumptions about the data are made (i.e. LSE is valid for both
Gaussian and non-Gaussian noise), although by the same token we also cannot make any optimality claims with LSE. (since this would depend on the
distribution of the noise and modelling errors ).
A problem that arises from assuming a signal model function
knowledge of
p(x; )
p(x; )
one usually needs to know what the underlying model and noise characteristics
are anyway.
8.1.1 EXAMPLE
Consider observations,
:
x[n] = + w[n]
where
J() =
N
1
X
(x[n] s[n])2 =
n=0
N
1
X
n=0
20
(x[n] )2
J()
=0
=
and hence
1
N
N
1
X
=0
(x[n] )
N
1
X
n=0
PN 1
x[n]
n=0
Jmin
n=0
which is the
=
= J()
x[n] N = 0
sample-mean.
!2
N 1
1 X
x[n]
x[n]
N n=0
N
1
X
n=0
s = H
s = [s[0], s[1], . . . , s[N 1]]T and H is a
[1 , 2 , . . . , p ]. Now:
x = H + w
where
and with
N
1
X
known
N p
matrix with
we have:
n=0
Dierentiating and setting to zero:
J()
=0
=
=0
2HT x + 2HT H
= (HT H)1 HT x
which, surprise, surprise, is the identical functional form of the MVU estimator
for the linear model.
An interesting extension to the linear LS is the
weighted
tion to the error from each component of the parameter vector can be weighted
in importance by using a dierent from of the error criterion:
is an
N N
21
s(),
that minimises
Jmin ,
that is:
s()
Jmin .
, and
However, models
are not arbitrary and some models are more complex (or more precisely have
a larger number of parameters or degrees of freedom) than others. The more
complex a model the lower the
the model is to
overt
Jmin
the data or be
overtrained
(i.e.
8.3.1 EXAMPLE
x(t) plotted against
t and we would like to t the best line to the data. But
what line do we t: s(t; ) = 1 , constant? s(t; ) = 1 + 2 t, a straight line?
s(t; ) = 1 +2 t+3 t2 , a quadratic? etc. Each case represents an increase in the
order
of the model (i.e. order of the polynomial t), or the number of parameters
s = H ,
where
1
1
Constant = [1 ]T and H = .. is a N 1 matrix
.
1
1
0
1
1
1
T
2
Linear = [1 , 2 ] and H =
is a N 2 matrix
.
.
.
..
.
1 N 1
1
0
0
1
1
1
1
T
2
4
Quadratic = [1 , 2 , 3 ] and H =
.
.
.
.
.
..
.
.
1 N 1 (N 1)2
and:
matrix
22
is a
N 3
and so on. If the underlying model is indeed a straight line then we would expect
Jmin
higher order polynomial models (e.g. quadratic, cubic, etc.) will yield the same
Jmin
except in cases of overtting). Thus the straight line model is the best model
to use.
order-recursive
of the same base model (e.g. polynomials of dierent degree). In this method
the LSE is updated in order (of increasing parameters). Specially dene
the LSE of order
(i.e.
as
k+1 =
k + UPDATEk
However the success of this approach depends on proper formulation of the linear
models in order to facilitate the derivation of the recursive update. For example,
if
the observation
H.
Since increasing the order implies increasing the dimensionality of the space
to be derived.
batch
or block
mode of processing whereby we wait for N samples to arrive and then form our
estimate based on these samples. One problem is the delay in waiting for the
N samples before we produce our estimate, another problem is that as more
data arrives we have to repeat the calculations on the larger blocks of data (N
increases as more data arrives). The latter not only implies a growing computational burden but also the fact that we have to buer all the data we have seen,
both will grow linearly with the number of samples we have.
Since in signal
1].
[n
is to use a
= [n
1] + K[n](x[n] s[n|n 1])
[n]
where s[n|n 1] s(n; [n 1]).
s[n|n 1]) is the prediction error.
The
K[n]
23
(x[n]
K[n] is
var([n1])
,
with a larger variance yielding a larger correction gain. This behaviour is reasonable since a larger variance implies a poorly estimated parameter (which
should have minumum variance) and a larger correction gain is expected. Thus
one expects the variance to decrease with more samples and the estimated parameter to converge to the true value (or the LSE with an innite number of
samples).
8.4.1 EXAMPLE
We consider the specic case of linear model with a vector parameter:
s[n] = H[n][n]
The most interesting example of sequential LS arises with the weighted LS
C[n]
H[n]
where
= diag(02 , 12 , . . . , n2 )
H[n 1]
=
hT [n]
It should be obvious
that:
So the
= [n
1] + K[n](x[n] hT [n][n 1])
[n]
Let
[n] = C [n]
based on
K[n] =
[n 1]h[n]
n2 + hT [n][n 1]h[n]
covariance update:
24
p -dimensional
is subject to
r<p
inde-
A = b
Then using the technique of Lagrangian multipliers our LSE problem is that of
miminising the following Lagrangian error criterion:
Jc = (x H)T (x H) + T (A b)
Let
c =
+ (HT H)1 AT [A(HT H)1 AT ]1 (A b)
where
= (HT H)1 HT x
s()
s() = H
p -dimensional parameter .
N -dimensional
J = (x s())T (x s())
becomes much more dicult. Dierentiating wrt
J
=0
s()T
(x s()) = 0
Approximate
solutions based on linearization of the problem exist which require iteration until
convergence.
g() =
s()T
(x s())
So the problem becomes that of nding the zero of a nonlinear function, that
is:
g() = 0.
25
initialised value
k+1 = k
g()
1
g()
= k
Of course if the function was linear the next estimate would be the correct value,
but since the function is nonlinear this will not be the case and the procedure
is iterated until the estimates converge.
k :
s()
s() s( k ) +
( k )
=k
J = (
x( k ) H( k ))T (
x( k ) H( k ))
where
( k ) = x s( k ) + H( k ) k
x
is known and:
s()
H( k ) =
=k
Based on the standard solution to the LSE of the linear model, we have that:
Method of Moments
Although we may not have an expression for the PDF we assume that we can
use the natrual estimator for the
k th
moment,
k = E(xk [n]),
that is:
N 1
1 X k
k =
x [n]
N n=0
If we can write an expression for the
parameter
k th
:
k = h()
and assuming
= h1 (
k ) = h1
26
!
N 1
1 X k
x [n]
N n=0
If
is a
p -dimensional
= h1 ()
where:
PN 1
1
N P n=0 x[n]
N 1 2
1
n=0 x [n]
N
.
.
.
P
N 1 p
1
n=0 x [n]
N
9.1 EXAMPLE
Consider a 2-mixture Gaussian PDF:
g1 = N(1 , 12 )
and
g2 = N(2 , 22 )
is the unknown parameter that has to be estimated. We can write the second
moment as a function of
as follows:
2 = E(x2 [n]) =
and hence:
2 12
= 2
=
2 12
10
1
N
PN 1
n=0
22
x2 [n] 12
12
Bayesian Philosophy
classic approach
is
optimal irre-
does not exist for certain values or where prior knowledge would improve the
estimation) the classic approach would not work eectively.
In the Bayesian philosophy the
27
In the classic approach we derived the MVU estimator by rst considering mimisation of the
where:
mean square eror, i.e. = arg min mse()
= E[( )2 ] =
mse()
and
p(x; )
is the pdf of
where:
= arg min Bmse()
parametrised by
= E[( )
2] =
Bmse()
is the Bayesian mse and
( )p(x; )dx
p(x, )
Z Z
2 p(x, )dxd
( )
and
(since
()
are the same.
and
is now a
2
( )
estimator
Bmse()
The minimum
we
= E(|x) =
where the
Z
p(|x)d
p(x, )
p(x|)p()
p(x, )
=R
=R
p(x)
p(x, )d
p(x|)p()d
requirements in deriving an
expression for the posterior pdf and then evaluating the expectation
E(|x)
there is also the problem of nding an appropriate prior pdf . The usual choice
is to assume the joint pdf,
p(|x),
pdf is a conjugate prior distribution). Thus the form of the pdfs remains the
same and all that changes are the means and variances.
10.1.1 EXAMPLE
Consider signal embedded in noise:
x[n] = A + w[n]
where as before
=A
w[n] = N(0, 2 )
the parameter
Gaussian pdf
can assume
p(A|x) = R
p(x|A)p(A)
2
= N(A|x , A|x
)
p(x|A)p(A)dA
28
is also a Gaussian pdf and after the required simplication we have that:
2
A|x
N
2
1
+
1
2
A
and
A|x =
N
A
x
+ 2
2
A
2
A|x
A = E[A|x]
Ap(A|x)dA = A|x
=
x + (1 )A
where
2
A
2 +
A
N
2
following (assume A
2 ):
2
A
2 /N
and
A A ,
that is the
MMSE tends towards the mean of the prior pdf (and eectively ignores
the contribution of the data). Also
2
p(A|x) N(A , A
).
2 /N and A x
, that is
the MMSE tends towards the sample mean x
(and eectively ignores the
contribution of the prior information). Also p(A|x) N(
x, 2 /N ).
2
p(y|x),
x = H + w
is the unknown parameter to be estimated with prior pdf N( , C )
w is a WGN with N(0, Cw ). The MMSE is provided by the expression for
E(y|x) where we identify y . We have that:
where
and
E(x) = H
E(y) =
29
Cxx
Cyx
= E(|x) = + C HT (HC HT + Cw )1 (x H )
have to be considered.
distribution, essentially
2 = .
noninformative
10.3.1 EXAMPLE
Consider the signal embedded in noise problem:
x[n] = A + w[n]
where we have shown that the MMSE is:
A =
x + (1 )A
where
with
A = x
2
A
2
2
A + N
2
=
A
and
=1
Then
p(|x) = R
Now
p(x|)
and if
is, in reality,
and
p(x|)p()
p(x|)p()d
Z
p(x|) =
p(x|, )p()d
30
true
p(x|)
by:
11
The
Bmse()
given by:
= E[( )
2] =
Bmse()
Z Z
2 p(x, )dxd
( )
is one specic case for a general estimator that attempts to minimise the average
of the cost function,
C(),
R = E[C()]
where
.
= ( )
1.
Quadratic:
C() = 2
Absolute:
.
R = Bmse()
R = Bmse() is:
Z
= p(|x)d
which yields
C() = ||.
posterior pdf.
The estimate,
that minimises
R = E[| |]
satises:
p(|x)d =
3.
Hit-or-miss:
p(|x)d
C() =
or
Pr{ |x}
=
2
posterior pdf.
0 || <
1 || >
posterior pdf
For the Gaussian posterior pdf it should be noted that the mean mediam and
mode are identical.
functions which, together with a special case of the latter, yield the following
three important classes of estimator:
1.
MMSE
posterior pdf :
Z
p(|x)d =
31
p(x|)p()d
2.
of the
posterior pdf :
=
3.
Bayesian ML (Bayesian
Maximum Likelihood
) estimator which is
the special case of the MAP estimator where the prior pdf,
p(), is uniform
or noninformative:
x given , p(x|), is essentially equivalent to the pdf of x parametrized by , p(x; ), the Bayesian ML estimator
p(|x)d
The MAP hit-or-miss cost function is more crude but the MAP estimate is easier to derive since there is no need to integrate, only nd the
maximum of the posterior pdf
p(|x)
p(|x).
However, like the classic MLE, the expression for the condi-
p(x|)
12
32
p(x, ).
N
1
X
an x[n] + aN = aT x + aN
n=0
where
a = [a1 a2 . . . aN 1 ]T and
{a, aN }
to min-
h
i
= E ( )
2
Bmse()
The resultant estimator is termed the
estimator.
x = H + w
The weight co-ecients are obtained from
Bmse()
ai
= 0
for
i = 1, 2, . . . , N
which yields:
aN = E()
N
1
X
and a = C1
xx Cx
n=0
T
where Cxx = E (x E(x))(x E(x))
is the N N covariance matrix and
Cx = E [(x E(x))( E())] is the N 1 cross-covariance vector. Thus the
LMMSE estimator is:
where we note
parameter
where now
= aT x + aN = E() + Cx C1
xx (x E(x))
Cx = CTx = E ( E())(x E(x))T . For
the
1p
Cx
= E() + Cx C1 (x E(x))
xx
= E ( E())(x E(x))T is the p N
cross-covariance
h
i
T
M = E ( )(
)
where
vector
= C Cx C1
xx Cx
= E ( E())( E())T is the p p
covariance matrix.
x = H + w
33
where
is a
E(x) = HE()
Cxx = HC HT + Cw
Cx = C HT
the LMMSE estimator of
is:
and the covariance of the error which is the Bayesian MSE matrix is:
T 1
1
M = (C1
+ H Cw H)
Cxx = Rxx
where
x[n]
process and
= Cx C1
xx x
Application of the LMMSE estimator to the three signal processing estimation
problems of ltering, smoothing and prediction gives rise to the
equation solutions.
34
Wiener ltering
12.2.1 Smoothing
= s = [s[0] s[1] . . . s[N 1]]T based
x = [x[0] x[1] . . . x[N 1]]T where:
on
x=s+w
and
future
values
past
values (x[0],
cannot be cast as ltering problem since we cannot apply a causal lter to the
data.
We assume that the signal and noise processes are uncorrelated. Hence:
N N
matrix:
12.2.2 Filtering
= s[n] based only on the present and
x = [x[0] x[1] . . . x[n]]T . As n increases this allows us to view
past
noisy data
Cxx
is an
(n + 1) (n + 1)
= E(ssT )
= E(s[n] [s[0] s[1] . . . s[n]])
= [rss [n] rss [n 1] . . . rss [0]] = rTss
35
is a
1 (n + 1)
is the
(n + 1) 1
vector of
time-reversal.
Thus:
Specically we let
k = 0, 1, . . . , n.
Then:
s[n] =
n
X
ak x[k] =
k=0
n
X
(n)
[n k]x[k] =
k=0
n
X
h(n) [k]x[n k]
k=0
(n)
T
,
We dene the vector h = h
[0] h(n) [1] . . . h(n) [n] . Then we have that h = a
that is, h is a time-reversed version of a. To explicitly nd the impulse response
h we note that since:
(Rss + Rww )a = rTss
then it also true that:
.
.
.
.
.
.
rxx [0]
h(n) [n]
rxx [0]
rxx [1]
rxx [1]
r
[0]
xx
.
.
..
.
.
.
.
.
rxx [n] rxx [n 1]
where
rss [0]
rss [1]
=
.
.
.
rss [n]
Levinson recursion
n.
12.2.3 Prediction
= x[N 1 + l] based on the current and past x =
[x[0] x[1] . . . x[N 1]] at sample l 1 in the future. The resulting estimator is
termed the
36
As before we have
Cxx = Rxx
where
Rxx
is the
N N
autocorrelation matrix,
and:
= E(xxT )
= E(x[N 1 + l] [x[0] x[1] . . . x[N 1]])
= [rxx [N 1 + l] rxx [N 2 + l] . . . rxx [l]] = rTxx
Cx
T
x
[N 1 + l] = rTxx R1
xx x = a x
where
a = [a0 a1 . . . aN 1 ] = R1
rxx .
xx
k] = ak
and then:
x
[N 1 + l] =
N
1
X
h[N k]x[k] =
k=0
N
X
h[k]x[N k]
k=1
T
T
Dening
nd an
as before we can
rxx .
rxx [0]
rxx [1]
rxx [1]
rxx [0]
.
.
.
.
.
.
..
rxx [N 1] rxx [N 2]
rxx [l]
rxx [N 1]
h[1]
h[2]
r
[l + 1]
rxx [N 2]
xx
.. =
.
.
.
.
.
.
.
rxx [l + N 1]
rxx [0]
h[N ]
recursion
Levinson
each value of
n.
l = 1,
the values of
h[n]
which are
Yule-Walker equations
used to
37
[n]
(i.e.
based
on
data samples)
1],
[n
x[n],
nth
the
x
[n|n 1],
n1
data samples,
an estimate of
x[n]
based on
n1
data samples.
Using a vector space analogy where we dene the following inner product:
(x, y) = E(xy)
the procedure is as follows:
1], or we have
[n
estimate
We can consider [n 1] as the true value of
on the subspace spanned by {x[0], , x[1], . . . , x[n 1]}.
.
[0]
an initial
projected
3. We form the
and rep-
x[n].
4. We calculate the
on the innovation
x[n] x
[n|n 1],
K[n] =
that is:
E[(x[n] x
[n|n 1])]
E[(x[n] x
[n|n 1])2 ]
= [n
1] + K[n](x[n] x
[n]
[n|n 1])
x[n] = hT + w[n]
we can derive the sequential vector LMMSE:
[n]
1] + K[n](x[n] hT [n][n
1])
= [n
M[n 1]h[n]
K[n] =
n2 + hT [n]M[n 1]h[n]
M[n]
where
M[n]
= (I K[n]hT [n])M[n 1]
38
13
Kalman Filters
s[n],
process
n,
equation:
observation
equation:
where
x[n]
is the
M 1
observation vector,
is a known
E[(s[n] s[n|n])2 ]
where
sequence
[n] = x[n] x
[n|n 1]
x
where
observation sequence
E(s[n]|X[n 1])
vation correction.
E(s[n]|
x[n]) is the innos[n|n 1] = E(s[n]|X[n 1]) and from the
By denition
[n|n 1])
E(s[n]|
x[n]) = K[n](x[n] x
and:
K[n] = E(s[n]
xT [n])E(
x[n]
xT [n])1 = M[n|n1]HT [n](H[n]M[n|n1]HT [n]+R)1
39
where:
M[n|n 1] = E (s[n] s[n|n 1])(s[n] s[n|n 1])T
is the state error covariance predictor at time
quence
X[n 1].
n:
M[n|n] = (I K[n]H[n])M[n|n 1]
Gain
Correction
s[n|n] = s[n|n 1] + K[n](x[n] x
[n|n 1])
M[n|n] = (I K[n]H[n])M[n|n 1]
14
MAIN REFERENCE
40