Professional Documents
Culture Documents
Estimation Theory: Alireza Karimi
Estimation Theory: Alireza Karimi
Alireza Karimi
Spring 2013
where n[k] is the measurement noise. The model parameters are estimated
using u[k] and y [k] for k = 1, . . . , N.
DC level in noise : Consider a set of data {x[0], x[1], . . . , x[N − 1]} that
can be modeled as :
x[n] = A + w [n]
where w [n] is some zero mean noise process. A can be estimated using the
measured data set.
(Introduction) Estimation Theory Spring 2013 3 / 152
Outline
Main reference :
Other references :
Lessons in Estimation Theory for Signal Processing, Communications
and Control. By Jerry M. Mendel, Prentice-Hall, 1995.
Probability, Random Processes and Estimation Theory for Engineers.
By Henry Stark and John W. Woods, Prentice-Hall, 1986.
Ω = {HH, HT , TH, TT }
Remark :
For the same experiment (throwing a coin twice) we could define
another random variable that would lead to a different PX (x).
In most of engineering problems the sample space is a subset of the
real line so X (ζ) = ζ and PX (x) is a continuous function of x.
(Probability and Random Variables) Estimation Theory Spring 2013 8 / 152
Probability Density Function (PDF)
The Probability Density Function, if it exists, is given by :
dPX (x)
pX (x) =
dx
When we deal with a single random variable the subscripts are removed :
dP(x)
p(x) =
dx
Properties :
∞
(i ) p(x)dx = P(∞) − P(−∞) = 1
−∞
x
(ii ) Pr [{ζ|X (ζ) ≤ x}] = Pr [X ≤ x] = P(x) = p(α)d α
−∞
x2
(iii ) Pr [x1 < X ≤ x2 ] = p(x)dx
x1
1 (x−µ)2
p(x) = √ e− 2σ2
2πσ 2
The PDF has two parameters : µ, the mean and σ 2 the variance.
1
Exponential (µ > 0) : p(x) = exp(−x/µ)u(x)
µ
x −x 2
Rayleigh (σ > 0) : p(x) = exp( )u(x)
σ2 2σ 2
√ 1 x n/2−1 exp(− x2 ) for x > 0
Chi-square χ2 : p(x) = 2n Γ(n/2)
0 for x < 0
∞
where Γ(z) = t z−1 e −t dt and u(x) is the unit step function.
0
Marginal PDF :
∞ ∞
p(x) = p(x, y )dy and p(y ) = p(x, y )dx
−∞ −∞
Bayes’ Formula : Consider two RVs defined on the same probability space
then we have :
p(x, y )
p(x, y ) = p(x|y )p(y ) = p(y |x)p(x) or p(x|y ) =
p(y )
(Probability and Random Variables) Estimation Theory Spring 2013 12 / 152
Independent Random Variables
p(x, y ) = p(x)p(y )
where n!! denotes the double factorial that is the product of every odd
number from n to 1.
Important formula : The relation between the variance and the mean of
X is given by
The variance is the mean of the square minus the square of the mean.
(Probability and Random Variables) Estimation Theory Spring 2013 18 / 152
Independence, Uncorrelatedness and Orthogonality
If σxy = 0, then X and Y are uncorrelated and
E {XY } = E {X }E {Y }
x = [x1 , x2 , . . . , xn ]T
1. In some books (including our main reference) there is no distinction between random
variable X and its specific value x. From now on we adopt the notation of our reference.
(Probability and Random Variables) Estimation Theory Spring 2013 20 / 152
Random Processes
Discrete Random Process : x[n] is a sequence of random variables
defined for every integer n.
Mean value : is defined as E (x[n]) = µx [n].
Autocorrelation Function (ACF) : is defined as
rxx [k, n] = E (x[n]x[n + k])
Wide Sense Stationary (WSS) : x[n] is WSS if its mean and its
autocorrelation function (ACF) do not depend on n.
Autocovariance function : is defined as
cxx [k] = E [(x[n] − µx )(x[n + k] − µx )] = rxx [k] − µ2x
Cross-correlation Function (CCF) : is defined as
rxy [k] = E (x[n]y [n + k])
Cross-covariance function : is defined as
cxy [k] = E [(x[n] − µx )(y [n + k] − µy )] = rxy [k] − µx µy
(Probability and Random Variables) Estimation Theory Spring 2013 21 / 152
Discrete White Noise
rxx [0] ≥ |rxx [k]| rxx [k] = rxx [−k] rxy [k] = ryx [−k]
Power Spectral Density : The Fourier transform of ACF and CCF gives
the Auto-PSD and Cross-PSD :
∞
Pxx (f ) = rxx [k] exp(−j2πfk)
k=−∞
∞
Pxy (f ) = rxy [k] exp(−j2πfk)
k=−∞
Discrete White Noise : is a discrete random process with zero mean and
rxx [k] = σ 2 δ[k] where δ[k] is the Kronecker impulse function. The PSD of
white noise becomes Pxx (f ) = σ 2 and is completely flat with frequency.
x[n] = A + Bn + w [n] n = 0, 1, . . . , N − 1
Suppose that w [n] ∼ N (0, σ 2 ) and is uncorrelated with all the other
samples. Letting θ = [A B] and x = [x[0], x[1], . . . , x[N − 1]] the PDF is :
N−1
1 1
N−1
p(x; θ) = p(x[n]; θ) = √ exp − 2 (x[n] − A − Bn)2
( 2πσ 2 )N 2σ
n=0 n=0
The quality of any estimator for this problem is related to the assumptions
on the data model. In this example, linear trend and WGN PDF
assumption.
p(x, θ) = p(x|θ)p(θ)
x[n] = A + w [n] n = 0, 1, . . . , N − 1
1 1 1
N−1 N−1 N−1
E (Â1 ) = E (x[n]) = E (A + w [n]) = (A + 0) = A
N N N
n=0 n=0 n=0
1 1
N−1
1 σ2
var(Â1 ) = var x[n] = 2 var(x[n]) = 2 Nσ 2 =
N N N N
n=0 n=0
1
n
θ̂ = θ̂i ⇒ E (θ̂) = θ
n
i =1
1
n
1 var(θ̂i )
var(θ̂) = 2 var(θ̂i ) = 2 n var(θ̂i ) =
n n n
i =1
The most logical criterion for estimation is the Mean Square Error (MSE) :
var(θ̂) ≥ CRLB(θ)
√ 1
ln p(x[0], A) = − ln 2πσ 2 − 2 (x[0] − A)2
2σ
Then
∂ ln p(x[0]; A) 1 ∂ 2 ln p(x[0]; A) 1
= 2 (x[0] − A) ⇒ − = 2
∂A σ ∂A 2 σ
According to Theorem :
1
var (Â) ≥ σ 2 and I (A) = and  = g (x[0]) = x[0]
σ2
1
N−1
1
p(x; A) = √ exp − 2 (x[n] − A)2
( 2πσ 2 )N 2σ
n=0
Then
∂ ln p(x; A) ∂ 1
N−1
= − ln[(2πσ 2 )N/2 ] − 2 (x[n] − A)2
∂A ∂A 2σ
n=0
N−1
1 1
N−1
N
= (x[n] − A) = 2 x[n] − A
σ2 σ N
n=0 n=0
According to Theorem :
1
N−1
σ2 N
var (Â) ≥ and I (A) = 2 and  = g (x[0]) = x[n]
N σ N
n=0
(Cramer-Rao Lower Bound) Estimation Theory Spring 2013 36 / 152
Transformation of Parameters
If it is desired to estimate α = g (θ), then the CRLB is :
−1 2
∂ 2 ln p(x; θ) ∂g
var(θ̂) ≥ −E
∂θ 2 ∂θ
Definition
Efficient estimator : An unbiased estimator that attains the CRLB is said
to be efficient.
1
N−1
Example : Knowing that x̄ = x[n] is an efficient estimator for A, is
N
n=0
x̄ 2 an efficient estimator for A2 ?
(Cramer-Rao Lower Bound) Estimation Theory Spring 2013 37 / 152
Transformation of Parameters
Solution : Knowing that x̄ ∼ N (A, σ 2 /N), we have :
σ2
E (x̄ 2 ) = E 2 (x̄) + var(x̄) = A2 + = A2
N
So the estimator A2 = x̄ 2 is not even unbiased.
Let’s look at the variance of this estimator :
var(x̄ 2 ) = E (x̄ 4 ) − E 2 (x̄ 2 )
but we have from the moments of Gaussian RVs (slide 15) :
σ2 σ2
E (x̄ 4 ) = A4 + 6A2 + 3( )2
N N
Therefore :
2
2σ σ2 4A2 σ 2 2σ 4
3σ 4 2
2
var(x̄ ) = A + 6A 4
+ 2 − A +
2
= + 2
N N N N N
Then the variance of any unbiased estimator θ̂ satisfies Cθ̂ − I−1 (θ) ≥ 0
where ≥ 0 means that the matrix is positive semidefinite. I(θ) is called the
Fisher information matrix and is given by :
2
∂ ln p(x; θ)
Iij (θ) = −E
∂θi ∂θj
1
N−1
N N
ln p(x; θ) = − ln 2π − ln σ − 2
2
(x[n] − A)2
2 2 2σ
n=0
The Fisher information matrix is :
2
∂ ln p(x;θ) ∂ 2 ln p(x;θ) N
∂A2 ∂A∂σ2 0
I(θ) = −E ∂ 2 ln p(x;θ) ∂ 2 ln p(x;θ) = σ2
N
∂A∂σ2 ∂(σ2 )2
0 2σ4
The matrix is diagonal (just for this example) and can be easily inverted to
yield :
σ2 2 ) ≥ 2σ 4
var(θ̂) ≥ var(σ
N N
So the CRLB is :
−1 2A
2A A2 N
0 σ2
4α + 2α2
var(α̂) ≥ − 4 σ2
−A2 =
σ2 σ 0 N
2σ4 σ4
N
M
2πkn M
2πkn
x[n] = ak cos( )+ bk sin( ) + w [n]
N N
k=1 k=1
with
1 1
cos( 2πk sin( 2πk
N ) N )
2πk2
cos( N ) 2πk2
sin( N )
hak = , hbk =
.. ..
. .
cos( 2πk(N−1)
N ) sin( 2πk(N−1)
N )
Hence the MVU estimate of the Fourier coefficients is : θ̂ = (HT H)−1 HT x
(Linear Models) Estimation Theory Spring 2013 47 / 152
Example (Fourier Analysis)
After simplification (noting that (HT H)−1 = 2
N I), we have :
2 a T
θ̂ = [(h ) x, . . . , (haM )T x , (hb1 )T x, . . . , (hbM )T x]T
N 1
which is the same as the standard solution :
2 2
N−1 N−1
2πkn 2πkn
âk = x[n] cos , b̂k = x[n] sin
N N N N
n=0 n=0
p−1
x[n] = h[k]u[n − k] + w [n] n = 0, 1, . . . , N − 1
k=0
Remark : The unbiasedness restriction implies a linear model for the data.
However, it may still be used if the data are transformed suitably or the
model is linearized.
(Best Linear Unbiased Estimators) Estimation Theory Spring 2013 52 / 152
Finding the BLUE (Scalar Case)
1 Choose a linear estimator for the observed data
x[n] , n = 0, 1, . . . , N − 1
N−1
θ̂ = an x[n] = aT x where a = [a0 , a1 , . . . , aN−1 ]T
n=0
2 Restrict estimate to be unbiased :
N−1
E (θ̂) = an E (x[n]) = θ
n=0
N−1
1 Choose a linear estimator : θ̂ = an x[n] = aT x
n=0
2 Restrict estimate to be unbiased : E (θ̂) = aT E (x) = aT sθ = θ
then aT s = 1 where s = [s[0], s[1], . . . , s[N − 1]]T
3 Minimize aT Ca subject to aT s = 1.
Theorem (Gauss–Markov)
If the data are of the general linear model form
x = Hθ + w
with w is a noise vector with zero mean and covariance C (the PDF of w
is arbitrary), then the BLUE of θ is :
Basic Idea : Choose the parameter value that makes the observed data,
the most likely data to have been observed.
1
N−1
1
p(x; A) = N exp − (x[n] − A)2
(2πA) 2 2A
n=0
1 1
N−1 N−1
∂ ln p(x; A) N
=− + (x[n] − A) + 2 (x[n] − A)2
∂A 2A A 2A
n=0 n=0
Example
Consider N independent RV x1 , x2 , . . . , xN with mean µ and finite variance
1
N
σ 2 . Let x̄N = xn , then according to the Central Limit Theorem
N
n=1
√ x̄N − µ
(CLT), zN = N converges in distribution to z ∼ N (0, 1).
σ
Example
Consider N independent RV x1 , x2 , . . . , xN with mean µ and finite variance
1
N
σ 2 . Let x̄N = xn . Then according to the Law of Large Numbers
N
n=1
(LLN), z̄N converges in probability to µ.
(Maximum Likelihood Estimation) Estimation Theory Spring 2013 61 / 152
Asymptotic Properties of Estimators
Asymptotic Unbiasedness : Estimator θ̂N is an asymptotically unbiased
estimator of θ if :
lim E (θ̂N − θ) = 0
N→∞
Asymptotic Distribution : It refers to p(xn ) as it evolves from
n = 1, 2, . . ., especially for large value of n (it is not the ultimate form of
distribution, which may be degenerate).
Remarks :
If θ̂N is asymptotically unbiased and its limiting variance is zero then
it converges to θ in mean-square.
If θ̂N converges to θ in mean-square, then the estimator is consistent.
Asymptotic unbiasedness does not imply consistency and vice versa.
plim can be treated as an operator, e.g. :
x plim(x)
plim(xy ) = plim(x)plim(y ) ; plim =
y plim(y )
1
N−1
1
p(x; A) = N exp − 2
(x[n] − A)2
2
(2πσ ) 2 2σ
n=0
1
N−1
∂ ln p(x; A)
= 2 (x[n] − A) = 0
∂A σ
n=0
N−1
⇒ x[n] − N Â = 0
n=0
1
N−1
which leads to  = x̄ = x[n]
N
n=0
α̂ = g (θ̂)
Example
Consider DC level in WGN and find MLE of α = A2 . Since g (θ) is not a
one-to-one function then :
√ √
α̂ = arg max{p(x; α), p(x; − α)}
α≥0
2
√ √
= arg √max {p(x; α), p(x; − α)}
α≥0
2
= arg max p(x; A) = Â2 = x̄ 2
−∞<A<∞
1
N−1
∂ ln p(x; θ)
= (x[n] − A)
∂A σ2
n=0
1
N−1
∂ ln p(x; θ) N
= − 2+ 4 (x[n] − A)2
∂σ 2 2σ 2σ
n=0
1
N−1
 = x̄ ˆ
σ =
2 (x[n] − x̄)2
N
n=0
∂ ln p(x; θ) ∂(Hθ)T −1
= C (x − Hθ)
∂θ ∂θ
Then
Remarks :
The Hessian can be replaced by the negative of its expectation, the
Fisher information matrix I(θ).
This method suffers from convergence problems (local maximum).
Typically, for large data length, the log-likelihood function becomes
more quadratic and the algorithm will produce the MLE.
Example
Estimate the DC level of a signal. We observe x[n] = A + [n] for
n = 0, . . . , N − 1 and the LS criterion is :
N−1
J(A) = (x[n] − A)2
n=0
∂J(A) N−1
1
N−1
= −2 (x[n] − A) = 0 ⇒ Â = x[n]
∂A N
n=0 n=0
N−1
J(θ) = (x[n] − s[n, θ])2 = (x − Hθ)T (x − Hθ)
n=0
= xT x − 2xT Hθ + θ T HT Hθ
∂J(θ)
= −2HT x + 2HT Hθ = 0 ⇒ θ̂ = (HT H)−1 HT x
∂θ
The minimum LS cost is :
θ̂ = (HT WH)−1 HT Wx
The data vector can lie anywhere in R N , while signal vectors must lie
in a p-dimensional subspace of R N , termed S p , which is spanned by
the column of H.
(Least Squares) Estimation Theory Spring 2013 78 / 152
Geometrical Interpretation (Orthogonal Projection)
p
p
ŝ = Hθ̂ = θ̂i hi = (hT
i x)hi
i =1 i =1
Remarks :
It is clear that by increasing the order Jmin is monotonically
non-increasing.
By choosing p = N we can perfectly fit the data to model. However,
we fit the noise as well.
We should choose the simplest model that adequately describes the
data.
We increase the order only if the cost reduction is significant.
If we have an idea about the expected level of Jmin , we increase p to
approximately attain this level.
There is an order-recursive LS algorithm to efficiently compute a
(p + 1)-th order model based on a p-th order one (See 8.6).
(Least Squares) Estimation Theory Spring 2013 81 / 152
Sequential Least Squares
Suppose that θ̂[N − 1] based on x[N − 1] = x[0], . . . , x[N − 1] is
available. If we get a new data x[N], we want to compute θ̂[N] as a
function of θ̂[N − 1] and x[N].
Example
1
N−1
Consider LS estimate of DC level : Â[N − 1] = x[n]. We have :
N
n=0
1 1
N N−1
1
Â[N] = x[n] = N x[n] + x[N]
N +1 N +1 N
n=0 n=0
N 1
= Â[N − 1] + x[N]
N +1 N +1
1
= Â[N − 1] + (x[N] − Â[N − 1])
N +1
new estimate = old estimate + gain × prediction error
(Least Squares) Estimation Theory Spring 2013 82 / 152
Sequential Least Squares
Example (DC level in uncorrelated noise)
x[n] = A + w [n] and var(w [n]) = σn2 . The WLS estimate (or BLUE) is :
N−1 x[n]
n=0 σn2
Â[N − 1] = N−1 1
n=0 σn2
1/σ 2
Â[N] = Â[N − 1] + N−1N 1 x[N] − Â[N − 1]
n=0 σn2
= Â[N − 1] + K [N] x[N] − Â[N − 1]
var(Â[N − 1])
K [N] =
var(Â[N − 1]) + σN2
(Least Squares) Estimation Theory Spring 2013 83 / 152
Sequential Least Squares
Consider the general linear model x = Hθ + w, where w is an uncorrelated
noise with the covariance matrix C. The BLUE (or WLS) is :
Let’s define :
where
Σ[n − 1]h[n]
K[n] =
σn2 + hT [n]Σ[n − 1]h[n]
Covariance Update :
Initialization ?
(Least Squares) Estimation Theory Spring 2013 85 / 152
Constrained Least Squares
Linear constraints in the form Aθ = b can be considered in LS solution.
The LS criterion becomes :
Jc (θ) = (x − Hθ)T (x − Hθ) + λT (Aθ − b)
∂Jc
= −2HT x + 2HT Hθ + AT λ
∂θ
Setting the gradient equal to zero produces :
1
θ̂c = (HT H)−1 HT x − (HT H)−1 AT λ
2
λ
= θ̂ − (HT H)−1 AT
2
where θ̂ = (HT H)−1 HT x is the unconstrained LSE.
Now Aθ̂c = b can be solved to find λ :
λ = 2[A(HT H)−1 AT ]−1 (Aθ̂ − b)
∂sT (θ)
where g(θ) = [x − s(θ)] and
∂θ
∂[g(θ)]i
N−1
∂ 2 s[n] ∂s[n] ∂s[n]
= (x[n] − s[n]) −
∂θj ∂θi θj ∂θj ∂θi
n=0
Around the solution [x − s(θ)] is small so the first term in the Jacobian
can be neglected. This makes this method equivalent to the Gauss-Newton
algorithm which is numerically more robust.
(Least Squares) Estimation Theory Spring 2013 87 / 152
Nonlinear Least Squares
Gauss-Newton Method : Linearize signal model around the current
estimate and solve the resulting linear problem.
∂s(θ)
s(θ) ≈ s(θ0 ) + (θ − θ0 ) = s(θ0 ) + H(θ0 )(θ − θ0 )
∂θ θ=θ0
Then
α̂ = arg max xT H(α)[HT (α)H(α)]−1 HT (α) x
α
s[n] = A1 r n + A2 r 2n + A3 r 3n n = 0, 1, . . . , N − 1
Bayesian MSE :
Bayesian MSE is not a function of θ and can be minimized to find an
estimator.
Bmse(θ̂) = E {[θ − θ̂(x)] } =
2
[θ − θ̂(x)]2 p(x, θ)d xd θ
−1
N−1
1 1
A exp (x[n] − A)2 dA
2A0 −A0 (2πσ 2 )N/2 2σ 2
n=0
 = A0
−1
N−1
1 1
exp (x[n] − A) dA
2
2A0 −A0 (2πσ 2 )N/2 2σ 2
n=0
Before collecting the data the mean of the prior PDF p(A) is the best
estimate while after collecting the data the best estimate is the mean
of posterior PDF p(A|x).
Choice of p(A) is crucial for the quality of the estimation.
Only Gaussian prior PDF leads to a closed-form estimator.
Conclusion : For an accurate estimator choose a prior PDF that can be
physically justified. For a closed-form estimator choose a Gaussian prior
PDF.
(The Bayesian Philosophy) Estimation Theory Spring 2013 97 / 152
Properties of the Gaussian PDF
1 −1 x − E (x) −1 x − E (x)
p(x, y ) = exp C
2π det(C) 2 y − E (y ) y − E (y )
var(x) cov(x, y )
with covariance matrix : C=
cov(x, y ) var(y )
then the conditional PDF p(y |x) is also Gaussian and :
cov(x, y )
E (y |x) = E (y ) + [x − E (x)]
var(x)
cov2 (x, y )
var(y |x) = var(y ) −
var(x)
cov(A, x[0]) σ2
 = µA|x = µA + [x[0] − µA ] = µA + 2 A 2 (x[0] − µA )
var(x) σ + σA
cov2 (A, x[0]) σ2
var(Â) = σA|x
2
= σA2 − = σA2 (1 − 2 A 2 )
var(x) σ + σA
(The Bayesian Philosophy) Estimation Theory Spring 2013 99 / 152
Properties of the Gaussian PDF
1 −1 x − E (x) −1 x − E (x)
k+l
p(x, y ) = exp C
(2π) 2 det(C) 2 y − E (y) y − E (y)
Cxx Cxy
with covariance matrix : C=
Cyx Cyy
then the conditional PDF p(y|x) is also Gaussian and :
x = Hθ + w
x = Hθ + w
and covariance
σ2 2
σA2 N σA
E (A|x) = µA + σ2
(x̄ − µA ) and var(A|x) =
σA + σN
2
σA2 + N
2
where R(θ̂) = E {C (ε)} is the Bayes Risk, C (ε) is a cost function and
ε = θ − θ̂ is the estimation error.
Quadratic
For this case R(θ̂) = E {(θ − θ̂)2 } =Bmse(θ̂) which leads to the MMSE
estimator with
θ̂ = E (θ|x)
so θ̂ is the mean of the posterior PDF p(θ|x).
Absolute
In this case we have :
θ̂ ∞
g (θ̂) = |θ − θ̂|p(θ|x)dθ = (θ̂ − θ)p(θ|x)dθ + (θ − θ̂)p(θ|x)d θ
−∞ θ̂
By setting the derivative of g (θ̂) equal to zero and the use of Leibnitz’s
rule, we get :
θ̂ ∞
p(θ|x)d θ = p(θ|x)d θ
−∞ θ̂
Hit-or-Miss
In this case we have
θ̂−δ ∞
g (θ̂) = 1 · p(θ|x)d θ + 1 · p(θ|x)dθ
−∞ θ̂+δ
θ̂+δ
= 1− p(θ|x)d θ
θ̂−δ
θ̂ → [HT C−1 −1 T −1
w H] H Cw x
An equivalent maximization is :
θ̂ = arg max p(x|θ)p(θ)
θ
or
θ̂ = arg max[ln p(x|θ) + ln p(θ)]
θ
If p(θ) is uniform or is approximately constant around the maximum
of p(x|θ) (for large data length), we can remove p(θ) to obtain the
Bayesian Maximum Likelihood :
θ̂ = arg max p(x|θ)
θ
(General Bayesian Estimators) Estimation Theory Spring 2013 112 / 152
Maximum A Posteriori Estimators
Example (DC Level in WGN with Uniform Prior PDF)
The MMSE estimator cannot be obtained in explicit form due to the need
to evaluate the following integrals :
A0
−1
N−1
1 1
A exp (x[n] − A) dA
2
2A0 −A0 (2πσ 2 )N/2 2σ 2
n=0
 = A0
1 1 −1
N−1
exp (x[n] − A)2 dA
2A0 −A0 (2πσ 2 )N/2 2σ 2
n=0
−A0 x̄ < −A0
The MAP estimator is given as : Â = x̄ −A0 ≤ x̄ ≤ A0
A0 x̄ > A0
Remark : The main advantage of MAP estimator is that for non jointly
Gaussian PDFs it may lead to explicit solutions or less computational
effort.
(General Bayesian Estimators) Estimation Theory Spring 2013 113 / 152
Maximum A Posteriori Estimators
where s[n] is a WSS Gaussian process with known ACF, and w [n] is WGN
with variance σ 2 .
x[0] h[0] 0 ··· 0 s[0] w [0]
x[1] h[1] h[0] ··· 0 s[1] w [1]
. = . . . . . + .
.. .. .. .. .. .. ..
x[N − 1] h[N − 1] h[N − 2] ··· h[N − ns ] s[ns − 1] w [N − 1]
x = Hθ + w ⇒ ŝ = Cs HT (HCs HT + σ 2 I)−1 x
where Cs is a symmetric Toeplitz matrix with [Cs ]ij = rss [i − j]
Remarks :
The results are identical to those of MMSE estimators for jointly
Gaussian PDF.
The LMMSE estimator relies on the correlation between random
variables.
If a parameter is uncorrelated with the data (but nonlinearly
dependent), it cannot be estimated with an LMMSE estimator.
where A
A0
1 A3 0 A20
σA2 2
= E (A ) = A 2
dA = =
−A0 2A0 6A0 −A0 3
−1
N−1
A exp (x[n] − A) dA
2
−A0 2σ 2
n=0
MMSE : Â =
−1
A0 N−1
exp (x[n] − A)2 dA
−A0 2σ 2
n=0
−A0 x̄ < −A0
MAP : Â = x̄ −A0 ≤ x̄ ≤ A0
A0 x̄ > A0
A20 /3
LMMSE : Â = x̄
A20 /3 + σ 2 /N
(Linear Bayesian Estimators) Estimation Theory Spring 2013 125 / 152
Geometrical Interpretation
Vector space of random variables : The set of scalar zero mean random
variables is a vector space.
The zero length vector of the set is a RV with zero variance.
For any real scalar a, ax is another zero mean RV in the set.
For two RVs x and y , x + y = y + x is a RV in the set.
For two RVs x and y , the inner product is : < x, y >= E (xy )
√
The length of x is defined as : x = < x, x > = E (x 2 )
Two RVs x and y are orthogonal if : < x, y >= E (xy ) = 0.
For RVs, x1 , x2 , y and real numbers a1 and a2 we have :
N−1
E θ− am x[m] x[n] = 0 n = 0, 1, . . . , N − 1
m=0
or
N−1
am E (x[m]x[n]) = E (θx[n]) n = 0, 1, . . . , N − 1
m=0
E (x 2 [0]) E (x[0]x[1]) ··· E (x[0]x[N − 1]) a0 E (θx[0])
E (x[1]x[0]) E (x 2 [1]) ··· E (x[1]x[N − 1]) a1 E (θx[1])
.. .. .. .. .. = ..
. . . . . .
E (x[N − 1]x[0]) E (x[N − 1]x[1]) ··· E (x [N − 1])
2 aN−1 E (θx[N − 1])
Therefore
Cxx a = CT
θx ⇒ a = C−1 T
xx Cθx
The LMMSE estimator is :
θ̂ = aT x = Cθx C−1
xx x
(Linear Bayesian Estimators) Estimation Theory Spring 2013 128 / 152
The vector LMMSE Estimator
We want to estimate θ = [θ1 , θ2 , . . . , θp ]T with a linear estimator that
minimize the Bayesian MSE for each element.
N−1
θ̂i = ain x[n] + aiN Minimize Bmse(θ̂i ) = E (θi − θ̂i )2
n=0
Therefore :
θ̂i = E (θi ) + Cθi x C−1
xx x − E (x) i = 1, 2, . . . , p
*+,- *+,- * +, -
1×N N×N N×1
The scalar LMMSE estimators can be combined into a vector estimator :
θ̂ = E (θ) + Cθx C−1 x − E (x)
* +, - *+,- *+,- * +, -
xx
p×1 p×N N×N N×1
and similarly
Bmse(θ̂i ) = [Mθ̂ ]ii where Mθ̂ = Cθθ − Cθx C−1 Cxθ
*+,- *+,- *+,- xx *+,-
p×p p×p p×N N×p
x = Hθ + w
σA2
x[n] = A + w [n] ⇒ Â[N − 1] = x̄
σA2 + σ 2 /N
Bmse(Â[N − 1])
where K [N] =
Bmse(Â[N − 1]) + σ 2
Minimum MSE Update :
E (Ax[0]) σA2
Â[0] = x[0] = x[0]
E (x 2 [0]) σA2 + σ 2
2 Find the LMMSE estimator of x[1] based on x[0], yielding x̂[1|0]
E (x[0]x[1]) σ2
x̂[1|0] = x[0] = 2 A 2 x[0]
2
E (x [0]) σA + σ
3 Determine the innovation of the new data :x̃[1] = x[1] − x̂[1|0].
This error vector is orthogonal to x[0]
4 Add to Â[0] the LMMSE estimator of A based on the innovation :
E (Ax̃[1])
Â[1] = Â[0] + x̃[1] = Â[0] + K [1](x[1] − x̂[1|0])
E (x̃ 2 [1])
* +, -
the projection of A on x̃[1]
(Linear Bayesian Estimators) Estimation Theory Spring 2013 132 / 152
Sequential LMMSE Estimation
Basic Idea : Generate a sequence of orthogonal RVs, namely, the
innovations :
. /
x̃[0] , x̃[1] , x̃[2] , . . . , x̃ [n]
*+,- *+,- *+,- *+,-
x[0] x[1]−x̂[1|0] x[2]−x̂[2|0,1] x[n]−x̂[n|0,1,...,n−1]
Remarks :
For the initialization of the sequential LMMSE estimator, we can use
the prior information :
Remarks : All these problems are solved using the LMMSE estimator :
θ̂ = Cθx C−1
xx x with the minimum Bmse : Mθ̂ = Cθθ − Cθx C−1
xx Cxθ
This represents a time-varying, causal FIR filter. The impulse response can
be computed as :
(Rss + Rww )a = rss ⇒ (Rss + Rww )hn = rss
where u[n] is WGN with variance σu2 , s[−1] ∼ N (µs , σs2 ) and s[−1] is
independent of u[n] for all n ≥ 0. It can be shown that :
n
s[n] = a n+1
s[−1] + ak u[n − k] ⇒ E (s[n]) = an+1 µs
k=0
and
& )
Cs [m, n] = E s[m] − E (s[m]) s[n] − E (s[n])
n
= am+n+2 σs2 + σu2 am−n a2k for m ≥ 0
k=0
Kalman Gain :
M[n|n − 1]
K [n] =
σn2 + M[n|n − 1]
Correction :
ŝ[n|n] = ŝ[n|n − 1] + K [n](x[n] − ŝ[n|n − 1])
Minimum MSE :
M[n|n] = (1 − K [n])M[n|n − 1]
Correction :
ŝ[n|n] = ŝ[n|n − 1] + K[n](x[n] − H[n]ŝ[n|n − 1])
Minimum MSE Matrix (p × p) :
M[n|n] = (I − K[n]H[n])M[n|n − 1]
(Kalman Filters) Estimation Theory Spring 2013 150 / 152
Extended Kalman Filter
The extended Kalman filter is a sub-optimal approach when the state
and the observation equations are nonlinear.
Nonlinear data model : Consider the following nonlinear data model
s[n] = f(s[n − 1]) + Bu[n]
x[n] = g(s[n]) + w[n]
where f(·) and g(·) are nonlinear function mapping (vector to vector).
Model linearization : We linearize the model around the estimated value
of s using a first order Taylor series.
∂f
f(s[n − 1]) ≈ f(ŝ[n − 1|n − 1]) + s[n − 1] − ŝ[n − 1|n − 1]
∂s[n − 1]
∂g
g(s[n]) ≈ g(ŝ[n|n − 1]) + s[n] − ŝ[n|n − 1]
∂s[n]
We denote the Jacobians by :
∂f ∂g
A[n − 1] = H[n] =
∂s[n − 1] ∂s[n]
ŝ[n−1|n−1] ŝ[n|n−1]
(Kalman Filters) Estimation Theory Spring 2013 151 / 152
Extended Kalman Filter
Now we use the linearized models :