Estimation Theory: Alireza Karimi

Estimation Theory
Alireza Karimi
Laboratoire d’Automatique, MEC2 397,

email: alireza.karimi@epfl.ch
Spring 2013
(Introduction) Estimation Theory Spring 2013 1 / 152

Course Objective
Extract information from noisy signals

Parameter Estimation Problem : Given a set of measured data
{x[0], x[1], . . . , x[N − 1]}
which depends on an unknown parameter vector θ, determine an estimator
θ̂ = g (x[0], x[1], . . . , x[N − 1])
where g is some function.
Applications : Image processing, communications, biomedicine, system

identification, state estimation in control, etc.

Some Examples
Range estimation : We transmit a pulse that is reflected by the aircraft.
An echo is received after τ second. Range θ is estimated from the equation
θ = τ c/2 where c is the light’s speed.
System identification : The input of a plant is excited with u and the

output signal y is measured. If we have :
y [k] = G (q −1 , θ)u[k] + n[k]
where n[k] is the measurement noise. The model parameters are estimated
using u[k] and y [k] for k = 1, . . . , N.
DC level in noise : Consider a set of data {x[0], x[1], . . . , x[N − 1]} that
can be modeled as :
x[n] = A + w [n]
where w [n] is some zero mean noise process. A can be estimated using the
measured data set.
Outline
Classical estimation (θ deterministic)

Minimum Variance Unbiased Estimator (MVU)
Cramer-Rao Lower Bound (CRLB)
Best Linear Unbiased Estimator (BLUE)
Maximum Likelihood Estimator (MLE)
Least Squares Estimator (LSE)
Bayesian estimation (θ stochastic)
Minimum Mean Square Error Estimator (MMSE)
Maximum A Posteriori Estimator (MAP)
Linear MMSE Estimator
Kalman Filter

References
Main reference :
Fundamentals of Statistical Signal Processing

Estimation Theory
by Steven M. KAY, Prentice-Hall, 1993 (available in Library de La
Fontaine, RLC). We cover Chapters 1 to 14, skipping Chapter 5 and
Chapter 9.
Other references :
Lessons in Estimation Theory for Signal Processing, Communications
and Control. By Jerry M. Mendel, Prentice-Hall, 1995.
Probability, Random Processes and Estimation Theory for Engineers.
By Henry Stark and John W. Woods, Prentice-Hall, 1986.

Review of Probability and Random Variables
(Probability and Random Variables) Estimation Theory Spring 2013 6 / 152

Random Variables
Random Variable : A rule X (·) that assigns to every element of a sample

space Ω a real value is called a RV. So X is not really a variable that varies
randomly but a function whose domain is Ω and whose range is some
subset of the real line.
Example : Consider the experiment of flipping a coin twice. The sample

space (the possible outcomes) is :
Ω = {HH, HT , TH, TT }
We can define a random variable X such that
X (HH) = 1, X (HT ) = 1.1, X (TH) = 1.6, X (TT ) = 1.8
Random variable X assigns to each event (e.g. E = {HT , TH} ⊂ Ω) a

subset of the real line (in this case B = {1.1, 1.6}).

Probability Distribution Function
For any element ζ in Ω, the event {ζ|X (ζ) ≤ x} is an important event.
The probability of this event
Pr [{ζ|X (ζ) ≤ x}] = PX (x)
is called the probability distribution function of X .
Example : For the random variable defined earlier, we have :

PX (1.5) = Pr [{ζ|X (ζ) ≤ 1.5}] = Pr [{HH, HT }] = 0.5
PX (x) can be computed for all x ∈ R. It is clear that 0 ≤ PX (x) ≤ 1.
Remark :
For the same experiment (throwing a coin twice) we could define
another random variable that would lead to a different PX (x).
In most of engineering problems the sample space is a subset of the
real line so X (ζ) = ζ and PX (x) is a continuous function of x.
Probability Density Function (PDF)
The Probability Density Function, if it exists, is given by :
dPX (x)
pX (x) =
dx
When we deal with a single random variable the subscripts are removed :
dP(x)
p(x) =
dx
Properties :
∞
(i ) p(x)dx = P(∞) − P(−∞) = 1
−∞
x
(ii ) Pr [{ζ|X (ζ) ≤ x}] = Pr [X ≤ x] = P(x) = p(α)d α
−∞
x2
(iii ) Pr [x1 < X ≤ x2 ] = p(x)dx
x1

Gaussian Probability Density Function
A random variable is distributed according to a Gaussian or normal
distribution if the PDF is given by :
1 (x−µ)2
p(x) = √ e− 2σ2
2πσ 2
The PDF has two parameters : µ, the mean and σ 2 the variance.
We note X ∼ N (µ, σ 2 ) when the random variable X has a normal

(Gaussian) distribution with the mean µ and the standard deviation σ.
Small σ means small variability (uncertainty) and large σ means large
variability.
Remark : Gaussian distribution is important because according to the

Central Limit Theorem the sum of N independent RVs has a PDF that
converges to a Gaussian distribution when N goes to infinity.
Some other common PDF
1
a<x <b
Uniform (b > a) : p(x) = b−a
0 otherwise
1
Exponential (µ > 0) : p(x) = exp(−x/µ)u(x)
µ
x −x 2
Rayleigh (σ > 0) : p(x) = exp( )u(x)
σ2 2σ 2

√ 1 x n/2−1 exp(− x2 ) for x > 0
Chi-square χ2 : p(x) = 2n Γ(n/2)
0 for x < 0
∞
where Γ(z) = t z−1 e −t dt and u(x) is the unit step function.
0

Joint, Marginal and Conditional PDF
Joint PDF : Consider two random variables X and Y then :
x2 y2
Pr [x1 < X ≤ x2 and y1 < Y ≤ y2 ] = p(x, y )dxdy
x1 y1
Marginal PDF :
∞ ∞
p(x) = p(x, y )dy and p(y ) = p(x, y )dx
−∞ −∞
Conditional PDF : p(x|y ) is defined as the PDF of X conditioned on

knowing the value of Y .
Bayes’ Formula : Consider two RVs defined on the same probability space
then we have :
p(x, y )
p(x, y ) = p(x|y )p(y ) = p(y |x)p(x) or p(x|y ) =
p(y )
Independent Random Variables
Two RVs X and Y are independent if and only if :
p(x, y ) = p(x)p(y )
A direct conclusion is that :

p(x, y ) p(x)p(y )
p(x|y ) = = = p(x) and p(y |x) = p(y )
p(y ) p(y )
which means conditioning does not change the PDF.
Remark : For a joint Gaussian pdf the contours of constant density is an

ellipse centered at (µx , µy ). For independent X and Y the major (or
minor) axis is parallel to x or y axis.

Expected Value of a Random Variable
The expected value, if it exists, of a random variable X with PDF p(x) is
defined by : ∞
E (X ) = xp(x)dx
−∞
Some properties of expected value :

E {X + Y } = E {X } + E {Y }
E {aX } = aE {X }
The expected value of Y = g (X ) can be computed by :
∞
E (Y ) = g (x)p(x)dx
−∞
Conditional expectation : The conditional expectation of X given that a

specific value of Y has occurred is :
∞
E (X |Y ) = xp(x|y )dx
−∞

Moments of a Random Variable
The r th moment of X is defined as :
∞
E (X r ) = x r p(x)dx
−∞
The first moment of X is its expected value or the mean (µ = E (X )).
Moments of Gaussian RVs : A Gaussian RV with N (µ, σ 2 ) has

moments of all orders in closed form
E (X ) = µ
E (X 2 ) = µ2 + σ 2
E (X 3 ) = µ3 + 3µσ 2
E (X 4 ) = µ4 + 6µ2 σ 2 + 3σ 4
E (X 5 ) = µ5 + 10µ3 σ 2 + 15µσ 4
E (X 6 ) = µ6 + 15µ4 σ 2 + 45µ2 σ 4 + 15σ 6

Central Moments
The r th central moment of X is defined as :

∞
E [(X − µ) ] =
r
(x − µ)r p(x)dx
−∞
The second central moment (variance) is denoted by σ 2 or var(X ).
Central Moments of Gaussian RVs :

0 if r is odd
E [(X − µ) ] =
r
σ r (r − 1)!! if r is even
where n!! denotes the double factorial that is the product of every odd
number from n to 1.

Some properties of Gaussian RVs
If X ∼ N (µx , σx2 ) then Z = (X − µx )/σx ∼ N (0, 1).

If Z ∼ N (0, 1) then X = σx Z + µx ∼ N (µx , σx2 ).
If X ∼ N (µx , σx2 ) then Z = aX + b ∼ N (aµx + b, a2 σx2 ).
If X ∼ N (µx , σx2 ) and Y ∼ N (µy , σy2 ) are two independent RVs, then
aX + bY ∼ N (aµx + bµy , aσx2 + bσy2 )
The sum of square of n independent RV with standard normal

distribution N (0, 1) has a χ2n distribution with n degree of freedom.
For large value of n, χ2n converges to N (n, 2n).
√
The Euclidian norm X 2 + Y 2 of two independent RVs with
standard normal distribution has the Rayleigh distribution.

Covariance
For two RVs X and Y , the covariance is defined as
σxy = E [(X − µx )(Y − µy )]

∞ ∞
σxy = (x − µx )(y − µy )p(x, y )dxdy
−∞ −∞
If X and Y are zero mean then σxy = E {XY }.

var(X + Y ) = σx2 + σy2 + 2σxy
var(aX ) = a2 σx2
Important formula : The relation between the variance and the mean of
X is given by
σ 2 = E [(X − µ)2 ] = E (X 2 ) − 2µE (X ) + µ2

= E (X 2 ) − µ2
The variance is the mean of the square minus the square of the mean.
Independence, Uncorrelatedness and Orthogonality
If σxy = 0, then X and Y are uncorrelated and
E {XY } = E {X }E {Y }
X and Y called orthogonal if E {XY } = 0.

If X and Y are independent then they are uncorrelated.
p(x, y ) = p(x)p(y ) ⇒ E {XY } = E {X }E {Y }
Uncorrelatedness does not imply the independence. For example, if X

is a normal RV with zero mean and Y = X 2 we have p(y |x) = p(y )
but
σxy = E {XY } − E {X }E {Y } = E {X 3 } − 0 = 0
The correlation only shows the linear dependence between two RV so
is weaker than independence.
For Jointly Gaussian RVs, independence is equivalent to being
uncorrelated.
Random Vectors
Random Vector : is a vector of random variables 1 :
x = [x1 , x2 , . . . , xn ]T
Expectation Vector : µx = E (x) = [E (x1 ), E (x2 ), . . . , E (xn )]T

Covariance Matrix : Cx = E [(x − µx )(x − µx )T ]
Cx is an n × n symmetric matrix which is assumed to be positive
definite and so invertible.
The elements of this matrix are : [Cx ]ij = E {[xi − E (xi )][xj − E (xj )]}.
If the random variables are uncorrelated then Cx is a diagonal matrix.
Multivariate Gaussian PDF :

1 1 T −1
p(x) = exp − (x − µx ) Cx (x − µx )
(2π)n det(Cx ) 2
1. In some books (including our main reference) there is no distinction between random
variable X and its specific value x. From now on we adopt the notation of our reference.
Random Processes
Discrete Random Process : x[n] is a sequence of random variables
defined for every integer n.
Mean value : is defined as E (x[n]) = µx [n].
Autocorrelation Function (ACF) : is defined as
rxx [k, n] = E (x[n]x[n + k])
Wide Sense Stationary (WSS) : x[n] is WSS if its mean and its
autocorrelation function (ACF) do not depend on n.
Autocovariance function : is defined as
cxx [k] = E [(x[n] − µx )(x[n + k] − µx )] = rxx [k] − µ2x
Cross-correlation Function (CCF) : is defined as
rxy [k] = E (x[n]y [n + k])
Cross-covariance function : is defined as
cxy [k] = E [(x[n] − µx )(y [n + k] − µy )] = rxy [k] − µx µy
Discrete White Noise
Some properties of ACF and CCF :
rxx [0] ≥ |rxx [k]| rxx [k] = rxx [−k] rxy [k] = ryx [−k]
Power Spectral Density : The Fourier transform of ACF and CCF gives
the Auto-PSD and Cross-PSD :
∞

Pxx (f ) = rxx [k] exp(−j2πfk)
k=−∞
∞
Pxy (f ) = rxy [k] exp(−j2πfk)
k=−∞
Discrete White Noise : is a discrete random process with zero mean and
rxx [k] = σ 2 δ[k] where δ[k] is the Kronecker impulse function. The PSD of
white noise becomes Pxx (f ) = σ 2 and is completely flat with frequency.

Introduction and Minimum Variance Unbiased
Estimation
(Minimum Variance Unbiased Estimation) Estimation Theory Spring 2013 23 / 152

The Mathematical Estimation Problem
Parameter Estimation Problem : Given a set of measured data
x = {x[0], x[1], . . . , x[N − 1]}
which depends on an unknown parameter vector θ, determine an estimator
θ̂ = g (x[0], x[1], . . . , x[N − 1])
where g is some function.
The first step is to find the PDF of data as a

function of θ : p(x; θ)
Example : Consider the problem of DC level in white Gaussian noise with
one observed data x[0] = θ + w [0] where w [0] has the PDF N (0, σ 2 ).
Then the PDF of x[0] is :

1 1
p(x[0]; θ) = √ exp − 2 (x[0] − θ) 2
2πσ 2 2σ
Example : Consider a data sequence that can be modeled with a linear

trend in white Gaussian noise
x[n] = A + Bn + w [n] n = 0, 1, . . . , N − 1
Suppose that w [n] ∼ N (0, σ 2 ) and is uncorrelated with all the other
samples. Letting θ = [A B] and x = [x[0], x[1], . . . , x[N − 1]] the PDF is :

N−1
1 1
N−1
p(x; θ) = p(x[n]; θ) = √ exp − 2 (x[n] − A − Bn)2
( 2πσ 2 )N 2σ
n=0 n=0
The quality of any estimator for this problem is related to the assumptions
on the data model. In this example, linear trend and WGN PDF
assumption.

Classical versus Bayesian estimation

If we assume θ is deterministic we will have a classical estimation
problem. The following method will be studied : MVU, MLE, BLUE,
LSE.
If we assume θ is a random variable with a known PDF, then we will
have a Bayesian estimation problem. In this case the data are
described as the joint PDF
p(x, θ) = p(x|θ)p(θ)
where p(θ) summarizes our knowledge about θ before any data is

observed and p(x|θ) summarizes our knowledge provided by data x
conditioned on knowing θ. The following methods will be studied :
MMSE, MAP, Kalman Filter.

Assessing Estimator Performance
Consider the problem of estimating a DC level A in uncorrelated noise :
x[n] = A + w [n] n = 0, 1, . . . , N − 1
Consider the following estimators :

1
N−1
Â1 = x[n]
N
n=0
Â2 = x[0]
Suppose that A = 1, Â1 = 0.95 and Â2 = 0.98. Which estimator is better ?
An estimator is a random variable, so its

performance can only be described by its PDF or
statistically (e.g. by Monte-Carlo simulation).

Unbiased Estimators
An estimator that on the average yield the true value is unbiased.
Mathematically
E (θ − θ̂) = 0 for a < θ < b
Let’s compute the expectation of the two estimators Â1 and Â2 :
1 1 1
N−1 N−1 N−1
E (Â1 ) = E (x[n]) = E (A + w [n]) = (A + 0) = A
N N N
n=0 n=0 n=0
E (Â2 ) = E (x[0]) = E (A + w [0]) = A + 0 = A

Both estimators are unbiased. Which one is better ?
Now, let’s compute the variance of the two estimators :
N−1
1 1
N−1
1 σ2
var(Â1 ) = var x[n] = 2 var(x[n]) = 2 Nσ 2 =
N N N N
n=0 n=0
var(Â2 ) = var(x[0]) = σ 2 > var (Â1 )

Unbiased Estimators
Remark : When several unbiased estimators of the same parameters from

independent set of data are available, i.e., θ̂1 , θ̂2 , . . . , θ̂n , a better estimator
can be obtained by averaging :
1
n
θ̂ = θ̂i ⇒ E (θ̂) = θ
n
i =1
Assuming that the estimators have the same variance, we have :
1
n
1 var(θ̂i )
var(θ̂) = 2 var(θ̂i ) = 2 n var(θ̂i ) =
n n n
i =1
By increasing n, the variance will decrease (if n → ∞, θ̂ → θ).

It is not the case for biased estimators, no matter how many estimators are
averaged.

Minimum Variance Criterion
The most logical criterion for estimation is the Mean Square Error (MSE) :
mse(θ̂) = E [(θ̂ − θ)2 ]
Unfortunately this type of estimators leads to unrealizable estimators (the

estimator will depend on θ).
mse(θ̂) = E {[θ̂ − E (θ̂) + E (θ̂) − θ]2 } = E {[θ̂ − E (θ̂) + b(θ)]2 }
where b(θ) = E (θ̂) − θ is defined as the bias of the estimator. Therefore :
mse(θ̂) = E {[θ̂ − E (θ̂)]2 } + 2b(θ)E [θ̂ − E (θ̂)] + b 2 (θ) = var(θ̂) + b 2 (θ)
Instead of minimizing MSE we can minimize the variance of

the unbiased estimators :
Minimum Variance Unbiased Estimator

Minimum Variance Unbiased Estimator
Existence of MVU Estimator : In general MVU estimator does not

always exist. There may be no unbiased estimator or non of unbiased
estimators has a uniformly minimum variance.
Finding the MVU Estimator : There is no known procedure which

always leads to the MVU estimator. Three existing approaches are :
1 Determine the Cramer-Rao lower bound (CRLB) and check to see if
some estimator satisfies it.
2 Apply the Rao-Blackwell-Lehmann-Scheffe theorem (we will skip it).
3 Restrict to linear unbiased estimators.

Cramer-Rao Lower Bound
(Cramer-Rao Lower Bound) Estimation Theory Spring 2013 32 / 152

CRLB is a lower bound on the variance of any unbiased estimator.
var(θ̂) ≥ CRLB(θ)
Note that the CRLB is a function of θ.

It tells us what is the best performance that can be achieved
(useful in feasibility study and comparison with other estimators).
It may lead us to compute the MVU estimator.

Theorem (scalar case)

Assume that the PDF p(x; θ) satisfies the regularity condition

∂ ln p(x; θ)
E =0 for all θ
∂θ
Then the variance of any unbiased estimator θ̂ satisfies

−1
∂ 2 ln p(x; θ)
var(θ̂) ≥ −E
∂θ 2
An unbiased estimator that attains the CRLB can be found iff :

∂ ln p(x; θ)
= I (θ)(g (x) − θ)
∂θ
for some functions g (x) and I (θ). The estimator is θ̂ = g (x) and the
minimum variance is 1/I (θ).
Example : Consider x[0] = A + w [0] with w [0] ∼ N (0, σ 2 ).

1 1
p(x[0]; A) = √ exp − 2 (x[0] − A) 2
2πσ 2 2σ
√ 1
ln p(x[0], A) = − ln 2πσ 2 − 2 (x[0] − A)2
2σ
Then
∂ ln p(x[0]; A) 1 ∂ 2 ln p(x[0]; A) 1
= 2 (x[0] − A) ⇒ − = 2
∂A σ ∂A 2 σ
According to Theorem :
1
var (Â) ≥ σ 2 and I (A) = and Â = g (x[0]) = x[0]
σ2

Example : Consider multiple observations for DC level in WGN :
x[n] = A + w [n] n = 0, 1, . . . , N − 1 with w [n] ∼ N (0, σ 2 )

1
N−1
1
p(x; A) = √ exp − 2 (x[n] − A)2
( 2πσ 2 )N 2σ
n=0
Then

∂ ln p(x; A) ∂ 1
N−1
= − ln[(2πσ 2 )N/2 ] − 2 (x[n] − A)2
∂A ∂A 2σ
n=0
N−1
1 1
N−1
N
= (x[n] − A) = 2 x[n] − A
σ2 σ N
n=0 n=0
According to Theorem :
1
N−1
σ2 N
var (Â) ≥ and I (A) = 2 and Â = g (x[0]) = x[n]
N σ N
n=0
Transformation of Parameters
If it is desired to estimate α = g (θ), then the CRLB is :
−1 2
∂ 2 ln p(x; θ) ∂g
var(θ̂) ≥ −E
∂θ 2 ∂θ
Example : Compute the CRLB for estimation of the power (A2 ) of a DC

level in noise :
σ2 4A2 σ 2
var(Aˆ2 ) ≥ (2A)2 =
N N
Definition
Efficient estimator : An unbiased estimator that attains the CRLB is said
to be efficient.
1
N−1
Example : Knowing that x̄ = x[n] is an efficient estimator for A, is
N
n=0
x̄ 2 an efficient estimator for A2 ?
Solution : Knowing that x̄ ∼ N (A, σ 2 /N), we have :
σ2
E (x̄ 2 ) = E 2 (x̄) + var(x̄) = A2 + = A2
N
So the estimator A2 = x̄ 2 is not even unbiased.
Let’s look at the variance of this estimator :
var(x̄ 2 ) = E (x̄ 4 ) − E 2 (x̄ 2 )
but we have from the moments of Gaussian RVs (slide 15) :
σ2 σ2
E (x̄ 4 ) = A4 + 6A2 + 3( )2
N N
Therefore :
2
2σ σ2 4A2 σ 2 2σ 4
3σ 4 2
2
var(x̄ ) = A + 6A 4
+ 2 − A +
2
= + 2
N N N N N

Remarks :
The estimator A 2 = x̄ 2 is biased and not efficient.
As N → ∞ the bias goes to zero and the variance of the estimator
approaches the CRLB. This type of estimators are called
asymptotically efficient.
General Remarks :
If g (θ) = aθ + b is an affine function of θ, then g (θ) = g (θ̂) is an
efficient estimator. First, it is unbiased : E (aθ̂ + b) = aθ + b = g (θ),
moreover : 2
∂g
var(g (θ)) ≥ var(θ̂) = a2 var(θ̂)
∂θ
but var(g (θ)) = var(aθ̂ + b) = a2 var(θ̂), so that the CRLB is
achieved.
If g (θ) is a nonlinear function of θ and θ̂ is an efficient estimator,
then g (θ̂) is an asymptotically efficient estimator.
Theorem (Vector Parameter)
Assume that the PDF p(x; θ) satisfies the regularity condition

∂ ln p(x; θ)
E =0 for all θ
∂θ
Then the variance of any unbiased estimator θ̂ satisfies Cθ̂ − I−1 (θ) ≥ 0
where ≥ 0 means that the matrix is positive semidefinite. I(θ) is called the
Fisher information matrix and is given by :
2
∂ ln p(x; θ)
Iij (θ) = −E
∂θi ∂θj
An unbiased estimator that attains the CRLB can be found iff :

∂ ln p(x; θ)
= I(θ)(g(x) − θ)
∂θ

CRLB Extension to Vector Parameter
Example : Consider a DC level in WGN with A and σ 2 unknown.
Compute the CRLB for estimation of θ = [A σ 2 ]T .
1
N−1
N N
ln p(x; θ) = − ln 2π − ln σ − 2
2
(x[n] − A)2
2 2 2σ
n=0
The Fisher information matrix is :
2

∂ ln p(x;θ) ∂ 2 ln p(x;θ) N
∂A2 ∂A∂σ2 0
I(θ) = −E ∂ 2 ln p(x;θ) ∂ 2 ln p(x;θ) = σ2
N
∂A∂σ2 ∂(σ2 )2
0 2σ4
The matrix is diagonal (just for this example) and can be easily inverted to
yield :
σ2 2 ) ≥ 2σ 4
var(θ̂) ≥ var(σ
N N
Is there any unbiased estimator that achieves these bounds ?

If it is desired to estimate α = g(θ), and the CRLB for the covariance of θ̂
is I−1 (θ), then :
2 −1 T
∂g ∂ ln p(x; θ) ∂g
Cα̂ ≥ −E
∂θ ∂θ 2 ∂θ
Example : Consider a DC level in WGN with A and σ 2 unknown.
Compute the CRLB for estimation of signal to noise ratio α = A2 /σ 2 .
We have θ = [A σ 2 ]T and α = g (θ) = θ12 /θ2 , then the Jacobian is :

∂g (θ) ∂g (θ) ∂g (θ) 2A A2
= = − 4
∂θ ∂θ1 ∂θ2 σ2 σ
So the CRLB is :
−1 2A
2A A2 N
0 σ2
4α + 2α2
var(α̂) ≥ − 4 σ2
−A2 =
σ2 σ 0 N
2σ4 σ4
N

Linear Models with WGN
If N point samples of data observed can be modeled as
x = Hθ + w
where
x =N ×1 observation vector
H =N ×p observation matrix (known, rank p)
θ =p×1 vector of parameters to be estimated
w =N ×1 noise vector with PDF N (0, σ 2 I)
Compute the CRLB and the MVU estimator that achieves this bound.
Step 1 : Compute ln p(x; θ).
∂ 2 ln p(x; θ)
Step 2 : Compute I(θ) = −E and the covariance
∂θ 2
matrix of θ̂ : Cθ̂ = I−1 (θ).
Step 3 : Find the MVU estimator g(x) by factoring
∂ ln p(x; θ)
= I(θ)[g(x) − θ]
∂θ
(Linear Models) Estimation Theory Spring 2013 43 / 152
√
Step 1 : ln p(x; θ) = − ln( 2πσ 2 )N − 2σ1 2 (x − Hθ)T (x − Hθ).
Step 2 :
∂ ln p(x; θ) 1 ∂ T
= − 2 [x x − 2xT Hθ + θ T HT Hθ]
∂θ 2σ ∂θ
1 T
= [H x − HT Hθ]
σ2
2
∂ ln p(x; θ) 1
Then I(θ) = −E = 2 HT H
∂θ 2 σ
Step 3 : Find the MVU estimator g(x) by factoring

∂ ln p(x; θ) HT H
= I(θ)[g(x) − θ] = [(HT H)−1 HT x − θ]
∂θ σ2
Therefore :
θ̂ = g(x) = (HT H)−1 HT x Cθ̂ = I−1 (θ) = σ 2 (HT H)−1

For a linear model with WGN represented by x = Hθ + w the MVU

estimator is :
θ̂ = (HT H)−1 HT x
This estimator is efficient and attains the CRLB.
That the estimator is unbiased can be seen easily by :
E (θ̂) = (HT H)−1 HT E (Hθ + w) = θ
The statistical performance of θ̂ is completely specified because θ̂ is a

linear transformation of a Gaussian vector x and hence has a Gaussian
distribution :
θ̂ ∼ N (θ, σ 2 (HT H)−1 )

Example (Curve Fitting)
Consider fitting the data x[n] by a p-th order polynomial function of n :
x[n] = θ0 + θ1 n + θ2 n2 + · · · + θp np + w [n]
We have N data samples, then :
x = [x[0], x[1], . . . , x[N − 1]]T
w = [w [0], w [1], . . . , w [N − 1]]T
θ = [θ0 , θ1 , . . . , θp ]T
so x = Hθ + w, where H is :
 
1 0 0 ··· 0
 1 1 1 ··· 1 
 
 1 2 4 ··· 2p 
 
 .. .. .. .. .. 
 . . . . . 
1 N − 1 (N − 1)2 ··· (N − 1)p N×(p+1)
Hence the MVU estimator is : θ̂ = (HT H)−1 HT x

Example (Fourier Analysis)
Consider the Fourier analysis of the data x[n] :

M
2πkn M
2πkn
x[n] = ak cos( )+ bk sin( ) + w [n]
N N
k=1 k=1
so we have θ = [a1 , a2 , . . . , aM , b1 , b2 , . . . , bM ]T and x = Hθ + w where :
H = [ha1 , ha2 , . . . , haM , hb1 , hb2 , . . . , hbM ]
with    
1 1
 cos( 2πk   sin( 2πk 
 N )   N ) 
 2πk2
cos( N )   2πk2
sin( N ) 
hak =  , hbk = 
 ..   .. 
 .   . 
cos( 2πk(N−1)
N ) sin( 2πk(N−1)
N )
Hence the MVU estimate of the Fourier coefficients is : θ̂ = (HT H)−1 HT x
Example (Fourier Analysis)
After simplification (noting that (HT H)−1 = 2
N I), we have :
2 a T
θ̂ = [(h ) x, . . . , (haM )T x , (hb1 )T x, . . . , (hbM )T x]T
N 1
which is the same as the standard solution :

2 2
N−1 N−1
2πkn 2πkn
âk = x[n] cos , b̂k = x[n] sin
N N N N
n=0 n=0
From the properties of linear models the estimates are unbiased.

The covariance matrix is :
2σ 2
Cθ̂ = σ 2 (HT H)−1 = I
N
Note that θ̂ is Gaussian and Cθ̂ is diagonal (the amplitude estimates
are independent).
Example (System Identification)
Consider identification of a Finite Impulse Response (FIR) model, h[k] for

k = 0, 1, . . . , p − 1, with input u[n] and output x[n] provided for
n = 0, 1, . . . , N − 1 :

p−1
x[n] = h[k]u[n − k] + w [n] n = 0, 1, . . . , N − 1
k=0
FIR model can be represented by the linear model x = Hθ + w where

   
h[0] u[0] 0 ··· 0
 h[1]   u[1] u[0] ··· 0 
 
θ=

 H =  .. .. .. .. 
...  . . . . 
h[p − 1] p×1 u[N − 1] u[N − 2] · · · u[N − p] N×p
The MVU estimate is θ̂ = (HT H)−1 HT x with Cθ̂ = σ 2 (HT H)−1 .

Linear Models with Colored Gaussian Noise
Determine the MVU estimator for the linear model x = Hθ + w with w a
colored Gaussian noise with N (0, C).
Whitening approach : Since C is positive definite, its inverse can be
factored as C−1 = DT D where D is an invertible matrix. This matrix acts
as a whitening transformation for w :
E [(Dw)(Dw)T ] = E (DwwT D) = DCDT = DD−1 D−T D = I
Now if we transform the linear model x = Hθ + w to :
x = Dx = DHθ + Dw = H θ + w
where w = Dw ∼ N (0, I) is white and we can compute the MVU
estimator as :
θ̂ = (HT H )−1 HT x = (HT DT DH)−1 HT DT Dx
so, we have :
θ̂ = (HT C−1 H)−1 HT C−1 x with Cθ̂ = (HT H )−1 = (HT C−1 H)−1

Linear Models with known components
Consider a linear model x = Hθ + s + w, where s is a known signal. To
determine the MVU estimator let x = x − s, so that x = Hθ + w is a
standard linear model. The MVU estimator is :
θ̂ = (HT H)−1 HT (x − s) with Cθ̂ = σ 2 (HT H)−1
Example : Consider a DC level and exponential in WGN :

x[n] = A + r n + w [n] where r is known. Then we have :
       
x[0] 1 1 w [0]
 x[1]   1   r   w [1] 
       
 ..  =  ..  A +  ..  +  .. 
 .   .   .   . 
x[N − 1] 1 r N−1 w [N − 1]
The MVU estimator is :
1
N−1
σ2
Â = (HT H)−1 HT (x − s) = (x[n] − r n ) with var(Â) =
N N
n=0
Best Linear Unbiased Estimators (BLUE)
Problems of finding the MVU estimators :
The MVU estimator does not always exist or impossible to find.
The PDF of data may be unknown.
BLUE is a suboptimal estimator that :

restricts estimates to be linear in data ; θ̂ = Ax
restricts estimates to be unbiased ; E (θ̂) = AE (x) = θ
minimizes the variance of the estimates ;
needs only the mean and the variance of the data (not the PDF). As
a result, in general, the PDF of the estimates cannot be computed.
Remark : The unbiasedness restriction implies a linear model for the data.
However, it may still be used if the data are transformed suitably or the
model is linearized.
(Best Linear Unbiased Estimators) Estimation Theory Spring 2013 52 / 152
Finding the BLUE (Scalar Case)
1 Choose a linear estimator for the observed data
x[n] , n = 0, 1, . . . , N − 1

N−1
θ̂ = an x[n] = aT x where a = [a0 , a1 , . . . , aN−1 ]T
n=0
2 Restrict estimate to be unbiased :

N−1
E (θ̂) = an E (x[n]) = θ
n=0
3 Minimize the variance
var(θ̂) = E {[θ̂ − E (θ̂)]2 } = E {[aT x − aT E (x)]2 }

= E {aT [x − E (x)][x − E (x)]T a} = aT Ca

Finding the BLUE (Scalar Case)
Consider the problem of amplitude estimation of known signals in noise :
x[n] = θs[n] + w [n]

N−1
1 Choose a linear estimator : θ̂ = an x[n] = aT x
n=0
2 Restrict estimate to be unbiased : E (θ̂) = aT E (x) = aT sθ = θ
then aT s = 1 where s = [s[0], s[1], . . . , s[N − 1]]T
3 Minimize aT Ca subject to aT s = 1.
The constrained optimization can be solved using Lagrangian Multipliers :

Minimize J = aT Ca + λ(aT s − 1)
The optimal solution is :
sT C−1 x 1
θ̂ = and var(θ̂) =
sT C−1 s sT C−1 s
Finding the BLUE (Vector Case)
Theorem (Gauss–Markov)
If the data are of the general linear model form
x = Hθ + w
with w is a noise vector with zero mean and covariance C (the PDF of w
is arbitrary), then the BLUE of θ is :
θ̂ = (HT C−1 H)−1 HT C−1 x
and the covariance matrix of θ̂ is
Cθ̂ = (HT C−1 H)−1
Remark : If noise is Gaussian then BLUE is MVU estimator.

Finding the BLUE
Example : Consider the problem of DC level in noise : x[n] = A + w [n],
where w [n] is of unspecified PDF with var(w [n]) = σn2 . We have θ = A
and H = 1 = [1, 1, . . . , 1]T . The covariance matrix is :
 2   1 
σ0 0 · · · 0 σ02
0 ··· 0
 0 σ2 · · ·  0 1 ··· 0 
 0    σ12 
C =  ..
1
..  ⇒ C =  ..
−1  
 .
.. . . . . . 
. . .   . .
. . . .
. 
0 0 · · · σN−1 2
0 0 · · · σ21
N−1
and hence the BLUE is :

N−1 −1 N−1
1 x[n]
−1 −1 −1
θ̂ = (H C T
H) T
H C x=
σn
2 σn2
n=0 n=0
and the minimum covariance is :
N−1 −1
1
T −1 −1
Cθ̂ = (H C H) =
σn2
n=0
Maximum Likelihood Estimation
(Maximum Likelihood Estimation) Estimation Theory Spring 2013 57 / 152

Problems : MVU estimator does not often exist or cannot be found.
BLUE is restricted to linear models.
Maximum Likelihood Estimator (MLE) :

can always be applied if the PDF is known ;
is optimal for large data size ;
is computationally complex and requires numerical methods.
Basic Idea : Choose the parameter value that makes the observed data,
the most likely data to have been observed.
Likelihood Function : is the PDF p(x; θ) when θ is regarded as a variable

(not a parameter).
ML Estimate : is the value of θ that maximizes the likelihood function.
Procedure : Find log-likelihood function ln p(x; θ) ; differentiate w.r.t θ

and set to zero and solve for θ.
Example : Consider DC level in WGN with unknown variance
x[n] = A + w [n]. Suppose that A > 0 and σ 2 = A. The PDF is :

1
N−1
1
p(x; A) = N exp − (x[n] − A)2
(2πA) 2 2A
n=0
Taking the derivative of the log-likelihood function, we have :
1 1
N−1 N−1
∂ ln p(x; A) N
=− + (x[n] − A) + 2 (x[n] − A)2
∂A 2A A 2A
n=0 n=0
What is the CRLB ? Does an MVU estimator exist ?

MLE can be found by setting the above equation to zero :

N−1
1 1 2 1
Â = − + x [n] +
2 N 4
n=0

Stochastic Convergence
Convergence in distribution : Let {p(xN )} be a sequence of PDF. If

there exists a PDF p(x) such that
lim p(xN ) = p(x)

N→∞
at every point x at which p(x) is continuous, we say that p(xN ) converges

d
in distribution to p(x). We write also xN → x.
Example
Consider N independent RV x1 , x2 , . . . , xN with mean µ and finite variance
1
N
σ 2 . Let x̄N = xn , then according to the Central Limit Theorem
N
n=1
√ x̄N − µ
(CLT), zN = N converges in distribution to z ∼ N (0, 1).
σ

Stochastic Convergence
Convergence in probability : Sequence of random variables {xN }
converges in probability to the random variable x if for every > 0
lim Pr {|xN − x| > } = 0
N→∞
Convergence with probability 1 (almost sure convergence) :

Sequence of random variables {xN } converges with probability 1 to the
random variable x if and only if for all possible events
Pr { lim xN = x} = 1
N→∞
Example
Consider N independent RV x1 , x2 , . . . , xN with mean µ and finite variance
1
N
σ 2 . Let x̄N = xn . Then according to the Law of Large Numbers
N
n=1
(LLN), z̄N converges in probability to µ.
Asymptotic Properties of Estimators
Asymptotic Unbiasedness : Estimator θ̂N is an asymptotically unbiased
estimator of θ if :
lim E (θ̂N − θ) = 0
N→∞
Asymptotic Distribution : It refers to p(xn ) as it evolves from
n = 1, 2, . . ., especially for large value of n (it is not the ultimate form of
distribution, which may be degenerate).
Asymptotic Variance : is not equal to lim var(θ̂N ) (which is the

N→∞
limiting variance). It is defined as :
1
asymptotic var(θ̂N ) = lim E {N[θ̂N − lim E (θ̂N )]2 }
N N→∞ N→∞
Mean-Square Convergence : Estimator θ̂N converges to θ in a

mean-squared sense, if :
lim E [(θ̂N − θ)2 ] = 0
N→∞

Consistency
Estimator θ̂N is a consistent estimator of θ if for every > 0
plim(θ̂N ) = θ ⇔ lim Pr [|θ̂N − θ| > ] = 0

N→∞
Remarks :
If θ̂N is asymptotically unbiased and its limiting variance is zero then
it converges to θ in mean-square.
If θ̂N converges to θ in mean-square, then the estimator is consistent.
Asymptotic unbiasedness does not imply consistency and vice versa.
plim can be treated as an operator, e.g. :

x plim(x)
plim(xy ) = plim(x)plim(y ) ; plim =
y plim(y )
The importance of consistency is that any continuous function of a

consistent estimator is itself a consistent estimator.
Properties : MLE may be biased and is not necessarily an efficient

estimator. However :
MLE is a consistent estimator meaning that
lim Pr [|θ̂ − θ| > ] = 0

N→∞
MLE asymptotically attains the CRLB (asymptotic variance is equal

to CRLB).
Under some “regularity” conditions, the MLE is asymptotically
normally distributed
θ̂ ∼ N (θ, I−1 (θ))
a
even if the PDF of x is not Gaussian.

If an MVU estimator exists then ML procedure will find it.

Example
Consider DC level in WGN with known variance σ 2 .
Sol. Then

1
N−1
1
p(x; A) = N exp − 2
(x[n] − A)2
2
(2πσ ) 2 2σ
n=0
1
N−1
∂ ln p(x; A)
= 2 (x[n] − A) = 0
∂A σ
n=0

N−1
⇒ x[n] − N Â = 0
n=0
1
N−1
which leads to Â = x̄ = x[n]
N
n=0

MLE for Transformed Parameters (Invariance Property)
Theorem (Invariance Property of the MLE)

The MLE of the parameter α = g (θ), where the PDF p(x; θ) is
parameterized by θ, is given by
α̂ = g (θ̂)
where θ̂ is the MLE of θ.
It can be proved using the property of the consistent estimators.

If α = g (θ) is a one-to-one function, then
α̂ = arg max p(x; g −1 (α)) = g (θ̂)
α
If α = g (θ) is not a one-to-one function, then

p̄T (x; α) = max p(x; θ) and α̂ = arg max p̄T (x; θ) = g (θ̂)
{θ:α=g (θ)} α

MLE for Transformed Parameters (Invariance Property)
Example
Consider DC level in WGN and find MLE of α = exp(A). Since g (θ) is a
one-to-one function then :
α̂ = arg max pT (x, α) = arg max p(x; ln α) = exp(x̄)
α α
Example
Consider DC level in WGN and find MLE of α = A2 . Since g (θ) is not a
one-to-one function then :
√ √
α̂ = arg max{p(x; α), p(x; − α)}
α≥0
2
√ √
= arg √max {p(x; α), p(x; − α)}
α≥0
2
= arg max p(x; A) = Â2 = x̄ 2
−∞<A<∞

MLE (Extension to Vector Parameter)
Example
Consider DC Level in WGN with unknown variance. The vector parameter
θ = [A σ 2 ]T should be estimated.
We have :
1
N−1
∂ ln p(x; θ)
= (x[n] − A)
∂A σ2
n=0
1
N−1
∂ ln p(x; θ) N
= − 2+ 4 (x[n] − A)2
∂σ 2 2σ 2σ
n=0
which leads to the following MLE :
1
N−1
Â = x̄ ˆ
σ =
2 (x[n] − x̄)2
N
n=0

MLE for General Gaussian Case
Consider the general Gaussian case where x ∼ N (µ(θ), C(θ)).

The partial derivative of the PDF is :

∂ ln p(x; θ) 1 −1 ∂C(θ) ∂µ(θ)T −1
= − tr C (θ) + C (θ)(x − µ(θ))
∂θk 2 ∂θk ∂θk
1 ∂C−1 (θ)
− (x − µ(θ))T (x − µ(θ))
2 ∂θk
for k = 1, . . . , p.
By setting the above equations equal to zero, MLE can be found.
A particular case is when C is known (the first and third terms
become zero).
In addition, if µ(θ) is linear in θ, the general linear model is obtained.

MLE for General Linear Models
Consider the general linear model x = Hθ + w where w is a noise vector
with PDF N (0, C) :

1 1 T −1
p(x; θ) = exp − (x − Hθ) C (x − Hθ)
(2π)N det(C) 2
Taking the derivative of ln p(x; θ) leads to :
∂ ln p(x; θ) ∂(Hθ)T −1
= C (x − Hθ)
∂θ ∂θ
Then
HT C−1 (x − Hθ) = 0 ⇒ θ̂ = (HT C−1 H)−1 HT C−1 x
which is the same as MVU estimator. The PDF of θ̂ is :
θ̂ ∼ N (θ, (HT C−1 H)−1 )

MLE (Numerical Method)
Newton–Raphson : A closed form estimator cannot be always computed

by maximizing the likelihood function. However, the maximum value can
be computed by the numerical methods like the iterative Newton–Raphson
algorithm.
−1
∂ 2 ln p(x; θ) ∂ ln p(x; θ)
θk+1 = θk −
∂θ∂θ T ∂θ θ=θk
Remarks :
The Hessian can be replaced by the negative of its expectation, the
Fisher information matrix I(θ).
This method suffers from convergence problems (local maximum).
Typically, for large data length, the log-likelihood function becomes
more quadratic and the algorithm will produce the MLE.

Least Squares Estimation
(Least Squares) Estimation Theory Spring 2013 72 / 152

The Least Squares Approach
In all the previous methods we assumed that the measured signal x[n] is
the sum of a true signal s[n] and a measurement error w [n] with known
probabilistic model. In least squares method
x[n] = s[n, θ] + e[n]
where e(n) represents the modeling and measurement errors. The objective
is to minimize the LS cost :

N−1
J(θ) = (x[n] − s[n, θ])2
n=0
We do not need a probabilistic assumption but only

a deterministic signal model.
It has a broader range of applications.
No claim about the optimality can be made.
The statistical performance cannot be assessed.
The Least Squares Approach
Example
Estimate the DC level of a signal. We observe x[n] = A + [n] for
n = 0, . . . , N − 1 and the LS criterion is :

N−1
J(A) = (x[n] − A)2
n=0
∂J(A) N−1
1
N−1
= −2 (x[n] − A) = 0 ⇒ Â = x[n]
∂A N
n=0 n=0

Linear Least Squares
Suppose that the observation model is linear x = Hθ + then

N−1
J(θ) = (x[n] − s[n, θ])2 = (x − Hθ)T (x − Hθ)
n=0
= xT x − 2xT Hθ + θ T HT Hθ
where H is full rank. The gradient is
∂J(θ)
= −2HT x + 2HT Hθ = 0 ⇒ θ̂ = (HT H)−1 HT x
∂θ
The minimum LS cost is :
Jmin = (x − Hθ̂)T (x − Hθ̂) = xT [I − H(HT H)−1 HT ]x = xT (x − Hθ)
where I − H(HT H)−1 HT is an Idempotent matrix.

Comparing Different Estimators for the Linear Model
Consider the following linear model
x = Hθ + w
Estimator Assumption Estimate
LSE No probabilistic assumption θ̂ = (HT H)−1 HT x
BLUE w is white with unknown PDF θ̂ = (HT H)−1 HT x
MLE w is white Gaussian noise θ̂ = (HT H)−1 HT x
MVUE w is white Gaussian noise θ̂ = (HT H)−1 HT x
For MLE and MVUE the PDF of the estimate will be

Gaussian.
Weighted Linear Least Squares
The LS criterion can be modified by including a positive definite
(symmetric) weighting matrix W :
J(θ) = (x − Hθ)T W(x − Hθ)
That leads to the following estimator :
θ̂ = (HT WH)−1 HT Wx
and minimum LS cost :
Jmin = xT [W − WH(HT WH)−1 HT W]x
Remark : If we take W = C−1 , where C is the covariance of noise then

weighted least squares estimator is the BLUE. However, there is no true
LS-based reason for this choice.
Geometrical Interpretation
Recall the general signal model s = Hθ. If we denote the column of H by
hi we have :
 
θ1
 θ2  p
 
s = [h1 h2 · · · hp ]  .  = θi hi
 .. 
i =1
θp
The signal model is a linear combination of the vectors {hi }.

The LS minimizes the length of the error vector between the data and
the signal model = x − s :
J(θ) = (x − Hθ)T (x − Hθ) = x − Hθ2
The data vector can lie anywhere in R N , while signal vectors must lie
in a p-dimensional subspace of R N , termed S p , which is spanned by
the column of H.
Geometrical Interpretation (Orthogonal Projection)
Intuitively, it is clear that the LS error is minimized when ŝ = Hθ̂ is

the orthogonal projection of x onto S p .
So the LS error vector = x − ŝ is orthogonal to all columns of H :
HT = 0 ⇒ HT (x − Hθ̂) = 0 ⇒ θ̂ = (HT H)−1 HT x
The signal estimate is the projection of x onto S p :
ŝ = Hθ̂ = H(HT H)−1 HT x = Px
where P is the orthogonal projection matrix.

Note that if z ∈ Range(H), then Pz = z. Recall that Range(H) is the
subspace spanned by the columns of H.
Now, since Px ∈ S p then P(Px) = Px. Therefore any projection
matrix is idempotent, i.e. P2 = P.
It can be verified that P is symmetric and singular (with rank p).
Geometrical Interpretation (Orthonormal columns of H)
Recall that
 
< h1 , h1 > < h1 , h2 > · · · < h1 , hp >
 < h2 , h1 > < h2 , h2 > · · · < h2 , hp > 
 
HT H =  .. .. .. .. 
 . . . . 
< hp , h1 > < hp , h2 > · · · < hp , hp >
If the columns of H are orthonormal then HT H = I and θ̂ = HT x.

In this case we have θ̂i = hT
i x, thus

p
p
ŝ = Hθ̂ = θ̂i hi = (hT
i x)hi
i =1 i =1
If we increase the number of parameters (the order of the linear

model), we can easily compute the new estimate.

Choosing the Model Order
Suppose that you have a set of data and the objective is to fit a
polynomial to data. What is the best polynomial order ?
Remarks :
It is clear that by increasing the order Jmin is monotonically
non-increasing.
By choosing p = N we can perfectly fit the data to model. However,
we fit the noise as well.
We should choose the simplest model that adequately describes the
data.
We increase the order only if the cost reduction is significant.
If we have an idea about the expected level of Jmin , we increase p to
approximately attain this level.
There is an order-recursive LS algorithm to efficiently compute a
(p + 1)-th order model based on a p-th order one (See 8.6).
Sequential Least Squares

Suppose that θ̂[N − 1] based on x[N − 1] = x[0], . . . , x[N − 1] is
available. If we get a new data x[N], we want to compute θ̂[N] as a
function of θ̂[N − 1] and x[N].
Example
1
N−1
Consider LS estimate of DC level : Â[N − 1] = x[n]. We have :
N
n=0

1 1
N N−1
1
Â[N] = x[n] = N x[n] + x[N]
N +1 N +1 N
n=0 n=0
N 1
= Â[N − 1] + x[N]
N +1 N +1
1
= Â[N − 1] + (x[N] − Â[N − 1])
N +1
new estimate = old estimate + gain × prediction error
Example (DC level in uncorrelated noise)
x[n] = A + w [n] and var(w [n]) = σn2 . The WLS estimate (or BLUE) is :
N−1 x[n]
n=0 σn2
Â[N − 1] = N−1 1
n=0 σn2
Similar to the previous example we can obtain :
1/σ 2
Â[N] = Â[N − 1] + N−1N 1 x[N] − Â[N − 1]
n=0 σn2

= Â[N − 1] + K [N] x[N] − Â[N − 1]
The gain factor K [N] can be reformulated as :
var(Â[N − 1])
K [N] =
var(Â[N − 1]) + σN2
Consider the general linear model x = Hθ + w, where w is an uncorrelated
noise with the covariance matrix C. The BLUE (or WLS) is :
θ̂ = (HT C−1 H)−1 HT C−1 x and Cθ̂ = (HT C−1 H)−1
Let’s define :
C[n] = diag(σ02 , σ12 , . . . , σn2 )

H[n − 1] n×p
H[n] = =
hT [n] 1×p

x[n] = x[0], x[1], . . . , x[n]
The objective is to find θ̂[n], based on n + 1 data samples, as a function of

θ̂[n − 1] and new data x[n]. The batch estimator is :
θ̂[n − 1] = Σ[n − 1]HT [n − 1]C−1 [n − 1]x[n − 1]
with Σ[n − 1] = (HT [n − 1]C−1 [n − 1]H[n − 1])−1

Estimator Update :

θ̂[n] = θ̂[n − 1] + K[n] x[n] − hT [n]θ̂[n − 1]
where
Σ[n − 1]h[n]
K[n] =
σn2 + hT [n]Σ[n − 1]h[n]
Covariance Update :
Σ[n] = (I − K[n]hT [n])Σ[n − 1]
The following Lemma is used to compute the updates :

Matrix Inversion Lemma
(A + BCD)−1 = A−1 − A−1 B[C−1 + DA−1 B]−1 DA−1
with A = Σ−1 [n − 1], B = h[n], C = 1/σn2 , D = hT [n].
Initialization ?
Constrained Least Squares
Linear constraints in the form Aθ = b can be considered in LS solution.
The LS criterion becomes :
Jc (θ) = (x − Hθ)T (x − Hθ) + λT (Aθ − b)
∂Jc
= −2HT x + 2HT Hθ + AT λ
∂θ
Setting the gradient equal to zero produces :
1
θ̂c = (HT H)−1 HT x − (HT H)−1 AT λ
2
λ
= θ̂ − (HT H)−1 AT
2
where θ̂ = (HT H)−1 HT x is the unconstrained LSE.
Now Aθ̂c = b can be solved to find λ :
λ = 2[A(HT H)−1 AT ]−1 (Aθ̂ − b)

Nonlinear Least Squares
Many applications have nonlinear observation model : s(θ) = Hθ. This
leads to a nonlinear optimization problem that can be solved numerically.
Newton-Raphson Method : Find a zero of the gradient of the criterion

by linearizing the gradient around the current estimate.

∂g(θ) −1

θ̂k+1 = θ̂k − g(θ)
∂θ
θ=θ̂k
∂sT (θ)
where g(θ) = [x − s(θ)] and
∂θ
∂[g(θ)]i
N−1
∂ 2 s[n] ∂s[n] ∂s[n]

= (x[n] − s[n]) −
∂θj ∂θi θj ∂θj ∂θi
n=0
Around the solution [x − s(θ)] is small so the first term in the Jacobian
can be neglected. This makes this method equivalent to the Gauss-Newton
algorithm which is numerically more robust.
Gauss-Newton Method : Linearize signal model around the current
estimate and solve the resulting linear problem.

∂s(θ)
s(θ) ≈ s(θ0 ) + (θ − θ0 ) = s(θ0 ) + H(θ0 )(θ − θ0 )
∂θ θ=θ0
The solution to the linearized problem is :
θ̂ = [HT (θ0 )H(θ0 )]−1 HT (θ0 )[x − s(θ0 ) + H(θ0 )θ0 ]

= θ0 + [HT (θ0 )H(θ0 )]−1 HT (θ0 )[x − s(θ0 )]
If we now iterate the solution, it becomes :
θ̂k+1 = θ̂k + [HT (θ̂k )H(θ̂k )]−1 HT (θ̂k )[x − s(θ̂k )]
Remark : Both the Newton-Raphson and the Gauss-Newton methods can

have convergence problems.
Transformation of parameters : Transform into a linear problem by
seeking for an invertible function α = g(θ) such that :
s(θ) = s(g−1 (α)) = Hα
So the nonlinear LSE is θ̂ = g−1 (α̂) where α̂ = (HT H)−1 HT x.
Example (Estimate the amplitude and phase of a sinusoidal signal)
s[n] = A cos(2πf0 n + φ) n = 0, 1, . . . , N − 1
The LS problem is nonlinear, however, we have :
A cos(2πf0 n + φ) = A cos φ cos 2πf0 n − A sin φ sin 2πf0 n
if we let α1 = A cos φ and α2 = −A sin φ then the signal model becomes

linear s = Hα. The LSE is α̂ = (HT H)−1 HT x and g−1 (α) is
!
−α2
A = α1 + α2
2 2 and φ = arctan
α1
Separability of parameters : If the signal model is linear in some of the
parameters, try to write it as s = H(α)β. The LS error can be minimized
wrt β.
β̂ = [HT (α)H(α)]−1 HT (α)x
The cost function is a function of α that can be minimized by a
numerical method (e.g. brute force).

J(α, β̂) = xT I − H(α)[HT (α)H(α)]−1 HT (α) x
Then

α̂ = arg max xT H(α)[HT (α)H(α)]−1 HT (α) x
α
Remark : This method is interesting if the dimension of α is much less

than the dimension of β.

Example (Damped Exponentials)

Consider the following signal model :
s[n] = A1 r n + A2 r 2n + A3 r 3n n = 0, 1, . . . , N − 1
with θ = [A1 , A2 , A3 , r ]T . It is known that 0 < r < 1. Let’s take

β = [A1 , A2 , A3 ]T so we get s = H(r )β with :
 
1 1 1
 r r 2 r3 
 
H(r ) =  .. .. .. 
 . . . 
r N−1 r 2(N−1) r 3(N−1)

Step 1 : Maximize xT H(r )[HT (r )H(r )]−1 HT (r ) x to obtain r̂ .
Step 2 : Compute β̂ = [HT (r̂ )H(r̂ )]−1 HT (r̂ )x.

The Bayesian Philosophy
(The Bayesian Philosophy) Estimation Theory Spring 2013 92 / 152

Introduction
Classical Approach :
Assumes θ is unknown but deterministic.
Some prior knowledge on θ cannot be used.
Variance of the estimate may depend on θ.
MVUE may not exist.
In Monte-Carlo Simulations, we do M runs for each fixed θ and then
compute sample mean and variance for each θ (no averaging over θ).
Bayesian Approach :
Assumes θ is random with a known prior PDF, p(θ).
We estimate a realization of θ based on the available data.
Variance of the estimate does not depend on θ.
A Bayesian estimate always exists
In Monte-Carlo simulations, we do M runs for randomly chosen θ and
then we compute sample mean and variance over all θ values.
Mean Square Error (MSE)
Classical MSE
Classical MSE is a function of the unknown parameter θ and cannot be
used for constructing the estimators.

mse(θ̂) = E {[θ − θ̂(x)] } = [θ − θ̂(x)]2 p(x; θ)dx
2
Note that the E {·} is wrt the PDF of x.
Bayesian MSE :
Bayesian MSE is not a function of θ and can be minimized to find an
estimator.

Bmse(θ̂) = E {[θ − θ̂(x)] } =
2
[θ − θ̂(x)]2 p(x, θ)d xd θ
Note that the E {·} is wrt the joint PDF of x and θ.

Minimum Mean Square Error Estimator
Consider the estimation of A that minimizes the Bayesian MSE, where A is
a random variable with uniform PDF p(A) = U [−A0 , A0 ] and independent
of w [n].

Bmse(θ̂) = [A − Â] p(x, θ)dxdA =
2
[A − Â] p(A|x)dA p(x)d x
2
where we used Bayes’ theorem : p(x, θ) = p(A|x)p(x).

Since p(x) ≥ 0 for all x, we need only minimize the integral in brackets.
We set the derivative equal to zero :

∂
[A − Â] p(A|x)dA = −2 Ap(A|x)dA + 2Â p(A|x)dA
2
∂ Â
which results in
Â = Ap(A|x)dA = E (A|x)
Bayesian MMSE Estimate : is the conditional mean of A given data x

or the mean of posterior PDF p(A|x).
How to compute the Bayesian MMSE estimate :

Â = E (A|x) = Ap(A|x)dA
The posteriori PDF can be computed using Bayes’ Rule :

p(x|A)p(A) p(x|A)p(A)
p(A|x) = ="
p(x) p(x|A)p(A)dA
"
p(x) is the marginal PDF defined as p(x) = p(x, A)dA.
The integral in the denominator acts as a ”normalization” of
p(x|A)p(A) such the integral of p(A|x) be equal to 1.
p(x|A) has exactly the same form as p(x; A) which is used in the
classical estimation.
The MMSE estimator always exists and can be computed by :
"
Ap(x|A)p(A)dA
Â = "
p(x|A)p(A)dA

Example : For p(A) = U [−A0 , A0 ] we have :
A0
−1
N−1
1 1
A exp (x[n] − A)2 dA
2A0 −A0 (2πσ 2 )N/2 2σ 2
n=0
Â = A0
−1
N−1
1 1
exp (x[n] − A) dA
2
2A0 −A0 (2πσ 2 )N/2 2σ 2
n=0
Before collecting the data the mean of the prior PDF p(A) is the best
estimate while after collecting the data the best estimate is the mean
of posterior PDF p(A|x).
Choice of p(A) is crucial for the quality of the estimation.
Only Gaussian prior PDF leads to a closed-form estimator.
Conclusion : For an accurate estimator choose a prior PDF that can be
physically justified. For a closed-form estimator choose a Gaussian prior
PDF.
Properties of the Gaussian PDF
Theorem (Conditional PDF of Bivariate Gaussian)

If x and y are distributed according to a bivariate Gaussian PDF :
T
1 −1 x − E (x) −1 x − E (x)
p(x, y ) = exp C
2π det(C) 2 y − E (y ) y − E (y )

var(x) cov(x, y )
with covariance matrix : C=
cov(x, y ) var(y )
then the conditional PDF p(y |x) is also Gaussian and :
cov(x, y )
E (y |x) = E (y ) + [x − E (x)]
var(x)
cov2 (x, y )
var(y |x) = var(y ) −
var(x)

Example (DC level in WGN with Gaussian prior PDF)

Consider the Bayesian model x[0] = A + w [0] (just one observation) with
a prior PDF of A ∼ N (µA , σA2 ) independent of noise.
First we compute the covariance cov(A, x[0]) :
cov(A, x[0]) = E {(A − µA )(x[0] − E (x[0])}

= E {(A − µA )(A + w [0] − µA − 0)}
= E {(A − µA )2 + (A − µA )w [0]} = var(A) + 0 = σA2
The Bayesian MMSE estimate is also Gaussian with :
cov(A, x[0]) σ2
Â = µA|x = µA + [x[0] − µA ] = µA + 2 A 2 (x[0] − µA )
var(x) σ + σA
cov2 (A, x[0]) σ2
var(Â) = σA|x
2
= σA2 − = σA2 (1 − 2 A 2 )
var(x) σ + σA
Theorem (Conditional PDF of Multivariate Gaussian)

If x (with dimension k × 1) and y (with dimension l × 1) are jointly
Gaussian with PDF :
T
1 −1 x − E (x) −1 x − E (x)
k+l
p(x, y ) = exp C
(2π) 2 det(C) 2 y − E (y) y − E (y)

Cxx Cxy
with covariance matrix : C=
Cyx Cyy
then the conditional PDF p(y|x) is also Gaussian and :
E (y|x) = E (y) + Cyx C−1

xx [x − E (x)]
Cy |x = Cyy − Cyx C−1
xx Cxy

Bayesian Linear Model
Let the data be modeled as
x = Hθ + w
where θ is a p × 1 random vector with prior PDF N (µθ , Cθ ) and w is a

noise vector with PDF N (0, Cw ).
Since θ and w are independent and Gaussian, they are jointly Gaussian
then the posterior PDF is also Gaussian. In order to find the MMSE
estimator we should compute the covariance matrices :
Cxx = E {[x − E (x)][x − E (x)]T }

= E {(Hθ + w − Hµθ )(Hθ + w − Hµθ )T }
= E {(H(θ − µθ ) + w)(H(θ − µθ ) + w)T }
= HE {(θ − µθ )(θ − µθ )T }HT + E (wwT ) = HCθ HT + Cw
Cθx = E {(θ − µθ )[H(θ − µθ ) + w]T }
= E {(θ − µθ )(θ − µθ )T }HT = Cθ HT
Theorem (Posterior PDF for the Bayesian General Linear Model)

If the observed data can be modeled as
x = Hθ + w
where θ is a p × 1 random vector with prior PDF N (µθ , Cθ ) and w is a

noise vector with PDF N (0, Cw ), then the posterior PDF p(x|θ) is
Gaussian with mean :
E (θ|x) = µθ + Cθ HT (HCθ HT + Cw )−1 (x − Hµθ )
and covariance
Cθ|x = Cθ − Cθ HT (HCθ HT + Cw )−1 HCθ
Remark : In contrast to the classical general linear model, H need not be

full rank.
Example (DC Level in WGN with Gaussian Prior PDF)
x[n] = A + w [n], n = 0, . . . , N − 1, A ∼ N (µA , σA2 ), w [n] ∼ N (0, σ 2 )
We have the general Bayesian linear model x = HA + w with H = 1. Then
E (A|x) = µA + σA2 1T (1σA2 1T + σ 2 I)−1 (x − 1µA )

var(A|x) = σA2 − σA2 1T (1σA2 1T + σ 2 I)−1 1σA2
We use the Matrix Inversion Lemma to get :
σA2 T −1 11T 11T

(I + 1 1 ) =I− =I−
σ2 σ2
+ 1T 1 σ2
+N
σA
2 σA
2
σ2 2
σA2 N σA
E (A|x) = µA + σ2
(x̄ − µA ) and var(A|x) =
σA + σN
2
σA2 + N
2

Nuisance Parameters
Definition (Nuisance Parameter)
Suppose that θ and α are unknown parameters but we are only interested
in estimating θ. In this case α is called a nuisance parameter.
In the classical approach we have to estimate both but in the

Bayesian approach we can integrate it out.
Note that in the Bayesian approach we can find p(θ|x) from p(θ, α|x)
as a marginal PDF :

p(θ|x) = p(θ, α|x)dα
We can also express it as

p(x|θ)p(θ)
p(θ|x) = " where p(x|θ) = p(x|θ, α)p(α|θ)dα
p(x|θ)p(θ)d θ
Furthermore if α is independent of θ we have :

p(x|θ) = p(x|θ, α)p(α)dα
General Bayesian Estimators
(General Bayesian Estimators) Estimation Theory Spring 2013 105 / 152

Risk Function
A general Bayesian estimator is obtained by minimizing the Bayes Risk
θ̂ = arg min R(θ̂)

θ̂
where R(θ̂) = E {C (ε)} is the Bayes Risk, C (ε) is a cost function and
ε = θ − θ̂ is the estimation error.
Three common risk functions

1. Quadratic : R(θ̂) = E {ε2 } = E {(θ − θ̂)2 }
2. Absolute : R(θ̂) = E {|ε|} = E {|θ − θ̂|}

0 |ε| < δ
3. Hit-or-Miss : R(θ̂) = E {C (ε)} where C (ε) =
1 |ε| ≥ δ
The Bayes risk is

R(θ̂) = E {C (ε)} = C (θ − θ̂)p(x, θ)d xd θ = g (θ̂)p(x)dx

where g (θ̂) = C (θ − θ̂)p(θ|x)d θ should be minimized.
Quadratic
For this case R(θ̂) = E {(θ − θ̂)2 } =Bmse(θ̂) which leads to the MMSE
estimator with
θ̂ = E (θ|x)
so θ̂ is the mean of the posterior PDF p(θ|x).

Absolute
In this case we have :
θ̂ ∞
g (θ̂) = |θ − θ̂|p(θ|x)dθ = (θ̂ − θ)p(θ|x)dθ + (θ − θ̂)p(θ|x)d θ
−∞ θ̂
By setting the derivative of g (θ̂) equal to zero and the use of Leibnitz’s
rule, we get :
θ̂ ∞
p(θ|x)d θ = p(θ|x)d θ
−∞ θ̂
Then θ̂ is the median (area to the left=area to the right) of p(θ|x)

Hit-or-Miss
In this case we have
θ̂−δ ∞
g (θ̂) = 1 · p(θ|x)d θ + 1 · p(θ|x)dθ
−∞ θ̂+δ
θ̂+δ
= 1− p(θ|x)d θ
θ̂−δ
For δ arbitrarily small the optimal estimate is the location of the

maximum of p(θ|x) or the mode of the posterior PDF.
This estimator is called the maximum a posteriori (MAP) estimator.
Remark : For unimodal and symmetric posterior PDF (e.g. Gaussian

PDF), the mean and the mode and the median are the same.

Minimum Mean Square Error Estimators
Extension to the vector parameter case : In general, like the scalar
case, we can write :

θ̂i = E (θi |x) = θi p(θi |x)d θi i = 1, 2, . . . , p
Then p(θi |x) can be computed as a marginal conditional PDF. For

example :

θ̂1 = θ1 · · · p(θ|x)d θ2 · · · d θp d θ1 = θ1 p(θ|x)dθ
In vector form we have :

 " 
" θ 1 p(θ|x)d θ
 θ2 p(θ|x)d θ 
 
θ̂ =  ..  = θp(θ|x)dθ = E (θ|x)
 . 
"
θp p(θ|x)d θ

Similarly Bmse(θ̂i ) = [Cθ|x ]ii p(x)d x
Properties of MMSE Estimators
For Bayesian Linear Model, poor prior knowledge leads to MVU
estimator.
θ̂ = E (θ|x) = µθ + (C−1 T −1 −1 T −1
θ + H Cw H) H Cw (x − Hµθ )
For no prior knowledge µθ → 0 and C−1
θ → 0, and therefore :
θ̂ → [HT C−1 −1 T −1
w H] H Cw x
Commutes over affine transformations.

Suppose that α = Aθ + b then the MMSE estimator for α is
α̂ = E (α|x) = E (Aθ + b|x) = AE (θ|x) + b = Aθ̂ + b
Enjoys the additive property for independent data sets.
Assume that θ, x1 , x2 are jointly Gaussian with x1 and x2
independent :
θ̂ = E (θ) + Cθx1 C−1 −1
x1 x1 [x1 − E (x1 )] + Cθx2 Cx2 x2 [x2 − E (x2 )]

Maximum A Posteriori Estimators
In the MAP estimation approach we have :
θ̂ = arg max p(θ|x)
θ
where
p(x|θ)p(θ)
p(θ|x) =
p(x)
An equivalent maximization is :
θ̂ = arg max p(x|θ)p(θ)
θ
or
θ̂ = arg max[ln p(x|θ) + ln p(θ)]
θ
If p(θ) is uniform or is approximately constant around the maximum
of p(x|θ) (for large data length), we can remove p(θ) to obtain the
Bayesian Maximum Likelihood :
θ̂ = arg max p(x|θ)
θ
Example (DC Level in WGN with Uniform Prior PDF)
The MMSE estimator cannot be obtained in explicit form due to the need
to evaluate the following integrals :
A0
−1
N−1
1 1
A exp (x[n] − A) dA
2
2A0 −A0 (2πσ 2 )N/2 2σ 2
n=0
Â = A0
1 1 −1
N−1
exp (x[n] − A)2 dA
2A0 −A0 (2πσ 2 )N/2 2σ 2
n=0

 −A0 x̄ < −A0
The MAP estimator is given as : Â = x̄ −A0 ≤ x̄ ≤ A0

A0 x̄ > A0
Remark : The main advantage of MAP estimator is that for non jointly
Gaussian PDFs it may lead to explicit solutions or less computational
effort.
Extension to the vector parameter case

θ̂1 = arg max p(θ1 |x) = arg max ··· p(θ|x)dθ2 . . . dθp
θ1 θ1
This needs integral evaluation of the marginal conditional PDFs.
An alternative is the following vector MAP estimator that maximizes the

joint conditional PDF :
θ̂ = arg max p(θ|x) = arg max p(x|θ)p(θ)

θ θ
This corresponds to a circular Hit-or-Miss cost function.

In general, as N → ∞, the MAP estimator → the Bayesian MLE.
If the posterior PDF is Gaussian, the mode is identical to the mean,
therefore the MAP estimator is identical to the MMSE estimator.
The invariance property in ML theory does not hold for the MAP
estimator.
Performance Description
In the classical estimation the mean and the variance of the estimate
(or its PDF) indicates the performance of the estimator.
In the Bayesian approach the PDF of the estimate is different for each
realization of θ. So a good estimator should perform well for every
possible value of θ.
The estimation error ε = θ − θ̂ is a function of two random variables
(θ and x).
The mean and the variance of the estimation error, wrt two random
variables, indicates the performance of the estimator.
The mean value of the estimation error is zero. So the estimates are
unbiased (in the Bayesian sense).
Ex,θ (θ − θ̂) = Ex,θ (θ − E (θ|x)) = Ex [Eθ|x (θ) − Eθ|x (θ|x)] = Ex (0) = 0
The variance of the estimate is Bmse :
var(ε) = Ex,θ (ε2 ) = Ex,θ [(θ − θ̂)2 ] = Bmse(θ̂)

Performance Description (Vector Parameter Case)
The vector of the estimation error is ε = θ − θ̂ and has a zero mean.
Its covariance matrix is :
Mθ̂ = Ex,θ (εεT ) = Ex,θ {[θ − E (θ|x)][θ − E (θ|x)]T }
& ' ()
= Ex Eθ|x [θ − E (θ|x)][θ − E (θ|x)]T = Ex (Cθ|x )
if x and θ are jointly Gaussian, we have :

Mθ̂ = Cθ|x = Cθθ − Cθx C−1
xx Cxθ
since Cθ|x does not depend on x.

For a Bayesian linear model we have :
Mθ̂ = Cθ|x = Cθ − Cθ HT (HCθ HT + Cw )−1 HCθ
In this case ε is a linear transformation of x and θ and thus is
Gaussian. Therefore :
ε ∼ N (0, Mθ̂ )
Signal Processing Example
Deconvolution Problem : Estimate a signal transmitted through a
channel with known impulse response :
s −1
n
x[n] = h[n − m]s[m] + w [n] n = 0, 1, . . . , N − 1
m=0
where s[n] is a WSS Gaussian process with known ACF, and w [n] is WGN
with variance σ 2 .
      
x[0] h[0] 0 ··· 0 s[0] w [0]
 x[1]   h[1] h[0] ··· 0  s[1]   w [1] 
      
 . =  . . . .  . + . 
 ..   .. .. .. ..  ..   .. 
x[N − 1] h[N − 1] h[N − 2] ··· h[N − ns ] s[ns − 1] w [N − 1]
x = Hθ + w ⇒ ŝ = Cs HT (HCs HT + σ 2 I)−1 x
where Cs is a symmetric Toeplitz matrix with [Cs ]ij = rss [i − j]

Signal Processing Example
Consider that H = I (no filtering), so the Bayesian model becomes
x=s+w
Classical Approach The MVU estimator is ŝ = (HT H)−1 HT x = x.
Bayesian Approach The MMSE estimator is ŝ = Cs (Cs + σ 2 I)−1 x.
Scalar case : We estimate s[0] based on x[0] :
rss [0] η
ŝ[0] = x[0] = x[0]
rss [0] + σ 2 η+1
where η = rss [0]/σ 2 is the SNR.
Signal is a Realization of an Auto Regressive Process : Consider
a first order AR process : s[n] = −a s[n − 1] + u[n], where u[n] is
WGN with variance σu2 . The ACF of s is :
σu2
rss [k] = (−a)|k| ⇒ [Cs ]ij = rss [i − j]
1 − a2

Linear Bayesian Estimators
(Linear Bayesian Estimators) Estimation Theory Spring 2013 119 / 152

Introduction
Problems with general Baysian estimators :
difficult to determine in closed form,
need intensive computations,
involve multidimensional integration (MMSE) or multidimensional
maximization (MAP),
can be determined only under the jointly Gaussian assumption.
What can we do if the joint PDF is not Gaussian or unknown ?
Keep the MMSE criterion ;
Restrict the estimator to be linear.
This leads to the Linear MMSE Estimator :
Which is a suboptimal estimator that can be easily implemented.
It needs only the first and second moments of the joint PDF.
It is analogous to BLUE in classical estimation.
In practice, this estimator is termed Wiener filter.
Linear MMSE Estimators (scalar case)
Goal : Estimate θ, given the data vector x. Assume that only the first two
moments of the joint PDF of x and θ are available :

E (θ) Cθθ Cθx
,
E (x) Cxθ Cxx
LMMSE Estimator : Take the class of all affine estimators

N−1
θ̂ = an x[n] + aN = aT x + aN
n=0
where a = [a0 , a1 , . . . , aN−1 ]T . Then minimize the Bayesian MSE :

Bmse(θ̂) = E [(θ − θ̂)2 ]
Computing aN : Let’s differentiate the Bmse with respect to aN :
∂
E (θ − aT x − aN )2 = −2E (θ − aT x − aN )
∂aN
Setting this equal to zero gives : aN = E (θ) − aT E (x).
Minimize the Bayesian MSE :
Bmse(θ̂) = E [(θ − θ̂)2 ] = E [(θ − aT x − E (θ) + aT E (x))2 ]
' (
= E [aT (x − E (x)) − (θ − E (θ))]2
= E [aT (x − E (x))(x − E (x))T a] − E [aT (x − E (x))(θ − E (θ))]
−E [(θ − E (θ))(x − E (x))T a] + E [(θ − E (θ))2 ]
= aT Cxx a − aT Cxθ − Cθx a + Cθθ
This can be minimized by setting the gradient to zero :
∂Bmse(θ̂)
= 2Cxx a − 2Cxθ = 0
∂a
which results in a = C−1 T
xx Cxθ and leads to (Note that : Cθx = Cxθ ) :
−1 T −1
xθ Cxx x + E (θ) − Cxθ Cxx E (x)
θ̂ = aT x + aN = CT
= E (θ) + Cθx C−1
xx (x − E (x))

The minimum Bayesian MSE is obtained :

−1 −1 T −1
xθ Cxx Cxx Cxx Cxθ − Cxθ Cxx Cxθ
Bmse(θ̂) = CT
−Cθx C−1
xx Cxθ + Cθθ
= Cθθ − Cθx C−1
xx Cxθ
Remarks :
The results are identical to those of MMSE estimators for jointly
Gaussian PDF.
The LMMSE estimator relies on the correlation between random
variables.
If a parameter is uncorrelated with the data (but nonlinearly
dependent), it cannot be estimated with an LMMSE estimator.

The data model is : x[n] = A + w [n] n = 0, 1, . . . , N − 1
where A ∼ U [−A0 , A0 ] is independent from w [n] (WGN with variance σ 2 ).
We have E (A) = 0 and E (x) = 0. The covariances are :
Cxx = E (xxT ) = E [(A1 + w)(A1 + w)T ] = E (A2 )11T + σ 2 I

Cθx = E (AxT ) = E [A(A1 + w)T ] = E (A2 )1T
Therefore the LMMSE estimator is :

A20 /3
Â = Cθx C−1 2 T 2 T 2 −1
xx x = σA 1 (σA 11 + σ I) x = x̄
A20 /3 + σ 2 /N
where A
A0
1 A3 0 A20
σA2 2
= E (A ) = A 2
dA = =
−A0 2A0 6A0 −A0 3


Comparison of different Bayesian estimators :
A0
−1
N−1
A exp (x[n] − A) dA
2
−A0 2σ 2
n=0
MMSE : Â =
−1
A0 N−1
exp (x[n] − A)2 dA
−A0 2σ 2
n=0

 −A0 x̄ < −A0
MAP : Â = x̄ −A0 ≤ x̄ ≤ A0

A0 x̄ > A0
A20 /3
LMMSE : Â = x̄
A20 /3 + σ 2 /N
Vector space of random variables : The set of scalar zero mean random
variables is a vector space.
The zero length vector of the set is a RV with zero variance.
For any real scalar a, ax is another zero mean RV in the set.
For two RVs x and y , x + y = y + x is a RV in the set.
For two RVs x and y , the inner product is : < x, y >= E (xy )
√
The length of x is defined as : x = < x, x > = E (x 2 )
Two RVs x and y are orthogonal if : < x, y >= E (xy ) = 0.
For RVs, x1 , x2 , y and real numbers a1 and a2 we have :
< a1 x1 + a2 x2 , y >= a1 < x1 , y > +a2 < x2 , y >
E [(a1 x1 + a2 x2 )y ] = a1 E (x1 y ) + a2 E (x2 y )

< y, x > E (yx)
The projection of y on x is : x= x
x 2 σx2
The LMMSE estimator can be determined using the vector space
viewpoint. We have :

N−1
θ̂ = an x[n]
n=0
where aN is zero, because of zero mean assumption.
θ̂ belongs to the subspace spanned by x[0], x[1], . . . , x[N − 1].
θ is not in this subspace.
A good estimator will minimize the MSE :
E [(θ − θ̂)2 ] = 2
where = θ − θ̂ is the error vector.
Clearly the length of the vector error is minimized when is
orthogonal to the subspace spanned by x[0], x[1], . . . , x[N − 1] or to
each data sample :
& )
E (θ − θ̂)x[n] = 0 for n = 0, 1, . . . , N − 1

The LMMSE estimator can be determined by solving the following
equations :

N−1
E θ− am x[m] x[n] = 0 n = 0, 1, . . . , N − 1
m=0
or

N−1
am E (x[m]x[n]) = E (θx[n]) n = 0, 1, . . . , N − 1
m=0
    
E (x 2 [0]) E (x[0]x[1]) ··· E (x[0]x[N − 1]) a0 E (θx[0])
 E (x[1]x[0]) E (x 2 [1]) ··· E (x[1]x[N − 1])  a1   E (θx[1]) 
    
 .. .. .. ..  .. =  .. 
 . . . .  .   . 
E (x[N − 1]x[0]) E (x[N − 1]x[1]) ··· E (x [N − 1])
2 aN−1 E (θx[N − 1])
Therefore
Cxx a = CT
θx ⇒ a = C−1 T
xx Cθx
The LMMSE estimator is :
θ̂ = aT x = Cθx C−1
xx x
The vector LMMSE Estimator
We want to estimate θ = [θ1 , θ2 , . . . , θp ]T with a linear estimator that
minimize the Bayesian MSE for each element.

N−1

θ̂i = ain x[n] + aiN Minimize Bmse(θ̂i ) = E (θi − θ̂i )2
n=0
Therefore :

θ̂i = E (θi ) + Cθi x C−1
xx x − E (x) i = 1, 2, . . . , p
*+,- *+,- * +, -
1×N N×N N×1
The scalar LMMSE estimators can be combined into a vector estimator :

θ̂ = E (θ) + Cθx C−1 x − E (x)
* +, - *+,- *+,- * +, -
xx
p×1 p×N N×N N×1
and similarly
Bmse(θ̂i ) = [Mθ̂ ]ii where Mθ̂ = Cθθ − Cθx C−1 Cxθ
*+,- *+,- *+,- xx *+,-
p×p p×p p×N N×p

The vector LMMSE Estimator
Theorem (Bayesian Gauss–Markov Theorem)

If the data are described by the Bayesian linear model form
x = Hθ + w
where θ is a p × 1 random vector with mean E (θ) and covariance Cθθ , w

is a noise vector with zero mean and covariance Cw and is uncorrelated
with θ (the joint PDF of p(θ, w) is arbitrary), then the LMMSE estimator
of θ is :

θ̂ = E (θ) + Cθθ HT (HCθθ HT + Cw )−1 x − HE (θ)
The performance of the estimator is measured by = θ − θ̂ whose mean is

zero and whose covariance matrix is
−1
C = Mθ̂ = Cθθ − Cθθ HT (HCθθ HT + Cw )−1 HCθθ = C−1 T −1
θθ + H Cw H

Sequential LMMSE Estimation
Objective : Given θ̂[n − 1] based on x[n − 1], update the new estimate
θ̂[n] based on the new sample x[n].
Example (DC Level in White Noise)

Assume that both A and w [n] have zero mean :
σA2
x[n] = A + w [n] ⇒ Â[N − 1] = x̄
σA2 + σ 2 /N
Estimator Update : Â[N] = Â[N − 1] + K [N](x[N] − Â[N − 1])
Bmse(Â[N − 1])
where K [N] =
Bmse(Â[N − 1]) + σ 2
Minimum MSE Update :
Bmse(Â[N]) = (1 − K [N])Bmse(Â[N − 1])

Vector space view : If two observations are orthogonal, the LMMSE
estimate of θ is the sum of the projection of θ on each observation.
1 Find the LMMSE estimator of A based on x[0], yielding Â[0] :
E (Ax[0]) σA2
Â[0] = x[0] = x[0]
E (x 2 [0]) σA2 + σ 2
2 Find the LMMSE estimator of x[1] based on x[0], yielding x̂[1|0]
E (x[0]x[1]) σ2
x̂[1|0] = x[0] = 2 A 2 x[0]
2
E (x [0]) σA + σ
3 Determine the innovation of the new data :x̃[1] = x[1] − x̂[1|0].
This error vector is orthogonal to x[0]
4 Add to Â[0] the LMMSE estimator of A based on the innovation :
E (Ax̃[1])
Â[1] = Â[0] + x̃[1] = Â[0] + K [1](x[1] − x̂[1|0])
E (x̃ 2 [1])
* +, -
the projection of A on x̃[1]
Basic Idea : Generate a sequence of orthogonal RVs, namely, the
innovations :
. /
x̃[0] , x̃[1] , x̃[2] , . . . , x̃ [n]
*+,- *+,- *+,- *+,-
x[0] x[1]−x̂[1|0] x[2]−x̂[2|0,1] x[n]−x̂[n|0,1,...,n−1]
Then, add the individual estimators to yield :

N
E (Ax̃[n])
Â[N] = K [n]x̃[n] = Â[N−1]+K [N]x̃ [N] where K [n] =
E (x̃ 2 [n])
n=0
It can be shown that :
Bmse(Â[N − 1])
x̃[N] = x[N] − Â[N − 1] and K [N] =
σ 2 + Bmse(Â[N − 1])
the minimum MSE be updated as :
Bmse(Â[N]) = (1 − K [N])Bmse(Â[N − 1])

General Sequential LMMSE Estimation
Consider the general Bayesian linear model x = Hθ + w, where w is an
uncorrelated noise with the diagonal covariance matrix Cw . Let’s define :
Cw [n] = diag(σ02 , σ12 , . . . , σn2 )

H[n − 1] n×p
H[n] = =
hT [n] 1×p

x[n] = x[0], x[1], . . . , x[n]
Estimator Update :

θ̂[n] = θ̂[n − 1] + K[n] x[n] − hT [n]θ̂[n − 1]
where
M[n − 1]h[n]
K[n] =
σn2
+ hT [n]M[n − 1]h[n]
Minimum MSE Matrix Update :
M[n] = (I − K[n]hT [n])M[n − 1]

General Sequential LMMSE Estimation
Remarks :
For the initialization of the sequential LMMSE estimator, we can use
the prior information :
θ̂[−1] = E (θ) M[−1] = Cθθ
For no prior knowledge about θ we can let Cθθ → ∞. Then we have

the same form as the sequential LSE, although the approaches are
fundamentally different.
No matrix inversion is required.
The gain factor K[n] weights confidence in the new data (measured
by σn2 ) against all previous data (summarized by M[n − 1]).

Wiener Filtering
Consider the signal model : x[n] = s[n] + w [n] n = 0, 1, . . . , N − 1
where the data and noise are zero mean with known covariance matrix.
There are three main problems concerning the Wiener Filters :

Filtering : Estimate θ = s[n] (scalar) based on the data set
x = [x[0], x[1], . . . , x[n]]. The signal sample is estimated
based on the the present and past data only.
Smoothing : Estimate θ = s = [s[0], s[1], . . . , s[N − 1]] (vector) based on

the data set x = x[0], x[1], . . . , x[N − 1] . The signal sample
is estimated based on the the present, past and future data.
Prediction : Estimate θ = x[N − 1 + ] for positive integer based on the

data set x = [x[0], x[1], . . . , x[N − 1]].
Remarks : All these problems are solved using the LMMSE estimator :
θ̂ = Cθx C−1
xx x with the minimum Bmse : Mθ̂ = Cθθ − Cθx C−1
xx Cxθ

Wiener Filtering (Smoothing)
Estimate
θ = s = [s[0], s[1], . . . , s[N − 1]] (vector) based on the data set
x = x[0], x[1], . . . , x[N − 1] .
Cxx = Rxx = Rss + Rww and Cθx = E (sxT ) = E (s(s + w)T ) = Rss
Therefore :
ŝ = Csx C−1 −1
xx x = Rss (Rss + Rww ) x = Wx
Filter Interpretation : The Wiener Smoothing Matrix can be interpreted
as an FIR filter. Let’s define :
W = [a0 , a1 , . . . , aN−1 ]
where aT
n is the (n + 1)th raw of W. Let’s define also :
hn = [h(n) [0], h(1) [1], . . . , h(n) [N − 1]] which is just the vector an when
flipped upside down. Then

N−1
ŝ[n] = aT
nx = h(n) [N − 1 − k]x[k]
k=0
This represents a time-varying, non causal FIR filter.
Wiener Filtering (Filtering)

Estimate θ = s[n] based on the data set x = x[0], x[1], . . . , x[n] . We
have again Cxx = Rxx = Rss + Rww and

Cθx = E (s[n] x[0], x[1], . . . , x[n] ) = rss [n], rss [n − 1], . . . , rss [0] = (rss )T
Therefore :
ŝ[n] = (rss )T (Rss + Rww )−1 x = aT x
Filter Interpretation : If we define hn = [h(n) [0], h(1) [1], . . . , h(n) [N − 1]]
which is just the vector a when flipped upside down. Then :

n
T
ŝ[n] = a x = h(n) [n − k]x[k]
k=0
This represents a time-varying, causal FIR filter. The impulse response can
be computed as :
(Rss + Rww )a = rss ⇒ (Rss + Rww )hn = rss

Wiener Filtering (Prediction)
Estimate θ = x[N − 1 + ] based on the data set
x = [x[0], x[1], . . . , x[N − 1]]. We have again Cxx = Rxx and
Cθx = E (x[N − 1 + ][x[0], x[1], . . . , x[N − 1]])

= rxx [N − 1 + ], rxx [N − 2 + ], . . . , rxx [] = (rxx )T
Therefore :
x̂[N − 1 + ] = (rxx )T R−1 T
xx x = a x
Filter Interpretation : If we let again h[N − k] = ak , we have

N−1
x̂[N − 1 + ] = a x = T
h[N − k]x[k]
k=0
This represents a time-invariant, causal FIR filter. The impulse response

can be computed as :
Rxx h = rxx
For = 1, this is equivalent to an Auto-Regressive process.
Kalman Filters
(Kalman Filters) Estimation Theory Spring 2013 140 / 152

Introduction
Kalman filter is a generalization of the Wiener filter.

In Wiener filter we estimate s[n] based on the noisy observation
vector x[n]. We assume that s[n] is a stationary random process with
known mean and covariance matrix.
In Kalman filter we assume that s[n] is a non stationary random
process whose mean and covariance matrix vary according to a known
dynamical model.
Kalman filter is a sequential MMSE estimator of s[n]. If the signal
and noise are jointly Gaussian, then the Kalman filter is an optimal
MMSE estimator, if not, it is the optimal LMMSE estimator.
Kalman filter has many applications in Control theory for state
estimation.
It can be generalized to vector signals and noise (in contrast to
Wiener filter).

Dynamical Signal Models
Consider a first order dynamical model :
s[n] = as[n − 1] + u[n] n≥0
where u[n] is WGN with variance σu2 , s[−1] ∼ N (µs , σs2 ) and s[−1] is
independent of u[n] for all n ≥ 0. It can be shown that :

n
s[n] = a n+1
s[−1] + ak u[n − k] ⇒ E (s[n]) = an+1 µs
k=0
and
& )
Cs [m, n] = E s[m] − E (s[m]) s[n] − E (s[n])

n
= am+n+2 σs2 + σu2 am−n a2k for m ≥ 0
k=0
and Cs [m, n] = Cs [n, m] for m < n.

Dynamical Signal Models
Theorem (Vector Gauss-Markov Model)
The Gauss-Markov model for a p × 1 vector signal s[n] is :
s[n] = As[n − 1] + Bu[n] n≥0
A is p × p and B p × r and all eigenvalues of A are less than 1 in

magnitude. The r × 1 vector u[n] ∼ N (0, Q) and the initial condition is a
p × 1 random vector with s[n] ∼ N (µs , Cs ) and is independent of u[n]’s.
Then, the signal process is Gaussian with mean E (s[n]) = An+1 µs and
covariance :
n+1 T
m
T
m+1
Cs [m, n] = A Cs A + Ak BQBT An−m+k
k=m−n
for m ≥ n and Cs [m, n] = Cs [n, m] for m < n. The covariance matrix

C[n] = Cs [n, n] and the mean and covariance propagation equations are :
E (s[n]) = AE (s[n − 1]) , C[n] = AC[n − 1]AT + BQBT
Scalar Kalman Filter
Data model :
State equation s[n] = as[n − 1] + u[n]
Observation equation x[n] = s[n] + w [n]
Assumptions :
u[n] is zero mean Gaussian noise with independent samples and
E (u 2 [n]) = σu2 .
w [n] is zero mean Gaussian noise with independent samples and
E (w 2 [n]) = σn2 (time-varying variance).
s[−1] ∼ N (µs , σs2 ) (for simplicity we suppose that µs = 0).
s[−1], u[n] and w [n] are independent.
Objective : Develop a sequential MMSE estimator to estimate s[n] based
on the data X[n] = x[0], x[1], . . . , x[n] . This estimator is the mean of
posterior PDF :

ŝ[n|n] = E (s[n]x[0], x[1], . . . , x[n])
To develop the equations of a Kalman filter we need the following

properties :
Property 1 : For two uncorrelated jointly Gaussian data vector, the
MMSE estimator θ (if it is zero mean) is given by :
θ̂ = E (θ|x1 , x2 ) = E (θ|x1 ) + E (θ|x2 )
Property 2 : If θ = θ1 + θ2 , then the MMSE estimator of θ is :
θ̂ = E (θ|x) = E (θ1 + θ2 |x) = E (θ1 |x) + E (θ2 |x)
Basic Idea : Generate the innovation x̃[n] = x[n] − x̂[n|n − 1] which is

− 1]. Then use x̃ [n] instead of x[n]
uncorrelated with previous samples X[n
for estimation (X[n] is equivalent to X[n − 1], x̃ [n] ).

From Property 1, we have :

ŝ[n|n] = E (s[n]X[n − 1], x̃ [n]) = E (s[n]X[n − 1]) + E (s[n]x̃[n])

= ŝ[n|n − 1] + E (s[n]x̃[n])
From Property 2, we have :

ŝ[n|n − 1] = E (as[n − 1] + u[n]X[n − 1])

= aE (s[n − 1]X[n − 1]) + E (u[n]X[n − 1])
= aŝ[n − 1|n − 1]
The MMSE estimator of s[n] based on x̃[n] is :
E (s[n]x̃(n))
E (s[n]x̃[n]) = x̃ [n] = K [n](x[n] − x̂ [n|n − 1])
E (x̃ 2 [n])
where x̂[n|n − 1] = ŝ[n|n − 1] + ŵ[n|n − 1] = ŝ[n|n − 1].
Finally we have :
ŝ[n|n] = aŝ[n − 1|n − 1] + K [n](x[n] − ŝ[n|n − 1])
Scalar State – Scalar Observation Kalman Filter
State Model : s[n] = as[n − 1] + u[n]
Observation Model : x[n] = s[n] + w [n]
Prediction :
ŝ[n|n − 1] = aŝ[n − 1|n − 1]
Minimum Prediction MSE :
M[n|n − 1] = a2 M[n − 1|n − 1] + σu2
Kalman Gain :
M[n|n − 1]
K [n] =
σn2 + M[n|n − 1]
Correction :
ŝ[n|n] = ŝ[n|n − 1] + K [n](x[n] − ŝ[n|n − 1])
Minimum MSE :
M[n|n] = (1 − K [n])M[n|n − 1]

Properties of Kalman Filter
The Kalman filter is an extension of the sequential MMSE estimator
but applied to time-varying parameters that are represented by a
dynamic model.
No matrix inversion is required.
The Kalman filter is a time-varying linear filter.
The Kalman filter computes (and uses) its own performance measure,
the Bayesian MSE M[n|n].
The prediction stage increases the error, while the correction stage
decreases it.
The Kalman filter generates an uncorrelated sequence from the
observations so can be viewed as a whitening filter.
The Kalman filter is optimal in the Gaussian case and the optimal
LMMSE estimator if the Gaussian assumption is not valid.
In KF M[n|n], M[n|n − 1] and K [n] do not depend on the
measurement and can be computed off line. At steady state (n → ∞)
the Kalman filter becomes an LTI filter (M[n|n], M[n|n − 1] and K [n]
become constant).
Vector State – Scalar Observation Kalman Filter
State Model : s[n] = As[n − 1] + Bu[n] with u[n] ∼ N (0, Q)
Observation Model : x[n] = hT [n]s[n] + w [n] with s[−1] ∼ N (µs , Cs )
Prediction :
ŝ[n|n − 1] = Aŝ[n − 1|n − 1]
Minimum Prediction MSE (p × p) :
M[n|n − 1] = AM[n − 1|n − 1]AT + BQBT
Kalman Gain (p × 1) :
M[n|n − 1]h[n]
K[n] =
σn2 + hT [n]M[n|n − 1]h[n]
Correction :
ŝ[n|n] = ŝ[n|n − 1] + K[n](x[n] − hT [n]ŝ[n|n − 1])
Minimum MSE Matrix (p × p) :
M[n|n] = (I − K[n]hT [n])M[n|n − 1]
Vector State – Vector Observation Kalman Filter
State Model : s[n] = As[n − 1] + Bu[n] with u[n] ∼ N (0, Q)
Observation Model : x[n] = H[n]s[n] + w[n] with w[n] ∼ N (0, C[n])
Prediction :
ŝ[n|n − 1] = Aŝ[n − 1|n − 1]
Minimum Prediction MSE (p × p) :
M[n|n − 1] = AM[n − 1|n − 1]AT + BQBT
Kalman Gain Matrix (p × M) : (Need matrix inversion !)

−1
K[n] = M[n|n − 1]HT [n] C[n] + H[n]M[n|n − 1]HT [n]
Correction :
ŝ[n|n] = ŝ[n|n − 1] + K[n](x[n] − H[n]ŝ[n|n − 1])
Minimum MSE Matrix (p × p) :
M[n|n] = (I − K[n]H[n])M[n|n − 1]
Extended Kalman Filter
The extended Kalman filter is a sub-optimal approach when the state
and the observation equations are nonlinear.
Nonlinear data model : Consider the following nonlinear data model
s[n] = f(s[n − 1]) + Bu[n]
x[n] = g(s[n]) + w[n]
where f(·) and g(·) are nonlinear function mapping (vector to vector).
Model linearization : We linearize the model around the estimated value
of s using a first order Taylor series.
∂f
f(s[n − 1]) ≈ f(ŝ[n − 1|n − 1]) + s[n − 1] − ŝ[n − 1|n − 1]
∂s[n − 1]
∂g
g(s[n]) ≈ g(ŝ[n|n − 1]) + s[n] − ŝ[n|n − 1]
∂s[n]
We denote the Jacobians by :

∂f ∂g
A[n − 1] = H[n] =
∂s[n − 1] ∂s[n]
ŝ[n−1|n−1] ŝ[n|n−1]
Extended Kalman Filter
Now we use the linearized models :
s[n] = A[n − 1]s[n − 1] + Bu[n]

+f(ŝ[n − 1|n − 1]) − A[n − 1]ŝ[n − 1|n − 1]
x[n] = H[n]s[n] + w[n]
+g(ŝ[n|n − 1]) − H[n]ŝ[n|n − 1]
There are two differences with the standard Kalman filter :

1 A is now, time varying.
2 Both equations have known terms added to them. This will not change
the derivation of the Kalman filter.
The extended Kalman filter has exactly the same equations where A
is replaced with A[n − 1].
A[n − 1] and H[n] should be computed at each sampling time.
MSE matrices and Kalman gain can no longer be computed off line.

Estimation Theory: Alireza Karimi

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Estimation Theory: Alireza Karimi

Uploaded by

Copyright:

Available Formats

Estimation Theory

Laboratoire d’Automatique, MEC2 397,

(Introduction) Estimation Theory Spring 2013 1 / 152

Extract information from noisy signals

{x[0], x[1], . . . , x[N − 1]}

which depends on an unknown parameter vector θ, determine an estimator

θ̂ = g (x[0], x[1], . . . , x[N − 1])

where g is some function.

Applications : Image processing, communications, biomedicine, system

(Introduction) Estimation Theory Spring 2013 2 / 152

System identiﬁcation : The input of a plant is excited with u and the

y [k] = G (q −1 , θ)u[k] + n[k]

Classical estimation (θ deterministic)

(Introduction) Estimation Theory Spring 2013 4 / 152

Fundamentals of Statistical Signal Processing

(Introduction) Estimation Theory Spring 2013 5 / 152

(Probability and Random Variables) Estimation Theory Spring 2013 6 / 152

Random Variable : A rule X (·) that assigns to every element of a sample

Example : Consider the experiment of ﬂipping a coin twice. The sample

We can deﬁne a random variable X such that

X (HH) = 1, X (HT ) = 1.1, X (TH) = 1.6, X (TT ) = 1.8

Random variable X assigns to each event (e.g. E = {HT , TH} ⊂ Ω) a

(Probability and Random Variables) Estimation Theory Spring 2013 7 / 152

Example : For the random variable deﬁned earlier, we have :

(Probability and Random Variables) Estimation Theory Spring 2013 9 / 152

We note X ∼ N (µ, σ 2 ) when the random variable X has a normal

Remark : Gaussian distribution is important because according to the

(Probability and Random Variables) Estimation Theory Spring 2013 11 / 152

Conditional PDF : p(x|y ) is deﬁned as the PDF of X conditioned on

Two RVs X and Y are independent if and only if :

A direct conclusion is that :

which means conditioning does not change the PDF.

Remark : For a joint Gaussian pdf the contours of constant density is an

(Probability and Random Variables) Estimation Theory Spring 2013 13 / 152

Some properties of expected value :

Conditional expectation : The conditional expectation of X given that a

(Probability and Random Variables) Estimation Theory Spring 2013 14 / 152

The ﬁrst moment of X is its expected value or the mean (µ = E (X )).

Moments of Gaussian RVs : A Gaussian RV with N (µ, σ 2 ) has

(Probability and Random Variables) Estimation Theory Spring 2013 15 / 152

The r th central moment of X is deﬁned as :

The second central moment (variance) is denoted by σ 2 or var(X ).

Central Moments of Gaussian RVs :

(Probability and Random Variables) Estimation Theory Spring 2013 16 / 152

If X ∼ N (µx , σx2 ) then Z = (X − µx )/σx ∼ N (0, 1).

aX + bY ∼ N (aµx + bµy , aσx2 + bσy2 )

The sum of square of n independent RV with standard normal

(Probability and Random Variables) Estimation Theory Spring 2013 17 / 152

σxy = E [(X − µx )(Y − µy )]

If X and Y are zero mean then σxy = E {XY }.

σ 2 = E [(X − µ)2 ] = E (X 2 ) − 2µE (X ) + µ2

X and Y called orthogonal if E {XY } = 0.

p(x, y ) = p(x)p(y ) ⇒ E {XY } = E {X }E {Y }

Uncorrelatedness does not imply the independence. For example, if X

Expectation Vector : µx = E (x) = [E (x1 ), E (x2 ), . . . , E (xn )]T

Some properties of ACF and CCF :

(Probability and Random Variables) Estimation Theory Spring 2013 22 / 152

(Minimum Variance Unbiased Estimation) Estimation Theory Spring 2013 23 / 152

The ﬁrst step is to ﬁnd the PDF of data as a

Example : Consider a data sequence that can be modeled with a linear

(Minimum Variance Unbiased Estimation) Estimation Theory Spring 2013 25 / 152

Classical versus Bayesian estimation

where p(θ) summarizes our knowledge about θ before any data is