Mean Square Estimation: Pillai

16.
Mean Square Estimation

Given some information that is related to an unknown quantity of
interest, the problem is to obtain a good estimate for the unknown in
terms of the observed data.
Suppose X 1 , X 2 , , X n represent a sequence of random
variables about whom one set of observations are available, and Y
represents an unknown random variable. The problem is to obtain a
good estimate for Y in terms of the observations X 1 , X 2 , , X n .
Let
Yˆ   ( X 1 , X 2 ,, X n )   ( X ) (16-1)
represent such an estimate for Y.
Note that  () can be a linear or a nonlinear function of the observation
X 1 , X 2 , , X n . Clearly
 ( X )  Y  Yˆ  Y   ( X ) (16-2)
represents the error in the above estimate, and |  | 2
the square of 1
PILLAI
the error. Since  is a random variable, E { |  | 2
} represents the mean
square error. One strategy to obtain a good estimator would be to
minimize the mean square error by varying over all possible forms
of  (), and this procedure gives rise to the Minimization of the
Mean Square Error (MMSE) criterion for estimation. Thus under
MMSE criterion,the estimator  () is chosen such that the mean
square error E{ |  |2 } is at its minimum.
Next we show that the conditional mean of Y given X is the
best estimator in the above sense.
Theorem1: Under MMSE criterion, the best estimator for the unknown
Y in terms of X 1 , X 2 , , X n is given by the conditional mean of Y
gives X. Thus
Yˆ   ( X )  E{Y | X }. (16-3)
Proof : Let Yˆ   ( X ) represent an estimate of Y in terms of
X  ( X 1 , X 2 ,, X n ).Then the error   Y  Yˆ , and the mean square
error is given by
2
2 ˆ
   E{ |  | }  E{ | Y  Y | }  E{ | Y   ( X ) | }
2 2 2
(16-4) PILLAI
Since
E[ z ]  E X [ E z {z | X }] (16-5)
we can rewrite (16-4) as
 2  E{ |Y   ( X ) |2 }  E X [ EY { |Y   ( X ) |2 X }]
z z
where the inner expectation is with respect to Y, and the outer one is
with respect to X .
Thus
 2  E[ E{ | Y   ( X ) |2 X }]

  E{ | Y   ( X ) |2 X } f X ( X )dx. (16-6)
To obtain the best estimator  , we need to minimize  2 in (16-6)
with respect to  . In (16-6), since f X ( X )  0, E{ | Y   ( X ) |2 X }  0,
and the variable  appears only in the integrand term, minimization
of the mean square error  2 in (16-6) with respect to  is
equivalent to minimization of E{ | Y   ( X ) |2 X } with respect to3  .
PILLAI
Since X is fixed at some value,  ( X ) is no longer random,
and hence minimization of E{ | Y   ( X ) |2 X } is equivalent to

E{ | Y   ( X ) |2 X }  0. (16-7)

This gives
E{| Y   ( X ) | X }  0
or
E{Y | X }  E{ ( X ) | X }  0. (16-8)
But
E{ ( X ) | X }   ( X ), (16-9)
since when X  x,  ( X ) is a fixed number  (x). Using (16-9)
4
PILLAI
in (16-8) we get the desired estimator to be
Yˆ   ( X )  E{Y | X }  E{Y | X 1 , X 2 ,, X n }. (16-10)
Thus the conditional mean of Y given X 1 , X 2 , , X n represents the best
estimator for Y that minimizes the mean square error.
The minimum value of the mean square error is given by
 min
2
 E{ | Y  E (Y | X ) |2 }  E[ E{ | Y  E (Y | X ) |2 X }]
        
var(Y X )
 E{var(Y | X )}  0. (16-11)
As an example, suppose Y  X 3 is the unknown. Then the best

MMSE estimator is given by
Yˆ  E{Y | X }  E{ X 3 | X }  X 3 . (16-12)
Clearly if Y  X 3 , then indeed Yˆ  X 3 is the best estimator for Y5
PILLAI
in terms of X. Thus the best estimator can be nonlinear.
Next, we will consider a less trivial example.
Example : Let
kxy, 0  x  y  1
f X ,Y ( x, y )  
 0 otherwise,
where k > 0 is a suitable normalization constant. To determine the best
estimate for Y in terms of X, we need f Y | X ( y | x).
y
1 1
f X ( x)   x f X ,Y ( x, y )dy   x kxydy
1
2 1
kxy kx(1  x 2 )
  , 0  x  1. x
2 x
2 1
Thus
f X , Y ( x, y ) kxy 2y
f Y X ( y | x)    ; 0  x  y  1.
f X ( x) kx (1  x ) / 2 1  x
2 2
(16-13)
Hence the best MMSE estimator is given by 6
PILLAI
1
ˆ
Y   ( X )  E{Y | X }   x y f Y | X ( y | x)dy
1 1
 x y 2y
1 x 2
dy  2
1 x 2  x dy
y 2
1
2 y 3
2 1  x 3 2 (1  x  x 2 )
   . (16-14)
31 x x 31 x
2 2
3 1 x 2
Once again the best estimator is nonlinear. In general the best

estimator E{Y | X } is difficult to evaluate, and hence next we
will examine the special subclass of best linear estimators.
Best Linear Estimator

In this case the estimator Yˆ is a linear function of the
observations X 1 , X 2 , , X n . Thus
n
Yˆl  a1 X 1  a2 X 2    an X n   ai X i . (16-15)
i 1
where a1 , a2 , , an are unknown quantities to be determined. The
mean square error is given by (  Y  Yˆl ) 7
PILLAI
E{ |  |2 }  E{| Y  Yˆl |2 }  E{ | Y   ai X i |2 } (16-16)
and under the MMSE criterion a1 , a2 , , an should be chosen so

that the mean square error E{ |  |2 } is at its minimum possible
value. Let  n2 represent that minimum possible value. Then
n
  min E{| Y   ai X i |2 }.
2
n
(16-17)
a1 , a2 ,, an
i 1
To minimize (16-16), we can equate


E{|  |2 }  0, k  1,2 , ,n. (16-18)
ak
This gives
  |  |  2     
*
E{|  | }  E 
2
  2 E      0.
ak  ak    ak   (16-19)
But
8
PILLAI
n n
 (Y   ai X i )  ( ai X i )
 Y
 i 1
  i 1
 Xk. (16-20)
ak ak ak ak
Substituting (16-19) in to (16-18), we get

E{|  |2 }
 2 E{ X k*}  0,
ak
or the best linear estimator must satisfy
E{ X k*}  0, k  1,2, , n. (16-21)
Notice nthat in (16-21),  represents the estimation error
(Y  i 1 ai X i ), and X k , k  1  n represents the data. Thus from
(16-21), the error  is orthogonal to the data X k , k  1  n for the
best linear estimator. This is the orthogonality principle.
In other words, in the linear estimator (16-15), the unknown
constants a1 , a2 , , an must be selected such that the error 9
PILLAI
  Y   i 1 ai X i is orthogonal to every data X 1 , X 2 , , X n for the
n
best linear estimator that minimizes the mean square error.

Interestingly a general form of the orthogonality principle
holds good in the case of nonlinear estimators also.
Nonlinear Orthogonality Rule: Let h( X ) represent any functional
form of the data and E{Y | X } the best estimator for Y given X . With
e  Y  E{Y | X } we shall show that
E{eh( X )}  0, (16-22)
implying that
e  Y  E{Y | X }  h( X ).
This follows since
E{eh( X )}  E{(Y  E[Y | X ])h( X )}
 E{Yh( X )}  E{E[Y | X ]h( X )}
 E{Yh( X )}  E{E[Yh( X ) | X ]}
 E{Yh( X )}  E{Yh( X )}  0. 10
PILLAI
Thus in the nonlinear version of the orthogonality rule the error is
orthogonal to any functional form of the data.
The orthogonality principle in (16-20) can be used to obtain
the unknowns a1 , a2 , , an in the linear case.
For example suppose n = 2, and we need to estimate Y in
terms of X 1 and X 2 linearly. Thus
Yˆl  a1 X 1  a2 X 2
From (16-20), the orthogonality rule gives
E{ X1*}  E{(Y  a1 X 1  a2 X 2 ) X 1*}  0
E{ X*2 }  E{(Y  a1 X 1  a2 X 2 ) X 2*}  0
Thus
E{| X 1 |2 }a1  E{ X 2 X 1*}a2  E{YX 1*}
E{ X 1 X 2*}a1  E{| X 2 |2 }a2  E{YX 2*}
or 11
PILLAI
 E{| X 1 |2 } E{ X 2 X 1*}  a1   E{YX 1*}
     (16-23)
 E{ X X *} 2   * 
E{| X 2 | }   a2   E{YX 2 }
 1 2
(16-23) can be solved to obtain a1 and a2 in terms of the cross-

correlations.
The minimum value of the mean square error  n in (16-17) is given by
2
 n2  min E{|  |2 }
a1 , a2 ,, an
n
 min E{ }  min E{ (Y   ai X i )*}
*
a1 , a2 ,, an a1 , a2 ,, an
i 1
n
 min E{ Y }  min
a1 , a2 ,, an
*
a1 , a2 ,, an
 i
a E{ X *
l }. (16-24)
i 1
But using (16-21), the second term in (16-24) is zero, since the error is
orthogonal to the data X i , where a1 , a2 , , an are chosen to be
optimum. Thus the minimum value of the mean square error is given
by 12
PILLAI
n
  E{ Y }  E{(Y   ai X i )Y *}
2
n
*
i 1
n
 E{| Y | }   ai E{ X iY *}
2
(16-25)
i 1
where a1 , a2 , , an are the optimum values from (16-21).

Since the linear estimate in (16-15) is only a special case of
the general estimator  ( X ) in (16-1), the best linear estimator that
satisfies (16-20) cannot be superior to the best nonlinear estimator
E{Y | X }. Often the best linear estimator will be inferior to the best
estimator in (16-3).
This raises the following question. Are there situations in
which the best estimator in (16-3) also turns out to be linear ? In
those situations it is enough to use (16-21) and obtain the best
linear estimators, since they also represent the best global estimators.
Such is the case if Y and X 1 , X 2 , , X n are distributed as jointly Gaussian
We summarize this in the next theorem and prove that result.
Theorem2: If X 1 , X 2 , , X n and Y are jointly Gaussian zero 13
PILLAI
mean random variables, then the best estimate for Y in terms of
X 1 , X 2 , , X n is always linear.
Proof : Let
Yˆ   ( X 1 , X 2 ,, X n )  E{Y | X } (16-26)
represent the best (possibly nonlinear) estimate of Y, and
n
Yˆl   ai X i (16-27)
i 1
the best linear estimate of Y. Then from (16-21)

n
  Y  Yl  Y   ai X i

(16-28)
i 1
is orthogonal to the data X k , k  1  n. Thus
E{ X*k }  0, k  1  n. (16-29)
Also from (16-28),
n
E{ }  E{Y }   ai E{ X i }  0. (16-30) 14
i 1 PILLAI
Using (16-29)-(16-30), we get
E{ X k*}  E{ }E{ X k*}  0, k  1  n. (16-31)
From (16-31), we obtain that  and X k are zero mean uncorrelated
random variables for k  1  n. But  itself represents a Gaussian
random variable, since from (16-28) it represents a linear combination
of a set of jointly Gaussian random variables. Thus  and X are
jointly Gaussian and uncorrelated random variables. As a result,  and
X are independent random variables. Thus from their independence
E{ | X }  E{ }. (16-32)
But from (16-30), E{ }  0, and hence from (16-32)
E{ | X }  0. (16-33)
Substituting (16-28) into (16-33), we get
n
E{ | X }  E{Y   ai X i | X }  0 15
i 1
PILLAI
or
n n
E{Y | X }  E{  ai X i | X }   ai X i  Yl . (16-34)
i 1 i 1
From (16-26), E{Y | X }   ( x) represents the best possible estimator,
and from (16-28),  i 1 ai X i represents the best linear estimator.
n
Thus the best linear estimator is also the best possible overall estimator
in the Gaussian case.
Next we turn our attention to prediction problems using linear
estimators.
Linear Prediction
Suppose X 1 , X 2 , , X n are known and X n 1 is unknown.
Thus Y  X n 1 , and this represents a one-step prediction problem.
If the unknown is X n  k , then it represents a k-step ahead prediction
problem. Returning back to the one-step predictor, let Xˆ n 1
represent the best linear predictor. Then
16
PILLAI
n
Xˆ n 1 =   ai X i ,

(16-35)
i 1
where the error

n
 n  X n 1  Xˆ n 1  X n 1   ai X i
i 1
 a1 X 1  a2 X 2    an X n  X n 1
n 1
  ai X i , an 1  1, (16-36)
i 1
is orthogonal to the data, i.e.,

E{ n X k*}  0, k  1  n. (16-37)
Using (16-36) in (16-37), we get
n 1
E{ n X }   ai E{ X i X k*}  0,
*
k k  1  n. (16-38)
i 1
Suppose X i represents the sample of a wide sense stationary 17

PILLAI
stochastic process X (t ) so that
E{ X i X k*}  R (i  k )  ri  k  rk*i (16-39)
Thus (16-38) becomes
n 1
E{ n X }   ai ri  k  0, an 1  1, k  1  n.
*
k
(16-40)
i 1
Expanding (16-40) for k  1,2, , n, we get the following set of

linear equations.
a1r0  a2 r1  a3 r2    an rn 1  rn  0  k  1
a1r1*  a2 r0  a3 r1    an rn  2  rn 1  0  k  2

a1rn*1  a2 rn* 2  a3 rn*3    an r0  r1  0  k  n. (16-41)
Similarly using (16-25), the minimum mean square error is given by
18
PILLAI
 n2  E{|  |2 }  E{ n Y *}  E{ n X n*1}
n 1 n 1
 E{( ai X i ) X *
n 1 }   ai rn*1i
i 1 i 1
 a1rn*  a2 rn*1  a3 rn* 2    an r1  r0 . (16-42)
The n equations in (16-41) together with (16-42) can be represented as

 r0 r1 r2  rn   a   0 
 *  1   
 r1 r0 r1  rn 1   a2   0 
 * 
 r2 r1* r0  rn  2   a3   0 
    .
        (16-43)
 *    
 rn 1 rn  2  r0 r1   an   0 
*
 * * *   1    2 
 rn rn 1  r1 r0  n
Let 19
PILLAI
 r0 r1 r2  rn 
 * 
 r1 r0 r1  rn 1 
Tn   .
 (16-44)
 
 r* r*  r1* r0 
 n n 1
Notice that Tn is Hermitian Toeplitz and positive definite. Using
(16-44), the unknowns in (16-43) can be represented as
 a1  0 
   
 a2  0   Last 
a  0   
2
column 
   Tn     n 
3 1
   of  (16-45)
0   
a   T 1 
 n    n 
1   2 
   n
Let
20
PILLAI
 Tn11 Tn12  Tn1, n 1 
 
1  T 21
T 22
 T 2 , n 1

Tn   n n n
. (16-46)
  
 T n 1,1 T n 1, 2  T n 1, n 1 
 n n n 
Then from (16-45),

 a1 
   Tn1,n 1 
 a2   
     2  Tn 
2 , n 1
n . (16-47)
   
 an  
 T n 1,n 1 
1   n 
 
Thus
1
 
2
n n 1, n 1
 0,
T n (16-48) 21
PILLAI
and
 a1   Tn1,n 1 
   
1  Tn 
2 , n 1
 a2 
    T n 1,n 1   . (16-49)
  n  
 an   T n 1,n 1 
 n 
Eq. (16-49) represents the best linear predictor coefficients, and they
can be evaluated from the last column of Tn in (16-45). Using these,
The best one-step ahead predictor in (16-35) taken the form
 1  n
Xˆ n 1   n 1,n 1  (Tni ,n 1 ) X i . (16-50)
 Tn  i 1
and from (16-48), the minimum mean square error is given by the
1
(n +1, n +1) entry of Tn .
From (16-36), since the one-step linear prediction error
 n  X n 1  an X n  an 1 X n 1    a1 X 1 , (16-51) 22
PILLAI
we can represent (16-51) formally as follows
X n 1  1  an z 1  an 1 z  2    a1 z  n   n
Thus, let
An ( z )  1  an z 1  an 1 z  2    a1 z  n , (16-52)
them from the above figure, we also have the representation
1
n   X n 1.
An ( z )
The filter
1 1
H ( z)   (16-53)
An ( z ) 1  an z 1  an 1 z  2    a1 z  n
represents an AR(n) filter, and this shows that linear prediction leads
to an auto regressive (AR) model. 23
PILLAI
The polynomial An (z ) in (16-52)-(16-53) can be simplified using
(16-43)-(16-44). To see this, we rewrite An (z ) as
An ( z )  a1 z  n  a2 z ( n 1)    an 1 z  2  an z 1  1
a1  0 
a   
 2 0 
 [ z  n , z ( n 1) , , z 1 ,1]     [ z  n , z ( n 1) , , z 1 ,1] Tn1   
   
 an  0 
 1   2 
n
(16-54)
To simplify (16-54), we can make use of the following matrix identity

 A B   I  AB   A 0 
C D  0   1 
. (16-55)
  I  C D  CA B  24
PILLAI
Taking determinants, we get
A B
 A D  CA1 B . (16-56)
C D
In particular if D  0, we get
1 ( 1) n
A B
CA B  .
A C 0 (16-57)
Using (16-57) in (16-54), with
0 
 
C  [ z  n , z ( n 1) , , z 1 ,1], A  Tn , B    
 n2 
25
PILLAI
we get
0
r0 r1 r2  rn
0
r1* r0 r1  rn 1
(1) n Tn  2
An ( z )   n
 . (16-58)
| Tn | 0 | Tn | * *
rn 1 rn  2  r0 r1
 n2
n  ( n 1) 1
n 1 z z  z 1
z ,, z ,1 0
Referring back to (16-43), using Cramer’s rule to solve for an 1 ( 1),

we get
r0  rn 1
 n2 
rn 1  r0 2 | Tn 1 |
an 1  n 1
| Tn | | Tn |
26
PILLAI
or | Tn |
 2
n  0.
| Tn 1 | (16-59)
Thus the polynomial (16-58) reduces to

r0 r1 r2  rn
r1* r0 r1  rn 1
1
An ( z )  
| Tn 1 |
rn*1 rn* 2  r0 r1 (16-60)
z  n z ( n 1)  z 1 1
 1  an z 1  an 1 z  2    a1 z  n .
The polynomial An (z ) in (16-53) can be alternatively represented as

1
in (16-60), and H ( z )  ~ AR (n) in fact represents a stable
An ( z ) 27
PILLAI
AR filter of order n, whose input error signal  n is white noise of
constant spectral height equal to | Tn | / | Tn 1 | and output is X n 1.
It can be shown that An (z ) has all its zeros in | z | 1 provided
| Tn | 0 thus establishing stability.
Linear prediction Error

From (16-59), the mean square error using n samples is given
by
| Tn |
 n2   0. (16-61)
| Tn 1 |
Suppose one more sample from the past is available to evaluate
X n 1 ( i.e., X n , X n 1 ,, X 1 , X 0 are available). Proceeding as above
the new coefficients and the mean square error  n21 can be determined.
From (16-59)-(16-61),
| Tn 1 |
 n 1 
2
. (16-62)
| Tn | 28
PILLAI
Using another matrix identity it is easy to show that
| Tn |2
| Tn 1 | (1 | sn 1 |2 ). (16-63)
| Tn 1 |
Since | Tk |  0, we must have (1 | sn 1 |2 )  0 or | sn 1 |  1 for every n.
From (16-63), we have
| Tn 1 | | Tn |
 (1 | sn 1 |2 )
| T | | Tn 1 |
  n 
 n21  n2
or
 n21   n2 (1 | sn 1 |2 )   n2 , (16-64)
since (1 | sn 1 |2 )  1. Thus the mean square error decreases as more

and more samples are used from the past in the linear predictor.
In general from (16-64), the mean square errors for the one-step
predictor form a monotonic nonincreasing sequence 29
PILLAI
 n2   n21    k2     2 (16-65)
whose limiting value    0.
2
Clearly,  2  0 corresponds to the irreducible error in linear

prediction using the entire past samples, and it is related to the power
spectrum of the underlying process X (nT ) through the relation
 1
ln S XX ( )d    0.

  exp   
2

 2  (16-66)
where S XX ( )  0 represents the power spectrum of X (nT ).

For any finite power process, we have

  S XX ( )d   ,
and since ( S XX ( )  0), ln S XX ( )  S XX ( ). Thus
 
  ln S XX ( )d     S XX ( )d   . (16-67)
30
PILLAI
Moreover, if the power spectrum is strictly positive at every
Frequency, i.e.,
S XX ( )  0, in -     , (16-68)
then from (16-66)

  ln S XX ( )d   . (16-69)
and hence
1
  exp  ln S XX ( )d   

2
   
e  0 (16-70)
 2
i.e., For processes that satisfy the strict positivity condition in
(16-68) almost everywhere in the interval ( ,  ), the final
minimum mean square error is strictly positive (see (16-70)).
i.e., Such processes are not completely predictable even using
their entire set of past samples, or they are inherently stochastic,
31
PILLAI
since the next output contains information that is not contained in
the past samples. Such processes are known as regular stochastic
processes, and their power spectrum is strictly positive.
S XX ( )

 
Power Spectrum of a regular stochastic Process
Conversely, if a process has the following power spectrum,

S XX ( )

 1  2 
such that S XX ( )  0 in 1     2 then from (16-70),  2  0.

32
PILLAI
Such processes are completely predictable from their past data
samples. In particular
X (nT )   ak cos( k t   k ) (16-71)
k
is completely predictable from its past samples, since S XX ( ) consists
of line spectrum. S XX ( )
 
1  k
X (nT ) in (16-71) is a shape deterministic stochastic process.
33
PILLAI

Mean Square Estimation: Pillai

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mean Square Estimation: Pillai

Uploaded by

Copyright:

Available Formats

16.

Mean Square Estimation

since when X  x,  ( X ) is a fixed number  (x). Using (16-9)

As an example, suppose Y  X 3 is the unknown. Then the best

Once again the best estimator is nonlinear. In general the best

Best Linear Estimator

and under the MMSE criterion a1 , a2 , , an should be chosen so

To minimize (16-16), we can equate

Substituting (16-19) in to (16-18), we get

best linear estimator that minimizes the mean square error.

(16-23) can be solved to obtain a1 and a2 in terms of the cross-

where a1 , a2 , , an are the optimum values from (16-21).

the best linear estimate of Y. Then from (16-21)

where the error

is orthogonal to the data, i.e.,

Suppose X i represents the sample of a wide sense stationary 17

Expanding (16-40) for k  1,2, , n, we get the following set of

 a1rn*  a2 rn*1  a3 rn* 2    an r1  r0 . (16-42)

The n equations in (16-41) together with (16-42) can be represented as

Then from (16-45),

To simplify (16-54), we can make use of the following matrix identity

Using (16-57) in (16-54), with

Referring back to (16-43), using Cramer’s rule to solve for an 1 ( 1),

Thus the polynomial (16-58) reduces to

The polynomial An (z ) in (16-53) can be alternatively represented as

Linear prediction Error

since (1 | sn 1 |2 )  1. Thus the mean square error decreases as more

Clearly,  2  0 corresponds to the irreducible error in linear

where S XX ( )  0 represents the power spectrum of X (nT ).

Conversely, if a process has the following power spectrum,

such that S XX ( )  0 in 1     2 then from (16-70),  2  0.

X (nT ) in (16-71) is a shape deterministic stochastic process.

You might also like

 a1rn*  a2 rn1  a3 rn 2    an r1  r0 . (16-42)