# Topic 4_Sequences of Random Variables

(1 week)

1

Sequences (Vector) of Random Variables
Papoulis Chapter 7

•  A vector random variable, X, is a function that assigns a vector of
real numbers to each outcome ξ in S. Basis for Random Processes.
•  The vector can be used to describe a sequence of random variables.
X(ξ) = x

S

ξ

Rn

•  Example: Sample a signal X(t) every T sec.: Xk = X(kT)
X = (X1, X2, …, Xn) is a vector RV (or a random process)
•  An event involving an n-dimensional random variable X = (X1, X2,
…, Xn) has a corresponding region in n-dimensional real space.
•  The expectation of a random vector is the vector of the expectations.
If X = (X1, X2, … Xn), then
E[X] = (E[X1], E[X2], … E[Xn])

(4-1)
2

Conditional Expected Values
(very useful in estimation)
•  We will show that
EX [X | Y] =EY [EX [X | Y]| Y]
(4-2a)
•  Which implies the useful result
EX [X] = EY [EX [X | Y] ]
(4-2b)
•  The above is known as the Law of Iterated Expectations. It is also
known as the Law of Total Expectation, since it implies that

\$∑ E[ X | Y = y ]P(Y = y)
!y
E[ X ] = #
E[ X | Y = y] f Y ( y )dy

!" y
•  If X and Y are independent, then
EX [X | Y] = EX [X]
•  If a is a constant, then
E [a |Y] = a

if Y is discrete
if Y is continuous.

(4-2c)

(4-3a)
(4-3b)

3

Law of Iterated Expectations
•  Proof:

! E[X | Y = y] f

E[ E[X | Y ]] =

Y

(y) dy

y

=

! !xf

X|Y

(x | y)dx fY (y)dy

y x

=

!x! f
x

=

X,Y

(x, y)dy dx

y

!xf

X

(x) dx

x

= E[X]

(4-3c)

4

Example: Expectation of the Sum of a Random
Number of Random Variables
If N is a positive integer-valued R.V. and Xi, i = 1, 2, … are
identically distributed R.V.s, with mean E [ X that are independent of
N, then
N

∑X

i

i =1

• is a random variable and
N

E[∑ X i ] = E[ N ]E[ X ].

(4-4)

i =1

5

Expectation of the Sum of a Random Number of
Random Variables
Proof:

N

n

E X [! Xi N = n] = E X [! Xi | N = n]
i=1

(4-5a)

i=1
n

n

= E X [! Xi ] = ! E X [Xi ] = n E X [X]
i=1

so that

i=1

N

E X [! X i | N ] = N " E[X]

(4-5b)

i=1

and finally we get the desired result
N

N

E[! X i ] = E N [E X [! X i | N ]] = E N [N " E[X]] = E[N ]E[X].
i=1

(4-4)

i=1
6

Conditional Variance
var( X )

E[ X | Y ]

var( X | Y )

?

The conditional variance of X given Y is the R.V. given by

var(X Y ) = E[(X ! E[X | Y ])2 | Y ] = E[X 2 | Y ]! E[X | Y ]2 .

(4-6a)

The Conditional Variance Formula (or Law of Total Variance)

var( X ) = E[var( X | Y )] + var( E[ X | Y ])

(4-6b)
7

Conditional Variance
Proof:
E[var( X Y )] = E[ E[ X 2 | Y ] − E[ X | Y ]2 ]
= E[ E[ X 2 | Y ]] − E[ E[ X | Y ]2 ]
= E[ X 2 ] − E[ E[ X | Y ]2 ]
= E[ X 2 ] − E[ X ]2 − E[ E[ X | Y ]2 ] + E[ X ]2
= var( X ) − E[ E[ X | Y ]2 ] + E[ E[ X | Y ]]2
= var( X ) − var( E[ X | Y ]).
(4-7)

8

Generalization to Many RVs
•  Let X1, X2, …, Xn, be n random variables defined on a sample space
•  Let the row vector XT = (X1, X2, …, Xn) be the transpose of a random
(column) vector.
•  Let u = (u1, u2, …, un) be a real vector
•  The notation { X ≤ u } denotes the event
{X1 ≤ u1, X2 ≤ u2, …, Xn ≤ un}, where, as before, the commas denote
intersections, that is,
{ X ≤ u } = {X1 ≤ u1}∩{X2 ≤ u2 }∩…∩{Xn ≤ un}
(4-8a)
•  The joint CDF of X1, X2, …, Xn or the CDF of the random vector X is
defined as
FX(u ) = P{ X ≤ u } = P{X1 ≤ u1, X2 ≤ u2, …, Xn ≤ un}
(4-8b)
–  FX(u ) is a real-valued function of n real variables (or of the
vector u ) with values between 0 and 1.
•  The expectation of a random vector is the vector of the expectations
9
E[XT] = [ E[X1], E[X2], … E[Xn] ]T = µXT
(4-8c)

The Covariance Matrix — I
•  There are n2 pairs of random variables Xi and Xj giving n2
covariance functions
•  n of these are cov(Xi, Xi) = var(Xi)
•  The covariance matrix R is a symmetric n×n matrix with i-jth
entry ri,j = cov(Xi, Xj).
•  The variances of the Xi s appear along the diagonal of the matrix
•  Uncorrelated RVs ⇒ R is diagonal matrix
•  Expectation of a matrix with RVs as entries = matrix of
expectations
Note: sometimes we refer to the covariance matrix as R and
sometimes as Σ

10

The Covariance Matrix — II
•  X = [X1, X2, … Xn] and µ are n x1 column vectors
•  (X – µX)T is a n×1 row vector
•  (X – µX)•(X – µX)T is a n×n matrix whose i-jth entry is (Xi–µi)(Xj–µj) (4-9a)
•  E[matrix] = matrix of expectations
•  R = E[(X – µX)•(X – µX)T] is an n×n matrix with i-jth entry ri,j = cov(Xi, Xj)
•  R is a symmetric positive semi-definite (also known as symmetric
nonnegative definite) matrix, since the quadratic form
var(XTa) = aTRa ≥ 0.

(4-9b)

•  R is used to find variance and covariances of linear combinations of the Xi

11

Multivariate (Sequence) Parameters
•  Let XT denote the transposed (column) data vector (x1, x2 ,..., xn )
T
Mean: E !"XT #\$ = [µ1,..., µ n ]

(4-10a)

Covariance:! ij % Cov ( Xi , X j )

(4-10b)

" ij
Correlation: Corr ( Xi , X j ) % !ij =
(4-10c)
" i" j
•  The covariance matrix has elements σij and is denoted by Σ and is
defined below:

[

T

Σ ≡ Cov(X ) = E (X − µ )(X − µ )

]

& σ 12 σ 12  σ 1d #
\$
!
2
σ 21 σ 2  σ 2d !
\$
=
\$ 
!
\$
2 !
\$%σ d 1 σ d 2  σ d !"

(4-10d)

Multivariate Normal Distribution

x ~ N d (µ, Σ )
1

& 1
#
T −1
p (x ) =
exp\$− (x − µ) Σ (x − µ)!
1/ 2
d/2
% 2
"
(2π) Σ

(4-11)

Parameter Estimation (“Predicting the value of Y”)
•  Suppose Y is a RV with known PMF or PDF
•  Problem: Predict (or estimate) what value of Y will be
observed on the next trial
•  Questions:
•  What value should we predict?
•  What is a good prediction?
•  We need to specify some criterion that determines what is a
good/reasonable estimate.
•  Note that for continuous random variables, it doesn’t make
sense to predict the exact value of Y, since that occurs with
zero probability.
•  A common estimate is the mean-square estimate.

The Mean Square Error (MSE) Estimate
•  We will let Ŷ be the mean square estimate of the random variable, Y.
•  Let E[Y] = µ
•  The mean-squared error (MSE) is defined as:
e = E[(Y – Ŷ)2]
•  We proceed by “completing the square”
E[(Y – Ŷ)2] = E[(Y – µ + µ– Ŷ)2]
= E[(Y – µ)2 + 2(Y – µ)(µ– Ŷ) + (µ– Ŷ)2]
= var(Y) + 2(µ– Ŷ)E[Y – µ] + (µ– Ŷ)2
= var(Y) + (µ– Ŷ)2 > var(Y) if Ŷ ≠ µ
(4-12)
•  Clearly choosing Ŷ = µ minimizes the MSE of the estimate
•  Ŷ = µ is called the minimum- (or least-) mean-square error (MMSE
or LMSE) estimate
•  The minimum mean-square error is var(Y)

The MSE of a RV Based Upon Observing another RV
•  Let X and Y denote random variables with known joint distribution
•  Suppose that the value of X becomes known to us, but not the value
of Y. How can we find the MMSE estimate, Ŷ ?
•  Can the MMSE estimate, Ŷ, which is a function of X, do better than
ignoring X and estimating the value of Y as Ŷ = µY = E[Y] ?
•  Denoting the MMSE estimate Ŷ by c(X), the MSE is given by
" "
2

e = E{[Y ! c(X)] } =

#

2
[y
!
c(x)]
f X,Y (x, y)dxdy
#

!" !"
"

"

= # f X (x) # [y ! c(x)]2 fY |X (y | x)dydx
!"

(4-13)

!"

•  Note that the above integrals are positive, so that e will be minimized
if the inner integral is a minimum for all values of x.
•  Note that for a fixed value of x, c(x) is a variable [not a function]

The MSE of a RV Based Upon Observing another RV-2
•  Since for a fixed value of x, c(x) is a variable [not a function], we can
minimize the MSE by setting the derivative of the inner integral, with
respect to c, to zero:
"
d "
2
[y
!
c(x)]
fY |X (y | x)dy = # 2(y ! c) fY |X (y | x)dy = 0
#
dc !"
!"

(4-14a)

•  Solving for “c” after noting that
"

"

# c(x) f

Y |X

!"

gives

(y)dy = c(x) # fY |X (y)dy = c(x), where the integral is one.
!"

Yˆ = c(X) =

"

# yf

Y |X

(y)dy = E[Y | X]

(4-14b)

!"

•  Thus the MMSE estimate, Ŷ, is the conditional mean of Y given X.
•  The MMSE estimate is nonlinear and its MSE are RVs that are
functions of X.

17

MMSE Example
•  Let the random point (X, Y) is uniformly distributed on a semicircle
1
(1–α2)1/2!

–1

α

1

•  The joint PDF has value 2/π on the semicircle
•  The conditional PDF of Y given that X = α is a uniform density on [0,
(1–α2)1/2].
•  So, Ŷ = E[Y|X = α] = (1/2)•(1–α2)1/2 and this estimate achieves the
least possible MSE of var(Y|X = α) = (1–α2)/12
•  Intuitively reasonable since
•  If |α| is nearly 1, the MSE is small (since the range of Y is small)
•  If |α| is nearly 0, the MSE is large (since the range of Y is large)

The Regression Curve of Y on X
1

–1

1

•  Ŷ = E[Y|X=α] as a function of α is a curve called the regression
curve of Y on X
•  Graph of (1/2)•(1–α2)1/2 is a half-ellipse
•  Given X value, the MMSE estimator of Y can be read off
from the regression curve

Linear MMSE Estimation — I
•  Suppose that we wish to estimate Y as a linear function of the
observation X
•  The linear MMSE estimate of Y is aX + b where a and b are chosen
to minimize the mean-square error E[(Y – aX – b)2]
•  Let Z = Y – aX – b be the error, then
E[(Y – aX – b)2] = E[Z2] = var(Z) + (E[Z])2
(4-15a)
= var(Y) + a2var(X) – 2a•cov(X,Y) + (E[Z])2
•  The above is quadratic in a and b.
•  By differentiation, it can be shown that the minimum occurs when

!" Y
a=
"X

b = µY ! aµ X

(4-15b)

Linear MMSE Estimation — II
•  As before, let Z = Y – aX – b be the error, then the MSE is
e = E[(Y – aX – b)2] = E [(Z2)]
(4-16a)
•  Setting the derivative of the MSE with respect to a to zero gives

!e
= E[2Z("X)] = 0
!A

(4-16b)

•  Which says that the estimation error, Z, is orthogonal, (that is
uncorrelated) to the received data X.
•  This is referred to as the orthogonality principle of linear
estimation. That is the error is uncorrelated with the observation )
data), X, and, intuitively, the estimate has done all it can to extract
correlated information from the data.

Gaussian MMSE = Linear MMSE
•  In general, the linear MMSE estimate has a higher MSE than the
(usually nonlinear) MMSE estimate E[Y|X]
•  If X and Y are jointly Gaussian RVs, it can be shown that the
conditional PDF of Y given X = α is a Gaussian PDF with mean
µY + (ρσY/σX)(α – µX)
(4-17a)
and variance (σY)2(1 – ρ2)
(4-17b)
•  Hence, E[Y|X = α] = µY + (ρσY/σX)(α – µX)
(4-17c)
is the same as the linear MMSE.
•  For jointly Gaussian RVs, MMSE estimate = linear MMSE estimate

Limit Theorems
•  Limit theorems specify the probabilistic behavior of n random
variables as n → ∞
•  Possible restrictions on RVs:
–  Independent random variables
–  Uncorrelated random variables
–  Have identical marginal CDFs/PDFs/PMFs
–  Have identical means and/or variances

23

The Average of n RVs
•  n random variables X1, X2, …, Xn have finite expectations µ1, µ2, …,
µn
•  Let the average be
Z = (X1 + X2 + …+ Xn)/n
(4-18a)
•  What is E[Z]?
•  Expectation is a linear operator so that
(4-18b)
E[Z] = (E[X1] + E[X2] + …+ E[Xn])/n
•  Expected value of average of n RVs = numerical average of their
expectations.
•  An important practical case is when the RVs are independent and
identically distributed (i.i.d.) and the average is called the sample
mean.
24

Variance of the Sample Mean
•  Sample mean Z = (X1 + X2 + …+ Xn)/n
•  E[Z] = E[X] = µ
•  It is easy to show that
var(Z) = var(X1 + X2 + …+ Xn)/n2
= var(X)/n
•  This is because the RVs are independent.
•  The variance decreases as n increases.

(4-18c)

(4-18d)

25

Weak Law of Large Numbers (WLLN)
•  Weak Law of Large Numbers:
If X1, X2, … Xn, … are i.i.d. RVs with finite mean µ, then for
every ε > 0,
P{|(X1+X2+…+Xn)/n – µ| ≥ ε} → 0 as n → ∞
(4-19a)
Equivalently
P{|(X1+X2+…+Xn)/n – µ| ≤ ε} → 1 as n → ∞
(4-19b)
•  Note that it is not necessary for the RVs to have finite variance
•  But the proof is easier if variance is finite
•  Note: WLLN says lim n→∞ P{something} = 1

26

Strong Law of Large Numbers (SLLN)
•  Strong Law of Large Numbers:
If X1, X2, … Xn, … are i.i.d. RVs with finite mean µ, then
P{ lim n→∞(X1 + X2 + … + Xn)/n = µ} = 1

(4-20a)

•  Experiment will be repeated infinitely often and the RV X took on
values x1, x2, … xn, … on these trials
•  What can be said about (x1+x2+…+xn)/n?
•  There are three possibilities
•  Sequence converges to µ
•  or it converges to some other number
•  or it does not converge at all
•  The Strong Law of Large Numbers says that
(4-21b)
P{(x1+x2+…+xn)/n converges to µ} = 1
Note: SLLN says P{lim n→∞ something} = 1

27

Strong Law of Large Numbers — II
•  If the Strong Law of Large Numbers holds, then so does the Weak
Law
•  In fact, both require only that the RVs be i.i.d. with finite mean µ
•  But, the Weak Law of Large Numbers might be applicable in
cases when the Strong Law does not hold
•  Example: Weak Law of Large Numbers still applies if the RVs are
uncorrelated but not independent

28

Strong Law and Relative frequencies
•  The Strong Law of Large Numbers justifies the estimation of
probabilities in terms of relative frequencies
•  If the Xi are i.i.d. Bernoulli RVs with parameter p (and hence,
finite mean p), then the sample mean Zn converges to p with
probability 1 as n → ∞
•  The observed relative frequency of an event of probability p
converges to p with probability 1.

29

The Central Limit Theorem
(previously discussed)
•
•
•
•

•
•

Let Yn= (X1 + X2 + …+ Xn – nµ)/σ√n be a RV with mean 0
and variance 1.
The Central Limit Theorem asserts for large values of n that
the CDF of Yn is well-approximated by the unit Gaussian CDF
Formally, the Central Limit Theorem states that the CDF
converges to the unit Gaussian CDF.
In practical use of the Central Limit Theorem. we hardly ever
use the RV
Yn= (X1 + X2 + …+ Xn – nµ)/σ√n
Instead, X1 + X2 + …+ Xn is treated as if its CDF is
approximately that of a N(nµ,nσ2) RV
Thus, we compute
P{X1 + X2 + …+ Xn ≤ u} ≈ Φ((u–nµ)/σ√n)
which is effectively the same computation
30

Final Remarks on Probability
We see that the theory of probability is at bottom only
common sense reduced to calculation; it makes us appreciate
with exactitude what reasonable minds feel by a sort of
instinct, often without being able to account for it. … It is
remarkable that this science, which originated in the
consideration of games of chance, should become the most
important object of human knowledge. … The most
important questions of life are, for the most part, really only
problems of probability.
Pierre Simon, Marquis de LaPlace, Analytical Theory of
Probability

31