You are on page 1of 19

# Probability Review

Definition: Let S be a sample space and a sigma field defined over it. Let P : be a mapping from the sigma-algebra into the real
line such that for each A , there exists a unique P( A) . Clearly P is a set function and is called probability if it satisfies the following
axioms

1. P ( A) 0 for all A
2. P( S ) 1
3. Countable additivity If A1 , A2 ,... are pair-wise disjoint events, i.e. Ai Aj for i j , then

i 1

i 1

P ( Ai ) P( Ai )

## The triplet (S , , P) is called the probability space.

Conditional Probability
The probability of an event B under the condition that another event A has occurred is called the conditional probability of B given A and
defined by

P(A B)
, P(A) 0
P(A)
P(A B) P ( A) P ( B / A)
The events A and B are independent if P( B / A) P( B) and P( A / B) P( A) so that P(A B) P( A) P( B)
P ( B / A)

Bayes Theorem
Suppose A1 , A2 , . . . , An are partitions on S such that S = A1 A2 ..... An and Ai Aj = for i j.

Suppose the event B occurs if one of the events A1 , A2 , . . . An occurs. Thus we have the information of the
probabilities P(Ai ) and P(B / Ai ), i = 1,2..,n.We ask the following question :
Given that B has occured what is the probability that a particular event Ak has occured? In other words
what is P(Ak / B) ?
n

## We have P(B) = P(Ai ) P B | Ai ( Using the theorem of total probability)

i=1

P(Ak | B) =
=

P(Ak ) P B / Ak
P(B)

P(Ak ) P B / Ak
n

P(A )P(B / A )
i

i=1

Random Variables
Consider the probability space (S , , P) . A Random variable is a function X : S

Figure 2

## Definition: The distribution function or cumulative distribution function (CDF) FX :

is a function defined by

FX ( x) P({s | X ( s) x, s S})
P({ X x})
for all x . It is also called the cumulative distribution function abbreviated as CDF.
The notation for FX ( x) is used to denote the CDF of the RV X at a point x .
Properties
FX (x) is a non-decreasing and right continuous function of X .

FX () 0 and FX () 1

P({x1 X x}) FX ( x) FX ( x1 )

## Discrete random variables and probability mass functions

Definition: A random variable X defined on the probability space ( S , , P) is said to be discrete if the number of elements in the range RX is finite or
countably infinite.
Examples 1 and 2 are discrete random variables. If the sample space S is discrete, the random variable X defined on it is always discrete.
A discrete random variable X with RX {x1 , x2 , x3 ...} is completely specified by the probability mass function (PMF)

pX ( xi ) P({s | X ( s) xi })
P({ X xi })

xi RX

## Continuous random variables and probability density functions

Definition: A random variable X defined on the probability space ( S , , P) is said to be continuous if FX (x) is absolutely continuous. Thus FX (x)

x

FX ( x)

f X (u )du

## where f X : R [0, ) is a function called the probability density function ( PDF).

If f X ( x) is a continuous function at pint x, then
d
f X ( x)
FX ( x)
dx

Clearly, f X ( x) 0, x and

f X ( x)dx 1

## Jointly distributed random variables

We may define two or more random variables on the same sample space. Let X and Y be two real random variables defined on the same
probability space ( S , , P). X and Y together define a two-dimensional or joint random variable ( X , Y ). Thus the joint random variable
( X , Y ) is the mapping from the sample space to the plane 2 as shown in the Figure 3 below.

( X (s), Y (s))

Y ( s)
X ( s)

The probability P {X x, Y y} ( x, y)

X ( s)

## is called the joint distribution function of the random variables

FX ,Y ( x, y).

Clearly,
FX ,Y (, ) FX ,Y (, y) FX ,Y ( x, ) 0

and FX ,Y (, ) 1.

## and Y and denoted by

P {x1 X x2 , y1 Y y2 } FX ,Y ( x2 , y2 ) FX ,Y ( x1 , y2 ) FX ,Y ( x2 , y1 ) FX ,Y ( x1 , y1 )

FX ,Y ( x, ) FX ( x, ), FX ,Y ( y, ) FY ( y)

## Joint Probability Density Function

If X and Y are two continuous random variables and their joint distribution function is continuous in both x and y, then we can define
joint probability density function (joint PDF) f X ,Y ( x, y) by
x

FX ,Y ( x, y) f X ,Y (u, v)dvdu

If f X ,Y ( x, y) is continuous at ( x, y) , then

f X ,Y ( x, y )

2
FX ,Y ( x, y ) .
xy

Clearly,

f X ,Y ( x, y) 0 ( x, y)

and f X ,Y ( x, y )dxdy 1 .

## The conditional PDF fY | X ( x | y) is defined by

fY | X ( x | y) f X ,Y ( x, y) / f X ( x) ,

when f X ( x) 0 . Thus,
f X ,Y ( x, y) f X ( x) fY | X ( y | x) fY ( y) f X |Y ( x | y)

## Two random variables are statistically independent if for all ( x, y)

f X ,Y ( x, y) f X ( x) fY ( y)

## Bayes rule for continuous random variables

We can derive the Bayes rule for two continuous joint random variables as follows.
Recall that

f X |Y ( x | y )

f X ,Y ( x, y)
fY ( y )

f X |Y ( x | y )
=

f X ( x ) fY | X ( y / x )
fY ( y )
f X ,Y ( x, y )

f X ,Y ( x, y )dx

fYIX ( y | x) f X ( x)

f X (u ) fY | X ( y | x)du

Expected values
The expected value of a function g ( X , Y ) of continuous random variables X and Y , is given by

## Eg ( X , Y ) g ( x, y) f X ,Y ( x, y)dxdy provided it exists.

Particularly, we have,

EX xf X ,Y ( x, y )dxdy

x f X ,Y ( x, y )dy dx

xf X ( x)dx

For two continuous random variables X and Y , the joint moment of order m n is defined as

E ( X mY n ) x m y n f X ,Y ( x, y )dxdy

## and the joint central moment of order m n is defined as

E ( X X )m (Y Y )n ( x X )m ( y Y )n f X ,Y ( x, y)dxdy

where
X EX
Y EY

The variance X2 is given by E ( X X )2 EX 2 X2 . The covariance Cov( X , Y ) between two RVs X and Y is given by
C ov( X , Y ) E( X X )(Y Y ) EXY X Y .
Two RVs X and Y are called uncorrelated if Cov( X , Y ) 0 , or in other words, EXY X Y .
If X and Y are independent, they are always uncorrelated. The converse is generally not true.
The ratio X ,Y

C ov( X , Y )

## is called the correlation coefficient.

X Y
Interpretation of the correlation coefficient
Suppose we have to approximate the random variable Y in terms of the random variable X by the line equation

Y aX b
where Y is the approximation of Y . Such an approximation is called linear regression.
Approximation error Y Y
Mean-square approximation error
E (Y Y )2 E (Y aX b)2

For minimizing E (Y Y )2 with respect to a and b will give optimal values of a and b. Corresponding to the optimal solutions for a and b,
we have
E (Y aX b) 2 0
a
E (Y aX b)2 0
b

Y Y
slope X ,Y

## Solving for a and b ,

a

Cov( X , Y )

2
X

X ,Y

we get
Y
X

y
x

X X
Figure 2

and
b Y a X

Y Y

X2

X ,Y ( X X )

y
( X X )
so that Y Y X ,Y
x

Conditional expectation
Suppose X and Y are two continuous RVs. The conditional expectation of Y given X x is defined by

Y / X x E (Y / X x) yfY / X ( y / x)

## Jointly Gaussian Random variables

Two random variables X and Y are called jointly Gaussian if their joint density function is

f X ,Y ( x, y )

1
21 2 1 2

If X ,Y 0 , we have

1
2(1 2 )

( x1 )2
( x )( y ) ( y )2
1
2
2
2

2

22
1 2
1

f X ,Y ( x, y )

1
21 2

1
21

e
e

( x1 )2 ( y 2 )2
1

2 12
22

( x 1 )2
1
2 12

1
2 2

( y 2 )2
1
2 22

f X ( x ) fY ( y )
For jointly Gaussian RVs, uncorrelatedness implies independence.

Random Vectors
We can extend the definition of the joint RVs to n random variables X1 , X 2 ,.., X n defined on the same probability space ( S , , P). We can
denote the these n RVs by
X1
X
2
X . X1 X 2 ... X n

.
X n
.A particular value of the random vector X is denoted by x=[ x1 x2 .... xn ]'.
The CDF of the random vector X is defined as the joint CDF of X1 , X 2 ,.., X n . Thus

## FX1 , X 2 ,..., X n ( x1 , x2 ,..., xn ) FX (x)

P ({X 1 x1 , X 2 x2 ,..., X n xn })
The corresponding joint pdf is given by,
xn xn1

x1

xn xn1

x1

## If f X1 , X 2 ,..., X n ( x1 , x2 ,..., xn ) is continous at x=[ x1 x2 .... xn ]' , then

f X (x)

n
FX , X ,..., X ( x1 , x2 ,..., xn )
x1x2 ...xn 1 2 n

## The mean vector of X, denoted by X , is defined as

X E ( X)

E ( X 1 ) E ( X 2 )... E ( X n ) '.
X1 X 2 ... X n '.

Similarly for each (i, j ), i 1, 2,.., n, j 1, 2,.., n, i j we can define the joint moment
E ( X i X j ) . All the joint moments and the mean-square values EX i2 , i 1, 2,.., n, can be represented into a correlation matrix R X,X given by

R X,X EXX'
EX 12
EX 1 X 2 ... ... EX 1 X n

EX 2 X 1
EX 22 ... ... EX 2 X n

EX n X 1

EX n X 2 ... ... EX 22
Similarly, all the possible covariances and the variances can be represented in terms of a matrix called the covariance matrix CX,X defined by
CX,X E ( X X )( X X )
cov( X 1 , X 2 ) cov( X 1 , X n )
var( X 1 )
cov( X , X ) var( X ) . cov( X , X )
2
1
2
2
n

var( X n )
cov( X n , X 1 ) cov( X n , X 2 )
It can be shown that
CX,X R X,X X'X

## Independent and Identically distributed random variables

The random variables X1 , X 2 ,..., X n are called (mutually) independent if and only if ( x1 , x2 ,..., xn )

## FX1 , X 2 ,.. X n ( x1 , x2 ,..., xn ) FX i xi

i 1

The random variables X1 , X 2 ,..., X n are called identically distributed if each random variable has the same marginal distribution function, that
is, x ,

FX1 x FX 2 x ... FX n x
An important subclass of independent random variables is the independent and identically distributed (iid) random variables. The random
variables X1 , X 2 ,..., X n are called iid if X1 , X 2 ,..., X n are mutually independent and each of X1 , X 2 ,..., X n has the same marginal distribution
function.

## Uncorrelated random variables

The random variables X1 , X 2 ,.., X n are called uncorrelated if for each (i, j ) i 1, 2,.., n, j 1, 2,.., n ,
Cov(Xi , Xj)=0
If

## Multiple Jointly Gaussian Random variables

For any positive integer n, X1 , X 2 ,..., X n represent n jointly random variables. These n random variables define a random
vector X [ X1 , X 2 ,....., X n ]'. These random variables are called jointly Gaussian if the random variables X1 , X 2 ,....., X n have
joint probability density function given by

f X1 , X 2 ,..., X n ( x1 , x2 ,..., xn )

1
( X X )C-1
( X X )
X
e 2

1
n

det(CX )

Remark
The properties of the two-dimensional Gaussian random variables can be extended to multiple jointly Gaussian random variables.
If X1 , X 2 ,....., X n are jointly Gaussian, then the marginal PDF of each of X1 , X 2 ,....., X n is a Gaussian.
If the jointly Gaussian random variables X1 , X 2 ,..., X n are uncorrelated, then X1 , X 2 ,..., X n are independent also.
Inequalities based on expectations
The mean and variance also give some quantitative information about the bounds of RVs. Following inequalities are extremely useful in
many practical problems.
Markov and Chebysev Inequalities
For a random variable X which takes only nonnegative values
P{ X a}

E( X )
a

where a 0.

E ( X ) xf X ( x)dx
0

xf X ( x) dx
a

af X ( x)dx
a

aP{ X a}

P{ X a}

E( X )
a

Clearly, P{( X k )2 a}

E ( X k )2
a

P{ X X } P{ X X 2 }
2

P{ X X }

X2

X2

## which is the Chebysev Inequality.

Laws of Large numbers
Consider a sequence of random variables { X n } with a common mean . It is common practice to determine on the basis of the sample
mean defined by the relation
Sn 1 N
Xi
n n i 1
N

where Sn X i .
i 1

Theorem 1 Weak law of large numbers( WLLN): Suppose { X n } is a sequence of random variables defined on a probability space (S , , P)
with finite mean i EX i , i 1, 2,..., n and finite second moments. If
1 n n
s
1 n
P.
i .
cov( X i , X j ) 0 , then n
2
n n i=1 j=1,j i
n
n i 1
s
1 n
P.
Note that n
i means that for any 0,
n
n i 1
lim

Proof: We have
E(

Sn 1 n
1 n
1 n
i ) 2 E X i i
n n i 1
n i 1
n i 1

1 n
E ( X i i )
2
n i 1

1 n
lim P n i 0
n n
n i 1

1 n
1 n n
E ( X i i ) 2 + 2 E ( X i i )( X j j )
2
n i 1
n i=1 j=1,ji
n
n
n
1
1
2 i2i + 2 cov( X i , X j )
n i 1
n i=1 j=1,ji

Sn 1 n
1 n n
1 n

## i ) 2 lim 2 i2i + 2 cov( X i , X j )

n
n n i 1
n n i 1
n i=1 j=1,ji

n
1
2
2
Now lim 2 ii 0 , as each ii is finite. Also,
n n i 1
1 n n
lim 2 cov( X i , X j ) 0
n n i=1 j=1,j i
S
1 n
lim E ( n i ) 2 0
n
n n i 1
S
1 n
E ( n i ) 2
n
s

1
n n i 1
Now P n i
(Chebyshev Inequality)
2
n n i 1

lim E (

1 n
lim P n i 0
n n
n i 1

sn P. 1 n
i
n
n i 1

## Special Case of the WLLN

(a) Suppose { X n } is a sequence of independent and identically distributed random variables defined on a probability space (S , , P)
Then we have
EX i constant= ( say )
var( X i ) constant= 2 ( say ) and
cov( X i , X j ) 0
1 n 2
1 n 2

lim

0
i
n n 2 i 1 i
n n 2 i 1
s
P.
n

n
lim

## i EX i , i 1, 2,..., n and finite second moments .

Then we have
Central Limit theorem
Theorem: Suppose
d
Zn
Z

lim Fzn ( z )
n

1 u 2 2
e
du
2

## with mean and variance 2 . Let Sn

X
i 1

and Z n

S n n

. Then

Random processes
Definition: Consider a probability space {S , , P}. A random process can be defined on {S , , P} as an indexed family of random
variables {X (s, t ), s S, t } where is an index set usually denoting time.
Thus X (s, t ) is a function defined on S . Figure1 illustrates a random process. The random process {X (s, t ), s S, t } is synonymously
referred to as a random function or a stochastic process also.
We observe the following in the case of a random process {X (s, t ), s S , t }
(1) For a fixed time t t0 , the collection {X (s, t0 ), s S} is a random variable.
(2) For a fixed sample point s s0 , the collection {X (s0 , t ), t } is no longer a function of the sample space. It is a deterministic
function on and called a realization of the random process. Thus each realization corresponds to a particular sample point and the
cardinality of S determines the number of such realizations. The collection of all the possible realizations of a random process is
called the ensemble.
(3) When both s and t are fixed at values s s0 and a fixed t t0 , X (s0 , t0 ) becomes a single number.
The underlying sample space and the index set are usually omitted to simplify the notation and the random process {X (s, t ), s S , t } is
generally denoted by { X (t )} .

X ( s3 , t )

X ( s2 , t )

s3

s2

s1

X ( s1 , t )

t
Figure 1 A random process

To describe {X (t ), t } we have to consider the collection of the random variables at all possible values of t. For any positive integer n , the
collection X (t1), X (t2 ),..., X (tn ) represents n jointly distributed random variables. Thus a random process {X (t ), t } at these n instants

t1, t2 ,..., tn can thus be described by specifying the n-th order joint distribution function
FX (t1 ), X (t2 ),..., X (tn ) ( x1 , x2 ,..., xn ) P( X (t1 ) x1, X (t2 ) x2 ,..., X (tn ) xn )

and the n-th order joint probability density function f X (t1 ), X (t2 ),..., X (tn ) ( x1 , x2 ,..., xn ) defined by
FX (t1 ), X (t2 ),..., X (tn ) ( x1 , x2 ,..., xn )

x1 x2

xn

## ... f X1 , X 2 ,..., X n (u1, u2 ,..., un )du1du2...dun

However, we have to consider the joint probability distribution function for very high n and all possible t1, t2 ,..., tn to describe the
random process in sufficient details. This being a formidable task, we have to look for other descriptions of a random process.
Moments of a random process
We defined the moments of a random variable and joint moments of random variables. We can define all the possible moments and joint
moments of a random process {X (t ), t }. Particularly, following moments are important.

## x (t ) Mean of the random process at t E ( X (t )

RX (t1 , t2 ) = autocorrelation function of the process at times t1 and t2 E( X (t 1 ) X (t2 )) Note that

## RX (t1 , t2 ) = RX (t2 , t1 , ) and

RX (t , t ) EX 2 (t ) sec ond moment or mean - square value at time t.

The autocovariance function CX (t1 , t2 ) of the random process at time t1 and t2 is defined by
C X (t1 , t2 ) E ( X (t 1 ) X (t1 ))( X (t2 ) X (t2 ))
=RX (t1 , t2 ) X (t1 ) X (t2 )
C X (t , t ) E ( X (t ) X (t )) 2 variance of the process at time t.

## A random process X (t ) is called wide sense stationary process (WSS) if

X (t ) constant
RX (t1 , t2 ) RX (t1 t2 ) is a function of time lag.

## RX ( ) RX ( )for real process (for a complex process X(t), RX ( ) R*X ( )

For a discrete random process, we can define the autocorrelation sequence similarly.

If R X ( ) drops quickly , then the signal samples are less correlated which in turn means that the signal has lot of changes with respect
to time. Such a signal has high frequency components. If R X ( ) drops slowly, the signal samples are highly correlated and such a
signal has less high frequency components.

## Spectral Representation of a WSS process: Wiener-Khinchin-Einstein theorem

R X ( ) is directly related to the frequency domain representation of WSS process. The power spectral density S X ( ) is the
contribution to the average power at frequency and is given by
S X ( )

( )e j d

## and using the inverse Fourier transform

RX ( )

1
S X ( )e j dw
2

Example
PSD of the amplitude-modulated random-phase sinusoid X (t ) M (t ) cos ct ,

~ U 0, 2

## where M(t) is a WSS process independent of .

RX ( ) E M (t ) cos c (t ) M (t ) cos c

## E M (t ) M (t ) E cos c (t ) cos c ( Using independence of M (t ) and the sinusoid)

RM

A2
cos c
2

A2
SM c SM c where SM is the PSD of M (t )
4
The Wiener-Khinchin theorem is also valid for discrete-time random processes.
S X

## If we define RX [m] E X [n m] X [n]

Then corresponding PSD is given by

S X ( ) Rx m e j m
m

or S X ( f ) Rx m e j 2 m
m

RX [ m]

w
1 f 1

1
j m
S X ( )e d
2

For a discrete-time random process, the generalized PSD is defined in the z domain as follows
S X ( z)

R m z

## Response of Linear time-invariant system to WSS input

In many applications, physical systems are modeled as linear time invariant (LTI) systems. The dynamic behavior of an LTI system to
deterministic inputs is described by linear differential equations. We are familiar with time and transform domain (such as Laplace transform
and Fourier transform) techniques to solve these equations. In this lecture, we develop the technique to analyze the response of an LTI system
to WSS random process.
Consider a discrete-time linear system with impulse response h[n]. Suppose a deterministic signal x[n] is the input to the system. Then
x[n]

h[n]

y[n]

h k x[n k ]

Y () X () H ()

H ( )

h n e

j n

## When the input is a random process { X [n]}, we can write

Y [ n]

h k X [n k ]

In the sense that each realization is subjected to the convolution operation. Assume that { X [n]} is WSS. Then the expected value of the
output is given by,

Y EY [n]

h k EX [n k ]

h k

X H (0)
The Cross correlation of the input X [n m] and the output Y [n] is given by

h k X [n k ]

E X [n m]Y [n] E X [n m]

h k E X [ n m] X [ n k ]

h k R

[m k ]

h l R

[m l ]

## RYX [m] RX [m]* h[m]

Similarly,

h k X [n k ]

E Y [n m]Y [n] E Y [n m]

h k E Y [n m] X [n k ]

h k R

YX

[m k ]

## RY [m] RYX [m]* h[m]

RY [m] RX [m]* h[m]* h[m]

## RY [m] is a function of lag m only.

From above we get
SY ( ) ) * )S X )

SY ( ) ) S X )
2

## In terms of the z transform, the power spectral densities are related by

SY ( z ) S X ( z ) H ( z ) H ( z 1 )

## Continuous-time White noise process

One of the very important random processes is the white noise process. Noises in many practical situations are approximated by the white
noise process. Most importantly, the white noise plays an important role in modeling of WSS signals.
A white noise process

X (t )

is defined by
S X ( )

N0
2

where N 0 is a real constant and called the intensity of the white noise. The corresponding autocorrelation function is given by

RX ( )

N
( )
2

## where ( ) is the Dirac delta function.

The average power of white noise
Pavg EX 2 (t )

1 N
d
2 2

## Thus the continuous-time white noise process is not realizable.

The autocorrelation function and the PSD of a white noise process is shown in Figure below

S X ( )

RX ( )

N0
2

N0
( )
2

0
(a)

0
(b)

## Discrete-time White Noise Process

A discrete-time WSS process { X [n]} is called a white noise process if X [n1 ] and X [n2 ] are uncorrelated for n1 n2 . We usually assume
{ X [n]} to be zero-mean so that
RX [m] 2 [m]

Here 2 is the variance of X [n] which is independent of n and (m) is the unit impulse signal. By taking the discrete-time Fourier
transform, we get
S X () 2

S X ( )
2
2

RX [m]

Note that a white noise process is described by its second-order statistics only and is therefore not unique. Particularly , if in addition, each
X [n] is Gaussian distributed, the white noise process is called a white Gaussian noise process. Similarly a sequence of Bernoulli

## random variables constitute the Bernoulli white noise process.

Linear Shift Invariant System with the white noise input
We have,
S X ( )

2
2

SY ( ) ) 2
2

## In the z-transform domain, we have

SY ( z ) H ( z ) H ( z 1 ) 2

## Spectral factorization theorem

A WSS random signal { X [n]} that satisfies the Paley Wiener condition | ln S X ( ) | d can be considered as an output of a linear

## If S X ( ) is an analytic function of , and | ln S X ( ) | d ,

then S X ( z) v2 H c ( z) H a ( z)
where
H c (z ) is the causal minimum phase transfer function

## H a (z ) is the anti-causal maximum phase transfer function

and v2 a constant and interpreted as the variance of a white-noise sequence.
Innovation sequence
V [ n]

X [n]

H c (z )

## Figure Innovation Filter

Minimum phase filter has the corresponding inverse filter. Therefore,

X [n]

1
H c ( z)

## Figure whitening filter

V [ n]

The spectral factorization theorem enables us to model a regular random process as an output of a minimum phase linear filter with
white noise as input. Different models are developed using different forms of linear filters.

These models are mathematically described by linear constant coefficient difference equations.

In statistics, random-process modeling using difference equations is known as time series analysis.

Under the most practical situation, the process may be considered as an output of a filter that has both zeros and poles.
q

V [ n]

H ( z)

bi z

i 0
p

X [ n]

1 ai z

i 1

## The model is given by

p

i 1

i 0

X [n] ai X [n i] bV
i [n i ]