You are on page 1of 298

Before we give a definition of probability, let us examine the following concepts: Random Experiment: An experiment is a random experiment if its

outcome cannot be predicted precisely. One out of a number of outcomes is possible in a random experiment.

2. Sample Space: The sample space S is the collection of all outcomes of a random experiment. The elements of S are called sample points. A sample space may be finite, countably infinite or uncountable. A finite or countably infinite sample space is called a discrete sample space. An uncountable sample space is called a continuous sample space

Sample space S Sample point s1 s2

3. Event: An event A is a subset of the sample space such that probability can be assigned to it. Thus A S. For a discrete sample space, all subsets are events. S is the certain event (sue to occur) and . is the impossible event. Consider the following examples. Example 1 Tossing a fair coin The possible outcomes are H (head) and T (tail). The associated sample space is S = {H , T }. It is a finite sample space. The events associated with the sample space S are: S ,{H },{ T } and . Example 2 Throwing a fair die- The possible 6 outcomes are:

The associated finite sample space is S = {'1', '2', '3', '4', '5', '6'}. Some events are A = The event of getting an odd face={'1', '3', '5'}. B = The event of getting a six={6} And so on.

Example 3 Tossing a fair coin until a head is obatined We may have to toss the coin any number of times before a head is obtained. Thus the possible outcomes are: H, TH,TTH,TTTH,.. How many outcomes are there? The outcomes are countable but infinite in number. The countably infinite sample space is S = {H , TH , TTH ,......}. Example 4 Picking a real number at random between -1 and 1

The associated sample space is S = {s | s , 1 s 1} = [ 1, 1]. Clearly S is a continuous sample space. The probability of an event A is a number P( A) assigned to the event . Let us see how we can define probability.

1. Classical definition of probability ( Laplace 1812) Consider a random experiment with a finite number of outcomes N . If all the outcomes of the experiment are equally likely, the probability of an event A is defined by
P( A) = NA N

where N A = Number of outcomes favourable to A.

Example 4 A fair die is rolled once. What is the probability of getting a6? Here S = {'1', ' 2 ', '3', ' 4 ', '5', '6 '} and A = { '6 '}
N = 6 and N A = 1 P( A) = 1 6

Example 5 A fair coin is tossed twice. What is the probability of getting two heads?

Here S = {HH , TH , TT , TT } and A = {HH }. Total number of outcomes is 4 and all four outcomes are equally likely. Only outcome favourable to A is {HH}
P( A) = 1 4

Discussion The classical definition is limited to a random experiment which has only a finite number of outcomes. In many experiments like that in the above example, the sample space is finite and each outcome may be assumed equally likely. In such cases, the counting method can be used to compute probabilities of events.

Consider the experiment of tossing a fair coin until a head appears.As we have discussed earlier, there are countably infinite outcomes. Can you believe that all these outcomes are equally likely? The notion of equally likely is important here. Equally likely means equally probable. Thus this definition presupposes that all events occur with equal probability. Thus the definition includes a concept to be defined.

2. Relative-frequency based definition of proability( von Mises, 1919)

If an experiment is repeated n times under similar conditions and the event A occurs in nA times, then
P( A) = Lim
n

nA n

This definition is also inadequate from the theoretical point of view. We cannot repeat an experiment infinite number of times.

How do we ascertain that the above ratio will converge for all possible sequences of outcomes of the experiment?
Example Suppose a die is rolled 500 times. The following table shows the frequency each face. Face Frequency Relative frequency 1 82 0.164 2 3 4 81 88 81 0.162 0.176 0.162 5 90 0.18 6 78 0.156

3. Axiometic definition of probability ( Kolmogorov, 1933)

We have earlier defined an event as a subset of the sample space. Does each subset of the sample space forms an event? The answer is yes for a finite sample space. However, we may not be able to assign probability meaningfully to all the subsets of a continuous sample space. We have to eliminate those subsets. The concept of the sigma algebra is meaningful now.
Definition: Let S be a sample space and F a sigma field defined over it. Let P : F be a mapping from the sigma-algebra F into the real line such that for each A F , there exists a unique P ( A) . Clearly P is a set function and is called probability if it satisfies the following these axioms
1. P ( A) 0 for all A F 2. P ( S ) = 1 3. Countable additivityIf A1 , A2 ,... are pair-wise disjoint events, i.e. Ai Aj = for i j, then P ( Ai ) = P ( Ai )
i =1 i =1

Remark The triplet ( S , F, P). is called the probability space.

Any assignment of probability assignment must satisfy the above three axioms If if A B = 0, P( A B) = P( A) + P( B) / This is a special case of axiom 3 and for a discrete sample space, this simpler version may be considered as the axiom 3. We shall give a proof of this result below. The events A and B are called mutually exclusive if A B = 0, /
Basic results of probability

From the above axioms we established the following basic results: 1. P( ) = 0 This is because, S = S P( S ) = P( S )

P( S ) + P( ) = P ( S ) P ( ) = 0 2. P ( Ac ) = 1- P ( A) where We have A Ac = S
P( A Ac ) = P( S ) P( A) + P( Ac ) = 1 A Ac = P ( A) = 1 P ( Ac ) 3. If A, B F and A B = 0, P ( A B ) = P( A) + P ( B ) / We have A B = A B ... ..
P ( A B) = P ( A) + P( B) + P ( )... + P ( ) + .. (using axiom 3) P ( A B) = P ( A) + P( B) 4. If A, B F, P ( A B c ) = P( A) P( A B)

where A F

We have ( A B ) ( A B) = A
c

A Bc

A B

P[( A B c ) ( A B)] = P( A) P( A B c ) + P ( A B ) = P ( A) P( A B c ) = P( A) P ( A B )

A = ( A B c ) ( A B)

We can similarly show that P ( Ac B ) = P ( B ) P ( A B ) 5. If A, B F, P ( AUB ) = P ( A) + P( B) P ( A B ) We have

A B = ( Ac B) ( A B) ( A B c ) P( A B ) = P[( Ac B) ( A B) ( A B c )] = P ( Ac B) + P( A B) + P( A B c ) =P ( B ) P( A B) + P( A B) + P( A) P( A B) =P ( B) + P( A) P( A B )
6. We can apply the properties of sets to establish the following result for A, B, C F, P ( A B C ) = P ( A) + P ( B ) + P(C ) P ( A B) P( B C ) P( A C ) + P( A B C ) The following generalization is known as the principle inclusion-exclusion.
7. Principle of Inclusion-exclusion

Suppose A1 , A2 ,..., An F

Then

n n n P Ai = P ( Ai ) P ( Ai Aj ) + P ( Ai Aj AK ) .... + (1) n +1 P Ai i , j |i < j i , j , k |i < j < k i =1 i =1 i =1

Discussion We require some rules to assign probabilities to some basic events in . For other events we can compute the probabilities in terms of the probabilities of these basic events. Probability assignment in a discrete sample space

Consider a finite sample space S = by the power set of S. such that,

{ s1 , s2 ,...... sn } . Then the sigma algebra is defined For any elementary event {si } , we can assign a probability P( si )

P ( {si } )
i =1

= 1

For any event A , we can define the probability


P ( A) =
Ai A

P ({ Ai })

In a special case, when the outcomes are equi-probable, we can assign equal probability p to each elementary event.

p
i =1

= 1

p = 1n
P ( A) = P {Si } S A i 1 n( A) = n( A) = n n Example Consider the experiment of rolling a fair die considered in example 2. Suppose Ai , i = 1,.., 6 represent the elementary events. Thus A1 is the event of getting 1, A2 is the event of getting 2 and so on. Since all six disjoint events are equiprobale and S = A1 A2 .... A6 we get

1 6 Su ppose A is the event of getting an odd face. Then A = A1 A3 A5 P ( A1 ) = P ( A2 ) = ... = P( A6 ) =


1 1 P ( A) = P ( A1 ) + P ( A3 ) + P( A5 ) = 3 = 6 2

Example Consider the experiment of tossing a fair coin until a head is obtained discussed in Example 3. Here S = {H , TH , TTH ,......}. Let us call s1 = H

s2 = TH s3 = TTH
and so on. If we assign, P({sn }) =

1 then P({sn }) = 1. Let A = {s1 , s2 , s3} is the event of sn S 2n obtaining the head before the 4th toss. Then P( A) = P({s1}) + P({s2 }) + P({s3}) = 1 1 1 7 + + = 2 2 2 23 8

Probability assignment in a continuous space

Suppose the sample space S is continuous and un-countable. Such a sample space arises when the outcomes of an experiment are numbers. For example, such sample space occurs when the experiment consists in measuring the voltage, the current or the resistance. In such a case, the sigma algebra consists of the Borel sets on the real line. Suppose S = and f : =1

is a non-negative integrable function such that,

f ( x) dx

For any Borel set A, P( A) =

f ( x) dx
A

defines the probability on the Borel sigma-algebra .


2

We can similarly define probability on the continuous space of Example Suppose 1 for x [a, b] f X ( x) = b - a 0 otherwise

etc.

Then for [a1 , b1 ] [a, b] 1 P([a1 , b1 ]) = b - a Example Consider S = 2 the two-dimensional Euclidean space. Let S1

and S1 represents

the area under S1.


1 f X ( x) = S1 0 P ( A) = A S1 for x S1 otherwise

This example interprets the geometrical definition of probability.

Probability Using Counting Method: In many applications we have to deal with a finite sample space S and the elementary events formed by single elements of the set may be assumed equiprobable. In this case, we can define the probability of the event A according to the classical definition discussed earlier:

P ( A) =

nA n

where nA = number of elements favorable to A and n is the total number of elements in the sample space S . Thus calculation of probability involves finding the number of elements in the sample space S and the event A. Combinatorial rules give us quick algebraic rules to find the elements in S . We briefly outline some of these rules: (1) Product rule: Suppose we have a set A with m distinct elements and the set B with n distinct elements and A B = ( ai , b j ) | ai A, b j B . Then A B contains

mn ordered pair of elements. This is illustrated in Fig for m = 5 and n = 4. In other words if we can choose element a in m possible ways and the element b in n possible ways then the ordered pair (a,b) can be chosen in mn possible ways.

a1 , b4 a1 , b3 a1 , b2 a1 , b1

a2 , b4 a2 , b3 a2 , b2 a2 , b1

a3 , b4 a3 , b3 a3 , b2 a3 , b1

a4 , b4 a4 , b3 a4 , b2 a4 , b1

a5 , b5 a5 , b3 a5 , b2 a5 , b1

A Fig Illustration of the product rule

The above result can be generalized as follows: The number of distinct k-tupples in A1 A2 ..... Ak = {( a1 , a2 ,..., ak ) | a1 A1 , a2 A2 ,........., ak Ak } is n1n2 .........nk where ni represents the number of distinct elements in Ai .

Example A fair die is thrown twice. What is the probability that a 3 will appear at least once. Solution: The sample space corresponding to two throws of the die is illustrated in the following table. Clearly, the sample space has 6.6 = 36 elements by the product rule. The event corresponding to getting at least one 3 is highlighted and contains 11 elements. 11 Therefore, the required probability is 36

T h r o w 2

(1,6) (1,5 ) (1,4 ) (1,3) (1,2) (1,1)

(2,6) (2,5 ) (2,4 ) (2,3) (2,2) (2,1)

(3,6) (3,5 ) (3,4 ) (3,3) (3,2) (3,1)

(4,6) (4,5 ) (4,4 ) (4,3) (4,2) (4,1)

(5,6) (5,5 ) (5,4 ) (5,3) (5,2) (5,1) Throw 1

(6,6) (6,5 ) (6,4 ) (6,3) (6,2) (6,1)

(2) Sampling with replacement and ordering:

Suppose we have to choose k objects from a set of n objects. Further, after every choosing, the object is placed back in the set. In this case, the number of distinct ordered k-tupples = n n ...... n (k times) = n k .
(3) Sampling without replacement:

Suppose we have to choose k objects from a set of n objects by picking one object after another at random. In this case the first object can be chosen from n objects, the second object can be chosen from n-1 objects, and so. Therefore, by applying the product rule, the number of distinct order k-tupples in this case is n ( n 1) ....... ( n k + 1)
= n! ( n k )!
n! is called the permutation of n objects taking k at a time and ( n k )!

The number

denoted by n Pk . Thus

Pk =

n! ( n k )!

Clearly , n Pn = n !
Example: Birthday problem- Given a class of students, what is the probability of two students in theclass having the same birthday? Plot this probability vs. number of people and be surprised! If the group has more than 365 people the probability of two people in the group having the same birthday is obviously 1.

Let k be the number of students in the class. Then the number of possible birth days=365.365....365 ( k -times) = 365k The number of cases with each of the k students having a different birth
day is =
365

Pk = 365 .364.....(365 - k + 1) Pk 365k


365

Therefore, the probability of common birthday = 1-

Number of persons 2 10 15 25 40 50 60 80 100

Probability 0.0027 0.1169 0.4114 0.5687 0.8912 0.9704 0.9941 0.9999

The plot of probability vs number of students is shown in Fig. . Observe the steep rise in the probability in the beginning. In fact this probability for a group of 25 students is greater than 0.5 and that for 60 students onward is closed to 1. This probability for 366 or more number of stuents is exactly one.

(4) Sampling without replacing and without ordering

Suppose nCk be the number of ways in which k objects can be chosen out of a set of n objects. In this case ordering of the objects in the set of k objects is not considered. Note that k objects can be arranged among themselves in k ! ways. Therefore, if ordering of the k objects is considered, the number of ways in which k objects can be chosen out of n objects is nCk k ! . This is the case of sampling with ordering.
nCk k ! = n pk = n k nk n nk

n Ck =

Ck is also called the binomial co-efficient. Example: An urn contains 6 red balls, 5 green balls and 4 blue balls. 9 balls were picked at random from the urn without replacement. What is the probability that out of the balls 4 are red, 3 are green and 2 are blue?
Solution:

9 balls can be picked from a population of 15 balls in 15C9 =


6

15! 9!6!

Therefore the required probability is

C4 5C3 4C2 15 C9

(5) Arranging n objects into k specific groups Suppose we want to partition a set of n distinct elements into k distinct subsets A1 , A2 ,....., An of sizes n1 , n2 ,.., nk respectively so that n = n1 + n2 + ....... + nk . Then the total number of distinct partitions is

n! n1 !n2 !....nk ! This can be proved by noting that the resulting number of partitions is n Cn1 n n1 Cn2 ........... n n1 n2 ...nk 1 Cnk
= =

( n n1 )! ..... ( n n1 n2 .... nk 1 )! n! n1 !( n n1 ) ! n2 !( n n1 n2 ) ! nk !( n n1 n2 .. nk 1 nk ) !
n! n1 !n2 !....nk !

Example: What is the probability that in a throw of 12 dice each face occurs twice. Solution: The total number of elements in the sample space of the outcomes of a single throw of 12 dice is = 612 The number of favourable outcomes is the number of ways in which 12 dice can be arranged in six groups of size 2 each group 1 consisting of two dice each showing 1, group 2 consisting of two dice each showing 2 and so on. Therefore, the total number distinct groups 12! = 2!2!2!2!2!2! Hence the required probability is

12! (2)6 612

Conditional probability Consider the probability space ( S , F, P ). Let A and B two events in F . We ask the following question Given that A has occurred, what is the probability of B? The answer is the conditional probability of B given A denoted by P( B / A). We shall develop the concept of the conditional probability and explain under what condition this conditional probability is same as P ( B ). Notation P( B / A) = Conditional probability of B given A

Let us consider the case of equiprobable events discussed earlier. Let N AB sample points be favourable for the joint event A B.

Clearly

P ( B / A) =

Number of outcomes favourable to A and B Number of outcomes in A NAB = NA NAB P( A B) = N = NA P ( A) N

This concept suggests us to define conditional probability. The probability of an event B under the condition that another event A has occurred is called the conditional probability of B given A and defined by P (A B ) P (B / A) = , P (A ) 0 P (A )

We can similarly define the conditional probability of A given B, denoted by P( A / B).

From the definition of conditional probability, we have the joint probability P ( A B) of two events A and B as follows
P ( A B ) = P (A ) P ( B / A ) = P ( B ) P ( A / B )

Example 1 Consider the example tossing the fair die. Suppose A = event of getting an even number = {2, 4, 6} B = event of getting a number less than 4 = {1, 2,3} A B = {2} P( A B ) 1/ 6 1 P ( B / A) = = = P( A) 3/ 6 3 Example 2 A family has two children. It is known that at least one of the children is a girl. What is the probability that both the children are girls? A = event of at least one girl B = event of two girls

Clearly S = {gg , gb, bg , bb}, A = {gg , gb, bg} and B = {gg} A B = {gg} P ( A B ) 1/ 4 1 P ( B / A) = = = P ( A) 3/ 4 3
Conditional probability and the axioms of probability

In the following we show that the conditional probability satisfies the axioms of probability. By definition
Axiom 1
P(B / A) = P( A B) , P( A) 0 P( A)

P( A B) 0, P( A) > 0 P( A B) 0 P( A) We have S A = A P(S A) P(A) = =1 P(S / A) = P(A) P(A) Consider a sequence of disjoint events B1 , B2 ,..., Bn ,... We have P( B / A) =
i =1

Axiom 2

Axiom 3
i =1

( Bi ) A = ( Bi A) ( See the Venn diagram below for illustration of finite version of the result.)

Note that the sequence Bi A, i = 1, 2,... is also sequence of disjoint events.


P((Bi A)) = P(Bi A)
i =1 i =1

P(Bi / A) =
i =1

P(Bi A)
i =1

P( A)

P(B A)
i=1 i

P( A)

= P(Bi / A)
i =1

Properties of Conditional Probabilities (1) If B A, then P ( B / A) = 1 and P ( A / B ) P( A) We have A B = B P ( A B) P( A) P ( B / A) = = =1 S P( A) P( A) A and B P( A B) P( A / B) = P( B) P( A) P( B / A) = P( B) P( A) = P( B) P( A) (2) Chain rule of probability P(A1 A2 ... An ) = P(A1 ) P( A2 / A1 )P( A3 / A1 A2 )...P ( An / A1 A2 .... An 1 ) We have
( A B ) C P (A B )P (C / A P (A B )P (C / A P (A ) P ( B / A )P (C / = P (A ) P ( B / A )P (C B B A / ) ) B ) A B )

(A B C ) = P ( A B C ) = = = P ( A B C )

We can generalize the above to get the chain rule of probability P(A1 A2 ... An ) = P(A1 ) P( A2 / A1 )P( A3 / A1 A2 )...P ( An / A1 A2 .... An 1 )

(3) Theorem of Total Probability: Let


n

A1 , A2 , . . . An be n events such that S = A1 A2 ..... An and Ai Aj = for i j. Then for any event B,
P(B) =

P( A )P(B / A )
i =1 n i i

Proof: We have

A i = B and the sequence B Ai is disjoint.


n

i=1

P ( B ) = P ( B Ai )
i =1

= =

i =1

P ( B Ai ) P ( Ai ) P ( B / Ai )

i =1

A1

B
A2 A3

Remark (1) A decomposition of a set S into 2 or more disjoint nonempty subsets is called a partition of S . The subsets A1 , A2 , . . . An form a partition of S if S = A1 A2 ..... An and Ai Aj = for i j.

(2) The theorem of total probability can be used to determine the probability of a complex event in terms of related simpler events. This result will be used in Bays theorem to be discussed to the end of the lecture.
Example 3 Suppose a box contains 2 white and 3 black balls. Two balls are picked at random without replacement. Let A1 = event that the first ball is white and Let A1c = event that the first

ball is black. Clearly A1 and A1c form a partition of the sample space corresponding to picking two balls from the box. Let B = the event that the second ball is white. Then P( B) = P( A1 ) P( B / A1 ) + P( A1c ) P( B / A1c )
= 2 1 3 2 2 + = 5 4 5 4 5

Independent events

Two events are called independent if the probability of occurrence of one event does not affect the probability of occurrence of the other. Thus the events A and B are independent if P( B / A) = P( B) and P( A / B) = P( A). where P( A) and P( B) are assumed to be non-zero. Equivalently if A and B are independent, we have

P( A B) = P( B) P ( A) or

P ( A B ) = P( A) P( B)

Joint probability is the product of individual probabilities.

Two events A and B are called statistically dependent if they are not independent. Similarly, we can define the independence of n events. The events A1 , A2 ,..., An are called independent if and only if P ( Ai Aj ) = P( Ai ) P( Aj )
P ( Ai Aj Ak ) = P ( Ai ) P ( Aj ) P( Ak ) P ( Ai Aj Ak ... An ) = P ( Ai ) P( Aj ) P( Ak )...P( An )
Example 4 Consider the example of tossing a fair coin twice. The resulting sample space is given by S = {HH , HT , TH , TT } and all the outcomes are equiprobable. Let A = {TH , TT } be the event of getting tail in the first toss and B = {TH , HH } be the event of getting head in the second toss. Then 1 1 P ( A ) = and P ( B ) = . 2 2 Again, ( A B ) = {TH } so that

1 = P ( A) P ( B ) 4 Hence the events A and B are independent. Example 5 Consider the experiment of picking two balls at random discussed in example 3. In 2 1 this case, P ( B) = and P ( B / A1 ) = . 5 4 Therefore, P ( B ) P ( B / A1 ) and A1 and B are dependent. P ( A B) =
Bayes Theorem

Suppose A1 , A2 , . . . An are partitions on S such that S = A1 A2 ..... An and Ai A j = for i j. Suppose the event B occurs if one of the events A1 , A2 , . . . An occurs. Thus we have the information of probabilities P ( Ai ) and P ( B / Ai ), i = 1, 2.., n. We ask the following question: Given that B has occured what is the probability that a particular event Ak has occured ? In other wo what is P ( Ak / B ) ? We have P ( B ) = P(Ak | B) = =

P(A
i =1

) P ( B | Ai )

( Using the theorem of total probability)

P(Ak ) P ( B/Ak ) P(B) P(Ak ) P ( B/Ak )

i =1

P(Ai )P ( B / Ai )

This result is known as the Bayes theorem. The probability P(Ak ) is called the a priori probability and P(Ak / B) is called the a posteriori probability. Thus the Bays theorem enables us to determine the a posteriori probability P(Ak | B) from the observation that B has occurred. This result is of practical importance and is the heart of Baysean classification, Baysean estimation etc.

Example 6 In a binary communication system a zero and a one is transmitted with probability 0.6 and 0.4 respectively. Due to error in the communication system a zero becomes a one with a probability 0.1 and a one becomes a zero with a probability 0.08. Determine the probability (i) of receiving a one and (ii) that a one was transmitted when the received message is one.

Let S be the sample space corresponding to binary communication. Suppose T0 be event of transmitting 0 and T1 be the event of transmitting 1 and R0 and R1 be corresponding events of receiving 0 and 1 respectively. Given P (T0 ) = 0.6, P (T1 ) = 0.4, P ( R1 / T0 ) = 0.1 and P ( R0 / T1 ) = 0.08.

(i) P( R1 ) = Probabilty of receiving 'one' = P(T1 ) P ( R1 / T1 ) + P(T0 ) P( R1 / T0 ) = 0.4 0.92 + 0.6 0.1 = 0.448 (ii) Using the Baye's rule P (T1 / R1 ) = = = P (T1 ) P( R1 / T1 ) P( R1 ) P(T1 ) P ( R1 / T1 ) P (T1 ) P( R1 / T1 ) + P (T0 ) P ( R1 / T0 )

0.4 0.92 0.4 0.92 + 0.6 0.1 = 0.8214


Example 7 In an electronics laboratory, there are identically looking capacitors of three makes A1 , A2 and A3 in the ratio 2:3:4. It is known that 1% of A1 , 1.5% of A2 and 2% of A3 are defective. What percentage of capacitors in the laboratory are defective? If a capacitor picked at defective is found to be defective, what is the probability it is of make A3 ? Let D be the event that the item is defective. Here we have to find P ( D) and P ( A3 / D ). 2 1 4 Here P( A1 ) = , P( A2 ) = and P( A3 ) = . 9 3 9 The conditional probabilities are P( D / A1 ) = 0.01, P( D / A2 ) = 0.015 and P( D / A3 ) = 0.02. P ( D) = P ( A1 ) P( D / A1 ) + P( A2 ) P( D / A2 ) + P( A3 ) P( D / A3 )
2 1 4 = 0.01 + 0.015 + 0.02 9 3 9 = 0.0167 and P ( A3 / D) = P( A3 ) P( D / A3 ) P( D) 4 0.02 =9 0.0167 = 0.533

Repeated Trials In our discussions so far, we considered the probability defined over a sample space corresponding to a random experiment. Often, we have to consider several random experiments in a sequence. For example, the experiment corresponding to sequential transmission of bits through a communication system may be considered as a sequence of experiments each representing transmission of single bit through the channel.

Suppose two experiments E1 and E2 with the corresponding sample space S1 and S 2 are performed sequentially. Such a combined experiment is called the product of two experiments E1 and E2. Clearly, the outcome of this combined experiment consists of the ordered pair ( s1 , s2 ) where s1 S1 and s2 S 2 . The sample space corresponding to the combined experiment is given by S = S1 S2 . The events in S consist of all the Cartesian products of the form A1 A2 where A1 is an event in S1 and A2 is an event in S 2 . Our aim is to define the probability P ( A1 A2 ) .

The sample space S1 S2 and the events A1 S 2 ,S1 A2 and A1 A2 are illustrated in in Fig. We can easily show that P ( S1 A2 ) = P2 ( A2 ) and P ( A1 S2 ) = P ( A1 ) 1 where Pi is the probability defined on the events of Ai i = 1, 2. Ai,. This is because, the event A1 S2 in S occurs whenever A1 in S1 occurs, irrespective of the event in S 2 . S2 S1 A2 A1 A2 A1 S2

Also note that A1 A2 = A1 S 2 S1 A2


P ( A B ) = P ( A1 S 2 ) ( S1 A2 )

A2

A1

S1

Fig. (Animate)

Independent Experiments: In many experiments, the events A1 S2 and S1 A2 are independent for every selection of A1 S1 and A2 S2 . Such experiments are called independent experiments. In this case can write P( A B) = P ( A1 S2 ) ( S1 A2 ) = P ( A1 S2 ) P ( S1 A2 ) = P ( A1 ) P2 ( A2 ) 1

Example 1 Suppose S1 is the sample space of the experiment of rolling of a six-faced fair die and S 2 is the sample space of the experiment of tossing of a fair die. Clearly, S1 = {1, 2,3, 4,5, 6} , A1 ={2,3}

and S 2 = { H , T } , A2 = {H } S1 S2 = {(1, H ), (1, T ), (2, H ), (2, T ), (3, H ), (3, T ), (4, H ), (4, T ), (5, H ), (5, T ), (6, H ), (6, T )} and 1 2 1 P( A1 A2 ) = . = 2 6 6

Example 2

In a digital communication system transmitting 1 and 0, 1 is transmitted twice as often as 0. If two bits are transmitted in a sequence, what is the probability that both the bits will be 1? S1 = {0,1} , A1 ={1} and S 2 = {0,1} , A2 = {1} S1 S2 = {(0, 0), (0,1), (1, 0), (1,1)} and P( A1 A2 ) = 1 1 1 = 4 2 2

Generalization We can similarly define the sample space S = S1 S2 .............. Sn corresponding to n experiments and the Cartesian product of events A1 A2 .......... An = ( s1 , s2 ,......., sn ) s1 A1 , s2 A2 ......., sn An .

If the experiments are independent, we can write

P ( A1 A2 ...... An ) = P ( A1 ) P2 ( A2 ) ........Pn ( An ) 1
where Pi is probability defined on the event of si .

Bernoulli trial

Suppose in an experiment, we are only concerned whether a particular event A has occurred or not. We call this event as the success with probability P ( A) = p and the complementary event Ac as the failure with probability P ( Ac ) = 1 p. Such a random experiment is called Bernoulli trial.

Binomial Law:

We are interested in finding the probability of k successes in n independent Bernoulli trials. This probability px (k ) is given by,
pn ( k ) = nCk p k (1 p ) n k

Consider n independent repetitions of the Bernoulli trial. Let S1 be the sample space associated with each trial and we are interested in a particular event A1 s and its complement Ac such that P ( A) = p and P( Ac ) = 1 p . If A occurs in a trial, then we have a success otherwise a failure. Thus the sample space corresponding to the n repeated trials is S = S1 S2 ....... Sn . Any event in S is of the form A1 A2 ........ An where some Ai s are A and remaining Ai s are Ac . Using the property of independent experiment we have,

P( A1 A2 ......... An ) = P ( A1 ) P ( A2 ) ...........P ( An ) .
If k Ai s are A and the remaining n - k Ai s are Ac , then

P ( A1 A2 ......... An ) = p k (1 p)n k
But the nCk number of events in S with k number of As and n k number of Ac s. For example, if n = 4, k = 2 , the possible events are
A Ac Ac A A A Ac A c A Ac A A c Ac A A A c Ac A c A A A c A Ac A

We also note that all the nCk events are mutually exclusive. Hence the probability of k successes in n independent repetitions of the Bernoulli trial is given by pn ( k ) = nCk p k (1 p ) n k

Example1:

A fair dice is rolled 6 times. What is the probability that a 4 appears thrice?
Solution:

We have

s1 = {1, 2,3, 4,5, 6} A = {4} with P ( A) = pi =


1 6

And

Ac = {1, 2,3,5, 6} with P ( Ac ) = 1 p = 5 / 6


65 1 5 = 0.008 2 6 6
4 2

P6 (4) =

Example2:

A communication source emits binary symbols 1 and 0 with probability 0.6 and 0.4 respectively. What is the probability that there will be 5 1s in a message of 20 symbols?
Solution:

S1 = {0,1} A = {1} , P( A) = p = 0.6


P20 (5) = 20C5 ( 0.6 ) ( 0.4 ) =0.0013
5 15

Example 3 In a binary communication system, bit error occurs with a probability of105 . What is the probability of getting at least one error bit in a message of 8 bits?

Here we can consider the sample space S1 = {error in transmission of 1 bit } {no error in transmission in transmission of 1 bit} with p={error in transmission of 1 bit }=10-5 Probability of no bit-error in transmission of 8 bits =P8 (0) = (1 105 )8 = 0.9999 Probability of at least bit-error in transmission of 8 bits=1-P8 (0) = 0.0001

Typical plots of binomial probabilities are shown in the figure.

Approximations of the Binomial probabilities

Two interesting approximations of the binomial probabilities are very important.


Case 1 Suppose n is very large and p is very small and np = a constant.

Pn (k ) = nCk p k (1 p) n k = nC k n
k

1 n

nk

1 n(n 1)....(n k + 1) n = . k k k n 1 n
k k

k 1 k n (1 1/ n )(1 2 / n ) ....... 1 1 n n = k k n k 1 n
lim 1 = 0 and lim 1 = e n n n n
k n

k This distribution is known as Poisson probability and widely used in engineering and other fields. We shall discuss more about this distribution in a later class.

pn (k ) =

k e

Case 2 When n is sufficiently large and np (1 p )


1 pn (k ) e 2 np(1 p ) 2 np (1 p )

1 , pn (k ) may be approximated as

( k np )2

The right hand side is an expression for normal distribution to be discussed in a later class.
Example: Consider the problem in example 2.
( 5 3)2
2 n0.60.4

Here p20 (5)

1 2 20 0.6 0.4

= 0.0128

Random Variables
In application of probabilities, we are often concerned with numerical values which are random in nature. These random quantities may be considered as real-valued function on the sample space. Such a real-valued function is called real random variable and plays an important role in describing random data. We shall introduce the concept of random variables in the following sections. Mathematical Preliminaries Real-valued point function on a set Recall that a real-valued function f : S maps each element s S , a unique element f ( s ) . The set S is called the domain of f and the set R f = { f ( x ) | x S } is called the range of f . Clearly R f . The range and domain of f are shown in Fig, .

f ( s1 )
s1

f ( s2
s3

s2

f ( s3 )

s4

f ( s1
Range of f

Domain of f

Image and Inverse image For a point s S , the functional value f ( s ) is called the image of the point s. If A S , then the set of the images of elements of A is called the image of A and denoted by f ( A). Thus f ( A) = { f ( s ) | s A} Clearly f ( A) R f

Suppose . The set {x | f ( x) } is called the inverse image of under f and is denoted by f 1 (). Example Suppose S = {H , T } and f : S is defined by f (H ) = 1 and f (T ) = 1. Therefore, R f = {1, 1}

Image of H is 1 and that of T is -1. For a subset of say 1 = (, 1.5],


f 1 (1 ) = {s | f ( s ) 1 } = {H , T }.

For another subset B2 = [5, 9], f 1 ( B2 ) = . Random variable


A random variable associates the points in the sample space with real numbers. Consider the probability space ( S , F, P ) and function X : S mapping the sample by space S into the real line. Let us define the probability of a subset B

PX ({B}) = P( X 1 ( B)) = P({s | X ( s ) B}). Such a definition will be valid if

( X 1 ( B)) is a valid event. If S is a discrete sample space, ( X 1 ( B)) is always a valid event, but the same may not be true if S is infinite. The concept of sigma algebra is again necessary to overcome this difficulty. We also need the Borel sigma algebra B the sigma algebra defined on the real line. The function X : S called a random variable if the inverse image of all Borel sets under X is an event. Thus, if X is a random variable, then

X 1 () = {s | X ( s ) } F.

X 1
A = X ( B)
1

Figure Random Variable


(To be animated) Remark S is the domain of X . The range of X , denoted by R X , is given by

Notations: Random variables are represented by upper-case letters. Values of a random variable are denoted by lower case letters X ( s ) = x means that x is the value of a random variable X at the sample point s. Usually, the argument s is omitted and we simply write X = x.

RX = { X ( s ) | s S}.
Clearly R X

The above definition of the random variable requires that the mapping X is such that

X 1 () is a valid event in S . If S is a discrete sample space, this requirement is

met by any mapping X : S . Thus any mapping defined on the discrete sample
space is a random variable. Example 1: Consider the example of tossing a fair coin twice. The psample space is S={

HH,HT,TH,TT} and all four outcomes are equally likely. Then we can define a random variable X as follows

Sample Point

HH HT TH TT
Here RX = {0,1, 2,3}.

Value of the random Variable X = x 0 1 2 3

Example 2: Consider the sample space associated with the single toss of a fair die. The sample space is given by S = {1, 2,3, 4,5,6} If we define the random variable X that associates a real number equal to the number in the face of the die, then
X = {1, 2,3, 4,5,6}

Probability Space induced by a Random Variable The random variable X induces a probability measure PX on B defined by

PX ({B}) = P( X 1 ( B)) = P({s | X ( s ) B}) The probability measure PX satisfies the three axioms of probability:
. Axiom 1 PX ( B) = P( X 1 ( B)) 1 . Axiom 2

PX ( ) = P( X 1 ( )) = P( S ) = 1
Axiom 3 Suppose B1 , B2 ,.... are disjoint Borel sets. Then X 1 ( B1 ), X 1 ( B2 ),.... are distinct events in

F. Therefore,

PX ( Bi ) = P ( X 1 ( Bi )
i =1 i =1

= P ( X 1 ( Bi ))
i =1

= PX ( Bi )
i =1

Thus the random variable X induces a probability space ( S , B, PX ) Probability Distribution Function We have seen that the event B and {s | X ( s ) B} are equivalent and

PX ({B}) = P({s | X ( s ) B}). The underlying sample space is omitted in notation and we simply write { X B} and P ({ X B}) in stead of {s | X ( s ) B} and P({s | X ( s ) B})
respectively.

Consider the Borel set (, x] where x represents any real number. The equivalent event X 1 ((, x]) = {s | X ( s) x, s S} is denoted as { X x}. The event { X x} can be taken as a representative event in studying the probability description of a random variable X . Any other event can be represented in terms of this event. For example, { X > x} = { X x}c ,{x1 < X x2 } = { X x2 } \{ X x1},
1 { X = x} = { X x} \{ X x } n =1 n

and so on. The probability P({ X x}) = P({s | X ( s) x, s S}) is called the probability distribution function ( also called the cumulative distribution function abbreviated as CDF) of X and denoted by FX ( x). Thus (, x], FX ( x) = P({ X x}) Value of the random variable

FX ( x)
Random variable
Eaxmple 3 Consider the random variable X in Example 1 We have P ({ X = x}) Value of the random Variable X = x 1 0

1 2 3

4 1 4 1 4 1 4

For x < 0, FX ( x) = P({ X x}) = 0 For 0 x < 1,


FX ( x) = P({ X x}) = P({ X = 0}) =

1 4

For 1 x < 2, FX ( x) = P({ X x}) = P({ X = 0} { X = 1}) = P({ X = 0}) + P ({ X = 1}) 1 1 1 = + = 4 4 2 For 2 x < 3, FX ( x) = P({ X x}) = P({ X = 0} { X = 1} { X = 2}) = P({ X = 0}) + P ({ X = 1}) + P ({ X = 2}) 1 1 1 3 = + + = 4 4 4 4 For x 3, FX ( x) = P({ X x}) = P( S ) =1

Properties of Distribution Function

0 FX ( x) 1

This follows from the fact that FX (x) is a probability and its value should lie between 0 and 1. X. FX (x) is a non-decreasing function of Thus, x1 < x2 , then FX ( x1 ) < FX ( x2 )

x1 < x2 { X ( s) x1 } { X ( s) x2 } P{ X ( s) x1 } P{ X ( s) x2 } FX ( x1 ) < FX ( x2 )

A real function f ( x) is said to be continuous at a point a if and only if f (a) is defined (i) (ii) lim f ( x) = lim f ( x ) = f (a)
xa + xa

FX (x) is right continuous


h 0 h>0

FX ( x + ) = lim FX ( x + h) = FX ( x) Because, lim FX ( x + h) = lim P{ X ( s ) x + h}


h 0 h >0 h 0 h>0

= P{ X ( s ) x} =FX ( x)

Thefunction f ( x) is said to be right-continuous at a point a if and only if f (a) is defined (iii)


xa +

lim f ( x) = f (a)

FX () = 0 Because, FX () = P{s | X ( s) } = P( ) = 0

FX () = 1 Because, FX () = P{s | X ( s) } = P( S ) = 1 P({x1 < X x2 }) = FX ( x2 ) FX ( x1 ) We have


{ X x2 } = { X x1} {x1 < X x2 }

P ({ X x2 }) = P({ X x1}) + P ({x1 < X x2 }) P ({x1 < X x2 }) = P ({ X x2 }) P ({ X x1}) = FX ( x2 ) FX ( x1 )

FX ( x )=FX ( x) P( X = x)
h 0 h >0

FX ( x ) = lim FX ( x h) = lim P{ X ( s) x h}
h 0 h >0

= P{ X ( s ) x} P( X ( s ) = x) =FX ( x) P( X = x)

We can further establish the following results on probability of events on the real line:
P{x1 X x2 } = FX ( x2 ) FX ( x1 ) + P ( X = x1 ) P ({x1 X < x2 }) = FX ( x2 ) FX ( x1 ) + P ( X = x1 ) P ( X = x2 ) P ({ X > x}) = P ({x < X < }) = 1 FX ( x)

Thus we have seen that given FX ( x), - < x < , we can determine the probability of any event involving values of the random variable X . Thus FX ( x) x X is a complete description of the random variable X .

Example 4 Consider the random variable X defined by


FX ( x) = 0, = x < 2

1 1 x + x, 2 x < 0 8 4 = 1, x0

Find

a) P(X = 0) b) P { X 0} c) P { X > 2} d) P {1 < X 1}

Solution:
a) P ( X = 0) = FX (0+ ) FX (0 ) 1 3 = 4 4 b) P { X 0} = FX (0) = 1

=1 c) P { X > 2} = 1 FX (2) = 11 = 0

d) P {1 < X 1} = FX (1) FX (1) 1 7 = 1 = 8 8


FX ( x)

Conditional Distribution and Density function: We discussed conditional probability in an earlier class. For two events A and B with P ( B ) 0 , the conditional probability P ( A / B ) was defined as
P ( A / B) = P ( A B) P ( B)

Clearly, the conditional probability can be defined on events involving a random variable X. Consider the event

{ X x} and

any event B involving the random variable X. The

conditional distribution function of X given B is defined as

FX ( x / B ) = P { X x} / B = P { X x} B P ( B) P ( B) 0

We can verify that FX ( x / B ) satisfies all the properties of the distribution function. In a similar manner, we can define the conditional density function f X ( x / B ) of the random variable X given the event B as
fX ( x / B) = d FX ( x / B ) dx

Example 1: Suppose X is a random variable with the distribution function FX ( x ) .

Defined B = { X b} . Then
FX ( x / B ) = = =
Case 1: x<b

P { X x} B P ( B) P { X x} { X b} P { X b} FX ( b ) P { X x} { X b}

Then

FX ( x / B ) =
=

P { X x} { X b} FX ( b ) P { X x} FX ( x ) = FX ( b ) FX ( b )
d FX ( x ) f X ( x ) = dx FX ( b ) f X ( b )

And f X ( x / B ) =
Case 2: x b

FX ( x / B ) =
=

P { X x} { x b} FX ( x ) P { X b} FX ( b ) = FX ( x ) FX ( x )

and f X ( x / B ) =

d d FX ( b ) FX ( x / B ) = =0 dx dx FX ( x )

FX ( x / B ) and f X ( x / B ) are plotted in the following figures.


Remark: We can define the Bayo rule in a similar manner. Suppose the interval

{ X x}

is portioned into non overlapping subsets such that

{ X x} = Bi . i =1
Then FX ( x ) = P ( Bi )FX ( x / Bi )
i =1 n

P ( B / { X x} ) = =

P B { X x} FX ( x ) P B { X x}

P(B ) F (x / B )
i =1 i X i

Mixed Type Random variable:

A random variable X is said to be mixed type if its distribution function FX ( x ) has discontinuous finite number of points and increases strictly with respect to over at least one interval of values of the random variable X. Thus for a mixed type random variable X, FX ( x ) has discontinuous, but is not of stair case type as the in the case of discrete random variable. A typical plot of the distribution functions of a mixed type random variable as shown in Figure.

Suppose SD denotes the countable subset of points on SX such that the RV X is characterized by the probability mass function p X ( x ) , x S D . Similarly let SC be a continuous subset of points on SX such that RV is characterized by the probability density function f X ( x ) , x SC . Clearly the subsets SD and SC partition the set SX. If P ( S D ) , then P ( SC ) = 1 p . Thus the probability of the event { X x} can be expressed as P { X x} = P ( S D ) P ({ X x} | S D ) + P ( SC ) P ({ X x} | SC )
= pFD ( x ) + (1 p ) FC ( x )

Where FD ( x ) is the conditional distribution function of X given X is discrete and

FC ( x ) is the conditional distribution function given that X is continuous.


Clearly FD ( x ) is a staircase of function and FC ( x ) is a continuous function.

Also note that, p = P ( SD )


=
x X D

pX ( x )

Example: Consider the distribution function of a random variable X given by

FX ( x ) = 0,
=

x<0

1 1 0 x<4 + x 4 16 3 1 = + ( x 4) 4 x 8 4 16 =1 x>8 Express FX ( x ) as the distribution function of a mixed type random variable.
Solution:

The distribution function FX ( x ) is as shown in the figure. Clearly

FX ( x )

has

jumps

at x = 0 and x = 4 .
p = pX ( 0 ) + pX ( 4 ) = 1 1 1 + = 4 4 2

And
FD ( x ) = p X ( 0 ) u ( x ) + p X ( 4 ) u ( x 4 ) 1 1 = u ( x ) + u ( x 4) 2 2

0 FC ( x ) = x 1

<x<0 0 x8 x >8

Example 2: X is the RV representing the life time of a device with the PDF f X ( x ) for x>0. Define the following random variable
y=X =a if if X a X >a

SC = ( 0, a )
p = P { y D} = P { X > a} = 1 FX ( a )

SD = a

FX ( x ) = pFD ( x ) + (1 p ) FC ( x )

Discrete, Continuous and Mixed-type random variables A random variable X is called a discrete random variable FX (x) is piecewise constant. Thus FX (x) is flat except at the points of jump discontinuity. If the sample space S is discrete the random variable X defined on it is always discrete. X is called a continuous random variable if FX (x) is an absolutely continuous

function of x. Thus FX (x) is continuous everywhere on and FX ( x) exists everywhere except at finite or countably infinite points . X is called a mixed random variable if FX (x) has jump discontinuity at countable number of points and it increases continuously at least at one interval of values of x. For a such type RV X, FX ( x) = pFX d ( x) + (1 p) FX c ( x) where FX d ( x ) is the distribution function of a discrete RV and
FX c ( x ) is the

distribution function of a continuous RV. Typical plots of FX ( x ) for discrete, continuous and mixed-random variables are shown in Fig below.

FX ( x)

x
Plot FX ( x) vs. x for a discrete random variable ( to be animated)
FX ( x)

FX ( x)

x
Plot FX ( x) vs. x for a continuous random variable ( to be animated)

FX ( x)

x
Plot FX ( x) vs. x for a mixed-type random variable ( to be animated) Discrete Random Variables and Probability mass functions
A random variable is said to be discrete if the number of elements in the range of RX is finite or countably infinite. Examples 1 and 2 are discrete random variables. Assume RX to be countably finite. Let x1 , x2 , x3 ..., xN be the elements in RX . Here the mapping

X ( s ) partitions S into N subsets {s | X ( s) = xi } , i = 1, 2,...., N .

The discrete random variable in this case is completely specified by the probability mass function (pmf) p X ( xi ) = P ( s | X ( s ) = xi ), i = 1, 2,...., N . Clearly,

iRX

p X ( xi ) 0 xi RX and p X ( xi ) = 1

Suppose D RX . Then

P({x D}) =

xi D

( xi )

X ( s1 ) s1
s3

X (s2 )
s2

X ( s3 ) X (s4 )

s4

Figure Discrete Random Variable


(To be animated)

Example Consider the random variable X with the distribution function


0 1 4 FX ( x) = 1 2 1 x<0 0 x <1 1 x < 2 x2

The plot of the FX ( x) is shown in Fig.

FX ( x)

1 2

1 4
0

The probability mass function of the random variable is given by Value of the random Variable X = x 0 1 2
p X ( x)

1 4 1 4 1 2

We shall describe about some useful discrete probability mass functions in a later class.
Continous Random Variables and Probability Density Functions

For a continuous random variable X , FX (x) is continuous everywhere. Therefore,

FX ( x) = FX ( x ) x . This implies that p X ( x) = P({ X = x})

= FX ( x) FX ( x ) = 0

Therefore, the probability mass function of a continuous RV X is zero for all x. A continuous random variable cannot be characterized by the probability mass function. A continuous random variable has a very important chacterisation in terms of a function called the probability density function.

If FX (x) is differentiable, the probability density function ( pdf) of X , f X ( x), us defined as


f X ( x) = d FX ( x ) dx

denoted by

Interpretation of f X ( x)

f X ( x) =

d FX ( x) dx F ( x + x) FX ( x) = lim X x 0 x P ({x < X x + x}) = lim x 0 x

so that
P ({x < X x + x}) f X ( x)x.

Thus the probability of X lying in the some interval ( x, x + x] is determined by f X ( x). In that sense, f X ( x) represents the concentration of probability just as the density represents the concentration of mass.
Properties of the Probability Density Function

f X ( x) 0.

This follows from the fact that FX ( x) is a non-decreasing function FX ( x) =

f X (u )du

( x)dx = 1

P ( x1 < X x 2 ) =
f X ( x)

x2

x1

( x)dx

x0

x0 + x0

Fig. P ({x0 < X x0 + x0 })

f X ( x0 )x0

Example Consider the random variable X with the distribution function


x<0 0 FX ( x) = ax 1 e , a > 0 x 0

The pdf of the RV is given by


0 f X ( x) = ax e , a > 0 x<0 x0

Remark: Using the Dirac delta function we can define the density function for a discrete

random variables. Consider the random variable X defined by the probability mass function (pmf) p X ( xi ) = P( s | X ( s ) = xi ), i = 1, 2,...., N . The distribution function FX ( x) can be written as

FX ( x) = p X ( xi )u ( x xi )
i =1

where u ( x xi ) shifted unit-step function given by


1 for x xi u ( x xi ) = 0 otherwise Then the density function f X ( x) can be written in terms of the Dirac delta function as

f X ( x) = p X ( xi ) ( x xi )
i =1

Example Consider the random variable defined in Example 1 and Example 3. The distribution function FX ( x) can be written as

1 1 1 FX ( x) = u ( x) + u ( x 1) + u ( x 2) 4 4 2 and 1 1 1 f X ( x) = ( x) + ( x 1) + ( x 2) 4 4 2
Probability Density Function of Mixed-type Random Variable Suppose X is a mixed-type random variable with FX ( x) having jump discontinuity at X = xi , i = 1, 2,.., n. As already satated, the CDF of a mixed-type random variable X is given by FX ( x) = pFX d ( x) + (1 p) FX c ( x)

where FX d ( x) is the distribution function of a discrete RV and FX c ( x) is the distribution function of a discrete RV. Therefore, f X ( x) = pf X d ( x) + (1 p) f X c ( x) where f X d ( x) = p X ( xi ) ( x xi )
i =1 n

Example Consider the random variable X with the distribution function

0 0.1 FX ( x) = 0.1 + 0.8 x 1

x<0 x=0 0 < x <1 x >1


FX ( x)

The plot of FX ( x) is shown in Fig.

FX ( x)

can be expressed as

FX ( x) = 0.2 FX d ( x) + 0.8 FX c ( x) where

0 FX d ( x) = 0.5 1

x<0 0 x 1 x >1

and
0 FX c ( x) = x 1 x<0 0 x 1 x >1

The pdf is given by f X ( x) = 0.2 f X d ( x) + 0.8 f X c ( x) where f X d ( x) = 0.5 ( x) + 0.5 ( x 1) and 0 x 1 1 f X c ( x) = elsewhere 0
f X ( x)

Functions of Random Variables Often we have to consider random variables which are functions of other random variables. Let X be a random variable and g (.) is function . Then Y = g ( X ) is a random variable. We are interested to find the pdf of Y . For example, suppose X represents the random voltage input to a full-wave rectifier. Then the rectifier output Y is given by Y = X . We have to find the probability description of the random variable

Y . We consider the following cases:


(a) X is a discrete random variable with probability mass function p X ( x) The probability mass function of Y is given by pY ( y ) = P ({Y = y})
= P ({ x | g ( x) = y}) = =
{ x| g ( x ) = y} { x| g ( x ) = y}

P ({ X = x}) p X ( x)

(b) X is a continuous random variable with probability density function y = g ( x ) is one-to-one and monotonically increasing The probability distribution function of Y is given by

f X ( x) and

FY ( y ) = P {Y y} = P { g ( X ) y} = P { X g 1 ( y )} = P({ X x}) x = g ( y ) = FX ( x) ]x = g 1 ( y ) fY ( y ) = = = dFY ( y ) dy dFX ( x) dy x = g 1 ( y ) dFX ( x) dx dx dy x = g 1 ( y )

dFX ( x) = dx dy dx x = g 1 ( y ) = f X ( x) g ( x) x = g 1 ( y )

fY ( y ) =

f X ( x) f X ( x) = dy g ( x) x = g 1 ( y ) dx

This is illustrated in Fig.

Example 1: Probability density function of a linear function of random variable Suppose Y = aX + b, a > 0. y b dy =a Then x = and a dx y b fX ( ) f X ( x) a fY ( y ) = = dy a dx Example 2: Probability density function of the distribution function of a random variable Suppose the distribution function FX ( x) of a continuous random variable X is monotonically increasing and one-to-one and define the random variable Y = FX ( X ). Then, fY ( y ) = 1 0 y 1.

y = FX ( x) Clearly 0 y 1 dy dFX ( x) = = f X ( x) dx dx f ( x) f X ( x) fY ( y ) = X = =1 dy f X ( x) dx fY ( y ) = 1 0 y 1.

Remark (1) The distribution given by fY ( y ) = 1 0 y 1 is called a uniform distribution over the interval [0,1]. (2) The above result is particularly important in simulating a random variable with a particular distribution function. We assumed FX ( x ) to be one-to-one function for invariability. However, the result is more general- the random variable defined by the distribution function of any random variable is uniformly distributed over [0,1]. For example, if X is a discrete RV, FY ( y ) =P(Y y )

= P ( FX ( x) y ) = P ( X FX1 ( y )) = FX ( FX1 ( y )) = y ( Assigning FX1 ( y ) to the left-most point of the interval for which FX ( x) = y ). dF ( y ) fY ( y ) = Y = 1 0 y 1. dy

Y = FX ( X )

x = F X 1 ( y )

(c) X is a continuous random variable with probability density function y = g ( x) has multiple solutions for x Suppose for y Y , y = g ( x) has solutions xi , i = 1, 2,3,............., n . Then

f X ( x) and

fY ( y ) =
i =1

f X ( x) dy dx x = xi

Proof:

Consider the plot of Y = g ( X ) . Suppose at a point y = g ( x) , we have three distinct roots as shown. Consider the event { y < Y y + dy} . This event will be equivalent to union events

{ x1 < X x1 + dx1} , { x2 dx2 < X x2 }

and

{ x3 < X x3 + dx3}

P { y < Y y + dy} = P { x1 < X x1 + dx1} + P { x2 dx2 < X x2 } + P { x3 < X x3 + dx3 }

fY ( y )dy = f X ( x1 )dx1 + f X ( x2 )(dx2 ) + f X ( x3 )dx3 Where the negative sign in dx2 is used to account for positive probability. Therefore, dividing by dy and taking the limit, we get

dx dx dx fY ( y ) = f X ( x1 ) 1 + f X ( x2 ) 2 + f X ( x3 ) 3 dy dy dy = f X ( x1 ) =
i =1 3

dx dx1 dx + f X ( x2 ) 2 + f X ( x3 ) 3 dy dy dy

f X ( xi ) dy dx x = xi

In the above we assumed y = g ( x) to have three roots. In general, if y = g ( x) has n roots, then fY ( y ) =
i =1 n

f X ( xi ) dy dx x = xi

Example 3: Probability density function of a linear function of random variable Suppose Y = aX + b, a 0. y b dy Then x = and =a a dx y b fX ( ) f X ( x) a fY ( y ) = = dy a dx

Example 4: Probability density function of the output of a full-wave rectifier

Suppose Y = X ,

a X a,

a>0

y = x has two solutions x1 = y and x2 = y and


f X ( x) ]x = y f X ( x)]x = y

dy = 1 at each solution point. dx

fY ( y ) =

1 1 = f X ( x) + f X ( x)

Example 5: Probability density function of the output a square-law device

Y = CX 2 , C 0

y = cx 2

=> x =

y c

y0

And

dy dy = 2cx so that = 2c y / c = 2 cy dx dx

fY ( y ) =

fX

y / c + fX 2 cy

y /c

y>0

= 0 otherwise

Expected Value of a Random Variable

The expectation operation extracts a few parameters of a random variable and provides a summary description of the random variable in terms of these parameters. It is far easier to estimate these parameters from data than to estimate the distribution or density function of the random variable.

Expected value or mean of a random variable The expected value of a random variable X is defined by

EX = xf X ( x)dx provided xf X ( x)dx exists.


EX is also called the mean or statistical average of the random variable X and denoted by X . Note that, for a discrete RV X defined by the probability mass function (pmf) p X ( xi ), i = 1, 2,...., N , the pdf f X ( x) is given by

f X ( x) = p X ( xi ) ( x xi )
i =1

X = EX = x p X ( xi ) ( x xi )dx
i =1

= p X ( xi ) x ( x xi )dx
i =1 N

= xi p X ( xi )
i =1

Thus for a discrete random variable X with

p X ( xi ), i = 1, 2,...., N ,

X = xi p X ( xi )
i =1

Interpretation of the mean

The mean gives an idea about the average value of the random value. The values of the random variable are spread about this value. Observe that
X = xf X ( x)dx

xf X ( x)dx f X ( x)dx

f X ( x)dx = 1

Therefore, the mean can be also interpreted as the centre of gravity of the pdf curve.

Fig. Mean of a random variable


Example 1 Suppose X is a random variable defined by the pdf

1 f X ( x) = b a 0 Then
EX = xf X ( x)dx

a xb otherwise

1 dx a ba a+b = 2 =x

Example 2 Consider the random variable X with pmf as tabulated below Value of the random variable x
p X ( x)

0
1 8

1
1 8

2
1 4

3
1 2

X = xi p X ( xi )
i =1

=0 = 17 8

1 1 1 1 + 1 + 2 + 3 8 8 4 2

Remark If f X ( x) is an even function of x, then xf X ( x)dx = 0. Thus the mean of a RV

with an even symmetric pdf is 0.

Expected value of a function of a random variable

Suppose Y = g ( X ) is a function of a random variable X as discussed in the last class. Then, EY = Eg ( X ) = g ( x) f X ( x)dx

We shall illustrate the theorem in the special case g ( X ) when y = g ( x ) is one-to-one and monotonically increasing function of x. In this case,

fY ( y ) =

f X ( x) g ( x) x = g 1 ( y )

g ( x)

EY = yfY ( y )dy

y2

= y
y1

y2

f X ( g ( y )) dy g ( g 1 ( y ) y1

where y1 = g () and y2 = g (). Substituting x = g 1 ( y ) so that y = g ( x) and dy = g ( x)dx, we get EY = g ( x) f X ( x)dx


The following important properties of the expectation operation can be immediately derived: (a) If c is a constant,
Ec = c

Clearly
Ec = cf X ( x)dx = c f X ( x)dx = c

(b) If g1 ( X ) and g 2 ( X ) are two functions of the random variable c1 and c2 are constants,
E[c1 g1 ( X ) + c2 g 2 ( X )]= c1 Eg1 ( X ) + c2 Eg 2 ( X )

X and

E[c1 g1 ( X ) + c2 g 2 ( X )] = c1 [ g1 ( x) + c2 g 2 ( x)] f X ( x)dx

= c1 g1 ( x) f X ( x) dx + c2 g 2 ( x) f X ( x)dx

= c1 g1 ( x) f X ( x)dx + c2 g 2 ( x) f X ( x)dx

= c1 Eg1 ( X ) + c2 Eg 2 ( X )

The above property means that E is a linear operator.

Mean-square value
EX 2 = x 2 f X ( x)dx

Variance For a random variable X with the pdf f X ( x) and men X , the variance of X is 2 denoted by X and defined as
2 X = E ( X X )2 = ( x X ) 2 f X ( x)dx

Thus for a discrete random variable X with


2 X = ( xi X ) 2 p X ( xi )
i =1 N

p X ( xi ), i = 1, 2,...., N ,

The standard deviation of X is defined as

X = E ( X X )2

Example 3 Find the variance of the random variable discussed in Example 1.


2 X = E ( X X )2

= (x
a

a+b 2 1 ) dx 2 ba
2

1 b 2 a+bb a+b b = [ x dx 2 xdx + dx 2 a ba a 2 a = (b a) 2 12

Example 4 Find the variance of the random variable discussed in Example 2. As already computed 17 X = 8 2 X = E ( X X )2

= (0 = 117 128

17 2 1 17 1 17 1 17 1 ) + (1 ) 2 + (2 ) 2 + (3 ) 2 8 8 8 8 8 4 8 2

Remark

Variance is a central moment and measure of dispersion of the random variable about the mean.
E ( X X ) 2 is the average of the square deviation from the mean. It

gives information about the deviation of the values of the RV about the 2 mean. A smaller X implies that the random values are more clustered
2 about the mean, Similarly, a bigger X means that the random values are more scattered. For example, consider two random variables X 1 and X 2 with pmf as 1 2 shown below. Note that each of X 1 and X 2 has zero means. X1 = and 2 5 2 X 2 = implying that X 2 has more spread about the mean. 3

pX (x)
1 2

1 4

x p X1 ( x )

-1 1 4

0 1 2

1 1 2

-1

pX (x)
3

x p X 2 ( x)

-2 1 6

-1

1 6

0 1 3

1 1 6

2 1 6 2

1 6

-1

Fig. shows the pdfs of two continuous random variables with same mean but different variances

We could have used the mean absolute deviation E X X for the same purpose. But it is more difficult both for analysis and numerical calculations.

Properties of variance 2 2 (1) X = EX 2 X

2 X = E ( X X )2 2 = E ( X 2 2 X X + X ) 2 = EX 2 2 X EX + E X 2 2 = EX 2 2 X + X 2 = EX 2 X

2 2 (2) If Y = cX + b, where c and b are constants, then Y = c 2 X


2 Y = E (cX + b c X b) 2

= Ec 2 ( X X ) 2
2 = c 2 X

(3) If c is a constant,
var(c ) = 0.

nth moment of a random variable

We can define the nth moment and the nth central-moment of a random variable X by the following relations
nth-order moment EX n = x n f X ( x ) dx n = 1, 2,..

nth-order central moment E ( X X ) n = ( x X ) n f X ( x ) dx n = 1, 2,...

Note that The mean X = EX is the first moment and the mean-square value EX 2 is the second moment
2 The first central moment is 0 and the variance X = E ( X X ) 2 is the

second central moment The third central moment measures lack of symmetry of the pdf of a random variable.
E ( X X )3
3 X

is called the coefficient of skewness and If the pdf is

symmetric this coefficient will be zero. The fourth central moment measures flatness of peakednes of the pdf of a random variable.
E ( X X )4
4 X

is called kurtosis. If the peak of the pdf is

sharper, then the random variable has a higher kurtosis.

Inequalities based on expectations

The mean and variance also give some quantitative information about the bounds of RVs. Following inequalities are extremely useful in many practical problems.

Chebysev Inequality
Suppose X is a parameter of a manufactured item with known mean
2 X and variance X . The quality control department rejects the item if the absolute

deviation of X from X is greater than 2 X . What fraction of the manufacturing item does the quality control department reject? Can you roughly guess it? The standard deviation gives us an intuitive idea how the random variable is distributed about the mean. This idea is more precisely expressed in the remarkable Chebysev Inequality stated below.
2 X and variance X

For a random variable

X with mean

P{ X X }
Proof:

2 X

2 x = ( x X ) 2 f X ( x)dx

X X

( x X ) 2 f X ( x)dx

X X

2 f X ( x) dx

= 2 P{ X X } P{ X X }
2 X

Markov Inequality
For a random variable X which take only nonnegative values
P{ X a} E( X ) a

where a > 0.

E ( X ) = xf X ( x)dx
0

xf X ( x)dx
a

af X ( x)dx
a

= aP{ X a}

P{ X a}

E( X ) a
E ( X k )2 a

Remark: P{( X k ) 2 a} Example

Example A nonnegative RV X has the mean X = 1. Find an upper bound of the

probability P ( X 3}). By Markovs inequality


P ( X 3}) E( X ) 1 = . 3 3

1 Hence the required upper bound = . 3

Characteristic Functions of Random Variables Just as the frequency-domain charcterisations of discrete-time and continuous-time signals, the probability mass function and the probability density function can also be characterized in the frequency-domain by means of the charcteristic function of a random variable. These functions are particularly important in calculating of moments of a random variable evaluating the PDF of combinations of multiple RVs. Characteristic function Consider a random variable X with probability density function f X ( x). The characteristic function of X denoted by X ( w), is defined as

X (w) = Ee j X
= e jx f X (x)dx

where j = 1.
Note the following:

X ( )

is a complex quantity, representing the Fourier transform of


j X

f ( x ) and

traditionally using e

. This implies that the properties of the Fourier transform applies to the characteristic function.
The interpretation that

instead of e

j X

X ( )

is the expectation of e

j X

helps in calculating moments

with the help of the characteristics function. As

X ( )

always +ve and

f X ( x ) dx = 1 , X ( ) always exists. f ( t )dt < , i.e., f(t) is absolutely

[Recall that the Fourier transform of a function f(t) exists if integrable.] We can get

f X ( x ) from X ( ) by the inverse transform

1 fX ( x) = 2

( ) e
X

j x

Example 1 Consider the random variable X with pdf f X ( x ) given by


fX ( x) = 1 ba a xb

= 0 otherwise. The characteristics function is given by

Solution:

X ( ) =

1 j x e dx ba a e j x j a
b

1 = ba =

1 ( e jb e ja ) j ( b a )

Example 2 The characteristic function of the random variable X with f X ( x) = e x > 0, x > 0 is

X ( ) = e j x e x
0

= e ( jw) x dx
0

Characteristic function of a discrete random variable: Suppose X is a random variable taking values from the discrete set corresponding probability mass function Then

RX = { x1 , x2 ,.....} with

p X ( xi ) for the value xi .

X ( ) = Ee j X
=
X i RX

p X ( xi ) e j xi
i

Note that

X ( ) can be interpreted as the discrete-time Fourier transform with e j x

substituting

e j xi in the original discrete-time Fourier transform. The inverse relation is

p X ( xi ) =

1 j xi X ( )d e 2

Example 3 Suppose X is a random variable with the probability mass function

p X ( k ) = nCk p k (1 p )
n

nk

, k = 0,1,...., n
nk

Then

X ( ) = nCk p k (1 p )
k =0

e j k
nk

= Ck ( pe
n n k =0

) (1 p )
n

= pe j + ( p )

(Using the Binomial theorem)

Example 4 The characteristic function of the discrete random variable X with

p X (k ) = p(1 p) k ,
k =0

k = 0,1,....

X ( ) = e j k p(1 p) k
= p e j k (1 p) k
k =0

p 1 (1 p)e j

Moments and the characteristic function


Given the characteristics function X

( ) , the kth moment is given by

EX k =

1 dk X ( ) j d k =0

To prove this consider the power series expansion of ei X

( j ) 2 X 2 ( j ) n X n + ...... + + .. n! 2! 2 n Taking expectation of both sides and assuming EX , EX ,..........., EX to exist, we get

e j X = 1 + j X +

X ( ) = 1 + j EX +

( j ) 2 EX 2 ( j ) n EX n + ...... + + ..... n! 2!

Taking the first derivative of X ( ) with respect to at = 0, we get

d X ( ) = jEX d = 0
Similarly, taking the nth derivative of X ( ) with respect to at = 0, we get

d n X ( ) = j n EX n n d =0
Thus

EX =

1 d X ( ) j d = 0 1 d n X ( ) j n d n = 0

EX n =

Example 3 First two moments of random variable in Example 2

X ( ) =

d j X ( ) = d ( j ) 2 d2 2 j 2 X ( ) = d 2 ( j )3 EX = 1 j j ( j ) 2
2

=
=0

EX 2 =

1 2 j 2 = 2 2 3 j ( j ) =0

Probability generating function: If the random variable under consideration takes non negative integer values only, it is convenient to characterize the random variable in terms of the probability generating function G (z) defined by

GX ( z ) = Ez X = pX ( k ) z
k =0

Note that

GX ( z ) is related to z-transform, in actual z-transform, z k is used instead of z k .

The characteristic function of X is given by

X ( ) = GX ( e j )

GX (1) = p X ( k ) = 1
k =0

GX ' ( z ) = kp X ( k )z k 1
k =0

G '(1) = kp X ( k ) = EX
k =0

GX ''( z ) = k (k 1) p X ( k ) z k 2 = k 2 px ( k ) z k 2 k px ( k ) z k 2
k =0 k =0 k =0

GX ''(1) = k 2 p X ( k ) kp X ( k ) = EX 2 EX
k =0 k =0

X 2 = EX 2 ( EX ) = GX ''(1) + GX '(1) ( GX '(1) )


2

Ex: Binomial distribution:

x ( z ) = nCx p x (1 p) n x z x = nCx ( pz ) x (1 p) n x
x x

pm ( x ) = nC x p x (1 p ) x

= (1 p + pz ) n

' X (1) = EX = np X '' (1) = EX 2 EX = n(n 1) p 2


EX 2 = X '' (1) + EX = n(n 1) p 2 + np = np 2 + npq
Example 2: Geometric distribution

p X ( x ) = p (1 p ) x

X ( z ) = p(1 p) x z x
x

= p ( (1 p ) z )
x

=p

1 1 (1 p) z p(1 p) (1 (1 p) z ) 2

X ' ( z ) = +

X ' (1) = X '' ( z ) =

p(1 p ) p (1 p) q = = 2 (1 1 + p) p2 p 2 p(1 p)(1 p ) (1 (1 p) z )3


2

q 2 pq 2 X (1) = 3 = 2 p p
'' 2 ''

q q q EX = (1) + = 2 + p p p q q q Var ( X ) = 2 + p p p
Moment Generating Function: Sometimes it is convenient to work with a function similarly to the Laplace transform and known as the moment generating function. For a random variable X, the moment generating function
2 2

M X ( s ) is defined by

M X ( s ) = Ee SX =
RX

f X ( x )e SX dx

Where

RX is the range of the random variable X.

If X is a non negative continuous random variable, we can write

M X ( s ) = f X ( x ) e SX dx
0

Note the following:

M x '( s ) = xf x ( x ) e sx dx
0

M '(0) = EX

dk M X ( s ) = x k f X ( x ) e sx dx ds k 0
=

EX k

Example

fX ( x) =
Then

( x2 + 2 )

Let

be

continuous

random

variable

with

< x < , > 0

EX =

xf ( x ) dx
X

2x dx x2 + 2 0

ln (1 + x 2 ) 0

Hence EX does not exist. This density function is known as the Cauchy density function.

q q = + p p

The joint characteristic function of two random variables X and Y is defined by,

X ,Y (1 , 2 ) = Ee j x + j y
1 2

f X , y ( x, y )e j1x + j2 y dydx

And the joint moment generating function of

( s1 , s2 )

is defined by,

X ,Y ( s1 , s2 ) =

f X ,Y ( x, y )e xs1 + ys2 dxdy

= Ee s1x + s2 y
IF Z =ax+by, then

Z ( s) = Ee Zs = Ee( ax +by ) s = X ,Y (as, bs)


Suppose X and Y are independent. Then

X ,Y ( s1 , s2 ) =
=

s1 x + s2 x

f X ,Y ( x, y )dydx

sx s y e 1 f X ( x)dx e 2 fY ( y)dy

= 1 ( s1 )2 ( s2 )
Particularly if Z=X+Y and X and Y are independent, then

Z ( s) = X ,Y ( s, s )
= X ( s)Y ( s)
Using the property of Laplace transformation we get,

f Z ( z ) = f X ( z ) * fY ( z )
Let us recall the MGF of a Gaussian random variable

X = N ( X , X 2 )

X (s) = Ee Xs
= = =

e 2 X e 2 X
1 2 X

1 1

x X 1 2 X

.e xs dx
X2

1 2

x2 2( X + X 2 s ) x + ( X + X 2 s )( X + X 2 s )+..

dx

( 4 s 2 + 2 X X 2 s ) +1 X 1 ( x 2 s ) 2 X2 e 2 X X dx + e dx

We have,

X ,Y ( s1 , s2 )
= Ee Xs1 +Ys2 = E (1 + Xs1 + Ys2 + 2 s EX 2 s2 2 EY 2 = 1 + s1 EX + s2 EY + + + s1s2 EXY 2 2
2 1

( Xs1 + Ys2 ) + ..............)

Hence, EX =

X ,Y ( s1 , s2 )]s1 =0 s1 X ,Y ( s1 , s2 )]s2 =0 s2

EY =

EXY =

2 X ,Y ( s1 , s2 )]s1 =0, s2 =0 s1s2

We can generate the joint moments of the RVS from the moment generating function.

Important Discrete Random Variables

Discrete uniform random variable A discrete random variable X is said to be a uniform random variable if it assumes each of the values x1 , x2 ,...xn with equal probability. The probability mass function of the uniform random variable X is given by
1 , n Its CDF is p X ( xi ) = i = 1, 2,...n

FX ( x) =

1 n ( x xi ) n i =1

p X ( x)

1 n
x1 x2 x3

iii iii

xn

Mean and variance of the Discrete uniform random variable

X =EX = xi p X ( xi )
i =0

=
n

1 n xi n i =0

EX 2 == xi2 p X ( xi )
i =0

1 n 2 xi n i =0
2

2 2 X = EX 2 X

1 n 1 n = xi2 xi n i =0 n i =0

Example: Suppose X is the random variable representing the outcome of a single roll of a fair dice. Then X can assume any of the 6 values in the set {1, 2, 3, 4, 5, 6} with the probability mass function
p X ( x) = 1 6 x = 1, 2,3, 4,5, 6

Bernoulli random variable Suppose X is a random variable that takes two values 0 and 1, with probability mass functions
p X (1) = P { X = 1} = p

and p X (0) = 1 p,

0 p 1

Such a random variable X is called a Bernoulli random variable, because it describes the outcomes of a Bernoulli trial. The typical cdf of the Bernoulli RV X is as shown in Fig.

FX ( x)

1 0

Remark We can define the pdf of X with the help of delta function. Thus f X ( x) = (1 p) ( x) + p ( x) Example 1: Consider the P { H } = p and P {T } = 1 p . experiment of tossing a

biased coin.

Suppose

If we define the random variable X ( H ) = 1 and X (T ) = 0 then X is a Bernoulli random variable. Mean and variance of the Bernoulli random variable

X =EX = kp X (k ) = 1 p + 0 (1 p) = p
k =0

EX 2 = k 2 p X (k ) = 1 p + 0 (1 p ) = p
k =0 2 = EX 2 X = p (1 p ) 2 X

Remark The Bernoulli RV is the simplest discrete RV. It can be used as the building block for many discrete RVs. For the Bernoulli RV, EX m = p m = 1, 2,3... . Thus all the moments of the Bernoulli RV have the same value of p. Binomial random variable:

Suppose X is a discrete random variable taking values from the set {0,1,......., n} . X is called a binomial random variable with parameters n and 0 p 1 if

p X (k ) = nCk p k (1 p ) n k where n! n Ck = k !( n k ) ! As we have seen, the probability of k successes in n independent repetitions of the Bernoulli trial is given by the binomial law. If X is a discrete random variable representing the number of successes in this case, then X is a binomial random variable. For example, the number of heads in n independent tossing of a fair coin is a binomial random variable. The notation X ~ B(n, p) is used to represent a binomial RV with the parameters n and p. p X (k ), k=0,1,...,n k), defines a valid probability mass function. This is because

pX (k ) = nCk p k (1 p)nk = [ p +(1 p)]n = 1.


k =0 k =0

The sum of n independent identically distributed Bernoulli random variables is a binomial random variable. The binomial distribution is useful when there are two types of objects - good, bad; correct, erroneous; healthy , diseased etc.

Example: In a binary communication system, the probability of bit error is 0.001. If a block of 8 bits are transmitted, find the probability that (a) exactly 2 bit errors will occur (b) at least 2 bit errors will occur (c) all the bits will bit erroneous Suppose X is the random variable representing the number of bit errors in a block of 8 bits. Then X ~ B(8, 0.01). Therefore, (a) P (exactly 2 bit errors will occur) = p X (2)
= 8C2 0.012 0.996 (b) P (at least 2 bit errors will occur)=p X (0) + p X (1) + p X (2) =0.998 + 8C1 0.011 0.997 + 8C2 0.012 0.996 (c) P( all 8 bits will be erroneous)=p X (8) = 0.018 = 106

The probability mass function for a binomial random variable with n = 6 and p =0.8 is shown in the figure below.

Mean and variance of the Binomial random variable

We have EX = kp X (k )
k =0 n n

= k nCk p k (1 p) n k
k =0

=0 q n + k nCk p k (1 p) n k
k =1

= k
k =1 n

n! p k (1 p) n k k !(n k )!

n! p k (1 p ) n k k =1 ( k 1)!( n k )! n 1! p k 1 (1 p ) n 1 k1 k =1 ( k 1)!( n k )!
n n 1

=np

=np = np

n 1! p k1 (1 p ) n 1 k1 (Substituting k1 = k 1) k1 = 0 k1 !( n 1 k1 )!

= np ( p + 1 p ) n 1 Similarly EX 2 = k 2 p X (k )
k =0 n n

= k 2 nCk p k (1 p) n k
k =0

=02 q n + k nCk p k (1 p) n k
k =1

= k2
k =1 n

n! p k (1 p) n k k !(n k )!

= k
k =1 n

n! p k (1 p) n k (k 1)!(n k )! n 1! p k 1 (1 p) n 1( k 1) (k 1)!(n k )!

= np (k 1 + 1)
k =1 n

n n 1! n 1! p k 1 (1 p) n 1( k 1) + np p k 1 (1 p) n 1 ( k 1) (k 1)!(n 1 k + 1)! k =1 k =1 ( k 1)!( n 1 k + 1)! = np (n 1) p + np

= np (k 1)

= n(n 1) p 2 + np
2 X = variance of X = n(n 1) p 2 + np n 2 p 2 = np (1 p )

Mean of B (n 1, p )

Important Random Variables

Discrete uniform random variable A discrete random variable X is said to be a uniform random variable if it assumes each of the values x1 , x2 ,...xn with equal probability. The probability mass function of the uniform random variable X is given by
p X ( xi ) = 1 , n i = 1, 2,...n

p X ( x)

1 n
x1 x2 x3

iii iii

xn

Example: Suppose X is the random variable representing the outcome of a single roll of a fair dice. Then X can assume any of the 6 values in the set {1, 2, 3, 4, 5, 6} with the probability mass function

p X ( x) =

1 6

x = 1, 2,3, 4,5, 6

Bernoulli random variable Suppose X is a random variable that takes two values 0 and 1, with probability mass functions
p X (1) = P { X = 1} = p

and p X (0) = 1 p,

0 p 1

Such a random variable X is called a Bernoulli random variable, because it describes the outcomes of a Bernoulli trial. The typical cdf of the Bernoulli RV X is as shown in Fig.

FX ( x)

1 0

Remark We can define the pdf of X with the help of delta function. Thus f X ( x) = (1 p) ( x) + p ( x) Example 1: Consider the P { H } = p and P {T } = 1 p .

experiment

of

tossing

biased coin. Suppose

If we define the random variable X ( H ) = 1 and X (T ) = 0 then X is a Bernoulli random variable.


Mean and variance of the Bernoulli random variable

X =EX = kp X (k ) = 1 p + 0 (1 p) = p
k =0

EX 2 = k 2 p X (k ) = 1 p + 0 (1 p ) = p
k =0 2 = EX 2 X = p (1 p ) 2 X

Remark The Bernoulli RV is the simplest discrete RV. It can be used as the building block for many discrete RVs.

For the Bernoulli RV, EX m = p m = 1, 2,3... . Thus all the moments of the Bernoulli RV have the same value of p.

Binomial random variable:


Suppose X is a discrete random variable taking values from the set {0,1,......., n} . X is called a binomial random variable with parameters n and 0 p 1 if
p X (k ) = nCk p k (1 p ) n k where n! n Ck = k !( n k ) ! As we have seen, the probability of k successes in n independent repetitions of the Bernoulli trial is given by the binomial law. If X is a discrete random variable representing the number of successes in this case, then X is a binomial random variable. For example, the number of heads in n independent tossing of a fair coin is a binomial random variable. The notation X ~ B(n, p) is used to represent a binomial RV with the parameters n and p. p X (k ), k=0,1,...,n k), defines a valid probability mass function. This is because

pX (k ) = nCk p k (1 p)nk = [ p +(1 p)]n = 1.


k =0 k =0

The sum of n independent identically distributed Bernoulli random variables is a binomial random variable. The binomial distribution is useful when there are two types of objects - good, bad; correct, erroneous; healthy , diseased etc.

Example: In a binary communication system, the probability of bit error is 0.001. If a block of 8 bits are transmitted, find the probability that (a) exactly 2 bit errors will occur (b) at least 2 bit errors will occur (c) all the bits will bit erroneous Suppose X is the random variable representing the number of bit errors in a block of 8 bits. Then X ~ B(8, 0.01). Therefore, (a) P (exactly 2 bit errors will occur) = p X (2)
= 8C2 0.012 0.996 (b) P (at least 2 bit errors will occur)=p X (0) + p X (1) + p X (2) =0.998 + 8C1 0.011 0.997 + 8C2 0.012 0.996 (c) P( all 8 bits will be erroneous)=p X (8) = 0.018 = 106

The probability mass function for a binomial random variable with n = 6 and p =0.8 is shown in the figure below.

Mean and variance of the Binomial random variable

We have EX = kp X (k )
k =0 n n

= k nCk p k (1 p) n k
k =0

=0 q n + k nCk p k (1 p) n k
k =1

= k
k =1 n

n! p k (1 p) n k k !(n k )!

n! p k (1 p ) n k k =1 ( k 1)!( n k )! n 1! p k 1 (1 p ) n 1 k1 k =1 ( k 1)!( n k )!
n n 1

=np

=np = np

n 1! p k1 (1 p ) n 1 k1 (Substituting k1 = k 1) k1 = 0 k1 !( n 1 k1 )!

= np ( p + 1 p ) n 1 Similarly EX 2 = k 2 p X (k )
k =0 n n

= k 2 nCk p k (1 p) n k
k =0

=02 q n + k nCk p k (1 p) n k
k =1

= k2
k =1 n

n! p k (1 p) n k k !(n k )!

= k
k =1 n

n! p k (1 p) n k (k 1)!(n k )! n 1! p k 1 (1 p) n 1( k 1) (k 1)!(n k )!

= np (k 1 + 1)
k =1 n

n n 1! n 1! p k 1 (1 p) n 1( k 1) + np p k 1 (1 p) n 1 ( k 1) (k 1)!(n 1 k + 1)! k =1 k =1 ( k 1)!( n 1 k + 1)! = np (n 1) p + np

= np (k 1)

= n(n 1) p 2 + np
2 X = variance of X = n(n 1) p 2 + np n 2 p 2 = np (1 p )

Mean of B (n 1, p )

Geometric random variable:


X as a discrete random variable with range RX = {1, 2,.......} . X is called a geometric random variable of p X (k ) = p(1 p) k 1 0 p 1

X describes the outcomes of independent Bernoulli trials each with probability of success p, before a success occurs. If the first success occurs at the kth trial, then the outputs of the Bernoulli trials are F ailiure, Failure,.., Failure , Success
k 1 times

p X (k ) = (1 p) k 1 p = p(1 p) k 1 RX is countablly infinite, because we may have to wait infinitely long before the first success occurs. The geometric random variable X with the parameter p is denoted by X ~ geo( p) The CDF of X ~ geo( p ) is given by
FX (k ) = (1 p )i 1 p = 1 (1 p) k
i =1 k

which gives the probability that the first success will occur before the (k + 1)th trial. Following Figure shows the pmf of a random variable X ~ geo( p ) for p = 0.25 and p = 0.5 respectively. Observe that the plots have a mode at k = 1.

Example: Suppose X is the random variable representing the number of independent tossing of a coin before a head shows up. Clearly X will be a geometric random variable. Example: A fair dice is rolled repeatedly. What is the probability that a 6 will be shown before the fourth roll.
Suppose X is the random variable representing the number of independent rolling of the 1 dice before a 6 shows up. Clearly X will be a geometric random variable with p = . . 6 P( a '6' will be shown before the 4th roll)=P( X = 1 or X = 2 or X = 3) = p X (1) + p X (2) + p X (3)

= p + p(1 p) + p(1 p ) 2 = p(3 3 p + p 2 ) 91 = 196 Mean and variance of the Geometric random variable

Mean=EX = kp X (k )
k =0

= k p (1 p ) k 1
k =0

= p k (1 p ) k 1
k =0

= p = p p p2 1 = p =

d (1 p ) k k = 0 dk d (1 p)k dk k =0 ( Sum of the geometric series)

EX = k 2 p X (k )
2 k =0

= k 2 p (1 p ) k 1
k =0

= p (k (k 1) + k )(1 p ) k 1
k =0

= p (1 p ) k (k 1)(1 p ) k 2 + p k (1 p ) k 1
k =0 k =0

d dp 2 (1- p ) 1 + =2 p p2 = p (1 p )
2 X = 2

(1 p)
k =0

+ p k (1 p ) k 1
k =0

(1- p ) 1 1 + 2 p p p2 (1- p ) = p2

Mean

X =

2 Variance X

1 p (1- p ) = p2

Poisson Random Variable: X is a Poisson distribution with the parameter e k p X (k ) = k = 0,1, 2,.... , k! The plot of the pmfof the Poisson RV is shown in Fig.

such that

>0

and

Remark e k satisfies to be a pmf, because k! e k k p X (k ) = k ! = e k ! = e e = 1 k =0 k =0 k =0 Named after the French mathematician S.D. Poisson p X (k ) =

Mean and Variance of the Poisson RV The mean of the Poisson RV X is given by

X = kp X (k )
k =0

= 0+k
k =1

e k k!

= e
k =1

k 1
k 1!

=
EX 2 = k 2 p X (k )
k =0

e k = 0+k k! k =1
2

k k =e k =1 k 1!

= e

(k 1 + 1) k k 1! k =1

k k = e 0 + +e k = 2 k 2! k =1 k 1!

= e 2
k =2

k 2
k 2!

+ e
k =1

k 1
k 1!

= e 2 e + e e = 2 +
2 2 X = EX 2 X =

Example: The number of calls received ina telephone exchange follows a Poisson distribution with an average of 10 calls per minute. What is the probability that in oneminute duration (i) no call is received (ii) exactly 5 callsare received (iii) more than 3 calls are received
Let X be the random variable representing the number of e k where = 10. Therefore, p X (k ) = k! calls received. Given

(i) (ii) (iii)

probability that no call is received = p X (0) = e 10 = e 10 105 probability that exactly 5 calls are received = p X (5) = = 5! probability that more the 3 calls are 2 3 5 10 10 10 = 1 p X (k ) = 1 e 10 (1 + + + )= 1 2! 3! k =0

received

The Poisson distribution is used to model many practical problems. It is used in many counting applications to count events that take place independently of one another. Thus it is used to model the count during a particular length of time of: customers arriving at a service station telephone calls coming to a telephone exchange packets arriving at a particular server particles decaying from a radioactive specimen Poisson approximation of the binomial random variable The Poisson distribution is also used to approximate the binomial distribution B (n, p ) when n is very large and p is small. Consider a binomial RV X ~ B(n, p) with n , p 0 so that EX = np = remains constant. Then p X (k ) e k k!

p X (k ) = nCk p k (1 p) n k n! p k (1 p ) n k k !(n k )! n(n 1)(n 2).....(n k + 1) k = p (1 p) n k k! 1 2 k 1 n k (1 )(1 ).....(1 ) n n n p k (1 p) n k = k! k 1 1 2 (1 )(1 ).....(1 ) n n n (np ) k (1 p) n k = k! 1 2 k 1 (1 )(1 ).....(1 )( ) k (1 ) n n n n n = = k !(1 ) k n Note that lim(1 ) n = e . n n 1 2 k 1 (1 )(1 ).....(1 )( ) k (1 ) n k n n n n =e p X (k ) lim n k! k !(1 ) k n

Thus the Poisson approximation can be used to compute binomial probabilities for large n. It also makes the analysis of such probabilities easier.Typical examples are: number of bit errors in a received binary data file number of typographical errors in a printed page Example Suppose there is an error probability of 0.01 per word in typing. What is the probability that there will be more than 1 error in a page 120 words. Suppose X is the RV representing the number of errors per page of 120 words. X ~ B (120, p ) where p = 0.01. Therefore,

= 120 0.01 = 0.12 P(more than one errors) = 1 p X (0) p X (1) 1 e e

Uniform Random Variable


A continuous random variable X is called uniformly distributed over the interval [a, b], if its probability density function is given by

1 f X ( x) = b a 0

a xb otherwise

We use the notation X ~ U (a, b) to denote a random variable X uniformly distributed over the interval

[a,b]. Also note that

f X ( x)dx =

1 dx = 1. ba a

Distribution function FX ( x)
For x < a FX ( x) = 0 For a x b

f X (u )dx du ba a
x

= =

xa ba For x > b,

FX ( x) = 1

Mean and Variance of a Uniform Random Variable b x X = EX = xf X ( x)dx = dx ba a

a+b 2

EX =
2

x f X ( x)dx =
2

x2 dx ba a

=
2 X

b 2 + ab + a 2 3
2 2 X

b 2 + ab + a 2 (a + b) 2 = EX = 3 4 2 (b a ) = 12 The characteristic function of the random variable X ~ U (a, b) is given by b e jwx X ( w) = Ee jwx = dx aba e jwb e jwa = ba Example: Suppose a random noise voltage X across an electronic circuit is uniformly distributed between -4 V and 5 V. What is the probability that the noise voltage will lie between 2 v and 4 V? What is the variance of the voltage? 3 dx 1 = . P (2 < X 3) = 9 2 5 ( 4)

(5 + 4)2 27 = . 12 4 Remark The uniform distribution is the simplest continuous distribution Used, for example, to to model quantization errors. If a signal is discritized into steps of , then the quantization error is uniformly distributed between and . 2 2 The unknown phase of a sinusoid is assumed to be uniformly distributed over [0, 2 ] in many applications. For example, in studying the noise performance of a communication receiver, the carrier signal is modeled as X (t ) = A cos( wc t + ) where ~ U (0, 2 ). A random variable of arbitrary distribution can be generated with the help of a routine to generate uniformly distributed random numbers. This follows from the fact that the distribution function of a random variable is uniformly distributed over [0,1]. (See Example) Thus if X is a contnuous raqndom variable, then FX ( X ) ~ U (0,1).
2 X =

Normal or Gaussian Random Variable

The normal distribution is the most important distribution used to model natural and man made phenomena. Particular, when the random variable is the sum of a large number of random variables, it can be modeled as a normal random variable.

A continuous random variable X is called a normal or a Gaussian random variable with parameters X and X 2 if its probability density function is given by,

f X ( x) =

1 2 X

1 xX 2 X

, < x <

where X and X > 0 are real numbers.

We write that X is N ( X , X 2 ) distributed.

If X = 0 and X 2 = 1 , 1 1 x2 f X ( x) = e 2 2 and the random variable X is called the standard natural variable.

f X ( x) , is a bell-shaped function, symmetrical about x = X .

X , determines the spread of the random variable X. If X 2 is small X is more concentrated around the mean X .

FX ( x) = P ( X x ) = 1 2 X

1 t X 2 X

dt

Substituting, u =
xX

t X

, we get

1 FX ( x) = 2

1 u2 2

du

x X = X

where ( x) is the distribution function of the standard normal variable. Thus FX ( x) can be computed from tabulated values of ( x). . The table ( x) was very useful in the pre-computer days. In communication engineering, it is customary to work with the Q function defined by,
Q( x) = 1 ( x) 1 = 2

e
x

u2 2

du

1 Note that Q (0) = , Q ( x ) = Q ( x ) 2

If X is N ( X , X 2 ) distributed, then EX = X
var( X ) = X 2

Proof:

EX =

xf

( x)dx =

1 2 X

xe

1 xX 2 X

dx Substituting, x X =u

1 u2 1 = X ( u X + X ) e 2 du 2

1 X udu + X 2 2
1 2
2

1 u2 2

du

so that x = u X + X

= 0 + X

u2 2

du

u = XX 2 e 2 du = X 2 0

Evaluation of

x2 2

dx

Suppose I = Then

x2 2

dx

x I = e 2 dx
2

= =

x2 2

dx e
x2 + y 2 2

y2 2

dy

dydx

Substituting x = r cos and y = r sin we get


I =
2

r2 2

r d dr

= 2 e r r dr
2

= 2 e ds
s 0

( r2 = s )

= 2 1 = 2

I = 2

Var ( X ) = E ( X X ) = = 1 2 X 1 2 X

2 1 xX 2 X
2

(x )
X

dx

2 2 X u e
2

1 u2 2

X du
Put

Put

x X

= 2

X 2 2 1u u e 2 du 2 0
2 t e dt
0 1 2 t

= u So that dx = X du

2 = 2 X 2
= 2
2

1 t = u 2 so that dt = udu 2

X 3 2 21 1 = 2 X 2 2 2 = X =X2

Note the definition and properties of the gamma function n = t n 1e t dt


0

n = (n 1)(n 1) 1 = 2

Exponential Random Variable A continuous random variable X is called exponentially distributed with the parameter > 0 if the probability density function is of the form

x0 e x f X ( x) = otherwise 0 The corresponding probabilty distribution function is FX ( x) =

f X (u )du

x<0 0 = x x0 1 e We have X = EX = x e x dx =
0

2 X = E ( X X ) 2 = ( x ) 2 e x dx = 2 0

The following figure shows the pdf of an exponential RV.

The time between two consecutive occurrences of independent events can be modeled by the exponential RV. For example, the exponential distribution gives probability density function of the time between two successive counts in Poisson distribution Used to model the service time in a queing system. In reliability studies, the expected life-time of a part, the average time between successive failures of a system etc.,are determined using the exponential distribution.

Memoryless Property of the Exponential Distribution For an exponential RV X with parameter ,

P( X > t + t0 / X > t0 ) = P( X > t ) for t > 0, t0 > 0 Proof: P ( X > t + t0 / X > t0 ) = = = P[( X > t + t0 ) PX > t0 )] P ( X > t0 ) P ( X > t + t0 ) P ( X > t0 ) 1 FX (t + t0 ) 1 FX (t0 )

e ( t + t0 ) = t0 e t = e = P( X > t ) Hence if X represents the life of a component in hours, the probability that the component will last more than t + t0 hours given that it has lasted t0 hours, is same as the probability that the component will last t hours. The information that the component has already lasted for t0 hours is not used. Thus the life expectancy of an used component is same as that for a new component. Such a model cannot represent a real-world situation, but used for its simplicity.
Laplace Distribution A continuous random variable X is called Laplace distributed with the parameter > 0 with the probability density function is of the form

f X ( x) =

> 0, - < x <

We have X = EX =

x2e

2 e

dx = 0 dx =

2 X = E ( X X )2 = x 2

Chi-square random variable


A random variable is called a Chi-square random variable with n degrees of freedom if its PDF is given by
2 x n / 21 e x / 2 x > 0 n/2 n f X ( x) = 2 (n / 2) 0 x<0

with the parameter > 0 and (.) denoting the gamma function. A Chi-square random
2 variable with n degrees of freedom is denoted by n .

Note that a 2 RV is the exponential RV.


2
2 The pdf of n RVs with different degrees of freedom is shown in Fig. below:

Mean and variance of the chi-square random variable

X = xf X ( x)dx
0

=x =

2 x n / 21 e x / 2 dx n/2 n 2 (n / 2)

2 xn / 2 e x / 2 dx n/2 n (n / 2) 0 2

= = =

(2 2 ) n / 2 u n / 2 u e (2 2 )du ( Substituting u = x / 2 2 ) n/2 n (n / 2) 0 2 2 2 [(n + 2) / 2] (n / 2) 2 2 n / 2(n / 2) (n / 2)

= n 2

Similarly, EX 2 = x 2 f X ( x)dx

= x2
0

2 x n / 21 e x / 2 dx n/2 n 2 (n / 2)

2 x ( n + 2) / 2 e x / 2 dx n/2 n (n / 2) 0 2

= = =

(2 2 )( n + 2) / 2 u n / 2 u e (2 2 )du ( Substituting u = x / 2 2 ) 2n / 2 n (n / 2) 0 4 4 [(n + 4) / 2] (n / 2) 4 4 [(n + 2) / 2]n / 2(n / 2) (n / 2)

= n(n + 2) 4
2 2 X = EX 2 X = n(n + 2) 4 n 4 = 2n 4

A random variable Let X 1 , X 2 . . . X n

be independent zero-mean Gaussian variables each with distribution with mean n 2 and varaiance

2 2 2 2 variance X . Then Y = X 12 + X 2 + . . . + X n has n

2n 4 . This result gives an application of the chi-square random variable.

Relation between the chi-square distribution and the Gaussian distribution

A random variable Let X 1 , X 2 . . . X n

be independent zero-mean Gaussian variables each with

2 2 2 2 variance X . Then Y = X 12 + X 2 + . . . + X n has n distribution with mean

Rayleigh Random Variable

A Rayleigh random variable is characterized by the PDF x x 2 / 2 2 x >0 e f X ( x ) = 2 0 x<0 where is the parameter of the random variable. The probability density functions for the Rayleigh RVs are illustrated in Fig.

Mean and variance of the Rayleigh distribution

EX =

xf

( x)dx e x
2

=x
0

/ 2 2

dx e x
2

= = =

x2 2

/ 2 2

dx

2 2 2

Similarly
EX 2 =

f X ( x)dx e x
2

= x2
0

/ 2 2

dx ( Substituting u =

= 2 2 ue u du
0

x2 ) 2 2

= 2 2

( Noting that

ue
0

du is the mean of the exponential RV with =1)

2 X = 2 2 (

)2

= (2

) 2

Relation between the Rayleigh distribution and the Gaussian distribution

A Rayleigh RV is related to Gaussian RVs as follow: Let X 1 and X 2 be two independent zero-mean Gaussian RVs with a common variance 2 . Then
2 X = X 12 + X 2 has the Rayleigh pdf with the parameter . We shall prove this result in a later class. This important result also suggests the cases where the Rayleigh RV can be used.

Application of the Rayleigh RV Modelling the root mean square error Modelling the envelope of a signal with two orthogonal components as in the case of a signal of the following form: s (t ) = X 1 cos wt + Y1 sin wt

If X 1 ~ N (0, 2 ) and X 2 ~ N (0, 2 ) are independent, then the envelope X = Rayleigh distribution.

2 X 12 + X 2 has the

Simulation of Random Variables


In many fields of science and engineering, computer simulation is used to study random phenomena in nature and the performance of an engineering system in a noisy environment. For example, we may study through computer simulation the performance of a communication receiver. Sometimes a probability model may not be analytically tractable and computer simulation is used to calculate probabilities. The heart of all these application is that it is possible to simulate a random variable with an empirical CDF or pdf that fits well with the theoretical CDF or pdf.

Generation of Random Numbers Generation of random numbers means producing a sequence of independent random numbers with a specified CDF or pdf. All the random number generators rely on a routine to generate random numbers with the uniform pdf. Such routine is of vital importance because the quality of the generated random numbers with any other distribution depends on it. By the quality of the generated random numbers we mean how closely the empirical CDF or PDF fits the true one. There are several algorithms to generate U [0 1] random numbers. Note that these algorithms generate random number by a reproducible deterministic method. These numbers are pseudo random numbers because they are reproducible and the same sequence of numbers repeats after some period of count specific to the generating algorithm. This period is very high and a finite sample of data within the period appears to be uniformly distributed. We will not discuss about these algorithms. Software packages provide routines to generate such numbers. Method of Inverse transform Suppose we want to generate a random variable X with a prescribed distribution function FX ( x ). We have observed that the random variable Y , defined by Y = FX ( X ) ~ U [ 0, 1 ].. Thus given U [ 0, 1 ]. random number Y , the inverse transform X = FX1 (Y ) will have the CDF FX ( x ).

The algorithmic steps for the inverse transform method are as follows: 1. Generate a random number from Y ~ U [0, 1]. Call it y. . 2. Compute the value x such that FX ( x) =y. 3. Take x to be the random number generated.

Example Suppose, we want to generate a random variable with the pdf f X ( x) ribution given by 2 0 x3 x f X ( x) = 9 0 otherwise The CDF of X is given by
0 1 FX ( x) = x 2 9 1

0 x3 x>0

FX(x) 1

x2 /9

x 3

Therefore, we generate a random number y from the U [0, 1] distribution and set FX ( x) = y. We have 1 2 x =y 9 x= 9y Example Suppose, we want to generate a random variable with the exponential distribution given by f X ( x ) = e x > 0, x 0. Then
FX ( x ) = 1 e x Therefore given y, we can get x by the mapping

1 e x = y x = log e (1 y )

Since 1 y is also uniformly distributed over [0, 1], the above expression can be simplified as, log y x =

Generation of Gaussian random numbers Generation of discrete random variables We observed that the CDF of a discrete random variable is also U [0, 1] distributed. Suppose X is a discrete random variable with the probability mass function p X ( xi ), i = 1, 2,..n. Given y = F X ( x ) , the inverse mapping is defined as shown in the Fig. below.
Y = FX ( X )

x k = FX 1 ( y )

Thus if FX ( xk 1 ) y < FX ( xk ), then

FX1 ( y ) = xk The algorithmic steps for the inverse transform method for simulating discrete random variables are as follows: 1. Generate a random number from Y ~ U [0, 1]. Call it y. 2. Compute the value xk such that FX ( xk 1 ) y < FX ( xk ), 3. Take xk to be the random number generated.
Example Generation of Bernoulli random numbers Suppose we want to generate a random number from X ~ Br ( p). Generate y from the U [0, 1] distribution. Set

0 for y 1 p x= 1 otherwise

1 p

Jointly distributed random variables We may define two or more random variables on the same sample space. Let X and Y be two real random variables defined on the same probability space ( S , F, P). The mapping variable.

such that

for s S , ( X ( s ), Y ( s ))

is called a joint random


( X ( s ), Y ( s ))
Y (s)
2 R

Figure Joint Random Variable Remark The above figure illustrates the mapping corresponding to a joint random variable. The joint random variable in the above case id denoted by ( X , Y ). We may represent a joint random variable as a two-dimensional vector X = [ X Y]. We can extend the above definition to define joint random variables of any dimension.The mapping S n such that for s S , ( X 1 ( s ), X 2 ( s ),...., X n ( s )) n is called a n-dimesional random variable and denoted by the vector X = [ X 1 X 2 .... X n ].

Example1: Suppose we are interested in studying the height and weight of the students in a class. We can define the joint RV ( X , Y ) where X represents height and Y represents the weight. Example 2 Suppose in a communication system X is the transmitted signal and Y is the corresponding noisy received signal. Then ( X , Y ) is a joint random variable.
Joint Probability Distribution Function

Recall the definition the distribution of a single random variable. The event { X x} was used to define the probability distribution function FX ( x). Given FX ( x), we can find the probability of any event involving the random variable.
Y, the X and Similarly, for two random variables { X x, Y y} = { X x} {Y y} is considered as the representative event.

event

The probability P{ X x, Y y} ( x, y ) 2 is called the joint distribution function of the random variables X and Y and denoted by FX ,Y ( x, y ).
Y
( x, y )

X
FX ,Y ( x, y ) satisfies

the following properties:

FX ,Y ( x1 , y1 ) FX ,Y ( x2 , y2 ) if x1 x2 and y1 y2

If x1 < x2 and y1 < y2 , { X x1 , Y y1 } { X x2 , Y y2 } P{ X x1 , Y y1 } P{ X x2 , Y y2 } FX ,Y ( x1 , y1 ) FX ,Y ( x2 , y2 )


Y
( x2 , y2 )

( x1 , y1 )

FX ,Y (, y) = FX ,Y ( x, ) = 0 Note that { X , Y y} { X } FX ,Y (, ) = 1. FX ,Y ( x, y ) is right continuous in both the variables. If x < x2 and y1 < y2 , I 1 P{x1 < X x2 , y1 < Y y2 } = FX ,Y ( x2 , y2 ) FX ,Y ( x1 , y2 ) FX ,Y ( x2 , y1 ) + FX ,Y ( x1 , y1 ) 0.

Y
( x2 , y2 )

( x1 , y1 )
X

Given FX ,Y ( x, y ), - < x < , - < y < , the random variables


X

we have a complete description of

and Y .

FX ( x) = FXY ( x,+).
To prove this
( X x ) = ( X x ) ( Y + ) F X ( x ) = P (X x ) = P (X x , Y

)=

XY

( x , + )

Similarly FY ( y ) = FXY (, y ). Given FX ,Y ( x, y ), - < x < , - < y < , a marginal distribution function.
Example Consider two jointly distributed random variables X and Y with the joint CDF (1 e 2 x )(1 e y ) x 0, y 0 FX ,Y ( x, y ) = otherwise 0

each of FX ( x) and FY ( y ) is called

(a) Find the marginal CDFs (b) Find the probability P{11 < X 2, 1 < Y 2} 1 e2 x x 0 FX ( x) = lim FX ,Y ( x, y ) = y elsewhere 0 1 e y y 0 FY ( y ) = lim FX ,Y ( x, y ) = x elsewhere 0
(b) P{11 < X 2, 1 < Y 2} = FX ,Y (2, 2) + FX ,Y (1,1) FX ,Y (1, 2) FX ,Y (2,1)

(a)

= (1 e 4 )(1 e 2 ) + (1 e 2 )(1 e 1 ) (1 e2 )(1 e2 ) (1 e4 )(1 e 1 ) =0.0272

Jointly distributed discrete random variables

If

X and Y are two discrete random variables defined on the same probability space ( S , F , P) such that X takes values from the countable subset RX and Y takes values

from the countable subset RY . Then the joint random variable ( X , Y ) can take values from the countable subset in RX RY . The joint random variable ( X , Y ) is completely specified by their joint probability mass function p X ,Y ( x, y ) = P{s | X ( s ) = x, Y ( s ) = y}, ( x, y ) RX RY . Given p X ,Y ( x, y ), we can determine other probabilities involving the random variables
X and Y . Remark

p X ,Y ( x, y ) = 0 for ( x, y ) RX RY
( x , y ) RX RY

p X ,Y ( x, y ) = 1

( x , y ) RX RY

This is because p X ,Y ( x, y ) = P

x , y )RX RY

{x, y}

=P ( RX RY ) =P{s | ( X ( s ), Y ( s )) ( RX RY } =P ( S ) = 1 Marginal Probability Mass Functions: follows p X ( x) = P{ X = x} RY The probability mass functions
p X ( x ) and pY ( y ) are obtained from the joint probability mass function as

= p X ,Y ( x, y )
yRY

and similarly
p Y ( y ) = p X ,Y ( x , y )
x R X

These probability mass functions p X ( x ) and pY ( y ) obtained from the joint probability mass functions are called marginal probability mass functions. Example Consider the random variables X and Y with the joint probability mass function as tabulated in Table . The marginal probabilities are as shown in the last column and the last row

X Y 0

0 0.25 0.14 0.39

1 0.1 0.35 0.45

2 0.15 0.01

pY ( y ) 0.5 0.5

1 p X ( x)

Joint Probability Density Function

and Y are two continuous random variables and their joint distribution function is continuous in both x and y, then we can define joint probability density function f X ,Y ( x, y ) by
If X

f X ,Y ( x, y ) =

2 FX ,Y ( x, y ), provided it exists. xy
x y

Clearly FX ,Y ( x, y ) = f X ,Y (u , v)dudv
Properties of Joint Probability Density Function

f X ,Y ( x, y ) is always a non-negaitive quantity. That is,


f X ,Y ( x, y ) 0 ( x, y )
2

f X ,Y ( x, y )dxdy = 1

The probability of any Borel set B can be obtained by P( B) = f X ,Y ( x, y )dxdy


( x , y )B

Marginal density functions

The marginal density functions f X ( x) and fY ( y ) of two joint RVs X and Y are given by the derivatives of the corresponding marginal distribution functions. Thus
fX (x) = = =
d dx d dx d dx

FX ( x) FX ( x, )
x

( f X ,Y ( u , y ) d y ) d u

= f X ,Y ( x , y ) d y

a n d s im ila rly

f Y ( y ) = f X ,Y ( x , y ) d x

Remark The marginal CDF and pdf are same as the CDF and pdf of the concerned single random variable. The marginal term simply refers that it is derived from the corresponding joint distribution or density function of two or more jointly random variables.

With the help of the two-dimensional Dirac Delta function, we can define the joint pdf of two discrete jointly random variables. Thus for discrete jointly random variables X and Y . f X ,Y ( x, y ) = p X ,Y ( x, y ) ( x x j , y y j )
( xi , y j )R X RY .

Example The joint density function f X ,Y ( x, y ) in the previous example are


f X ,Y ( x, y ) = = 2 FX ,Y ( x, y ) xy 2 [(1 e 2 x )(1 e y )] x 0, y 0 xy x 0, y 0

= 2e 2 x e y

Example: The joint pdf of two random variables X f X ,Y ( x, y ) = cxy 0 x 2, 0 y 2


(i) (ii) (iii) (iv)

and Y are given by

= 0 otherwise Find c. Find FX , y ( x, y ) Find f X ( x) and fY ( y ). What is the probability P (0 < X 1, 0 < Y 1) ?

f X ,Y ( x, y )dydx = c 1 4

xydydx

c =

FX ,Y ( x, y ) =

1 y x uvdudv 4 0 0 x2 y2 = 16 2 xy f X ( x) = dy 0 y 2 4 0 = x 2 y 2 0 y2

Similarly fY ( y ) = 0 y2

P (0 < X 1, 0 < Y 1) = FX ,Y (1,1) + FX ,Y (0, 0) FX ,Y (0,1) FX ,Y (1, 0) 1 +000 16 1 = 16 =

Jointly distributed random variables We may define two or more random variables on the same sample space. Let X and Y be two real random variables defined on the same probability space ( S , F, P). The mapping variable.

such that

for s S , ( X ( s ), Y ( s ))

is called a joint random


( X ( s ), Y ( s ))
Y (s)
2 R

Figure Joint Random Variable Remark The above figure illustrates the mapping corresponding to a joint random variable. The joint random variable in the above case id denoted by ( X , Y ). We may represent a joint random variable as a two-dimensional vector X = [ X Y]. We can extend the above definition to define joint random variables of any dimension.The mapping S n such that for s S , ( X 1 ( s ), X 2 ( s ),...., X n ( s )) n is called a n-dimesional random variable and denoted by the vector X = [ X 1 X 2 .... X n ].

Example1: Suppose we are interested in studying the height and weight of the students in a class. We can define the joint RV ( X , Y ) where X represents height and Y represents the weight. Example 2 Suppose in a communication system X is the transmitted signal and Y is the corresponding noisy received signal. Then ( X , Y ) is a joint random variable.
Joint Probability Distribution Function

Recall the definition the distribution of a single random variable. The event { X x} was used to define the probability distribution function FX ( x). Given FX ( x), we can find the probability of any event involving the random variable.
Y, the X and Similarly, for two random variables { X x, Y y} = { X x} {Y y} is considered as the representative event.

event

The probability P{ X x, Y y} ( x, y ) 2 is called the joint distribution function of the random variables X and Y and denoted by FX ,Y ( x, y ).
Y
( x, y )

X
FX ,Y ( x, y ) satisfies

the following properties:

FX ,Y ( x1 , y1 ) FX ,Y ( x2 , y2 ) if x1 x2 and y1 y2

If x1 < x2 and y1 < y2 , { X x1 , Y y1 } { X x2 , Y y2 } P{ X x1 , Y y1 } P{ X x2 , Y y2 } FX ,Y ( x1 , y1 ) FX ,Y ( x2 , y2 )


Y
( x2 , y2 )

( x1 , y1 )

FX ,Y (, y) = FX ,Y ( x, ) = 0 Note that { X , Y y} { X } FX ,Y (, ) = 1. FX ,Y ( x, y ) is right continuous in both the variables. If x < x2 and y1 < y2 , I 1 P{x1 < X x2 , y1 < Y y2 } = FX ,Y ( x2 , y2 ) FX ,Y ( x1 , y2 ) FX ,Y ( x2 , y1 ) + FX ,Y ( x1 , y1 ) 0.

Y
( x2 , y2 )

( x1 , y1 )
X

Given FX ,Y ( x, y ), - < x < , - < y < , the random variables


X

we have a complete description of

and Y .

FX ( x) = FXY ( x,+).
To prove this
( X x ) = ( X x ) ( Y + ) F X ( x ) = P (X x ) = P (X x , Y

)=

XY

( x , + )

Similarly FY ( y ) = FXY (, y ). Given FX ,Y ( x, y ), - < x < , - < y < , a marginal distribution function.
Example Consider two jointly distributed random variables X and Y with the joint CDF (1 e 2 x )(1 e y ) x 0, y 0 FX ,Y ( x, y ) = otherwise 0

each of FX ( x) and FY ( y ) is called

(a) Find the marginal CDFs (b) Find the probability P{11 < X 2, 1 < Y 2} 1 e2 x x 0 FX ( x) = lim FX ,Y ( x, y ) = y elsewhere 0 1 e y y 0 FY ( y ) = lim FX ,Y ( x, y ) = x elsewhere 0
(b) P{11 < X 2, 1 < Y 2} = FX ,Y (2, 2) + FX ,Y (1,1) FX ,Y (1, 2) FX ,Y (2,1)

(a)

= (1 e 4 )(1 e 2 ) + (1 e 2 )(1 e 1 ) (1 e2 )(1 e2 ) (1 e4 )(1 e 1 ) =0.0272

Jointly distributed discrete random variables

If

X and Y are two discrete random variables defined on the same probability space ( S , F , P) such that X takes values from the countable subset RX and Y takes values

from the countable subset RY . Then the joint random variable ( X , Y ) can take values from the countable subset in RX RY . The joint random variable ( X , Y ) is completely specified by their joint probability mass function p X ,Y ( x, y ) = P{s | X ( s ) = x, Y ( s ) = y}, ( x, y ) RX RY . Given p X ,Y ( x, y ), we can determine other probabilities involving the random variables
X and Y . Remark

p X ,Y ( x, y ) = 0 for ( x, y ) RX RY
( x , y ) RX RY

p X ,Y ( x, y ) = 1

( x , y ) RX RY

This is because p X ,Y ( x, y ) = P

x , y )RX RY

{x, y}

=P ( RX RY ) =P{s | ( X ( s ), Y ( s )) ( RX RY } =P ( S ) = 1 Marginal Probability Mass Functions: follows p X ( x) = P{ X = x} RY The probability mass functions
p X ( x ) and pY ( y ) are obtained from the joint probability mass function as

= p X ,Y ( x, y )
yRY

and similarly
p Y ( y ) = p X ,Y ( x , y )
x R X

These probability mass functions p X ( x ) and pY ( y ) obtained from the joint probability mass functions are called marginal probability mass functions. Example Consider the random variables X and Y with the joint probability mass function as tabulated in Table . The marginal probabilities are as shown in the last column and the last row

X Y 0

0 0.25 0.14 0.39

1 0.1 0.35 0.45

2 0.15 0.01

pY ( y ) 0.5 0.5

1 p X ( x)

Joint Probability Density Function

and Y are two continuous random variables and their joint distribution function is continuous in both x and y, then we can define joint probability density function f X ,Y ( x, y ) by
If X

f X ,Y ( x, y ) =

2 FX ,Y ( x, y ), provided it exists. xy
x y

Clearly FX ,Y ( x, y ) = f X ,Y (u , v)dudv
Properties of Joint Probability Density Function

f X ,Y ( x, y ) is always a non-negaitive quantity. That is,


f X ,Y ( x, y ) 0 ( x, y )
2

f X ,Y ( x, y )dxdy = 1

The probability of any Borel set B can be obtained by P( B) = f X ,Y ( x, y )dxdy


( x , y )B

Marginal density functions

The marginal density functions f X ( x) and fY ( y ) of two joint RVs X and Y are given by the derivatives of the corresponding marginal distribution functions. Thus
fX (x) = = =
d dx d dx d dx

FX ( x) FX ( x, )
x

( f X ,Y ( u , y ) d y ) d u

= f X ,Y ( x , y ) d y

a n d s im ila rly

f Y ( y ) = f X ,Y ( x , y ) d x

Remark The marginal CDF and pdf are same as the CDF and pdf of the concerned single random variable. The marginal term simply refers that it is derived from the corresponding joint distribution or density function of two or more jointly random variables.

With the help of the two-dimensional Dirac Delta function, we can define the joint pdf of two discrete jointly random variables. Thus for discrete jointly random variables X and Y . f X ,Y ( x, y ) = p X ,Y ( x, y ) ( x x j , y y j )
( xi , y j )R X RY .

Example The joint density function f X ,Y ( x, y ) in the previous example are


f X ,Y ( x, y ) = = 2 FX ,Y ( x, y ) xy 2 [(1 e 2 x )(1 e y )] x 0, y 0 xy x 0, y 0

= 2e 2 x e y

Example: The joint pdf of two random variables X f X ,Y ( x, y ) = cxy 0 x 2, 0 y 2


(i) (ii) (iii) (iv)

and Y are given by

= 0 otherwise Find c. Find FX , y ( x, y ) Find f X ( x) and fY ( y ). What is the probability P (0 < X 1, 0 < Y 1) ?

f X ,Y ( x, y )dydx = c 1 4

xydydx

c =

FX ,Y ( x, y ) =

1 y x uvdudv 4 0 0 x2 y2 = 16 2 xy f X ( x) = dy 0 y 2 4 0 = x 2 y 2 0 y2

Similarly fY ( y ) = 0 y2

P (0 < X 1, 0 < Y 1) = FX ,Y (1,1) + FX ,Y (0, 0) FX ,Y (0,1) FX ,Y (1, 0) 1 +000 16 1 = 16 =

Conditional probability mass functions pY / X ( y / x) = P({Y = y}/{ X = x})

P ({ X = x}{Y y}) P{ X = x} p X ,Y ( x, y ) = provided p X ( x) 0 p X ( x) Similarly we can define the conditional probability mass function pX / Y ( x / y) =

From the definition of conditional probability mass functions, we can define two independent random variables. Two discrete random variables X and Y are said to be independent if and only if pY / X ( y / x ) = pY ( y ) so that
p X , y ( x, y ) = p X ( x ) pY ( y )

Bayes rule: p X / Y ( x / y ) = P({ X = x}/{Y = y})


= = = P({ X = x}{Y y}) P{Y = y} p X ,Y ( x, y ) pY ( y ) p X ,Y ( x, y )
xRX

p X ,Y ( y )

Example Consider the random variables X and Y with the joint probability mass function as tabulated in Table .
X Y 0

0 0.25 0.14 0.39

1 0.1 0.35 0.45

2 0.15 0.01

pY ( y ) 0.5 0.5

p X ( x) The marginal probabilities are as shown in the last column and the last row

pY / X (0 /1) = =

p X ,Y (0,1) p X (1)

0.14 0.39

Conditional Probability Density Function

f Y / X ( y / X = x) = f Y / X ( y / x) is called conditional density of Y given X . Let us define the conditional distribution function . We cannot define the conditional distribution function for the continuous random variables X and Y by the relation
F
Y/X

( y / x) = P(Y y / X = x) = P(Y y, X = x) P ( X = x)

as both the numerator and the denominator are zero for the above expression. The conditional distribution function is defined in the limiting sense as follows:
F
Y/X

( y / x) = limx 0 P (Y y / x < X x + x) =limx 0 P (Y y , x < X x + x) P ( x < X x + x)

f X ,Y ( x, u )xdu f X ( x ) x

=limx 0
y

f X ,Y ( x, u )du = f X ( x)

The conditional density is defined in the limiting sense as follows f Y / X ( y / X = x) = lim y 0 ( FY / X ( y + y / X = x) FY / X ( y / X = x)) / y

= lim y 0,x0 ( FY / X ( y + y / x < X x + x) FY / X ( y / x < X x + x)) / y


Because ( X = x) = lim x 0 ( x < X x + x)

The right hand side in equation (1) is

lim y 0,x0 ( FY / X ( y + y / x < X < x + x) FY / X ( y / x < X < x + x)) / y = lim y 0, x 0 ( P( y < Y y + y / x < X x + x)) / y = lim y 0, x 0 ( P( y < Y y + y, x < X x + x)) / P ( x < X x + x)y = lim y 0, x 0 f X ,Y ( x, y )xy / f X ( x)xy = f X ,Y ( x, y ) / f X ( x)
fY / X ( x / y ) = f X ,Y ( x, y ) / f X ( x)

(2)

Similarly we have
f X / Y ( x / y ) = f X ,Y ( x, y ) / fY ( y )

(3)

Two random variables are statistically independent if for all ( x, y )


fY / X ( y / x ) = fY ( y ) or equivalently f X ,Y ( x, y ) = f X ( x) fY ( y )

(4)

Bayes rule for continuous random variables:

From (2) and (3) we get Bayes rule

f X /Y (x / y) = =

f X ,Y (x, y) fY ( y) f X (x) fY / X ( y / x) fY ( y)

f (x, y) = X ,Y f X ,Y (x, y)dx

f ( y / x) f X (x) = Y/X f X (u) fY / X ( y / x)du

(4)

Given the joint density function we can find out the conditional density function.
Example: For random variables X and Y, the joint probability density function is given by 1 + xy f X ,Y ( x, y ) = X 1, Y 1 4 = 0 otherwise Find the marginal density f X ( x), fY ( y ) and fY / X ( y / x). Are X and Y independent?
f X ( x) = 1 + xy dy 4 1

1 2 Similarly = fY ( y ) = 1 2 -1 y 1

and

fY / X ( y / x ) = =

f X ,Y ( x, y ) f X ( x)

Bayes Rule for mixed random variables

Let X be a discrete random variable with probability mass function p X ( x) and Y be a continuous random variable with the conditional probability density function fY / X ( y / x). In practical problem we may have to estimate X from observed Y . Then

pX

/Y

( x / y ) = lim = lim = lim = =

y 0 y 0

P(X = x / y < Y y + y) P ( X = x, y < Y y + y) P(y < Y y + y) p X ( x ) fY / X ( y / x ) y fY ( y ) y

y 0

p X ( x ) fY / X ( y / x ) fY ( y ) p X ( x ) fY / p X ( x ) fY
x
X

( y / x) X ( y / x)

Example

X is a binary random variable with 1 with probability p X = 1 with probability 1-p V is the Gaussian noise with mean 0 and variance .2
Then

p X / Y ( x = 1/ y ) = = pe pe
( y 1)2 / 2 2

p X ( x ) fY / X ( y / x ) p X ( x ) fY / X ( y / x )
x / 2 2

( y 1)2 / 2 2

+ (1 p)e ( y +1)

Independent Random Variables

Let X and Y be two random variables characterised by the joint distribution function
FX ,Y ( x, y ) = P{ X x, Y y}
and the corresponding joint density function f X ,Y ( x, y ) = Then
2 xy

FX ,Y ( x, y )

X and Y are independent if independent events. Thus,

( x, y )

, { X x} and {Y y} are

FX ,Y ( x, y ) = P{ X x, Y y} =P{ X x}P{Y y} = FX ( x) FY ( y ) f X ,Y ( x, y ) = 2 FX ,Y ( x, y ) xy = f X ( x ) fY ( y )

and equivalently

fY / X ( y) = fY ( y)
Remark: Suppose X
and Y are two discrete random variables with joint probability mass function p X ,Y ( x, y ). Then X and Y are independent if

p X ,Y ( x, y ) = p X ( x ) pY ( y ) (x,y) RX RY

Transformation of two random variables: We are often interested in finding out the probability density function of a function of two or more RVs. Following are a few examples. The received signal by a communication receiver is given by Z = X +Y where Z is received signal which is the superposition of the message signal X and the noise Y .
X

Z
+
Y

The frequently applied operations on communication signals like modulation, demodulation, correlation etc. involve multiplication of two signals in the form Z = XY. We have to know about the probability distribution of Z in any analysis of Z . More formally, given two random variables X and Y with joint probability density function f X ,Y ( x, y ) and a function Z = g ( X , Y ) , we have to find f Z ( z ) . In this lecture, we will try to address this problem.

Probability density of the function of two random variables: We consider the transformation g : R 2 R. Consider the event
DZ
2

{Z z} corresponding to such that DZ = {( x, y ) | g ( x, y ) z} .


Y

each z. We can find a variable subset

DZ

{Z z}
X

FZ ( z ) = P({Z z}) =

= P {( x, y ) | ( x, y ) Dz }
( x , y )Dz

f X ,Y ( x, y ) dydx

fZ ( z ) =

dFZ ( z ) dz

Example: Suppose Z = X+Y. Find the PDF f Z ( z ) .


Zz => X + Y Z

Therefore DZ is the shaded region in the fig below.

FZ ( z ) =
=

( x , y )Dz

f X ,Y ( x, y ) dxdy

zx f X ,Y ( x, y )dy dx z = f X ,Y ( x, u x ) du dx z = f X ,Y ( x, u x ) dx du

substituting y = u - x interchanging the order of integration

z d fZ ( z ) = f X ,Y ( x, u x ) dx du dz

f X ,Y ( x, u x ) dx

If X and Y are independent

f X ,Y ( x, z x ) = f X ( x ) fY ( z x )
fZ ( z ) =

f X ( x ) fY ( z x ) dx

= f X ( z ) * fY ( z )

Where * is the convolution operation.

Example: Suppose X and Y are independent random variables and each uniformly distributed over (a, b). f X ( x ) and fY ( y ) are as shown in the figure below.

The PDF of z is a triangular probability density function as shown in the figure. Probability density function of Z = XY

FZ ( z ) =

( x , y )Dz

f X ,Y ( x, y ) dydx

z x

f X ,Y ( x, y ) dy )dx

1 u f X ,Y x, x dudx x

Substituting u = xy du = xdy

fZ ( z ) =

d 1 z FZ ( z ) = f X ,Y x, dx dz x x

z 1 f X ,Y , y dy y y
Y X

Probability density function of Z =

Zz Y => z X => Y xz
Dz = {( x, y ) | Z z} = { < x , y xz}
FZ ( z ) = = =

( x , y )Dz
zx

f X ,Y ( x, y ) dydx

f XY ( x, y ) dydx
z

f X ,Y ( x, ux ) dudx

Suppose, X and Y are independent random variables. Then, f X ,Y ( x, zx ) = f X ( x ) fY ( zx )


fZ ( z ) =

x f X ( x ) fY ( zx ) dx

fZ ( z ) = = =

d FZ ( z ) dz

x f X ,Y ( x, z , k ) dx y f X ,Y ( z , y, u ) dy

Example:

Suppose X and Y are independent zero mean Gaussian random variable with unity Y standard derivationand Z = . Then X

fZ ( z ) =

1 x2 e 2

1 3 2x e dx 2

1 = 2 = = 1

1 x 2 1+ z 2 2

xe

1 x 2 1+ z 2 2

dx

xe
0

dx

1 (1 + z 2 )

which is the Cauchy probability density function. Probability density function of Z = Dz = {( x, y ) | Z z} =


X2 +Y2

{( x, y ) |
2 z

x2 + y 2 z

= {( r , ) | 0 r z , 0 2 } FZ ( z ) = =
( x , y )Dz

f X ,Y ( x, y ) dydx

f ( r cos , r sin ) rdrd


XY 0 0 2

d fZ ( z ) = FZ ( z ) = dz

f XY ( z cos , z sin ) zd

Example Suppose X and Y are two independent Gaussian random variables each with mean 0 and variance 2 and Z =
X 2 + Y 2 . Then

fZ ( z ) =

f XY ( z cos , z sin ) zd

= z f X ( z cos ) fY ( z sin ) d
0 2

z2

= z
0

e
z2

2 2

cos2

.e

z2

2 2

sin 2

2 2
2 2

ze

z0

The above is the Rayleigh density function we discussed earlier. Rician Distribution:

Suppose X and Y are independent Gaussian variables with non zero mean X and Y respectively and constant variance. We have to find the joint density function of the
random variable Z = X 2 + Y 2 . Envelope of a sinusoidal + a narrow band Gaussian noise. Received noise in a multipath situation. Here
2 2 1 ( x X ) + ( y Y ) 2 2

f X ,Y ( x, y ) =
and

1 2 2

Z = X2 +Y2 We have shown that

fZ ( z ) =

f XY ( z cos , z sin ) zd

Suppose X = cos and Y = sin . Then

f X ,Y ( z cos , z sin ) = = fZ ( z ) =

1 2 2 1 e 1

2 2 1 ( z cos cos ) + ( z sin csin ) 2 2

2 2
2 0

( z2 + 2 ) 2 2 e z cos( ) ( z 2 + 2 ) z cos( ) 2 2 e 2 zd

2 2 e

ze = 2 2

( z2 + 2 ) 2 2 2

z cos(

probability density function f Z1 , Z2 ( z1 , z2 ) where Z1 = g1 ( X , Y ) and Z 2 = g 2 ( X , Y ) . We and z2 = g 2 ( x, y ) . Suppose the inverse mapping relation is x = h1 ( z1 , z2 ) and y = h2 ( z1 , z2 ) hve to find out the joint probability density function f Z1 , Z2 ( z1 , z2 ) where z1 = g1 ( x, y )

Joint Probability Density Function of two functions of two random variables We consider the transformation ( g1 , g 2 ) : R 2 R 2 . We have to find out the joint

Consider a differential region of area dz1 dz2 at point ( z1 , z2 ) in the Z1 Z 2 plane. Let us see how the corners of the differential region are mapped to the X Y plane. Observe that h h h1 ( z1 + dz1 , z2 ) = h1 ( z1 , z2 ) + 1 dz1 = x + 1 dz1 z1 z1 h h h2 ( z1 + dz1 , z2 ) = h1 ( z1 , z2 ) + 2 dz1 = y + 2 dz1 z1 z1 Therefore, h h The point ( z1 + dz1 , z2 ) is mapped to the point ( x + 1 dz1 , y + 2 dz ) in the z1 z1 X Y plane.

Y
Z2
( x, y )
( z1 , z2 ) ( z1 + dz1 , z2 )
(x +

h1 h dz1 , y + 2 dz ) z1 z1

X Z1
We can similarly find the points in the X Y plane corresponding to ( z1 , z2 + dz2 ) and ( z1 + dz1 , z2 + dz2 ). The mapping is shown in Fig. We notice that each differential region in the X Y plane is a parallelogram. It can be shown the differential parallelogram at ( x, y ) has a area J ( z1 , z2 ) dz1 dz2 where J ( z1 , z2 ) is the Jacobian of the transformation defined as the determinant

h1 z J ( z1 , z2 ) = 1 h2 z1

h1 z2 h2 z2

Further, it can be shown that the absolute values of the Jacobians of the forward and the inverse transform are inverse of each other so that 1 J ( z1 , z2 ) = J ( x, y ) where

g1 x J ( x, y ) = g2 x

g1 y g2 y
dz1 dz2 . J ( x, y )

Therefore, the differential parallelogram in Fig. has an area of

Suppose the transformation z1 = g1 ( x, y ) and z2 = g 2 ( x, y ) has n roots and let ( xi , yi ), i = 1, 2,..n be the roots. The inverse mapping of the differential region in the X Y plane will be n differential regions corresponding to n roots. The inverse mapping is illustrated in the following figure for n = 4. As these parallelograms are nonoverlapping,

f Z1 , Z2 ( z1 , z2 ) dz1 dz2 = f X ,Y ( x, y )
i =1

dz1 dz2 J ( xi , yi )

f Z1 , Z 2 ( z1 , z2 ) =
i =1

f X ,Y ( x, y ) J ( xi , yi )

Remark If z1 = g1 ( x, y ) and z2 = g 2 ( x, y ) does not have a root in ( x, y ),

then

f Z1 , Z2 ( z1 , z2 ) = 0.

(x +
Y

x x y y dz1 + dz2 , y + dz1 + dz ) z1 z2 z1 z2 2

(x +

Z2

( z1 , z2 + dz2 )

( z1 + dz1 , z2 + dz2 )

x y dz z , y + dz ) z2 z2 2 x y dz1 , y + dz ) z1 z1 1

( z1 , z2 )

( z1 + dz1 , z2 )

( x, y )

(x +

Z1

Z2

( z1 , z2 )

( z1 + dz1 , z2 ) Z1

Example: pdf of linear transformation Z1 = aX + bY

Z 2 = cX + dY Then

dz1 bz2 az cz1 , y= 2 ad bc ad bc a b J ( x, y ) = = ad bc c d Suppose X and Y are two independent Gaussian random variables each with mean 0 and Y variance 2 . Given R = X 2 + Y 2 and = tan 1 , find f R (r ) and f ( ) . X x=

Solution: We have x = r cos and y = r sin so that r 2 = x 2 + y 2 . (1) and tan =


y (2) x

From (1) r x = = cos x r r y = = sin and y r From (2) y sin = 2 = 2 r x x + y x cos = 2 = 2 r y x + y


cos J = det sin r sin 1 cos = r r

f R, (r , ) =

f X ( x, y) J x =r cos
y = r sin

= =

2 2 2 2 r

2 2 r 2 cos2 r 2 sin 2 e 2 . e 2 2 r 2 2 e

f R (r ) =

f R , (r , )d
2 r 2 e 2

f ( ) = f R, (r , )dr
0

0r <

2 2 0

2 r 2 2 dr re

1 0 2 2

Rician Distribution:

X and Y are independent Gaussian variables with non zero mean X and Y respectively and constant variance. We have to find the joint density function of the random variable Z = X 2 + Y 2 . Envelope of a sinusoidal + a narrow band Gaussian noise. Received noise in a multipath situation.

X 2 +Y2 Y = tan 1 X

Z=

We have to find J(x, y) corresponding to z and . z z x y J ( x, y ) = det x y From We have, Therefore,


z = x2 + y2 y x y and tan = x

and tan =

z 2 = x2 + y2

and

z x = = cos x z z y = = sin y z

also and

y y = 2 = 2 cos 2 2 x sec x x y 1 = = cos 2 2 y x sec x


cos J ( x, y ) = det y 2 cos 2 x2 cos3 y sin cos 2 = + x x2 x cos + y sin = cos 2 x2 2 z cos 1 = = x2 z

sin 1 2 cos x x

Consider the transformation as shown in the diagram below:

f X ,Y ( x, y) =

2 2

2 2 1 2 ( x X ) + ( y Y ) 2

We have to find the density at (x, y) corresponding to z and . From the above figure, X X = Z cos cos 0 and Y Y = Z sin sin 0

f X ,Y ( x, y) =

2 2

2 2 1 2 ( z cos cos0 ) + ( z sin sin ) 2

2 2+ 2 z 1 e z 2 e cos( 0 ) 2 2 = 2 2

1 e 2 2 z2 2 z cos( 0 )+ 2 2

Expected Values of Functions of Random Variables


Recall that
If

Y = g ( X ) is

function

of

continuous

random

variable

X , then

EY = Eg ( X ) = g ( x) f X ( x)dx If Y = g ( X ) is a function of a discrete random variable X , then EY = Eg ( X ) = g ( x) p X ( x)


xRX

Suppose Z = g ( X , Y ) is a function of continuous random variables X and Y , then the expected value of Z is given by

EZ = Eg ( X , y ) = zf Z ( z )dz

= g ( x, y ) f X ,Y ( x, y )dxdy

Thus EZ can be computed without explicitly determining f Z ( z ). We can establish the above result as follows. Suppose Z = g ( X , Y ) has n roots ( xi , yi ), i = 1, 2,.., n at Z = z. Then

{ z < Z z + z} = {( xi , yi ) Di }
i =1

where Di is the differential region containing ( xi , yi ). The mapping is illustrated in Fig. for n = 3.
P ({ z < Z z + z}) = f Z ( z )z = zf Z ( z )z = =
( xi , yi )Di

( xi , yi )Di

f X ,Y ( xi , yi )xi yi

zf X ,Y ( xi , yi )xi yi g ( xi , yi ) f X ,Y ( xi , yi )xi y

( xi , yi )Di

As z is varied over the entire the entire Z axis, the corresponding (nonoverlapping) differential regions in X Y plane covers the entire plane. zf Z ( z )dz = g ( x, y ) f X ,Y ( x, y )dxdy

Thus,

Eg ( X , y ) = g ( x, y ) f X ,Y ( x, y )dxdy

{z < Z z + z}

D1

D2

X
D3

If Z = g ( X , Y ) is a function of discrete s random variables X and Y , We can similarly show that EZ = Eg ( X , Y ) = g ( x, y ) pX ,Y ( x, y )


x , yR X RY

Example: The joint pdf of two random variables X 1 f X ,Y ( x, y ) = xy 0 x 2, 0 y 2 4 = 0 otherwise Find the joint expectation of g ( X , Y ) = X 2Y

and Y are given by

Eg ( X , Y ) = EX 2Y = g ( x, y ) f X ,Y ( x, y )dxdy
22 1 = x 2 y xydxdy 4 00 2 2 1 = x 3 dx y 2 dy 40 0 4 1 2 23 = 4 4 3 8 = 3

Example: If Z = aX + bY , where a and b are constants, then EZ = aEX + bEY

Proof:
EZ = (ax + by ) f X ,Y ( x, y )dxdy

= axf X ,Y ( x, y ) dxdy + byf X ,Y ( x, y ) dxdy


= ax f X ,Y ( x, y )dydx + by f X ,Y ( x, y )dxdy

= a xf X ( x)dx + b yfY ( y )dy


= aEX + bEY Thus, expectation is a linear operator. Example:

Consider the discrete random variables X and Y discussed in Example .The joint probability mass function of the random variables are tabulated in Table . Find the joint expectation of g ( X , Y ) = XY

X Y 0

0 0.25 0.14 0.39

1 0.1 0.35 0.45

2 0.15 0.01

pY ( y )
0.5 0.5

p X ( x)

Clearly EXY =

x , yR X RY

g ( x, y ) p X ,Y ( x, y )

= 11 0.35 + 1 2 0.01 = 0.37


Remark (1) We have earlier shown that expectation is a linear operator. We can generally write E[a1 g1 ( X , Y ) + a2 g 2 ( X , Y )] = a1 Eg1 ( X , Y ) + a2 Eg 2 ( X , Y ) Thus E ( XY + 5log e XY ) = EXY + 5E log e XY (2) If X and Y are independent random variables and g ( X , Y ) = g1 ( X ) g 2 (Y ), then

Eg ( X , Y ) = Eg1 ( X ) g 2 (Y ) = g1 ( X ) g 2 (Y ) f X ,Y ( x, y )dx

= g1 ( X ) g 2 (Y ) f X ( x) fY ( y )dxdy

= g1 ( X ) f X ( x)dx g 2 (Y ) fY ( y )dxdy

= Eg1 ( X ) Eg 2 (Y ) Joint Moments of Random Variables Just like the moments of a random variable provides a summary description of the random variable, so also the joint moments provide summary description of two random variables. For two continuous random variables X and Y the joint moment of order m + n is defined as
E ( X m Y n ) = x m y n f X ,Y ( x, y )dxdy

and

the joint central moment of order m + n is defined as


E ( X X ) m (Y Y ) n = ( x X ) m ( y Y ) n f X ,Y ( x, y )dxdy

where

X = EX

and

Y = EY

Remark

(1) If X and Y are discrete random variables, the joint expectation of order m and

n is defined as
E ( X mY n ) =
m n x y p X ,Y ( x, y )

( x , y )RX ,Y

E ( X X ) m (Y Y ) n =

( x , y )R X ,Y

m n ( x X ) ( y Y ) p X ,Y ( x, y )

(2) If m = 1 and n = 1 , we have the second-order moment of the random variables


X and Y given by
xyf X ,Y ( x, y ) dxdy E ( XY ) = xyp X ,Y ( x, y ) ( x , y )RX ,Y if X and Y are continuous if X and Y are discrete

(3) If X and Y are independent, E ( XY ) = EXEY

Covariance of two random variables

The covariance of two random variables X and Y is defined as


Cov ( X , Y ) = E ( X X )(Y Y )

Expanding the right-hand side, we get


Cov( X , Y ) = E ( X X )(Y Y ) = E ( XY Y X X Y + X Y ) = EXY Y EX X EY + X Y = EXY X Y

The ratio ( X , Y ) =

Cov( X , Y )

X Y

is called the correlation coefficient. We will give an

interpretation of Cov( X , Y ) and ( X , Y ) later on. We will show that ( X , Y ) 1. To establish the relation, we prove the following result: For two random variables X and Y , E 2 ( XY ) EX 2 EY 2

Proof:

Consider the random variable Z = aX + Y E (aX + Y ) 2 0 a 2 EX 2 + EY 2 + 2aEXY 0 .

Non-negatively of the left-hand side => its minimum also must be nonnegative. For the minimum value, dEZ 2 EXY = 0 => a = da EX 2 so the corresponding minimum is Minimum is nonnegative => E 2 XY E 2 XY + EY 2 2 EX 2 EX 2

E 2 XY 0 EX 2 E 2 XY EX 2 EY 2 EY 2 EXY EX 2 EY 2

Now
( X ,Y ) =
Cov( X , Y )

X X

E ( X X )(Y Y ) E ( X X ) 2 E (Y Y ) 2

( X ,Y ) =

E ( X X )(Y Y ) E ( X X ) 2 E (Y Y ) 2

=1

E ( X X ) 2 E (Y Y ) 2 E ( X X ) 2 E (Y Y ) 2

( X ,Y ) 1

Uncorrelated random variables


Two random variables X and Y are called uncorrelated if
Cov ( X , Y ) = 0 which also means E ( XY ) = X Y

Recall that if X and Y are independent random variables, then f X ,Y ( x, y ) = f X ( x) fY ( y ).


EXY = xyf X ,Y ( x, y ) dxdy

assuming X and Y are continuous

Then

= xyf X ( x) fY ( y )dxdy

= xf X ( x)dx yfY ( y )dy


= EXEY

Thus two independent random variables are always uncorrelated. The converse is not always true. (3) Two random variables may be dependent, but still they may be uncorrelated. If there exists correlation between two random variables, one may be represented as a linear regression of the others. We will discuss this point in the next section.

Linear prediction of
Y = aX + b

from

Regression

Prediction error Y Y

Mean square prediction error


E (Y Y ) 2 = E (Y aX b) 2

For minimising the error will give optimal values of a and b. Corresponding to the optimal solutions for a and b, we have
E (Y aX b) 2 = 0 a E (Y aX b) 2 = 0 b

Solving for a and b ,

1 Y Y = 2 X ,Y ( x X )

so that Y Y = X ,Y y ( x X ) x
where X ,Y =

XY is the correlation coefficient. XY


Y Y

slope = X ,Y

y x

x X
Remark

If X ,Y > 0, then X and Y are called positively correlated. If X ,Y < 0, then X and Y are called negatively correlated If X ,Y = 0 then X and Y are uncorrelated.
=> Y Y = 0 => Y = Y is the best prediction.

( To be labeled and animated)

If X ,Y = 0 then X and Y are uncorrelated.


=> Y Y = 0 => Y = Y is the best prediction.

Note that independence => Uncorrelatedness. But uncorrelated generally does not imply independence (except for jointly Gaussian random variables).

Example :

Y = X 2 and f X (x) is uniformly distributed between (1,-1).

X and Y are dependent, but they are uncorrelated.


Cov( X , Y ) = X = E ( X X )(Y Y ) Because = EXY = EX 3 = 0 = EXEY ( EX = 0)

In fact for any zero- mean symmetric distribution of X, X and X 2 are uncorrelated.

(4) is a linear estimator X = aY

Jointly Gaussian Random variables


Many practically occurred random variables are modeled as jointly Gaussian random variables. For example, noise occurring in the communication systems is modeled as jointly Gaussian random variables. Two random variables

X and Y are called jointly Gaussian if their joint density function is


( x X 2 X
)2 ( x )( y ) ( y ) 2 Y + Y 2 Y X Y X

f X ,Y ( x, y ) =

1
2 2 X Y 1 X ,Y

1 2 (1 2 ,Y ) X

2 XY

- < x < , - < y <


The joint pdf is determined by 5 parameters means X and Y variances X and Y
2 2

correlation coefficient X , Y .

We denote the jointly Gaussian random variables X and Y with these parameters as
2 2 ( X , Y ) ~ N ( X , Y , X , Y , X ,Y )

The pdf has a bell shape centred at


2 X and Y2

( X , Y ) as shown in the Fig. below. The variances

determine the spread of the pdf surface and

X ,Y determines the orientation of the

surface in the X Y plane.

Properties of jointly Gaussian random variables (1) If X and Y are jointly Gaussian, then X and Y are both Gaussian. We have

f X ( x) = f X ,Y ( x, y )dy
1
2 2 X Y 1 X ,Y 2

= = = =
Similarly

e
1

1 2(1 2 ,Y ) X

( x

)2

2 X

2 XY

( x

X Y

)( y ) ( y )2 Y + Y 2 Y

dy dy
Area under a Gaussian

1 x X 2 e X

2 2 Y 1 X ,Y

2 X
1 x X 2 e X 2

2 ( x )2 ( x )( y ) ( y )2 X X Y + Y 1 X ,Y 2 XY X 2 2 ) 2 X Y 2 (1 X ,Y Y

1
2 2 Y 1 X ,Y

2 X

X ,Y ( x X 1 Y ( y Y 2 (1 2 ) 2 Y X X ,Y

) 2

dy

1 2 X

1 x X 2 X

fY ( y ) =

1 2 Y

1 y Y 2 Y

(2) The converse of the above result is not true. If each of X and Y is Gaussian, X and Y are not necessarily jointly Gaussian. Suppose

f X ,Y ( x, y ) =
f X ,Y ( x, y ) 0 and

1 2 X Y

( x X 1 2 2 X

)2

( y )2 Y 2 Y

(1 + sin x sin y )

f X ,Y ( x, y ) in this example is non-Gaussian and qualifies to be a joint pdf. Because,

2 X Y

( x X 1 2 2 X

)2

( y )2 Y 2 Y

(1 + sin x sin y )dydx


)2

1 2 X Y

( x X 1 2 2 X

( y )2 Y 2 Y

dydx +

1 2
( x X )2

1 2 X Y

( x X 1 2 2 X

)2

( y )2 Y 2 Y

sin x sin ydydx

= 1 + 2 X Y
1

2 X

sin xdx

1 2

( y )2 Y 2 Y

sin ydy

= 1+ 0 =1
Odd function in y The marginal density f X ( x ) is given by

f X ( x) = =

2 X Y

( x X 1 2 2 X

)2

( y )2 Y 2 Y

(1 + sin x sin y )dy

1 2 X Y

( x X 1 2 2 X

)2

( y )2 Y 2 Y

dy +


2 2

1 2 X Y

1 2 2 (1 X ,Y )

( x X 2 X

)2

( y )2 Y 2 Y

sin x sin ydy

= =

1 2 X

e e

1 x X 2 X

+0
Odd function in y
1 y Y 2 Y
2

1 2 X

1 x X 2 X

Similarly, fY ( y ) =

1 2 Y

Thus X and Y are both Gaussian, but not jointly Gaussian. (3) If X and Y are jointly Gaussian, then for any constants

a and b, then the random variable

Z , given by Z = aX + bY is Gaussian with mean Z = a X + bY and variance

Z 2 = a 2 X 2 + b 2 Y 2 + 2ab X Y X ,Y
(4) Two jointly Gaussian RVs X and Y are independent if and only if uncorrelated ( X ,Y = 0). Observe that if X and Y are uncorrelated, then

X and Y are

f X ,Y ( x, y ) =
=

2 X Y

( x X 1 2 2 X
( x

)2

( y )2 Y 2 Y

1 2 X

)2

2 X

1 2 Y

( y )2 Y 2 Y

= f X ( x ) fY ( y )

Joint Characteristic Functions of Two Random Variables


The joint characteristic function of two random variables X and Y is defined by

X ,Y (1 , 2 ) = Ee j x + j y
1 2

If X and Y are jointly continuous random variables, then

X ,Y (1 , 2 ) =
Note that

X ,Y (1 , 2 ) is same as the two-dimensional Fourier transform with the basis function

f X ,Y ( x, y )e j1x + j2 y dydx

e j1x + j 2 y instead of e ( j1x + j 2 y ) .

f X , y ( x, y ) is related to the joint characteristic function by the Fourier inversion formula

f X , y ( x, y ) =

1 4
2

X ,Y

(1 , 2 )e j1x j2 y d 1 d 2

If X and Y are discrete random variables, we can define the joint characteristic function in terms of the joint probability mass function as follows:

X ,Y (1 , 2 ) =

( x , y )R X RY

X ,Y

( x, y )e j1x + j2 y

Properties of Joint Characteristic Functions of Two Random Variables The joint characteristic function has properties similar to the properties of the chacteristic function of a single random variable. We can easily establish the following properties: 1. X ( ) = X ,Y ( , 0) 2.

Y ( ) = X ,Y (0, )

3. If X and Y are independent random variables, then

X ,Y (1 , 2 ) = Ee j X + j Y
1 2

= E (e j1 X e j2Y ) = Ee j1 X Ee j2Y = X (1 )Y (2 )
4. We have,

X ,Y (1 , 2 ) = Ee j X + j Y
1 2

j 2 (1 X + j2Y ) 2 + ..............) 2 j 212 EX 2 j 22 2 EY 2 = 1 + j1 EX + j2 EY + + + 12 EXY + ..... 2 2 = E (1 + j1 X + j2Y +


Hence,

1= X ,Y (0, 0) 1 X ,Y (1 , 2 ) j 1 =0
1

EX =

EY = EXY =

1 X ,Y (1 , 2 ) j 2
2

2 =0

1 X ,Y (1 , 2 ) j2 1 2 = 0,
1

2 =0

In general, the ( m + n )th order joint moment is given by


m n 1 X ,Y (1 , 2 ) EX Y = m + n n j 1m 2 = 0, m n
1

2 =0

Example Joint characteristic function of the jointly Gaussian random variables X and Y with the joint pdf
1
2 2(1 X ,Y

f X ,Y ( x, y ) =

x 2 xX X 2 X , y ) X X

2 y y y Y + Y Y

2 2 X Y 1 X ,Y

Let us recall the characteristic function of a Gaussian random variable


2 X ~ N ( X , X )

X ( ) = Ee j X
= = =e

e 2 X
1 2 X

1 x X 2 X

.e j x dx

2 2 2 2 1 x2 2( X X j ) x + ( X X j )2 ( X X j )2 + X 2 X2

dx
Area under a Gaussian

2 2 1 ( X 2 + 2 X X j ) 2 X2

1 2 X

2 1 x X X j 2 X

dx

= e X j = e X j

2 X

2 / 2

42 X

2 / 2

If X and Y are jointly Gaussian,


1
2 2(1 X ,Y

f X ,Y ( x, y ) =

x X ) X

xX 2 X , y X

y y Y

y Y + Y

2 2 X Y 1 X ,Y

we can similarly show that

X ,Y (1 , 2 ) = Ee j ( X +Y )
1 2

=e

2 2 2 j X 1 + j Y 2 1 ( X 12 + 2 X ,Y X Y 12 + Y 2 )

We can use the joint characteristic functions to simplify the probabilistic analysis as illustrated below: Example 2 Suppose

Z = aX + bY . then

Z ( ) = Ee j Z = Ee j ( aX +bY ) = X ,Y (a , b )
If X and Y are jointly Gaussian, then

Z ( ) = X ,Y (a , b )
=e
mean
2

2 2 j ( X + Y ) 1 ( X + 2 X ,Y X Y + Y ) 2

which is the characteristic function of a Gaussian random variable with

=
2

+ Y and

variance

X
2

+ 2 X ,Y X Y + Y

Example 3 If Z=X+Y and X and Y are independent, then

Z ( ) = X ,Y ( , )
= X ( )Y ( )

Using the property of the Fourier transform, we get

f Z ( z ) = f X ( z ) * fY ( z )

Conditional Expectation
Recall that If

X and Y are continuous random variables, then the conditional density


fY / X ( x / y ) = f X ,Y ( x, y ) / f X ( x)

function of Y given X = x is given by

If

X and Y are discrete random variables, then the probability mass function

Y given X = x is given by
pY / X ( y / x) = p X ,Y ( x, y ) / p X ( x)

The conditional expectation of Y given X = x is defined by


Y / X = x
yfY / X ( x / y ) = E (Y / X = x) = ypY / X ( x / y ) yRY if X and Y are continuous if X and Y are discrete

Remark

The conditional expectation of Y given X = x is also called the conditional mean of Y given X = x.

We

can

similarly

define

the

conditional

expectation

of

X given X = x, denoted by E ( X / Y = y )
Higher-order conditional moments can be defined in a similar manner. Particulaly, the conditional variance of

Y given X = x

is given by

Y2 / X = x = E[(Y Y / X = x ) 2 / X = x]

Example: Consider the discrete random variables X and Y discussed in example .The joint probability mass function of the random variables are tabulated in Table . Find the joint expectation of E (Y / X = 2)

X Y 0

0 0.25 0.14 0.39

1 0.1 0.35 0.45

2 0.15 0.01 0.16

pY ( y ) 0.5 0.5

1 p X ( x)

The conditional probability mass function is given by


pY / X ( y / 2) = p X ,Y (2, y ) / p X (2) pY / X (0 / 2) = p X ,Y (2, 0) / p X (2) = and pY / X (1/ 2) = p X ,Y (2,1) / p X (2) 0.01 = 1/16 0.16 E (Y / x = 2) = 0 pY / X (0 / 2) + 1 pY / X (1/ 2) = = 1/16 0.15 = 15 /16 0.16

Example Suppose X and Y are jointlyuniform random variables with the joint probability density function given by 1 x 0, y 0, x + y 2 f X ,Y ( x, y ) = 2 0 otherwise Find E (Y / X = x) From the figure, f X ,Y ( x, y ) =
1 in the shaded area. 2

We have

f X ( x) = f X ,Y ( x, y )dy
0

2 x

2 x 0

1 dy 2

1 = (2 x) 0 x 2 2
fY / X ( y / x ) = f X ,Y ( x, y ) / f X ( x) = 1 2 x

E (Y / X = x) = yfY / X ( y / x)dy

= y
0

2 x

1 dy 2 x

2- x = 2

Example Suppose X and Y are jointly Gaussian random variables with the joint probability density function given by f X ,Y ( x, y ) =
1
2 2 x y 1 X ,Y 1 2(1 2 ) X ,Y

( x X 2 X

)2

2 XY

( x

X X

)( y

( y

)2

2 Y

We have to find E (Y / X = x).


We have fY / X ( y / x) = f X ,Y ( x, y ) / f X ( x)

1 2 2 X Y 1 X ,Y

1 2 (1 2 ,Y ) X

( x X 2 X

)2

2 XY

( x

)( y ) ( y )2 Y + Y 2 Y Y

2 X

1 2 2 Y (1 2 ,Y ) X

1 2

( x

)2

2 X

1
2 2 Y 1 X ,Y

Y X ,Y ( y Y ) ( x x ) X

Therefore,
E (Y / X = x) = y
1
2 2 Y 1 X ,Y

1 2 2 2 Y (1 X ,Y )

Y X ,Y ( x x ) ( y Y ) X

dy

= Y +

Y X ,Y (x x ) X

Conditional Expectation as a random variable


Note that E (Y / X = x) is a function of x.

Using this function, we may define a random variable

( X ) = E (Y / X )
E (Y / X = x)

Thus we may consider E (Y / X ) at

EY / X X = x.

as a function of the random variable

and

as the value of

E (Y / X ) is a random variable, E (Y / X = x) is the value of E (Y / X ) at X = x EE (Y / X ) = EY EE (Y / X ) = EY Proof:

EE (Y / X ) = E (Y / X = x) f X ( x)dx

= yfY / X ( x / y )dy f X ( x)dx


= yf X ( x) fY / X ( x / y )dydx

= yf X ,Y ( x, y )dydx

= y f X ,Y ( x, y )dxdy

= yfY ( y )dy

= EY

Thus EE (Y / X ) = EY and similarly


EE ( X / Y ) = EX

Baysean Estimation theory and conditional expectation


Consider two random variables X and Y with joint pdf f X ,Y ( x, y ). Suppose
Y is observable and f X ( x ) is known. We have to estimate X for a given value in

some optimal sense. I

in a sense that some values of are more likely (a priori information). We can represent this prior information in the form of a prior density function. In the folowing we omit the suffix in density functions just for notational simplicity.

Obervation Y = y fY / X ( y / x )
Random variable X with density f X ( x)

The conditional density function fY / X ( y / x) is called likelihood function in estimation terminology.


f X ,Y ( x, y ) = f X ( x ) fY / X ( y )

Also we have the Bayes rule


f X /Y ( x / y) = f X ( x ) fY / X ( y / x ) fY ( y )

where f / X ( ) is the a posteriori density function


Suppose the optimam estimator X (Y ) is a function of the random variable Y such that

it minimizes the mean-square error


The Estimation error E ( X (Y ) X ) 2 . Such an estimator is known as the minimum mean-

square error estimator.

Estimation problem is

Minimize

( X ( y) x)

f X ,Y (x, y )dx dy

with respect to X(y) .

This is equivalent to minimizing

( X ( y) x)

fY (y ) f X / Y ( x / y )dx dy
2

( X ( y) x)

f X / Y ( x / y )dx fY (y )dy

Since fY (y ) is always +ve, the above integral will be minimum if the inner integral is minimum. This results in the problem:

Minimize

( X ( y) x)

f X / Y ( x / y )dx

with respect to X ( y ). The minimum is given by

2 ( X ( y) x) f X / Y ( x / y)dx = 0 ( y ) X 2 ( X ( y ) x) f X / Y ( x / y )dx = 0

X ( y ) f X / Y ( x / y )dx =

xf

X /Y

( x / y )dx =E ( X / Y = y )

Multiple Random Variables


In many applications we have to deal with many random variables. For example, in the navigation problem, the position of a space craft is represented by three random variables denoting the x, y and z coordinates. The noise affecting the R, G, B channels of colour video may be represented by three random variables. In such situations, it is convenient to define the vector-valued random variables where each component of the vector is a random variable. In this lecture, we extend the concepts of joint random variables to the case of multiple random variables. A generalized analysis will be presented n random variables defined on the same sample space. Joint CDF of n random variables Consider n random variables X 1 , X 2 ,.., X n defined on the same probability space ( S , F, P). We define the random vector X as,

X1 X 2 X = . . Xn

or

X ' = [ X1

X2 . . Xn ]

where ' indicates the transpose operation. Thus an n dimensional random vector X. is defined by the mapping X : S R n . A particular value of the random vector X. is denoted by
x=[ x1 x2 .. xn ]'.

The CDF of the random vector X. is defined as the joint CDF of X 1 , X 2 ,.., X n . Thus FX1 , X 2 ,.., X n ( x1 , x2 ,..xn ) = FX (x)

= P ({ X 1 x1 , X 2 x2 ,.. X n xn })
Some of the most important properties of the joint CDF are listed below. These properties are mere extensions of the properties of two joint random variables.
Properties of the joint CDF of n random variables

(a) FX1 , X 2 ,.. X n ( x1 , x2 ,..xn ) is a non-decreasing function of each of its arguments. (b) FX1 , X 2 ,.. X n ( , x2 ,..xn ) = FX1 , X 2 ,.. X n ( x1 , ,..xn ) = ... = FX1 , X 2 ,.. X n ( x1 , x2 ,.. ) = 0 (c) FX1 , X 2 ,.. X n (, ,.., ) = 1 (d) FX1 , X 2 ,.. X n ( x1 , x2 ,..xn ) is right-continuous in each of its arguments.

(e) The marginal CDF of a random variable X i is obtained from FX1 , X 2 ,.. X n ( x1 , x2 ,..xn ) by letting all random variables except X i tend to . Thus FX1 ( x1 ) = FX1 , X 2 ,.. X n ( x1 , ,.., ),
FX 2 ( x2 ) = FX1 , X 2 ,.. X n (, x2 ,.., )

and so on.
Joint pmf of n discrete random variables Suppose X is a discrete random vector defined on the same probability space ( S , F , P). Then X is completely specified by the joint probability mass function PX1 , X 2 ,.., X n ( x1 , x2 ,..xn ) = PX (x)

= P ({ X 1 = x1 , X 2 = x2 ,.. X n = xn })

Given PX1 , X 2 ,.., X n ( x1 , x2 ,..xn ) we can find the marginal probability mass function

p X1 ( x1 ) = .... p X 2 , X 3 ,..., X n ( x1 , x2 ,..., xn )


X 2 X3 Xn

n -1 summations

Joint PDF of n random variables If X is a continuous random vector, that is, FX1 , X 2 ,.. X n ( x1 , x2 ,..xn ) is continuous in each of

its argument, then X can be specified by the joint probability density function

f X (x) = f X1 , X 2 ,.. X n ( x1 , x2 ,..xn ) = n FX , X ,.. X ( x1 , x2 ,..xn ) x1x2 ...xn 1 2 n

Properties of joint PDF of n random variables The joint pdf of n random variables satisfies the following important properties (1) f X1 , X 2 ,.. X n ( x1 , x2 ,..xn ) is always a non-negaitive quantity. That is,
f X1 , X 2 ,.. X n ( x1 , x2 ,..xn ) 0 ( x1 , x2 ,..xn )

n

(2) ... f X1 , X 2 ,.. X n ( x1 , x2 ,..xn )dx1 dx2 ...dxn = 1 (3) Given f X (x) = f X1 , X 2 ,.. X n ( x1 , x2 ,..xn ) for all ( x1 , x2 ,..xn ) R n , we can find the probability of a Borel set (region ) B R n ,

P({( x1 , x2 ,..xn ) B}) = .. f X1 , X 2 ,.. X n ( x1 , x2 ,..xn )dx1 dx2 ...dxn


B

(4) The marginal CDF of a random variable X i is related to f X1 , X 2 ,.. X n ( x1 , x2 ,..xn ) by the ( n 1) fold integral
2

f X i ( xi ) = ... f X1 , X 2 ,.. X n ( x1 , x2 ,..xi ..xn )dx1 dx2 ...dxn


where the integral is performed over all the arguments except xi . Similarly, f X i , X j ( xi , x j ) = ... f X1 , X 2 ,.. X n ( x1 , x2 ,..xi ..xn )dx1 dx2 ...dxn

( n 2) fold integral

and so on. The conditional density functions are defined in a similar manner. Thus f X , X ,..., X ( x1 , x2 ,........., xn ) f X m +1 , X m + 2 ,......, X n / X1 , X 2 ,......., X m ( xm +1 , xm + 2 ,...... xn / x1 , x2 ,........., xm ) = 1 2 n f X1 , X 2 ,..., X m ( x1 , x2 ,........., xm )

Independent random variables: The random variables X 1 , X 2 ,..............., X n are called (mutually) independent if and only if
f X1 , X 2 ,.. X n ( x1 , x2 ,..xn ) = f X i ( xi )
i =1 n

For example, if X 1 , X 2 ,..............., X n are independent Gaussian random variables, then


1 ( xi i ) 2 i2
2

f X1 , X 2 ,.. X n ( x1 , x2 ,..xn ) =
i =1

1 e 2 i

Remark, X 1 , X 2 ,................, X n may be pair wise independent, but may not be mutually independent. Identically distributed random variables: The random variables X 1 , X 2 ,....................., X n are called identically distributed if each random variable has the same marginal distribution function, that is,

FX1 ( x ) = FX 2 ( x ) = ......... = FX n ( x )

An important subclass of independent random variables is the independent and identically distributed (iid) random variables. The random variables X 1 , X 2 ,....., X n are called iid if X 1 , X 2 ,......., X n are mutually independent and each of X 1 , X 2 ,......., X n has the same marginal distribution function. Example: If X 1 , X 2 ,......., X n may be iid random variables generated by n independent throwing of a fair coin and each taking values 0 and 1, then X 1 , X 2 ,......., X n are iid and
1 p X (1,1,......,1) = 2
n

Moments of Multiple random variables Consider


n

jointly

random

variables

represented

by

the

random

vector

X = [ X 1 , X 2 ,....., X n ] '. The expected value of any scalar-valued function g ( X) is defined

using the n fold integral as E ( g ( X) = ... g ( x1 , x2 ,..xn ) f X1 , X 2 ,.. X n ( x1 , x2 ,..xn )dx1 dx2 ...dxn

The mean vector of X, denoted by X , is defined as


X = E ( X) = [ E ( X 1 ) E ( X 2 )......E ( X n ) ] '. = X1 X 2 ... X n '.

Similarly for each (i, j ) i = 1, 2,.., n, j = 1, 2,.., n we can define the covariance

Cov( X i , X j ) = E ( X i X i )( X j X j )
All the possible covariances can be represented in terms of a matrix called the covariance matrix CX defined by CX = E ( X X )( X X ) cov( X 1 , X 2 ) cov( X 1 , X n ) var( X i ) cov( X , X ) var( X ) . cov( X , X ) 2 1 2 2 n = var( X n ) cov( X n , X 1 ) cov( X n , X 2 )
Properties of the Covariance Matrix

CX is a symmetric matrix because Cov ( X i , X j ) = Cov ( X j , X i )

CX is a non-negative definite matrix in the sense that for any real vector the quadratic form z CX z 0. The result can be proved as follows: z C X z = z E (( X X )( X X ))z = E (z ( X X )( X X )z )
= E (z ( X X )) 2 0

z 0,

The covariance matrix represents second-order relationship between each pair of the random variables and plays an important role in applications of random variables. The n random variables X 1 , X 2 ,.., X n are called uncorrelated if for each (i, j ) i = 1, 2,.., n, j = 1, 2,.., n Cov( X i , X j ) = 0 If X 1 , X 2 ,.., X n are uncorrelated, CX will be a diagonal matrix.

Multiple Jointly Gaussian Random variables

For any positive integer n, X 1 , X 2 ,....., X n represent n jointly

random variables.

These n random variables define a random vector X = [ X 1 , X 2 ,....., X n ]'. These random variables are called jointly Gaussian if the random variables X 1 , X 2 ,....., X n have joint probability density function given by

f X1 , X 2 ,....., X n ( x1 , x2 ,...xn ) =
where

1 X'C1 X X e 2

det(CX )
is the covariance matrix and

C X = E ( X X )( X X ) '

X = E ( X) = [ E ( X 1 ), E ( X 2 )......E ( X n ) ] ' is the vector formed by the means of the random

variables.

Vector space Interpretation of Random Variables


Consider a set V with elements called vectors and the field of real numbers

V is called a vector space if and only if


1. An operation vector addition '+' is defined in V such that ( V ,+) is a commutative group. Thus ( V ,+) satisfies the following properties. (i) (ii) (iii) (iv) (v) For any pair of elements v, w V , there exists a unique element (v + w) V. Vector addition is associative: v + (w + z) = (v + w) + z for any three vectors v, w, z V There is a vector 0 V such that v + 0 = 0 + v for any v V. For any v V there is a vector v V such that v + (-v) = 0 = (-v) + v. for any v, w V , v + w = w + v.

2. For any element v V and any r . the scalar product rv V This scalar product has the following properties for any r , s . and any v, w V , 3. r ( sv ) = ( rs ) v for r , s and v V 4. r(v + w) = rv + rw 5. (r + s)v = rv + sv 6. 1v = v It is easy to verify that the set of all random variables defined on a probability space
( S , F, P ) forms a vector space with respect to addition and scalar multiplication. Similarly

the set of all n dim ensional random vectors forms a vector space. Interpretation of random variables as elements of a vector space help in understanding many operations involving random variables.

Linear Independence
Consider n random vectors v1 , v 2 , ...., v N .
If c1 v1 + c2 v 2 + .... + cN v N = 0 implies that

c1 = c2 = .... = cN = 0, then v1 , v 2 , ...., v N are linearly independent.

For

N random vectors

X1 , X 2 , ...., X N , if

c1X1 + c2 X 2 + .... + cN X N = 0 implies that

c1 = c2 = .... = cN = 0, then X1 , X 2 , ...., X N are linearly independent.

Inner Product
If v and w are real vectors in a vector space V defined over the field , the inner product
< v, w > is a scalar such that
v, w, z V and r

1. < v, w > = < w, v > 2. < v, v > = v 0, where v is a norm inducedby the inner product 3. < v + w , z > = < v, z > + < w, z > 4. < rv , w > = r < v, w >
2

In the case of two random variables X and Y , inner product between X and Y . Thus
< X, Y > = EXY

the joint expectation EXY defines an

We can easily verify that EXY satisfies the axioms of inner product. The norm of a a random variable X is given by
X
2

= EX 2

X1 Y1 X 2 and Y = Y2 , the inner product is For two n dimensional random vectors X = Xn Yn


< X, Y > = EXY = EX iYi
i n

The norm of a a random vector X is given by


X =< X, X > = EXX = EX i2
2 i n

The set of RVs along with the inner product defined through the joint expectation operation and the corresponding norm defines a Hilbert Space.

Schwarz Inequality
For any two vectors v and w belonging to a Hilbert space V

|< v, w >| v w
This means that for any two random variables X and Y

E ( XY ) EX 2 EY 2
Similarly for any two random vectors X and Y E ( XY) EXX EYY

Orthogonal Random Variables and Orthogonal Random Vectors


Two vectors v and w are called orthogonal if < v, w > = 0 Two random variables X and Y are called orthogonal if EXY = 0 . Similarly two random vectors X and Y are called orthogonal if
EXY = EX iYi = 0
i n

Just like the of independent random variables and the uncorrelated random variables, the orthogonal random variables form an important class of random variables.

Remark If X and Y are uncorrelated, then

E ( X X )(Y Y ) = 0 ( X X ) is orthogonal to (Y Y )
If each of X and Y is zero-mean

Cov( X ,Y ) = EXY
In this case, EXY = 0 Cov ( XY ) = 0.

Minimum Mean-square-error Estimation Suppose X is a random variable which is not observable and Y is another observable random variable which is statistically dependent on X through the joint probability density function f X ,Y ( x, y ). We pose the followingproblem. Given a value of Y what is the best guess for X ? This problem is known as the estimation problem and has many practical applications. One application is the signal estimation from noisy observations as illustrated in the Fig. below:
Signal X Estimated

Noisy observation Y +

Estimation

Signal X

Noise
Let X (Y ) be the estimate of the random variable X based on the random variable
Y . Clearly X (Y ) is a function of Y . We have to find best estimate be X (Y ) in some

meaningful sense. Observe that


X is the unknown random variable

X (Y ) is the estimate of X . X X (Y ) is the estimation error. E ( X X (Y )) 2 is the mean of the square error.

One meaningful criterion is to minimize E ( X X (Y )) 2 with respect to X (Y ) and the

corresponding estimation principle is called minimum mean square error principle. Such a function which we want to minimize is called a cost function in optimization theory. For
finding X (Y ), we have to minimize the cost function

E ( X X (Y )) 2 = ( x X ( y )) 2 f X ,Y ( x, y )dydx

= ( x X ( y )) 2 fY ( y ) f X / Y ( x)dydx

= fY ( y )( ( x X ( y )) 2 f X / Y ( x)dx)dy

Since

fY ( y )

is always positive, therefore the minimization of

E ( X X (Y )) 2 with respect to X (Y ) is equivalent to minimizing the inside


integral ( x X ( y ))2 f X / Y ( x)dx

with respect to X (Y ). The condition for the

minimum is
X 2 ( x X ( y )) f X / Y ( x / y )dx = 0

Or 2 ( x X ( y )) f X / Y ( x / y )dx = 0
-

X ( y ) f X / Y ( x / y )dx = xf X / Y ( x / y )dx

X ( y) = E ( X / Y = y)

Thus, the minimum mean-square error estimation involves conditional expectation


E ( X / Y = y ). To find E ( X / Y = y ),

we have to determine the a posteriori probability


xf X / Y ( x / y )dx. These operations are computationally

density f X / Y ( x / y ) and perform

exhaustive when we have to perform these operations numerically. Example Consider two zero-mean jointly Gaussian random variables X and Y with the joint pdf

f X ,Y ( x, y ) =

1
2 2 X Y 1 X ,Y

1 2 (1 2 ,Y ) X

x2 xy y2 2 2 XY + 2 X Y X Y

- < x < , - < y <


The marginal density fY ( y ) is a Gaussian random variable and given by

fY ( y ) =

1 2 Y

y2 2 2 Y

f X /Y ( x / y) = =

f X ,Y ( x, y ) fY ( y )
1
2 2 X 1 X ,Y 1 2 2 (1 2 ,Y ) X X

e
Y

XY X x Y

which is Gaussian with mean


Y = y is given by

XY X

y . Therefore, the MMSE estimator of X given

X ( y) = E ( X / Y = y) =
XY X
Y

This example illustrates that in the case of jointly Gaussian random variables X and Y ,
the mean-square estimator of X given Y = y , is linearly related with y. This important result gives us a clue to have simpler version of the mean-square error estimation problem discussed below.

Linear Minimum Mean-square-error Estimation and the Orthogonality Principle


We assume that X and Y are both zero-mean and X ( y ) = ay. The estimation problem

is now to find the optimal value for a. Thus we have the linear minimum mean-square error criterion which minimizes E ( X aY )2 with respect to a.

Signal X

Noisy observation Y +

Estimated

Estimation
X = aY

Signal X

Noise

d da

E ( X aY ) 2 = 0

d E da ( X aY ) 2 = 0

E ( X aY )Y = 0 EeY = 0

where e is the estimation error. Thus the optimum value of a is such that the estimation error ( X aY ) is orthogonal to the observed random variable Y and the optimal estimator aY is the orthogonal projection of X on

Y . This orthogonality principle forms the heart of a class of

estimation problem called Wiener filtering. The orthogonality principle is illustrated geometrically in the following figure (Fig. ).

X
e
aY

e is orthogonal to Y

Orothogonal projection

The optimum value of a is given by


E ( X aY )Y = 0 EXY aEY 2 = 0 EXY a= EY 2 The corresponding minimum linear mean-square error (LMMSE) is

LMMSE = E ( X aY ) 2 = E ( X aY ) X aE ( X aY )Y = E ( X aY ) X 0 ( E ( X aY )Y = 0, using the orthogonality principle) = EX 2 aEXY The orthogonality principle can be applied to optimal estimation of a random variable from more than one observation. We illustrate this in the following example.

Example Suppose X is a zero-mean random variable which is to be estimated from two zero-mean random variables Y1 and Y2 . Let the LMMSE estimator be X = a1Y1 + a2Y2 .
Then the optimal values of a1 and a2 are given by
E ( X a1Y1 a2Y2 ) 2 = 0 i = 1, 2. ai This results in the orthogonality conditions

E ( X a1Y1 a2Y2 )Y1 = 0 and E ( X a1Y1 a2Y2 )Y2 = 0 Rewriting the above equations we get a1 EY12 + a2 EY2Y1 = EXY1
and a1 EY2Y1 + a2 EY22 + = EXY2 Solving these equations we can find a1 and a2 . Further the corresponding minimum linear mean-square error (LMMSE) is LMMSE = E ( X a1Y1 a2Y2 ) 2 = E ( X a1Y1 a2Y2 ) X a1 E ( X a1Y1 a2Y2 )Y1 a2 E ( X a1Y1 a2Y2 )Y2 = E ( X a1Y1 a2Y2 ) X 0 ( using the orthogonality principle) = EX 2 a1 EXY1 a2 EXY2

Convergence of a sequence of random variables


Let
X 1 , X 2 ,..., X n be a sequence n independent and identically distributed random

variables. Suppose we want to estimate the mean of the random variable on the basis of the observed data by means of the relation
n =
1 N Xi n i =1
n is increased? How do we

How closely does n represent the true mean X as measure the closeness between n and X ?

Notice that n is a random variable. What do we mean by the statement n converges to X ? Consider a deterministic sequence of real numbers x1 , x 2 ,....x n .... The sequence converges to a limit x if correspond to every > 0 we can find a positive integer
N such that x xn < for n > N .

For example, the sequence

1, 1 ,..., 1 ,... converges to the number 0. 2 n The Cauchy criterion gives the condition for convergence of a sequence without actually finding the limit. The sequence x1 , x 2 ,....x n .... converges if and only if , for every > 0 there exists a positive integer N such that
xn + m xn < for all n > N and all m > 0.

Convergence of a random sequence X 1 , X 2 ,.... X n .... cannot be defined as above. Note that for each s S , X 1 ( s ), X 2 ( s ),.... X n ( s ).... represent a sequence of numbers . Thus X 1 , X 2 ,.... X n .... represents a family of sequences of numbers. Convergence of a random sequence is to be defined using different criteria. Five of these criteria are explained below.

Convergence Everywhere
A sequence of random variables is said to converge everywhere to X if
X ( s ) X n ( s ) 0 for n > N and s S .

Note here that the sequence of numbers for each sample point is convergent.

Almost sure (a.s.) convergence or convergence with probability 1


A random sequence X 1 , X 2 ,.... X n ,.... may not converge for every s S . Consider the event {s | X n ( s ) X } 1

The sequence X 1 , X 2 ,.... X n ,.... is said to converge to X almost sure or with probability 1 if
P{s | X n ( s ) X ( s )} = 1 as n , or equivalently for every >0 there exists N such that P{s X n ( s ) X ( s ) < for all for n N } = 1
a .s. We write X n X in this case

One important application is the Strong Law of Large Numbers(SLLN): If X 1 , X 2 ,.... X n .... are independent and identically distributed random variables with a finite mean X , then Remark: n =
1 n X i is called the sample mean. n i =1 1 n X i X with probability 1as n . n i =1

The strong law of large number states that the sample mean converges to the true mean as the sample size increases. The SLLN is one of the fundamental theorems of probability. There is a weaker versions of the law that we will discuss letter

Convergence in mean square sense


A random sequence X 1 , X 2 ,.... X n .... is said to converge in the mean-square sense (m.s) to a random variable X if
E ( X n X )2 0 as n

X is called the mean-square limit of the sequence and we write

l.i.m. X n = X where l.i.m. means limit in mean-square. We also write


m.s . X n X

The following Cauchy criterion gives the condition for m.s. convergence of a random sequence without actually finding the limit. The sequence X 1 , X 2 ,.... X n .... converges in m.s. if and only if , for every > 0 there exists a positive integer N such that
E xn + m xn 0 as n for all m > 0.
2

Example :

If X 1 , X 2 ,.... X n .... are iid random variables, then


1N X i X in the mean square 1as n . n i =1 1N We have to show that lim E ( X i X )2 = 0 n n i =1

Now,
1N 1 N E ( X i X ) 2 = E ( ( ( X i X )) 2 n i =1 n i =1 1 N 1 n n = 2 E ( X i X ) 2 + 2 E ( X i X )( X j X ) n i=1 n i=1 j=1,ji = =
2 n X +0 ( Because of independence) 2 n 2 X

n 1N lim E ( X i X ) 2 = 0 n n i =1

Convergence in probability
Associated with the sequence of random variables X 1 , X 2 ,.... X n ,...., we can define a sequence of probabilities P{ X n X > }, n = 1, 2,... for every > 0. The sequence X 1 , X 2 ,.... X n .... is said to convergent to X in probability if this sequence of probability is convergent that is

P{ X n X > } 0

as

P for every > 0. We write X n X to denote convergence in probability of the

sequence of random variables X 1 , X 2 ,.... X n .... to the random variable X . If a sequence is convergent in mean, then it is convergent in probability also, because
P{ X n X
2

> 2 } E ( X n X )2 / 2

(Markov Inequality)

We have

P{ X n X > } E ( X n X ) 2 / 2
If E ( X n X ) 2 0
as n , (mean square convergent) then

P{ X n X > } 0 as
Example:

n .

Suppose { X n } be a sequence of random variables with

P{ X n = 1} = 1 and P{ X n = 1} = 1 n

1 n

Clearly

P{ X n 1 > } = P{ X n = 1} = as n .
P Therefore { X n } { X = 0}

1 0 n

Thus the above sequence converges to a constant in probability.


Remark:

Convergence in probability is also called stochastic convergence.

Weak Laws of Large numbers


If X 1 , X 2 ,.... X n .... are independent and identically distributed random variables, with sample mean n = We have
1 n Xi n i =1 1 n E n = X i = X n i =1 and
1 n P X i . Then n X as n . n i =1

n =

E ( n X )2 = P{ n X

(as shown above) n } E ( n X )2 / 2

2 X

2 X = 2 n P{ n X } 0 as n .

Convergence in distribution
Consider the random sequence X 1 , X 2 ,.... X n .... and a random variable X . Suppose
FX n ( x ) and FX ( x ) are the distribution functions of X n and X respectively. The sequence

is said to converge to X in distribution if


FX n ( x ) FX ( x ) as n .

for all x at which FX ( x ) is continuous. Here the two distribution functions eventually
d coincide. We write X n X to denote convergence in distribution of the random

sequence X 1 , X 2 ,.... X n .... to the random variable X .


Example: Suppose X 1 , X 2 ,.... X n .... is a sequence of RVs with each random variable X i having the uniform density 1 xb f X i ( x) = b 0 other wise

Define Z n = max( X 1 , X 2 ,.... X n ) We can show that = 0, z < 0 n z FZn ( z ) = n , 0 z < a a otherwise 1 Clearly, 0, z < a lim FZn ( z ) = FZ ( z ) = n 1 z a

{ zn } Converges to Z in distribution.

Relation between Types of Convergence

as X n X Convergence almost sure

p X n X Convergence in probability

d X n X Convergence in distribution

m. s . X n X Convergence in mean-square

Central Limit Theorem


Consider n independent random variables X 1 , X 2 ,.... X n The mean and variance of each
2 of the random variables are known. Suppose E ( X i ) = X i and var( X i ) = X i .

Form a random variable Yn = X 1 + X 2 + ... X n The mean and variances of Yn are given by
EYn = Yn = X1 + X 2 + ... X n

and
2 var(Yn ) = Yn = E{ ( X i i )}2 i =1 n

= E ( X i i )2 +
i =1 2 2 2 = X1 + X 2 + ... + X n

i =1 j =1, j i

E ( X i i )( X j j )

X i and X j are independent for (i j )

Thus we can determine the mean and variances of Yn . Can we guess about the probability distribution of Yn ? The central limit theorem (CLT) provides an answer to this question. The CLT states that under very general conditions {Yn = X i } converges in distribution
i =1
2 to Y ~ N ( Y , Y ) as n . The conditions are:

1. The random variables X , X ,..., X 1 2 n variance, but not identically distributed.

are independent with same mean and

2. The random variables X , X ,..., X are independent with different means and 1 2 n same variance and not identically distributed. 3. The random variables X , X ,..., X are independent with different means and 1 2 n each variance being neither too small nor too large. We shall consider the first condition only. In this case, the central-limit theorem can be stated as follows:

Suppose X 1 , X 2 .... X n is a sequence of independent and identically distributed random n ( X X ) 2 . Then, the variables each with mean X and variance X and Yn = i
i =1

sequence {Yn } converges in distribution to a Gaussian random variable Y with mean


2 0 and variance X . That is, y 2 1 e u lim FYn ( y ) = n 2 X
2 2 X

du

Remarks The central limit theorem is really a property of convolution, Consider the sum of two statistically independent random variables, say, Y = X 1 + X 2 . Then the pdf fY ( y ) the convolution of f X1 ( x ) and f X 2 ( x ) . This can be shown with the help of the characteristic functions as follows:

Y ( ) = E e j ( x + x )
1 2

= E ( e j x1 ) E ( e j x2 ) = x1 ( ) x2 ( )

fY ( x ) = f X1 ( y ) * f X 2 ( y ) =

f X1 ( ) f X 2 ( y )d

where * is the convolution operation. We can illustrate this by convolving two uniform distributions repeatedly. The convolution of two uniform distributions gives a triangular distribution. Further convolution gives a parabolic distribution and so on.

(To be animated)
Proof of the central limit theorem

We give a less rigorous proof of the theorem with the help of the characteristic function. Further we consider each of X 1 , X 2 .... X n to have zero mean. Thus,

Yn = ( X 1 + X 2 + ... X n ) / n .
Clearly,

Y = 0,
n

2 =X , 2 Yn

E (Yn3 ) = E ( X 3 ) / n and so on. The characteristic function of Yn is given by

Y ( ) = E ( e
n

j Yn

n j 1 X i n i =1 = E e

We will show that as n the characteristic function Y is of the form of the


n

characteristic function of a Gaussian random variable. Expanding e jwY in power series


n

( j ) 2 2 ( j )3 3 Yn + Yn + .... 2! 3! Assume all the moments of Yn to be finite. Then

e jYn = 1 + jYn +

Y ( ) = E ( e jY ) = 1 + jY +
n n

2 Substituting Yn = 0 and E (Yn2 ) = Yn

( j ) 2 ( j ) 2 E (Yn2 ) + E (Yn3 ) + ... n 2! 3! 2 = X , we get

Therefore, 2 Yn ( ) = 1 ( 2 / 2!) X + R( , n) where R( , n) is the average of terms involving 3 and higher powers of .
2 Yn ( w) = 1 ( 2 / 2!) X + R( , n).

Note also that each term in R( , n) involves a ratio of a higher moment and a power of n and therefore, lim R( , n) = 0
n

lim In Yn ( ) = e
n

2 2 X

which is the characteristic function of a Gaussian random variable with 0 mean and 2 variance X . d 2 Yn N (0, X )
Remark:

(1) Under the conditions of the CLT, the sample mean


X = X i converges in distribution to N ( X , 1N n i =1
2 X

). In other words, if samples are

2 taken from any distribution with mean X and variance X , as the sample size n

increases, the distribution function of the sample mean approaches to the distribution function of a Gaussian random variable. (2) The CLT states that the distribution function FX n ( x ) converges to a Gaussian distribution function. The theorem does not say that the pdf fYn ( y ) is a Gaussian pdf in the limit. For example, suppose each X i has a Bernoulli distribution. Then the pdf of Y consists of impulses and can never approach the Gaussian pdf.

(3) The the Cauchy distribution does not meet the conditions for the central limit theorem to hold. As we have noted earlier, this distribution does not have a finite mean or a variance. Suppose a random variable X i has the Cauchy distribution

f X i ( x) =

1 (1 + x 2 )

- < x < .

The characteristic function of X i is given by

X ( w) = e w
i

The sample mean X = X i will have the chacteristic function

1N n i =1

Y ( w) = e w
n

Thus the sum of large number of Cauchy random variables will not follow a Gaussian distribution. (4) The central-limit theorem one of the most widely used results of probability. If a random variable is result of several independent causes, then the random variable can be considered to be Gaussian. For example, -the thermal noise in a resistor is result of the independent motion of billions electrons and is modelled as a Gaussian. -the observation error/ measurement error of any process is modeled as a Gaussian. (5) The CLT can be used to simulate a Gaussian distribution given a routine to simulate a particular random variable
Normal approximation of the Binomial distribution

One of the application of the CLT is in approximation of the Binomial coefficients. Suppose X1, X2, X3, Xn,.. is a sequence of Bernoulli(p) random variables with P{ Xi =1}= p and P{ xi =0}= 1- p. Then Thus, or
2 yn = X i is a Binomial distribution with y n = np and yn = n p ( 1 p ) .
i =1 n

Yn np np(1 p)

d N ( 0, 1)

Yn

d N ( np, np(1 p) )

P ( k 1 < Yn k P (Yn = k

k 1

k 1

1 2 np(1 p)
1 ( y np )2

1 ( y np )2 . 2 np (1 p )

.dy

. 1 e 2 np (1 p ) 2 np (1 p)

( assume the integrand interval = 1 )

This is normal approximation to the Binomial coefficients and known as the De-MoirreLaplace approximation.

RANDOM PROCESSES
In practical problems we deal with time varying waveforms whose value at a time is random in nature. For example, the speech waveform, the signal received by communication receiver or the daily record of stock-market data represents random variables that change with time. How do we characterize such data? Such data are characterized as random or stochastic processes. This lecture covers the fundamentals of random processes.. Random processes Recall that a random variable maps each sample point in the sample space to a point in the real line. A random process maps each sample point to a waveform. Consider a probability space {S , F, P}. A random process can be defined on {S , F, P} as an indexed family of random variables { X ( s, t ), s S,t } where is an index set which may be discrete or continuous usually denoting time. Thus a random process is a function of the sample point and index variable t and may be written as X (t , ). Remark For a fixed t (= t 0 ), X (t 0 , ) is a random variable. For a fixed (= 0 ), X (t , 0 ) is a single realization of the random process and is a deterministic function. For a fixed (= 0 ) and a fixed t (= t 0 ), X (t , 0 ) is a single number. When both t and are varying we have the random process X (t , ).

The random process { X ( s, t ), s S , t T } is normally denoted by { X (t )}. Following figure illustrates a random procee. A random process is illustrated below.

X (t , s3 )

s3

X (t , s2 )

s2

s1 X (t , s1 ) Figure Random Process

( To Be animated)

Example Consider a sinusoidal signal X (t ) = A cos t where A is a binary random variable with probability mass functions p A (1) = p and p A (1) = 1 p. Clearly, { X (t ), t } is a random process with two possible realizations X 1 (t ) = cos t and X 2 (t ) = cos t. At a particular time t0 X (t0 ) is a random variable with two values cos t0 and cos t0 . Continuous-time vs. discrete-time process If the index set is continuous, {X(t),t} is called a continuous-time process. Example Suppose X (t ) = A cos( w0 t + ) where A and w0 are constants and is uniformly distributed between 0 and 2 . X (t ) is an example of a continuous-time process. 4 realizations of the process is illustrated below. (TO BE ANIMATED)

= 0.8373

= 0.9320

= 1.6924

= 1.8636

If the index set is a countable set, { X (t ), t } is called a discrete-time process. Such a random process can be represented as { X [n], n Z } and called a random sequence. Sometimes the notation { X n , n 0} is used to describe a random sequence indexed by the set of positive integers. We can define a discrete-time random process on discrete points of time. Particularly, we can get a discrete-time random process { X [n], n Z } by sampling a continuous-time process { X (t ), t } at a uniform interval T such that X [n] = X (nT ). The discrete-time random process is more important in practical implementations. Advanced statistical signal processing techniques have been developed to process this type of signals. Example Suppose X n = 2cos(0 n + Y ) where 0 is a constant and Y is a random variable uniformly distributed between and - . X n is an example of a discrete-time process.

= 0.4623

= 1.9003

= 0.9720
Continuous-state vs. discrete-state process:
The value of a random process X (t ) is at any time t can be described from its probabilistic model.

The state is the value taken by X (t ) at a time t, and the set of all such states is called the state space. A random process is discrete-state if the state-space is finite or countable. It also means that the corresponding sample space is also finite countable. Other-wise the random process is called continuous state. Example Consider the random sequence { X n , n 0} generated by repeated tossing of a fair coin where we assign 1 to Head and 0 to Tail. Clearly X n can take only two values- 0 and 1. Hence { X n , n 0} is a discrete-time twostate process.

How to describe a random process?


As we have observed above that X (t ) at a specific time t is a random variable and can be described by its probability distribution function FX (t ) ( x) = P ( X (t ) x). This distribution function is called the first-order probability distribution function. We can similarly define the first-order probability density function f X (t ) ( x) = dFX ( t ) ( x) dx .

To describe { X (t ), t } we have to use joint distribution function of the random variables at all possible values of t . For any positive integer n , X (t1 ), X (t 2 ),..... X (t n ) represents

jointly distributed random variables. Thus a random process

{ X (t ), t } can thus be described by specifying the n-th order joint distribution

function
FX ( t1 ), X ( t2 )..... X ( tn ) ( x1 , x2 .....xn ) = P ( X (t1 ) x1 , X (t2 ) x2 ..... X (tn ) xn ), n 1 and tn

or th the n-th order joint density function


f X (t1 ), X (t2 )..... X (tn ) ( x1 , x2 .....xn ) = n FX (t1 ), X (t2 )..... X (tn ) ( x1 , x2 .....xn ) x1x2 ...xn

If { X (t ), t } is a discrete-state random process, then it can be also specified by the collection of n-th order joint probability mass function
p X ( t1 ), X ( t2 )..... X ( tn ) ( x1 , x2 .....xn ) = P ( X (t1 ) = x1 , X (t2 ) = x2 ..... X (tn ) = xn ), n 1 and tn

If the random process is continuous-state, it can be specified by Moments of a random process We defined the moments of a random variable and joint moments of random variables. We can define all the possible moments and joint moments of a random process
{ X (t ), t }. Particularly, following moments are important.

x (t ) = Mean of the random process at t = E ( X (t )


RX (t1 , t2 ) = autocorrelation function of the process at times t1 , t2 = E ( X (t 1 ) X (t2 )) Note that

RX (t1 , t2 ) = RX (t2 , t1 , ) and RX (t , t ) = EX 2 (t ) = sec ond moment or mean - square value at time t.

The autocovariance function C X (t1 , t2 ) of the random process at time t1 and t2 is defined by

C X (t1 , t2 ) = E ( X (t 1 ) X (t1 ))( X (t2 ) X (t2 )) =RX (t1 , t2 ) X (t1 ) X (t2 ) C X (t , t ) = E ( X (t ) X (t )) 2 = variance of the process at time t.

These moments give partial information about the process. The ratio X (t1, t2 ) =
CX (t1, t2 ) CX (t1, t1 ) CX (t2 , t2 )

is called the correlation coefficient. to

The autocorrelation function and the autocovariance functions are widely used characterize a class of random process called the wide-sense stationary process.

We can also define higher-order moments R X (t1 , t 2 , t 3 ) = E ( X (t 1 ), X (t 2 ), X (t 3 )) = Triple correlation function at t1 , t2 , t3 etc. The above definitions are easily extended to a random sequence { X n , n 0}. Example (a) Gaussian Random Process For any positive integer n, X (t1 ), X (t 2 ),..... X (t n ) represent variables. These
n

n jointly
a random

random vector

random

variables

define

X = [ X (t1 ), X (t2 ),..... X (tn )]'. The process X (t ) is called Gaussian if the random vector [ X (t1 ), X (t2 ),..... X (tn )]' is jointly Gaussian with the joint density function given by
1 X'C1X X e 2

f X (t1 ), X (t2 )... X (tn ) ( x1 , x2 ,...xn ) =


where C X = E ( X X )( X X ) '

det(CX )

and X = E ( X) = [ E ( X 1 ), E ( X 2 )......E ( X n ) ] '.

The Gaussian Random Process is completely specified by the autocovariance matrix and hence by the mean vector and the autocorrelation matrix R X = EXX ' . (b) Bernoulli Random Process A Bernoulli process is a discrete-time random process consisting of a sequence of independent and identically distributed Bernoulli random variables. Thus the discrete time random process { X n , n 0} is Bernoulli process if P{ X n = 1} = p and P{ X n = 0} = 1 p Example Consider the random sequence { X n , n 0} generated by repeated tossing of a fair coin where we assign 1 to Head and 0 to Tail. Here { X n , n 0} is a Bernoulli process where each random variable X n is a Bernoulli random variable with
1 and 2 1 p X (0) = P{ X n = 0} = 2 p X (1) = P{ X n = 1} =

(c) A sinusoid with a random phase X (t ) = A cos( w0 t + ) where A and w0 are constants and is uniformly distributed between 0 and 2 . Thus 1 f ( ) = 2 X (t ) at a particular t is a random variable and it can be shown that 1 f X ( t ) ( x ) = A2 x 2 0 x<A otherwise

The pdf is sketched in the Fig. below: The mean and autocorrelation of X (t ) :

X (t ) = EX (t )
= EA cos( w0t + ) = A cos( w0t + )

1 d 2

=0 RX (t1 , t2 ) = EA cos( w0t1 + ) A cos( w0t2 + ) = A2 E cos( w0t1 + ) cos( w0t2 + ) A2 E (cos( w0 (t1 t2 )) + cos( w0 (t1 + t2 + 2 ))) 2 1 A2 A2 cos( w0 (t1 t2 )) + = d cos( w0 (t1 + t2 + 2 )) 2 2 2 A2 = cos( w0 (t1 t2 )) 2 =

Two or More Random Processes

In practical situations we deal with two or more random processes. We often deal with the input and output processes of a system. To describe two or more random processes we have to use the joint distribution functions and the joint moments. Consider two random processes { X (t ), t } and {Y (t ), t }. For any positive integer
/ n , X (t1 ), X (t2 ),..... X (tn ), Y (t1/ ), Y (t2 ),.....Y (t n/ ) represent 2n jointly distributed random

variables. Thus these two random processes can be described by the ( n + m)th order joint distribution function

FX (t ), X (t
1

2 )..... X

/ / / ( tn ),Y ( t1 ),Y ( t2 ),.....Y ( t m )

( x1 , x2 .....xn , y1 , y2 ..... ym )

/ = P( X (t1 ) x1 , X (t2 ) x2 ..... X (tn ) xn , Y (t1/ ) y1 , Y (t 2/ ) y2 .....Y (tm ) ym )

or the corresponding ( n + m )th order joint density function


f X (t ), X (t
1 2 )..... X / / / ( tn ),Y ( t1 ),Y ( t2 ),.....Y ( tn )

( x1 , x2 .....xn , y1 , y2 ..... yn )

2n F / / / ( x1 , x2 ..... xn , y1 , y2 ..... yn ) x1x2 ...xn y1y2 ...yn X ( t1 ), X (t2 )..... X ( tn ),Y (t1 ),Y (t2 ),.....Y (t n )

Two random processes can be partially described by the joint moments: Cross correlation function of the processes at times t1 , t2 RXY (t1 , t2 ) = E ( X (t 1 )Y (t2 )) = E ( X (t 1 )Y (t2 ))
Cross cov ariance function of the processes at times t1 , t2

C XY (t1 , t2 ) == E ( X (t 1 ) X (t1 ))( X (t2 ) X (t2 )) = RXY (t1 , t2 ) X (t1 ) Y (t2 )


Cross-correlation coefficient

XY (t1, t2 ) =

CXY (t1, t2 ) CX (t1, t1 ) CY (t2 , t2 )

On the basis of the above definitions, we can study the degree of dependence between two random processes Independent processes: Two random processes { X (t ), t } and {Y (t ), t }. are called independent if Uncorrelated processes: Two random processes { X (t ), t } and {Y (t ), t }. are called uncorrelated if C XY (t1 , t2 ) = 0 t1 , t2 This also implies that for such two processes

RXY (t1 , t2 ) = X (t1 ) Y (t2 ) .


Orthogonal processes: Two random processes { X (t ), t } and {Y (t ), t }. are called orthogonal if R XY (t1 , t2 ) = 0 t1 , t2
Example Suppose X (t ) = A cos( w0 t + ) and Y (t ) = A sin( w0t + ) where A and w0 are constants and is uniformly distributed between 0 and 2 .

RANDOM PROCESSES
In practical problems we deal with time varying waveforms whose value at a time is random in nature. For example, the speech waveform, the signal received by communication receiver or the daily record of stock-market data represents random variables that change with time. How do we characterize such data? Such data are characterized as random or stochastic processes. This lecture covers the fundamentals of random processes.. Random processes Recall that a random variable maps each sample point in the sample space to a point in the real line. A random process maps each sample point to a waveform. Consider a probability space {S , F, P}. A random process can be defined on {S , F, P} as an indexed family of random variables { X ( s, t ), s S,t } where is an index set which may be discrete or continuous usually denoting time. Thus a random process is a function of the sample point and index variable t and may be written as X (t , ). Remark For a fixed t (= t 0 ), X (t 0 , ) is a random variable. For a fixed (= 0 ), X (t , 0 ) is a single realization of the random process and is a deterministic function. For a fixed (= 0 ) and a fixed t (= t 0 ), X (t , 0 ) is a single number. When both t and are varying we have the random process X (t , ).

The random process { X ( s, t ), s S , t T } is normally denoted by { X (t )}. Following figure illustrates a random procee. A random process is illustrated below.

X (t , s3 )

s3

X (t , s2 )

s2

s1 X (t , s1 ) Figure Random Process

( To Be animated)

Example Consider a sinusoidal signal X (t ) = A cos t where A is a binary random variable with probability mass functions p A (1) = p and p A (1) = 1 p. Clearly, { X (t ), t } is a random process with two possible realizations X 1 (t ) = cos t and X 2 (t ) = cos t. At a particular time t0 X (t0 ) is a random variable with two values cos t0 and cos t0 . Continuous-time vs. discrete-time process If the index set is continuous, {X(t),t} is called a continuous-time process. Example Suppose X (t ) = A cos( w0 t + ) where A and w0 are constants and is uniformly distributed between 0 and 2 . X (t ) is an example of a continuous-time process. 4 realizations of the process is illustrated below. (TO BE ANIMATED)

= 0.8373

= 0.9320

= 1.6924

= 1.8636

If the index set is a countable set, { X (t ), t } is called a discrete-time process. Such a random process can be represented as { X [n], n Z } and called a random sequence. Sometimes the notation { X n , n 0} is used to describe a random sequence indexed by the set of positive integers. We can define a discrete-time random process on discrete points of time. Particularly, we can get a discrete-time random process { X [n], n Z } by sampling a continuous-time process { X (t ), t } at a uniform interval T such that X [n] = X (nT ). The discrete-time random process is more important in practical implementations. Advanced statistical signal processing techniques have been developed to process this type of signals. Example Suppose X n = 2cos(0 n + Y ) where 0 is a constant and Y is a random variable uniformly distributed between and - . X n is an example of a discrete-time process.

= 0.4623

= 1.9003

= 0.9720
Continuous-state vs. discrete-state process:
The value of a random process X (t ) is at any time t can be described from its probabilistic model.

The state is the value taken by X (t ) at a time t, and the set of all such states is called the state space. A random process is discrete-state if the state-space is finite or countable. It also means that the corresponding sample space is also finite countable. Other-wise the random process is called continuous state. Example Consider the random sequence { X n , n 0} generated by repeated tossing of a fair coin where we assign 1 to Head and 0 to Tail. Clearly X n can take only two values- 0 and 1. Hence { X n , n 0} is a discrete-time twostate process.

How to describe a random process?


As we have observed above that X (t ) at a specific time t is a random variable and can be described by its probability distribution function FX (t ) ( x) = P ( X (t ) x). This distribution function is called the first-order probability distribution function. We can similarly define the first-order probability density function f X (t ) ( x) = dFX ( t ) ( x) dx .

To describe { X (t ), t } we have to use joint distribution function of the random variables at all possible values of t . For any positive integer n , X (t1 ), X (t 2 ),..... X (t n ) represents

jointly distributed random variables. Thus a random process

{ X (t ), t } can thus be described by specifying the n-th order joint distribution

function
FX ( t1 ), X ( t2 )..... X ( tn ) ( x1 , x2 .....xn ) = P ( X (t1 ) x1 , X (t2 ) x2 ..... X (tn ) xn ), n 1 and tn

or th the n-th order joint density function


f X (t1 ), X (t2 )..... X (tn ) ( x1 , x2 .....xn ) = n FX (t1 ), X (t2 )..... X (tn ) ( x1 , x2 .....xn ) x1x2 ...xn

If { X (t ), t } is a discrete-state random process, then it can be also specified by the collection of n-th order joint probability mass function
p X ( t1 ), X ( t2 )..... X ( tn ) ( x1 , x2 .....xn ) = P ( X (t1 ) = x1 , X (t2 ) = x2 ..... X (tn ) = xn ), n 1 and tn

If the random process is continuous-state, it can be specified by Moments of a random process We defined the moments of a random variable and joint moments of random variables. We can define all the possible moments and joint moments of a random process
{ X (t ), t }. Particularly, following moments are important.

x (t ) = Mean of the random process at t = E ( X (t )


RX (t1 , t2 ) = autocorrelation function of the process at times t1 , t2 = E ( X (t 1 ) X (t2 )) Note that

RX (t1 , t2 ) = RX (t2 , t1 , ) and RX (t , t ) = EX 2 (t ) = sec ond moment or mean - square value at time t.

The autocovariance function C X (t1 , t2 ) of the random process at time t1 and t2 is defined by

C X (t1 , t2 ) = E ( X (t 1 ) X (t1 ))( X (t2 ) X (t2 )) =RX (t1 , t2 ) X (t1 ) X (t2 ) C X (t , t ) = E ( X (t ) X (t )) 2 = variance of the process at time t.

These moments give partial information about the process. The ratio X (t1, t2 ) =
CX (t1, t2 ) CX (t1, t1 ) CX (t2 , t2 )

is called the correlation coefficient. to

The autocorrelation function and the autocovariance functions are widely used characterize a class of random process called the wide-sense stationary process.

We can also define higher-order moments R X (t1 , t 2 , t 3 ) = E ( X (t 1 ), X (t 2 ), X (t 3 )) = Triple correlation function at t1 , t2 , t3 etc. The above definitions are easily extended to a random sequence { X n , n 0}. Example (a) Gaussian Random Process For any positive integer n, X (t1 ), X (t 2 ),..... X (t n ) represent variables. These
n

n jointly
a random

random vector

random

variables

define

X = [ X (t1 ), X (t2 ),..... X (tn )]'. The process X (t ) is called Gaussian if the random vector [ X (t1 ), X (t2 ),..... X (tn )]' is jointly Gaussian with the joint density function given by
1 X'C1X X e 2

f X (t1 ), X (t2 )... X (tn ) ( x1 , x2 ,...xn ) =


where C X = E ( X X )( X X ) '

det(CX )

and X = E ( X) = [ E ( X 1 ), E ( X 2 )......E ( X n ) ] '.

The Gaussian Random Process is completely specified by the autocovariance matrix and hence by the mean vector and the autocorrelation matrix R X = EXX ' . (b) Bernoulli Random Process A Bernoulli process is a discrete-time random process consisting of a sequence of independent and identically distributed Bernoulli random variables. Thus the discrete time random process { X n , n 0} is Bernoulli process if P{ X n = 1} = p and P{ X n = 0} = 1 p Example Consider the random sequence { X n , n 0} generated by repeated tossing of a fair coin where we assign 1 to Head and 0 to Tail. Here { X n , n 0} is a Bernoulli process where each random variable X n is a Bernoulli random variable with
1 and 2 1 p X (0) = P{ X n = 0} = 2 p X (1) = P{ X n = 1} =

(c) A sinusoid with a random phase X (t ) = A cos( w0 t + ) where A and w0 are constants and is uniformly distributed between 0 and 2 . Thus 1 f ( ) = 2 X (t ) at a particular t is a random variable and it can be shown that 1 f X ( t ) ( x ) = A2 x 2 0 x<A otherwise

The pdf is sketched in the Fig. below: The mean and autocorrelation of X (t ) :

X (t ) = EX (t )
= EA cos( w0t + ) = A cos( w0t + )

1 d 2

=0 RX (t1 , t2 ) = EA cos( w0t1 + ) A cos( w0t2 + ) = A2 E cos( w0t1 + ) cos( w0t2 + ) A2 E (cos( w0 (t1 t2 )) + cos( w0 (t1 + t2 + 2 ))) 2 1 A2 A2 cos( w0 (t1 t2 )) + = d cos( w0 (t1 + t2 + 2 )) 2 2 2 A2 = cos( w0 (t1 t2 )) 2 =

Two or More Random Processes

In practical situations we deal with two or more random processes. We often deal with the input and output processes of a system. To describe two or more random processes we have to use the joint distribution functions and the joint moments. Consider two random processes { X (t ), t } and {Y (t ), t }. For any positive integer
/ n , X (t1 ), X (t2 ),..... X (tn ), Y (t1/ ), Y (t2 ),.....Y (t n/ ) represent 2n jointly distributed random

variables. Thus these two random processes can be described by the ( n + m)th order joint distribution function

FX (t ), X (t
1

2 )..... X

/ / / ( tn ),Y ( t1 ),Y ( t2 ),.....Y ( t m )

( x1 , x2 .....xn , y1 , y2 ..... ym )

/ = P( X (t1 ) x1 , X (t2 ) x2 ..... X (tn ) xn , Y (t1/ ) y1 , Y (t 2/ ) y2 .....Y (tm ) ym )

or the corresponding ( n + m )th order joint density function


f X (t ), X (t
1 2 )..... X / / / ( tn ),Y ( t1 ),Y ( t2 ),.....Y ( tn )

( x1 , x2 .....xn , y1 , y2 ..... yn )

2n F / / / ( x1 , x2 ..... xn , y1 , y2 ..... yn ) x1x2 ...xn y1y2 ...yn X ( t1 ), X (t2 )..... X ( tn ),Y (t1 ),Y (t2 ),.....Y (t n )

Two random processes can be partially described by the joint moments: Cross correlation function of the processes at times t1 , t2 RXY (t1 , t2 ) = E ( X (t 1 )Y (t2 )) = E ( X (t 1 )Y (t2 ))
Cross cov ariance function of the processes at times t1 , t2

C XY (t1 , t2 ) == E ( X (t 1 ) X (t1 ))( X (t2 ) X (t2 )) = RXY (t1 , t2 ) X (t1 ) Y (t2 )


Cross-correlation coefficient

XY (t1, t2 ) =

CXY (t1, t2 ) CX (t1, t1 ) CY (t2 , t2 )

On the basis of the above definitions, we can study the degree of dependence between two random processes Independent processes: Two random processes { X (t ), t } and {Y (t ), t }. are called independent if Uncorrelated processes: Two random processes { X (t ), t } and {Y (t ), t }. are called uncorrelated if C XY (t1 , t2 ) = 0 t1 , t2 This also implies that for such two processes

RXY (t1 , t2 ) = X (t1 ) Y (t2 ) .


Orthogonal processes: Two random processes { X (t ), t } and {Y (t ), t }. are called orthogonal if R XY (t1 , t2 ) = 0 t1 , t2
Example Suppose X (t ) = A cos( w0 t + ) and Y (t ) = A sin( w0t + ) where A and w0 are constants and is uniformly distributed between 0 and 2 .

Important Classes of Random Processes Having characterized the random process by the joint distribution ( density) functions and joint moments we define the following two important classes of random processes. (a) Independent and Identically Distributed Process Consider a discrete-time random process { X n }. For any finite choice of time instants n1 , n2 ,...., nN , if the random variables X n1 , X n2 ,..., X nN are jointly independent with a common distribution, then { X n } is called an independent and identically distributed (iid) random process. Thus for an iid random process { X n }, FX n , X n ,.., X n ( x1 , x2 .....xn ) = FX ( x1 ) FX ( x2 ).....FX ( xn )
1 2 N

and equivalently p X n , X n ,.., X n ( x1 , x2 .....xn ) = p X ( x1 ) p X ( x2 )..... p X ( xn )


1 2 N

Moments of the IID process: It is easy to verify that for an iid process { X n } Mean EX n = X = constant
2 Variance E ( X n X ) 2 = X =constant Autocovariance C X (n, m) = E ( X n X )( X m X ) = E( X n X )E( X m X )

0 for n m = 2 X otherwise 2 = X [n, m] where [n, m] = 1 for n = m and 0 otherwise. 2 2 2 Autocorrelation RX (n, m) = C X (n, m) + X = X (n, m) + X
Example Bernoulli process: Consider the Bernoulli process { X n } with p X (1) = p and

p X (0) = 1 p This process is an iid process. Using the iid property, we can obtain the joint probability mass functions of any order in terms of p. For example, p X1 , X 2 (1, 0) = p (1 p )
p X1 , X 2 , X 3 (0, 0,1) = (1 p ) 2 p

and so on. Similarly, the mean, the variance and the autocorrelation function are given by

X = EX n = p
n

var( X n ) = p (1 p ) RX (n1 , n2 ) = EX n1 X n2 = EX n1 EX n2 = p2

(b) Independent Increment Process

A random process { X (t )} is called an independent increment process if for any n > 1 and t1 < t2 < ... < tn , the set of n random variables X (t1 ), X (t2 ) X (t1 ),... ,X (tn ) X (tn 1 ) are jointly independent random variables. If the probability distribution of X (t + r ) X (t '+ r ) is same as that of X (t ) X (t ), for any choice of t , t 'and r , the { X (t )} is called stationary increment process. The above definitions of the independent increment process and the stationary increment process can be easily extended to discrete-time random processes. The independent increment property simplifies the calculation of joint probability distribution, density and mass functions from the corresponding first-order quantities. As an example,for t1 < t2 , x1 < x2 ,
FX (t1 ), X (t2 ) ( x1 , x2 ) = P ({ X (t1 ) x1 , X (t2 ) x2 } = P ({ X (t1 ) x1}) P({ X (t2 ) x2 }/{ X (t1 ) x1}) = P ({ X (t1 ) x1 ) P({ X (t2 ) X (t1 ) x2 x1}) = FX (t1 ) ( x1 ) FX ( t2 ) X (t1 ) ( x2 x1 )

The independent increment property simplifies the computation autocovariance function. For t1 < t2 , the autocorrelation function of X (t ) is given by RX (t1 , t2 ) = EX (t1 ) X (t2 )
= EX (t1 )( X (t1 ) + X (t2 ) X (t1 )) = EX 2 (t1 ) + EX (t1 ) E ( X (t2 ) X (t1 )) = EX 2 (t1 ) + EX (t1 ) EX (t2 ) ( EX (t1 )) 2 = var( X (t1 )) + EX (t1 ) EX (t2 ) C X (t1 , t2 ) = EX (t1 ) X (t2 ) EX (t1 ) EX (t2 ) = var( X (t1 )) Similarly, for t1 > t2 ,

of

the

C X (t1 , t2 ) = var( X (t2 )) Therefore C X (t1 , t2 ) = var( X (min(t1 , t2 )))

Example: Two continuous-time independent increment processes are widely studied. They are: (a) Wiener process with the increments following Gaussian distribution and (b) Poisson process with the increments following Poisson distribution. We shall discuss these processes shortly.

Random Walk process Consider an iid process {Z n } having two states Z n = 1 Z n = 1 with the probability mass functions pZ (1) = p and pZ (1) = q = 1 p. Then the sum process { X n } given by
X n = Z i = X n 1 + Z n
i =1 n

with X 0 = 0 is called a Random Walk process. This process is one of the widely studied random processes. It is an independent increment process. This follows from the fact that X n X n 1 = Z n and {Z n } is an iid process. If we call Z n = 1 as success and Z n = 1 as failure, then X n = Z i
i =1 n

represents the total number of successes in n independent trials. 1 If p = , { X n } is called a symmetrical random walk process. 2

Probability mass function of the Random Walk Process At an instant n, X n can take integer values from n to n Suppose X n = k . Clearly k = n1 n1 where n1 = number of successes and n1 = number of failures in n trials of Z n such that n1 + n1 = n. n+k nk n1 = and n1 = 2 2 Also n1 and n1 are necessarily non-negative integers.
n+k nk n+k nk n C n + k p 2 (1 p) 2 if and are non-negative integers pX n (k ) = 2 2 2 0 otherwise

Mean, Variance and Covariance of a Random Walk process Note that EZ n = 1 p 1 (1 p) = 2 p 1


2 EZ n = 1 p + 1 (1 p) = 1

and
2 var( Z n ) = EZ n (EZ n )2

= 1- 4 p 2 + 4 p 1 = 4 pq EX n = EZ i = n(2 p 1)
i =1 n

and var( X n ) = var( Z i )


i =1 n

Z i s are independent random variables

= 4npq Since the random walk process { X n } is an independent increment process, the autocovariance function is given by C X (n1 , n2 ) = 4 pq min(n1 , n2 ) Three realizations of a random walk process is as shown in the Fig. below:

Remark If the increment Z n of the random walk process takes the values of s and s, then

EX n = EZ i = n(2 p 1) s
i =1

and var( X n ) = var( Z i )


i =1 n

= 4npqs 2

(c) Markov process A process { X (t )} is called a Markov process if for any sequence of time t1 < t2 < ..... < tn , P ({ X (tn ) x | X (t1 ) = x1 , X (t2 ) = x2 ,..., X (tn 1 ) = xn 1 }) = P ({ X (tn ) x | X (tn 1 ) = xn 1 }) Thus for a Markov process the future of the process, given present, is independent of the past. A discrete-state Markov process is called a Markov Chain. If { X n } is a discretetime discrete-state random process, the process is Markov if P ({ X n = xn | X 0 = x0 , X 1 = x1 ,..., X n 1 = xn 1 }) = P ({ X n = xn | X n 1 = xn 1 }) An iid random process is a Markov process. Many practical signals with strong correlation between neighbouring samples are modelled as Markov processes Example Show that the random walk process { X n } is Markov. Here, P({ X n = xn | X 0 = 0, X 1 = x1 ,..., X n 1 = xn 1 }) = P({ X n 1 + Z n = xn | X 0 = 0, X 1 = x1 ,..., X n 1 = xn 1 }) = P({Z n = xn xn 1 }) = P({ X n = xn | X n 1 = xn 1 }) Wiener Process

Consider a symmetrical random walk process { X n } given by X n = X (n ) where the discrete instants in the time axis are separated by as shown in the Fig. below. Assume to be infinitesimally small.

t = n

Clearly, EX n = 0
1 1 var( X n ) = 4 pqns 2 = 4 ns 2 = ns 2 2 2 For large n, the distribution of X n approaches the normal with men 0 and variance

As 0 and n , X n becomes the continuous-time process X (t ) with the pdf


1x 1 f X ( t ) ( x) = e 2 t . This process { X ( t )} is called the Wienerprocess. 2 t
2

t ns2 = s2 = t

A random process { X ( t )} is called a Wiener process or the Brownian motion process if it satisfies the following conditions: (1) X ( 0 ) = 0 (2) X ( t ) is an independent increment process. (3) For each s 0, t 0 variance t .
1x 1 f X ( s +t ) X ( s ) ( x) = e 2 t 2 t
2

X ( s + t ) X ( s ) has the normal distribution with mean 0 and

Wiener process was used to model the Brownian motion microscopic particles suspended in a fluid are subject to continuous molecular impacts resulting in the zigzag motion of the particle named Brownian motion after the British Botanist Brown. Wiener Process is the integration of the white noise process.

A realization of the Wiener process is shown in the figure below:

RX ( t1 , t2 ) = EX ( t1 ) X ( t2 )

= EX ( t1 ) { X ( t2 ) X ( t1 ) + X ( t1 )} = EX ( t1 ) E { X ( t2 ) X ( t1 )} + EX 2 ( t1 ) = EX 2 ( t1 ) = t1

Assuming t2 > t1

Similarly if t1 > t2
RX ( t1 , t2 ) = t2
RX ( t1 , t2 ) = min ( t1 , t2 )
1 x 1 2 t f X (t ) ( x ) = e 2 t
2

Remark C X ( t1 , t2 ) = min ( t1 , t2 )
X ( t ) is a Gaussian process.

Poisson Process Consider a random process representing the number of occurrences of an event up to time t (over time interval (0, t]. Such a process is called a counting process and we shall denote it by { N (t ), t 0} . Clearly { N (t ), t 0} is a continuous-time discrete state process and any of its realizations is non-decreasing function of time. The counting process { N (t ), t 0} is called Poissons process with the rate parameter if (i) (ii) N(0)=0 N (t) is an independent increment process.

Thus the increments N (t2 ) N (t1 ) and N (t4 ) N (t3 ), et. are independent. (iii)

P ({ N (t ) = 1} ) = t + 0(t ) P ({ N (t ) 2} ) = 0 ( t )
t 0

where 0( t ) implies any function such that lim (iv)

0(t ) = 0. t

The assumptions are valid for many applications. Some typical examples are Number of alpha particles emitted by a radio active substance. Number of binary packets received at switching node of a communication network. Number of cars arriving at a petrol pump during a particular interval of time.

P ({ N (t + t ) = n} ) = Probability of occurrence of n events up to time t + t = P ({ N (t ) = n, N (t ) = 0} ) + P ({N (t ) = n 1, n(t ) = 1}) + P ({ N (t ) < n 1, n(t ) 2} )

= P ({ N (t ) = n} ) (1 t 0(t )) + P ({ N (t ) = n 1} ) (t + 0(t )) + P ({ N (t ) = (n 1)} ) (0(

P ({ N (t ) = n, N (t ) = 0} ) + P ({N (t ) = n 1, n(t ) = 1}) + P ({ N (t ) < n 1, n(t ) 2} ) = P ({ N (t ) = n} ) (1 t 0(t )) + P ({ N (t ) = n 1} ) (t + 0(t )) + P ({ N (t ) = (n 1)} ) (0(t ))

t 0

lim

P ({ N (t + t ) = n} ) P ({ N (t ) = n} ) t

= P ({N (t ) = n}) P ({N (t ) = n 1}) (1)

d P ({ N (t ) = n} ) = P ({ N (t ) = n} ) P ({ N (t ) = n 1} ) dt

The above is a first-order linear differential equation with initial condition P({ N (0) = n}) = 0. . This differential equation can be solved recursively. First consider the problem to find P ({ N (t ) = 0} ) From (1)
d P ({ N (t ) = 0} ) = P ({ N (t ) = 0} ) dt P { N (t ) = 0} = e t

Next to find P ( N (t ) = 1)
d ({P( N (t ) = 1}) = P ({ N (t ) = 1}) P ({ N (t ) = 0}) dt = P { N (t ) = 1} e t

with initial condition P ({ N (0) = 1} ) = 0. Solving the above first-order linear differential equation we get P {{ N (t ) = 1}} = te t Now by mathematical indication we can show that P ({ N (t ) = n} ) = ( t ) n e t n!

Remark (1) The parameter is called the rate or intensity of the Poisson process. It can be shown that ( (t2 t1 )) n e (t2 t1 ) P ({ N (t2 ) N (t1 ) = n} ) = n! Thus the probability of the increments depends on the length of the interval t2 t1 and not on the absolute times t2and t1. Thus the Poisson process is a process with stationary increments. (2) The independent and stationary increment properties help us to compute the joint probability mass function of N (t ). For example,

P ({ N (t1 ) = n1 , N (t2 ) = n2 } ) = P ({ N (t1 ) = n1}) P ({N (t2 ) = n2 } /{N (t1 ) = n1}) = P ({ N (t1 ) = n1}) P({N (t2 ) N (t1 ) = n2 n1} ) (t1 ) n1 e t1 ( (t2 t1 )) n2 n1 e (t2 t1 ) = (n2 n1 )! n1 !

Mean, Variance and Covariance of the Poisson process

We observe that at any time t > 0, N (t ) is a Poisson random variable with the parameter t. Therefore, EN (t ) = t and var N (t ) = t Thus both the mean and variance of a Poisson process varies linearly with time. As N (t ) is a random process with independent increment, we can readily show that CN (t1 , t2 ) = var( N (min(t1 , t2 )))

= min(t1 , t2 ) RN (t1 , t2 ) = CN (t1 , t2 ) + EN (t1 ) EN (t2 ) = min(t1 , t2 ) + 2t1t2 A typical realization of a Poisson process is shown below:

Example: A petrol pump serves on the average 30cars per hour. Find the probability that during a period of 5 minutes (i) no car comes to the station, (ii) exactly 3 cars come to the station and (iii) more than 3 cars come to the station.

Average arrival = 30 cars/hr = car/min

Probability of no car in 5 minutes (i) P { N (5) = 0} = e


1 5 2

= e 2.5 = 0.0821
3

1 5 2 2.5 e (ii) P { N (5) = 3} = 3! (iii) P{N (5) > 3} = 1 0.21 = 0.79


Binomial model: P = probability of car coming in 1 minute=1/2 n=5 P( X = 0) = (1 P)5 = 0.55

Inter-arrival time and Waiting time of Poisson Process:

t1

t2

t n 1

tn

T1

T2

Tn1

Tn

Let Tn = time elapsed between (n-1) st event and nth event. The random process {Tn , n = 1, 2,..} represent the arrival time of the Poisson process. T1 = time elapsed before the first event take place. Clearly T1 is a continuous random variable. Let us find out the probability P({T1 > t}) P ({T1 > t}) = P({0 event upto time t}) = e t
FT1 (t ) = 1 P{T1 > t} = 1 e t

fT1 (t ) = e t t 0

Similarly, P ({Tn > t}) = P( {0 event occurs in the interval (tn 1 , tn 1 + t )/(n-1)th event occurs at (0, tn 1 ] })
= P( {0 event occurs in the interval (tn 1 , tn 1 + t ]}) = e t
fTn (t ) = e t

Thus the inter-arrival times of a Poisson process with the parameter are exponentially distributed. with
fTn (t ) = e t n>0

Remark We have seen that the inter-arrival times are identically distributed with the exponential pdf. Further, we can easily prove that the inter-arrival times are independent also. For example

FT1 ,T2 (t1 , t2 ) = P({T1 t1 , T2 t2 }) = P({T1 t1}) P({T2 t2 }/{T1 t1}) = P({T1 t1}) P({zero event occurs in (T1 , T1 +t2 ) }/{one event occurs in (01 , T1 ]}) = P({T1 t1}) P({zero event occurs in (T1 , T1 +t2 ) }) = P({T1 t1}) P({T2 t2 }) = FT1 (t1 ) FT2 (t2 ) It is interesting to note that the converse of the above result is also true. If the inter-arrival times between two events of a discrete state {N (t ), t 0} process are 1 exponentially distributed with mean , then {N (t ), t 0} is a Poisson process

with the parameter . The exponential distribution of the inter-arrival process indicates that the arrival process has no memory. Thus P ({Tn > t0 + t1 / Tn > t1}) = P ({Tn > t0 }) t0 , t1 Another important quantity is the waiting time Wn . This is the time that elapses before the nth event occurs. Thus

Wn = Ti
i =1

How to find the first order pdf of Wn is left as an exercise. Note that Wn is the sum of n independent and identically distributed random variables.

EWn = ETi = ETi =


i =1 i =1

and var (Wn ) = var (Ti ) = var(Ti ) =


i =1 i =1 n n

Example The number of customers arriving at a service station is a Poisson process with a rate 10 customers per minute. (a) What is the mean arrival time of the customers? (b) What is the probability that the second customer will arrive 5 minutes after the first customer has arrived? (c) What is the average waiting time before the 10th customer arrives? Semi-random Telegraph signal

Consider a two-state random process { X (t )} with the states X (t ) = 1 and X (t ) = 1. Suppose X (t ) = 1 if the number of events of the Poisson process N (t ) in the interval (0, t ] is even and X (t ) = 1 if the number of events of the Poisson process N (t ) in the interval (0, t ] is odd. Such a process { X (t )} is called the semi-random telegraph signal because the initial value X (0) is always 1. p X ( t ) (1) = P ({ X (t ) = 1})
= P({N (t ) = 0} + P ({N (t ) = 2}) + ... = e t {1 + ( t ) 2 + ....) 2! = e t cosh t

and p X ( t ) (1) = P({ X (t ) = 1})


= P({N (t ) = 1} + P ({N (t ) = 3}) + ... ( t ) 3 + ....) 1! 3! = e t sinh t = e t { +

We can also find the conditional and joint probability mass functions. For example, for t1 < t2 ,

p X ( t1 ), X ( t2 ) (1,1) = P({ X (t1 ) = 1}) P({ X (t2 ) = 1)}/{ X (t1 ) = 1)} = e t1 cosh t1 P ({N (t2 ) is even )}/{N (t1 ) is even)} = e t1 cosh t1 P ({N (t2 ) N (t1 ) is even )}/{N (t1 ) is even)} = e t1 cosh t1 P ({N (t2 ) N (t1 ) is even )} = e t1 cosh t1e ( t2 t1 ) cosh (t2 t1 ) = e t2 cosh t1 cosh (t2 t1 ) Similarly p X ( t1 ), X ( t2 ) (1, 1) = e t2 cosh t1 sinh (t2 t1 ), p X ( t1 ), X ( t2 ) (1,1) = e t2 sinh t1 sinh (t2 t1 ) p X ( t1 ), X ( t2 ) (1, 1) = e t2 sinh t1 cosh (t2 t1 )

Mean, autocorrelation and autocovariance function of X (t )


EX (t ) = 1 e t cosh t 1 e t sinh t = e t (cosh t sinh t ) = e 2 t EX 2 (t ) = 1 e t cosh t + 1 e t sinh t = e t (cosh t + sinh t ) = e t et =1 var( X (t )) = 1 e 4 t For t1 < t2 RX (t1 , t2 ) = EX (t1 ) X (t2 ) = 1 1 p X ( t1 ), X ( t2 ) (1,1) + 1 (1) p X ( t1 ), X ( t2 ) (1, 1) + (1) 1 p X ( t1 ), X ( t2 ) (1,1) + (1) (1) p X ( t1 ), X ( t2 ) (1, 1) = e t2 cosh t1 cosh (t2 t1 ) e t2 cosh t1 sinh (t2 t1 ) e t2 sinh t1 sinh (t2 t1 ) + e t2 sinh t1 cosh (t2 t1 ) = e t2 cosh t1 (cosh (t2 t1 ) sinh (t2 t1 )) + e t2 sinh t1 (cosh (t2 t1 ) 1 sinh (t2 t1 )) = e t2 e ( t2 t1 ) (cosh t1 + sinh t1 ) = e t2 e ( t2 t1 ) et1 = e 2 ( t2 t1 ) Similarlrly for t1 > t2 RX (t1 , t2 ) = e 2 ( t1 t2 ) RX (t1 , t2 ) = e 2 t1 t2

Random Telegraph signal Consider a two-state random process {Y (t )} with the states Y (t ) = 1 and Y (t ) = 1. 1 1 and Y (t ) changes polarity with an Suppose P ({Y (0) = 1}) = and P ({Y (0) = 1}) = 2 2 equal probability with each occurrence of an event in a Poisson process of parameter . Such a random process {Y (t )} is called a random telegraph signal and can be expressed as Y (t ) = AX (t ) where { X (t )} is the semirandom telegraph signal and A is a random variable 1 1 independent of X (t ) with P ({ A = 1}) = and P ({ A = 1}) = . 2 2 Clearly, 1 1 EA = (1) + 1 = 0 2 2 and 1 1 EA2 = (1) 2 + 12 = 1 2 2 Therefore, EY (t ) = EAX (t ) = EAEX (t ) A and X (t ) are independent =0 RY (t1 , t2 ) = EAX (t1 ) AX (t2 )

= EA2 EX (t1 ) X (t2 ) = e 2 t1 t2

Stationary Random Process


The concept of stationarity plays an important role in solving practical problems

involving random processes. Just like time-invariance is an important characteristics of many deterministic systems, stationarity describes certain time-invariant property of a random process. Stationarity also leads to frequency-domain description of a random process.

Strict-sense Stationary Process A random process { X (t )} is called strict-sense stationary (SSS) if its probability structure is invariant with time. In terms of the joint distribution function, { X (t )} is called SSS if
FX ( t1 ), X (t2 )..., X (tn ) ( x1 , x2 ..., xn ) = FX (t1 + t0 ), X ( t2 + t0 )..., X ( tn + t0 ) ( x1 , x2 ..., xn )

n N , t0 and for all choices of sample points t1 , t2 ,...tn .

Thus the joint distribution functions of any set of random variables X ( t1 ), X ( t 2 ), ..., X ( t n ) does not depend on the placement of the origin of the time axis. This requirement is a very strict. Less strict form of stationarity may be defined. Particularly, if
FX ( t1 ), X ( t2 )..... X ( tn ) ( x1 , x2 .....xn ) = FX ( t1 +t0 ), X ( t2 +t0 )..... X ( tn +t0 ) ( x1 , x2 .....xn ) for n = 1, 2,.., k ,

then

{ X (t )} is called kth order stationary.


If { X (t )} is stationary up to order 1
FX ( t1 ) ( x1 ) = FX ( t1 + t0 ) ( x1 ) t 0 T

Let us assume t0 = t1 . Then


FX ( t1 ) ( x1 ) = FX (0) ( x1 ) which is independent of time.

As a consequence EX (t1 ) = EX (0) = X (0) = constant If { X (t )} is stationary up to order 2

FX ( t1 ), X ( t 2 ) ( x1 , x 2 .) = FX ( t1 + t0 ), X ( t 2 +t0 ) ( x1 , x 2 )

Put t 0 = t 2

FX (t1 ), X (t2 ) ( x1 , x2 ) = FX ( t1 t2 ), X (0) ( x1 , x2 ) This implies that the second-order distribution depends only on the time-lag t1 t2 .
As a consequence, for such a process

RX (t1 , t2 ) = E ( X (t1 ) X (t2 ))


xx

1 2

f X (0) X (t1 t2 ) ( x1 , x2 )dx1 dx2

= RX (t1 t2 ) Similarly, C X (t1 , t2 )= C X (t1 t2 ) Therefore, the autocorrelation function of a SSS process depends only on the time lag t1 t2 . We can also define the joint stationarity of two random processes. Two processes

{ X (t )} and {Y (t )} are called jointly strict-sense stationary {Z (t ) = X (t ) +

if their joint probability

distributions of any order is invariant under the translation of time. A complex process
jY (t )} is called SSS if { X (t )} and {Y (t )} are jointly SSS.

Example An iid process is SSS. This is because, n,

FX ( t1 ), X (t2 )..., X ( tn ) ( x1 , x2 ..., xn ) = FX ( t1 ) ( x1 ) FX ( t2 ) ( x2 )...FX ( tn ) ( xn ) = FX ( x1 ) FX ( x2 )...FX ( xn ) = FX ( t1 + t0 ), X (t2 +t0 )..., X ( tn + t0 ) ( x1 , x2 ..., xn )


Example The Poisson process is {N (t ), t 0} not stationary, because
EN (t ) = t

which varies with time.


Wide-sense stationary process

It is very difficult to test whether a process is SSS or not. A subclass of the SSS process called the wide sense stationary process is extremely important from practical point of view.

A random process { X (t )} is called wide sense stationary process (WSS) if


EX (t ) = X = constant and EX (t1 ) X (t2 ) = RX (t1 t2 ) is a function of time lag t1 t2 . (Equivalently, Cov( X (t1 ) X (t2 )) = C X (t1 t2 ) is a function of time lag t1 t2 )

Remark

(1) For a WSS process { X (t )},


EX 2 (t ) = RX (0) = constant var( X (t )=EX 2 (t ) ( EX (t ))2 = constant CX (t1 , t2 ) = EX (t1 ) X (t2 ) EX (t1 ) EX (t2 )
2 = RX (t2 t1 ) X

CX (t1 , t2 ) is a function of lag (t2 t1 ).

(2) An SSS process is always WSS, but the converse is not always true.
Example: Sinusoid with random phase

Consider the random process { X (t )} given by X (t ) = A cos( w0 t + ) where A and w0 are constants and is uniformly distributed between 0 and 2 . This is the model of the carrier wave (sinusoid of fixed frequency) used to analyse the noise performance of many receivers.

Note that
1 0 2 f ( ) = 2 0 otherwise

By applying the rule for the transformation of a random variable, we get 1 -A x A f X ( t ) ( x ) = A2 x 2 0 otherwise which is independent of t. Hence Note that

{ X (t )} is first-order stationary.

EX (t ) = EA cos( w0 t + )
2

A cos(w t + ) 2 d
0 0

= 0 which is a constant and


RX (t1 , t2 ) = EX (t1 ) X (t2 ) = EA cos( w0 t1 + ) A cos( w0 t2 + ) A E[c os( w0 t1 + + w0 t2 + ) + c os( w0 t1 + w0 t2 )] 2 A2 = E[c os( w0 (t1 + t2 ) + 2 ) + c os( w0 (t1 t2 )] 2 A2 = c os( w0 (t1 t2 ) which is a function of the lag t1 t2 . 2 =
2

Hence

{ X (t )} is wide-sense stationary.

Example: Sinusoid with random amplitude

Consider the random process { X (t )} given by X (t ) = A cos( w0 t + ) where and w0 are constants and A is a random variable. Here, EX (t ) = EA cos( w0t + ) which is independent of time only if EA = 0.
RX (t1 , t2 ) = EX (t1 ) X (t2 ) = EA cos( w0t1 + ) A cos( w0t2 + ) = EA2 cos( w0t1 + ) cos( w0t2 + ) 1 = EA2 [c os( w0 (t1 + t2 ) + 2 ) + c os( w0 (t1 t2 )] 2

which will not be function of (t1 t2 ) only.


Example: Random binary wave Consider a binary random process { X (t )} consisting of a sequence of random pulses of duration T with the following features: 1 Pulse amplitude AK is a random variable with two values p AK (1) = and 2 1 p AK (1) = 2

Pulse amplitudes at different pulse durations are independent of each other. The start time of the pulse sequence can be any value between 0 to T. Thus the random start time D (Delay) is uniformly distributed between 0 and T.

A realization of the random binary wave is shown in Fig. above. Such waveforms are used in binary munication- a pulse of amplitude 1is used to transmit 1 and a pulse of amplitude -1 is used to transmit 0. The random process X (t) can be written as,
X (t ) =

n =

A rect
n

(t nT D) T

For any t,
1 1 + (1)( ) = 0 2 2 1 1 EX 2 (t ) = 12 + (1) 2 = 1 2 2 EX (t ) = 1

Thus mean and variance of the process are constants. To find the autocorrelation function RX (t1 , t2 ) let us consider the case 0 < t1 < t1 + < T . Depending on the delay D, the points t1 and t2 may lie on one or two pulse intervals.

Case 1:

X (t1 ) X (t2 ) = 1

Case 2:

X (t1 ) X (t1 ) = (1)(1) = 1

Case 3:

X (t1 ) X (t2 ) = 1

Thus,

1 X (t1 ) X (t2 ) = 1

or t2 < D < T 0 < D < t1 t1 < D < t2

RX (t1 , t2 ) = EEX (t1 ) X (t2 ) | D = E ( X (t1 ) X (t2 ) | P(0 < D < t1 or t2 < D < T )) + E ( X (t1 ) X (t2 )) | t1 < D < t2 ).P(t1 < D < t2 ) 1 1 t t t t = 1 1 2 1 + (1 1 ) 2 1 T 2 2 T t t = 1 2 1 T We also have, RX (t2 , t1 ) = EX (t2 ) X (t1 ) = EX (t1 ) X (t2 ) = RX (t1 , t2 ) So that RX (t1 , t2 ) = 1 t2 t1 T t2 t1 T

For t2 t1 > T , t1 and t2 are at different pulse intervals. EX (t1 ) X (t2 ) = EX (t1 ) EX (t2 ) = 0
Thus the autocorrelation function for the random binary waveform depends on

= t2 t1 , and we can write


RX ( ) = 1

1
T

The plot of RX ( ) is shown below.

RX ( )

-T1

T1

Example Gaussian Random Process

Consider the Gaussian process

{ X (t )} discussed earlier. For any positive integer

n, X (t1 ), X (t 2 ),..... X (t n ) is jointly Gaussian with the joint density function given by
1 X'C1X X e 2

f X (t1 ), X (t2 )... X (tn ) ( x1 , x2 ,...xn ) =


where C X = E ( X X )( X X ) '

det(CX )

and X = E ( X) = [ E ( X 1 ), E ( X 2 )......E ( X n ) ] '. If { X (t )} is WSS, then


EX (t1 ) X EX (t2 ) X = .. X = .. .. .. EX (t ) n X

X (t ) X X (t1 ) X 1 X (t 2 ) X X ( t 2 ) X .. CX = E .. .. .. X (t ) X ( t ) n X n X C X (t2 t1 )... C X (tn t1 ) C X (0) C (t t ) C X (0)... C X (t2 tn ) = X 2 1 C X (0) C X (tn t1 ) C X (tn t1 )...

We see that f X (t1 ), X (t2 )... X (tn ) ( x1 , x2 ,...xn ) depends on the time-lags. Thus, for a Gaussian random process, WSS implies strict sense stationarity, because this process is completely described by the mean and the autocorrelation functions.

Properties Autocorrelation Function of a real WSS Random Process Autocorrelation of a deterministic signal Consider a deterministic signal x(t ) such that
0 < lim 1 T 2 x (t ) dt < T 2T T

Such signals are called power signals. For a power signal x(t ), function is defined as
Rx ( ) = lim 1 T x (t + ) x (t ) dt T 2T T

the autocorrelation

Rx ( ) measures the similarity between a signal and its time-shifted version. Particularly, Rx (0) = lim
1 T 2 x (t ) dt is the mean-square value. If x(t ) is a voltage T 2T T

waveform across a 1 ohm resistance, then Rx (0) is the average power delivered to the resistance. In this sense, Rx (0) represents the average power of the signal. Example Suppose x(t ) = A cos t. The autocorrelation function of x(t) at lag is given by Rx ( ) = lim 1 T A cos (t + ) A cos tdt T 2T T A2 T = lim [cos(2 t + ) + cos ]dt T 4T T A2 cos = 2

We see that Rx ( ) of the above periodic signal is also periodic and its maximum occurs when = 0,
2

, etc. The power of the signal is Rx (0) =

A2 . 2

The autocorrelation of the deterministic signal gives us insight into the properties of the autocorrelation function of a WSS process. We shall discuss these properties next.

Properties of the autocorrelation function of a WSS process Consider a real WSS process { X (t )} . Since the autocorrelation function RX (t1 , t2 ) of such a process is a function of the lag = t1 t2 , we can redefine a one-parameter autocorrelation function as

R X ( ) = EX (t + ) X (t )
If

{ X (t )} is a complex WSS process, then

RX ( ) = EX (t + ) X *(t ) where X * (t ) is the complex conjugate of X (t ). For a discrete random sequence, we can define the autocorrelation sequence similarly.

The autocorrelation function is an important function charactersing a WSS random process. It possesses some general properties. We briefly describe them below. 1. RX (0) = EX 2 (t ) is the mean-square value of the process. If X (t) is a voltage signal applied across a 1 ohm resistance, then RX (0) is the ensemble average power delivered to the resistance. Thus,
RX (0) = EX 2 (t ) 0.

2. For a real WSS process X (t), RX ( ) is an


RX ( ) = R X ( ). Thus,

even function of the time .

RX ( ) = EX (t ) X (t ) = EX (t ) X (t ) = EX (t1 + ) X (t1 ) ( Substituting t1 = t ) = RX ( ) Remark For a complex process X(t), RX ( ) = R* ( ) X 3.


RX ( ) RX (0). This follows from the Schwartz inequality

< X (t ), X (t + ) > X (t )
2

X (t + )

We have

2 RX ( ) = {EX (t ) X (t + )}2

= EX 2 (t ) EX 2 (t + ) = RX (0) RX (0) RX ( ) <RX (0)

4. RX ( ) is a positive semi-definite function in the sense that for any positive integer
n and real a j , a j ,
i =1 j =1

ai a j RX (ti t j )>0

Proof Define the random variable


Y = ai X (ti )
j =1 n

Then we have 0 EY 2 = ai a j EX (ti )X (t j )


i =1 j =1 n n n

= ai a j RX (ti t j )
i =1 j =1

It can be shown that the sufficient condition for a function RX ( ) to be the autocorrelation function of a real WSS process { X (t )} is that RX ( ) be real, even and positive semidefinite. 5. If X (t ) is MS periodic, then R X ( ) is also periodic with the same period. Proof: Note that a real WSS random process { X (t )} is called mean-square periodic ( MS periodic) with a period T p if for every t

E ( X (t + Tp ) X (t )) 2 = 0 EX 2 (t + Tp ) + EX 2 (t ) 2 EX (t + Tp ) X (t ) = 0 RX (0) + RX (0) 2 RX (Tp ) = 0 RX (Tp ) = RX (0)


Again

E (( X (t + + Tp ) X (t + )) X (t )) 2 E ( X (t + + Tp ) X (t + )) 2 EX 2 (t ) ( RX ( + Tp ) RX ( )) 2 2( RX (0) RX (Tp )) RX (0) ( RX ( + Tp ) RX ( )) 2 0 RX ( + Tp ) = RX ( ) RX (0) = RX (Tp )

For example, X (t ) = A cos( w0 t + ) where A and w0 are constants and 2 ~ U [0, 2 ], is MS periodic random process with a period . Its autocorrelation

function A2 cos 0 2 is periodic with the same period . 2 0 The converse of this result is also true. If R X ( ) is periodic with period T p then RX ( ) =
X (t ) is MS periodic with a period T p . This property helps us in determining time

period of a MS periodic random process. 6. Suppose X (t ) = X + V (t ) where V (t ) is a zero-mean WSS process and lim RV ( ) = 0. Then

lim RX ( ) =

2 X

Interpretation of the autocorrelation function of a WSS process The autocorrelation function R X ( ) measures the correlation between two random variables X (t ) and X (t + ). If R X ( ) drops quickly with respect to , then the
X (t ) and X (t + ) will be less correlated for large . This in turn means that the

signal has lot of changes with respect to time. Such a signal has high frequency components. If R X ( ) drops slowly, the signal samples are highly correlated and such a signal has less high frequency components. Later on we see that R X ( ) is directly related to the frequency -domain representation of a WSS process. Fig. Cross correlation function of jointly WSS processes If { X (t )} and {Y (t )} are two real jointly WSS random processes, their cross-correlation functions are independent of t and depends on the time-lag. We can write the crosscorrelation function

RXY ( ) = EX (t + )Y (t ) The cross correlation function satisfies the following properties: (i ) RXY ( ) = RYX ( ) This is because RXY ( ) = EX (t + )Y (t ) = EY (t ) X (t + ) = RYX ( )

RYX ( ) RXY ( )

O Fig. RXY ( ) = RYX ( )


(ii ) RXY ( ) RX (0) RY (0) We have

RXY ( ) = EX (t + )Y (t )
2

EX 2 (t + ) EY 2 (t ) using Cauch-Schwarts Inequality = RX (0) RY (0) RXY ( ) RX (0) RY (0) Further, 1 RX (0) RY (0) ( RX (0) + RY (0) ) Geometric mean Arithmatic mean 2 1 RXY ( ) RX (0) RY (0) ( RX (0) + RY (0) ) 2
(iii) If X (t) and Y (t) are uncorrelated, then RXY ( ) = EX (t + ) EY (t ) = X Y (iv) If X (t) and Y (t) is orthogonal process, RXY ( ) = EX ( t + ) Y ( t ) = 0

Example Consider a random process Z (t ) which is sum of two real jointly WSS random processes

X(t) and Y(t). We have


Z (t ) = X (t ) + Y (t ) RZ ( ) = EZ (t + ) Z (t ) = E[ X (t + ) + Y (t + )][ X (t ) + Y (t )] = RX ( ) + RY ( ) + RXY ( ) + RYX ( ) If X (t ) and Y (t ) are orthogonal processes,then RXY ( ) = RYX ( ) = 0 RZ ( ) = RX ( ) + RY ( )

Continuity and Differentiation of Random Processes We know that the dynamic behavior of a system is described by differential equations involving input to and output of the system. For example, the behavior of the RC network is described by a first order linear differential equation with the source voltage as the input and the capacitor voltage as the output. What happens if the input voltage to the network is a random process? Each realization of the random process is a deterministic function and the concepts of differentiation and integration can be applied to it. Our aim is to extend these concepts to the ensemble of realizations and apply calculus to random process.

We discussed the convergence and the limit of a random sequence. The continuity of the random process can be defined with the help of convergence and limits of a random process. We can define continuity with probability 1, mean-square continuity, and continuity in probability etc. We shall discuss the mean-square continuity and the elementary concepts of corresponding mean-square calculus. Mean-square continuity of a random process Recall that a sequence of random variables { X n } converges to a random variable X if
lim E [ X n X ] = 0
2 n

and we write l.i.m. X n = X


n

A random process { X ( t )} is said to be continuous at a point t = t0 in the mean-square sense if l.i.m. X ( t ) = X ( t0 ) or equvalently
t t0

lim E X ( t ) X ( t0 ) = 0 t t0
2

Mean-square continuity and autocorrelation function (1) A random process { X ( t )} is MS continuous at t0 if its auto correlation function

RX ( t1 , t2 ) is continuous at (t0 , t0 ).

Proof:

E X ( t ) X ( t0 ) = E ( X 2 ( t ) 2 X ( t ) X ( t 0 ) + X 2 ( t 0 ) )
2

= R X ( t , t ) 2 RX ( t , t0 ) + RX ( t0 , t0 )
If RX ( t1 , t2 ) is continuous at (t0 , t0 ), then

lim E X ( t ) X ( t0 ) = lim RX ( t , t ) 2 RX ( t , t0 ) + RX ( t0 , t0 )
2 t t0 t t0

= R X ( t 0 , t0 ) 2 R X ( t0 , t0 ) + R X ( t 0 , t0 ) =0 (2) If

{ X ( t )} is MS continuous at

t0 its mean is continuous at t0 .

This follows from the fact that

( E X ( t ) X ( t ) )
0
t t0

E ( X ( t ) X ( t0 ) )
2

lim E X ( t ) X ( t0 ) lim E X ( t ) X ( t0 ) = 0
2 t t0

EX ( t ) is continuous at t0 .
Example Consider the random binary wave { X (t )} discussed in Eaxmple . Atypical realization of the process is shown in Fig. below. The realization is a discontinuous function.

X (t )

Tp
Tp

The process has the autocorrelation function given by Tp 1 RX ( ) = Tp 0 otherwise We observe that RX ( ) is continuous at = 0. Therefore, RX ( ) is continuous at all .

Example For a Wiener process { X ( t )} ,

RX ( t1 , t2 ) = min ( t1 , t2 )

where is a constant. RX ( t , t ) = min ( t , t ) = t Thus the autocorrelation function of a Wiener process is continuous everywhere implying that a Wiener process is m.s. continuous everywhere. We can similarly show that the Poisson process is m.s. continuous everywhere.
Mean-square differentiability

The random process { X ( t )} is said to have the mean-square derivative X ' ( t ) at a point

t ,

X ( t + t ) X ( t ) approaches X ' ( t ) in the mean square sense as t t 0 . In other words, the random process { X ( t )} has a m-s derivative X ' ( t ) if
provided
X ( t + t ) X ( t ) lim E X ' ( t ) = 0 t 0 t
2

Remark (1) If all the sample functions of a random process X ( t ) is differentiable, then the above

condition is satisfied and the m-s derivative exists. Example Consider the random-phase sinusoid { X (t )} given by

X (t ) = A cos( w0 t + ) where A and w0 are constants and ~ U [0, 2 ]. Then for each , X (t ) is differentiable. Therefore, the m.s. derivative is
X (t ) = Aw0 sin( w0t + )
M.S. Derivative and Autocorrelation functions

2 RX ( t1 , t2 ) The m-s derivative of a random process X ( t ) at a point t exists if t1t2 exists at the point (t , t ). Applying the Cauchy criterion, the condition for existence of m-s derivative is
X ( t + t1 ) X ( t ) X ( t + t2 ) X ( t ) lim E =0 t1 , t2 0 t1 t2
2

Expanding the square and taking expectation results,

X ( t + t1 ) X ( t ) X ( t + t2 ) X ( t ) lim E t1 , t2 0 t1 t 2 R ( t + t1 , t + t1 ) + RX ( t , t ) 2 RX ( t + t1 , t ) RX ( t + t2 , t + t2 ) + RX ( t , t ) 2 RX ( t + t2 , t ) = lim X + t1 , t2 0 t12 t 2 2 R ( t + t1 , t + t2 ) RX ( t + t1 , t ) RX ( t , t + t2 ) + RX ( t , t ) 2 X t1t2


2

Each of the above terms within square bracket converges to second partial derivative exists.

2 RX ( t1 , t2 ) t1t2 t =t ,t
1

if the
2 =t

X ( t + t1 ) X ( t ) X ( t + t2 ) X ( t ) 2 RX ( t1 , t2 ) 2 RX ( t1 , t2 ) 2 RX ( t1 , t2 ) lim E = + 2 t1 , t2 0 t1 t 2 t1t2 t1t2 t1t2 t =t ,t =t 1 2


2

=0

Thus, { X ( t )} is m-s differentiable at t if

2 RX ( t1 , t2 ) exists at (t , t ) . t1t2

Particularly, if X ( t ) is WSS,
RX ( t1 , t2 ) = RX ( t1 t2 )

Substituting t1 t2 = , we get
2 RX ( t1 , t2 ) 2 RX ( t1 t2 ) = t1t2 t1t2 = dRX ( ) ( t1 t2 ) . t1 d t2 d 2 RX ( ) d 2 t1

d 2 RX ( ) = d 2

Therefore, a WSS process X ( t ) is m-s differentiable if RX ( ) has second derivative at

= 0.
Example Consider a WSS process { X ( t )} with autocorrelation function

RX ( ) = exp ( a

RX ( ) does not have the first and second derivative at = 0.

{ X ( t )} is not mean-square

differentiable. Example The random binary wave { X ( t )} has the autocorrelation function 1 RX ( ) = Tp 0

Tp
otherwise

RX ( ) does not have the first and second derivative at = 0. Therefore, { X ( t )} is not

mean-square differentiable. Example For a Wiener process { X ( t )} ,


RX ( t1 , t2 ) = min ( t1 , t2 ) where is a constant.

t if t2 < 0 RX ( 0, t2 ) = 2 0 other wise if t2 < 0 RX ( 0, t2 ) = 0 if t2 > 0 t1 does not exist if if t = 0 2 2 RX ( t1 , t2 ) does not exist at (t1 = 0, t2 = 0) t1t2

Thus a Wiener process is m.s. differentiable nowhere..

Mean and Autocorrelation of the Derivative process

We have,

EX ' ( t ) = E lim

X ( t + t ) X ( t )

t EX ( t + t ) EX ( t ) = lim t 0 t ( t + t ) X ( t ) = lim X t 0 t = X '(t )


t 0

For a WSS process EX ' ( t ) = X ' ( t ) = 0 as X ( t ) = constant.


EX ( t1 ) X ' ( t2 ) = RXX ' ( t1 , t2 ) = EX ( t1 ) lim = lim = lim = X ( t 2 + t 2 ) X ( t 2 ) t2 0 t2 t2

E X ( t1 ) X ( t2 + t2 ) X ( t1 ) X ( t2 ) RX ( t1 , t2 + t2 ) RX ( t1 , t2 ) t2

t2 0

t2 0

RX ( t1 , t2 ) t2

Similarly we can show that 2 RXX ( t1 , t2 ) EX ' ( t1 ) X ' ( t2 ) = t1t2 For a WSS process EX ( t1 ) X ' ( t2 ) = and
EX ( t1 ) X ( t2 ) = RX ( t1 t2 ) = var( X ( t ) = d 2 RX ( ) d 2 d 2 RX ( ) d 2
=0

dR ( ) RX ( t1 t2 ) = X t d

Mean Square Integral Recall that the definite integral (Riemannian integral) of a function x ( t ) over the interval

[t0 , t ] is defined as the limiting sum given by

t0

x ( )d =

n k 0

lim

x ( )
k =0 k

n 1

Where t0 < t1 < ................ < tn 1 < tn = t k = tk +1 tk and k [tk 1 , tk ] .

are partitions on the interval

[ t0 , t ]

and

For a random process { X ( t )} , the m-s integral can be similarly defined as the process
{Y ( t )} given by

Y ( t ) = X ( )d = l.i.m.
t0

n k 0

X ( )
k =0 k

n 1

Existence of M.S. Integral

It can be shown that the sufficient condition for the m-s integral exist is that the double integral

t0

X ( )d

to

If { X ( t )} is M.S. continuous, then the above condition is satisfied and the process is M.S. integrable.

t0 t0

R ( , ) d d
X
1 2 1

t t

exists.

Mean and Autocorrelation of the Integral of a WSS process We have

EY ( t ) = E X ( )d
t0

= EX ( )d
t0 t

= X d
t0

Therefore, if X 0, {Y ( t )} is necessarily non-stationary.


RY ( t1 , t2 ) = EY (t1 )Y (t2 )
t1 t2

= X (t t0 )

= E X ( 1 ) X ( 2 ) d 1d 2
t0 t0

= EX ( 1 ) X ( 2 ) d 1d 2
t0 t0

t1 t2

= RX ( 1 2 ) d 1d 2
t0 t0

t1 t2

which is a function of t1 and t2 . Thus the integral of a WSS process is always non-stationary.

Fig. (a) Realization of a WSS process X (t ) (b) corresponding integral Y (t )


Remark The nonstatinarity of the M.S. integral of a random process has physical importance the output of an integrator due to stationary noise rises unboundedly. Example The random binary wave { X ( t )} has the autocorrelation function

1 RX ( ) = Tp 0

Tp
otherwise

{ X ( t )} is mean-square integrable.

RX ( ) is continuous at = 0 implying that

{ X ( t )} is

M.S. continuous. Therefore,

Time averages and Ergodicity Often we are interested in finding the various ensemble averages of a random process { X ( t )} by means of the corresponding time averages determined from single realization of the random process. For example we can compute the time-mean of a single realization of the random process by the formula
1 T x (t )dt T 2T T which is constant for the selected realization. x

= lim

represents the dc value of x(t ).

Another important average used in electrical engineering is the rms value given by 1 T 2 xrms T = lim x (t )dt T 2T T

Can x

and xrms

represent X and

EX 2 (t ) respectively ?

To answer such a question we have understand various time averages and their properties.

Time averages of a random process The time-average of a function g ( X (t )) of a continuous random process { X ( t )} is defined by
1 T g ( X (t )) dt 2T T where the integral is defined in the mean-square sense. g ( X (t ))
T

Similarly, the time-average of a function g ( X n ) of a continuous random process { X n } is defined by


g(X n )
N

N 1 g( Xi ) 2 N + 1 i = N

The above definitions are in contrast to the corresponding ensemble average defined by

Eg ( X (t )) = g ( x) f X ( t ) ( x)dx

for continuous case for discrete case

iRX ( t )

g ( xi ) p X (t ) ( xi )

The following time averages are of particular interest: (a) Time-averaged mean

X
X

=
=

1 2T

X (t )dt

(continuous case)
(discrete case)

N 1 Xi 2 N + 1 i = N

(b) Time-averaged autocorrelation function


RX ( )
R X [ m]

=
=

1 2T

X (t ) X (t + )dt

(continuous case)
(discrete case)

N 1 X i X i+m 2 N + 1 i = N

Note that, g ( X (t )) T and g ( X n ) N are functions of random variables and are governed by respective probability distributions. However, determination of these distribution functions is difficult and we shall discuss the behaviour of these averages in terms of their mean and variances. We shall further assume that the random processes { X ( t )} and

{ X n } are WSS.
Mean and Variance of the time averages Let us consider the simplest case of the time averaged mean of a discrete-time WSS random process { X n } given by
N 1 Xi 2 N + 1 i = N The mean of X N

E X

=E

N 1 Xi 2 N + 1 i = N N 1 = EX i 2 N + 1 i = N = X

and the variance


E X

N 1 = E Xi X 2 N + 1 i = N

N 1 = E (Xi X ) 2 N + 1 i = N N N N 1 2 = E( X i X ) + 2 E ( X i X )( X j X ) 2 i = N ,i j j = N ( 2 N + 1) i = N

If the samples X N , X N +1 ,...., X 1 , X 2 ,...., X N are uncorrelated,

E ( X

N 1 = E Xi X 2 N + 1 i = N N 1 E ( X i X )2 = 2 ( 2 N + 1) i = N

2 X

2N +1
N N

We also observe that lim E ( X

=0
N M .S . X

From the above result, we conclude that X

Let us consider the time-averaged mean for the continuous case. We have 1 T X T = X (t )dt 2T T 1 T E X T = EX (t )dt 2T T 1 T X dt = X = 2T T and the variance 2 2 1 T E ( X T X ) = E T X (t )dt X 2T
1 T = E T ( X (t ) X )dt 2T 1 T T = 2 T T E ( X (t1 ) X )( X (t2 ) X )dt1dt2 4T 1 T T = 2 T T C X (t1 t2 )dt1dt2 4T The above double integral is evaluated on the square area bounded by t1 = T and t2 = T . We divide this square region into sum of trapizoida strips parallel to t1 t2 = 0. Putting t1 t2 = and noting that the differential area between t1 t2 = and t1 t2 = + d is (2T )d , the above double integral is converted to a single integral as follows: Therefore,
2

E ( X

X ) =
2

1 T T T T C X (t1 t2 )dt1dt2 4T 2 1 2T = 2 2T (2T )C X ( )d 4T 1 2T = 2T 1 C X ( )d 2T 2T

t1
t1 t2 = + d t1 t2 = 2T T t1 t2 =
d

t2

t1 t2 = 2T

Ergodicity Principle

If the time averages converge to the corresponding ensemble averages in the probabilistic sense, then a time-average computed from a large realization can be used as the value for the corresponding ensemble average. Such a principle is the ergodicity principle to be discussed below:

Mean ergodic process A WSS process { X ( t )} is said to be ergodic in mean, if

M .S . X as T .

Thus for a mean ergodic process { X ( t )} , lim E X T = X


T

and
T

lim var X

2 =X

We have earlier shown that E X T = X and


2T 1 C X ( ) 1 d T 2T T 2T 2 Therefore, the condition for ergodicity in mean is

var X

2T 1 lim CX ( ) 1 d = 0 T 2T 2T 2T

If C X ( ) decreases to 0 for > 0 , then the above condition is satisfied. Further, 2T 1 1 T CX ( ) 1 2T d 2T 2T 2

2T

2T

C X ( ) d

Therefore, a sufficient condition for mean ergodicity is


2T 2T

C X ( ) d <

Example Consider the random binary waveform { X (t )} discussed in Eaxmple .The

process has the autocovariance function for Tp given by


1 C X ( ) = Tp 0

Tp
otherwise

Here

2T

2T

C X ( ) d = 2 C X ( ) d
0

2T

= 2 1 d T p 0 Tp3 Tp2 = 2 Tp + 2 3Tp Tp 2Tp = 3


Tp

C X ( ) d <

Hence { X (t )} is not mean ergodic. Autocorrelation ergodicity T 1 RX ( ) T = X (t ) X (t + )dt 2T T If we consider Z (t ) = X (t ) X (t + ) so that, Z = RX ( )

Then { X (t )} will be autocorrelation ergodic if {Z (t )} is mean ergodic. Thus { X (t )} will be autocorrelation ergodic if

1 lim T 2T
where

2T

2T

1 2T C

( 1 )d 1 = 0

CZ ( 1 ) = EZ (t ) Z (t 1 ) EZ (t ) EZ (t 1 )
2 = EX (t ) X (t ) X (t ) X (t 1 ) RX ( )

Involves fourth order moment. Hence found the condition for autocorrelation ergodicity of a jointly Gaussian process. Thus X (t ) will be autocorrelation ergodic if

1 lim T 2T

2T

2T

1 2T C ( )d 0
z

2 Now CZ ( ) = EZ (t ) Z (t + ) RX ( )

Hence, X (t) will be autocorrelation ergodic

1 If lim T 2T

2 1 T 2T ( Ez (t ) z (t + ) RX ( ) ) d 0 2
2T

Example Consider the random phased sinusoid given by

X (t ) = A cos( w0 t + ) where A and w0 are constants and ~ U [0, 2 ] is a random variable. We have earlier proved that this process is WSS with X = 0 and A2 cos w0 2 For any particular realization x(t ) = A cos( w0t + 1 ), RX ( ) =

1 T A cos( w0t + 1 )dt 2T T 1 A sin( w0T ) = Tw0


=

and
Rx ( )
T

1 2T

T T

A cos(w t + ) A cos(w (t + ) + )dt


0 1 0 1 0 0 1

A2 = 4T =

[cos w + A cos(w (2t + ) + 2 )]dt

A2 cos w0 A2 sin( w0 (2T + )) + 2 4w0T

A2 cos w0 T T 2 For each realization, both the time-averaged mean and the time-averaged autocorrelation function converges to the corresponding ensemble averages. Thus the random-phased sinusoid is ergodic in both mean and autocorrelation.

We see that as T , both

0 and Rx ( )

Remark A random process { X (t )} is ergodic if its ensemble averages converges in the M.S. sense to the corresponding time averages. This is a stronger requirement than stationarity- the ensemble averages of all orders of such a process are independent of time. This implies that an ergodic process is necessarily stationary in the strict sense. The converse is not true- there are stationary random processes which are not ergodic.

Following Fig. shows a hierarchical classification of random processes.

Random processes

WSS random process WSS Process

Ergodic Processes

Example Suppose X (t ) = C where C ~ U [0 a]. { X (t )} is a family of straight line as illustrated in

Fig. below.

X (t ) = a
X (t ) = 3 a 4

X (t ) =

1 a 2 1 a 4

X (t ) =

X (t ) = 0

a and 2 1 T X T = lim Cdt is a different constants for different realizations. Hence { X (t )} is T 2T T not mean ergodic. Here X =

Spectral Representation of a Wide-sense Stationary Random Process


Recall that the Fourier transform (FT) of a real signal g (t ) is given by

G ( ) = FT ( g (t ) = g (t )e j t dt

= cos t + j sin t is the complex exponential. The Fourier transform G ( ) exists if g (t ) satisfies the following Dirichlet conditions 1) g (t ) is absolutely integrable, that is,
where e

j t

g (t ) dt <

2) g (t ) has only a finite number of discontinuities in any finite interval 3) g (t ) has only finite number of maxima and minima within any finite interval.
The signal g (t ) can be obtained from G ( ) by the inverse Fourier transform (IFT) as

follows:
g (t ) = IFT (G ( )) = 1 j t G ( )e d 2

The existence of the inverse Fourier transform implies that we can represent a function g (t ) as a superposition of continuum of complex sinusoids. The Fourier transform G ( ) is the strength of the sinusoids of frequency present in the signal. If g (t ) is a voltage signal measured in volt,

G ( ) has the unit of volt/radian. The function G ( ) is also called the spectrum of g (t ).
We can define the Fourier transform also in terms of the frequency variable f =

. In this 2

case, we can define the Fourier transform and the inverse Fourier transform as follows:

G ( f ) = g (t )e j 2 ft dt

and

g (t ) = G ( f )e j 2 ft df

The Fourier transform is a linear transform and has many interesting properties. Particularly, the energy of the energy of the signal f (t ) is related by the Parsevals theorem
T

(t)dt =

G( )

How to have the frequency-domain representation of a random process, particularly a WSS process? The answer is the spectral representation of WSS random process. Wiener (1930) and Khinchin (1934) independently discovered it. Einstein (1914) also used the concept.

Difficulty in Fourier Representation of a Random Process We cannot define the Fourier transform of a WSS process integral FT ( X (t )) = X (t )e jt dt

X (t ) by the mean-square

The existence of the above integral would have implied the existence the Fourier transform of every realization of X (t ). But the very notion of stationarity demands that the realization does not decay with time and the first condition of Dirichlet is violated. This difficulty is avoided by a frequency-domain representation of X (t ) in terms of the power spectral density (PSD). Recall that the power of a WSS process X (t ) is a constant and given by EX 2 (t ). The PSD denotes the distribution of this power over frequencies. Definition of Power Spectral Density of a WSS Process Let us define
X T (t) = X(t) =0 -T < t < T otherwise t ) 2T

= X (t )rect (

where rect (

t ) is the unity-amplitude rectangular pulse of width 2T centering the 2T

origin. As t , X T (t ) will represent the random process X (t ). Define the mean-square integral FTX T ( ) =
T

(t)e j t dt

Applying the Paresevals theorem we find the energy of the signal


T

X T2 (t)dt =

FTX T ( ) d .
2

Therefore, the power associated with X T (t ) is 1 2T 1 T X (t)dt = 2T


2 T T 2

FTX T ( ) d and

The average power is given by

T FTX T ( ) 1 1 2 E X T (t)dt = E FTX T ( ) d = E d 2T T 2T 2T 2 2

where

E FTX T ( ) 2T

is the contribution to the average power at frequency and

represents the power spectral density for X T (t ). As T , the left-hand side in the above expression represents the average power of X (t ). Therefore, the PSD S X ( ) of the process X (t ) is defined in the limiting sense by
S X ( ) = lim E FTX T ( ) 2T
2

Relation between the autocorrelation function and PSD: Wiener-Khinchin-Einstein theorem We have E | FTX T ( ) |2 FTX T ( ) FTX T * ( ) =E 2T 2T T T 1 j t + j t = T T EX T (t1 )X T (t2 )e 1 e 2 dt1dt2 2T = 1 2T
T T T T

(t1 t2 )e j (t1 t2 ) dt1dt2

t1
t1 t2 = + d t1 t2 = 2T T t1 t2 =
d

t2

t1 t2 = 2T

Note that the above integral is to be performed on a square region bounded by

t1 = T and t2 = T . Substitute t1 t 2 = so that

t 2 = t1 is a family of

straight

lines parallel to t1 t2 = 0. The differential area in terms of is given by the shaded area and equal to (2T | |)d . The double integral is now replaced by a single integral in . Therefore,
E FTX T ( ) X T * ( ) 1 2T j = RX ( )e (2T | |)d 2T 2T 2T 2T | | = Rx ( )e j (1 )d 2 T 2T

If R X ( ) is integrable then the right hand integral converges to RX ( )e j d as


T ,
lim
T

E FTX T ( ) 2T

= RX ( )e j d

As we have noted earlier, the power spectral density S X ( ) = lim

E FTX T ( )
2T

is the

contribution to the average power at frequency and is called the power spectral density
X (t ) . Thus

S X ( ) =

( )e j d

and using the inverse Fourier transform 1 RX ( ) = 2

S X ( )e j dw

Example 1 The autocorrelation function of a WSS process X (t ) is given by

RX ( ) = a 2 e

b>0

Find the power spectral density of the process.

j ( ) = R ( )e d X x b j e d = a2e 0 j j d + a 2 eb e d = a 2 eb e 0 a2 a2 = + b j b + j 2a 2 b = b2 + 2

The autocorrelation function and the PSD are shown in Fig.

Example: Suppose X (t ) = A + B sin ( c t + ) where A is a constant bias and ~ U [0, 2 ]. Find RX ( ) and S X ( ). . Rx ( ) = EX (t + ) X (t )
= E ( A + B sin ( c (t + ) + ))( A + B sin ( c t + )) = A2 + B2 cos c 2

S X ( ) = A2 ( ) +

B2 ( ( + c ) + ( c ) ) 4 where ( ) is the Dirac Delta function.


S X ( )

A2 B2 2 c
o

B2 2

Example 3 PSD of the amplitude-modulated random-phase sinusoid


X (t ) = M (t ) cos ( ct + ) , ~ U ( 0, 2

where M(t) is a WSS process independent of .

RX ( ) = E M (t + ) cos ( c (t + ) + ) M (t ) cos ( c + ) = E M (t + ) M (t ) E cos ( c (t + ) + ) cos ( c + ) ( Using the independence of M (t ) and the sinusoid) = RM ( ) A2 cos c 2

S X ( ) =

A2 ( SM ( + c ) + SM ( c ) ) 4 where S M ( ) is the PSD of M(t)

S M ( )

S X ( )

Example 4 The PSD of a noise process is given by

N0 c 2 2 Otherwise = 0 Find the autocorrelation of the process. S N ( ) =

1 RN ( ) = 2 =

S ( )
N

e j d No cos d 2

1 2 2

c +

No 2
No 2

sin c + 2 sin c 2 sin 2 cos o 2

Properties of the PSD

S X ( ) being the Fourier transform of

RX ( ), it shares the properties of the Fourier

transform. Here we discuss important properties of S X ( ).

The average power of a random process X (t ) is


EX 2 (t) = R X (0) = 1 S X ( )dw 2

The average power in the band [1 , 2 ] is 2 S X ( )d


1

Average Power in the frequency band [1 , 2 ]

If { X (t )} is real, R X ( ) is a real and even function of . Therefore,

S X ( ) = = =

( )e j d ( )(cos + j sin )d

R
0

( ) cos d

= 2 RX ( ) cos d
Thus S X ( ) is a real and even function of .

From the definition S X ( ) 0.

S X ( w) = limT

E X T ( ) is always non-negative. Thus 2T


2

If

{ X (t )} has a periodic component,

R X ( ) is periodic and so S X ( ) will have

impulses.
Remark

1) The function SX () is the PSD of a WSS process { X (t )} if and only if

SX () is a non-negative,

real and even function of and

SX () <

2) The above condition on SX () also ensures that the corresponding autocorrelation function
R X ( ) is non-negative definite. Thus the

non-negative definite property of an autocorrelation function can be tested through its power spectrum. 3) Recall that a periodic function has the Fourier series expansion. If

{ X (t )} is

M.S. periodic we can have an equivalent Fourier series

expansion { X (t )} .

Cross power spectral density


Consider a random process Z (t ) which is sum of two real jointly WSS random processes
X(t) and Y(t). As we have seen earlier,

RZ ( ) = RX ( ) + RY ( ) + RXY ( ) + RYX ( ) If we take the Fourier transform of both sides, S Z ( ) = S X ( ) + SY ( ) + FT ( RXY ( )) + FT ( RYX ( )) where FT (.) stands for the Fourier transform. Thus we see that S Z ( ) includes contribution from the Fourier transform of the crosscorrelation functions RXY ( ) and RYX ( ). These Fourier transforms represent cross power spectral densities. Definition of Cross Power Spectral Density Given two real jointly WSS random processes X(t) and Y(t), the cross power spectral density (CPSD) S XY ( ) is defined as
S XY ( ) = lim E
T FTX T ( ) FTYT ( ) 2T

where

FTX T ( ) and FTYT ( ) are the Fourier transform of the truncated processes
t t ) and YT (t ) = Y(t)rect ( ) respectively and 2T 2T

X T (t ) = X(t)rect (

denotes the complex

conjugate operation. We can similarly define SYX ( ) by


SYX ( ) = lim E
T

FTYT ( ) FTX T ( ) 2T

Proceeding in the same way as the derivation of the Wiener-Khinchin-Einstein theorem for the WSS process, it can be shown that
S XY ( ) = RXY ( )e j d

and
SYX ( ) = RYX ( )e j d

The cross-correlation function and the cross-power spectral density form a Fourier transform pair and we can write
RXY ( ) = S XY ( )e j d

and
RYX ( ) = SYX ( )e j d

Properties of the CPSD The CPSD is a complex function of the frequency . Some properties of the CPSD of two jointly WSS processes X(t) and Y(t) are listed below: (1)
* S XY ( ) = SYX ( )

Note that RXY ( ) = RYX ( )


S XY ( ) = RXY ( )e j d

= RYX ( )e j d

= RYX ( )e j d
* = SYX ( )

(2) Re( S XY ( )) is an even function of and Im( S XY ( )) is an odd function of We have


S XY ( ) = RXY ( )(cos + j sin )d

= RXY ( ) cos d + j RXY ( )sin )d


= Re( S XY ( )) + j Im( S XY ( )) where Re( S XY ( )) = RXY ( )cos d is an even function of and


Im( S XY ( )) = RXY ( )sin d is an odd function of and

(3) X(t) and Y(t) are uncorrelated and have constant means, then
S XY ( ) = SYX ( ) = X Y ( )

Observe that

RXY ( ) = EX (t + )Y (t ) = EX (t + ) EY (t ) = X Y = Y X = RXY ( ) S XY ( ) = SYX ( ) = X Y ( )

(4)

If X(t) and Y(t) are orthogonal, then


S XY ( ) = SYX ( ) = 0

If X(t) and Y(t) are orthogonal,


RXY ( ) = EX (t + )Y (t ) =0 = RXY ( ) S XY ( ) = SYX ( ) = 0

(5) The cross power PXY between X(t) and Y(t) is defined by
PXY = lim 1 T E X (t )Y (t )dt T 2T T

Applying Parsevals theorem, we get

1 T E X (t )Y (t )dt T 2T T 1 E X T (t )YT (t )dt = lim T 2T 1 1 * E = lim FTX T ( ) FTYT ( )d T 2T 2 * EFTX T ( ) FTYT ( ) 1 d = lim 2 T 2T 1 = S XY ( )d 2 1 PXY = S XY ( )d 2 PXY = lim Similarly, PYX 1 SYX ( )d 2 1 * = S XY ( )d 2 * = PXY =

Example Consider the random process Z (t ) = X (t ) + Y (t ) discussed in the beginning of the lecture. Here Z (t ) is the sum of two jointly WSS orthogonal random processes X(t) and Y(t). We have, RZ ( ) = RX ( ) + RY ( ) + RXY ( ) + RYX ( ) Taking the Fourier transform of both sides,

S Z ( ) = S X ( ) + SY ( ) + S XY ( ) + SYX ( ) 1 1 1 1 1 S Z ( )d = S X ( )d + SY ( )d + S XY ( ) d + SYX ( )d 2 2 2 2 2 Therefore, PZ ( ) = PX ( ) + PY ( ) + PXY ( ) + PYX ( )


Remark PXY ( ) + PYX ( ) is the additional power contributed by X (t ) and Y (t ) to the resulting power of X (t ) + Y (t ) If X(t) and Y(t) are orthogonal, then

S Z ( ) = S X ( ) + SY ( ) + 0 + 0 = S X ( ) + SY ( ) Consequently PZ ( ) = PX ( ) + PY ( ) Thus in the case of two jointly WSS orthogonal processes, the power of the sum of the processes is equal to the sum of respective powers.

Power spectral density of a discrete-time WSS random process

Suppose g [ n ] is a discrete-time real signal. Assume g[ n] to be obtained by sampling a continuous-time signal g (t ) at an uniform interval T such that
g[ n] = g ( nT ), n = 0, 1, 2,...

The discrete-time Fourier transform (DTFT) of the signal g[n] is defined by


G ( ) = g[ n]e j n
n =

G ( ) exists if { g[n]} is absolutely summable, that is,

n =

g[ n] < . The signal g[n] is

obtained from G ( ) by the inverse discrete-time Fourier transform


g[n]) = 1 2

g ( )e j n dw

Following observations about the DTFT are important:


is a frequency variable representing the frequency of a discrete sinusoid.

Thus the signal g[n] = A cos(0 n) has a frequency of 0 radian/samples.


G ( ) is always

periodic in with a period of 2 . Thus G ( ) is uniquely

defined in the interval . Suppose

{ g[n]} is obtained by sampling a continuous-time signal g a (t ) at a


n = 0, 1, 2,...

uniform interval T such that g[n] = g a (nT ), The frequency of the discrete-time signal is related to the frequency of the continuous time signal by the relation =

where T is the uniform sampling interval. The symbol for frequency of a continuous signal is used in the signal-processing literature just to distinguish it from the corresponding frequency of the discrete-time signal. This is illustrated in the Fig. below.

We can define the Z transform of the discrete-time signal by the relation

G ( z ) = g[n]z n
n =

where z is a complex variable. G ( ) is related to G ( z ) by


G ( ) = G ( z ) z = e j

Power spectrum of a discrete-time real WSS process { X [n]}

Consider a discrete-time real WSS process { X [ n]}. The very notion of stationarity poses problem in frequency-domain representation of { X [n]} through the Discrete-time Fourier transform. The difficulty is avoided similar to the case of the continuous-time WSS process by defining the truncated process
X [ n] X N [ n] = 0 for n N otherwise

The power spectral density S X ( ) of the process { X [n]} is defined as


S X ( ) = lim
N

1 2 E DTFTX N ( ) 2N +1

where
DTFTX N ( ) = X N [n]e jwn = X [ n]e jwn
n = n = N N

Note that the average power of

{ X [n]} is RX [0] = E X 2 [n ] and the power spectral

density S X ( ) indicates the contribution to the average power of the sinusoidal component of frequency .

Wiener-Einstein-Khinchin theorem

The Wiener-Einstein-Khinchin theorem is also valid for discrete-time random processes. The power spectral density S X ( )

of the WSS process { X [n]} is the discrete-time

Fourier transform of autocorrelation sequence.


S X ( ) = RX [ m ] e j m
m =

RX [ m] is related to S X ( ) by the inverse discrete-time Fourier transform and given by

RX [ m]) =

1 2

S X ( )e j m d

Thus RX [ m] and S X ( ) forms a discrete-time Fourier transform pair. A generalized PSD can be defined in terms of z transform as follows
S X ( z) =
m =

R [m ] z
x

Clearly, S X ( ) = S X ( z ) z =e j
Example Suppose RX [m] = 2
S X ( ) = RX [ m ] e j m
m = 1 = 1 + e j m m = 2 m0 m

m = 0, 1, 2, 3.... Then

3 5 4cos

The plot of the autocorrelation sequence and the power spectral density is shown in Fig. below.

Example Properties of the PSD of a discrete-time WSS process

For the real discrete-time process { X [ n]}, the autocorrelation function RX [m] is real and even. Therefore, S X ( ) is real and even. S X ( ) 0. The average power of { X [n]} is given by 1 EX 2 [ n] = RX [0] = S X ( )d 2 Similarly the average power in the frequency band [1 , 2 ] is given by
2 S X ( ) d
1 2

S X ( ) is periodic in with a period of 2 .

Interpretation of the power spectrum of a discrete-time WSS process

Assume that the discrete-time WSS process { X [n]} is obtained by sampling a continuoustime random process { X a (t )} at an uniform interval, that is, X [n] = X a (nT ), n = 0, 1, 2,...

The autocorrelation function RX [m] is defined by

R X [ m] = E X [ n + m] X [ n ] = E X a (n T + mT ) X a (n T ) = RX a ( mT )
RX [ m] = RX a ( mT ) m = 0, 1, 2,...

Thus the sequence { RX [m] } is obtained by sampling the autocorrelation function


RX a ( ) at a uniform interval T .

The frequency of the discrete-time WSS process is related to the frequency of the continuous time process by the relation =

White noise process


A white noise process {W (t )} is defined by
SW ( ) = N0 2 < <

where N 0 is a real constant and called the intensity of the white noise. corresponding autocorrelation function is given by
RW ( ) = N ( ) where ( ) is the Dirac delta. 2

The

The average power of white noise


Pavg = EW 2 (t ) = 1 N d 2 2

The autocorrelation function and the PSD of a white noise process is shown in Fig. below.

S W ( )
N0 2

RW ( )

N0 ( ) 2

Remarks

The term white noise is analogous to white light which contains all visible light frequencies. We generally consider zero-mean white noise process. A white noise process is unpredictable as the noise samples at different instants of time are uncorrelated.:
CW (ti , t j ) = 0 for ti t j .

White noise is a mathematical abstraction, it cannot be realized since it has infinite average power. If the system band-width (BW) is sufficiently narrower than the noise BW and noise PSD is flat , we can model it as a white noise process. Thermal noise, which is the noise generated in resistors due to random motion electrons, is well modelled as white Gaussian noise, since they have very flat psd over very wide band (GHzs)

A white noise process can have any probability density function. Particularly, if the white noise process {W (t )} is a Gaussian random process, then {W (t )} is called a white Gaussian random process.

A white noise process is called strict-sense white noise process if the noise samples at distinct instants of time are independent. A white Gaussian noise process is a strictsense white noise process. Such a process represents a purely random process, because its samples at arbitrarily close intervals also will be independent.

Example A random-phase sinusoid corrupted by white noise Suppose X (t ) = B sin ( c t + ) + W (t ) where A is a constant bias and ~ U [0, 2 ]. N and {W (t ) } is a zero-mean WGN process with PSD of 0 and independent of . 2 Find RX ( ) and S X ( ). . Rx ( ) = EX (t + ) X (t )

= E ( B sin (c (t + ) + ) + W (t + ) )( W (t ) + B sin ( c t + )) B2 cos c + RW ( ) 2 N B2 S X ( ) = ( ( + c ) + ( c ) ) + 20 4 where ( ) is the Dirac Delta function. =

Band-limited white noise


A noise process which has non-zero constant PSD over a finite frequency band and zero elsewhere is called band-limited white noise. Thus the WSS process { X (t )} is bandlimited white noise if
S X ( ) = N0 2 B < < B

For example, thermal noise which has constant PSD up to very high frequency is better modeled by a band-limited white noise process. The corresponding autocorrelation function RX ( ) is given by

RX ( ) =

N 0 B sin B 2 B

The plot of S X ( ) and RX ( ) of a band-limited white noise process is shown in Fig. below. Further assume that { X (t )} is a zero-mean process.

S
N0 2

( )

Observe that The average power of the process is EX 2 (t ) = RX (0) = RX ( ) = 0 for = N0 B 2

2 3 n , ,... This means that X (t ) and X (t + ) B B B

where n is a non-zero integer are uncorrelated. Thus we can get uncorrelated

samples by sampling a band-limited white noise process at a uniform interval of

A band-limited white noise process may also be a band-pass process with PSD as shown in the Fig. and given by
N 0 S X ( ) = 2 0

0 <
otherwise

B 2

The corresponding autocorrelation function is given by B N 0 B sin 2 cos 0 RX ( ) = B 2 2

S
N0 2

( )

B 0 2

B 0 + 2

B 0 2

0 0

B 2

Coloured Noise
A noise process which is not white is called coloured noise. Thus the noise process

{ X (t )}

with RX ( ) = a 2 e b

b > 0 and PSD S X ( ) =

2 a 2b is an example of a b2 + 2

coloured noise.

White Noise Sequence


A random sequence {W [n]} is called a white noise sequence if
SW ( ) = Therefore RW (m) = N ( m) 2 N 2

where (m) is the unit impulse sequence. The autocorrelation function and the PSD of a whitenoise sequence is shown in Fig.

SW ( )
RW [m]
N 2

N N 2 NN 22 2

m A realization of a white noise sequence is shown in Fig. below.

Remark
1 N N 2 = 2 2 2 The average power of the white noise sequence is finite and uniformly distributed over all frequencies. If the white noise sequence {W [n]} is a Gaussian sequence, then {W [n]} is

The average power of the white noise sequence is EX 2 [n] =

called a white Gaussian noise (WGN) sequence. An i.i.d. random sequence is always white. Such a sequence may be called strict-sense white noise sequence. A WGN sequence is is a strict-sense stationary white noise sequence. The model white noise sequence looks artificial, but it plays a key role in random signal modelling. It plays the similar role as that of the impulse function in the modeling of deterministic signals. A class of WSS processes

called regular processes can be considered as the output of a linear system with white noise as input as illustrated in Fig. The notion of the sequence of i.i.d. random variables is also important in statistical inference. White noise process Linear System

Regular WSS random process

Response of Linear time-invariant system to WSS input:

In many applications, physical systems are modeled as linear time invariant (LTI) system. The dynamic behavior of an LTI system to deterministic inputs is described by linear differential equations. We are familiar with time and transfer domain (such as Laplace transform and Fourier transform) techniques to solve these equations. In this lecture, we develop the technique to analyse the response of an LTI system to WSS random process. The purpose of this study is two-folds: (1) Analysis of the response of a system (2) Finding a LTI system that can optionally estimate an unobserved random process from an observed process. The observed random process is statistically related to the unobserved random process. For example, we may have to find LTI system (also called a filter) to estimate the signal from the noisy observations. Basics of Linear Time Invariant Systems: A system is modeled by a transformation T that maps an input signal x(t ) to an output signal y(t). We can thus write,

y(t ) = T [ x(t )]

Linear system The system is called linear if superposition applies: the weighted sum of inputs results in the weighted sum of the corresponding outputs. Thus for a linear system
T a1 x1 ( t ) + a2 x2 ( t ) = a1T x1 ( t ) + a2T x2 ( t )

Example : Consider the output of a linear differentiator, given by d x(t ) y (t ) = dx d Then, ( a1 x1 (t ) + a2 x2 (t ) ) dt


= a1 d d x1 ( t ) + a2 x2 (t ) dt dt

Hence the linear differentiator is a linear system.

Linear time-invariant system Consider a linear system with y(t) = T x(t). The system is called time-invariant if T x ( t t0 ) = y ( t t0 ) t0 It is easy to check that that the differentiator in the above example is a linear timeinvariant system. Causal system The system is called causal if the output of the system at t = t0 depends only on the present and past values of input. Thus for a causal system

y ( t0 ) = T ( x(t ), t t0 ) Response of a linear time-invariant system to deterministic input A linear system can be characterised its impulse response h(t ) = T (t ) where (t ) is the Dirac delta function.

(t )

LTI system

h(t )

Recall that any function x(t) can be represented in terms of the Dirac delta function as follows

x(t ) =

x( ) ( t s )

ds

If x(t) is input to the linear system y(t) = T x(t), then


y (t ) = T

x( s) ( t s )

ds ds

= =

x( s ) T ( t s )

[ Using the linearity property ]

x( s) h ( t , s )

ds

Where h ( t , s ) = T ( t s ) is the response at time t due to the shifted impulse ( t s ) .

If the system is time invariant, h ( t, s ) = h (t s ) Therefore for a linear invariant system,

y (t ) =

x ( s ) h ( t s ) ds =

x(t ) * h(t )

where * denotes the convolution operation. Thus for a LTI System,


y (t ) = x (t ) * h(t ) = h(t ) * x(t )

x(t )

LTI System h(t )

y (t )

X ( )

LTI System H ( )

Y ( )

Taking the Fourier transform, we get

Y ( ) = H ( ) X ( ) where H ( ) = FT h ( t ) =

h(t ) e

j t

dt is the frequency response of the system

Response of an LTI System to WSS input

Consider an LTI system with impulse response h(t). Suppose { X (t )} is a WSS process input to the system. The output {Y (t )} of the system is given by

Y (t ) =

h ( s ) X ( t s ) ds =

h (t s ) X ( s )

ds

where we have assumed that the integrals exist in the mean square (m.s.) sense.
Mean and autocorrelation of the output process {Y ( t )}

The mean of the output process is given by,


EY ( t ) = E h ( s )X ( t s ) ds

= =

h ( s )EX ( t s ) ds h ( s )
X

ds

= X

h ( s )ds

= X H (0) where H (0) is the frequency response H ( ) at 0 frequency ( = 0 ) given by

H ( ) =0 =

h ( t )e

j t

dt
=0

h ( t )dt

Therefore, the mean of the output process {Y ( t )} is a constant The Cross correlation of the input X(t) and the out put Y(t) is given by

EX ( t + )Y t

= E X (t + )

h ( s ) X ( t s ) ds

= = =

h (s) h(s) h ( u )

E X ( t + ) X ( t s ) ds RX ( + s ) ds RX ( u ) du [ Put s = u ]

= h ( ) * RX ( )

R XY ( ) = h ( ) * R X ( ) = h ( ) * RX ( )

also RYX ( ) = R XY ( ) = h ( ) * R X ( ) Thus we see that RXY ( ) is a function of lag only. Therefore, X ( t ) and Y ( t ) are jointly wide-sense stationary.

The autocorrelation function of the output process Y(t) is given by,

EY ( t + )Y (t ) ) = E

h ( s ) X ( t + s ) dsY (t )
E X ( t + s ) Y (t ) ds R XY ( s ) ds

= = =

h(s)

h(s)

h( ) * R XY ( ) = h( ) * h( ) * R X ( )

Thus the autocorrelation of the output process {Y ( t )} depends on the time-lag , i.e.,
EY ( t ) Y ( t + ) = RY ( ) .

Thus RY ( ) = RX ( ) * h ( ) * h ( ) The above analysis indicates that for an LTI system with WSS input (1) the output is WSS and

(2) the input and output are jointly WSS. The average power of the output process {Y ( t )} is given by PY = RY ( 0 ) = RX ( 0 ) * h ( 0 ) * h ( 0 )
Power spectrum of the output process Using the property of Fourier transform, we get the power spectral density of the output process given by
SY ( ) = S X ( ) H ( ) H * ( ) = S X ( ) H ( )
2

Also note that R XY ( ) = h ( ) * R X ( ) and RYX ( ) = h ( ) * RX ( ) Taking the Fourier transform of RXY ( ) we get the cross power spectral density S XY ( ) given by S XY ( ) = H * ( ) S X ( ) and SYX ( ) = H ( ) S X ( )

S X ( )

S XY ( ) H ( )
*

H ( )

SY ( )

RX ( )

h( )

RXY ( )

h( )

RY ( )

Example:

(a) White noise process X ( t ) with power spectral density

N0 is input to an ideal low 2 pass filter of band-width B. Find the PSD and autocorrelation function of the output process.

H ( ) N 2

The input process X ( t ) is white noise with power spectral density S X ( ) =


The output power spectral density SY ( ) is given by,
SY ( ) = H ( ) S X ( )
2

N0 . 2

= 1 = N0 2

N0 2

B B B B

RY ( ) = Inverse Fourier transform of SY ( ) = 1 2 N 0 j N sin 2 B e d = 0 2 2 B

The output PSD SY ( ) and the output autocorrelation function RY ( ) are illustrated in Fig. below.

S Y ( )
N0 2

Example 2:

A random voltage modeled by a white noise process X ( t ) with power spectral density

N0 is input to a RC network shown in the fig. 2


Find (a) output PSD SY ( ) (b) output auto correlation function RY ( ) (c) average output power EY 2 ( t ) R

The frequency response of the system is given by

1 jC H ( ) = 1 R+ jC 1 = jRC + 1 Therefore, (a) SY ( ) = H ( ) S X ( )


2

S X ( ) R C 2 +1 N0 1 = 2 2 2 R C +1 2 (b) Taking inverse Fourier transform =


2 2

RY ( ) =

N 0 RC e 4 RC

(c) Average output power EY 2 ( t ) = RY ( 0 ) = N0 4 RC

Discrete-time Linear Shift Invariant System with WSS Random Inputs


We have seen that the Dirac delta function (t ) plays a very important role in the analysis of the response of the continuous-time LTI systems to deterministic and random inputs. Similar role in the case of the discrete-time LTI system is played by the unit sample sequence [n] defined by

[ n] =

1 for n = 0 0 otherwise

Any discrete-time signal x[ n] can be expressed in terms of [n] as follows:


x[n] = x[ k ] [ n k ]
k =

A discrete-time linear shift-invariant system is characterized by the unit sample response


h[ n ] which is the output of the system to the unit sample sequence [n ]. [n ] LTI

h[n]

system

The transfer function of such a system is given by


H ( ) = h[ n]e j n
n =

An analysis similar to that for the continuous-time LTI system shows that the response
y[ n] of a the linear time-invariant system with impulse response h[ n ] to a deterministic

input x[ n] is given by
y[ n] = x[ k ]h[ n k ] = x[ n]* h[n]
k =

Consider a discrete-time linear system with impulse response h[n] and WSS input X [n]

h[n]

X [n]

Y [n]

Y [n] = X [n]* h[n] = h[k ] X [n k ]


k =

E Y [n] = E X [n]* h[n]

For the WSS input X [n]


Y = E Y [n] = X * h[ n] = X h[ n] = X H (0)
n =

where H (0) is the dc gain of the system given by


H (0) = H ( ) =0 = h[n]e jwn
n =

=0

= h[n]
n =

RY [m] = E Y [n]Y [n m]
= E ( X [n]* h[n])( X [n m]* h[n m]) . = RX [m]* h[m]* h[ m]

RY [m] is a function of lag m only.


From above we get
SY ( ) = | ( ) |2 S X ( )

S X ( )

S Y ( )

H )

Note that though the input is an uncorrelated process, the output is a correlated process.

Consider the case of the discrete-time system with a random sequence x[n ] as an input.
x[n ]

h[n ]

y[n ]

RY [m] = R X [m] * h[m] * h[ m]


Taking the z transform, we get S Y ( z ) = S X ( z ) H ( z ) H ( z 1 ) Notice that if H (z ) is causal, then H ( z 1 ) is anti causal. Similarly if H (z ) is minimum-phase then H ( z 1 ) is maximum-phase.
RX [ m ] S X ( z)

H (z )

H ( z 1 )

RY [m]

SY ( z )

Example

If

H ( z) =

1 1 z 1

and x[n ] is a unity-variance white-noise sequence, then

2 Given EX 2 [n] = X

S X ( ) =

2 X 2

SY ( z ) = H ( z ) H ( z 1 ) S X ( z ) 1 1 1 = 1 1 z 1 z 2

By partial fraction expansion and inverse z transform, we get


RY [m ] = 1 a |m| 2 1

Spectral factorization theorem


A WSS

random signal

X [n ]

that satisfies the Paley Wiener condition

| ln S X ( ) | d < can be considered as an output of a linear filter fed by a white noise

sequence. If S X ( ) is an analytic function of , and | ln S X ( ) | d < , then S X ( z ) = v2 H c ( z ) H a ( z )

where
H c ( z ) is the causal minimum phase transfer function H a ( z ) is the anti-causal maximum phase transfer function

and v2 a constant and interpreted as the variance of a white-noise sequence. Innovation sequence
v[n]

H c ( z)
Figure Innovation Filter

X [n ]

Minimum phase filter => the corresponding inverse filter exists.

X [n ]

1 H c ( z)
Figure whitening filter

V [ n]

Since ln S X ( z ) is analytic in an annular region < z <


ln S X ( z ) = c[ k ] z k
k =

where c[k ] =

1 i n ln S X ( )e dw is the kth order cepstral coefficient. 2

For a real signal c[k ] = c[k ] and c[0] =


1 ln S XX ( w)dw 2
k =

S XX ( z ) = e

c[ k ] z k c[ k ] z k
1

=e Let

c[0]

k =1

k =

c[ k ] z k

H C ( z ) = ek =1

c[ k ] z k

z >

= 1 + hc (1)z -1 + hc (2) z 2 + ......


( hc [0] = Limz H C ( z ) = 1
H C ( z ) and ln H C ( z ) are both analytic

=> H C ( z ) is a minimum phase filter. Similarly let


1

H a ( z ) = ek =

c( k ) z k

= ek =1
Therefore,

c( k ) zk

= H C ( z 1 )

z<

2 S XX ( z ) = V H C ( z ) H C ( z 1 )

2 where V = ec ( 0)

Salient points

S XX (z ) can be factorized into a minimum-phase and a maximum-phase factors


i.e. H C ( z ) and H C ( z 1 ).

In general spectral factorization is difficult, however for a signal with rational power spectrum, spectral factorization can be easily done.

Since is a minimum phase filter,

1 H C (z)

exists (=> stable), therefore we can have a

filter

1 to filter the given signal to get the innovation sequence. HC ( z)

X [n ] and v[n] are related through an invertible transform; so they contain the

same information.
Example

Wolds Decomposition
Any WSS signal X [n ] can be decomposed as a sum of two mutually orthogonal processes

a regular process X r [ n] and a predictable process X p [n] , X [n ] = X r [n ] + X p [n ]


X r [ n ] can be expressed as the output of linear filter using a white noise

sequence as input.
X p [n] is a predictable process, that is, the process can be predicted from its own

past with zero prediction error.

You might also like