This action might not be possible to undo. Are you sure you want to continue?

BooksAudiobooksComicsSheet Music### Categories

### Categories

### Categories

### Publishers

Scribd Selects Books

Hand-picked favorites from

our editors

our editors

Scribd Selects Audiobooks

Hand-picked favorites from

our editors

our editors

Scribd Selects Comics

Hand-picked favorites from

our editors

our editors

Scribd Selects Sheet Music

Hand-picked favorites from

our editors

our editors

Top Books

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Audiobooks

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Comics

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Sheet Music

What's trending, bestsellers,

award-winners & more

award-winners & more

P. 1

Lecture Notes|Views: 89|Likes: 0

Published by Sohail Afridi

See more

See less

https://www.scribd.com/doc/94768206/Lecture-Notes

08/28/2013

text

original

- 2 Combinatorics
- 3 Axioms of Probability
- 4 Conditional Probability and Independence
- Interlude: Practice Midterm 1
- 5 Discrete Random Variables
- 6 Continuous Random Variables
- 7 Joint Distributions and Independence
- Interlude: Practice Midterm 2
- 8 More on Expectation and Limit Theorems
- 9 Convergence in probability
- 11 Computing probabilities and expectations by conditioning
- 12 Markov Chains: Introduction
- 13 Markov Chains: Classiﬁcation of States
- 14 Branching processes
- 15 Markov Chains: Limiting Probabilities
- 16 Markov Chains: Reversibility
- 17 Three Applications
- 18 Poisson Process
- Interlude: Practice Final

Janko Gravner

Mathematics Department

University of California

Davis, CA 95616

gravner@math.ucdavis.edu

June 9, 2011

These notes were started in January 2009 with help from Christopher Ng, a student in

Math 135A and 135B classes at UC Davis, who typeset the notes he took during my lectures.

This text is not a treatise in elementary probability and has no lofty goals; instead, its aim is

to help a student achieve the proﬁciency in the subject required for a typical exam and basic

real-life applications. Therefore, its emphasis is on examples, which are chosen without much

redundancy. A reader should strive to understand every example given and be able to design

and solve a similar one. Problems at the end of chapters and on sample exams (the solutions

to all of which are provided) have been selected from actual exams, hence should be used as a

test for preparedness.

I have only one tip for studying probability: you cannot do it half-heartedly. You have to

devote to this class several hours per week of concentrated attention to understand the subject

enough so that standard problems become routine. If you think that coming to class and reading

the examples while also doing something else is enough, you’re in for an unpleasant surprise on

the exams.

This text will always be available free of charge to UC Davis students. Please contact me if

you spot any mistake. I am thankful to Marisano James for numerous corrections and helpful

suggestions.

Copyright 2010, Janko Gravner

1 INTRODUCTION 1

1 Introduction

The theory of probability has always been associated with gambling and many most accessible

examples still come from that activity. You should be familiar with the basic tools of the

gambling trade: a coin, a (six-sided) die, and a full deck of 52 cards. A fair coin gives you Heads

(H) or Tails (T) with equal probability, a fair die will give you 1, 2, 3, 4, 5, or 6 with equal

probability, and a shuﬄed deck of cards means that any ordering of cards is equally likely.

Example 1.1. Here are typical questions that we will be asking and that you will learn how to

answer. This example serves as an illustration and you should not expect to understand how to

get the answer yet.

Start with a shuﬄed deck of cards and distribute all 52 cards to 4 players, 13 cards to each.

What is the probability that each player gets an Ace? Next, assume that you are a player and

you get a single Ace. What is the probability now that each player gets an Ace?

Answers. If any ordering of cards is equally likely, then any position of the four Aces in the

deck is also equally likely. There are

_

52

4

_

possibilities for the positions (slots) for the 4 aces. Out

of these, the number of positions that give each player an Ace is 13

4

: pick the ﬁrst slot among

the cards that the ﬁrst player gets, then the second slot among the second player’s cards, then

the third and the fourth slot. Therefore, the answer is

13

4

(

52

4

)

≈ 0.1055.

After you see that you have a single Ace, the probability goes up: the previous answer needs

to be divided by the probability that you get a single Ace, which is

13·(

39

3

)

(

52

4

)

≈ 0.4388. The answer

then becomes

13

4

13·(

39

3

)

≈ 0.2404.

Here is how you can quickly estimate the second probability during a card game: give the

second ace to a player, the third to a diﬀerent player (probability about 2/3) and then the last

to the third player (probability about 1/3) for the approximate answer 2/9 ≈ 0.22.

History of probability

Although gambling dates back thousands of years, the birth of modern probability is considered

to be a 1654 letter from the Flemish aristocrat and notorious gambler Chevalier de M´er´e to the

mathematician and philosopher Blaise Pascal. In essence the letter said:

I used to bet even money that I would get at least one 6 in four rolls of a fair die.

The probability of this is 4 times the probability of getting a 6 in a single die, i.e.,

4/6 = 2/3; clearly I had an advantage and indeed I was making money. Now I bet

even money that within 24 rolls of two dice I get at least one double 6. This has the

same advantage (24/6

2

= 2/3), but now I am losing money. Why?

As Pascal discussed in his correspondence with Pierre de Fermat, de M´er´e’s reasoning was faulty;

after all, if the number of rolls were 7 in the ﬁrst game, the logic would give the nonsensical

probability 7/6. We’ll come back to this later.

1 INTRODUCTION 2

Example 1.2. In a family with 4 children, what is the probability of a 2:2 boy-girl split?

One common wrong answer:

1

5

, as the 5 possibilities for the number of boys are not equally

likely.

Another common guess: close to 1, as this is the most “balanced” possibility. This repre-

sents the mistaken belief that symmetry in probabilities should very likely result in symmetry in

the outcome. A related confusion supposes that events that are probable (say, have probability

around 0.75) occur nearly certainly.

Equally likely outcomes

Suppose an experiment is performed, with n possible outcomes comprising a set S. Assume

also that all outcomes are equally likely. (Whether this assumption is realistic depends on the

context. The above Example 1.2 gives an instance where this is not a reasonable assumption.)

An event E is a set of outcomes, i.e., E ⊂ S. If an event E consists of m diﬀerent outcomes

(often called “good” outcomes for E), then the probability of E is given by:

(1.1) P(E) =

m

n

.

Example 1.3. A fair die has 6 outcomes; take E = ¦2, 4, 6¦. Then P(E) =

1

2

.

What does the answer in Example 1.3 mean? Every student of probability should spend

some time thinking about this. The fact is that it is very diﬃcult to attach a meaning to P(E)

if we roll a die a single time or a few times. The most straightforward interpretation is that for

a very large number of rolls about half of the outcomes will be even. Note that this requires

at least the concept of a limit! This relative frequency interpretation of probability will be

explained in detail much later. For now, take formula (1.1) as the deﬁnition of probability.

2 COMBINATORICS 3

2 Combinatorics

Example 2.1. Toss three fair coins. What is the probability of exactly one Heads (H)?

There are 8 equally likely outcomes: HHH, HHT, HTH, HTT, THH, THT, TTH, TTT. Out

of these, 3 have exactly one H. That is, E = ¦HTT, THT, TTH¦, and P(E) = 3/8.

Example 2.2. Let us now compute the probability of a 2:2 boy-girl split in a four-children

family. We have 16 outcomes, which we will assume are equally likely, although this is not quite

true in reality. We list the outcomes below, although we will soon see that there is a better way.

BBBB BBBG BBGB BBGG

BGBB BGBG BGGB BGGG

GBBB GBBG GBGB GBGG

GGBB GGBG GGGB GGGG

We conclude that

P(2:2 split) =

6

16

=

3

8

,

P(1:3 split or 3:1 split) =

8

16

=

1

2

,

P(4:0 split or 0:4 split) =

2

16

=

1

8

.

Example 2.3. Roll two dice. What is the most likely sum?

Outcomes are ordered pairs (i, j), 1 ≤ i ≤ 6, 1 ≤ j ≤ 6.

sum no. of outcomes

2 1

3 2

4 3

5 4

6 5

7 6

8 5

9 4

10 3

11 2

12 1

Our answer is 7, and P(sum = 7) =

6

36

=

1

6

.

2 COMBINATORICS 4

How to count?

Listing all outcomes is very ineﬃcient, especially if their number is large. We will, therefore,

learn a few counting techniques, starting with a trivial, but conceptually important fact.

Basic principle of counting. If an experiment consists of two stages and the ﬁrst stage has

m outcomes, while the second stage has n outcomes regardless of the outcome at the ﬁrst

stage, then the experiment as a whole has mn outcomes.

Example 2.4. Roll a die 4 times. What is the probability that you get diﬀerent numbers?

At least at the beginning, you should divide every solution into the following three steps:

Step 1: Identify the set of equally likely outcomes. In this case, this is the set of all ordered

4-tuples of numbers 1, . . . , 6. That is, ¦(a, b, c, d) : a, b, c, d ∈ ¦1, . . . , 6¦¦.

Step 2: Compute the number of outcomes. In this case, it is therefore 6

4

.

Step 3: Compute the number of good outcomes. In this case it is 6 5 4 3. Why? We

have 6 options for the ﬁrst roll, 5 options for the second roll since its number must diﬀer

from the number on the ﬁrst roll; 4 options for the third roll since its number must not

appear on the ﬁrst two rolls, etc. Note that the set of possible outcomes changes from

stage to stage (roll to roll in this case), but their number does not!

The answer then is

6·5·4·3

6

4

=

5

18

≈ 0.2778.

Example 2.5. Let us now compute probabilities for de M´er´e’s games.

In Game 1, there are 4 rolls and he wins with at least one 6. The number of good events is

6

4

−5

4

, as the number of bad events is 5

4

. Therefore

P(win) = 1 −

_

5

6

_

4

≈ 0.5177.

In Game 2, there are 24 rolls of two dice and he wins by at least one pair of 6’s rolled. The

number of outcomes is 36

24

, the number of bad ones is 35

24

, thus the number of good outcomes

equals 36

24

−35

24

. Therefore

P(win) = 1 −

_

35

36

_

24

≈ 0.4914.

Chevalier de M´er´e overcounted the good outcomes in both cases. His count 4 6

3

in Game

1 selects a die with a 6 and arbitrary numbers for other dice. However, many outcomes have

more than one six and are hence counted more than once.

One should also note that both probabilities are barely diﬀerent from 1/2, so de M´er´e was

gambling a lot to be able to notice the diﬀerence.

2 COMBINATORICS 5

Permutations

Assume you have n objects. The number of ways to ﬁll n ordered slots with them is

n (n −1) . . . 2 1 = n!,

while the number of ways to ﬁll k ≤ n ordered slots is

n(n −1) . . . (n −k + 1) =

n!

(n −k)!

.

Example 2.6. Shuﬄe a deck of cards.

• P(top card is an Ace) =

1

13

=

4·51!

52!

.

• P(all cards of the same suit end up next to each other) =

4!·(13!)

4

52!

≈ 4.510

−28

. This event

never happens in practice.

• P(hearts are together) =

40!13!

52!

= 6 10

−11

.

To compute the last probability, for example, collect all hearts into a block; a good event is

speciﬁed by ordering 40 items (the block of hearts plus 39 other cards) and ordering the hearts

within their block.

Before we go on to further examples, let us agree that when the text says without further

elaboration, that a random choice is made, this means that all available choices are equally

likely. Also, in the next problem (and in statistics in general) sampling with replacement refers

to choosing, at random, an object from a population, noting its properties, putting the object

back into the population, and then repeating. Sampling without replacement omits the putting

back part.

Example 2.7. A bag has 6 pieces of paper, each with one of the letters, E, E, P, P, P, and

R, on it. Pull 6 pieces at random out of the bag (1) without, and (2) with replacement. What

is the probability that these pieces, in order, spell PEPPER?

There are two problems to solve. For sampling without replacement:

1. An outcome is an ordering of the pieces of paper E

1

E

2

P

1

P

2

P

3

R.

2. The number of outcomes thus is 6!.

3. The number of good outcomes is 3!2!.

The probability is

3!2!

6!

=

1

60

.

For sampling with replacement, the answer is

3

3

·2

2

6

6

=

1

2·6

3

, quite a lot smaller.

2 COMBINATORICS 6

Example 2.8. Sit 3 men and 3 women at random (1) in a row of chairs and (2) around a table.

Compute P(all women sit together). In case (2), also compute P(men and women alternate).

In case (1), the answer is

4!3!

6!

=

1

5

.

For case (2), pick a man, say John Smith, and sit him ﬁrst. Then, we reduce to a row

problem with 5! outcomes; the number of good outcomes is 3! 3!. The answer is

3

10

. For the

last question, the seats for the men and women are ﬁxed after John Smith takes his seat and so

the answer is

3!2!

5!

=

1

10

.

Example 2.9. A group consists of 3 Norwegians, 4 Swedes, and 5 Finns, and they sit at random

around a table. What is the probability that all groups end up sitting together?

The answer is

3!·4!·5!·2!

11!

. Pick, say, a Norwegian (Arne) and sit him down. Here is how you

count the good events. There are 3! choices for ordering the group of Norwegians (and then sit

them down to one of both sides of Arne, depending on the ordering). Then, there are 4! choices

for arranging the Swedes and 5! choices for arranging the Finns. Finally, there are 2! choices to

order the two blocks of Swedes and Finns.

Combinations

Let

_

n

k

_

be the number of diﬀerent subsets with k elements of a set with n elements. Then,

_

n

k

_

=

n(n −1) . . . (n −k + 1)

k!

=

n!

k!(n −k)!

To understand why the above is true, ﬁrst choose a subset, then order its elements in a row

to ﬁll k ordered slots with elements from the set with n objects. Then, distinct choices of a

subset and its ordering will end up as distinct orderings. Therefore,

_

n

k

_

k! = n(n −1) . . . (n −k + 1).

We call

_

n

k

_

= n choose k or a binomial coeﬃcient (as it appears in the binomial theorem:

(x +y)

n

=

n

k=0

_

n

k

_

x

k

y

n−k

). Also, note that

_

n

0

_

=

_

n

n

_

= 1 and

_

n

k

_

=

_

n

n −k

_

.

The multinomial coeﬃcients are more general and are deﬁned next.

2 COMBINATORICS 7

The number of ways to divide a set of n elements into r (distinguishable) subsets of

n

1

, n

2

, . . . , n

r

elements, where n

1

+. . . +n

r

= n, is denoted by

_

n

n

1

...n

r

_

and

_

n

n

1

. . . n

r

_

=

_

n

n

1

__

n −n

1

n

2

__

n −n

1

−n

2

n

3

_

. . .

_

n −n

1

−. . . −n

r−1

n

r

_

=

n!

n

1

!n

2

! . . . n

r

!

To understand the slightly confusing word distinguishable, just think of painting n

1

elements

red, then n

2

diﬀerent elements blue, etc. These colors distinguish among the diﬀerent subsets.

Example 2.10. A fair coin is tossed 10 times. What is the probability that we get exactly 5

Heads?

P(exactly 5 Heads) =

_

10

5

_

2

10

≈ 0.2461,

as one needs to choose the position of the ﬁve heads among 10 slots to ﬁx a good outcome.

Example 2.11. We have a bag that contains 100 balls, 50 of them red and 50 blue. Select 5

balls at random. What is the probability that 3 are blue and 2 are red?

The number of outcomes is

_

100

5

_

and all of them are equally likely, which is a reasonable

interpretation of “select 5 balls at random.” The answer is

P(3 are blue and 2 are red) =

_

50

3

__

50

2

_

_

100

5

_ ≈ 0.3189

Why should this probability be less than a half? The probability that 3 are blue and 2 are

red is equal to the probability of 3 are red and 2 are blue and they cannot both exceed

1

2

, as

their sum cannot be more than 1. It cannot be exactly

1

2

either, because other possibilities (such

as all 5 chosen balls red) have probability greater than 0.

Example 2.12. Here we return to Example 1.1 and solve it more slowly. Shuﬄe a standard

deck of 52 cards and deal 13 cards to each of the 4 players.

What is the probability that each player gets an Ace? We will solve this problem in two

ways to emphasize that you often have a choice in your set of equally likely outcomes.

The ﬁrst way uses an outcome to be an ordering of 52 cards:

1. There are 52! equally likely outcomes.

2. Let the ﬁrst 13 cards go to the ﬁrst player, the second 13 cards to the second player,

etc. Pick a slot within each of the four segments of 13 slots for an Ace. There are 13

4

possibilities to choose these four slots for the Aces.

3. The number of choices to ﬁll these four positions with (four diﬀerent) Aces is 4!.

4. Order the rest of the cards in 48! ways.

2 COMBINATORICS 8

The probability, then, is

13

4

4!48!

52!

.

The second way, via a small leap of faith, assumes that each set of the four positions of the

four Aces among the 52 shuﬄed cards is equally likely. You may choose to believe this intuitive

fact or try to write down a formal proof: the number of permutations that result in a given set

F of four positions is independent of F. Then:

1. The outcomes are the positions of the 4 Aces among the 52 slots for the shuﬄed cards of

the deck.

2. The number of outcomes is

_

52

4

_

.

3. The number of good outcomes is 13

4

, as we need to choose one slot among 13 cards that

go to the ﬁrst player, etc.

The probability, then, is

13

4

(

52

4

)

, which agrees with the number we obtained the ﬁrst way.

Let us also compute P(one person has all four Aces). Doing the problem the second way,

we get

1. The number of outcomes is

_

52

4

_

.

2. To ﬁx a good outcome, pick one player (

_

4

1

_

choices) and pick four slots for the Aces for

that player (

_

13

4

_

choices).

The answer, then, is

(

4

1

)(

13

4

)

(

52

4

)

= 0.0106, a lot smaller than the probability of each player getting

an Ace.

Example 2.13. Roll a die 12 times. P(each number appears exactly twice)?

1. An outcome consists of ﬁlling each of the 12 slots (for the 12 rolls) with an integer between

1 and 6 (the outcome of the roll).

2. The number of outcomes, therefore, is 6

12

.

3. To ﬁx a good outcome, pick two slots for 1, then pick two slots for 2, etc., with

_

12

2

__

10

2

_

. . .

_

2

2

_

choices.

The probability, then, is

(

12

2

)(

10

2

)...(

2

2

)

6

12

.

What is P(1 appears exactly 3 times, 2 appears exactly 2 times)?

To ﬁx a good outcome now, pick three slots for 1, two slots for 2, and ﬁll the remaining 7

slots by numbers 3, . . . , 6. The number of choices is

_

12

3

__

9

2

_

4

7

and the answer is

(

12

3

)(

9

2

)4

7

6

12

.

Example 2.14. We have 14 rooms and 4 colors, white, blue, green, and yellow. Each room is

painted at random with one of the four colors. There are 4

14

equally likely outcomes, so, for

2 COMBINATORICS 9

example,

P(5 white, 4 blue, 3 green, 2 yellow rooms) =

_

14

5

__

9

4

__

5

3

__

2

2

_

4

14

.

Example 2.15. A middle row on a plane seats 7 people. Three of them order chicken (C) and

the remaining four pasta (P). The ﬂight attendant returns with the meals, but has forgotten

who ordered what and discovers that they are all asleep, so she puts the meals in front of them

at random. What is the probability that they all receive correct meals?

A reformulation makes the problem clearer: we are interested in P(3 people who ordered C

get C). Let us label the people 1, . . . , 7 and assume that 1, 2, and 3 ordered C. The outcome

is a selection of 3 people from the 7 who receive C, the number of them is

_

7

3

_

, and there is a

single good outcome. So, the answer is

1

(

7

3

)

=

1

35

. Similarly,

P(no one who ordered C gets C) =

_

4

3

_

_

7

3

_ =

4

35

,

P(a single person who ordered C gets C) =

3

_

4

2

_

_

7

3

_ =

18

35

,

P(two persons who ordered C get C) =

_

3

2

_

4

_

7

3

_ =

12

35

.

Problems

1. A California licence plate consists of a sequence of seven symbols: number, letter, letter,

letter, number, number, number, where a letter is any one of 26 letters and a number is one

among 0, 1, . . . , 9. Assume that all licence plates are equally likely. (a) What is the probability

that all symbols are diﬀerent? (b) What is the probability that all symbols are diﬀerent and the

ﬁrst number is the largest among the numbers?

2. A tennis tournament has 2n participants, n Swedes and n Norwegians. First, n people are

chosen at random from the 2n (with no regard to nationality) and then paired randomly with

the other n people. Each pair proceeds to play one match. An outcome is a set of n (ordered)

pairs, giving the winner and the loser in each of the n matches. (a) Determine the number of

outcomes. (b) What do you need to assume to conclude that all outcomes are equally likely?

(c) Under this assumption, compute the probability that all Swedes are the winners.

3. A group of 18 Scandinavians consists of 5 Norwegians, 6 Swedes, and 7 Finns. They are seated

at random around a table. Compute the following probabilities: (a) that all the Norwegians

sit together, (b) that all the Norwegians and all the Swedes sit together, and (c) that all the

Norwegians, all the Swedes, and all the Finns sit together.

2 COMBINATORICS 10

4. A bag contains 80 balls numbered 1, . . . , 80. Before the game starts, you choose 10 diﬀerent

numbers from amongst 1, . . . , 80 and write them on a piece of paper. Then 20 balls are selected

(without replacement) out of the bag at random. (a) What is the probability that all your

numbers are selected? (b) What is the probability that none of your numbers is selected? (c)

What is the probability that exactly 4 of your numbers are selected?

5. A full deck of 52 cards contains 13 hearts. Pick 8 cards from the deck at random (a) without

replacement and (b) with replacement. In each case compute the probability that you get no

hearts.

Solutions to problems

1. (a)

10·9·8·7·26·25·24

10

4

·26

3

, (b) the answer in (a) times

1

4

.

2. (a) Divide into two groups (winners and losers), then pair them:

_

2n

n

_

n!. Alternatively, pair

the ﬁrst player, then the next available player, etc., and, then, at the end choose the winners

and the losers: (2n − 1)(2n − 3) 3 1 2

n

. (Of course, these two expressions are the same.)

(b) All players are of equal strength, equally likely to win or lose any match against any other

player. (c) The number of good events is n!, the choice of a Norwegian paired with each Swede.

3. (a)

13!·5!

17!

, (b)

8!·5!·6!

17!

, (c)

2!·7!·6!·5!

17!

.

4. (a)

(

70

10

)

(

80

20

)

, (b)

(

70

20

)

(

80

20

)

, (c)

(

10

4

)(

70

16

)

(

80

20

)

.

5. (a)

(

39

8

)

(

52

8

)

, (b)

_

3

4

_

8

.

3 AXIOMS OF PROBABILITY 11

3 Axioms of Probability

The question here is: how can we mathematically deﬁne a random experiment? What we have

are outcomes (which tell you exactly what happens), events (sets containing certain outcomes),

and probability (which attaches to every event the likelihood that it happens). We need to

agree on which properties these objects must have in order to compute with them and develop

a theory.

When we have ﬁnitely many equally likely outcomes all is clear and we have already seen

many examples. However, as is common in mathematics, inﬁnite sets are much harder to deal

with. For example, we will soon see what it means to choose a random point within a unit circle.

On the other hand, we will also see that there is no way to choose at random a positive integer

— remember that “at random” means all choices are equally likely, unless otherwise speciﬁed.

Finally, a probability space is a triple (Ω, T, P). The ﬁrst object Ω is an arbitrary set,

repesenting the set of outcomes, sometimes called the sample space.

The second object T is a collection of events, that is, a set of subsets of Ω. Therefore, an

event A ∈ T is necessarily a subset of Ω. Can we just say that each A ⊂ Ω is an event? In this

course you can assume so without worry, although there are good reasons for not assuming so

in general ! I will give the deﬁnition of what properties T needs to satisfy, but this is only for

illustration and you should take a course in measure theory to understand what is really going

on. Namely, T needs to be a σ-algebra, which means (1) ∅ ∈ T, (2) A ∈ T =⇒ A

c

∈ T, and

(3) A

1

, A

2

, ∈ T =⇒ ∪

∞

i=1

A

i

∈ T.

What is important is that you can take the complement A

c

of an event A (i.e., A

c

happens

when A does not happen), unions of two or more events (i.e., A

1

∪ A

2

happens when either A

1

or A

2

happens), and intersections of two or more events (i.e., A

1

∩ A

2

happens when both A

1

and A

2

happen). We call events A

1

, A

2

, . . . pairwise disjoint if A

i

∩ A

j

= ∅ if i ,= j — that is,

at most one of these events can occur.

Finally, the probability P is a number attached to every event A and satisﬁes the following

three axioms:

Axiom 1. For every event A, P(A) ≥ 0.

Axiom 2. P(Ω) = 1.

Axiom 3. If A

1

, A

2

, . . . is a sequence of pairwise disjoint events, then

P(

∞

_

i=1

A

i

) =

∞

i=1

P(A

i

).

Whenever we have an abstract deﬁnition such as this one, the ﬁrst thing to do is to look for

examples. Here are some.

Example 3.1. Ω = ¦1, 2, 3, 4, 5, 6¦,

P(A) =

(number of elements in A)

6

.

3 AXIOMS OF PROBABILITY 12

The random experiment here is rolling a fair die. Clearly, this can be generalized to any ﬁnite

set with equally likely outcomes.

Example 3.2. Ω = ¦1, 2, . . .¦ and you have numbers p

1

, p

2

, . . . ≥ 0 with p

1

+p

2

+. . . = 1. For

any A ⊂ Ω,

P(A) =

i∈A

p

i

.

For example, toss a fair coin repeatedly until the ﬁrst Heads. Your outcome is the number of

tosses. Here, p

i

=

1

2

i

.

Note that p

i

cannot be chosen to be equal, as you cannot make the sum of inﬁnitely many

equal numbers to be 1.

Example 3.3. Pick a point from inside the unit circle centered at the origin. Here, Ω = ¦(x, y) :

x

2

+y

2

< 1¦ and

P(A) =

(area of A)

π

.

It is important to observe that if A is a singleton (a set whose element is a single point), then

P(A) = 0. This means that we cannot attach the probability to outcomes — you hit a single

point (or even a line) with probability 0, but a “fatter” set with positive area you hit with

positive probability.

Another important theoretical remark: this is a case where A cannot be an arbitrary subset

of the circle — for some sets area cannot be deﬁned!

Consequences of the axioms

(C0) P(∅) = 0.

Proof . In Axiom 3, take all sets to be ∅.

(C1) If A

1

∩ A

2

= ∅, then P(A

1

∪ A

2

) = P(A

1

) +P(A

2

).

Proof . In Axiom 3, take all sets other than ﬁrst two to be ∅.

(C2)

P(A

c

) = 1 −P(A).

Proof . Apply (C1) to A

1

= A, A

2

= A

c

.

3 AXIOMS OF PROBABILITY 13

(C3) 0 ≤ P(A) ≤ 1.

Proof . Use that P(A

c

) ≥ 0 in (C2).

(C4) If A ⊂ B, P(B) = P(A) +P(B ¸ A) ≥ P(A).

Proof . Use (C1) for A

1

= A and A

2

= B ¸ A.

(C5) P(A∪ B) = P(A) +P(B) −P(A∩ B).

Proof . Let P(A¸ B) = p

1

, P(A∩B) = p

12

and P(B¸ A) = p

2

, and note that A¸ B, A∩B, and

B¸A are pairwise disjoint. Then P(A) = p

1

+p

12

, P(B) = p

2

+p

12

, and P(A∪B) = p

1

+p

2

+p

12

.

(C6)

P(A

1

∪ A

2

∪ A

3

) = P(A

1

) +P(A

2

) +P(A

3

)

−P(A

1

∩ A

2

) −P(A

1

∩ A

3

) −P(A

2

∩ A

2

)

+P(A

1

∩ A

2

∩ A

3

)

and more generally

P(A

1

∪ ∪ A

n

) =

n

i=1

P(A

i

) −

1≤i<j≤n

P(A

i

∩ A

j

) +

1≤i<j<k≤n

P(A

i

∩ A

j

∩ A

k

) +. . .

+ (−1)

n−1

P(A

1

∩ ∩ A

n

).

This is called the inclusion-exclusion formula and is commonly used when it is easier to compute

probabilities of intersections than of unions.

Proof . We prove this only for n = 3. Let p

1

= P(A

1

∩ A

c

2

∩ A

c

3

), p

2

= P(A

c

1

∩ A

2

∩ A

c

3

),

p

3

= P(A

c

1

∩ A

c

2

∩ A

3

), p

12

= P(A

1

∩ A

2

∩ A

c

3

), p

13

= P(A

1

∩ A

c

2

∩ A

3

), p

23

= P(A

c

1

∩ A

2

∩ A

3

),

and p

123

= P(A

1

∩ A

2

∩ A

3

). Again, note that all sets are pairwise disjoint and that the right

hand side of (6) is

(p

1

+p

12

+p

13

+p

123

) + (p

2

+p

12

+p

23

+p

123

) + (p

3

+p

13

+p

23

+p

123

)

−(p

12

+p

123

) −(p

13

+p

123

) −(p

23

+p

123

)

+p

123

= p

1

+p

2

+p

3

+p

12

+p

13

+p

23

+p

123

= P(A

1

∪ A

2

∪ A

3

).

3 AXIOMS OF PROBABILITY 14

Example 3.4. Pick an integer in [1, 1000] at random. Compute the probability that it is

divisible neither by 12 nor by 15.

The sample space consists of the 1000 integers between 1 and 1000 and let A

r

be the subset

consisting of integers divisible by r. The cardinality of A

r

is ¸1000/r|. Another simple fact

is that A

r

∩ A

s

= A

lcm(r,s)

, where lcm stands for the least common multiple. Our probability

equals

1 −P(A

12

∪ A

15

) = 1 −P(A

12

) −P(A

15

) +P(A

12

∩ A

15

)

= 1 −P(A

12

) −P(A

15

) +P(A

60

)

= 1 −

83

1000

−

66

1000

+

16

1000

= 0.867.

Example 3.5. Sit 3 men and 4 women at random in a row. What is the probability that either

all the men or all the women end up sitting together?

Here, A

1

= ¦all women sit together¦, A

2

= ¦all men sit together¦, A

1

∩ A

2

= ¦both women

and men sit together¦, and so the answer is

P(A

1

∪ A

2

) = P(A

1

) +P(A

2

) −P(A

1

∩ A

2

) =

4! 4!

7!

+

5! 3!

7!

−

2! 3! 4!

7!

.

Example 3.6. A group of 3 Norwegians, 4 Swedes, and 5 Finns is seated at random around a

table. Compute the probability that at least one of the three groups ends up sitting together.

Deﬁne A

N

= ¦Norwegians sit together¦ and similarly A

S

, A

F

. We have

P(A

N

) =

3! 9!

11!

, P(A

S

) =

4! 8!

11!

, P(A

N

) =

5! 7!

11!

,

P(A

N

∩ A

S

) =

3! 4! 6!

11!

, P(A

N

∩ A

F

) =

3! 5! 5!

11!

, P(A

S

∩ A

F

) =

4! 5! 4!

11!

,

P(A

N

∩ A

S

∩ A

F

) =

3! 4! 5! 2!

11!

.

Therefore,

P(A

N

∪ A

S

∪ A

F

) =

3! 9! + 4! 8! + 5! 7! −3! 4! 6! −3! 5! 5! −4! 5! 4! + 3! 4! 5! 2!

11!

.

Example 3.7. Matching problem. A large company with n employees has a scheme according

to which each employee buys a Christmas gift and the gifts are then distributed at random to

the employees. What is the probability that someone gets his or her own gift?

Note that this is diﬀerent from asking, assuming that you are one of the employees, for the

probability that you get your own gift, which is

1

n

.

Let A

i

= ¦ith employee gets his or her own gift¦. Then, what we are looking for is

P(

n

_

i=1

A

i

).

3 AXIOMS OF PROBABILITY 15

We have

P(A

i

) =

1

n

(for all i),

P(A

i

∩ A

j

) =

(n −2)!

n!

=

1

n(n −1)

(for all i < j),

P(A

i

∩ A

j

∩ A

k

) =

(n −3)!

n!

=

1

n(n −1)(n −2)

(for all i < j < k),

. . .

P(A

1

∩ ∩ A

n

) =

1

n!

.

Therefore, our answer is

n

1

n

−

_

n

2

_

1

n(n −1)

+

_

n

3

_

1

n(n −1)(n −2)

−. . . + (−1)

n−1

1

n!

= 1 −

1

2!

+

1

3!

+. . . + (−1)

n−1

1

n!

→ 1 −

1

e

≈ 0.6321 (as n →∞).

Example 3.8. Birthday Problem. Assume that there are k people in the room. What is

the probability that there are two who share a birthday? We will ignore leap years, assume

all birthdays are equally likely, and generalize the problem a little: from n possible birthdays,

sample k times with replacement.

P(a shared birthday) = 1 −P(no shared birthdays) = 1 −

n (n −1) (n −k + 1)

n

k

.

When n = 365, the lowest k for which the above exceeds 0.5 is, famously, k = 23. Some values

are given in the following table.

k prob. for n = 365

10 0.1169

23 0.5073

41 0.9032

57 0.9901

70 0.9992

Occurences of this problem are quite common in various contexts, so we give another example.

Each day, the Massachusetts lottery chooses a four digit number at random, with leading 0’s

allowed. Thus, this is sampling with replacement from among n = 10, 000 choices each day. On

February 6, 1978, the Boston Evening Globe reported that

“During [the lottery’s] 22 months’ existence [...], no winning number has ever been

repeated. [David] Hughes, the expert [and a lottery oﬃcial] doesn’t expect to see

duplicate winners until about half of the 10, 000 possibilities have been exhausted.”

3 AXIOMS OF PROBABILITY 16

What would an informed reader make of this? Assuming k = 660 days, the probability of no

repetition works out to be about 2.19 10

−10

, making it a remarkably improbable event. What

happened was that Mr. Hughes, not understanding the Birthday Problem, did not check for

repetitions, conﬁdent that there would not be any. Apologetic lottery oﬃcials announced later

that there indeed were repetitions.

Example 3.9. Coupon Collector Problem. Within the context of the previous problem, assume

that k ≥ n and compute P(all n birthdays are represented).

More often, this is described in terms of cereal boxes, each of which contains one of n diﬀerent

cards (coupons), chosen at random. If you buy k boxes, what is the probability that you have

a complete collection?

When k = n, our probability is

n!

n

n

. More generally, let

A

i

= ¦ith birthday is missing¦.

Then, we need to compute

1 −P(

n

_

i=1

A

i

).

Now,

P(A

i

) =

(n −1)

k

n

k

(for all i)

P(A

i

∩ A

j

) =

(n −2)

k

n

k

(for all i < j)

. . .

P(A

1

∩ ∩ A

n

) = 0

and our answer is

1 −n

_

n −1

n

_

k

+

_

n

2

__

n −2

n

_

k

−. . . + (−1)

n−1

_

n

n −1

__

1

n

_

k

=

n

i=0

_

n

i

_

(−1)

i

_

1 −

i

n

_

k

.

This must be

n!

n

n

for k = n, and 0 when k < n, neither of which is obvious from the formula.

Neither will you, for large n, get anything close to the correct numbers when k ≤ n if you try to

compute the probabilities by computer, due to the very large size of summands with alternating

signs and the resulting rounding errors. We will return to this problem later for a much more

eﬃcient computational method, but some numbers are in the two tables below. Another remark

for those who know a lot of combinatorics: you will perhaps notice that the above probability

is

n!

n

k

S

k,n

, where S

k,n

is the Stirling number of the second kind.

3 AXIOMS OF PROBABILITY 17

k prob. for n = 6

13 0.5139

23 0.9108

36 0.9915

k prob. for n = 365

1607 0.0101

1854 0.1003

2287 0.5004

2972 0.9002

3828 0.9900

4669 0.9990

More examples with combinatorial ﬂavor

We will now do more problems which would rather belong to the previous chapter, but are a

little harder, so we do them here instead.

Example 3.10. Roll a die 12 times. Compute the probability that a number occurs 6 times

and two other numbers occur three times each.

The number of outcomes is 6

12

. To count the number of good outcomes:

1. Pick the number that occurs 6 times:

_

6

1

_

= 6 choices.

2. Pick the two numbers that occur 3 times each:

_

5

2

_

choices.

3. Pick slots (rolls) for the number that occurs 6 times:

_

12

6

_

choices.

4. Pick slots for one of the numbers that occur 3 times:

_

6

3

_

choices.

Therefore, our probability is

6(

5

2

)(

12

6

)(

6

3

)

6

12

.

Example 3.11. You have 10 pairs of socks in the closet. Pick 8 socks at random. For every i,

compute the probability that you get i complete pairs of socks.

There are

_

20

8

_

outcomes. To count the number of good outcomes:

1. Pick i pairs of socks from the 10:

_

10

i

_

choices.

2. Pick pairs which are represented by a single sock:

_

10−i

8−2i

_

choices.

3. Pick a sock from each of the latter pairs: 2

8−2i

choices.

Therefore, our probability is

2

8−2i

(

10−i

8−2i

)(

10

i

)

(

20

8

)

.

3 AXIOMS OF PROBABILITY 18

Example 3.12. Poker Hands. In the deﬁnitions, the word value refers to A, K, Q, J, 10, 9,

8, 7, 6, 5, 4, 3, 2. This sequence orders the cards in descending consecutive values, with one

exception: an Ace may be regarded as 1 for the purposes of making a straight (but note that,

for example, K, A, 1, 2, 3 is not a valid straight sequence — A can only begin or end a straight).

From the lowest to the highest, here are the hands:

(a) one pair: two cards of the same value plus 3 cards with diﬀerent values

J♠ J♣ 9♥ Q♣ 4♠

(b) two pairs: two pairs plus another card of diﬀerent value

J♠ J♣ 9♥ 9♣ 3♠

(c) three of a kind: three cards of the same value plus two with diﬀerent values

Q♠ Q♣ Q♥ 9♣ 3♠

(d) straight: ﬁve cards with consecutive values

5♥ 4♣ 3♣ 2♥ A♠

(e) ﬂush: ﬁve cards of the same suit

K♣ 9♣ 7♣ 6♣ 3♣

(f) full house: a three of a kind and a pair

J♣ J♦ J♥ 3♣ 3♠

(g) four of a kind: four cards of the same value

K♣ K♦ K♥ K♣ 10♠

(e) straight ﬂush: ﬁve cards of the same suit with consecutive values

A♠ K♠ Q♠ J♠ 10♠

Here are the probabilities:

hand no. combinations approx. prob.

one pair 13

_

12

3

_

_

4

2

_

4

3

0.422569

two pairs

_

13

2

_

11

_

4

2

_

_

4

2

_

4 0.047539

three of a kind 13

_

12

2

_

_

4

3

_

4

2

0.021128

straight 10 4

5

0.003940

ﬂush 4

_

13

5

_

0.001981

full house 13 12

_

4

3

_

_

4

2

_

0.001441

four of a kind 13 12 4 0.000240

straight ﬂush 10 4 0.000015

3 AXIOMS OF PROBABILITY 19

Note that the probabilities of a straight and a ﬂush above include the possibility of a straight

ﬂush.

Let us see how some of these are computed. First, the number of all outcomes is

_

52

5

_

=

2, 598, 960. Then, for example, for the three of a kind, the number of good outcomes may be

obtained by listing the number of choices:

1. Choose a value for the triple: 13.

2. Choose the values of other two cards:

_

12

2

_

.

3. Pick three cards from the four of the same chosen value:

_

4

3

_

.

4. Pick a card (i.e., the suit) from each of the two remaining values: 4

2

.

One could do the same for one pair:

1. Pick a number for the pair: 13.

2. Pick the other three numbers:

_

12

3

_

3. Pick two cards from the value of the pair:

_

4

2

_

.

4. Pick the suits of the other three values: 4

3

And for the ﬂush:

1. Pick a suit: 4.

2. Pick ﬁve numbers:

_

13

5

_

Our ﬁnal worked out case is straight ﬂush:

1. Pick a suit: 4.

2. Pick the beginning number: 10.

We end this example by computing the probability of not getting any hand listed above,

that is,

P(nothing) = P(all cards with diﬀerent values) −P(straight or ﬂush)

=

_

13

5

_

4

5

_

52

5

_ −(P(straight) +P(ﬂush) −P(straight ﬂush))

=

_

13

5

_

4

5

−10 4

5

−4

_

13

5

_

+ 40

_

52

5

_

≈ 0.5012.

3 AXIOMS OF PROBABILITY 20

Example 3.13. Assume that 20 Scandinavians, 10 Finns, and 10 Danes, are to be distributed

at random into 10 rooms, 2 per room. What is the probability that exactly 2i rooms are mixed,

i = 0, . . . 5?

This is an example when careful thinking about what the outcomes should be really pays

oﬀ. Consider the following model for distributing the Scandinavians into rooms. First arrange

them at random into a row of 20 slots S1, S2, . . . , S20. Assume that room 1 takes people in

slots S1, S2, so let us call these two slots R1. Similarly, room 2 takes people in slots S3, S4, so

let us call these two slots R2, etc.

Now, it is clear that we only need to keep track of the distribution of 10 D’s into the 20 slots,

corresponding to the positions of the 10 Danes. Any such distribution constitutes an outcome

and they are equally likely. Their number is

_

20

10

_

.

To get 2i (for example, 4) mixed rooms, start by choosing 2i (ex., 4) out of the 10 rooms

which are going to be mixed; there are

_

10

2i

_

choices. You also need to decide into which slot in

each of the 2i chosen mixed rooms the D goes, for 2

2i

choices.

Once these two choices are made, you still have 10 −2i (ex., 6) D’s to distribute into 5 −i

(ex., 3) rooms, as there are two D’s in each of these rooms. For this, you need to choose 5 − i

(ex., 3) rooms from the remaining 10 − 2i (ex., 6), for

_

10−2i

5−i

_

choices, and this choice ﬁxes a

good outcome.

The ﬁnal answer, therefore, is

_

10

2i

_

2

2i

_

10−2i

5−i

_

_

20

10

_ .

Problems

1. Roll a single die 10 times. Compute the following probabilities: (a) that you get at least one

6; (b) that you get at least one 6 and at least one 5; (c) that you get three 1’s, two 2’s, and ﬁve

3’s.

2. Three married couples take seats around a table at random. Compute P(no wife sits next to

her husband).

3. A group of 20 Scandinavians consists of 7 Swedes, 3 Finns, and 10 Norwegians. A committee

of ﬁve people is chosen at random from this group. What is the probability that at least one of

the three nations is not represented on the committee?

4. Choose each digit of a 5 digit number at random from digits 1, . . . 9. Compute the probability

that no digit appears more than twice.

5. Roll a fair die 10 times. (a) Compute the probability that at least one number occurs exactly

6 times. (b) Compute the probability that at least one number occurs exactly once.

3 AXIOMS OF PROBABILITY 21

6. A lottery ticket consists of two rows, each containing 3 numbers from 1, 2, . . . , 50. The

drawing consists of choosing 5 diﬀerent numbers from 1, 2, . . . , 50 at random. A ticket wins if

its ﬁrst row contains at least two of the numbers drawn and its second row contains at least two

of the numbers drawn. The four examples below represent the four types of tickets:

Ticket 1 Ticket 2 Ticket 3 Ticket 4

1 2 3

4 5 6

1 2 3

1 2 3

1 2 3

2 3 4

1 2 3

3 4 5

For example, if the numbers 1, 3, 5, 6, 17 are drawn, then Ticket 1, Ticket 2, and Ticket 4 all

win, while Ticket 3 loses. Compute the winning probabilities for each of the four tickets.

Solutions to problems

1. (a) 1 −(5/6)

10

. (b) 1 −2 (5/6)

10

+ (2/3)

10

. (c)

_

10

3

__

7

2

_

6

−10

.

2. The complement is the union of the three events A

i

= ¦couple i sits together¦, i = 1, 2, 3.

Moreover,

P(A

1

) =

2

5

= P(A

2

) = P(A

3

),

P(A

1

∩ A

2

) = P(A

1

∩ A

3

) = P(A

2

∩ A

3

) =

3! 2! 2!

5!

=

1

5

,

P(A

1

∩ A

2

∩ A

3

) =

2! 2! 2! 2!

5!

=

2

15

.

For P(A

1

∩A

2

), for example, pick a seat for husband

3

. In the remaining row of 5 seats, pick the

ordering for couple

1

, couple

2

, and wife

3

, then the ordering of seats within each of couple

1

and

couple

2

. Now, by inclusion-exclusion,

P(A

1

∪ A

2

∪ A

3

) = 3

2

5

−3

1

5

+

2

15

=

11

15

,

and our answer is

4

15

.

3. Let A

1

= the event that Swedes are not represented, A

2

= the event that Finns are not

represented, and A

3

= the event that Norwegians are not represented.

P(A

1

∪ A

2

∪ A

3

) = P(A

1

) +P(A

2

) +P(A

3

) −P(A

1

∩ A

2

) −P(A

1

∩ A

3

) −P(A

2

∩ A

3

)

+P(A

1

∩ A

2

∩ A

3

)

=

1

_

20

5

_

__

13

5

_

+

_

17

5

_

+

_

10

5

_

−

_

10

5

_

−0 −

_

7

5

_

+ 0

_

4. The number of bad events is 9

_

5

3

_

8

2

+ 9

_

5

4

_

8 + 9. The ﬁrst term is the number of

numbers in which a digit appears 3 times, but no digit appears 4 times: choose a digit, choose 3

3 AXIOMS OF PROBABILITY 22

positions ﬁlled by it, then ﬁll the remaining position. The second term is the number of numbers

in which a digit appears 4 times, but no digit appears 5 times, and the last term is the number

of numbers in which a digit appears 5 times. The answer then is

1 −

9

_

5

3

_

8

2

+ 9

_

5

4

_

8 + 9

9

5

.

5. (a) Let A

i

be the event that the number i appears exactly 6 times. As A

i

are pairwise

disjoint,

P(A

1

∪ A

2

∪ A

3

∪ A

4

∪ A

5

∪ A

6

) = 6

_

10

6

_

5

4

6

10

.

(b) (a) Now, A

i

is the event that the number i appears exactly once. By inclusion-exclusion,

P(A

1

∪ A

2

∪ A

3

∪ A

4

∪ A

5

∪ A

6

)

= 6P(A

1

)

−

_

6

2

_

P(A

1

∩ A

2

)

+

_

6

3

_

P(A

1

∩ A

2

∩ A

3

)

−

_

6

4

_

P(A

1

∩ A

2

∩ A

3

∩ A

4

)

+

_

6

5

_

P(A

1

∩ A

2

∩ A

3

∩ A

4

∩ A

5

)

−

_

6

6

_

P(A

1

∩ A

2

∩ A

3

∩ A

4

∩ A

5

∩ A

6

)

= 6 10

5

9

6

10

−

_

6

2

_

10 9

4

8

6

10

+

_

6

3

_

10 9 8

3

7

6

10

−

_

6

4

_

10 9 8 7

2

6

6

10

+

_

6

5

_

10 9 8 7 6

1

6

10

−0.

6. Below, a hit is shorthand for a chosen number.

P(ticket 1 wins) = P(two hits on each line) +P(two hits on one line, three on the other)

=

3 3 44 + 2 3

_

50

5

_ =

402

_

50

5

_,

3 AXIOMS OF PROBABILITY 23

and

P(ticket 2 wins) = P(two hits among 1, 2, 3) +P(three hits among 1, 2, 3)

=

3

_

47

3

_

+

_

47

2

_

_

50

5

_ =

49726

_

50

5

_ ,

and

P(ticket 3 wins) = P(2, 3 both hit) +P(1, 4 both hit and one of 2, 3 hit)

=

_

48

3

_

+ 2

_

46

2

_

_

50

5

_ =

19366

_

50

5

_ ,

and, ﬁnally,

P(ticket 4 wins) = P(3 hit, at least one additional hit on each line) +P(1, 2, 4, 5 all hit)

=

(4

_

45

2

_

+ 4 45 + 1) + 45

_

50

5

_ =

4186

_

50

5

_

4 CONDITIONAL PROBABILITY AND INDEPENDENCE 24

4 Conditional Probability and Independence

Example 4.1. Assume that you have a bag with 11 cubes, 7 of which have a fuzzy surface and

4 are smooth. Out of the 7 fuzzy ones, 3 are red and 4 are blue; out of 4 smooth ones, 2 are red

and 2 are blue. So, there are 5 red and 6 blue cubes. Other than color and fuzziness, the cubes

have no other distinguishing characteristics.

You plan to pick a cube out of the bag at random, but forget to wear gloves. Before you

start your experiment, the probability that the selected cube is red is 5/11. Now, you reach into

the bag, grab a cube, and notice it is fuzzy (but you do not take it out or note its color in any

other way). Clearly, the probability should now change to 3/7!

Your experiment clearly has 11 outcomes. Consider the events R, B, F, S that the selected

cube is red, blue, fuzzy, and smooth, respectively. We observed that P(R) = 5/11. For the

probability of a red cube, conditioned on it being fuzzy, we do not have notation, so we introduce

it here: P(R[F) = 3/7. Note that this also equals

P(R ∩ F)

P(F)

=

P(the selected ball is red and fuzzy)

P(the selected ball is fuzzy)

.

This conveys the idea that with additional information the probability must be adjusted.

This is common in real life. Say bookies estimate your basketball team’s chances of winning a

particular game to be 0.6, 24 hours before the game starts. Two hours before the game starts,

however, it becomes known that your team’s star player is out with a sprained ankle. You

cannot expect that the bookies’ odds will remain the same and they change, say, to 0.3. Then,

the game starts and at half-time your team leads by 25 points. Again, the odds will change, say

to 0.8. Finally, when complete information (that is, the outcome of your experiment, the game

in this case) is known, all probabilities are trivial, 0 or 1.

For the general deﬁnition, take events A and B, and assume that P(B) > 0. The conditional

probability of A given B equals

P(A[B) =

P(A∩ B)

P(B)

.

Example 4.2. Here is a question asked on Wall Street job interviews. (This is the original

formulation; the macabre tone is not unusual for such interviews.)

“Let’s play a game of Russian roulette. You are tied to your chair. Here’s a gun, a revolver.

Here’s the barrel of the gun, six chambers, all empty. Now watch me as I put two bullets into

the barrel, into two adjacent chambers. I close the barrel and spin it. I put a gun to your head

and pull the trigger. Click. Lucky you! Now I’m going to pull the trigger one more time. Which

would you prefer: that I spin the barrel ﬁrst or that I just pull the trigger?”

Assume that the barrel rotates clockwise after the hammer hits and is pulled back. You are

given the choice between an unconditional and a conditional probability of death. The former,

4 CONDITIONAL PROBABILITY AND INDEPENDENCE 25

if the barrel is spun again, remains 1/3. The latter, if the trigger is pulled without the extra

spin, equals the probability that the hammer clicked on an empty slot, which is next to a bullet

in the counterclockwise direction, and equals 1/4.

For a ﬁxed condition B, and acting on events A, the conditional probability Q(A) = P(A[B)

satisﬁes the three axioms in Chapter 3. (This is routine to check and the reader who is more the-

oretically inclined might view it as a good exercise.) Thus, Q is another probability assignment

and all consequences of the axioms are valid for it.

Example 4.3. Toss two fair coins, blindfolded. Somebody tells you that you tossed at least

one Heads. What is the probability that both tosses are Heads?

Here A = ¦both H¦, B = ¦at least one H¦, and

P(A[B) =

P(A∩ B)

P(B)

=

P(both H)

P(at least one H)

=

1

4

3

4

=

1

3

.

Example 4.4. Toss a coin 10 times. If you know (a) that exactly 7 Heads are tossed, (b) that

at least 7 Heads are tossed, what is the probability that your ﬁrst toss is Heads?

For (a),

P(ﬁrst toss H[exactly 7 H’s) =

_

9

6

_

1

2

10

_

10

7

_

1

2

10

=

7

10

.

Why is this not surprising? Conditioned on 7 Heads, they are equally likely to occur on any

given 7 tosses. If you choose 7 tosses out of 10 at random, the ﬁrst toss is included in your

choice with probability

7

10

.

For (b), the answer is, after canceling

1

2

10

,

_

9

6

_

+

_

9

7

_

+

_

9

8

_

+

_

9

9

_

_

10

7

_

+

_

10

8

_

+

_

10

9

_

+

_

10

10

_ =

65

88

≈ 0.7386.

Clearly, the answer should be a little larger than before, because this condition is more advan-

tageous for Heads.

Conditional probabilities are sometimes given, or can be easily determined, especially in

sequential random experiments. Then, we can use

P(A

1

∩ A

2

) = P(A

1

)P(A

2

[A

1

),

P(A

1

∩ A

2

∩ A

3

) = P(A

1

)P(A

2

[A

1

)P(A

3

[A

1

∩ A

2

),

etc.

Example 4.5. An urn contains 10 black and 10 white balls. Draw 3 (a) without replacement,

and (b) with replacement. What is the probability that all three are white?

We already know how to do part (a):

4 CONDITIONAL PROBABILITY AND INDEPENDENCE 26

1. Number of outcomes:

_

20

3

_

.

2. Number of ways to select 3 balls out of 10 white ones:

_

10

3

_

.

Our probability is then

(

10

3

)

(

20

3

)

.

To do this problem another way, imagine drawing the balls sequentially. Then, we are

computing the probability of the intersection of the three events: P(1st ball is white, 2nd ball

is white, and 3rd ball is white). The relevant probabilities are:

1. P(1st ball is white) =

1

2

.

2. P(2nd ball is white[1st ball is white) =

9

19

.

3. P(3rd ball is white[1st two picked are white) =

8

18

.

Our probability is, then, the product

1

2

9

19

8

18

, which equals, as it must, what we obtained

before.

This approach is particularly easy in case (b), where the previous colors of the selected balls

do not aﬀect the probabilities at subsequent stages. The answer, therefore, is

_

1

2

_

3

.

Theorem 4.1. First Bayes’ formula. Assume that F

1

, . . . , F

n

are pairwise disjoint and that

F

1

∪ . . . ∪ F

n

= Ω, that is, exactly one of them always happens. Then, for an event A,

P(A) = P(F

1

)P(A[F

1

)+P(F

2

)P(A[F

2

)+. . .+P(F

n

)P(A[F

n

) .

Proof.

P(F

1

)P(A[F

1

) +P(F

2

)P(A[F

2

) +. . . +P(F

n

)P(A[F

n

) = P(A∩ F

1

) +. . . +P(A∩ F

n

)

= P((A∩ F

1

) ∪ . . . ∪ (A∩ F

n

))

= P(A∩ (F

1

∪ . . . ∪ F

n

))

= P(A∩ Ω) = P(A)

We call an instance of using this formula “computing the probability by conditioning on

which of the events F

i

happens.” The formula is useful in sequential experiments, when you

face diﬀerent experimental conditions at the second stage depending on what happens at the

ﬁrst stage. Quite often, there are just two events F

i

, that is, an event F and its complement

F

c

, and we are thus conditioning on whether F happens or not.

Example 4.6. Flip a fair coin. If you toss Heads, roll 1 die. If you toss Tails, roll 2 dice.

Compute the probability that you roll exactly one 6.

4 CONDITIONAL PROBABILITY AND INDEPENDENCE 27

Here you condition on the outcome of the coin toss, which could be Heads (event F) or Tails

(event F

c

). If A = ¦exactly one 6¦, then P(A[F) =

1

6

, P(A[F

c

) =

2·5

36

, P(F) = P(F

c

) =

1

2

and

so

P(A) = P(F)P(A[F) +P(F

c

)P(A[F

c

) =

2

9

.

Example 4.7. Roll a die, then select at random, without replacement, as many cards from the

deck as the number shown on the die. What is the probability that you get at least one Ace?

Here F

i

= ¦number shown on the die is i¦, for i = 1, . . . , 6. Clearly, P(F

i

) =

1

6

. If A is the

event that you get at least one Ace,

1. P(A[F

1

) =

1

13

,

2. In general, for i ≥ 1, P(A[F

i

) = 1 −

(

48

i

)

(

52

i

)

.

Therefore, by Bayes’ formula,

P(A) =

1

6

_

1

13

+ 1 −

_

48

2

_

_

52

2

_ + 1 −

_

48

3

_

_

52

3

_ + 1 −

_

48

4

_

_

52

4

_ + 1 −

_

48

5

_

_

52

5

_ + 1 −

_

48

6

_

_

52

6

_

_

.

Example 4.8. Coupon collector problem, revisited. As promised, we will develop a compu-

tationally much better formula than the one in Example 3.9. This will be another example of

conditioning, whereby you (1) reinterpret the problem as a sequential experiment and (2) use

Bayes’ formula with “conditions” F

i

being relevant events at the ﬁrst stage of the experiment.

Here is how it works in this example. Let p

k,r

be the probability that exactly r (out of a

total of n) birthdays are represented among k people; we call the event A. We will ﬁx n and let

k and r be variable. Note that p

k,n

is what we computed by the inclusion-exclusion formula.

At the ﬁrst stage you have k − 1 people; then the k’th person arrives on the scene. Let F

1

be the event that there are r birthdays represented among the k − 1 people and let F

2

be the

event that there are r − 1 birthdays represented among the k − 1 people. Let F

3

be the event

that any other number of birthdays occurs with k − 1 people. Clearly, P(A[F

3

) = 0, as the

newcomer contributes either 0 or 1 new birthdays. Moreover, P(A[F

1

) =

r

n

, the probability that

the newcomer duplicates one of the existing r birthdays, and P(A[F

2

) =

n−r+1

n

, the probability

that the newcomer does not duplicate any of the existing r −1 birthdays. Therefore,

p

k,r

= P(A) = P(A[F

1

)P(F

1

) +P(A[F

2

)P(F

2

) =

r

n

p

k−1,r

+

n −r + 1

n

p

k−1,r−1

,

for k, r ≥ 1, and this, together with the boundary conditions

p

0,0

= 1,

p

k,r

= 0, for 0 ≤ k < r,

p

k,0

= 0, for k > 0,

makes the computation fast and precise.

4 CONDITIONAL PROBABILITY AND INDEPENDENCE 28

Theorem 4.2. Second Bayes’ formula. Let F

1

, . . . , F

n

and A be as in Theorem 4.1. Then

P(F

j

[A) =

P(F

j

∩ A)

P(A)

=

P(A[F

j

)P(F

j

)

P(A[F

1

)P(F

1

) +. . . +P(A[F

n

)P(F

n

)

.

An event F

j

is often called a hypothesis, P(F

j

) its prior probability, and P(F

j

[A) its posterior

probability.

Example 4.9. We have a fair coin and an unfair coin, which always comes out Heads. Choose

one at random, toss it twice. It comes out Heads both times. What is the probability that the

coin is fair?

The relevant events are F = ¦fair coin¦, U = ¦unfair coin¦, and B = ¦both tosses H¦. Then

P(F) = P(U) =

1

2

(as each coin is chosen with equal probability). Moreover, P(B[F) =

1

4

, and

P(B[U) = 1. Our probability then is

1

2

1

4

1

2

1

4

+

1

2

1

=

1

5

.

Example 4.10. A factory has three machines, M

1

, M

2

and M

3

, that produce items (say,

lightbulbs). It is impossible to tell which machine produced a particular item, but some are

defective. Here are the known numbers:

machine proportion of items made prob. any made item is defective

M

1

0.2 0.001

M

2

0.3 0.002

M

3

0.5 0.003

You pick an item, test it, and ﬁnd it is defective. What is the probability that it was made

by machine M

2

?

The best way to think about this random experiment is as a two-stage procedure. First you

choose a machine with the probabilities given by the proportion. Then, that machine produces

an item, which you then proceed to test. (Indeed, this is the same as choosing the item from a

large number of them and testing it.)

Let D be the event that an item is defective and let M

i

also denote the event that the

item was made by machine i. Then, P(D[M

1

) = 0.001, P(D[M

2

) = 0.002, P(D[M

3

) = 0.003,

P(M

1

) = 0.2, P(M

2

) = 0.3, P(M

3

) = 0.5, and so

P(M

2

[D) =

0.002 0.3

0.001 0.2 + 0.002 0.3 + 0.003 0.5

≈ 0.26.

Example 4.11. Assume 10% of people have a certain disease. A test gives the correct diagnosis

with probability of 0.8; that is, if the person is sick, the test will be positive with probability 0.8,

but if the person is not sick, the test will be positive with probability 0.2. A random person from

4 CONDITIONAL PROBABILITY AND INDEPENDENCE 29

the population has tested positive for the disease. What is the probability that he is actually

sick? (No, it is not 0.8!)

Let us deﬁne the three relevant events: S = ¦sick¦, H = ¦healthy¦, T = ¦tested positive¦.

Now, P(H) = 0.9, P(S) = 0.1, P(T[H) = 0.2 and P(T[S) = 0.8. We are interested in

P(S[T) =

P(T[S)P(S)

P(T[S)P(S) +P(T[H)P(H)

=

8

26

≈ 31%.

Note that the prior probability P(S) is very important! Without a very good idea about what

it is, a positive test result is diﬃcult to evaluate: a positive test for HIV would mean something

very diﬀerent for a random person as opposed to somebody who gets tested because of risky

behavior.

Example 4.12. O. J. Simpson’s ﬁrst trial , 1995. The famous sports star and media personality

O. J. Simpson was on trial in Los Angeles for murder of his wife and her boyfriend. One of the

many issues was whether Simpson’s history of spousal abuse could be presented by prosecution

at the trial; that is, whether this history was “probative,” i.e., it had some evidentiary value,

or whether it was merely “prejudicial” and should be excluded. Alan Dershowitz, a famous

professor of law at Harvard and a consultant for the defense, was claiming the latter, citing the

statistics that < 0.1% of men who abuse their wives end up killing them. As J. F. Merz and

J. C. Caulkins pointed out in the journal Chance (Vol. 8, 1995, pg. 14), this was the wrong

probability to look at!

We need to start with the fact that a woman is murdered. These numbered 4, 936 in 1992,

out of which 1, 430 were killed by partners. In other words, if we let

A = ¦the (murdered) woman was abused by the partner¦,

M = ¦the woman was murdered by the partner¦,

then we estimate the prior probabilities P(M) = 0.29, P(M

c

) = 0.71, and what we are interested

in is the posterior probability P(M[A). It was also commonly estimated at the time that

about 5% of the women had been physically abused by their husbands. Thus, we can say that

P(A[M

c

) = 0.05, as there is no reason to assume that a woman murdered by somebody else

was more or less likely to be abused by her partner. The ﬁnal number we need is P(A[M).

Dershowitz states that “a considerable number” of wife murderers had previously assaulted

them, although “some” did not. So, we will (conservatively) say that P(A[M) = 0.5. (The

two-stage experiment then is: choose a murdered woman at random; at the ﬁrst stage, she is

murdered by her partner, or not, with stated probabilities; at the second stage, she is among

the abused women, or not, with probabilities depending on the outcome of the ﬁrst stage.) By

Bayes’ formula,

P(M[A) =

P(M)P(A[M)

P(M)P(A[M) +P(M

c

)P(A[M

c

)

=

29

36.1

≈ 0.8.

The law literature studiously avoids quantifying concepts such as probative value and reasonable

doubt. Nevertheless, we can probably say that 80% is considerably too high, compared to the

prior probability of 29%, to use as a sole argument that the evidence is not probative.

4 CONDITIONAL PROBABILITY AND INDEPENDENCE 30

Independence

Events A and B are independent if P(A ∩ B) = P(A)P(B) and dependent (or correlated)

otherwise.

Assuming that P(B) > 0, one could rewrite the condition for independence,

P(A[B) = P(A),

so the probability of A is unaﬀected by knowledge that B occurred. Also, if A and B are

independent,

P(A∩ B

c

) = P(A) −P(A∩ B) = P(A) −P(A)P(B) = P(A)(1 −P(B)) = P(A)P(B

c

),

so A and B

c

are also independent — knowing that B has not occurred also has no inﬂuence on

the probability of A. Another fact to notice immediately is that disjoint events with nonzero

probability cannot be independent: given that one of them happens, the other cannot happen

and thus its probability drops to zero.

Quite often, independence is an assumption and it is the most important concept in proba-

bility.

Example 4.13. Pick a random card from a full deck. Let A = ¦card is an Ace¦ and R =

¦card is red¦. Are A and R independent?

We have P(A) =

1

13

, P(R) =

1

2

, and, as there are two red Aces, P(A ∩ R) =

2

52

=

1

26

.

The two events are independent — the proportion of aces among red cards is the same as the

proportion among all cards.

Now, pick two cards out of the deck sequentially without replacement. Are F = ¦ﬁrst card

is an Ace¦ and S = ¦second card is an Ace¦ independent?

Now P(F) = P(S) =

1

13

and P(S[F) =

3

51

, so they are not independent.

Example 4.14. Toss 2 fair coins and let F = ¦Heads on 1st toss¦, S = ¦Heads on 2nd toss¦.

These are independent. You will notice that here the independence is in fact an assumption.

How do we deﬁne independence of more than two events? We say that events A

1

, A

2

, . . . , A

n

are independent if

P(A

i

1

∩ . . . ∩ A

i

k

) = P(A

i

1

)P(A

i

2

) P(A

i

k

),

where 1 ≤ i

1

< i

2

< . . . < i

k

≤ n are arbitrary indices. The occurrence of any combination

of events does not inﬂuence the probability of others. Again, it can be shown that, in such a

collection of independent events, we can replace an A

i

by A

c

i

and the events remain independent.

Example 4.15. Roll a four sided fair die, that is, choose one of the numbers 1, 2, 3, 4 at

random. Let A = ¦1, 2¦, B = ¦1, 3¦, C = ¦1, 4¦. Check that these are pairwise independent

(each pair is independent), but not independent.

4 CONDITIONAL PROBABILITY AND INDEPENDENCE 31

Indeed, P(A) = P(B) = P(C) =

1

2

and P(A∩B) = P(A∩C) = P(B∩C) =

1

4

and pairwise

independence follows. However,

P(A∩ B ∩ C) =

1

4

,=

1

8

.

The simple reason for lack of independence is

A∩ B ⊂ C,

so we have complete information on the occurrence of C as soon as we know that A and B both

happen.

Example 4.16. You roll a die, your friend tosses a coin.

• If you roll 6, you win outright.

• If you do not roll 6 and your friend tosses Heads, you lose outright.

• If neither, the game is repeated until decided.

What is the probability that you win?

One way to solve this problem certainly is this:

P(win) = P(win on 1st round) +P(win on 2nd round) +P(win on 3rd round) +. . .

=

1

6

+

_

5

6

1

2

_

1

6

+

_

5

6

1

2

_

2

1

6

+. . . ,

and then we sum the geometric series. Important note: we have implicitly assumed independence

between the coin and the die, as well as between diﬀerent tosses and rolls. This is very common

in problems such as this!

You can avoid the nuisance, however, by the following trick. Let

D = ¦game is decided on 1st round¦,

W = ¦you win¦.

The events D and W are independent, which one can certainly check by computation, but, in

fact, there is a very good reason to conclude so immediately. The crucial observation is that,

provided that the game is not decided in the 1st round, you are thereafter facing the same game

with the same winning probability; thus

P(W[D

c

) = P(W).

In other words, D

c

and W are independent and then so are D and W, and therefore

P(W) = P(W[D).

4 CONDITIONAL PROBABILITY AND INDEPENDENCE 32

This means that one can solve this problem by computing the relevant probabilities for the 1st

round:

P(W[D) =

P(W ∩ D)

P(D)

=

1

6

1

6

+

5

6

1

2

=

2

7

,

which is our answer.

Example 4.17. Craps. Many casinos allow you to bet even money on the following game. Two

dice are rolled and the sum S is observed.

• If S ∈ ¦7, 11¦, you win immediately.

• If S ∈ ¦2, 3, 12¦, you lose immediately.

• If S ∈ ¦4, 5, 6, 8, 9, 10¦, the pair of dice is rolled repeatedly until one of the following

happens:

– S repeats, in which case you win.

– 7 appears, in which case you lose.

What is the winning probability?

Let us look at all possible ways to win:

1. You win on the ﬁrst roll with probability

8

36

.

2. Otherwise,

• you roll a 4 (probability

3

36

), then win with probability

3

36

3

36

+

6

36

=

3

3+6

=

1

3

;

• you roll a 5 (probability

4

36

), then win with probability

4

4+6

=

2

5

;

• you roll a 6 (probability

5

36

), then win with probability

5

5+6

=

5

11

;

• you roll a 8 (probability

5

36

), then win with probability

5

5+6

=

5

11

;

• you roll a 9 (probability

4

36

), then win with probability

4

4+6

=

2

5

;

• you roll a 10 (probability

3

36

), then win with probability

3

3+6

=

1

3

.

Using Bayes’ formula,

P(win) =

8

36

+ 2

_

3

36

1

3

+

4

36

2

5

+

5

36

5

11

_

≈ 0.4929,

a decent game by casino standards.

4 CONDITIONAL PROBABILITY AND INDEPENDENCE 33

Bernoulli trials

Assume n independent experiments, each of which is a success with probability p and, thus,

failure with probability 1 −p.

In a sequence of n Bernoulli trials, P(exactly k successes) =

_

n

k

_

p

k

(1 −p)

n−k

.

This is because the successes can occur on any subset S of k trials out of n, i.e., on any

S ⊂ ¦1, . . . , n¦ with cardinality k. These possibilities are disjoint, as exactly k successes cannot

occur on two diﬀerent such sets. There are

_

n

k

_

such subsets; if we ﬁx such an S, then successes

must occur on k trials in S and failures on all n − k trials not in S; the probability that this

happens, by the assumed independence, is p

k

(1 −p)

n−k

.

Example 4.18. A machine produces items which are independently defective with probability

p. Let us compute a few probabilities:

1. P(exactly two items among the ﬁrst 6 are defective) =

_

6

2

_

p

2

(1 −p)

4

.

2. P(at least one item among the ﬁrst 6 is defective) = 1 −P(no defects) = 1 −(1 −p)

6

3. P(at least 2 items among the ﬁrst 6 are defective) = 1 −(1 −p)

6

−6p(1 −p)

5

4. P(exactly 100 items are made before 6 defective are found) equals

P(100th item defective, exactly 5 items among 1st 99 defective) = p

_

99

5

_

p

5

(1 −p)

94

.

Example 4.19. Problem of Points. This involves ﬁnding the probability of n successes before

m failures in a sequence of Bernoulli trials. Let us call this probability p

n,m

.

p

n,m

= P(in the ﬁrst m+n −1 trials, the number of successes is ≥ n)

=

n+m−1

k=n

_

n +m−1

k

_

p

k

(1 −p)

n+m−1−k

.

The problem is solved, but it needs to be pointed out that computationally this is not the best

formula. It is much more eﬃcient to use the recursive formula obtained by conditioning on the

outcome of the ﬁrst trial. Assume m, n ≥ 1. Then,

p

n,m

= P(ﬁrst trial is a success) P(n −1 successes before m failures)

+P(ﬁrst trial is a failure) P(n successes before m−1 failures)

= p p

n−1,m

+ (1 −p) p

n,m−1

.

Together with boundary conditions, valid for m, n ≥ 1,

p

n,0

= 0, p

0,m

= 1,

4 CONDITIONAL PROBABILITY AND INDEPENDENCE 34

which allows for very speedy and precise computations for large m and n.

Example 4.20. Best of 7. Assume that two equally matched teams, A and B, play a series

of games and that the ﬁrst team that wins four games is the overall winner of the series. As

it happens, team A lost the ﬁrst game. What is the probability it will win the series? Assume

that the games are Bernoulli trials with success probability

1

2

.

We have

P(A wins the series) = P(4 successes before 3 failures)

=

6

k=4

_

6

k

__

1

2

_

6

=

15 + 6 + 1

2

6

≈ 0.3438.

Example 4.21. Banach Matchbox Problem. A mathematician carries two matchboxes, each

originally containing n matches. Each time he needs a match, he is equally likely to take it from

either box. What is the probability that, upon reaching for a box and ﬁnding it empty, there

are exactly k matches still in the other box? Here, 0 ≤ k ≤ n.

Let A

1

be the event that matchbox 1 is the one discovered empty and that, at that instant,

matchbox 2 contains k matches. Before this point, he has used n + n − k matches, n from

matchbox 1 and n − k from matchbox 2. This means that he has reached for box 1 exactly n

times in (n + n − k) trials and for the last time at the (n + 1 + n − k)th trial. Therefore, our

probability is

2 P(A

1

) = 2

1

2

_

2n −k

n

_

1

2

2n−k

=

_

2n −k

n

_

1

2

2n−k

.

Example 4.22. Each day, you independently decide, with probability p, to ﬂip a fair coin.

Otherwise, you do nothing. (a) What is the probability of getting exactly 10 Heads in the ﬁrst

20 days? (b) What is the probability of getting 10 Heads before 5 Tails?

For (a), the probability of getting Heads is p/2 independently each day, so the answer is

_

20

10

_

_

p

2

_

10

_

1 −

p

2

_

10

.

For (b), you can disregard days at which you do not ﬂip to get

14

k=10

_

14

k

_

1

2

14

.

Example 4.23. You roll a die and your score is the number shown on the die. Your friend rolls

ﬁve dice and his score is the number of 6’s shown. Compute (a) the probability of event A that

the two scores are equal and (b) the probability of event B that your friend’s score is strictly

larger than yours.

4 CONDITIONAL PROBABILITY AND INDEPENDENCE 35

In both cases we will condition on your friend’s score — this works a little better in case (b)

than conditioning on your score. Let F

i

, i = 0, . . . , 5, be the event that your friend’s score is i.

Then, P(A[F

i

) =

1

6

if i ≥ 1 and P(A[F

0

) = 0. Then, by the ﬁrst Bayes’ formula, we get

P(A) =

5

i=1

P(F

i

)

1

6

=

1

6

(1 −P(F

0

)) =

1

6

−

5

5

6

6

≈ 0.0997.

Moreover, P(B[F

i

) =

i−1

6

if i ≥ 2 and 0 otherwise, and so

P(B) =

5

i=1

P(F

i

)

i −1

6

=

1

6

5

i=1

i P(F

i

) −

1

6

5

i=1

P(F

i

)

=

1

6

5

i=1

i P(F

i

) −

1

6

+

5

5

6

6

=

1

6

5

i=1

i

_

5

i

__

1

6

_

i

_

5

6

_

5−i

−

1

6

+

5

5

6

6

=

1

6

5

6

−

1

6

+

5

5

6

6

≈ 0.0392.

The last equality can be obtained by computation, but we will soon learn why the sum has to

equal

5

6

.

Problems

1. Consider the following game. Pick one card at random from a full deck of 52 cards. If you

pull an Ace, you win outright. If not, then you look at the value of the card (K, Q, and J count

as 10). If the number is 7 or less, you lose outright. Otherwise, you select (at random, without

replacement) that number of additional cards from the deck. (For example, if you picked a 9

the ﬁrst time, you select 9 more cards.) If you get at least one Ace, you win. What are your

chances of winning this game?

2. An item is defective (independently of other items) with probability 0.3. You have a method

of testing whether the item is defective, but it does not always give you correct answer. If

the tested item is defective, the method detects the defect with probability 0.9 (and says it is

good with probability 0.1). If the tested item is good, then the method says it is defective with

probability 0.2 (and gives the right answer with probability 0.8).

A box contains 3 items. You have tested all of them and the tests detect no defects. What

is the probability that none of the 3 items is defective?

4 CONDITIONAL PROBABILITY AND INDEPENDENCE 36

3. A chocolate egg either contains a toy or is empty. Assume that each egg contains a toy with

probability p, independently of other eggs. You have 5 eggs; open the ﬁrst one and see if it has

a toy inside, then do the same for the second one, etc. Let E

1

be the event that you get at least

4 toys and let E

2

be the event that you get at least two toys in succession. Compute P(E

1

) and

P(E

2

). Are E

1

and E

2

independent?

4. You have 16 balls, 3 blue, 4 green, and 9 red. You also have 3 urns. For each of the 16 balls,

you select an urn at random and put the ball into it. (Urns are large enough to accommodate any

number of balls.) (a) What is the probability that no urn is empty? (b) What is the probability

that each urn contains 3 red balls? (c) What is the probability that each urn contains all three

colors?

5. Assume that you have an n–element set U and that you select r independent random subsets

A

1

, . . . , A

r

⊂ U. All A

i

are chosen so that all 2

n

choices are equally likely. Compute (in a simple

closed form) the probability that the A

i

are pairwise disjoint.

Solutions to problems

1. Let

F

1

= ¦Ace ﬁrst time¦,

F

8

= ¦8 ﬁrst time¦,

F

9

= ¦9 ﬁrst time¦,

F

10

= ¦10, J, Q, or K ﬁrst time¦.

Also, let W be the event that you win. Then

P(W[F

1

) = 1,

P(W[F

8

) = 1 −

_

47

8

_

_

51

8

_,

P(W[F

9

) = 1 −

_

47

9

_

_

51

9

_,

P(W[F

10

) = 1 −

_

47

10

_

_

51

10

_,

and so,

P(W) =

4

52

+

4

52

_

1 −

_

47

8

_

_

51

8

_

_

+

4

52

_

1 −

_

47

9

_

_

51

9

_

_

+

16

52

_

1 −

_

47

10

_

_

51

10

_

_

.

2. Let F = ¦none is defective¦ and A = ¦test indicates that none is defective¦. By the second

Bayes’ formula,

4 CONDITIONAL PROBABILITY AND INDEPENDENCE 37

P(F[A) =

P(A∩ F)

P(A)

=

(0.7 0.8)

3

(0.7 0.8 + 0.3 0.1)

3

=

_

56

59

_

3

.

3. P(E

1

) = 5p

4

(1 −p) +p

5

= 5p

4

−4p

5

and P(E

2

) = 1 −(1 −p)

5

−5p(1 −p)

4

−

_

4

2

_

p

2

(1 −p)

3

−

p

3

(1 −p)

2

. As E

1

⊂ E

2

, E

1

and E

2

are not independent.

4. (a) Let A

i

= the event that the i-th urn is empty.

P(A

1

) = P(A

2

) = P(A

3

) =

_

2

3

_

16

,

P(A

1

∩ A

2

) = P(A

1

∩ A

3

) = P(A

2

∩ A

3

) =

_

1

3

_

16

,

P(A

1

∩ A

2

∩ A

3

) = 0.

Hence, by inclusion-exclusion,

P(A

1

∪ A

2

∪ A

3

) =

2

16

−1

3

15

,

and

P(no urns are empty) = 1 −P(A

1

∪ A

2

∪ A

3

)

= 1 −

2

16

−1

3

15

.

(b) We can ignore other balls since only the red balls matter here.

Hence, the result is:

9!

3!3!3!

3

9

=

9!

8 3

12

.

(c) As

P(at least one urn lacks blue) = 3

_

2

3

_

3

−3

_

1

3

_

3

,

P(at least one urn lacks green) = 3

_

2

3

_

4

−3

_

1

3

_

4

,

P(at least one urn lacks red) = 3

_

2

3

_

9

−3

_

1

3

_

9

,

4 CONDITIONAL PROBABILITY AND INDEPENDENCE 38

we have, by independence,

P(each urn contains all 3 colors) =

_

1 −

_

3

_

2

3

_

3

−3

_

1

3

_

3

__

_

1 −

_

3

_

2

3

_

4

−3

_

1

3

_

4

__

_

1 −

_

3

_

2

3

_

9

−3

_

1

3

_

9

__

.

5. This is the same as choosing an r n matrix in which every entry is independently 0 or

1 with probability 1/2 and ending up with at most one 1 in every column. Since columns are

independent, this gives ((1 +r)2

−r

)

n

.

4 CONDITIONAL PROBABILITY AND INDEPENDENCE 39

Interlude: Practice Midterm 1

This practice exam covers the material from the ﬁrst four chapters. Give yourself 50 minutes to

solve the four problems, which you may assume have equal point score.

1. Ten fair dice are rolled. What is the probability that:

(a) At least one 1 appears.

(b) Each of the numbers 1, 2, 3 appears exactly twice, while the number 4 appears four times.

(c) Each of the numbers 1, 2, 3 appears at least once.

2. Five married couples are seated at random around a round table.

(a) Compute the probability that all couples sit together (i.e., every husband-wife pair occupies

adjacent seats).

(b) Compute the probability that at most one wife does not sit next to her husband.

3. Consider the following game. A player rolls a fair die. If he rolls 3 or less, he loses immediately.

Otherwise he selects, at random, as many cards from a full deck as the number that came up

on the die. The player wins if all four Aces are among the selected cards.

(a) Compute the winning probability for this game.

(b) Smith tells you that he recently played this game once and won. What is the probability

that he rolled a 6 on the die?

4. A chocolate egg either contains a toy or is empty. Assume that each egg contains a toy with

probability p ∈ (0, 1), independently of other eggs. Each toy is, with equal probability, red,

white, or blue (again, independently of other toys). You buy 5 eggs. Let E

1

be the event that

you get at most 2 toys and let E

2

be the event that you get you get at least one red and at least

one white and at least one blue toy (so that you have a complete collection).

(a) Compute P(E

1

). Why is this probability very easy to compute when p = 1/2?

(b) Compute P(E

2

).

(c) Are E

1

and E

2

independent? Explain.

4 CONDITIONAL PROBABILITY AND INDEPENDENCE 40

Solutions to Practice Midterm 1

1. Ten fair dice are rolled. What is the probability that:

(a) At least one 1 appears.

Solution:

1 −P(no 1 appears) = 1 −

_

5

6

_

10

.

(b) Each of the numbers 1, 2, 3 appears exactly twice, while the number 4 appears four

times.

Solution:

_

10

2

__

8

2

__

6

2

_

6

10

=

10!

2

3

4! 6

10

.

(c) Each of the numbers 1, 2, 3 appears at least once.

Solution:

Let A

i

be the event that the number i does not appear. We know the following:

P(A

1

) = P(A

2

) = P(A

3

) =

_

5

6

_

10

,

P(A

1

∩ A

2

) = P(A

1

∩ A

3

) = P(A

2

∩ A

3

) =

_

4

6

_

10

,

P(A

1

∩ A

2

∩ A

3

) =

_

3

6

_

10

.

4 CONDITIONAL PROBABILITY AND INDEPENDENCE 41

Then,

P(1, 2, and 3 each appear at least once)

= P((A

1

∪ A

2

∪ A

3

)

c

)

= 1 −P(A

1

) −P(A

2

) −P(A

3

)

+P(A

1

∩ A

2

) +P(A

2

∩ A

3

) +P(A

1

∩ A

3

)

−P(A

1

∩ A

2

∩ A

3

)

= 1 −3

_

5

6

_

10

+ 3

_

4

6

_

10

−

_

3

6

_

10

.

2. Five married couples are seated at random around a round table.

(a) Compute the probability that all couples sit together (i.e., every husband-wife pair

occupies adjacent seats).

Solution:

Let i be an integer in the set ¦1, 2, 3, 4, 5¦. Denote each husband and wife as h

i

and w

i

, respectively.

i. Fix h

1

onto one of the seats.

ii. There are 9! ways to order the remaining 9 people in the remaining 9 seats. This

is our sample space.

iii. There are 2 ways to order w

1

.

iv. Treat each couple as a block and the remaining 8 seats as 4 pairs (where each

pair is two adjacent seats). There are 4! ways to seat the remaining 4 couples

into 4 pairs of seats.

v. There are 2

4

ways to order each h

i

and w

i

within its pair of seats.

Therefore, our solution is

2 4! 2

4

9!

.

(b) Compute the probability that at most one wife does not sit next to her husband.

Solution:

Let A be the event that all wives sit next to their husbands and let B be the event

that exactly one wife does not sit next to her husband. We know that P(A) =

2

5

·4!

9!

from part (a). Moreover, B = B

1

∪B

2

∪B

3

∪B

4

∪B

5

, where B

i

is the event that w

i

4 CONDITIONAL PROBABILITY AND INDEPENDENCE 42

does not sit next to h

i

and the remaining couples sit together. Then, B

i

are disjoint

and their probabilities are all the same. So, we need to determine P(B

1

).

i. Fix h

1

onto one of the seats.

ii. There are 9! ways to order the remaining 9 people in the remaining 9 seats.

iii. Consider each of remaining 4 couples and w

1

as 5 blocks.

iv. As w

1

cannot be next to her husband, we have 3 positions for w

1

in the ordering

of the 5 blocks.

v. There are 4! ways to order the remaining 4 couples.

vi. There are 2

4

ways to order the couples within their blocks.

Therefore,

P(B

1

) =

3 4! 2

4

9!

.

Our answer, then, is

5

3 4! 2

4

9!

+

2

5

4!

9!

.

3. Consider the following game. The player rolls a fair die. If he rolls 3 or less, he loses

immediately. Otherwise he selects, at random, as many cards from a full deck as the

number that came up on the die. The player wins if all four Aces are among the selected

cards.

(a) Compute the winning probability for this game.

Solution:

Let W be the event that the player wins. Let F

i

be the event that he rolls i, where

i = 1, . . . , 6; P(F

i

) =

1

6

.

Since we lose if we roll a 1, 2, or 3, P(W[F

1

) = P(W[F

2

) = P(W[F

3

) = 0. Moreover,

P(W[F

4

) =

1

_

52

4

_,

P(W[F

5

) =

_

5

4

_

_

52

4

_,

P(W[F

6

) =

_

6

4

_

_

52

4

_.

Therefore,

P(W) =

1

6

1

_

52

4

_

_

1 +

_

5

4

_

+

_

6

4

__

.

4 CONDITIONAL PROBABILITY AND INDEPENDENCE 43

(b) Smith tells you that he recently played this game once and won. What is the proba-

bility that he rolled a 6 on the die?

Solution:

P(F

6

[W) =

1

6

1

(

52

4

)

_

6

4

_

P(W)

=

_

6

4

_

1 +

_

5

4

_

+

_

6

4

_

=

15

21

=

5

7

.

4. A chocolate egg either contains a toy or is empty. Assume that each egg contains a toy with

probability p ∈ (0, 1), independently of other eggs. Each toy is, with equal probability,

red, white, or blue (again, independently of other toys). You buy 5 eggs. Let E

1

be the

event that you get at most 2 toys and let E

2

be the event that you get you get at least

one red and at least one white and at least one blue toy (so that you have a complete

collection).

(a) Compute P(E

1

). Why is this probability very easy to compute when p = 1/2?

Solution:

P(E

1

) = P(0 toys) +P(1 toy) +P(2 toys)

= (1 −p)

5

+ 5p(1 −p)

4

+

_

5

2

_

p

2

(1 −p)

3

.

When p =

1

2

,

P(at most 2 toys) = P(at least 3 toys)

= P(at most 2 eggs are empty)

Therefore, P(E

1

) = P(E

c

1

) and so P(E

1

) =

1

2

.

4 CONDITIONAL PROBABILITY AND INDEPENDENCE 44

(b) Compute P(E

2

).

Solution:

Let A

1

be the event that red is missing, A

2

the event that white is missing, and

A

3

the event that blue is missing.

P(E

2

) = P((A

1

∪ A

2

∪ A

3

)

c

)

= 1 −3

_

1 −

p

3

_

5

+ 3

_

1 −

2p

3

_

5

−(1 −p)

5

.

(c) Are E

1

and E

2

independent? Explain.

Solution:

No: E

1

∩ E

2

= ∅.

5 DISCRETE RANDOM VARIABLES 45

5 Discrete Random Variables

A random variable is a number whose value depends upon the outcome of a random experiment.

Mathematically, a random variable X is a real-valued function on Ω, the space of outcomes:

X : Ω →R.

Sometimes, when convenient, we also allow X to have the value ∞ or, more rarely, −∞, but

this will not occur in this chapter. The crucial theoretical property that X should have is that,

for each interval B, the set of outcomes for which X ∈ B is an event, so we are able to talk

about its probability, P(X ∈ B). Random variables are traditionally denoted by capital letters

to distinguish them from deterministic quantities.

Example 5.1. Here are some examples of random variables.

1. Toss a coin 10 times and let X be the number of Heads.

2. Choose a random point in the unit square ¦(x, y) : 0 ≤ x, y ≤ 1¦ and let X be its distance

from the origin.

3. Choose a random person in a class and let X be the height of the person, in inches.

4. Let X be value of the NASDAQ stock index at the closing of the next business day.

A discrete random variable X has ﬁnitely or countably many values x

i

, i = 1, 2, . . ., and

p(x

i

) = P(X = x

i

) with i = 1, 2, . . . is called the probability mass function of X. Sometimes X

is added as the subscript of its p. m. f., p = p

X

.

A probability mass function p has the following properties:

1. For all i, p(x

i

) > 0 (we do not list values of X which occur with probability 0).

2. For any interval B, P(X ∈ B) =

x

i

∈B

p(x

i

).

3. As X must have some value,

i

p(x

i

) = 1.

Example 5.2. Let X be the number of Heads in 2 fair coin tosses. Determine its p. m. f.

Possible values of X are 0, 1, and 2. Their probabilities are: P(X = 0) =

1

4

, P(X = 1) =

1

2

,

and P(X = 2) =

1

4

.

You should note that the random variable Y , which counts the number of Tails in the 2

tosses, has the same p. m. f., that is, p

X

= p

Y

, but X and Y are far from being the same

random variable! In general, random variables may have the same p. m. f., but may not even

be deﬁned on the same set of outcomes.

Example 5.3. An urn contains 20 balls numbered 1, . . . , 20. Select 5 balls at random, without

replacement. Let X be the largest number among selected balls. Determine its p. m. f. and the

probability that at least one of the selected numbers is 15 or more.

5 DISCRETE RANDOM VARIABLES 46

The possible values are 5, . . . , 20. To determine the p. m. f., note that we have

_

20

5

_

outcomes,

and, then,

P(X = i) =

_

i−1

4

_

_

20

5

_ .

Finally,

P(at least one number 15 or more) = P(X ≥ 15) =

20

i=15

P(X = i).

Example 5.4. An urn contains 11 balls, 3 white, 3 red, and 5 blue balls. Take out 3 balls at

random, without replacement. You win $1 for each red ball you select and lose a $1 for each

white ball you select. Determine the p. m. f. of X, the amount you win.

The number of outcomes is

_

11

3

_

. X can have values −3, −2, −1, 0, 1, 2, and 3. Let us start

with 0. This can occur with one ball of each color or with 3 blue balls:

P(X = 0) =

3 3 5 +

_

5

3

_

_

11

3

_ =

55

165

.

To get X = 1, we can have 2 red and 1 white, or 1 red and 2 blue:

P(X = 1) = P(X = −1) =

_

3

2

__

3

1

_

+

_

3

1

__

5

2

_

_

11

3

_ =

39

165

.

The probability that X = −1 is the same because of symmetry between the roles that the red

and the white balls play. Next, to get X = 2 we must have 2 red balls and 1 blue:

P(X = −2) = P(X = 2) =

_

3

2

__

5

1

_

_

11

3

_ =

15

165

.

Finally, a single outcome (3 red balls) produces X = 3:

P(X = −3) = P(X = 3) =

1

_

11

3

_ =

1

165

.

All the seven probabilities should add to 1, which can be used either to check the computations

or to compute the seventh probability (say, P(X = 0)) from the other six.

Assume that X is a discrete random variable with possible values x

i

, i = 1, 2 . . .. Then, the

expected value, also called expectation, average, or mean, of X is

EX =

i

x

i

P(X = x

i

) =

i

x

i

p(x

i

).

For any function, g : R →R,

Eg(X) =

i

g(x

i

)P(X = x

i

).

5 DISCRETE RANDOM VARIABLES 47

Example 5.5. Let X be a random variable with P(X = 1) = 0.2, P(X = 2) = 0.3, and

P(X = 3) = 0.5. What is the expected value of X?

We can, of course, just use the formula, but let us instead proceed intuitively and see that

the deﬁnition makes sense. What, then, should the average of X be?

Imagine a large number n of repetitions of the experiment and measure the realization of X

in each. By the frequency interpretation of probability, about 0.2n realizations will have X = 1,

about 0.3n will have X = 2, and about 0.5n will have X = 3. The average value of X then

should be

1 0.2n + 2 0.3n + 3 0.5n

n

= 1 0.2 + 2 0.3 + 3 0.5 = 2.3,

which of course is the same as the formula gives.

Take a discrete random variable X and let µ = EX. How should we measure the deviation

of X from µ, i.e., how “spread-out” is the p. m. f. of X?

The most natural way would certainly be E[X − µ[. The only problem with this is that

absolute values are annoying. Instead, we deﬁne the variance of X

Var(X) = E(x −µ)

2

.

The quantity that has the correct units is the standard deviation

σ(X) =

_

Var(X) =

_

E(X −µ)

2

.

We will give another, more convenient, formula for variance that will use the following property

of expectation, called linearity:

E(α

1

X

1

+α

2

X

2

) = α

1

EX

1

+α

2

EX

2

,

valid for any random variables X

1

and X

2

and nonrandom constants α

1

and α

2

. This property

will be explained and discussed in more detail later. Then

Var(X) = E[(X −µ)

2

]

= E[X

2

−2µX +µ

2

]

= E(X

2

) −2µE(X) +µ

2

= E(X

2

) −µ

2

= E(X

2

) −(EX)

2

In computations, bear in mind that variance cannot be negative! Furthermore, the only way

that a random variable has 0 variance is when it is equal to its expectation µ with probability

1 (so it is not really random at all): P(X = µ) = 1. Here is the summary:

The variance of a random variable X is Var(X) = E(X−EX)

2

= E(X

2

)−(EX)

2

.

5 DISCRETE RANDOM VARIABLES 48

Example 5.6. Previous example, continued. Compute Var(X).

E(X

2

) = 1

2

0.2 + 2

2

0.3 + 3

2

0.5 = 5.9,

(EX)

2

= (2.3)

2

= 5.29, and so Var(X) = 5.9 −5.29 = 0.61 and σ(X) =

_

Var(X) ≈ 0.7810.

We will now look at some famous probability mass functions.

5.1 Uniform discrete random variable

This is a random variable with values x

1

, . . . , x

n

, each with equal probability 1/n. Such a random

variable is simply the random choice of one among n numbers.

Properties:

1. EX =

x

1

+...+x

n

n

.

2. VarX =

x

2

1

+...+x

2

n

n

−

_

x

1

+...+xn

n

_

2

.

Example 5.7. Let X be the number shown on a rolled fair die. Compute EX, E(X

2

), and

Var(X).

This is a standard example of a discrete uniform random variable and

EX =

1 + 2 +. . . + 6

6

=

7

2

,

EX

2

=

1 + 2

2

+. . . + 6

2

6

=

91

6

,

Var(X) =

91

6

−

_

7

2

_

2

=

35

12

.

5.2 Bernoulli random variable

This is also called an indicator random variable. Assume that A is an event with probability p.

Then, I

A

, the indicator of A, is given by

I

A

=

_

1 if A happens,

0 otherwise.

Other notations for I

A

include 1

A

and χ

A

. Although simple, such random variables are very

important as building blocks for more complicated random variables.

Properties:

1. EI

A

= p.

2. Var(I

A

) = p(1 −p).

For the variance, note that I

2

A

= I

A

, so that E(I

2

A

) = EI

A

= p.

5 DISCRETE RANDOM VARIABLES 49

5.3 Binomial random variable

A Binomial(n,p) random variable counts the number of successes in n independent trials, each

of which is a success with probability p.

Properties:

1. Probability mass function: P(X = i) =

_

n

i

_

p

i

(1 −p)

n−i

, i = 0, . . . , n.

2. EX = np.

3. Var(X) = np(1 −p).

The expectation and variance formulas will be proved in Chapter 8. For now, take them on

faith.

Example 5.8. Let X be the number of Heads in 50 tosses of a fair coin. Determine EX,

Var(X) and P(X ≤ 10)? As X is Binomial(50,

1

2

), so EX = 25, Var(X) = 12.5, and

P(X ≤ 10) =

10

i=0

_

50

i

_

1

2

50

.

Example 5.9. Denote by d the dominant gene and by r the recessive gene at a single locus.

Then dd is called the pure dominant genotype, dr is called the hybrid, and rr the pure recessive

genotype. The two genotypes with at least one dominant gene, dd and dr, result in the phenotype

of the dominant gene, while rr results in a recessive phenotype.

Assuming that both parents are hybrid and have n children, what is the probability that at

least two will have the recessive phenotype? Each child, independently, gets one of the genes at

random from each parent.

For each child, independently, the probability of the rr genotype is

1

4

. If X is the number of

rr children, then X is Binomial(n,

1

4

). Therefore,

P(X ≥ 2) = 1 −P(X = 0) −P(X = 1) = 1 −

_

3

4

_

n

−n

1

4

_

3

4

_

n−1

.

5.4 Poisson random variable

A random variable is Poisson(λ), with parameter λ > 0, if it has the probability mass function

given below.

Properties:

1. P(X = i) =

λ

i

i!

e

−λ

, for i = 0, 1, 2, . . ..

5 DISCRETE RANDOM VARIABLES 50

2. EX = λ.

3. Var(X) = λ.

Here is how we compute the expectation:

EX =

∞

i=1

i e

−λ

λ

i

i!

= e

−λ

λ

∞

i=1

λ

i−1

(i −1)!

= e

−λ

λe

λ

= λ,

and the variance computation is similar (and a good exercise!).

The Poisson random variable is useful as an approximation to a Binomial random variable

when the number of trials is large and the probability of success is small. In this context it is

often called the law of rare events, ﬁrst formulated by L. J. Bortkiewicz (in 1898), who studied

deaths by horse kicks in the Prussian cavalry.

Theorem 5.1. Poisson approximation to Binomial . When n is large, p is small, and λ = np

is of moderate size, Binomial(n,p) is approximately Poisson(λ):

If X is Binomial(n, p), with p =

λ

n

, then, as n →∞,

P(X = i) →e

−λ

λ

i

i!

,

for each ﬁxed i = 0, 1, 2, . . ..

Proof.

P(X = i) =

_

n

i

__

λ

n

_

i

_

1 −

λ

n

_

n−i

=

n(n −1) . . . (n −i + 1)

i!

λ

i

n

i

_

1 −

λ

n

_

n

_

1 −

λ

n

_

−i

=

λ

i

i!

_

1 −

λ

n

_

n

n(n −1) . . . (n −i + 1)

n

i

1

_

1 −

λ

n

_

i

→

λ

i

i!

e

−λ

1 1,

as n →∞.

The Poisson approximation is quite good: one can prove that the error made by computing

a probability using the Poisson approximation instead of its exact Binomial expression (in the

context of the above theorem) is no more than

min(1, λ) p.

Example 5.10. Suppose that the probability that a person is killed by lighting in a year is,

independently, 1/(500 million). Assume that the US population is 300 million.

5 DISCRETE RANDOM VARIABLES 51

1. Compute P(3 or more people will be killed by lightning next year) exactly.

If X is the number of people killed by lightning, then X is Binomial(n, p), where n = 300

million and p = 1/ (500 million), and the answer is

1 −(1 −p)

n

−np(1 −p)

n−1

−

_

n

2

_

p

2

(1 −p)

n−2

≈ 0.02311530.

2. Approximate the above probability.

As np =

3

5

, X is approximately Poisson(

3

5

), and the answer is

1 −e

−λ

−λe

−λ

−

λ

2

2

e

−λ

≈ 0.02311529.

3. Approximate P(two or more people are killed by lightning within the ﬁrst 6 months of

next year).

This highlights the interpretation of λ as a rate. If lightning deaths occur at the rate of

3

5

a year, they should occur at half that rate in 6 months. Indeed, assuming that lightning

deaths occur as a result of independent factors in disjoint time intervals, we can imagine

that they operate on diﬀerent people in disjoint time intervals. Thus, doubling the time

interval is the same as doubling the number n of people (while keeping p the same), and

then np also doubles. Consequently, halving the time interval has the same p, but half as

many trials, so np changes to

3

10

and so λ =

3

10

as well. The answer is

1 −e

−λ

−λe

−λ

≈ 0.0369.

4. Approximate P(in exactly 3 of next 10 years exactly 3 people are killed by lightning).

In every year, the probability of exactly 3 deaths is approximately

λ

3

3!

e

−λ

, where, again,

λ =

3

5

. Assuming year-to-year independence, the number of years with exactly 3 people

killed is approximately Binomial(10,

λ

3

3!

e

−λ

). The answer, then, is

_

10

3

__

λ

3

3!

e

−λ

_

3

_

1 −

λ

3

3!

e

−λ

_

7

≈ 4.34 10

−6

.

5. Compute the expected number of years, among the next 10, in which 2 or more people are

killed by lightning.

By the same logic as above and the formula for Binomal expectation, the answer is

10(1 −e

−λ

−λe

−λ

) ≈ 0.3694.

Example 5.11. Poisson distribution and law. Assume a crime has been committed. It is

known that the perpetrator has certain characteristics, which occur with a small frequency p

(say, 10

−8

) in a population of size n (say, 10

8

). A person who matches these characteristics has

5 DISCRETE RANDOM VARIABLES 52

been found at random (e.g., at a routine traﬃc stop or by airport security) and, since p is so

small, charged with the crime. There is no other evidence. What should the defense be?

Let us start with a mathematical model of this situation. Assume that N is the number of

people with given characteristics. This is a Binomial random variable but, given the assumptions,

we can easily assume that it is Poisson with λ = np. Choose a person from among these N, label

that person by C, the criminal. Then, choose at random another person, A, who is arrested.

The question is whether C = A, that is, whether the arrested person is guilty. Mathematically,

we can formulate the problem as follows:

P(C = A[ N ≥ 1) =

P(C = A, N ≥ 1)

P(N ≥ 1)

.

We need to condition as the experiment cannot even be performed when N = 0. Now, by the

ﬁrst Bayes’ formula,

P(C = A, N ≥ 1) =

∞

k=1

P(C = A, N ≥ 1 [ N = k) P(N = k)

=

∞

k=1

P(C = A[ N = k) P(N = k)

and

P(C = A[ N = k) =

1

k

,

so

P(C = A, N ≥ 1) =

∞

k=1

1

k

λ

k

k!

e

−λ

.

The probability that the arrested person is guilty then is

P(C = A[N ≥ 1) =

e

−λ

1 −e

−λ

∞

k=1

λ

k

k k!

.

There is no closed-form expression for the sum, but it can be easily computed numerically. The

defense may claim that the probability of innocence, 1−(the above probability), is about 0.2330

when λ = 1, presumably enough for a reasonable doubt.

This model was in fact tested in court, in the famous People v. Collins case, a 1968 jury

trial in Los Angeles. In this instance, it was claimed by the prosecution (on ﬂimsy grounds)

that p = 1/12, 000, 000 and n would have been the number of adult couples in the LA area, say

n = 5, 000, 000. The jury convicted the couple charged for robbery on the basis of the prose-

cutor’s claim that, due to low p, “the chances of there being another couple [with the speciﬁed

characteristics, in the LA area] must be one in a billion.” The Supreme Court of California

reversed the conviction and gave two reasons. The ﬁrst reason was insuﬃcient foundation for

5 DISCRETE RANDOM VARIABLES 53

the estimate of p. The second reason was that the probability that another couple with matching

characteristics existed was, in fact,

P(N ≥ 2 [ N ≥ 1) =

1 −e

−λ

−λe

−λ

1 −e

−λ

,

much larger than the prosecutor claimed, namely, for λ =

5

12

it is about 0.1939. This is about

twice the (more relevant) probability of innocence, which, for this λ, is about 0.1015.

5.5 Geometric random variable

A Geometric(p) random variable X counts the number trials required for the ﬁrst success in

independent trials with success probability p.

Properties:

1. Probability mass function: P(X = n) = p(1 −p)

n−1

, where n = 1, 2, . . ..

2. EX =

1

p

.

3. Var(X) =

1−p

p

2

.

4. P(X > n) =

∞

k=n+1

p(1 −p)

k−1

= (1 −p)

n

.

5. P(X > n +k[X > k) =

(1−p)

n+k

(1−p)

k

= P(X > n).

We omit the proofs of the second and third formulas, which reduce to manipulations with

geometric series.

Example 5.12. Let X be the number of tosses of a fair coin required for the ﬁrst Heads. What

are EX and Var(X)?

As X is Geometric(

1

2

), EX = 2 and Var(X) = 2.

Example 5.13. You roll a die, your opponent tosses a coin. If you roll 6 you win; if you do

not roll 6 and your opponent tosses Heads you lose; otherwise, this round ends and the game

repeats. On the average, how many rounds does the game last?

P(game decided on round 1) =

1

6

+

5

6

1

2

=

7

12

,

and so the number of rounds N is Geometric(

7

12

), and

EN =

12

7

.

5 DISCRETE RANDOM VARIABLES 54

Problems

1. Roll a fair die repeatedly. Let X be the number of 6’s in the ﬁrst 10 rolls and let Y the

number of rolls needed to obtain a 3. (a) Write down the probability mass function of X. (b)

Write down the probability mass function of Y . (c) Find an expression for P(X ≥ 6). (d) Find

an expression for P(Y > 10).

2. A biologist needs at least 3 mature specimens of a certain plant. The plant needs a year

to reach maturity; once a seed is planted, any plant will survive for the year with probability

1/1000 (independently of other plants). The biologist plants 3000 seeds. A year is deemed a

success if three or more plants from these seeds reach maturity.

(a) Write down the exact expression for the probability that the biologist will indeed end up

with at least 3 mature plants.

(b) Write down a relevant approximate expression for the probability from (a). Justify brieﬂy

the approximation.

(c) The biologist plans to do this year after year. What is the approximate probability that he

has at least 2 successes in 10 years?

(d) Devise a method to determine the number of seeds the biologist should plant in order to get

at least 3 mature plants in a year with probability at least 0.999. (Your method will probably

require a lengthy calculation – do not try to carry it out with pen and paper.)

3. You are dealt one card at random from a full deck and your opponent is dealt 2 cards

(without any replacement). If you get an Ace, he pays you $10, if you get a King, he pays you

$5 (regardless of his cards). If you have neither an Ace nor a King, but your card is red and

your opponent has no red cards, he pays you $1. In all other cases you pay him $1. Determine

your expected earnings. Are they positive?

4. You and your opponent both roll a fair die. If you both roll the same number, the game

is repeated, otherwise whoever rolls the larger number wins. Let N be the number of times

the two dice have to be rolled before the game is decided. (a) Determine the probability mass

function of N. (b) Compute EN. (c) Compute P(you win). (d) Assume that you get paid

$10 for winning in the ﬁrst round, $1 for winning in any other round, and nothing otherwise.

Compute your expected winnings.

5. Each of the 50 students in class belongs to exactly one of the four groups A, B, C, or D. The

membership numbers for the four groups are as follows: A: 5, B: 10, C: 15, D: 20. First, choose

one of the 50 students at random and let X be the size of that student’s group. Next, choose

one of the four groups at random and let Y be its size. (Recall: all random choices are with

equal probability, unless otherwise speciﬁed.) (a) Write down the probability mass functions for

X and Y . (b) Compute EX and EY . (c) Compute Var(X) and Var(Y ). (d) Assume you have

5 DISCRETE RANDOM VARIABLES 55

s students divided into n groups with membership numbers s

1

, . . . , s

n

, and again X is the size

of the group of a randomly chosen student, while Y is the size of the randomly chosen group.

Let EY = µ and Var(Y ) = σ

2

. Express EX with s, n, µ, and σ.

Solutions

1. (a) X is Binomial(10,

1

6

):

P(X = i) =

_

10

i

__

1

6

_

i

_

5

6

_

10−i

,

where i = 0, 1, 2, . . . , 10.

(b) Y is Geometric(

1

6

):

P(Y = i) =

1

6

_

5

6

_

i−1

,

where i = 1, 2, . . ..

(c)

P(X ≥ 6) =

10

i=6

_

10

i

__

1

6

_

i

_

5

6

_

10−i

.

(d)

P(Y > 10) =

_

5

6

_

10

.

2. (a) The random variable X, the number of mature plants, is Binomial

_

3000,

1

1000

_

.

P(X ≥ 3) = 1 −P(X ≤ 2)

= 1 −(0.999)

3000

−3000(0.999)

2999

(0.001) −

_

3000

2

_

(0.999)

2998

(0.001)

2

.

(b) By the Poisson approximation with λ = 3000

1

1000

= 3,

P(X ≥ 3) ≈ 1 −e

−3

−3e

−3

−

9

2

e

−3

.

(c) Denote the probability in (b) by s. Then, the number of years the biologists succeeds is

approximately Binomial(10, s) and the answer is

1 −(1 −s)

10

−10s(1 −s)

9

.

(d) Solve

e

−λ

+λe

−λ

+

λ

2

2

e

−λ

= 0.001

5 DISCRETE RANDOM VARIABLES 56

for λ and then let n = 1000λ. The equation above can be solved by rewriting

λ = log 1000 + log(1 +λ +

λ

2

2

)

and then solved by iteration. The result is that the biologist should plant 11, 229 seeds.

3. Let X be your earnings.

P(X = 10) =

4

52

,

P(X = 5) =

4

52

,

P(X = 1) =

22

52

_

26

2

_

_

51

2

_ =

11

102

,

P(X = −1) = 1 −

2

13

−

11

102

,

and so

EX =

10

13

+

5

13

+

11

102

−1 +

2

13

+

11

102

=

4

13

+

11

51

> 0

4. (a) N is Geometric(

5

6

):

P(N = n) =

_

1

6

_

n−1

5

6

,

where n = 1, 2, 3, . . ..

(b) EN =

6

5

.

(c) By symmetry, P(you win) =

1

2

.

(d) You get paid $10 with probability

5

12

, $1 with probability

1

12

, and 0 otherwise, so your

expected winnings are

51

12

.

5. (a)

x P(X = x) P(Y = x)

5 0.1 0.25

10 0.2 0.25

15 0.3 0.25

20 0.4 0.25

(b) EX = 15, EY = 12.5.

(c) E(X

2

) = 250, so Var(X) = 25. E(Y

2

) = 187.5, so Var(Y ) = 31.25.

5 DISCRETE RANDOM VARIABLES 57

(d) Let s = s

1

+. . . +s

n

. Then,

E(X) =

n

i=1

s

i

s

i

s

=

n

s

n

i=1

s

2

i

1

n

=

n

s

EY

2

=

n

s

(Var(Y ) + (EY )

2

) =

n

s

(σ

2

+µ

2

).

6 CONTINUOUS RANDOM VARIABLES 58

6 Continuous Random Variables

A random variable X is continuous if there exists a nonnegative function f so that, for every

interval B,

P(X ∈ B) =

_

B

f(x) dx,

The function f = f

X

is called the density of X.

We will assume that a density function f is continuous, apart from ﬁnitely many (possibly

inﬁnite) jumps. Clearly, it must hold that

_

∞

−∞

f(x) dx = 1.

Note also that

P(X ∈ [a, b]) = P(a ≤ X ≤ b) =

_

b

a

f(x) dx,

P(X = a) = 0,

P(X ≤ b) = P(X < b) =

_

b

−∞

f(x) dx.

The function F = F

X

given by

F(x) = P(X ≤ x) =

_

x

−∞

f(s) ds

is called the distribution function of X. On an open interval where f is continuous,

F

(x) = f(x).

Density has the same role as the probability mass function for discrete random variables: it

tells which values x are relatively more probable for X than others. Namely, if h is very small,

then

P(X ∈ [x, x +h]) = F(x +h) −F(x) ≈ F

(x) h = f(x) h.

By analogy with discrete random variables, we deﬁne,

EX =

_

∞

−∞

x f(x) dx,

Eg(X) =

_

∞

−∞

g(x) f(x) dx,

and variance is computed by the same formula: Var(X) = E(X

2

) −(EX)

2

.

6 CONTINUOUS RANDOM VARIABLES 59

Example 6.1. Let

f(x) =

_

cx if 0 < x < 4,

0 otherwise.

(a) Determine c. (b) Compute P(1 ≤ X ≤ 2). (c) Determine EX and Var(X).

For (a), we use the fact that density integrates to 1, so we have

_

4

0

cxdx = 1 and c =

1

8

. For

(b), we compute

_

2

1

x

8

dx =

3

16

.

Finally, for (c) we get

EX =

_

4

0

x

2

8

dx =

8

3

and

E(X

2

) =

_

4

0

x

3

8

dx = 8.

So, Var(X) = 8 −

64

9

=

8

9

.

Example 6.2. Assume that X has density

f

X

(x) =

_

3x

2

if x ∈ [0, 1],

0 otherwise.

Compute the density f

Y

of Y = 1 −X

4

.

In a problem such as this, compute ﬁrst the distribution function F

Y

of Y . Before starting,

note that the density f

Y

(y) will be nonzero only when y ∈ [0, 1], as the values of Y are restricted

to that interval. Now, for y ∈ (0, 1),

F

Y

(y) = P(Y ≤ y) = P(1 −X

4

≤ y) = P(1 −y ≤ X

4

) = P((1 −y)

1

4

≤ X) =

_

1

(1−y)

1

4

3x

2

dx.

It follows that

f

Y

(y) =

d

dy

F

Y

(y) = −3((1 −y)

1

4

)

2

1

4

(1 −y)

−

3

4

(−1) =

3

4

1

(1 −y)

1

4

,

for y ∈ (0, 1), and f

Y

(y) = 0 otherwise. Observe that it is immaterial how f(y) is deﬁned at

y = 0 and y = 1, because those two values contribute nothing to any integral.

As with discrete random variables, we now look at some famous densities.

6.1 Uniform random variable

Such a random variable represents the choice of a random number in [α, β]. For [α, β] = [0, 1],

this is ideally the output of a computer random number generator.

6 CONTINUOUS RANDOM VARIABLES 60

Properties:

1. Density: f(x) =

_

1

β−α

if x ∈ [α, β],

0 otherwise.

2. EX =

α+β

2

.

3. Var(X) =

(β−α)

2

12

.

Example 6.3. Assume that X is uniform on [0, 1]. What is P(X ∈ Q)? What is the probability

that the binary expansion of X starts with 0.010?

As Q is countable, it has an enumeration, say, Q = ¦q

1

, q

2

, . . . ¦. By Axiom 3 of Chapter 3:

P(X ∈ Q) = P(∪

i

¦X = q

i

¦) =

i

P(X = q

i

) = 0.

Note that you cannot do this for sets that are not countable or you would “prove” that P(X ∈

R) = 0, while we, of course, know that P(X ∈ R) = P(Ω) = 1. As X is, with probability 1,

irrational, its binary expansion is uniquely deﬁned, so there is no ambiguity about what the

second question means.

Divide [0, 1) into 2

n

intervals of equal length. If the binary expansion of a number x ∈ [0, 1)

is 0.x

1

x

2

. . ., the ﬁrst n binary digits determine which of the 2

n

subintervals x belongs to: if you

know that x belongs to an interval I based on the ﬁrst n−1 digits, then nth digit 1 means that

x is in the right half of I and nth digit 0 means that x is in the left half of I. For example, if

the expansion starts with 0.010, the number is in [0,

1

2

], then in [

1

4

,

1

2

], and then ﬁnally in [

1

4

,

3

8

].

Our answer is

1

8

, but, in fact, we can make a more general conclusion. If X is uniform on

[0, 1], then any of the 2

n

possibilities for its ﬁrst n binary digits are equally likely. In other

words, the binary digits of X are the result of an inﬁnite sequence of independent fair coin

tosses. Choosing a uniform random number on [0, 1] is thus equivalent to tossing a fair coin

inﬁnitely many times.

Example 6.4. A uniform random number X divides [0, 1] into two segments. Let R be the

ratio of the smaller versus the larger segment. Compute the density of R.

As R has values in (0, 1), the density f

R

(r) is nonzero only for r ∈ (0, 1) and we will deal

only with such r’s.

F

R

(r) = P(R ≤ r) = P

_

X ≤

1

2

,

X

1 −X

≤ r

_

+P

_

X >

1

2

,

1 −X

X

≤ r

_

= P

_

X ≤

1

2

, X ≤

r

r + 1

_

+P

_

X >

1

2

, X ≥

1

r + 1

_

= P

_

X ≤

r

r + 1

_

+P

_

X ≥

1

r + 1

_ _

since

r

r + 1

≤

1

2

and

1

r + 1

≥

1

2

_

=

r

r + 1

+ 1 −

1

r + 1

=

2r

r + 1

6 CONTINUOUS RANDOM VARIABLES 61

For r ∈ (0, 1), the density, thus, equals

f

R

(r) =

d

dr

F

R

(r) =

2

(r + 1)

2

.

6.2 Exponential random variable

A random variable is Exponential(λ), with parameter λ > 0, if it has the probability mass

function given below. This is a distribution for the waiting time for some random event, for

example, for a lightbulb to burn out or for the next earthquake of at least some given magnitude.

Properties:

1. Density: f(x) =

_

λe

−λx

if x ≥ 0,

0 if x < 0.

2. EX =

1

λ

.

3. Var(X) =

1

λ

2

.

4. P(X ≥ x) = e

−λx

.

5. Memoryless property: P(X ≥ x +y[X ≥ y) = e

−λx

.

The last property means that, if the event has not occurred by some given time (no matter

how large), the distribution of the remaining waiting time is the same as it was at the beginning.

There is no “aging.”

Proofs of these properties are integration exercises and are omitted.

Example 6.5. Assume that a lightbulb lasts on average 100 hours. Assuming exponential

distribution, compute the probability that it lasts more than 200 hours and the probability that

it lasts less than 50 hours.

Let X be the waiting time for the bulb to burn out. Then, X is Exponential with λ =

1

100

and

P(X ≥ 200) = e

−2

≈ 0.1353,

P(X ≤ 50) = 1 −e

−

1

2

≈ 0.3935.

6.3 Normal random variable

A random variable is Normal with parameters µ ∈ R and σ

2

> 0 or, in short, X is N(µ, σ

2

),

if its density is the function given below. Such a random variable is (at least approximately)

very common. For example, measurement with random error, weight of a randomly caught

yellow-billed magpie, SAT (or some other) test score of a randomly chosen student at UC Davis,

etc.

6 CONTINUOUS RANDOM VARIABLES 62

Properties:

1. Density:

f(x) = f

X

(x) =

1

σ

√

2π

e

−

(x−µ)

2

2σ

2

,

where x ∈ (−∞, ∞).

2. EX = µ.

3. Var(X) = σ

2

.

To show that

_

∞

−∞

f(x) dx = 1

is a tricky exercise in integration, as is the computation of the variance. Assuming that the

integral of f is 1, we can use symmetry to prove that EX must be µ:

EX =

_

∞

−∞

xf(x) dx =

_

∞

−∞

(x −µ)f(x) dx +µ

_

∞

−∞

f(x) dx

=

_

∞

−∞

(x −µ)

1

σ

√

2π

e

−

(x−µ)

2

2σ

2

dx +µ

=

_

∞

−∞

z

1

σ

√

2π

e

−

z

2

2σ

2

dz +µ

= µ,

where the last integral was obtained by the change of variable z = x−µ and is zero because the

function integrated is odd.

Example 6.6. Let X be a N(µ, σ

2

) random variable and let Y = αX +β, with α > 0. How is

Y distributed?

If X is a “measurement with error” αX +β amounts to changing the units and so Y should

still be normal. Let us see if this is the case. We start by computing the distribution function

of Y ,

F

Y

(y) = P(Y ≤ y)

= P(αX +β ≤ y)

= P

_

X ≤

y −β

α

_

=

_ y−β

α

−∞

f

X

(x) dx

6 CONTINUOUS RANDOM VARIABLES 63

and, then, the density

f

Y

(y) = f

X

_

y −β

α

_

1

α

=

1

√

2πσα

e

−

(y−β−αµ)

2

2α

2

σ

2

.

Therefore, Y is normal with EY = αµ +β and Var(Y ) = (ασ)

2

.

In particular,

Z =

X −µ

σ

has EZ = 0 and Var(Z) = 1. Such a N(0, 1) random variable is called standard Normal. Its

distribution function F

Z

(z) is denoted by Φ(z). Note that

f

Z

(z) =

1

√

2π

e

−z

2

/2

Φ(z) = F

Z

(z) =

1

√

2π

_

z

−∞

e

−x

2

/2

dx.

The integral for Φ(z) cannot be computed as an elementary function, so approximate values

are given in tables. Nowadays, this is largely obsolete, as computers can easily compute Φ(z)

very accurately for any given z. You should also note that it is enough to know these values for

z > 0, as in this case, by using the fact that f

Z

(x) is an even function,

Φ(−z) =

_

−z

−∞

f

Z

(x) dx =

_

∞

z

f

Z

(x) dx = 1 −

_

z

−∞

f

Z

(x) dx = 1 −Φ(z).

In particular, Φ(0) =

1

2

. Another way to write this is P(Z ≥ −z) = P(Z ≤ z), a form which is

also often useful.

Example 6.7. What is the probability that a Normal random variable diﬀers from its mean µ

by more than σ? More than 2σ? More than 3σ?

In symbols, if X is N(µ, σ

2

), we need to compute P([X − µ[ ≥ σ), P([X − µ[ ≥ 2σ), and

P([X −µ[ ≥ 3σ).

In this and all other examples of this type, the letter Z will stand for an N(0, 1) random

variable.

We have

P([X −µ[ ≥ σ) = P

_¸

¸

¸

¸

X −µ

σ

¸

¸

¸

¸

≥ 1

_

= P([Z[ ≥ 1) = 2P(Z ≥ 1) = 2(1 −Φ(1)) ≈ 0.3173.

Similarly,

P([X −µ[ ≥ 2σ) = 2(1 −Φ(2)) ≈ 0.0455,

P([X −µ[ ≥ 3σ) = 2(1 −Φ(3)) ≈ 0.0027.

6 CONTINUOUS RANDOM VARIABLES 64

Example 6.8. Assume that X is Normal with mean µ = 2 and variance σ

2

= 25. Compute

the probability that X is between 1 and 4.

Here is the computation:

P(1 ≤ X ≤ 4) = P

_

1 −2

5

≤

X −2

5

≤

4 −2

5

_

= P(−0.2 ≤ Z ≤ 0.4)

= P(Z ≤ 0.4) −P(Z ≤ −0.2)

= Φ(0.4) −(1 −Φ(0.2))

≈ 0.2347 .

Let S

n

be a Binomial(n, p) random variable. Recall that its mean is np and its variance

np(1 − p). If we pretend that S

n

is Normal, then

S

n

−np

√

np(1−p)

is standard Normal, i.e., N(0, 1).

The following theorem says that this is approximately true if p is ﬁxed (e.g., 0.5) and n is large

(e.g., n = 100).

Theorem 6.1. De Moivre-Laplace Central Limit Theorem.

Let S

n

be Binomial(n, p), where p is ﬁxed and n is large. Then,

S

n

−np

√

np(1−p)

≈ N(0, 1); more

precisely,

P

_

S

n

−np

_

np(1 −p)

≤ x

_

→Φ(x)

as n →∞, for every real number x.

We should also note that the above theorem is an analytical statement; it says that

k:0≤k≤np+x

√

np(1−p)

_

n

k

_

p

k

(1 −p)

n−k

→

1

√

2π

_

x

−∞

e

−

s

2

2

ds

as n →∞, for every x ∈ R. Indeed it can be, and originally was, proved this way, with a lot of

computational work.

An important issue is the quality of the Normal approximation to the Binomial. One can

prove that the diﬀerence between the Binomial probability (in the above theorem) and its limit

is at most

0.5 (p

2

+ (1 −p)

2

)

_

np (1 −p)

.

A commonly cited rule of thumb is that this is a decent approximation when np(1 − p) ≥ 10;

however, if we take p = 1/3 and n = 45, so that np(1−p) = 10, the bound above is about 0.0878,

too large for many purposes. Various corrections have been developed to diminish the error,

but they are, in my opinion, obsolete by now. In the situation when the above upper bound

6 CONTINUOUS RANDOM VARIABLES 65

on the error is too high, we should simply compute directly with the Binomial distribution and

not use the Normal approximation. (We will assume that the approximation is adequate in the

examples below.) Remember that, when n is large and p is small, say n = 100 and p =

1

100

, the

Poisson approximation (with λ = np) is much better!

Example 6.9. A roulette wheel has 38 slots: 18 red, 18 black, and 2 green. The ball ends at

one of these at random. You are a player who plays a large number of games and makes an even

bet of $1 on red in every game. After n games, what is the probability that you are ahead?

Answer this for n = 100 and n = 1000.

Let S

n

be the number of times you win. This is a Binomial(n,

9

19

) random variable.

P(ahead) = P(win more than half of the games)

= P

_

S

n

>

n

2

_

= P

_

S

n

−np

_

np(1 −p)

>

1

2

n −np

_

np(1 −p)

_

≈ P

_

Z >

(

1

2

−p)

√

n

_

p(1 −p)

_

For n = 100, we get

P

_

Z >

5

√

90

_

≈ 0.2990,

and for n = 1000, we get

P

_

Z >

5

3

_

≈ 0.0478.

For comparison, the true probabilities are 0.2650 and 0.0448, respectively.

Example 6.10. What would the answer to the previous example be if the game were fair, i.e.,

you bet even money on the outcome of a fair coin toss each time.

Then, p =

1

2

and

P(ahead) →P(Z > 0) = 0.5,

as n →∞.

Example 6.11. How many times do you need to toss a fair coin to get 100 heads with probability

90%?

Let n be the number of tosses that we are looking for. For S

n

, which is Binomial(n,

1

2

), we

need to ﬁnd n so that

P(S

n

≥ 100) ≈ 0.9.

We will use below that n > 200, as the probability would be approximately

1

2

for n = 200 (see

6 CONTINUOUS RANDOM VARIABLES 66

the previous example). Here is the computation:

P

_

S

n

−

1

2

n

1

2

√

n

≥

100 −

1

2

n

1

2

√

n

_

≈ P

_

Z ≥

100 −

1

2

n

1

2

√

n

_

= P

_

Z ≥

200 −n

√

n

_

= P

_

Z ≥ −

_

n −200

√

n

__

= P

_

Z ≤

n −200

√

n

_

= Φ

_

n −200

√

n

_

= 0.9

Now, according to the tables, Φ(1.28) ≈ 0.9, thus we need to solve

n−200

√

n

= 1.28, that is,

n −1.28

√

n −200 = 0.

This is a quadratic equation in

√

n, with the only positive solution

√

n =

1.28 +

√

1.28

2

+ 800

2

.

Rounding up the number n we get from above, we conclude that n = 219.

Problems

1. A random variable X has the density function

f(x) =

_

c(x +

√

x) x ∈ [0, 1],

0 otherwise.

(a) Determine c. (b) Compute E(1/X). (c) Determine the probability density function of

Y = X

2

.

2. The density function of a random variable X is given by

f(x) =

_

a +bx 0 ≤ x ≤ 2,

0 otherwise.

We also know that E(X) = 7/6. (a) Compute a and b. (b) Compute Var(X).

6 CONTINUOUS RANDOM VARIABLES 67

3. After your complaint about their service, a representative of an insurance company promised

to call you “between 7 and 9 this evening.” Assume that this means that the time T of the call

is uniformly distributed in the speciﬁed interval.

(a) Compute the probability that the call arrives between 8:00 and 8:20.

(b) At 8:30, the call still hasn’t arrived. What is the probability that it arrives in the next 10

minutes?

(c) Assume that you know in advance that the call will last exactly 1 hour. From 9 to 9:30,

there is a game show on TV that you wanted to watch. Let M be the amount of time of the

show that you miss because of the call. Compute the expected value of M.

4. Toss a fair coin twice. You win $1 if at least one of the two tosses comes out heads.

(a) Assume that you play this game 300 times. What is, approximately, the probability that

you win at least $250?

(b) Approximately how many times do you need to play so that you win at least $250 with

probability at least 0.99?

5. Roll a die n times and let M be the number of times you roll 6. Assume that n is large.

(a) Compute the expectation EM.

(b) Write down an approximation, in terms on n and Φ, of the probability that M diﬀers from

its expectation by less than 10%.

(c) How large should n be so that the probability in (b) is larger than 0.99?

Solutions

1. (a) As

1 = c

_

1

0

(x +

√

x) dx = c

_

1

2

+

2

3

_

=

7

6

c,

it follows that c =

6

7

.

(b)

6

7

_

1

0

1

x

(x +

√

x) dx =

18

7

.

(c)

F

r

(y) = P(Y ≤ y)

= P(X ≤

√

y)

=

6

7

_

√

y

0

(x +

√

x) dx,

6 CONTINUOUS RANDOM VARIABLES 68

and so

f

Y

(y) =

_

3

7

(1 +y

−

1

4

) if y ∈ (0, 1),

0 otherwise.

2. (a) From

_

1

0

f(x) dx = 1 we get 2a + 2b = 1 and from

_

1

0

xf(x) dx =

7

6

we get 2a +

8

3

b =

7

6

.

The two equations give a = b =

1

4

.

(b) E(X

2

) =

_

1

0

x

2

f(x) dx =

5

3

and so Var(X) =

5

3

−(

7

6

)

2

=

11

36

.

3. (a)

1

6

.

(b) Let T be the time of the call, from 7pm, in minutes; T is uniform on [0, 120]. Thus,

P(T ≤ 100[T ≥ 90) =

1

3

.

(c) We have

M =

_

¸

_

¸

_

0 if 0 ≤ T ≤ 60,

T −60 if 60 ≤ T ≤ 90,

30 if 90 ≤ T.

Then,

EM =

1

120

_

90

60

(t −60) dx +

1

120

_

120

90

30 dx = 11.25.

4. (a) P(win a single game) =

3

4

. If you play n times, the number X of games you win is

Binomial(n,

3

4

). If Z is N(0, 1), then

P(X ≥ 250) ≈ P

_

_

Z ≥

250 −n

3

4

_

n

3

4

1

4

_

_

.

For (a), n = 300 and the above expression is P(Z ≥

10

3

), which is approximately 1 −Φ(3.33) ≈

0.0004.

For (b), you need to ﬁnd n so that the above expression is 0.99 or so that

Φ

_

_

250 −n

3

4

_

n

3

4

1

4

_

_

= 0.01.

The argument must be negative, hence

250 −n

3

4

_

n

3

4

1

4

= −2.33.

6 CONTINUOUS RANDOM VARIABLES 69

If x =

√

3n, this yields

x

2

−4.66x −1000 = 0

and solving the quadratic equation gives x ≈ 34.04, n > (34.04)

2

/3, n ≥ 387.

5. (a) M is Binomial(n,

1

6

), so EM =

n

6

.

(b)

P

_¸

¸

¸M −

n

6

¸

¸

¸ <

n

6

0.1

_

≈ P

_

_

[Z[ <

n

6

0.1

_

n

1

6

5

6

_

_

= 2Φ

_

0.1

√

n

√

5

_

−1.

(c) The above must be 0.99 and so Φ

_

0.1

√

n

√

5

_

= 0.995,

0.1

√

n

√

5

= 2.57, and, ﬁnally, n ≥ 3303.

7 JOINT DISTRIBUTIONS AND INDEPENDENCE 70

7 Joint Distributions and Independence

Discrete Case

Assume that you have a pair (X, Y ) of discrete random variables X and Y . Their joint probability

mass function is given by

p(x, y) = P(X = x, Y = y)

so that

P((X, Y ) ∈ A) =

(x,y)∈A

p(x, y).

The marginal probability mass functions are the p. m. f.’s of X and Y , given by

P(X = x) =

y

P(X = x, Y = y) =

y

p(x, y)

P(Y = y) =

x

P(X = x, Y = y) =

x

p(x, y)

Example 7.1. An urn has 2 red, 5 white, and 3 green balls. Select 3 balls at random and let

X be the number of red balls and Y the number of white balls. Determine (a) joint p. m. f. of

(X, Y ), (b) marginal p. m. f.’s, (c) P(X ≥ Y ), and (d) P(X = 2[X ≥ Y ).

The joint p. m. f. is given by P(X = x, Y = y) for all possible x and y. In our case, x can

be 0, 1, or 2 and y can be 0, 1, 2, or 3. The values are given in the table.

y¸x 0 1 2 P(Y = y)

0 1/120 2 3/120 3/120 10/120

1 5 3/120 2 5 3/120 5/120 50/120

2 10 3/120 10 2/120 0 50/120

3 10/120 0 0 10/120

P(X = x) 56/120 56/120 8/120 1

The last row and column entries are the respective column and row sums and, therefore,

determine the marginal p. m. f.’s. To answer (c) we merely add the relevant probabilities,

P(X ≥ Y ) =

1 + 6 + 3 + 30 + 5

120

=

3

8

,

and, to answer (d), we compute

P(X = 2, X ≥ Y )

P(X ≥ Y )

=

8

120

3

8

=

8

45

.

7 JOINT DISTRIBUTIONS AND INDEPENDENCE 71

Two random variables X and Y are independent if

P(X ∈ A, Y ∈ B) = P(X ∈ A)P(Y ∈ B)

for all intervals A and B. In the discrete case, X and Y are independent exactly when

P(X = x, Y = y) = P(X = x)P(Y = y)

for all possible values x and y of X and Y , that is, the joint p. m. f. is the product of the

marginal p. m. f.’s.

Example 7.2. In the previous example, are X and Y independent?

No, the 0’s in the table are dead giveaways. For example, P(X = 2, Y = 2) = 0, but neither

P(X = 2) nor P(Y = 2) is 0.

Example 7.3. Most often, independence is an assumption. For example, roll a die twice and

let X be the number on the ﬁrst roll and let Y be the number on the second roll. Then, X and

Y are independent: we are used to assuming that all 36 outcomes of the two rolls are equally

likely, which is the same as assuming that the two random variables are discrete uniform (on

¦1, 2, . . . , 6¦) and independent.

Continuous Case

We say that (X, Y ) is a jointly continuous pair of random variables if there exists a joint density

f(x, y) ≥ 0 so that

P((X, Y ) ∈ S) =

__

S

f(x, y) dxdy,

where S is some nice (say, open or closed) subset of R

2

Example 7.4. Let (X, Y ) be a random point in S, where S is a compact (that is, closed and

bounded) subset of R

2

. This means that

f(x, y) =

_

1

area(S)

if (x, y) ∈ S,

0 otherwise.

The simplest example is a square of side length 1 where

f(x, y) =

_

1 if 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1,

0 otherwise.

7 JOINT DISTRIBUTIONS AND INDEPENDENCE 72

Example 7.5. Let

f(x, y) =

_

c x

2

y if x

2

≤ y ≤ 1,

0 otherwise.

Determine (a) the constant c, (b) P(X ≥ Y ), (c) P(X = Y ), and (d) P(X = 2Y ).

For (a),

_

1

−1

dx

_

1

x

2

c x

2

y dy = 1

c

4

21

= 1

and so

c =

21

4

.

For (b), let S be the region between the graphs y = x

2

and y = x, for x ∈ (0, 1). Then,

P(X ≥ Y ) = P((X, Y ) ∈ S)

=

_

1

0

dx

_

x

x

2

21

4

x

2

y dy

=

3

20

Both probabilities in (c) and (d) are 0 because a two-dimensional integral over a line is 0.

If f is the joint density of (X, Y ), then the two marginal densities, which are the densities of X

and Y , are computed by integrating out the other variable:

f

X

(x) =

_

∞

−∞

f(x, y) dy,

f

Y

(y) =

_

∞

−∞

f(x, y) dx.

Indeed, for an interval A, X ∈ A means that (X, Y ) ∈ S, where S = AR, and, therefore,

P(X ∈ A) =

_

A

dx

_

∞

−∞

f(x, y) dy.

The marginal densities formulas follow from the deﬁnition of density. With some advanced

calculus expertise, the following can be checked.

Two jointly continuous random variables X and Y are independent exactly when the joint

density is the product of the marginal ones:

f(x, y) = f

X

(x) f

Y

(y),

for all x and y.

7 JOINT DISTRIBUTIONS AND INDEPENDENCE 73

Example 7.6. Previous example, continued. Compute marginal densities and determine

whether X and Y are independent.

We have

f

X

(x) =

_

1

x

2

21

4

x

2

y dy =

21

8

x

2

(1 −x

4

),

for x ∈ [−1, 1], and 0 otherwise. Moreover,

f

Y

(y) =

_

√

y

−

√

y

21

4

x

2

y dx =

7

2

y

5

2

,

where y ∈ [0, 1], and 0 otherwise. The two random variables X and Y are clearly not indepen-

dent, as f(x, y) ,= f

X

(x)f

Y

(y).

Example 7.7. Let (X, Y ) be a random point in a square of length 1 with the bottom left corner

at the origin. Are X and Y independent?

f(x, y) =

_

1 (x, y) ∈ [0, 1] [0, 1],

0 otherwise.

The marginal densities are

f

X

(x) = 1,

if x ∈ [0, 1], and

f

Y

(y) = 1,

if y ∈ [0, 1], and 0 otherwise. Therefore, X and Y are independent.

Example 7.8. Let (X, Y ) be a random point in the triangle ¦(x, y) : 0 ≤ y ≤ x ≤ 1¦. Are X

and Y independent?

Now

f(x, y) =

_

2 0 ≤ y ≤ x ≤ 1

0, otherwise.

The marginal densities are

f

X

(x) = 2x,

if x ∈ [0, 1], and

f

Y

(y) = 2(1 −y),

if y ∈ [0, 1], and 0 otherwise. So X and Y are no longer distributed uniformly and no longer

independent.

We can make a more general conclusion from the last two examples. Assume that (X, Y ) is

a jointly continuous pair of random variables, uniform on a compact set S ⊂ R

2

. If they are to

be independent, their marginal densities have to be constant, thus uniform on some sets, say A

and B, and then S = A B. (If A and B are both intervals, then S = A B is a rectangle,

which is the most common example of independence.)

7 JOINT DISTRIBUTIONS AND INDEPENDENCE 74

Example 7.9. Mr. and Mrs. Smith agree to meet at a speciﬁed location “between 5 and 6

p.m.” Assume that they both arrive there at a random time between 5 and 6 and that their

arrivals are independent. (a) Find the density for the time one of them will have to wait for

the other. (b) Mrs. Smith later tells you she had to wait; given this information, compute the

probability that Mr. Smith arrived before 5:30.

Let X be the time when Mr. Smith arrives and let Y be the time when Mrs. Smith arrives,

with the time unit 1 hour. The assumptions imply that (X, Y ) is uniform on [0, 1] [0, 1].

For (a), let T = [X − Y [, which has possible values in [0, 1]. So, ﬁx t ∈ [0, 1] and compute

(drawing a picture will also help)

P(T ≤ t) = P([X −Y [ ≤ t)

= P(−t ≤ X −Y ≤ t)

= P(X −t ≤ Y ≤ X +t)

= 1 −(1 −t)

2

= 2t −t

2

,

and so

f

T

(t) = 2 −2t.

For (b), we need to compute

P(X ≤ 0.5[X > Y ) =

P(X ≤ 0.5, X > Y )

P(X > Y )

=

1

8

1

2

=

1

4

.

Example 7.10. Assume that X and Y are independent, that X is uniform on [0, 1], and that

Y has density f

Y

(y) = 2y, for y ∈ [0, 1], and 0 elsewhere. Compute P(X +Y ≤ 1).

The assumptions determine the joint density of (X, Y )

f(x, y) =

_

2y if (x, y) ∈ [0, 1] [0, 1],

0 otherwise.

To compute the probability in question we compute

_

1

0

dx

_

1−x

0

2y dy

or

_

1

0

dy

_

1−y

0

2y dx,

whichever double integral is easier. The answer is

1

3

.

Example 7.11. Assume that you are waiting for two phone calls, from Alice and from Bob.

The waiting time T

1

for Alice’s call has expectation 10 minutes and the waiting time T

2

for

7 JOINT DISTRIBUTIONS AND INDEPENDENCE 75

Bob’s call has expectation 40 minutes. Assume T

1

and T

2

are independent exponential random

variables. What is the probability that Alice’s call will come ﬁrst?

We need to compute P(T

1

< T

2

). Assuming our unit is 10 minutes, we have, for t

1

, t

2

> 0,

f

T

1

(t

1

) = e

−t

1

and

f

T

2

(t

2

) =

1

4

e

−t

2

/4

,

so that the joint density is

f(t

1

, t

2

) =

1

4

e

−t

1

−t

2

/4

,

for t

1

, t

2

≥ 0. Therefore,

P(T

1

< T

2

) =

_

∞

0

dt

1

_

∞

t

1

1

4

e

−t

1

−t

2

/4

dt

2

=

_

∞

0

e

−t

1

dt

1

e

−t

1

/4

=

_

∞

0

e

−5t

1

4

dt

1

=

4

5

.

Example 7.12. Buﬀon needle problem. Parallel lines at a distance 1 are drawn on a large sheet

of paper. Drop a needle of length onto the sheet. Compute the probability that it intersects

one of the lines.

Let D be the distance from the center of the needle to the nearest line and let Θ be the

acute angle relative to the lines. We will, reasonably, assume that D and Θ are independent

and uniform on their respective intervals 0 ≤ D ≤

1

2

and 0 ≤ Θ ≤

π

2

. Then,

P(the needle intersects a line) = P

_

D

sin Θ

<

2

_

= P

_

D <

2

sin Θ

_

.

Case 1: ≤ 1. Then, the probability equals

_

π/2

0

2

sin θ dθ

π/4

=

/2

π/4

=

2

π

.

When = 1, you famously get

2

π

, which can be used to get (very poor) approximations for π.

Case 2: > 1. Now, the curve d =

2

sin θ intersects d =

1

2

at θ = arcsin

1

. The probability

equals

4

π

_

2

_

arcsin

1

0

sin θ dθ +

_

π

2

−arcsin

1

_

1

2

_

=

4

π

_

2

−

1

2

_

2

−1 +

π

4

−

1

2

arcsin

1

_

.

7 JOINT DISTRIBUTIONS AND INDEPENDENCE 76

A similar approach works for cases with more than two random variables. Let us do an

example for illustration.

Example 7.13. Assume X

1

, X

2

, X

3

are uniform on [0, 1] and independent. What is P(X

1

+

X

2

+X

3

≤ 1)?

The joint density is

f

X

1

,X

2

,X

3

(x

1

, x

2

, x

3

) = f

X

1

(x

1

)f

X

2

(x

2

)f

X

3

(x

3

) =

_

1 if (x

1

, x

2

, x

3

) ∈ [0, 1]

3

,

0 otherwise.

Here is how we get the answer:

P(X

1

+X

2

+X

3

≤ 1) =

_

1

0

dx

1

_

1−x

1

0

dx

2

_

1−x

1

−x

2

0

dx

3

=

1

6

.

In general, if X

1

, . . . , X

n

are independent and uniform on [0, 1],

P(X

1

+. . . +X

n

≤ 1) =

1

n!

Conditional distributions

The conditional p. m. f. of X given Y = y is, in the discrete case, given simply by

p

X

(x[Y = y) = P(X = x[Y = y) =

P(X = x, Y = y)

P(Y = y)

.

This is trickier in the continuous case, as we cannot divide by P(Y = y) = 0.

For a jointly continuous pair of random variables X and Y , we deﬁne the conditional density of

X given Y = y as follows:

f

X

(x[Y = y) =

f(x, y)

f

Y

(y)

,

where f(x, y) is, of course, the joint density of (X, Y ).

Observe that when f

Y

(y) = 0,

_

∞

−∞

f(x, y) dx = 0, and so f(x, y) = 0 for every x. So, we

have a

0

0

expression, which we deﬁne to be 0.

Here is a “physicist’s proof” why this should be the conditional density formula:

P(X = x +dx[Y = y +dy) =

P(X = x +dx, Y = y +dy)

P(Y = y +dy)

=

f(x, y) dxdy

f

Y

(y) dy

=

f(x, y)

f

Y

(y)

dx

= f

X

(x[Y = y) dx .

7 JOINT DISTRIBUTIONS AND INDEPENDENCE 77

Example 7.14. Let (X, Y ) be a random point in the triangle ¦(x, y) : x, y ≥ 0, x + y ≤ 1¦.

Compute f

X

(x[Y = y).

The joint density f(x, y) equals 2 on the triangle. For a given y ∈ [0, 1], we know that, if

Y = y, X is between 0 and 1 −y. Moreover,

f

Y

(y) =

_

1−y

0

2 dx = 2(1 −y).

Therefore,

f

X

(x[Y = y) =

_

1

1−y

0 ≤ x ≤ 1 −y,

0 otherwise.

In other words, given Y = y, X is distributed uniformly on [0, 1 −y], which is hardly surprising.

Example 7.15. Suppose (X, Y ) has joint density

f(x, y) =

_

21

4

x

2

y x

2

≤ y ≤ 1,

0 otherwise.

Compute f

X

(x[Y = y).

We compute ﬁrst

f

Y

(y) =

21

4

y

_

√

y

−

√

y

x

2

dx =

7

2

y

5

2

,

for y ∈ [0, 1]. Then,

f

X

(x[Y = y) =

21

4

x

2

y

7

2

y

5/2

=

3

2

x

2

y

−3/2

,

where −

√

y ≤ x ≤

√

y.

Suppose we are asked to compute P(X ≥ Y [Y = y). This makes no literal sense because

the probability P(Y = y) of the condition is 0. We reinterpret this expression as

P(X ≥ y[Y = y) =

_

∞

y

f

X

(x[Y = y) dx,

which equals

_

√

y

y

3

2

x

2

y

−3/2

dx =

1

2

y

−3/2

_

y

3/2

−y

3

_

=

1

2

_

1 −y

3/2

_

.

Problems

1. Let (X, Y ) be a random point in the square ¦(x, y) : −1 ≤ x, y ≤ 1¦. Compute the conditional

probability P(X ≥ 0 [ Y ≤ 2X). (It may be a good idea to draw a picture and use elementary

geometry, rather than calculus.)

7 JOINT DISTRIBUTIONS AND INDEPENDENCE 78

2. Roll a fair die 3 times. Let X be the number of 6’s obtained and Y the number of 5’s.

(a) Compute the joint probability mass function of X and Y .

(b) Are X and Y independent?

3. X and Y are independent random variables and they have the same density function

f(x) =

_

c(2 −x) x ∈ (0, 1)

0 otherwise.

(a) Determine c. (b) Compute P(Y ≤ 2X) and P(Y < 2X).

4. Let X and Y be independent random variables, both uniformly distributed on [0, 1]. Let

Z = min(X, Y ) be the smaller value of the two.

(a) Compute the density function of Z.

(b) Compute P(X ≤ 0.5[Z ≤ 0.5).

(c) Are X and Z independent?

5. The joint density of (X, Y ) is given by

_

f(x, y) = 3x if 0 ≤ y ≤ x ≤ 1,

0 otherwise.

(a) Compute the conditional density of Y given X = x.

(b) Are X and Y independent?

Solutions to problems

1. After noting the relevant areas,

P(X ≥ 0[Y ≤ 2X) =

P(X ≥ 0, Y ≤ 2X)

P(Y ≤ 2X)

=

1

4

_

2 −

1

2

1

2

1

_

1

2

=

7

8

2. (a) The joint p. m. f. is given by the table

7 JOINT DISTRIBUTIONS AND INDEPENDENCE 79

y¸x 0 1 2 3 P(Y = y)

0

4

3

216

3·4

2

216

3·4

216

1

216

125

216

1

3·4

2

216

3·2·4

216

3

216

0

75

216

2

3·4

216

3

216

0 0

15

216

3

1

216

0 0 0

1

216

P(X = x)

125

216

75

216

15

216

1

216

1

Alternatively, for x, y = 0, 1, 2, 3 and x +y ≤ 3,

P(X = x, Y = y) =

_

3

x

__

3 −x

y

__

1

6

_

x+y

_

4

6

_

3−x−y

(b) No. P(X = 3, Y = 3) = 0 and P(X = 3)P(Y = 3) ,= 0.

3. (a) From

c

_

1

0

(2 −x) dx = 1,

it follows that c =

2

3

.

(b) We have

P(Y ≤ 2X) = P(Y < 2X)

=

_

1

0

dy

_

1

y/2

4

9

(2 −x)(2 −y) dx

=

4

9

_

1

0

(2 −y) dy

_

1

y/2

(2 −x) dx

=

4

9

_

1

0

(2 −y) dy

_

2

_

1 −

y

2

_

−

1

2

_

1 −

y

2

4

__

=

4

9

_

1

0

(2 −y)

_

3

2

−y +

y

2

8

_

dy

=

4

9

_

1

0

_

3 −

7

2

y +

5

4

y

2

−

y

3

8

_

dy

=

4

9

_

3 −

7

4

+

5

12

−

1

32

_

.

7 JOINT DISTRIBUTIONS AND INDEPENDENCE 80

4. (a) For z ∈ [0, 1]

P(Z ≤ z) = 1 −P(both X and Y are above z) = 1 −(1 −z)

2

= 2z −z

2

,

so that

f

Z

(z) = 2(1 −z),

for z ∈ [0, 1] and 0 otherwise.

(b) From (a), we conclude that P(Z ≤ 0.5) =

3

4

and P(X ≤ 0.5, Z ≤ 0.5) = P(X ≤ 0.5) =

1

2

,

so the answer is

2

3

.

(c) No: Z ≤ X.

5. (a) Assume that x ∈ [0, 1]. As

f

X

(x) =

_

x

0

3xdy = 3x

2

,

we have

f

Y

(y[X = x) =

f(x, y)

f

X

(x)

=

3x

3x

2

=

1

x

,

for 0 ≤ y ≤ x. In other words, Y is uniform on [0, x].

(b) As the answer in (a) depends on x, the two random variables are not independent.

7 JOINT DISTRIBUTIONS AND INDEPENDENCE 81

Interlude: Practice Midterm 2

This practice exam covers the material from chapters 5 through 7. Give yourself 50 minutes to

solve the four problems, which you may assume have equal point score.

1. A random variable X has density function

f(x) =

_

c(x +x

2

), x ∈ [0, 1],

0, otherwise.

(a) Determine c.

(b) Compute E(1/X).

(c) Determine the probability density function of Y = X

2

.

2. A certain country holds a presidential election, with two candidates running for oﬃce. Not

satisﬁed with their choice, each voter casts a vote independently at random, based on the

outcome of a fair coin ﬂip. At the end, there are 4,000,000 valid votes, as well as 20,000 invalid

votes.

(a) Using a relevant approximation, compute the probability that, in the ﬁnal count of valid

votes only, the numbers for the two candidates will diﬀer by less than 1000 votes.

(b) Each invalid vote is double-checked independently with probability 1/5000. Using a relevant

approximation, compute the probability that at least 3 invalid votes are double-checked.

3. Toss a fair coin 5 times. Let X be the total number of Heads among the ﬁrst three tosses and

Y the total number of Heads among the last three tosses. (Note that, if the third toss comes

out Heads, it is counted both into X and into Y ).

(a) Write down the joint probability mass function of X and Y .

(b) Are X and Y independent? Explain.

(c) Compute the conditional probability P(X ≥ 2 [ X ≥ Y ).

4. Every working day, John comes to the bus stop exactly at 7am. He takes the ﬁrst bus

that arrives. The arrival of the ﬁrst bus is an exponential random variable with expectation 20

minutes.

Also, every working day, and independently, Mary comes to the same bus stop at a random

time, uniformly distributed between 7 and 7:30.

(a) What is the probability that tomorrow John will wait for more than 30 minutes?

(b) Assume day-to-day independence. Consider Mary late if she comes after 7:20. What is the

probability that Mary will be late on 2 or more working days among the next 10 working days?

(c) What is the probability that John and Mary will meet at the station tomorrow?

7 JOINT DISTRIBUTIONS AND INDEPENDENCE 82

Solutions to Practice Midterm 2

1. A random variable X has density function

f(x) =

_

c(x +x

2

), x ∈ [0, 1],

0, otherwise.

(a) Determine c.

Solution:

Since

1 = c

_

1

0

(x +x

2

) dx,

1 = c

_

1

2

+

1

3

_

,

1 =

5

6

c,

and so c =

6

5

.

(b) Compute E(1/X).

Solution:

E

_

1

X

_

=

6

5

_

1

0

1

x

(x +x

2

) dx

=

6

5

_

1

0

(1 +x) dx

=

6

5

_

1 +

1

2

_

=

9

5

.

(c) Determine the probability density function of Y = X

2

.

7 JOINT DISTRIBUTIONS AND INDEPENDENCE 83

Solution:

The values of Y are in [0, 1], so we will assume that y ∈ [0, 1]. Then,

F(y) = P(Y ≤ y)

= P(X

2

≤ y)

= P(X ≤

√

y)

=

_

√

y

0

6

5

(x +x

2

) dx,

and so

f

Y

(y) =

d

dy

F

Y

(y)

=

6

5

(

√

y +y)

1

2

√

y

=

3

5

(1 +

√

y).

2. A certain country holds a presidential election, with two candidates running for oﬃce. Not

satisﬁed with their choice, each voter casts a vote independently at random, based on the

outcome of a fair coin ﬂip. At the end, there are 4,000,000 valid votes, as well as 20,000

invalid votes.

(a) Using a relevant approximation, compute the probability that, in the ﬁnal count of

valid votes only, the numbers for the two candidates will diﬀer by less than 1000

votes.

Solution:

Let S

n

be the vote count for candidate 1. Thus, S

n

is Binomial(n, p), where n =

4, 000, 000 and p =

1

2

. Then, n −S

n

is the vote count for candidate 2.

P([S

n

−(n −S

n

)[ ≤ 1000) = P(−1000 ≤ 2S

n

−n ≤ 1000)

= P

_

_

−500

_

n

1

2

1

2

≤

S

n

−n/2

_

n

1

2

1

2

≤

500

_

n

1

2

1

2

_

_

≈ P(−0.5 ≤ Z ≤ 0.5)

= P(Z ≤ 0.5) −P(Z ≤ −0.5)

= P(Z ≤ 0.5) −(1 −P(Z ≤ 0.5))

= 2P(Z ≤ 0.5) −1

= 2Φ(0.5) −1

≈ 2 0.6915 −1

= 0.383.

7 JOINT DISTRIBUTIONS AND INDEPENDENCE 84

(b) Each invalid vote is double-checked independently with probability 1/5000. Using

a relevant approximation, compute the probability that at least 3 invalid votes are

double-checked.

Solution:

Now, let S

n

be the number of double-checked votes, which is Binomial

_

20000,

1

5000

_

and thus approximately Poisson(4). Then,

P(S

n

≥ 3) = 1 −P(S

n

= 0) −P(S

n

= 1) −P(S

n

= 2)

≈ 1 −e

−4

−4e

−4

−

4

2

2

e

−4

= 1 −13e

−4

.

3. Toss a fair coin 5 times. Let X be the total number of Heads among the ﬁrst three tosses

and Y the total number of Heads among the last three tosses. (Note that, if the third toss

comes out Heads, it is counted both into X and into Y ).

(a) Write down the joint probability mass function of X and Y .

Solution:

P(X = x, Y = y) is given by the table

x¸y 0 1 2 3

0 1/32 2/32 1/32 0

1 2/32 5/32 4/32 1/32

2 1/32 4/32 5/32 2/32

3 0 1/32 2/32 1/32

To compute these, observe that the number of outcomes is 2

5

= 32. Then,

P(X = 2, Y = 1) = P(X = 2, Y = 1, 3rd coin Heads)

+P(X = 2, Y = 1, 3rd coin Tails)

=

2

32

+

2

32

=

4

32

,

P(X = 2, Y = 2) =

2 2

32

+

1

32

=

5

32

,

P(X = 1, Y = 1) =

2 2

32

+

1

32

=

5

32

,

etc.

7 JOINT DISTRIBUTIONS AND INDEPENDENCE 85

(b) Are X and Y independent? Explain.

Solution:

No,

P(X = 0, Y = 3) = 0 ,=

1

8

1

8

= P(X = 0)P(Y = 3).

(c) Compute the conditional probability P(X ≥ 2 [ X ≥ Y ).

Solution:

P(X ≥ 2[X ≥ Y ) =

P(X ≥ 2, X ≥ Y )

P(X ≥ Y )

=

1 + 4 + 5 + 1 + 2 + 1

1 + 2 + 1 + 5 + 4 + 1 + 5 + 2 + 1

=

14

22

=

7

11

.

4. Every working day, John comes to the bus stop exactly at 7am. He takes the ﬁrst bus that

arrives. The arrival of the ﬁrst bus is an exponential random variable with expectation 20

minutes.

Also, every working day, and independently, Mary comes to the same bus stop at a random

time, uniformly distributed between 7 and 7:30.

(a) What is the probability that tomorrow John will wait for more than 30 minutes?

Solution:

Assume that the time unit is 10 minutes. Let T be the arrival time of the bus.

It is Exponential with parameter λ =

1

2

. Then,

f

T

(t) =

1

2

e

−t/2

,

for t ≥ 0, and

P(T ≥ 3) = e

−3/2

.

7 JOINT DISTRIBUTIONS AND INDEPENDENCE 86

(b) Assume day-to-day independence. Consider Mary late if she comes after 7:20. What

is the probability that Mary will be late on 2 or more working days among the next

10 working days?

Solution:

Let X be Mary’s arrival time. It is uniform on [0, 3]. Therefore,

P(X ≥ 2) =

1

3

.

The number of late days among 10 days is Binomial

_

10,

1

3

_

and, therefore,

P(2 or more late working days among 10 working days)

= 1 −P(0 late) −P(1 late)

= 1 −

_

2

3

_

10

−10

1

3

_

2

3

_

9

.

(c) What is the probability that John and Mary will meet at the station tomorrow?

Solution:

We have

f

(T,X)

(t, x) =

1

6

e

−t/2

,

for x ∈ [0, 3] and t ≥ 0. Therefore,

P(X ≥ T) =

1

3

_

3

0

dx

_

∞

x

1

2

e

−t/2

dt

=

1

3

_

3

0

e

−x/2

dx

=

2

3

(1 −e

−3/2

).

8 MORE ON EXPECTATION AND LIMIT THEOREMS 87

8 More on Expectation and Limit Theorems

Given a pair of random variables (X, Y ) with joint density f and another function g of two

variables,

Eg(X, Y ) =

_ _

g(x, y)f(x, y) dxdy;

if instead (X, Y ) is a discrete pair with joint probability mass function p, then

Eg(X, Y ) =

x,y

g(x, y)p(x, y).

Example 8.1. Assume that two among the 5 items are defective. Put the items in a random

order and inspect them one by one. Let X be the number of inspections needed to ﬁnd the ﬁrst

defective item and Y the number of additional inspections needed to ﬁnd the second defective

item. Compute E[X −Y [.

The joint p. m. f. of (X, Y ) is given by the following table, which lists P(X = i, Y = j),

together with [i −j[ in parentheses, whenever the probability is nonzero:

i¸j 1 2 3 4

1 .1 (0) .1 (1) .1 (2) .1 (3)

2 .1 (1) .1 (0) .1 (1) 0

3 .1 (2) .1 (1) 0 0

4 .1 (3) 0 0 0

The answer is, therefore, E[X −Y [ = 1.4.

Example 8.2. Assume that (X, Y ) is a random point in the right triangle ¦(x, y) : x, y ≥

0, x +y ≤ 1¦. Compute EX, EY , and EXY .

Note that the density is 2 on the triangle, and so,

EX =

_

1

0

dx

_

1−x

0

x2 dy

=

_

1

0

2x(1 −x) dx

= 2

_

1

2

−

1

3

_

=

1

3

,

8 MORE ON EXPECTATION AND LIMIT THEOREMS 88

and, therefore, by symmetry, EY = EX =

1

3

. Furthermore,

E(XY ) =

_

1

0

dx

_

1−x

0

xy2 dy

=

_

1

0

2x

(1 −x)

2

2

dx

=

_

1

0

(1 −x)x

2

dx

=

1

3

−

1

4

=

1

12

.

Linearity and monotonicity of expectation

Theorem 8.1. Expectation is linear and monotone:

1. For constants a and b, E(aX +b) = aE(X) +b.

2. For arbitrary random variables X

1

, . . . , X

n

whose expected values exist,

E(X

1

+. . . +X

n

) = E(X

1

) +. . . +E(X

n

).

3. For two random variables X ≤ Y , we have EX ≤ EY .

Proof. We will check the second property for n = 2 and the continuous case, that is,

E(X +Y ) = EX +EY.

This is a consequence of the same property for two-dimensional integrals:

__

(x +y)f(x, y) dxdy =

__

xf(x, y) dxdy +

__

yf(x, y) dxdy.

To prove this property for arbitrary n (and the continuous case), one can simply proceed by

induction.

By the way we deﬁned expectation, the third property is not immediately obvious. However,

it is clear that Z ≥ 0 implies EZ ≥ 0 and, applying this to Z = Y −X, together with linearity,

establishes monotonicity.

We emphasize again that linearity holds for arbitrary random variables which do not need

to be independent! This is very useful. For example, we can often write a random variable X

as a sum of (possibly dependent) indicators, X = I

1

+ + I

n

. An instance of this method is

called the indicator trick.

8 MORE ON EXPECTATION AND LIMIT THEOREMS 89

Example 8.3. Assume that an urn contains 10 black, 7 red, and 5 white balls. Select 5 balls

(a) with and (b) without replacement and let X be the number of red balls selected. Compute

EX.

Let I

i

be the indicator of the event that the ith ball is red, that is,

I

i

= I

{ith ball is red}

=

_

1 if ith ball is red,

0 otherwise.

In both cases, X = I

1

+I

2

+I

3

+I

4

+I

5

.

In (a), X is Binomial(5,

7

22

), so we know that EX = 5

7

22

, but we will not use this knowledge.

Instead, it is clear that

EI

1

= 1 P(1st ball is red) =

7

22

= EI

2

= . . . = EI

5

.

Therefore, by additivity, EX = 5

7

22

.

For (b), one solution is to compute the p. m. f. of X,

P(X = i) =

_

7

i

__

15

5−i

_

_

22

5

_ ,

where i = 0, 1, . . . , 5, and then

EX =

5

i=0

i

_

7

i

__

15

5−i

_

_

22

5

_ .

However, the indicator trick works exactly as before (the fact that I

i

are now dependent does

not matter) and so the answer is also exactly the same, EX = 5

7

22

.

Example 8.4. Matching problem, revisited. Assume n people buy n gifts, which are then

assigned at random, and let X be the number of people who receive their own gift. What is

EX?

This is another problem very well suited for the indicator trick. Let

I

i

= I

{person i receives own gift}

.

Then,

X = I

1

+I

2

+. . . +I

n

.

Moreover,

EI

i

=

1

n

,

for all i, and so

EX = 1.

Example 8.5. Five married couples are seated around a table at random. Let X be the number

of wives who sit next to their husbands. What is EX?

8 MORE ON EXPECTATION AND LIMIT THEOREMS 90

Now, let

I

i

= I

{wife i sits next to her husband}

.

Then,

X = I

1

+. . . +I

5

,

and

EI

i

=

2

9

,

so that

EX =

10

9

.

Example 8.6. Coupon collector problem, revisited. Sample from n cards, with replacement,

indeﬁnitely. Let N be the number of cards you need to sample for a complete collection, i.e., to

get all diﬀerent cards represented. What is EN?

Let N

i

be the number the number of additional cards you need to get the ith new card, after

you have received the (i −1)st new card.

Then, N

1

, the number of cards needed to receive the ﬁrst new card, is trivial, as the ﬁrst

card you buy is new: N

1

= 1. Afterward, N

2

, the number of additional cards needed to get

the second new card is Geometric with success probability

n−1

n

. After that, N

3

, the number of

additional cards needed to get the third new card is Geometric with success probability

n−2

n

. In

general, N

i

is geometric with success probability

n−i+1

n

, i = 1, . . . , n, and

N = N

1

+. . . +N

n

,

so that

EN = n

_

1 +

1

2

+

1

3

+. . . +

1

n

_

.

Now, we have

n

i=2

1

i

≤

_

n

1

1

x

dx ≤

n−1

i=1

1

i

,

by comparing the integral with the Riemman sum at the left and right endpoints in the division

of [1, n] into [1, 2], [2, 3], . . . , [n −1, n], and so

log n ≤

n

i=1

1

i

≤ log n + 1,

which establishes the limit

lim

n→∞

EN

nlog n

= 1.

Example 8.7. Assume that an urn contains 10 black, 7 red, and 5 white balls. Select 5 balls

(a) with replacement and (b) without replacement, and let W be the number of white balls

selected, and Y the number of diﬀerent colors. Compute EW and EY .

8 MORE ON EXPECTATION AND LIMIT THEOREMS 91

We already know that

EW = 5

5

22

in either case.

Let I

b

, I

r

, and I

w

be the indicators of the event that, respectively, black, red, and white balls

are represented. Clearly,

Y = I

b

+I

r

+I

w

,

and so, in the case with replacement

EY = 1 −

12

5

22

5

+ 1 −

15

5

22

5

+ 1 −

17

5

22

5

≈ 2.5289,

while in the case without replacement

EY = 1 −

_

12

5

_

_

22

5

_ + 1 −

_

15

5

_

_

22

5

_ + 1 −

_

17

5

_

_

22

5

_ ≈ 2.6209.

Expectation and independence

Theorem 8.2. Multiplicativity of expectation for independent factors.

The expectation of the product of independent random variables is the product of their expecta-

tions, i.e., if X and Y are independent,

E[g(X)h(Y )] = Eg(X) Eh(Y )

Proof. For the continuous case,

E[g(X)h(Y )] =

__

g(x)h(y)f(x, y) dxdy

=

__

g(x)h(y)f

X

(x)g

Y

(y) dxdy

=

_

g(x)f

X

(x) dx

_

h(y)f

Y

(y) dy

= Eg(X) Eh(Y )

Example 8.8. Let us return to a random point (X, Y ) in the triangle ¦(x, y) : x, y ≥ 0, x+y ≤

1¦. We computed that E(XY ) =

1

12

and that EX = EY =

1

3

. The two random variables have

E(XY ) ,= EX EY , thus, they cannot be independent. Of course, we already knew that they

were not independent.

8 MORE ON EXPECTATION AND LIMIT THEOREMS 92

If, instead, we pick a random point (X, Y ) in the square ¦(x, y) : 0 ≤ x, y ≤ 1¦, X and Y

are independent and, therefore, E(XY ) = EX EY =

1

4

.

Finally, pick a random point (X, Y ) in the diamond of radius 1, that is, in the square with

corners at (0, 1), (1, 0), (0, −1), and (−1, 0). Clearly, we have, by symmetry,

EX = EY = 0,

but also

E(XY ) =

1

2

_

1

−1

dx

_

1−|x|

−1+|x|

xy dy

=

1

2

_

1

−1

xdx

_

1−|x|

−1+|x|

y dy

=

1

2

_

1

−1

xdx 0

= 0.

This is an example where E(XY ) = EX EY even though X and Y are not independent.

Computing expectation by conditioning

For a pair of random variables (X, Y ), we deﬁne the conditional expectation of Y given

X = x by

E(Y [X = x) =

y

y P(Y = y[X = x) (discrete case),

=

_

y

y f

Y

(y[X = x) (continuous case).

Observe that E(Y [X = x) is a function of x; let us call it g(x) for a moment. We denote

g(X) by E(Y [X). This is the expectation of Y provided the value X is known; note again that

this is an expression dependent on X and so we can compute its expectation. Here is what we

get.

Theorem 8.3. Tower property.

The formula E(E(Y [X)) = EY holds; less mysteriously, in the discrete case

EY =

x

E(Y [X = x) P(X = x),

and, in the continuous case,

EY =

_

E(Y [X = x)f

X

(x) dx.

8 MORE ON EXPECTATION AND LIMIT THEOREMS 93

Proof. To verify this in the discrete case, we write out the expectation inside the sum:

x

y

yP(Y = y[X = x) P(X = x) =

x

y

y

P(X = x, Y = y)

P(X = x)

P(X = x)

=

x

y

yP(X = x, Y = y)

= EY

Example 8.9. Once again, consider a random point (X, Y ) in the triangle ¦(x, y) : x, y ≥

0, x +y ≤ 1¦. Given that X = x, Y is distributed uniformly on [0, 1 −x] and so

E(Y [X = x) =

1

2

(1 −x).

By deﬁnition, E(Y [X) =

1

2

(1 −X), and the expectation of

1

2

(1 −X) must, therefore, equal the

expectation of Y , and, indeed, it does as EX = EY =

1

3

, as we know.

Example 8.10. Roll a die and then toss as many coins as shown up on the die. Compute the

expected number of Heads.

Let X be the number on the die and let Y be the number of Heads. Fix an x ∈ ¦1, 2, . . . , 6¦.

Given that X = x, Y is Binomial(x,

1

2

). In particular,

E(Y [X = x) = x

1

2

,

and, therefore,

E(number of Heads) = EY

=

6

x=1

x

1

2

P(X = x)

=

6

x=1

x

1

2

1

6

=

7

4

.

Example 8.11. Here is another job interview question. You die and are presented with three

doors. One of them leads to heaven, one leads to one day in purgatory, and one leads to two

days in purgatory. After your stay in purgatory is over, you go back to the doors and pick again,

but the doors are reshuﬄed each time you come back, so you must in fact choose a door at

random each time. How long is your expected stay in purgatory?

Code the doors 0, 1, and 2, with the obvious meaning, and let N be the number of days in

purgatory.

8 MORE ON EXPECTATION AND LIMIT THEOREMS 94

Then

E(N[your ﬁrst pick is door 0) = 0,

E(N[your ﬁrst pick is door 1) = 1 +EN,

E(N[your ﬁrst pick is door 2) = 2 +EN.

Therefore,

EN = (1 +EN)

1

3

+ (2 +EN)

1

3

,

and solving this equation gives

EN = 3.

Covariance

Let X, Y be random variables. We deﬁne the covariance of (or between) X and Y as

Cov(X, Y ) = E((X −EX)(Y −EY ))

= E(XY −(EX) Y −(EY ) X +EX EY )

= E(XY ) −EX EY −EY EX +EX EY

= E(XY ) −EX EY.

To summarize, the most useful formula is

Cov(X, Y ) = E(XY ) −EX EY.

Note immediately that, if X and Y are independent, then Cov(X, Y ) = 0, but the converse

is false.

Let X and Y be indicator random variables, so X = I

A

and Y = I

B

, for two events A and

B. Then, EX = P(A), EY = P(B), E(XY ) = E(I

A∩B

) = P(A∩ B), and so

Cov(X, Y ) = P(A∩ B) −P(A)P(B) = P(A)[P(B[A) −P(B)].

If P(B[A) > P(B), we say the two events are positively correlated and, in this case, the covariance

is positive; if the events are negatively correlated all inequalities are reversed. For general random

variables X and Y , Cov(X, Y ) > 0 intuitively means that, “on the average,” increasing X will

result in larger Y .

Variance of sums of random variables

Theorem 8.4. Variance-covariance formula:

E(

n

i=1

X

i

)

2

=

n

i=1

EX

2

i

+

i=j

E(X

i

X

j

),

Var(

n

i=1

X

i

) =

n

i=1

Var(X

i

) +

i=j

Cov(X

i

, X

j

).

8 MORE ON EXPECTATION AND LIMIT THEOREMS 95

Proof. The ﬁrst formula follows from writing the sum

_

n

i=1

X

i

_

2

=

n

i=1

X

2

i

+

i=j

X

i

X

j

and linearity of expectation. The second formula follows from the ﬁrst:

Var(

n

i=1

X

i

) = E

_

n

i=1

X

i

−E(

n

i=1

X

i

)

_

2

= E

_

n

i=1

(X

i

−EX

i

)

_

2

=

n

i=1

Var(X

i

) +

i=j

E(X

i

−EX

i

)E(X

j

−EX

j

),

which is equivalent to the formula.

Corollary 8.5. Linearity of variance for independent summands.

If X

1

, X

2

, . . . , X

n

are independent, then Var(X

1

+. . . +X

n

) = Var(X

1

)+. . . +Var(X

n

).

The variance-covariance formula often makes computing variance possible even if the random

variables are not independent, especially when they are indicators.

Example 8.12. Let S

n

be Binomial(n, p). We will now fulﬁll our promise from Chapter 5 and

compute its expectation and variance.

The crucial observation is that S

n

=

n

i=1

I

i

, where I

i

is the indicator I

{ith trial is a success}

.

Therefore, I

i

are independent. Then, ES

n

= np and

Var(S

n

) =

n

i=1

Var(I

i

)

=

n

i=1

(EI

i

−(EI

i

)

2

)

= n(p −p

2

)

= np(1 −p).

Example 8.13. Matching problem, revisited yet again. Recall that X is the number of people

who get their own gift. We will compute Var(X).

8 MORE ON EXPECTATION AND LIMIT THEOREMS 96

Recall also that X =

n

i=1

I

i

, where I

i

= I

{ith person gets own gift}

, so that EI

i

=

1

n

and

E(X

2

) = n

1

n

+

i=j

E(I

i

I

j

)

= 1 +

i=j

1

n(n −1)

= 1 + 1

= 2.

The E(I

i

I

j

) above is the probability that the ith person and jth person both get their own gifts

and, thus, equals

1

n(n−1)

. We conclude that Var(X) = 1. (In fact, X is, for large n, very close

to Poisson with λ = 1.)

Example 8.14. Roll a die 10 times. Let X be the number of 6’s rolled and Y be the number

of 5’s rolled. Compute Cov(X, Y ).

Observe that X =

10

i=1

I

i

, where I

i

= I

{ith roll is 6}

, and Y =

10

i=1

J

i

, where J

i

= I

{ith roll is 5}

.

Then, EX = EY =

10

6

=

5

3

. Moreover, E(I

i

J

j

) equals 0 if i = j (because both 5 and 6 cannot

be rolled on the same roll), and E(I

i

J

j

) =

1

6

2

if i ,= j (by independence of diﬀerent rolls).

Therefore,

EXY =

10

i=1

10

j=1

E(I

i

J

j

)

=

i=j

1

6

2

=

10 9

36

=

5

2

and

Cov(X, Y ) =

5

2

−

_

5

3

_

2

= −

5

18

is negative, as should be expected.

Weak Law of Large Numbers

Assume that an experiment is performed in which an event Ahappens with probability P(A) = p.

At the beginning of the course, we promised to attach a precise meaning to the statement: “If

you repeat the experiment, independently, a large number of times, the proportion of times A

happens converges to p.” We will now do so and actually prove the more general statement

below.

Theorem 8.6. Weak Law of Large Numbers.

8 MORE ON EXPECTATION AND LIMIT THEOREMS 97

If X, X

1

, X

2

, . . . are independent and identically distributed random variables with ﬁnite expec-

tation and variance, then

X

1

+...+X

n

n

converges to EX in the sense that, for any ﬁxed > 0,

P

_¸

¸

¸

¸

X

1

+. . . +X

n

n

−EX

¸

¸

¸

¸

≥

_

→0,

as n →∞.

In particular, if S

n

is the number of successes in n independent trials, each of which is

a success with probability p, then, as we have observed before, S

n

= I

1

+ . . . + I

n

, where

I

i

= I

{success at trial i}

. So, for every > 0,

P

_¸

¸

¸

¸

S

n

n

−p

¸

¸

¸

¸

≥

_

→0,

as n →∞. Thus, the proportion of successes converges to p in this sense.

Theorem 8.7. Markov Inequality. If X ≥ 0 is a random variable and a > 0, then

P(X ≥ a) ≤

1

a

EX.

Example 8.15. If EX = 1 and X ≥ 0, it must be that P(X ≥ 10) ≤ 0.1.

Proof. Here is the crucial observation:

I

{X≥a}

≤

1

a

X.

Indeed, if X < a, the left-hand side is 0 and the right-hand side is nonnegative; if X ≥ a, the

left-hand side is 1 and the right-hand side is at least 1. Taking the expectation of both sides,

we get

P(X ≥ a) = E(I

{X≥a}

) ≤

1

a

EX.

Theorem 8.8. Chebyshev inequality. If EX = µ and Var(X) = σ

2

are both ﬁnite and k > 0,

then

P([X −µ[ ≥ k) ≤

σ

2

k

2

.

Example 8.16. If EX = 1 and Var(X) = 1, P(X ≥ 10) ≤ P([X −1[ ≥ 9) ≤

1

81

.

Example 8.17. If EX = 1, Var(X) = 0.1,

P([X −1[ ≥ 0.5) ≤

0.1

0.5

2

=

2

5

.

As the previous two examples show, the Chebyshev inequality is useful if either σ is small

or k is large.

8 MORE ON EXPECTATION AND LIMIT THEOREMS 98

Proof. By the Markov inequality,

P([X −µ[ ≥ k) = P((X −µ)

2

≥ k

2

) ≤

1

k

2

E(X −µ)

2

=

1

k

2

Var(X).

We are now ready to prove the Weak Law of Large Numbers.

Proof. Denote µ = EX, σ

2

= Var(X), and let S

n

= X

1

+. . . +X

n

. Then,

ES

n

= EX

1

+. . . +EX

n

= nµ

and

Var(S

n

) = nσ

2

.

Therefore, by the Chebyshev inequality,

P([S

n

−nµ[ ≥ n) ≤

nσ

2

n

2

2

=

σ

2

n

2

→0,

as n →∞.

A careful examination of the proof above will show that it remains valid if depends on n,

but goes to 0 slower than

1

√

n

. This suggests that

X

1

+...+Xn

n

converges to EX at the rate of

about

1

√

n

. We will make this statement more precise below.

Central Limit Theorem

Theorem 8.9. Central limit theorem.

Assume that X, X

1

, X

2

, . . . are independent, identically distributed random variables, with ﬁnite

µ = EX and σ

2

= V ar(X). Then,

P

_

X

1

+. . . +X

n

−µn

σ

√

n

≤ x

_

→P(Z ≤ x),

as n →∞, where Z is standard Normal.

We will not prove this theorem in full detail, but will later give good indication as to why

it holds. Observe, however, that it is a remarkable theorem: the random variables X

i

have an

arbitrary distribution (with given expectation and variance) and the theorem says that their

sum approximates a very particular distribution, the normal one. Adding many independent

copies of a random variable erases all information about its distribution other than expectation

and variance!

8 MORE ON EXPECTATION AND LIMIT THEOREMS 99

On the other hand, the convergence is not very fast; the current version of the celebrated

Berry-Esseen theorem states that an upper bound on the diﬀerence between the two probabilities

in the Central limit theorem is

0.4785

E[X −µ[

3

σ

3

√

n

.

Example 8.18. Assume that X

n

are independent and uniform on [0, 1]. Let S

n

= X

1

+. . .+X

n

.

(a) Compute approximately P(S

200

≤ 90). (b) Using the approximation, ﬁnd n so that P(S

n

≥

50) ≥ 0.99.

We know that EX

i

=

1

2

and Var(X

i

) =

1

3

−

1

4

=

1

12

.

For (a),

P(S

200

≤ 90) = P

_

_

S

200

−200

1

2

_

200

1

12

≤

90 −200

1

2

_

200

1

12

_

_

≈ P(Z ≤ −

√

6)

= 1 −P(Z ≤

√

6)

≈ 1 −0.993

= 0.007

For (b), we rewrite

P

_

_

S

n

−n

1

2

_

n

1

12

≥

50 −n

1

2

_

n

1

12

_

_

= 0.99,

and then approximate

P

_

_

Z ≥ −

n

2

−50

_

n

1

12

_

_

= 0.99,

or

P

_

_

Z ≤

n

2

−50

_

n

1

12

_

_

= Φ

_

_

n

2

−50

_

n

1

12

_

_

= 0.99.

Using the fact that Φ(z) = 0.99 for (approximately) z = 2.326, we get that

n −1.345

√

n −100 = 0

and that n = 115.

Example 8.19. A casino charges $1 for entrance. For promotion, they oﬀer to the ﬁrst 30,000

“guests” the following game. Roll a fair die:

• if you roll 6, you get free entrance and $2;

8 MORE ON EXPECTATION AND LIMIT THEOREMS 100

• if you roll 5, you get free entrance;

• otherwise, you pay the normal fee.

Compute the number s so that the revenue loss is at most s with probability 0.9.

In symbols, if L is lost revenue, we need to ﬁnd s so that

P(L ≤ s) = 0.9.

We have L = X

1

+ + X

n

, where n = 30, 000, X

i

are independent, and P(X

i

= 0) =

4

6

,

P(X

i

= 1) =

1

6

, and P(X

i

= 3) =

1

6

. Therefore,

EX

i

=

2

3

and

Var(X

i

) =

1

6

+

9

6

−

_

2

3

_

2

=

11

9

.

Therefore,

P(L ≤ s) = P

_

_

L −

2

3

n

_

n

11

9

≤

s −

2

3

n

_

n

11

9

_

_

≈ P

_

_

Z ≤

s −

2

3

n

_

n

11

9

_

_

= 0.9,

which gives

s −

2

3

n

_

n

11

9

≈ 1.28,

and ﬁnally,

s ≈

2

3

n + 1.28

_

n

11

9

≈ 20, 245.

Problems

1. An urn contains 2 white and 4 black balls. Select three balls in three successive steps without

replacement. Let X be the total number of white balls selected and Y the step in which you

selected the ﬁrst black ball. For example, if the selected balls are white, black, black, then

X = 1, Y = 2. Compute E(XY ).

8 MORE ON EXPECTATION AND LIMIT THEOREMS 101

2. The joint density of (X, Y ) is given by

_

f(x, y) = 3x if 0 ≤ y ≤ x ≤ 1,

0 otherwise.

Compute Cov(X, Y ).

3. Five married couples are seated at random in a row of 10 seats.

(a) Compute the expected number of women that sit next to their husbands.

(b) Compute the expected number of women that sit next to at least one man.

4. There are 20 birds that sit in a row on a wire. Each bird looks left or right with equal

probability. Let N be the number of birds not seen by any neighboring bird. Compute EN.

5. Recall that a full deck of cards contains 52 cards, 13 cards of each of the four suits. Distribute

the cards at random to 13 players, so that each gets 4 cards. Let N be the number of players

whose four cards are of the same suit. Using the indicator trick, compute EN.

6. Roll a fair die 24 times. Compute, using a relevant approximation, the probability that the

sum of numbers exceeds 100.

Solutions to problems

1. First we determine the joint p. m. f. of (X, Y ). We have

P(X = 0, Y = 1) = P(bbb) =

1

5

,

P(X = 1, Y = 1) = P(bwb or bbw) =

2

5

,

P(X = 1, Y = 2) = P(wbb) =

1

5

,

P(X = 2, Y = 1) = P(bww) =

1

15

,

P(X = 2, Y = 2) = P(wbw) =

1

15

,

P(X = 2, Y = 3) = P(wwb) =

1

15

,

so that

E(XY ) = 1

2

5

+ 2

1

5

+ 2

1

15

+ 4

1

15

+ 6

1

15

=

8

5

.

8 MORE ON EXPECTATION AND LIMIT THEOREMS 102

2. We have

EX =

_

1

0

dx

_

x

0

x 3xdy =

_

1

0

3x

3

dx =

3

4

,

EX =

_

1

0

dx

_

x

0

y 3xdy =

_

1

0

3

2

x

3

dx =

3

8

,

E(XY ) =

_

1

0

dx

_

x

0

xy 3xdy =

_

1

0

3

2

x

4

dx =

3

10

,

so that Cov(X, Y ) =

3

10

−

3

4

3

8

=

3

160

.

3. (a) Let the number be M. Let I

i

= I

{couple i sits together}

. Then

EI

i

=

9! 2!

10!

=

1

5

,

and so

EM = EI

1

+. . . +EI

5

= 5 EI

1

= 1.

(b) Let the number be N. Let I

i

= I

{woman i sits next to a man}

. Then, by dividing into cases,

whereby the woman either sits on one of the two end chairs or on one of the eight middle chairs,

EI

i

=

2

10

5

9

+

8

10

_

1 −

4

9

3

8

_

=

7

9

,

and so

EM = EI

1

+. . . +EI

5

= 5 EI

1

=

35

9

.

4. For the two birds at either end, the probability that it is not seen is

1

2

, while for any other

bird this probability is

1

4

. By the indicator trick

EN = 2

1

2

+ 18

1

4

=

11

2

.

5. Let

I

i

= I

{player i has four cards of the same suit}

,

so that N = I

1

+... +I

13

. Observe that:

• the number of ways to select 4 cards from a 52 card deck is

_

52

4

_

;

• the number of choices of a suit is 4; and

• after choosing a suit, the number of ways to select 4 cards of that suit is

_

13

4

_

.

8 MORE ON EXPECTATION AND LIMIT THEOREMS 103

Therefore, for all i,

EI

i

=

4

_

13

4

_

_

52

4

_

and

EN =

4

_

13

4

_

_

52

4

_ 13.

6. Let X

1

, X

2

, . . . be the numbers on successive rolls and S

n

= X

1

+ . . . + X

n

the sum. We

know that EX

i

=

7

2

, and Var(X

i

) =

35

12

. So, we have

P(S

24

≥ 100) = P

_

_

S

24

−24

7

2

_

24

35

12

≥

100 −24

7

2

_

24

35

12

_

_

≈ P(Z ≥ 1.85) = 1 −Φ(1.85) ≈ 0.032.

8 MORE ON EXPECTATION AND LIMIT THEOREMS 104

Interlude: Practice Final

This practice exam covers the material from chapters 1 through 8. Give yourself 120 minutes

to solve the six problems, which you may assume have equal point score.

1. Recall that a full deck of cards contains 13 cards of each of the four suits (♣, ♦, ♥, ♠). Select

cards from the deck at random, one by one, without replacement.

(a) Compute the probability that the ﬁrst four cards selected are all hearts (♥).

(b) Compute the probability that all suits are represented among the ﬁrst four cards selected.

(c) Compute the expected number of diﬀerent suits among the ﬁrst four cards selected.

(d) Compute the expected number of cards you have to select to get the ﬁrst hearts card.

2. Eleven Scandinavians: 2 Swedes, 4 Norwegians, and 5 Finns are seated in a row of 11 chairs

at random.

(a) Compute the probability that all groups sit together (i.e., the Swedes occupy adjacent seats,

as do the Norwegians and Finns).

(b) Compute the probability that at least one of the groups sits together.

(c) Compute the probability that the two Swedes have exactly one person sitting between them.

3. You have two fair coins. Toss the ﬁrst coin three times and let X be the number of Heads.

Then toss the second coin X times, that is, as many times as the number of Heads in the ﬁrst

coin toss. Let Y be the number of Heads in the second coin toss. (For example, if X = 0, Y is

automatically 0; if X = 2, toss the second coin twice and count the number of Heads to get Y .)

(a) Determine the joint probability mass function of X and Y , that is, write down a formula for

P(X = i, Y = j) for all relevant i and j.

(b) Compute P(X ≥ 2 [ Y = 1).

4. Assume that 2, 000, 000 single male tourists visit Las Vegas every year. Assume also that

each of these tourists, independently, gets married while drunk with probability 1/1, 000, 000.

(a) Write down the exact probability that exactly 3 male tourists will get married while drunk

next year.

(b) Compute the expected number of such drunk marriages in the next 10 years.

(c) Write down a relevant approximate expression for the probability in (a).

(d) Write down an approximate expression for the probability that there will be no such drunk

marriage during at least one of the next 3 years.

5. Toss a fair coin twice. You win $2 if both tosses comes out Heads, lose $1 if no toss comes

out Heads, and win or lose nothing otherwise.

(a) What is the expected number of games you need to play to win once?

8 MORE ON EXPECTATION AND LIMIT THEOREMS 105

(b) Assume that you play this game 500 times. What is, approximately, the probability that

you win at least $135?

(c) Again, assume that you play this game 500 times. Compute (approximately) the amount

of money x such that your winnings will be at least x with probability 0.5. Then, do the same

with probability 0.9.

6. Two random variables X and Y are independent and have the same probability density

function

g(x) =

_

c(1 +x) x ∈ [0, 1],

0 otherwise.

(a) Find the value of c. Here and in (b): use

_

1

0

x

n

dx =

1

n+1

, for n > −1.

(b) Find Var(X +Y ).

(c) Find P(X +Y < 1) and P(X +Y ≤ 1). Here and in (d): when you get to a single integral

involving powers, stop.

(d) Find E[X −Y [.

8 MORE ON EXPECTATION AND LIMIT THEOREMS 106

Solutions to Practice Final

1. Recall that a full deck of cards contains 13 cards of each of the four suits (♣, ♦, ♥, ♠).

Select cards from the deck at random, one by one, without replacement.

(a) Compute the probability that the ﬁrst four cards selected are all hearts (♥).

Solution:

_

13

4

_

_

52

4

_

(b) Compute the probability that all suits are represented among the ﬁrst four cards

selected.

Solution:

13

4

_

52

4

_

(c) Compute the expected number of diﬀerent suits among the ﬁrst four cards selected.

Solution:

If X is the number of suits represented, then X = I

♥

+ I

♦

+ I

♣

+ I

♠

, where I

♥

=

I

{♥ is represented}

, etc. Then,

EI

♥

= 1 −

_

39

4

_

_

52

4

_,

which is the same for the other three indicators, so

EX = 4 EI

♥

= 4

_

1 −

_

39

4

_

_

52

4

_

_

.

(d) Compute the expected number of cards you have to select to get the ﬁrst hearts card.

Solution:

Label non-♥ cards 1, . . . , 39 and let I

i

= I

{card i selected before any ♥ card}

. Then, EI

i

=

1

14

for any i. If N is the number of cards you have to select to get the ﬁrst hearts

card, then

EN = E (I

1

+ +EI

39

) =

39

14

.

8 MORE ON EXPECTATION AND LIMIT THEOREMS 107

2. Eleven Scandinavians: 2 Swedes, 4 Norwegians, and 5 Finns are seated in a row of 11

chairs at random.

(a) Compute the probability that all groups sit together (i.e., the Swedes occupy adjacent

seats, as do the Norwegians and Finns).

Solution:

2! 4! 5! 3!

11!

(b) Compute the probability that at least one of the groups sits together.

Solution:

Deﬁne A

S

= ¦Swedes sit together¦ and, similarly, A

N

and A

F

. Then,

P(A

S

∪ A

N

∪ A

F

)

= P(A

S

) +P(A

N

) +P(A

F

)

−P(A

S

∩ A

N

) −P(A

S

∩ A

F

) −P(A

N

∩ A

F

)

+P(A

S

∩ A

N

∩ A

F

)

=

2! 10! + 4! 8! + 5! 7! −2! 4! 7! −2! 5! 6! −4! 5! 3! + 2! 4! 5! 3!

11!

(c) Compute the probability that the two Swedes have exactly one person sitting between

them.

Solution:

The two Swedes may occupy chairs 1, 3; or 2, 4; or 3, 5; ...; or 9, 11. There are exactly

9 possibilities, so the answer is

9

_

11

2

_ =

9

55

.

3. You have two fair coins. Toss the ﬁrst coin three times and let X be the number of Heads.

Then, toss the second coin X times, that is, as many times as you got Heads in the ﬁrst

coin toss. Let Y be the number of Heads in the second coin toss. (For example, if X = 0,

Y is automatically 0; if X = 2, toss the second coin twice and count the number of Heads

to get Y .)

8 MORE ON EXPECTATION AND LIMIT THEOREMS 108

(a) Determine the joint probability mass function of X and Y , that is, write down a

formula for P(X = i, Y = j) for all relevant i and j.

Solution:

P(X = i, Y = j) = P(X = i)P(Y = j[X = i) =

_

3

i

_

1

2

3

_

i

j

_

1

2

i

,

for 0 ≤ j ≤ i ≤ 3.

(b) Compute P(X ≥ 2 [ Y = 1).

Solution:

This equals

P(X = 2, Y = 1) +P(X = 3, Y = 1)

P(X = 1, Y = 1) +P(X = 2, Y = 1) +P(X = 3, Y = 1)

=

5

9

.

4. Assume that 2, 000, 000 single male tourists visit Las Vegas every year. Assume also that

each of these tourists independently gets married while drunk with probability 1/1, 000, 000.

(a) Write down the exact probability that exactly 3 male tourists will get married while

drunk next year.

Solution:

With X equal to the number of such drunk marriages, X is Binomial(n, p) with

p = 1/1, 000, 000 and n = 2, 000, 000, so we have

P(X = 3) =

_

n

3

_

p

3

(1 −p)

n−3

(b) Compute the expected number of such drunk marriages in the next 10 years.

Solution:

As X is binomial, its expected value is np = 2, so the answer is 10 EX = 20.

(c) Write down a relevant approximate expression for the probability in (a).

Solution:

We use that X is approximately Poisson with λ = 2, so the answer is

λ

3

3!

e

−λ

=

4

3

e

−2

.

8 MORE ON EXPECTATION AND LIMIT THEOREMS 109

(d) Write down an approximate expression for the probability that there will be no such

drunk marriage during at least one of the next 3 years.

Solution:

This equals 1−P(at least one such marriage in each of the next 3 years), which equals

1 −

_

1 −e

−2

_

3

.

5. Toss a fair coin twice. You win $2 if both tosses comes out Heads, lose $1 if no toss comes

out Heads, and win or lose nothing otherwise.

(a) What is the expected number of games you need to play to win once?

Solution:

The probability of winning in

1

4

. The answer, the expectation of a Geometric

_

1

4

_

random variable, is 4.

(b) Assume that you play this game 500 times. What is, approximately, the probability

that you win at least $135?

Solution:

Let X be the winnings in one game and X

1

, X

2

, . . . , X

n

the winnings in successive

games, with S

n

= X

1

+. . . +X

n

. Then, we have

EX = 2

1

4

−1

1

4

=

1

4

and

Var(X) = 4

1

4

+ 1

1

4

−

_

1

4

_

2

=

19

16

.

Thus,

P(S

n

≥ 135) = P

_

_

S

n

−n

1

4

_

n

19

16

≥

135 −n

1

4

_

n

19

16

_

_

≈ P

_

_

Z ≥

135 −n

1

4

_

n

19

16

_

_

,

where Z is standard Normal. Using n = 500, we get the answer

1 −Φ

_

_

10

_

500

19

16

_

_

.

8 MORE ON EXPECTATION AND LIMIT THEOREMS 110

(c) Again, assume that you play this game 500 times. Compute (approximately) the

amount of money x such that your winnings will be at least x with probability 0.5.

Then, do the same with probability 0.9.

Solution:

For probability 0.5, the answer is exactly ES

n

= n

1

4

= 125. For probability 0.9, we

approximate

P(S

n

≥ x) ≈ P

_

_

Z ≥

x −125

_

500

19

16

_

_

= P

_

_

Z ≤

125 −x

_

500

19

16

_

_

= Φ

_

_

125 −x

_

500

19

16

_

_

,

where we have used that x < 125. Then, we use that Φ(z) = 0.9 at z ≈ 1.28, leading

to the equation

125 −x

_

500

19

16

= 1.28

and, therefore,

x = 125 −1.28

_

500

19

16

.

(This gives x = 93.)

6. Two random variables X and Y are independent and have the same probability density

function

g(x) =

_

c(1 +x) x ∈ [0, 1],

0 otherwise.

(a) Find the value of c. Here and in (b): use

_

1

0

x

n

dx =

1

n+1

, for n > −1.

Solution:

As

1 = c

_

1

0

(1 +x) dx = c

3

2

,

we have c =

2

3

(b) Find Var(X +Y ).

Solution:

8 MORE ON EXPECTATION AND LIMIT THEOREMS 111

By independence, this equals 2Var(X) = 2(E(X

2

) −(EX)

2

). Moreover,

E(X) =

2

3

_

1

0

x(1 +x) dx =

5

9

,

E(X

2

) =

2

3

_

1

0

x

2

(1 +x) dx =

7

18

,

and the answer is

13

81

.

(c) Find P(X + Y < 1) and P(X + Y ≤ 1). Here and in (d): when you get to a single

integral involving powers, stop.

Solution:

The two probabilities are both equal to

_

2

3

_

2

_

1

0

dx

_

1−x

0

(1 +x)(1 +y) dy =

_

2

3

_

2

_

1

0

(1 +x)

_

(1 −x) +

(1 −x)

2

2

_

dx.

(d) Find E[X −Y [.

Solution:

This equals

_

2

3

_

2

2

_

1

0

dx

_

x

0

(x −y)(1 +x)(1 +y) dy =

_

2

3

_

2

2

_

1

0

_

x(1 +x)

_

x +

x

2

2

_

−(1 +x)

_

x

2

2

+

x

3

3

__

dx.

9 CONVERGENCE IN PROBABILITY 112

9 Convergence in probability

One of the goals of probability theory is to extricate a useful deterministic quantity out of a

random situation. This is typically possible when a large number of random eﬀects cancel each

other out, so some limit is involved. In this chapter we consider the following setting: given a

sequence of random variables, Y

1

, Y

2

, . . ., we want to show that, when n is large, Y

n

is approx-

imately f(n), for some simple deterministic function f(n). The meaning of “approximately” is

what we now make clear.

A sequence Y

1

, Y

2

, . . . of random variables converges to a number a in probability if, as n →∞,

P([Y

n

−a[ ≤ ) converges to 1, for any ﬁxed > 0. This is equivalent to P([Y

n

−a[ > ) →0 as

n →∞, for any ﬁxed > 0.

Example 9.1. Toss a fair coin n times, independently. Let R

n

be the “longest run of Heads,”

i.e., the longest sequence of consecutive tosses of Heads. For example, if n = 15 and the tosses

come out

HHTTHHHTHTHTHHH,

then R

n

= 3. We will show that, as n →∞,

R

n

log

2

n

→1,

in probability. This means that, to a ﬁrst approximation, one should expect about 20 consecutive

Heads somewhere in a million tosses.

To solve a problem such as this, we need to ﬁnd upper bounds for probabilities that R

n

is

large and that it is small, i.e., for P(R

n

≥ k) and P(R

n

≤ k), for appropriately chosen k. Now,

for arbitrary k,

P(R

n

≥ k) = P (k consecutive Heads start at some i, 0 ≤ i ≤ n −k + 1)

= P(

n−k+1

_

i=1

¦i is the ﬁrst Heads in a succession of at least k Heads¦)

≤ n

1

2

k

.

For the lower bound, divide the string of size n into disjoint blocks of size k. There are

¸

n

k

| such blocks (if n is not divisible by k, simply throw away the leftover smaller block at the

end). Then, R

n

≥ k as soon as one of the blocks consists of Heads only; diﬀerent blocks are

independent. Therefore,

P(R

n

< k) ≤

_

1 −

1

2

k

_

n

k

≤ exp

_

−

1

2

k

_

n

k

_

_

,

9 CONVERGENCE IN PROBABILITY 113

using the famous inequality 1 −x ≤ e

−x

, valid for all x.

Below, we will use the following trivial inequalities, valid for any real number x ≥ 2: ¸x| ≥

x −1, ¸x| ≤ x + 1, x −1 ≥

x

2

, and x + 1 ≤ 2x.

To demonstrate that

R

n

log

2

n

→1, in probability, we need to show that, for any > 0,

P(R

n

≥ (1 +) log

2

n) → 0, (1)

P(R

n

≤ (1 −) log

2

n) → 0, (2)

as

P

_¸

¸

¸

¸

R

n

log

2

n

−1

¸

¸

¸

¸

≥

_

= P

_

R

n

log

2

n

≥ 1 + or

R

n

log

2

n

≤ 1 −

_

= P

_

R

n

log

2

n

≥ 1 +

_

+P

_

R

n

log

2

n

≤ 1 −

_

= P (R

n

≥ (1 +) log

2

n) +P (R

n

≤ (1 −) log

2

n) .

A little fussing in the proof comes from the fact that (1 ± ) log

2

n are not integers. This is

common in such problems. To prove (1), we plug k = ¸(1 + ) log

2

n| into the upper bound to

get

P(R

n

≥ (1 +) log

2

n) ≤ n

1

2

(1+) log

2

n−1

= n

2

n

1+

=

2

n

→0,

as n → ∞. On the other hand, to prove (2) we need to plug k = ¸(1 − ) log

2

n| + 1 into the

lower bound,

P(R

n

≤ (1 −) log

2

n) ≤ P(R

n

< k)

≤ exp

_

−

1

2

k

_

n

k

_

_

≤ exp

_

−

1

2

k

_

n

k

−1

_

_

≤ exp

_

−

1

32

1

n

1−

n

(1 −) log

2

n

_

= exp

_

−

1

32

n

(1 −) log

2

n

_

→ 0,

as n →∞, as n

**is much larger than log
**

2

n.

9 CONVERGENCE IN PROBABILITY 114

The most basic tool in proving convergence in probability is the Chebyshev inequality: if X

is a random variable with EX = µ and Var(X) = σ

2

, then

P([X −µ[ ≥ k) ≤

σ

2

k

2

,

for any k > 0. We proved this inequality in the previous chapter and we will use it to prove the

next theorem.

Theorem 9.1. Connection between variance and convergence in probability.

Assume that Y

n

are random variables and that a is a constant such

that

EY

n

→a,

Var(Y

n

) →0,

as n →∞. Then,

Y

n

→a,

as n →∞, in probability.

Proof. Fix an > 0. If n is so large that

[EY

n

−a[ < /2,

then

P([Y

n

−a[ > ) ≤ P([Y

n

−EY

n

[ > /2)

≤ 4

Var(Y

n

)

2

→ 0,

as n →∞. Note that the second inequality in the computation is the Chebyshev inequality.

This is most often applied to sums of random variables. Let

S

n

= X

1

+. . . +X

n

,

where X

i

are random variables with ﬁnite expectation and variance. Then, without any inde-

pendence assumption,

ES

n

= EX

1

+. . . +EX

n

and

E(S

2

n

) =

n

i=1

EX

2

i

+

i=j

E(X

i

X

j

),

Var(S

n

) =

n

i=1

Var(X

i

) +

i=j

Cov(X

i

, X

j

).

9 CONVERGENCE IN PROBABILITY 115

Recall that

Cov(X

1

, X

j

) = E(X

i

X

j

) −EX

i

EX

j

and

Var(aX) = a

2

Var(X).

Moreover, if X

i

are independent,

Var(X

1

+. . . +X

n

) = Var(X

1

) +. . . + Var(X

n

).

Continuing with the review, let us reformulate and prove again the most famous convergence in

probability theorem. We will use the common abbreviation i. i. d. for independent identically

distributed random variables.

Theorem 9.2. Weak law of large numbers. Let X, X

1

, X

2

, . . . be i. i. d. random variables with

with EX = µ and Var(X) = σ

2

< ∞. Let S

n

= X

1

+. . . +X

n

. Then, as n →∞,

S

n

n

→µ

in probability.

Proof. Let Y

n

=

Sn

n

. We have EY

n

= µ and

Var(Y

n

) =

1

n

2

Var(S

n

) =

1

n

2

nσ

2

=

σ

2

n

.

Thus, we can simply apply the previous theorem.

Example 9.2. We analyze a typical “investment” (the accepted euphemism for gambling on

ﬁnancial markets) problem. Assume that you have two investment choices at the beginning of

each year:

• a risk-free “bond” which returns 6% per year; and

• a risky “stock” which increases your investment by 50% with probability 0.8 and wipes it

out with probability 0.2.

Putting an amount s in the bond, then, gives you 1.06s after a year. The same amount in the

stock gives you 1.5s with probability 0.8 and 0 with probability 0.2; note that the expected value

is 0.8 1.5s = 1.2s > 1.06s. We will assume year-to-year independence of the stock’s return.

We will try to maximize the return to our investment by “hedging.” That is, we invest, at

the beginning of each year, a ﬁxed proportion x of our current capital into the stock and the

remaining proportion 1 − x into the bond. We collect the resulting capital at the end of the

year, which is simultaneously the beginning of next year, and reinvest with the same proportion

x. Assume that our initial capital is x

0

.

9 CONVERGENCE IN PROBABILITY 116

It is important to note that the expected value of the capital at the end of the year is

maximized when x = 1, but by using this strategy you will eventually lose everything. Let X

n

be your capital at the end of year n. Deﬁne the average growth rate of your investment as

λ = lim

n→∞

1

n

log

X

n

x

0

,

so that

X

n

≈ x

0

e

λn

.

We will express λ in terms of x; in particular, we will show that it is a nonrandom quantity.

Let I

i

= I

{stock goes up in year i}

. These are independent indicators with EI

i

= 0.8.

X

n

= X

n−1

(1 −x) 1.06 +X

n−1

x 1.5 I

n

= X

n−1

(1.06(1 −x) + 1.5x I

n

)

and so we can unroll the recurrence to get

X

n

= x

0

(1.06(1 −x) + 1.5x)

S

n

((1 −x)1.06)

n−S

n

,

where S

n

= I

1

+. . . +I

n

. Therefore,

1

n

log

X

n

x

0

=

S

n

n

log(1.06 + 0.44x) +

_

1 −

S

n

n

_

log(1.06(1 −x))

→ 0.8 log(1.06 + 0.44x) + 0.2 log(1.06(1 −x)),

in probability, as n → ∞. The last expression deﬁnes λ as a function of x. To maximize this,

we set

dλ

dx

= 0 to get

0.8 0.44

1.06 + 0.44x

=

0.2

1 −x

.

The solution is x =

7

22

, which gives λ ≈ 8.1%.

Example 9.3. Distribute n balls independently at random into n boxes. Let N

n

be the number

of empty boxes. Show that

1

n

N

n

converges in probability and identify the limit.

Note that

N

n

= I

1

+. . . +I

n

,

where I

i

= I

{ith box is empty}

, but we cannot use the weak law of large numbers as I

i

are not

independent. Nevertheless,

EI

i

=

_

n −1

n

_

n

=

_

1 −

1

n

_

n

,

and so

EN

n

= n

_

1 −

1

n

_

n

.

9 CONVERGENCE IN PROBABILITY 117

Moreover,

E(N

2

n

) = EN

n

+

i=j

E(I

i

I

j

)

with

E(I

i

I

j

) = P(box i and j are both empty) =

_

n −2

n

_

n

,

so that

Var(N

n

) = E(N

2

n

) −(EN

n

)

2

= n

_

1 −

1

n

_

n

+n(n −1)

_

1 −

2

n

_

n

−n

2

_

1 −

1

n

_

2n

.

Now, let Y

n

=

1

n

N

n

. We have

EY

n

→e

−1

,

as n →∞, and

Var(Y

n

) =

1

n

_

1 −

1

n

_

n

+

n −1

n

_

1 −

2

n

_

n

−

_

1 −

1

n

_

2n

→ 0 +e

−2

−e

−2

= 0,

as n →∞. Therefore,

Y

n

=

N

n

n

→e

−1

,

as n →∞, in probability.

Problems

1. Assume that n married couples (amounting to 2n people) are seated at random on 2n seats

around a round table. Let T be the number of couples that sit together. Determine ET and

Var(T).

2. There are n birds that sit in a row on a wire. Each bird looks left or right with equal

probability. Let N be the number of birds not seen by any neighboring bird. Determine, with

proof, the constant c so that, as n →∞,

1

n

N →c in probability.

3. Recall the coupon collector problem: sample from n cards, with replacement, indeﬁnitely, and

let N be the number of cards you need to get so that each of n diﬀerent cards are represented.

Find a sequence a

n

so that, as n →∞, N/a

n

converges to 1 in probability.

4. Kings and Lakers are playing a “best of seven” playoﬀ series, which means they play until

one team wins four games. Assume Kings win every game independently with probability p.

9 CONVERGENCE IN PROBABILITY 118

(There is no diﬀerence between home and away games.) Let N be the number of games played.

Compute EN and Var(N).

5. An urn contains n red and m black balls. Select balls from the urn one by one without

replacement. Let X be the number of red balls selected before any black ball, and let Y be the

number of red balls between the ﬁrst and the second black one. Compute EX and EY .

Solutions to problems

1. Let I

i

be the indicator of the event that the ith couple sits together. Then, T = I

1

+ +I

n

.

Moreover,

EI

i

=

2

2n −1

, E(I

i

I

j

) =

2

2

(2n −3)!

(2n −1)!

=

4

(2n −1)(2n −2)

,

for any i and j ,= i. Thus,

ET =

2n

2n −1

and

E(T

2

) = ET +n(n −1)

4

(2n −1)(2n −2)

=

4n

2n −1

,

so

Var(T) =

4n

2n −1

−

4n

2

(2n −1)

2

=

4n(n −1)

(2n −1)

2

.

2. Let I

i

indicate the event that bird i is not seen by any other bird. Then, EI

i

is

1

2

if i = 1 or

i = n and

1

4

otherwise. It follows that

EN = 1 +

n −2

4

=

n + 2

4

.

Furthermore, I

i

and I

j

are independent if [i − j[ ≥ 3 (two birds that have two or more birds

between them are observed independently). Thus, Cov(I

i

, I

j

) = 0 if [i −j[ ≥ 3. As I

i

and I

j

are

indicators, Cov(I

i

, I

j

) ≤ 1 for any i and j. For the same reason, Var(I

i

) ≤ 1. Therefore,

Var(N) =

i

Var(I

i

) +

i=j

Cov(I

i

, I

j

) ≤ n + 4n = 5n.

Clearly, if M =

1

n

N, then EM =

1

n

EN →

1

4

and Var(M) =

1

n

2

Var(N) → 0. It follows that

c =

1

4

.

3. Let N

i

be the number of coupons needed to get i diﬀerent coupons after having i −1 diﬀerent

ones. Then N = N

1

+ . . . + N

n

, and N

i

are independent Geometric with success probability

9 CONVERGENCE IN PROBABILITY 119

n−i+1

n

. So,

EN

i

=

n

n −i + 1

, Var(N

i

) =

n(i −1)

(n −i + 1)

2

,

and, therefore,

EN = n

_

1 +

1

2

+. . . +

1

n

_

,

Var(N) =

n

i=1

n(i −1)

(n −i + 1)

2

≤ n

2

_

1 +

1

2

2

+. . . +

1

n

2

_

≤ n

2

π

2

6

< 2n

2

.

If a

n

= nlog n, then

1

a

n

EN →1,

1

a

2

n

EN →0,

as n →∞, so that

1

a

n

N →1

in probability.

4. Let I

i

be the indicator of the event that the ith game is played. Then, EI

1

= EI

2

= EI

3

=

EI

4

= 1,

EI

5

= 1 −p

4

−(1 −p)

4

,

EI

6

= 1 −p

5

−5p

4

(1 −p) −5p(1 −p)

4

−(1 −p)

5

,

EI

7

=

_

6

3

_

p

3

(1 −p)

3

.

Add the seven expectations to get EN. To compute E(N

2

), we use the fact that I

i

I

j

= I

i

if

i > j, so that E(I

i

I

j

) = EI

i

. So,

EN

2

=

i

EI

i

+ 2

i>j

E(I

i

I

j

) =

i

EI

i

+ 2

i

(i −1)EI

i

=

7

i=1

(2i −1)EI

i

,

and the ﬁnal result can be obtained by plugging in EI

i

and by the standard formula

Var(N) = E(N

2

) −(EN)

2

.

5. Imagine the balls ordered in a row where the ordering speciﬁes the sequence in which they

are selected. Let I

i

be the indicator of the event that the ith red ball is selected before any black

ball. Then, EI

i

=

1

m+1

, the probability that in a random ordering of the ith red ball and all m

black balls, the red comes ﬁrst. As X = I

1

+. . . +I

n

, EX =

n

m+1

.

Now, let J

i

be the indicator of the event that the ith red ball is selected between the ﬁrst and

the second black one. Then, EJ

i

is the probability that the red ball is second in the ordering of

the above m+ 1 balls, so EJ

i

= EI

i

, and EY = EX.

10 MOMENT GENERATING FUNCTIONS 120

10 Moment generating functions

If X is a random variable, then its moment generating function is

φ(t) = φ

X

(t) = E(e

tX

) =

_

x

e

tx

P(X = x) in the discrete case,

_

∞

−∞

e

tx

f

X

(x) dx in the continuous case.

Example 10.1. Assume that X is an Exponential(1) random variable, that is,

f

X

(x) =

_

e

−x

x > 0,

0 x ≤ 0.

Then,

φ(t) =

_

∞

0

e

tx

e

−x

dx =

1

1 −t

,

only when t < 1. Otherwise, the integral diverges and the moment generating function does not

exist. Have in mind that the moment generating function is meaningful only when the integral

(or the sum) converges.

Here is where the name comes from: by writing its Taylor expansion in place of e

tX

and

exchanging the sum and the integral (which can be done in many cases)

E(e

tX

) = E[1 +tX +

1

2

t

2

X

2

+

1

3!

t

3

X

3

+. . .]

= 1 +tE(X) +

1

2

t

2

E(X

2

) +

1

3!

t

3

E(X

3

) +. . .

The expectation of the k-th power of X, m

k

= E(X

k

), is called the k-th moment of x. In

combinatorial language, φ(t) is the exponential generating function of the sequence m

k

. Note

also that

d

dt

E(e

tX

)[

t=0

= EX,

d

2

dt

2

E(e

tX

)[

t=0

= EX

2

,

which lets us compute the expectation and variance of a random variable once we know its

moment generating function.

Example 10.2. Compute the moment generating function for a Poisson(λ) random variable.

10 MOMENT GENERATING FUNCTIONS 121

By deﬁnition,

φ(t) =

∞

n=0

e

tn

λ

n

n!

e

−λ

= e

−λ

∞

n=0

(e

t

λ)

n

n!

= e

−λ+λe

t

= e

λ(e

t

−1)

.

Example 10.3. Compute the moment generating function for a standard Normal random

variable.

By deﬁnition,

φ

X

(t) =

1

√

2π

_

∞

−∞

e

tx

e

−x

2

/2

dx

=

1

√

2π

e

1

2

t

2

_

∞

−∞

e

−

1

2

(x−t)

2

dx

= e

1

2

t

2

,

where, from the ﬁrst to the second line, we have used, in the exponent,

tx −

1

2

x

2

= −

1

2

(−2tx +x

2

) =

1

2

((x −t)

2

−t

2

).

Lemma 10.1. If X

1

, X

2

, . . . , X

n

are independent and S

n

= X

1

+. . . +X

n

, then

φ

Sn

(t) = φ

X

1

(t) . . . φ

Xn

(t).

If X

i

is identically distributed as X, then

φ

S

n

(t) = (φ

X

(t))

n

.

Proof. This follows from multiplicativity of expectation for independent random variables:

E[e

tS

n

] = E[e

tX

1

e

tX

2

. . . e

tX

n

] = E[e

tX

1

] E[e

tX

2

] . . . E[e

tX

n

].

Example 10.4. Compute the moment generating function of a Binomial(n, p) random variable.

Here we have S

n

=

n

k=1

I

k

, where the indicators I

k

are independent and I

k

= I

{success on kth trial}

,

so that

φ

S

n

(t) = (e

t

p + 1 −p)

n

.

10 MOMENT GENERATING FUNCTIONS 122

Why are moment generating functions useful? One reason is the computation of large devi-

ations. Let S

n

= X

1

+ + X

n

, where X

i

are independent and identically distributed as X,

with expectation EX = µ and moment generating function φ. At issue is the probability that

S

n

is far away from its expectation nµ, more precisely, P(S

n

> an), where a > µ. We can,

of course, use the Chebyshev inequality to get an upper bound of order

1

n

. It turns out that

this probability is, for large n, much smaller; the theorem below gives an upper bound that is a

much better estimate.

Theorem 10.2. Large deviation bound.

Assume that φ(t) is ﬁnite for some t > 0. For any a > µ,

P(S

n

≥ an) ≤ exp(−nI(a)),

where

I(a) = sup¦at −log φ(t) : t > 0¦ > 0.

Proof. For any t > 0, using the Markov inequality,

P(S

n

≥ an) = P(e

tSn−tan

≥ 1) ≤ E[e

tSn−tan

] = e

−tan

φ(t)

n

= exp (−n(at −log φ(t))) .

Note that t > 0 is arbitrary, so we can optimize over t to get what the theorem claims. We need

to show that I(a) > 0 when a > µ. For this, note that Φ(t) = at − log φ(t) satisﬁes Φ(0) = 0

and, assuming that one can diﬀerentiate inside the integral sign (which one can in this case, but

proving this requires abstract analysis beyond our scope),

Φ

(t) = a −

φ

(t)

φ(t)

= a −

E(Xe

tX

)

φ(t)

,

and, then,

Φ

(0) = a −µ > 0,

so that Φ(t) > 0 for some small enough positive t.

Example 10.5. Roll a fair die n times and let S

n

be the sum of the numbers you roll. Find an

upper bound for the probability that S

n

exceeds its expectation by at least n, for n = 100 and

n = 1000.

We ﬁt this into the above theorem: observe that µ = 3.5 and ES

n

= 3.5 n, and that we need

to ﬁnd an upper bound for P(S

n

≥ 4.5 n), i.e., a = 4.5. Moreover,

φ(t) =

1

6

6

i=1

e

it

=

e

t

(e

6t

−1)

6(e

t

−1)

and we need to compute I(4.5), which, by deﬁnition, is the maximum, over t > 0, of the function

4.5 t −log φ(t),

whose graph is in the ﬁgure below.

10 MOMENT GENERATING FUNCTIONS 123

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

t

It would be nice if we could solve this problem by calculus, but unfortunately we cannot (which

is very common in such problems), so we resort to numerical calculations. The maximum is at

t ≈ 0.37105 and, as a result, I(4.5) is a little larger than 0.178. This gives the upper bound

P(S

n

≥ 4.5 n) ≤ e

−0.178·n

,

which is about 0.17 for n = 10, 1.83 10

−8

for n = 100, and 4.16 10

−78

for n = 1000. The bound

35

12n

for the same probability, obtained by the Chebyshev inequality, is much too large for large

n.

Another reason why moment generating functions are useful is that they characterize the

distribution and convergence of distributions. We will state the following theorem without proof.

Theorem 10.3. Assume that the moment generating functions for random variables X, Y , and

X

n

are ﬁnite for all t.

1. If φ

X

(t) = φ

Y

(t) for all t, then P(X ≤ x) = P(Y ≤ x) for all x.

2. If φ

Xn

(t) →φ

X

(t) for all t, and P(X ≤ x) is continuous in x, then P(X

n

≤ x) →P(X ≤

x) for all x.

Example 10.6. Show that the sum of independent Poisson random variables is Poisson.

Here is the situation. We have n independent random variables X

1

, . . . , X

n

, such that:

X

1

is Poisson(λ

1

), φ

X

1

(t) = e

λ

1

(e

t

−1)

,

X

2

is Poisson(λ

2

), φ

X

2

(t) = e

λ

2

(e

t

−1)

,

.

.

.

X

n

is Poisson(λ

n

), φ

X

n

(t) = e

λn(e

t

−1)

.

10 MOMENT GENERATING FUNCTIONS 124

Therefore,

φ

X

1

+...+X

n

(t) = e

(λ

1

+...+λ

n

)(e

t

−1)

and so X

1

+ . . . + X

n

is Poisson(λ

1

+ . . . + λ

n

). Similarly, one can also prove that the sum of

independent Normal random variables is Normal.

We will now reformulate and prove the Central Limit Theorem in a special case when the

moment generating function is ﬁnite. This assumption is not needed and the theorem may be

applied it as it was in the previous chapter.

Theorem 10.4. Assume that X is a random variable, with EX = µ and Var(X) = σ

2

, and

assume that φ

X

(t) is ﬁnite for all t. Let S

n

= X

1

+. . . +X

n

, where X

1

, . . . , X

n

are i. i. d. and

distributed as X. Let

T

n

=

S

n

−nµ

σ

√

n

.

Then, for every x,

P(T

n

≤ x) →P(Z ≤ x),

as n →∞, where Z is a standard Normal random variable.

Proof. Let Y =

X−µ

σ

and Y

i

=

X

i

−µ

σ

. Then, Y

i

are independent, distributed as Y , E(Y

i

) = 0,

Var(Y

i

) = 1, and

T

n

=

Y

1

+. . . +Y

n

√

n

.

To ﬁnish the proof, we show that φ

T

n

(t) →φ

Z

(t) = exp(t

2

/2) as n →∞:

φ

Tn

(t) = E

_

e

t T

n

¸

= E

_

e

t

√

n

Y

1

+...+

t

√

n

Y

n

_

= E

_

e

t

√

n

Y

1

_

E

_

e

t

√

n

Y

n

_

= E

_

e

t

√

n

Y

_

n

=

_

1 +

t

√

n

EY +

1

2

t

2

n

E(Y

2

) +

1

6

t

3

n

3/2

E(Y

3

) +. . .

_

n

=

_

1 + 0 +

1

2

t

2

n

+

1

6

t

3

n

3/2

E(Y

3

) +. . .

_

n

≈

_

1 +

t

2

2

1

n

_

n

→ e

t

2

2

.

10 MOMENT GENERATING FUNCTIONS 125

Problems

1. A player selects three cards at random from a full deck and collects as many dollars as the

number of red cards among the three. Assume 10 people play this game once and let X be the

number of their combined winnings. Compute the moment generating function of X.

2. Compute the moment generating function of a uniform random variable on [0, 1].

3. This exercise was in fact the original motivation for the study of large deviations by the

Swedish probabilist Harald Cram`er, who was working as an insurance company consultant in

the 1930’s. Assume that the insurance company receives a steady stream of payments, amounting

to (a deterministic number) λ per day. Also, every day, they receive a certain amount in claims;

assume this amount is Normal with expectation µ and variance σ

2

. Assume also day-to-day

independence of the claims. Regulators require that, within a period of n days, the company

must be able to cover its claims by the payments received in the same period, or else. Intimidated

by the ﬁerce regulators, the company wants to fail to satisfy their requirement with probability

less than some small number . The parameters n, µ, σ and are ﬁxed, but λ is a quantity the

company controls. Determine λ.

4. Assume that S is Binomial(n, p). For every a > p, determine by calculus the large deviation

bound for P(S ≥ an).

5. Using the central limit theorem for a sum of Poisson random variables, compute

lim

n→∞

e

−n

n

i=0

n

i

i!

.

Solutions to problems

1. Compute the moment generating function for a single game, then raise it to the 10th power:

φ(t) =

_

1

_

52

3

_

__

26

3

_

+

_

26

1

__

26

2

_

e

t

+

_

26

2

__

26

1

_

e

2t

+

_

26

3

_

e

3t

_

_

10

.

2. Answer: φ(t) =

_

1

0

e

tx

dx =

1

t

(e

t

−1).

3. By the assumption, a claim Y is Normal N(µ, σ

2

) and, so, X = (Y −µ)/σ is standard normal.

Note that Y = σX +µ. Thus, he combined amount of claims is σ(X

1

+ +X

n

) +nµ, where

10 MOMENT GENERATING FUNCTIONS 126

X

i

are i. i. d. standard Normal, so we need to bound

P(X

1

+ +X

n

≥

λ −µ

σ

n) ≤ e

−In

.

As log φ(t) =

1

2

t

2

, we need to maximize, over t > 0,

λ −µ

σ

t −

1

2

t

2

,

and the maximum equals

I =

1

2

_

λ −µ

σ

_

2

.

Finally, we solve the equation

e

−In

= ,

to get

λ = µ +σ

_

−2 log

n

.

4. After a computation, the answer we get is

I(a) = a log

a

p

+ (1 −a) log

1 −a

1 −p

.

5. Let S

n

be the sum of i. i. d. Poisson(1) random variables. Thus, S

n

is Poisson(n) and

ES

n

= n. By the Central Limit Theorem, P(S

n

≤ n) →

1

2

, but P(S

n

≤ n) is exactly the

expression in question. So, the answer is

1

2

.

11 COMPUTING PROBABILITIES AND EXPECTATIONS BY CONDITIONING 127

11 Computing probabilities and expectations by conditioning

Conditioning is the method we encountered before; to remind ourselves, it involves two-stage

(or multistage) processes and conditions are appropriate events at the ﬁrst stage. Recall also

the basic deﬁnitions:

• Conditional probability: if A and B are two events, P(A[B) =

P(A∩B)

P(B)

;

• Conditional probability mass function: if (X, Y ) has probability mass function p, p

X

(x[Y =

y) =

p(x,y)

p

Y

(y)

= P(X = x[Y = y);

• Conditional density: if (X, Y ) has joint density f, f

X

(x[Y = y) =

f(x,y)

f

Y

(y)

.

• Conditional expectation: E(X[Y = y) is either

x

xp

X

(x[Y = y) or

_

xf

X

(x[Y = y) dx

depending on whether the pair (X, Y ) is discrete or continuous.

Bayes’ formula also applies to expectation. Assume that the distribution of a random

variable X conditioned on Y = y is given, and, consequently, its expectation E(X[Y = y) is

also known. Such is the case of a two-stage process, whereby the value of Y is chosen at the

ﬁrst stage, which then determines the distribution of X at the second stage. This situation is

very common in applications. Then,

E(X) =

_

y

E(X[Y = y)P(Y = y) if Y is discrete,

_

∞

−∞

E(X[Y = y)f

Y

(y) dy if Y is continuous.

Note that this applies to the probability of an event (which is nothing other than the expectation

of its indicator) as well — if we know P(A[Y = y) = E(I

A

[Y = y), then we may compute

P(A) = EI

A

by Bayes’ formula above.

Example 11.1. Assume that X, Y are independent Poisson, with EX = λ

1

, EY = λ

2

. Com-

pute the conditional probability mass function of p

X

(x[X +Y = n).

Recall that X +Y is Poisson(λ

1

+λ

2

). By deﬁnition,

P(X = k[X +Y = n) =

P(X = k, X +Y = n)

P(X +Y = n)

=

P(X = k)P(Y = n −k)

(λ

1

+λ

2

)

n

n!

e

−(λ

1

+λ

2

)

=

λ

k

1

k!

e

−λ

1

λ

n−k

2

(n−k)!

e

−λ

2

(λ

1

+λ

2

)

n

n!

e

−(λ

1

+λ

2

)

=

_

n

k

__

λ

1

λ

1

+λ

2

_

k

_

λ

2

λ

1

+λ

2

_

n−k

.

11 COMPUTING PROBABILITIES AND EXPECTATIONS BY CONDITIONING 128

Therefore, conditioned on X +Y = n, X is Binomial(n,

λ

1

λ

1

+λ

2

).

Example 11.2. Let T

1

, T

2

be two independent Exponential(λ) random variables and let S

1

=

T

1

, S

2

= T

1

+T

2

. Compute f

S

1

(s

1

[S

2

= s

2

).

First,

P(S

1

≤ s

1

, S

2

≤ s

2

) = P(T

1

≤ s

1

, T

1

+T

2

≤ s

2

)

=

_

s

1

0

dt

1

_

s

2

−t

1

0

f

T

1

,T

2

(t

1

, t

2

) dt

2

.

If f = f

S

1

,S

2

, then

f(s

1

, s

2

) =

∂

2

∂s

1

∂s

2

P(S

1

≤ s

1

, S

2

≤ s

2

)

=

∂

∂s

2

_

s

2

−s

1

0

f

T

1

,T

2

(s

1

, t

2

) dt

2

= f

T

1

,T

2

(s

1

, s

2

−s

1

)

= f

T

1

(s

1

)f

T

2

(s

2

−s

1

)

= λe

−λs

1

λe

−λ(s

2

−s

1

)

= λ

2

e

−λs

2

.

Therefore,

f(s

1

, s

2

) =

_

λ

2

e

−λs

2

if 0 ≤ s

1

≤ s

2

,

0 otherwise

and, consequently, for s

2

≥ 0,

f

S

2

(s

2

) =

_

s

2

0

f(s

1

, s

2

) ds

1

= λ

2

s

2

e

−λs

2

.

Therefore,

f

S

1

(s

1

[S

2

= s

2

) =

λ

2

e

−λs

2

λ

2

s

2

e

−λs

2

=

1

s

2

,

for 0 ≤ s

1

≤ s

2

, and 0 otherwise. Therefore, conditioned on T

1

+ T

2

= s

2

, T

1

is uniform on

[0, s

2

].

Imagine the following: a new lightbulb is put in and, after time T

1

, it burns out. It is then

replaced by a new lightbulb, identical to the ﬁrst one, which also burns out after an additional

time T

2

. If we know the time when the second bulb burns out, the ﬁrst bulb’s failure time is

uniform on the interval of its possible values.

Example 11.3. Waiting to exceed the initial score. For the ﬁrst problem, roll a die once and

assume that the number you rolled is U. Then, continue rolling the die until you either match

or exceed U. What is the expected number of additional rolls?

11 COMPUTING PROBABILITIES AND EXPECTATIONS BY CONDITIONING 129

Let N be the number of additional rolls. This number is Geometric, if we know U, so let us

condition on the value of U. We know that

E(N[U = n) =

6

7 −n

,

and so, by Bayes’ formula for expectation,

E(N) =

6

n=1

E(N[U = n) P(U = n)

=

1

6

6

n=1

6

7 −n

=

6

n=1

1

7 −n

= 1 +

1

2

+

1

3

+. . . +

1

6

= 2.45 .

Now, let U be a uniform random variable on [0, 1], that is, the result of a call of a random

number generator. Once we know U, generate additional independent uniform random variables

(still on [0, 1]), X

1

, X

2

, . . ., until we get one that equals or exceeds U. Let n be the number of

additional calls of the generator, that is, the smallest n for which X

n

≥ U. Determine the

p. m. f. of N and EN.

Given that U = u, N is Geometric(1 −u). Thus, for k = 0, 1, 2, . . .,

P(N = k[U = u) = u

k−1

(1 −u)

and so

P(N = k) =

_

1

0

P(N = k[U = u) du =

_

1

0

u

k−1

(1 −u) du =

1

k(k + 1)

.

In fact, a slick alternate derivation shows that P(N = k) does not depend on the distribution of

random variables (which we assumed to be uniform), as soon as it is continuous, so that there

are no “ties” (i.e., no two random variables are equal). Namely, the event ¦N = k¦ happens

exactly when X

k

is the largest and U is the second largest among X

1

, X

2

, . . . , X

k

, U. All orders,

by diminishing size, of these k + 1 random numbers are equally likely, so the probability that

X

k

and U are the ﬁrst and the second is

1

k+1

1

k

.

It follows that

EN =

∞

k=1

1

k + 1

= ∞,

which can (in the uniform case) also be obtained by

EN =

_

1

0

E(N[U = u) du =

_

1

0

1

1 −u

du = ∞.

11 COMPUTING PROBABILITIES AND EXPECTATIONS BY CONDITIONING 130

As we see from this example, random variables with inﬁnite expectation are more common and

natural than one might suppose.

Example 11.4. The number N of customers entering a store on a given day is Poisson(λ).

Each of them buys something independently with probability p. Compute the probability that

exactly k people buy something.

Let X be the number of people who buy something. Why should X be Poisson? Approxi-

mate: let n be the (large) number of people in the town and the probability that any particular

one of them enters the store on a given day. Then, by the Poisson approximation, with λ = n,

N ≈ Binomial(n, ) and X ≈ Binomial(n, p) ≈ Poisson(pλ). A more straightforward way to

see this is as follows:

P(X = k) =

∞

n=k

P(X = k[N = n)P(N = n)

=

∞

n=k

_

n

k

_

p

k

(1 −p)

n−k

λ

n

e

−λ

n!

= e

−λ

∞

n=k

n!

k!(n −k)!

p

k

(1 −p)

n−k

λ

n

n!

=

e

−λ

p

k

λ

k

k!

∞

n=k

(1 −p)

n−k

λ

n−k

(n −k)!

=

e

−λ

(pλ)

k

k!

∞

=0

((1 −p)λ)

!

=

e

−λ

(pλ)

k

k!

e

(1−p)λ

=

e

−pλ

(pλ)

k

k!

.

This is, indeed, the Poisson(pλ) probability mass function.

Example 11.5. A coin with Heads probability p is tossed repeatedly. What is the expected

number of tosses needed to get k successive heads?

Note: If we remove “successive,” the answer is

k

p

, as it equals the expectation of the sum of

k (independent) Geometric(p) random variables.

Let N

k

be the number of needed tosses and m

k

= EN

k

. Let us condition on the value of

N

k−1

. If N

k−1

= n, then observe the next toss; if it is Heads, then N

k

= n+1, but, if it is Tails,

then we have to start from the beginning, with n + 1 tosses wasted. Here is how we translate

this into mathematics:

E[N

k

[N

k−1

= n] = p(n + 1) + (1 −p)(n + 1 +E(N

k

))

= pn +p + (1 −p)n + 1 −p +m

k

(1 −p)

= n + 1 +m

k

(1 −p).

11 COMPUTING PROBABILITIES AND EXPECTATIONS BY CONDITIONING 131

Therefore,

m

k

= E(N

k

) =

∞

n=k−1

E[N

k

[N

k−1

= n]P(N

k−1

= n)

=

∞

n=k−1

(n + 1 +m

k

(1 −p)) P(N

k−1

= n)

= m

k−1

+ 1 +m

k

(1 −p)

=

1

p

+

m

k−1

p

.

This recursion can be unrolled,

m

1

=

1

p

m

2

=

1

p

+

1

p

2

.

.

.

m

k

=

1

p

+

1

p

2

+. . . +

1

p

k

.

In fact, we can even compute the moment generating function of N

k

by diﬀerent condition-

ing

1

. Let F

a

, a = 0, . . . , k − 1, be the event that the tosses begin with a Heads, followed by

Tails, and let F

k

the event that the ﬁrst k tosses are Heads. One of F

0

, . . . , F

k

must happen,

therefore, by Bayes’ formula,

E[e

tN

k

] =

k

a=0

E[e

tN

k

[F

k

]P(F

k

).

If F

k

happens, then N

k

= k, otherwise a + 1 tosses are wasted and one has to start over with

the same conditions as at the beginning. Therefore,

E[e

tN

k

] =

k−1

a=0

E[e

t(N

k

+a+1)

]p

a

(1 −p) +e

tk

p

k

= (1 −p)E[e

tN

k

]

k−1

a=0

e

t(a+1)

p

a

+e

tk

p

k

,

and this gives an equation for E[e

tN

k

] which can be solved:

E[e

tN

k

] =

p

k

e

tk

1 −(1 −p)

k−1

a=0

e

t(a+1)

p

a

=

p

k

e

tk

(1 −pe

t

)

1 −pe

t

−(1 −p)e

t

(1 −p

k

e

tk

)

.

We can, then, get EN

k

by diﬀerentiating and some algebra, by

EN

k

=

d

dt

E[e

tN

k

][

t=0

.

1

Thanks to Travis Scrimshaw for pointing this out.

11 COMPUTING PROBABILITIES AND EXPECTATIONS BY CONDITIONING 132

Example 11.6. Gambler’s ruin. Fix a probability p ∈ (0, 1). Play a sequence of games; in each

game you (independently) win $1 with probability p and lose $1 with probability 1 −p. Assume

that your initial capital is i dollars and that you play until you either reach a predetermined

amount N, or you lose all your money. For example, if you play a fair game, p =

1

2

, while, if

you bet on Red at roulette, p =

9

19

. You are interested in the probability P

i

that you leave the

game happy with your desired amount N.

Another interpretation is that of a simple random walk on the integers. Start at i, 0 ≤ i ≤ N,

and make steps in discrete time units: each time (independently) move rightward by 1 (i.e., add

1 to your position) with probability p and move leftward by 1 (i.e., add −1 to your position)

with probability 1 −p. In other words, if the position of the walker at time n is S

n

, then

S

n

= i +X

1

+ +X

n

,

where X

k

are i. i. d. and P(X

k

= 1) = p, P(X

k

= −1) = 1 −p. This random walk is one of the

very basic random (or, if you prefer a Greek word, stochastic) processes. The probability P

i

is

the probability that the walker visits N before a visit to 0.

We condition on the ﬁrst step X

1

the walker makes, i.e., the outcome of the ﬁrst bet. Then,

by Bayes’ formula,

P

i

= P(visit N before 0[X

1

= 1)P(X

1

= 1) +P(visit N before 0[X

1

= −1)P(X

1

= −1)

= P

i+1

p +P

i−1

(1 −p),

which gives us a recurrence relation, which we can rewrite as

P

i+1

−P

i

=

1 −p

p

(P

i

−P

i−1

).

We also have boundary conditions P

0

= 0, P

N

= 1. This is a recurrence we can solve quite

easily, as

P

2

−P

1

=

1 −p

p

P

1

P

3

−P

2

=

1 −p

p

(P

2

−P

1

) =

_

1 −p

p

_

2

P

1

.

.

.

P

i

−P

i−1

=

_

1 −p

p

_

i−1

P

1

, for i = 1, . . . , N.

We conclude that

P

i

−P

1

=

_

_

1 −p

p

_

+

_

1 −p

p

_

2

+. . . +

_

1 −p

p

_

i−1

_

P

1

,

P

i

=

_

1 +

_

1 −p

p

_

+

_

1 −p

p

_

2

+. . . +

_

1 −p

p

_

i−1

_

P

1

.

11 COMPUTING PROBABILITIES AND EXPECTATIONS BY CONDITIONING 133

Therefore,

P

i

=

_

¸

_

¸

_

1−

1−p

p

i

1−

1−p

p

P

1

if p ,=

1

2

,

iP

1

if p =

1

2

.

To determine the unknown P

1

we use that P

N

= 1 to ﬁnally get

P

i

=

_

¸

_

¸

_

1−

1−p

p

i

1−

1−p

p

N

if p ,=

1

2

,

i

N

if p =

1

2

.

For example, if N = 10 and p = 0.6, then P

5

≈ 0.8836; if N = 1000 and p =

9

19

, then

P

900

≈ 2.6561 10

−5

.

Example 11.7. Bold Play. Assume that the only game available to you is a game in which

you can place even bets at any amount, and that you win each of these bets with probability p.

Your initial capital is x ∈ [0, N], a real number, and again you want to increase it to N before

going broke. Your bold strategy (which can be proved to be the best) is to bet everything unless

you are close enough to N that a smaller amount will do:

1. Bet x if x ≤

N

2

.

2. Bet N −x if x ≥

N

2

.

We can, without loss of generality, ﬁx our monetary unit so that N = 1. We now deﬁne

P(x) = P(reach 1 before reaching 0).

By conditioning on the outcome of your ﬁrst bet,

P(x) =

_

p P(2x) if x ∈

_

0,

1

2

¸

,

p 1 + (1 −p) P(2x −1) if x ∈

_

1

2

, 1

¸

.

For each positive integer n, this is a linear system for P(

k

2

n

), k = 0, . . . , 2

n

, which can be solved.

For example:

• When n = 1, P

_

1

2

_

= p.

• When n = 2, P

_

1

4

_

= p

2

, P

_

3

4

_

= p + (1 −p)p.

• When n = 3, P

_

1

8

_

= p

3

, P

_

3

8

_

= p P

_

3

4

_

= p

2

+ p

2

(1 − p), P

_

5

8

_

= p + p

2

(1 − p),

P

_

7

8

_

= p +p(1 −p) +p(1 −p)

2

.

It is easy to verify that P(x) = x, for all x, if p =

1

2

. Moreover, it can be computed that

P(0.9) ≈ 0.8794 for p =

9

19

, which is not too diﬀerent from a fair game. The ﬁgure below

displays the graphs of functions P(x) for p = 0.1, 0.25,

9

19

, and

1

2

.

11 COMPUTING PROBABILITIES AND EXPECTATIONS BY CONDITIONING 134

0 0.2 0.4 0.6 0.8 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

P

(

x

)

A few remarks for the more mathematically inclined: The function P(x) is continuous, but

nowhere diﬀerentiable on [0, 1]. It is thus a highly irregular function despite the fact that it is

strictly increasing. In fact, P(x) is the distribution function of a certain random variable Y ,

that is, P(x) = P(Y ≤ x). This random variable with values in (0, 1) is deﬁned by its binary

expansion

Y =

∞

j=1

D

j

1

2

j

,

where its binary digits D

j

are independent and equal to 1 with probability 1 − p and, thus, 0

with probability p.

Theorem 11.1. Expectation and variance of sums with a random number of terms.

Assume that X, X

1

, X

2

, . . . is an i. i. d. sequence of random variables with ﬁnite EX = µ and

Var(X) = σ

2

. Let N be a nonnegative integer random variable, independent of all X

i

, and let

S =

N

i=1

X

i

.

Then

ES = µEN,

Var(S) = σ

2

EN +µ

2

Var(N).

Proof. Let S

n

= X

1

+. . . +X

n

. We have

E[S[N = n] = ES

n

= nEX

1

= nµ.

11 COMPUTING PROBABILITIES AND EXPECTATIONS BY CONDITIONING 135

Then,

ES =

∞

n=0

nµP(N = n) = µEN.

For variance, compute ﬁrst

E(S

2

) =

∞

n=0

E[S

2

[N = n]P(N = n)

=

∞

n=0

E(S

2

n

)P(N = n)

=

∞

n=0

((Var(S

n

) + (ES

n

)

2

)P(N = n)

=

∞

n=0

(nσ

2

+n

2

µ

2

)P(N = n)

= σ

2

EN +µ

2

E(N

2

).

Therefore,

Var(S) = E(S

2

) −(ES)

2

= σ

2

EN +µ

2

E(N

2

) −µ

2

(EN)

2

= σ

2

EN +µ

2

Var(N).

Example 11.8. Toss a fair coin until you toss Heads for the ﬁrst time. Each time you toss

Tails, roll a die and collect as many dollars as the number on the die. Let S be your total

winnings. Compute ES and Var(S).

This ﬁts into the above context, with X

i

, the numbers rolled on the die, and N, the number

of Tails tossed before ﬁrst Heads. We know that

EX

1

=

7

2

,

Var(X

1

) =

35

12

.

Moreover, N + 1 is a Geometric(

1

2

) random variable, and so

EN = 2 −1 = 1.

Var(N) =

1 −

1

2

_

1

2

_

2

= 2

Plug in to get ES =

7

2

and Var(S) =

59

12

.

11 COMPUTING PROBABILITIES AND EXPECTATIONS BY CONDITIONING 136

Example 11.9. We now take another look at Example 8.11. We will rename the number of

days in purgatory as S, to ﬁt it better into the present context, and call the three doors 0, 1,

and 2. Let N be the number of times your choice of door is not door 0. This means that N is

Geometric(

1

3

)−1. Any time you do not pick door 0, you pick door 1 or 2 with equal probability.

Therefore, each X

i

is 1 or 2 with probability

1

2

each. (Note that X

i

are not 0, 1, or 2 with

probability

1

3

each!)

It follows that

EN = 3 −1 = 2,

Var(N) =

1 −

1

3

_

1

3

_

2

= 6

and

EX

1

=

3

2

,

Var(X

1

) =

1

2

+ 2

2

2

−

9

4

=

1

4

.

Therefore, ES = EN EX

1

= 3, which, of course, agrees with the answer in Example 8.11.

Moreover,

Var(S) =

1

4

2 +

9

4

6 = 14.

Problems

1. Toss an unfair coin with probability p ∈ (0, 1) of Heads n times. By conditioning on the

outcome of the last toss, compute the probability that you get an even number of Heads.

2. Let X

1

and X

2

be independent Geometric(p) random variables. Compute the conditional

p. m. f. of X

1

given X

1

+X

2

= n, n = 2, 3, . . .

3. Assume that the joint density of (X, Y ) is

f(x, y) =

1

y

e

−y

, for 0 < x < y,

and 0 otherwise. Compute E(X

2

[Y = y).

4. You are trapped in a dark room. In front oﬀ you are two buttons, A and B. If you press A,

• with probability 1/3 you will be released in two minutes;

• with probability 2/3 you will have to wait ﬁve minutes and then you will be able to press

one of the buttons again.

11 COMPUTING PROBABILITIES AND EXPECTATIONS BY CONDITIONING 137

If you press B,

• you will have to wait three minutes and then be able to press one of the buttons again.

Assume that you cannot see the buttons, so each time you press one of them at random. Compute

the expected time of your conﬁnement.

5. Assume that a Poisson number with expectation 10 of customers enters the store. For

promotion, each of them receives an in-store credit uniformly distributed between 0 and 100

dollars. Compute the expectation and variance of the amount of credit the store will give.

6. Generate a random number Λ uniformly on [0, 1]; once you observe the value of Λ, say

Λ = λ, generate a Poisson random variable N with expectation λ. Before you start the random

experiment, what is the probability that N ≥ 3?

7. A coin has probability p of Heads. Alice ﬂips it ﬁrst, then Bob, then Alice, etc., and the

winner is the ﬁrst to ﬂip Heads. Compute the probability that Alice wins.

Solutions to problems

1. Let p

n

be the probability of an even number of Heads in n tosses. We have

p

n

= p (1 −p

n−1

) + (1 −p)p

n−1

= p + (1 −2p)p

n−1

,

and so

p

n

−

1

2

= (1 −2p)(p

n−1

−

1

2

),

and then

p

n

=

1

2

+C(1 −2p)

n

.

As p

0

= 1, we get C =

1

2

and, ﬁnally,

p

n

=

1

2

+

1

2

(1 −2p)

n

.

2. We have, for i = 1, . . . , n −1,

P(X

1

= i[X

1

+X

2

= n) =

P(X

1

= i)P(X

2

= n −i)

P(X

1

+X

2

= n)

=

p(1 −p)

i−1

p(1 −p)

n−i−1

n−1

k=1

p(1 −p)

k−1

p(1 −p)

n−k−1

=

1

n −1

,

so X

1

is uniform over its possible values.

11 COMPUTING PROBABILITIES AND EXPECTATIONS BY CONDITIONING 138

3. The conditional density of X given Y = y is f

X

(x[Y = y) =

1

y

, for 0 < x < y (i.e., uniform

on [0, y]), and so the answer is

y

2

3

.

4. Let I be the indicator of the event that you press A, and X the time of your conﬁnement in

minutes. Then,

EX = E(X[I = 0)P(I = 0) +E(X[I = 1)P(I = 1) = (3 +EX)

1

2

+ (

1

3

2 +

2

3

(5 +EX))

1

2

and the answer is EX = 21.

5. Let N be the number of customers and X the amount of credit, while X

i

are independent

uniform on [0, 100]. So, EX

i

= 50 and Var(X

i

) =

100

2

12

. Then, X =

N

i=1

X

i

, so EX = 50EN =

500 and Var(X) =

100

2

12

10 + 50

2

10.

6. The answer is

P(N ≥ 3) =

_

1

0

P(N ≥ 3[Λ = λ) dλ =

_

1

0

(1 −(1 +λ +

λ

2

2

)e

−λ

)dλ.

7. Let f(p) be the probability. Then,

f(p) = p + (1 −p)(1 −f(p))

which gives

f(p) =

1

2 −p

.

11 COMPUTING PROBABILITIES AND EXPECTATIONS BY CONDITIONING 139

Interlude: Practice Midterm 1

This practice exam covers the material from chapters 9 through 11. Give yourself 50 minutes

to solve the four problems, which you may assume have equal point score.

1. Assume that a deck of 4n cards has n cards of each of the four suits. The cards are shuﬄed

and dealt to n players, four cards per player. Let D

n

be the number of people whose four cards

are of four diﬀerent suits.

(a) Find ED

n

.

(b) Find Var(D

n

).

(c) Find a constant c so that

1

n

D

n

converges to c in probability, as n →∞.

2. Consider the following game, which will also appear in problem 4. Toss a coin with probability

p of Heads. If you toss Heads, you win $2, if you toss Tails, you win $1.

(a) Assume that you play this game n times and let S

n

be your combined winnings. Compute

the moment generating function of S

n

, that is, E(e

tS

n

).

(b) Keep the assumptions from (a). Explain how you would ﬁnd an upper bound for the

probability that S

n

is more than 10% larger than its expectation. Do not compute.

(c) Now you roll a fair die and you play the game as many times as the number you roll. Let Y

be your total winnings. Compute E(Y ) and Var(Y ).

3. The joint density of X and Y is

f(x, y) =

e

−x/y

e

−y

y

,

for x > 0 and y > 0, and 0 otherwise. Compute E(X[Y = y).

4. Consider the following game again: Toss a coin with probability p of Heads. If you toss

Heads, you win $2, if you toss Tails, you win $1. Assume that you start with no money and you

have to quit the game when your winnings match or exceed the dollar amount n. (For example,

assume n = 5 and you have $3: if your next toss is Heads, you collect $5 and quit; if your next

toss is Tails, you play once more. Note that, at the amount you quit, your winnings will be

either n or n + 1.) Let p

n

be the probability that you will quit with winnings exactly n.

(a) What is p

1

? What is p

2

?

(b) Write down the recursive equation which expresses p

n

in terms of p

n−1

and p

n−2

.

(c) Solve the recursion.

11 COMPUTING PROBABILITIES AND EXPECTATIONS BY CONDITIONING 140

Solutions to Practice Midterm 1

1. Assume that a deck of 4n cards has n cards of each of the four suits. The cards are shuﬄed

and dealt to n players, four cards per player. Let D

n

be the number of people whose four

cards are of four diﬀerent suits.

(a) Find ED

n

.

Solution:

As

D

n

=

n

i=1

I

i

,

where

I

i

= I

{ith player gets four diﬀerent suits}

,

and

EI

i

=

n

4

_

4n

4

_,

the answer is

ED

n

=

n

5

_

4n

4

_.

(b) Find Var(D

n

).

Solution:

We also have, for i ,= j,

E(I

i

I

j

) =

n

4

(n −1)

4

_

4n

4

__

4n−4

4

_.

Therefore,

E(D

2

n

) =

i=j

E(I

i

I

j

) +

i

EI

i

=

n

5

(n −1)

5

_

4n

4

__

4n−4

4

_ +

n

5

_

4n

4

_

and

Var(D

n

) =

n

5

(n −1)

5

_

4n

4

__

4n−4

4

_ +

n

5

_

4n

4

_ −

_

n

5

_

4n

4

_

_

2

.

11 COMPUTING PROBABILITIES AND EXPECTATIONS BY CONDITIONING 141

(c) Find a constant c so that

1

n

D

n

converges to c in probability as n →∞.

Solution:

Let Y

n

=

1

n

D

n

. Then

EY

n

=

n

4

_

4n

4

_ =

6n

3

(4n −1)(4n −2)(4n −3)

→

6

4

3

=

3

32

,

as n →∞. Moreover,

Var(Y

n

) =

1

n

2

Var(D

n

) =

n

3

(n −1)

5

_

4n

4

__

4n−4

4

_ +

n

3

_

4n

4

_ −

_

n

4

_

4n

4

_

_

2

→

6

2

4

6

+ 0 −

_

3

32

_

2

= 0,

as n →∞, so the statement holds with c =

3

32

.

2. Consider the following game, which will also appear in problem 4. Toss a coin with

probability p of Heads. If you toss Heads, you win $2, if you toss Tails, you win $1.

(a) Assume that you play this game n times and let S

n

be your combined winnings.

Compute the moment generating function of S

n

, that is, E(e

tS

n

).

Solution:

E(e

tS

n

) = (e

2t

p +e

t

(1 −p))

n

.

(b) Keep the assumptions from (a). Explain how you would ﬁnd an upper bound for the

probability that S

n

is more than 10% larger than its expectation. Do not compute.

Solution:

As EX

1

= 2p + 1 −p = 1 +p, ES

n

= n(1 +p), and we need to ﬁnd an upper bound

for P(S

n

> n(1.1 + 1.1p). When (1.1 + 1.1p) ≥ 2, i.e., p ≥

9

11

, this is an impossible

event, so the probability is 0. When p <

9

11

, the bound is

P(S

n

> n(1.1 + 1.1p) ≤ e

−I(1.1+1.1p)n

,

where I(1.1 + 1.1p) > 0 and is given by

I(1.1 + 1.1p) = sup¦(1.1 + 1.1p)t −log(pe

2t

+ (1 −p)e

t

) : t > 0¦.

11 COMPUTING PROBABILITIES AND EXPECTATIONS BY CONDITIONING 142

(c) Now you roll a fair die and you play the game as many times as the number you roll.

Let Y be your total winnings. Compute E(Y ) and Var(Y ).

Solution:

Let Y = X

1

+. . . +X

N

, where X

i

are independently and identically distributed with

P(X

1

= 2) = p and P(X

1

= 1) = 1 − p, and P(N = k) =

1

6

, for k = 1, . . . , 6. We

know that

EY = EN EX

1

,

Var(Y ) = Var(X

1

) EN + (EX

1

)

2

Var(N).

We have

EN =

7

2

, Var(N) =

35

12

.

Moreover,

EX

1

= 2p + 1 −p = 1 +p,

and

EX

2

1

= 4p + 1 −p = 1 + 3p,

so that

Var(X

1

) = 1 + 3p −(1 +p)

2

= p −p

2

.

The answer is

EY =

7

2

(1 +p),

Var(Y ) = (p −p

2

)

7

2

+ (1 +p)

2

35

12

.

3. The joint density of X and Y is

f(x, y) =

e

−x/y

e

−y

y

,

for x > 0 and y > 0, and 0 otherwise. Compute E(X[Y = y).

Solution:

We have

E(X[Y = y) =

_

∞

0

xf

X

(x[Y = y) dx.

As

f

Y

(y) =

_

∞

0

e

−x/y

e

−y

y

dx

=

e

−y

y

_

∞

0

e

−x/y

dx

=

e

−y

y

_

ye

−x/y

_

x=∞

x=0

= e

−y

, for y > 0,

11 COMPUTING PROBABILITIES AND EXPECTATIONS BY CONDITIONING 143

we have

f

X

(x[Y = y) =

f(x, y)

f

Y

(y)

=

e

−x/y

y

, for x, y > 0,

and so

E(X[Y = y) =

_

∞

0

x

y

e

−x/y

dx

= y

_

∞

0

ze

−z

dz

= y.

4. Consider the following game again: Toss a coin with probability p of Heads. If you toss

Heads, you win $2, if you toss Tails, you win $1. Assume that you start with no money

and you have to quit the game when your winnings match or exceed the dollar amount

n. (For example, assume n = 5 and you have $3: if your next toss is Heads, you collect

$5 and quit; if your next toss is Tails, you play once more. Note that, at the amount you

quit, your winnings will be either n or n+1.) Let p

n

be the probability that you will quit

with winnings exactly n.

(a) What is p

1

? What is p

2

?

Solution:

We have

p

1

= 1 −p

and

p

2

= (1 −p)

2

+p.

Also, p

0

= 1.

(b) Write down the recursive equation which expresses p

n

in terms of p

n−1

and p

n−2

.

Solution:

We have

p

n

= p p

n−2

+ (1 −p)p

n−1

.

11 COMPUTING PROBABILITIES AND EXPECTATIONS BY CONDITIONING 144

(c) Solve the recursion.

Solution:

We can use

p

n

−p

n−1

= (−p)(p

n−1

−p

n−2

)

= (−p)

n−1

(p

1

−p

0

).

Another possibility is to use the characteristic equation λ

2

−(1 −p)λ −p = 0 to get

λ =

1 −p ±

_

(1 −p)

2

+ 4p

2

=

1 −p ±(1 +p)

2

=

_

1

−p

.

This gives

p

n

= a +b(−p)

n

,

with

a +b = 1, a −bp = 1 −p.

We get

a =

1

1 +p

, b =

p

1 +p

,

and then

p

n

=

1

1 +p

+

p

1 +p

(−p)

n

.

12 MARKOV CHAINS: INTRODUCTION 145

12 Markov Chains: Introduction

Example 12.1. Take your favorite book. Start, at step 0, by choosing a random letter. Pick

one of the ﬁve random procedures described below and perform it at each time step n = 1, 2, . . .

1. Pick another random letter.

2. Choose a random occurrence of the letter obtained at the previous step ((n −1)st), then

pick the letter following it in the text. Use the convention that the letter that follows the

last letter in the text is the ﬁrst letter in the text.

3. At step 1 use procedure (2), while for n ≥ 2 choose a random occurrence of the two letters

obtained, in order, in the previous two steps, then pick the following letter.

4. Choose a random occurrence of all previously chosen letters, in order, then pick the fol-

lowing letter.

5. At step n, perform procedure (1) with probability

1

n

and perform procedure (2) with

probability 1 −

1

n

.

Repeated iteration of procedure (1) merely gives the familiar independent experiments — selec-

tion of letters is done with replacement and, thus, the letters at diﬀerent steps are independent.

Procedure (2), however, is diﬀerent: the probability mass function for the letter at the next

time step depends on the letter at this step and nothing else. If we call our current letter our

state, then we transition into a new state chosen with the p. m. f. that depends only on our

current state. Such processes are called Markov.

Procedure (3) is not Markov at ﬁrst glance. However, it becomes such via a natural redeﬁ-

nition of state: keep track of the last two letters; call an ordered pair of two letters a state.

Procedure (4) can be made Markov in a contrived fashion, that is, by keeping track, at the

current state, of the entire history of the process. There is, however, no natural way of making

this process Markov and, indeed, there is something diﬀerent about this scheme: it ceases being

random after many steps are performed, as the sequence of the chosen letters occurs just once

in the book.

Procedure (5) is Markov, but what distinguishes it from (2) is that the p. m. f. is dependent

not only on the current step, but also on time. That is, the process is Markov, but not time-

homogeneous. We will only consider time-homogeneous Markov processes.

In general, a Markov chain is given by

• a state space, a countable set S of states, which are often labeled by the positive integers

1, 2, . . . ;

• transition probabilities, a (possibly inﬁnite) matrix of numbers P

ij

, where i and j range

over all states in S; and

12 MARKOV CHAINS: INTRODUCTION 146

• an initial distribution α, a probability mass function on the states.

Here is how these three ingredients determine a sequence X

0

, X

1

, . . . of random variables

(with values in S): Use the initial distribution as your random procedure to pick X

0

. Subse-

quently, given that you are at state i ∈ S at any time n, make the transition to state j ∈ S with

probability P

ij

, that is

(12.1) P

ij

= P(X

n+1

= j[X

n

= i).

The transition probabilities are collected into the transition matrix:

P =

_

¸

_

P

11

P

12

P

13

. . .

P

21

P

22

P

23

. . .

.

.

.

_

¸

_.

A stochastic matrix is a (possibly inﬁnite) square matrix with nonnegative entries, such that all

rows sum to 1, that is

j

P

ij

= 1,

for all i. In other words, every row of the matrix is a p. m. f. Clearly, by (12.1), every transition

matrix is a stochastic matrix (as X

n+1

must be in some state). The opposite also holds: given

any stochastic matrix, one can construct a Markov chain on positive integer states with the

same transition matrix, by using the entries as transition probabilities as in (12.1).

Geometrically, a Markov chain is often represented as oriented graph on S (possibly with

self-loops) with an oriented edge going from i to j whenever a transition from i to j is possible,

i.e., whenever P

ij

> 0; such an edge is labeled by P

ij

.

Example 12.2. A random walker moves on the set ¦0, 1, 2, 3, 4¦. She moves to the right (by

1) with probability, p, and to the left with probability 1 − p, except when she is at 0 or at 4.

These two states are absorbing: once there, the walker does not move. The transition matrix is

P =

_

¸

¸

¸

¸

_

1 0 0 0 0

1 −p 0 p 0 0

0 1 −p 0 p 0

0 0 1 −p 0 p

0 0 0 0 1

_

¸

¸

¸

¸

_

.

and the matching graphical representation is below.

1 −p 1 −p

1

p p

0 1 2 3

1

4

1 −p

p

12 MARKOV CHAINS: INTRODUCTION 147

Example 12.3. Same as the previous example except that now 0 and 4 are reﬂecting. From

0, the walker always moves to 1, while from 4 she always moves to 3. The transition matrix

changes to

P =

_

¸

¸

¸

¸

_

0 1 0 0 0

1 −p 0 p 0 0

0 1 −p 0 p 0

0 0 1 −p 0 p

0 0 0 1 0

_

¸

¸

¸

¸

_

.

Example 12.4. Random walk on a graph. Assume that a graph with undirected edges is

given by its adjacency matrix, which is a binary matrix with the i, jth entry 1 exactly when i

is connected to j. At every step, a random walker moves to a randomly chosen neighbor. For

example, the adjacency matrix of the graph

3

4 1

2

is

_

¸

¸

_

0 1 0 1

1 0 1 1

0 1 0 1

1 1 1 0

_

¸

¸

_

,

and the transition matrix is

P =

_

¸

¸

_

0

1

2

0

1

2

1

3

0

1

3

1

3

0

1

2

0

1

2

1

3

1

3

1

3

0

_

¸

¸

_

.

Example 12.5. The general two-state Markov chain. There are two states 1 and 2 with

transitions:

• 1 →1 with probability α;

• 1 →2 with probability 1 −α;

• 2 →1 with probability β;

• 2 →2 with probability 1 −β.

The transition matrix has two parameters α, β ∈ [0, 1]:

P =

_

α 1 −α

β 1 −β

_

.

12 MARKOV CHAINS: INTRODUCTION 148

Example 12.6. Changeovers. Keep track of two-toss blocks in an inﬁnite sequence of indepen-

dent coin tosses with probability p of Heads. The states represent (previous ﬂip, current ﬂip)

and are (in order) HH, HT, TH, and TT. The resulting transition matrix is

_

¸

¸

_

p 1 −p 0 0

0 0 p 1 −p

p 1 −p 0 0

0 0 p 1 −p

_

¸

¸

_

.

Example 12.7. Simple random walk on Z. The walker moves left or right by 1, with prob-

abilities p and 1 − p, respectively. The state space is doubly inﬁnite and so is the transition

matrix:

_

¸

¸

¸

¸

¸

¸

_

.

.

.

. . . 1 −p 0 p 0 0 . . .

. . . 0 1 −p 0 p 0 . . .

. . . 0 0 1 −p 0 p . . .

.

.

.

_

¸

¸

¸

¸

¸

¸

_

Example 12.8. Birth-death chain. This is a general model in which a population may change

by at most 1 at each time step. Assume that the size of a population is x. Birth probability

p

x

is the transition probability to x + 1, death probability q

x

is the transition to x − 1. and

r

x

= 1 −p

x

−q

x

is the transition to x. Clearly, q

0

= 0. The transition matrix is

_

¸

¸

¸

_

r

0

p

0

0 0 0 . . .

q

1

r

1

p

1

0 0 . . .

0 q

2

r

2

p

2

0 . . .

.

.

.

_

¸

¸

¸

_

We begin our theory by studying n-step probabilities

P

n

ij

= P(X

n

= j[X

0

= i) = P(X

n+m

= j[X

m

= i).

Note that P

0

ij

= I, the identity matrix, and P

1

ij

= P

ij

. Note also that the condition X

0

= i

simply speciﬁes a particular non-random initial state.

Consider an oriented path of length n from i to j, that is i, k

1

, . . . , k

n−1

, j, for some states

k

1

, . . . , k

n−1

. One can compute the probability of following this path by multiplying all transition

probabilities, i.e., P

ik

1

P

k

1

k

2

P

k

n−1

j

. To compute P

n

ij

, one has to sum these products over all

paths of length n from i to j. The next theorem writes this in a familiar and neater fashion.

Theorem 12.1. Connection between n-step probabilities and matrix powers:

P

n

ij

is the i, jth entry of the nth power of the transition matrix.

12 MARKOV CHAINS: INTRODUCTION 149

Proof. Call the transition matrix P and temporarily denote the n-step transition matrix by

P

(n)

. Then, for m, n ≥ 0,

P

(n+m)

ij

= P(X

n+m

= j[X

0

= i)

=

k

P(X

n+m

= j, X

n

= k[X

0

= i)

=

k

P(X

n+m

= j[X

n

= k) P(X

n

= k[X

0

= i)

=

k

P(X

m

= j[X

0

= k) P(X

n

= k[X

0

= i)

=

k

P

(m)

kj

P

(n)

ik

.

The ﬁrst equality decomposes the probability according to where the chain is at time n, the

second uses the Markov property and the third time-homogeneity. Thus,

P

(m+n)

= P

(n)

P

(m)

,

and, then, by induction

P

(n)

= P

(1)

P

(1)

P

(1)

= P

n

.

The fact that the matrix powers of the transition matrix give the n-step probabilities makes

linear algebra useful in the study of ﬁnite-state Markov chains.

Example 12.9. For the two state Markov Chain

P =

_

α 1 −α

β 1 −β

_

,

and

P

2

=

_

α

2

+ (1 −α)β α(1 −α) + (1 −α)(1 −β)

αβ + (1 −β)β β(1 −α) + (1 −β)

2

_

gives all P

2

ij

.

Assume now that the initial distribution is given by

α

i

= P(X

0

= i),

for all states i (again, for notational purposes, we assume that i = 1, 2, . . .). As this must

determine a p. m. f., we have α

i

≥ 0 and

i

α

i

= 1. Then,

P(X

n

= j) =

i

P(X

n

= j[X

0

= i)P(X

0

= i)

=

i

α

i

P

n

ij

.

12 MARKOV CHAINS: INTRODUCTION 150

Then, the row of probabilities at time n is given by [P(X

n

= i), i ∈ S] = [α

1

, α

2

, . . .] P

n

.

Example 12.10. Consider the random walk on the graph from Example 12.4. Choose a

starting vertex at random. (a) What is the probability mass function at time 2? (b) Compute

P(X

2

= 2, X

6

= 3, X

12

= 4).

As

P =

_

¸

¸

_

0

1

2

0

1

2

1

3

0

1

3

1

3

0

1

2

0

1

2

1

3

1

3

1

3

0

_

¸

¸

_

,

we have

_

P(X

2

= 1) P(X

2

= 2) P(X

2

= 3) P(X

2

= 4)

¸

=

_

1

4

1

4

1

4

1

4

¸

P

2

=

_

2

9

5

18

2

9

5

18

¸

.

The probability in (b) equals

P(X

2

= 2) P

4

23

P

6

34

=

8645

708588

≈ 0.0122.

Problems

1. Three white and three black balls are distributed in two urns, with three balls per urn. The

state of the system is the number of white balls in the ﬁrst urn. At each step, we draw at

random a ball from each of the two urns and exchange their places (the ball that was in the ﬁrst

urn is put into the second and vice versa). (a) Determine the transition matrix for this Markov

chain. (b) Assume that initially all white balls are in the ﬁrst urn. Determine the probability

that this is also the case after 6 steps.

2. A Markov chain on states 0, 1, 2, has the transition matrix

_

_

1

2

1

3

1

6

0

1

3

2

3

5

6

0

1

6

_

_

Assume that P(X

0

= 0) = P(X

0

= 1) =

1

4

. Determine EX

3

.

3. We have two coins: coin 1 has probability 0.7 of Heads and coin 2 probability 0.6 of Heads.

You ﬂip a coin once per day, starting today (day 0), when you pick one of the two coins with

equal probability and toss it. On any day, if you ﬂip Heads, you ﬂip coin 1 the next day,

12 MARKOV CHAINS: INTRODUCTION 151

otherwise you ﬂip coin 2 the next day. (a) Compute the probability that you ﬂip coin 1 on day

3. (b) Compute the probability that you ﬂip coin 1 on days 3, 6, and 14. (c) Compute the

probability that you ﬂip Heads on days 3 and 6.

4. A walker moves on two positions a and b. She begins at a at time 0 and is at a the next time

as well. Subsequently, if she is at the same position for two consecutive time steps, she changes

position with probability 0.8 and remains at the same position with probability 0.2; in all other

cases she decides the next position by a ﬂip of a fair coin. (a) Interpret this as a Markov chain

on a suitable state space and write down the transition matrix P. (b) Determine the probability

that the walker is at position a at time 10.

Solutions to problems

1. The states are 0, 1, 2, 3. For (a),

P =

_

¸

¸

_

0 1 0 0

1

9

4

9

4

9

0

0

4

9

4

9

1

9

0 0 1 0

_

¸

¸

_

For (b), compute the fourth entry of

_

0 0 0 1

¸

P

6

,

that is, the 4, 4th entry of P

6

.

2. The answer is given by

EX

3

=

_

P(X

3

= 0) P(X

3

= 1) P(X

3

= 2)

¸

_

_

0

1

2

_

_

=

_

1

4

1

4

1

2

¸

P

3

_

_

0

1

2

_

_

.

3. The state X

n

of our Markov chain, 1 or 2, is the coin we ﬂip on day n. (a) Let

P =

_

0.7 0.3

0.6 0.4

_

.

Then,

_

P(X

3

= 1) P(X

3

= 2)

¸

=

_

1

2

1

2

¸

P

3

and the answer to (a) is the ﬁrst entry. (b) Answer: P(X

3

= 1) P

3

11

P

8

11

. (c) You toss Heads

on days 3 and 6 if and only if you toss Coin 1 on days 4 and 7, so the answer is P(X

4

= 1) P

3

11

.

12 MARKOV CHAINS: INTRODUCTION 152

4. (a) The states are the four ordered pairs aa, ab, ba, and bb, which we will code as 1, 2, 3, and

4. Then,

P =

_

¸

¸

_

0.2 0.8 0 0

0 0 0.5 0.5

0.5 0.5 0 0

0 0 0.8 0.2

_

¸

¸

_

.

The answer to (b) is the sum of the ﬁrst and the third entries of

_

1 0 0 0

¸

P

9

.

The power is 9 instead of 10 because the initial time for the chain (when it is at state aa) is

time 1 for the walker.

13 MARKOV CHAINS: CLASSIFICATION OF STATES 153

13 Markov Chains: Classiﬁcation of States

We say that a state j is accessible from state i, i → j, if P

n

ij

> 0 for some n ≥ 0. This means

that there is a possibility of reaching j from i in some number of steps. If j is not accessible

from i, P

n

ij

= 0 for all n ≥ 0, and thus the chain started from i never visits j:

P(ever visit j[X

0

= i) = P(

∞

_

n=0

¦X

n

= j¦ [X

0

= i)

≤

∞

n=0

P(X

n

= j[X

0

= i) = 0.

Also, note that for accessibility the size of entries of P does not matter, all that matters is which

are positive and which are 0. For computational purposes, one should also observe that, if the

chain has m states, then j is accessible from i if and only if (P +P

2

+. . . +P

m

)

ij

> 0.

If i is accessible from j and j is accessible from i, then we say that i and j communicate,

i ↔j. It is easy to check that this is an equivalence relation:

1. i ↔i;

2. i ↔j implies j ↔i; and

3. i ↔j and j ↔k together imply i ↔k.

The only nontrivial part is (3) and, to prove it, let us assume i →j and j →k. This means that

there exists an n ≥ 0 so that P

n

ij

> 0 and an m ≥ 0 so that P

m

jk

> 0. Now, one can get from i

to j in m+n steps by going ﬁrst to j in n steps and then from j to k in m steps, so that

P

n+m

ik

≥ P

n

ij

P

m

jk

> 0.

(Alternatively, one can use that P

m+n

= P

n

P

m

and then

P

n+m

ik

=

P

n

i

P

m

k

≥ P

n

ij

P

m

jk

,

as the sum of nonnegative numbers is at least as large as one of its terms.)

The accessibility relation divides states into classes. Within each class, all states commu-

nicate with each other, but no pair of states in diﬀerent classes communicates. The chain is

irreducible if there is only one class. If the chain has m states, irreducibility means that all

entries of I +P +. . . +P

m

are nonzero.

Example 13.1. To determine the classes we may present the Markov chain as a graph, in which

we only need to depict the edges that signify nonzero transition probabilities (their precise value

is irrelevant for this purpose); by convention, we draw an undirected edge when probabilities in

both directions are nonzero. Here is an example:

13 MARKOV CHAINS: CLASSIFICATION OF STATES 154

3

4

5

1

2

Any state 1, 2, 3, 4 is accessible from any of the ﬁve states, but 5 is not accessible from 1, 2, 3, 4.

So, we have two classes: ¦1, 2, 3, 4¦, and ¦5¦. The chain is not irreducible.

Example 13.2. Consider the chain on states 1, 2, 3 and

P =

_

_

1

2

1

2

0

1

2

1

4

1

4

0

1

3

2

3

_

_

.

As 1 ↔2 and 2 ↔3, this is an irreducible chain.

Example 13.3. Consider the chain on states 1, 2, 3, 4, and

P =

_

¸

¸

_

1

2

1

2

0 0

1

2

1

2

0 0

0 0

1

4

3

4

0 0 0 1

_

¸

¸

_

.

This chain has three classes ¦1, 2¦, ¦3¦ and ¦4¦, hence, it is not irreducible.

For any state i, denote

f

i

= P(ever reenter i[X

0

= i).

We call a state i recurrent if f

i

= 1, and transient if f

i

< 1.

Example 13.4. Back to the previous example. Obviously, 4 is recurrent, as it is an absorbing

state. The only possibility of returning to 3 is to do so in one step, so we have f

3

=

1

4

, and 3 is

transient. Moreover, f

1

= 1 because in order to never return to 1 we need to go to state 2 and

stay there forever. We stay at 2 for n steps with probability

_

1

2

_

n

→0,

as n → ∞, so the probability of staying at 1 forever is 0 and, consequently, f

1

= 1. By similar

logic, f

2

= 1. We will soon develop better methods to determine recurrence and transience.

Starting from any state, a Markov Chain visits a recurrent state inﬁnitely many times or not

at all. Let us now compute, in two diﬀerent ways, the expected number of visits to i (i.e., the

times, including time 0, when the chain is at i). First, we observe that, at every visit to i, the

probability of never visiting i again is 1 −f

i

, therefore,

P(exactly n visits to i[X

0

= i) = f

n−1

i

(1 −f

i

).

13 MARKOV CHAINS: CLASSIFICATION OF STATES 155

This formula says that the number of visits to i is a Geometric(1 −f

i

) random variable and so

its expectation is

E(number of visits to i[X

0

= i) =

1

1 −f

i

.

A second way to compute this expectation is by using the indicator trick:

E(number of visits to i[X

0

= i) = E(

∞

n=0

I

n

[X

0

= i),

where I

n

= I

{Xn=i}

, n = 0, 1, 2, . . .. Then,

E(

∞

n=0

I

n

[X

0

= i) =

∞

n=0

P(X

n

= i[X

0

= i)

=

∞

n=0

P

n

ii

.

Thus,

1

1 −f

i

=

∞

n=0

P

n

ii

and we have proved the following theorem.

Theorem 13.1. Characterization of recurrence via n step return probabilities:

A state i is recurrent if and only if

∞

n=1

P

n

ii

= ∞.

We call a subset S

0

⊂ S of states closed if P

ij

= 0 for each i ∈ S

0

and j / ∈ S

0

. In plain

language, once entered, a closed set cannot be exited.

Proposition 13.2. If a closed subset S

0

has only ﬁnitely many states, then there must be at least

one recurrent state. In particular, any ﬁnite Markov chain must contain at least one recurrent

state.

Proof. Start from any state from S

0

. By deﬁnition, the chain stays in S

0

forever. If all states in

S

0

are transient, then each of them is visited either not at all or only ﬁnitely many times. This

is impossible.

Proposition 13.3. If i is recurrent and i → j, then also j →i.

Proof. There is an n

0

such that P

n

0

ij

> 0, i.e., starting from i, the chain can reach j in n

0

steps.

Thus, every time it is at i, there is a ﬁxed positive probability that it will be at j n

0

steps later.

Starting from i, the chain returns to i inﬁnitely many times and, every time it does so, it has

an independent chance to reach j n

0

steps later; thus, eventually the chain does reach j. Now

assume that it is not true that j →i. Then, once the chain reaches j, it never returns to i, but

then, i is not recurrent. This contradiction ends the proof.

13 MARKOV CHAINS: CLASSIFICATION OF STATES 156

Proposition 13.4. If i is recurrent and i → j, then j is also recurrent. Therefore, in any class,

either all states are recurrent or all are transient. In particular, if the chain is irreducible, then

either all states are recurrent or all are transient.

In light of this proposition, we can classify each class, as well as an irreducible Markov chain,

as recurrent or transient.

Proof. By the previous proposition, we know that also j →i. We will now give two arguments

for the recurrence of j.

We could use the same logic as before: starting from j, the chain must visit i with probability

1 (or else the chain starting at i has a positive probability of no return to i, by visiting j), then

it returns to i inﬁnitely many times and, at each of those times, it has an independent chance

of getting to j at a later time — so it must do so inﬁnitely often.

For another argument, we know that there exist k, m ≥ 0 so that P

k

ij

> 0, P

m

ji

> 0. Further-

more, for any n ≥ 0, one way to get from j to j in m+n +k steps is by going from j to i in m

steps, then from i to i in n steps, and then from i to j in k steps; thus,

P

m+n+k

jj

≥ P

m

ji

P

n

ii

P

k

ij

.

If

∞

n=0

P

ii

= ∞, then

∞

n=0

P

m+n+k

jj

= ∞and, ﬁnally,

∞

=0

P

jj

= ∞. In short, if i is recurrent,

then so is j.

Proposition 13.5. Any recurrent class is a closed subset of states.

Proof. Let S

0

be a recurrent class, i ∈ S

0

and j ,∈ S

0

. We need to show that P

ij

= 0. Assume

the converse, P

ij

> 0. As j does not communicate with i, the chain never reaches i from j, i.e.,

i is not accessible from j. But this is a contradiction to Proposition 13.3.

For ﬁnite Markov chains, these propositions make it easy to determine recurrence and tran-

sience: if a class is closed, it is recurrent, but if it is not closed, it is transient.

Example 13.5. Assume that the states are 1, 2, 3, 4 and that the transition matrix is

P =

_

¸

¸

_

0 0

1

2

1

2

1 0 0 0

0 1 0 0

0 1 0 0

_

¸

¸

_

.

By inspection, every state is accessible from every other state and so this chain is irreducible.

Therefore, every state is recurrent.

13 MARKOV CHAINS: CLASSIFICATION OF STATES 157

Example 13.6. Assume now that the states are 1, . . . , 6 and

P =

_

¸

¸

¸

¸

¸

¸

_

0 1 0 0 0 0

0.4 0.6 0 0 0 0

0.3 0 0.4 0.2 0.1 0

0 0 0 0.3 0.7 0

0 0 0 0.5 0 0.5

0 0 0 0.3 0 0.7

_

¸

¸

¸

¸

¸

¸

_

.

2

1 3 4

5 6

0.6

1

0.4

0.3

0.3

0.5

0.5

0.7

0.1

0.3 0.2

0.4

0.7

We observe that 3 can only be reached from 3, therefore, 3 is in a class of its own. States 1 and

2 can reach each other and no other state, so they form a class together. Furthermore, 4, 5, 6

all communicate with each other. Thus, the division into classes is ¦1, 2¦, ¦3¦, and ¦4, 5, 6¦. As

it is not closed, ¦3¦ is a transient class (in fact, it is clear that f

3

= 0.4). On the other hand,

¦1, 2¦ and ¦4, 5, 6¦ both are closed and, therefore, recurrent.

Example 13.7. Recurrence of a simple random walk on Z. Recall that such a walker moves

from x to x + 1 with probability p and to x − 1 with probability 1 − p. We will assume that

p ∈ (0, 1) and denote the chain S

n

= S

(1)

n

. (The superscript indicates the dimension. We will

make use of this in subsequent examples in which the walker will move in higher dimensions.)

As such a walk is irreducible, we only have to check whether state 0 is recurrent or transient, so

we assume that the walker begins at 0. First, we observe that the walker will be at 0 at a later

time only if she makes an equal number of left and right moves. Thus, for n = 1, 2, . . .,

P

2n−1

00

= 0

and

P

2n

00

=

_

2n

n

_

p

n

(1 −p)

n

.

Now, we recall Stirling’s formula:

n! ∼ n

n

e

−n

√

2πn

(the symbol “∼” means that the quotient of the two quantities converges to 1 as n → ∞).

13 MARKOV CHAINS: CLASSIFICATION OF STATES 158

Therefore,

_

2n

n

_

=

(2n)!

(n!)

2

∼

(2n)

2n

e

−2n

√

2π2n

n

2n

e

−2n

2πn

=

2

2n

√

nπ

,

and, therefore,

P

2n

00

=

2

2n

√

nπ

p

n

(1 −p)

n

∼

1

√

nπ

(4p(1 −p))

n

In the symmetric case, when p =

1

2

,

P

2n

00

∼

1

√

nπ

,

therefore,

∞

n=0

P

2n

00

= ∞,

and the random walk is recurrent.

When p ,=

1

2

, 4p(1 − p) < 1, so that P

2n

00

goes to 0 faster than the terms of a convergent

geometric series,

∞

n=0

P

2n

00

< ∞,

and the random walk is transient. In this case, what is the probability f

0

that the chain ever

reenters 0? We need to recall the Gambler’s ruin probabilities,

P(S

n

reaches N before 0[S

0

= 1) =

1 −

1−p

p

1 −

_

1−p

p

_

N

.

As N →∞, the probability

P(S

n

reaches 0 before N[S

0

= 1) = 1 −P(S

n

reaches N before 0[S

0

= 1)

converges to

P(S

n

ever reaches 0[S

0

= 1) =

_

1 if p <

1

2

,

1−p

p

if p >

1

2

.

13 MARKOV CHAINS: CLASSIFICATION OF STATES 159

Assume that p >

1

2

. Then,

f

0

= P(S

1

= 1, S

n

returns to 0 eventually) +P(S

1

= −1, S

n

returns to 0 eventually)

= p

1 −p

p

+ (1 −p) 1

= 2(1 −p).

If p < 1/2, we may use the fact that replacing the walker’s position with its mirror image replaces

p by 1 −p; this gives f

0

= 2p when p <

1

2

.

Example 13.8. Is the simple symmetric random walk on Z

2

recurrent? A walker now moves on

integer points in two dimensions: each step is a distance 1 jump in one of the four directions (N,

S, E, or W). We denote this Markov chain by S

(2)

n

and imagine a drunk wandering at random

through the rectangular grid of streets of a large city. (Chicago would be a good example.) The

question is whether the drunk will eventually return to her home at (0, 0). All starting positions

in this and in the next example will be the appropriate origins. Note again that the walker can

only return in an even number of steps and, in fact, both the number of steps in the x direction

(E or W) and in the y direction (N or S) must be even (otherwise, the respective coordinate

cannot be 0).

We condition on the number N of times the walker moves in the x-direction:

P(S

(2)

2n

= (0, 0)) =

n

k=0

P(N = 2k)P(S

(2)

2n

= (0, 0)[N = 2k)

=

n

k=0

P(N = 2k)P(S

(1)

2k

= 0)P(S

(1)

2(n−k)

= 0).

In order not to obscure the computation, we will not show the full details from now on; ﬁlling

in the missing pieces is an excellent computational exercise.

First, as the walker chooses to go horizontally or vertically with equal probability, N ∼

2n

2

=

n with overwhelming probability and so we can assume that k ∼

n

2

. Taking this into account,

P(S

(1)

2k

= 0) ∼

√

2

√

nπ

,

P(S

(1)

2(n−k)

= 0) ∼

√

2

√

nπ

.

Therefore,

P(S

(2)

2n

= (0, 0)) ∼

2

nπ

n

k=0

P(N = 2k)

∼

2

nπ

P(N is even)

∼

1

nπ

,

13 MARKOV CHAINS: CLASSIFICATION OF STATES 160

as we know that (see Problem 1 in Chapter 11)

P(N is even) =

1

2

.

Therefore,

∞

n=0

P(S

(2)

2n

= (0, 0)) = ∞

and we have demonstrated that this chain is still recurrent, albeit barely. In fact, there is an

easier slick proof that does not generalize to higher dimensions, which demonstrates that

P(S

(2)

2n

= 0) = P(S

(1)

2n

= 0)

2

.

Here is how it goes. If we let each coordinate of a two-dimensional random walker move inde-

pendently, then the above is certainly true. Such a walker makes diagonal moves, from (x, y) to

(x +1, y +1), (x −1, y +1), (x +1, y −1), or (x −1, y −1) with equal probability. At ﬁrst, this

appears to be a diﬀerent walk, but if we rotate the lattice by 45 degrees, scale by

1

√

2

, and ignore

half of the points that are never visited, this becomes the same walk as S

(2)

n

. In particular, it is

at the origin exactly when S

(2)

n

is.

Example 13.9. Is the simple symmetric random walk on Z

3

recurrent? Now, imagine a squirrel

running around in a 3 dimensional maze. The process S

(3)

n

moves from a point (x, y, z) to one

of the six neighbors (x ± 1, y, z), (x, y ± 1, z), (x, y, z ± 1) with equal probability. To return to

(0, 0, 0), it has to make an even number number of steps in each of the three directions. We will

condition on the number N of steps in the z direction. This time N ∼

2n

3

and, thus,

P(S

(3)

2n

= (0, 0, 0)) =

n

k=0

P(N = 2k)P(S

(1)

2k

= 0)P(S

(2)

2(n−k)

= (0, 0))

∼

n

k=0

P(N = 2k)

√

3

√

πn

3

2πn

=

3

√

3

2π

3/2

n

3/2

P(N is even)

∼

3

√

3

4π

3/2

n

3/2

.

Therefore,

n

P(S

(3)

2n

= (0, 0, 0)) < ∞

and the three-dimensional random walk is transient, so the squirrel may never return home. The

probability f

0

= P(return to 0), thus, is not 1, but can we compute it? One approximation is

obtained by using

1

1 −f

0

=

∞

n=0

P(S

(3)

2n

= (0, 0, 0)) = 1 +

1

6

+. . . ,

13 MARKOV CHAINS: CLASSIFICATION OF STATES 161

but this series converges slowly and its terms are diﬃcult to compute. Instead, one can use the

remarkable formula, derived by Fourier analysis,

1

1 −f

0

=

1

(2π)

3

___

(−π,π)

3

dxdy dz

1 −

1

3

(cos(x) + cos(y) + cos(z))

,

which gives, to four decimal places,

f

0

≈ 0.3405.

Problems

1. For the following transition matrices, determine the classes and specify which are recurrent

and which transient.

P

1

=

_

¸

¸

¸

¸

_

0

1

2

1

2

0 0

1

2

0

1

2

0 0

1

2

1

2

0 0 0

1

2

1

4

0 0

1

4

0

1

2

1

4

1

4

0

_

¸

¸

¸

¸

_

P

2

=

_

¸

¸

¸

¸

_

0 0 0 1 0

0 0 0 1 0

1

3

1

3

0 0

1

3

0 0 1 0 0

0 0 1 0 0

_

¸

¸

¸

¸

_

P

3

=

_

¸

¸

¸

¸

_

1

2

0

1

2

0 0

1

4

1

2

1

4

0 0

1

4

1

4

1

2

0 0

0 0 0

1

2

1

2

0 0 0

1

2

1

2

_

¸

¸

¸

¸

_

P

4

=

_

¸

¸

¸

¸

_

1

2

1

2

0 0 0

1

2

1

2

0 0 0

0 0 1 0 0

0 0

1

2

1

2

0

1 0 0 0 0

_

¸

¸

¸

¸

_

2. Assume that a Markov chain X

n

has states 0, 1, 2 . . . and transitions from each i > 0 to i +1

with probability 1 −

1

2·i

α

and to 0 with probability

1

2·i

α

. Moreover, from 0 it transitions to 1

with probability 1. (a) Is this chain irreducible? (b) Assume that X

0

= 0 and let R be the

ﬁrst return time to 0 (i.e., the ﬁrst time after the initial time the chain is back at the origin).

Determine α for which

1 −f

0

= P(no return) = lim

n→∞

P(R > n) = 0.

(c) Depending on α, determine which classes are recurrent.

3. Consider the one-dimensional simple symmetric random walk S

n

= S

(1)

n

with p =

1

2

. As in

the Gambler’s ruin problem, ﬁx an N and start at some 0 ≤ i ≤ N. Let E

i

be the expected

time at which the walk ﬁrst hits either 0 or N. (a) By conditioning on the ﬁrst step, determine

the recursive equation for E

i

. Also, determine the boundary conditions E

0

and E

N

. (b) Solve

the recursion. (c) Assume that the chain starts at 0 and let R be the ﬁrst time (after time 0)

that it revisits 0. By recurrence, we know that P(R < ∞) = 1; use (b) to show that ER = ∞.

The walk will eventually return to 0, but the expected waiting time is inﬁnite!

13 MARKOV CHAINS: CLASSIFICATION OF STATES 162

Solutions to problems

1. Assume that the states are 1, . . . , 5. For P

1

: ¦1, 2, 3¦ recurrent, ¦4, 5¦ transient. For P

2

:

irreducible, so all states are recurrent. For P

3

: ¦1, 2, 3¦ recurrent, ¦4, 5¦ recurrent. For P

4

:

¦1, 2¦ recurrent, ¦3¦ recurrent (absorbing), ¦4¦ transient, ¦5¦ transient.

2. (a) The chain is irreducible. (b) If R > n, then the chain, after moving to 1, makes n − 1

consecutive steps to the right, so

P(R > n) =

n−1

i=1

_

1 −

1

2 i

α

_

.

The product converges to 0 if and only if its logarithm converges to −∞ and that holds if and

only if the series

∞

i=1

1

2 i

α

diverges, which is when α ≤ 1. (c) For α ≤ 1, the chain is recurrent, otherwise, it is transient.

3. For (a), the walker makes one step and then proceeds from i+1 or i−1 with equal probability,

so that

E

i

= 1 +

1

2

(E

i+1

+E

i

),

with E

0

= E

N

= 0. For (b), the homogeneous equation is the same as the one in the Gambler’s

ruin, so its general solution is linear: Ci +D. We look for a particular solution of the form Bi

2

and we get Bi

2

= 1 +

1

2

(B(i

2

+2i +1) +B(i

2

−2i +1)) = 1 +Bi

2

+B, so B = −1. By plugging

in the boundary conditions we can solve for C and D to get D = 0, C = N. Therefore,

E

i

= i(N −i).

For (c), after a step the walker proceeds either from 1 or −1 and, by symmetry, the expected

time to get to 0 is the same for both. So, for every N,

ER ≥ 1 +E

1

= 1 + 1 (N −1) = N,

and so ER = ∞.

14 BRANCHING PROCESSES 163

14 Branching processes

In this chapter we will consider a random model for population growth in the absence of spatial or

any other resource constraints. So, consider a population of individuals which evolves according

to the following rule: in every generation n = 0, 1, 2, . . ., each individual produces a random

number of oﬀspring in the next generation, independently of other individuals. The probability

mass function for oﬀspring is often called the oﬀspring distribution and is given by

p

i

= P(number of oﬀspring = i),

for i = 0, 1, 2, . . .. We will assume that p

0

< 1 and p

1

< 1 to eliminate the trivial cases. This

model was introduced by F. Galton in the late 1800s to study the disappearance of family names;

in this case p

i

is the probability that a man has i sons.

We will start with a single individual in generation 0 and generate the resulting random

family tree. This tree is either ﬁnite (when some generation produces no oﬀspring at all) or

inﬁnite — in the former case, we say that the branching process dies out and, in the latter case,

that it survives.

We can look at this process as a Markov chain where X

n

is the number of individuals in

generation n. Let us start with the following observations:

• If X

n

reaches 0, it stays there, so 0 is an absorbing state.

• If p

0

> 0, P(X

n+1

= 0[X

n

= k) > 0, for all k.

• Therefore, by Proposition 13.5, all states other than 0 are transient if p

0

> 0; the popu-

lation must either die out or increase to inﬁnity. If p

0

= 0, then the population cannot

decrease and each generation increases with probability at least 1 − p

1

, therefore it must

increase to inﬁnity.

It is possible to write down the transition probabilities for this chain, but they have a rather

complicated explicit form, as

P(X

n+1

= i[X

n

= k) = P(W

1

+W

2

+. . . +W

k

= i),

where W

1

, . . . , W

k

are independent random variables, each with the oﬀspring distribution. This

suggests using moment generating functions, which we will indeed do. Recall that we are as-

suming that X

0

= 1.

Let

δ

n

= P(X

n

= 0)

be the probability that the population is extinct by generation (which we also think of as time) n.

The probability π

0

that the branching process dies out is, then, the limit of these probabilities:

π

0

= P(the process dies out) = P(X

n

= 0 for some n) = lim

n→∞

P(X

n

= 0) = lim

n→∞

δ

n

.

14 BRANCHING PROCESSES 164

Note that π

0

= 0 if p

0

= 0. Our main task will be to compute π

0

for general probabilities p

k

. We

start, however, with computing the expectation and variance of the population at generation n.

Let µ and σ

2

be the expectation and the variance of the oﬀspring distribution, that is,

µ = EX

1

=

∞

k=0

kp

k

,

and

σ

2

= Var(X

1

).

Let m

n

= E(X

n

) and v

n

= Var(X

n

). Now, X

n+1

is the sum of a random number X

n

of

independent random variables, each with the oﬀspring distribution. Thus, we have by Theorem

11.1,

m

n+1

= m

n

µ,

and

v

n+1

= m

n

σ

2

+v

n

µ

2

.

Together with initial conditions m

0

= 1, v

0

= 0, the two recursive equations determine m

n

and

v

n

. We can very quickly solve the ﬁrst recursion to get m

n

= µ

n

and so

v

n+1

= µ

n

σ

2

+v

n

µ

2

.

When µ = 1, m

n

= 1 and v

n

= nσ

2

. When µ ,= 1, the recursion has the general solution

v

n

= Aµ

n

+Bµ

2n

. The constant A must satisfy

Aµ

n+1

= σ

2

µ

n

+Aµ

n+2

,

so that,

A =

σ

2

µ(1 −µ)

.

From v

0

= 0 we get A+B = 0 and the solution is given in the next theorem.

Theorem 14.1. Expectation m

n

and variance v

n

of the nth generation count.

We have

m

n

= µ

n

and

v

n

=

_

σ

2

µ

n

(1−µ

n

)

µ(1−µ)

if µ ,= 1,

nσ

2

if µ = 1.

We can immediately conclude that µ < 1 implies π

0

= 1, as

P(X

n

,= 0) = P(X

n

≥ 1) ≤ EX

n

= µ

n

→0;

14 BRANCHING PROCESSES 165

if the individuals have less than one oﬀspring on average, the branching process dies out.

Now, let φ be the moment generating function of the oﬀspring distribution. It is more

convenient to replace e

t

in our original deﬁnition with s, so that

φ(s) = φ

X

1

(s) = E(s

X

1

) =

∞

k=0

p

k

s

k

.

In combinatorics, this would be exactly the generating function of the sequence p

k

. Then, the

moment generating function of X

n

is

φ

Xn

(s) = E[s

Xn

] =

∞

k=0

P(X

n

= k)s

k

.

We will assume that 0 ≤ s ≤ 1 and observe that, for such s, this power series converges. Let us

get a recursive equation for φ

Xn

by conditioning on the population count in generation n −1:

φ

Xn

(s) = E[s

X

n

]

=

∞

k=0

E[s

X

n

[X

n−1

= k]P(X

n−1

= k)

=

∞

k=0

E[s

W

1

+...+W

k

]P(X

n−1

= k)

=

∞

k=0

(E(s

W

1

)E(s

W

2

) E(s

W

k

)P(X

n−1

= k)

=

∞

k=0

φ(s)

k

P(X

n−1

= k)

= φ

X

n−1

(φ(s)).

So, φ

Xn

is the nth iterate of φ,

φ

X

2

(s) = φ(φ(s)), φ

X

3

(s) = φ(φ(φ(s))), . . .

and we can also write

φ

X

n

(s) = φ(φ

X

n−1

(s)).

Next, we take a closer look at the properties of φ. Clearly,

φ(0) = p

0

> 0

and

φ(1) =

∞

k=0

p

k

= 1.

Moreover, for s > 0,

φ

(s) =

∞

k=0

kp

k

s

k−1

> 0,

14 BRANCHING PROCESSES 166

so φ is strictly increasing, with

φ

(1) = µ.

Finally,

φ

(s) =

∞

k=1

k(k −1)p

k

s

k−2

≥ 0,

so φ is also convex. The crucial observation is that

δ

n

= φ

Xn

(0),

and so δ

n

is obtained by starting at 0 and computing the nth iteration of φ. It is also clear

that δ

n

is a nondecreasing sequence (because X

n−1

= 0 implies that X

n

= 0). We now consider

separately two cases:

• Assume that φ is always above the diagonal, that is, φ(s) ≥ s for all s ∈ [0, 1]. This

happens exactly when µ = φ

(1) ≤ 1. In this case, δ

n

converges to 1, and so π

0

= 1. This

is shown in the right graph of the ﬁgure below.

• Now, assume that φ is not always above the diagonal, which happens when µ > 1. In

this case, there exists exactly one s < 1 which solves s = φ(s). As δ

n

converges to this

solution, we conclude that π

0

< 1 is the smallest solution to s = φ(s). This is shown in

the left graph of the ﬁgure below.

1

1

1

1

δ

2

φ φ

δ

1

δ

2

δ

1

The following theorem is a summary of our ﬁndings.

Theorem 14.2. Probability that the branching process dies out.

If µ ≤ 1, π

0

= 1. If µ > 1, then π

0

is the smallest solution on [0, 1] to s = φ(s).

Example 14.1. Assume that a branching process is started with X

0

= k instead of X

0

= 1.

How does this change the survival probability? The k individuals all evolve independent family

trees, so that the probability of eventual death is π

k

0

. It also follows that

P(the process ever dies out[X

n

= k) = π

k

0

14 BRANCHING PROCESSES 167

for every n.

If µ is barely larger than 1, the probability π

0

of extinction is quite close to 1. In the context

of family names, this means that the ones with already a large number of representatives in the

population are at a distinct advantage, as the probability that they die out by chance is much

lower than that of those with only a few representatives. Thus, common family names become

ever more common, especially in societies that have used family names for a long time. The

most famous example of this phenomenon is in Korea, where three family names (Kim, Lee, and

Park in English transcriptions) account for about 45% of the population.

Example 14.2. Assume that

p

k

= p

k

(1 −p), k = 0, 1, 2, . . . .

This means that the oﬀspring distribution is Geometric(1 −p) minus 1. Thus,

µ =

1

1 −p

−1 =

p

1 −p

and, if p ≤

1

2

, π

0

= 1. Now suppose that p >

1

2

. Then, we have to compute

φ(s) =

∞

k=0

s

k

p

k

(1 −p)

=

1 −p

1 −ps

.

The equation φ(s) = s has two solutions, s = 1 and s =

1−p

p

. Thus, when p >

1

2

,

π

0

=

1 −p

p

.

Example 14.3. Assume that the oﬀspring distribution is Binomial(3,

1

2

). Compute π

0

.

As µ =

3

2

> 1, π

0

is given by

φ(s) =

1

8

+

3

8

s +

3

8

s

2

+

1

8

s

3

= s,

with solutions s = 1, −

√

5 −2, and

√

5 −2. The one that lies in (0, 1),

√

5 −2 ≈ 0.2361, is the

probability π

0

.

Problems

1. For a branching process with oﬀspring distribution given by p

0

=

1

6

, p

1

=

1

2

, p

3

=

1

3

, determine

(a) the expectation and variance of X

9

, the population at generation 9, (b) the probability that

the branching process dies by generation 3, but not by generation 2, and (c) the probability that

14 BRANCHING PROCESSES 168

the process ever dies out. Then, assume that you start 5 independent copies of this branching

process at the same time (equivalently, change X

0

to 5), and (d) compute the probability that

the process ever dies out.

2. Assume that the oﬀspring distribution of a branching process is Poisson with parameter λ.

(a) Determine the expected combined population through generation 10. (b) Determine, with

the aid of computer if necessary, the probability that the process ever dies out for λ =

1

2

, λ = 1,

and λ = 2.

3. Assume that the oﬀspring distribution of a branching process is given by p

1

= p

2

= p

3

=

1

3

.

Note that p

0

= 0. Solve the following problem for a = 1, 2, 3. Let Y

n

be the proportion of

individuals in generation n (out of the total number of X

n

individuals) from families of size

a. (A family consists of individuals that are oﬀspring of the same parent from the previous

generation.) Compute the limit of Y

n

as n →∞.

Solutions to problems

1. For (a), compute µ =

3

2

, σ

2

=

7

2

−

9

4

=

5

4

, and plug into the formula. Then compute

φ(s) =

1

6

+

1

2

s +

1

3

s

3

.

For (b),

P(X

3

= 0) −P(X

2

= 0) = φ(φ(φ(0))) −φ(φ(0)) ≈ 0.0462.

For (c), we solve φ(s) = s, 0 = 2s

3

−3s +1 = (s −1)(2s

2

+2s −1), and so π

0

=

√

3−1

2

≈ 0.3660.

For (d), the answer is π

5

0

.

2. For (a), µ = λ and

E(X

0

+X

1

+. . . +X

10

) = EX

0

+EX

1

+. . . +EX

10

= 1 +λ + +λ

10

=

λ

11

−1

λ −1

,

if λ ,= 1, and 11 if λ = 1. For (b), if λ ≤ 1 then π

0

= 1, but if λ > 1, then π

0

is the solution for

s ∈ (0, 1) to

e

λ(s−1)

= s.

This equation cannot be solved analytically, but we can numerically obtain the solution for λ = 2

to get π

0

≈ 0.2032.

3. Assuming X

n−1

= k, the number of families at time n is also k. Each of these has, in-

dependently, a members with probability p

a

. If k is large — which it will be for large n, as

the branching process cannot die out — then, with overwhelming probability, the number of

children in such families is about ap

a

k, while X

n

is about µk. Then, the proportion Y

n

is about

ap

a

µ

, which works out to be

1

6

,

1

3

, and

1

2

, for a = 1, 2, and 3.

15 MARKOV CHAINS: LIMITING PROBABILITIES 169

15 Markov Chains: Limiting Probabilities

Example 15.1. Assume that the transition matrix is given by

P =

_

_

0.7 0.2 0.1

0.4 0.6 0

0 1 0

_

_

.

Recall that the n-step transition probabilities are given by powers of P. Let us look at some

large powers of P, beginning with

P

4

=

_

_

0.5401 0.4056 0.0543

0.5412 0.4048 0.054

0.54 0.408 0.052

_

_

.

Then, to four decimal places,

P

8

≈

_

_

0.5405 0.4054 0.0541

0.5405 0.4054 0.0541

0.5405 0.4054 0.0541

_

_

.

and subsequent powers are the same to this precision.

The matrix elements appear to converge and the rows become almost identical. Why? What

determines the limit? These questions will be answered in this chapter.

We say that a state i ∈ S has period d ≥ 1 if (1) P

n

ii

> 0 implies that d[n and (2) d is the

largest positive integer that satisﬁes (1).

Example 15.2. Simple random walk on Z, with p ∈ (0, 1). The period of any state is 2 because

the walker can return to her original position in any even number of steps, but in no odd number

of steps.

Example 15.3. Random walk on the vertices of a square. Again, the period of any state is 2

for the same reason.

Example 15.4. Random walk on the vertices of a triangle. The period of any state is 1

because the walker can return in two steps (one step out and then back) or three steps (around

the triangle).

Example 15.5. Deterministic cycle. If a chain has n states 0, 1, . . . , n−1 and transitions from

i to (i + 1) mod n with probability 1, then it has period n. So, any period is possible.

However, if the following two transition probabilities are changed: P

01

= 0.9 and P

00

= 0.1,

then the chain has period 1. In fact, the period of any state i with P

ii

> 0 is trivially 1.

It can be shown that the period is the same for all states in the same class. If a state, and

therefore its class, has period 1, it is called aperiodic. If the chain is irreducible, we call the

entire chain aperiodic if all states have period 1.

15 MARKOV CHAINS: LIMITING PROBABILITIES 170

For a state i ∈ S, let

R

i

= inf¦n ≥ 1 : X

n

= i¦

be the ﬁrst time, after time 0, that the chain is at i ∈ S. Also, let

f

(n)

i

= P(R

i

= n[X

0

= i)

be the p. m. f. of R

i

when the starting state is i itself (in which case we may call R

i

the return

time). We can connect these to the familiar quantity

f

i

= P(ever reenter i[X

0

= i) =

∞

n=1

f

(n)

i

,

so that the state i is recurrent exactly when

∞

n=1

f

(n)

i

= 1. Then, we deﬁne

m

i

= E[R

i

[X

0

= i] =

∞

n=1

nf

(n)

i

.

If the above series converges, i.e., m

i

< ∞, then we say that i is positive recurrent. It can be

shown that positive recurrence is also a class property: a state shares it with all members of its

class. Thus, an irreducible chain is positive recurrent if each of its states is.

It is not hard to show that a ﬁnite irreducible chain is positive recurrent. In this case, there

must exist an m ≥ 1 and an > 0 so that i can be reached from any j in at most m steps with

probability at least . Then, P(R

i

≥ n) ≤ (1 −)

n/m

, which goes to 0 geometrically fast.

We now state the key theorems. Some of these have rather involved proofs (although none

is exceptionally diﬃcult), which we will merely sketch or omit altogether.

Theorem 15.1. Proportion of the time spent at i.

Assume that the chain is irreducible and positive recurrent. Let N

n

(i) be the number of visits to

i in the time interval from 0 through n. Then,

lim

n→∞

N

n

(i)

n

=

1

m

i

,

in probability.

Proof. The idea is quite simple: once the chain visits i, it returns, on average, once per m

i

time

steps, hence the proportion of time spent there is 1/m

i

. We skip a more detailed proof.

A vector of probabilities, π

i

, i ∈ S, such that

i∈S

π

i

= 1 is called an invariant, or stationary,

distribution for a Markov chain with transition matrix P if

i∈S

π

i

P

ij

= π

j

, for all j ∈ S.

15 MARKOV CHAINS: LIMITING PROBABILITIES 171

In matrix form, if we put π into a row vector [π

1

, π

2

, . . .], then

[π

1

, π

2

, . . .] P = [π

1

, π

2

, . . .].

Thus [π

1

, π

2

, . . .] is a left eigenvector of P, for eigenvalue 1. More important for us is the following

probabilistic interpretation. If π

i

is the p. m. f. for X

0

, that is, P(X

0

= i) = π

i

, for all i ∈ S, it

is also the p. m. f. for X

1

and hence for all other X

n

, that is, P(X

n

= i) = π

i

, for all n.

Theorem 15.2. Existence and uniqueness of invariant distributions.

An irreducible positive recurrent Markov chain has a unique invariant distribution, which is

given by

π

i

=

1

m

i

.

In fact, an irreducible chain is positive recurrent if and only if a stationary distribution exists.

The formula for π should not be a surprise: if the probability that the chain is at i is always

π

i

, then one should expect that the proportion of time spent at i, which we already know to be

1/m

i

, to be equal to π

i

. We will not, however, go deeper into the proof.

Theorem 15.3. Convergence to invariant distribution.

If a Markov chain is irreducible, aperiodic, and positive recurrent, then, for every i, j ∈ S,

lim

n→∞

P

n

ij

= π

j

.

Recall that P

n

ij

= P(X

n

= j[X

0

= i) and note that the limit is independent of the initial

state. Thus, the rows of P

n

are more and more similar to the row vector π as n becomes large.

The most elegant proof of this theorem uses coupling, an important idea ﬁrst developed by

a young French probabilist Wolfgang Doeblin in the late 1930s. (Doeblin’s life is a romantic,

and quite tragic, story. An immigrant from Germany, he died as a soldier in the French army

in 1940, at the age of 25. He made signiﬁcant mathematical contributions during his army

service.) Start with two independent copies of the chain — two particles moving from state to

state according to the transition probabilities — one started from i, the other using the initial

distribution π. Under the stated assumptions, they will eventually meet. Afterwards, the two

particles move together in unison, that is, they are coupled. Thus, the diﬀerence between the

two probabilities at time n is bounded above by twice the probability that coupling does not

happen by time n, which goes to 0. We will not go into greater detail, but, as we will see in the

next example, periodicity is necessary.

Example 15.6. A deterministic cycle with a = 3 has the transition matrix

P =

_

_

0 1 0

0 0 1

1 0 0

_

_

.

15 MARKOV CHAINS: LIMITING PROBABILITIES 172

This is an irreducible chain with the invariant distribution π

0

= π

1

= π

2

=

1

3

(as it is easy to

check). Moreover,

P

2

=

_

_

0 0 1

1 0 0

0 1 0

_

_

,

P

3

= I, P

4

= P, etc. Although the chain does spend 1/3 of the time at each state, the transition

probabilities are a periodic sequence of 0’s and 1’s and do not converge.

Our ﬁnal theorem is mostly a summary of the results for the special, and for us the most

common, case.

Theorem 15.4. Convergence theorem for a ﬁnite state space S.

Assume that s Markov chain with a ﬁnite state space is irreducible.

1. There exists a unique invariant distribution given by π

i

=

1

m

i

.

2. For every i, irrespective of the initial state,

1

n

N

n

(i) →π

i

,

in probability.

3. If the chain is also aperiodic, then, for all i and j,

P

n

ij

→π

j

.

4. If the chain is periodic with period d, then, for every pair i, j ∈ S, there exists an integer

r, 0 ≤ r ≤ d −1, so that

lim

m→∞

P

md+r

ij

= dπ

j

and so that P

n

ij

= 0 for all n such that n ,= r mod d.

Example 15.7. We begin by our ﬁrst example, Example 15.1. That was clearly an irreducible

and aperiodic (note that P

00

> 0) chain. The invariant distribution [π

1

, π

2

, π

3

] is given by

0.7π

1

+ 0.4π

2

= π

1

0.2π

1

+ 0.6π

2

+π

3

= π

2

0.1π

1

= π

3

This system has inﬁnitely many solutions and we need to use

π

1

+π

2

+π

3

= 1

15 MARKOV CHAINS: LIMITING PROBABILITIES 173

to get the unique solution

π

1

=

20

37

≈ 0.5405, π

2

=

15

37

≈ 0.4054, π

3

=

2

37

≈ 0.0541.

Example 15.8. The general two-state Markov chain. Here S = ¦1, 2¦ and

P =

_

α 1 −α

β 1 −β

_

and we assume that 0 < α, β < 1.

απ

1

+βπ

2

=π

1

(1 −α)π

1

+ (1 −β)π

2

=π

2

π

1

+π

2

=1

and, after some algebra,

π

1

=

β

1 +β −α

,

π

2

=

1 −α

1 −β +α

.

Here are a few common follow-up questions:

• Start the chain at 1. In the long run, what proportion of time does the chain spend at 2?

Answer: π

2

(and this does not depend on the starting state).

• Start the chain at 2. What is the expected return time to 2? Answer:

1

π

2

.

• In the long run, what proportion of time is the chain at 2, while at the previous time it

was at 1? Answer: π

1

P

12

, as it needs to be at 1 at the previous time and then make a

transition to 2 (again, the answer does not depend on the starting state).

Example 15.9. In this example we will see how to compute the average length of time a chain

remains in a subset of states, once the subset is entered. Assume that a machine can be in 4

states labeled 1, 2, 3, and 4. In states 1 and 2 the machine is up, working properly. In states 3

and 4 the machine is down, out of order. Suppose that the transition matrix is

P =

_

¸

¸

_

1

4

1

4

1

2

0

0

1

4

1

2

1

4

1

4

1

4

1

4

1

4

1

4

1

4

0

1

2

_

¸

¸

_

.

(a) Compute the average length of time the machine remains up after it goes up. (b) Compute

the proportion of time that the system is up, but down at the next time step (this is called the

breakdown rate).

15 MARKOV CHAINS: LIMITING PROBABILITIES 174

We begin with computing the invariant distribution, which works out to be π

1

=

9

48

, π

2

=

12

48

,

π

3

=

14

48

, π

4

=

13

48

. Then, the breakdown rate is

π

1

(P

13

+P

14

) +π

2

(P

23

+P

24

) =

9

32

,

the answer to (b).

Now, let u be the average stretch of time the machine remains up and let d be the average

stretch of time it is down. We need to compute u to answer (a). We will achieve this by

ﬁguring out the two equations that u and d must satisfy. For the ﬁrst equation, we use that the

proportion of time the system is up is

u

u +d

= π

1

+π

2

=

21

48

.

For the second equation, we use that there is a single breakdown in every interval of time

consisting of the stretch of up time followed by the stretch of down time, i.e., from one repair

to the next repair. This means

1

u +d

=

9

32

,

the breakdown rate from (b). The system of two equations gives d = 2 and, to answer (a),

u =

14

9

.

Computing the invariant distribution amounts to solving a system of linear equations. Nowa-

days this is not diﬃcult to do, even for enormous systems; still, it is worthwhile to observe that

there are cases when the invariant distribution is easy to identify.

We call a square matrix with nonnegative entries doubly stochastic if the sum of the entries

in each row and in each column is 1.

Proposition 15.5. Invariant distribution in a doubly stochastic case.

If the transition matrix for an irreducible Markov chain with a ﬁnite state space S is doubly

stochastic, its (unique) invariant measure is uniform over S.

Proof. Assume that S = ¦1, . . . , m¦, as usual. If [1, . . . , 1] is the row vector with m 1’s, then

[1, . . . , 1]P is exactly the vector of column sums, thus [1, . . . , 1]. This vector is preserved by right

multiplication by P, as is

1

m

[1, . . . , 1]. This vector speciﬁes the uniform p. m. f. on S.

Example 15.10. Simple random walk on a circle. Pick a probability p ∈ (0, 1). Assume that

a points labeled 0, 1, . . . , a −1 are arranged on a circle clockwise. From i, the walker moves to

i + 1 (with a identiﬁed with 0) with probability p and to i − 1 (with −1 identiﬁed with a − 1)

15 MARKOV CHAINS: LIMITING PROBABILITIES 175

with probability 1 −p. The transition matrix is

P =

_

¸

¸

¸

¸

¸

¸

_

0 p 0 0 . . . 0 0 1 −p

1 −p 0 p 0 . . . 0 0 0

0 1 −p 0 p . . . 0 0 0

. . .

0 0 0 0 . . . 1 −p 0 p

p 0 0 0 . . . 0 1 −p 0

_

¸

¸

¸

¸

¸

¸

_

and is doubly stochastic. Moreover, the chain is aperiodic if a is odd and otherwise periodic

with period 2. Therefore, the proportion of time the walker spends at any state is

1

a

, which is

also the limit of P

n

ij

for all i and j if a is odd. If a is even, then P

n

ij

= 0 if (i −j) and n have a

diﬀerent parity, while if they are of the same parity, P

n

ij

→

2

a

.

Assume that we change the transition probabilities a little: assume that, only when the

walker is at 0, she stays at 0 with probability r ∈ (0, 1), moves to 1 with probability (1 − r)p,

and to a −1 with probability (1 −r)(1 −p). The other transition probabilities are unchanged.

Clearly, now the chain is aperiodic for any a, but the transition matrix is no longer doubly

stochastic. What happens to the invariant distribution?

The walker spends a longer time at 0; if we stop the clock while she stays at 0, the chain

is the same as before and spends an equal an proportion of time at all states. It follows that

our perturbed chain spends the same proportion of time at all states except 0, where it spends

a Geometric(1 −r) time at every visit. Therefore, π

0

is larger by the factor

1

1−r

than other π

i

.

Thus, the row vector with invariant distributions is

1

1

1−r

+a −1

_

1

1−r

1 . . . 1

¸

=

_

1

1+(1−r)(a−1)

1−r

1+(1−r)(a−1)

. . .

1−r

1+(1−r)(a−1)

_

.

Thus, we can still determine the invariant distribution if only the self-transition probabilities P

ii

are changed.

Problems

1. Consider the chain in Problem 2 of Chapter 12. (a) Determine the invariant distribution. (b)

Determine lim

n→∞

P

n

10

. Why does it exist?

2. Consider the chain in Problem 4 of Chapter 12 with the same initial state. Determine the

proportion of time the walker spends at a.

3. Roll a fair die n times and let S

n

be the sum of the numbers you roll. Determine, with proof,

lim

n→∞

P(S

n

mod 13 = 0).

4. Peter owns two pairs of running shoes. Each morning he goes running. He is equally likely

15 MARKOV CHAINS: LIMITING PROBABILITIES 176

to leave from his front or back door. Upon leaving the house, he chooses a pair of running shoes

at the door from which he leaves or goes running barefoot if there are no shoes there. On his

return, he is equally likely to enter at either door and leaves his shoes (if any) there. (a) What

proportion of days does he run barefoot? (b) What proportion of days is there at least one pair

of shoes at the front door (before he goes running)? (c) Now, assume also that the pairs of shoes

are green and red and that he chooses a pair at random if he has a choice. What proportion of

mornings does he run in green shoes?

5. Prof. Messi does one of three service tasks every year, coded as 1, 2, and 3. The assignment

changes randomly from year to year as a Markov chain with transition matrix

_

_

7

10

1

5

1

10

1

5

3

5

1

5

1

10

2

5

1

5

_

_

.

Determine the proportion of years that Messi has the same assignment as the previous two years.

6. Consider the Markov chain with states 0, 1, 2, 3, 4, which transition from state i > 0 to one

of the states 0, . . . , i − 1 with equal probability, and transition from 0 to 4 with probability 1.

Show that all P

n

ij

converge as n →∞ and determine the limits.

Solutions to problems

1. The chain is irreducible and aperiodic. Moreover, (a) π =

_

10

21

,

5

21

,

6

21

¸

and (b) the limit is

π

0

=

10

21

.

2. The chain is irreducible and aperiodic. Moreover, π =

_

5

41

,

8

41

,

8

41

,

20

41

¸

and the answer is

π

1

+π

3

=

13

41

.

3. Consider S

n

mod 13. This is a Markov chain with states 0, 1 . . . , 12 and transition matrix is

_

_

0

1

6

1

6

1

6

1

6

1

6

1

6

0 . . . 0

. . .

1

6

1

6

1

6

1

6

1

6

1

6

0 0 . . . 0

_

_

.

(To get the next row, shift cyclically to the right.) This is a doubly stochastic matrix with

π

i

=

1

13

, for all i. So the answer is

1

13

.

4. Consider the Markov chain with states given by the number of shoes at the front door. Then

P =

_

_

3

4

1

4

0

1

4

1

2

1

4

0

1

4

3

4

_

_

.

15 MARKOV CHAINS: LIMITING PROBABILITIES 177

This is a doubly stochastic matrix with π

0

= π

1

= π

2

=

1

3

. Answers: (a) π

0

1

2

+π

2

1

2

=

1

3

; (b)

π

1

+π

2

=

2

3

; (c) π

0

1

4

+π

2

1

4

+π

1

1

2

=

1

3

.

5. Solve for the invariant distribution: π =

_

6

17

,

7

17

,

4

17

¸

. The answer is π

1

P

2

11

+π

2

P

2

22

+π

3

P

2

33

=

19

50

.

6. As the chain is irreducible and aperiodic, P

n

ij

converges to π

j

, j = 0, 1, 2, 3, 4, where π is given

by π =

_

12

37

,

6

37

,

4

37

,

3

37

,

12

37

¸

.

15 MARKOV CHAINS: LIMITING PROBABILITIES 178

Interlude: Practice Midterm 2

This practice exam covers the material from chapters 12 through 15. Give yourself 50 minutes

to solve the four problems, which you may assume have equal point score.

1. Suppose that whether it rains on a day or not depends on whether it did so on the previous

two days. If it rained yesterday and today, then it will rain tomorrow with probability 0.7;

if it rained yesterday but not today, it will rain tomorrow with probability 0.4; if it did not

rain yesterday, but rained today, it will rain tomorrow with probability 0.5; if it did not rain

yesterday nor today, it will rain tomorrow with probability 0.2.

(a) Let X

n

be the Markov chain with 4 states, (R, R), (N, R), (R, N), (N, N), which code

(weather yesterday, weather today) with R = rain and N = no rain. Write down the tran-

sition probability matrix for this chain. (b) Today is Wednesday and it is raining. It also rained

yesterday. Explain how you would compute the probability that it will rain on Saturday. Do

not carry out the computation.

(c) Under the same assumption as in (b), explain how you would approximate the probability

of rain on a day exactly a year from today. Carefully justify your answer, but do not carry out

the computation.

2. Consider the Markov chain with states 1, 2, 3, 4, 5, given by the following transition matrix:

P =

_

¸

¸

¸

¸

_

1

2

0

1

2

0 0

1

4

1

2

1

4

0 0

1

2

0

1

2

0 0

0 0 0

1

2

1

2

0 0 0

1

2

1

2

_

¸

¸

¸

¸

_

.

Specify the classes and determine whether they are transient or recurrent.

3. In a branching process the number of descendants is determined as follows. An individual

ﬁrst tosses a coin that comes out Heads with probability p. If this coin comes out Tails, the

individual has no descendants. If the coin comes Heads, the individual has 1 or 2 descendants,

each with probability

1

2

.

(a) Compute π

0

, the probability that the branching process eventually dies out. Your answer

will, of course, depend on the parameter p.

(b) Write down the expression for the probability that the branching process is still alive at

generation 3. Do not simplify.

4. A random walker is in one of the four states, 0, 1, 2, or 3. If she is at i at some time, she

makes the following transition. With probability

1

2

she moves from i to (i +1) mod 4 (that is, if

she is at 0 she moves to 1, from 1 she moves to 2, from 2 to 3, and from 3 to 0). With probability

1

2

, she moves to a random state among the four states, each chosen with equal probability.

(a) Show that this chain has a unique invariant distribution and compute it. (Take a good look

at the transition matrix before you start solving this).

15 MARKOV CHAINS: LIMITING PROBABILITIES 179

(b) After the walker makes many steps, compute the proportion of time she spends at 1. Does

the answer depend on the chain’s starting point?

(c) After the walker makes many steps, compute the proportion of times she is at the same state

as at the previous time.

15 MARKOV CHAINS: LIMITING PROBABILITIES 180

Solutions to Practice Midterm 2

1. Suppose that whether it rains on a day or not depends on whether it did so on the previous

two days. If it rained yesterday and today, then it will rain tomorrow with probability 0.7;

if it rained yesterday but not today, it will rain tomorrow with probability 0.4; if it did

not rain yesterday, but rained today, it will rain tomorrow with probability 0.5; if it did

not rain yesterday nor today, it will rain tomorrow with probability 0.2.

(a) Let X

n

be the Markov chain with 4 states, (R, R), (N, R), (R, N), (N, N), which code

(weather yesterday, weather today) with R = rain and N = no rain. Write down the

transition probability matrix for this chain.

Solution:

Let (R, R) be state 1, (N, R) state 2, (R, N) state 3, and (N, N) state 4. The

transition matrix is

P =

_

¸

¸

_

0.7 0 0.3 0

0.5 0 0.5 0

0 0.4 0 0.6

0 0.2 0 0.8

_

¸

¸

_

.

(b) Today is Wednesday and it is raining. It also rained yesterday. Explain how you

would compute the probability that it will rain on Saturday. Do not carry out the

computation.

Solution: If Wednesday is time 0, then Saturday is time 3. The initial state is given by

the row [1, 0, 0, 0] and it will rain on Saturday if we end up at state 1 or 2. Therefore,

our solution is

_

1 0 0 0

¸

P

3

_

¸

¸

_

1

1

0

0

_

¸

¸

_

,

that is, the sum of the ﬁrst two entries of

_

1 0 0 0

¸

P

3

.

(c) Under the same assumption as in (b), explain how you would approximate the prob-

ability of rain on a day exactly a year from today. Carefully justify your answer, but

do not carry out the computation.

Solution:

The matrix P is irreducible since the chain makes the following transitions with

positive probability: (R, R) → (R, N) → (N, N) → (N, R) → (R, R). It is also

15 MARKOV CHAINS: LIMITING PROBABILITIES 181

aperiodic because the transition (R, R) →(R, R) has positive probability. Therefore,

the probability can be approximated by π

1

+ π

2

, where [π

1

, π

2

, π

3

, π

4

] is the unique

solution to [π

1

, π

2

, π

3

, π

4

] P = [π

1

, π

2

, π

3

, π

4

] and π

1

+π

2

+π

3

+π

4

= 1.

2. Consider the Markov chain with states 1, 2, 3, 4, 5, given by the following transition

matrix:

P =

_

¸

¸

¸

¸

_

1

2

0

1

2

0 0

1

4

1

2

1

4

0 0

1

2

0

1

2

0 0

0 0 0

1

2

1

2

0 0 0

1

2

1

2

_

¸

¸

¸

¸

_

.

Specify all classes and determine whether they are transient or recurrent.

Solution:

Answer:

• ¦2¦ is transient;

• ¦1, 3¦ is recurrent;

• ¦4, 5¦ is recurrent.

3. In a branching process, the number of descendants is determined as follows. An individual

ﬁrst tosses a coin that comes out Heads with probability p. If this coin comes out Tails,

the individual has no descendants. If the coin comes Heads, the individual has 1 or 2

descendants, each with probability

1

2

.

(a) Compute π

0

, the probability that the branching process eventually dies out. Your

answer will, of course, depend on the parameter p.

Solution:

The probability mass function for the number of descendants is

_

0 1 2

1 −p

p

2

p

2

_

,

and so

E(number of descendants) =

p

2

+p =

3p

2

.

If

3p

2

≤ 1, i.e., p ≤

2

3

, then π

0

= 1. Otherwise, we need to compute φ(s) and solve

φ(s) = s. We have

φ(s) = 1 −p +

p

2

s +

p

2

s

2

.

Then,

s = 1 −p +

p

2

s +

p

2

s

2

,

0 = ps

2

+ (p −2)s + 2(1 −p),

0 = (s −1)(ps −2(1 −p)).

15 MARKOV CHAINS: LIMITING PROBABILITIES 182

We conclude that π

0

=

2(1−p)

p

, if p >

2

3

.

(b) Write down the expression for the probability that the branching process is still alive

at generation 3. Do not simplify.

Solution:

The answer is 1 −φ(φ(φ(0))) and we compute

φ(0) = 1 −p,

φ(φ(0)) = 1 −p +

p

2

(1 −p) +

p

2

(1 −p)

2

,

1 −φ(φ(φ(0))) = 1 −

_

1 −p +

p

2

(1 −p +

p

2

(1 −p)+

p

2

(1 −p)

2

) +

p

2

(1 −p +

p

2

(1 −p) +

p

2

(1 −p)

2

)

2

_

.

4. A random walker is at one of the four states, 0, 1, 2, or 3. If she at i at some time, she

makes the following transition. With probability

1

2

she moves from i to (i + 1) mod 4

(that is, if she is at 0 she moves to 1, from 1 she moves to 2, from 2 to 3, and from 3 to

0). With probability

1

2

, she moves to a random state among the four states, each chosen

with equal probability.

(a) Show that this chain has a unique invariant distribution and compute it. (Take a

good look at the transition matrix before you start solving this).

Solution:

The transition matrix is

P =

_

¸

¸

_

1

8

5

8

1

8

1

8

1

8

1

8

5

8

1

8

1

8

1

8

1

8

5

8

5

8

1

8

1

8

1

8

_

¸

¸

_

.

As P is a doubly stochastic irreducible matrix, π = [

1

4

,

1

4

,

1

4

,

1

4

] is the unique invariant

distribution. (Note that irreducibility is trivial, as all entries are positive).

(b) After the walker makes many steps, compute the proportion of time she spends at 1.

Does the answer depend on the chain’s starting point?

Solution:

The proportion equals π

1

=

1

4

, independently of the starting point.

(c) After the walker makes many steps, compute the proportion of time she is at the

same state as at the previous time.

15 MARKOV CHAINS: LIMITING PROBABILITIES 183

Solution:

The probability of staying at the same state is always

1

8

, which is the answer.

16 MARKOV CHAINS: REVERSIBILITY 184

16 Markov Chains: Reversibility

Assume that you have an irreducible and positive recurrent chain, started at its unique invariant

distribution π. Recall that this means that π is the p. m. f. of X

0

and of all other X

n

as well.

Now suppose that, for every n, X

0

, X

1

, . . . , X

n

have the same joint p. m. f. as their time-reversal

X

n

, X

n−1

, . . . , X

0

. Then, we call the chain reversible — sometimes it is, equivalently, also

said that its invariant distribution π is reversible. This means that a recorded simulation of a

reversible chain looks the same if the “movie” is run backwards.

Is there a condition for reversibility that can be easily checked? The ﬁrst thing to observe

is that for the chain started at π, reversible or not, the time-reversed chain has the Markov

property. This is not completely intuitively clear, but can be checked:

P(X

k

= i[X

k+1

= j, X

k+2

= i

k+2

, . . . , X

n

= i

n

)

=

P(X

k

= i, X

k+1

= j, X

k+2

= i

k+2

, . . . , X

n

= i

n

)

P(X

k+1

= j, X

k+2

= i

k+2

, . . . , X

n

= i

n

)

=

π

i

P

ij

P

ji

k+2

P

i

n−1

in

π

j

P

ji

k+2

P

i

n−1

i

n

=

π

i

P

ij

π

j

,

which is an expression dependent only on i and j. For reversibility, this expression must be

the same as the forward transition probability P(X

k+1

= i[X

k

= j) = P

ji

. Conversely, if both

the original and the time-reversed chain have the same transition probabilities (and we already

know that the two start at the same invariant distribution and that both are Markov), then

their p. m. f.’s must agree. We have proved the following useful result.

Theorem 16.1. Reversibility condition.

A Markov chain with invariant measure π is reversible if and only if

π

i

P

ij

= π

j

P

ji

,

for all states i and j.

Another useful fact is that once reversibility is checked, invariance is automatic.

Proposition 16.2. Reversibility implies invariance. If a probability mass function π

i

satisﬁes

the condition in the previous theorem, then it is invariant.

Proof. We need to check that, for every j, π

j

=

i

π

i

P

ij

, and here is how we do it:

i

π

i

P

ij

=

i

π

j

P

ji

= π

j

i

P

ji

= π

j

.

16 MARKOV CHAINS: REVERSIBILITY 185

We now proceed to describe random walks on weighted graphs, the most easily recognizable

examples of reversible chains. Assume that every undirected edge between vertices i and j in a

complete graph has weight w

ij

= w

ji

; we think of edges with zero weight as not present at all.

When at i, the walker goes to j with probability proportional to w

ij

, so that

P

ij

=

w

ij

k

w

ik

.

What makes such random walks easy to analyze is the existence of a simple reversible measure.

Let

s =

i,k

w

ik

be the sum of all weights and let

π

i

=

k

w

ik

s

.

To see why this is a reversible distribution, compute

π

i

P

ij

=

k

w

ik

s

w

ij

k

w

ik

=

w

ij

s

,

which clearly remains the same if we switch i and j.

We should observe that this chain is irreducible exactly when the graph with present edges

(those with w

ij

> 0) is connected. The graph can only be periodic and the period can only be

2 (because the walker can always return in two steps) when it is bipartite: the set of vertices V

is divided into two sets V

1

and V

2

with every edge connecting a vertex from V

1

to a vertex from

V

2

. Finally, we note that there is no reason to forbid self-edges: some of the weights w

ii

may be

nonzero. (However, each w

ii

appears only once in s, while each w

ij

, with i ,= j, appears there

twice.)

By far the most common examples have no self-edges and all nonzero weights equal to 1 —

we already have a name for these cases: random walks on graphs. The number of neighbors of

a vertex is commonly called its degree. Then, the invariant distribution is

π

i

=

degree of i

2 (number of all edges)

.

Example 16.1. Consider the random walk on the graph below.

1

2

3 4

5

6

16 MARKOV CHAINS: REVERSIBILITY 186

What is the proportion of time the walk spends at vertex 2?

The reversible distribution is

π

1

=

3

18

, π

2

=

4

18

, π

3

=

2

18

, π

4

=

3

18

, π

5

=

3

18

, π

6

=

3

18

,

and, thus, the answer is

2

9

.

Assume now that the walker may stay at a vertex with probability P

ii

, but when she does

move she moves to a random neighbor as before. How can we choose P

ii

so that π becomes

uniform, π

i

=

1

6

for all i?

We should choose the weights of self-edges so that the sum of the weights of all edges

emanating from any vertex is the same. Thus, w

22

= 0, w

33

= 2, and w

ii

= 1, for all other i.

Example 16.2. Ehrenfest chain. We have M balls distributed in two urns. Each time, pick a

ball at random, move it from the urn where it currently resides to the other urn. Let X

n

be the

number of balls in urn 1. Prove that this chain has a reversible distribution.

The nonzero transition probabilities are

P

0,1

= 1,

P

M,M−1

= 1,

P

i,i−1

=

i

M

,

P

i,i+1

=

M −i

M

.

Some inspiration: the invariant measure puts each ball at random into one of the two urns, as

switching any ball between the two urns does not alter this assignment. Thus, π is Binomial(M,

1

2

),

π

i

=

_

M

i

_

1

2

M

.

Let us check that this is a reversible measure. The following equalities need to be veriﬁed:

π

0

P

01

= π

1

P

10

,

π

i

P

i,i+1

= π

i+1

P

i+1,i

,

π

i

P

i,i−1

= π

i−1

P

i−1,i

,

π

M

P

M,M−1

= π

M−1

P

M−1,M

,

and it is straightforward to do so. Note that this chain is irreducible, but not aperiodic (it has

period 2).

Example 16.3. Markov chain Monte Carlo. Assume that we have a very large probability

space, say some subset of S = ¦0, 1¦

V

, where V is a large set of n sites. Assume also that

16 MARKOV CHAINS: REVERSIBILITY 187

we have a probability measure on S given via the energy (sometimes called the Hamiltonian)

function E : S →R. The probability of any conﬁguration ω ∈ S is

π(ω) =

1

Z

−

1

T

E(ω)

.

Here, T > 0 is the temperature, a parameter, and Z is the normalizing constant that makes

ω∈S

π(ω) = 1. Such distributions frequently occur in statistical physics and are often called

Maxwell-Boltzmann distributions. They have numerous other applications, however, especially

in optimization problems, and have yielded an optimization technique called simulated annealing.

If T is very large, the role of the energy is diminished and the states are almost equally likely.

On the other hand, if T is very small, the large energy states have a much lower probability than

the small energy ones, thus the system is much more likely to be found in the close to minimal

energy states. If we want to ﬁnd states with small energy, we merely choose some small T and

generate at random, according to P, some states, and we have a reasonable answer. The only

problem is that, although E is typically a simple function, π is very diﬃcult to evaluate exactly,

as Z is some enormous sum. (There are a few celebrated cases, called exactly solvable systems,

in which exact computations are diﬃcult, but possible.)

Instead of generating a random state directly, we design a Markov chain, which has π as

its invariant distribution. It is common that the convergence to π is quite fast and that the

necessary number of steps of the chain to get close to π is some small power of n. This is in

startling contrast to the size of S, which is typically exponential in n. However, the convergence

slows down at a rate exponential in T

−1

when T is small.

We will illustrate this on the Knapsack problem. Assume that you are a burglar and have

just broken into a jewelry store. You see a large number n of items, with weights w

i

and values

v

i

. Your backpack (knapsack) has a weight limit b. You are faced with a question of how to ﬁll

your backpack, that is, you have to maximize the combined value of the items you will carry out

V = V (ω

1

, . . . , ω

n

) =

n

i=1

v

i

ω

i

,

subject to the constraints that ω

i

∈ ¦0, 1¦ and that the combined weight does not exceed the

backpack capacity,

n

i=1

w

i

ω

i

≤ b.

This problem is known to be NP-hard; there is no known algorithm to solve it quickly.

The set S of feasible solutions ω = (ω

1

, . . . , ω

n

) that satisfy the constraints above will be our

state space and the energy function E on S is given as E = −V , as we want to maximize V .

The temperature T measures how good a solution we are happy with — the idea of simulated

annealing is, in fact, a gradual lowering of the temperature to improve the solution. There is

give and take: higher temperature improves the speed of convergence and lower temperature

improves the quality of the result.

Finally, we are ready to specify the Markov chain (sometimes called a Metropolis algorithm,

in honor of N. Metropolis, a pioneer in computational physics). Assume that the chain is at

16 MARKOV CHAINS: REVERSIBILITY 188

state ω at time t, i.e., X

t

= ω. Pick a coordinate i, uniformly at random. Let ω

i

be the same as

ω except that its ith coordinate is ﬂipped: ω

i

i

= 1 −ω

i

. (This means that the status of the ith

item is changed from in to out or from out to in.) If ω

i

is not feasible, then X

t+1

= ω and the

state is unchanged. Otherwise, evaluate the diﬀerence in energy E(ω

i

) − E(ω) and proceed as

follows:

• if E(ω

i

) −E(ω) ≤ 0, then make the transition to ω

i

, X

t+1

= ω

i

;

• if E(ω

i

) − E(ω) > 0, then make the transition to ω

i

with probability e

1

T

(E(ω)−E(ω

i

))

, or

else stay at ω.

Note that, in the second case, the new state has higher energy, but, in physicist’s terms, we

tolerate the transition because of temperature, which corresponds to the energy input from the

environment.

We need to check that this chain is irreducible on S: to see this, note that we can get from

any feasible solution to an empty backpack by removing object one by one, and then back by

reversing the steps. Thus, the chain has a unique invariant measure, but is it the right one, that

is, π? In fact, the measure π on S is reversible. We need to show that, for any pair ω, ω

∈ S,

π(ω)P(ω, ω

) = π(ω

)P(ω

, ω),

and this is enough to do with ω

= ω

i

, for arbitrary i, and assume that both are feasible (as

only such transitions are possible). Note ﬁrst that the normalizing constant Z cancels out (the

key feature of this method) and so does the probability

1

n

that i is chosen. If E(ω

i

) −E(ω) ≤ 0,

then the equality reduces to

e

−

1

T

E(ω)

= e

−

1

T

E(ω

i

)

e

1

T

(E(ω

i

)−E(ω))

,

and similarly in the other case.

Problems

1. Determine the invariant distribution for the random walk in Examples 12.4 and 12.10.

2. A total of m white and m black balls are distributed into two urns, with m balls per urn. At

each step, a ball is randomly selected from each urn and the two balls are interchanged. The

state of this Markov chain can, thus, be described by the number of black balls in urn 1. Guess

the invariant measure for this chain and prove that it is reversible.

3. Each day, your opinion on a particular political issue is either positive, neutral, or negative.

If it is positive today, then it is neutral or negative tomorrow with equal probability. If it is

16 MARKOV CHAINS: REVERSIBILITY 189

neutral or negative, it stays the same with probability 0.5, and, otherwise, it is equally likely to

be either of the other two possibilities. Is this a reversible Markov chain?

4. A king moves on a standard 8 8 chessboard. Each time, it makes one of the available legal

moves (to a horizontally, vertically or diagonally adjacent square) at random. (a) Assuming

that the king starts at one of the four corner squares of the chessboard, compute the expected

number of steps before it returns to the starting position. (b) Now you have two kings, they both

start at the same corner square and move independently. What is, now, the expected number

of steps before they simultaneously occupy the starting position?

Solutions to problems

1. Answer: π =

_

1

5

,

3

10

,

1

5

,

3

10

¸

.

2. If you choose m balls to put into urn 1 at random, you get

π

i

=

_

m

i

__

m

m−i

_

_

2m

m

_ ,

and the transition probabilities are

P

i,i−1

=

i

2

m

2

, P

i,i+1

=

(m−i)

2

m

2

, P

i,i

=

2i(m−i)

m

2

.

Reversibility check is routine.

3. If the three states are labeled in the order given, 1, 2, and 3, then we have

P =

_

_

0

1

2

1

2

1

4

1

2

1

4

1

4

1

4

1

2

_

_

.

The only way to check reversibility is to compute the invariant distribution π

1

, π

2

, π

3

, form the

diagonal matrix D with π

1

, π

2

, π

3

on the diagonal and to check that DP is symmetric. We get

π

1

=

1

5

, π

2

=

2

5

, π

3

=

2

5

, and DP is, indeed, symmetric, so the chain is reversible.

4. This is a random walk on a graph with 64 vertices (squares) and degrees 3 (4 corner squares),

5 (24 side squares), and 8 (36 remaining squares). If i is a corner square, π

i

=

3

3·4+5·24+8·36

,

so the answer to (a) is

420

3

. In (b), you have two independent chains, so π

(i,j)

= π

i

π

j

and the

answer is

_

420

3

_

2

.

17 THREE APPLICATIONS 190

17 Three Applications

Parrondo’s Paradox

This famous paradox was constructed by the Spanish physicist J. Parrondo. We will consider

three games A, B and C with ﬁve parameters: probabilities p, p

1

, p

2

, and γ, and an integer

period M ≥ 2. These parameters are, for now, general so that the description of the games is

more transparent. We will choose particular values once we are ﬁnished with the analysis.

We will call a game losing if, after playing it for a long time, a player’s capital becomes more

and more negative, i.e., the player loses more and more money.

Game A is very simple; in fact it is an asymmetric one-dimensional simple random walk.

Win $1, i.e., add +1 to your capital, with probability p, and lose a dollar, i.e., add −1 to your

capital, with probability 1 −p. This is clearly a losing game if p <

1

2

.

In game B, the winning probabilities depend on whether your current capital is divisible by

M. If it is, you add +1 with probability p

1

, and −1 with probability 1 − p

1

, and, if it is not,

you add +1 with probability p

2

and −1 with probability 1 −p

2

. We will determine below when

this is a losing game.

Now consider game C, in which you, at every step, play A with probability γ and B with

probability 1 −γ. Is it possible that A and B are losing games, while C is winning?

The surprising answer is yes! However, this should not be so surprising as in game B your

winning probabilities depend on the capital you have and you can manipulate the proportion

of time your capital spends at “unfavorable” amounts by playing the combination of the two

games.

We now provide a detailed analysis. As mentioned, game A is easy. To analyze game B, take

a simple random walk which makes a +1 step with probability p

2

and −1 step with probability

1 − p

2

. Assume that you start this walk at some x, 0 < x < M. Then, by the Gambler’s ruin

computation (Example 11.6),

(17.1) P(the walk hits M before 0) =

1 −

_

1−p

2

p

2

_

x

1 −

_

1−p

2

p

2

_

M

.

Starting from a multiple of M, the probability that you increase your capital by M before either

decreasing it by M or returning to the starting point is

(17.2) p

1

1 −

_

1−p

2

p

2

_

1 −

_

1−p

2

p

2

_

M

.

(You have to make a step to the right and then use the formula (17.1) with x = 1.) Similarly, from

a multiple of M, the probability that you decrease your capital by M before either increasing it

17 THREE APPLICATIONS 191

by M or returning to the starting point is

(17.3) (1 −p

1

)

_

1−p

2

p

2

_

M−1

−

_

1−p

2

p

2

_

M

1 −

_

1−p

2

p

2

_

M

.

(Now you have to move one step to the left and then use 1−(probability in (17.1) with x =

M −1).)

The main trick is to observe that game B is losing if (17.2)<(17.3). Why? Observe your

capital at multiples of M: if, starting from kM, the probability that the next (diﬀerent) multiple

of M you visit is (k −1)M exceeds the probability that it is (k + 1)M, then the game is losing

and that is exactly when (17.2)<(17.3). After some algebra, this condition reduces to

(17.4)

(1 −p

1

)(1 −p

2

)

M−1

p

1

p

M−1

2

> 1.

Now, game C is the same as game B with p

1

and p

2

replaced by q

1

= γp + (1 − γ)p

1

and

q

2

= γp + (1 −γ)p

2

, yielding a winning game if

(17.5)

(1 −q

1

)(1 −q

2

)

M−1

q

1

q

M−1

2

< 1.

This is easily achieved with large enough M as soon as p

2

<

1

2

and q

2

>

1

2

, but even for M = 3,

one can choose p =

5

11

, p

1

=

1

11

2

, p

2

=

10

11

, γ =

1

2

, to get

6

5

in (17.4) and

217

300

in (17.5).

A Discrete Renewal Theorem

Theorem 17.1. Assume that f

1

, . . . , f

N

≥ 0 are given numbers with

N

k=1

f

k

= 1. Let µ =

N

k=1

k f

k

. Deﬁne a sequence u

n

as follows:

u

n

= 0 if n < 0,

u

0

= 1,

u

n

=

N

k=1

f

k

u

n−k

if n > 0.

Assume that the greatest common divisor of the set ¦k : f

k

> 0¦ is 1. Then,

lim

n→∞

u

n

=

1

µ

.

Example 17.1. Roll a fair die forever and let S

m

be the sum of outcomes of the ﬁrst m rolls.

Let p

n

= P(S

m

ever equals n). Estimate p

10,000

.

17 THREE APPLICATIONS 192

One can write a linear recursion

p

0

= 1,

p

n

=

1

6

(p

n−1

+ +p

n−6

) ,

and then solve it, but this is a lot of work! (Note that one should either modify the recursion

for n ≤ 5 or, more easily, deﬁne p

n

= 0 for n < 0.) By the above theorem, however, we can

immediately conclude that p

n

converges to

2

7

.

Example 17.2. Assume that a random walk starts from 0 and jumps from x either to x + 1

or to x + 2, with probability p and 1 −p, respectively. What is, approximately, the probability

that the walk ever hits 10, 000? The recursion is now much simpler:

p

0

= 1,

p

n

= p p

n−1

+ (1 −p) p

n−2

,

and we can solve it, but again we can avoid the work by applying the theorem to get that p

n

converges to

1

2−p

.

Proof. We can assume, without loss of generality, that f

N

> 0 (or else reduce N).

Deﬁne a Markov chain with state space S = ¦0, 1, . . . , N −1¦ by

_

¸

¸

¸

¸

¸

¸

¸

¸

¸

_

f

1

1 −f

1

0 0 . . .

f

2

1−f

1

0

1−f

1

−f

2

1−f

1

0 . . .

f

3

1−f

1

−f

2

0 0

1−f

1

−f

2

−f

3

1−f

1

−f

2

. . .

. . .

f

N

1−f

1

−···−f

N−1

0 0 0 . . .

_

¸

¸

¸

¸

¸

¸

¸

¸

¸

_

.

This is called a renewal chain: it moves to the right (from x to x + 1) on the nonnegative

integers, except for renewals, i.e., jumps to 0. At N −1, the jump to 0 is certain (note that the

matrix entry P

N−1,0

is 1, since the sum of f

k

’s is 1).

The chain is irreducible (you can get to N − 1 from anywhere, from N − 1 to 0, and from

0 anywhere) and we will see shortly that is also aperiodic. If X

0

= 0 and R

0

is the ﬁrst return

time to 0, then

P(R

0

= k)

clearly equals f

1

, if k = 1. Then, for k = 2 it equals

(1 −f

1

)

f

2

1 −f

1

= f

2

,

and, then, for k = 3 it equals

(1 −f

1

)

1 −f

1

−f

2

1 −f

1

f

3

1 −f

1

−f

2

= f

3

,

17 THREE APPLICATIONS 193

and so on. We conclude that (recall again that X

0

= 0)

P(R

0

= k) = f

k

for all k ≥ 1.

In particular, the promised aperiodicity follows, as the chain can return to 0 in k steps if f

k

> 0.

Moreover, the expected return time to 0 is

m

00

=

N

k=1

kf

k

= µ.

The next observation is that the probability P

n

00

that the chain is at 0 in n steps is given by the

recursion

(17.6) P

n

00

=

n

k=1

P(R

0

= k)P

n−k

00

.

To see this, observe that you must return to 0 at some time not exceeding n in order to end up

at 0; either you return for the ﬁrst time at time n or you return at some previous time k and,

then, you have to be back at 0 in n −k steps.

The above formula (17.6) is true for every Markov chain. In this case, however, we note that

the ﬁrst return time to 0 is, certainly, at most N, so we can always sum to N with the proviso

that P

n−k

00

= 0 when k > n. So, from (17.6) we get

P

n

00

=

N

k=1

f

k

P

n−k

00

.

The recursion for P

n

00

is the same as the recursion for u

n

. The initial conditions are also the

same and we conclude that u

n

= P

n

00

. It follows from the convergence theorem (Theorem 15.3)

that

lim

n→∞

u

n

= lim

n→∞

P

n

00

=

1

m

00

=

1

µ

,

which ends the proof.

Patterns in coin tosses

Assume that you repeatedly toss a coin, with Heads represented by 1 and Tails represented by

0. On any toss, 1 occurs with probability p. Assume also that you have a pattern of outcomes,

say 1011101. What is the expected number of tosses needed to obtain this pattern? It should

be about 2

7

= 128 when p =

1

2

, but what is it exactly? One can compare two patterns by this

waiting game, saying that the one with the smaller expected value wins.

Another way to compare two patterns is the horse race: you and your adversary each choose

a pattern, say 1001 and 0100, and the person whose pattern appears ﬁrst wins.

Here are the natural questions. How do we compute the expectations in the waiting game

and the probabilities in the horse race? Is the pattern that wins in the waiting game more

17 THREE APPLICATIONS 194

likely to win in the horse race? There are several ways of solving these problems (a particularly

elegant one uses the so called Optional Stopping Theorem for martingales), but we will use

Markov chains.

The Markov chain X

n

we will utilize has as the state space all patterns of length . Each

time, the chain transitions into the pattern obtained by appending 1 (with probability p) or 0

(with probability 1 − p) at the right end of the current pattern and by deleting the symbol at

the left end of the current pattern. That is, the chain simply keeps track of the last symbols

in a sequence of tosses.

There is a slight problem before we have tosses. For now, assume that the chain starts

with some particular sequence of tosses, chosen in some way.

We can immediately ﬁgure out the invariant distribution for this chain. At any time n ≥ 2

and for any pattern A with k 1’s and −k 0’s,

P(X

n

= A) = p

k

(1 −p)

−k

,

as the chain is generated by independent coin tosses! Therefore, the invariant distribution of

X

n

assigns to A the probability

π

A

= p

k

(1 −p)

−k

.

Now, if we have two patterns B and A, denote by N

B→A

the expected number of additional

tosses we need to get A provided that the ﬁrst tosses ended in B. Here, if A is a subpattern of

B, this does not count, we have to actually make A in the additional tosses, although we can

use a part of B. For example, if B = 111001 and A = 110, and the next tosses are 10, then

N

B→A

= 2, and, if the next tosses are 001110, then N

B→A

= 6.

Also denote

E(B →A) = E(N

B→A

).

Our initial example can, therefore, be formulated as follows: compute

E(∅ →1011101).

The convergence theorem for Markov chains guarantees that, for every A,

E(A →A) =

1

π

A

.

The hard part of our problem is over. We now show how to analyze the waiting game by the

example.

We know that

E(1011101 →1011101) =

1

π

1011101

.

However, starting with 1011101, we can only use the overlap 101 to help us get back to 1011101,

so that

E(1011101 →1011101) = E(101 →1011101).

17 THREE APPLICATIONS 195

To get from ∅ to 1011101, we have to get ﬁrst to 101 and then from there to 1011101, so that

E(∅ →1011101) = E(∅ →101) +E(101 →1011101).

We have reduced the problem to 101 and we iterate our method:

E(∅ →101) = E(∅ →1) +E(1 →101)

= E(∅ →1) +E(101 →101)

= E(1 →1) +E(101 →101)

=

1

π

1

+

1

π

101

.

The ﬁnal result is

E(∅ →1011101) =

1

π

1011101

+

1

π

101

+

1

π

1

=

1

p

5

(1 −p)

2

+

1

p

2

(1 −p)

+

1

p

,

which is equal to 2

7

+ 2

3

+ 2 = 138 when p =

1

2

.

In general, the expected time E(∅ →A) can be computed by adding to 1/π

A

all the overlaps

between A and its shifts, that is, all the patterns by which A begins and ends. In the example,

the overlaps are 101 and 1. The more overlaps A has, the larger E(∅ →A) is. Accordingly, for

p =

1

2

, of all patterns of length , the largest expectation is 2

+ 2

−1

+ + 2 = 2

+1

− 2 (for

constant patterns 11 . . . 1 and 00 . . . 0) and the smallest is 2

**when there is no overlap at all (for
**

example, for 100 . . . 0).

Now that we know how to compute the expectations in the waiting game, we will look at

the horse race. Fix two patterns A and B and let p

A

= P(A wins) and p

B

= P(B wins). The

trick is to consider the time N, the ﬁrst time one of the two appears. Then, we can write

N

∅→A

= N +I

{B appears before A}

N

B→A

,

where N

B→A

is the additional number of tosses we need to get to A after we reach B for the ﬁrst

time. In words, to get to A we either stop at N or go further starting from B, but the second

case occurs only when B occurs before A. It is clear that N

B→A

has the same distribution as

N

B→A

and is independent of the event that B appears before A. (At the time B appears for the

ﬁrst time, what matters for N

B→A

is that we are at B and not whether we have seen A earlier.)

Taking expectations,

E(∅ →A) = E(N) +p

B

E(B →A),

E(∅ →B) = E(N) +p

A

E(A →B),

p

A

+p

B

= 1.

We already know how to compute E(∅ →A), E(∅ →B), E(B →A), and E(A →B), so this is

a system of three equations with three unknowns: p

A

, p

B

and N.

Example 17.3. Let us return to the patterns A = 1001 and B = 0100, and p =

1

2

, and compute

the winning probabilities in the horse race.

17 THREE APPLICATIONS 196

We compute E(∅ → A) = 16 + 2 = 18 and E(∅ → B) = 16 + 2 = 18. Next, we compute

E(B → A) = E(0100 → 1001). First, we note that E(0100 → 1001) = E(100 → 1001) and,

then, E(∅ → 1001) = E(∅ → 100) + E(100 → 1001), so that E(0100 → 1001) = E(∅ →

1001) − E(∅ → 100) = 18 − 8 = 10. Similarly, E(A → B) = 18 − 4 = 14, and, then, the above

three equations with three unknowns give p

A

=

5

12

, p

B

=

7

12

, E(N) =

73

6

.

We conclude with two examples, each somewhat paradoxical and thus illuminating.

Example 17.4. Consider sequences A = 1010 and B = 0100. It is straightforward to verify

that E(∅ →A) = 20, E(∅ →B) = 18, while p

A

=

9

14

. So, A loses in the waiting game, but wins

in the horse race! What is going on? Simply, when A loses in the horse race, it loses by a lot,

thereby tipping the waiting game towards B.

Example 17.5. This example concerns the horse race only. Consider the relation ≥ given by

A ≥ B if P(A beats B) ≥ 0.5. Naively, one would expect that this relation is transitive, but

this is not true! The simplest example are triples 011 ≥ 100 ≥ 001 ≥ 011, with probabilities

1

2

,

3

4

and

2

3

.

Problems

1. Start at 0 and perform the following random walk on the integers. At each step, ﬂip 3 fair

coins and make a jump forward equal to the number of Heads (you stay where you are if you

ﬂip no Heads). Let p

n

be the probability that you ever hit n. Compute lim

n→∞

p

n

. (It is not

2

3

!)

2. Suppose that you have three patterns A = 0110, B = 1010, C = 0010. Compute the

probability that A appears ﬁrst among the three in a sequence of fair coin tosses.

Solutions to problems

1. The size S of the step has the p. m. f. given by P(X = 0) =

1

8

, P(X = 1) =

3

8

, P(X = 2) =

3

8

,

P(X = 3) =

1

8

. Thus,

p

n

=

1

8

p

n

+

3

8

p

n−1

+

3

8

p

n−2

+

1

8

p

n−3

,

and so

p

n

=

8

7

_

3

8

p

n−1

+

3

8

p

n−2

+

1

8

p

n−3

_

.

17 THREE APPLICATIONS 197

It follows that p

n

converges to the reciprocal of

8

7

_

3

8

1 +

3

8

2 +

1

8

3

_

,

that is, to E(S[S > 0)

−1

. The answer is

lim

n→∞

p

n

=

7

12

.

2. If N is the ﬁrst time one of the three appears, we have

E(∅ →A) = EN +p

B

E(B →A) +p

C

E(C →A)

E(∅ →B) = EN +p

A

E(A →B) +p

C

E(C →B)

E(∅ →C) = EN +p

A

E(A →C) +p

B

E(B →C)

p

A

+p

B

+p

C

= 1

and

E(∅ →A) = 18

E(∅ →B) = 20

E(∅ →C) = 18

E(B →A) = 16

E(C →A) = 16

E(A →B) = 16

E(C →B) = 16

E(A →C) = 16

E(B →C) = 16

The solution is EN = 8, p

A

=

3

8

, p

B

=

1

4

, and p

C

=

3

8

. The answer is

3

8

.

18 POISSON PROCESS 198

18 Poisson Process

A counting process is a random process N(t), t ≥ 0, such that

1. N(t) is a nonnegative integer for each t;

2. N(t) is nondecreasing in t; and

3. N(t) is right-continuous.

The third condition is merely a convention: if the ﬁrst two events happen at t = 2 and t = 3,

then N(2) = 1, N(3) = 2, N(t) = 1 for t ∈ (2, 3), and N(t) = 0 for t < 2. Thus, N(t) − N(s)

represents the number of events in (s, t].

A Poisson process with rate (or intensity) λ > 0 is a counting process N(t) such that

1. N(0) = 0;

2. it has independent increments: if (s

1

, t

1

]

(s

2

, t

2

] = ∅, then N(t

1

) − N(s

1

) and N(t

2

) −

N(s

2

) are independent; and

3. the number of events in any interval of length t is Poisson(λt).

In particular,

P(N(t +s) −N(s) = k) = e

−λt

(λt)

k

k!

, k = 0, 1, 2, . . . ,

E(N(t +s) −N(s)) = λt.

Moreover, as h →0,

P(N(h) = 1) = e

−λh

λh ∼ λh,

P(N(h) ≥ 2) = O(h

2

) ¸λh.

Thus, in small time intervals, a single event happens with probability proportional to the length

of the interval; this is why λ is called the rate.

A deﬁnition as the above should be followed by the question whether the object in question

exists — we may be wishing for contradictory properties. To demonstrate the existence, we

will outline two constructions of the Poisson process. Yes, it is unique, but it would require

some probabilistic sophistication to prove this, as would the proof (or even the formulation) of

convergence in the ﬁrst construction we are about to give. Nevertheless, it is very useful, as it

makes many properties of the Poisson process almost instantly understandable.

Construction by tossing a low-probability coin very fast. Pick a large n and assume

that you have a coin with (low) Heads probability

λ

n

. Toss the coin at times which are positive

integer multiples of

1

n

(that is, very fast) and let N

n

(t) be the number of Heads in [0, t]. Clearly,

as n → ∞, the number of Heads in any interval (s, t] is Binomial with the number of trials

n(t −s) ±2 and success probability

λ

n

; thus, it converges to Poisson(t −s), as n →∞. Moreover,

N

n

has independent increments for any n and hence the same holds in the limit. We should

18 POISSON PROCESS 199

note that the Heads probability does not need to be exactly

λ

n

, instead, it suﬃces that this

probability converges to λ when multiplied by n. Similarly, we do not need all integer multiples

of

1

n

; it is enough that their number in [0, t], divided by n, converges to t in probability for any

ﬁxed t.

An example of a property that follows immediately is the following. Let S

k

be the time of

the kth (say, 3rd) event (which is a random time) and let N

k

(t) be the number of additional

events within time t after time S

k

. Then, N

k

(t) is another Poisson process, with the same rate

λ, as starting to count the Heads afresh after the kth Heads gives us the same process as if we

counted them from the beginning — we can restart a Poisson process at the time of the kth

event. In fact, we can do so at any stopping time, a random time T with the property that

T = t depends only on the behavior of the Poisson process up to time t (i.e., depends on the

past, but not on the future). The Poisson process, restarted at a stopping time, has the same

properties as the original process started at time 0; this is called the strong Markov property.

As each N

k

is a Poisson process, N

k

(0) = 0, so two events in the original Poisson N(t)

process do not happen at the same time.

Let T

1

, T

2

, . . . be the interarrival times, where T

n

is the time elapsed between (n −1)st and

nth event. A typical example would be the times between consecutive buses arriving at a station.

Proposition 18.1. Distribution of interarrival times:

T

1

, T

2

, . . . are independent and Exponential(λ).

Proof. We have

P(T

1

> t) = P(N(t) = 0) = e

−λt

,

which proves that T

1

is Exponential(λ). Moreover, for any s > 0 and any t > 0,

P(T

2

> t[T

1

= s) = P(no events in (s, s +t][T

1

= s) = P(N(t) = 0) = e

−λt

,

as events in (s, s + t] are not inﬂuenced by what happens in [0, s]. So, T

2

is independent of T

1

and Exponential(λ). Similarly, we can establish that T

3

is independent of T

1

and T

2

with the

same distribution, and so on.

Construction by exponential interarrival times. We can use the above Proposition 18.1

for another construction of a Poisson process, which is convenient for simulations. Let T

1

, T

2

, . . .

be i. i. d. Exponential(λ) random variables and let S

n

= T

1

+ . . . + T

n

be the waiting time for

the nth event. We deﬁne N(t) to be the largest n so that S

n

≤ t.

We know that ES

n

=

n

λ

, but we can derive its density; the distribution is called Gamma(n, λ).

We start with

P(S

n

> t) = P(N(t) < n) =

n−1

j=0

e

−λt

(λt)

j

j!

,

18 POISSON PROCESS 200

and then we diﬀerentiate to get

−f

S

n

(t) =

n−1

j=0

1

j!

(−λe

−λt

(λt)

j

+e

−λt

j(λt)

j−1

λ)

= λe

−λt

k−1

j=0

_

−

(λt)

j

j!

+

(λt)

j−1

(j −1)!

_

= −λe

−λt

(λt)

n−1

(n −1)!

,

and so

f

S

n

(t) = λe

−λt

(λt)

n−1

(n −1)!

.

Example 18.1. Consider a Poisson process with rate λ. Compute (a) E(time of the 10th

event), (b) P(the 10th event occurs 2 or more time units after the 9th event), (c) P(the 10th

event occurs later than time 20), and (d) P(2 events in [1, 4] and 3 events in [3, 5]).

The answer to (a) is

10

λ

by Proposition 18.1. The answer to (b) is e

−2λ

, as one can restart

the Poisson process at any event. The answer to (c) is P(S

10

> 20) = P(N(20) < 10), so we

can either write the integral

P(S

10

> 20) =

_

∞

20

λe

−λt

(λt)

9

9!

dt,

or use

P(N(20) < 10) =

9

j=0

e

−20λ

(20λ)

j

9!

.

To answer (d), we condition on the number of events in [3, 4]:

2

k=0

P(2 events in [1, 4] and 3 events in [3, 5] [ k events in [3, 4]) P(k events in [3, 4])

=

2

k=0

P(2 −k events in [1, 3] and 3 −k events in [4, 5]) P(k events in [3, 4])

=

2

k=0

e

−2λ

(2λ)

2−k

(2 −k)!

e

−λ

λ

3−k

(3 −k)!

e

−λ

λ

k

k!

=e

−4λ

_

1

3

λ

5

+λ

4

+

1

2

λ

3

_

.

Theorem 18.2. Superposition of independent Poisson processes.

Assume that N

1

(t) and N

2

(t) are independent Poisson processes with rates λ

1

and λ

2

. Combine

them into a single process by taking the union of both sets of events or, equivalently, N(t) =

N

1

(t) +N

2

(t). This is a Poisson process with rate λ

1

+λ

2

.

18 POISSON PROCESS 201

Proof. This is a consequence of the same property for Poisson random variables.

Theorem 18.3. Thinning of a Poisson process.

Each event in a Poisson process N(t) with rate λ is independently a Type I event with probability

p; the remaining events are Type II. Let N

1

(t) and N

2

(t) be the numbers of Type I and Type II

events in [0, t]. These are independent Poisson processes with rates λp and λ(1 −p).

The most substantial part of this theorem is independence, as the other claims follow from

the thinning properties of Poisson random variables (Example 11.4).

Proof. We argue by discrete approximation. At each integer multiple of

1

n

, we toss two indepen-

dent coins: coin A has Heads probability

pλ

n

; and coin B has Heads probability

(1−p)λ

n

/

_

1 −

pλ

n

_

.

Then call discrete Type I events the locations with coin A Heads; discrete Type II(a) events the

locations with coin A Tails and coin B Heads; and discrete Type II(b) events the locations with

coin B Heads. A location is a discrete event if it is either a Type I or a Type II(a) event.

One can easily compute that a location

k

n

is a discrete event with probability

λ

n

. Moreover,

given that a location is a discrete event, it is Type I with probability p. Therefore, the process

of discrete events and its division into Type I and Type II(a) events determines the discrete

versions of the processes in the statement of the theorem. Now, discrete Type I and Type II(a)

events are not independent (for example, both cannot occur at the same location), but discrete

Type I and Type II(b) events are independent (as they depend on diﬀerent coins). The proof

will be concluded by showing that discrete Type II(a) and Type II(b) events have the same limit

as n → ∞. The Type I and Type II events will then be independent as limits of independent

discrete processes.

To prove the claimed asymptotic equality, observe ﬁrst that the discrete schemes (a) and (b)

result in a diﬀerent outcome at a location

k

n

exactly when two events occur there: a discrete Type

I event and a discrete Type II(b) event. The probability that the two discrete Type II schemes

diﬀer at

k

n

is thus at most

C

n

2

, for some constant C. This causes the expected number of such

“double points” in [0, t] to be at most

Ct

n

. Therefore, by the Markov inequality, an upper bound

for the probability that there is at least one double point in [0, t] is also

Ct

n

. This probability

goes to zero, as n → ∞, for any ﬁxed t and, consequently, discrete (a) and (b) schemes indeed

result in the same limit.

Example 18.2. Customers arrive at a store at a rate of 10 per hour. Each is either male or

female with probability

1

2

. Assume that you know that exactly 10 women entered within some

hour (say, 10 to 11am). (a) Compute the probability that exactly 10 men also entered. (b)

Compute the probability that at least 20 customers have entered.

Male and female arrivals are independent Poisson processes, with parameter

1

2

10 = 5, so

the answer to (a) is

e

−5

5

10

10!

.

18 POISSON PROCESS 202

The answer to (b) is

∞

k=10

P(k men entered) =

∞

k=10

e

−5

5

k

k!

= 1 −

9

k=0

e

−5

5

k

k!

.

Example 18.3. Assume that cars arrive at a rate of 10 per hour. Assume that each car will

pick up a hitchhiker with probability

1

10

. You are second in line. What is the probability that

you will have to wait for more than 2 hours?

Cars that pick up hitchhikers are a Poisson process with rate 10

1

10

= 1. For this process,

P(T

1

+T

2

> 2) = P(N(2) ≤ 1) = e

−2

(1 + 2) = 3e

−2

.

Proposition 18.4. Order of events in independent Poisson processes.

Assume that we have two independent Poisson processes, N

1

(t) with rate λ

1

and N

2

(t) with rate

λ

2

. The probability that n events occur in the ﬁrst process before m events occur in the second

process is

n+m−1

k=n

_

n +m−1

k

__

λ

1

λ

1

+λ

2

_

k

_

λ

2

λ

1

+λ

2

_

n+m−1−k

.

We can easily extend this idea to more than two independent Poisson processes; we will not

make a formal statement, but instead illustrate by the few examples below.

Proof. Start with a Poisson process with λ

1

+ λ

2

, then independently decide for each event

whether it belongs to the ﬁrst process, with probability

λ

1

λ

1

+λ

2

, or to the second process, with

probability

λ

2

λ

1

+λ

2

. The obtained processes are independent and have the correct rates. The

probability we are interested in is the probability that among the ﬁrst m + n −1 events in the

combined process, n or more events belong to the ﬁrst process, which is the binomial probability

in the statement.

Example 18.4. Assume that λ

1

= 5, λ

2

= 1. Then,

P(5 events in the ﬁrst process before 1 in the second) =

_

5

6

_

5

and

P(5 events in the ﬁrst process before 2 in the second) =

6

k=5

_

6

k

__

5

6

_

k

_

1

6

_

6−k

=

11 5

5

6

6

.

Example 18.5. You have three friends, A, B, and C. Each will call you after an Exponential

amount of time with expectation 30 minutes, 1 hour, and 2.5 hours, respectively. You will go

out with the ﬁrst friend that calls. What is the probability that you go out with A?

18 POISSON PROCESS 203

We could evaluate the triple integral, but we will avoid that. Interpret each call as the ﬁrst

event in the appropriate one of three Poisson processes with rates 2, 1, and

2

5

, assuming the

time unit to be one hour. (Recall that the rates are inverses of the expectations.)

We will solve the general problem with rates λ

1

, λ

2

, and λ

3

. Start with rate λ

1

+ λ

2

+

λ

3

Poisson process, distribute the events with probability

λ

1

λ

1

+λ

2

+λ

3

,

λ

2

λ

1

+λ

2

+λ

3

, and

λ

3

λ

1

+λ

2

+λ

3

,

respectively. The probability of A calling ﬁrst is clearly

λ

1

λ

1

+λ

2

+λ

3

, which in our case works out

to be

2

2 + 1 +

2

5

=

10

17

.

Our next theorem illustrates what we can say about previous event times if we either know

that their number by time t is k or we know that the kth one happens exactly at time t.

Theorem 18.5. Uniformity of previous event times.

1. Given that N(t) = k, the conditional distribution of the interarrival times, S

1

, . . . , S

k

, is

distributed as order statistics of k independent uniform variables: the set ¦S

1

, . . . , S

k

¦ is

equal, in distribution, to ¦U

1

, . . . , U

k

¦, where U

i

are independent and uniform on [0, t].

2. Given that S

k

= t, S

1

, . . . , S

k−1

are distributed as order statistics of k − 1 independent

uniform random variables on [0, t].

Proof. Again, we discretize and the discrete counterpart is as follows. Assume that we toss a

coin, with arbitrary ﬁxed Heads probability, N times in succession and that we know that the

number of Heads in these N tosses is k. Then, these Heads occur in any of the

_

N

k

_

subsets (of

k tosses out of a total of N) with equal probability, simply by symmetry. This is exactly the

statement of the theorem, in the appropriate limit.

Example 18.6. Passengers arrive at a bus station as a Poisson process with rate λ.

(a) The only bus departs after a deterministic time T. Let W be the combined waiting time for

all passengers. Compute EW.

If S

1

, S

2

, . . . are the arrival times in [0, T], then the combined waiting time is W = T −S

1

+

T − S

3

+ . . .. Recall that we denote by N(t) the number of arrivals in [0, t]. We obtain the

answer by conditioning on the value of N(T): if we know that N(T) = k, then W is the sum

of k i. i. d. uniform random variables on [0, T]. Hence, W =

N(T)

k=1

U

k

, where U

1

, U

2

, . . . are

i. i. d. uniform on [0, T] and independent of N(T). By Theorem 11.1,

EW =

λT

2

2

and Var(W) =

λT

3

3

.

(b) Now two buses depart, one at T and one at S < T. What is EW now?

18 POISSON PROCESS 204

We have two independent Poisson processes in time intervals [0, S] and [S, T], so the answer

is

λ

S

2

2

+λ

(T −S)

2

2

.

(c) Now assume T, the only bus departure time, is Exponential(µ), independent of the passen-

gers’ arrivals.

This time,

EW =

_

∞

0

E(W[T = t)f

T

(t) dt =

_

∞

0

λ

t

2

2

f

T

(t) dt =

λ

2

E(T

2

)

=

λ

2

(Var(T) + (ET)

2

) =

λ

2

2

µ

2

=

λ

µ

2

.

(d) Finally, two buses depart as the ﬁrst two events in a rate µ Poisson process.

This makes

EW = 2

λ

µ

2

.

Example 18.7. You have two machines. Machine 1 has lifetime T

1

, which is Exponential(λ

1

),

and Machine 2 has lifetime T

2

, which is Exponential(λ

2

). Machine 1 starts at time 0 and Machine

2 starts at a time T.

(a) Assume that T is deterministic. Compute the probability that M

1

is the ﬁrst to fail.

We could compute this via a double integral (which is a good exercise!), but instead we

proceed thus:

P(T

1

< T

2

+T) = P(T

1

< T) +P(T

1

≥ T, T

1

< T

2

+T)

= P(T

1

< T) +P(T

1

< T

2

+T[T

1

≥ T)P(T

1

≥ T)

= 1 −e

−λ

1

T

+P(T

1

−T < T

2

[T

1

≥ T)e

−λ

1

T

= 1 −e

−λ

1

T

+P(T

1

< T

2

)e

−λ

1

T

= 1 −e

−λ

1

T

+

λ

1

λ

1

+λ

2

e

−λ

1

T

.

The key observation above is that P(T

1

− T < T

2

[T

1

≥ T) = P(T

1

< T

2

). Why does this

hold? We can simply quote the memoryless property of the Exponential distribution, but it

is instructive to make a short argument using Poisson processes. Embed the failure times into

appropriate Poisson processes. Then, T

1

≥ T means that no events in the ﬁrst process occur

during time [0, T]. Under this condition, T

1

−T is the time of the ﬁrst event of the same process

restarted at T, but this restarted process is not inﬂuenced by what happened before T, so the

condition (which in addition does not inﬂuence T

2

) drops out.

18 POISSON PROCESS 205

(b) Answer the same question when T is Exponential(µ) (and, of course, independent of the

machines). Now, by the same logic,

P(T

1

< T

2

+T) = P(T

1

< T) +P(T

1

≥ T, T

1

< T

2

+T)

=

λ

1

λ

1

+µ

+

λ

1

λ

1

+λ

2

µ

λ

1

+λ

2

.

Example 18.8. Impatient hitchhikers. Two people, Alice and Bob, are hitchhiking. Cars that

would pick up a hitchhiker arrive as a Poisson process with rate λ

C

. Alice is ﬁrst in line for a

ride. Moreover, after Exponential(λ

A

) time, Alice quits, and after Exponential(λ

B

) time, Bob

quits. Compute the probability that Alice is picked up before she quits and compute the same

for Bob.

Embed each quitting time into an appropriate Poisson process, call these A and B processes,

and call the car arrivals C process. Clearly, Alice gets picked if the ﬁrst event in the combined

A and C process is a C event:

P(Alice gets picked) =

λ

C

λ

A

+λ

C

.

Moreover,

P(Bob gets picked)

= P(¦at least 2 C events before a B event¦

∪ ¦at least one A event before either a B or a C event,

and then at least one C event before a B event¦)

= P(at least 2 C events before a B event)

+P(at least one A event before either a B or a C event,

and then at least one C event before a B event)

−P(at least one A event before either a B or a C event,

and then at least two C events before a B event)

=

_

λ

C

λ

B

+λ

C

_

2

+

_

λ

A

λ

A

+λ

B

+λ

C

__

λ

C

λ

B

+λ

C

_

−

_

λ

A

λ

A

+λ

B

+λ

C

__

λ

C

λ

B

+λ

C

_

2

=

λ

A

+λ

C

λ

A

+λ

B

+λ

C

λ

C

λ

B

+λ

C

.

This leaves us with an excellent hint that there may be a shorter way and, indeed, there is:

P(Bob gets picked)

= P(ﬁrst event is either A or C, and then the next event among B and C is C).

18 POISSON PROCESS 206

Problems

1. An oﬃce has two clerks. Three people, A, B, and C enter simultaneously. A and B begin

service with the two clerks, while C waits for the ﬁrst available clerk. Assume that the service

time is Exponential(λ). (a) Compute the probability that A is the last to ﬁnish the service. (b)

Compute the expected time before C is ﬁnished (i.e., C’s combined waiting and service time).

2. A car wash has two stations, 1 and 2, with Exponential(λ

1

) and Exponential(λ

2

) service

times. A car enters at station 1. Upon completing the service at station 1, the car proceeds to

station 2, provided station 2 is free; otherwise, the car has to wait at station 1, blocking the

entrance of other cars. The car exits the wash after the service at station 2 is completed. When

you arrive at the wash there is a single car at station 1. Compute the expected time before you

exit.

3. A system has two server stations, 1 and 2, with Exponential(λ

1

) and Exponential(λ

2

) service

times. Whenever a new customer arrives, any customer in the system immediately departs.

Customer arrivals are a rate µ Poisson process, and a new arrival enters the system at station

1, then goes to station 2. (a) What proportion of customers complete their service? (b) What

proportion of customers stay in the system for more than 1 time unit, but do not complete the

service?

4. A machine needs frequent maintenance to stay on. The maintenance times occur as a Poisson

process with rate µ. Once the machine receives no maintenance for a time interval of length h,

it breaks down. It then needs to be repaired, which takes an Exponential(λ) time, after which it

goes back on. (a) After the machine is started, ﬁnd the probability that the machine will break

down before receiving its ﬁrst maintenance. (b) Find the expected time for the ﬁrst breakdown.

(c) Find the proportion of time the machine is on.

5. Assume that certain events (say, power surges) occur as a Poisson process with rate 3 per hour.

These events cause damage to a certain system (say, a computer), thus, a special protecting unit

has been designed. That unit now has to be removed from the system for 10 minutes for service.

(a) Assume that a single event occurring in the service period will cause the system to crash.

What is the probability that the system will crash?

(b) Assume that the system will survive a single event, but two events occurring in the service

period will cause it to crash. What is, now, the probability that the system will crash?

(c) Assume that a crash will not happen unless there are two events within 5 minutes of each

other. Compute the probability that the system will crash.

(d) Solve (b) by assuming that the protective unit will be out of the system for a time which is

exponentially distributed with expectation 10 minutes.

18 POISSON PROCESS 207

Solutions to problems

1. (a) This is the probability that two events happen in a rate λ Poisson process before a single

event in an independent rate λ process, that is,

1

4

. (b) First, C has to wait for the ﬁrst event in

two combined Poisson processes, which is a single process with rate 2λ, and then for the service

time; the answer is

1

2λ

+

1

λ

=

3

2λ

.

2. Your total time is (the time the other car spends at station 1) + (the time you spend at

station 2)+(maximum of the time the other car spends at station 2 and the time you spend at

station 1). If T

1

and T

2

are Exponential(λ

1

) and Exponential(λ

2

), then you need to compute

E(T

1

) +E(T

2

) +E(max¦T

1

, T

2

¦).

Now use that

max¦T

1

, T

2

¦ = T

1

+T

2

−min¦T

1

, T

2

¦

and that min¦T

1

, T

2

¦ is Exponential(λ

1

+λ

2

), to get

2

λ

1

+

2

λ

2

−

1

λ

1

+λ

2

.

3. (a) A customer needs to complete the service at both stations before a new one arrives, thus

the answer is

λ

1

λ

1

+µ

λ

2

λ

2

+µ

.

(b) Let T

1

and T

2

be the customer’s times at stations 1 and 2. The event will happen if either:

• T

1

> 1, no newcomers during time 1, and a newcomer during time [1, T

1

]; or

• T

1

< 1, T

1

+T

2

> 1, no newcomers by time 1, and a newcomer during time [1, T

1

+T

2

].

For the ﬁrst case, nothing will happen by time 1, which has probability e

−(µ+λ

1

)

. Then, after

time 1, a newcomer has to appear before the service time at station 1, which has probability

µ

λ

1

+µ

.

For the second case, conditioned on T

1

= t < 1, the probability is

e

−µ

e

−λ

2

(1−t)

µ

λ

2

+µ

.

Therefore, the probability of the second case is

e

−µ

µ

λ

2

+µ

_

1

0

e

−λ

2

(1−t)

λ

1

e

−λ

1

t

dt = e

−µ

λ

1

µ

λ

2

+µ

e

−λ

2

e

λ

2

−λ

1

−1

λ

2

−λ

1

,

18 POISSON PROCESS 208

where the last factor is 1 when λ

1

= λ

2

. The answer is

e

−(µ+λ

1

)

µ

λ

1

+µ

+

λ

1

µ

λ

2

+µ

e

−(µ+λ

2

)

e

λ

2

−λ

1

−1

λ

2

−λ

1

.

4. (a) The answer is e

−µh

. (b) Let W be the waiting time for maintenance such that the next

maintenance is at least time h in the future, and let T

1

be the time of the ﬁrst maintenance.

Then, provided t < h,

E(W[T

1

= t) = t +EW,

as the process is restarted at time t. Therefore,

EW =

_

h

0

(t +EW)µe

−µt

dt =

_

h

0

tµe

−µt

dt +EW

_

h

0

µe

−µt

dt.

Computing the two integrals and solving for EW gives

EW =

1 −µhe

−µh

−e

−µh

e

−µh

.

The answer to (b) is EW +h (the machine waits for h more units before it breaks down). The

answer to (c) is

EW +h

EW +h +

1

λ

.

5. Assume the time unit is 10 minutes,

1

6

of an hour. The answer to (a) is

P(N(1/6) ≥ 1) = 1 −e

−

1

2

,

and to (b)

P(N(1/6) ≥ 2) = 1 −

3

2

e

−

1

2

.

For (c), if there are 0 or 1 events in the 10 minutes, there will be no crash, but 3 or more events

in the 10 minutes will cause a crash. The ﬁnal possibility is exactly two events, in which case

the crash will happen with probability

P

_

[U

1

−U

2

[ <

1

2

_

,

where U

1

and U

2

are independent uniform random variables on [0, 1]. By drawing a picture, this

probability can be computed to be

3

4

. Therefore,

P(crash) = P(X > 2) +P(crash[X = 2)P(X = 2)

= 1 −e

−

1

2

−

1

2

e

−

1

2

−e

−

1

2

_

1

2

_

2

2

+

3

4

_

1

2

_

2

2

e

−

1

2

= 1 −

49

32

e

−

1

2

18 POISSON PROCESS 209

Finally, for (d), we need to calculate the probability that two events in a rate 3 Poisson process

occur before an event occurs in a rate 6 Poisson process. This probability is

_

3

3 + 6

_

2

=

1

9

.

18 POISSON PROCESS 210

Interlude: Practice Final

This practice exam covers the material from chapters 9 through 18. Give yourself 120 minutes

to solve the six problems, which you may assume have equal point score.

1. You select a random number X in [0, 1] (uniformly). Your friend then keeps selecting random

numbers U

1

, U

2

, . . . in [0, 1] (uniformly and independently) until he gets a number larger than

X/2, then he stops.

(a) Compute the expected number N of times your friend selects a number.

(b) Compute the expected sum S of the numbers your friend selects.

2. You are a member of a sports team and your coach has instituted the following policy. You

begin with zero warnings. After every game the coach evaluates whether you’ve had a discipline

problem during the game; if so, he gives you a warning. After you receive two warnings (not

necessarily in consecutive games), you are suspended for the next game and your warnings count

goes back to zero. After the suspension, the rules are the same as at the beginning. You ﬁgure

you will receive a warning after each game you play independently with probability p ∈ (0, 1).

(a) Let the state of the Markov chain be your warning count after a game. Write down the

transition matrix and determine whether this chain is irreducible and aperiodic. Compute its

invariant distribution.

(b) Write down an expression for the probability that you are suspended for both games 10 and

15. Do not evaluate.

(c) Let s

n

be the probability that you are suspended in the nth game. Compute lim

n→∞

s

n

.

3. A random walker on the nonnegative integers starts at 0 and then at each step adds either 2

or 3 to her position, each with probability

1

2

.

(a) Compute the probability that the walker is at 2n + 3 after making n steps.

(b) Let p

n

be the probability that the walker ever hits n. Compute lim

n→∞

p

n

.

4. A random walker is at one of the six vertices, labeled 0, 1, 2, 3, 4, and 5, of the graph in the

picture. At each time, she moves to a randomly chosen vertex connected to her current position

by an edge. (All choices are equally likely and she never stays at the same position for two

successive steps.)

1

2

3 4

5

0

18 POISSON PROCESS 211

(a) Compute the proportion of time the walker spends at 0, after she makes many steps. Does

this proportion depend on the walker’s starting vertex?

(b) Compute the proportion of time the walker is at an odd state (1, 3, or 5) while, previously,

she was at even state (0, 2, or 4).

(c) Now assume that the walker starts at 0. What is expected number of steps she will take

before she is back at 0?

5. In a branching process, an individual has two descendants with probability

3

4

and no descen-

dants with probability

1

4

. The process starts with a single individual in generation 0.

(a) Compute the expected number of individuals in generation 2.

(b) Compute the probability that the process ever becomes extinct.

6. Customers arrive at two service stations, labeled 1 and 2, as a Poisson process with rate

λ. Assume that the time unit is one hour. Whenever a new customer arrives, any previous

customer is immediately ejected from the system. A new arrival enters the service at station 1,

then goes to station 2.

(a) Assume that the service time at each station is exactly 2 hours. What proportion of entering

customers will complete the service (before they are ejected)?

(b) Assume that the service time at each station now is exponential with expectation 2 hours.

What proportion of entering customers will now complete the service?

(c) Keep the service time assumption from (b). A customer arrives, but he is now given special

treatment: he will not be ejected unless at least three or more new customers arrive during his

service. Compute the probability that this special customer is allowed to complete his service.

18 POISSON PROCESS 212

Solutions to Practice Final

1. You select a random number X in [0, 1] (uniformly). Your friend then keeps selecting

random numbers U

1

, U

2

, . . . in [0, 1] (uniformly and independently) until he gets a number

larger than X/2, then he stops.

(a) Compute the expected number N of times your friend selects a number.

Solution:

Given X = x, N is distributed geometrically with success probability 1 −

x

2

, so

E[N[X = x] =

1

1 −

x

2

,

and so,

EN =

_

1

0

dx

1 −

x

2

= −2 log(1 −

x

2

)[

1

0

= 2 log 2.

(b) Compute the expected sum S of the numbers your friend selects.

Solution:

Given X = x and N = n, your friend selects n − 1 numbers uniformly in [0,

x

2

] and

one number uniformly in [

x

2

, 1]. Therefore,

E[S[X = x, N = n] = (n −1)

x

4

+

1

2

_

1 +

x

2

_

=

1

4

nx +

1

2

,

E[S[X = x] =

1

4

1

1 −

x

2

x +

1

2

=

1

2 −x

,

ES =

_

1

0

1

2 −x

dx = log 2.

2. You are a member of a sports team and your coach has instituted the following policy.

You begin with zero warnings. After every game the coach evaluates whether you’ve had

a discipline problem during the game; if so, he gives you a warning. After you receive

two warnings (not necessarily in consecutive games), you are suspended for the next game

and your warnings count goes back to zero. After the suspension, the rules are the same

as at the beginning. You ﬁgure you will receive a warning after each game you play

independently with probability p ∈ (0, 1).

18 POISSON PROCESS 213

(a) Let the state of the Markov chain be your warning count after a game. Write down

the transition matrix and determine whether this chain is irreducible and aperiodic.

Compute its invariant distribution.

Solution:

The transition matrix is

P =

_

_

1 −p p 0

0 1 −p p

1 0 0

_

_

,

and the chain is clearly irreducible (the transitions 0 → 1 → 2 → 0 happen with

positive probability) and aperiodic (0 → 0 happens with positive probability). The

invariant distribution is given by

π

0

(1 −p) +π

2

= π

0

π

0

p +π

1

(1 −p) = π

1

π

1

p = π

2

and

π

0

+π

1

+π

2

= 1,

which gives

π =

_

1

p + 2

,

1

p + 2

,

p

p + 2

_

.

(b) Write down an expression for the probability that you are suspended for both games

10 and 15. Do not evaluate.

Solution:

You must have 2 warnings after game 9 and then again 2 warnings after game 14:

P

9

02

P

4

02

.

(c) Let s

n

be the probability that you are suspended in the nth game. Compute lim

n→∞

s

n

.

Solution:

As the chain is irreducible and aperiodic,

lim

n→∞

s

n

= lim

n→∞

P(X

n−1

= 2) =

p

2 +p

.

18 POISSON PROCESS 214

3. A random walker on the nonnegative integers starts at 0 and then at each step adds either

2 or 3 to her position, each with probability

1

2

.

(a) Compute the probability that the walker is at 2n + 3 after making n steps.

Solution:

The walker has to make (n −3) 2-steps and 3 3-steps, so the answer is

_

n

3

_

1

2

n

.

(b) Let p

n

be the probability that the walker ever hits n. Compute lim

n→∞

p

n

.

Solution:

The step distribution is aperiodic, as the greatest common divisor of 2 and 3 is 1, so

lim

n→∞

p

n

=

1

1

2

2 +

1

2

3

=

2

5

.

4. A random walker is at one of six vertices, labeled 0, 1, 2, 3, 4, and 5, of the graph in the

picture. At each time, she moves to a randomly chosen vertex connected to her current

position by an edge. (All choices are equally likely and she never stays at the same position

for two successive steps.)

1

2

3 4

5

0

(a) Compute the proportion of time the walker spends at 0, after she makes many steps.

Does this proportion depend on the walker’s starting vertex?

Solution:

Independently of the starting vertex, the proportion is π

0

, where [π

0

, π

1

, π

2

, π

3

, π

4

, π

5

]

is the unique invariant distribution. (Unique because of irreducibility.) This chain

18 POISSON PROCESS 215

is reversible with the invariant distribution given by

1

14

[4, 2, 2, 2, 3, 1]. Therefore, the

answer is

2

7

.

(b) Compute the proportion of times the walker is at an odd state (1, 3, or 5) while,

previously, she was at even state (0, 2, or 4).

Solution:

The answer is

π

0

(p

03

+p

05

) +π

2

p

21

+π

4

(p

43

+p

41

) =

2

7

2

4

+

1

7

1

2

+

3

14

2

3

=

5

14

.

(c) Now assume that the walker starts at 0. What is expected number of steps she will

take before she is back at 0?

Solution:

The answer is

1

π

0

=

7

2

5. In a branching process, an individual has two descendants with probability

3

4

and no

descendants with probability

1

4

. The process starts with a single individual in generation

0.

(a) Compute the expected number of individuals in generation 2.

Solution:

As

µ = 2

3

4

+ 0

1

4

=

3

2

,

the answer is

µ

2

=

9

4

.

(b) Compute the probability that the process ever goes extinct.

Solution:

As

φ(s) =

1

4

+

3

4

s

2

18 POISSON PROCESS 216

the solution to φ(s) = s is given by 3s

2

− 4s + 1 = 0, i.e., (3s − 1)(s − 1) = 0. The

answer is

π

0

=

1

3

.

6. Customers arrive at two service stations, labeled 1 and 2, as a Poisson process with rate λ.

Assume that the time unit is one hour. Whenever a new customer arrives, any previous

customer is immediately ejected from the system. A new arrival enters the service at

station 1, then goes to station 2.

(a) Assume that the service time at each station is exactly 2 hours. What proportion of

entering customers will complete the service (before they are ejected)?

Solution:

The answer is

P(customer served) = P(no arrival in 4 hours) = e

−4λ

.

(b) Assume that the service time at each station now is exponential with expectation 2

hours. What proportion of entering customers will now complete the service?

Solution:

Now,

P(customer served)

= P(2 or more arrivals in rate

1

2

Poisson process

before one arrival in rate λ Poisson process)

=

_

1

2

1

2

+λ

_

2

=

1

(1 + 2λ)

2

.

(c) Keep the service time assumption from (b). A customer arrives, but he is now given

special treatment: he will not be ejected unless at least three or more new customers

arrive during his service. Compute the probability that this special customer is

allowed to complete his service.

Solution:

The special customer is served exactly when 2 or more arrivals in rate

1

2

Poisson

18 POISSON PROCESS 217

before three arrivals in rate λ Poisson process happen. Equivalently, among the ﬁrst

4 arrivals in rate λ+

1

2

Poisson process, 2 or more belong to the rate

1

2

Poisson process.

The answer is

1 −

_

λ

λ +

1

2

_

4

−4

_

λ

λ +

1

2

_

3

1

2

λ +

1

2

= 1 −

λ

4

+ 2λ

3

_

λ +

1

2

_

4

.

1

INTRODUCTION

1

1

Introduction

The theory of probability has always been associated with gambling and many most accessible examples still come from that activity. You should be familiar with the basic tools of the gambling trade: a coin, a (six-sided) die, and a full deck of 52 cards. A fair coin gives you Heads (H) or Tails (T) with equal probability, a fair die will give you 1, 2, 3, 4, 5, or 6 with equal probability, and a shuﬄed deck of cards means that any ordering of cards is equally likely. Example 1.1. Here are typical questions that we will be asking and that you will learn how to answer. This example serves as an illustration and you should not expect to understand how to get the answer yet. Start with a shuﬄed deck of cards and distribute all 52 cards to 4 players, 13 cards to each. What is the probability that each player gets an Ace? Next, assume that you are a player and you get a single Ace. What is the probability now that each player gets an Ace? Answers. If any ordering of cards is equally likely, then any position of the four Aces in the deck is also equally likely. There are 52 possibilities for the positions (slots) for the 4 aces. Out 4 of these, the number of positions that give each player an Ace is 134 : pick the ﬁrst slot among the cards that the ﬁrst player gets, then the second slot among the second player’s cards, then 4 the third and the fourth slot. Therefore, the answer is 13 ≈ 0.1055. (52) 4 After you see that you have a single Ace, the probability goes up: the previous answer needs 13·(39) to be divided by the probability that you get a single Ace, which is 523 ≈ 0.4388. The answer (4) 134 then becomes 13· 39 ≈ 0.2404. (3) Here is how you can quickly estimate the second probability during a card game: give the second ace to a player, the third to a diﬀerent player (probability about 2/3) and then the last to the third player (probability about 1/3) for the approximate answer 2/9 ≈ 0.22.

History of probability

Although gambling dates back thousands of years, the birth of modern probability is considered to be a 1654 letter from the Flemish aristocrat and notorious gambler Chevalier de M´r´ to the ee mathematician and philosopher Blaise Pascal. In essence the letter said: I used to bet even money that I would get at least one 6 in four rolls of a fair die. The probability of this is 4 times the probability of getting a 6 in a single die, i.e., 4/6 = 2/3; clearly I had an advantage and indeed I was making money. Now I bet even money that within 24 rolls of two dice I get at least one double 6. This has the same advantage (24/62 = 2/3), but now I am losing money. Why? As Pascal discussed in his correspondence with Pierre de Fermat, de M´r´’s reasoning was faulty; ee after all, if the number of rolls were 7 in the ﬁrst game, the logic would give the nonsensical probability 7/6. We’ll come back to this later.

1

INTRODUCTION

2

Example 1.2. In a family with 4 children, what is the probability of a 2:2 boy-girl split? One common wrong answer: likely.

1 5,

as the 5 possibilities for the number of boys are not equally

Another common guess: close to 1, as this is the most “balanced” possibility. This represents the mistaken belief that symmetry in probabilities should very likely result in symmetry in the outcome. A related confusion supposes that events that are probable (say, have probability around 0.75) occur nearly certainly.

**Equally likely outcomes
**

Suppose an experiment is performed, with n possible outcomes comprising a set S. Assume also that all outcomes are equally likely. (Whether this assumption is realistic depends on the context. The above Example 1.2 gives an instance where this is not a reasonable assumption.) An event E is a set of outcomes, i.e., E ⊂ S. If an event E consists of m diﬀerent outcomes (often called “good” outcomes for E), then the probability of E is given by: (1.1) P (E) = m . n

Example 1.3. A fair die has 6 outcomes; take E = {2, 4, 6}. Then P (E) = 1 . 2 What does the answer in Example 1.3 mean? Every student of probability should spend some time thinking about this. The fact is that it is very diﬃcult to attach a meaning to P (E) if we roll a die a single time or a few times. The most straightforward interpretation is that for a very large number of rolls about half of the outcomes will be even. Note that this requires at least the concept of a limit! This relative frequency interpretation of probability will be explained in detail much later. For now, take formula (1.1) as the deﬁnition of probability.

2

COMBINATORICS

3

2

Combinatorics

Example 2.1. Toss three fair coins. What is the probability of exactly one Heads (H)? There are 8 equally likely outcomes: HHH, HHT, HTH, HTT, THH, THT, TTH, TTT. Out of these, 3 have exactly one H. That is, E = {HTT, THT, TTH}, and P (E) = 3/8. Example 2.2. Let us now compute the probability of a 2:2 boy-girl split in a four-children family. We have 16 outcomes, which we will assume are equally likely, although this is not quite true in reality. We list the outcomes below, although we will soon see that there is a better way. BBBB BGBB GBBB GGBB We conclude that P (2:2 split) = BBBG BGBG GBBG GGBG BBGB BGGB GBGB GGGB BBGG BGGG GBGG GGGG

8 1 = , 16 2 2 1 P (4:0 split or 0:4 split) = = . 16 8 P (1:3 split or 3:1 split) = Example 2.3. Roll two dice. What is the most likely sum? Outcomes are ordered pairs (i, j), 1 ≤ i ≤ 6, 1 ≤ j ≤ 6. sum 2 3 4 5 6 7 8 9 10 11 12 Our answer is 7, and P (sum = 7) =

6 36

6 3 = , 16 8

no. of outcomes 1 2 3 4 5 6 5 4 3 2 1 = 1. 6

it is therefore 64 . Chevalier de M´r´ overcounted the good outcomes in both cases. Why? We have 6 options for the ﬁrst roll. Roll a die 4 times. but conceptually important fact.4914. so de M´r´ was ee gambling a lot to be able to notice the diﬀerence. .5177. but their number does not! The answer then is 6·5·4·3 64 = 5 18 ≈ 0.5. therefore. Therefore P (win) = 1 − 5 6 4 ≈ 0. In this case it is 6 · 5 · 4 · 3. Note that the set of possible outcomes changes from stage to stage (roll to roll in this case). d) : a. . His count 4 · 63 in Game ee 1 selects a die with a 6 and arbitrary numbers for other dice. 4 options for the third roll since its number must not appear on the ﬁrst two rolls. . If an experiment consists of two stages and the ﬁrst stage has m outcomes. d ∈ {1. In this case. What is the probability that you get diﬀerent numbers? At least at the beginning. However. We will. .4. One should also note that both probabilities are barely diﬀerent from 1/2. b. ee In Game 1. That is. . the number of bad ones is 3524 . Example 2. there are 4 rolls and he wins with at least one 6. b. while the second stage has n outcomes regardless of the outcome at the ﬁrst stage. In this case.2778. especially if their number is large. 5 options for the second roll since its number must diﬀer from the number on the ﬁrst roll. you should divide every solution into the following three steps: Step 1: Identify the set of equally likely outcomes. many outcomes have more than one six and are hence counted more than once. Basic principle of counting. Example 2.2 COMBINATORICS 4 How to count? Listing all outcomes is very ineﬃcient. The number of good events is 64 − 54 . Step 2: Compute the number of outcomes. Therefore P (win) = 1 − 35 36 24 ≈ 0. as the number of bad events is 54 . . c. In Game 2. learn a few counting techniques. 6}}. Step 3: Compute the number of good outcomes. there are 24 rolls of two dice and he wins by at least one pair of 6’s rolled. this is the set of all ordered 4-tuples of numbers 1. . 6. starting with a trivial. etc. . . thus the number of good outcomes equals 3624 − 3524 . {(a. The number of outcomes is 3624 . c. then the experiment as a whole has mn outcomes. Let us now compute probabilities for de M´r´’s games.

Shuﬄe a deck of cards.2 COMBINATORICS 5 Permutations Assume you have n objects. What is the probability that these pieces. P . Sampling without replacement omits the putting back part.6. for example. at random. • P (top card is an Ace) = 1 13 = 4·51! 52! . E. Pull 6 pieces at random out of the bag (1) without.7. (n − k + 1) = n! . 2·63 quite a lot smaller. noting its properties. a good event is speciﬁed by ordering 40 items (the block of hearts plus 39 other cards) and ordering the hearts within their block. in the next problem (and in statistics in general) sampling with replacement refers to choosing. The number of good outcomes is 3!2!. . The probability is 3!2! 6! = 1 60 . Also. For sampling without replacement: 1. . A bag has 6 pieces of paper. To compute the last probability. putting the object back into the population. 2. and R. Example 2. . The number of ways to ﬁll n ordered slots with them is n · (n − 1) . that a random choice is made. P . 3. • P (hearts are together) = 40!13! 52! ≈ 4. collect all hearts into a block. in order. and (2) with replacement. This event = 6 · 10−11 . this means that all available choices are equally likely. An outcome is an ordering of the pieces of paper E1 E2 P1 P2 P3 R. (n − k)! Example 2. spell P EP P ER? There are two problems to solve. while the number of ways to ﬁll k ≤ n ordered slots is n(n − 1) . let us agree that when the text says without further elaboration. on it. . the answer is = 1 . and then repeating. E. The number of outcomes thus is 6!. . P . 33 ·22 66 For sampling with replacement. an object from a population. Before we go on to further examples. 2 · 1 = n!.5·10−28 . each with one of the letters. 4!·(13!)4 52! • P (all cards of the same suit end up next to each other) = never happens in practice.

n−k The multinomial coeﬃcients are more general and are deﬁned next. Combinations Let n k be the number of diﬀerent subsets with k elements of a set with n elements. Then. n k = = n(n − 1) . 5! Example 2. Then. ﬁrst choose a subset. What is the probability that all groups end up sitting together? The answer is 3!·4!·5!·2! . A group consists of 3 Norwegians. also compute P (men and women alternate). say. and 5 Finns. Sit 3 men and 3 women at random (1) in a row of chairs and (2) around a table. . the answer is 4!3! 6! = 1. The answer is 10 . Here is how you 11! count the good events. . then order its elements in a row to ﬁll k ordered slots with elements from the set with n objects. n k! = n(n − 1) . say John Smith. the seats for the men and women are ﬁxed after John Smith takes his seat and so 1 the answer is 3!2! = 10 . Therefore. note that k=0 k x y n 0 = n n =1 and n k = n .9. . Then. (n − k + 1). There are 3! choices for ordering the group of Norwegians (and then sit them down to one of both sides of Arne. there are 2! choices to order the two blocks of Swedes and Finns. and they sit at random around a table.8. and sit him ﬁrst. Also. a Norwegian (Arne) and sit him down. Compute P (all women sit together). In case (2).2 COMBINATORICS 6 Example 2. distinct choices of a subset and its ordering will end up as distinct orderings. depending on the ordering). Then. we reduce to a row 3 problem with 5! outcomes. For the last question. Finally. k We call n = n choose k or a binomial coeﬃcient (as it appears in the binomial theorem: k n k n−k (x + y)n = n ). pick a man. there are 4! choices for arranging the Swedes and 5! choices for arranging the Finns. (n − k + 1) k! n! k!(n − k)! To understand why the above is true. . 4 Swedes. the number of good outcomes is 3! · 3!. Pick. 5 For case (2). . In case (1).

4. Let the ﬁrst 13 cards go to the ﬁrst player. . . Order the rest of the cards in 48! ways. is denoted by n1 . nr = = n − n1 n2 n! n1 !n2 ! . + nr = n. as one needs to choose the position of the ﬁve heads among 10 slots to ﬁx a good outcome. Example 2.. 50 of them red and 50 blue.12. We have a bag that contains 100 balls. Here we return to Example 1. Example 2. There are 52! equally likely outcomes.11. nr elements.. Pick a slot within each of the four segments of 13 slots for an Ace. . . 2. n2 . . .1 and solve it more slowly. . the second 13 cards to the second player. etc.10. 3. It cannot be exactly 1 either. What is the probability that we get exactly 5 Heads? P (exactly 5 Heads) = 10 5 210 ≈ 0. . then n2 diﬀerent elements blue.2461. . A fair coin is tossed 10 times. . Select 5 balls at random. What is the probability that 3 are blue and 2 are red? The number of outcomes is 100 and all of them are equally likely. where n1 + .. as 2 their sum cannot be more than 1. These colors distinguish among the diﬀerent subsets. What is the probability that each player gets an Ace? We will solve this problem in two ways to emphasize that you often have a choice in your set of equally likely outcomes. etc. Example 2. . n3 nr 7 To understand the slightly confusing word distinguishable. nr ! n n1 n − n1 − n2 n − n1 − .” The answer is P (3 are blue and 2 are red) = 50 50 3 2 100 5 ≈ 0. There are 134 possibilities to choose these four slots for the Aces.. Shuﬄe a standard deck of 52 cards and deal 13 cards to each of the 4 players.2 COMBINATORICS The number of ways to divide a set of n elements into r (distinguishable) subsets of n n1 .nr and n n1 . − nr−1 .3189 Why should this probability be less than a half? The probability that 3 are blue and 2 are red is equal to the probability of 3 are red and 2 are blue and they cannot both exceed 1 . The number of choices to ﬁll these four positions with (four diﬀerent) Aces is 4!. because other possibilities (such 2 as all 5 chosen balls red) have probability greater than 0. . just think of painting n1 elements red. The ﬁrst way uses an outcome to be an ordering of 52 cards: 1. . which is a reasonable 5 interpretation of “select 5 balls at random.

Doing the problem the second way. then.0106. 3 2 Example 2. which agrees with the number we obtained the ﬁrst way. 3. green.14. 2.. pick three slots for 1. 2 appears exactly 2 times)? To ﬁx a good outcome now. We have 14 rooms and 4 colors.. blue. The number of outcomes is 52 4 . etc. The outcomes are the positions of the 4 Aces among the 52 slots for the shuﬄed cards of the deck.13. we get 1. etc. The number of outcomes. as we need to choose one slot among 13 cards that go to the ﬁrst player. and yellow. then.. 8 The probability. pick two slots for 1. is The second way. a lot smaller than the probability of each player getting (52) 4 Example 2. Each room is painted at random with one of the four colors. . pick one player ( that player ( 13 choices). is (52) 4 134 . Let us also compute P (one person has all four Aces). white. 2 2 . 3.. so. for .2 COMBINATORICS 134 4!48! 52! . . There are 414 equally likely outcomes. The number of good outcomes is 134 . is 612 . . Then: 1. Roll a die 12 times. is (12)(10). therefore. with choices. You may choose to believe this intuitive fact or try to write down a formal proof: the number of permutations that result in a given set F of four positions is independent of F . What is P (1 appears exactly 3 times. two slots for 2. To ﬁx a good outcome. . via a small leap of faith. The probability. P (each number appears exactly twice)? 1. An outcome consists of ﬁlling each of the 12 slots (for the 12 rolls) with an integer between 1 and 6 (the outcome of the roll). is an Ace. 4 The answer. 4 1 2. then pick two slots for 2. 6.. The number of outcomes is 52 4 . then. 2. The number of choices is 12 9 47 and the answer is 3 612 .(2) 2 2 2 612 12 2 10 2 . and ﬁll the remaining 7 (12)(9)47 2 slots by numbers 3. choices) and pick four slots for the Aces for (4)(13) 1 4 = 0. To ﬁx a good outcome. The probability. assumes that each set of the four positions of the four Aces among the 52 shuﬄed cards is equally likely. then.

What is the probability that they all receive correct meals? A reformulation makes the problem clearer: we are interested in P (3 people who ordered C get C). but has forgotten who ordered what and discovers that they are all asleep. So. all the Swedes. Assume that all licence plates are equally likely. 35 ·4 7 3 = Problems 1. Each pair proceeds to play one match. Compute the following probabilities: (a) that all the Norwegians sit together. number. 7 (3) P (no one who ordered C gets C) = 4 3 7 3 = 4 . compute the probability that all Swedes are the winners. The outcome is a selection of 3 people from the 7 who receive C. A tennis tournament has 2n participants. so she puts the meals in front of them at random. and there is a 3 1 single good outcome. n people are chosen at random from the 2n (with no regard to nationality) and then paired randomly with the other n people. 2. where a letter is any one of 26 letters and a number is one among 0. . letter. First. The ﬂight attendant returns with the meals. 35 3· 7 3 4 2 P (a single person who ordered C gets C) = P (two persons who ordered C get C) = 3 2 = 12 . the number of them is 7 . 7 and assume that 1. 4 blue. number. (b) What do you need to assume to conclude that all outcomes are equally likely? (c) Under this assumption. They are seated at random around a table.2 COMBINATORICS 9 example. An outcome is a set of n (ordered) pairs. Example 2. 2 yellow rooms) = 14 5 9 4 5 3 2 2 414 . . . 35 18 . letter. Similarly. giving the winner and the loser in each of the n matches. 9. and all the Finns sit together. 6 Swedes. A middle row on a plane seats 7 people. A California licence plate consists of a sequence of seven symbols: number. . (a) Determine the number of outcomes. P (5 white. Let us label the people 1. number. 3 green. Three of them order chicken (C) and the remaining four pasta (P). . (a) What is the probability that all symbols are diﬀerent? (b) What is the probability that all symbols are diﬀerent and the ﬁrst number is the largest among the numbers? 2. and 7 Finns. . (b) that all the Norwegians and all the Swedes sit together. n Swedes and n Norwegians. letter. and 3 ordered C.15. A group of 18 Scandinavians consists of 5 Norwegians. 1. . . 3. and (c) that all the Norwegians. the answer is 1 = 35 . .

Solutions to problems 1.. (a) 13!·5! 17! . . 4 2. (Of course. Pick 8 cards from the deck at random (a) without replacement and (b) with replacement. then the next available player.) (b) All players are of equal strength. 80. Then 20 balls are selected (without replacement) out of the bag at random. 17! (70) (10)(70) (70) 20 4 16 10 . (a) 4. . . (c) 2!·7!·6!·5! . A full deck of 52 cards contains 13 hearts. these two expressions are the same. (b) (52) 8 3 8 4 . (a) 5.2 COMBINATORICS 10 4. (b) 8!·5!·6! 17! . (a) What is the probability that all your numbers are selected? (b) What is the probability that none of your numbers is selected? (c) What is the probability that exactly 4 of your numbers are selected? 5. (b) 80 . (a) Divide into two groups (winners and losers). Before the game starts. 80 and write them on a piece of paper. the choice of a Norwegian paired with each Swede. (c) (20) (20) (80) 20 (39) 8 . Alternatively. then pair them: 2n · n!. (a) 10·9·8·7·26·25·24 . you choose 10 diﬀerent numbers from amongst 1. and. In each case compute the probability that you get no hearts. . 104 ·263 (b) the answer in (a) times 1 . A bag contains 80 balls numbered 1. . at the end choose the winners and the losers: (2n − 1)(2n − 3) · · · 3 · 1 · 2n . etc. 80 . 3. . . equally likely to win or lose any match against any other player. . . pair n the ﬁrst player. (c) The number of good events is n!. then.

sometimes called the sample space. Here are some. Example 3. Can we just say that each A ⊂ Ω is an event? In this course you can assume so without worry. P ). P (A) = (number of elements in A) . as is common in mathematics. . We call events A1 . The ﬁrst object Ω is an arbitrary set. For every event A. · · · ∈ F =⇒ ∪∞ Ai ∈ F. a set of subsets of Ω. A2 . Finally. Finally. 3. and intersections of two or more events (i. . but this is only for illustration and you should take a course in measure theory to understand what is really going on. F needs to be a σ-algebra.. then ∞ ∞ P( i=1 Ai ) = i=1 P (Ai )..e. . . we will soon see what it means to choose a random point within a unit circle. 2. We need to agree on which properties these objects must have in order to compute with them and develop a theory. Ω = {1. events (sets containing certain outcomes).1. Ac happens when A does not happen). F. a probability space is a triple (Ω. unions of two or more events (i. Whenever we have an abstract deﬁnition such as this one. pairwise disjoint if Ai ∩ Aj = ∅ if i = j — that is. A1 ∩ A2 happens when both A1 and A2 happen). If A1 .3 AXIOMS OF PROBABILITY 11 3 Axioms of Probability The question here is: how can we mathematically deﬁne a random experiment? What we have are outcomes (which tell you exactly what happens). (2) A ∈ F =⇒ Ac ∈ F. . The second object F is a collection of events.e. the ﬁrst thing to do is to look for examples. On the other hand. Axiom 2. For example. inﬁnite sets are much harder to deal with. However. Therefore. i=1 What is important is that you can take the complement Ac of an event A (i. although there are good reasons for not assuming so in general ! I will give the deﬁnition of what properties F needs to satisfy. which means (1) ∅ ∈ F. When we have ﬁnitely many equally likely outcomes all is clear and we have already seen many examples. A2 . repesenting the set of outcomes. . P (Ω) = 1.e. A2 . that is. we will also see that there is no way to choose at random a positive integer — remember that “at random” means all choices are equally likely. at most one of these events can occur. and probability (which attaches to every event the likelihood that it happens). Axiom 3. Namely. is a sequence of pairwise disjoint events.. 6}. the probability P is a number attached to every event A and satisﬁes the following three axioms: Axiom 1. A1 ∪ A2 happens when either A1 or A2 happens). and (3) A1 . 5. 6 . unless otherwise speciﬁed. 4. P (A) ≥ 0. an event A ∈ F is necessarily a subset of Ω.

. Note that pi cannot be chosen to be equal. as you cannot make the sum of inﬁnitely many equal numbers to be 1. Another important theoretical remark: this is a case where A cannot be an arbitrary subset of the circle — for some sets area cannot be deﬁned! Consequences of the axioms (C0) P (∅) = 0. For any A ⊂ Ω. Ω = {1.3. . take all sets to be ∅. . Proof .2. y) : x2 + y 2 < 1} and (area of A) P (A) = . Example 3. (C2) P (Ac ) = 1 − P (A). This means that we cannot attach the probability to outcomes — you hit a single point (or even a line) with probability 0. Your outcome is the number of 1 tosses. Example 3. but a “fatter” set with positive area you hit with positive probability. Pick a point from inside the unit circle centered at the origin. then P (A) = 0. p2 . = 1. pi . Apply (C1) to A1 = A. In Axiom 3. this can be generalized to any ﬁnite set with equally likely outcomes. Proof .3 AXIOMS OF PROBABILITY 12 The random experiment here is rolling a fair die. Here. A2 = Ac . Proof . Ω = {(x. Clearly. toss a fair coin repeatedly until the ﬁrst Heads. . π It is important to observe that if A is a singleton (a set whose element is a single point). In Axiom 3. pi = 2i . . (C1) If A1 ∩ A2 = ∅. then P (A1 ∪ A2 ) = P (A1 ) + P (A2 ). . . ≥ 0 with p1 + p2 + . Here. take all sets other than ﬁrst two to be ∅. . .} and you have numbers p1 . P (A) = i∈A For example. 2.

2 3 1 3 p3 = P (Ac ∩ Ac ∩ A3 ). p13 = P (A1 ∩ Ac ∩ A3 ). (C6) P (A1 ∪ A2 ∪ A3 ) = P (A1 ) + P (A2 ) + P (A3 ) − P (A1 ∩ A2 ) − P (A1 ∩ A3 ) − P (A2 ∩ A2 ) + P (A1 ∩ A2 ∩ A3 ) and more generally n P (A1 ∪ · · · ∪ An ) = i=1 P (Ai ) − 1≤i<j≤n P (Ai ∩ Aj ) + 1≤i<j<k≤n P (Ai ∩ Aj ∩ Ak ) + . and note that A \ B. (C5) P (A ∪ B) = P (A) + P (B) − P (A ∩ B). Again. Proof . (C4) If A ⊂ B. P (B) = P (A) + P (B \ A) ≥ P (A). We prove this only for n = 3. p12 = P (A1 ∩ A2 ∩ Ac ). + (−1)n−1 P (A1 ∩ · · · ∩ An ). Use (C1) for A1 = A and A2 = B \ A. 1 2 3 2 1 and p123 = P (A1 ∩ A2 ∩ A3 ). note that all sets are pairwise disjoint and that the right hand side of (6) is (p1 + p12 + p13 + p123 ) + (p2 + p12 + p23 + p123 ) + (p3 + p13 + p23 + p123 ) − (p12 + p123 ) − (p13 + p123 ) − (p23 + p123 ) + p123 = p1 + p2 + p3 + p12 + p13 + p23 + p123 = P (A1 ∪ A2 ∪ A3 ).3 AXIOMS OF PROBABILITY 13 (C3) 0 ≤ P (A) ≤ 1. and B \A are pairwise disjoint. P (A ∩ B) = p12 and P (B \ A) = p2 . Let p1 = P (A1 ∩ Ac ∩ Ac ). and P (A∪B) = p1 +p2 +p12 . Use that P (Ac ) ≥ 0 in (C2). P (B) = p2 +p12 . Proof . A ∩ B. Then P (A) = p1 +p12 . . This is called the inclusion-exclusion formula and is commonly used when it is easier to compute probabilities of intersections than of unions. p23 = P (Ac ∩ A2 ∩ A3 ). p2 = P (Ac ∩ A2 ∩ Ac ). Proof . . . Let P (A \ B) = p1 . Proof .

4. Pick an integer in [1. Compute the probability that it is divisible neither by 12 nor by 15.6. where lcm stands for the least common multiple. Sit 3 men and 4 women at random in a row. Then.5. 11! Example 3. 11! 11! 11! 3! · 4! · 5! · 2! . We have 3! · 9! 4! · 8! 5! · 7! .3 AXIOMS OF PROBABILITY 14 Example 3. for the 1 probability that you get your own gift. 11! 11! 11! 3! · 4! · 6! 3! · 5! · 5! 4! · 5! · 4! P (AN ∩ AS ) = .s) . and so the answer is P (A1 ∪ A2 ) = P (A1 ) + P (A2 ) − P (A1 ∩ A2 ) = 4! · 4! 5! · 3! 2! · 3! · 4! + − . P (AN ∩ AS ∩ AF ) = 11! P (AN ) = Therefore.7. P (AS ∩ AF ) = .867. what we are looking for is n P( i=1 Ai ). Let Ai = {ith employee gets his or her own gift}. The sample space consists of the 1000 integers between 1 and 1000 and let Ar be the subset consisting of integers divisible by r. P (AN ∪ AS ∪ AF ) = 3! · 9! + 4! · 8! + 5! · 7! − 3! · 4! · 6! − 3! · 5! · 5! − 4! · 5! · 4! + 3! · 4! · 5! · 2! . Matching problem. assuming that you are one of the employees. Another simple fact is that Ar ∩ As = Alcm(r. A group of 3 Norwegians. Deﬁne AN = {Norwegians sit together} and similarly AS . 7! 7! 7! Example 3. . What is the probability that someone gets his or her own gift? Note that this is diﬀerent from asking. which is n . 1000 1000 1000 Example 3. A1 = {all women sit together}. 4 Swedes. P (AS ) = . P (AN ∩ AF ) = . AF . A1 ∩ A2 = {both women and men sit together}. 1000] at random. What is the probability that either all the men or all the women end up sitting together? Here. The cardinality of Ar is 1000/r . A large company with n employees has a scheme according to which each employee buys a Christmas gift and the gifts are then distributed at random to the employees. Compute the probability that at least one of the three groups ends up sitting together. Our probability equals 1 − P (A12 ∪ A15 ) = 1 − P (A12 ) − P (A15 ) + P (A12 ∩ A15 ) = 1 − P (A12 ) − P (A15 ) + P (A60 ) 83 66 16 =1− − + = 0. P (AN ) = . A2 = {all men sit together}. and 5 Finns is seated at random around a table.

this is sampling with replacement from among n = 10.” . Each day.8.9901 0. On February 6. Assume that there are k people in the room. no winning number has ever been repeated. = 1 (for all i). and generalize the problem a little: from n possible birthdays.. the Massachusetts lottery chooses a four digit number at random. + (−1)n−1 · · n 2 n(n − 1) 3 n(n − 1)(n − 2) n! 1 1 1 = 1 − + + . e n· Example 3. n! n(n − 1) (n − 3)! 1 = (for all i < j < k). the expert [and a lottery oﬃcial] doesn’t expect to see duplicate winners until about half of the 10. 1978.5073 0.9032 0.]. Some values are given in the following table. What is the probability that there are two who share a birthday? We will ignore leap years. . [David] Hughes. k 10 23 41 57 70 prob. the lowest k for which the above exceeds 0.5 is. . 000 choices each day. . sample k times with replacement. for n = 365 0. famously. with leading 0’s allowed. P (a shared birthday) = 1 − P (no shared birthdays) = 1 − n · (n − 1) · · · (n − k + 1) . n (n − 2)! 1 = (for all i < j).. . our answer is 1 n n 1 1 1 − + − . n! n(n − 1)(n − 2) 1 . k = 23. assume all birthdays are equally likely.. n! When n = 365. so we give another example. 000 possibilities have been exhausted.6321 (as n → ∞).1169 0. Thus. + (−1)n−1 2! 3! n! 1 → 1 − ≈ 0. the Boston Evening Globe reported that “During [the lottery’s] 22 months’ existence [. nk = = = . Birthday Problem..3 AXIOMS OF PROBABILITY 15 We have P (Ai ) P (Ai ∩ Aj ) P (Ai ∩ Aj ∩ Ak ) P (A1 ∩ · · · ∩ An ) Therefore.9992 Occurences of this problem are quite common in various contexts.

Within the context of the previous problem. Example 3. . If you buy k boxes. = (n − 1)k nk (n − 2)k nk 0 (for all i) (for all i < j) n−1 n k + n 2 n−2 n k k − . the probability of no repetition works out to be about 2..3 AXIOMS OF PROBABILITY 16 What would an informed reader make of this? Assuming k = 660 days. and 0 when k < n. assume that k ≥ n and compute P (all n birthdays are represented). due to the very large size of summands with alternating signs and the resulting rounding errors. Another remark for those who know a lot of combinatorics: you will perhaps notice that the above probability n! is nk Sk.. did not check for repetitions. Now. We will return to this problem later for a much more eﬃcient computational method. More often. what is the probability that you have a complete collection? When k = n. More generally. conﬁdent that there would not be any.9. Apologetic lottery oﬃcials announced later that there indeed were repetitions. get anything close to the correct numbers when k ≤ n if you try to compute the probabilities by computer. we need to compute 1 − P( i=1 n Ai ). . let Ai = {ith birthday is missing}. for large n. Neither will you. each of which contains one of n diﬀerent cards (coupons). but some numbers are in the two tables below.n is the Stirling number of the second kind. P (Ai ) P (Ai ∩ Aj ) P (A1 ∩ · · · ∩ An ) and our answer is 1−n n = = . n! This must be nn for k = n. chosen at random.19 · 10−10 . not understanding the Birthday Problem.n . neither of which is obvious from the formula. where Sk. our probability is n! nn . Then. Hughes. + (−1)n−1 n n−1 1 n k = i=0 n i (−1)i 1 − n i . making it a remarkably improbable event. this is described in terms of cereal boxes. . What happened was that Mr. Coupon Collector Problem.

Example 3.11. 12 6 6 3 3. choices.3 AXIOMS OF PROBABILITY k 1607 1854 2287 2972 3828 4669 prob. Pick i pairs of socks from the 10: choices.5004 0. Pick slots for one of the numbers that occur 3 times: Therefore. To count the number of good outcomes: 10 i 1.9002 0. 5 2 2. for n = 6 0. You have 10 pairs of socks in the closet. The number of outcomes is 612 . Therefore.1003 0. 612 choices. To count the number of good outcomes: 1. Roll a die 12 times. for n = 365 0.10. our probability is 28−2i (10−i)(10) 8−2i i (20) 8 . compute the probability that you get i complete pairs of socks. Pick slots (rolls) for the number that occurs 6 times: 4. Example 3. 3. Pick the number that occurs 6 times: 6 1 = 6 choices.9108 0. . There are 20 8 outcomes. Pick a sock from each of the latter pairs: 28−2i choices. For every i.9900 0.5139 0. Pick pairs which are represented by a single sock: choices. but are a little harder. Pick 8 socks at random. 10−i 8−2i 2.9915 More examples with combinatorial ﬂavor We will now do more problems which would rather belong to the previous chapter. Pick the two numbers that occur 3 times each: choices.0101 0.9990 17 k 13 23 36 prob. Compute the probability that a number occurs 6 times and two other numbers occur three times each. our probability is 6(5)(12)(6) 2 6 3 . so we do them here instead.

3.000240 0.000015 . 5. 8. 6. 4.021128 0. Q. 7.001441 0. Poker Hands. In the deﬁnitions.422569 0.3 AXIOMS OF PROBABILITY 18 Example 3.001981 0. 9. 1. 2. K. J.003940 0. K. 10. prob. combinations 13 · 12 · 4 · 43 3 2 13 4 4 2 · 11 · 2 · 2 · 4 12 4 13 · 2 · 3 · 42 10 · 45 4 · 13 5 13 · 12 · 4 · 4 3 2 13 · 12 · 4 10 · 4 approx. From the lowest to the highest. with one exception: an Ace may be regarded as 1 for the purposes of making a straight (but note that. 3 is not a valid straight sequence — A can only begin or end a straight). 2. for example.12. This sequence orders the cards in descending consecutive values. here are the hands: (a) one pair : two cards of the same value plus 3 cards with diﬀerent values J♠ J♣ 9♥ Q♣ 4♠ (b) two pairs: two pairs plus another card of diﬀerent value J♠ J♣ 9♥ 9♣ 3♠ (c) three of a kind : three cards of the same value plus two with diﬀerent values Q♠ Q♣ Q♥ 9♣ 3♠ (d) straight: ﬁve cards with consecutive values 5♥ 4♣ 3♣ 2♥ A♠ (e) ﬂush: ﬁve cards of the same suit K♣ 9♣ 7♣ 6♣ 3♣ (f) full house: a three of a kind and a pair J♣ J♦ J♥ 3♣ 3♠ (g) four of a kind : four cards of the same value K♣ K♦ K♥ K♣ 10♠ (e) straight ﬂush: ﬁve cards of the same suit with consecutive values A♠ K♠ Q♠ J♠ 10♠ Here are the probabilities: hand one pair two pairs three of a kind straight ﬂush full house four of a kind straight ﬂush no. the word value refers to A. A.047539 0. 0.

2.e. the suit) from each of the two remaining values: 42 . 2. 960. Pick ﬁve numbers: 13 5 . that is. Pick the suits of the other three values: 43 And for the ﬂush: 1. for the three of a kind . Pick a suit: 4.3 AXIOMS OF PROBABILITY 19 Note that the probabilities of a straight and a ﬂush above include the possibility of a straight ﬂush. Pick two cards from the value of the pair: 4. Pick a suit: 4.. 2. 4. the number of good outcomes may be obtained by listing the number of choices: 1. 4 3 3. Let us see how some of these are computed.5012. Choose the values of other two cards: 12 2 . . Pick a card (i. We end this example by computing the probability of not getting any hand listed above. Then. 2. 598. Pick three cards from the four of the same chosen value: . Pick the beginning number: 10. Pick the other three numbers: 12 3 4 2 3. the number of all outcomes is 52 = 5 2. Our ﬁnal worked out case is straight ﬂush: 1. P (nothing) = P (all cards with diﬀerent values) − P (straight or ﬂush) = = 13 5 13 5 · 45 52 5 − (P (straight) + P (ﬂush) − P (straight ﬂush)) 13 5 · 45 − 10 · 45 − 4 · 52 5 + 40 ≈ 0. First. Pick a number for the pair: 13. One could do the same for one pair : 1. for example. Choose a value for the triple: 13.

. . Any such distribution constitutes an outcome and they are equally likely. For this. What is the probability that at least one of the three nations is not represented on the committee? 4. Assume that room 1 takes people in slots S1.3 AXIOMS OF PROBABILITY 20 Example 3. is 10 2i 22i 20 10 10−2i 5−i .. A group of 20 Scandinavians consists of 7 Swedes. S4. therefore. 6) D’s to distribute into 5 − i (ex.. and ﬁve 3’s. . two 2’s. as there are two D’s in each of these rooms. (b) that you get at least one 6 and at least one 5. there are 10 choices. 2 per room. Once these two choices are made. 3) rooms. 3 Finns. you still have 10 − 2i (ex. 10 Finns. 5.13. for 22i choices. 4) mixed rooms. . 10 To get 2i (for example. so let us call these two slots R1. The ﬁnal answer. corresponding to the positions of the 10 Danes. A committee of ﬁve people is chosen at random from this group. What is the probability that exactly 2i rooms are mixed. Compute P (no wife sits next to her husband). i = 0. Roll a single die 10 times. Their number is 20 . start by choosing 2i (ex. (b) Compute the probability that at least one number occurs exactly once. . are to be distributed at random into 10 rooms. Problems 1. and this choice ﬁxes a 5−i good outcome. 3) rooms from the remaining 10 − 2i (ex. First arrange them at random into a row of 20 slots S1. . Consider the following model for distributing the Scandinavians into rooms. 2. (a) Compute the probability that at least one number occurs exactly 6 times. room 2 takes people in slots S3. 3. 4) out of the 10 rooms which are going to be mixed. Compute the following probabilities: (a) that you get at least one 6. . so let us call these two slots R2.. 9. S2. . you need to choose 5 − i (ex. etc. it is clear that we only need to keep track of the distribution of 10 D’s into the 20 slots. Choose each digit of a 5 digit number at random from digits 1. (c) that you get three 1’s. . . . Three married couples take seats around a table at random. . Roll a fair die 10 times. and 10 Danes. Now. Assume that 20 Scandinavians. S2. 5? This is an example when careful thinking about what the outcomes should be really pays oﬀ. Compute the probability that no digit appears more than twice. S20. and 10 Norwegians. 6). for 10−2i choices.. Similarly. You also need to decide into which slot in 2i each of the 2i chosen mixed rooms the D goes.

(b) 1 − 2 · (5/6)10 + (2/3)10 . 5 3! · 2! · 2! 1 P (A1 ∩ A2 ) = P (A1 ∩ A3 ) = P (A2 ∩ A3 ) = = . P (A1 ∪ A2 ∪ A3 ) = 3 · and our answer is 4 15 . Moreover. A lottery ticket consists of two rows. P (A1 ∩ A2 ∩ A3 ) = 5! 15 For P (A1 ∩ A2 ). 5! 5 2 2! · 2! · 2! · 2! = . . . 2. 5. A ticket wins if its ﬁrst row contains at least two of the numbers drawn and its second row contains at least two of the numbers drawn. Compute the winning probabilities for each of the four tickets. but no digit appears 4 times: choose a digit. The drawing consists of choosing 5 diﬀerent numbers from 1. 2 P (A1 ) = = P (A2 ) = P (A3 ). pick the ordering for couple1 . and A3 = the event that Norwegians are not represented. . (a) 1 − (5/6)10 . P (A1 ∪ A2 ∪ A3 ) = P (A1 ) + P (A2 ) + P (A3 ) − P (A1 ∩ A2 ) − P (A1 ∩ A3 ) − P (A2 ∩ A3 ) +P (A1 ∩ A2 ∩ A3 ) 1 13 17 10 10 7 = 20 + + − −0− +0 5 5 5 5 5 5 4. Let A1 = the event that Swedes are not represented. . and Ticket 4 all win. 50 at random. . In the remaining row of 5 seats. The four examples below represent the four types of tickets: Ticket 1 1 2 3 4 5 6 Ticket 2 1 2 3 1 2 3 Ticket 3 1 2 3 2 3 4 Ticket 4 1 2 3 3 4 5 For example. The number of bad events is 9 · 5 · 82 + 9 · 5 · 8 + 9. . A2 = the event that Finns are not represented. while Ticket 3 loses. 1 2 11 2 −3· + = . . by inclusion-exclusion. (c) 10 3 7 2 6−10 . and wife3 . if the numbers 1. The ﬁrst term is the number of 3 4 numbers in which a digit appears 3 times. each containing 3 numbers from 1. . 2. Ticket 2. i = 1. choose 3 . 2. 2. The complement is the union of the three events Ai = {couple i sits together}. couple2 . Solutions to problems 1. 3. for example. 6. 5 5 15 15 3. then Ticket 1. Now. 3. 17 are drawn. pick a seat for husband3 . 50.3 AXIOMS OF PROBABILITY 21 6. then the ordering of seats within each of couple1 and couple2 .

10 · 54 P (A1 ∪ A2 ∪ A3 ∪ A4 ∪ A5 ∪ A6 ) = 6 · 6 10 . a hit is shorthand for a chosen number. 6 (b) (a) Now. Ai is the event that the number i appears exactly once. Below. (a) Let Ai be the event that the number i appears exactly 6 times. and the last term is the number of numbers in which a digit appears 5 times. P (ticket 1 wins) = P (two hits on each line) + P (two hits on one line. but no digit appears 5 times. three on the other) 3 · 3 · 44 + 2 · 3 402 = = 50 . P (A1 ∪ A2 ∪ A3 ∪ A4 ∪ A5 ∪ A6 ) = 6P (A1 ) − + − + − 6 2 6 3 6 4 6 5 6 6 P (A1 ∩ A2 ) P (A1 ∩ A2 ∩ A3 ) P (A1 ∩ A2 ∩ A3 ∩ A4 ) P (A1 ∩ A2 ∩ A3 ∩ A4 ∩ A5 ) P (A1 ∩ A2 ∩ A3 ∩ A4 ∩ A5 ∩ A6 ) 59 610 48 610 37 610 26 610 1 610 = 6 · 10 · 6 2 6 + 3 6 − 4 6 + 5 − 0. 50 5 5 . then ﬁll the remaining position. As Ai are pairwise disjoint. The second term is the number of numbers in which a digit appears 4 times.3 AXIOMS OF PROBABILITY 22 positions ﬁlled by it. 5. By inclusion-exclusion. The answer then is 1− 9· 5 3 · 82 + 9 · 95 5 4 ·8+9 . − · 10 · 9 · · 10 · 9 · 8 · · 10 · 9 · 8 · 7 · · 10 · 9 · 8 · 7 · 6 · 6.

5 all hit) = (4 45 2 + 4 · 45 + 1) + 45 50 5 = 4186 50 5 . and P (ticket 3 wins) = P (2. 3) = 3· 47 3 + 50 5 47 2 = 49726 50 5 . P (ticket 4 wins) = P (3 hit. ﬁnally. and. 4 both hit and one of 2. at least one additional hit on each line) + P (1. 3 both hit) + P (1. 2. 2. 3) + P (three hits among 1.3 AXIOMS OF PROBABILITY 23 and P (ticket 2 wins) = P (two hits among 1. 3 hit) = 48 3 +2· 50 5 46 2 = 19366 50 5 . 4. 2.

. the cubes have no other distinguishing characteristics. Here is a question asked on Wall Street job interviews. the odds will change. We observed that P (R) = 5/11. the game in this case) is known. You are given the choice between an unconditional and a conditional probability of death. 2 are red and 2 are blue. Before you start your experiment. the game starts and at half-time your team leads by 25 points.4 CONDITIONAL PROBABILITY AND INDEPENDENCE 24 4 Conditional Probability and Independence Example 4.8. For the probability of a red cube. P (F ) P (the selected ball is fuzzy) This conveys the idea that with additional information the probability must be adjusted. Here’s a gun.1. Again. You cannot expect that the bookies’ odds will remain the same and they change. Say bookies estimate your basketball team’s chances of winning a particular game to be 0. I put a gun to your head and pull the trigger. For the general deﬁnition. This is common in real life. The conditional probability of A given B equals P (A|B) = P (A ∩ B) . B. Other than color and fuzziness. all empty. Out of the 7 fuzzy ones. The former. so we introduce it here: P (R|F ) = 3/7. and assume that P (B) > 0. but forget to wear gloves. Assume that you have a bag with 11 cubes. I close the barrel and spin it. Lucky you! Now I’m going to pull the trigger one more time. S that the selected cube is red. the probability should now change to 3/7! Your experiment clearly has 11 outcomes. Clearly. you reach into the bag. Now. the macabre tone is not unusual for such interviews. blue. F . Then. 3 are red and 4 are blue. Note that this also equals P (the selected ball is red and fuzzy) P (R ∩ F ) = . Which would you prefer: that I spin the barrel ﬁrst or that I just pull the trigger?” Assume that the barrel rotates clockwise after the hammer hits and is pulled back. we do not have notation.3. You plan to pick a cube out of the bag at random. say. it becomes known that your team’s star player is out with a sprained ankle. So. P (B) Example 4. take events A and B. 0 or 1. and smooth. six chambers. conditioned on it being fuzzy. (This is the original formulation. Click. say to 0. and notice it is fuzzy (but you do not take it out or note its color in any other way). Here’s the barrel of the gun. Consider the events R. all probabilities are trivial. the probability that the selected cube is red is 5/11. Now watch me as I put two bullets into the barrel. respectively. when complete information (that is.) “Let’s play a game of Russian roulette. a revolver. fuzzy.6. the outcome of your experiment. You are tied to your chair.2. 24 hours before the game starts. grab a cube. there are 5 red and 6 blue cubes. into two adjacent chambers. Finally. to 0. however. 7 of which have a fuzzy surface and 4 are smooth. out of 4 smooth ones. Two hours before the game starts.

Somebody tells you that you tossed at least one Heads. which is next to a bullet in the counterclockwise direction. Toss two fair coins.5. Example 4. If you choose 7 tosses out of 10 at random. An urn contains 10 black and 10 white balls. if the trigger is pulled without the extra spin. etc. we can use P (A1 ∩ A2 ) = P (A1 )P (A2 |A1 ). Conditional probabilities are sometimes given. and acting on events A. For a ﬁxed condition B. the conditional probability Q(A) = P (A|B) satisﬁes the three axioms in Chapter 3. P (ﬁrst toss H|exactly 7 H’s) = 9 6 10 7 · 1 210 · 21 10 = 7 . B = {at least one H}. and equals 1/4. equals the probability that the hammer clicked on an empty slot.7386.3. or can be easily determined. Toss a coin 10 times. the answer is. and (b) with replacement. 3 Example 4.4. 88 Clearly. Draw 3 (a) without replacement. and P (A|B) = P (A ∩ B) P (both H) = = P (B) P (at least one H) 1 4 3 4 1 = . after canceling 9 6 1 . the answer should be a little larger than before. remains 1/3. What is the probability that both tosses are Heads? Here A = {both H}. Example 4. what is the probability that your ﬁrst toss is Heads? For (a).) Thus. blindfolded. What is the probability that all three are white? We already know how to do part (a): . because this condition is more advantageous for Heads. especially in sequential random experiments. Q is another probability assignment and all consequences of the axioms are valid for it. The latter.4 CONDITIONAL PROBABILITY AND INDEPENDENCE 25 if the barrel is spun again. 210 9 8 10 9 9 7 10 8 9 9 10 10 + + + + + + 10 7 = 65 ≈ 0. 10 Why is this not surprising? Conditioned on 7 Heads. (b) that at least 7 Heads are tossed. P (A1 ∩ A2 ∩ A3 ) = P (A1 )P (A2 |A1 )P (A3 |A1 ∩ A2 ). Then. the ﬁrst toss is included in your 7 choice with probability 10 . If you know (a) that exactly 7 Heads are tossed. they are equally likely to occur on any given 7 tosses. For (b). (This is routine to check and the reader who is more theoretically inclined might view it as a good exercise.

∪ Fn = Ω. 3. 2 2. 10 3 2. Number of ways to select 3 balls out of 10 white ones: Our probability is then . imagine drawing the balls sequentially. . . ∪ Fn )) = P (A ∩ Ω) = P (A) We call an instance of using this formula “computing the probability by conditioning on which of the events Fi happens. First Bayes’ formula. . roll 1 die. Number of outcomes: 20 3 26 . If you toss Tails. . P (3rd ball is white|1st two picked are white) = Our probability is. . Then. 1 2 · 9 19 · 8 18 . there are just two events Fi . P (A) = P (F1 )P (A|F1 )+P (F2 )P (A|F2 )+. ∪ (A ∩ Fn )) = P (A ∩ (F1 ∪ . 2 Theorem 4.6. If you toss Heads. we are computing the probability of the intersection of the three events: P (1st ball is white. (20) 3 To do this problem another way. . . . where the previous colors of the selected balls 3 do not aﬀect the probabilities at subsequent stages. Quite often.” The formula is useful in sequential experiments. the product before. which equals. an event F and its complement F c . . what we obtained This approach is particularly easy in case (b). Then. Flip a fair coin. is 1 . therefore. . Proof. that is. Example 4. + P (Fn )P (A|Fn ) = P (A ∩ F1 ) + . . The relevant probabilities are: 1. .1. . The answer. P (1st ball is white) = 1 . . roll 2 dice. . Assume that F1 . 2nd ball is white. (10) 3 . as it must. 8 18 . when you face diﬀerent experimental conditions at the second stage depending on what happens at the ﬁrst stage. P (2nd ball is white|1st ball is white) = 9 19 . for an event A. exactly one of them always happens. and 3rd ball is white). . then.4 CONDITIONAL PROBABILITY AND INDEPENDENCE 1. . Fn are pairwise disjoint and that F1 ∪ . Compute the probability that you roll exactly one 6. P (F1 )P (A|F1 ) + P (F2 )P (A|F2 ) + . and we are thus conditioning on whether F happens or not. + P (A ∩ Fn ) = P ((A ∩ F1 ) ∪ . that is.+P (Fn )P (A|Fn ) .

Therefore. pk.0 = 0. pk. and P (A|F2 ) = n−r+1 . for k > 0. (52) i +1− 48 3 52 3 +1− 48 4 52 4 +1− 48 5 52 5 +1− 48 6 52 6 . the probability n that the newcomer does not duplicate any of the existing r − 1 birthdays. for i = 1. . Let F3 be the event that any other number of birthdays occurs with k − 1 people. makes the computation fast and precise. the probability that the newcomer duplicates one of the existing r birthdays. 1. If A is the 6 event that you get at least one Ace. 6. P (Fi ) = 1 . In general.r be the probability that exactly r (out of a total of n) birthdays are represented among k people. P (A|F3 ) = 0. . as the r newcomer contributes either 0 or 1 new birthdays. P (A|F1 ) = n .0 = 1.7. Note that pk. we call the event A. As promised.r = P (A) = P (A|F1 )P (F1 ) + P (A|F2 )P (F2 ) = r n−r+1 · pk−1. then select at random. revisited. P (A|F c ) = 2·5 . which could be Heads (event F ) or Tails 1 (event F c ). Coupon collector problem. .r + · pk−1. 2. together with the boundary conditions p0.4 CONDITIONAL PROBABILITY AND INDEPENDENCE 27 Here you condition on the outcome of the coin toss. then the k’th person arrives on the scene. Clearly.8. P (A|Fi ) = 1 − Therefore. Here is how it works in this example.r = 0. pk. P (A|F1 ) = 1 13 . whereby you (1) reinterpret the problem as a sequential experiment and (2) use Bayes’ formula with “conditions” Fi being relevant events at the ﬁrst stage of the experiment. as many cards from the deck as the number shown on the die. Moreover. 9 Example 4. r ≥ 1. We will ﬁx n and let k and r be variable. for 0 ≤ k < r. Example 4. without replacement. we will develop a computationally much better formula than the one in Example 3. P (F ) = P (F c ) = 1 and 36 2 so 2 P (A) = P (F )P (A|F ) + P (F c )P (A|F c ) = . Clearly. Let pk. Roll a die.n is what we computed by the inclusion-exclusion formula. then P (A|F ) = 6 . Let F1 be the event that there are r birthdays represented among the k − 1 people and let F2 be the event that there are r − 1 birthdays represented among the k − 1 people. P (A) = 1 6 1 +1− 13 48 2 52 2 (48) i .9. for i ≥ 1. This will be another example of conditioning. n n for k. At the ﬁrst stage you have k − 1 people. If A = {exactly one 6}.r−1 . by Bayes’ formula. . and this. . What is the probability that you get at least one Ace? Here Fi = {number shown on the die is i}.

26. A test gives the correct diagnosis with probability of 0. What is the probability that the coin is fair? The relevant events are F = {fair coin}. and 2 4 P (B|U ) = 1. Then. and so P (M2 |D) = 0. P (D|M3 ) = 0. . the test will be positive with probability 0. Example 4.4 CONDITIONAL PROBABILITY AND INDEPENDENCE 28 Theorem 4.9.002 0. and B = {both tosses H}.001 0. test it.5 Example 4.001.2. It comes out Heads both times. . . P (D|M1 ) = 0. P (Fj ) its prior probability. which always comes out Heads. M1 . if the person is sick. P (M1 ) = 0. that is.002. Assume 10% of people have a certain disease. Second Bayes’ formula. . and P (Fj |A) its posterior probability. which you then proceed to test. P (B|F ) = 1 . (Indeed.3 ≈ 0. .2 0. lightbulbs). Then.003 You pick an item. 0.2 + 0. this is the same as choosing the item from a large number of them and testing it.11. M2 and M3 . Then P (F ) = P (U ) = 1 (as each coin is chosen with equal probability).8.3.2. Choose one at random. any made item is defective 0. P (M2 ) = 0.1. + P (A|Fn )P (Fn ) An event Fj is often called a hypothesis.5. but some are defective.5 prob. What is the probability that it was made by machine M2 ? The best way to think about this random experiment is as a two-stage procedure.8.001 · 0. .003. the test will be positive with probability 0.2.3 + 0. Our probability then is 1 2 1 1 2 · 4 1 1 4 + 2 · 1 = . but if the person is not sick. and ﬁnd it is defective.) Let D be the event that an item is defective and let Mi also denote the event that the item was made by machine i. U = {unfair coin}. Let F1 . 5 ·1 Example 4. toss it twice. Moreover. that machine produces an item. Then P (Fj |A) = P (Fj ∩ A) P (A|Fj )P (Fj ) = . A factory has three machines.10. A random person from . P (D|M2 ) = 0.003 · 0. Here are the known numbers: machine M1 M2 M3 proportion of items made 0. We have a fair coin and an unfair coin. P (A) P (A|F1 )P (F1 ) + .002 · 0. P (M3 ) = 0.3 0.002 · 0. that produce items (say. It is impossible to tell which machine produced a particular item. Fn and A be as in Theorem 4. First you choose a machine with the probabilities given by the proportion.

Thus. So. C. with stated probabilities. she is among the abused women.8. 430 were killed by partners. or not. 8. P (T |S)P (S) + P (T |H)P (H) 26 Note that the prior probability P (S) is very important! Without a very good idea about what it is. Alan Dershowitz. 1995. Dershowitz states that “a considerable number” of wife murderers had previously assaulted them. P (T |H) = 0. One of the many issues was whether Simpson’s history of spousal abuse could be presented by prosecution at the trial.1. pg.8. c )P (A|M c ) P (M )P (A|M ) + P (M 36. O.” i.) By Bayes’ formula. 936 in 1992. we will (conservatively) say that P (A|M ) = 0. it is not 0. . to use as a sole argument that the evidence is not probative. The ﬁnal number we need is P (A|M ).9. Simpson was on trial in Los Angeles for murder of his wife and her boyfriend..5. Simpson’s ﬁrst trial. Nevertheless. a famous professor of law at Harvard and a consultant for the defense.2 and P (T |S) = 0. As J. These numbered 4.71. J. F. Caulkins pointed out in the journal Chance (Vol. at the second stage. P (M c ) = 0. H = {healthy}. we can say that P (A|M c ) = 0.05. it had some evidentiary value. 1995. P (S) = 0. Example 4.29. Now. In other words. although “some” did not. or not.4 CONDITIONAL PROBABILITY AND INDEPENDENCE 29 the population has tested positive for the disease. T = {tested positive}. We are interested in P (S|T ) = P (T |S)P (S) 8 = ≈ 31%. It was also commonly estimated at the time that about 5% of the women had been physically abused by their husbands.1% of men who abuse their wives end up killing them. (The two-stage experiment then is: choose a murdered woman at random. if we let A = {the (murdered) woman was abused by the partner}. and what we are interested in is the posterior probability P (M |A). 14). The famous sports star and media personality O. a positive test result is diﬃcult to evaluate: a positive test for HIV would mean something very diﬀerent for a random person as opposed to somebody who gets tested because of risky behavior. M = {the woman was murdered by the partner}. compared to the prior probability of 29%. we can probably say that 80% is considerably too high. Merz and J. then we estimate the prior probabilities P (M ) = 0.8!) Let us deﬁne the three relevant events: S = {sick}. she is murdered by her partner. citing the statistics that < 0. with probabilities depending on the outcome of the ﬁrst stage.e. J. at the ﬁrst stage. whether this history was “probative. as there is no reason to assume that a woman murdered by somebody else was more or less likely to be abused by her partner. P (M |A) = P (M )P (A|M ) 29 = ≈ 0. What is the probability that he is actually sick? (No. that is. this was the wrong probability to look at! We need to start with the fact that a woman is murdered. or whether it was merely “prejudicial” and should be excluded.12. was claiming the latter. out of which 1.1 The law literature studiously avoids quantifying concepts such as probative value and reasonable doubt. P (H) = 0.

Let A = {card is an Ace} and R = {card is red}. Example 4. Also. Are F = {ﬁrst card is an Ace} and S = {second card is an Ace} independent? Now P (F ) = P (S) = 1 13 and P (S|F ) = 3 51 . . 4 at random. The two events are independent — the proportion of aces among red cards is the same as the proportion among all cards. it can be shown that. 3}. Assuming that P (B) > 0. .4 CONDITIONAL PROBABILITY AND INDEPENDENCE 30 Independence Events A and B are independent if P (A ∩ B) = P (A)P (B) and dependent (or correlated ) otherwise. Now. but not independent.15. 4}. Are A and R independent? 1 1 2 1 We have P (A) = 13 . where 1 ≤ i1 < i2 < . . Again. 2}. Roll a four sided fair die. that is. C = {1. Toss 2 fair coins and let F = {Heads on 1st toss}. independence is an assumption and it is the most important concept in probability. . Pick a random card from a full deck. Example 4. pick two cards out of the deck sequentially without replacement. 3. P (A|B) = P (A). Quite often. . so they are not independent. as there are two red Aces. You will notice that here the independence is in fact an assumption. so the probability of A is unaﬀected by knowledge that B occurred.13. S = {Heads on 2nd toss}. ∩ Aik ) = P (Ai1 )P (Ai2 ) · · · P (Aik ). P (A ∩ B c ) = P (A) − P (A ∩ B) = P (A) − P (A)P (B) = P (A)(1 − P (B)) = P (A)P (B c ). and. How do we deﬁne independence of more than two events? We say that events A1 . we can replace an Ai by Ac and the events remain independent. . A2 . < ik ≤ n are arbitrary indices. Let A = {1. An are independent if P (Ai1 ∩ . if A and B are independent.14. P (R) = 2 . . P (A ∩ R) = 52 = 26 . . so A and B c are also independent — knowing that B has not occurred also has no inﬂuence on the probability of A. Check that these are pairwise independent (each pair is independent). These are independent. B = {1. . i Example 4. one could rewrite the condition for independence. 2. Another fact to notice immediately is that disjoint events with nonzero probability cannot be independent: given that one of them happens. the other cannot happen and thus its probability drops to zero. in such a collection of independent events. The occurrence of any combination of events does not inﬂuence the probability of others. choose one of the numbers 1.

In other words. • If you do not roll 6 and your friend tosses Heads. This is very common in problems such as this! You can avoid the nuisance.16. What is the probability that you win? One way to solve this problem certainly is this: P (win) = P (win on 1st round) + P (win on 2nd round) + P (win on 3rd round) + . .. and P (A ∩ B) = P (A ∩ C) = P (B ∩ C) = 1 1 = . and therefore P (W ) = P (W |D). Let D = {game is decided on 1st round}. as well as between diﬀerent tosses and rolls. but. your friend tosses a coin. you lose outright. by the following trick. You roll a die. .4 CONDITIONAL PROBABILITY AND INDEPENDENCE 1 2 31 1 4 Indeed. = 1 + 6 5 1 · 6 2 1 + 6 5 1 · 6 2 2 1 + . 6 and then we sum the geometric series. provided that the game is not decided in the 1st round. you are thereafter facing the same game with the same winning probability. there is a very good reason to conclude so immediately. P (A) = P (B) = P (C) = independence follows. W = {you win}. • If you roll 6. however. Example 4. Dc and W are independent and then so are D and W .. thus P (W |Dc ) = P (W ). . which one can certainly check by computation. However. in fact. so we have complete information on the occurrence of C as soon as we know that A and B both happen. • If neither. 4 8 and pairwise P (A ∩ B ∩ C) = The simple reason for lack of independence is A ∩ B ⊂ C. you win outright. The events D and W are independent. The crucial observation is that.. Important note: we have implicitly assumed independence between the coin and the die. the game is repeated until decided.

11}.4 CONDITIONAL PROBABILITY AND INDEPENDENCE 32 This means that one can solve this problem by computing the relevant probabilities for the 1st round: 1 P (W ∩ D) 2 P (W |D) = = 1 651 = . 3 4 4 2 36 ). • If S ∈ {7. Otherwise. 5 3 3 10 (probability 36 ). • If S ∈ {2. then win with probability 4+6 = 5 . in which case you win. Many casinos allow you to bet even money on the following game. What is the winning probability? Let us look at all possible ways to win: 1. a decent game by casino standards. 5 5 5 8 (probability 36 ).17. 5. 10}. in which case you lose. . • If S ∈ {4. you win immediately. then win with probability 3 36 3 6 + 36 36 = 3 3+6 = 1.4929. • you roll a 4 (probability • you roll a 5 (probability • you roll a • you roll a • you roll a • you roll a 3 36 ). then win with probability 5+6 = 11 . 8 36 . then win with probability 4+6 = 2 . 5 5 5 6 (probability 36 ). – 7 appears. you lose immediately. You win on the ﬁrst roll with probability 2. 6. the pair of dice is rolled repeatedly until one of the following happens: – S repeats. P (D) 7 6 + 62 which is our answer. 9. Two dice are rolled and the sum S is observed. 4 4 9 (probability 36 ). then win with probability 3+6 = 1 . 3. Craps. 3 Using Bayes’ formula. Example 4. P (win) = 8 +2 36 3 1 4 2 5 5 · + · + · 36 3 36 5 36 11 ≈ 0. 8. 12}. then win with probability 5+6 = 11 .

18. A machine produces items which are independently defective with probability p. P (exactly two items among the ﬁrst 6 are defective) = 6 2 p2 (1 − p)4 . then successes k must occur on k trials in S and failures on all n − k trials not in S. as exactly k successes cannot occur on two diﬀerent such sets. is pk (1 − p)n−k . In a sequence of n Bernoulli trials.m = P (ﬁrst trial is a success) · P (n − 1 successes before m failures) + P (ﬁrst trial is a failure) · P (n successes before m − 1 failures) = p · pn−1. Problem of Points. i. the number of successes is ≥ n) n+m−1 = k=n n+m−1 k p (1 − p)n+m−1−k . but it needs to be pointed out that computationally this is not the best formula.m−1 . valid for m.m . Let us compute a few probabilities: 1. Then. failure with probability 1 − p. on any S ⊂ {1. pn. . pn. n ≥ 1. Example 4. P (at least one item among the ﬁrst 6 is defective) = 1 − P (no defects) = 1 − (1 − p)6 3.m + (1 − p) · pn. It is much more eﬃcient to use the recursive formula obtained by conditioning on the outcome of the ﬁrst trial.m = 1. . Together with boundary conditions. exactly 5 items among 1st 99 defective) = p · 99 5 p (1 − p)94 . Let us call this probability pn. Assume m. each of which is a success with probability p and. n} with cardinality k. pn. if we ﬁx such an S. P (exactly 100 items are made before 6 defective are found) equals P (100th item defective. p0. This involves ﬁnding the probability of n successes before m failures in a sequence of Bernoulli trials. . P (at least 2 items among the ﬁrst 6 are defective) = 1 − (1 − p)6 − 6p(1 − p)5 4.e.4 CONDITIONAL PROBABILITY AND INDEPENDENCE 33 Bernoulli trials Assume n independent experiments.. There are n such subsets. k This is because the successes can occur on any subset S of k trials out of n. thus. . the probability that this happens. by the assumed independence.0 = 0. n ≥ 1. These possibilities are disjoint.19. k The problem is solved. 2.m = P (in the ﬁrst m + n − 1 trials. 5 Example 4. P (exactly k successes) = n k p (1 − p)n−k . .

each originally containing n matches. Compute (a) the probability of event A that the two scores are equal and (b) the probability of event B that your friend’s score is strictly larger than yours. A and B. 2 We have P (A wins the series) = P (4 successes before 3 failures) 6 = k=4 6 k 1 2 6 = 15 + 6 + 1 ≈ 0. you independently decide. k 214 Example 4. play a series of games and that the ﬁrst team that wins four games is the overall winner of the series. upon reaching for a box and ﬁnding it empty. As it happens. n from matchbox 1 and n − k from matchbox 2. You roll a die and your score is the number shown on the die. 0 ≤ k ≤ n. Each day. . Therefore. he is equally likely to take it from either box. A mathematician carries two matchboxes. Assume that two equally matched teams. you do nothing. Otherwise.23.20. there are exactly k matches still in the other box? Here. Example 4.22. matchbox 2 contains k matches.4 CONDITIONAL PROBABILITY AND INDEPENDENCE 34 which allows for very speedy and precise computations for large m and n.21. Best of 7 . 26 Example 4. Let A1 be the event that matchbox 1 is the one discovered empty and that. For (b). Each time he needs a match. our probability is 1 2n − k 1 1 2n − k = . team A lost the ﬁrst game. the probability of getting Heads is p/2 independently each day. (a) What is the probability of getting exactly 10 Heads in the ﬁrst 20 days? (b) What is the probability of getting 10 Heads before 5 Tails? For (a). he has used n + n − k matches. so the answer is 20 10 p 2 10 1− p 2 10 . with probability p. 2 · P (A1 ) = 2 · 2 n n 22n−k 22n−k Example 4. What is the probability it will win the series? Assume that the games are Bernoulli trials with success probability 1 . What is the probability that. Banach Matchbox Problem. at that instant. Before this point. you can disregard days at which you do not ﬂip to get 14 k=10 14 1 . Your friend rolls ﬁve dice and his score is the number of 6’s shown.3438. to ﬂip a fair coin. This means that he has reached for box 1 exactly n times in (n + n − k) trials and for the last time at the (n + 1 + n − k)th trial.

if you picked a 9 the ﬁrst time. . by the ﬁrst Bayes’ formula.9 (and says it is good with probability 0. Pick one card at random from a full deck of 52 cards. be the event that your friend’s score is i. but we will soon learn why the sum has to equal 5 .) If you get at least one Ace. Consider the following game. then the method says it is defective with probability 0. . You have a method of testing whether the item is defective. 6 Problems 1. (For example. without replacement) that number of additional cards from the deck. If the number is 7 or less.4 CONDITIONAL PROBABILITY AND INDEPENDENCE 35 In both cases we will condition on your friend’s score — this works a little better in case (b) than conditioning on your score. You have tested all of them and the tests detect no defects. we get 6 5 P (A) = i=1 P (Fi ) · 1 1 1 55 = (1 − P (F0 )) = − 6 ≈ 0. but it does not always give you correct answer. and so 5 P (B) = i=1 P (Fi ) · 5 i=1 5 i=1 5 i−1 6 5 1 = 6 = = = 1 6 1 6 1 i · P (Fi ) − 6 i · P (Fi ) − i· 5 i 1 6 P (Fi ) i=1 1 55 + 6 66 i i=1 5 6 5−i − 1 55 + 6 66 1 5 1 55 · − + ≈ 0. .8). and J count as 10). Then. A box contains 3 items.3. What are your chances of winning this game? 2. What is the probability that none of the 3 items is defective? . If not. Q. If the tested item is good. P (B|Fi ) = i−1 6 if i ≥ 2 and 0 otherwise. Let Fi . P (A|Fi ) = 1 if i ≥ 1 and P (A|F0 ) = 0. you win outright. you select (at random. . then you look at the value of the card (K. 5. Then. the method detects the defect with probability 0. you win.2 (and gives the right answer with probability 0. you select 9 more cards. If you pull an Ace.0997. 6 6 6 66 The last equality can be obtained by computation. i = 0. If the tested item is defective. you lose outright. 6 6 6 6 Moreover. Otherwise. An item is defective (independently of other items) with probability 0.0392.1).

You have 16 balls.4 CONDITIONAL PROBABILITY AND INDEPENDENCE 36 3. . You also have 3 urns. then do the same for the second one. You have 5 eggs. open the ﬁrst one and see if it has a toy inside. . F10 = {10. All Ai are chosen so that all 2n choices are equally likely. P (W ) = 4 4 + 52 52 1− 47 8 51 8 47 8 51 8 47 9 51 9 47 10 51 10 . etc. . F9 = {9 ﬁrst time}. . . P (W |F8 ) = 1 − P (W |F9 ) = 1 − P (W |F10 ) = 1 − and so. + 4 52 1− 47 9 51 9 + 16 52 1− 47 10 51 10 . Solutions to problems 1. or K ﬁrst time}. Are E1 and E2 independent? 4. Q. Ar ⊂ U . (Urns are large enough to accommodate any number of balls. and 9 red. For each of the 16 balls. Assume that you have an n–element set U and that you select r independent random subsets A1 . Let F = {none is defective} and A = {test indicates that none is defective}. Compute P (E1 ) and P (E2 ). let W be the event that you win.) (a) What is the probability that no urn is empty? (b) What is the probability that each urn contains 3 red balls? (c) What is the probability that each urn contains all three colors? 5. independently of other eggs. Let F1 = {Ace ﬁrst time}. J. 4 green. 2. By the second Bayes’ formula. A chocolate egg either contains a toy or is empty. . . Then P (W |F1 ) = 1. Assume that each egg contains a toy with probability p. 3 blue. Also. Let E1 be the event that you get at least 4 toys and let E2 be the event that you get at least two toys in succession. you select an urn at random and put the ball into it. F8 = {8 ﬁrst time}. Compute (in a simple closed form) the probability that the Ai are pairwise disjoint.

the result is: 9! 9! 3!3!3! = . . P (A1 ) = P (A2 ) = P (A3 ) = 2 3 16 4 2 p2 (1 − p)3 − . 315 .7 · 0. 4 −3 −3 . P (A1 ∪ A2 ∪ A3 ) = and 216 − 1 . Hence. 1 3 16 P (A1 ∩ A2 ) = P (A1 ∩ A3 ) = P (A2 ∩ A3 ) = P (A1 ∩ A2 ∩ A3 ) = 0. by inclusion-exclusion.4 CONDITIONAL PROBABILITY AND INDEPENDENCE 37 P (F |A) = = = P (A ∩ F ) P (A) (0. P (E1 ) = 5p4 (1 − p) + p5 = 5p4 − 4p5 and P (E2 ) = 1 − (1 − p)5 − 5p(1 − p)4 − p3 (1 − p)2 . 39 8 · 312 (c) As P (at least one urn lacks blue) = 3 P (at least one urn lacks green) = 3 P (at least one urn lacks red) = 3 2 3 2 3 2 3 9 3 −3 4 1 3 1 3 1 3 3 .8)3 (0. E1 and E2 are not independent.7 · 0. 315 (b) We can ignore other balls since only the red balls matter here. 9 .8 + 0. P (no urns are empty) = 1 − P (A1 ∪ A2 ∪ A3 ) 216 − 1 = 1− . 3. Hence. 4.1)3 56 59 3 . As E1 ⊂ E2 .3 · 0. (a) Let Ai = the event that the i-th urn is empty.

. This is the same as choosing an r × n matrix in which every entry is independently 0 or 1 with probability 1/2 and ending up with at most one 1 in every column. Since columns are n independent. by independence.4 CONDITIONAL PROBABILITY AND INDEPENDENCE 38 we have. this gives ((1 + r)2−r ) . P (each urn contains all 3 colors) = 1− 1− 1− 3 3 3× 2 3 2 3 3 −3 4 1 3 1 3 3 × 4 −3 2 3 9 × 1 3 9 −3× . 5.

You buy 5 eggs. Ten fair dice are rolled. Otherwise he selects. or blue (again. 1. What is the probability that: (a) At least one 1 appears. independently of other eggs. Assume that each egg contains a toy with probability p ∈ (0. independently of other toys). A chocolate egg either contains a toy or is empty. 2. while the number 4 appears four times. What is the probability that he rolled a 6 on the die? 4. If he rolls 3 or less. Each toy is.. 3. Let E1 be the event that you get at most 2 toys and let E2 be the event that you get you get at least one red and at least one white and at least one blue toy (so that you have a complete collection). (a) Compute the winning probability for this game. 3 appears at least once. (c) Are E1 and E2 independent? Explain. Five married couples are seated at random around a round table. he loses immediately. with equal probability. white. The player wins if all four Aces are among the selected cards. red.e. at random. 1). 2. (b) Smith tells you that he recently played this game once and won.4 CONDITIONAL PROBABILITY AND INDEPENDENCE 39 Interlude: Practice Midterm 1 This practice exam covers the material from the ﬁrst four chapters. . A player rolls a fair die. Give yourself 50 minutes to solve the four problems. every husband-wife pair occupies adjacent seats). (b) Each of the numbers 1. (a) Compute P (E1 ). Why is this probability very easy to compute when p = 1/2? (b) Compute P (E2 ). 3 appears exactly twice. (b) Compute the probability that at most one wife does not sit next to her husband. Consider the following game. (c) Each of the numbers 1. as many cards from a full deck as the number that came up on the die. (a) Compute the probability that all couples sit together (i. 2. which you may assume have equal point score.

We know the following: P (A1 ) = P (A2 ) = P (A3 ) = 5 6 10 . . 3 appears exactly twice. Solution: 10 2 8 6 2 2 610 = 23 10! .4 CONDITIONAL PROBABILITY AND INDEPENDENCE 40 Solutions to Practice Midterm 1 1. (b) Each of the numbers 1. 2. Solution: 5 6 10 1 − P (no 1 appears) = 1 − . 3 appears at least once. What is the probability that: (a) At least one 1 appears. · 4! · 610 (c) Each of the numbers 1. . Solution: Let Ai be the event that the number i does not appear. Ten fair dice are rolled. 4 6 10 P (A1 ∩ A2 ) = P (A1 ∩ A3 ) = P (A2 ∩ A3 ) = P (A1 ∩ A2 ∩ A3 ) = 3 6 10 . while the number 4 appears four times. 2.

iii. Treat each couple as a block and the remaining 8 seats as 4 pairs (where each pair is two adjacent seats). Moreover. every husband-wife pair occupies adjacent seats). 2. There are 4! ways to seat the remaining 4 couples into 4 pairs of seats. 3. B = B1 ∪ B2 ∪ B3 ∪ B4 ∪ B5 . 4. This is our sample space. Five married couples are seated at random around a round table. There are 24 ways to order each hi and wi within its pair of seats. Solution: Let A be the event that all wives sit next to their husbands and let B be the event 5 ·4! that exactly one wife does not sit next to her husband. There are 9! ways to order the remaining 9 people in the remaining 9 seats. where Bi is the event that wi . There are 2 ways to order w1 . Fix h1 onto one of the seats. respectively.. (a) Compute the probability that all couples sit together (i. Solution: Let i be an integer in the set {1. ii.e. Therefore. We know that P (A) = 2 9! from part (a). 2. iv. Denote each husband and wife as hi and wi . P (1. and 3 each appear at least once) = P ((A1 ∪ A2 ∪ A3 )c ) = 1 − P (A1 ) − P (A2 ) − P (A3 ) +P (A1 ∩ A2 ) + P (A2 ∩ A3 ) + P (A1 ∩ A3 ) −P (A1 ∩ A2 ∩ A3 ) = 1−3· 5 6 10 41 +3· 4 6 10 − 3 6 10 . 2. 9! (b) Compute the probability that at most one wife does not sit next to her husband. i.4 CONDITIONAL PROBABILITY AND INDEPENDENCE Then. our solution is 2 · 4! · 24 . 5}. v.

. he loses immediately. we need to determine P (B1 ). P (W |F4 ) = 1 52 4 5 4 52 4 6 4 52 4 . is 5· 3 · 4! · 24 . . The player wins if all four Aces are among the selected cards. (a) Compute the winning probability for this game. 9! 9! 3. where i = 1. iv. Then. . So. . Bi are disjoint and their probabilities are all the same. The player rolls a fair die. Moreover. 6. at random. P (B1 ) = Our answer.4 CONDITIONAL PROBABILITY AND INDEPENDENCE 42 does not sit next to hi and the remaining couples sit together. Consider each of remaining 4 couples and w1 as 5 blocks. Let Fi be the event that he rolls i. As w1 cannot be next to her husband. or 3. Consider the following game. 6 Since we lose if we roll a 1. vi. There are 9! ways to order the remaining 9 people in the remaining 9 seats. then. ii. as many cards from a full deck as the number that came up on the die. There are 24 ways to order the couples within their blocks. P (W |F5 ) = P (W |F6 ) = Therefore. P (Fi ) = 1 . There are 4! ways to order the remaining 4 couples. Fix h1 onto one of the seats. Otherwise he selects. If he rolls 3 or less. 2. iii. P (W |F1 ) = P (W |F2 ) = P (W |F3 ) = 0. Solution: Let W be the event that the player wins. v. i. . 9! Therefore. 5 6 + 4 4 . we have 3 positions for w1 in the ordering of the 5 blocks. . P (W ) = 1 1 · 6 52 4 1+ . 3 · 4! · 24 25 · 4! + .

Assume that each egg contains a toy with probability p ∈ (0. 2 When p = 1 . 1). or blue (again. 7 + 6 4 4. independently of other toys). Why is this probability very easy to compute when p = 1/2? Solution: P (E1 ) = P (0 toys) + P (1 toy) + P (2 toys) 5 2 = (1 − p)5 + 5p(1 − p)4 + p (1 − p)3 . 2 . What is the probability that he rolled a 6 on the die? Solution: P (F6 |W ) = = = = 1 6 · 1 · (52) 4 P (W ) 5 4 6 4 6 4 1+ 15 21 5 .4 CONDITIONAL PROBABILITY AND INDEPENDENCE 43 (b) Smith tells you that he recently played this game once and won. independently of other eggs. red. white. Let E1 be the event that you get at most 2 toys and let E2 be the event that you get you get at least one red and at least one white and at least one blue toy (so that you have a complete collection). 2 P (at most 2 toys) = P (at least 3 toys) = P (at most 2 eggs are empty) c Therefore. Each toy is. You buy 5 eggs. P (E1 ) = P (E1 ) and so P (E1 ) = 1 . (a) Compute P (E1 ). with equal probability. A chocolate egg either contains a toy or is empty.

and A3 the event that blue is missing. P (E2 ) = P ((A1 ∪ A2 ∪ A3 )c ) = 1−3· 1− p 3 5 5 +3· 1− 2p 3 − (1 − p)5 . Solution: 44 Let A1 be the event that red is missing. . (c) Are E1 and E2 independent? Explain. Solution: No: E1 ∩ E2 = ∅. A2 the event that white is missing.4 CONDITIONAL PROBABILITY AND INDEPENDENCE (b) Compute P (E2 ).

. m. f. the set of outcomes for which X ∈ B is an event. . Sometimes X is added as the subscript of its p. a random variable X is a real-valued function on Ω. Let X be value of the NASDAQ stock index at the closing of the next business day. For any interval B. Let X be the number of Heads in 2 fair coin tosses. y) : 0 ≤ x. . f.1. .. Choose a random point in the unit square {(x. so we are able to talk about its probability. Their probabilities are: P (X = 0) = 1 . . Determine its p. 2. P (X ∈ B). 4 2 and P (X = 2) = 1 . 20. . Select 5 balls at random. but may not even be deﬁned on the same set of outcomes. f. and p(xi ) = P (X = xi ) with i = 1. A discrete random variable X has ﬁnitely or countably many values xi . P (X = 1) = 1 . and the probability that at least one of the selected numbers is 15 or more. is called the probability mass function of X.3. For all i. . . 2. As X must have some value. but X and Y are far from being the same random variable! In general. that is. xi ∈B p(xi ). p = pX . Sometimes. f. . pX = pY . 4. but this will not occur in this chapter. y ≤ 1} and let X be its distance from the origin. .. the space of outcomes: X : Ω → R. −∞. An urn contains 20 balls numbered 1. A probability mass function p has the following properties: 1. The crucial theoretical property that X should have is that. Possible values of X are 0. Choose a random person in a class and let X be the height of the person. without replacement. we also allow X to have the value ∞ or. Example 5. i p(xi ) = 1.. Mathematically. .5 DISCRETE RANDOM VARIABLES 45 5 Discrete Random Variables A random variable is a number whose value depends upon the outcome of a random experiment. Determine its p. Toss a coin 10 times and let X be the number of Heads. when convenient. for each interval B. Here are some examples of random variables. Example 5. 4 You should note that the random variable Y . Random variables are traditionally denoted by capital letters to distinguish them from deterministic quantities.2. . 3. m. 2. 1. which counts the number of Tails in the 2 tosses. m. P (X ∈ B) = 3. f. Example 5. 1. m. random variables may have the same p. in inches. m. Let X be the largest number among selected balls. 2. has the same p. i = 1. p(xi ) > 0 (we do not list values of X which occur with probability 0). more rarely. and 2.

Assume that X is a discrete random variable with possible values xi . 2. . The possible values are 5. Eg(X) = i g(xi )P (X = xi ). f. 3 white. Let us start 3 with 0. This can occur with one ball of each color or with 3 blue balls: P (X = 0) = 3·3·5+ 11 3 5 3 = 55 . Example 5. or mean.4. then. of X. of X is EX = i xi P (X = xi ) = i xi p(xi ). without replacement. the expected value. a single outcome (3 red balls) produces X = 3: P (X = −3) = P (X = 3) = 1 11 3 3 2 11 3 5 1 = 15 . . we can have 2 red and 1 white. and 5 blue balls. average. 165 To get X = 1. f. also called expectation. Take out 3 balls at random. Then. . Next. or 1 red and 2 blue: P (X = 1) = P (X = −1) = 3 2 3 1 + 11 3 3 1 5 2 = 39 . 20 i−1 4 20 5 . −1. m. 1. An urn contains 11 balls. P (at least one number 15 or more) = P (X ≥ 15) = i=15 P (X = i). 2 . to get X = 2 we must have 2 red balls and 1 blue: P (X = −2) = P (X = 2) = Finally.5 DISCRETE RANDOM VARIABLES 20 5 46 outcomes. You win $1 for each red ball you select and lose a $1 for each white ball you select. X can have values −3. note that we have and. . The number of outcomes is 11 . . To determine the p. the amount you win. 165 All the seven probabilities should add to 1. which can be used either to check the computations or to compute the seventh probability (say. 20. i = 1. . m.. and 3. .. P (X = 0)) from the other six. −2. g : R → R. 0. 3 red. 165 The probability that X = −1 is the same because of symmetry between the roles that the red and the white balls play. 165 = 1 . For any function. Determine the p. P (X = i) = Finally.

5 DISCRETE RANDOM VARIABLES 47 Example 5. The average value of X then should be 1 · 0.5. What is the expected value of X? We can. i. We will give another. should the average of X be? Imagine a large number n of repetitions of the experiment and measure the realization of X in each.3 + 3 · 0. By the frequency interpretation of probability.3n will have X = 2. Instead. and P (X = 3) = 0. P (X = 2) = 0. called linearity: E(α1 X1 + α2 X2 ) = α1 EX1 + α2 EX2 . Then Var(X) = E[(X − µ)2 ] = E[X 2 − 2µX + µ2 ] = E(X 2 ) − 2µE(X) + µ2 = E(X 2 ) − µ2 = E(X 2 ) − (EX)2 In computations.3. valid for any random variables X1 and X2 and nonrandom constants α1 and α2 . What. of course. how “spread-out” is the p. n which of course is the same as the formula gives. Let X be a random variable with P (X = 1) = 0.3n + 3 · 0. m. more convenient. and about 0. but let us instead proceed intuitively and see that the deﬁnition makes sense. f.2n + 2 · 0. . The only problem with this is that absolute values are annoying. formula for variance that will use the following property of expectation.2n realizations will have X = 1. about 0.5n = 1 · 0. Take a discrete random variable X and let µ = EX.5. then. about 0. bear in mind that variance cannot be negative! Furthermore.5n will have X = 3. Here is the summary: The variance of a random variable X is Var(X) = E(X −EX)2 = E(X 2 )−(EX)2 . just use the formula. the only way that a random variable has 0 variance is when it is equal to its expectation µ with probability 1 (so it is not really random at all): P (X = µ) = 1. The quantity that has the correct units is the standard deviation σ(X) = Var(X) = E(X − µ)2 . of X? The most natural way would certainly be E|X − µ|.5 = 2.2 + 2 · 0.. How should we measure the deviation of X from µ. we deﬁne the variance of X Var(X) = E(x − µ)2 .2.3.e. This property will be explained and discussed in more detail later.

2. Properties: 1.29 = 0. Although simple. . Let X be the number shown on a rolled fair die. + 6 = . Var(X) ≈ 0. EIA = p. 6 6 2 91 7 35 Var(X) = − = . note that IA = IA ..7.2 Bernoulli random variable This is also called an indicator random variable.7810. Var(IA ) = p(1 − p).5 = 5. Compute Var(X). Assume that A is an event with probability p. and so Var(X) = 5. and Var(X). Previous example..9 − 5.9.. is given by IA = 1 0 if A happens. Properties: 1. E(X 2 ) = 12 · 0.. n Example 5.5 DISCRETE RANDOM VARIABLES 48 Example 5.6. . This is a standard example of a discrete uniform random variable and 7 1 + 2 + . Then. + 62 91 EX 2 = = . n x2 +.3)2 = 5.1 Uniform discrete random variable This is a random variable with values x1 . . 2 2 For the variance. otherwise.3 + 32 · 0.61 and σ(X) = We will now look at some famous probability mass functions. so that E(IA ) = EIA = p. such random variables are very important as building blocks for more complicated random variables. EX = x1 +. . 5. 6 2 12 5...2 + 22 · 0. xn . . continued. Other notations for IA include 1A and χA . . the indicator of A. . each with equal probability 1/n. EX = 6 2 1 + 22 + .29.+x2 n 1 n 2.+xn . (EX)2 = (2. IA .. E(X 2 ). Compute EX.+xn 2 . VarX = − x1 +.. Such a random variable is simply the random choice of one among n numbers.

1 Var(X) and P (X ≤ 10)? As X is Binomial(50. 1 ). i = 0. Properties: 1. gets one of the genes at random from each parent. Assuming that both parents are hybrid and have n children. ..p) random variable counts the number of successes in n independent trials. independently. Probability mass function: P (X = i) = 2. . with parameter λ > 0. 4 P (X ≥ 2) = 1 − P (X = 0) − P (X = 1) = 1 − 3 4 n −n· 1 4 3 4 n−1 . and rr the pure recessive genotype.5 DISCRETE RANDOM VARIABLES 49 5.4 Poisson random variable A random variable is Poisson(λ). each of which is a success with probability p. while rr results in a recessive phenotype. . for i = 0. 5. EX = np. P (X ≤ 10) = i=0 50 1 . Var(X) = np(1 − p). result in the phenotype of the dominant gene.3 Binomial random variable A Binomial(n. The expectation and variance formulas will be proved in Chapter 8. Example 5. Properties: 1. and 10 n i pi (1 − p)n−i . the probability of the rr genotype is 1 . what is the probability that at least two will have the recessive phenotype? Each child. n. i 250 Example 5. 1. independently. 2. if it has the probability mass function given below. For each child. 3. . Denote by d the dominant gene and by r the recessive gene at a single locus. . For now. Therefore.9.8. Determine EX. dr is called the hybrid. Then dd is called the pure dominant genotype. dd and dr. P (X = i) = λi i! e−λ . then X is Binomial(n. Var(X) = 12. . Let X be the number of Heads in 50 tosses of a fair coin. The two genotypes with at least one dominant gene. . so EX = 25. . If X is the number of 4 rr children. 2 ).5. take them on faith.

(n − i + 1) λi λ n · i 1− 1− i! n n n n i λ λ n(n − 1) . . P (X = i) = = = → as n → ∞. The Poisson random variable is useful as an approximation to a Binomial random variable when the number of trials is large and the probability of success is small. 2. ﬁrst formulated by L. P (X = i) → e−λ for each ﬁxed i = 0. i! 1− λ n n−i λ −i n(n − 1) .1. p). (n − i + 1) 1 · 1− · · λ i i! n ni 1− n λi −λ · e · 1 · 1. Example 5. 1. Binomial(n. Theorem 5. as n → ∞. who studied deaths by horse kicks in the Prussian cavalry. Poisson approximation to Binomial .p) is approximately Poisson(λ): λ If X is Binomial(n. Here is how we compute the expectation: ∞ 50 EX = i=1 i · e−λ λi = e−λ λ i! ∞ i=1 λi−1 = e−λ λ eλ = λ. n i λ n i λi . then. . Assume that the US population is 300 million. In this context it is often called the law of rare events. .10. EX = λ. Proof. λ) · p. independently. Suppose that the probability that a person is killed by lighting in a year is. . When n is large. Var(X) = λ. .5 DISCRETE RANDOM VARIABLES 2.. . with p = n . Bortkiewicz (in 1898). J. (i − 1)! and the variance computation is similar (and a good exercise!). and λ = np is of moderate size. 3. . . i! The Poisson approximation is quite good: one can prove that the error made by computing a probability using the Poisson approximation instead of its exact Binomial expression (in the context of the above theorem) is no more than min(1. 1/(500 million). p is small.

3694. p).02311530. the probability of exactly 3 deaths is approximately λ e−λ . 3 This highlights the interpretation of λ as a rate. The answer is 1 − e−λ − λe−λ ≈ 0. among the next 10. then X is Binomial(n. is 3! 10 3 λ3 −λ e 3! 3 3 λ3 1 − e−λ 3! 7 ≈ 4. If lightning deaths occur at the rate of 5 a year. in which 2 or more people are killed by lightning. Compute the expected number of years. Consequently. Approximate the above probability. where n = 300 million and p = 1/ (500 million). X is approximately Poisson( 3 ). assuming that lightning deaths occur as a result of independent factors in disjoint time intervals. then. By the same logic as above and the formula for Binomal expectation. A person who matches these characteristics has .02311529. Compute P (3 or more people will be killed by lightning next year) exactly. λ e−λ ). again. Assuming year-to-year independence. we can imagine that they operate on diﬀerent people in disjoint time intervals. where. 5. and the answer is 1 − (1 − p)n − np(1 − p)n−1 − 2. Assume a crime has been committed. The answer.0369. halving the time interval has the same p. 2 3. so np changes to 10 and so λ = 10 as well. doubling the time interval is the same as doubling the number n of people (while keeping p the same). and the answer is 5 n 2 p (1 − p)n−2 ≈ 0. but half as 3 3 many trials. Indeed. they should occur at half that rate in 6 months. 3! λ = 3 . 108 ). the number of years with exactly 3 people 5 3 killed is approximately Binomial(10. the answer is 10(1 − e−λ − λe−λ ) ≈ 0. Thus. 2 1 − e−λ − λe−λ − λ2 −λ e ≈ 0. Approximate P (two or more people are killed by lightning within the ﬁrst 6 months of next year). 4. Example 5. 3 As np = 5 .34 · 10−6 . In every year. 10−8 ) in a population of size n (say. It is known that the perpetrator has certain characteristics. Approximate P (in exactly 3 of next 10 years exactly 3 people are killed by lightning). and then np also doubles. 51 If X is the number of people killed by lightning. which occur with a small frequency p (say. Poisson distribution and law .11.5 DISCRETE RANDOM VARIABLES 1.

Then. due to low p. k · k! There is no closed-form expression for the sum. This model was in fact tested in court. charged with the crime. “the chances of there being another couple [with the speciﬁed characteristics.2330 when λ = 1. Assume that N is the number of people with given characteristics. that is. 000. The defense may claim that the probability of innocence. presumably enough for a reasonable doubt. N ≥ 1 | N = k) · P (N = k) P (C = A | N = k) · P (N = k) k=1 = and P (C = A | N = k) = so P (C = A. The jury convicted the couple charged for robbery on the basis of the prosecutor’s claim that. Mathematically. who is arrested. is about 0. Collins case. given the assumptions. say n = 5. k 1 λk −λ · ·e .” The Supreme Court of California reversed the conviction and gave two reasons. we can easily assume that it is Poisson with λ = np. This is a Binomial random variable but. 1−(the above probability). since p is so small. N ≥ 1) = k=1 ∞ P (C = A. the criminal. In this instance. Now. by the ﬁrst Bayes’ formula. 000 and n would have been the number of adult couples in the LA area. but it can be easily computed numerically. a 1968 jury trial in Los Angeles. ∞ P (C = A. The ﬁrst reason was insuﬃcient foundation for .g. we can formulate the problem as follows: P (C = A | N ≥ 1) = P (C = A. in the LA area] must be one in a billion. at a routine traﬃc stop or by airport security) and. whether the arrested person is guilty. N ≥ 1) = k=1 ∞ 1 . 000. A. Choose a person from among these N . There is no other evidence. N ≥ 1) . in the famous People v. choose at random another person. it was claimed by the prosecution (on ﬂimsy grounds) that p = 1/12.5 DISCRETE RANDOM VARIABLES 52 been found at random (e. label that person by C. The question is whether C = A.. k k! The probability that the arrested person is guilty then is e−λ P (C = A|N ≥ 1) = · 1 − e−λ ∞ k=1 λk . 000. P (N ≥ 1) We need to condition as the experiment cannot even be performed when N = 0. What should the defense be? Let us start with a mathematical model of this situation.

5 Geometric random variable A Geometric(p) random variable X counts the number trials required for the ﬁrst success in independent trials with success probability p.1015. P (X > n) = − p)k−1 = (1 − p)n .. .12. = P (X > n). is about 0. Example 5. 1 − e−λ 5 much larger than the prosecutor claimed. . Probability mass function: P (X = n) = p(1 − p)n−1 . your opponent tosses a coin. for λ = 12 it is about 0. On the average. EX = 2 and Var(X) = 2. where n = 1. how many rounds does the game last? P (game decided on round 1) = 7 and so the number of rounds N is Geometric( 12 ). namely. 7 .1939. which reduce to manipulations with geometric series. .13. 5. Var(X) = 1−p . 2. If you roll 6 you win. this round ends and the game repeats. EX = p . The second reason was that the probability that another couple with matching characteristics existed was. 3. p2 ∞ k=n+1 p(1 4. in fact. if you do not roll 6 and your opponent tosses Heads you lose. You roll a die. and 7 1 5 1 + · = . 2 Example 5. P (X > n + k|X > k) = (1−p)n+k (1−p)k We omit the proofs of the second and third formulas. This is about twice the (more relevant) probability of innocence. Properties: 1. which. Let X be the number of tosses of a fair coin required for the ﬁrst Heads. for this λ. 5. otherwise. P (N ≥ 2 | N ≥ 1) = 1 − e−λ − λe−λ . 6 6 2 12 EN = 12 .5 DISCRETE RANDOM VARIABLES 53 the estimate of p. What are EX and Var(X)? As X is Geometric( 1 ). 1 2.

(c) Find an expression for P (X ≥ 6). once a seed is planted. What is the approximate probability that he has at least 2 successes in 10 years? (d) Devise a method to determine the number of seeds the biologist should plant in order to get at least 3 mature plants in a year with probability at least 0. unless otherwise speciﬁed. Next.999. C. The membership numbers for the four groups are as follows: A: 5. Let X be the number of 6’s in the ﬁrst 10 rolls and let Y the number of rolls needed to obtain a 3. B: 10. (a) Write down the probability mass function of X.) 3. (Recall: all random choices are with equal probability. he pays you $10. (b) Write down the probability mass function of Y . the game is repeated. Justify brieﬂy the approximation. (b) Compute EN . (c) Compute P (you win). (d) Assume that you get paid $10 for winning in the ﬁrst round. (a) Write down the exact expression for the probability that the biologist will indeed end up with at least 3 mature plants. (Your method will probably require a lengthy calculation – do not try to carry it out with pen and paper.) (a) Write down the probability mass functions for X and Y . You are dealt one card at random from a full deck and your opponent is dealt 2 cards (without any replacement). In all other cases you pay him $1. choose one of the four groups at random and let Y be its size. D: 20. if you get a King. Let N be the number of times the two dice have to be rolled before the game is decided. First. If you get an Ace. (a) Determine the probability mass function of N . 2. (c) Compute Var(X) and Var(Y ). but your card is red and your opponent has no red cards. B. $1 for winning in any other round. otherwise whoever rolls the larger number wins. 5. The biologist plants 3000 seeds. (c) The biologist plans to do this year after year. The plant needs a year to reach maturity. choose one of the 50 students at random and let X be the size of that student’s group. (d) Find an expression for P (Y > 10). You and your opponent both roll a fair die. Determine your expected earnings. C: 15. Are they positive? 4. Roll a fair die repeatedly. Each of the 50 students in class belongs to exactly one of the four groups A. or D. A biologist needs at least 3 mature specimens of a certain plant. Compute your expected winnings. he pays you $5 (regardless of his cards). and nothing otherwise. (b) Compute EX and EY . A year is deemed a success if three or more plants from these seeds reach maturity. If you both roll the same number. any plant will survive for the year with probability 1/1000 (independently of other plants). If you have neither an Ace nor a King.5 DISCRETE RANDOM VARIABLES 54 Problems 1. (d) Assume you have . (b) Write down a relevant approximate expression for the probability from (a). he pays you $1.

Solutions 1. 2 = 3. 2. . 2. Then. 1. n. . the number of mature plants. 1000 . . 1 6 5 6 i−1 . P (X ≥ 6) = i=6 10 i 1 6 5 6 i 5 6 10−i .. . (a) The random variable X. Express EX with s.001)2 . 1 ): 6 P (X = i) = where i = 0. the number of years the biologists succeeds is approximately Binomial(10. (a) X is Binomial(10.999)3000 − 3000(0. sn . 1 2.999)2999 (0. . P (X ≥ 3) = 1 − P (X ≤ 2) = 1 − (0. s) and the answer is 1 − (1 − s)10 − 10s(1 − s)9 . Let EY = µ and Var(Y ) = σ 2 . (c) 10 10 i 1 6 i 5 6 10−i . 10. 2 (c) Denote the probability in (b) by s.001) − (b) By the Poisson approximation with λ = 3000 · 1 1000 3000 (0. (d) P (Y > 10) = 10 . and σ. . µ. and again X is the size of the group of a randomly chosen student.5 DISCRETE RANDOM VARIABLES 55 s students divided into n groups with membership numbers s1 . (d) Solve e−λ + λe−λ + λ2 −λ e = 0. 9 P (X ≥ 3) ≈ 1 − e−3 − 3e−3 − e−3 .001 2 .999)2998 (0. . . is Binomial 3000. (b) Y is Geometric( 1 ): 6 P (Y = i) = where i = 1. while Y is the size of the randomly chosen group. . . .

so your . 102 P (X = −1) = 1 − and so EX = 2 11 − . and 0 otherwise. 1 6 n−1 5 · . so Var(Y ) = 31.25 0. 3.5 DISCRETE RANDOM VARIABLES 56 for λ and then let n = 1000λ. 2 (d) You get paid $10 with probability expected winnings are 51 . (c) E(X 2 ) = 250. EY = 12.2 0. so Var(X) = 25. P (you win) = 1 .4 P (Y = x) 0. .25 0. 13 102 5 11 2 11 4 11 10 + + −1+ + = + >0 13 13 102 13 102 13 51 4. E(Y 2 ) = 187.25 0. 6 (c) By symmetry.25 5 12 . 229 seeds. 52 22 · 52 26 2 51 2 P (X = 1) = = 11 . 6 (b) EN = 5 . 2. P (X = x) 0.5. The equation above can be solved by rewriting λ = log 1000 + log(1 + λ + λ2 ) 2 and then solved by iteration. The result is that the biologist should plant 11. . (a) x 5 10 15 20 (b) EX = 15. .25. $1 with probability 1 12 .5. 12 5. 52 4 P (X = 5) = .1 0. 3. Let X be your earnings. (a) N is Geometric( 5 ): 6 P (N = n) = where n = 1.. P (X = 10) = 4 .3 0.

n E(X) = i=1 si n si · = s s n s2 · i i=1 1 n n n = · EY 2 = (Var(Y ) + (EY )2 ) = (σ 2 + µ2 ). . . n s s s . + sn .5 DISCRETE RANDOM VARIABLES 57 (d) Let s = s1 + . Then.

x + h]) = F (x + h) − F (x) ≈ F (x) · h = f (x) · h. P (X ∈ B) = B f (x) dx. and variance is computed by the same formula: Var(X) = E(X 2 ) − (EX)2 . The function F = FX given by x F (x) = P (X ≤ x) = −∞ f (s) ds is called the distribution function of X. Density has the same role as the probability mass function for discrete random variables: it tells which values x are relatively more probable for X than others. b]) = P (a ≤ X ≤ b) = a b f (x) dx.6 CONTINUOUS RANDOM VARIABLES 58 6 Continuous Random Variables A random variable X is continuous if there exists a nonnegative function f so that. if h is very small. ∞ EX = −∞ x · f (x) dx. −∞ Note also that P (X ∈ [a. By analogy with discrete random variables. ∞ Eg(X) = −∞ g(x) · f (x) dx. apart from ﬁnitely many (possibly inﬁnite) jumps. . it must hold that ∞ f (x) dx = 1. F (x) = f (x). P (X = a) = 0. b P (X ≤ b) = P (X < b) = −∞ f (x) dx. On an open interval where f is continuous. We will assume that a density function f is continuous. Namely. we deﬁne. The function f = fX is called the density of X. Clearly. for every interval B. then P (X ∈ [x.

1]. for y ∈ (0. 8 So. In a problem such as this. Example 6. 8 16 1 Finally. so we have (b). β]. for y ∈ (0.2. dy 4 4 (1 − y) 4 1 3x2 0 if x ∈ [0.1. Assume that X has density fX (x) = Compute the density fY of Y = 1 − X 4 . (b) Compute P (1 ≤ X ≤ 2). For 8 x2 8 dx = 8 3 4 and E(X 2 ) = 0 x3 dx = 8. as the values of Y are restricted to that interval. 4 0 cx dx (a) Determine c. . we compute 2 x 3 dx = . compute ﬁrst the distribution function FY of Y . 1 1 (1−y) 4 3x2 dx .6 CONTINUOUS RANDOM VARIABLES 59 Example 6. Let f (x) = cx if 0 < x < 4. Now. 0 otherwise. FY (y) = P (Y ≤ y) = P (1 − X 4 ≤ y) = P (1 − y ≤ X 4 ) = P ((1 − y) 4 ≤ X) = It follows that fY (y) = 1 3 d 1 3 1 FY (y) = −3((1 − y) 4 )2 (1 − y)− 4 (−1) = 1 . For [α. Before starting. otherwise. this is ideally the output of a computer random number generator. 1]. we now look at some famous densities.1 Uniform random variable Such a random variable represents the choice of a random number in [α. 1]. 1). As with discrete random variables. and fY (y) = 0 otherwise. note that the density fY (y) will be nonzero only when y ∈ [0. we use the fact that density integrates to 1. because those two values contribute nothing to any integral. for (c) we get EX = 0 4 = 1 and c = 1 . 1). β] = [0. For (a). Var(X) = 8 − 64 9 = 8 9. 6. Observe that it is immaterial how f (y) is deﬁned at y = 0 and y = 1. (c) Determine EX and Var(X).

X ≤ +P X > 2 r+1 r 1 = P X≤ +P X ≥ r+1 r+1 1 2r r +1− = = r+1 r+1 r+1 1 1−X X> . In other words. . Note that you cannot do this for sets that are not countable or you would “prove” that P (X ∈ R) = 0. we can make a more general conclusion. but. then in [ 1 . For example. 1) and we will deal only with such r’s. 1]. of course.4. say. As R has values in (0. and then ﬁnally in [ 1 . . Let R be the ratio of the smaller versus the larger segment. its binary expansion is uniquely deﬁned. Q = {q1 . A uniform random number X divides [0.010? As Q is countable. Compute the density of R. then any of the 2n possibilities for its ﬁrst n binary digits are equally likely. 1) is 0. . What is P (X ∈ Q)? What is the probability that the binary expansion of X starts with 0. 1 β−α if x ∈ [α. As X is. Example 6. 1). Divide [0. (β−α)2 12 . . β]. the density fR (r) is nonzero only for r ∈ (0. }. If the binary expansion of a number x ∈ [0.. 1]. 1] into two segments. q2 . then nth digit 1 means that x is in the right half of I and nth digit 0 means that x is in the left half of I. Assume that X is uniform on [0. Choosing a uniform random number on [0. in fact. 1] is thus equivalent to tossing a fair coin inﬁnitely many times.3.X ≥ 2 r+1 r 1 1 1 since ≤ and ≥ r+1 2 r+1 2 +P .6 CONTINUOUS RANDOM VARIABLES 60 Properties: 1. 0 3. the number is in [0. while we. if 1 3 1 the expansion starts with 0. 8 ]. know that P (X ∈ R) = P (Ω) = 1. ≤r 2 X 1 1 . otherwise. the binary digits of X are the result of an inﬁnite sequence of independent fair coin tosses. EX = α+β 2 . FR (r) = P (R ≤ r) = P X 1 ≤r X≤ . irrational. 2 ]. 1) into 2n intervals of equal length. 4 4 Our answer is 1 . .x1 x2 . 2 1−X 1 r = P X ≤ . Density: f (x) = 2. it has an enumeration. the ﬁrst n binary digits determine which of the 2n subintervals x belongs to: if you know that x belongs to an interval I based on the ﬁrst n − 1 digits. with probability 1. so there is no ambiguity about what the second question means. By Axiom 3 of Chapter 3: P (X ∈ Q) = P (∪i {X = qi }) = i P (X = qi ) = 0. 2 ]. If X is uniform on 8 [0.010. Var(X) = Example 6.

dr (r + 1)2 6. The last property means that.6 CONTINUOUS RANDOM VARIABLES 61 For r ∈ (0. λ2 4. if x < 0.3 Normal random variable A random variable is Normal with parameters µ ∈ R and σ 2 > 0 or. Assume that a lightbulb lasts on average 100 hours. λe−λx 0 if x ≥ 0.1353. .2 Exponential random variable A random variable is Exponential(λ). Assuming exponential distribution. X is Exponential with λ = and P (X ≥ 200) = e−2 ≈ 0. for a lightbulb to burn out or for the next earthquake of at least some given magnitude. EX = λ . Example 6. There is no “aging. 5. with parameter λ > 0. 1). For example. P (X ≤ 50) = 1 − e− 2 ≈ 0. if its density is the function given below.3935. thus. Var(X) = 1 . measurement with random error. etc. σ 2 ). in short.” Proofs of these properties are integration exercises and are omitted. Memoryless property: P (X ≥ x + y|X ≥ y) = e−λx . the distribution of the remaining waiting time is the same as it was at the beginning. Properties: 1. SAT (or some other) test score of a randomly chosen student at UC Davis. Then. if it has the probability mass function given below. Let X be the waiting time for the bulb to burn out. Density: f (x) = 1 2. 1 1 100 6. This is a distribution for the waiting time for some random event. compute the probability that it lasts more than 200 hours and the probability that it lasts less than 50 hours. X is N (µ. for example. 3. P (X ≥ x) = e−λx . equals fR (r) = d 2 FR (r) = . the density.5. Such a random variable is (at least approximately) very common. weight of a randomly caught yellow-billed magpie. if the event has not occurred by some given time (no matter how large).

How is Y distributed? If X is a “measurement with error” αX + β amounts to changing the units and so Y should still be normal. 3. Assuming that the integral of f is 1. Let us see if this is the case. = where the last integral was obtained by the change of variable z = x − µ and is zero because the function integrated is odd. Density: (x−µ)2 1 f (x) = fX (x) = √ e− 2σ2 . σ 2π where x ∈ (−∞.6. Let X be a N (µ. EX = µ. we can use symmetry to prove that EX must be µ: ∞ ∞ ∞ EX = −∞ ∞ xf (x) dx = −∞ (x − µ)f (x) dx + µ −∞ f (x) dx (x−µ)2 1 (x − µ) √ e− 2σ2 dx + µ σ 2π −∞ ∞ z2 1 = z √ e− 2σ2 dz + µ −∞ σ 2π = µ. We start by computing the distribution function of Y . Var(X) = σ 2 . σ 2 ) random variable and let Y = αX + β. FY (y) = P (Y ≤ y) = P (αX + β ≤ y) y−β = P X≤ α = y−β α −∞ fX (x) dx . with α > 0. 2. as is the computation of the variance. To show that ∞ f (x) dx = 1 −∞ is a tricky exercise in integration. ∞). Example 6.6 CONTINUOUS RANDOM VARIABLES 62 Properties: 1.

and P (|X − µ| ≥ 3σ). 2π −∞ In particular. P (|X − µ| ≥ 2σ) = 2(1 − Φ(2)) ≈ 0. In particular. Example 6. X −µ σ has EZ = 0 and Var(Z) = 1. The integral for Φ(z) cannot be computed as an elementary function. P (|X − µ| ≥ 3σ) = 2(1 − Φ(3)) ≈ 0. then. What is the probability that a Normal random variable diﬀers from its mean µ by more than σ? More than 2σ? More than 3σ? In symbols. You should also note that it is enough to know these values for z > 0. so approximate values are given in tables. by using the fact that fZ (x) is an even function. 2πσα Therefore. . X −µ ≥1 σ = P (|Z| ≥ 1) = 2P (Z ≥ 1) = 2(1 − Φ(1)) ≈ 0. In this and all other examples of this type. 1) random variable. σ 2 ).7. a form which is 2 also often useful.3173. P (|X − µ| ≥ 2σ). Another way to write this is P (Z ≥ −z) = P (Z ≤ z). −z ∞ z Φ(−z) = −∞ fZ (x) dx = z fZ (x) dx = 1 − −∞ fZ (x) dx = 1 − Φ(z). Its distribution function FZ (z) is denoted by Φ(z). this is largely obsolete. if X is N (µ. Such a N (0. We have P (|X − µ| ≥ σ) = P Similarly. we need to compute P (|X − µ| ≥ σ). Φ(0) = 1 . as in this case. as computers can easily compute Φ(z) very accurately for any given z.0027. Nowadays.0455. the density fY (y) = fX = y−β α · 1 α (y−β−αµ)2 1 √ e− 2α2 σ2 . Y is normal with EY = αµ + β and Var(Y ) = (ασ)2 . the letter Z will stand for an N (0.6 CONTINUOUS RANDOM VARIABLES 63 and. Note that Z= fZ (z) = Φ(z) = FZ (z) = 1 2 √ e−z /2 2π z 1 2 √ e−x /2 dx. 1) random variable is called standard Normal.

e. Let Sn be Binomial(n. proved this way. An important issue is the quality of the Normal approximation to the Binomial.g. 1). Here is the computation: P (1 ≤ X ≤ 4) = P 1−2 X −2 4−2 ≤ ≤ 5 5 5 = P (−0. obsolete by now. i. p) random variable..5) and n is large (e. N (0. the bound above is about 0. If we pretend that Sn is Normal. De Moivre-Laplace Central Limit Theorem.2) = Φ(0.5 · (p2 + (1 − p)2 ) .g.8. so that np(1−p) = 10. p). Then.2 ≤ Z ≤ 0. In the situation when the above upper bound .1. too large for many purposes. however. P as n → ∞. 1). Sn − np np(1 − p) ≤x → Φ(x) np(1−p) ≈ N (0. Indeed it can be. and originally was. it says that √ n k 1 p (1 − p)n−k → √ k 2π x −∞ e− 2 ds s2 k:0≤k≤np+x np(1−p) as n → ∞. Let Sn be a Binomial(n. for every real number x..0878. np(1−p) The following theorem says that this is approximately true if p is ﬁxed (e.6 CONTINUOUS RANDOM VARIABLES 64 Example 6. Compute the probability that X is between 1 and 4. n = 100).2)) ≈ 0. n p (1 − p) A commonly cited rule of thumb is that this is a decent approximation when np(1 − p) ≥ 10. for every x ∈ R. 0. where p is ﬁxed and n is large.4) = P (Z ≤ 0.4) − (1 − Φ(0.4) − P (Z ≤ −0. more We should also note that the above theorem is an analytical statement. but they are. √Sn −np precisely. in my opinion. if we take p = 1/3 and n = 45. then √Sn −np is standard Normal. Assume that X is Normal with mean µ = 2 and variance σ 2 = 25. Theorem 6. One can prove that the diﬀerence between the Binomial probability (in the above theorem) and its limit is at most 0. Various corrections have been developed to diminish the error.. with a lot of computational work. Recall that its mean is np and its variance np(1 − p).2347 .

What would the answer to the previous example be if the game were fair.5. For Sn .11. we 2 need to ﬁnd n so that P (Sn ≥ 100) ≈ 0. Example 6.0448. 9 Let Sn be the number of times you win. we get P Z> 5 3 ≈ 0.10.0478. After n games.2650 and 0. respectively. Then.. 18 black. when n is large and p is small.2990. How many times do you need to toss a fair coin to get 100 heads with probability 90%? Let n be the number of tosses that we are looking for. which is Binomial(n. you bet even money on the outcome of a fair coin toss each time. what is the probability that you are ahead? Answer this for n = 100 and n = 1000. This is a Binomial(n.) Remember that. we should simply compute directly with the Binomial distribution and not use the Normal approximation. i.e. A roulette wheel has 38 slots: 18 red. (We will assume that the approximation is adequate in the 1 examples below. the true probabilities are 0. 5 Z>√ 90 ≈ 0. 1 ).9. the Poisson approximation (with λ = np) is much better! Example 6. we get P and for n = 1000. say n = 100 and p = 100 . and 2 green. Example 6.9. The ball ends at one of these at random. 19 ) random variable. We will use below that n > 200. as the probability would be approximately 1 2 1 2 and P (ahead) → P (Z > 0) = 0. p = as n → ∞. for n = 200 (see .6 CONTINUOUS RANDOM VARIABLES 65 on the error is too high. For comparison. P (ahead) = P (win more than half of the games) n = P Sn > 2 1 n − np Sn − np = P > 2 np(1 − p) np(1 − p) √ ( 1 − p) n ≈ P Z> 2 p(1 − p) For n = 100. You are a player who plays a large number of games and makes an even bet of $1 on red in every game.

Here is the computation: P 1 Sn − 2 n 100 − 1 n ≥ 1√ 2 √ 1 2 n 2 n ≈ P = P = = = = Z≥ Z≥ 100 − 1 n 2 1√ n 2 200 − n √ n n − 200 √ P Z≥− n n − 200 P Z≤ √ n n − 200 √ Φ n 0. that is. Φ(1. The density function of a random variable X is given by f (x) = a + bx 0 0 ≤ x ≤ 2. 2. with the only positive solution √ √ 1.9 n−200 √ n Now. . (b) Compute E(1/X). otherwise. (b) Compute Var(X). we conclude that n = 219. Problems 1. 1].28 + 1.28.28 n − 200 = 0. thus we need to solve √ n − 1. (a) Determine c.282 + 800 n= .28) ≈ 0. 2 = 1. (c) Determine the probability density function of Y = X 2. otherwise. according to the tables. A random variable X has the density function f (x) = c(x + 0 √ x) x ∈ [0.6 CONTINUOUS RANDOM VARIABLES 66 the previous example). We also know that E(X) = 7/6. Rounding up the number n we get from above. This is a quadratic equation in √ n. (a) Compute a and b.9.

in terms on n and Φ. approximately. Let M be the amount of time of the show that you miss because of the call.” Assume that this means that the time T of the call is uniformly distributed in the speciﬁed interval. a representative of an insurance company promised to call you “between 7 and 9 this evening. 4. the call still hasn’t arrived.6 CONTINUOUS RANDOM VARIABLES 67 3. 6 it follows that c = (b) 6 7. the probability that you win at least $250? (b) Approximately how many times do you need to play so that you win at least $250 with probability at least 0. (c) How large should n be so that the probability in (b) is larger than 0. there is a game show on TV that you wanted to watch. (b) Write down an approximation. Assume that n is large. of the probability that M diﬀers from its expectation by less than 10%. After your complaint about their service. 6 7 (c) 1 0 √ 1 18 (x + x) dx = . What is the probability that it arrives in the next 10 minutes? (c) Assume that you know in advance that the call will last exactly 1 hour. (a) Compute the probability that the call arrives between 8:00 and 8:20. Toss a fair coin twice.99? 5. From 9 to 9:30. What is. x 7 Fr (y) = P (Y ≤ y) √ = P (X ≤ y) = 6 7 √ y (x + 0 √ x) dx. You win $1 if at least one of the two tosses comes out heads. Compute the expected value of M . (b) At 8:30. (a) As 1=c 0 1 (x + √ x) dx = c 1 2 + 2 3 7 = c. . Roll a die n times and let M be the number of times you roll 6.99? Solutions 1. (a) Assume that you play this game 300 times. (a) Compute the expectation EM .

1 120 90 Then. 3 1 n· 4 · 4 The argument must be negative. 1). n = 300 and the above expression is P (Z ≥ 0. Thus. otherwise. which is approximately 1 − Φ(3. 3 (c) We have 0 if 0 ≤ T ≤ 60. · . M = T − 60 if 60 ≤ T ≤ 90. 1 2. 4 (b) E(X 2 ) = 3. 1 0 xf (x) dx 11 36 . 90 4.99 or so that 250 − n · 3 4 Φ = 0. (a) P (win a single game) = 3 . 30 if 90 ≤ T. 1 P (T ≤ 100|T ≥ 90) = . 120].33) ≈ For (b). from 7pm.0004. If you play n times. the number X of games you win is 4 Binomial(n. T is uniform on [0. 3 ).25.01.6 CONTINUOUS RANDOM VARIABLES 68 and so fY (y) = 3 7 (1 0 + y − 4 ) if y ∈ (0. P (X ≥ 250) ≈ P Z ≥ 3 1 n· 4 · 4 For (a). EM = (t − 60) dx + 60 1 120 120 30 dx = 11. 1). you need to ﬁnd n so that the above expression is 0. 6 1 2 0 x f (x) dx 1 = 7 6 we get 2a + 8 b = 7 . 10 3 ). If Z is N (0. 3 6 = 5 3 and so Var(X) = 5 3 − ( 7 )2 = 6 (b) Let T be the time of the call.33. in minutes. (a) 1 . (a) From 0 f (x) dx = 1 we get 2a + 2b = 1 and from The two equations give a = b = 1 . then 4 3 250 − n · 4 . hence 250 − n · n· 3 4 3 4 1 4 = −2.

04)2 /3.57.04.6 CONTINUOUS RANDOM VARIABLES √ 3n. (c) The above must be 0.1 ≈ P |Z| < M− 6 6 √ 0. √ 0. (a) M is Binomial(n.1 n √ 5 n 6 · 0. this yields x2 − 4. n > (34.66x − 1000 = 0 69 If x = and solving the quadratic equation gives x ≈ 34. so EM = n .1 1 6 n· · 5 6 = 2Φ √ 0. and.99 and so Φ = 0.1 n √ 5 = 2. n ≥ 387. .995.1 n √ 5 − 1. 5. n ≥ 3303. 6 6 (b) P n n < · 0. 1 ). ﬁnally.

f.’s of X and Y . y) x P (Y = y) = x P (X = x. Y = y) for all possible x and y. therefore. y) = P (X = x. Their joint probability mass function is given by p(x. of (X. given by P (X = x) = y P (X = x. 120 8 = 8 . Select 3 balls at random and let X be the number of red balls and Y the number of white balls. y). In our case.7 JOINT DISTRIBUTIONS AND INDEPENDENCE 70 7 Joint Distributions and Independence Discrete Case Assume that you have a pair (X. f. Y ). m. To answer (c) we merely add the relevant probabilities. An urn has 2 red. 2. 1. or 2 and y can be 0. or 3.1. f. Y = y) so that P ((X. The joint p.’s. m. (b) marginal p. m. to answer (d). Y = y) = Example 7. and 3 green balls. f. Y ) ∈ A) = (x. f.y)∈A p(x. m. (c) P (X ≥ Y ). m. 1. 45 . X ≥ Y ) = P (X ≥ Y ) 8 120 3 8 1 + 6 + 3 + 30 + 5 3 = . Y = y) = y p(x. and (d) P (X = 2|X ≥ Y ). P (X ≥ Y ) = and. determine the marginal p. Y ) of discrete random variables X and Y . we compute P (X = 2. is given by P (X = x.’s. The marginal probability mass functions are the p. 5 white. The values are given in the table. Determine (a) joint p. x can be 0. y) p(x. y\x 0 1 2 3 P (X = x) 0 1/120 5 · 3/120 10 · 3/120 10/120 56/120 1 2 · 3/120 2 · 5 · 3/120 10 · 2/120 0 56/120 2 3/120 5/120 0 0 8/120 P (Y = y) 10/120 50/120 50/120 10/120 1 The last row and column entries are the respective column and row sums and.

3. m. Then. the 0’s in the table are dead giveaways. For example. Continuous Case We say that (X. closed and bounded) subset of R2 . 2. y) ≥ 0 so that P ((X. open or closed) subset of R2 Example 7. the joint p. This means that f (x. that is. y) = 1 area(S) if (x. roll a die twice and let X be the number on the ﬁrst roll and let Y be the number on the second roll. In the discrete case. which is the same as assuming that the two random variables are discrete uniform (on {1. where S is a compact (that is. . where S is some nice (say. independence is an assumption. f. P (X = 2.2. For example. . . In the previous example. Y = 2) = 0. m. . are X and Y independent? No. 0 otherwise. X and Y are independent exactly when P (X = x. y) ∈ S. Y = y) = P (X = x)P (Y = y) for all possible values x and y of X and Y . Let (X. y) dx dy. but neither P (X = 2) nor P (Y = 2) is 0. 6}) and independent. .7 JOINT DISTRIBUTIONS AND INDEPENDENCE 71 Two random variables X and Y are independent if P (X ∈ A.4. 0 The simplest example is a square of side length 1 where f (x. f.’s. X and Y are independent: we are used to assuming that all 36 outcomes of the two rolls are equally likely. y) = 1 if 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1. Y ) be a random point in S. Example 7. otherwise. Most often. Y ∈ B) = P (X ∈ A)P (Y ∈ B) for all intervals A and B. is the product of the marginal p. Y ) ∈ S) = S f (x. Y ) is a jointly continuous pair of random variables if there exists a joint density f (x. Example 7.

−∞ Indeed. −∞ ∞ f (x. which are the densities of X and Y . y) dy. are computed by integrating out the other variable: ∞ fX (x) = fY (y) = f (x. y) dy. Two jointly continuous random variables X and Y are independent exactly when the joint density is the product of the marginal ones: f (x. 1 1 dx −1 x2 c x2 y dy = 1 c· 4 21 = 1 and so c= 21 . otherwise. for x ∈ (0. 4 For (b). y) = c x2 y 0 if x2 ≤ y ≤ 1.5. X ∈ A means that (X. If f is the joint density of (X. where S = A × R. The marginal densities formulas follow from the deﬁnition of density. (b) P (X ≥ Y ). for an interval A. y) = fX (x) · fY (y). therefore. Let f (x. Determine (a) the constant c. the following can be checked. Then. let S be the region between the graphs y = x2 and y = x. 1).7 JOINT DISTRIBUTIONS AND INDEPENDENCE 72 Example 7. y) dx. For (a). for all x and y. (c) P (X = Y ). ∞ P (X ∈ A) = A dx −∞ f (x. Y ) ∈ S) 1 x 21 2 · x y dy = dx 2 4 0 x 3 = 20 Both probabilities in (c) and (d) are 0 because a two-dimensional integral over a line is 0. Y ). and (d) P (X = 2Y ). With some advanced calculus expertise. then the two marginal densities. and. Y ) ∈ S. . P (X ≥ Y ) = P ((X.

Are X and Y independent? f (x. Let (X. We have fX (x) = 1 x2 Compute marginal densities and determine 21 2 21 2 · x y dy = x (1 − x4 ).) 2 0≤y≤x≤1 0. thus uniform on some sets. and fY (y) = 1. if x ∈ [0.7 JOINT DISTRIBUTIONS AND INDEPENDENCE 73 Example 7. Are X and Y independent? Now f (x.7. 1]. Assume that (X. and 0 otherwise. uniform on a compact set S ⊂ R2 . and 0 otherwise. Let (X. 1] × [0. X and Y are independent. say A and B. 1].6. and 0 otherwise. 1]. (If A and B are both intervals. otherwise. Example 7. y) = The marginal densities are fX (x) = 2x. then S = A × B is a rectangle. Y ) be a random point in a square of length 1 with the bottom left corner at the origin. as f (x. We can make a more general conclusion from the last two examples. if y ∈ [0. if y ∈ [0. y) = The marginal densities are fX (x) = 1. Y ) be a random point in the triangle {(x. and 0 otherwise. continued. Example 7. 1]. If they are to be independent. Previous example. which is the most common example of independence. Y ) is a jointly continuous pair of random variables. whether X and Y are independent. 4 8 √ y for x ∈ [−1. 0 otherwise. if x ∈ [0. So X and Y are no longer distributed uniformly and no longer independent. Therefore. 4 2 where y ∈ [0. The two random variables X and Y are clearly not independent. 1 (x. and fY (y) = 2(1 − y). y) ∈ [0. their marginal densities have to be constant. 1]. 1].8. Moreover. . 1]. y) : 0 ≤ y ≤ x ≤ 1}. and then S = A × B. fY (y) = √ − y 21 2 7 5 · x y dx = y 2 . y) = fX (x)fY (y).

For (b). Y ) f (x.10. The waiting time T1 for Alice’s call has expectation 10 minutes and the waiting time T2 for .11. Smith arrives and let Y be the time when Mrs. for y ∈ [0. from Alice and from Bob. So. 3 Example 7. ﬁx t ∈ [0. 1]. which has possible values in [0. y) = 2y 0 if (x. compute the probability that Mr. whichever double integral is easier.m. Y ) is uniform on [0. Let X be the time when Mr.” Assume that they both arrive there at a random time between 5 and 6 and that their arrivals are independent. To compute the probability in question we compute 1 1−x dx 0 0 1−y 2y dy or 0 1 dy 0 2y dx. For (a).5. let T = |X − Y |. 1]. The assumptions determine the joint density of (X. Smith agree to meet at a speciﬁed location “between 5 and 6 p. and that Y has density fY (y) = 2y. that X is uniform on [0. 1] and compute (drawing a picture will also help) P (T ≤ t) = P (|X − Y | ≤ t) = P (−t ≤ X − Y ≤ t) = P (X − t ≤ Y ≤ X + t) = 1 − (1 − t)2 = 2t − t2 . X > Y ) = P (X > Y ) 1 8 1 2 1 = .9.5|X > Y ) = P (X ≤ 0. 1] × [0. y) ∈ [0. 1]. 1]. otherwise. (b) Mrs. Smith arrived before 5:30. Assume that X and Y are independent. The assumptions imply that (X. 1] × [0. Compute P (X + Y ≤ 1). and 0 elsewhere. Assume that you are waiting for two phone calls.7 JOINT DISTRIBUTIONS AND INDEPENDENCE 74 Example 7. given this information. Smith later tells you she had to wait. we need to compute P (X ≤ 0. 1]. and so fT (t) = 2 − 2t. 4 Example 7. The answer is 1 . Mr. with the time unit 1 hour. (a) Find the density for the time one of them will have to wait for the other. Smith arrives. and Mrs.

Parallel lines at a distance 1 are drawn on a large sheet of paper. fT1 (t1 ) = e−t1 and so that the joint density is for t1 . t2 ) = e−t1 −t2 /4 . sin θ dθ /2 2 = = . The probability π 1 1 − arcsin . t2 > 0.12. 4 2 2 sin θ dθ + 0 π 1 − arcsin 2 · = 4 1 − π 2 2 2 −1+ . We will. Then. Let D be the distance from the center of the needle to the nearest line and let Θ be the acute angle relative to the lines. Then. 4 1 f (t1 . 2 > 1. assume that D and Θ are independent and uniform on their respective intervals 0 ≤ D ≤ 1 and 0 ≤ Θ ≤ π . ∞ ∞ 1 fT2 (t2 ) = e−t2 /4 . t2 ≥ 0. the curve d = arcsin 1 sin θ intersects d = 1 2 at θ = arcsin 1 . we have. 2 2 P (the needle intersects a line) = P = P Case 1: ≤ 1. Compute the probability that it intersects one of the lines. for t1 . Assuming our unit is 10 minutes. reasonably. π/4 π/4 π 1 2 When Case 2: equals 4 π = 1. Drop a needle of length onto the sheet. Now. D < sin Θ 2 D< 2 sin Θ . 4 dt1 ∞ P (T1 < T2 ) = = 0 t1 1 −t1 −t2 /4 e dt2 4 0 e−t1 dt1 e−t1 /4 e −5t1 4 ∞ = 0 dt1 = 4 . Assume T1 and T2 are independent exponential random variables. the probability equals π/2 2 0 2 π. 5 Example 7. What is the probability that Alice’s call will come ﬁrst? We need to compute P (T1 < T2 ). Buﬀon needle problem. you famously get which can be used to get (very poor) approximations for π. Therefore.7 JOINT DISTRIBUTIONS AND INDEPENDENCE 75 Bob’s call has expectation 40 minutes.

x2 . y) dx = 0. given simply by pX (x|Y = y) = P (X = x|Y = y) = P (X = x. For a jointly continuous pair of random variables X and Y . of X given Y = y is. ∞ . . Y = y + dy) P (Y = y + dy) f (x. f.7 JOINT DISTRIBUTIONS AND INDEPENDENCE 76 A similar approach works for cases with more than two random variables. + Xn ≤ 1) = 1 n! Conditional distributions The conditional p. the joint density of (X. fX (x|Y = y) = fY (y) where f (x. Y ). in the discrete case. of course.13. we 0 have a 0 expression. x2 . and so f (x. as we cannot divide by P (Y = y) = 0. which we deﬁne to be 0. Y = y) . x3 ) ∈ [0. Let us do an example for illustration. P (Y = y) This is trickier in the continuous case. x3 ) = fX1 (x1 )fX2 (x2 )fX3 (x3 ) = Here is how we get the answer: 1 1−x1 1−x1 −x2 1 if (x1 . . Xn are independent and uniform on [0. y) dx dy = fY (y) dy f (x. y) = dx fY (y) = fX (x|Y = y) dx . 1] and independent. m. So. What is P (X1 + X2 + X3 ≤ 1)? The joint density is fX1 . . if X1 . Observe that when fY (y) = 0. X3 are uniform on [0. P (X1 + X2 + X3 ≤ 1) = 0 dx1 0 dx2 0 1 dx3 = . Assume X1 . y) . 0 otherwise. . P (X1 + . . Here is a “physicist’s proof” why this should be the conditional density formula: P (X = x + dx|Y = y + dy) = P (X = x + dx.X2 . 1]3 . X2 . 6 In general. y) = 0 for every x.X3 (x1 . y) is. . −∞ f (x. we deﬁne the conditional density of X given Y = y as follows: f (x. Example 7. 1].

This makes no literal sense because the probability P (Y = y) of the condition is 0. 2 Suppose we are asked to compute P (X ≥ Y |Y = y). y ≥ 0. otherwise. 2 21 2 4 x y 7 5/2 2y = 3 2 −3/2 x y . We compute ﬁrst 21 fY (y) = y 4 for y ∈ [0. y) : x. 1 − y]. 1]. we know that. Compute fX (x|Y = y). The joint density f (x. Problems 1. which is hardly surprising. rather than calculus.) .14. X is between 0 and 1 − y. Compute the conditional probability P (X ≥ 0 | Y ≤ 2X). if Y = y.15. otherwise. √ y 21 2 4 x y 0 x2 ≤ y ≤ 1. fX (x|Y = y) = √ √ where − y ≤ x ≤ y. 0 In other words. 2 dx = 2(1 − y). Let (X. X is distributed uniformly on [0. For a given y ∈ [0. Then. Example 7. y) = Compute fX (x|Y = y). 0 1 1−y fX (x|Y = y) = 0 ≤ x ≤ 1 − y. which equals y √ y 3 2 −3/2 1 1 x y dx = y −3/2 y 3/2 − y 3 = 2 2 2 1 − y 3/2 . 1]. We reinterpret this expression as ∞ P (X ≥ y|Y = y) = y fX (x|Y = y) dx.7 JOINT DISTRIBUTIONS AND INDEPENDENCE 77 Example 7. y) equals 2 on the triangle. y) : −1 ≤ x. Y ) has joint density f (x. given Y = y. Y ) be a random point in the square {(x. 1−y fY (y) = Therefore. x + y ≤ 1}. Y ) be a random point in the triangle {(x. Suppose (X. Let (X. y ≤ 1}. Moreover. √ − y x2 dx = 7 5 y2. (It may be a good idea to draw a picture and use elementary geometry.

(a) Compute the conditional density of Y given X = x. Let Z = min(X. X and Y are independent random variables and they have the same density function f (x) = c(2 − x) 0 x ∈ (0. Y ≤ 2X) P (Y ≤ 2X) 1 1 1 4 2− 2 · 2 ·1 1 2 7 8 2. The joint density of (X. 1]. (a) The joint p. 4. is given by the table . (a) Compute the density function of Z. Roll a fair die 3 times. Let X be the number of 6’s obtained and Y the number of 5’s. 0 otherwise. 1) otherwise. f. (b) Compute P (Y ≤ 2X) and P (Y < 2X). (a) Determine c. (b) Are X and Y independent? Solutions to problems 1. m. (a) Compute the joint probability mass function of X and Y .5).5|Z ≤ 0. Y ) be the smaller value of the two. (b) Compute P (X ≤ 0. (b) Are X and Y independent? 3. y) = 3x if 0 ≤ y ≤ x ≤ 1. Let X and Y be independent random variables. (c) Are X and Z independent? 5. After noting the relevant areas. both uniformly distributed on [0. P (X ≥ 0|Y ≤ 2X) = = = P (X ≥ 0. Y ) is given by f (x.7 JOINT DISTRIBUTIONS AND INDEPENDENCE 78 2.

P (X = 3. 9 4 12 32 . 2. for x. it follows that c = 2 . (a) From c 0 1 (2 − x) dx = 1. 3 (b) We have P (Y ≤ 2X) = P (Y < 2X) 1 1 4 = dy (2 − x)(2 − y) dx 0 y/2 9 = = = = = 4 9 1 1 (2 − y) dy 0 y/2 (2 − x) dx 1− y2 4 4 1 y 1 (2 − y) dy 2 1 − − 9 0 2 2 4 1 3 y2 (2 − y) −y+ dy 9 0 2 8 4 1 7 5 y3 3 − y + y2 − dy 9 0 2 4 8 4 7 5 1 3− + − . 3. 1. P (X = x. Y = 3) = 0 and P (X = 3)P (Y = 3) = 0. Y = y) = 3 x 3−x y 1 6 x+y 4 6 3−x−y (b) No. y = 0. 3 and x + y ≤ 3.7 JOINT DISTRIBUTIONS AND INDEPENDENCE y\x 0 1 2 3 0 43 216 3·42 216 3·4 216 1 216 79 3 1 216 1 3·42 216 3·2·4 216 3 216 2 3·4 216 3 216 P (Y = y) 125 216 75 216 15 216 1 216 0 0 0 0 0 0 P (X = x) 125 216 75 216 15 216 1 216 1 Alternatively.

1] and 0 otherwise. (b) As the answer in (a) depends on x.5) = so the answer is 2 .5. the two random variables are not independent. 2 fX (x) = we have fY (y|X = x) = 3x dy = 3x2 . 3 (c) No: Z ≤ X. y) 3x 1 = 2 = . (a) For z ∈ [0. In other words. Z ≤ 0. (a) Assume that x ∈ [0.5) = 1 .7 JOINT DISTRIBUTIONS AND INDEPENDENCE 80 4. x]. . 1] P (Z ≤ z) = 1 − P (both X and Y are above z) = 1 − (1 − z)2 = 2z − z 2 . we conclude that P (Z ≤ 0. for z ∈ [0.5) = P (X ≤ 0. fX (x) 3x x for 0 ≤ y ≤ x. Y is uniform on [0. (b) From (a). 1]. 5. so that fZ (z) = 2(1 − z). As x 3 4 and P (X ≤ 0. 0 f (x.

7

JOINT DISTRIBUTIONS AND INDEPENDENCE

81

**Interlude: Practice Midterm 2
**

This practice exam covers the material from chapters 5 through 7. Give yourself 50 minutes to solve the four problems, which you may assume have equal point score. 1. A random variable X has density function f (x) = (a) Determine c. (b) Compute E(1/X). (c) Determine the probability density function of Y = X 2 . 2. A certain country holds a presidential election, with two candidates running for oﬃce. Not satisﬁed with their choice, each voter casts a vote independently at random, based on the outcome of a fair coin ﬂip. At the end, there are 4,000,000 valid votes, as well as 20,000 invalid votes. (a) Using a relevant approximation, compute the probability that, in the ﬁnal count of valid votes only, the numbers for the two candidates will diﬀer by less than 1000 votes. (b) Each invalid vote is double-checked independently with probability 1/5000. Using a relevant approximation, compute the probability that at least 3 invalid votes are double-checked. 3. Toss a fair coin 5 times. Let X be the total number of Heads among the ﬁrst three tosses and Y the total number of Heads among the last three tosses. (Note that, if the third toss comes out Heads, it is counted both into X and into Y ). (a) Write down the joint probability mass function of X and Y . (b) Are X and Y independent? Explain. (c) Compute the conditional probability P (X ≥ 2 | X ≥ Y ). 4. Every working day, John comes to the bus stop exactly at 7am. He takes the ﬁrst bus that arrives. The arrival of the ﬁrst bus is an exponential random variable with expectation 20 minutes. Also, every working day, and independently, Mary comes to the same bus stop at a random time, uniformly distributed between 7 and 7:30. (a) What is the probability that tomorrow John will wait for more than 30 minutes? (b) Assume day-to-day independence. Consider Mary late if she comes after 7:20. What is the probability that Mary will be late on 2 or more working days among the next 10 working days? (c) What is the probability that John and Mary will meet at the station tomorrow? c(x + x2 ), 0, x ∈ [0, 1], otherwise.

7

JOINT DISTRIBUTIONS AND INDEPENDENCE

82

**Solutions to Practice Midterm 2
**

1. A random variable X has density function f (x) = (a) Determine c. Solution: Since

1

c(x + x2 ), 0,

x ∈ [0, 1], otherwise.

1 = c

0

(x + x2 ) dx, ,

1 = c 1 = and so c = 6 . 5 (b) Compute E(1/X). Solution: 1 X 5 c, 6

1 1 + 2 3

E

= = = =

6 11 (x + x2 ) dx 5 0 x 6 1 (1 + x) dx 5 0 1 6 1+ 5 2 9 . 5

(c) Determine the probability density function of Y = X 2 .

7

**JOINT DISTRIBUTIONS AND INDEPENDENCE Solution: The values of Y are in [0, 1], so we will assume that y ∈ [0, 1]. Then, F (y) = P (Y ≤ y) = P (X 2 ≤ y) √ = P (X ≤ y)
**

√ y

83

=

0

6 (x + x2 ) dx, 5

and so fY (y) = = = d FY (y) dy 1 6 √ ( y + y) · √ 5 2 y 3 √ (1 + y). 5

2. A certain country holds a presidential election, with two candidates running for oﬃce. Not satisﬁed with their choice, each voter casts a vote independently at random, based on the outcome of a fair coin ﬂip. At the end, there are 4,000,000 valid votes, as well as 20,000 invalid votes. (a) Using a relevant approximation, compute the probability that, in the ﬁnal count of valid votes only, the numbers for the two candidates will diﬀer by less than 1000 votes. Solution: Let Sn be the vote count for candidate 1. Thus, Sn is Binomial(n, p), where n = 4, 000, 000 and p = 1 . Then, n − Sn is the vote count for candidate 2. 2 P (|Sn − (n − Sn )| ≤ 1000) = P (−1000 ≤ 2Sn − n ≤ 1000) −500 Sn − n/2 = P ≤ ≤ 1 n· 1 · 1 n· 2 · 1 2 2 2 ≈ P (−0.5 ≤ Z ≤ 0.5) = P (Z ≤ 0.5) − P (Z ≤ −0.5) = P (Z ≤ 0.5) − (1 − P (Z ≤ 0.5)) = 2P (Z ≤ 0.5) − 1 = 2Φ(0.5) − 1 ≈ 2 · 0.6915 − 1 = 0.383. 500 n·

1 2

·

1 2

7

JOINT DISTRIBUTIONS AND INDEPENDENCE

84

(b) Each invalid vote is double-checked independently with probability 1/5000. Using a relevant approximation, compute the probability that at least 3 invalid votes are double-checked. Solution:

1 Now, let Sn be the number of double-checked votes, which is Binomial 20000, 5000 and thus approximately Poisson(4). Then,

P (Sn ≥ 3) = 1 − P (Sn = 0) − P (Sn = 1) − P (Sn = 2) 42 ≈ 1 − e−4 − 4e−4 − e−4 2 = 1 − 13e−4 . 3. Toss a fair coin 5 times. Let X be the total number of Heads among the ﬁrst three tosses and Y the total number of Heads among the last three tosses. (Note that, if the third toss comes out Heads, it is counted both into X and into Y ). (a) Write down the joint probability mass function of X and Y . Solution: P (X = x, Y = y) is given by the table x\y 0 1 2 3 0 1/32 2/32 1/32 0 1 2/32 5/32 4/32 1/32 2 1/32 4/32 5/32 2/32 3 0 1/32 2/32 1/32

To compute these, observe that the number of outcomes is 25 = 32. Then, P (X = 2, Y = 1) = P (X = 2, Y = 1, 3rd coin Heads) +P (X = 2, Y = 1, 3rd coin Tails) 2 2 4 = + = , 32 32 32 2·2 1 5 P (X = 2, Y = 2) = + = , 32 32 32 2·2 1 5 P (X = 1, Y = 1) = + = , 32 32 32 etc.

and independently. uniformly distributed between 7 and 7:30. Solution: No. Let T be the arrival time of the bus. John comes to the bus stop exactly at 7am. He takes the ﬁrst bus that arrives. 1 fT (t) = e−t/2 . Y = 3) = 0 = 1 1 · = P (X = 0)P (Y = 3). and P (T ≥ 3) = e−3/2 . 11 P (X ≥ 2|X ≥ Y ) = = = = 4. Also. . The arrival of the ﬁrst bus is an exponential random variable with expectation 20 minutes. Solution: P (X ≥ 2. (a) What is the probability that tomorrow John will wait for more than 30 minutes? Solution: Assume that the time unit is 10 minutes. P (X = 0. every working day. X ≥ Y ) P (X ≥ Y ) 1+4+5+1+2+1 1+2+1+5+4+1+5+2+1 14 22 7 . 2 for t ≥ 0. 8 8 85 (c) Compute the conditional probability P (X ≥ 2 | X ≥ Y ). Then.7 JOINT DISTRIBUTIONS AND INDEPENDENCE (b) Are X and Y independent? Explain. 1 It is Exponential with parameter λ = 2 . Every working day. Mary comes to the same bus stop at a random time.

Therefore. Therefore. Consider Mary late if she comes after 7:20. 3 .7 JOINT DISTRIBUTIONS AND INDEPENDENCE 86 (b) Assume day-to-day independence. It is uniform on [0.X) (t. 3]. 3 and. What is the probability that Mary will be late on 2 or more working days among the next 10 working days? Solution: Let X be Mary’s arrival time. therefore. 3 1 The number of late days among 10 days is Binomial 10. P (2 or more late working days among 10 working days) = 1 − P (0 late) − P (1 late) = 1− 2 3 10 − 10 · 1 3 2 3 9 . (c) What is the probability that John and Mary will meet at the station tomorrow? Solution: We have 1 f(T. 6 for x ∈ [0. 1 P (X ≥ 2) = . x) = e−t/2 . 3] and t ≥ 0. P (X ≥ T ) = = = ∞ 1 3 1 −t/2 dx e dt 3 0 2 x 1 3 −x/2 e dx 3 0 2 (1 − e−3/2 ).

y). and EXY . Put the items in a random order and inspect them one by one. Example 8. E|X − Y | = 1. then Eg(X.1 (0) . if instead (X. y)f (x. Let X be the number of inspections needed to ﬁnd the ﬁrst defective item and Y the number of additional inspections needed to ﬁnd the second defective item.2.1 . x + y ≤ 1}. EY . y) : x.1 (1) . of (X. Y ) = x. Y ) is given by the following table. m. whenever the probability is nonzero: i\j 1 2 3 4 1 (0) (1) (2) (3) 2 . y ≥ 0. Y ) is a discrete pair with joint probability mass function p. Y ) is a random point in the right triangle {(x. 1 1−x EX = 0 1 dx 0 x 2 dy = 0 2x(1 − x) dx 1 1 − 2 3 = 2 = 1 . Example 8.1 (2) . 3 . y) dxdy. Assume that (X.1 .1 (3) 0 0 0 .1. which lists P (X = i. and so. Compute E|X − Y |.1 (1) 0 0 4 .1 . y)p(x. Y ) with joint density f and another function g of two variables. f.1 The answer is. The joint p. Note that the density is 2 on the triangle. Assume that two among the 5 items are defective. Eg(X. Compute EX. together with |i − j| in parentheses. Y ) = g(x.1 (1) 0 3 . therefore.8 MORE ON EXPECTATION AND LIMIT THEOREMS 87 8 More on Expectation and Limit Theorems Given a pair of random variables (X. Y = j).y g(x.4.

X = I1 + · · · + In . y) dx dy + yf (x. therefore. one can simply proceed by induction. . EY = EX = 3 .8 MORE ON EXPECTATION AND LIMIT THEOREMS 88 1 and. . Expectation is linear and monotone: 1. By the way we deﬁned expectation. Xn whose expected values exist. establishes monotonicity. we have EX ≤ EY . For two random variables X ≤ Y . E(X1 + . by symmetry. For arbitrary random variables X1 . + Xn ) = E(X1 ) + . we can often write a random variable X as a sum of (possibly dependent) indicators. This is a consequence of the same property for two-dimensional integrals: (x + y)f (x. . 12 Linearity and monotonicity of expectation Theorem 8. together with linearity. We emphasize again that linearity holds for arbitrary random variables which do not need to be independent! This is very useful. . . However. E(X + Y ) = EX + EY. it is clear that Z ≥ 0 implies EZ ≥ 0 and. . . the third property is not immediately obvious. 2. An instance of this method is called the indicator trick . applying this to Z = Y − X. . 1 1−x E(XY ) = 0 1 dx 0 xy2 dy (1 − x)2 dx 2 = 0 1 2x = 0 (1 − x)x2 dx = = 1 1 − 3 4 1 . For example. 3. Proof. that is. y) dx dy = xf (x. For constants a and b. Furthermore. E(aX + b) = aE(X) + b. To prove this property for arbitrary n (and the continuous case). We will check the second property for n = 2 and the continuous case. . y) dx dy.1. + E(Xn ).

Instead. EX = i=0 i 7 i . However. 7 7 In (a). 1. EX = 5 · 7 22 .4.5. f. EX = 5 · 22 . X = I1 + I2 + I3 + I4 + I5 . which are then assigned at random. and so EX = 1. one solution is to compute the p. n . . it is clear that 1 if ith ball is red. X = I1 + I2 + . Compute EX. Five married couples are seated around a table at random. and 5 white balls. Let Ii = I{person i receives own gift} . What is EX? This is another problem very well suited for the indicator trick.8 MORE ON EXPECTATION AND LIMIT THEOREMS 89 Example 8. 5. + In . What is EX? 1 . the indicator trick works exactly as before (the fact that Ii are now dependent does 7 not matter) and so the answer is also exactly the same. revisited. Example 8. of X. Example 8. X is Binomial(5. Matching problem. Ii = I{ith ball is red} = In both cases. = EI5 . . . 7 red. EI1 = 1 · P (1st ball is red) = Therefore. Assume n people buy n gifts. . P (X = i) = where i = 0. . Assume that an urn contains 10 black. 7 = EI2 = . Select 5 balls (a) with and (b) without replacement and let X be the number of red balls selected. so we know that EX = 5· 22 . .3. m. EIi = for all i. by additivity. . Let Ii be the indicator of the event that the ith ball is red. but we will not use this knowledge. Let X be the number of wives who sit next to their husbands. 22 ). 0 otherwise. and let X be the number of people who receive their own gift. Then. Moreover. . that is. 22 For (b). and then 5 7 i 15 5−i 22 5 15 5−i 22 5 .

. and so that EX = 2 EIi = . 7 red. i which establishes the limit EN = 1. 2]. [n − 1. . revisited. . . Compute EW and EY . In n general. + Nn . i by comparing the integral with the Riemman sum at the left and right endpoints in the division of [1. Ni is geometric with success probability n−i+1 . + I5 .. 1 ≤ i 1 1 dx ≤ x n−1 i=1 1 . and Y the number of diﬀerent colors. we have n i=2 1 1 1 + + . N1 . i = 1. N2 . 9 10 . 90 Then. . [2. and so n log n ≤ i=1 1 ≤ log n + 1. Sample from n cards.. Select 5 balls (a) with replacement and (b) without replacement. . to get all diﬀerent cards represented. What is EN ? Let Ni be the number the number of additional cards you need to get the ith new card. and let W be the number of white balls selected. ..e. and n N = N1 + . Let N be the number of cards you need to sample for a complete collection. X = I1 + . is trivial.6. . so that EN = n 1 + Now. Afterward. and 5 white balls. . Then. .7. + 2 3 n n . the number of n additional cards needed to get the third new card is Geometric with success probability n−2 . let Ii = I{wife i sits next to her husband} . after you have received the (i − 1)st new card. 3]. the number of cards needed to receive the ﬁrst new card. Coupon collector problem. N3 . with replacement. i. n.8 MORE ON EXPECTATION AND LIMIT THEOREMS Now. 9 Example 8. n] into [1. n→∞ n log n lim Example 8. Assume that an urn contains 10 black. the number of additional cards needed to get the second new card is Geometric with success probability n−1 . as the ﬁrst card you buy is new: N1 = 1. After that. . n]. . indeﬁnitely. .

x + y ≤ 1 1}. Let us return to a random point (X. and Iw be the indicators of the event that. if X and Y are independent. The expectation of the product of independent random variables is the product of their expectations. y) dx dy g(x)h(y)fX (x)gY (y) dxdy g(x)fX (x) dx h(y)fY (y) dy = Eg(X) · Eh(Y ) Example 8. Y ) in the triangle {(x. y ≥ 0. 22 while in the case without replacement EY = 1 − 12 5 22 5 +1− 15 5 22 5 +1− 17 5 22 5 ≈ 2. respectively. Y = Ib + Ir + Iw . we already knew that they were not independent. red. and white balls are represented. i.2. y) : x. and so. in the case with replacement EY = 1 − 125 155 175 5 + 1 − 225 + 1 − 225 ≈ 2.6209. E[g(X)h(Y )] = Eg(X) · Eh(Y ) Proof.5289. Of course. Ir . they cannot be independent. E[g(X)h(Y )] = = = g(x)h(y)f (x. Expectation and independence Theorem 8..8. black.e. Clearly. . We computed that E(XY ) = 12 and that EX = EY = 1 .8 MORE ON EXPECTATION AND LIMIT THEOREMS We already know that EW = 5 · 91 5 22 in either case. thus. The two random variables have 3 E(XY ) = EX · EY . For the continuous case. Let Ib . Multiplicativity of expectation for independent factors.

4 Finally. let us call it g(x) for a moment. and. EY = E(Y |X = x)fX (x) dx. EX = EY = 0. Y ) in the diamond of radius 1. Y ) in the square {(x. Tower property. Y ). therefore. (1. 1). We denote g(X) by E(Y |X). y = y y fY (y|X = x) (continuous case). Here is what we get. 0). X and Y are independent and. that is. 0).3. in the continuous case. in the square with corners at (0.8 MORE ON EXPECTATION AND LIMIT THEOREMS 92 If. . Theorem 8. Clearly. The formula E(E(Y |X)) = EY holds. −1). we have. note again that this is an expression dependent on X and so we can compute its expectation. less mysteriously. but also E(XY ) = = = 1 2 1 2 1 1−|x| dx −1 1 −1 1 −1 xy dy −1+|x| 1−|x| x dx −1+|x| y dy 1 2 = 0. by symmetry. Observe that E(Y |X = x) is a function of x. x dx · 0 This is an example where E(XY ) = EX · EY even though X and Y are not independent. Computing expectation by conditioning For a pair of random variables (X. (0. y) : 0 ≤ x. in the discrete case EY = x E(Y |X = x) · P (X = x). E(XY ) = EX · EY = 1 . y ≤ 1}. instead. we deﬁne the conditional expectation of Y given X = x by E(Y |X = x) = y P (Y = y|X = x) (discrete case). we pick a random point (X. This is the expectation of Y provided the value X is known. pick a random point (X. and (−1.

therefore. Y = y) = EY Example 8. and let N be the number of days in purgatory. so you must in fact choose a door at random each time. but the doors are reshuﬄed each time you come back. therefore. One of them leads to heaven. 2 and. . as we know. and 2. . with the obvious meaning. y) : x. one leads to one day in purgatory. Let X be the number on the die and let Y be the number of Heads. 4 Example 8. Compute the expected number of Heads.8 MORE ON EXPECTATION AND LIMIT THEOREMS 93 Proof. 3 Example 8.9. it does as EX = EY = 1 . E(Y |X) = 1 (1 − X). In particular. Once again. and one leads to two days in purgatory. 2 By deﬁnition. E(number of Heads) = EY 6 = x=1 6 x· x· x=1 1 · P (X = x) 2 1 1 · 2 6 = = 7 . indeed. . . 2 1 E(Y |X = x) = x · . To verify this in the discrete case. and. equal the 2 2 expectation of Y . Given that X = x. Y is distributed uniformly on [0. and the expectation of 1 (1 − X) must. x + y ≤ 1}. we write out the expectation inside the sum: yP (Y = y|X = x) · P (X = x) = x y x y y P (X = x. Roll a die and then toss as many coins as shown up on the die. 1 − x] and so 1 E(Y |X = x) = (1 − x). Y is Binomial(x.11.10. 1. Given that X = x. Y = y) · P (X = x) P (X = x) = x y yP (X = x. After your stay in purgatory is over. Y ) in the triangle {(x. Here is another job interview question. Fix an x ∈ {1. consider a random point (X. You die and are presented with three doors. How long is your expected stay in purgatory? Code the doors 0. you go back to the doors and pick again. 2. 1 ). . y ≥ 0. 6}.

Y ) > 0 intuitively means that. EY = P (B). E(N |your ﬁrst pick is door 1) = 1 + EN. if X and Y are independent.8 MORE ON EXPECTATION AND LIMIT THEOREMS Then 94 E(N |your ﬁrst pick is door 0) = 0. To summarize. Covariance Let X. the covariance is positive. if the events are negatively correlated all inequalities are reversed. we say the two events are positively correlated and. We deﬁne the covariance of (or between) X and Y as Cov(X. but the converse is false. For general random variables X and Y . Y ) = E((X − EX)(Y − EY )) = E(XY − (EX) · Y − (EY ) · X + EX · EY ) = E(XY ) − EX · EY − EY · EX + EX · EY = E(XY ) − EX · EY. in this case. for two events A and B. “on the average. Let X and Y be indicator random variables. Y ) = 0. Y ) = E(XY ) − EX · EY. then Cov(X. and so Cov(X. EX = P (A). Then. Cov(Xi . E(N |your ﬁrst pick is door 2) = 2 + EN. Cov(X.” increasing X will result in larger Y . the most useful formula is Cov(X. If P (B|A) > P (B). Xj ). Variance of sums of random variables Theorem 8. Y ) = P (A ∩ B) − P (A)P (B) = P (A)[P (B|A) − P (B)]. so X = IA and Y = IB . Therefore. i=j Xi ) = i=1 i=1 Var(Xi ) + . Variance-covariance formula: n n E( Var( Xi ) = i=1 n i=1 n 2 EXi2 + i=j E(Xi Xj ). Note immediately that. E(XY ) = E(IA∩B ) = P (A ∩ B). Y be random variables.4. and solving this equation gives 1 1 EN = (1 + EN ) + (2 + EN ) . 3 3 EN = 3.

The ﬁrst formula follows from writing the sum n 2 n Xi i=1 = i=1 Xi2 + i=j Xi Xj and linearity of expectation. . p). Corollary 8. Then. .+Xn ) = Var(X1 )+. If X1 .8 MORE ON EXPECTATION AND LIMIT THEOREMS 95 Proof. Xn are independent. especially when they are indicators.+Var(Xn ). Example 8.12. ESn = np and n Var(Sn ) = i=1 n Var(Ii ) (EIi − (EIi )2 ) i=1 = = n(p − p2 ) = np(1 − p). The second formula follows from the ﬁrst: n n n 2 Var( i=1 Xi ) = E i=1 n Xi − E( i=1 Xi ) 2 = E i=1 n (Xi − EXi ) Var(Xi ) + E(Xi − EXi )E(Xj − EXj ). revisited yet again. . We will compute Var(X). . then Var(X1 +. . The crucial observation is that Sn = n Ii . . The variance-covariance formula often makes computing variance possible even if the random variables are not independent. . Matching problem. where Ii is the indicator I{ith trial is a success} . i=1 Therefore.5. Ii are independent. .13. Example 8. Linearity of variance for independent summands. Let Sn be Binomial(n. X2 . i=j = i=1 which is equivalent to the formula. We will now fulﬁll our promise from Chapter 5 and compute its expectation and variance. Recall that X is the number of people who get their own gift. .

Theorem 8. X is. we promised to attach a precise meaning to the statement: “If you repeat the experiment. independently. Compute Cov(X. as should be expected.6. very close to Poisson with λ = 1. . EX = EY = 10 = 5 . thus. E(Ii Jj ) equals 0 if i = j (because both 5 and 6 cannot 6 3 be rolled on the same roll). Weak Law of Large Numbers. At the beginning of the course. 10 · 9 36 5 2 5 − 2 5 3 2 =− 5 18 Weak Law of Large Numbers Assume that an experiment is performed in which an event A happens with probability P (A) = p. where Ji = I{ith roll is 5} . 96 1 n where Ii = I{ith person gets own gift} . The E(Ii Ij ) above is the probability that the ith person and jth person both get their own gifts 1 and. where Ii = I{ith roll is 6} . and Y = 10 Ji . equals n(n−1) . Observe that X = 10 Ii . a large number of times.) Example 8.” We will now do so and actually prove the more general statement below.14. (In fact. We conclude that Var(X) = 1. Moreover.8 MORE ON EXPECTATION AND LIMIT THEOREMS Recall also that X = n i=1 Ii . Roll a die 10 times. 10 10 EXY = i=1 j=1 E(Ii Jj ) 1 62 = i=j = = and Cov(X. i=1 i=1 Then. the proportion of times A happens converges to p. Y ). Let X be the number of 6’s rolled and Y be the number of 5’s rolled. Therefore. Y ) = is negative. for large n. so that EIi = 1 + n E(Ii Ij ) i=j and E(X 2 ) = n · = 1+ i=j 1 n(n − 1) = 1+1 = 2. and E(Ii Jj ) = 612 if i = j (by independence of diﬀerent rolls).

15. Sn = I1 + .16. So. . if X ≥ a. P (|X − 1| ≥ 0.. . then. then X1 +. as n → ∞. . . a I{X≥a} ≤ Theorem 8. P Sn −p ≥ n → 0. . n P as n → ∞.1. If EX = 1 and X ≥ 0.5) ≤ 0. as we have observed before.5 5 1 81 . the Chebyshev inequality is useful if either σ is small or k is large. P (X ≥ 10) ≤ P (|X − 1| ≥ 9) ≤ Example 8. are independent and identically distributed random variables with ﬁnite expectation and variance. If EX = 1 and Var(X) = 1. the left-hand side is 0 and the right-hand side is nonnegative. + Xn − EX ≥ n → 0. If EX = µ and Var(X) = σ 2 are both ﬁnite and k > 0.8. X1 . if Sn is the number of successes in n independent trials. the proportion of successes converges to p in this sense. Taking the expectation of both sides. + In .17. As the previous two examples show.1. a Indeed. the left-hand side is 1 and the right-hand side is at least 1. if X < a. If X ≥ 0 is a random variable and a > 0. for every > 0. a Example 8. it must be that P (X ≥ 10) ≤ 0.. for any ﬁxed > 0. each of which is a success with probability p. then σ2 P (|X − µ| ≥ k) ≤ 2 . k Example 8. Markov Inequality.7. Proof.1 2 = . Here is the crucial observation: 1 X. Thus.+Xn converges to EX in the sense that. Chebyshev inequality.8 MORE ON EXPECTATION AND LIMIT THEOREMS 97 If X. X2 . then P (X ≥ a) ≤ 1 EX. . X1 + . . . Var(X) = 0. If EX = 1. we get 1 P (X ≥ a) = E(I{X≥a} ) ≤ EX. where Ii = I{success at trial i} . Theorem 8. 2 0. In particular.

n nσ 2 σ2 = 2 → 0.9. with ﬁnite µ = EX and σ 2 = V ar(X). . Proof. . Denote µ = EX. are independent. X1 . P (|X − µ| ≥ k) = P ((X − µ)2 ≥ k 2 ) ≤ 1 1 E(X − µ)2 = 2 Var(X). and let Sn = X1 + . This suggests that X1 +. + EXn = nµ and Var(Sn ) = nσ 2 . Assume that X. Then. by the Chebyshev inequality. . Adding many independent copies of a random variable erases all information about its distribution other than expectation and variance! . . identically distributed random variables. Then. + Xn − µn √ ≤x σ n → P (Z ≤ x).+Xn converges to EX at the rate of n about 1 √ . however. A careful examination of the proof above will show that it remains valid if depends on n.. Observe. P X1 + . Therefore. that it is a remarkable theorem: the random variables Xi have an arbitrary distribution (with given expectation and variance) and the theorem says that their sum approximates a very particular distribution. the normal one. where Z is standard Normal. X2 . σ 2 = Var(X). By the Markov inequality. P (|Sn − nµ| ≥ n ) ≤ as n → ∞. + Xn . . Central limit theorem.8 MORE ON EXPECTATION AND LIMIT THEOREMS 98 Proof. ESn = EX1 + . k2 k We are now ready to prove the Weak Law of Large Numbers. 1 but goes to 0 slower than √n . but will later give good indication as to why it holds. . Central Limit Theorem Theorem 8.. . We will not prove this theorem in full detail. as n → ∞. n2 2 n We will make this statement more precise below. . .

. ﬁnd n so that P (Sn ≥ 50) ≥ 0. n 2 − 50 n· 1 12 − 50 n· 1 12 = Φ = 0.99. Roll a fair die: • if you roll 6.99 for (approximately) z = 2. For promotion. .18.993 = 0.+Xn . you get free entrance and $2. 1]. . Assume that Xn are independent and uniform on [0. the convergence is not very fast. − 50 n· 1 12 = 0. ≤ 90 − 200 · 200 · 1 12 1 2 ≈ 1 − 0.99.007 For (b). σ3 n Example 8.000 “guests” the following game.99.4785 · . A casino charges $1 for entrance. the current version of the celebrated Berry-Esseen theorem states that an upper bound on the diﬀerence between the two probabilities in the Central limit theorem is E|X − µ|3 √ 0.345 n − 100 = 0 and that n = 115. Example 8. Let Sn = X1 +. Using the fact that Φ(z) = 0.326. they oﬀer to the ﬁrst 30. (a) Compute approximately P (S200 ≤ 90).19. we rewrite P and then approximate Sn − n · n· P Z ≥ − or P Z ≤ n 2 n 2 1 12 1 2 ≥ 50 − n · n· 1 12 1 2 = 0. P (S200 ≤ 90) = P S200 − 200 · 1 2 1 200 · 12 √ ≈ P (Z ≤ − 6) √ = 1 − P (Z ≤ 6) 1 2 and Var(Xi ) = 1 3 − 1 4 = 1 12 . (b) Using the approximation.8 MORE ON EXPECTATION AND LIMIT THEOREMS 99 On the other hand.99. We know that EXi = For (a). we get that √ n − 1.

Select three balls in three successive steps without replacement. Let X be the total number of white balls selected and Y the step in which you selected the ﬁrst black ball. and P (Xi = 3) = 1 . we need to ﬁnd s so that P (L ≤ s) = 0.9. 245. Y = 2.8 MORE ON EXPECTATION AND LIMIT THEOREMS • if you roll 5. and P (Xi = 0) = P (Xi = 1) = 1 . then X = 1. n· and ﬁnally. We have L = X1 + · · · + Xn . where n = 30. 9 ·n 11 9 ≤ 2 3 s− 2 3 ·n 11 9 n· n· ·n 11 9 ≈ P Z ≤ which gives s− s− = 0. 6 6 EXi = and Var(Xi ) = Therefore. .9. For example. n· 2 3 ·n 11 9 ≈ 1. Therefore. Xi are independent.9. if the selected balls are white. 000. black. • otherwise. you get free entrance. 9 Problems 1. 2 3 2 3 2 1 9 + − 6 6 = 11 .28. black. you pay the normal fee. Compute E(XY ).28 3 n· 11 ≈ 20. An urn contains 2 white and 4 black balls. 2 s ≈ n + 1. In symbols. 100 Compute the number s so that the revenue loss is at most s with probability 0. if L is lost revenue. P (L ≤ s) = P L− 2 3 4 6.

Y = 1) = P (bwb or bbw) = . Distribute the cards at random to 13 players. so that each gets 4 cards. f. Each bird looks left or right with equal probability. Using the indicator trick. 5 1 P (X = 1. Y = 1) = P (bbb) = . 3.8 MORE ON EXPECTATION AND LIMIT THEOREMS 101 2. 6. We have 1 P (X = 0. Roll a fair die 24 times. There are 20 birds that sit in a row on a wire. First we determine the joint p. compute EN . Solutions to problems 1. The joint density of (X. Compute. Compute EN . Y = 2) = P (wbb) = . 5 1 P (X = 2. 15 1 P (X = 2. m. Y ). 5. 15 so that E(XY ) = 1 · 2 1 1 1 1 8 +2· +2· +4· +6· = . Five married couples are seated at random in a row of 10 seats. Y ). Let N be the number of birds not seen by any neighboring bird. Compute Cov(X. the probability that the sum of numbers exceeds 100. (a) Compute the expected number of women that sit next to their husbands. of (X. 13 cards of each of the four suits. 15 1 P (X = 2. Y = 3) = P (wwb) = . Y ) is given by f (x. 0 otherwise. using a relevant approximation. Y = 2) = P (wbw) = . 5 2 P (X = 1. Y = 1) = P (bww) = . Recall that a full deck of cards contains 52 cards. 5 5 15 15 15 5 . 4. y) = 3x if 0 ≤ y ≤ x ≤ 1. Let N be the number of players whose four cards are of the same suit. (b) Compute the expected number of women that sit next to at least one man.

By the indicator trick 4 EN = 2 · 1 1 11 + 18 · = . + EI5 = 5 EI1 = 2 5 8 4 3 · + · 1− · 10 9 10 9 8 7 = . . . (b) Let the number be N . For the two birds at either end. and • after choosing a suit. 9 35 . so that N = I1 + . Let Ii = I{woman i sits next to a man} . the number of ways to select 4 cards of that suit is 13 4 52 4 . (a) Let the number be M . + EI5 = 5 EI1 = 1. Then EIi = and so EM = EI1 + . . 2 4 2 5. We have 3 3x3 dx = . 3. .8 MORE ON EXPECTATION AND LIMIT THEOREMS 102 2. Y ) = − 3 4 · 3 8 = 3 160 . Let Ii = I{couple i sits together} . while for any other 2 bird this probability is 1 . Observe that: • the number of ways to select 4 cards from a 52 card deck is • the number of choices of a suit is 4.. 10! 5 4. 4 0 0 0 1 x 1 3 3 3 dx y · 3x dy = EX = x dx = . E(XY ) = dx xy · 3x dy = 10 0 0 0 2 EX = dx x · 3x dy = 3 10 1 x 1 so that Cov(X. the probability that it is not seen is 1 . 9 9! 2! 1 = . whereby the woman either sits on one of the two end chairs or on one of the eight middle chairs. . + I13 . Let Ii = I{player i has four cards of the same suit} .. Then. . 2 8 0 0 0 1 x 1 3 3 4 x dx = . by dividing into cases. EIi = and so EM = EI1 + .

So. 6. for all i. We know that EXi = 7 .032. we have 2 12 7 7 100 − 24 · 2 S24 − 24 · 2 ≈ P (Z ≥ 1. be the numbers on successive rolls and Sn = X1 + . P (S24 ≥ 100) = P ≥ 35 35 24 · 12 24 · 12 .85) ≈ 0. . . . + Xn the sum.85) = 1 − Φ(1. and Var(Xi ) = 35 . X2 . .8 MORE ON EXPECTATION AND LIMIT THEOREMS 103 Therefore. Let X1 . EIi = and EN = 4· 13 4 52 4 4· 13 4 52 4 · 13. .

000. gets married while drunk with probability 1/1.8 MORE ON EXPECTATION AND LIMIT THEOREMS 104 Interlude: Practice Final This practice exam covers the material from chapters 1 through 8. (a) Compute the probability that the ﬁrst four cards selected are all hearts (♥). 000 single male tourists visit Las Vegas every year. Y = j) for all relevant i and j. and 5 Finns are seated in a row of 11 chairs at random. (For example. You win $2 if both tosses comes out Heads. 5. (c) Compute the expected number of diﬀerent suits among the ﬁrst four cards selected. without replacement. the Swedes occupy adjacent seats. Let Y be the number of Heads in the second coin toss. Eleven Scandinavians: 2 Swedes. 1.. Assume that 2. Y is automatically 0. 4. if X = 0. 2. (b) Compute the probability that all suits are represented among the ﬁrst four cards selected. lose $1 if no toss comes out Heads. as many times as the number of Heads in the ﬁrst coin toss. ♠). (d) Compute the expected number of cards you have to select to get the ﬁrst hearts card. (b) Compute the probability that at least one of the groups sits together. toss the second coin twice and count the number of Heads to get Y . Give yourself 120 minutes to solve the six problems. independently. if X = 2. which you may assume have equal point score. (a) Compute the probability that all groups sit together (i. that is. Select cards from the deck at random. ♥. 3. 4 Norwegians. (b) Compute the expected number of such drunk marriages in the next 10 years. 000. Then toss the second coin X times.e. ♦. (a) What is the expected number of games you need to play to win once? . Recall that a full deck of cards contains 13 cards of each of the four suits (♣. as do the Norwegians and Finns). (b) Compute P (X ≥ 2 | Y = 1). Toss the ﬁrst coin three times and let X be the number of Heads. (a) Write down the exact probability that exactly 3 male tourists will get married while drunk next year. one by one.) (a) Determine the joint probability mass function of X and Y . (d) Write down an approximate expression for the probability that there will be no such drunk marriage during at least one of the next 3 years. 000. (c) Compute the probability that the two Swedes have exactly one person sitting between them. that is. Assume also that each of these tourists. (c) Write down a relevant approximate expression for the probability in (a). and win or lose nothing otherwise. You have two fair coins. write down a formula for P (X = i. Toss a fair coin twice.

5. . Here and in (d): when you get to a single integral involving powers. do the same with probability 0. 1 n 0 x dx = 1 n+1 .9. approximately. 1]. Here and in (b): use (b) Find Var(X + Y ). for n > −1. Compute (approximately) the amount of money x such that your winnings will be at least x with probability 0. (d) Find E|X − Y |. (a) Find the value of c. Two random variables X and Y are independent and have the same probability density function c(1 + x) x ∈ [0. assume that you play this game 500 times. the probability that you win at least $135? (c) Again. stop. g(x) = 0 otherwise. What is.8 MORE ON EXPECTATION AND LIMIT THEOREMS 105 (b) Assume that you play this game 500 times. (c) Find P (X + Y < 1) and P (X + Y ≤ 1). Then. 6.

8

MORE ON EXPECTATION AND LIMIT THEOREMS

106

**Solutions to Practice Final
**

1. Recall that a full deck of cards contains 13 cards of each of the four suits (♣, ♦, ♥, ♠). Select cards from the deck at random, one by one, without replacement. (a) Compute the probability that the ﬁrst four cards selected are all hearts (♥). Solution:

13 4 52 4

(b) Compute the probability that all suits are represented among the ﬁrst four cards selected. Solution: 134

52 4

(c) Compute the expected number of diﬀerent suits among the ﬁrst four cards selected. Solution: If X is the number of suits represented, then X = I♥ + I♦ + I♣ + I♠ , where I♥ = I{♥ is represented} , etc. Then, EI♥ = 1 −

39 4 52 4

,

**which is the same for the other three indicators, so EX = 4 EI♥ = 4 1−
**

39 4 52 4

.

(d) Compute the expected number of cards you have to select to get the ﬁrst hearts card. Solution: Label non-♥ cards 1, . . . , 39 and let Ii = I{card i selected before any ♥ card} . Then, EIi = 1 14 for any i. If N is the number of cards you have to select to get the ﬁrst hearts card, then 39 EN = E (I1 + · · · + EI39 ) = . 14

8

MORE ON EXPECTATION AND LIMIT THEOREMS

107

2. Eleven Scandinavians: 2 Swedes, 4 Norwegians, and 5 Finns are seated in a row of 11 chairs at random. (a) Compute the probability that all groups sit together (i.e., the Swedes occupy adjacent seats, as do the Norwegians and Finns). Solution: 2! 4! 5! 3! 11! (b) Compute the probability that at least one of the groups sits together. Solution: Deﬁne AS = {Swedes sit together} and, similarly, AN and AF . Then, P (AS ∪ AN ∪ AF ) = P (AS ) + P (AN ) + P (AF ) − P (AS ∩ AN ) − P (AS ∩ AF ) − P (AN ∩ AF ) + P (AS ∩ AN ∩ AF ) 2! 10! + 4! 8! + 5! 7! − 2! 4! 7! − 2! 5! 6! − 4! 5! 3! + 2! 4! 5! 3! = 11!

(c) Compute the probability that the two Swedes have exactly one person sitting between them. Solution: The two Swedes may occupy chairs 1, 3; or 2, 4; or 3, 5; ...; or 9, 11. There are exactly 9 possibilities, so the answer is 9 9 11 = 55 . 2 3. You have two fair coins. Toss the ﬁrst coin three times and let X be the number of Heads. Then, toss the second coin X times, that is, as many times as you got Heads in the ﬁrst coin toss. Let Y be the number of Heads in the second coin toss. (For example, if X = 0, Y is automatically 0; if X = 2, toss the second coin twice and count the number of Heads to get Y .)

8

MORE ON EXPECTATION AND LIMIT THEOREMS

108

(a) Determine the joint probability mass function of X and Y , that is, write down a formula for P (X = i, Y = j) for all relevant i and j. Solution: P (X = i, Y = j) = P (X = i)P (Y = j|X = i) = for 0 ≤ j ≤ i ≤ 3. (b) Compute P (X ≥ 2 | Y = 1). Solution: This equals 5 P (X = 2, Y = 1) + P (X = 3, Y = 1) = . P (X = 1, Y = 1) + P (X = 2, Y = 1) + P (X = 3, Y = 1) 9 4. Assume that 2, 000, 000 single male tourists visit Las Vegas every year. Assume also that each of these tourists independently gets married while drunk with probability 1/1, 000, 000. (a) Write down the exact probability that exactly 3 male tourists will get married while drunk next year. Solution: With X equal to the number of such drunk marriages, X is Binomial(n, p) with p = 1/1, 000, 000 and n = 2, 000, 000, so we have P (X = 3) = n 3 p (1 − p)n−3 3 3 1 i 1 · , i 23 j 2i

(b) Compute the expected number of such drunk marriages in the next 10 years. Solution: As X is binomial, its expected value is np = 2, so the answer is 10 EX = 20. (c) Write down a relevant approximate expression for the probability in (a). Solution: We use that X is approximately Poisson with λ = 2, so the answer is λ3 −λ 4 −2 e = e . 3! 3

8

MORE ON EXPECTATION AND LIMIT THEOREMS

109

(d) Write down an approximate expression for the probability that there will be no such drunk marriage during at least one of the next 3 years. Solution: This equals 1−P (at least one such marriage in each of the next 3 years), which equals 1 − 1 − e−2

3

.

5. Toss a fair coin twice. You win $2 if both tosses comes out Heads, lose $1 if no toss comes out Heads, and win or lose nothing otherwise. (a) What is the expected number of games you need to play to win once? Solution: The probability of winning in random variable, is 4.

1 4.

The answer, the expectation of a Geometric

1 4

(b) Assume that you play this game 500 times. What is, approximately, the probability that you win at least $135? Solution: Let X be the winnings in one game and X1 , X2 , . . . , Xn the winnings in successive games, with Sn = X1 + . . . + Xn . Then, we have EX = 2 · and Var(X) = 4 · Thus, P (Sn ≥ 135) = P Sn − n · n·

19 16 1 4

1 1 1 −1· = 4 4 4 1 4

2

1 1 +1· − 4 4

=

19 . 16 135 − n · n·

19 16 1 4

≥

135 − n · n·

19 16

1 4

≈ P Z ≥

,

where Z is standard Normal. Using n = 500, we get the answer 10 . 1 − Φ 500 · 19 16

do the same with probability 0. therefore.28 · (This gives x = 93. 16 = 1 n+1 . for n > −1.9. Here and in (b): use Solution: As 1=c· 0 1 n 0 x dx 500 · 19 . 1 3 (1 + x) dx = c · . Solution: . leading to the equation 125 − x = 1. Two random variables X and Y are independent and have the same probability density function c(1 + x) x ∈ [0. x = 125 − 1. For probability 0.28 19 500 · 16 and. Then. 1]. Solution: For probability 0. assume that you play this game 500 times.9 at z ≈ 1.5. P (Sn ≥ x) ≈ P Z ≥ 19 500 · 19 500 · 19 500 · 16 16 16 where we have used that x < 125.8 MORE ON EXPECTATION AND LIMIT THEOREMS 110 (c) Again. (a) Find the value of c.) 6. 2 we have c = 2 3 (b) Find Var(X + Y ).5.9. g(x) = 0 otherwise. the answer is exactly ESn = n · 1 = 125. Then. Compute (approximately) the amount of money x such that your winnings will be at least x with probability 0. we use that Φ(z) = 0. we 4 approximate 125 − x 125 − x x − 125 = P Z ≤ = Φ .28.

8

**MORE ON EXPECTATION AND LIMIT THEOREMS By independence, this equals 2Var(X) = 2(E(X 2 ) − (EX)2 ). Moreover, 2 3 2 E(X 2 ) = 3 E(X) = and the answer is
**

13 81 . 1 0

111

5 x(1 + x) dx = , 9

1

x2 (1 + x) dx =

0

7 , 18

(c) Find P (X + Y < 1) and P (X + Y ≤ 1). Here and in (d): when you get to a single integral involving powers, stop. Solution: The two probabilities are both equal to 2 3 2 3

2 0 2 0 1 1 1−x

dx

0

(1 + x)(1 + y) dy = (1 − x)2 2 dx.

(1 + x) (1 − x) +

**(d) Find E|X − Y |. Solution: This equals 2 3 2 3
**

2 1 x

·2

0 2 1

dx

0

(x − y)(1 + x)(1 + y) dy = x2 2 − (1 + x) x2 x3 + 2 3 dx.

·2

0

x(1 + x) x +

9

CONVERGENCE IN PROBABILITY

112

9

Convergence in probability

One of the goals of probability theory is to extricate a useful deterministic quantity out of a random situation. This is typically possible when a large number of random eﬀects cancel each other out, so some limit is involved. In this chapter we consider the following setting: given a sequence of random variables, Y1 , Y2 , . . ., we want to show that, when n is large, Yn is approximately f (n), for some simple deterministic function f (n). The meaning of “approximately” is what we now make clear.

A sequence Y1 , Y2 , . . . of random variables converges to a number a in probability if, as n → ∞, P (|Yn − a| ≤ ) converges to 1, for any ﬁxed > 0. This is equivalent to P (|Yn − a| > ) → 0 as n → ∞, for any ﬁxed > 0. Example 9.1. Toss a fair coin n times, independently. Let Rn be the “longest run of Heads,” i.e., the longest sequence of consecutive tosses of Heads. For example, if n = 15 and the tosses come out HHTTHHHTHTHTHHH, then Rn = 3. We will show that, as n → ∞, Rn → 1, log2 n in probability. This means that, to a ﬁrst approximation, one should expect about 20 consecutive Heads somewhere in a million tosses. To solve a problem such as this, we need to ﬁnd upper bounds for probabilities that Rn is large and that it is small, i.e., for P (Rn ≥ k) and P (Rn ≤ k), for appropriately chosen k. Now, for arbitrary k, P (Rn ≥ k) = P (k consecutive Heads start at some i, 0 ≤ i ≤ n − k + 1)

n−k+1

= P(

i=1

{i is the ﬁrst Heads in a succession of at least k Heads}) 1 . 2k

≤ n·

For the lower bound, divide the string of size n into disjoint blocks of size k. There are such blocks (if n is not divisible by k, simply throw away the leftover smaller block at the end). Then, Rn ≥ k as soon as one of the blocks consists of Heads only; diﬀerent blocks are independent. Therefore,

n k

P (Rn < k) ≤

1 1− k 2

n k

≤ exp −

1 n 2k k

,

9

CONVERGENCE IN PROBABILITY

113

using the famous inequality 1 − x ≤ e−x , valid for all x. Below, we will use the following trivial inequalities, valid for any real number x ≥ 2: x ≥ x − 1, x ≤ x + 1, x − 1 ≥ x , and x + 1 ≤ 2x. 2 To demonstrate that (1) (2) as P Rn −1 ≥ log2 n Rn Rn ≥ 1 + or ≤1− log2 n log2 n Rn Rn = P ≥1+ +P ≤1− log2 n log2 n = P (Rn ≥ (1 + ) log2 n) + P (Rn ≤ (1 − ) log2 n) . = P

Rn log2 n

→ 1, in probability, we need to show that, for any P (Rn ≥ (1 + ) log2 n) → 0, P (Rn ≤ (1 − ) log2 n) → 0,

> 0,

A little fussing in the proof comes from the fact that (1 ± ) log2 n are not integers. This is common in such problems. To prove (1), we plug k = (1 + ) log2 n into the upper bound to get P (Rn ≥ (1 + ) log2 n) ≤ n · = n· = 1 2(1+ ) log2 n−1 2 n1+

2 → 0, n

as n → ∞. On the other hand, to prove (2) we need to plug k = (1 − ) log2 n + 1 into the lower bound, P (Rn ≤ (1 − ) log2 n) P (Rn < k) 1 n ≤ exp − k 2 k 1 n ≤ exp − k −1 2 k 1 1 n ≤ exp − · 1− · 32 n (1 − ) log2 n 1 n = exp − 32 (1 − ) log2 n → 0, ≤

as n → ∞, as n is much larger than log2 n.

9

CONVERGENCE IN PROBABILITY

114

The most basic tool in proving convergence in probability is the Chebyshev inequality: if X is a random variable with EX = µ and Var(X) = σ 2 , then P (|X − µ| ≥ k) ≤ σ2 , k2

for any k > 0. We proved this inequality in the previous chapter and we will use it to prove the next theorem. Theorem 9.1. Connection between variance and convergence in probability. Assume that Yn are random variables and that a is a constant such that EYn → a, Var(Yn ) → 0, as n → ∞. Then, Yn → a, as n → ∞, in probability. Proof. Fix an > 0. If n is so large that |EYn − a| < /2, then P (|Yn − a| > ) ≤ ≤ P (|Yn − EYn | > /2) Var(Yn ) 4 2

→ 0, as n → ∞. Note that the second inequality in the computation is the Chebyshev inequality. This is most often applied to sums of random variables. Let Sn = X1 + . . . + Xn , where Xi are random variables with ﬁnite expectation and variance. Then, without any independence assumption, ESn = EX1 + . . . + EXn and

2 E(Sn ) = n

EXi2 +

i=1 n i=j

E(Xi Xj ), Cov(Xi , Xj ).

i=j

Var(Sn ) =

i=1

Var(Xi ) +

Example 9. as n → ∞. Weak law of large numbers. + Xn ) = Var(X1 ) + . Proof. . We have EYn = µ and Var(Yn ) = 1 1 σ2 .8 and wipes it out with probability 0. let us reformulate and prove again the most famous convergence in probability theorem. Then. d. Var(Sn ) = 2 n σ 2 = n2 n n Thus. Xj ) = E(Xi Xj ) − EXi EXj and Var(aX) = a2 Var(X). i. . X2 . + Var(Xn ). . X1 .2. . We will try to maximize the return to our investment by “hedging. + Xn . and reinvest with the same proportion x. for independent identically distributed random variables. then.5s with probability 0. Assume that our initial capital is x0 . Let X. . we invest. . Assume that you have two investment choices at the beginning of each year: • a risk-free “bond” which returns 6% per year. a ﬁxed proportion x of our current capital into the stock and the remaining proportion 1 − x into the bond. and • a risky “stock” which increases your investment by 50% with probability 0.06s after a year. Var(X1 + . .2. at the beginning of each year. if Xi are independent. Moreover. Let Sn = X1 + .” That is. Putting an amount s in the bond. We will assume year-to-year independence of the stock’s return. We analyze a typical “investment” (the accepted euphemism for gambling on ﬁnancial markets) problem. Continuing with the review. We collect the resulting capital at the end of the year. which is simultaneously the beginning of next year.8 and 0 with probability 0. The same amount in the stock gives you 1. we can simply apply the previous theorem.2s > 1.8 · 1. be i. . We will use the common abbreviation i.9 CONVERGENCE IN PROBABILITY 115 Recall that Cov(X1 .06s. note that the expected value is 0. . Let Yn = Sn n .5s = 1. Theorem 9.2. . d. random variables with with EX = µ and Var(X) = σ 2 < ∞. i. gives you 1.2. Sn →µ n in probability.

. .1%. n n and so ENn = n · 1 − 1 n n . These are independent indicators with EIi = 0.06(1 − x)).8 log(1. Note that Nn = I1 + . The last expression deﬁnes λ as a function of x. = n→∞ 1 Xn log . Therefore. Example 9. but we cannot use the weak law of large numbers as Ii are not independent.8 · 0.06 + 0.06(1 − x) + 1.44x) + 0. as n → ∞. n−1 n 1 n EIi = = 1− . where Ii = I{ith box is empty} . Let Nn be the number 1 of empty boxes.06 + Xn−1 · x · 1.44x) + 1 − log(1. n x0 in probability. .06(1 − x) + 1. To maximize this.44 = . we will show that it is a nonrandom quantity.06(1 − x)) n n → 0.06 + 0.06 + 0.5x)Sn ((1 − x)1. .44x 1−x The solution is x = 7 22 .5x · In ) and so we can unroll the recurrence to get Xn = x0 (1. but by using this strategy you will eventually lose everything. Distribute n balls independently at random into n boxes.5 · In = Xn−1 (1. .9 CONVERGENCE IN PROBABILITY 116 It is important to note that the expected value of the capital at the end of the year is maximized when x = 1. Deﬁne the average growth rate of your investment as λ = lim so that Xn ≈ x0 eλn . We will express λ in terms of x. which gives λ ≈ 8. we set dλ = 0 to get dx 0. + In . where Sn = I1 + . 1.06)n−Sn . Show that n Nn converges in probability and identify the limit.8.3. + In . Let Ii = I{stock goes up in year i} . 1 Xn log n x0 Sn Sn log(1. Nevertheless. Xn = Xn−1 (1 − x) · 1.2 log(1.2 0. Let Xn be your capital at the end of year n. in particular.

with replacement. as n → ∞. Nn → e−1 . Let N be the number of birds not seen by any neighboring bird. Assume Kings win every game independently with probability p. 2 E(Nn ) = ENn + i=j E(Ii Ij ) n with E(Ii Ij ) = P (box i and j are both empty) = so that Var(Nn ) = 2 E(Nn ) n−2 n n . as n → ∞. which means they play until one team wins four games. Let T be the number of couples that sit together. and Var(Yn ) = 1 n 1− 1 n n + n−1 n 1− 2 n n − 1− 1 n 2n → 0 + e−2 − e−2 = 0. Determine. n Problems 1. Kings and Lakers are playing a “best of seven” playoﬀ series. with 1 proof. Therefore. We have EYn → e−1 . Find a sequence an so that. There are n birds that sit in a row on a wire. Each bird looks left or right with equal probability. Recall the coupon collector problem: sample from n cards. 1 − (ENn ) = n 1 − n 2 n 2 + n(n − 1) 1 − n −n 2 1 1− n 2n . the constant c so that. in probability. and let N be the number of cards you need to get so that each of n diﬀerent cards are represented. Yn = as n → ∞.9 CONVERGENCE IN PROBABILITY 117 Moreover. 1 Now. indeﬁnitely. . N/an converges to 1 in probability. as n → ∞. 4. Assume that n married couples (amounting to 2n people) are seated at random on 2n seats around a round table. let Yn = n Nn . 2. as n → ∞. 3. Determine ET and Var(T ). n N → c in probability.

Solutions to problems 1. 4 1 nN. 2 22 (2n − 3)! 4 EIi = . 2 2n − 1 (2n − 1) (2n − 1)2 1 2 2. Then N = N1 + . E(Ii Ij ) = = . Ij ) = 0 if |i − j| ≥ 3. Cov(Ii . Then. An urn contains n red and m black balls. 4 4 if i = 1 or Furthermore. 1 4 Clearly. Ij ) ≤ 1 for any i and j. . As Ii and Ij are indicators. EIi is i = n and 1 otherwise. . and Ni are independent Geometric with success probability . Compute EN and Var(N ). Let Ii indicate the event that bird i is not seen by any other bird. (2n − 1)(2n − 2) 2n − 1 4n 4n2 4n(n − 1) − = . Ij ) ≤ n + 4n = 5n. Thus. Moreover. Ii and Ij are independent if |i − j| ≥ 3 (two birds that have two or more birds between them are observed independently). Then. Var(N ) = i Var(Ii ) + i=j 1 n EN Cov(Ii . ET = and E(T 2 ) = ET + n(n − 1) so Var(T ) = 2n 2n − 1 4 4n = . Compute EX and EY .) Let N be the number of games played. Let Ii be the indicator of the event that the ith couple sits together. For the same reason. Therefore. It follows that 3. Var(Ii ) ≤ 1. Let Ni be the number of coupons needed to get i diﬀerent coupons after having i − 1 diﬀerent ones. Thus. if M = c = 1. then EM = → and Var(M ) = 1 Var(N ) n2 → 0. and let Y be the number of red balls between the ﬁrst and the second black one. 5. Let X be the number of red balls selected before any black ball. 2n − 1 (2n − 1)! (2n − 1)(2n − 2) for any i and j = i. Cov(Ii . T = I1 + · · · + In . It follows that 4 EN = 1 + n−2 n+2 = .9 CONVERGENCE IN PROBABILITY 118 (There is no diﬀerence between home and away games. Select balls from the urn one by one without replacement. + Nn .

EX = m+1 . EI1 = EI2 = EI3 = EI4 = 1. 1+ 1 1 + . 4. Imagine the balls ordered in a row where the ordering speciﬁes the sequence in which they are selected. therefore. Now. Then.9 CONVERGENCE IN PROBABILITY So.. ENi = n . . As X = I1 + . 6 Var(N ) = i=1 n(i − 1) ≤ n2 (n − i + 1)2 1 EN → 1. then 1 EN → 0. . so that 1 N →1 an in probability. the probability that in a random ordering of the ith red ball and all m n black balls. (n − i + 1)2 119 n−i+1 n . + 2 2 2 n ≤ n2 π2 < 2n2 . so that E(Ii Ij ) = EIi .. so EJi = EIi . and the ﬁnal result can be obtained by plugging in EIi and by the standard formula Var(N ) = E(N 2 ) − (EN )2 . a2 n as n → ∞. EI7 = 6 3 p (1 − p)3 . To compute E(N 2 ). 5. let Ji be the indicator of the event that the ith red ball is selected between the ﬁrst and the second black one. So. the red comes ﬁrst. EI5 = 1 − p4 − (1 − p)4 . we use the fact that Ii Ij = Ii if i > j. + In . + 2 n . EN = n 1 + n 1 1 + . 3 Add the seven expectations to get EN . Let Ii be the indicator of the event that the ith red ball is selected before any black 1 ball. n−i+1 Var(Ni ) = n(i − 1) . and EY = EX. 7 EN = i 2 EIi + 2 i>j E(Ii Ij ) = i EIi + 2 i (i − 1)EIi = i=1 (2i − 1)EIi . Then. EIi = m+1 . Then. and. Let Ii be the indicator of the event that the ith game is played. an If an = n log n. EJi is the probability that the red ball is second in the ordering of the above m + 1 balls.. EI6 = 1 − p5 − 5p4 (1 − p) − 5p(1 − p)4 − (1 − p)5 . ..

. . Note also that d E(etX )|t=0 = EX. 1 . Example 10.10 MOMENT GENERATING FUNCTIONS 120 10 Moment generating functions If X is a random variable. Here is where the name comes from: by writing its Taylor expansion in place of etX and exchanging the sum and the integral (which can be done in many cases) 1 1 E(etX ) = E[1 + tX + t2 X 2 + t3 X 3 + . 1−t ∞ etx e−x dx = only when t < 1. mk = E(X k ). then its moment generating function is φ(t) = φX (t) = E(etX ) = tx x e P (X = x) ∞ tx −∞ e fX (x) dx in the discrete case. the integral diverges and the moment generating function does not exist. fX (x) = Then. Assume that X is an Exponential(1) random variable.] 2 3! 1 2 1 2 = 1 + tE(X) + t E(X ) + t3 E(X 3 ) + . dt2 which lets us compute the expectation and variance of a random variable once we know its moment generating function. Example 10. . Have in mind that the moment generating function is meaningful only when the integral (or the sum) converges. in the continuous case. 2 3! The expectation of the k-th power of X. . Otherwise. φ(t) = 0 e−x 0 x > 0. In combinatorial language. x ≤ 0. φ(t) is the exponential generating function of the sequence mk . that is. is called the k-th moment of x. dt d2 E(etX )|t=0 = EX 2 .2. Compute the moment generating function for a Poisson(λ) random variable. .1.

from the ﬁrst to the second line. . . . . X2 . . + Xn . φXn (t). Proof. Compute the moment generating function for a standard Normal random variable. · etXn ] = E[etX1 ] · E[etX2 ] · . 1 1 1 tx − x2 = − (−2tx + x2 ) = ((x − t)2 − t2 ). φSn (t) = (et p + 1 − p)n . Example 10. Here we have Sn = so that n k=1 Ik . p) random variable. Compute the moment generating function of a Binomial(n. . . where. · E[etXn ]. we have used. then φSn (t) = (φX (t))n . 2 2 2 Lemma 10.10 MOMENT GENERATING FUNCTIONS By deﬁnition. Xn are independent and Sn = X1 + . . in the exponent.1. . then φSn (t) = φX1 (t) . where the indicators Ik are independent and Ik = I{success on kth trial} . . This follows from multiplicativity of expectation for independent random variables: E[etSn ] = E[etX1 · etX2 · . ∞ 121 φ(t) = n=0 etn · ∞ −λ n=0 −λ+λet t −1) λn −λ e n! (et λ)n n! = e = e = eλ(e .3. . . φX (t) = = ∞ 1 2 √ etx e−x /2 dx 2π −∞ 1 1 2 ∞ − 1 (x−t)2 √ e2t e 2 dx 2π −∞ 1 2 = e2t .4. Example 10. If Xi is identically distributed as X. . By deﬁnition. If X1 .

It turns out that this probability is.5. and that we need to ﬁnd an upper bound for P (Sn ≥ 4. Proof. We need to show that I(a) > 0 when a > µ. use the Chebyshev inequality to get an upper bound of order n . so that Φ(t) > 0 for some small enough positive t. Example 10.5. i. then. for n = 100 and n = 1000.10 MOMENT GENERATING FUNCTIONS 122 Why are moment generating functions useful? One reason is the computation of large deviations. of the function 4. Find an upper bound for the probability that Sn exceeds its expectation by at least n. Roll a fair die n times and let Sn be the sum of the numbers you roll. . more precisely. the theorem below gives an upper bound that is a much better estimate. with expectation EX = µ and moment generating function φ. assuming that one can diﬀerentiate inside the integral sign (which one can in this case.5 t − log φ(t). whose graph is in the ﬁgure below..5 and ESn = 3. note that Φ(t) = at − log φ(t) satisﬁes Φ(0) = 0 and. Theorem 10.2. φ(t) φ(t) eit = i=1 et (e6t − 1) 6(et − 1) and we need to compute I(4. P (Sn ≥ an) ≤ exp(−n I(a)). Let Sn = X1 + · · · + Xn .e. Note that t > 0 is arbitrary. by deﬁnition. over t > 0. a = 4. For any t > 0. For any a > µ. is the maximum. which. We ﬁt this into the above theorem: observe that µ = 3. where a > µ. for large n.5 n). where I(a) = sup{at − log φ(t) : t > 0} > 0. At issue is the probability that Sn is far away from its expectation nµ. Φ (0) = a − µ > 0. Moreover. Assume that φ(t) is ﬁnite for some t > 0. 1 of course. Φ (t) = a − and. We can. where Xi are independent and identically distributed as X. much smaller.5 n. using the Markov inequality. P (Sn > an).5). 1 φ(t) = 6 6 E(XetX ) φ (t) =a− . so we can optimize over t to get what the theorem claims. For this. but proving this requires abstract analysis beyond our scope). Large deviation bound . P (Sn ≥ an) = P (etSn −tan ≥ 1) ≤ E[etSn −tan ] = e−tan φ(t)n = exp (−n(at − log φ(t))) .

1. Theorem 10. 1. This gives the upper bound P (Sn ≥ 4. . . .1 0.2 0 0. Example 10.1 −0.6 0. .3 0.178.15 −0. and 4. We have n independent random variables X1 . .2 0. . Here is the situation. I(4.05 −0. φXn (t) = eλn (e t −1) t .7 0. then P (X ≤ x) = P (Y ≤ x) for all x. and P (X ≤ x) is continuous in x.178·n . The maximum is at t ≈ 0.3.2 0. is Poisson(λn ).5 n) ≤ e−0. We will state the following theorem without proof. Another reason why moment generating functions are useful is that they characterize the distribution and convergence of distributions. φX1 (t) = eλ1 (e −1) .6.16 · 10−78 for n = 1000. . is much too large for large n.17 for n = 10. but unfortunately we cannot (which is very common in such problems). If φX (t) = φY (t) for all t.05 0 −0.5 t 0. then P (Xn ≤ x) → P (X ≤ x) for all x. The bound 35 12n for the same probability.9 1 It would be nice if we could solve this problem by calculus. as a result. .1 0.83 · 10−8 for n = 100.8 0. which is about 0. such that: X1 X2 Xn is Poisson(λ1 ).15 0. obtained by the Chebyshev inequality. Xn . so we resort to numerical calculations. Y . t φX2 (t) = eλ2 (e −1) .37105 and.5) is a little larger than 0. and Xn are ﬁnite for all t. is Poisson(λ2 ). 2. If φXn (t) → φX (t) for all t. Assume that the moment generating functions for random variables X.10 MOMENT GENERATING FUNCTIONS 123 0.4 0. Show that the sum of independent Poisson random variables is Poisson.

with EX = µ and Var(X) = σ 2 . Assume that X is a random variable.. . . + Xn . This assumption is not needed and the theorem may be applied it as it was in the previous chapter. φX1 +.4. d. Similarly. . P (Tn ≤ x) → P (Z ≤ x). for every x. Xn are i. . Tn = σ n Then. as n → ∞. and Xi −µ σ . .+ √n Yn Y1 ···E e t √ n Yn Y n n t 1 t2 1 t3 1 + √ EY + E(Y 2 ) + E(Y 3 ) + . distributed as Y . . + Yn √ . ... and distributed as X. Proof. + λn ). . we show that φTn (t) → φZ (t) = exp(t2 /2) as n → ∞: φTn (t) = = = = = = ≈ E et T n E e E e E e t √ n t √ n t √ n t Y1 +. Yi are independent. Y1 + . + Xn is Poisson(λ1 + . 2 n 6 n3/2 n t2 1 1+ 2 n t2 → e2.. E(Yi ) = 0.+Xn (t) = e(λ1 +. . . .+λn )(e t −1) and so X1 + . . where Z is a standard Normal random variable. . . .10 MOMENT GENERATING FUNCTIONS 124 Therefore. 2 n 6 n3/2 n n 1 t2 1 t3 1+0+ + E(Y 3 ) + . Then. i.. We will now reformulate and prove the Central Limit Theorem in a special case when the moment generating function is ﬁnite. and assume that φX (t) is ﬁnite for all t. . Let Y = X−µ and Yi = σ Var(Yi ) = 1. Let Sn − nµ √ . . n Tn = To ﬁnish the proof. Let Sn = X1 + . one can also prove that the sum of independent Normal random variables is Normal. Theorem 10. where X1 ..

compute n n→∞ lim e−n i=0 ni . 4. the company must be able to cover its claims by the payments received in the same period. Intimidated by the ﬁerce regulators. but λ is a quantity the company controls. 3. A player selects three cards at random from a full deck and collects as many dollars as the number of red cards among the three. The parameters n. 5. Regulators require that. Thus. This exercise was in fact the original motivation for the study of large deviations by the Swedish probabilist Harald Cram`r.10 MOMENT GENERATING FUNCTIONS 125 Problems 1. Compute the moment generating function of X. Assume 10 people play this game once and let X be the number of their combined winnings. σ and are ﬁxed. By the assumption. where . t 3. he combined amount of claims is σ(X1 + · · · + Xn ) + nµ. determine by calculus the large deviation bound for P (S ≥ an). Compute the moment generating function for a single game. Also. µ. amounting to (a deterministic number) λ per day. assume this amount is Normal with expectation µ and variance σ 2 . 2. i! Solutions to problems 1. p). Determine λ. a claim Y is Normal N (µ. Answer: φ(t) = 1 tx 0 e dx = 1 (et − 1). Assume that the insurance company receives a steady stream of payments. or else. X = (Y −µ)/σ is standard normal. For every a > p. Compute the moment generating function of a uniform random variable on [0. so. who was working as an insurance company consultant in e the 1930’s. 2. Assume also day-to-day independence of the claims. σ 2 ) and. every day. the company wants to fail to satisfy their requirement with probability less than some small number . Using the central limit theorem for a sum of Poisson random variables. then raise it to the 10th power: φ(t) = 1 52 3 26 26 + 3 1 26 26 · et + 2 2 26 26 · e2t + · e3t 1 3 10 . Assume that S is Binomial(n. they receive a certain amount in claims. 1]. within a period of n days. Note that Y = σX + µ.

4. σ 1 As log φ(t) = 2 t2 . i. So. By the Central Limit Theorem. 2 . Poisson(1) random variables. we solve the equation e−In = . σ 2 and the maximum equals 1 I= 2 Finally. the answer we get is I(a) = a log a 1−a + (1 − a) log . p 1−p 5. to get λ=µ+σ· −2 log . P (Sn ≤ n) → 1 . the answer is 1 . but P (Sn ≤ n) is exactly the 2 expression in question. over t > 0. we need to maximize. i.10 MOMENT GENERATING FUNCTIONS 126 Xi are i. so we need to bound P (X1 + · · · + Xn ≥ λ−µ n) ≤ e−In . Let Sn be the sum of i. d. After a computation. d. Thus. Sn is Poisson(n) and ESn = n. λ−µ 1 t − t2 . standard Normal. n λ−µ σ 2 .

1. then we may compute P (A) = EIA by Bayes’ formula above. with EX = λ1 . xfX (x|Y = y) dx Bayes’ formula also applies to expectation. Y ) is discrete or continuous. fX (x|Y = y) = f (x. pX (x|Y = y) = p(x. Y ) has joint density f . This situation is very common in applications. Then. pY (y) • Conditional density: if (X. Y ) has probability mass function p.y) fY (y) . . Example 11. P (A|B) = P (A∩B) P (B) . Recall that X + Y is Poisson(λ1 + λ2 ). Y are independent Poisson. EY = λ2 . P (X = k|X + Y = n) = = P (X = k. and. it involves two-stage (or multistage) processes and conditions are appropriate events at the ﬁrst stage. whereby the value of Y is chosen at the ﬁrst stage. which then determines the distribution of X at the second stage. consequently. By deﬁnition. Assume that X. X + Y = n) P (X + Y = n) P (X = k)P (Y = n − k) (λ1 +λ2 )n −(λ1 +λ2 ) e n! n−k λ2 λk −λ1 1 · (n−k)! e−λ2 k! e (λ1 +λ2 )n −(λ1 +λ2 ) e n! k n λ1 = = k λ1 + λ2 λ2 λ1 + λ2 n−k . Note that this applies to the probability of an event (which is nothing other than the expectation of its indicator) as well — if we know P (A|Y = y) = E(IA |Y = y). Assume that the distribution of a random variable X conditioned on Y = y is given. • Conditional probability mass function: if (X. Recall also the basic deﬁnitions: • Conditional probability: if A and B are two events. • Conditional expectation: E(X|Y = y) is either x xpX (x|Y = y) or depending on whether the pair (X.y) = P (X = x|Y = y). Compute the conditional probability mass function of pX (x|X + Y = n). Such is the case of a two-stage process. its expectation E(X|Y = y) is also known. to remind ourselves. if Y is continuous.11 COMPUTING PROBABILITIES AND EXPECTATIONS BY CONDITIONING 127 11 Computing probabilities and expectations by conditioning Conditioning is the method we encountered before. E(X) = y E(X|Y = y)P (Y = y) ∞ −∞ E(X|Y = y)fY (y) dy if Y is discrete.

3. P (S1 ≤ s1 . Therefore. 128 Therefore. 0 f (s1 . Therefore. otherwise fS2 (s2 ) = Therefore. for s2 ≥ 0. S2 = T1 + T2 . Let T1 .T2 (s1 . t2 ) dt2 = ∂s2 0 = fT1 . t2 ) dt2 . If f = fS1 . the ﬁrst bulb’s failure time is uniform on the interval of its possible values. s2 ) ds1 = λ2 s2 e−λs2 .11 COMPUTING PROBABILITIES AND EXPECTATIONS BY CONDITIONING λ1 λ1 +λ2 ). Example 11. s2 ) = and. S2 ≤ s2 ) ∂s1 ∂s2 s2 −s1 ∂ fT1 . First. Then. T2 be two independent Exponential(λ) random variables and let S1 = T1 . s2 ]. f (s1 . S2 ≤ s2 ) = P (T1 ≤ s1 . conditioned on T1 + T2 = s2 .T2 (s1 . and 0 otherwise. s2 λ2 s2 e−λs2 for 0 ≤ s1 ≤ s2 . X is Binomial(n. Imagine the following: a new lightbulb is put in and. Waiting to exceed the initial score. Example 11. roll a die once and assume that the number you rolled is U . s2 − s1 ) = fT1 (s1 )fT2 (s2 − s1 ) = λe−λs1 λ e−λ(s2 −s1 ) = λ2 e−λs2 . after time T1 . Compute fS1 (s1 |S2 = s2 ). then f (s1 . What is the expected number of additional rolls? . fS1 (s1 |S2 = s2 ) = λ2 e−λs2 1 = . conditioned on X + Y = n. T1 + T2 ≤ s2 ) s1 s2 −t1 = 0 dt1 0 fT1 .2. consequently.T2 (t1 . For the ﬁrst problem. T1 is uniform on [0.S2 . s2 λ2 e−λs2 0 if 0 ≤ s1 ≤ s2 . It is then replaced by a new lightbulb. If we know the time when the second bulb burns out. identical to the ﬁrst one. which also burns out after an additional time T2 . s2 ) = ∂2 P (S1 ≤ s1 . it burns out. continue rolling the die until you either match or exceed U .

. . m. . . so let us condition on the value of U . 2 3 6 = 1+ Now. k+1 which can (in the uniform case) also be obtained by 1 1 EN = 0 E(N |U = u) du = 0 1 du = ∞. no two random variables are equal). 1]. of N and EN . This number is Geometric. + = 2. if we know U . . by Bayes’ formula for expectation. 2. . X1 .45 . N is Geometric(1 − u). the smallest n for which Xn ≥ U . the result of a call of a random number generator. a slick alternate derivation shows that P (N = k) does not depend on the distribution of random variables (which we assumed to be uniform). the event {N = k} happens exactly when Xk is the largest and U is the second largest among X1 . so the probability that 1 1 Xk and U are the ﬁrst and the second is k+1 · k . X2 . by diminishing size. All orders.11 COMPUTING PROBABILITIES AND EXPECTATIONS BY CONDITIONING 129 Let N be the number of additional rolls. Determine the p. f. X2 . 1])... . Namely. Let n be the number of additional calls of the generator. Xk . Once we know U . 1. so that there are no “ties” (i.. let U be a uniform random variable on [0. k(k + 1) In fact. . We know that E(N |U = n) = and so. of these k + 1 random numbers are equally likely. 1−u . generate additional independent uniform random variables (still on [0.e. Thus. P (N = k|U = u) = uk−1 (1 − u) and so P (N = k) = 0 1 1 P (N = k|U = u) du = 0 uk−1 (1 − u) du = 1 . Given that U = u. . that is. for k = 0. . 7−n E(N ) = = = n=1 E(N |U = n) P (U = n) n=1 6 1 6 n=1 6 6 7−n 1 7−n 1 1 1 + + . 6 6 . U . . as soon as it is continuous. until we get one that equals or exceeds U . It follows that EN = k=1 ∞ 1 = ∞. that is. .

if it is Heads. Compute the probability that exactly k people buy something. Let X be the number of people who buy something. A coin with Heads probability p is tossed repeatedly.11 COMPUTING PROBABILITIES AND EXPECTATIONS BY CONDITIONING 130 As we see from this example. if it is Tails. random variables with inﬁnite expectation are more common and natural than one might suppose. A more straightforward way to see this is as follows: ∞ P (X = k) = n=k ∞ P (X = k|N = n)P (N = n) λn e−λ n k p (1 − p)n−k n! k ∞ = n=k = e−λ = = = = n=k −λ pk λk ∞ e n! λn pk (1 − p)n−k k!(n − k)! n! (1 − p)n−k λn−k (n − k)! ((1 − p)λ) ! k! e−λ (pλ)k k! n=k ∞ =0 e−λ (pλ)k (1−p)λ e k! e−pλ (pλ)k . then we have to start from the beginning. Let Nk be the number of needed tosses and mk = ENk . but. then observe the next toss. Example 11. Here is how we translate this into mathematics: E[Nk |Nk−1 = n] = p(n + 1) + (1 − p)(n + 1 + E(Nk )) = pn + p + (1 − p)n + 1 − p + mk (1 − p) = n + 1 + mk (1 − p). N ≈ Binomial(n. indeed. k! This is. with n + 1 tosses wasted. with λ = n .4. Each of them buys something independently with probability p. Example 11. Then. then Nk = n + 1. The number N of customers entering a store on a given day is Poisson(λ). as it equals the expectation of the sum of p k (independent) Geometric(p) random variables. p ) ≈ Poisson(pλ). Why should X be Poisson? Approximate: let n be the (large) number of people in the town and the probability that any particular one of them enters the store on a given day. ) and X ≈ Binomial(n. . If Nk−1 = n. Let us condition on the value of Nk−1 .” the answer is k .5. the Poisson(pλ) probability mass function. by the Poisson approximation. What is the expected number of tosses needed to get k successive heads? Note: If we remove “successive.

get ENk by diﬀerentiating and some algebra. . k − 1.. by Bayes’ formula. and this gives an equation for E[etNk ] which can be solved: E[etNk ] = pk etk 1 − (1 − p) k−1 t(a+1) a p a=0 e = pk etk (1 − pet ) . ∞ mk = E(Nk ) = n=k−1 ∞ E[Nk |Nk−1 = n]P (Nk−1 = n) (n + 1 + mk (1 − p)) P (Nk−1 = n) n=k−1 = = mk−1 + 1 + mk (1 − p) 1 mk−1 = + . . One of F0 . k−1 k−1 E[etNk ] = a=0 E[et(Nk +a+1) ]pa (1 − p) + etk pk = (1 − p)E[etNk ] a=0 et(a+1) pa + etk pk . . be the event that the tosses begin with a Heads. + k. dt Thanks to Travis Scrimshaw for pointing this out. . . and let Fk the event that the ﬁrst k tosses are Heads. we can even compute the moment generating function of Nk by diﬀerent conditioning1 . p p This recursion can be unrolled. m1 = m2 = . . . Fk must happen. . Let Fa . 1 − pet − (1 − p)et (1 − pk etk ) We can. . . then.. then Nk = k. Therefore.11 COMPUTING PROBABILITIES AND EXPECTATIONS BY CONDITIONING 131 Therefore. otherwise a + 1 tosses are wasted and one has to start over with the same conditions as at the beginning. . followed by Tails. by ENk = 1 d E[etNk ]|t=0 . k E[etNk ] = a=0 E[etNk |Fk ]P (Fk ). p p2 p In fact. therefore. mk = 1 p 1 1 + 2 p p 1 1 1 + + . a = 0. If Fk happens.

p = 19 . PN = 1. Fix a probability p ∈ (0.. and make steps in discrete time units: each time (independently) move rightward by 1 (i. Start at i. . You are interested in the probability Pi that you leave the game happy with your desired amount N . Another interpretation is that of a simple random walk on the integers.6. if you prefer a Greek word. if you play a fair game. This random walk is one of the very basic random (or. .11 COMPUTING PROBABILITIES AND EXPECTATIONS BY CONDITIONING 132 Example 11. p We also have boundary conditions P0 = 0. d. and P (Xk = 1) = p. add −1 to your position) with probability 1 − p. if 2 9 you bet on Red at roulette. . . This is a recurrence we can solve quite easily. add 1 to your position) with probability p and move leftward by 1 (i. For example. Pi − Pi−1 = We conclude that Pi − P1 = Pi = 1−p p 1+ + 1−p p + 2 1−p P1 p 1−p (P2 − P1 ) = p 1−p p i−1 1−p p 2 P1 P1 . then Sn = i + X1 + · · · + Xn . i−1 1−p p 1−p p + . In other words. which gives us a recurrence relation. which we can rewrite as Pi+1 − Pi = 1−p (Pi − Pi−1 ). i. 0 ≤ i ≤ N .. .. Gambler’s ruin.. We condition on the ﬁrst step X1 the walker makes. + 2 1−p p i−1 P1 . Then. p = 1 . The probability Pi is the probability that the walker visits N before a visit to 0. where Xk are i. . for i = 1. + . i.. Pi = P (visit N before 0|X1 = 1)P (X1 = 1) + P (visit N before 0|X1 = −1)P (X1 = −1) = Pi+1 p + Pi−1 (1 − p). the outcome of the ﬁrst bet. N. as P2 − P1 = P3 − P2 = ..e. Assume that your initial capital is i dollars and that you play until you either reach a predetermined amount N . if the position of the walker at time n is Sn . or you lose all your money. by Bayes’ formula. . while. stochastic) processes.e. 1). P (Xk = −1) = 1 − p. Play a sequence of games.e. in each game you (independently) win $1 with probability p and lose $1 with probability 1 − p. + 1−p p P1 ..

and 1 . N 2 For example. = p2 . which can be solved. . P 3 4 = p + (1 − p)p. 2n . 9 19 .1. = p2 + p2 (1 − p).25. P 5 8 = p + p2 (1 − p). N ]. if p = 1 . 2 For each positive integer n. 2 . P • When n = 2. Bet x if x ≤ N 2. . P 3 = p · P 8 7 P 8 = p + p(1 − p) + p(1 − p)2 . 19 . a real number. for all x. if N = 1000 and p = P900 ≈ 2.9) ≈ 0. and that you win each of these bets with probability p. N 2 Pi = 1− 1−p p i if p = 1 . Pi = 1− iP 1−p p 1− 1−p p i P1 if p = 1 . k = 0. 1 . and again you want to increase it to N before going broke. 2 p · 1 + (1 − p) · P (2x − 1) if x ∈ 1 . P 1 2 1 4 = p. n For example: • When n = 1. Bold Play. N 2.7. then P5 ≈ 0.11 COMPUTING PROBABILITIES AND EXPECTATIONS BY CONDITIONING 133 Therefore. The ﬁgure below 9 displays the graphs of functions P (x) for p = 0. .6561 · 10−5 . It is easy to verify that P (x) = x.8794 for p = 19 . . ﬁx our monetary unit so that N = 1. Assume that the only game available to you is a game in which you can place even bets at any amount. By conditioning on the outcome of your ﬁrst bet. 2 1 To determine the unknown P1 we use that PN = 1 to ﬁnally get i 1− 1−p p if p = 1 .6. P 8 = p3 . if N = 10 and p = 0. this is a linear system for P ( 2k ). Your bold strategy (which can be proved to be the best) is to bet everything unless you are close enough to N that a smaller amount will do: 1. 1 . which is not too diﬀerent from a fair game. Bet N − x if x ≥ We can. We now deﬁne P (x) = P (reach 1 before reaching 0). 3 4 1 • When n = 3.8836. P (x) = p · P (2x) if x ∈ 0. 2. Moreover. it can be computed that 2 9 P (0. Your initial capital is x ∈ [0. without loss of generality. 0. 2 if p = 1 . then Example 11.

We have E[S|N = n] = ESn = nEX1 = nµ. thus.2 0.4 0.6 P(x) 0. 0 with probability p. . + Xn . Let N be a nonnegative integer random variable. sequence of random variables with ﬁnite EX = µ and Var(X) = σ 2 .2 0. that is. .8 1 A few remarks for the more mathematically inclined: The function P (x) is continuous. Expectation and variance of sums with a random number of terms. Assume that X.4 x 0.7 0. . . In fact.3 0.9 0.1. independent of all Xi . It is thus a highly irregular function despite the fact that it is strictly increasing. Proof. but nowhere diﬀerentiable on [0. 1]. Let Sn = X1 + . d. Then ES = µ EN. 2 j=1 where its binary digits Dj are independent and equal to 1 with probability 1 − p and. . X1 . 1) is deﬁned by its binary expansion ∞ 1 Y = Dj j .11 COMPUTING PROBABILITIES AND EXPECTATIONS BY CONDITIONING 134 1 0.5 0.8 0. P (x) is the distribution function of a certain random variable Y . is an i. and let N S= i=1 Xi . . P (x) = P (Y ≤ x). X2 .1 0 0 0. i.6 0. Theorem 11. This random variable with values in (0. Var(S) = σ 2 EN + µ2 Var(N ).

n=0 For variance. N + 1 is a Geometric( 1 ) random variable. with Xi . Example 11. 12 Moreover.8. and so 2 EN = 2 − 1 = 1. This ﬁts into the above context.11 COMPUTING PROBABILITIES AND EXPECTATIONS BY CONDITIONING 135 Then. the number of Tails tossed before ﬁrst Heads. and N . 2 Var(X1 ) = 35 . We know that 7 EX1 = . ES = ∞ nµ P (N = n) = µ EN. roll a die and collect as many dollars as the number on the die. Compute ES and Var(S). . Therefore. the numbers rolled on the die. Var(S) = E(S 2 ) − (ES)2 = σ 2 EN + µ2 E(N 2 ) − µ2 (EN )2 = σ 2 EN + µ2 Var(N ). Toss a fair coin until you toss Heads for the ﬁrst time. Let S be your total winnings. compute ﬁrst ∞ E(S 2 ) = n=0 ∞ E[S 2 |N = n]P (N = n) 2 E(Sn )P (N = n) n=0 ∞ = = = ((Var(Sn ) + (ESn )2 )P (N = n) n=0 ∞ (n σ 2 + n2 µ2 )P (N = n) n=0 = σ 2 EN + µ2 E(N 2 ). Var(N ) = Plug in to get ES = 7 2 1− 1 2 1 2 2 =2 and Var(S) = 59 12 . Each time you toss Tails.

Any time you do not pick door 0. 2. 1.11 COMPUTING PROBABILITIES AND EXPECTATIONS BY CONDITIONING 136 Example 11. 3. This means that N is 1 Geometric( 3 )−1. m. n = 2. y) = and 0 otherwise. f. Let N be the number of times your choice of door is not door 0. compute the probability that you get an even number of Heads. of course. If you press A. and call the three doors 0. which. 3. agrees with the answer in Example 8.9. Therefore. . 4. ES = EN · EX1 = 3. . to ﬁt it better into the present context. A and B. Toss an unfair coin with probability p ∈ (0. 2 4 4 Therefore. 4 4 Problems 1. 1 −y e . y . and 2. Y ) is f (x. 1) of Heads n times. you pick door 1 or 2 with equal probability. of X1 given X1 + X2 = n. Moreover. 1 9 Var(S) = · 2 + · 6 = 14. You are trapped in a dark room.11. • with probability 1/3 you will be released in two minutes. (Note that Xi are not 0. Let X1 and X2 be independent Geometric(p) random variables. • with probability 2/3 you will have to wait ﬁve minutes and then you will be able to press one of the buttons again. 1. for 0 < x < y. 2 Var(X1 ) = 1− 1 3 1 2 3 =6 12 + 22 9 1 − = . Compute E(X 2 |Y = y). We now take another look at Example 8. each Xi is 1 or 2 with probability 1 each. . Compute the conditional p. By conditioning on the outcome of the last toss. We will rename the number of days in purgatory as S. or 2 with 2 1 probability 3 each!) It follows that EN = 3 − 1 = 2.11. Var(N ) = and 3 EX1 = . Assume that the joint density of (X. In front oﬀ you are two buttons.

so each time you press one of them at random. and so pn − and then pn = As p0 = 1. n−1 k−1 p(1 − p)n−k−1 n−1 k=1 p(1 − p) so X1 is uniform over its possible values. Solutions to problems 1. We have. P (X1 = i|X1 +X2 = n) = P (X1 = i)P (X2 = n − i) = P (X1 + X2 = n) p(1 − p)i−1 p(1 − p)n−i−1 1 = . each of them receives an in-store credit uniformly distributed between 0 and 100 dollars. Compute the probability that Alice wins. etc. We have pn = p · (1 − pn−1 ) + (1 − p)pn−1 = p + (1 − 2p)pn−1 . . once you observe the value of Λ. ﬁnally. Alice ﬂips it ﬁrst.. Before you start the random experiment. 5.11 COMPUTING PROBABILITIES AND EXPECTATIONS BY CONDITIONING 137 If you press B. Assume that a Poisson number with expectation 10 of customers enters the store. what is the probability that N ≥ 3? 7. . A coin has probability p of Heads. say Λ = λ. pn = 1 1 + (1 − 2p)n . we get C = 1 2 1 1 = (1 − 2p)(pn−1 − ). 1]. for i = 1. . . 6. . Generate a random number Λ uniformly on [0. Compute the expected time of your conﬁnement. Assume that you cannot see the buttons. generate a Poisson random variable N with expectation λ. 2 and. and the winner is the ﬁrst to ﬂip Heads. • you will have to wait three minutes and then be able to press one of the buttons again. 2 2 2. then Bob. then Alice. Let pn be the probability of an even number of Heads in n tosses. n − 1. For promotion. 2 2 1 + C(1 − 2p)n . Compute the expectation and variance of the amount of credit the store will give.

uniform on [0.11 COMPUTING PROBABILITIES AND EXPECTATIONS BY CONDITIONING 138 1 3. The answer is 1 1 P (N ≥ 3) = 0 P (N ≥ 3|Λ = λ) dλ = 0 (1 − (1 + λ + λ2 −λ )e )dλ. y]). EXi = 50 and Var(Xi ) = 100 . Then. and so the answer is y2 3 . 100]. So. 1 1 2 1 EX = E(X|I = 0)P (I = 0) + E(X|I = 1)P (I = 1) = (3 + EX) + ( · 2 + · (5 + EX)) 2 3 3 2 and the answer is EX = 21. Let f (p) be the probability. Then. Let I be the indicator of the event that you press A.e. f (p) = p + (1 − p)(1 − f (p)) which gives f (p) = 1 . and X the time of your conﬁnement in minutes. 4. 2 7. 12 6. The conditional density of X given Y = y is fX (x|Y = y) = y . 2−p . Then. X = N Xi . while Xi are independent 2 uniform on [0. Let N be the number of customers and X the amount of credit. 5. for 0 < x < y (i. so EX = 50·EN = i=1 12 2 500 and Var(X) = 100 · 10 + 502 · 10..

3. y for x > 0 and y > 0. you collect $5 and quit. Let Dn be the number of people whose four cards are of four diﬀerent suits. (c) Solve the recursion. Assume that you start with no money and you have to quit the game when your winnings match or exceed the dollar amount n. Explain how you would ﬁnd an upper bound for the probability that Sn is more than 10% larger than its expectation. you play once more. The cards are shuﬄed and dealt to n players. four cards per player. Toss a coin with probability p of Heads. E(etSn ). . (b) Find Var(Dn ). y) = e−x/y e−y . (For example. (a) What is p1 ? What is p2 ? (b) Write down the recursive equation which expresses pn in terms of pn−1 and pn−2 . (c) Now you roll a fair die and you play the game as many times as the number you roll. 4. The joint density of X and Y is f (x. Assume that a deck of 4n cards has n cards of each of the four suits. and 0 otherwise. Compute the moment generating function of Sn . at the amount you quit. Let Y be your total winnings. (a) Assume that you play this game n times and let Sn be your combined winnings. Note that. which you may assume have equal point score. Consider the following game again: Toss a coin with probability p of Heads. Compute E(X|Y = y). if you toss Tails. 1. If you toss Heads. as n → ∞. if you toss Tails. your winnings will be either n or n + 1. 2. you win $1. assume n = 5 and you have $3: if your next toss is Heads. you win $2. If you toss Heads.) Let pn be the probability that you will quit with winnings exactly n. (b) Keep the assumptions from (a).11 COMPUTING PROBABILITIES AND EXPECTATIONS BY CONDITIONING 139 Interlude: Practice Midterm 1 This practice exam covers the material from chapters 9 through 11. Compute E(Y ) and Var(Y ). that is. you win $1. you win $2. which will also appear in problem 4. if your next toss is Tails. Do not compute. (a) Find EDn . (c) Find a constant c so that 1 n Dn converges to c in probability. Give yourself 50 minutes to solve the four problems. Consider the following game.

n5 4n 4 E(Ii Ij ) + i EIi = n5 (n − 1)5 4n 4 4n−4 4 + and Var(Dn ) = n5 (n − 1)5 4n 4 4n−4 4 + n5 4n 4 − n5 4n 4 2 . for i = j. Assume that a deck of 4n cards has n cards of each of the four suits. and EIi = the answer is EDn = (b) Find Var(Dn ). n4 (n − 1)4 4n 4 4n−4 4 . where Ii = I{ith player gets four diﬀerent suits} . E(Ii Ij ) = Therefore.11 COMPUTING PROBABILITIES AND EXPECTATIONS BY CONDITIONING 140 Solutions to Practice Midterm 1 1. 2 E(Dn ) = i=j n4 4n 4 . (a) Find EDn . Solution: As Dn = i=1 n Ii . . four cards per player. Let Dn be the number of people whose four cards are of four diﬀerent suits. n5 4n 4 . Solution: We also have. The cards are shuﬄed and dealt to n players.

. and we need to ﬁnd an upper bound 9 for P (Sn > n(1. this is an impossible 9 event. .1 + 1. so the statement holds with c = 3 32 . n4 4n 4 = 6n3 3 6 → 3 = .1p)n . Consider the following game.1p) ≤ e−I(1.1p) > 0 and is given by I(1.1p)t − log(pe2t + (1 − p)et ) : t > 0}. When p < 11 . (a) Assume that you play this game n times and let Sn be your combined winnings.1 + 1. p ≥ 11 . E(etSn ). where I(1. If you toss Heads. (b) Keep the assumptions from (a). if you toss Tails. When (1. ESn = n(1 + p). Moreover.1 + 1.1 + 1. Do not compute. so the probability is 0. that is. 1 n Dn 141 converges to c in probability as n → ∞.1+1. the bound is P (Sn > n(1. which will also appear in problem 4. 1 n3 (n − 1)5 n3 Var(Yn ) = 2 Var(Dn ) = 4n 4n−4 + 4n − n 4 4 4 as n → ∞. you win $2. Explain how you would ﬁnd an upper bound for the probability that Sn is more than 10% larger than its expectation. 2. Then EYn = as n → ∞. Compute the moment generating function of Sn . Solution: As EX1 = 2p + 1 − p = 1 + p. Solution: E(etSn ) = (e2t · p + et · (1 − p))n . Toss a coin with probability p of Heads. you win $1.1p) = sup{(1.1 + 1.1p). (4n − 1)(4n − 2)(4n − 3) 4 32 n4 4n 4 2 → 62 +0− 46 3 32 2 = 0. i.1 + 1.e.11 COMPUTING PROBABILITIES AND EXPECTATIONS BY CONDITIONING (c) Find a constant c so that Solution: 1 Let Yn = n Dn .1p) ≥ 2.

The answer is 7 · (1 + p). Compute E(Y ) and Var(Y ). Let Y be your total winnings. y for x > 0 and y > 0. Var(N ) = . EX1 = 2p + 1 − p = 1 + p. + XN . 2 7 35 Var(Y ) = (p − p2 ) + (1 + p)2 . = EN · EX1 . Compute E(X|Y = y). ∞ As fY (y) = 0 e−x/y e−y dx y e−y ∞ −x/y = e dx y 0 x=∞ e−y = ye−x/y y x=0 −y = e . . . The joint density of X and Y is EY = e−x/y e−y . where Xi are independently and identically distributed with P (X1 = 2) = p and P (X1 = 1) = 1 − p. y) = Solution: We have E(X|Y = y) = 0 ∞ xfX (x|Y = y) dx. 2 12 3. . . f (x. . and P (N = k) = 1 . Solution: Let Y = X1 + . 7 35 EN = . 2 12 so that Var(X1 ) = 1 + 3p − (1 + p)2 = p − p2 . Var(Y ) = Var(X1 ) · EN + (EX1 )2 · Var(N ). . We 6 know that EY We have Moreover. . for y > 0. and 0 otherwise. and 2 EX1 = 4p + 1 − p = 1 + 3p. for k = 1. 6.11 COMPUTING PROBABILITIES AND EXPECTATIONS BY CONDITIONING 142 (c) Now you roll a fair die and you play the game as many times as the number you roll.

Also. p0 = 1. y) fY (y) e−x/y . you win $2.11 COMPUTING PROBABILITIES AND EXPECTATIONS BY CONDITIONING we have fX (x|Y = y) = = and so ∞ 143 f (x. assume n = 5 and you have $3: if your next toss is Heads. Solution: We have pn = p · pn−2 + (1 − p)pn−1 . . (b) Write down the recursive equation which expresses pn in terms of pn−1 and pn−2 . 4. if you toss Tails. if your next toss is Tails. you play once more. Assume that you start with no money and you have to quit the game when your winnings match or exceed the dollar amount n. for x. (For example. at the amount you quit. your winnings will be either n or n + 1. Note that. Consider the following game again: Toss a coin with probability p of Heads. If you toss Heads. y > 0. you win $1. you collect $5 and quit. (a) What is p1 ? What is p2 ? Solution: We have p1 = 1 − p and p2 = (1 − p)2 + p.) Let pn be the probability that you will quit with winnings exactly n. y x −x/y e dx y ze−z dz E(X|Y = y) = 0 ∞ 0 = y = y.

144 Another possibility is to use the characteristic equation λ2 − (1 − p)λ − p = 0 to get λ= This gives pn = a + b(−p)n . We get a= and then pn = 1 p . with a + b = 1. 1+p 1+p . 1 p + (−p)n . 1+p 1+p 1−p± (1 − p)2 + 4p 1 − p ± (1 + p) = = 2 2 1 −p .b= . a − bp = 1 − p. Solution: We can use pn − pn−1 = (−p)(pn−1 − pn−2 ) = (−p)n−1 (p1 − p0 ).11 COMPUTING PROBABILITIES AND EXPECTATIONS BY CONDITIONING (c) Solve the recursion.

m. where i and j range over all states in S. by keeping track. Pick one of the ﬁve random procedures described below and perform it at each time step n = 1. while for n ≥ 2 choose a random occurrence of the two letters obtained. m. and . . • transition probabilities. it becomes such via a natural redeﬁnition of state: keep track of the last two letters. 2. but what distinguishes it from (2) is that the p. a countable set S of states. In general. 5. . 2. If we call our current letter our state.1. then pick the letter following it in the text. . . that depends only on our current state. Pick another random letter. call an ordered pair of two letters a state. Procedure (2). the process is Markov. Use the convention that the letter that follows the last letter in the text is the ﬁrst letter in the text. We will only consider time-homogeneous Markov processes. then pick the following letter. there is something diﬀerent about this scheme: it ceases being random after many steps are performed. by choosing a random letter. no natural way of making this process Markov and. perform procedure (1) with probability 1 probability 1 − n . . Choose a random occurrence of the letter obtained at the previous step ((n − 1)st). f. 1. is dependent not only on the current step. Procedure (5) is Markov. Start. There is. Procedure (4) can be made Markov in a contrived fashion. 2. then pick the following letter. is diﬀerent: the probability mass function for the letter at the next time step depends on the letter at this step and nothing else.12 MARKOV CHAINS: INTRODUCTION 145 12 Markov Chains: Introduction Example 12. 3. as the sequence of the chosen letters occurs just once in the book. Take your favorite book. which are often labeled by the positive integers 1. . in order. Procedure (3) is not Markov at ﬁrst glance. a Markov chain is given by • a state space. Choose a random occurrence of all previously chosen letters. at step 0. however. That is. but not timehomogeneous. in order. At step 1 use procedure (2). 1 n and perform procedure (2) with Repeated iteration of procedure (1) merely gives the familiar independent experiments — selection of letters is done with replacement and. at the current state. that is. then we transition into a new state chosen with the p. . f. however. of the entire history of the process. Such processes are called Markov . indeed. in the previous two steps. a (possibly inﬁnite) matrix of numbers Pij . thus. 4. the letters at diﬀerent steps are independent. At step n. However. but also on time.

12

MARKOV CHAINS: INTRODUCTION • an initial distribution α, a probability mass function on the states.

146

Here is how these three ingredients determine a sequence X0 , X1 , . . . of random variables (with values in S): Use the initial distribution as your random procedure to pick X0 . Subsequently, given that you are at state i ∈ S at any time n, make the transition to state j ∈ S with probability Pij , that is (12.1) Pij = P (Xn+1 = j|Xn = i).

The transition probabilities are collected into the transition matrix : P11 P12 P13 . . . P = P21 P22 P23 . . . . . . . A stochastic matrix is a (possibly inﬁnite) square matrix with nonnegative entries, such that all rows sum to 1, that is Pij = 1,

j

for all i. In other words, every row of the matrix is a p. m. f. Clearly, by (12.1), every transition matrix is a stochastic matrix (as Xn+1 must be in some state). The opposite also holds: given any stochastic matrix, one can construct a Markov chain on positive integer states with the same transition matrix, by using the entries as transition probabilities as in (12.1). Geometrically, a Markov chain is often represented as oriented graph on S (possibly with self-loops) with an oriented edge going from i to j whenever a transition from i to j is possible, i.e., whenever Pij > 0; such an edge is labeled by Pij . Example 12.2. A random walker moves on the set {0, 1, 2, 3, 4}. She moves to the right (by 1) with probability, p, and to the left with probability 1 − p, except when she is at 0 or at 4. These two states are absorbing: once there, the walker does not move. The transition matrix is 1 0 0 0 0 1−p 0 p 0 0 0 1−p 0 p 0 . P = 0 0 1−p 0 p 0 0 0 0 1 and the matching graphical representation is below.

p 0 1 1−p 1 1−p 2

p 3 1−p

p 4 1

12

MARKOV CHAINS: INTRODUCTION

147

Example 12.3. Same as the previous example except that now 0 and 4 are reﬂecting. From 0, the walker always moves to 1, while from 4 she always moves to 3. The transition matrix changes to 0 1 0 0 0 1−p 0 p 0 0 0 1−p 0 p 0 . P = 0 0 1−p 0 p 0 0 0 1 0 Example 12.4. Random walk on a graph. Assume that a graph with undirected edges is given by its adjacency matrix, which is a binary matrix with the i, jth entry 1 exactly when i is connected to j. At every step, a random walker moves to a randomly chosen neighbor. For example, the adjacency matrix of the graph

2

3

1

is 0 1 0 1 0 1 0 1 1 0 1 0 1

1 2

4

1 1 , 1 0 0 0

1 3 1 3 1 2 1 3 1 2

and the transition matrix is

.

1 P = 3 0

1 3

0

1 2 1 3

0

Example 12.5. The general two-state Markov chain. There are two states 1 and 2 with transitions: • 1 → 1 with probability α; • 1 → 2 with probability 1 − α; • 2 → 1 with probability β; • 2 → 2 with probability 1 − β. The transition matrix has two parameters α, β ∈ [0, 1]: P = α 1−α β 1−β .

12

MARKOV CHAINS: INTRODUCTION

148

Example 12.6. Changeovers. Keep track of two-toss blocks in an inﬁnite sequence of independent coin tosses with probability p of Heads. The states represent (previous ﬂip, current ﬂip) and are (in order) HH, HT, TH, and TT. The resulting transition matrix is p 1−p 0 0 0 0 p 1 − p . p 1 − p 0 0 0 0 p 1−p Example 12.7. Simple random walk on Z. The walker moves left or right by 1, with probabilities p and 1 − p, respectively. The state space is doubly inﬁnite and so is the transition matrix: .. . . . . 1 − p 0 p 0 0 . . . . . . 0 1−p 0 p 0 . . . . . . 0 0 1 − p 0 p . . . .. . Example 12.8. Birth-death chain. This is a general model in which a population may change by at most 1 at each time step. Assume that the size of a population is x. Birth probability px is the transition probability to x + 1, death probability qx is the transition to x − 1. and rx = 1 − px − qx is the transition to x. Clearly, q0 = 0. The transition matrix is r0 p0 0 0 0 . . . q1 r1 p1 0 0 . . . 0 q2 r2 p2 0 . . . .. . We begin our theory by studying n-step probabilities

n Pij = P (Xn = j|X0 = i) = P (Xn+m = j|Xm = i). 0 1 Note that Pij = I, the identity matrix, and Pij = Pij . Note also that the condition X0 = i simply speciﬁes a particular non-random initial state.

Consider an oriented path of length n from i to j, that is i, k1 , . . . , kn−1 , j, for some states k1 , . . . , kn−1 . One can compute the probability of following this path by multiplying all transition n probabilities, i.e., Pik1 Pk1 k2 · · · Pkn−1 j . To compute Pij , one has to sum these products over all paths of length n from i to j. The next theorem writes this in a familiar and neater fashion. Theorem 12.1. Connection between n-step probabilities and matrix powers:

n Pij is the i, jth entry of the nth power of the transition matrix.

12

MARKOV CHAINS: INTRODUCTION

149

Proof. Call the transition matrix P and temporarily denote the n-step transition matrix by P (n) . Then, for m, n ≥ 0, Pij

(n+m)

= P (Xn+m = j|X0 = i) =

k

P (Xn+m = j, Xn = k|X0 = i) P (Xn+m = j|Xn = k) · P (Xn = k|X0 = i)

k

= =

k

P (Xm = j|X0 = k) · P (Xn = k|X0 = i) Pkj Pik .

k (m) (n)

=

The ﬁrst equality decomposes the probability according to where the chain is at time n, the second uses the Markov property and the third time-homogeneity. Thus, P (m+n) = P (n) P (m) , and, then, by induction P (n) = P (1) P (1) · · · P (1) = P n .

The fact that the matrix powers of the transition matrix give the n-step probabilities makes linear algebra useful in the study of ﬁnite-state Markov chains. Example 12.9. For the two state Markov Chain P = and P2 =

2 gives all Pij .

α 1−α β 1−β

,

α2 + (1 − α)β α(1 − α) + (1 − α)(1 − β) αβ + (1 − β)β β(1 − α) + (1 − β)2

Assume now that the initial distribution is given by αi = P (X0 = i), for all states i (again, for notational purposes, we assume that i = 1, 2, . . .). As this must determine a p. m. f., we have αi ≥ 0 and i αi = 1. Then, P (Xn = j) =

i

P (Xn = j|X0 = i)P (X0 = i)

n αi Pij . i

=

Three white and three black balls are distributed in two urns. Choose a starting vertex at random. with three balls per urn. As 1 P = 3 0 1 3 0 0 1 2 1 3 1 2 0 0 1 3 1 3 0 1 2 1 3 1 2 . 2. Determine EX3 . starting today (day 0). . At each step. the row of probabilities at time n is given by [P (Xn = i). Example 12.4. has the transition matrix 1 1 1 0 5 6 2 0 3 1 3 6 2 3 1 6 Assume that P (X0 = 0) = P (X0 = 1) = 1 . 2. 1. The state of the system is the number of white balls in the ﬁrst urn. X12 = 4). you ﬂip coin 1 the next day. Consider the random walk on the graph from Example 12.7 of Heads and coin 2 probability 0. . 708588 Problems 1. when you pick one of the two coins with equal probability and toss it. We have two coins: coin 1 has probability 0.] · P n . α2 . On any day. (a) What is the probability mass function at time 2? (b) Compute P (X2 = 2. i ∈ S] = [α1 .6 of Heads. . You ﬂip a coin once per day. if you ﬂip Heads.12 MARKOV CHAINS: INTRODUCTION 150 Then. 4 3. Determine the probability that this is also the case after 6 steps. (a) Determine the transition matrix for this Markov chain. 8645 ≈ 0. A Markov chain on states 0.0122. X6 = 3. we draw at random a ball from each of the two urns and exchange their places (the ball that was in the ﬁrst urn is put into the second and vice versa).10. (b) Assume that initially all white balls are in the ﬁrst urn. . we have P (X2 = 1) P (X2 = 2) P (X2 = 3) P (X2 = 4) = The probability in (b) equals 4 6 P (X2 = 2) · P23 · P34 = 1 4 1 4 1 4 1 4 · P2 = 2 9 5 18 2 9 5 18 .

2. if she is at the same position for two consecutive time steps. (c) You toss Heads 3 on days 3 and 6 if and only if you toss Coin 1 on days 4 and 7.6 0. (a) Let P = Then. The state Xn of our Markov chain. .3 . Solutions to problems 1. 1.8 and remains at the same position with probability 0. 0. compute the fourth entry of 0 0 0 1 · P 6. P (X3 = 1) P (X3 = 2) = 1 2 1 2 0. (c) Compute the probability that you ﬂip Heads on days 3 and 6. The states are 0. 2 3.7 0. 1 or 2. (b) Answer: P (X3 = 1) · P11 · P11 . so the answer is P (X4 = 1) · P11 . 3. 6. 4th entry of P 6 . 2. she changes position with probability 0. is the coin we ﬂip on day n.12 MARKOV CHAINS: INTRODUCTION 151 otherwise you ﬂip coin 2 the next day. and 14.2. that is. (b) Compute the probability that you ﬂip coin 1 on days 3. Subsequently. 1 P = 9 0 0 For (b). 4. the 4. A walker moves on two positions a and b. She begins at a at time 0 and is at a the next time as well. For (a). in all other cases she decides the next position by a ﬂip of a fair coin.4 · P3 3 8 and the answer to (a) is the ﬁrst entry. (a) Interpret this as a Markov chain on a suitable state space and write down the transition matrix P . (b) Determine the probability that the walker is at position a at time 10. (a) Compute the probability that you ﬂip coin 1 on day 3. The answer is given by 0 EX3 = P (X3 = 0) P (X3 = 1) P (X3 = 2) · 1 = 2 1 4 1 4 1 2 0 1 4 9 4 9 0 4 9 4 9 0 1 0 0 1 9 0 0 · P 3 · 1 .

5 0 0 0 0 0.2 The answer to (b) is the sum of the ﬁrst and the third entries of 1 0 0 0 P 9.12 MARKOV CHAINS: INTRODUCTION 152 4. 3. 0. . which we will code as 1. (a) The states are the four ordered pairs aa. and bb. P = 0. 2. ba.2 0.5 0. and 4. The power is 9 instead of 10 because the initial time for the chain (when it is at state aa) is time 1 for the walker.5 0. Then. ab.8 0.8 0 0 0 0 0.5 .

If i is accessible from j and j is accessible from i. all that matters is which are positive and which are 0. in which we only need to depict the edges that signify nonzero transition probabilities (their precise value is irrelevant for this purpose). irreducibility means that all entries of I + P + . If j is not accessible n from i. we draw an undirected edge when probabilities in both directions are nonzero.13 MARKOV CHAINS: CLASSIFICATION OF STATES 153 13 Markov Chains: Classiﬁcation of States n We say that a state j is accessible from state i. + P m are nonzero. Pij = 0 for all n ≥ 0. so that n+m n m Pik ≥ Pij Pjk > 0. one can use that P m+n = P n · P m and then n+m Pik = n m Pin P m ≥ Pij Pjk . to prove it. i ↔ j and j ↔ k together imply i ↔ k. + P m )ij > 0. This means that there is a possibility of reaching j from i in some number of steps. If the chain has m states. i ↔ j implies j ↔ i. . This means that n m there exists an n ≥ 0 so that Pij > 0 and an m ≥ 0 so that Pjk > 0.) The accessibility relation divides states into classes. (Alternatively. let us assume i → j and j → k. 2. if Pij > 0 for some n ≥ 0. i → j. i ↔ i. then j is accessible from i if and only if (P + P 2 + . Also. but no pair of states in diﬀerent classes communicates. .1. To determine the classes we may present the Markov chain as a graph. For computational purposes. k as the sum of nonnegative numbers is at least as large as one of its terms. then we say that i and j communicate. Here is an example: . . one can get from i to j in m + n steps by going ﬁrst to j in n steps and then from j to k in m steps. and thus the chain started from i never visits j: ∞ P (ever visit j|X0 = i) = P ( ∞ {Xn = j} |X0 = i) n=0 ≤ n=0 P (Xn = j|X0 = i) = 0. all states communicate with each other. note that for accessibility the size of entries of P does not matter. one should also observe that. by convention. Within each class. It is easy to check that this is an equivalence relation: 1. . Now. i ↔ j. and 3. Example 13. The only nontrivial part is (3) and. if the chain has m states. The chain is irreducible if there is only one class.

denote fi = P (ever reenter i|X0 = i). Example 13. 2. f1 = 1 because in order to never return to 1 we need to go to state 2 and stay there forever. So. a Markov Chain visits a recurrent state inﬁnitely many times or not at all. at every visit to i. 2}. 2. f2 = 1. 3.2. 2. . 2. 2. The chain is not irreducible. as n → ∞. Example 13. By similar logic.3. Consider the chain on states 1. 4 is recurrent. f1 = 1. Let us now compute. as it is an absorbing state. We call a state i recurrent if fi = 1. but 5 is not accessible from 1. the times. this is an irreducible chain. so we have f3 = 1 . P (exactly n visits to i|X0 = i) = fin−1 (1 − fi ).. For any state i. First. 3. 4 1 This chain has three classes {1.e. 4. Example 13. hence. when the chain is at i). and 3 is 4 transient. 3. 3 and 1 1 2 2 0 P = 1 1 1 . and {5}. {3} and {4}. 2 4 4 0 1 2 3 3 As 1 ↔ 2 and 2 ↔ 3. and transient if fi < 1. 3. The only possibility of returning to 3 is to do so in one step. 1 1 2 2 0 1 1 0 P = 2 2 1 0 0 4 0 0 0 4. the expected number of visits to i (i. Starting from any state. Back to the previous example. Obviously. we have two classes: {1. We will soon develop better methods to determine recurrence and transience. so the probability of staying at 1 forever is 0 and.4. Consider the chain on states 1. therefore. 4 is accessible from any of the ﬁve states. and 0 0 3 . Moreover.13 MARKOV CHAINS: CLASSIFICATION OF STATES 154 2 3 5 1 4 Any state 1. it is not irreducible. consequently. We stay at 2 for n steps with probability 1 2 n → 0. we observe that. including time 0. the probability of never visiting i again is 1 − fi . in two diﬀerent ways. 4}.

Proposition 13. starting from i. Characterization of recurrence via n step return probabilities: ∞ A state i is recurrent if and only if n=1 n Pii = ∞. In particular. thus. the chain returns to i inﬁnitely many times and. Proof. Then.1. n = 0. then each of them is visited either not at all or only ﬁnitely many times. a closed set cannot be exited. We call a subset S0 ⊂ S of states closed if Pij = 0 for each i ∈ S0 and j ∈ S0 . Start from any state from S0 . Now assume that it is not true that j → i. every time it does so.e. If a closed subset S0 has only ﬁnitely many states. . This contradiction ends the proof. every time it is at i. i is not recurrent. If i is recurrent and i → j. If all states in S0 are transient. . 2. By deﬁnition. Thus. n=0 = Thus. then there must be at least one recurrent state. There is an n0 such that Pij 0 > 0. Then.2.3. In plain / language. then also j → i. eventually the chain does reach j.13 MARKOV CHAINS: CLASSIFICATION OF STATES 155 This formula says that the number of visits to i is a Geometric(1 − fi ) random variable and so its expectation is 1 E(number of visits to i|X0 = i) = . it has an independent chance to reach j n0 steps later.. any ﬁnite Markov chain must contain at least one recurrent state. Starting from i. the chain stays in S0 forever. 1 − fi A second way to compute this expectation is by using the indicator trick: ∞ E(number of visits to i|X0 = i) = E( n=0 In |X0 = i). Proposition 13. . n Proof.. the chain can reach j in n0 steps. 1. ∞ n Pii n=0 Theorem 13. ∞ ∞ E( n=0 In |X0 = i) = n=0 ∞ P (Xn = i|X0 = i) n Pii . where In = I{Xn =i} . there is a ﬁxed positive probability that it will be at j n0 steps later. but then. i. once the chain reaches j. This is impossible. it never returns to i. 1 = 1 − fi and we have proved the following theorem. . once entered.

and then from i to j in k steps.3. 4 and that the transition matrix is 1 1 0 0 0 2 0 . every state is accessible from every other state and so this chain is irreducible. 0 0 1 0 P = 0 1 0 1 3. at each of those times. ∞ m+n+k n=0 Pjj = ∞ and. i ∈ S0 and j ∈ S0 . By the previous proposition. m ≥ 0 so that Pij > 0. For ﬁnite Markov chains. we know that there exist k. either all states are recurrent or all are transient. thus. then either all states are recurrent or all are transient. We need to show that Pij = 0. these propositions make it easy to determine recurrence and transience: if a class is closed. then j is also recurrent.4. If i is recurrent and i → j. i. the chain never reaches i from j. We will now give two arguments for the recurrence of j. ﬁnally. In short. But this is a contradiction to Proposition 13. We could use the same logic as before: starting from j. Furthermore. ∞ =0 Pjj = ∞. it is transient. one way to get from j to j in m + n + k steps is by going from j to i in m steps. Assume that the states are 1. Proof. Let S0 be a recurrent class. if i is recurrent. then n=0 then so is j. every state is recurrent. it is recurrent. 2. Pij > 0. Therefore. then it returns to i inﬁnitely many times and. Proposition 13. by visiting j). we can classify each class. m k For another argument.5. if the chain is irreducible. in any class. i is not accessible from j.e. If ∞ Pii = ∞. for any n ≥ 0. 0 0 2 By inspection. m+n+k m n k Pjj ≥ Pji Pii Pij .5. Example 13. In light of this proposition. . Assume the converse. Proof. the chain must visit i with probability 1 (or else the chain starting at i has a positive probability of no return to i.13 MARKOV CHAINS: CLASSIFICATION OF STATES 156 Proposition 13. we know that also j → i. In particular.. as well as an irreducible Markov chain. Any recurrent class is a closed subset of states. it has an independent chance of getting to j at a later time — so it must do so inﬁnitely often. Therefore. as recurrent or transient. but if it is not closed. As j does not communicate with i. then from i to i in n steps. Pji > 0.

1 0. First. . .7 0 0.7 6 0. we observe that the walker will be at 0 at a later time only if she makes an equal number of left and right moves.1 0 . 6 and 0 0 0 0 0 0 0. Furthermore. n Now. We will make use of this in subsequent examples in which the walker will move in higher dimensions. 2. . 2}.4 1 0. . Thus. we only have to check whether state 0 is recurrent or transient.5 0. We will assume that (1) p ∈ (0.. therefore.2 0.4).7 We observe that 3 can only be reached from 3. we recall Stirling’s formula: √ n! ∼ nn e−n 2πn (the symbol “∼” means that the quotient of the two quantities converges to 1 as n → ∞). {1. it is clear that f3 = 0.3 0. 0.3 0.5 0. Thus. {3}. On the other hand. for n = 1. so we assume that the walker begins at 0. 1) and denote the chain Sn = Sn .13 MARKOV CHAINS: CLASSIFICATION OF STATES 1. 6 all communicate with each other. As it is not closed. 2} and {4.7 157 Example 13.3 0. 6}. so they form a class together.6 1 0.6 0 0. . . Recurrence of a simple random walk on Z. 3 is in a class of its own. therefore.4 0.4 2 0.6. {3} is a transient class (in fact.3 3 0. .7. 5.4 P = 0 0 0 0 0 0 0 0 0 0. States 1 and 2 can reach each other and no other state.5 5 0. 4. and {4.5 0 0. 5. recurrent. 6} both are closed and.3 0 0.2 4 0. Assume now that the states are 0 1 0 0. (The superscript indicates the dimension. . the division into classes is {1.) As such a walk is irreducible. 5. 2n−1 P00 =0 and 2n P00 = 2n n p (1 − p)n . Example 13. Recall that such a walker moves from x to x + 1 with probability p and to x − 1 with probability 1 − p.3 0 0.

nπ ∞ 2n P00 = ∞. 2n P00 = (2n)! (n!)2 √ (2n)2n e−2n 2π2n n2n e−2n 2πn 2n 2 √ . 4p(1 − p) < 1. 2n n = ∼ = and. . when p = 1 . so that P00 goes to 0 faster than the terms of a convergent 2 geometric series. therefore. ∞ 2n P00 < ∞. In this case. if p > 1 . 2 22n √ pn (1 − p)n nπ 1 √ (4p(1 − p))n nπ 1 2n P00 ∼ √ . nπ ∼ In the symmetric case. n=0 therefore.13 MARKOV CHAINS: CLASSIFICATION OF STATES 158 Therefore. 2 1−p p N 1−p p . P (Sn reaches N before 0|S0 = 1) = 1− 1− As N → ∞. and the random walk is recurrent. the probability P (Sn reaches 0 before N |S0 = 1) = 1 − P (Sn reaches N before 0|S0 = 1) converges to P (Sn ever reaches 0|S0 = 1) = 1 1−p p 1 if p < 2 . 2n When p = 1 . what is the probability f0 that the chain ever reenters 0? We need to recall the Gambler’s ruin probabilities. n=0 and the random walk is transient.

0)|N = 2k) P (N = 2k)P (S2k = 0)P (S2(n−k) = 0). P (S2n = (0.8. We condition on the number N of times the walker moves in the x-direction: P (S2n = (0. both the number of steps in the x direction (E or W) and in the y direction (N or S) must be even (otherwise. We denote this Markov chain by Sn and imagine a drunk wandering at random through the rectangular grid of streets of a large city. nπ . the respective coordinate cannot be 0). nπ Therefore. Is the simple symmetric random walk on Z2 recurrent? A walker now moves on integer points in two dimensions: each step is a distance 1 jump in one of the four directions (N. First. nπ √ 2 (1) P (S2(n−k) = 0) ∼ √ . Sn returns to 0 eventually) + P (S1 = −1.) The question is whether the drunk will eventually return to her home at (0.13 MARKOV CHAINS: CLASSIFICATION OF STATES 159 1 Assume that p > 2 . 2 Example 13. 2 √ 2 (1) P (S2k = 0) ∼ √ . 0)) = k=0 n (2) n P (N = 2k)P (S2n = (0. in fact. Taking this into account. we will not show the full details from now on. 0). E. f0 = P (S1 = 1. All starting positions in this and in the next example will be the appropriate origins. (2) S. Then. or W). we may use the fact that replacing the walker’s position with its mirror image replaces p by 1 − p. N ∼ 2n = 2 n with overwhelming probability and so we can assume that k ∼ n . If p < 1/2. this gives f0 = 2p when p < 1 . as the walker chooses to go horizontally or vertically with equal probability. ﬁlling in the missing pieces is an excellent computational exercise. 0)) ∼ ∼ ∼ (2) n 2 nπ P (N = 2k) k=0 2 P (N is even) nπ 1 . Sn returns to 0 eventually) 1−p = p· + (1 − p) · 1 p = 2(1 − p). k=0 (1) (1) (2) = In order not to obscure the computation. Note again that the walker can only return in an even number of steps and. (Chicago would be a good example.

Here is how it goes. albeit barely. If we let each coordinate of a two-dimensional random walker move independently. y + 1). . y + 1).13 MARKOV CHAINS: CLASSIFICATION OF STATES 160 as we know that (see Problem 1 in Chapter 11) 1 P (N is even) = . y. 0. . (x + 1. z). (x. so the squirrel may never return home. (x − 1. y) to (x + 1. is not 1. The probability f0 = P (return to 0). y − 1). and ignore half of the points that are never visited. there is an easier slick proof that does not generalize to higher dimensions. Is the simple symmetric random walk on Z3 recurrent? Now. then the above is certainly true. it is (2) at the origin exactly when Sn is. In fact. from (x. This time N ∼ 2n and. but can we compute it? One approximation is obtained by using ∞ 1 1 (3) P (S2n = (0. 0)) (1) (2) √ 3 3 P (N = 2k) √ ∼ πn 2πn k=0 √ 3 3 = P (N is even) 3/2 n3/2 2π √ 3 3 ∼ . 0)) = 1 + + . In particular. imagine a squirrel (3) running around in a 3 dimensional maze.9. 3/2 n3/2 4π Therefore. 0)) < ∞ n (3) and the three-dimensional random walk is transient. . 2 Therefore. The process Sn moves from a point (x. but if we rotate the lattice by 45 degrees. 0). or (x − 1. this becomes the same walk as Sn . z) to one of the six neighbors (x ± 1. 3 (3) P (S2n n (2) (2) (1) (2) = (0. thus. z ± 1) with equal probability. (x. 0)) = k=0 n P (N = 2k)P (S2k = 0)P (S2(n−k) = (0. this 1 appears to be a diﬀerent walk. y − 1) with equal probability. P (S2n = (0. which demonstrates that P (S2n = 0) = P (S2n = 0)2 . 0. thus. We will condition on the number N of steps in the z direction. y ± 1. y. ∞ n=0 P (S2n = (0. scale by √2 . 0)) = ∞ and we have demonstrated that this chain is still recurrent. z). 0. y. it has to make an even number number of steps in each of the three directions. Such a walker makes diagonal moves. Example 13. = 1 − f0 6 n=0 . 0. At ﬁrst. To return to (0.

. 1 0 1 2 0 0 0 0 0 1 2 1 0 1 0 0 0 0 0 1 2 2 P1 = 1 1 0 0 0 P2 = 1 1 0 0 2 2 1 3 3 1 0 0 1 0 0 0 1 2 4 4 1 1 1 0 0 1 0 0 2 4 4 0 1 1 1 1 2 0 2 0 0 2 2 0 0 1 1 1 0 0 1 1 0 0 4 2 4 2 2 1 P3 = 1 1 2 0 0 P4 = 0 0 1 0 4 4 0 0 0 1 1 0 0 1 1 2 2 2 2 1 1 0 0 0 2 2 1 0 0 0 specify which are recurrent 0 0 1 3 0 0 0 0 0 0 0 2. use (b) to show that ER = ∞. 1. Also. the ﬁrst time after the initial time the chain is back at the origin). one can use the remarkable formula. f0 ≈ 0. (a) Is this chain irreducible? (b) Assume that X0 = 0 and let R be the ﬁrst return time to 0 (i. (c) Assume that the chain starts at 0 and let R be the ﬁrst time (after time 0) that it revisits 0. Instead. 1 1 = 1 − f0 (2π)3 which gives. to four decimal places. determine which classes are recurrent. Assume that a Markov chain Xn has states 0. By recurrence. n→∞ (c) Depending on α.e. 1− 1 3 (cos(x) (−π. Moreover.π)3 dx dy dz . but the expected waiting time is inﬁnite! (1) . we know that P (R < ∞) = 1. Consider the one-dimensional simple symmetric random walk Sn = Sn with p = 1 .3405. and transitions from each i > 0 to i + 1 1 1 with probability 1 − 2·iα and to 0 with probability 2·iα . determine the recursive equation for Ei . Determine α for which 1 − f0 = P (no return) = lim P (R > n) = 0. ﬁx an N and start at some 0 ≤ i ≤ N . As in 2 the Gambler’s ruin problem. + cos(y) + cos(z)) Problems 1. . 2 . determine the boundary conditions E0 and EN . 3. determine the classes and and which transient.13 MARKOV CHAINS: CLASSIFICATION OF STATES 161 but this series converges slowly and its terms are diﬃcult to compute.. (a) By conditioning on the ﬁrst step. (b) Solve the recursion. from 0 it transitions to 1 with probability 1. For the following transition matrices. Let Ei be the expected time at which the walk ﬁrst hits either 0 or N . derived by Fourier analysis. The walk will eventually return to 0.

By plugging 2 in the boundary conditions we can solve for C and D to get D = 0. So. . so that 1 Ei = 1 + (Ei+1 + Ei ). (c) For α ≤ 1. 5} transient. {4. Therefore. which is when α ≤ 1. it is transient. so n−1 P (R > n) = i=1 1− 1 2 · iα . for every N . The product converges to 0 if and only if its logarithm converges to −∞ and that holds if and only if the series ∞ 1 2 · iα i=1 diverges. the chain is recurrent. For P1 : {1. and so ER = ∞. then the chain. {3} recurrent (absorbing).13 MARKOV CHAINS: CLASSIFICATION OF STATES 162 Solutions to problems 1. . (a) The chain is irreducible. For (a). Assume that the states are 1. 2 with E0 = EN = 0. makes n − 1 consecutive steps to the right. 2. the walker makes one step and then proceeds from i+1 or i−1 with equal probability. . For (c). so all states are recurrent. after a step the walker proceeds either from 1 or −1 and. 2. C = N . otherwise. For (b). so B = −1. the expected time to get to 0 is the same for both. 5. 2. (b) If R > n. For P4 : {1. . so its general solution is linear: Ci + D. {4. {5} transient. For P2 : irreducible. 2} recurrent. For P3 : {1. Ei = i(N − i). the homogeneous equation is the same as the one in the Gambler’s ruin. by symmetry. 5} recurrent. 3. {4} transient. ER ≥ 1 + E1 = 1 + 1 · (N − 1) = N. 3} recurrent. We look for a particular solution of the form Bi2 and we get Bi2 = 1 + 1 (B(i2 + 2i + 1) + B(i2 − 2i + 1)) = 1 + Bi2 + B. after moving to 1. 3} recurrent. .

. . that it survives. We will start with a single individual in generation 0 and generate the resulting random family tree. each individual produces a random number of oﬀspring in the next generation. .. we say that the branching process dies out and. 2. in this case pi is the probability that a man has i sons. Galton in the late 1800s to study the disappearance of family names. then the population cannot decrease and each generation increases with probability at least 1 − p1 . for all k. for i = 0. as P (Xn+1 = i|Xn = k) = P (W1 + W2 + . This tree is either ﬁnite (when some generation produces no oﬀspring at all) or inﬁnite — in the former case. The probability mass function for oﬀspring is often called the oﬀspring distribution and is given by pi = P (number of oﬀspring = i). Recall that we are assuming that X0 = 1. 1. Let δn = P (Xn = 0) be the probability that the population is extinct by generation (which we also think of as time) n.5. P (Xn+1 = 0|Xn = k) > 0. then. . It is possible to write down the transition probabilities for this chain. consider a population of individuals which evolves according to the following rule: in every generation n = 0. . . the population must either die out or increase to inﬁnity. • If p0 > 0. but they have a rather complicated explicit form. If p0 = 0.. This suggests using moment generating functions. so 0 is an absorbing state. n→∞ n→∞ . So. Wk are independent random variables. . . it stays there. Let us start with the following observations: • If Xn reaches 0. We can look at this process as a Markov chain where Xn is the number of individuals in generation n. + Wk = i). .14 BRANCHING PROCESSES 163 14 Branching processes In this chapter we will consider a random model for population growth in the absence of spatial or any other resource constraints. . each with the oﬀspring distribution. . This model was introduced by F. We will assume that p0 < 1 and p1 < 1 to eliminate the trivial cases. in the latter case. 2. where W1 . • Therefore. all states other than 0 are transient if p0 > 0. by Proposition 13. which we will indeed do. The probability π0 that the branching process dies out is. therefore it must increase to inﬁnity. the limit of these probabilities: π0 = P (the process dies out) = P (Xn = 0 for some n) = lim P (Xn = 0) = lim δn . . 1. independently of other individuals.

µ(1 − µ) From v0 = 0 we get A + B = 0 and the solution is given in the next theorem. The constant A must satisfy Aµn+1 = σ 2 µn + Aµn+2 . Together with initial conditions m0 = 1. ∞ µ = EX1 = k=0 kpk . We have mn = µn and vn = σ 2 µn (1−µn ) µ(1−µ) nσ 2 if µ = 1.1. Expectation mn and variance vn of the nth generation count. We can very quickly solve the ﬁrst recursion to get mn = µn and so vn+1 = µn σ 2 + vn µ2 . A= σ2 . if µ = 1. When µ = 1. and σ 2 = Var(X1 ). Now.14 BRANCHING PROCESSES 164 Note that π0 = 0 if p0 = 0. Let mn = E(Xn ) and vn = Var(Xn ). Xn+1 is the sum of a random number Xn of independent random variables. We start. with computing the expectation and variance of the population at generation n. as P (Xn = 0) = P (Xn ≥ 1) ≤ EXn = µn → 0. When µ = 1. Theorem 14. each with the oﬀspring distribution. Let µ and σ 2 be the expectation and the variance of the oﬀspring distribution. however. Thus. the two recursive equations determine mn and vn . so that. that is. we have by Theorem 11. . and vn+1 = mn σ 2 + vn µ2 . the recursion has the general solution vn = Aµn + Bµ2n . mn = 1 and vn = nσ 2 . We can immediately conclude that µ < 1 implies π0 = 1. v0 = 0. mn+1 = mn µ. Our main task will be to compute π0 for general probabilities pk .1.

Now. It is more convenient to replace et in our original deﬁnition with s. Clearly. for such s. . Moreover. In combinatorics. So. .. φX3 (s) = φ(φ(φ(s))). k=0 . this would be exactly the generating function of the sequence pk . let φ be the moment generating function of the oﬀspring distribution.14 BRANCHING PROCESSES 165 if the individuals have less than one oﬀspring on average. Let us get a recursive equation for φXn by conditioning on the population count in generation n − 1: φXn (s) = E[sXn ] ∞ = k=0 ∞ E[sXn |Xn−1 = k]P (Xn−1 = k) E[sW1 +. We will assume that 0 ≤ s ≤ 1 and observe that. we take a closer look at the properties of φ. . for s > 0. Then.+Wk ]P (Xn−1 = k) k=0 ∞ = = k=0 ∞ (E(sW1 )E(sW2 ) · · · E(sWk )P (Xn−1 = k) φ(s)k P (Xn−1 = k) k=0 = = φXn−1 (φ(s)). φXn is the nth iterate of φ. so that ∞ φ(s) = φX1 (s) = E(s X1 )= k=0 pk sk . φX2 (s) = φ(φ(s)). φ (s) = ∞ kpk sk−1 > 0. and we can also write φXn (s) = φ(φXn−1 (s)).. Next. this power series converges. the branching process dies out. the moment generating function of Xn is ∞ φXn (s) = E[s Xn ]= k=0 P (Xn = k)sk . φ(0) = p0 > 0 and φ(1) = k=0 ∞ pk = 1.

then π0 is the smallest solution on [0. with φ (1) = µ. π0 = 1. 1]. The crucial observation is that δn = φXn (0). and so π0 = 1. Example 14. It is also clear that δn is a nondecreasing sequence (because Xn−1 = 0 implies that Xn = 0). φ(s) ≥ s for all s ∈ [0. Probability that the branching process dies out. If µ > 1.2. 1 φ 1 φ δ1δ2 1 δ1 δ2 1 The following theorem is a summary of our ﬁndings. δn converges to 1. • Now. This is shown in the right graph of the ﬁgure below. φ (s) = k=1 ∞ k(k − 1)pk sk−2 ≥ 0. This is shown in the left graph of the ﬁgure below. 1] to s = φ(s). there exists exactly one s < 1 which solves s = φ(s). How does this change the survival probability? The k individuals all evolve independent family k trees. and so δn is obtained by starting at 0 and computing the nth iteration of φ. which happens when µ > 1. Finally.14 BRANCHING PROCESSES 166 so φ is strictly increasing.1. so that the probability of eventual death is π0 . We now consider separately two cases: • Assume that φ is always above the diagonal. In this case. that is. This happens exactly when µ = φ (1) ≤ 1. If µ ≤ 1. Theorem 14. assume that φ is not always above the diagonal. In this case. we conclude that π0 < 1 is the smallest solution to s = φ(s). It also follows that k P (the process ever dies out|Xn = k) = π0 . so φ is also convex. As δn converges to this solution. Assume that a branching process is started with X0 = k instead of X0 = 1.

8 8 8 8 √ √ √ with solutions s = 1. = The equation φ(s) = s has two solutions. this means that the ones with already a large number of representatives in the population are at a distinct advantage. and (c) the probability that . the probability π0 of extinction is quite close to 1. Problems 1. For a branching process with oﬀspring distribution given by p0 = 1 . Now suppose that p > 1 . we have to compute 2 2 ∞ φ(s) = k=0 sk pk (1 − p) 1−p . π0 = 1. 1). 1 − ps 1−p p . and Park in English transcriptions) account for about 45% of the population. 2 1 Example 14. 5 − 2 ≈ 0. and 5 − 2. 2. but not by generation 2. determine 6 2 3 (a) the expectation and variance of X9 .2.3. Thus.14 BRANCHING PROCESSES 167 for every n. − 5 − 2. where three family names (Kim. The most famous example of this phenomenon is in Korea. when p > 1 .2361. 2 ). the population at generation 9. s = 1 and s = π0 = 1−p . common family names become ever more common. This means that the oﬀspring distribution is Geometric(1 − p) minus 1. . (b) the probability that the branching process dies by generation 3. Assume that the oﬀspring distribution is Binomial(3. Then. As µ = 3 2 > 1. Example 14. . p3 = 1 . Lee. k = 0. π0 is given by φ(s) = 1 3 3 1 + s + s2 + s3 = s. If µ is barely larger than 1. p1 = 1 . p Thus. Thus. is the probability π0 . . if p ≤ 1 . as the probability that they die out by chance is much lower than that of those with only a few representatives. Assume that pk = pk (1 − p). 1. The one that lies in (0. µ= 1 p −1= 1−p 1−p and. . In the context of family names. Compute π0 . especially in societies that have used family names for a long time.

. and 11 if λ = 1.2032. For (b). as the branching process cannot die out — then. 2. 6 2 3 φ(s) = For (b). the answer is π0 . which works out to be 6 . 0 = 2s3 − 3s + 1 = (s − 1)(2s2 + 2s − 1). λ11 − 1 .) Compute the limit of Yn as n → ∞. (a) Determine the expected combined population through generation 10. for a = 1. If k is large — which it will be for large n. λ = 1. if λ ≤ 1 then π0 = 1. 3 . compute µ = 3 . Note that p0 = 0. Then compute 4 1 1 1 + s + s3 . while Xn is about µk. 2. then π0 is the solution for s ∈ (0. assume that you start 5 independent copies of this branching process at the same time (equivalently. E(X0 + X1 + . Assuming Xn−1 = k. the proportion Yn is about apa 1 1 1 µ . with overwhelming probability. σ 2 = 2 7 2 − 9 4 = 5 . Then.3660. For (c). P (X3 = 0) − P (X2 = 0) = φ(φ(φ(0))) − φ(φ(0)) ≈ 0. change X0 to 5). with the aid of computer if necessary. For (a). µ = λ and √ 3−1 2 ≈ 0. 2 and λ = 2. 1) to eλ(s−1) = s. 2. and plug into the formula. the number of children in such families is about apa k. . the probability that the process ever dies out for λ = 1 .14 BRANCHING PROCESSES 168 the process ever dies out. and so π0 = 5 For (d). (A family consists of individuals that are oﬀspring of the same parent from the previous generation. 1 3. 2. Assume that the oﬀspring distribution of a branching process is given by p1 = p2 = p3 = 3 . the number of families at time n is also k. + X10 ) = EX0 + EX1 + . + EX10 = 1 + λ + · · · + λ10 = This equation cannot be solved analytically.0462. Each of these has. Assume that the oﬀspring distribution of a branching process is Poisson with parameter λ. and 3. 3. but if λ > 1. 3. λ−1 if λ = 1. Then. . independently. (b) Determine. Let Yn be the proportion of individuals in generation n (out of the total number of Xn individuals) from families of size a. and 2 . . For (a). but we can numerically obtain the solution for λ = 2 to get π0 ≈ 0. a members with probability pa . we solve φ(s) = s. and (d) compute the probability that the process ever dies out. Solutions to problems 1. . Solve the following problem for a = 1.

with p ∈ (0. Again.6 0 1 Recall that the n-step transition probabilities are given by powers of P . If a state. Assume that the transition matrix 0. beginning with 0.1.3. In fact.4054 0. .5412 0.5405 0.5. n − 1 and transitions from i to (i + 1) mod n with probability 1.5405 0. However. 0.052 Then. If the chain is irreducible. any period is possible. Why? What determines the limit? These questions will be answered in this chapter. Deterministic cycle.4048 0.1. The period of any state is 2 because the walker can return to her original position in any even number of steps. Example 15.4054 0. The period of any state is 1 because the walker can return in two steps (one step out and then back) or three steps (around the triangle). Example 15. So.5401 0. we call the entire chain aperiodic if all states have period 1. Example 15. and therefore its class.1 0 . has period 1.408 0. Let us look at some large powers of P . to four decimal places.054 .0541 and subsequent powers are the same to this precision. Random walk on the vertices of a square. The matrix elements appear to converge and the rows become almost identical.0541 .0541 P 8 ≈ 0. It can be shown that the period is the same for all states in the same class. then the chain has period 1.2 P = 0. the period of any state is 2 for the same reason. Example 15. .0543 P 4 = 0.9 and P00 = 0. if the following two transition probabilities are changed: P01 = 0. then it has period n. If a chain has n states 0. 1). 0 Example 15. 0. n We say that a state i ∈ S has period d ≥ 1 if (1) Pii > 0 implies that d|n and (2) d is the largest positive integer that satisﬁes (1).4. Simple random walk on Z. . the period of any state i with Pii > 0 is trivially 1.15 MARKOV CHAINS: LIMITING PROBABILITIES 169 15 Markov Chains: Limiting Probabilities is given by 0.4056 0. 1.5405 0. Random walk on the vertices of a triangle. but in no odd number of steps. .4054 0. .7 0.54 0.2. it is called aperiodic. 0.4 0.

Also. m. after time 0. πi . or stationary. let Ri = inf{n ≥ 1 : Xn = i} 170 be the ﬁrst time. an irreducible chain is positive recurrent if each of its states is. A vector of probabilities. i. let fi (n) = P (Ri = n|X0 = i) be the p. i ∈ S.1. Thus. Some of these have rather involved proofs (although none is exceptionally diﬃcult). Proof. we deﬁne nfi . Let Nn (i) be the number of visits to i in the time interval from 0 through n. it returns.15 MARKOV CHAINS: LIMITING PROBABILITIES For a state i ∈ S.. Then. Assume that the chain is irreducible and positive recurrent. We now state the key theorems. then we say that i is positive recurrent. for all j ∈ S. hence the proportion of time spent there is 1/mi . once per mi time steps. It can be shown that positive recurrence is also a class property: a state shares it with all members of its class. distribution for a Markov chain with transition matrix P if πi Pij = πj . Then. We can connect these to the familiar quantity ∞ fi = P (ever reenter i|X0 = i) = n=1 fi . mi < ∞. i∈S . P (Ri ≥ n) ≤ (1 − ) n/m . there must exist an m ≥ 1 and an > 0 so that i can be reached from any j in at most m steps with probability at least . It is not hard to show that a ﬁnite irreducible chain is positive recurrent. on average. that the chain is at i ∈ S. (n) ∞ mi = E[Ri |X0 = i] = n=1 If the above series converges. We skip a more detailed proof. (n) so that the state i is recurrent exactly when (n) ∞ n=1 fi = 1. such that i∈S πi = 1 is called an invariant. n→∞ lim Nn (i) 1 = . In this case. f. Proportion of the time spent at i. n mi in probability. The idea is quite simple: once the chain visits i. of Ri when the starting state is i itself (in which case we may call Ri the return time). Then.e. which goes to 0 geometrically fast. which we will merely sketch or omit altogether. Theorem 15.

at the age of 25. for every i. the diﬀerence between the two probabilities at time n is bounded above by twice the probability that coupling does not happen by time n. If a Markov chain is irreducible. 1 0 0 . for eigenvalue 1. which is given by 1 .] is a left eigenvector of P . A deterministic cycle with a = 3 has the transition matrix 0 1 0 P = 0 0 1 . f. . P (Xn = i) = πi . periodicity is necessary.]. m. . however. as we will see in the next example. . and positive recurrent. for all n. m. . Under the stated assumptions. πi = mi In fact. that is. then. which we already know to be 1/mi . An irreducible positive recurrent Markov chain has a unique invariant distribution.3. then [π1 . Theorem 15. the rows of P n are more and more similar to the row vector π as n becomes large. to be equal to πi . We will not.) Start with two independent copies of the chain — two particles moving from state to state according to the transition probabilities — one started from i.]. π2 . If πi is the p. Afterwards. . Example 15. The formula for π should not be a surprise: if the probability that the chain is at i is always πi . More important for us is the following probabilistic interpretation. . for X0 . and quite tragic. story. the two particles move together in unison. n Recall that Pij = P (Xn = j|X0 = i) and note that the limit is independent of the initial state. that is. Convergence to invariant distribution. they will eventually meet. but. . We will not go into greater detail. Thus. . for all i ∈ S. He made signiﬁcant mathematical contributions during his army service.15 MARKOV CHAINS: LIMITING PROBABILITIES 171 In matrix form.6. go deeper into the proof. Existence and uniqueness of invariant distributions. an irreducible chain is positive recurrent if and only if a stationary distribution exists. which goes to 0. The most elegant proof of this theorem uses coupling. π2 . Thus [π1 . an important idea ﬁrst developed by a young French probabilist Wolfgang Doeblin in the late 1930s. (Doeblin’s life is a romantic. Theorem 15. π2 . that is. the other using the initial distribution π. Thus. he died as a soldier in the French army in 1940. π2 . An immigrant from Germany. if we put π into a row vector [π1 .2. .] · P = [π1 . then one should expect that the proportion of time spent at i. f. n→∞ n lim Pij = πj . P (X0 = i) = πi . it is also the p. . they are coupled . j ∈ S. . . for X1 and hence for all other Xn . aperiodic.

6π2 +π3 = π2 This system has inﬁnitely many solutions and we need to use π1 + π2 + π3 = 1 . That was clearly an irreducible and aperiodic (note that P00 > 0) chain. For every i. Although the chain does spend 1/3 of the time at each state. 1 mi . Convergence theorem for a ﬁnite state space S. Example 15. 1. then.4π2 0. so that md+r lim Pij = dπj m→∞ and so that n Pij = 0 for all n such that n = r mod d. The invariant distribution [π1 . there exists an integer r. n in probability. 3. There exists a unique invariant distribution given by πi = 2. Assume that s Markov chain with a ﬁnite state space is irreducible. 0 ≤ r ≤ d − 1. 4. irrespective of the initial state.1π1 = π1 = π3 0. If the chain is periodic with period d. and for us the most common. etc. If the chain is also aperiodic. 1 Nn (i) → πi . the transition probabilities are a periodic sequence of 0’s and 1’s and do not converge. 0 0 1 P 2 = 1 0 0 . case.1. Our ﬁnal theorem is mostly a summary of the results for the special. Example 15.7. for every pair i. P 4 = P . 0 1 0 P 3 = I.4.7π1 + 0. j ∈ S. n Pij → πj . π3 ] is given by 0.15 MARKOV CHAINS: LIMITING PROBABILITIES 1 3 172 (as it is easy to This is an irreducible chain with the invariant distribution π0 = π1 = π2 = check). Theorem 15. for all i and j. then. Moreover.2π1 + 0. We begin by our ﬁrst example. π2 .

and 4. The general two-state Markov chain. what proportion of time does the chain spend at 2? Answer: π2 (and this does not depend on the starting state). 1−β+α • In the long run.9. π1 = π2 = Here are a few common follow-up questions: • Start the chain at 1.0541.4054. after some algebra. 2} and P = and we assume that 0 < α. 37 37 37 Example 15. In this example we will see how to compute the average length of time a chain remains in a subset of states. Here S = {1. the answer does not depend on the starting state). απ1 + βπ2 π1 + π2 and. but down at the next time step (this is called the breakdown rate). π3 = ≈ 0. (b) Compute the proportion of time that the system is up.5405. working properly. Assume that a machine can be in 4 states labeled 1. In the long run. as it needs to be at 1 at the previous time and then make a transition to 2 (again. while at the previous time it was at 1? Answer: π1 P12 . In states 3 and 4 the machine is down. α 1−α β 1−β = π1 =1 (1 − α)π1 + (1 − β)π2 = π2 β . once the subset is entered. 1 1 1 4 4 4 4 1 1 1 4 4 0 2 (a) Compute the average length of time the machine remains up after it goes up.8. In states 1 and 2 the machine is up. Suppose that the transition matrix is 1 1 1 4 4 2 0 0 1 1 1 P = 1 4 2 4 .15 MARKOV CHAINS: LIMITING PROBABILITIES 173 to get the unique solution π1 = 20 15 2 ≈ 0. 2. what proportion of time is the chain at 2. What is the expected return time to 2? Answer: 1 π2 . Example 15. 1+β−α 1−α . . β < 1. π2 = ≈ 0. out of order. 3. • Start the chain at 2.

1].. . From i. the breakdown rate is 48 48 π1 (P13 + P14 ) + π2 (P23 + P24 ) = the answer to (b). . . u+d 48 For the second equation. Nowadays this is not diﬃcult to do.10. thus [1. If [1. f. . We will achieve this by ﬁguring out the two equations that u and d must satisfy.5. then [1. . . This vector is preserved by right 1 multiplication by P . For the ﬁrst equation. 1). Simple random walk on a circle. .e. 1. u = 14 . which works out to be π1 = π3 = 14 . we use that the proportion of time the system is up is u 21 = π1 + π2 = . the walker moves to i + 1 (with a identiﬁed with 0) with probability p and to i − 1 (with −1 identiﬁed with a − 1) . . . We begin with computing the invariant distribution. a − 1 are arranged on a circle clockwise. i. 9 Computing the invariant distribution amounts to solving a system of linear equations. Invariant distribution in a doubly stochastic case. Proof. Then. it is worthwhile to observe that there are cases when the invariant distribution is easy to identify. m. Assume that a points labeled 0. Pick a probability p ∈ (0. u+d 32 the breakdown rate from (b). still. We need to compute u to answer (a). 1]. . 1] is the row vector with m 1’s. 174 π2 = 12 48 . . 1]P is exactly the vector of column sums. let u be the average stretch of time the machine remains up and let d be the average stretch of time it is down. . from one repair to the next repair. . We call a square matrix with nonnegative entries doubly stochastic if the sum of the entries in each row and in each column is 1. 9 . to answer (a). Proposition 15. Example 15. π4 = 13 . . This vector speciﬁes the uniform p. . This means 1 9 = . . . on S. as is m [1.15 MARKOV CHAINS: LIMITING PROBABILITIES 9 48 . . m}. we use that there is a single breakdown in every interval of time consisting of the stretch of up time followed by the stretch of down time. . The system of two equations gives d = 2 and. Assume that S = {1. If the transition matrix for an irreducible Markov chain with a ﬁnite state space S is doubly stochastic. . . . as usual. even for enormous systems. . 32 Now. . its (unique) invariant measure is uniform over S.

. (b) n Determine limn→∞ P10 . the chain is aperiodic if a is odd and otherwise periodic 1 with period 2.. .... 0 0 0 0 p 0 0 0 . the proportion of time the walker spends at any state is a . π0 is larger by the factor 1−r than other πi . Consider the chain in Problem 2 of Chapter 12. 0 0 0 0 0 0 1−p 0 0 1−p 1−p 0 0 p 0 and is doubly stochastic. Clearly.. Determine the proportion of time the walker spends at a... Moreover.. He is equally likely . the row vector with invariant distributions is 1 1−r 1 +a−1 1 1−r 1 . we can still determine the invariant distribution if only the self-transition probabilities Pii are changed. Therefore. which is n n also the limit of Pij for all i and j if a is odd.. . where it spends 1 a Geometric(1 − r) time at every visit. The other transition probabilities are unchanged. while if they are of the same parity. 3. 1−r 1+(1−r)(a−1) . with proof. Roll a fair die n times and let Sn be the sum of the numbers you roll. Thus. Consider the chain in Problem 4 of Chapter 12 with the same initial state. 4. The transition matrix is 0 p 0 0 1 − p 0 p 0 0 1−p 0 p P = . Problems 1. the chain is the same as before and spends an equal an proportion of time at all states.. . (a) Determine the invariant distribution.15 MARKOV CHAINS: LIMITING PROBABILITIES 175 with probability 1 − p. 1).. Peter owns two pairs of running shoes. Thus. Why does it exist? 2. If a is even. only when the walker is at 0. Pij → a .. . now the chain is aperiodic for any a. moves to 1 with probability (1 − r)p. Assume that we change the transition probabilities a little: assume that. Each morning he goes running. 1 = 1 1+(1−r)(a−1) 1−r 1+(1−r)(a−1) . she stays at 0 with probability r ∈ (0.. Therefore. Determine. but the transition matrix is no longer doubly stochastic. if we stop the clock while she stays at 0. It follows that our perturbed chain spends the same proportion of time at all states except 0. and to a − 1 with probability (1 − r)(1 − p).. then Pij = 0 if (i − j) and n have a 2 n diﬀerent parity. limn→∞ P (Sn mod 13 = 0).. What happens to the invariant distribution? The walker spends a longer time at 0.

15 MARKOV CHAINS: LIMITING PROBABILITIES 176 to leave from his front or back door. . . Consider Sn mod 13. i − 1 with equal probability. So the answer is 13 . 41 . 6. 2. 4. (a) What proportion of days does he run barefoot? (b) What proportion of days is there at least one pair of shoes at the front door (before he goes running)? (c) Now. shift cyclically to the right. 4. Then 3 1 4 4 0 P = 1 1 1 . The assignment changes randomly from year to year as a Markov chain with transition matrix 7 1 1 10 1 5 1 10 5 3 5 2 5 10 1 . 2. Prof.. . 1. 12 and transition matrix is 1 0 1 1 6 1 1 1 0 . 3.. . Messi does one of three service tasks every year. 2. 5 1 5 Determine the proportion of years that Messi has the same assignment as the previous two years. 21 . 4 2 4 0 1 3 4 4 . he chooses a pair of running shoes at the door from which he leaves or goes running barefoot if there are no shoes there. . and transition from 0 to 4 with probability 1. coded as 1. assume also that the pairs of shoes are green and red and that he chooses a pair at random if he has a choice. Consider the Markov chain with states 0. 1 1 1 1 1 1 6 6 6 6 6 6 0 0 .. . . 21 and (b) the limit is 5 8 8 20 41 . (a) π = 10 π0 = 21 . . What proportion of mornings does he run in green shoes? 5. and 3. n Show that all Pij converge as n → ∞ and determine the limits. which transition from state i > 0 to one of the states 0. The chain is irreducible and aperiodic. 41 and the answer is 3. The chain is irreducible and aperiodic. On his return. for all i. 0 (To get the next row. Moreover. Consider the Markov chain with states given by the number of shoes at the front door. 41 . 1 . 0 6 6 6 6 6 .) This is a doubly stochastic matrix with 1 1 πi = 13 . 41 10 5 6 21 . Moreover. This is a Markov chain with states 0. . π = π1 + π3 = 13 . he is equally likely to enter at either door and leaves his shoes (if any) there. Solutions to problems 1. . Upon leaving the house..

4. Answers: (a) π0 · 3 1 1 π1 + π2 = 2 . 37 37 . 17 = 1 . (c) π0 · 4 + π2 · 1 + π1 · 1 = 3 . where π is given 6 4 3 by π = 12 . Solve for the invariant distribution: π = 19 50 . 1. 37 . 12 . 3 4 2 5. 37 . 2.15 MARKOV CHAINS: LIMITING PROBABILITIES 1 2 177 + π2 · 1 2 This is a doubly stochastic matrix with π0 = π1 = π2 = 1 . 17 . j = 0. The answer is π1 ·P11 +π2 ·P22 +π3 ·P33 = n 6. As the chain is irreducible and aperiodic. (b) 3 2 2 2 . 6 7 4 17 . 3. 37 . Pij converges to πj .

2.15 MARKOV CHAINS: LIMITING PROBABILITIES 178 Interlude: Practice Midterm 2 This practice exam covers the material from chapters 12 through 15. 0. Explain how you would compute the probability that it will rain on Saturday. N ). from 1 she moves to 2. each chosen with equal probability. (c) Under the same assumption as in (b). Write down the transition probability matrix for this chain. the individual has no descendants. it will rain tomorrow with probability 0. . (b) Today is Wednesday and it is raining. she moves to a random state among the four states. (a) Show that this chain has a unique invariant distribution and compute it. Your answer will. N ). if it did not rain yesterday. (Take a good look at the transition matrix before you start solving this). 1 1 2 0 2 1 1 1 4 2 4 P = 1 0 1 2 2 0 0 0 0 0 0 5. from 2 to 3. but rained today. (R. If she is at i at some time. With probability 1 2 . given by the following transition matrix: 0 0 0 0 0 0 . 2 (a) Compute π0 . If this coin comes out Tails. Do not simplify. In a branching process the number of descendants is determined as follows. R). weather today) with R = rain and N = no rain. Do not carry out the computation. if 2 she is at 0 she moves to 1. (N. (b) Write down the expression for the probability that the branching process is still alive at generation 3. (a) Let Xn be the Markov chain with 4 states.7. depend on the parameter p. explain how you would approximate the probability of rain on a day exactly a year from today. it will rain tomorrow with probability 0. 1. but do not carry out the computation. With probability 1 she moves from i to (i + 1) mod 4 (that is. (N. each with probability 1 . of course. A random walker is in one of the four states. if it rained yesterday but not today. 4. Give yourself 50 minutes to solve the four problems. 4.5. 1 1 2 1 2 2 1 2 Specify the classes and determine whether they are transient or recurrent. which code (weather yesterday. Suppose that whether it rains on a day or not depends on whether it did so on the previous two days. R). 3. Carefully justify your answer.4. 1. if it did not rain yesterday nor today. the probability that the branching process eventually dies out. If the coin comes Heads. or 3. 2. 2. it will rain tomorrow with probability 0. then it will rain tomorrow with probability 0. Consider the Markov chain with states 1. An individual ﬁrst tosses a coin that comes out Heads with probability p. (R. and from 3 to 0). 3. If it rained yesterday and today. It also rained yesterday. 2. which you may assume have equal point score. she makes the following transition. the individual has 1 or 2 descendants.

compute the proportion of time she spends at 1.15 MARKOV CHAINS: LIMITING PROBABILITIES 179 (b) After the walker makes many steps. compute the proportion of times she is at the same state as at the previous time. . Does the answer depend on the chain’s starting point? (c) After the walker makes many steps.

15 MARKOV CHAINS: LIMITING PROBABILITIES 180 Solutions to Practice Midterm 2 1.2. which code (weather yesterday. Carefully justify your answer. R) state 2. the sum of the ﬁrst two entries of 1 0 0 0 · P 3 . our solution is 1 3 1 1 0 0 0 · P · . The initial state is given by the row [1. The transition matrix is 0. R). Solution: The matrix P is irreducible since the chain makes the following transitions with positive probability: (R. N ) → (N.7 0 0. N ). N ) state 3. (c) Under the same assumption as in (b). It is also . if it rained yesterday but not today.5. then it will rain tomorrow with probability 0. explain how you would approximate the probability of rain on a day exactly a year from today. (R. it will rain tomorrow with probability 0. R) be state 1. it will rain tomorrow with probability 0. N ) → (N.5 0 0. (N. Solution: Let (R. 0] and it will rain on Saturday if we end up at state 1 or 2. but do not carry out the computation. R). R) → (R. Suppose that whether it rains on a day or not depends on whether it did so on the previous two days. and (N. 0 0 that is. N ). Therefore.7. If it rained yesterday and today. R). N ) state 4. (R.2 0 0. Do not carry out the computation. Write down the transition probability matrix for this chain. then Saturday is time 3. but rained today. R) → (R.8 (b) Today is Wednesday and it is raining.4. Explain how you would compute the probability that it will rain on Saturday. (N. 0. if it did not rain yesterday.3 0 0. 0 0. weather today) with R = rain and N = no rain. Solution: If Wednesday is time 0. it will rain tomorrow with probability 0. It also rained yesterday.5 0 P = 0 0. (N. if it did not rain yesterday nor today. (a) Let Xn be the Markov chain with 4 states. 0. (R.4 0 0.6 .

the probability that the branching process eventually dies out.. . Your answer will. π2 . 2. depend on the parameter p. 4. 5. We have p p φ(s) = 1 − p + s + s2 . 3. π4 ] is the unique solution to [π1 . Consider the Markov chain with states 1. the number of descendants is determined as follows. given by the following transition matrix: 1 1 2 0 2 0 0 1 1 1 0 0 4 2 4 1 P = 2 0 1 0 0 . π3 . 2 2 0 = ps2 + (p − 2)s + 2(1 − p). of course. i. 5} is recurrent. Solution: Answer: • {2} is transient. 2 2 Then. 2. the individual has no descendants.e. 2 2 3p 2 If 2 ≤ 1. p p s = 1 − p + s + s2 . we need to compute φ(s) and solve φ(s) = s.15 MARKOV CHAINS: LIMITING PROBABILITIES 181 aperiodic because the transition (R. 2 0 0 0 1 1 2 2 1 0 0 0 2 1 2 Specify all classes and determine whether they are transient or recurrent. If this coin comes out Tails. each with probability 1 . π4 ] · P = [π1 . Therefore. π4 ] and π1 + π2 + π3 + π4 = 1. π3 . In a branching process. An individual ﬁrst tosses a coin that comes out Heads with probability p. Solution: The probability mass function for the number of descendants is 0 1 1−p p 2 and so 2 p 2 . π2 . p 3p +p= . the individual has 1 or 2 descendants. then π0 = 1. R) → (R. p ≤ 3 . π3 . Otherwise. 2 (a) Compute π0 . 3. If the coin comes Heads. • {4. where [π1 . π2 . 3} is recurrent. the probability can be approximated by π1 + π2 . R) has positive probability. E(number of descendants) = 0 = (s − 1)(ps − 2(1 − p)). • {1.

A random walker is at one of the four states. each chosen 2 with equal probability.15 MARKOV CHAINS: LIMITING PROBABILITIES We conclude that π0 = 2(1−p) . 0. (Note that irreducibility is trivial. if she is at 0 she moves to 1. 2 2 2 2 4. she makes the following transition. 1 ] is the unique invariant 4 4 distribution. compute the proportion of time she is at the same state as at the previous time. 4 . (c) After the walker makes many steps. Solution: The answer is 1 − φ(φ(φ(0))) and we compute p p φ(φ(0)) = 1 − p + (1 − p) + (1 − p)2 . 3 (b) Write down the expression for the probability that the branching process is still alive at generation 3. from 1 she moves to 2. P = 1 8 1 8 1 8 5 8 5 8 1 8 1 8 1 8 1 8 5 8 1 8 1 8 1 8 1 8 5 8 1 8 . With probability 1 . p 182 if p > 2 . With probability 1 she moves from i to (i + 1) mod 4 2 (that is. and from 3 to 0). (a) Show that this chain has a unique invariant distribution and compute it. as all entries are positive). . independently of the starting point. Does the answer depend on the chain’s starting point? Solution: 1 The proportion equals π1 = 4 . (b) After the walker makes many steps. she moves to a random state among the four states. 2. from 2 to 3. or 3. Do not simplify. π = [ 1 . (Take a good look at the transition matrix before you start solving this). 4 . 2 2 p p 1 − φ(φ(φ(0))) = 1 − 1 − p + (1 − p + (1 − p)+ 2 2 p p p p 2 (1 − p) ) + (1 − p + (1 − p) + (1 − p)2 )2 . compute the proportion of time she spends at 1. If she at i at some time. Solution: The transition matrix is φ(0) = 1 − p. 1 1 As P is a doubly stochastic irreducible matrix. 1.

15 MARKOV CHAINS: LIMITING PROBABILITIES Solution: The probability of staying at the same state is always 1 . 8 183 . which is the answer.

as their time-reversal Xn . invariance is automatic. If a probability mass function πi satisﬁes the condition in the previous theorem. started at its unique invariant distribution π. f. πj Pji = πj i . Proof. For reversibility. Is there a condition for reversibility that can be easily checked? The ﬁrst thing to observe is that for the chain started at π. This is not completely intuitively clear.’s must agree. Xn = in ) πi Pij Pjik+2 · · · Pin−1 in = πj Pjik+2 · · · Pin−1 in πi Pij = . Xn = in ) = P (Xk+1 = j. We have proved the following useful result. but can be checked: P (Xk = i|Xk+1 = j. Reversibility implies invariance. . . . . Another useful fact is that once reversibility is checked. We need to check that. Recall that this means that π is the p. Xn have the same joint p. . Xk+1 = j. Proposition 16. if both the original and the time-reversed chain have the same transition probabilities (and we already know that the two start at the same invariant distribution and that both are Markov). Theorem 16. for every j. Xk+2 = ik+2 . Reversibility condition. Now suppose that. Conversely. . m. the time-reversed chain has the Markov property. . . also said that its invariant distribution π is reversible.2.1. reversible or not. Xk+2 = ik+2 . . . f. A Markov chain with invariant measure π is reversible if and only if πi Pij = πj Pji . Xn−1 . f. πj which is an expression dependent only on i and j. Xn = in ) P (Xk = i. . X1 . This means that a recorded simulation of a reversible chain looks the same if the “movie” is run backwards. . m. of X0 and of all other Xn as well. for all states i and j. X0 . we call the chain reversible — sometimes it is. Then. then their p. Xk+2 = ik+2 . equivalently. . . . X0 . .16 MARKOV CHAINS: REVERSIBILITY 184 16 Markov Chains: Reversibility Assume that you have an irreducible and positive recurrent chain. this expression must be the same as the forward transition probability P (Xk+1 = i|Xk = j) = Pji . πj = πi Pij = i i i πi Pij . then it is invariant. . . . m. and here is how we do it: Pji = πj . . for every n.

Then. We should observe that this chain is irreducible exactly when the graph with present edges (those with wij > 0) is connected.k be the sum of all weights and let wik . the invariant distribution is πi = degree of i .16 MARKOV CHAINS: REVERSIBILITY 185 We now proceed to describe random walks on weighted graphs. The number of neighbors of a vertex is commonly called its degree. compute πi = k πi Pij = k wik · s wij wij = . Let s= wik i. we think of edges with zero weight as not present at all. Finally. we note that there is no reason to forbid self-edges: some of the weights wii may be nonzero. When at i. 1 6 2 5 3 4 . with i = j. each wii appears only once in s.) By far the most common examples have no self-edges and all nonzero weights equal to 1 — we already have a name for these cases: random walks on graphs. s To see why this is a reversible distribution. Assume that every undirected edge between vertices i and j in a complete graph has weight wij = wji . s k wik which clearly remains the same if we switch i and j. Consider the random walk on the graph below. appears there twice. The graph can only be periodic and the period can only be 2 (because the walker can always return in two steps) when it is bipartite: the set of vertices V is divided into two sets V1 and V2 with every edge connecting a vertex from V1 to a vertex from V2 . 2 · (number of all edges) Example 16. k wik What makes such random walks easy to analyze is the existence of a simple reversible measure. the most easily recognizable examples of reversible chains. while each wij . (However. the walker goes to j with probability proportional to wij .1. so that wij Pij = .

πi Pi. i Pi. but not aperiodic (it has period 2). How can we choose Pii so that π becomes 1 uniform. Ehrenfest chain.i−1 = πi−1 Pi−1. 1}V .i−1 = . where V is a large set of n sites. π4 = . the answer is 2 . Assume that we have a very large probability space. for all other i. thus. We have M balls distributed in two urns. i 2M Let us check that this is a reversible measure. and it is straightforward to do so. pick a ball at random. Note that this chain is irreducible. as switching any ball between the two urns does not alter this assignment.i+1 = . 18 18 18 18 18 18 and. π5 = .2. Example 16.16 MARKOV CHAINS: REVERSIBILITY 186 What is the proportion of time the walk spends at vertex 2? The reversible distribution is π1 = 3 4 2 3 3 3 . The following equalities need to be veriﬁed: π0 P01 = π1 P10 . 1 ).1 = 1. 9 Assume now that the walker may stay at a vertex with probability Pii . π is Binomial(M.M . M M −i Pi. Each time. Prove that this chain has a reversible distribution. Thus. PM. Let Xn be the number of balls in urn 1.M −1 = πM −1 PM −1. Thus. The nonzero transition probabilities are P0.i .i . π3 = . π2 = . Markov chain Monte Carlo. πM PM. say some subset of S = {0. πi Pi. π6 = . w22 = 0. w33 = 2.M −1 = 1. and wii = 1.i+1 = πi+1 Pi+1. but when she does move she moves to a random neighbor as before. πi = 6 for all i? We should choose the weights of self-edges so that the sum of the weights of all edges emanating from any vertex is the same. 2 πi = M 1 . Assume also that . Example 16. M Some inspiration: the invariant measure puts each ball at random into one of the two urns.3. move it from the urn where it currently resides to the other urn.

ωn ) that satisfy the constraints above will be our state space and the energy function E on S is given as E = −V . but possible. there is no known algorithm to solve it quickly. thus the system is much more likely to be found in the close to minimal energy states. Z Here. the large energy states have a much lower probability than the small energy ones. . (There are a few celebrated cases. the convergence slows down at a rate exponential in T −1 when T is small. Your backpack (knapsack) has a weight limit b. 1} and that the combined weight does not exceed the backpack capacity.16 MARKOV CHAINS: REVERSIBILITY 187 we have a probability measure on S given via the energy (sometimes called the Hamiltonian) function E : S → R. which is typically exponential in n. . The set S of feasible solutions ω = (ω1 . We will illustrate this on the Knapsack problem. π(ω) = If T is very large. with weights wi and values vi . you have to maximize the combined value of the items you will carry out n V = V (ω1 . in fact. n wi ωi ≤ b. . Finally. we design a Markov chain. a parameter. a gradual lowering of the temperature to improve the solution. in honor of N. . The probability of any conﬁguration ω ∈ S is 1 − 1 E(ω) T . ωn ) = i=1 vi ωi . Metropolis. we are ready to specify the Markov chain (sometimes called a Metropolis algorithm. as we want to maximize V . You are faced with a question of how to ﬁll your backpack. if T is very small. i=1 This problem is known to be NP-hard.) Instead of generating a random state directly. . . You see a large number n of items. that is. This is in startling contrast to the size of S. The temperature T measures how good a solution we are happy with — the idea of simulated annealing is. On the other hand. . however. T > 0 is the temperature. which has π as its invariant distribution. some states. in which exact computations are diﬃcult. There is give and take: higher temperature improves the speed of convergence and lower temperature improves the quality of the result. subject to the constraints that ωi ∈ {0. a pioneer in computational physics). Assume that you are a burglar and have just broken into a jewelry store. . They have numerous other applications. and we have a reasonable answer. It is common that the convergence to π is quite fast and that the necessary number of steps of the chain to get close to π is some small power of n. and have yielded an optimization technique called simulated annealing. Such distributions frequently occur in statistical physics and are often called Maxwell-Boltzmann distributions. However. according to P . If we want to ﬁnd states with small energy. as Z is some enormous sum. especially in optimization problems. and Z is the normalizing constant that makes ω∈S π(ω) = 1. we merely choose some small T and generate at random. although E is typically a simple function. The only problem is that. the role of the energy is diminished and the states are almost equally likely. π is very diﬃcult to evaluate exactly. Assume that the chain is at . called exactly solvable systems.

4 and 12. then it is neutral or negative tomorrow with equal probability. Problems 1. the new state has higher energy. thus. Xt = ω. • if E(ω i ) − E(ω) > 0. Let ω i be the same as i ω except that its ith coordinate is ﬂipped: ωi = 1 − ωi . If it is . or else stay at ω.) If ω i is not feasible. which corresponds to the energy input from the environment.16 MARKOV CHAINS: REVERSIBILITY 188 state ω at time t. At each step.e. then make the transition to ω i with probability e T (E(ω)−E(ω )) . Xt+1 = ω i . for any pair ω. The state of this Markov chain can. π(ω)P (ω. then the equality reduces to e− T E(ω) = e− T E(ω ) e T (E(ω and similarly in the other case. in physicist’s terms. Otherwise. 3. Guess the invariant measure for this chain and prove that it is reversible. ω ) = π(ω )P (ω . that is. and assume that both are feasible (as only such transitions are possible). Note that. but. neutral. uniformly at random. then make the transition to ω i . Pick a coordinate i. but is it the right one. 1 1 i 1 i )−E(ω)) 1 i . your opinion on a particular political issue is either positive. (This means that the status of the ith item is changed from in to out or from out to in. A total of m white and m black balls are distributed into two urns. 2. then Xt+1 = ω and the state is unchanged. If it is positive today. be described by the number of black balls in urn 1. We need to check that this chain is irreducible on S: to see this. for arbitrary i. π? In fact. evaluate the diﬀerence in energy E(ω i ) − E(ω) and proceed as follows: • if E(ω i ) − E(ω) ≤ 0.10. Each day. a ball is randomly selected from each urn and the two balls are interchanged. and this is enough to do with ω = ω i . or negative. the measure π on S is reversible. If E(ω i ) − E(ω) ≤ 0. ω). the chain has a unique invariant measure. we tolerate the transition because of temperature. with m balls per urn. We need to show that. Note ﬁrst that the normalizing constant Z cancels out (the 1 key feature of this method) and so does the probability n that i is chosen. ω ∈ S. and then back by reversing the steps. i.. Determine the invariant distribution for the random walk in Examples 12. note that we can get from any feasible solution to an empty backpack by removing object one by one. Thus. in the second case.

i−1 = Reversibility check is routine. 2 2 m m m2 The only way to check reversibility is to compute the invariant distribution π1 . you get πi = and the transition probabilities are Pi. A king moves on a standard 8 × 8 chessboard. compute the expected number of steps before it returns to the starting position. 2 4 1 4 1 4 1 2 m i m m−i 2m m . 3 10 . then we have 0 1 1 2 2 1 P = 4 1 1 .5. Pi. so π(i. it is equally likely to be either of the other two possibilities. 2. What is. (b) Now you have two kings.i = . π3 . 5 5 4. otherwise. 1.j) = πi πj and the 3 answer is 420 2 . i2 (m − i)2 2i(m − i) . π3 on the diagonal and to check that DP is symmetric.i+1 = . now. indeed. 1 5. If i is a corner square. In (b). and. If you choose m balls to put into urn 1 at random. and DP is. π2 . it makes one of the available legal moves (to a horizontally. Is this a reversible Markov chain? 4. Each time. π2 . 3. If the three states are labeled in the order given. (a) Assuming that the king starts at one of the four corner squares of the chessboard. the expected number of steps before they simultaneously occupy the starting position? Solutions to problems 1. and 8 (36 remaining squares). 3 . it stays the same with probability 0. 3 5 (24 side squares).16 MARKOV CHAINS: REVERSIBILITY 189 neutral or negative. so the answer to (a) is 420 . symmetric. π3 = 2 . so the chain is reversible. 2. they both start at the same corner square and move independently. Answer: π = 1 5. you have two independent chains. form the diagonal matrix D with π1 . vertically or diagonally adjacent square) at random. We get 1 π1 = 5 . πi = 3·4+5·24+8·36 . Pi. and 3. π2 = 2 . This is a random walk on a graph with 64 vertices (squares) and degrees 3 (4 corner squares). 3 10 .

This is clearly a losing game if p < 1 . take a simple random walk which makes a +1 step with probability p2 and −1 step with probability 1 − p2 . with probability p. Starting from a multiple of M . Win $1. if it is not. and. you add +1 with probability p1 . by the Gambler’s ruin computation (Example 11. Then.e. We now provide a detailed analysis. you add +1 with probability p2 and −1 with probability 1 − p2 .. Game A is very simple. If it is.6). in fact it is an asymmetric one-dimensional simple random walk. As mentioned. 1− 1− 1−p2 p2 1−p2 p2 x M (17. the player loses more and more money. after playing it for a long time. a player’s capital becomes more and more negative.2) p1 · 1− 1− 1−p2 p2 1−p2 p2 M . We will call a game losing if.1) P (the walk hits M before 0) = . with probability 1 − p.) Similarly. and an integer period M ≥ 2. at every step. To analyze game B. Assume that you start this walk at some x.. (You have to make a step to the right and then use the formula (17. in which you. We will determine below when this is a losing game.17 THREE APPLICATIONS 190 17 Three Applications Parrondo’s Paradox This famous paradox was constructed by the Spanish physicist J. i. p1 . We will choose particular values once we are ﬁnished with the analysis. play A with probability γ and B with probability 1 − γ. 0 < x < M . p2 . from a multiple of M . and lose a dollar. Now consider game C. general so that the description of the games is more transparent. 2 In game B.1) with x = 1. add −1 to your capital. i. These parameters are.. add +1 to your capital.e. game A is easy. and −1 with probability 1 − p1 . Is it possible that A and B are losing games. while C is winning? The surprising answer is yes! However. the probability that you increase your capital by M before either decreasing it by M or returning to the starting point is (17. for now. the probability that you decrease your capital by M before either increasing it . this should not be so surprising as in game B your winning probabilities depend on the capital you have and you can manipulate the proportion of time your capital spends at “unfavorable” amounts by playing the combination of the two games. Parrondo. B and C with ﬁve parameters: probabilities p. We will consider three games A. i.e. the winning probabilities depend on whether your current capital is divisible by M . and γ.

Assume that the greatest common divisor of the set {k : fk > 0} is 1. . n→∞ lim un = 1 .2)<(17.3) (1 − p1 ) · − 1− 1−p2 p2 1−p2 p2 M M . . After some algebra.3). p1 = 112 . Assume that f1 . Then. Deﬁne a sequence un as follows: un = 0 u0 = 1.4) (1 − p1 )(1 − p2 )M −1 > 1.3). Let µ = if n < 0. Let pn = P (Sm ever equals n). 11 2 5 300 A Discrete Renewal Theorem Theorem 17. .1. .5).5) (1 − q1 )(1 − q2 )M −1 < 1. un = k=1 fk un−k if n > 0.000 . .1) with x = M − 1). Why? Observe your capital at multiples of M : if. fN ≥ 0 are given numbers with N k=1 k fk . N N k=1 fk = 1. yielding a winning game if (17. Roll a fair die forever and let Sm be the sum of outcomes of the ﬁrst m rolls. p1 pM −1 2 Now. game C is the same as game B with p1 and p2 replaced by q1 = γp + (1 − γ)p1 and q2 = γp + (1 − γ)p2 .) The main trick is to observe that game B is losing if (17. this condition reduces to (17. the probability that the next (diﬀerent) multiple of M you visit is (k − 1)M exceeds the probability that it is (k + 1)M . M q1 q2 −1 This is easily achieved with large enough M as soon as p2 < 1 and q2 > 1 .4) and 217 in (17. to get 6 in (17. (Now you have to move one step to the left and then use 1−(probability in (17. Estimate p10. p2 = 10 . but even for M = 3. µ Example 17. γ = 1 . starting from kM .17 THREE APPLICATIONS 191 by M or returning to the starting point is 1−p2 p2 M −1 (17. 2 2 5 1 one can choose p = 11 . then the game is losing and that is exactly when (17.2)<(17.1.

however. We can assume.) By the above theorem. The chain is irreducible (you can get to N − 1 from anywhere. .0 is 1. i. except for renewals. since the sum of fk ’s is 1). the jump to 0 is certain (note that the matrix entry PN −1. and from 0 anywhere) and we will see shortly that is also aperiodic. . more easily. . At N − 1. 1. approximately. and we can solve it. Deﬁne a Markov chain with state space S = {0. then. but this is a lot of work! (Note that one should either modify the recursion for n ≤ 5 or. 1−f1 1−f1 f3 1−f1 −f2 −f3 0 0 . 1 pn = (pn−1 + · · · + pn−6 ) . . .. that fN > 0 (or else reduce N ). 1−f1 −f2 1−f1 −f2 . for k = 2 it equals (1 − f1 ) · and. from N − 1 to 0. deﬁne pn = 0 for n < 0. but again we can avoid the work by applying the theorem to get that pn 1 converges to 2−p . . if k = 1. Assume that a random walk starts from 0 and jumps from x either to x + 1 or to x + 2. .. 000? The recursion is now much simpler: p0 = 1. 1 − f1 1 − f1 − f2 f2 = f2 . Proof. 1 − f1 . If X0 = 0 and R0 is the ﬁrst return time to 0. without loss of generality.2. then P (R0 = k) clearly equals f1 . 6 192 and then solve it. respectively..17 THREE APPLICATIONS One can write a linear recursion p0 = 1... . Then. . the probability that the walk ever hits 10. Example 17. What is. pn = p · pn−1 + (1 − p) · pn−2 . with probability p and 1 − p. 1−f1 −f2 f2 0 0 . 1−f1 −···−fN −1 This is called a renewal chain: it moves to the right (from x to x + 1) on the nonnegative integers.. jumps to 0. we can 2 immediately conclude that pn converges to 7 .e. fN 0 0 0 .. N − 1} by f1 1 − f1 0 0 . for k = 3 it equals (1 − f1 ) · f3 1 − f1 − f2 · = f3 .

observe that you must return to 0 at some time not exceeding n in order to end up at 0. the promised aperiodicity follows. Another way to compare two patterns is the horse race: you and your adversary each choose a pattern. certainly. So. you have to be back at 0 in n − k steps. so we can always sum to N with the proviso n−k that P00 = 0 when k > n.6) n P00 = k=1 n−k P (R0 = k)P00 . Moreover. The initial conditions are also the n same and we conclude that un = P00 . however. at most N . and the person whose pattern appears ﬁrst wins. We conclude that (recall again that X0 = 0) P (R0 = k) = fk for all k ≥ 1. as the chain can return to 0 in k steps if fk > 0. Here are the natural questions. n The next observation is that the probability P00 that the chain is at 0 in n steps is given by the recursion n (17. Patterns in coin tosses Assume that you repeatedly toss a coin. The above formula (17. with Heads represented by 1 and Tails represented by 0. but what is it exactly? One can compare two patterns by this waiting game. 1 occurs with probability p. from (17. which ends the proof. say 1011101. we note that the ﬁrst return time to 0 is. In particular. In this case. On any toss. either you return for the ﬁrst time at time n or you return at some previous time k and. Assume also that you have a pattern of outcomes. lim un = lim P00 = n→∞ n→∞ m00 µ n−k fk P00 . say 1001 and 0100.17 THREE APPLICATIONS 193 and so on.3) that 1 1 n = . To see this. What is the expected number of tosses needed to obtain this pattern? It should 1 be about 27 = 128 when p = 2 . How do we compute the expectations in the waiting game and the probabilities in the horse race? Is the pattern that wins in the waiting game more .6) we get N n P00 = k=1 n The recursion for P00 is the same as the recursion for un . the expected return time to 0 is N m00 = k=1 kfk = µ. It follows from the convergence theorem (Theorem 15.6) is true for every Markov chain. then. saying that the one with the smaller expected value wins.

be formulated as follows: compute E(∅ → 1011101). P (Xn = A) = pk (1 − p) −k .17 THREE APPLICATIONS 194 likely to win in the horse race? There are several ways of solving these problems (a particularly elegant one uses the so called Optional Stopping Theorem for martingales). For example. if we have two patterns B and A. if A is a subpattern of B. Here. We can immediately ﬁgure out the invariant distribution for this chain. if the next tosses are 001110. but we will use Markov chains. πA The hard part of our problem is over. chosen in some way. At any time n ≥ 2 and for any pattern A with k 1’s and − k 0’s. Our initial example can. assume that the chain starts with some particular sequence of tosses. we can only use the overlap 101 to help us get back to 1011101. E(A → A) = 1 . For now. Also denote E(B → A) = E(NB→A ). π1011101 However. then NB→A = 2. starting with 1011101. this does not count. . if B = 111001 and A = 110. and the next tosses are 10. The Markov chain Xn we will utilize has as the state space all patterns of length . then NB→A = 6. although we can use a part of B. the chain transitions into the pattern obtained by appending 1 (with probability p) or 0 (with probability 1 − p) at the right end of the current pattern and by deleting the symbol at the left end of the current pattern. therefore. the invariant distribution of Xn assigns to A the probability πA = pk (1 − p) −k . the chain simply keeps track of the last symbols in a sequence of tosses. We know that E(1011101 → 1011101) = 1 . The convergence theorem for Markov chains guarantees that. so that E(1011101 → 1011101) = E(101 → 1011101). There is a slight problem before we have tosses. and. That is. We now show how to analyze the waiting game by the example. for every A. Now. Each time. denote by NB→A the expected number of additional tosses we need to get A provided that the ﬁrst tosses ended in B. as the chain is generated by independent coin tosses! Therefore. we have to actually make A in the additional tosses.

for 1 p = 2 . E(∅ → B) = E(N ) + pA · E(A → B). Example 17. the expected time E(∅ → A) can be computed by adding to 1/πA all the overlaps between A and its shifts. and compute 2 the winning probabilities in the horse race. 0). we can write N∅→A = N + I{B appears before A} NB→A . where NB→A is the additional number of tosses we need to get to A after we reach B for the ﬁrst time. that is. The more overlaps A has. so this is a system of three equations with three unknowns: pA . but the second case occurs only when B occurs before A. Then. and p = 1 . Accordingly. . the ﬁrst time one of the two appears. In words. to get to A we either stop at N or go further starting from B. the larger E(∅ → A) is. Let us return to the patterns A = 1001 and B = 0100. The trick is to consider the time N . (At the time B appears for the ﬁrst time. E(∅ → B). E(∅ → A) = E(N ) + pB · E(B → A). the overlaps are 101 and 1. pA + pB = 1. of all patterns of length . . 2 In general. 1 and 00 .) Taking expectations. π1 π101 The ﬁnal result is E(∅ → 1011101) = 1 1 1 + + π1011101 π101 π1 1 1 1 = 5 + + . pB and N . we have to get ﬁrst to 101 and then from there to 1011101. . E(B → A). for 100 . We already know how to compute E(∅ → A). what matters for NB→A is that we are at B and not whether we have seen A earlier. .17 THREE APPLICATIONS 195 To get from ∅ to 1011101. Fix two patterns A and B and let pA = P (A wins) and pB = P (B wins). In the example. so that E(∅ → 1011101) = E(∅ → 101) + E(101 → 1011101). 0) and the smallest is 2 when there is no overlap at all (for example. and E(A → B). Now that we know how to compute the expectations in the waiting game. the largest expectation is 2 + 2 −1 + · · · + 2 = 2 +1 − 2 (for constant patterns 11 . . all the patterns by which A begins and ends. .3. p (1 − p)2 p2 (1 − p) p which is equal to 27 + 23 + 2 = 138 when p = 1 . we will look at the horse race. . It is clear that NB→A has the same distribution as NB→A and is independent of the event that B appears before A. We have reduced the problem to 101 and we iterate our method: E(∅ → 101) = E(∅ → 1) + E(1 → 101) = E(∅ → 1) + E(101 → 101) = E(1 → 1) + E(101 → 101) 1 1 = + .

P (X = 1) = 8 . 8 8 8 8 and so 8 3 3 1 pn = pn−1 + pn−2 + pn−3 . then. Thus. E(∅ → B) = 18. It is straightforward to verify 9 that E(∅ → A) = 20. P (X = 2) = 3 . B = 1010.5. when A loses in the horse race. we note that E(0100 → 1001) = E(100 → 1001) and. one would expect that this relation is transitive. each somewhat paradoxical and thus illuminating. 8 8 1 P (X = 3) = 8 . Similarly. Example 17. and.5. then. Next. but wins in the horse race! What is going on? Simply.17 THREE APPLICATIONS 196 We compute E(∅ → A) = 16 + 2 = 18 and E(∅ → B) = 16 + 2 = 18. so that E(0100 → 1001) = E(∅ → 1001) − E(∅ → 100) = 18 − 8 = 10. Let pn be the probability that you ever hit n. Consider sequences A = 1010 and B = 0100. 2 3 2 4 and 3 . Naively. C = 0010. 1 3 3 1 pn = pn + pn−1 + pn−2 + pn−3 . Problems 1. E(∅ → 1001) = E(∅ → 100) + E(100 → 1001). Suppose that you have three patterns A = 0110. E(A → B) = 18 − 4 = 14. Example 17. 7 8 8 8 . given by P (X = 0) = 1 . pB = 12 . m. f. it loses by a lot. At each step. with probabilities 1 . Start at 0 and perform the following random walk on the integers. First. 6 We conclude with two examples. E(N ) = 73 . A loses in the waiting game.4. (It is not 2 3 !) 2. ﬂip 3 fair coins and make a jump forward equal to the number of Heads (you stay where you are if you ﬂip no Heads). Solutions to problems 3 1. while pA = 14 . This example concerns the horse race only. thereby tipping the waiting game towards B. but this is not true! The simplest example are triples 011 ≥ 100 ≥ 001 ≥ 011. Compute limn→∞ pn . the above 5 7 three equations with three unknowns give pA = 12 . we compute E(B → A) = E(0100 → 1001). The size S of the step has the p. So. Compute the probability that A appears ﬁrst among the three in a sequence of fair coin tosses. Consider the relation ≥ given by A ≥ B if P (A beats B) ≥ 0.

8 8 8 that is. we have E(∅ → A) = EN + pB E(B → A) + pC E(C → A) E(∅ → B) = EN + pA E(A → B) + pC E(C → B) E(∅ → C) = EN + pA E(A → C) + pB E(B → C) pA + pB + pC = 1 and E(∅ → A) = 18 E(∅ → B) = 20 E(∅ → C) = 18 E(B → A) = 16 E(C → A) = 16 E(A → B) = 16 E(C → B) = 16 E(A → C) = 16 E(B → C) = 16 3 The solution is EN = 8.17 THREE APPLICATIONS 197 It follows that pn converges to the reciprocal of 8 7 3 3 1 ·1+ ·2+ ·3 . pB = 1 . The answer is n→∞ lim pn = 7 . pA = 3 . to E(S|S > 0)−1 . 12 2. and pC = 8 . 8 4 8 . The answer is 3 . If N is the ﬁrst time one of the three appears.

such that 1. N (t) − N (s) represents the number of events in (s. in small time intervals. the number of events in any interval of length t is Poisson(λt). t] is Binomial with the number of trials λ n(t − s) ± 2 and success probability n . P (N (h) ≥ 2) = O(h2 ) λh. as n → ∞. A deﬁnition as the above should be followed by the question whether the object in question exists — we may be wishing for contradictory properties. N (t) = 1 for t ∈ (2. Nevertheless. N (t) is right-continuous. . N (t) is a nonnegative integer for each t. Construction by tossing a low-probability coin very fast. it is unique. t]. k! Thus. t1 ] (s2 . a single event happens with probability proportional to the length of the interval. as h → 0. t ≥ 0. as it makes many properties of the Poisson process almost instantly understandable. the number of Heads in any interval (s. very fast) and let Nn (t) be the number of Heads in [0. Nn has independent increments for any n and hence the same holds in the limit. and 3. it converges to Poisson(t − s). we will outline two constructions of the Poisson process. To demonstrate the existence.18 POISSON PROCESS 198 18 Poisson Process A counting process is a random process N (t). t2 ] = ∅. this is why λ is called the rate. (λt)k . N (0) = 0. Yes. N (t) is nondecreasing in t. it is very useful. Moreover. The third condition is merely a convention: if the ﬁrst two events happen at t = 2 and t = 3. k = 0. 2. 1. Thus. t]. 2. 2. . as would the proof (or even the formulation) of convergence in the ﬁrst construction we are about to give. . Clearly. as n → ∞. Moreover. thus. In particular. and N (t) = 0 for t < 2. then N (t1 ) − N (s1 ) and N (t2 ) − N (s2 ) are independent. it has independent increments: if (s1 . We should . and 3. Toss the coin at times which are positive 1 integer multiples of n (that is. P (N (h) = 1) = e−λh λh ∼ λh. then N (2) = 1. P (N (t + s) − N (s) = k) = e−λt E(N (t + s) − N (s)) = λt. Pick a large n and assume λ that you have a coin with (low) Heads probability n . 3). N (3) = 2. but it would require some probabilistic sophistication to prove this. A Poisson process with rate (or intensity) λ > 0 is a counting process N (t) such that 1. .

it suﬃces that this probability converges to λ when multiplied by n. . but we can derive its density. In fact. A typical example would be the times between consecutive buses arriving at a station. instead.1. We have P (T1 > t) = P (N (t) = 0) = e−λt . Then. Let T1 . i.. Let Sk be the time of the kth (say. Let T1 . and so on. . converges to t in probability for any ﬁxed t. Exponential(λ) random variables and let Sn = T1 + . d. we do not need all integer multiples 1 of n . . t]. Similarly.e. the distribution is called Gamma(n. where Tn is the time elapsed between (n − 1)st and nth event. Moreover. We deﬁne N (t) to be the largest n so that Sn ≤ t. Proposition 18. Nk (0) = 0. with the same rate λ. . . We know that ESn = n . be i. but not on the future). restarted at a stopping time. . . λ We start with n−1 (λt)j P (Sn > t) = P (N (t) < n) = e−λt . . Construction by exponential interarrival times. T2 . 3rd) event (which is a random time) and let Nk (t) be the number of additional events within time t after time Sk . We can use the above Proposition 18. it is enough that their number in [0. T2 .1 for another construction of a Poisson process. we can do so at any stopping time. divided by n. An example of a property that follows immediately is the following.18 POISSON PROCESS 199 λ note that the Heads probability does not need to be exactly n . we can establish that T3 is independent of T1 and T2 with the same distribution. Distribution of interarrival times: T1 . s + t] are not inﬂuenced by what happens in [0. . So. which is convenient for simulations. which proves that T1 is Exponential(λ). + Tn be the waiting time for the nth event. are independent and Exponential(λ). T2 is independent of T1 and Exponential(λ). be the interarrival times. s + t]|T1 = s) = P (N (t) = 0) = e−λt . The Poisson process. . λ). so two events in the original Poisson N (t) process do not happen at the same time. depends on the past. this is called the strong Markov property. for any s > 0 and any t > 0. Similarly. as starting to count the Heads afresh after the kth Heads gives us the same process as if we counted them from the beginning — we can restart a Poisson process at the time of the kth event. j! j=0 . s]. . As each Nk is a Poisson process. T2 . has the same properties as the original process started at time 0. a random time T with the property that T = t depends only on the behavior of the Poisson process up to time t (i. as events in (s. Proof. Nk (t) is another Poisson process. P (T2 > t|T1 = s) = P (no events in (s.

4] and 3 events in [3. (c) P (the 10th event occurs later than time 20). (n − 1)! (λt)n−1 . Superposition of independent Poisson processes. 4]) e−2λ k=0 = (2λ)2−k −λ λ3−k λk ·e · e−λ (2 − k)! (3 − k)! k! = e−4λ 1 5 1 λ + λ4 + λ3 . (n − 1)! fSn (t) = λe−λt Example 18. we condition on the number of events in [3. Compute (a) E(time of the 10th event). 3] and 3 − k events in [4. Assume that N1 (t) and N2 (t) are independent Poisson processes with rates λ1 and λ2 . 5]) · P (k events in [3. and (d) P (2 events in [1. 3 2 Theorem 18. The answer to (b) is e−2λ . so we can either write the integral ∞ P (S10 > 20) = or use λe−λt 20 9 (λt)9 dt. 4] and 3 events in [3. (b) P (the 10th event occurs 2 or more time units after the 9th event). 9! (20λ)j .2. Consider a Poisson process with rate λ. The answer to (c) is P (S10 > 20) = P (N (20) < 10). 4]: 2 P (2 events in [1. Combine them into a single process by taking the union of both sets of events or. 4]) · P (k events in [3.1. The answer to (a) is 10 by Proposition 18. 9! P (N (20) < 10) = j=0 e−20λ To answer (d). 4]) k=0 2 = k=0 2 P (2 − k events in [1. N (t) = N1 (t) + N2 (t). as one can restart λ the Poisson process at any event. This is a Poisson process with rate λ1 + λ2 . . 5] | k events in [3.1. 5]).18 POISSON PROCESS 200 and then we diﬀerentiate to get n−1 −fSn (t) = j=0 1 (−λe−λt (λt)j + e−λt j(λt)j−1 λ) j! k−1 = λe −λt j=0 − (λt)j−1 (λt)j + j! (j − 1)! = −λe−λt and so (λt)n−1 . equivalently.

for some constant C. by the Markov inequality. we toss two indepen- dent coins: coin A has Heads probability pλ . it is Type I with probability p. Moreover.3. The Type I and Type II events will then be independent as limits of independent discrete processes. an upper bound n for the probability that there is at least one double point in [0. both cannot occur at the same location). k λ One can easily compute that a location n is a discrete event with probability n . The most substantial part of this theorem is independence. and coin B has Heads probability (1−p)λ / 1 − pλ . Let N1 (t) and N2 (t) be the numbers of Type I and Type II events in [0. the remaining events are Type II. Assume that you know that exactly 10 women entered within some 2 hour (say. t]. with parameter the answer to (a) is 510 . (b) Compute the probability that at least 20 customers have entered.2. This probability n goes to zero. the process of discrete events and its division into Type I and Type II(a) events determines the discrete versions of the processes in the statement of the theorem. discrete Type II(a) events the locations with coin A Tails and coin B Heads. We argue by discrete approximation. given that a location is a discrete event. observe ﬁrst that the discrete schemes (a) and (b) k result in a diﬀerent outcome at a location n exactly when two events occur there: a discrete Type I event and a discrete Type II(b) event. Thinning of a Poisson process. consequently.4). Customers arrive at a store at a rate of 10 per hour. e−5 10! 1 2 · 10 = 5. so . n n n Then call discrete Type I events the locations with coin A Heads. and discrete Type II(b) events the locations with coin B Heads. Therefore. as n → ∞. Example 18. 1 Proof. (a) Compute the probability that exactly 10 men also entered. discrete Type I and Type II(a) events are not independent (for example. Each event in a Poisson process N (t) with rate λ is independently a Type I event with probability p. These are independent Poisson processes with rates λp and λ(1 − p). t] to be at most Ct . The proof will be concluded by showing that discrete Type II(a) and Type II(b) events have the same limit as n → ∞. This is a consequence of the same property for Poisson random variables. The probability that the two discrete Type II schemes k C diﬀer at n is thus at most n2 . Theorem 18. This causes the expected number of such “double points” in [0. At each integer multiple of n . t] is also Ct . To prove the claimed asymptotic equality.18 POISSON PROCESS 201 Proof. as the other claims follow from the thinning properties of Poisson random variables (Example 11. Male and female arrivals are independent Poisson processes. for any ﬁxed t and. 10 to 11am). A location is a discrete event if it is either a Type I or a Type II(a) event. Therefore. Now. but discrete Type I and Type II(b) events are independent (as they depend on diﬀerent coins). Each is either male or female with probability 1 . discrete (a) and (b) schemes indeed result in the same limit.

we will not make a formal statement. P (T1 + T2 > 2) = P (N (2) ≤ 1) = e−2 (1 + 2) = 3e−2 . then independently decide for each event whether it belongs to the ﬁrst process. Proposition 18.5. Assume that λ1 = 5. You have three friends. The +λ probability we are interested in is the probability that among the ﬁrst m + n − 1 events in the combined process. k λ1 + λ2 λ1 + λ2 k=n We can easily extend this idea to more than two independent Poisson processes. which is the binomial probability in the statement.18 POISSON PROCESS 202 The answer to (b) is ∞ ∞ P (k men entered) = k=10 k=10 e−5 5k =1− k! 9 e−5 k=0 5k . n or more events belong to the ﬁrst process. or to the second process. 1 hour. You will go out with the ﬁrst friend that calls. Assume that we have two independent Poisson processes. Example 18. Proof. and C. 66 Example 18. with probability λ1λ1 2 . A. and 2. k! Example 18. Then. What is the probability that you go out with A? .4.4. but instead illustrate by the few examples below. N1 (t) with rate λ1 and N2 (t) with rate λ2 . Assume that cars arrive at a rate of 10 per hour. respectively. Each will call you after an Exponential amount of time with expectation 30 minutes. Assume that each car will 1 pick up a hitchhiker with probability 10 . For this process. What is the probability that you will have to wait for more than 2 hours? Cars that pick up hitchhikers are a Poisson process with rate 10 · 1 10 = 1. P (5 events in the ﬁrst process before 1 in the second) = and 6 5 6 k 5 P (5 events in the ﬁrst process before 2 in the second) = k=5 6 k 5 6 1 6 6−k = 11 · 55 . The probability that n events occur in the ﬁrst process before m events occur in the second process is n+m−1 k n+m−1−k n+m−1 λ1 λ2 .3. λ2 = 1. Start with a Poisson process with λ1 + λ2 . B. with +λ probability λ1λ2 2 . The obtained processes are independent and have the correct rates. Order of events in independent Poisson processes.5 hours. You are second in line.

1. Sk } is equal. . Sk−1 are distributed as order statistics of k − 1 independent uniform random variables on [0. Proof. and 2 . these Heads occur in any of the N subsets (of k k tosses out of a total of N ) with equal probability.) We will solve the general problem with rates λ1 . . T ]. EW = and Var(W ) = λT 3 3 . 2. . S1 . . λ2 . t]. in the appropriate limit. then W is the sum N (T ) of k i. to {U1 . ..18 POISSON PROCESS 203 We could evaluate the triple integral. . . . Let W be the combined waiting time for all passengers. 17 2+1+ 2 5 Our next theorem illustrates what we can say about previous event times if we either know that their number by time t is k or we know that the kth one happens exactly at time t. . uniform on [0. t]. i. This is exactly the statement of the theorem.5. where Ui are independent and uniform on [0. Example 18. .6. Hence. d. i. Recall that we denote by N (t) the number of arrivals in [0. d. in distribution. t]. (Recall that the rates are inverses of the expectations. assuming the 5 time unit to be one hour. and λ3 . Compute EW . 1. T ]. Theorem 18. 2 2 λ1 respectively. uniform random variables on [0. Given that Sk = t. Uk }. . which in our case works out to be 2 10 = . . (a) The only bus departs after a deterministic time T . T ] and independent of N (T ). λT 2 2 (b) Now two buses depart. the conditional distribution of the interarrival times. we discretize and the discrete counterpart is as follows. 1. . What is EW now? . . then the combined waiting time is W = T − S1 + T − S3 + . are i. . N times in succession and that we know that the number of Heads in these N tosses is k. is distributed as order statistics of k independent uniform variables: the set {S1 . We obtain the answer by conditioning on the value of N (T ): if we know that N (T ) = k. . but we will avoid that. Start with rate λ1 + λ2 + λ1 λ λ λ3 Poisson process. S2 . distribute the events with probability λ1 +λ2 +λ3 . Again. with arbitrary ﬁxed Heads probability. If S1 . λ1 +λ2 +λ3 . . By Theorem 11. Assume that we toss a coin. Then. S1 . Given that N (t) = k. where U1 . are the arrival times in [0. and λ1 +λ3 +λ3 . . . . Uniformity of previous event times. . Interpret each call as the ﬁrst event in the appropriate one of three Poisson processes with rates 2. . one at T and one at S < T . Passengers arrive at a bus station as a Poisson process with rate λ. W = k=1 Uk . . . The probability of A calling ﬁrst is clearly λ1 +λ2 +λ3 . Sk . U2 . simply by symmetry.

We could compute this via a double integral (which is a good exercise!). This makes EW = 2 λ . but this restarted process is not inﬂuenced by what happened before T . Under this condition. T1 < T2 + T ) = P (T1 < T ) + P (T1 < T2 + T |T1 ≥ T )P (T1 ≥ T ) = 1 − e−λ1 T + P (T1 − T < T2 |T1 ≥ T )e−λ1 T = 1 − e−λ1 T + P (T1 < T2 )e−λ1 T λ1 e−λ1 T . T1 ≥ T means that no events in the ﬁrst process occur during time [0. S] and [S. You have two machines. Then. and Machine 2 has lifetime T2 . Machine 1 has lifetime T1 . is Exponential(µ).18 POISSON PROCESS 204 We have two independent Poisson processes in time intervals [0. Machine 1 starts at time 0 and Machine 2 starts at a time T . . 2 2 (c) Now assume T . Embed the failure times into appropriate Poisson processes. so the answer is λ S2 (T − S)2 +λ . which is Exponential(λ2 ). Compute the probability that M1 is the ﬁrst to fail. = 1 − e−λ1 T + λ1 + λ2 The key observation above is that P (T1 − T < T2 |T1 ≥ T ) = P (T1 < T2 ). (a) Assume that T is deterministic. ∞ ∞ EW = 0 E(W |T = t)fT (t) dt = λ 0 t2 λ fT (t) dt = E(T 2 ) 2 2 λ λ λ 2 = 2. = (Var(T ) + (ET )2 ) = 2 2 2µ µ (d) Finally. T ]. which is Exponential(λ1 ). Why does this hold? We can simply quote the memoryless property of the Exponential distribution. the only bus departure time. so the condition (which in addition does not inﬂuence T2 ) drops out. T1 − T is the time of the ﬁrst event of the same process restarted at T . but it is instructive to make a short argument using Poisson processes. This time. two buses depart as the ﬁrst two events in a rate µ Poisson process. T ]. µ2 Example 18. but instead we proceed thus: P (T1 < T2 + T ) = P (T1 < T ) + P (T1 ≥ T.7. independent of the passengers’ arrivals.

there is: − P (Bob gets picked) = P (ﬁrst event is either A or C.8. . Alice is ﬁrst in line for a ride. T1 < T2 + T ) λ1 λ1 µ = . of course. Alice quits. Compute the probability that Alice is picked up before she quits and compute the same for Bob. Alice and Bob. and call the car arrivals C process. after Exponential(λA ) time. and then the next event among B and C is C). Alice gets picked if the ﬁrst event in the combined A and C process is a C event: P (Alice gets picked) = Moreover.18 POISSON PROCESS 205 (b) Answer the same question when T is Exponential(µ) (and. call these A and B processes. and then at least one C event before a B event}) = P (at least 2 C events before a B event) + P (at least one A event before either a B or a C event. Clearly. Moreover. are hitchhiking. independent of the machines). Impatient hitchhikers. P (T1 < T2 + T ) = P (T1 < T ) + P (T1 ≥ T. λA + λC λC λB + λC 2 λA λC λA + λB + λC λB + λC λA + λC λC = · . λA + λB + λC λB + λC This leaves us with an excellent hint that there may be a shorter way and. Now. and after Exponential(λB ) time. + λ1 + µ λ1 + λ2 λ1 + λ2 Example 18. Embed each quitting time into an appropriate Poisson process. and then at least two C events before a B event) = 2 λC λB + λC λA + λA + λB + λC λC . Cars that would pick up a hitchhiker arrive as a Poisson process with rate λC . and then at least one C event before a B event) − P (at least one A event before either a B or a C event. by the same logic. indeed. Bob quits. P (Bob gets picked) = P ({at least 2 C events before a B event} ∪ {at least one A event before either a B or a C event. Two people.

A machine needs frequent maintenance to stay on. Three people.18 POISSON PROCESS 206 Problems 1. When you arrive at the wash there is a single car at station 1. Compute the expected time before you exit. (a) Assume that a single event occurring in the service period will cause the system to crash. a computer). 1 and 2. That unit now has to be removed from the system for 10 minutes for service. after which it goes back on. while C waits for the ﬁrst available clerk. . It then needs to be repaired. ﬁnd the probability that the machine will break down before receiving its ﬁrst maintenance. B. (c) Find the proportion of time the machine is on. the probability that the system will crash? (c) Assume that a crash will not happen unless there are two events within 5 minutes of each other. (d) Solve (b) by assuming that the protective unit will be out of the system for a time which is exponentially distributed with expectation 10 minutes. the car proceeds to station 2. (a) What proportion of customers complete their service? (b) What proportion of customers stay in the system for more than 1 time unit. A system has two server stations. Assume that the service time is Exponential(λ). What is. but two events occurring in the service period will cause it to crash. A. What is the probability that the system will crash? (b) Assume that the system will survive a single event. then goes to station 2. The maintenance times occur as a Poisson process with rate µ. but do not complete the service? 4.. Assume that certain events (say. (b) Compute the expected time before C is ﬁnished (i. An oﬃce has two clerks. which takes an Exponential(λ) time. 5. Once the machine receives no maintenance for a time interval of length h. any customer in the system immediately departs. Upon completing the service at station 1. A car enters at station 1. with Exponential(λ1 ) and Exponential(λ2 ) service times. (a) After the machine is started. power surges) occur as a Poisson process with rate 3 per hour. provided station 2 is free. the car has to wait at station 1. 1 and 2. Whenever a new customer arrives.e. A car wash has two stations. it breaks down. with Exponential(λ1 ) and Exponential(λ2 ) service times. (a) Compute the probability that A is the last to ﬁnish the service. and a new arrival enters the system at station 1. 3. (b) Find the expected time for the ﬁrst breakdown. These events cause damage to a certain system (say. 2. and C enter simultaneously. A and B begin service with the two clerks. blocking the entrance of other cars. Customer arrivals are a rate µ Poisson process. otherwise. Compute the probability that the system will crash. C’s combined waiting and service time). a special protecting unit has been designed. now. thus. The car exits the wash after the service at station 2 is completed.

T2 } is Exponential(λ1 + λ2 ). which has probability e−(µ+λ1 ) . the probability of the second case is e−µ µ λ2 + µ 1 0 µ . T2 } and that min{T1 . the answer is 2λ + λ = 2λ . Then. λ1 + µ λ2 + µ (b) Let T1 and T2 be the customer’s times at stations 1 and 2. For the second case. and then for the service 1 1 3 time. (a) A customer needs to complete the service at both stations before a new one arrives. Your total time is (the time the other car spends at station 1) + (the time you spend at station 2)+(maximum of the time the other car spends at station 2 and the time you spend at station 1). The event will happen if either: • T1 > 1. which has probability µ λ1 +µ . conditioned on T1 = t < 1. or • T1 < 1. Now use that max{T1 . thus the answer is λ1 λ2 · . T1 + T2 > 1. If T1 and T2 are Exponential(λ1 ) and Exponential(λ2 ). T2 } = T1 + T2 − min{T1 . T1 ]. T2 }). that is. a newcomer has to appear before the service time at station 1. 2. no newcomers during time 1. T1 + T2 ]. which is a single process with rate 2λ.18 POISSON PROCESS 207 Solutions to problems 1. λ1 λ2 λ1 + λ2 3. (b) First. 1 . then you need to compute E(T1 ) + E(T2 ) + E(max{T1 . to get 2 2 1 + − . λ2 + µ λ2 − λ1 . λ2 + µ e−λ2 (1−t) λ1 e−λ1 t dt = e−µ λ1 µ −λ2 eλ2 −λ1 − 1 e . C has to wait for the ﬁrst event in 4 two combined Poisson processes. the probability is e−µ e−λ2 (1−t) Therefore. nothing will happen by time 1. and a newcomer during time [1. For the ﬁrst case. no newcomers by time 1. and a newcomer during time [1. (a) This is the probability that two events happen in a rate λ Poisson process before a single event in an independent rate λ process. after time 1.

The answer to (a) is 1 P (N (1/6) ≥ 1) = 1 − e− 2 . Therefore. e−µh The answer to (b) is EW + h (the machine waits for h more units before it breaks down). where U1 and U2 are independent uniform random variables on [0. (a) The answer is e−µh . and to (b) 3 1 P (N (1/6) ≥ 2) = 1 − e− 2 . Assume the time unit is 10 minutes. E(W |T1 = t) = t + EW. λ1 + µ λ2 + µ λ2 − λ1 4. By drawing a picture. there will be no crash. provided t < h. The answer is e−(µ+λ1 ) µ λ1 µ −(µ+λ2 ) eλ2 −λ1 − 1 + e . (b) Let W be the waiting time for maintenance such that the next maintenance is at least time h in the future. 2 For (c). this probability can be computed to be 3 . Then. but 3 or more events in the 10 minutes will cause a crash. h h 0 h 0 EW = 0 (t + EW )µ e−µt dt = tµ e−µt dt + EW µ e−µt dt. Computing the two integrals and solving for EW gives EW = 1 − µhe−µh − e−µh . 1]. The answer to (c) is EW + h 1. 4 P (crash) = P (X > 2) + P (crash|X = 2)P (X = 2) = 1−e = 1− −1 2 1 1 1 1 − e− 2 − e− 2 2 2 2 2 49 − 1 e 2 32 3 + · 4 1 2 −1 2 2 2 e . as the process is restarted at time t. in which case the crash will happen with probability P |U1 − U2 | < 1 2 . Therefore. The ﬁnal possibility is exactly two events. 1 6 of an hour. and let T1 be the time of the ﬁrst maintenance.18 POISSON PROCESS 208 where the last factor is 1 when λ1 = λ2 . if there are 0 or 1 events in the 10 minutes. EW + h + λ 5.

we need to calculate the probability that two events in a rate 3 Poisson process occur before an event occurs in a rate 6 Poisson process. 9 . for (d).18 POISSON PROCESS 209 Finally. This probability is 3 3+6 2 1 = .

each with probability 1 . You ﬁgure you will receive a warning after each game you play independently with probability p ∈ (0. After you receive two warnings (not necessarily in consecutive games). 2. Write down the transition matrix and determine whether this chain is irreducible and aperiodic. 1. U2 . A random walker on the nonnegative integers starts at 0 and then at each step adds either 2 or 3 to her position. the rules are the same as at the beginning. 1] (uniformly). you are suspended for the next game and your warnings count goes back to zero. Compute limn→∞ sn . A random walker is at one of the six vertices. (b) Compute the expected sum S of the numbers your friend selects. which you may assume have equal point score. in [0. 4. then he stops. 1). (a) Compute the expected number N of times your friend selects a number. 4. 3. You are a member of a sports team and your coach has instituted the following policy. After the suspension. he gives you a warning. Your friend then keeps selecting random numbers U1 . Give yourself 120 minutes to solve the six problems. At each time. (a) Let the state of the Markov chain be your warning count after a game. . if so. (b) Write down an expression for the probability that you are suspended for both games 10 and 15. Do not evaluate. 1.) 1 0 2 5 3 4 . 2. 3. . Compute its invariant distribution. 2 (a) Compute the probability that the walker is at 2n + 3 after making n steps. and 5. (c) Let sn be the probability that you are suspended in the nth game. You select a random number X in [0. (b) Let pn be the probability that the walker ever hits n. After every game the coach evaluates whether you’ve had a discipline problem during the game. Compute limn→∞ pn . . (All choices are equally likely and she never stays at the same position for two successive steps. she moves to a randomly chosen vertex connected to her current position by an edge. You begin with zero warnings. of the graph in the picture. 1] (uniformly and independently) until he gets a number larger than X/2. labeled 0.18 POISSON PROCESS 210 Interlude: Practice Final This practice exam covers the material from chapters 9 through 18.

Customers arrive at two service stations. Does this proportion depend on the walker’s starting vertex? (b) Compute the proportion of time the walker is at an odd state (1. any previous customer is immediately ejected from the system. 6. . previously. (a) Assume that the service time at each station is exactly 2 hours. or 4). The process starts with a single individual in generation 0. or 5) while. 3.18 POISSON PROCESS 211 (a) Compute the proportion of time the walker spends at 0. after she makes many steps. In a branching process. then goes to station 2. as a Poisson process with rate λ. Compute the probability that this special customer is allowed to complete his service. A customer arrives. (b) Compute the probability that the process ever becomes extinct. What proportion of entering customers will now complete the service? (c) Keep the service time assumption from (b). an individual has two descendants with probability 3 and no descen4 dants with probability 1 . A new arrival enters the service at station 1. she was at even state (0. 4 (a) Compute the expected number of individuals in generation 2. labeled 1 and 2. (c) Now assume that the walker starts at 0. What is expected number of steps she will take before she is back at 0? 5. Whenever a new customer arrives. but he is now given special treatment: he will not be ejected unless at least three or more new customers arrive during his service. 2. What proportion of entering customers will complete the service (before they are ejected)? (b) Assume that the service time at each station now is exponential with expectation 2 hours. Assume that the time unit is one hour.

2 E[S|X = x.18 POISSON PROCESS 212 Solutions to Practice Final 1. 1] (uniformly and independently) until he gets a number larger than X/2. 1− x 2 EN = 0 dx 1− x 2 = −2 log(1 − = 2 log 2. he gives you a warning. You are a member of a sports team and your coach has instituted the following policy. (a) Compute the expected number N of times your friend selects a number. N = n] = (n − 1) E[S|X = x] = 1 1 1 · 4 1− x 2 1 x 1 x 1 + 1+ = nx + . if so. N is distributed geometrically with success probability 1 − x . . x 1 )| 2 0 (b) Compute the expected sum S of the numbers your friend selects. so 2 E[N |X = x] = and so. Solution: Given X = x. You begin with zero warnings. you are suspended for the next game and your warnings count goes back to zero. You select a random number X in [0. 1 1 . Your friend then keeps selecting random numbers U1 . . 2 2−x ES = 0 1 dx = log 2. x ] and 2 one number uniformly in [ x . U2 . 1). After every game the coach evaluates whether you’ve had a discipline problem during the game. then he stops. After you receive two warnings (not necessarily in consecutive games). 2−x 2. . Solution: Given X = x and N = n. . After the suspension. You ﬁgure you will receive a warning after each game you play independently with probability p ∈ (0. Therefore. 4 2 2 4 2 1 1 ·x+ = . 1]. the rules are the same as at the beginning. your friend selects n − 1 numbers uniformly in [0. 1] (uniformly). in [0.

n→∞ lim sn = lim P (Xn−1 = 2) = n→∞ p . . Solution: The transition matrix is 1−p p 0 1 − p p . P = 0 1 0 0 and the chain is clearly irreducible (the transitions 0 → 1 → 2 → 0 happen with positive probability) and aperiodic (0 → 0 happens with positive probability). (c) Let sn be the probability that you are suspended in the nth game. p+2 p+2 p+2 (b) Write down an expression for the probability that you are suspended for both games 10 and 15. The invariant distribution is given by π0 (1 − p) + π2 = π0 π0 p + π1 (1 − p) = π1 π1 p = π2 and π0 + π1 + π2 = 1. which gives π= 1 1 p . Solution: You must have 2 warnings after game 9 and then again 2 warnings after game 14: 9 4 P02 · P02 . Compute limn→∞ sn . . Compute its invariant distribution. Write down the transition matrix and determine whether this chain is irreducible and aperiodic. Solution: As the chain is irreducible and aperiodic. 2+p .18 POISSON PROCESS 213 (a) Let the state of the Markov chain be your warning count after a game. Do not evaluate.

3.18 POISSON PROCESS 214 3.) 1 0 2 5 3 4 (a) Compute the proportion of time the walker spends at 0. 4. At each time. labeled 0. each with probability 1 . where [π0 . after she makes many steps. of the graph in the picture. 1. A random walker on the nonnegative integers starts at 0 and then at each step adds either 2 or 3 to her position. 3 2n (b) Let pn be the probability that the walker ever hits n. Compute limn→∞ pn . Solution: The step distribution is aperiodic. 5 ·3 4. the proportion is π0 . (All choices are equally likely and she never stays at the same position for two successive steps. so n→∞ lim pn = 1 2 1 ·2+ 1 2 2 = . Does this proportion depend on the walker’s starting vertex? Solution: Independently of the starting vertex. π2 . and 5. π3 . Solution: The walker has to make (n − 3) 2-steps and 3 3-steps. A random walker is at one of six vertices. as the greatest common divisor of 2 and 3 is 1. (Unique because of irreducibility.) This chain . she moves to a randomly chosen vertex connected to her current position by an edge. π4 . so the answer is n 1 . π5 ] is the unique invariant distribution. 2. π1 . 2 (a) Compute the probability that the walker is at 2n + 3 after making n steps.

2. an individual has two descendants with probability 4 and no 1 descendants with probability 4 . 3. The process starts with a single individual in generation 0.18 POISSON PROCESS is reversible with the invariant distribution given by answer is 2 . she was at even state (0. 4 (b) Compute the probability that the process ever goes extinct. 1]. 2. 2. 3. Solution: As µ=2· the answer is 3 1 3 +0· = . 215 Therefore. 2. or 5) while. 7 4 7 2 14 3 14 (c) Now assume that the walker starts at 0. (a) Compute the expected number of individuals in generation 2. 7 1 14 [4. What is expected number of steps she will take before she is back at 0? Solution: The answer is 7 1 = π0 2 3 5. previously. or 4). 4 4 2 9 µ2 = . Solution: As φ(s) = 1 3 2 + s 4 4 . the (b) Compute the proportion of times the walker is at an odd state (1. Solution: The answer is π0 (p03 + p05 ) + π2 · p21 + π4 (p43 + p41 ) = 2 2 1 1 3 2 5 · + · + · = . In a branching process.

(a) Assume that the service time at each station is exactly 2 hours. (1 + 2λ)2 = 1 2 1 2 +λ (c) Keep the service time assumption from (b). What proportion of entering customers will now complete the service? Solution: Now. The answer is 1 π0 = . P (customer served) = P (2 or more arrivals in rate 2 1 Poisson process 2 before one arrival in rate λ Poisson process) = 1 . any previous customer is immediately ejected from the system. (3s − 1)(s − 1) = 0. A customer arrives. Solution: The special customer is served exactly when 2 or more arrivals in rate 1 2 Poisson .18 POISSON PROCESS 216 the solution to φ(s) = s is given by 3s2 − 4s + 1 = 0. i. but he is now given special treatment: he will not be ejected unless at least three or more new customers arrive during his service. (b) Assume that the service time at each station now is exponential with expectation 2 hours. Customers arrive at two service stations. Assume that the time unit is one hour. A new arrival enters the service at station 1. labeled 1 and 2. then goes to station 2. 3 6. Whenever a new customer arrives.e. Compute the probability that this special customer is allowed to complete his service. What proportion of entering customers will complete the service (before they are ejected)? Solution: The answer is P (customer served) = P (no arrival in 4 hours) = e−4λ .. as a Poisson process with rate λ.

among the ﬁrst 1 4 arrivals in rate λ+ 2 Poisson process.18 POISSON PROCESS 217 before three arrivals in rate λ Poisson process happen. 1 4 2 . Equivalently. 2 The answer is 1− λ λ+ 4 1 2 −4 λ λ+ 3 1 2 · 1 2 λ+ 1 2 =1− λ4 + 2λ3 λ+ . 2 or more belong to the rate 1 Poisson process.

Test Patterns

LESCO-GEPCO

Electronic Engineering Booklist

Tanha Tanha by Ahmed Faraz

Chain Codes

Project Report

Antenna Azimuth Position Control System Project Verification

Electrical Machines

(eBook) Radio Shack - Mini-Notebook - Formulas Tables Basic Circuits _2

DIP_Slides 2 Image Fundamentals_ Umer Javed

DIP_Slides 1 Introduction_ Umer Javed

Electrical Protection and Switchgear

Cover Letter

GCC Recruiting Flyer

Plc Program and Hm i Modules

Hifz Test Manual Vol+3.668-704

Muslim Scientist Notes (Male)

Final Project Presentation

FYP Report Starting 8 Pages

ProjectFinalSolution (1)

PIEAS Sample Test Paper for MS Engineers and Scientists

Control Lab 3..

Control Lab 2

Chain Codes

Antenna Azimuth Position Control System Verification

- Read and print without ads
- Download to keep your version
- Edit, email or read offline

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

CANCEL

OK

You've been reading!

NO, THANKS

OK

scribd

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->