This action might not be possible to undo. Are you sure you want to continue?

BooksAudiobooksComicsSheet Music### Categories

### Categories

### Categories

### Publishers

Scribd Selects Books

Hand-picked favorites from

our editors

our editors

Scribd Selects Audiobooks

Hand-picked favorites from

our editors

our editors

Scribd Selects Comics

Hand-picked favorites from

our editors

our editors

Scribd Selects Sheet Music

Hand-picked favorites from

our editors

our editors

Top Books

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Audiobooks

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Comics

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Sheet Music

What's trending, bestsellers,

award-winners & more

award-winners & more

P. 1

ecen303notes[1]|Views: 1,496|Likes: 11

Published by David Price

See more

See less

https://www.scribd.com/doc/51768282/ecen303notes-1

12/18/2012

text

original

- Mathematical Review
- 1.1. ELEMENTARY SET OPERATIONS 3
- 1.1 Elementary Set Operations
- 1.2. ADDITIONAL RULES AND PROPERTIES 5
- 1.2 Additional Rules and Properties
- 1.3 Cartesian Products
- 1.4. SET THEORY AND PROBABILITY 7
- 1.4 Set Theory and Probability
- Basic Concepts of Probability
- 2.1 Sample Spaces and Events
- 2.2 Probability Laws
- 2.2.1 Finite Sample Spaces
- 2.2.2 Countably Inﬁnite Models
- 2.2.3 Uncountably Inﬁnite Models
- 2.2.4 Probability and Measure Theory*
- Conditional Probability
- 3.1 Conditioning on Events
- 3.2. THE TOTAL PROBABILITY THEOREM 23
- 3.2 The Total Probability Theorem
- 3.3 Bayes’ Rule
- 3.4 Independence
- 3.4.1 Independence of Multiple Events
- 3.4.2 Conditional Independence
- 3.5 Equivalent Notations
- 4.1. THE COUNTING PRINCIPLE 35
- 4.1 The Counting Principle
- 4.2 Permutations
- 4.2.1 Stirling’s Formula*
- 4.2.2 k-Permutations
- 4.3 Combinations
- 4.4 Partitions
- 4.5. COMBINATORIAL EXAMPLES 43
- 4.5 Combinatorial Examples
- Discrete Random Variables
- 5.1 Probability Mass Functions
- 5.2 Important Discrete Random Variables
- 5.2.1 Bernoulli Random Variables
- 5.2.2 Binomial Random Variables
- 5.2.3 Poisson Random Variables
- 5.2.4 Geometric Random Variables
- 5.2.5 Discrete Uniform Random Variables
- 5.3 Functions of Random Variables
- Meeting Expectations
- 6.1 Expected Values
- 6.2. FUNCTIONS AND EXPECTATIONS 61
- 6.2 Functions and Expectations
- 6.2.1 The Mean
- 6.2.2 The Variance
- 6.2.3 Aﬃne Functions
- 6.3 Moments
- 6.4. ORDINARY GENERATING FUNCTIONS 69
- 6.4 Ordinary Generating Functions
- 7.1 Joint Probability Mass Functions
- 7.2. FUNCTIONS AND EXPECTATIONS 75
- 7.2 Functions and Expectations
- 7.3. CONDITIONAL RANDOM VARIABLES 77
- 7.3 Conditional Random Variables
- 7.3.1 Conditioning on Events
- 7.4 Conditional Expectations
- 7.5 Independence
- 7.5.1 Sums of Independent Random Variables
- 7.6. NUMEROUS RANDOM VARIABLES 87
- 7.6 Numerous Random Variables
- Continuous Random Variables
- 8.1 Cumulative Distribution Functions
- 8.1.1 Discrete Random Variables
- 8.1.2 Continuous Random Variables
- 8.2. PROBABILITY DENSITY FUNCTIONS 95
- 8.1.3 Mixed Random Variables*
- 8.2 Probability Density Functions
- 8.3. IMPORTANT DISTRIBUTIONS 97
- 8.3 Important Distributions
- 8.3.1 The Uniform Distribution
- 8.3.2 The Gaussian (Normal) Random Variable
- 8.3.3 The Exponential Distribution
- 8.4 Additional Distributions
- 8.4.1 The Gamma Distribution
- 8.4.2 The Rayleigh Distribution
- 8.4.3 The Laplace Distribution
- 8.4.4 The Cauchy Distribution
- 9.1 Monotone Functions
- 9.2 Diﬀerentiable Functions
- 9.3 Generating Random Variables
- 9.3.1 Continuous Random Variables
- 9.3.2 Discrete Random Variables
- Expectations and Bounds
- 10.1 Expectations Revisited
- 10.2 Moment Generating Functions
- 10.3 Important Inequalities
- 10.3.1 The Markov Inequality
- 10.3.2 The Chebyshev Inequality
- 10.3.3 The Chernoﬀ Bound
- 10.3.4 Jensen’s Inequality
- 11.1 Joint Cumulative Distributions
- 11.2 Conditional Probability Distributions

**Class Notes for
**

Engineering Students

Free Research and Education Documents

Fall 2010

Contents

1 Mathematical Review 1

1.1 Elementary Set Operations . . . . . . . . . . . . . . . . . . . . 3

1.2 Additional Rules and Properties . . . . . . . . . . . . . . . . . 5

1.3 Cartesian Products . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Set Theory and Probability . . . . . . . . . . . . . . . . . . . 7

2 Basic Concepts of Probability 8

2.1 Sample Spaces and Events . . . . . . . . . . . . . . . . . . . . 8

2.2 Probability Laws . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.1 Finite Sample Spaces . . . . . . . . . . . . . . . . . . . 13

2.2.2 Countably Inﬁnite Models . . . . . . . . . . . . . . . . 14

2.2.3 Uncountably Inﬁnite Models . . . . . . . . . . . . . . . 15

2.2.4 Probability and Measure Theory* . . . . . . . . . . . . 18

3 Conditional Probability 20

3.1 Conditioning on Events . . . . . . . . . . . . . . . . . . . . . . 20

3.2 The Total Probability Theorem . . . . . . . . . . . . . . . . . 23

3.3 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.4 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4.1 Independence of Multiple Events . . . . . . . . . . . . 29

3.4.2 Conditional Independence . . . . . . . . . . . . . . . . 30

3.5 Equivalent Notations . . . . . . . . . . . . . . . . . . . . . . . 32

4 Equiprobable Outcomes and Combinatorics 34

4.1 The Counting Principle . . . . . . . . . . . . . . . . . . . . . . 35

4.2 Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

ii

CONTENTS iii

4.2.1 Stirling’s Formula* . . . . . . . . . . . . . . . . . . . . 38

4.2.2 k-Permutations . . . . . . . . . . . . . . . . . . . . . . 38

4.3 Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.4 Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.5 Combinatorial Examples . . . . . . . . . . . . . . . . . . . . . 43

5 Discrete Random Variables 47

5.1 Probability Mass Functions . . . . . . . . . . . . . . . . . . . 48

5.2 Important Discrete Random Variables . . . . . . . . . . . . . 50

5.2.1 Bernoulli Random Variables . . . . . . . . . . . . . . . 50

5.2.2 Binomial Random Variables . . . . . . . . . . . . . . . 51

5.2.3 Poisson Random Variables . . . . . . . . . . . . . . . . 52

5.2.4 Geometric Random Variables . . . . . . . . . . . . . . 55

5.2.5 Discrete Uniform Random Variables . . . . . . . . . . . 57

5.3 Functions of Random Variables . . . . . . . . . . . . . . . . . 58

6 Meeting Expectations 60

6.1 Expected Values . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.2 Functions and Expectations . . . . . . . . . . . . . . . . . . . 61

6.2.1 The Mean . . . . . . . . . . . . . . . . . . . . . . . . . 64

6.2.2 The Variance . . . . . . . . . . . . . . . . . . . . . . . 66

6.2.3 Aﬃne Functions . . . . . . . . . . . . . . . . . . . . . . 67

6.3 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.4 Ordinary Generating Functions . . . . . . . . . . . . . . . . . 69

7 Multiple Discrete Random Variables 73

7.1 Joint Probability Mass Functions . . . . . . . . . . . . . . . . 73

7.2 Functions and Expectations . . . . . . . . . . . . . . . . . . . 75

7.3 Conditional Random Variables . . . . . . . . . . . . . . . . . . 77

7.3.1 Conditioning on Events . . . . . . . . . . . . . . . . . . 80

7.4 Conditional Expectations . . . . . . . . . . . . . . . . . . . . . 81

7.5 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

7.5.1 Sums of Independent Random Variables . . . . . . . . 85

7.6 Numerous Random Variables . . . . . . . . . . . . . . . . . . 87

iv CONTENTS

8 Continuous Random Variables 91

8.1 Cumulative Distribution Functions . . . . . . . . . . . . . . . 91

8.1.1 Discrete Random Variables . . . . . . . . . . . . . . . . 92

8.1.2 Continuous Random Variables . . . . . . . . . . . . . . 94

8.1.3 Mixed Random Variables* . . . . . . . . . . . . . . . . 95

8.2 Probability Density Functions . . . . . . . . . . . . . . . . . . 95

8.3 Important Distributions . . . . . . . . . . . . . . . . . . . . . 97

8.3.1 The Uniform Distribution . . . . . . . . . . . . . . . . 97

8.3.2 The Gaussian (Normal) Random Variable . . . . . . . 98

8.3.3 The Exponential Distribution . . . . . . . . . . . . . . 101

8.4 Additional Distributions . . . . . . . . . . . . . . . . . . . . . 104

8.4.1 The Gamma Distribution . . . . . . . . . . . . . . . . 104

8.4.2 The Rayleigh Distribution . . . . . . . . . . . . . . . . 106

8.4.3 The Laplace Distribution . . . . . . . . . . . . . . . . . 107

8.4.4 The Cauchy Distribution . . . . . . . . . . . . . . . . . 108

9 Functions and Derived Distributions 110

9.1 Monotone Functions . . . . . . . . . . . . . . . . . . . . . . . 111

9.2 Diﬀerentiable Functions . . . . . . . . . . . . . . . . . . . . . 114

9.3 Generating Random Variables . . . . . . . . . . . . . . . . . . 117

9.3.1 Continuous Random Variables . . . . . . . . . . . . . . 118

9.3.2 Discrete Random Variables . . . . . . . . . . . . . . . . 119

10 Expectations and Bounds 121

10.1 Expectations Revisited . . . . . . . . . . . . . . . . . . . . . . 121

10.2 Moment Generating Functions . . . . . . . . . . . . . . . . . . 124

10.3 Important Inequalities . . . . . . . . . . . . . . . . . . . . . . 127

10.3.1 The Markov Inequality . . . . . . . . . . . . . . . . . . 128

10.3.2 The Chebyshev Inequality . . . . . . . . . . . . . . . . 129

10.3.3 The Chernoﬀ Bound . . . . . . . . . . . . . . . . . . . 131

10.3.4 Jensen’s Inequality . . . . . . . . . . . . . . . . . . . . 132

11 Multiple Continuous Random Variables 135

11.1 Joint Cumulative Distributions . . . . . . . . . . . . . . . . . 135

11.2 Conditional Probability Distributions . . . . . . . . . . . . . . 138

CONTENTS v

11.2.1 Conditioning on Values . . . . . . . . . . . . . . . . . . 139

11.2.2 Conditional Expectation . . . . . . . . . . . . . . . . . 140

11.2.3 Derived Distributions . . . . . . . . . . . . . . . . . . . 141

11.3 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

11.3.1 Sums of Continuous Random Variables . . . . . . . . . 145

12 Convergence, Sequences and Limit Theorems 150

12.1 Types of Convergence . . . . . . . . . . . . . . . . . . . . . . . 150

12.1.1 Convergence in Probability . . . . . . . . . . . . . . . . 152

12.1.2 Mean Square Convergence . . . . . . . . . . . . . . . . 152

12.1.3 Convergence in Distribution . . . . . . . . . . . . . . . 153

12.2 The Law of Large Numbers . . . . . . . . . . . . . . . . . . . 153

12.2.1 Heavy-Tailed Distributions* . . . . . . . . . . . . . . . 155

12.3 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . 156

12.3.1 Normal Approximation . . . . . . . . . . . . . . . . . . 158

vi CONTENTS

Chapter 1

Mathematical Review

Set theory is now generally accepted as the foundation of modern mathemat-

ics, and it plays an instrumental role in the treatment of probability. Un-

fortunately, a simple description of set theory can lead to paradoxes, while a

rigorous axiomatic approach is quite tedious. In these notes, we circumvent

these diﬃculties and assume that the meaning of a set as a collection of objects

is intuitively clear. This standpoint is known as naive set theory. We proceed

to deﬁne relevant notation and operations.

1

2

3

4

5

6

7

Figure 1.1: This is an illustration of a generic set and its elements.

A set is a collection of objects, which are called the elements of the set. If

an element x belongs to a set S, we express this fact by writing x ∈ S. If x

does not belong to S, we write x / ∈ S. We use the equality symbol to denote

logical identity. For instance, x = y means that x and y are symbols denoting

the same object. Similarly, the equation S = T states that S and T are two

symbols for the same set. In particular, the sets S and T contain precisely the

1

2 CHAPTER 1. MATHEMATICAL REVIEW

same elements. If x and y are diﬀerent objects then we write x = y. Also, we

can express the fact that S and T are diﬀerent sets by writing S = T.

A set S is a subset of T if every element of S is also contained in T. We

express this relation by writing S ⊂ T. Note that this deﬁnition does not

require S to be diﬀerent from T. In fact, S = T if and only if S ⊂ T and

T ⊂ S. If S ⊂ T and S is diﬀerent from T, then S is a proper subset of T,

which we indicate by S T.

There are many ways to specify a set. If the set contains only a few

elements, one can simply list the objects in the set;

S = {x

1

, x

2

, x

3

}.

The content of a set can also be enumerated whenever S has a countable

number of elements,

S = {x

1

, x

2

, . . .}.

Usually, the way to specify a set is to take some collection T of objects and

some property that elements of T may or may not possess, and to form the

set consisting of all elements of T having that property. For example, starting

with the integers Z, we can form the subset S consisting of all even numbers,

S = {x ∈ Z|x is an even number}.

More generally, we denote the set of all elements that satisfy a certain property

P by S = {x|x satisﬁes P}. The braces are to be read as “the set of” while

the symbol | stands for the words “such that.”

It is convenient to introduce two special sets. The empty set, denoted by

∅, is a set that contains no elements. The universal set is the collection of all

objects of interest in a particular context, and it is represented by Ω. Once a

universal set Ω is speciﬁed, we need only consider sets that are subsets of Ω. In

the context of probability, Ω is often called the sample space. The complement

of a set S, relative to the universal set Ω, is the collection of all objects in Ω

that do not belong to S,

S

c

= {x ∈ Ω|x / ∈ S}.

For example, we have Ω

c

= ∅.

1.1. ELEMENTARY SET OPERATIONS 3

1.1 Elementary Set Operations

The ﬁeld of probability makes extensive use of set theory. Below, we review

elementary set operations, and we establish basic terminology and notation.

Consider two sets, S and T.

S T

Figure 1.2: This is an abstract representation of two sets, S and T. Each con-

tains the elements in its shaded circle and the overlap corresponds to elements

that are in both sets.

The union of sets S and T is the collection of all elements that belong to

S or T (or both), and it is denoted by S ∪ T. Formally, we deﬁne the union

of these two sets by S ∪ T = {x|x ∈ S or x ∈ T}.

S ∪ T

Figure 1.3: The union of sets S and T consists of all elements that are contained

in S or T.

The intersection of sets S and T is the collection of all elements that belong

to S and T. It is denoted by S ∩ T, and it can be expressed mathematically

as S ∩ T = {x|x ∈ S and x ∈ T}.

When S and T have no elements in common, we write S ∩ T = ∅. We

also express this fact by saying that S and T are disjoint. More generally, a

collection of sets is said to be disjoint if no two sets have a common element.

4 CHAPTER 1. MATHEMATICAL REVIEW

S ∩ T

Figure 1.4: The intersection of sets S and T only contains elements that are

both in S and T.

A collection of sets is said to form a partition of S if the sets in the collection

are disjoint and their union is S.

Figure 1.5: A partition of S is a collection of sets that are disjoint and whose

union is S.

The diﬀerence of two sets, denoted by S−T, is deﬁned as the set consisting

of those elements of S that are not in T, S −T = {x|x ∈ S and x / ∈ T}. This

set is sometimes called the complement of T relative to S, or the complement

of T in S.

So far, we have looked at the deﬁnition of the union and the intersection of

two sets. We can also form the union or the intersection of arbitrarily many

sets. This is deﬁned in a straightforward way,

¸

α∈I

S

α

= {x|x ∈ S

α

for some α ∈ I}

¸

α∈I

S

α

= {x|x ∈ S

α

for all α ∈ I}.

The index set I can be ﬁnite or inﬁnite.

1.2. ADDITIONAL RULES AND PROPERTIES 5

S −T

Figure 1.6: The complement of T relative to S contains all the elements of S

that are not in T.

1.2 Additional Rules and Properties

Given a collection of sets, it is possible to form new ones by applying elemen-

tary set operations to them. As in algebra, one uses parentheses to indicate

precedence. For instance, R ∪ (S ∩ T) denotes the union of two sets R and

S ∩ T, whereas (R ∪ S) ∩ T represents the intersection of two sets R ∪ S and

T. The sets thus formed are quite diﬀerent.

R ∪ (S ∩ T)

(R ∪ S) ∩ T

Figure 1.7: The order of set operations is important; parentheses should be

employed to specify precedence.

Sometimes, diﬀerent combinations of operations lead to a same set. For

instance, we have the following distributive laws

R ∩ (S ∪ T) = (R ∩ S) ∪ (R ∩ T)

R ∪ (S ∩ T) = (R ∪ S) ∩ (R ∪ T).

Two particularly useful equivalent combinations of operations are given by

6 CHAPTER 1. MATHEMATICAL REVIEW

De Morgan’s laws, which state that

R −(S ∪ T) = (R −S) ∩ (R −T)

R −(S ∩ T) = (R −S) ∪ (R −T).

These two laws can be generalized to

¸

α∈I

S

α

c

=

¸

α∈I

S

c

α

¸

α∈I

S

α

c

=

¸

α∈I

S

c

α

when multiple sets are involved. To establish the ﬁrst equality, suppose that

x belongs to

¸

α∈I

S

α

c

. Then x is not contained in

¸

α∈I

S

α

. That is, x is not

an element of S

α

for any α ∈ I. This implies that x belongs to S

c

α

for all α ∈ I,

and therefore x ∈

¸

α∈I

S

c

α

. We have shown that

¸

α∈I

S

α

c

⊂

¸

α∈I

S

c

α

. The

converse inclusion is obtained by reversing the above argument. The second

law can be obtained in a similar fashion.

1.3 Cartesian Products

There is yet another way to create new sets from existing ones. It involves the

notion of an ordered pair of objects. Given sets S and T, the cartesian product

S ×T is the set of all ordered pairs (x, y) for which x is an element of S and

y is an element of T, S ×T = {(x, y)|x ∈ S and y ∈ T}.

1

1 1 2

2

2

3

3

3

a

a

a

a

b

b

b

b

Figure 1.8: Cartesian products can be used to create new sets. In this example,

the sets {1, 2, 3} and {a, b} are employed to create a cartesian product with

six elements.

1.4. SET THEORY AND PROBABILITY 7

1.4 Set Theory and Probability

Set theory provides a rigorous foundation for modern probability and its ax-

iomatic basis. It is employed to describe the laws of probability, give meaning

to their implications and answer practical questions. Being familiar with ba-

sic deﬁnitions and set operations is key in understanding the subtleties of

probability; it helps overcome its many challenges. A working knowledge of

set theory becomes critical when modeling measured quantities and evolving

processes that appear random, an invaluable skill for engineers.

Further Reading

1. Bertsekas, D. P., and Tsitsiklis, J. N., Introduction to Probability, Athena

Scientiﬁc, 2002: Section 1.1.

2. Gubner, J. A., Probability and Random Processes for Electrical and Computer

Engineers, Cambridge, 2006: Section 1.2.

Chapter 2

Basic Concepts of Probability

The theory of probability provides the mathematical tools necessary to study

and analyze uncertain phenomena that occur in nature. It establishes a formal

framework to understand and predict the outcome of a random experiment. It

can be used to model complex systems and characterize stochastic processes.

This is instrumental in designing eﬃcient solutions to many engineering prob-

lems. Two components deﬁne a probabilistic model, a sample space and a

probability law.

2.1 Sample Spaces and Events

In the context of probability, an experiment is a random occurrence that pro-

duces one of several outcomes. The set of all possible outcomes is called the

sample space of the experiment, and it is denoted by Ω. An admissible subset

of the sample space is called an event.

Example 1. The rolling of a die forms a common experiment. A sample

space Ω corresponding to this experiment is given by the six faces of a die.

The set of prime numbers less than or equal to six, namely {2, 3, 5}, is one of

many possible events. The actual number observed after rolling the die is the

outcome of the experiment.

There is essentially no restriction on what constitutes an experiment. The

ﬂipping of a coin, the ﬂipping of n coins, and the tossing of an inﬁnite sequence

of coins are all random experiments. Also, two similar experiments may have

8

2.1. SAMPLE SPACES AND EVENTS 9

1

2

3

4

5

6

7

event

outcome

Figure 2.1: A sample space contains all the possible outcomes; an admissible

subset of the sample space is called an event.

Figure 2.2: A possible sample space for the rolling of a die is Ω = {1, 2, . . . , 6},

and the subset {2, 3, 5} forms a speciﬁc event.

diﬀerent sample spaces. A sample space Ω for observing the number of heads

in n tosses of a coin is {0, 1, . . . , n}; however, when describing the complete

history of the n coin tosses, the sample space becomes much larger with 2

n

distinct sequences of heads and tales. Ultimately, the choice of a particular

sample space depends on the properties one wishes to analyze. Yet some rules

must be followed in selecting a sample space.

1. The elements of a sample space should be distinct and mutually exclusive.

This ensures that the outcome of an experiment is unique.

2. A sample space must be collectively exhaustive. That is, every possible

outcome of the experiment must be accounted for in the sample space.

In general, a sample space should be precise enough to distinguish between all

outcomes of interest, while avoiding frivolous details.

10 CHAPTER 2. BASIC CONCEPTS OF PROBABILITY

Example 2. Consider the space composed of the odd integers located between

one and six, the even integers contained between one and six, and the prime

numbers less than or equal to six. This collection cannot be a sample space

for the rolling of a die because its elements are not mutually exclusive. In

particular, the numbers three and ﬁve are both odd and prime, while the number

two is prime and even. This violates the uniqueness criterion.

Figure 2.3: This collection of objects cannot be a sample space as the elements

are not mutually exclusive.

Alternatively, the elements of the space composed of the odd numbers be-

tween one and six, and the even numbers between one and six, are distinct and

mutually exclusive; an integer cannot be simultaneously odd and even. More-

over, this space is collectively exhaustive because every integer is either odd or

even. This latter description forms a possible sample space for the rolling of a

die.

Figure 2.4: A candidate sample space for the rolling of a die is composed of

two objects, the odd numbers and the even numbers between one and six.

2.2. PROBABILITY LAWS 11

2.2 Probability Laws

A probability law speciﬁes the likelihood of events related to an experiment.

Formally, a probability law assigns to every event A a number Pr(A), called

the probability of event A, such that the following axioms are satisﬁed.

1. (Nonnegativity) Pr(A) ≥ 0, for every event A.

2. (Normalization) The probability of the sample space Ω is equal to one,

Pr(Ω) = 1.

3. (Countable Additivity) If A and B are disjoint events, A ∩ B = ∅,

then the probability of their union satisﬁes

Pr(A ∪ B) = Pr(A) + Pr(B).

More generally, if A

1

, A

2

, . . . is a sequence of disjoint events and

¸

∞

k=1

A

k

is itself an admissible event then

Pr

∞

¸

k=1

A

k

=

∞

¸

k=1

Pr(A

k

).

A number of properties can be deduced from the three axioms of proba-

bility. We prove two such properties below. The ﬁrst statement describes the

relation between inclusion and probabilities.

Proposition 1. If A ⊂ B, then Pr(A) ≤ Pr(B).

Proof. Since A ⊂ B, we have B = A∪(B−A). Noting that A and B−A are

disjoint sets, we get

Pr(B) = Pr(A) + Pr(B −A) ≥ Pr(A),

where the inequality follows from the nonnegativity of probability laws.

Our second result speciﬁes the probability of the union of two events that

are not necessarily mutually exclusive.

Proposition 2. Pr(A ∪ B) = Pr(A) + Pr(B) −Pr(A ∩ B).

12 CHAPTER 2. BASIC CONCEPTS OF PROBABILITY

A

B

B −A

Figure 2.5: If event A is a subset of event B, then the probability of A is less

than or equal to the probability of B.

Proof. Using the third axiom of probability on the disjoint sets A and (A ∪

B) −A, we can write

Pr(A ∪ B) = Pr(A) + Pr((A ∪ B) −A) = Pr(A) + Pr(B −A).

Similarly, applying the third axiom to A ∩ B and B −(A ∩ B), we obtain

Pr(B) = Pr(A ∩ B) + Pr(B −(A ∩ B)) = Pr(A ∩ B) + Pr(B −A).

Combining these two equations yields the desired result.

A B

A ∩ B

Figure 2.6: The probability of the union of A and B is equal to the probability

of A plus the probability of B minus the probability of their intersection.

The statement of Proposition 2 can be extended to ﬁnite unions of events.

Speciﬁcally, we can write

Pr

n

¸

k=1

A

k

=

n

¸

k=1

(−1)

k−1

¸

I⊂{1,...,n},|I|=k

Pr

¸

i∈I

A

i

**where the rightmost sum runs over all subsets I of {1, . . . , n} that contain ex-
**

actly k elements. This more encompassing equation is known as the inclusion-

exclusion principle.

2.2. PROBABILITY LAWS 13

We can use Proposition 2 recursively to derive a bound on the probabilities

of unions. This theorem, which is sometimes called the Boole inequality, asserts

that the probability of at least one event occurring is no greater than the sum

of the probabilities of the individual events.

Theorem 1 (Union Bound). Let A

1

, A

2

, . . . , A

n

be a collection of events, then

Pr

n

¸

k=1

A

k

≤

n

¸

k=1

Pr(A

k

). (2.1)

Proof. We show this result using induction. First, we note that the claim is

trivially true for n = 1. As an inductive hypothesis, assume that (2.1) holds

for some n ≥ 1. Then, we have

Pr

n+1

¸

k=1

A

k

= Pr

A

n+1

∪

n

¸

k=1

A

k

= Pr(A

n+1

) + Pr

n

¸

k=1

A

k

−Pr

A

n+1

∩

n

¸

k=1

A

k

≤ Pr(A

n+1

) + Pr

n

¸

k=1

A

k

≤

n+1

¸

k=1

Pr(A

k

).

Therefore, by the principle of mathematical induction, (2.1) is valid for all

positive integers.

2.2.1 Finite Sample Spaces

If a sample space Ω contains a ﬁnite number of elements, then a probability law

on Ω is completely determined by the probabilities of its individual outcomes.

Denote a sample space containing n elements by Ω = {s

1

, s

2

, . . . , s

n

}. Any

event in Ω is of the form A = {s

i

∈ Ω|i ∈ I}, where I is a subset of the

integers one through n. The probability of event A is therefore given by the

third axiom of probability,

Pr(A) = Pr({s

i

∈ Ω|i ∈ I}) =

¸

i∈I

Pr(s

i

).

We emphasize that, by deﬁnition, distinct outcomes are always disjoint events.

14 CHAPTER 2. BASIC CONCEPTS OF PROBABILITY

If in addition the elements of Ω are equally likely with

Pr(s

1

) = Pr(s

2

) = · · · = Pr(s

n

) =

1

n

,

then the probability of an event A becomes

Pr(A) =

|A|

n

(2.2)

where |A| denotes the number of elements in A.

Example 3. The rolling of a fair die is an experiment with a ﬁnite number

of equally likely outcomes. The probability of observing any of the faces labeled

one through six is therefore equal to 1/6. The probability of any event can

easily be computed by counting the number of distinct outcomes included in the

event. For instance, the probability of rolling a prime number is

Pr({2, 3, 5}) = Pr(2) + Pr(3) + Pr(5) =

3

6

.

2.2.2 Countably Inﬁnite Models

Consider a sample space that consists of a countably inﬁnite number of el-

ements, Ω = {s

1

, s

2

, . . .}. Again, a probability law on Ω is speciﬁed by

the probabilities of individual outcomes. An event in Ω can be written as

A = {s

j

∈ Ω|j ∈ J}, where J is a subset of the positive integers. Using the

third axiom of probability, Pr(A) can be written as

Pr(A) = Pr({s

j

∈ Ω|j ∈ J}) =

¸

j∈J

Pr(s

j

).

The possibly inﬁnite sum

¸

j∈J

Pr(s

j

) always converges since the summands

are nonnegative and the sum is bounded above by one. It is consequently well

deﬁned.

1 2 3 4 5 6 7

Figure 2.7: A countable set is a collection of elements with the same cardinality

as some subset of the natural numbers.

2.2. PROBABILITY LAWS 15

Example 4. Suppose that a fair coin is tossed repetitively until heads is ob-

served. The number of coin tosses is recorded as the outcome of this experi-

ment. A natural sample space for this experiment is Ω = {1, 2, . . .}, a countably

inﬁnite set.

The probability of observing heads on the ﬁrst trial is 1/2, and the proba-

bility of observing heads for the ﬁrst time on trial k is 2

−k

. The probability of

the entire sample space is therefore equal to

Pr(Ω) =

∞

¸

k=1

Pr(k) =

∞

¸

k=1

1

2

k

= 1,

as expected. Similarly, the probability of the number of coin tosses being even

can be computed as

Pr({2, 4, 6, . . .}) =

∞

¸

k=1

Pr(2k) =

∞

¸

k=1

1

2

2k

=

1

4

1

1 −

1

4

=

1

3

.

2.2.3 Uncountably Inﬁnite Models

Probabilistic models with uncountably inﬁnite sample spaces diﬀer from the

ﬁnite and countable cases in that a probability law may not necessarily be

speciﬁed by the probabilities of single-element outcomes. This diﬃculty arises

from the large number of elements contained in the sample space when the

latter is uncountable. Many subsets of Ω do not have a ﬁnite or countable

representation, and as such the third axiom of probability cannot be applied

to relate the probabilities of these events to the probabilities of individual

outcomes. Despite these apparent diﬃculties, probabilistic models with un-

countably inﬁnite sample spaces are quite useful in practice. To develop an

understanding of uncountable probabilistic models, we consider the unit inter-

val [0, 1].

0 1

Figure 2.8: The unit interval [0, 1], composed of all real numbers between zero

and one, is an example of an uncountable set.

16 CHAPTER 2. BASIC CONCEPTS OF PROBABILITY

Suppose that an element is chosen at random from this interval, with all

elements having equal weight. By the ﬁrst axiom of probability, we must

have Pr ([0, 1]) = 1. Furthermore, if two intervals have the same length, the

probabilities of the outcome falling in either interval should be identical. For

instance, it is natural to anticipate that Pr ((0, 0.25)) = Pr ((0.75, 1)).

0 0.25 0.75 1

Figure 2.9: If the outcome of an experiment is uniformly distributed over [0, 1],

then two subintervals of equal lengths should have the same probabilities.

In an extension of the previous observation, we take the probability of an

open interval (a, b) where 0 ≤ a < b ≤ 1 to equal

Pr((a, b)) = b −a. (2.3)

Using the third axiom of probability, it is then possible to ﬁnd the probability

of a ﬁnite or countable union of disjoint open intervals.

0 1

Figure 2.10: The probabilities of events that are formed through the union of

disjoint intervals can be computed easily.

Speciﬁcally, for constants 0 ≤ a

1

< b

1

< a

2

< b

2

< · · · ≤ 1, we get

Pr

∞

¸

k=1

(a

k

, b

k

)

=

∞

¸

k=1

(b

k

−a

k

) .

The probabilities of more complex events can be obtained by applying ad-

ditional elementary set operations. However, it suﬃces to say for now that

2.2. PROBABILITY LAWS 17

specifying the probability of the outcome falling in (a, b) for every possible

open interval is enough to deﬁne a probability law on Ω. In the example at

hand, (2.3) completely determines the probability law on [0, 1].

Note that we can give an alternative means of computing the probability

of an interval. Again, consider the open interval (a, b) where 0 ≤ a < b ≤ 1.

The probability of the outcome falling in this interval is equal to

Pr((a, b)) = b −a =

b

a

dx =

(a,b)

dx.

Moreover, for 0 ≤ a

1

< b

1

< a

2

< b

2

< · · · ≤ 1, we can write

Pr

∞

¸

k=1

(a

k

, b

k

)

=

∞

¸

k=1

(b

k

−a

k

) =

S

∞

k=1

(a

k

,b

k

)

dx.

For this carefully crafted example, it appears that the probability of an ad-

missible event A is given by the integral

Pr(A) =

A

dx.

This is indeed accurate for the current scenario. In fact, the class of admissible

events for this experiment is simply the collection of all sets for which the

integral

A

dx can be computed. In other words, if a number is chosen at

random from [0, 1], then the probability of this number falling in measurable

set A ⊂ [0, 1] is

Pr(A) =

A

dx.

This method of computing probabilities can be extended to more complicated

problems. In these notes, we will see many probabilistic models with uncount-

ably inﬁnite sample spaces. The mathematical tools required to handle such

models will be treated alongside.

Example 5. Suppose that a participant at a game-show is required to spin

the wheel of serendipity, a perfect circle with unit radius. When subjected to a

vigorous spin, the wheel is equally likely to stop anywhere along its perimeter.

A sampling space for this experiment is the collection of all angles from 0

to 2π, an uncountable set. The probability of Ω is invariably equal to one,

Pr([0, 2π)) = 1.

18 CHAPTER 2. BASIC CONCEPTS OF PROBABILITY

Figure 2.11: The wheel of serendipity provides an example of a random exper-

iment for which the sample space is uncountable.

The probability that the wheel stops in the ﬁrst quadrant is given by

Pr

0,

π

2

=

π

2

0

1

2π

dθ =

1

4

.

More generally, the probability that the wheel stops in an interval (a, b) where

0 ≤ a ≤ b < 2π can be written as

Pr((a, b)) =

b −a

2π

.

If B ⊂ [0, 2π) is a measurable set representing all winning outcomes, then the

probability of success at the wheel becomes

Pr(B) =

B

1

2π

dθ.

2.2.4 Probability and Measure Theory*

A thorough treatment of probability involves advanced mathematical concepts,

especially when it comes to inﬁnite sample spaces. The basis of our intuition

for the inﬁnite is the set of natural numbers,

N = {1, 2, . . .}.

Two sets are said to have the same cardinality if their elements can be put

in one-to-one correspondence. A set with the same cardinality as a subset

of the natural numbers is said to be countable. That is, the elements of a

countable set can always be listed in sequence, s

1

, s

2

, . . .; although the order

may have nothing to do with any relation between the elements. The integers

2.2. PROBABILITY LAWS 19

and the rational numbers are examples of countably inﬁnite sets. It may be

surprising at ﬁrst to learn that there exist uncountable sets. To escape beyond

the countable, one needs set theoretic tools such as power sets. The set of real

numbers is uncountably inﬁnite; it cannot be put in one-to-one correspondence

with the natural numbers. A typical progression in analysis consists of using

the ﬁnite to get at the countably inﬁnite, and then to employ the countably

inﬁnite to get at the uncountable.

It is tempting to try to assign probabilities to every subset of a sample

space Ω. However, for uncountably inﬁnite sample spaces, this leads to serious

diﬃculties that cannot be resolved. In general, it is necessary to work with

special subclasses of the class of all subsets of a sample space Ω. The collections

of the appropriate kinds are called ﬁelds and σ-ﬁelds, and they are studied in

measure theory. This leads to measure-theoretic probability, and to its uniﬁed

treatment of the discrete and the continuous.

Fortunately, it is possible to develop an intuitive and working understand-

ing of probability without worrying about these issues. At some point in your

academic career, you may wish to study analysis and measure theory more

carefully and in greater details. However, it is not our current purpose to

initiate the rigorous treatment of these topics.

Further Reading

1. Ross, S., A First Course in Probability, 7th edition, Pearson Prentice Hall,

2006: Chapter 2.

2. Bertsekas, D. P., and Tsitsiklis, J. N., Introduction to Probability, Athena

Scientiﬁc, 2002: Section 1.2.

3. Miller, S. L., and Childers, D. G., Probability and Random Processes with

Applications to Signal Processing and Communications, 2004: Sections 2.1–

2.3.

4. Gubner, J. A., Probability and Random Processes for Electrical and Computer

Engineers, Cambridge, 2006: Sections 1.1,1.3–1.4.

Chapter 3

Conditional Probability

Conditional probability provides a way to compute the likelihood of an event

based on partial information. This is a powerful concept that is used exten-

sively throughout engineering with applications to decision making, networks,

communications and many other ﬁelds.

3.1 Conditioning on Events

We begin our description of conditional probability with illustrative examples.

The intuition gained through this exercise is then generalized by introducing

a formal deﬁnition for this important concept.

Example 6. The rolling of a fair die is an experiment with six equally likely

outcomes. As such, the probability of obtaining any of the outcomes is 1/6.

However, if we are told that the upper face features an odd number, then only

three possibilities remain, namely {1, 3, 5}. These three outcomes had equal

probabilities before the additional information was revealed. It then seems nat-

ural to assume that they remain equally likely afterwards. In particular, it is

reasonable to assign a probability of 1/3 to each of the three outcomes that re-

main possible candidates after receiving the side information. We can express

the probability of getting a three given that the outcome is an odd number as

Pr(3 ∩ {1, 3, 5})

Pr({1, 3, 5})

=

Pr(3)

Pr({1, 3, 5})

=

1

3

.

20

3.1. CONDITIONING ON EVENTS 21

Figure 3.1: Partial information about the outcome of an experiment may

change the likelihood of events. The resulting values are known as conditional

probabilities.

Example 7. Let A and B be events associated with a random experiment, and

assume that Pr(B) > 0. To gain additional insight into conditional probability,

we consider the scenario where this experiment is repeated N times. Let N

AB

be

the number of trials for which A∩B occurs, N

AB

be the number of times where

only A occurs, N

AB

be the number of times where only B occurs, and N

AB

be

the number of trials for which neither takes place. From these deﬁnitions, we

gather that A is observed exactly N

A

= N

AB

+ N

AB

times, and B is seen

N

B

= N

AB

+ N

AB

times.

The frequentist view of probability is based on the fact that, as N becomes

large, one can approximate the probability of an event by taking the ratio of the

number of times this event occurs over the total number of trials. For instance,

we can write

Pr(A ∩ B) ≈

N

AB

N

Pr(B) ≈

N

B

N

.

Likewise, the conditional probability of A given knowledge that the outcome

lies in B can be computed using

Pr(A|B) ≈

N

AB

N

B

=

N

AB

/N

N

B

/N

≈

Pr(A ∩ B)

Pr(B)

. (3.1)

As N approaches inﬁnity, these approximations become exact and (3.1) unveils

the formula for conditional probability.

Having considered intuitive arguments, we turn to the mathematical def-

inition of conditional probability. Let B be an event such that Pr(B) > 0.

22 CHAPTER 3. CONDITIONAL PROBABILITY

A conditional probability law assigns to every event A a number Pr(A|B),

termed the conditional probability of A given B, such that

Pr(A|B) =

Pr(A ∩ B)

Pr(B)

. (3.2)

We can show that the collection of conditional probabilities {Pr(A|B)} spec-

iﬁes a valid probability law, as deﬁned in Section 2.2. For every event A, we

have

Pr(A|B) =

Pr(A ∩ B)

Pr(B)

≥ 0

and, hence, Pr(A|B) is nonnegative. The probability of the entire sample

space Ω is equal to

Pr(Ω|B) =

Pr(Ω ∩ B)

Pr(B)

=

Pr(B)

Pr(B)

= 1.

If A

1

, A

2

, . . . is a sequence of disjoint events, then

A

1

∩ B, A

2

∩ B, . . .

is also a sequence of disjoint events and

Pr

∞

¸

k=1

A

k

B

=

Pr ((

¸

∞

k=1

A

k

) ∩ B)

Pr(B)

=

Pr (

¸

∞

k=1

(A

k

∩ B))

Pr(B)

=

∞

¸

k=1

Pr(A

k

∩ B)

Pr(B)

=

∞

¸

k=1

Pr(A

k

|B),

where the third equality follows from the third axiom of probability applied to

the set

¸

∞

k=1

(A

k

∩ B). Thus, the conditional probability law deﬁned by (3.2)

satisﬁes the three axioms of probability.

Example 8. A fair coin is tossed repetitively until heads is observed. In

Example 4, we found that the probability of observing heads for the ﬁrst time

on trial k is 2

−k

. We now wish to compute the probability that heads occurred

for the ﬁrst time on the second trial given that it took an even number of tosses

to observe heads. In this example, A = {2} and B is the set of even numbers.

The probability that the outcome is two, given that the number of tosses is

even, is equal to

Pr(2|B) =

Pr(2 ∩ B)

Pr(B)

=

Pr(2)

Pr(B)

=

1/4

1/3

=

3

4

.

3.2. THE TOTAL PROBABILITY THEOREM 23

In the above computation, we have used the fact that the probability of ﬂipping

the coin an even number of times is equal to 1/3. This fact was established in

Example 4.

The deﬁnition of conditional probability can be employed to compute the

probability of several events occurring simultaneously. Let A

1

, A

2

, . . . , A

n

be

a collection of events. The probability of events A

1

through A

n

taking place

at the same time is given by

Pr

n

¸

k=1

A

k

= Pr(A

1

) Pr(A

2

|A

1

) Pr(A

3

|A

1

∩ A

2

) · · · Pr

A

n

n−1

¸

k=1

A

k

. (3.3)

This formula is known as the chain rule of probability, and it can be veriﬁed

by expanding each of the conditional probabilities using (3.2),

Pr

n

¸

k=1

A

k

= Pr(A

1

)

Pr(A

1

∩ A

2

)

Pr(A

1

)

Pr(A

1

∩ A

2

∩ A

3

)

Pr(A

1

∩ A

2

)

· · ·

Pr (

¸

n

k=1

A

k

)

Pr

¸

n−1

k=1

A

k

.

This latter expression implicitly assumes that Pr

¸

n−1

k=1

A

k

= 0.

Example 9. An urn contains eight blue balls and four green balls. Three

balls are drawn from this urn without replacement. We wish to compute the

probability that all three balls are blue. The probability of drawing a blue ball

the ﬁrst time is equal to 8/12. The probability of drawing a blue ball the second

time given that the ﬁrst ball is blue is 7/11. Finally, the probability of drawing

a blue ball the third time given that the ﬁrst two balls are blue is 6/10. Using

(3.3), we can compute the probability of drawing three blue balls as

Pr(bbb) =

8

12

7

11

6

10

=

14

55

.

3.2 The Total Probability Theorem

The probability of events A and B occurring at the same time can be calculated

as a special case of (3.3). For two events, this computational formula simpliﬁes

to

Pr(A ∩ B) = Pr(A|B) Pr(B). (3.4)

24 CHAPTER 3. CONDITIONAL PROBABILITY

1

1

1 1

1

1

1

1

1 1

1

1

1 1

1 1

1

1

1 1 1

1

1 1

2

2

2

2

2

2

2

2

2 2

2

2

Figure 3.2: Conditional probability can be employed to calculate the proba-

bility of multiple events occurring at the same time.

We can also obtain this equation directly from the deﬁnition of conditional

probability. This property is a key observation that plays a central role in

establishing two important results, the total probability theorem and Bayes’

rule. To formulate these two theorems, we need to revisit the notion of a

partition. A collection of events A

1

, A

2

, . . . , A

n

is said to be a partition of the

sample space Ω if these events are disjoint and their union is the entire sample

space,

n

¸

k=1

A

k

= Ω.

Visually, a partition divides an entire set into disjoint subsets, as exempliﬁed

in Figure 3.3.

Theorem 2 (Total Probability Theorem). Let A

1

, A

2

, . . . , A

n

be a collection of

events that forms a partition of the sample space Ω. Suppose that Pr(A

k

) > 0

for all k. Then, for any event B, we can write

Pr(B) = Pr(A

1

∩ B) + Pr(A

2

∩ B) +· · · + Pr(A

n

∩ B)

= Pr(A

1

) Pr(B|A

1

) + Pr(A

2

) Pr(B|A

2

) +· · · + Pr(A

n

) Pr(B|A

n

).

3.2. THE TOTAL PROBABILITY THEOREM 25

Figure 3.3: A partition of S can be formed by selecting a collection of subsets

that are disjoint and whose union is S.

Proof. The collection of events A

1

, A

2

, . . . , A

n

forms a partition of the sample

space Ω. We can therefore write

B = B ∩ Ω = B ∩

n

¸

k=1

A

k

.

Since A

1

, A

2

, . . . , A

n

are disjoint sets, the events A

1

∩ B, A

2

∩ B, . . . , A

n

∩ B

are also disjoint. Combining these two facts, we get

Pr(B) = Pr

B ∩

n

¸

k=1

A

k

= Pr

n

¸

k=1

(B ∩ A

k

)

=

n

¸

k=1

Pr (B ∩ A

k

) =

n

¸

k=1

Pr(A

k

) Pr (B|A

k

) ,

where the fourth equality follows from the third axiom of probability.

Figure 3.4: The total probability theorem states that the probability of event

B can be computed by summing Pr(A

i

∩B) over all members of the partition

A

1

, A

2

, . . . , A

n

.

A graphical interpretation of Theorem 2 is illustrated in Figure 3.4. Event

B can be decomposed into the disjoint union of A

1

∩ B, A

2

∩ B, . . . , A

n

∩ B.

26 CHAPTER 3. CONDITIONAL PROBABILITY

The probability of event B can then be computed by adding the corresponding

summands

Pr(A

1

∩ B), Pr(A

2

∩ B), . . . , Pr(A

n

∩ B).

Example 10. An urn contains ﬁve green balls and three red balls. A second

urn contains three green balls and nine red balls. One of the two urns is picked

at random, with equal probabilities, and a ball is drawn from the selected urn.

We wish to compute the probability of obtaining a green ball.

In this problem, using a divide and conquer approach seems appropriate;

we therefore utilize the total probability theorem. If the ﬁrst urn is chosen,

then the ensuing probability of getting a green ball is 5/8. One the other hand,

if a ball is drawn from the second urn, the probability that it is green reduces

to 3/12. Since the probability of selecting either urn is 1/2, we can write the

overall probability of getting a green ball as

Pr(g) = Pr(g ∩ U

1

) + Pr(g ∩ U

2

)

= Pr(g|U

1

) Pr(U

1

) + Pr(g|U

2

) Pr(U

2

)

=

5

8

·

1

2

+

3

12

·

1

2

=

7

16

.

3.3 Bayes’ Rule

The following result is also very useful. It relates the conditional probability

of A given B to the conditional probability of B given A.

Theorem 3 (Bayes’ Rule). Let A

1

, A

2

, . . . , A

n

be a collection of events that

forms a partition of the sample space Ω. Suppose that Pr(A

k

) > 0 for all k.

Then, for any event B such that Pr(B) > 0, we can write

Pr(A

i

|B) =

Pr(A

i

) Pr(B|A

i

)

Pr(B)

=

Pr(A

i

) Pr(B|A

i

)

¸

n

k=1

Pr(A

k

) Pr(B|A

k

)

.

(3.5)

Proof. Bayes’ rule is easily veriﬁed. We expand the probability of A

i

∩B using

(3.4) twice, and we get

Pr(A

i

∩ B) = Pr(A

i

|B) Pr(B) = Pr(B|A

i

) Pr(A

i

).

3.4. INDEPENDENCE 27

Rearranging the terms yields the ﬁrst equality. The second equality in (3.5)

is obtained by applying Theorem 2 to the denominator Pr(B).

Example 11. April, a biochemist, designs a test for a latent disease. If a

subject has the disease, the probability that the test results turn out positive is

0.95. Similarly, if a subject does not have the disease, the probability that the

test results come up negative is 0.95. Suppose that one percent of the population

is infected by the disease. We wish to ﬁnd the probability that a person who

tested positive has the disease.

Let D denote the event that a person has the disease, and let P be the

event that the test results are positive. Using Bayes’ rule, we can compute the

probability that a person who tested positive has the disease,

Pr(D|P) =

Pr(D) Pr(P|D)

Pr(D) Pr(P|D) + Pr(D

c

) Pr(P|D

c

)

=

0.01 · 0.95

0.01 · 0.95 + 0.99 · 0.05

≈ 0.1610.

Although the test may initially appear fairly accurate, the probability that a

person with a positive test carries the disease remains small.

3.4 Independence

Two events A and B are said to be independent if Pr(A∩ B) = Pr(A) Pr(B).

Interestingly, independence is closely linked to the concept of conditional prob-

ability. If Pr(B) > 0 and events A and B are independent, then

Pr(A|B) =

Pr(A ∩ B)

Pr(B)

=

Pr(A) Pr(B)

Pr(B)

= Pr(A).

That is, the a priori probability of event A is identical to the a posteriori

probability of A given B. In other words, if A is independent of B, then

partial knowledge of B contains no information about the likely occurrence of

A. We note that independence is a symmetric relation; if A is independent of

B, then B is also independent of A. It is therefore unambiguous to say that

A and B are independent events.

28 CHAPTER 3. CONDITIONAL PROBABILITY

Example 12. Suppose that two dice are rolled at the same time, a red die and

a blue die. We observe the numbers that appear on the upper faces of the two

dice. The sample space for this experiment is composed of thirty-six equally

likely outcomes. Consider the probability of getting a four on the red die given

that the blue die shows a six,

Pr({r = 4}|{b = 6}) =

Pr({r = 4} ∩ {b = 6})

Pr(b = 6)

=

1

6

= Pr(r = 4).

From this equation, we gather that

Pr({r = 4} ∩ {b = 6}) = Pr(r = 4) Pr(b = 6).

As such, rolling a four on the red die and rolling a six on the blue die are

independent events.

Similarly, consider the probability of obtaining a four on the red die given

that the sum of the two dice is eleven,

Pr({r = 4}|{r + b = 11}) =

Pr({r = 4} ∩ {r + b = 11})

Pr(r + b = 11)

= 0

=

1

6

= Pr(r = 4).

In this case, we conclude that getting a four on the red die and a sum total of

eleven are not independent events.

The basic idea of independence seems intuitively clear: if knowledge about

the occurrence of event B has no impact on the probability of A, then these

two events must be independent. Yet, independent events are not necessarily

easy to visualize in terms of their sample space. A common mistake is to

assume that two events are independent if they are disjoint. Two mutually

exclusive events can hardly be independent: if Pr(A) > 0, Pr(B) > 0, and

Pr(A ∩ B) = 0 then

Pr(A ∩ B) = 0 < Pr(A) Pr(B).

Hence, A and B cannot be independent if they are disjoint, non-trivial events.

3.4. INDEPENDENCE 29

3.4.1 Independence of Multiple Events

The concept of independence can be extended to multiple events. The events

A

1

, A

2

, . . . , A

n

are independent provided that

Pr

¸

i∈I

A

i

=

¸

i∈I

Pr(A

i

), (3.6)

for every subset I of {1, 2, . . . , n}.

For instance, consider a collection of three events, A, B and C. These

events are independent whenever

Pr(A ∩ B) = Pr(A) Pr(B)

Pr(A ∩ C) = Pr(A) Pr(C)

Pr(B ∩ C) = Pr(B) Pr(C)

(3.7)

and, in addition,

Pr(A ∩ B ∩ C) = Pr(A) Pr(B) Pr(C).

The three equalities in (3.7) assert that A, B and C are pairwise independent.

Note that the fourth equation does not follow from the ﬁrst three conditions,

nor does it imply any of them. Pairwise independence does not necessarily

imply independence. This is illustrated below.

Example 13. A fair coin is ﬂipped twice. Let A denote the event that heads

is observed on the ﬁrst toss. Let B be the event that heads is obtained on the

second toss. Finally, let C be the event that the two coins show distinct sides.

These three events each have a probability of 1/2. Furthermore, we have

Pr(A ∩ B) = Pr(A ∩ C) = Pr(B ∩ C) =

1

4

and, therefore, these events are pairwise independent. However, we can verify

that

Pr(A ∩ B ∩ C) = 0 =

1

8

= Pr(A) Pr(B) Pr(C).

This shows that events A, B and C are not independent.

30 CHAPTER 3. CONDITIONAL PROBABILITY

Example 14. Two dice are rolled at the same time, a red die and a blue die.

Let A be the event that the number on the red die is odd. Let B be the event

that the number on the red die is either two, three or four. Also, let C be the

event that the product of the two dice is twelve. The individual probabilities of

these events are

Pr(A) = Pr(r ∈ {1, 3, 5}) =

1

2

Pr(B) = Pr(r ∈ {2, 3, 4}) =

1

2

Pr(C) = Pr(r ×b = 12) =

4

36

.

We note that these events are not pairwise independent because

Pr(A ∩ B) =

1

6

=

1

4

= Pr(A) Pr(B)

Pr(A ∩ C) =

1

36

=

1

18

= Pr(A) Pr(C)

Pr(B ∩ C) =

1

12

=

1

18

= Pr(B) Pr(C).

Consequently, the multiple events A, B and C are not independent. Still, the

probability of these three events occurring simultaneously is

Pr(A ∩ B ∩ C) =

1

36

=

1

2

·

1

2

·

4

36

= Pr(A) Pr(B) Pr(C).

3.4.2 Conditional Independence

We introduced earlier the meaning of conditional probability, and we showed

that the set of conditional probabilities {Pr(A|B)} speciﬁes a valid probability

law. It is therefore possible to discuss independence with respect to conditional

probability. We say that events A

1

and A

2

are conditionally independent, given

event B, if

Pr(A

1

∩ A

2

|B) = Pr(A

1

|B) Pr(A

2

|B).

Note that if A

1

and A

2

are conditionally independent, we can use equation

(3.2) to write

Pr(A

1

∩ A

2

|B) =

Pr(A

1

∩ A

2

∩ B)

Pr(B)

=

Pr(B) Pr(A

1

|B) Pr(A

2

|A

1

∩ B)

Pr(B)

= Pr(A

1

|B) Pr(A

2

|A

1

∩ B).

3.4. INDEPENDENCE 31

Under the assumption that Pr(A

1

|B) > 0, we can combine the previous two

expressions and get

Pr(A

2

|A

1

∩ B) = Pr(A

2

|B).

This latter result asserts that, given event B has taken place, the additional

information that A

1

has also occurred does not aﬀect the likelihood of A

2

. It is

simple to show that conditional independence is a symmetric relation as well.

Example 15. Suppose that a fair coin is tossed until heads is observed. The

number of trials is recorded as the outcome of this experiment. We denote by

B the event that the coin is tossed more than one time. Moreover, we let A

1

be the event that the number of trials is an even number; and A

2

, the event

that the number of trials is less than six. The conditional probabilities of A

1

and A

2

, given that the coin is tossed more than once, are

Pr(A

1

|B) =

Pr(A

1

∩ B)

Pr(B)

=

1/3

1/2

=

2

3

Pr(A

2

|B) =

Pr(A

2

∩ B)

Pr(B)

=

15/32

1/2

=

15

16

.

The joint probability of events A

1

and A

2

given B is equal to

Pr(A

1

∩ A

2

|B) =

Pr(A

1

∩ A

2

∩ B)

Pr(B)

=

5/16

1/2

=

5

8

=

2

3

·

15

16

= Pr(A

1

|B) Pr(A

2

|B).

We conclude that A

1

and A

2

are conditionally independent given B. In par-

ticular, we have

Pr(A

2

|A

1

∩ B) = Pr(A

2

|B)

Pr(A

1

|A

2

∩ B) = Pr(A

1

|B).

We emphasize that events A

1

and A

2

are not independent with respect to the

unconditional probability law.

Two events that are independent with respect to an unconditional proba-

bility law may not be conditionally independent.

32 CHAPTER 3. CONDITIONAL PROBABILITY

Example 16. Two dice are rolled at the same time, a red die and a blue die.

We can easily compute the probability of simultaneously getting a two on the

red die and a six on the blue die,

Pr({r = 2} ∩ {b = 6}) =

1

36

= Pr(r = 2) Pr(b = 6).

Clearly, these two events are independent.

Consider the probability of rolling a two on the red die and a six on the

blue die given that the sum of the two dice is an odd number. The individual

conditional probabilities are given by

Pr({r = 2}|{r + b is odd}) = Pr({b = 6}|{r + b is odd}) =

1

6

,

whereas the joint conditional probability is

Pr({r = 2} ∩ {b = 6}|{r + b is odd}) = 0.

These two events are not conditionally independent.

It is possible to extend the notion of conditional independence to several

events. The events A

1

, A

2

, . . . , A

n

are conditionally independent given B if

Pr

¸

i∈I

A

i

B

=

¸

i∈I

Pr(A

i

|B)

for every subset I of {1, 2, . . . , n}. This deﬁnition is analogous to (3.6), albeit

using the appropriate conditional probability law.

3.5 Equivalent Notations

In the study of probability, we are frequently interested in the probability of

multiple events occurring simultaneously. So far, we have expressed the joint

probability of events A and B using the notation Pr(A∩B). For mathematical

convenience, we also represent the probability that two events occur at the

same time by

Pr(A, B) = Pr(A ∩ B).

3.5. EQUIVALENT NOTATIONS 33

This alternate notation easily extends to the joint probability of several events.

We denote the joint probability of events A

1

, A

2

, . . . , A

n

by

Pr(A

1

, A

2

, . . . , A

n

) = Pr

n

¸

k=1

A

k

.

Conditional probabilities can be written using a similar format. The proba-

bility of A given events B

1

, B

2

, . . . , B

n

becomes

Pr(A|B

1

, B

2

, . . . , B

n

) = Pr

A

n

¸

k=1

B

k

.

From this point forward, we use these equivalent notations interchangeably.

Further Reading

1. Ross, S., A First Course in Probability, 7th edition, Pearson Prentice Hall,

2006: Chapter 3.

2. Bertsekas, D. P., and Tsitsiklis, J. N., Introduction to Probability, Athena

Scientiﬁc, 2002: Sections 1.3–1.5.

3. Miller, S. L., and Childers, D. G., Probability and Random Processes with

Applications to Signal Processing and Communications, 2004: Sections 2.4–

2.6.

4. Gubner, J. A., Probability and Random Processes for Electrical and Computer

Engineers, Cambridge, 2006: Sections 1.5–1.6.

Chapter 4

Equiprobable Outcomes and

Combinatorics

The simplest probabilistic scenario is perhaps one where the sample space

Ω is ﬁnite and all the possible outcomes are equally likely. In such cases,

computing the probability of an event amounts to counting the number of

outcomes comprising this event and dividing the sum by the total number of

elements contained in the sample space. This situation was ﬁrst explored in

Section 2.2.1, with an explicit formula for computing probabilities appearing

in (2.2).

While counting outcomes may appear intuitively straightforward, it is in

many circumstances a daunting task. For instance, consider the number of

distinct subsets of the integers {1, 2, . . . , n} that do not contain two consecutive

integers. This number is equal to

φ

n+2

−(1 −φ)

n+2

√

5

,

where φ = (1 +

√

5)/2 is the golden ratio. This somewhat surprising result

can be obtained iteratively through the Fibonacci recurrence relation, and it

attests of the fact that counting may at times be challenging. Calculating

the number of ways that certain patterns can be formed is part of the ﬁeld of

combinatorics. In this chapter, we introduce useful counting techniques that

can be applied to situations pertinent to probability.

34

4.1. THE COUNTING PRINCIPLE 35

4.1 The Counting Principle

The counting principle is a guiding rule for computing the number of elements

in a cartesian product. Suppose that S and T are ﬁnite sets with m and n

elements, respectively. The cartesian product of S and T is given by

S ×T = {(x, y)|x ∈ S and y ∈ T}.

The number of elements in the cartesian product S ×T is equal to mn. This

is illustrated in Figure 4.1.

1

1

1

1

2

2

2 2

3

3

3

3

a

a

a

a

b

b

b b

Figure 4.1: This ﬁgure provides a graphical interpretation of the cartesian

product of S = {1, 2, 3} and T = {a, b}. In general, if S has m elements and T

contains n elements, then the cartesian product S×T consists of mn elements.

Example 17. Consider an experiment consisting of ﬂipping a coin and rolling

a die. There are two possibilities for the coin, heads or tales, and the die has

six faces. The number of possible outcomes for this experiment is 2 ×6 = 12.

That is, there are twelve diﬀerent ways to ﬂip a coin and roll a die.

The counting principle can be broadened to calculating the number of

elements in the cartesian product of multiple sets. Consider the ﬁnite sets

S

1

, S

2

, . . . , S

r

and their cartesian product

S

1

×S

2

×· · · ×S

r

= {(s

1

, s

2

, . . . , s

r

)|s

i

∈ S

i

} .

36CHAPTER 4. EQUIPROBABLE OUTCOMES AND COMBINATORICS

If we denote the cardinality of S

i

by n

i

= |S

i

|, then the number of distinct

ordered r-tuples of the form (s

1

, s

2

, . . . , s

r

) is n = n

1

n

2

· · · n

r

.

Example 18 (Sampling with Replacement and Ordering). An urn contains n

balls numbered one through n. A ball is drawn from the urn, and its number is

recorded on an ordered list. The ball is then replaced in the urn. This procedure

is repeated k times. We wish to compute the number of possible sequences that

can result from this experiment. There are k drawings and n possibilities per

drawing. Using the counting principle, we gather that the number of distinct

sequences is n

k

.

1 1 1

1

1 2

2 2 2 2

Figure 4.2: The cartesian product {1, 2}

2

has four distinct ordered pairs.

Example 19. The power set of S, denoted by 2

S

, is the collection of all subsets

of S. In set theory, 2

S

represents the set of all functions from S to {0, 1}. By

identifying a function in 2

S

with the corresponding preimage of one, we obtain

a bijection between 2

S

and the subsets of S. In particular, each function in 2

S

is the characteristic function of a subset of S.

Suppose that S is ﬁnite with n = |S| elements. For every element of S,

a characteristic function in 2

S

is either zero or one. There are therefore 2

n

distinct characteristic functions from S to {0, 1}. Hence, the number of distinct

subsets of S is given by 2

n

.

4.2 Permutations

Again, consider the integer set S = {1, 2, . . . , n}. A permutation of S is an

ordered arrangement of its elements, a list without repetitions. The number

of permutations of S can be computed as follows. Clearly, there are n distinct

possibilities for the ﬁrst item in the list. The number of possibilities for the

second item is n − 1, namely all the integers in S except the element we

4.2. PERMUTATIONS 37

1

1

1 1

1

2

2

2 2 2

3

3

3 3

3

Figure 4.3: The power set of {1, 2, 3} contains eight elements. They are dis-

played above.

selected initially. Similarly, the number of distinct possibilities for the mth

item is n − m + 1. This pattern continues until all the elements in S are

recorded. Summarizing, we ﬁnd that the total number of permutations of S

is n factorial, n! = n(n −1) · · · 1.

Example 20. We wish to compute the number of permutations of S = {1, 2, 3}.

Since the set S possesses three elements, it has 3! = 6 diﬀerent permutations.

They can be written as 123, 132, 213, 231, 312, 321.

1

1

1

1

1

1

1

2

2

2

2

2

2

2

3

3

3

3

3

3 3

Figure 4.4: Ordering the numbers one, two and three leads to six possible

permutations.

38CHAPTER 4. EQUIPROBABLE OUTCOMES AND COMBINATORICS

4.2.1 Stirling’s Formula*

The number n! grows very rapidly as a function of n. A good approximation

for n! when n is large is given by Stirling’s formula,

n! ∼ n

n

e

−n

√

2πn.

The notation a

n

∼ b

n

signiﬁes that the ratio a

n

/b

n

→ 1 as n → ∞.

4.2.2 k-Permutations

Suppose that we rank only k elements out of the set S = {1, 2, . . . , n}, where

k ≤ n. We wish to count the number of distinct k-permutations of S. Fol-

lowing our previous argument, we can choose one of n elements to be the ﬁrst

item listed, one of the remaining (n −1) elements for the second item, and so

on. The procedure terminates when k items have been recorded. The number

of possible sequences is therefore given by

n!

(n −k)!

= n(n −1) · · · (n −k + 1).

Example 21. A recently formed music group has four original songs they can

play. They are asked to perform two songs at South by Southwest. We wish

to compute the number of song arrangements the group can oﬀer in concert.

Abstractly, this is equivalent to computing the number of 2-permutations of

four songs. Thus, the number of distinct arrangements is 4!/2! = 12.

1

1 1

1

1

1 1

2

2

2

2 2

2

2

3

3

3

3

3

3

3

4

4

4

4

4

4

4

Figure 4.5: There are twelve 2-permutations of the numbers one through four.

4.3. COMBINATIONS 39

Example 22 (Sampling without Replacement, with Ordering). An urn con-

tains n balls numbered one through n. A ball is picked from the urn, and its

number is recorded on an ordered list. The ball is not replaced in the urn.

This procedure is repeated until k balls are selected from the urn, where k ≤ n.

We wish to compute the number of possible sequences that can result from

this experiment. The number of possibilities is equivalent to the number of

k-permutations of n elements, which is given by n!/(n −k)!.

4.3 Combinations

Consider the integer set S = {1, 2, . . . , n}. A combination is a subset of S.

We emphasize that a combination diﬀers from a permutation in that elements

in a combination have no speciﬁc ordering. The 2-element subsets of S =

{1, 2, 3, 4} are

{1, 2}, {1, 3}, {1, 4}, {2, 3}, {2, 4}, {3, 4},

whereas the 2-permutations of S are more numerous with

(1, 2), (1, 3), (1, 4), (2, 1), (2, 3), (2, 4),

(3, 1), (3, 2), (3, 4), (4, 1), (4, 2), (4, 3).

There are fewer 2-element subsets of S than 2-permutations of S.

1

1

1

1

2

2 2

2

3

3

3

3

4

4 4 4

Figure 4.6: There exist six 2-element subsets of the numbers one through four.

We can compute the number of k-element combinations of S = {1, 2, . . . , n}

as follows. Note that a k-permutation can be formed by ﬁrst selecting k objects

40CHAPTER 4. EQUIPROBABLE OUTCOMES AND COMBINATORICS

from S and then ordering them. There are k! distinct ways of ordering k ob-

jects. The number of k-permutations must then equal the number of k-element

combinations multiplied by k!. Since the total number of k-permutations of S

is n!/(n −k)!, the number of k-element combinations is found to be

n

k

=

n!

k!(n −k)!

=

n(n −1) · · · (n −k + 1)

k!

.

This expression is termed a binomial coeﬃcient. Observe that selecting a k-

element subset of S is equivalent to choosing the n − k elements that belong

to its complement. Consequently, we can write

n

k

=

n

n −k

.

Example 23 (Sampling without Replacement or Ordering). An urn contains

n balls numbered one through n. A ball is drawn from the urn and placed in

a separate jar. This process is repeated until the jar contains k balls, where

k ≤ n. We wish to compute the number of distinct combinations the jar can

hold after the completion of this experiment. Because there is no ordering in

the jar, this amounts to counting the number of k-element subsets of a given

n-element set, which is given by

n

k

=

n!

k!(n −k)!

.

Again, let S = {1, 2, . . . , n}. Since a combination is also a subset and

the number of k-element combinations of S is

n

k

**, the sum of the binomial
**

coeﬃcients

n

k

**over all values of k must be equal to the number of elements
**

in the power set of S. Thus, we get

n

¸

k=0

n

k

= 2

n

.

4.4 Partitions

Abstractly, a combination is equivalent to partitioning a set into two disjoint

subsets, one containing k objects and the other containing the n−k remaining

4.4. PARTITIONS 41

elements. In general, the set S = {1, 2, . . . , n} can be partitioned into r disjoint

subsets. Let n

1

, n

2

, . . . , n

r

be nonnegative integers such that

r

¸

i=1

n

i

= n.

Consider the following iterative algorithm that leads to a partition of S. First,

we choose a subset of n

1

elements from S. Having selected the ﬁrst subset,

we pick a second subset containing n

2

elements from the remaining n − n

1

elements. We continue this procedure by successively choosing subsets of n

i

elements from the residual n − n

1

− · · · − n

i−1

elements, until no element

remains. This algorithm yields a partition of S into r subsets, with the ith

subset containing exactly n

i

elements.

We wish to count the number of such partitions. We know that there are

n

n

1

**ways to form the ﬁrst subset. Examining our algorithm, we see that there
**

are exactly

n −n

1

−· · · −n

i−1

n

i

**ways to form the ith subset. Using the counting principle, the total number
**

of partitions is then given by

n

n

1

n −n

1

n

2

· · ·

n −n

1

−· · · −n

r−1

n

r

,

which after simpliﬁcation can be written as

n

n

1

, n

2

, . . . , n

r

=

n!

n

1

!n

2

! · · · n

r

!

.

This expression is called a multinomial coeﬃcient.

Example 24. A die is rolled nine times. We wish to compute the number

of possible outcomes for which every odd number appears three times. The

number of distinct sequences in which one, three and ﬁve each appear three

times is equal to the number of partitions of {1, 2, . . . , 9} into three subsets of

size three, namely

9!

3!3!3!

= 1680.

42CHAPTER 4. EQUIPROBABLE OUTCOMES AND COMBINATORICS

In the above analysis, we assume that the cardinality of each subset is

ﬁxed. Suppose instead that we are interested in counting the number of ways

to pick the cardinality of the subsets that form the partition. Speciﬁcally, we

wish to compute the number of ways integers n

1

, n

2

, . . . , n

r

can be selected

such that every integer is nonnegative and

r

¸

i=1

n

i

= n.

We can visualize the number of possible assignments as follows. Picture n

balls space out on a straight line and consider r −1 vertical markers, each of

which can be put between two consecutive balls, before the ﬁrst ball, or after

the last ball. For instance, if there are ﬁve balls and two markers then one

possible assignment is illustrated in Figure 4.7.

1 1

1 1 1

1 1 1 1 1

Figure 4.7: The number of possible cardinality assignments for the partition

of a set of n elements into r distinct subsets is equivalent to the number of

ways to select n positions out of n + r −1 candidates.

The number of objects in the ﬁrst subset corresponds to the number balls

appearing before the ﬁrst marker. Similarly, the number of objects in the

ith subset is equal to the number of balls positioned between the ith marker

and the preceding one. Finally, the number of objects in the last subset is

simply the number of balls after the last marker. In this scheme, two consecu-

tive markers imply that the corresponding subset is empty. For example, the

4.5. COMBINATORIAL EXAMPLES 43

integer assignment associated with the Figure 4.7 is

(n

1

, n

2

, n

3

) = (0, 2, 3).

1 1 1 1 1

Figure 4.8: The assignment displayed in Figure 4.7 corresponds to having no

element in the ﬁrst set, two elements in the second set and three elements in

the last one.

There is a natural bijection between an integer assignment and the graph-

ical representation depicted above. To count the number of possible integer

assignments, it suﬃces to calculate the number of ways to place the markers

and the balls. In particular, there are n + r − 1 positions, n balls and r − 1

markers. The number of ways to assign the markers is equal to the number of

n-combinations of n + r −1 elements,

n + r −1

n

=

n +r −1

r −1

.

Example 25 (Sampling with Replacement, without Ordering). An urn con-

tains r balls numbered one through r. A ball is drawn from the urn and its

number is recorded. The ball is then returned to the urn. This procedure is

repeated a total of n times. The outcome of this experiment is a table that

contains the number of times each ball has come in sight. We are interested

in computing the number of possible outcomes. This is equivalent to counting

the ways a set with n elements can be partitioned into r subsets. The number

of possible outcomes is therefore given by

n + r −1

n

=

n +r −1

r −1

.

4.5 Combinatorial Examples

In this section, we present a few applications of combinatorics to computing

the probabilities of speciﬁc events.

44CHAPTER 4. EQUIPROBABLE OUTCOMES AND COMBINATORICS

Example 26 (Pick 3 Texas Lottery). The Texas Lottery game “Pick 3” is

easy to play. A player must pick three numbers from zero to nine, and choose

how to play them: exact order or any order. The Pick 3 balls are drawn using

three air-driven machines. These machines employ compressed air to mix and

select each ball.

The probability of winning when playing the exact order is

1

10

1

10

1

10

=

1

1000

.

The probability of winning while playing any order depends on the numbers

selected. If three distinct numbers are selected, then the probability of winning

is

3!

1000

=

3

500

.

If a number is repeated twice, the probability of winning becomes

3

2

1000

=

3

1000

.

Finally, if a same number is selected three times, the probability of winning

decreases to 1/1000.

Example 27 (Mega Millions Texas Lottery). To play the Mega Millions game,

a player must select ﬁve numbers from 1 to 56 in the upper white play area of

the play board, and one Mega Ball number from 1 to 46 in the lower yellow play

area of the play board. All drawing equipment is stored in a secured on-site

storage room. Only authorized drawings department personnel have keys to the

door. Upon entry of the secured room to begin the drawing process, a lottery

drawing specialist examines the security seal to determine if any unauthorized

access has occurred. For each drawing, the Lotto Texas balls are mixed by four

acrylic mixing paddles rotating clockwise. High speed is used for mixing and

low speed for ball selection. As each ball is selected, it rolls down a chute into

an oﬃcial number display area. We wish to compute the probability of winning

the Mega Millions Grand Prize, which requires the correct selection of the ﬁve

white balls plus the gold Mega ball.

The probability of winning the Mega Millions Grand Prize is

1

56

5

1

46

=

51!5!

56!

1

46

=

1

175711536

.

4.5. COMBINATORIAL EXAMPLES 45

Example 28 (Sinking Boat). Six couples, twelve people total, are out at sea

on a sail boat. Unfortunately, the boat hits an iceberg and starts sinking slowly.

Two Coast Guard vessels, the Ibis and the Mako, come to the rescue. Each

passenger must decide which of the two rescue boats to board. What is the

probability that no two people from a same couple end up on the Mako?

Suppose that all the passengers choose a rescue boat independently of one

another. For a given couple, the probability that the two members do not end

up on the Mako is 3/4. Since couples act independently, the probability that no

two people from a same couple are united on the Mako becomes 3

6

/4

6

. In other

words, there are a total of 4096 possible arrangements of people on the Mako

after the rescue operation, all equally likely; only 729 of these arrangements

are such that no two members from a same couple board the Mako.

Example 29 (The Locksmith Convention*). A total of n people are gathering

to attend a two-day, secret locksmith convention. As tradition dictates, every

locksmith brings a lock and two keys to the convention. At the opening cere-

mony, all the locksmiths sit around a large table, and they exchange keys to

their own locks with the two members sitting immediately next to them. The

rules of the secret society stipulate that the conference can only proceed on

the second day if a working lock-and-key combination can be reconstituted. If

each locksmith decides to attend the second day of the convention with proba-

bility half, independently of others, what is the probability that the reunion can

continue its course of action on the second day?

There are n attendants on the ﬁrst day. The number of possible audiences

on the second day is therefore 2

n

, and they are all equally likely. To ﬁnd

the probability that the reunion can proceed, it suﬃces to count the number of

ways in which at least two people that were sitting next to each other during

the opening ceremony are present. Alternatively, we can compute the likelihood

that this does not occur. To solve this problem, we label the people attending

the opening ceremony using integers one through n, according to their posi-

tions around the table. We consider three situations that may occur on the

second day and for which the society’s requirements are not met. If member 1

is present, then members 2 and n cannot be there on the second day, and the

number of possible non-qualifying arrangements is equal to the number of dis-

tinct subsets of the integers {3, 4, . . . , n − 1} that do not contain consecutive

46CHAPTER 4. EQUIPROBABLE OUTCOMES AND COMBINATORICS

integers. This number can be deduced from the discussion at the beginning of

this chapter, and is given by

φ

n−1

−(1 −φ)

n−1

√

5

.

By symmetry, the number of arrangements for which a lock-and-key combi-

nation cannot be reconstituted when member n shows up on the second day

is

φ

n−1

−(1 −φ)

n−1

√

5

.

Finally, if both members 1 and n are absent, then there are a total of

φ

n

−(1 −φ)

n

√

5

combinations for which the meeting cannot proceed, as this is the number of

distinct subsets of the integers {2, 3, . . . , n−1} that do not contain consecutive

integers. Collecting these results, we gather that the probability of the meeting

being cancelled on the second day is

1 −

2φ

n−1

−2(1 −φ)

n−1

+ φ

n

−(1 −φ)

n

2

n

√

5

.

Further Reading

1. Ross, S., A First Course in Probability, 7th edition, Pearson Prentice Hall,

2006: Chapter 1.

2. Bertsekas, D. P., and Tsitsiklis, J. N., Introduction to Probability, Athena

Scientiﬁc, 2002: Section 1.6.

3. Gubner, J. A., Probability and Random Processes for Electrical and Computer

Engineers, Cambridge, 2006: Section 1.7.

Chapter 5

Discrete Random Variables

Suppose that an experiment and a sample space are given. A random variable

is a real-valued function of the outcome of the experiment. In other words,

the random variable assigns a speciﬁc number to every possible outcome of

the experiment. The numerical value of a particular outcome is simply called

the value of the random variable. Because of the structure of real numbers, it

is possible to deﬁne pertinent statistical properties on random variables that

otherwise do not apply to probability spaces in general.

Sample Space

1

2

3

4

5

6

7 Real Numbers

Figure 5.1: The sample space in this example has seven possible outcomes. A

random variable maps each of these outcomes to a real number.

Example 30. There are six possible outcomes to the rolling of a fair die,

namely each of the six faces. These faces map naturally to the integers one

47

48 CHAPTER 5. DISCRETE RANDOM VARIABLES

through six. The value of the random variable, in this case, is simply the

number of dots that appear on the top face of the die.

1 2 3 4 5 6

Figure 5.2: This random variable takes its input from the rolling of a die and

assigns to each outcome a real number that corresponds to the number of dots

that appear on the top face of the die.

A simple class of random variables is the collection of discrete random

variables. A variable is called discrete if its range is ﬁnite or countably inﬁnite;

that is, it can only take a ﬁnite or countable number of values.

Example 31. Consider the experiment where a coin is tossed repetitively un-

til heads is observed. The corresponding function, which maps the number

of tosses to an integer, is a discrete random variable that takes a countable

number of values. The range of this random variable is given by the positive

integers {1, 2, . . .}.

5.1 Probability Mass Functions

A discrete random variable X is characterized by the probability of each of

the elements in its range. We identify the probabilities of individual elements

in the range of X using the probability mass function (PMF) of X, which we

denote by p

X

(·). If x is a possible value of X then the probability mass of x,

5.1. PROBABILITY MASS FUNCTIONS 49

written p

X

(x), is deﬁned by

p

X

(x) = Pr({X = x}) = Pr(X = x). (5.1)

Equivalently, we can think of p

X

(x) as the probability of the set of all outcomes

in Ω for which X is equal to x,

p

X

(x) = Pr(X

−1

(x)) = Pr({ω ∈ Ω|X(ω) = x}).

Here, X

−1

(x) denotes the preimage of x deﬁned by {ω ∈ Ω|X(ω) = x}. This

is not to be confused with the inverse of a bijection.

Sample Space

1

2

3

4

5

6

7

x

Figure 5.3: The probability mass of x is given by the probability of the set of

all outcomes which X maps to x.

Let X(Ω) denote the collection of all the possible numerical values X can

take; this set is known as the range of X. Using this notation, we can write

¸

x∈X(Ω)

p

X

(x) = 1. (5.2)

We emphasize that the sets deﬁned by {ω ∈ Ω|X(ω) = x} are disjoint and

form a partition of the sample space Ω, as x ranges over all the possible values

in X(Ω). Thus, (5.2) follows immediately from the countable additivity axiom

and the normalization axiom of probability laws. In general, if X is a discrete

random variable and S is a subset of X(Ω), we can write

Pr(S) = Pr ({ω ∈ Ω|X(ω) ∈ S}) =

¸

x∈S

p

X

(x). (5.3)

This equation oﬀers an explicit formula to compute the probability of any

subset of X(Ω), provided that X is discrete.

50 CHAPTER 5. DISCRETE RANDOM VARIABLES

Example 32. An urn contains three balls numbered one, two and three. Two

balls are drawn from the urn without replacement. We wish to ﬁnd the proba-

bility that the sum of the two selected numbers is odd.

Let Ω be the set of ordered pairs corresponding to the possible outcomes of

the experiment, Ω = {(1, 2), (1, 3), (2, 1), (2, 3), (3, 1), (3, 2)}. Note that these

outcomes are equiprobable. We employ X to represent the sum of the two

selected numbers. The PMF of random variable X is given by

p

X

(3) = p

X

(4) = p

X

(5) =

1

3

.

If S denotes the event that the sum of the two numbers is odd, then the prob-

ability of the sum being odd can be computed as follows,

Pr(S) = Pr({3, 5}) = p

X

(3) +p

X

(5) =

2

3

.

5.2 Important Discrete Random Variables

A number of discrete random variables appears frequently in problems related

to probability. These random variables arise in many diﬀerent contexts, and

they are worth getting acquainted with. In general, discrete random variables

occur primarily in situations where counting is involved.

5.2.1 Bernoulli Random Variables

The ﬁrst and simplest random variable is the Bernoulli random variable. Let

X be a random variable that takes on only two possible numerical values,

X(Ω) = {0, 1}. Then, X is a Bernoulli random variable and its PMF is given

by

p

X

(x) =

1 −p, if x = 0

p, if x = 1

where p ∈ [0, 1].

Example 33. Consider the ﬂipping of a biased coin, for which heads is ob-

tained with probability p and tails is obtained with probability 1 −p. A random

variable that maps heads to one and tails to zero is a Bernoulli random vari-

able with parameter p. In fact, every Bernoulli random variable is equivalent

to the tossing of a coin.

5.2. IMPORTANT DISCRETE RANDOM VARIABLES 51

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-1 0 1 2 3 4

P

r

o

b

a

b

i

l

i

t

y

M

a

s

s

F

u

n

c

t

i

o

n

Value

Bernoulli Random Variable

p = 0.25

Figure 5.4: The PMF of a Bernoulli random variable appears above for pa-

rameter p = 0.25.

5.2.2 Binomial Random Variables

Multiple independent Bernoulli random variables can be combined to construct

more sophisticated random variables. Suppose X is the sum of n independent

and identically distributed Bernoulli random variables. Then X is called a

binomial random variable with parameters n and p. The PMF of X is given

by

p

X

(k) = Pr(X = k) =

n

k

p

k

(1 −p)

n−k

,

where k = 0, 1, . . . n. We can easily verify that X fulﬁlls the normalization

axiom,

n

¸

k=0

n

k

p

k

(1 −p)

n−k

= (p + (1 −p))

n

= 1.

Example 34. The Brazos Soda Company creates an “Under the Caps” pro-

motion whereby a customer can win an instant cash prize of $1 by looking

under a bottle cap. The likelihood to win is one in four, and it is independent

from bottle to bottle. A customer buys eight bottles of soda from this company.

We wish to ﬁnd the PMF of the number of winning bottles, which we denote

by X. Also, we want to compute the probability of winning more than $4.

The random variable X is binomial with parameters n = 8 and p = 1/4.

52 CHAPTER 5. DISCRETE RANDOM VARIABLES

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 2 4 6 8

P

r

o

b

a

b

i

l

i

t

y

M

a

s

s

F

u

n

c

t

i

o

n

Value

Binomial Random Variable

p = 0.25

Figure 5.5: This ﬁgure shows a binomial random variable with parameters

n = 8 and p = 0.25.

Its PMF is given by

p

X

(k) =

8

k

1

4

k

3

4

8−k

=

8

k

3

8−k

4

8

,

where k = 0, 1, . . . , 8. The probability of winning more than $4 is

Pr(X > 4) =

8

¸

k=5

8

k

3

8−k

4

8

.

5.2.3 Poisson Random Variables

The probability mass function of a Poisson random variable is given by

p

X

(k) =

λ

k

k!

e

−λ

, k = 0, 1, . . .

where λ is a positive number. Note that, using Taylor series expansion, we

have

∞

¸

k=0

p

X

(k) =

∞

¸

k=0

λ

k

k!

e

−λ

= e

−λ

∞

¸

k=0

λ

k

k!

= e

−λ

e

λ

= 1,

which shows that this PMF fulﬁlls the normalization axiom of probability laws.

The Poisson random variable is of fundamental importance when counting the

number of occurrences of a phenomenon within a certain time period. It ﬁnds

extensive use in networking, inventory management and queueing applications.

5.2. IMPORTANT DISCRETE RANDOM VARIABLES 53

0

0.05

0.1

0.15

0.2

0.25

0.3

0 2 4 6 8

P

r

o

b

a

b

i

l

i

t

y

M

a

s

s

F

u

n

c

t

i

o

n

Value

Poisson Random Variable

l = 2

Figure 5.6: This ﬁgure shows the PMF of a Poisson random variable with

parameter λ = 2. Note that the values of the PMF are only present for

k = 0, 1, . . . , 8.

Example 35. Requests at an Internet server arrive at a rate of λ connections

per second. The number of service requests per second is modeled as a random

variable with a Poisson distribution. We wish to ﬁnd the probability that no

service requests arrive during a time interval of one second.

Let N be a random variable that represents the number of requests that

arrives within a span of one second. By assumption, N is a Poisson random

variable with PMF

p

N

(k) =

λ

k

k!

e

−λ

.

The probability that no service requests arrive in one second is simply given by

p

N

(0) = e

−λ

.

It is possible to obtain a Poisson random variable as the limit of a sequence

of binomial random variables. Fix λ and let p

n

= λ/n. For k = 1, 2, . . . n, we

deﬁne the PMF of the random variable X

n

as

p

Xn

(k) = Pr(X

n

= k) =

n

k

p

k

n

(1 −p

n

)

n−k

=

n!

k!(n −k)!

λ

n

k

1 −

λ

n

n−k

=

n!

n

k

(n −k)!

λ

k

k!

1 −

λ

n

n−k

.

54 CHAPTER 5. DISCRETE RANDOM VARIABLES

In the limit, as n approaches inﬁnity, we get

lim

n→∞

p

Xn

(k) = lim

n→∞

n!

n

k

(n −k)!

λ

k

k!

1 −

λ

n

n−k

=

λ

k

k!

e

−λ

.

Thus, the sequence of binomial random variables {X

n

} converges in distribu-

tion to the Poisson random variable X.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 2 4 6 8 10 12 14

P

r

o

b

a

b

i

l

i

t

y

M

a

s

s

F

u

n

c

t

i

o

n

Value

Convergence to Poisson Distribution

n = 5

n = 15

n = 25

n = 35

Poisson

Figure 5.7: The levels of a binomial PMF with parameter p = λ/n converge to

the probabilities of a Poisson PMF with parameter λ as n increases to inﬁnity.

This discussion suggests that the Poisson PMF can be employed to ap-

proximate the PMF of a binomial random variable in certain circumstances.

Suppose that Y is a binomial random variable with parameters n and p. If n is

large and p is small then the probability that Y equals k can be approximated

by

p

Y

(k) =

n!

n

k

(n −k)!

λ

k

k!

1 −

λ

n

n−k

≈

λ

k

k!

e

−λ

,

where λ = np. The latter expression can be computed numerically in a

straightforward manner.

Example 36. The probability of a bit error on a communication channel is

equal to 10

−2

. We wish to approximate the probability that a block of 1000 bits

has four or more errors.

Assume that the probability of individual errors is independent from bit to

bit. The transmission of each bit can be modeled as a Bernoulli trial, with a

5.2. IMPORTANT DISCRETE RANDOM VARIABLES 55

zero indicating a correct transmission and a one representing a bit error. The

total number of errors in 1000 transmissions then corresponds to a binomial

random variable with parameters n = 1000 and p = 10

−2

. The probability

of making four or more errors can be approximated using a Poisson random

variable with constant λ = np = 10. Thus, we can approximate the probability

that a block of 1000 bits has four or more errors by

Pr(N ≥ 4) = 1 −Pr(N < 4) ≈ 1 −

3

¸

k=0

λ

k

k!

e

−λ

= 1 −e

−10

1 + 10 + 50 +

500

3

≈ 0.01034.

5.2.4 Geometric Random Variables

Consider a random experiment where a Bernoulli trial is repeated multiple

times until a one is observed. At each time step, the probability of getting a

one is equal to p and the probability of getting a zero is 1 −p. The number of

trials carried out before completion, which we denote by X, is recorded as the

outcome of this experiment. The random variable X is a geometric random

variable, and its PMF is given by

p

X

(k) = (1 −p)

k−1

p, k = 1, 2, . . .

We stress that (1 − p)

k−1

p simply represents the probability of obtaining a

sequence of k −1 zero immediately followed by a one.

Example 37. The Brazos Soda Company introduces another “Under the Caps”

promotion. This time, a customer can win an additional bottle of soda by look-

ing under the cap of her bottle. The probability to win is 1/5, and it is inde-

pendent from bottle to bottle. A customer purchases one bottle of soda from the

Brazos Soda Company and thereby enters the contest. For every extra bottle

of soda won by looking under the cap, the customer gets an additional chance

to play. We wish to ﬁnd the PMF of the number of bottles obtained by this

customer.

56 CHAPTER 5. DISCRETE RANDOM VARIABLES

0

0.05

0.1

0.15

0.2

0.25

0.3

0 2 4 6 8 10 12

P

r

o

b

a

b

i

l

i

t

y

M

a

s

s

F

u

n

c

t

i

o

n

Value

Geometric Random Variable

p = 0.25

Figure 5.8: The PMF of a geometric random variable is a decreasing function

of k. It is plotted above for p = 0.25. The values of the PMF are only present

for k = 1, 2, . . . , 12.

Let X denote the total number of bottles obtained by the customer. The

random variable X is geometric and its PMF is

p

X

(k) =

1

5

k−1

4

5

,

where k = 1, 2, . . .

Memoryless Property: A remarkable aspect of the geometric random vari-

able is that it satisﬁes the memoryless property,

Pr(X = k + j|X > k) = Pr(X = j).

This can be established using the deﬁnition of conditional probability. Let X

be a geometric random variable with parameter p, and assume that k and j

are positive integers. We can write the conditional probability of X as

Pr(X = k + j|X > k) =

Pr({X = k +j} ∩ {X > k})

Pr(X > k)

=

Pr(X = k + j)

Pr(X > k)

=

(1 −p)

k+j−1

p

(1 −p)

k

= (1 −p)

j−1

p = Pr(X = j).

5.2. IMPORTANT DISCRETE RANDOM VARIABLES 57

In words, the probability that the number of trials carried out before com-

pletion is k + j, given k unsuccessful trials, is equal to the unconditioned

probability that the total number of trials is j. It can be shown that the ge-

ometric random variable is the only discrete random variable that possesses

the memoryless property.

5.2.5 Discrete Uniform Random Variables

A ﬁnite random variable where all the possible outcomes are equally likely

is called a discrete uniform random variable. Let X be a uniform random

variable taking value over X(Ω) = {1, 2, . . . , n}. Its PMF is therefore given by

p

X

(k) =

1/n, if k = 1, 2, . . . , n

0, otherwise.

We encountered speciﬁc cases of this random variable before. The tossing of a

fair coin and the rolling of a die can both be used to construct uniform random

variables.

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0 1 2 3 4 5 6 7 8 9

P

r

o

b

a

b

i

l

i

t

y

M

a

s

s

F

u

n

c

t

i

o

n

Value

Uniform Random Variable

n = 8

Figure 5.9: A uniform random variable corresponds to the situation where all

the possible values of the random variable are equally likely. It is shown above

for the case where n = 8.

58 CHAPTER 5. DISCRETE RANDOM VARIABLES

5.3 Functions of Random Variables

Recall that a random variable is a function of the outcome of an experiment.

Given a random variable X, it is possible to create a new random variable

Y by applying a real-valued function g(·) to X. If X is a random variable

then Y = g(X) is also a random variable since it associates a numerical value

to every outcome of the experiment. In particular, if ω ∈ Ω is the outcome

of the experiment, then X takes on value x = X(ω) and Y takes on value

Y (ω) = g(x).

1

2

3

4

5

Sample Space

X

Y = g(X)

Figure 5.10: A real-valued function of a random variable is a random variable

itself. In this ﬁgure, Y is obtained by passing random variable X through a

function g(·).

Furthermore, if X is a discrete random variable, then so is Y . The set of

possible values Y can take is denoted by g(X(Ω)), and the number of elements

in g(X(Ω)) is no greater than the number of elements in X(Ω). The PMF of

Y , which we represent by p

Y

(·), is obtained as follows. If y ∈ g(X(Ω)) then

p

Y

(y) =

¸

{x∈X(Ω)|g(x)=y}

p

X

(x); (5.4)

otherwise, p

Y

(y) = 0. In particular, p

Y

(y) is non-zero only if there exists an

x ∈ X(Ω) such that g(x) = y and p

X

(x) > 0.

Example 38. Let X be a random variable and let Y = g(X) = aX +b, where

a and b are constant. That is, Y is an aﬃne function of X. Suppose that

5.3. FUNCTIONS OF RANDOM VARIABLES 59

a = 0, then the probability of Y being equal to value y is given by

p

Y

(y) = p

X

y −b

a

.

Linear and aﬃne functions of random variables are commonplace in applied

probability and engineering.

Example 39. A taxi driver works in New York City. The distance of a ride

with a typical client is a discrete uniform random variable taking value over

X(Ω) = {1, 2, . . . , 10}. The metered rate of fare, according to city rules, is

$2.50 upon entry and $0.40 for each additional unit (one-ﬁfth of a mile). We

wish to ﬁnd the PMF of Y , the value of a typical fare.

Traveling one mile is ﬁve units and costs an extra $2.00. The smallest

possible fare is therefore $4.50, with probability 1/10. Similarly, a ride of X

miles will generate a revenue of Y = 2.5 + 2X dollars for the cab driver. The

PMF of Y is thus given by

p

Y

(2.5 + 2k) =

1

10

for k = 1, 2, . . . , 10; and it is necessarily zero for any other argument.

Further Reading

1. Ross, S., A First Course in Probability, 7th edition, Pearson Prentice Hall,

2006: Sections 4.1–4.2, 4.6–4.8.

2. Bertsekas, D. P., and Tsitsiklis, J. N., Introduction to Probability, Athena

Scientiﬁc, 2002: Sections 2.1–2.3.

3. Miller, S. L., and Childers, D. G., Probability and Random Processes with

Applications to Signal Processing and Communications, 2004: Section 2.7.

4. Gubner, J. A., Probability and Random Processes for Electrical and Computer

Engineers, Cambridge, 2006: Sections 2.1–1.2.

Chapter 6

Meeting Expectations

When a large collection of data is gathered, one is typically interested not

necessarily in every individual data point, but rather in certain descriptive

quantities such as the average or the median. The same is true for random

variables. The PMF of discrete random variable X provides a complete char-

acterization of the distribution of X by specifying the probability of every

possible value of X. Still, it may be desirable to summarize the information

contained in the PMF of a random variable. One way to accomplish this task

and thereby obtain meaningful descriptive quantities is through the expecta-

tion operator.

6.1 Expected Values

The expected value E[X] of discrete random variable X is deﬁned by

E[X] =

¸

x∈X(Ω)

xp

X

(x), (6.1)

whenever this sum converges absolutely. If this sum is not absolutely conver-

gent, then X is said not to possess an expected value. As mentioned above,

the expected value E[X] provides insightful information about the underlying

random variable X without giving a comprehensive and overly detailed de-

scription. The expected value of a random variable, as deﬁned in (6.1), is also

called the mean of X. It is important to realize that E[X] is not a function of

random variable X; rather, it is a function of the PMF of X.

60

6.2. FUNCTIONS AND EXPECTATIONS 61

Example 40. A fair die is rolled once, with the number of dots appearing

on the top face taken as the value of the corresponding random variable. The

expected value of the roll can be computed as

6

¸

k=1

k

6

=

42

12

= 3.5.

In other words, the mean of this random variable is 3.5.

Example 41. Assume that a fair coin is ﬂipped repetitively until heads is

observed. The value of random variable X is taken to be the total number

of tosses performed during this experiment. The possible values for X are

therefore given by X(Ω) = {1, 2, . . .}. Recall that, in this case, the PMF of X

is equal to p

X

(k) = 2

−k

, where k is a positive integer. The expected value of

this geometric random variable can be computed as

E[X] =

∞

¸

k=1

k

2

k

= 2.

The expected number of tosses until the coin produces heads is equal to two.

In general, determining the expectation of a random variable requires as

input its PMF, a detailed characterization of the random variable, and returns

a much simpler scalar attribute, its mean. Hence, computing the expected

value of the random variable yields a concise summary of its overall behavior.

6.2 Functions and Expectations

The mean forms one instance where the distribution of a random variable is

condensed into a scalar quantity. There are several additional examples. The

notion of an expectation can be combined with traditional functions to create

alternate descriptions and other meaningful quantities. Suppose that X is a

discrete random variable. Let g(·) be a real-valued function on the range of X,

and consider the expectation of g(X). This expected value, E[g(X)], is a scalar

quantity that can also provide partial information about the distribution of

X.

One way to determine the expected value of g(X) is to ﬁrst note that Y =

g(X) is itself a random variable. Thus, we can ﬁnd the derived distribution

62 CHAPTER 6. MEETING EXPECTATIONS

of Y and then apply the deﬁnition of expected value provided in (6.1). Yet,

there is a more direct way to compute this quantity; the expectation of g(X)

can be expressed as

E[g(X)] =

¸

x∈X(Ω)

g(x)p

X

(x). (6.2)

It is worth re-emphasizing that there exist random variables and functions for

which the above sum does not converge. In such cases, we simply say that the

expected value of g(X) does not exist. Also, notice that the mean E[X] is a

special case of (6.2), where g(X) = X. Hence the deﬁnition of E[g(X)] given

above subsumes our original description of E[X], which appeared in (6.1). We

explore pertinent examples below.

Example 42. The simplest possible scenario for (6.2) is the case where the

function g(·) is a constant. The expectation of g(X) = c becomes

E[c] =

¸

x∈X(Ω)

cp

X

(x) = c

¸

x∈X(Ω)

p

X

(x) = c.

The last inequality follows from the normalization axiom of probability laws.

The expectation of a constant is always the constant itself.

Example 43. Let S be a subset of the real numbers, and deﬁne the indicator

function of S by

1

S

(x) =

1, x ∈ S

0, x / ∈ S.

The expectation of 1

S

(X) is equal to

E[1

S

(X)] =

¸

x∈X(Ω)

1

S

(x)p

X

(x)

=

¸

x∈S∩X(Ω)

p

X

(x) = Pr(X ∈ S).

That is, the expectation of the indicator function of S is simply the probability

that X takes on a value in S. This alternate way of computing the probability

of an event can sometimes be employed to solve otherwise diﬃcult probability

problems.

6.2. FUNCTIONS AND EXPECTATIONS 63

Let random variable Y be deﬁned by applying real-valued function g(·)

to X, with Y = g(X). The mean of Y is equal to the expectation of g(X),

and we know from our ongoing discussion that this value can be obtained by

applying two diﬀerent formulas. To ensure consistency, we verify that these

two approaches lead to a same answer.

First, we can apply (6.1) directly to Y , and obtain

E[Y ] =

¸

y∈g(X(Ω))

yp

Y

(y),

where p

Y

(·) is the PMF of Y provided by (5.3). Alternatively, using (6.2), we

have

E[Y ] = E[g(X)] =

¸

x∈X(Ω)

g(x)p

X

(x).

We prove that these two expressions describe a same answer as follows. Recall

that the PMF of Y evaluated at y is obtained by summing the values of p

X

(·)

over all x ∈ X(Ω) such that g(x) = y. Mathematically, this can be expressed

as p

Y

(y) =

¸

{x∈X(Ω)|g(x)=y}

p

X

(x). Using this equality, we can write

E[Y ] =

¸

y∈g(X(Ω))

yp

Y

(y) =

¸

y∈g(X(Ω))

y

¸

{x∈X(Ω)|g(x)=y}

p

X

(x)

=

¸

y∈g(X(Ω))

¸

{x∈X(Ω)|g(x)=y}

yp

X

(x)

=

¸

y∈g(X(Ω))

¸

{x∈X(Ω)|g(x)=y}

g(x)p

X

(x)

=

¸

x∈X(Ω)

g(x)p

X

(x) = E[g(X)].

Note that ﬁrst summing over all possible values of Y and then over the preim-

age of every y ∈ Y (Ω) is equivalent to summing over all x ∈ X(Ω). Hence,

we have shown that computing the expectation of a function using the two

methods outlined above leads to a same solution.

Example 44. Brazos Extreme Events Radio creates the “Extreme Trio” con-

test. To participate, a person must ﬁll out an application card. Three cards

are drawn from the lot and each winner is awarded $1,000. While a same

participant can send multiple cards, he or she can only win one grand prize.

64 CHAPTER 6. MEETING EXPECTATIONS

At the time of the drawing, the radio station has accumulated a total of 100

cards. David, an over-enthusiastic listener, is accountable for half of these

cards. We wish to compute the amount of money David expects to win under

this promotion.

Let X be the number of cards drawn by the radio station written by David.

The PMF of X is given by

p

X

(k) =

50

k

50

3−k

100

3

k ∈ {0, 1, 2, 3}.

The money earned by David can be expressed as g(k) = 1000 min{k, 1}. It

follows that the expected amount of money he receives is equal to

3

¸

k=0

(1000 min{k, 1}) p

X

(k) = 1000 ·

29

33

.

Alternatively, we can deﬁne Y = 1000 min{X, 1}. Clearly, Y can only take

on one of two possible values, 0 or 1000. Evaluating the PMF of Y , we get

p

Y

(0) = p

X

(0) = 4/33 and p

Y

(1000) = 1 −p

Y

(0) = 29/33. The expected value

of Y is equal to

0 · p

Y

(0) + 1000 · p

Y

(1000) = 1000 ·

29

33

.

As anticipated, both methods lead to the same answer. The expected amount

of money won by David is roughly $878.79.

6.2.1 The Mean

As seen at the beginning of this chapter, the simplest non-trivial expectation

is the mean. We provide two additional examples for the mean, and we explore

a physical interpretation of its deﬁnition below.

Example 45. Let X be a geometric random variable with parameter p and

PMF

p

X

(k) = (1 −p)

k−1

p, k = 1, 2, . . .

The mean of this random variable is

E[X] =

∞

¸

k=1

k(1 −p)

k−1

p = p

∞

¸

k=1

k(1 −p)

k−1

=

1

p

.

6.2. FUNCTIONS AND EXPECTATIONS 65

Example 46. Let X be a binomial random variable with parameters n and p.

The PMF of X is given by

p

X

(k) =

n

k

p

k

(1 −p)

n−k

, k = 0, 1, . . . , n.

The mean of this binomial random variable can therefore be computed as

E[X] =

n

¸

k=0

k

n

k

p

k

(1 −p)

n−k

=

n

¸

k=1

n!

(k −1)!(n −k)!

p

k

(1 −p)

n−k

=

n−1

¸

ℓ=0

n!

ℓ!(n −1 −ℓ)!

p

ℓ+1

(1 −p)

n−ℓ−1

= np

n−1

¸

ℓ=0

n −1

ℓ

p

ℓ

(1 −p)

n−1−ℓ

= np.

Notice how we rearranged the sum into a familiar form to compute its value.

It can be insightful to relate the mean of a random variable to classical

mechanics. Let X be a random variable and suppose that, for every x ∈ X(Ω),

we place an inﬁnitesimal particle of mass p

X

(x) at position x along a real line.

The mean of random variable X as deﬁned in (6.1) coincides with the center

of mass of the system of particles.

Example 47. Let X be a Bernoulli random variable such that

p

X

(x) =

0.25, if x = 0

0.75, if x = 1.

The mean of X is given by

E[X] = 0 · 0.25 + 1 · 0.75 = 0.75.

Consider a two-particle system with masses m

1

= 0.25 and m

2

= 0.75, respec-

tively. In the coordinate system illustrated below, the particles are located at

positions x

1

= 0 and x

2

= 1. From classical mechanics, we know that their

center of mass can be expressed as

m

1

x

1

+m

2

x

2

m

1

+m

2

= 0.75.

As anticipated, the center of mass corresponds to the mean of X.

66 CHAPTER 6. MEETING EXPECTATIONS

x

1

x

2

Center of Mass

R

Figure 6.1: The center of mass on the ﬁgure is indicated by the tip of the

arrow. In general, the mean of a discrete random variable corresponds to the

center of mass of the associated particle system

6.2.2 The Variance

A second widespread descriptive quantity associated with random variable X

is its variance, which we denote by Var[X]. It is deﬁned by

Var[X] = E

(X −E[X])

2

. (6.3)

Evidently, the variance is always nonnegative. It provides a measure of the

dispersion of X around its mean. For discrete random variables, it can be

computed explicitly as

Var[X] =

¸

x∈X(Ω)

(x −E[X])

2

p

X

(x).

The square root of the variance is referred to as the standard deviation of X,

and it is often denoted by σ.

Example 48. Suppose X is a Bernoulli random variable with parameter p.

We can compute the mean of X as

E[X] = 1 · p + 0 · (1 −p) = p.

Its variance is given by

Var[X] = (1 −p)

2

· p + (0 −p)

2

· (1 −p) = p(1 −p).

Example 49. Let X be a Poisson random variable with parameter λ. The

mean of X is given by

E[X] =

∞

¸

k=0

k

λ

k

k!

e

−λ

=

∞

¸

k=1

λ

k

(k −1)!

e

−λ

= λ

∞

¸

ℓ=0

λ

ℓ

ℓ!

e

−λ

= λ.

6.2. FUNCTIONS AND EXPECTATIONS 67

The variance of X can be calculated as

Var[X] =

∞

¸

k=0

(k −λ)

2

λ

k

k!

e

−λ

=

∞

¸

k=0

λ

2

+k(1 −2λ) + k(k −1)

λ

k

k!

e

−λ

= λ −λ

2

+

∞

¸

k=2

λ

k

(k −2)!

e

−λ

= λ.

Both the mean and the variance of a Poisson random variable are equal to its

parameter λ.

6.2.3 Aﬃne Functions

Proposition 3. Suppose X is a random variable with ﬁnite mean. Let Y be

the aﬃne function of X deﬁned by Y = aX + b, where a and b are ﬁxed real

numbers. The mean of random variable Y is equal to E[Y ] = aE[X] + b.

Proof. This can be computed using (6.2);

E[Y ] =

¸

x∈X(Ω)

(ax + b)p

X

(x)

= a

¸

x∈X(Ω)

xp

X

(x) + b

¸

x∈X(Ω)

p

X

(x)

= aE[X] + b.

We can summarize this property with E[aX + b] = aE[X] + b.

It is not much harder to show that the expectation is a linear functional.

Suppose that g(·) and h(·) are two real-valued functions such that E[g(X)]

and E[h(X)] both exist. We can write the expectation of ag(X) + h(X) as

E[ag(X) + h(X)] =

¸

x∈X(Ω)

(ag(x) + h(x))p

X

(x)

= a

¸

x∈X(Ω)

g(x)p

X

(x) +

¸

x∈X(Ω)

h(x)p

X

(x)

= aE[g(X)] + E[h(X)].

This demonstrates that the expectation is both homogeneous and additive.

68 CHAPTER 6. MEETING EXPECTATIONS

Proposition 4. Assume that X is a random variable with ﬁnite mean and

variance, and let Y be the aﬃne function of X given by Y = aX +b, where a

and b are constants. The variance of Y is equal to Var[Y ] = a

2

Var[X].

Proof. Consider (6.3) applied to Y = aX + b,

Var[Y ] =

¸

x∈X(Ω)

(ax + b −E[aX + b])

2

p

X

(x)

=

¸

x∈X(Ω)

(ax + b −aE[X] −b)

2

p

X

(x)

= a

2

¸

x∈X(Ω)

(x −E[X])

2

p

X

(x) = a

2

Var[X].

The variance of an aﬃne function only depends on the distribution of its

argument and parameter a. A translation of the argument by b does not aﬀect

the variance of Y = aX + b; in other words, it is shift invariant.

6.3 Moments

The moments of a random variable X are likewise important quantities used

in providing partial information about the PMF of X. The nth moment of

random variable X is deﬁned by

E[X

n

] =

¸

x∈X(Ω)

x

n

p

X

(x). (6.4)

Incidentally, the mean of random variable X is its ﬁrst moment.

Proposition 5. The variance of random variable X can be expressed in terms

of its ﬁrst two moments, Var[X] = E[X

2

] −(E[X])

2

.

Proof. Suppose that the variance of X exists and is ﬁnite. Starting from (6.3),

we can expand the variance of X as follows,

Var[X] =

¸

x∈X(Ω)

(x −E[X])

2

p

X

(x)

=

¸

x∈X(Ω)

x

2

−2xE[X] + (E[X])

2

p

X

(x)

=

¸

x∈X(Ω)

x

2

p

X

(x) −2E[X]

¸

x∈X(Ω)

xp

X

(x) + (E[X])

2

¸

x∈X(Ω)

p

X

(x)

= E

X

2

−(E[X])

2

.

6.4. ORDINARY GENERATING FUNCTIONS 69

This alternate formula for the variance is sometimes convenient for computa-

tional purposes.

We include below an example where the above formula for the variance is

applied. This allows the straightforward application of standard sums from

calculus.

Example 50. Let X be a uniform random variable with PMF

p

X

(k) =

1/n, if k = 1, 2, . . . , n

0, otherwise.

The mean of this uniform random variable is equal to

E[X] =

n + 1

2

.

The variance of X can then be obtained as

Var[X] = E

X

2

−(E[X])

2

=

n

¸

k=1

k

2

n

−

n + 1

2

2

=

n(n + 1)(2n + 1)

6n

−

n + 1

2

2

=

n

2

−1

12

.

Closely related to the moments of a random variable are its central mo-

ments. The kth central moment of X is deﬁned by E

(X −E[X])

k

. The

variance is an example of a central moment, as we can see from deﬁnition

(6.3). The central moments are used to deﬁne the skewness of random vari-

able, which is a measure of asymmetry; and its kurtosis, which assesses whether

the variance is due to infrequent extreme deviations or more frequent, modest-

size deviations. Although these quantities will not play a central role in our

exposition of probability, they each reveal a diﬀerent characteristic of a random

variable and they are encountered frequently in statistics.

6.4 Ordinary Generating Functions

In the special yet important case where X(Ω) is a subset of the non-negative

integers, it is occasionally useful to employ the ordinary generating function.

70 CHAPTER 6. MEETING EXPECTATIONS

This function bears a close resemblance to the z-transform and is deﬁned by

G

X

(z) = E

z

X

=

∞

¸

k=0

z

k

p

X

(k). (6.5)

It is also called the probability-generating function because of the following

property. The probability Pr(X = k) = p

X

(k) can be recovered from the corre-

sponding generating function G

X

(z) through Taylor series expansion. Within

the radius of convergence of G

X

(z), we have

G

X

(z) =

∞

¸

k=0

1

k!

d

k

G

X

dz

k

(0)

z

k

.

Comparing this equation to (6.5), we conclude that

p

X

(k) =

1

k!

d

k

G

X

dz

k

(0).

We note that, for |z| ≤ 1, we have

∞

¸

k=0

z

k

p

X

(k)

≤

∞

¸

k=0

|z|

k

p

X

(k) ≤

∞

¸

k=0

p

X

(k) = 1

and hence the radius of convergence of any probability-generating function

must include one.

The ordinary generating function plays an important role in dealing with

sums of discrete random variables. As a preview of what lies ahead, we com-

pute ordinary generating functions for Bernoulli and Binomial random vari-

ables below.

Example 51. Let X be a Bernoulli random variable with parameter p. The

ordinary generating function of X is given by

G

X

(z) = p

X

(0) +p

X

(1)z = 1 −p + pz.

Example 52. Let S be a binomial random variable with parameters n and p.

The ordinary generating function of S can be computed as

G

S

(z) =

n

¸

k=0

z

k

p

S

(k) =

n

¸

k=0

z

k

n

k

p

k

(1 −p)

n−k

=

n

¸

k=0

n

k

(pz)

k

(1 −p)

n−k

= (1 −p +pz)

n

.

6.4. ORDINARY GENERATING FUNCTIONS 71

We know, from Section 5.2.2, that one way to create a binomial random

variable is to sum n independent and identically distributed Bernoulli random

variables, each with parameter p. Looking at the ordinary generating functions

above, we notice that the G

S

(z) is the product of n copies of G

X

(z). Intuitively,

it appears that the sum of independent discrete random variables leads to the

product of their ordinary generating functions, a relation that we will revisit

shortly.

The mean and second moment of X can be computed based on its ordinary

generating function, In particular, we have

E[X] = lim

z↑1

dG

X

dz

(z).

Similarly, the second moment of X can be derived as

E

X

2

= lim

z↑1

d

2

G

X

dz

2

(z) +

dG

X

dz

(z)

.

This can be quite useful, as illustrated in the following example.

Example 53 (Poisson Random Variable). Suppose that X has a Poisson dis-

tribution with parameter λ > 0. The function G

X

(s) can be computed using

the distribution of X,

G

X

(z) =

∞

¸

k=0

z

k

e

−λ

λ

k

k!

= e

−λ

∞

¸

k=0

(λz)

k

k!

= e

−λ

e

λz

= e

λ(z−1)

.

The ﬁrst two moments of X are given by

E[X] = lim

z↑1

dG

X

dz

(z) = lim

z↑1

λe

λ(z−1)

= λ

E

X

2

= lim

z↑1

d

2

G

X

dz

2

(z) +

dG

X

dz

(z)

= lim

z↑1

λ

2

+ λ

e

λ(z−1)

= λ

2

+ λ.

This provides a very eﬃcient way to compute the mean and variance of X,

which are both equal to λ. It may be helpful to compare this derivation with

Example 49.

Further Reading

1. Ross, S., A First Course in Probability, 7th edition, Pearson Prentice Hall,

2006: Section 4.3.

72 CHAPTER 6. MEETING EXPECTATIONS

2. Bertsekas, D. P., and Tsitsiklis, J. N., Introduction to Probability, Athena

Scientiﬁc, 2002: Section 2.4.

3. Gubner, J. A., Probability and Random Processes for Electrical and Computer

Engineers, Cambridge, 2006: Sections 2.4, 3.1.

4. Miller, S. L., and Childers, D. G., Probability and Random Processes with

Applications to Signal Processing and Communications, 2004: Sections 4.8.

Chapter 7

Multiple Discrete Random

Variables

Thus far, our treatment of probability has been focused on single random vari-

ables. It is often convenient or required to model stochastic phenomena using

multiple random variables. In this section, we extend some of the concepts

developed for single random variables to multiple random variables. We center

our exposition of random vectors around the simplest case, pairs of random

variables.

7.1 Joint Probability Mass Functions

Consider two discrete random variables X and Y associated with a single

experiment. The random pair (X, Y ) is characterized by the joint probability

mass function of X and Y , which we denote by p

X,Y

(·, ·). If x is a possible

value of X and y is a possible value of Y , then the probability mass function

of (x, y) is denoted by

p

X,Y

(x, y) = Pr({X = x} ∩ {Y = y})

= Pr(X = x, Y = y).

Note the similarity between the deﬁnition of the joint PMF and (5.1).

Suppose that S is a subset of X(Ω)×Y (Ω). We can express the probability

73

74 CHAPTER 7. MULTIPLE DISCRETE RANDOM VARIABLES

1

2

3

4

5

Sample Space

R

R

X

Y

(X, Y )

Figure 7.1: The random pair (X, Y ) maps every outcome contained in the

sample space to a vector in R

2

.

of S as

Pr(S) = Pr({ω ∈ Ω|(X(ω), Y (ω)) ∈ S})

=

¸

(x,y)∈S

p

X,Y

(x, y).

In particular, we have

¸

x∈X(Ω)

¸

y∈Y (Ω)

p

X,Y

(x, y) = 1.

To further distinguish between the joint PMF of (X, Y ) and the individual

PMFs p

X

(·) and p

Y

(·), we occasionally refer to the latter as marginal proba-

bility mass functions. We can compute the marginal PMFs of X and Y from

the joint PMF p

X,Y

(·, ·) using the formulas

p

X

(x) =

¸

y∈Y (Ω)

p

X,Y

(x, y),

p

Y

(y) =

¸

x∈X(Ω)

p

X,Y

(x, y).

On the other hand, knowledge of the marginal distributions p

X

(·) and p

Y

(·) is

not enough to obtain a complete description of the joint PMF p

X,Y

(·, ·). This

fact is illustrated in Examples 54 & 55.

7.2. FUNCTIONS AND EXPECTATIONS 75

Example 54. An urn contains three balls numbered one, two and three. A

random experiment consists of drawing two balls from the urn, without replace-

ment. The number appearing on the ﬁrst ball is a random variable, which we

denote by X. Similarly, we refer to the number inscribed on the second ball as

Y . The joint PMF of X and Y is speciﬁed in table form below,

p

X,Y

(x, y) 1 2 3

1 0 1/6 1/6

2 1/6 0 1/6

3 1/6 1/6 0

.

We can compute the marginal PMF of X as

p

X

(x) =

¸

y∈Y (Ω)

p

X,Y

(x, y) =

1

6

+

1

6

=

1

3

,

where x ∈ {1, 2, 3}. Likewise, the marginal PMF of Y is given by

p

Y

(y) =

1/3, if y ∈ {1, 2, 3}

0, otherwise.

Example 55. Again, suppose that an urn contains three balls numbered one,

two and three. This time the random experiment consists of drawing two balls

from the urn with replacement. We use X and Y to denote the numbers

appearing on the ﬁrst and second balls, respectively. The joint PMF of X and

Y becomes

p

X,Y

(x, y) 1 2 3

1 1/9 1/9 1/9

2 1/9 1/9 1/9

3 1/9 1/9 1/9

.

The marginal distributions of X and Y are the same as in Example 54; how-

ever, the joint PMFs diﬀer.

7.2 Functions and Expectations

Let X and Y be two random variables with joint PMF p

X,Y

(·, ·). Consider a

third random variable deﬁned by V = g(X, Y ), where g(·, ·) is a real-valued

76 CHAPTER 7. MULTIPLE DISCRETE RANDOM VARIABLES

function. We can obtain the PMF of V by computing

p

V

(v) =

¸

{(x,y)|g(x,y)=v}

p

X,Y

(x, y). (7.1)

This equation is the analog of (5.3) for pairs of random variables.

1

2

3

4

5

Sample Space

R R

R

X

Y

V

V = g(X, Y )

Figure 7.2: A real-valued function of two random variables, X and Y , is also

a random variable. Above, V = g(X, Y ) maps elements of the sample space

to real numbers.

Example 56. Two dice, a blue die and a red one, are rolled simultaneously.

The random variable X represents the number of dots that appears on the top

face of the blue die, whereas Y denotes the number of dots on the red die.

We can form a random variable U that describes the sum of these two dice,

U = X + Y .

The lowest possible value for U is two, and its maximum value is twelve.

The PMF of U, as calculated using (7.1), appears in table form below

k 2, 12 3, 11 4, 10 5, 9 6, 8 7

p

U

(k) 1/36 1/18 1/12 1/9 5/36 1/6

The deﬁnition of the expectation operator can be extended to multiple

random variables. In particular, the expected value of g(X, Y ) is obtained by

computing

E[g(X, Y )] =

¸

x∈X(Ω)

¸

y∈Y (Ω)

g(x, y)p

X,Y

(x, y). (7.2)

7.3. CONDITIONAL RANDOM VARIABLES 77

Example 57. An urn contains three balls numbered one, two and three. Two

balls are selected from the urn at random, without replacement. We employ X

and Y to represent the numbers on the ﬁrst and second balls, respectively. We

wish to compute the expected value of the function X + Y .

Using (7.2), we compute the expectation of g(X, Y ) = X + Y as

E[g(X, Y )] = E[X + Y ]

=

¸

x∈X(Ω)

¸

y∈Y (Ω)

(x + y)p

X,Y

(x, y)

=

¸

x∈X(Ω)

xp

X

(x) +

¸

y∈Y (Ω)

yp

Y

(y) = 4.

The expected value of X + Y is four.

7.3 Conditional Random Variables

Many events of practical interest are dependent. That is, knowledge about

event A may provide partial information about the realization of event B. This

inter-dependence is captured by the concept of conditioning, which was ﬁrst

discussed in Chapter 3. In this section, we extend the concept of conditioning

to multiple random variables. We study the probability of events concerning

random variable Y given that some information about random variable X is

available.

Let X and Y be two random variables associated with a same experiment.

The conditional probability mass function of Y given X = x, which we write

p

Y |X

(·|·), is deﬁned by

p

Y |X

(y|x) = Pr(Y = y|X = x)

=

Pr({Y = y} ∩ {X = x})

Pr(X = x)

=

p

X,Y

(x, y)

p

X

(x)

,

provided that p

X

(x) = 0. Note that conditioning on X = x is not possible

when p

X

(x) vanishes, as it has no relevant meaning. This is similar to con-

ditional probabilities being only deﬁned for conditional events with non-zero

probabilities.

78 CHAPTER 7. MULTIPLE DISCRETE RANDOM VARIABLES

Let x be ﬁxed with p

X

(x) > 0. The conditional PMF introduced above is

a valid PMF since it is nonnegative and

¸

y∈Y (Ω)

p

Y |X

(y|x) =

¸

y∈Y (Ω)

p

X,Y

(x, y)

p

X

(x)

=

1

p

X

(x)

¸

y∈Y (Ω)

p

X,Y

(x, y) = 1.

The probability that Y belongs to S, given X = x, is obtained by summing

the conditional probability p

Y |X

(·|·) over all outcomes included in S,

Pr(Y ∈ S|X = x) =

¸

y∈S

p

Y |X

(y|x).

Example 58 (Hypergeometric Random Variable). A wireless communication

channel is employed to transmit a data packet that contains a total of m bits.

The ﬁrst n bits of this message are dedicated to the packet header, which stores

very sensitive information. The wireless connection is unreliable; every trans-

mitted bit is received properly with probability 1 −p and erased with probability

p, independently of other bits. We wish to derive the conditional probability

distribution of the number of erasures located in the header, given that the total

number of corrupted bits in the packet is equal to e.

Let H represent the number of erasures in the header, and denote the num-

ber of corrupted bits in the entire message by C. The conditional probability

mass function of H given C = c is equal to

p

H|C

(h|c) =

p

H,C

(h, c)

p

H

(h)

=

n

h

(1 −p)

n−h

p

h

m−n

c−h

(1 −p)

(m−n)−(c−h)

p

c−h

m

c

(1 −p)

m−c

p

c

=

n

h

m−n

c−h

m

c

,

where h = 0, 1, . . . min{e, m}. Clearly, the conditioning aﬀects the probability

distribution of the number of corrupted bits in the header. In general, a ran-

dom variable with such a distribution is known as a hypergeometric random

variable.

7.3. CONDITIONAL RANDOM VARIABLES 79

The deﬁnition of conditional PMF can be rearranged to obtain a convenient

formula to calculate the joint distribution of X and Y , namely

p

X,Y

(x, y) = p

Y |X

(y|x)p

X

(x) = p

X|Y

(x|y)p

Y

(y).

This formula can be use to compute the joint PMF of X and Y sequentially.

Example 59 (Splitting Property of Poisson PMF). A digital communication

system sends out either a one with probability p or a zero with probability

1 − p, independently of previous transmissions. The number of transmitted

binary digits within a given time interval has a Poisson PMF with parameter

λ. We wish to show that the number of ones sent in that same time interval

has a Poisson PMF with parameter pλ.

Let M denote the number of ones within the stipulated interval, N be the

number of zeros, and K = M + N be the total number of bits sent during the

same interval. The number of ones given that the total number of transmissions

is k is given by

p

M|K

(m|k) =

k

m

p

m

(1 −p)

k−m

, m = 0, 1, . . . , k.

The probability that M is equal to m is therefore equal to

p

M

(m) =

∞

¸

k=0

p

K,M

(k, m) =

∞

¸

k=0

p

M|K

(m|k)p

K

(k)

=

∞

¸

k=m

k

m

p

m

(1 −p)

k−m

λ

k

k!

e

−λ

=

∞

¸

u=0

u + m

m

p

m

(1 −p)

u

λ

u+m

(u +m)!

e

−λ

=

(λp)

m

m!

e

−λ

∞

¸

u=0

((1 −p)λ)

u

u!

=

(λp)

m

m!

e

−pλ

.

Above, we have used the change of variables k = u + m. We have also rear-

ranged the sum into a familiar form, leveraging the fact that the summation

of a Poisson PMF over all possible values is equal to one. We can see that M

has a Poisson PMF with parameter pλ.

80 CHAPTER 7. MULTIPLE DISCRETE RANDOM VARIABLES

7.3.1 Conditioning on Events

It is also possible to deﬁne the conditional PMF of a random variable X, con-

ditioned on an event S where Pr(X ∈ S) > 0. Let X be a random variable

associated with a particular experiment, and let S be a non-trivial event cor-

responding to this experiment. The conditional PMF of X given S is deﬁned

by

p

X|S

(x) = Pr(X = x|S) =

Pr({X = x} ∩ S)

Pr(S)

. (7.3)

Note that the events {ω ∈ Ω|X(ω) = x} form a partition of Ω as x ranges over

all the values in X(Ω). Using the total probability theorem, we gather that

¸

x∈X(Ω)

Pr({X = x} ∩ S) = Pr(S)

and, consequently, we get

¸

x∈X(Ω)

p

X|S

(x) =

¸

x∈X(Ω)

Pr({X = x} ∩ S)

Pr(S)

=

¸

x∈X(Ω)

Pr({X = x} ∩ S)

Pr(S)

= 1.

We conclude that p

X|S

(·) is a valid PMF.

Another interpretation of (7.3) is the following. Let Y = 1

S

(·) symbolize

the indicator function of S, with

1

S

(ω) =

1, ω ∈ S

0, ω / ∈ S.

Then p

X|S

(x) = p

X|Y

(x|1). In this sense, conditioning on an event is a special

case of a conditional probability mass function.

Example 60. A data packet is sent to a destination over an unreliable wireless

communication link. Data is successfully decoded at the receiver with proba-

bility p, and the transmission fails with probability (1 − p), independently of

previous trials. When initial decoding fails, a retransmission is requested im-

mediately. However, if packet transmission fails n consecutive times, then the

data is dropped from the queue altogether, with no further transmission at-

tempt. Let X denote the number of trials, and let S be the event that the data

7.4. CONDITIONAL EXPECTATIONS 81

packet is successfully transmitted. We wish to compute the conditional PMF

of X given S.

First, we note that a successful transmission can happen at any of n possible

instants. These outcomes being mutually exclusive, the probability of S can be

written as

Pr(S) =

n

¸

k=1

(1 −p)

k−1

p = 1 −(1 −p)

n

.

Thus, the conditional PMF p

N|S

(·) is equal to

p

N|S

(k) =

(1 −p)

k−1

p

1 −(1 −p)

n

,

where k = 1, 2, . . . n.

7.4 Conditional Expectations

The conditional expectation of Y given X = x is simply the expectation of Y

with respect to the conditional PMF p

Y |X

(·|x),

E[Y |X = x] =

¸

y∈Y (Ω)

yp

Y |X

(y|x).

This conditional expectation can be viewed as a function of x,

h(x) = E[Y |X = x].

It is therefore mathematically accurate and sometimes desirable to talk about

the random variable h(X) = E[Y |X]. In particular, if X and Y are two ran-

dom variables associated with an experiment, the outcome of this experiment

determines the value of X, say X = x, which in turn yields the conditional

expectation h(x) = E[Y |X = x]. From this point of view, the conditional

expectation E[Y |X] is simply an instance of a random variable.

Not too surprisingly, the expectation of E[Y |X] is equal to E[Y ], as evinced

82 CHAPTER 7. MULTIPLE DISCRETE RANDOM VARIABLES

by the following derivation,

E[E[Y |X]] =

¸

x∈X(Ω)

E[Y |X = x]p

X

(x)

=

¸

x∈X(Ω)

¸

y∈Y (Ω)

yp

Y |X

(y|x)p

X

(x)

=

¸

y∈Y (Ω)

¸

x∈X(Ω)

yp

X,Y

(x, y)

=

¸

y∈Y (Ω)

yp

Y

(y) = E[Y ].

Using a similar argument, it is straightforward to show that

E[E[g(Y )|X]] = E[g(Y )].

Example 61. An entrepreneur opens a small business that sells two kinds of

beverages from the Brazos Soda Company, cherry soda and lemonade. The

number of bottles sold in an hour at the store is found to be a Poisson random

variable with parameter λ = 10. Every customer selects a cherry soda with

probability p and a lemonade with probability (1 − p), independently of other

customers. We wish to ﬁnd the conditional mean of the number of cherry sodas

purchased in an hour given that ten beverages were sold to customers during

this time period.

Let B represent the number of bottles sold during an hour. Similarly, let

C and L be the number of cherry sodas and lemonades purchased during the

same time interval, respectively. We note that B = C + L. The conditional

PMF of C given that the total number of beverages sold equals ten is

p

C|B

(k|10) =

10

k

p

k

(1 −p)

10−k

. (7.4)

This follows from the fact that every customer selects a cherry soda with prob-

ability p, independently of other customers. The conditional mean is then seen

to equal

E[C|B = 10] =

10

¸

k=0

k

10

k

p

k

(1 −p)

10−k

= 10p.

We can deﬁne the expectation of X conditioned on event S in an analogous

fashion. Let S be an event such that Pr(S) > 0. The conditional expectation

7.4. CONDITIONAL EXPECTATIONS 83

of X given S is

E[X|S] =

¸

x∈X(Ω)

xp

X|S

(x),

where p

X|S

(·) is deﬁned in (7.3). Similarly, we have

E[g(X)|S] =

¸

x∈X(Ω)

g(x)p

X|S

(x),

Example 62. Spring break is coming and a student decides to renew his shirt

collection. The number of shirts purchased by the student is a random variable

denoted by N. The PMF of this random variable is a geometric distribution

with parameter p = 0.5. Any one shirt costs $10, $20 or $50 with respective

probabilities 0.5, 0.3 and 0.2, independently of other shirts. We wish to com-

pute the expected amount of money spent by the student during his shopping

spree. Also, we wish to compute the expected amount of money disbursed given

that the student buys at least ﬁve shirts.

Let C

i

be the cost of the ith shirt. The total amount of money spent by the

student, denoted by T, can be expressed as

T =

N

¸

i=1

C

i

.

The mean of T can be computed using nested conditional expectation. It is

equal to

E[T] = E

¸

N

¸

i=1

C

i

¸

= E

¸

E

¸

N

¸

i=1

C

i

N

¸¸

= E

¸

N

¸

i=1

E[C

i

|N]

¸

= 21E[N] = 42.

The student is expected to spend $42. Given that the student buys at least ﬁve

shirts, the conditional expectation becomes

E[T|N ≥ 5] = E

¸

N

¸

i=1

C

i

N ≥ 5

¸

= E

¸

E

¸

N

¸

i=1

C

i

N

¸

N ≥ 5

¸

= 21E[N|N ≥ 5] = 126.

84 CHAPTER 7. MULTIPLE DISCRETE RANDOM VARIABLES

Conditioned on buying more than ﬁve shirts, the student is expected to spend

$126. Note that, in computing the conditional expectation, we have utilized the

memoryless property of the geometric random variable,

E[N|N ≥ 5] = E[N|N > 4] = 4 + E[N] = 6.

7.5 Independence

Let X and Y be two random variables associated with a same experiment. We

say that X and Y are independent random variables if

p

X,Y

(x, y) = p

X

(x)p

Y

(y)

for every x ∈ X(Ω) and every y ∈ Y (Ω).

There is a clear relation between the concept of independence introduced in

Section 3.4 and the independence of two random variables. Random variables

X and Y are independent if and only if the events {X = x} and {Y = y} are

independent for every pair (x, y) such that x ∈ X(Ω) and y ∈ Y (Ω).

Proposition 6. If X and Y are independent random variables, then

E[XY ] = E[X]E[Y ].

Proof. Assume that both E[X] and E[Y ] exist, then

E[XY ] =

¸

x∈X(Ω)

¸

y∈Y (Ω)

xyp

X,Y

(x, y)

=

¸

x∈X(Ω)

¸

y∈Y (Ω)

xyp

X

(x)p

Y

(y)

=

¸

x∈X(Ω)

xp

X

(x)

¸

y∈Y (Ω)

yp

Y

(y) = E[X]E[Y ],

where we have used the fact that p

X,Y

(x, y) = p

X

(x)p

Y

(y) for independent

random variables.

Example 63. Two dice of diﬀerent colors are rolled, as in Example 56. We

wish to compute the expected value of their product. We know that the mean

of each role is E[X] = E[Y ] = 3.5. Furthermore, it is straightforward to show

7.5. INDEPENDENCE 85

that X and Y , the numbers of dots on the two dice, are independent random

variables. The expected value of their product is then equal to the product of

the individual means, E[XY ] = E[X]E[Y ] = 12.25.

We can parallel the proof of Proposition 6 to show that

E[g(X)h(Y )] = E[g(X)]E[h(Y )] (7.5)

whenever X and Y are independent random variables and the corresponding

expectations exist.

7.5.1 Sums of Independent Random Variables

We turn to the question of determining the distribution of a sum of two in-

dependent random variables in terms of the marginal PMF of the summands.

Suppose X and Y are independent random variables that take on integer val-

ues. Let p

X

(·) and p

Y

(·) be the PMFs of X and Y , respectively. We wish to

determine the distribution p

U

(·) of U, where U = X + Y . To accomplish this

task, it suﬃces to compute the probability that U assumes a value k, where k

is an arbitrary integer,

p

U

(k) = Pr(U = k) = Pr(X + Y = k)

=

¸

{(x,y)∈X(Ω)×Y (Ω)|x+y=k}

p

X,Y

(x, y)

=

¸

m∈Z

p

X,Y

(m, k −m)

=

¸

m∈Z

p

X

(m)p

Y

(k −m).

The latter operation is called a discrete convolution. In particular, the PMF

of U = X + Y is the discrete convolution of p

X

(·) and p

Y

(·) given by

p

U

(k) = (p

X

∗ p

Y

)(k) =

¸

m∈Z

p

X

(m)p

Y

(k −m). (7.6)

The discrete convolution is commutative and associative.

One interesting application of (7.5) occurs when dealing with sums of in-

dependent random variables. Suppose X and Y are two independent random

86 CHAPTER 7. MULTIPLE DISCRETE RANDOM VARIABLES

variables that take on integer values and, again, let U = X +Y . The ordinary

generating function of U, as deﬁned in Section 6.4, is given by

G

U

(z) = E

z

U

= E

z

X+Y

= E

z

X

z

Y

= E

z

X

E

z

Y

= G

X

(z)G

Y

(z).

That is, the generating function of a sum of independent random variables is

equal to the product of the individual ordinary generating functions.

Example 64 (Sum of Poisson Random Variables). Let X be a Poisson ran-

dom variable with parameter α and let Y be a Poisson random variable with

parameter β. We wish to ﬁnd the probability mass function of U = X + Y .

To solve this problem, we use ordinary generating functions. First, recall

that the generating function of a Poisson random variable with parameter λ is

e

λ(z−1)

. The ordinary generating functions of X and Y are therefore equal to

G

X

(z) = e

α(z−1)

G

Y

(z) = e

β(z−1)

.

As such, the ordinary generating function of U = X + Y is

G

U

(z) = G

X

(z)G

Y

(z) = e

α(z−1)

e

β(z−1)

= e

(α+β)(z−1)

.

We conclude, by the uniqueness of generating functions, that U is a Poisson

random variable with parameter α +β and, accordingly, we get

p

U

(k) =

(α + β)

k

k!

e

−(α+β)

, k = 0, 1, 2, . . .

This method of ﬁnding the PMF of U is more concise than using the discrete

convolution.

We saw in Section 7.3 that a random variable can also be conditioned on

a speciﬁc event. Let X be a random variable and let S be a non-trivial event.

The variable X is independent of S if

Pr({X = x} ∩ S) = p

X

(x) Pr(S)

for every x ∈ X(Ω). In particular, if X is independent of event S then

p

X|S

(x) = p

X

(x)

for all x ∈ X(Ω).

7.6. NUMEROUS RANDOM VARIABLES 87

7.6 Numerous Random Variables

The notion of a joint distribution can be applied to any number of random

variables. Let X

1

, X

2

, . . . , X

n

be random variables; their joint PMF is deﬁned

by

p

X

1

,X

2

,...,Xn

(x

1

, x

2

, . . . , x

n

) = Pr

n

¸

k=1

{X

k

= x

k

}

,

or, in vector form, p

X

(x) = Pr(X = x). When the random variables {X

k

} are

independent, their joint PMF reduces to

p

X

(x) =

n

¸

k=1

p

X

k

(x

k

).

Although not discussed in details herein, we emphasize that most concepts

introduced in this chapter can be extended to multiple random variables in

a straightforward manner. Also, we note that, from an abstract perspective,

the space deﬁned by vectors of the form (X

1

(ω), X

2

(ω), . . . , X

n

(ω)), ω ∈ Ω, is

itself a sample space on which random variables can be deﬁned.

Consider the empirical sums

S

n

=

n

¸

k=1

X

k

n ≥ 1,

where X

1

, X

2

, . . . are independent integer-valued random variables, each with

marginal PMF p

X

(·). Obviously, the distribution of S

1

is simply p

X

(·). More

generally, the PMF of S

n

can be obtained recursively using the formula

S

n

= S

n−1

+ X

n

.

This leads to the PMF

p

Sn

(k) = (p

X

∗ p

X

∗ · · · ∗ p

X

)(k),

which is the n-fold convolution of p

X

(·). Furthermore, the ordinary generating

function of S

n

can be written as

G

Sn

(z) = (G

X

(z))

n

.

88 CHAPTER 7. MULTIPLE DISCRETE RANDOM VARIABLES

Example 65 (Binomial Random Variables). Suppose that S

n

is a sum of n

independent and identically distributed Bernoulli random variables, each with

parameter p ∈ (0, 1). The PMF of S

1

is equal to

p

S

1

(k) =

1 −p k = 0

p k = 1.

Assume that the PMF of S

n−1

is given by

p

S

n−1

(k) =

n −1

k

p

k

(1 −p)

n−1−k

, k = 0, 1, . . . n −1.

Then, the distribution of S

n

can be computed recursively using the discrete

convolution,

p

Sn

(k) =

∞

¸

m=−∞

p

S

n−1

(m)p

X

(k −m)

= p

S

n−1

(k)(1 −p) +p

S

n−1

(k −1)p

=

n −1

k

p

k

(1 −p)

n−k

+

n −1

k −1

p

k

(1 −p)

n−k

=

n

k

p

k

(1 −p)

n−k

where k = 0, 1, . . . , n. Thus, by the principle of mathematical induction, we

gather that the sum of n independent Bernoulli random variables, each with

parameter p, is a binomial random variable with parameters n and p.

The convolution of two binomial distributions, one with parameter m and p

and the other with parameters n and p, is also a binomial random distribution

with parameters (m + n) and p. This fact follows directly from the previous

argument.

Example 66 (Negative Binomial Random Variable*). Suppose that a Bernoulli

trial is repeated multiple times until r ones are obtained. We denote by X the

random variable that represents the number of zeros observed before completion

of the process. The distribution of X is given by

p

X

(k) =

k + r −1

r −1

p

r

(1 −p)

k

, k = 0, 1, . . .

7.6. NUMEROUS RANDOM VARIABLES 89

where p is the common parameter of the Bernoulli trials. In general, a random

variable with such a distribution is known as a negative binomial random

variable.

We wish to show that X can be obtained as a sum of r independent random

variables, a task which we perform by looking at its ordinary generating func-

tion. We use properties of diﬀerential equations to derive G

X

(z) from p

X

(·).

By deﬁnition, we have

G

X

(z) =

∞

¸

k=0

z

k

k +r −1

r −1

p

r

(1 −p)

k

.

Consider the derivative of G

X

(z),

dG

X

dz

(z) =

∞

¸

k=0

kz

k−1

k + r −1

r −1

p

r

(1 −p)

k

=

∞

¸

k=1

z

k−1

(k + r −1)!

(r −1)!(k −1)!

p

r

(1 −p)

k

= (1 −p)

∞

¸

ℓ=0

(ℓ +r)z

ℓ

ℓ + r −1

r −1

p

r

(1 −p)

ℓ

= (1 −p)

z

dG

X

dz

(z) + rG

X

(z)

.

It follows from this equation that

1

G

X

(z)

dG

X

dz

(z) =

(1 −p)r

1 −(1 −p)z

or, alternatively, we can write

d

dz

log (G

X

(z)) =

(1 −p)r

1 −(1 −p)z

.

Integrating both sides yields

log (G

X

(z)) = −r log (1 −(1 −p)z) + c,

where c is an arbitrary constant. Applying boundary condition G

X

(1) = 1, we

get the ordinary generating function of X as

G

X

(z) =

p

1 −(1 −p)z

r

.

90 CHAPTER 7. MULTIPLE DISCRETE RANDOM VARIABLES

From this equation, we can deduce that X is the sum of r independent random

variables, Y

1

, . . . , Y

r

, each with ordinary generating function

G

Y

i

(z) =

p

1 −(1 −p)z

.

In particular, the distribution of Y

i

is given by

p

Y

i

(m) =

1

m!

d

m

G

Y

i

dz

m

(0) = p(1 −p)

m

m = 0, 1, . . .

It may be instructive to compare this distribution with the PMF of a geometric

random variable.

Further Reading

1. Bertsekas, D. P., and Tsitsiklis, J. N., Introduction to Probability, Athena

Scientiﬁc, 2002: Sections 2.5–2.7.

2. Ross, S., A First Course in Probability, 7th edition, Pearson Prentice Hall,

2006: Sections 6.1–6.4.

3. Gubner, J. A., Probability and Random Processes for Electrical and Computer

Engineers, Cambridge, 2006: Sections 3.4–3.5.

Chapter 8

Continuous Random Variables

So far, we have studied discrete random variables and we have explored their

properties. Discrete random variables are quite useful in many contexts, yet

they form only a small subset of the collection of random variables pertinent

to applied probability and engineering. In this chapter, we consider random

variables that range over a continuum of possible values; that is, random

variables that can take on an uncountable set of values.

Continuous random variables are powerful mathematical abstractions that

allow engineers to pose and solve important problems. Many of these problems

are diﬃcult to address using discrete models. While this extra ﬂexibility is

useful and desirable, it comes at a certain cost. A continuous random variable

cannot be characterized by a probability mass function. This predicament

emerges from the limitations of the third axiom of probability laws, which

only applies to countable collections of disjoint events.

Below, we provide a deﬁnition for continuous random variables. Further-

more, we extend and apply the concepts and methods initially developed for

discrete random variables to the class of continuous random variables. In par-

ticular, we develop a continuous counterpart to the probability mass function.

8.1 Cumulative Distribution Functions

We begin our exposition of continuous random variables by introducing a gen-

eral concept which can be employed to bridge our understanding of discrete

91

92 CHAPTER 8. CONTINUOUS RANDOM VARIABLES

and continuous random variables. Recall that a random variable is a real-

valued function acting on the outcomes of an experiment. In particular, given

a sample space, random variable X is a function from Ω to R. The cumulative

distribution function (CDF) of X is deﬁned point-wise as the probability of

the event {X ≤ x},

F

X

(x) = Pr({X ≤ x}) = Pr(X ≤ x).

In terms of the underlying sample space, F

X

(x) denotes the probability of the

set of all outcomes in Ω for which the value of X is less than or equal to x,

F

X

(x) = Pr

X

−1

((−∞, x])

= Pr({ω ∈ Ω|X(ω) ≤ x}).

In essence, the CDF is a convenient way to specify the probability of all events

of the form {X ∈ (−∞, x]}.

The CDF of random variable X exists for any well-behaved function X :

Ω → R. Moreover, since the realization of X is a real number, we have

lim

x↓−∞

F

X

(x) = 0

lim

x↑∞

F

X

(x) = 1.

Suppose x

1

< x

2

, then we can write {X ≤ x

2

} as the union of the two disjoint

sets {X ≤ x

1

} and {x

1

< X ≤ x

2

}. It follows that

F

X

(x

2

) = Pr(X ≤ x

2

)

= Pr(X ≤ x

1

) + Pr(x

1

< X ≤ x

2

)

≥ Pr(X ≤ x

1

) = F

X

(x

1

).

(8.1)

In other words, a CDF is always a non-decreasing function. Finally, we note

from (8.1) that the probability of X falling in the interval (x

1

, x

2

] is

Pr(x

1

< X ≤ x

2

) = F

X

(x

2

) −F

X

(x

1

). (8.2)

8.1.1 Discrete Random Variables

If X is a discrete random variable, then the CDF of X is given by

F

X

(x) =

¸

u∈X(Ω)∩(−∞,x]

p

X

(u),

8.1. CUMULATIVE DISTRIBUTION FUNCTIONS 93

and its PMF can be computed using the formula

p

X

(x) = Pr(X ≤ x) −Pr(X < x) = F

X

(x) −lim

u↑x

F

X

(u).

Fortunately, this formula is simpler when the random variable X only takes

integer values, as seen in the example below.

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8

Value

Discrete Random Variable

CDF

PMF

Figure 8.1: This ﬁgure shows the PMF of a discrete random variable, along

with the corresponding CDF. The values of the PMF are depicted by the

height of the rectangles; their cumulative sums lead to the values of the CDF.

Example 67. Let X be a geometric random variable with parameter p,

p

X

(k) = (1 −p)

k−1

p k = 1, 2, . . .

For x > 0, the CDF of X is given by

F

X

(x) =

⌊x⌋

¸

k=1

(1 −p)

k−1

p = 1 −(1 −p)

⌊x⌋

,

where ⌊·⌋ denotes the standard ﬂoor function. For integer k ≥ 1, the PMF of

geometric random variable X can be recovered from the CDF as follows,

p

X

(k) = F

X

(x) −lim

u↑x

F

X

(u) = F

X

(k) −F

X

(k −1)

=

1 −(1 −p)

k

−

1 −(1 −p)

k−1

= (1 −p)

k−1

p.

94 CHAPTER 8. CONTINUOUS RANDOM VARIABLES

8.1.2 Continuous Random Variables

Having introduced the general notion of a CDF, we can safely provide a more

precise deﬁnition for continuous random variables. Let X be a random variable

with CDF F

X

(·), then X is said to be a continuous random variable if F

X

(·)

is continuous and diﬀerentiable.

Example 68. Suppose that X is a random variable with CDF given by

F

X

(x) =

1 −e

−x

, x ≥ 0

0, x < 0.

This cumulative distribution function is diﬀerentiable with

dF

X

dx

(x) =

e

−x

, x > 0

0, x < 0

and therefore X is a continuous random variable.

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4 5

Value

Continuous Random Variable

CDF

PDF

Figure 8.2: The CDF of a continuous random variable is diﬀerentiable. This

ﬁgure provides one example of a continuous random variable. Both, the CDF

F

X

(·) and its derivative f

X

(·) (PDF) are displayed.

8.2. PROBABILITY DENSITY FUNCTIONS 95

8.1.3 Mixed Random Variables*

Generally speaking, the CDF of a discrete random variable is a discontinuous

staircase-like function, whereas the CDF of a continuous random variable is

continuous and diﬀerentiable almost everywhere. There exist random variables

for which neither situation applies. Such random variables are sometimes

called mixed random variables. Our exposition of mixed random variables in

this document is very limited. Still, we emphasize that a good understanding

of discrete and continuous random variables is instrumental in understanding

and solving problems including mixed random variables.

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4 5

Value

Mixed Random Variable

CDF

Figure 8.3: This ﬁgure shows the CDF of a mixed random variable. In general,

mixed random variables do not have a PMF nor a PDF. Their CDF may be

composed of a mixture of diﬀerentiable intervals and discontinuous jumps.

8.2 Probability Density Functions

As mentioned above, the CDF of continuous random variable X is a diﬀer-

entiable function. The derivative of F

X

(·) is called the probability density

function (PDF) of X, and it is denoted by f

X

(·). If X is a random variable

with PDF f

X

(·) then, by the fundamental theorem of calculus, we have

F

X

(x) =

x

−∞

f

X

(ξ)dξ.

96 CHAPTER 8. CONTINUOUS RANDOM VARIABLES

Equivalently, we can write

f

X

(x) =

dF

X

dx

(x).

Note that PDFs are only deﬁned for continuous random variables. This is

somewhat restrictive. Nevertheless, the PDF can be a very powerful tool to

derive properties of continuous random variables, which may otherwise be

diﬃcult to compute.

For x

1

< x

2

, we can combine the deﬁnition of f

X

(·) and (8.2) to obtain

Pr(x

1

< X ≤ x

2

) =

x

2

x

1

f

X

(ξ)dξ.

Furthermore, it is easily seen that for any continuous random variable

Pr(X = x

2

) = lim

x

1

↑x

2

Pr(x

1

< X ≤ x

2

) = lim

x

1

↑x

2

x

2

x

1

f

X

(ξ)dξ

=

x

2

x

2

f

X

(ξ)dξ = 0.

In other words, if X is a continuous random variable, then Pr(X = x) = 0 for

any real number x. An immediate corollary of this fact is

Pr(x

1

< X < x

2

) = Pr(x

1

≤ X < x

2

)

= Pr(x

1

< X ≤ x

2

) = Pr(x

1

≤ X ≤ x

2

);

the inclusion or exclusion of endpoints in an interval does not aﬀect the prob-

ability of the corresponding interval when X is a continuous random variable.

We can derive properties for the PDF of continuous random variable X

based on the axioms of probability laws. First, the probability that X is a real

number is given by

Pr(−∞ < X < ∞) =

∞

−∞

f

X

(ξ)dξ = 1.

Thus, f

X

(x) must integrate to one. Also, because the probabilities of events are

nonnegative, we must have f

X

(x) ≥ 0 everywhere. Finally, given an admissible

set S, the probability that X ∈ S can be expressed through an integral,

Pr(X ∈ S) =

S

f

X

(ξ)dξ.

Admissible events of the form {X ∈ S} are sets for which we know how to

carry this integral.

8.3. IMPORTANT DISTRIBUTIONS 97

8.3 Important Distributions

Good intuition about continuous random variables can be developed by looking

at examples. In this section, we introduce important random variables and

their distributions. These random variables ﬁnd widespread application in

various ﬁelds of engineering.

8.3.1 The Uniform Distribution

A (continuous) uniform random variable is such that all intervals of a same

length contained within its support are equally probable. The PDF of a uni-

form random variable is deﬁned by two parameters, a and b, which represent

the minimum and maximum values of its support, respectively. The PDF

f

X

(·) is given by

f

X

(x) =

1

b−a

, x ∈ [a, b]

0, otherwise.

The associated cumulative distribution function becomes

F

X

(x) =

0, x < a

x−a

b−a

, a ≤ x ≤ b

1, x ≥ b.

Example 69. David comes to campus every morning riding the Aggie Spirit

Transit. On his route, a bus comes every thirty minutes, from sunrise until

dusk. David, who does not believe in alarm clocks or watches, wakes up with

daylight. After cooking a hefty breakfast, he walks to the bus stop. If his arrival

time at the bus stop is uniformly distributed between 9:00 a.m. and 9:30 a.m.,

what is the probability that he waits less than ﬁve minutes for the bus?

Let t

0

be the time at which David arrives at the bus stop, and denote by T

the time he spends waiting. The time at which the next bus arrives at David’s

stop is uniformly distributed between t

0

and t

0

+ 30. The amount of time that

he spends at the bus stop is therefore uniformly distributed between 0 and 30

minutes. Accordingly, we have

f

T

(t) =

1

30

, t ∈ [0, 30]

0, otherwise.

98 CHAPTER 8. CONTINUOUS RANDOM VARIABLES

0

0.2

0.4

0.6

0.8

1

1.2

0 1 2 3 4 5 P

r

o

b

a

b

i

l

i

t

y

D

e

n

s

i

t

y

F

u

n

c

t

i

o

n

Value

Uniform Distribution

max = 1

max = 2

max = 4

Figure 8.4: This ﬁgure shows the PDFs of uniform random variables with

support intervals [0, 1], [0, 2] and [0, 4].

The probability that David waits less than ﬁve minutes is

Pr(T < 5) =

5

0

1

30

dt =

1

6

.

8.3.2 The Gaussian (Normal) Random Variable

The Gaussian random variable is of fundamental importance in probability and

statistics. It is often used to model distributions of quantities inﬂuenced by

large numbers of small random components. The PDF of a Gaussian random

variable is given by

f

X

(x) =

1

√

2πσ

e

−

(x−m)

2

2σ

2

−∞ < x < ∞,

where m and σ > 0 are real parameters.

A Gaussian variable whose distribution has parameters m = 0 and σ = 1

is called a normal random variable or a standard Gaussian random variable,

names that hint at its popularity. The CDF of a Gaussian random variable

does not admit a closed-form expression; it can be expressed as

F

X

(x) = Pr(X ≤ x) =

1

√

2πσ

x

−∞

e

−

(ξ−m)

2

2σ

2

dξ

=

1

√

2π

(x−m)/σ

−∞

e

−

ζ

2

2

dζ = Φ

x −m

σ

,

8.3. IMPORTANT DISTRIBUTIONS 99

0

0.1

0.2

0.3

0.4

0.5

-4 -2 0 2 4 P

r

o

b

a

b

i

l

i

t

y

D

e

n

s

i

t

y

F

u

n

c

t

i

o

n

Value

Gaussian Distribution

1

2

4

Figure 8.5: The distributions of Gaussian random variables appear above for

parameters m = 0 and σ

2

∈ {1, 2, 4}.

where Φ(·) is termed the standard normal cumulative distribution function and

is deﬁned by

Φ(x) =

1

√

2π

x

−∞

e

−

ζ

2

2

dζ.

We emphasize that the function Φ(·) is nothing more than a convenient nota-

tion for the CDF of a normal random variable.

Example 70. A binary signal is transmitted through a noisy communication

channel. The sent signal takes on either a value of 1 or −1. The message

received at the output of the communication channel is corrupted by additive

thermal noise. This noise can be accurately modeled as a Gaussian random

variable. The receiver declares that a 1 (−1) was transmitted if the sent signal

is positive (negative). What is the probability of making an erroneous decision?

Let S ∈ {−1, 1} denote the transmitted signal, N be the value of the thermal

noise, and Y represent the value of the received signal. An error can occur in

one of two possible ways: S = 1 was transmitted and Y is less than zero, or

S = −1 was transmitted and Y is greater than zero. Using the total probability

theorem, we can compute the probability of error as

Pr(Y ≥ 0|S = −1) Pr(S = −1) + Pr(Y ≤ 0|S = 1) Pr(S = 1).

100 CHAPTER 8. CONTINUOUS RANDOM VARIABLES

By symmetry, it is easily argued that

Pr(Y ≤ 0|S = 1) = Pr(Y ≥ 0|S = −1) = Pr(N > 1)

=

∞

1

1

√

2πσ

e

−

ξ

2

2σ

2

dξ = 1 −Φ

1

σ

.

The probability that the receiver makes an erroneous decision is 1 − Φ(1/σ).

The reliability of this transmission scheme depends on the amount of noise

present at the receiver.

The normal random variable is so frequent in applied mathematics and

engineering that many variations of its CDF possess their own names. The

error function is a function which is primarily encountered in the ﬁelds of

statistics and partial diﬀerential equations. It is deﬁned by

erf(x) =

2

√

π

x

0

e

−ξ

2

dξ.

The error function is related to the standard normal cumulative distribution

function by scaling and translation,

Φ(x) =

1 + erf

x/

√

2

2

.

If X is a standard normal random variable, then erf

x/

√

2

**denotes the prob-
**

ability that X lies in the interval (−x, x). In engineering, it is customary to

employ the Q-function, which is given by

Q(x) =

1

√

2π

∞

x

e

−

ξ

2

2

dξ = 1 −Φ(x)

=

1 −erf

x/

√

2

2

.

(8.3)

Equation (8.3) may prove useful when using software packages that provide a

built-in implementation for erf(·), but not for the Q-function. The probability

of an erroneous decision in Example 70 can be expressed concisely using the

Q-function as Q(1/σ).

Next, we prove that the standard normal PDF integrates to one. The

solution is easy to follow, but hard to discover. It is therefore useful to include

it in this document. Consider a standard normal PDF,

f

X

(x) =

1

√

2π

e

−

x

2

2

.

8.3. IMPORTANT DISTRIBUTIONS 101

We can show that f

X

(x) integrates to one using a subtle argument and a

change of variables. We start with the square of the integrated PDF and

proceed from there,

∞

−∞

f

X

(ξ)dξ

2

=

∞

−∞

1

√

2π

e

−

ξ

2

2

dξ

∞

−∞

1

√

2π

e

−

ζ

2

2

dζ

=

∞

−∞

∞

−∞

1

2π

e

−

ξ

2

+ζ

2

2

dξdζ =

2π

0

1

2π

dθ

∞

0

e

−

r

2

2

rdr

=

−e

−

r

2

2

∞

0

= 1.

Since the square of the desired integral is nonnegative and equal to one, we

can conclude that the normal PDF integrates to one.

8.3.3 The Exponential Distribution

The exponential random variable is also frequently encountered in engineering.

It can be used to model the lifetime of devices and systems, and the time

elapsed between speciﬁc occurrences. An exponential random variable X with

parameter λ > 0 has PDF

f

X

(x) = λe

−λx

x ≥ 0.

For x ≥ 0, its CDF is equal to

F

X

(x) = 1 −e

−λx

.

The parameter λ characterizes the rate at which events occur.

Example 71. Connection requests at an Internet server are characterized

by an exponential inter-arrival time with parameter λ = 1/2. If a requests

arrives at time t

0

, what is the probability that the next packet arrives within

two minutes?

The probability that the inter-arrival time T is less than two minutes can

be computed as

Pr(T < 2) =

2

0

1

2

e

−

ξ

2

dξ = −e

−

ξ

2

2

0

= 1 −e

−1

≈ 0.632.

102 CHAPTER 8. CONTINUOUS RANDOM VARIABLES

0

0.5

1

1.5

2

0 0.5 1 1.5 2 P

r

o

b

a

b

i

l

i

t

y

D

e

n

s

i

t

y

F

u

n

c

t

i

o

n

Value

Exponential Distribution

2

1

0.5

Figure 8.6: The distributions of exponential random variables are shown above

for parameters λ ∈ {0.5, 1, 2}.

The exponential random variable can be obtained as the limit of a sequence

of geometric random variables. Let λ be ﬁxed and deﬁned p

n

= λ/n. We deﬁne

the PMF of random variable Y

n

as

p

Yn

(k) = (1 −p

n

)

k−1

p

n

=

1 −

λ

n

k−1

λ

n

k = 1, 2, . . .

That is, random variable Y

n

is a standard geometric random variable with

parameter p

n

= λ/n. For every n, we create a new variable X

n

,

X

n

=

Y

n

n

.

By construction, the random variable X

n

has PMF

p

Xn

(y) =

(1 −p

n

)

k−1

p

n

, if y = k/n

0, otherwise.

For any x ≥ 0, the CDF of random variable X

n

can be computed as

Pr(X

n

≤ x) = Pr(Y

n

≤ nx) =

⌊nx⌋

¸

k=1

p

Yn

(k)

=

⌊nx⌋

¸

k=1

(1 −p

n

)

k−1

p

n

= 1 −(1 −p

n

)

⌊nx⌋

.

8.3. IMPORTANT DISTRIBUTIONS 103

In the limit, as n grows unbounded, we get

lim

n→∞

Pr(X

n

≤ x) = lim

n→∞

1 −(1 −p

n

)

⌊nx⌋

= 1 − lim

n→∞

1 −

λ

n

⌊nx⌋

= 1 −e

−λx

.

Thus, the sequence of scaled geometric random variables {X

n

} converges in

distribution to an exponential random variable X with parameter λ.

Memoryless Property: In view of this asymptotic characterization and

the fact that geometric random variables are memoryless, it is not surprising

that the exponential random variable also satisﬁes the memoryless property,

Pr(X > t + u|X > t) = Pr(X > u).

This fact can be shown by a straightforward application of conditional prob-

ability. Suppose that X is an exponential random variable with parameter λ.

Also, let t and u be two positive numbers. The memoryless property can be

veriﬁed by expanding the conditional probability of X using deﬁnition (3.2),

Pr(X > t + u|X > t) =

Pr({X > t + u} ∩ {X > t})

Pr(X > t)

=

Pr(X > t + u)

Pr(X > t)

=

e

−λ(t+u)

e

−λt

= e

−λu

= Pr(X > u).

In reality, the exponential random variable is the only continuous random

variable that satisﬁes the memoryless property.

Example 72. A prominent company, Century Oak Networks, maintains a

bank of servers for its operation. Hard drives on the servers have a half-life of

two years. We wish to compute the probability that a speciﬁc disk needs repair

within its ﬁrst year of usage.

Half-lives are typically used to describe quantities that undergo exponential

decay. Let T denote the time elapsed until failure of the disk. We know that T

is an exponential random variable and, although we are not given λ explicitly,

we know that

Pr(T > 2) =

1

2

.

104 CHAPTER 8. CONTINUOUS RANDOM VARIABLES

We use the memoryless property to solve this problem,

Pr(T > 2) = Pr(T > 1) Pr(T > 1 + 1|T > 1)

= Pr(T > 1) Pr(T > 1) = (Pr(T > 1))

2

.

It follows that Pr(T > 1) =

Pr(T > 2) = 1/

√

2. We can then write Pr(T <

1) = 1 − Pr(T > 1) ≈ 0.293. An alternative way to solve this problem would

be to ﬁrst ﬁnd the value of λ associated with T, and then compute Pr(T < 1)

from the corresponding integral.

8.4 Additional Distributions

Probability distributions arise in many diﬀerent contexts and they assume var-

ious forms. We conclude this ﬁrst chapter on continuous random variables by

mentioning a few additional distributions that ﬁnd application in engineering.

It is interesting to note the interconnection between various random variables

and their corresponding probability distributions.

8.4.1 The Gamma Distribution

The gamma PDF deﬁnes a versatile collection of distributions. The PDF of a

gamma random variable is given by

f

X

(x) =

λ(λx)

α−1

e

−λx

Γ(α)

x > 0,

where Γ(·) denotes the gamma function deﬁned by

Γ(z) =

∞

0

ξ

z−1

e

−ξ

dξ z > 0.

The two parameters α > 0 and λ > 0 aﬀect the shape of the ensuing distri-

bution signiﬁcantly. By varying these two parameters, it is possible for the

gamma PDF to accurately model a wide array of empirical data.

The gamma function can be evaluated recursively using integration by

parts; this yields the relation Γ(z + 1) = zΓ(z) for z > 0. For nonnegative in-

tegers, it can easily be shown that Γ(k+1) = k!. Perhaps, the most well-known

8.4. ADDITIONAL DISTRIBUTIONS 105

value for the gamma function at a non-integer argument is Γ(1/2) =

√

π. In-

terestingly, this speciﬁc value for the gamma function can be evaluated by a

procedure similar to the one we used to integrate the Gaussian distribution,

Γ

1

2

2

=

∞

0

ξ

−

1

2

e

−ξ

dξ

∞

0

ζ

−

1

2

e

−ζ

dζ

=

∞

0

∞

0

ξ

−

1

2

ζ

−

1

2

e

−(ξ+ζ)

dξdζ

=

π/2

0

∞

0

1

r

2

sin θ cos θ

e

−r

2

4r

3

sin θ cos θdrdθ

=

π/2

0

∞

0

e

−r

2

4rdrdθ = π.

Many common distributions are special cases of the gamma distribution, as

seen in Figure 8.7.

0

0.1

0.2

0.3

0.4

0.5

0.6

0 1 2 3 4 5 P

r

o

b

a

b

i

l

i

t

y

D

e

n

s

i

t

y

F

u

n

c

t

i

o

n

Value

Gamma Distribution

Exponential

Chi-Square

Erlang

Figure 8.7: Gamma distributions form a two-parameter family of PDFs and,

depending on (α, λ), they can be employed to model various situations. The

parameters used above are (1, 0.5) for the exponential distribution, (2, 0.5) for

the chi-square distribution and (4, 2) for the Erlang distribution; they are all

instances of gamma distributions.

The Exponential Distribution: When α = 1, the gamma distribution

simply reduces to the exponential distribution discussed in Section 8.3.3.

106 CHAPTER 8. CONTINUOUS RANDOM VARIABLES

The Chi-Square Distribution: When λ = 1/2 and α = k/2 for some

positive integer k, the gamma distribution becomes a chi-square distribution,

f

X

(x) =

x

k

2

−1

e

−

x

2

2

k

2

Γ(k/2)

x > 0.

The chi-square distribution is one of the probability distributions most widely

used in statistical inference problems. Interestingly, the sum of the squares of

k independent standard normal random variables leads to a chi-square variable

with k degrees of freedom.

The Erlang Distribution: When α = m, a positive integer, the gamma

distribution is called an Erlang distribution. This distribution ﬁnds application

in queueing theory. Its PDF is given by

f

X

(x) =

λ(λx)

m−1

e

−λx

(m−1)!

x > 0.

An m-Erlang random variable can be obtained by summing m independent

exponential random variables. Speciﬁcally, let X

1

, X

2

, . . . , X

m

be indepen-

dent exponential random variables, each with parameter λ > 0. The random

variable S

m

given by

S

m

=

m

¸

k=1

X

k

is an Erlang random variable with parameter m and λ.

Example 73. Suppose that the requests arriving at a computer server on the

Internet are characterized by independent, memoryless inter-arrival periods.

Let S

m

be a random variable that denotes the time instant of the mth arrival,

then S

m

is an Erlang random variable.

8.4.2 The Rayleigh Distribution

The Rayleigh PDF is given by

f

R

(r) =

r

σ

2

e

−

r

2

2σ

2

r ≥ 0.

The Rayleigh distribution arises in the context of wireless communications.

Suppose that X and Y are two independent normal random variables, then

8.4. ADDITIONAL DISTRIBUTIONS 107

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 1 2 3 4 5 P

r

o

b

a

b

i

l

i

t

y

D

e

n

s

i

t

y

F

u

n

c

t

i

o

n

Value

Rayleigh Distribution

1

2

4

Figure 8.8: This ﬁgure plots the distributions of Rayleigh random variables

for parameters σ

2

∈ {1, 2, 4}.

the magnitude of this random vector possesses a Rayleigh distribution. Also,

if R is a Rayleigh random variable then R

2

has an exponential distribution.

Example 74. Radio signals propagating through wireless media get reﬂected,

refracted and diﬀracted. This creates variations in signal strength at the desti-

nations, a phenomenon known as fading. Rayleigh random variables are often

employed to model amplitude ﬂuctuations of radio signals in urban environ-

ments.

8.4.3 The Laplace Distribution

The Laplace distribution is sometimes called a double exponential distribution

because it can be thought of as an exponential function and its reﬂection

spliced together. The PDF of a Laplacian random variable can then be written

as

f

X

(x) =

1

2b

e

−

|x|

b

x ∈ R,

where b is a positive constant. The diﬀerence between two independent and

identically distributed exponential random variables is governed by a Laplace

distribution.

108 CHAPTER 8. CONTINUOUS RANDOM VARIABLES

0

0.2

0.4

0.6

0.8

1

-4 -2 0 2 4 P

r

o

b

a

b

i

l

i

t

y

D

e

n

s

i

t

y

F

u

n

c

t

i

o

n

Value

Laplace Distribution

0.5

1

2

Figure 8.9: The PDF of a Laplace random variable can be constructed using

an exponential function and its reﬂection spliced together. This ﬁgures shows

Laplace PDFs for parameters b ∈ {0.5, 1, 2}.

8.4.4 The Cauchy Distribution

The Cauchy distribution is considered a heavy-tail distribution because its tail

is not exponentially bounded. The PDF of a Cauchy random variable is given

by

f

X

(x) =

γ

π (γ

2

+ x

2

)

x ∈ R.

An interesting fact about this distribution is that its mean, variance and all

higher moments are undeﬁned. Moreover, if X

1

, X

2

, . . . , X

n

are independent

random variables, each with a standard Cauchy distribution, then the sample

mean (X

1

+X

2

+· · · +X

n

)/n possesses the same Cauchy distribution. Cauchy

random variables appear in detection theory to model communication systems

subject to extreme noise conditions; they also ﬁnds applications in physics.

Physicists sometimes refer to this distribution as the Lorentz distribution.

Further Reading

1. Ross, S., A First Course in Probability, 7th edition, Pearson Prentice Hall,

2006: Sections 5.1, 5.3–5.6.

8.4. ADDITIONAL DISTRIBUTIONS 109

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

-4 -2 0 2 4 P

r

o

b

a

b

i

l

i

t

y

D

e

n

s

i

t

y

F

u

n

c

t

i

o

n

Value

Cauchy Distribution

0.5

1

2

Figure 8.10: Cauchy distributions are categorized as heavy-tail distributions

because of their very slow decay. The PDFs of Cauchy random variables are

plotted above for parameters γ ∈ {0.5, 1, 2}.

2. Bertsekas, D. P., and Tsitsiklis, J. N., Introduction to Probability, Athena

Scientiﬁc, 2002: Sections 3.1–3.3.

3. Gubner, J. A., Probability and Random Processes for Electrical and Computer

Engineers, Cambridge, 2006: Sections 4.1, 5.1.

4. Miller, S. L., and Childers, D. G., Probability and Random Processes with

Applications to Signal Processing and Communications, 2004: Sections 3.1–

3.4.

Chapter 9

Functions and Derived

Distributions

We already know from our previous discussion that it is possible to form new

random variables by applying real-valued functions to existing discrete random

variables. In a similar manner, it is possible to generate a new random variable

Y by taking a well-behaved function g(·) of a continuous random variable X.

The graphical interpretation of this notion is analog to the discrete case and

appears in Figure 9.1.

Sample Space

X

Y = g(X)

Figure 9.1: A function of a random variable is a random variable itself. In

this ﬁgure, Y is obtained by applying function g(·) to the value of continuous

random variable X.

Suppose X is a continuous random variable and let g(·) be a real-valued

function. The function composition Y = g(X) is itself a random variable. The

probability that Y falls in a speciﬁc set S depends on both the function g(·)

110

9.1. MONOTONE FUNCTIONS 111

and the PDF of X,

Pr(Y ∈ S) = Pr(g(X) ∈ S) = Pr

X ∈ g

−1

(S)

=

g

−1

(S)

f

X

(ξ)dξ,

where g

−1

(S) = {ξ ∈ X(Ω)|g(ξ) ∈ S} denotes the preimage of S. In particular,

we can derive the CDF of Y using the formula

F

Y

(y) = Pr(g(X) ≤ y) =

{ξ∈X(Ω)|g(ξ)≤y}

f

X

(ξ)dξ. (9.1)

Example 75. Let X be a Rayleigh random variable with parameter σ

2

= 1,

and deﬁne Y = X

2

. We wish to ﬁnd the distribution of Y . Using (9.1), we

can compute the CDF of Y . For y > 0, we get

F

Y

(y) = Pr(Y ≤ y) = Pr

X

2

≤ y

= Pr(−

√

y ≤ X ≤

√

y) =

√

y

0

ξe

−

ξ

2

2

dξ

=

y

0

1

2

e

−

ζ

2

dζ = 1 −e

−

y

2

.

In this derivation, we use the fact that X ≥ 0 in identifying the boundaries of

integration, and we apply the change of variables ζ = ξ

2

in computing the inte-

gral. We recognize F

Y

(·) as the CDF of an exponential random variable. This

shows that the square of a Rayleigh random variable possesses an exponential

distribution.

In general, the fact that X is a continuous random variable does not provide

much information about the properties of Y = g(X). For instance, Y could

be a continuous random variable, a discrete random variable or neither. To

gain a better understanding of derived distributions, we begin our exposition

of functions of continuous random variables by exploring speciﬁc cases.

9.1 Monotone Functions

A monotonic function is a function that preserves a given order. For instance,

g(·) is monotone increasing if, for all x

1

and x

2

such that x

1

≤ x

2

, we have

g(x

1

) ≤ g(x

2

). Likewise, a function g(·) is termed monotone decreasing pro-

vided that g(x

1

) ≥ g(x

2

) whenever x

1

≤ x

2

. If the inequalities above are

112 CHAPTER 9. FUNCTIONS AND DERIVED DISTRIBUTIONS

replaced by strict inequalities (< and >), then the corresponding functions

are said to be strictly monotonic. Monotonic functions of random variables

are straightforward to handle because they admit the simple characterization

of their derived CDFs. For non-decreasing function g(·) of continuous random

variable X, we have

F

Y

(y) = Pr(Y ≤ y) = Pr(g(X) ≤ y) = Pr(g(X) ∈ (−∞, y])

= Pr(X ∈ g

−1

((−∞, y])) = Pr

X ≤ sup

¸

g

−1

((−∞, y])

¸

= F

X

sup

¸

g

−1

((−∞, y])

¸

.

(9.2)

The supremum comes from the fact that multiple values of x may lead to a

same value of y; that is, the preimage g

−1

(y) = {x|g(x) = y} may contain

several elements. Furthermore, g(·) may be discontinuous and g

−1

(y) may

not contain any value. These scenarios all need to be accounted for in our

expression, and this is accomplished by selecting the largest value in the set

g

−1

((−∞, y]).

g

−1

(y) = {x|g(x) = y}

y

X

Y

Figure 9.2: In this ﬁgure, Y is obtained by passing random variable X through

a function g(·). The preimage of point y contains several elements, as seen

above.

Example 76. Let X be a continuous random variable uniformly distributed

over interval [0, 1]. We wish to characterize the derived distribution of Y =

2X. This can be accomplished as follows. For y ∈ [0, 2], we get

F

Y

(y) = Pr(Y ≤ y) = Pr

X ≤

y

2

=

y

2

0

dx =

y

2

.

9.1. MONOTONE FUNCTIONS 113

g

−1

((−∞, y])

y

X

Y

Figure 9.3: If g(·) is monotone increasing and discontinuous, then g

−1

(y) can be

empty; whereas g

−1

((−∞, y]) is typically a well-deﬁned interval. It is therefore

advisable to deﬁne F

Y

(y) in terms of g

−1

((−∞, y]).

In particular, Y is a uniform random variable with support [0, 2]. By taking

derivatives, we obtain the PDF of Y as

f

Y

(y) =

1

2

, y ∈ [0, 2]

0, otherwise.

More generally, an aﬃne function of a uniform random variable is also a

uniform random variable.

The same methodology applies to non-increasing functions. Suppose that

g(·) is monotone decreasing, and let Y = g(X) be a function of continuous

random variable X. The CDF of Y is then equal to

F

Y

(y) = Pr(Y ≤ y) = Pr

X ∈ g

−1

((−∞, y])

= Pr

X ≥ inf

¸

g

−1

((−∞, y])

¸

= 1 −F

X

inf

¸

g

−1

((−∞, y])

¸

.

(9.3)

This formula is similar to the previous case in that the inﬁmum accounts for

the fact that the preimage g

−1

(y) = {x|g(x) = y} may contain numerous

elements or no elements at all.

114 CHAPTER 9. FUNCTIONS AND DERIVED DISTRIBUTIONS

9.2 Diﬀerentiable Functions

To further our understanding of derived distributions, we next consider the

situation where g(·) is a diﬀerentiable and strictly increasing function. Note

that, with these two properties, g(·) becomes an invertible function. It is

therefore possible to write x = g

−1

(y) unambiguous, as the value of x is unique.

In such a case, the CDF of Y = g(X) becomes

F

Y

(y) = Pr

X ≤ g

−1

(y)

= F

X

g

−1

(y)

.

Diﬀerentiating this equation with respect to y, we obtain the PDF of Y

f

Y

(y) =

d

dy

F

Y

(y) =

d

dy

F

X

g

−1

(y)

= f

X

g

−1

(y)

d

dy

g

−1

(y) = f

X

g

−1

(y)

dx

dy

.

With the simple substitution x = g

−1

(y), we get

f

Y

(y) = f

X

(x)

dx

dy

=

f

X

(x)

dg

dx

(x)

.

Note that

dg

dx

(x) =

dg

dx

(x)

**is strictly positive because g(·) is a strictly increasing
**

function. From this analysis, we gather that Y = g(X) is a continuous random

variable. In addition, we can express the PDF of Y = g(X) in terms of the

PDF of X and the derivative of g(·), as seen above.

Likewise, suppose that g(·) is diﬀerentiable and strictly decreasing. We can

write the CDF of random variable Y = g(X) as follows,

F

Y

(y) = Pr(g(X) ≤ y) = Pr

X ≥ g

−1

(y)

= 1 −F

X

g

−1

(y)

.

Its PDF is given by

f

Y

(y) =

d

dy

1 −F

X

g

−1

(y)

=

f

X

(x)

−

dg

dx

(x)

,

where again x = g

−1

(y). We point out that

dg

dx

(x) = −

dg

dx

(x)

is strictly

negative because g(·) is a strictly decreasing function. As before, we ﬁnd that

Y = g(X) is a continuous random variable and the PDF of Y can be expressed

in terms of f

X

(·) and the derivative of g(·). Combining these two expressions,

9.2. DIFFERENTIABLE FUNCTIONS 115

g

−1

((y

1

, y

1

+ δ)) g

−1

((y

2

, y

2

+ δ))

(y

1

, y

1

+ δ)

(y

2

, y

2

+ δ)

X

Y

Figure 9.4: This ﬁgure provides a graphical interpretation of why the derivative

of g(·) plays an important role in determining the value of the derived PDF

f

Y

(·). For an interval of width δ on the y-axis, the size of the corresponding

interval on the x-axis depends heavily on the derivative of g(·). A small slope

leads to a wide interval, whereas a steep slope produces a narrow interval on

the x-axis.

we observe that, when g(·) is diﬀerentiable and strictly monotone, the PDF of

Y becomes

f

Y

(y) = f

X

g

−1

(y)

dx

dy

=

f

X

(x)

dg

dx

(x)

(9.4)

where x = g

−1

(y). The role of

dg

dx

(·)

**in ﬁnding the derived PDF f
**

Y

(·) is

illustrated in Figure 9.4.

Example 77. Suppose that X is a Gaussian random variable with PDF

f

X

(x) =

1

√

2π

e

−

x

2

2

.

We wish to ﬁnd the PDF of random variable Y where Y = aX +b and a = 0.

In this example, we have g(x) = ax + b and g(·) is immediately recognized

as a strictly monotonic function. The inverse of function of g(·) is equal to

x = g

−1

(y) =

y −b

a

,

and the desired derivative is given by

dx

dy

=

1

dg

dx

(x)

=

1

a

.

116 CHAPTER 9. FUNCTIONS AND DERIVED DISTRIBUTIONS

The PDF of Y can be computed using (9.4), and is found to be

f

Y

(y) = f

X

g

−1

(y)

dx

dy

=

1

√

2π|a|

e

−

(y−b)

2

2a

2

,

which is itself a Gaussian distribution.

Using a similar progression, we can show that the aﬃne function of any

Gaussian random variable necessarily remains a Gaussian random variable

(provided a = 0).

Example 78. Suppose X is a Rayleigh random variable with parameter σ

2

=

1, and let Y = X

2

. We wish to derive the distribution of random variable Y

using the PDF of X.

Recall that the distribution of Rayleigh random variable X is given by

f

X

(x) = xe

−

x

2

2

x ≥ 0.

Since Y is the square of X, we have g(x) = x

2

. Note that X is a non-negative

random variable and g(x) = x

2

is strictly monotonic over [0, ∞). The PDF of

Y is therefore found to be

f

Y

(y) =

f

X

(x)

dg

dx

(x)

=

f

X

√

y

dg

dx

√

y

=

√

y

2

√

y

e

−

y

2

=

1

2

e

−

y

2

,

where y ≥ 0. Thus, random variable Y possesses an exponential distribution

with parameter 1/2. It may be instructive to compare this derivation with the

steps outlined in Example 75.

Finally, suppose that g(·) is a diﬀerentiable function with a ﬁnite number

of local extrema. Then, g(·) is piecewise monotonic and we can write the PDF

of Y = g(X) as

f

Y

(y) =

¸

{x∈X(Ω)|g(x)=y}

f

X

(x)

dg

dx

(x)

(9.5)

for (almost) all values of y ∈ R. That is, f

Y

(y) is obtained by ﬁrst identi-

fying the values of x for which g(x) = y. The PDF of Y is then computed

explicitly by ﬁnding the local contribution of each of these values to f

Y

(y)

using the methodology developed above. This is accomplished by applying

(9.4) repetitively to every value of x for which g(x) = y. It is certainly useful

to compare (9.5) to its discrete equivalent (5.4), which is easier to understand

and visualize.

9.3. GENERATING RANDOM VARIABLES 117

y

x

1

x

2

x

3

x

4

x

5

X

Y

Figure 9.5: The PDF of Y = g(X) when X is a continuous random variable

and g(·) is diﬀerentiable with a ﬁnite number of local extrema is obtained by

ﬁrst identifying all the values of x for which g(x) = y, and then calculating

the contribution of each of these values to f

Y

(y) using (9.4). The end result

leads to (9.5).

Example 79. Suppose X is a continuous random variable uniformly dis-

tributed over [0, 2π). Let Y = cos(X), the random sampling of a sinusoidal

waveform. We wish to ﬁnd the PDF of Y .

For y ∈ (−1, 1), the preimage g

−1

(y) contains two values in [0, 2π), namely

arccos(y) and 2π −arccos(y). Recall that the derivative of cos(x) is given by

d

dx

cos(x) = −sin(x).

Collecting these results, we can write the PDF of Y as

f

Y

(y) =

f

X

(arccos(y))

|−sin(arccos(y))|

+

f

X

(2π −arccos(y))

|−sin(2π −arccos(y))|

=

1

2π

1 −y

2

+

1

2π

1 −y

2

=

1

π

1 −y

2

,

where −1 < y < 1. The CDF of Y can be obtained by integrating f

Y

(y). Not

surprisingly, solving this integral involves a trigonometric substitution.

9.3 Generating Random Variables

In many engineering projects, computer simulations are employed as a ﬁrst

step in validating concepts or comparing various design candidates. Many

118 CHAPTER 9. FUNCTIONS AND DERIVED DISTRIBUTIONS

such tasks involve the generation of random variables. In this section, we

explore a method to generate arbitrary random variables based on a routine

that outputs a random value uniformly distributed between zero and one.

9.3.1 Continuous Random Variables

First, we consider a scenario where the simulation task requires the generation

of a continuous random variable. We begin our exposition with a simple

observation. Let X be a continuous random variable with PDF f

X

(·). Consider

the random variable Y = F

X

(X). Since F

X

(·) is diﬀerentiable and strictly

increasing over the support of X, we get

f

Y

(y) =

f

X

(x)

dF

X

dx

(x)

=

f

X

(x)

|f

X

(x)|

= 1

where y ∈ (0, 1) and x = F

−1

X

(y). The PDF of Y is zero outside of this interval

because 0 ≤ F

X

(x) ≤ 1. Thus, using an arbitrary continuous random variable

X, we can generate a uniform random variable Y with PDF

f

Y

(y) =

1 y ∈ (0, 1)

0 otherwise.

This observation provides valuable insight about our original goal. Suppose

that Y is a continuous random variable uniformly distributed over [0, 1]. We

wish to generate continuous random variable with CDF F

X

(·). First, we note

that, when F

X

(·) is invertible, we have

F

−1

X

(F

X

(X)) = X.

Thus, applying F

−1

X

(·) to uniform random variable Y should lead to the desired

result. Deﬁne V = F

−1

X

(Y ), and consider the PDF of V . Using our knowledge

of derived distributions, we get

f

V

(v) =

f

Y

(y)

dF

−1

X

dy

(y)

= f

Y

(y)

dF

X

dv

(v) = f

X

(v)

where y = F

X

(v). Note that f

Y

(y) = 1 for any y ∈ [0, 1] because Y is uniform

over the unit interval. Hence the PDF of F

−1

X

(Y ) possesses the structure

9.3. GENERATING RANDOM VARIABLES 119

wanted. We stress that this technique can be utilized to generate any random

variable with PDF f

X

(·) using a computer routine that outputs a random

value uniformly distributed between zero and one. In other words, to create

a continuous random variable X with CDF F

X

(·), one can apply the function

F

−1

X

(·) to a random variable Y that is uniformly distributed over [0, 1].

Example 80. Suppose that Y is a continuous random variable uniformly dis-

tributed over [0, 1]. We wish to create an exponential random variable X with

parameter λ by taking a function of Y .

Random variable X is nonnegative, and its CDF is given by F

X

(x) =

1 −e

−λx

for x ≥ 0. The inverse of F

X

(·) is given by

F

−1

X

(y) = −

1

λ

log(1 −y).

We can therefore generate the desired random variable X with

X = −

1

λ

log(1 −Y ).

Indeed, for x ≥ 0, we obtain

f

X

(x) =

f

Y

(y)

1

λ(1−y)

= λe

−λx

where we have implicitly deﬁned y = 1 −e

−λx

. This is the desired distribution.

9.3.2 Discrete Random Variables

It is equally straightforward to generate a discrete random variable from a

continuous random variable that is uniformly distributed between zero and

one. Let p

X

(·) be a PMF, and denote its support by {x

1

, x

2

, . . .} where x

i

< x

j

whenever i < j. We know that the corresponding CDF is given by

F

X

(x) =

¸

x

i

≤x

p

X

(x

i

).

We can generate a random variable X with PMF p

X

(·) with the following case

function,

g(y) =

x

i

, if F

X

(x

i−1

) < y ≤ F

X

(x

i

)

0, otherwise.

120 CHAPTER 9. FUNCTIONS AND DERIVED DISTRIBUTIONS

Note that we have used the convention x

0

= 0 to simplify the deﬁnition of

g(·). Taking X = g(Y ), we get

Pr(X = x

i

) = Pr(F

X

(x

i−1

) < Y ≤ F

X

(x

i

))

= F

X

(x

i

) −F

X

(x

i−1

) = p

X

(x

i

).

Of course, implementing a discrete random variable through a case statement

may lead to an excessively slow routine. For many discrete random variables,

there are much more eﬃcient ways to generate a speciﬁc output.

Further Reading

1. Ross, S., A First Course in Probability, 7th edition, Pearson Prentice Hall,

2006: Section 5.7.

2. Bertsekas, D. P., and Tsitsiklis, J. N., Introduction to Probability, Athena

Scientiﬁc, 2002: Section 3.6.

3. Miller, S. L., and Childers, D. G., Probability and Random Processes with

Applications to Signal Processing and Communications, 2004: Section 4.6.

4. Gubner, J. A., Probability and Random Processes for Electrical and Computer

Engineers, Cambridge, 2006: Sections 1.1,1.3–1.4.

5. Mitzenmacher, M., and Upfal, E., Probability and Computing: Randomized

Algorithms and Probabilistic Analysis, Cambridge, 2005: Chapters 1 & 10.

Chapter 10

Expectations and Bounds

The concept of expectation, which was originally introduced in the context

of discrete random variables, can be generalized to other types of random

variables. For instance, the expectation of a continuous random variable is

deﬁned in terms of its probability density function (PDF). We know from our

previous discussion that expectations provide an eﬀective way to summarize

the information contained in the distribution of a random variable. As we

will see shortly, expectations are also very valuable in establishing bounds on

probabilities.

10.1 Expectations Revisited

The deﬁnition of an expectation associated with a continuous random variable

is very similar to its discrete counterpart; the weighted sum is simply replaced

by a weighted integral. For a continuous random variable X with PDF f

X

(·),

the expectation of g(X) is deﬁned by

E[g(X)] =

R

g(ξ)f

X

(ξ)dξ.

In particular, the mean of X is equal to

E[X] =

R

ξf

X

(ξ)dξ

and its variance becomes

Var(X) = E

(X −E[X])

2

=

R

(ξ −E[X])

2

f

X

(ξ)dξ.

121

122 CHAPTER 10. EXPECTATIONS AND BOUNDS

As before, the variance of random variable X can also be computed using

Var(X) = E[X

2

] −(E[X])

2

.

Example 81. We wish to calculate the mean and variance of a Gaussian

random variable with parameters m and σ

2

. By deﬁnition, the PDF of this

random variable can be written as

f

X

(ξ) =

1

√

2πσ

e

−

(ξ−m)

2

2σ

2

ξ ∈ R.

The mean of X can be obtained through direct integration, with a change of

variables,

E[X] =

1

√

2πσ

∞

−∞

ξe

−

(ξ−m)

2

2σ

2

dξ

=

σ

√

2π

∞

−∞

ζ +

m

σ

e

−

ζ

2

2

dζ

=

σ

√

2π

∞

−∞

ζe

−

ζ

2

2

dζ +

σ

√

2π

∞

−∞

m

σ

e

−

ζ

2

2

dζ = m.

In ﬁnding a solution, we have leveraged the facts that ζe

−

ζ

2

2

is an absolutely in-

tegrable, odd function. We also took advantage of the normalization condition

which ensures that a Gaussian PDF integrates to one. To derive the variance,

we again use the normalization condition. For a Gaussian PDF, this property

implies that

∞

−∞

e

−

(ξ−m)

2

2σ

2

dξ =

√

2πσ.

Diﬀerentiating both sides of this equation with respect to σ, we get

∞

−∞

(ξ −m)

2

σ

3

e

−

(ξ−m)

2

2σ

2

dξ =

√

2π.

Rearranging the terms yields

∞

−∞

(ξ −m)

2

√

2πσ

e

−

(ξ−m)

2

2σ

2

dξ = σ

2

.

Hence, Var(X) = E[(X −m)

2

] = σ

2

. Of course, the variance can also be

obtained by more conventional methods.

Example 82. Suppose that R is a Rayleigh random variable with parameter

σ

2

. We wish to compute its mean and variance.

10.1. EXPECTATIONS REVISITED 123

Recall that R is a nonnegative random variable with PDF

f

R

(r) =

r

σ

2

e

−

r

2

2σ

2

r ≥ 0.

Using this distribution, we get

E[R] =

∞

0

ξf

R

(ξ)dξ =

∞

0

ξ

2

σ

2

e

−

ξ

2

2σ

2

dξ

= −ξe

−

ξ

2

2σ

2

∞

0

+

∞

0

e

−

ξ

2

2σ

2

dξ

=

√

2πσ

∞

0

1

√

2πσ

e

−

ζ

2

2σ

2

dζ =

√

2πσ

2

.

Integration by parts is key in solving this expectation. Also, notice the judicious

use of the fact that the integral of a standard normal random variable over

[0, ∞) must be equal to 1/2. We compute the second moment of R below,

E

R

2

=

∞

0

ξ

2

f

R

(ξ)dξ =

∞

0

ξ

3

σ

2

e

−

ξ

2

2σ

2

dξ

= −ξ

2

e

−

ξ

2

2σ

2

∞

0

+

∞

0

2ξe

−

ξ

2

2σ

2

dξ

= −2σ

2

e

−

ξ

2

2σ

2

∞

0

= 2σ

2

.

The variance of R is therefore equal to

Var[R] =

(4 −π)

2

σ

2

.

Typically, σ

2

is employed to denote the variance of a random variable. It may

be confusing at ﬁrst to have a random variable R described in terms of parame-

ter σ

2

whose variance is equal to (4−π)σ

2

/2. This situation is an artifact of the

following relation. A Rayleigh random variable R can be generated through the

expression R =

√

X

2

+ Y

2

, where X and Y are independent zero-mean Gaus-

sian variables with variance σ

2

. Thus, the parameter σ

2

in f

R

(·) is a tribute

to this popular construction, not a representation of its actual variance.

For nonnegative random variable X, an alternative way to compute E[X]

is described in Proposition 7.

124 CHAPTER 10. EXPECTATIONS AND BOUNDS

Proposition 7. Suppose that X is a nonnegative random variable with ﬁnite

mean, then

E[X] =

∞

0

Pr(X > x)dx.

Proof. We oﬀer a proof for the special case where X is a continuous random

variable, although the result remains true in general,

∞

0

Pr(X > x)dx =

∞

0

∞

x

f

X

(ξ)dξdx

=

∞

0

ξ

0

f

X

(ξ)dxdξ

=

∞

0

ξf

X

(ξ)dξ = E[X].

Interchanging the order of integration is justiﬁed because X is assumed to

have ﬁnite mean.

Example 83. A player throws darts at a circular target hung on a wall. The

dartboard has unit radius, and the position of every dart is distributed uni-

formly over the target. We wish to compute the expected distance from each

dart to the center of the dartboard.

Let R denote the distance from a dart to the center of the target. For

0 ≤ r ≤ 1, the probability that R exceeds r is given by

Pr(R > r) = 1 −Pr(R ≤ r) = 1 −

πr

2

π

= 1 −r

2

.

Then, by Proposition 7, the expected value of R is equal to

E[R] =

1

0

1 −r

2

dr =

r −

r

3

3

1

0

= 1 −

1

3

=

2

3

.

Notice how we were able to compute the answer without deriving an explicit

expression for f

R

(·).

10.2 Moment Generating Functions

The moment generating function of a random variable X is deﬁned by

M

X

(s) = E

e

sX

.

10.2. MOMENT GENERATING FUNCTIONS 125

For continuous random variables, the moment generating function becomes

M

X

(s) =

∞

−∞

f

X

(ξ)e

sξ

dξ.

The experienced reader will quickly recognize the deﬁnition of M

X

(s) as a

variant of the Laplace Transform, a widely used linear operator. The moment

generating function gets its name from the following property. Suppose that

M

X

(s) exists within an open interval around s = 0, then the nth moment of

X is given by

d

n

ds

n

M

X

(s)

s=0

=

d

n

ds

n

E

e

sX

s=0

= E

¸

d

n

ds

n

e

sX

s=0

= E

X

n

e

sX

s=0

= E[X

n

].

In words, if we diﬀerentiate M

X

(s) a total of n times and then evaluate the

resulting function at zero, we obtain the nth moment of X. In particular, we

have

dM

X

ds

(0) = E[X] and

d

2

M

X

ds

2

(0) = E[X

2

].

Example 84 (Exponential Random Variable). Let X be an exponential ran-

dom variable with parameter λ. The moment-generating function of X is given

by

M

X

(s) =

∞

0

λe

−λξ

e

sξ

dξ =

∞

0

λe

−(λ−s)ξ

dξ =

λ

λ −s

.

The mean of X is

E[X] =

dM

X

ds

(0) =

λ

(λ −s)

2

s=0

=

1

λ

;

more generally, the nth moment of X can be computed as

E[X

n

] =

d

n

M

X

ds

n

(0) =

n!λ

(λ −s)

n+1

s=0

=

n!

λ

n

.

Incidentally, we can deduce from these results that the variance of X is 1/λ

2

.

The deﬁnition of the moment generating function applies to discrete ran-

dom variables as well. In fact, for integer-valued random variables, the moment

generating function and the ordinary generating function are related through

the equation

M

X

(s) =

¸

k∈X(Ω)

e

sk

p

X

(k) = G

X

(e

s

).

126 CHAPTER 10. EXPECTATIONS AND BOUNDS

Example 85 (Discrete Uniform Random Variable). Suppose U is a discrete

uniform random variable taking value in U(Ω) = {1, 2, . . . , n}. Then, p

U

(k) =

1/n for 1 ≤ k ≤ n and

M

U

(s) =

n

¸

k=1

1

n

e

sk

=

1

n

n

¸

k=1

e

sk

=

e

s

(e

ns

−1)

n(e

s

−1)

.

The moment generating function provides an alternate and somewhat intricate

way to compute the mean of U,

E[U] =

dM

U

ds

(0) = lim

s→0

ne

(n+2)s

−(n + 1)e

(n+1)s

+e

s

n(e

s

−1)

2

= lim

s→0

n(n + 2)e

(n+1)s

−(n + 1)

2

e

ns

+ 1

2n(e

s

−1)

= lim

s→0

n(n + 1)(n + 2)ne

(n+1)s

−n(n + 1)

2

e

ns

2ne

s

=

n + 1

2

.

Notice the double application of l’Hˆopital’s rule to evaluate the derivative of

M

U

(s) at zero. This may be deemed a more contrived method to derive the

expected value of a discrete uniform random variables, but it does not rely on

prior knowledge of special sums. Through similar steps, one can derive the

second moment of U, which is equal to

E

U

2

=

(n + 1)(2n + 1)

6

.

From these two results, we can show that the variance of U is (n

2

−1)/12.

The simple form of the moment generating function of a standard normal

random variable points to its importance in many situations. The exponential

function is analytic and possesses many representations.

Example 86 (Gaussian Random Variable). Let X be a standard normal ran-

dom variable whose PDF is given by

f

X

(ξ) =

1

√

2π

e

−

ξ

2

2

.

The moment generating function of X is equal to

M

X

(s) = E

e

sX

=

∞

−∞

1

√

2π

e

−

ξ

2

2

e

sξ

dξ

=

∞

−∞

1

√

2π

e

−

ξ

2

+2sξ

2

dξ = e

s

2

2

∞

−∞

1

√

2π

e

−

ξ

2

−2sξ+s

2

2

dξ

= e

s

2

2

∞

−∞

1

√

2π

e

−

(ξ−s)

2

2

dξ = e

s

2

2

.

10.3. IMPORTANT INEQUALITIES 127

The last equality follows from the normalization condition and the fact that

the integrand is a Gaussian PDF.

Let M

X

(s) be the moment generating function associated with a random

variable X, and consider the random variable Y = aX + b where a and b are

constant. The moment generating function of Y can be obtained as follows,

M

Y

(s) = E

e

sX

= E

e

s(aX+b)

= e

sb

E

e

saX

= e

sb

M

X

(as).

Thus, if Y is an aﬃne function of X then M

Y

(s) = e

sb

M

X

(as).

Example 87. We can use this property to identify the moment generating

function of a Gaussian random variable with parameters m and σ

2

. Recall

that an aﬃne function of a Gaussian random variable is also Gaussian. Let

Y = σX + m, then the moment generating function of Y becomes

M

Y

(s) = E

e

sY

= E

e

s(σX+m)

= e

sm

E

e

sσX

= e

sm+

s

2

σ

2

2

.

From this moment generating function, we get

E[Y ] =

dM

Y

ds

(0) =

m+ sσ

2

e

sm+

s

2

σ

2

2

s=0

= m

E

Y

2

=

d

2

M

Y

ds

2

(0) =

σ

2

e

sm+

s

2

σ

2

2

+

m+ sσ

2

2

e

sm+

s

2

σ

2

2

s=0

= σ

2

+ m

2

.

The mean of Y is m and its variance is equal to σ

2

, as anticipated.

10.3 Important Inequalities

There are many situations for which computing the exact value of a prob-

ability is impossible or impractical. In such cases, it may be acceptable to

provide bounds on the value of an elusive probability. The expectation is

most important in ﬁnding pertinent bounds.

As we will see, many upper bounds rely on the concept of dominating

functions. Suppose that g(x) and h(x) are two nonnegative function such that

g(x) ≤ h(x) for all x ∈ R. Then, for any continuous random variable X, the

128 CHAPTER 10. EXPECTATIONS AND BOUNDS

following inequality holds

E[g(X)] =

∞

−∞

g(x)f

X

(x)dx

≤

∞

−∞

h(x)f

X

(x)dx = E[h(X)].

This is illustrated in Figure 10.1. In words, the weighted integral of g(·) is

dominated by the weighted integral of h(·), where f

X

(·) acts as the weighting

function. This notion is instrumental in understanding bounding techniques.

R

R

h(x)

g(x)

Figure 10.1: If g(x) and h(x) are two nonnegative functions such that g(x) ≤

h(x) for all x ∈ R, then E[g(X)] is less than or equal to E[h(X)].

10.3.1 The Markov Inequality

We begin our exposition of classical upper bounds with a result known as the

Markov inequality. Recall that, for admissible set S ⊂ R, we have

Pr(X ∈ S) = E[1

S

(X)] .

Thus, to obtain a bound on Pr(X ∈ S), it suﬃces to ﬁnd a function that

dominates 1

S

(·) and for which we can compute the expectation.

Suppose that we wish to bound Pr(X ≥ a) where X is a nonnegative

random variable. In this case, we can select S = [a, ∞) and function h(x) =

x/a. For any x ≥ 0, we have h(x) ≥ 1

S

(x), as illustrated in Figure 10.2. It

follows that

Pr(X ≥ a) = E[1

S

(X)] ≤

E[X]

a

.

10.3. IMPORTANT INEQUALITIES 129

R

R

h(x) =

x

a

g(x) = 1

S

(x)

a

Figure 10.2: Suppose that we wish to ﬁnd a bound for Pr(X ≤ a). We

deﬁne set S = [a, ∞) and function g(x) = 1

S

(x). Using dominating function

h(x) = x/a, we conclude that Pr(X ≥ a) ≤ a

−1

E[X] for any nonnegative

random variable X.

10.3.2 The Chebyshev Inequality

The Chebyshev inequality provides an extension to this methodology to various

dominating functions. This yields a number of bounds that become useful in

a myriad of contexts.

Proposition 8 (Chebyshev Inequality). Suppose h(·) is a nonnegative func-

tion and let S be an admissible set. We denote the inﬁmum of h(·) over S

by

i

S

= inf

x∈S

h(x).

The Chebyshev inequality asserts that

i

S

Pr(X ∈ S) ≤ E[h(X)] (10.1)

where X is an arbitrary random variable.

Proof. This is a remarkably powerful result and it can be shown in a few steps.

The deﬁnition of i

S

and the fact that h(·) is nonnegative imply that

i

S

1

S

(x) ≤ h(x)1

S

(x) ≤ h(x)

for any x ∈ R. Moreover, for any such x and distribution f

X

(·), we can write

i

S

1

S

(x)f

X

(x) ≤ h(x)f(x), which in turn yields

i

S

Pr(X ∈ S) = E[i

s

1

S

(X)] =

R

i

S

1

S

(ξ)f

X

(ξ)dξ

≤

R

h(ξ)f

X

(ξ)dξ = E[h(X)].

130 CHAPTER 10. EXPECTATIONS AND BOUNDS

When i

S

> 0, this provides the upper bound Pr(X ∈ S) ≤ i

−1

S

E[h(X)].

Although the proof assumes a continuous random variable, we emphasize

that the Chebyshev inequality applies to both discrete and continuous random

variables alike. The interested reader can rework the proof using the discrete

setting and a generic PMF. We provide special instances of the Chebyshev

inequality below.

Example 88. Consider the nonnegative function h(x) = x

2

and let S =

{x|x

2

≥ b

2

} where b is a positive constant. We wish to ﬁnd a bound on

the probability that |X| exceeds b. Using the Chebyshev inequality, we have

i

S

= inf

x∈S

x

2

= b

2

and, consequently, we get

b

2

Pr(X ∈ S) ≤ E

X

2

.

Constant b being a positive real number, we can rewrite this equation as

Pr(|X| ≥ b) = Pr(X ∈ S) ≤

E[X

2

]

b

2

.

Example 89 (The Cantelli Inequality). Suppose that X is a random variable

with mean m and variance σ

2

. We wish to show that

Pr(X −m ≥ a) ≤

σ

2

a

2

+ σ

2

,

where a ≥ 0.

This equation is slightly more involved and requires a small optimization

in addition to the Chebyshev inequality. Deﬁne Y = X − m and note that,

by construction, we have E[Y ] = 0. Consider the probability Pr(Y ≥ a) where

a > 0, and let S = {y|y ≥ a}. Also, deﬁne the nonnegative function h(y) =

(y+b)

2

, where b > 0. Following the steps of the Chebyshev inequality, we write

the inﬁmum of h(y) over S as

i

S

= inf

y∈S

(y + b)

2

= (a +b)

2

.

Then, applying the Chebyshev inequality, we obtain

Pr(Y ≥ a) ≤

E[(Y + b)

2

]

(a + b)

2

=

σ

2

+ b

2

(a + b)

2

. (10.2)

10.3. IMPORTANT INEQUALITIES 131

This inequality holds for any b > 0. To produce a better upper bound, we mini-

mize the right-hand side of (10.2) over all possible values of b. Diﬀerentiating

this expression and setting the derivative equal to zero yields

2b

(a + b)

2

=

2 (σ

2

+ b

2

)

(a + b)

3

or, equivalently, b = σ

2

/a. A second derivative test reveals that this is indeed

a minimum. Collecting these results, we obtain

Pr(Y ≥ a) ≤

σ

2

+ b

2

(a + b)

2

=

σ

2

a

2

+ σ

2

.

Substituting Y = X −m leads to the desired result.

In some circumstances, a Chebyshev inequality can be tight.

Example 90. Let a and b be two constants such that 0 < b ≤ a. Consider the

function h(x) = x

2

along with the set S = {x|x

2

≥ a

2

}. Furthermore, let X

be a discrete random variable with PMF

p

X

(x) =

1 −

b

2

a

2

, x = 0

b

2

a

2

, x = a

0, otherwise.

For this random variable, we have Pr(X ∈ S) = b

2

/a

2

. By inspection, we also

gather that the second moment of X is equal to E[X

2

] = b

2

. Applying the

Chebyshev inequality, we get i

S

= inf

x∈S

h(x) = a

2

and therefore

Pr(X ∈ S) ≤ i

−1

S

E[h(X)] =

b

2

a

2

.

Thus, in this particular example, the inequality is met with equality.

10.3.3 The Chernoﬀ Bound

The Chernoﬀ bound is yet another upper bound that can be constructed from

the Chebyshev inequality. Still, because of its central role in many application

domains, it deserves its own section. Suppose that we want to ﬁnd a bound on

the probability Pr(X ≥ a). We can apply the Chebyshev inequality using the

132 CHAPTER 10. EXPECTATIONS AND BOUNDS

nonnegative function h(x) = e

sx

, where s > 0. For this speciﬁc construction,

S = [a, ∞) and

i

S

= inf

x∈S

e

sx

= e

sa

.

It follows that

Pr(X ≥ a) ≤ e

−sa

E[e

sX

] = e

−sa

M

X

(s).

Because this inequality holds for any s > 0, we can optimize the upper bound

over all possible values of s, thereby picking the best one,

Pr(X ≥ a) ≤ inf

s>0

e

−sa

M

X

(s). (10.3)

This inequality is called the Chernoﬀ bound. It is sometimes expressed in

terms of the log-moment generating function Λ(s) = log M

X

(s). In this latter

case, (10.3) translates into

log Pr(X ≥ a) ≤ −sup

s>0

{sa −Λ(s)} . (10.4)

The right-hand side of (10.4) is called the Legendre transformation of Λ(s).

Figure 10.3 plots e

s(x−a)

for various values of s > 0. It should be noted that all

these functions dominate 1

[a,∞)

(x), and therefore they each provide a diﬀerent

bound on Pr(X ≥ a). It is natural to select the function that provides the best

bound. Yet, in general, this optimal e

s(x−a)

may depend on the distribution

of X and the value of a, which explains why (10.3) involves a search over all

possible values of s.

10.3.4 Jensen’s Inequality

Some inequalities can be derived based on the properties of a single function.

The Jensen inequality is one such example. Suppose that function g(·) is

convex and twice diﬀerentiable, with

d

2

g

dx

2

(x) ≥ 0

for all x ∈ R. From the fundamental theorem of calculus, we have

g(x) = g(a) +

x

a

dg

dx

(ξ)dξ.

10.3. IMPORTANT INEQUALITIES 133

0

1

2

3

4

5

0 0.5 1 1.5 2 2.5 3 3.5 4

D

o

m

i

n

a

t

i

n

g

F

u

n

c

t

i

o

n

s

Argument

Exponential Bounds

Figure 10.3: This ﬁgure illustrates how exponential functions can be employed

to provide bounds on Pr(X > a). Optimizing over all admissible exponential

functions, e

s(x−a)

where s > 0, leads to the celebrated Chernoﬀ bound.

Futhermore, because the second derivative of g(·) is a non-negative function,

we gather that

dg

dx

(·) is a monotone increasing function. As such, for any value

of a, we have

g(x) = g(a) +

x

a

dg

dx

(ξ)dξ

≥ g(a) +

x

a

dg

dx

(a)dξ = g(a) + (x −a)

dg

dx

(a).

For random variable X, we then have

g(X) ≥ g(a) + (X −a)

dg

dx

(a).

Choosing a = E[X] and taking expectations on both sides, we obtain

E[g(X)] ≥ g(E[X]) + (E[X] −E[X])

dg

dx

(E[X]) = g(E[X]).

That is, E[g(X)] ≥ g(E[X]), provided that these two expectations exist. The

Jensen inequality actually holds for convex functions that are not twice diﬀer-

entiable, but the proof is much harder in the general setting.

134 CHAPTER 10. EXPECTATIONS AND BOUNDS

Further Reading

1. Ross, S., A First Course in Probability, 7th edition, Pearson Prentice Hall,

2006: Sections 5.2, 7.7, 8.2.

2. Bertsekas, D. P., and Tsitsiklis, J. N., Introduction to Probability, Athena

Scientiﬁc, 2002: Section 3.1, 4.1, 7.1.

3. Gubner, J. A., Probability and Random Processes for Electrical and Computer

Engineers, Cambridge, 2006: Sections 4.2–4.3.

4. Miller, S. L., and Childers, D. G., Probability and Random Processes with

Applications to Signal Processing and Communications, 2004: Section 5.2.

Chapter 11

Multiple Continuous Random

Variables

Being versed at dealing with multiple random variables is an essential part of

statistics, engineering and science. This is equally true for models based on

discrete and continuous random variables. In this chapter, we focus on the

latter and expand our exposition of continuous random variables to random

vectors. Again, our initial survey of this topic revolves around conditional

distributions and pairs of random variables. More complex scenarios will be

considered in the later parts of the chapter.

11.1 Joint Cumulative Distributions

Let X and Y be two random variables associated with a same experiment.

The joint cumulative distribution function of X and Y is deﬁned by

F

X,Y

(x, y) = Pr(X ≤ x, Y ≤ y) x, y ∈ R.

Keeping in mind that X and Y are real-valued functions acting on a same

sample space, we can also write

F

X,Y

(x, y) = Pr ({ω ∈ Ω|X(ω) ≤ x, Y (ω) ≤ y}) .

135

136 CHAPTER 11. MULTIPLE CONTINUOUS RANDOM VARIABLES

From this characterization, we can identify a few properties of the joint CDF;

lim

y↑∞

F

X,Y

(x, y) = lim

y↑∞

Pr ({ω ∈ Ω|X(ω) ≤ x, Y (ω) ≤ y})

= Pr ({ω ∈ Ω|X(ω) ≤ x, Y (ω) ∈ R})

= Pr ({ω ∈ Ω|X(ω) ≤ x}) = F

X

(x).

Similarly, we have lim

x↑∞

F

X,Y

(x, y) = F

Y

(y). Taking limits in the other

direction, we get

lim

x↓−∞

F

X,Y

(x, y) = lim

y↓−∞

F

X,Y

(x, y) = 0.

When the function F

X,Y

(·, ·) is totally diﬀerentiable, it is possible to deﬁne

the joint probability density function of X and Y ,

f

X,Y

(x, y) =

∂

2

F

X,Y

∂x∂y

(x, y) =

∂

2

F

X,Y

∂y∂x

(x, y) x, y ∈ R. (11.1)

Hereafter, we refer to a pair of random variables as continuous if the corre-

sponding joint PDF exists and is deﬁned unambiguously through (11.1). When

this is the case, standard calculus asserts that the following equation holds,

F

X,Y

(x, y) =

x

−∞

y

−∞

f

X,Y

(ξ, ζ)dζdξ.

From its deﬁnition, we note that f

X,Y

(·, ·) is a nonnegative function which

integrates to one,

R

2

f

X,Y

(ξ, ζ)dζdξ = 1.

Furthermore, for any admissible set S ⊂ R

2

, the probability that (X, Y ) ∈ S

can be evaluated through the integral formula

Pr((X, Y ) ∈ S) =

R

2

1

S

(ξ, ζ)f

X,Y

(ξ, ζ)dζdξ

=

S

f

X,Y

(ξ, ζ)dζdξ.

(11.2)

In particular, if S is the cartesian product of two intervals,

S =

¸

(x, y) ∈ R

2

a ≤ x ≤ b, c ≤ y ≤ d

¸

,

11.1. JOINT CUMULATIVE DISTRIBUTIONS 137

then the probability that (X, Y ) ∈ S reduces to the typical integral form

Pr((X, Y ) ∈ S) = Pr(a ≤ X ≤ b, c ≤ Y ≤ d) =

b

a

d

c

f

X,Y

(ξ, ζ)dζdξ.

Example 91. Suppose that the random pair (X, Y ) is uniformly distributed

over the unit circle. We can express the joint PDF f

X,Y

(·, ·) as

f

X,Y

(x, y) =

1

π

x

2

+ y

2

≤ 1

0 otherwise.

We wish to ﬁnd the probability that the point (X, Y ) lies inside a circle of

radius 1/2.

Let S = {(x, y) ∈ R

2

|x

2

+ y

2

≤ 0.5}. The probability that (X, Y ) belongs

to S is given by

Pr((X, Y ) ∈ S) =

R

2

1

S

(ξ, ζ)

π

dξdζ =

1

4

.

Thus, the probability that (X, Y ) is contained within a circle of radius half is

one fourth.

Example 92. Let X and Y be two independent zero-mean Gaussian random

variables, each with variance σ

2

. For (x, y) ∈ R

2

, their joint PDF is given by

f

X,Y

(x, y) =

1

2πσ

2

e

−

x

2

+y

2

2σ

2

.

We wish to ﬁnd the probability that (X, Y ) falls within a circle of radius r

centered at the origin.

We can compute this probability using integral formula (11.2) applied to

this particular problem. Let R =

√

X

2

+ Y

2

and assume r > 0, then

Pr(R ≤ r) =

R≤r

f

X,Y

(x, y)dxdy =

R≤r

1

2πσ

2

e

−

x

2

+y

2

2σ

2

(x, y)dxdy

=

r

0

2π

0

1

2πσ

2

e

−

r

2

2σ

2

rdθdr = 1 −e

−

r

2

2σ

2

.

The probability that (X, Y ) is contained within a circle of radius r is 1−e

−

r

2

2σ

2

.

Recognizing that R is a continuous random variable, we can write its PDF as

f

R

(r) =

r

σ

2

e

−

r

2

2σ

2

r ≥ 0.

From this equation, we gather That R possesses a Rayleigh distribution with

parameter σ

2

.

138 CHAPTER 11. MULTIPLE CONTINUOUS RANDOM VARIABLES

11.2 Conditional Probability Distributions

Given non-vanishing event A, we can write the conditional CDF of random

variable X as

F

X|A

(x) = Pr(X ≤ x|A) =

Pr({X ≤ x} ∩ A)

Pr(A)

x ∈ R.

Note that event A can be deﬁned in terms of variables X and Y . For instance,

we may use A = {Y ≥ X} as our condition. Under suitable conditions, it is

equally straightforward to specify the conditional PDF of X given A,

f

X|A

(x) =

dF

X|A

dx

(x) x ∈ R.

Example 93. Let X and Y be continuous random variables with joint PDF

f

X,Y

(x, y) = λ

2

e

−λ(x+y)

x, y ≥ 0.

We wish to compute the conditional PDF of X given A = {X ≤ Y }. To solve

this problem, we ﬁrst compute the probability of the event {X ≤ x} ∩ A,

Pr({X ≤ x} ∩ A) =

x

0

ξ

0

f

X,Y

(ξ, ζ)dζdξ =

x

0

ξ

0

λ

2

e

−λ(ξ+ζ)

dζdξ

=

x

0

λe

−λξ

1 −e

−λξ

dξ =

1 −e

−λx

2

2

.

By symmetry, we gather that Pr(A) = 1/2 and, as such,

f

X|A

(x) = 2λe

−λx

1 −e

−λx

x ≥ 0.

One case of special interest is the situation where event A is deﬁned in

terms of the random variable X itself. In particular, consider the PDF of X

conditioned on the fact that X belongs to an interval I. Then, A = {X ∈ I}

and the conditional CDF of X becomes

F

X|A

(x) = Pr(X ≤ x|X ∈ I)

=

Pr ({X ≤ x} ∩ {X ∈ I})

Pr(X ∈ I)

=

Pr (X ∈ (−∞, x] ∩ I)

Pr(X ∈ I)

.

11.2. CONDITIONAL PROBABILITY DISTRIBUTIONS 139

Diﬀerentiating with respect to x, we obtain the conditional PDF of X,

f

X|A

(x) =

f

X

(x)

Pr(X ∈ I)

for any x ∈ I. In words, the conditional PDF of X becomes a scaled version

of f

X

(·) whenever x ∈ I, and it is equal to zero otherwise. Essentially, this is

equivalent to re-normalizing the PDF of X over interval I, accounting for the

partial information given by X ∈ I.

11.2.1 Conditioning on Values

Suppose that X and Y form a pair of random variables with joint PDF

f

X,Y

(·, ·). With great care, it is possible and desirable to deﬁne the condi-

tional PDF of X, conditioned on Y = y. Special attention must be given to

this situation because the event {Y = y} has probability zero whenever Y is a

continuous random variable. Still, when X and Y are jointly continuous and

for any y ∈ R such that f

Y

(y) > 0, we can deﬁned the conditional PDF of X

given Y = y as

f

X|Y

(x|y) =

f

X,Y

(x, y)

f

Y

(y)

. (11.3)

Intuitively, this deﬁnition is motivated by the following property. For small

∆

x

and ∆

y

, we can write

Pr(x ≤ X ≤ x + ∆

x

|y ≤ Y ≤ y + ∆

y

)

=

Pr(x ≤ X ≤ x + ∆

x

, y ≤ Y ≤ y + ∆

y

)

Pr(y ≤ Y ≤ y + ∆

y

)

≈

f

X,Y

(x, y)∆

x

∆

y

f

Y

(y)∆

y

=

f

X,Y

(x, y)

f

Y

(y)

∆

x

.

Thus, loosely speaking, f

X|Y

(x|y)∆

x

represents the probability that X lies

close to x, given that Y is near y.

Using this deﬁnition, it is possible to compute the probabilities of events

associated with X conditioned on a speciﬁc value of random variable Y ,

Pr(X ∈ S|Y = y) =

S

f

X|Y

(x|y)dx.

A word of caution is in order. The technical diﬃculty that surfaces when

conditioning on {Y = y} stems from the fact that Pr(Y = y) = 0. Remember

140 CHAPTER 11. MULTIPLE CONTINUOUS RANDOM VARIABLES

that, in general, the notion of conditional probability is only deﬁned for non-

vanishing conditions. Although we were able to circumvent this issue, care

must be taken when dealing with the conditional PDF of the form (11.3), as

it only provides valuable insight when the random variables (X, Y ) are jointly

continuous.

Example 94. Consider the experiment where an outcome (ω

1

, ω

2

) is selected

at random from the unit circle. Let X = ω

1

and Y = ω

2

. We wish to compute

the conditional PDF of X given that Y = 0.5.

First, we compute the marginal PDF of Y evaluated at Y = 0.5,

f

Y

(0.5) =

R

f

X,Y

(x, 0.5)dx =

√

3

2

√

3

2

1

π

dx =

√

3

π

.

We then apply deﬁnition (11.3) to obtain the desired conditional PDF of X,

f

X|Y

(x|0.5) =

f

X,Y

(x, 0.5)

f

Y

(0.5)

=

π

√

3

f

X,Y

(x, 0.5)

=

1

√

3

|y| ≤

√

3

2

0 otherwise.

11.2.2 Conditional Expectation

The conditional expectation of a function g(Y ) is simply the integral of g(Y )

weighted by the proper conditional PDF,

E[g(Y )|X = x] =

R

g(y)f

Y |X

(y|x)dy

E[g(Y )|S] =

R

g(y)f

Y |S

(y)dy.

Note again that the function

h(x) = E[Y |X = x]

deﬁnes a random variable since the conditional expectation of Y may vary as a

function of X. After all, a conditional expectation is itself a random variable.

Example 95. An analog communication system transmits a random signal

over a noisy channel. The transmit signal X and the additive noise N are both

11.2. CONDITIONAL PROBABILITY DISTRIBUTIONS 141

standard Gaussian random variables, and they are independent. The signal

received at the destination is equal to

Y = X + N.

We wish to estimate the value of X conditioned on Y = y.

For this problem, the joint PDF of X and Y is

f

X,Y

(x, y) =

1

2π

exp

−

2x

2

−2xy +y

2

2

**and the conditional distribution of X given Y becomes
**

f

X|Y

(x|y) =

f

X,Y

(x, y)

f

Y

(y)

=

1

2π

exp

−

2x

2

−2xy+y

2

2

1

2

√

π

exp

−

y

2

4

=

1

√

π

exp

−

4x

2

−4xy + y

2

4

.

By inspection, we recognize that this conditional PDF is a Gaussian distribu-

tion with parameters m = y/2 and σ

2

= 1/2. A widespread algorithm employed

to perform the desired task is called the minimum mean square error (MMSE)

estimator. In the present case, the MMSE estimator reduces to the conditional

expectation of X given Y = y, which is

E[X|Y = y] =

R

xf

X,Y

(x|y)dx =

y

2

.

11.2.3 Derived Distributions

Suppose X

1

and X

2

are jointly continuous random variables. Furthermore, let

Y

1

= g

1

(X

1

, X

2

) and Y

2

= g

2

(X

1

, X

2

), where g

1

(·, ·) and g

2

(·, ·) are real-valued

functions. Under certain conditions, the pair of random variables (Y

1

, Y

2

) will

also be continuous. Deriving the joint PDF of (Y

1

, Y

2

) can get convoluted, a

task we forgo. It requires the skillful application of vector calculus. Neverthe-

less, we examine the case where a simple expression for f

Y

1

,Y

2

(·, ·) exists.

Consider the scenario where the functions g

1

(·, ·) and g

2

(·, ·) are totally

diﬀerentiable, with Jacobian determinant

J(x

1

, x

2

) = det

¸

∂g

1

∂x

1

(x

1

, x

2

)

∂g

1

∂x

2

(x

1

, x

2

)

∂g

2

∂x

1

(x

1

, x

2

)

∂g

2

∂x

2

(x

1

, x

2

)

¸

=

∂g

1

∂x

1

(x

1

, x

2

)

∂g

2

∂x

2

(x

1

, x

2

) −

∂g

1

∂x

2

(x

1

, x

2

)

∂g

2

∂x

1

(x

1

, x

2

) = 0.

142 CHAPTER 11. MULTIPLE CONTINUOUS RANDOM VARIABLES

Also, assume that the system of equations

g

1

(x

1

, x

2

) = y

1

g

2

(x

1

, x

2

) = y

2

has a unique solution. We express this solution using x

1

= h

1

(y

1

, y

2

) and

x

2

= h

2

(y

1

, y

2

). Then, the random variables (Y

1

, Y

2

) are jointly continuous

with joint PDF

f

Y

1

,Y

2

(y

1

, y

2

) =

f

X

1

,X

2

(x

1

, x

2

)

|J(x

1

, x

2

)|

(11.4)

where x

1

= h

1

(y

1

, y

2

) and x

2

= h

2

(y

1

, y

2

). Note the close resemblance between

this equation and the derived distribution of (9.4). Looking back at Chapter 9

oﬀers an idea of what proving this result entails. It also hints at how this

equation can be modiﬁed to accommodate non-unique mappings.

Example 96. An important application of (11.4) pertains to the properties

of Gaussian vectors. Suppose that X

1

and X

2

are jointly continuous random

variables, and let

X =

¸

X

1

X

2

¸

.

Deﬁne the mean of X by

m = E[X] =

¸

E[X

1

]

E[X

2

]

¸

.

and its covariance by

Σ = E

(X−m) (X−m)

T

=

¸

E[(X

1

−m

1

)

2

] E[(X

1

−m

1

)(X

2

−m

2

)]

E[(X

2

−m

2

)(X

1

−m

1

)] E[(X

2

−m

2

)

2

]

¸

.

Random variables X

1

and X

2

are said to be jointly Gaussian provided that

their joint PDF is of the form

f

X

1

,X

2

(x

1

, x

2

) =

1

2π|Σ|

1

2

exp

−

1

2

(x −m)

T

Σ

−1

(x −m)

.

Assume that the random variables Y

1

and Y

2

are generated through the matrix

equation

Y =

¸

Y

1

Y

2

¸

= AX+b,

11.3. INDEPENDENCE 143

where A is a 2 ×2 invertible matrix and b is a constant vector. In this case,

X = A

−1

(Y−b) and the corresponding Jacobian determinant is

J(x

1

, x

2

) = det

¸

a

11

a

12

a

21

a

22

¸

= |A|.

Applying (11.4), we gather that the joint PDF of (Y

1

, Y

2

) is expressed as

f

Y

1

,Y

2

(y

1

, y

2

) =

1

2π|Σ|

1

2

|A|

exp

−

1

2

(x −m)

T

Σ

−1

(x −m)

=

1

2π|AΣA|

1

2

exp

−

1

2

A

−1

(y −b) −m

T

Σ

−1

A

−1

(y −b) −m

=

1

2π|AΣA|

1

2

exp

−

1

2

(y −b −Am)

T

(AΣA)

−1

(y −b −Am)

.

Looking at this equation, we conclude that random variables Y

1

and Y

2

are

also jointly Gaussian, as their joint PDF possesses the proper form. It should

come as no surprise that the mean of Y is E[Y] = Am+b and its covariance

matrix is equal to

E

(Y−E[Y]) (Y−E[Y])

T

= AΣA.

In other words, a non-trivial aﬃne transformation of a two-dimensional Gaus-

sian vector yields another Gaussian vector. This admirable property general-

izes to higher dimensions. Indeed, if Y = AX+b where A is an n×n invertible

matrix and X is a Gaussian random vector, then Y remains a Gaussian ran-

dom vector. Furthermore, to obtain the derived distribution of the latter vector,

it suﬃces to compute its mean and covariance, and substitute the resulting pa-

rameters in the general form of the Gaussian PDF. Collectively, the features

of joint Gaussian random vectors underly many contemporary successes of en-

gineering.

11.3 Independence

Two random variables X and Y are mutually independent if their joint CDF

is the product of their respective CDFs,

F

X,Y

(x, y) = F

X

(x)F

Y

(y)

144 CHAPTER 11. MULTIPLE CONTINUOUS RANDOM VARIABLES

for x, y ∈ R. For jointly continuous random variables, this deﬁnition neces-

sarily implies that the joint PDF f

X,Y

(·, ·) is the product of their marginal

PDFs,

f

X,Y

(x, y) =

∂

2

F

X,Y

∂x∂y

(x, y) =

dF

X

dx

(x)

dF

Y

dy

(y) = f

X

(x)f

Y

(y)

for x, y ∈ R. Furthermore, we gather from (11.3) that the conditional PDF

of Y given X = x is equal to the marginal PDF of Y whenever X and Y are

independent,

f

Y |X

(y|x) =

f

X,Y

(x, y)

f

X

(x)

=

f

X

(x)f

Y

(y)

f

X

(x)

= f

Y

(y)

provided of course that f

X

(x) = 0. Additionally, if X and Y are indepen-

dent random variables, then the two events {X ∈ S} and {Y ∈ T} are also

independent,

Pr(X ∈ S, Y ∈ T) =

S

T

f

X,Y

(x, y)dydx

=

S

f

X

(x)dx

T

f

Y

(y)dy

= Pr(X ∈ S) Pr(Y ∈ T).

Example 97. Consider a random experiment where an outcome (ω

1

, ω

2

) is

selected at random from the unit square. Let X = ω

1

, Y = ω

2

and U = ω

1

+ω

2

.

We wish to show that X and Y are independent, but that X and U are not

independent.

We begin by computing the joint CDF of X and Y . For x, y ∈ [0, 1], we

have

F

X,Y

(x, y) =

x

0

y

0

dζdξ = xy = F

X

(x)F

Y

(y).

More generally, if x, y ∈ R

2

, we get

F

X,Y

(x, y) =

x

−∞

y

−∞

1

[0,1]

2(ξ, ζ)dζdξ

=

x

−∞

1

[0,1]

(ξ)dξ

y

−∞

1

[0,1]

(ζ)dζ = F

X

(x)F

Y

(y).

Thus, we gather that X and Y are independent.

11.3. INDEPENDENCE 145

Next, we show that X and U are not independent. Note that F

U

(1) = 0.5

and F

X

(0.5) = 0.5. Consider the joint CDF of X and U evaluated at (0.5, 1),

F

X,U

(0.5, 1) =

1

2

0

1−ξ

0

dζdξ =

1

2

0

(1 −ξ)dξ

=

3

8

= F

X

(0.5)F

U

(1).

Clearly, random variables X and U are not independent.

11.3.1 Sums of Continuous Random Variables

As mentioned before, sums of independent random variables are frequently

encountered in engineering. We therefore turn to the question of determining

the distribution of a sum of independent continuous random variables in terms

of the PDFs of its constituents. If X and Y are independent random variables,

the distribution of their sum U = X +Y can be obtained by using the convo-

lution operator. Let f

X

(·) and f

Y

(·) be the PDFs of X and Y , respectively.

The convolution of f

X

(·) and f

Y

(·) is the function deﬁned by

(f

X

∗ f

Y

)(u) =

∞

−∞

f

X

(ξ)f

Y

(u −ξ)dξ

=

∞

−∞

f

X

(u −ζ)f

Y

(ζ)dζ.

The PDF of the sum U = X +Y is the convolution of the individual densities

f

X

(·) and f

Y

(·),

f

U

(u) = (f

X

∗ f

Y

)(u).

To show that this is indeed the case, we ﬁrst consider the CDF of U,

F

U

(u) = Pr(U ≤ u) = Pr(X +Y ≤ u)

=

∞

−∞

u−ξ

−∞

f

X,Y

(ξ, ζ)dζdξ

=

∞

−∞

u−ξ

−∞

f

Y

(ζ)dζf

X

(ξ)dξ

=

∞

−∞

F

Y

(u −ξ)f

X

(ξ)dξ.

146 CHAPTER 11. MULTIPLE CONTINUOUS RANDOM VARIABLES

Taking the derivative of F

U

(u) with respect to u, we obtain

d

du

F

U

(u) =

d

du

∞

−∞

F

Y

(u −ξ)f

X

(ξ)dξ

=

∞

−∞

d

du

F

Y

(u −ξ)f

X

(ξ)dξ

=

∞

−∞

f

Y

(u −ξ)f

X

(ξ)dξ.

Notice the judicious use of the fundamental theorem of calculus. This shows

that f

U

(u) = (f

X

∗ f

Y

)(u).

Example 98 (Sum of Uniform Random Variables). Suppose that two numbers

are independently selected from the interval [0, 1], each with a uniform distri-

bution. We wish to compute the PDF of their sum. Let X and Y be random

variables describing the two choices, and let U = X + Y represent their sum.

The PDFs of X and Y are

f

X

(ξ) = f

Y

(ξ) =

1 0 ≤ ξ ≤ 1

0 otherwise.

The PDF of their sum is therefore equal to

f

U

(u) =

∞

−∞

f

X

(u −ξ)f

Y

(ξ)dξ.

Since f

Y

(y) = 1 when 0 ≤ y ≤ 1 and zero otherwise, this integral becomes

f

U

(u) =

1

0

f

X

(u −ξ)dξ =

1

0

1

[0,1]

(u −ξ)dξ.

The integrand above is zero unless 0 ≤ u −ξ ≤ 1 (i.e., unless u −1 ≤ ξ ≤ u).

Thus, if 0 ≤ u ≤ 1, we get

f

U

(u) =

u

0

dξ = u;

while, if 1 < u ≤ 2, we obtain

f

U

(u) =

1

u−1

dξ = 2 −u.

11.3. INDEPENDENCE 147

If u < 0 or u > 2, the value of the PDF becomes zero. Collecting these results,

we can write the PDF of U as

f

U

(u) =

u 0 ≤ u ≤ 1,

2 −u 1 < u ≤ 2,

0 otherwise.

Example 99 (Sum of Exponential Random Variables). Two numbers are

selected independently from the positive real numbers, each according to an

exponential distribution with parameter λ. We wish to ﬁnd the PDF of their

sum. Let X and Y represent these two numbers, and denote this sum by

U = X + Y . The random variables X and Y have PDFs

f

X

(ξ) = f

Y

(ξ) =

λe

−λξ

ξ ≥ 0

0 otherwise.

When u ≥ 0, we can use the convolution formula and write

f

U

(u) =

∞

−∞

f

X

(u −ξ)f

Y

(ξ)dξ

=

u

0

λe

−λ(u−ξ)

λe

−λξ

dξ

=

u

0

λ

2

e

−λu

dξ = λ

2

ue

−λu

.

On the other hand, if u < 0 then we get f

U

(u) = 0. The PDF of U is given by

f

U

(u) =

λ

2

ue

−λu

u ≥ 0

0 otherwise.

This is an Erlang distribution with parameter m = 2 and λ > 0.

Example 100 (Sum of Gaussian Random Variables). It is an interesting and

important fact that the sum of two independent Gaussian random variables is

itself a Gaussian random variable. Suppose X is Gaussian with mean m

1

and

variance σ

2

1

, and similarly Y is Gaussian with mean m

2

and variance σ

2

2

, then

U = X +Y has a Gaussian density with mean m

1

+m

2

and variance σ

2

1

+σ

2

2

.

We will show this property in the special case where both random variables

148 CHAPTER 11. MULTIPLE CONTINUOUS RANDOM VARIABLES

are standard normal random variable. The general case can be attained in a

similar manner, but the computations are somewhat tedious.

Suppose X and Y are two independent Gaussian random variables with

PDFs

f

X

(ξ) = f

Y

(ξ) =

1

√

2π

e

−

ξ

2

2

.

Then, the PDF of U = X + Y is given by

f

U

(u) = (f

X

∗ f

Y

)(u) =

1

2π

∞

−∞

e

−

(u−ξ)

2

2

e

−

ξ

2

2

dξ

=

1

2π

e

−

u

2

4

∞

−∞

e

−(ξ−

u

2

)

2

dξ

=

1

2

√

π

e

−

u

2

4

∞

−∞

1

√

π

e

−(ξ−

u

2

)

2

dξ

.

The expression within the parentheses is equal to one since the integrant is a

Gaussian PDF with m = u/2 and σ

2

= 1/2. Thus, we obtain

f

U

(u) =

1

√

4π

e

−

u

2

4

,

which veriﬁes that U is indeed Gaussian.

Let X and Y be independent random variables. Consider the random

variable U = X + Y . The moment generating function of U is given by

M

U

(s) = E

e

sU

= E

e

s(X+Y )

= E

e

sX

e

sY

= E

e

sX

E

e

sY

= M

X

(s)M

Y

(s).

That is, the moment generating function of the sum of two independent ran-

dom variables is the product of the individual moment generating functions.

Example 101 (Sum of Gaussian Random Variables). In this example, we

revisit the problem of adding independent Gaussian variables using moment

generating functions. Again, let X and Y denote the two independent Gaussian

variables. We denote the mean and variance of X by m

1

and σ

2

1

. Likewise,

we represent the mean and variance of Y by m

2

and σ

2

2

. We wish to show that

the sum U = X + Y is Gaussian with parameters m

1

+ m

2

and σ

2

1

+ σ

2

2

.

11.3. INDEPENDENCE 149

The moment generating functions of X and Y are

M

X

(s) = e

m

1

s+

σ

2

1

s

2

2

M

Y

(s) = e

m

2

s+

σ

2

2

s

2

2

The moment generating function of U = X + Y is therefore equal to

M

U

(s) = M

X

(s)M

Y

(s) = exp

(m

1

+ m

2

)s +

(σ

2

1

+ σ

2

2

)s

2

2

,

which demonstrates that U is a Gaussian random variable with mean m

1

+m

2

and variance σ

2

1

+ σ

2

2

, as anticipated.

Further Reading

1. Bertsekas, D. P., and Tsitsiklis, J. N., Introduction to Probability, Athena

Scientiﬁc, 2002: Section 3.5.

2. Ross, S., A First Course in Probability, 7th edition, Pearson Prentice Hall,

2006: Chapter 6, Section 7.5.

3. Miller, S. L., and Childers, D. G., Probability and Random Processes with Ap-

plications to Signal Processing and Communications, 2004: Chapter 5, Sec-

tions 6.1–6.3.

4. Gubner, J. A., Probability and Random Processes for Electrical and Computer

Engineers, Cambridge, 2006: Chapters 7–9.

Chapter 12

Convergence, Sequences and

Limit Theorems

Some of the most astonishing results in probability are related to the properties

of sequences of random variables and the convergence of empirical distribu-

tions. From an engineering viewpoint, these results are important as they

enable the eﬃcient design of complex systems with very small probabilities of

failure. Concentration behavior facilitates true economies of scale.

12.1 Types of Convergence

The premise on which most probabilistic convergence results lie is a sequence

of random variables X

1

, X

2

, . . . and a limiting random variable X, all of which

are deﬁned on the same probability space. Recall that a random variable is a

function of the outcome of a random experiment. The above statement stip-

ulate that all the random variables listed above are functions of the outcome

of a same experiment.

Statements that can be made about a sequence of random variables range

from simple assertions to more intricate claims. For instance, the sequence

may appear to move toward a deterministic quantity or to behave increasingly

akin to a certain function. Alternatively, the CDFs of the random variables

in the sequence may appear to approach a precise function. Being able to

recognize speciﬁc patterns within the sequence is key in establishing converge

150

12.1. TYPES OF CONVERGENCE 151

results. The various statement one can make about the sequence X

1

, X

2

, . . .

lead to the diﬀerent types of convergence encountered in probability. Below,

we discuss brieﬂy three types of convergence.

Example 102. Suppose that X

1

, X

2

, . . . is a sequence of independent Gaussian

random variables, each with mean m and variance σ

2

. Deﬁne the partial sums

S

n

=

n

¸

i=1

X

i

, (12.1)

and consider the sequence

S

1

,

S

2

2

,

S

3

3

, . . . (12.2)

We know that aﬃne transformations of Gaussian random variables remain

Gaussian. Furthermore, we know that sums of jointly Gaussian random vari-

ables are also Gaussian. Thus, S

n

/n possesses a Gaussian distribution with

mean

E

¸

S

n

n

=

E[S

n

]

n

=

E[X

1

] +· · · + E[X

n

]

n

= m

and variance

Var

¸

S

n

n

=

Var[S

n

]

n

2

=

Var[X

1

] +· · · + Var[X

n

]

n

2

=

σ

2

n

.

It appears that the PDF of S

n

/n concentrates around m as n approaches inﬁn-

ity. That is, the sequence in (12.2) seems to become increasingly predictable.

Example 103. Again, let X

1

, X

2

, . . . be the sequence described above, and

let S

n

be deﬁned according to (12.1). This time, we wish to characterize the

properties of

S

1

−m,

S

2

−2m

√

2

,

S

3

−3m

√

3

, . . .

From our current discussion, we know that (S

n

−nm)/

√

n is a Gaussian ran-

dom variables. We can compute its mean and variance as follows

E

¸

S

n

−nm

√

n

=

E[S

n

−nm]

n

= 0

Var

¸

S

n

−mn

√

n

=

Var[S

n

−mn]

n

=

Var[S

n

]

n

= σ

2

.

No matter how large n is, the random variable (S

n

−nm)/

√

n has a Gaussian

distribution with mean zero and variance σ

2

. Intriguingly, the distributions

remains invariant throughout the sequence.

152CHAPTER 12. CONVERGENCE, SEQUENCES ANDLIMIT THEOREMS

12.1.1 Convergence in Probability

The basic concept behind the deﬁnition of convergence in probability is that

the probability that a random variable deviates from its typical behavior be-

comes less likely as the sequence progresses. Formally, a sequence X

1

, X

2

, . . .

of random variables converges in probability to X if for every ǫ > 0,

lim

n→∞

Pr (|X

n

−X| ≥ ǫ) = 0.

In Example 102, the sequence {S

n

/n} converges in probability to m.

12.1.2 Mean Square Convergence

We say that a sequence X

1

, X

2

, . . . of random variables converges in mean

square to X if

lim

n→∞

E

|X

n

−X|

2

= 0.

That is, the second moment of the diﬀerence between X

n

and X vanishes as

n goes to inﬁnity. Convergence in the mean square sense implies convergence

in probability.

Proposition 9. Let X

1

, X

2

, . . . be a sequence of random variables that con-

verge in mean square to X. Then, the sequence X

1

, X

2

, . . . also converges to

X in probability.

Proof. Suppose that ǫ > 0 is ﬁxed. The sequence X

1

, X

2

, . . . converges in mean

square to X. Thus, for δ > 0, there exists an N such that n ≥ N implies

lim

n→∞

E

|X

n

−X|

2

< δ.

If we apply the Chebyshev inequality to X

n

−X, we get

Pr (|X

n

−X| ≥ ǫ) ≤

E

|X

n

−X|

2

ǫ

2

<

δ

ǫ

2

Since δ can be made arbitrarily small, we conclude that this sequence also

converges to X in probability.

12.2. THE LAW OF LARGE NUMBERS 153

12.1.3 Convergence in Distribution

A sequence X

1

, X

2

, . . . of random variables is said to converge in distribution

to a random variable X if

lim

n→∞

F

Xn

(x) = F

X

(x)

at every point x ∈ R where F

X

(·) is continuous. This type of convergence is

also called weak convergence.

Example 104. Let X

n

be a continuous random variable that is uniformly dis-

tributed over [0, 1/n]. Then, the sequence X

1

, X

2

, . . . converges in distribution

to 0.

In this example, X = 0 and

F

X

(x) =

1 x ≥ 0

0 x < 0.

Furthermore, for every x < 0, we have F

Xn

(x) = 0; and for every x > 0, we

have

lim

n→∞

F

Xn

(x) = 1.

Hence, the sequence X

1

, X

2

, . . . converges in distribution to a constant.

12.2 The Law of Large Numbers

The law of large numbers focuses on the convergence of empirical averages.

Although, there are many versions of this law, we only state its simplest form

below. Suppose that X

1

, X

2

, . . . is a sequence of independent and identically

distributed random variable, each with ﬁnite second moment. Furthermore,

for n ≥ 1, deﬁne the empirical sum

S

n

=

n

¸

i=1

X

n

.

The law of large number asserts that the sequence of empirical averages,

S

n

n

=

1

n

n

¸

i=1

X

n

,

converges in probability to the mean E[X].

154CHAPTER 12. CONVERGENCE, SEQUENCES ANDLIMIT THEOREMS

Theorem 4 (Law of Large Numbers). Let X

1

, X

2

, . . . be independent and

identically distributed random variables with mean E[X] and ﬁnite variance.

For every ǫ > 0, we have

lim

n→∞

Pr

S

n

n

−E[X]

≥ ǫ

= lim

n→∞

Pr

X

1

+· · · + X

n

n

−E[X]

≥ ǫ

= 0.

Proof. Taking the expectation of the empirical average, we obtain

E

¸

S

n

n

=

E[S

n

]

n

=

E[X

1

] +· · · + E[X

n

]

n

= E[X].

Using independence, we also have

Var

¸

S

n

n

=

Var[X

1

] +· · · + Var[X

n

]

n

2

=

Var[X]

n

.

As n goes to inﬁnity, the variance of the empirical average S

n

/n vanishes.

Thus, we showed that the sequence {S

n

/n} of empirical averages converges in

mean square to E[X] since

lim

n→∞

E

¸

S

n

n

−E[X]

2

¸

= lim

n→∞

Var

¸

S

n

n

= 0.

To get convergence in probability, we apply the Chebyshev inequality, as we

did in Proposition 9,

Pr

S

n

n

−E[X]

≥ ǫ

≤

Var

Sn

n

ǫ

2

=

Var[X]

nǫ

2

,

which clearly goes to zero as n approaches inﬁnity.

Example 105. Suppose that a die is thrown repetitively. We are interested in

the average number of times a six shows up on the top face, as the number of

throws becomes very large.

Let D

n

be a random variable that represent the number on the nth roll.

Also, deﬁne the random variable X

n

= 1

{Dn=6}

. Then, X

n

is a Bernoulli

random variable with parameter p = 1/6, and the empirical average S

n

/n is

equal to the number of times a six is observed divided by the total number of

rolls. By the law of large numbers, we have

lim

n→∞

Pr

S

n

n

−

1

6

≥ ǫ

= 0.

That is, as the number of rolls increases, the average number of times a six is

observe converges to the probability of getting a six.

12.2. THE LAW OF LARGE NUMBERS 155

12.2.1 Heavy-Tailed Distributions*

There are situations where the law of large numbers does not apply. For

example, when dealing with heavy-tailed distributions, one needs to be very

careful. In this section, we study Cauchy random variables in greater details.

First, we show that the sum of two independent Cauchy random variables is

itself a Cauchy random variable.

Let X

1

and X

2

be two independent Cauchy random variables with param-

eter γ

1

and γ

2

, respectively. We wish to compute the PDF of S = X

1

+ X

2

.

For continuous random variable, the PDF of S is given by the convolution of

f

X

1

(·) and f

X

2

(·). Thus, we can write

f

S

(x) =

∞

−∞

f

X

1

(ξ)f

X

2

(x −ξ)dξ

=

∞

−∞

γ

1

π (γ

2

1

+ξ

2

)

γ

2

π (γ

2

2

+ (x −ξ)

2

)

dξ.

This integral is somewhat diﬃcult to solve. We therefore resort to complex

analysis and contour integration to get a solution. Let C be a contour that goes

along the real line from −a to a, and then counterclockwise along a semicircle

centered at zero. For a large enough, Cauchy’s residue theorem requires that

C

f

X

1

(z)f

X

2

(x −z)dz =

C

γ

1

π (γ

2

1

+ z

2

)

γ

2

π (γ

2

2

+ (x −z)

2

)

dz

= 2πi (Res(g, iγ

1

) + Res(g, x + iγ

2

))

(12.3)

where we have implicitly deﬁned the function

g(z) =

γ

1

γ

2

π

2

(γ

2

1

+ z

2

) (γ

2

2

+ (z −x)

2

)

=

γ

1

γ

2

π

2

(z −iγ

1

)(z + iγ

1

)(z −x + iγ

2

)(z −x −iγ

2

)

.

Only two residues are contained within the enclosed region. Because they are

simple poles, their values are given by

Res(g, iγ

1

) = lim

z→iγ

1

(z −iγ

1

)g(z) =

γ

2

2iπ

2

((x −iγ

1

)

2

+ γ

2

2

)

Res(g, x + iγ

2

) = lim

z→x+iγ

2

(z −x −iγ

2

)g(z) =

γ

1

2iπ

2

((x + iγ

2

)

2

+ γ

2

1

)

.

156CHAPTER 12. CONVERGENCE, SEQUENCES ANDLIMIT THEOREMS

It follows that

2πi (Res(g, iγ

1

) + Res(g, x + iγ

2

))

=

γ

2

π ((x −iγ

1

)

2

+ γ

2

2

)

+

γ

1

π ((x + iγ

2

)

2

+ γ

2

1

)

=

(γ

1

+ γ

2

)

π ((γ

1

+ γ

2

)

2

+ x

2

)

The contribution of the arc in (12.3) vanishes as a → ∞. We then conclude

that the PDF of S is equal to

f

S

(x) =

(γ

1

+ γ

2

)

π ((γ

1

+ γ

2

)

2

+ x

2

)

.

The sum of two independent Cauchy random variables with parameters γ

1

and

γ is itself a Cauchy random variable with parameter γ

1

+γ

2

.

Let X

1

, X

2

, . . . form a sequence of independent Cauchy random variables,

each with parameter γ. Also, consider the empirical sum

S

n

=

n

¸

i=1

X

n

.

Using mathematical induction and the aforementioned fact, it is possible to

show that S

n

is a Cauchy random variable with parameter nγ. Furthermore,

for x ∈ R, the PDF of the empirical average S

n

/n is given by

f

Sn

(nx)

1

n

=

n

2

γ

π (n

2

γ

2

+ (nx)

2

)

=

γ

π (γ

2

+x

2

)

,

where we have used the methodology developed in Chapter 9. Amazingly,

the empirical average of a sequence of independent Cauchy random variables,

each with parameter γ, remains a Cauchy random variable with the same

parameter. Clearly, the law of large numbers does not apply to this scenario.

Note that our version of the law of large numbers requires random variables

to have ﬁnite second moments, a condition that is clearly violated by the

Cauchy distribution. This explains why convergence does not take place in

this situation.

12.3 The Central Limit Theorem

The central limit theorem is a remarkable result in probability; it partly ex-

plains the prevalence of Gaussian random variables. In some sense, it captures

12.3. THE CENTRAL LIMIT THEOREM 157

the behavior of large sums of small, independent random components.

Theorem 5 (Central Limit Theorem). Let X

1

, X

2

, . . . be independent and

identically distributed random variables, each with mean E[X] and variance

σ

2

. The distribution of

S

n

−nE[X]

σ

√

n

converges in distribution to a standard normal random variable as n → ∞. In

particular, for any x ∈ R,

lim

n→∞

Pr

S

n

−nE[X]

σ

√

n

≤ x

=

x

−∞

1

√

2π

e

−

ξ

2

2

dξ.

Proof. Initially, we assume that E[X] = 0 and σ

2

= 1. Furthermore, we only

study the situation where the moment generating function of X exists and is

ﬁnite. Consider the log-moment generating function of X,

Λ

X

(s) = log M

X

(s) = log E

e

sX

.

The ﬁrst two derivatives of Λ

X

(s) are equal to

dΛ

X

ds

(s) =

1

M

X

(s)

dM

X

ds

(s)

d

2

Λ

X

ds

2

(s) =

1

M

X

(s)

dM

X

ds

(s)

2

+

1

M

X

(s)

d

2

M

X

ds

2

(s).

Collecting these results and evaluating the functions at zero, we get Λ

X

(0) =

0,

dΛ

X

ds

(0) = E[X] = 0, and

d

2

Λ

X

ds

2

(0) = E[X

2

] = 1. Next, we study the

log-moment generating function of S

n

/

√

n. Recall that the expectation of

a product of independent random variables is equal to the product of their

individual expectations. Using this property, we get

log E

e

sSn/

√

n

= log E

e

sX

1

/

√

n

· · · e

sXn/

√

n

= log

M

X

s

√

n

· · · M

X

s

√

n

= nΛ

X

sn

−1/2

.

To explore the asymptotic behavior of the sequence {S

n

/

√

n}, we take the limit

of nΛ

X

sn

−1/2

**as n → ∞. In doing so, notice the double use of L’Hˆopital’s
**

158CHAPTER 12. CONVERGENCE, SEQUENCES ANDLIMIT THEOREMS

rule,

lim

n→∞

1

n

−1

Λ

sn

−1/2

= lim

n→∞

1

n

−2

dΛ

ds

sn

−1/2

sn

−3/2

2

= lim

n→∞

s

2n

−1/2

dΛ

ds

sn

−1/2

= lim

n→∞

s

n

−3/2

d

2

Λ

ds

2

sn

−1/2

sn

−3/2

2

= lim

n→∞

s

2

2

d

2

Λ

ds

2

sn

−1/2

=

s

2

2

.

That is, the moment generating function of S

n

/

√

n converges point-wise to

e

s

2

/2

as n → ∞. In fact, this implies that

Pr

S

n

√

n

≤ x

→

x

−∞

1

√

2π

e

−

ξ

2

2

dξ

as n → ∞. In words, S

n

/

√

n converges in distribution to a standard normal

random variable. The more general case where E[X] and Var[X] are arbitrary

constants can be established in an analog manner by proper scaling of the

random variables X

1

, X

2

, . . .

In the last step of the proof, we stated that point-wise convergence of the

moment generating functions implies convergence in distribution. This is a

sophisticated result that we quote without proof.

12.3.1 Normal Approximation

The central limit theorem can be employed to approximate the CDF of large

sums. Again, let

S

n

= X

1

+· · · + X

n

where X

1

, X

2

, . . . are independent and identically distributed random variables

with mean E[X] and variance σ

2

. When n is large, the CDF of S

n

can be

estimated by approximating

S

n

−nE[X]

σ

√

n

as a standard normal random variable. More speciﬁcally, we have

F

Sn

(x) = Pr(S

n

≤ x) = Pr

S

n

−nE[X]

σ

√

n

≤

x −nE[X]

σ

√

n

≈ Φ

x −nE[x]

σ

√

n

,

12.3. THE CENTRAL LIMIT THEOREM 159

where Φ(·) is the CDF of a standard normal random variable.

Further Reading

1. Ross, S., A First Course in Probability, 7th edition, Pearson Prentice Hall,

2006: Chapter 8.

2. Bertsekas, D. P., and Tsitsiklis, J. N., Introduction to Probability, Athena

Scientiﬁc, 2002: Sections 7.2–7.4.

3. Miller, S. L., and Childers, D. G., Probability and Random Processes with

Applications to Signal Processing and Communications, 2004: Chapter 7.

Index

z-transform, 70

Bayes’ rule, 26

Bernoulli random variable, 50

Binomial coeﬃcient, 40

Binomial random variable, 51

Boole inequality, 13

Cardinality, 18

Cartesian product, 6

Cauchy random variable, 108

Central limit theorem, 157

Central moments, 69

Chebyshev inequality, 129

Chernoﬀ bound, 131

Collectively exhaustive, 9

Combination, 39

Combinatorics, 34

Complement, 2

Conditional expectation, 81

Conditional independence, 30

Conditional probability, 22

Conditional probability density func-

tion, 139

Conditional probability mass function,

77

Continuous random variable, 94

Convergence in distribution, 153

Convergence in probability, 152

Convolution, 145

Countable, 18

Cumulative distribution function, 92

De Morgan’s laws, 6

Discrete convolution, 85

Discrete random variable, 48

Disjoint sets, 3

Distinct elements, 9

Element, 1

Empty set, 2

Error function, 100

Event, 8

Expectation, 61, 62, 121, 123

Expected value, 60

Experiment, 8

Exponential random variable, 101

Factorial, 37

Functions of random variables, 58, 110

Gamma random variable, 104

Gaussian random variable, 98

Geometric random variable, 55

Hypergeometric random variable, 78

Inclusion-exclusion principle, 12

Independence, 27, 29, 84, 143

Indicator function, 62

160

INDEX 161

Intersection, 3

Jensen inequality, 132

Joint cumulative distribution function,

135

Joint probability density function, 136

Joint probability mass function, 73

Laplace random variable, 107

Law of large numbers, 154

Legendre transformation, 132

Marginal probability mass function, 74

Markov inequality, 128

Mean, 60, 64, 121

Mean square convergence, 152

Measure theory, 19

Memoryless property, 56, 103

Mixed random variable, 95

Moment generating function, 125

Moments, 68

Monotonic function, 111

Multinomial coeﬃcient, 41

Mutually exclusive, 9

Natural numbers, 18

Negative binomial random variable, 89

Normal random variable, 98

Ordered pair, 6

Ordinary generating function, 70, 125

Outcome, 8

Partition, 4, 24, 41, 42

Permutation, 36

Poisson random variable, 52

Power set, 19, 36

Preimage, 49

Probability density function (PDF), 95

Probability law, 11

Probability mass function (PMF), 48,

74

Probability-generating function, 70

Proper subset, 2

Q-function, 100

Random variable, 47, 92

Rayleigh random variable, 106

Sample space, 2, 8

Set, 1

Standard deviation, 66

Stirling’s formula, 38

Strictly monotonic function, 112

Subset, 2

Total probability theorem, 24

Uniform random variable (continuous),

97

Uniform random variable (discrete), 57

Union, 3

Union bound, 13

Value of a random variable, 47

Variance, 66, 121

Weak convergence, 153

Contents

1 Mathematical Review 1.1 1.2 1.3 1.4 Elementary Set Operations . . . . . . . . . . . . . . . . . . . . Additional Rules and Properties . . . . . . . . . . . . . . . . . Cartesian Products . . . . . . . . . . . . . . . . . . . . . . . . Set Theory and Probability . . . . . . . . . . . . . . . . . . . 1 3 5 6 7 8 8 11 13 14 15 18 20 20 23 26 27 29 30 32 34 35 36

2 Basic Concepts of Probability 2.1 2.2 Sample Spaces and Events . . . . . . . . . . . . . . . . . . . . Probability Laws . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 2.2.2 2.2.3 2.2.4 Finite Sample Spaces . . . . . . . . . . . . . . . . . . . Countably Inﬁnite Models . . . . . . . . . . . . . . . . Uncountably Inﬁnite Models . . . . . . . . . . . . . . . Probability and Measure Theory* . . . . . . . . . . . .

3 Conditional Probability 3.1 3.2 3.3 3.4 Conditioning on Events . . . . . . . . . . . . . . . . . . . . . . The Total Probability Theorem . . . . . . . . . . . . . . . . . Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 3.4.2 3.5 Independence of Multiple Events . . . . . . . . . . . . Conditional Independence . . . . . . . . . . . . . . . .

Equivalent Notations . . . . . . . . . . . . . . . . . . . . . . .

4 Equiprobable Outcomes and Combinatorics 4.1 4.2 The Counting Principle . . . . . . . . . . . . . . . . . . . . . . Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

CONTENTS 4.2.1 4.2.2 4.3 4.4 4.5 Stirling’s Formula* . . . . . . . . . . . . . . . . . . . . k-Permutations . . . . . . . . . . . . . . . . . . . . . .

iii 38 38 39 40 43 47 48 50 50 51 52 55 57 58 60 60 61 64 66 67 68 69 73 73 75 77 80 81 84 85 87 . . . . . . . . . . . . .

Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Combinatorial Examples . . . . . . . . . . . . . . . . . . . . .

5 Discrete Random Variables 5.1 5.2 Probability Mass Functions . . . . . . . . . . . . . . . . . . . Important Discrete Random Variables 5.2.1 5.2.2 5.2.3 5.2.4 5.2.5 5.3 Bernoulli Random Variables . . . . . . . . . . . . . . . Binomial Random Variables . . . . . . . . . . . . . . . Poisson Random Variables . . . . . . . . . . . . . . . . Geometric Random Variables . . . . . . . . . . . . . . Discrete Uniform Random Variables . . . . . . . . . . .

Functions of Random Variables . . . . . . . . . . . . . . . . .

6 Meeting Expectations 6.1 6.2 Expected Values 6.2.1 6.2.2 6.2.3 6.3 6.4 . . . . . . . . . . . . . . . . . . . . . . . . . Functions and Expectations . . . . . . . . . . . . . . . . . . . The Mean . . . . . . . . . . . . . . . . . . . . . . . . . The Variance . . . . . . . . . . . . . . . . . . . . . . . Aﬃne Functions . . . . . . . . . . . . . . . . . . . . . .

Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ordinary Generating Functions . . . . . . . . . . . . . . . . .

7 Multiple Discrete Random Variables 7.1 7.2 7.3 7.4 7.5 7.6 Joint Probability Mass Functions . . . . . . . . . . . . . . . . Functions and Expectations . . . . . . . . . . . . . . . . . . . Conditional Random Variables . . . . . . . . . . . . . . . . . . 7.3.1 Conditioning on Events . . . . . . . . . . . . . . . . . . Conditional Expectations . . . . . . . . . . . . . . . . . . . . . Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 Sums of Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . Numerous Random Variables

The Gaussian (Normal) Random Variable . . .3. . Important Distributions .3. . . . . . . 10. . .3 The Chernoﬀ Bound . . . . . . . . . The Cauchy Distribution . . . .3. . . . . 10. . . .3 The Uniform Distribution . . . 8. . .1 8. . . . . . . . . . . . . .4 9 Functions and Derived Distributions 9. . . Generating Random Variables . . . . . . . . . Diﬀerentiable Functions . .3. 10. . . . . . . .4. . The Exponential Distribution .1 9. . . . . . . . 9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8. . . Continuous Random Variables . . . . . The Laplace Distribution . . . . . . . .2 8. . . . . . . 10. . . . . .1 8. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11. . . . .3 Probability Density Functions . . . . . . . . . .3. . . .2 8. .3.3 8. . . . Discrete Random Variables . .4. . . .2 Conditional Probability Distributions . .3. . . . . . . . . . . . . . .2 8. .4 Jensen’s Inequality . . . . .3 Monotone Functions .1. . . . . . . . . . . . . . . . . . . . . . . 10. . . . . . . .2 The Chebyshev Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3. . . 10. . . . . . . . 10 Expectations and Bounds 10. . . . . . . . . . . . . . . Mixed Random Variables* . . . . .4. . . 8. . . . . . . . . . . . . . . . . . . . . . .1 Joint Cumulative Distributions .1. . . . . . . . . . .1 9. . . . . . . The Rayleigh Distribution . . . . . . . . 8. . . . . . . . .4 Additional Distributions . .2 8. . . . . . . . . . . . . .1 8. . .1. . . .1 CONTENTS 91 91 92 94 95 95 97 97 98 101 104 104 106 107 108 110 111 114 117 118 119 121 121 124 127 128 129 131 132 135 135 138 Cumulative Distribution Functions . . . . . . . . . . . . . . The Gamma Distribution . . .iv 8 Continuous Random Variables 8. 8. . . . . . . . . . . . . . . . . .4. . . . . . . . . .3 Important Inequalities . . . . . . . .1 Expectations Revisited . . . . . . . . . . . . .3 Discrete Random Variables . . . . . .1 The Markov Inequality . . . 11 Multiple Continuous Random Variables 11. . . . . . . . . . . . . .2 Moment Generating Functions . .2 Continuous Random Variables . . . . . . . . . . . . . . . .3.2 9.

. . 12. .1 Convergence in Probability . . . . . . . . . . . . . . . . . . .1 Heavy-Tailed Distributions* . . . . . . . . . . .3 The Central Limit Theorem . . . . . .3. . . 12. . . . . . . . . . .2. . . . . . . . . . . . . . . . . .3. . .2. . . .1. . . . 11. .3 Convergence in Distribution . . . . . . . . 12. . . . . .1 Sums of Continuous Random Variables . . . . . . . . . . .3 Independence . .2 The Law of Large Numbers . . . . . . . . . . . . . . . . 12. .CONTENTS 11. .1 Normal Approximation . . . . . . 11. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 Derived Distributions . . . . . . . . . . . . 12.1 Types of Convergence . . .1 Conditioning on Values . . . 11. . . . 11. . . . . . . . . .2. . . . . . . v 139 140 141 143 145 150 150 152 152 153 153 155 156 158 . Sequences and Limit Theorems 12. . . . . . . 12. . . . . . . . .2 Conditional Expectation . . . . . . . 12 Convergence. .1. . . 12. .2 Mean Square Convergence . . . . . .1. .2.

vi CONTENTS .

and it plays an instrumental role in the treatment of probability. 2 1 4 5 3 7 6 Figure 1. x = y means that x and y are symbols denoting the same object. the sets S and T contain precisely the 1 .Chapter 1 Mathematical Review Set theory is now generally accepted as the foundation of modern mathematics. the equation S = T states that S and T are two symbols for the same set. For instance. We use the equality symbol to denote / an element x belongs to a set S. we write x ∈ S.1: This is an illustration of a generic set and its elements. while a rigorous axiomatic approach is quite tedious. we express this fact by writing x ∈ S. Similarly. Unfortunately. If x logical identity. A set is a collection of objects. which are called the elements of the set. In these notes. If does not belong to S. we circumvent these diﬃculties and assume that the meaning of a set as a collection of objects is intuitively clear. In particular. a simple description of set theory can lead to paradoxes. This standpoint is known as naive set theory. We proceed to deﬁne relevant notation and operations.

which we indicate by S T. ∅. The braces are to be read as “the set of” while It is convenient to introduce two special sets. then S is a proper subset of T . If x and y are diﬀerent objects then we write x = y. the way to specify a set is to take some collection T of objects and some property that elements of T may or may not possess. .}. In fact. . S = {x ∈ Z|x is an even number}. The complement of a set S. one can simply list the objects in the set. we need only consider sets that are subsets of Ω. and it is represented by Ω. The universal set is the collection of all . If the set contains only a few elements. is a set that contains no elements. If S ⊂ T and S is diﬀerent from T . S = {x1 . x3 }. we can form the subset S consisting of all even numbers. is the collection of all objects in Ω that do not belong to S. . we denote the set of all elements that satisfy a certain property the symbol | stands for the words “such that. S = {x1 . MATHEMATICAL REVIEW same elements. For example. In the context of probability.” P by S = {x|x satisﬁes P }. denoted by require S to be diﬀerent from T . Ω is often called the sample space. and to form the set consisting of all elements of T having that property. S = T if and only if S ⊂ T and objects of interest in a particular context.2 CHAPTER 1. Once a universal set Ω is speciﬁed. Also. Note that this deﬁnition does not T ⊂ S. S c = {x ∈ Ω|x ∈ S}. There are many ways to specify a set. x2 . x2 . A set S is a subset of T if every element of S is also contained in T . We express this relation by writing S ⊂ T . starting with the integers Z. The content of a set can also be enumerated whenever S has a countable number of elements. we have Ωc = ∅. we can express the fact that S and T are diﬀerent sets by writing S = T . More generally. Usually. relative to the universal set Ω. The empty set. / For example.

The intersection of sets S and T is the collection of all elements that belong to S and T . and it is denoted by S ∪ T . S∪T Figure 1.2: This is an abstract representation of two sets.3: The union of sets S and T consists of all elements that are contained in S or T . When S and T have no elements in common. The union of sets S and T is the collection of all elements that belong to S or T (or both). S T Figure 1.1. It is denoted by S ∩ T . and we establish basic terminology and notation. Formally. Consider two sets. ELEMENTARY SET OPERATIONS 3 1.1. S and T . we deﬁne the union of these two sets by S ∪ T = {x|x ∈ S or x ∈ T }. we write S ∩ T = ∅. we review elementary set operations. S and T . and it can be expressed mathematically as S ∩ T = {x|x ∈ S and x ∈ T }. More generally. We also express this fact by saying that S and T are disjoint. . a collection of sets is said to be disjoint if no two sets have a common element. Below. Each contains the elements in its shaded circle and the overlap corresponds to elements that are in both sets.1 Elementary Set Operations The ﬁeld of probability makes extensive use of set theory.

Sα = {x|x ∈ Sα for some α ∈ I} Sα = {x|x ∈ Sα for all α ∈ I}. . Figure 1. MATHEMATICAL REVIEW S∩T Figure 1. The diﬀerence of two sets. or the complement So far. denoted by S −T . is deﬁned as the set consisting set is sometimes called the complement of T relative to S. S − T = {x|x ∈ S and x ∈ T }.4 CHAPTER 1. A collection of sets is said to form a partition of S if the sets in the collection are disjoint and their union is S. α∈I α∈I The index set I can be ﬁnite or inﬁnite. This / of T in S. we have looked at the deﬁnition of the union and the intersection of two sets. This is deﬁned in a straightforward way. of those elements of S that are not in T .4: The intersection of sets S and T only contains elements that are both in S and T . We can also form the union or the intersection of arbitrarily many sets.5: A partition of S is a collection of sets that are disjoint and whose union is S.

1. For instance.7: The order of set operations is important. precedence. whereas (R ∪ S) ∩ T represents the intersection of two sets R ∪ S and T .6: The complement of T relative to S contains all the elements of S that are not in T . Sometimes. Two particularly useful equivalent combinations of operations are given by . one uses parentheses to indicate S ∩ T . As in algebra.2 Additional Rules and Properties Given a collection of sets. R ∪ (S ∩ T ) denotes the union of two sets R and R ∪ (S ∩ T ) (R ∪ S) ∩ T Figure 1. diﬀerent combinations of operations lead to a same set. it is possible to form new ones by applying elementary set operations to them.2. The sets thus formed are quite diﬀerent. we have the following distributive laws R ∩ (S ∪ T ) = (R ∩ S) ∪ (R ∩ T ) R ∪ (S ∩ T ) = (R ∪ S) ∩ (R ∪ T ). ADDITIONAL RULES AND PROPERTIES 5 S−T Figure 1. For instance. 1. parentheses should be employed to specify precedence.

8: Cartesian products can be used to create new sets. Then x is not contained in α∈I c Sα . x is not c and therefore x ∈ c an element of Sα for any α ∈ I.3 Cartesian Products There is yet another way to create new sets from existing ones. The 1. suppose that x belongs to α∈I Sα . That is. We have shown that c α∈I Sα . This implies that x belongs to Sα for all α ∈ I. y)|x ∈ S and y ∈ T }. The second law can be obtained in a similar fashion. It involves the notion of an ordered pair of objects. α∈I Sα converse inclusion is obtained by reversing the above argument. To establish the ﬁrst equality. .6 CHAPTER 1. Given sets S and T . the sets {1. S × T = {(x. 1 a 2 b 3 2 a 3 b 1 b 3 a 1 a S × T is the set of all ordered pairs (x. b} are employed to create a cartesian product with six elements. In this example. MATHEMATICAL REVIEW De Morgan’s laws. These two laws can be generalized to c Sα α∈I c = α∈I c Sα Sα α∈I = α∈I c Sα when multiple sets are involved. which state that R − (S ∪ T ) = (R − S) ∩ (R − T ) R − (S ∩ T ) = (R − S) ∪ (R − T ). 2. y) for which x is an element of S and 2 b Figure 1. the cartesian product y is an element of T . 3} and {a. ⊂ α∈I c Sα .

A. D. Bertsekas. give meaning to their implications and answer practical questions. J. J. 2002: Section 1.4. It is employed to describe the laws of probability.1. it helps overcome its many challenges. Introduction to Probability.. SET THEORY AND PROBABILITY 7 1.4 Set Theory and Probability Set theory provides a rigorous foundation for modern probability and its axiomatic basis. an invaluable skill for engineers.1. N.2. Being familiar with basic deﬁnitions and set operations is key in understanding the subtleties of probability. 2. Cambridge.. . and Tsitsiklis. A working knowledge of set theory becomes critical when modeling measured quantities and evolving processes that appear random. Athena Scientiﬁc. P. Further Reading 1. Gubner. 2006: Section 1. Probability and Random Processes for Electrical and Computer Engineers..

**Chapter 2 Basic Concepts of Probability
**

The theory of probability provides the mathematical tools necessary to study and analyze uncertain phenomena that occur in nature. It establishes a formal framework to understand and predict the outcome of a random experiment. It can be used to model complex systems and characterize stochastic processes. This is instrumental in designing eﬃcient solutions to many engineering problems. Two components deﬁne a probabilistic model, a sample space and a probability law.

2.1

Sample Spaces and Events

In the context of probability, an experiment is a random occurrence that produces one of several outcomes. The set of all possible outcomes is called the sample space of the experiment, and it is denoted by Ω. An admissible subset of the sample space is called an event. Example 1. The rolling of a die forms a common experiment. A sample space Ω corresponding to this experiment is given by the six faces of a die. The set of prime numbers less than or equal to six, namely {2, 3, 5}, is one of outcome of the experiment. There is essentially no restriction on what constitutes an experiment. The ﬂipping of a coin, the ﬂipping of n coins, and the tossing of an inﬁnite sequence of coins are all random experiments. Also, two similar experiments may have 8 many possible events. The actual number observed after rolling the die is the

2.1. SAMPLE SPACES AND EVENTS outcome 2 3 1 4 event 5 6 7

9

Figure 2.1: A sample space contains all the possible outcomes; an admissible subset of the sample space is called an event.

and the subset {2, 3, 5} forms a speciﬁc event.

Figure 2.2: A possible sample space for the rolling of a die is Ω = {1, 2, . . . , 6},

diﬀerent sample spaces. A sample space Ω for observing the number of heads in n tosses of a coin is {0, 1, . . . , n}; however, when describing the complete history of the n coin tosses, the sample space becomes much larger with 2n distinct sequences of heads and tales. Ultimately, the choice of a particular sample space depends on the properties one wishes to analyze. Yet some rules must be followed in selecting a sample space. 1. The elements of a sample space should be distinct and mutually exclusive. This ensures that the outcome of an experiment is unique. 2. A sample space must be collectively exhaustive. That is, every possible outcome of the experiment must be accounted for in the sample space. In general, a sample space should be precise enough to distinguish between all outcomes of interest, while avoiding frivolous details.

10

CHAPTER 2. BASIC CONCEPTS OF PROBABILITY

Example 2. Consider the space composed of the odd integers located between one and six, the even integers contained between one and six, and the prime numbers less than or equal to six. This collection cannot be a sample space for the rolling of a die because its elements are not mutually exclusive. In particular, the numbers three and ﬁve are both odd and prime, while the number two is prime and even. This violates the uniqueness criterion.

Figure 2.3: This collection of objects cannot be a sample space as the elements are not mutually exclusive.

Alternatively, the elements of the space composed of the odd numbers between one and six, and the even numbers between one and six, are distinct and mutually exclusive; an integer cannot be simultaneously odd and even. Moreover, this space is collectively exhaustive because every integer is either odd or even. This latter description forms a possible sample space for the rolling of a die.

Figure 2.4: A candidate sample space for the rolling of a die is composed of two objects, the odd numbers and the even numbers between one and six.

A2 . Proposition 1. such that the following axioms are satisﬁed. 2. called the probability of event A. . Proposition 2. 1. . More generally.2 Probability Laws A probability law speciﬁes the likelihood of events related to an experiment. then Pr(A) ≤ Pr(B). is a sequence of disjoint events and is itself an admissible event then Pr ∞ k=1 ∞ k=1 ∞ k=1 Ak Ak = Pr(Ak ).2. Formally. for every event A. (Normalization) The probability of the sample space Ω is equal to one. If A ⊂ B. Our second result speciﬁes the probability of the union of two events that are not necessarily mutually exclusive. a probability law assigns to every event A a number Pr(A). Noting that A and B − A are Pr(B) = Pr(A) + Pr(B − A) ≥ Pr(A). disjoint sets. if A1 . We prove two such properties below. Since A ⊂ B. (Nonnegativity) Pr(A) ≥ 0. Pr(A ∪ B) = Pr(A) + Pr(B) − Pr(A ∩ B). then the probability of their union satisﬁes Pr(A ∪ B) = Pr(A) + Pr(B). PROBABILITY LAWS 11 2. A number of properties can be deduced from the three axioms of probability. where the inequality follows from the nonnegativity of probability laws. The ﬁrst statement describes the relation between inclusion and probabilities. we get Proof. A ∩ B = ∅. . Pr(Ω) = 1.2. . 3. we have B = A ∪ (B − A). (Countable Additivity) If A and B are disjoint events.

Speciﬁcally. .. . applying the third axiom to A ∩ B and B − (A ∩ B). then the probability of A is less than or equal to the probability of B. Similarly.6: The probability of the union of A and B is equal to the probability of A plus the probability of B minus the probability of their intersection. The statement of Proposition 2 can be extended to ﬁnite unions of events.|I|=k Pr i∈I Ai where the rightmost sum runs over all subsets I of {1.12 CHAPTER 2. n} that contain exactly k elements. This more encompassing equation is known as the inclusionexclusion principle. BASIC CONCEPTS OF PROBABILITY B A B−A Figure 2. Using the third axiom of probability on the disjoint sets A and (A ∪ Pr(A ∪ B) = Pr(A) + Pr((A ∪ B) − A) = Pr(A) + Pr(B − A).. Combining these two equations yields the desired result. B) − A..n}. A A∩B B Figure 2. . we obtain Pr(B) = Pr(A ∩ B) + Pr(B − (A ∩ B)) = Pr(A ∩ B) + Pr(B − A).5: If event A is a subset of event B. . we can write n n Pr k=1 Ak = k=1 (−1)k−1 I⊂{1. . we can write Proof..

(2. assume that (2. 2. we note that the claim is trivially true for n = 1. by deﬁnition. An be a collection of events. k=1 (2. . . which is sometimes called the Boole inequality. k=1 Therefore. PROBABILITY LAWS 13 We can use Proposition 2 recursively to derive a bound on the probabilities of unions. . asserts that the probability of at least one event occurring is no greater than the sum of the probabilities of the individual events. where I is a subset of the Denote a sample space containing n elements by Ω = {s1 . First. This theorem. s2 . by the principle of mathematical induction. Let A1 . we have n+1 n Pr k=1 Ak = Pr An+1 ∪ Ak k=1 n n = Pr(An+1 ) + Pr k=1 n Ak Ak k=1 − Pr An+1 ∩ n+1 Ak k=1 ≤ Pr(An+1 ) + Pr ≤ Pr(Ak ).1) holds for some n ≥ 1. . Then. sn }. We show this result using induction.1) is valid for all positive integers. Theorem 1 (Union Bound). .1) Proof.2. . . then n n Pr k=1 Ak ≤ Pr(Ak ).1 Finite Sample Spaces If a sample space Ω contains a ﬁnite number of elements. i∈I We emphasize that. Any integers one through n. Pr(A) = Pr({si ∈ Ω|i ∈ I}) = Pr(si ). As an inductive hypothesis.2. . distinct outcomes are always disjoint events. event in Ω is of the form A = {si ∈ Ω|i ∈ I}. The probability of event A is therefore given by the third axiom of probability. A2 . . then a probability law on Ω is completely determined by the probabilities of its individual outcomes.2.

Pr(A) can be written as Pr(sj ). j∈J Pr(sj ) always converges since the summands are nonnegative and the sum is bounded above by one. where J is a subset of the positive integers. 6 2. BASIC CONCEPTS OF PROBABILITY If in addition the elements of Ω are equally likely with Pr(s1 ) = Pr(s2 ) = · · · = Pr(sn ) = 1 .2 Countably Inﬁnite Models Again. The rolling of a fair die is an experiment with a ﬁnite number of equally likely outcomes.14 CHAPTER 2. a probability law on Ω is speciﬁed by Consider a sample space that consists of a countably inﬁnite number of elements. the probabilities of individual outcomes. s2 . 1 2 3 4 5 6 7 j∈J third axiom of probability. the probability of rolling a prime number is 3 Pr({2. 3.7: A countable set is a collection of elements with the same cardinality as some subset of the natural numbers. For instance. Example 3. The probability of observing any of the faces labeled one through six is therefore equal to 1/6. . . Ω = {s1 . Using the Pr(A) = Pr({sj ∈ Ω|j ∈ J}) = The possibly inﬁnite sum deﬁned.2. The probability of any event can easily be computed by counting the number of distinct outcomes included in the event.2) where |A| denotes the number of elements in A.}. . 5}) = Pr(2) + Pr(3) + Pr(5) = . . It is consequently well Figure 2. An event in Ω can be written as A = {sj ∈ Ω|j ∈ J}. n then the probability of an event A becomes Pr(A) = |A| n (2.

The number of coin tosses is recorded as the outcome of this experiment. composed of all real numbers between zero and one.8: The unit interval [0. This diﬃculty arises from the large number of elements contained in the sample space when the latter is uncountable. probabilistic models with uncountably inﬁnite sample spaces are quite useful in practice. 1]. . 2.}) = ∞ k=1 ∞ k=1 Pr(2k) = 1 1 1 = 22k 4 1− 1 4 1 = . To develop an understanding of uncountable probabilistic models. Similarly. Despite these apparent diﬃculties. . 4. The probability of observing heads on the ﬁrst trial is 1/2. Suppose that a fair coin is tossed repetitively until heads is observed. A natural sample space for this experiment is Ω = {1.2. 6. the probability of the number of coin tosses being even can be computed as Pr({2. . 2k as expected. . a countably inﬁnite set. we consider the unit interval [0. is an example of an uncountable set.3 Uncountably Inﬁnite Models Probabilistic models with uncountably inﬁnite sample spaces diﬀer from the ﬁnite and countable cases in that a probability law may not necessarily be speciﬁed by the probabilities of single-element outcomes. PROBABILITY LAWS 15 Example 4. and as such the third axiom of probability cannot be applied to relate the probabilities of these events to the probabilities of individual outcomes.2. The probability of the entire sample space is therefore equal to Pr(Ω) = ∞ k=1 ∞ k=1 Pr(k) = 1 = 1.2. 1]. 3 2. . . . and the probability of observing heads for the ﬁrst time on trial k is 2−k . 0 1 Figure 2. Many subsets of Ω do not have a ﬁnite or countable representation.}.

75. it is natural to anticipate that Pr ((0. 0 0. For instance. In an extension of the previous observation.25)) = Pr ((0. then two subintervals of equal lengths should have the same probabilities.16 CHAPTER 2. b)) = b − a. The probabilities of more complex events can be obtained by applying additional elementary set operations.3) Using the third axiom of probability.25 0. b) where 0 ≤ a < b ≤ 1 to equal Pr((a.75 1 Figure 2. bk ) = ∞ k=1 (bk − ak ) . it suﬃces to say for now that .9: If the outcome of an experiment is uniformly distributed over [0. if two intervals have the same length. Speciﬁcally. (2. By the ﬁrst axiom of probability. we must have Pr ([0. the probabilities of the outcome falling in either interval should be identical. we take the probability of an open interval (a. 1)).10: The probabilities of events that are formed through the union of disjoint intervals can be computed easily. BASIC CONCEPTS OF PROBABILITY Suppose that an element is chosen at random from this interval. it is then possible to ﬁnd the probability 0 1 Figure 2. we get Pr ∞ k=1 (ak . 0. 1]) = 1. for constants 0 ≤ a1 < b1 < a2 < b2 < · · · ≤ 1. with all elements having equal weight. 1]. However. of a ﬁnite or countable union of disjoint open intervals. Furthermore.

2. 2π)) = 1. b) where 0 ≤ a < b ≤ 1. then the probability of this number falling in measurable set A ⊂ [0. we will see many probabilistic models with uncountably inﬁnite sample spaces. the wheel is equally likely to stop anywhere along its perimeter. b)) = b − a = dx = a (a. 1] is Pr(A) = A dx. 1]. Example 5. an uncountable set. Pr([0. we can write Pr ∞ k=1 (ak . In the example at hand. b) for every possible open interval is enough to deﬁne a probability law on Ω. for 0 ≤ a1 < b1 < a2 < b2 < · · · ≤ 1. Suppose that a participant at a game-show is required to spin the wheel of serendipity.bk ) For this carefully crafted example. This method of computing probabilities can be extended to more complicated problems. In fact. consider the open interval (a.b) dx. bk ) = ∞ k=1 (bk − ak ) = S∞ dx. PROBABILITY LAWS 17 specifying the probability of the outcome falling in (a. the class of admissible events for this experiment is simply the collection of all sets for which the integral A dx can be computed. In these notes. 1]. . In other words. Again. Pr((a. This is indeed accurate for the current scenario. if a number is chosen at random from [0. When subjected to a vigorous spin. The mathematical tools required to handle such models will be treated alongside. a perfect circle with unit radius. A sampling space for this experiment is the collection of all angles from 0 to 2π.3) completely determines the probability law on [0.2. it appears that the probability of an admissible event A is given by the integral Pr(A) = A dx. k=1 (ak . (2. Note that we can give an alternative means of computing the probability The probability of the outcome falling in this interval is equal to b of an interval. Moreover. The probability of Ω is invariably equal to one.

Two sets are said to have the same cardinality if their elements can be put in one-to-one correspondence. 2π probability of success at the wheel becomes Pr(B) = B If B ⊂ [0.18 CHAPTER 2. BASIC CONCEPTS OF PROBABILITY Figure 2.}. 2 = 0 π 2 1 1 dθ = . b)) = b−a . A set with the same cardinality as a subset of the natural numbers is said to be countable. 2. That is.11: The wheel of serendipity provides an example of a random experiment for which the sample space is uncountable. 2π) is a measurable set representing all winning outcomes. . especially when it comes to inﬁnite sample spaces. . the elements of a countable set can always be listed in sequence. the probability that the wheel stops in an interval (a.2. . s2 . The basis of our intuition for the inﬁnite is the set of natural numbers. then the 1 dθ.. although the order may have nothing to do with any relation between the elements. s1 . The integers . b) where 0 ≤ a ≤ b < 2π can be written as Pr((a. 2π 2. N = {1. . The probability that the wheel stops in the ﬁrst quadrant is given by Pr π 0. .4 Probability and Measure Theory* A thorough treatment of probability involves advanced mathematical concepts. . 2π 4 More generally.

At some point in your academic career. it is possible to develop an intuitive and working understanding of probability without worrying about these issues. and Tsitsiklis.1. Bertsekas. The set of real numbers is uncountably inﬁnite. A typical progression in analysis consists of using the ﬁnite to get at the countably inﬁnite.1. Further Reading 1.3–1. However. 3. S.. In general. you may wish to study analysis and measure theory more carefully and in greater details. N. 2004: Sections 2. this leads to serious diﬃculties that cannot be resolved. D. D. one needs set theoretic tools such as power sets. Introduction to Probability. it is necessary to work with special subclasses of the class of all subsets of a sample space Ω. P. J. A.. Probability and Random Processes for Electrical and Computer Engineers. S. and then to employ the countably inﬁnite to get at the uncountable.3. it cannot be put in one-to-one correspondence with the natural numbers.. 2006: Sections 1. 2. it is not our current purpose to initiate the rigorous treatment of these topics. Gubner. J. Pearson Prentice Hall. . and they are studied in measure theory. A First Course in Probability. for uncountably inﬁnite sample spaces.2. The collections of the appropriate kinds are called ﬁelds and σ-ﬁelds. Probability and Random Processes with Applications to Signal Processing and Communications. and Childers.1– 2. Ross. and to its uniﬁed treatment of the discrete and the continuous. However. PROBABILITY LAWS 19 and the rational numbers are examples of countably inﬁnite sets..2. Miller. L. 7th edition.. Athena Scientiﬁc.4. 2006: Chapter 2.. 2002: Section 1. It may be surprising at ﬁrst to learn that there exist uncountable sets. It is tempting to try to assign probabilities to every subset of a sample space Ω. G. 4. Cambridge. Fortunately. To escape beyond the countable. This leads to measure-theoretic probability.2.

it is reasonable to assign a probability of 1/3 to each of the three outcomes that remain possible candidates after receiving the side information. This is a powerful concept that is used extensively throughout engineering with applications to decision making.Chapter 3 Conditional Probability Conditional probability provides a way to compute the likelihood of an event based on partial information. 3. However. networks.1 Conditioning on Events We begin our description of conditional probability with illustrative examples. namely {1. As such. In particular. the probability of obtaining any of the outcomes is 1/6. Example 6. The rolling of a fair die is an experiment with six equally likely outcomes. It then seems natural to assume that they remain equally likely afterwards. These three outcomes had equal probabilities before the additional information was revealed. 3. 5}. 5}) Pr({1. Pr({1. 3. then only three possibilities remain. The intuition gained through this exercise is then generalized by introducing a formal deﬁnition for this important concept. communications and many other ﬁelds. 3. 5}) 3 20 . if we are told that the upper face features an odd number. 3. We can express the probability of getting a three given that the outcome is an odd number as Pr(3) 1 Pr(3 ∩ {1. 5}) = = .

and assume that Pr(B) > 0. .1: Partial information about the outcome of an experiment may change the likelihood of events. For instance. The frequentist view of probability is based on the fact that. Let NAB be the number of trials for which A∩B occurs. we consider the scenario where this experiment is repeated N times.1) unveils the formula for conditional probability. Example 7. we turn to the mathematical definition of conditional probability. NAB be the number of times where only A occurs. as N becomes large. N Likewise.1. Let B be an event such that Pr(B) > 0. To gain additional insight into conditional probability. and NAB be the number of trials for which neither takes place. one can approximate the probability of an event by taking the ratio of the number of times this event occurs over the total number of trials. Having considered intuitive arguments. we gather that A is observed exactly NA = NAB + NAB times. Let A and B be events associated with a random experiment. From these deﬁnitions. and B is seen NB = NAB + NAB times. NAB be the number of times where only B occurs. we can write Pr(A ∩ B) ≈ NAB N Pr(B) ≈ NB . these approximations become exact and (3. The resulting values are known as conditional probabilities. = NB NB /N Pr(B) (3.1) As N approaches inﬁnity.3. the conditional probability of A given knowledge that the outcome lies in B can be computed using Pr(A|B) ≈ NAB Pr(A ∩ B) NAB /N ≈ . CONDITIONING ON EVENTS 21 Figure 3.

. The probability of the entire sample space Ω is equal to Pr(Ω|B) = Pr(Ω ∩ B) Pr(B) = = 1. is equal to Pr(2|B) = Pr(2) 1/4 3 Pr(2 ∩ B) = = = . Pr(B) (3. ∩ B). A2 . where the third equality follows from the third axiom of probability applied to the set ∞ k=1 (Ak satisﬁes the three axioms of probability. is also a sequence of disjoint events and Pr ∞ k=1 ∞ k=1 ∞ k=1 (Ak Ak B = = Pr (( ∞ k=1 Pr ( Ak ) ∩ B) = Pr(B) ∞ k=1 Pr(B) ∩ B)) Pr(Ak ∩ B) = Pr(B) Pr(Ak |B). Pr(A|B) is nonnegative.22 CHAPTER 3. we Pr(A|B) = Pr(A ∩ B) ≥0 Pr(B) have and. we found that the probability of observing heads for the ﬁrst time on trial k is 2−k . given that the number of tosses is even. Pr(B) Pr(B) If A1 . In Example 4. Pr(B) Pr(B) 1/3 4 to observe heads.2) Example 8. A2 ∩ B. . hence.2) We can show that the collection of conditional probabilities {Pr(A|B)} speciﬁes a valid probability law. such that Pr(A|B) = Pr(A ∩ B) . We now wish to compute the probability that heads occurred for the ﬁrst time on the second trial given that it took an even number of tosses The probability that the outcome is two. . the conditional probability law deﬁned by (3. Thus. then A1 ∩ B. A fair coin is tossed repetitively until heads is observed. For every event A.2. . as deﬁned in Section 2. . . A = {2} and B is the set of even numbers. In this example. . is a sequence of disjoint events. CONDITIONAL PROBABILITY A conditional probability law assigns to every event A a number Pr(A|B). termed the conditional probability of A given B.

4) . . The probability of events A1 through An taking place at the same time is given by n n−1 Pr k=1 Ak = Pr(A1 ) Pr(A2 |A1 ) Pr(A3 |A1 ∩ A2 ) · · · Pr An Ak k=1 . Finally. Example 9. the probability of drawing a blue ball the third time given that the ﬁrst two balls are blue is 6/10. For two events.3). 12 11 10 55 3. The probability of drawing a blue ball the ﬁrst time is equal to 8/12. . The deﬁnition of conditional probability can be employed to compute the probability of several events occurring simultaneously. n Pr k=1 Ak Pr ( Pr(A1 ∩ A2 ) Pr(A1 ∩ A2 ∩ A3 ) ··· = Pr(A1 ) Pr(A1 ) Pr(A1 ∩ A2 ) Pr n−1 k=1 n k=1 Ak ) . this computational formula simpliﬁes to Pr(A ∩ B) = Pr(A|B) Pr(B). (3. An urn contains eight blue balls and four green balls. Let A1 . . we have used the fact that the probability of ﬂipping the coin an even number of times is equal to 1/3.3) This formula is known as the chain rule of probability.3). Using (3. This fact was established in Example 4. (3. An be a collection of events. we can compute the probability of drawing three blue balls as Pr(bbb) = 8 7 6 14 = . We wish to compute the probability that all three balls are blue. THE TOTAL PROBABILITY THEOREM 23 In the above computation. and it can be veriﬁed by expanding each of the conditional probabilities using (3.2).2 The Total Probability Theorem The probability of events A and B occurring at the same time can be calculated as a special case of (3. The probability of drawing a blue ball the second time given that the ﬁrst ball is blue is 7/11. . Three balls are drawn from this urn without replacement.2. A2 . n−1 k=1 Ak This latter expression implicitly assumes that Pr Ak = 0.3.

. n Ak = Ω. as exempliﬁed in Figure 3. . We can also obtain this equation directly from the deﬁnition of conditional probability. . Suppose that Pr(Ak ) > 0 for all k. the total probability theorem and Bayes’ rule. Then. for any event B. . .24 2 1 1 2 2 1 1 2 2 1 1 2 1 1 2 1 1 1 1 1 2 1 2 CHAPTER 3. . . . CONDITIONAL PROBABILITY 1 1 1 2 1 1 2 1 1 1 2 1 1 Figure 3. . An is said to be a partition of the sample space Ω if these events are disjoint and their union is the entire sample space. we can write Pr(B) = Pr(A1 ∩ B) + Pr(A2 ∩ B) + · · · + Pr(An ∩ B) = Pr(A1 ) Pr(B|A1 ) + Pr(A2 ) Pr(B|A2 ) + · · · + Pr(An ) Pr(B|An ). A collection of events A1 . An be a collection of events that forms a partition of the sample space Ω. To formulate these two theorems. we need to revisit the notion of a partition.3. Let A1 . Theorem 2 (Total Probability Theorem). A2 . a partition divides an entire set into disjoint subsets. This property is a key observation that plays a central role in establishing two important results. A2 .2: Conditional probability can be employed to calculate the probability of multiple events occurring at the same time. k=1 Visually.

4: The total probability theorem states that the probability of event B can be computed by summing Pr(Ai ∩ B) over all members of the partition A1 .3: A partition of S can be formed by selecting a collection of subsets that are disjoint and whose union is S. . A2 . A graphical interpretation of Theorem 2 is illustrated in Figure 3. Combining these two facts. .3. We can therefore write n B =B∩Ω=B∩ Ak k=1 . . . Proof. Event B can be decomposed into the disjoint union of A1 ∩ B. we get n Since A1 . are also disjoint. The collection of events A1 .2. THE TOTAL PROBABILITY THEOREM 25 Figure 3. . An . . . . An are disjoint sets. . k=1 where the fourth equality follows from the third axiom of probability. . . . An forms a partition of the sample space Ω. A2 ∩ B. . . Figure 3. . A2 . . A2 . . . . . An ∩ B. An ∩ B n Pr(B) = Pr B ∩ n Ak k=1 n = Pr k=1 (B ∩ Ak ) = k=1 Pr (B ∩ Ak ) = Pr(Ak ) Pr (B|Ak ) . . the events A1 ∩ B.4. A2 ∩ B.

Let A1 . One the other hand. if a ball is drawn from the second urn. We wish to compute the probability of obtaining a green ball. . CONDITIONAL PROBABILITY The probability of event B can then be computed by adding the corresponding summands Pr(A1 ∩ B). .26 CHAPTER 3. Pr(A2 ∩ B). . Since the probability of selecting either urn is 1/2. An urn contains ﬁve green balls and three red balls. = n k=1 Pr(Ak ) Pr(B|Ak ) (3. We expand the probability of Ai ∩ B using Pr(Ai ∩ B) = Pr(Ai |B) Pr(B) = Pr(B|Ai ) Pr(Ai ). .4) twice. we can write Pr(Ai |B) = Pr(Ai ) Pr(B|Ai ) Pr(B) Pr(Ai ) Pr(B|Ai ) . with equal probabilities. we can write the overall probability of getting a green ball as Pr(g) = Pr(g ∩ U1 ) + Pr(g ∩ U2 ) = Pr(g|U1 ) Pr(U1 ) + Pr(g|U2 ) Pr(U2 ) 5 1 3 1 7 = · + · = . Suppose that Pr(Ak ) > 0 for all k. Example 10. If the ﬁrst urn is chosen. . It relates the conditional probability of A given B to the conditional probability of B given A.3 Bayes’ Rule The following result is also very useful. . A second urn contains three green balls and nine red balls.5) (3. . using a divide and conquer approach seems appropriate. the probability that it is green reduces to 3/12. and we get Proof. Pr(An ∩ B). . In this problem. for any event B such that Pr(B) > 0. Theorem 3 (Bayes’ Rule). 8 2 12 2 16 3. A2 . Then. and a ball is drawn from the selected urn. then the ensuing probability of getting a green ball is 5/8. Bayes’ rule is easily veriﬁed. An be a collection of events that forms a partition of the sample space Ω. . One of the two urns is picked at random. we therefore utilize the total probability theorem.

then partial knowledge of B contains no information about the likely occurrence of A. then Pr(A|B) = Pr(A ∩ B) Pr(A) Pr(B) = = Pr(A). the a priori probability of event A is identical to the a posteriori probability of A given B. if A is independent of B. then B is also independent of A.4 Independence Two events A and B are said to be independent if Pr(A ∩ B) = Pr(A) Pr(B). Using Bayes’ rule.95. 3.95 = 0. and let P be the event that the test results are positive. It is therefore unambiguous to say that A and B are independent events. we can compute the probability that a person who tested positive has the disease.99 · 0. Pr(D|P ) = Pr(D) Pr(P |D) Pr(D) Pr(P |D) + Pr(Dc ) Pr(P |Dc ) 0. the probability that a person with a positive test carries the disease remains small. The second equality in (3. the probability that the test results come up negative is 0.01 · 0. Pr(B) Pr(B) Interestingly. If Pr(B) > 0 and events A and B are independent. Let D denote the event that a person has the disease. Although the test may initially appear fairly accurate. Similarly.5) is obtained by applying Theorem 2 to the denominator Pr(B). independence is closely linked to the concept of conditional prob- That is. In other words. designs a test for a latent disease. ability.4. We note that independence is a symmetric relation.01 · 0. .95. the probability that the test results turn out positive is 0.1610. INDEPENDENCE 27 Rearranging the terms yields the ﬁrst equality. We wish to ﬁnd the probability that a person who tested positive has the disease.95 + 0. Example 11. a biochemist. If a subject has the disease. April. if a subject does not have the disease. if A is independent of B.3.05 ≈ 0. Suppose that one percent of the population is infected by the disease.

Suppose that two dice are rolled at the same time. The basic idea of independence seems intuitively clear: if knowledge about the occurrence of event B has no impact on the probability of A. The sample space for this experiment is composed of thirty-six equally likely outcomes. then these two events must be independent. Pr(B) > 0. A and B cannot be independent if they are disjoint. non-trivial events. As such. rolling a four on the red die and rolling a six on the blue die are independent events. and Pr(A ∩ B) = 0 then Pr(A ∩ B) = 0 < Pr(A) Pr(B). we gather that Pr({r = 4} ∩ {b = 6}) = Pr(r = 4) Pr(b = 6). . Yet. We observe the numbers that appear on the upper faces of the two dice. Hence. we conclude that getting a four on the red die and a sum total of eleven are not independent events. consider the probability of obtaining a four on the red die given that the sum of the two dice is eleven. A common mistake is to assume that two events are independent if they are disjoint. CONDITIONAL PROBABILITY Example 12. Pr({r = 4}|{r + b = 11}) = Pr({r = 4} ∩ {r + b = 11}) =0 Pr(r + b = 11) 1 = = Pr(r = 4). Pr({r = 4}|{b = 6}) = Pr({r = 4} ∩ {b = 6}) Pr(b = 6) 1 = = Pr(r = 4). Two mutually exclusive events can hardly be independent: if Pr(A) > 0. a red die and a blue die. independent events are not necessarily easy to visualize in terms of their sample space. 6 In this case. 6 From this equation. Similarly. Consider the probability of getting a four on the red die given that the blue die shows a six.28 CHAPTER 3.

. in addition.7) The three equalities in (3. . .4. INDEPENDENCE 29 3.6) for every subset I of {1. A2 . An are independent provided that Pr i∈I Ai = i∈I Pr(Ai ). . consider a collection of three events.4. (3. Pairwise independence does not necessarily imply independence. therefore. B and C are not independent. we have Pr(A ∩ B) = Pr(A ∩ C) = Pr(B ∩ C) = that Pr(A ∩ B ∩ C) = 0 = 1 = Pr(A) Pr(B) Pr(C). The events A1 . nor does it imply any of them. (3. . These Pr(A ∩ B) = Pr(A) Pr(B) Pr(A ∩ C) = Pr(A) Pr(C) Pr(B ∩ C) = Pr(B) Pr(C) and. However. events are independent whenever For instance. let C be the event that the two coins show distinct sides. Let B be the event that heads is obtained on the second toss. n}. B and C are pairwise independent. these events are pairwise independent. we can verify This shows that events A. A fair coin is ﬂipped twice. A. Furthermore.1 Independence of Multiple Events The concept of independence can be extended to multiple events.3.7) assert that A. 8 1 4 and. Finally. 2. Note that the fourth equation does not follow from the ﬁrst three conditions. . . This is illustrated below. . B and C. Let A denote the event that heads is observed on the ﬁrst toss. Example 13. Pr(A ∩ B ∩ C) = Pr(A) Pr(B) Pr(C). These three events each have a probability of 1/2. .

36 We note that these events are not pairwise independent because 1 1 Pr(A ∩ B) = = = Pr(A) Pr(B) 6 4 1 1 Pr(A ∩ C) = = = Pr(A) Pr(C) 36 18 1 1 Pr(B ∩ C) = = = Pr(B) Pr(C). 3. a red die and a blue die.4. if Pr(A1 ∩ A2 |B) = Pr(A1 |B) Pr(A2 |B). given event B. the multiple events A. let C be the event that the product of the two dice is twelve. CONDITIONAL PROBABILITY Example 14. the Pr(A) = Pr(r ∈ {1. 4}) = 2 4 Pr(C) = Pr(r × b = 12) = . 12 18 Consequently. Also.30 CHAPTER 3. Let B be the event that the number on the red die is either two. It is therefore possible to discuss independence with respect to conditional probability.2) to write Pr(A1 ∩ A2 |B) = Pr(A1 ∩ A2 ∩ B) Pr(B) Pr(B) Pr(A1 |B) Pr(A2 |A1 ∩ B) = Pr(B) = Pr(A1 |B) Pr(A2 |A1 ∩ B). Pr(A ∩ B ∩ C) = 36 2 2 36 3. 5}) = probability of these three events occurring simultaneously is 1 1 4 1 = · · = Pr(A) Pr(B) Pr(C).2 Conditional Independence We introduced earlier the meaning of conditional probability. Note that if A1 and A2 are conditionally independent. B and C are not independent. Let A be the event that the number on the red die is odd. and we showed that the set of conditional probabilities {Pr(A|B)} speciﬁes a valid probability law. The individual probabilities of these events are 1 2 1 Pr(B) = Pr(r ∈ {2. we can use equation (3. three or four. . We say that events A1 and A2 are conditionally independent. Two dice are rolled at the same time. Still. 3.

The conditional probabilities of A1 and A2 . Two events that are independent with respect to an unconditional probability law may not be conditionally independent. INDEPENDENCE 31 Under the assumption that Pr(A1 |B) > 0. This latter result asserts that.3.4. the event that the number of trials is less than six. Suppose that a fair coin is tossed until heads is observed. In particular. The number of trials is recorded as the outcome of this experiment. we have Pr(A2 |A1 ∩ B) = Pr(A2 |B) Pr(A1 |A2 ∩ B) = Pr(A1 |B). . are 1/3 2 Pr(A1 ∩ B) = = Pr(B) 1/2 3 Pr(A2 ∩ B) 15/32 15 Pr(A2 |B) = = = . we let A1 be the event that the number of trials is an even number. We conclude that A1 and A2 are conditionally independent given B. and A2 . We denote by B the event that the coin is tossed more than one time. It is simple to show that conditional independence is a symmetric relation as well. we can combine the previous two expressions and get Pr(A2 |A1 ∩ B) = Pr(A2 |B). Example 15. Moreover. We emphasize that events A1 and A2 are not independent with respect to the unconditional probability law. Pr(B) 1/2 16 Pr(A1 |B) = The joint probability of events A1 and A2 given B is equal to Pr(A1 ∩ A2 |B) = Pr(A1 ∩ A2 ∩ B) Pr(B) 5/16 5 2 15 = = = · 1/2 8 3 16 = Pr(A1 |B) Pr(A2 |B). the additional information that A1 has also occurred does not aﬀect the likelihood of A2 . given event B has taken place. given that the coin is tossed more than once.

B) = Pr(A ∩ B). these two events are independent. . we also represent the probability that two events occur at the same time by Pr(A. 6 whereas the joint conditional probability is Pr({r = 2} ∩ {b = 6}|{r + b is odd}) = 0. Consider the probability of rolling a two on the red die and a six on the blue die given that the sum of the two dice is an odd number. So far. we have expressed the joint probability of events A and B using the notation Pr(A∩B). CONDITIONAL PROBABILITY Example 16. We can easily compute the probability of simultaneously getting a two on the red die and a six on the blue die. . It is possible to extend the notion of conditional independence to several events. . These two events are not conditionally independent.5 Equivalent Notations In the study of probability. This deﬁnition is analogous to (3. . The individual conditional probabilities are given by 1 Pr({r = 2}|{r + b is odd}) = Pr({b = 6}|{r + b is odd}) = . albeit using the appropriate conditional probability law. . 2. n}. . 3. . we are frequently interested in the probability of multiple events occurring simultaneously. A2 . Pr({r = 2} ∩ {b = 6}) = 1 = Pr(r = 2) Pr(b = 6). . For mathematical convenience. An are conditionally independent given B if Pr i∈I Ai B = i∈I Pr(Ai |B) for every subset I of {1. a red die and a blue die.6). . The events A1 . Two dice are rolled at the same time. 36 Clearly.32 CHAPTER 3.

6. Further Reading 1. A. Bn ) = Pr A k=1 Bk . . G.4– 2. N. Ross.5–1.. B2 . and Childers. A2 . D.3. Pearson Prentice Hall.. J. . The probability of A given events B1 . 2004: Sections 2.5. Miller.. Probability and Random Processes for Electrical and Computer Engineers. we use these equivalent notations interchangeably. An by n Pr(A1 .3–1.6. An ) = Pr k=1 Ak .. 3. Probability and Random Processes with Applications to Signal Processing and Communications. Athena Scientiﬁc.. Gubner. Cambridge. . Introduction to Probability. . L.5. . D. . B2 . . S. 2002: Sections 1. 2. Conditional probabilities can be written using a similar format. . and Tsitsiklis. Bn becomes n Pr(A|B1 . EQUIVALENT NOTATIONS 33 This alternate notation easily extends to the joint probability of several events. S. . From this point forward. We denote the joint probability of events A1 . .. . 4. . 7th edition. A2 . A First Course in Probability. . 2006: Sections 1. . P. . . J. . Bertsekas. 2006: Chapter 3.

with an explicit formula for computing probabilities appearing in (2. 34 . Calculating the number of ways that certain patterns can be formed is part of the ﬁeld of combinatorics. While counting outcomes may appear intuitively straightforward. . and it attests of the fact that counting may at times be challenging. computing the probability of an event amounts to counting the number of outcomes comprising this event and dividing the sum by the total number of elements contained in the sample space. This somewhat surprising result can be obtained iteratively through the Fibonacci recurrence relation. . 5 where φ = (1 + √ 5)/2 is the golden ratio. For instance.Chapter 4 Equiprobable Outcomes and Combinatorics The simplest probabilistic scenario is perhaps one where the sample space Ω is ﬁnite and all the possible outcomes are equally likely. In this chapter. This number is equal to distinct subsets of the integers {1. This situation was ﬁrst explored in Section 2.2).1. it is in many circumstances a daunting task. .2. . In such cases. consider the number of integers. 2. n} that do not contain two consecutive φn+2 − (1 − φ)n+2 √ . we introduce useful counting techniques that can be applied to situations pertinent to probability.

heads or tales. . . and the die has That is. Consider the ﬁnite sets S1 . .1: This ﬁgure provides a graphical interpretation of the cartesian contains n elements. . six faces. . product of S = {1. This is illustrated in Figure 4. S2 . Suppose that S and T are ﬁnite sets with m and n elements. The cartesian product of S and T is given by S × T = {(x. then the cartesian product S ×T consists of mn elements. The number of elements in the cartesian product S × T is equal to mn. . The number of possible outcomes for this experiment is 2 × 6 = 12. 1 1 1 1 a 2 b 3 3 3 3 b a 2 2 b 2 a b a Figure 4. The counting principle can be broadened to calculating the number of elements in the cartesian product of multiple sets. sr )|si ∈ Si } . 2.1 The Counting Principle The counting principle is a guiding rule for computing the number of elements in a cartesian product. Consider an experiment consisting of ﬂipping a coin and rolling a die. there are twelve diﬀerent ways to ﬂip a coin and roll a die. Sr and their cartesian product S1 × S2 × · · · × Sr = {(s1 . s2 . In general.1. 3} and T = {a. THE COUNTING PRINCIPLE 35 4. if S has m elements and T Example 17. . y)|x ∈ S and y ∈ T }. There are two possibilities for the coin. .4. b}. . respectively.1.

A permutation of S is an . This procedure is repeated k times. a list without repetitions. An urn contains n balls numbered one through n. We wish to compute the number of possible sequences that can result from this experiment. 2}2 has four distinct ordered pairs. The number of permutations of S can be computed as follows. . A ball is drawn from the urn. The power set of S. of S. 1}. . By distinct characteristic functions from S to {0. consider the integer set S = {1. 1}. is the collection of all subsets identifying a function in 2S with the corresponding preimage of one. sr ) is n = n1 n2 · · · nr . then the number of distinct ordered r-tuples of the form (s1 .36CHAPTER 4. we obtain a bijection between 2S and the subsets of S. . Clearly. and its number is recorded on an ordered list. . In particular. For every element of S. namely all the integers in S except the element we Again. 1 2 1 1 1 2 2 2 1 2 Figure 4. we gather that the number of distinct sequences is nk . denoted by 2S . . Using the counting principle. In set theory. n}. there are n distinct possibilities for the ﬁrst item in the list. . s2 . . Hence. The number of possibilities for the second item is n − 1. each function in 2S is the characteristic function of a subset of S. Example 18 (Sampling with Replacement and Ordering). Suppose that S is ﬁnite with n = |S| elements. The ball is then replaced in the urn. Example 19. There are k drawings and n possibilities per drawing.2: The cartesian product {1. There are therefore 2n subsets of S is given by 2n . . 2. 2S represents the set of all functions from S to {0.2 Permutations ordered arrangement of its elements. the number of distinct 4. EQUIPROBABLE OUTCOMES AND COMBINATORICS If we denote the cardinality of Si by ni = |Si |. a characteristic function in 2S is either zero or one.

They can be written as 123. Since the set S possesses three elements.4: Ordering the numbers one. We wish to compute the number of permutations of S = {1. Summarizing. 213. Similarly. 321. 2. 231. This pattern continues until all the elements in S are 1 2 3 1 1 2 2 3 1 3 2 3 2 3 3 3 1 2 1 2 1 Figure 4. Figure 4.3: The power set of {1. n! = n(n − 1) · · · 1. 3}. 312. . They are dis- selected initially. we ﬁnd that the total number of permutations of S is n factorial. 132. it has 3! = 6 diﬀerent permutations. Example 20.2. PERMUTATIONS 37 1 1 2 3 1 2 1 2 2 3 2 3 1 3 3 played above.4. item is n − m + 1. the number of distinct possibilities for the mth recorded. two and three leads to six possible permutations. 2. 3} contains eight elements.

. we can choose one of n elements to be the ﬁrst on. The number of possible sequences is therefore given by n! = n(n − 1) · · · (n − k + 1). and so k ≤ n. √ n! ∼ nn e−n 2πn. . where 1 2 3 4 1 1 1 2 2 3 4 3 2 2 3 3 1 4 1 2 3 4 4 4 4 1 2 3 Figure 4. They are asked to perform two songs at South by Southwest. . We wish to compute the number of song arrangements the group can oﬀer in concert. Abstractly. the number of distinct arrangements is 4!/2! = 12. A good approximation for n! when n is large is given by Stirling’s formula. A recently formed music group has four original songs they can play. EQUIPROBABLE OUTCOMES AND COMBINATORICS 4. n}. The notation an ∼ bn signiﬁes that the ratio an /bn → 1 as n → ∞. 2. .2.2 k-Permutations lowing our previous argument.1 Stirling’s Formula* The number n! grows very rapidly as a function of n. item listed.5: There are twelve 2-permutations of the numbers one through four. We wish to count the number of distinct k-permutations of S. . one of the remaining (n − 1) elements for the second item. The procedure terminates when k items have been recorded. (n − k)! Example 21. Thus. Fol- Suppose that we rank only k elements out of the set S = {1.38CHAPTER 4.2. 4. this is equivalent to computing the number of 2-permutations of four songs.

(2. Note that a k-permutation can be formed by ﬁrst selecting k objects We can compute the number of k-element combinations of S = {1. {2. There are fewer 2-element subsets of S than 2-permutations of S. 4}. This procedure is repeated until k balls are selected from the urn. 3). 2).6: There exist six 2-element subsets of the numbers one through four. 2). 4}. (3. 1). . which is given by n!/(n − k)!. (4. An urn contains n balls numbered one through n. 4). 1). 2. . with Ordering). {2. 2). 3). (2. 2}. 3}. whereas the 2-permutations of S are more numerous with (1. A ball is picked from the urn. A combination is a subset of S. 3). . n}. 4). 4. {1. (3. 3. The 2-element subsets of S = {1. (1. We wish to compute the number of possible sequences that can result from this experiment. {3. (2. Consider the integer set S = {1. COMBINATIONS 39 Example 22 (Sampling without Replacement. . 2. (3. {1. where k ≤ n. 1 2 1 3 4 1 4 2 4 3 3 4 1 2 2 3 Figure 4.3. 4). and its number is recorded on an ordered list. 4} are {1. . . The number of possibilities is equivalent to the number of k-permutations of n elements. (4. n} . 3}.3 Combinations We emphasize that a combination diﬀers from a permutation in that elements in a combination have no speciﬁc ordering. . . (4. (1. 2. 4}. 1). as follows. The ball is not replaced in the urn.4.

where k ≤ n. k!(n − k)! n k the number of k-element combinations of S is coeﬃcients n k Again. Since a combination is also a subset and . Observe that selecting a kto its complement. . .4 Partitions Abstractly. we get n k=0 n k = 2n . An urn contains n balls numbered one through n. We wish to compute the number of distinct combinations the jar can hold after the completion of this experiment. 4. this amounts to counting the number of k-element subsets of a given n-element set. Thus. Consequently. EQUIPROBABLE OUTCOMES AND COMBINATORICS from S and then ordering them. . n}. This process is repeated until the jar contains k balls. 2. A ball is drawn from the urn and placed in a separate jar. Since the total number of k-permutations of S is n!/(n − k)!. Because there is no ordering in the jar. There are k! distinct ways of ordering k objects. let S = {1. The number of k-permutations must then equal the number of k-element combinations multiplied by k!. the sum of the binomial over all values of k must be equal to the number of elements in the power set of S. n−k Example 23 (Sampling without Replacement or Ordering). the number of k-element combinations is found to be n k = n! n(n − 1) · · · (n − k + 1) = . . a combination is equivalent to partitioning a set into two disjoint subsets. one containing k objects and the other containing the n − k remaining . which is given by n k = n! .40CHAPTER 4. k!(n − k)! k! This expression is termed a binomial coeﬃcient. we can write n k = element subset of S is equivalent to choosing the n − k elements that belong n .

n2 . . Let n1 . . We wish to count the number of such partitions. n2 . until no element subset containing exactly ni elements. . we pick a second subset containing n2 elements from the remaining n − n1 elements. We wish to compute the number of possible outcomes for which every odd number appears three times.4. This algorithm yields a partition of S into r subsets. namely 9! = 1680. i=1 Consider the following iterative algorithm that leads to a partition of S. the set S = {1. . 9} into three subsets of size three. Having selected the ﬁrst subset. . . with the ith ways to form the ﬁrst subset. . In general. . Example 24. . Examining our algorithm. 3!3!3! . The number of distinct sequences in which one. . . PARTITIONS 41 elements. nr = n! . . 2.4. the total number of partitions is then given by n n1 n − n1 − · · · − nr−1 n − n1 . A die is rolled nine times. Using the counting principle. 2. . . we see that there n − n1 − · · · − ni−1 ni are exactly ways to form the ith subset. We know that there are n n1 remains. We continue this procedure by successively choosing subsets of ni elements from the residual n − n1 − · · · − ni−1 elements. First. three and ﬁve each appear three times is equal to the number of partitions of {1. ··· nr n2 which after simpliﬁcation can be written as n n1 . . n} can be partitioned into r disjoint subsets. . nr be nonnegative integers such that r ni = n. n1 !n2 ! · · · nr ! This expression is called a multinomial coeﬃcient. we choose a subset of n1 elements from S.

. . each of 1 1 1 1 1 Figure 4. i=1 We can visualize the number of possible assignments as follows. we wish to compute the number of ways integers n1 . nr can be selected such that every integer is nonnegative and r ni = n. before the ﬁrst ball. For example. the number of objects in the last subset is simply the number of balls after the last marker. n2 .42CHAPTER 4. Finally. . Picture n which can be put between two consecutive balls. Suppose instead that we are interested in counting the number of ways to pick the cardinality of the subsets that form the partition. For instance. EQUIPROBABLE OUTCOMES AND COMBINATORICS In the above analysis.7: The number of possible cardinality assignments for the partition of a set of n elements into r distinct subsets is equivalent to the number of ways to select n positions out of n + r − 1 candidates. In this scheme. or after the last ball. the . The number of objects in the ﬁrst subset corresponds to the number balls appearing before the ﬁrst marker. Similarly. we assume that the cardinality of each subset is ﬁxed. the number of objects in the ith subset is equal to the number of balls positioned between the ith marker and the preceding one. Speciﬁcally.7. 1 1 1 1 1 balls space out on a straight line and consider r − 1 vertical markers. if there are ﬁve balls and two markers then one possible assignment is illustrated in Figure 4. . two consecutive markers imply that the corresponding subset is empty.

r−1 Example 25 (Sampling with Replacement.7 is (n1 . COMBINATORIAL EXAMPLES integer assignment associated with the Figure 4. r−1 4.5. n2 . there are n + r − 1 positions. 43 1 1 1 1 1 Figure 4. n+r−1 n = n+r−1 . The number of possible outcomes is therefore given by n+r−1 n = n+r−1 . This procedure is repeated a total of n times. n balls and r − 1 markers. without Ordering). it suﬃces to calculate the number of ways to place the markers and the balls.8: The assignment displayed in Figure 4. The ball is then returned to the urn.7 corresponds to having no element in the ﬁrst set. The number of ways to assign the markers is equal to the number of n-combinations of n + r − 1 elements.5 Combinatorial Examples In this section. . The outcome of this experiment is a table that contains the number of times each ball has come in sight.4. This is equivalent to counting the ways a set with n elements can be partitioned into r subsets. we present a few applications of combinatorics to computing the probabilities of speciﬁc events. An urn contains r balls numbered one through r. In particular. two elements in the second set and three elements in the last one. 3). A ball is drawn from the urn and its number is recorded. To count the number of possible integer assignments. n3 ) = (0. We are interested in computing the number of possible outcomes. There is a natural bijection between an integer assignment and the graphical representation depicted above. 2.

The probability of winning the Mega Millions Grand Prize is 1 56 5 1 51!5! 1 1 = = . We wish to compute the probability of winning the Mega Millions Grand Prize. All drawing equipment is stored in a secured on-site storage room. Only authorized drawings department personnel have keys to the door. To play the Mega Millions game. 10 10 10 1000 The probability of winning while playing any order depends on the numbers selected. 46 56! 46 175711536 . The probability of winning when playing the exact order is 1 1 1 1 = . the probability of winning becomes 3 2 is 1000 decreases to 1/1000. and choose how to play them: exact order or any order. the Lotto Texas balls are mixed by four acrylic mixing paddles rotating clockwise. and one Mega Ball number from 1 to 46 in the lower yellow play area of the play board. The Pick 3 balls are drawn using three air-driven machines. As each ball is selected. Upon entry of the secured room to begin the drawing process. A player must pick three numbers from zero to nine. it rolls down a chute into an oﬃcial number display area. a player must select ﬁve numbers from 1 to 56 in the upper white play area of the play board. 1000 Finally. the probability of winning Example 27 (Mega Millions Texas Lottery). High speed is used for mixing and low speed for ball selection. These machines employ compressed air to mix and select each ball. The Texas Lottery game “Pick 3” is easy to play. if a same number is selected three times. = 3 . EQUIPROBABLE OUTCOMES AND COMBINATORICS Example 26 (Pick 3 Texas Lottery). If three distinct numbers are selected. then the probability of winning 3 3! = . 1000 500 If a number is repeated twice. which requires the correct selection of the ﬁve white balls plus the gold Mega ball. a lottery drawing specialist examines the security seal to determine if any unauthorized access has occurred.44CHAPTER 4. For each drawing.

COMBINATORIAL EXAMPLES 45 Example 28 (Sinking Boat). As tradition dictates. A total of n people are gathering to attend a two-day. 4. To ﬁnd the probability that the reunion can proceed. Six couples. What is the probability that no two people from a same couple end up on the Mako? Suppose that all the passengers choose a rescue boat independently of one another. Two Coast Guard vessels. . . . The rules of the secret society stipulate that the conference can only proceed on the second day if a working lock-and-key combination can be reconstituted. To solve this problem. Example 29 (The Locksmith Convention*). what is the probability that the reunion can continue its course of action on the second day? There are n attendants on the ﬁrst day. only 729 of these arrangements are such that no two members from a same couple board the Mako. come to the rescue. At the opening ceremony. Since couples act independently. then members 2 and n cannot be there on the second day. the probability that no two people from a same couple are united on the Mako becomes 36 /46 . the boat hits an iceberg and starts sinking slowly. independently of others. and they are all equally likely. Alternatively. and the number of possible non-qualifying arrangements is equal to the number of distinct subsets of the integers {3. secret locksmith convention. If each locksmith decides to attend the second day of the convention with probability half. we label the people attending the opening ceremony using integers one through n. For a given couple. it suﬃces to count the number of ways in which at least two people that were sitting next to each other during the opening ceremony are present.4. In other words. and they exchange keys to their own locks with the two members sitting immediately next to them.5. the probability that the two members do not end up on the Mako is 3/4. . We consider three situations that may occur on the second day and for which the society’s requirements are not met. we can compute the likelihood that this does not occur. every locksmith brings a lock and two keys to the convention. Unfortunately. twelve people total. The number of possible audiences on the second day is therefore 2n . n − 1} that do not contain consecutive . Each passenger must decide which of the two rescue boats to board. all equally likely. all the locksmiths sit around a large table. there are a total of 4096 possible arrangements of people on the Mako after the rescue operation. If member 1 is present. are out at sea on a sail boat. the Ibis and the Mako. according to their positions around the table.

the number of arrangements for which a lock-and-key combination cannot be reconstituted when member n shows up on the second day φn−1 − (1 − φ)n−1 √ .. and is given by φn−1 − (1 − φ)n−1 √ . 2006: Section 1.7. as this is the number of integers. . S. Cambridge. J. P. 2n 5 distinct subsets of the integers {2. and Tsitsiklis. 2002: Section 1.. A First Course in Probability. if both members 1 and n are absent. Ross. . . N. 7th edition. 3. Introduction to Probability. Probability and Random Processes for Electrical and Computer Engineers. n − 1} that do not contain consecutive is Further Reading 1. J. This number can be deduced from the discussion at the beginning of this chapter. 2. D. we gather that the probability of the meeting being cancelled on the second day is 1− 2φn−1 − 2(1 − φ)n−1 + φn − (1 − φ)n √ .. Athena Scientiﬁc. Bertsekas. 5 Finally.46CHAPTER 4. then there are a total of φn − (1 − φ)n √ 5 combinations for which the meeting cannot proceed. Collecting these results.6. 5 By symmetry.. EQUIPROBABLE OUTCOMES AND COMBINATORICS integers. 2006: Chapter 1. Pearson Prentice Hall. . A. Gubner. . 3.

There are six possible outcomes to the rolling of a fair die. The numerical value of a particular outcome is simply called the value of the random variable. the random variable assigns a speciﬁc number to every possible outcome of the experiment. These faces map naturally to the integers one 47 .Chapter 5 Discrete Random Variables Suppose that an experiment and a sample space are given. Example 30. Sample Space 3 4 1 6 7 5 2 Real Numbers Figure 5. A random variable maps each of these outcomes to a real number. A random variable is a real-valued function of the outcome of the experiment. In other words. namely each of the six faces. Because of the structure of real numbers.1: The sample space in this example has seven possible outcomes. it is possible to deﬁne pertinent statistical properties on random variables that otherwise do not apply to probability spaces in general.

2. . If x is a possible value of X then the probability mass of x. Example 31. that is. . . A variable is called discrete if its range is ﬁnite or countably inﬁnite. . The range of this random variable is given by the positive integers {1.1 Probability Mass Functions A discrete random variable X is characterized by the probability of each of the elements in its range. which maps the number of tosses to an integer. The value of the random variable. 1 2 3 4 5 6 Figure 5. it can only take a ﬁnite or countable number of values. Consider the experiment where a coin is tossed repetitively until heads is observed. which we denote by pX (·). in this case.48 CHAPTER 5. We identify the probabilities of individual elements in the range of X using the probability mass function (PMF) of X. DISCRETE RANDOM VARIABLES through six.}. A simple class of random variables is the collection of discrete random variables. The corresponding function. is a discrete random variable that takes a countable number of values. is simply the number of dots that appear on the top face of the die. 5.2: This random variable takes its input from the rolling of a die and assigns to each outcome a real number that corresponds to the number of dots that appear on the top face of the die.

PROBABILITY MASS FUNCTIONS written pX (x). (5. X −1 (x) denotes the preimage of x deﬁned by {ω ∈ Ω|X(ω) = x}.3: The probability mass of x is given by the probability of the set of all outcomes which X maps to x. pX (x) = Pr(X −1 (x)) = Pr({ω ∈ Ω|X(ω) = x}).3) This equation oﬀers an explicit formula to compute the probability of any . Let X(Ω) denote the collection of all the possible numerical values X can take. In general. as x ranges over all the possible values in X(Ω).5. is deﬁned by pX (x) = Pr({X = x}) = Pr(X = x). we can write pX (x) = 1. we can think of pX (x) as the probability of the set of all outcomes in Ω for which X is equal to x. x∈X(Ω) (5. this set is known as the range of X.1) Equivalently. 49 (5.2) We emphasize that the sets deﬁned by {ω ∈ Ω|X(ω) = x} are disjoint and form a partition of the sample space Ω. Here. provided that X is discrete. x∈S (5. This is not to be confused with the inverse of a bijection. we can write Pr(S) = Pr ({ω ∈ Ω|X(ω) ∈ S}) = subset of X(Ω). pX (x). Thus. if X is a discrete random variable and S is a subset of X(Ω).2) follows immediately from the countable additivity axiom and the normalization axiom of probability laws. Using this notation. Sample Space 7 5 2 4 6 1 3 x Figure 5.1.

The PMF of random variable X is given by 1 pX (3) = pX (4) = pX (5) = . Example 33. DISCRETE RANDOM VARIABLES Example 32.50 CHAPTER 5. Then. 1}. Note that these outcomes are equiprobable. 3). 5. We employ X to represent the sum of the two selected numbers. (2. Ω = {(1. if x = 1 . 5}) = pX (3) + pX (5) = . Consider the ﬂipping of a biased coin. X(Ω) = {0. 2 Pr(S) = Pr({3. 1 − p. if x = 0 p. (3. discrete random variables occur primarily in situations where counting is involved.2. then the probability of the sum being odd can be computed as follows. A random variable that maps heads to one and tails to zero is a Bernoulli random variable with parameter p. In fact. 1). (2. 1). Let X be a random variable that takes on only two possible numerical values. We wish to ﬁnd the probability that the sum of the two selected numbers is odd. 3 If S denotes the event that the sum of the two numbers is odd. In general.2 Important Discrete Random Variables A number of discrete random variables appears frequently in problems related to probability. (3. Let Ω be the set of ordered pairs corresponding to the possible outcomes of the experiment. An urn contains three balls numbered one. every Bernoulli random variable is equivalent to the tossing of a coin. Two balls are drawn from the urn without replacement. (1. 2)}. for which heads is obtained with probability p and tails is obtained with probability 1 − p.1 Bernoulli Random Variables The ﬁrst and simplest random variable is the Bernoulli random variable. two and three. X is a Bernoulli random variable and its PMF is given by pX (x) = where p ∈ [0. and they are worth getting acquainted with. 3 5. 1]. 2). These random variables arise in many diﬀerent contexts. 3).

and it is independent from bottle to bottle.2.2 Binomial Random Variables Multiple independent Bernoulli random variables can be combined to construct more sophisticated random variables.4 0. The PMF of X is given by pX (k) = Pr(X = k) = n k p (1 − p)n−k .6 0. IMPORTANT DISCRETE RANDOM VARIABLES Bernoulli Random Variable Probability Mass Function 0. We wish to ﬁnd the PMF of the number of winning bottles. k Example 34. n k=0 n k p (1 − p)n−k = (p + (1 − p))n = 1. . The random variable X is binomial with parameters n = 8 and p = 1/4.2 0.3 0. Suppose X is the sum of n independent and identically distributed Bernoulli random variables. k where k = 0.5. we want to compute the probability of winning more than $4. .1 0 -1 0 p = 0. We can easily verify that X fulﬁlls the normalization axiom.2.25 51 1 2 Value 3 4 Figure 5. The Brazos Soda Company creates an “Under the Caps” promotion whereby a customer can win an instant cash prize of $1 by looking under a bottle cap. which we denote by X. The likelihood to win is one in four. n. 5.25. Then X is called a binomial random variable with parameters n and p. . 1.7 0. A customer buys eight bottles of soda from this company.8 0.4: The PMF of a Bernoulli random variable appears above for parameter p = 0. Also.5 0. .

3 0. k 48 where k = 0. .25 4 Value 6 8 Figure 5. k! The probability mass function of a Poisson random variable is given by pX (k) = k = 0.2. 1. . 1.5: This ﬁgure shows a binomial random variable with parameters n = 8 and p = 0. Its PMF is given by pX (k) = 8 k 1 4 k 3 4 8−k = 8 38−k .1 0.52 CHAPTER 5. using Taylor series expansion. DISCRETE RANDOM VARIABLES Binomial Random Variable Probability Mass Function 0.15 0. .35 0. . The Poisson random variable is of fundamental importance when counting the number of occurrences of a phenomenon within a certain time period. The probability of winning more than $4 is 8 Pr(X > 4) = k=5 8 38−k .05 0 0 2 p = 0. we have ∞ k=0 pX (k) = ∞ k=0 λk λk −λ e = e−λ = e−λ eλ = 1. . Note that. k 48 5. k! k! k=0 ∞ which shows that this PMF fulﬁlls the normalization axiom of probability laws.25. where λ is a positive number.2 0.25 0. . 8.3 Poisson Random Variables λk −λ e . It ﬁnds extensive use in networking. . . inventory management and queueing applications.

IMPORTANT DISCRETE RANDOM VARIABLES Poisson Random Variable Probability Mass Function 0. N is a Poisson random λk −λ e . For k = 1. Fix λ and let pn = λ/n. 1. .6: This ﬁgure shows the PMF of a Poisson random variable with parameter λ = 2.2 0.3 0. . Requests at an Internet server arrive at a rate of λ connections per second.5. k! The probability that no service requests arrive in one second is simply given by pN (k) = pN (0) = e−λ .05 0 0 2 4 Value 6 8 l=2 53 Figure 5. We wish to ﬁnd the probability that no service requests arrive during a time interval of one second.25 0.1 0.15 0. . . . we deﬁne the PMF of the random variable Xn as pXn (k) = Pr(Xn = k) = = = n! k!(n − k)! λ n n k pn (1 − pn )n−k k k variable with PMF 1− 1− λ n λ n n−k n−k λk n! nk (n − k)! k! . . Example 35. The number of service requests per second is modeled as a random variable with a Poisson distribution. 8. . Note that the values of the PMF are only present for k = 0. 2.2. . By assumption. It is possible to obtain a Poisson random variable as the limit of a sequence of binomial random variables. Let N be a random variable that represents the number of requests that arrives within a span of one second. n.

Thus. DISCRETE RANDOM VARIABLES In the limit. Example 36.05 0 0 2 4 n=5 n = 15 n = 25 n = 35 Poisson 6 8 Value 10 12 14 Figure 5.25 0. as n approaches inﬁnity. with a . This discussion suggests that the Poisson PMF can be employed to approximate the PMF of a binomial random variable in certain circumstances.1 0. we get lim pXn (k) = lim λk n! n→∞ nk (n − k)! k! 1− λ n n−k n→∞ = λk −λ e . k! where λ = np.7: The levels of a binomial PMF with parameter p = λ/n converge to the probabilities of a Poisson PMF with parameter λ as n increases to inﬁnity. If n is large and p is small then the probability that Y equals k can be approximated by pY (k) = n! λk nk (n − k)! k! 1− λ n n−k ≈ λk −λ e . The probability of a bit error on a communication channel is equal to 10−2 . the sequence of binomial random variables {Xn } converges in distribu- Convergence to Poisson Distribution Probability Mass Function 0.45 0. The transmission of each bit can be modeled as a Bernoulli trial. The latter expression can be computed numerically in a straightforward manner. Suppose that Y is a binomial random variable with parameters n and p.35 0.3 0.15 0. Assume that the probability of individual errors is independent from bit to bit. k! tion to the Poisson random variable X. We wish to approximate the probability that a block of 1000 bits has four or more errors.54 CHAPTER 5.4 0.2 0.

4 Geometric Random Variables Consider a random experiment where a Bernoulli trial is repeated multiple times until a one is observed. A customer purchases one bottle of soda from the Brazos Soda Company and thereby enters the contest. k = 1. Thus. We wish to ﬁnd the PMF of the number of bottles obtained by this customer. IMPORTANT DISCRETE RANDOM VARIABLES 55 zero indicating a correct transmission and a one representing a bit error. and it is independent from bottle to bottle.2. The random variable X is a geometric random variable. We stress that (1 − p)k−1 p simply represents the probability of obtaining a sequence of k − 1 zero immediately followed by a one. The number of trials carried out before completion.2.5. Example 37. a customer can win an additional bottle of soda by looking under the cap of her bottle. This time. . . . is recorded as the outcome of this experiment. the probability of getting a one is equal to p and the probability of getting a zero is 1 − p. For every extra bottle of soda won by looking under the cap. . 2. the customer gets an additional chance to play. and its PMF is given by pX (k) = (1 − p)k−1 p. 500 3 5. The total number of errors in 1000 transmissions then corresponds to a binomial random variable with parameters n = 1000 and p = 10−2 . The probability of making four or more errors can be approximated using a Poisson random variable with constant λ = np = 10.01034. which we denote by X. The probability to win is 1/5. we can approximate the probability that a block of 1000 bits has four or more errors by 3 Pr(N ≥ 4) = 1 − Pr(N < 4) ≈ 1 − k=0 λk −λ e k! = 1 − e−10 1 + 10 + 50 + ≈ 0. At each time step. The Brazos Soda Company introduces another “Under the Caps” promotion.

Memoryless Property: A remarkable aspect of the geometric random variable is that it satisﬁes the memoryless property. 12.1 0. . 1 5 k−1 4 . We can write the conditional probability of X as Pr(X = k + j|X > k) = Pr({X = k + j} ∩ {X > k}) Pr(X > k) (1 − p)k+j−1 p Pr(X = k + j) = = Pr(X > k) (1 − p)k = (1 − p)j−1 p = Pr(X = j). Pr(X = k + j|X > k) = Pr(X = j). This can be established using the deﬁnition of conditional probability. 5 .05 0 0 2 4 6 8 Value 10 12 p = 0. 2. The values of the PMF are only present for k = 1. . .56 CHAPTER 5.25 Figure 5. Let X denote the total number of bottles obtained by the customer. . It is plotted above for p = 0. . . DISCRETE RANDOM VARIABLES Geometric Random Variable Probability Mass Function 0. The random variable X is geometric and its PMF is pX (k) = where k = 1.3 0. .25 0. Let X be a geometric random variable with parameter p.8: The PMF of a geometric random variable is a decreasing function of k. 2.15 0. and assume that k and j are positive integers.2 0.25.

06 0. We encountered speciﬁc cases of this random variable before. It can be shown that the geometric random variable is the only discrete random variable that possesses the memoryless property. . n}.1 0. Let X be a uniform random variable taking value over X(Ω) = {1.02 0 0 1 2 3 n=8 4 5 Value 6 7 8 9 Figure 5. .04 0. 2. is equal to the unconditioned probability that the total number of trials is j. . .08 0.14 0.12 0. 2. otherwise. . The tossing of a fair coin and the rolling of a die can both be used to construct uniform random variables. IMPORTANT DISCRETE RANDOM VARIABLES 57 In words.2. n 0. Its PMF is therefore given by pX (k) = 1/n. .2. .16 0. 5.9: A uniform random variable corresponds to the situation where all the possible values of the random variable are equally likely. . .5.5 Discrete Uniform Random Variables A ﬁnite random variable where all the possible outcomes are equally likely is called a discrete uniform random variable. Uniform Random Variable Probability Mass Function 0. the probability that the number of trials carried out before completion is k + j. It is shown above for the case where n = 8. given k unsuccessful trials. if k = 1.

Suppose that . Example 38. Let X be a random variable and let Y = g(X) = aX + b. In particular. Y is an aﬃne function of X.3 Functions of Random Variables Recall that a random variable is a function of the outcome of an experiment. if ω ∈ Ω is the outcome Y (ω) = g(x). is obtained as follows.4) otherwise. pY (y) is non-zero only if there exists an x ∈ X(Ω) such that g(x) = y and pX (x) > 0. DISCRETE RANDOM VARIABLES 5. If X is a random variable then Y = g(X) is also a random variable since it associates a numerical value to every outcome of the experiment. In this ﬁgure.58 CHAPTER 5. and the number of elements in g(X(Ω)) is no greater than the number of elements in X(Ω). In particular. Sample Space 4 5 2 3 1 X Y = g(X) of the experiment. it is possible to create a new random variable Y by applying a real-valued function g(·) to X. pY (y) = 0. If y ∈ g(X(Ω)) then pY (y) = {x∈X(Ω)|g(x)=y} pX (x). The PMF of Y . which we represent by pY (·). (5. Y is obtained by passing random variable X through a function g(·). where a and b are constant. The set of possible values Y can take is denoted by g(X(Ω)). if X is a discrete random variable. Furthermore. then so is Y . That is. Given a random variable X.10: A real-valued function of a random variable is a random variable itself. then X takes on value x = X(ω) and Y takes on value Figure 5.

. P. . the value of a typical fare.40 for each additional unit (one-ﬁfth of a mile). 7th edition. . Athena Scientiﬁc. .00. . 10. 59 Linear and aﬃne functions of random variables are commonplace in applied probability and engineering. Introduction to Probability. Pearson Prentice Hall. 2.1–2. and it is necessarily zero for any other argument. . Similarly.50. J. We wish to ﬁnd the PMF of Y . Probability and Random Processes with Applications to Signal Processing and Communications. . L. The distance of a ride with a typical client is a discrete uniform random variable taking value over $2.5 + 2X dollars for the cab driver. 4.50 upon entry and $0. 2006: Sections 4.. S. 2. 3. 2004: Section 2.. Further Reading 1. 2. a ride of X miles will generate a revenue of Y = 2. . J. Example 39. 4. Gubner. G. D. 2002: Sections 2.5.1–1.5 + 2k) = 1 10 X(Ω) = {1. D. A.8. Miller.. 2006: Sections 2.2. FUNCTIONS OF RANDOM VARIABLES a = 0. . Ross. S. The smallest possible fare is therefore $4. with probability 1/10. A taxi driver works in New York City. Probability and Random Processes for Electrical and Computer Engineers. Cambridge. is for k = 1.. Traveling one mile is ﬁve units and costs an extra $2. and Childers.6–4. The metered rate of fare. N. .1–4.7. according to city rules. then the probability of Y being equal to value y is given by pY (y) = pX y−b a . A First Course in Probability. The PMF of Y is thus given by pY (2. Bertsekas.3.. and Tsitsiklis.3.2. 10}.

one is typically interested not necessarily in every individual data point. as deﬁned in (6. (6. As mentioned above. 60 .1 Expected Values The expected value E[X] of discrete random variable X is deﬁned by E[X] = x∈X(Ω) xpX (x).1). One way to accomplish this task and thereby obtain meaningful descriptive quantities is through the expectation operator. The same is true for random variables. The expected value of a random variable.1) whenever this sum converges absolutely. then X is said not to possess an expected value. but rather in certain descriptive quantities such as the average or the median. is also called the mean of X. the expected value E[X] provides insightful information about the underlying random variable X without giving a comprehensive and overly detailed description. it is a function of the PMF of X. It is important to realize that E[X] is not a function of random variable X.Chapter 6 Meeting Expectations When a large collection of data is gathered. The PMF of discrete random variable X provides a complete characterization of the distribution of X by specifying the probability of every possible value of X. rather. Still. 6. it may be desirable to summarize the information contained in the PMF of a random variable. If this sum is not absolutely convergent.

This expected value. with the number of dots appearing on the top face taken as the value of the corresponding random variable. . Hence. we can ﬁnd the derived distribution . determining the expectation of a random variable requires as input its PMF. Suppose that X is a discrete random variable. 6 12 In other words. where k is a positive integer. A fair die is rolled once. The notion of an expectation can be combined with traditional functions to create alternate descriptions and other meaningful quantities. The expected value of the roll can be computed as 6 k=1 k 42 = = 3. . 2. E[g(X)]. FUNCTIONS AND EXPECTATIONS 61 Example 40. There are several additional examples. in this case.5. Example 41. Assume that a fair coin is ﬂipped repetitively until heads is observed.2. 2k The expected number of tosses until the coin produces heads is equal to two.5. a detailed characterization of the random variable. The expected value of k = 2. Let g(·) be a real-valued function on the range of X. the PMF of X this geometric random variable can be computed as E[X] = ∞ k=1 is equal to pX (k) = 2−k . In general. The value of random variable X is taken to be the total number of tosses performed during this experiment. is a scalar quantity that can also provide partial information about the distribution of X. One way to determine the expected value of g(X) is to ﬁrst note that Y = g(X) is itself a random variable. .2 Functions and Expectations The mean forms one instance where the distribution of a random variable is condensed into a scalar quantity. and consider the expectation of g(X). Recall that. Thus.}. The possible values for X are therefore given by X(Ω) = {1. the mean of this random variable is 3. 6. its mean.6. and returns a much simpler scalar attribute. computing the expected value of the random variable yields a concise summary of its overall behavior.

2) It is worth re-emphasizing that there exist random variables and functions for which the above sum does not converge. notice that the mean E[X] is a special case of (6. 0. This alternate way of computing the probability of an event can sometimes be employed to solve otherwise diﬃcult probability problems. = x∈S∩X(Ω) That is. the expectation of g(X) can be expressed as E [g(X)] = x∈X(Ω) g(x)pX (x). Let S be a subset of the real numbers. Also. The expectation of a constant is always the constant itself. Example 42. x∈S x ∈ S.2) is the case where the function g(·) is a constant. Yet. We explore pertinent examples below. and deﬁne the indicator function of S by 1S (x) = The expectation of 1S (X) is equal to E [1S (X)] = x∈X(Ω) 1. the expectation of the indicator function of S is simply the probability that X takes on a value in S. Hence the deﬁnition of E[g(X)] given above subsumes our original description of E[X]. Example 43. / 1S (x)pX (x) pX (x) = Pr(X ∈ S). In such cases. (6. The simplest possible scenario for (6. MEETING EXPECTATIONS of Y and then apply the deﬁnition of expected value provided in (6. The expectation of g(X) = c becomes E[c] = x∈X(Ω) cpX (x) = c x∈X(Ω) pX (x) = c.1). we simply say that the expected value of g(X) does not exist. where g(X) = X. The last inequality follows from the normalization axiom of probability laws. .1). which appeared in (6.62 CHAPTER 6. there is a more direct way to compute this quantity.2).

6.2. FUNCTIONS AND EXPECTATIONS

63

Let random variable Y be deﬁned by applying real-valued function g(·) to X, with Y = g(X). The mean of Y is equal to the expectation of g(X), and we know from our ongoing discussion that this value can be obtained by applying two diﬀerent formulas. To ensure consistency, we verify that these two approaches lead to a same answer. First, we can apply (6.1) directly to Y , and obtain E[Y ] =

y∈g(X(Ω))

ypY (y),

**where pY (·) is the PMF of Y provided by (5.3). Alternatively, using (6.2), we have E[Y ] = E[g(X)] =
**

x∈X(Ω)

g(x)pX (x).

We prove that these two expressions describe a same answer as follows. Recall that the PMF of Y evaluated at y is obtained by summing the values of pX (·) as pY (y) = over all x ∈ X(Ω) such that g(x) = y. Mathematically, this can be expressed

{x∈X(Ω)|g(x)=y}

**pX (x). Using this equality, we can write y
**

y∈g(X(Ω)) {x∈X(Ω)|g(x)=y}

E[Y ] =

y∈g(X(Ω))

ypY (y) =

pX (x)

=

y∈g(X(Ω)) {x∈X(Ω)|g(x)=y}

ypX (x) g(x)pX (x)

y∈g(X(Ω)) {x∈X(Ω)|g(x)=y}

= =

x∈X(Ω)

g(x)pX (x) = E[g(X)].

Note that ﬁrst summing over all possible values of Y and then over the preimage of every y ∈ Y (Ω) is equivalent to summing over all x ∈ X(Ω). Hence, methods outlined above leads to a same solution. Example 44. Brazos Extreme Events Radio creates the “Extreme Trio” contest. To participate, a person must ﬁll out an application card. Three cards are drawn from the lot and each winner is awarded $1,000. While a same participant can send multiple cards, he or she can only win one grand prize. we have shown that computing the expectation of a function using the two

64

CHAPTER 6. MEETING EXPECTATIONS

At the time of the drawing, the radio station has accumulated a total of 100 cards. David, an over-enthusiastic listener, is accountable for half of these cards. We wish to compute the amount of money David expects to win under this promotion. Let X be the number of cards drawn by the radio station written by David. The PMF of X is given by pX (k) =

50 k 50 3−k 100 3

k ∈ {0, 1, 2, 3}.

The money earned by David can be expressed as g(k) = 1000 min{k, 1}. It follows that the expected amount of money he receives is equal to

3

k=0

(1000 min{k, 1}) pX (k) = 1000 ·

29 . 33

Alternatively, we can deﬁne Y = 1000 min{X, 1}. Clearly, Y can only take on one of two possible values, 0 or 1000. Evaluating the PMF of Y , we get of Y is equal to pY (0) = pX (0) = 4/33 and pY (1000) = 1 − pY (0) = 29/33. The expected value 0 · pY (0) + 1000 · pY (1000) = 1000 · of money won by David is roughly $878.79. 29 . 33

As anticipated, both methods lead to the same answer. The expected amount

6.2.1

The Mean

As seen at the beginning of this chapter, the simplest non-trivial expectation is the mean. We provide two additional examples for the mean, and we explore a physical interpretation of its deﬁnition below. Example 45. Let X be a geometric random variable with parameter p and PMF pX (k) = (1 − p)k−1 p, The mean of this random variable is E[X] =

∞ k=1

k = 1, 2, . . .

k(1 − p)

k−1

p=p

1 k(1 − p)k−1 = . p k=1

∞

6.2. FUNCTIONS AND EXPECTATIONS

65

**Example 46. Let X be a binomial random variable with parameters n and p. The PMF of X is given by pX (k) = n k p (1 − p)n−k , k
**

n

k = 0, 1, . . . , n.

**The mean of this binomial random variable can therefore be computed as E[X] =
**

k=0 n

k

n k p (1 − p)n−k k

=

k=1 n−1

n! pk (1 − p)n−k (k − 1)!(n − k)! n! pℓ+1 (1 − p)n−ℓ−1 ℓ!(n − 1 − ℓ)! n−1 ℓ p (1 − p)n−1−ℓ = np. ℓ

= = np

ℓ=0 n−1

ℓ=0

Notice how we rearranged the sum into a familiar form to compute its value. It can be insightful to relate the mean of a random variable to classical mechanics. Let X be a random variable and suppose that, for every x ∈ X(Ω), we place an inﬁnitesimal particle of mass pX (x) at position x along a real line.

The mean of random variable X as deﬁned in (6.1) coincides with the center of mass of the system of particles. Example 47. Let X be a Bernoulli random variable such that pX (x) = The mean of X is given by E[X] = 0 · 0.25 + 1 · 0.75 = 0.75. Consider a two-particle system with masses m1 = 0.25 and m2 = 0.75, respectively. In the coordinate system illustrated below, the particles are located at positions x1 = 0 and x2 = 1. From classical mechanics, we know that their center of mass can be expressed as m 1 x1 + m 2 x2 = 0.75. m1 + m2 As anticipated, the center of mass corresponds to the mean of X. 0.25, if x = 0 0.75, if x = 1.

Suppose X is a Bernoulli random variable with parameter p. In general. which we denote by Var[X]. MEETING EXPECTATIONS x1 Center of Mass x2 R Figure 6. Let X be a Poisson random variable with parameter λ. the mean of a discrete random variable corresponds to the center of mass of the associated particle system 6.1: The center of mass on the ﬁgure is indicated by the tip of the arrow.66 CHAPTER 6. ℓ! .2. the variance is always nonnegative. and it is often denoted by σ. For discrete random variables. Example 49. it can be computed explicitly as Var[X] = x∈X(Ω) (x − E[X])2 pX (x). Example 48.2 The Variance A second widespread descriptive quantity associated with random variable X is its variance. It provides a measure of the dispersion of X around its mean. Its variance is given by Var[X] = (1 − p)2 · p + (0 − p)2 · (1 − p) = p(1 − p). (6. The square root of the variance is referred to as the standard deviation of X.3) Evidently. It is deﬁned by Var[X] = E (X − E[X])2 . The mean of X is given by E[X] = ∞ k =λ k=0 ∞ ℓ=0 λk −λ e = k! ℓ ∞ k=1 λk e−λ (k − 1)! λ −λ e = λ. We can compute the mean of X as E[X] = 1 · p + 0 · (1 − p) = p.

2. This can be computed using (6.3 Aﬃne Functions Proposition 3. (k − 2)! Both the mean and the variance of a Poisson random variable are equal to its parameter λ. FUNCTIONS AND EXPECTATIONS The variance of X can be calculated as Var[X] = = ∞ k=0 ∞ k=0 67 (k − λ)2 λk −λ e k! λk −λ e k! λ2 + k(1 − 2λ) + k(k − 1) ∞ k=2 = λ − λ2 + λk e−λ = λ. Suppose X is a random variable with ﬁnite mean. Let Y be the aﬃne function of X deﬁned by Y = aX + b. The mean of random variable Y is equal to E[Y ] = aE[X] + b. where a and b are ﬁxed real numbers. It is not much harder to show that the expectation is a linear functional. Proof.2). This demonstrates that the expectation is both homogeneous and additive. . 6.2. E[Y ] = x∈X(Ω) (ax + b)pX (x) xpX (x) + b x∈X(Ω) x∈X(Ω) =a pX (x) = aE[X] + b.6. Suppose that g(·) and h(·) are two real-valued functions such that E[g(X)] and E[h(X)] both exist. We can write the expectation of ag(X) + h(X) as E[ag(X) + h(X)] = x∈X(Ω) (ag(x) + h(x))pX (x) g(x)pX (x) + x∈X(Ω) x∈X(Ω) =a h(x)pX (x) = aE[g(X)] + E[h(X)]. We can summarize this property with E[aX + b] = aE[X] + b.

Suppose that the variance of X exists and is ﬁnite.3) applied to Y = aX + b. Var[X] = E [X 2 ] − (E[X])2 . Proposition 5. Consider (6. it is shift invariant. A translation of the argument by b does not aﬀect the variance of Y = aX + b. in other words.4) Incidentally. we can expand the variance of X as follows. The nth moment of random variable X is deﬁned by E[X n ] = x∈X(Ω) xn pX (x). Proof. where a and b are constants.68 CHAPTER 6. = x∈X(Ω) = a2 x∈X(Ω) The variance of an aﬃne function only depends on the distribution of its argument and parameter a. Assume that X is a random variable with ﬁnite mean and variance. Var[Y ] = x∈X(Ω) (ax + b − E[aX + b])2 pX (x) (ax + b − aE[X] − b)2 pX (x) (x − E[X])2 pX (x) = a2 Var[X]. 2 = x∈X(Ω) = x∈X(Ω) xpX (x) + (E[X])2 x∈X(Ω) x∈X(Ω) pX (x) =E X 2 . the mean of random variable X is its ﬁrst moment.3 Moments The moments of a random variable X are likewise important quantities used in providing partial information about the PMF of X. 6. The variance of Y is equal to Var[Y ] = a2 Var[X]. Starting from (6. and let Y be the aﬃne function of X given by Y = aX + b. (6. MEETING EXPECTATIONS Proposition 4.3). Proof. The variance of random variable X can be expressed in terms of its ﬁrst two moments. Var[X] = x∈X(Ω) (x − E[X])2 pX (x) x2 − 2xE[X] + (E[X])2 pX (x) x2 pX (x) − 2E[X] − (E[X]) .

Although these quantities will not play a central role in our exposition of probability. . 12 = k2 − n 2 k=1 n+1 2 n+1 2 Closely related to the moments of a random variable are its central moments. .4 Ordinary Generating Functions In the special yet important case where X(Ω) is a subset of the non-negative integers.4. This allows the straightforward application of standard sums from calculus. Let X be a uniform random variable with PMF pX (k) = 1/n. 6. The mean of this uniform random variable is equal to E[X] = n+1 .6. . which is a measure of asymmetry. The central moments are used to deﬁne the skewness of random variable. 2. . if k = 1. . We include below an example where the above formula for the variance is applied. The variance is an example of a central moment. it is occasionally useful to employ the ordinary generating function. modestsize deviations. as we can see from deﬁnition (6. they each reveal a diﬀerent characteristic of a random variable and they are encountered frequently in statistics. and its kurtosis. The kth central moment of X is deﬁned by E (X − E[X])k . n 0.3). otherwise. Example 50. which assesses whether the variance is due to infrequent extreme deviations or more frequent. ORDINARY GENERATING FUNCTIONS 69 This alternate formula for the variance is sometimes convenient for computational purposes. 2 n 2 The variance of X can then be obtained as Var[X] = E X 2 − (E[X])2 = n(n + 1)(2n + 1) − 6n n2 − 1 = .

5) It is also called the probability-generating function because of the following property. The probability Pr(X = k) = pX (k) can be recovered from the corresponding generating function GX (z) through Taylor series expansion.5). As a preview of what lies ahead. Let S be a binomial random variable with parameters n and p. Let X be a Bernoulli random variable with parameter p. dz k Comparing this equation to (6. k . we have ∞ k=0 1 dk GX (0). for |z| ≤ 1. we have GX (z) = ∞ k=0 1 k! dk GX (0) z k . MEETING EXPECTATIONS This function bears a close resemblance to the z-transform and is deﬁned by GX (z) = E z X = ∞ k=0 z k pX (k).70 CHAPTER 6. Within the radius of convergence of GX (z). Example 52. The ordinary generating function of S can be computed as n n GS (z) = k=0 n z pS (k) = k=0 k zk n k p (1 − p)n−k k = k=0 n (pz)k (1 − p)n−k = (1 − p + pz)n . The ordinary generating function of X is given by GX (z) = pX (0) + pX (1)z = 1 − p + pz. k! dz k ∞ k=0 z k pX (k) ≤ ∞ k=0 |z|k pX (k) ≤ pX (k) = 1 and hence the radius of convergence of any probability-generating function must include one. The ordinary generating function plays an important role in dealing with sums of discrete random variables. we conclude that pX (k) = We note that. (6. we compute ordinary generating functions for Bernoulli and Binomial random variables below. Example 51.

GX (z) = ∞ k=0 zk (λz)k e−λ λk = e−λ = e−λ eλz = eλ(z−1) .2.3. we notice that the GS (z) is the product of n copies of GX (z). Example 53 (Poisson Random Variable). 2006: Section 4. it appears that the sum of independent discrete random variables leads to the product of their ordinary generating functions. The mean and second moment of X can be computed based on its ordinary generating function. which are both equal to λ. a relation that we will revisit shortly. . Suppose that X has a Poisson distribution with parameter λ > 0. k! k! k=0 ∞ The ﬁrst two moments of X are given by E[X] = lim z↑1 E X2 dGX (z) = lim λeλ(z−1) = λ z↑1 dz 2 d GX dGX = lim (z) + (z) = lim λ2 + λ eλ(z−1) = λ2 + λ. In particular. It may be helpful to compare this derivation with Example 49. Intuitively. as illustrated in the following example. from Section 5. Further Reading 1. ORDINARY GENERATING FUNCTIONS 71 We know. 7th edition. Ross. 2 dz dz This can be quite useful. we have E[X] = lim z↑1 dGX (z). the second moment of X can be derived as E X 2 = lim z↑1 dGX d2 GX (z) + (z) ..6. S. Pearson Prentice Hall. Looking at the ordinary generating functions above. dz Similarly.4. that one way to create a binomial random variable is to sum n independent and identically distributed Bernoulli random variables. A First Course in Probability. The function GX (s) can be computed using the distribution of X.2. 2 z↑1 z↑1 dz dz This provides a very eﬃcient way to compute the mean and variance of X. each with parameter p.

and Childers.4. Probability and Random Processes with Applications to Signal Processing and Communications. 3.8. Miller. G. 2004: Sections 4. A. J. Athena Scientiﬁc. 3.4. Introduction to Probability. Gubner. P. Bertsekas.. .. MEETING EXPECTATIONS 2. D. N. J. 2002: Section 2. 4.1. S. D. 2006: Sections 2. Probability and Random Processes for Electrical and Computer Engineers. Cambridge. L.... and Tsitsiklis.72 CHAPTER 6.

y) = Pr({X = x} ∩ {Y = y}) = Pr(X = x. pairs of random variables. It is often convenient or required to model stochastic phenomena using multiple random variables. We can express the probability 73 mass function of X and Y . Y ) is characterized by the joint probability value of X and y is a possible value of Y .Chapter 7 Multiple Discrete Random Variables Thus far. The random pair (X. Y = y). Note the similarity between the deﬁnition of the joint PMF and (5.Y (x. y) is denoted by pX. ·). If x is a possible .1).Y (·. 7. In this section. we extend some of the concepts developed for single random variables to multiple random variables. Suppose that S is a subset of X(Ω)×Y (Ω). which we denote by pX.1 Joint Probability Mass Functions Consider two discrete random variables X and Y associated with a single experiment. then the probability mass function of (x. our treatment of probability has been focused on single random variables. We center our exposition of random vectors around the simplest case.

Y (x. pX. y). y). of S as Pr(S) = Pr({ω ∈ Ω|(X(ω). Y ) and the individual PMFs pX (·) and pY (·). x∈X(Ω) y∈Y (Ω) To further distinguish between the joint PMF of (X. . Y ) maps every outcome contained in the sample space to a vector in R2 . In particular. We can compute the marginal PMFs of X and Y from the joint PMF pX. y) = 1. ·) using the formulas pX (x) = y∈Y (Ω) pX.y)∈S pX. x∈X(Ω) pY (y) = On the other hand. MULTIPLE DISCRETE RANDOM VARIABLES Sample Space 2 1 4 3 5 X R Y (X.Y (x.Y (·.Y (·.Y (x. This fact is illustrated in Examples 54 & 55.1: The random pair (X. Y ) R Figure 7. y).Y (x. knowledge of the marginal distributions pX (·) and pY (·) is not enough to obtain a complete description of the joint PMF pX. Y (ω)) ∈ S}) = (x. we occasionally refer to the latter as marginal probability mass functions.74 CHAPTER 7. we have pX. ·).

without replacement. The joint PMF of X and Y is speciﬁed in table form below. 1/6 1/6 1/6 1/6 We can compute the marginal PMF of X as pX (x) = y∈Y (Ω) pX. respectively. y) 1 2 3 1 0 1/6 2 0 3 1/6 0 . Similarly.Y (·. however. Again. FUNCTIONS AND EXPECTATIONS 75 Example 54. we refer to the number inscribed on the second ball as Y . two and three.Y (x. 2. y) 1 2 3 1 2 3 . two and three. ·) is a real-valued . 1/3. Y ). 2. 3}. The number appearing on the ﬁrst ball is a random variable. 1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9 The marginal distributions of X and Y are the same as in Example 54.7. which we denote by X. y) = 1 1 1 + = . the joint PMFs diﬀer. 6 6 3 where x ∈ {1.Y (x. The joint PMF of X and Y becomes pX.2 Functions and Expectations Let X and Y be two random variables with joint PMF pX. ·). pX. the marginal PMF of Y is given by pY (y) = 0.Y (x. Consider a third random variable deﬁned by V = g(X. where g(·. Likewise. if y ∈ {1. We use X and Y to denote the numbers appearing on the ﬁrst and second balls. This time the random experiment consists of drawing two balls from the urn with replacement.2. 7. An urn contains three balls numbered one. suppose that an urn contains three balls numbered one. Example 55. 3} otherwise. A random experiment consists of drawing two balls from the urn.

The random variable X represents the number of dots that appears on the top face of the blue die. X and Y .1). is also a random variable. Y ) maps elements of the sample space to real numbers. Y ) is obtained by computing E[g(X.y)=v} pX. MULTIPLE DISCRETE RANDOM VARIABLES function. 8 7 1/6 pU (k) 1/36 1/9 5/36 The deﬁnition of the expectation operator can be extended to multiple random variables.2) . whereas Y denotes the number of dots on the red die. The lowest possible value for U is two.2: A real-valued function of two random variables. the expected value of g(X. We can obtain the PMF of V by computing pV (v) = {(x.3) for pairs of random variables. 11 1/18 4. as calculated using (7. y). y)pX. appears in table form below k 2. We can form a random variable U that describes the sum of these two dice. Sample Space R 5 2 1 3 X R V R 4 V = g(X.Y (x. 12 3. are rolled simultaneously.Y (x. V = g(X. Example 56. 9 6. and its maximum value is twelve. Y )] = x∈X(Ω) y∈Y (Ω) g(x. Above.y)|g(x. a blue die and a red one. (7. 10 1/12 5. Two dice.1) This equation is the analog of (5.76 CHAPTER 7. U =X +Y. (7. The PMF of U . y). Y ) Y Figure 7. In particular.

In this section. y) xpX (x) + x∈X(Ω) y∈Y (Ω) = ypY (y) = 4. We wish to compute the expected value of the function X + Y . as it has no relevant meaning. which we write pY |X (·|·).7. We employ X and Y to represent the numbers on the ﬁrst and second balls. Y )] = E[X + Y ] = x∈X(Ω) y∈Y (Ω) (x + y)pX. respectively. Y ) = X + Y as E[g(X. without replacement. That is. which was ﬁrst discussed in Chapter 3. This is similar to conditional probabilities being only deﬁned for conditional events with non-zero probabilities. An urn contains three balls numbered one. We study the probability of events concerning random variable Y given that some information about random variable X is available.2). Let X and Y be two random variables associated with a same experiment.3 Conditional Random Variables Many events of practical interest are dependent. knowledge about event A may provide partial information about the realization of event B. The conditional probability mass function of Y given X = x. is deﬁned by pY |X (y|x) = Pr(Y = y|X = x) = Pr({Y = y} ∩ {X = x}) Pr(X = x) pX. two and three. The expected value of X + Y is four. we extend the concept of conditioning to multiple random variables. y) = . 7. . This inter-dependence is captured by the concept of conditioning. we compute the expectation of g(X.3.Y (x. Using (7. CONDITIONAL RANDOM VARIABLES 77 Example 57. pX (x) provided that pX (x) = 0. Two balls are selected from the urn at random. Note that conditioning on X = x is not possible when pX (x) vanishes.Y (x.

The conditional PMF introduced above is a valid PMF since it is nonnegative and pY |X (y|x) = = pX. y) pX (x) pX. We wish to derive the conditional probability distribution of the number of erasures located in the header. y∈Y (Ω) y∈Y (Ω) y∈Y (Ω) 1 pX (x) The probability that Y belongs to S. min{e.C (h. 1.78 CHAPTER 7.Y (x. The wireless connection is unreliable. Clearly. MULTIPLE DISCRETE RANDOM VARIABLES Let x be ﬁxed with pX (x) > 0. In general. which stores very sensitive information. a random variable with such a distribution is known as a hypergeometric random variable. y) = 1. Pr(Y ∈ S|X = x) = pY |X (y|x). y∈S Example 58 (Hypergeometric Random Variable). given X = x. . and denote the number of corrupted bits in the entire message by C.Y (x. The ﬁrst n bits of this message are dedicated to the packet header. . given that the total number of corrupted bits in the packet is equal to e. the conditioning aﬀects the probability distribution of the number of corrupted bits in the header. m}. A wireless communication channel is employed to transmit a data packet that contains a total of m bits. . Let H represent the number of erasures in the header. The conditional probability mass function of H given C = c is equal to pH|C (h|c) = = = pH. every transmitted bit is received properly with probability 1 − p and erased with probability p. independently of other bits. c) pH (h) n (1 − p)n−h ph h n h m−n c−h m c m−n c−h m c (1 − p)m−c pc (1 − p)(m−n)−(c−h) pc−h . where h = 0. is obtained by summing the conditional probability pY |X (·|·) over all outcomes included in S. .

The number of transmitted The probability that M is equal to m is therefore equal to pM (m) = = = ∞ k=0 ∞ k=m ∞ u=0 pK. . N be the number of zeros. leveraging the fact that the summation of a Poisson PMF over all possible values is equal to one. and K = M + N be the total number of bits sent during the same interval.Y (x.7. . independently of previous transmissions.M (k.3. We can see that M has a Poisson PMF with parameter pλ. m) = ∞ k=0 pM |K (m|k)pK (k) λk k m p (1 − p)k−m e−λ m k! λu+m −λ u+m m e p (1 − p)u (u + m)! m ∞ (λp)m −λ (λp)m −pλ ((1 − p)λ)u = e = e . We have also rearranged the sum into a familiar form. m m = 0. . we have used the change of variables k = u + m. k. . Let M denote the number of ones within the stipulated interval. CONDITIONAL RANDOM VARIABLES 79 The deﬁnition of conditional PMF can be rearranged to obtain a convenient formula to calculate the joint distribution of X and Y . The number of ones given that the total number of transmissions is k is given by pM |K (m|k) = k m p (1 − p)k−m . m! u! m! u=0 Above. Example 59 (Splitting Property of Poisson PMF). . y) = pY |X (y|x)pX (x) = pX|Y (x|y)pY (y). A digital communication system sends out either a one with probability p or a zero with probability binary digits within a given time interval has a Poisson PMF with parameter λ. 1 − p. 1. This formula can be use to compute the joint PMF of X and Y sequentially. namely pX. We wish to show that the number of ones sent in that same time interval has a Poisson PMF with parameter pλ.

with 1. Let Y = 1S (·) symbolize the indicator function of S. Example 60. Let X be a random variable all the values in X(Ω). and let S be a non-trivial event corresponding to this experiment. we gather that Pr({X = x} ∩ S) = Pr(S) Note that the events {ω ∈ Ω|X(ω) = x} form a partition of Ω as x ranges over x∈X(Ω) and. / Then pX|S (x) = pX|Y (x|1). consequently.3) is the following. and the transmission fails with probability (1 − p). However. a retransmission is requested immediately. conassociated with a particular experiment. Pr(S) (7. if packet transmission fails n consecutive times.80 CHAPTER 7. MULTIPLE DISCRETE RANDOM VARIABLES 7.3. with no further transmission attempt. A data packet is sent to a destination over an unreliable wireless communication link. conditioning on an event is a special case of a conditional probability mass function. When initial decoding fails.3) ditioned on an event S where Pr(X ∈ S) > 0. Another interpretation of (7. ω ∈ S 1S (ω) = 0. ω ∈ S.1 Conditioning on Events It is also possible to deﬁne the conditional PMF of a random variable X. independently of . The conditional PMF of X given S is deﬁned by pX|S (x) = Pr(X = x|S) = Pr({X = x} ∩ S) . = x∈X(Ω) We conclude that pX|S (·) is a valid PMF. Data is successfully decoded at the receiver with probaprevious trials. Let X denote the number of trials. In this sense. then the data is dropped from the queue altogether. and let S be the event that the data bility p. Using the total probability theorem. we get pX|S (x) = x∈X(Ω) x∈X(Ω) Pr({X = x} ∩ S) Pr(S) Pr({X = x} ∩ S) Pr(S) = 1.

We wish to compute the conditional PMF of X given S. Thus. (1 − p)k−1 p . n. From this point of view. 2. we note that a successful transmission can happen at any of n possible instants. E[Y |X = x] = ypY |X (y|x). the conditional PMF pN |S (·) is equal to pN |S (k) = where k = 1. .7. h(x) = E[Y |X = x]. the expectation of E[Y |X] is equal to E[Y ]. These outcomes being mutually exclusive. expectation h(x) = E[Y |X = x]. say X = x. if X and Y are two ran- determines the value of X. First. y∈Y (Ω) This conditional expectation can be viewed as a function of x. It is therefore mathematically accurate and sometimes desirable to talk about dom variables associated with an experiment. . the conditional Not too surprisingly. CONDITIONAL EXPECTATIONS 81 packet is successfully transmitted.4. . the outcome of this experiment the random variable h(X) = E[Y |X].4 Conditional Expectations The conditional expectation of Y given X = x is simply the expectation of Y with respect to the conditional PMF pY |X (·|x). 1 − (1 − p)n 7. which in turn yields the conditional expectation E[Y |X] is simply an instance of a random variable. as evinced . the probability of S can be written as n Pr(S) = k=1 (1 − p)k−1 p = 1 − (1 − p)n . In particular.

The conditional expectation . let C and L be the number of cherry sodas and lemonades purchased during the same time interval. independently of other customers. k (7. Every customer selects a cherry soda with customers. An entrepreneur opens a small business that sells two kinds of beverages from the Brazos Soda Company. independently of other purchased in an hour given that ten beverages were sold to customers during This follows from the fact that every customer selects a cherry soda with probability p. it is straightforward to show that E [E[g(Y )|X]] = E[g(Y )]. respectively. We wish to ﬁnd the conditional mean of the number of cherry sodas this time period. The conditional mean is then seen to equal 10 E[C|B = 10] = k=0 k 10 k p (1 − p)10−k = 10p. y) = y∈Y (Ω) x∈X(Ω) = y∈Y (Ω) ypY (y) = E[Y ]. Using a similar argument.Y (x.4) probability p and a lemonade with probability (1 − p).82 CHAPTER 7. The conditional PMF of C given that the total number of beverages sold equals ten is pC|B (k|10) = 10 k p (1 − p)10−k . E [E[Y |X]] = = x∈X(Ω) y∈Y (Ω) x∈X(Ω) E[Y |X = x]pX (x) ypY |X (y|x)pX (x) ypX. Example 61. cherry soda and lemonade. k We can deﬁne the expectation of X conditioned on event S in an analogous fashion. Similarly. Let B represent the number of bottles sold during an hour. MULTIPLE DISCRETE RANDOM VARIABLES by the following derivation. We note that B = C + L. Let S be an event such that Pr(S) > 0. The number of bottles sold in an hour at the store is found to be a Poisson random variable with parameter λ = 10.

0. Example 62. The total amount of money spent by the student. we wish to compute the expected amount of money disbursed given that the student buys at least ﬁve shirts. Let Ci be the cost of the ith shirt.4. Spring break is coming and a student decides to renew his shirt collection.2.5. Also.3). The number of shirts purchased by the student is a random variable denoted by N . we have E[g(X)|S] = x∈X(Ω) g(x)pX|S (x). $20 or $50 with respective probabilities 0. We wish to compute the expected amount of money spent by the student during his shopping spree. CONDITIONAL EXPECTATIONS of X given S is E[X|S] = x∈X(Ω) 83 xpX|S (x).5. The student is expected to spend $42. Similarly. where pX|S (·) is deﬁned in (7. Any one shirt costs $10. denoted by T . . It is equal to N N E[T ] = E i=1 N Ci = E E i=1 Ci N =E i=1 E[Ci |N ] = 21E[N ] = 42. independently of other shirts.3 and 0. can be expressed as N T = i=1 Ci . The mean of T can be computed using nested conditional expectation. the conditional expectation becomes N E[T |N ≥ 5] = E i=1 Ci N ≥ 5 N =E E i=1 Ci N N ≥5 = 21E[N |N ≥ 5] = 126.7. Given that the student buys at least ﬁve shirts. The PMF of this random variable is a geometric distribution with parameter p = 0.

5. Note that. then E[XY ] = x∈X(Ω) y∈Y (Ω) xypX.Y (x. y) = pX (x)pY (y) for independent random variables.4 and the independence of two random variables. There is a clear relation between the concept of independence introduced in Section 3. y) = pX (x)pY (y) for every x ∈ X(Ω) and every y ∈ Y (Ω). E[N |N ≥ 5] = E[N |N > 4] = 4 + E[N ] = 6. Proof.84 CHAPTER 7. the student is expected to spend $126. as in Example 56. 7. Proposition 6. Furthermore. where we have used the fact that pX.Y (x. y) xypX (x)pY (y) x∈X(Ω) y∈Y (Ω) = = x∈X(Ω) xpX (x) y∈Y (Ω) ypY (y) = E[X]E[Y ]. Two dice of diﬀerent colors are rolled. Example 63. y) such that x ∈ X(Ω) and y ∈ Y (Ω). then E[XY ] = E[X]E[Y ]. We say that X and Y are independent random variables if pX. If X and Y are independent random variables. Random variables X and Y are independent if and only if the events {X = x} and {Y = y} are independent for every pair (x. Assume that both E[X] and E[Y ] exist. in computing the conditional expectation. We wish to compute the expected value of their product. we have utilized the memoryless property of the geometric random variable. MULTIPLE DISCRETE RANDOM VARIABLES Conditioned on buying more than ﬁve shirts.Y (x.5 Independence Let X and Y be two random variables associated with a same experiment. We know that the mean of each role is E[X] = E[Y ] = 3. it is straightforward to show .

We wish to determine the distribution pU (·) of U . respectively. E[XY ] = E[X]E[Y ] = 12. pU (k) = Pr(U = k) = Pr(X + Y = k) = {(x. (7. y) pX. Suppose X and Y are two independent random .5) occurs when dealing with sums of independent random variables. In particular.7.Y (x.5) whenever X and Y are independent random variables and the corresponding expectations exist.25. One interesting application of (7.Y (m. 7. = m∈Z = m∈Z The latter operation is called a discrete convolution.1 Sums of Independent Random Variables We turn to the question of determining the distribution of a sum of two independent random variables in terms of the marginal PMF of the summands. k − m) pX (m)pY (k − m). are independent random variables. The expected value of their product is then equal to the product of the individual means. it suﬃces to compute the probability that U assumes a value k. where U = X + Y .5. Suppose X and Y are independent random variables that take on integer values. We can parallel the proof of Proposition 6 to show that E[g(X)h(Y )] = E[g(X)]E[h(Y )] (7. To accomplish this task.5. the PMF of U = X + Y is the discrete convolution of pX (·) and pY (·) given by pU (k) = (pX ∗ pY )(k) = pX (m)pY (k − m). where k is an arbitrary integer.6) m∈Z The discrete convolution is commutative and associative.y)∈X(Ω)×Y (Ω)|x+y=k} pX. the numbers of dots on the two dice. Let pX (·) and pY (·) be the PMFs of X and Y . INDEPENDENCE 85 that X and Y .

That is. pU (k) = k! This method of ﬁnding the PMF of U is more concise than using the discrete convolution. .4. As such. accordingly. we use ordinary generating functions. . k = 0. the ordinary generating function of U = X + Y is GU (z) = GX (z)GY (z) = eα(z−1) eβ(z−1) = e(α+β)(z−1) . MULTIPLE DISCRETE RANDOM VARIABLES variables that take on integer values and. The ordinary generating function of U . that U is a Poisson random variable with parameter α + β and. . Let X be a Poisson random variable with parameter α and let Y be a Poisson random variable with parameter β.86 CHAPTER 7. as deﬁned in Section 6. by the uniqueness of generating functions. The variable X is independent of S if Pr({X = x} ∩ S) = pX (x) Pr(S) for every x ∈ X(Ω). if X is independent of event S then pX|S (x) = pX (x) for all x ∈ X(Ω). 2. We saw in Section 7. First. . In particular. To solve this problem. Example 64 (Sum of Poisson Random Variables). recall that the generating function of a Poisson random variable with parameter λ is eλ(z−1) . The ordinary generating functions of X and Y are therefore equal to GX (z) = eα(z−1) GY (z) = eβ(z−1) . 1. We wish to ﬁnd the probability mass function of U = X + Y . the generating function of a sum of independent random variables is equal to the product of the individual ordinary generating functions. is given by GU (z) = E z U = E z X+Y = E zX zY = E zX E zY = GX (z)GY (z). we get (α + β)k −(α+β) e .3 that a random variable can also be conditioned on a speciﬁc event. let U = X + Y . again. Let X be a random variable and let S be a non-trivial event. We conclude.

. pX (x) = Pr(X = x). where X1 . xn ) = Pr k=1 {Xk = xk } . ω ∈ Ω.. .6 Numerous Random Variables The notion of a joint distribution can be applied to any number of random variables. which is the n-fold convolution of pX (·). Consider the empirical sums n Sn = k=1 Xk n ≥ 1. Xn (ω)). the PMF of Sn can be obtained recursively using the formula Sn = Sn−1 + Xn . their joint PMF is deﬁned by n pX1 . Furthermore. . the space deﬁned by vectors of the form (X1 (ω). each with marginal PMF pX (·). ... are independent integer-valued random variables. . X2 (ω). independent. from an abstract perspective. . . Xn be random variables. is itself a sample space on which random variables can be deﬁned. in vector form. . .Xn (x1 . NUMEROUS RANDOM VARIABLES 87 7. Also. . . . When the random variables {Xk } are pX (x) = k=1 pXk (xk ).7. . . we emphasize that most concepts introduced in this chapter can be extended to multiple random variables in a straightforward manner. the ordinary generating function of Sn can be written as GSn (z) = (GX (z))n . x2 .. the distribution of S1 is simply pX (·).X2 . Obviously. Let X1 . their joint PMF reduces to n or. Although not discussed in details herein. we note that. More generally. . . This leads to the PMF pSn (k) = (pX ∗ pX ∗ · · · ∗ pX )(k). X2 .6. X2 .

. n − 1. Thus. . by the principle of mathematical induction. . MULTIPLE DISCRETE RANDOM VARIABLES Example 65 (Binomial Random Variables). . we gather that the sum of n independent Bernoulli random variables. The PMF of S1 is equal to 1 − p k = 0 pS1 (k) = p k = 1. 1. each with parameter p ∈ (0. . 1. n. . Assume that the PMF of Sn−1 is given by pSn−1 (k) = n−1 k p (1 − p)n−1−k . Example 66 (Negative Binomial Random Variable*). The distribution of X is given by pX (k) = k+r−1 r p (1 − p)k . 1.88 CHAPTER 7. . . pSn (k) = ∞ m=−∞ pSn−1 (m)pX (k − m) = pSn−1 (k)(1 − p) + pSn−1 (k − 1)p = = n−1 k n−1 k p (1 − p)n−k + p (1 − p)n−k k k−1 n k p (1 − p)n−k k where k = 0. Suppose that Sn is a sum of n independent and identically distributed Bernoulli random variables. is also a binomial random distribution with parameters (m + n) and p. is a binomial random variable with parameters n and p. each with parameter p. r−1 k = 0. Suppose that a Bernoulli trial is repeated multiple times until r ones are obtained. the distribution of Sn can be computed recursively using the discrete convolution. We denote by X the random variable that represents the number of zeros observed before completion of the process. 1). . . The convolution of two binomial distributions. Then. . one with parameter m and p and the other with parameters n and p. This fact follows directly from the previous argument. k k = 0.

NUMEROUS RANDOM VARIABLES 89 where p is the common parameter of the Bernoulli trials. We use properties of diﬀerential equations to derive GX (z) from pX (·). where c is an arbitrary constant. a task which we perform by looking at its ordinary generating function. alternatively. dGX (z) = dz = ∞ k=0 ∞ k=1 kz k−1 z k−1 k+r−1 r p (1 − p)k r−1 = (1 − p) ∞ (k + r − 1)! r p (1 − p)k (r − 1)!(k − 1)! (ℓ + r)z ℓ ℓ+r−1 r p (1 − p)ℓ r−1 ℓ=0 = (1 − p) z It follows from this equation that dGX (z) + rGX (z) . We wish to show that X can be obtained as a sum of r independent random variables. we have GX (z) = ∞ k=0 zk k+r−1 r p (1 − p)k .7. we get the ordinary generating function of X as GX (z) = p 1 − (1 − p)z r . . dz 1 dGX (1 − p)r (z) = GX (z) dz 1 − (1 − p)z or. we can write (1 − p)r d log (GX (z)) = . Applying boundary condition GX (1) = 1. dz 1 − (1 − p)z Integrating both sides yields log (GX (z)) = −r log (1 − (1 − p)z) + c. r−1 Consider the derivative of GX (z). By deﬁnition.6. In general. a random variable with such a distribution is known as a negative binomial random variable.

Ross. Athena Scientiﬁc. . It may be instructive to compare this distribution with the PMF of a geometric Further Reading 1. Pearson Prentice Hall. 2006: Sections 3.5.7. 2. Yr . .. ... Y1 . 2002: Sections 2. N. the distribution of Yi is given by 1 dm GYi (0) = p(1 − p)m pYi (m) = m m! dz random variable. . J.5–2. 2006: Sections 6.4–3.. D. each with ordinary generating function GYi (z) = p . . Bertsekas.90 CHAPTER 7. Cambridge. Introduction to Probability. J. we can deduce that X is the sum of r independent random variables. 3. A. .4. and Tsitsiklis.1–6. S. 7th edition. m = 0. Gubner. Probability and Random Processes for Electrical and Computer Engineers. MULTIPLE DISCRETE RANDOM VARIABLES From this equation. . . P. A First Course in Probability. 1. 1 − (1 − p)z In particular.

random variables that can take on an uncountable set of values. we extend and apply the concepts and methods initially developed for discrete random variables to the class of continuous random variables. we provide a deﬁnition for continuous random variables. we consider random variables that range over a continuum of possible values. that is. This predicament emerges from the limitations of the third axiom of probability laws. which only applies to countable collections of disjoint events. While this extra ﬂexibility is useful and desirable.1 Cumulative Distribution Functions We begin our exposition of continuous random variables by introducing a general concept which can be employed to bridge our understanding of discrete 91 . Continuous random variables are powerful mathematical abstractions that allow engineers to pose and solve important problems. yet they form only a small subset of the collection of random variables pertinent to applied probability and engineering. we develop a continuous counterpart to the probability mass function. Furthermore. Discrete random variables are quite useful in many contexts. we have studied discrete random variables and we have explored their properties.Chapter 8 Continuous Random Variables So far. Many of these problems are diﬃcult to address using discrete models. it comes at a certain cost. In particular. Below. 8. In this chapter. A continuous random variable cannot be characterized by a probability mass function.

1 Discrete Random Variables If X is a discrete random variable. we note from (8. then we can write {X ≤ x2 } as the union of the two disjoint = Pr(X ≤ x1 ) + Pr(x1 < X ≤ x2 ) ≥ Pr(X ≤ x1 ) = FX (x1 ). x]) = Pr({ω ∈ Ω|X(ω) ≤ x}). x2 ] is Pr(x1 < X ≤ x2 ) = FX (x2 ) − FX (x1 ). CONTINUOUS RANDOM VARIABLES and continuous random variables. (8. It follows that FX (x2 ) = Pr(X ≤ x2 ) Suppose x1 < x2 . In terms of the underlying sample space. the CDF is a convenient way to specify the probability of all events of the form {X ∈ (−∞.2) 8. The CDF of random variable X exists for any well-behaved function X : Ω → R. a CDF is always a non-decreasing function. sets {X ≤ x1 } and {x1 < X ≤ x2 }.x] pX (u). In particular.1) that the probability of X falling in the interval (x1 .1) In other words. FX (x) denotes the probability of the set of all outcomes in Ω for which the value of X is less than or equal to x.92 CHAPTER 8. . we have x↓−∞ x↑∞ lim FX (x) = 0 lim FX (x) = 1. (8. Moreover. Recall that a random variable is a realvalued function acting on the outcomes of an experiment. Finally. The cumulative distribution function (CDF) of X is deﬁned point-wise as the probability of the event {X ≤ x}.1. then the CDF of X is given by FX (x) = u∈X(Ω)∩(−∞. FX (x) = Pr({X ≤ x}) = Pr(X ≤ x). since the realization of X is a real number. FX (x) = Pr X −1 ((−∞. In essence. given a sample space. random variable X is a function from Ω to R. x]}.

1: This ﬁgure shows the PMF of a discrete random variable.4 0. where ⌊·⌋ denotes the standard ﬂoor function. pX (k) = FX (x) − lim FX (u) = FX (k) − FX (k − 1) u↑x = 1 − (1 − p)k − 1 − (1 − p)k−1 = (1 − p)k−1 p. pX (k) = (1 − p)k−1 p k = 1. Let X be a geometric random variable with parameter p. Discrete Random Variable CDF 0. their cumulative sums lead to the values of the CDF. For x > 0. this formula is simpler when the random variable X only takes integer values.1.8. Example 67.6 0. 2. u↑x 93 Fortunately. along with the corresponding CDF. as seen in the example below. CUMULATIVE DISTRIBUTION FUNCTIONS and its PMF can be computed using the formula pX (x) = Pr(X ≤ x) − Pr(X < x) = FX (x) − lim FX (u). The values of the PMF are depicted by the height of the rectangles. . .8 PMF 0. the CDF of X is given by FX (x) = ⌊x⌋ k=1 (1 − p)k−1 p = 1 − (1 − p)⌊x⌋ .2 0 0 2 4 Value 6 8 1 Figure 8. the PMF of geometric random variable X can be recovered from the CDF as follows. . For integer k ≥ 1. .

Suppose that X is a random variable with CDF given by 1 − e−x .2: The CDF of a continuous random variable is diﬀerentiable. This ﬁgure provides one example of a continuous random variable. Let X be a random variable with CDF FX (·).8 0. CONTINUOUS RANDOM VARIABLES 8. Both.4 0. Example 68. Continuous Random Variable 1 0.94 CHAPTER 8. FX (x) = This cumulative distribution function is diﬀerentiable with e−x .1. .2 Continuous Random Variables Having introduced the general notion of a CDF. then X is said to be a continuous random variable if FX (·) is continuous and diﬀerentiable. x > 0 dFX (x) = 0.2 0 0 1 2 3 Value 4 5 CDF PDF Figure 8. x≥0 x < 0. we can safely provide a more precise deﬁnition for continuous random variables. dx x<0 and therefore X is a continuous random variable. 0. the CDF FX (·) and its derivative fX (·) (PDF) are displayed.6 0.

1. PROBABILITY DENSITY FUNCTIONS 95 8.2 0 0 1 2 3 Value 4 5 CDF Figure 8.3: This ﬁgure shows the CDF of a mixed random variable.2. Their CDF may be composed of a mixture of diﬀerentiable intervals and discontinuous jumps. The derivative of FX (·) is called the probability density function (PDF) of X. Mixed Random Variable 1 0. and it is denoted by fX (·).3 Mixed Random Variables* Generally speaking. Such random variables are sometimes called mixed random variables. we emphasize that a good understanding of discrete and continuous random variables is instrumental in understanding and solving problems including mixed random variables. the CDF of a discrete random variable is a discontinuous staircase-like function.6 0. the CDF of continuous random variable X is a diﬀerentiable function. mixed random variables do not have a PMF nor a PDF. we have x FX (x) = −∞ fX (ξ)dξ.2 Probability Density Functions As mentioned above.8. In general. Our exposition of mixed random variables in this document is very limited. If X is a random variable with PDF fX (·) then.8 0. whereas the CDF of a continuous random variable is continuous and diﬀerentiable almost everywhere.4 0. . by the fundamental theorem of calculus. Still. There exist random variables for which neither situation applies. 8.

the probability that X ∈ S can be expressed through an integral. because the probabilities of events are set S. CONTINUOUS RANDOM VARIABLES Equivalently. dx Note that PDFs are only deﬁned for continuous random variables. S nonnegative. This is fX (x) = somewhat restrictive. Also. Pr(X ∈ S) = carry this integral. we can combine the deﬁnition of fX (·) and (8. the probability that X is a real number is given by Pr(−∞ < X < ∞) = ∞ −∞ fX (ξ)dξ = 1. then Pr(X = x) = 0 for any real number x. which may otherwise be diﬃcult to compute. it is easily seen that for any continuous random variable x2 Pr(X = x2 ) = lim Pr(x1 < X ≤ x2 ) = lim x1 ↑x2 x2 x2 x1 ↑x2 fX (ξ)dξ x1 = fX (ξ)dξ = 0. In other words. An immediate corollary of this fact is Pr(x1 < X < x2 ) = Pr(x1 ≤ X < x2 ) = Pr(x1 < X ≤ x2 ) = Pr(x1 ≤ X ≤ x2 ). We can derive properties for the PDF of continuous random variable X based on the axioms of probability laws. Finally.2) to obtain x2 Pr(x1 < X ≤ x2 ) = fX (ξ)dξ. For x1 < x2 . we can write dFX (x). if X is a continuous random variable. Thus. the inclusion or exclusion of endpoints in an interval does not aﬀect the probability of the corresponding interval when X is a continuous random variable.96 CHAPTER 8. fX (ξ)dξ. x1 Furthermore. fX (x) must integrate to one. First. given an admissible Admissible events of the form {X ∈ S} are sets for which we know how to . we must have fX (x) ≥ 0 everywhere. Nevertheless. the PDF can be a very powerful tool to derive properties of continuous random variables.

m. a and b. we introduce important random variables and their distributions. a ≤ x ≤ b b−a 1.3. The PDF of a uniform random variable is deﬁned by two parameters. David.8. The PDF fX (·) is given by fX (x) = 1 .3. . a bus comes every thirty minutes.m. which represent the minimum and maximum values of its support. b] otherwise. who does not believe in alarm clocks or watches.3 Important Distributions Good intuition about continuous random variables can be developed by looking at examples. and denote by T the time he spends waiting. t ∈ [0. After cooking a hefty breakfast. we have fT (t) = 1. 8. In this section. 0. from sunrise until dusk. x ≥ b. The amount of time that he spends at the bus stop is therefore uniformly distributed between 0 and 30 minutes. IMPORTANT DISTRIBUTIONS 97 8.. If his arrival time at the bus stop is uniformly distributed between 9:00 a. Accordingly. respectively.1 The Uniform Distribution A (continuous) uniform random variable is such that all intervals of a same length contained within its support are equally probable. The time at which the next bus arrives at David’s stop is uniformly distributed between t0 and t0 + 30. Example 69. David comes to campus every morning riding the Aggie Spirit Transit. On his route. These random variables ﬁnd widespread application in various ﬁelds of engineering. wakes up with daylight. b−a x ∈ [a. 30] otherwise. x<a FX (x) = x−a . 0. and 9:30 a. he walks to the bus stop. 30 The associated cumulative distribution function becomes 0. what is the probability that he waits less than ﬁve minutes for the bus? Let t0 be the time at which David arrives at the bus stop.

e− 2 dζ = Φ x−m σ .8 0. 4]. CONTINUOUS RANDOM VARIABLES Uniform Distribution 1.4: This ﬁgure shows the PDFs of uniform random variables with support intervals [0.4 0. The CDF of a Gaussian random variable does not admit a closed-form expression. names that hint at its popularity. The probability that David waits less than ﬁve minutes is 5 Probability Density Function Pr(T < 5) = 0 1 1 dt = . 2] and [0. The PDF of a Gaussian random variable is given by fX (x) = √ (x−m)2 1 e− 2σ2 2πσ − ∞ < x < ∞. where m and σ > 0 are real parameters. It is often used to model distributions of quantities inﬂuenced by large numbers of small random components. 1]. A Gaussian variable whose distribution has parameters m = 0 and σ = 1 is called a normal random variable or a standard Gaussian random variable.98 CHAPTER 8. 30 6 8.2 0 0 1 2 3 Value 4 5 max = 1 max = 2 max = 4 Figure 8.6 0.3.2 The Gaussian (Normal) Random Variable The Gaussian random variable is of fundamental importance in probability and statistics.2 1 0. it can be expressed as FX (x) = Pr(X ≤ x) = √ 1 =√ 2π (x−m)/σ −∞ 1 2πσ ζ2 x −∞ e− (ξ−m)2 2σ 2 dξ . [0.

Using the total probability theorem. and Y represent the value of the received signal. An error can occur in one of two possible ways: S = 1 was transmitted and Y is less than zero. 1} denote the transmitted signal.2 0.5: The distributions of Gaussian random variables appear above for parameters m = 0 and σ 2 ∈ {1. A binary signal is transmitted through a noisy communication channel.3.1 0 -4 -2 0 Value 2 4 1 2 4 99 Figure 8. . 2. where Φ(·) is termed the standard normal cumulative distribution function and is deﬁned by 1 Φ(x) = √ 2π x −∞ Probability Density Function e− 2 dζ. The receiver declares that a 1 (−1) was transmitted if the sent signal is positive (negative).5 0. or S = −1 was transmitted and Y is greater than zero. This noise can be accurately modeled as a Gaussian random variable. 4}. we can compute the probability of error as Let S ∈ {−1. IMPORTANT DISTRIBUTIONS Gaussian Distribution 0. The sent signal takes on either a value of 1 or −1. Example 70. What is the probability of making an erroneous decision? noise.3 0. The message received at the output of the communication channel is corrupted by additive thermal noise.4 0. N be the value of the thermal Pr(Y ≥ 0|S = −1) Pr(S = −1) + Pr(Y ≤ 0|S = 1) Pr(S = 1). ζ2 We emphasize that the function Φ(·) is nothing more than a convenient notation for the CDF of a normal random variable.8.

In engineering. √ 1 + erf x/ 2 . then erf x/ 2 denotes the probability that X lies in the interval (−x. Consider a standard normal PDF. Next. It is therefore useful to include it in this document. we prove that the standard normal PDF integrates to one. Φ(x) = 2 x 0 e−ξ dξ. 2π . 2 The error function is related to the standard normal cumulative distribution √ If X is a standard normal random variable.3) may prove useful when using software packages that provide a built-in implementation for erf(·). it is customary to employ the Q-function. but hard to discover. which is given by ∞ ξ2 1 e− 2 dξ = 1 − Φ(x) Q(x) = √ 2π x √ 1 − erf x/ 2 . The The reliability of this transmission scheme depends on the amount of noise error function is a function which is primarily encountered in the ﬁelds of statistics and partial diﬀerential equations. x2 1 fX (x) = √ e− 2 . CONTINUOUS RANDOM VARIABLES By symmetry. x).100 CHAPTER 8. present at the receiver. but not for the Q-function. The normal random variable is so frequent in applied mathematics and engineering that many variations of its CDF possess their own names. The solution is easy to follow. it is easily argued that Pr(Y ≤ 0|S = 1) = Pr(Y ≥ 0|S = −1) = Pr(N > 1) = 1 ∞ √ ξ2 1 e− 2σ2 dξ = 1 − Φ 2πσ 1 σ .3) Equation (8. = 2 (8. It is deﬁned by 2 erf(x) = √ π function by scaling and translation. The probability that the receiver makes an erroneous decision is 1 − Φ(1/σ). The probability of an erroneous decision in Example 70 can be expressed concisely using the Q-function as Q(1/σ).

Connection requests at an Internet server are characterized by an exponential inter-arrival time with parameter λ = 1/2. and the time elapsed between speciﬁc occurrences. IMPORTANT DISTRIBUTIONS 101 We can show that fX (x) integrates to one using a subtle argument and a change of variables.3 The Exponential Distribution The exponential random variable is also frequently encountered in engineering. Since the square of the desired integral is nonnegative and equal to one. Pr(T < 2) = 0 ξ 1 −ξ e 2 dξ = −e− 2 2 2 0 = 1 − e−1 ≈ 0. its CDF is equal to FX (x) = 1 − e−λx .8. It can be used to model the lifetime of devices and systems. An exponential random variable X with parameter λ > 0 has PDF fX (x) = λe−λx For x ≥ 0. The parameter λ characterizes the rate at which events occur. We start with the square of the integrated PDF and proceed from there. 8. ∞ −∞ 2 ∞ −∞ ∞ −∞ ∞ ξ2 ζ2 1 1 √ e− 2 dξ √ e− 2 dζ 2π 2π −∞ ∞ 2π 1 − ξ2 +ζ 2 1 e 2 dξdζ = dθ 2π −∞ 2π 0 ∞ 0 fX (ξ)dξ = = ∞ 0 e− 2 rdr r2 = −e 2 − r2 = 1. . If a requests arrives at time t0 .632. what is the probability that the next packet arrives within two minutes? The probability that the inter-arrival time T is less than two minutes can be computed as 2 x ≥ 0.3.3. Example 71. we can conclude that the normal PDF integrates to one.

we create a new variable Xn . 2}. 0. 1. .102 CHAPTER 8. The exponential random variable can be obtained as the limit of a sequence of geometric random variables. if y = k/n otherwise. 2. . random variable Yn is a standard geometric random variable with parameter pn = λ/n.5 1 0.5 0 0 0.6: The distributions of exponential random variables are shown above for parameters λ ∈ {0. We deﬁne the PMF of random variable Yn as pYn (k) = (1 − pn ) k−1 k−1 Probability Density Function pn = λ 1− n λ n k = 1. Xn = Yn .5 Figure 8. CONTINUOUS RANDOM VARIABLES Exponential Distribution 2 1. . That is. For every n. Let λ be ﬁxed and deﬁned pn = λ/n. . n By construction. the CDF of random variable Xn can be computed as Pr(Xn ≤ x) = Pr(Yn ≤ nx) = = ⌊nx⌋ k=1 ⌊nx⌋ k=1 pYn (k) (1 − pn )k−1 pn = 1 − (1 − pn )⌊nx⌋ . the random variable Xn has PMF pXn (y) = (1 − pn )k−1 pn .5 2 2 1 0.5 1 Value 1.5. For any x ≥ 0.

Let T denote the time elapsed until failure of the disk. A prominent company.3. as n grows unbounded. In reality. This fact can be shown by a straightforward application of conditional probability. let t and u be two positive numbers. We know that T is an exponential random variable and. 2 . the sequence of scaled geometric random variables {Xn } converges in distribution to an exponential random variable X with parameter λ. Memoryless Property: In view of this asymptotic characterization and the fact that geometric random variables are memoryless. Pr(X > t + u|X > t) = Pr({X > t + u} ∩ {X > t}) Pr(X > t) Pr(X > t + u) e−λ(t+u) = = Pr(X > t) e−λt = e−λu = Pr(X > u). maintains a bank of servers for its operation. Suppose that X is an exponential random variable with parameter λ. although we are not given λ explicitly. Half-lives are typically used to describe quantities that undergo exponential decay. Also.8.2). it is not surprising that the exponential random variable also satisﬁes the memoryless property. Pr(X > t + u|X > t) = Pr(X > u). Thus. Century Oak Networks. The memoryless property can be veriﬁed by expanding the conditional probability of X using deﬁnition (3. we get n→∞ 103 lim Pr(Xn ≤ x) = lim 1 − (1 − pn )⌊nx⌋ n→∞ = 1 − lim n→∞ 1− λ n ⌊nx⌋ = 1 − e−λx . We wish to compute the probability that a speciﬁc disk needs repair within its ﬁrst year of usage. Example 72. the exponential random variable is the only continuous random variable that satisﬁes the memoryless property. IMPORTANT DISTRIBUTIONS In the limit. Hard drives on the servers have a half-life of two years. we know that 1 Pr(T > 2) = .

4.1 The Gamma Distribution The gamma PDF deﬁnes a versatile collection of distributions. 8. The two parameters α > 0 and λ > 0 aﬀect the shape of the ensuing distribution signiﬁcantly. and then compute Pr(T < 1) from the corresponding integral. Perhaps. The PDF of a gamma random variable is given by fX (x) = λ(λx)α−1 e−λx Γ(α) x > 0. For nonnegative integers. it can easily be shown that Γ(k+1) = k!. the most well-known . It is interesting to note the interconnection between various random variables and their corresponding probability distributions. it is possible for the gamma PDF to accurately model a wide array of empirical data. By varying these two parameters. where Γ(·) denotes the gamma function deﬁned by Γ(z) = 0 ∞ ξ z−1 e−ξ dξ z > 0. The gamma function can be evaluated recursively using integration by parts. We can then write Pr(T < be to ﬁrst ﬁnd the value of λ associated with T .293. It follows that Pr(T > 1) = √ Pr(T > 2) = 1/ 2.104 CHAPTER 8. CONTINUOUS RANDOM VARIABLES We use the memoryless property to solve this problem. this yields the relation Γ(z + 1) = zΓ(z) for z > 0.4 Additional Distributions Probability distributions arise in many diﬀerent contexts and they assume various forms. 1) = 1 − Pr(T > 1) ≈ 0. An alternative way to solve this problem would 8. We conclude this ﬁrst chapter on continuous random variables by mentioning a few additional distributions that ﬁnd application in engineering. Pr(T > 2) = Pr(T > 1) Pr(T > 1 + 1|T > 1) = Pr(T > 1) Pr(T > 1) = (Pr(T > 1))2 .

In- terestingly.7. this speciﬁc value for the gamma function can be evaluated by a procedure similar to the one we used to integrate the Gaussian distribution. Γ 1 2 2 = 0 ∞ ∞ 0 ξ − 1 −ξ 2 ∞ 0 e dξ 0 1 1 ∞ ζ − 2 e−ζ dζ 1 = = 0 ξ − 2 ζ − 2 e−(ξ+ζ) dξdζ r2 1 2 e−r 4r3 sin θ cos θdrdθ sin θ cos θ 2 π/2 0 π/2 ∞ = 0 0 ∞ e−r 4rdrdθ = π. they are all instances of gamma distributions. Probability Density Function . ADDITIONAL DISTRIBUTIONS value for the gamma function at a non-integer argument is Γ(1/2) = √ 105 π. 2) for the Erlang distribution.7: Gamma distributions form a two-parameter family of PDFs and. depending on (α. Gamma Distribution 0.4.5) for the exponential distribution.8. (2. as seen in Figure 8. Many common distributions are special cases of the gamma distribution. they can be employed to model various situations. 0.3.4 0. 0.6 0. The parameters used above are (1.5) for the chi-square distribution and (4.3 0. the gamma distribution simply reduces to the exponential distribution discussed in Section 8.5 0.1 0 0 1 2 3 Value 4 5 Exponential Chi-Square Erlang Figure 8. The Exponential Distribution: When α = 1.3. λ).2 0.

.2 The Rayleigh Distribution r − r22 e 2σ σ2 The Rayleigh PDF is given by fR (r) = r ≥ 0. 8. Suppose that X and Y are two independent normal random variables. X2 . This distribution ﬁnds application in queueing theory. then Sm is an Erlang random variable. CONTINUOUS RANDOM VARIABLES The Chi-Square Distribution: When λ = 1/2 and α = k/2 for some positive integer k. Speciﬁcally. the gamma distribution is called an Erlang distribution. each with parameter λ > 0.106 CHAPTER 8. the gamma distribution becomes a chi-square distribution. Interestingly. let X1 . Let Sm be a random variable that denotes the time instant of the mth arrival. Suppose that the requests arriving at a computer server on the Internet are characterized by independent. Xm be independent exponential random variables. . . the sum of the squares of k independent standard normal random variables leads to a chi-square variable with k degrees of freedom. Its PDF is given by fX (x) = λ(λx)m−1 e−λx (m − 1)! x > 0. then . memoryless inter-arrival periods. An m-Erlang random variable can be obtained by summing m independent exponential random variables. Example 73. The Rayleigh distribution arises in the context of wireless communications.4. The random variable Sm given by Sm = k=1 m Xk is an Erlang random variable with parameter m and λ. . The Erlang Distribution: When α = m. fX (x) = x 2 −1 e− 2 k k x 2 2 Γ(k/2) x > 0. a positive integer. The chi-square distribution is one of the probability distributions most widely used in statistical inference problems.

.5 0. The diﬀerence between two independent and identically distributed exponential random variables is governed by a Laplace distribution. 2.4.8. the magnitude of this random vector possesses a Rayleigh distribution. ADDITIONAL DISTRIBUTIONS Rayleigh Distribution 0.3 0.7 0. This creates variations in signal strength at the destinations. 8.2 0.8: This ﬁgure plots the distributions of Rayleigh random variables for parameters σ 2 ∈ {1. Also. Rayleigh random variables are often employed to model amplitude ﬂuctuations of radio signals in urban environments. 4}. The PDF of a Laplacian random variable can then be written as fX (x) = 1 − |x| e b 2b x ∈ R. if R is a Rayleigh random variable then R2 has an exponential distribution. Radio signals propagating through wireless media get reﬂected. a phenomenon known as fading.3 The Laplace Distribution The Laplace distribution is sometimes called a double exponential distribution because it can be thought of as an exponential function and its reﬂection spliced together.6 0.4 0. refracted and diﬀracted.4.1 0 0 1 2 3 Value 1 2 4 107 Probability Density Function 4 5 Figure 8. where b is a positive constant. Example 74.

Moreover.4 The Cauchy Distribution The Cauchy distribution is considered a heavy-tail distribution because its tail is not exponentially bounded. 7th edition.5 1 2 Figure 8.2 0 -4 -2 0 Value 2 4 0.8 0. The PDF of a Cauchy random variable is given by fX (x) = π (γ 2 γ + x2 ) x ∈ R. .5. 5. mean (X1 + X2 + · · · + Xn )/n possesses the same Cauchy distribution. 8. X2 .108 CHAPTER 8. if X1 . 1. then the sample random variables appear in detection theory to model communication systems subject to extreme noise conditions. Probability Density Function . they also ﬁnds applications in physics.9: The PDF of a Laplace random variable can be constructed using an exponential function and its reﬂection spliced together. A First Course in Probability. Pearson Prentice Hall.6. 2}. . CONTINUOUS RANDOM VARIABLES Laplace Distribution 1 0. Xn are independent random variables.4 0. An interesting fact about this distribution is that its mean. 2006: Sections 5. S. each with a standard Cauchy distribution.4. .6 0..1. Ross. . This ﬁgures shows Laplace PDFs for parameters b ∈ {0. Cauchy Further Reading 1. Physicists sometimes refer to this distribution as the Lorentz distribution. variance and all higher moments are undeﬁned.3–5.

4 0. and Tsitsiklis.1 0 -4 -2 0 Value 0.5. 2004: Sections 3. Probability and Random Processes with Applications to Signal Processing and Communications.1. Bertsekas.3 0. Cambridge..1–3. 4. J. Introduction to Probability. Probability and Random Processes for Electrical and Computer Engineers. D. 2002: Sections 3. D.2 0.7 0.6 0.3.. S. G.5 1 2 109 Probability Density Function 2 4 Figure 8. 1. The PDFs of Cauchy random variables are plotted above for parameters γ ∈ {0. A.4.1– 3. 2.8. .. P. 2}..10: Cauchy distributions are categorized as heavy-tail distributions because of their very slow decay. 2006: Sections 4. 5. ADDITIONAL DISTRIBUTIONS Cauchy Distribution 0.5 0. 3.4. Athena Scientiﬁc. Gubner. Miller.1. and Childers.. J. N. L.

Chapter 9 Functions and Derived Distributions We already know from our previous discussion that it is possible to form new random variables by applying real-valued functions to existing discrete random variables. Y is obtained by applying function g(·) to the value of continuous random variable X. In this ﬁgure. it is possible to generate a new random variable Y by taking a well-behaved function g(·) of a continuous random variable X. The function composition Y = g(X) is itself a random variable. Sample Space Y = g(X) X Figure 9.1. Suppose X is a continuous random variable and let g(·) be a real-valued function. The graphical interpretation of this notion is analog to the discrete case and appears in Figure 9. The probability that Y falls in a speciﬁc set S depends on both the function g(·) 110 . In a similar manner.1: A function of a random variable is a random variable itself.

{ξ∈X(Ω)|g(ξ)≤y} (9.9. We wish to ﬁnd the distribution of Y . In this derivation. Likewise. and we apply the change of variables ζ = ξ 2 in computing the integral. we can compute the CDF of Y . a function g(·) is termed monotone decreasing pro- vided that g(x1 ) ≥ g(x2 ) whenever x1 ≤ x2 .1) Example 75. we use the fact that X ≥ 0 in identifying the boundaries of 9. For instance. To gain a better understanding of derived distributions. Y could be a continuous random variable. For instance. and deﬁne Y = X 2 . This shows that the square of a Rayleigh random variable possesses an exponential distribution. Let X be a Rayleigh random variable with parameter σ 2 = 1. the fact that X is a continuous random variable does not provide much information about the properties of Y = g(X). In general. for all x1 and x2 such that x1 ≤ x2 . 2 integration. we begin our exposition of functions of continuous random variables by exploring speciﬁc cases.1 Monotone Functions A monotonic function is a function that preserves a given order. we have g(x1 ) ≤ g(x2 ). Using (9. Pr(Y ∈ S) = Pr(g(X) ∈ S) = Pr X ∈ g −1 (S) = fX (ξ)dξ. we can derive the CDF of Y using the formula FY (y) = Pr(g(X) ≤ y) = fX (ξ)dξ. In particular. we get FY (y) = Pr(Y ≤ y) = Pr X 2 ≤ y √ √ = Pr(− y ≤ X ≤ y) = y 0 √ y ξe− 2 dξ ξ2 = 0 y 1 −ζ e 2 dζ = 1 − e− 2 . g −1 (S) 111 where g −1 (S) = {ξ ∈ X(Ω)|g(ξ) ∈ S} denotes the preimage of S. For y > 0. a discrete random variable or neither. MONOTONE FUNCTIONS and the PDF of X. We recognize FY (·) as the CDF of an exponential random variable. If the inequalities above are .1).1. g(·) is monotone increasing if.

Y is obtained by passing random variable X through a function g(·).112 CHAPTER 9. y]). These scenarios all need to be accounted for in our expression. g(·) may be discontinuous and g −1 (y) may not contain any value. Let X be a continuous random variable uniformly distributed over interval [0. Y y g −1 (y) = {x|g(x) = y} X Figure 9. = Pr(X ∈ g −1 ((−∞. 1].2: In this ﬁgure. and this is accomplished by selecting the largest value in the set g −1 ((−∞. For non-decreasing function g(·) of continuous random variable X. the preimage g −1 (y) = {x|g(x) = y} may contain several elements. that is. y]) . as seen above. y])) = Pr X ≤ sup g −1 ((−∞. Furthermore. we get y FY (y) = Pr(Y ≤ y) = Pr X ≤ 2 y 2 y = dx = . FUNCTIONS AND DERIVED DISTRIBUTIONS replaced by strict inequalities (< and >). y]) (9. We wish to characterize the derived distribution of Y = 2X. Example 76. we have FY (y) = Pr(Y ≤ y) = Pr(g(X) ≤ y) = Pr(g(X) ∈ (−∞. then the corresponding functions are said to be strictly monotonic. For y ∈ [0. Monotonic functions of random variables are straightforward to handle because they admit the simple characterization of their derived CDFs. 2].2) The supremum comes from the fact that multiple values of x may lead to a same value of y. The preimage of point y contains several elements. This can be accomplished as follows. 2 0 . y]) = FX sup g −1 ((−∞.

y]) = Pr X ≥ inf g −1 ((−∞.3: If g(·) is monotone increasing and discontinuous. .3) This formula is similar to the previous case in that the inﬁmum accounts for the fact that the preimage g −1 (y) = {x|g(x) = y} may contain numerous elements or no elements at all. The same methodology applies to non-increasing functions. y]) is typically a well-deﬁned interval. (9. y]) = 1 − FX inf g −1 ((−∞. y]). Y is a uniform random variable with support [0. y]) . and let Y = g(X) be a function of continuous random variable X. y]) X Figure 9. an aﬃne function of a uniform random variable is also a uniform random variable. It is therefore advisable to deﬁne FY (y) in terms of g −1 ((−∞. 0. The CDF of Y is then equal to FY (y) = Pr(Y ≤ y) = Pr X ∈ g −1 ((−∞. By taking derivatives. we obtain the PDF of Y as 1. Suppose that g(·) is monotone decreasing. 2] otherwise. In particular. More generally.9. 2]. 2 fY (y) = y ∈ [0. MONOTONE FUNCTIONS Y 113 y g −1 ((−∞. whereas g −1 ((−∞.1. then g −1 (y) can be empty.

In such a case. dg − dx (x) = − dg (x) dx where again x = g −1 (y). Note that. It is therefore possible to write x = g −1 (y) unambiguous. Likewise. From this analysis. We point out that dg (x) dx is strictly negative because g(·) is a strictly decreasing function. we next consider the situation where g(·) is a diﬀerentiable and strictly increasing function. we can express the PDF of Y = g(X) in terms of the PDF of X and the derivative of g(·). we gather that Y = g(X) is a continuous random variable. g(·) becomes an invertible function. as the value of x is unique. Its PDF is given by fY (y) = d 1 − FX g −1 (y) dy = fX (x) .114 CHAPTER 9. with these two properties. . As before. Combining these two expressions. as seen above. FUNCTIONS AND DERIVED DISTRIBUTIONS 9. FY (y) = Pr(g(X) ≤ y) = Pr X ≥ g −1 (y) = 1 − FX g −1 (y) . we obtain the PDF of Y fY (y) = d d FY (y) = FX g −1 (y) dy dy d −1 dx = fX g −1 (y) g (y) = fX g −1 (y) . we ﬁnd that Y = g(X) is a continuous random variable and the PDF of Y can be expressed in terms of fX (·) and the derivative of g(·). Diﬀerentiating this equation with respect to y. We can write the CDF of random variable Y = g(X) as follows. we get fY (y) = fX (x) Note that dg (x) dx fX (x) dx = dg . suppose that g(·) is diﬀerentiable and strictly decreasing. the CDF of Y = g(X) becomes FY (y) = Pr X ≤ g −1 (y) = FX g −1 (y) . dy dy With the simple substitution x = g −1 (y). dy (x) dx = dg (x) dx is strictly positive because g(·) is a strictly increasing function.2 Diﬀerentiable Functions To further our understanding of derived distributions. In addition.

9.2. DIFFERENTIABLE FUNCTIONS Y

(y2 , y2 + δ)

115

(y1 , y1 + δ)

g −1 ((y1 , y1 + δ)) g −1 ((y2 , y2 + δ))

X

Figure 9.4: This ﬁgure provides a graphical interpretation of why the derivative of g(·) plays an important role in determining the value of the derived PDF fY (·). For an interval of width δ on the y-axis, the size of the corresponding interval on the x-axis depends heavily on the derivative of g(·). A small slope leads to a wide interval, whereas a steep slope produces a narrow interval on the x-axis.

we observe that, when g(·) is diﬀerentiable and strictly monotone, the PDF of Y becomes fY (y) = fX g −1 (y) where x = g −1 (y). The role of illustrated in Figure 9.4. Example 77. Suppose that X is a Gaussian random variable with PDF

x2 1 fX (x) = √ e− 2 . 2π

dx fX (x) = dg dy (x) dx

(9.4)

dg (·) dx

in ﬁnding the derived PDF fY (·) is

We wish to ﬁnd the PDF of random variable Y where Y = aX + b and a = 0. In this example, we have g(x) = ax + b and g(·) is immediately recognized as a strictly monotonic function. The inverse of function of g(·) is equal to x = g −1 (y) = and the desired derivative is given by dx = dy 1

dg (x) dx

y−b , a

1 = . a

116

CHAPTER 9. FUNCTIONS AND DERIVED DISTRIBUTIONS

**The PDF of Y can be computed using (9.4), and is found to be fY (y) = fX g −1 (y)
**

(y−b)2 dx 1 e− 2a2 , =√ dy 2π|a|

which is itself a Gaussian distribution. Using a similar progression, we can show that the aﬃne function of any Gaussian random variable necessarily remains a Gaussian random variable (provided a = 0). Example 78. Suppose X is a Rayleigh random variable with parameter σ 2 = 1, and let Y = X 2 . We wish to derive the distribution of random variable Y using the PDF of X. Recall that the distribution of Rayleigh random variable X is given by fX (x) = xe− 2

x2

x ≥ 0.

Since Y is the square of X, we have g(x) = x2 . Note that X is a non-negative random variable and g(x) = x2 is strictly monotonic over [0, ∞). The PDF of Y is therefore found to be √ fX y fX (x) fY (y) = dg = dg √ (x) y dx dx √ y y 1 y = √ e− 2 = e− 2 , 2 y 2

where y ≥ 0. Thus, random variable Y possesses an exponential distribution with parameter 1/2. It may be instructive to compare this derivation with the steps outlined in Example 75. Finally, suppose that g(·) is a diﬀerentiable function with a ﬁnite number of local extrema. Then, g(·) is piecewise monotonic and we can write the PDF of Y = g(X) as fY (y) =

{x∈X(Ω)|g(x)=y}

fX (x) dg (x) dx

(9.5)

fying the values of x for which g(x) = y. The PDF of Y is then computed

for (almost) all values of y ∈ R. That is, fY (y) is obtained by ﬁrst identi-

explicitly by ﬁnding the local contribution of each of these values to fY (y) using the methodology developed above. This is accomplished by applying (9.4) repetitively to every value of x for which g(x) = y. It is certainly useful to compare (9.5) to its discrete equivalent (5.4), which is easier to understand and visualize.

9.3. GENERATING RANDOM VARIABLES Y

117

y

x1 x2

x3

x4 x5

X

Figure 9.5: The PDF of Y = g(X) when X is a continuous random variable and g(·) is diﬀerentiable with a ﬁnite number of local extrema is obtained by ﬁrst identifying all the values of x for which g(x) = y, and then calculating the contribution of each of these values to fY (y) using (9.4). The end result leads to (9.5).

Example 79. Suppose X is a continuous random variable uniformly distributed over [0, 2π). Let Y = cos(X), the random sampling of a sinusoidal waveform. We wish to ﬁnd the PDF of Y . arccos(y) and 2π − arccos(y). Recall that the derivative of cos(x) is given by d cos(x) = − sin(x). dx Collecting these results, we can write the PDF of Y as fY (y) = fX (2π − arccos(y)) fX (arccos(y)) + |− sin(arccos(y))| |− sin(2π − arccos(y))| 1 1 1 = + = , 2π 1 − y 2 2π 1 − y 2 π 1 − y2 For y ∈ (−1, 1), the preimage g −1 (y) contains two values in [0, 2π), namely

where −1 < y < 1. The CDF of Y can be obtained by integrating fY (y). Not surprisingly, solving this integral involves a trigonometric substitution.

9.3

Generating Random Variables

In many engineering projects, computer simulations are employed as a ﬁrst step in validating concepts or comparing various design candidates. Many

First. 1]. In this section. Let X be a continuous random variable with PDF fX (·).1 Continuous Random Variables First. Note that fY (y) = 1 for any y ∈ [0. FUNCTIONS AND DERIVED DISTRIBUTIONS such tasks involve the generation of random variables. Hence the PDF of FX (Y ) possesses the structure where y = FX (v). Thus. Suppose that Y is a continuous random variable uniformly distributed over [0. we note that.3. we can generate a uniform random variable Y with PDF 1 y ∈ (0. Deﬁne V = FX (Y ). we have −1 FX (FX (X)) = X. applying FX (·) to uniform random variable Y should lead to the desired −1 result. we consider a scenario where the simulation task requires the generation of a continuous random variable. Since FX (·) is diﬀerentiable and strictly increasing over the support of X. 1] because Y is uniform . 9. We begin our exposition with a simple observation. we get fV (v) = fY (y) −1 dFX dy (y) = fY (y) dFX (v) = fX (v) dv −1 over the unit interval. we explore a method to generate arbitrary random variables based on a routine that outputs a random value uniformly distributed between zero and one. Consider the random variable Y = FX (X). we get fY (y) = fX (x) dFX (x) dx = fX (x) =1 |fX (x)| X. −1 Thus. and consider the PDF of V . The PDF of Y is zero outside of this interval This observation provides valuable insight about our original goal. We wish to generate continuous random variable with CDF FX (·). using an arbitrary continuous random variable −1 where y ∈ (0.118 CHAPTER 9. when FX (·) is invertible. 1) and x = FX (y). 1) fY (y) = 0 otherwise. Using our knowledge of derived distributions. because 0 ≤ FX (x) ≤ 1.

GENERATING RANDOM VARIABLES 119 wanted. to create a continuous random variable X with CDF FX (·). In other words. λ Indeed. if F (x ) < y ≤ F (x ) i X i−1 X i g(y) = 0. . We wish to create an exponential random variable X with parameter λ by taking a function of Y .9.} where xi < xj whenever i < j. We know that the corresponding CDF is given by FX (x) = xi ≤x pX (xi ).2 Discrete Random Variables It is equally straightforward to generate a discrete random variable from a continuous random variable that is uniformly distributed between zero and one. otherwise. x . we obtain fX (x) = fY (y) 1 λ(1−y) = λe−λx where we have implicitly deﬁned y = 1 − e−λx . λ We can therefore generate the desired random variable X with 1 X = − log(1 − Y ). 1]. for x ≥ 0. This is the desired distribution. 1]. 9. The inverse of FX (·) is given by 1 −1 FX (y) = − log(1 − y).3. Random variable X is nonnegative. . and its CDF is given by FX (x) = 1 − e−λx for x ≥ 0. one can apply the function −1 FX (·) to a random variable Y that is uniformly distributed over [0. and denote its support by {x1 . We can generate a random variable X with PMF pX (·) with the following case function. . We stress that this technique can be utilized to generate any random variable with PDF fX (·) using a computer routine that outputs a random value uniformly distributed between zero and one. x2 .3. . Suppose that Y is a continuous random variable uniformly distributed over [0. Let pX (·) be a PMF. Example 80.

J. 2005: Chapters 1 & 10. Probability and Random Processes for Electrical and Computer Engineers. and Childers. Mitzenmacher. Cambridge.. . 2002: Section 3. S. Probability and Random Processes with Applications to Signal Processing and Communications. implementing a discrete random variable through a case statement may lead to an excessively slow routine. A First Course in Probability.. Further Reading 1. Miller. 2. E.120 CHAPTER 9. 2006: Sections 1. and Tsitsiklis. S. P. and Upfal.. Gubner.. Athena Scientiﬁc..6.1. J. 2004: Section 4. A.3–1.. D. Taking X = g(Y ).. For many discrete random variables.4. 5. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. 4. G. there are much more eﬃcient ways to generate a speciﬁc output. Cambridge. Of course. 2006: Section 5. 3. L. Ross..1. M.7. Introduction to Probability. D. Bertsekas. FUNCTIONS AND DERIVED DISTRIBUTIONS Note that we have used the convention x0 = 0 to simplify the deﬁnition of g(·). 7th edition. we get Pr(X = xi ) = Pr(FX (xi−1 ) < Y ≤ FX (xi )) = FX (xi ) − FX (xi−1 ) = pX (xi ).6. N. Pearson Prentice Hall.

the weighted sum is simply replaced by a weighted integral. We know from our previous discussion that expectations provide an eﬀective way to summarize the information contained in the distribution of a random variable. 10. As we will see shortly. which was originally introduced in the context of discrete random variables.Chapter 10 Expectations and Bounds The concept of expectation. .1 Expectations Revisited The deﬁnition of an expectation associated with a continuous random variable is very similar to its discrete counterpart. can be generalized to other types of random variables. the expectation of g(X) is deﬁned by E[g(X)] = R g(ξ)fX (ξ)dξ. For instance. For a continuous random variable X with PDF fX (·). the expectation of a continuous random variable is deﬁned in terms of its probability density function (PDF). the mean of X is equal to E[X] = R ξfX (ξ)dξ and its variance becomes Var(X) = E (X − E[X])2 = 121 R (ξ − E[X])2 fX (ξ)dξ. expectations are also very valuable in establishing bounds on probabilities. In particular.

the PDF of this random variable can be written as fX (ξ) = √ variables. ∞ (ξ−m)2 1 E[X] = √ ξe− 2σ2 dξ 2πσ −∞ ∞ σ m − ζ2 =√ ζ+ e 2 dζ σ 2π −∞ ∞ ζ2 σ σ ζe− 2 dζ + √ =√ 2π −∞ 2π (ξ−m)2 1 e− 2σ2 2πσ ξ ∈ R. We wish to calculate the mean and variance of a Gaussian random variable with parameters m and σ 2 . we have leveraged the facts that ζe− 2 is an absolutely integrable. Of course. . this property implies that ∞ −∞ e− (ξ−m)2 2σ 2 dξ = √ 2πσ. σ ζ2 In ﬁnding a solution. the variance of random variable X can also be computed using Var(X) = E [X 2 ] − (E[X])2 . 2πσ obtained by more conventional methods. σ3 Rearranging the terms yields ∞ −∞ (ξ − m)2 − (ξ−m)2 √ e 2σ2 dξ = σ 2 . odd function. we get ∞ −∞ √ (ξ − m)2 − (ξ−m)2 e 2σ2 dξ = 2π. We also took advantage of the normalization condition which ensures that a Gaussian PDF integrates to one. To derive the variance. The mean of X can be obtained through direct integration. Diﬀerentiating both sides of this equation with respect to σ. Hence. we again use the normalization condition. We wish to compute its mean and variance. For a Gaussian PDF. Suppose that R is a Rayleigh random variable with parameter σ 2 . EXPECTATIONS AND BOUNDS As before. Example 81. By deﬁnition.122 CHAPTER 10. with a change of ∞ −∞ m − ζ2 e 2 dζ = m. Var(X) = E [(X − m)2 ] = σ 2 . the variance can also be Example 82.

The variance of R is therefore equal to Var[R] = (4 − π) 2 σ . This situation is an artifact of the following relation. We compute the second moment of R below. ξfR (ξ)dξ = 2 − ξ2 2σ ∞ 0 ∞ ξ 2 − ξ22 e 2σ dξ σ2 ξ2 = −ξe = √ ∞ 0 ∞ + 0 e− 2σ2 dξ √ 2πσ . ∞) must be equal to 1/2. 2 Typically. where X and Y are independent zero-mean Gaussian variables with variance σ 2 . For nonnegative random variable X. the parameter σ 2 in fR (·) is a tribute to this popular construction. σ 2 is employed to denote the variance of a random variable.1. . A Rayleigh random variable R can be generated through the √ expression R = X 2 + Y 2 . It may be confusing at ﬁrst to have a random variable R described in terms of parameter σ 2 whose variance is equal to (4−π)σ 2 /2.10. E R2 = 0 ∞ ξ 2 fR (ξ)dξ = 2 − ξ2 2σ ∞ 0 ∞ ξ 3 − ξ22 e 2σ dξ σ2 ξ2 = −ξ 2 e ∞ 0 + 0 2ξe− 2σ2 dξ = −2σ 2 e 2 − ξ2 2σ ∞ 0 = 2σ 2 . Thus. not a representation of its actual variance. an alternative way to compute E[X] is described in Proposition 7. notice the judicious use of the fact that the integral of a standard normal random variable over [0. 2 2πσ 0 ζ2 1 √ e− 2σ2 dζ = 2πσ Integration by parts is key in solving this expectation. Also. we get E[R] = 0 ∞ 123 r − r22 e 2σ σ2 r ≥ 0. EXPECTATIONS REVISITED Recall that R is a nonnegative random variable with PDF fR (r) = Using this distribution.

For 0 ≤ r ≤ 1. and the position of every dart is distributed uniformly over the target. Example 83. The dartboard has unit radius. EXPECTATIONS AND BOUNDS Proposition 7. Proof.2 Moment Generating Functions The moment generating function of a random variable X is deﬁned by MX (s) = E esX . 3 3 Notice how we were able to compute the answer without deriving an explicit expression for fR (·). A player throws darts at a circular target hung on a wall.124 CHAPTER 10. Suppose that X is a nonnegative random variable with ﬁnite mean. ∞ 0 ∞ 0 x ∞ 0 ∞ 0 ξ ∞ Pr(X > x)dx = = = 0 fX (ξ)dξdx fX (ξ)dxdξ ξfX (ξ)dξ = E[X]. We oﬀer a proof for the special case where X is a continuous random variable. the probability that R exceeds r is given by Pr(R > r) = 1 − Pr(R ≤ r) = 1 − πr2 = 1 − r2 . then E[X] = 0 ∞ Pr(X > x)dx. by Proposition 7. the expected value of R is equal to 1 E[R] = 0 1−r 2 dr = r3 r− 3 1 0 =1− 2 1 = . . π Then. Let R denote the distance from a dart to the center of the target. We wish to compute the expected distance from each dart to the center of the dartboard. Interchanging the order of integration is justiﬁed because X is assumed to have ﬁnite mean. 10. although the result remains true in general.

In particular.2. The experienced reader will quickly recognize the deﬁnition of MX (s) as a variant of the Laplace Transform. MOMENT GENERATING FUNCTIONS 125 For continuous random variables. The moment-generating function of X is given by MX (s) = 0 ∞ λe −λξ sξ e dξ = 0 ∞ λe−(λ−s)ξ dξ = The mean of X is E[X] = dMX λ (0) = ds (λ − s)2 = s=0 λ . we can deduce from these results that the variance of X is 1/λ2 . we have dMX (0) ds = E[X] and d2 MX (0) ds2 = E[X 2 ]. Example 84 (Exponential Random Variable). In fact. Suppose that MX (s) exists within an open interval around s = 0. the nth moment of X can be computed as E[X n ] = dn MX n!λ (0) = n ds (λ − s)n+1 = s=0 n! . for integer-valued random variables. if we diﬀerentiate MX (s) a total of n times and then evaluate the resulting function at zero. λ−s 1 . the moment generating function and the ordinary generating function are related through the equation MX (s) = k∈X(Ω) esk pX (k) = GX (es ). then the nth moment of X is given by dn MX (s) dsn s=0 = dn E esX dsn n sX s=0 =E dn sX e dsn n s=0 =E X e s=0 = E[X ].10. Let X be an exponential random variable with parameter λ. we obtain the nth moment of X. In words. λn Incidentally. The deﬁnition of the moment generating function applies to discrete random variables as well. . The moment generating function gets its name from the following property. λ more generally. the moment generating function becomes MX (s) = ∞ −∞ fX (ξ)esξ dξ. a widely used linear operator.

Example 86 (Gaussian Random Variable). we can show that the variance of U is (n2 − 1)/12. n}. Suppose U is a discrete uniform random variable taking value in U (Ω) = {1. 2π The moment generating function of X is equal to ∞ ξ2 1 √ e− 2 esξ dξ MX (s) = E esX = 2π −∞ ∞ ∞ 2 +2sξ ξ ξ2 −2sξ+s2 s2 1 1 2 √ e− 2 dξ = e 2 √ e− = dξ 2π 2π −∞ −∞ ∞ (ξ−s)2 s2 s2 1 √ e− 2 dξ = e 2 . Let X be a standard normal random variable whose PDF is given by ξ2 1 fX (ξ) = √ e− 2 . E[U ] = dMU ne(n+2)s − (n + 1)e(n+1)s + es (0) = lim s→0 ds n (es − 1)2 n(n + 2)e(n+1)s − (n + 1)2 ens + 1 s→0 2n (es − 1) n(n + 1)(n + 2)ne(n+1)s − n(n + 1)2 ens n+1 = lim . The exponential function is analytic and possesses many representations. =e2 2π −∞ . 6 From these two results. . which is equal to (n + 1)(2n + 1) E U2 = . The simple form of the moment generating function of a standard normal random variable points to its importance in many situations. = s s→0 2ne 2 Notice the double application of l’Hˆpital’s rule to evaluate the derivative of o = lim MU (s) at zero. n(es − 1) The moment generating function provides an alternate and somewhat intricate way to compute the mean of U . Then. one can derive the second moment of U . pU (k) = 1/n for 1 ≤ k ≤ n and n MU (s) = k=1 1 1 sk e = n n n esk = k=1 es (ens − 1) . but it does not rely on prior knowledge of special sums. 2. Through similar steps. This may be deemed a more contrived method to derive the expected value of a discrete uniform random variables. . . EXPECTATIONS AND BOUNDS Example 85 (Discrete Uniform Random Variable).126 CHAPTER 10. .

Thus.3. 10.3 Important Inequalities There are many situations for which computing the exact value of a probability is impossible or impractical. The moment generating function of Y can be obtained as follows. The expectation is most important in ﬁnding pertinent bounds. Let MX (s) be the moment generating function associated with a random variable X.10. if Y is an aﬃne function of X then MY (s) = esb MX (as). we get E[Y ] = E Y2 s2 σ 2 dMY =m (0) = m + sσ 2 esm+ 2 ds s=0 s2 σ 2 s2 σ 2 d2 MY 2 (0) = σ 2 esm+ 2 + m + sσ 2 esm+ 2 = ds2 = σ 2 + m2 . We can use this property to identify the moment generating function of a Gaussian random variable with parameters m and σ 2 . s=0 The mean of Y is m and its variance is equal to σ 2 . MY (s) = E esX = E es(aX+b) = esb E esaX = esb MX (as). IMPORTANT INEQUALITIES 127 The last equality follows from the normalization condition and the fact that the integrand is a Gaussian PDF. many upper bounds rely on the concept of dominating functions. In such cases. s2 σ 2 2 . the . as anticipated. and consider the random variable Y = aX + b where a and b are constant. Suppose that g(x) and h(x) are two nonnegative function such that g(x) ≤ h(x) for all x ∈ R. Recall that an aﬃne function of a Gaussian random variable is also Gaussian. then the moment generating function of Y becomes MY (s) = E esY = E es(σX+m) = esm E esσX = esm+ From this moment generating function. Let Y = σX + m. for any continuous random variable X. As we will see. Then. it may be acceptable to provide bounds on the value of an elusive probability. Example 87.

for admissible set S ⊂ R. we can select S = [a.2. we have Pr(X ∈ S) = E [1S (X)] .128 following inequality holds CHAPTER 10. EXPECTATIONS AND BOUNDS E[g(X)] = ≤ ∞ −∞ ∞ −∞ g(x)fX (x)dx h(x)fX (x)dx = E[h(X)]. Thus. In words. we have h(x) ≥ 1S (x). then E[g(X)] is less than or equal to E[h(X)]. as illustrated in Figure 10. where fX (·) acts as the weighting function.1. It Pr(X ≥ a) = E [1S (X)] ≤ E[X] .1 The Markov Inequality We begin our exposition of classical upper bounds with a result known as the Markov inequality. to obtain a bound on Pr(X ∈ S).1: If g(x) and h(x) are two nonnegative functions such that g(x) ≤ h(x) for all x ∈ R. the weighted integral of g(·) is dominated by the weighted integral of h(·).3. dominates 1S (·) and for which we can compute the expectation. R h(x) g(x) R Figure 10. 10. ∞) and function h(x) = . For any x ≥ 0. a random variable. This notion is instrumental in understanding bounding techniques. it suﬃces to ﬁnd a function that Suppose that we wish to bound Pr(X ≥ a) where X is a nonnegative follows that x/a. This is illustrated in Figure 10. In this case. Recall that.

10.3. IMPORTANT INEQUALITIES R h(x) =

x a

129

g(x) = 1S (x) a R

Figure 10.2: Suppose that we wish to ﬁnd a bound for Pr(X ≤ a). We

deﬁne set S = [a, ∞) and function g(x) = 1S (x). Using dominating function random variable X.

h(x) = x/a, we conclude that Pr(X ≥ a) ≤ a−1 E[X] for any nonnegative

10.3.2

The Chebyshev Inequality

The Chebyshev inequality provides an extension to this methodology to various dominating functions. This yields a number of bounds that become useful in a myriad of contexts. Proposition 8 (Chebyshev Inequality). Suppose h(·) is a nonnegative function and let S be an admissible set. We denote the inﬁmum of h(·) over S by iS = inf h(x).

x∈S

The Chebyshev inequality asserts that iS Pr(X ∈ S) ≤ E[h(X)] where X is an arbitrary random variable. Proof. This is a remarkably powerful result and it can be shown in a few steps. The deﬁnition of iS and the fact that h(·) is nonnegative imply that iS 1S (x) ≤ h(x)1S (x) ≤ h(x) iS 1S (x)fX (x) ≤ h(x)f (x), which in turn yields iS Pr(X ∈ S) = E [is 1S (X)] = ≤

R

(10.1)

**for any x ∈ R. Moreover, for any such x and distribution fX (·), we can write iS 1S (ξ)fX (ξ)dξ
**

R

h(ξ)fX (ξ)dξ = E[h(X)].

130

CHAPTER 10. EXPECTATIONS AND BOUNDS

When iS > 0, this provides the upper bound Pr(X ∈ S) ≤ i−1 E[h(X)]. S Although the proof assumes a continuous random variable, we emphasize that the Chebyshev inequality applies to both discrete and continuous random variables alike. The interested reader can rework the proof using the discrete setting and a generic PMF. We provide special instances of the Chebyshev inequality below. Example 88. Consider the nonnegative function h(x) = x2 and let S = {x|x2 ≥ b2 } where b is a positive constant. We wish to ﬁnd a bound on iS = inf x∈S x2 = b2 and, consequently, we get the probability that |X| exceeds b. Using the Chebyshev inequality, we have b2 Pr(X ∈ S) ≤ E X 2 . Constant b being a positive real number, we can rewrite this equation as Pr(|X| ≥ b) = Pr(X ∈ S) ≤ E [X 2 ] . b2

Example 89 (The Cantelli Inequality). Suppose that X is a random variable with mean m and variance σ 2 . We wish to show that Pr(X − m ≥ a) ≤ where a ≥ 0. σ2 , a2 + σ 2

This equation is slightly more involved and requires a small optimization

**(y +b)2 , where b > 0. Following the steps of the Chebyshev inequality, we write the inﬁmum of h(y) over S as iS = inf (y + b)2 = (a + b)2 .
**

y∈S

a > 0, and let S = {y|y ≥ a}. Also, deﬁne the nonnegative function h(y) =

by construction, we have E[Y ] = 0. Consider the probability Pr(Y ≥ a) where

in addition to the Chebyshev inequality. Deﬁne Y = X − m and note that,

Then, applying the Chebyshev inequality, we obtain Pr(Y ≥ a) ≤ σ 2 + b2 E [(Y + b)2 ] = . (a + b)2 (a + b)2 (10.2)

10.3. IMPORTANT INEQUALITIES

131

This inequality holds for any b > 0. To produce a better upper bound, we minimize the right-hand side of (10.2) over all possible values of b. Diﬀerentiating this expression and setting the derivative equal to zero yields 2b 2 (σ 2 + b2 ) = (a + b)2 (a + b)3 or, equivalently, b = σ 2 /a. A second derivative test reveals that this is indeed a minimum. Collecting these results, we obtain σ2 σ 2 + b2 = 2 . Pr(Y ≥ a) ≤ (a + b)2 a + σ2 Substituting Y = X − m leads to the desired result. In some circumstances, a Chebyshev inequality can be tight. Example 90. Let a and b be two constants such that 0 < b ≤ a. Consider the be a discrete random variable with PMF b2 1 − a2 , x = 0 b2 pX (x) = , x=a a2 0, otherwise. function h(x) = x2 along with the set S = {x|x2 ≥ a2 }. Furthermore, let X

For this random variable, we have Pr(X ∈ S) = b2 /a2 . By inspection, we also Chebyshev inequality, we get iS = inf x∈S h(x) = a2 and therefore Pr(X ∈ S) ≤ i−1 E [h(X)] = S b2 . a2

gather that the second moment of X is equal to E [X 2 ] = b2 . Applying the

Thus, in this particular example, the inequality is met with equality.

10.3.3

The Chernoﬀ Bound

The Chernoﬀ bound is yet another upper bound that can be constructed from the Chebyshev inequality. Still, because of its central role in many application domains, it deserves its own section. Suppose that we want to ﬁnd a bound on the probability Pr(X ≥ a). We can apply the Chebyshev inequality using the

It is natural to select the function that provides the best of X and the value of a.3) involves a search over all 10. From the fundamental theorem of calculus.4) The right-hand side of (10. we can optimize the upper bound over all possible values of s. we have x g(x) = g(a) + a dg (ξ)dξ.4) is called the Legendre transformation of Λ(s). bound on Pr(X ≥ a). Figure 10. For this speciﬁc construction. In this latter case. thereby picking the best one. where s > 0. this optimal es(x−a) may depend on the distribution possible values of s.132 CHAPTER 10.3) This inequality is called the Chernoﬀ bound. which explains why (10. with d2 g (x) ≥ 0 dx2 for all x ∈ R. Suppose that function g(·) is convex and twice diﬀerentiable.4 Jensen’s Inequality Some inequalities can be derived based on the properties of a single function. x∈S (10. It should be noted that all these functions dominate 1[a. (10. S = [a. EXPECTATIONS AND BOUNDS nonnegative function h(x) = esx . s>0 iS = inf esx = esa . ∞) and It follows that Pr(X ≥ a) ≤ e−sa E[esX ] = e−sa MX (s). s>0 (10.∞) (x). Pr(X ≥ a) ≤ inf e−sa MX (s). It is sometimes expressed in terms of the log-moment generating function Λ(s) = log MX (s).3) translates into log Pr(X ≥ a) ≤ − sup {sa − Λ(s)} .3 plots es(x−a) for various values of s > 0. in general. and therefore they each provide a diﬀerent bound. Yet.3. dx . Because this inequality holds for any s > 0. The Jensen inequality is one such example.

5 Argument 3 3. dx a dx x dg (·) dx is a monotone increasing function. The . because the second derivative of g(·) is a non-negative function. Futhermore.5 4 133 Figure 10. That is. provided that these two expectations exist. es(x−a) where s > 0. we have g(x) = g(a) + dg (ξ)dξ a dx x dg dg ≥ g(a) + (a)dξ = g(a) + (x − a) (a).5 1 1. dx Choosing a = E[X] and taking expectations on both sides. we then have g(X) ≥ g(a) + (X − a) dg (a). we obtain E[g(X)] ≥ g(E[X]) + (E[X] − E[X]) dg (E[X]) = g(E[X]). we gather that of a.5 2 2. leads to the celebrated Chernoﬀ bound.3: This ﬁgure illustrates how exponential functions can be employed to provide bounds on Pr(X > a). dx Jensen inequality actually holds for convex functions that are not twice diﬀerentiable.3. Optimizing over all admissible exponential functions. but the proof is much harder in the general setting. E[g(X)] ≥ g(E[X]).10. IMPORTANT INEQUALITIES Exponential Bounds Dominating Functions 5 4 3 2 1 0 0 0. for any value For random variable X. As such.

G.134 CHAPTER 10. Pearson Prentice Hall. 7.. and Childers. L. 4. 2002: Section 3.. Athena Scientiﬁc. Probability and Random Processes for Electrical and Computer Engineers. 7th edition. Bertsekas. . 8. D. Probability and Random Processes with Applications to Signal Processing and Communications.2. D. EXPECTATIONS AND BOUNDS Further Reading 1. 7. Introduction to Probability. Cambridge.2–4.1. Miller.7. 2006: Sections 4.1.3.2. S. 3. J. 2.1. A. S. and Tsitsiklis. Gubner.. 2006: Sections 5. A First Course in Probability. P.. Ross.. 4.. N.2. 2004: Section 5. J.

Y ≤ y) x. Y (ω) ≤ y}) . Again. our initial survey of this topic revolves around conditional distributions and pairs of random variables. engineering and science.Chapter 11 Multiple Continuous Random Variables Being versed at dealing with multiple random variables is an essential part of statistics. This is equally true for models based on discrete and continuous random variables. y) = Pr(X ≤ x. 135 . In this chapter.Y (x. 11. More complex scenarios will be considered in the later parts of the chapter.Y (x. we can also write FX. The joint cumulative distribution function of X and Y is deﬁned by FX. y ∈ R.1 Joint Cumulative Distributions Let X and Y be two random variables associated with a same experiment. we focus on the latter and expand our exposition of continuous random variables to random vectors. y) = Pr ({ω ∈ Ω|X(ω) ≤ x. Keeping in mind that X and Y are real-valued functions acting on a same sample space.

we can identify a few properties of the joint CDF. y) ∈ R2 a ≤ x ≤ b. standard calculus asserts that the following equation holds. y) = (x. ζ)dζdξ. MULTIPLE CONTINUOUS RANDOM VARIABLES From this characterization.Y (x.Y (x. fX. if S is the cartesian product of two intervals. R2 Furthermore. Y ) ∈ S) = R2 1S (ξ.Y (ξ.1) Hereafter. Similarly. for any admissible set S ⊂ R2 . S = In particular.Y (ξ.Y (·. When this is the case.Y ∂ 2 FX. x y FX.Y (ξ. Y (ω) ∈ R}) = Pr ({ω ∈ Ω|X(ω) ≤ x}) = FX (x).Y (x. it is possible to deﬁne ∂ 2 FX. Y (ω) ≤ y}) y↑∞ = Pr ({ω ∈ Ω|X(ω) ≤ x. . ζ)fX. we have limx↑∞ FX. ·) is totally diﬀerentiable. y) = When the function FX. the probability that (X. ∂x∂y ∂y∂x (11.Y (x. y) = −∞ −∞ fX.136 CHAPTER 11.1). Taking limits in the other direction. y) = lim FX. c ≤ y ≤ d . From its deﬁnition. ζ)dζdξ (11. y) = FY (y). y↑∞ lim FX. ·) is a nonnegative function which integrates to one. fX.Y (x.Y (x. y ∈ R. y) = 0. y) = lim Pr ({ω ∈ Ω|X(ω) ≤ x. ζ)dζdξ. we note that fX. y) x. S = (x. Y ) ∈ S can be evaluated through the integral formula Pr((X.2) fX. ζ)dζdξ = 1.Y (·. y↓−∞ the joint probability density function of X and Y . we refer to a pair of random variables as continuous if the corresponding joint PDF exists and is deﬁned unambiguously through (11.Y (ξ.Y (x. we get x↓−∞ lim FX.

The probability that (X.5}. 2πσ 2 We wish to ﬁnd the probability that (X. then 1 − x2 +y2 fX. ζ)dζdξ. ζ) dξdζ = . Y ) is contained within a circle of radius half is one fourth. the probability that (X. Y ) is uniformly distributed over the unit circle. Y ) ∈ S) = Pr(a ≤ X ≤ b. we can write its PDF as r2 r fR (r) = 2 e− 2σ2 r ≥ 0. The probability that (X. Y ) falls within a circle of radius r centered at the origin. y)dxdy = Pr(R ≤ r) = e 2σ2 (x. For (x.2) applied to √ this particular problem. y) = π 0 otherwise. π 4 Thus. Y ) is contained within a circle of radius r is 1 − e− 2σ2 . Suppose that the random pair (X.11. Y ) ∈ S reduces to the typical integral form b d 137 Pr((X.Y (·. Let X and Y be two independent zero-mean Gaussian random variables. σ From this equation. we gather That R possesses a Rayleigh distribution with parameter σ 2 . Y ) belongs Pr((X.1. Let R = X 2 + Y 2 and assume r > 0. each with variance σ 2 . Y ) ∈ S) = R2 to S is given by 1 1S (ξ. y) ∈ R2 |x2 + y 2 ≤ 0. a c Example 91. radius 1/2. Y ) lies inside a circle of Let S = {(x. y)dxdy 2πσ 2 R≤r r R≤r 2π 0 = 0 1 e 2πσ 2 2 − r2 2σ rdθdr = 1 − e− 2σ2 . y) ∈ R2 . Example 92. y) = e 2σ2 . r2 r2 Recognizing that R is a continuous random variable.Y (x. their joint PDF is given by 1 − x2 +y2 fX. JOINT CUMULATIVE DISTRIBUTIONS then the probability that (X.Y (x.Y (ξ. We can compute this probability using integral formula (11. c ≤ Y ≤ d) = fX. We wish to ﬁnd the probability that the point (X.Y (x. We can express the joint PDF fX. . ·) as 1 x2 + y 2 ≤ 1 fX.

Under suitable conditions. we gather that Pr(A) = 1/2 and. x] ∩ I) .Y (x. MULTIPLE CONTINUOUS RANDOM VARIABLES 11. equally straightforward to specify the conditional PDF of X given A. y ≥ 0. this problem. Note that event A can be deﬁned in terms of variables X and Y . x ξ x ξ We wish to compute the conditional PDF of X given A = {X ≤ Y }. A = {X ∈ I} and the conditional CDF of X becomes FX|A (x) = Pr(X ≤ x|X ∈ I) = Pr ({X ≤ x} ∩ {X ∈ I}) Pr(X ∈ I) Pr (X ∈ (−∞.138 CHAPTER 11. To solve Pr({X ≤ x} ∩ A) = = fX. consider the PDF of X conditioned on the fact that X belongs to an interval I. it is Example 93.Y (ξ. we can write the conditional CDF of random variable X as FX|A (x) = Pr(X ≤ x|A) = Pr({X ≤ x} ∩ A) Pr(A) x ∈ R. ζ)dζdξ = 0 x 0 0 −λξ 0 λ2 e−λ(ξ+ζ) dζdξ 2 λe 0 1−e −λξ 1 − e−λx dξ = . we ﬁrst compute the probability of the event {X ≤ x} ∩ A. y) = λ2 e−λ(x+y) x.2 Conditional Probability Distributions Given non-vanishing event A. 2 By symmetry. For instance. fX|A (x) = dFX|A (x) x ∈ R. Let X and Y be continuous random variables with joint PDF fX. fX|A (x) = 2λe−λx 1 − e−λx x ≥ 0. as such. In particular. = Pr(X ∈ I) . dx we may use A = {Y ≥ X} as our condition. Then. One case of special interest is the situation where event A is deﬁned in terms of the random variable X itself.

= ≈ fY (y)∆y fY (y) Thus. The technical diﬃculty that surfaces when conditioning on {Y = y} stems from the fact that Pr(Y = y) = 0. CONDITIONAL PROBABILITY DISTRIBUTIONS Diﬀerentiating with respect to x. With great care. S A word of caution is in order. fX|Y (x|y)∆x represents the probability that X lies close to x. For small ∆x and ∆y . conditioned on Y = y.Y (x. Remember . this is partial information given by X ∈ I. y) fX. loosely speaking.1 Conditioning on Values Suppose that X and Y form a pair of random variables with joint PDF fX.2. y ≤ Y ≤ y + ∆y ) Pr(y ≤ Y ≤ y + ∆y ) fX.2. we can deﬁned the conditional PDF of X fX|Y (x|y) = continuous random variable. when X and Y are jointly continuous and given Y = y as fX.3) fY (y) Intuitively. it is possible to compute the probabilities of events associated with X conditioned on a speciﬁc value of random variable Y . Essentially. we obtain the conditional PDF of X. we can write Pr(x ≤ X ≤ x + ∆x |y ≤ Y ≤ y + ∆y ) = Pr(x ≤ X ≤ x + ∆x .11. Special attention must be given to this situation because the event {Y = y} has probability zero whenever Y is a for any y ∈ R such that fY (y) > 0. Still. Pr(X ∈ S|Y = y) = fX|Y (x|y)dx. ·). In words. fX|A (x) = fX (x) Pr(X ∈ I) 139 for any x ∈ I. and it is equal to zero otherwise. y)∆x ∆y ∆x .Y (·. accounting for the 11. equivalent to re-normalizing the PDF of X over interval I. given that Y is near y. the conditional PDF of X becomes a scaled version of fX (·) whenever x ∈ I. this deﬁnition is motivated by the following property.Y (x. y) . it is possible and desirable to deﬁne the conditional PDF of X. (11.Y (x. Using this deﬁnition.

we compute the marginal PDF of Y evaluated at Y = 0. fX|Y (x|0. Y ) are jointly continuous. care must be taken when dealing with the conditional PDF of the form (11.2. Let X = ω1 and Y = ω2 . ω2 ) is selected at random from the unit circle.5) f (0.3) to obtain the desired conditional PDF of X. Example 94. Example 95.5) = R fX. a conditional expectation is itself a random variable. π π We then apply deﬁnition (11. 0.Y (x. as it only provides valuable insight when the random variables (X.5) = = fX. We wish to compute the conditional PDF of X given that Y = 0. E[g(Y )|S] = R Note again that the function h(x) = E[Y |X = x] deﬁnes a random variable since the conditional expectation of Y may vary as a function of X.Y (x.5) π = √ fX.5.5) 3 Y √ 1 √ |y| ≤ 3 3 2 11. 0. √ fY (0.140 CHAPTER 11. Although we were able to circumvent this issue. The transmit signal X and the additive noise N are both . E[g(Y )|X = x] = R g(y)fY |X (y|x)dy g(y)fY |S (y)dy.3).5)dx = 3 2 √ 3 2 √ 1 3 dx = . MULTIPLE CONTINUOUS RANDOM VARIABLES that. An analog communication system transmits a random signal over a noisy channel. 0. The conditional expectation of a function g(Y ) is simply the integral of g(Y ) weighted by the proper conditional PDF. the notion of conditional probability is only deﬁned for nonvanishing conditions.Y (x. After all. in general.5.2 Conditional Expectation 0 otherwise. First. Consider the experiment where an outcome (ω1 .

2. let Y1 = g1 (X1 . diﬀerentiable.Y (x. x2 ) ∂x2 ∂g2 (x1 . X2 ) and Y2 = g2 (X1 . By inspection. ∂x1 ∂x2 ∂x2 ∂x1 . The signal received at the destination is equal to Y = X + N. which is E[X|Y = y] = y xfX. ·) and g2 (·. Under certain conditions. we examine the case where a simple expression for fY1 . In the present case.11. 2 R 11. y) = 2x2 − 2xy + y 2 1 exp − 2π 2 1 2π and the conditional distribution of X given Y becomes fX. It requires the skillful application of vector calculus.2. x2 ) ∂x1 ∂g2 (x1 . X2 ). x2 ) = 0. the MMSE estimator reduces to the conditional expectation of X given Y = y. CONDITIONAL PROBABILITY DISTRIBUTIONS 141 standard Gaussian random variables. the joint PDF of X and Y is fX. Y2 ) will also be continuous. x2 ) ∂x1 Consider the scenario where the functions g1 (·. x2 ) ∂x2 ∂g1 ∂g2 ∂g1 ∂g2 (x1 . ·) are real-valued functions. ·) are totally ∂g1 (x1 .Y (x. with Jacobian determinant J(x1 . Furthermore. Y2 ) can get convoluted. A widespread algorithm employed to perform the desired task is called the minimum mean square error (MMSE) estimator. we recognize that this conditional PDF is a Gaussian distribution with parameters m = y/2 and σ 2 = 1/2.Y2 (·. ·) and g2 (·. the pair of random variables (Y1 . For this problem. ·) exists. and they are independent. x2 ) = det = ∂g1 (x1 . We wish to estimate the value of X conditioned on Y = y. a task we forgo. Deriving the joint PDF of (Y1 . Nevertheless. y) fX|Y (x|y) = = fY (y) exp − 2x 1 √ 2 π 2 −2xy+y 2 2 4x2 − 4xy + y 2 1 = √ exp − 4 π exp − y4 2 .3 Derived Distributions Suppose X1 and X2 are jointly continuous random variables. x2 ) (x1 . where g1 (·. x2 ) (x1 . x2 ) − (x1 .Y (x|y)dx = .

4). It also hints at how this equation can be modiﬁed to accommodate non-unique mappings.4) pertains to the properties of Gaussian vectors. x2 )| where x1 = h1 (y1 . E[X1 ] E[X2 ] . x2 ) (11. . An important application of (11. x2 ) = y2 has a unique solution. 2 2π|Σ| 1 1 2 Assume that the random variables Y1 and Y2 are generated through the matrix equation Y= Y1 Y2 = AX + b. y2 ) and x2 = h2 (y1 . x2 ) = y1 g2 (x1 . y2 ). y2 ).X2 (x1 .142 CHAPTER 11. MULTIPLE CONTINUOUS RANDOM VARIABLES Also. and let X= Deﬁne the mean of X by m = E[X] = and its covariance by Σ = E (X − m) (X − m)T = E[(X2 − m2 )(X1 − m1 )] E [(X1 − m1 )2 ] E[(X1 − m1 )(X2 − m2 )] E [(X2 − m2 )2 ] . y2 ) = this equation and the derived distribution of (9. X1 X2 . y2 ) and x2 = h2 (y1 . the random variables (Y1 . Example 96. Suppose that X1 and X2 are jointly continuous random variables. Looking back at Chapter 9 oﬀers an idea of what proving this result entails.Y2 (y1 . assume that the system of equations g1 (x1 . Random variables X1 and X2 are said to be jointly Gaussian provided that their joint PDF is of the form fX1 .4) |J(x1 . x2 ) = 1 exp − (x − m)T Σ−1 (x − m) . We express this solution using x1 = h1 (y1 . Note the close resemblance between fY1 . Then.X2 (x1 . Y2 ) are jointly continuous with joint PDF fX1 .

FX. it suﬃces to compute its mean and covariance. 1 exp 2 2π|AΣA| 2 1 1 2 Looking at this equation. 11. the features of joint Gaussian random vectors underly many contemporary successes of engineering. Indeed. X = A−1 (Y − b) and the corresponding Jacobian determinant is J(x1 . a non-trivial aﬃne transformation of a two-dimensional Gaussian vector yields another Gaussian vector. It should come as no surprise that the mean of Y is E [Y] = Am + b and its covariance matrix is equal to E (Y − E [Y]) (Y − E [Y])T = AΣA. INDEPENDENCE 143 where A is a 2 × 2 invertible matrix and b is a constant vector.3 Independence Two random variables X and Y are mutually independent if their joint CDF is the product of their respective CDFs. Applying (11. as their joint PDF possesses the proper form. then Y remains a Gaussian random vector.4). In this case.Y (x. This admirable property generalizes to higher dimensions.11. y2 ) = 1 exp − (x − m)T Σ−1 (x − m) 2 2π|Σ| |A| 1 1 T − A−1 (y − b) − m Σ−1 A−1 (y − b) − m = 1 exp 2 2π|AΣA| 2 1 1 = − (y − b − Am)T (AΣA)−1 (y − b − Am) . In other words. and substitute the resulting parameters in the general form of the Gaussian PDF. Collectively.Y2 (y1 . Furthermore. x2 ) = det a11 a12 a21 a22 = |A|.3. y) = FX (x)FY (y) . we gather that the joint PDF of (Y1 . if Y = AX+b where A is an n×n invertible matrix and X is a Gaussian random vector. Y2 ) is expressed as fY1 . we conclude that random variables Y1 and Y2 are also jointly Gaussian. to obtain the derived distribution of the latter vector.

dent random variables. Furthermore. . ζ)dζdξ y = −∞ 1[0.Y (x. Additionally.Y (x. y) = −∞ x −∞ 1[0. y)dydx S T fX (x)dx T fY (y)dy = Pr(X ∈ S) Pr(Y ∈ T ). if x.Y (x. Let X = ω1 .Y dFX dFY (x. ·) is the product of their marginal PDFs. we gather from (11. For jointly continuous random variables.Y (x.1]2 (ξ. y) = (x) (y) = fX (x)fY (y) ∂x∂y dx dy for x.Y (x. y ∈ R.1] (ζ)dζ = FX (x)FY (y). but that X and U are not independent. Y = ω2 and U = ω1 +ω2 . ω2 ) is selected at random from the unit square. y) = for x. MULTIPLE CONTINUOUS RANDOM VARIABLES sarily implies that the joint PDF fX. we get x y FX. then the two events {X ∈ S} and {Y ∈ T } are also Pr(X ∈ S. We wish to show that X and Y are independent. we x y FX. y) = = fY (y) fX (x) fX (x) of Y given X = x is equal to the marginal PDF of Y whenever X and Y are provided of course that fX (x) = 0. Thus.144 CHAPTER 11. y ∈ R. y ∈ R2 . fY |X (y|x) = fX (x)fY (y) fX. Example 97. More generally. y) = 0 0 dζdξ = xy = FX (x)FY (y). Y ∈ T ) = = S fX. if X and Y are indepenindependent. y ∈ [0. we gather that X and Y are independent.1] (ξ)dξ −∞ 1[0. 1]. have We begin by computing the joint CDF of X and Y .3) that the conditional PDF independent. For x. this deﬁnition neces- fX. Consider a random experiment where an outcome (ω1 . ∂ 2 FX.Y (·.

ζ)dζdξ −∞ u−ξ −∞ ∞ −∞ ∞ −∞ fY (ζ)dζfX (ξ)dξ −∞ FY (u − ξ)fX (ξ)dξ. sums of independent random variables are frequently encountered in engineering. fU (u) = (fX ∗ fY )(u). FX.1 Sums of Continuous Random Variables As mentioned before. 8 Clearly.5 and FX (0. respectively. FU (u) = Pr(U ≤ u) = Pr(X + Y ≤ u) = = = ∞ u−ξ fX. 1). we ﬁrst consider the CDF of U . We therefore turn to the question of determining the distribution of a sum of independent continuous random variables in terms of the PDFs of its constituents.U (0.5) = 0. To show that this is indeed the case.5)FU (1).Y (ξ. we show that X and U are not independent. The convolution of fX (·) and fY (·) is the function deﬁned by (fX ∗ fY )(u) = = ∞ −∞ ∞ −∞ fX (ξ)fY (u − ξ)dξ fX (u − ζ)fY (ζ)dζ.5. the distribution of their sum U = X + Y can be obtained by using the convolution operator. Note that FU (1) = 0.3. .3. random variables X and U are not independent. If X and Y are independent random variables.5. Let fX (·) and fY (·) be the PDFs of X and Y . INDEPENDENCE 145 Next. 1) = 0 1 2 1−ξ dζdξ = 0 0 1 2 (1 − ξ)dξ = 3 = FX (0.5.11. Consider the joint CDF of X and U evaluated at (0. The PDF of the sum U = X + Y is the convolution of the individual densities fX (·) and fY (·). 11.

while.146 CHAPTER 11. Notice the judicious use of the fundamental theorem of calculus. This shows that fU (u) = (fX ∗ fY )(u). we obtain d d FU (u) = du du = = ∞ −∞ ∞ −∞ ∞ −∞ FY (u − ξ)fX (ξ)dξ d FY (u − ξ)fX (ξ)dξ du fY (u − ξ)fX (ξ)dξ. Example 98 (Sum of Uniform Random Variables). 1].1] (u − ξ)dξ. and let U = X + Y represent their sum. Let X and Y be random variables describing the two choices. each with a uniform distribution. if 1 < u ≤ 2. this integral becomes 1 1 fU (u) = 0 fX (u − ξ)dξ = 0 1[0. . if 0 ≤ u ≤ 1. we get The integrand above is zero unless 0 ≤ u − ξ ≤ 1 (i.e. we obtain 1 fU (u) = u−1 dξ = 2 − u. Since fY (y) = 1 when 0 ≤ y ≤ 1 and zero otherwise. fX (u − ξ)fY (ξ)dξ.. We wish to compute the PDF of their sum. Thus. Suppose that two numbers are independently selected from the interval [0. unless u − 1 ≤ ξ ≤ u). The PDFs of X and Y are fX (ξ) = fY (ξ) = 1 0≤ξ≤1 The PDF of their sum is therefore equal to fU (u) = ∞ −∞ 0 otherwise. MULTIPLE CONTINUOUS RANDOM VARIABLES Taking the derivative of FU (u) with respect to u. u fU (u) = 0 dξ = u.

The PDF of U is given by λ2 ue−λu u ≥ 0 fU (u) = 0 otherwise. We wish to ﬁnd the PDF of their sum. the value of the PDF becomes zero. 0 = On the other hand. we can use the convolution formula and write fX (u − ξ)fY (ξ)dξ λe−λ(u−ξ) λe−λξ dξ λ2 e−λu dξ = λ2 ue−λu . fU (u) = 2 − u 1 < u ≤ 2. We will show this property in the special case where both random variables . 0 otherwise. This is an Erlang distribution with parameter m = 2 and λ > 0. Two numbers are selected independently from the positive real numbers. we can write the PDF of U as u 0 ≤ u ≤ 1. Suppose X is Gaussian with mean m1 and 2 2 variance σ1 . The random variables X and Y have PDFs λe−λξ ξ ≥ 0 fX (ξ) = fY (ξ) = 0 otherwise. and similarly Y is Gaussian with mean m2 and variance σ2 . Collecting these results.11. Let X and Y represent these two numbers. Example 100 (Sum of Gaussian Random Variables). and denote this sum by U = X + Y . Example 99 (Sum of Exponential Random Variables). then 2 2 U = X + Y has a Gaussian density with mean m1 + m2 and variance σ1 + σ2 . if u < 0 then we get fU (u) = 0. each according to an exponential distribution with parameter λ. INDEPENDENCE 147 If u < 0 or u > 2. It is an interesting and important fact that the sum of two independent Gaussian random variables is itself a Gaussian random variable.3. fU (u) = = 0 u ∞ −∞ u When u ≥ 0.

We denote the mean and variance of X by m1 and σ1 . The moment generating function of U is given by MU (s) = E esU = E es(X+Y ) = E esX esY = E esX E esY = MX (s)MY (s). Let X and Y be independent random variables. let X and Y denote the two independent Gaussian 2 variables.148 CHAPTER 11. we obtain u2 1 fU (u) = √ e− 4 . Likewise. we revisit the problem of adding independent Gaussian variables using moment generating functions. We wish to show that 2 2 the sum U = X + Y is Gaussian with parameters m1 + m2 and σ1 + σ2 . Example 101 (Sum of Gaussian Random Variables). Again. Thus. . That is. the PDF of U = X + Y is given by fU (u) = (fX ∗ fY )(u) = 1 2π ∞ −∞ e− (u−ξ)2 2 e− 2 dξ ξ2 1 − u2 ∞ −(ξ− u )2 2 = e dξ e 4 2π −∞ ∞ u 2 u2 1 1 √ e−(ξ− 2 ) dξ . = √ e− 4 2 π π −∞ The expression within the parentheses is equal to one since the integrant is a Gaussian PDF with m = u/2 and σ 2 = 1/2. The general case can be attained in a similar manner. 2π Then. Consider the random variable U = X + Y . but the computations are somewhat tedious. MULTIPLE CONTINUOUS RANDOM VARIABLES are standard normal random variable. 2 we represent the mean and variance of Y by m2 and σ2 . Suppose X and Y are two independent Gaussian random variables with PDFs ξ2 1 fX (ξ) = fY (ξ) = √ e− 2 . In this example. the moment generating function of the sum of two independent random variables is the product of the individual moment generating functions. 4π which veriﬁes that U is indeed Gaussian.

S. Athena Scientiﬁc. G. L. Ross. 2004: Chapter 5.. Probability and Random Processes for Electrical and Computer Engineers.. A First Course in Probability. 2. 2002: Section 3. 3. N. Pearson Prentice Hall.. D.. and Tsitsiklis. 4. D.3.1–6. Sections 6.3. as anticipated.5.. S. Cambridge. .11. A. P. which demonstrates that U is a Gaussian random variable with mean m1 + m2 2 2 and variance σ1 + σ2 . Section 7. and Childers. Further Reading 1.. Miller. Introduction to Probability.5. Probability and Random Processes with Applications to Signal Processing and Communications. 2006: Chapters 7–9. Gubner. 7th edition. J. INDEPENDENCE The moment generating functions of X and Y are MX (s) = em1 s+ MY (s) = em2 s+ 2 σ1 s2 2 2 σ2 s2 2 149 The moment generating function of U = X + Y is therefore equal to 2 2 (σ1 + σ2 )s2 MU (s) = MX (s)MY (s) = exp (m1 + m2 )s + 2 . 2006: Chapter 6. Bertsekas. J.

the CDFs of the random variables in the sequence may appear to approach a precise function. Alternatively.1 Types of Convergence The premise on which most probabilistic convergence results lie is a sequence of random variables X1 . Statements that can be made about a sequence of random variables range from simple assertions to more intricate claims. Concentration behavior facilitates true economies of scale. From an engineering viewpoint. Sequences and Limit Theorems Some of the most astonishing results in probability are related to the properties of sequences of random variables and the convergence of empirical distributions. X2 . Being able to recognize speciﬁc patterns within the sequence is key in establishing converge 150 . Recall that a random variable is a function of the outcome of a random experiment. 12. The above statement stipulate that all the random variables listed above are functions of the outcome of a same experiment. . these results are important as they enable the eﬃcient design of complex systems with very small probabilities of failure.Chapter 12 Convergence. . the sequence may appear to move toward a deterministic quantity or to behave increasingly akin to a certain function. For instance. . all of which are deﬁned on the same probability space. and a limiting random variable X.

. 2 3 √ From our current discussion. Example 102. (12. Gaussian. Example 103. = n n2 n2 n Sn E[Sn ] E[X1 ] + · · · + E[Xn ] = = =m n n n and consider the sequence It appears that the PDF of Sn /n concentrates around m as n approaches inﬁnity. TYPES OF CONVERGENCE 151 results. the random variable (Sn − nm)/ n has a Gaussian remains invariant throughout the sequence. That is. each with mean m and variance σ 2 . Below. X2 . = n n n √ No matter how large n is. . X2 . is a sequence of independent Gaussian random variables. Sn /n possesses a Gaussian distribution with mean E and variance Var Var[X1 ] + · · · + Var[Xn ] σ2 Var[Sn ] Sn = = . (12..1) S2 S3 . . The various statement one can make about the sequence X1 .2) seems to become increasingly predictable. This time... and let Sn be deﬁned according to (12. √ . Again. the sequence in (12. we know that (Sn − nm)/ n is a Gaussian ranS1 − m. . be the sequence described above. We can compute its mean and variance as follows E Sn − nm E[Sn − nm] √ =0 = n n Sn − mn Var[Sn ] Var[Sn − mn] √ Var = = σ2. . . . let X1 .12. . Intriguingly. Deﬁne the partial sums n Sn = i=1 Xi . Furthermore. Thus.. Suppose that X1 . lead to the diﬀerent types of convergence encountered in probability.. we know that sums of jointly Gaussian random variables are also Gaussian.1). distribution with mean zero and variance σ 2 .1. we discuss brieﬂy three types of convergence. we wish to characterize the properties of S2 − 2m S3 − 3m √ . .2) 2 3 We know that aﬃne transformations of Gaussian random variables remain S1 . . dom variables. . X2 . the distributions .

X2 .1. CONVERGENCE. If we apply the Chebyshev inequality to Xn − X. Proof. That is. Formally. . The sequence X1 . . . . for δ > 0. 12. Thus. we conclude that this sequence also converges to X in probability. Proposition 9. SEQUENCES AND LIMIT THEOREMS 12. Suppose that ǫ > 0 is ﬁxed. of random variables converges in mean square to X if n→∞ lim E |Xn − X|2 = 0. a sequence X1 . . also converges to X in probability. n→∞ In Example 102. . . the second moment of the diﬀerence between Xn and X vanishes as n goes to inﬁnity. Then. the sequence {Sn /n} converges in probability to m. . X2 . of random variables converges in probability to X if for every ǫ > 0. X2 .1. Convergence in the mean square sense implies convergence in probability. be a sequence of random variables that converge in mean square to X. X2 . Let X1 . . converges in mean square to X. there exists an N such that n ≥ N implies n→∞ lim E |Xn − X|2 < δ. . . we get Pr (|Xn − X| ≥ ǫ) ≤ E |Xn − X|2 δ < 2 2 ǫ ǫ Since δ can be made arbitrarily small. the sequence X1 .1 Convergence in Probability The basic concept behind the deﬁnition of convergence in probability is that the probability that a random variable deviates from its typical behavior becomes less likely as the sequence progresses. . lim Pr (|Xn − X| ≥ ǫ) = 0.152CHAPTER 12.2 Mean Square Convergence We say that a sequence X1 . . . . . X2 .

X2 . Let Xn be a continuous random variable that is uniformly distributed over [0. X2 . .12. deﬁne the empirical sum Sn = i=1 n Xn . Then. the sequence X1 . 1/n]. Furthermore. This type of convergence is also called weak convergence. we have n→∞ Hence. X = 0 and 1 x ≥ 0 FX (x) = 0 x < 0. converges in distribution to a constant. . . of random variables is said to converge in distribution to a random variable X if n→∞ lim FXn (x) = FX (x) at every point x ∈ R where FX (·) is continuous. there are many versions of this law. . i=1 converges in probability to the mean E[X]. and for every x > 0. . converges in distribution to 0.3 Convergence in Distribution A sequence X1 . . is a sequence of independent and identically distributed random variable. . lim FXn (x) = 1. THE LAW OF LARGE NUMBERS 153 12. In this example. . . The law of large number asserts that the sequence of empirical averages. X2 . we have FXn (x) = 0. X2 . 12. the sequence X1 . . each with ﬁnite second moment. Although. . Furthermore. . for every x < 0. we only state its simplest form below. 1 Sn = n n n Xn .2. Suppose that X1 . . for n ≥ 1.2 The Law of Large Numbers The law of large numbers focuses on the convergence of empirical averages.1. Example 104.

as the number of throws becomes very large. as we did in Proposition 9. we also have Var As n goes to inﬁnity. mean square to E[X] since n→∞ Thus. = n n n Var[X] Var[X1 ] + · · · + Var[Xn ] Sn = . That is. CONVERGENCE. . . Xn is a Bernoulli random variable with parameter p = 1/6. . Suppose that a die is thrown repetitively. Then. X2 . as the number of rolls increases. n To get convergence in probability. we obtain E E[X1 ] + · · · + E[Xn ] E [Sn ] Sn = = E[X]. Also. Let X1 . By the law of large numbers. Proof. SEQUENCES AND LIMIT THEOREMS Theorem 4 (Law of Large Numbers). we have n→∞ lim Pr Sn 1 − ≥ǫ n 6 = 0. Pr Sn − E[X] ≥ ǫ n ≤ Var Sn Var[X] n = . For every ǫ > 0. be independent and identically distributed random variables with mean E[X] and ﬁnite variance. Example 105. We are interested in the average number of times a six shows up on the top face. . = 2 n n n Using independence.154CHAPTER 12. Taking the expectation of the empirical average. we have n→∞ lim Pr Sn − E[X] ≥ ǫ n = lim Pr n→∞ X1 + · · · + Xn − E[X] ≥ ǫ n = 0. deﬁne the random variable Xn = 1{Dn =6} . we showed that the sequence {Sn /n} of empirical averages converges in lim E Sn − E[X] n 2 = lim Var n→∞ Sn = 0. we apply the Chebyshev inequality. the variance of the empirical average Sn /n vanishes. and the empirical average Sn /n is equal to the number of times a six is observed divided by the total number of rolls. the average number of times a six is observe converges to the probability of getting a six. ǫ2 nǫ2 which clearly goes to zero as n approaches inﬁnity. Let Dn be a random variable that represent the number on the nth roll.

We wish to compute the PDF of S = X1 + X2 . In this section.2. We therefore resort to complex analysis and contour integration to get a solution. Res(g. Because they are simple poles. iγ1 ) = lim (z − iγ1 )g(z) = z→iγ1 2iπ 2 . π (z − iγ1 )(z + iγ1 )(z − x + iγ2 )(z − x − iγ2 ) 2 π 2 (γ1 Only two residues are contained within the enclosed region. their values are given by γ2 2 ((x − iγ1 )2 + γ2 ) γ1 . Thus. Let C be a contour that goes along the real line from −a to a. x + iγ2 )) where we have implicitly deﬁned the function g(z) = γ2 γ1 dz 2 ) π (γ 2 + (x − z)2 ) +z 2 (12. First. 2 ) π (γ 2 + (x − ξ)2 ) +ξ 2 This integral is somewhat diﬃcult to solve. one needs to be very careful.12. Let X1 and X2 be two independent Cauchy random variables with parameter γ1 and γ2 . and then counterclockwise along a semicircle centered at zero.1 Heavy-Tailed Distributions* There are situations where the law of large numbers does not apply. we show that the sum of two independent Cauchy random variables is itself a Cauchy random variable. when dealing with heavy-tailed distributions. respectively. For continuous random variable. Cauchy’s residue theorem requires that fX1 (z)fX2 (x − z)dz = π 2 (γ1 C C = 2πi (Res(g. we can write fS (x) = = ∞ −∞ ∞ −∞ fX1 (ξ)fX2 (x − ξ)dξ π 2 (γ1 γ1 γ2 dξ. x + iγ2 ) = lim (z − x − iγ2 )g(z) = 2 ((x + iγ )2 + γ 2 ) z→x+iγ2 2iπ 2 1 Res(g. For example.2. THE LAW OF LARGE NUMBERS 155 12. we study Cauchy random variables in greater details. the PDF of S is given by the convolution of fX1 (·) and fX2 (·). iγ1 ) + Res(g.3) γ1 γ2 2 + z 2 ) (γ2 + (z − x)2 ) γ1 γ2 = 2 . For a large enough.

. each with parameter γ.156CHAPTER 12. We then conclude fS (x) = (γ1 + γ2 ) . it captures . = 2 γ 2 + (nx)2 ) 2 + x2 ) π (n π (γ where we have used the methodology developed in Chapter 9. In some sense. iγ1 ) + Res(g. CONVERGENCE. Clearly. π ((γ1 + γ2 )2 + x2 ) The sum of two independent Cauchy random variables with parameters γ1 and γ is itself a Cauchy random variable with parameter γ1 + γ2 . the empirical average of a sequence of independent Cauchy random variables. X2 . consider the empirical sum n Sn = i=1 Xn . remains a Cauchy random variable with the same parameter. SEQUENCES AND LIMIT THEOREMS It follows that 2πi (Res(g. a condition that is clearly violated by the Cauchy distribution. the law of large numbers does not apply to this scenario. . Also.3) vanishes as a → ∞. Amazingly.3 The Central Limit Theorem The central limit theorem is a remarkable result in probability. . it is possible to show that Sn is a Cauchy random variable with parameter nγ. This explains why convergence does not take place in this situation. Furthermore. Using mathematical induction and the aforementioned fact. Note that our version of the law of large numbers requires random variables to have ﬁnite second moments. for x ∈ R. 12. x + iγ2 )) γ1 γ2 + = 2 2 + γ2) π ((x − iγ1 ) π ((x + iγ2 )2 + γ1 ) 2 (γ1 + γ2 ) = π ((γ1 + γ2 )2 + x2 ) that the PDF of S is equal to The contribution of the arc in (12. Let X1 . the PDF of the empirical average Sn /n is given by fSn (nx) 1 n γ n2 γ = . each with parameter γ. form a sequence of independent Cauchy random variables. it partly explains the prevalence of Gaussian random variables.

. for any x ∈ R. Recall that the expectation of 0. MX (s) ds2 Collecting these results and evaluating the functions at zero. we take the limit of nΛX sn−1/2 as n → ∞. The distribution of Sn − nE[X] √ σ n converges in distribution to a standard normal random variable as n → ∞. Consider the log-moment generating function of X. √ To explore the asymptotic behavior of the sequence {Sn / n}. ΛX (s) = log MX (s) = log E esX . The ﬁrst two derivatives of ΛX (s) are equal to dΛX 1 dMX (s) = (s) ds MX (s) ds 1 d2 ΛX (s) = 2 ds MX (s) dΛX (0) ds dMX (s) ds 2 + 1 d2 MX (s). independent random components. n→∞ lim Pr Sn − nE[X] √ ≤x σ n x = −∞ ξ2 1 √ e− 2 dξ. notice the double use of L’Hˆpital’s o . we assume that E[X] = 0 and σ 2 = 1. we study the √ log-moment generating function of Sn / n. Next. and a product of independent random variables is equal to the product of their individual expectations. Let X1 . Initially. 2π Proof. we get log E esSn / √ n √ √ d2 ΛX (0) ds2 = log E esX1 / = log MX n · · · esXn / n s √ n · · · MX s √ n = nΛX sn−1/2 . . each with mean E[X] and variance σ 2 . THE CENTRAL LIMIT THEOREM the behavior of large sums of small. 157 Theorem 5 (Central Limit Theorem). we only study the situation where the moment generating function of X exists and is ﬁnite.12. X2 . Furthermore. . Using this property.3. be independent and identically distributed random variables. we get ΛX (0) = = E [X 2 ] = 1. In doing so. In particular. = E[X] = 0.

CONVERGENCE. ≈Φ σ n . n→∞ 2 ds2 2 √ That is. . Again. the moment generating function of Sn / n converges point-wise to lim es as n → ∞. Sn / n converges in distribution to a standard normal constants can be established in an analog manner by proper scaling of the random variables X1 . In words. . 1 1 dΛ sn−3/2 Λ sn−1/2 = lim −2 sn−1/2 n→∞ n−1 n→∞ n ds 2 s dΛ sn−1/2 = lim n→∞ 2n−1/2 ds sn−3/2 s d2 Λ = lim −3/2 2 sn−1/2 n→∞ n ds 2 2 2 2 s dΛ s = lim sn−1/2 = . In the last step of the proof. this implies that x ξ2 1 S √n ≤ x → √ e− 2 dξ Pr n 2π −∞ √ as n → ∞. SEQUENCES AND LIMIT THEOREMS rule. the CDF of Sn can be estimated by approximating Sn − nE[X] √ σ n as a standard normal random variable. let Sn = X1 + · · · + Xn where X1 .158CHAPTER 12. X2 . . we stated that point-wise convergence of the moment generating functions implies convergence in distribution. The more general case where E[X] and Var[X] are arbitrary 12. 2 /2 random variable. X2 . . we have x − nE[X] Sn − nE[X] √ √ ≤ FSn (x) = Pr(Sn ≤ x) = Pr σ n σ n x − nE[x] √ . When n is large. More speciﬁcally. . In fact.1 Normal Approximation The central limit theorem can be employed to approximate the CDF of large sums. are independent and identically distributed random variables with mean E[X] and variance σ 2 . This is a sophisticated result that we quote without proof. .3.

THE CENTRAL LIMIT THEOREM where Φ(·) is the CDF of a standard normal random variable. Athena Scientiﬁc. 7th edition..3. Ross. S. and Tsitsiklis. . 2006: Chapter 8. 159 Further Reading 1. D. and Childers. Bertsekas. N.2–7. Miller. S. Probability and Random Processes with Applications to Signal Processing and Communications. D. P. 2. 3.4.12. 2002: Sections 7..... Pearson Prentice Hall. A First Course in Probability. J. 2004: Chapter 7. L. G. Introduction to Probability.

18 Cartesian product. 29. 121. 61. 85 Discrete random variable. 9 Combination. 153 Convergence in probability. 30 Conditional probability. 123 Expected value. 101 Factorial. 78 Inclusion-exclusion principle. 48 Disjoint sets. 2 Conditional expectation. 100 Event. 62 160 . 26 Bernoulli random variable. 108 Central limit theorem. 1 Empty set. 6 Cauchy random variable. 27. 13 Cardinality.Gaussian random variable. 129 Chernoﬀ bound. 110 Gamma random variable. 55 tion. 152 Hypergeometric random variable. 84. 2 Error function.Index z-transform. 8 Exponential random variable. 6 Discrete convolution. 9 Conditional probability density func. 58. 3 Distinct elements. 98 Geometric random variable. 50 Binomial coeﬃcient. 40 Binomial random variable. 157 Central moments. 139 Conditional probability mass function. 34 Complement. 8 Expectation. 22 Element. 51 Boole inequality. 131 Collectively exhaustive. 69 Chebyshev inequality. 81 Conditional independence. 62. 77 Continuous random variable. 145 Countable. 92 De Morgan’s laws. 12 Independence. 143 Indicator function. 94 Convergence in distribution. 70 Bayes’ rule. 39 Combinatorics. 18 Cumulative distribution function. 37 Functions of random variables. 104 Convolution. 60 Experiment.

36 . 121 Ordinary generating function. 74 Probability-generating function. 24. 73 Laplace random variable. 42 Permutation. 111 Multinomial coeﬃcient. 70. 56. 64. 154 Legendre transformation. 98 Ordered pair. 47. 66. 128 Mean. 112 Subset. 95 Probability law. 2 Total probability theorem. 121 Mean square convergence. 100 Random variable. 107 Law of large numbers. 125 Weak convergence. 66 Stirling’s formula. 60. 4. 74 Markov inequality. 152 Measure theory. 153 Outcome. 136 Joint probability mass function. 8 Partition. 3 Union bound. 47 Variance. 49 161 Probability density function (PDF). 97 Uniform random variable (discrete). 8 Set. 6 Preimage. 11 Probability mass function (PMF). 125 Moments. 24 Uniform random variable (continuous). 68 Monotonic function. 19. 18 Negative binomial random variable. 41. 9 Natural numbers. 38 Strictly monotonic function. 106 Sample space. 2. 70 Proper subset. 132 Joint cumulative distribution function. 132 Marginal probability mass function. 92 Rayleigh random variable. 1 Standard deviation. 89 Normal random variable. 13 Value of a random variable. 41 Mutually exclusive. 95 Moment generating function.INDEX Intersection. 2 Q-function. 57 Union. 103 Mixed random variable. 48. 52 Power set. 36 Poisson random variable. 3 Jensen inequality. 19 Memoryless property. 135 Joint probability density function.

- Read and print without ads
- Download to keep your version
- Edit, email or read offline

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

CANCEL

OK

You've been reading!

NO, THANKS

OK

scribd

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->