0% found this document useful (0 votes)
97 views92 pages

MATH2010 2022 23 AutumnNotes Gappy

This document contains lecture notes for the Probability Models and Methods course at the University of Nottingham. The notes cover topics such as random variables and their distributions, joint and conditional distributions, transformations of random variables, generating functions, and limit theorems. The notes are intended to support student learning in the course by providing material to read in advance of lectures and examples to discuss. Students are expected to work through problems related to the content.

Uploaded by

asiyahsharif1303
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views92 pages

MATH2010 2022 23 AutumnNotes Gappy

This document contains lecture notes for the Probability Models and Methods course at the University of Nottingham. The notes cover topics such as random variables and their distributions, joint and conditional distributions, transformations of random variables, generating functions, and limit theorems. The notes are intended to support student learning in the course by providing material to read in advance of lectures and examples to discuss. Students are expected to work through problems related to the content.

Uploaded by

asiyahsharif1303
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 92

University of Nottingham

MATH2010

Probability Models and Methods

Autumn Semester Notes 2022/23


2 CONTENTS

Contents

0 Background 5
0.1 Random variables and distributions . . . . . . . . . . . . . . . . . . . 5
0.2 Standard continuous RVs (distributions) . . . . . . . . . . . . . . . . 6
0.3 Bivariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 7
0.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

I Distributions of more than one random variable 9

1 Joint distributions 9
1.1 Joint distributions for more than two RVs . . . . . . . . . . . . . . . 16
1.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2 Independent Random Variables 19


2.1 Sums of independent RVs . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Functions of independent RVs . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Mutual independence of a collection of RVs . . . . . . . . . . . . . . . 23
2.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Expectation, Covariance and Correlation 25


3.1 Expectation and some properties . . . . . . . . . . . . . . . . . . . . 25
3.2 Covariance and correlation . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Moments and Higher dimensions . . . . . . . . . . . . . . . . . . . . . 32
3.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4 Conditional distributions and expectations 34


4.1 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Conditional distributions and independence . . . . . . . . . . . . . . 37
4.3 Conditional expectation . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4 Conditional variance . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
II Transformations, generating functions and inequalities 47

5 Transformations 47

5.1 1-dimensional transformations . . . . . . . . . . . . . . . . . . . . . . 47

5.2 2-dimensional transformations . . . . . . . . . . . . . . . . . . . . . . 50

5.3 Extension to more than 2 dimensions . . . . . . . . . . . . . . . . . . 55

5.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6 Generating Functions 58

6.1 Moment generating functions . . . . . . . . . . . . . . . . . . . . . . 58

6.2 Properties of MGFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.3 Joint moment generating functions . . . . . . . . . . . . . . . . . . . 62

6.4 Sums of independent random variables . . . . . . . . . . . . . . . . . 64

6.5 Probability generating functions . . . . . . . . . . . . . . . . . . . . . 67

6.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

7 Markov’s and Chebychev’s Inequalities 70

7.1 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

8 Multivariate normal distributions 73

8.1 Characterisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

8.2 Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

8.3 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

8.4 The degenerate case . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

8.5 Some additional properties . . . . . . . . . . . . . . . . . . . . . . . . 83

8.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

9 Two limit theorems 85

9.1 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

3
4 CONTENTS

Preliminaries
These notes & support for your learning

These notes are the basis for the teaching and learning in the autumn semester of
MATH2010. You will be expected to read ahead through some set sections of the notes
in advance of most lectures. Then we can use the lecture time to focus on working
through some examples and discussing the more nuanced aspects of the material we
encounter.
You will need to bring a copy of these notes and some blank paper (or digital
equivalents) to every lecture.
See Moodle for more details of all aspects of the resources and support available
to you for this module; including problem classes, computing classes, dropin classes
and lecturer’s office hours.

Problems

Working through problems is a key part of studying mathematics. You should begin
attempting problems as soon as we have worked through the relevant parts of the
notes. The suggested texts are an excellent source of further problems (and alternative
explanations and presentations of the lecture material).
0. Background 5

0 Background

0.1 Random variables and distributions

A random variable (RV) X is a function of the outcome of a random experiment, or


a mapping from the set of outcomes Ω to R: X : Ω → R.
The (cumulative) distribution function (CDF) of X, FX : R → [0, 1], is
defined by
FX (x) = P (X ⩽ x) = P ({ω ∈ Ω : X(ω) ⩽ x}).

Hence, in particular,

P (X > x) = 1 − FX (x) and

P (x1 < X ⩽ x2 ) = FX (x2 ) − FX (x1 ).

Associated with a discrete random variable X is its probability mass function


(PMF) pX defined by

pX (xi ) = P (X = xi ) = P ({ω : X(ω) = xi })

and associated with a continuous random variable X is its probability density


function (PDF) fX satisfying
Z
P (X ∈ E) = fX (x) dx. for all E ⊆ R.
E

This gives ( P
pX (xi ) if X is discrete,
FX (x) = R xxi ⩽x
−∞ X
f (u) du if X is continuous. DV!

If X is a continuous RV then
d
1. fX (x) = FX (x),
dx
Rb
2. P (a < X ⩽ b) = a fX (x) dx,
R R
3. P (X ∈ A ∪ B) = f (x) dx +
A X B
fX (x) dx if A ∩ B = ∅, key when fX piecewise
Ra
4. P (X = a) = a
fX (x) dx = 0.

This last point implies that P (X ∈ (a, b)) = P (X ∈ [a, b)) = . . . . for X cts; cf. 2d
6 0. Background

The expected value or expectation of a random variable X is given by


( P
x · pX (x) if X is discrete,
E[X] = R ∞x
−∞
xfX (x) dx if X is continuous.

It follows that the expected value of a function g(·) of a random variable X is given
by ( P
g(x) · pX (x) if X is discrete,
E[g(X)] = R ∞x
−∞
g(x)fX (x) dx if X is continuous.
An important expectation is the variance of the random variable X, defined by
var(X) = E[(X − E[X])2 ]. A useful equivalent form is var(X) = E[X 2 ] − (E[X])2 .

0.2 Standard continuous RVs (distributions)

• Uniform distribution, U (a, b) (with a < b ∈ R):


(
1
b−a
if x ∈ (a, b)
fX (x) =
0 otherwise;

• Normal distribution, N (µ, σ 2 ) (with µ ∈ R and σ > 0):

1 2 2
fX (x) = √ e−(x−µ) /(2σ ) , x ∈ R;
2πσ

• Exponential distribution with (rate) parameter λ > 0:


(
λ e−λx if x > 0
fX (x) =
0 otherwise;

• Gamma distribution Gamma(α, β) (or Γ(α, β), or G(α, β), with α, β > 0):
(
1
Γ(α)
β e−βx (β x)α−1 if x ⩾ 0
fX (x) =
0 if x < 0,
R∞
where Γ(s) = 0
e−x xs−1 dx;
0.3 Bivariate distributions 7

• Beta distribution Beta(α, β) (with α, β > 0):

xα−1 (1−x)β−1
(
B(α,β)
if 0 ⩽ x ⩽ 1,
fX (x) =
0 otherwise,

R1
where B(α, β) = 0
tα−1 (1 − t)β−1 dt = Γ(α)Γ(β)/Γ(α + β).

I will assume that you are familiar with the following distributions (their PDF/PMF,
mean, variance): normal, uniform, exponential, Poisson, geometric, binomial.

0.3 Bivariate distributions

Similar ideas to those in Section 0.1 apply to pairs of random variables: bivariate
distributions. In MATH1001 you saw, in the context of discrete distributions,

• Joint and marginal distributions.

• Expectation (mean, variance, covariance, correlation).

• Independence.

• Conditional distributions, conditional expectation, the tower property.

The first part of this module is about looking at these concepts in more detail, with
the focus on continuous distributions.

(The second part includes sections on transformations, generating functions and


limit theorems which extend topics you studied in MATH1001, as well as some addi-
tional topics, namely inequalities and the multivariate normal distribution.)
8 0. Background

0.4 Problems

1. A random variable X has PDF



αx2 , x ∈ [0, 2],
fX (x) =
0, otherwise,

where α ∈ R is a constant.

(a) Determine the value of α.


(b) Find P (X < 1).
(c) Find the expected value and variance of X.
(d) Determine the CDF of X.

2. Suppose that X ∼ U(a, b). Verify that E[X] = (a + b)/2 and var(X) = (b −
a)2 /12.

3. Suppose that Y ∼ Exp(λ). Verify that E[Y ] = λ−1 and var(Y ) = λ−2 .

4. Suppose that X1 and X2 are independent Bernoulli random variables with pa-
1 1
rameter 2
(i.e. P (Xi = 0) = P (Xi = 1) = 2
for both i = 1, 2) and define
Y1 = X1 + X2 and Y2 = |X1 − X2 |.

(a) Construct a table which gives the joint PMF of Y1 and Y2 .


(b) Show that Y1 and Y2 are uncorrelated.
(c) Determine whether or not Y1 and Y2 are independent.
(d) Construct a table which gives, for every y2 , the conditional probability
mass function of Y1 given Y2 = y2 .

5. Two people are asked to toss a fair coin twice and record each result (heads or
tails) as zero or one. The diligent person does as they are asked, but the lazy
one just tosses the coin once and records the same result twice. Let (D1 , D2 ) be
the results of the diligent person’s tosses and (L1 , L2 ) be the recorded results
of the lazy person’s tosses. Construct tables showing the joint PMFs of (D1 , D2 )
and of (L1 , L2 ). Hence show that D1 and L1 have the same distribution, that D2
and L2 have the same distribution, but that the joint distributions of (D1 , D2 )
and of (L1 , L2 ) are not the same.
Part I

Distributions of more than one


random variable
1 Joint distributions

Now we consider the case where we have more than one RV, say X : Ω → R and
Y : Ω → R. We are not simply interested in the properties of X and Y , but in their
joint behaviour.

Definition 1. The joint (cumulative) distribution function (joint CDF) of two


random variables X and Y is defined by

FX,Y (x, y) = P ({ω : X(ω) ⩽ x and Y (ω) ⩽ y})

= P ({ω : X(ω) ⩽ x} ∩ {ω : Y (ω) ⩽ y})

= P (X ⩽ x, Y ⩽ y), x, y ∈ R.

The CDF of X can be obtained from the joint CDF of X and Y :

FX (x) = P (X ⩽ x) = P (X ⩽ x, Y < ∞)

= lim FX,Y (x, y)


y→∞

(= FX,Y (x, ∞))

and FX is called the marginal distribution function (marginal CDF) of X. The


marginal CDF of Y can be similarly obtained: FY (y) = FX,Y (∞, y).

Definition 2. Two RVs X and Y are said to be jointly continuous if there exists
a function fX,Y (x, y) (⩾ 0) with the property that, for every ‘nice’1 set C ⊆ R2 ,
ZZ
P ((X, Y ) ∈ C) = fX,Y (x, y) dx dy.
C

1
The details of ‘nice’ are beyond the scope of this module, but certainly rectangles and unions
and intersections of rectangles are included.

9
10 1. Joint distributions

The function fX,Y is called the joint PDF of X and Y . It must satisfy fX,Y (x, y) ⩾ 0
R
for all (x, y) and R2 fX,Y (u, v) du dv = 1.

Example 1. Suppose that X and Y have joint PDF



α(x + y) 0 ⩽ x ⩽ y ⩽ 1,
fX,Y (x, y) =
0 otherwise.

Determine the value of the constant α > 0.

Solution. To determine the value of α we use the fact that the PDF must integrate
to 1. A diagram of the domain of fX,Y is invariably useful for determining the bounds
of integration.

y y=x
6
1
fX,Y > 0

- x
1

So we have
Z ∞ Z ∞ Z 1 Z y
1= fX,Y (x, y) dx dy = α(x + y) dx dy
−∞ −∞ 0 0

1 y 1
x2
Z  Z
3 2
=α + xy dy = α y dy
0 2 x=0 0 2
3 1
 
y 1
=α =α·
2 y=0 2

and thus α = 2.
1. Joint distributions 11

The joint CDF can therefore be found as

FX,Y (x, y) = P (X ⩽ x, Y ⩽ y)

= P (X ∈ (−∞, x], Y ∈ (−∞, y])


Z y Z x
= fX,Y (u, v) du dv.
−∞ −∞

(emphasise dummy variables!)

Taking derivatives of both sides with respect to x and y, we get

∂2
fX,Y (x, y) = FX,Y (x, y).
∂x∂y

We can also recover the PDF of X from fX,Y , since

d d
fX (x) = FX (x) = FX,Y (x, ∞)
dx dx
Z x Z ∞
d
= fX,Y (u, y) dy du
dx −∞ −∞
Z ∞
= fX,Y (x, y) dy(note dummy vars)
−∞

and so fX is called the marginal PDF of X. The marginal PDF of Y can be similarly
obtained:
Z ∞ Z ∞
fX (x) = fX,Y (x, y) dy and fY (y) = fX,Y (x, y) dx.
−∞ −∞

Corresponding definitions and results hold for jointly discrete random variables.
The focus in this part of the module is on jointly continuous RVs, but all the concepts
R P
we cover have analogues for the discrete case. (PDF ≡ PMF and ≡ )

Example 2. Find the joint CDF of X and Y if they have joint PDF
(
e−x x ⩾ 0, 0 ⩽ y ⩽ 1,
fX,Y (x, y) =
0 otherwise.

Ry Rx
Solution. To use the formula FX,Y (x, y) = −∞ −∞
fX,Y (u, v) du dv we need to
sketch the region of integration:
12 1. Joint distributions

Separate for (i)–(iii)


(i) (x, y) ∈ A = {(x, y) : fX,Y (x, y) > 0}
v R yR x
FX,Y (x, y) = 0 0 e−u du dv = · · · = y(1 − e−x )
6
• (ii) (ii) x > 0 and y > 1
• (iii)
FX,Y (x, y) = FX,Y (x, 1) = 1 − e−x
1
• (i)
fX,Y > 0 (iii) x < 0 or y < 0
- u FX,Y (x, y) = 0.
• (iii)

Gathering these together, we have





 y (1 − e−x ) x ⩾ 0, 0 ⩽ y ⩽ 1,

FX,Y (x, y) = 1 − e−x x ⩾ 0, y ⩾ 1,



0 otherwise.

Visualising PDFs

A useful way of understanding a joint PDF, in 2 (or maybe 3) dimensions at least, is


to visualise it. For bivariate distributions a surface or contour plot are the appropriate
tools for this. For example, the joint PDF in Example 1 looks like this:
1.0

3
0.8

2
z

0.6

0
y

1.0
1.0
0.4

1
0.5
0.5
y

x
0.2

0.0 0.0
0
0.0

0.0 0.2 0.4 0.6 0.8 1.0

x
1. Joint distributions 13

One needs to be a bit careful interpreting what happens at the edges of the domain
where fX,Y > 0, as most surface/contour plotters don’t deal with discontinuities very
well. Nevertheless, these are useful ways to get a feel for what the PDF ‘looks like’.
It is a very useful skill to be able to (roughly) visualise densities in your head; I
strongly recommend using a computer package to help practice, using this and some
of the other joint PDFs in this section. The above plots were generated using the R
commands in plotting1-jointPDF.R available on Moodle. (It also includes code to
generate a dynamic surface plot which you can drag around to explore.)
Now we continue with some more examples of the kinds of calculations that can
be done using joint PDFs.

Example 3. Suppose that


(
24x(1 − x − y) x ⩾ 0, y ⩾ 0, x + y ⩽ 1,
fX,Y (x, y) =
0 otherwise.

Find (i) the marginal CDF of Y and (ii) P (X > Y ).

Solution. First draw a diagram of the region (call it A) where fX,Y is non-zero.

x=0
y 6
1 @
@
@
@
@ x+y =1
@
fixed y ∈ (0, 1)
@
@
@
A @
@
@ y =0
@-
x
1
14 1. Joint distributions

(i) The marginal PDF of Y is clearly fY (y) = 0 when y < 0 or y > 1.


For 0 ⩽ y ⩽ 1 we have
Z ∞ Z 1−y
fY (y) = fX,Y (x, y) dx = 24x(1 − x − y) dx
−∞ 0

= · · · = 4(1 − y)3 .

Ry
Now, since 0
4(1 − u)3 du = 1 − (1 − y)4 , the CDF of Y is

 0 y ⩽ 0,


FY (y) = 4
1 − (1 − y) 0 < y ⩽ 1,


 1 y ⩾ 1.

Note that since the region where fX,Y (x, y) > 0 is not a rectangle, the range of
integration is different for every y ∈ [0, 1].
(ii) To find P (X > Y ) we need to integrate fX,Y (x, y) over the region where x > y.
Formally, if we write A = {(x, y) : fX,Y (x, y) > 0} and let C = {(x, y) : x > y}
RR RR
then we need to calculate C fX,Y (x, y) dx dy = A∩C 24x(1 − x − y) dx dy.

To see what this means for the double integral we’ll have to do, draw a diagram!

x=0
y 6
1 @
@
@ x+y =1
@
@ x=y
@
@
@
@
A∩C @
@
@ y =0
@-
x
1
1. Joint distributions 15

Now we can calculate


ZZ ZZ
P (X > Y ) = fX,Y (x, y) dx dy = 24x(1 − x − y) dx dy
C A∩C
Z 1/2 Z 1−y
= 24x(1 − x − y) dx dy
0 y

= ···

= 3/4.

Next we do an example which shows how this technique can be used to find the
distribution of a function of two random variables. Find the CDF (a probability) as
above; differentiate to get PDF.

Example 4. Find the PDF of Z = X/Y , where


(
e−(x+y) 0 < x, y < ∞,
fX,Y (x, y) =
0 otherwise.

Solution. First, X > 0 and Y > 0 so Z = X/Y > 0 and FZ (z) = 0 for all z ⩽ 0.
For z > 0 we need to find FZ (z) = P (Z ⩽ z) = P (X/Y ⩽ z).
(Draw a diagram to see what region we need to integrate fX,Y over:)

y 6

1
x = yz, or y = x (hatching left of line; flag z > 0)
z

-
x
16 1. Joint distributions

Since x/y ⩽ z ≡ x ⩽ yz, we see that


ZZ
FZ (z) = P (X/Y ⩽ z) = fX,Y (x, y) dx dy
{(x,y) : x/y⩽z}
Z ∞ Z yz
= e−(x+y) dx dy
0 0

= ···
1
= 1−
1+z

and so (
d (1 + z)−2 z > 0,
fZ (z) = FZ (z) =
dz 0 z ⩽ 0.

The same approach will work for any function of X and Y . If Z = g(X, Y ) then
ZZ
FZ (z) = P (g(X, Y ) ⩽ z) = fX,Y (x, y) dx dy.
{(x,y) : g(x,y)⩽z}

However, if fX,Y or g is complicated then this integral might be difficult or impossible


to evaluate.

1.1 Joint distributions for more than two RVs

All of the ideas and notation above can be extended to deal with more than 2
random variables. Rather than labelling the RVs X, Y , . . . , we use the notation
X1 , X2 , . . . , Xn (n ⩾ 2).
Joint CDF:

FX1 ,...,Xn (x1 , . . . , xn ) = P (X1 ⩽ x1 , . . . , Xn ⩽ xn ).

Joint PDF:

∂ n FX1 ,...,Xn (x1 , . . . , xn )


fX1 ,...,Xn (x1 , . . . , xn ) = .
∂x1 . . . ∂xn
1.1 Joint distributions for more than two RVs 17

Marginals: for example,

FXi ,Xj (xi , xj ) = FX1 ,...,Xn (∞, . . . , ∞, xi , ∞, . . . , ∞, xj , ∞, . . . , ∞)

and
Z Z
fXi ,Xj (xi , xj ) = ··· fX1 ,...,Xn (x1 , . . . , xn )
Rn−2

dx1 . . . dxi−1 dxi+1 . . . dxj−1 dxj+1 . . . dxn .

For C ⊆ Rn ,
Z Z
P ((X1 , . . . , Xn ) ∈ C) = ··· fX1 ,...,Xn (x1 , . . . , xn ) dx1 . . . dxn .
C

The notation here can be rather cumbersome, so vector notation is often used.
Writing x = (x1 , . . . , xn ) and X = (X1 , . . . , Xn ) the above statements can be re-
written as, for example,

Z
(1)
FX (x) = P (X ⩽ x), fX (x) = F (x) and P (X ∈ C) = fX (x) dx.
C

Here we need to interpret X ⩽ x elementwise and, for k = (k1 , . . . , kn ) with ki ∈ Z+ ,


F (k) (x) as the mixed derivative of order ki with respect to xi .
This notation will not often be used in this module as we usually restrict attention
to cases with only 2 or 3 variables. Nevertheless, it is used in many textbooks and
other resources you might encounter, and also in some higher level study when dealing
with large collections of random variables.
18 1. Joint distributions

1.2 Problems

1. Let the random variables X and Y have joint distribution function FX,Y . Show
the following:

(a) P (X > x, Y > y) = 1 − FX (x) − FY (y) + FX,Y (x, y);


(b) P (x1 < X ⩽ x2 , y1 < Y ⩽ y2 ) = FX,Y (x2 , y2 )+FX,Y (x1 , y1 )−FX,Y (x1 , y2 )−
FX,Y (x2 , y1 ).

2. Let X and Y have joint probability density function



3x 0 < y < x < 1,
fX,Y (x, y) =
0 otherwise.

Determine the marginal probability density functions of X and Y .

3. A theory of chemical reactions suggests that the variation in the quantities X


and Y of two products C1 and C2 of a certain reaction is described by the joint
probability density function

2
fX,Y (x, y) = x ⩾ 0, y ⩾ 0.
(1 + x + y)3

On the basis of this theory, answer the following questions.

(a) What is the probability that at least one unit of each product is produced?
(b) Determine the probability that the quantity of C1 produced is less than
half that of C2 .
(c) Find the CDF for the total quantity of C1 and C2 .

4. The random variables X and Y have joint probability density function



8xy 0 < x < y < 1,
fX,Y (x, y) =
0 otherwise.

Find the probability density function of X + Y .


2. Independent Random Variables 19

2 Independent Random Variables

Definition 3. The random variables X and Y are said to be independent if

P (X ⩽ x, Y ⩽ y) = P (X ⩽ x) P (Y ⩽ y)

(i.e. FX,Y (x, y) = FX (x) FY (y)) for all x, y ∈ R.

This can be interpreted as the joint behaviour of X and Y being determined by the
marginal behaviour of X and the marginal behaviour of Y . There is no ‘interaction’
between X and Y . The following equivalent characterisation is often more useful than
the definition involving CDFs.

Theorem 4. Jointly continuous random variables X and Y are independent if and


only if fX,Y (x, y) = fX (x) fY (y) for all (x, y) at which FX,Y (x, y) is differentiable.

In practice FX,Y (x, y) is very often differentiable everywhere; so independence of


X and Y is equivalent to fX,Y (x, y) = fX (x) fY (y) for all (x, y).
Note that to show dependence of two random variables it is enough to find one
(x, y) pair that does not satisfy fX,Y (x, y) = fX (x) fY (y).

Example 5. Determine whether the RVs defined by the following joint PDFs are
independent.
 
e−(x+y) x, y ⩾ 0, 8xy x ∈ (0, 1), y ∈ (0, x),
f1 (x, y) = f2 (x, y) =
0 otherwise. 0 otherwise.

Solution. It looks like f1 (x, y) = e−x · e−y . To verify that this is true, find the
marginal PDFs:
Z ∞ Z ∞
−(x+y) −x
fX (x) = e dy = e e−y dy = e−x (x ⩾ 0)
0 0

and similarly fY (y) = e−y (y ⩾ 0).


Thus f (x, y) = fX (x) fY (y) for all (x, y), so X and Y are independent.
The PDF f2 also looks, at first glance, to have the correct product form for
20 2. Independent Random Variables

independence. However, the marginal densities are


Z x Z 1
fX (x) = 8xy dy fY (y) = 8xy dx
0 y

= 4x3 (x ∈ (0, 1)) = 4y(1 − y 2 ) (y ∈ (0, 1)).

Immediately we can see that f2 ̸= fX · fY , for example f2 ( 21 , 14 ) = 8 · 1


2
· 1
4
= 1 but
fX ( 21 )fY ( 14 ) = 1
2
· 15
16
= 15
32
. Therefore X and Y are dependent.

There is an alternative way to show that the variables described by f2 above are
dependent. The (x, y) values with positive density form a triangle. Knowing the
value of one variable gives us information about possible values of the other variable,
contravening the idea of independence (see also Section 4.2). We can make this
argument rigorous by noting that X and Y both take values in (0, 1), but (X, Y )
does not take values in all of (0, 1) × (0, 1). For example, taking (x, y) = ( 41 , 12 ) we get

0 = f2 ( 14 , 12 ) ̸= fX ( 14 ) fY ( 12 ) > 0,

so X and Y are dependent.

y 6
1

-
x
1

This type of argument will work whenever the possible values of one variable
depend on the value of another. More concretely, if the region where fX,Y > 0 cannot
be written as AX × AY where AX (AY ) is the set of possible X (Y ) values then X and
Y cannot be independent. If AX and AY are both intervals then their cross product
is a rectangle).
2.1 Sums of independent RVs 21

This informal observation is in fact a consequence of the following result.

Theorem 5. Jointly continuous random variables X and Y are independent if and


only if their joint PDF fX,Y factorises as the product fX,Y (x, y) = g(x)h(y) of func-
tions of the single variables x and y alone.

Comment on domains of fX,Y , g, h; with reference to f2 above. f2 ‘works’ with


g(x) = 8x (x ∈ (0, 1)) and h(y) = y (y ∈ (0, x)). h depends on y.

2.1 Sums of independent RVs

Suppose that X and Y are independent continuous RVs with PDFs fX and fY re-
spectively. The CDF of X + Y is

FX+Y (z) = P (X + Y ⩽ z)
ZZ
= fX (x) fY (y) dx dy
{(x,y):x+y⩽z}
Z ∞ Z z−y   Z ∞ Z z−x  
= fY (y) fX (x) dx dy = fX (x) fY (y) dy dx
−∞ −∞ −∞ −∞
Z ∞  Z ∞ 
= FX (z − y) fY (y) dy = fX (x) FY (z − x) dx .
−∞ −∞

Hence

d
fX+Y (z) = FX+Y (z)
dz
Z ∞ Z ∞
d  d 
= FX (z − y) fY (y) dy = FY (z − x) dx
fX (x)
−∞ dz −∞ dz
Z ∞  Z ∞ 
= fX (z − y) fY (y) dy = fX (x) fY (z − x) dx .
−∞ −∞

We have thus proved the following result.

Theorem 6. If X and Y are independent continuous RVs with PDFs fX and fY


then the PDF of Z = X + Y is
Z ∞ Z ∞
fZ (z) = fX (z − y) fY (y) dy = fX (x) fY (z − x) dx.
−∞ −∞

The function fZ is called the convolution of fX and fY .


22 2. Independent Random Variables

Example 6. Find the PDF of Z = X +Y when X and Y are independent exponential


random variables with (rate) parameter λ > 0.

Solution. Recall that X ∼ Exp(λ) means fX (x) = λe−λx (x ⩾ 0).


First, X ⩾ 0 and Y ⩾ 0, so Z ⩾ 0 and fZ (z) = 0 for z < 0.
For z ⩾ 0, we have Z ∞
fZ (z) = fX (x) fY (z − x) dx.
−∞

Now, fX (x) > 0 when x > 0 and fY (z − x) > 0 when x < z.


newline i.e. both are positive when 0 < x < z.
So
Z z
fZ (z) = λ e−λx λ e−λ(z−x) dx
0
Z z
2 −λz
=λ e 1 dx
0

= λ2 ze−λz .

Or draw a diagram:
highlight fXY > 0 in first quadrant; thicken line there indicating endpoints; then
z < 0.

y 6

@
z @
@
@
@ x+y =z >0
@
@
@
@
@
@
@ -
x
@
z
2.2 Functions of independent RVs 23

2.2 Functions of independent RVs

If two random variables X and Y are independent then it seems intuitively clear that
any function of X should be independent of any function of Y . This is indeed the
case.

Theorem 7. If X1 and X2 are independent then so are Y1 = g1 (X1 ) and Y2 = g2 (X2 );


for any ‘nice’ functions g1 and g2 .

(The proof is omitted.) The exact meaning of ‘nice’ involves the details of inte-
gration theory; but it is a very weak condition. Examples of nice functions are those
which are continuous, monotone, piecewise continuous with finitely many disconti-
nuities. The theorem can obviously be extended to deal with finitely many random
variables. important later

2.3 Mutual independence of a collection of RVs

To conclude this section we note the main generalisation of the notion of independence
to more than two random variables, that of mutual independence.

Definition 8. The random variables X1 , X2 , X3 , . . . are said to be mutually inde-


pendent if
r
Y
FXi(1) ,...,Xi(r) (x1 , . . . , xr ) = FXi(j) (xj )
j=1

for all finite subsets Xi(1) , . . . , Xi(r) of the random variables and values xj ∈ R.

This definition captures the idea that no possible information about any finite
combination of any of the RVs gives any further information about likely values of
any finite combination of any of the other RVs. Loosely speaking, it says that all
pairs of RVs amongst X1 , X2 , . . . are independent, as are all triples, and all n-tuples.
This is, therefore, a more stringent requirement than just pairwise independence,
where every pair Xi , Xj of the collection of RVs is independent.
This leads on to the critical notion in statistics of a collection of random variables
X1 , X2 , . . . being independent and identically distributed (IID). A collection of
random variables X1 , X2 , . . . is said to be IID if the Xi ’s are mutually independent
and all have the same distribution.
24 2. Independent Random Variables

2.4 Problems

1. Are the variables defined in Q3 of the Section 1 Problems independent?

2. Let X1 , · · · , Xn be n (⩾ 2) independent random variables, where each Xi has


the exponential distribution with parameter λi > 0. Let Yn = X1 + · · · + Xn .

(a) Find the probability density function of Y2 .


(b) Assume now that all λi are the same, say λi = λ, for i = 1, · · · , n. Use
induction to show that, for all n ⩾ 2, Yn is a Gamma(n, λ) random variable.

3. Determine the PDF of Z = X + Y , where X and Y are independent uniform


random variables on the interval (0, 1).
R∞
4. Write out a formal proof of the fZ (z) = −∞
fX (x) fY (z − x) dx part of the
convolution formula.
3. Expectation, Covariance and Correlation 25

3 Expectation, Covariance and Correlation

3.1 Expectation and some properties

Theorem 9. For all ‘nice’ functions g : A ⊆ Rn → R, the expectation of a function


g(X1 , . . . , Xn ) of the random variables X1 , . . . , Xn is
 P
x1 ,...,xn g(x1 , . . . , xn )pX1 ,...,Xn (x1 , . . . , xn )




discrete case,




E[g(X1 , . . . , Xn )] =
 R R
··· Rn g(x1 , . . . , xn )fX1 ,...,Xn (x1 , . . . , xn ) dx1 . . . dxn






 continuous case,

so long as the sum/integral is unambiguously defined (e.g. is absolutely convergent or


unconditionally convergent). This can be written in vector notation:
P
x g(x) pX (x), discrete case,

E[g(X)] = R

Rn
g(x) fX (x) dx, continuous case.

For the meaning of ‘nice’ see the discussion after Theorem 7. Basically, so long as
Y = g(X1 , . . . , Xn ) is a random variable then E[Y ] is defined by the appropriate sum
or integral. Also, we note that we can take g in this theorem to map Rn to Rm , so
long as we interpret the expectation of a vector as the vector of expectations. (g has
to be quite unpleasant for Y = g(X1 , . . . , Xn ) not to be a random variable.) (Also
note that this is a theorem, not a definition.)

To check that this agrees with the case of a single random variable, take the
function g to be g(X1 , . . . , Xn ) = g(X1 ). Then
Z Z
E[g(X1 )] = ··· g(x1 ) fX1 ,...,Xn (x1 , . . . , xn ) dx1 . . . dxn
Rn
Z ∞ Z Z 
= g(x1 ) ··· fX1 ,...,Xn (x1 , . . . , xn ) dx2 . . . dxn dx1
−∞ Rn−1
Z ∞
= g(x1 ) fX1 (x1 ) dx1 .
−∞
26 3. Expectation, Covariance and Correlation

Example 7. Find the expectation of Z = XY , where X and Y have joint PDF



2(x + y) 0 ⩽ x ⩽ y ⩽ 1,
fX,Y (x, y) =
0 otherwise.

Solution. With reference to the diagram from Example 1, Theorem 9 tells us that
ZZ Z 1 Z y
E[Z] = E[XY ] = xyfX,Y (x, y) dx dy = 2 xy(x + y) dx dy=
R2 0 0

1
. . .= .
3

Since expectation is defined as an integral/sum, it inherits many properties of


these operations, particularly linearity.

Proposition 10. For all a, b, c ∈ R, we have

E[ag1 (X1 , . . . , Xn ) + bg2 (X1 , . . . , Xn ) + c]

= aE[g1 (X1 , . . . , Xn )] + bE[g2 (X1 , . . . , Xn )] + c,

or E[ag1 (X) + bg2 (X) + c] = aE[g1 (X)] + bE[g2 (X)] + c.

Proof. Omitted as an exercise. To avoid long sums and integrals one approach is to
prove each of E[ag1 (X)] = aE[g1 (X)] and E[g1 (X) + g2 (X)] = E[g1 (X)] + E[g2 (X)]
(of which E[a + g1 (X)] = a + E[g1 (X)] is a special case); then combine these.

It is also important to note that E(g(X)) ̸= g(E(X)) (where we interpret E(X) to


R
be the vector of expected values). This is because (in the 1-var, cts case) g(x) f (x) dx ̸=
R
g( x f (x) dx). (eg E[X 2 ] = .. and E[X]2 = .. Mention Jensen’s inequality?)

Taking another special case for the function g, we can prove the following impor-
tant result concerning the expectation of a sum of random variables.
3.2 Covariance and correlation 27

Proposition 11. For any random variables X1 , . . . , Xn , we have


" n
# n
X X
E Xi = E[Xi ].
i=1 i=1

Pn
Proof. Left as an exercise. Either take g(X1 , . . . , Xn ) = i=1 Xi in Theorem 9 or use
Proposition 10 and induction.

No assumption about independence!

Theorem 12. If X and Y are independent then, for any functions g and h,

E[g(X)h(Y )] = E[g(X)]E[h(Y )].

Proof. We treat the case where X and Y are jointly continuous. The independence
of X and Y implies that fX,Y (x, y) = fX (x) fY (y) for all x, y. Therefore
Z ∞ Z ∞
E[g(X) h(Y )] = g(x) h(y) fX (x) fY (y) dx dy
−∞ −∞
Z ∞ Z ∞
= g(x) fX (x) dx h(y) fY (y) dy
−∞ −∞

= E[g(X)] E[h(Y )].

A particularly important case of this is the following.

Proposition 13. If X and Y are independent then E[XY ] = E[X]E[Y ].

Proof. In Theorem 12, take g and h to be the identity function x 7→ x.

3.2 Covariance and correlation

We now look at covariance and correlation, which measure the extent to which the
values of two random variables are associated with each other.

Definition 14. The covariance of two RVs X and Y , denoted by cov(X, Y ), is defined
by
cov(X, Y ) = E[(X − E[X])(Y − E[Y ])]

(so long as the expectations above are defined).


28 3. Expectation, Covariance and Correlation

By evaluating the expectation in this definition (see problems), one can arrive at
the alternative (and often more computationally convenient) formula

cov(X, Y ) = E[XY ] − E[X]E[Y ].

Also note that the covariance of a random variable with itself is just its variance: defs

cov(X, X) = E[(X − E[X])(X − E[X])]

= E[(X − E[X])2 ]

= var(X).

The following proposition gives some important properties of covariance.

Proposition 15. (i) cov(X, Y ) = cov(Y, X).

(ii) If X and Y are independent then cov(X, Y ) = 0. The converse is false in


general.

(iii) cov(aX + b, cY + d) = a c cov(X, Y ).


P  P P
m Pn m n
(iv) cov i=1 ai X i , j=1 j j =
b Y i=1 j=1 ai bj cov(Xi , Yj ). (So cov is bilinear)

In particular,
n
! n
X X X
(a) var ai X i = a2i var(Xi ) + 2 ai aj cov(Xi , Xj );
i=1 i=1 1⩽i<j⩽n

n
! n
X X
(b) if X1 , . . . , Xn are independent then var ai X i = a2i var(Xi );
i=1 i=1
n
! n
X X
(c) if X1 , . . . , Xn are IID then var ai X i = var(X1 ) a2i .
i=1 i=1

Proof. Statement (i) follows immediately from the definition of covariance. Statement
(ii) follows from the definition and Proposition 13: (use the alt form and the Prop)

cov(X, Y ) = E[XY ] − E[X]E[Y ] = E[X]E[Y ] − E[X]E[Y ] = 0.


3.2 Covariance and correlation 29

Statement (iii) follows from the linearity of expectation (Proposition 10):

cov(aX + b, cY + d) = E[((aX + b) − E[aX + b])((cY + d) − E[cY + d])]

= E[(aX + b − aE[X] − b)(cY + d − cE[Y ] − d)]

= E[(a(X − E[X]))(c(Y − E[Y ]))]

= ac E[(X − E[X])(Y − E[Y ])] = ac cov(X, Y ).

Proving (iv) is left as an exercise.

Example 8. Using the construction of a Binomial random variable as the sum of n


independent Bernoulli random variables, each with success probability p, compute the
variance of a Binomial random variable S ∼ Bin(n, p).

Solution. Write S = X1 + · · · + Xn , where Xi are independent and Ber(p)



1 with probability p,
Xi =
0 with probability 1 − p.

It is straightforward to show that E[Xi ] = p and var(Xi ) = p(1 − p). Thus

n
! n
X X
var(S) = var Xi = var(Xi ) = n var(X1 ) = np(1 − p).
i=1 i=1

Example 9. Suppose that X has mean 1 and variance 2, Y has mean 2 and variance
4, and cov(X, Y ) = 1. Find cov(2X + Y, 2Y − X).

Solution. Using Proposition 15, we find that

cov(2X + Y, 2Y − X)

= 2 · 2 cov(X, Y ) + 2 · (−1) cov(X, X) + 1 · 2 cov(Y, Y ) + 1 · (−1) cov(Y, X)

= 2 var(Y ) − 2 var(X) + 3 cov(X, Y )

= 2 · 4 − 2 · 2 + 3 · 1 = 7.
30 3. Expectation, Covariance and Correlation

The following technical result will be used to get useful bounds on the covariance.

Lemma 16. If E[X] = E[Y ] = 0 and var(X) = var(Y ) = 1, then |E[XY ]| ⩽ 1.

Proof. First, observe that (Remember definitions and assumptions!)

0 ⩽ E[(X + Y )2 ] = E[X 2 ] + 2 E[XY ] + E[Y 2 ]

= var(X) + 2 E[XY ] + var(Y )

= 2(1 + E[XY ]),

so 1 + E[XY ] ⩾ 0, or E[XY ] ⩾ −1. Similarly,

0 ⩽ E[(X − Y )2 ] = · · · = 2(1 − E[XY ]),

so E[XY ] ⩽ 1. We therefore have −1 ⩽ E[XY ] ⩽ 1 or equivalently |E[XY ]| ⩽ 1.

Theorem 17. For any random variables X and Y we have

(cov(X, Y ))2 ⩽ var(X) var(Y ).

This relationship is known as the Cauchy-Schwartz inequality.

X − E[X] Y − E[Y ]
Proof. Let X̃ = p and Ỹ = p . Then, using Proposition 15,
var(X) var(Y )
!
X Y
cov(X̃, Ỹ ) = cov p , p
var(X) var(Y )

cov(X, Y )
=p p .
var(X) var(Y )

Now, since E[X̃] = E[Ỹ ] = 0 and var(X̃) = var(Ỹ ) = 1, we can apply Lemma 16 to
find that
cov(X, Y )
p p = | cov(X̃, Ỹ )| = |E[X̃ Ỹ ]| ⩽ 1.
var(X) var(Y )

Hence (cov(X, Y ))2 ⩽ var(X) var(Y ), as required.


3.2 Covariance and correlation 31

Definition 18. Assume that var(X) var(Y ) ̸= 0. The (Pearson) correlation of X and
Y , denoted by ρ(X, Y ), is defined by

cov(X, Y )
ρ(X, Y ) = p .
var(X) var(Y )

Proposition 19. The correlation coefficient has the following properties.

(i) −1 ⩽ ρ(X, Y ) ⩽ 1.

(ii) X and Y are independent =⇒ ρ(X, Y ) = 0.


a c cov(X, Y ) ρ(X, Y ) a c > 0,
(iii) ρ(aX + b, cY + d) = p =
|a c| var(X) var(Y ) −ρ(X, Y ) a c < 0.

Proof. Part (i) follows from the Cauchy-Schwartz inequality:

cov(X, Y )2
ρ(X, Y )2 = ⩽ 1,
var(X) var(Y )

so |ρ(X, Y )| ⩽ 1. To prove part (ii), use (ii) of Proposition 15 to see that

cov(X, Y ) 0
ρ(X, Y ) = p =p = 0.
var(X) var(Y ) var(X) var(Y )

The proof of (iii) is left as an exercise.

Correlation is thus a normalised, unitless version of covariance. It is a measure of


the strength of linear association between two random variables.

Example 10. If Xi ∼ Poi(λi ), where λi > 0, are mutually independent for i =


0, 1, 2; find the means, standard deviations and correlations of Y1 = X0 + X1 and
Y2 = X 0 + X2 .

Solution. For i = 1, 2 we first have E[Yi ] = E[X0 + Xi ] = E[X0 ] + E[Xi ] = λ0 + λi .


Similarly, var(Yi ) = var(X0 + Xi ) = var(X0 ) + var(Xi ) = λ0 + λi , so sd(Yi ) =

λ0 + λi .
32 3. Expectation, Covariance and Correlation

Lastly we have

cov(Y1 , Y2 ) = cov(X0 + X1 , X0 + X2 )

= cov(X0 , X0 ) + cov(X0 , X2 ) + cov(X1 , X0 ) + cov(X1 , X2 )

= λ0 + 0 + 0 + 0 = λ0 ,

cov(Y1 , Y2 ) λ0
so ρ(Y1 , Y2 ) = p =p .
var(X) var(Y ) (λ0 + λ1 )(λ0 + λ2 )

3.3 Moments and Higher dimensions

To conclude this section we introduce some terminology which generalises some of


the ideas earlier in this section.

Definition 20. Let k be a positive integer. The k-th (raw) moment of X is defined
to be mk = E[X k ] and the k-th central moment of X is σk = E[(X − m1 )k ].

Commonly encountered moments are the mean µ = m1 and the variance σ 2 = σ2 .


These quantify the location and variability of the distribution of X. Third, fourth
and fifth moments also have interpretations, relating to the shape of X’s distribution.
We can similarly define joint moments of a random vector (X, Y ).

Definition 21. For positive integers j, k, the raw moments are mjk = E[X j Y k ] and
the central moments σjk = E[(X − m10 )j (Y − m01 )k ].

As well as being useful quantitative descriptors of probability distributions, mo-


ments can also be used in statistical settings (the method of moments).
All of the ideas in this section extend to higher dimensions where we consider a
random vector X = (X1 , X2 , . . . , Xn )⊤ of length n. In this case we write

E[X] = µ

for µ = (µ1 , µ2 , . . . , µn )⊤ if E[Xi ] = µi for all i = 1, 2, . . . , n. We also write

var(X) = E[(X − E[X]) (X − E[X])⊤ ]

for the matrix of variances and covariances of X, i.e. [var(X)]ij = cov(Xi , Xj ). Often
this variance-covariance matrix (or just covariance matrix) is called Σ; it then
follows that [Σ]ij = cov(Xi , Xj ) = ρij σi σj . Note that [Σ]ii = σi2 . But [Σ]ii ̸= σi .
3.4 Problems 33

3.4 Problems

1. Complete the proofs of Proposition 10, that E[ag1 (X1 , . . . , Xn )+bg2 (X1 , . . . , Xn )+
c] = aE[g1 (X1 , . . . , Xn )] + bE[g2 (X1 , . . . , Xn )] + c for all a, b, c, ∈ R; and Propo-
sition 11, that E[ ni=1 Xi ] = ni=1 E[Xi ].
P P

2. Find the correlation ρ(X, Y ) and justify why X and Y are dependent, where

(a) X and Y have joint PDF given by



8xy 0 < x < y < 1,
fX,Y (x, y) =
0 otherwise.

(b) X and Y are uniformly distributed on the triangle in R2 with corners (0, 0),
(1, 1) and (2, 0).

3. Derive the formula cov(X, Y ) = E[XY ] − E[X]E[Y ] from the definition of


covariance.

4. Prove Proposition 19: If a, b, c, d ∈ R and ρ(X, Y ) is well-defined, then ρ(aX +


b, cY + d) is undefined if ac = 0, equal to ρ(X, Y ) if ac > 0 and equal to
−ρ(X, Y ) if ac < 0.

5. Let X1 , X2 , · · · , Xn be random variables satisfying

E[Xi ] = µ, var(Xi ) = σ 2 and cov(Xi , Xj ) = c,

n
P
where i, j = 1, · · · , n and i ̸= j, and let S = Xi . Show that
i=1

E[S] = nµ and var(S) = nσ 2 + n(n − 1)c.

6. Suppose that n skiers, each having a distinct pair of skis, are sharing a chalet.
At the end of a hard day on the piste, they each remove their skis and throw
them in a heap on the porch floor. The following morning, still befuddled by
the previous night’s après-ski, they each choose two skis completely at random
from the heap. Let N be the number of skiers that end up with a matching
pair of skis. Determine E[N ] and var(N ).
Hint: Begin by writing N as the sum of n variables, where the i-th variable is
1 or 0 according as the i-th skiier does or does not get a matching pair of skis.
34 4. Conditional distributions and expectations

4 Conditional distributions and expectations

Conditional probability is about incorporating extra information into a probability


model after it has been specified. These lead naturally to conditional distributions
and conditional expectations; just as ordinary distributions and expectations follow
from probabilities.

4.1 Conditional distributions

Recall that the conditional probability of E given F is

P (E ∩ F )
P (E | F ) = , if P (F ) > 0.
P (F )

We can directly apply this in the following situations.


If X and Y are discrete, the conditional PMF of X given Y = y is

pX|Y (x | y) = P (X = x | Y = y)
P (X = x, Y = y)
= (defn)
P (Y = y)
pX,Y (x, y)
= ,
pY (y)

for y such that pY (y) > 0.


Thus, for C ⊆ R,
X
P (X ∈ C | Y = y) = pX|Y (x | y).
x∈C

In particular, the conditional CDF of X given Y = y is


X
FX|Y (x | y) = P (X ⩽ x | Y = y) = pX|Y (x′ | y).
x′ ⩽x

If X and Y are jointly continuous then for C, D ⊆ R we can write

P (X ∈ C, Y ∈ D)
P (X ∈ C | Y ∈ D) = ,
P (Y ∈ D)
so long as P (Y ∈ D) > 0.
4.1 Conditional distributions 35

For example, if C = (a, b] and D = (c, d] with P (Y ∈ (c, d]) > 0, then
RbRd
P (X ∈ (a, b], Y ∈ (c, d]) a c
fX,Y (u, v) dv du
P (X ∈ (a, b] | Y ∈ (c, d]) = = R∞ Rd .
P (Y ∈ (c, d]) fX,Y (u, v) dv du
−∞ c

However, if X and Y are jointly continuous then P (E1 | E2 ) with P (E2 ) = 0

P (X ∈ A ∩ Y = y) 0
P (X ∈ A | Y = y) = = ,
P (Y = y) 0

so we cannot compute conditional probabilities such as P (X ∈ A | Y = y) directly.


The solution is to condition on an event of positive probability and then take a limit:

P (X ∈ A ∩ Y ∈ (y, y + h])
P (X ∈ A | Y ∈ (y, y + h]) =
P (Y ∈ (y, y + h])
R R y+h
A y
fX,Y (u, v) dv du
= R ∞ R y+h
−∞ y
fX,Y (u, v) dv du
R
h · fX,Y (u, y) du
(small h) ≈ A
h · fY (y)
Z
fX,Y (u, y)
= du.
A fY (y)

This calculation justifies the following definition: ↑Cond event pos prob iff fY (y) > 0

Definition 22. If X and Y have a joint PDF fX,Y then the conditional PDF of
X, given that Y = y, is defined by

fX,Y (x, y)
fX|Y (x | y) =
fY (y)

for y such that fY (y) > 0.

So, for C ⊆ R, (cf. above conditioning on interval)


Z
P (X ∈ C | Y = y) = fX|Y (x | y) dx.
C

In particular, taking C = (−∞, x], the conditional CDF of X, given Y = y, is


Z x
FX|Y (x | y) = P (X ⩽ x | Y = y) = fX|Y (u | y) du.(DV !)
−∞
36 4. Conditional distributions and expectations

If we think of fX | Y (x | y) as a function of x only then it has all the properties of any


other PDF. Similarly, FX | Y (x | y) as a function of x has all the properties of a CDF.

Example 11. Suppose that the joint PDF of X and Y is given by


( x
e−( y +y) y −1 0 < x, y < ∞,
fX,Y (x, y) =
0 otherwise.

Find the conditional PDF fX|Y and hence P (X > 1 | Y = y).

Solution. We first find the marginal PDF of Y , then fX|Y = fX,Y /fY . Then we will
be able to find P (X > 1 | Y = y) by integrating fX|Y (x | y) in the region x > 1.
Clearly fY (y) = 0 for y ⩽ 0. For y > 0 we have
Z ∞ Z ∞
x
fY (y) = fX,Y (x, y) dx = e−( y +y) y −1 dx = · · · = e−y .
−∞ 0

Hence, for y > 0,



fX,Y (x, y) e−x/y y −1 x > 0,
fX|Y (x | y) = =
fY (y) 0 x ⩽ 0.

Now we can write


Z ∞ Z ∞
P (X > 1 | Y = y) = fX|Y (x | y) dx = e−x/y y −1 dx
1 1
 ∞
1
= y −1 e−x/y
−y −1 x=1
∞
= −e−x/y x=1 = −(0 − e−1/y ) = e−1/y


Note here how we write the domain of the conditional PDF fX | Y . The value
Y = y is given, so we think of it as a parameter. Thus fX | Y (x | y) is usually thought
of as a function of x only.
For example, if we have the joint density

C exp(−x − y − αxy) x, y > 0,
fX,Y (x, y) =
0 otherwise,
4.2 Conditional distributions and independence 37

with α > 0 a parameter and C a normalising constant, then it can be shown (exercise!)
that, for given y > 0,

(1 + αy)e−x(1+αy) x > 0,
fX|Y (x | y) =
0 x ⩽ 0.

4.2 Conditional distributions and independence

We now return briefly to the notion of independence (see Section 2) to consider how
it relates to the conditional distributions we have just introduced.
The initial intuition behind independence was that the joint distribution of (X, Y )
is determined by the marginal distributions: FX,Y (x, y) = FX (x) FY (y) for all x, y.
In Section 2 we also used the informal characterisation that knowing the value of
one variable gives no information about the possible values of the other. In terms of
probability functions, we might expect this to manifest as a relationship saying that
all conditional distributions are identical to the marginal distributions. Conditions 3
and 4 of the next theorem show that this is indeed the case.

Theorem 23. For jointly continuous random variables X and Y , the following are
equivalent:

1. X and Y are independent.

2. fX,Y (x, y) = fX (x) fY (y) for all (x, y).

3. fX|Y (x | y) = fX (x) for all x ∈ R and y with fY (y) > 0.

4. fY |X (y | x) = fY (y) for all y ∈ R and x with fX (x) > 0.

Proof. Theorem 4 establishes the equivalence of 1 and 2. We prove equivalence of 2


and 3. (The proof of 2 ⇐⇒ 4 is the same but with the roles of X and Y reversed.)
First suppose that fX,Y (x, y) = fX (x) fY (y) for all (x, y). Then from the definition
of fX|Y we have

fX,Y (x, y) fX (x) fY (y)


fX|Y (x | y) = = = fX (x),
fY (y) fY (y)

for every x ∈ R and y s.t. fY (y) > 0. So 2 implies 3.


38 4. Conditional distributions and expectations

Now suppose that fX|Y (x | y) = fX (x) for all x ∈ R and y with fY (y) > 0. Then

• for y with fY (y) > 0 we have fX,Y (x, y) = fX|Y (x | y) fY (y) = fX (x) fY (y);

• for y with fY (y) = 0 we have fX,Y (x, y) = 0 = fX (x) fY (y).

Therefore 3 implies 2, and so 3 is equivalent to 2.

4.3 Conditional expectation

Suppose we are told that Y = y. Conditional upon this, we can characterise the
‘new’ distribution of X: it is the conditional PDF fX|Y (x | y), which we think of as
a function of x. Just like any other random variable (or corresponding PDF) we can
find its expected value.

Definition 24. The conditional expectation of X given Y = y, written ψ(y) =


E[X | Y = y], is the mean of the conditional density function:
Z ∞
E[X | Y = y] = x fX|Y (x | y) dx,
−∞

valid for all y with fY (y) > 0.

Example 12. Determine E[X | Y = y] (y > 0) for the random variables X and Y in
Example 11.

Solution. From the solution to Example 11 we have that for y > 0,



fX,Y (x, y) e−x/y y −1 x > 0,
fX|Y (x | y) = =
fY (y) 0 x ⩽ 0.

Now
Z ∞ Z ∞
x −x/y
E[X | Y = y] = x fX|Y (x | y) dx = e dx
−∞ 0 y
.
= ..

= y.

Note here that y is thought of as fixed, so E[X | Y = y] is a number. Recognising


that it is the outcome of a random variable motivates the following definition.
4.3 Conditional expectation 39

Definition 25. Recall that ψ(y) = E[X | Y = y]. Then ψ(Y ) is called the conditional
expectation of X given Y ; written as E[X | Y ].

As noted above, the function ψ maps y to E[X | Y = y], so ψ(y) = E[X | Y = y]


is a number. On the other hand, ψ(Y ) = E[X | Y ] is a function of a random variable
and so is also a random variable. The ‘conditional expectation’ sounds like a number,
but it is actually a random variable. This is a situation where the notational conven-
tions regarding random variables and their realisations become crucial. (Timeline of
imagining what E[X | Y = y] will be before we know Y ?)

Example 13. Determine E[X | Y ] and E[X 2 | Y ] for the random variables X and Y
in Example 11.

Solution. From the solution to Example 12 we have E[X | Y = y] = y. Substituting


Y for y we have E[X | Y ] = Y .
R∞
To find E[X 2 | Y ] we first find E[X 2 | Y = y] = −∞
x2 fX|Y (x | y) dx = · · · = 2y 2
for y > 0. Thus E[X 2 | Y ] = 2Y 2 .

Since E[X | Y ] is a random variable, we can take its expectation. The next result
(known as the tower property and the law of iterated expectations, amongst
other names) looks a bit odd at first. It says that if we want to calculate E[X] we
can first fix the value of Y and then average over this value later. We will soon see
that it can be very useful for finding properties of random variables which are defined
‘indirectly’, since then the conditional expectation can be easy to calculate.

Theorem 26. For any random variables X and Y , we have E[E[X|Y ]] = E[X].

Proof. We only consider the case when X and Y are jointly continuous.
Z ∞
E[E[X | Y ]] = E[X | Y = y] fY (y) dy (E[X | Y ] = ψ(Y ) is a f’n of the RV Y )
−∞
Z ∞ Z ∞
= x fX|Y (x|y) fY (y) dx dy (defn of E[X | Y = y])
−∞ −∞
Z ∞ Z ∞
= x fX,Y (x, y) dx dy (defn of fX|Y )
−∞ −∞

= E[X]. (working defn of E[g(X, Y )])

R R
E[X] = E[E[X | Y ]] = E[ψ(Y )] = ψ(y)fY (y) dy = E[X | Y = y]fY (y) dy
40 4. Conditional distributions and expectations

Example 14. Suppose that Y ∼ Exp(λ) and (X|Y = y) ∼ Poi(y). Find E[X].

Solution. First, since (X | Y = y) ∼ Poi(y), we have E[X | Y = y] = y.


Therefore E[X | Y ] = Y.
Now, using the tower property and the fact that Y ∼ Exp(λ) has mean 1/λ,

E[X] = E[E[X | Y ]] = E[Y ] = 1/λ.

Note that we have found E[X] without finding the distribution of X.

Example 15. Determine E[Xn ], where X0 , X1 , X2 , . . . is a sequence of random vari-


ables, with X0 = 0 and

 X +1 with prob. 1/3,
 n−1


Xn = Xn−1 + 2 with prob. 1/3,



X +3 n−1 with prob. 1/3.

Solution. The approach is to condition on Xn−1 and iterate. Given Xn−1 = k, we


have
1 1 1
E[Xn | Xn−1 = k] = (k + 1) + (k + 2) + (k + 3) = k + 2.
3 3 3
Hence, E[Xn | Xn−1 ] = Xn−1 + 2 , so

E[Xn ] = E[E[Xn | Xn−1 ]] = E[Xn−1 + 2] = E[Xn−1 ] + 2.

Iterating this equation, we find that

E[Xn ]= E[Xn−1 ] + 2 = (E[Xn−2 ] + 2) + 2

= E[Xn−2 ] + 4
..
= .

= E[X0 ] + 2n = 2n.

Again we have calculated the expectation of a RV without explicitly finding its


distribution.
4.3 Conditional expectation 41

Another important consequence of the tower property is the following generalisa-


P
tion of the law of total probability. Recall the basic LTP P (A) = i P (A | Bi )P (Bi )
where (Bi ) partition the sample space. If we take Bi to be the event that a discrete
P
random variable X takes the value i this gives P (A) = i P (A | X = i)P (X = i). We
cannot apply this directly to calculate P (A) by conditioning on a continuous variable
X since that would require summing over uncountably many terms.

Proposition 27. Let X be a continuous random variable and A be an event. Then


Z ∞
P (A) = P (A | X = x) fX (x) dx
−∞

For the proof we need the notation 1A , a random variable called the indicator
function of the event A: for any event A we define

1 if A occurs,
1A =
0 otherwise.

Proof. First note that E[1A ] = 1 · P (A) + 0 · P (Ac ) = P (A).


By the same argument we also have E[1A | X = x] = P (A | X = x). Then apply
the tower property and use the fact that X is continuous to conclude that
Z ∞
P (A) = E[1A ] = E[E[1A | X]] = P (A | X = x) fX (x) dx.
−∞

Example 16. As in Example 14, suppose that Y ∼ Exp(λ) and (X|Y = y) ∼ Poi(y).
Determine P (X = 0).

Solution. Conditioning on the outcome of Y (i.e. applying Proposition 27), we find


that
Z ∞ Z ∞
λ
P (X = 0) = P (X = 0 | Y = y) fY (y) dy = e−y λe−λy dy = · · · = .
−∞ 0 λ+1

An immediate consequence of this is the following. another version of the tower


property / conditioning
42 4. Conditional distributions and expectations

Proposition 28. For jointly continuous random variables (X, Y ),


Z ∞
fY (y) = fY | X (y | x) fX (x) dx.
−∞

Proof. Left as an exercise. Start by using Prop 27 to find the CDF (A = {Y ⩽ y})

Now we look at a version of a result known as Wald’s equation, which allows


us to investigate properties of random sums.

Theorem 29. Let X1 , X2 , . . . be a sequence of IID RVs and let N be a non-negative


integer-valued RV that is independent of the sequence {Xn : n ⩾ 1}. Define the
random sum T = N
P
i=1 Xi . Then E[T ] = E[N ]E[X1 ].

Proof. We use the tower property E[T ] = E[E[T | N ]]. Firstly

E[T | N = n] = E[X1 + · · · + XN | N = n]

= E[X1 + · · · + Xn ]
n
X
= E[Xi ]
i=1

= n E[X1 ].

Hence E[T | N ] = N E[X1 ] and so

E[T ] = E[E[T | N ]] = E[N E[X1 ]] = E[X1 ]E[N ].

P0
Here we note the convention that an empty sum takes the value 0, i.e. i=1 Xi = 0.
This ensures that the random sum T is well defined when N can take the value 0.

Example 17. Suppose I have a chicken that lays N eggs per week, where the weights
of eggs are IID N(µE , σE2 ) random variables and N ∼ Poi(λ), independently of the
egg weights. Find the expected total weight of eggs laid in a week.
PN
Solution. Write Xi for the weight of the i-th egg, so then T = i=1 Xi for the
weight of eggs laid in a week.
Since N and all the Xi s are mutually independent, apply Wald’s equation to find
that E[T ] = E[N ] E[X1 ] = λµE .
4.4 Conditional variance 43

Lastly in this section we note some further properties of conditional expectation,


some of which we shall use in the next section.

Proposition 30. 1. E[aX + bY | Z] = aE[X | Z] + bE[Y | Z] for all a, b ∈ R,

2. if Y ⩾ 0 then E[Y | Z] ⩾ 0,

3. E[1 | Z] = 1,

4. if X and Z are independent then E[X | Z] = E[X],

5. for any ‘nice’ function g, E[Y g(Z) | Z] = g(Z)E[Y | Z],

6. E[X | Z] = E[E[X | Y, Z] | Z] = E[E[X | Z] | Y, Z].

Proof. We won’t prove these results here, but if you are interested in studying prob-
ability further then I strongly suggest that you treat these proofs, assuming that all
relevant random variables are jointly continuous for simplicty, as an exercise. The
exception is item 3, where ‘1’ is a random variable which takes the value 1 with prob-
ability 1, and is therefore not continuous. In this case use the PMF, and note the
very simple PMF for this random variable.

4.4 Conditional variance

We have just been looking at expectations associated with conditional distributions;


we can also look at the corresponding variance.

Definition 31. The conditional variance associated with a conditional distribution


X | Y is defined to be

var(X | Y ) = E[(X − E[X | Y ])2 | Y ].

Note that both expected values in this expression are conditional expectations.
If we expand the squared term in the definition and (carefully) use the properties of
expectation, we can find a formula for conditional variance analogous to var(X) =
44 4. Conditional distributions and expectations

E[X 2 ] − E[X]2 :

var(X | Y ) = E[(X − E[X | Y ])2 | Y ] Use Prop 30

= E X 2 − 2XE[X | Y ] + E[X | Y ]2 Y
 

= E X 2 | Y − 2E [XE[X | Y ] | Y ] + E E[X | Y ]2 | Y
   

= E X 2 | Y − 2E [X | Y ] E[X | Y ] + E[X | Y ]2
 

= E[X 2 | Y ] − (E[X | Y ])2 .

As with the expectation notation E[X | Y = y] and E[X | Y ], it important to note


the distinction between var[X | Y = y] and var[X | Y ]. The former is a number (a
function of y), whereas the latter is a function of the random variable Y .
The following result is an analogue of the tower property for conditional expec-
tation, but for conditional variance. Like the tower property, it is useful for finding
properties of random variables that are defined indirectly or have complicated distri-
butions which mean that finding the PDF or PMF is difficult.

Theorem 32. For any random variables X and Y for which var(X) is defined, we
have
var(X) = E[var(X | Y )] + var(E[X | Y ]).

Proof. Taking expectations of the formula for var(X | Y ) and using the tower property,
we find that

E[var(X | Y )] = E[E[X 2 | Y ]] − E[(E[X | Y ])2 ] = E[X 2 ] − E[(E[X | Y ])2 ].

Considering the variance of E[X | Y ], we have

var(E[X | Y ]) = E[(E[X | Y ])2 ] − (E[E[X | Y ]])2 = E[(E[X | Y ])2 ] − (E[X])2 .

Putting these two results together we get

E[var(X | Y )] + var(E[X | Y ]) = E[X 2 ] − E[(E[X | Y ])2 ] + E[(E[X | Y ])2 ] − (E[X])2

= E[X 2 ] − (E[X])2 = var(X).

as required.
4.4 Conditional variance 45

This result has many important applications in probability and statistics (perhaps
most immediately in Analysis of Variance and linear regression). Here we look at a
short probabilistic example.

Example 18. Recall Example 14, where X | Y ∼ Poi(Y ) with Y ∼ Exp(λ) and we
found that E[X] = E[Y ] = λ−1 . Find var(X).

Solution. Since (X | Y = y) ∼ Poi(y), we have

E[X | Y = y] = y and var(X | Y = y) = y.

Therefore E[X | Y ] = Y and var(X | Y ) = Y . Now we can use the law of total variance
to find that

var(X) = E[var(X | Y )] + var(E[X | Y ])

= E[Y ] + var(Y )

1 1
= + 2.
λ λ

Note that here we have E[X] = E[Y ], but var(X) > var(Y ). In var(X) there is a
contribution from the variability of the Poisson distribution and also a contribution
from the fact that X is an average of many Poisson distributions.
Short diagram of the Poisson mass functions P (X = x|Y = y) with means y =
1, 5, 10 say; with P (X = x) over the top. So more variability in X than any one of
the X|Y = y’s.
and possibly idea of X being height of random person, Y their age

Finding the variance of Xn in Example 15 and of T in Theorem 29 and Example 17


using this method are left as problems.
Diagram with ‘time’ evolving from (X, Y ) random to knowing Y and then X; and
considering what is the mean/distribution of X?
‘time’ distn mean
Both unknown: fX , E[X]
Y = y known: fX|Y , E[X|Y ]
X = x known:
46 4. Conditional distributions and expectations

4.5 Problems
1. Find P {Y > 1/2 | X < 1/2} if the joint PDF of X and Y is given by

6  2 xy 
fX,Y (x, y) = x + , 0 < x < 1, 0 < y < 2.
7 2

2. A boy and a girl arrange to meet at a certain location during the lunch hour.
Denote the boy’s arrival time by X and the girl’s by Y ; and assume that
 
3 1
fX,Y (x, y) = y x + y , 0 ⩽ x ⩽ 1, 0 ⩽ y ⩽ 2.
7 2

(a) Find the marginal density functions of X and of Y .


(b) Given that the girl arrived at Y = y, 0 < y < 2, what is the probability
density function of the boy’s arrival time?
(c) Given that the girl arrived at Y = 1/2, what is the probability that the
boy arrived after her?
(d) Find the boy’s expected arrival time, given that the girl arrives at Y = 1/2.

3. Complete the exercise just after Example 11.

4. Repeat the justification of Definition 22, but conditioning on Y ∈ (y − h, y + h).


Why is this helpful/reassuring?

5. Suppose that X ∼ U(0, 1) and Y | X = x ∼ U(0, x). That is, X is uniform on


the interval (0, 1) and, given X = x, Y is uniform on (0, x).

(a) Write down the PDFs fX and fY |X .


(b) Find fX,Y and hence fX|Y .
(c) Use the tower property to find E[Y ].

6. Prove Proposition 28. (See the hint given in the ‘proof’ in the notes.)

7. Determine P (Y ⩽ 1), where Y | X ∼ Poi(X) and X ∼ exp(1).

8. (Exercise from Section 4.4.) Suppose that {Xi , i = 1, 2, . . . } are IID and inde-
pendent of the non-negative integer-valued RV N and set T = N
P
i=1 Xi . Find
var(T ) in terms of the moments of X1 and N .

9. Suppose that Xn is as in Example 15: X0 = 0 and, for n = 1, 2, . . . , Xn =


Xn−1 + Yn , where the Yi are IID and P (Yi = 1) = P (Yi = 2) = P (Yi = 3) = 31 .
We showed in Example 15 that E[Xn ] = 2n. Determine var(Xn ).
47

Part II

Transformations, generating
functions and inequalities
5 Transformations
Very often the random variables that are of interest in a practical situation do not
have their distributions directly available. They may be defined in terms of simpler
random variables whose distributions we do know. For example, the lifetime of an
electronic appliance might be the minimum of the lifetimes of its components. A
simpler example involves rescaling: given the PDF of daily maximum temperatures
at a certain location in Fahrenheit then can we translate this into the PDF of these
daily maximum temperatures in Celcius? If we know the distribution of velocities of
molecules in a gas then what can we say about the distribution of kinetic energies,
which are proportional to the squared velocities?

5.1 1-dimensional transformations

Recall the following programme (from MATH1001) for finding the PDF of Y = g(X),
where X is a continuous RV with known PDF fX and g : R → R.

1. Determine possible values for Y .


2. For each possible value y of Y , find FY (y) = P (Y ⩽ y) by writing the event
{Y ⩽ y} in terms of X and y and using the PDF or CDF of X.
3. Differentiate FY (y) to obtain fY (y).

Example 19. Find the PDF of Y = cX, where c > 0 and X ∼ exp(λ) with λ > 0.

Solution. First note that Y = cX ⩾ 0, since X ⩾ 0 and c > 0.


Now Y ⩽ y ⇐⇒ cX ⩽ y ⇐⇒ X ⩽ y/c, so for y > 0 we have

FY (y) = P (Y ⩽ y) = P (X ⩽ y/c) = FX (y/c) = 1 − e−λy/c .

Therefore fY (y) = d
F (y)
dy Y
= λ
c
e−(λ/c)y for y > 0 and fY (y) = 0 otherwise. In other
words, Y ∼ exp(λ/c).
48 5. Transformations

Example 20. If X ∼ N(0, 1) and Y = g(X) = X 2 , show that Y ∼ χ21 , i.e.



0
 y ⩽ 0,
fY (y) = 1
√
 e−y/2 y > 0.
2πy

Solution. First, X takes values in R, so Y = X 2 takes values in R+ = {y ∈ R : y ⩾


0}. We therefore have FY (y) = 0 for y < 0. For y ⩾ 0 we have

√ √
Y ⩽ y ⇐⇒ X 2 ⩽ y ⇐⇒ − y ⩽ X ⩽ y,

so

FY (y) = P (Y ⩽ y)
√ √
= P (− y ⩽ X ⩽ y)
√ √
= FX ( y) − FX (− y).

Now we differentiate FY (y) to find that, for y > 0,

d d √ √
fY (y) = FY (y) = (FX ( y) − FX (− y))
dy dy
1 √ √ d √ √ 1
= √ {fX ( y) + fX (− y)} (e.g. FX ( y) = fX ( y) · √
2 y
)
2 y dy
1
=√ e−y/2 .
2πy

Thus Y ∼ χ21 , as required.

point out 1,2,3 from above; and the differences in the ⇐⇒ and FY (y) = . . . bits
We now introduce some notation which will make our working in these problems
a bit simpler. We write A = {x : fX (x) > 0} for the values that X can take and
g(A) = {g(x) : x ∈ A} for the values that Y can take. A is called the support of
X or fX and g(A) is the support of Y = g(X) or fY . In the above examples A and
g(A) are as follows:

Example 19: A = R+ g(A) = R+ .


Example 20: A=R g(A) = R+ .
5.1 1-dimensional transformations 49

Monotone transformations There is an important point to note about Exam-


ple 19 compared to Example 20. In the former case the calculation of which X values
correspond to {Y ⩽ y} is much simpler because the function g is monotone. Indeed
we can rewrite these calculations as

FY (y) = P (Y ⩽ y) = P (g(X) ⩽ y)

P (X ⩽ g −1 (y)) = FX (g −1 (y)) g increasing,
=
P (X ⩾ g −1 (y)) = 1 − F (g −1 (y)) g decreasing,
X

where g −1 is well-defined because g is monotone on the domain A. (sketches)


Now differentiate to obtain

d d

 FX (g −1 (y)) = fX (g −1 (y)) (g −1 (y))
 g increasing,
fY (y) = dyd
dy
d
 1 − FX (g −1 (y)) = −fX (g −1 (y)) (g −1 (y))
 g decreasing.
dy dy

Since fX ⩾ 0 and d
dy
(g −1 (y)) is positive or negative according as g is increasing or
decreasing, we have established the following result. (It is an exercise to write this
up as a formal proof.)

Theorem 33. Let X be a continuous RV with support A and suppose that g : A → R


is strictly monotone and differentiable on A. Then the PDF of Y = g(X) is given by

fX (g −1 (y)) d
g −1 (y) y ∈ g(A),
dy
fY (y) =
0 otherwise.

X−µ
Example 21. Suppose that X ∼ N(µ, σ 2 ) and Y = σ
. Show that Y ∼ N(0, 1).

Solution. First observe that we can write Y = g(X), where g(x) = (x − µ)/σ. Now
g is strictly increasing (since σ > 0) and differentiable on A = R. Thus, since X is
continuous, we can apply Theorem 33 to find the distribution of Y .
n 2 o
1
First note that fX (x) = √2πσ exp − 21 x−µ
σ
for all x ∈ R. To find g −1 we note
that
y = (x − µ)/σ ⇐⇒ x = µ + σy,

so g −1 (y) = µ + σy , and therefore d


dy
g −1 (y) = σ .
50 5. Transformations

Also, since A = R and y = x/σ − µ/σ is linear, g(A) = R. (picture)


Thus, for y ∈ R,

d −1
fY (y) = fX (g −1 (y)) g (y)
dy
(  2 )
1 1 (σ y + µ) − µ
=√ exp − ×σ
2πσ 2 σ

1 2
= √ e−y /2 ,

i.e. Y ∼ N(0, 1).

Note that this result can easily be generalised to show that if X ∼ N(µ, σ 2 ) then
Y = a + bX has distribution Y ∼ N(a + bµ, b2 σ 2 ) for any a, b ∈ R with b ̸= 0 (see
problems).

Aside Theorem 33 can be applied, indirectly, to non-monotone transformations.


This is done by breaking up the domain of X into pieces upon which the transforma-
tion g is monotone. and adding up the resulting pieces of the PDF for Y

5.2 2-dimensional transformations

Here we deal only with one-to-one transformations, leading to an analogue of Theo-


rem 33. We briefly discuss how non one-to-one transformations can be dealt with at
the end of the section.
First we need to recall some ideas and notation from multivariable calculus. Recall
that a function f : X → Y is called one-to-one or injective if (and only if) for all
a, b ∈ X, f (a) = f (b) =⇒ a = b. This means that f never maps distinct elements
of its domain to the same element of its target/codomain. Now suppose that the
transformation T : (x1 , x2 ) 7→ (y1 , y2 ) is one-to-one in some domain A ⊆ R2 . Then T
has an inverse function on the domain T (A), say T −1 = H : (y1 , y2 ) 7→ (x1 , x2 ) with
components H = (H1 , H2 ). The Jacobian determinant of H is defined by

∂ ∂
H (y , y )
∂y1 1 1 2
H (y , y )
∂y2 1 1 2
JH (y1 , y2 ) = .
∂ ∂
H (y , y )
∂y1 2 1 2
H (y , y )
∂y2 2 1 2
5.2 2-dimensional transformations 51

As in the previous section, we let A = {(x1 , x2 ) : fX1 ,X2 (x1 , x2 ) > 0} be the
support of fX1 ,X2 and T (A) = {(y1 , y2 ) = T (x1 , x2 ) : (x1 , x2 ) ∈ A} be the support of
fY1 ,Y2 , i.e. the (y1 , y2 ) values that will have non-zero probability density.

Theorem 34. Let (X1 , X2 ) be a jointly continuous random vector with PDF fX1 ,X2 (x1 , x2 )
and suppose that T : A ⊆ R2 → R2 is one-to-one, so H = T −1 exists, and that H has
continuous first-order partial derivatives in T (A). Then (Y1 , Y2 ) = T (X1 , X2 ) has a
joint PDF given by

f X
1 ,X2 (H1 (y1 , y2 ), H2 (y1 , y2 )) · |JH (y1 , y2 )| (y1 , y2 ) ∈ T (A),
fY1 ,Y2 (y1 , y2 ) =
0 otherwise.

Proof. The proof is omitted, but can be viewed as an application of the change of
variables formula from first year calculus and the definition of a joint PDF.

A point that sometimes makes calculations simpler is that, under the conditions
of the theorem, we have JT −1 = (JT )−1 . Similarly, under the other conditions of
the theorem, T has continuous first-order partial derivatives if and only if T −1 does.
(Think polar co-ordinates, etc.)

We now work through some examples using this theorem.

Example 22. Let X1 ∼ exp(1) and X2 ∼ exp(1) be independent and define

Y1 = X 1 + X2 , Y2 = X1 − X2 .

Find the joint PDF of Y1 and Y2 .

Solution. The joint PDF of X1 and X2 is



e−x1 −x2 x1 ⩾ 0, x2 ⩾ 0,
fX1 ,X2 (x1 , x2 ) = fX1 (x1 ) fX2 (x2 ) =
0 otherwise,

and T : (x1 , x2 ) 7→ (y1 , y2 ) is defined by

y 1 = x1 + x2 , y2 = x1 − x2 .
52 5. Transformations

We can uniquely solve these equations for x1 and x2 : (simultaneous linear equations)

y1 + y2
x1 = = H1 (y1 , y2 )
2
y1 − y2
x2 = = H2 (y1 , y2 ),
2

so T is one-to-one and H(y1 , y2 ) = (H1 (y1 , y2 ), H2 (y1 , y2 )) is its inverse. Since H is


linear it has continuous first-order partial derivatives and the Jacobian JH is given by

∂H1 ∂H1 1 1
∂y1 ∂y2 2 2 1
JH (y1 , y2 ) = = =− .
∂H2 ∂H2 1
− 12 2
∂y1 ∂y2 2

We can therefore apply Theorem 34.


To find T (A), first note that x1 , x2 ⩾ 0 implies that y1 = x1 + x2 ⩾ 0. Now we fix
y1 ⩾ 0 and determine what values of (x1 , x2 ) are possible, then see what values of y2
this implies.
Fix y1 = c ⩾ 0. Then y1 = x1 + x2 implies that x2 = c − x1 , and to have x1 , x2 ⩾ 0
we need x1 ⩾ 0 and c − x1 ⩾ 0, i.e. x1 ∈ [0, c]. Now

y2 = x1 − x2 = x1 − (c − x1 ) = 2x1 − c, x1 ∈ [0, c];

and 0 ⩽ x1 ⩽ c ⇐⇒ −c ⩽ 2x1 − c ⩽ c, i.e. y2 ∈ [−c, c] where c is a fixed value of y1 .


Therefore
T (A) = {(y1 , y2 ) ∈ R2 : y1 ⩾ 0, −y1 ⩽ y2 ⩽ y1 }.

A diagram helps to understand this:


y2
6
y2 = y1
x2
6

c @@
x + x2 = c
@ 1
@ -
y1
@
@
@ y1 = c
@ @
-
@ x1 @
c@@ @
@
@ y2 = −y1
? @
5.2 2-dimensional transformations 53

Thus, for (y1 , y2 ) ∈ T (A) we have

1 1 1
fY1 ,Y2 (y1 , y2 ) = fX1 ,X2 (H1 (y1 , y2 ), H2 (y1 , y2 )) × |JH (y1 , y2 )| = e− 2 (y1 +y2 )− 2 (y1 −y2 ) × ,
2

so 
 1 e−y1 y1 ⩾ 0, −y1 ⩽ y2 ⩽ y1 ,
2
fY1 ,Y2 (y1 , y2 ) =
0 otherwise.

Example 23. Suppose that X1 and X2 are IID exponential RVs with parameter λ.
X1
Find the joint PDF of Y1 = X2
and Y2 = X1 + X2 ; and thus the PDF of Y1 .

Solution. The joint PDF of X1 and X2 is



λ2 e−λ(x1 +x2 ) x1 , x2 > 0,
fX1 ,X2 (x1 , x2 ) = fX1 (x1 ) fX2 (x2 ) =
0 otherwise

and we define T : R2+ → R2+ by


 
x1
(x1 , x2 ) 7→ (y1 , y2 ) = , x1 + x2 .
x2

Solving for x1 and x2 , we find that

y1 = x1 /x2 y 2 = x1 + x2

x2 y 1 = x1 → y2 = x2 y1 + x2

= x2 (y1 + 1)
y1 y2 y2
= x1 ← = x2
y1 + 1 y1 + 1
 
−1 y1 y2 y2
so the solution is unique and T = H is defined by H(y1 , y2 ) = , .
1 + y1 1 + y1
This function is clearly suitably differentiable on R2+ , and

y2 y1
(1+y1 )2 1+y1 y2
JH (y1 , y2 ) = = ··· = .
y2
− (1+y 2
1 (1 + y1 )2
1) 1+y1

Since X1 and X2 are exponential we have A = {(x1 , x2 ) : x1 , x2 > 0}. To


54 5. Transformations

determine T (A) first note that T (A) ⊆ {(y1 , y2 ) : y1 , y2 > 0} is immediate from the
form of T . The formula above for H implies that for every (y1 , y2 ) with y1 , y2 > 0
there is a corresponding (x1 , x2 ) with x1 , x2 > 0; so T (A) = {(y1 , y2 ) : y1 , y2 > 0}.
We could also take the approach above. Since x1 , x2 ∈ (0, ∞) we have y1 =
x1 /x2 ∈ (0, ∞). Now fix y1 = c ∈ (0, ∞). The corresponding (x1 , x2 ) values are given
by x2 = c−1 x1 , for x1 ∈ (0, ∞) (since x2 ⩾ 0 automatically if x1 ⩾ 0). This implies
that y2 = x1 + x2 = (1 + c−1 )x1 for x1 ∈ (0, ∞), so y2 ∈ (0, ∞).
x2 y2
6 6

x2 = c−1 x1

- -
x1 y1
y1 = c

Thus, for (y1 , y2 ) ∈ T (A) we have

fY1 ,Y2 (y1 , y2 ) = fX1 ,X2 (H1 (y1 , y2 ), H2 (y1 , y2 )) · |JH (y1 , y2 )|
 
y1 y2 y2 y2
= fX1 ,X2 , ·
1 + y1 1 + y1 (1 + y1 )2
y2
= λ2 e−λy2 ,
(1 + y1 )2

so 
y2
λ2 e−λy2
 y1 , y2 > 0,
fY1 ,Y2 (y1 , y2 ) = (1 + y1 )2
0

otherwise.

So, for y1 > 0,


Z ∞ Z ∞
1 1
fY1 (y1 ) = fY1 ,Y2 (y1 , y2 ) dy2 = λ2 e−λy2 y2 dy2 = ,
−∞ (1 + y1 )2 0 (1 + y1 )2

that is, (above is by the Gamma PDF; note Y1 ⊥ Y2 )



1

(1+y1 )2
y1 > 0,
fY1 (y1 ) =
0 y1 ⩽ 0.
5.3 Extension to more than 2 dimensions 55

Here we were interested only in the distribution of Y1 , but (x1 , x2 ) 7→ x1 /x2 is not
one-to-one. By including Y2 we get a transformation which is one-to-one, enabling
the use of this method. (We could also have used the method of Example 4: integrate
fX1 ,X2 over the region {(x1 , x2 ) : x1 /x2 ⩽ y1 } for arbitrary y1 ⩾ 0 to find the CDF
FY1 (y1 ), then differentiate the CDF to get the PDF.)

Aside In this section on 2-dimensional transformations we have only considered


one-to-one transformations. To deal with non one-to-one transformations we can
in principle apply an analogue of the direct approach we used for non-monotone
transformations of a single random variable, but this can quickly get very complicated.
The usual approach is to break the transformation up into sections of its domain upon
which it is one-to-one. We do not go into the details. (cf. aside at the end of Sec. 5.1.)

5.3 Extension to more than 2 dimensions

Unsurprisingly the methods of the previous section can be extended to deal with more
than 2 random variables at a time. The only additional complicating factor is the
slightly cumbersome notation (which can be avoided with vector notation). Here we
just state the general multidimensional version of Theorems 33 and 34.

Theorem 35. Suppose that the mapping T : A ⊆ Rn → Rn defined by yi =


yi (x1 , . . . , xn ) is one-to-one, has continuously differentiable inverse mapping H de-
fined by xi = xi (y1 , . . . , yn ) = Hi (y1 , . . . , yn ). Suppose also that (X1 , . . . , Xn ) has joint
PDF fX1 ,...,Xn and support A. Then the joint PDF of (Y1 , . . . , Yn ) = T (X1 , . . . , Xn )
is

 f (H (y , . . . , yn ), . . . , Hn (y1 , . . . , yn ))
 X1 ,...,Xn 1 1

(y1 , . . . , yn ) ∈ T (A),

fY1 ,...,Yn (y1 , . . . , yn ) = ×|JH (y1 , . . . , yn )|



0 otherwise.

∂H
Using vector notation: fY (y) = fX (H(y)) · ∂y
, y ∈ T (A),
56 5. Transformations

5.4 Problems

1. If U is uniformly distributed on [−1, 1], find

(a) P {|U | > 1/2};


(b) P sin πU > 13 ;

2

(c) the probability density function of the random variable |U |.

2. A random variable X has distribution function FX . What is the distribution


function of Y = max{0, X}?

3. If X is a positive random quantity with probability density function

fX (x) = α β xβ−1 exp(−α xβ ), x ⩾ 0,

where α and β are two positive constants, use the monotone transformation
theorem to show that Y = X β has a well-known named distribution and find
its parameter/s.

4. Write out a formal (statement and) proof of Theorem 33 in the notes, which
gives the PDF of a monotone transformation of a continuous random variable.

5. Suppose that X is an exponential random variable with parameter λ > 0, that


is, 
λe−λx x > 0,
fX (x) =
0 x ⩽ 0.

Use the theorem on monotone transformations from the notes to find the prob-
ability density function of the random variable Y = log X.

6. Let X and Y be independent random variables, each having probability density


function 
λe−λx x > 0,
f (x) =
0 otherwise,

and let U = X + Y and V = X − Y .

(a) Find the joint PDF of U and V .


(b) Hence derive the marginal probability density functions of U and V .
(c) Are U and V independent? Justify your answer.
5.4 Problems 57

7. If X, Y and Z are independent random variables, each being exponentially


distributed with mean 1, derive the joint probability density function of U =
X + Y , V = X + Z and W = Y + Z. Hence, or otherwise, find the joint
probability density function of U and V .

8. If U is uniform on (0, 2π) and V , independent of U , is exponential with pa-


√ √
rameter 1, show that (X, Y ) = ( 2V cos U, 2V sin U ) ∈ R2 are independent
standard normal random variables. (To show that (X, Y ) takes values in all of
R2 , it might help to note that (U, V ) 7→ (X, Y ) is (up to scalar multiples) the
transformation from polar to rectangular coordinate systems.)
58 6. Generating Functions

6 Generating Functions

Generating functions are tools for studying probability distributions; they are defined
as the expectation of a certain function of a random variable with that distribution.
Using generating functions often makes calculations involving random variables much
simpler than dealing directly with CDFs, PDFs or PMFs. Generating functions are
particularly useful when analysing sums of independent random variables.

In MATH1001 you saw probability generating functions (PGF) of random vari-


ables which take values in {0, 1, 2, . . . }. The generating function we look at here
has similar properties and uses but applies to a much wider class of distributions /
random variables.

6.1 Moment generating functions

Definition 36. For any random variable X, the moment generating function
(MGF) of X, MX : A ⊆ R → [0, ∞) is defined by MX (t) = E[etX ], so long as E[etX ]
exists and is finite in an open interval containing t = 0. The domain A of MX consists
of all values t for which E[etX ] exists and is finite.

If E[etX ] is not finite in an open interval containing t = 0 then we say that the
MGF does not exist.

Aside The condition ‘E[etX ] exists and is finite in an open interval containing t = 0’
can be stated equivalently as ‘there exists h > 0 such that E[etX ] exists and is finite
for all t ∈ (−h, h)’.

Example 24. Find the MGF of X ∼ N (0, 1).

Solution. The approach is to apply the working definition of the expected value of
a function of a random variable and calculate the resulting integral.
6.1 Moment generating functions 59

For every t ∈ R, we have


Z ∞
1 1 2
tX
MX (t) = E[e ] = etx √ e− 2 x dx (use working definition of E)
−∞ 2π
Z ∞
1
e− 2 { x } dx
1 2 −2xt
=√ (rearrange)
2π −∞
Z ∞
1
e− 2 {(x−t) −t } dx
1 2 2
=√ (completing the square)
2π −∞
Z ∞
1 2 1 1 2
= e2 t
√ e− 2 (x−t) dx (rearrange)
−∞ 2π
1 2
= e2t . (since the integrand is a PDF)

1 2
So MX (t) = e 2 t for all t ∈ R.

Example 25. Determine the MGF of a random variable X with an exponential dis-
tribution, i.e. fX (x) = λe−λx for x > 0, where λ > 0 is a parameter.

Solution. Again we just use the definition of the MGF:


Z ∞
MX (t) = E[e ] = tX
etx λe−λx dx
0
Z ∞
=λ ex(t−λ) dx.
0

R∞
Now, 0
ex(t−λ) dx diverges if t − λ ⩾ 0, so MX (t) does not exist for t ⩾ λ. (sketch
ecx )
For t − λ < 0, i.e. t < λ, we have
Z ∞
MX (t) = λ ex(t−λ) dx
0
∞
ex(t−λ)


t−λ x=0

λ
= (0 − 1)
t−λ
λ
= .
λ−t

λ
So MX (t) = , t < λ.
λ−t
60 6. Generating Functions

Example 26. Find the MGF of a standard Cauchy random variable, with PDF
1
fX (x) = π(1+x2 )
, x ∈ R.

Solution. Here the calculation is



etx
Z
tX
MX (t) = E[e ] = dx.
−∞ π(1 + x2 )
R∞ 1
So MX (0) = −∞ π(1+x2 )
dx = 1. However (invite to verify at home)

etx
• when t > 0 we have π(1+x2 )
→ ∞ as x → ∞, so E[etX ] is not finite,

etx
• when t < 0 we have π(1+x2 )
→ ∞ as x → −∞, so E[etX ] is not finite.

Therefore E[etX ] exists and is finite for t = 0 only. This is not an open interval
containing 0, so the MGF does not exist.

6.2 Properties of MGFs

We now investigate some properties of moment generating functions, which will help
us to see how they can be useful.

1. If we evaluate a MGF at t = 0 we get MX (0) = E[e0 ] = E[1] = 1.


This is not so interesting in its own right, but it can act as a simple error-check
on an MGF we have calculated.

2. A key feature of the MGF involves its derivatives:

dn ∞ tx
Z
(n)
MX (t) = n e fX (x) dx
dt −∞
Z ∞ n
d  tx
= n
e fX (x) dx(rearrange)
−∞ dt
Z ∞
= xn etx fX (x) dx(evaluate deriv)
−∞

= E[X n etX ].(LUS)

Setting t = 0 we get the relation

(n)
MX (0) = E[X n ].
6.2 Properties of MGFs 61

3. Lastly, consider the Taylor series expansion of MX (t) about t0 = 0

∞ ∞ (n) ∞
X
n
X M (0) n
X E[X n ]
MX (t) = an t = X
t = tn
n=0 n=0
n! n=0
n!

This gives another possible way to find the (raw) moments of X from the MGF:
Equating coefficients of tn above we have E[X n ] = n! an . (If you can identify
the Maclaurin series for a MGF then this allows you to find moments without
having to do any differentiation.)

λ
Consider for example the MGF MX (t) = λ−t
, t < λ of X ∼ Exp(λ).

1. MX (0) = λ/λ = 1 .

2. The mean can be found as

λ 1
E[X] = MX′ (0) = =
(λ − t)2 t=0 λ

and we can differentiate again to find E[X 2 ] and thus var(X).

1
P∞ λ 1 P∞ t n

3. From 1−x
= n=0 xn we have MX (t) = = = n=0 λ
=
λ−t 1 − t/λ
P∞ 1 n E[X n ] 1 n!
n=0 n
t . It then follows that (a n =) = n , so E[X n ] = n .
λ n! λ λ

Uniqueness of generating functions Because MGFs are power series, the rich
theory of power series (as widely used in applied mathematics) applies to them. This
includes results concerning the radius of convergence and the validity of term-by-term
integration and differentiation within that radius (which we have implicitly used in
our calculations above). One other such result is the Inversion Theorem, which tells
us how to calculate a CDF from a given MGF. (We don’t present this here as it
involves complex integration.) This has the crucial consequence that a probability
distribution is uniquely specified by its MGF; so long as the MGF exists.

Theorem 37. Suppose that the random variables X and Y have MGFs MX and MY .
Then X and Y have the same distribution (CDF) if and only if MX (t) = MY (t) for
all t in an open interval containing 0.

This means that, in most situations, the MGF of a random variable uniquely
specifies the distribution of the random variable. The condition ‘for all t in an open
62 6. Generating Functions

interval containing 0’ (or ‘so long as the MGF exists’) is crucial though. See, for
example, the Cauchy distribution above (Example 26).
It is important to note that this theorem does not mean “if two distributions have
the same moments then they have the same distribution”. (This statement is false in
general.)
Next we touch briefly on linear transformations of random variables.

Theorem 38. If X has MGF MX then Y = a + bX has MGF MY (t) = eat MX (bt).

Proof. Using the definition of the MGF we have

MY (t) = E[etY ] = E[et(a+bX) ] = E[eat ebtX ]

= eat E[ebtX ] = eat MX (bt).

Also note that if b = 0 then Y = a and MY (t) = eat for all t ∈ R. If b ̸= 0


then MX (t) being defined for t ∈ (−h, h) implies that MY (t) is defined for t ∈
(−h/|b|, h/|b|). In either case the calculations above are therefore valid in a suitable
range of t values and the MGF of Y is well-defined.

We have seen in Section 5.1 that if X ∼ N(µ, σ 2 ) then a + bX is also Normally


distributed, with mean a + bµ and variance b2 σ 2 . Using this result and Theorem 38
above, it is an exercise to show that that X ∼ N(µ, σ 2 ) has MGF MX (t) = exp(µt +
1 2 2
2
σ t ).

6.3 Joint moment generating functions

Just as we can extend the ideas of probability distributions to joint distributions in


order to deal with more than one random variable at a time, we can extend MGFs
to joint MGFs. The appropriate generalisation is as follows.

Definition 39. The joint moment generating function MX1 ,...,Xn : A ⊆ Rn → R+


of the random variables X1 , . . . , Xn is defined by

MX1 ,...,Xn (t1 , . . . , tn ) = E[et1 X1 et2 X2 . . . etn Xn ] = E[et1 X1 +···+tn Xn ],

so long as the expectation exists and is finite in an open set containing the origin.
6.3 Joint moment generating functions 63

The joint MGF has properties corresponding to single variable MGFs and the
relationship between joint and marginal CDFs (and PDFs and PMFs):

1. There is a one-to-one correspondence between joint MGFs and joint CDFs.

2. The MGF of Xi (the marginal MGF) can be obtained from the joint MGF
MX1 ,...,Xn (t1 , . . . , tn ) by letting all the tj (j = 1, . . . , n) be zero except ti . This
is because

MX1 ,...,Xn (0, . . . , 0, t, 0, . . . , 0) = E[e0+···+tXi +···+0 ] = E[etXi ].

3. The moments of the joint distribution can be calculated by evaluating deriva-


tives of the MGF at 0. For example

∂ ∂
MX1 ,...,Xn (t1 , . . . , tn ) = E[et1 X1 . . . etn Xn ]
∂ti ∂ti

= E[Xi et1 X1 . . . etn Xn ],


so MX1 ,...,Xn (0, . . . , 0) = E[Xi ] . Similarly we can show that
∂ti

∂2 ∂2
MX1 ,...,Xn (0, . . . , 0) = E[Xi2 ] and MX1 ,...,Xn (0, . . . , 0) = E[Xi Xj ].
∂t2i ∂ti ∂tj

One of the main uses for generating functions is in calculations involving indepen-
dent random variables. A key property for these calculations is given by the following
theorem.

Theorem 40. X1 , . . . , Xn are mutually independent if and only if

MX1 ,...,Xn (t1 , . . . , tn ) = MX1 (t1 ) . . . MXn (tn ).

Proof. We give a proof for the case when X1 , . . . , Xn are jointly continuous, so the
RVs X1 , . . . , Xn are independent if and only if

fX1 ,...,Xn (x1 , . . . , xn ) = fX1 (x1 ) . . . fXn (xn ).


64 6. Generating Functions

Only if Suppose that X1 , . . . , Xn are independent. Then

MX1 ,...,Xn (t1 , . . . , tn ) = E[et1 X1 . . . etn Xn ] (defn)

= E[et1 X1 ] . . . E[etn Xn ] (Thm 12)

= MX1 (t1 ) . . . MXn (tn ).

If Suppose that MX1 ,...,Xn (t1 , . . . , tn ) = MX1 (t1 ) . . . MXn (tn ). Then by the above
part we have that MX1 ,...,Xn is the same as the joint MGF of independent variables
with the same distribution as X1 , . . . , Xn . Since the MGF characterises the joint
distribution of X1 , . . . , Xn , we have that X1 , . . . , Xn must be independent.

6.4 Sums of independent random variables

Earlier in the module we looked at various properties of sums of random variables:


mean, variance and distribution. MGFs offer a very powerful way to compute the
distribution of sums of independent random variables. The following theorem is the
main reason for the usefulness of generating functions.
Pn
Theorem 41. Suppose that X1 , . . . , Xn are mutually independent and let Y = i=1 ci Xi ,
where c1 , c2 , . . . , cn ∈ R are constants. Then
n
Y
MY (t) = MXi (ci t).
i=1

Proof.

MY (t) = E[etY ] = E[et(c1 X1 +···+cn Xn ) ]

= MX1 ,...,Xn (c1 t, . . . , cn t)


n
Y
= MXi (ci t) (by independence).
i=1

An important special case is where c1 = · · · = cn = 1.


6.4 Sums of independent random variables 65

Pn
Theorem 42. If X1 , . . . , Xn are mutually independent and Y = i=1 Xi then

n
Y
MY (t) = MXi (t).
i=1

Now we see a couple of examples demonstrating the application of this result.

Proposition 43. Suppose that Xi ∼ N (µi , σi2 ), i = 1, . . . , n, are mutually indepen-


dent and Y = ni=1 ci Xi , where c1 , . . . , cn ∈ R. Then Y ∼ N (µ, σ 2 ), where
P

n
X n
X
2
µ= ci µ i and σ = c2i σi2 .
i=1 i=1

Proof. From the comment after Theorem 38 we know that


 
1 2 2
MXi (t) = exp µi t + σi t , t ∈ R.
2

Thus, by Theorem 41,


n
Y
MY (t) = MXi (ci t)
i=1

n  
Y 1 2 2
= exp µi ci t + σi (ci t)
i=1
2
( n
! n
! )
X 1 X
= exp ci µ i t+ c2i σi2 t2
i=1
2 i=1
 
1
= exp µt + σ 2 t2 , t ∈ R.
2

So, by the uniqueness of MGFs, we have Y ∼ N (µ, σ 2 ) as required.

An important special case of this is given by taking µi = µX , σi2 = σX


2
and ci = n1 ;
so X1 , . . . , Xn are IID normal RVs and Y = n1 ni=1 Xi = X̄n is the mean of n IID
P
2
normal RVs. Then we have X̄n ∼ N (µX , σX /n). This is the sampling distribution
of the sample mean from a normally distributed population, the probabilistic basis
of the one-sample Z-test for the mean and the one-sample confidence interval for the
mean of a normally distributed population with known population variance.
66 6. Generating Functions

Recall that we say that X has a chi-square(d) distribution with r ∈ {1, 2, 3, . . . }


degrees of freedom, written X ∼ χ2r , if

1

 xr/2−1 e−x/2 x > 0,
fX (x) = 2r/2 Γ(r/2)
0

x ⩽ 0.

1
It can be shown (exercise) that the MGF of X is MX (t) = (t < 12 ).
(1 − 2t)r/2
Proposition 44. If Xi ∼ χ2ri , i = 1, . . . , n, are mutually independent then their sum
Y = ni=1 Xi has a χ2r distribution, where r = ni=1 ri .
P P

Proof. Using Theorem 41, the MGF of Y is


n n
1 Pn
Y Y
MY (t) = MXi (t) = (1 − 2t)−ri /2 = (1 − 2t)− 2 i=1 ri
i=1 i=1

(for t < 21 )
and the conclusion follows from the equivalence of MGFs and distributions.

It’s worth pausing for a moment to consider these last two proofs. How you would
prove these results without using generating functions (see Section 2.1)?
Repeated convolution: set Sn = X1 + · · · + Xn (for n = 1, 2, . . . ), so fS1 = fX and
Z ∞
fSn+1 (s) = fX (x)fSn (s − x) dx.
−∞

So finding the distribution of Sn requires doing n − 1 convolutions.


Compare with the proofs of the last two propositions. Generating functions are easier!
6.5 Probability generating functions 67

6.5 Probability generating functions

We now look back to PGFs of N-valued random variables and see how they relate to
MGFs.

Definition 45. For a RV X taking values in a subset of {0, 1, 2, . . .} with PMF pX ,


the probability generating function (PGF) ϕX : [0, 1] → [0, 1] is defined by


X
ϕX (s) = E[sX ] = si pX (i).
i=0

Note that there are no technical conditions about existence here. For s ∈ [0, 1] we
have 0 ⩽ ∞
P i
P∞
i=0 s pX (i) ⩽ i=0 pX (i) = 1, so the expectation in the definition always
exists and is finite.
PGFs are useful because they have lots of properties (most of which have exact
MGF analogues). In the following list, all random variables X, Y , Z, Xi take values
in the non-negative integers.

1. X and Y have the same PGF if and only if they have the same PMF.

2. ϕX (1) = 1.

(n)
3. ϕX (1) = E[X(X − 1) . . . (X − (n − 1))] (the n-th factorial moment of X).

(n)
4. ϕX (0) = n! pX (n). (We haven’t seen the analogue of this.)
Pn Qn
5. If X1 , . . . , Xn are independent then Y = i=1 Xi has PGF ϕY (s) = i=1 ϕXi (s).

6. Random variables X1 , . . . , Xn are independent if and only if their joint PGF


X
ϕX1 ,...,Xn (s1 , . . . , sn ) = E[ nj=1 sj j ] is given by the product of the marginal
Q

PGFs, i.e. ϕX1 ,...,Xn (s1 , . . . , sn ) = ϕX1 (s1 ) . . . ϕXn (sn ).

PGFs and MGFs are related to each other in the following ways:

ϕX (s) = E[sX ] = E[(elog s )X ] = MX (log s)

and similarly MX (t) = ϕX (et ). (Exercise.)


When both generating functions are well-defined, these relationships may be used
to derive properties of one from those of the other.
68 6. Generating Functions

Example 27. Use ϕX (s) = MX (log s) and the fact that E[X] = MX′ (0) to derive the
formula E[X] = ϕ′X (1).

Solution. Taking the derivative with respect to t in MX (t) = ϕX (et ) we have

d d
MX (t) = ϕX (et ) = ϕ′X (et ) · et .
dt dt

Therefore
E[X] = MX′ (0) = ϕ′X (e0 ) · e0 = ϕ′X (1).

Why PGFs and MGFs? PGFs can be viewed as an easier to manage equivalent to
MGFs when the random variables being considered take non-negative integer values.

Aside: Other generating functions

Besides the moment generating function and probability generating function, there are
other generating functions that you might come across in further studies of probability
and/or statistics. These all have their own uses in particular areas of study, but the
most important of these is the characteristic function

ψX (θ) = E[eiθX ], θ∈R

where i2 = −1. The characteristic function has similar properties to other generat-
ing functions in that (i) there is a one-to-one correspondence between characteristic
functions and CDFs, and (ii) lots of properties of random variable/s can be expressed
through their characteristic functions. However, there are very few technical issues
regarding the existence of the expectation (convergence of the integral/sum) that
defines the characteristic function, so the theory is much neater in the sense that it
doesn’t have ‘so long as the expectation exists’ conditions floating around.

After a short detour into inequalities we will use (moment) generating functions
to help establish some key results in probability and applied statistics in sections 8
and 9.
6.6 Problems 69

6.6 Problems

1. (Exercise from notes) Show that if X ∼ χ2r then MX (t) = 1/(1 − 2t)r/2 (t < 21 ).
Hint: You’ll need the integral to cancel out with the Gamma function.
Now adapt this to show that the MGF of Y ∼ Gamma(s, λ) is MY (t) = (λ/(λ−
t))s , t < λ.

2. (Exercise from notes) Given that Z ∼ N(0, 1) has MGF MZ (t) = exp( 21 t2 )
(t ∈ R) and the formula for standardising an arbitrary Normal random variable;
show that the MGF of X ∼ N(µ, σ 2 ) is MX (t) = exp(µt + 12 σ 2 t2 ) (t ∈ R).

3. Using MGFs, find the distribution of Z = X + Y , where X ∼ Bin(n, p) and


Y ∼ Bin(m, p) are independent.

4. Derive the formula var(X) = ϕ′′ (1) + ϕ′ (1) = (ϕ′ (1))2 using the relationship
between PGFs and MGFs and the corresponding result for MGFs.

5. Suppose that random variables X and Y have joint moment generating function

MX,Y (s, t) = (1 − s − t)−α (1 − t)−β , s + t < 1, t < 1,

where α and β are strictly positive constants.

(a) Find the MGF of X. What is the distribution of X?


(b) Find the moment generating function of Y − X.
(c) Find the joint moment generating function of X and Y − X.
(d) Hence, show that X and Y − X are independent.

6. Suppose that X1 , X2 , · · · is an infinite sequence of IID exp(λ) random vari-


ables and that N ∼ Geom(p) (i.e. P (N = k) = (1 − p)k−1 p, k = 1, 2, · · · ) is
independent of the Xi ’s.
Determine the MGF of TN = X1 + · · · + XN and deduce the distribution of T .
(Hint: write the MGF as an expectation, then condition on N and use the tower
ps
property. The PGF ϕN (s) = E[sN ] = 1−(1−p)s
will also be useful.)
What is the MGF of T if N is Poisson(µ) distributed (instead of geometric)?
Use the convention that Tn = X1 + · · · + Xn = 0 for n = 0. (T does not have a
well-known distribution, but once you have found the MGF it is a good further
exercise to find its mean and variance.)
70 7. Markov’s and Chebychev’s Inequalities

7 Markov’s and Chebychev’s Inequalities


In this section we introduce some basic probabilistic inequalities. If we cannot com-
pute probabilities of interest exactly then bounds are clearly going to be useful, to
get at least some information about the probabilities that we are interested in. We
shall also see in Section 9 that inequalities can also be useful theoretical tools. There
are a whole host of inequalities in probability theory, but we will focus on two of
the most fundamental: Markov’s inequality and Chebychev’s inequality (sometimes
called Chebychev’s first and Chebychev’s second inequalities). They provide tools for
bounding the probability of a random variable taking an ‘extreme’ value in a tail of
the distribution.

Theorem 46 (Markov’s inequality). For any non-negative random variable X and


constant c > 0 we have
E[X]
P (X ⩾ c) ⩽ .
c
Sometimes given as “for any X and for any c > 0, P (|X| ⩾ c) ⩽ E[|X|]/c”.

This gives a bound on the probability that X (or |X|) is ‘large’ which depends only
on the mean of X. It is a very powerful result because it makes no other assumption
on the distribution of X.

Proof. Start by noting that for any c > 0 we have X ⩾ c · 1{X⩾c} . (Consider cases)
This implies that

E[X] ⩾ E[c · 1{X⩾c} ]

= c · P (X ⩾ c) + 0 · P (X < c)

= c · P (X ⩾ c)

and we can divide by c > 0 to get the result.

Example 28. Use Markov’s inequality to bound the fraction of people who have in-
come more than three times the average income.

Solution. Letting X be a random variable denoting the income of a randomly chosen


UK resident, X is clearly non-negative so Markov’s inequality can be applied: Taking
c = 3E[X] in Theorem 46 gives
7. Markov’s and Chebychev’s Inequalities 71

E[X] 1
P (X ⩾ 3E[X]) ⩽ = .
3E[X] 3

So no more than one third of the population can earn more than 3 times the
average income.

We can also use Markov’s inequality to show what it means for a random variable
to have zero variance.

Proposition 47. If the RV X has E[X] = µ and var(X) = 0 then P (X = µ) = 1.

Proof. Markov’s inequality gives, for every n = 1, 2, . . . ,

1 1
P (|X − µ| ⩾ ) = P (|X − µ|2 ⩾ 2 ) ⩽ n2 var(X) = 0.
n n

Therefore limn→∞ P (|X − µ| ⩾ n1 ) = 0 or limn→∞ P (|X − µ| < n1 ) = 1 and for this


to be true we must have P (X = µ) = 1.

Whilst Markov’s inequality is appealing and useful because of it’s simplicity, the
proof above suggests the following bound, considering the random variable |X−E[X]|2
(which is non-negative) and applying Markov’s inequality.

Theorem 48 (Chebychev’s inequality). For any random variable X with mean µ


and variance σ 2 ,
σ2
P (|X − µ| ⩾ c) ⩽ for all c > 0.
c2

Proof. As suggested above, we apply Markov’s inequality to the random variable


|X − µ|2 :

E[(X − µ)2 ] σ2
P (|X − µ| ⩾ c) = P (|X − µ|2 ⩾ c2 ) ⩽ = .
c2 c2
72 7. Markov’s and Chebychev’s Inequalities

Here’s an example to demonstrate that Chebyshev’s inequality gives a better


bound than Markov’s.

Example 29. Compare the tail probability P (X ⩾ 10) with bounds for that quantity
from Markov’s and Chebychev’s inequalities when X ∼ exp(1/2).

Solution. From X ∼ exp(1/2) we immediately have µ = 2 and σ 2 = 4.


2 1
Markov’s inequality gives P (X ⩾ 10) ⩽ = = 0.2.
10 5
4 1
Chebychev’s inequality gives P (X ⩾ 10) = P (|X − 2| ⩾ 8) ⩽ 2 = = 0.0625.
8 16
1
Since we know the distribution we can compare these with P (X ⩾ 10) = e− 2 ·10 ≈
0.0067 and see that (i) both bounds are correct, (ii) Chebychev’s inequality gives a
‘better’ bound than Markov’s. but uses var too

7.1 Problems

1. If a non-negative random variable X satisfies P (X < 10) = 14 , find a bound for


E[X].

2. Suppose that X has mean 10 and takes values strictly between 5 and 15 with
probability 32 . Determine a bound for the standard deviation of X.

3. Find a bound for the variance of a non-negative random variable Z which has
mean 20 and satisfies P (Z ⩾ 25) = 15 .

3
4. If the random variable Xn has mean 4 and variance n2
(for all n = 1, 2, . . . ),
how large does n have to be so that Xn takes a value between 3.9 and 4.1 with
probability at least 0.9?
Check your answer by bounding P (Xn ∈ (3.9, 4.1)) for values of n near the
minimum value that you find.

5. Determine a bound for P (|X| > 10) if X is non-negative with mean 5. Deter-
mine an additional bound if you also know that E[|X|3 ] = 100.
(Hint: if g(·) is an increasing function then {X > a} ⇐⇒ {g(X) > g(a)}.)
8. Multivariate normal distributions 73

8 Multivariate normal distributions

The multivariate normal distribution is a multidimensional generalisation of the nor-


mal distribution. It describes a set of possibly correlated real-valued random variables
which each tend to clump around a mean value. The definition we use is slightly ab-
stract (in that we don’t define a PDF directly), however this definition has some the-
oretical advantages which make it more mathematically appealing. We then explore
some of the consequences of this definition, like finding PDFs, MGFs and finding
some other properties of the multivariate normal distribution which are central to
much of multivariate statistics.

First we need to establish some notation and recall some results from linear al-
gebra. Let µ denote an n-dimensional real column vector µ = (µ1 , . . . , µn )⊤ and let
Σ = (σij ) be an n × n real, symmetric, positive semi-definite matrix. Recall that
an n × n symmetric matrix Σ is called positive semi-definite iff x⊤ Σx ⩾ 0 for
all column vectors x ∈ Rn , and this is equivalent to all eigenvalues of Σ being non-
negative. In this situation the following are equivalent: (i) Σ is positive definite
(i.e. x⊤ Σx > 0 for all non-zero x ∈ Rn ), (ii) Σ is of full rank, (iii) all eigenvalues of
Σ are positive and (iv) Σ is invertible. In this latter case Σ−1 is also symmetric and
positive definite. Lastly, an affine transformation T : Rn → Rm is one which can be
written in the form x 7→ y = Ax + b, where A ∈ Rm×n and b ∈ Rm . (µ ≈ µ, A ≈ σ,
B ≈ σ 2 . cf. Σ of Sec 3.3.)

8.1 Characterisation

Definition 49. A random vector X = (X1 , . . . , Xn )⊤ is said to have the n-dimensional


normal distribution if it can be written in the form X = µ + AZ, where µ ∈ Rn ,
A ∈ Rn×l is of rank l ⩽ n and Z = (Z1 , . . . , Zl )⊤ is a vector of independent stan-
dard normal random variables. If X has this distribution we set Σ = AA⊤ and write
X ∼ Nn (µ, Σ); often omitting the subscript if the dimension is clear from the context.

We note for later reference that, since A is of rank l, Σ = AA⊤ also has rank l.
We will focus mainly on the situation where l = n, i.e. Σ is of full rank. In Section 8.4
we will touch on what is special/unusual about the case when Σ is not of full rank.
74 8. Multivariate normal distributions

Derivation of PDF

Since our definition of X ∼ Nn (µ, Σ) is as a transformation of Z, a vector of indepen-


dent standard normal RVs and thus with known joint PDF, we can use Theorem 35
to find the (joint) PDF of X. Here we must assume that Σ is of full rank; this implies
that (i) Z and X are both n × 1 vectors, (ii) Σ is positive definite and thus invertible
and (iii) |A| = |Σ|1/2 ̸= 0.
The transformation T : Rn → Rn defined by z 7→ x = µ + Az is affine and
therefore has inverse T −1 characterised by z = A−1 (x − µ). Since Z is a collection of
normal random variables it takes values in the whole of Rn and it then follows that
T (Rn ) = Rn . It also follows easily from z = A−1 (x − µ) that the Jacobian matrix of
first order derivatives is given by JT −1 (x) = A−1 . Then, since

n
Y 1 1 2 1
fZ (z) = √ e− 2 zj = (2π)−n/2 exp(− z ⊤ z),
j=1
2π 2

it follows from Theorem 35 that we have


 
−n/2 1 −1
fX (x) = (2π) exp − (A (x − µ)) A (x − µ) × |A−1 |
⊤ −1
2
 
−1 −n/2 1 ⊤ −1 ⊤ −1
= |A | (2π) exp − (x − µ) (A ) A (x − µ) .
2

Now, since Σ = AA⊤ is invertible we have Σ−1 = (AA⊤ )−1 = (A⊤ )−1 A−1 = (A−1 )⊤ A−1 ,
and thus
 
−n/2 −1/2 1 ⊤ −1
fX (x) = (2π) |Σ| exp − (x − µ) Σ (x − µ) , x ∈ Rn .
2

Note that Σ being of full rank is crucial to this derivation. If it is not then Σ−1
does not exist and the formula for fX does not make sense.

Moment generating function

The (joint) MGF gives another characterisation of the multivariate normal distribu-
tion. It will also be very useful for finding properties of the distribution in the next
couple of sections. First we compute the joint MGF of Z, then use the representa-
tion X = µ + AZ. Since the components Zi of Z are independent standard normal
8.2 Exploration 75

variables we have
l l
1 2 1 ⊤
Y Y
MZ (t) = MZi (ti ) = e 2 ti = e 2 t t , t ∈ Rl .
i=1 i=1

Therefore the joint MGF of X = µ + AZ is, for t ∈ Rn ,

⊤X
MX (t) = E[et ] = E[exp t⊤ (µ + AZ) ]


= exp t⊤ µ E[exp (A⊤ t)⊤ Z ]


 

= exp t⊤ µ MZ (A⊤ t)


= exp t⊤ µ + 12 (A⊤ t)⊤ (A⊤ t)




= exp t⊤ µ + 12 t⊤ Σt .


This is another characterisation of a multivariate normal distribution: a random


vector has a multivariate normal distribution if and only if it has a MGF of this form.
(And note that this characterisation does not depend on the rank of Σ.)

8.2 Exploration

To get more of an understanding of the multivariate normal distribution, we now


calculate some of its moments and look at the PDF in a little more detail.

Moments

We can find the means, variances and covariances (and therefore correlations) asso-
ciated with X ∼ N(µ, Σ) using the moment generating function. For this we write
the MGF as (so far µ and Σ are just parameters)
!
X 1 XX
MX (t) = exp ti µi + ti σij tj .
i
2 i j

We differentiate this to obtain (for all k = 1, 2, . . . , n)


!
∂ X
MX (t) = µk + ti σik MX (t)
∂tk i
76 8. Multivariate normal distributions

and (for all k, l = 1, 2, . . . , n)


! !
∂2 X X
MX (t) = µk + ti σik µl + ti σil MX (t) + σkl MX (t).
∂tk ∂tl i i

Therefore

E[Xk ] = MX (0) = (µk + 0)1 = µk .
∂tk
Next, we have
∂2
E[Xk2 ] = 2
MX (0) = µ2k + σkk ,
∂tk
so var(Xk ) = E[Xk2 ] − E[Xk ]2 = µ2k + σkk − µ2k = σkk . Lastly, for k ̸= l we have

∂2
E[Xk Xl ] = MX (0) = µk µl + σkl ,
∂tk ∂tl

so cov(Xk , Xl ) = E[Xk Xl ] − E[Xk ]E[Xl ] = µk µl + σkl − µk µl = σkl . We therefore


σkl
have ρkl = ρ(Xk , Xl ) = √ .
σkk σll
Note that (i) these calculations justify our interpretation of the parameters and
(ii) we often write var(Xk ) = σk2 = [Σ]kk = σkk (cf. Section 3.3).

Bivariate PDF

When n is small we can write out the entries of µ and Σ explicitly and expand out
the quadratic form in the PDF of X to get more of a ‘feel’ for the behaviour of a
multivariate normal random vector. Since we use the PDF here we must assume that
Σ is non-singular. When n = 2, we write
!
σ12 σ1 σ2 ρ
Σ= ,
σ1 σ2 ρ σ22

where ρ = ρ(X1 , X2 ) and σi2 = var(Xi ). It can be shown (exercise) that Σ is non-
singular if and only if |ρ| < 1 and σ12 σ22 >!0. Then the ! joint PDF of X1 and X2 is
x1 µ1
Calculate |Σ| and exp(. . . ) using x = ,µ= and Σ above, sub into fX on page 74.
x2 µ2
( " 2
1 1 x1 − µ 1
fX1 ,X2 (x1 , x2 ) = p exp −
2π σ1 σ2 1 − ρ2 2(1 − ρ2 ) σ1
    2 #)
x1 − µ 1 x2 − µ 2 x2 − µ 2
−2ρ + .
σ1 σ2 σ2
8.2 Exploration 77

Below are contour plots of the density when µ = (1, 2)⊤ and Σ is such that X1
and X2 (i) are uncorrelated N (0, 1) variables and (ii) are positively correlated N (0, 1)
variables.
5

5
4

4
3

3
x2

x2
2

2
1

1
0

0
−1

−1

−2 −1 0 1 2 3 4 −2 −1 0 1 2 3 4

x1 x1

(i) σ1 = 1, σ2 = 1 and ρ = 0 (ii) σ1 = σ2 = 1 and ρ = 0.5

Aside It follows from the PDF that these contours are always ellipses; centred at
µ and with size and orientation determined by (the eigenvalues/vectors of) Σ.

To get some ‘feel’ for the bivariate normal distribution and how it is affected by
the various parameters you should investigate such contour plots and surface plots of
the PDF in a computer package. The R file plotting2-2dNormal.R on Moodle will
allow you to do this. You can get a feel for a 3-dimensional normal distribution by
genarating realisations from it and looking at a 3-d scatterplot of them. See the R
file plotting3-3dNormal.R on Moodle for this.

If you change the numbers in those files you will need to be take care that Σ is
positive definite; there is code in those R files to check the eigenvalues. A couple of
covariance matrices worth investigating are (The code provided lets you specify σi ’s
and ρij ’s; not σij ’s directly.)
 √ √   
1 −0.5 1 · 2
0.4 1 · 4 1 −0.7 0
 √ √   
, .
−0.5 1 · 2 2 −0.2 2 · 4 −0.7 1 −0.7
 √ √ 
0.4 1 · 4 −0.2 2 · 4 4 0 −0.7 1
78 8. Multivariate normal distributions

8.3 Properties

We now investigate some further properties of the multivariate normal distribution,


many of which underly its ubiquity in (frequentist) statistics.

Affine transformations

An important property in its own right in applications, as well as a useful theoretical


tool, the next result is an analogue of the fact that if X is normally distributed then
so is Y = cX + b.

Theorem 50. If X ∼ Nn (µ, Σ) and Y = CX + b, where b ∈ Rm and C ∈ Rm×n ,


then Y ∼ Nm (Cµ + b, C Σ C ⊤ ).

In words, this theorem says that an affine transformation of a multivariate normal


random vector has a multivariate normal distribution.

Proof. Since Y = CX + b and MX (t) = exp t⊤ µ + 12 t⊤ Σt , we have




MY (t) = E[exp{t⊤ Y }] = E[exp{t⊤ (CX + b)}]

= exp{t⊤ b} E[exp{(C ⊤ t)⊤ X}]

= exp{t⊤ b} MX (C ⊤ t).

Now using the known MGF of X we find that

MY (t) = exp{t⊤ b} exp (C ⊤ t)⊤ µ) + 12 (C ⊤ t)⊤ Σ(C ⊤ t)




= exp{t⊤ b} exp t⊤ Cµ + 12 t⊤ CΣC ⊤ t




= exp t⊤ (Cµ + b) + 12 t⊤ (CΣC ⊤ )t .




So Y ∼ Nm (Cµ + b, CΣC ⊤ ).

Marginal distributions

If we take b = 0 and C = (0, . . . , 0, 1, 0, . . . , 0), a 1 × n matrix with a 1 in the ith


position, and apply Theorem 50 then we find that Y = Xi ∼ N (µi , σii ) . This shows
that all 1-dimensional marginal random variables Xi of a multivariate normal random
vector are normally distributed (with µi = E[Xi ] and σii = var(Xi ) as we saw above.)
8.3 Properties 79

To investigate the marginal distribution of m ⩾ 1 of the components of X we can


take C to be a m × n matrix of zeros, but with each row having a single one in a
column corresponding to a component of the marginal distribution we are interested
in. For example, if X ∼ N3 (µ, Σ) then setting b = 0 and
!
0 1 0
C=
0 0 1

we have  
! X1 !
0 1 0   X2
Y = X2  = .
0 0 1   X3
X3

It then follows that Y = (X2 , X3 )⊤ has a multivariate normal distribution with mean
 
! µ1 !
0 1 0   µ2
µY = Cµ = µ2  =
0 0 1   µ3
µ3

and covariance matrix


  
! σ11 σ12 σ13 0 0 !
0 1 0  σ22 σ23
ΣY = CΣC ⊤ =
 
σ21 σ22 σ23  1 0 = .
0 0 1     σ32 σ33
σ31 σ32 σ33 0 1

This example demonstrates that the marginal distributions of a multivariate normal


distribution are again multivariate normal, with mean and covariance parameters
given by those parameters for the appropriate components of the original distribution.

Independence

Recall (from Section 3.2) that independent random variables are uncorrelated, but
the converse is not generally true. Next we establish that this converse is true for
random variables that follow a multivariate normal distribution.

Theorem 51. The components X1 , . . . , Xn of a multivariate normal random vector


X are independent of each other if and only if X1 , . . . , Xn are uncorrelated, i.e.
σij = cov(Xi , Xj ) = 0 for all 1 ⩽ i ̸= j ⩽ n.

Proof. As noted above, we only need to prove the ‘if’ part of the theorem.
80 8. Multivariate normal distributions

Suppose that X1 , . . . , Xn are uncorrelated, so σij = 0 for i ̸= j. Then


 
1 ⊤⊤
MX (t) = exp t µ + t Σt
2
X 
1 XX
= exp ti µi + ti σij tj
i
2 i j
| {z }
P
= i ti σii ti

since σij = 0 unless i = j


n  
Y 1
= exp ti µi + σii t2i
i=1
2
n
Y
= MXi (ti )
i=1

so X1 , . . . , Xn are independent (since joint MGF = product of marginals).

A numerical example

Example 30. Suppose that X = (X1 , X2 , X3 )⊤ ∼ N3 (0, Σ), where


 
2 1 0
 
Σ=
 1 4 0.

0 0 5

(i) Determine the joint distribution of the random variables Y1 = X1 + X2 + 2 and


Y2 = 2X1 − X2 .

(ii) For which values of c ∈ R are Z1 = 2X1 +cX2 and Z2 = 2X1 +cX3 independent?

! ! !
1 1 0 2 Y1
Solution. (i) Let C = and b = , so Y = = CX + b is an
2 −1 0 0 Y2
affine transformation of X. Then, by Theorem 50, Y is normally distributed
with mean
 
! 0 ! !
1 1 0   2 2
µY = E[Y ] = C µX + b = 0 + =
2 −1 0   0 0
0
8.4 The degenerate case 81

and

 
2 1 0
! 1 2 !
1 1 0  8 1
ΣY = var(Y ) = CΣC ⊤ =
 
1 4 0 1 −1 = · · · = .
2 −1 0    1 8
0 0 5 0 0

That is, Y ∼ N(µY , ΣY ) with µY and ΣY as above.

! ! !
2 c 0 0 Z1
(ii) Let C = , and b = , so that Z = = CX + b. Now, by
2 0 c 0 Z2
Theorems 50 and 51, Z is MVN and so Z1 and Z2 are independent precisely
when cov(Z1 , Z2 ) = 0 (i.e. when ΣZ = CΣC ⊤ is diagonal) .

Now, cov(Z1 , Z2 ) = cov(2X1 + cX2 , 2X1 + cX3 ) = · · · = 2(4 + c).


  
! 2 1 0 2 2 !
⊤ 2 c 0   
1 4 0  c 0 = · · · = ∗ 2(4 + c)
CΣC = ,
2 0 c    2(4 + c) ∗
0 0 5 0 c

[or cov(Z1 , Z2 ) = cov(2X1 + cX2 , 2X1 + cX3 ) = · · · ]

so Z1 and Z2 are independent iff 4 + c = 0, i.e. c = −4.

8.4 The degenerate case

Here we investigate the situation where the covariance matrix Σ is not of full rank.

If Σ is not of full rank then X = µ+AZ with A being n×l (and Z being l ×1) for
l < n. Loosely speaking, since Z is l-dimensional, the n-dimensional random vector
X can only take values in a l-dimensional subspace of Rn . We investigate this further
firstly through an explicit numerical example and then a more general argument. A
bit like X = µ + σZ with σ = 0.
82 8. Multivariate normal distributions

For example, consider X ∼ N3 (µ, Σ) with µ = (1, 1, 1)⊤ and


 
1 0 2
 
Σ=
0 1 .
3
2 3 13

This looks like a perfectly respectable covariance matrix, but routine calculations
reveal that Σ is singular. Further calculation reveals that the null space of Σ is
spanned by the vector (2, 3, −1)⊤ . Applying Theorem 50 with C = (2, 3, −1) and
b = 0 we can show that E[2X1 + 3X2 − X3 ] = 2µ1 + 3µ2 − µ3 = 4 and
  
  1 0 2
 
2
var(2X1 + 3X2 − X3 ) = 2 3 −1 0 1 3  3 
 
 = · · · = 0.
2 3 13 −1

This implies that P (2X1 + 3X2 − X3 = 4) = 1 or equivalently X3 = 2X1 + 3X2 − 4


with probability 1. So although X is notionally 3-dimensional (it takes values in R3 )
there is a 2-dimensional subset 2X1 + 3X2 − X3 = 4 in which X always takes its
values. In this sense X is really only 2-dimensional. (2 dimensions of randomness;
once 2 variables are known so is the other. See simulations.)

More generally, suppose that Σx = 0 for some (column) vector x ̸= 0, and


consider the variable Z = x⊤ X. Theorem 50 tells us that

Z is (univariate) normal, µZ = x⊤ µ, σZ2 = x⊤ Σx = 0.

Pn
Thus Z = x⊤ X = i=1 xi Xi = µZ (w.p. 1) and so one of the Xi ’s can be written as
an affine function of the others. (Since x has at last one non-zero element.)

If there are k such independent vectors x (i.e. Σ has nullity k) then there are k of
the Xi ’s that can be written as an affine function of the others; this is another heuristic
justification of saying that X is, in some sense, really only n − k = l-dimensional.

It is worth pointing out that this degenerate situation is not just a special case
that we consider in order to be more abstract. For example, degenerate multivariate
normal distributions play an important role in the theory of linear regression. (Joint
distribution of residuals.)

(Try simulations/contours for 2-d and simulations for 3d; letting a variance get
close to zero or a correlation get close to 1)
8.5 Some additional properties 83

8.5 Some additional properties

Conditional distributions

We can also define and study conditional distributions involving multivariate nor-
mal random vectors. Crucially, these conditional distributions are also multivariate
normal.
For example, in the bivariate case we can show that the conditional PDF of X2 ,
given X1 = x1 , is the PDF of a
 
σ2 2 2
N µ2 + ρ (x1 − µ1 ), σ2 (1 − ρ )
σ1

random variable. Similarly,


 
σ1 2 2
(X1 | X2 = x2 ) ∼ N µ1 + ρ (x2 − µ2 ), σ1 (1 − ρ ) .
σ2

There are analogous formulae for conditional distributions involving general mul-
tivariate normals, but these require lots of additional notation so are omitted here.

An alternative definition

We briefly note another common definition of the vector X being multivariate normal.
This is that all linear combinations of elements of X (i.e. ni=1 ai Xi for all ai ∈ R)
P

follow a normal distribution.


We therefore now have 4 possible definitions for X to be multivariate normal. This
one, the one we used and the one specifying the MGF are all equivalent. We could
use a definition based on the (joint) PDF but this would not allow for the degenerate
case. (So long as we allow zero variance for the univariate normal distributions.)

Joint and marginal normality

Although (X, Y ) being jointly multivariate normal implies that each of X and Y are
(marginally) normal, the converse is not necessarily true. In other words, two random
variables which are both (marginally) normally distributed do not necessarily have a
joint multivariate normal distribution.
Often relevant in (applications of) statistics. In other words: joint specifies
marginals, but not vice-versa. The same argument applies in many dimensions.
84 8. Multivariate normal distributions

8.6 Problems

1. Let X ∼ N (µ, Σ) and suppose that µ = (0, 0, 0, 0)⊤ and


 
2 −2 1 0
 
−2 6 3 0
Σ=
1
.
 3 5 0
0 0 0 10

(a) Determine the joint distribution of Y1 = X1 + 3 and Y2 = X1 + 2X2 .


(b) Determine whether or not the following pairs of random variables are in-
dependent: (i) X3 and X4 ; (ii) 2X1 + 43 X3 and X2 .
!
σ12 ρσ1 σ2
2. Show that the 2 × 2 covariance matrix Σ = is of full rank if
ρσ1 σ2 σ22
and only if σ1 σ2 > 0 and |ρ| =
̸ 1.

3. Let X = (X1 , X2 , . . . , Xn )⊤ ∼ Nn (µ, Σ), and let

a = (a1 , a2 , · · · , an )⊤ and b = (b1 , b2 , . . . , bn )⊤

Pn Pn
be two constant vectors. Define U = i=1 ai Xi and V = i=1 bi Xi .

(a) Find the joint probability density function of U and V .


(b) Show that U and V are independent if and only if a⊤ Σb = 0.

4. Suppose that X ∼ Nn (µ, Σ) and that Σx = 0 for some non-zero vector x =


(x1 , . . . , xn )⊤ . Show that one of the random variables, Xi say, can be written
as a deterministic function of the other random variables in X.
(Hint: What can you say about the random variable Y = x⊤ X?)

5. Determine the distribution of (X − µ)⊤ Σ−1 (X − µ), where X ∼ Nn (µ, Σ).


First use the spectral decomposition Σ = B ⊤ ΛB, where B is orthogonal and
Λ is diagonal with positive diagonal entries λi , to find the distribution of Y =
−1/2
Λ−1/2 B(X −µ), where Λ−1/2 is diagonal with entries λi . Then show that the

distribution of Y Y is the distribution that you wish to characterise. What
assumption about Σ must be made for this to be valid?
9. Two limit theorems 85

9 Two limit theorems

We are now in a position to state (and prove) two of the most famous results in
probability: the Law of Large Numbers (LLN) and the Central Limit Theorem (CLT).
These are results that you will have come across before, but here we pay attention
to some additional technical detail, including the proofs. The LLN confirms our
intuition that if we repeat a random experiment with some numerical outcome (i.e.
make IID observations of a random variable) then the long-term average of these
outcomes settles down to the expected value. The CLT gives more detail about the
variation around that mean: it approximately follows a normal distribution. Looking
at these results in more detail also gives us an introduction to an important topic in
more advanced probability: modes of convergence of random variables. That is, ways
in which a sequence of random variables can have a meaningful limit. (cf. convergence
of sequences of functions)

Throughout this section we assume that X1 , X2 , . . . are IID RVs and define their
partial sums Sn = ni=1 Xi (n = 1, 2, . . . ). We also assume that the variables have
P

finite mean E[X1 ] = µ.

The partial sums Sn are also RVs; and we can find their mean and variance:

E[Sn ] = E[X1 + · · · + Xn ] = E[X1 ] + · · · + E[Xn ] = µ + · · · + µ = nµ,

var(Sn ) = var(X1 + · · · + Xn ) = var(X1 ) + · · · + var(Xn ) = σ 2 + · · · + σ 2 = nσ 2 .

Now, if we want a limit of some sort we are looking for quantities to not depend on
n, so let’s look at Sn /n: will make mean nµ/n = µ not depend on n

1 1
E[Sn /n] = E[Sn ] = · nµ = µ,
n n
1 1 2 σ2
var(Sn /n) = var(Sn ) = · nσ = .
n2 n2 n

So indeed the mean stays fixed and the variance gets small as n grows.

The (weak) LLN captures this idea: in the limit of large n, the sample mean
Sn /n converges to the true mean µ. In fact, the LLN comes in two different ver-
sions, which correspond to different interpretations (almost sure convergence and
convergence in probability) of a sequence of random variables converging to a
number. We will state both versions, but prove only the second.
86 9. Two limit theorems

Theorem 52 (Strong LLN). The event {Sn /n → µ} has probability 1. In other


words, the set of ω for which Sn (ω)/n ̸→ µ has probability 0.

Theorem 53 (Weak LLN). For all ϵ > 0, lim P (|Sn /n − µ| > ϵ) = 0.


n→∞

Proof of the weak LLN. Fixing ϵ > 0, we have by Chebychev’s inequality that

var(Sn /n) σ2
P (|Sn /n − µ| > ϵ) ⩽ = 2.
ϵ2 nϵ

As n → ∞ the right-hand side converges to 0, so the left hand side must too; more
specifically we have

σ2
0 ⩽ P (|Sn /n − µ| > ϵ) ⩽ for all n
nϵ2

take lim =⇒ 0 ⩽ lim P (|Sn /n − µ| > ϵ) ⩽ 0,


n→∞

so lim P (|Sn /n − µ| > ϵ) = 0.


n→∞

For some interpretation of these results, consider a sequence of Bernoulli trials


with success probability 1/2, so Sn is the number of successes in the first n trials.
The strong LLN says that when we see realisations of the random variables, the
sequence (Sn /n, n = 1, 2, . . . ) will converge to 1/2. There are sequences which are
possible which would give a different limit, but they are so unlikely that collectively
they have zero probability. The weak LLN says that for any (small) ‘error’ ϵ > 0, we
can make the probability of Sn /n being more than ϵ away from 1/2 as small as we
like by taking n big enough.
Other interpretations are in things that we do all the time (in life and in prob-
ability): approximate the probability of something happening by the proportion of
times we observe it happen, or approximate a random quantity with its average. In
these situations we are implicitly applying the LLN.
D
These practical interpretations of Sn /n → µ amount to Sn /n ≈ µ for large n;
or alternatively Sn ≈ nµ for large n. A natural next question to ask is how good
an approximation this is, i.e. what can we say about the differences Sn − nµ? The
next result, the Central Limit Theorem, gives us a very detailed answer about the
distribution of these differences. But first we need to introduce another mode of
convergence of random variables and a way of characterising it.
9. Two limit theorems 87

Definition 54. Let X and X1 , X2 , . . . be RVs with CDFs F and F1 , F2 , . . . respec-


D
tively. Then Xn converges in distribution to X as n → ∞, denoted by Xn → X
as n → ∞, if limn→∞ Fn (x) → F (x) for all x ∈ R at which F is continuous.

The following result shows that we can characterise convergence in distribution


using MGFs. No proof, but natural since MGFs characterise distributions

Theorem 55. Let X and X1 , X2 , . . . be RVs with MGFs M and M1 , M2 , . . . respec-


tively. Suppose that there exists an h > 0 s.t. ∀t ∈ (−h, h), M1 (t), M2 (t), . . . and
M (t) are all finite.
D
Then lim Mn (t) = M (t) for all t ∈ (−h, h) is equivalent to Xn → X as n → ∞.
n→∞

The following example uses this result to give a rigorous characterisation of the
idea that the Poisson distribution can be used to approximate the Binomial distribu-
tion (when n is large and p is small).

(i) n = 5 (ii) n = 10
0.35 0.35

0.3 0.3

0.25 0.25

0.2 0.2
P(X 10 =x)
P(X 5 =x)

0.15 0.15

0.1 0.1

0.05 0.05

0 0
0 2 4 6 8 10 12 0 2 4 6 8 10 12
x x

(iii) n = 20 (iv) Poisson


0.3 0.3

0.25 0.25

0.2 0.2
P(X 20 =x)

P(X=x)

0.15 0.15

0.1 0.1

0.05 0.05

0 0
0 2 4 6 8 10 12 0 2 4 6 8 10 12
x x

Binomial PMF with p = 2/n and n = 5, 10, 20; and Poisson PMF with λ = 2.
88 9. Two limit theorems

Example 31. Suppose that Xn ∼ Bin n, nλ and that X ∼ Poisson(λ). Show that

D
Xn → X as n → ∞.

Solution. By Theorem 55, it is sufficient to show that limn→∞ MXn (t) = MX (t).
Recall that, for all t ∈ R, (PGFs from MATH1001)
 n
λ λ t t
MXn (t) = 1 − + e and MX (t) = e−λ(1−e ) .
n n

x n

Now, using the identity limn→∞ 1 + n
= ex from calculus, we have
n
−λ(1 − et )

t
lim MXn (t) = lim 1 + = e−λ(1−e ) = MX (t) ∀t ∈ R,
n→∞ n→∞ n

as required.

Before we state and prove the CLT, recall our standing assumptions that X1 , X2 , . . .
are IID with finite mean µ = E[X1 ] and that Sn = ni=1 Xi . We now need to assume
P

that σ 2 = var(X1 ) is also finite.

Theorem 56 (Central limit theorem). As n → ∞ we have

Sn − nµ D
Zn = √ → Z,
σ n

where Z ∼ N(0, 1).

Proof. We prove the CLT under the assumption that the moment generating function
of X1 exists.
Our approach is to standardise the variables Xi to new variables Yi and write MZn
in terms of the MGFs of the standardised variables, then show that log MZn converges
to log MZ .
Define the standarised RVs Yi = (Xi −µ)/σ. Then Y1 , Y2 , . . . are IID with E[Y1 ] =
0 and var(Y1 ) = 1 = E[Y12 ]; and we can write

n
1 X
Zn = √ Yi .
n i=1

Then, using Theorem 41 and the fact that the Yi are IID, we have
n     n
Y t t
MZn (t) = MYi √ = MY1 √ .
i=1
n n
9. Two limit theorems 89

Letting n → ∞ here results in the indeterminate form 1∞ , so instead we work


with the logarithm of this MGF:
  
t
log MZn (t) = n log MY1 √ .
n

t2
Since MZ (t) = e 2 , to show limn→∞ log MZn (t) = log MZ (t), we need to show that

t2
  
t
lim n log MY1 √ = ,
n→∞ n 2

and making the substitution u = 1/ n, this is equivalent to showing (∞0; so n = u−2 )

1 t2
lim log (MY 1 (ut)) = , (∗)
u↓0 u2 2

This limit involves the function L(s) = log(MY1 (s)) as s → 0. With a view towards
applying L’Hôpital’s rule, we find that

MY′ 1 (s) MY1 (s) MY′′1 (s) − (MY′ 1 (s))2


L′ (s) = and L′′ (s) = .
MY1 (s) (MY1 (s))2

and so (using properties of MGFs and the definition of the Yi variables),

L(0) = log(MY1 (0)) = log 1 = 0,

L′ (0) = MY′ 1 (0) = E[Y1 ] = 0,

L′′ (0) = MY′′1 (0) = E[Y12 ] = 1.

Now we can show (*) above: using L’Hôpital’s rule we have


   
t L(tu) 0
lim n L √ = lim =
n→∞ n u↓0 u2 0
t L′ (tu)
 
0
= lim =
u↓0 2u 0
t2 L′′ (tu) t2
= lim = .
u↓0 2 2

for what t?
90 9. Two limit theorems

The last thing to check is that this convergence happens for all t ∈ R, since that is
t2
the domain on which the MGF MZ (t) = e 2 is defined. If the MGF MX1 (t) is defined
for |t| < h then MY1 (t) is defined for |t| < σh (see Theorem 38). It then follows from
√ √
the formula MZn (t) = (MY1 (t/ n))n that this function is defined for |t/ n| < σh, or

equivalently |t| < σh n. In the limit n → ∞ this gives t ∈ R, as required.

The CLT is an asymptotic result, that tells us about the limiting distribution of
the Zn ’s; but like the LLN it suggests and approximation for the distribution of Sn
when n is large but finite.

Proposition 57. For large n, the distribution of Sn is approximately N(nµ, nσ 2 ).

√ D
Proof. The CLT says that Zn = (Sn − nµ)/ nσ 2 → Z with Z standard normal.
D
Changing → ‘convergence in distribution’ to ∼
˙ ‘approximately distributed as’, we
have (cf. changing → to ≈)
Sn − nµ
√ ∼
˙ Z.
σ n

Rearranging this we get Sn ∼
˙ nµ + σ nZ, or Sn ∼˙ Sn′ ∼ N(nµ, nσ 2 ). (linear tfm of
normal)

To interpret and relate the LLN and CLT somewhat: the LLN says that for large
n, Sn is approximately its mean nµ. The CLT quantifies the typical deviation from
that mean.

It is also important to note that the only condition on the distribution of the
random variables Xi in the CLT is that they have finite variance. Aside from that
they can have any distribution; whether discrete, continuous or neither; symmetric or
asymmetric. The precise distribution that the IID variables have affects how quickly
the convergence in the CLT happens, but it always happens. (rate of cgce is related
to quality of approximation)

The central limit theorem is the basic reason for the pervasiveness of the normal
distribution, mainly because the same argument as above gives a normal approxima-
tion for X̄n = Sn /n, the sample mean.
9. Two limit theorems 91

Lastly, we look at a short example demonstrating approximation using the CLT.


Pn
Example 32. Let Sn = i=1 Xi , where Xi ∼ Exp(1/2) are IID. Use the CLT to find
the approximate probability that S50 is between 90 and 110, in terms of the standard
normal CDF Φ(x) = P (Z ⩽ x), where Z ∼ N(0, 1).

Solution. Since the Xi are IID the CLT justifies using a normal approximation
Yn ∼ N(nµ, nσ 2 ) for Sn . Here we have µ = E[Xi ] = 2, σ 2 = var(Xi ) = 22 and n = 50;
so S50 ∼
˙ Y50 ∼ N(100, 200).
Therefore

P (90 < S50 < 110) ≈ P (90 < Y50 < 110)

90 − 100 90 − 100
= P( √ <Z< √ )
200 200
q q
= P (− 2 < Z < 12 ),
1

q q
= Φ( 2 ) − Φ(− 12 ).
1

Of course in practice the normal CDF can be easily and accurately evaluated by a
computer (or tables). In this case we also know the distribution of Sn (as the sum
of 50 independent Exp(1/2) RVs it is Gamma(50, 1/2); see Problem 2b at the end
of Section 2), so we can determine the probability exactly and see how good the
approximation is.
In this case the exact probability is 0.5210. . . , whilst the normal approximation
above gives 0.5205. . . . (relative error about 0.005/0.52 ≈ 1%.)
92 9. Two limit theorems

9.1 Problems

1. Let Xn ∼ Bin(n, 3/n). When n is large, what distributions can you use to
approximate the distibution of Xn ? Give your reasons.

2. A fair die is rolled repeatedly. Use the central limit theorem to determine the
minimum number of rolls required so that the probability that the sum of the
scores of all of the rolls is greater than or equal to 100 is at least 0.95.

3. Let X be a random variable with probability density function fX (x) = 21 e−|x| , x ∈


R. Suppose that X1 , X2 , . . . are independent and each has the same distribution
as X. Define Sn = √1n ni=1 Xi , n = 1, 2, . . . .
P

(a) Find the moment generating function of X.


(b) For each n, compute the moment generating function of Sn .
D
(c) Show, carefully, that Sn → S for some random variable S as n → ∞.
Determine the distribution of S

4. Use a similar argument to the proof of the CLT to prove (a version of) the LLN.
Clearly state the theorem you prove too; i.e. be explicit about the assumptions
needed for the proof to be valid.

You might also like