MATH2010 2022 23 AutumnNotes Gappy
MATH2010 2022 23 AutumnNotes Gappy
MATH2010
Contents
0 Background 5
0.1 Random variables and distributions . . . . . . . . . . . . . . . . . . . 5
0.2 Standard continuous RVs (distributions) . . . . . . . . . . . . . . . . 6
0.3 Bivariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 7
0.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1 Joint distributions 9
1.1 Joint distributions for more than two RVs . . . . . . . . . . . . . . . 16
1.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5 Transformations 47
5.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6 Generating Functions 58
6.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.1 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
8.1 Characterisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.2 Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8.3 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
8.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
9.1 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3
4 CONTENTS
Preliminaries
These notes & support for your learning
These notes are the basis for the teaching and learning in the autumn semester of
MATH2010. You will be expected to read ahead through some set sections of the notes
in advance of most lectures. Then we can use the lecture time to focus on working
through some examples and discussing the more nuanced aspects of the material we
encounter.
You will need to bring a copy of these notes and some blank paper (or digital
equivalents) to every lecture.
See Moodle for more details of all aspects of the resources and support available
to you for this module; including problem classes, computing classes, dropin classes
and lecturer’s office hours.
Problems
Working through problems is a key part of studying mathematics. You should begin
attempting problems as soon as we have worked through the relevant parts of the
notes. The suggested texts are an excellent source of further problems (and alternative
explanations and presentations of the lecture material).
0. Background 5
0 Background
Hence, in particular,
This gives ( P
pX (xi ) if X is discrete,
FX (x) = R xxi ⩽x
−∞ X
f (u) du if X is continuous. DV!
If X is a continuous RV then
d
1. fX (x) = FX (x),
dx
Rb
2. P (a < X ⩽ b) = a fX (x) dx,
R R
3. P (X ∈ A ∪ B) = f (x) dx +
A X B
fX (x) dx if A ∩ B = ∅, key when fX piecewise
Ra
4. P (X = a) = a
fX (x) dx = 0.
This last point implies that P (X ∈ (a, b)) = P (X ∈ [a, b)) = . . . . for X cts; cf. 2d
6 0. Background
It follows that the expected value of a function g(·) of a random variable X is given
by ( P
g(x) · pX (x) if X is discrete,
E[g(X)] = R ∞x
−∞
g(x)fX (x) dx if X is continuous.
An important expectation is the variance of the random variable X, defined by
var(X) = E[(X − E[X])2 ]. A useful equivalent form is var(X) = E[X 2 ] − (E[X])2 .
1 2 2
fX (x) = √ e−(x−µ) /(2σ ) , x ∈ R;
2πσ
• Gamma distribution Gamma(α, β) (or Γ(α, β), or G(α, β), with α, β > 0):
(
1
Γ(α)
β e−βx (β x)α−1 if x ⩾ 0
fX (x) =
0 if x < 0,
R∞
where Γ(s) = 0
e−x xs−1 dx;
0.3 Bivariate distributions 7
xα−1 (1−x)β−1
(
B(α,β)
if 0 ⩽ x ⩽ 1,
fX (x) =
0 otherwise,
R1
where B(α, β) = 0
tα−1 (1 − t)β−1 dt = Γ(α)Γ(β)/Γ(α + β).
I will assume that you are familiar with the following distributions (their PDF/PMF,
mean, variance): normal, uniform, exponential, Poisson, geometric, binomial.
Similar ideas to those in Section 0.1 apply to pairs of random variables: bivariate
distributions. In MATH1001 you saw, in the context of discrete distributions,
• Independence.
The first part of this module is about looking at these concepts in more detail, with
the focus on continuous distributions.
0.4 Problems
where α ∈ R is a constant.
2. Suppose that X ∼ U(a, b). Verify that E[X] = (a + b)/2 and var(X) = (b −
a)2 /12.
3. Suppose that Y ∼ Exp(λ). Verify that E[Y ] = λ−1 and var(Y ) = λ−2 .
4. Suppose that X1 and X2 are independent Bernoulli random variables with pa-
1 1
rameter 2
(i.e. P (Xi = 0) = P (Xi = 1) = 2
for both i = 1, 2) and define
Y1 = X1 + X2 and Y2 = |X1 − X2 |.
5. Two people are asked to toss a fair coin twice and record each result (heads or
tails) as zero or one. The diligent person does as they are asked, but the lazy
one just tosses the coin once and records the same result twice. Let (D1 , D2 ) be
the results of the diligent person’s tosses and (L1 , L2 ) be the recorded results
of the lazy person’s tosses. Construct tables showing the joint PMFs of (D1 , D2 )
and of (L1 , L2 ). Hence show that D1 and L1 have the same distribution, that D2
and L2 have the same distribution, but that the joint distributions of (D1 , D2 )
and of (L1 , L2 ) are not the same.
Part I
Now we consider the case where we have more than one RV, say X : Ω → R and
Y : Ω → R. We are not simply interested in the properties of X and Y , but in their
joint behaviour.
= P (X ⩽ x, Y ⩽ y), x, y ∈ R.
FX (x) = P (X ⩽ x) = P (X ⩽ x, Y < ∞)
Definition 2. Two RVs X and Y are said to be jointly continuous if there exists
a function fX,Y (x, y) (⩾ 0) with the property that, for every ‘nice’1 set C ⊆ R2 ,
ZZ
P ((X, Y ) ∈ C) = fX,Y (x, y) dx dy.
C
1
The details of ‘nice’ are beyond the scope of this module, but certainly rectangles and unions
and intersections of rectangles are included.
9
10 1. Joint distributions
The function fX,Y is called the joint PDF of X and Y . It must satisfy fX,Y (x, y) ⩾ 0
R
for all (x, y) and R2 fX,Y (u, v) du dv = 1.
Solution. To determine the value of α we use the fact that the PDF must integrate
to 1. A diagram of the domain of fX,Y is invariably useful for determining the bounds
of integration.
y y=x
6
1
fX,Y > 0
- x
1
So we have
Z ∞ Z ∞ Z 1 Z y
1= fX,Y (x, y) dx dy = α(x + y) dx dy
−∞ −∞ 0 0
1 y 1
x2
Z Z
3 2
=α + xy dy = α y dy
0 2 x=0 0 2
3 1
y 1
=α =α·
2 y=0 2
and thus α = 2.
1. Joint distributions 11
FX,Y (x, y) = P (X ⩽ x, Y ⩽ y)
∂2
fX,Y (x, y) = FX,Y (x, y).
∂x∂y
d d
fX (x) = FX (x) = FX,Y (x, ∞)
dx dx
Z x Z ∞
d
= fX,Y (u, y) dy du
dx −∞ −∞
Z ∞
= fX,Y (x, y) dy(note dummy vars)
−∞
and so fX is called the marginal PDF of X. The marginal PDF of Y can be similarly
obtained:
Z ∞ Z ∞
fX (x) = fX,Y (x, y) dy and fY (y) = fX,Y (x, y) dx.
−∞ −∞
Corresponding definitions and results hold for jointly discrete random variables.
The focus in this part of the module is on jointly continuous RVs, but all the concepts
R P
we cover have analogues for the discrete case. (PDF ≡ PMF and ≡ )
Example 2. Find the joint CDF of X and Y if they have joint PDF
(
e−x x ⩾ 0, 0 ⩽ y ⩽ 1,
fX,Y (x, y) =
0 otherwise.
Ry Rx
Solution. To use the formula FX,Y (x, y) = −∞ −∞
fX,Y (u, v) du dv we need to
sketch the region of integration:
12 1. Joint distributions
Visualising PDFs
3
0.8
2
z
0.6
0
y
1.0
1.0
0.4
1
0.5
0.5
y
x
0.2
0.0 0.0
0
0.0
x
1. Joint distributions 13
One needs to be a bit careful interpreting what happens at the edges of the domain
where fX,Y > 0, as most surface/contour plotters don’t deal with discontinuities very
well. Nevertheless, these are useful ways to get a feel for what the PDF ‘looks like’.
It is a very useful skill to be able to (roughly) visualise densities in your head; I
strongly recommend using a computer package to help practice, using this and some
of the other joint PDFs in this section. The above plots were generated using the R
commands in plotting1-jointPDF.R available on Moodle. (It also includes code to
generate a dynamic surface plot which you can drag around to explore.)
Now we continue with some more examples of the kinds of calculations that can
be done using joint PDFs.
Solution. First draw a diagram of the region (call it A) where fX,Y is non-zero.
x=0
y 6
1 @
@
@
@
@ x+y =1
@
fixed y ∈ (0, 1)
@
@
@
A @
@
@ y =0
@-
x
1
14 1. Joint distributions
= · · · = 4(1 − y)3 .
Ry
Now, since 0
4(1 − u)3 du = 1 − (1 − y)4 , the CDF of Y is
0 y ⩽ 0,
FY (y) = 4
1 − (1 − y) 0 < y ⩽ 1,
1 y ⩾ 1.
Note that since the region where fX,Y (x, y) > 0 is not a rectangle, the range of
integration is different for every y ∈ [0, 1].
(ii) To find P (X > Y ) we need to integrate fX,Y (x, y) over the region where x > y.
Formally, if we write A = {(x, y) : fX,Y (x, y) > 0} and let C = {(x, y) : x > y}
RR RR
then we need to calculate C fX,Y (x, y) dx dy = A∩C 24x(1 − x − y) dx dy.
To see what this means for the double integral we’ll have to do, draw a diagram!
x=0
y 6
1 @
@
@ x+y =1
@
@ x=y
@
@
@
@
A∩C @
@
@ y =0
@-
x
1
1. Joint distributions 15
= ···
= 3/4.
Next we do an example which shows how this technique can be used to find the
distribution of a function of two random variables. Find the CDF (a probability) as
above; differentiate to get PDF.
Solution. First, X > 0 and Y > 0 so Z = X/Y > 0 and FZ (z) = 0 for all z ⩽ 0.
For z > 0 we need to find FZ (z) = P (Z ⩽ z) = P (X/Y ⩽ z).
(Draw a diagram to see what region we need to integrate fX,Y over:)
y 6
1
x = yz, or y = x (hatching left of line; flag z > 0)
z
-
x
16 1. Joint distributions
= ···
1
= 1−
1+z
and so (
d (1 + z)−2 z > 0,
fZ (z) = FZ (z) =
dz 0 z ⩽ 0.
The same approach will work for any function of X and Y . If Z = g(X, Y ) then
ZZ
FZ (z) = P (g(X, Y ) ⩽ z) = fX,Y (x, y) dx dy.
{(x,y) : g(x,y)⩽z}
All of the ideas and notation above can be extended to deal with more than 2
random variables. Rather than labelling the RVs X, Y , . . . , we use the notation
X1 , X2 , . . . , Xn (n ⩾ 2).
Joint CDF:
Joint PDF:
and
Z Z
fXi ,Xj (xi , xj ) = ··· fX1 ,...,Xn (x1 , . . . , xn )
Rn−2
For C ⊆ Rn ,
Z Z
P ((X1 , . . . , Xn ) ∈ C) = ··· fX1 ,...,Xn (x1 , . . . , xn ) dx1 . . . dxn .
C
The notation here can be rather cumbersome, so vector notation is often used.
Writing x = (x1 , . . . , xn ) and X = (X1 , . . . , Xn ) the above statements can be re-
written as, for example,
Z
(1)
FX (x) = P (X ⩽ x), fX (x) = F (x) and P (X ∈ C) = fX (x) dx.
C
1.2 Problems
1. Let the random variables X and Y have joint distribution function FX,Y . Show
the following:
2
fX,Y (x, y) = x ⩾ 0, y ⩾ 0.
(1 + x + y)3
(a) What is the probability that at least one unit of each product is produced?
(b) Determine the probability that the quantity of C1 produced is less than
half that of C2 .
(c) Find the CDF for the total quantity of C1 and C2 .
P (X ⩽ x, Y ⩽ y) = P (X ⩽ x) P (Y ⩽ y)
This can be interpreted as the joint behaviour of X and Y being determined by the
marginal behaviour of X and the marginal behaviour of Y . There is no ‘interaction’
between X and Y . The following equivalent characterisation is often more useful than
the definition involving CDFs.
Example 5. Determine whether the RVs defined by the following joint PDFs are
independent.
e−(x+y) x, y ⩾ 0, 8xy x ∈ (0, 1), y ∈ (0, x),
f1 (x, y) = f2 (x, y) =
0 otherwise. 0 otherwise.
Solution. It looks like f1 (x, y) = e−x · e−y . To verify that this is true, find the
marginal PDFs:
Z ∞ Z ∞
−(x+y) −x
fX (x) = e dy = e e−y dy = e−x (x ⩾ 0)
0 0
There is an alternative way to show that the variables described by f2 above are
dependent. The (x, y) values with positive density form a triangle. Knowing the
value of one variable gives us information about possible values of the other variable,
contravening the idea of independence (see also Section 4.2). We can make this
argument rigorous by noting that X and Y both take values in (0, 1), but (X, Y )
does not take values in all of (0, 1) × (0, 1). For example, taking (x, y) = ( 41 , 12 ) we get
0 = f2 ( 14 , 12 ) ̸= fX ( 14 ) fY ( 12 ) > 0,
y 6
1
-
x
1
This type of argument will work whenever the possible values of one variable
depend on the value of another. More concretely, if the region where fX,Y > 0 cannot
be written as AX × AY where AX (AY ) is the set of possible X (Y ) values then X and
Y cannot be independent. If AX and AY are both intervals then their cross product
is a rectangle).
2.1 Sums of independent RVs 21
Suppose that X and Y are independent continuous RVs with PDFs fX and fY re-
spectively. The CDF of X + Y is
FX+Y (z) = P (X + Y ⩽ z)
ZZ
= fX (x) fY (y) dx dy
{(x,y):x+y⩽z}
Z ∞ Z z−y Z ∞ Z z−x
= fY (y) fX (x) dx dy = fX (x) fY (y) dy dx
−∞ −∞ −∞ −∞
Z ∞ Z ∞
= FX (z − y) fY (y) dy = fX (x) FY (z − x) dx .
−∞ −∞
Hence
d
fX+Y (z) = FX+Y (z)
dz
Z ∞ Z ∞
d d
= FX (z − y) fY (y) dy = FY (z − x) dx
fX (x)
−∞ dz −∞ dz
Z ∞ Z ∞
= fX (z − y) fY (y) dy = fX (x) fY (z − x) dx .
−∞ −∞
= λ2 ze−λz .
Or draw a diagram:
highlight fXY > 0 in first quadrant; thicken line there indicating endpoints; then
z < 0.
y 6
@
z @
@
@
@ x+y =z >0
@
@
@
@
@
@
@ -
x
@
z
2.2 Functions of independent RVs 23
If two random variables X and Y are independent then it seems intuitively clear that
any function of X should be independent of any function of Y . This is indeed the
case.
(The proof is omitted.) The exact meaning of ‘nice’ involves the details of inte-
gration theory; but it is a very weak condition. Examples of nice functions are those
which are continuous, monotone, piecewise continuous with finitely many disconti-
nuities. The theorem can obviously be extended to deal with finitely many random
variables. important later
To conclude this section we note the main generalisation of the notion of independence
to more than two random variables, that of mutual independence.
for all finite subsets Xi(1) , . . . , Xi(r) of the random variables and values xj ∈ R.
This definition captures the idea that no possible information about any finite
combination of any of the RVs gives any further information about likely values of
any finite combination of any of the other RVs. Loosely speaking, it says that all
pairs of RVs amongst X1 , X2 , . . . are independent, as are all triples, and all n-tuples.
This is, therefore, a more stringent requirement than just pairwise independence,
where every pair Xi , Xj of the collection of RVs is independent.
This leads on to the critical notion in statistics of a collection of random variables
X1 , X2 , . . . being independent and identically distributed (IID). A collection of
random variables X1 , X2 , . . . is said to be IID if the Xi ’s are mutually independent
and all have the same distribution.
24 2. Independent Random Variables
2.4 Problems
For the meaning of ‘nice’ see the discussion after Theorem 7. Basically, so long as
Y = g(X1 , . . . , Xn ) is a random variable then E[Y ] is defined by the appropriate sum
or integral. Also, we note that we can take g in this theorem to map Rn to Rm , so
long as we interpret the expectation of a vector as the vector of expectations. (g has
to be quite unpleasant for Y = g(X1 , . . . , Xn ) not to be a random variable.) (Also
note that this is a theorem, not a definition.)
To check that this agrees with the case of a single random variable, take the
function g to be g(X1 , . . . , Xn ) = g(X1 ). Then
Z Z
E[g(X1 )] = ··· g(x1 ) fX1 ,...,Xn (x1 , . . . , xn ) dx1 . . . dxn
Rn
Z ∞ Z Z
= g(x1 ) ··· fX1 ,...,Xn (x1 , . . . , xn ) dx2 . . . dxn dx1
−∞ Rn−1
Z ∞
= g(x1 ) fX1 (x1 ) dx1 .
−∞
26 3. Expectation, Covariance and Correlation
Solution. With reference to the diagram from Example 1, Theorem 9 tells us that
ZZ Z 1 Z y
E[Z] = E[XY ] = xyfX,Y (x, y) dx dy = 2 xy(x + y) dx dy=
R2 0 0
1
. . .= .
3
Proof. Omitted as an exercise. To avoid long sums and integrals one approach is to
prove each of E[ag1 (X)] = aE[g1 (X)] and E[g1 (X) + g2 (X)] = E[g1 (X)] + E[g2 (X)]
(of which E[a + g1 (X)] = a + E[g1 (X)] is a special case); then combine these.
Taking another special case for the function g, we can prove the following impor-
tant result concerning the expectation of a sum of random variables.
3.2 Covariance and correlation 27
Pn
Proof. Left as an exercise. Either take g(X1 , . . . , Xn ) = i=1 Xi in Theorem 9 or use
Proposition 10 and induction.
Theorem 12. If X and Y are independent then, for any functions g and h,
Proof. We treat the case where X and Y are jointly continuous. The independence
of X and Y implies that fX,Y (x, y) = fX (x) fY (y) for all x, y. Therefore
Z ∞ Z ∞
E[g(X) h(Y )] = g(x) h(y) fX (x) fY (y) dx dy
−∞ −∞
Z ∞ Z ∞
= g(x) fX (x) dx h(y) fY (y) dy
−∞ −∞
We now look at covariance and correlation, which measure the extent to which the
values of two random variables are associated with each other.
Definition 14. The covariance of two RVs X and Y , denoted by cov(X, Y ), is defined
by
cov(X, Y ) = E[(X − E[X])(Y − E[Y ])]
By evaluating the expectation in this definition (see problems), one can arrive at
the alternative (and often more computationally convenient) formula
Also note that the covariance of a random variable with itself is just its variance: defs
= E[(X − E[X])2 ]
= var(X).
In particular,
n
! n
X X X
(a) var ai X i = a2i var(Xi ) + 2 ai aj cov(Xi , Xj );
i=1 i=1 1⩽i<j⩽n
n
! n
X X
(b) if X1 , . . . , Xn are independent then var ai X i = a2i var(Xi );
i=1 i=1
n
! n
X X
(c) if X1 , . . . , Xn are IID then var ai X i = var(X1 ) a2i .
i=1 i=1
Proof. Statement (i) follows immediately from the definition of covariance. Statement
(ii) follows from the definition and Proposition 13: (use the alt form and the Prop)
n
! n
X X
var(S) = var Xi = var(Xi ) = n var(X1 ) = np(1 − p).
i=1 i=1
Example 9. Suppose that X has mean 1 and variance 2, Y has mean 2 and variance
4, and cov(X, Y ) = 1. Find cov(2X + Y, 2Y − X).
cov(2X + Y, 2Y − X)
= 2 · 4 − 2 · 2 + 3 · 1 = 7.
30 3. Expectation, Covariance and Correlation
The following technical result will be used to get useful bounds on the covariance.
X − E[X] Y − E[Y ]
Proof. Let X̃ = p and Ỹ = p . Then, using Proposition 15,
var(X) var(Y )
!
X Y
cov(X̃, Ỹ ) = cov p , p
var(X) var(Y )
cov(X, Y )
=p p .
var(X) var(Y )
Now, since E[X̃] = E[Ỹ ] = 0 and var(X̃) = var(Ỹ ) = 1, we can apply Lemma 16 to
find that
cov(X, Y )
p p = | cov(X̃, Ỹ )| = |E[X̃ Ỹ ]| ⩽ 1.
var(X) var(Y )
Definition 18. Assume that var(X) var(Y ) ̸= 0. The (Pearson) correlation of X and
Y , denoted by ρ(X, Y ), is defined by
cov(X, Y )
ρ(X, Y ) = p .
var(X) var(Y )
(i) −1 ⩽ ρ(X, Y ) ⩽ 1.
a c cov(X, Y ) ρ(X, Y ) a c > 0,
(iii) ρ(aX + b, cY + d) = p =
|a c| var(X) var(Y ) −ρ(X, Y ) a c < 0.
cov(X, Y )2
ρ(X, Y )2 = ⩽ 1,
var(X) var(Y )
cov(X, Y ) 0
ρ(X, Y ) = p =p = 0.
var(X) var(Y ) var(X) var(Y )
Lastly we have
cov(Y1 , Y2 ) = cov(X0 + X1 , X0 + X2 )
= λ0 + 0 + 0 + 0 = λ0 ,
cov(Y1 , Y2 ) λ0
so ρ(Y1 , Y2 ) = p =p .
var(X) var(Y ) (λ0 + λ1 )(λ0 + λ2 )
Definition 20. Let k be a positive integer. The k-th (raw) moment of X is defined
to be mk = E[X k ] and the k-th central moment of X is σk = E[(X − m1 )k ].
Definition 21. For positive integers j, k, the raw moments are mjk = E[X j Y k ] and
the central moments σjk = E[(X − m10 )j (Y − m01 )k ].
E[X] = µ
for the matrix of variances and covariances of X, i.e. [var(X)]ij = cov(Xi , Xj ). Often
this variance-covariance matrix (or just covariance matrix) is called Σ; it then
follows that [Σ]ij = cov(Xi , Xj ) = ρij σi σj . Note that [Σ]ii = σi2 . But [Σ]ii ̸= σi .
3.4 Problems 33
3.4 Problems
1. Complete the proofs of Proposition 10, that E[ag1 (X1 , . . . , Xn )+bg2 (X1 , . . . , Xn )+
c] = aE[g1 (X1 , . . . , Xn )] + bE[g2 (X1 , . . . , Xn )] + c for all a, b, c, ∈ R; and Propo-
sition 11, that E[ ni=1 Xi ] = ni=1 E[Xi ].
P P
2. Find the correlation ρ(X, Y ) and justify why X and Y are dependent, where
(b) X and Y are uniformly distributed on the triangle in R2 with corners (0, 0),
(1, 1) and (2, 0).
n
P
where i, j = 1, · · · , n and i ̸= j, and let S = Xi . Show that
i=1
6. Suppose that n skiers, each having a distinct pair of skis, are sharing a chalet.
At the end of a hard day on the piste, they each remove their skis and throw
them in a heap on the porch floor. The following morning, still befuddled by
the previous night’s après-ski, they each choose two skis completely at random
from the heap. Let N be the number of skiers that end up with a matching
pair of skis. Determine E[N ] and var(N ).
Hint: Begin by writing N as the sum of n variables, where the i-th variable is
1 or 0 according as the i-th skiier does or does not get a matching pair of skis.
34 4. Conditional distributions and expectations
P (E ∩ F )
P (E | F ) = , if P (F ) > 0.
P (F )
pX|Y (x | y) = P (X = x | Y = y)
P (X = x, Y = y)
= (defn)
P (Y = y)
pX,Y (x, y)
= ,
pY (y)
P (X ∈ C, Y ∈ D)
P (X ∈ C | Y ∈ D) = ,
P (Y ∈ D)
so long as P (Y ∈ D) > 0.
4.1 Conditional distributions 35
For example, if C = (a, b] and D = (c, d] with P (Y ∈ (c, d]) > 0, then
RbRd
P (X ∈ (a, b], Y ∈ (c, d]) a c
fX,Y (u, v) dv du
P (X ∈ (a, b] | Y ∈ (c, d]) = = R∞ Rd .
P (Y ∈ (c, d]) fX,Y (u, v) dv du
−∞ c
P (X ∈ A ∩ Y = y) 0
P (X ∈ A | Y = y) = = ,
P (Y = y) 0
P (X ∈ A ∩ Y ∈ (y, y + h])
P (X ∈ A | Y ∈ (y, y + h]) =
P (Y ∈ (y, y + h])
R R y+h
A y
fX,Y (u, v) dv du
= R ∞ R y+h
−∞ y
fX,Y (u, v) dv du
R
h · fX,Y (u, y) du
(small h) ≈ A
h · fY (y)
Z
fX,Y (u, y)
= du.
A fY (y)
This calculation justifies the following definition: ↑Cond event pos prob iff fY (y) > 0
Definition 22. If X and Y have a joint PDF fX,Y then the conditional PDF of
X, given that Y = y, is defined by
fX,Y (x, y)
fX|Y (x | y) =
fY (y)
Solution. We first find the marginal PDF of Y , then fX|Y = fX,Y /fY . Then we will
be able to find P (X > 1 | Y = y) by integrating fX|Y (x | y) in the region x > 1.
Clearly fY (y) = 0 for y ⩽ 0. For y > 0 we have
Z ∞ Z ∞
x
fY (y) = fX,Y (x, y) dx = e−( y +y) y −1 dx = · · · = e−y .
−∞ 0
Note here how we write the domain of the conditional PDF fX | Y . The value
Y = y is given, so we think of it as a parameter. Thus fX | Y (x | y) is usually thought
of as a function of x only.
For example, if we have the joint density
C exp(−x − y − αxy) x, y > 0,
fX,Y (x, y) =
0 otherwise,
4.2 Conditional distributions and independence 37
with α > 0 a parameter and C a normalising constant, then it can be shown (exercise!)
that, for given y > 0,
(1 + αy)e−x(1+αy) x > 0,
fX|Y (x | y) =
0 x ⩽ 0.
We now return briefly to the notion of independence (see Section 2) to consider how
it relates to the conditional distributions we have just introduced.
The initial intuition behind independence was that the joint distribution of (X, Y )
is determined by the marginal distributions: FX,Y (x, y) = FX (x) FY (y) for all x, y.
In Section 2 we also used the informal characterisation that knowing the value of
one variable gives no information about the possible values of the other. In terms of
probability functions, we might expect this to manifest as a relationship saying that
all conditional distributions are identical to the marginal distributions. Conditions 3
and 4 of the next theorem show that this is indeed the case.
Theorem 23. For jointly continuous random variables X and Y , the following are
equivalent:
Now suppose that fX|Y (x | y) = fX (x) for all x ∈ R and y with fY (y) > 0. Then
• for y with fY (y) > 0 we have fX,Y (x, y) = fX|Y (x | y) fY (y) = fX (x) fY (y);
Suppose we are told that Y = y. Conditional upon this, we can characterise the
‘new’ distribution of X: it is the conditional PDF fX|Y (x | y), which we think of as
a function of x. Just like any other random variable (or corresponding PDF) we can
find its expected value.
Example 12. Determine E[X | Y = y] (y > 0) for the random variables X and Y in
Example 11.
Now
Z ∞ Z ∞
x −x/y
E[X | Y = y] = x fX|Y (x | y) dx = e dx
−∞ 0 y
.
= ..
= y.
Definition 25. Recall that ψ(y) = E[X | Y = y]. Then ψ(Y ) is called the conditional
expectation of X given Y ; written as E[X | Y ].
Example 13. Determine E[X | Y ] and E[X 2 | Y ] for the random variables X and Y
in Example 11.
Since E[X | Y ] is a random variable, we can take its expectation. The next result
(known as the tower property and the law of iterated expectations, amongst
other names) looks a bit odd at first. It says that if we want to calculate E[X] we
can first fix the value of Y and then average over this value later. We will soon see
that it can be very useful for finding properties of random variables which are defined
‘indirectly’, since then the conditional expectation can be easy to calculate.
Theorem 26. For any random variables X and Y , we have E[E[X|Y ]] = E[X].
Proof. We only consider the case when X and Y are jointly continuous.
Z ∞
E[E[X | Y ]] = E[X | Y = y] fY (y) dy (E[X | Y ] = ψ(Y ) is a f’n of the RV Y )
−∞
Z ∞ Z ∞
= x fX|Y (x|y) fY (y) dx dy (defn of E[X | Y = y])
−∞ −∞
Z ∞ Z ∞
= x fX,Y (x, y) dx dy (defn of fX|Y )
−∞ −∞
R R
E[X] = E[E[X | Y ]] = E[ψ(Y )] = ψ(y)fY (y) dy = E[X | Y = y]fY (y) dy
40 4. Conditional distributions and expectations
Example 14. Suppose that Y ∼ Exp(λ) and (X|Y = y) ∼ Poi(y). Find E[X].
= E[Xn−2 ] + 4
..
= .
= E[X0 ] + 2n = 2n.
For the proof we need the notation 1A , a random variable called the indicator
function of the event A: for any event A we define
1 if A occurs,
1A =
0 otherwise.
Example 16. As in Example 14, suppose that Y ∼ Exp(λ) and (X|Y = y) ∼ Poi(y).
Determine P (X = 0).
Proof. Left as an exercise. Start by using Prop 27 to find the CDF (A = {Y ⩽ y})
E[T | N = n] = E[X1 + · · · + XN | N = n]
= E[X1 + · · · + Xn ]
n
X
= E[Xi ]
i=1
= n E[X1 ].
P0
Here we note the convention that an empty sum takes the value 0, i.e. i=1 Xi = 0.
This ensures that the random sum T is well defined when N can take the value 0.
Example 17. Suppose I have a chicken that lays N eggs per week, where the weights
of eggs are IID N(µE , σE2 ) random variables and N ∼ Poi(λ), independently of the
egg weights. Find the expected total weight of eggs laid in a week.
PN
Solution. Write Xi for the weight of the i-th egg, so then T = i=1 Xi for the
weight of eggs laid in a week.
Since N and all the Xi s are mutually independent, apply Wald’s equation to find
that E[T ] = E[N ] E[X1 ] = λµE .
4.4 Conditional variance 43
2. if Y ⩾ 0 then E[Y | Z] ⩾ 0,
3. E[1 | Z] = 1,
Proof. We won’t prove these results here, but if you are interested in studying prob-
ability further then I strongly suggest that you treat these proofs, assuming that all
relevant random variables are jointly continuous for simplicty, as an exercise. The
exception is item 3, where ‘1’ is a random variable which takes the value 1 with prob-
ability 1, and is therefore not continuous. In this case use the PMF, and note the
very simple PMF for this random variable.
Note that both expected values in this expression are conditional expectations.
If we expand the squared term in the definition and (carefully) use the properties of
expectation, we can find a formula for conditional variance analogous to var(X) =
44 4. Conditional distributions and expectations
E[X 2 ] − E[X]2 :
= E X 2 − 2XE[X | Y ] + E[X | Y ]2 Y
= E X 2 | Y − 2E [XE[X | Y ] | Y ] + E E[X | Y ]2 | Y
= E X 2 | Y − 2E [X | Y ] E[X | Y ] + E[X | Y ]2
Theorem 32. For any random variables X and Y for which var(X) is defined, we
have
var(X) = E[var(X | Y )] + var(E[X | Y ]).
Proof. Taking expectations of the formula for var(X | Y ) and using the tower property,
we find that
as required.
4.4 Conditional variance 45
This result has many important applications in probability and statistics (perhaps
most immediately in Analysis of Variance and linear regression). Here we look at a
short probabilistic example.
Example 18. Recall Example 14, where X | Y ∼ Poi(Y ) with Y ∼ Exp(λ) and we
found that E[X] = E[Y ] = λ−1 . Find var(X).
Therefore E[X | Y ] = Y and var(X | Y ) = Y . Now we can use the law of total variance
to find that
= E[Y ] + var(Y )
1 1
= + 2.
λ λ
Note that here we have E[X] = E[Y ], but var(X) > var(Y ). In var(X) there is a
contribution from the variability of the Poisson distribution and also a contribution
from the fact that X is an average of many Poisson distributions.
Short diagram of the Poisson mass functions P (X = x|Y = y) with means y =
1, 5, 10 say; with P (X = x) over the top. So more variability in X than any one of
the X|Y = y’s.
and possibly idea of X being height of random person, Y their age
4.5 Problems
1. Find P {Y > 1/2 | X < 1/2} if the joint PDF of X and Y is given by
6 2 xy
fX,Y (x, y) = x + , 0 < x < 1, 0 < y < 2.
7 2
2. A boy and a girl arrange to meet at a certain location during the lunch hour.
Denote the boy’s arrival time by X and the girl’s by Y ; and assume that
3 1
fX,Y (x, y) = y x + y , 0 ⩽ x ⩽ 1, 0 ⩽ y ⩽ 2.
7 2
6. Prove Proposition 28. (See the hint given in the ‘proof’ in the notes.)
8. (Exercise from Section 4.4.) Suppose that {Xi , i = 1, 2, . . . } are IID and inde-
pendent of the non-negative integer-valued RV N and set T = N
P
i=1 Xi . Find
var(T ) in terms of the moments of X1 and N .
Part II
Transformations, generating
functions and inequalities
5 Transformations
Very often the random variables that are of interest in a practical situation do not
have their distributions directly available. They may be defined in terms of simpler
random variables whose distributions we do know. For example, the lifetime of an
electronic appliance might be the minimum of the lifetimes of its components. A
simpler example involves rescaling: given the PDF of daily maximum temperatures
at a certain location in Fahrenheit then can we translate this into the PDF of these
daily maximum temperatures in Celcius? If we know the distribution of velocities of
molecules in a gas then what can we say about the distribution of kinetic energies,
which are proportional to the squared velocities?
Recall the following programme (from MATH1001) for finding the PDF of Y = g(X),
where X is a continuous RV with known PDF fX and g : R → R.
Example 19. Find the PDF of Y = cX, where c > 0 and X ∼ exp(λ) with λ > 0.
Therefore fY (y) = d
F (y)
dy Y
= λ
c
e−(λ/c)y for y > 0 and fY (y) = 0 otherwise. In other
words, Y ∼ exp(λ/c).
48 5. Transformations
√ √
Y ⩽ y ⇐⇒ X 2 ⩽ y ⇐⇒ − y ⩽ X ⩽ y,
so
FY (y) = P (Y ⩽ y)
√ √
= P (− y ⩽ X ⩽ y)
√ √
= FX ( y) − FX (− y).
d d √ √
fY (y) = FY (y) = (FX ( y) − FX (− y))
dy dy
1 √ √ d √ √ 1
= √ {fX ( y) + fX (− y)} (e.g. FX ( y) = fX ( y) · √
2 y
)
2 y dy
1
=√ e−y/2 .
2πy
point out 1,2,3 from above; and the differences in the ⇐⇒ and FY (y) = . . . bits
We now introduce some notation which will make our working in these problems
a bit simpler. We write A = {x : fX (x) > 0} for the values that X can take and
g(A) = {g(x) : x ∈ A} for the values that Y can take. A is called the support of
X or fX and g(A) is the support of Y = g(X) or fY . In the above examples A and
g(A) are as follows:
FY (y) = P (Y ⩽ y) = P (g(X) ⩽ y)
P (X ⩽ g −1 (y)) = FX (g −1 (y)) g increasing,
=
P (X ⩾ g −1 (y)) = 1 − F (g −1 (y)) g decreasing,
X
d d
FX (g −1 (y)) = fX (g −1 (y)) (g −1 (y))
g increasing,
fY (y) = dyd
dy
d
1 − FX (g −1 (y)) = −fX (g −1 (y)) (g −1 (y))
g decreasing.
dy dy
Since fX ⩾ 0 and d
dy
(g −1 (y)) is positive or negative according as g is increasing or
decreasing, we have established the following result. (It is an exercise to write this
up as a formal proof.)
X−µ
Example 21. Suppose that X ∼ N(µ, σ 2 ) and Y = σ
. Show that Y ∼ N(0, 1).
Solution. First observe that we can write Y = g(X), where g(x) = (x − µ)/σ. Now
g is strictly increasing (since σ > 0) and differentiable on A = R. Thus, since X is
continuous, we can apply Theorem 33 to find the distribution of Y .
n 2 o
1
First note that fX (x) = √2πσ exp − 21 x−µ
σ
for all x ∈ R. To find g −1 we note
that
y = (x − µ)/σ ⇐⇒ x = µ + σy,
d −1
fY (y) = fX (g −1 (y)) g (y)
dy
( 2 )
1 1 (σ y + µ) − µ
=√ exp − ×σ
2πσ 2 σ
1 2
= √ e−y /2 ,
2π
Note that this result can easily be generalised to show that if X ∼ N(µ, σ 2 ) then
Y = a + bX has distribution Y ∼ N(a + bµ, b2 σ 2 ) for any a, b ∈ R with b ̸= 0 (see
problems).
∂ ∂
H (y , y )
∂y1 1 1 2
H (y , y )
∂y2 1 1 2
JH (y1 , y2 ) = .
∂ ∂
H (y , y )
∂y1 2 1 2
H (y , y )
∂y2 2 1 2
5.2 2-dimensional transformations 51
As in the previous section, we let A = {(x1 , x2 ) : fX1 ,X2 (x1 , x2 ) > 0} be the
support of fX1 ,X2 and T (A) = {(y1 , y2 ) = T (x1 , x2 ) : (x1 , x2 ) ∈ A} be the support of
fY1 ,Y2 , i.e. the (y1 , y2 ) values that will have non-zero probability density.
Theorem 34. Let (X1 , X2 ) be a jointly continuous random vector with PDF fX1 ,X2 (x1 , x2 )
and suppose that T : A ⊆ R2 → R2 is one-to-one, so H = T −1 exists, and that H has
continuous first-order partial derivatives in T (A). Then (Y1 , Y2 ) = T (X1 , X2 ) has a
joint PDF given by
f X
1 ,X2 (H1 (y1 , y2 ), H2 (y1 , y2 )) · |JH (y1 , y2 )| (y1 , y2 ) ∈ T (A),
fY1 ,Y2 (y1 , y2 ) =
0 otherwise.
Proof. The proof is omitted, but can be viewed as an application of the change of
variables formula from first year calculus and the definition of a joint PDF.
A point that sometimes makes calculations simpler is that, under the conditions
of the theorem, we have JT −1 = (JT )−1 . Similarly, under the other conditions of
the theorem, T has continuous first-order partial derivatives if and only if T −1 does.
(Think polar co-ordinates, etc.)
Y1 = X 1 + X2 , Y2 = X1 − X2 .
y 1 = x1 + x2 , y2 = x1 − x2 .
52 5. Transformations
We can uniquely solve these equations for x1 and x2 : (simultaneous linear equations)
y1 + y2
x1 = = H1 (y1 , y2 )
2
y1 − y2
x2 = = H2 (y1 , y2 ),
2
∂H1 ∂H1 1 1
∂y1 ∂y2 2 2 1
JH (y1 , y2 ) = = =− .
∂H2 ∂H2 1
− 12 2
∂y1 ∂y2 2
c @@
x + x2 = c
@ 1
@ -
y1
@
@
@ y1 = c
@ @
-
@ x1 @
c@@ @
@
@ y2 = −y1
? @
5.2 2-dimensional transformations 53
1 1 1
fY1 ,Y2 (y1 , y2 ) = fX1 ,X2 (H1 (y1 , y2 ), H2 (y1 , y2 )) × |JH (y1 , y2 )| = e− 2 (y1 +y2 )− 2 (y1 −y2 ) × ,
2
so
1 e−y1 y1 ⩾ 0, −y1 ⩽ y2 ⩽ y1 ,
2
fY1 ,Y2 (y1 , y2 ) =
0 otherwise.
Example 23. Suppose that X1 and X2 are IID exponential RVs with parameter λ.
X1
Find the joint PDF of Y1 = X2
and Y2 = X1 + X2 ; and thus the PDF of Y1 .
y1 = x1 /x2 y 2 = x1 + x2
x2 y 1 = x1 → y2 = x2 y1 + x2
= x2 (y1 + 1)
y1 y2 y2
= x1 ← = x2
y1 + 1 y1 + 1
−1 y1 y2 y2
so the solution is unique and T = H is defined by H(y1 , y2 ) = , .
1 + y1 1 + y1
This function is clearly suitably differentiable on R2+ , and
y2 y1
(1+y1 )2 1+y1 y2
JH (y1 , y2 ) = = ··· = .
y2
− (1+y 2
1 (1 + y1 )2
1) 1+y1
determine T (A) first note that T (A) ⊆ {(y1 , y2 ) : y1 , y2 > 0} is immediate from the
form of T . The formula above for H implies that for every (y1 , y2 ) with y1 , y2 > 0
there is a corresponding (x1 , x2 ) with x1 , x2 > 0; so T (A) = {(y1 , y2 ) : y1 , y2 > 0}.
We could also take the approach above. Since x1 , x2 ∈ (0, ∞) we have y1 =
x1 /x2 ∈ (0, ∞). Now fix y1 = c ∈ (0, ∞). The corresponding (x1 , x2 ) values are given
by x2 = c−1 x1 , for x1 ∈ (0, ∞) (since x2 ⩾ 0 automatically if x1 ⩾ 0). This implies
that y2 = x1 + x2 = (1 + c−1 )x1 for x1 ∈ (0, ∞), so y2 ∈ (0, ∞).
x2 y2
6 6
x2 = c−1 x1
- -
x1 y1
y1 = c
fY1 ,Y2 (y1 , y2 ) = fX1 ,X2 (H1 (y1 , y2 ), H2 (y1 , y2 )) · |JH (y1 , y2 )|
y1 y2 y2 y2
= fX1 ,X2 , ·
1 + y1 1 + y1 (1 + y1 )2
y2
= λ2 e−λy2 ,
(1 + y1 )2
so
y2
λ2 e−λy2
y1 , y2 > 0,
fY1 ,Y2 (y1 , y2 ) = (1 + y1 )2
0
otherwise.
Here we were interested only in the distribution of Y1 , but (x1 , x2 ) 7→ x1 /x2 is not
one-to-one. By including Y2 we get a transformation which is one-to-one, enabling
the use of this method. (We could also have used the method of Example 4: integrate
fX1 ,X2 over the region {(x1 , x2 ) : x1 /x2 ⩽ y1 } for arbitrary y1 ⩾ 0 to find the CDF
FY1 (y1 ), then differentiate the CDF to get the PDF.)
Unsurprisingly the methods of the previous section can be extended to deal with more
than 2 random variables at a time. The only additional complicating factor is the
slightly cumbersome notation (which can be avoided with vector notation). Here we
just state the general multidimensional version of Theorems 33 and 34.
∂H
Using vector notation: fY (y) = fX (H(y)) · ∂y
, y ∈ T (A),
56 5. Transformations
5.4 Problems
where α and β are two positive constants, use the monotone transformation
theorem to show that Y = X β has a well-known named distribution and find
its parameter/s.
4. Write out a formal (statement and) proof of Theorem 33 in the notes, which
gives the PDF of a monotone transformation of a continuous random variable.
Use the theorem on monotone transformations from the notes to find the prob-
ability density function of the random variable Y = log X.
6 Generating Functions
Generating functions are tools for studying probability distributions; they are defined
as the expectation of a certain function of a random variable with that distribution.
Using generating functions often makes calculations involving random variables much
simpler than dealing directly with CDFs, PDFs or PMFs. Generating functions are
particularly useful when analysing sums of independent random variables.
Definition 36. For any random variable X, the moment generating function
(MGF) of X, MX : A ⊆ R → [0, ∞) is defined by MX (t) = E[etX ], so long as E[etX ]
exists and is finite in an open interval containing t = 0. The domain A of MX consists
of all values t for which E[etX ] exists and is finite.
If E[etX ] is not finite in an open interval containing t = 0 then we say that the
MGF does not exist.
Aside The condition ‘E[etX ] exists and is finite in an open interval containing t = 0’
can be stated equivalently as ‘there exists h > 0 such that E[etX ] exists and is finite
for all t ∈ (−h, h)’.
Solution. The approach is to apply the working definition of the expected value of
a function of a random variable and calculate the resulting integral.
6.1 Moment generating functions 59
1 2
So MX (t) = e 2 t for all t ∈ R.
Example 25. Determine the MGF of a random variable X with an exponential dis-
tribution, i.e. fX (x) = λe−λx for x > 0, where λ > 0 is a parameter.
R∞
Now, 0
ex(t−λ) dx diverges if t − λ ⩾ 0, so MX (t) does not exist for t ⩾ λ. (sketch
ecx )
For t − λ < 0, i.e. t < λ, we have
Z ∞
MX (t) = λ ex(t−λ) dx
0
∞
ex(t−λ)
=λ
t−λ x=0
λ
= (0 − 1)
t−λ
λ
= .
λ−t
λ
So MX (t) = , t < λ.
λ−t
60 6. Generating Functions
Example 26. Find the MGF of a standard Cauchy random variable, with PDF
1
fX (x) = π(1+x2 )
, x ∈ R.
etx
• when t > 0 we have π(1+x2 )
→ ∞ as x → ∞, so E[etX ] is not finite,
etx
• when t < 0 we have π(1+x2 )
→ ∞ as x → −∞, so E[etX ] is not finite.
Therefore E[etX ] exists and is finite for t = 0 only. This is not an open interval
containing 0, so the MGF does not exist.
We now investigate some properties of moment generating functions, which will help
us to see how they can be useful.
dn ∞ tx
Z
(n)
MX (t) = n e fX (x) dx
dt −∞
Z ∞ n
d tx
= n
e fX (x) dx(rearrange)
−∞ dt
Z ∞
= xn etx fX (x) dx(evaluate deriv)
−∞
(n)
MX (0) = E[X n ].
6.2 Properties of MGFs 61
∞ ∞ (n) ∞
X
n
X M (0) n
X E[X n ]
MX (t) = an t = X
t = tn
n=0 n=0
n! n=0
n!
This gives another possible way to find the (raw) moments of X from the MGF:
Equating coefficients of tn above we have E[X n ] = n! an . (If you can identify
the Maclaurin series for a MGF then this allows you to find moments without
having to do any differentiation.)
λ
Consider for example the MGF MX (t) = λ−t
, t < λ of X ∼ Exp(λ).
1. MX (0) = λ/λ = 1 .
λ 1
E[X] = MX′ (0) = =
(λ − t)2 t=0 λ
1
P∞ λ 1 P∞ t n
3. From 1−x
= n=0 xn we have MX (t) = = = n=0 λ
=
λ−t 1 − t/λ
P∞ 1 n E[X n ] 1 n!
n=0 n
t . It then follows that (a n =) = n , so E[X n ] = n .
λ n! λ λ
Uniqueness of generating functions Because MGFs are power series, the rich
theory of power series (as widely used in applied mathematics) applies to them. This
includes results concerning the radius of convergence and the validity of term-by-term
integration and differentiation within that radius (which we have implicitly used in
our calculations above). One other such result is the Inversion Theorem, which tells
us how to calculate a CDF from a given MGF. (We don’t present this here as it
involves complex integration.) This has the crucial consequence that a probability
distribution is uniquely specified by its MGF; so long as the MGF exists.
Theorem 37. Suppose that the random variables X and Y have MGFs MX and MY .
Then X and Y have the same distribution (CDF) if and only if MX (t) = MY (t) for
all t in an open interval containing 0.
This means that, in most situations, the MGF of a random variable uniquely
specifies the distribution of the random variable. The condition ‘for all t in an open
62 6. Generating Functions
interval containing 0’ (or ‘so long as the MGF exists’) is crucial though. See, for
example, the Cauchy distribution above (Example 26).
It is important to note that this theorem does not mean “if two distributions have
the same moments then they have the same distribution”. (This statement is false in
general.)
Next we touch briefly on linear transformations of random variables.
Theorem 38. If X has MGF MX then Y = a + bX has MGF MY (t) = eat MX (bt).
so long as the expectation exists and is finite in an open set containing the origin.
6.3 Joint moment generating functions 63
The joint MGF has properties corresponding to single variable MGFs and the
relationship between joint and marginal CDFs (and PDFs and PMFs):
2. The MGF of Xi (the marginal MGF) can be obtained from the joint MGF
MX1 ,...,Xn (t1 , . . . , tn ) by letting all the tj (j = 1, . . . , n) be zero except ti . This
is because
∂ ∂
MX1 ,...,Xn (t1 , . . . , tn ) = E[et1 X1 . . . etn Xn ]
∂ti ∂ti
∂
so MX1 ,...,Xn (0, . . . , 0) = E[Xi ] . Similarly we can show that
∂ti
∂2 ∂2
MX1 ,...,Xn (0, . . . , 0) = E[Xi2 ] and MX1 ,...,Xn (0, . . . , 0) = E[Xi Xj ].
∂t2i ∂ti ∂tj
One of the main uses for generating functions is in calculations involving indepen-
dent random variables. A key property for these calculations is given by the following
theorem.
Proof. We give a proof for the case when X1 , . . . , Xn are jointly continuous, so the
RVs X1 , . . . , Xn are independent if and only if
If Suppose that MX1 ,...,Xn (t1 , . . . , tn ) = MX1 (t1 ) . . . MXn (tn ). Then by the above
part we have that MX1 ,...,Xn is the same as the joint MGF of independent variables
with the same distribution as X1 , . . . , Xn . Since the MGF characterises the joint
distribution of X1 , . . . , Xn , we have that X1 , . . . , Xn must be independent.
Proof.
Pn
Theorem 42. If X1 , . . . , Xn are mutually independent and Y = i=1 Xi then
n
Y
MY (t) = MXi (t).
i=1
n
X n
X
2
µ= ci µ i and σ = c2i σi2 .
i=1 i=1
n
Y 1 2 2
= exp µi ci t + σi (ci t)
i=1
2
( n
! n
! )
X 1 X
= exp ci µ i t+ c2i σi2 t2
i=1
2 i=1
1
= exp µt + σ 2 t2 , t ∈ R.
2
1
It can be shown (exercise) that the MGF of X is MX (t) = (t < 12 ).
(1 − 2t)r/2
Proposition 44. If Xi ∼ χ2ri , i = 1, . . . , n, are mutually independent then their sum
Y = ni=1 Xi has a χ2r distribution, where r = ni=1 ri .
P P
(for t < 21 )
and the conclusion follows from the equivalence of MGFs and distributions.
It’s worth pausing for a moment to consider these last two proofs. How you would
prove these results without using generating functions (see Section 2.1)?
Repeated convolution: set Sn = X1 + · · · + Xn (for n = 1, 2, . . . ), so fS1 = fX and
Z ∞
fSn+1 (s) = fX (x)fSn (s − x) dx.
−∞
We now look back to PGFs of N-valued random variables and see how they relate to
MGFs.
∞
X
ϕX (s) = E[sX ] = si pX (i).
i=0
Note that there are no technical conditions about existence here. For s ∈ [0, 1] we
have 0 ⩽ ∞
P i
P∞
i=0 s pX (i) ⩽ i=0 pX (i) = 1, so the expectation in the definition always
exists and is finite.
PGFs are useful because they have lots of properties (most of which have exact
MGF analogues). In the following list, all random variables X, Y , Z, Xi take values
in the non-negative integers.
1. X and Y have the same PGF if and only if they have the same PMF.
2. ϕX (1) = 1.
(n)
3. ϕX (1) = E[X(X − 1) . . . (X − (n − 1))] (the n-th factorial moment of X).
(n)
4. ϕX (0) = n! pX (n). (We haven’t seen the analogue of this.)
Pn Qn
5. If X1 , . . . , Xn are independent then Y = i=1 Xi has PGF ϕY (s) = i=1 ϕXi (s).
PGFs and MGFs are related to each other in the following ways:
Example 27. Use ϕX (s) = MX (log s) and the fact that E[X] = MX′ (0) to derive the
formula E[X] = ϕ′X (1).
d d
MX (t) = ϕX (et ) = ϕ′X (et ) · et .
dt dt
Therefore
E[X] = MX′ (0) = ϕ′X (e0 ) · e0 = ϕ′X (1).
Why PGFs and MGFs? PGFs can be viewed as an easier to manage equivalent to
MGFs when the random variables being considered take non-negative integer values.
Besides the moment generating function and probability generating function, there are
other generating functions that you might come across in further studies of probability
and/or statistics. These all have their own uses in particular areas of study, but the
most important of these is the characteristic function
where i2 = −1. The characteristic function has similar properties to other generat-
ing functions in that (i) there is a one-to-one correspondence between characteristic
functions and CDFs, and (ii) lots of properties of random variable/s can be expressed
through their characteristic functions. However, there are very few technical issues
regarding the existence of the expectation (convergence of the integral/sum) that
defines the characteristic function, so the theory is much neater in the sense that it
doesn’t have ‘so long as the expectation exists’ conditions floating around.
After a short detour into inequalities we will use (moment) generating functions
to help establish some key results in probability and applied statistics in sections 8
and 9.
6.6 Problems 69
6.6 Problems
1. (Exercise from notes) Show that if X ∼ χ2r then MX (t) = 1/(1 − 2t)r/2 (t < 21 ).
Hint: You’ll need the integral to cancel out with the Gamma function.
Now adapt this to show that the MGF of Y ∼ Gamma(s, λ) is MY (t) = (λ/(λ−
t))s , t < λ.
2. (Exercise from notes) Given that Z ∼ N(0, 1) has MGF MZ (t) = exp( 21 t2 )
(t ∈ R) and the formula for standardising an arbitrary Normal random variable;
show that the MGF of X ∼ N(µ, σ 2 ) is MX (t) = exp(µt + 12 σ 2 t2 ) (t ∈ R).
4. Derive the formula var(X) = ϕ′′ (1) + ϕ′ (1) = (ϕ′ (1))2 using the relationship
between PGFs and MGFs and the corresponding result for MGFs.
5. Suppose that random variables X and Y have joint moment generating function
This gives a bound on the probability that X (or |X|) is ‘large’ which depends only
on the mean of X. It is a very powerful result because it makes no other assumption
on the distribution of X.
Proof. Start by noting that for any c > 0 we have X ⩾ c · 1{X⩾c} . (Consider cases)
This implies that
= c · P (X ⩾ c) + 0 · P (X < c)
= c · P (X ⩾ c)
Example 28. Use Markov’s inequality to bound the fraction of people who have in-
come more than three times the average income.
E[X] 1
P (X ⩾ 3E[X]) ⩽ = .
3E[X] 3
So no more than one third of the population can earn more than 3 times the
average income.
We can also use Markov’s inequality to show what it means for a random variable
to have zero variance.
1 1
P (|X − µ| ⩾ ) = P (|X − µ|2 ⩾ 2 ) ⩽ n2 var(X) = 0.
n n
Whilst Markov’s inequality is appealing and useful because of it’s simplicity, the
proof above suggests the following bound, considering the random variable |X−E[X]|2
(which is non-negative) and applying Markov’s inequality.
E[(X − µ)2 ] σ2
P (|X − µ| ⩾ c) = P (|X − µ|2 ⩾ c2 ) ⩽ = .
c2 c2
72 7. Markov’s and Chebychev’s Inequalities
Example 29. Compare the tail probability P (X ⩾ 10) with bounds for that quantity
from Markov’s and Chebychev’s inequalities when X ∼ exp(1/2).
7.1 Problems
2. Suppose that X has mean 10 and takes values strictly between 5 and 15 with
probability 32 . Determine a bound for the standard deviation of X.
3. Find a bound for the variance of a non-negative random variable Z which has
mean 20 and satisfies P (Z ⩾ 25) = 15 .
3
4. If the random variable Xn has mean 4 and variance n2
(for all n = 1, 2, . . . ),
how large does n have to be so that Xn takes a value between 3.9 and 4.1 with
probability at least 0.9?
Check your answer by bounding P (Xn ∈ (3.9, 4.1)) for values of n near the
minimum value that you find.
5. Determine a bound for P (|X| > 10) if X is non-negative with mean 5. Deter-
mine an additional bound if you also know that E[|X|3 ] = 100.
(Hint: if g(·) is an increasing function then {X > a} ⇐⇒ {g(X) > g(a)}.)
8. Multivariate normal distributions 73
First we need to establish some notation and recall some results from linear al-
gebra. Let µ denote an n-dimensional real column vector µ = (µ1 , . . . , µn )⊤ and let
Σ = (σij ) be an n × n real, symmetric, positive semi-definite matrix. Recall that
an n × n symmetric matrix Σ is called positive semi-definite iff x⊤ Σx ⩾ 0 for
all column vectors x ∈ Rn , and this is equivalent to all eigenvalues of Σ being non-
negative. In this situation the following are equivalent: (i) Σ is positive definite
(i.e. x⊤ Σx > 0 for all non-zero x ∈ Rn ), (ii) Σ is of full rank, (iii) all eigenvalues of
Σ are positive and (iv) Σ is invertible. In this latter case Σ−1 is also symmetric and
positive definite. Lastly, an affine transformation T : Rn → Rm is one which can be
written in the form x 7→ y = Ax + b, where A ∈ Rm×n and b ∈ Rm . (µ ≈ µ, A ≈ σ,
B ≈ σ 2 . cf. Σ of Sec 3.3.)
8.1 Characterisation
We note for later reference that, since A is of rank l, Σ = AA⊤ also has rank l.
We will focus mainly on the situation where l = n, i.e. Σ is of full rank. In Section 8.4
we will touch on what is special/unusual about the case when Σ is not of full rank.
74 8. Multivariate normal distributions
Derivation of PDF
n
Y 1 1 2 1
fZ (z) = √ e− 2 zj = (2π)−n/2 exp(− z ⊤ z),
j=1
2π 2
Now, since Σ = AA⊤ is invertible we have Σ−1 = (AA⊤ )−1 = (A⊤ )−1 A−1 = (A−1 )⊤ A−1 ,
and thus
−n/2 −1/2 1 ⊤ −1
fX (x) = (2π) |Σ| exp − (x − µ) Σ (x − µ) , x ∈ Rn .
2
Note that Σ being of full rank is crucial to this derivation. If it is not then Σ−1
does not exist and the formula for fX does not make sense.
The (joint) MGF gives another characterisation of the multivariate normal distribu-
tion. It will also be very useful for finding properties of the distribution in the next
couple of sections. First we compute the joint MGF of Z, then use the representa-
tion X = µ + AZ. Since the components Zi of Z are independent standard normal
8.2 Exploration 75
variables we have
l l
1 2 1 ⊤
Y Y
MZ (t) = MZi (ti ) = e 2 ti = e 2 t t , t ∈ Rl .
i=1 i=1
⊤X
MX (t) = E[et ] = E[exp t⊤ (µ + AZ) ]
= exp t⊤ µ MZ (A⊤ t)
= exp t⊤ µ + 12 t⊤ Σt .
8.2 Exploration
Moments
We can find the means, variances and covariances (and therefore correlations) asso-
ciated with X ∼ N(µ, Σ) using the moment generating function. For this we write
the MGF as (so far µ and Σ are just parameters)
!
X 1 XX
MX (t) = exp ti µi + ti σij tj .
i
2 i j
Therefore
∂
E[Xk ] = MX (0) = (µk + 0)1 = µk .
∂tk
Next, we have
∂2
E[Xk2 ] = 2
MX (0) = µ2k + σkk ,
∂tk
so var(Xk ) = E[Xk2 ] − E[Xk ]2 = µ2k + σkk − µ2k = σkk . Lastly, for k ̸= l we have
∂2
E[Xk Xl ] = MX (0) = µk µl + σkl ,
∂tk ∂tl
Bivariate PDF
When n is small we can write out the entries of µ and Σ explicitly and expand out
the quadratic form in the PDF of X to get more of a ‘feel’ for the behaviour of a
multivariate normal random vector. Since we use the PDF here we must assume that
Σ is non-singular. When n = 2, we write
!
σ12 σ1 σ2 ρ
Σ= ,
σ1 σ2 ρ σ22
where ρ = ρ(X1 , X2 ) and σi2 = var(Xi ). It can be shown (exercise) that Σ is non-
singular if and only if |ρ| < 1 and σ12 σ22 >!0. Then the ! joint PDF of X1 and X2 is
x1 µ1
Calculate |Σ| and exp(. . . ) using x = ,µ= and Σ above, sub into fX on page 74.
x2 µ2
( " 2
1 1 x1 − µ 1
fX1 ,X2 (x1 , x2 ) = p exp −
2π σ1 σ2 1 − ρ2 2(1 − ρ2 ) σ1
2 #)
x1 − µ 1 x2 − µ 2 x2 − µ 2
−2ρ + .
σ1 σ2 σ2
8.2 Exploration 77
Below are contour plots of the density when µ = (1, 2)⊤ and Σ is such that X1
and X2 (i) are uncorrelated N (0, 1) variables and (ii) are positively correlated N (0, 1)
variables.
5
5
4
4
3
3
x2
x2
2
2
1
1
0
0
−1
−1
−2 −1 0 1 2 3 4 −2 −1 0 1 2 3 4
x1 x1
Aside It follows from the PDF that these contours are always ellipses; centred at
µ and with size and orientation determined by (the eigenvalues/vectors of) Σ.
To get some ‘feel’ for the bivariate normal distribution and how it is affected by
the various parameters you should investigate such contour plots and surface plots of
the PDF in a computer package. The R file plotting2-2dNormal.R on Moodle will
allow you to do this. You can get a feel for a 3-dimensional normal distribution by
genarating realisations from it and looking at a 3-d scatterplot of them. See the R
file plotting3-3dNormal.R on Moodle for this.
If you change the numbers in those files you will need to be take care that Σ is
positive definite; there is code in those R files to check the eigenvalues. A couple of
covariance matrices worth investigating are (The code provided lets you specify σi ’s
and ρij ’s; not σij ’s directly.)
√ √
1 −0.5 1 · 2
0.4 1 · 4 1 −0.7 0
√ √
, .
−0.5 1 · 2 2 −0.2 2 · 4 −0.7 1 −0.7
√ √
0.4 1 · 4 −0.2 2 · 4 4 0 −0.7 1
78 8. Multivariate normal distributions
8.3 Properties
Affine transformations
= exp{t⊤ b} MX (C ⊤ t).
So Y ∼ Nm (Cµ + b, CΣC ⊤ ).
Marginal distributions
we have
! X1 !
0 1 0 X2
Y = X2 = .
0 0 1 X3
X3
It then follows that Y = (X2 , X3 )⊤ has a multivariate normal distribution with mean
! µ1 !
0 1 0 µ2
µY = Cµ = µ2 =
0 0 1 µ3
µ3
Independence
Recall (from Section 3.2) that independent random variables are uncorrelated, but
the converse is not generally true. Next we establish that this converse is true for
random variables that follow a multivariate normal distribution.
Proof. As noted above, we only need to prove the ‘if’ part of the theorem.
80 8. Multivariate normal distributions
A numerical example
(ii) For which values of c ∈ R are Z1 = 2X1 +cX2 and Z2 = 2X1 +cX3 independent?
! ! !
1 1 0 2 Y1
Solution. (i) Let C = and b = , so Y = = CX + b is an
2 −1 0 0 Y2
affine transformation of X. Then, by Theorem 50, Y is normally distributed
with mean
! 0 ! !
1 1 0 2 2
µY = E[Y ] = C µX + b = 0 + =
2 −1 0 0 0
0
8.4 The degenerate case 81
and
2 1 0
! 1 2 !
1 1 0 8 1
ΣY = var(Y ) = CΣC ⊤ =
1 4 0 1 −1 = · · · = .
2 −1 0 1 8
0 0 5 0 0
! ! !
2 c 0 0 Z1
(ii) Let C = , and b = , so that Z = = CX + b. Now, by
2 0 c 0 Z2
Theorems 50 and 51, Z is MVN and so Z1 and Z2 are independent precisely
when cov(Z1 , Z2 ) = 0 (i.e. when ΣZ = CΣC ⊤ is diagonal) .
Here we investigate the situation where the covariance matrix Σ is not of full rank.
If Σ is not of full rank then X = µ+AZ with A being n×l (and Z being l ×1) for
l < n. Loosely speaking, since Z is l-dimensional, the n-dimensional random vector
X can only take values in a l-dimensional subspace of Rn . We investigate this further
firstly through an explicit numerical example and then a more general argument. A
bit like X = µ + σZ with σ = 0.
82 8. Multivariate normal distributions
This looks like a perfectly respectable covariance matrix, but routine calculations
reveal that Σ is singular. Further calculation reveals that the null space of Σ is
spanned by the vector (2, 3, −1)⊤ . Applying Theorem 50 with C = (2, 3, −1) and
b = 0 we can show that E[2X1 + 3X2 − X3 ] = 2µ1 + 3µ2 − µ3 = 4 and
1 0 2
2
var(2X1 + 3X2 − X3 ) = 2 3 −1 0 1 3 3
= · · · = 0.
2 3 13 −1
Pn
Thus Z = x⊤ X = i=1 xi Xi = µZ (w.p. 1) and so one of the Xi ’s can be written as
an affine function of the others. (Since x has at last one non-zero element.)
If there are k such independent vectors x (i.e. Σ has nullity k) then there are k of
the Xi ’s that can be written as an affine function of the others; this is another heuristic
justification of saying that X is, in some sense, really only n − k = l-dimensional.
It is worth pointing out that this degenerate situation is not just a special case
that we consider in order to be more abstract. For example, degenerate multivariate
normal distributions play an important role in the theory of linear regression. (Joint
distribution of residuals.)
(Try simulations/contours for 2-d and simulations for 3d; letting a variance get
close to zero or a correlation get close to 1)
8.5 Some additional properties 83
Conditional distributions
We can also define and study conditional distributions involving multivariate nor-
mal random vectors. Crucially, these conditional distributions are also multivariate
normal.
For example, in the bivariate case we can show that the conditional PDF of X2 ,
given X1 = x1 , is the PDF of a
σ2 2 2
N µ2 + ρ (x1 − µ1 ), σ2 (1 − ρ )
σ1
There are analogous formulae for conditional distributions involving general mul-
tivariate normals, but these require lots of additional notation so are omitted here.
An alternative definition
We briefly note another common definition of the vector X being multivariate normal.
This is that all linear combinations of elements of X (i.e. ni=1 ai Xi for all ai ∈ R)
P
Although (X, Y ) being jointly multivariate normal implies that each of X and Y are
(marginally) normal, the converse is not necessarily true. In other words, two random
variables which are both (marginally) normally distributed do not necessarily have a
joint multivariate normal distribution.
Often relevant in (applications of) statistics. In other words: joint specifies
marginals, but not vice-versa. The same argument applies in many dimensions.
84 8. Multivariate normal distributions
8.6 Problems
Pn Pn
be two constant vectors. Define U = i=1 ai Xi and V = i=1 bi Xi .
We are now in a position to state (and prove) two of the most famous results in
probability: the Law of Large Numbers (LLN) and the Central Limit Theorem (CLT).
These are results that you will have come across before, but here we pay attention
to some additional technical detail, including the proofs. The LLN confirms our
intuition that if we repeat a random experiment with some numerical outcome (i.e.
make IID observations of a random variable) then the long-term average of these
outcomes settles down to the expected value. The CLT gives more detail about the
variation around that mean: it approximately follows a normal distribution. Looking
at these results in more detail also gives us an introduction to an important topic in
more advanced probability: modes of convergence of random variables. That is, ways
in which a sequence of random variables can have a meaningful limit. (cf. convergence
of sequences of functions)
Throughout this section we assume that X1 , X2 , . . . are IID RVs and define their
partial sums Sn = ni=1 Xi (n = 1, 2, . . . ). We also assume that the variables have
P
The partial sums Sn are also RVs; and we can find their mean and variance:
Now, if we want a limit of some sort we are looking for quantities to not depend on
n, so let’s look at Sn /n: will make mean nµ/n = µ not depend on n
1 1
E[Sn /n] = E[Sn ] = · nµ = µ,
n n
1 1 2 σ2
var(Sn /n) = var(Sn ) = · nσ = .
n2 n2 n
So indeed the mean stays fixed and the variance gets small as n grows.
The (weak) LLN captures this idea: in the limit of large n, the sample mean
Sn /n converges to the true mean µ. In fact, the LLN comes in two different ver-
sions, which correspond to different interpretations (almost sure convergence and
convergence in probability) of a sequence of random variables converging to a
number. We will state both versions, but prove only the second.
86 9. Two limit theorems
Proof of the weak LLN. Fixing ϵ > 0, we have by Chebychev’s inequality that
var(Sn /n) σ2
P (|Sn /n − µ| > ϵ) ⩽ = 2.
ϵ2 nϵ
As n → ∞ the right-hand side converges to 0, so the left hand side must too; more
specifically we have
σ2
0 ⩽ P (|Sn /n − µ| > ϵ) ⩽ for all n
nϵ2
The following example uses this result to give a rigorous characterisation of the
idea that the Poisson distribution can be used to approximate the Binomial distribu-
tion (when n is large and p is small).
(i) n = 5 (ii) n = 10
0.35 0.35
0.3 0.3
0.25 0.25
0.2 0.2
P(X 10 =x)
P(X 5 =x)
0.15 0.15
0.1 0.1
0.05 0.05
0 0
0 2 4 6 8 10 12 0 2 4 6 8 10 12
x x
0.25 0.25
0.2 0.2
P(X 20 =x)
P(X=x)
0.15 0.15
0.1 0.1
0.05 0.05
0 0
0 2 4 6 8 10 12 0 2 4 6 8 10 12
x x
Binomial PMF with p = 2/n and n = 5, 10, 20; and Poisson PMF with λ = 2.
88 9. Two limit theorems
Example 31. Suppose that Xn ∼ Bin n, nλ and that X ∼ Poisson(λ). Show that
D
Xn → X as n → ∞.
Solution. By Theorem 55, it is sufficient to show that limn→∞ MXn (t) = MX (t).
Recall that, for all t ∈ R, (PGFs from MATH1001)
n
λ λ t t
MXn (t) = 1 − + e and MX (t) = e−λ(1−e ) .
n n
x n
Now, using the identity limn→∞ 1 + n
= ex from calculus, we have
n
−λ(1 − et )
t
lim MXn (t) = lim 1 + = e−λ(1−e ) = MX (t) ∀t ∈ R,
n→∞ n→∞ n
as required.
Before we state and prove the CLT, recall our standing assumptions that X1 , X2 , . . .
are IID with finite mean µ = E[X1 ] and that Sn = ni=1 Xi . We now need to assume
P
Sn − nµ D
Zn = √ → Z,
σ n
Proof. We prove the CLT under the assumption that the moment generating function
of X1 exists.
Our approach is to standardise the variables Xi to new variables Yi and write MZn
in terms of the MGFs of the standardised variables, then show that log MZn converges
to log MZ .
Define the standarised RVs Yi = (Xi −µ)/σ. Then Y1 , Y2 , . . . are IID with E[Y1 ] =
0 and var(Y1 ) = 1 = E[Y12 ]; and we can write
n
1 X
Zn = √ Yi .
n i=1
Then, using Theorem 41 and the fact that the Yi are IID, we have
n n
Y t t
MZn (t) = MYi √ = MY1 √ .
i=1
n n
9. Two limit theorems 89
t2
Since MZ (t) = e 2 , to show limn→∞ log MZn (t) = log MZ (t), we need to show that
t2
t
lim n log MY1 √ = ,
n→∞ n 2
√
and making the substitution u = 1/ n, this is equivalent to showing (∞0; so n = u−2 )
1 t2
lim log (MY 1 (ut)) = , (∗)
u↓0 u2 2
This limit involves the function L(s) = log(MY1 (s)) as s → 0. With a view towards
applying L’Hôpital’s rule, we find that
for what t?
90 9. Two limit theorems
The last thing to check is that this convergence happens for all t ∈ R, since that is
t2
the domain on which the MGF MZ (t) = e 2 is defined. If the MGF MX1 (t) is defined
for |t| < h then MY1 (t) is defined for |t| < σh (see Theorem 38). It then follows from
√ √
the formula MZn (t) = (MY1 (t/ n))n that this function is defined for |t/ n| < σh, or
√
equivalently |t| < σh n. In the limit n → ∞ this gives t ∈ R, as required.
The CLT is an asymptotic result, that tells us about the limiting distribution of
the Zn ’s; but like the LLN it suggests and approximation for the distribution of Sn
when n is large but finite.
√ D
Proof. The CLT says that Zn = (Sn − nµ)/ nσ 2 → Z with Z standard normal.
D
Changing → ‘convergence in distribution’ to ∼
˙ ‘approximately distributed as’, we
have (cf. changing → to ≈)
Sn − nµ
√ ∼
˙ Z.
σ n
√
Rearranging this we get Sn ∼
˙ nµ + σ nZ, or Sn ∼˙ Sn′ ∼ N(nµ, nσ 2 ). (linear tfm of
normal)
To interpret and relate the LLN and CLT somewhat: the LLN says that for large
n, Sn is approximately its mean nµ. The CLT quantifies the typical deviation from
that mean.
It is also important to note that the only condition on the distribution of the
random variables Xi in the CLT is that they have finite variance. Aside from that
they can have any distribution; whether discrete, continuous or neither; symmetric or
asymmetric. The precise distribution that the IID variables have affects how quickly
the convergence in the CLT happens, but it always happens. (rate of cgce is related
to quality of approximation)
The central limit theorem is the basic reason for the pervasiveness of the normal
distribution, mainly because the same argument as above gives a normal approxima-
tion for X̄n = Sn /n, the sample mean.
9. Two limit theorems 91
Solution. Since the Xi are IID the CLT justifies using a normal approximation
Yn ∼ N(nµ, nσ 2 ) for Sn . Here we have µ = E[Xi ] = 2, σ 2 = var(Xi ) = 22 and n = 50;
so S50 ∼
˙ Y50 ∼ N(100, 200).
Therefore
P (90 < S50 < 110) ≈ P (90 < Y50 < 110)
90 − 100 90 − 100
= P( √ <Z< √ )
200 200
q q
= P (− 2 < Z < 12 ),
1
q q
= Φ( 2 ) − Φ(− 12 ).
1
Of course in practice the normal CDF can be easily and accurately evaluated by a
computer (or tables). In this case we also know the distribution of Sn (as the sum
of 50 independent Exp(1/2) RVs it is Gamma(50, 1/2); see Problem 2b at the end
of Section 2), so we can determine the probability exactly and see how good the
approximation is.
In this case the exact probability is 0.5210. . . , whilst the normal approximation
above gives 0.5205. . . . (relative error about 0.005/0.52 ≈ 1%.)
92 9. Two limit theorems
9.1 Problems
1. Let Xn ∼ Bin(n, 3/n). When n is large, what distributions can you use to
approximate the distibution of Xn ? Give your reasons.
2. A fair die is rolled repeatedly. Use the central limit theorem to determine the
minimum number of rolls required so that the probability that the sum of the
scores of all of the rolls is greater than or equal to 100 is at least 0.95.
4. Use a similar argument to the proof of the CLT to prove (a version of) the LLN.
Clearly state the theorem you prove too; i.e. be explicit about the assumptions
needed for the proof to be valid.