Professional Documents
Culture Documents
G. A. Young
http://www2.imperial.ac.uk/∼ayoung
September 2011
ii
Contents
iii
iv CONTENTS
7 STATISTICAL ANALYSIS 61
7.1 STATISTICAL SUMMARIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.2 SAMPLING DISTRIBUTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.3 HYPOTHESIS TESTING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7.3.1 TESTING FOR NORMAL SAMPLES - THE Z-TEST . . . . . . . . . . . . 63
7.3.2 HYPOTHESIS TESTING TERMINOLOGY . . . . . . . . . . . . . . . . . . 64
7.3.3 THE t-TEST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.3.4 TEST FOR σ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.3.5 TWO SAMPLE TESTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.4 POINT ESTIMATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.4.1 ESTIMATION TECHNIQUES I: METHOD OF MOMENTS . . . . . . . . . 68
7.4.2 ESTIMATION TECHNIQUES II: MAXIMUM LIKELIHOOD . . . . . . . . 69
7.5 INTERVAL ESTIMATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.5.1 PIVOTAL QUANTITY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.5.2 INVERTING A TEST STATISTIC . . . . . . . . . . . . . . . . . . . . . . . . 71
CHAPTER 1
DEFINITIONS, TERMINOLOGY, NOTATION
Definition 1.1.3 The sample space, Ω, of an experiment is the set of all possible outcomes.
NOTE : Ω is a set in the mathematical sense, so set theory notation can be used. For example, if
the sample outcomes are denoted ω 1 , ..., ω k , say, then
Ω = {ω 1 , ..., ω k } = {ω i : i = 1, ..., k} ,
1
2 CHAPTER 1. DEFINITIONS, TERMINOLOGY, NOTATION
Since events are subsets of Ω, set theory operations are used to manipulate events in probability
theory. Consider events E, F ⊆ Ω. Then we can reasonably concern ourselves also with events
obtained from the three basic set operations:
Consider events E, F, G ⊆ Ω.
COMMUTATIVITY E∪F =F ∪E
E∩F =F ∩E
ASSOCIATIVITY E ∪ (F ∪ G) = (E ∪ F ) ∪ G
E ∩ (F ∩ G) = (E ∩ F ) ∩ G
DISTRIBUTIVITY E ∪ (F ∩ G) = (E ∪ F ) ∩ (E ∪ G)
E ∩ (F ∪ G) = (E ∩ F ) ∪ (E ∩ G)
0 0 0
DE MORGAN’S LAWS (E ∪ F ) = E ∩ F
0 0 0
(E ∩ F ) = E ∪ F
S
6
In Figure 1.1, we have Ω = Ei
i=1
S
6
In Figure 1.2, we have F = (F ∩ Ei ), but, for example, F ∩ E6 = ∅.
i=1
Events are subsets of Ω, but need all subsets of Ω be events? The answer is negative. But it
suffices to think of the collection of events as a subcollection A of the set of all subsets of Ω. This
subcollection should have the following properties:
(b) if A ∈ A then A0 ∈ A;
4 CHAPTER 1. DEFINITIONS, TERMINOLOGY, NOTATION
(c) ∅ ∈ A.
A collection A of subsets of Ω which satisfies these three conditions is called a field. It follows
from the properties of a field that if A1 , A2 , . . . , Ak ∈ A, then
k
[
Ai ∈ A.
i=1
So, A is closed under finite unions and hence under finite intersections also. To see this note that
if A1 , A2 ∈ A, then
0
A01 , A02 ∈ A =⇒ A01 ∪ A02 ∈ A =⇒ A01 ∪ A02 ∈ A =⇒ A1 ∩ A2 ∈ A.
This is fine when Ω is a finite set, but we require slightly more to deal with the common situation
when Ω is infinite. We require the collection of events to be closed under the operation of taking
countable unions, not just finite unions.
(I) ∅ ∈ A;
S∞
(II) if A1 , A2 , . . . ∈ A then i=1 Ai ∈ A;
(III) if A ∈ A then A0 ∈ A.
To recap, with any experiment we may associate a pair (Ω, A), where Ω is the set of all possible
outcomes (or elementary events) and A is a σ−field of subsets of Ω, which contains all the events
in whose occurrences we may be interested. So, from now on, to call a set A an event is equivalent
to asserting that A belongs to the σ−field in question.
The triple (Ω, A, P (.)), consisting of a set Ω, a σ-field A of subsets of Ω and a probability function
P (.) on (Ω, A) is called a probability space.
1. P (E 0 ) = 1 − P (E).
2. If E ⊆ F , then P (E) ≤ P (F ).
3. In general, P (E ∪ F ) = P (E) + P (F ) − P (E ∩ F ).
4. P (E ∩ F 0 ) = P (E) − P (E ∩ F )
5. P (E ∪ F ) ≤ P (E) + P (F ).
6. P (E ∩ F ) ≥ P (E) + P (F ) − 1.
NOTE : The general addition rule 3 for probabilities and Boole’s Inequalities 5 and 6
extend to more than two events. Let E1 , ..., En be events in Ω. Then
n
! n
!
[ X X X \
n
P Ei = P (Ei ) − P (Ei ∩ Ej ) + P (Ei ∩ Ej ∩ Ek ) − ... + (−1) P Ei
i=1 i i<j i<j<k i=1
and !
n
[ n
X
P Ei ≤ P (Ei ).
i=1 i=1
i−1
!0
[
Fi = Ei ∩ Ek
k=1
S
n S
n
for i = 2, 3, ..., n. Then F1 , F2 , ...Fn are disjoint, and Ei = Fi , so
i=1 i=1
n
! n
! n
[ [ X
P Ei =P Fi = P (Fi ).
i=1 i=1 i=1
6 CHAPTER 1. DEFINITIONS, TERMINOLOGY, NOTATION
i−1
S
= P (Ei ) − P (Ei ∩ Ek )
k=1
and the result follows by recursive expansion of the second term for i = 2, 3, ...n.
NOTE : We will often deal with both probabilities of single events, and also probabilities for
intersection events. For convenience, and to reflect connections with distribution theory that will
be presented in Chapter 2, we will use the following terminology; for events E and F
Extension : Events E1 , ..., Ek are independent if, for every subset of events of size l ≤ k, indexed
by {i1 , ..., il }, say,
\l Yl
P Eij = P (Eij ).
j=1 j=1
1.6. THE THEOREM OF TOTAL PROBABILITY 7
PROOF
F = (F ∩ E1 ) ∪ ... ∪ (F ∩ Ek )
P
k P
k
=⇒ P (F ) = P (F ∩ Ei ) = P (F |Ei )P (Ei ),
i=1 i=1
Extension: The theorem still holds if E1 , E2 , ... is a (countably) infinite a partition of Ω, and
F ⊆ Ω, so that
X∞ X∞
P (F ) = P (F ∩ Ei ) = P (F |Ei )P (Ei ),
i=1 i=1
P (F |E)P (E)
P (E|F ) = .
P (F )
PROOF
Extension: If E1 , ..., Ek are disjoint, with P (Ei ) > 0 for i = 1, ..., k, and form a partition of F ⊆ Ω,
then
P (F |Ei )P (Ei )
P (Ei |F ) = k .
X
P (F |Ej )P (Ej )
j=1
Suppose that an experiment has N equally likely sample outcomes. If event E corresponds to a
collection of sample outcomes of size n(E), then
n(E)
P (E) = ,
N
so it is necessary to be able to evaluate n(E) and N in practice.
Example 1.1 If each of r trials of an experiment has N possible outcomes, then there are N r
possible sequences of outcomes in total. For example:
(i) If a multiple choice exam has 20 questions, each of which has 5 possible answers, then there
are 520 different ways of completing the exam.
(ii) There are 2m subsets of m elements (as each element is either in the subset, or not in the
subset, which is equivalent to m trials each with two outcomes).
NOTE : The order in which the operations are carried out may be important
e.g. in a raffle with three prizes and 100 tickets, the draw {45, 19, 76} is different from {19, 76, 45}.
1.8. COUNTING TECHNIQUES 9
NOTE : The items may be distinct (unique in the collection), or indistinct (of a unique type in
the collection, but not unique individually).
e.g. The numbered balls in the National Lottery, or individual playing cards, are distinct. However
when balls in the lottery are regarded as “WINNING” or “NOT WINNING”, or playing cards are
regarded in terms of their suit only, they are indistinct.
RESULT 3 The number of combinations of r from n distinct items is the binomial coefficient
n n n!
Cr = = (as Prn = r!Crn ).
r r!(n − r)!
-recall the Binomial Theorem, namely
n
X
n n
(a + b) = ai bn−i .
i
i=0
Then the number of subsets of m items can be calculated as follows; for each 0 ≤ j ≤ m, choose a
subset of j items from m. Then
m
X m
Total number of subsets = = (1 + 1)m = 2m .
j
j=0
If the items are indistinct, but each is of a unique type, say Type I, ..., Type κ say, (the so-called
Urn Model) then
Special Case : if κ = 2, then the number of distinguishable permutations of the n1 objects of type
I, and n2 = n − n1 objects of type II is
n!
Cnn2 = .
n1 !(n − n1 )!
10 CHAPTER 1. DEFINITIONS, TERMINOLOGY, NOTATION
RESULT 5 There are Crn ways of partitioning n distinct items into two “cells”, with r in one
cell and n − r in the other.
Example 1.2 A True/False exam has 20 questions. Let E = “16 answers correct at random”.
Then
20
Number of ways of getting 16 out of 20 correct 16
P (E) = = 20 = 0.0046.
Total number of ways of answering 20 questions 2
Example 1.3 Sampling without replacement. Consider an Urn Model with 10 Type I objects
and 20 Type II objects, and an experiment involving sampling five objects without replacement.
Let E=“precisely 2 Type I objects selected” We need to calculate N and n(E) in order to
calculate P(E). In this case N is the number of ways of choosing 5 from 30 items, and hence
30
N= .
5
To calculate n(E), we think of E occurring by first choosing 2 Type I objects from 10, and then
choosing 3 Type II objects from 20, and hence, by the multiplication rule,
10 20
n(E) = .
2 3
Therefore
10 20
2 3
P (E) = = 0.360.
30
5
This result can be checked using a conditional probability argument; consider event F ⊆ E, where
F = “sequence of objects 11222 obtained”. Then
T
5
F = Fij
i=1
Example 1.4 Sampling with replacement. Consider an Urn Model with 10 Type I objects and
20 Type II objects, and an experiment involving sampling five objects with replacement. Let E =
“precisely 2 Type I objects selected”. Again, we need to calculate N and n(E) in order to
calculate P(E). In this case N is the number of ways of choosing 5 from 30 items with
replacement, and hence
N = 305 .
To calculate n(E), we think of E occurring by first choosing 2 Type I objects from 10, and 3
Type II objects from 20 in any order. Consider such sequences of selection
Sequence Number of ways
11222 10.10.20.20.20
12122 10.20.10.20.20
. .
This chapter contains the introduction of random variables as a technical device to enable the
general specification of probability distributions in one and many dimensions to be made. The key
topics and techniques introduced in this chapter include the following:
• EXPECTATION
• TRANSFORMATION
• STANDARDIZATION
• GENERATING FUNCTIONS
• JOINT MODELLING
• MARGINALIZATION
• MULTIVARIATE TRANSFORMATION
• SUMS OF VARIABLES
Of key importance is the moment generating function, which is a standard device for identifi-
cation of probability distributions. Transformations are often used to transform a random
variable or statistic of interest to one of simpler form, whose probability distribution is more
convenient to work with. Standardization is often a key part of such simplification.
We are not always interested in an experiment itself, but rather in some consequence of its random
outcome. Such consequences, when real valued, may be thought of as functions which map Ω to
R, and these functions are called random variables.
Definition 2.1.1 A random variable (r.v.) X is a function X : Ω → R with the property that
{ω ∈ Ω : X(ω) ≤ x} ∈ A for each x ∈ R.
The point is that we defined the probability function P (.) on the σ−field A, so if A(x) = {ω ∈ Ω :
X(ω) ≤ x}, we cannot discuss P (A(x)) unless A(x) belongs to A. We generally pay no attention to
the technical condition in the definition, and just think of random variables as functions mapping
Ω to R.
13
14 CHAPTER 2. RANDOM VARIABLES & PROBABILITY DISTRIBUTIONS
with, for completeness, fX (x1 ) = FX (x1 ) . These relationships follow as events of the form [X ≤ xi ]
can be represented as countable unions of the events Ai . The first result therefore follows from
properties of the probability function. The second result follows immediately.
(ii) FX is continuous from the right (but not continuous) on R that is, for x ∈ R,
lim FX (x + h) = FX (x).
h→0+
The key idea is that the functions fX and/or FX can be used to describe the probability distri-
bution of the random variable X. A graph of the function fX is non-zero only at the elements of
X. A graph of the function FX is a step-function which takes the value zero at minus infinity,
the value one at infinity, and is non-decreasing with points of discontinuity at the elements of X.
FX (x) = P [X ≤ x]
lim FX (x + h) = FX (x).
h→0
FX (x) = P [X ≤ x], x ∈ R.
16 CHAPTER 2. RANDOM VARIABLES & PROBABILITY DISTRIBUTIONS
From now on when we speak of a continuous random variable, we will implicitly assume the abso-
lutely continuous case, where a pdf exists.
(i) The pdf fX need not exist, but as indicated above, continuous r.v.’s where a pdf fX cannot
be defined in this way will be ignored. The function fX can be defined piecewise on intervals
of R .
(iv) If X is continuous,
It follows that a function fX is a pdf for a continuous random variable X if and only if
Z ∞
(i) fX (x) ≥ 0, (ii) fX (x)dx = 1.
−∞
Example 2.1 Consider a coin tossing experiment where a fair coin is tossed repeatedly under
identical experimental conditions, with the sequence of tosses independent, until a Head is
obtained. For this experiment, the sample space, Ω is then the set of sequences
({H} , {T H} , {T T H} , {T T T H} ...) with associated probabilities 1/2, 1/4, 1/8, 1/16, ... .
and zero otherwise. For x ≥ 1, let k(x) be the largest integer not greater than x, then
X k(x)
X k(x)
1
FX (x) = fX (xi ) = fX (i) = 1 −
2
xi ≤x i=1
Graphs of the probability mass function (left) and cumulative distribution function (right) are
shown in Figure 2.1. Note that the mass function is only non-zero at points that are elements of
X, and that the cdf is defined for all real values of x, but is only continuous from the right. FX is
therefore a step-function.
1.0
0.5
0.8
0.4
0.6
0.3
F(x)
f(x)
0.4
0.2
0.1
0.2
0.0
0.0
0 2 4 6 8 10 0 2 4 6 8 10
x x
1 x
1 k(x)
Figure 2.1: PMF fX (x) = 2 ,x = 1, 2, . . . , and CDF FX (x) = 1 − 2 .
Example 2.2 Consider an experiment to measure the length of time that an electrical
component functions before failure. The sample space of outcomes of the experiment, Ω is R+ ,
and if Ax is the event
that
the component functions for longer than x > 0 time units, suppose
that P(Ax ) = exp −x2 .
d
fX (x) = {FX (t)}t=x = 2x exp −x2 ,
dt
and zero otherwise.
Graphs of the probability density function (left) and cumulative distribution function (right) are
shown in Figure 2.2. Note that both the pdf and cdf are defined for all real values of x, and that
both are continuous functions.
1.0
0.8
0.8
0.6
0.6
F(x)
f(x)
0.4
0.4
0.2
0.2
0.0
0.0
0 1 2 3 4 5 0 1 2 3 4 5
x x
Figure 2.2: PDF fX (x) = 2x exp{−x2 }, x > 0, and CDF FX (x) = 1 − exp{−x2 }, x > 0.
For a continuous random variable X with range X and pdf fX , the expectation or
expected value of X with respect to fX is defined by
Z ∞ Z
EfX [X] = xfX (x)dx = xfX (x)dx.
−∞ X
NOTE : The sum/integral may not be convergent, and hence the expected value may be infinite.
It is important always to check that the integral is finite: a sufficient condition is the absolute
integrability of the summand/integrand, that is
X X
|x| fX (x) < ∞ =⇒ xfX (x) = EfX [X] < ∞,
x x
PROPERTIES OF EXPECTATIONS
Let X be a random variable with mass function/pdf fX . Let g and h be real-valued functions
whose domains include X, and let a and b be constants. Then
EfX [ag(X) + bh(X)] = aEfX [g(X)] + bEfX [h(X)],
as (in the continuous case)
Z
EfX [ag(X) + bh(X)] = [ag(x) + bh(x)]fX (x)dx
X
Z Z
=a g(x)fX (x)dx + b h(x)fX (x)dx
X X
SPECIAL CASES :
(ii) Consider g(x) = (x−EfX [X])2 . Write μ =EfX [X] (a constant that does not depend on x).
Then, expanding the integrand
Z Z Z Z
2 2 2
EfX [g(X)] = (x − μ) fX (x)dx = x fX (x)dx − 2μ xfX (x)dx + μ fX (x)dx
Z Z
= x2 fX (x)dx − 2μ2 + μ2 = x2 fX (x)dx − μ2
and EfX [(X − μ)k ] is the kth central moment of the distribution.
= a2 V arfX [X].
Then IA is a random variable taking values 1 and 0 with probabilities P (A) and P (A0 ) respectively.
Also, IA has expectation P (A) and variance P (A){1 − P (A)}. The usefulness lies in the fact
that any discrete random variable X can be written as a linear combination of indicator random
variables: X
X= ai IAi ,
i
for some collection of events (Ai , i ≥ 1) and real numbers (ai , i ≥ 1). Sometimes we can obtain the
expectation and variance of a random variable X easily by expressing it in this way, then using
knowledge of the expectation and variance of the indicator variables IAi , rather than by direct
calculation.
that is, g −1 (A) is the set of points in X that map into A, and g −1 (y) is the set of points in X that
map to y, under transformation g. By construction, we have
P [Y ∈ A] = P [X ∈ g −1 (A)].
where Ay = {x ∈ X : g(x) ≤ y}. This result gives the “first principles” approach to computing
the distribution of the new variable. The approach can be summarized as follows:
FY (y) = P [Y ≤ y] = P [g(X) ≤ y] = P [X ∈ Ay ].
22 CHAPTER 2. RANDOM VARIABLES & PROBABILITY DISTRIBUTIONS
Transformation: Y=sin(X)
1.0
y
0.5
Y = sin(X)
0.0
-0.5
-1.0 x1 x2
0 1 2 3 4 5 6
X (radians)
Note that it is usually a good idea to start with the cdf, not the pmf or pdf.
Ay = {x ∈ X : g(x) ≤ y} .
Example 2.3 Suppose that X is a continuous r.v. with range X ≡ (0, 2π) whose pdf fX is
constant
1
fX (x) = , 0 < x < 2π,
2π
and zero otherwise. This pdf has corresponding continuous cdf
x
FX (x) = , 0 < x < 2π.
2π
Consider the transformed r.v. Y = sin X. Then the range of Y , Y, is [−1, 1], but the
transformation is not 1-1. However, from first principles, we have
FY (y) = P [Y ≤ y] = P [sin X ≤ y] .
Now, by inspection of Figure 2.3, we can easily identify the required set Ay , y > 0 : it is the union
of two disjoint intervals
Ay = [0, x1 ] ∪ [x2 , 2π] = 0, sin−1 y ∪ π − sin−1 y, 2π .
Hence
2.6. TRANSFORMATIONS OF RANDOM VARIABLES 23
Transformation: T=tan(X)
10
5
t
T = tan(X)
0
-5
-10 x1 x2
0 1 2 3 4 5 6
X (radians)
1 1 1 1
= sin−1 y + 1− π − sin−1 y = + sin−1 y,
2π 2π 2 π
and hence, by differentiation,
1 1
fY (y) = p .
π 1 − y2
[A symmetry argument verifies this for y < 0.]
Example 2.4 Consider transformed r.v. T = tan X. Then the range of T , T, is R, but the
transformation is not 1-1. However, from first principles, we have, for t > 0,
FT (t) = P [T ≤ t] = P [tan X ≤ t] .
Figure 2.4 helps identify the required set At : in this case it is the union of three disjoint intervals
π i 3π
π i 3π
−1 −1
At = [0, x1 ] ∪ , x2 ∪ , 2π = 0, tan t ∪ , π + tan t ∪ , 2π ,
2 2 2 2
(note, for values of t < 0, the union will be of only two intervals, but the calculation proceeds
identically). Therefore,
hπ i
3π
FT (t) = P [tan X ≤ t] = P [X ≤ x1 ] + P < X ≤ x2 + P < X ≤ 2π
2 2
1 −1 1 n −1 πo 1 3π 1 1
= tan t + π + tan t − + 2π − = tan−1 t + ,
2π 2π 2 2π 2 π 2
24 CHAPTER 2. RANDOM VARIABLES & PROBABILITY DISTRIBUTIONS
The following theorem gives the distribution for random variable Y = g(X) when g is 1-1.
d −1
g (t)
dt
is continuous and non-zero on Y.
Continuous case : function g is either (I) a monotonic increasing, or (II) a monotonic decreasing
function.
Case (I): If g is increasing, then for x ∈ X and y ∈ Y, we have that
g(x) ≤ y ⇐⇒ x ≤ g −1 (y).
Therefore, for y ∈ Y,
g(x) ≤ y ⇐⇒ x ≥ g −1 (y)
Therefore, for y ∈ Y,
so
d −1 d −1 d −1
fY (y) = −fX (g −1
(y)) −1
g (y) = fX (g (y)) g (t) t=y as g (t) < 0.
dt dt dt
d −1
J(y) = g (t) t=y ,
dt
that is, the first derivative of g −1 evaluated at y = g(x). Note that the inverse transformation
g −1 : Y −→ X has Jacobian 1/J(x).
NOTE :
(i) The Jacobian is precisely the same term that appears as a change of variable term in an
integration.
(ii) In the Univariate Transformation Theorem, in the continuous case, we take the modulus of
the Jacobian
(iii) To compute the expectation of Y = g(X), we now have two alternative methods of
computation; we either compute the expectation of g(X) with respect to the distribution of
X, or compute the distribution of Y , and then its expectation. It is straightforward to
demonstrate that the two methods are equivalent, that is
IMPORTANT NOTE: Note that the apparently appealing “plug-in” approach that sets
−1
fY (y) = fX g (y)
will almost always fail as the Jacobian term must be included. For example, if Y = eX
so that X = log Y , then merely setting
fY (y) = fX (log y)
is insufficient, you must have
1
fY (y) = fX (log y) × .
y
if this expectation exists for all values of t ∈ (−h, h) for some h > 0, that is,
X
DISCRETE CASE MX (t) = etx fX (x)
Z
CONTINUOUS CASE MX (t) = etx fX (x)dx
NOTE : It can be shown that if X1 and X2 are random variables taking values on X with
mass/density functions fX1 and fX2 , and mgfs MX1 and MX2 respectively, then
fX1 (x) ≡ fX2 (x), x ∈ X ⇐⇒ MX1 (t) ≡ MX2 (t), t ∈ (−h, h).
Hence there is a 1-1 correspondence between generating functions and distributions: this
provides a key technique for identification of probability distributions.
(r) dr dr nX sx o X
MX (t) = {MX (s)}s=t = e fX (x) = xr etx fX (x)
dsr dsr s=t
and hence X
(r)
MX (0) = xr fX (x) = EfX [X r ].
2.7. GENERATING FUNCTIONS 27
and hence Z
(r)
MX (0) = xr fX (x)dx = EfX [X r ].
(iii) From the general result for expectations of functions of random variables,
EfY [etY ] ≡ EfX [et(aX+b) ] =⇒ MY (t) = EfX [et(aX+b) ] = ebt EfX [eatX ] = ebt MX (at).
Therefore, if
Y = aX + b, MY (t) = ebt MX (at)
Theorem 2.7.1 Let X1 , ..., Xk be independent random variables with mgfs MX1 , ..., MXk
respectively. Then if the random variable Y is defined by Y = X1 + ... + Xk ,
k
Y
MY (t) = MXi (t).
i=1
Hence
( )
X X X
MY (t) = EfY [etY ] = ety fY (y) = ety fX2 (y − x1 ) fX1 (x1 )
y y x1
( )
X X
= et(x1 +x2 ) fX2 (x2 ) fX1 (x1 ) (changing variables in the summation, x2 = y − x1 )
x2 x1
( )( )
X X
= etx1 fX1 (x1 ) etx2 fX2 (x2 ) = MX1 (t)MX2 (t),
x1 x2
28 CHAPTER 2. RANDOM VARIABLES & PROBABILITY DISTRIBUTIONS
Special Case : If X1 , ..., Xk are identically distributed, then MXi (t) ≡ MX (t), say, for all i, so
k
Y
MY (t) = MX (t) = {MX (t)}k .
i=1
Properties :
(i) Using similar techniques to those used for the mgf, it can be shown that
(r) dr X−r
GX (t) = {G X (s)}s=t = Ef X(X − 1)...(X − r + 1)t
dsr X
(r)
=⇒ GX (1) = EfX [X(X − 1)...(X − r + 1)],
By definition
Z Z
CX (t) = eitx fX (x)dx = [cos tx + i sin tx] fX (x)dx
x∈X x∈X
Z Z
= cos txfX (x)dx + i sin txfX (x)dx
x∈X x∈X
Suppose X and Y are random variables on the probability space (Ω, A, P (.)). Their distribution
functions FX and FY contain information about their associated probabilities. But how do we
describe information about their properties relative to each other? We think of X and Y as compo-
nents of a random vector (X, Y ) taking values in R2 , rather than as unrelated random variables
each taking values in R.
Example 2.5 Toss a coin n times and let Xi = 0 or 1, depending on whether the ith toss is a tail
or a head. The random
P vector X = (X1 , . . . , Xn ) describes the whole experiment. The total
number of heads is ni=1 Xi .
We will consider, for simplicity, the case n = 2, without any loss of generality: the case n > 2 is
just notationally more cumbersome.
The joint distribution function FX,Y of the random vector (X, Y ) satisfies:
(i)
lim FX,Y (x, y) = 0,
x,y−→−∞
as u, v −→ 0+ .
(iv)
lim FX,Y (x, y) = FX (x) ≡ P (X ≤ x),
y−→∞
FX and FY are the marginal distribution functions of the joint distribution FX,Y .
Definition 2.8.2 The random variables X and Y on (Ω, A, P (.)) are (jointly) discrete if (X, Y )
takes values in a countable subset of R2 only.
Definition 2.8.3 Discrete variables X and Y are independent if the events {X = x} and {Y = y}
are independent for all x and y.
Definition 2.8.4 The joint probability mass function fX,Y : R2 −→ [0, 1] of X and Y is
given by
fX,Y (x, y) = P (X = x, Y = y).
fX (x) = P (X = x)
X
= P (X = x, Y = y)
y
X
= fX,Y (x, y).
y
The definition of independence can be reformulated as: X and Y are independent iff fX,Y (x, y) =
fX (x)fY (y), for all x, y ∈ R.
More generally, X and Y are independent iff fX,Y (x, y) can be factorized as the product g(x)h(y)
of a function of x alone and a function of y alone.
Let X be the support of X and Y be the support of Y . Then Z = (X, Y ) has support Z = {(x, y) :
fX,Y (x, y) > 0}. In nice cases Z = X × Y, but we need to be alert to cases with Z ⊂ X × Y. In
general, given a random vector (X1 , . . . , Xk ) we will denote its range or support by X(k) .
2.8. JOINT PROBABILITY DISTRIBUTIONS 31
Definition 2.8.6 The random variables X and Y on (Ω, A, P (.)) are called jointly continuous
if their joint distribution function can be expressed as
Z x Z y
FX,Y (x, y) = fX,Y (u, v)dvdu,
u=−∞ v=−∞
∂2
fX,Y (x, y) = FX,Y (x, y).
∂x∂y
This is the usual case, which we will assume from now on.
Then:
(i)
Since Z x Z ∞
FX (x) = fX,Y (u, y)dy du,
−∞ −∞
we see, differentiating with respect to x, that the marginal pdf of X is
Z ∞
fX (x) = fX,Y (x, y)dy.
−∞
We cannot, as we did in the discrete case, define independence of X and Y in terms of events
{X = x} and {Y = y}, as these have zero probability and are trivially independent.
So,
or (equivalently) iff
fX,Y (x, y) = fX (x)fY (y),
whenever FX,Y is differentiable at (x, y).
This is an appropriate point to remark that not all random variables are either continuous or
discrete, and not all distribution functions are either absolutely continuous or discrete. Many
practical examples exist of distribution functions that are partly discrete and partly continuous.
2.8. JOINT PROBABILITY DISTRIBUTIONS 33
Example 2.6 We record the delay that a motorist encounters at a one-way traffic stop sign. Let
X be the random variable representing the delay the motorist experiences. There is a certain
probability that there will be no opposing traffic, so she will be able to proceed without delay.
However, if she has to wait, she could (in principle) have to wait for any positive amount of time.
The experiment could be described by assuming that X has distribution function
FX (x) = (1 − pe−λx )I[0,∞) (x). This has a jump of 1 − p at x = 0, but is continuous for x > 0:
there is a probability 1 − p of no wait at all.
We shall see later cases of random vectors, (X, Y ) say, where one component is discrete and the other
continuous: there is no essential complication in the manipulation of the marginal distributions etc.
for such a case.
fX1 ,X2 ,X3 (x1 , x2 , x3 ) = fX1 (x1 )fX2 |X1 (x2 |x1 )fX3 |X1 ,X2 (x3 |x1 , x2 ),
Equivalent relationships hold in the discrete case and can be extended to determine the explicit
relationship between joint, marginal, and conditional mass/density functions for any number of
random variables.
NOTE: the discrete equivalent of this result is a DIRECT consequence of the Theorem of Total
Probability; the event [X1 = x1 ] is partitioned into sub-events [(X1 = x1 ) ∩ (X2 = x2 ) ∩ (X3 = x3 )]
for all possible values of the pair (x2 , x3 ).
i.e. the expectation of g(X1 ) with respect to the conditional density of X1 given X2 = x2 , (possibly
giving a function of x2 ). The case g(x) ≡ x is a particular case.
Proof
Z
EfX1 [g(X1 )] = g(x1 )fX1 (x1 )dx1
X1
Z Z
= g(x1 ) fX1 ,X2 (x1 , x2 )dx2 dx1
X1 X2
Z Z
= g(x1 ) fX1 |X2 (x1 |x2 )fX2 (x2 )dx2 dx1
X1 X2
Z Z
= g(x1 )fX1 |X2 (x1 |x2 )fX2 (x2 )dx2 dx1
X1 X2
Z Z
= g(x1 )fX1 |X2 (x1 |x2 )dx1 fX2 (x2 )dx2
X2 X1
Z n o
= EfX1 |X2 [g(X1 )|X2 = x2 ] fX2 (x2 )dx2
X2
h i
= EfX2 EfX1 |X2 [g(X1 )|X2 = x2 ] ,
so the expectation of g(X1 ) can be calculated by finding the conditional expectation of g(X1 ) given
X2 = x2 , giving a function of x2 , and then taking the expectation of this function with respect to
the marginal density for X2 . Note that this proof only works if the conditional expectation and
the marginal expectation are finite. This results extends naturally to k variables.
Let Y = (Y1 , ..., Yk ) be a vector of random variables defined by Yi = gi (X1 , ..., Xk ) for some
functions gi , i = 1, ..., k, where the vector function g mapping (X1 , ..., Xk ) to (Y1 , ..., Yk ) is a 1-1
transformation. Then the joint mass/density function of (Y1 , ..., Yk ) is given by
CONTINUOUS fY1 ,...,Yk (y1 , ..., yk ) = fX1 ,...,Xk (x1 , ..., xk ) |J(y1 , ..., yk )| ,
where x = (x1 , ..., xk ) is the unique solution of the system y = g(x), so that x = g−1 (y), and
where J(y1 , ..., yk ) is the Jacobian of the transformation, that is, the determinant of the k × k
matrix whose (i, j)th element is
∂ −1
g i (t) t1 =y1 ,...,t =y ,
∂tj k k
where gi−1 is the inverse function uniquely defined by Xi = gi−1 (Y1 , ..., Yk ). Note again the
modulus.
Proof. The discrete case proof follows the univariate case precisely. For the continuous case,
consider the equivalent events [X ∈ C] and [Y ∈ D], where D is the image of C under g. Clearly,
P [X ∈ C] = P [Y ∈ D]. Now, P [X ∈ C] is the k dimensional integral of the joint density fX1 ,...,Xk
over the set C, and P [Y ∈ D] is the k dimensional integral of the joint density fY1 ,...,Yk over the set
D. The result follows by changing variables in the first integral from x to y = g(x), and equating
the two integrands.
Note : As for single variable transformations, the ranges of the transformed variables must be
considered carefully.
Example 2.7 The multivariate transformation theorem provides a simple proof of the
convolution formula: if X and Y are independent continuous random variables with pdfs
fX (x) and fY (y), then the pdf of Z = X + Y is
Z ∞
fZ (z) = fX (w)fY (z − w)dw.
−∞
Let W = X. The Jacobian of the transformation from (X, Y ) to (Z, W ) is 1. So, the joint pdf of
(Z, W ) is
fZ,W (z, w) = fX,Y (w, z − w) = fX (w)fY (z − w).
Then integrate out W to obtain the marginal pdf of Z.
Example 2.8 Consider the case k = 2, and suppose that X1 and X2 are independent continuous
random variables with ranges X1 = X2 = [0, 1] and pdfs given respectively by
and zero elsewhere. In order to calculate the pdf of random variable Y1 defined by
Y1 = X1 X2 ,
36 CHAPTER 2. RANDOM VARIABLES & PROBABILITY DISTRIBUTIONS
using the transformation result, consider the additional random variable Y2 , where Y2 = X1 (note,
as X1 and X2 take values on [0, 1], X1 ≥ X1 X2 so Y1 ≤ Y2 ).
The transformation Y = g(X) is then specified by the two functions
g1 (t1 , t2 ) = t1 t2 , g2 (t1 , t2 ) = t1 ,
X1 = Y2 , X2 = Y1 /Y2 ,
giving
g1−1 (t1 , t2 ) = t2 , g2−1 (t1 , t2 ) = t1 /t2 .
Hence
∂ −1 ∂ −1
g1 (t1 , t2 ) = 0, g1 (t1 , t2 ) = 1,
∂t1 ∂t2
∂ −1 ∂ −1
g2 (t1 , t2 ) = 1/t2 , g2 (t1 , t2 ) = −t1 /t22 ,
∂t1 ∂t2
= 18y12 (1 − y2 )/y22 ,
= 18y1 (1 − y1 + y1 log y1 ),
PROPERTIES
(i) Let g and h be real-valued functions and let a and b be constants. Then, if fX ≡ fX1 ,...,Xk ,
EfX [ag(X1 , ..., Xk ) + bh(X1 , ..., Xk )] = aEfX [g(X1 , ..., Xk )] + bEfX [h(X1 , ..., Xk )].
(ii) Let X1 , ...Xk be independent random variables with mass functions/pdfs fX1 , ..., fXk respec-
tively. Let g1 , ..., gk be scalar functions of X1 , ..., Xk respectively (that is, gi is a function of Xi only
for i = 1, ..., k). If g(X1 , ..., Xk ) = g1 (X1 )...gk (Xk ), then
k
Y
EfX [g(X1 , ..., Xk )] = EfXi [gi (Xi )],
i=1
where EfXi [gi (Xi )] is the marginal expectation of gi (Xi ) with respect to fXi .
(iii) Generally,
EfX [g(X1 )] ≡ EfX1 [g(X1 )],
so that the expectation over the joint distribution is the same as the expectation over the marginal
distribution. The proof is an immediate consequence of the fact that the marginal pdf fX1 is
obtained by integrating the joint density with respect to x2 , . . . , xk . So, whevever we wish, it is
reasonable to denote the expectation as, say, E[g(X1 )], rather than EfX1 [g(X1 )] or EfX [g(X1 )]: we
can ‘drop subscripts’.
CovfX1 ,X2 [X1 , X2 ] = EfX1 ,X2 [(X1 − μ1 )(X2 − μ2 )] = EfX1 ,X2 [X1 X2 ] − μ1 μ2 ,
that is, the expectation of function g(x1 , x2 ) = x1 x2 with respect to the joint distribution fX1 ,X2 .
38 CHAPTER 2. RANDOM VARIABLES & PROBABILITY DISTRIBUTIONS
Definition 2.10.3 The correlation of X1 and X2 is denoted CorrfX1 ,X2 [X1 , X2 ], and is defined
by
CovfX1 ,X2 [X1 , X2 ]
CorrfX1 ,X2 [X1 , X2 ] = q .
V arfX1 [X1 ]V arfX2 [X2 ]
If CovfX1 ,X2 [X1 , X2 ] = CorrfX1 ,X2 [X1 , X2 ] = 0 then variables X1 and X2 are uncorrelated.
Note that if random variables X1 and X2 are independent then
CovfX1 ,X2 [X1 , X2 ] = EfX1 ,X2 [X1 X2 ] − EfX1 [X1 ]EfX2 [X2 ]
and so X1 and X2 are also uncorrelated (the converse does not hold).
NOTES:
(i) For random variables X1 and X2 , with (marginal) expectations μ1 and μ2 respectively, and
(marginal) variances σ 21 and σ 22 respectively, if random variables Z1 and Z2 are defined by
then Z1 and Z2 are standardized variables. Then EfZi [Zi ] = 0, V arfZi [Zi ] = 1 and
(ii) Extension to k variables: covariances can only be calculated for pairs of random variables, but
if k variables have a joint probability structure it is possible to construct a k × k matrix, C say, of
covariance values, whose (i, j)th element is
for i, j = 1, .., k, that captures the complete covariance structure in the joint distribution. If i 6= j,
then
CovfXj ,Xi [Xj , Xi ] = CovfXi ,Xj [Xi , Xj ],
so C is symmetric, and if i = j,
k
X k X
X i−1
V arfX [X] = a2i V arfXi [Xi ] + 2 ai aj CovfXi ,Xj [Xi , Xj
i=1 i=1 j=1
= aT Ca, a = (a1 , . . . , ak )T .
2.10. MULTIVARIATE EXPECTATIONS AND COVARIANCE 39
(iv) Combining (i) and (iii) when k = 2, and defining standardized variables Z1 and Z2 ,
0 ≤ V arfZ1 ,Z2 [Z1 ± Z2 ] = V arfZ1 [Z1 ] + V arfZ2 [Z2 ] ± 2CovfZ1 ,Z2 [Z1 , Z2 ]
and hence
−1 ≤ CorrfX1 ,X2 [X1 , X2 ] ≤ 1.
If this exists in a neighbourhood of the origin (0,0), then it has the same attractive properties as
the ordinary moment generating function. It determines the joint distribution of X and Y uniquely
and it also yields the moments:
∂ m+n
m n
MX,Y (s, t) = E(X m Y n ).
∂s ∂t s=t=0
Joint moment generating functions factorize for independent random variables. We have
if and only if X and Y are independent. Note, MX (s) = MX,Y (s, 0), etc.
The definition of the joint moment generating function extends in an obvious way to three or
more random variables, with the corresponding result for independence. For instance, the ran-
dom variable X is independent of (Y, Z) if and only if the joint moment generating function
MX,Y,Z (s, t, u) = E(esX+tY +uZ ) factorizes as
(I) Let X and Y be independent random variables. Let g(x) be a function only of x and h(y) be a
function only of y. Then the random variables U = g(X) and V = h(Y ) are independent.
Yk = X(k) ,
and then X(1) , X(2) , . . . , X(n) are known as the order statistics of X1 , . . . , Xn . Two key results
are:
An informal proof of Result A is straightforward, using a symmetry argument based on the inde-
pendent, identically distributed nature of X1 , . . . , Xn . Result B follows on noting that the event
X(k) ≤ y occurs if and only if at least k of the Xi lie in (−∞, y]. Recalling the Binomial distribution,
this means that X(k) has distribution function
n
X n
F(k) (y) = {FX (y)}j {1 − FX (y)}n−j .
j
j=k
X ∼ U nif orm(n)
1
fX (x) = , x ∈ X = {1, 2, ..., n} ,
n
and zero otherwise.
X ∼ Bernoulli(θ)
fX (x) = θx (1 − θ)1−x , x ∈ X = {0, 1} ,
and zero otherwise.
NOTE The Bernoulli distribution is used for modelling when the outcome of an experiment is
either a “success” or a ‘failure”, where the probability of getting a success is equal to θ. Such an
experiment is a ‘Bernoulli trial’. The mgf is
MX (t) = (1 − θ) + θet .
X ∼ Bin(n, θ)
n x
fX (x) = θ (1 − θ)n−x , x ∈ X = {0, 1, 2, ..., n} , n ≥ 1, 0 ≤ θ ≤ 1.
x
NOTES
1. If X1 , ..., Xk are independent and identically distributed (IID) Bernoulli(θ) random variables,
and Y = X1 + ..., Xk , then by the standard result for mgfs,
k
MY (t) = {MX (t)}k = 1 − θ + θet ,
so therefore Y ∼ Bin(k, θ) because of the uniqueness of mgfs. Thus the binomial distribution is
used to model the total number of successes in a series of independent and identical experiments.
41
42 CHAPTER 3. DISCRETE PROBABILITY DISTRIBUTIONS
X ∼ P oisson(λ)
e−λ λx
fX (x) = , x ∈ X = {0, 1, 2, ...} , λ > 0,
x!
and zero otherwise
NOTES
1. If X ∼ Bin(n, θ), let λ = nθ. Then
n
n λ(et − 1))
MX (t) = 1 − θ + θet = 1+ −→ exp λ et − 1 ,
n
as n −→ ∞, which is the mgf of a Poisson random variable. Therefore, the Poisson distribution
arises as the limiting case of the binomial distribution, when n −→ ∞, θ −→ 0 with nθ = λ
constant (that is, for “large” n and “small” θ). So, if n is large, θ is small, we can reasonably
approximate Bin(n, θ) by P oisson(λ).
so that Y ∼ P oisson(λ1 + λ2 ). Therefore, the sum of two independent Poisson random variables
also has a Poisson distribution. This result can be extended easily; if X1 , ..., Xk are independent
random variables with Xi ∼ P oisson(λi ) for i = 1, ..., k, then
k k
!
X X
Y = Xi =⇒ Y ∼ P oisson λi .
i=1 i=1
The Poisson distribution arises as part of a larger modelling framework. Consider an experiment
involving events (such as radioactive emissions) that occur repeatedly and randomly in time. Let
X(t) be the random variable representing the number of events that occur in the interval [0, t], so
that X(t) takes values 0, 1, 2, .... Informally, in a Poisson process we have the following properties.
In the interval (t, t + h) there may or may not be events. If h is small, the probability of an event
in (t, t + h) is roughly proportional to h; it is not very likely that two or more events occur in a
small interval.
(b)
λh + o(h), m = 1,
P (X(t + h) = n + m| X(t) = n) = o(h), m > 1,
1 − λh + o(h), m = 0.
43
(c) If s < t, the number of events X(t) − X(s) in (s, t] is independent of the number of events in
[0, s].
Here [definition]
o(h)
lim = 0.
h−→0 h
Then, if
Pn (t) = P [X(t) = n] = P [n events occur in [0, t]]
it can be shown that
e−λt (λt)n
Pn (t) =
n!
(that is, the random variable corresponding to the number of events that occurs in the interval
[0, t] has a Poisson distribution with parameter λt.)
X ∼ Geometric(θ)
NOTES
1. The cdf is available analytically as
P [X = x + j, X > j] P [X = x + j] (1 − θ)x+j−1 θ
P [X = x+j|X > j] = = = = (1−θ)x−1 θ = P [X = x].
P [X > j] P [X > j] (1 − θ)j
So P[X = x + j|X > j] =P[X = x]. This property is unique (among discrete distributions) to
the geometric distribution, and is called the lack of memory property.
fX (x) = φx (1 − φ) , x = 0, 1, 2, . . . .
X ∼ N egBin(n, θ)
x−1 n
fX (x) = θ (1 − θ)x−n , x ∈ X = {n, n + 1, ...} , n ≥ 1, 0 ≤ θ ≤ 1.
n−1
44 CHAPTER 3. DISCRETE PROBABILITY DISTRIBUTIONS
NOTES
1. If X ∼ Bin(n, θ), Y ∼ N egBin(r, θ), then for r ≤ n, P [X ≥ r] = P [Y ≤ n].
2. The Negative Binomial distribution is used to model the number, X, of independent, identical
Bernoulli trials needed to obtain exactly n successes. [The number of trials up to and including
the nth success].
4. If Xi ∼ Geometric(θ), for i = 1, ...n, are i.i.d. random variables, and Y = X1 + ... + Xn , then
Y ∼ N egBin(n, θ) (result immediately follows using mgfs).
as n −→ ∞, hence the alternate form of the negative binomial distribution tends to the Poisson
distribution as n −→ ∞ with n(1 − θ)/θ = λ constant.
NOTES
1. The hypergeometric distribution is used as a model for experiments involving sampling
without replacement from a finite population. Specifically, consider a finite population of size N ,
consisting of R items of Type I and N − R of Type II: take a sample of size n
without replacement, and let X be the number of Type I objects on the sample. The mass
function for the hypergeometric distribution can be obtained by using combinatorics/counting
techniques. However the form of the mass function does not lend itself readily to calculation of
moments etc..
X ∼ U nif orm(a, b)
1
fX (x) = , a ≤ x ≤ b.
b−a
NOTES
1. The cdf is
x−a
FX (x) = , a ≤ x ≤ b.
b−a
X ∼ Exp(λ)
fX (x) = λe−λx , x > 0, λ > 0.
NOTES
1. The cdf is
FX (x) = 1 − e−λx , x > 0.
2. An alternative representation uses θ = 1/λ as the parameter of the distribution. This is
sometimes used because the expectation and variance of the Exponential distribution are
1 1
EfX [X] = = θ, V arfX [X] = .
λ λ2
4. Suppose that X(t) is a Poisson process with rate parameter λ > 0, so that
e−λt (λt)n
P [X(t) = n] = .
n!
Let X1 , ..., Xn be random variables defined by X1 = “time that first event occurs”, and, for
i = 2, ..., n, Xi = “time interval between occurrence of (i − 1)st and ith events”. Then X1 , ..., Xn
45
46 CHAPTER 4. CONTINUOUS PROBABILITY DISTRIBUTIONS
are IID because of the assumptions underlying the Poisson process. So consider the distribution
of X1 ; in particular, consider the probability P[X1 > x] for x > 0. The event [X1 > x] is
equivalent to the event “No events occur in the interval (0,x]”, which has probability e−λx . But
7. If X ∼ Exp(λ), and
Y = X 1/α ,
for α > 0, then Y has a (two-parameter) Weibull distribution, and
α
fY (y) = αλy α−1 e−λy , y > 0.
X ∼ Ga(α, β)
β α α−1 −βx
fX (x) = x e , x > 0, α, β > 0,
Γ(α)
where, for any real number α > 0, the gamma function, Γ(.) is defined by
Z ∞
Γ(α) = tα−1 e−t dt.
0
NOTES
1. If X1 ∼ Ga(α1 , β), X2 ∼ Ga(α2 , β) are independent random variables, and
Y = X1 + X2
2. Ga(1, β) ≡ Exp(β).
Y = X1 + ... + Xn
Γ(α) = (α − 1)Γ(α − 1)
√
and hence if α = 1, 2, ..., then Γ(α) = (α − 1)!. A useful fact is that Γ( 12 ) = π.
47
5. Special Case : If α = 1, 2, ... the Ga(α/2, 1/2) distribution is also known as the chi-squared
distribution with α degrees of freedom, denoted by χ2α .
6. If X1 ∼ χ2n1 and X2 ∼ χ2n2 are independent chi-squared random variables with n1 and n2
degrees of freedom respectively, then random variable F defined as the ratio
X1 /n1
F =
X2 /n2
X ∼ Be(α, β)
Γ(α + β) α−1
fX (x) = x (1 − x)β−1 , 0 < x < 1, α, β > 0.
Γ(α)Γ(β)
NOTES
1. If α = β = 1, Be(α, β) ≡ U nif orm(0, 1)
X1
Y =
X1 + X2
3. Suppose that random variables X and Y have a joint probability distribution such that the
conditional distribution of X, given Y = y for 0 < y < 1, is binomial, Bin(n, y), and the marginal
distribution of Y is beta, Be(α, β), so that
n x
fX|Y (x|y) = y (1 − y)n−x , x = 0, 1, ..., n,
x
Γ(α + β) α−1
fY (y) = y (1 − y)β−1 , 0 < y < 1.
Γ(α)Γ(β)
Note that this provides an example of a joint distribution of continuous Y and discrete X.
48 CHAPTER 4. CONTINUOUS PROBABILITY DISTRIBUTIONS
X ∼ N (μ, σ 2 )
1 1 2
fX (x) = √ exp − 2 (x − μ) , x ∈ R, μ ∈ R, σ > 0.
2πσ 2 2σ
NOTES
1. Special Case : If μ = 0, σ 2 = 1, then X has a standard or unit normal distribution. Usually,
the pdf of the standard normal is written φ(x), and the cdf is written Φ(x).
3. The Central Limit Theorem Suppose X1 , ..., Xn are IID random variables with some mgf
MX , with EfX [Xi ] = μ and V arfX [Xi ] = σ 2 that is, the mgf and the expectation and variance of
the Xi ’s are specified, but the pdf is not. Let the standardized random variable Zn be defined by
n
X
Xi − nμ
i=1
Zn = √
nσ 2
and let Zn have mgf MZn . Then, as n −→ ∞,
MZn (t) −→ exp t2 /2 ,
irrespective of the distribution of the Xi ’s, that is, the distribution of Zn tends to a standard
normal distribution as n tends to infinity. This theorem will be proved and explained in Chapter
6.
4. If X ∼ N (0, 1), and Y = X 2 , then Y ∼ χ21 , so that the square of a unit normal random
variable has a chi-squared distribution with 1 degree of freedom.
5. If X ∼ N (0, 1), and Y ∼ N (0, 1) are independent random variables, and Z is defined by
Z = X/Y , then Z has a Cauchy distribution
1 1
fZ (z) = , z ∈ R.
π 1 + z2
6. If X ∼ N (0, 1), and Y ∼ Ga(n/2, 1/2) for n = 1, 2, ... (so that Y ∼ χ2n ), are independent
random variables, and T is defined by
X
T =p
Y /n
then T has a Student-t distribution with n degrees of freedom, T ∼ St(n),
n+1
Γ 1/2 −(n+1)/2
2 1 t2
fT (t) = n 1+ , t ∈ R.
Γ nπ n
2
49
For purely notational reasons, it is convenient in this chapter to consider a random vector X as a
column vector, X = (X1 , . . . , Xk )T , say
51
52 CHAPTER 5. MULTIVARIATE PROBABILITY DISTRIBUTIONS
for 0 ≤ xi ≤ 1 for all i such that x1 + ... + xk + xk+1 = 1, where α = α1 + ... + αk+1 and where
xk+1 is defined by xk+1 = 1 − (x1 + ... + xk ). This is the density function which reduces to the beta
distribution if k = 1. It can also be shown that the marginal distribution of Xi is Beta(αi , α).
Properties
2. Since Σ is symmetric and positive definite, there exists a matrix Σ 1/2 [the ‘square root of Σ’]
such that: (i) Σ1/2 is symmetric; (ii) Σ = Σ1/2 Σ1/2 ; (iii) Σ1/2 Σ−1/2 = Σ−1/2 Σ1/2 = I, the k × k
identity matrix, with Σ−1/2 = (Σ1/2 )−1 .
Then, if Z ∼ Nk (0, I), so that Z1 , . . . Zk are IID N (0, 1), and X = μ + Σ1/2 Z, then X ∼ Nk (μ, Σ).
Conversely, if X ∼ Nk (μ, Σ), then Σ−1/2 (X − μ) ∼ Nk (0, I).
4. Suppose we partition X as
Xa
X= ,
Xb
with
X1 Xm+1
Xa = ... , Xb = ... .
Xm Xk
We can similarly partition μ and Σ:
μa Σaa Σab
μ= , Σ= .
μb Σba Σbb
55
56 CHAPTER 6. PROBABILITY RESULTS & LIMIT THEOREMS
Theorem 6.2.1 Suppose X1 , ..., Xn are i.i.d. random variables with mgf MX , with
Proof. First, let Yi = (Xi − μ)/σ for i = 1, ..., n. Then Y1 , ..., Yn are i.i.d. with mgf MY say, and
by the elementary properties of expectation,
for each i. Using the power series expansion result for mgfs, we have that
t2 t3 t4
MY (t) = 1 + tEfY [Y ] + EfY [Y 2 ] + EfY [Y 3 ] + EfY [Y 4 ] + ...
2! 3! 4!
t2 t3 t4
= 1+ + EfY [Y 3 ] + EfY [Y 4 ] + ...
2! 3! 4!
Now, the random variable Zn can be rewritten
n
1 X Xi − μ
Zn = √
n σ
i=1
and thus, again by a standard mgf result, as Y1 , ..., Yn are independent, we have that
Yn n
√ t2 t3 3 t4 4
MZn (t) = MY (t/ n) = 1 + + Ef [Y ] + 2 EfY [Y ]... .
2n 6n3/2 Y 6n
i=1
Thus, as n −→ ∞, using the properties of the exponential function, which give that if an → a,
then an n
1+ → ea ,
n
we have 2
t
MZn (t) −→ exp .
2
Then the sequence {Xn } converges in distribution to the random variable X with cdf FX .
This is denoted
d
Xn −→ X,
Convergence of a sequence of mgfs also indicates convergence in distribution. That is, if for all t
at which MX (t) is defined, if as n −→ ∞, we have
d
then Xn −→ X.
Definition 6.3.2 The sequence {Xn } of random variables converges in distribution to the
constant c if the limiting distribution of Xn is degenerate at c, that is,
d
Xn −→ X
This special type of convergence in distribution occurs when the limiting distribution is discrete,
with the probability mass function only being non-zero at a single value. That is, if the limiting
random variable is X, then
Theorem 6.3.1 The sequence of random variables {Xn } converges in distribution to c if and
only if, for all > 0,
lim P [|Xn − c| < ] = 1.
n−→∞
This theorem indicates that convergence in distribution to a constant c occurs if and only if the
probability becomes increasingly concentrated around c as n −→ ∞.
58 CHAPTER 6. PROBABILITY RESULTS & LIMIT THEOREMS
Proof. Using the properties of expectation, it can be shown that Yn has expectation μ and
variance σ 2 /n, and hence by the Chebychev Inequality,
σ2
P [|Yn − μ| ≥ ] ≤ −→ 0, as n −→ ∞
n2
for all > 0. Hence
P [|Yn − μ| < ] −→ 1, as n −→ ∞
P
and Yn −→ μ.
qm
Proof. The proof of (a) is simple. Suppose that Xn −→ X. Fix > 0. Then, by Markov’s
inequality,
E(Xn − X)2
P (|Xn − X| > ) = P ((Xn − X)2 > 2 ) ≤ −→ 0.
2
This holds for all > 0. Take the limit as −→ 0 and use the fact that FX is continuous at x to
conclude that
lim FXn (x) = FX (x).
n−→∞
Note that the reverse implications do not hold. Convergence in probability does not imply
convergence in quadratic mean. Also, convergence in distribution does not imply convergence in
probability.
60 CHAPTER 6. PROBABILITY RESULTS & LIMIT THEOREMS
CHAPTER 7
STATISTICAL ANALYSIS
Definition 7.1.2 A function, T , of a random sample, X1 , ..., Xn , that is, T = t(X1 , ..., Xn ) that
depends only on X1 , ..., Xn is a statistic. A statistic is a random variable. For example, the
sample mean
Xˉ = X1 + X2 + ... + Xn
n
is a statistic. But a statistic T need not necessarily be constructed from a random sample (the
random variables need not be independent, identically distributed), but that is the case
encountered most often. In many circumstances it is necessary to consider statistics constructed
from a collection of independent, but not identically distributed, random variables.
Using standard mgf results, the distribution of Y is derived to be normal with parameters
n
X n
X
μY = ai μi , σ 2Y = a2i σ 2i .
i=1 i=1
Now consider the special case of this result when X1 , ..., Xn are independent, identically
distributed with μi = μ and σ 2i = σ 2 , and where ai = 1/n for i = 1, ..., n. Then
n
X
1 ˉ σ2
Y = Xi = X ∼ N μ, .
n n
i=1
61
62 CHAPTER 7. STATISTICAL ANALYSIS
Definition 7.2.2 For a random sample X1 , ..., Xn from a probability distribution, then the
sample variance, S 2 , is the statistic defined by
n
2 1 X
ˉ 2.
S = Xi − X
n−1
i=1
Theorem 7.2.2 Suppose that X1 , ..., Xn is a random sample from a normal distribution, say
Xi ∼ N (μ, σ 2 ). Then the random variable
ˉ −μ
X
T = √
S/ n
(n − 1)S 2
V = ∼ χ2n−1 ,
σ2
and
Z
T =r ,
V
n−1
and use the properties of the normal distribution and related random variables (NOTE 6,
following Definition 4.1.5). Also, see EXERCISES 5, Q4 (b).
7.3. HYPOTHESIS TESTING 63
• ONE SAMPLE Possible tests of interest are: μ = μ0 , σ = σ 0 for some specified constants
μ0 and σ 0 .
Recall from Theorem 7.2.1 that, if X1 , ..., Xn ∼ N (μ, σ 2 ) are the i.i.d. outcome random variables
of n experimental trials, then
ˉ σ2 (n − 1) S 2
X ∼ N μ, and ∼ χ2n−1 ,
n σ2
with Xˉ and S 2 statistically independent. Suppose we want to test the hypothesis that μ = μ0 ,
for some specified constant μ0 , (for example, μ0 = 20.0) is a plausible model. More specifically, we
want to test
So, we want to test whether H0 is true, or whether H1 is true. In the case of a Normal sample,
ˉ is Normal, and
the distribution of X
2
ˉ −μ
Xˉ ∼ N μ, σ =⇒ Z =
X
√ ∼ N (0, 1) ,
n σ/ n
where Z is a random variable. Now, when we have observed the data sample, we can calculate
x, and therefore we have a way of testing whether μ = μ0 is a plausible model; we calculate x
from x1 , ..., xn , and then calculate
x − μ0
z= √ .
σ/ n
If H0 is true, and μ = μ0 , then the observed z should be an observation from an N (0, 1)
distribution (as Z ∼ N (0, 1)), that is, it should be near zero with high probability. In fact, z
should lie between -1.96 and 1.96 with probability 1 − α = 0.95, say, as
Figure 7.1: CRITICAL REGION IN A Z-TEST (taken from Schaum’s ELEMENTS OF STATIS-
TICS II, Bernstein & Bernstein.
hypothesis, only rejecting H0 in favour of the alternative H1 if evidence is clear i.e. when the data
represent a rare event under H0 .
As an alternative approach, we could calculate the probability p of observing a z value that is
more extreme than the z we did observe; this probability is given by
2Φ(z), z < 0,
p=
2(1 − Φ(z)), z ≥ 0.
If this p is very small, say p ≤ α = 0.05, then again there is evidence that H0 is not true. This
approach is called significance testing.
In summary, we need to assess whether z is a surprising observation from an N (0, 1)
distribution - if it is, then we reject H0 . Figure 6.1 depicts the ”critical region” in a Z-test.
• α = 0.05 is the significance level of the test (choosing α = 0.01 gives a “stronger” test);
• the solution CR of Φ(CR ) = 1 − α/2 gives the critical values of the test ±CR . These
critical values define the boundary of a critical region: if the value z is in the critical
region we reject H0 .
Z
T =p
Y /ν
has a Student-t distribution with ν degrees of freedom. Using this result, and recalling the
sampling distributions of Xˉ and S 2 , we see that
ˉ −μ
X
√ ˉ − μ)
σ/ n (X
T =s = √ ∼ tn−1 :
(n − 1)S 2 /σ 2 S/ n
(n − 1)
T has a Student-t distribution with n − 1 degrees of freedom, denoted St(n − 1), which does not
depend on σ 2 . Thus we can repeat the procedure used in the σ known case, but use the sampling
distribution of T rather than that of Z to assess whether the value of the test statistic is
“surprising” or not. Specifically, we calculate the observed value
(x − μ)
t= √
s/ n
and find the critical values for a α = 0.05 test by finding the ordinates corresponding to the 0.025
and 0.975 percentiles of a Student-t distribution, St(n − 1) (rather than a N (0, 1)) distribution.
H0 : σ = σ 0 ,
H1 : σ 6= σ 0 ,
66 CHAPTER 7. STATISTICAL ANALYSIS
we construct a test based on the estimate of variance, S 2 . In particular, we saw from Theorem
7.2.1 that the random variable Q, defined by
(n − 1)S 2
Q= ∼ χ2n−1 ,
σ2
if the data have an N (μ, σ 2 ) distribution. Hence if we define test statistic value q by
(n − 1)s2
q=
σ0
then we can compare q with the critical values derived from a χ2n−1 distribution; we look for the
0.025 and 0.975 quantiles - note that the chi-squared distribution is not symmetric, so we need
two distinct critical values.
H0 : μX = μY ,
H1 : μX 6= μY ,
when σ X = σ Y = σ is known, so the two samples come from normal distributions with the
same, known, variance. Now, from the sampling distributions theorem we have, under H0
2
ˉ ∼ N μX , σ σ2 2
ˉ − Yˉ ∼ N 0, σ + σ
2
X , Yˉ ∼ N μY , , =⇒ X ,
nX nY nX nY
ˉ and Yˉ are independent. Hence by the properties of normal random variables
since X
Xˉ − Yˉ
Z= r ∼ N (0, 1),
1 1
σ +
nX nY
if H0 is true, giving us a test statistic z defined by
x−y
z= r ,
1 1
σ +
nX nY
which we can compare with the standard normal distribution. If z is a surprising
observation from N (0, 1), and lies in the critical region, then we reject H0 . This procedure
is the Two Sample Z-Test.
7.3. HYPOTHESIS TESTING 67
2. If we can assume that σ X = σ Y , but the common value, σ, say, is unknown, we parallel the
one sample t-test by replacing σ by an estimate in the two sample Z-test. First, we obtain
an estimate of σ by “pooling” the two samples; our estimate is the pooled estimate, s2P ,
defined by
(nX − 1)s2X + (nY − 1)s2Y
s2P = ,
nX + nY − 2
which we then use to form the test statistic t defined by
x−y
t= r .
1 1
sP +
nX nY
It can be shown that if H0 is true then t should be an observation from a Student-t
distribution with nX + nY − 2 degrees of freedom. Hence we can derive the critical values
from the tables of the Student-t distribution.
3. If σ X 6= σ Y , but both parameters are known, we can use a similar approach to the one
above to derive a test statistic z defined by
x−y
z=s ,
σ 2X σ2
+ Y
nX nY
4. If σ X 6= σ Y , but both parameters are unknown, we can use a similar approach to the one
above to derive a test statistic t defined by
x−y
t= s .
s2X s2
+ Y
nX nY
The distribution of this statistic when H0 is true is not analytically available, but can be
adequately approximated by a Student (m) distribution, where
(wX + wY )2
m= 2 ,
wX wY2
+
nX − 1 nY − 1
with
s2X s2Y
wX = , wY = .
nX nY
Clearly, the choice of test we use depends on whether σ X = σ Y or not. We may test this
hypothesis formally; to test
H0 : σ X = σ Y ,
H1 : σ X 6= σ Y ,
we compute the test statistic
2
SX
Q= ,
SY2
68 CHAPTER 7. STATISTICAL ANALYSIS
which has as null distribution the F distribution with (nX − 1, nY − 1) degrees of freedom. This
distribution can be denoted F (nX − 1, nY − 1), and its quantiles are tabulated. Hence we can
look up the 0.025 and 0.975 quantiles of this distribution (the F distribution is not symmetric),
and hence define the critical region. Informally, if the test statistic value q is very small or very
large, then it is a surprising observation from the F distribution and hence we reject the
hypothesis of equal variances.
A statistic T = t(X1 , ..., Xn ) that is used to represent or estimate a function τ (θ) of θ based on an
observed sample of the random variables x1 , ..., xn is an estimator, and t = t(x1 , ..., xn ) is an
estimate, τ̂ (θ), of τ (θ). The estimator T = t(X1 , . . . , Xn ) is said to be unbiased if E(T ) = θ,
otherwise T is biased.
is an estimator of μr .
We may, however, need l > k. Intuitively, and recalling the Weak Law of Large Numbers, it
is reasonable to suppose that there is a close relationship between the theoretical properties of a
probability distribution and estimates derived from a large sample. For example, we know that,
for large n, the sample mean converges in probability to the theoretical expectation.
Definition 7.4.3 Let L(θ) be the likelihood function derived from the joint/mass density
function of random variables X1 , ..., Xn , where θ ∈ Θ ⊆ Rk , say, and Θ is termed the parameter
space. Then for a fixed set of observed values x1 , ..., xn of the variables, the estimate of θ termed
the maximum likelihood estimate (MLE) of θ, b θ, is defined by
b
θ = arg maxL(θ).
θ∈Θ
That is, the maximum likelihood estimate is the value of θ for which L(θ) is maximized in the
parameter space Θ.
DISCUSSION : The method of estimation involves finding the value of θ for which L(θ) is
maximized. This is generally done by setting the first partial derivatives of L(θ) with respect to
θj equal to zero, for j = 1, ..., k, and solving the resulting k simultaneous equations. But we must
be alert to cases where the likelihood function L(θ) is not differentiable, or where the maximum
occurs on the boundary of Θ! Typically, it is easier to obtain the MLE by maximising the
(natural) logarithm of L(θ): we maximise l(θ) = log L(θ), the log-likelihood.
THE FOUR STEP ESTIMATION PROCEDURE: Suppose a sample x1 , ..., xn has been
obtained from a probability model specified by mass or density function fX (x; θ) depending on
parameter(s) θ lying in parameter space Θ. The maximum likelihood estimate is produced as
follows;
4. Check that the estimate b θ obtained in STEP 3 truly corresponds to a maximum in the (log)
likelihood function by inspecting the second derivative of log L(θ) with respect to θ. In the
single parameter case, if the second derivative of the log -likelihood is negative at θ = b
θ,
then bθ is confirmed as the MLE of θ (other techniques may be used to verify that the
likelihood is maximized at θ̂).
The techniques of the previous section provide just a single point estimate of the value of an
unknown parameter. Instead, we can attempt to provide a set or interval of values which
expresses our uncertainty over the unknown value.
Given data, x = (x1 , . . . , xn ), the realised value of X, we calculate the interval (t1 , t2 ), where
t1 = l1 (x1 , . . . , xn ) and t2 = l2 (x1 , . . . , xn ). NOTE: θ here is fixed (it has some true, fixed but
unknown value in Θ), and it is T1 , T2 that are random. The calculated interval (t1 , t2 ) either does,
or does not, contain the true value of θ. Under repeated sampling of X, the random interval
(T1 , T2 ) contains the true value a proportion 1 − α of the time. Typically, we will take α = 0.05 or
α = 0.01, corresponding to a 95% or 99% confidence interval.
If θ is a vector, then we use a confidence set, such as a sphere or ellipse, instead of an interval.
So, C(X) is a 1 − α confidence set for θ if P (θ ∈ C(X)) = 1 − α, for all possible θ.
The key problem is to develop methods for constructing a confidence interval. There are two
general procedures.
If Q = q(X; θ) is a pivotal quantity and has a probability density function, then for any fixed
0 < α < 1, there will exist q1 and q2 depending on α such that P (q1 < Q < q2 ) = 1 − α. Then, if
for each possible sample value x = (x1 , . . . , xn ), q1 < q(x1 , . . . , xn ; θ) < q2 if and only if
l1 (x1 , . . . , xn ) < θ < l2 (X1 , . . . , xn ), for functions l1 and l2 , not depending on θ, then (T1 , T2 ) is a
1 − α confidence interval for θ, where Ti = li (X1 , . . . , Xn ), i = 1, 2.
7.5. INTERVAL ESTIMATION 71
A second method utilises a correspondence between hypothesis testing and interval estimation.
Definition 7.5.3 For each possible value θ0 , let A(θ0 ) be the acceptance region of a test of
H0 : θ = θ0 , of significance level α, so that A(θ0 ) is the set of data values x such that H0 is
accepted in a test of significance level α. For each x, define a set C(x) by
Example. Let X = (X1 , . . . , Xn ), with the Xi IID N (μ, σ 2 ), with σ 2 known. Both the above
procedures yield a 1 − α confidence interval for μ of the form (X ˉ − zα/2 σ/√n, Xˉ + zα/2 σ/√n),
where Φ(zα/2 ) = 1 − α/2.
We know that
ˉ −μ
X
√ ∼ N (0, 1),
σ/ n
and so is a pivotal quantity. We have
ˉ −μ
X
P (−zα/2 < √ < zα/2 ) = 1 − α.
σ/ n
Rearranging gives √ √
ˉ − zα/2 σ/ n < μ < X
P (X ˉ + zα/2 σ/ n) = 1 − α,
is true for every μ, so the confidence interval obtained by inverting the test is the same as that
derived by the pivotal quantity method.