Notes Print

CMP5034A: Mathematical Statistics
ARISTIDIS K. NIKOLOULOPOULOS
September 2014
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
Acknowledgment
Parts of these notes are based on previous course notes, written by Gareth J. Janacek.
ii
Contents
1 Probability and Distributions 1

1.1 Sample space and events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Combining events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Random variables and their probability distributions . . . . . . . . . . . . . 3
1.4 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Some Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6.1 Discrete Uniform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6.2 The Binomial B(n, p) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6.3 The Poisson P (λ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6.4 The Geometric Geo(p) . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.7 Some Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.7.1 The Uniform U (α, β) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.7.2 The Exponential E xp(λ) . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.7.3 The Gamma G(α, β) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.7.4 The Beta B e(α, β) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.7.5 The Normal N (µ, σ2 ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.7.6 The student tν . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.7.7 The Chi-square χ2ν . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.8 Expected values and variances of the preceding distributions . . . . . . . . 10
1.9 Generating functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.9.1 Moment Generating Functions . . . . . . . . . . . . . . . . . . . . . . 11
1.9.2 Probability generating functions . . . . . . . . . . . . . . . . . . . . . 13
1.10 Transformation of random variables . . . . . . . . . . . . . . . . . . . . . . . 13
1.10.1 The method of cumulative distribution function . . . . . . . . . . . 13
1.10.2 The method of transformation . . . . . . . . . . . . . . . . . . . . . . 14
1.10.3 The method of moment generating function . . . . . . . . . . . . . . 14
2 Multivariate distributions 17
2.1 Discrete case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Continuous case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Expected values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
iii
2.4 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Expectation for sums of random variables . . . . . . . . . . . . . . . . . . . 27
2.6 Distribution of sums of independent random variables . . . . . . . . . . . 28
2.6.1 Moment Generating Functions and sums . . . . . . . . . . . . . . . . 28
2.6.2 Central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3 Methods of estimation 35
3.1 Populations and samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1.1 Sampling distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Properties of estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.1 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5.1 Likelihood function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5.2 Maximum Likelihood Estimators (MLE’s) . . . . . . . . . . . . . . . . 40
3.5.3 Properties of MLE’s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6.1 Asymptotic confidence intervals . . . . . . . . . . . . . . . . . . . . . 44
4 Introduction to Hypothesis Testing 49

4.1 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.1.1 Type I and II errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 A general approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.1 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.2 Summary of the definitions . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3 Constructing tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.1 A Binomial example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.2 A two tailed test! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.3 Normal small sample case . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3.4 Sample size choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5 The Neyman Pearson Lemma 63

5.1 The Neyman Pearson Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.1.1 Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 Uniformly Most Powerful Tests (UMP) . . . . . . . . . . . . . . . . . . . . . . 68
6 Likelihood Ratio Tests 71

6.1 Likelihood Ratio Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.2 Asymptotic likelihood ratio test . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.3 Goodness-of-fit-tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.3.1 Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.3.2 The multinomial distribution . . . . . . . . . . . . . . . . . . . . . . . 75
6.3.3 Maximum likelihood estimators of multinomial . . . . . . . . . . . . 76
6.3.4 Likelihood ratio for the multinomial . . . . . . . . . . . . . . . . . . . 76
iv
6.3.5 Goodness of fit–Non multinomial distribution . . . . . . . . . . . . . 78
7 Nonparametric inference 83
7.1 The Kolmogorov-Smirnov test . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.2 Permutation Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.3 The sign test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.4 Wilcoxon’s Signed Rank Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.5 Paired Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.6 Unpaired Samples - The Mann-Whitney Test . . . . . . . . . . . . . . . . . 93
8 Bayesian inference 97
8.1 Bayes theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.1.1 The Continuous Analogue . . . . . . . . . . . . . . . . . . . . . . . . . 98
8.2 Subjective Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
8.3 Choice of prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.3.1 Conjugate class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.3.2 Noninformative priors . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8.4 Predictive distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
v
vi
Chapter 1
Probability and Distributions
1.1 Sample space and events

We take a pretty conventional view of probability. Suppose we have an experiment with
set of outcomes Ω. We call Ω the sample space. So
• If we toss a coin Ω consists of {Head, Tail}, we write Ω = {Head, Tail},
• If we roll a die Ω={ 1,2,3,4,5,6}
• When we roll two dice then Ω is the set of pairs
(1,1) (1,2) (1,3) (1,4) (1,5) (1,6)

(2,1) (2,2) (2,3) (2,4) (2,5) (2,6)
(3,1) (3,2) (3,3) (3,4) (3,5) (3,6)
(4,1) (4,2) (4,3) (4,4) (4,5) (4,6)
(5,1) (5,2) (5,3) (5,4) (5,5) (5,6)
(6,1) (6,2) (6,3) (6,4) (6,5) (6,6)
An event ω is a collection of outcomes of interest, for example rolling two dice and get-
ting a double. In this case the event ω is defined as
ω ={ (1,1),(2,2),(3,3),(4,4),(5,5),(6,6)}.
′
Suppose that the event ω is that the sum is less that 4 when we roll two dice, then
′
ω ={ (1,1),(1,2),(2,1)} .
′
If two events ω and ω have no elements in common then we say they are mutually
exclusive.
I prefer Roman symbols for events, so for example let A be the event {At least one 6}
A={(1,6),(2,6),(3,6),(4,6),(5,6),(6,1),(6,2),(6,3),(6,4),(6,5),(6,)}
and define the event B as
B={ (2,3),(5,7)}
Then A and B are mutually exclusive.
1
1.1.1 Combining events

If we have two events, say A and B then
• It is handy to have a symbol for not A, we use AC but we are not very picky and not
A is acceptable.
• The event A and B, often written A ∩B is the set of outcomes which belong both to
A and to B.
• The event A or B, often written A ∪ B is the set of outcomes which belong either to
A or to B or to both.
Example 1.1.1. Suppose Ω={0,1,2,3,4,5,6,7,8,9} then if we define A={1,3,5,7,9} and B={4,5,7}
we have
• A ∩ B = {5,7}
• While A ∪ B = {1,3,4,5,7,9}
• B C =not B = 0,1,2,3,6,8,9
1.2 Probability
To any subset ω ⊂ Ω we assign a measure P [ω] called the probability of ω. These mea-
sures satisfy the basic axioms of probability. For any A ⊂ Ω
1. 0 ≤ P [A] ≤ 1
2. P [Ω] = 1
3. If A and B are mutually exclusive then P [A ∪ B] = P [A] + P [B]

note for pairwise mutually exclusive events
∞
P [∪∞
X
i =1 A i ] = P [A i ]
i =1
Various useful results follow, for example

• P [not A] = 1 − P [A]
• P [A ∩ B] = P [A] + P [B] − P [A ∪ B].

While we have the properties of the probability measure and can prove theorems we
have said nothing about the calculation of probabilities. This is reasonably simple in
the case where the number of outcomes of an experiment are finite and each outcome
is equally likely. In this case the probability of each element of Ω is the same. If there are
n distinct outcomes then the probability of each, assuming they are equally likely is n1
where n is the number of outcomes. The probability of any event can then be computed
by counting the number of outcomes which make up the event. While this reduces to a
counting problem it is worth bearing in mind that counting can be hard.
2
1.2.1 Conditional Probability

Of course it could well be argued that we have to define probability in a conditional
sense. Thus the probability of A given B has occurred, written P [A|B] is defined as
P [A ∩ B]
P [A|B] =
P [B]
The definition of independence is natural: A and B are independent if and only if
P [A ∩ B] = P [A]P [B].
While conditional probabilities can have interesting philosophical implications they
also allow one to do calculations. Thus
P [A] = P [A|B]P [B] + P [A| notB]P [ not B]

or more generally if ∪ni=1 B i = Ω then
n
X
P [A] = P [A|B i ]P [B i ]
i =1
1.3 Random variables and their probability distributions

Traditionally we have labeled the outcome ω in some way X (ω) and called the func-
tion we use, say X , a random variable (rv). We almost always omit the argument ω.
So a random variable X is a real valued function defined over a sample space. More
typically, a random variable is an entity whose observed values are the outcomes of a
random experiment. In this sense, their value cannot be a priori determined. That is,
we do not know what they are going to be before we collect the sample, run the exper-
iment, e.t.c. The mechanism determining the probability or chance of observing each
individual value of the random variable is called a probability distribution (as it liter-
ally distributes the probability among all the possible values of the random variables).
Probability distributions are defined through frequency tables, graphs, or mathematical
expressions.
A random variable X is discrete if it can take only a finite or countably infinite num-
ber of distinct values, while it is continuous if it can take any value in an interval. Any
random variable has a probability distribution. There are two types of probability distri-
butions corresponding to the two kinds of random variables:
• Discrete probability distributions: These specify the chance of observing a small
countable number of possible values (e.g., race, sex).
• Continuous probability distributions: These handle cases were all possible (real)
numbers can be observed (e.g., height or weight).
Note in passing that large or infinite numbers of countable values are handled by con-
tinuous distributions.
If X is discrete the function,
p X (x) = Pr(X = x),
3
is called the probability mass function (pmf) of X . If X is continuous, and there is a

non-negative function f X (x) such,
Z
Pr(X ∈ ∆) = f X (x)d x,
∆
for every interval of ∆, then the function f X (x) is the density of the random variable X .
The pmfs have the following properties:
1. p X (x) ≥ 0 for each x ∈ R.
2. kx=0 p X (x) = 1, where k is the number of different atoms.

P
P
3. Pr(X ∈ B) = X ∈B p X (x), B ∈ Ω.
The densities have the following properties:
1. f X (x) ≥ 0 for each x ∈ R.

R∞
2. −∞ f X (x)d x = 1.
We define the cumulative distribution function (cdf) F X (x) as,
F X (x) = P [X ≤ x], x ∈ R,
with the following properties:
1. Pr(a ≤ X ≤ b) = F X (b) − F X (a).
2. If the rv X is continuous then,

Zx
F X (x) = f (t ) and f X (x) = F ′ (x) for each x ∈ R.
−∞
3. For a (discrete) rv X , Pr(X = x) = F X (x) − F X (x − 1).
4. 0 ≤ F X (x) ≤ 1 for each x ∈ R.
5. F X (x) is increasing: If x1 < x2 then F X (x1 ) ≤ F X (x2 ).
6. F (−∞) = limx→−∞ F X (x) = 0 and F (∞) = limx→∞ F X (x) = 1.
In the sequel let F X (x) = F (x), f X (x) = f (x), and p X (x) = p(x) for shorthand notation.
Example 1.3.1. Suppose F (x) = 1 − exp(−λx), x ≥ 0, for some positive λ. Then if λ = 3,
• P [X ≤ 2] = F (2) = 1 − exp(−6).
• P [2 ≤ X ≤ 4] = F (4) − F (2) = 1 − exp(−12) − [1 − exp(−6)]
• The density function is f (x) = F ′ (x) = λ exp(−λx) and we can also write,
Z4
P [2 ≤ X ≤ 4] = 3 exp(−3x)d x.
2
4
1.4 Expectation
We can also view probability from the point of view of what happens in the long run.
Given a random variable X define the expected value of X written E [X ] as,
 Pk
 x=0 xp(x) if X is a discrete rv
E [X ] = R ,
 ∞
−∞ x f (x)d x if X is a continuous rv
where k is the number of different atoms for the discrete rv. The expected value can be
regarded as the long run average.
Example 1.4.1. If we roll a dice and the outcome is X then P [X = i , i = 1, 2, · · · , 6] = 1/6
and so
1 1 1
E [X ] = 1 × + 2 × + · · · + 6 × = 3.5
6 6 6
You can be sure that if you roll a die you will never get 3.5, however if you rolled a die, say
N times and kept an average of the score you will find that this will approach 3.5 as N
increases.
Example 1.4.2. For a coin we have Head and Tail. Suppose we count head as 1 and tail
as 0, then
P [X = 1] = 1/2 and P [X = 0] = 1/2
and so E [X ] = 1 × 12 + 0 × 12 = 12 .
Things work in much the same way in the continuous case by replacing the summa-
tion with an integral as described in the following example.
Example 1.4.3. If the density of the rv X is, f (x) = 83 x 2 , 0 ≤ x ≤ 2 and 0 elsewhere then,
Z∞ Z2
3
E [X ] = x f (x)d x = x x 2 d x = 1.5.
−∞ 0 8
The expected value of a function h(X ) of a rv X is defined similarly as below,
 Pk
 x=0 h(x)p(x) if X is a discrete rv
E [h(X )] = R ;
 ∞
−∞ h(x) f (x)d x if X is a continuous rv
k is the number of different atoms for the discrete rv.

Some properties of the expected value follow:
1. E (c) = c.
2. E (ah(X ) + b] = aE [h(X )] + b.
3. E (h 1(X ) + h 2 (X )] = E [(h 1 (X )] + E [(h 2 (X )].
4. If h 1 (X ) ≥ h 2 (X ) then E [(h 1(X )] ≥ E [(h 2 (X )].
5. |E [h(X )] ≤ E |h(X )|.
6. If E (X n ) exists then E (X m ) exists for each m = 1, 2, . . ., n − 1.
5
1.5 Moments
Some important expected values in statistics are the moments,
µr = E [X r ] r = 1, 2, . . .,
and the central moments,

¢r
µ′r = E [ X − µ ]
¡
r = 1, 2, . . ..
You will have met the mean µ = E [X ] and the variance Var(X ) = σ2 = E [(X − µ)2 ]. Some
properties of the variance are,
1. Var(X ) = E X 2 − (E X )2 .
2. Var(c) = 0.
3. Var(aX + b) = a 2 VarX .
The parameter σ is known as the standard deviation. We can prove an interesting link
between the mean µ and the variance σ2 . The result s known as Chebyshev’s inequality,
h σ i2
P [|X − µ| > ǫ] ≤ (1.1)
ǫ
This tells us that departure from the mean have small probability when σ is small. Thus
the long run average is the mean and the size of departures from this long run average
are controlled by the variance.
The third and fourth moments E [(X − µ)3 ],E [(X − µ)4 ] are less commonly used.
1.6 Some Discrete Distributions

We are assuming that you have some familiarity with probability distributions. We shall
run through some of the most common and important probability distributions.
1.6.1 Discrete Uniform

The discrete uniform distribution is a probability distribution whereby a finite number
of equally spaced values are equally likely to be observed. The random variable takes
the values 1, 2, 3, · · · , n with equal probability. That is,
1
p(x) = ,
n
for each x in 1, 2, 3, · · · , n.
6
1.6.2 The Binomial B (n, p)

It is the discrete probability distribution of the number of successes in a sequence of
n independent yes/no experiments, each of which yields success with probability p.
Such a success/failure experiment is also called a Bernoulli experiment or Bernoulli trial;
when n = 1, the binomial distribution is a Bernoulli distribution. The Binomial distribu-
tion is an n times repeated Bernoulli trial. The probability of getting exactly x successes
in n trials is given by the probability mass function,
Ã !
n x
p(x) = p (1 − p)n−x , x = 0, 1, 2, · · · , n.
x
1.6.3 The Poisson P (λ)

Poisson expresses the probability of a given number of events occurring in a fixed in-
terval of time and/or space if these events occur with a known average rate and inde-
pendently of the time since the last event. The probability mass function of the Poisson
distribution is,
λx exp(−λ)
p(x) = , x = 0, 1, 2, . . .,
x!
where λ is a positive real number, equal to the expected number of occurrences during
the given interval.
1.6.4 The Geometric Geo(p)

It’s the probability that the first occurrence of success require x number of independent
trials, each with success probability p. If the probability of success on each trial is p,
then the probability that the xth trial (out of x trials) is the first success is,
p(x) = (1 − p)x−1 p, x = 1, 2, . . .
1.7 Some Continuous Distributions
1.7.1 The Uniform U (α, β)

It is a family of probability distributions such that for each member of the family, all
intervals of the same length on the distribution’s support are equally probable. The sup-
port is defined by the two parameters, α and β, which are its minimum and maximum
values. The density function is,
1
f (x) = , α ≤ x ≤ β.
β−α
7
1.7.2 The Exponential E xp(λ)

It describes the time between events in a Poisson process, i.e. a process in which events
occur continuously and independently at a constant average rate. The density of an
exponential distribution is,
f (x) = λ exp(−λx), x ≥ 0, (1.2)
where λ > 0 is the parameter of the distribution, often called the rate parameter.
1.7.3 The Gamma G(α, β)

It is a two-parameter family of continuous probability distributions. It has a shape pa-
rameter α > 0 and a scale parameter β > 0. The gamma distribution is frequently a
probability model for waiting times; for instance, in life testing, the waiting time until
death is a random variable that is frequently modeled with a gamma distribution. The
probability density function of a gamma-distributed random variable X is,
x α−1 exp(−x/β)
f (x) = , x ≥ 0, (1.3)
βa Γ(α)
where Γ(·) is the Gamma function. The exponential is a special case of G(α, β). Note that
the density in (1.3) is the same with the density in (1.2), when α = 1 and β = 1/λ, i.e.,
G(1, 1/λ) ≡ E xp(λ).
1.7.4 The Beta B e(α, β)

It is defined on the unit interval (0, 1) and parameterized by two positive shape param-
eters, typically denoted by α > 0 and β > 0. The Beta distribution can be suited to the
statistical modelling of proportions in applications where values of proportions equal to
0 or 1 do not occur. One theoretical case where the beta distribution arises is as the dis-
tribution of the ratio formed by one random variable having a Gamma distribution di-
vided by the sum of it and another independent random variable also having a Gamma
distribution with the same scale parameter (but possibly different shape parameter).
The probability density function of the beta distribution is,
x α−1 (1 − x)β−1
f (x) = , 0 < x < 1,
B(α, β)
where B(α, β) is the Beta function.
1.7.5 The Normal N (µ, σ2 )

The most well known distribution is the Normal or Gaussian distribution. The graph of
the associated probability density function is “bell"-shaped, and is known as the Gaus-
sian function or bell curve:
½ ¾
1 1 2
f (x) = p exp − 2 (x − µ) , −∞ ≤ x ≤ ∞,
2πσ2 2σ
8
where parameter µ is the mean and σ2 is the variance (the measure of the width of the
distribution). The distribution with µ = 0 and σ2 = 1 is called the standard normal, that is
N (0, 1). Note one cannot evaluate the normal integral analytically however the standard
X −µ
normal N(0,1) is tabulated. If X is N (µ, σ2 ) then it is easy to show that Z = σ is standard
Normal. So we transform to standard normal and use the appropriate tables. We must
use tables since the integral,
Zx
1
Φ(x) = P [X ≤ x] = p exp(−z 2 /2)d z,
−∞ 2πσ 2
is not simple.
The importance of this distribution comes from the Central Limit theorem which
tells us that the distribution of sums of random variables have distributions that tend to
normality under a wide range of conditions.
1.7.6 The student t ν

Student’s t-distribution is symmetric and bell-shaped, like the normal distribution, but
has heavier tails, meaning that it is more prone to producing values that fall far from its
mean; see Figure 1.1. Student’s t-distribution has the probability density function given
by,
Γ( ν+1 ) ³ x 2 ´− ν+1
f (x) = p 2 ν 1 +
2
, −∞ ≤ x ≤ ∞,
νπ 2 ν
where ν is the number of degrees of freedom and Γ(·) is the Gamma function. A final
0.4
N(0,1)
t, df=2
0.3
0.2
f(x)
0.1
0.0
−6 −4 −2 0 2 4 6
Figure 1.1: Student t2 vs standard normal N (0, 1) density.
characteristic of the t distribution is that as ν gets larger, the t gets closer to the standard
normal.
9
1.7.7 The Chi-square χ2ν

The chi-square distribution with ν degrees of freedom arises as the distribution of squares
of standard normally distributed variables. Its shape is non-symmetrical and it covers
only the positive real numbers (as a sum of squared numbers); see Figure 1.2.
0.15
0.10
f(x;df=5)
0.05
0.00
0 5 10 15 20
Figure 1.2: The χ25 density.
The Chi-square χ2ν distribution has the probability density function given by,
x ν/2−1 exp(−x/2)
f (x) = , x ≥ 0, (1.4)
2ν/2 Γ(ν/2)
where ν is the number of degrees of freedom and Γ(·) is the Gamma function. Note
that the density in (1.4) is the same with the density in (1.3) for α = ν/2 and β = 2, i.e.,
χ2ν ≡ G(ν/2, 2).
1.8 Expected values and variances of the preceding distri-

butions
The expected values and variances of the preceding distributions are given below:
1. Discrete Uniform
µ = (n + 1)/2 σ2 = (n 2 − 1)/12.
2. Binomial B(n, p)
µ = np σ2 = np(1 − p)
10
3. Poisson P (λ)
µ=λ σ2 = λ
4. Geometric Geo(p)
µ = 1/p σ2 = (1 − p)/p 2
5. Uniform U (α, β)
µ = (α + β)/2 σ2 = (β − α)2 /12
6. Exponential E xp(λ)
µ = 1/λ σ2 = 1/λ2
7. Gamma G(α, β)
µ = αβ σ2 = αβ2
8. Beta B e(α, β)
µ = α/(α + β) σ2 = αβ/(α + β)2 /(α + β + 1)
9. Normal N (µ, σ2 )
µ=µ σ2 = σ2
10. Student tν
µ=0 σ2 = ν/(ν − 2), ν > 2
More distributions their means and variances are given in the Statistical Tables.
1.9 Generating functions

1.9.1 Moment Generating Functions
Suppose we have a random variable X . We define the moment generating function as,
M X (t ) = E [exp(t X )]. (1.5)
That is, Z∞
M X (t ) = f (x) exp(t x)d x,
−∞
for a continuous rv X , and,
X
M X (t ) = P [X = x] exp(t x),
x
for a discrete rv X .
Example 1.9.1. If X is exponential with density, f (x) = λ exp(−λx), then

Z∞ Z∞
¡ ¢ λ
M X (t ) = λ exp(−λx) exp(t x)d x = λ exp −(λ − t )x = .
0 0 λ−t
11
Example 1.9.2. For a Geometric variable with pmf,

p(x) = P [X = x] = (1 − p)p x , x = 0, 1, · · · ,
∞ ∞ ¡ ¢x
(1 − p)p x exp(xt ) = (1 − p)
X X ¡ ¢
M X (t ) = p exp(t ) = (1 − p)/ 1 − p exp(t ) .
x=0 x=0
We also have the following interesting relation,
M X (t ) = 1 + µt + µ2 t 2 /2! + µ3 t 3 /3! + · · · . (1.6)
This means that if we know the mgf we can easily compute the kth moment using a
power series expansion or by using the fact that,
d M X (t )k
= µk at t=0.
dtk
In fact the mgf is just a Laplace transform and we know that (if it exists), then it it is
uniquely defined by (and defines) the probability distribution.
Example 1.9.3. For a Gamma distribution with density,
λr r −1
f (x) = x exp(−λx),
Γ(r )
we find the mgf is, µ ¶r
λ
M X (t ) = .
λ−t
Using a Taylor expansion,
µ ¶ µ ¶
rt r r (r − 1) 2 r (r − 1) r r (r − 1)(r − 2) 3
M X (t ) = 1 + + 2 + 1/2 t + + 3 + 1/6 t
λ λ λ2 λ3 λ λ3
µ ¶
r (r − 1) r r (r − 1) (r − 2) r (r − 1)(r − 2)(r − 3) 4
+ 3/2 + 4 + 1/2 + 1/24 t + O(t 5 ).
λ4 λ λ4 λ4
So if we look at (1.6) we have,
• µ = r /λ
³ ´
• µ2 = E [X 2 ] = 2! r
λ2
+ 1/2 r (rλ−1)
2
³ ´
r (r −1)
• µ3 = E [X 3 ] = 3! λ3
+ λr3 + 1/6 r (r −1)(r
λ3
−2)
or
λ r
µ ¶
d M(t )
= r (λ − t )−1 ,
dt λ−t
so setting t = 0 gives,
µ = r /λ.
while
d 2 M(t ) λ r 3 λ r 2 λ r
µ ¶ µ ¶ µ ¶
−3 −3
= r (λ − t ) + 3 r (λ − t ) + 2 r (λ − t )−3
dt2 λ−t λ−t λ−t
again setting t=0 gives, µ ¶
r r (r − 1)
µ2 = 2! 2 + 1/2 .
λ λ2
12
1.9.2 Probability generating functions

Suppose that X is a discrete random variable taking values on some subset of the non-
negative integers, {0, 1, . . .}, then the probability-generating function of X is defined as:
∞
G X (z) = E [z X ] = z k P [X = k].
X
k=0
Example 1.9.4. For the Poisson distribution where P [X = k] = p k = λk exp(−λ)/k! we

have
∞
z k λk exp(−λ)/k!
X
G(z) =
k=0
∞
(zλ)k exp(−λ)/k! = exp(−λ) exp(λz) = exp[λ(z − 1)]
X
=
k=0
¡n ¢
Example 1.9.5. For the Binomial with P [X = k] = k p k q n−k
Ã ! Ã !
n n n n
zk p k q n−k = (zp)k q n−k = (q + pz)n
X X
G(z) =
k=0 k k=0 k
There are some simple results which follow directly from the definition,
1. G(0) = p[X = 0]
2. G(1) = 1
′
3. G (1) = E [X ]
In our view Mgfs are much more useful however in some areas of probability the right
tool to use is the probability generating function. As in most areas of life it is a question
of picking the right tool.
1.10 Transformation of random variables

Suppose we have a random variable X with a known distribution. However our interest
is in another random variable Y defined as Y = h(X ) where h is a known function. So
we want to find the distribution of Y . Below three methods to solve this problem are
outlined.
1.10.1 The method of cumulative distribution function

With this method the cumulative distribution function of Y can be given as below,
FY (y) = Pr(Y ≤ y) = Pr(h(X ) ≤ y) = Pr(X ≤ h −1 (y)) = F x (h −1 (y)).
13
Example 1.10.1. Suppose X ∼ E xp(1). Below we will find the distribution of Y = 1 −

exp(−X ). Note in passing that Y can take values in [0, 1].
FY (y) = P (Y ≤ y) = P (1 − e −X ≤ y) = P (e −X ≥ 1 − y) = P (X ≤ − log(1 − y)) =

R−log(1−y) −x
= 0 e d x = 1 − e log(1−y) = 1 − (1 − y) = y.

 0, y ≤ 0 ½
1, 0 < y < 1
⇒ FY (y) = y, 0 < y < 1 ⇒ f Y (y) =
0, ot her wi se
1, y ≥ 1

⇒ Y ∼ U (0, 1)
1.10.2 The method of transformation

For continuous random variables, the density of Y can be given as below,
³ ´ ¯¯ d x ¯¯
−1
f Y (y) = f x h (y) ¯¯ ¯.
dy¯
Example 1.10.2. Suppose X ∼ E xp(1/θ). Find the distribution of Y = X /θ. We have that,
¯ ¯
1 − 1 θy ¯¯ d ¯ 1
f Y (y) = e θ (θy)¯¯ = e −y θ = e −y ⇒ Y ∼ E xp(1).
θ ¯ dy θ
1.10.3 The method of moment generating function

With this method the mgf of Y can be given as below,
 P ¡ ¢
³ ´ ³ ¡ ¢´  x exp t h(x) p X (x)
M y (t ) = E exp(t y) = E exp t h(x) = R .
 ∞ ¡ ¢
−∞ exp t h(x) f X (x)
Example 1.10.3. Suppose X ∼ N (0, 1). Find the distribution of Y = X 2 . We have that,
Z∞ Z∞
1 1 ³ ´
M y (t ) = exp(t x 2 ) p exp(−x 2 /2) d x = p exp −(1/2)x 2 (1 − 2t ) d x
−∞ 2π 2π −∞
Z∞
1 1
= 1/2
p exp(−y 2 /2) d y = (1 − 2t )−1/2 .
(1 − 2t ) 2π −∞
Above, we transformed the integrant, i.e., we set,

p
y = x 1 − 2t
p
dy = 1 − 2td x
1
dx = d y.
(1 − 2t )1/2
So Y ∼ G(α = 1/2, β = 2) ≡ χ21 .
14
Exercises
1. You and a friend play a game in which you each toss a fair coin. If the result is two
tails (TT), you win £1; if the result is two heads (HH), you win £2. If the result is TH
or HT, you lose £1 (i.e. win £−1).
(a) Find the probability distribution of your winnings, X , for a single play of the
game.
(b) Find the mean and variance of X for a single play.
2. Suppose that the continuous random variable X has the probability density func-
tion, ½
cx , 0 ≤ x ≤ 2
f (x) =
0 , otherwise
(a) Find the value of c that makes f (x) a valid density.

(b) Find the cumulative distribution function F (x).
(c) Use F (x) to find P [1 ≤ X ≤ 2].
(d) Use F (x) to find P [1/2 ≤ X ≤ 1].
3. Suppose that the continuous random variable X has the probability density func-
tion,
½ 3 2
f (x) = 2x +x , 0≤x ≤1
0 , otherwise
(a) Find the cumulative distribution function F (x).

(b) Find the mean and variance of X .
4. The probability that a patient recovers from a certain disease is 0.8. Suppose that
20 people are known to have contracted the disease.
(a) What is the probability that exactly 14 recover?

(b) What is the probability that at least 10 recover?
(c) What is the probability that at least 14 but not more than 18 recover?
Hind: Requires tables of the Binomial distribution.
5. Customers arrive at a checkout counter in a shop according to a Poisson distribu-

tion at an average of seven per hour. During a given hour, what are the probabili-
ties that,
(a) no more than three customers arrive?

(b) at least two customers arrive?
(c) exactly five customers arrive?
15
Hind: Requires tables of the Poisson distribution.
6. A salesperson has found that the probability of a sale on a single contact is approx-
imately 0.03. Suppose the salesperson makes 100 contacts:
(a) Find the exact probability that he or she makes at least one sale.
(b) Use the Poisson approximation (P (np)) to the Binomial distribution to esti-
mate the probability of making at least one sale.
Hind: Requires tables of the Poisson distribution.
7. The cycle time for trucks hauling concrete to a building site is uniformly distributed
over the interval 50 to 70 minutes.
(a) What is the probability that the cycle time exceeds 65 minutes if it is known
that it exceeds 55 minutes?
(b) Find the mean cycle time and the variance of the cycle time.
8. The marks of a large population of students are approximately normally distributed

with mean 62 and standard deviation 20.
(a) What is the probability that a student has a mark greater than 80?
(b) A department has to lose the 10% of its students that have the lowest marks.
Where should it set its cut-off mark to achieve this?
Hind: Requires tables of the normal distribution.
9. If X has a geometric distribution with probability of success p, that is,
P [X = k] = p(1 − p)k−1 k = 1, 2, . . . .
(a) Derive the moment generating function (mgf) for X .

(b) Using this mgf find the mean and variance of the geometric distribution.
10. If the mgf of X is M X (t ), show that the mgf of Y = aX +b is exp(bt )M X (at ). Hence
find the mgf of a Normal distribution with mean µ and variance σ2 given that a
standard Normal has the mgf exp(t 2 /2).
11. If Z has a standard normal distribution show that E [Z k ] = 0 for all odd values of k.
X −µ
12. If X ∼ N (µ, σ2 ) find the distribution of Z = σ .
16
Chapter 2
Multivariate distributions
In this chapter we deal with more than one random variables, since it is clear that the
world is not one dimensional. To deal with this, we have to define the joint probabil-
ity distribution function for the random vector X = (X 1 , X 2 , . . . , X d ), where each X j , j =
1, . . . , d is a random variable. It is obvious that if each of X j is discrete then X is a discrete
random vector, and if each of X j is continuous then X is a continuous random vector.
2.1 Discrete case

n
In the discrete case usually all we do is to define, the probability of the event X 1 =
o
x1 , X 2 = x2 , . . . , X d = xd , where x j is a realization (value) of the discrete rv X j ,
p X (x) = p X 1 ,X 2 ,...,X d (x1 , x2 , . . . , xd ) = Pr(X 1 = x1 , X 2 = x2 , . . . , X d = xd ) = Pr(X = x).
This function is called the joint probability mass function. It is obvious that,
X
p X (x) ≥ 0, and p X (x) = 1.
x1 ,...,xd
Clearly there are situations where we just need the distribution of X j disregarding
the other random variables. The relevant distributions are known as the marginal pmfs
and are defined as,
X X X X
Pr(X j = x j ) = · · · · · · Pr(X 1 = x1 , X 2 = x2 , . . . , X d = xd )
x1 x j −1 x j +1 xd
or
X X X X
p X j (x j ) = ··· ··· p X (x).
x1 x j −1 x j +1 xd
For the bivariate case (d = 2) the marginal pmfs are:

X X
p X 1 (x1 ) = Pr(X 1 = x1 ) = Pr(X 1 = x1 , X 2 = x2 ) = p X 1 ,X 2 (x1 , x2 ),
x2 x2
and
17
X X
p X 2 (x2 ) = Pr(X 2 = x2 ) = Pr(X 1 = x1 , X 2 = x2 ) = p X 1 ,X 2 (x1 , x2 ).
x1 x1
Finally, the joint cumulative distribution function of the d -variate random vector is
defined as below,
X X
FX (x) = Pr(X 1 ≤ x1 , X 2 ≤ x2 , . . . , X d ≤ xd ) = ··· p X (y 1 , . . . , y d ).
y 1 ≤x1 y d ≤xd
The joint cdf FX has the same properties with the univariate cdf; see Section 1.3. We also
have some obvious conditional pmfs. For the bivariate case,
• The conditional pmf of X 1 given X 2 = x2 is,
Pr[X 1 = x1 , X 2 = x2 ]
p X 1 |X 2 (x1 |x2 ) = Pr(X 1 = x1 |X 2 = x2 ) = .
Pr[X 2 = x2 ]
• The conditional pmf of X 2 given X 1 = x is,
Pr[X 2 = x2 , X 1 = x1 ]
p X 2 |X 1 (x2 |x1 ) = Pr(X 2 = x2 |X 1 = x1 ) = .
Pr[X 1 = x1 ]
Example 2.1.1. We roll two dice and then Ω is the set of pairs
(1,1) (1,2) (1,3) (1,4) (1,5) (1,6)

(2,1) (2,2) (2,3) (2,4) (2,5) (2,6)
(3,1) (3,2) (3,3) (3,4) (3,5) (3,6)
(4,1) (4,2) (4,3) (4,4) (4,5) (4,6)
(5,1) (5,2) (5,3) (5,4) (5,5) (5,6)
(6,1) (6,2) (6,3) (6,4) (6,5) (6,6)
The joint pmf of (X 1 , X 2 ) is:
p(x1 , x2 ) = 1/36, x1 , x2 = 1, . . . , 6.
If we take the sum of p(x1 , x2 ) over x2 and x1 , respectively, we derive the marginal pmfs,
p(x1 ) = 1/6, x1 = 1, . . . , 6, and p(x2 ) = 1/6, x2 = 1, . . . , 6.
The joint pmf can be also be represented in a table as presented in the following exam-
ple.
Example 2.1.2. A bivariate pmf is given by the following table,
X 2 \X 1 1 2
1 1/8 1/4
2 1/8 1/2
The marginal distribution of X 1 and X 2 are conveniently the marginal sums of the
distribution table.
18
X 2 \X 1 1 2 p X 2 (x2 )
1 1/8 1/4 3/8
2 1/8 1/2 5/8
p X 1 (x1 ) 1/4 3/4 1
So for example, p X 2 (2) = Pr[X 2 = 2] = 5/8.

The conditional pmfs of X 1 |X 2 = 1 are,
( Pr(X =1,X
1 2 =1) 1/8
Pr(X 2 =1) = 3/8 = 1/3 if x1 = 1
p X 1 |X 2 (x1 |x2 = 1) = Pr(X 1 =2,X 2 =1) 1/4 .
Pr(X 2 =1)
= 3/8 = 2/3 if x1 = 2
In the same way the conditional pmfs of X 2 |X 1 = 2 are,

( Pr(X =1,X =2)
2 1 1/4
= 3/4 = 1/3 if x2 = 1
p X 2 |X 1 (x2 |x1 = 2) = Pr(XPr(X 1 =2)
2 =2,X 1 =2) 1/2 .
Pr(X 1 =2) = 3/4 = 2/3 if x2 = 2
Example 2.1.3. The joint pmf of two random variables X , Y is,

½
c(2x + y), x = 0, 1, 2 and y = 0, 1, 2, 3
p(x, y) =
0, otherwise
The task we will deal is to calculate:
1. The constant c.
2. The probabilities Pr(X = 2, Y = 1), Pr(X ≥ 1, Y ≤ 2).
3. The conditional probabilities Pr(X = x|Y = y), Pr(Y = y|X = x)
We proceed as below:
1. We have P (X = x, Y = y) > 0 for x = 0, 1, 2 and y = 0, 1, 2, 3, so c > 0. Further,

PP
p(x, y) = 1, so
x y
2 X
X 3
c(2x + y) = 1 ⇒
x=0 y=0
³ ´
c· (2·0+2)+(2·0+3)+(2·1+0)+(2·1+1)+(2·1+2)+(2·1+3)+(2·2+0)+(2·2+1)+(2·2+2)+(2·2+3) = 1
1
⇒ c · 42 = 1 ⇒ c =
42
To this end,
 1
 42 (2x + y), x = 0, 1, 2 and y = 0, 1, 2, 3
p(x, y) = Pr(X = x, Y = y) = .
0, elsewhere

19
1 1 5
2. (a) P (X = 2, Y = 1) = 42 (2 · 2 + 1) = 42 · 5 = 42 .
P P 2 P
P 2
P (X ≥ 1, Y ≤ 2) = P (X = x, Y = y) = P (X = x, Y = y) =
(b) x≥1 y≤2 x=1 y=0
1 1
= 42 · {((2 · 1 + 0) + (2 · 1 + 1) + (2 · 1 + 2)) + ((2 · 2 + 0) + (2 · 2 + 1) + (2 · 2 + 2))} = 42 · 24 = 47 .
Pr(X =x,Y =y)
3. We know that Pr(X = x|Y = y) = Pr(Y =y) . So we have to calculate the marginal
pmf of Y . This is,
2
P P 1
p Y (y) = Pr(X = x, Y = y) = 42 (2x + y) =
x x=0
1
© ª 6+3y 1
= 42 (2 · 0 + y) + (2 · 1 + y) + (2 · 2 + y) = 42 = 14 (y + 2).
1
42 (2x+y) 2x+y
Therefore, Pr(X = x|Y = y) = 1 = 13 · y+2 , x = 0,1,2, y = 0,1,2,3.
14 (y+2)
Pr(X =x,Y =y)
In the same manner, Pr(Y = y|X = x) = Pr(X =x)
, where
3
P P 1
p X (x) = Pr(X = x, Y = y) = 42 (2x + y) =
y y=0
1
= 42 {(2x + 0) + (2x + 1) + (2x + 2) + (2x + 3)} = 8x+6
42 .
1
42 (2x+y) 2x+y
So, P (Y = y|X = x) = 1 = 12 · 4x+3 , x = 0,1,2, y = 0,1,2,3
42 (8x+6)
2.2 Continuous case

If there is a non-negative real function f X (x) such as,
Z Z Z
Pr(X 1 ∈ ∆1 , X 2 ∈ ∆2 , . . . , X d ∈ ∆d ) = ··· = f X (x1 , x2 , . . . , xd )d x1 d x2 , . . . , d xd
∆1 ∆2 ∆d
for each ∆i in Ω, then f X (x) is the joint density of the random vector X. It is obvious that,
Z∞ Z∞ Z∞
··· f X (x1 , x2 , . . . , xd )d x1 d x2 . . . d xd = 1.
−∞ −∞ −∞
The joint cdf is given as below,
FX (x) = FX (x1 , x2 , . . . , xd ) = Pr(X 1 ≤ x1 , X 2 ≤ x2 , . . . , X d ≤ xd ),
and given that we have the following relations between the joint cdf and density and vice
versa:
Zt 1 Zt 2 Ztd
FX (x) = FX (x1 , x2 , . . . , xd ) = ··· f X (t1 , t2 , . . . , td )d t1 d t2 . . . d td .
−∞ −∞ −∞
∂d FX (x1 , x2 , . . . , xd )
f X (x1 , x2 , . . . , xd ) = .
∂x1 ∂x2 , . . . , ∂xd
20
It follows (but not quite obviously) that the marginal cdf for the rv X j is given as
below,
FX j (x j ) = Pr(X j ≤ x j ) = Pr(X 1 ≤ ∞, . . . , X j −1 ≤ ∞, X j +1 ≤ ∞, . . . , X d ≤ ∞) = FX (∞, ∞, . . ., x j , . . . , ∞).
Similarly the marginal density for the rv X j is given as below,

Z∞ Z∞ Z∞
f X j (x j ) = ··· f X (x1 , x2 , . . . , xd )d x1 . . . d x j −1 d x j +1 . . . d xd .
−∞ −∞ −∞
Let assume now that the random vector is bivariate X = (X 1 , X 2 ). Then we have:
• Pr(a1 ≤ X 1 ≤ b 1 , a2 ≤ X 2 ≤ b 2 ) = FX (b 1 , b 2 ) − FX (b 1 , a2 ) − FX (a1 , b 2 ) + FX (a1 , a2 ).
• F X 1 (x1 ) = FX (x1 , ∞), F X 2 (x2 ) = FX (∞, x2 ).

R∞ R∞
• f X 1 (x1 ) = −∞ f X (x1 , x2 )d x2 , f X 2 (x2 ) = −∞ f X (x1 , x2 )d x1 .
• f X 1 (x1 ) = ∂FX (x1 , x2 )/∂x2 , f X 2 (x2 ) = ∂FX (x1 , x2 )/∂x1 .
Example 2.2.1. Suppose (X , Y ) have the joint distribution,
f (x, y) = 2x, 0 ≤ x ≤ 1 0 ≤ y ≤ 1.
Then, the marginal densities are:

Z∞ Z1
f X (x) = f (x, y)d y = 2xd y = 2x, 0 ≤ x ≤ 1,
−∞ 0
Z∞ Z1
f Y (y) = f (x, y)d x = 2xd x = 1, 0 ≤ y ≤ 1.
−∞ 0
The conditional densities are just:
f (x, y) 2x
f (x|y) = = = 2x 0 ≤ x ≤ 1,
f (y) 1
f (x, y) 2x
f (y|x) = = = 1 0 ≤ y ≤ 1.
f (x) 2x
The joint cdf is:
Zx Z y Zx Z y
F (x, y) = f (x, y)d xd y = 2xd xd y = x 2 y, 0 ≤ x ≤ 1 0 ≤ y ≤ 1.
−∞ −∞ 0 0
The marginal cdfs are:
F X (x) = F (x, ∞) = F (x, 1) = x 2 , 0≤x ≤1
FY (y) = F (∞, y) = F (1, y) = y, 0 ≤ y ≤ 1.
21
Example 2.2.2. Suppose (X , Y ) have the joint distribution,

½
3x, 0 ≤ y ≤ x ≤ 1
f (x, y) = .
0, otherwise
Then, the marginal distributions are:

Z∞ Zx
f X (x) = f (x, y)d y = 3xd y = 3x 2 , 0 ≤ x ≤ 1,
−∞ 0
Z∞ Z1
3
f Y (y) = f (x, y)d x = 3xd x = (1 − y 2 ), 0 ≤ y ≤ 1.
−∞ y 2
The conditional distributions are just:
f (x, y) 3x 2x
f (x|y) = = 3 = 2)
, 0 ≤ y ≤ x ≤ 1,
f (y) (1 − y 2) (1 − y
2
f (x, y) 3x
f (y|x) = = 2 = 1/x, 0 ≤ y ≤ x ≤ 1.
f (x) 3x
Note that f (x, y) is valid density since,
Z∞ Z∞ Z1 Z1 Z1 Z x
f (x, y)d xd y = 3x d xd y = 3x d yd x = 1.
−∞ −∞ 0 y 0 0
The secret here is to sketch the area over which X and Y are defined. This is called the
region of support. We will focus on this in the class.
2.3 Expected values

As for the single variable case we can define expected values. The expected value of a
function h(X) of a random vector X is defined similarly as below,
 P P
 x1 · · · xd h(x)p(x) if X is a discrete random vector
E [h(X)] = .
 R∞ R∞
−∞ . . . −∞ h(x) f (x)d x if X is a continuous random vector
For example for the bivariate case, suppose we are interested in some function of the
random variables X and Y , say h(X , Y ). Then the expected value of h(X , Y ), is defined
as,
 P P
 x y h(x, y)p(x, y) if (X , Y ) are discrete random variables
E [h(X)] = R R .
 ∞ ∞
−∞ −∞ h(x, y) f (x, y)d xd y if (X , Y ) are continuous random variables
Note that,
Z∞ Z∞ Z∞ Z∞ Z∞ Z∞
E (X ) = x f (x, y)d xd y = x f X (x)d x, E (Y ) = y f (x, y)d xd y = y f Y (y)d x,
−∞ −∞ −∞ −∞ −∞ −∞
22
if X , Y are continuous, and,

XX X X X
E (X ) = xp(x, y) = xp X (x), E (Y ) = y p(x, y) = y p Y (y),
x y x y y
if X , Y are discrete.
It is useful to remember (it is also easy to prove) that for random variables X and Y
and constants a and b we have the following properties:
• E (aX + b) = aE X + b
¡ ¢
• E ah(X ) + bg (Y ) = aE (h(X )) + bE (g (Y ))
• Var(aX + b) = a 2 Var(X )
• Var(aX + bY ) = a 2 Var(X ) + b 2 Var(Y ) + 2ab cov(X , Y )
where cov(X , Y ) = E [(X − E X )(Y − E Y )] = E X Y − E X E Y is the covariance term.
2.3.1 Covariance
The covariance measures the association/dependence between the variables. With the
properties of the expected values it is obviously that,
Cov(X , Y ) = Cov(Y , X ), and Cov(X , X ) = Var(X ).
It is more common to use the scaled variant called the correlation ρ. This is defined
as,
cov(X , Y ) cov(X , Y )
ρ= p = .
v ar (X )v ar (Y ) σ X σY
The correlation is popular because the properties below but must be handled with care
(i.e., can measure only linear association, between normally distributed variables). In
outline:
• |ρ| ≤ 1.
• If Y = aX + b then |ρ| = 1 and the sign is the sign of a.
• If X and Y are (linearly) independent then ρ = 0.
It is important to remember that ρ does not tell you anything about causality.
Example 2.3.1. The rvs X , Y have the following joint density,

½ 3
f (x, y) = 5 x(x + y) , 0 ≤ x ≤ 1, 0 ≤ y ≤ 2
.
0 , otherwise
We have to calculate the,
1. E (X ), E (Y ).
23
2. f (x|y), f (y|x).
3. cov(X , Y ).
These can be solved as below:

+∞
R
1. E (X ) = x · f X (x) dx, where the marginal density is,
−∞
+∞ Z2 Ã · 2 ¸2 !
3 2 3 2 £ ¤2 y 3
Z
f X (x) = f (x, y)d y = (x + x y) dy = x y 0+x = (2x 2 + 2x).
5 5 2 0 5
−∞ 0
So,
R1 R1
E (X ) = x · 65 (x 2 + x)dx = 65 (x 3 +x 2 )dx =
µh 0 i h i ¶ 0
4 1 3 1
= 56 x4 + x3 = 65 14 + 31 = 65 · 12
7 7
¡ ¢
= 10 .
0 0
+∞
R
Similarly E (Y ) = y · f Y (y) dy where,
−∞
+∞ R1
f (x, y) dx = 35 (x 2 + x y) dx =
R
f Y (y) =
µh −∞i1 h 2 i1 ¶ 0 ¡ ³ ´
x3 y¢ 2+3y
= 5 3 + y x2
3
= 35 13 + 2 = 53 6 = 10 1
(2 + 3y).
0 0
So,
R2 1 1
R2
E (Y ) = y · 10 (2 + 3y)dy = 10 (2y + 3y 2 )dy =
0
µ h i 0
h 3 i2 ¶
1 y2 2 y 1
= 10 2 2 + 3 3 = 10 (4 + 8) = 12 6
10 = 5 .
0 0
3 3
f (x,y) 5 x(x+y) f (x,y) 5 x(x+y)
2. f (x|y) = f Y (y) = 1 , 0 ≤ x ≤ 1, 0 ≤ y ≤ 2, and f (y|x) = f X (x) = 6 =
10 (2+3y) 5 x(x+1)
(x+y)
2(x+1) , 0 ≤ x ≤ 1, 0 ≤ y ≤ 2.
3. We know that, Cov(X , Y ) = E (X Y ) − E (X )E (Y ), where,
+∞
R +∞ R2 R1
x y · 53 x(x + y)dxdy =
R
E (X Y ) = x y · f (x, y)dxdy =
−∞ −∞ µ 00
3
R2 R1 3 2 2 3
R2 h x 4 i1 h i1 ¶
2 x3
=5 (x y + x y )dxdy = 5 y 4 + y 3 dy =
0 0 0 0 0
R2 ³ 1 y 2 2 ´
µ h i h 3 i2 ¶
2 2
3 3 1 y y
= 5 y 4 + 3 y dy = 5 4 2 + 3 3 =
0 0 0
= 35 18 · 4 + 91 · 8 = 35 21 + 89 = 35 · 9+16 = 35 · 25 = 56 .
¡ ¢ ¡ ¢
18 18
So,
5 7 6 5 21 1
Cov(X , Y ) = E (X Y ) − E (X )E (Y ) = − · = − =− .
6 10 5 6 25 150
24
2.4 Independence
The random variables X 1 , X 2 , . . . X d are independent if and only if,
F X 1 ,...,X d (x1 , . . . , xd ) = F X 1 (x1 ) · · · F X d (xd ),
for each (x1 , . . . , xd ) ∈ R d . One can also use densities/pmfs instead of cdfs to show that
the random variables X 1 , X 2 , . . . X d are independent:
• For discrete variables iff,
p X 1 ,...,X d (x1 , . . . , xd ) = p X 1 (x1 ) · · · p X d (xd ).
• For continuous variables iff,
f X 1 ,...,X d (x1 , . . . , xd ) = f X 1 (x1 ) · · · f X d (xd ).
If two random variables are independent we have the following properties:

F X 1 ,X 2 (x1 ,x2 ) F X 1 (x1 )F X 2 (x2 )
• F X 1 |X 2 (x1 |x2 ) = F X 2 (x2 ) = F X 2 (x2 ) = F X 1 (x1 ).
p X 1 ,X 2 (x1 ,x2 ) p X 1 (x1 )p X 2 (x2 )

• p X 1 |X 2 (x1 |x2 ) = p X 2 (x2 ) = p X 2 (x2 ) = p X 1 (x1 ).
f X 1 ,X 2 (x1 ,x2 ) f X 1 (x1 ) f X 2 (x2 )

• f X 1 |X 2 (x1 |x2 ) = f X 2 (x2 ) = f X 2 (x2 ) = f X 1 (x1 ).
• E X Y = E X E Y , so cov(X , Y ) = 0.
Example 2.4.1. The rvs X , Y have the following joint density,
2e −(x+2y) , x > 0, y > 0

½
f (x, y) = .
0 , otherwise
We have to calculate the:
1. f X (x), f Y (y).
2. Are X and Y independent?
3. f (x|y), f (y|x).
4. E (X |Y ), E (Y |X ).
These can be solved as below:
1. The marginal density of X is,

+∞ ∞ ∞ µ ¶Z∞
1
Z Z Z
−(x+2y) −x −2y −x
f X (x) = f (x, y)dy = 2e dy = 2e e dy = 2e · − e −2y d( − 2y) =
2
−∞ 0 0 0
25
e −x , x > 0
½
−x
£ −2y ¤∞ −x −x
= −e e 0 = −e (0 − 1) = e , x > 0. So, f X (x) = .
0, otherwise
The marginal density of Y is,
+∞
Z ∞
Z ∞
Z ∞
Z
−(x+2y) −2y −x −2y
f Y (y) = f (x, y)dx = 2e dx = 2e e dy = −2e e −x d( − x) =
−∞ 0 0 0
2e −2y , y > 0
½
= −2e −2y [e −x ]∞
0 = −2e
−2y
(0 − 1) = 2e −2y , y > 0. So, f Y (y) = .
0, otherwise
2. X , Y are independent iff
f (x, y) = f X (x) f Y (y), for each (x, y).
We have that,
2e −x e −2y = 2e −(x+2y) ,
½
x > 0, y > 0
f X (x) f Y (y) = .
0, otherwise
So, f (x, y) = f X (x) f Y (y) for each (x, y), and therefore X and Y are independent.
f (x,y) −(x+2y)
3. f (x|y) = f Y (y) = 2e2e −2y = e −x for x > 0 and y > 0. So,
e −x , x > 0, y > 0
½
f (x|y) = .
0, otherwise
f (x,y) −(x+2y)
f (y|x) = f X (x) = 2e e −x = 2e −2y for x > 0 and y > 0. So,
2e −2y , y > 0, x > 0

½
f (y|x) = .
0, otherwise
∞ ∞ ∞ ∞
µ ¶
−x −x ∞−x ′
−x
R R R R
4. E (X |Y ) = x f (x|y)dx = xe dx = − x (e ) dx = − [xe ]0 − e dx =
−∞ n³ 0 ´ 0³ ó 0
− [xe −x ]∞ −x ∞ −x −x
¡ ¢
0 + [e ]0 = − lim xe − 0 + lim e − 1 =
x→∞ x→∞
n³ ´ o
= − lim e −x − 0 + 0 − 1 = 1
x→∞
∞ ∞ ∞ ∞
 
Z Z Z Z
′ ∞
y f (x|y)d y = 2ye −2y d y = − y e −2y d y = −  ye −2y 0 − e −2y d y  =
¡ ¢ £ ¤
E (Y |X ) =
−∞ 0 0 0
∞
µ ¶
¤∞ R ¤∞ ¤∞ ¢
ye −2y 0 − e −2y d y = − ye −2y 0 + 12 e −2y 0 =
£ ¡£ £
=−
0
½µ ¶ µ ¶¾ µ ¶
1
−2y −2y y 1
= − lim ye − 0 + 2 lim e − 1 = − lim e 2y − 2 =
y→∞ y→∞ y→∞
µ ¶
= − lim 2e12y − 12 = 21
y→∞
26
2.5 Expectation for sums of random variables

Given a set of random variables X 1 , X 2 , · · · , X d it is not difficult to show that
d
hX i Xd
E Xj = EXj. (2.1)
j =1 j =1
Indeed we can easily extend this to,
d
hX i Xd
E h j (X j ) = E [h j (X j )].
j =1 j =1
Suppose now we ask about the variance of X 1 , X 2 , · · · , X d . Surprisingly we have a simple

solution, at least in the independent case,
Ã !
d
X d
X ¡ ¢
var Xj = var X j .
j =1 j =1
Recall for two variables,
E [X + Y ] = E [X ] + E [Y ] = µ X + µY .
So,
¡ ¢2 ¡ ¢2
var(X + Y ) = E [ X + Y − µ X − µY ] = E [ (X − µ X ) + (Y − µY ) ]
= E [(X − µ X )2 ] + E [(Y − µY )2 ] + 2E [(X − µ X )(Y − µY )]
= var(X ) + var(Y ) + 2cov(X , Y ) = var(X ) + var(Y ) - from independence.
So if we now take d variables in the same way we have,

Ã !2
d
hX i d
X d
³X ´ h Xd i
E Xj = µ j so var Xj = E (X j − µ j )
j =1 j =1 j =1 j =1
d
hX d X
d i
(X j − µ j )2 + 2
X
=E (X j − µ j )(X k − µk ) ( j < k)
j =1 j =1 k=1
d
X d X
X d
= var(X j ) + cov(X i X j ) ( j 6= k).
j =1 j =1 k=1
When we have independence life is much simpler so we have,
d
³X ´ Xd
var Xj = var(X j ).
j =1 j =1
27
Example 2.5.1. Consider the ³ independent

´ ³rvs´ X 1 , X 2 , · · · , X d with E (X j ) = µ , Var(X j ) =
2 2 1 Pd
σ , j = 1, . . . , d . Show that E X̄ = µ and Var X̄ = σ /d , where X̄ = d j =1 X j :
1 X d 1 X d 1 X d
E [ X̄ ] = E [ Xj] = E [X j ] = µ = µ,
d j =1 d j =1 d j =1
and
Ã !
1 X d 1 Xd
var( X̄ ) = var Xj = 2 var(X j ) = d σ2 /d 2 = σ2 /d ,
d j =1 d i =1
since X j are independent.
2.6 Distribution of sums of independent random variables

2.6.1 Moment Generating Functions and sums
The most useful application of mgfs is when we need the distribution if sums of inde-
pendent random variables.
Suppose we have X and Y which are independent variables.The the mgf of Z = X +Y
is just M Z (t ) = E [exp{t (X + Y )}] which, from independence is
M Z (t ) = E [exp(t X )]E [exp(t Y )] = M X (t )MY (t ).
So the mgf is the product of the mgfs of the terms in the sum. This result extends to d
variables very simply. The mgf of Z = dj=1 X j is just the product of the individual mgfs.
P
Thus,
d
Y
M Z (t ) = M X j (t ).
j =1
Example 2.6.1. Suppose that P [X = 1] = p and P [X = 0] = 1−p = q (i.e., X ∼ Bernoulli(p)).

Then we can compute the mgf as,
M(t ) = p exp(t ) + q.
Now assume that we have X 1 , X 2 , · · · , X d which are all independent and identically dis-
tributed with this distribution. The mgf of the sum Z = dj=1 X j is just the product of the
P
individual mgfs,
d
(p exp(t ) + q) = (p exp(t ) + q)d .
Y
M Z (t ) =
j =1
Looking up the list of mgfs in Statistical Tables we find that Z ∼ B(d , p) (Binomial).
Example 2.6.2. Suppose we have X 1, X 2 , · · · , X d which are all independent and distributed
with a Poisson distributions, say,
λxj exp(−λ j )
p[X j = x] = , x = 0, 1, 2, · · · .
x!
28
Then the individual mgfs are,
M X j (t ) = exp λ j (e t − 1) .
¡ ¢
Pd
So the mgf of the sum Z = j =1 X j is,
d
exp λ j (e t − 1) = exp Λ(e t − 1) ,
Y ¡ ¢ ¡ ¢
M Z (t ) = (2.2)
j =1
Pd
where Λ = j =1 λ j . From the form of the mgf in (2.2) we can deduce that Z ∼ Poi sson(Λ).
Example 2.6.3. Supopose we have X 1, X 2 , · · · , X d which are all independent and distributed
with a Normal distributions, say
( )
1 (x j − µ j )2
f (x j ) = q exp − ,
2πσ2j 2σ2j
then the individual mgfs are

µ ¶
1 2 2
M X j (t ) = exp µ j t + σ j t .
2
Pd
So the mgf of the sum Z = j =1 X j is
d
Y
µ
1 2 2
¶ µ
1 2
¶
M Z (t ) = exp µ j t + σ j t = exp At + B t , (2.3)
j =1 2 2
where A = nj=1 µ j and B = nj=1 σ2j . From the form of the mgf in (2.3) we can deduce that
P P
Z ∼ N (A, B).
We have managed to show that the Normal and the Poisson have reproductive prop-
erties i.e., sums of independent Poissons are Poisson and sums of independent normals
are normal. We can show using mgfs as above that,
• The sum of independent Normals is Normal.
• The sum of independent Poissons is Poisson.
• The sum of independent Chi-squared is Chi-squared.
• The sum of independent Gammas is Gamma.
29
2.6.2 Central limit theorem

In some special cases, as we have seen in the preceding subsection, the distribution of
the sum of independent random variables S d = dj=1 X j is known, for example
P
• If the X j are normal then their sum is normal.

In fact E [S d ] = dj=1 E [X j ] and v ar [S d ] = dj=1 v ar [X j ]
P P
• If the X j are Poisson then their sum is Poisson. The parameter of the sum is the
sum of the parameters of the components, since E [S d ] = dj=1 E [X j ].
P
Often however the component variables X j are not Normal or Poisson, e.t.c., but we
would still like the distribution of S d = dj=1 X j . Finding the exact distribution is a hard
P
problem but we have some useful results in particular the Central Limit Theorem.
Suppose X 1 , X 2 , · · · , X d are independent and identically distributed with common
mean µ and variance σ2 . Define,
X̄ − µ
Z= p . (2.4)
σ/ d
Then the distribution of Z approaches that of a standard Normal variable for large d .
Informally the following are equivalent statements,
• Z is approximately N (0, 1).

p
• X̄ is approximately N (µ, σ2 / d ).
• S d is approximately N (d µ, d σ2 ).
Example 2.6.4. Suppose X 1 , X 2 , · · · , X d , are independent and identically distributed with

an E xp(2) distribution. Let S 100 = 100
P
j =1 X j and we seek the probability,
P [52 < S 100 < 58].
For the exponential distribution the mean is 1/λ = 1/2 and the variance 1/λ2 = 1/4 - look
it up. From the central limit theorem we know that S 100 is approximately N (50, 25), so
S 100 − 50
P [52 < S 100 < 58] = P [2/5 < Z = < 8/5] = Φ(8/5) − Φ(2/5) = 0.2898.
5
x̄−µ
Similarly, since Z = p is N (0, 1),
σ/ 100
· ¸
0.52 − 1/2 S 100 /100 − 1/2 0.58 − 1/2
P [52 < S 100 < 58] = P p <Z = p < p
1/(2 100) 1/(2 100) 1/(2 100)
= Φ(8/5) − Φ(2/5) = 0.2898.
30
Normal approximation to the Binomial
Suppose X 1 , X 2 , · · · , X d are iid and,
P [X j = 1] = p while P [X j = 0] = 1 − p.
Then
E [X j ] = p and v ar (X j ) = p(1 − p).
Using the central limit theorem we know that S d = dj=1 X j is N (d p, d p(1 − p)). We have
P
proved that a Binomial variable B(d , p) can be approximated by a normal variable, with
mean d p, variance d p(1 − p) for a large d . This is known as the normal approximation
to the Binomial. You may find it easier in some cases to work with the fraction of ones,
1
p̂ = Sd .
d
In this case sµ ¶
p(1 − p)
z = (p̂ − p)/
d
is standard normal.
Example 2.6.5. Suppose X is number of 6’s in 40 rolls of a die. By calling the CLT, approx-
imate X with N ( 40 , 40 5 ). Then
6 6 6
5 − 20/3
P [X < 5] = P [z < p ] = Φ(−0.7071) = 0.2398.
50/9
31
Exercises
1. Let X and Y have the joint distribution given by the table of probabilities below
Y
X 1 2 3
1 0.1 0 0.1
2 0 0.2 0.2
3 0.1 0.2 0.1
(a) Find the marginal distributions.

(b) Find the conditional distribution of Y given X = 2.
(c) Are X and Y independent?
2. Let X and Y have the joint density function given by,

½
k(1 − y), 0≤x ≤y ≤1
f (x, y) = .
0, otherwise
(a) Find the value of k that makes this a valid density .

(b) Find Pr[X ≤ 3/4; Y ≥ 1/2].
3. Suppose that the random variables X and Y have the following joint density,
½
k, 0 ≤ x ≤ 2 , 0 ≤ y ≤ 1, 2y ≤ x
f (x, y) = .
0, otherwise
(a) Sketch the area over which X and Y are defined (the region of support).
(b) Find the value of k that makes this a valid density.
(c) Find Pr[X ≥ 3Y ].
(d) Find cov(X , Y ).
4. It can be shown that

½
4x y, 0 ≤ x ≤ 1 ,0 ≤ y ≤ 1
f (x, y) = ,
0, otherwise
is a valid density function.
(a) Find the marginal density functions of X and Y .

(b) Find the conditional density function of X given Y = y.
(c) Find the conditional density function of Y given X = x.
(d) Find cov(X , Y )
32
5. X has a binomial distribution with probability of success p in n trials and Y has

a binomial distribution with probability of success p in m trials. If X and Y are
independent, then use mgfs to find the probability density of Z = X + Y .
6. Let X 1 , X 2 , · · · , X d be independent random variables such that each X j has a gamma

distribution with parameters α j and β j . Find the distribution of Z = dj=1 X j if
P
β j = β, j = 1, . . . , d .
7. A soft drink vending machine is set so that the amount of drink dispensed is a
random variable with mean 200 ml and a standard deviation of 15 ml. If a sample
of 36 drinks is taken what is the probability that the mean amount dispensed will
be at least 204 ml. Find the value M for which the probability that the sample
mean is least that M is 0.1.
8. A die is rolled 100 times. What is the probability that the number of sixes observed
is,
(a) More than 20?

(b) Fewer than 7?
(c) Between 10 and 20?
33
34
Chapter 3
Methods of estimation
3.1 Populations and samples

The traditional goal of statistics involves the making of inferences about a population.
Thus for example we might be interested in the number of people in East Anglia who
have mpeg players. There may be difficulties in defining a population (and tracking
down the members), but, if we can, the obvious approach is to examine every member
of the population. This is known as a census.
If, as is usual, we cannot examine the whole population we look at a subset of the
population. We then examine the subset and use the observation to make inferences
about the whole population. The subset is called a sample and much of statistics is
about the choice of samples and the quantifying of the inevitable errors that occur when
inferences are made.
Choosing samples from populations is a complex process, the unifying principle is
that the samples should be random. How does one arrange this? The answer depends,
but we will take as a principle:
"every member of the population is equally likely to be chosen"
3.1.1 Sampling distributions

Clearly in practice we need some random mechanism to select members of the popula-
tion, typically these have been generated by a random mechanism or by algorithms for
pseudo random numbers. In R we can use
runif(n, min=0, max=1)
> runif(20,0,1)
[1] 0.566791256 0.041486462 0.549447211 0.214432339 0.637779016 0.899486283
[7] 0.054652610 0.557283329 0.170380431 0.079230045 0.914602292 0.764830734
[13] 0.009130727 0.764550706 0.344527487 0.730142699 0.483321483 0.784094527
[19] 0.226894059 0.077518982
35
Before universal computing people used tables of random numbers, see for example the
Statistical tables.
The ability to generate uniform numbers also gives us the chance to generate sam-
ples from other distributions. We need the following lemma:
Lemma 3.1.1. if X has a continuous cdf F then F (X ) ∼ U (0, 1). Moreover by the inversion
method X = F −1 (U ) is a sample from F .
Example 3.1.1. Suppose X ∼ E xp(λ), that is,
f (x) = λ exp(−λx), F (x) = 1 − exp(−λx), x ≥ 0.
Then, since U = F (X ) = 1 − exp(−λX ) we can see that X = − λ1 log(1 −U ), is a sample for

an Exponential distribution with parameter λ. Note that U is a sample from standard
uniform U (0, 1).
3.2 Estimation
Suppose we are given a sample x1 , x2 , · · · , xn . We can think of our sample as being a
realization (observation) from a set of random variables X 1 , X 2 , · · · , X n whose joint dis-
tribution is f (x1 , x2 , · · · , xn ; θ). The question we ask ourselves is: What can we say about
the underlying distribution f (x1 , x2 , · · · , xn )? We will make a crucial assumption that the
sample is random. By this we mean that the joint density can be written as,
n
Y
f (x1 , x2 , · · · , xn ; θ) = f (xi ; θ) = f (x1 ; θ) f (x2 ; θ) · · · f (xn ; θ).
i =1
We say that the variables are independent and identically distributed (iid) and that we
have a sample from the distribution f (x; θ). In much of what we will discuss we will
make a second major assumption. This is that we know the form of the distribution
f (x; θ) except for some parameters θ. In such cases our interest is to estimate the pa-
rameters θ.
3.3 Properties of estimators

We do not want to get into too much detail here, as the subject can get complex. Usually
we prefer unbiased estimators and insist on consistent ones. If we have two (unbiased)
estimators we would prefer the one with smallest variance. In what follows we define
bias and consistency.
3.3.1 Bias
Suppose we have an estimator T of some parameter θ. We define the mead square error
(MSE) as,
E [(T − θ)2 ]. (3.1)
36
We would like to minimize this error so using a little algebra we have,
E [(T − θ)2 ] = E [((T − E [T ]) + (E [T ] − θ))2 ] = E [(T − E [T ])2 ] + (E [T ] − θ)2
or
var(T ) + b(θ)2 ,
where b(θ) = E [T ] − θ is called the bias. One possibility is to seek unbiased estimates,
that is estimates where b(θ) = E [T ] − θ = 0 or equivalently E [T ] = θ.
Example 3.3.1. So if we have a random sample from a distribution with mean µ we know
that E [ X̄ ] = µ so X̄ is an unbiased estimate of µ.
Example 3.3.2. Suppose X 1 , X 2 , · · · , X n is a random sample from a distribution with mean
1 Pn
¢2
µ and variance σ2 . The unbiased estimator of σ2 is s 2 = n−1
¡
i =1 X i − X̄ .
We first show that,
n ¡
X ¢2 n ³
X ´2
Xi − µ = (X i − X̄ ) + ( X̄ − µ)
i =1 i =1
n h i
(X i − X̄ )2 + 2(X i − X̄ )( X̄ − µ) + ( X̄ − µ)2
X
=
i =1
n n n
(X i − X̄ )2 + 2( X̄ − µ) ( X̄ − µ)2
X X X
= (X i − X̄ ) +
i =1 i =1 i =1
n
(X i − X̄ )2 + 0 + n( X̄ − µ)2
X
=
i =1
So,
n n ¡ ¢2
(X i − X̄ )2 = X i − µ − n( X̄ − µ)2 .
X X
i =1 i =1
Then the expected value is,
n
hX i n ¡
hX ¢2 i
E (X i − X̄ )2 = E X i − µ − n( X̄ − µ)2
i =1 i =1
n ¡
hX ¢2 i h i
=E X i − µ − nE ( X̄ − µ)2
i =1
n h¡ ¢2 i h i
E X i − µ − nE ( X̄ − µ)2
X
=
i =1
n h i h i
E X i2 − 2µX i + µ2 − nE X̄ 2 − 2µ X̄ + µ2
X
=
i =1
n h i h i
E (X i2 ) − 2µE (X i ) + µ2 − n E ( X̄ 2 ) − 2µE ( X̄ ) + µ2
X
=
i =1
n h i h i
σ2 + µ2 − 2µ2 + µ2 − n σ2 /n + µ2 − 2µ2 + µ2
X
=
i =1
= (n − 1)σ2 .
37
Note that: E (X i2 ) = Var(X i ) + [E (X i )]2 = σ2 + µ2 , E ( X̄ 2 ) = Var( X̄ ) + [E ( X̄ )]2 = σ2 /n + µ2 .

1 Pn
¢2
It follows that s 2 = n−1 is an unbiased estimate of σ2 . Of course when n is
¡
i =1 X i − X̄
large it makes little practical difference if the divisor is n or n − 1.
3.3.2 Consistency
If we take large samples we expect to have more information that with a small sample
and our estimators should improve in some sense. We formalize this using the idea of a
consistent estimate. We say that an estimate T of θ is consistent if
P [|T − θ| > ǫ] → 0 as the sample size n → ∞.
For unbiased estimates we can show that a sufficient condition is that,
var(T ) → 0 as the sample size n → ∞.
Example 3.3.3. Since for a random sample, X̄ is unbiased for µ and var( X̄ ) = σ2 /n then
X̄ is consistent.
We have defined some desirable properties of estimators such as consistency. How-

ever, we have not seen any general purpose method for obtaining good estimators. The
method of moment estimator and maximum likelihood estimator are two such general
purpose methods. They generally obtain consistent estimators and are usually straight-
forward to calculate.
3.4 Method of Moments

The motivation behind the method of moments is that if we have a good estimator θ̂ =
(θ̂1 , . . . , θ̂k ), the distribution that underlies θ̂ should be similar to the distribution of θ
- where similarity is compared by equality of moments. However, we do not know the
moments of the distribution that corresponds to θ since we don’t know the value of θ.
For this reason we approximate it by the sample moment - the moment computed by
the given sample (that was generated from θ). In other words, we would choose θ such
that,
1X n
µr (θr ) = m r = xr , r = 1, . . . , k,
n i =1 i
where obviously µr and m r are the theoretical and sample moments, respectively. So in
case that θ is a vector of k components, i.e., we have k-unknown parameters, we need
to solve k equations. Therefore we require the equality of the first k moments.
Example 3.4.1. Suppose we have the following sample x1 , x2 , · · · , x20 from an exponential
distribution with parameter λ, i.e., X i ∼ E xp(λ), i = 1, . . . , 20,
0.043 1.200 0.219 1.883 1.230 0.349 0.592 0.433 0.858 0.394
0.001 0.376 2.560 0.170 0.906 0.302 0.104 0.247 0.327 0.844
38
The sample mean is X̄ =0.6519 so µ̂ = 1/λ̂ = 0.6519 and we deduce that λ̂ = 1.5340.
Example 3.4.2. Suppose we have the following sample x1 , x2 , · · · , x20 from a normal dis-
tribution, i.e., X i ∼ N (µ, σ2 ), i = 1, . . . , 20,
4.948 20.498 19.985 -8.206 13.288 12.563 3.162 5.753
0.276 9.508 9.101 14.090 -19.981 27.302 19.962 6.960
-13.619 15.464 6.813 13.347
with sample mean X̄ =8.0606 and sample standard deviation s = 11.7025, we deduce, µ̂ =
8.0606 and σ̂ = 11.7025.
Example 3.4.3. Suppose we have the following sample x1 , x2 , · · · , x20 from from a uniform
distribution, i.e., X i ∼ U (α, β), i = 1, . . . , 20,
3.204 2.241 5.554 4.863 2.168 3.101 4.557 5.337 5.265 3.933 2.953
4.836 2.412 5.432 4.733 2.074 2.742 5.864 4.825 5.934 3.500 3.496
5.358 4.889 5.633
with sample mean X̄ = 4.1962 and sample standard deviation s = 1.2860. Now µ̂ = (α̂ +
β̂)/2 = 4.1962 and σ̂2 = (β̂ − α̂)2 /12 = 1.6537 and we solve for α̂ and β̂. We get α̂ = 3.0138
and β̂ = 5.3785.
The method of moments is the oldest method of deriving point estimators. It almost
always produces some asymptotically unbiased estimators, although they may not be
the best estimators.
3.5 Maximum likelihood

3.5.1 Likelihood function
Suppose we have as sample x1 , . . . , xn with density f (x1 , . . . , xn , θ1 ; θ2 , . . . , θk ). Since we
have already observed the data we can think of the density as a function of the param-
eters rather than the sample values x1 , . . . , xn . The resulting function is then called the
likelihood function,
n
Y
L(θ1 , θ2 , . . . , θk ) = f (x1 , . . . , xn , θ1 ; θ2 , . . . , θk ) = f (xi ; θ1 ; θ2 , . . . , θk ).
i =1
Example 3.5.1. If we toss a coin 3 times then the probability distribution of the number
of heads X is, Ã !
3 x
Pr[X = x] = θ (1 − θ)3−x x = 0, 1, 2, 3,
x
where θ = Pr[ head]. If we observe 2 heads the x = 2 and the likelihood is
Ã !
3 2
L(θ) = θ (1 − θ) = 3θ 2 (1 − θ)
2
This is a function of the (unknown) parameter θ.
39
Example 3.5.2. Suppose we have a random sample x1 , x2 , · · · , xn from the exponential

distribution with parameter θ,
f (x) = θ exp(−θx).
Then the likelihood is,
n
Y
L(θ) = θ exp(−θxi )
i =1
= θ exp(−θx1 )θ exp(−θx2 )θ exp(−θx3 ) · · · θ exp(−θxn )
Ã !
n
n
X
= θ exp −θ xi .
i =1
Example 3.5.3. For a random sample x1 , x2 , · · · , xn from a Normal distribution with mean
µ and variance σ2 the likelihood is,
( )
n 1
½
1
¾
1 1 n
L(µ, σ2 ) = exp − 2 (xi − µ)2 = (xi − µ)2 .
Y X
2 )1/2 2 )n/2
exp − 2
i =1 (2πσ 2σ (2πσ 2σ i =1
3.5.2 Maximum Likelihood Estimators (MLE’s)

If we toss a coin twice and observe one head and one tail then if the probability of a head
is θ we have L(θ) = 2θ(1 − θ). It makes sense to make the likelihood as large as possible
since we have observed one H and one T. To do so we must choose the parameter θ to
be 0.5. Estimates which are chosen in this way to maximize the likelihood are called
maximum likelihood estimators and have many nice properties.
The most of the times we will maximize the log-likelihood,
ℓ(θ1 , . . . , θk ) = log L(θ1 , . . . , θk ),
as it is simpler to do and the log-likelihood has the maximum in the same place as the
likelihood function itself. Below we summarize the steps to obtain the maximum like-
lihood estimates (MLE’s) of the parameters θ1 , . . . , θk . We will denote the MLE’s with
θ̂1 , . . . , θ̂k :
1. Write down the likelihood function,
n
Y
L(θ1 , θ2 , . . . , θk ) = f (xi ; θ1 , θ2 , . . . , θk ).
i =1
2. Calculate the log-likelihood,
ℓ(θ1 , . . . , θk ) = log L(θ1 , . . . , θk ).
40
3. Calculate the derivatives of ℓ with respect to the parameters and equate them to
0,  ∂ℓ(θ ,θ ,...,θ )
1 2 k

 ∂θ1 =0





 ∂ℓ(θ1 ,θ2 ,...,θk )
=0



 ∂θ2
. (3.2)

 ..



 .




 ∂ℓ(θ1 ,θ2 ,...,θk )
=0

∂θk
4. Solve the score equations in (3.2) to obtain the MLE’s, θ̂1 , . . . , θ̂k .
Example 3.5.4. Suppose that you have one observation x from the B(n, θ) distribution.
The above steps to derive the MLE of θ are as follows:
1. For a Binomial the likelihood is,

Ã !
n x
L(θ) = θ (1 − θ)(n−x) .
x
2. The log-likelihood is,

Ã !
n
ℓ(θ) = log + x log θ + (n − x) log(1 − θ).
x
3. The score equation is,
∂ℓ(θ)
= 0 (3.3)
∂θ
x n−x
− = 0.
θ 1−θ
4. Solve the score equation in (3.7) to obtain the MLE,

x n−x
=
θ 1−θ
θ̂ = x/n.
Example 3.5.5. Suppose you have a sample x1 , x2 , · · · , xn from a Normal distribution with
mean µ and variance σ2 . The steps to derive the MLE’s of µ and σ2 are as follows:
1. For a normal distribution the likelihood is,

( )
n 1
½
1
¾
1 1 n
exp − 2 (xi − µ)2 = (xi − µ)2 .
Y X
L(µ, θ) = 2 )1/2 2 )n/2
exp − 2
i =1 (2πσ 2σ (2πσ 2σ i =1
41
2. Taking logs we have the log-likelihood,
n 1 X n
ℓ(µ, θ) = − log(2πσ2 ) − 2 (xi − µ)2 .
2 2σ i =1
3. The score equations are,

∂ℓ(µ,σ2 )


 ∂µ =0
(3.4)
∂ℓ(µ,σ2 )



∂σ2
=0
 1 Pn
 σ2 i =1 i
(x − µ) = 0
n
+ σ13
Pn
− µ)2 = 0

−σ i =1 (x i
4. Solve the score equations in (3.4) to obtain the MLE’s,
1X n
µ̂ = x̄ and σˆ2 = s ′2 = (xi − x̄)2 .
n i =1
3.5.3 Properties of MLE’s

Likelihood estimates have several desirable properties:
• Consistency.
• Asymptotic normality.
• Efficiency.
• Invariance.
Asymptotic normality
The likelihood estimates are asymptotically unbiased and are normal with known vari-
ance. That is,
θ̂ ∼ N (θ, v 2 ),
where,
h¡ ∂ℓ(θ) ¢ i h ∂2 ℓ(θ) i
2
v 2 = 1/E = −1/E . (3.5)
∂θ ∂θ 2
Example 3.5.6. From the Example 3.5.4 we have that
∂ℓ(θ) x n − x
= − ,
∂θ θ 1−θ
so
∂2 ℓ(θ) x n−x
2
=− 2 − .
∂θ θ (1 − θ)2
42
Now when we take the expectation,
∂2 ℓ(θ)
· ¸
E [x] E [n − x]
E = − −
∂θ 2 θ2 (1 − θ)2
as E [x] = nθ we have,
∂2 ℓ(θ)
· ¸
n n n
E 2
=− − =−
∂θ θ (1 − θ) θ(1 − θ)
and so the variance of the maximum likelihood estimate θ̂ is,
∂2 ℓ(θ)
¸·
2
v = −1/E = θ(1 − θ)/n.
∂θ 2
Efficiency and the Cramér-Rao lower bound
The Cramér-Rao theorem shows that (given certain regularity conditions concerning the
distribution) the variance of any estimator of a parameter θ must be at least as large as,
∂2 ℓ(θ)
· ¸
Var(θ) ≥ −1/E .
∂θ 2
Hence any estimator that achieves this lower bound is efficient and no “better" estima-
tor is possible.
As you can see the variance of the MLE of θ in (3.5) is exactly equal to the Cramér-Rao
Lower bound. Therefore MLE is efficient.
Invariance
If the MLE’s θ1 , θ2 , . . . , θk exist, then the MLE of a function h(θ1 , θ2 , . . . , θk ) is,
ĥ(θ1 , θ2 , . . . , θk ) = h(θ̂1 , θ̂2 , . . . , θ̂k ).
Example 3.5.7. Suppose you have a sample x1 , x2 , · · · , xn from a Normal distribution with
mean µ and variance σ2 . Find the MLE’s of (a) µ2 + σ2 , (b) Pr(X ≤ x0 ).
From Example 3.5.5 we have that µ̂ = X̄ and σ̂2 = s ′2 . So,
(a) The MLE of µ2 + σ2 is,

µ̂2 + σ̂2 = X̄ 2 + s ′2 .
³ ´
X −µ x0 −µ x0 −µ
(b) We have that Pr(X ≤ x0 ) = Pr σ ≤ σ = Φ( σ ). So the MLE is,
x0 − µ̂ x0 − X̄
Φ( ) = Φ( ).
σ̂ s′
43
3.6 Confidence intervals

Is it satisfactory to report only an estimated value of θ (such estimates are known as
point estimates)?
A point estimate, though it will represent our best guess for the true value of the
parameter, may be close to that true value but (for continuous case) will virtually never
actually equal it. Some measure of how close the point estimate is to the true value θ is
required. One solution is to report both the estimate and its estimated standard error.
We look for an alternative, a way to report a range of possible values.
Formally for a random vector X = (X 1 , . . . , X n ), an¡interval estimator
¢ of a parameter θ
with coverage probability 1 − α is a random interval θ̂L (X), θ̂U (X) where,
• θ̂L (X), θ̂U (X) are functions of the data.
• θ̂L (X) < θ̂U (X).

h ¡ ¢i
• Pr θ ∈ θ̂L (X), θ̂U (X) = 1 − α, where a is a small probability know as significance
level.
The term 100(1 − α)% confidence interval is used to denote either an interval estimator
with coverage probability 1 − α or an interval estimate. For α is usually assigned a small
value, e.g. 0.1, 0.05, or 0.01. There are three methods of deriving confidence intervals,
(a) The exact method.
(b) The asymptotic method.
(c) The method of bootstrap.
For this course we restrict ourselves to the asymptotic method.
3.6.1 Asymptotic confidence intervals

We know that the maximumh¡ likelihood estimate of θ is approximately normal with mean
2 ∂2 ℓ(θ)
¢i
θ and variance v = −1/E ∂θ2
, i.e., θ̂ ∼ N (θ, v 2 ). So,
θ̂ − θ
∼ N (0, 1).
v
Thus given α we can find the asymptotic confidence interval for θ since the coverage
probability involves critical values from the standard normal distribution(see Figure
3.1),
h θ̂ − θ i
Pr −z 1−a/2 = −Φ−1 (1 − α/2) ≤ ≤ Φ−1 (1 − α/2) = z 1−a/2 = 1 − α. (3.6)
v
Example 3.6.1. Suppose that we have a sample x1 , x2 , · · · , xn from an exponential distri-

bution with parameter 1/λ and we would like to find (a) the MLE of λ, (b) the variance of
λ, and (c) the 100(1 − α)% confidence interval of λ.
44
θ̂−θ
Figure 3.1: The standard normal density curve. The red area indicates that |z = v | >
z 1−α/2 with probability α/2 in each tail and α totaly.
(a) The steps to derive the MLE of λ are as follows:
1. For an Exp(1/λ) the likelihood is,
Y n 1 ³ 1 ´ 1 ³ 1X n ´
L(λ) = exp − xi = n exp − xi .
i =1 λ λ λ λ i =1
2. The log-likelihood is,

1X n
ℓ(λ) = −n logλ − xi .
λ i =1
3. The score equation is,
∂ℓ(λ)
= 0 (3.7)
∂λ
n 1 Xn
− + 2 xi = 0.
λ λ i =1
4. Solve the score equation in (3.7) to obtain the MLE,
λ̂ = X̄ .
(b)
∂2 ℓ(λ) n 2 X n
= − xi .
∂λ2 λ2 λ3 i =1
45
Now when we take the expectation,

· 2 " # " #
∂ ℓ(λ)
¸
n n
2 X hni 2 Xn
E =E − xi = E 2 + E − 3 xi =
∂λ2 λ2 λ3 i =1 λ λ i =1
" #
n 2 Xn n 2 n
= 2 − 3E xi = 2 − 3 nλ = − 2 ,
λ λ i =1 λ λ λ
as E ni=1 xi = E x1 + . . . + E xn = λ + . . . + λ = nλ, and so the variance of the maxi-

£P ¤
mum likelihood estimate λ̂ is,
∂2 ℓ(λ)
· ¸
v 2 = −1/E 2
= λ2 /n.
∂λ
(c) Substituting into (3.6) we have,

h λ̂ − λ i
Pr −z 1−a/2 ≤ p ≤ z 1−a/2 = 1 − α,
λ2 /n
or
h λ̂ λ̂ i
Pr p ≤λ≤ p = 1 − α.
1 + z 1−a/2 / n 1 − z 1−a/2 / n
So the 100(1 − α)% confidence interval of λ is,
³ λ̂ λ̂ ´
p , p .
1 + z 1−a/2 / n 1 − z 1−a/2 / n
46
Exercises
1. Using the inversion method simulate from the Weibull distribution with with cu-
mulative distribution function,
β
F X (x) = 1 − e −x , x ≥ 0.
2. Suppose we have a random sample from,
1
f (x, θ) = , −θ ≤ x ≤ θ.
2θ
Find the method of moments estimator of θ.
3. The dorsal length data below is thought to come from a lognormal distribution
with density,
1
exp (log y − µ)2 ,
© ª
f (y) = p y > 0.
y 2πσ2
The distribution is called the lognormal as the distribution of Y = exp(X ) where X
is normal with mean µ and variance σ2 .
(a) Show that E [Y k ] = exp(kµ + 21 σ2 k 2 ).

(b) Hence estimate the parameters of the distribution using the data below:
Dorsal lengths in mm of distinct octopods Robson(1929)
10 12 18 64 54 19 115 73
190 15 21 84 44 18 37 43
55 57 60 73 82 175 50 80
65 19 43 54 16 10 17 52
43 63 23 44 95 20 41 17
15 70 36 82 29 29 61 22
40 12 22 16 30 16 116 28
32 17 11 95 27 16 55 8
11 33 26 29 85 20 67 27
44 49 29 30 35 17 26 32
76 16 82 27 5 6 51 75
23 150 6 84 22 47 9 10
28 29 21 73 52 130 50 45
Y 2 = 344223.
P
Note: Ȳ = 44.67 and
4. Find the maximum likelihood estimates of θ for a sample x1 , . . . , xn from the fol-
lowing distributions:
(a) f (x, θ) = θ(1 − θ)x , x = 0, 1, 2, · · · .
47
(b) f (x, θ) = θ x exp(−θ)/x!, x = 0, 1, 2, · · · .

© 2 2
ª p
(c) f (x, θ) = exp −x /(2θ ) / 2πθ 2 , x > 0.
5. The data below:
8 6 2 3 6 4 7 5 4 4
2 4 7 3 10 6 10 6 8 5
5 2 5 3 3 5 4 5 2 8
6 10 1 5 0 5 4 4 5 6
6 4 9 4 6 4 6 4 4 12
4 5 3 5 5 8 3 7 9 3
6 6 3 4 3 5 4 6 5 3
3 3 9 4 1 4 5 5 6 4
7 3 6 5 6 4 4 6 7 5
3 7 2 7 5 6 7 5 3 3
is thought to have come from a Poisson distribution with parameter θ.
(a) Find the maximum likelihood estimate of θ.

(b) Find the variance of θ̂.
Note: You may assume that x̄ = 4.99 and s 2 = 2.11532
48
Chapter 4
Introduction to Hypothesis Testing
4.1 Hypothesis Testing

Setting up and testing hypotheses is seen in most courses as an essential part of sta-
tistical inference. It is an important part of statistics and you need to be able to both
understand and construct statistical tests. We begin with some ideas and definitions.
Hypotheses
We often make assertions and as in many cases we have incomplete information, the
assertion is about a probability distribution. Such an assertion abut a distribution is
called a statistical hypothesis. It may be a simple hypothesis if it completely specifies the
distribution or a composite if it is not simple.
Example 4.1.1. H0 : f (x) = θ exp(−θx) is a simple hypothesis if we add θ = 3.
Example 4.1.2. X is N(100,σ2), for an unspecified σ, is composite.
Typically the hypothesis has been put forward, either because it is believed to be true
or because it is to be used as a basis for argument.
Example 4.1.3. Suppose we have a coin which we wish to check is fair, that is P[Head]=1/2.
If we assume the coin is fair we are assuming that the number of heads, X , is Binomial
with p = 1/2.
Example 4.1.4. We might be interested in the birth weight of babies born in Sheffield com-
pared with those born in Brighton. Since we know (from the medics) that birth weights
are Normal we can think of the hypothesis that the Sheffield and Brighton weights have
Normal distributions with the same mean. This is not a simple hypothesis as the means
and variances are not specified.
In each problem we shall consider, the question of interest is simplified into two
competing hypotheses between which we have a choice;
• The null hypothesis, denoted H0
49
• The alternative hypothesis, denoted H1 , which is the compliment ( in the context

of the problem) of the null.
These two competing hypotheses are not however treated on an equal basis, special
consideration is given to the null hypothesis. Usually the experiment has been carried
out in an attempt to prove or disprove a particular hypothesis, the null hypothesis. For
example,
H0 : there is no difference in taste between coke and diet coke
against
H1 : there is a difference.
Of the two hypotheses the null is almost always simple in that it completely specifies the
underlying distribution, the alternative is often composite.
Example 4.1.5. H0 : X is Binomial (100,1/2) i.e. p is specified

H1 : X is Binomial (100,p) p ≤ 1/2.
Example 4.1.6. H0 : X is N(5, 20) i.e. µ and σ are specified

H1 : X is N(µ, 20) i.e. µ > 5.
4.1.1 Type I and II errors

As we are dealing with incomplete information errors are inevitable. The power of statis-
tics lies in the fact that we accept these errors and in the way we deal with them.
If we have two competing hypotheses there are two kinds of errors that may arise
and the following table gives a summary of possible results .
Truth
action H0 H1
Accept H0 ok Type II error
Reject H0 Type I error ok
The two errors are traditionally called the type one and type two errors and we will be
looking at their probabilities
1. α = P [ Type I error] = P [reject H0 |H0 true].
2. β = P [Type II error] = P [accept H0 |H0 false]
4.2 A general approach

Suppose we choose an acceptable value for the probability of a type I error,
α = P [ Type I error] = P [reject H0 |H0 true]
50
such as 0.05 or 0.01. Then take the sample space of outcomes (x1 , x2 , . . . , xn ) and split it
into two parts C and R where C ∩ R is empty and C ∪ R is the whole space.
We choose C - the critical region - to be the set of unlikely points, that is the set of
outcomes which are unlikely under H0 . Then if we observe a set of points in C we have
two options
• Either H0 is false
• Or we have observed an event of small probability.
In conventional testing we assume the second and say we reject the null hypothesis
and accept the alternative.
By this we mean that it is rational on the evidence to believe that the null is not true.
Recall
• The probability of observing the unlikely event is α the probability of a type I error-
often referred to as the size of the test.
• As we have seen the probability of a type II error β = P [accept H0 |H1 ] is gener-

ally unknown and needs to be calculated. The alternative measure is the power
P [reject H0 |H1 ] but this also has to be computed.
If we do not reject the null hypothesis, it may still be false (a type II error) as the sample
may not be big enough to identify the falseness of the null hypothesis (especially if the
truth is very close to hypothesis). You should bear in mind that for any given set of data,
the type I and type II errors are inversely related; so the smaller the risk of one, the higher
the risk of the other.
As dealing with high dimensional spaces is difficult we usually base our tests on a
test statistic T computed from the observations. As above we find a critical region C
defined by
P [T ∈ C |H0 ] = α
Thus the region C is those values of T which are unlikely when H0 is true. Some workers
prefer the p-value. The probability value (p-value) of a statistical hypothesis test is the
probability of getting a value of the test statistic as extreme as or more extreme than that
observed by chance alone on the assumption that the null hypothesis H0 , is true.
4.2.1 Power
The power of a statistical hypothesis test measures the test’s ability to reject the null
hypothesis when it is actually false - that is, to make a correct decision. In other words,
the power of a hypothesis test is the probability of not committing a type II error that is
γ = 1 − β. Usually statisticians think of the power as a function of the parameter, So if we
have
H0 : θ = θ0 against H1 : θ = θ1
they would consider the power as being γ(θ1 ) a function of possible alternatives.
51
4.2.2 Summary of the definitions

• A statistical hypothesis is an assertion about a probability distribution.
• A simple hypothesis completely specifies the distribution.
• α = P [ Type I error] = P [reject H0 |H0 true].
• β = P [ Type II error] = P [accept H0 |H0 false].
• The critical region C is those values of T which are unlikely when H0 is true.
• The probability value (p-value) of a statistical hypothesis test is the probability of

getting a value of the test statistic as extreme as or more extreme than that ob-
served by chance alone on the assumption that the null hypothesis H0 , is true.
• The power of a statistical hypothesis test is the probability of rejecting the null
hypothesis when it is false.
4.3 Constructing tests

While the apparatus above is reasonable it does not answer the questions as to how one
might find a suitable statistic T and how we can be assured that T encapsulates the whole
problem. Most statistics books give a list of recipes. The procedure is typically:
1. Set up H0 and H1
2. Pick a suitable test statistic T whose distribution is known under the assumptions
of H0 .
3. Choose the size of the test α = P [reject H0 |H0 true]
4. Find the critical region using the distribution in 2
5. Compute T.
6. If T lies in the critical region reject H0 .
Often we can derive such a method from insight into the problem, as we shall see.
4.3.1 A Binomial example

We toss a coin 12 times and observe
HH TH TT TH HT HH
52
We assume that X the number of heads in Binomial B(12,p) and our null hypothesis is
1
H0 : P [ Head ] = p =
2
while the alternative is
1
H1 : P [ Head ] = p >
2
1. One choice of statistic is just X the number of heads.
2. We choose α to be approximately 0.05.
3. To find a critical region C note
α = P [X > c |H0 : p = 1/2]
From tables of the Binomial we have

c P[ X > c]
0 0.99976
1 0.99683
2 0.98071
3 0.92700
4 0.80615
5 0.61279
6 0.38721
7 0.19385
8 0.07300
9 0.01929
10 0.0032
11 0.0002
12 0.0000
Now an α of 0.05 is not possible but we can have 0.0193. So we choose a critical region of
the form X > 9. In our sample we have X = 7 heads. Since this does not lie in the critical
region we accept H0 and conclude that p = 12
Notice that if we had a simple alternative, say H1 : p = 0.7 we could compute the type
II error since β = P [X ≤ c|p = 0.7].
You might like compute this this yourself.
The power is just γ(p) = P [X > 9|p] which is
P γ(p)
0.5 0.01929
0.6 0.08344
0.7 0.25282
0.8 0.55835
0.9 0.88913
53
0.8
0.6
Power
0.4
0.2
0.0
0.5 0.6 0.7 0.8 0.9
Figure 4.1: The power as being γ(p) a function of possible alternatives with parameter p
= 0.5 to 0.9 in 0.004 increments.
The critical region depends on the alternative - try finding the critical region for
testing
1 1
H0 : p = against the alternative H1 : p >
2 2
4.3.2 A two tailed test!
Suppose we wish to test
1 1
H0 : p = against the alternative H1 : p 6=
2 2
then the critical region will consist of small values of X and large values of X. Hence we
have a region of the form X < c1 and X > c2 . We split our α between the two segments
so
P [X < c1 |H0 : p = 1/2] = P [X > c2 |H0 : p = 1/2]
Extending the table above gives
54
c P[ X > c ] P[ X ≤ c ]
0 0.99976 0.00024
1 0.99683 0.00317
2 0.98071 0.01929
3 0.92700 0.07300
4 0.80615 0.19385
5 0.61279 0.38721
6 0.38721 0.61279
7 0.19385 0.80615
8 0.07300 0.92700
9 0.01929 0.98071
10 0.00317 0.99683
11 0.00024 0.99976
12 0.00000 1.00000
Now if we choose c2 = 9 as before with c1 = 2 then α = 2 × 0.019287 or about 0.0386.

Notice the critical region was in both tails of the distribution. Tests with regions of
this for are often called two tailed tests.
A Normal example
Suppose we have a sample of size 100 from a Normal distribution. We wish to test
H0 : µ = 68 against H1 : µ 6= 68
To simplify matters we assume that the standard deviation of the population is σ = 16.
A possible statistic is X̄ which is Normal N( µ, σ2 /n). Rather simpler is the standard-
ized random variable
X̄ − 68
z= p
σ/ n
which we know is standard Normal when H0 is true .
Now if the true mean is not 68 then z will be very different from zero. The distribution
of z is shown in the Figure 4.2 (c) and we see that the two areas in the tails must total α.
A little inspection gives us the critical regions as
z < −z 1−α/2 and z > z 1−α/2
Now if we choose α = 0.05 then we will reject H0 when z < −1.96 or z > 1.96. In our case
X̄ = 68.04. We know that σ = 16 so z = 0.025. This is not in the critical region so we
accept H0
This is somewhat unrealistic since if we are unsure of µ is is most unlikely that we
know σ! Here we have a large sample and for for large samples (exceeding 50) we can
use an estimate. Here σ̂ = 13.724 so
X̄ − 68
t= p = 0.029
σ̂/ n
and in consequence we accept H0 .
What is the critical region for the alternative H1 : µ > 68?
55
(a) z > z 1−α (b) z < −z 1−α (c) |z| > z 1−α/2
Figure 4.2: Critical regions for testing H0 : µ = µ0 against (a) H1 : µ > µ0 , (b) H1 : µ < µ0 ,
x̄−µ
and (c) H1 : µ 6= µ0 , based on z = p 2 0 ∼ N (0, 1).
s /n
4.3.3 Normal small sample case

For a small sample we would then have to find the distribution of
X̄ − 68
t= p
σ̂/ n
in order to compute the size of the critical region.
In fact we know that the distribution of t has a Student’s t distribution with n − 1
degrees of freedom so all we need to do is to find the critical region using tables of t
rather than Normal tables.
A sample of 10 batteries are randomly selected from a production batch and their
lifetimes found. The mean lifetime is 30.3 and the estimated variance is 16.08. The
manufacturer claims a lifetime of 36 months. Suppose we assume a a normal popu-
lation with mean µ and test
H0 : µ = 36 against H1 : µ < 36
Then
30.3 − 36
t=p = −4.49
16.08/10
Now we find the critical region using the t distribution with 10-1=9 degrees of freedom.
The critical value is -1.833, for a test size of 0.05 (check this!) so we have a value of t in
the critical region and we reject H0 .
Summary
We have deduced the procedure given in statistical recipe books. Suppose we are given
a random sample X 1 , X 2 , · · · , X n from a N(µ, σ2 ) distribution and we wish to test
H0 : µ = µ0 against H1 : µ > µ0 , H : µ1 < µ0 , H : µ1 6= µ0
56
then we compute
• For large samples ( n > 50) we compute
x̄ − µ0 1 X n
z=p where s 2 = (xi − x̄)2
2
s /n n − 1 i =1
and we reject H0 when z > z 1−α , z < −z 1−α , |z| > z 1−α/2 , see Figure 4.2.
• When the sample size is small, n < 50, we use the t distribution with n − 1 degrees
of freedom and compute
x̄ − µ0 1 X n
t=p where s 2 = (xi − x̄)2
2
s /n n − 1 i =1
and we reject H0 when t > t1−α , t < −t1−α , |t | > t1−α/2 using Student’s t with
n − 1 degrees of freedom, see Figure 4.3.
(a) t > t1−α (b) t < −t1−α (c) |t | > t1−α/2
Figure 4.3: Critical regions for testing H0 : µ = µ0 against (a) H1 : µ > µ0 , (b) H1 : µ < µ0 ,
x̄−µ
and (c) H1 : µ 6= µ0 , based on t = p 2 0 ∼ tn−1 .
s /n
Variances
Suppose we are given a random sample X 1 , X 2 , · · · , X n from a N(µ, σ2 ) distribution and

we wish to test
H0 : σ2 = σ20 against H1 : σ2 > σ20 , H1 : σ2 < σ20 , H1 : σ2 6= σ20 ,
our critical regions will be respectively
X 2 > χ21−α,n−1 , X 2 < χ2α,n−1 , X 2 > χ21−α/2,n−1 or X 2 < χ2α/2,n−1 ,

(n−1)s 2
where X 2 = σ20
has a Chi-squared distribution with n − 1 degrees of freedom under
H0 .
57
(a) X 2 > χ21−α,n−1 (b) X 2 < χ2α,n−1 (c) X 2 > χ21−α/2,n−1 or X 2 < χ2α/2,n−1
Figure 4.4: Rejection regions for testing H0 : σ2 = σ20 against (a) H1 : σ2 > σ20 , (b) H1 : σ2 <
2
σ20 , and (c) H1 : σ2 6= σ20 based on X 2 = (n−1)s
σ2
2
∼ X n−1 .
0
Example
We are given a random sample with n = 10 from a N(µ, σ2 ), where s 2 = 12.6. For α = 0.05
we test
H0 : σ2 = 9 against H1 : σ2 > 9.
(n−1)s 2 9×12.6
The statistic is X 2 = σ20
with critical region X 2 > χ21−α,n−1 . Then X 2 = 9 = 12.6
and χ20.95,9 = 16.919. So we have a value of not in the critical region and we cannot reject
H0 .
A binomial/normal example
We can use for large samples the normal approximation to the binomial, to turn what are
essentially binomial problems in to normal ones. The binomial distribution is the dis-
crete probability distribution of the number of successes in a sequence of n independent
experiments, each of which yields success with probability p. If X ∼ B(n, p) (that is, X
is a binomially distributed random variable), then the expected value of X is E (X ) = np
and the variance is V ar (X ) = np(1 − p). Therefore if n is large (n > 50) we can approxi-
mate the binomial distribution with the normal distribution N (µ = np, σ2 = npq) or the
p
standard normal Z = (X − np)/ npq. According to this approximation we test
H0 : p = p 0 against H1 : p 1 > p 0 , H1 : p 1 < p 0 , H1 : p 1 6= p 0
by computing
x̄ − np 0
z=p
np 0 (1 − p 0 )
and we reject H0 when z > z 1−α , z < −z 1−α , |z| > z 1−α/2 .
58
4.3.4 Sample size choice

In many situations we can use our definitions of the type I and type II probabilities to
specify the sample size we require.
A crossover trial
Suppose we wish to test the effects of two kinds of medication A and B on reducing blood
pressure in males. We intend to teat n patients for five weeks on A and five weeks on B.
The order of application will be randomized. The response is the average blood pressure
in the third week of each treatment.
If we pick α = 0.05 and β = 0.1 how large a sample do we need?
We assume that the responses A i , B i , i = 1, 2, · · · , n are normal and look at the differences
D i = A i − B i , i = 1, 2, · · · , n. The test is
H0 : µD = 0 against H1 : µD > 0
Now the background to this experiment is that A is the current treatment and we expect
to see a change from A to B whose size is about 1/2 a standard deviation.
Back to definitions - in general
α = P (z > z 1−a |H0 : µ = µ0 ) ⇐⇒ 0.05 = P (z > z 0.95 |H0 : µ = 0) ⇐⇒ P (z ≤ z 0.95 |H0 : µ = 0) = 0.95.
From the statistical tables (inverse normal) one can see that z 0.95 = 1.645. Therefore the
critical region is
D̄ − µ0 p
z > 1.645 ⇐⇒ p > 1.645 ⇐⇒ D̄ > µ0 + 1.645σ/ n
σ/ n
Now in addition we require

p σ
β = 0.1 = P [D̄ ≤ µ0 + 1.645σ/ n|H1 : µ = µ1 = µ0 + ] ⇐⇒
2
p
D̄ − µ1 µ0 + 1.645σ/ n − µ0 − σ/2 σ
0.1 = P [ p ≤ p |H1 : µ = µ1 = µ0 + ] ⇐⇒
σ/ n σ/ n 2
p
1.645σ/ n − σ/2 σ
0.1 = P [z ≤ p |H1 : µ = ] ⇐⇒
σ/ n 2
p σ
0.1 = P [z ≤ 1.645 − n/2|H1 : µ = ].
2
From the statistical tables (inverse normal) one can see that z 0.1 = −1.2816. So
p
1.645 − n/2 = −1.2816 ⇐⇒ n = 35.
59
A Binomial/Normal case
A medic knows that of patients admitted to hospital for cardiac problem 60% will be
readmitted on an emergency basis within 2 months. She believes that treatment with X
will reduce this readmission level by half, that is to 30%. To check her theory she must
experiment on patients and to gain permission to do so she must show that her experi-
ment has a reasonable chance of detecting a change and uses the minimum number of
patients.
This could be framed as a Binomial, with p the probability of readmission. Suppose
we take n patients and administer the drug to them. We set up the hypotheses
H0 : p = 0.6 against H1 : p < 0.6
We observe R of our n treated patients readmitted. We will assume that R is Binomial

and that we can approximate R by a Normal distribution.
Now how to specify the sensitivity of our procedure. We ask that the type II error
probability be
β = P [R ≥ k|p = 0.3] = 0.1
You should check that you understand this!
I am going to specify the test size as α = 0.05

The critical region is of the form R ≤ k and assuming normality we have the defini-
tion of test size
α = P [R ≤ k|p = 0.6] = 0.05
Using the normal approximation
R − np R − n × 0.6
0.05 = α = P [R ≤ k|p = 0.6] = P [z = p < −z 1−a |p = 0.6] = P [z = p < z a |p = 0.6].
np(1 − p) n × 0.6 × 0.4
From the statistical tables (inverse normal) one can see that z 0.05 = −1.645. Therefore
the critical region is p
R < n × 0.6 − 1.645 × n × 0.6 × 0.4
For the type II error
p
0.1 = β = P [R ≥ n × 0.6 − 1.645 × n × 0.6 × 0.4|p = 0.3] ⇐⇒
p
h R − np n × 0.6 − 1.645 n × 0.6 × 0.4 − n × 0.3 i
0.1 = P z = p ≥ p |p = 0.3 ⇐⇒
np(1 − p) n × 0.7 × 0.3
p
h n × 0.6 − 1.645 n × 0.6 × 0.4 − n × 0.3 i
0.9 = P z < p .
n × 0.7 × 0.3
From the statistical tables (inverse normal) one can see that z 0.9 = 1.2816. Therefore
p
n × 0.6 − 1.645 n × 0.6 × 0.4 − n × 0.3
1.2816 = p ⇐⇒ · · · ⇐⇒ n = 22.
n × 0.7 × 0.3
60
Exercises
1. Suppose we are given a random sample X 1 , X 2 , · · · , X 9 from a N(µ, 1) distribution.
To test
H0 : µ = 20 against H1 : µ = 21.5
the critical region is X̄ > 21.
(a) What is the probability of type I error?

(b) What is the probability of type II error?
2. Suppose we are given a random sample X 1 , X 2 , · · · , X 12 from a Poisson(θ) distribu-

tion. To test
H0 : θ = 1/2 against H1 : θ < 1/2
P12
the critical region is i =1 X i ≤ 2.

(b) What is the power of the test?
Hint: Recall that the sum of independent Poissons is Poisson.
3. Suppose we are given a random sample X 1 , X 2 from an Exp(1/θ) distribution. To

test
H0 : θ = 2 against H1 : θ > 2
the critical region is X 1 + X 2 ≥ 9.5.

(b) What is the function of the power of the test?
Pn
Hint: Recall that the i =1 X i of independent G amma(1, θ) is G amma(n, θ).
4. Suppose we are given a random sample X 1 , X 2 , · · · , X n from a N(µ, σ2 ) distribution.

To test
H0 : µ = µ0 against H1 : µ1 > µ0
X̄ −µ
the critical region is Z = σ/pn0 > z 1−α .
(a) What is the distribution of Z under H1 ?

(b) What is the power of the test?
5. Suppose we are given a random sample from a N(µ, 25) distribution. If x̄ = 42.6, n =
36 then with α = 0.05 test
H0 : µ = 45 against H1 : µ < 45.
61
6. Suppose we are given a random sample from a N(µ, σ2 ) distribution. If x̄ = 65, n =

12, s = 3 then with α = 0.05 test
H0 : µ = 66 against H1 : µ < 66.
7. A random sample of 10 is taken from a population with distribution N(µ, σ2), where
s 2 = 12.6.
(a) With α = 0.05 test

H0 : σ2 = 9 against H1 : σ2 > 9.
(b) If σ21 = 18 find the probability of Type II error.
8. A tobacco company believes that the percentage of the smokers which prefer its
cigarettes is at least 65%. To test this belief, 500 smokers were interviewed and 311
of them declared that they prefer this type of cigarettes. With α = 0.05 test the
belief of the tobacco company.
62
Chapter 5
The Neyman Pearson Lemma
5.1 The Neyman Pearson Lemma

The probability of a type I error is controlled by construction; it is at most α.
A good test should also have a small probability of type II error. In other words it
should also be a powerful test.
Definition 5.1.1 (Most Powerful). Let T be a statistical test for testing
H0 : θ = θ0 against H1 : θ = θ1
with significance level α and maximum power γ = 1 − β among all the other statistical
tests with the same significance level. Such a test T with the maximum power is said to
be the most powerful (MP) test.
The following theorem which is known as the Neyman–Pearson Lemma solve the
problem of the existence and construction of MP tests for testing a simple null hypoth-
esis against a simple alternative hypothesis.
Lemma 5.1.1. Suppose we have a sample x1 , x2 , · · · , xn and have two simple hypotheses
H0 and H1 . Suppose the likelihood, is L (H0 ) under H0 and L (H1 ) under H1 . The most
powerful test of H0 against H1 has a critical region of the form
L (H0 )
≤ a constant. (5.1)
L (H1 )
Proof.
We give the proof for continuous random variables. For discrete random variables
just replace integrals with sums. For the definition of the critical region of significance
level α, Z Z
α = P (X ∈ C | f 0 ) = ... L (H0 )d x.
C
63
Suppose that D is an other critical region of significance level α. In other words,

Z Z
α= ... L (H0 )d x.
D
We will prove that

Z Z Z Z
power of C = ... L (H1 )d x ≥ ... L (H1 )d x = power of D.
C D
The following equalities are obvious,
C = (C ∩ D) ∪ (C ∩ D ′ ) and D = (D ∩C ) ∪ (D ∩C ′).
So, Z Z Z Z Z Z
α= ... L (H0 )d x = ... L (H0 )d x + ... L (H0 )d x
C C ∩D C ∩D ′
and Z Z Z Z Z Z
α= ... L (H0 )d x = ... L (H0 )d x + ... L (H0 )d x
D D∩C D∩C ′
Therefore Z Z Z Z
... L (H0 )d x = ... L (H0 )d x. (5.2)
C ∩D ′ D∩C ′
(H0 )
Inside C we have the Neyman-Pearson region so L L (H1 )
≤ k (a constant) or L (H1 ) ≥
L (H0 )/k, while outside C we have the non Neyman-Pearson region L (H1 ) ≤ L (H0 )/k.
Accordingly from Eq. (5.2),
Z Z Z Z Z Z Z Z
... L (H1 )d x ≥ ... L (H0 )/kd x = ... L (H0 )/kd x ≥ ... L (H1 )d x
C ∩D ′ C ∩D ′ D∩C ′ D∩C ′
(5.3)
R R
Adding . . . C ∩D L (H1 )d x in Eq. (5.3),
Z Z Z Z Z Z Z Z
... L (H1 )d x+ ... L (H1 )d x ≥ ... L (H1 )d x+ ... L (H1 )d x ⇐⇒
C ∩D C ∩D ′ D∩C D∩C ′
Z Z Z Z
power of C = ... L (H1 )d x ≥ ... L (H1 )d x = power of D.
C D
To reiterate, by most powerful we mean that any other test will have power which is
less that or equal to the power of the test based on the Neyman-Pearson critical region.
This enables us to construct the critical region of the MP test in the sense that any other
test will have inferior power.
In the most common case we have a random sample from a distribution f (x). We
can then formulate the lemma as:
64
Proposition 5.1.1. Suppose we have a null hypothesis which completely specifies the dis-
tribution f (x), say
H0 : f (x) = f 0 (x)
where f 0 (x) is a known function. We further assume that the alternative is
H1 : f (x) = f 1 (x)
where again f 1 (x) is a known function.

Then the most powerful test of H0 : f (x) = f 0 (x) against H1 : f (x) = f 1 (x) has a critical
region of the form Qn
i =1 f 0 (x i )
Qn ≤ constant (5.4)
i =1 f 1 (x i )
Example 5.1.1. Suppose we have one observation x from an exponential distribution

f (x) = θ exp(−θx) and we wish to test
H0 : θ = θ0 against H1 : θ = θ1 > θ0
Then the critical region takes the form

n L (H0 ) o
C = x: ≤ k1
L (H1 )
n θ0 exp(−θ0 x) o
= x: ≤ k1
θ1 exp(−θ1 x)
n µθ ¶ o
0
= x: exp {(θ1 − θ0 )x} ≤ k 1
θ1
n o
= x : exp {(θ1 − θ0 )x} ≤ k 2
n o
= x : (θ1 − θ0 )x ≤ k 3
n o
= x : x ≤ k 4 since θ1 > θ0
For the computation of k 4 we have that

Zk 4
α = P (x ≤ k 4 |θ = θ0 ) = θ0 exp(−θ0 x) = 1 − exp(−θ0 k 4 ) ⇐⇒
0
k 4 = − log(1 − α)/θ0 .
Therefore the test with statistic x and critical region {x : x ≤ − log(1 − α)/θ0 } is the MP
test of
H0 : θ = θ0 against H1 : θ = θ1 > θ0 .
65
Example 5.1.2. Suppose we have a sample x1 , x2 , · · · , xn from a normal distribution

½ ¾
1 1
f (x) = p exp − 2 (x − µ) , σ2 is known
2
2πσ 2 2σ
and we wish to test
H0 : µ = µ0 against H1 : µ = µ1 > µ0 , σ2 is known
The two values of the likelihood functions L (H0 ) and L (H1 ) are
¶n ( )
µ
1 1 Xn
2
L (H0 ) = p exp − 2 (xi − µ0 )
2πσ2 2σ i =1
¶n ( )
µ
1 1 Xn
L (H1 ) = p exp − 2 (xi − µ1 )2
2πσ2 2σ i =1
Then the critical region takes the form
nL (H0 ) o
C = x: ≤ k1
L (H1 )
( " #)
n 1 X n n o
2 2
X
= x : exp (x i − µ 1 ) − (x i − µ0 ) ≤ k 1
2σ2 i =1 i =1
Ã !
n 1 X n n o
2 2
X
= x: (x i − µ1 ) − (x i − µ 0 ) ≤ k 2
2σ2 i =1 i =1
n n o
x : n(µ21 − µ20 ) + 2(µ0 − µ1 )
X
= xi ≤ k3
i =1
n o
= x : x̄ ≥ k 4 since µ1 > µ0
α = P (x̄ ≥ k 4 |µ = µ0 ).
We know that x̄ ∼ N (µ, σ2 /n) and under the null hypothesis H0 ,

µ ¶
x̄ − µ0 k 4 − µ0
P p ≥ p or P (z ≥ z α ) ⇐⇒
σ/ n σ/ n
p
k 4 = µ0 + (σ/ n)z α .
p
Therefore the test with statistic x̄ and critical region {x : x̄ ≥ µ0 + (σ/ n)z α } is the MP test
of
H0 : µ = µ0 against H1 : µ = µ1 > µ0 .
66
5.1.1 Discrete Distributions

You will have noticed that we¡ have avoided discrete distributions. Suppose we have a
n¢ x
Binomial problem, so f (x) = x p (1 − p)n−x x = 0, 1, 2, · · · , n
H0 : p = p 0 against H1 : p = p 1 < p 0
For a given α Neyman Pearson gives the critical region
n L (H0 ) o
C = x: ≤ k1
L (H1 )
¡n ¢ x
n
x
p 0 (1 − p 0 )n−x o
= x: n x
¡ ¢ ≤ k1
x
p 1 (1 − p 1 )n−x
µ ¶
n 1 − p0 o
= x : x log(p 0 /p 1 ) − x log ≤ k2
1 − p1
µ ¶
n p0 − p1 p0 o
= x : x log ≤ k2
p1 − p0 p1
µ ¶
n o p0 − p1 p0
= x : x ≤ k 3 since > 1 ⇐⇒ p 1 < p 0
p1 − p0 p1
Now suppose we try to go a bit further, with p 0 = 0.5 and n = 12. If we choose α = 0.05
then
α = 0.05 = P [x ≤ k 3 ]
If we check the Binomial tables we see that such a k 3 is not possible. We can choose k 3
to give a variety of α values but we cannot have 0.05, as you can see from below
k 0 1 2 3 4 5
P [X ≤ k] 0.000 0.003 0.019 0.073 0.194 0.387
This is a theoretical problem and some would argue that this invalidates the whole scheme
so Neyman-Pearson cannot be applied in this case.
It is possible by using randomization methods to come up with a modified scheme
but I have never ever seen it used. In practice one chooses the best α value from those
available and we shall refer to these tests as being non-randomized when we want to be
quite precise.
Randomization procedure
As an aside we demonstrate the randomization process. In the example above we would

like α = 0.5 but we can only bracket the value with 0.019 and 0.073. Take a biased coin
where the probability of a head is π. Suppose then we toss the coin and
67
• If a head choose the critical region based on 0.019.
• If a tail choose the critical region based on 0.073.

Then on average our α is 0.019π + 0.073(1 − π). We can make this equal 0.05 simply by
choosing 0.019π+0.073(1−π) = 0.05 or 0.023 = 0.054π. So choosing π = 0.426 solves our
problem!
5.2 Uniformly Most Powerful Tests (UMP)

Definition 5.2.1 (Uniformly Most Powerful). Let T be a statistical test for testing a simple
null hypothesis H0 against a composite H1 with significance level α. Let also for each
θ ∈ H1 its power γ(θ) = 1 − β(θ) is more or equal to the power of any other statistical test
with the same significance level. Such a test T with the maximum power is said to be the
uniformly most powerful (UMP) test.
The next theorem introduce a general approach of finding an UMP test.
Theorem 5.2.1. Let H0 be a simple null hypothesis and H1 a composite alternative hy-
pothesis. Let also the MP test of significance level α for testing the H0 against the alterna-
tive θ = θ1 be the same for every θ ∈ H1 . With other words the test statistic is independent
of the special value of the alternative hypothesis. Then this test is the UMP test of signifi-
cance level α for testing H0 against the composite H1 .
Example 5.2.1. Suppose we have a sample x1 , x2 , · · · , xn from a normal distribution
½ ¾
1 1 2
f (x) = p exp − 2 (x − µ)
2πσ2 2σ
and we wish to test
H0 : µ = µ0 against H1 : µ > µ0
First we will construct the MP test for testing:
H0 : µ = µ0 against H1 : µ = µ1 > µ0 .
We have seen in Example 5.1.2 (you need to follow exactly the same procedure) that the
MP test of
H0 : µ = µ0 against H1 : µ = µ1 > µ0 .
p
is the test with statistic x̄ and critical region {x : x̄ ≥ µ0 + (σ/ n)z α }. Since the test statistic
x̄ is independent of the special value of H1 (µ1 ), i.e., is the same for each µ1 > µ0 , the
p
critical region {x : x̄ ≥ µ0 + (σ/ n)z α } is the UMP region for testing
H0 : µ = µ0 against H1 : µ > µ0 .
There is not always an UMP test. This is easily seen using the same set up as above
but with
H0 : µ = µ0 against H1 : µ = µ1 6= µ0
Now we have the two possibilities µ1 < µ0 and µ1 > µ0 . Each of these alternatives has an
UMP test but they are based on different critical regions so we cannot have a combined
UMP test!
68
Exercises
1. Suppose we have one observation x from an exponential distribution f (x) = θ exp(−θx).
(a) Find the MP test for testing
H0 : θ = θ0 against H1 : θ = θ1 > θ0 .
(b) Find the power of the test.
2. Suppose we are given a random sample X 1, X 2 , · · · , X 10 from a N(0, σ2 ) distribution.

Find the MP critical region of significance level α = 0.05, for testing,
H0 : σ2 = 1 against H1 : σ2 = 2.
Hint: The square of standard normal random variable X ∼ N (0, 1) is chi-square

distribution with 1 degree of freedom, i.e., X 2 ∼ X 12 . The ni=1 X i2 of independent
P
X 12 is chi-square distribution with n degrees of freedom, i.e., ni=1 X i2 ∼ X n2 .

P
3. Suppose we are given a random sample X 1 , X 2 , · · · , X n from a Exp(1/θ) ≡ G(1, θ)

distribution. Find the MP test for testing
H0 : θ = 2 against H1 : θ = 4.
Pn 2
Hint: Recall that the i =1 X i of independent G(1, 2) is G(n, 2) ≡ X 2n .
4. Suppose we are given a random sample X 1 , X 2 , · · · , X 16 from a N(µ, σ2 = 4) distri-

bution.
(a) Find the UMP test with significance level α = 0.05 for testing
H0 : µ = 1 against H1 : µ > 1
(b) Find the power of the test if µ = 3.
5. Suppose we are given a random sample X 1 , X 2 , · · · , X 20 from a P(θ) distribution.
(a) Show that the critical region 20

P
i =1 X i ≥ 5 is UMP critical region for testing
H0 : θ = 1/10 against H1 : θ > 1/10.
(b) Find the significance level α of the test.
69
70
Chapter 6
Likelihood Ratio Tests
6.1 Likelihood Ratio Tests

The Neyman Pearson approach is optimal but in rather limited circumstances. We can
devise more widely applicable methods by extending the use of the likelihood ratio to
test
1. of a composite hypothesis against an alternative composite hypothesis or
2. of constructing a test of a simple hypothesis against an alternative composite hy-

pothesis when a UMP test does not exist.
Suppose we have a sample x1 , x2 , · · · , xn from a distribution with density f (x, θ) where

θ = {θ1 , θ2 , · · · , θk }. We are interested in some test of a hypothesis
H0 : θ ∈ Θ0 against H1 : θ ∈ Θ1 .
The only restriction being that H0 is a simplified version of H1 .

For the Neyman Person we considered
L (H0 )
λ=
L (H1 )
and we can do the same again for composite hypotheses. Of course there may be un-
specified parameters so we choose to consider
maxθ∈Θ0 L
λ=
maxθ∈Θ1 L
That is we take the ratio

L (H0 )
λ=
L (H1 )
where assume that we have used the maximum likelihood estimates (under each hy-
pothesis) for the unspecified parameters.
As we have required that H0 is a special case of H1 it follows that 0 ≤ λ ≤ 1 and we can
envisage a critical region of the form λ ≤ constant. As you will see there are problems!
71
Example 6.1.1. X 1 , X 2 , · · · , X n is a random sample from a Normal distribution, say N(µ, σ2 ).

We wish to test
H0 : µ= µ0 against H1 : µ 6= µ0
Both hypotheses are composite since σ2 is unknown. Therefore we will apply the likeli-
hood ratio test. The likelihood is
¶n ( )
µ
1 1 n
L (µ, σ2 |x) = p (xi − µ)2
X
exp − 2 (6.1)
2πσ2 2σ i =1
To find maxθ∈Θ0 L substitute in Eq. (6.1) µ with µ0 and maximize with respect to σ2 , i.e.,
find the MLE of σ2 when µ = µ0 is known. So,
1 X n
log L (x; µ, σ2 ) = − log(2πσ2 )n/2 − (xi − µ0 )2
2σ2 i =1
or Pn
∂ log L (x; µ0 , σ2 ) n i =1 (x i − µ0 )2
2
=− 2 + = 0.
∂σ 2σ 2σ4
Therefore,
1X ³ 1 ń ³ n´
σ̂2 = (xi − µ0 )2 and max L = p exp −
n θ∈Θ0 2πσ̂2 2
To find maxθ∈Θ1 L we are looking for the MLEs of µ and σ2 . It is known that the later are
1X
µ̂ = x̄, σ̂2 = (xi − x̄)2 := s ′2 .
n
Therefore
³ 1 ń ³ n´
max L = p exp −
θ∈Θ1 2πs ′2 2
Substituting into the likelihood ratio gives
2 n/2
Ã Pn !
i =1 (x i − x̄)
λ= Pn 2
.
i =1 (x i − µ0 )
We note that (skipping some algebra)
(xi − µ0 )2 = (xi − x̄)2 + n(x̄ − µ0 )2

X X
which eventually gives

¶−n/2
t2
µ
λ = 1+
n −1
where
x̄ − µ0 1 X
t= p and s 2 = (xi − x̄)2 .
s/ n n −1
72
is the usual t statistic. The critical region is

n o
C = x :λ≤k
¶−n/2
t2
n µ o
= x : 1+ ≤k
n −1
n µ t2
¶ o
−2/n
= x : 1+ ≥k
n −1
n o
= x : t 2 ≥ k1
n p o
= x : |t | ≥ k1 = k2
α = P (|t | ≥ k 2 |µ = µ0 ). (6.2)
We know that t ∼ tn−1 under the null hypothesis H0 and for the Eq. (6.2): k 2 = tn−1,1−α/2 .
Therefore the test with statistic t and critical region {x : |t | ≥ tn−1,1−α/2 } is a likelihood
ratio test of
H0 : µ = µ0 against H1 : µ 6= µ0 .
6.2 Asymptotic likelihood ratio test

It should be apparent that finding the distribution of λ is complex and probably impos-
sible to find in general. Our life is made very much easier by Wilks who proved that
Λ = −2 lnλ (6.3)
has a χ2r −s distribution, where r is given the number of parameters estimated in H1 and
s the number of parameters estimated in H0 . This is a large sample approximation but
enables us to produce tests in a wide variety of situations.
Example 6.2.1. Suppose we have X 1 , X 2 , · · · , X n a random sample from a Normal distri-

bution, say N(µx , σ2x ). We also have a second, independent, random sample Y1 , Y2 , · · · , Ym
from N(µ y , σ2y ). We wish to test
H0 : σx = σ y against H1 : σx 6= σ y
Both hypotheses are composite since σx , σ y , µx , µ y is unknown. Therefore we will apply
73
the likelihood ratio test. The likelihood since the two random samples are independent is,
L (σx , σ y , µx , µ y |x, y) = L x (σx , µx ) × L y (σ y , µ y )
 n
( )
1 1 X n
2
= q  exp − 2 (xi − µx ) ×
 
2
2πσx 2σ x i =1
 m
( )
1 1 X m
 exp − 2 (y i − µ y )2
 
q
2
2πσ y 2σ y i =1
The log-likelihood is,
n n 1 X n
log L (σx , σ y , µx , µ y |x, y) = − log 2π − log σ2x − 2 (xi − µx )2
2 2 2σx i =1
m m 1 X m
− log 2π − log σ2y − 2 (y i − µ y )2
2 2 2σ y i =1
To find maxθ∈Θ0 L substitute in logL (σx , σ y , µx , µ y |x, y) σx = σ y = σ and maximize with

respect to σ2 , i.e., find the MLE of σ2 , with respect to µx , i.e., find the MLE of µx and with
respect to µ y , i.e., find the MLE of µ y . With other words the MLEs will derived by solving
the following system of equations,
∂ 1 X n
log L (σ, µx , µ y |x, y) = (xi − µx ) = 0
∂µx σ2 i =1
∂ 1 X m
log L (σ, µx , µ y |x, y) = (y i − µ y ) = 0
∂µ y σ2 i =1
∂ n 1 X n
log L (σ, µ x , µ y |x, y) = − + (xi − µx )2
∂σ2 σ2 2σ4 i =1
m 1 Xm
− 2
+ 4
(y i − µ y )2 = 0
σ 2σ i =1
After some algebra,
1 ³X ´
µ̂x = x̄, µ̂ y = ȳ, σ̂2 = (xi − x̄)2 + (y i − ȳ)2 .
X
(n + m)
To find maxθ∈Θ1 L we are looking for the MLEs of µx , σ2x , µ y , σ2y . These are
1X
µ̂x = x̄, σˆx 2 = (xi − x̄)2 := s x′2 .
n
74
1 X
µ̂ y = ȳ, σˆy 2 = (y i − ȳ)2 := s ′2
y
m
Substituting into the likelihood ratio gives, after, some algebra reduces this to
σ̂−(n+m)
½ ¾
λ=
σˆx −n σˆy −m
Taking logs
Λ = 2n log(σˆx ) + 2m log(σˆy ) − 2(n + m) log(σ̂).
There are 4 parameters, all of these had to be estimated for H1 and 3 for H0 . It follows that
Λ is X 12 .
6.3 Goodness-of-fit-tests
A common likelihood-ratio based test is the goodness-of-fit test.
6.3.1 Categories
Suppose we have an experiment which has k mutually exclusive outcomes which we will
label A 1 , A 2 , · · · , A k . Suppose further we repeat our experiment n times and find that the
number of outcomes in A j is n j , j = 1, 2, · · · , k. In addition we will write the probability
of an outcome being in A j is p j , j = 1, 2, · · · , k. Clearly kj=1 n j = n and kj=1 p j = 1.
P P
This simple model has many applications, you could think of asking questions in a
survey with the A j as categories of answers, or the A j could correspond to the bins of a
histogram. To proceed any further we need a bit more theory.
6.3.2 The multinomial distribution

If we have the situation above of k mutually and exhaustive categories A j j = 1, 2, · · · , k
and we have
n 1 observations in A 1 and P[fall in A 1 ] = p 1
n 2 observations in A 2 and P[fall in A 2 ] = p 2
···
n j observations in A j and P[fall in A j ] = p j
···
n k observations in A k and P[fall in A k ] = p k
Then
n! n n n
P [n 1 in A 1 , · · · , n k in A k ] = p1 1 p2 2 · · · pk k (6.4)
n 1 !n 2 ! · · · n k !
It is reasonably clear that this is an extension of the Binomial and the distribution is
known as the multinomial distribution.
75
6.3.3 Maximum likelihood estimators of multinomial

We can find the maximum likelihood estimators by maximizing the log likelihood
k
X
ℓ(p) = n j log(p j ) + const ant
j =1
subject to kj=1 p j = 1. The constant is of course log(n!) − kj=1 log(n j !)

P P
Then consider the function

k
X k
X
Φ(p, λ) = n j log(p j ) + λ( p j − 1) + const ant .
j =1 j =1
The MLEs will derived by solving the following system of equations,
∂Φ n 1
= +λ = 0
∂p 1 p 1
∂Φ n 2
= +λ = 0
∂p 2 p 2
..
.
∂Φ nk
= +λ = 0
∂p k p k
∂Φ X
= pj −1 = 0
∂pλ
This is equivalent with P
n1 n2 nk nj n
= = ... = =P = .
p1 p2 pk pj 1
Hence it is easy to show that
nj
. p̂ j =
n
Of course you could say that on either falls in A j or not - a Binomial problem. The
nj
maximum likelihood estimate of p j is just that for the Binomial probability p̂ j = n .
6.3.4 Likelihood ratio for the multinomial

Suppose we now wish to test
H0 : P (A j ) = p j , j = 1, 2, · · · , k against H1 : probabilities are unspecified
The likelihood ratio is in general
k k k ¡
Y nj Y nj Y ¢n j
λ= p̂ j / p̆ j = p̂ j /p̆ j
j =1 j =1 j =1
76
where the p̂ j are the estimates under H0 and p̆ j are the estimates under H1 . Since we
have unspecified probabilities, we need the estimates of these probabilities - however
we know that
p̆ j = n j /n j = 1, 2, · · · , n
so
k k ³ n ń j k ¡
Y nj Y j Y ¢n j
λ= p̂ j / = n p̂ j /n j
j =1 j =1 n j =1
or
k
X
µ
nj
¶
Λ = −2 logλ = 2 n j log .
j =1 n p̂ j
Our statistic is often written as the asymptotically equivalent form
k (n − ê )2
j j
X2 =
X
, where ê j = n p̂ j . (6.5)
j =1 ê j
Note here that if the probabilities p j under the null hypothesis are known and not need
to be estimated then the chi-square statistic reduces to the form,
k (n − e )2
j j
X2 =
X
, where e j = np j . (6.6)
j =1 ej
Example 6.3.1. An experimenter bred flowers and found the following numbers in each
of the four possible classes
Class A1 A2 A3 A4
Number 120 48 36 13
Probability 9/16 3/16 3/16 1/16
The table also includes the probability of falling in each class - according to established
theory. We aim to test
H0 : probabilities are as given by the table H1 : probabilities are unspecified.
In the following table we have calculated the expected frequencies:
class pj nj e i = np j
1 9/16 120 122.06
2 3/16 48 40.69
3 3/16 36 40.69
4 1/16 13 13.56
The likelihood ratio, since the probabilities under the null hypothesis are given, is
k ¡
Y ¢n
λ= np j /n j j
j =1
77
and we find, after some arithmetic, using Eq. (6.6) that Λ is approximately 1.9. We know
that Λ = −2 log(λ) is chi-squared. Here we have unspecified k-1 parameters under H1 and
none under H0 so the number of degrees of freedom is 3. We will reject H0 if Λ is large i.e.
exceed the 95% point of X 32 which is 7.815 . In this case we accept H0 .
6.3.5 Goodness of fit–Non multinomial distribution

We can use the ideas above for goodness-of-fit testing for various theoretical distribu-
tions. We need to split the x axis in k intervals (classes) A 1 , A 2 , . . . , A k and calculate
P (A 1 ), P (A 2 ), . . . , P (A k ) using the theoretical distribution.
Example 6.3.2. Suppose we have 200 random numbers which are suppose to be from a
U (0, 1) distribution. The numbers in the intervals (0-0.1), (0.1-0.2), e.t.c., are
class 0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5 0.5-0.6
frequency 19 18 20 16 26 18
class 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1
frequency 19 19 23 22
This is just a test of
H0 : p j = 0.1 for all j against An unspecific alternative.

(n j −e j )2
We find that , using Λ = kj=1 e
P
= 3.69, where e j = np j (the probabilities under
j
the null hypotheses are given and not need to be estimated); see also Eq. 6.6. We know
that Λ is X r2−s = X 9−0
2
and we accept H0 at 5% since Λ does not lie in the critical region
2
Λ ≥ χ9,0.95 = 16.919 and conclude that the distribution is indeed Uniform.
Example 6.3.3. A survey of families with 5 children gave rise the following distribution.
No Boys 0 1 2 3 4 5 total
no families 8 40 88 110 56 18 320
One model for the number of boys is the binomial, specifically that the number of boys X
has the distribution
Ã !
5 x
P [X = x] = p (1 − p)5−x x = 0, 1, 2, 3, 4, 5,
x
where p = P [ boy ].
We can see if this distribution fits the data when p = 12 . Calculating the expected proba-
bilities, Ã !µ ¶
5 1 5
P [0 births in 5] = = 1/32
0 2
Ã !µ ¶
5 1 5
P [1 birth in 5] = = 5/32
1 2
··· ,
we conclude to the following table,
78
no families (n j ) 8 40 88 110 56 18 320
expected (e j = np j ) 10 50 100 100 50 10
The likelihood ratio statistic in Eq. (6.6 ) is Λ = 11.096 and the degrees of freedom is (6 −
1) − 0 = 5. We only need to estimate 5 of the probabilities since the remaining one follows
from the fact that they must add up to one. The upper point of χ25 = 11.07 so in this case
we reject H0 and conclude our model is wrong.
We know however that in general p[ boy ] > 12 . Indeed from our data the proportion of
boys is p̂ = 0.5375. We can revisit out Binomial model but in this case we use p̂ = 0.5375.
The expectations are harder - I get
no families (n j ) 8 40 88 110 56 18 320
expected (ê j = n p̂ j ) 6.8 39.3 91.5 106.3 61.8 14.4 320
and Λ = 1.03 . We now have (6-1)-1 =4 degrees of freedom. We estimate 5 parameters for
H1 and one under H0 . The conclusion is that this fits the data very well.
Example 6.3.4. According to the data of the following table can we assume that the parent
distribution is standard normal?
classes frequencies
≤ 39.5 6
39.5–44.5 13
44.5–49.5 40
49.5–54.5 65
54.5–59.5 52
≥ 59.5 24
For testing the null hypothesis that the data follow normal distribution against the alter-
native that follow any other distribution we need to:
1. Derive the maximum likelihood estimators of µ, σ2 , i.e.,
( n i xi )2
µ P ¶
1X 2 ′2 1 X 2
µ̂ = x̄ = n i xi , σ̂ = s = ni xi − ,
n n n
where xi is the center of each interval and n i the observed frequency. The centers
of the classes are 37, 42, 47, 52, 57, 62 (for eg., consider the class 44.5–49.5 we have
[44.5+49.5]/2=47). After some calculations,
µ̂ = 52.4 and σ̂2 = 36.77 = 6.062 .
2. Calculate the estimated probabilities p j from N (µ̂, σ̂2 ) = N (52.4, 36.77). For e.g.,
µ ¶
x − 52.4 39.5 − 52.4
P (x ≤ 39.5) = P ≤ = P (z ≤ −2.13) = 0.5−P (0 < z < 2.13) = 0.0166.
6.06 6.06
3. Calculate the expected (estimated) frequencies ê j = n p̂ j . For e.g., ê 1 = 200 × 0.0166.
So we derive the following table,
79
classes nj p̂ j ê j = n p̂ j
≤ 39.5 6 0.0166 3.32
39.5–44.5 13 0.0802 16.04
44.5–49.5 40 0.2188 43.76
49.5–54.5 65 0.3212 64.24
54.5–59.5 52 0.2422 48.44
≥ 59.5 24 0.1210 24.20
total 200 1 200
One useful rule of thumb is to try and ensure that non of the expected values are less
than 5. This is a rather mysterious number but if we examine our chi-squared approxi-
mation carefully we see it breaks down when we have small expected values. Therefore we
merge the classes ≤ 39 and 39.5–44.5, i.e., we have finally 5 classes.
(n j −ê j )2
According to the table we find that, using Eq. (6.5) Λ = 5j =1 ê
P
= 0.6021. We
j
know that Λ is X r2−s = X 4−2
2
and we accept H0 at 1% since Λ does not lie in the critical
2
region Λ ≥ χ2,0.99 = 9.210 and conclude that the distribution is indeed normal.
80
Exercises
1. Suppose we are given a random sample X 1 , X 2 , · · · , X n from a N(µ, σ2 ) distribution,
where σ2 is known. Find the critical region of significance level α for testing
H0 : µ = µ0 against H1 : µ 6= µ0 .
2. Suppose we have X 1 , X 2 , · · · , X n a random sample from a Poisson distribution, say

P(m 1 ). We also have a second, independent, random sample Y1 , Y2 , · · · , Yn from
P(m 2 ). Find the test for testing
H0 : m 1 = m 2 against H1 : m 1 6= m 2 .
3. We toss a dice 60 times. The results and their frequencies are in the following table,
result 1 2 3 4 5 6
frequency 13 19 11 8 5 4
Test the null hypothesis H0 : p j = 16 , j = 1, . . . , 6.
4. The table below is showing the number of accidents in a year for 400 drivers.
number of accidents 0 1 2 3 4 5 6 7
frequency 89 143 94 42 20 8 3 1
Would it be reasonable to assume that the distribution is Poisson?
81
82
Chapter 7
Nonparametric inference
In much inference and almost all that you have met so far we have assumed that the
form of the distributions f (x, θ1 , . . . θp ) is known except for some parameters θ1 , . . . θp . In
some cases we may feel that actually specifying the form of f (x, θ1 , . . . θp ) is unrealistic.
This is often so when the sample size is small. Our aim to construct methods of infer-
ence which do not make distributional assumptions or which are relatively insensitive
to these assumptions. This leads us to nonparametric inference and to robust meth-
ods as opposed to the classical parametric methods that you met in previous. Another
possibility are computationally intensive methods such as the Bootstrap.
7.1 The Kolmogorov-Smirnov test

As we have seen the chi-square goodness-of-fit tests are based on frequencies of the
observations in the different classes and not the actual values of the data. Therefore all
the amount of information from the data is not considered. The goodness-of-fit test of
Kolmogorov-Smirnov considers all information of the raw data and applies with success
to continuous distributions; it is better from the chi-square goodness-of-fit test for small
samples.
Suppose that we have a random sample X 1 , . . . , X n . We wish to test the null hypothe-
sis that the random sample is drawn from a theoretical distribution against the alterna-
tive that is not drawn from this distribution. The Kolmogorov-Smirnov statistic is,
D n = max |Fn (x) − F0 (x)|,
where Fn (x) is the empirical distribution and F0 (x) is the theoretical distribution under
the null hypothesis. The empirical distribution is given by,
1X n number of xi ≤ x
Fn (x) = 1(X i ≤ x) = .
n i =1 n
Then the critical region of significance level α, for testing,
H0 : F (x) = F0 (x) against F (x) 6= F0 (x)
is D n ≥ D a,n . The critical vales of significance level a can be given using the following
approximation formulas,
83
α 0.20 0.15 0.10 0.05 0.01

p p p p p
D α,n 1.07/ n 1.14/ n 1.22/ n 1.36/ n 1.63/ n
Example 7.1.1. For example suppose we have the following data set of 10 observations.
31.0 31.4 33.3 33.4 33.5 33.7 34.4 34.9 36.2 37.0
Would it be reasonable to assume that the observations are drawn from N (µ = 32, σ2 =
3.24)?
We will use the Kolmogorov-Smirnov goodness-of-fit statistic to check if the distribu-
tions are drawn from N (µ = 32, σ2 = 3.24). The critical region of significance level α is
D n ≥ D a,n where D n = max |Fn (x) − F0 (x)|.
Given that F0 (x) is the cumulative distribution function of the N (µ = 32, σ2 = 3.24),
µ ¶ µ ¶
X − µ X − 32 X − 32
F0 (x) = P (X ≤ x) = P ≤p =P Z ≤
σ 3.24 1.80
and using the statistical tables we have the following results,

x 31.00 31.40 33.30 33.40 33.50 33.70 34.40 34.90 36.20 37.00
F0 (x) 0.29 0.37 0.76 0.78 0.80 0.83 0.91 0.95 0.99 1.00
The empirical distribution is
number of xi ≤ x
F10 (x) =
10
and is given in the following table,
x 31.00 31.40 33.30 33.40 33.50 33.70 34.40 34.90 36.20 37.00
F10 (x) 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00
The Kolmogorov-Smirnov goodness-of-fit statistic is,
D n = max |F10 (x) − F0 (x)| = 0.4642

1.36
while D 0.05,10 = p = 0.43. Therefore we reject the null hypothesis that the observations
10
are drawn from N (µ = 32, σ2 = 3.24).
7.2 Permutation Tests

The basic idea is quite simple, we find a plausible statistic, say T , for the hypotheses
of interest and a family of permutations of the data such that the probability of each
permutation can be found under the null hypothesis. Usually for simplicity we choose
permutations which are equally likely under the null hypothesis. We then find the values
of T under all possible permutations and then using this empirical distribution either
find a critical region and decide whether to accept H0 , or find the p-value, and make
our decision using this value.
84
Example 7.2.1. For example suppose we have the following data set of 12 observations.
2.29828 0.59061 1.39212 2.67940 0.55242 0.05688

2.42898 4.14296 0.01564 2.82009 0.79828 1.33614
We wish to test whether the mean of the population is 1, say
H0 : µ = 1 against H1 : µ 6= 1.
Subtracting 1, the hypothesized value of the mean from each value gives a set of deviations
(d i , i = 1, . . . , 12)
1.29828 -0.40939 0.39212 1.67940 -0.44758 -0.94312

1.42898 3.14296 -0.98436 1.82009 -0.20172 0.33614
One choice of statistic for our test is T = |d¯| of the deviations above.
Now we set up the permutations of the (modified) data:
• We first take the absolute values of all the deviations; this will give a new value for
T.
• Under the null hypothesis that the median of the population is 1 the modified data
are symmetric about 0, i.e., each of these deviations is equally likely to be positive or
negative. Therefore a rearrangement of the data is to assign each observation to be
positive or negative. Totally there 2n permutations (rearrangements):
1. Assign one negative sign to each value in turn and obtain n values for T .
2. Assign two negative signs to all pairs of values and obtain more values for T .
3. Assign minus signs to triples e.t.c., until we have the T value for all 2n values.
• If we assume, as seems reasonable under the null hypothesis that the value of T from
each permutation of signs is equally likely we have a distribution of T values. We
may then choose between the hypotheses using the permutation distribution of T .
Thus if we had a sample of only 4 deviations:
1.29828 -0.40939 0.39212 1.67940
we have the 16 permutations and their corresponding values for T , see Table 7.1.
The actual sample value of T was Tobs = 0.74010 which exceeds 10 of the 16 values, or
is in the top 6 of the 16. For the definition of p-value, i.e., the probability of getting a value
of the test statistic as extreme as or more extreme than that observed by chance alone on
the assumption that the null hypothesis H0 , is true, we have that,
number of permutations with T ≥ Tobs
p-value = P (T ≥ Tobs |H0 ) = .
total number of permutations
In our example p-value=6/16=0.375 which is greater than the significance level of the test
α, say 20% and thus we accept H0 . Note in passing that the parametric test for the above
problem is the one-sample t -test.
85
Permutation T
1.29828 0.40939 0.39212 1.6794 0.94480
-1.29828 0.40939 0.39212 1.6794 0.29566
1.29828 -0.40939 0.39212 1.6794 0.74010
1.29828 0.40939 -0.39212 1.6794 0.74874
1.29828 0.40939 0.39212 -1.6794 0.10510
-1.29828 -0.40939 0.39212 1.6794 0.09096
-1.29828 0.40939 -0.39212 1.6794 0.09960
-1.29828 0.40939 0.39212 -1.6794 0.54404
1.29828 -0.40939 -0.39212 1.6794 0.54404
1.29828 -0.40939 0.39212 -1.6794 0.09960
1.29828 0.40939 -0.39212 -1.6794 0.09096
-1.2982 -0.40939 -0.39212 1.679 0.10510
-1.2982 -0.40939 0.39212 -1.6794 0.74874
-1.29828 0.40939 -0.39212 -1.6794 0.74010
1.29828 -0.40939 -0.39212 -1.6794 0.29566
-1.29828 -0.40939 -0.39212 -1.6794 0.94480
Table 7.1: All possible permutations.
Example 7.2.2. 5 patients submitted to two different treatments A and B. Then the medics
rate their health status based on a scale. The observations (rates) are:
Treatment A (2 patients): 1, 2
Treatment B (3 patients): 3, 5, 9
Is there any difference between the two treatments?
With other words we need to test if the two mean values are different or not,
H0 : µ A = µB against H1 : µ A 6= µB .
Under the null hypothesis the two treatments are the same and there is no importance
which treatment submitted to each patient. Therefore if we rearrange randomly the pa-
tients to the two treatments the results will be quite the same. Furthermore all rearrange-
ments are equally likely under the null hypothesis. One choice of statistic for our test is
T = |ā − b̄|, where ā, b̄ are the mean values of the patients for the two treatments. If the
null hypothesis is true the statistic is close to 0, while under the alternative hypothesis the
statistic takes values away from 0.
We now look to rearrangements of the data observed. One possible rearrangement is
1,3 in the first sample and 2,5,9 in the second. For each rearrangement, we compute the
value of T . Note that there are
µ ¶
5
= 10
3
such rearrangements, see Table 7.2 . Under the null hypothesis (that all observations come
for the same distribution) all 10 rearrangements are equally likely, each with probability
1/10.
86
treatment A treatment B ā b̄ T
1,3 2,5,9 2 5.33 3.33
1,5 2,3,9 3 4.66 1.66
1,9 2,3,5 5 3.33 1.66
1,2 3,5,9 1.5 5.66 4.51
2,3 1,5,9 2.5 5 2.5
2,5 1,3,9 3.5 4.33 0.83
2,9 1,3,5 5.5 3 2.5
3,5 1,2,9 4 4 0
3,9 1,2,5 6 2.66 3.34
5,9 1„2,3 7 2 5
Table 7.2: All possible rearrangements of the patients in the two treatment groups.
The actual sample value of T was Tobs = 4.51 which exceeds 8 of the 10 values, or is in
the top 2 of the 10. For the definition of p-value, i.e., the probability of getting a value of
the test statistic as extreme as or more extreme than that observed by chance alone on the
assumption that the null hypothesis H0 , is true, we have that,
number of permutations with T ≥ Tobs

p-value = P (T ≥ Tobs |H0 ) = .
total number of permutations
In our example p-value=2/10=0.2 which is greater than the significance level of the test
α, say 10% and thus we accept H0 . Note in passing that the parametric test for the above
problem is the two-sample t -test.
Some interesting remarks are the following:
1. On the definition of p-value we have used the probability P (T ≥ Tobs |H0 ). This
generally speaking is not correct such also the values of the statistic which lead to
rejection are small, i.e., we reject when T ≤ Tobs (two-tailed test). We have avoided
this cases by selecting appropriately the test statistic using the absolute value and
we reject for large values of the statistic.
2. These permutation computations are only practical for small data sets. The ob-
vious drawback to this test is the amount of computation required. For the one
sample problem with n observations we require 2n permutations, while for the
two sample model with m and n observations in the samples, there are
µ ¶ µ ¶
m+n m+n
=
m n
possible rearrangements, each with its own value of the statistic.
3. We can however come up with a very much simpler procedure at the cost of some
precision. A recent suggestion is that we don’t look at all permutations, but rather
look a randomly chosen subset of them.
87
7.3 The sign test

The only assumption of the test is that the probability of one observation to be greater
than δ0 is equal to the probability that this observation is smaller than δ0 . To this end,
1
P (X > δ0 ) = P (X < δ0 ) = .
2
With other words δ0 is the median of the population.
Suppose we wish to test,
H0 : δ = δ0
against
H1 : δ > δ0 , δ < δ0 , δ 6= δ0 .
We transform the data as follows, if a value exceeds δ0 we call it a plus while if it lies below
δ0 we call it a minus. If a value is equal to δ0 then disregard this observation. Under H0
the probability that a value lies above the median δ0 and hence maps to a plus is 0.5 and
the corresponding probability that we have a minus is also 0.5. Therefore the set of signs
consists to a sample of binomial distribution. Finally our hypotheses are equivalent to,
1
H0 : p =
2
against
1 1 1
H1 : p > , p < , p 6= .
2 2 2
The procedure described above is known as the sign test. We testH0 : δ = δ0 by count-
ing the number of observations that exceed the median (number of plus signs) and the
number that lies below the median (number of of minus signs). Then under H0 the
number of plus signs out of the total number of signs is binomial.
The test is based on the number of positive or negative signs since we work with
deviations from the median xi − δ0 . Under H0 the probability of a plus or the propor-
tion of positive signs is 0.5 and so the significance levels (p-values) can be calculated as
indicated in the following steps:
1. Take the plus sign and suppose there are k of these.
2. For H1 : p > 12 :
p-value = P (X ≥ k|H0 is true),
for H1 : p < 12 :
p-value = P (X ≤ k|H0 is true),
1
for H1 : p 6= 2
: If k > n/2,
p-value = P (X ≥ k|H0 is true) + P (X ≤ n − k|H0 is true) =
2 × P (X ≥ k|H0 is true) = 2 × P (X ≤ n − k|H0 is true),

else
p-value = 2 × P (X ≥ n − k|H0 is true) = 2 × P (X ≤ k|H0 is true).
The probabilities are calculated using the binomial distribution B(n, p 0 = 21 ).
88
3. If the p-value is smaller than the significance level of the test α reject H0 , other-
wise accept H0 .
Example 7.3.1. Suppose we have the following data set of 12 observations which comes
from a non-normal distribution:
2.29 0.59 1.39 2.67 0.55 0.05

2.42 4.14 0.01 2.82 0.79 1.33
We wish to test whether the median of the population is 1, say
H0 : δ = 1 against H1 : δ > 1.
We transform the data as follows, if a value exceeds 1 we call it a plus while if it lies below
1 we call it a minus. Of our data we have 7 plus signs and 5 minus signs.
The p-value is
1 1
P (X ≥ k|p = ) = P (X ≥ 7|p = ) = 0.387
2 2
and so we conclude that the is no reason to reject the null hypothesis that the median is 1.
Note: For the two-tailed test (H1 : p 6= 12 ), p-value= 2 × 0.387.
Suppose that the sample size n is large (greater than 25) . Then we use the normal
approximation to the Binomial as follows: If the proportion of plus signs is p then the
test statistics
X − np X − n/2 X − 1/2
z=p = p = p
np(1 − p) n/4 1/ (4n)
is standard normal under the null hypothesis.
We may then base a test of H0 : p = 0.5 on z. Given α we have z 1−α and z 1−α/2 from
normal tables and
• For H1 : δ < δ0 or H1 : p < 0.5 reject H0 if z ≤ −z 1−α
• For H1 : δ > δ0 or H1 : p > 0.5 reject H0 if z ≥ z 1−α
• For H1 : δ 6= δ0 or H1 : p 6= 0.5 reject H0 if |z| ≥ z 1−α/2
Example 7.3.2. Suppose we have the following data set of 30 observations which comes
from a non-normal distribution:
19.5 19.8 18.9 20.4 20.2 21.5 19.9 20 18.1 20.5

18.3 19.5 18.3 19 21.7 19.5 18.2 23.9 17.5 19.7
20.9 25.3 18.4 19 23.2 17.8 24.2 19.4 20.3 23.2
Test with significance level α = 0.05 the following hypotheses,
H0 : δ = 20.8 against H1 : δ > 20.8.
89
We transform the data as follows, if a value exceeds 20.8 we call it a plus while if it
lies below 20.8 we call it a minus. Of our original deviations we have 9 plus signs and 21
minus signs. Our hypotheses are equivalent to,
1 1
H0 : p = against H1 : p > .
2 2
The test statistic is Z = Xp−1/2

∼ N (0, 1) under H0 . We find that z = −2.19 and we accept
1/ (4n)
H0 at 5% since z does not lie in the critical region z ≥ z 0.95 = 1.64.
You will notice that the sign test makes few assumptions about the form of the parent
distribution and in consequence it is a valuable tool when we have non normal distri-
butions. However it is probably pretty clear that we must pay a price for the versatility
of the test. Indeed as we replace numerical values with indicators, the signs, it seems
inevitable that there will be some loss of sensitivity.
7.4 Wilcoxon’s Signed Rank Test

Suppose we wish to test the same set of hypotheses,
H0 : δ = δ0
against
H1 : δ > δ0 , δ < δ0 , δ 6= δ0 .
In the 1940’s Frank Wilcoxon came up with a modification of the sign test. The test is now
known as Wilcoxon’s Signed Rank Test. Wilcoxon suggested that rather than map the
deviations between the hypothetical median and the observations onto just two values
some assessment should be made of the magnitude of these deviations. He used the
rank of the deviations.
Wilcoxon took the view that a way to improve the test was to take into account the
size of deviation from the median. The procedure he adopted was to take the deviations
and give them a rank order ignoring the sign. Thus the smallest deviation gets rank 1 the
next smallest rank 2 and so on. Deviations equally to zero are eliminated.
Now if the suggested median is the true one then the deviations should be scattered
around zero and the sum of the positive ranks R+ should be much the same as the sum
of the negative ranks R−. To test the hypothesis Wilcoxon actually worked out the prob-
ability distribution of W = mi n(R+, R−) on the assumption:
H0 : the deviations have a symmetrical parent distribution with median zero and hence
calculated the percentage points, a table of which is available.
The procedure is thus
1. Ignoring the sign assign to each difference its rank. If two or more values have the
same rank then the average rank is assigned to each.
2. Given the ranks in (1) form the signed ranks by giving a negative sign to those ranks
which come from negative values.
90
3. Compute
R + = sum of the positive ranks R − = sum of the negative ranks
4. (a) For a one tailed test, where the alternative hypothesis is that the median is
more than a given value, the test statistic W is R − .
(b) For a one tailed test, where the alternative hypothesis is that the median is
less than a given value, the test statistic W is R + .
(c) For a two tailed test the test statistic W is the minimum of R + and R − .
5. Reject H0 if W ≤ W0 , where W0 is the critical value given in Table 12 of the Statisti-

cal Tables.
Now check that R + + R − are the same as n(n + 1)/2, where n is the number in the
sample (having ignored the zeros). Given the tables the test is simple and the only as-
sumption being that the parent distribution is symmetrical. If this is not true then the
sign test is your best bet.
Example 7.4.1. Suppose we have the following data set of 10 observations
131 127 118 135 117 112 132 120 137 113
We suspect a median of 121 so to test
H0 : δ = 121 against H1 : δ 6= 121.
We first get the deviations from the median (under H0 )

10 6 -3 14 -4 -9 11 -1 16 -8
with 5 positive and 5 negative signs. The sign test would accept H0 . The data and the
ranks e.t.c., are laid out in Table 7.3.
Data Deviation |Deviation| Rank Signed rank

131 10 10 7 7
127 6 6 4 4
118 -3 3 2 -2
135 14 14 9 9
117 -4 4 3 -3
112 -9 9 6 -6
132 11 11 8 8
120 -1 1 1 -1
137 16 16 10 10
113 -8 8 5 -5
Table 7.3: The data and the ranks of Example 7.4.1.
For our data R + = 38 and R − = 17. Therefore W = min(R + , R − ) = 17. Table 12 of the
Statistical Tables, indicates that H0 would be accepted for α = 0.05 (W0 = 9).
91
For large sample sizes, n ,we use a normal approximation under the null hypothesis
that the deviations D i are independent and symmetrically distributed about zero.
Then we have that
1
E [R + ] = n(n + 1)
4
1
v ar (R + ) = n(n + 1)(2n + 1)
24
+ −
and R (or R ) is asymptotically normal.
Then the z statistic,
R + − E (R + ) R + − n(n + 1)/4
z= p = 1
v ar (R + ) 24 n(n + 1)(2n + 1)
can be used as a test statistic and we reject H0 when z > z 1−α , z < −z 1−α , |z| > z 1−α/2 ,
respectively.
7.5 Paired Tests

The most common situation where the Wilcoxon or sign tests are used is when we have
paired samples. Examples might be measurements made before and after treatment on
the same item or paint lifetime on pairs of door on the same care. Pairing is always a
good idea as it increases efficiency dramatically.
Example 7.5.1. Suppose we have the following data set of 10 observations.
No. Before After Difference (d)

1 251 261 10
2 247 292 45
3 308 317 9
4 258 253 -5
5 267 271 4
6 256 305 49
7 230 238 8
8 268 320 52
9 269 267 -2
10 275 281 6
We could use the sign test or better still Wilcoxon’s test on the differences. Our null
H0 : differences have zero median
and the alternative

H1 : differences have non-zero median
The procedure is , take the two samples and compute the differences. Then H0 : no differ-
ence in medians, is the same as assuming that the differences have zero median and we
can do a Wilcoxon test on the differences, giving
92
No. Before After Difference (d) Rank Signed Rank

1 251 261 10 7 7
2 247 292 45 8 8
3 308 317 9 6 6
4 258 253 -5 3 -3
5 267 271 4 2 2
6 256 305 49 9 9
7 230 238 8 5 5
8 268 320 52 10 10
9 269 267 -2 1 -1
10 275 281 6 4 4
Now R+ = 51 while R- = 4. Therefore W = min(R + , R − ) = 4. Table 12 of the Statistical

Tables, indicates that H0 would be rejected for α = 0.05 (W0 = 9).
We could have done a sign test since as is clear we have 8 + signs in 10. The p-value is
2 × P (X ≥ 8) = 2 × P (X ≤ 2) = 2 × 0.0547 = 0.1092.
This is on the acceptance of H0 , illustrating the loss of efficiency which occurs with the
sign test. We usually prefer the greater efficiency of Wilcoxon’s test compared with the sign
test assuming a symmetric parent.
7.6 Unpaired Samples - The Mann-Whitney Test

Clearly not all comparisons between samples are made with the assumption of pairing.
The example below is a simple illustration.
Example 7.6.1. Speaking ability for patients in a study of Parkinson’s disease
Had operation 2.6 2 1.7 2.7 2.5 2.6 2.5 3

Had not 1.2 1.8 1.8 2.3 1.3 3 2.2 1.3
1.5 1.6 1.3 1.5 2.7 2
Suppose we have two samples one from population A of size n and one from popu-
lation B of size m (m < n) and we wish to test:
H0 : they come from populations with the same median

against
H1 : they come from populations with different medians
we use Mann-Whitney test as described in the steps below:
1. Rank the all data; for equal values of the observations we assign average ranks just
as for Wilcoxon’s signed rank test.
2. Find the sum of the ranks for the smaller sample (with m observations) say R. If
m = n then pick the sample you prefer.
93
3. Find R ′ = m(m + n + 1) − R.
4. Compute W = min(R, R ′ ).
5. Reject H0 if W is less than the tabulated value in Table 13 or 14 of the Statistical

Tables. The tables give the 5% and 2.5% points of the distribution of W . That is
if the observed value of W (rounded) is less than or equal to the tabulated value
then we reject H0 the hypothesis of equal medians.
Example 7.6.2. Suppose we have two samples one from A of size m = 10 and one from B
of size n = 9,
A 7.05 14.25 8.57 10.5 11.9 4.5 37.6 9.4 8.1 45.2
B 9.25 2.05 2.75 2.5 6.4 6.9 10 8 33
Assign to each observation its rank in the combined set of observations as below
A 7.05 14.25 8.57 10.5 11.9 4.5 37.6 9.4 8.1 45.2
rank 7 16 10 14 15 4 18 12 9 19
B 9.25 2.05 2.75 2.5 6.4 6.9 10 8 33
rank 11 1 3 2 5 6 13 8 17
For the Mann-Whitney test we find the sum of the ranks for each sample R A = 124 while R B
= 66. So, the smaller sample gives a rank sum of R = 66 giving R ′ = 9(9+10+1)−66 = 114.
This gives W = min(66, 114) = 66 which not quite significant at 5% (two sided).
For large sample values it is conventional we can get quite good approximate values
for the rank sum of the sample of size m using the normal distribution with mean and
variance given by the formula below for the m A’s and n B’s.
1 1
µ = m(m + n + 1) and σ2 = mn(m + n + 1).
2 12
Exercises
1. Suppose we have the following data set of 10 observations.
3.94 4.22 2.60 8.65 0.18 0.43 0.09 8.68 2.64 2.77
Would it be reasonable to assume that the observations are drawn from E xp(λ =
1/3)?
2. New recruits to a call center are given initial training in answering customer calls.
Following this training they are independently assessed on their competence, and
are rated on a score of 1 to 10, 1 representing ‘totally incompetent’ to 10 ‘totally
competent’. It is usual for the trainees’ scores to be symmetrically distributed
about a median of 6. A new trainer has been appointed and the scores of her first
19 trainees are:
94
6 5 6 9 7 3 4 6 7 2 9 8 7 4 5 6 9 5 7
Is there evidence at the 5% level that the new trainer has made any difference?
3. It is recommended that women should not consume more than 70g of fat per day.
A random sample of 13 student nurses at St Clares were asked to estimate as care-
fully as possible how much fat they ate on one particular day. The results were
(measured in grams)
85 120 45 95 100 50 65 85 105 125 65 49
Is there evidence at the 5% level that student nurses at St Clares are consuming
more fat than they should?
4. Two methods A and B were used to determine the latent heat of fusion of ice. The
table below gives the change in total heat from ice at −72o C to water at 0o C .
A 79.98 80.04 80.02 80.04 80.03 80.03 80.04

79.97 80.05 80.03 80.02 80.00 80.02
B 80.02 79.94 79.98 79.79 79.97 80.03 79.95 79.97
Are the methods comparable?
95
96
Chapter 8
Bayesian inference
8.1 Bayes theorem

We begin by refreshing our memories on the subject of conditional probability. We de-
fine the probability of A given B has occurred, written P [A|B] as
P [A ∩ B]
P [A|B] =
P [B]
While conditional probabilities can have interesting philosophical implications they also
allow one to do calculations. Thus
P [A] = P [A|B]P [B] + P [A|B c ]P [B c ]

or more generally if ∪ni=1 B i = Ω then
n
X
P [A] = P [A|B i ]P [B i ]
i =1
We also have Bayes Theorem

P [A|B]P [B]
P [B|A] =
P [A]
this implies that
P [B|A] ∝ P [A|B]P [B] (8.1)
A slight generalization of the Bayes theorem is by considering the events ∪ni=1 B i = Ω,
P (A|B i )P (B i )
P (B i |A) = Pk , i = 1, . . . , k.
j =1 P (A|B j )P (B j )
Example 8.1.1. In a population with high risk to be affected from HIV we try an new
diagnostic method. The 10% percent of the population is believed to be affected by HIV.
From the results of the test, the test is positive for 90% of the people that they are really
affected by HIV and negative for 85% of the people that the are not affected by HIV. Which
are the probabilities to obtain false negative and false positive results?
97
Suppose that A is the event of someone to be affected by HIV and B the event that the
test is positive. Then P (A) = 0.1, P (B|A) = 0.9 and P (B c |A c ) = 0.85.
P (B|A c )P (A c )
P (the test is false positive) = P (A c |B) = =
P (B)
(1 − P (B c |A c ))P (A c ) (1 − 0.85) × 0.90

c c
= = 0.6
P (B|A )P (A ) + P (B|A)P (A) 0.15 × 0.90 + 0.90 × 0.10
Accordingly,
P (B c |A)P (A)
P (the test is false negative) = P (A|B c ) = =
P (B c )
(1 − P (B|A))P (A) (1 − 0.9) × 0.10
= = 0.0129
P (B c |A c )P (A c ) + P (B c |A)P (A) 0.85 × 0.90 + 0.10 × 0.10
8.1.1 The Continuous Analogue

In the continuous case we usually consider distributions with a joint density. In the two
variable case with a density f (x, y) so
Z
P [(X , Y ) ∈ C ] = f (x, y)d xd y
C
Here we define the marginals as

Z Z
f x (x) = f (x, y)d y and f y (y) = f (x, y)d x
The conditional distributions are
f (x, y) f (x, y)
f (x|y) = and f (y|x) =
f y (y) f x (x)
Thus
f (x, y) f (x|y) f y (y)
f (y|x) = =
f x (x) f x (x)
or f (y|x) is proportional to f (x|y) f y (y)
f (y|x) ∝ f (x|y) f y (y)
8.2 Subjective Inference

Despite the careful derivations of frequentist inference many people would argue that
one always starts an experiment with a prior belief and this is then modified by experi-
ence. This subjective view is generally known as Bayesian and can be developed to give
an alternative approach to inference.
Example 8.2.1. Consider three statistical experiments:
98
1. A music expert claims to be able to distinguish a page of Haydn score from a page of
Mozart score. In all 3 trials, she distinguished correctly.
2. A lady, who adds milk to her tea, claims to be able to tell whether the tea or the milk
was poured into the cup first. In all 3 trials, she correctly determined which was
poured first.
3. A drunkard claims to be able to predict the outcome of a flip of a fair coin. In all 3
trials, he guessed the outcomes correctly.
Our interest is the the probability of the person answering correctly. The distribution
for this case is binomial B(3, p). The maximum likelihood estimator is p̂ = Xn = 33 = 1. Is
this a satisfactory analysis for all the three experiments?
No reasons to doubt the conclusion for Experiment 1 but we have serious doubt for
Experiment 3.
Suppose we have some parameter of interest θ. Unlike classical inference we have

some subjective, or prior belief about this parameter. To quantify this belief we construct
a probability distribution f (θ) which we call the prior distribution.
We perform some experiment with the aim of gaining more information about θ and
suppose the result of our experiment is some data
(x1 , x2 , x3 , . . . , xn )T = x
This has a likelihood of the form

f (x|θ)
Note we use the above notation since we perform the experiment given our prior belief
about θ.
We now use Bayes theorem to modify our belief, that is to derive a new distribution
for θ. If θ is continuous
f (x|θ) f (θ)
f (θ|x) = R ,
f (x|θ) f (θ)d θ
and if θ is discrete,
f (x|θ) f (θ)
f (θ|x) = P .
f (x|θ j ) f (θ j )
Since in the denominator we integrate or sum with respect to θ the denominator is a
function of x. Therefore for given data x, the denominator is constant, the so-called
constant of proportionality. Accordingly an alternative way of presenting the Bayes the-
orem is
f (θ|x) ∝ f (x|θ) f (θ)
where f (θ|x) is the posterior distribution, that is the distribution of θ after our view
has been modified by experiment. This is just an analog of Bayes theorem as discussed
above.
p[A|B] = p[B|A]p[A]/p[B]
The basic idea:
99
1. Treat θ as a random variable.
2. The prior f (θ) reflects the beliefs on the true value of θ before taking any observa-
tions.
3. The goal of Bayesian inference is to update the beliefs from prior f (θ) to posterior
f (θ|x), after taking observations from a random sample (x1 , x2 , x3 , . . . , x N )T = x.
Example 8.2.2. A coin is tossed 70 times, the number of heads is 34. The probability of
being a head is some unknown value θ. We have some prior belief about θ, i.e., E (θ) = 0.4
and V ar (θ) = 0.02 which we need to quantify. One way of doing this is to use a proba-
bility distribution to quantify our belief. This distribution f (θ) is the prior distribution is
assumed to contain our prior beliefs about the parameter θ. A possible distribution for θ
is the Beta distribution with density
θ a−1 (1 − θ)b−1
f (θ) = for 0 ≤ θ ≤ 1. (8.2)
B e t a(a, b)
Since the mean and variance of the Beta distribution for the prior distribution are known
i.e.,
a ab
E (θ) = = 0.4 V ar (θ) = = 0.02.
a +b (a + b)2 (a + b + 1)
We can easily derive the parameters of the Beta distribution a, b by solving the above sys-
tem of equations. So the resulted parameters are,
(1 − m)m 2 (1 − m)2 m
a= −m and b = − (1 − m),
u u
where m = E (θ) and u = V ar (θ). For our example the parameters are a = 4.4 and b =
6.6. When we conduct the experiment of tossing coin n times we have the probability,
(likelihood when x the number of heads is known)
Ã !
n x
f (x|θ) = θ (1 − θ)n−x
x
Using Bayes theorem we have

Ã !
n x
f (θ|x) ∝ θ (1 − θ)n−x f (θ)
x
so if we had chosen
θ a−1 (1 − θ)b−1
f (θ) =
B e t a(a, b)
we would have
f (θ|x) ∝ θ (a+x−1) (1 − θ)(N−x+b−1)
We know that f (θ|x) is a distribution function and since that
θ|x ∼ B e t a(a + x, b + n − x).
100
The constant of proportionality is obtained easily since we know that the distribution
must integrate to 1. This latter function is the posterior distribution of the parameter θ
given the data - it is our belief given the data.
So if we start with a prior which is
θ ∼ B e t a(4.4, 6.6)
and we toss a coin 70 times and observe 34 heads then
θ|x ∼ B e t a(38.4, 42.6).
We will study now how the data change our prior beliefs by comparing the expected values
for the prior and posterior distributions:
a a+x
E (θ) = E (θ|x) =
a +b a +b +n
In our example they are:
E (θ) = 0.4 E (θ|x) = 0.474.
So after taking observations from a random sample the prior estimate increases from 0.4
to 0.474. Note that using the standard maximum likelihood estimation the MLE based
on the data is nx = 0.486. Summing up the posterior is a combination of the data and the
prior belief.
Example 8.2.3. Drinks dispensed by a vending machine may overflow. Suppose the prior
distribution of p the proportion that overflow is
p 0.05 0.10 0.15

f (p) 0.30 0.50 0.20
If two of the next nine drinks dispensed overflow find the posterior distribution of p .
Assuming independence the likelihood given the experiment is
Ã !
9 2
f (x|p) = p (1 − p)7
2
Thus the posterior is ¡9¢

2 p 2 (1 − p)7 × f (p)
f (p|x) = P3 ¡9¢ 2
7
j =1 2 p j (1 − p j ) × f (p j )
This gives
f (0.05|x) ∝ 0.052 (0.95)7 × 0.3

f (0.10|x) ∝ 0.102 (0.90)7 × 0.5
f (0.05|x) ∝ 0.152 (0.85)7 × 0.2
or pulling all the results together

p 0.05 0.10 0.15
f (p) 0.30 0.50 0.20
f (p|x) 0.12 0.55 0.33
101
8.3 Choice of prior

The Prior Distributions are the most critical and most criticized point of Bayesian anal-
ysis:
• The prior distribution is the key to Bayesian inference.
• In practice, it seldom occurs that the available prior information is precise enough
to lead to an exact determination of the prior distribution.
• There is no such thing as the prior distribution.
• The prior is a tool summarizing available information as well as uncertainty re-

lated with this information.
• Ungrounded prior distributions produce unjustified posterior inference.
8.3.1 Conjugate class

The other difficulty is the integration required to obtain the posterior density, i.e., the
calculation of the constant of the proportionality. We can overcome this to some extent
by choosing a prior in sufficiently clever way. It is not very restrictive to choose a prior
from a conjugate class. Conjugate priors are specific parametric families with analytical
properties.
Definition 8.3.1. A family F of probability distributions on θ is conjugate for a likelihood
function f (x|θ) if, for every f ∈ F , the posterior distribution f (θ|x) also belongs to F .
Switching from prior to posterior distribution is reduced to an updating of the corre-
sponding parameters, see Example 8.2.2 where the Beta distribution for binomial likeli-
hood as you can see is a conjugate prior. The justification of the use of conjugate priors
is mainly their tractability and simplicity along with the preservation of the structure of
f (θ|x).
The following table represent the most usual conjugate priors.
Example 8.3.1. Suppose a random sample x ∼ G amma(k, 1/θ), where k is known and
θ ∼ G amma(a, 1/b). Show that
¡ X ¢
θ|x ∼ G amma a + nk, 1/(b + xi ) .
The likelihood is,
n θk
xik−1 exp(−θxi )
Y
f (x|θ) =
i =1 Γ(k)
n x k−1
i
θ k exp(−θxi )
Y
=
i =1 Γ(k)
n
θ k exp(−θxi )
Y
∝
i =1
θ nk exp(−θ
X
= xi )
102
f (x|θ) f (θ) f (θ|x)
x ∼ B(n, θ) B e t a(a, b) B e t a(a + x, b + n − x)

¡ P ¢
x1 , . . . , xn ∼ P (θ) G amma(a, 1/b) G amma a + xi , 1/(b + n)
¡ P ¢
x1 , . . . , xn ∼ G amma(k, 1/θ) G amma(a, 1/b) G amma a + nk, 1/(b + xi )
k is known
P
x1 , . . . , xn ∼ Geom(θ) B e t a(a, b) B e t a(a + n, b + xi − n)
x ∼ N B(r, θ) B e t a(a, b) B e t a(a + r, b + x)
x1 , . . . , xn ∼ N (θ, 1/τ) N (b, 1/c) N ( cb+nτ x̄

, 1 )
c+nτ c+nτ
τi sknown
Table 8.1: List of conjugate priors.
So
f (θ|x) ∝ f (θ) × f (x|θ)
∝ θ a−1 exp(−bθ) × θ nk exp(−θ

X
xi )
∝ θ a+nk−1 exp −(b + xi )θ

© X ª
Therefore
¡ X ¢
θ|x ∼ G amma a + nk, 1/(b + xi ) .
Example 8.3.2. Suppose a random sample x ∼ P (θ) of size n and θ ∼ G amma(a, 1/b).
Show that
¡ X ¢
θ|x ∼ G amma a + xi , 1/(b + n) .
The likelihood is,
n exp(−θ)θ xi
Y
f (x|θ) =
i =1 xi !
n
exp(−θ)θ xi
Y
∝
i =1
P
xi
= exp(−nθ)θ
103
So
f (θ|x) ∝ f (θ) × f (x|θ)
P
∝ θ a−1 exp(−bθ) × exp(−nθ)θ xi
P
∝ θ a+ xi −1
exp {−(b + n)θ}
Therefore ¡ X ¢
θ|x ∼ G amma a + xi , 1/(b + n) .
Existence of conjugate priors
Under the assumption that the conjugate priors do not contradict with our prior beliefs
and given that a such family of distribution exists the computations could be simplified.
Although in which cases a family of conjugate priors can be obtained?
The only case that the conjugate priors can be derived easily is for models of the
exponential family of distributions.
Definition 8.3.2. The family of distributions:
¡ ¢
f (x|θ) = h(x)g (θ) exp t (x)c(θ)
is called an exponential family for the functions h, g , t , c such as

Z Z
¡ ¢
f (x|θ)d x = g (θ) h(x) exp t (x)c(θ) d x = 1.
The exponential family includes the exponential, the Poisson, the Gamma with one pa-
rameter, the binomial and the normal distribution with known variance.
Suppose the prior distribution f (θ), then:
f (θ|x) ∝ f (θ) × f (x|θ)
n
Y ¡ ¢
= f (θ) × h(xi )g (θ) exp t (xi )c(θ)
i =1
n
h(xi ) g (θ)n exp
¡Y ¢ ¡X ¢
= f (θ) × t (xi )c(θ)
i =1
f (θ)g (θ)n exp

¡X ¢
∝ t (xi )c(θ)
So if we choose
f (θ) ∝ g (θ)d exp bc(θ)
¡ ¢
(8.3)
then the following posterior distribution will be obtained:
f (θ|x) ∝ g (θ)n+d exp ( t (xi ) + b)c(θ)

¡X ¢
˜
g (θ)d exp b̃c(θ) ,
¡ ¢
=
104
which belongs to the same family of distributions with the prior distribution with ad-
justed parameters. All the examples of conjugate priors that we have seen were obtained
in this way.
Example 8.3.3. Consider the binomial distribution. The probability mass function is
given by,
Ã !
n x
f (x|θ) = θ (1 − θ)n−x
x
Ã !
n ¡ θ ¢x
= (1 − θ)n
x 1−θ
Ã ! µ ³ ¶
n n
¡ θ ¢x ´
= (1 − θ) exp log
x 1−θ
Ã !
n n
³ ¡ θ ¢´
= (1 − θ) exp x log .
x 1−θ
For Definition 8.3.2,
Ã !
n
h(x) =
x
g (θ) = (1 − θ)n
t (x) = x
¡ θ ¢
c(θ) = log
1−θ
Therefore we can construct a conjugate prior, see Eq. 8.3 of the form,
f (θ) ∝ g (θ)d exp bc(θ)
¡ ¢
¢d ³ ¡ θ ¢´
(1 − θ)n
¡
= exp b log
1−θ
= (1 − θ)nd−b θ b ,
which is member of Beta distributions.
8.3.2 Noninformative priors

What if all we know is that we know “nothing”? In the absence of prior information, prior
distributions solely derived from the sample distribution f (x|θ).
Noninformative priors cannot be expected to represent exactly total ignorance about
the problem at hand, but should rather be taken as reference or default priors, upon
which everyone could fall back when the prior information is missing.
105
The Jeffreys’ prior
One perspective on defining the noninformative prior distribution is the invariance to

‘1-1’ transformations of the parameters. This leads to the Jeffreys’ prior which is based
on Fisher information
¡ d 2 log f (x|θ) ¢ n¡ d log f (x|θ) ¢ o
2
I (θ) = −E = E .
d θ2 dθ
Definition 8.3.3. The prior distribution of Jeffreys is defined as
f 0 (θ) ∝ |I (θ)|1/2 .
Proposition 8.3.1. The prior distribution of Jeffreys is invariant under transformations
of the parameters i.e., if φ = g (θ) then
¯ ∂φ ¯
f 0 (θ) = f 0 (φ)¯ ¯.
∂θ
T
Example 8.3.4. Suppose that (x1 , x2 , x3 , . . . , xn ) = x ∼ N (θ, σ2 ), where σ2 is known. Then
Ã Pn !
2
i =1 (x i − θ)
f (x|θ) ∝ exp −
2σ2
and Pn
i =1 (x i − θ)2
log f (x|θ) ∝ − .
2σ2
Furthermore:
¡ d 2 log f (x|θ) ¢
I (θ) = −E
d θ2
¡n¢
= E
σ2
n
= .
σ2
Therefore f 0 (θ) ∝ 1.
Example 8.3.5. Suppose that x|θ ∼ B(n, θ). Then
f (x|θ) ∝ θ x (1 − θ)n−x
and
log f (x|θ) ∝ x log θ + (n − x) log(1 − θ).
Furthermore:
¡ d 2 log f (x|θ) ¢
I (θ) = −E
d θ2
¡ −x n−x ¢
= −E −
θ2 (1 − θ)2
nθ n − nθ
= + because E (x) = nθ
θ 2 (1 − θ)2
= nθ −1 (1 − θ)−1 .
106
Therefore f 0 (θ) ∝ θ −1/2 (1 − θ)−1/2 and thus

θ ∼ B e t a(1/2, 1/2).
8.4 Predictive distribution

With a prior distribution f (θ) the Bayes theorem leads to the posterior distribution f (θ|x).
Suppose we consider taking a further observation y. We know the likelihood so f (y|θ) is
known. The predictive distribution of y given x is
Z
f (y|x) = f (y|θ) f (θ|x)d θ.
Example 8.4.1. 20 windows in a high rise office block broke in the first year of occupancy
of the building. The question was how many of these were due to a specific defect D. If they
were caused by D then the manufacturer of the windows will replace them, otherwise they
will not. Only in 4 of the 20 widows was glass available for analysis; to this end we will
assume a sample of 4 in 20. So all in our sample of 4 were found to have broken because
of D.
In the subsequent legal wrangle a glass expert claimed that the distribution of θ the
probability a window suffers from D is
θ −3/4 (1 − θ)−1/4
f (θ) = 0 ≤ θ ≤ 1 or θ ∼ B e t a(1/4, 3/4).
B(1/4, 3/4)
¡4¢ 4
For a sample of 4 with 4 with defect D the likelihood is f (x|θ) = 4
θ (1 − θ)4−4 = θ 4 so the
posterior is
θ 13/4 (1 − θ)−1/4
f (θ|x) = 0 ≤ θ ≤ 1 or θ ∼ B e t a(17/4, 3/4).
B(1/4 + 4, 3/4 + 4 − 4)
This is a conjugate prior, see for the general case Table 8.3. The distribution of Y the num-
ber of defectives is (given θ)
Ã !
16 y
f (y|θ) = θ (1 − θ)16−y
y
The predictive distribution is then
Z
f (y|x) = f (y|θ) f (θ|x)d θ
¡16¢
Z1
y
= θ y+13/4 (1 − θ)63/4−y d θ
B(17/4, 3/4) 0
Ã !
16 B(y + 13/4 + 1, 63/4 − y + 1)
=
y B(17/4, 3/4)
Ã !
16 B(y + 17/4, 67/4 − y)
=
y B(17/4, 3/4)
107
Exercises
1. Suppose we have one observation x from an exponential distribution f (x) = θ exp(−θx).
If the prior distribution is defined from the discrete probabilities:
1
p(θ = 1) = p(θ = 2) =
2
find the posterior distribution of θ.
2. Suppose that
(x1 , . . . , xn ) = x ∼ Geom(θ)
and
θ ∼ B e t a(a, b).
Show that X
θ|x ∼ B e t a(a + n, b + xi − n).
3. Suppose that you have one observation
x ∼ NeB(r, θ)
and
θ ∼ B e t a(a, b).
Show that
θ|x ∼ B e t a(a + r, b + x − r ).
4. Suppose we have one observation x from a Pareto distribution, x ∼ P a(a, b), where
a is known and b is unknown. Then
f (x|b) = ba b x −b−1 , x > a.
(a) Find the prior distribution of Jeffreys for b.

(b) Find the posterior distribution of b|x.
5. Suppose we have one observation x from binomial distribution B(n, π) and the
prior distribution of π is B e t a(p, q). Suppose that y is the next observation from
binomial distribution B(m, π).
(a) Find the posterior distribution of π|x.

(b) Find the predictive distribution of y|x.
108

Notes Print

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Notes Print

Uploaded by

Copyright:

Available Formats

CMP5034A: Mathematical Statistics

1 Probability and Distributions 1

4 Introduction to Hypothesis Testing 49

5 The Neyman Pearson Lemma 63

6 Likelihood Ratio Tests 71

6.3.5 Goodness of fit–Non multinomial distribution . . . . . . . . . . . . . 78

Probability and Distributions

1.1 Sample space and events

• If we roll a die Ω={ 1,2,3,4,5,6}

• When we roll two dice then Ω is the set of pairs

(1,1) (1,2) (1,3) (1,4) (1,5) (1,6)

1.1.1 Combining events

3. If A and B are mutually exclusive then P [A ∪ B] = P [A] + P [B]

Various useful results follow, for example

• P [A ∩ B] = P [A] + P [B] − P [A ∪ B].

1.2.1 Conditional Probability

P [A] = P [A|B]P [B] + P [A| notB]P [ not B]

1.3 Random variables and their probability distributions

p X (x) = Pr(X = x),

is called the probability mass function (pmf) of X . If X is continuous, and there is a

1. p X (x) ≥ 0 for each x ∈ R.

2. kx=0 p X (x) = 1, where k is the number of different atoms.

The densities have the following properties:

1. f X (x) ≥ 0 for each x ∈ R.

We define the cumulative distribution function (cdf) F X (x) as,

with the following properties:

1. Pr(a ≤ X ≤ b) = F X (b) − F X (a).

2. If the rv X is continuous then,

3. For a (discrete) rv X , Pr(X = x) = F X (x) − F X (x − 1).

4. 0 ≤ F X (x) ≤ 1 for each x ∈ R.

5. F X (x) is increasing: If x1 < x2 then F X (x1 ) ≤ F X (x2 ).

6. F (−∞) = limx→−∞ F X (x) = 0 and F (∞) = limx→∞ F X (x) = 1.

Example 1.3.1. Suppose F (x) = 1 − exp(−λx), x ≥ 0, for some positive λ. Then if λ = 3,

• P [2 ≤ X ≤ 4] = F (4) − F (2) = 1 − exp(−12) − [1 − exp(−6)]

k is the number of different atoms for the discrete rv.

3. E (h 1(X ) + h 2 (X )] = E [(h 1 (X )] + E [(h 2 (X )].

4. If h 1 (X ) ≥ h 2 (X ) then E [(h 1(X )] ≥ E [(h 2 (X )].

5. |E [h(X )] ≤ E |h(X )|.

6. If E (X n ) exists then E (X m ) exists for each m = 1, 2, . . ., n − 1.

and the central moments,

1.6 Some Discrete Distributions

1.6.1 Discrete Uniform

1.6.2 The Binomial B (n, p)

1.6.3 The Poisson P (λ)

1.6.4 The Geometric Geo(p)

1.7 Some Continuous Distributions

1.7.1 The Uniform U (α, β)

1.7.2 The Exponential E xp(λ)

f (x) = λ exp(−λx), x ≥ 0, (1.2)

1.7.3 The Gamma G(α, β)

1.7.4 The Beta B e(α, β)

1.7.5 The Normal N (µ, σ2 )

1.7.6 The student t ν

Figure 1.1: Student t2 vs standard normal N (0, 1) density.

1.7.7 The Chi-square χ2ν

Figure 1.2: The χ25 density.

1.8 Expected values and variances of the preceding distri-

1.9 Generating functions

M X (t ) = E [exp(t X )]. (1.5)

Example 1.9.1. If X is exponential with density, f (x) = λ exp(−λx), then

Example 1.9.2. For a Geometric variable with pmf,

1.9.2 Probability generating functions

Example 1.9.4. For the Poisson distribution where P [X = k] = p k = λk exp(−λ)/k! we

1.10 Transformation of random variables