Professional Documents
Culture Documents
ARISTIDIS K. NIKOLOULOPOULOS
September 2014
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
Acknowledgment
Parts of these notes are based on previous course notes, written by Gareth J. Janacek.
ii
Contents
2 Multivariate distributions 17
2.1 Discrete case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Continuous case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Expected values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
iii
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
2.4 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Expectation for sums of random variables . . . . . . . . . . . . . . . . . . . 27
2.6 Distribution of sums of independent random variables . . . . . . . . . . . 28
2.6.1 Moment Generating Functions and sums . . . . . . . . . . . . . . . . 28
2.6.2 Central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3 Methods of estimation 35
3.1 Populations and samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1.1 Sampling distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Properties of estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.1 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5.1 Likelihood function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5.2 Maximum Likelihood Estimators (MLE’s) . . . . . . . . . . . . . . . . 40
3.5.3 Properties of MLE’s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6.1 Asymptotic confidence intervals . . . . . . . . . . . . . . . . . . . . . 44
iv
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
7 Nonparametric inference 83
7.1 The Kolmogorov-Smirnov test . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.2 Permutation Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.3 The sign test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.4 Wilcoxon’s Signed Rank Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.5 Paired Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.6 Unpaired Samples - The Mann-Whitney Test . . . . . . . . . . . . . . . . . 93
8 Bayesian inference 97
8.1 Bayes theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.1.1 The Continuous Analogue . . . . . . . . . . . . . . . . . . . . . . . . . 98
8.2 Subjective Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
8.3 Choice of prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.3.1 Conjugate class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.3.2 Noninformative priors . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8.4 Predictive distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
v
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
vi
Chapter 1
An event ω is a collection of outcomes of interest, for example rolling two dice and get-
ting a double. In this case the event ω is defined as
ω ={ (1,1),(2,2),(3,3),(4,4),(5,5),(6,6)}.
′
Suppose that the event ω is that the sum is less that 4 when we roll two dice, then
′
ω ={ (1,1),(1,2),(2,1)} .
′
If two events ω and ω have no elements in common then we say they are mutually
exclusive.
I prefer Roman symbols for events, so for example let A be the event {At least one 6}
A={(1,6),(2,6),(3,6),(4,6),(5,6),(6,1),(6,2),(6,3),(6,4),(6,5),(6,)}
and define the event B as
B={ (2,3),(5,7)}
Then A and B are mutually exclusive.
1
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
• The event A and B, often written A ∩B is the set of outcomes which belong both to
A and to B.
• The event A or B, often written A ∪ B is the set of outcomes which belong either to
A or to B or to both.
Example 1.1.1. Suppose Ω={0,1,2,3,4,5,6,7,8,9} then if we define A={1,3,5,7,9} and B={4,5,7}
we have
• A ∩ B = {5,7}
• While A ∪ B = {1,3,4,5,7,9}
• B C =not B = 0,1,2,3,6,8,9
1.2 Probability
To any subset ω ⊂ Ω we assign a measure P [ω] called the probability of ω. These mea-
sures satisfy the basic axioms of probability. For any A ⊂ Ω
1. 0 ≤ P [A] ≤ 1
2. P [Ω] = 1
2
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
• Continuous probability distributions: These handle cases were all possible (real)
numbers can be observed (e.g., height or weight).
Note in passing that large or infinite numbers of countable values are handled by con-
tinuous distributions.
If X is discrete the function,
3
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
for every interval of ∆, then the function f X (x) is the density of the random variable X .
The pmfs have the following properties:
P
3. Pr(X ∈ B) = X ∈B p X (x), B ∈ Ω.
F X (x) = P [X ≤ x], x ∈ R,
In the sequel let F X (x) = F (x), f X (x) = f (x), and p X (x) = p(x) for shorthand notation.
• P [X ≤ 2] = F (2) = 1 − exp(−6).
• The density function is f (x) = F ′ (x) = λ exp(−λx) and we can also write,
Z4
P [2 ≤ X ≤ 4] = 3 exp(−3x)d x.
2
4
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
1.4 Expectation
We can also view probability from the point of view of what happens in the long run.
Given a random variable X define the expected value of X written E [X ] as,
Pk
x=0 xp(x) if X is a discrete rv
E [X ] = R ,
∞
−∞ x f (x)d x if X is a continuous rv
where k is the number of different atoms for the discrete rv. The expected value can be
regarded as the long run average.
Example 1.4.1. If we roll a dice and the outcome is X then P [X = i , i = 1, 2, · · · , 6] = 1/6
and so
1 1 1
E [X ] = 1 × + 2 × + · · · + 6 × = 3.5
6 6 6
You can be sure that if you roll a die you will never get 3.5, however if you rolled a die, say
N times and kept an average of the score you will find that this will approach 3.5 as N
increases.
Example 1.4.2. For a coin we have Head and Tail. Suppose we count head as 1 and tail
as 0, then
P [X = 1] = 1/2 and P [X = 0] = 1/2
and so E [X ] = 1 × 12 + 0 × 12 = 12 .
Things work in much the same way in the continuous case by replacing the summa-
tion with an integral as described in the following example.
Example 1.4.3. If the density of the rv X is, f (x) = 83 x 2 , 0 ≤ x ≤ 2 and 0 elsewhere then,
Z∞ Z2
3
E [X ] = x f (x)d x = x x 2 d x = 1.5.
−∞ 0 8
The expected value of a function h(X ) of a rv X is defined similarly as below,
Pk
x=0 h(x)p(x) if X is a discrete rv
E [h(X )] = R ;
∞
−∞ h(x) f (x)d x if X is a continuous rv
2. E (ah(X ) + b] = aE [h(X )] + b.
5
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
1.5 Moments
Some important expected values in statistics are the moments,
µr = E [X r ] r = 1, 2, . . .,
You will have met the mean µ = E [X ] and the variance Var(X ) = σ2 = E [(X − µ)2 ]. Some
properties of the variance are,
1. Var(X ) = E X 2 − (E X )2 .
2. Var(c) = 0.
3. Var(aX + b) = a 2 VarX .
The parameter σ is known as the standard deviation. We can prove an interesting link
between the mean µ and the variance σ2 . The result s known as Chebyshev’s inequality,
h σ i2
P [|X − µ| > ǫ] ≤ (1.1)
ǫ
This tells us that departure from the mean have small probability when σ is small. Thus
the long run average is the mean and the size of departures from this long run average
are controlled by the variance.
The third and fourth moments E [(X − µ)3 ],E [(X − µ)4 ] are less commonly used.
1
p(x) = ,
n
for each x in 1, 2, 3, · · · , n.
6
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
p(x) = (1 − p)x−1 p, x = 1, 2, . . .
1
f (x) = , α ≤ x ≤ β.
β−α
7
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
where λ > 0 is the parameter of the distribution, often called the rate parameter.
x α−1 exp(−x/β)
f (x) = , x ≥ 0, (1.3)
βa Γ(α)
where Γ(·) is the Gamma function. The exponential is a special case of G(α, β). Note that
the density in (1.3) is the same with the density in (1.2), when α = 1 and β = 1/λ, i.e.,
G(1, 1/λ) ≡ E xp(λ).
x α−1 (1 − x)β−1
f (x) = , 0 < x < 1,
B(α, β)
where B(α, β) is the Beta function.
8
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
where parameter µ is the mean and σ2 is the variance (the measure of the width of the
distribution). The distribution with µ = 0 and σ2 = 1 is called the standard normal, that is
N (0, 1). Note one cannot evaluate the normal integral analytically however the standard
X −µ
normal N(0,1) is tabulated. If X is N (µ, σ2 ) then it is easy to show that Z = σ is standard
Normal. So we transform to standard normal and use the appropriate tables. We must
use tables since the integral,
Zx
1
Φ(x) = P [X ≤ x] = p exp(−z 2 /2)d z,
−∞ 2πσ 2
is not simple.
The importance of this distribution comes from the Central Limit theorem which
tells us that the distribution of sums of random variables have distributions that tend to
normality under a wide range of conditions.
N(0,1)
t, df=2
0.3
0.2
f(x)
0.1
0.0
−6 −4 −2 0 2 4 6
characteristic of the t distribution is that as ν gets larger, the t gets closer to the standard
normal.
9
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
0.15
0.10
f(x;df=5)
0.05
0.00
0 5 10 15 20
The Chi-square χ2ν distribution has the probability density function given by,
x ν/2−1 exp(−x/2)
f (x) = , x ≥ 0, (1.4)
2ν/2 Γ(ν/2)
where ν is the number of degrees of freedom and Γ(·) is the Gamma function. Note
that the density in (1.4) is the same with the density in (1.3) for α = ν/2 and β = 2, i.e.,
χ2ν ≡ G(ν/2, 2).
1. Discrete Uniform
µ = (n + 1)/2 σ2 = (n 2 − 1)/12.
2. Binomial B(n, p)
µ = np σ2 = np(1 − p)
10
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
3. Poisson P (λ)
µ=λ σ2 = λ
4. Geometric Geo(p)
µ = 1/p σ2 = (1 − p)/p 2
5. Uniform U (α, β)
µ = (α + β)/2 σ2 = (β − α)2 /12
6. Exponential E xp(λ)
µ = 1/λ σ2 = 1/λ2
7. Gamma G(α, β)
µ = αβ σ2 = αβ2
8. Beta B e(α, β)
µ = α/(α + β) σ2 = αβ/(α + β)2 /(α + β + 1)
9. Normal N (µ, σ2 )
µ=µ σ2 = σ2
10. Student tν
µ=0 σ2 = ν/(ν − 2), ν > 2
More distributions their means and variances are given in the Statistical Tables.
That is, Z∞
M X (t ) = f (x) exp(t x)d x,
−∞
for a continuous rv X , and,
X
M X (t ) = P [X = x] exp(t x),
x
for a discrete rv X .
11
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
³ ´
r (r −1)
• µ3 = E [X 3 ] = 3! λ3
+ λr3 + 1/6 r (r −1)(r
λ3
−2)
or
λ r
µ ¶
d M(t )
= r (λ − t )−1 ,
dt λ−t
so setting t = 0 gives,
µ = r /λ.
while
d 2 M(t ) λ r 3 λ r 2 λ r
µ ¶ µ ¶ µ ¶
−3 −3
= r (λ − t ) + 3 r (λ − t ) + 2 r (λ − t )−3
dt2 λ−t λ−t λ−t
again setting t=0 gives, µ ¶
r r (r − 1)
µ2 = 2! 2 + 1/2 .
λ λ2
12
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
There are some simple results which follow directly from the definition,
1. G(0) = p[X = 0]
2. G(1) = 1
′
3. G (1) = E [X ]
In our view Mgfs are much more useful however in some areas of probability the right
tool to use is the probability generating function. As in most areas of life it is a question
of picking the right tool.
13
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
Example 1.10.2. Suppose X ∼ E xp(1/θ). Find the distribution of Y = X /θ. We have that,
¯ ¯
1 − 1 θy ¯¯ d ¯ 1
f Y (y) = e θ (θy)¯¯ = e −y θ = e −y ⇒ Y ∼ E xp(1).
θ ¯ dy θ
Example 1.10.3. Suppose X ∼ N (0, 1). Find the distribution of Y = X 2 . We have that,
Z∞ Z∞
1 1 ³ ´
M y (t ) = exp(t x 2 ) p exp(−x 2 /2) d x = p exp −(1/2)x 2 (1 − 2t ) d x
−∞ 2π 2π −∞
Z∞
1 1
= 1/2
p exp(−y 2 /2) d y = (1 − 2t )−1/2 .
(1 − 2t ) 2π −∞
14
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
Exercises
1. You and a friend play a game in which you each toss a fair coin. If the result is two
tails (TT), you win £1; if the result is two heads (HH), you win £2. If the result is TH
or HT, you lose £1 (i.e. win £−1).
(a) Find the probability distribution of your winnings, X , for a single play of the
game.
(b) Find the mean and variance of X for a single play.
2. Suppose that the continuous random variable X has the probability density func-
tion, ½
cx , 0 ≤ x ≤ 2
f (x) =
0 , otherwise
3. Suppose that the continuous random variable X has the probability density func-
tion,
½ 3 2
f (x) = 2x +x , 0≤x ≤1
0 , otherwise
4. The probability that a patient recovers from a certain disease is 0.8. Suppose that
20 people are known to have contracted the disease.
15
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
6. A salesperson has found that the probability of a sale on a single contact is approx-
imately 0.03. Suppose the salesperson makes 100 contacts:
(a) Find the exact probability that he or she makes at least one sale.
(b) Use the Poisson approximation (P (np)) to the Binomial distribution to esti-
mate the probability of making at least one sale.
7. The cycle time for trucks hauling concrete to a building site is uniformly distributed
over the interval 50 to 70 minutes.
(a) What is the probability that the cycle time exceeds 65 minutes if it is known
that it exceeds 55 minutes?
(b) Find the mean cycle time and the variance of the cycle time.
(a) What is the probability that a student has a mark greater than 80?
(b) A department has to lose the 10% of its students that have the lowest marks.
Where should it set its cut-off mark to achieve this?
P [X = k] = p(1 − p)k−1 k = 1, 2, . . . .
10. If the mgf of X is M X (t ), show that the mgf of Y = aX +b is exp(bt )M X (at ). Hence
find the mgf of a Normal distribution with mean µ and variance σ2 given that a
standard Normal has the mgf exp(t 2 /2).
11. If Z has a standard normal distribution show that E [Z k ] = 0 for all odd values of k.
X −µ
12. If X ∼ N (µ, σ2 ) find the distribution of Z = σ .
16
Chapter 2
Multivariate distributions
In this chapter we deal with more than one random variables, since it is clear that the
world is not one dimensional. To deal with this, we have to define the joint probabil-
ity distribution function for the random vector X = (X 1 , X 2 , . . . , X d ), where each X j , j =
1, . . . , d is a random variable. It is obvious that if each of X j is discrete then X is a discrete
random vector, and if each of X j is continuous then X is a continuous random vector.
This function is called the joint probability mass function. It is obvious that,
X
p X (x) ≥ 0, and p X (x) = 1.
x1 ,...,xd
Clearly there are situations where we just need the distribution of X j disregarding
the other random variables. The relevant distributions are known as the marginal pmfs
and are defined as,
X X X X
Pr(X j = x j ) = · · · · · · Pr(X 1 = x1 , X 2 = x2 , . . . , X d = xd )
x1 x j −1 x j +1 xd
or
X X X X
p X j (x j ) = ··· ··· p X (x).
x1 x j −1 x j +1 xd
and
17
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
X X
p X 2 (x2 ) = Pr(X 2 = x2 ) = Pr(X 1 = x1 , X 2 = x2 ) = p X 1 ,X 2 (x1 , x2 ).
x1 x1
Finally, the joint cumulative distribution function of the d -variate random vector is
defined as below,
X X
FX (x) = Pr(X 1 ≤ x1 , X 2 ≤ x2 , . . . , X d ≤ xd ) = ··· p X (y 1 , . . . , y d ).
y 1 ≤x1 y d ≤xd
The joint cdf FX has the same properties with the univariate cdf; see Section 1.3. We also
have some obvious conditional pmfs. For the bivariate case,
Pr[X 1 = x1 , X 2 = x2 ]
p X 1 |X 2 (x1 |x2 ) = Pr(X 1 = x1 |X 2 = x2 ) = .
Pr[X 2 = x2 ]
Pr[X 2 = x2 , X 1 = x1 ]
p X 2 |X 1 (x2 |x1 ) = Pr(X 2 = x2 |X 1 = x1 ) = .
Pr[X 1 = x1 ]
Example 2.1.1. We roll two dice and then Ω is the set of pairs
p(x1 , x2 ) = 1/36, x1 , x2 = 1, . . . , 6.
If we take the sum of p(x1 , x2 ) over x2 and x1 , respectively, we derive the marginal pmfs,
The joint pmf can be also be represented in a table as presented in the following exam-
ple.
X 2 \X 1 1 2
1 1/8 1/4
2 1/8 1/2
The marginal distribution of X 1 and X 2 are conveniently the marginal sums of the
distribution table.
18
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
X 2 \X 1 1 2 p X 2 (x2 )
1 1/8 1/4 3/8
2 1/8 1/2 5/8
p X 1 (x1 ) 1/4 3/4 1
1. The constant c.
We proceed as below:
³ ´
c· (2·0+2)+(2·0+3)+(2·1+0)+(2·1+1)+(2·1+2)+(2·1+3)+(2·2+0)+(2·2+1)+(2·2+2)+(2·2+3) = 1
1
⇒ c · 42 = 1 ⇒ c =
42
To this end,
1
42 (2x + y), x = 0, 1, 2 and y = 0, 1, 2, 3
p(x, y) = Pr(X = x, Y = y) = .
0, elsewhere
19
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
1 1 5
2. (a) P (X = 2, Y = 1) = 42 (2 · 2 + 1) = 42 · 5 = 42 .
P P 2 P
P 2
P (X ≥ 1, Y ≤ 2) = P (X = x, Y = y) = P (X = x, Y = y) =
(b) x≥1 y≤2 x=1 y=0
1 1
= 42 · {((2 · 1 + 0) + (2 · 1 + 1) + (2 · 1 + 2)) + ((2 · 2 + 0) + (2 · 2 + 1) + (2 · 2 + 2))} = 42 · 24 = 47 .
Pr(X =x,Y =y)
3. We know that Pr(X = x|Y = y) = Pr(Y =y) . So we have to calculate the marginal
pmf of Y . This is,
2
P P 1
p Y (y) = Pr(X = x, Y = y) = 42 (2x + y) =
x x=0
1
© ª 6+3y 1
= 42 (2 · 0 + y) + (2 · 1 + y) + (2 · 2 + y) = 42 = 14 (y + 2).
1
42 (2x+y) 2x+y
Therefore, Pr(X = x|Y = y) = 1 = 13 · y+2 , x = 0,1,2, y = 0,1,2,3.
14 (y+2)
Pr(X =x,Y =y)
In the same manner, Pr(Y = y|X = x) = Pr(X =x)
, where
3
P P 1
p X (x) = Pr(X = x, Y = y) = 42 (2x + y) =
y y=0
1
= 42 {(2x + 0) + (2x + 1) + (2x + 2) + (2x + 3)} = 8x+6
42 .
1
42 (2x+y) 2x+y
So, P (Y = y|X = x) = 1 = 12 · 4x+3 , x = 0,1,2, y = 0,1,2,3
42 (8x+6)
for each ∆i in Ω, then f X (x) is the joint density of the random vector X. It is obvious that,
Z∞ Z∞ Z∞
··· f X (x1 , x2 , . . . , xd )d x1 d x2 . . . d xd = 1.
−∞ −∞ −∞
and given that we have the following relations between the joint cdf and density and vice
versa:
Zt 1 Zt 2 Ztd
FX (x) = FX (x1 , x2 , . . . , xd ) = ··· f X (t1 , t2 , . . . , td )d t1 d t2 . . . d td .
−∞ −∞ −∞
∂d FX (x1 , x2 , . . . , xd )
f X (x1 , x2 , . . . , xd ) = .
∂x1 ∂x2 , . . . , ∂xd
20
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
It follows (but not quite obviously) that the marginal cdf for the rv X j is given as
below,
Let assume now that the random vector is bivariate X = (X 1 , X 2 ). Then we have:
f (x, y) = 2x, 0 ≤ x ≤ 1 0 ≤ y ≤ 1.
f (x, y) 2x
f (x|y) = = = 2x 0 ≤ x ≤ 1,
f (y) 1
f (x, y) 2x
f (y|x) = = = 1 0 ≤ y ≤ 1.
f (x) 2x
The joint cdf is:
Zx Z y Zx Z y
F (x, y) = f (x, y)d xd y = 2xd xd y = x 2 y, 0 ≤ x ≤ 1 0 ≤ y ≤ 1.
−∞ −∞ 0 0
21
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
f (x, y) 3x 2x
f (x|y) = = 3 = 2)
, 0 ≤ y ≤ x ≤ 1,
f (y) (1 − y 2) (1 − y
2
f (x, y) 3x
f (y|x) = = 2 = 1/x, 0 ≤ y ≤ x ≤ 1.
f (x) 3x
Note that f (x, y) is valid density since,
Z∞ Z∞ Z1 Z1 Z1 Z x
f (x, y)d xd y = 3x d xd y = 3x d yd x = 1.
−∞ −∞ 0 y 0 0
The secret here is to sketch the area over which X and Y are defined. This is called the
region of support. We will focus on this in the class.
For example for the bivariate case, suppose we are interested in some function of the
random variables X and Y , say h(X , Y ). Then the expected value of h(X , Y ), is defined
as,
P P
x y h(x, y)p(x, y) if (X , Y ) are discrete random variables
E [h(X)] = R R .
∞ ∞
−∞ −∞ h(x, y) f (x, y)d xd y if (X , Y ) are continuous random variables
Note that,
Z∞ Z∞ Z∞ Z∞ Z∞ Z∞
E (X ) = x f (x, y)d xd y = x f X (x)d x, E (Y ) = y f (x, y)d xd y = y f Y (y)d x,
−∞ −∞ −∞ −∞ −∞ −∞
22
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
if X , Y are discrete.
It is useful to remember (it is also easy to prove) that for random variables X and Y
and constants a and b we have the following properties:
• E (aX + b) = aE X + b
¡ ¢
• E ah(X ) + bg (Y ) = aE (h(X )) + bE (g (Y ))
• Var(aX + b) = a 2 Var(X )
2.3.1 Covariance
The covariance measures the association/dependence between the variables. With the
properties of the expected values it is obviously that,
It is more common to use the scaled variant called the correlation ρ. This is defined
as,
cov(X , Y ) cov(X , Y )
ρ= p = .
v ar (X )v ar (Y ) σ X σY
The correlation is popular because the properties below but must be handled with care
(i.e., can measure only linear association, between normally distributed variables). In
outline:
• |ρ| ≤ 1.
It is important to remember that ρ does not tell you anything about causality.
1. E (X ), E (Y ).
23
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
2. f (x|y), f (y|x).
3. cov(X , Y ).
+∞ Z2 Ã · 2 ¸2 !
3 2 3 2 £ ¤2 y 3
Z
f X (x) = f (x, y)d y = (x + x y) dy = x y 0+x = (2x 2 + 2x).
5 5 2 0 5
−∞ 0
So,
R1 R1
E (X ) = x · 65 (x 2 + x)dx = 65 (x 3 +x 2 )dx =
µh 0 i h i ¶ 0
4 1 3 1
= 56 x4 + x3 = 65 14 + 31 = 65 · 12
7 7
¡ ¢
= 10 .
0 0
+∞
R
Similarly E (Y ) = y · f Y (y) dy where,
−∞
+∞ R1
f (x, y) dx = 35 (x 2 + x y) dx =
R
f Y (y) =
µh −∞i1 h 2 i1 ¶ 0 ¡ ³ ´
x3 y¢ 2+3y
= 5 3 + y x2
3
= 35 13 + 2 = 53 6 = 10 1
(2 + 3y).
0 0
So,
R2 1 1
R2
E (Y ) = y · 10 (2 + 3y)dy = 10 (2y + 3y 2 )dy =
0
µ h i 0
h 3 i2 ¶
1 y2 2 y 1
= 10 2 2 + 3 3 = 10 (4 + 8) = 12 6
10 = 5 .
0 0
3 3
f (x,y) 5 x(x+y) f (x,y) 5 x(x+y)
2. f (x|y) = f Y (y) = 1 , 0 ≤ x ≤ 1, 0 ≤ y ≤ 2, and f (y|x) = f X (x) = 6 =
10 (2+3y) 5 x(x+1)
(x+y)
2(x+1) , 0 ≤ x ≤ 1, 0 ≤ y ≤ 2.
+∞
R +∞ R2 R1
x y · 53 x(x + y)dxdy =
R
E (X Y ) = x y · f (x, y)dxdy =
−∞ −∞ µ 00
3
R2 R1 3 2 2 3
R2 h x 4 i1 h i1 ¶
2 x3
=5 (x y + x y )dxdy = 5 y 4 + y 3 dy =
0 0 0 0 0
R2 ³ 1 y 2 2 ´
µ h i h 3 i2 ¶
2 2
3 3 1 y y
= 5 y 4 + 3 y dy = 5 4 2 + 3 3 =
0 0 0
= 35 18 · 4 + 91 · 8 = 35 21 + 89 = 35 · 9+16 = 35 · 25 = 56 .
¡ ¢ ¡ ¢
18 18
So,
5 7 6 5 21 1
Cov(X , Y ) = E (X Y ) − E (X )E (Y ) = − · = − =− .
6 10 5 6 25 150
24
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
2.4 Independence
The random variables X 1 , X 2 , . . . X d are independent if and only if,
for each (x1 , . . . , xd ) ∈ R d . One can also use densities/pmfs instead of cdfs to show that
the random variables X 1 , X 2 , . . . X d are independent:
• E X Y = E X E Y , so cov(X , Y ) = 0.
1. f X (x), f Y (y).
3. f (x|y), f (y|x).
4. E (X |Y ), E (Y |X ).
25
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
e −x , x > 0
½
−x
£ −2y ¤∞ −x −x
= −e e 0 = −e (0 − 1) = e , x > 0. So, f X (x) = .
0, otherwise
The marginal density of Y is,
+∞
Z ∞
Z ∞
Z ∞
Z
−(x+2y) −2y −x −2y
f Y (y) = f (x, y)dx = 2e dx = 2e e dy = −2e e −x d( − x) =
−∞ 0 0 0
2e −2y , y > 0
½
= −2e −2y [e −x ]∞
0 = −2e
−2y
(0 − 1) = 2e −2y , y > 0. So, f Y (y) = .
0, otherwise
We have that,
2e −x e −2y = 2e −(x+2y) ,
½
x > 0, y > 0
f X (x) f Y (y) = .
0, otherwise
So, f (x, y) = f X (x) f Y (y) for each (x, y), and therefore X and Y are independent.
f (x,y) −(x+2y)
3. f (x|y) = f Y (y) = 2e2e −2y = e −x for x > 0 and y > 0. So,
e −x , x > 0, y > 0
½
f (x|y) = .
0, otherwise
f (x,y) −(x+2y)
f (y|x) = f X (x) = 2e e −x = 2e −2y for x > 0 and y > 0. So,
∞ ∞ ∞ ∞
µ ¶
−x −x ∞−x ′
−x
R R R R
4. E (X |Y ) = x f (x|y)dx = xe dx = − x (e ) dx = − [xe ]0 − e dx =
−∞ n³ 0 ´ 0³ ´o 0
− [xe −x ]∞ −x ∞ −x −x
¡ ¢
0 + [e ]0 = − lim xe − 0 + lim e − 1 =
x→∞ x→∞
n³ ´ o
= − lim e −x − 0 + 0 − 1 = 1
x→∞
∞ ∞ ∞ ∞
Z Z Z Z
′ ∞
y f (x|y)d y = 2ye −2y d y = − y e −2y d y = − ye −2y 0 − e −2y d y =
¡ ¢ £ ¤
E (Y |X ) =
−∞ 0 0 0
∞
µ ¶
¤∞ R ¤∞ ¤∞ ¢
ye −2y 0 − e −2y d y = − ye −2y 0 + 12 e −2y 0 =
£ ¡£ £
=−
0
½µ ¶ µ ¶¾ µ ¶
1
−2y −2y y 1
= − lim ye − 0 + 2 lim e − 1 = − lim e 2y − 2 =
y→∞ y→∞ y→∞
µ ¶
= − lim 2e12y − 12 = 21
y→∞
26
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
d
hX i Xd
E Xj = EXj. (2.1)
j =1 j =1
d
hX i Xd
E h j (X j ) = E [h j (X j )].
j =1 j =1
E [X + Y ] = E [X ] + E [Y ] = µ X + µY .
So,
¡ ¢2 ¡ ¢2
var(X + Y ) = E [ X + Y − µ X − µY ] = E [ (X − µ X ) + (Y − µY ) ]
d
hX d X
d i
(X j − µ j )2 + 2
X
=E (X j − µ j )(X k − µk ) ( j < k)
j =1 j =1 k=1
d
X d X
X d
= var(X j ) + cov(X i X j ) ( j 6= k).
j =1 j =1 k=1
d
³X ´ Xd
var Xj = var(X j ).
j =1 j =1
27
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
1 X d 1 X d 1 X d
E [ X̄ ] = E [ Xj] = E [X j ] = µ = µ,
d j =1 d j =1 d j =1
and
à !
1 X d 1 Xd
var( X̄ ) = var Xj = 2 var(X j ) = d σ2 /d 2 = σ2 /d ,
d j =1 d i =1
since X j are independent.
So the mgf is the product of the mgfs of the terms in the sum. This result extends to d
variables very simply. The mgf of Z = dj=1 X j is just the product of the individual mgfs.
P
Thus,
d
Y
M Z (t ) = M X j (t ).
j =1
M(t ) = p exp(t ) + q.
Now assume that we have X 1 , X 2 , · · · , X d which are all independent and identically dis-
tributed with this distribution. The mgf of the sum Z = dj=1 X j is just the product of the
P
individual mgfs,
d
(p exp(t ) + q) = (p exp(t ) + q)d .
Y
M Z (t ) =
j =1
Looking up the list of mgfs in Statistical Tables we find that Z ∼ B(d , p) (Binomial).
Example 2.6.2. Suppose we have X 1, X 2 , · · · , X d which are all independent and distributed
with a Poisson distributions, say,
λxj exp(−λ j )
p[X j = x] = , x = 0, 1, 2, · · · .
x!
28
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
M X j (t ) = exp λ j (e t − 1) .
¡ ¢
Pd
So the mgf of the sum Z = j =1 X j is,
d
exp λ j (e t − 1) = exp Λ(e t − 1) ,
Y ¡ ¢ ¡ ¢
M Z (t ) = (2.2)
j =1
Pd
where Λ = j =1 λ j . From the form of the mgf in (2.2) we can deduce that Z ∼ Poi sson(Λ).
Example 2.6.3. Supopose we have X 1, X 2 , · · · , X d which are all independent and distributed
with a Normal distributions, say
( )
1 (x j − µ j )2
f (x j ) = q exp − ,
2πσ2j 2σ2j
d
Y
µ
1 2 2
¶ µ
1 2
¶
M Z (t ) = exp µ j t + σ j t = exp At + B t , (2.3)
j =1 2 2
where A = nj=1 µ j and B = nj=1 σ2j . From the form of the mgf in (2.3) we can deduce that
P P
Z ∼ N (A, B).
We have managed to show that the Normal and the Poisson have reproductive prop-
erties i.e., sums of independent Poissons are Poisson and sums of independent normals
are normal. We can show using mgfs as above that,
29
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
• If the X j are Poisson then their sum is Poisson. The parameter of the sum is the
sum of the parameters of the components, since E [S d ] = dj=1 E [X j ].
P
Often however the component variables X j are not Normal or Poisson, e.t.c., but we
would still like the distribution of S d = dj=1 X j . Finding the exact distribution is a hard
P
problem but we have some useful results in particular the Central Limit Theorem.
Suppose X 1 , X 2 , · · · , X d are independent and identically distributed with common
mean µ and variance σ2 . Define,
X̄ − µ
Z= p . (2.4)
σ/ d
Then the distribution of Z approaches that of a standard Normal variable for large d .
Informally the following are equivalent statements,
• S d is approximately N (d µ, d σ2 ).
For the exponential distribution the mean is 1/λ = 1/2 and the variance 1/λ2 = 1/4 - look
it up. From the central limit theorem we know that S 100 is approximately N (50, 25), so
S 100 − 50
P [52 < S 100 < 58] = P [2/5 < Z = < 8/5] = Φ(8/5) − Φ(2/5) = 0.2898.
5
x̄−µ
Similarly, since Z = p is N (0, 1),
σ/ 100
· ¸
0.52 − 1/2 S 100 /100 − 1/2 0.58 − 1/2
P [52 < S 100 < 58] = P p <Z = p < p
1/(2 100) 1/(2 100) 1/(2 100)
30
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
P [X j = 1] = p while P [X j = 0] = 1 − p.
Then
E [X j ] = p and v ar (X j ) = p(1 − p).
Using the central limit theorem we know that S d = dj=1 X j is N (d p, d p(1 − p)). We have
P
proved that a Binomial variable B(d , p) can be approximated by a normal variable, with
mean d p, variance d p(1 − p) for a large d . This is known as the normal approximation
to the Binomial. You may find it easier in some cases to work with the fraction of ones,
1
p̂ = Sd .
d
In this case sµ ¶
p(1 − p)
z = (p̂ − p)/
d
is standard normal.
Example 2.6.5. Suppose X is number of 6’s in 40 rolls of a die. By calling the CLT, approx-
imate X with N ( 40 , 40 5 ). Then
6 6 6
5 − 20/3
P [X < 5] = P [z < p ] = Φ(−0.7071) = 0.2398.
50/9
31
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
Exercises
1. Let X and Y have the joint distribution given by the table of probabilities below
Y
X 1 2 3
1 0.1 0 0.1
2 0 0.2 0.2
3 0.1 0.2 0.1
3. Suppose that the random variables X and Y have the following joint density,
½
k, 0 ≤ x ≤ 2 , 0 ≤ y ≤ 1, 2y ≤ x
f (x, y) = .
0, otherwise
(a) Sketch the area over which X and Y are defined (the region of support).
(b) Find the value of k that makes this a valid density.
(c) Find Pr[X ≥ 3Y ].
(d) Find cov(X , Y ).
32
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
β j = β, j = 1, . . . , d .
7. A soft drink vending machine is set so that the amount of drink dispensed is a
random variable with mean 200 ml and a standard deviation of 15 ml. If a sample
of 36 drinks is taken what is the probability that the mean amount dispensed will
be at least 204 ml. Find the value M for which the probability that the sample
mean is least that M is 0.1.
8. A die is rolled 100 times. What is the probability that the number of sixes observed
is,
33
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
34
Chapter 3
Methods of estimation
> runif(20,0,1)
[1] 0.566791256 0.041486462 0.549447211 0.214432339 0.637779016 0.899486283
[7] 0.054652610 0.557283329 0.170380431 0.079230045 0.914602292 0.764830734
[13] 0.009130727 0.764550706 0.344527487 0.730142699 0.483321483 0.784094527
[19] 0.226894059 0.077518982
35
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
Before universal computing people used tables of random numbers, see for example the
Statistical tables.
The ability to generate uniform numbers also gives us the chance to generate sam-
ples from other distributions. We need the following lemma:
Lemma 3.1.1. if X has a continuous cdf F then F (X ) ∼ U (0, 1). Moreover by the inversion
method X = F −1 (U ) is a sample from F .
3.2 Estimation
Suppose we are given a sample x1 , x2 , · · · , xn . We can think of our sample as being a
realization (observation) from a set of random variables X 1 , X 2 , · · · , X n whose joint dis-
tribution is f (x1 , x2 , · · · , xn ; θ). The question we ask ourselves is: What can we say about
the underlying distribution f (x1 , x2 , · · · , xn )? We will make a crucial assumption that the
sample is random. By this we mean that the joint density can be written as,
n
Y
f (x1 , x2 , · · · , xn ; θ) = f (xi ; θ) = f (x1 ; θ) f (x2 ; θ) · · · f (xn ; θ).
i =1
We say that the variables are independent and identically distributed (iid) and that we
have a sample from the distribution f (x; θ). In much of what we will discuss we will
make a second major assumption. This is that we know the form of the distribution
f (x; θ) except for some parameters θ. In such cases our interest is to estimate the pa-
rameters θ.
3.3.1 Bias
Suppose we have an estimator T of some parameter θ. We define the mead square error
(MSE) as,
E [(T − θ)2 ]. (3.1)
36
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
or
var(T ) + b(θ)2 ,
where b(θ) = E [T ] − θ is called the bias. One possibility is to seek unbiased estimates,
that is estimates where b(θ) = E [T ] − θ = 0 or equivalently E [T ] = θ.
Example 3.3.1. So if we have a random sample from a distribution with mean µ we know
that E [ X̄ ] = µ so X̄ is an unbiased estimate of µ.
Example 3.3.2. Suppose X 1 , X 2 , · · · , X n is a random sample from a distribution with mean
1 Pn
¢2
µ and variance σ2 . The unbiased estimator of σ2 is s 2 = n−1
¡
i =1 X i − X̄ .
We first show that,
n ¡
X ¢2 n ³
X ´2
Xi − µ = (X i − X̄ ) + ( X̄ − µ)
i =1 i =1
n h i
(X i − X̄ )2 + 2(X i − X̄ )( X̄ − µ) + ( X̄ − µ)2
X
=
i =1
n n n
(X i − X̄ )2 + 2( X̄ − µ) ( X̄ − µ)2
X X X
= (X i − X̄ ) +
i =1 i =1 i =1
n
(X i − X̄ )2 + 0 + n( X̄ − µ)2
X
=
i =1
So,
n n ¡ ¢2
(X i − X̄ )2 = X i − µ − n( X̄ − µ)2 .
X X
i =1 i =1
Then the expected value is,
n
hX i n ¡
hX ¢2 i
E (X i − X̄ )2 = E X i − µ − n( X̄ − µ)2
i =1 i =1
n ¡
hX ¢2 i h i
=E X i − µ − nE ( X̄ − µ)2
i =1
n h¡ ¢2 i h i
E X i − µ − nE ( X̄ − µ)2
X
=
i =1
n h i h i
E X i2 − 2µX i + µ2 − nE X̄ 2 − 2µ X̄ + µ2
X
=
i =1
n h i h i
E (X i2 ) − 2µE (X i ) + µ2 − n E ( X̄ 2 ) − 2µE ( X̄ ) + µ2
X
=
i =1
n h i h i
σ2 + µ2 − 2µ2 + µ2 − n σ2 /n + µ2 − 2µ2 + µ2
X
=
i =1
= (n − 1)σ2 .
37
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
3.3.2 Consistency
If we take large samples we expect to have more information that with a small sample
and our estimators should improve in some sense. We formalize this using the idea of a
consistent estimate. We say that an estimate T of θ is consistent if
Example 3.3.3. Since for a random sample, X̄ is unbiased for µ and var( X̄ ) = σ2 /n then
X̄ is consistent.
Example 3.4.1. Suppose we have the following sample x1 , x2 , · · · , x20 from an exponential
distribution with parameter λ, i.e., X i ∼ E xp(λ), i = 1, . . . , 20,
0.043 1.200 0.219 1.883 1.230 0.349 0.592 0.433 0.858 0.394
0.001 0.376 2.560 0.170 0.906 0.302 0.104 0.247 0.327 0.844
38
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
The sample mean is X̄ =0.6519 so µ̂ = 1/λ̂ = 0.6519 and we deduce that λ̂ = 1.5340.
Example 3.4.2. Suppose we have the following sample x1 , x2 , · · · , x20 from a normal dis-
tribution, i.e., X i ∼ N (µ, σ2 ), i = 1, . . . , 20,
4.948 20.498 19.985 -8.206 13.288 12.563 3.162 5.753
0.276 9.508 9.101 14.090 -19.981 27.302 19.962 6.960
-13.619 15.464 6.813 13.347
with sample mean X̄ =8.0606 and sample standard deviation s = 11.7025, we deduce, µ̂ =
8.0606 and σ̂ = 11.7025.
Example 3.4.3. Suppose we have the following sample x1 , x2 , · · · , x20 from from a uniform
distribution, i.e., X i ∼ U (α, β), i = 1, . . . , 20,
3.204 2.241 5.554 4.863 2.168 3.101 4.557 5.337 5.265 3.933 2.953
4.836 2.412 5.432 4.733 2.074 2.742 5.864 4.825 5.934 3.500 3.496
5.358 4.889 5.633
with sample mean X̄ = 4.1962 and sample standard deviation s = 1.2860. Now µ̂ = (α̂ +
β̂)/2 = 4.1962 and σ̂2 = (β̂ − α̂)2 /12 = 1.6537 and we solve for α̂ and β̂. We get α̂ = 3.0138
and β̂ = 5.3785.
The method of moments is the oldest method of deriving point estimators. It almost
always produces some asymptotically unbiased estimators, although they may not be
the best estimators.
Example 3.5.1. If we toss a coin 3 times then the probability distribution of the number
of heads X is, Ã !
3 x
Pr[X = x] = θ (1 − θ)3−x x = 0, 1, 2, 3,
x
where θ = Pr[ head]. If we observe 2 heads the x = 2 and the likelihood is
à !
3 2
L(θ) = θ (1 − θ) = 3θ 2 (1 − θ)
2
39
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
n
Y
L(θ) = θ exp(−θxi )
i =1
= θ exp(−θx1 )θ exp(−θx2 )θ exp(−θx3 ) · · · θ exp(−θxn )
à !
n
n
X
= θ exp −θ xi .
i =1
Example 3.5.3. For a random sample x1 , x2 , · · · , xn from a Normal distribution with mean
µ and variance σ2 the likelihood is,
( )
n 1
½
1
¾
1 1 n
L(µ, σ2 ) = exp − 2 (xi − µ)2 = (xi − µ)2 .
Y X
2 )1/2 2 )n/2
exp − 2
i =1 (2πσ 2σ (2πσ 2σ i =1
as it is simpler to do and the log-likelihood has the maximum in the same place as the
likelihood function itself. Below we summarize the steps to obtain the maximum like-
lihood estimates (MLE’s) of the parameters θ1 , . . . , θk . We will denote the MLE’s with
θ̂1 , . . . , θ̂k :
n
Y
L(θ1 , θ2 , . . . , θk ) = f (xi ; θ1 , θ2 , . . . , θk ).
i =1
40
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
3. Calculate the derivatives of ℓ with respect to the parameters and equate them to
0, ∂ℓ(θ ,θ ,...,θ )
1 2 k
∂θ1 =0
∂ℓ(θ1 ,θ2 ,...,θk )
=0
∂θ2
. (3.2)
..
.
∂ℓ(θ1 ,θ2 ,...,θk )
=0
∂θk
4. Solve the score equations in (3.2) to obtain the MLE’s, θ̂1 , . . . , θ̂k .
Example 3.5.4. Suppose that you have one observation x from the B(n, θ) distribution.
The above steps to derive the MLE of θ are as follows:
∂ℓ(θ)
= 0 (3.3)
∂θ
x n−x
− = 0.
θ 1−θ
Example 3.5.5. Suppose you have a sample x1 , x2 , · · · , xn from a Normal distribution with
mean µ and variance σ2 . The steps to derive the MLE’s of µ and σ2 are as follows:
41
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
n 1 X n
ℓ(µ, θ) = − log(2πσ2 ) − 2 (xi − µ)2 .
2 2σ i =1
∂ℓ(µ,σ2 )
∂µ =0
(3.4)
∂ℓ(µ,σ2 )
∂σ2
=0
1 Pn
σ2 i =1 i
(x − µ) = 0
n
+ σ13
Pn
− µ)2 = 0
−σ i =1 (x i
1X n
µ̂ = x̄ and σˆ2 = s ′2 = (xi − x̄)2 .
n i =1
• Consistency.
• Asymptotic normality.
• Efficiency.
• Invariance.
Asymptotic normality
The likelihood estimates are asymptotically unbiased and are normal with known vari-
ance. That is,
θ̂ ∼ N (θ, v 2 ),
where,
h¡ ∂ℓ(θ) ¢ i h ∂2 ℓ(θ) i
2
v 2 = 1/E = −1/E . (3.5)
∂θ ∂θ 2
Example 3.5.6. From the Example 3.5.4 we have that
∂ℓ(θ) x n − x
= − ,
∂θ θ 1−θ
so
∂2 ℓ(θ) x n−x
2
=− 2 − .
∂θ θ (1 − θ)2
42
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
∂2 ℓ(θ)
· ¸
E [x] E [n − x]
E = − −
∂θ 2 θ2 (1 − θ)2
as E [x] = nθ we have,
∂2 ℓ(θ)
· ¸
n n n
E 2
=− − =−
∂θ θ (1 − θ) θ(1 − θ)
∂2 ℓ(θ)
¸·
2
v = −1/E = θ(1 − θ)/n.
∂θ 2
The Cramér-Rao theorem shows that (given certain regularity conditions concerning the
distribution) the variance of any estimator of a parameter θ must be at least as large as,
∂2 ℓ(θ)
· ¸
Var(θ) ≥ −1/E .
∂θ 2
Hence any estimator that achieves this lower bound is efficient and no “better" estima-
tor is possible.
As you can see the variance of the MLE of θ in (3.5) is exactly equal to the Cramér-Rao
Lower bound. Therefore MLE is efficient.
Invariance
Example 3.5.7. Suppose you have a sample x1 , x2 , · · · , xn from a Normal distribution with
mean µ and variance σ2 . Find the MLE’s of (a) µ2 + σ2 , (b) Pr(X ≤ x0 ).
From Example 3.5.5 we have that µ̂ = X̄ and σ̂2 = s ′2 . So,
x0 − µ̂ x0 − X̄
Φ( ) = Φ( ).
σ̂ s′
43
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
The term 100(1 − α)% confidence interval is used to denote either an interval estimator
with coverage probability 1 − α or an interval estimate. For α is usually assigned a small
value, e.g. 0.1, 0.05, or 0.01. There are three methods of deriving confidence intervals,
θ̂ − θ
∼ N (0, 1).
v
Thus given α we can find the asymptotic confidence interval for θ since the coverage
probability involves critical values from the standard normal distribution(see Figure
3.1),
h θ̂ − θ i
Pr −z 1−a/2 = −Φ−1 (1 − α/2) ≤ ≤ Φ−1 (1 − α/2) = z 1−a/2 = 1 − α. (3.6)
v
44
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
θ̂−θ
Figure 3.1: The standard normal density curve. The red area indicates that |z = v | >
z 1−α/2 with probability α/2 in each tail and α totaly.
Y n 1 ³ 1 ´ 1 ³ 1X n ´
L(λ) = exp − xi = n exp − xi .
i =1 λ λ λ λ i =1
∂ℓ(λ)
= 0 (3.7)
∂λ
n 1 Xn
− + 2 xi = 0.
λ λ i =1
λ̂ = X̄ .
(b)
∂2 ℓ(λ) n 2 X n
= − xi .
∂λ2 λ2 λ3 i =1
45
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
∂2 ℓ(λ)
· ¸
v 2 = −1/E 2
= λ2 /n.
∂λ
46
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
Exercises
1. Using the inversion method simulate from the Weibull distribution with with cu-
mulative distribution function,
β
F X (x) = 1 − e −x , x ≥ 0.
1
f (x, θ) = , −θ ≤ x ≤ θ.
2θ
Find the method of moments estimator of θ.
3. The dorsal length data below is thought to come from a lognormal distribution
with density,
1
exp (log y − µ)2 ,
© ª
f (y) = p y > 0.
y 2πσ2
The distribution is called the lognormal as the distribution of Y = exp(X ) where X
is normal with mean µ and variance σ2 .
10 12 18 64 54 19 115 73
190 15 21 84 44 18 37 43
55 57 60 73 82 175 50 80
65 19 43 54 16 10 17 52
43 63 23 44 95 20 41 17
15 70 36 82 29 29 61 22
40 12 22 16 30 16 116 28
32 17 11 95 27 16 55 8
11 33 26 29 85 20 67 27
44 49 29 30 35 17 26 32
76 16 82 27 5 6 51 75
23 150 6 84 22 47 9 10
28 29 21 73 52 130 50 45
Y 2 = 344223.
P
Note: Ȳ = 44.67 and
4. Find the maximum likelihood estimates of θ for a sample x1 , . . . , xn from the fol-
lowing distributions:
47
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
8 6 2 3 6 4 7 5 4 4
2 4 7 3 10 6 10 6 8 5
5 2 5 3 3 5 4 5 2 8
6 10 1 5 0 5 4 4 5 6
6 4 9 4 6 4 6 4 4 12
4 5 3 5 5 8 3 7 9 3
6 6 3 4 3 5 4 6 5 3
3 3 9 4 1 4 5 5 6 4
7 3 6 5 6 4 4 6 7 5
3 7 2 7 5 6 7 5 3 3
48
Chapter 4
Hypotheses
We often make assertions and as in many cases we have incomplete information, the
assertion is about a probability distribution. Such an assertion abut a distribution is
called a statistical hypothesis. It may be a simple hypothesis if it completely specifies the
distribution or a composite if it is not simple.
Typically the hypothesis has been put forward, either because it is believed to be true
or because it is to be used as a basis for argument.
Example 4.1.3. Suppose we have a coin which we wish to check is fair, that is P[Head]=1/2.
If we assume the coin is fair we are assuming that the number of heads, X , is Binomial
with p = 1/2.
Example 4.1.4. We might be interested in the birth weight of babies born in Sheffield com-
pared with those born in Brighton. Since we know (from the medics) that birth weights
are Normal we can think of the hypothesis that the Sheffield and Brighton weights have
Normal distributions with the same mean. This is not a simple hypothesis as the means
and variances are not specified.
In each problem we shall consider, the question of interest is simplified into two
competing hypotheses between which we have a choice;
49
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
These two competing hypotheses are not however treated on an equal basis, special
consideration is given to the null hypothesis. Usually the experiment has been carried
out in an attempt to prove or disprove a particular hypothesis, the null hypothesis. For
example,
against
H1 : there is a difference.
Of the two hypotheses the null is almost always simple in that it completely specifies the
underlying distribution, the alternative is often composite.
Truth
action H0 H1
Accept H0 ok Type II error
Reject H0 Type I error ok
The two errors are traditionally called the type one and type two errors and we will be
looking at their probabilities
50
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
such as 0.05 or 0.01. Then take the sample space of outcomes (x1 , x2 , . . . , xn ) and split it
into two parts C and R where C ∩ R is empty and C ∪ R is the whole space.
We choose C - the critical region - to be the set of unlikely points, that is the set of
outcomes which are unlikely under H0 . Then if we observe a set of points in C we have
two options
• Either H0 is false
In conventional testing we assume the second and say we reject the null hypothesis
and accept the alternative.
By this we mean that it is rational on the evidence to believe that the null is not true.
Recall
• The probability of observing the unlikely event is α the probability of a type I error-
often referred to as the size of the test.
If we do not reject the null hypothesis, it may still be false (a type II error) as the sample
may not be big enough to identify the falseness of the null hypothesis (especially if the
truth is very close to hypothesis). You should bear in mind that for any given set of data,
the type I and type II errors are inversely related; so the smaller the risk of one, the higher
the risk of the other.
As dealing with high dimensional spaces is difficult we usually base our tests on a
test statistic T computed from the observations. As above we find a critical region C
defined by
P [T ∈ C |H0 ] = α
Thus the region C is those values of T which are unlikely when H0 is true. Some workers
prefer the p-value. The probability value (p-value) of a statistical hypothesis test is the
probability of getting a value of the test statistic as extreme as or more extreme than that
observed by chance alone on the assumption that the null hypothesis H0 , is true.
4.2.1 Power
The power of a statistical hypothesis test measures the test’s ability to reject the null
hypothesis when it is actually false - that is, to make a correct decision. In other words,
the power of a hypothesis test is the probability of not committing a type II error that is
γ = 1 − β. Usually statisticians think of the power as a function of the parameter, So if we
have
H0 : θ = θ0 against H1 : θ = θ1
they would consider the power as being γ(θ1 ) a function of possible alternatives.
51
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
• The critical region C is those values of T which are unlikely when H0 is true.
• The power of a statistical hypothesis test is the probability of rejecting the null
hypothesis when it is false.
1. Set up H0 and H1
2. Pick a suitable test statistic T whose distribution is known under the assumptions
of H0 .
5. Compute T.
Often we can derive such a method from insight into the problem, as we shall see.
HH TH TT TH HT HH
52
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
We assume that X the number of heads in Binomial B(12,p) and our null hypothesis is
1
H0 : P [ Head ] = p =
2
while the alternative is
1
H1 : P [ Head ] = p >
2
1. One choice of statistic is just X the number of heads.
P γ(p)
0.5 0.01929
0.6 0.08344
0.7 0.25282
0.8 0.55835
0.9 0.88913
53
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
0.8
0.6
Power
0.4
0.2
0.0
Figure 4.1: The power as being γ(p) a function of possible alternatives with parameter p
= 0.5 to 0.9 in 0.004 increments.
The critical region depends on the alternative - try finding the critical region for
testing
1 1
H0 : p = against the alternative H1 : p >
2 2
1 1
H0 : p = against the alternative H1 : p 6=
2 2
then the critical region will consist of small values of X and large values of X. Hence we
have a region of the form X < c1 and X > c2 . We split our α between the two segments
so
54
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
c P[ X > c ] P[ X ≤ c ]
0 0.99976 0.00024
1 0.99683 0.00317
2 0.98071 0.01929
3 0.92700 0.07300
4 0.80615 0.19385
5 0.61279 0.38721
6 0.38721 0.61279
7 0.19385 0.80615
8 0.07300 0.92700
9 0.01929 0.98071
10 0.00317 0.99683
11 0.00024 0.99976
12 0.00000 1.00000
A Normal example
Suppose we have a sample of size 100 from a Normal distribution. We wish to test
H0 : µ = 68 against H1 : µ 6= 68
To simplify matters we assume that the standard deviation of the population is σ = 16.
A possible statistic is X̄ which is Normal N( µ, σ2 /n). Rather simpler is the standard-
ized random variable
X̄ − 68
z= p
σ/ n
which we know is standard Normal when H0 is true .
Now if the true mean is not 68 then z will be very different from zero. The distribution
of z is shown in the Figure 4.2 (c) and we see that the two areas in the tails must total α.
A little inspection gives us the critical regions as
Now if we choose α = 0.05 then we will reject H0 when z < −1.96 or z > 1.96. In our case
X̄ = 68.04. We know that σ = 16 so z = 0.025. This is not in the critical region so we
accept H0
This is somewhat unrealistic since if we are unsure of µ is is most unlikely that we
know σ! Here we have a large sample and for for large samples (exceeding 50) we can
use an estimate. Here σ̂ = 13.724 so
X̄ − 68
t= p = 0.029
σ̂/ n
and in consequence we accept H0 .
What is the critical region for the alternative H1 : µ > 68?
55
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
(a) z > z 1−α (b) z < −z 1−α (c) |z| > z 1−α/2
Figure 4.2: Critical regions for testing H0 : µ = µ0 against (a) H1 : µ > µ0 , (b) H1 : µ < µ0 ,
x̄−µ
and (c) H1 : µ 6= µ0 , based on z = p 2 0 ∼ N (0, 1).
s /n
H0 : µ = 36 against H1 : µ < 36
Then
30.3 − 36
t=p = −4.49
16.08/10
Now we find the critical region using the t distribution with 10-1=9 degrees of freedom.
The critical value is -1.833, for a test size of 0.05 (check this!) so we have a value of t in
the critical region and we reject H0 .
Summary
We have deduced the procedure given in statistical recipe books. Suppose we are given
a random sample X 1 , X 2 , · · · , X n from a N(µ, σ2 ) distribution and we wish to test
56
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
then we compute
• For large samples ( n > 50) we compute
x̄ − µ0 1 X n
z=p where s 2 = (xi − x̄)2
2
s /n n − 1 i =1
and we reject H0 when z > z 1−α , z < −z 1−α , |z| > z 1−α/2 , see Figure 4.2.
• When the sample size is small, n < 50, we use the t distribution with n − 1 degrees
of freedom and compute
x̄ − µ0 1 X n
t=p where s 2 = (xi − x̄)2
2
s /n n − 1 i =1
and we reject H0 when t > t1−α , t < −t1−α , |t | > t1−α/2 using Student’s t with
n − 1 degrees of freedom, see Figure 4.3.
Figure 4.3: Critical regions for testing H0 : µ = µ0 against (a) H1 : µ > µ0 , (b) H1 : µ < µ0 ,
x̄−µ
and (c) H1 : µ 6= µ0 , based on t = p 2 0 ∼ tn−1 .
s /n
Variances
57
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
(a) X 2 > χ21−α,n−1 (b) X 2 < χ2α,n−1 (c) X 2 > χ21−α/2,n−1 or X 2 < χ2α/2,n−1
Figure 4.4: Rejection regions for testing H0 : σ2 = σ20 against (a) H1 : σ2 > σ20 , (b) H1 : σ2 <
2
σ20 , and (c) H1 : σ2 6= σ20 based on X 2 = (n−1)s
σ2
2
∼ X n−1 .
0
Example
We are given a random sample with n = 10 from a N(µ, σ2 ), where s 2 = 12.6. For α = 0.05
we test
H0 : σ2 = 9 against H1 : σ2 > 9.
(n−1)s 2 9×12.6
The statistic is X 2 = σ20
with critical region X 2 > χ21−α,n−1 . Then X 2 = 9 = 12.6
and χ20.95,9 = 16.919. So we have a value of not in the critical region and we cannot reject
H0 .
A binomial/normal example
We can use for large samples the normal approximation to the binomial, to turn what are
essentially binomial problems in to normal ones. The binomial distribution is the dis-
crete probability distribution of the number of successes in a sequence of n independent
experiments, each of which yields success with probability p. If X ∼ B(n, p) (that is, X
is a binomially distributed random variable), then the expected value of X is E (X ) = np
and the variance is V ar (X ) = np(1 − p). Therefore if n is large (n > 50) we can approxi-
mate the binomial distribution with the normal distribution N (µ = np, σ2 = npq) or the
p
standard normal Z = (X − np)/ npq. According to this approximation we test
by computing
x̄ − np 0
z=p
np 0 (1 − p 0 )
and we reject H0 when z > z 1−α , z < −z 1−α , |z| > z 1−α/2 .
58
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
A crossover trial
Suppose we wish to test the effects of two kinds of medication A and B on reducing blood
pressure in males. We intend to teat n patients for five weeks on A and five weeks on B.
The order of application will be randomized. The response is the average blood pressure
in the third week of each treatment.
We assume that the responses A i , B i , i = 1, 2, · · · , n are normal and look at the differences
D i = A i − B i , i = 1, 2, · · · , n. The test is
H0 : µD = 0 against H1 : µD > 0
Now the background to this experiment is that A is the current treatment and we expect
to see a change from A to B whose size is about 1/2 a standard deviation.
Back to definitions - in general
α = P (z > z 1−a |H0 : µ = µ0 ) ⇐⇒ 0.05 = P (z > z 0.95 |H0 : µ = 0) ⇐⇒ P (z ≤ z 0.95 |H0 : µ = 0) = 0.95.
From the statistical tables (inverse normal) one can see that z 0.95 = 1.645. Therefore the
critical region is
D̄ − µ0 p
z > 1.645 ⇐⇒ p > 1.645 ⇐⇒ D̄ > µ0 + 1.645σ/ n
σ/ n
p
D̄ − µ1 µ0 + 1.645σ/ n − µ0 − σ/2 σ
0.1 = P [ p ≤ p |H1 : µ = µ1 = µ0 + ] ⇐⇒
σ/ n σ/ n 2
p
1.645σ/ n − σ/2 σ
0.1 = P [z ≤ p |H1 : µ = ] ⇐⇒
σ/ n 2
p σ
0.1 = P [z ≤ 1.645 − n/2|H1 : µ = ].
2
From the statistical tables (inverse normal) one can see that z 0.1 = −1.2816. So
p
1.645 − n/2 = −1.2816 ⇐⇒ n = 35.
59
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
A Binomial/Normal case
A medic knows that of patients admitted to hospital for cardiac problem 60% will be
readmitted on an emergency basis within 2 months. She believes that treatment with X
will reduce this readmission level by half, that is to 30%. To check her theory she must
experiment on patients and to gain permission to do so she must show that her experi-
ment has a reasonable chance of detecting a change and uses the minimum number of
patients.
This could be framed as a Binomial, with p the probability of readmission. Suppose
we take n patients and administer the drug to them. We set up the hypotheses
p
h R − np n × 0.6 − 1.645 n × 0.6 × 0.4 − n × 0.3 i
0.1 = P z = p ≥ p |p = 0.3 ⇐⇒
np(1 − p) n × 0.7 × 0.3
p
h n × 0.6 − 1.645 n × 0.6 × 0.4 − n × 0.3 i
0.9 = P z < p .
n × 0.7 × 0.3
From the statistical tables (inverse normal) one can see that z 0.9 = 1.2816. Therefore
p
n × 0.6 − 1.645 n × 0.6 × 0.4 − n × 0.3
1.2816 = p ⇐⇒ · · · ⇐⇒ n = 22.
n × 0.7 × 0.3
60
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
Exercises
1. Suppose we are given a random sample X 1 , X 2 , · · · , X 9 from a N(µ, 1) distribution.
To test
H0 : µ = 20 against H1 : µ = 21.5
5. Suppose we are given a random sample from a N(µ, 25) distribution. If x̄ = 42.6, n =
36 then with α = 0.05 test
61
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
7. A random sample of 10 is taken from a population with distribution N(µ, σ2), where
s 2 = 12.6.
8. A tobacco company believes that the percentage of the smokers which prefer its
cigarettes is at least 65%. To test this belief, 500 smokers were interviewed and 311
of them declared that they prefer this type of cigarettes. With α = 0.05 test the
belief of the tobacco company.
62
Chapter 5
A good test should also have a small probability of type II error. In other words it
should also be a powerful test.
H0 : θ = θ0 against H1 : θ = θ1
with significance level α and maximum power γ = 1 − β among all the other statistical
tests with the same significance level. Such a test T with the maximum power is said to
be the most powerful (MP) test.
The following theorem which is known as the Neyman–Pearson Lemma solve the
problem of the existence and construction of MP tests for testing a simple null hypoth-
esis against a simple alternative hypothesis.
Lemma 5.1.1. Suppose we have a sample x1 , x2 , · · · , xn and have two simple hypotheses
H0 and H1 . Suppose the likelihood, is L (H0 ) under H0 and L (H1 ) under H1 . The most
powerful test of H0 against H1 has a critical region of the form
L (H0 )
≤ a constant. (5.1)
L (H1 )
Proof.
We give the proof for continuous random variables. For discrete random variables
just replace integrals with sums. For the definition of the critical region of significance
level α, Z Z
α = P (X ∈ C | f 0 ) = ... L (H0 )d x.
C
63
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
C = (C ∩ D) ∪ (C ∩ D ′ ) and D = (D ∩C ) ∪ (D ∩C ′).
So, Z Z Z Z Z Z
α= ... L (H0 )d x = ... L (H0 )d x + ... L (H0 )d x
C C ∩D C ∩D ′
and Z Z Z Z Z Z
α= ... L (H0 )d x = ... L (H0 )d x + ... L (H0 )d x
D D∩C D∩C ′
Therefore Z Z Z Z
... L (H0 )d x = ... L (H0 )d x. (5.2)
C ∩D ′ D∩C ′
(H0 )
Inside C we have the Neyman-Pearson region so L L (H1 )
≤ k (a constant) or L (H1 ) ≥
L (H0 )/k, while outside C we have the non Neyman-Pearson region L (H1 ) ≤ L (H0 )/k.
Accordingly from Eq. (5.2),
Z Z Z Z Z Z Z Z
... L (H1 )d x ≥ ... L (H0 )/kd x = ... L (H0 )/kd x ≥ ... L (H1 )d x
C ∩D ′ C ∩D ′ D∩C ′ D∩C ′
(5.3)
R R
Adding . . . C ∩D L (H1 )d x in Eq. (5.3),
Z Z Z Z Z Z Z Z
... L (H1 )d x+ ... L (H1 )d x ≥ ... L (H1 )d x+ ... L (H1 )d x ⇐⇒
C ∩D C ∩D ′ D∩C D∩C ′
Z Z Z Z
power of C = ... L (H1 )d x ≥ ... L (H1 )d x = power of D.
C D
To reiterate, by most powerful we mean that any other test will have power which is
less that or equal to the power of the test based on the Neyman-Pearson critical region.
This enables us to construct the critical region of the MP test in the sense that any other
test will have inferior power.
In the most common case we have a random sample from a distribution f (x). We
can then formulate the lemma as:
64
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
Proposition 5.1.1. Suppose we have a null hypothesis which completely specifies the dis-
tribution f (x), say
H0 : f (x) = f 0 (x)
where f 0 (x) is a known function. We further assume that the alternative is
H1 : f (x) = f 1 (x)
H0 : θ = θ0 against H1 : θ = θ1 > θ0
n o
= x : (θ1 − θ0 )x ≤ k 3
n o
= x : x ≤ k 4 since θ1 > θ0
k 4 = − log(1 − α)/θ0 .
Therefore the test with statistic x and critical region {x : x ≤ − log(1 − α)/θ0 } is the MP
test of
H0 : θ = θ0 against H1 : θ = θ1 > θ0 .
65
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
The two values of the likelihood functions L (H0 ) and L (H1 ) are
¶n ( )
µ
1 1 Xn
2
L (H0 ) = p exp − 2 (xi − µ0 )
2πσ2 2σ i =1
¶n ( )
µ
1 1 Xn
L (H1 ) = p exp − 2 (xi − µ1 )2
2πσ2 2σ i =1
Then the critical region takes the form
nL (H0 ) o
C = x: ≤ k1
L (H1 )
( " #)
n 1 X n n o
2 2
X
= x : exp (x i − µ 1 ) − (x i − µ0 ) ≤ k 1
2σ2 i =1 i =1
à !
n 1 X n n o
2 2
X
= x: (x i − µ1 ) − (x i − µ 0 ) ≤ k 2
2σ2 i =1 i =1
n n o
x : n(µ21 − µ20 ) + 2(µ0 − µ1 )
X
= xi ≤ k3
i =1
n o
= x : x̄ ≥ k 4 since µ1 > µ0
α = P (x̄ ≥ k 4 |µ = µ0 ).
66
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
H0 : p = p 0 against H1 : p = p 1 < p 0
For a given α Neyman Pearson gives the critical region
n L (H0 ) o
C = x: ≤ k1
L (H1 )
¡n ¢ x
n
x
p 0 (1 − p 0 )n−x o
= x: n x
¡ ¢ ≤ k1
x
p 1 (1 − p 1 )n−x
µ ¶
n 1 − p0 o
= x : x log(p 0 /p 1 ) − x log ≤ k2
1 − p1
µ ¶
n p0 − p1 p0 o
= x : x log ≤ k2
p1 − p0 p1
µ ¶
n o p0 − p1 p0
= x : x ≤ k 3 since > 1 ⇐⇒ p 1 < p 0
p1 − p0 p1
Now suppose we try to go a bit further, with p 0 = 0.5 and n = 12. If we choose α = 0.05
then
α = 0.05 = P [x ≤ k 3 ]
If we check the Binomial tables we see that such a k 3 is not possible. We can choose k 3
to give a variety of α values but we cannot have 0.05, as you can see from below
k 0 1 2 3 4 5
P [X ≤ k] 0.000 0.003 0.019 0.073 0.194 0.387
This is a theoretical problem and some would argue that this invalidates the whole scheme
so Neyman-Pearson cannot be applied in this case.
It is possible by using randomization methods to come up with a modified scheme
but I have never ever seen it used. In practice one chooses the best α value from those
available and we shall refer to these tests as being non-randomized when we want to be
quite precise.
Randomization procedure
67
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
68
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
Exercises
1. Suppose we have one observation x from an exponential distribution f (x) = θ exp(−θx).
H0 : θ = θ0 against H1 : θ = θ1 > θ0 .
H0 : σ2 = 1 against H1 : σ2 = 2.
H0 : θ = 2 against H1 : θ = 4.
Pn 2
Hint: Recall that the i =1 X i of independent G(1, 2) is G(n, 2) ≡ X 2n .
(a) Find the UMP test with significance level α = 0.05 for testing
H0 : µ = 1 against H1 : µ > 1
69
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
70
Chapter 6
H0 : θ ∈ Θ0 against H1 : θ ∈ Θ1 .
L (H0 )
λ=
L (H1 )
and we can do the same again for composite hypotheses. Of course there may be un-
specified parameters so we choose to consider
maxθ∈Θ0 L
λ=
maxθ∈Θ1 L
71
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
Both hypotheses are composite since σ2 is unknown. Therefore we will apply the likeli-
hood ratio test. The likelihood is
¶n ( )
µ
1 1 n
L (µ, σ2 |x) = p (xi − µ)2
X
exp − 2 (6.1)
2πσ2 2σ i =1
To find maxθ∈Θ0 L substitute in Eq. (6.1) µ with µ0 and maximize with respect to σ2 , i.e.,
find the MLE of σ2 when µ = µ0 is known. So,
1 X n
log L (x; µ, σ2 ) = − log(2πσ2 )n/2 − (xi − µ0 )2
2σ2 i =1
or Pn
∂ log L (x; µ0 , σ2 ) n i =1 (x i − µ0 )2
2
=− 2 + = 0.
∂σ 2σ 2σ4
Therefore,
1X ³ 1 ´n ³ n´
σ̂2 = (xi − µ0 )2 and max L = p exp −
n θ∈Θ0 2πσ̂2 2
To find maxθ∈Θ1 L we are looking for the MLEs of µ and σ2 . It is known that the later are
1X
µ̂ = x̄, σ̂2 = (xi − x̄)2 := s ′2 .
n
Therefore
³ 1 ´n ³ n´
max L = p exp −
θ∈Θ1 2πs ′2 2
Substituting into the likelihood ratio gives
2 n/2
à Pn !
i =1 (x i − x̄)
λ= Pn 2
.
i =1 (x i − µ0 )
72
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
¶−n/2
t2
n µ o
= x : 1+ ≤k
n −1
n µ t2
¶ o
−2/n
= x : 1+ ≥k
n −1
n o
= x : t 2 ≥ k1
n p o
= x : |t | ≥ k1 = k2
α = P (|t | ≥ k 2 |µ = µ0 ). (6.2)
We know that t ∼ tn−1 under the null hypothesis H0 and for the Eq. (6.2): k 2 = tn−1,1−α/2 .
Therefore the test with statistic t and critical region {x : |t | ≥ tn−1,1−α/2 } is a likelihood
ratio test of
H0 : µ = µ0 against H1 : µ 6= µ0 .
Λ = −2 lnλ (6.3)
has a χ2r −s distribution, where r is given the number of parameters estimated in H1 and
s the number of parameters estimated in H0 . This is a large sample approximation but
enables us to produce tests in a wide variety of situations.
H0 : σx = σ y against H1 : σx 6= σ y
73
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
the likelihood ratio test. The likelihood since the two random samples are independent is,
n
( )
1 1 X n
2
= q exp − 2 (xi − µx ) ×
2
2πσx 2σ x i =1
m
( )
1 1 X m
exp − 2 (y i − µ y )2
q
2
2πσ y 2σ y i =1
n n 1 X n
log L (σx , σ y , µx , µ y |x, y) = − log 2π − log σ2x − 2 (xi − µx )2
2 2 2σx i =1
m m 1 X m
− log 2π − log σ2y − 2 (y i − µ y )2
2 2 2σ y i =1
∂ 1 X n
log L (σ, µx , µ y |x, y) = (xi − µx ) = 0
∂µx σ2 i =1
∂ 1 X m
log L (σ, µx , µ y |x, y) = (y i − µ y ) = 0
∂µ y σ2 i =1
∂ n 1 X n
log L (σ, µ x , µ y |x, y) = − + (xi − µx )2
∂σ2 σ2 2σ4 i =1
m 1 Xm
− 2
+ 4
(y i − µ y )2 = 0
σ 2σ i =1
1 ³X ´
µ̂x = x̄, µ̂ y = ȳ, σ̂2 = (xi − x̄)2 + (y i − ȳ)2 .
X
(n + m)
To find maxθ∈Θ1 L we are looking for the MLEs of µx , σ2x , µ y , σ2y . These are
1X
µ̂x = x̄, σˆx 2 = (xi − x̄)2 := s x′2 .
n
74
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
1 X
µ̂ y = ȳ, σˆy 2 = (y i − ȳ)2 := s ′2
y
m
Substituting into the likelihood ratio gives, after, some algebra reduces this to
σ̂−(n+m)
½ ¾
λ=
σˆx −n σˆy −m
Taking logs
Λ = 2n log(σˆx ) + 2m log(σˆy ) − 2(n + m) log(σ̂).
There are 4 parameters, all of these had to be estimated for H1 and 3 for H0 . It follows that
Λ is X 12 .
6.3 Goodness-of-fit-tests
A common likelihood-ratio based test is the goodness-of-fit test.
6.3.1 Categories
Suppose we have an experiment which has k mutually exclusive outcomes which we will
label A 1 , A 2 , · · · , A k . Suppose further we repeat our experiment n times and find that the
number of outcomes in A j is n j , j = 1, 2, · · · , k. In addition we will write the probability
of an outcome being in A j is p j , j = 1, 2, · · · , k. Clearly kj=1 n j = n and kj=1 p j = 1.
P P
This simple model has many applications, you could think of asking questions in a
survey with the A j as categories of answers, or the A j could correspond to the bins of a
histogram. To proceed any further we need a bit more theory.
75
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
k
X
ℓ(p) = n j log(p j ) + const ant
j =1
∂Φ n 1
= +λ = 0
∂p 1 p 1
∂Φ n 2
= +λ = 0
∂p 2 p 2
..
.
∂Φ nk
= +λ = 0
∂p k p k
∂Φ X
= pj −1 = 0
∂pλ
This is equivalent with P
n1 n2 nk nj n
= = ... = =P = .
p1 p2 pk pj 1
Hence it is easy to show that
nj
. p̂ j =
n
Of course you could say that on either falls in A j or not - a Binomial problem. The
nj
maximum likelihood estimate of p j is just that for the Binomial probability p̂ j = n .
k k k ¡
Y nj Y nj Y ¢n j
λ= p̂ j / p̆ j = p̂ j /p̆ j
j =1 j =1 j =1
76
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
where the p̂ j are the estimates under H0 and p̆ j are the estimates under H1 . Since we
have unspecified probabilities, we need the estimates of these probabilities - however
we know that
p̆ j = n j /n j = 1, 2, · · · , n
so
k k ³ n ´n j k ¡
Y nj Y j Y ¢n j
λ= p̂ j / = n p̂ j /n j
j =1 j =1 n j =1
or
k
X
µ
nj
¶
Λ = −2 logλ = 2 n j log .
j =1 n p̂ j
Our statistic is often written as the asymptotically equivalent form
k (n − ê )2
j j
X2 =
X
, where ê j = n p̂ j . (6.5)
j =1 ê j
Note here that if the probabilities p j under the null hypothesis are known and not need
to be estimated then the chi-square statistic reduces to the form,
k (n − e )2
j j
X2 =
X
, where e j = np j . (6.6)
j =1 ej
Example 6.3.1. An experimenter bred flowers and found the following numbers in each
of the four possible classes
Class A1 A2 A3 A4
Number 120 48 36 13
Probability 9/16 3/16 3/16 1/16
The table also includes the probability of falling in each class - according to established
theory. We aim to test
class pj nj e i = np j
1 9/16 120 122.06
2 3/16 48 40.69
3 3/16 36 40.69
4 1/16 13 13.56
The likelihood ratio, since the probabilities under the null hypothesis are given, is
k ¡
Y ¢n
λ= np j /n j j
j =1
77
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
and we find, after some arithmetic, using Eq. (6.6) that Λ is approximately 1.9. We know
that Λ = −2 log(λ) is chi-squared. Here we have unspecified k-1 parameters under H1 and
none under H0 so the number of degrees of freedom is 3. We will reject H0 if Λ is large i.e.
exceed the 95% point of X 32 which is 7.815 . In this case we accept H0 .
where p = P [ boy ].
We can see if this distribution fits the data when p = 12 . Calculating the expected proba-
bilities, Ã !µ ¶
5 1 5
P [0 births in 5] = = 1/32
0 2
à !µ ¶
5 1 5
P [1 birth in 5] = = 5/32
1 2
··· ,
we conclude to the following table,
78
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
No Boys 0 1 2 3 4 5 total
no families (n j ) 8 40 88 110 56 18 320
expected (e j = np j ) 10 50 100 100 50 10
The likelihood ratio statistic in Eq. (6.6 ) is Λ = 11.096 and the degrees of freedom is (6 −
1) − 0 = 5. We only need to estimate 5 of the probabilities since the remaining one follows
from the fact that they must add up to one. The upper point of χ25 = 11.07 so in this case
we reject H0 and conclude our model is wrong.
We know however that in general p[ boy ] > 12 . Indeed from our data the proportion of
boys is p̂ = 0.5375. We can revisit out Binomial model but in this case we use p̂ = 0.5375.
The expectations are harder - I get
No Boys 0 1 2 3 4 5 total
no families (n j ) 8 40 88 110 56 18 320
expected (ê j = n p̂ j ) 6.8 39.3 91.5 106.3 61.8 14.4 320
and Λ = 1.03 . We now have (6-1)-1 =4 degrees of freedom. We estimate 5 parameters for
H1 and one under H0 . The conclusion is that this fits the data very well.
Example 6.3.4. According to the data of the following table can we assume that the parent
distribution is standard normal?
classes frequencies
≤ 39.5 6
39.5–44.5 13
44.5–49.5 40
49.5–54.5 65
54.5–59.5 52
≥ 59.5 24
For testing the null hypothesis that the data follow normal distribution against the alter-
native that follow any other distribution we need to:
1. Derive the maximum likelihood estimators of µ, σ2 , i.e.,
( n i xi )2
µ P ¶
1X 2 ′2 1 X 2
µ̂ = x̄ = n i xi , σ̂ = s = ni xi − ,
n n n
where xi is the center of each interval and n i the observed frequency. The centers
of the classes are 37, 42, 47, 52, 57, 62 (for eg., consider the class 44.5–49.5 we have
[44.5+49.5]/2=47). After some calculations,
µ̂ = 52.4 and σ̂2 = 36.77 = 6.062 .
2. Calculate the estimated probabilities p j from N (µ̂, σ̂2 ) = N (52.4, 36.77). For e.g.,
µ ¶
x − 52.4 39.5 − 52.4
P (x ≤ 39.5) = P ≤ = P (z ≤ −2.13) = 0.5−P (0 < z < 2.13) = 0.0166.
6.06 6.06
3. Calculate the expected (estimated) frequencies ê j = n p̂ j . For e.g., ê 1 = 200 × 0.0166.
So we derive the following table,
79
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
classes nj p̂ j ê j = n p̂ j
≤ 39.5 6 0.0166 3.32
39.5–44.5 13 0.0802 16.04
44.5–49.5 40 0.2188 43.76
49.5–54.5 65 0.3212 64.24
54.5–59.5 52 0.2422 48.44
≥ 59.5 24 0.1210 24.20
total 200 1 200
One useful rule of thumb is to try and ensure that non of the expected values are less
than 5. This is a rather mysterious number but if we examine our chi-squared approxi-
mation carefully we see it breaks down when we have small expected values. Therefore we
merge the classes ≤ 39 and 39.5–44.5, i.e., we have finally 5 classes.
(n j −ê j )2
According to the table we find that, using Eq. (6.5) Λ = 5j =1 ê
P
= 0.6021. We
j
know that Λ is X r2−s = X 4−2
2
and we accept H0 at 1% since Λ does not lie in the critical
2
region Λ ≥ χ2,0.99 = 9.210 and conclude that the distribution is indeed normal.
80
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
Exercises
1. Suppose we are given a random sample X 1 , X 2 , · · · , X n from a N(µ, σ2 ) distribution,
where σ2 is known. Find the critical region of significance level α for testing
H0 : µ = µ0 against H1 : µ 6= µ0 .
H0 : m 1 = m 2 against H1 : m 1 6= m 2 .
3. We toss a dice 60 times. The results and their frequencies are in the following table,
result 1 2 3 4 5 6
frequency 13 19 11 8 5 4
4. The table below is showing the number of accidents in a year for 400 drivers.
number of accidents 0 1 2 3 4 5 6 7
frequency 89 143 94 42 20 8 3 1
81
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
82
Chapter 7
Nonparametric inference
In much inference and almost all that you have met so far we have assumed that the
form of the distributions f (x, θ1 , . . . θp ) is known except for some parameters θ1 , . . . θp . In
some cases we may feel that actually specifying the form of f (x, θ1 , . . . θp ) is unrealistic.
This is often so when the sample size is small. Our aim to construct methods of infer-
ence which do not make distributional assumptions or which are relatively insensitive
to these assumptions. This leads us to nonparametric inference and to robust meth-
ods as opposed to the classical parametric methods that you met in previous. Another
possibility are computationally intensive methods such as the Bootstrap.
83
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
Example 7.1.1. For example suppose we have the following data set of 10 observations.
31.0 31.4 33.3 33.4 33.5 33.7 34.4 34.9 36.2 37.0
Would it be reasonable to assume that the observations are drawn from N (µ = 32, σ2 =
3.24)?
We will use the Kolmogorov-Smirnov goodness-of-fit statistic to check if the distribu-
tions are drawn from N (µ = 32, σ2 = 3.24). The critical region of significance level α is
Given that F0 (x) is the cumulative distribution function of the N (µ = 32, σ2 = 3.24),
µ ¶ µ ¶
X − µ X − 32 X − 32
F0 (x) = P (X ≤ x) = P ≤p =P Z ≤
σ 3.24 1.80
number of xi ≤ x
F10 (x) =
10
and is given in the following table,
x 31.00 31.40 33.30 33.40 33.50 33.70 34.40 34.90 36.20 37.00
F10 (x) 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00
The Kolmogorov-Smirnov goodness-of-fit statistic is,
84
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
Example 7.2.1. For example suppose we have the following data set of 12 observations.
H0 : µ = 1 against H1 : µ 6= 1.
Subtracting 1, the hypothesized value of the mean from each value gives a set of deviations
(d i , i = 1, . . . , 12)
One choice of statistic for our test is T = |d¯| of the deviations above.
Now we set up the permutations of the (modified) data:
• We first take the absolute values of all the deviations; this will give a new value for
T.
• Under the null hypothesis that the median of the population is 1 the modified data
are symmetric about 0, i.e., each of these deviations is equally likely to be positive or
negative. Therefore a rearrangement of the data is to assign each observation to be
positive or negative. Totally there 2n permutations (rearrangements):
1. Assign one negative sign to each value in turn and obtain n values for T .
2. Assign two negative signs to all pairs of values and obtain more values for T .
3. Assign minus signs to triples e.t.c., until we have the T value for all 2n values.
• If we assume, as seems reasonable under the null hypothesis that the value of T from
each permutation of signs is equally likely we have a distribution of T values. We
may then choose between the hypotheses using the permutation distribution of T .
Thus if we had a sample of only 4 deviations:
1.29828 -0.40939 0.39212 1.67940
we have the 16 permutations and their corresponding values for T , see Table 7.1.
The actual sample value of T was Tobs = 0.74010 which exceeds 10 of the 16 values, or
is in the top 6 of the 16. For the definition of p-value, i.e., the probability of getting a value
of the test statistic as extreme as or more extreme than that observed by chance alone on
the assumption that the null hypothesis H0 , is true, we have that,
number of permutations with T ≥ Tobs
p-value = P (T ≥ Tobs |H0 ) = .
total number of permutations
In our example p-value=6/16=0.375 which is greater than the significance level of the test
α, say 20% and thus we accept H0 . Note in passing that the parametric test for the above
problem is the one-sample t -test.
85
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
Permutation T
1.29828 0.40939 0.39212 1.6794 0.94480
-1.29828 0.40939 0.39212 1.6794 0.29566
1.29828 -0.40939 0.39212 1.6794 0.74010
1.29828 0.40939 -0.39212 1.6794 0.74874
1.29828 0.40939 0.39212 -1.6794 0.10510
-1.29828 -0.40939 0.39212 1.6794 0.09096
-1.29828 0.40939 -0.39212 1.6794 0.09960
-1.29828 0.40939 0.39212 -1.6794 0.54404
1.29828 -0.40939 -0.39212 1.6794 0.54404
1.29828 -0.40939 0.39212 -1.6794 0.09960
1.29828 0.40939 -0.39212 -1.6794 0.09096
-1.2982 -0.40939 -0.39212 1.679 0.10510
-1.2982 -0.40939 0.39212 -1.6794 0.74874
-1.29828 0.40939 -0.39212 -1.6794 0.74010
1.29828 -0.40939 -0.39212 -1.6794 0.29566
-1.29828 -0.40939 -0.39212 -1.6794 0.94480
Example 7.2.2. 5 patients submitted to two different treatments A and B. Then the medics
rate their health status based on a scale. The observations (rates) are:
Treatment A (2 patients): 1, 2
Treatment B (3 patients): 3, 5, 9
Is there any difference between the two treatments?
With other words we need to test if the two mean values are different or not,
H0 : µ A = µB against H1 : µ A 6= µB .
Under the null hypothesis the two treatments are the same and there is no importance
which treatment submitted to each patient. Therefore if we rearrange randomly the pa-
tients to the two treatments the results will be quite the same. Furthermore all rearrange-
ments are equally likely under the null hypothesis. One choice of statistic for our test is
T = |ā − b̄|, where ā, b̄ are the mean values of the patients for the two treatments. If the
null hypothesis is true the statistic is close to 0, while under the alternative hypothesis the
statistic takes values away from 0.
We now look to rearrangements of the data observed. One possible rearrangement is
1,3 in the first sample and 2,5,9 in the second. For each rearrangement, we compute the
value of T . Note that there are
µ ¶
5
= 10
3
such rearrangements, see Table 7.2 . Under the null hypothesis (that all observations come
for the same distribution) all 10 rearrangements are equally likely, each with probability
1/10.
86
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
treatment A treatment B ā b̄ T
1,3 2,5,9 2 5.33 3.33
1,5 2,3,9 3 4.66 1.66
1,9 2,3,5 5 3.33 1.66
1,2 3,5,9 1.5 5.66 4.51
2,3 1,5,9 2.5 5 2.5
2,5 1,3,9 3.5 4.33 0.83
2,9 1,3,5 5.5 3 2.5
3,5 1,2,9 4 4 0
3,9 1,2,5 6 2.66 3.34
5,9 1„2,3 7 2 5
Table 7.2: All possible rearrangements of the patients in the two treatment groups.
The actual sample value of T was Tobs = 4.51 which exceeds 8 of the 10 values, or is in
the top 2 of the 10. For the definition of p-value, i.e., the probability of getting a value of
the test statistic as extreme as or more extreme than that observed by chance alone on the
assumption that the null hypothesis H0 , is true, we have that,
In our example p-value=2/10=0.2 which is greater than the significance level of the test
α, say 10% and thus we accept H0 . Note in passing that the parametric test for the above
problem is the two-sample t -test.
1. On the definition of p-value we have used the probability P (T ≥ Tobs |H0 ). This
generally speaking is not correct such also the values of the statistic which lead to
rejection are small, i.e., we reject when T ≤ Tobs (two-tailed test). We have avoided
this cases by selecting appropriately the test statistic using the absolute value and
we reject for large values of the statistic.
2. These permutation computations are only practical for small data sets. The ob-
vious drawback to this test is the amount of computation required. For the one
sample problem with n observations we require 2n permutations, while for the
two sample model with m and n observations in the samples, there are
µ ¶ µ ¶
m+n m+n
=
m n
3. We can however come up with a very much simpler procedure at the cost of some
precision. A recent suggestion is that we don’t look at all permutations, but rather
look a randomly chosen subset of them.
87
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
2. For H1 : p > 12 :
p-value = P (X ≥ k|H0 is true),
for H1 : p < 12 :
p-value = P (X ≤ k|H0 is true),
1
for H1 : p 6= 2
: If k > n/2,
88
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
3. If the p-value is smaller than the significance level of the test α reject H0 , other-
wise accept H0 .
Example 7.3.1. Suppose we have the following data set of 12 observations which comes
from a non-normal distribution:
H0 : δ = 1 against H1 : δ > 1.
We transform the data as follows, if a value exceeds 1 we call it a plus while if it lies below
1 we call it a minus. Of our data we have 7 plus signs and 5 minus signs.
The p-value is
1 1
P (X ≥ k|p = ) = P (X ≥ 7|p = ) = 0.387
2 2
and so we conclude that the is no reason to reject the null hypothesis that the median is 1.
Note: For the two-tailed test (H1 : p 6= 12 ), p-value= 2 × 0.387.
Suppose that the sample size n is large (greater than 25) . Then we use the normal
approximation to the Binomial as follows: If the proportion of plus signs is p then the
test statistics
X − np X − n/2 X − 1/2
z=p = p = p
np(1 − p) n/4 1/ (4n)
is standard normal under the null hypothesis.
We may then base a test of H0 : p = 0.5 on z. Given α we have z 1−α and z 1−α/2 from
normal tables and
Example 7.3.2. Suppose we have the following data set of 30 observations which comes
from a non-normal distribution:
89
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
We transform the data as follows, if a value exceeds 20.8 we call it a plus while if it
lies below 20.8 we call it a minus. Of our original deviations we have 9 plus signs and 21
minus signs. Our hypotheses are equivalent to,
1 1
H0 : p = against H1 : p > .
2 2
You will notice that the sign test makes few assumptions about the form of the parent
distribution and in consequence it is a valuable tool when we have non normal distri-
butions. However it is probably pretty clear that we must pay a price for the versatility
of the test. Indeed as we replace numerical values with indicators, the signs, it seems
inevitable that there will be some loss of sensitivity.
H0 : δ = δ0
against
H1 : δ > δ0 , δ < δ0 , δ 6= δ0 .
In the 1940’s Frank Wilcoxon came up with a modification of the sign test. The test is now
known as Wilcoxon’s Signed Rank Test. Wilcoxon suggested that rather than map the
deviations between the hypothetical median and the observations onto just two values
some assessment should be made of the magnitude of these deviations. He used the
rank of the deviations.
Wilcoxon took the view that a way to improve the test was to take into account the
size of deviation from the median. The procedure he adopted was to take the deviations
and give them a rank order ignoring the sign. Thus the smallest deviation gets rank 1 the
next smallest rank 2 and so on. Deviations equally to zero are eliminated.
Now if the suggested median is the true one then the deviations should be scattered
around zero and the sum of the positive ranks R+ should be much the same as the sum
of the negative ranks R−. To test the hypothesis Wilcoxon actually worked out the prob-
ability distribution of W = mi n(R+, R−) on the assumption:
H0 : the deviations have a symmetrical parent distribution with median zero and hence
calculated the percentage points, a table of which is available.
The procedure is thus
1. Ignoring the sign assign to each difference its rank. If two or more values have the
same rank then the average rank is assigned to each.
2. Given the ranks in (1) form the signed ranks by giving a negative sign to those ranks
which come from negative values.
90
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
3. Compute
R + = sum of the positive ranks R − = sum of the negative ranks
4. (a) For a one tailed test, where the alternative hypothesis is that the median is
more than a given value, the test statistic W is R − .
(b) For a one tailed test, where the alternative hypothesis is that the median is
less than a given value, the test statistic W is R + .
(c) For a two tailed test the test statistic W is the minimum of R + and R − .
Now check that R + + R − are the same as n(n + 1)/2, where n is the number in the
sample (having ignored the zeros). Given the tables the test is simple and the only as-
sumption being that the parent distribution is symmetrical. If this is not true then the
sign test is your best bet.
Example 7.4.1. Suppose we have the following data set of 10 observations
131 127 118 135 117 112 132 120 137 113
with 5 positive and 5 negative signs. The sign test would accept H0 . The data and the
ranks e.t.c., are laid out in Table 7.3.
For our data R + = 38 and R − = 17. Therefore W = min(R + , R − ) = 17. Table 12 of the
Statistical Tables, indicates that H0 would be accepted for α = 0.05 (W0 = 9).
91
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
For large sample sizes, n ,we use a normal approximation under the null hypothesis
that the deviations D i are independent and symmetrically distributed about zero.
Then we have that
1
E [R + ] = n(n + 1)
4
1
v ar (R + ) = n(n + 1)(2n + 1)
24
+ −
and R (or R ) is asymptotically normal.
Then the z statistic,
R + − E (R + ) R + − n(n + 1)/4
z= p = 1
v ar (R + ) 24 n(n + 1)(2n + 1)
can be used as a test statistic and we reject H0 when z > z 1−α , z < −z 1−α , |z| > z 1−α/2 ,
respectively.
We could use the sign test or better still Wilcoxon’s test on the differences. Our null
92
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
2 × P (X ≥ 8) = 2 × P (X ≤ 2) = 2 × 0.0547 = 0.1092.
This is on the acceptance of H0 , illustrating the loss of efficiency which occurs with the
sign test. We usually prefer the greater efficiency of Wilcoxon’s test compared with the sign
test assuming a symmetric parent.
Suppose we have two samples one from population A of size n and one from popu-
lation B of size m (m < n) and we wish to test:
1. Rank the all data; for equal values of the observations we assign average ranks just
as for Wilcoxon’s signed rank test.
2. Find the sum of the ranks for the smaller sample (with m observations) say R. If
m = n then pick the sample you prefer.
93
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
3. Find R ′ = m(m + n + 1) − R.
4. Compute W = min(R, R ′ ).
Example 7.6.2. Suppose we have two samples one from A of size m = 10 and one from B
of size n = 9,
A 7.05 14.25 8.57 10.5 11.9 4.5 37.6 9.4 8.1 45.2
B 9.25 2.05 2.75 2.5 6.4 6.9 10 8 33
Assign to each observation its rank in the combined set of observations as below
A 7.05 14.25 8.57 10.5 11.9 4.5 37.6 9.4 8.1 45.2
rank 7 16 10 14 15 4 18 12 9 19
B 9.25 2.05 2.75 2.5 6.4 6.9 10 8 33
rank 11 1 3 2 5 6 13 8 17
For the Mann-Whitney test we find the sum of the ranks for each sample R A = 124 while R B
= 66. So, the smaller sample gives a rank sum of R = 66 giving R ′ = 9(9+10+1)−66 = 114.
This gives W = min(66, 114) = 66 which not quite significant at 5% (two sided).
For large sample values it is conventional we can get quite good approximate values
for the rank sum of the sample of size m using the normal distribution with mean and
variance given by the formula below for the m A’s and n B’s.
1 1
µ = m(m + n + 1) and σ2 = mn(m + n + 1).
2 12
Exercises
1. Suppose we have the following data set of 10 observations.
3.94 4.22 2.60 8.65 0.18 0.43 0.09 8.68 2.64 2.77
Would it be reasonable to assume that the observations are drawn from E xp(λ =
1/3)?
2. New recruits to a call center are given initial training in answering customer calls.
Following this training they are independently assessed on their competence, and
are rated on a score of 1 to 10, 1 representing ‘totally incompetent’ to 10 ‘totally
competent’. It is usual for the trainees’ scores to be symmetrically distributed
about a median of 6. A new trainer has been appointed and the scores of her first
19 trainees are:
94
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
6 5 6 9 7 3 4 6 7 2 9 8 7 4 5 6 9 5 7
Is there evidence at the 5% level that the new trainer has made any difference?
3. It is recommended that women should not consume more than 70g of fat per day.
A random sample of 13 student nurses at St Clares were asked to estimate as care-
fully as possible how much fat they ate on one particular day. The results were
(measured in grams)
Is there evidence at the 5% level that student nurses at St Clares are consuming
more fat than they should?
4. Two methods A and B were used to determine the latent heat of fusion of ice. The
table below gives the change in total heat from ice at −72o C to water at 0o C .
95
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
96
Chapter 8
Bayesian inference
P (A|B i )P (B i )
P (B i |A) = Pk , i = 1, . . . , k.
j =1 P (A|B j )P (B j )
Example 8.1.1. In a population with high risk to be affected from HIV we try an new
diagnostic method. The 10% percent of the population is believed to be affected by HIV.
From the results of the test, the test is positive for 90% of the people that they are really
affected by HIV and negative for 85% of the people that the are not affected by HIV. Which
are the probabilities to obtain false negative and false positive results?
97
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
Suppose that A is the event of someone to be affected by HIV and B the event that the
test is positive. Then P (A) = 0.1, P (B|A) = 0.9 and P (B c |A c ) = 0.85.
P (B|A c )P (A c )
P (the test is false positive) = P (A c |B) = =
P (B)
f (x, y) f (x, y)
f (x|y) = and f (y|x) =
f y (y) f x (x)
Thus
f (x, y) f (x|y) f y (y)
f (y|x) = =
f x (x) f x (x)
or f (y|x) is proportional to f (x|y) f y (y)
98
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
1. A music expert claims to be able to distinguish a page of Haydn score from a page of
Mozart score. In all 3 trials, she distinguished correctly.
2. A lady, who adds milk to her tea, claims to be able to tell whether the tea or the milk
was poured into the cup first. In all 3 trials, she correctly determined which was
poured first.
3. A drunkard claims to be able to predict the outcome of a flip of a fair coin. In all 3
trials, he guessed the outcomes correctly.
Our interest is the the probability of the person answering correctly. The distribution
for this case is binomial B(3, p). The maximum likelihood estimator is p̂ = Xn = 33 = 1. Is
this a satisfactory analysis for all the three experiments?
No reasons to doubt the conclusion for Experiment 1 but we have serious doubt for
Experiment 3.
(x1 , x2 , x3 , . . . , xn )T = x
99
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
2. The prior f (θ) reflects the beliefs on the true value of θ before taking any observa-
tions.
3. The goal of Bayesian inference is to update the beliefs from prior f (θ) to posterior
f (θ|x), after taking observations from a random sample (x1 , x2 , x3 , . . . , x N )T = x.
Example 8.2.2. A coin is tossed 70 times, the number of heads is 34. The probability of
being a head is some unknown value θ. We have some prior belief about θ, i.e., E (θ) = 0.4
and V ar (θ) = 0.02 which we need to quantify. One way of doing this is to use a proba-
bility distribution to quantify our belief. This distribution f (θ) is the prior distribution is
assumed to contain our prior beliefs about the parameter θ. A possible distribution for θ
is the Beta distribution with density
θ a−1 (1 − θ)b−1
f (θ) = for 0 ≤ θ ≤ 1. (8.2)
B e t a(a, b)
Since the mean and variance of the Beta distribution for the prior distribution are known
i.e.,
a ab
E (θ) = = 0.4 V ar (θ) = = 0.02.
a +b (a + b)2 (a + b + 1)
We can easily derive the parameters of the Beta distribution a, b by solving the above sys-
tem of equations. So the resulted parameters are,
(1 − m)m 2 (1 − m)2 m
a= −m and b = − (1 − m),
u u
where m = E (θ) and u = V ar (θ). For our example the parameters are a = 4.4 and b =
6.6. When we conduct the experiment of tossing coin n times we have the probability,
(likelihood when x the number of heads is known)
à !
n x
f (x|θ) = θ (1 − θ)n−x
x
so if we had chosen
θ a−1 (1 − θ)b−1
f (θ) =
B e t a(a, b)
we would have
f (θ|x) ∝ θ (a+x−1) (1 − θ)(N−x+b−1)
We know that f (θ|x) is a distribution function and since that
100
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
The constant of proportionality is obtained easily since we know that the distribution
must integrate to 1. This latter function is the posterior distribution of the parameter θ
given the data - it is our belief given the data.
So if we start with a prior which is
θ ∼ B e t a(4.4, 6.6)
We will study now how the data change our prior beliefs by comparing the expected values
for the prior and posterior distributions:
a a+x
E (θ) = E (θ|x) =
a +b a +b +n
In our example they are:
E (θ) = 0.4 E (θ|x) = 0.474.
So after taking observations from a random sample the prior estimate increases from 0.4
to 0.474. Note that using the standard maximum likelihood estimation the MLE based
on the data is nx = 0.486. Summing up the posterior is a combination of the data and the
prior belief.
Example 8.2.3. Drinks dispensed by a vending machine may overflow. Suppose the prior
distribution of p the proportion that overflow is
If two of the next nine drinks dispensed overflow find the posterior distribution of p .
Assuming independence the likelihood given the experiment is
à !
9 2
f (x|p) = p (1 − p)7
2
101
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
• In practice, it seldom occurs that the available prior information is precise enough
to lead to an exact determination of the prior distribution.
θ nk exp(−θ
X
= xi )
102
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
So
Therefore
¡ X ¢
θ|x ∼ G amma a + nk, 1/(b + xi ) .
Example 8.3.2. Suppose a random sample x ∼ P (θ) of size n and θ ∼ G amma(a, 1/b).
Show that
¡ X ¢
θ|x ∼ G amma a + xi , 1/(b + n) .
n exp(−θ)θ xi
Y
f (x|θ) =
i =1 xi !
n
exp(−θ)θ xi
Y
∝
i =1
P
xi
= exp(−nθ)θ
103
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
So
P
∝ θ a−1 exp(−bθ) × exp(−nθ)θ xi
P
∝ θ a+ xi −1
exp {−(b + n)θ}
Therefore ¡ X ¢
θ|x ∼ G amma a + xi , 1/(b + n) .
Under the assumption that the conjugate priors do not contradict with our prior beliefs
and given that a such family of distribution exists the computations could be simplified.
Although in which cases a family of conjugate priors can be obtained?
The only case that the conjugate priors can be derived easily is for models of the
exponential family of distributions.
Definition 8.3.2. The family of distributions:
¡ ¢
f (x|θ) = h(x)g (θ) exp t (x)c(θ)
The exponential family includes the exponential, the Poisson, the Gamma with one pa-
rameter, the binomial and the normal distribution with known variance.
Suppose the prior distribution f (θ), then:
n
Y ¡ ¢
= f (θ) × h(xi )g (θ) exp t (xi )c(θ)
i =1
n
h(xi ) g (θ)n exp
¡Y ¢ ¡X ¢
= f (θ) × t (xi )c(θ)
i =1
So if we choose
f (θ) ∝ g (θ)d exp bc(θ)
¡ ¢
(8.3)
then the following posterior distribution will be obtained:
˜
g (θ)d exp b̃c(θ) ,
¡ ¢
=
104
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
which belongs to the same family of distributions with the prior distribution with ad-
justed parameters. All the examples of conjugate priors that we have seen were obtained
in this way.
Example 8.3.3. Consider the binomial distribution. The probability mass function is
given by,
à !
n x
f (x|θ) = θ (1 − θ)n−x
x
à !
n ¡ θ ¢x
= (1 − θ)n
x 1−θ
à ! µ ³ ¶
n n
¡ θ ¢x ´
= (1 − θ) exp log
x 1−θ
à !
n n
³ ¡ θ ¢´
= (1 − θ) exp x log .
x 1−θ
For Definition 8.3.2,
à !
n
h(x) =
x
g (θ) = (1 − θ)n
t (x) = x
¡ θ ¢
c(θ) = log
1−θ
Therefore we can construct a conjugate prior, see Eq. 8.3 of the form,
f (θ) ∝ g (θ)d exp bc(θ)
¡ ¢
¢d ³ ¡ θ ¢´
(1 − θ)n
¡
= exp b log
1−θ
= (1 − θ)nd−b θ b ,
which is member of Beta distributions.
105
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
= nθ −1 (1 − θ)−1 .
106
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
Example 8.4.1. 20 windows in a high rise office block broke in the first year of occupancy
of the building. The question was how many of these were due to a specific defect D. If they
were caused by D then the manufacturer of the windows will replace them, otherwise they
will not. Only in 4 of the 20 widows was glass available for analysis; to this end we will
assume a sample of 4 in 20. So all in our sample of 4 were found to have broken because
of D.
In the subsequent legal wrangle a glass expert claimed that the distribution of θ the
probability a window suffers from D is
θ −3/4 (1 − θ)−1/4
f (θ) = 0 ≤ θ ≤ 1 or θ ∼ B e t a(1/4, 3/4).
B(1/4, 3/4)
¡4¢ 4
For a sample of 4 with 4 with defect D the likelihood is f (x|θ) = 4
θ (1 − θ)4−4 = θ 4 so the
posterior is
θ 13/4 (1 − θ)−1/4
f (θ|x) = 0 ≤ θ ≤ 1 or θ ∼ B e t a(17/4, 3/4).
B(1/4 + 4, 3/4 + 4 − 4)
This is a conjugate prior, see for the general case Table 8.3. The distribution of Y the num-
ber of defectives is (given θ)
à !
16 y
f (y|θ) = θ (1 − θ)16−y
y
The predictive distribution is then
Z
f (y|x) = f (y|θ) f (θ|x)d θ
¡16¢
Z1
y
= θ y+13/4 (1 − θ)63/4−y d θ
B(17/4, 3/4) 0
à !
16 B(y + 13/4 + 1, 63/4 − y + 1)
=
y B(17/4, 3/4)
à !
16 B(y + 17/4, 67/4 − y)
=
y B(17/4, 3/4)
107
A. K. Nikoloulopoulos Mathematical Statistics CMP5034A
Exercises
1. Suppose we have one observation x from an exponential distribution f (x) = θ exp(−θx).
If the prior distribution is defined from the discrete probabilities:
1
p(θ = 1) = p(θ = 2) =
2
find the posterior distribution of θ.
2. Suppose that
(x1 , . . . , xn ) = x ∼ Geom(θ)
and
θ ∼ B e t a(a, b).
Show that X
θ|x ∼ B e t a(a + n, b + xi − n).
x ∼ NeB(r, θ)
and
θ ∼ B e t a(a, b).
Show that
θ|x ∼ B e t a(a + r, b + x − r ).
4. Suppose we have one observation x from a Pareto distribution, x ∼ P a(a, b), where
a is known and b is unknown. Then
5. Suppose we have one observation x from binomial distribution B(n, π) and the
prior distribution of π is B e t a(p, q). Suppose that y is the next observation from
binomial distribution B(m, π).
108