Probability and Statistics II

© All Rights Reserved

97 views

Probability and Statistics II

© All Rights Reserved

- Quiz 6 Solutions
- Intro to Statistical Thought
- ASPEN HYSYS SIMULACION BASICA 2006.pdf
- Examples of Poisson Distribution PDF
- e1071
- chap02 modify.pptdd
- Chapter 5 in Class Problems
- Mathematical Statistics
- Understanding Statistics.pdf
- Probability Distribution
- 26139810-Chapter-12-Review-of-Calculus-and-Probability.pdf
- 5 Peubah Acak Kontinu Dan Fungsi Kepekatanya
- EM561 Lecture Notes - Part 1 of 3
- vva
- Course Outline MA 1050 2014_15_II
- Probabilty
- 05 OPERATIONS I - Inventory Analysis Stochastic Demand - Francisco Muñoz Prado
- Chapter 5
- CalcII Probability
- SE

You are on page 1of 104

Email: mazhuke@hku.hk

Department of Statistics and Actuarial Science

The University of Hong Kong

Contents

1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Random variable and probability function . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Discrete distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Continuous distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4 Empirical distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.5 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.1 Moment generating function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3 Point estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.1 Maximum likelihood estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2 Method of moments estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3 Estimator properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3.1 Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3.2 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.3.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4 Interval Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2 Confidence intervals for means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.2.1 One-sample case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.2.2 Tow-sample case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.3 Confidence intervals for variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.3.1 One-sample case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.3.2 Two-sample case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.2 Most powerful tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

v

vi Contents

5.3.1 Testing for the mean: Variance known . . . . . . . . . . . . . . . . . . . 73

5.3.2 Testing for the mean: Variance unknown . . . . . . . . . . . . . . . . . 76

5.3.3 Testing for the variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.3.4 Test and interval estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.4 Generalized likelihood ratio tests: Two-sample case . . . . . . . . . . . . . . 83

5.4.1 Testing for the mean: Variance is known . . . . . . . . . . . . . . . . . 83

5.4.2 Testing for the mean: Variance is unknown . . . . . . . . . . . . . . . 86

5.4.3 Testing for the variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.5 Generalized likelihood ratio tests: Large samples . . . . . . . . . . . . . . . . 94

5.5.1 Goodness-of-fit tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.5.2 Pearson Chi-squared test of independence . . . . . . . . . . . . . . . . 100

Chapter 1

Basic concepts

Abstract A random variable usually results from a random experiment, and its ful-

l information is determined by its cumulative distribution function or probability

density function, which however is unknown in practice. By repeating the random

experiment several times (say, n times), we obtain a random sample which consists

of the results of all repeated random experiments. Based on this random sample,

we are able to estimate the cumulative distribution function (by the empirical distri-

bution) and the probability density function (by the relative frequency histogram).

This chapter reviews these basic concepts in statistics. If you are familiar with these

concepts, you can skip this chapter.

A random variable

X : E

X( )

set E. Here, is called the sample space, and E is called the state space, which is

measurable. Any element in the sigma-algebra of (denoted by ( )) is called an

event, and generally speaking, an event A is a subset of . The value of X is subject

to variations due to chance, which can be measured by a probability function P on

A.

that assigns to each event A a number P(A), called the probability of the event A,

such that the following properties are satisfied:

(1) P(A) 0;

(2) P( ) = 1;

1

2 1 Basic concepts

able. If E contains intervals or a union of intervals, X is called a continuous random

variable. From now on, we assume that E is a subset of R s , where R = [, ] and

the integer s 1.

1. The result of a single toss is a discrete random variable X;

2. = {Head (H), Tail (T)} and E = {0, 1} such that X(H) = 0 and X(T) = 1;

3. ( ) = {0,

/ H, T, };

4. P(H) = P(X = 0) = 1/2 and P(T ) = P(X = 1) = 1/2 (i.e., the probability func-

tion assigns a probability 1/2 to each point of ).

1. The result of two tosses is a discrete random variable X = (X1 , X2 );

2. = {HH, HT, TH, TT} and E = {(0, 0), (0, 1), (1, 0), (1, 1)} such that X(HH) =

(0, 0), X(HT) = (0, 1), X(TH) = (1, 0), and X(TT) = (1, 1);

3. ( ) = All subsets of (Please write down by yourself);

4. P(HH) = P(X = (0, 0)) = 1/4, P(HT ) = P(X = (0, 1)) = 1/4, P(TH) = P(X =

(1, 0)) = 1/4, and P(T T ) = P(X = (1, 1)) = 1/4 (i.e., the probability function

assigns a probability 1/4 to each point of ).

1. The price of one stock is a continuous random variable X;

2. = E = (0, ) such that X( ) = for any ;

3. ( ) = ((0, x]; x > 0);

4. P(X (0, x]) = W (x) (i.e., the probability function assigns a probability W (x) to

the event (0, x] in ( )).

f (x) = P(X = x)

is the probability density function (p.d.f.) of X. Clearly, f (x) tells us how likely the

event { : X( ) = x} happens.

ate probability density function has the following properties:

1.2 Discrete distribution 3

(2) xE f (x) = 1;

(3) P(X A) = xA f (x), where A E.

Remark 1.1. The proof of Theorem 1.1 follows directly from Definition 1.1, and the

details are omitted.

Remark 1.2. Any function f (x) satisfies (1)-(3) of Theorem 1.1 can induce a random

variable, which has a p.d.f. f (x).

Let f (x) = 0 for x E. Then, the domain of f (x) can be extended to R, and

supp( f ) = E. Here, supp( f ) = {x : f (x) = 0} is the support of f (x), i.e., supp( f ) is

the subset of the domain containing those elements which are not mapped to zero.

The graph of the p.d.f. of a discrete random variable X would be a plot of the points

{(x, f (x)) : x E}. However, it is easier to visualize the corresponding probabilities

if a vertical line segment is drawn from each (x, f (x)) to (x, 0), to form a bar graph

(see Fig.1.1 for an illustration).

3/6

2/6

f(x)

1/6

0

1 2 3

x

1

F (x)

1/2

1/6

1 2 3

x

Fig. 1.1 The top panel is the p.d.f F(x) of a discrete random variable X, where f (x) = P(X = x) =

x/6 for x = 1, 2, 3, and the bottom panel is the corresponding c.d.f. F(x).

4 1 Basic concepts

eE and ex

The function F(x) is called the cumulative distribution function (c.d.f.) of the dis-

crete random variable X. Note that F(x) is a step function on R and the height of a

step at x, x E, equals the probability f (x) (see Fig.1.1 for an illustration).

From Theorem 1.1, we can obtain the following theorem.

mulative distribution function has the following properties:

(1) 0 F(x) 1 for x R;

(2) F(x) is a nondecreasing function of x;

(3) F() = 1 and F() = 0.

Remark 1.3. The p.d.f. f (x) and the c.d.f. F(x) are one-to-one corresponding. We

can first define the c.d.f. F(x), and then define the p.d.f. f (x) by

Example 1.4. Let the random variable X of the discrete type have the p.d.f. f (x) =

x/6, x = 1, 2, 3. Then,

0 for x < 1

1

for 1 x < 2

F(x) = 61

for 2 x < 3

2

1 for x 3.

Example 1.5. Let X be a discrete random variable taking value at 1, 2, and 3. Sup-

pose that F(x) is the c.d.f. of X, and it satisfies that

F(1) = 12

F(2) = c

F(3) = c2 56 c + 76 .

The above definitions of p.d.f and c.d.f. can be similarly extended for the discrete

multivariate random variable X = (X1 , , Xs ) R s for s > 1. Let x = (x1 , , xs )

E R s be a realization of X. Then, the p.d.f. of X is

f (x1 , , xs ) = P(X1 = x1 , , Xs = xs ).

1.2 Discrete distribution 5

variate probability density function has the following properties:

(1) f (x1 , , xs ) > 0 for (x1 , , xs ) E;

(2) (x1 , ,xs )E f (x1 , , xs ) = 1;

(3) P((X1 , , Xs ) A) = (x1 , ,xs )A f (x1 , , xs ), where A E.

Let f (x1 , , xs ) = 0 for (x1 , , xs ) E. Then, the domain of f (x1 , , xs ) can

be extended to R s , and supp( f ) = E. From the joint p.d.f. f (x1 , , xs ), we can

obtain the p.d.f of a single Xk :

fk (xk ) = f (x1 , , xs ).

x1 xk1 xk+1 xs

We call fk (xk ) the marginal p.d.f. of Xk . This marginal p.d.f. fk (xk ) is calculated

by summing f (x1 , , xs ) over all xi s except xk . Further, we say that X1 , , Xs are

independent if and only if

A joint marginal p.d.f. of X j and Xk is calculated similarly by summing f (x1 , , xs )

over all xi s except x j and xk ; that is,

f j,k (x j , xk ) = f (x1 , , xs ).

x1 x j1 x j+1 xk1 xk+1 xs

Extensions of these marginal p.d.f.s to more than two random variables are made in

an obvious way.

Based on the p.d.f. f (x1 , , xs ), the c.d.f. of X = (X1 , , Xs ) R s is defined

by

F(x1 , , xs ) = f (e1 , , es ).

(e1 , ,es )E and e1 x1 , ,es xs

multivariate cumulative distribution function has the following properties:

(1) 0 F(x1 , , xs ) 1 for (x1 , , xs ) R s ;

(2) F(x1 , , xs ) is a nondecreasing function of each xi ;

(3) F(, , ) = 1 and F(, , ) = 0.

Property 1.1. Two discrete random variables X and Y are independent if and only if

F(x, y) = FX (x)FY (y) for all (x, y) E.

Proof. We only give the proof of the if part. Note that

e1 x,e2 y e1 x,e2 <y e1 <x,e2 y

+ f (e1 , e2 ) f (e1 , e2 ).

e1 =x,e2 =y e1 <x,e2 <y

6 1 Basic concepts

Hence,

= FX (x)FY (y) FX (x)FY (y) FX (x)FY (y) + FX (x)FY (y)

= [FX (x) FX (x)] [FY (y) FY (y)]

= fX (x) fY (y),

Property 1.2. Let X and Y be two independent discrete random variables. Then,

(a) for arbitrary countable sets A and B,

(b) for any real functions g() and h(), g(X) and h(Y ) are independent.

Proof. (a) Note that

xA yB

xA yB

= P(X = x) P(Y = y)

xA yB

as required.

(b) For any and , let A = {x : g(x) = } and B = {y : h(y) = }, and then the

joint p.d.f. of g(X) and h(Y ) is

= P(X A)P(Y B) (by (a))

= P(g(X) = )P(h(Y ) = ).

It follows that g(X) and h(Y ) are independent. This completes the proof.

Example 1.6. Let the joint p.d.f. of X and Y be

xy2

f (x, y) = , x = 1, 2, 3, y = 1, 2.

30

The marginal p.d.f of X is

xy2 x

f1 (x) = 30

= ,

6

x = 1, 2, 3.

y=1,2

1.2 Discrete distribution 7

xy2 y2

f2 (y) = = , y = 1, 2.

x=1,2,3 30 5

Example 1.7. Let X and Y be two independent geometric random variables having

respective p.d.f. fX (x) = (1 ) x and fY (y) = (1 ) y for x 0 and y 0. What

is the p.d.f of Z = min(X,Y )?

= P(X > z)P(Y > z) (by Property 1.2(a))

= fX (x) fY (y) = z+1 z+1 .

x>z y>z

We introduce an important discrete distribution, called the multinomial distribu-

tion. Consider a sequence of repetitions of an experiment for which the following

conditions are satisfied:

1. The experiment has k possible outcomes that are mutually exclusive and exhaus-

tive, say A1 , A2 , , Ak .

2. n independent trials of this experiment are observed.

3. P(Ai ) = pi , i = 1, 2, , k, one each trial with ki=1 pi = 1.

Note that a simple example of multinomial distribution is the coin toss repetitive-

ly for n times. In this case, there are two outcomes A1 and A2 in each independent

trial, where A1 = {Head} (i.e., the result of the trial is Head), A2 = {Tail} (i.e., the

result of the trial is Tail), P(A1 ) = p, and P(A2 ) = 1 p (e.g., typically p = 1/2 so

that there is an equal probability to get Head and Tail). Let the random variable X

be the number of times A1 occurs in the n trials. Then,

( )

n x

f (x) = P(X = x) = p (1 p)nx , x = 0, 1, 2, , n.

x

The above distribution, denoted by B(n, p), is called the binomial distribution. Par-

ticularly, B(1, p) is called Bernoulli distribution.

The multinomial distribution is a natural extension of the binomial distribution.

Let the random variable Xi be the number of times Ai occurs in the n trials, i =

1, 2, , k. Then, the multinomial distribution of X1 , , Xk is defined by

( )

n x

f (x1 , , xk ) = P(X1 = x1 , , Xk = xk ) = px1 px2 pkk ,

x1 , x2 , , xk 1 2

8 1 Basic concepts

Example 1.8. A bowl contains three red, four white, two blue, and five green balls.

One ball is drawn at random from the bowl and the replaced. This is repeated 20

independent times. Let X1 , X2 , X3 , and X4 denote the numbers of red, white, blue,

and green balls drawn, respectively. What is the joint p.d.f. of X1 , X2 , X3 , and X4 ?

Solution. For each draw, there are four outcomes (say, red, white, blue, and green)

for the color of the ball, and it is easy to see that

3

p1 = P(the drawn ball is red) = ,

14

4

p2 = P(the drawn ball is white) = ,

14

2

p3 = P(the drawn ball is blue) = ,

14

5

p4 = P(the drawn ball is green) = .

14

Thus, the joint p.d.f. of X1 , X2 , X3 , and X4 is

f (x1 , x2 , x3 , x4 ) = P(X1 = x1 , X2 = x2 , X3 = x3 , X4 = x4 )

( ) ( )x1 ( )x2 ( )x3 ( )x4

20 3 4 2 5

= ,

x1 , x2 , x3 , x4 14 14 14 14

for x1 + x2 + x3 + x4 = 20.

(a, b] is

b

P(a < X b) = f (x)dx

a

for some non-negative function f (). That is, the probability P(a < X b) is the

area bounded by the graph of f (x), the x axis, and the lines x = a and x = b. We call

f (x) the p.d.f. of the continuous random variable X.

nivariate probability density function has the following properties:

(1) f (x) 0 for x R;

(2) R f (x)dx =1;

(3) P(X A) = A f (x)dx for A R.

1.3 Continuous distribution 9

x

F(x) = P(X x) = f (e)de,

which also satisfies Theorem 1.2. From the fundamental theorems of calculus, we

have F (x) = f (x) if exists. Since there are no steps or jumps in a continuous c.d.f.,

it must be true that P(X = b) = 0 for all real values of b.

As you can see, the definition for the p.d.f. (or c.d.f.) of a continuous random

variable differs from the definition for the p.d.f. (or c.d.f.) of a discrete random

variable by simply changing the summations that appeared in the discrete case to

integrals in the continuous case.

if

{ 1

, for a x b,

f (x) = ba

0, otherwise.

Proof.

P(Y x) = P(F 1 (X) x) = P(X F(x)) = F(x).

Note that this property helps us to generate a random variable from certain distribu-

tion.

if

[ ]

1 (x )2

f (x) = exp for x R,

2 2 2

where R is the location parameter and > 0 is the scale parameter. Briefly, we

say that X N( , 2 ). A simple illustration of f (x) with different values of and

is given in Fig.1.2.

Further, Z = (X )/ N(0, 1) (the standard normal distribution), and the

c.d.f. of Z is typically denoted by (x), where

x [ ]

1 w2

(x) = P(Z x) = exp dw.

2 2

Property 1.4. If the p.d.f. of a continuous random variable X is fX (x) for x R, the

a ) for x R.

p.d.f. of Y = aX + b for a = 0 is fY (x) = 1a fX ( xb

10 1 Basic concepts

0.4 0.4

N (0, 1) N (2, 1)

0.35 N (0, 4) 0.35 N (2, 4)

0.3 0.3

0.25 0.25

f(x)

f(x)

0.2 0.2

0.15 0.15

0.1 0.1

0.05 0.05

0 0

-10 -5 0 5 10 -5 2 10

x x

( ) ( )

xb xb

FY (x) = P(Y x) = P(aX + b x) = P X = FX

a a

for x R. Hence,

( ) ( )

1 xb 1 xb

fY (x) = FY (x) = F = fX .

a X a a a

The definitions of p.d.f. and c.d.f. for the continuous univariate random vari-

able can be similarly extended for the continuous multivariate random variable

X = (X1 , , Xs ) R s , s > 1. The probability of X lies in a rectangle A =

(a1 , b1 ] (a2 , b2 ] (as , bs ] is

b1 b2 bs

P(X A) = f (x1 , x2 , , xs )dx1 dx2 dxs

a1 a2 as

for some non-negative function f (x1 , x2 , , xs ). For instance, when s = 2, the prob-

ability P(X A) is the volume of the solid over the region A in the x1 x2 plane and

bounded by the surface z = f (x1 , x2 ). We call f (x1 , x2 , , xs ) the joint p.d.f. of the

continuous random variables X1 , , Xs .

multivariate probability density function has the following properties:

(1) f (x1, x2 , , xs ) 0 for (x1 , x2 , , xs ) R s ;

(2) f (x1 , x2, , xs )dx1 dx2 dxs = 1;

(3) P((X1 , , Xs ) A) = A f (x1 , x2 , , xs )dx1 dx2 dxs for A R s .

1.3 Continuous distribution 11

Similar to the discrete case, the marginal p.d.f. of any single Xk is given by the

(s 1)-fold integral:

fk (xk ) = f (x1 , x2 , , xs )dx1 dxk1 dxk+1 dxs

for (x1 , , xs ) R s .

Based on the p.d.f. f (x1 , , xs ), the c.d.f. of a continuous multivariate random

variable X = (X1 , , Xs ) is

x1 xs

F(x1 , , xs ) = P(X1 x1 , , Xs xs ) = f (e1 , , es )de1 des ,

s F(x1 , , xs )

f (x1 , , xs ) = ,

x1 xs

if exists.

Property 1.5. Two continuous random variables X and Y are independent if and only

if

F(x, y) = FX (x)FY (y) for all (x, y) R2 .

Proof. The proof of the only if part is obvious, and the proof of the if part

follows by the fact that

f (x, y) = = = = fX (x) fY (y).

x y x y x y

This completes the proof.

Property 1.6. Let X and Y be two independent continuous random variables. Then,

(a) for arbitrary intervals A and B,

(b) for any real functions g() and h(), g(X) and h(Y ) are independent.

P(X A,Y B) = f (x, y)dxdy (by Theorem 1.6(3))

xA yB

= fX (x) fY (y)dxdy (by independence)

xA yB

12 1 Basic concepts

= fX (x)dx fY (y)dy

xA yB

= P(X A)P(Y B) (by Theorem 1.6(3))

(b) For any and , let A = {x : g(x) } and B = {y : h(y) }, and then the

joint c.d.f. of g(X) and h(Y ) is

= P(X A)P(Y B) (by (a))

= P(g(X) )P(h(Y ) ).

It follows that g(X) and h(Y ) are independent by Property 1.5. This completes the

proof.

f (x, y) = 2, 0 x y 1.

1/2 y 1/2

= 2dxdy = 2ydy = 1/4.

0 0 0

1/2 x 1/2

= 2dydx = 2(1/2 x)dx = 1/4.

0 1/2 0

1

f1 (x) = 2dy = 2(1 x), 0 x 1,

x

y

f2 (y) = 2dx = 2y, 0 y 1.

0

{ a

cx , if 0 < x < 1, y < 1, x + y > 1, a > 1,

f (x, y) =

0, otherwise.

Solution. (a) Since f (x, y)dxdy = 1, this entails

1.4 Empirical distribution 13

1 1

c1 = xa dydx = (a + 2)1 .

0 1x

(b) By definition,

x y

a + 2 a+1 a + 2 a+1 1

F(x, y) = cua dvdu = yx + xa+2 x + (1 y)a+2 .

1y 1u a+1 a+1 a+1

Repeat this experiment n independent times, we get n random variables X1 , , Xn

associated with these outcomes. The collection of these random variables is called

a sample from a distribution with c.d.f. F(x) (or p.d.f. f (x)). The number n is called

the sample size.

As all random variables in a sample follow the same c.d.f. as X, we expect that

they can give us the information about the c.d.f of X. Next, we are going to show

that the empirical distribution of {X1 , , Xn } is close to F(x) in some probability

sense.

The empirical distribution of {X1 , , Xn } is defined as

1 n

Fn (x) = I(Xk x)

n k=1

for x R, where I(A) is an indicator function such that I(A) = 1 if A holds and

I(A) = 0 otherwise. Obviously, Fn (x) assigns the probability 1/n to each Xk , and

we can check that it satisfies Theorem 1.2 (please do it by yourself). Since Fn (x) is

the relative frequency of the event X x, it is an approximation of the probability

P(X x) = F(x). Thus, the following result is expected.

almost surely (a.s.); furthermore, for each x R,

n(Fn (x) F(x)) d N(0, F(x)(1 F(x))),

and this convergence can be further extended by the Donskers theorem, where d

stands for the convergence in distribution.

The proof of aforementioned theorem is omitted. This theorem shows that the

empirical distribution function Fn (x) provides an estimate of the c.d.f. F(x). To see

it more clearly, Fig.1.3 plots the empirical distribution function Fn (x) based on a

data sample {X1 , , Xn } with Xi N(0, 1). As a comparison, the c.d.f. (x) of

N(0, 1) is also included in Fig.1.3. From this figure, we can see that Fn (x) is getting

14 1 Basic concepts

close to (x) as the sample size n increases, and this is consistent to the conclusion

in Theorem 1.7.

0.8

Fn(x)

Fn(x)

0.4

0.4

0.0

0.0

1 0 1 2 2 1 0 1 2 3

x x

0.8

0.8

Fn(x)

Fn(x)

0.4

0.4

0.0

0.0

2 1 0 1 2 3 3 1 0 1 2 3

x x

Fig. 1.3 The black step function is the empirical distribution function Fn (x) based on a data sample

{X1 , , Xn } with Xi N(0, 1). The red solid line is the c.d.f. (x) of N(0, 1).

Example 1.13. Let X denote the number of observed heads when four coins are

tosses independently and at random. Recall that the distribution of X is B(4, 1/2).

One thousand repetitions of this experiment (actually simulated on the computer)

yielded the following results:

0 65

1 246

2 358

3 272

4 59

1.4 Empirical distribution 15

(, 0) 0.000 [2, 3) 0.669

[0, 1) 0.065 [3, 4) 0.941

[1, 2) 0.311 [4, ) 1.000

The graph of the empirical distribution function F1000 (x) and the theoretical distri-

bution function F(x) for the binomial distribution are very close (please check it by

yourself).

Example 1.14. The following numbers are a random sample of size 10 from some

distribution:

0.49, 0.90, 0.76, 0.97, 0.73, 0.93, 0.88, 0.75, 0.88, 0.96.

(a) Write done the empirical distribution; (b) use the empirical distribution to esti-

mate P(X 0.5) and P(0.5 X 0.5).

0.97, 0.88, 0.75, 0.73, 0.49, 0.76, 0.88, 0.90, 0.93, 0.96.

(, 0.97) 0.0 [0.49, 0.76) 0.5

[0.97, 0.88) 0.1 [0.76, 0.88) 0.6

[0.88, 0.75) 0.2 [0.88, 0.90) 0.7

[0.75, 0.73) 0.3 [0.90, 0.93) 0.8

[0.73, 0.49) 0.4 [0.93, 0.96) 0.9

[0.96, ) 1.0

Thus, P(X 0.5) = F(0.5) F10 (0.5) = 0.5 and P(0.5 X 0.5) =

F(0.5) F(0.5) F10 (0.5) F10 (0.5) = 0.5 0.4 = 0.1.

The question now is how to estimate the p.d.f. f (x)? The answer is relative

frequency histogram.

For the discrete random variable X, we can estimate f (x) = P(X = x) by the

relative frequency of occurrences of x. That is,

nk=1 I(Xk = x)

f (x) fn (x) = .

n

Example 1.13. (cont) The relative frequency of observing x = 0, 1, 2, 3 or 4 is listed

in the second column, and it is close to the value of f (x), which is the p.d.f of

B(4, 1/2).

16 1 Basic concepts

0 0.065 0.0625

1 0.246 0.2500

2 0.358 0.3750

3 0.272 0.2500

4 0.059 0.0625

By increasing the value of n, the difference between fn (x) and f (x) will become

small.

For the continuous random variable X, we first define the so-called class intervals.

Choose an integer l 1, and a sequence of real numbers c0 , c1 , , cl such that

c0 < c1 < < cl . The class intervals are

Roughly speaking, the class intervals are a non-overlapped partition of the interval

[Xmin , Xmax ]. As f (x) = F (x), we expect that when c j1 and c j is close,

F(c j ) F(c j1 )

f (x) for x (c j1 , c j ], j = 1, 2, , l.

c j c j1

Note that

nk=1 I(Xk (c j1 , c j ])

F(c j ) F(c j1 ) = P(X (c j1 , c j ])

n

is the relative frequency of occurrences of Xk (c j1 , c j ]. Thus, we can approximate

f (x) by

nk=1 I(Xk (c j1 , c j ])

f (x) hn (x) = for x (c j1 , c j ], j = 1, 2, , l.

n(c j c j1 )

We call hn (x) the relative frequency histogram. Clearly, the way that we define the

class intervals is not unique, and hence the value of hn (x) is not unique. When the

sample size n is large and the length of the class interval is small, hn (x) is expected

to be a good estimate of f (x).

The property of hn (x) is as follows:

(i) hn (x) 0 for all x;

(ii) The total area bounded by the x axis and below hn (x) equals one, i.e.,

cl

hn (x)dx = 1;

c0

(iii) The probability for an event A, which is composed of a union of class intervals,

can be estimated by the area above A bounded by hn (x), i.e.,

P(A) hn (x)dx.

A

1.5 Expectation 17

the following high school cumulative grade point averages (GPAs).

3.77 2.78 3.40 2.20 3.26

3.00 2.85 2.65 3.08 2.92

3.69 2.83 2.75 3.97 2.74

2.90 3.38 2.38 2.71 3.31

3.92 3.29 4.00 3.50 2.80

3.57 2.84 3.18 3.66 2.86

2.81 3.10 2.84 2.89 2.59

2.95 2.77 3.90 2.82 3.89

2.83 2.28 3.20 2.47 3.00

3.78 3.48 3.52 3.20 3.30

(a) Construct a frequency table for these 50 GPAs using 10 intervals of equal length

with c0 = 2.005 and c10 = 4.005.

(b) Construct a relative frequency histogram for the grouped data.

(c) Estimate f (3) and f (4).

Solution. (a) and (b). The frequency and the relative frequency histogram based on

the class intervals are given in the following table:

class interval frequency relative frequency class interval frequency relative frequency

histogram histogram

(2.005, 2.205] 1 0.1 (3.005, 3.205] 5 0.5

(2.205, 2.405] 2 0.2 (3.205, 3.405] 6 0.6

(2.405, 2.605] 2 0.2 (3.405, 3.605] 4 0.4

(2.605, 2.805] 7 0.7 (3.605, 3.805] 4 0.4

(2.805, 3.005] 14 1.4 (3.805, 4.005] 5 0.5

14

f (3) h50 (3) = = 1.4,

50 (3.005 2.805)

5

f (4) h50 (4) = = 0.5.

50 (4.005 3.805)

1.5 Expectation

n discrete random variables that have a joint p.d.f. f (x1 , , xn ), then

18 1 Basic concepts

(x1 , ,xn )E

where the summation is taken over all possible pairs of (x1 , , xn ). If E[u(X1 , , Xn )]

exists, it is called the mathematical expectation (or expected value) of u(X1 , , Xn ).

Remark 1.4. E[u(X1 , , Xn )] exists if

(x1 , ,xn )E

(i) i := E(Xi ) = xi xi fi (xi ), where the summation is taken over all possible values

of Xi . We call i the mean of Xi .

(ii) i2 := Var(Xi ) = E((Xi i )2 ) = xi (xi i )2 fi (xi ), where the summation is

taken over all possible values of Xi . We call i2 the variance of Xi , and i the standard

deviation of Xi .

(iii) i j := Cov(Xi , X j ) = E((Xi i )(X j j )) = (xi ,x j ) (xi i )(x j j ) fi, j (xi , x j ),

where the summation is taken over all possible values of (Xi , X j ). We call i j the co-

variance of Xi and X j .

(iv) If i > 0 and j > 0, then

Cov(Xi , X j ) i j

(Xi , X j ) = =

Var(Xi )Var(X j ) i j

Simple algebra shows that

Cov(Xi , X j ) = E(Xi X j ) i j .

Example 1.16. Let X and Y have the joint p.d.f.

x1 + 2x2

f (x1 , x2 ) = , x1 = 1, 2, x2 = 1, 2,

18

The marginal p.d.f of X1 is

x1 + 2x2 2x1 + 6

f1 (x1 ) = 18

=

18

, x1 = 1, 2;

x2 =1,2

x1 + 2x2 3 + 4x2

f2 (x2 ) = 18

=

18

, x2 = 1, 2.

x1 =1,2

1.5 Expectation 19

8 10 14

1 = x1 f1 (x1 ) =

18

+2

18

=

9

x1 =1,2

and

14 2 8 14 10 20

12 = (x1 1 )2 f1 (x1 ) = (1 )

9 18

+ (2 )2 = ;

9 18 81

x1 =1,2

7 11 29

2 = x2 f2 (x2 ) =

18

+2 =

18 18

x2 =1,2

and

29 2 7 29 11 77

22 = (x2 2 )2 f2 (x2 ) = (1 )

18 18

+ (2 )2 =

18 18 324

.

x2 =1,2

x1 + 2x2 1

Cov(X1 , X2 ) = E(X1 X2 ) 1 2 = x1 x2

18

1 2 =

162

.

x1 =1,2 x2 =1,2

Cov(X1 , X2 ) 1/162 1

(X1 , X2 ) = = = = 0.025.

1 2 (20/81) (29/18) 1540

Property 1.7. Let X be a discrete random variable with finite mean E(X), and let a

and b be constants. Then,

(i) E(aX + b) = aE(X) + b;

(ii) if P(X = b) = 1, then E(X) = b;

(iii) if P(a < X b) = 1, then a < E(X) b;

(iv) if g(X) and h(X) have finite mean, then

x x

(iii) In this case f (x) = 0 for x (a, b], so

{

x b f (x) = b,

E(X) = x f (x) = x f (x) =

x

> x a f (x) = a.

x(a,b]

20 1 Basic concepts

x

= g(x) f (x) + h(x) f (x)

x x

= E(g(X)) + E(h(X)) <

x=1 P(X x) = x=0 P(X >

x).

Proof. By definition,

x

E(X) = x f (x) = f (x) 1 = f (x) I(r x) = f (x)I(r x).

x=1 x=1 r=1 x=1 r=1 x=1 r=1

E(X) = f (x)I(r x) = f (x) = P(X r).

r=1 x=1 r=1 x=r r=1

P(X r) = P(X > r 1) = P(X > r).

r=1 r=1 r=0

Example 1.17. A coin shows a head with probability p. Then, how many times do

you expect to toss the coin until it first shows a head?

Solution. Let the required number of tosses until the first head be T . Then, as each

toss is independent, P(T = x) = qx1 p for x 1, where q = 1 p. Hence,

1

E(T ) = xqx1 p = p .

x=1

Another simple way to calculate E(T ) is by noting that P(T > x) = qx for x 0 and

1

E(T ) = P(T > x) = qx = p .

x=0 x=0

of n continuous random variables that have a joint p.d.f. f (x1 , , xn ), then

1.5 Expectation 21

E[u(X1 , , Xn )] = u(x1 , , xn ) f (x1 , , xn )dx1 dx2 dxn .

Rn

ue) of u(X1 , , Xn ).

Remark 1.5. E[u(X1 , , Xn )] exists if

|u(x1 , , xn )| f (x1 , , xn )dx1 dx2 dxn < .

Rn

1, 2, , n.

(i) The mean of Xi is i := E(Xi ) = R xi fi (xi )dxi .

(ii) The variance of Xi is i2 := Var(Xi ) = E((Xi i )2 ) = R (xi i )2 fi (xi )dxi .

Here, i is the standard deviation of Xi .

(iii)

The covariance of Xi and X j is i j := Cov(Xi , X j ) = E((Xi i )(X j j )) =

R R (xi i )(x j j ) f i, j (xi , x j )dxi dx j .

(iv) If i > 0 and j > 0, then

Cov(Xi , X j ) i j

(Xi , X j ) = =

Var(Xi )Var(X j ) i j

As the discrete case, we say that Xi and X j are uncorrelated if Cov(Xi , X j ) = 0,

where

Cov(Xi , X j ) = E(Xi X j ) i j .

Example 1.18. Let X have the N( , 2 ) distribution. Then,

[ ]

x (x )2

E(X) = exp dx

2 2 2

[ 2]

s + s

= exp d(s + ) (letting s = (x )/ )

2 2

[ 2] [ 2]

s s 1 s

= exp ds + exp ds.

2 2 2 2

The first integrand is an odd function, and so the integral over R is zero. The second

integrand is one by some algebra. Hence, E(X) = .

Example 1.19. Let X have density f (x) = ( 1)x for x 1 and > 1. Then,

n

1 1

E(X) = x dx = ( 1) lim dx.

1 x n 1 x 1

When 2 (i.e., 1 1), E(X) = . When > 2, it is not hard to see that

1

E(X) = .

2

22 1 Basic concepts

and h be functions. Then,

(i) if g(X) and h(X) have finite mean then

(iii) if h is non-negative, then for a > 0, P(h(X) a) E(h(X)/a);

(iv) if g is convex, then g(E(X)) E(g(X)).

(ii) Since F(b) F(a) = P(a X b) = 1 and F(x) [0, 1] for any x R, it

follows that F(b) = 1 and F(a) = 0. By the monotonicity of F(x), we know that

F(x) = 1 for x b, and F(x) = 0 for x a. Therefore,

a a

x f (x)dx |x| f (x)dx

a

= lim |x| f (x)dx

n n

a

lim max(|n|, |a|) f (x)dx

n n

= lim max(|n|, |a|)(F(a) F(n)) = 0.

n

Similarly, we can show that | b x f (x)dx| = 0. Hence,

a b

E(X) = x f (x)dx = x f (x)dx + x f (x)dx + x f (x)dx

a b

b

= x f (x)dx.

a

Note that

b b

x f (x)dx a f (x)dx = a(F(b) F(a)) = a,

a a

b b

x f (x)dx b f (x)dx = b(F(b) F(a)) = b.

a a

(iii) Note that

( ) ( )

h(X) h(X)

P(h(X) a) = E(I(h(X) a)) E I(h(X) a) E ,

a a

(iv) The proof is omitted.

1.5 Expectation 23

Property 1.10. Let X be a non-negative random variable with c.d.f. F, p.d.f f , and

finite expected value E(X). Then, if limn n(F(n) 1) = 0,

E(X) = (1 F(x))dx.

0

E(X) = x f (x)dx

0

n

= lim x f (x)dx

n 0

n

= lim xdF(x)

n 0

[ n ]

= lim nF(n) F(x)dx (integration by parts)

n 0

[ n ]

= lim n(F(n) 1) + (1 F(x))dx ,

n 0

which follows that E(X) = 0 (1 F(x))dx.

Property 1.11. Let a, b, c, and d be constants. Then,

(i) E(X 2 ) = 0 if and only if P(X = 0) = 1;

(ii) Cov(aX + b, cY + d) = ac Cov(X,Y );

(iii) Var(X +Y ) = Var(X) + Var(Y ) + 2Cov(X,Y );

(iv) if X and Y are independent, E(h(X)g(Y )) = E(h(X))E(g(Y )) provided that

E(h(X)) < and E(g(Y )) < .

(v) 1 (X,Y ) 1;

(vi) | (X,Y )| = 1 if and only if P(X = aY + b) = 1 for some constants a and b;

(vii) (aX + b, cY + d) = sgn(ac) (X,Y ), where sgn(x) denotes the sign of x;

(viii) if X and Y are independent, (X,Y ) = 0.

Proof. (i) The proof of the if part is directly from Property 1.9(ii). We now give

the proof of the only if part. Suppose that P(X = 0) < 1. Then, there exists con-

stants a and b such that b a > 0 and P(a |X| b) > 0. Hence,

E(X 2 ) = E(X 2 I(a |X| b)) + E(X 2 I(|X| < a)) + E(X 2 I(|X| > b))

E(X 2 I(a |X| b))

a2 E(I(a |X| b))

= a2 P(a |X| b) > 0,

which is contradict with the condition that E(X 2 ) = 0. This completes the proof of

(i).

The proofs of remaining parts are left as an excise. (Hint: the proof of (v) and (vi)

relies on the Cauchy-Schwarz inequality below to the random variables X E(X)

and Y E(Y )).

24 1 Basic concepts

E(XY ) E(X 2 )E(Y 2 ).

Proof. Without loss generality, we assume that E(Y 2 ) > 0. Note that

[ ] [ ]

0 E (XE(Y 2 ) Y E(XY ))2 = E(Y 2 ) E(X 2 )E(Y 2 ) (E(XY ))2 .

Chapter 2

Preliminary

Abstract This chapter talks about the moment generating function and some im-

portant convergence, including the the law of large number theory and central limit

theory. The moment generating function is one-to-one corresponding to the cumu-

lative distribution function, and hence it can determine the distribution of a random

variable. In many applications, the expected value of a random experiment is partic-

ularly important, and it is estimated by the sample mean. The law of large number

theory makes sure that the sample mean is a rational estimate of the expected value,

and the variation of this estimate can be measured by the central limit theory.

Let r be a positive integer. The r-th moment about the origin of a random variable X

is defined as r = E(X r ). In order to calculate r , we can make use of the moment

generating function (m.g.f.).

is a function of t R defined by MX (t) = E(etX ) if exists.

( r)

(1) MX (t) = r tr! ;

r=0

(r)

(2) r = MX (0) for r = 1, 2, . . .;

(3) For constants a and b, MaX+b (t) = ebt MX (at).

r r

(tx)r t t

MX (t) = etx P(X = x) = P(X = x) = xr P(X = x) = r .

x x r=0 r! r=0 r! x r=0 r!

25

26 2 Preliminary

For a continuous random variable X, the proof is similar by using integrals instead

of sums.

(2) Make use of (1).

[ ]

(3) MaX+b (t) = E e(aX+b)t = ebt E(eatX ) = ebt MX (at).

(t ) ( 2) ( 3)

(1) (2) t (3) t

MX (t) = MX (0) + MX (0) + MX (0) + MX (0) + ,

1! 2! 3!

series expansion of MX (t) can be found, the r-th moment r is the coefficient of

t r /r!; or if MX (t) exists and the moments are given, we can frequently sum the

Maclaurins series to obtain the closed form of MX (t).

and the p.d.f. f (x) (or c.d.f. F(x)).

The above property shows that we can decide the distribution of X by calculating

its m.g.f.

+ (x )2

1

E(etX ) = etx e 2 2 dx

2

+ ( )

1 2 2tx + x2 2 x + 2

= exp dx

2 2 2

( ) + ( )

2 2t 4t 2 1 (x 2t)2

= exp exp dx

2 2 2 2 2

( )

1

= exp t + 2t 2 ,

2

( )

2 t)2

because 21 exp (x2 2 is the density function of N( + 2t, 2 ).

Example 2.2. Find the moment generating function of a random variable X follow-

ing a Poisson distribution with mean .

Solution.

e x ( et )x

etx P(X = x) = etx = e = e e e

t

MX (t) = E(etX ) =

x=0 x=0 x! x=0 x!

= e (e 1) .

t

2.1 Moment generating function 27

Example 2.3. Find the moment generating function of a random variable which has

a (probability) density function given by

{

ex , for x > 0;

f (x) =

0, otherwise,

Solution.

MX (t) = E(etX )

+

(t1)x

+ + e 1

= , for t < 1;

= etx f (x)dx = etx ex dx = t 1 1t

0

0

does not exist, for t 1.

Then,

(1) 1

1 = MX (0) = = 1,

(1 t)2 t=0

(2) 2

2 = MX (0) = = 2,

(1 t)3 t=0

(3) 2 3

3 = MX (0) = = 3!.

(1 t)4 t=0

Property 2.3. If X1 , X2 , . . . , Xn are independent random variables, MXi (t) exists for

i = 1, 2, , n, and Y = X1 + X2 + + Xn , then MY (t) exists and

n

MY (t) = MXi (t).

i=1

Example 2.4. Find the probability distribution of the sum of n independent random

variables X1 , X2 , . . . , Xn following Poisson distributions with means 1 , 2 , . . . , n re-

spectively.

n n

MY (t) = MXi (t) = ei (e 1) = e(e 1) i=1 i ,

t t n

i=1 i=1

which is the m.d.f. of a Poisson random variable with mean ni=1 i . Therefore, by

Example 2.2 and Property 2.3, Y follows the Poisson distribution with mean ni=1 i .

28 2 Preliminary

Example 2.5. For positive numbers and , find the moment generating function

of a gamma distribution Gamma( , ) of which the density function is given by

1 x

x e

, for x > 0;

f (x) = ( )

0, otherwise.

Solution.

+

tX x 1 e x

MX (t) = E(e ) = etx dx

0 ( )

+

= x 1 e( t)x dx

0 ( )

+ ( t)

=

x 1 e( t)x dx

( t) 0 ( )

, for t < ;

= ( t)

does not exist, for t ,

where +

( t) 1 ( t)x

x e dx = 1

0 ( )

is due to the fact that

( t) 1 ( t)x

x e for x > 0

( )

Example 2.6. Find the distribution of the sum of n independent random variables

X1 , X2 , . . . , Xn where Xi follows Gamma(i , ), i = 1, 2, . . . , n, with the p.d.f. given

by 1 x

ix i e

, for x > 0;

f (x) = (i )

0, otherwise.

Solution. From the previous example, we know that the moment generating function

of Xi is ( )i

MXi (t) = for t < , i = 1, 2, . . . , n.

t

Hence, the moment generating function of X1 + X2 + + Xn is

( )1 +2 ++n

n

Xi M (t) =

t

for t < .

i=1

2.2 Convergence 29

each following a Bernoulli distribution with parameter p follows B(n, p), the bino-

mial distribution with parameters n and p.

Proof. On one hand, for i = 1, 2, . . . , n, MXi (t) = e0t P(Xi = 0) + e1t P(Xi = 1) =

(1 p) + et p = 1 p + pet , and hence the moment generating function of X1 + X2 +

+ Xn is

n

MXi (t) = (1 p + pet )n .

i=1

n ( ) n ( )

tx n n

x e p x

(1 p)nx

= x (pet )x (1 p)nx = (pet + 1 p)n .

x=0 x=0

2.2 Convergence

functions of random sample X = {X1 , , Xn } are called statistics. Two important

statistics are the sample mean X and the sample variance S2 , where

1 n

X= Xi ,

n i=1

1 n

S2 = (Xi X)2 .

n i=1

statistics, x and s2 , we should recognize that each value is only one observation of

the respective random variables X and S2 . That is, each X or S2 is also a random

variable with its own distribution.

Suppose that the random sample X from a distribution F(x) with mean = E(X)

and variance 2 = Var(X). When n is large, Theorem 1.7 shows that F(x) can be

well approximated by Fn (x). Meanwhile, we can easily show that X and S2 are the

mean and variance of a random variable from a distribution Fn (x). Therefore, it is

expected that when n is large, and 2 can be well approximated by X and S2 ,

respectively.

Definition 2.2. (Convergence in probability) Let (Zn ; n 1) be a sequence of ran-

dom variables. We say the sequence Zn converges in probability to Z if, for any

> 0,

30 2 Preliminary

P(|Zn Z| > ) 0 as n .

For brevity, this is often written as Zn p Z.

Theorem 2.1. (Weak law of large numbers) Let (Xn ; n 1) be a sequence of inde-

pendent random variables having the same finite mean and variance, = E(X1 )

and 2 = Var(X1 ). Then, as n ,

X p .

( )

1

P Sn > 0 as n . (2.1)

n

( ) [( )2 ]

1 1 1

P Sn > 2 E Sn

n n

( )2

1 1 n

= 2E

(Xi )

n i=1

( )2

n

1

= 2 2 E (Xi ) .

n i=1

( ) [ ]

2

n n n

E (Xi ) =E (Xi )(X j )

i=1 i=1 j=1

[ ] [ ]

n n n

=E (Xi ) 2

+E (Xi )(X j )

i=1 i=1 j=i, j=1

[ ]

n

=E (Xi )2 (by Property 1.11(iv))

i=1

= n 2 .

( )

1 n 2 2

P Sn > 2 2 = 2 0 as n ,

n n n

2.2 Convergence 31

Property 2.4. (Chebyshovs inequality) Suppose that E(X 2 ) < . Then, for any con-

stant a > 0,

E(X 2 )

P(|X| a) .

a2

Proof. This is left as an excise.

Property 2.5. If Xn p and Yn p , then (i) Xn +Yn p + ; (ii) XnYn p ;

(iii) Xn /Yn / if Yn = 0 and = 0; (iv) g(Xn ) p g( ) for a continuous function

g().

Proof. The proof is omitted.

Example 2.8. Let (Xn ; n 1) be a sequence of independent random variables having

the same finite mean = E(X1 ), finite variance 2 = Var(X1 ), and finite fourth

moment 4 = E(X14 ). Show that

S2 p Var(X1 ).

dom variables. We say the sequence Zn converges in distribution to Z if, as n ,

Theorem 2.2. (Central limit theorem) Let (Xn ; n 1) be a sequence of independent

random variables having the same finite mean and variance, = E(X1 ) and 2 =

Var(X1 ). Then, as n ,

X E(X) n(X )

= d N(0, 1).

Var(X)

To prove the above central limit theorem, we need the following lemma:

Lemma 2.1. If

1. MZn (t), the moment generating function of Zn , exists, n = 1, 2, . . .,

2. lim MZn (t) exists and equals the moment generating function of a random vari-

n

able Z,

then

lim GZn (x) = GZ (x) for all x at which GZ (x) is continuous,

n

where GZn (x) is the distribution function of Zn , n = 1, 2, . . ., and GZ (x) is the distri-

bution function of Z.

Proof. As (Xn ; n 1) is a sequence of independent random variables having the

same finite mean = E(X1 ) and finite variance 2 = Var(X1 ), simple algebra gives

us that

32 2 Preliminary

1 n 2

E(X) =

n i=1

E(Xi ) = and Var(X) = E[(X)2 ] [E(X)]2 =

n

.

n(X )

Zn = d N(0, 1). (2.2)

Xi

A heuristic proof for (2.2) is based on Lemma 2.1. Let Yi = , then E(Yi ) =

0 and Var(Yi ) = 1 and suppose the moment generating function MYi (t) exists. A

Taylors expansion of MYi (t) around 0 gives:

(1) t 2 (2)

MYi (t) = MYi (0) + tMYi (0) + M ( ), for some 0 t.

2 Yi

Since Zn = 1 ni=1 Yi , then the moment generating function of Zn is thus given by

n

n ( )

t

MZn (t) = MYi

i=1 n

[ ( )]n

t

= MYi

n

[ ]n

t (1) (t/ n)2 (2)

= MYi (0) + MYi (0) + MYi ( )

n 2

[ ]n

t t 2 (2)

= 1 + E(Yi ) + MYi ( )

n 2n

[ 2 ]n

t (2)

= 1 + MYi ( ) ,

2n

(2) (2)

where 0 t/ n. As n , 0 and MYi ( ) MYi (0) = E(Yi2 ) = 1. Hence,

( )n ( 2) ( )

t2 t 1

lim MZn (t) = lim 1 + = exp = exp 0 t + 1 t 2

n n 2n 2 2

which is the moment generating function of N(0, 1) random variable. Hence, the

conclusion follows directly from Lemma 2.1.

Chapter 3

Point estimator

periment is assumed to have a certain distribution with the p.d.f. f (x; ), where

R s is a unknown parameter taking value in a set , and is usually called the

parameter space. For example, X is often assumed to follow a normal distribution

N( , 2 ); in this case = ( , ) is the unknown parameter, and the parameter

space = {( , ) : R, > 0}. For the experimenter, the important question is

how to find a good estimator for the unknown parameter . Heuristically, a ran-

dom sample X = {X1 , X2 , , Xn }, which forms an empirical distribution, can elicit

information about the distribution of X. Hence, it is natural to expect that we can

estimate the unknown parameter based on a random sample X.

example below.

Example 3.1. Suppose that X follows a Bernoulli distribution so that the p.d.f. of X

is

f (x; p) = px (1 p)1x , x = 0, 1,

where the unknown parameter p with = {p : p (0, 1)}. Further, assume

that we have a random sample X = {X1 , X2 , , Xn } with the observable values x =

{x1 , x2 , , xn }, respectively. Then, the probability that X = x is

L(x1 , , xn ; p) = P(X1 = x1 , X2 = x2 , , Xn = xn )

n

= pxi (1 p)1xi = pi=1 xi (1 p)ni=1 xi ,

n n

i=1

which is the joint p.d.f. of X1 , X2 , , Xn evaluated at the observed values. The joint

p.d.f. is a function of p. Then, we want to find the value of p that maximizes this

33

34 3 Point estimator

p

The way to propose p is reasonable because p most likely has produced the sample

values x1 , , xn . We call p the maximum likelihood estimate, since likelihood

is often used as a synonym for probability in informal contexts.

Conventionally, we denote L(p) = L(x1 , , xn ; p), and p is easier to be com-

puted by

p = arg max log L(p).

p

[Note that p maximizes log L(p) also maximizes L(p)]. By simple algebra (see one

example below), we can show that

1 n

p = xi ,

n i=1

called the maximum likelihood estimator (MLE) of p; that is,

1 n

p = Xi .

n i=1

with p.d.f. f (x1 , , xn ; ) with parameter , where the parameter is within a

certain parameter space . Then, the likelihood function of this random sample is

defined as

L( ) = f (X; ) = f (X1 , X2 , , Xn ; )

for . Moreover, ( ) = log L( ) is called the log-likelihood function.

for , the maximum likelihood estimator (MLE) of is defined as

the maximum likelihood estimate.

with parameter p with 0 < p < 1. Find the maximum likelihood estimator of p.

3.1 Maximum likelihood estimator 35

n

L(p) = pXi (1 p)1Xi .

i=1

n

(p) = ln L(p) = [Xi ln p + (1 Xi ) ln(1 p)]

i=1

n n

= ln p Xi + ln(1 p) (1 Xi ).

i=1 i=1

Note that

( )

d(p) 1 n 1 n

= Xi n Xi

dp p i=1 1 p i=1

n n

(1 p) Xi np + p Xi

i=1 i=1

=

p(1 p)

n(X p)

= .

p(1 p)

d(p)

p < X >0

dp

and

d(p)

p > X < 0,

dp

which imply that (p) attains its maximum at p = X. Therefore, the maximum like-

lihood estimator of p is X.

over the interval [0, ]. Find the maximum likelihood estimator of .

Solution. Note that a uniformly distribution over the interval [0, ] has the p.d.f.

given by

1 , for 0 x , 1

f (x; ) = = I(0 x ).

0, otherwise,

For the random sample X, the likelihood function is

n

1 1 n

L( ) = I(0 Xi ) = n I(0 Xi ).

i=1 i=1

36 3 Point estimator

0 Xi , i = 1, 2, . . . , n.

1

Since increases as decreases, we must select to be as small as possible sub-

n

ject to the previous constraint. Therefore, the maximum of L( ) should be selected

to be the maximum of X1 , X2 , . . . , Xn , that is, the maximum likelihood estimator

= X(n) = max1in Xi .

and = {(1 , 2 ) : 1 R, 2 > 0}. Find the MLEs of 1 and 2 . [Note: here

we let 1 = and 2 = 2 ].

[ ]

n

1 (Xi 1 )2

L( ) = exp .

i=1 22 22

n n (Xi 1 )2

( ) = log L( ) = log(22 ) i=1 .

2 22

( ) 1 n

0=

1

= (Xi 1 ),

2 i=1

( ) n 1 n

0= = + 2 (Xi 1 )2 .

2 22 22 i=1

1 n 1 n

1 = X =

n i=1

Xi and 2 = S2 = (Xi X)2 .

n i=1

By considering the usual condition on the second partial derivatives, these solutions

do provide a maximum. Thus, the MLEs of 1 and 2 are

1 = X and 2 = S2 ,

respectively.

not know the full information about X except for its certain moments. Recall that

3.2 Method of moments estimator 37

r contains the information about the unknown parameter . For example, if X

N( , 2 ), we know that

or

= 1 and 2 = 2 12 .

That is, the unknown parameters and 2 can be estimated if we find good

estimators for 1 and 2 . Note that by the weak low of large numbers (see Theorem

2.1),

1 n 1 n

m1 =

n i=1

Xi p E(X) = 1 and m2 = Xi2 p E(X 2 ) = 2 .

n i=1

these estimators are called the method of moments estimator.

To generalize the idea above, we assume that the unknown parameter R s can

be expressed by

= h(1 , 2 , , k ), (3.1)

and h = (h1 , h2 ) with

h1 (1 , 2 ) = 1 and h2 (1 , 2 ) = m2 m21 .

1 n r

mr = Xi ,

n i=1

r = 1, 2, . . . .

Unlike r , mr always exists for any positive integer r. In view of (3.1), the method

of moments estimator (MME) of is defined by

= h(m1 , m2 , , mk ),

Example 3.5. Let X be an independent random sample from a gamma distribution

with the p.d.f. given by

1 x

x e

, for x > 0;

f (x) = ( )

0, otherwise.

Find a MME of ( , ).

38 3 Point estimator

Solution. Some simple algebra shows that the first two moments are

2 +

1 = and 2 = .

2

[Note: 1 and 2 can be obtained from Example 2.6.] Substituting = 1 in the

second equation, we get

( 1 )2 + 1 1 1

2 = = 12 + or = ,

2 2 12

(1 )2

= 1 = .

2 (1 )2

Therefore, one MME of ( , ) is ( , ), where

2

m21 X m1 X

= = and = = .

m2 m21 1 n 2 m2 m21 1 n 2

Xi X Xi X

2 2

n i=1 n i=1

It is worth noting that the way to construct h in (3.1) is not unique. Usually, we

use the lowest possible order moments to construct f , although this is may not be

the optimal way. To consider the optimal MME, one may refer to the generalized

method of moments estimator for a further reading.

For the same unknown parameter , many different estimators may be obtained.

Heuristically, some estimators are good and others bad. The question is how would

we establish a criterion of goodness to compare one estimator with another? The

particular properties of estimators that we will discuss below are unbiasedness, effi-

ciency, and consistency.

3.3.1 Unbiasedness

property is that its mean be equal to , namely, E( ) = . That is, in practice, we

would want E(Y ) to be reasonably close to .

3.3 Estimator properties 39

Bias( ) = E( ) .

biased.

ased estimator if

lim Bias( ) = lim [E( ) ] = 0,

n n

Example 3.3. (cont) (i) Show that = X(n) is an asymptotically unbiased estimator

of ; (ii) modify this estimator of to make it unbiased.

n ( )n

y

P(Y y) = P(Xi y) = .

i=1

[ ( )n ]

y n+1

E(Y ) = P(Y > y)dy = 1 dy =

0 0 (n + 1) n

n

= as n ,

n+1

which implies that Y is an

( asymptotically

) unbiased estimator of .

n+1 n+1

Furthermore, since E Y = , we know that Y is an unbiased esti-

n n

mator of .

asymptotically unbiased estimator of 2 .

1 n 1 n

E(X) =

n i=1

E(Xi ) = 1 = 1 .

n i=1

Next, it is easy to see that

[ ]

1 n 1 n [ ]

2

E(S ) = E

n i=1

(Xi X) = E (Xi X)2 .

2

n i=1

40 3 Point estimator

[ ] [ ]

E (Xi X)2 = E (X1 X)2

= Var(X1 X) (since E(X1 X) = 0)

( )

X1 + X2 + + Xn

= Var X1

n

( )

(n 1)X1 n

Xi

= Var

n i=2 n

(n 1)2 (n 1)

= 2 + 2 (by the independence among Xi s)

n2 n2

n1

= 2 .

n

Hence, it follows that

1 n [ ] n1

E(S2 ) = E (Xi X)2 = n 2 .

n i=1

3.3.2 Efficiency

Suppose that we have two unbiased estimators and . The question is how to

compare and in terms of a certain criterion. To answer this question, we first

introduce the so-called mean squared error of a given estimator .

Definition 3.6. (Mean squared error) Suppose that is an estimator of . The mean

squared error of is [( )2 ]

MSE( ) = E .

For a given estimator , MSE( ) is the mean (expected) value of the square of

the error (difference) . This criterion can be decomposed by two parts as shown

below.

Property 3.1. If Var( ) exists, then the mean squared error of is

[ ]2

MSE( ) = Var( ) + Bias( ) .

Proof.

[( )2 ]

MSE( ) = E

({[ ] [ ]}2 )

=E E( ) + E( )

([ ]2 [ ][ ] [ ]2 )

= E E( ) + 2 E( ) E( ) + E( )

3.3 Estimator properties 41

[ ][ ] [ ]2

= Var( ) + 2E E( ) E( ) + Bias( )

[ ]2

= Var( ) + Bias( ) .

Remark 3.1. The following result is straightforward:

n n n

unbiasedness is the desirable property for a certain estimator . Thus, it is reason-

able to restrict our attention to only the unbiased estimator . In this case,

MSE( ) = Var( )

by Property 3.1. Now, for two unbiased estimators and , we only need to se-

lect the one with a smaller variance, and this motivates us to define the efficiency

between and .

Definition 3.7. (Efficiency) Suppose that and are two unbiased estimators of .

The efficiency of relative to is defined by

Var( )

Eff ( , ) = .

Var( )

Example 3.6. Let (Xn ; n 1) be a sequence of independent random variables having

the same finite mean and variance, = E(X1 ) and 2 = Var(X1 ). We can show that

X is an unbiased estimator of , and Var(X) = n . Suppose that we now take two

2

(1)

samples, one of size n1 and one of size n2 , and denote the sample means as X and

(2)

X , respectively. Then,

(2)

(1) (2) Var(X ) n1

Eff (X ,X )= (1)

= .

Var(X ) n2

Therefore, the larger is the sample size, the more efficient is the sample mean for

estimating .

n+1

Example 3.3. (cont) Note that Y is an unbiased estimator of , where Y is the

n

n-th order statistic. Show that

(i) 2X is also an unbiased estimator of ;

(ii) Compare the efficiency of these two estimators of .

Solution. (i) Since E(X) equals the population mean, which is /2, E(2X) = .

Thus, 2X is an unbiased estimator of .

42 3 Point estimator

(ii) First we must find the variance of the two estimators. Before, we have already

obtained ( )n

y

P(Y y) = for 0 y .

Therefore, for Z = Y 2 , it is not hard to show that

( )n

z

P(Z z) = P(Y z) = for 0 z 2 .

2 ( )n

z

E(Z) = 1 dz

0

( )

yn

=2 y 1 n dy (by setting y = z)

0

( n+1

)

y

=2 y n dy

0

[ 2 ]

n+2

=2

2 (n + 2) n

n

= 2.

n+2

Or by a direct calculation, we can show that

yn n n n+2 n

E(Z) = y2 d = yn+1 dy = = 2.

0 n n 0 n n+2 n+2

Hence,

( ) [( ) ] [ ( )]2

n+1 n+1 2 n+1

Var Y =E Y E Y

n n n

( )2

n+1 n

= 2 2

n n+2

( 2 )

n + 2n + 1

= 1 2

n2 + 2n

2

= .

n(n + 2)

2 2

Var(2X) = 4Var(X) = 4 = .

12n 3n

Therefore,

3.3 Estimator properties 43

( )

n+1 Var(2X) n+2

Eff Y, 2X = ( )= .

n Var n+1

n Y 3

n Y is more efficient than 2X, and for n = 1,

n+1

n Y and 2X have the same efficiency.

has the smallest variance among all unbiased estimators. This desirable estimator is

usually called the uniformly minimum variance unbiased estimator (UMVUE), that

is,

UMVUE = arg min Var( ).

is unbiased

Clearly, a UMVUE is relative more efficient than any other unbiased estimators.

Intuitively, how to find the UMVUE is not an easy task. But in many cases,

Var( ) has a lower bound for all unbiased estimators; therefore, if the variance of

any unbiased estimator achieves this lower bound, it must be a UMVUE.

In summary, we can conclude that an unbiased estimator is a UMVUE, if we

can show that

(i) Var( ) has a lower bound for all unbiased estimators; (3.2)

(ii) the variance of achieves this lower bound. (3.3)

It is worth noting that conditions (3.2)-(3.3) are not necessary for the UMVUE, since

there are some cases that the UMVUE can not achieve the lower bound in (3.2).

To consider the lower bound of Var( ), we need introduce the so-called Fisher

information.

[( ) ]

( ) 2

In ( ) = E ,

tion with the p.d.f. f (x; ). Then, under certain regularity conditions, we have the

following conclusions.

(i) In ( ) = nI( ), where

[( ) ]

ln f (X; ) 2

I( ) = E ,

[ 2 distribution] as the population;

ln f (X; )

(ii) I( ) = E ;

2

(iii) Cramer-Rao inequality:

44 3 Point estimator

1

Var( ) ,

In ( )

In ( ) is called the Crammer-Rao lower

bound (CRLB).

)

be the score function. Under certain regularity conditions,

it can be shown that the first moment of the score is

[ ] [ ]

ln f (X; ) f (X; ) f (x; )

E =E = f (x; )dx

f (X; ) f (x; )

= f (x; )dx = f (x; )dx = 1 = 0.

Hence, by the independence of X1 , , Xn , it follows that

[( ) ]

( ) 2

In ( ) = E

( )2

n

ln f (X ; )

= E

i

i=1

[ ( ) ]

n

ln f (Xi ; ) 2 n n

ln f (X ; ) ln f (X ; )

=E + E

i j

[ ( ) ]

n

ln f (Xi ; ) 2

=E

i=1

= nI( ).

[ 2 ]

E ln f (X; )

2

( )2

2 f (x; )

2 f (x; )

= f (x; )dx

f (x; ) f (x; )

( )2

2

= f (x; )dx ln f (x; ) f (x; )dx

2

2

= f (x; )dx I( )

2

2

= 1 I( ) = I( ).

2

3.3 Estimator properties 45

(X1 , , Xn ). For any unbiased estimator , we can write = g(X1 , , Xn ) for some

functional g. Then, we have

0 = E( ) = [g(x1 , , xn ) ]L( )dx1 dxn .

L( )

0= L( ) + [g(x1 , , xn ) ] dx1 dxn ,

which implies that

L( )

1= [g(x1 , , xn ) ] dx1 dxn .

Using the Cauchy-Schwarz Inequality: ( s1 (x)g1 (x)dx)2 s21 (x)dx s22 (x)dx, we

have

{ }2

L( )

1= [g(x1 , , xn ) ] dx1 dxn

{

[ ] }2

1 L( )

= [g(x1 , , xn ) ] L( ) dx1 dxn

L( )

{ [ ] }2

log L( )

= [g(x1 , , xn ) ] L( ) L( ) dx1 dxn

( )

log L( ) 2

[g(x1 , , xn ) ]2 L( )dx1 dxn L( ) dx1 dxn .

Hence, it gives us that 1 Var( ) In ( ), which implies that the Cramer-Rao in-

equality holds.

1

Corollary 3.1. If is an unbiased estimator of and Var( ) = , then is a

In ( )

UMVUE of .

N( , 2 ).

[ ( ) ]

1 1 x 2

f (x; ) = exp ,

2 2

46 3 Point estimator

( )2

1 x

ln f (x; ) = ln( 2 ) ,

2

ln f (x; ) x

and = .

2

Therefore,

[( )2 ] [ ]

ln f (X; ) (X )2 1

I( ) = E =E = 2

4

or [ ] ( )

2 ln f (X; ) 1 1

I( ) = E = E = 2.

2 2

Hence,

1 1 2

CRLB = = = .

In ( ) nI( ) n

2

Recall that E(X) = and Var(X) = n . Thus, X is a UMVUE of .

tion.

Solution. For x = 0 or 1,

f (x; ) = x (1 )1x ,

ln f (x; )

= [x ln + (1 x) ln(1 )]

x 1x

=

1

x 1

= .

(1 ) 1

Noting that [ ]

X 1

E = ,

(1 ) 1

we have

[( )2 ] [ ]

ln f (X; ) X (1 ) 1

I( ) = E = Var = 2 = .

(1 ) (1 )2 (1 )

Hence,

1 1 (1 )

CRLB = = = .

In ( ) nI( ) n

3.3 Estimator properties 47

(1 )

Since E(X) = and Var(X) = n , X is a UMVUE of .

If we know the full information about the population distribution X, the following

theorem tells us that the MLE tends to be the first choice asymptotically.

bution. Then, under certain regular conditions, as n ,

d N(0, 1).

1/In ( )

1

If is an unbiased estimator of , the above theorem implies that Var( )

In ( )

when n is large. That is, the MLE can achieve the CRLB asymptotically.

3.3.3 Consistency

In the previous discussions, we have restricted our attention to the unbiased estima-

tor, and proposed a way to check whether an unbiased estimator is UMVUE. Now,

we introduce another property of the estimator called the consistency.

p ,

( )

P | | > 0 as n .

2.2.

n , then is a consistent estimator of .

Var( ) 0 as n , then is a consistent estimator of .

We shall mention that the unbiasedness along does not imply the consistency. A

toy example is as follows. Suppose that

as takes value of either 1 or -1, p 0.

48 3 Point estimator

(i) p ;

(ii) p ;

(iii) / p / assuming that = 0 and = 0;

(iv) if g is any real-valued function that is continuous at , g( ) p g( ).

ables having the same finite mean = E(X1 ), finite variance 2 = Var(X1 ), and

finite fourth moment 4 = E(X14 ). Show that X is a consistent estimator of , and

S2 is a consistent estimator of 2 .

Solution. Note that E(X) = and Var(X) = n 0 as n . Hence, by Property

2

3.2, X is a consistent estimator of (This is just the weak law of large numbers in

Theorem 2.1).

For S2 , we have

1 n 1 n 1 n

S2 =

n i=1

(Xi X)2 = (Xi2 + (X)2 2Xi X) = Xi2 (X)2 .

n i=1 n i=1

1 n 2

Xi p 2 = E(X12 ).

n i=1

(X)2 p 2 .

ables, we can show that

nS2

= n1

2

,

2

2

where k2 is a chi-square distribution with k degrees of freedom. Therefore, E( nS

2

)=

n 1, which implies that E(S ) = n as n . That is, S is an asymp-

2 n1 2 2 2

2

Moreover, since Var( nS

2

) = 2(n 1), we can obtain that

( )

2 nS2 2 4 (n 1)

Var(S2 ) = Var = 0 as n .

n 2 n2

Chapter 4

Interval Estimation

for a unknown parameter , leading to the guess of a single value as the value of

. However, a point estimator for does not provide much information about the

accuracy of the estimator. It is desirable to generate a narrow interval that will cover

the unknown parameter with a large probability (confidence). This motivates us

to consider the interval estimator in this section.

val [L(X),U(X)], where L(X) := L(X1 , , Xn ) and U(X) := U(X1 , , Xn ) are two

statistics such that L(X) U(X) with probability one.

Definition 4.2. (Interval estimate) If X = x is observed, [L(x),U(x)] is the interval

estimate of .

Although the definition is based on a closed interval [L(X),U(X)], it will some-

times be more natural to use an open interval (L(X),U(X)), a half-open and half-

closed interval (L(X),U(X)] (or [L(X),U(X))), or an one-sided interval (,U(X)]

(or [L(X), )).

The next example shows that compared to the point estimator, the interval estima-

tor can have some confidence (or guarantee) of capturing the parameter of interest,

although it gives up some precision.

Example 4.1. For an independent random sample X1 , X2 , X3 , X4 from N( , 1), con-

sider an interval estimator of by [X 1, X + 1]. Then, the probability that is

covered by the interval [X 1, X + 1] can be calculated by

( ) ( )

P [X 1, X + 1] = P X 1 X + 1

( )

= P 1 X 1

49

50 4 Interval Estimation

( )

X

= P 2 2

1/4

= P (2 Z 2)

0.9544,

where Z N(0, 1) and we have used the fact that X N( , 1/4). Thus, we have

over a 95% change of covering the unknown parameter with our interval estimator.

Note that for any point estimator of , we have P( = ) = 0. Sacrificing some

precision in the interval estimator, in moving from a point to an interval, has resulted

in increased confidence that our assertion about is correct.

The certainty of the confidence (or guarantee) is quantified in the following defi-

nition.

the confidence coefficient of [L(X),U(X)], denoted by (1 ), is the infimum of

the coverage probabilities, that is,

1 = inf P ,

Remark 4.1. In many situations, the coverage probability P is free of , and hence

1 = P( [L(X),U(X)]).

coefficient), is sometimes known as confidence interval. So, the terminologies of in-

terval estimators and confidence intervals are interchangeable. A confidence interval

with confidence coefficient equal to 1 , is called a 1 confidence interval.

the interval [0, ], and Y = max Xi . We are interested in an interval estimator of .

1in

We consider two candidate estimators: [aY, bY ] for 1 a < b, and [Y + c,Y + d] for

0 c < d.

For the first interval, we have

( ) ( )

1 Y 1 1 1

P = P ( [aY, bY ]) = P =P T ,

b a b a

where we can show that T has the p.d.f. fT (x) = nxn1 for 0 x 1. Hence, it

follows that

( ) 1/a ( )n ( )n

1 1 1 1

P T = nxn1 dx = .

b a 1/b a b

4.2 Confidence intervals for means 51

( 1 )n ( 1 )n

That is, the coverage probability P of the first interval is free of , and a b

is the confidence coefficient of this interval.

For the second interval, we have

P = P ( [Y + c,Y + d]) = P ( d Y c)

( )

d c

= P 1 T 1

1c/

= nxn1 dx

1d/

( ) ( )

c n d n

= 1 1 .

Hence, we find that the coverage probability P of the second interval is not free of

. Moreover, since ( ) ( )

c n d n

lim 1 1 = 0,

we know that inf P = 0 (i.e., the confidence coefficient of this interval is 0).

Now, the question is how to construct the interval estimator. One important way

to do it is using the pivotal quantity.

Definition 4.4. (Pivotal Quantity) A random variable Q(X, ) = Q(X1 , , Xn , ) is

a pivotal quantity if the distribution of Q(X, ) is free of . That is, regardless of

the distribution of X, Q(X, ) has the same distribution for all values of .

Logically, when Q(X, ) is a pivotal quantity, we can easily construct a 1

confidence interval for Q(X, ) by

( )

1 = P L e Q(X, ) U

e , (4.1)

e in (4.1) are equivalent to the inequalities L(X) U(X), Then, from (??), a

U

1 confidence interval of is [L(X),U(X)].

In the rest of this section, we will use the pivotal quantity to construct our interval

estimators, and always assume that X = {X1 , , Xn } is an independent random

sample from the population N( , 2 ).

52 4 Interval Estimation

X N( , 2 /n). (4.2)

X

Hence, when 2 is known, Z = N(0, 1) is a pivotal quantity involving .

/ n

Let

( )

1 = P z /2 Z z /2

( )

X

= P z /2 z /2

/ n

( )

= P X z /2 X + z /2 ,

n n

where z satisfies

P(Z z ) =

for Z N(0, 1). Usually, we call z the upper percentile of N(0, 1) at the level ;

see Fig. 4.1. So, when 2 is known, a 1 confidence interval of is

[ ]

X z /2 , X + z /2 . (4.3)

n n

Given the observed value of X = x and the value of z /2 , we can calculate the

area is 1

area is /2 area is /2

z/2 z/2

interval estimate of by

4.2 Confidence intervals for means 53

[ ]

x z /2 , x + z /2 .

n n

As the point estimator, the 1 confidence interval is also not unique. Ideally, we

should choose it as narrow as possible in some sense, but in practice, we usually

choose the equal-tail confidence interval as in (4.3) for convenience, since tables for

selecting equal probabilities in the two tails are readily available.

Example 4.3. A publishing company has just published a new college textbook. Be-

fore the company decides the price of the book, it wants to know the average price

of all such textbooks in the market. The research department at the company took

a sample of 36 such textbooks and collected information on their prices. This in-

formation produced a mean price of $48.40 for this sample. It is known that the

standard deviation of the prices of all such textbooks is $4.50. Construct a 90% con-

fidence interval for the mean price of all such college textbooks assuming that the

underlying population is normal.

Solution. From the given information, n = 36, x = 48.40 and = 4.50. Now, 1 =

0.9, i.e., = 0.1, and by (4.3), the 90% confidence interval for the mean price of all

such college textbooks is given by

4.50 4.50

[x z /2 , x + z /2 ] = [48.40 z0.05 , 48.40 + z0.05 ]

n n 36 36

[47.1662, 49.6338].

Example 4.4. Suppose the bureau of the census and statistics of a city wants to esti-

mate the mean family annual income for all families in the city. It is known that

the standard deviation for the family annual income is 60 thousand dollars. How

large a sample should the bureau select so that it can assert with probability 0.99

that the sample mean will differ from by no more than 5 thousand dollars?

( ) ( )

1 = P z /2 X z /2 = P |X | z /2 ,

n n n

( )2 ( )2

60z /2 60 2.576

n = 955.5517.

5 5

Thus, the sample size should be at least 956. (Note that we have to round 955.5517

up to the next higher integer. This is always the case when determining the sample

size.)

54 4 Interval Estimation

pivotal quantity, we need to use the following result.

( )2

nS2 n Xi X

(ii) 2 = i=1 2 is n1

2 , where 2 is a chi-square distribution with k

k

degrees of freedom;

X

(iii) T = is tn1 , where tk is a t distribution with k degrees of freedom.

S/ n 1

k

Zi2 is k2 ;

i=1

Z

(ii) If Z is N(0, 1), U is k2 , and Z and U are independent, then T = is tk .

U/k

Proof of Property 4.1. (i) The proof of (i) is out of the scope of this course;

(ii) Note that

( )2 [ ]2

n

Xi n

Xi X X

W= = +

i=1 i=1

( ) 2 n ( )2

n

Xi X X

= +

i=1 i=1

nS2

= + Z2,

2

where we have used the fact that the cross-product term is equal to

n

(X )(Xi X) 2(X ) n ( )

2 = Xi X = 0.

i=1 2 2

i=1

nS2

Note that W is n2 and Z 2 is 12 by Property 4.2(i). Since and Z 2 are independent

2

nS2

by (i), we can show that the m.g.f. of 2 is the same as the one of n1

2 . Hence, (ii)

holds.

(iii) Note that

(X )/( / n)

T= .

nS2 / 2 1/(n 1)

Hence, by Property 4.2(ii), T is tn1 .

From Property 4.1(iii), we know that T is a pivotal quantity of . Let

4.2 Confidence intervals for means 55

( )

1 = P t /2,d f =n1 T t /2,d f =n1

( )

X

= P t /2,d f =n1 t /2,d f =n1

S/ n 1

( )

S S

= P X t /2,d f =n1 X + t /2,d f =n1 ,

n1 n1

where t ,d f =k satisfies

P(T t ,d f =k ) =

for a random variable T tk ; see Fig. 4.2. So, when 2 is unknown, a 1 confi-

dence interval of is

[ ]

S S

X t /2,d f =n1 , X + t /2,d f =n1 . (4.4)

n1 n1

Given the observed value of X = x, S = s, and the value of t /2,d f =n1 , we can

area is 1

area is /2 area is /2

t/2,df=n t/2,df=n

[ ]

s s

x t /2,d f =n1 , x + t /2,d f =n1 .

n1 n1

Remark 4.2. Usually there is a row with degrees of freedom in a t-distribution

table, which actually shows values of z . In fact, when n , the distribution

56 4 Interval Estimation

function of tn tends to that of N(0, 1); see Fig. 4.3. That is, in tests or exams, if

n is so large that the value of t ,d f =n cannot be found, you may use z instead.

t3

t10

t20

N (0, 1)

Example 4.5. A paint manufacturer wants to determine the average drying time of

a new brand of interior wall paint. If for 12 test areas of equal size he obtained a

mean drying time of 66.3 minutes and a standard deviation of 8.4 minutes, construct

a 95% confidence interval for the true population mean assuming normality.

t0.025,11 2.201, the 95% confidence interval for is

[ ]

8.4 8.4

66.3 2.201 , 66.3 + 2.201 ,

12 1 12 1

Example 4.6. Construct a 95% confidence interval for the mean hourly wage of ap-

prentice geologists employed by the top 5 oil companies. For a sample of 50 ap-

prentice geologists, x = 14.75 and s = 3.0 (in dollars).

s 3.0

t /2,d f =n1 2.010 = 0.8614.

n 50 1

4.2 Confidence intervals for means 57

Thus, the 95% confidence interval is [14.75 0.86, 14.75 + 0.86], or [13.89, 15.59].

Besides the confidence interval for the mean of one single normal distribution, we

shall also consider the problem of constructing confidence intervals for the differ-

ence of the means of two normal distributions when the variances are unknown.

Let X = {X1 , X2 , , Xn } and Y = {Y1 ,Y2 , ,Ym } be random samples from in-

dependent distributions N(X , X2 ) and N(Y , Y2 ), respectively. We are of interest

to construct the confidence interval for X Y when X2 = Y2 = 2 .

First, we can show that

(X Y ) (X Y )

Z=

2 /n + 2 /m

is N(0, 1). Also, by the independence of X and Y, from Property 4.1(ii), we know

that

nS2 mS2

U = 2X + 2Y

is n+m2

2 . Moreover, by Property 4.1(i), V and U are independent. Hence,

Z

T=

U/(n + m 2)

[(X Y ) (X Y )]/ 2 /n + 2 /m

=

(nSX2 + mSY2 )/[ 2 (n + m 2)]

(X Y ) (X Y )

=

R

is tn+m2 , where ( )

nSX2 + mSY2 1 1

R= + .

n+m2 n m

That is, T is a pivotal quantity of X Y . Let

( )

1 = P t /2,d f =n+m2 T t /2,d f =n+m2

( )

(X Y ) (X Y )

= P t /2,d f =n+m2 t /2,d f =n+m2

R

( )

= P (X Y ) t /2,d f =n+m2 R X Y (X Y ) + t /2,d f =n+m2 R ,

58 4 Interval Estimation

[ ]

(X Y ) t /2,d f =n+m2 R, (X Y ) + t /2,d f =n+m2 R . (4.5)

t /2,d f =n+m2 , we can calculate the interval estimate of by

[ ]

(x y) t /2,d f =n+m2 r, (x y) + t /2,d f =n+m2 r .

where ( )

ns2X + msY2 1 1

r= + .

n+m2 n m

Example 4.7. Suppose that scores on a standardized test in mathematics taken by

students from large and small high schools are N(X , 2 ) and N(Y , 2 ), respec-

tively, where 2 is unknown. If a random sample of n = 9 students from large high

schools yielded x = 81.31, s2X = 60.76 and a random sample of m = 15 students

from small high schools yielded y = 78.61, sY2 = 48.24, the endpoints for a 95%

confidence interval for X Y are given by

( )

9 60.76 + 15 48.24 1 1

81.31 78.61 2.074 +

22 9 15

because P(T 2.074) = 0.975. That is, the 95% confidence interval is [3.95, 9.35].

nS2

First, we consider the one-sample case. By Property 4.1(ii), n1

2 is a pivotal

2

quantity involving 2 . Let

( )

nS2

1 = P 12

/2,d f =n1 2

/2,d f =n1

2

( )

nS2 nS2

=P 2

2

,

2 /2,d f =n1 1 /2,d f =n1

where 2 ,d f =n satisfies

P(T 2 ,d f =n ) =

for a random variable T n2 ; see Fig. 4.4. So, a 1 confidence interval of 2 is

4.3 Confidence intervals for variances 59

[ ]

nS2 nS2

, 2 . (4.6)

2 /2,d f =n1 1 /2,d f =n1

Given the observed value of S = s and the values of 2 /2,d f =n1 and 1

2

/2,d f =n1 ,

we can calculate the interval estimate of by

2

[ ]

ns2 ns2

, 2 .

2 /2,d f =n1 1 /2,d f =n1

area is 1

area is

/2

area is /2

21/2,df=n 2/2,df=n

Example 4.8. A machine is set up to fill packages of cookies. A recently taken ran-

dom sample of the weights of 25 packages from the production line gave a variance

of 2.9 g2 . Construct a 95% confidence interval for the standard deviation of the

weight of a randomly selected package from the production line.

= 1.8420,

2 /2,d f =n1 0.025,d

2

f =24 39.36

ns2 25(2.9) 25(2.9)

= 5.8468,

1

2

/2,d f =n1 0.975,d

2

f =24 12.40

60 4 Interval Estimation

the 95% confidence interval for the population variance is (1.8420, 5.8468). Tak-

ing positive square roots, we obtain the 95% confidence interval for the population

standard deviation to be (1.3572, 2.4180).

Next, we consider the two-sample case. Let X = {X1 , X2 , , Xn } and Y = {Y1 ,Y2 , ,Ym }

be random samples from independent distributions N(X , X2 ) and N(Y , Y2 ), re-

spectively. We are of interest to construct the confidence interval for X2 /Y2 .

Property 4.3. Suppose that U r21 and V r22 are independent. Then,

U/r1

Fr1 ,r2 =

V /r2

By Property 4.1(ii),

nSX2 mSY2

n1

2

and m1

2

.

X2 Y2

[ ]/[ ]

mSY2 nSX2

Fm1,n1 ,

Y2 (m 1) X2 (n 1)

( [ ]/[ ] )

mSY2 nSX2

1 = P F1 /2,d f =(m1,n1) F /2,d f =(m1,n1)

Y2 (m 1) X2 (n 1)

( )

n(m 1)SX2 X2 n(m 1)SX2

=P F1 /2,d f =(m1,n1) 2 F /2,d f =(m1,n1) ,

m(n 1)SY2 Y m(n 1)SY2

P(T F ,d f =(m,n) ) =

for a random variable T Fm,n ; see Fig. 4.5. So, a 1 confidence interval of

X2 /Y2 is

[ ]

n(m 1)SX2 n(m 1)SX2

F1 /2,d f =(m1,n1) , F /2,d f =(m1,n1) . (4.7)

m(n 1)SY2 m(n 1)SY2

4.3 Confidence intervals for variances 61

and F1 /2,d f =(m1,n1) , we can calculate the interval estimate of X2 /Y2 by

[ ]

n(m 1)s2X n(m 1)s2X

F1 /2,d f =(m1,n1) , F /2,d f =(m1,n1) .

m(n 1)sY2 m(n 1)sY2

area is 1

area is

/2

area is /2

Chapter 5

Hypothesis testing

about the validity of theories or hypotheses concerning physical phenomena. For

examples, (i) Is the new drug effective in combating a certain disease? (ii) Are fe-

males more talented in music than males? e.t.c. To answer these questions, we need

use the hypothesis testing, which is a procedure used to determine (make a decision)

whether a hypothesis should be rejected (declared false) or not.

Hypothesis a statement (or claim) about a population;

Hypothesis test a rule that leads to a decision to or not to reject a hypothesis;

Simple hypothesis a hypothesis that completely specifies the distribution of the

population;

Composite hypothesis a hypothesis that does not completely specify the distribu-

tion of the population;

Null hypothesis (H0 ) a hypothesis that is assumed to be true before it can be re-

jected;

Alternative hypothesis (H1 or Ha ) a hypothesis that will be accepted if the null

hypothesis is rejected.

Often, a hypothesis has the special form the unknown distributional parameter

belongs to a set. There may be two competing hypotheses of this form:

H0 : 0 versus H1 : 1 ,

Example 5.1. Suppose that the score of STAT2602 follows N( , 10), and we want

to know whether the theoretical mean = 80. In this case,

63

64 5 Hypothesis testing

H0 : = 80 versus H1 : = 80.

consists of exactly one real number; and H1 is a composite hypothesis, because it

can not completely specify the distribution of the score.

If both the two hypotheses are simple, the null hypothesis H0 is usually chosen

to be a kind of default hypothesis, which one tends to believe unless given strong

evidence otherwise.

Example 5.2. Suppose that the score of STAT2602 follows N( , 10), and we want

to know whether the theoretical mean = 80 or 70. In this case,

H0 : = 80 versus H1 : = 70.

given strong evidence otherwise.

we need to use the test statistic defined by

Test statistic the statistic upon which the statistical decision will be based.

Usually, the test statistic is a functional on the random sample X = {X1 , , Xn }, and

it is denoted by W (X). Some important terms about the test statistic are as follows:

Rejection region or critical region the set of values of the test statistic for which

the null hypothesis is rejected;

Acceptance region the set of values of the test statistic for which the null hypoth-

esis is not rejected (is accepted);

Type I error rejection of the null hypothesis when it is true;

Type II error acceptance of the null hypothesis when it is false.

Accept H0 Reject H0

H0 is true No error Type I error

H0 is false Type II error No error

{W (X) R}.

tively when the true value of the parameter is . That is,

( ) = P (W (X) R) for 0 ;

( ) = P (W (X) Rc ) for 1 .

5.1 Basic concepts 65

{

( ), for 0 ;

( ) := P (W (X) R) =

1 ( ), for 1 .

Definition 5.1. (Power function) The power function, ( ), of a test of H0 is the

probability of rejecting H0 when the true value of the parameter is .

Example 5.3. A manufacturer of drugs has to decide whether 90% of all patients

given a new drug will recover from a certain disease. Suppose

(a) the alternative hypothesis is that 60% of all patients given the new drug will

recover;

(b) the test statistic is W , the observed number of recoveries in 20 trials;

(c) he will accept the null hypothesis when T > 14 and reject it otherwise.

Find the power function of W .

ses are

H0 : p = 0.9 versus H1 : p = 0.6.

The test statistic W follows a binomial distribution B(n, p) with parameters n = 20

and p. The rejection region is {W 14}. Hence,

(p) = P p (W 14)

= 1 P p (W > 14)

20 ( )

20 k

= 1 p (1 p)20k

k=15 k

{

0.0113, for p = 0.9;

0.8744, for p = 0.6.

(This implies that the probability of committing a type I and type II error are 0.0113

and 0.1256, respectively.)

Example 5.4. Let X1 , , Xn be a random sample from N( , 2 ), where 2 is

0

known. Consider a test statistic T = X

/ n

for hypotheses H0 : 0 versus H1 :

> 0 . Assume that the rejection region is {T K}. Then, the power function is

( ) = P (T K)

( )

X

= P K+ 0

/ n / n

( )

0

= P Z K+ ,

/ n

66 5 Hypothesis testing

The ideal power function is 0 for 0 and 1 for 1 . However, this ideal

can not be attained in general. For a fixed sample size, it is usually impossible to

make both types of error probability arbitrarily small. In searching for a good test,

it is common to restrict consideration to test that control the type I error probability

at a specified level. Within this class of tests we then search for tests that have type

II error probability that is as small as possible.

The size defined below is used to control the type I error probability.

Definition 5.2. (Size) For [0, 1], a test with power function ( ) is a size test

if

sup ( ) = .

0

simple hypothesis = 0 , then = (0 ).

Example 5.5. Suppose that we want to test the null hypothesis that the mean of a

normal population with 2 = 1 is 0 against the alternative hypothesis that it is 1 ,

where 1 > 0 .

(a) Find the value of K such that {X K} provides a rejection region with the level

of significance = 0.05 for a random sample of size n.

(b) For the rejection region found in (a), if 0 = 10, 1 = 11 and we need the type

II probability 0.06, what should n be?

By definition,

= (0 ) = P0 (X K)

( )

X 0 K 0

= P 0

/ n / n

( )

K 0

=P Z ,

1/ n

( )

K 0

0.05 = P Z ,

1/ n

which is equivalent to

K 0 1.645

= z0.05 1.645 or K 0 + .

1/ n n

(b) Note that H0 : = 10, H1 : = 11, and the rejection region is {X K}. By

definition,

5.1 Basic concepts 67

= (1 ) = P1 (X < K)

( )

X 1 K 1

= P 1 <

/ n / n

( )

K 1

=P Z<

/ n

( )

0 + 1.645

1

n

P Z< .

/ n

( ) ( )

P Z < n(0 1 ) + 1.645 = P Z < n + 1.645 .

Hence,

0.06 n + 1.645 z0.06 1.555

n (1.645 + 1.555)2 10.24,

Remark 5.2. In the above example, the value of K in the rejection region {X K}

is determined by the significance level . For the test statistic X, the value of K

uniquely decides whether the null hypothesis is rejected or not, and it is usually

called a critical value of this test.

(1) State the null and alternative hypotheses and the level of significance .

(2) Choose a test statistic.

(3) Determine the rejection region.

(4) Calculate the value of the test statistic according to the particular sample drawn.

(5) Make a decision: reject H0 if and only if the value of the test statistic falls in the

rejection region.

The key steps are (2) and (3), and they can be accomplished by using the like-

lihood ratio or generalized likelihood ratio. The test statistic, denoted by W (X),

is chosen case by case. The rejection region of W (X) usually has the form of

{W (X) K}, {W (X) K}, {|W (X)| K}, or {W (X) K1 } {W (X) K2 },

where the values of K, K1 and K2 are determined by the significance level .

Instead of using step (5), we can also use p-value to make a decision.

Definition 5.3. (p-value) Let W (x) be the observed value of the test statistic W (X).

Case 1: The rejection region is {W (X) K}, then

0

68 5 Hypothesis testing

0

0

more extreme than (in the direction supporting the alternative hypothesis) what

was actually observed, when the null hypothesis is true.

From the above example, we know that p-value does not depend on , and it

helps us to make a decision by comparing its value with .

0 0

p-value W (x) K

the observed value of W (X) falls in the rejection region

H0 is rejected at the significance level .

12 ( )

20

= (0.9)k (0.1)20k 0.0004.

k=0 k

( )

X 10.417

= P for = 10

/ n / n

( )

10.417 10

=P Z

1/ 11

0.0833.

5.2 Most powerful tests 69

rejected.

Definition 5.4. (Most powerful tests) A test concerning a simple null hypothesis

= 0 against a simple alternative hypothesis = 1 is said to be most powerful if

the power of the test at = 1 is a maximum.

of a random sample X = {X1 , X2 , , Xn } defined by

L( ) = f (X1 , X2 , . . . , Xn ; ),

L(0 )

from a population with a parameter . Consider the likelihood ratio . Intu-

L(1 )

itively speaking, the null hypothesis should be rejected when the likelihood ratio is

small.

dom sample of size n from a population with exactly one unknown parameter .

Suppose that there is a positive constant k and a region C such that

P {(X1 , X2 , . . . , Xn ) C} = for = 0 ,

and

f (x1 , x2 , . . . , xn ; 0 )

k when (x1 , x2 , . . . , xn ) C,

f (x1 , x2 , . . . , xn ; 1 )

f (x1 , x2 , . . . , xn ; 0 )

k when (x1 , x2 , . . . , xn ) C,

f (x1 , x2 , . . . , xn ; 1 )

Construct a test, called the likelihood ratio test, which rejects H0 : = 0 and

accepts H1 : = 1 if and only if (X1 , X2 , . . . , Xn ) C. Then any other test which

has significance level has power not more than that of this likelihood ratio

test. In other words, the likelihood ratio test is most powerful among all tests having

significance level .

Proof. Suppose D is the rejection region of any other test which has significance

level . We consider first the continuous case. Note that

= P {(X1 , X2 , . . . , Xn ) C} for = 0

= f (x1 , x2 , . . . , xn ; 0 )dx1 dx2 dxn ;

C

70 5 Hypothesis testing

= P {(X1 , X2 , . . . , Xn ) D} for = 0

= f (x1 , x2 , . . . , xn ; 0 )dx1 dx2 dxn .

D

f (x1 , x2 , . . . , xn ; 0 )dx1 dx2 dxn

C

f (x1 , x2 , . . . , xn ; 0 )dx1 dx2 dxn .

D

Subtracting

f (x1 , x2 , . . . , xn ; 0 )dx1 dx2 dxn ,

CD

we get

f (x1 , x2 , . . . , xn ; 0 )dx1 dx2 dxn

CD

f (x1 , x2 , . . . , xn ; 0 )dx1 dx2 dxn ,

C D

f (x1 , x2 , . . . , xn ; 1 )dx1 dx2 dxn

CD

f (x1 , x2 , . . . , xn ; 0 )

dx1 dx2 dxn

k

CD

f (x1 , x2 , . . . , xn ; 0 )

dx1 dx2 dxn

k

C D

f (x1 , x2 , . . . , xn ; 1 )dx1 dx2 dxn .

C D

Adding f (x1 , x2 , . . . , xn ; 1 )dx1 dx2 dxn , we finally obtain

CD

f (x1 , x2 , . . . , xn ; 1 )dx1 dx2 dxn

C

f (x1 , x2 , . . . , xn ; 1 )dx1 dx2 dxn ,

D

5.2 Most powerful tests 71

or

P {(X1 , X2 , . . . , Xn ) C} P {(X1 , X2 , . . . , Xn ) D}

for = 1 . The last inequality states that the power of the likelihood ratio test at

= 1 is at least as much as that corresponding to the rejection region D. The proof

for discrete case is similar, with sums taking places of integrals.

rejection region for the likelihood ratio test is

L(0 )

k (X1 , X2 , , Xn ) C W (X) R,

L(1 )

where the interval R is chosen so that the test has the significance level . Generally

speaking, the likelihood ratio helps us to determine the test statistic and the form of

its rejection region.

where 2 = 02 is known, is to be used to test the null hypothesis = 0 against the

alternative hypothesis = 1 , where 1 > 0 . Use the Neyman-Pearson Lemma to

construct the most powerful test.

( )n [ ]

1 1 n

L( ) = exp 2 (Xi ) .

2

0 2 20 i=1

L(0 )

The likelihood ratio test rejects the null hypothesis = 0 if and only if k,

L(1 )

that is,

{ }

1 n [ ]

exp (Xi 1 ) (Xi 0 ) k

202 i=1

2 2

n

(21 Xi + 12 + 20 Xi 02 ) 202 ln k

i=1

n

n(12 02 ) + 2(0 1 ) Xi 202 ln k

i=1

20 ln k n(1 0 )

2 2 2

X (since 1 > 0 ).

2n(0 1 )

K such that P (X K) = for = 0 , that is,

( )

X 0 K 0 K 0 0 z

P = = z K = 0 + .

0 / n 0 / n 0 / n n

72 5 Hypothesis testing

Therefore, the most powerful test having significance level is the one which

has the rejection region

{ } { }

0 z X 0

X 0 + or z .

n 0 / n

(Note that the rejection region found does not depend on the value of 1 ).

population given by a density

If 0 Xi 1 for i = 1, 2, . . . , n, find the form of the most powerful test with signifi-

cance level for testing

H0 : = 2 versus H1 : = 1.

( ) 1 ( ) 1

n n n

L( ) = n

Xi I(0 Xi 1) = Xi

n

.

i=1 i=1 i=1

n

L(2)

= 2n Xi .

L(1) i=1

n

The likelihood ratio test rejects H0 if and only if 2n Xi K where K is a positive

i=1

n

constant (or, equivalently, Xi k where k is a positive constant).

i=1

jection regions for testing a simple null hypothesis against a simple alternative hy-

pothesis, but it does not always apply to composite hypotheses. We shall present a

general method for constructing rejection regions for tests of composite hypotheses

which in most cases have very satisfactory properties, although they are not neces-

sarily uniformly most powerful.

Suppose that , where is the parametric space. Consider the following

hypotheses:

5.3 Generalized likelihood ratio tests: One-sample case 73

H0 : 0 versus H1 : 1 ,

where 1 is the complement of 0 with respect to (i.e., 1 = /0 ). Let

0

The generalized likelihood ratio is defined as

L(0 )

= .

L( )

we would expect to be small. A generalized likelihood ratio test states, therefore,

that the null hypothesis H0 is rejected if and only if falls in a rejection region of

the form k, where 0 k 1.

Example 5.8. Find the generalized likelihood ratio test for testing

H0 : = 0 versus H1 : = 0

hand, since 0 contains only 0 , it follows that

( )n [ ]

1 1 n

L(0 ) = exp 2 (Xi 0 ) .

2

0 2 20 i=1

that ( )n [ ]

1 1 n

L( ) = exp 2 (Xi X)2 .

0 2 20 i=1

Hence,

L(0 )

=

L( )

{ [ ]}

n n

1

= exp 2 (Xi 0 ) (Xi X)

2 2

20 i=1 i=1

[ ]

n(X 0 ) 2

= exp .

202

74 5 Hypothesis testing

{ }

Therefore, the rejection region is X 0 K . In order that the level of signifi-

cance is , that is, ( )

P X 0 K = for = 0 ,

0

we should let K = z /2 n

, so that

( )

( ) 0

P X 0 K = P X 0 z /2

n

( ) ( )

X 0 X 0

= P z /2 + P z /2

0 / n 0 / n

( ) ( )

= P Z z /2 + P Z z /2

= + =

2 2

for = 0 . So, the generalized likelihood ratio test has the rejection region

{ }

X 0

z /2

0 / n

From aforemention example and the similar technique, we can have the following

table:

Test H0 H1

Rejection region p-value

)

{ } (

X 0 x 0

Two-tailed = 0

= 0 z /2 P |Z|

{ / n } ( / n)

X 0 x 0

Left-tailed = 0 or 0 < 0 z P Z

{ / n } ( / n)

X 0 x 0

Right-tailed = 0 or 0 > 0 z P Z

/ n / n

Example 5.9. The standard deviation of the annual incomes of government employ-

ees is $1400. The mean is claimed to be $35,000. Now a sample of 49 employees

has been drawn and their average income is $35,600. At the 5% significance level,

can you conclude that the mean annual income of all government employees is not

$35,000?

Solution 1.

Step 1: The mean ... is not 35,000 can be written as = 35000, while the

mean ... is 35,000 can be written as = 35000. Since the null hypothesis should

include an equality, we consider hypothesis:

5.3 Generalized likelihood ratio tests: One-sample case 75

X 0 X 35000 X 35000

Z= = = ,

/ n 1400/ 49 200

Step 3: At the significance level = 5%, the rejection region is

{ }

|Z| z /2 {|Z| 1.960} .

35600 35000

= 3.

200

Step 5: Since |3| 1.960, we reject H0 and accept H1 . Therefore, we conclude

that the mean annual income of all government employees is not $35,000 at the 5%

level of significance.

Solution 2.

Step 1:

H0 : = 35000 versus H1 : = 35000.

Step 2: The test statistic is

X 0 X 35000 X 35000

Z= = = ,

/ n 1400/ 49 200

Step 3: Since x = 35600, the value of the test statistic is

35600 35000

= 3.

200

Step 4:

Step 5: Since 0.0026 0.05 = , we reject H0 and accept H1 . Therefore we

conclude that the mean annual income of all government employees is not $35,000.

Example 5.10. The chief financial officer in FedEx believes that including a stamped

self-addressed envelope in the monthly invoice sent to customers will reduce the

amount of time it takes for customers to pay their monthly bills. Currently, cus-

tomers return their payments in 24 days on average, with a standard deviation of 6

days. It was calculated that an improvement of two days on average would cover the

costs of the envelopes (because cheques can be deposited earlier). A random sample

76 5 Hypothesis testing

of 220 customers was selected and stamped self-addressed envelopes were included

in their invoice packs. The amounts of time taken for these customers to pay their

bills were recorded and their mean is 21.63 days. Assume that the corresponding

population standard deviation is still 6 days. Can the chief financial officer conclude

that the plan will be profitable at the 10% significance level?

Solution 1. The plan will be profitable when < 22, and not profitable when

22. Since the null hypothesis should include an equality, we have

x 0 21.63 22

= 0.9147.

/ n 6/ 220

Since 0.9147 > 1.282 z0.1 , H0 should not be rejected. The chief financial

officer cannot conclude that the plan is profitable at the 10% significance level.

Solution 2: Consider

where Z follows N(0, 1). Therefore, H0 should not be rejected. The chief financial

officer cannot conclude that the plan is profitable at the 10% significance level.

Example 5.11. Find the generalized likelihood ratio test for testing

H0 : = 0 versus H1 : > 0

Solution. Now

= {( , ) : 0 , > 0} ,

0 = {( , ) : = 0 , > 0} ,

1 = {( , ) : > 0 , > 0} .

( )n [ ]

1 1 n

L( , ) = = exp 2 (Xi ) ,

2

2 2 i=1

5.3 Generalized likelihood ratio tests: One-sample case 77

and hence,

ln L( , ) n

= 2 (X ),

ln L( , ) n 1 n

= + 3 (Xi )2 .

i=1

1 n

2 = (Xi 0 )2 .

n i=1

This is because is the maximum value of ln L(0 , ), by noting that for all > 0,

1 n ln L( , )

< (Xi )2 > 0,

n i=1

1 n ln L( , )

>

n i=1

(Xi )2

< 0.

Therefore, ( )n

1 ( n)

L(0 ) = L(0 , ) = exp .

2 2

On , the maximum value of L( , ) is L( , ), where (noting that L( , )

decreases with respect to when > X and increases with respect to when

< X) {

0 , if X 0 ;

=

X, if X > 0 ,

and

1 n

2 = (Xi )2 .

n i=1

Therefore, ( )n

1 ( n)

L( ) = L( , ) = exp .

2 2

Thus, we have

1, if X 0 ;

n n/2

( )n ( 2 )n/2

L(0 ) (Xi X)

2

= = = =

L( ) 2

i=1 , if X > 0 .

n

(Xi 0 )2

i=1

78 5 Hypothesis testing

{ nonnegative

} constant k < 1 (since we do

not want to be 1). Then { k} X > 0 and k is equivalent to

n n

(Xi X)2 (Xi X)2 1

i=1 i=1

k2/n n = n = ,

n(X0 )2

(Xi 0 )2 (Xi X)2 + n(X 0 )2 1+ n

i=1 i=1 (Xi X)2

i=1

that is,

(X 0 )2 n(X 0 )2

= n k2/n 1,

S2

(Xi X) 2

i=1

or (since X > 0 )

X 0

c,

S/ n 1

where c is the constant (n 1)(k2/n 1). In order that the level of significance

is , that is, ( )

X 0

P( , ) c = for = 0 ,

S/ n 1

we should let c = t ,n1 , since

( )

X 0

P( , ) t ,n1 = P (tn1 t ,n1 ) for = 0

S/ n 1

by Property 4.1(iii). So, the generalized likelihood ratio test has the rejection region

{ }

X 0

t ,n1

S/ n 1

0 ?

0 = {( , ) : 0 , > 0} ,

1 = {( , ) : > 0 , > 0} .

5.3 Generalized likelihood ratio tests: One-sample case 79

{

X, if X < 0 ;

=

0 , if X 0 ,

and

1 n

2 = (Xi )2 .

n i=1

Therefore, ( )n

1 ( n)

L(0 ) = L( , ) = exp .

2 2

On , the maximum value of L( , ) is L( , ), where

1 n

= X and 2 = (Xi X)2

n i=1

Therefore, ( )n

1 ( n)

L( ) = L( , ) = exp .

2

2

Thus, we have

1, if X 0 ;

n n/2

( )n ( 2 )n/2

=

L(0 )

=

=

=

(Xi X)2

L( ) 2

ni=1

, if X > 0 .

(Xi 0 )2

i=1

Hence the generalized likelihood ratio remains the same as that in the previous ex-

ample, and so does the rejection region.

From aforemention two examples and the similar technique, we can have the

following table:

Test H0 H1 Rejection

region } p-value

)

{ (

X 0 x 0

Two-tailed = 0 = 0 t /2,n1

P |tn1 |

{ S/ n 1 } ( s/ n 1)

X 0 x 0

Left-tailed = 0 or 0 < 0 t ,n1 P tn1

{S/ n 1 } ( s/ n 1 )

X 0 x 0

Right-tailed = 0 or 0 > 0 t ,n1 P tn1

S/ n 1 s/ n 1

Example 5.13. According to the last census in a city, the mean family annual in-

come was 316 thousand dollars. A random sample of 900 families taken this year

80 5 Hypothesis testing

produced a mean family annual income of 313 thousand dollars and a standard de-

viation of 70 thousand dollars. At the 2.5% significance level, can we conclude that

the mean family annual income has declined since the last census?

x 0 313 316

= 1.286.

s/ n 1 70/ 900 1

Since 1.286 > 1.963 = t0.025,899 , we do not reject H0 . Thus we cannot conclude

that the mean family annual income has declined since the last census at the 2.5%

level of significance.

Example 5.14. Given a random sample of size n from a normal population with

unknown mean and variance, find the generalized likelihood ratio test for testing the

null hypothesis = 0 (0 > 0) against the alternative hypothesis = 0 .

0 = {( , ) : < < , = 0 } ,

1 = {( , ) : < < , > 0, = 0 } ,

( )n [ ]

1 1 n

L( , ) = exp 2 (Xi ) . 2

2 2 i=1

L(0 ) = L( , 0 )

( )n [ ]

1 1 n

= exp 2 (Xi X)2 .

0 2 20 i=1

1 n

= X and 2 = (Xi X)2

n i=1

5.3 Generalized likelihood ratio tests: One-sample case 81

Therefore, ( )n

1 ( n)

L( ) = L( , ) = exp .

2 2

Thus, we have

n

L(0 )

(

2

)n/2 (Xi X)2

n

= = exp i=1 +

L( ) 02 202 2

n n/2 n

i i

(X X)2 (X X)2

n

i=1

= i=1 exp + .

n02 202 2

1 n

not want to be 1). Letting Y = (Xi X)2 ,

n02 i=1

( )

nY n

k Y n/2 exp + k,

2 2

Y exp(Y + 1) k2/n ,

k2/n

Y exp(Y ) .

e

For y > 0 define a function g(y) = yey . Then,

dg(y)

= ey yey = (1 y)ey .

dy

Since

dg(y)

y < 1 >0

dy

and

dg(y)

y > 1 < 0,

dy

g(y) will be small when y is close to zero or very large. Thus we reject the null

hypothesis = 0 when the value of Y (or nY ) is large or small, that is, the rejection

region of our generalized likelihood ratio test has the rejection region:

{nY K1 } {nY K2 }.

82 5 Hypothesis testing

nS2

Note that nY = . In order that the level of significance is , that is,

02

( ) ( 2 )

nS2 nS

P( , ) K1 + P K2 = for = 0 ,

02 02

we should let K1 = 1

2

/2,n1 and K2 = /2,n1 , since

2

( ) ( )

nS2

P( , ) K1 = P n1

2

1

2

/2,n1 =

02 2

and ( ) ( )

nS2

P( , ) K2 = P n1

2

2 /2,n1 =

02 2

for = 0 by using the fact that nY n1

2 from Property 4.1(ii).

From the aforemention example and the similar technique, we can have the fol-

lowing table:

Test H0 H1

Rejection region p-value

{ } ( ( )

nS2 ns2

Two-tailed = 0 = 0 1 /2,n1 2 min P n1 2 ,

2 2

02

{ } ( )0

nS2 ns2 )

/2,n1

2

P n1 2

2

{ 20 0 )

2

} (

nS ns2

Left-tailed = 0 or 0 < 0 2

1 ,n1 P 2

n1

{0 2 02 )

2

} (

nS ns2

Right-tailed = 0 or 0 > 0 2 ,n1 P n12

2

0 2 0

Example 5.15. One important factor in inventory control is the variance of the daily

demand for the product. A manager has developed the optimal order quantity and

reorder point, assuming that the variance is equal to 250. Recently, the company

has experienced some inventory problems, which induced the operations manager

to doubt the assumption. To examine the problem, the manager took a sample of 25

daily demands and found that s2 = 270.58. Do these data provide sufficient evidence

at the 5% significance level to infer that the management scientists assumption

about the variance is wrong?

5.4 Generalized likelihood ratio tests: Two-sample case 83

ns2 25 270.58

= 25.976.

02 250

Since 10.05/2,251

2 12.401 25.976 39.364 0.05/2,251

2 , we do not reject

H0 . There is not sufficient evidence at the 5% significance level to infer that the

management scientists assumption about the variance is wrong.

We can obtain the interval estimation by using the two-tailed hypothesis testing. For

example, consider hypotheses

H0 : = 0 versus = 0 .

{ }

X 0

< z /2 X z /2 < 0 < X + z /2 .

/ n n n

( )

P X z /2 < < X + z /2 = 1 .

n n

[ ]

That is, X z /2 n , X + z /2 n is the 1 confidence interval of .

Similarly, we can find the confidence interval of when the variance is unknown,

and 2 by using the two-tailed hypothesis testing.

In this section, we assume that there are two populations following N(1 , 12 ) and

N(2 , 22 ) respectively. A sample {Xi , i = 1, 2, . . . , n1 } is taken from the popula-

tion N(1 , 12 ) and a sample {Y j , j = 1, 2, . . . , n2 } is taken from the population

N(2 , 22 ). Assume that these two samples are independent (that is, X1 , X2 , . . . , Xn1 ,

Y1 ,Y2 , . . . ,Yn2 are independent).

We first consider the hypothesis testing for 1 2 when 1 and 2 are known.

84 5 Hypothesis testing

Example 5.16. Assume that 1 and 2 are known. Find the generalized likelihood

ratio for testing

H0 : 1 2 = versus H1 : 1 2 = .

0 = {(1 , 2 ) : 1 2 = } ,

1 = {(1 , 2 ) : 1 2 = } ,

= 0 1 = {(1 , 2 ) : < 1 < , < 2 < } .

( )n1 [ ]

1 1 n1

L(1 , 2 ) = exp 2 (Xi 1 )2

1 2 21 i=1

( )n2 [ ]

1 1 n2

exp 2 (Y j 2 )2 .

2 2 22 j=1

On 0 , we have

ln L(1 , 2 ) = ln L(1 , 1 )

1 n1 1 n2

= C

21 i=1

2

(Xi 1 )2 2

22

(Y j 1 + )2 ,

j=1

1 n1 1 n2

ln L(1 , 1 ) = 2 (Xi 1 ) + 2 (Y j 1 + )

1 1 i=1 2 j=1

n1 (X 1 ) n2 (Y 1 + )

= +

12 22

( )

n1 X n2 (Y + ) n1 n2

= 2 + + 1 .

1 22 12 22

n1 X n2 (Y + )

+

12 22

1 = n1 n2 ,

+

12 22

since

5.4 Generalized likelihood ratio tests: Two-sample case 85

1 < 1 ln L(1 , 1 ) > 0,

1

1 > 1 ln L(1 , 1 ) < 0.

1

of 2 is Y , since 1 = X maximizes

( )n1 [ ]

1 1 n1

exp 2 (Xi 1 )2 ,

1 2 21 i=1

and 2 = Y maximizes

( )n2 [ ]

n2

1 1

2 2

exp 2

22

(Y j 2 )

2

.

j=1

L(0 )

=

L( )

[ ]

1 n1 [ ] 1 [

n2 ]

= exp 2 (Xi 1 )2 (Xi X)2 2 (Y j 1 + ) (Y j Y )

2 2

21 i=1 22 j=1

[ ]

n1 (X 1 )2 n2 (Y 1 + )2

= exp

212 222

[ ]

= exp C (X Y )2 ,

n2

(X Y )

2

X 1 = 2 n1 n2 ,

+

12 22

n1

(Y + X)

12

Y 1 + = n1 n2 .

+ 2

1 2

2

{ }

Therefore the rejection region should be |X Y | K .

Under H0 , we have

( 2)

X follows N 1 , 1 ,

( n1

)

Y follows N 1 , 22 ,

n2

86 5 Hypothesis testing

( )

2 2

X Y follows N , 1 + 2 .

n1 n2

|X Y |

z /2 ,

12 22

+

n1 n2

X Y

where the test statistic is .

12 22

+

n1 n2

From the aforemention example and the similar technique, we can have the fol-

lowing table:

p-value

X Y

x y

1 2 = 1 2 =

Two-tailed

z /2 P |Z| 2

2

2

2

n1 + n2 n11 + n22

1 2

X Y xy

Left-tailed 1 2 = 1 2 < z P Z

1 + 2 12 22

2 2

n1 n2 n1+ n2

or 1 2

X Y xy

Right-tailed 1 2 = 1 2 > z P

Z 2

1 + 2 1 22

2 2

n1 n2 n1 + n2

or 1 2

We second consider the hypothesis testing for 1 2 when 1 and 2 are unknown

but equal.

5.4 Generalized likelihood ratio tests: Two-sample case 87

Example 5.17. Assume that 1 and 2 are unknown but equal to . Find the gener-

alized likelihood ratio for testing

H0 : 1 2 = versus H1 : 1 2 = .

0 = {(1 , 2 , ) : 1 2 = , > 0} ,

1 = {(1 , 2 , ) : 1 2 = , > 0} ,

= 0 1 = {(1 , 2 , ) : > 0} .

( )n1 +n2 { [ ]}

n1 n2

1 1

L(1 , 2 , ) = exp 2 (Xi 1 ) + (Y j 2 )

2 2

.

2 2 i=1 j=1

On 0 , we have

ln L(1 , 2 , ) = ln L(1 , 1 , )

[ ]

n1 n2

1

= C (n1 + n2 ) ln 2

2 (Xi 1 ) 2

+ (Y j 1 + )

2

,

i=1 j=1

n1 (X 1 ) + n2 (Y 1 + )

ln L(1 , 1 , ) =

1 2

n1 X + n2 (Y + ) n1 + n2

= 1 .

2 2

This implies that the maximum likelihood estimator of 1 is

n1 X + n2 (Y + )

1 = ,

n1 + n2

which does not depend on , since

1 < 1 ln L(1 , 1 , ) > 0,

1

1 > 1 ln L(1 , 1 , ) < 0.

1

likelihood estimator of . By direct calculation,

88 5 Hypothesis testing

[ ]

n1 + n2 1 n1 n2

ln L( 1 , 1 , ) = + 3 (Xi 1 ) + (Y j 1 + )

2 2

i=1 j=1

[ ]

n1 n2

1

= 3 (n1 + n2 ) + (Xi 1 ) + (Y j 1 + )

2 2 2

i=1 j=1

v [ ]

u

u 1 n1 n2

t

= (Xi 1 ) + (Y j 1 + ) ,

n1 + n2 i=1

2 2

j=1

since

> ln L( 1 , 1 , ) < 0.

Therefore,

n1 + n2

ln L(0 ) = C (n1 + n2 ) ln .

2

On , we have

[ ]

n1 n2

1

ln L(1 , 2 , ) = C (n1 + n2 ) ln 2

2 (Xi 1 ) 2

+ (Y j 2 ) 2

,

i=1 j=1

n1 (X 1 )

ln L(1 , 2 , ) = ,

1 2

n2 (Y 2 )

ln L(1 , 2 , ) = ,

2 2

[ ]

n1 + n2 1 n1 n2

ln L(1 , 2 , ) =

+ 3

(Xi 1 )2 + (Y j 2 )2 .

i=1 j=1

Hence, by following the same routine as before, we can show that the maximum

likelihood estimators are

1 = X,

2 = Y ,

[ ]

n1 n2

1

=

2

n1 + n2 (Xi X) 2

+ (Y j Y ) 2

.

i=1 j=1

Therefore,

n1 + n2

ln L( ) = C (n1 + n2 ) ln .

2

5.4 Generalized likelihood ratio tests: Two-sample case 89

( )(n1 +n2 )/2

L(0 ) (n1 +n2 ) 2

= = (n +n ) = .

L( ) 1 2 2

Note that

n1 n2

(Xi 1 )2 + (Y j 1 + )2

2 i=1 j=1

=

2 n1 n2

(Xi X)2 + (Y j Y )2

i=1 j=1

n1 n2

(Xi X)2 + n1 (X 1 )2 + (Y j Y )2 + n2 (Y 1 + )2

i=1 j=1

= n1 n2

(Xi X)2 + (Y j Y )2

i=1 j=1

n1 (X 1 )2 + n2 (Y 1 + )2

= 1+ n1 n2

(Xi X)2 + (Y j Y )2

i=1 j=1

[ ]2 [ ]2

n2 (X Y ) n1 (Y + X)

n1 + n2

n1 + n2 n1 + n2

= 1+ n1 n2

(Xi X)2 + (Y j Y )2

i=1 j=1

n1 n2

(X Y )2

n1 + n2

= 1+

n1 S12 + n2 S22

(X Y )2

= 1+ ( ) ,

1 1 [ 2 2

]

+ n1 S1 + n2 S2

n1 n2

{ }

where S12 and S22 are the sample variances of {Xi , i = 1, 2, . . . , n1 } and Y j , j = 1, 2, . . . , n2

respectively, Therefore H0 should be rejected when

|X Y |

1 2

is large.

1 2

+ n1 S1 + n2 S2

n1 n2

( ) ( )

2 2

Under H0 , X follows N 1 , and Y follows N 1 , , and thus

( ) n1 n2

2 2

X Y follows N , + , which implies that

n1 n2

90 5 Hypothesis testing

X Y

follows N(0, 1).

1 1

+

n1 n2

n1 S12 n2 S22

Besides, the fact that the two independent random variables and follows

2 2

n1 S12 + n2 S22

n21 1 and n22 1 respectively implies that follows n21 +n2 2 . Therefore,

2

X Y

n11 + n12 X Y

W= / =

n1 S12 + n2 S22 1 1 n1 S12 + n2 S22

(n1 + n2 2) +

2 n1 n2 n1 + n2 2

n1 n2

(Xi X)2 + (Y j Y )2

n1 S12 + n2 S22 i=1 j=1

S2p = = ,

n1 + n2 2 n1 + n2 2

{ } X Y

then the rejection region is |W | t /2,n1 +n2 2 , where the test statistic is W = .

1 1

Sp +

n1 n2

From the aforemention example and the similar technique, we can have the fol-

lowing table:

Test H0 H1 Rejection

region p-value

X Y

xy

Two-tailed 1 2 = 1 2 = t /2,n +n 2 P |tn +n 2 |

S 1

1 2 1 2

p

1

n1 + n2

s p n1 + n1

1 2

X Y xy

Left-tailed 1 2 = 1 2 < t ,n1 +n2 2 P tn1 +n2 2

S 1

+ 1 s 1

+ 1

p n1 n2 p n1 n2

or 1 2

X Y x y

Right-tailed 1 2 = 1 2 > t ,n1 +n2 2 P tn1 +n2 2

S 1

+ 1 s p n11 + n12

p n1 n2

or 1 2

Remark 5.3. S p is called the pooled sample variance, which is an unbiased estimator

of 2 .

Example 5.18. A consumer agency wanted to estimate the difference in the mean

amounts of caffeine in two brands of coffee. The agency took a sample of 15 500-

5.4 Generalized likelihood ratio tests: Two-sample case 91

gramme jars of Brand I coffee that showed the mean amount of caffeine in these

jars to be 80 mg per jar and the standard deviation to be 5 mg. Another sample of

12 500-gramme jars of Brand II coffee gave a mean amount of caffeine equal to

77 mg per jar and a standard deviation of 6 mg. Assuming that the two populations

are normally distributed with equal variances, check at the 5% significance level

whether the mean amount of caffeine in 500-gramme jars is greater for Brand 1

than for Brand 2.

1 and those of Brand II be referred to as population 2.

We consider the hypotheses:

H0 : 1 2 versus H1 : 1 > 2 ,

Note that

n1 = 15, x1 = 80, s1 = 5,

and

n2 = 12, x2 = 77, s2 = 6, = 0.05.

1 1

Hence, x1 x2 = 80 77 = 3, t /2,n1 +n2 2 = t0.025,25 2.060, + 0.3873,

n1 n2

and

n1 s21 + n2 s22 15 52 + 12 62

sp = = 5.4626.

n1 + n2 2 15 + 12 2

x x2 3

w= 1 = 1.42.

s p n11 + n12 5.4626 0.3873

As 1.42 < 2.06, we can not reject H0 . Thus, we conclude that the mean amount of

caffeine in 500-gramme jars is not greater for Brand 1 than for Brand 2 at the 5%

significance level.

to perform tests comparing 1 and 2 .

Example 5.19. Find the generalized likelihood ratio test for hypotheses

H0 : 1 = 2 versus H1 : 1 = 2 .

92 5 Hypothesis testing

Solution. It can be proved (details omitted) that the generalized likelihood ratio is

( )n1 /2

S2

C 12

S2

[ ](n1 +n2 )/2 ,

S12

n1 2 + n2

S2

where C is a constant.

For w > 0 define the function

wn1 /2

G(w) = .

[n1 w + n2 ](n1 +n2 )/2

Then,

n1 n1 + n2

ln G(w) = ln w ln [n1 w + n2 ] ,

2 2

d n1 n1 + n2 n1

ln G(w) =

dw 2w 2 n1 w + n2

n1 n2 (1 w)

= ,

2w [n1 w + n2 ]

which is negative when w > 1 and is positive when w < 1. Therefore, the value

of G(w) will be small when w is very large or very small. Therefore H0 should be

S2

rejected when 12 is large or small.

S2

n1 S12

n1 (n2 1)S12 (n1 1)12

When H0 is true, = follows Fn1 1,n2 1 by Property 4.3.

n2 (n1 1)S22 n2 S22

(n2 1)22

n1 (n2 1)S12

Thus, we let the test statistic be W = , and the rejection region is

n2 (n1 1)S22

Remark 5.4. Recall F ,m,n as the positive real number such that P(X F ,m,n ) =

where X follows Fm,n . Suppose X follows Fm,n . Then 1/X follows Fn,m and

1

F1 ,m,n = ,

F ,n,m

because

5.4 Generalized likelihood ratio tests: Two-sample case 93

( )

1 1

1 = P(F1 ,m,n < X) = P >

F1 ,m,n X

( )

1 1

= = P

F1 ,m,n X

1

= F1 ,m,n = .

F ,n,m

From the aforemention example and the similar technique, we can have the fol-

lowing table:

{ } ( ( )

n1 (n2 1)S12 n1 (n2 1)s21

Two-tailed 1 = 2 1 = 2 F /2,n 1,n 1 2 min P Fn 1,n 1 ,

{ n2 (n1 1)S22 n2 (n1 1)s

2 1 2 1 2 2

} ( ) )2

n1 (n2 1)S1 n1 (n2 1)s12

F1 /2,n1 1,n2 1 P Fn1 1,n2 1

{ n 2 (n1 1)S 2

2 } ( n2 (n1 1)s2 )

2

Left-tailed 1 = 2 1 < 2 F1 ,n1 1,n2 1 P Fn1 1,n2 1

n2 (n1 1)S22 n2 (n1 1)s22

or 1 2

{ } ( )

n1 (n2 1)S12 n1 (n2 1)s21

Right-tailed 1 = 2 1 > 2 F ,n1 1,n2 1 P Fn1 1,n2 1

n2 (n1 1)S22 n2 (n1 1)s22

or 1 2

Example 5.20. A study involves the number of absences per year among union and

non-union workers. A sample of 16 union workers has a sample standard deviation

of 3.0 days. A sample of 10 non-union workers has a sample standard deviation

of 2.5 days. At the 10% significance level, can we conclude that the variance of

the number of days absent for union workers is different from that for nonunion

workers?

Solution. Let all union workers be referred to as population 1 and all non-union

workers be referred to as population 2.

We consider the hypotheses:

H0 : 1 = 2 versus H1 : 1 = 2 ,

Note that n1 = 16, s1 = 3, n2 = 10, and s2 = 2.5. Hence, the value of the test

statistic is

n1 (n2 1) s21 3.02

= 0.96 = 1.3824.

n2 (n1 1) s22 2.52

Since

1

< 1 < 1.3824 < 3.006 f0.05,15,9 ,

f0.05,9,15

we cannot reject H0 . Thus we conclude that the data do not indicate that the variance

of the number of days absent for union workers is different from that for non-union

workers at the 10% significance level.

94 5 Hypothesis testing

H0 : 1 = 2 versus H1 : 1 = 2

Unfortunately, the likelihood ratio method does not always produce a test statistic

with a known probability distribution. Nevertheless, if the sample size is large, we

can obtain an approximation to the distribution of a generalized likelihood ratio.

versus

H1 : i = i,0 for at least one i = 1, 2, . . . , d

and that is the generalized likelihood ratio. Then, under very general conditions,

when H0 is true,

2 ln d d2 as n .

[number of parameters to estimate when determining L(0 )] .

against

H1 : i = i,0 for at least one i = 1, 2, . . . , m,

then d = m.

P(X = ai ) = pi , i = 1, 2, . . . , m,

5.5 Generalized likelihood ratio tests: Large samples 95

p1 + p2 + + pm = 1.

lation. For i = 1, 2, . . . , m, let Yi be the number of k such that Xk = ai . Then

n!

P(Yi = yi , i = 1, 2, . . . , m) = py1 py2 pymm

y1 !y2 ! ym ! 1 2

for non-negative integers yi , i = 1, 2, . . . , m, such that y1 + y2 + + ym = n.

We want to test

versus

H1 : pi = pi,0 for at least one i = 1, 2, . . . , m,

where pi,0 > 0 for i = 1, 2, . . . , m and

versus

H1 : pi = pi,0 for at least one i = 1, 2, . . . , m 1,

where pi,0 > 0 for i = 1, 2, . . . , m 1 and

( )( )nm1 Yi

m1 m1 i=1

n! pYi i 1 pi

i=1 i=1

L(p1 , p2 , . . . , pm1 ) = .

Y1 !Y2 ! Ym1 !(n Y1 Y2 Ym1 )!

Then,

( ) ( )

m1 m1 m1

ln L(p1 , p2 , . . . , pm1 ) = lnC + Yi ln pi + n Yi ln 1 pi ,

i=1 i=1 i=1

m1

ln L(p1 , p2 , . . . , pm1 ) Yi

n Yi

i=1

= , i = 1, 2, . . . , m 1,

pi pi m1

1 pi

i=1

96 5 Hypothesis testing

Denote the value of pi such that

ln L(p1 , p2 , . . . , pm1 )

=0

pi

as pi , i = 1, 2, . . . , m 1. Then,

( )

m1 m1

n Yi Y1 +Y2 + +Ym1 + n Yi

Y1 Y2 Ym1 Yi i=1

= = = = i=1

= ( ) =n

p1 p2 pm1 m1 pi m1

1 pi p1 + p2 + + pm1 + 1 pi

i=1 i=1

Yi

pi = for i = 1, 2, . . . , m 1.

n

Since

ln L(p1 , p2 , . . . , pm1 )

pi < pi > 0,

pi

ln L(p1 , p2 , . . . , pm1 )

pi > pi < 0,

pi

and each of these pi s does not depend on the others, the maximum of L(p1 , p2 , . . . , pm1 )

is thus L( p1 , p2 , . . . , pm1 ).

Therefore the generalized likelihood ratio is

=

L( p1 , p2 , . . . , pm1 )

( )nm1 Yi n(1m1

m1 i=1 m1 i=1 pi )

pi,0

Ym1

pY1,0 )n pi 1 pi,0

1 Y2

p2,0 pm1,0 1

m1 (

pi,0

i=1 i=1

= ( )nm1 Yi = .

pi m1

1 pi

m1 i=1 i=1

pi

Y

pY11 pY22 pm1

m1

1

i=1

i=1

From ( )

d x x 1 x

x ln = ln + x = ln + 1

dx x0 x0 x x0

and ( ) ( )

d2 x d x 1

x ln = ln = ,

dx2 x0 dx x0 x

we obtain

5.5 Generalized likelihood ratio tests: Large samples 97

x (x x0 )2

x ln = (x x0 ) + + ,

x0 2x0

and therefore,

x (x x0 )2

x ln (x x0 ) + when x x0 .

x0 2x0

Hence, when n ,

m1

1 pi

( ) ( )

m1 m1

pi

2 ln = 2n pi ln

pi,0

+ 2n 1 pi ln

i=1

m1

1 pi,0

i=1 i=1

i=1

[ ] [( ) ( )]

2

m1

( pi pi,0 ) m1 m1

2n ( pi pi,0 ) +

2pi,0

+ 2n 1 pi 1 pi,0

i=1 i=1 i=1

[( ) ( )]2

m1 m1

1 pi 1 pi,0

i=1 i=1

+2n ( )

m1

2 1 pi,0

i=1

m1 m1

(n pi npi,0 )2 m1

= 2n ( pi pi,0 ) + npi,0

+ 2n (pi,0 pi )

i=1 i=1 i=1

[ ( ) ( )]2

m1 m1

n 1 pi n 1 pi,0

i=1 i=1

+ ( )

m1

n 1 pi,0

i=1

[ ( ) ( )]2

m1 m1

n 1 pi n 1 pi,0

m1

(n pi npi,0 )2

i=1 i=1

= + ( )

npi,0 m1

pi,0

i=1

n 1

i=1

2 by Theorem 5.2.

If we let (O for observed frequency and E for expected frequency when H0 is

true)

Oi = Yi = n pi for i = 1, 2, . . . , m 1,

( )

m1 m1

Om = n Yi = n 1 pi ,

i=1 i=1

98 5 Hypothesis testing

Ei = npi,0 for i = 1, 2, . . . , m 1,

( )

m1

Em = n 1 pi,0 ,

i=1

then,

m

(Oi Ei )2 m

O2 m m m

O2

2 ln Ei

= i 2 Oi + Ei = i 2n + n

i=1 i=1 Ei i=1 i=1 i=1 Ei

m

O2

= Eii n.

i=1

{ }

(Oi Ei )2

m

Note that {2 ln K} is the rejection region, and ,m1

2

i=1 Ei

can serve as an approximate rejection region. Since this is only an approximate

result, it is suggested that all expected frequencies should be no less than 5, so that

the sample is large enough. To meet this rule, some categories may be combined

when to do so is logical.

Example 5.21. A journal reported that, in a bag of m&ms chocolate peanut candies,

there are 30% brown, 30% yellow, 10% blue, 10% red, 10% green and 10% orange

candies. Suppose you purchase a bag of m&ms chocolate peanut candies at a nearby

store and find 17 brown, 20 yellow, 13 blue, 7 red, 6 green and 9 orange candies, for

a total of 72 candies. At the 0.1 level of significance, does the bag purchased agree

with the distribution suggested by the journal?

H0 : the bag purchased agrees with the distribution suggested by the journal,

versus

H1 : the bag purchased does not agree with the distribution suggested by the journal.

Then we have the table below, in which all expected frequencies are at least 5.

Colour Oi Ei Oi Ei

Brown 17 72 30% = 21.6 -4.6

Yellow 20 72 30% = 21.6 -1.6

Blue 13 72 10% = 7.2 5.8

Red 7 72 10% = 7.2 -0.2

Green 6 72 10% = 7.2 -1.2

Orange 9 72 10% = 7.2 1.8

Total 72 72 0

5.5 Generalized likelihood ratio tests: Large samples 99

6

O2

2 ln Eii n

i=1

172 + 202132 + 72 + 62 + 92

= + 72

21.6 7.2

6.426 < 9.236 0.1,61

2

.

Alternatively,

6

(Oi Ei )2

2 ln Ei

i=1

(4.6)2 (1.6)2 5.82 (0.2)2 (1.2)2 1.82

= + + + + +

21.6 21.6 7.2 7.2 7.2 7.2

6.426 < 9.236 0.1,61

2

.

Hence we should not reject H0 . At the significance level 10%, we cannot conclude

that the bag purchased does not agree with the distribution suggested by the journal.

Example 5.22. A traffic engineer wishes to study whether drivers have a preference

for certain tollbooths at a bridge during non-rush hours. The number of automobiles

passing through each tollbooth lane was counted during a randomly selected 15-

minute interval. The sample information is as follows.

Tollbooth Lane 1 2 3 4 5 Total

Number of Cars observed 171 224 211 180 214 100

Can we conclude that there are differences in the numbers of cars selecting respec-

tively each of the lanes? Test at the 5% significance level.

versus

All the five expected frequencies equal 1000 5 = 200, which is not less than 5.

Therefore, as the sample is large enough,

5

O2i

2 ln n

i=1 Ei

1712 + 2242 + 2112 + 1802 + 2142

= 1000

200

10.67 9.488 0.05,51

2

.

100 5 Hypothesis testing

Hence, H0 should be rejected. At the significance level 5%, we can conclude that

there are differences in the numbers of cars selecting respectively each of the lanes.

usually are interested in testing whether some family of distributions seems appro-

priate and are not interested in the lack of fit due to the wrong parameter values.

Suppose we want to test

For calculating Ei s, we have to use the maximum likelihood estimate of the un-

known parameters. Then the rejection region is {2 ln K} or, approximately,

{ }

m

(Oi Ei )2

Ei ,m1k .2

i=1

Consider the following joint distribution of two discrete random variables X and Y :

Value of Y

Probability Row sum

b1 bj bc

a1 p1,1 p1, j p1,c p1.

Value of X ai pi,1 pi, j pi,c pi.

ar pr,1 pr, j pr,c pr.

Column sum p.1 p. j p.c 1

We want to test

H0 : X and Y are independent

versus

H1 : X and Y are not independent.

That is, we want to test

versus

where i = 1, 2, . . . , r 1 and j = 1, 2, . . . , c 1.

5.5 Generalized likelihood ratio tests: Large samples 101

dent vectors, or ordered pairs of random variables, (X1 ,Y1 ), (X2 ,Y2 ), . . ., (Xn ,Yn )

each following this distribution. From such a sample we obtain the following table,

where Oi, j (called the observed frequency of the (i, j)-th cell) is the number of k

such that Xk = ai and Yk = b j , i = 1, 2, . . . , r, j = 1, 2, . . . , c. A box containing an ob-

served frequency is called a cell. Such a two-way classification table is also called a

contingency table or cross-tabulation. Ours is an r c contingency table.

Value of Y

Observed frequency Row sum

b1 bj bc

a1 O1,1 O1, j O1,c n1.

Value of X ai Oi,1 Oi, j Oi,c ni.

ar Or,1 Or, j Or,c nr.

Column sum n.1 n. j n.c n

Let be the generalized likelihood ratio. Then it can be proved that

( )

r c O2 r c O2

r c

(Oi, j Ei, j )2

2ln = n = n

i, j i, j

1

i=1 j=1 Ei, j i=1 j=1 Ei, j i=1 j=1 ni n j

ni n j

where Ei, j = is the expected frequency corresponding to Oi, j when H0 is true,

n

i = 1, 2, . . . , r and j = 1, 2, . . . , c. The rejection region is approximately

{ }

r c

(Oi, j Ei, j )2

Ei, j

2 ,(r1)(c1) ,

i=1 j=1

where

= [number of parameters to estimate when determining L( )]

[number of parameters to estimate when determining L(0 )]

= (rc 1) [(r 1) + (c 1)]

= rc r c + 1

= (r 1)(c 1).

sections, we require each expected frequency to be at least 5.

Example 5.23. Suppose we draw a sample of 360 students and obtain the following

information. At the 0.01 level of significance, test whether a students ability in

mathematics is independent of the students interest in statistics.

102 5 Hypothesis testing

Ability in Math

sum

Low Average High

Low 63 42 15 120

Interest in Statistics Average 58 61 31 150

High 14 47 29 90

Sum 135 150 75 360

Solution. Consider hypotheses:

versus

H1 : ability in mathematics and interest in statistics are not independent (are related).

The table below shows the expected frequencies (where, for example, 45 = 120

135 360 and 50 = 120 150 360).

Ability in Math

sum

Low Average High

Low 45 50 25 120

Interest in Statistics Average 56.25 62.5 31.25 150

High 33.75 37.5 18.75 90

Sum 135 150 75 360

All expected frequencies are at least 5. Therefore, as the sample is large enough,

( ) ( )

r c O2

632 422 292

n

i, j

1 = 360 + ++ 1

i=1 j=1 ni n j 120 135 120 150 90 75

32.140 13.277 0.01,(31)(31)

2

.

Hence, at the significance level 1%, we reject H0 and conclude that there is a re-

lationship between a students ability in mathematics and the students interest in

statistics.

Alternatively, the value of the test statistic equals

3 3

(Oi, j Ei, j )2 (63 45)2 (42 50)2 (29 18.75)2

Ei, j

=

45

+

50

+ +

18.75

i=1 j=1

2

.

- Quiz 6 SolutionsUploaded byrrro0orrr
- Intro to Statistical ThoughtUploaded byBo Ferris
- ASPEN HYSYS SIMULACION BASICA 2006.pdfUploaded byKelly Holmes
- Examples of Poisson Distribution PDFUploaded byTracy
- e1071Uploaded byAshwini Kumar Pal
- chap02 modify.pptddUploaded byUud Achmad
- Chapter 5 in Class ProblemsUploaded byBharat Mendiratta
- Mathematical StatisticsUploaded byRaphael Sison
- Understanding Statistics.pdfUploaded bydownloadmx
- Probability DistributionUploaded byBoost Websitetraffic Freeviral
- 26139810-Chapter-12-Review-of-Calculus-and-Probability.pdfUploaded byAshley Seesurun
- 5 Peubah Acak Kontinu Dan Fungsi KepekatanyaUploaded bysudahkuliah
- EM561 Lecture Notes - Part 1 of 3Uploaded byfrankarg2981
- vvaUploaded byLimas Baginta Ginting
- Course Outline MA 1050 2014_15_IIUploaded byAmanuel Q. Mulugeta
- ProbabiltyUploaded byRajeshKolhe
- 05 OPERATIONS I - Inventory Analysis Stochastic Demand - Francisco Muñoz PradoUploaded byrichard toca
- Chapter 5Uploaded byDevaky_Dealish_182
- CalcII ProbabilityUploaded byD Princess Shailashree
- SEUploaded byMuneeb ur rehman
- Assignment 1Uploaded byIrfan Msw
- lect7Poisson.pptUploaded byDEVANSHI MATHUR
- Math204-02-FTablesUploaded bydian_2108
- 10.1.1.195.8221Uploaded byaaparumugam
- Double Ranked Set SamplingUploaded byLubna Amr
- STA291 Course OutlineUploaded byApril Taylor
- Teaching Schedule Apr11(1)Uploaded byShawn Messi Lim
- Solved Example Based On Single Sampling PlanUploaded byChandra Sekhar
- Probability DistributionsUploaded byEliad RMndeme
- ProbLectures1Uploaded byMohammed Ibrahim Ali

- syllabusUploaded byjiosja
- T1 OverviewUploaded byHarry Kwong
- T1_Demo of Random WalkUploaded byHarry Kwong
- Class Test 1 Revision NotesUploaded byHarry Kwong
- 2013 Dec Final ExamUploaded byHarry Kwong
- Lecture6_StockSimulationExcelUploaded byHarry Kwong
- STAT1600A (16-17, 1st) Assignment 4Uploaded byHarry Kwong
- STAT1600A (16-17, 1st) Assignment 5Uploaded byHarry Kwong
- Lecture10_ImpliedVolUploaded byHarry Kwong
- dc2000lUploaded byHarry Kwong
- pareto curve.pdfUploaded byHarry Kwong
- FINA0804_Assignment_4.pdfUploaded byHarry Kwong
- Solutions to Problem Set 1Uploaded byHarry Kwong
- Problem Sets (Days 1-6)Uploaded byHarry Kwong

- A Test of the Efficiency of a Given Portfolio Gibbons M R Ross S a and Shanken JUploaded byAzeriboye
- L S T Lecture 1 E Routh Spring 2012Uploaded byFazal Akbar
- Time_ Word Problems Made Easy _ GMAT Math Questions and InteUploaded byCat Aspirant
- 2$20reasoningUploaded byShawanda Clark
- Actuarial Valuation of Pension Plans by Stochastic Interest Rates ApproachUploaded bytayko177
- Functional DerivativesUploaded byCaleb Oki
- Abacus InfoUploaded byAsif Shaikh
- Unit 7Uploaded byAlex Caskey
- Year8Maths.pdfUploaded byJames Buchan
- Cambridge Mathematics 3 Unit Year 12Uploaded byphilipkheav
- Control HWUploaded bycspenc8
- Equations of Motion Rigid Spherical BodyUploaded byRaimundo Costa
- spring1ans1Uploaded byEmanuel Fontelles
- 10.1.1.64.575Uploaded byMichael Jordan
- solutionscomplex(parte3.3)Uploaded byjoa
- Weiss 4th US Edition Solutions - Ch 9-12Uploaded byAdam Kliber
- Let Us C solnUploaded byCnu Chowdhary
- TAKLIMAT+950+CourseworkUploaded byChew Gee Lan
- Eps 2014322 l is 9789058923752Uploaded byAKHI9
- Math317_HW4_solnsUploaded bysysolution
- Question-paper-Paper-1R-June-2014.pdfUploaded byakash
- A Beautiful TheoryUploaded byShantam Tandon
- Group 3 - Systems of Linear EquationsUploaded bykaye
- ParabolaUploaded byhsvfisica
- Metric Space NotesUploaded byNehal Anurag
- ma780-chu-notesUploaded byTJ Deems
- Practice ProblemsUploaded bygo_gators
- Wave Scatter DiagramUploaded bycr231181
- 3D-Geometry-13-14Uploaded bypreetam
- pid controllerUploaded bynilanjana89