You are on page 1of 147

UNIVERSITY OF CAPE COAST

DEPARTMENT OF STATISTICS
STA 403:
STATISTICAL METHODS II

delivered by

PROF. NATHANIEL HOWARD


1
COURSE OUTLINE
Further methods for discrete data: examples and formulation - binomial,
multinomial and Poisson distributions. Comparison of two binomials;
McNeyman’s test for matched pairs; theory and transformations of
variables; multiple linear regression; selection of variables; use of dummy
variables. Introduction to logistic regression and generalized linear
modeling. Non-parametric methods. Use of least squares principle;
estimation of contrasts, two-way crossed classified data.
Pre-requisite: STA 303
(Statistical methods I)
2
Recommended Literature
1. Ott, R.L. (1992). An Introduction to Statistical Methods and Data
Analysis; 4th Ed.; Duxbury Press, Belmont, California, USA.
2. Freund, J.E. (1992). Mathematical Statistics; 5th Ed.; Prenctice-Hall
Int. Inc., London, UK.
3. Milton, J.S., Corbet, J.J. and McTeer, P.M. (1986). Introduction to
Statistics. D.C. Heath & Co., Toronto, Canada.
4. Gordor, B. K. and Howard, N. K. (2006). Introduction to Statistical
Methods; Ghana Mathematical Group.
5. Wetherill, G.B. (1981). Intermediate Statistical Methods. Chapman
and Hall, London, UK.
3
Quiz Schedules
Quiz 1: (18th January, 2024)
[20%]
Quiz 2: (End of week 9)
[20%]
Schedules of Lectures
Tuesday 12:30 – 1:30pm (LT 20)
Tuesday 1:30 – 2:30pm (LT 20)
Wednesday: 4:30 – 5:30pm (LT 21)

Thursday: 8:30 – 9:30pm (PGR)


Thursday: 9:30 – 10:30pm (PGR)
Thursday: 10:30 – 11:30pm (PGR) 4
COURSE OBJECTIVES
By the end of this Course, students should be able to:
 Perform hypothesis test for quantitative and qualitative data sets.
 Perform and interpret correlation analysis.
 Perform and interpret simple and multiple linear regression analyses.
 Describe logistic regression models and solve problems involving them
 Describe GLMs and solve problems involving them.
 Perform non-parametric tests

5
HYPOTHESES AND TEST PROCEDURES

What is Hypothesis?
A statistical hypothesis is a statement or assertion which may or may not be
true about the value of a population parameter.
Example
The mean age  of Master of Public Health students in the Department
of Physician Assistantship equals 24 years. That is,   24
.

6
Types of Hypotheses

There are two types of hypotheses namely the null hypothesis


and alternative hypothesis.
The main hypothesis which we wish to test is called the null
hypothesis. It is denoted by H 0.
The hypothesis that will be accepted when His0 rejected is called

the alternative hypothesis. It is an assertion that contradicts the null


hypothesis. It is denoted by H1.
7
Classification of Hypothesis Tests

In any hypothesis testing problem, the null and alternative


hypotheses may further be classified as simple or composite,
or as one-sided or two-sided tests depending on how the test is
set up.

8
Simple and Composite Tests

 If  can take on a single value, then both the null and


alternative hypotheses are called simple hypotheses.
Example H 0 :   0 or H 1 :   20.

 If  can take on multiple values, then both the null and


alternative hypotheses are called composite hypotheses.
Example H 0 :   100or H1 :   10, 100.
9
One-side and two-sided tests

When both null and alternative hypotheses are composite and represent
one side of the parameter space around some value  0 , then the test is
said to be a one-sided test. One-sided tests are also called one-tailed
tests.
Example
H 0 :    0 against H 1 :    0

10
When the null is simple and the alternative hypothesis represents the rest
of the parameter space  , then the test is said to be a two-sided test.
Two-sided tests are also called two-tailed tests.
Example
H 0 :    0 againstH 1 :    0

11
Errors in Hypothesis Testing
The purpose of hypothesis testing is to determine whether the evidence on
the basis of available data tends to refute H 0 . Since H 0 can either be true
or false and at the end of the experiment we can reject or fail to reject H 0 ,
there are four possible decisions that we can make.
Hypothesis Testing Decisions
1. Reject H 0 when it is true (wrong decision – Type I error).
H0
2. Reject when it is false (correct decision).
H0
3. Fail to reject when it is true (correct decision).
H0
4. Fail to reject when it is false (wrong decision – Type II error).
12
Decision table for hypothesis testing
H 0 is true H 0 is false
Reject H Type I error correct decision
0
Fail to reject H 0 correct decision Type II error

Hypothesis testing involves the use of sample data to decide whether the
null hypothesis should be rejected or not. The decision to reject the null
hypothesis or not is based on the value of a test statistic. A test statistic is
an estimator whose value is calculated from sample data. Its distribution is
known under the assumption that the null hypothesis is true. 13
If there is a large difference between what is expected under the null
hypothesis and what is observed in a sample, then the null hypothesis is
rejected; and the result is said to be statistically significant. If, on the
other hand, the difference between what is expected and what is observed
is small, then there is not enough evidence to reject the null hypothesis;
and the result is said to be not statistically significant.

14
There are two approaches to determining whether to reject the null
hypothesis or not. One involves the determination of the rejection or
critical region of the test. The rejection or critical region is a set of values
of the test statistic that will enable us to reject H 0 . It is obtained by
using a pre-determined level of significance (or size of the test). The level
of significance, denoted by  , is the probability of committing a Type I
error. The levels of significance often used in literature include
  1% (or 0.01),   5% (or 0.05) and   10%(or 0.10).

15
The second approach involves calculation of the p-value of the test. The p-
value of the test is the probability of observing the test statistic at least as
extreme as observed under the null hypothesis. The null hypothesis is
p  0.05
rejected for “small” p-values (usually for ). Generally, the null
hypothesis is rejected at the level of significance  if p   . For values
of p   , there is not enough evidence to reject the null hypothesis.
We shall limit ourselves to the first approach in this module.

16
In general, hypothesis testing in statistics involves the following steps:
Step 1: State the hypothesis that is to be questioned ( H 0 ).
Step 2: State an alternative hypothesis which will be accepted if the null
hypothesis is rejected ( H 1).
Step 3: Select the decision rule about when to reject H 0 and when to fail to
reject it.
Step 4: Evaluate the appropriate test statistic using sample data from the
population of interest.
Step 5: Carry out your decision.

17
TESTS CONCERNING A SINGLE POPULATION MEAN
Suppose that we wish to test the null hypothesis that the mean of a
 2
normal population with variance

equals a specific 0value, .

That is, if we wish to test the null hypothesis H 0 against any of the three
alternatives H 1 :    0 or H 1 :    0 or H 1 :    0 ; then we

need to perform one of the tests in the table below, based on a random
sample of size n from this population.

18
Null hypothesis H 0 :   0 against various alternatives

(a) (b) (c)


H 0 :   0 H 0 :   0 H 0 :   0
H1 :    0 H1 :    0 H1 :    0

19
Critical regions for testing H 0 are shown below.

H1 Reject H 0 if

  0 z   z (lower-tailed test)

  0 z  z (upper-tailed test)

  0 z   z or z  z (two-tailed test)
2 2

20
0
 z is the value of z that leaves z is the value of z that leaves
a value of  to its left. a value of  to its right.

 z  leaves a value of 2 to its left and z 


2 2

leaves a value of 2 to its right. 21
For the test of means from a single population, there are three possible
scenarios to be considered:
1. Tests for means from a single normal population with a known  .
2. Tests for means from a single population with unknown  but large
sample size.
3. Tests for means from a single population with unknown  but small

sample size.

22
Test for means from a single normal population with a known 

If the population we are sampling is normal and  is known, then the test
statistics is given by
x  0
z
 n

23
Example

The scores for some students in an examination have been normally


distributed for some time with mean 200 and standard deviation 25.
Currently some lecturers think that the performance has changed. To support
this claim, scores of 100 students were taken and found the mean to be 212.
(a) Set up the null and alternative hypotheses for test.
(b) Will you agree with the lecturers’ claim at the 5% significance level?

24
Solution
(a) The null and alternative hypotheses are:
H 0 :   200
H 1 :   200

Substituting for x  212,  o  200,   25 and n  100into


the formula gives
x  0
Z 
/ n
212  200

25 100
 4.8 25
Note that   0.05. Because the test is two-tailed, we find
z   z 0.05  z 0.025. From the statistical table, z0.025  1.96.
2 2

Since z cal  4.8 is greater than z0.025  1.96, we reject H 0 and agree
with the lecturers that the performance of the students have changed.

26
Test for means from a single population with unknown but
large sample
size

If the population standard deviation  is unknown but the sample size is


large n  30  , the test statistics becomes
x  0
z  ,
s n
where s is the standard deviation of the sample data.

27
Example
A random sample of size n  100 observations taken from a population
with mean  yielded the sample mean x  18.9 and sample standard
deviation of s  12.6 . If the hypotheses are
H 0 :   16 and H1 :   16,

(a) Calculate the value of the appropriate test statistics for this test;
(b) Hence determine whether H 0 should be rejected at the 1% level of
significance.

28
Solution

In this problem,  is unknown hence we replace it with s. Therefore


substituting for x  18.9,  o  16, s  12.6 and n  100in the
formula gives
x  0
Z 
s/ n
18.9  16

12.6 100
 2.30

29
(b) Now we have z 2  z 0.01 2  z 0.005. From the z-tables,
z 0.005  2.575. z cal  2.30
 2.575
Now since is neither greater
than 2.575 nor less than , we cannot reject the null
hypothesis at the 1% level of significance.

30
Test for means from a single population with unknown  but
sample size is small

If the population standard deviation  is unknown but the sample


size is small n  30 , the test statistics becomes
x  0
t ,
s n
which has an approximate t-distribution with n  1 degrees of
freedom.

31
The critical regions for such tests are shown below.

H1 Reject H 0if

  0 t  t
  0 t  t
t  t or t  t
  0 2 2

32
Example

The manufacturer of a new fiberglass tire claims that its average life will
be at least 40,000 miles. To verify this claim, a sample of 12 tires is
selected, with their lifetimes (in 1000s of miles) as follows:
Tire 1 2 3 4 5 6 7
Life 36.1 40.2 33.8 38.5 42.0 35.8 37.0

Tire 8 9 10 11 12
Life 41.0 36.8 37.2 33.0 36.0

Test the manufacturer’s claim at the 5% level of significance.


33
Solution
We wish to test H 0 :   40 against H1 :   40.
Since  is unknown and the sample size n  12is small, we have to use
the t-distribution with 11  n  1 degrees of freedom. From the data,
x  37.283,  o  40, s  2.732 and n  12
Therefore, x  0
t 
s/ n
37.283  40

2.732 12
 3.445 34
From the one-tailed t-table, t0.05 11  1.796.

Since t  3 .445is less than  t0.05 11  1.796, we reject H 0


and conclude that the average life of the new fiberglass will be less than
40,000 miles.

35
TEST CONCERNING A POPULATION PROPORTION
(LARGE SAMPLE)

Supposed that we wish to test the null hypothesis H 0 : p  p0


against any of the alternatives
H1 : p  p0 or H1 : p  p0 or H1 : p  p0

If n is large and H 0 is true, then the test statistics is given by


ˆ  p0
p
z
p0 1  p0  n

where p̂ is the sample proportion of the characteristic of interest.


36
The critical regions for testing are shown below.

H1 Reject H 0 if

H1 : p  p0 z   z
H1 : p  p0 z  z

H1 : p  p0 z   z  or z  z 
2 2

37
Example

An oil company claims that less than 20% of all car owners have not tried its
gasoline. Test the claim at the 0.01 level of significance, if a random check
reveals that 22 out of 200 car owners have not tried the company’s gasoline.

Solution
We wish to test H 0 : p  0.20 against H 1 : p  0.20.
Here, p0  0.20, the number of successes, x  22, and the
sample size is n  200.
38
Thus, 22
p̂   0.11 .
200

Substituting these into the formula gives:


ˆ  p0
p
z 
p 0 1  p 0  n
0.11  0.20

0.201  0.20  200
 0.09

0.0283
 3.1802

39
From the z-table, z 0.01  2.33.Therefore the rejection region is
z  2.33. z  3.1802  2.33,

Since we reject the null


hypothesis. We therefore conclude that less than 20% of all car owners
have not tried the company’s gasoline.

40
TEST CONCERNING A SINGLE POPULATION
PROPORTION (SMALL SAMPLES)

Supposed that we wish to test the null hypothesis H 0 : p  p0 against


H1 : p  p0  % at x  isk ,
significant level. Then the critical region

where,
k
x is the number of observed successes;
k
 Biskthe p0   integer
; n,largest ; for which
k 0

41
and Bk ; n , p0 is
 the probability of observing k successes in n binomial
p  p0 .
trials when

If the alternative hypothesis was H1 : p  p0 , the critical region would be


* *
x  k , k
where  is the smallest integer for which
n
 Bk ; n , p0    ;
k  k*

42
Similarly, if the alternative hypothesis was H1 : p  p0 ,the
x  k ,
critical region would be
2
where
k
is the largest
k
integer for which
2

 Bk ; n, p0  
2
;
k 0 2
*
k
and is the smallest integer for which
2 b 
 Bk ; n, p0   .
k  k * 2
43
Example
It is claimed that 40% of patients that attend a certain clinic on
any day are smokers. Suppose that on a particular day, 3 out of
a sample of 13 patients attending the clinic were found smokers.
Test the hypothesis H 0 : p  0.40 against H1 : p  0.4
at the 5% significant level.

44
Solution
In this problem, x  3 and n  13Since
.   0.05, k  k0.025 .
2
Now, from binomial tables,
1
 Bk ;13,0.40   B0;13,0.40   B1;13,0.40 
k 0
 0.0013  0.0113
 0.0126
1
Thus,  Bk ;13,0.40   0.0126  0.025implying
, that the largest
k 0 1
integer k for which  Bk ;13,0.40  0.025 is 1.
k 0
2
45
*
Similarly, the smallest integer k for which
13 2
 Bk ;13,0.40   0.025 is 10.
k 10
That is 13
 Bk ;13,0.40  B10;13,0.40    B13;13,0.40
k 10
 0.0065  0.0012  0.0001  0.0000
 0.0078
To be able to reject the null hypothesis, either the number of
successes, x, is less than or equal to 1; or greater than or equal
to 10.
46
Since x  3 is not less or equal to 1, nor greater or equal to 10,

we cannot reject the null hypothesis. We therefore, conclude

that 40% of patients that attend clinic on any day are smokers.

47
TEST CONCERNING A SINGLE
POPULATION VARIANCE
2 2
Supposed that we wish to test the null hypothesis H 0 :    0 against

any of the alternatives


H1 :  2   02 or H1 :  2   02 or H1 :  2   02

If the population we are sampling is normal, then the test statistics is


given by
2
 
n  1s 2

2
0
where  2
has (n  1) degrees of freedom. 48
The critical regions for testing H 0 :  2   02 are shown below

H1 Reject H 0if
2 2
 2
 0
2
  1
2 2
 2 2
 0   
2 2 2 2
 2
 0
2    or   
1
2 2

49
2 2 2
Given that n  25, s  9 and   10 testH 0 :   10 against
the H1 :  2  10

two-sided alternative at 1% significance level.

Solution n  25 and s 2  9,

Substituting for into the formula

gives: n  1s 2
25  19
  2
  21.6
 2
0 10

50
The critical region is less than  , or greater than 
2 2
0.995 0.005 . From

chi-square tables, the value of  2


(with 24 df.) is 9.886 and that
0.995

 0.005
2
for (with 24 df.) is 45.558.

Since   21.is
2
6 neither less than 9.886 nor greater than 45.558, we

cannot reject the null hypothesis.

51
TEST CONCERNING TWO POPULATION MEANS
(INDEPENDENT SAMPLES)

Suppose that we have two independent random samples with means


x1 and x2 and respective sample sizes n1 and n2 ,from normal

populations with means 1 and  2 and variances 1 and  2 .
2 2

1 and  2 ; H :  
We can compare by testing 0 1 2 , where
 H 1 : 1   2  

: 1 
isHa1given  2  against
constant, : of
H 1any thealternatives
1  2 

or or
under the following three conditions: 52
1. Large independent samples with known  and 
2 2
1 2.

2. Large independent samples with unknown


 2
1 and  2
2 .

3. Small independent samples with unknown


 2
1 and  2
2 .

53

Large independent samples with 1
2
and  2
2 known

The test statistic for testing two population means from independent samples

with 1
2
and  2
2is given by

x1  x2  
z
 2
 2
1
 2
n1 n2

  1   2 z
where and is the usual standard normal random variable.

54
Example
A random sample of 100 observations is drawn from a normal population
with variance 16 and the sample mean was found to be 10.8. Another
sample of 64 observations is drawn from a second and independent
normal population with variance 25 and the sampling mean was found to
be 9.6. Test the hypotheses:
H 0 : the population means are equal
against
H1 : the population means are not equal.

55
Solution
The hypotheses above are equivalent to

H 0 : 1   2  0
H1 : 1   2  0
We now evaluate the test statistic by substituting
n1  100, x1  10.8,  2
1  16, n2  64, x2  10.8,  2
2  25
and   0 into the formula.

56
This gives
x1  x 2  
z 
1
2
2
2

n1 n2
10.8  9.6  0

16 25

100 64
 1.6260

Fromz-tables, z   z 0.025  1.96 ,


giving the critical region as
2
z  1.96 or z  1.96.
57
Since z  1.626
is neither less than -1.96 nor greater than 1.96, we fail
to reject H
We0 .therefore conclude that the population means are

equal.

58

Large independent samples with 1
2
and  2
2 unknown

The test statistic for testing two population means from independent

samples with 1
2
and  2
is2given by
x1  x2  
z 2 2 ,
s1 s2

n1 n2
2
1
2

where s and s are the respective sample estimates for 1
2
2
and  2
2 .

59
Example

Suppose that we have randomly selected two independent samples from


populations having means 1 and 
If 2 . x1  25, x 2  20,
s1  3, s 2  4 n1  100, n2  100.
and Test
H 0 : 1   2  0 against H1 : 1   2  0

at the 0.05 level of significance.

60
Solution

Substituting for x1  25, x2  20, s1  3, s2  4 , n1  100,


n2  100 and  in0the formula, we obtain
x1  x 2  
z 
2 2
s1 s2

n1 n2
25  20  0

9 16

100 100
 10.
61
From the z-tables, we have z  z 0.05  .1This
.645gives the critical

region as z  1.645
Since . z  10
is greater than 1.645, we reject the null
1 2
hypothesis and conclude that is greater than .

62
Small independent samples with  and  unknown
2 2
1 2

Test concerning two population means (Independent samples)


 2
and  2 n  30 n  30
with 1 2 unknown and sample sizes 1 and 2

can be performed under two different assumptions about the population


variances:
 12 and  22
1. are unknown and both are assumed to be equal to a
 .
2

common variance
 1 and  2
2 2

2. are unknown and are assumed to be different from each


other.
63
Assumption 1

Under Assumption 1, the appropriate test statistics for such test is given by
x1  x2  
t ,
1 1
sp 
n1 n2
sp ,
where called the pooled sample variance is given by
n1  1s1  n2  1s2
2 2
sp  ,
n1  n2  2
2 2
s and s
1 2
with the
n1 respective
n2  2 variances for samples 1 and 2, and t has the t-
distribution with degrees of freedom. 64
Example
n1  16 and n2  10
Two independent random samples of sizes from
normal populations with unknown standard deviations have means
x1  23.4 and x2  18.2 ,
with corresponding standard deviations
s1  3.5 s 2  4 .8
and .
H 0 : 1   2  0 against H1 : 1   2  0
Test at the 10%
significance level, assuming that the population variances are equal.

65
Solution
s
We first evaluate p by substituting n1  16, s1  3.5, n2  10 and
s2  4.8 into the formula to give

n1  1s  n2  1s


2 2
sp  1 2
n1  n2  2
16  13.5  10  14.8
2 2

16  10  2
 4.04

66
Now substituting
s p  4.04, n1  16, x1  23.4 , n2  10, x2  18.2 and   0

into the test statistic gives


x1  x2  
t
1 1
sp 
n1 n2
23.4  18.2  0

1 1
4.04 
16 10
 3.193

67
From the t-tables, t0.10 with 24 degrees of freedom is 1.318. Thus the
critical region is t  1.318 . Since t  3.193  1.318, we reject H 0 and
conclude that 1   2 .

68
Under Assumption 2, appropriate test statistics for such test is given by
* x1  x2  
t  2 2
s1 s2

n1 n2
*
where t is approximately t-distribution with df, v, given by
2
s 2
s  2

n  n  If v is not a whole number, then we


1 2

v   12 2  2 have to round it to the nearest whole


 s1 
2
 s2 
2

n   n  number.
 1   2
n1  1 n2  1 69
Example
Two independent random samples of sizes n1  16 and n2  10 from
normal populations with unknown standard deviations have means
x1  23.4 and x2  18.2 ,
with corresponding standard deviations
s1  3.5 s 2  4 .8
and .
H 0 : 1   2  0 against H1 : 1   2  0
Test at the 10%
significance level, assuming that the population variances are different.

70
*
Evaluate t by substituting n1  16, s1  3.5, x1  23.4, n2  10, s2  4.8,
x2  18.2 and   0 into the test statistic to gives

* x1  x2  
t 
s12 s22

n1 n2
23.4  18.2  0

3.5 2
4.8 2

16 10
 2.9680
We evaluate v by substituting n1  16, s1  3.5, n2  10, s2to obtain
4.8

71
2
s2
s  2

n  n 
1 2

v  1 2
2 2 2 2
 s1   s2 
n  n 
 1   2
n1  1 n2  1
2
 3.5
2
4 .8  
2

  
 16 10 

2 2 2 2
 3.5   4.8 
   
 16    10 
16  1 10  1
 15.7
 15
72
From the t-tables, t0.10 with 15 degrees of freedom is 1.341. Thus the
critical region is t  1.341. Since t  2.9680  1.341, we reject H 0
* *

and conclude that 1   2 .

73
TEST CONCERNING TWO POPULATION MEAN
(PAIRED DATA)

Suppose that x1 , x2 , , xn are the observations on n individuals


y1 , y2 , , yn
before an experiment, and are the corresponding
 x1 , y1 , x 2 , y 2 ,  ,
observations after the experiment. Then the pairs
xn , yn 
constitute a paired data set.

74
Consider the test of H 0 : 1   2   the various alternatives
against

(a) (b) (c)


H 0 : 1   2   H 0 : 1   2   H 0 : 1   2  
H1 : 1   2   H1 : 1   2   H1 : 1   2  

By calculating the differences d i  yi  xi i  1,2 , , n


between corresponding observations, the test reduces to testing
75
H 0 :  d  against various alternatives

(a) (b) (c)


H 0 : d   H 0 : d   H 0 : d  

H1 :  d   H1 :  d   H1 :  d  


Let dbe the mean of the normally distributed population of paired
differences, d and sd be the mean and standard deviation of a
sample of n paired differences that have been selected.

76
Then the appropriate test statistic for conducting any of the test in the
table above is given by
d 
t
sd n
n  1
where t has the t-distribution with degrees of freedom.

77
Example

The data below are the weights before and after ten boxers were fed with
a weight reducing diet:
i 1 2 3 4 5 6 7 8 9 10
Before, xi 69 50 61 72 78 66 75 89 86 54
After, yi 66 49 63 70 71 65 75 88 87 51

Test the null hypothesis H 0 :  d  0, against the alternative


H1 :  d  0,
hypothesis at the 5% level of significance.

78
Solution

i 1 2 3 4 5 6 7 8 9 10
xi 69 50 61 72 78 66 75 89 86 54
yi 66 49 63 70 71 65 75 88 87 51
yi  xi -3 -1 2 -2 -7 -1 0 -1 1 -3

Considering the differences as one sample data, we find that


n  10, d  1.5, sd  2.5 and   0

Substituting these into the test statistic gives


d   1.5  0
t   1.8973
sd n 2.5 10
79
From the t-tables, t0.05 with 9 degrees of freedom is 1.833. Thus the
t  1.897  1.we
critical region is t  1.833.Since 833 ,
reject
H 0and conclude that  2  1 .

80
TEST CONCERNING TWO POPULATION PROPORTIONS
(INDEPENDENT SAMPLES)

Suppose that we have two independent random samples n1 and n2 with


p̂ and p̂ x1 x2
proportions 1 2 where p̂1  and p̂2  , from normal
n1 n2
populations.

We can compare p̂1 and p̂2 H 0 : p1  p2  


by testing against any
H1 : p1  p2   , or H1 : p1  p2   , or H1 : p1  p2  
of the alternatives
under two conditions if the sample sizes are large.
1.If   0or
 0 81
Condition 1

If   0,the appropriate test statistic is given by


p̂1  p̂2  
z ,
1 1 
p̂ 1  p̂   
 n1 n2 

where p̂ , called the combined sample proportion is given by


x1  x2
p̂  .
n1  n2

82
Condition 2

If   0,the appropriate test statistic is given by


pˆ1  pˆ2 
z .
pˆ 1 1  pˆ1  p ˆ 2 1  p
ˆ2 

n1  1 n2  1

83
Example (a)

If x1  18, x2  15, n1  35 and n2test


 the
42,null hypothesis
H 0 : p1  p2  0
against
H1 : p1  p2  0

at the 5% significance level.

84
Solution
Since   0we
, first find the combined sample proportion as follows:
x1  x2
ˆ 
p
n1  n2
18  15

35  42
 0.4286

Therefore substituting
18 15
p̂1   0.5143, p̂2   0.3571, p̂  0.4286 and   0
35 42

into the test statistic gives:


85
p̂1  p̂2  
z
1 1 
p̂ 1  p̂   
 n1 n2 
0.5143  0.3571  0

 1 1 
0.42860.5714  
 35 42 
 1.3875

Fromthe z-tables z0.05  1.645, resulting in a critical region of z  1.645.


Since z  1.3875  1.645, H0
we fail to reject and conclude
p1  p2 .
that
86
Example (b)

If x1  18, x2  15, n1  35 and n2 test


 42the, null hypothesis
H 0 : p1  p2  0.15
against
H1 : p1  p2  0.15

at the 5% significance level. Interpret your result.

87
Solution

Since  0 , we substitute
18 15
ˆ1 
p ˆ2 
p
35 42
 0.5143,  0.3571,
n1  35, n2  42 and   0.15 into the
test statistic gives:
pˆ1  pˆ2 
z
ˆ 1 1  p
p ˆ1  p ˆ 2 1  p
ˆ2 

n1  1 n2  1
0.5143  0.3571   0.15

0.51430.4857   0.35710.6429 
35  1 42  1
 2.6995
88
Fromthe z-tables z0.05  1.645, resulting in a critical region of z  1.645.
Since z  2.6995  1.645, H0
we reject and conclude that
p1  p2  0.15.

That is, p1exceeds pby


2 15%.

89
TEST CONCERNING TWO POPULATION VARIANCES
(INDEPENDENT SAMPLES)

Suppose that we have two independent random samples n1 and n2


 2
with variances
1 and  2
2 .

Then we can compare 1
2
and  2
2 by testing the null hypothesis
H 0 :  12   22

against any of the alternatives


H1 :  1   2 , or H1 :  1   2 , or H1 :  1   2
2 2 2 2 2 2

90
The appropriate
2
test statistics is given by
s1
F  2,
s2
2 2
s
where 1 and s 2 are the sample variances.

The F-statistic has the F-distribution with


n1  1 as numerator degrees of freedom and

n2  1 as denominator degrees of freedom.

91
The critical regions for testing H 0 :  12   22as shown:
are

H1 Reject H 0 if

H1 :   
2
1
2
2
F  F1 n1  1, n2  1

H1 :   
2 2 F  F n1  1, n2  1
1 2
F F  n1  1, n2  1 or
1
H1 :   
2
1
2
2
2

F  F n1  1, n2  1
2

92
You may find the following identity useful.

1
F1 n1  1, n2  1 
F n2  1,n1  1
Example
Suppose that observations from two independent random samples from
two normal populations yielded the following result:
2 2
n1  11, s1  18.4 , n2  16 and s2  13.5
againstH1 :   
2 2
Test the null hypothesis H 0 :   
2 2
1 2 1 2 at the 10%
significance level.
93
Solution
2 2
Substituting 1 s  18 . 4 and s 2  13 .5the test statistic gives:
into
s12
F  2
s2
18.4

13.5
 1.363.

The test is two-sided, so we need to evaluate


FF  n1  1, n2  1 and F  F n1  1, n2  1
1
2 2
F0.05 10,15  2.54
From the F-distribution table,
94
Using the identity
1
F1 n1  1,n2  1 
F n2  1,n1  1
we have
1 1
F0.95 10,15    0.35
F0.05 15,10 2.85

Therefore the critical region is F  0.35 or F  2 .54


F  1.36
Since is neither less than 0.35 nor greater than 2.54, we cannot

reject the null hypothesis. We therefore conclude that 1
2
  2
2

95
TEST ON CATEGORICAL DATA

Some common test that can be considered appropriate for


qualitative data include the following:
1. The multinomial distribution
2. Goodness of fit tests (when categorical probabilities are
completely define)
3. Goodness of fit tests for the Poisson, binomial and normal
distributions
4. Goodness of fit tests for Independence

96
THE MULTINOMIAL DISTRIBUTION
Multinomial distribution is an extension of the binomial
distribution. Its properties are as follows:
1. The experiment consist of n identical trials.
2. There are k possible outcomes associated with
each trial.
3. The probabilities of the k outcomes denoted by p1 , p2 ,, pk ,
remain constant from trial to trial; and p1  p2    pk  1.
4. The n trials are independent of each other.
5. The random variable of interest are the counts
in each of the k cells
97
Example
The table below show the market share for different brands of
television.
Brand of TV Market share
LG 20%
Samsung 30%
Panasonic 35%
Sony 15%

Can this be said to follow a multinomial distribution?

98
Solution
It is clear that the brands of television are independent of each
other. The TV brands LG, Samsung, Panasonic and Sony have
p1  0.20, p2  0.30, p3  0.35
distribution probabilities
and p4  0.15,
respectively. So we have
0.20  0.30  0.35  0.15  1
and therefore, the distribution of brand of television sets follows a
multinomial distribution.

99
GOODNESS OF FIT TESTS (WHEN CATEGORICAL
PROBABILITIES ARE COMPLETELY DEFINE)

Suppose we wish to test the null hypothesis


H 0 : p1  p2    pk   and H1 :
At least one of the
multinomial probabilities is not equal its hypothesized value.

100
Then the test statistics is given by
k oi  ei 
2
 
2
,
i 1 ei
where
k denotes the number of classes
oi denotes the number of observations in class i
ei denotes the number of expected observations
in class i ei  npi 
pi , the probability of observing an observation in class i
n denotes the sample size

101
The test statistic has an approximate chi-square distribution with
k  1degrees of freedom.

NOTE:
The approximation is good if the sample size is large enough so
that , for every cell, the expected cell frequency is 5 or more.

102
Example
The head teacher of a primary school is interested in knowing whether
there exist colour preferences among the pupils in his school. A sample of
100 pupils were drawn from the school and shown identically shaped
objects, coloured red, blue, yellow, green or pink. When each child was
asked to pick the most preferred colour, 30 picked red, 18 blue, 12 yellow,
20 green and 20 pink. Test, at 5% significance level, the hypotheses:
H 0 : there does not exist colour preferences
against
H 1 : colour preference does exist
103
Solution

If there are no preferences, then the probability of choosing


1
any colour is the same. That is pi   0.20 i  1,2 , ,5.
5
Thus, we are testing the hypotheses
H 0 : p1  p2  p3  p4  p5  0.20
against
H 1 : at least one of the pi s is not 0.20

104
Colour Red Blue Yellow Green Pink
Observed, oi 30 18 12 20 20
Expected, ei 20 20 20 20 20
oi  ei 10 -2 -8 0 0
Thus
5 oi  ei  2
 2
i 1 ei
10  2   8 02 02
2 2 2
    
20 20 20 20 20
 8.4
105
At the 5% significance level from the chi-square tables,
 2
0.05 at df  5  1  4 is 9.49Therefore,
. the critical region

is   9.49.
2
Since   8.4  2 2
0.05 4   9we
.49,
cannot
reject H 0 . That is, we do not have enough evidence against the
null hypothesis. Therefore we conclude that there does not exist
colour preferences among the pupils

106
GOODNESS-OF-FIT TESTS FOR THE POISSON,
BINOMIAL AND NORMAL DISTRIBUTION
The goodness-of-fit tests can be applied to test sample data set as
coming from a population having a Poisson, or binomial or
normal distribution. The test statistic is given by
k oi  ei 
2
 
2
,
i 1 ei

107
where
k denotes the number of classes
oi denotes the number of observations in class i
ei denotes the number of expected observations
in class i ei  npi 
pi denotes the probability of observing an observation in i
n denotes the sample size

108
The test statistic has an approximate chi-square distribution with
k  m  1degrees of freedom, where m is the number of
independent parameters that have to be estimated from the
sample.
NB: The approximation is good if the sample size is large
enough so that , for every cell, the expected cell frequency is 5 or
more.

109
Example 1
The weekly number of power failures reported in a certain district
in 50 weeks is recorded as follows
Number of failures Number of Weeks
0 6
1 8
2 13
3 11
4 7
5 4
6 1
110
Determine whether the weekly number of power failures in the district
follows a Poisson distribution at the 5% significance level.

Solution
We wish to test the hypothesis
H0 :
the weekly no. of power failures follow a Poisson distribution
against
H1 :
the weekly no. of power failures does not follow a Poisson
distribution. 111
We first calculate the expected frequencies using Poison
probabilities given by;
i 
e
pi  , i  0 ,1,2 , ,6.
i!
where is the mean of the distribution.
x f fx
0 6 0
1 8 8
2 13 26
3 11 33
4 7 28
5 4 20
6 1 6
50 121
112
Therefore, the mean x is given by
 fx 121
x   2.42  2.4
 f 50

We can calculate the various probabilities as follows


2.4  e
0  2 .4
p0   0.091
0!
2.4  e
1  2.4
p1   0.218
1!

The rest is summarized in the table below.

113
Number of Number of Poisson Expected
Failures Weeks Probabilities frequencies
i ni pi ei  npi
0 6 0.091 4.55
1 8 0.218 10.90
2 13 0.261 13.05
3 11 0.209 10.45
4 7 0.125 6.25
5 4 0.060 3.00
6 1 0.024 1.20

114
From the table above, three of the frequencies (43%) are less than
5. to satisfy the condition, we merge the last three class as shown
in the table below
Number of Number of Poisson Expected
Failures Weeks Probabilities frequencies
i ni pi npi
0 6 0.091 4.55
1 8 0.218 10.90
2 13 0.261 13.05
3 11 0.209 10.45
4 12 0.209 10.45
115
The test statistic becomes
k oi  ei 
2
 
2
i 1 ei
6  4.55 2
8  10.90
2
12  10.45
2
  
4.55 10.90 10.45
 0.4621  0.7716    0.2299
 1.49

The critical region for the test is


   k  m  1     0.05 3  5  1  1  7.815
2 2 2 2

116
Since 1.49 is less than 7.815, we fail to reject the null
hypothesis and conclude that the weekly number of power
failures follows a Poisson distribution.

117
Example 2
Four identical six-sided dice, each with faces marked 1 to 6, are rolled
200 times. At each rolling, a record is made of the number of dice
whose score on the uppermost face are even. The result is shown below.

Number of even scores xi 0 1 2 3 4


Frequency fi 10 41 70 57 22

Test, at the 5% level of significance, that the number of even faces


follows a binomial distribution with n  4 and p  0.5.
118
Solution
We wish to test the hypothesis:
H 0 : Number of even scores is ~ B4,0.5
against
H1 : Number of even scores is not ~ B4,0.5

p x   B n, p  C x p 1  p 
n x n x
We have

Thus,

119
p 0   B 4,0.5 C0 0.5 0.5  0.0625
4 0 4

p 1  B 4,0.5 C1 0.5 0.5  0.2500


4 1 3

p 2   B 4,0.5 C2 0.5 0.5  0.3750


4 2 2

p 3  B 4,0.5 C3 0.5 0.5  0.2500


4 3 1

p 4   B 4,0.5 C4 0.5 0.5  0.0625


4 4 0

We can now calculate the expected cell frequencies and


summarize them in a table as shown.
120
oi pi ei  npi
i
0 10 0.0625 12.50
1 41 0.2500 50.00
2 70 0.3750 75.00
3 57 0.2500 50.00
4 22 0.0625 12.50

The test statistic becomes

121
k oi  ei 
2
 
2
i 1 ei
10  12.50 2
22  12.50 2
 
12.50 12.50
 0.500  1.620    7.220
 10.653

The critical region for the test is


   k  m  1    
2 2 2 2
0.05 4  5  0  1  9.488

B conclude
Since 10.653 is greater than 9.488, we reject the null hypothesis and 4,0.5.
that the number of even scores is not approximately
122
Example 3
Three hundred marbled ducks in Quack town are weighed and the results are shown
in the following table.
Mass (g) Frequency
m  470 10

470  m  520 158


123
520  m  570
9
m  570

Set up the hypotheses and test, at the 10% significance level, whether the mass of
marbled duck can be modelled by a normal distribution with mean 520g and
standard deviation 30g.
123
Solution

H 0: Mass of the marbled ducks can be modelled by the normal distribution with
mean 520 and standard deviation 30.
against
H 1: Mass of the marbled ducks cannot be modelled by the normal distribution
with mean 520 and standard deviation 30.

Now we calculate the required probabilities as follows:


 M  520 470  520 
Pr( M  470)  Pr   
 30 30 
 Pr z  1.67 
 0.5000  0.4525
 0.0475
124
 470 - 520 520 - 520 
Pr M  
 30 30 
 Pr - 1.67  M  0 

 520 - 520 570 - 520 


Pr M  
 30 30 
 Pr 0  M  1.67 

125
 M  520 570  520 
Pr( M  570)  Pr  
 30 30 
 Pr z  1.67 
 0.5000  0.4525
 0.0475

Mass (g) Frequency Probability


m  470 10 0.0475
470  m  520 158 0.4525
520  m  570 123 0.4525
9 0.0475
m  570
126
Calculate the expected frequencies using ei  npi .

Mass (g) Frequency Probability ei  npi


m  470 10 0.0475 14.25
470  m  520 158 0.4525 135.75
520  m  570 9 0.0475 14.25

m  570 123 0.4525 135.75

127
We summarize the rest of the calculations as follows:
oi ei oi  ei (oi  ei ) 2 (oi  ei )2 ei
10 14.25 -4.25 18.06 1.268
158 135.80 22.20 492.84 3.629
123 135.80 -12.80 163.84 1.206
9 14.25 -5.25 27.56 1.934
8.037

Thus,
 2  8.037

128
From tables,  k  m  1  3  6.251. Since Cal  8.037 is greater
2 2 2
 0.10 
than 0.05 3  6.251, we reject H 0 and conclude that the mass of the marbled
2

ducks cannot be modelled by the normal distribution with mean 520 and standard
deviation 30.

129
GOODNESS-OF-FIT TESTS FOR HOMOGENEITY

This test is used to determine whether frequency counts for a


given variable are distributed identically across different
populations. That is, a single categorical variable from two or
more populations is studied.

130
This approach is considered appropriate when the following
conditions hold.

1. The method for selecting a sample from each population is


simple random sampling.

2. The variable under study is categorical.

3. The expected frequency for each cell should be at least 5.

131
The test statistic is given by

 2

p l nij  E nij 
2
,
i 1 j 1 E nij 
where
E nij  is the expected cell frequency for theij 
th
cell.
nij
is the number of observations that fall into each cell
called observed cell.
 p  1l  1 degrees of freedom
The test statistic has a chi-square distribution with 132
Example
In a study of television viewing habits of children, a
developmental psychologist selects a random sample of 300
primary school pupils, 100 boys and 200 girls. Each child is
asked which of the following television programmes they like
best: The Talented kids, or The Pulpit, or Maths and Science
Quiz. The results are shown below

133
Viewing Preferences
The Talented Kids The Pulpit Math and Science Quiz
Boys 50 30 20
Girls 50 80 70

Do boys’ preferences for the television programmes differ


significantly from the girls’ preferences? Use the 0.05 level of
significance.

134
Solution
Viewing Preferences
The Talented Kids The Pulpit Math & Science Quiz Totals

Boys 50 (33.33) 30 20 100


Girls 50 (66.57) 80 70 200
Totals 100 110 90 300

The hypotheses we wish to test are as follows:

135
Proportion of boys who prefer Talented Kids equals proportion of girls who
H 01 prefer Talented Kids

Proportion of boys who prefer The Pulpit equals proportion of girls who
H 02
prefer The Pulpit

Proportion of boys who prefer Maths & Science Quiz equals proportion of
H 03 girls who prefer Maths & Science Quiz

against

At least one of the is false


H1 H0
136
We now calculate the expected frequencies
P1  L1 100  100
E n11     33.33
n 300
P1  L2 100  110
E n12     36.67
n 300
P1  L3 100  90
E n13     30.00
n 300
P2  L1 200  100
E n21     66.67
n 300
P2  L2 200  110
E n22     73.33
n 300
P2  L3 200  90
E n23     60.00
n 300 137
Substituting them into the test statistics gives
p l n  E n 
2
 
2 ij ij

i 1 j 1 E nij 
50  33.33 2
70  60 2
 
33.33 60
 8.3375    1.6667
 19.3255
The degrees of freedom is given by
df   p  1l  1
 2  13  1  2
138
From the chi-square tables, the value of the chi-square at the
0.05 level of significance, with 2 degrees of freedom is 5.99.
Since 19.3255 is greater than 5.99, we reject the null
hypothesis and conclude that at least one of the null
hypothesis is false.

139
GOODNESS-OF-FIT TESTS FOR INDEPENDENCE

Goodness-of-fit tests for independence are applied to two categorical


variables from a single population.

In these tests, the null hypothesis is such that the variables are
independent against the alternative that the variables are not
independent.

140
Data for the goodness-of-fit of independence is usually presented in
contingency table. The following table is an r  c contingency table.
Variable 2
Variable 1 1 2  c Totals
n11 n12  n1c R1
1 n21 n22  n2 c R2

2     
r nr1 nr 2  nrc Rr
C1 C2  Cc n
Totals
The test statistic for the goodness-of-fit test is given by
r c oi  ei  2
 
2
,
i 1 j 1 ei
141
where
c denotes the number of columns
r denotes the number of rows
oi denotes the number of observed cell frequency
ei denotes the number of expected cell frequency
n denotes the grand total

Ri  C j
ei 
n

142
The test statistic follows a chi-square distribution with c  1r  1
degrees of freedom. We reject the null hypothesis if
 cal
2
 2 c  1r  1

143
Example

The following table is base on the classification by size and colour of a


sample of 120 shirts drawn from a large population.
Colour Size
Small Medium Large
Red 10 13 12
Yellow 12 11 14
Green 18 20 10

Test the hypothesis that size and colour are independent at 5% significance
level.
144
Solution

H 0 : size and colour are independent


against
H1 size and colour are not independent

The row marginal and column marginal totals are


R1  35, R2  37, R3  48, and C1  40, C2  44, C3  36

respectively. We now calculate the expected frequencies using


Ri  C j
eij 
n
145
35  40 35  44 35  36
e11   11 .67 e12   12.83 e13   10.50
120 120 120

37  40 37  44 37  36
e21   12.33 e22   13.57 e23   11 .10
120 120 120

48  40 48  44 48  36
e31   16.00 e32   17.60 e33   14.40
120 120 120

146
The test statistic is given by

r c oi  ei 
2
 
2
i 1 j 1 ei
10  11 .67  13  12.83
2 2
10  14.402
  
11 .67 12.83 14.40
 3.63
The degrees of freedom is df  3  13  1  4
Therefore from chi-square tables, 0.05 4   9.49.
 2

Since   3.63is less than 9.49, we fail to reject H0.


2

Therefore, size and colour are independent.


147

You might also like