You are on page 1of 32

HYPOTHESIS TESTING I

TheWilcoxon
Rank Sum Test

r*:-.,: A Aathematics Project: CET/Schools Council


A510.02SCH
fcULMERSHE COLLEGE
HIGHER EDUCATION
HYPOTHESIS TESTING 1
CLASS No. ACCESS Nq.
The Wilcoxon Rank Sum Test

Before you begin

Research is being conducted in thousands of subject areas today. The researcher may be a scientist,
technologist, artist, educationalist or a social scientist and he usually starts with a hunch, a wild idea,
some divine inspiration or a rational thought that something may or may not be true. He would set this
down as a formal hypothesis which he would naturally wish to test.

In order to test the hypothesis he will analyse some data which will enable him to accept or reject the
hypothesis. In some situations he can establish the truth or falsity of the hypothesis with absolute certainty.
For example,

Hypothesis: The planet Jupiter has at least 12 moons. (This is true with 100% certainty.)

Hypothesis: In Great Britain more people attend church on Sundays than Football League
matches on Saturdays. (This is true with 100% certainty.)

These statements belong to the secure world of hard fact and indisputable evidence. But just as often,
the hypothesis cannot be established with absolute 100% certainty. For example, consider the following
hypothesis that might be postulated by a research worker.

Hypothesis: Blue-eyed people are taller than brown-eyed people of the same age and sex.

It would be virtually impossible to measure the height of every blue and brown-eyed person, and so we
cannot be absolutely 100% certain about the truth (or otherwise) of this hypothesis. In practice, a sample
of blue-eyed people would be compared with a sample of brown-eyed people by using a statistical test.
This test might lead to the conclusion that the one group is not really any taller than the other group
in fact, despite differences in height from person to person.

But this conclusion cannot be held with 100% certainty. Even if all the people from the two samples
together were exactly the same height, we still cannot be 100% certain about blue and brown-eyed people
in general, from the two relatively small samples.

Are men better drivers than women? Does indulgence in soft drugs lead to hard drugs? Is independent
learning more effective than conventional teacher-based learning? In each case, only a tentative conclusion
about the population as a whole can be reached from a statistical analysis of just a sample. Although a
degree of uncertainty is unavoidable, the science of statistics is concerned with calculating precise
probabilities for these uncertainties so that we can, at least, know the likelihood of our being in error.

In the course of this sequence of three units you will learn how to write down a suitable hypothesis for
a statistical test, and how to calculate the probability that your conclusion is in error.

This unit aims to teach the use of a 'non-parametric' test, namely the Wilcoxon Rank Sum Test for assessing
the significance of the difference between two independent samples. In this context, the objectives for this
unit are that you will

(a) be able to set up a suitable Null Hypothesis, H0 of the general form:

H0 : "there is no difference, on the average, between the size of the measurements


in population A and those in population B";

(b) possess an understanding of 'significance' and of 'significance level'. The latter is defined in the
unit as the probability of incorrectly rejecting H 0 ;

(c) be able to use tables of critical values and to write down the corresponding rejection regions;
(tables at the 5% level only will be used)

(d) appreciate the justification of the critical values. This will be achieved by-deriving--the " :'";\ i
Wilcoxon test from 'first principles' using a rn'in.imum of abstract . mathematics; , '-,-. '- v

(e) be able to interpret 'significant' and 'non-significant' .values-for the. test statistic,in each test j- ' '"!
and also be aware of the dangers of drawing erroneous conclusions. V f ; '( -.- ; '

Objectives (b) and (e) are considered to be the major ones foij a sensible and well-understood appli$9.tioii.i -^
of any statistical test. / ',"' C.... '

CM 41
What you need to know before starting this unit

You should be able to study successfully the three units in the sequence even if you have done very little
statistics before. In fact, all you need to know is some simple probability theory. If you can follow
and understand the probability arguments below then you are ready to tackle this unit.

"Since there are 5 vowels in the alphabet (A, E, I, O, U), then if I choose a letter completely
at random, the probability of obtaining a vowel is the fraction ~."
26
4 1
"The probability that I draw an ace from a shuffled pack of cards is (which is TT)."
52 13

"If I throw two dice together then there are 36 possible pairs of results. For example,
1 and 1, 1 and 2, 1 and 3, 2 and 1, and so on. If I want the two numbers to add up to
seven then I will need either 1 and 6, 2 and 5, 3 and 4, 4 and 3, 5 and 2, or 6 and 1.
So out of the total of 36 possible pairs, just 6 of them will add up to 7. So the probability
ft 1
that I get a sum of 7 from the two dice is (which is )."
36 6

The text is designed to test you as it teaches you. So you will find it broken into sections. These
sections are called frames, and they are numbered sequentially. Some frames put questions to you, and
give you the answers immediately below. Such frames are easily recognised by the dotted line which
separates the answer from the question, and by the arrow in the left-hand margin, which shows where the
answer begins.

To work through the text, then, you will need to equip yourself with some kind of a work book - referred
to in the text as your 'answer book' - and a piece of paper with which you can hide the answer in the text
until you have written your answer in your answer book.

Two ruled lines after a frame means that you have reached a convenient stopping place. By looking for
the next set of double lines, you can judge whether you have time to complete another section of work.
Try not to stop between frames so marked, or the rhythm of your work may be lost.

When you are satisfied that you have understood the material in this unit, read the Summary on- pages 19
and 20 . You should then tackle the Post-test on page 25 without referring to the text.

| WITHDRAWN
| University of
I Reading Library

N22426
Here are four hypotheses that might be tested statistically:

(i) Men are better drivers than women.

(ii) Independent learning is more effective than teacher-based learning,

(iii) The addition of fluoride in drinking water reduces the incidence of tooth decay. <

(iv) The use of soft drugs encourages the use of hard drugs.

These hypotheses are imprecise as they stand. For instance, how do you measure 'driving skill';
is independent learning more 'effective' for passing examinations or for giving a better understanding
of a subject?

Let's see how the first hypothesis above can be modified in order to make it suitable for a
statistical test.

Ever since the invention of the motor car, it has been a favourite contention of the male sex that
men are, on the whole, better drivers than women. It is often unwise to generalise, and there
are certainly many women drivers who are better than some men drivers. However, the
hypothesis put forward by some people is that the overall standard of driving of men is superior
to that of women. Can we test this by taking a sample of men drivers and a sample of women
drivers and comparing their driving skills?

First, we must define the term 'driving skill'. This is a highly subjective and personal quality
- one person would rate confidence, nerve and decisiveness highly, whilst someone else would
prefer caution, prudence and cool-headedness. These are difficult things to measure anyway.

One quality that most people would agree is necessary for good driving, is good reactions.
It is vital to be able to react quickly in an emergency, regardless of who is at fault.

This quality of 'good reactions' can be measured by the quantity 'reaction time'. This is the time
taken from the moment an emergency arises to the moment that one reacts to it. Most people
have a reaction time of between 200 and 500 ms. (1 ms = \ millisecond, i.e. I thousandth of
a second.)

Though a low reaction time, in itself, does not mean that one is a good driver, reaction time is
measurable, and, being measurable, it provides, for our purposes, a convenient quantity for
comparing driving abilities.

CM 41
So, imagine that a researcher formulates the hypothesis:

' THE REACTION TIME OF MEN DRIVERS IS LESS, ON THE AVERAGE, THAN THE
REACTION TIME OF WOMEN DRIVERS OF THE SAME AGE.

Though there may be some men whose reaction time is greater than many women of the same age,
the researcher claims that, on the whole, men tend to react more quickly than women.

atrOd

The above hypothesis is now precise, but it is not the hypothesis that would be tested in practice.
With a little thought you can see that there are three possible hypotheses for the situation.

(a) THE REACTION TIMES OF MEN DRIVERS IS ILESSI. ON THE AVERAGE, THAN THE
REACTION TIMES OF WOMEN DRIVERS OF THE SAME AGE.

(b) THE REACTION TIMES OF MEN DRIVERS IS |NO DIFFERENT! , ON THE AVERAGE,
FROM THE REACTION TIMES OF WOMEN DRIVERS OF THE SAME AGE.

(c) THE REACTION TIMES OF MEN DRIVERS IS iGREATERl , ON THE AVERAGE, THAN
THE REACTION TIMES OF WOMEN DRIVERS OF THE SAME AGE.

These three alternatives cover all the possibilities, though only one of them is true, of course.

Imagine yourself as an uncommitted and fair researcher. Which hypothesis appears to be the most
unbiased and impartial one for testing?

Hypothesis (b). A woman driver might prefer to see hypothesis (c) proven and a man driver might
prefer (a). However, hypothesis (b) is the only impartial one of the three alternatives.

Let's turn to the second hypothesis in Frame 1.

INDEPENDENT LEARNING IS iMOREl EFFECTIVE THAN TEACHER-BASED LEARNING.

As in the driving skill example, this hypothesis is one of three alternatives. Write down the other two.

(a) INDEPENDENT LEARNING AND TEACHER-BASED LEARNING ARE (EQUALLY) EFFECTIVE .

(b) INDEPENDENT LEARNING IS [LESSl EFFECTIVE THAN TEACHER-BASED LEARNING.

You may have worded your answers in a different way, especially (a).

CM 41
7

\
I Which is the most suitable hypothesis, from the three alternatives, to adopt for testing?

INDEPENDENT LEARNING AND TEACHER-BASED LEARNING ARE EQUALLY EFFECTIVE.

It is the most suitable because the wording does not express favour towards either way of learning.

Even though we are very often looking for definite differences (say differences in .driving skill between
men and women or differences in the effectiveness of independent and conventional learning) we should
be impartial, sceptical even, and assume there are no differences, on the average, until evidence
convinces us otherwise.

In general the impartial, no difference hypothesis is called the NULL HYPOTHESIS and is written
as H0 for short, (pronounced 'H-nought'}. It is this hypothesis that is tested statistically (often with
a view to its rejection).

Write down as precisely as you can the Null Hypothesis, H0 , that would be tested for each of (iii)
and (iv) in Frame 1.

(iii) H0 : The addition of fluoride in drinking water has |no effectl on the incidence of tooth decay.

(iv) H0 : The use of soft drugs makes [no difference! to the use of hard drugs.

Again your answers may be phrased differently.

If we return to the driving skill controversy, we can test the truth of H0 for this example by applying
the Wllcoxon Rank Sum Test to sample data. The test is useful for both large and small samples,
but for simplicity we shall use small samples to begin with. Imagine that four men and six women
of the same age have their reaction times measured. This can be done by playing a single note
over headphones. The pitch of the 'note suddenly changes and the subject must press a button as
soon as possible after hearing the change. The time taken between the pitch change and the
pressing of the button is the reaction time.

The following results, in milliseconds, are obtained:

Males 270 , 370 , 320 , 280 .

Females 410 , 340 , 290 , 400 , 380 , 450 .

Remember that the researcher hopes to show that men drivers have smaller reaction times than
women drivers (see Frame 4), but, for the time being, he is assuming the Null Hypothesis - that
there is 'no difference' between the sexes.

10 Procedure

The ten measurements are set out in ascending order like this:

MMFMFMFFFF
Times 270 280 290 320 340 370 380 400 410 450

This is called RANKING the measurements. Then each measurement is given its RANK order.
The smallest reaction time has a rank of 1, and, as there are ten measurements, the largest has
a rank of 10. We now get:

M M F M F M F F F F
Times 270 280 290 320 340 370 380 400 410 450
Ranks 1 2 3 4 5 6 7 8 9 10
When the results are set out like this we can see that 3 out of the 4 M's are in the left half of the row.
Now if the four men had ranks 1, 2, 3 and 4 (i.e. first, second, third and fourth places) then most
people would feel that men drivers almost certainly have lower reaction times than women. There
would be less certainty if they had ranks 1, 2, 3 and 5 and our results give ranks 1, 2, 4 and 6.
Can we still conclude that men have lower reaction times or do we give the benefit of the doubt to
the Null Hypothesis?

To answer this question we combine the four ranks by simply adding them together. The total is
called the rank sum and is given the symbol W (after Wilcoxon, the originator of the test). What is
the rank sum W for the four men's results?

1+2+4 + 6= 13. The rank sum of this sample is W = 13.

11 Now if H0 is_ true (that the reaction times of men and women are, on average, the same), then we
would expect an even mixture of M's and F's along the ranks with no 'bunching' at either end.

If, however, we find that the M's are excessively bunched at, for instance, the low end of the scale,
then we would have to reject H0 and conclude instead that men's reaction times are shorter than
women's. This bunching would reveal itself in a low value for W.

On the other hand a high value of W would be an indication of excessive bunching of the M's at the
high end of the scale. What could we conclude about men drivers' reaction times in such a situation?

Men drivers have longed reaction times than women.

12 To be consistent with H0> then, we expect W to be neither very low nor very high. In order to
make a judgement about what is 'low' and what is 'high', we need to find what is the lowest possible
value for W (and the highest).

What is the lowest possible value for the rank sum W, of the four M's, in general?

The lowest possible value isW=10. W=l+2+3 + 4=10. This will happen when the results
are arranged.

MMMMFFFFF F
Ranks 123456789 10

13 I What is the highest possible value for the rank sum of the four M's?

W=7+8 + 9 + 10=34.

CM 41
14 This means that, in general, W can take any value from W = 10 up to W = 34. If the researcher
really wants to demonstrate that men drivers have better reactions than women drivers then he would
be interested in obtaining a fairly low value for W. But how low?

values of W consistent with


_______A
reject

10 II IZ IJ lit- If H, // If /9 -2C & 22 Z? 24. 2f 2. 27 Z? 2} ?0 ?\ 52.

Clearly a line must be drawn somewhere. We must identify a definite CRITICAL VALUE for W,
so that when a rank sum is obtained that is smaller than this value, then H0 is to be rejected.

Statisticians have compiled tables that give these critical values for various sample sizes. In our
example (sample of 4 M's and 6 F's) we can say that:

if W is 13 or less

then HQ is rejected and we conclude that men have better reactions than women, (The reason why
the' critical value of W is 13 will be given shortly.)

Should W have any value greater than 13, then, we do not reject H0 , unless, of course, it is near
to 34, the highest possible value. However, for low values of W, the range of values

W = 10, 11, 12 or 13

is called the REJECTION REGION for HQ , because obtaining one of them entitles as to reject the
impartial HQ with reasonable confidence.

15 Returning to the value of the rank sum obtained by the researcher, W = 13, it can be seen that this
value is within the rejection region. The researcher has obtained an unacceptably low value for the
rank sum. He must therefore reject the 'no difference' HQ, and turn to the alternative hypothesis
that men drivers do have better reactions than women drivers.

Two things should be pointed out at this stage. Firstly, we cannot reject H0 with absolute certainty.
It is only highly unlikely that H 0 is true. Secondly, the data is fictitious, and, in fact, the insurance
companies make no distinction between the sexes in estimating premiums for car-drivers. Their
only considerations are the age of the driver and the type of car.

16 In the last example, the researcher was concerned with the rejection region W = 10, 11, 12 or 13.
However, a second researcher may wish to demonstrate the hypothesis that men drivers have greater
reaction times than women drivers. Would this second researcher adopt the same Null Hypothesis
as the first?

Yes - he too must assume that there is 'no difference' until evidence indicates otherwise.
17 The main difference between the two researchers is that the second hopes to obtain a high value of W
from his sample. In fact,

if W is 31 or more

then H 0 is rejected.

This range of values is called the UPPER REJECTION REGION and W = 31 is called the UPPER
CRITICAL VALUE. This critical value has also been obtained from tables.

UPPER REJECTION REGION

-,r-W
10 it it 1} m- 15 ic, /7 is /? 2c 21 a 25 n+ z? 26 27 zr 29 jo y *>t %, j>tt-

UPPER CRITICAL VALUE

The following results in milliseconds are obtained by the second researcher (using different subjects
from the first).

Males 380 , 430 , 330 , 420 .

Females 370 , 350 , 410 , 320 , 260 , 290 .

Rank the measurements and find the rank sum of the 4 m's. (See Frame 10 if you have any
difficulty.)

FFFMFFMFMM
260 290 320 330 350 370 380 410 420 430
Ranks 123456789 10

W = 4 + 7 + 9 + 10 = 30.

18 I Can the second researcher reasonably reject the Null Hypothesis from his evidence?

No. The value of W obtained (W = 30) does not fall within the upper rejection region.

19 In less technical language, this means that the second researcher's evidence is not strong enough for
him to doubt the hypothesis that men and women drivers do have the same reaction times, on the
average, despite differences from person to person.

If the second researcher had obtained a value W= 32, say, from his sample, then the Null Hypothesis
would be rejected by him. What alternative hypothesis should he then adopt?

That men drivers have greater reaction times than women drivers.

CM 41
Exercise (The answers are on page 21.) - . ..

1. Is traffic less on Sundays than on Weekdays? Write down the Null Hypothesis for this question
in your answer book.

Here is some sample data to enable you to answer the above question.

(a) For Winter 1971, 6 Weekdays and 4 Sundays were chosen at random. The volume of
traffic (in millions of vehicle kilometres) for each of these days was as follows:

Weekday traffic (We) 436 , 429 , 440 , 413 , 444 , 452, .

Sunday traffic (S) 431 , 424 , 411 , 392 .

(Taken from 'Traffic census results for 1971' by J.B. Dunn.) ' ' .

Rank the ten measurements in ascending order and find the rank sum W of the
4 S results. Use Frame 14 to decide what conclusion you can draw about

(i) rejecting or not rejecting H 0 ,

(ii) the volume of Sunday traffic compared with the volume of Weekday traffic in Winter.

(b) However, in Summer 1971 the traffic situation is different. Again we compare 6 days
of Weekday traffic with 4 days of Sunday traffic.

Weekday traffic (We) 553 , 546 , 534 , 517 , 519 , 545 .

Sunday traffic (S) 562 , 549 , 548 , 521 .

Rank the ten measurements in ascending order and find the rank sum W of the 4S results.

What conclusion can you draw about

(i) rejecting or not rejecting H 0 ,

(ii) the relative volumes of traffic on Sunday and Weekday traffic in Summer?

2. In 1965, Capital Punishment was abolished in Great Britain. Some people are concerned that
the removal of such a deterrent causes an increase in the murder rate.

Home Office figures for the number of murder victims in England and Wales during 6 years
before 1965 and 4 years after 1965 are given below.

Sample A (pre '65)

("Murder 1957 to 1968", Home Office Statistical Division Report on Murder in England and Wales)

Year 1958 1959 1960 1961 1962 1963

No. of victims 144 135 123 118 129 122

Sample B (post '65)

("Social Trends", Central Statistical Office)

Year 1968 1969 1970 1971

No. of victims 148 119 136 177

(a) If one is contending that removal of the death penalty has led an increase in murder
victims, what would be the Null Hypothesis?

(b) Rank the measurements and find the rank sum W of the 4 B results.

(c) What conclusion .can you draw, from this evidence, about the probable effect of removing
Capital Punishment upon the murder rate?
3. A sample of four measurements (Sample A) is compared with a sample of six measurements
(Sample B) by using the Wilcoxon Rank Sum Test. It is found that the rank sum W of the
four A results is W = 15. What is the rank sum of the six B results?

(Hint -1+2+3+ 4-+5+6+7+8+ 9+ 10 =55.)

20 The mathematics of the Wilcoxon Rank Sum Test will now be investigated in order to justify the
critical values W = 13 and W = 31.

We found that the lowest possible value for W is W = 10 when the four measurements in the smaller
sample have ranks 1, 2, 3, 4. No other set of four ranks will give a rank sum of 10. There is
also only one way to obtain a rank sum of W = 11. It could only have come from ranks 1, 2, 3, 5.

However, to obtain a sum W = 12 there are two ways to achieve this sum, namely from ranks
1, 2, 3, 6 or 1, 2, 4, 5. (Can you think of any more?)

There are three ways to obtain a sum of W = 13. Can you find them?

1, 2, 3, 7 or 1, 2, 4, 6 or 1, 3, 4, 5. All add up to W = 13.

21 I There are five ways to obtain a sum W = 14. What are they?

1, 2, 3, 8 or 1, 2, 4, 7 or 1, 2, 5, 6 or 1, 3, 4, 6 or 2, 3, 4, 5.

22 We could continue like this right up to W = 34. Here is an extract from all the possibilities:

W = 10 1 way
W = 11 1 way
W = 12 2 ways
W = 13 3 ways
W = 14 5 ways

W = 30 5 ways
W = 31 3 ways
W = 32 2 ways
W = 33 1 way
W = 34 1 way

210 = total number of ways

(Note for mathematicians - the total number of ways of selecting four numbers from ten is 1f)C. = 210.)

CM 41
23 Out of these 210 possible ways of choosing 4 numbers from the numbers one to ten, we have

1 way to obtain |W = 10|


2 ways to obtain |W = 10 or W = 111 .............................. (1+1)
4 ways to obtain |w = 10. W = 11 or W = li] ..................... (1+1+2)
7 ways to obtain |W = 10. W = .11. W = 12 or W = Til ............ (1+1+2 + 3)
12 ways to obtain |W = 10, W = 11. W = 12, W = 13 or W = ~14~| ... .(1+1 + 2+3 + 5)

So the probability of obtaining a rank sum as low as W = 10 is . (This is because there is


210
1 way to obtain W = 10 out of the 210 ways altogether.) This probability is very nearly % and so
i

if H 0 is true, then the chance of obtaining W = 10 is not very likely (i.e. once in 210).

What is the probability of obtaining a sum of W = 10 or W = 11. if H 0 is true?

^^ . There are two ways to obtain W = 10 or W = 11 out of a total of 210 ways altogether.
The fraction is almost 1%.

24 We can represent the situation in diagram form. The values of W (from W = 10 to W = 34) are
drawn horizontally and the vertical lines represent the 'number of ways' to obtain each W value
(see Frame 22).

IS

"H
/z
10

10 II
ll
IZ
II /(, 1J 26 Zl 22 2? 25
II I.
27 22 2<? Jo
/I/

What is the probability, in each case, of obtaining a sum W in the regions

(a) W = 10, W = 11 or W = 12,

(b) W = 10, W = 11, W = 12 or W = 13,

(c) W = 10, W = 11, W = 12, W = 13 or W = 14? (Leave your answer in fraction form)

(a) , this is almost 2%.


10

(b) , this is almost 3.5%.


1.0

' this is almost 6%.


25
'*'

10-

20 74 2Z 2?
Ih.
30
t/Wtfgf of-

Armed with these calculated probabilities, we are in a position to justify the critical value we have
been using, namely W = 13. We saw that we would be suspicious of very low values of W under
the Null Hypothesis H 0 of 'no difference'.

To obtain a rank sum as low as W = 10 (with H 0 being true) is highly unlikely. The probability of
its happening purely by chance is less than %. If we do obtain a rank sum as low as W = 10
in an actual experiment then we would argue that this is so unlikely to have happened by pure chance,
that we should suspect some other factor being at work instead. In other words, so small a rank
sum is much more likely to be due to men's ability to react more quickly than women in an emergency
than to the Null Hypothesis that there is 'no difference' between the sexes.

To obtain a rank sum in the region W = 10 or W = 11 is also very unlikely The probability of its
happening purely by chance is less than 1%.

To obtain a rank sum in the range W = 10, W = 11 or W = 12 may also be considered unlikely.
The probability that it will happen purely by chance is less than 2%.

Where do we stop? When can we say that an event is not unlikely? A subjective judgement must
be made here. In many statistical tests it is customary to regard an event that has a probability of
5% or less as 'too unlikely to have happened purely by chance'. It is therefore presumed that some
other factor is influencing the results, and that the Null Hypothesis must be abandoned. In other
tests the line might be drawn at 2.5%, at 1% or even at 0.1%. But we shall adopt 5% as our
criterion for the present.

26 When we adopt as our criterion a probability of 5% or less as being too unlikely to have happened by
pure chance, we can see from Frame 24 that to obtain a value for W so low that it falls within the
region W= 10, 11, 12 or 13 is 'too unlikely', since the probability of this is 3.5%. To obtain a
value of W in the region W= 10, 11, 12, 13 or 14, however, is not, by our criterion, 'too unlikely
to have happened by pure chance', since this region has a probability of 6%. So W = 14 cannot be
delegated to the rejection region. The lower rejection region is therefore, by our criterion,

W = 10, 11, 12 or 13. (W = 13 is the critical value)

It must, however, be emphasised that the choice of 5% as the criterion is an arbitrary and a
subjective choice. Different criteria for 'unlikely' events are used according to circumstances.

Suppose, now, that we tighten up our criterion. What effect might this have upon the lower rejection
region? Suppose, for example, that we decide to consider an event to be 'too unlikely to have
happened by pure chance' only if it has a probability of 2.5% or less. What would be the lower
rejection region by this more exacting criterion? (See Frame 24)

W = 10, W = 11 or W = 12. From Frame 24, the probability of obtaining a value for W in this
region is 2%, which is less than 2.5%. By this criterion the critical value of W is lowered from
13 to 12.

10
CM 41
27 These small probabilities which we choose to adopt to define an 'unlikely' event are called
SIGNIFICANCE LEVELS. We say "at the 5% significance level, the lower critical value of the
rank sum in the Wilcoxon Rank Sum Test is W = 13". Should a researcher be contending that men
have worse reactions than women (thus looking for a high value of W in the test) then he would need
the upper critical value of W. This is W = 31 at the '5% significance' level.

K
16]
/
/z
lo
vati/e
it- \
ZH

to a
I 11
114 1$ 16 f] /y /? 20 2J 22. 2? 2 25 26 27 2? 2? ?0\/_3*-
of
If a rank sum of, say, W = 29 had been obtained, high though it certainly is, it is not significantly
high. The Null Hypothesis of 'no difference' would still allow such a value to be obtained by
random variations.

Suppose, however, that we tighten up our criterion, as we did at the end of the previous frame.
What would be the upper critical value of W at the 2.5% significance level?

W = 32. The rejection region is W= 32, 33, 34, since this region has less than a 2% probability
of happening purely by chance.

28 Suppose we tighten up still further our criterion for rejecting the Null Hypothesis. Suppose we
decide to accept as 'unproven' any value of W which could happen by pure chance with a probability
of 1% or more. Suppose, in other words, we decide to regard as 'significant' rank sums which
could only arise by pure chance less than one in a hundred times. What then, at 'the 1% significance
level' is

(a) the lower rejection region for W? (b) the higher rejection region for W?

(a) W = 10 or 11. (b) W = 33 or 34.

29 The Wilcoxon Rank Sum Test is used to judge the significance, or otherwise, of the differences
between two 'samples'. The samples may be, as we have seen, the reaction times of four men and
of six women. They could be samples of temperature measurements, as we shall soon be working on.
Or they could be samples of milk yields of different herds of cattle. Indeed samples of any two
measurable quantities which bear comparison can be assessed for significant differences between them
with the help of the Wilcoxon Rank Sum Test.

So far we have been comparing a sample of 4 measurements with a sample of & measurements, and,
at the 5% level of significance, we found that W = 13 was the lower critical value of the rani': sum
of the four measurements and 31 was the upper critical value. Clearly we need to be able to apply
the Wilcoxon Rank Sum Test to samples of other than 4 and 6 measurements. The tables on page 24
will help us to do so. Spend a minute or two looking at them.

You see that 'n' is used for the number in the smaller sample, and 'm' for the number in the larger
sample. So in all the cases we have examined so far n = 4 and in = 6. Find n = 4 on the top line.
Take your eye down the column until you are opposite m = 6. There you see 13 and 31, the lower
and upper critical values of the Rank Sum of the n (the smaller) measurements.

11
n 2 3 4 5 6 7
I
m I
l
l
4
I
I
5 l
l
__ 113
6 _ __

What at the 5% level of significance are the lower and upper critical values of W for two samples
of 5 and 7 measurements respectively?

W = 21 (lower) and W = 44 (upper).

30 I What are the critical values of W if two samples are taken, one of size 10 and one of size 18?

W = 110 and W = 180.

31 "Lovely weather for the time of year ..."

Whatever the weather may be at the moment, it is almost certainly different on the other side of the
world.

How does the weather in New Zealand compare with British weather? Although one usually thinks .
of Pacific Islands in terms of palm-trees and lagoons, in fact New Zealand is only a little nearer
the equator than Great Britain.

12
CM 41
A meteorologist wishes to test the hypothesis that the January temperatures in New Zealand are greater
than the June temperatures in Great Britain. (Remember that January is a summer month in the
Southern Hemisphere.)

What is the Null Hypothesis?

H 0 : There is no difference between the January temperatures in New Zealand and the June
temperatures in Great Britain.

32 In order to test this Null Hypothesis, we shall use data collected during 1951-1960 at the Greenwich
Observatory, London, and at Christchurch, New Zealand. At each weather centre, the average
monthly temperature is recorded.

1951 1952 1953 1954 1955 1956 1957 1958 1959 1960

June temperature
15.5 16.3 15.3 14.7 15.3 14.2 17.1 14.9 16.4 16.8
London

January temperature
15.1 15.4 15.2 16.5 17.6 19.6 17.7 15.6 17.4 17.3
Christchurch

(Source - 'World Weather Records' U.S. Department of Commerce.)

Rank the measurements and find the rank sum W of the London sample. (Note - two of the London
temperatures are the same. If they were slightly different they would have had ranks 6 and 7,
so assign ranks of 6| to each.)

1 L LLZZLLZLZLLZLLZZZZZ

14.2 14.7 14.9 15.1 15.2 .15.3^ 15.3, 15.4 15.5 15.6 16.3 16.4 16.5 16.8 17.1 17.3 17.4 17.6 17.7 19.6

1 2 34 5 63 65 8 9 10 11 12 13 14 15 16 17 18 19 20

Rank sum of London sample = 1 + 2 + 3 + 6| + 6| + 9 + 11 + 12 + 14 + 15 = 80.

(Rank sum of New Zealand sample = 130.)

33 Turn to the tables on page 24 and identify the lower critical value of W (for n = 10 and m = 10).
What conclusion can you draw about the Null Hypothesis, and then interpret this in terms of British
and New Zealand weather?

lower rejection
region

W = 82

Lower critical value is W= 82. Thus the value of W obtained is significantly low (at the 5%
significance level). We can thus reject H0 as being unlikely and prefer the alternative hypothesis
that January temperatures in New Zealand are higher than June temperatures in London. (The
individual bars of the previous diagrams have been 'smoothed out' in drawing the above curve.)

13
34 We concluded that the temperature recordings made in the two cities are significantly different from
each other. In other words, the differences are 'unlikely' to be due simply to random fluctuations
around temperatures which are basically equal. But we could be wrong.

In general, we reject H0 because W turns out to be- significantly high or significantly low. It falls
within a 'rejection region'. But we cannot reject HQ with absolute certainty. As an analogy,
imagine that 12 throws of a single coin produce 11 heads.jMost people would reject the (null) hypothesis
that the coin is an unbiased one, because of the significantly high number of heads produced. They
would conclude that the coin has a bias towards heads. The argument would be that an unbiased coin
is very unlikely to produce 11 or more heads from 12 throws. When a certain coin does behave like
this, it seems more reasonable to suppose that the coin is biased towards heads, rather than that the
coin is actually unbiased, but has behaved in an unlikely way. Nevertheless, we should bear in mind
the small possibility that the coin is actually unbiased and that a very unlikely event has happened.

Similarly, when we reject HO , we should always bear in mind the possibility that we are wrong, and
that, instead, a very unlikely event has happened.

35 When, in the weather example, we obtained the 'significant' value of W = 80, we should bear in mind,
as we should whenever we use the Wilcoxon Rank Sum Test, that the value of W can be 82 or smaller
and HQ can still be true, though, of course, the probability that HQ is true will be less than 5%.

W = 82

The differences in temperature could be due to mere chance, though this is unlikely. But we can
put a definite figure on the chance that we have wrongly rejected HQ . The chance is no more
than 5%. (You may find it useful at this point to read from Frame 25 again.)

36 If a second researcher adopts a 2.5% significance level, then he is being stricter than the first who
adopted a 5% level. This is because he has a smaller chance of being wrong if he rejects H 0
(his chance of being wrong now is less than 2.5%).

However, the price he must pay for this extra certainty is that, before he can reject HQ , his evidence
must be stronger.

Will the lower critical value of W at the 2.5% level be greater or smaller than the lower critical value
at the 5% level?

Smaller.

14
CM 41
37 The diagrams below illustrate the reason why the 2.5% critical value is lower than the 5% value.
You should note, too, that for samples of 10 and 10, the lower critical value for W at the
significance level is 78.

2.

W= 82 W=78 W=82

The diagrams also illustrate the fact that the value of W obtained (W = 80), while being significantly
low at the 5% significance level, is not significantly low at the 2.5% level.

To the second researcher, the evidence is not strong enough to convince him that New Zealand is
warmer in the summer than Great Britain. He regards the differences in temperature as being
due to mere random effects. He does not think they are large enough to be explained as being due
to some real difference between the climates of the two countries.

38 This may puzzle you. One researcher has rejected HQ- . The other has not rejected H 0 . So who
is correct? The answer is that we cannot tell for sure. Statistical techniques cannot prove
anything. They can only give probabilities. The first researcher (the 5% man) has come to a
definite conclusion. He may be wrong. The chance that he is_ wrong is 5%. The second has not
really reached any positive conclusion. Being more cautious than the first, he is not prepared to
discard HQ until the evidence is more convincing. He is only accepting HQ by default, since the
evidence has not convinced him that he should reject it. HE HAS NOT PROVED THAT, H 0 IS TRUE,
OR EVEN THAT H0 IS 97.5% CERTAIN TO BE TRUE.

39 A value, W = 80, does not point to any significant difference in temperature at the 2.5% level.
Equally, it does not point to any significant similarity. After all, W = 80 is ominously close to the
critical value of W = 78 so any statement like "I am 97.5% certain that H 0 is true", if made by the
second researcher would be an incorrect interpretation of the evidence at his command.

40 Unfortunately, statistical techniques are abused as much as they are used, and all too often incorrect
conclusions are drawn from data. The following guidelines should help you to avoid making errors
of judgement.

In general, a statistical test leads to a REJECTION of, or an ACCEPTANCE of the Null Hypothesis.
The decision is determined by the value of W calculated at the significance level chosen by the
research worker.

(a) REJECTION

If W falls within the rejection region then HQ is rejected, and an alternative hypothesis is
adopted. The significance level chosen represents the probability that HQ was wrongly
rejected. It is the probability that HO is actually true, and that an unlikely event has
happened. Accordingly, the smaller the chosen significance level is, the more convincing
is the rejection of HQ when the evidence for rejection arises.

(b) ACCEPTANCE

If the value of W does not fall within the rejection region then we must 'accept' HQ .
If W is very near to the rejection region, the acceptance will be a grudging one, an acceptance
by default. A better term than 'acceptance' of HQ is 'non-rejection' of HQ to reflect the fact
that, in many cases, the value of W may be very close to the rejection region. In such cases,
'non-rejection' means that the data almost leads to a rejection of HQ (but not quite). Hence a
typical conclusion might be "the value of W obtained is not significant at the 5% level so the
evidence is not strong enough to reject the Null Hypothesis that . ..".

15
41 Let's run through the 'logical' arguments involved in statistical testing once more. Then you will
have a chance to gain some practice in the exercise that follows.

Very often a researcher is looking for definite differences between two samples of measurements.
Despite this he assumes a Null Hypothesis, which states that there is 'no difference' between the two
samples other than random effects. Only when the evidence of the statistical test is strong enough
can he reject this hypothesis in favour of some alternative.

For example, a farmer may be hoping that a certain additive to the diet of his dairy cows will
increase milk yield.

(i) Null Hypothesis

H0 : The additive makes no difference to the milk yield,

(ii) Collect the sample data

The farmer fed the additive to 4 cows (sample A) but not to 6 other cows (sample B).
The milk yields, in litres, during one day were as follows:

21, 17, 22, 19

16, 14, 20, 10, 13, 11

(iii) Apply the statistical test

If using the Wilcoxon Rank Sum Test the 10 measurements are ranked in ascending order
and the rank sum of the smaller sample is calculated. The value obtained is

W = 32 (n = 4, m = 6) .

(iv) Interpret the result

The researcher would conclude that this value is significantly high at the 5% level. This
means that, if HQ is true, the probability of obtaining this value of W or a higher one is
smaller than 5%. Since 5% is generally regarded as a small probability it is more likely
that H0 is not true. So it can be rejected.

Thus the additive seems to increase the yield of milk.

If, say, W = 29 had been obtained, HQ would not be rejected. This does not mean
that HO is likely to be true, but rather that the results from the samples are not quite
convincing enough to reject the Null Hypothesis.

Exercise (The answers are on pages 22 and 23.)

1. In the Wilcoxon test for n = 3, m = 6, the lowest possible value for the rank sum of the
3 results in the smaller sample isW=6, i.e. 1+2+3. The highest possible value is
W = 24, i.e. 9+8+7. The number of different ways that some of the values for W can
be obtained are

W = 6 1 way (1, 2, 3)
W = 7 1 way (1, 2, 4)
W = 8 2 ways (1, 2, 5 or 1, 3, 4)
W = 9 3 ways (1, 2, 6 or 1, 3, 5 or 2, 3, 4)

W = 21 3 ways (4, 8, 9 or 5, 7, 9 or 6, 7, 8)
W = 22 2 ways (5, 8, 9 or 6, 7, 9)
W = 23 1 way (6,8,9)
W = 24 1 way (7,8, 9)

Total 84 ways

16 CM 41
If HQ is true, then these 84 ways are equally likely. Find the probability in each case that

(i) W= 6 (ii) W = 6 or 7 (ill) W = 6, 7 or 8.

What, then, would be the lower rejection region at (a) 5%, (b) 2.5% significance levels for
values of W? Check your 5% answer with the tables.

(Note: It may help you to know that 77 = 1.2%. So, for example, = 5- X 1.2- = 6%.)

2. In using the Wilcoxon test for n = 3, m = 6, a rank sum W = 20 is obtained. Write true (T)
or false (F) for each of these conclusions.

(i) W = 20 is not significantly high at the 5% level. Therefore H 0 is proved to be true.

(ii) W = 20 is not significantly high at the 5% level. Therefore I am 95% sure that H0
is true.

(iii) W = 20 is not significantly high at the 5% level. Therefore I cannot reject HQ


on the evidence.

3. In using the Wilcoxon test for n = 3 and m = 6 again, a rank sum of W = 22 is obtained this time.
Write T or F for each of these conclusions.

(i) W = 22 is significantly high at the 5% level. Therefore HO is proved to be false.

(ii) W = 22 is significantly high at the 5% level. Therefore I reject H 0 though there is


a one in twenty chance that HQ is really true.

What conclusion can you draw about H 0 from W= 22 at the 2.5% significance level? (See
question 1 to assess the significance of W = 22 at the 2.5% level.)

4. A standard 'memory recall' test was administered to a group of sixth-formers -and to a. group
of people over forty. This was to determine if the memory of middle-aged people is worse
than that of sixth-formers.

(a) Write down the Null Hypothesis for this situation.

The memory recall test involved the experimenter reading a list of twelve words
just once to each group. Each person had to write down immediately as many of
the dozen words that could be recalled. The number of words correctly recalled
was recorded and the results for both groups were as follows * :

Number of words recalled


Group A Group B
(7 middle-aged) (8 sixth-formers)
8. 11, 3, 5 ' 7, 4, 10, 6
9. 0, 10 10, 2, 12, 11

(b) Rank all 15 results in ascending order. (See (c) below for dealing with ties.)

(c) Tied ranks. Where two or more results are the same you should allocate ranks
as if- they were slightly different and then average out the ties. For example,

Results 0 , 3 , 4 , 5 , 5 , 5 , 6 , 8 , 10 , 10 , 12

* If you have the time and inclination you may like to generate your own sample data. Write
down twelve two-syllable words and slowly read them once to each of your 'subjects' and ask
them to recite as many back to you as they can remember. Your two samples do not have
to be of size seven and eight, of course, though no more than about seven in each is recommended.

17
(d) When using the Wilcoxon Rank Sum Test you should always select the smaller size
sample for calculating W. So in this case, find the rank sum W of the seven
middle-aged people (i.e. sample A).

(e) Use the tables of critical values to decide whether you can reject H0 or not at the
5% significance level.

(f) What conclusion can you draw from the evidence about the change in recall ability
with age?

5. Most people are poor at estimating the passing of time. Depending upon the circumstances,
a minute or so can seem like eternity or like an instant. A psychologist asked a group of
people to sit silently for 200 seconds. He then asked each one to estimate how long the time
interval had been. He was trying to demonstrate that, in such circumstances, people tend'to
overestimate the passing of time.

(a) Write down the Null Hypothesis for this situation.

Here are the results of his experiment.

Underestimates (seconds) Overestimates (seconds)


30 , 0 , 20 , 10 35 , 25 , 5 , 50
15 ,30 40 , 50 , 30

For instance, the first one in the underestimates group estimated the 200 second
interval to be 170 seconds. Now if H 0 is true, then we expect the results between
the two groups to be approximately equal - the underestimates should 'balance' the
overestimates. We can use the Wilcoxon test to see if there is any significant
difference between the two groups.

(b) Rank the results and find the rank sum W of the smaller sized sample.

(c) Use the tables to decide whether you can reject H 0 or not.

(d) What conclusion do you draw from the evidence about people's estimation of a
200 second time interval?

(e) Estimate the probability that you have drawn an incorrect conclusion from the
sample data.

(Again this is an experiment that you may wish to do for yourself. Use no more than
15 subjects.)

6. Write true (T) or false (F) for each of these statements.

(a) A certain value of W is found to be significantly high at the 5% significance level.


This means that it must be significantly high at the 1% significance level also.

(b) A certain value of W is found to be significantly high at the 1% significance level.


This means that it must be significantly high at the 5% level also.

(c) The probability of obtaining 10 or more heads from 12 throws of an unbiased coin
is smaller than 5%. (This is true.) So if I throw a coin 12 times and I obtain
10 heads then this is convincing enough evidence to conclude (at the 5% significance
level) that the coin is not unbiased, but is biased towards heads.

(d) When testing the effectiveness of the safety precautions in the design for a nuclear
power station it is preferable to use a 0.1% significance level than a'5% significance
level.

When you are satisfied that you have understood the material in this unit, read the Summary on pages
19 and 20. Then work through the post-test on page 25 without referring to the text. Next, check
your answers to the post-test against those provided inside the back cover. Finally, discuss your
work with your tutor, if you have one, and decide if you are ready to begin the next unit in the sequence.

18
CM 41
SUMMARY

There are three main steps in applying a statistical test.

(i) Writing down the Null Hypothesis, HQ .

(ii) Performing the test at a particular significance level,

(iii) Interpreting the results.

(i) The Null Hypothesis (written 'H 0 ' and pronounced 'H nought')

This, in general, is an assumption of neutrality, of scepticism and of non-commitment. A researcher


may .suspect that there is a definite difference between two groups, but to test this theory using the
evidence from sample data, a Null Hypothesis of no difference must be assumed.

Thus, despite contending that

(a) Lung cancer is more common amongst smokers than non-smokers,

or (b) Middle class people live longer than working class people,

the hypothesis that is actually tested in each case is

(a) Lung cancer is as common amongst smokers as non-smokers,

(b) Middle class people have the same life expectancy as working class people.

Beside putting the researcher into the role of an unbiased judge, the Null Hypothesis is, generally,
the simplest of all possible testing hypotheses in mathematical terms.

(ii) Significance level

Any test involves computing a certain value, say X. Since the researcher is usually interested in
rejecting the Null Hypothesis, he will be looking for significantly high values of X, or significantly
low values of X.

At the '5% significance level', a significant value of X means that the value occurs in a region
that is unlikely if HQ is true. The probability that HQ is. true for such a value of X is 5% or less.
Such an unlikely event is judged to be incompatible with HQ . Thus the Null Hypothesis is rejected.

However, the unlikely can happen and HQ could be true. The probability that H0 is true after all,
is no greater than 5%.

A smaller significance level, therefore, will reduce this risk of incorrectly rejecting H0 . The
price that is paid for making the test more strict, is that the evidence needs to be more 'convincing'
before HQ can be rejected. The critical value of X becomes more extreme.
(iii) Interpretation

The choice is between the REJECTION OF H0 or the NON-REJECTION OF H0 .

A significant value of X implies rejection of HQ . But a non-significant value of X should be


interpreted as inconclusive evidence and HQ is to be 'accepted' as true only until stronger
evidence can be presented.

The Wilcoxon Rank Sum Test

Two independent samples are taken and the measurements are ranked. The sum of the rank numbers of
the smaller sample is computed. This total is given the symbol 'W after the name of the inventor of the
test. The tables list both the upper and lower critical values of W at the 5% level.

e.g. Sample A - 11, 12, 15, 17 (n = 4, m = 6)

Sample B - 13, 14, 16, 18, 19, 20

ranking the data

AABBABABBB
11 12 13 14 15 16 17 18 19 20
ranks 123456789 10

The data from Sample A is fewer in number, so we add up the rank numbers of these.

W=l+2+5+7= 15; this is not significantly low since the lower critical value is W = 13.
So the Null Hypothesis cannot be rejected.

20
CM 41
ANSWERS TO THE EXERCISE ON PAGES 7 AND 8

1. H0 : There is no difference between the volume of traffic on Sundays and Weekdays,

(a) S S We S We S We We We We
392 411 413 424. 429 431 436 440 444 452
Ranks 1 2 3 4 5 6 7 8 9 10

W=l+2 + 4+6= 13..

(i) W = 13 is within the rejection region of W = 10, 11, 12 or 13 and so we reject H 0 .

(ii) The volume of Sunday traffic is less than the volume of Weekday traffic in winter.

(b) We We S We We We S S We S
517 519 521 534 545 546 548 549 553 562
Ranks 12 3456789 10

W=3+7+8-t-10=28.

(i) W = 28 is not within the rejection region and so we do not reject HQ .

(ii) There seems to be no difference between the volume of Sunday traffic and Weekday
traffic in summer, on the basis of the evidence.

2. (a) HQ: The removal of the death penalty has made no difference to number of murder victims.

(b) A B A A A A B A B B-
118 119 122 123 129 135 136 144 148 177
Ranks 1 2 3 4 5 6 7 8 9 10

W = 2 + 7 + 9 + 10 = 28.

(c) W = 28 is not within the rejection region and so we do not reject HQ . Hence, this analysis
of the figures does not suggest that the removal of the death penalty has increased the
murder rate.

3. If the numbers 1 to 10 add up to 55. If four of the numbers add up to 15 then the sum of the
other six must be 40.

21
ANSWERS TO THE EXERCISE ON PAGES 16. 17 AND 18

1. (i) = 1.2% (ii) = 2.4% (iii) = 4.

(a) W = 6, 7 or 8. (b) W = 6 or 7.

2. (i) F. Statistics do not prove anything.

(ii) F. We assume H 0 to be true for the purposes of argument and the evidence has not been
good enough for us to reject it. This is not the same as being 95% sure of HQ-

(iii) T.

3. (i) F. (ii) T. We conclude that HQ cannot be rejected at the 2.5% significance level.

4. (a) HQ: The memory of middle-aged people is no different to the memory of sixth-formers,

(b) and (c)

ABABABBAAABBBAB
0 2 3 4 5 6 7 8 9 10 10 10 11 11 12
Ranks 12 34 56 7 8 9 11 11 11 13j 13j 15

(d) W = 1 + 3 + 5 + 8 + 9 + 11 + 13i = 50J.

(e) The value of W obtained is not significant and so we cannot reject HQ .

(f) We conclude that the evidence does not suggest that recall ability is different for middle-aged
people and sixth-formers.

5. (a) HQ : People neither underestimate nor overestimate time, in general.

(b) ABAAABAABBBBB
0 5 10 15 20 25 30 30 30 35 40 50 50
Ranks 1 2 3 4 5 6 8 8 8 10 11 12| 12j

W=l+3+4+5+8+8=29.

(c) The value of W is significantly low (since the critical value is W = 29 also). Hence we
reject HQ .

(d) The evidence indicates that people tend to overestimate the 200 second time interval.

(e) No more than 5%.

22
CM 41
6. (a) False.

For instance, X is within the area (and so is significantly high at the 5% level)
but not within the 1% area.

(b) True.

(P) True.

(d) True - because it is better to be 1 in a 1 000 sure (i.e. 0.1%') than 1 in 20 sure (i.e.
with such a potentially dangerous thing as a faulty nuclear power station.

23
THE WILCOXON RANK SUM TEST

Critical values * of the rank sum of the smaller sample

n = size of smaller sample


m = size of larger sample

n 2 3 456 789 10 11 12 13 14 15 16 17 18

m
6 11
4
18 25
3 7 12 19
5
13 20 28 36

3 8 13 20 28
6
15 22 31 40 50
3 8 14 21 29 39
7
17 25 34 44 55 66
4 9 15 23 31 41 51
8
18 27 37 47 59 71 85
4 10 16 24 33 43 54 66
9
20 29 40 51 63 76 90 105
4 10 1.7 26 35 45 56 69 82
to 22 32 43 54 67 81 96 111 128

4 11 18 27 37 47 59 72 86 1.00
1.1
24 34 46 58 71 86 101 117 134 153
5 11 19 28 38 49 62 75 89 104 120
12
25 37 49 62 76 91 106 123 141 160 180
5 12 20 30 40 52 64 78 92 108 125 142
13
27 39 52 65 80 95 112 129 148 167 187 209
6 13 21 31 42 54 67 81 96 11.2 129 147 166
14
28 41 55 69 84 100 117 135 154 174 195 217 240
6 13 22 33 44 56 69 84 99 116 133 152 171 192
15
30 44 58 72 88 105 123 141 161 181 203 225 249 273

6 14 24 34 46 58 72 87 103 120 138 156 176 197 21.9


16
32 46 60 76 92 110 128 147 167 1.88 210 234 258 283 309
6 15 25 35 47 61 75 90 106 123 142 161 182 203 225 249
17
34 48 63 80 97 114 133 153 174 196 218 242 266 292 319 346
7 15 26 37 49 63 77 93 11.0 127 146 166 187 208 231 255 280
18
35 51 66 83 101 119 139 159 180 203 226 250 275 302 329 357 386
7 16 27 38 51 65 80 96 1.13 131 1.50 171 192 214 237 262 287
19
37 53 69 87 105 124 144 165 187 210 234 258 284 311 339 367 397
7 17 28 40 53 67 83 99 117 135 155 175 197 220 243 268 294
20
39 55 72 90 109 129 149 171 193 217 241 267 293 320 349 378 408

8 17 29 41 55 69 85 102 120 139 159 180 202 225 249 274 301
21 275 302 330 359 389 419
40 58 75 94 113 134 1.55 177 200 224 249

* At the 5% level of significance.

24
CM 41
POST-TEST

Write down a suitable Null Hypothesis, HQ , for each hypothesis below.

(i) As a generalisation, fair-haired people may have Nordic origins and dark-haired people may
have Mediterranean origins. So we can assert

Hypothesis: FAIR-HAIRED MEN ARE TALLER, ON THE AVERAGE, THAN


DARK-HAIRED MEN.

(ii) Although a wild animal can never be 'at home' in a zoo, the animal will enjoy a controlled
environment relatively free from disease and attack. So we can assert

Hypothesis: LIONS IN CAPTIVITY LIVE LONGER THAN LIONS IN THE WILD.

(Hi) The advantages of children attending nursery school before the age of five are not so much
academic as allowing them to mix with a large group of other children at an early age.
So we can assert

Hypothesis: CHILDREN WHO ATTEND NURSERY SCHOOL ARE MORE EXTROVERT


THAN THOSE WHO DO NOT.

A certain statistical test involves calculating a value for a quantity called X. The lowest possible
value for X is X = 5 and the highest is X = 15. The table below shows the number of ways possible
to obtain each value of X.

Value of X 5 6 7 8 9 10 11 12 13 14 15

No. of ways to
1 I 1 3 10 25 30 25 1 9 1 Total = 100
obtain this value

If the Null Hypothesis is true, then all of the (100) possible ways are equally likely.

(i) One particular experiment is concerned with predicting significantly low values for X
in order to reject H 0 . What would be the rejection region at the (a) 5%, (b) 2.5%
significance level in each case?

(ii) What would be the corresponding critical value in each case?

(ill) Another experiment is concerned with significantly high values for X and the rejection region
is X = 14, 15. What is the level of significance; is it (a) 2%, (b) 3%, (c) 4%?

(a) Two researchers, A and B, are separately testing the same Null Hypothesis. On the basis
of his sample data, researcher A rejects HQ at the 5% significance level and researcher B
also rejects HQ , from his own data, but at the -1% significance level. Which researcher
probably has the stronger evidence? Try and give your reasons in a sentence or so.

(b) A statistician is asked to analyse the results of two separate experiments. One is concerned
with the possible side-effects of brain damage when using a new drug. The second is
concerned with the effect of various coloured labels on the selling power of baked beans tins.
He decides to adopt significance levels of 5% and 0.1%. State which experiment received
the 5% level and which the 0.1% level and give your reasons in a sentence or so.

An advertising campaign is launched to boost the sales of BIO washing powder, The weekly sales
(in thousands of packets) before and after the campaign were as follows:

Before (7 weeks) After (9 weeks)


31. 24, 29 35, 27, 32
32. 26, 32 23, 39, 38
34 36, 39, 40

Use the Wilcoxon Rank Sum Test to assess whether the campaign has been successful. Do this by
ranking the results and finding the rank sum W of the smaller sample. (The lower critical value
of W is W = 43 at the 5% significance level. The lower critical value of W is W = 37 at the
1% significance level.)
25
IV
(i) H O: Fair-haired men are the same height, on the average, as dark-haired men.
(ii) H0 : Lions in captivity live as long as lions in the wild,
(iii) H O: Children who attend nursery school are as extrovert as those who do not.
(i) (a) X = 5, 6 or 7 because this region covers the bottom of values. (To include 8
within the region would exceed 5%.)
(b) X = 5 or 6.
(ii) (a) X = 7.
(b) X = 6.
(Hi) 3%.
i a) . Researcher B, because he has a smaller chance of incorrectly rejecting HQ (1%) than
researcher A (5%).
(b) The new drug ought to receive a stricter consideration than the baked beans label. Hence
the former should be tested at the 0.1% significance level and the latter at the 5% level.
We test the Null Hypothesis that the campaign has made no difference to sales.
Ranking the results (A = after, B = before).
A B B A B B B B A B A A A
23 24 26 27 29 31 .32 32 32. 34 35 36 38
Ranks 1 2 3 4 5 6 10 11 12 13
W = 2+3+5+6+8+8+10= 42.
At the 5% significance level, this value is significantly low. Hence we can conclude that sales
have been improved by the advertising campaign; but there is a 1 in 20 chance that we are wrong
in this conclusion.
In fact, we cannot improve the 'odds' because if we choose the stricter significance level then
we cannot conclude that sales have been improved.
XS3X-XSOd OX SH3MSNV
CATEGORY 3

Longman

This booklet is one of a collection of self-learning units prepared by the Continuing Mathematics Project
based at the University of Sussex, and sponsored by the Schools Council, the Council for Educational
Technology, the Department of Education and Science, the Scottish Education Department, and a number of
industrial and commercial companies.

Members of the project team: A.W. Fuller (Director 1971-73), R.W. Morris (Director 1974- ),
M.J. Gould, Mrs. B. Harmer, R.J. Hayter, E.B. Loy, Miss K. Oliver, C.J. Rutter, H.M. Semple,
K.L. Winter. \^

For a descriptive brochure listing other units prepared by the Continuing Mathematics Project, write to;

Publishing Manager,
Longman Group Limited - Resources Unit,
9-11 The Shambles,
York.

Schools Council Publications 1976.. Printed in the United Kingdom

All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any
means, electronic, mechanical, photocopying, recording or otherwise without prior permission of the
publishers.