You are on page 1of 115

EDEL812

STATISTICAL ANALYSIS IN RESEARCH


MODULE

UNIVERSITY OF KWAZULU-NATAL
PIETERMARITZBURG CAMPUS

2002

Compiled by:
Peter M Njuho, Ph.D.
Senior Lecturer
School of Statistics and Actuarial Science
University of KwaZulu-Natal
Private Bag X01
P O Box X01
Scottsville 3209
South Africa

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

PMNjuho

EDEL 812

STATISTICAL ANALYSIS IN RESEARCH

1. SURVEY AND DESIGN OF EXPERIMENT STUDY CONCEPTS


1.1 Survey versus design of experimental study
The difference between a survey study and design of experiment study is mainly in the
study objectives. The researcher should understand the difference before he or she
undertakes the study. Failure to make the distinction between the two forms of studies
leads to complicated data analyses whose results may fail to tie with the study objectives.
The primary objective in survey study is to observe the characteristics of the population of
interest. For instances, is the disease common across the different communities? Is the
level of education distributed the same across the race? Is the distribution of land even
across different communities? What is the opinion across the residence regarding new
rules in rubbish disposal? In such situations we would be concerned about the level of
distribution rather than the actual difference. Where is the variability high? Would be a
question of great interest. In designed experimental study the primary interest is to
investigate on the relative performance of certain factors. The key questions to be
answered are generally expressed as a statement of hypothesis that has to be verified or
disproved through experimentation. The interest would be in answering questions such as:
Are the three methods for treating the disease different? And if so, by how much?
Is the new teaching method significantly different from the old method?
In a survey study the researcher has no control over the responses. He/she acts as an
observer. The outcomes are mainly considered as random. Survey study can be classified
into two types namely exploratory or informal survey and formal survey. The exploratory
survey is mainly used in obtaining information about population of interest, for example
farmer circumstances. The approach places interviewer in direct contact with the subject
and allows the interviewers to observe the characteristics of the population. An
exploratory survey allows for quick gathering of information through informal interviews
with many people. The information from exploratory survey is used to design a wellfocused formal survey by:

identifying important topics bearing on research planning that should be the


focus of the formal survey;
ensuring that written questions in the formal survey are asked in a way that can
be understood;
designing and testing a sampling scheme;

Other important features of an exploratory survey


Towards the end of the exploratory survey, it should also be possible to give approximate
frequencies of use for a given practice among the target population (e.g. 0-10%, 10-25%,
25-50%, 50-75%, 75-100% of farmers).
The exploratory survey narrows down the data to be collected in the formal survey to that
which are essential for understanding present practices and prescreening technologies.
An important part of the exploratory survey is to formulate hypotheses.
Statistical Analysis in Research Module
E-mail: NjuhoP@ukzn.ac.za

PMNjuho

Examples of such hypotheses:

A larger area can be planted as labour is a limiting factor at the planting


period;

There is a dry period three months after the start of the rains and late plantings
may survive this period better than plantings that flower at that time;

Early plantings give an early supply of new food and are particularly important
when the previous harvest has been poor.

1.2 Formal survey study concept


The purpose of formal survey study is to verify and quantify information and test
hypotheses formulated in the exploratory survey. A formal survey involves use of a welldesigned questionnaire. Define the population of interest as a first step. It should be noted
that we interview a sample of the respondent and use the information obtained from this
sample of respondent to make statements or inferences about the population of interest.
The following are general rules to be followed when developing a questionnaire
Organizing the questionnaire: -The questionnaire should be divided into sections based
on the study themes. Section one should always be designed to collect the bio-data
(examples gender, age, education, marital status, etc.)
Language of the questionnaire: - The questions should be constructed using clear and
friendly language. The responded must be given an opportunity to express himself or
herself in a language of choice. Leading questions should be avoided. The question should
be put in such away the respondent will provide more information. For example ask, did
you apply fertilizer to wheat this year? rather than, Do you use fertilizer on wheat?
Length of the questionnaire: - Lengthy questionnaires should be avoided, because they
may introduce fatigue. The construction of the questions should be compact and
comprehensive.
The role of the questionnaire is to obtain estimates of how widespread are those problems
and opinions and whether there are differences between groups of respondents. Finalising
a questionnaire for use in a formal survey study requires an undertaking of pre-test of the
same before producing a final version.
Subjective data
Consider a survey study were interest is in obtaining farmers opinions regarding a certain
technology. It should be noted that information on what farmers do is objective and
quantifiable where as farmers opinion and perceptions about problems and technologies
are subjective.
Sampling procedures for a formal survey study

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

PMNjuho

To select at a reasonable cost a group of respondent which is roughly representative of all


subjects in the population of interest. A representative sample must be selected at random.
That is, each unit in the population or subgroup of the population has an equal chance of
selection. Such an assumption requires all the sampling units to be homogeneous and to
be non overlapping. The nature of these non overlapping units dictate the sampling
technique to use. The following are some of the sampling procedures:

Simple random samples


Stratified random sampling
Mult-stage random sampling
Systematic random sampling
Cluster sampling

Stratification of the population is the process of dividing the population into relatively
homogeneous subgroups called strata, and then taking separate samples from each group
or strata.
Sample size: - Depends on the variability within the population and not on the size of the
population. It should conform to the time and cost constraints of the survey.
Major costs:- Cost of developing questionnaire, training enumerators, and establishing a
suitable sampling method.
Form of analysis: Either to estimate population means, variance components, population
size or to establish casual relationship, predictable models, frequencies, etc.
Commonly used in analysis:- Chi-squares, mean estimation, non-parametric, regression
(i.e. logit, probit, logistic, etc.)

1.3 Designed experimental study concepts


The researcher has the control over the factors to be tested and the form of data to be
collected. He/she sets the experiment and observes the outcome.
An experiment is a planned inquiry set to obtain new facts, confirm or disapprove results
from a previous experiment or verify certain biological phenomenal.
Objectives:- The objectives must clearly stated as questions to be answered; hypotheses
to be tested, and effects to be estimated. It is necessary to classify the objectives as major
or minor, since certain experimental designs give greater precision for some treatment
comparisons than others.
Precision: - Precision, sensitivity, or amount of information is measured as a reciprocal of
the variance of a mean. That is
Information =

1
var( y )

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

PMNjuho

Where var( y ) denotes the variance of the sample mean y .


As the variance of y denoted by 2 increases, the amount of information decreases.
Similarly, as n increases, the amount of information increases.
Components of experimental design:- The following are components that any
researcher must clearly state when conducting a designed experimental study.

Treatment structure
Design structure
Experimental unit
Randomization
Replications
Assumptions

Treatment structure:- A treatment is a procedure whose effect is to be measured and


compared with other treatments. E.g. a standard ration, a spraying schedule, a
temperature-humidity combination, etc. A set of treatments, e.g. sources of fertilizer such
as DAP, CAN, TSP, Manure, etc. One-way treatment structure, e.g. nitrogen levels, Dairy
meal levels, etc., Two-way treatment structure, e.g. plant population and different hybrids.
Higher order treatment structure, etc. The interest is to estimate effects, compare effects,
predict, etc.,
Experimental unit:- This the smallest unit of material to which the treatment is applied.
e.g. an animal, 5 pigs in a pen, a half-leaf.
Sampling unit: -This is often referred to as observational unit. Treatment effect is
measured on a sampling unit, which is basically a unit of experimental unit. Sometimes a
sampling unit is a complete experimental unit.
Experimental error:- This is a measure of the variation which exists among observations
on experimental units treated alike. Aim at reducing experimental error in order to
improve the power of the test.
Replication:- When a treatment appears more than once in an experiment, it is said to
replicated. Replication is necessary to provide for an estimate of experimental error,
which is required for tests of significance. Without replication there is no basis for
comparison. Valid replication requires that for similarly different units there are at least
some sets of units treated identically. There are many situations when there can be
different levels of replication, providing different degrees of variation. It is necessary to
identify the different levels of replication, the correct of replication and situations when
false replication is used. There are also many situations when multiple levels of
replication are necessary and relevant to the analysis. Replication provides means of
computing experimental error.

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

PMNjuho

The amount of replication is determined by the extent, to which the standard error must be
reduced, which is in turn determined by the size of treatment difference, which the
experiment should detect. Given the necessary amount of replication we have a total
number of units in the experiment. The division of the total degrees of freedom (Sample
size minus 1) can model the variation between these units in an analysis of variance into
control (design structure), questions (treatment structure) and error (random structure).
We should design experiments so that the error degrees of freedom are between 10 and
20. Experiments not satisfying this requirement are, to some degree, inefficient and should
be avoided.
Randomization:- Done to ensure that we have a varied or unbiased estimate of
experimental error and of treatment means and differences among them. In other words,
the procedure provides insurance against the possibility that the model for analysis is
valid. It also provides a basis for randomisation test arguments to support coincidence
arguments in terms of significance. Randomization provides a valid measure of
experimental error.
Design Structure:- Involves techniques for controlling known variation among the
experimental units. Thus, experimental units are grouped into homogeneous groups
referred as blocks such that variation within the groups is a minimum and between them is
a maximum. The following are examples of design structure:

Complete randomized design (CRD)


Randomized complete block design (RCBD)
Latin square design
Cross-over design
Incomplete block design
Experiments with more than one experimental unit such as:Split plot design.
Strip plot design.
Repeated measure design.

Assumptions:- The design structure and treatment structure do not interact. The observed
values are independently and identically distributed normal with a constant variance.

1.4 Conceptual models in scientific research


Conceptual models serve to organise research approaches and direct data presentation.
Many inexperienced scientists do not make full use of conceptual models. Conceptual
models assume many different forms that are not mutually exclusive.
Different conceptual models may be dynamic and interactive. Working hypotheses are an
essential component to all scientific approaches and must be elucidated in advance of
more detailed research activities. Most working hypotheses may then be captured as either
mathematical or statistical models. Simple diagrams should be based on one or more
working hypothesis and constructed in advance of detailed research efforts to serve as a
framework and may often evolve into more complex forms during the course of many
experiments and much thought.

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

PMNjuho

Working hypotheses: Working hypotheses are reductionist word models based on logic
and an essential component of all research. All of scientific progress may be viewed as a
long chain of working hypotheses that were framed, tested and either accepted or rejected
with conclusions that led to a more advanced hypothesis. A well stated working
hypothesis is specific and directs treatment selection and measurements. Global
hypotheses are more general, often a restatement of overall objectives and generate one to
many working hypotheses. A null hypothesis is stated in the negative context and no
longer considered essential provided that a working hypothesis is stated in a manner that
may either be accepted or rejected. It should be noted that hypothesis testing is a formal
procedure by which we investigate research questions using inferential statistics to reach
decisions about the validity of the null and alternative hypotheses.
It is most reasonable for one scientist to ask another How do your findings reflect on
your working hypothesis? Working hypotheses should be stated as simply as possible but
must be complete statements such as X regulates Y under Z conditions.
Example 1.1
Phosphorus availability is limiting maize production in nutrient-depleted, smallholder
farms in the highly weathered, sandy soils of KZN.

Maize streak virus infects a greater proportion of the maize stand and reduces
crop yields to a greater extent under continuous maize cultivation than in
maize legume rotation.

Both of these statements in example 1.1 may be summarised as

If A and B then C (and D, etc.)


Always remember that working hypotheses are intended to be either accepted or rejected
as a result of successful research and as such must be able to withstand various tests of
logic. Do not be defensive when a particular hypothesis is challenged but rather
complimented that another scientist considers it worth of discussion.
Beware of incomplete statements such as
Use of fertiliser is better for farmers, or
Maize streak virus is a serious problem.
Also, tautologies statements such as Sustainable agriculture results in long-term food
security are unsatisfactory working hypotheses, rather these sort of statements should be
included within introductions, justification sections or overall objectives.
Mathematical models
Mathematical equations may also serve as conceptual models. Equations attempt to
quantify cause and effect relationships. Cause(s) is referred to as the independent variable,
that direct an effect in the dependent variable. The mathematical relationship may also be

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

PMNjuho

stated in more general terms as a working hypothesis. The relationship may either be
linear or non linear. A general expression a simple linear relationship is of the form
Y = + X +
Where is the intercept
is the slope
is the random error
Y is the dependent variable
X is the dependent variable
Many different equations define non linear relationships. Examples of such equations are:

Power functions

- (y = axb , where a and b are constants)

Exponential growth curves


can

- (y = abx , where a and b are constants. b


also be exponential e)

Hyperbolic functions where x is the reciprocal of y ( y = 1/x)

Asymptotic decay curves where y approaches 0 as x increases without limit


(y =ae-kx)

Polynomial curves

-- ( y = a0 + a1x + a2 x2 + . . . + apxp )

In general, researcher should identify a conceptual basis for selecting a given curvilinear
relationship based upon the properties of a phenomenon under study.
Statistical models
Statistical models particularly those that examine two or more factors simultaneously, are
also useful as conceptual models. Differential effects of one factor on another result in
interaction. For instance, in a study set to investigate the effect of two factors, a statistical
model would be the following form assuming a completely randomised design. Suppose y
is the response variable.
yijk = + Ai + Bj + ABij + ijk
Where

denotes the overall mean


Ai ith effect for factor A
Bj jth effect for factor B
ABij ijth interaction effect
ijk random error.

Statistical model could also be considered as a process of partitioning of the response


value, into components due to inherent and random variation. Setting up of the model
before the analysis enables the researcher to be focused on issues of interest. Interest
would be in estimating the main effects of factors A and B, and their interaction effect.
Statistical Analysis in Research Module
E-mail: NjuhoP@ukzn.ac.za

PMNjuho

The random error would be used to establish statistical tests to ascertain significance or
non significance of these effects from zero or any specified value.
Exercise 1.1
1.1 Suppose you were approached to design a questionnaire to be used in a feasibility
study regarding the settlement of a group of urban dwellers on a new land within
KZN.
a)
b)
c)
d)

What components would you include in your questionnaire?


What could be the sampling unit?
What could be the population of interest?
What could be the sample size?

1.2 Consider the newly constructed Casino in Pietermaritzburg. Suppose the manager
wishes to collect the views of the residence of Pietermaritzburg regarding the
business. Indicate how such information could be collected.
1.3 The Checkers in Scottsville, Pietermaritzburg underwent some renovation recently.
Suppose the manager wishes to collect the customers views regarding this change.
Indicate how such information would be collected.
1.4 An experiment was conducted to determine the best way to manage citrus insects
pests and diseases under small holder farms. Six farms were selected. In each farm,
10 trees infested with white flies were selected. The investigator was interested in
finding treatment had the best effect in controlling the disease among the four namely
1) pruning, 2) fertiliser application, 3) pesticide application, and 4) farm activities
intercropping practise.
a)
b)
c)
d)
e)

Indicate how the treatments were applied.


What could have been the experimental unit?
How independent are these treatments?
What name could you give the experimental design used?
What possible questions could be answered in this study?

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

PMNjuho

2. INFERENCES ABOUT ONE AND TWO POPULATION MEANS


2.1 Introduction to hypothesis testing
Hypothesis testing is an area of statistical testing in which we evaluate a conjecture,
which we call hypothesis, about some characteristic of the parent population. The
hypotheses, usually concerns the unknown parameters of the population.
The null hypotheses:
This is the statement being tested and is denoted as H0. It is usually stated as equality
implying no difference.
The alternative hypothesis
This is what is believed to be true if Ha is rejected. Usually, the investigator wishes to
establish that there is a difference between the parameter and the value being tested. Thus,
the alternative is also called the research hypothesis.
Consider, for example, the hypothesis that the mean per capita income in a certain town is
R800 per year. Suppose we denote the population mean, by . Suppose the investigator
believes the mean per capita income of the town is greater than R800. The two hypotheses
are stated as
H0 : = 800 against Ha : > 800
If the investigator believes the mean per capita income is less then the alternative
hypothesis is stated as
Ha : < 800
The alternative is stated in support of what the investigator wish to believe.
The significance level
This is the probability with which we are willing to reject the null hypothesis when it is
correct.
Type I error is committed if we reject null hypothesis when it is in fact true.
Type II error is committed if we fail to reject null hypothesis when it is false.

2.2 Inferences about a population mean


In reality, we encounter situations where interest is in confirming a known hypothesis.
This relates to questions such as, has the average increased, decreased or remained static
over time? Sometimes, an investigator would like to compare characteristics of two
populations. Handling of such investigation involves one or two sample situations.
Consider a single population that is normally distributed with mean and variance 2 .
Suppose we want to test hypothesis about . Let 0 denotes a known mean.
Hypothesis: Ho : = 0 against Ha : 0
Statistical Analysis in Research Module
E-mail: NjuhoP@ukzn.ac.za

10

PMNjuho

Example 2.1
The scores on a college placement exam in mathematics are assumed to be normally
distributed with a mean of 70 and a standard deviation of 18, The exam is given to a
random sample of 50 high school students who have been admitted to college. Their
average score on the exam was 67. If this is a true random sample, is the evidence
sufficient to suggest that the population mean score is lower than 70?
Solution:
Let denotes the true population mean of the placement exam. We wish to see whether
there is evidence that < 70. This is the research hypothesis. Thus, 0 = 70, 2 = 324,
and n = 50.
Hypothesis:

Ho : = 70

against Ha : < 70

Significance level: = 0.05


Critical region: Reject H0 if the p-value is less than 0.05, where p-value = the probability
that X 67.
Test statistics: The sample mean, X = 67
X 0

67 70

z = / n = 18 / 50
= - 1.18
Thus, P( X 67) = P(Z -1.18) = 0.119
Conclusion:
We fail to reject H0 since p-value = 0.119 is not less than 0.05. Based on the results of a
random sample of 50 high school students, there is insufficient evidence to say that the
mean score on the college placement exam should be lower than 70.

Exercise 2.1
2.1 For the future planning and control of automatic sorting machines, a member of the
General Post Office is instructed to take a random sample of the letters posted with a
10c stamp during a specific period of the year. The weights of these letters were
recorded as follows (in grams):
25.7 23.2 25.8 25.8 29.1 23.1 17.2 26.4 31.9 18.3 19.2 20.7 23.6 21.6 21.9
21.8
a) Test a claim that the average weight of such letters is 19.6 gms.
b) Test a claim that the average weight is greater than 21.6 gms.
Statistical Analysis in Research Module
E-mail: NjuhoP@ukzn.ac.za

11

PMNjuho

c) Find 95 % and 99 % confidence intervals for the average weight of the letters.
d) Use your results in (c ) to test whether the mean is 27.9gms.
2.2 A certain department store conducts monthly checks amongst its branches to test
whether the mean balance outstanding on 30-day charge accounts complies with the
company policy of R100. For a particular branch store a sample of 100 accounts gave
the following results:
x = R104.19

s = R22.13

a) Test the claim at 5 % significance level, that the branch was complying with
company policy.
b) The department store financial controller claims that the mean balance is
greater than R100. Test this claim. (Use = 0.05).
2.3 State the null and alternative hypotheses for the following research questions.
a) Are children who have strict parents more disciplined than children who do
not have strict parents?
b) Do babies with birth weights of 2.8kg and more have a greater mortality rate
than those with birth weights lower than 2.8kg?
2.4 The records of the National Road Traffic department reveal that the scores on the
learner driver test are normally distributed with a mean of 62% and a standard
deviation of 16. The traffic department is aware that people in Kwa-Zulu Natal tend to
be better at obeying traffic rules than people from other provinces around the country.
They administered the learner driver test on a random sample of 200 adults from
KZN, and noted that their mean score was 69.9%.
a) Do people from KZN perform better than national mean? ( = 0.05).
b) Conduct analysis on the same data to test the research question that people
from KZN perform differently from expectation. ( = 0.05).
c) Although the above tests have yielded similar conclusions, in what way do they
differ?
d) How would the chance of making Type I and Type II errors change if we
changed the significance level of the tests to ( = 0.01).
2.5 The weight of humans is normally distributed with a mean of 73kg and a variance of
144. To investigate whether the weight of rural South Africans is different from this
international mean, we draw a random sample of 100 rural South Africans and
calculate their mean weight to be 69kg.
a) Determine whether the weight of rural South Africans is different from the
international mean ( = 0.01).

2.3 Paired t- test problem


Consider a situation where a researcher is interested in the effect of a treatment given to
randomly selected subjects. Measurements are made before and after application of the
Statistical Analysis in Research Module
E-mail: NjuhoP@ukzn.ac.za

12

PMNjuho

treatment. The data are paired and interest would be to find out the effect as to whether it
is negative or none or positive. Differencing eliminates the effect of the subject, and
leaving the effect due to the treatment and random.
Suppose we have two treatments A and B applied to n samples randomly selected from a
normally distributed population with mean and variance 2 .

Subject

Treatment
A
X1
X2
X3
.
.
.
Xn

1
2
3
.
.
.
n

Response
Treatment
B
Y1
Y2
Y3
.
.
.
Yn

Difference
d = X-Y
d1 = X1-Y1
d2 = X2-Y2
d2 = X3-Y3

d2 = Xn-Yn

We treat the new information on differences as single sample problem, and compute
estimates of the mean and the variance using the usual estimation formulae.
Assumption
The di s (i =1, 2, . . . n) are random samples from a normal population with mean d and
variance 2 .
Hypothesis: Ho : d = 0 against H1 : d 0 or H1 : d < 0 or H1 : d > 0, depending
on the available information. That is, the effect may be suspected to differ, or decrease or
increase.
Under, Ho : d = 0 is true, we estimate the variance, 2 , by s d2 computed as follows:
2
d

s =

( d ) 2 / n
n 1

The calculated t- value, denoted by tcalc is


tcalc =

d 0
sd / n

Where

d = Average of the paired differences


sd = Standard deviation of the paired differences
n = Number of pairs.
Reject Ho if |tcalc| > t n / 12 and conclude that there is enough evidence that the treatment had
a significant effect at level where , measures the strength of evidence against Ho. The
values of t distribution are given in Table B.
Statistical Analysis in Research Module
E-mail: NjuhoP@ukzn.ac.za

13

PMNjuho

The following example illustrates the computation procedures and the type of inference
one can draw.

Example 2.2
A market research study in which a family was asked to record its total monthly purchases
at Pick n Pay and its total monthly purchases at Checkers was conducted. The study
wishes to estimate the difference in average monthly expenditures by families at the two
shopping centres. The data from 10 families selected at random is presented below. The
data is in rands.
Family
1
2
3
4
5
6
7
8
9
10

d =

Pick n Pay
140
120
230
50
70
240
190
120
250
100

Checkers
100
150
220
80
110
180
190
140
190
100

Difference, d
40
-30
10
-30
-40
60
0
-20
60
0
50

(d - d )2
(40 5)2 = 1 225
(-30 -5)2 = 1 225
(10 5)2 = 25
(-30 5)2 = 1 225
(- 40 5)2 = 2 025
(60 5)2 = 3 025
(0 5)2 =
25
(-20 5)2 = 625
(60 5)2 = 3 025
(0 5)2
=
25
Sum = 12 450

50
=5
10

sd =

(d d )
n 1

12450
10 1

= 37.193

Critical region:
The t-table value with 9 degrees of freedom at 5 % significance level is t = 2.262. (See
Table B) We would reject H0 if |tcalc| > t n / 12 =2.262.
Test statistic:

tcalc =

d 0
sd / n

50
37.193 / 10

= 0.425

Conclusion:

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

14

PMNjuho

We fail to reject H0 since the |tcalc| is not greater than 2.262. We conclude that there is no
significant difference in average spending by families at the two shopping centres, based
on the available data.

Exercises 2.2

2.6 Given two independent samples with the following information


Item
1
2
3
4
5
6
7
8
9
10

Sample 1
19.6
22.1
19.5
20.0
21.5
20.2
17.9
23.0
12.5
19.0

Sample 2
21.3
17.4
19.0
21.2
20.1
23.5
18.9
22.4
14.3
17.8

a) State the null hypothesis


b) What assumption would you make?
c) Based on these paired samples, test at the = 0.10 level whether the true
average paired difference is 0.
d) State your conclusions.

2.7 A random sample of 15 cars passed through an urban speed trap. The following
speeds in km per hour were recorded.
Car
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

Speed
71
49
68
65
64
57
80
63
62
69
45
61
66
66
55

a) Estimate , the true mean speed of cars passing that point.

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

15

PMNjuho

b) Given that the speed limit is 60 km/h, test H0: = 60 to check if it is


reasonable.
c) Set a 95 % confidence limit about the true mean.
d) What assumption did you have to make?

2.8 A consumer organisation has sampled 20 owners TV sets and recorded the time in
years to the sets first repair. The data are:
1.97
2.81

2.87
0.57

3.01
3.17

2.75
3.89

2.09
3.10

1.34
2.05

1.62
1.01

1.10
4.16

2.24
2.59

1.79
1.67

a) Estimate the mean time to first repair for the population sampled.
b) Set a 99 % confidence limits for the true mean, .
c) Use the results obtained in part (b) to test H0: = 0.

2.9 The following data arise from a survey of aged people in Durban. The variable
recorded per person is the monthly expenditure on medicines, recorded in rands.
34.42 9.66 40.40 31.00 6.30
52.82 2.20 20.00 6.50 48.24
57.13 24.64 37.80 36.00 58.16
a) The claim recently made in a local newspaper was that the mean monthly
expenditure on medicines for elderly people exceeds R 30.00 per month. Test
this claim at 5 % significance level.
b) Use the sample to estimate the mean annual expenditure on medicines for this
population and set a 95 % confidence limits to this quantity.

2.10 The repainting of lines on freeways represents a large proportion of the


expenditure of a roads department. It is decided that a new, cheaper paint should be
tested. Twenty-five randomly chosen 1-km stretches are painted with the new paint.
After a month an assessment is done at each site. An instrument using a scale on
which the current paint registers 39.2 measures the durability of the paint. For the
sample of 25 sites, the following calculations have been done.
x = 39.65

s = 3.02

The department wishes to test (using =0.01) whether the new paint is better than
the current paint.
a) State the appropriate null and alternative hypotheses.
b) Test your null hypothesis at the required level.
c) State your conclusions.

2.11 A random sample of nine local school children yielded the following sample
statistics for the random variable X =IQ.
x = 107

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

s = 3.88

16

PMNjuho

a) Find a 99 % confidence interval for , the mean IQ, and use this interval to
test H0 : = 100 at the 1% significance level.

2.12 A random sample of 16 pharmacies was selected in the Witwatersrand area. The
price in rands charged for 100 tablets of a particular drug by each pharmacy in the
sample was:
3.75
5.85

4.10
7.65

10.40 7.50
8.10 6.50

2.95
7.50

5.75
5.50

7.50
8.00

8.90
4.50

a) Estimate the mean price of 100 tablets of this drug for pharmacies in the area.
b) Set 95 % confidence limits to your estimate.
c) Carry out a test of a significance to assist you in deciding whether the mean
price of this drug in the Witwatersrand area is lower than R 7,95 (which is
known to be the mean selling price in Cape Town).

2.13 Suppose that, after sampling 20 records at random, a sociologist finds the
following durations (to the nearest tenth of a year) of marriage ending in divorce.
10.1 21.2
4.31 4.9

13.8
5.4

11.1
8.7

10.9
4.81

9.2
9.42

6.6
6.3

12.3
24.5

7.8
21.6

15.1

2.61

a) Set up an appropriate null hypothesis and alternative hypothesis.


b) Determine whether these data provide proof, at the 5 % significance level, that
the mean duration of marriage ending in divorce in the population has
decreased from an earlier value of 14.9.
c) What distribution assumption is made in applying the hypothesis test?

2.14 A designer claims that by smoothing out parts of a particular automobile body to
reduce air resistance, the average fuel consumption can be reduced below 8.0 litres
per 100km. In an attempt to support the claim, the designer has obtained a sample of
fuel consumption for 15 modified automobiles. The sample mean was 7.4 l/km and
standard error of the mean was 0.8 l/km.
a) Do these results provide sufficient evidence to support the claim?

2.15 To test the durability of a new paint for white centre lines, a highway department
painted test strips across heavily travelled roads in eight different locations, and
automatic counters showed that they deteriorated after having been crossed by the
following number of cars (in thousands).
142.6 167.8 136.5 108.3 126.4

133.7 162.0 149.0

a) Find 95 % confidence limits for , the average number of crossings that this
paint can withstand before it deteriorates.
b) Find 99 % confidence limits for , the average number of crossings that this
paint can withstand before it deteriorates.
c) Test the paint manufacturers claim that =160.0

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

17

PMNjuho

Oftentimes, investigators are interested in assessing the performance of one population


compared to another. For instance, comparing the performance of a new technology to an
old one or comparing new variety against an old variety, etc. The two populations are
assumed to be independent. Suppose we have two random samples one of size n1, X1, X2,
X3, . . . Xn1 drawn from a normally distributed population with mean 1 and variance
2 and the other of size n2, Y1, Y2, Y3, . . . Yn2 drawn from a normally distributed
population with mean 2 and variance 2 . Consider the sample means x and y as
unbiased estimators of population means 1 and 2, respectively. Also, let sample
variances s12 and s 22 both be unbiased estimators of population variance, 2 .

2.4 Inferences about two population means


Assumption
The two populations of Xs and Ys are independent and normally distributed with
possibly different means and a common variance.
The setting up of hypotheses depends on the study objectives. The following are possible
hypothesis:

Hypothesis: Ho : 1 = 2 against Ha : 1 2 or Ha : 1 < 2 or Ha : 1 > 2


Test Statistic:
Thus, X - Y is normally distributed with mean 1 - 2 and variance 2 . We estimate
2 by a pooled variance, s 2p , where
s 2p = (Total Sum of Squares)/(Total Degrees of Freedom)
=

(n1 1) s12 + (n 2 1) s 22
n1 + n 2 2

Comparing the equality of the two means against an alternative hypothesis of not equal,
demand that the standard error of the means difference computed first. For a combined
sample size n1 + n2 < 30, we use t-distribution, otherwise, normal distribution would
apply. The appropriate test statistic assuming common variance estimated by a pooled
variance is computed as
tcal =

( x y ) ( 1 2 )
sp (

1
1
+ )
n1 n 2

Conclusion:
Reject Ho : 1 = 2 in support of Ha : 1 2 if |tcal| > tTable obtained with n1 + n2 2
degrees of freedom at - level of significance.

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

18

PMNjuho

In case the assumption of common variance of population variances cannot be assumed,


say, 12 and 22 , then an approximate t-distribution with degrees of freedom, df,
computed as
s12
s22 2 s12
s22
+
] /[( )/(n1 -1 ) - ( )/(n2 - 1)]
df = [
n1
n2
n1
n2

The computed df are rounded down to the nearest integer and then t-test used, noting that
s12
the less degrees of freedom the lower the power. The pooled variance is estimated as
n1
+

s22
.
n2

Example 2.3

Consider data collected to study the heating producing capacity. The heat producing
capacity (in millions of calories per ton) was measured on random samples of five
specimens each of coal from two mines. The following is the data and the test statistics.
Mine 1
Mine 2

8380
7540

8210
7720

8360
7750

7840
8100

7910
7690

Suppose we assume sample from Mine 1 to be normally distributed with mean 1 and
variance 2 , and from Mine 2 to be also normally distributed with mean 2 and variance
2.
Hypothesis:

Test

H0 : 1 = 2

against

Ha : 1 2

Significance level: = 0.05


Critical region: Reject H0 is |tcal| > t* where t* is the t- table value corresponding to 2(n 1) = 8 degrees of freedom at 5 % significance level.
Test statistics:
x1 = 8 140

and

SS(x1) =

x 2 = 7760

and

SS(x2) = 170 600

( x x )

= 253 800

n1 = n2 = n = 5
t* = 2.306 with 8 degrees of freedom and at 5 % significance level.
Thus, the pooled estimate of the variance 2 is

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

19

PMNjuho

s 2pooled =

SS ( x1 ) + SS ( x 2 ) 253800 + 170600
=
2( n 1 )
2( 5 1 )
= 53 050

The estimate standard error of the mean difference is


1 1
s 2p ( + ) =
n n

S.E.( x1 - x 2 ) =

53050(

2
)
5

= 145.67
The value of t- calculate is
tcalc =

x1 x 2 0
8140 7760
=
145.67
S . E .( x1 x 2 )
=

380
= 2.61
145.67

Conclusion:
We reject H0 since |tcalc| = 2.61 is greater than t* = 2.306 and conclude that the heat
producing capacity of the coal from the two mines is not the same. The coal from Mine 1
being superior by 380 145.67 millions of calories per ton.

Example 2.4
A researcher wants to determine whether a given drug has any effect on the scores of
human subjects performing a task of psychomotor co-ordination. Nineteen subjects were
randomly selected from a subject pool and then randomly assigned to two groups. The
nine subjects in group 1 received an oral administration of the drug prior to being tested.
The ten subjects in group 2 received a placebo at the same time. The scored results were
as follows:
Group 1
12
14
10
8
16
5
3
9
11

n1 = 9

Group 2
21
18
14
20
11
19
8
12
13
15
n2 = 10

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

20

PMNjuho

Total score for


Group 1 :
Group 2 :

X
X

= 88

= 151

On the assumption that the scores are distributed normally, we wish to test whether the
two groups are significantly different at 5 % significance level.
Hypothesis
H0 : 1 = 2

Ha : 1 2

against

Significance level: = 0.05


Critical region:
Reject H0 is |tcal| > t* where t* is the t- table value corresponding to n1 + n2 -2 = 9 +10 -2
= 17 degrees of freedom at 5 % significance level. Thus t* = 2.110
Test statistics:
The means and sum of squares, SS (x) for
Group 1 : x1 = 9.778

and SS(x1) = 135.56

Group 2 : x 2 = 15.100

and SS(x2) = 164.90

s 2pooled =

SS ( x1 ) + SS ( x 2 ) 135.56 + 164.90
=
9 + 10 2
n1 + n2 2
= 17.6742

Hence
S.E.( x1 - x 2 ) =

s 2p (

1
1
+ ) =
n1 n2

17 .6742(

1 1
+
)
9 10

= 1.93
Thus,
tcalc =

x1 x 2 0
9 .778 15.10
=
1.93
S . E .( x1 x 2 )
= - 2.758

Conclusion:
We reject H0 since |tcalc| = 2.758 is greater than t* = 2.110 and conclude that the scores of
the experimental group are significantly lower the control group, say by 5.32 1.93 units.

2.5 The process of setting hypotheses and testing

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

21

PMNjuho

The investigator set up the study objectives, which are translated into questions that need
to be answered by the data collected. These questions are formulated in form of
hypotheses. The null hypothesis always has the equality sign whereas the alternative
hypothesis is stated as either unequal, decrease or increase based on the available
information on direction of the reaction. The basic form of the null and alternative
hypotheses for two samples test.
Null hypothesis

H0 : 1 = 2

or 1 - 2 = 0

Possible alternative hypotheses

i) Ha : 1 2
ii) Ha : 1 < 2
iii) Ha : 1 > 2

or 1 - 2 0
or 1 - 2 < 0
or 1 - 2 > 0

< two tailed test >


< one tailed test >
< one tailed test >

A conventional rule is to state the null hypothesis with an equality sign and the alternative
hypothesis with a strict inequality.
The following are the necessary steps to follow when performing hypothesis test.
Step 1: State the assumptions associated with the random variable(s) related to the
population(s) under investigation. Often, the random effect is assumed to be
independently and identically distributed normal with a fixed mean and a constant
variance.
Step 2: State both the null and alternative hypotheses. Normally, the alternative
hypothesis is the statement we wish to prove.
Step 3: State the significance level. This is the type I error which, the probability of
rejecting the null hypothesis when it is true. It is commonly referred to as an experimental
error rate. The conventional levels are 10 %, 5 % and 1 %.
Step 4: Set the critical rule or decision rule. It is becoming traditional to use p- value,
which is the observed probability of rejecting the null hypothesis when is true. The
smaller the p-value, the stronger is the evidence against the null hypothesis. Reject the
null hypothesis when p -value is less than the significance level. The rejection region,
consist of those values of the test statistic that will lead to the rejection of the null
hypothesis.
Step 5: Compute the test statistics. These are sample mean(s), variance of the mean(s),
standard error of the mean or mean difference, and the degrees of freedom. In general, the
test statistic is calculated from the sample data that is used to test the null hypothesis.
Step 6: Draw the conclusions based on these statistics when compared to the critical
value(s). If the null hypothesis is rejected then declare that there is sufficient evidence.
Otherwise, there is no enough evidence.

Two probability distributions namely, normal and t - distributions are used. The t distribution is used when variance(s) is/are unknown and the sample size is less than 30.

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

22

PMNjuho

The normal distribution is used when the sample size is greater or equal to 30. The
variance is still estimated if it is unknown.

2.6 Inferences about a population proportion.

Suppose P denotes the proportion in the population with the attribute. This is referred to a
probability of success.
Suppose that, a random sample of size n is drawn from a binomial distribution. Let y be
the number in the population with the attribute. We estimate P using a statistic p as

y
= Total number with the attribute divided by sample size.
n
The sampling distribution of p is approximately normal with mean P and variance
p=

P( 1 P )
.
n
Suppose the hypothesis is stated as
H0 : P = P 0

Ha : P P 0

against

Under the assumption that the H0 is true, the variance of p becomes


Thus,
z calc =

P0 ( 1 P0 )
n

p P0
P0 ( 1 P0 )
n

Example 2.5

A recent report claimed that 20% of all college graduates find a job in their chosen area of
study. A survey of a random sample of 500 graduates found that 110 obtained work in
their area. Is there statistical evidence to refute the claim?
Solution:
If P denotes the percent of college graduates who find a job in their area of study, then

H0:P = 0.20

against

Ha : P 0.20

We denote the test statistic by p, the proportion of successes in the sample.


Thus,
p=

110
= 0.22
500

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

23

PMNjuho

The studentised test statistic is


zcalc =

0.22 0.20
p 0 .20
=
0.018
( 0 .20 )( 0 .80 )
500

= 1.11
In case of one-sided test,
p-value = P (z 1.11) = 0.1335.
For a two-sided test,
p-value = 2(0.1335) = 0.267
Conclusion: There is no enough evidence to reject the null hypothesis at 5 % significance
level. Thus, cannot refute the claim.

Testing with confidence intervals

The null hypothesis H0:P = P0 against Ha : P P0 is rejected at an level of significance


if and only if the hypothesised value P0 falls outside a (1- )100% confidence interval for
P.
Example 2.6

A news report in a major city stated that 80% of all violent crimes in that city involves
firearms. A survey of all violent crimes in the city for the past 2 years revealed that of 283
violent crimes, 240 involved firearms. Determine with a confidence interval whether the
news report is correct.
Solution:

H0:P = 0.80

against Ha : P 0.80

Given,
P0 = 0.80
n = 283
y = 240
A 95% confidence interval for P is
p 1.96

But,

p=

p( 1 p )
n

240
= 0.848
283

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

24

PMNjuho

Thus, 0.848 1.96

0 .848( 0 .152 )
= 0.848 0.042
283

The 95% confidence interval for P is (0.806, 0.890). We reject H0 at 5% significance


level, because the hypothesised value, P0 = 0.80 does not fall in the interval.
Test

H0 : 1 = 2

against

Ha : 1 2

Exercises 2.3
2.16 A claim was made that 60% of the adult population thinks that there is too much
violence on television. A random sample of 200 adults found that 110 thought that
there is too much violence on television. Is this enough evidence to reject the claim?
2.17 The government believes that no more than 25% of all college students would
favour reducing the penalties for the use of marijuana. A sample of 2 400 college
students revealed that 750 favour reducing the penalties.

a) Set up null and alternative hypotheses to evaluate the governments claim.


b) Give the form of the standardised test statistic and calculate its value.
c) Compute the p-value and determine whether there is sufficient statistical
evidence to reject the governments claim.
d) State your conclusion.
2.18 A psychologist has developed a new aptitude test and believes that 80% of the
public should score above 50 on the test. From a sample of 200 people, 164 scored
above 50.

a) Is there statistical evidence that the claim made by the psychologist is not
valid?
b) For the results to be significant at the 5% level of significance, how many out
of 200 will have to score above 50 on the aptitude test?

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

25

PMNjuho

3. ANALYSIS OF VARIANCE
3.1 Introduction to completely randomised design

Testing of two population means is achieved through t- distribution procedures. The


experimenter is at liberty to select type I error (the probability of rejecting null hypothesis
when it is true), when setting the critical region or rejection rule. We draw inference on
the population means based on the sample data. The problem in using t-test when the
population means are more than two becomes complicated. For instance, with four
4
treatments, we require which reads 4 choose 2, pair-wise comparisons, namely,
2
{(1,2), (1,3), (1,4), (2,3), (2,4), and (3,4)}. We have, say, , type I error for each
comparison. This probability increases exponentially with the number of pair-wise
comparison. The analysis of variance is used as an alternative procedure for testing
simultaneously, the equality of population means, while using the same type I error,
say, .
The design of an experiment is the process of planning a study. Conclusions are draw
from such experiments. The analysis of variance is concerned with the comparison of t
Populations (treatments) means 1 , 2 ,..., t . We would like to use sample results to
draw inference on the means.

The model

A statistical model for an observation made on subject j receiving treatment i, denoted yij
is expressed as
yij = i + ij
where i = + i
i=1,2,..., t, j=1,2,...,ni

= Overall mean.
i = Mean of the ith population or treatment.

i = ith treatment effect.


ij = Random effect due to jth replication receiving ith treatment.
The statistical model can be categorised into two parts namely the means effect model and
the fixed effects model. That is
Means effects model:

yij = i + ij

Fixed effects model:

yij = + i + ij

The null hypothesis related to the fixed effect model is

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

26

PMNjuho

H0 : 1 = 2 ... t =0
where i = i - , i=1,2, . . ., t.
The null hypothesis for the means model is stated as H0 : 1 = 2 =...= t = , and
alternative hypothesis as Ha : Not all treatment means are equal (i.e. i i' , for some
i i).
Assumptions

The statistical model for a completely randomised design is based on the following
assumptions.

Each population is normally distributed. That is yij ~ i. i.dN(i, 2 ), i =1, 2, . . . t.


The variance, denoted 2 , is the same for each population.
The observations must be independent.
Usually, the above assumptions are summarised by the following mathematical
expression:
ij ~ i. i.dN(0, 2 ),
i=1,2, . . ., t, j=1,2, ..., ni.
Where the first i denotes, identical distribution, second i denotes, independently
distributed, d denotes distribution, N denotes normal distribution with mean zero and
constant variance denoted by 2 .

The design layout

Suppose an investigator intends to carry out an experiment to investigate the performance


of four varieties. Suppose the available experimental material allows for 12 homogeneous
experimental units. Thus, each variety occupies three units, say plots. Denote the 4
varieties by V1, V2, V3, and V4. A simple randomisation approach is to write down the
variety numbers on 12 pieces of papers wrap each of then and shovel them. Pick each at
random and allocate the variety to the unit sequentially. For this example, the layout as a
completely randomised design would be

V4
V2
V4

V1
V3
V2

V3
V4
V1

V2
V1
V3

The estimation

Under H0 : 1 = 2 =...= t = each sample observation would have been drawn from the
same normal probability distribution with mean and variance 2 . Recall that the
sampling distribution of the sample mean, y for a simple random sample of size n from a

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

27

PMNjuho

normal population is normally distributed with mean and standard deviation y =

.
n
The best estimate of the mean of the sampling distribution of y is the mean of the
individual sample means. That is

y1 + y 2 + ...+ y t
t

y=

The between samples variation provides a good estimate of the 2 only if the null
hypothesis is true. If the null hypothesis is false, the between sample variation
overestimates 2 . The within sample variation provides a good estimate of 2 in either
case. If the null hypothesis is true, both estimates will be similar and their ratio will be
close to 1. If the null hypothesis is false, the between samples will be larger than within
samples, and their ratio will be large.
The analysis of variance is a statistical technique for testing the hypothesis that the means
of three or more populations (treatments) are equal. Also, it can be used to test the
hypothesis that the means of two populations are equal.
Pooling is the process of combining the results of two or more independent simple
random samples to provide an estimate of 2 . When a simple random sample is selected
from each population, each of the sample variances provides an unbiased estimate of the
population variance 2 . The estimate of 2 obtained by combining the individual
estimates into an overall estimate is called the within samples estimate.

The sample mean for the ith treatment

yi . =

1
ni

ni

y
j =1

ij

i =1, 2, . . ., t

The sample variance for the ith treatment

si2 =

1
ni 1

ni

( y
j =1

ij

yi . )2

nT = n1 + n 2 + . . . + nt
Recall that variance is a measure of the dispersion in a set of responses and is calculated
by determining the average distance of a set of responses from its mean.

3.2 Between samples estimate of population variance

Consider sample means each estimating the population means for each treatment under
investigation. These sample means are statistics with a sampling distribution that is

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

28

PMNjuho

normal. The sampling distribution of sample mean i is yi . ~ i.i.d N(i,

),, i =1, 2, . . . t.
ni
The investigator wishes to assess how each of these sample means differ from the estimate
y.. which estimates the over all population mean, . The component that measures the
deviation of the sample means from the overall sample mean is called the mean square
between denoted MSB. This is defined as
SSB
1 ni
MSB =
=
n ( y y... )2
t 1 t 1 j =1 i i .
Where SSB = Sum of squares between treatment means.
The squaring of the deviation is done to remove the negative sign and the divisor t-1, is
the corresponding degrees of freedom. Each of the deviation is weighted by the
corresponding replications ni.
The MSB is sometimes referred to as systematic variance and can be explained in terms of
the independent variables or independent groups or treatments. For instance, suppose we
wish to test the performance of a pressure cooker at three different temperature settings.
We run the pressure cooker at 20, 40 and 60 kilopascals and record the temperature at
which the water boils. We take five temperature readings at each pressure. The average
deviations from the overall mean, of the means of the five readings, at each pressure,
provides the measure of systematic variance.
The component MSB is unbiased estimator of 2 under H0. The MSB is not an unbiased
estimator of 2 and does overestimate, if the means of the t populations are not equal.

3.3 Within samples estimate of population variance

The component that measures the deviation of each observation from the overall mean is
called the mean square within denoted MSW. It is also an estimate of 2 and is defined as
MSW =

SSW
1
=
n T t nT t

(n
i =1

1 )si2

Where SSW = Sum of squares within the treatments, and s i2 =

1
( y ij y i. ) 2 , the

ni 1 j =1

sample variance for treatment i.


The estimate MSW is not influenced by whether or not the null hypothesis is true, unlike
the MSB. It always provides an unbiased estimate of 2 .
The MSW is referred as error variance or random error. This refers to the random
variation between sample means, which we find when we select random samples from a
population.
3.4 Comparing the variance estimates
Statistical Analysis in Research Module
E-mail: NjuhoP@ukzn.ac.za

29

PMNjuho

MSB
, of the two independent estimates of 2
MSW
follows an F distribution, under H0. Thus, under H0, and when assumptions are valid, the
MSB
sampling distribution of
is an F distribution with the numerator degrees of
MSW
freedom equal to t-1 and denominator degrees of freedom equal to nT - t. In general, an F
distribution is a ratio of two random variables that are distributed chi-square. Thus, the
range of F is from zero to positive infinity.
The sampling distribution of the ratio

MSB
is inflated because MSB overestimates 2 when the means of the t
MSW
MSB
populations are not equal. Hence, we will reject H0 if the resulting value of
MSW
appears to be too large to have been selected at random from an F distribution with
degrees of freedom t-1 in the numerator and nT - t in the denominator. The value of
MSB
that will cause us to reject H0 depends on , the level of significance. Table D
MSW
MSB
and the rejection region associated with a
provides the sampling distribution of
MSW
level of significance equal to where F denotes the critical value.
The value of

To read the value of F from the table, you need to have numerator, denominator degrees
of freedom and the level of significance, . Proceed to locate the F value that
corresponds to the within degrees of freedom in the first column and the between degrees
of freedom in the first row for a given - level. Often, the F table is provided for =
0.05 or 0.01. You will note later that most statistical software produce all statistics. An
important statistic among these is the p-value which is the probability computed using
calculate F- value. The decision to reject or not to reject the null hypothesis is based on
the comparison made between the p- value and the - level. We reject the null hypothesis
if p value < and otherwise fail to reject.

3.5 Computation formulae

The formulae previously discussed are difficult to apply. Equivalent formulae that are
easy to use are presented below.
t

Sum of squares total (SST) =

ni

y
i =1 j =1

2
ij

- C.F.

where C.F. is the correction factor calculated as

1
Sum of squares treatment (SSTrt) =
ni

where y i. =

1
ni

1 t ni
( y )2 .
nT i =1 j =1 ij

ni

y
i =1

2
i.

- C.F.

ni

y
j =1

ij

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

30

PMNjuho

Sum of squares error (SSE) = SST - SSTrt.

The mean squares are computed as the ratio of sum of squares -to-the degrees of freedom.
The analysis of variance table denoted ANOVA is a convenient display of calculations of
between, within and total sum of squares, the associated degrees of freedom and mean
squares. It is composed of none negative values. A general ANOVA table follows:

Source of
Variation

Degrees of
Freedom

Sum of
Squares

Mean
Square

Between

t-1

SSB

MSB

Within

nT - t

SSW

MSW

Total

nT - 1

SST

Ftable

Fcalc

MSB
MSW

The analysis of variance can be viewed as the process of partitioning the total sum of
squares and degrees of freedom into two sources, between and within. Dividing the sum
of squares by the appropriate degrees of freedom provides the variance estimates and the
F value used to test the hypothesis of equal population means. The degrees of freedom
and the sum of squares are the only additive columns. Thus, need to compute two and the
third can be obtained by subtraction.
Example 3.1

To test if the mean time needed to mix a batch of material is the same for machines
produced by three manufacturers, the following data on the time (in minutes) needed to
mix the material were obtained.

Sample mean y i. :

Manufacturer
1
2
20
28
26
26
24
31
22
28
23
28

Sample variance si2 : 6.67

4.67

3
20
19
23
21
21
3.33

Test if the population mean times needed to mix a batch of material differ for the three
manufacturers at 5% significance level.

Solution

Treatments, t = 3, and sample size per treatment, n1 = n2 =n3 = n =4


y .. = ( y1. + y 2. + y 3. )/3 = (23 + 28 + 21)/3 = 24.

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

31

PMNjuho

1
nT

Or y .. =

y
i =1 j =1

ni

SSB =

n ( y
j =1

i.

ij

288
= 24
12

y ... )2

= 4(23-24)2 + 4(28-24)2 + 4(21-24)2


= 104.

SSB
104
=
= 52.
t 1
2

MSB =

(n

SSW =

i =1

1 )si2 = 3(6.67) + 3(4.67) + 3(3.33)


= 44.01

MSW =

Fcalc =

44.01
SSW
=
= 4.89
12 3
nT t

MSB
52
=
= 10.63
MSW
4.89

Ftable, 0.05(2, 9) = 4.26


The ANOVA Table
Source of
Variation
Between
Within
Total

Degrees of
Freedom
2
9
11

Sum of
Squares
104.00
44.01
148.01

Mean
Square
52.00
4.89

Fcalc
10.63

Ftable, 0.05
4.26

Conclusion: Since Fcalc= 10.63 > Ftable, 0.05(2, 9) = 4.26, we reject the null hypothesis that
the mean time needed to mix a batch of material is the same for each manufacturer at 5%
significance level. This means that there is at least one significant difference between the
means.

The rejection of the null hypothesis using F does not pinpoint where the specific
differences are. Further analysis is therefore required to investigate which treatment
means that are different. Multiple comparison tests (some are more conservative than the
other) are used to achieve this. If the structure of the treatment means is known priori to
the experiment, contrast or regression techniques could be used. For instance, if the
treatments have qualitative structure, then reasonable contrasts can be constructed. If the
structure is quantitative, then regression techniques can be applied. If the treatment
structure is not known at all, which is unusual, multiple comparison test techniques can be
used.

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

32

PMNjuho

It should be noted that if we got a nonsignificant F test in the analysis of variance, it


would indicate the failure of the experiment to detect any difference among treatments.
Nonsignificant F test does not, in any way prove that all treatments are the same, because
the failure to detect treatment difference, could be the result of either a very small or nil
treatment difference or a very large experimental error, or both. Thus, one need to
examine the size of the experimental error and the numerical difference among treatment
means, whenever F test is nonsignificant.

Steps in testing hypothesis

Below are useful steps to follow when conducting a test of hypothesis.

State the statistical model and the associated assumptions based on the design of
experiment used and treatment structure.
State the null and alternative hypothesis, based on the interest of the investigator.
Choose the level of significance , which depends on the desired confidence to be
attached to the results.
Develop the critical region (rejection region) which depends on the alternative
hypothesis.
Compute the test statistic, say, sum of squares, mean squares, F-calculated and pvalues.
Draw conclusions based on the analysis of variance results. Further statistical analyses
are directed by the outcome of the ANOVA results.

3.6 Advantages and disadvantages

A completely randomised design has the following advantages over other designs:

Easy to set up and analyse;


Provides maximum number of degrees of freedom for estimation of error variation;
Missing values cause no difficulty.

Disadvantages

The approach is insensitive when the experimental units are heterogeneous. This is
because it assumes the units to be homogeneous;
It is difficult to maintain homogeneity among units when the treatment numbers is
large. Thus, the approach is suitable only for small numbers of treatments.
Exercise 3.1
3.1 Decide by F Table whether the following F calculated values would be greater at
0.01 significant level:

i) F at df1 = 14 and df2 = 100


ii) F at df1 = 2 and df2 = 40
Statistical Analysis in Research Module
E-mail: NjuhoP@ukzn.ac.za

33

PMNjuho

iii) F at df1 = 9 and df2 = 30


3.2 Consider the following set of data on scores.
Group A
62
60
50
48
47

Group B
60
60
58
53
49

Group C
59
49
49
47
42

i)

Find the total sum of squares, the within groups sum of squares, and the
between group sum of squares.

ii)

Present your results in a analysis of variance table.

iii) Is there enough evidence at 5 % significance level to suggest that the three
treatments are significant?
3.3 A researcher investigates emotional stability in three groups of children, a control
group who come from a stable background, children who have been physically
abused, and children who have been sexually abused. Higher scores indicate greater
stability. The researcher wants to test the hypothesis that any abused child shows less
emotional stability. The following is the data.

Control
8
9
7
8
9

Physically abused
3
4
3
2
4

Sexually abused
4
2
2
3
3

a) State both null and alternative hypotheses.


b) Set up a ANOVA table and test whether there is significant difference between
the groups at a 5 % significance level.
3.4 A researcher wants to know what type of humour appeals most to students. She looks
at three different types, slapsticks, puns and stand-up comedy. Three different groups
laughed as follows.
Slapstick
5
3
5
4
6

Puns Stand-up comedy


3
8
6
6
4
4
9
3
3
3

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

34

PMNjuho

Conduct a one-way analysis of variance to test if the three types differ


significantly at 5 % significance level.
3.5 A study investigated the perception of corporate ethical values among individuals
specialising in marketing. The following data on scores were recorded where higher
scores indicate higher ethical values.
Marketing
Managers
6
5
4
5
6
4
Sample mean
5
Sample variance
0.8

Marketing
Research
5
5
4
4
5
4
4.5
0.3

Advertising
6
7
6
5
6
6
6
0.4

Using 5 % significance level, test if there are significant differences in


perception for the groups of specialists.
3.6 As a result of the recent revisions to the tax law, investment in equity instruments has
become increasingly attractive. The accompanying table lists the annual internal rates
of return for several different investment portfolios managed by three separate
investment firms.
Firm A
16.9
15.0
16.2
15.8
17.1

Firm B
15.1
12.5
13.0
11.8

Firm C
10.0
13.1
12.3
10.2
8.9

Carry out the analysis of the above data to test the equality of the three investment
firms with respect to the mean annual internal rate of return earned on portfolios. Use
a 5 % significance level.
3.7 Samples of peanut butter produced by three different manufacturers are tested for a
flatoxin content with the following results:

A
2.5
6.3
3.1
2.7
5.5
4.3

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

Brand
B
2.5
1.8
3.6
4.1
1.2
0.7

C
2.3
1.5
0.4
3.8
2.2
1.0

35

PMNjuho

a) Determine whether there is a significance difference between the brand


means at 5 % significance level.
b) Outline the assumptions for a valid analysis of variance.
3.8 The following are the litres per 100 kilometres which a test driver obtained with
measured quantities of five brands of petrol containing various additives:

Brand
S
T
C
8.71 8.11 8.71
11.20 8.71 8.71
10.69
7.35
9.80

M
9.80
11.20
10.23

E
9.40
11.76
11.20

Test the hypothesis that the five brands of petrol give the same results. Use the 1 %
significance level.
3.9 A postgraduate student in the Department of Dietetics studied the effect of diet on
blood sugar. Originally 32 subjects were selected for their uniformity and assigned
randomly to four diet groups; eight individuals per diet group. A mishap resulted in
the loss of the records for six subjects. The following are the results for the remaining
cases:

I
24
18
25
23
22

Diet
II
26
21
23
25
20
24
20

III
30
32
29
25
31
33
29
28

IV
30
28
27
23
31
25

Determine whether the four diets have different effect on blood sugar levels. Use 5 %
significance level.
3.10 In an assessment of five different reading programmes, a number of children
judged to be equivalent in abilities on the basis of pre-testing were assigned at
random to the five programmes. Assessments on the reading capacities of the children
completing the programme produced the following scores:

Programme

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

II

III

IV

63
67
59
60

81
71
74
70

72
77
79
83

59
65
70
71

62
71
73
67

36

PMNjuho

72
58
65
64

73
83
79
80

70
82
71
77

67
60
62
66

68
61
68
66

Determine whether there are any differences in the five programmes. Use
a 5 % significance level.
3.11 A random ample of 16 observations was selected from each of four populations. A
portion of the ANOVA table is given below:
Source of
Variation
Between
Within
Total

Degrees of
Freedom

Sum of
Squares

Mean
Square
400

1 500

a) Complete the missing entries in the ANOVA table.


b) Test whether the treatment means of the four populations are equal,
using a 5 % significance level.
3.12

Random samples of 25 observations were selected from each of three populations.


For these data, sum of squares between (SSB) = 120 and sum of square within
(SSW) = 216.
a) Set up the ANOVA table for this problem.
b) What is the critical of F? Use a 5 % significance level.
c) Are the three population means equal, at 5 % significance level?

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

37

PMNjuho

4. BEYOND ANALYSIS OF VARIANCE


4.1 Introduction

A common first step is to subject the data to an analysis of variance to determine whether
or not significant differences exist among the treatment means. The overall F test provides
statistical evidence of existence of some significant difference between the treatments
under investigation. For instance, a rejection of the null hypothesis indicates that the
treatment means are not all equal. That is, either 1 2 or 1 3 or 2 3, or 1 2
3. It cannot tell us where the differences between the means lie. While t test applies only
for two treatments, F test applies to two or more treatments. Interest would be in finding
source of that difference that contributed to an overall significant F test. Various
procedures are in use under such circumstance. Recent approach suggests use of
regression techniques if the treatments are of quantitative nature. If they are of qualitative
nature and priori information on the treatment structure was available, appropriate
contrasts questions could be formulated and tested through the ANOVA. If no structure
is known and the treatments are of qualitative nature, multiple comparison procedures can
then be applied.

4.2 Multiple comparison procedures

After the analysis of variance, the data are further analysed in an attempt to explain the
nature of the response in more detail. A number of statistical procedures may be used for
this purpose. Among these are:

Fitting response functions using regression techniques.


Planned sets of contrasts among means, or groups of means.
Pairwise multiple comparison procedures.

Some of these procedures are appropriate with some kinds of treatments and entirely
inappropriate with others. These statistical test procedures are used under different
circumstances. Most commonly used are the post hoc tests which are modified t tests
known to control for familywise error rates.
Fishers least significant different (LSD)

This is the most widely used method for making pairwise comparisons of treatment
means. Suppose the overall F test led to a rejection of H0 : 1 = 2 = 3 . The following
could be the possible causes:
i) H0 : 1 = 2 against Ha : 1 2
ii) H0 : 1 = 3 against Ha : 1 3
iv) H0 : 2 = 3 against Ha : 2 3

To test any of the above possibilities, t -test procedures can be applied. The test statistic
for Fishers LSD at 5 % significance level is computed as follows:
Statistical Analysis in Research Module
E-mail: NjuhoP@ukzn.ac.za

38

PMNjuho

LSD0.05 = t0.025 x s.e.( yi yi' ) with MSW degrees of freedom


where s.e.( yi yi' ) =

MSW (

1
1
+
)
ni ni'

Reject H0 : i = i' if | yi yi' | > LSD0.05 in support of the alternative at 5 % significance


level.
Fishers LSD test is commonly referred to as protected or restricted LSD. It is only
applied when the overall F test is significant.
Example 4.1

Consider the information obtained in Example 3.1.


Sample size: ni :
Sample mean y i. :

4
23

4
28

4
21

MSW = 4.89
s.e.( yi yi' ) =

MSW (

1
1
+
)=
ni ni'

1 1
4.89( + )
4 4

= 1.584
The table value with 9 degrees of freedom at 5 % significance level is t0.025 = 2.262.
Thus,
LSD0.05 = 2.262 x 1.584
= 3.583
Reject H0 : i = i ' if | yi yi' | > LSD0.05.
Treatment
difference
y1 y 2 = 23 - 28
y1 y 3 = 23 21
y 2 y 3 = 28 - 21

Difference
-5
2
7

Status
Significant
Not significant
Significant

The overall F-test assured us that at least two of the treatment means are significantly
different, at 5 % significant level. Further analysis using Fishers LSD indicate that the
difference in Trt mean 1 versus 2 and Trt mean 2 versus 3. Treatment mean 1 is not
significantly different from treatment mean 3, at 5 % significance level.
A confidence interval estimate of the form ( yi yi' ) LSD0.05 can also be used for the
same test. If the interval includes the value 0, we fail to reject the hypothesis that the

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

39

PMNjuho

treatment means are equal. However, if the confidence interval does not include the value
0, we conclude that there is a difference between the treatment means.
Similarly, ( yi yi' ) LSD0.05 implies a 95 % confidence interval for treatment difference
is
Trt 1 vs Trt 2:
Trt 1 vs Trt 3:
Trt 2 vs Trt 3:

(-5 3.583) = (-8.583, -1.417)


(2 3.583) = (-1.583, 5.583)
(7 3.583) = (3.417, 10.583)

Comparison-wise Type I error rate: This is the error rate that indicates the level of
significance associated with a single statistical test. Thus, the comparison-wise Type I
error remains , say = 0.05.
Experiment-wise Type I error rate: Suppose we conduct a pair wise test and for each
single t-test, we set = 0.05. The probability that we will not make a Type I error is 1 0.05 = 0.95 for each test. The probability that we will not make a Type I error for two
consecutive t- tests is (0.95)(0.95) = 0.9025. Thus, the probability of making at least one
Type I error is 1 - 0.9025 = 0.0975.When we use sequentially test two sets of hypotheses,
the Type I error rate associated with this is not 0.05, but actually 0.0975. This Type I error
rate is called experiment-wise Type I error rate.

In general, suppose we consider k treatments. The number of possible pairwise


comparisons, C is

k
k( k 1)
k!
=
=
2
2 ( k 2 )! 2 !
The probability of making at least one Type I error is
Experiment-wise Type I error rate, EW = 1 - (1- )C
The Fishers LSD procedures leads to a experiment-wise Type I error rate that depends on
the comparison-wise Type I error rate, and the number of comparisons, C.
Bonferroni adjustment: EW = 1 - (1- )C < C .

Thus, the maximum probability of making a Type I error for the overall experiment EW
can be maintained if we use a comparison-wise Type I error rate of size EW/k.
Example 4.2

Refer to the information in example 4.1. Using = 0.05.


Number of treatments, k =3, thus possible pairwise comparisons,
C=

k ( k 1 ) 3(3 1)
=
= 3.
2
2

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

40

PMNjuho

EW = 1 - (1- )C =1 (1-0.05)3
=1 (0.95)3 = 0.143
C = 3(0.05) = 0.15
We use a comparison-wise Type I error rate of size
0.143 < C = 0.15.

EW
k

0.143
= 0.048, since EW =
3

Tukeys procedures: Allows one to perform tests of all possible pairwise comparisons
and still maintain an overall experiment-wise Type I error rate, such as EW = 0.05. It
uses a studentised range probability distribution. Considers all treatment means to have
the same sample size, n and equal variances. However, a generalised Tukeys test can be
used for unequal sample size case. Then a sampling distribution of

q=

y max y min
MSW
n

where

y max = largest sample mean and


y min = Smallest sample mean
MSW = Mean square within treatments.
Follows a studentised range distribution. Tukeys significant difference, denoted,
TSD = q

MSW
n

Tukeys procedure is an unprotected testing approach. Thus, Tukeys procedure provides


an alternative to analysis of variance for testing, if the treatment means of k populations
are equal. However, to use Tukeys procedures we need to estimate the population
variance using MSW.
Example 4.3

Consider the information given in Example 4.1.


i
Sample size: ni :
Sample mean y i. :

1
4
23

2
4
28

3
4
21

MSW = 4.89
Error degrees of freedom = 9

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

41

PMNjuho

The critical value of the studentised range, q( , k, v) for the 3 pairwise comparisons, k
and v error degrees of freedom at 5 % significance level is obtained from Table E. Thus,
q = q( , k, v) = q(0.05, 3, 9) = 3.95
Hence, TSD = q

MSW
4.89
= 3.95
= 4.367
n
4

y max = 28

y min = 21

We reject H0: 2 = 3 since | y 2 y 3 | = 7 > TSD = 4.367 and conclude that the two
treatment means are significantly different at 5 % significance level.
Similarly, we reject H0: 2 = 1 since | y 2 y1 | = 5 > TSD = 4.367 and conclude that the
two treatment means are significantly different at 5 % significance level.
But we fail to reject H0: 1 = 3 since | y1 y 3 | = 2 < TSD = 4.367 and conclude that the
two treatment means are not significantly different at 5 % significance level.
These conclusions can be summarised as follows:
Ordered treatment means:

Any two treatment means sharing the same line are not significantly different at 5 %
significance level.

Remark:
The most often used and most often misused are the multiple comparison tests. Their
purpose is to detect possible groups among a set of unstructured treatments. They are not
meant for quantitative treatments, for which response methodology is more appropriate.
Nor are they intended to substitute for meaningful orthogonal comparisons, which can be
formulated in advance based on the treatments used. The following points should be
noted:

Care should be taken to select a statistical procedure which is appropriate for


the data being analysed.
For experiments involving factorial sets of treatments or graded levels of
quantitative factors there is almost always a statistical procedure, which can be
specified in advance and which is more appropriate than a multiple comparison
test.
For experiments involving qualitative treatments it is often possible to form
planned sets of comparisons to answer the objectives of experiment.
Multiple comparison tests may be useful for grouping means from experiments
involving unstructured qualitative treatments.

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

42

PMNjuho

Indiscriminant use of multiple comparison tests can result in loss of


information and reduced efficiency when more appropriate procedures are
available.

Exercise 4.1

Refer to Exercise 3.1 to answer the following questions.


4.1 Refer to the results from question 3.2 to compute LSD at 5 % significance level and
determine which treatment means that are different.
4.2 Refer to the results from question 3.3 to compute Tukeys (TSD) critical value at 5 %
significance level and determine which treatment means that are different.
4.3 Refer to the results from question 3.6 to construct 95 % confidence interval for each
of the pairwise treatment means difference. Use these intervals to test the equality of
these means.
4.4 Refer to the results from question 3.8 to compute Tukeys (TSD) critical value at 5 %
significance level and determine which treatment means that are different.
4.5 Refer to the results from question 3.7 to compute LSD at 5 % significance level and
determine which treatment means that are different.

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

43

PMNjuho

5. RANDOMISED COMPLETE BLOCK DESIGN


5.1 Introduction

Extraneous factors, not considered in the experiment, can inflate the mean square within
(MSE) component. This causes the F value to be small, thus, signalling no significance
difference among treatment means when in fact such a difference exists. We wish to
compare the treatment means when all known variation is control or rather eliminated
from the experimental error. One way of eliminating the known variation from the
experimental error is by grouping the experimental units into homogeneous groups,
commonly known as block.
For instance, if an experiment is to be carried out in KZN and the race or gender is truly
known to have an effect, the setting of the experiment should then take this known
variation into consideration. The race could be used as blocks. The treatment understudy
should be applied to each block where each application is based on independent
randomisation.
Or if a study involves assessment of different types of fuels in a given City, the cars
should be considered as blocks because they are known to differ in fuel consumption.
Or if different management practices are to be compared within the farming community in
KZN, the size of farm, either small, or medium or large should be considered as a
blocking factor. They are known to differ and this information should be incorporated
into the experiment.
Or if an experiment involves comparing of different animal feeds, where the breed is
known to have an effect, then the breeding should be used as a blocking factor.
Or if an agronomist wants to conduct an experiment on a field known to have different
levels of soil fertility, then this information should be used as a blocking factor. And so
on.
The randomised complete block design (RCBD) draws its name from the fact that the
treatments are allocated at random in each block. Independent randomisation is applied in
each block. Complete implies that each block contains a complete set of treatments.
This is an extension of a completely randomised design (CRD) in a situation where
experimental units are no longer homogeneous. The principle behind this design is to
divide all experimental units into homogeneous groups before applying the treatments.
Each group is referred to as a block or replication in case of a balanced design. Balanced,
because each treatment occurs equally often in each block. Differences between blocks
cancel out for any comparison of treatments. The criteria applied in grouping should
ensure that there is minimum variation within the blocks and maximum variation between
them. Differences between blocks are then removed from the random or unexplained
variation. The following should be noted with a RCBD
a) Blocks should be laid perpendicular to the gradient in case of a directional variation.
b) Blocks need not be continuous.
Statistical Analysis in Research Module
E-mail: NjuhoP@ukzn.ac.za

44

PMNjuho

c) Possible to replicate within a block. That is to say, a treatment may appear more than
once in a block
d) A block should signify a known variation that need to be controlled by the experiment.
e) All the treatments should be randomised within each block, ensuring independent
randomisation in each block.
Even when no obvious natural blocks that exist, it is still sensible to define blocks
representing major patterns of variation. Consider an experiment involving different
varieties. Harvesting may be carried on each block on each day if it is impossible to
harvest all on a single day. Such blocking controls the variability that may be introduced
in a day (due to rain).
Missing data can also occur in RCBD. The good thing with the design is that, the analysis
can still be performed in the event of losing a complete block.
A major restriction in the use of this design, is the requirement that all treatments must
appear in each block.

5.2 Aspect of blocking

The analysis of completely randomised design assumes that the experimental units are
homogeneous. Any treatment effect between the groups or treatments is expected to be
due to the treatments only, under such assumption. Hence, the within treatments variation
is assumed to be purely random. The experimental error is overestimated if the
assumption is not true.
The blocking technique is meant to utilise priori information concerning the nature of
experimental units. Blocking is therefore defined as the process of grouping the
experimental units into homogeneous groups such that the variation within the blocks is
maximised and that between block is maximised. The approach aims at obtaining estimate
of experimental errors that is unbiased.
The field layout

Consider an experiment set to investigate the effect of 5 nitrogen levels on the growth of a
new variety. Three types of soils are used as the blocking factor. Thus, 5 x 3 experimental
units were used. We denote the nitrogen levels by N0, N1, N2, N3, and N4. Suppose the
soil types were clay, loam and sand. The five nitrogen levels are randomly assigned to
each block. At each stage, a new randomisation scheme is used. The layout is presented
below.

Block 1
N1

N3

N0

N2

N4

Block 2
Statistical Analysis in Research Module
E-mail: NjuhoP@ukzn.ac.za

45

PMNjuho

N2

N4

N1

N0

N3

N4

N0

N1

N3

N2

Block 3

5.3 The model

Suppose the experimental material is grouped into b homogeneous groups (referred to as


blocks) and t treatments under investigation are randomly assigned, ensuring independent
randomisation at each stage. Suppose yij is the response variable corresponding to
treatment i measured on block j, where i =1, 2, . . ., t and j = 1, 2, . . .,b. We assume one
measurement on each treatment on each block. Also, there is no treatment by block
interaction. The response variable yij is partitioned into components, say, due to overall
mean, block, treatment and random error effects. The mathematical expression is
yij = + i + j + ij
where

= the overall mean


i = the ith treatment effect
j = the jth block effect
ij = the random effect

The random effect ij is assumed to be identically and independently distributed normal


with zero mean and constant variance. (i.e. ij i.i.d(0, 2) ). The model is also assumed
to be additive.
The data corrected from b blocks involving t treatments is usually summarised in a twoway table of treatment totals as follows:

Treatment
1
2
.
.
.
t
Block
Totals

1
y11
y21
.
.
.
yt1
y.1

Block
2 3
y12 y13
y22 y23
.
.
.
.
.
.
yt2 yt3
y.2 y.3

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

...
...
...
...
...
...
...

b
y1b
y2b
.
.
.
ytb
y.b

46

Treatment
Totals
y1.
y2.
.
.
.

yt.
y..

PMNjuho

Notations
t

Where, the marginal treatment and block totals are denoted yi. =

y
j =1

respectively. The overall total is denoted y.. =

ij

, and y.j =

y
i =1

ij

y
i =1 j =1

ij

Similarly, the marginal means for the treatments and blocks are y i. =

1 t
1
y ij , respectively. The grand mean is y.. =

t i =1
bt

1
b

y
j =1

ij

, and y. j =

y
i =1 j =1

ij

Definition formulae

The Sum of squares is a squared deviation summed over the levels. Thus, the sum of
squares total is a measure of overall deviation of each observation from the overall mean.
These deviations are summed over the levels of treatment and blocks.
t

Sum of squares total, SSTotal =

( y
i =1 j =1

ij

y.. ) 2

The total sum of squares is partitioned into the three components that due to blocks,
treatments and random effects. The sum of squares block is a measure of deviation of
block means from the overall mean.
b

Sum of squares block, SSBlk = t ( y. j y.. ) 2


j =1

Similarly, the sum of squares treatment is a measure of deviation of treatment means from
the overall mean.
t

Sum of squares treatment, SSTrt = b ( y i. y.. ) 2


i =1

The sum of squares error is a measure of within experimental unit variation. That is, the
random variation due to treatments treated alike. It is also referred to as a measure of
uncontrollable variation within the experimental units.
t

Sum of squares error, SSE =

( y
i =1 j =1

ij

y i . y. j + y.. ) 2

Computation formulae

The analysis using the definition formulae is tedious. Statistical formulae that are
equivalent to definition formulae are often used. We referred these as computation of sum
of squares.

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

47

PMNjuho

Usually the first item to compute is the correction factor, which is the sum of squares
mean. This requires adding all the bt observations squaring the result and dividing it by
total observations, bt. Thus,
t

( y ij ) 2

Correction factor,

C.F. =

i =1 j =1

bt

The sum of squares total requires each of the bt observations to be squared, summed and
then subtracted the correction factor. Thus,
t

SSTotal =

y
i =1 j =1

2
ij

- CF

An easier way to compute the sum of squares block and sum of squares treatment is to
construct a two way table totals both body and marginal. To compute the sum of squares
block, square each block mean, average the sum over the treatment levels and then
subtract the correction factor. Thus,
1 b
SSBlk = y.2j - CF
t j =1
Similarly, the sum of squares for treatment is obtained by squaring each treatment mean,
averaging the sum over the block levels and then subtracting the correction factor. Thus,
1 t 2
SSTrt =
yi. - CF
b i =1
The property of additivity of the model allows the sum of squares error to be computed by
subtracting both SSB and SSTrt from SSTotal. Thus,
SSE = SSTotal SSBlk SSTrt
The above sum of squares are called corrected or adjusted sum of squares. The unadjusted
or uncorrected sums of squares are obtained when correction factor is not subtracted
during the computation.
The total degrees of freedom (df) computed by subtracting one from the total number of
observation are bt 1. These are partitioned into degrees of freedom due to treatments,
blocks and error. Thus (t-1) df due to treatment, (b-1) df due to block and (b-1)(t-1) due to
error.
Computation of mean squares

The mean squares are computed as averages of sum of squares over the degrees of
freedom. These are known to have a distribution called chi-square.
Mean square blocks, MSBlk =

1
(SSBlk)
b 1

The quantity MSBlk is distributed chi-square with b-1 degrees of freedom.

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

48

PMNjuho

Mean square treatment, MSTrt =

1
(SSTrt)
t 1

Similarly, MSTrt is distributed chi-square with t-1 degrees of freedom.


Mean square error, MSE =

1
(SSE)
(b 1)(t 1)

Also, MSE is distributed chi- square with (b-1)(t-1) degrees of freedom.

Computation of the F-value

The ratio of MSTrt to MSE has an F-distribution with (t-1) numerator degrees of freedom
and (b-1)(t-1) denominator degrees of freedom. Both quantities MSTrt and MSE are
assumed to be unbiased estimators of the common variance, 2 when null hypothesis of
equality of treatment means is true. That is, H0: 1 = 2 = . . . = t = 0. In case the
treatment effects are not equal, the MSTrt tends to be larger than the MSE. The larger the
quantity the more likely we to rejecting the null hypothesis in favour of the alternative.
Therefore, the F-calculated value for testing the null hypothesis at a specified significance
level is computed as
Fcalc =

MSTrt
MSE

The calculated F-value is compared against an FTable Value obtained with (t-1) numerator
df and (b-1)(t-1) denominator df., at significance level. The null hypothesis is rejected if
the Fcalc is greater than FTable.
Similarly, the ratio MSBlk to MSE is distributed F with (b-1) numerator degrees of
freedom and (b-1)(t-1) denominator degrees of freedom. Often, the test is not performed
simply because the information about the blocks is priori known. The hypothesis tested by
this quantity depends on the nature of the blocks whether considered random or fixed
effects. When blocks are considered fixed effect then the quantity
Fcalc =

MSBlk
MSE

test H0 : 1 = 2 = = b = 0, against Ha : At least two blocks are different.


When the blocks are considered to be random effect the interest would be assessing the
block variability. This provides an indication on how effective the blocking was. The
hypothesis tested by the F- calculate under this condition is
H0 : 2 = 0 against Ha : 2 > 0

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

49

PMNjuho

Reject the null hypothesis in both cases (fixed or random effects), if the Fcalc > FTable
obtained using (b-1) numerator df and (b-1)(t-1) denominator df, at significance level.
The above computations are summarised in a table called analysis of variance table,
(ANOVA). The format of ANVOA is as follows:
Source of
Variation

Degrees of
freedom

Sum of
squares

Mean
squares

FCalculated

Block

b -1

SSBlk

MSBlk

F=

Treatment

t 1

SSTrt

MSTrt

Error

(b-1)(t-1)

SSE

MSE

Total

bt 1

SSTotal

MSBlk
MSE
MSTrt
F=
MSE

Example 5.1

An automobile dealer conducted a test to determine if the time needed to complete a


minor engine tune-up depends on whether a computerised engine analyser or an electronic
analyser is used. Because tune-up time varies among compact, intermediate, and full-size
cars, the three types of cars were used as blocks in the experiment. The data obtained are
presented below.
Car

Analyser
Computerised
50
55
63
168

Compact
Intermediate
Full-size
Treatment Total

Electronic
42
44
46
132

Block
Totals
92
99
109
300

We consider cars to our blocking factor and analysers as the treatments under
investigation. Thus we have three blocks and two treatments. We wish to test the equality
of two analyser methods at 5 % significance level. Note: this is a very insensitive
experiment because of very few degrees of freedom for error.

Hypothesis: H0: 1 = 2 = 0 against Ha : 1 2


Critical region: Reject H0: 1 = 2 = 0 in favour of Ha : 1 2 if Fcalc > FTable (0.05, 1, 2).
Computation of the sums of squares:
t

( y ij ) 2

C.F. =

i =1 j =1

bt

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

(300) 2
= 15 000
(2)(3)

50

PMNjuho

SSTotal =

y
i =1 j =1

2
ij

- CF = 502 + 422 + . . . + 462 C.F.

= 15 310 15 000 = 310

1
1 b 2
y. j - CF = (922 + 992 + 1092) C.F.

t j =1
2
1
= (30 146) 15 000 = 73
2

SSBlk =

1 t 2
1
y i. - CF = (1682 + 1322) C.F.

b i =1
3
1
= (45 648) 15 000 = 216
3

SSTrt =

SSE = SSTotal SSBlk SSTrt


= 310 73 216 = 21

Computation of the mean squares:

1
(SSBlk)
b 1
1
(73) = 36.5
=
3 1

MSBlk =

1
(SSTrt)
t 1
1
(216) = 216
=
2 1

MSTrt =

1
(SSE)
(b 1)(t 1)
1
(21) = 10.5
=
(3 1)(2 1)

MSE =

Computation of Fcalc
Fcalc =

MSTrt
216
=
= 20.571
MSE
10.5

Fcalc =

MSBlk
36.5
=
= 3.476
MSE
10.5

FTable Values
FT(0.05, 1, 2) = 18.5;

FT(0.05, 2, 2) = 19.0

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

51

PMNjuho

The ANOVA Table


Source of
Variation
Cars
Analyser
Error
Total

Degrees of
freedom
2
1
2
5

Sum of
squares
73
216
21
310

Mean
squares
36.5
216
10.5

FCalculated

FTable, 0.05

3.476
20.571

19.0
18.5

Conclusions:
Reject H0: 1 = 2 = 0 in favour of Ha : 1 2 since Fcalc = 20.571 > FT (0.05, 1, 2) =
18.5. Thus, we have enough evidence that the two analyser methods are significantly
different at 5 % significance level

If we assume the type of cars to be random effect, then we would fail to reject H0 : 2 = 0
in favour of Ha : 2 > 0, since Fcalc = 3.476 < FT (0.05, 2, 2) = 19.0. Thus, the variability
among the car types was not significantly different from zero.
Remark:
It should be noted that the multiple comparisons tests discussed in Section 3 also apply to
randomised complete block design.

Exercise 5.1

5.1 A nation-wide real estate chain is in the process of comparing townhouse prices in
four cities across the country. It is however known that the area size of a townhouse is
also a determining factor in price fixing and should be controlled by using blocks.
Therefore in each city, the selling prices of a 90-square-meter, a 120-square-meter, a
150-square-meter, a 180-square-meter and a 210-square-meter townhouse are
randomly selected. The results are recorded to the nearest thousand Rand and are
shown below.
Townhouse
size (m2)

Bloemfontein

Durban

90
120
150
180
210

165
198
251
312
405

185
193
215
268
381

Port
Elizabeth
173
181
197
229
294

Joburg
200
196
278
332
446

Test if the townhouses in the four cities are significantly different at 5 %


significance level. (Hint: The cities are the treatments and the townhouse sizes
are the blocks).

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

52

PMNjuho

5.2 Five different auditing procedures were compared with respect to total audit time. To
control for possible variation due to the person conducting the audit, four accountants
were selected randomly and treated as blocks in the experiment. The following values
were obtained using the ANOVA procedures:
SSTotal = 100; SSTrt = 45; SSBlk = 36.
a) Set up an ANOVA Table, filling in the missing information.
b) Test to see if there is any significant difference in total audit stemming from
the auditing procedure used. Use = 0.05.
c) Determine which treatments could be significantly different, using Tukeys
procedures.
5.3 An important factor in selecting software for word-processing and data base
management systems is the time required to learn how to use a particular system. To
evaluate three file management systems, a firm designed a test involving five different
word-processing operators. Since operator variability was believed to be a significant
factor, each of the five operators was trained on each of the three file management
systems. The data obtained are presented below:

Operator
1
2
3
4
5

A
16
19
14
13
18

System
B
16
17
13
12
17

C
24
22
19
18
22

a) Carry out analysis of variance and present your results in ANOVA Table.
b) Using = 0.05, test to see if there is any significant difference in mean
training times for the three systems.
c) Compute LSD at = 0.05 and indicate which treatments could be significantly
different.
d) Compute TSD at = 0.05 and indicate which treatments could be significantly
different.
e) Comment on the results obtained in parts (c ) and (d).
5.4 Three groups of students are to be tested for percentage of high-level questions asked
by each group. As questions can be on various types of material, six lessons are
taught to each group and a record is made of the percentage of high-level questions
asked by each group on all six lessons.
a) Show a data layout for this situation.
b) Provide an ANOVA Table outline giving only the source of variation and degrees
of freedom.
5.5 Suppose data from question 4.4 is as follows:
Group
Statistical Analysis in Research Module
E-mail: NjuhoP@ukzn.ac.za

53

PMNjuho

Lesson
1
2
3
4
5
6

A
13
16
28
26
27
23

B
18
25
24
13
16
19

C
7
17
14
15
12
9

Carry out analysis of variance on this data treating each lesson as a block and
state your conclusions.
5.6 The effects of four types of graphite coaters on light box readings are to be studied. As
these readings might differ from day to day, observations are to be taken on each of
the four types every day for three days. The order of testing of the four types on any
given day can be randomised. The results are

Day
1
2
3

Graphite Coater Type


M
A
K
L
4.0 4.8 5.0
4.6
4.8 5.0 5.2
4.6
4.0 4.8 5.6
5.0

a) State the null and alternative hypotheses to test equality of the four graphite
coater types.
b) Analyse the data as a randomised complete block design and present your
results in an ANOVA Table.
c) Determine whether the four types are significantly different at 1 % significance
level.
d) Determine which types are different at 1 % significance level using Tukeys
test procedures.
e) State your overall conclusions.
5.7 A study on a physical strength measurement in kilogrammes on seven subjects before
and after a specified training period gave the results shown below.
Subject
1
2
3
4
5
6
7

Pretest
45.36
49.90
40.82
49.90
56.70
58.97
47.63

Posttest
52.16
56.70
47.63
58.97
63.50
63.75
56.70

a) Carry out the analysis as a pair t-test, stating the hypothesis. Use = 0.05.
b) Carry out the analysis as a randomised complete block design, using subjects
as blocks Use = 0.05.
c) Using the results from parts (a) and (c), verify t2 = F.

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

54

PMNjuho

6. SPLIT-PLOT DESIGN
6.1 Introduction

A factor is a kind of treatment, and any factor can supply several treatments. For example,
if diet is a factor under consideration, then several diets can be used. If baking temperature
is a factor, then baking can be done at several temperatures. Such a factor provides oneway treatment structure. A researcher may be interested in determining the combined
effect of two or more factors. For instance, the interest may be in investigating the effect
of humidity on seed germination in the presence of temperature. Such joined effect is
referred to as interaction. The process of formulating all possible combinations of the
levels of these factors produces treatment combinations when are then randomly applied
to the experimental units. This process is called factorial arrangement.
6.2 The field layout

Consider a case of an agronomist who wishes to investigate the effect of spacing on maize
yield in the presence of nitrogen. Suppose 3 spacing (s1, s2, and s3), and 4 nitrogen levels
(n0, n1, n2, and n3) are considered. This is a two-way treatment structure with the two
factors being spacing (at 3 levels) and nitrogen (at 4 levels). We formulate all possible
combinations as
s1n0, s1n1, s1n2, s1n3, s2n0, s2n1, s2n2, s2n3, s3n0, s3n1, s3n2, s3n3
The 12 treatment combinations are randomly assigned to the experimental units according
to the experimental design used, say CRD or RCBD. These treatments should be
replicated in order to have an estimate of experimental error needed for drawing
inference. Sometimes, it is not practical to randomly assign these treatments completely
according to these designs. Suppose the study involves mechanisation (say m1, m2, m3,
etc) as one factor and variety (v1, v2, v3, etc) as another. Note that the mechanisation may
refer to method of land preparation. It is impractical to formulate these combinations and
then randomise them according to CRD or RCBD, especially when mechanisation
involves use of farm machinery. An alternative approach would be to randomise the
machination factor first and then the variety over each level of the first factor. We
illustrate this point using 3 levels of one factor and 4 levels of the other factor.
Block I

M1
V2

V1

V4

V3

V1

V3

V2

V4

V4

V2

V3

V4

M2
M3

The process is repeated for the other replications ensuring independent randomisation at
each stage. The process involves two stages of randomisation. In case for RCBD, we first
randomise the three levels of mechanisation in each block and then the levels of variety
over each level of mechanisation.
The design discussed above is called split-plot design. The word treatment and factor
Statistical Analysis in Research Module
E-mail: NjuhoP@ukzn.ac.za

55

PMNjuho

are used interchangeably in this case since they mean the same thing. The split-plot design
involves two- or higher-order treatment structure with an incomplete block design
structure and at least two different sizes of experimental units. The bigger size is
associated with whole-plot treatment and the smaller size to the sub-plot treatment. The
decision on which treatment to applied to a whole-plot or to a sub-plot is based on
practicability and precision required for each treatment. The treatment of much interest is
placed on the smaller experimental unit and that of less interest on the larger unit. The
interaction is also measured with a higher precision. Since in split-plot experiments
variation among sub-plots is expected to be less than among whole- plots, the factors
which require smaller amounts of experimental material, or which are of major
importance, or which are expected to exhibit smaller differences, or for which greater
precision is desired for any reason, are assigned to the sub-plots.
The selection of such a design depends on practicability of the treatments. Say applying
fertilizer to a whole plot and varieties to a sub plot, etc. The fact that there are two
experimental units imply that there are two experimental errors, hereby, referred to as
error (a) and error (b). The plot layout requires the whole -plot treatments to be randomly
applied the whole -plot and then the sub plot treatments are applied to each whole -plot
randomly. Each application demand for an independent randomisation.
Split-plot designs are frequently used for factorial experiments. Such designs may
incorporate one or more of the completely random, randomised complete block, or Latin
square designs.

6.3 The model

Suppose we wish to investigate on the joined effect of two factors namely A and B, on
yield of maize. Let r equal the number of blocks, a the number of levels of A or wholeplot per block, and b the number of levels of B or sub-plots per whole-plot. Thus, we
have ab treatment combinations replicated r times. We have abr total number of
experimental units.
Let yijk be an observation associate with ith block, jth factor A effect, and kth factor B
effect. The observation yijk is expressed in a mathematical form as
yijk = + i + j + ij + k + ()jk + ijk
i =1, 2, . . ., r; j =1, 2, . . ., a; k = 1, 2, . . ., b
Where = overall mean
i = ith block effect
j = jth factor A effect
ij = ijth random effect associated with whole-plot factor
k = kth factor B effect
()jk = jkth interaction effect
ijk = random effect associated with sub-plot factor

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

56

PMNjuho

The effects ij, and ijk are assumed to be normally and independently distributed about
zero means with 2 as the common variance of the s, the whole-plot random
components, and with 2 as the common variance of the s, the sub-plot random
components.

The form of the analysis of variance for a two-factor split-plot experiment for a
randomised complete block design is presented below.
Source of
Variation
Block
Factor A
Error (a)
Factor B
A*B
Error (b)
Total

Degrees of
Freedom
r-1
a-1
(r-1)(a-1)
b-1
(a-1)(b-1)
a(r-1)(b-1)
abr - 1

Sum of
Squares

Mean
Squares

F-Calculate

Error (a) is composed of the interaction between the whole-plot factor and the blocks. As
was mentioned earlier, factor A by block interaction is assumed to be no existence. Thus,
error (a) test the equality of level means of factor A (i.e. Error (a) = A*Block) Error (b) is
composed of factor A by block and factor A by factor B by block interactions (Error (b) =
B*Block +A*B*Block). The effects of factor B and those of the interaction between
factor A and B are tested using error (b).
6.4 The analysis

The analysis of variance is illustrated through an example as follows:


Consider 4 strains of perennial ryegrass were grown as swards at each of the two fertiliser
levels. The 4 strains were S23, New Zealand, Kent and X. The fertiliser levels were
denoted by H, heavy, and A, average. The experiment was laid out as four blocks of four
whole plots for the varieties each split in two for application of fertiliser. The midsummer
dry matter yields, in units of 10 lb/acre, were as follows:

Strains
S23
New Zealand
X
Kent

Manure
H
A
H
A
H
A
H
A

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

1
299
247
315
257
403
222
382
233

Block
2
318
202
247
175
439
170
353
216

57

3
284
171
289
188
355
192
383
200

4
279
183
307
174
324
176
310
143

PMNjuho

The whole-plot factor A is Strain, the Sub-plot factor B is Manure or fertiliser. With
respective to our example, r = 4, a = 4, and b = 2.
Computation of whole-plot analysis

This requires setting up of a two way table of blocks and factor A treatment totals. Thus
Blocks
1
2
3
4
546 520 455 462
572 422 477 481
625 609 547 500
615 569 583 453
2358 2120 2062 1896

Strains
S23
New Zealand
X
Kent
Block Totals

Strain Totals
1983
1952
2281
2220
8436

Correction factor (C.F.)


( y ijk ) 2

C.F.

i , j ,k

rab

(8436) 2
32

= 2223940.5

Sum of squares for the whole-plots

y
SS(Whole-plot) =

2
ij .

i, j

- C.F. =

1
(5462 + 5202 + . . . + 5832 + 4532) C.F.
2

1
(4510942) 2223940.5 = 31530.5
2

Sum of square due to blocks

y
SSBlk =

ab

2
i ..

- C.F. =

1
(23582 + 21202 + 20622 + 18962) C.F.
8

1
(17901224) 2223940.5 = 13712.5
8

Sum of square due to strains

y
SS(Strains) = SS(A) =

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

rb

2
. j.

- C.F.

58

PMNjuho

1
( 19832 + 19522 + 22812 + 22202) C.F.
8

1
(17873954) - 2223940.5 = 10303.7
8

Sum of squares for whole-plot error


SSE(a) = SS(Whole-plot) SSBlk - SS(A)
= 31530.5 - 13712.5 - 10303.7
= 7514.3

Computation of sub-plot analysis

This section requires a two way table of factor A and factor B totals.

Strains (Factor A)
S23
New Zealand
X
Kent
Manure Totals

Manure (Factor B)
H
A
1180
803
1158
794
760
1521
1428
792
5287
3149

Strain
Totals
1983
1952
2281
2220
8436

Sum of squares due to factor B

y
SS(B) =

2
..k

- C.F. =

ra

1
(52872 + 31492) C.F.
16

1
(3786857) - 2223940.5 = 142845.1
16

Sum of squares due to factor A and B interaction

y
SS(AB) =

j ,k

2
jk

- C.F. SS(A) SS(B)

1
(11802 + 8032 + . . . + 7922) - C.F. SS(A) SS(B)
4

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

59

PMNjuho

1
(9566098) - 2223940.5 - 10303.75 - 142845.13
4

= 14435.1
Sum of squares total
SSTotal =

2
ijk

- C.F. = 2992 + 3182 + . . . + 1432 - 2223940.5

i , j ,k

= 2420734.0 - 2223940.5 = 196793.5


Sum of squares for sub-plot error
SSE(b) = SSTotal - SS(Whole-plot) - SS(B) - SS(AB)
= 196793.5 - 31530.5 - 142845.1 - 14435.1
= 7982.8
The above calculations are summarised in ANOVA Table as follows:
The ANOVA Table
Source of variation D.F.
Block
3
Strains
3
Error (a)
9
Manure
1
Strain*Manure
3
Error (b)
12
Total
31

SS
13712.5
10303.7
7514.3
142845.1
14435.1
7982.8
196793.5

MS
4570.8
3434.6
834.9
142845.1
4811.7
665.2

F-Calculated
5.47
4.11
214.73
7.23

F-Table, 0.05
FT(3, 9) = 3.86
FT(1,12) = 4.75
FT(3,12) = 3.49

Critical region:
Testing the four strains:
Reject H0 : 1 = 2 = 3 = 4 = 0 if F-Calculated > FT(3, 9) = 3.86 and conclude that the
strains are significantly different at 5 % significance level.
Testing the effect of the two types of manure:
Reject H0 : k = k = 0 if F-Calculated > FT(1,12) = 4.75 and conclude that the two types of
manure are significantly different at 5 % significance level.

Testing for the strain by manure interaction:


Reject H0 : ()11 = ()12 = . . . = ()42 = 0 if F-Calculated > FT(3,12) = 3.49 and conclude
that the interaction between the strains and manure types are significantly different at 5 %
significance level.
Statistical Analysis in Research Module
E-mail: NjuhoP@ukzn.ac.za

60

PMNjuho

Conclusions:
We reject H0 : 1 = 2 = 3 = 4 = 0 since F-Calculated = 4.11 > FT(3, 9) = 3.86 and
conclude that the strains are significantly different at 5 % significance level. Strain X
shows a higher performance followed by Kent based on the means.

We reject H0 : k = k = 0 since F-Calculated = 214.73 > FT(1,12) = 4.75 and conclude that
the two types of manure are significantly different at 5 % significance level. Actually, H
type has a higher effect than A based on the means.
We reject H0 : ()11 = ()12 = . . . = ()42 = 0 since F-Calculated = 7.23 > FT(3,12) = 3.49
and conclude that the interaction between the strains and manure types are significantly
different at 5 % significance level. The following is the graphical presentation of the
interaction. Manure A consistently performed better than manure B. Manure B appears to
have a constant effect across the strains. It is hard to note the source of the interaction
from the graph.

Mean yield

Strain by manure interaction


400
350
300
250
200
150
100
50
0
S23

NZ

KENT

Strains
ManureA

ManureB

Exercises 6.1
6.1 A researcher is interested in the effects of moisture and nitrogen on the growth of
wheat plants. In the experiment, a particular variety of wheat is planted in 10 tube of
soil in the greenhouse. Each tub is divided into 3 parts, and different levels of nitrogen
(0, 10, 20) are applied randomly, one to each part. Five of the tubs are selected
randomly and given high moisture while the other 5 are given normal moisture.
a) Identify both the whole plot and subplot experimental units. Explain.
b) Make a sketch of the field layout and explain the randomisation process.
c) Give an outline of ANOVA Table (Source of variation and degrees of freedom
only).

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

61

PMNjuho

6.2 An experiment was conducted using a split-plot design. The experiment consisted of 3
pairs of identical steers each pair used as a block, 2 rations (A, B) as whole plot
treatments, and 2 cooking methods (1, 2) as sub-plot treatments. Within each pair of
steers, one is assigned at random to feed A and one to feed B. After slaughter, two
identical roasts are obtained and two roasts are randomly assigned to the two cooking
methods. Recorded data are weight losses due cooking. (Assume methods and rations
to be fixed effects).
Block

a)
b)
c)
d)
e)
f)

Method

Ration

1
2
1
2

A
A
B
B

Pair1
Pair3
11.0
2.5
5.0
3.5

Pair2
17.0
9.0
8.0
4.0

11.0
6.5
8.0
4.5

Write down a mathematical model stating the necessary assumptions.


State the null hypotheses for testing methods, rations and their interaction.
Analyse the data and present your results in an ANOVA Table.
State the critical regions for testing the hypotheses stated in part (b).
Present a two-way table of treatment means.
Compute the standard errors for testing the means differences for methods,
rations, and method by ration interactions.

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

62

PMNjuho

7. NESTED DESIGNS
7.1 Introduction

Consider an experiment involving two fertiliser levels and three varieties. Thus, we have
2 x 3 = 6 treatment combinations. We consider such as case, the factors are said to be
crossed. This means that every level of every factor could be used in combination with
every level of every other factor. The intersections of these factor levels are the subclasses
or cells of the situation, wherein data arise. Absence of data from a cell does not imply
non-existence of that cell, only that it has no data. The total number of cells in a crossed
classification is the product of the number of levels of the various factors, noting that not
all of them may have observations in them.
Nesting in design structure occurs when we have sub-units within larger experimental
units. Examples: pigs within pens; plants within pots; pies within an oven; farms within a
region; technicians within a method; sires within progeny; insecticides within source, etc.
In general, levels of B are nested within levels of A. Thus, we do not have A*B
interaction effect, but have A effect and B within A (denoted B(A)) effect. More often, in
the treatment structure, levels of A are crossed with levels of factor B.
The following example illustrates the concept of nested classification:
Example 7.1

Suppose that at a university a student survey is carried out to ascertain the reaction to
instructors usage of a new computing facility. Suppose that all first years have to take
English or Geology or Chemistry in their first semester. All three courses in the first
semester are large and are divided into sections, each section with a different instructor
and not all sections have the same number of students. Each student provided his or her
opinion measured on a scale of 1-10, of his instructors use of the computer. The
investigators interest is whether the instructors differ in their use of the computers. A
Schematic representation of this nested classification follows: The (nij) denotes the
number of students in section j of course i ( i=1, 2, 3; j = 1, , 4).

English
Sec.1 (28)
Sec.2 (27)
Sec.3 (30)

Course
Geology
Sec.1 (31)
Sec.2 (29)

Chemistry
Sec.1 (27)
Sec.2 (32)
Sec.3 (29)
Sec.4 (30)

The measure of effect due to section j, say for j =1, it would mean the effect of the
English course, of the Geology course and of the Chemistry course would be meaningless.
This is because the three sections, composed of different groups of students, have nothing
in common other than that they are all numbered 1 in respect of their respective courses.
The number is only for identification purpose. Section 1 of English is no way related to
section 1 of Geology. The only thing in common is the number 1, which is purely an
identifier. These are not like the variety by fertiliser treatment combination discussed
earlier. Fertiliser 1 on variety 1 was the same as fertiliser 1 on variety 2 and on variety 3.
The sections are not related in this way, and are identities within their own courses. They
Statistical Analysis in Research Module
E-mail: NjuhoP@ukzn.ac.za

63

PMNjuho

are considered as sections within courses. Thus, sections nested within course. Similarly
the students are nested within sections. An ANOVA outline would be:
Source of variation
Courses
Sections within Course
Students within Sections within Course
Total

Degrees of freedom
2
2+1+3 = 6
By subtraction (254)
262

The main use of the design would be mainly in assessing the degree of variation due to
each component. Is the variation most between plants within pots or pots within
treatments? Would be the interesting question.
Nested designs have a characteristic that interaction does not occur, but nesting does. For
instance, when we say A is nested in B, we cannot then say A interacts with B. Often
nesting is denoted by say, A(B), meaning A is nested in B or A:B or A/B and the degrees
of freedom are expressed as b(a-1), where a is levels of A and b is levels of B. We say
levels of one factor are nested within or are subsamples of, levels of another factor. Such
experiments are also sometimes called hierarchical experiments. For instance, in an onfarm experiment you may have farm types, farms nested within types and replications
nested within farms.

Farm Types:

Farms within types:

Replications within farms:

2 3

2 3 1 2 3 1 2 3

...

2 3

...

In general there is no limit to the degree of nesting that can be handled. The extent of its
use depends entirely on the data and the environment from which they came.
Example 7.2

Consider an experiment involving product of three manufacturing plants in each of two


areas, A and B, and of two plants in area C. The observations on the quality of a product
made in eight manufacturing plants in three areas is presented below.

Area
Statistical Analysis in Research Module
E-mail: NjuhoP@ukzn.ac.za

B
64

C_____
PMNjuho

Plants
Observations

I
6

II
6, 8

III
I
6,7,8 5, 7

II
6, 7

III
6

I
7

II
7, 9

Two way table of totals

Plants
I
II
III
Area Totals

Area
B
12
13
6
31

A
6
14
21
41

ijk

= 95

2
ijk

= 62 + 62 + . . . + 92 = 659

C
7
16
0
23

Plants
Totals
25
43
27
95

Total observations = 14

i , j ,k

Correction factor, C.F. =

(95) 2
= 644.64
14

Total sum of squares


SSTotal =

2
ijk

= 14.36

- C.F. = 659 - 644.64


with 13 degrees of freedom.

Area sum of squares

412
312
23 2
+
+
- C.F = 648.7 644.64
6
5
3
= 4.06 with (3 areas -1 = 2) degrees of freedom.

SSArea =

Plants sum of squares ignoring areas


SS plants = 62 +
= 5.86

(6 + 8) 2
(7 + 9) 2
+... +
- C. F.
2
2
with (8 plants 1 =7) degrees of freedom.

Plants within area sum of squares


SSPlants(Area) = SS plants (ignoring areas) SSArea
= 5.86 - 4.06
= 1.80
with (7 2 = 5) degrees of freedom.
Error sum of squares
SSE = SSTotal SS Plants(ignoring areas)
= 14.36 - 5.86 = 8.50
with 13 7 = 6 degrees of freedom.

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

65

PMNjuho

The ANOVA table


Source of

Degrees of

Sum of

Mean

F-cacl

F- table,

Freedom
2
5

Squares
4.06
1.80

Squares
2.03
0.36

5.639

5.79

6
13

8.50
14.36

1.42

0.05

Variation
Area
Plants within areas
Observation
within plants
Total

Often, nested designs are meant to provide information about variability, and therefore,
makes no sense to compute F value. Perhaps, the areas are fixed and hence can test the
equality of the means using F- test.
Estimation of an experimental error is only possible if the replications are independent. In
this case, plants within areas are independent but observations within plants are not.
Therefore, we estimate experimental error using plants within areas. The F-value for
testing the equality of the areas is obtained as
F=

MSArea
2.03
=
= 5.639
MSP ( A)
0.36

which is compared against F-T = 5.79 obtained using 2 df numerator and 5 df denominator
at 5 % significance level. We fail to reject H0 :1 = 2 = 3 since F-calc = 5.639 is not
greater than F-T = 5.79 at 5 % significance level.
Suppose we assume observations within plants to be randomly distributed normal with
zero mean a constant variance, 2, and also plants within areas to be normally distributed
with zero mean and 2p . Some techniques, which are beyond this manual, are available
for estimating these variance components. The following estimates of these variance
components are obtained through such techniques.
The observation within plants variance component is estimate as

2 = 1.42
The size of the estimate suggests that the total variance is purely due to observations
within plants.
Similarly, an estimate of plants within area variance components would be approximated
as

0.36 1.42
MSP( A) MSE
=
= -0.64
1.64
1.64
Since the variance will never be negative, we consider the estimate not to be significantly
different from zero. Thus 2p 0, indicating no variation between plants within areas.

2p =

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

66

PMNjuho

Exercise 7.1
7.1 An educator proposes a new teaching method and wishes to compare the achievement
of students using his method with that of students using a traditional method. Twenty
students are randomly placed into two groups with ten students per group. Tests are
given to all 20 students at the beginning of a semester, at the end of the semester, and
ten weeks after the end of the semester. The educator wishes to see whether there is a
difference in the average achievement between the two methods at each of the three
time periods.
a) Write a mathematical model for this situation.
b) Set up an ANOVA table and show the F tests that can be made.
7.2 In a study made of the characteristics associated with guidance competence versus
counselling competence, 144 students were divided into 9 groups of 16 each. These
nine groups represented all combinations of three levels of guidance ranking (high,
medium, low) and three levels of counselling ranking (high, medium, low). All
subjects were then given nine subtests. Assume the rankings as two fixed factors, the
subtests as fixed, and the subjects within the nine groups as random.
a) Present a schematic diagram for this information.
b) Give an outline of ANOVA table with source of variation and degrees of freedom
only.
7.3 Three days of sampling where each sample was subjected to two types of size graders
gave the following results, coded by subtracting 4 percent moisture and multiplying by
10.
Day
Grader
Sample 1
2
3
4
5
6
7
8
9
10
11

1
A
4
6
6
13
7
7
14
12
9
6
8

2
B
11
7
10
11
10
11
16
10
12
9
13

A
5
17
8
3
14
11
6
11
16
-1
3

3
B
11
13
15
14
20
19
11
17
4
9
14

A
0
-1
2
8
8
4
5
10
16
8
7

B
6
-2
5
6
10
10
18
13
17
15
11

Assume graders fixed, days random, and samples within days random.
a) State the necessary hypotheses.
b) Give an outline of the ANOVA table with source of variation
and degrees of freedom only.
c) Complete the ANOVA table by working out the calculations.

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

67

PMNjuho

8. NONPARAMETRIC STATISTICS
8.1 Introduction
Nonparametric methods are often applicable in situations where the parametric methods
are not. They require less restrictive assumptions concerning the data and the form of the
probability distributions generating the data. The scale of measurement for the data
somehow determines whether to use parametric or nonparametric methods. Most
parametric methods use interval or ratio-scaled data. Thus, means, medians, variances,
standard deviations interquartile ranges, etc., can be computed and interpreted. Parametric
methods cannot be applied on nominal or ordinal-scaled data. Nonparametric methods are
the only way nominal or ordinal-scaled data can be statistically analysed and sound
conclusions made.

The form or type of assumptions made to generate data also determines whether to use
parametric or nonparametric method. Many parametric methods require assumptions. For
instance, for a small sample case, normal distribution with a constant variance is required
in order to apply t-distribution. The nonparametric methods do not require assumptions
about the population probability distribution, and can be used when one is not prepared to
make distribution assumptions. This property has led to nonparametric methods to be
referred to as distribution-free methods. The sign test, the Wilcoxon signed-rank test, the
Mann-Whitney-Wilcoxon test, the Kruskal-Wallis test, and Spearman rank correlation are
the nonparametric methods discussed.
8.2 Sign test

This section is better introduced through an example. Consider a study of consumer


preference for two brands of orange juice, where 12 people were given unmarked samples
of the two brands. The brand each individual tasted was selected randomly. Each
individual stated a preference for one of the two brands. The question of interest is to
determine whether the preferences for the two products are equal.
Hypothesis

Ho : P=0.5
H1: P 0.5

<No difference in preference for one brand over the other exists>.
<A difference in preference for one brand over the other exists>

Where, P= Population proportion of consumers favouring one brand.


Suppose we denote, preference for brand A by + and that of brand B by -. The data is
recorded in form of + and - sign, hence, Sign-test.
Under Ho, the number of + are equal to - signs. If we consider + sign to denote
success, then with n = 12, and P = 0.5, we have a binomial probability distribution case.
We can compute probabilities for all the 12 people, giving a symmetric binomial
distribution. This sampling distribution is used to determine a rejection rule.

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

68

PMNjuho

Binomial Probabilities (P=0.5, n=12)


0.25

Probability

0.2
0.15
0.1
0.05

12

11

10

0
Number of + Signs

The rejection rule is established as follows. Suppose our = 0.05. For a two tailed test,
we have 0.025 on one tail and 0.025 on the other. Thus, starting at the lower end of the
distribution, 0.0002 + 0.0029 + 0.0161 = 0.0192 probability of obtaining 0, 1 or 2 + signs.
Adding the probability of 3 would give 0.0729, which exceeds the set probability, 0.025
for the lower tail. So we stop at 2 + sign. At the upper tail, we get 0.0192 probability
corresponding to 10, 11 or 12 + signs. The closest we get to 0.05 is 0.0192 + 0.0192 =
0.0384. Thus, the rejection rule is
Reject Ho if the number of + signs is less than 3 or greater than 9.
The binomial probability distribution can be used for n=20 (small sample case). Largesample normal approximation of binomial probabilities can be used for sample size n,
greater than 20 to determine the rejection rule for the sign test.
Normal approximation of the sampling distribution of the number of + signs when no
preference exists requires determination of
Mean:

= 0.5n

Standard deviation: =

Thus, Z=

0 .25n

Example 8.1

The following data show the preferences indicated by 10 individuals in taste tests
involving two brands of a product.
Individual

Brand A versus Brand B

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

69

PMNjuho

1
2
3
4
5
6
7
8
9
10

+
+
+
+
+
+
+

We test for a significant difference in the preferences for the two brands at 5%
significance level. A + indicates a preference for brand A over brand B.
Hypothesis

Ho : P = 0.5
H1 : P 0.5
Where P= Population proportion of consumers favouring one brand A.
The binomial probabilities for P = 0.5 and n = 10

Number of + Signs
0
1
2
3
4
5
6
7
8
9
10

Binomial Probability
0.0010
0.0098
0.0439
0.1172
0.2051
0.2461
0.2051
0.1172
0.0439
0.0098
0.0010

Starting at the lower end of the distribution: 0.0010 + 0.0098 = 0.0108 for 0, and 1. If we
include 2, we get 0.0547, which exceed 0,05. Thus, we stop at 1. Similarly, from the
upper end of the distribution we get 0.0010 + 0.0098 = 0.0108 for 9 and 10. Therefore, we
reject Ho if the number of + signs is less than 2 and greater than 8.
We fail to reject Ho in favour of H1 because we have 7 + signs. There is no evidence from
this data that individuals preference differ significantly for the two brands at 5 %
significance level.

8.3 Wilcoxon Signed-Rank Test

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

70

PMNjuho

The Wilcoxon Signed-Rank Test is the nonparametric alternative to the parametric paired
sample test. In the parametric case, the population of differences between pairs of
observations is assumed normally distributed. The nonparametric Wilcoxon Signed-Rank
Test can be used when the appropriateness of the assumption of normality is in question.
The procedure is illustrated by the following example.
Example 8.2

A manufacturing firm is attempting to determine whether a difference between taskcompletion times exists for two population methods. A sample of 11 workers was selected
and each worker completed a production task using both production methods. The
production method that each worker used first was selected randomly. A positive
difference in task-completion times indicates that method 1 required more time and a
negative difference indicates that method 2 required more time.
Production task-completion times (Minutes)

Worker
1
2
3
4
5
6
7
8
9
10
11

Method 1
10.2
9.6
9.2
10.6
9.9
10.2
10.6
10.0
11.2
10.7
10.6

Method 2
9.5
9.8
8.8
10.1
10.3
9.3
10.5
10.0
10.6
10.2
9.8

Difference
0.7
-0.2
0.4
0.5
-0.4
0.9
0.1
0.0
0.6
0.5
0.8

Absolute
Signed
Difference Rank Rank
0.7
8
+8
0.2
2
-2
0.4
3.5
+3.5
0.5
5.5
+5.5
0.4
3.5
-3.5
0.9
10
+10
0.1
1
+1
0.0
0.6
7
+7
0.5
5.5
+5.5
0.8
9
+9
Sum of signed ranks +44

Hypothesis

Ho: The populations are identical


H1: The populations are not identical
The first step is to rank the absolute differences between the two methods, from lowest to
the highest, where any differences of zeros are discarded. Tied differences are assigned
average rank values. The ranks are given the sign of the original difference in the data.
The sum of signed rank is finally obtained. For our example, we have +44.
If the populations representing task-completion times for each of the two methods are
identical, we would expect the positive ranks and the negative ranks to cancel out. Thus,
we wish to test if the sum of signed rank is significantly different from zero.
Let T denote the sum of the signed-rank values in a Wilcoxon signed-rank test. The
distribution of T is approximated when the number of pairs of data is 10 or more and the

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

71

PMNjuho

populations are identical. Thus, the sampling distribution of T for identical population is
mean T =0, and
n( n + 1 )( 2n + 1 )
.
6

standard deviation, T =

Referring to the above example, T =

Z=

T T

10( 11 )( 21 )
= 19.62.
6

44 0
19 .62
= 2.24

Conclusion: Reject Ho since Zcal = 2.24 is greater than Ztable= 1.96 at 5 % significance
level and conclude that the two populations are not identical in terms of task-completion
times. It is worth to note that the Wilcoxon signed-rank test does not enable us to
conclude in what ways the populations differ.

Exercise 8.1
8.1 A test was conducted of two overnight mail-delivery services. Two samples of
identical deliveries were set up such that both delivery services were notified of the
need for a delivery at the same time. The number of hours required to make the
delivery is showed below for each service time.

Delivery
1
2
3
4
5
6
7
8
9
10
11

Service
1
2
24.5
28.0
26.0
25.5
28.0
32.0
21.0
20.0
18.0
19.5
36.0
28.0
25.0
29.0
21.0
22.0
24.0
23.5
26.0
29.5
31.0
30.0

Test at 5% significance level is the data suggest a difference in the delivery times
for the two services.

8.4 Mann-Whitney-Wilcoxon Test

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

72

PMNjuho

This is a nonparametric test used to determine whether there is a difference between two
populations. Unlike Wilcoxon signed-rank test, it is not based on paired samples. It
concerns two independent random samples one from each population. In the case on
parametric test, normality distribution and equality of variances were assumed. The
Mann-Whitney-Wilcoxon (MWW) test does not require either of the assumptions.
However, it does require that the measurement scale for the data generated by the two
independent random samples be at least ordinal.
Small-Sample Case: Appropriate when sample sizes are less or equal to 10.
following steps are taken in carrying out the test.

The

Combine the data from both samples and then rank them from smallest value ranked 1 and
the largest value ranked the highest. Sum the ranks for each sample separately. The sum
of ranks denoted by T takes two values, either smallest or largest from the two samples.
Under Ho, the value of T is expected to be near the average of the sum of the smallest plus
the largest values of T. That is,
T =(TL+TU)/2.
The critical value of the MWW T statistic exists when both sample sizes are less than or
equal to 10. The n1 corresponds to the sample whose rank sum is being used in the test.
TU = n1(n1+n2+1) -TL
Reject Ho if T is strictly less than TL or strictly greater than TU.
Large-Sample Case: Appropriate when sample size is greater or equal to 10. In this case,
the MWW T statistic can be approximated normal with a sampling distribution that has

1
n ( n + n2 + 1 ) and
2 1 1
1
n n ( n + n2 + 1 )
Standard deviation T =
12 1 2 1

Mean T =

General steps for MWW T test.


1. Rank the combined sample observations from lowest to the largest, with tied values
being assigned the average of the tied rankings.
2. Compute the T, the sum of the ranks for the first sample.
When we reject the hypothesis that the populations are identical using MWW test, we
cannot state how they differ. The populations could have different means, different
variances, and/or different forms. The MWW test has the advantage that it does not
require any probability distribution assumptions and can be used on ordinal data.
Example 8.3

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

73

PMNjuho

Two fuel additives are being tested to determine their effect on fuel consumption. Seven
cars were tested using additive 1 and another independent sample of nine cars was tested
using additive 2. The data below show the kilometre per litre obtained using the additives.
Test using MWW test to see if there is a significant difference in fuel consumption at 5 %
significance level
Additive 1
17.3
18.4
19.1
16.7
18.2
18.6
17.5
Sum

Rank
2
6
10
1
5
7
3
34

Additive 2
18.7
17.8
21.3
21.0
22.1
18.7
19.8
20.7
20.2
Sum

Rank
8.5
4
15
14
16
8.5
11
13
12
102

The combined samples are ranked and the rank sum for each sample obtained. This is a
small sample test since, n1=7 and n2=9.
T=34. With = 0.05, n1=7 and n2=9, TL = 41 and TU = 7(7+9+1) -41 = 78

Conclusion: Since T=34 < 41, we reject Ho and conclude that there is a significant
difference in fuel consumption.

8.5 Kruskal-Wallis Test

Kruskal-Wallis test is an extension of Mann-Whitney-Wilcoxon test for three or more


populations. The hypothesis is stated as follows:
Ho : All k populations are identical
H1 : Not all populations are identical
Recall that the parametric test such as completely randomised design requires interval or
ratio data. The Kruskal-Wallis test, which does not require the assumptions of normality
and equal variance, is used with ordinal data as well as with interval or ratio data.
The Kruskal-Wallis test statistics, which is based on the sum of ranks for each of the
samples, can be computed as follows:
k
Ri2
12
] - 3(nT +1)
W=
[
nT ( nT + 1 ) i =1 ni

where
k = the number of populations
ni = the number of items in sample i
Statistical Analysis in Research Module
E-mail: NjuhoP@ukzn.ac.za

74

PMNjuho

nT = total number of items in all samples


Ri = sum of the ranks for sample i.
Under Ho, the populations are identical with the sampling distribution of W being
approximated by a 2 with k-1 degrees of freedom. The approximation works well if
each of the sample size is greater or equal to 5. See Table C.
The following example illustrates the computation procedure.

Example 8.4

Three products received the following performance ratings by a panel of 15 consumers.


We wish to use Kruskal-Wallis test to determine if there is a significant difference in the
performance ratings for the product, at 5% significance level.
A
Rank
50
4
62
8
75
10
48
3
65
9
Sum= 34

B Rank
80 11
95 14
98 15
87 12
90 13
Sum=65

C Rank
60
7
45
2
30
1
58
6
57
2
Sum=21

The first step is to rank all the 15 data values, with the lowest ranked 1 and the largest
ranked 15. The average rank is assigned to tied data.
RA =34, RB =65, RC =21
nA = 5, nB =5, nC =5

Sum of ranks:
Sample sizes:

nT =15

Total number of items in all samples;


k=3, thus degrees of freedom =2
2

W=

12 34
65 2 212
[
+
+
] -3(16) =10.22
15(16) 5
5
5

2 (2, 0.05) =5.99


Conclusion: Reject Ho and conclude the ratings for the products differ at 5% significance
level.
Note that the procedure would also have been applied directly to the original data if the
data had been the ordinal rankings of the 15 consumers. The step of constructing the rank
orderings from the performance evaluation ratings would have been omitted.

8.6 Spearman Rank Correlation


Statistical Analysis in Research Module
E-mail: NjuhoP@ukzn.ac.za

75

PMNjuho

Spearmans rank correlation is used to find a measure of association between two random
variables when only ordinal data are available. The Spearman rank-correlation coefficient
is computed using the following formula:

rs = 1 -

6 d i2
n( n 2 1 )

where
n = the number of items or individuals being ranked.
xi =the rank of item i with respect to one variable
yi = the rank of item i with respect to a second variable
di = xi - y i
6 is a constant.
While r is a measure of linear correlation between X and Y, rs is a measure of increasing
or decreasing relationship. The rs ranges from -1 to 1. Positive values near 1 indicate a
strong positive association between the rankings. That is, as one rank increases the other
rank also increase. Similarly, negative values near -1 indicate a strong negative
association in the ranks.
The sampling distribution of rs is
Mean: rs =0 and Standard deviation: rs =

Z=

rs rs

1
for n 10.
n1

has standard normal with mean zero and unit variance.

Consider the following example to illustrate the computation procedures.

Example 8.5

At a wine tasting function, two judges were asked to independently rank the 10 wines on
exhibit from most desirable (rank=1) to least desirable (rank=10). The preferences were as
follows:

JudgeA Rank
6
2
8
10

Judge B Rank
5
2
7
9

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

Difference
di
1
0
1
1
76

di 2
1
0
1
1
PMNjuho

6
3
1
8
10
4

7
3
1
4
9
5

2
i

1
0
0
16
1
1

1
0
0
-4
-1
1

= 22, n= 10

So,

6 d i2
rs = 1 n( n 2 1 )
6 ( 22 )
= 0.867
=110( 10 2 1 )
Conclusion: The high value of rs = 0.867 suggests the two judges preferences coincides
very closely.

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

77

PMNjuho

9. REGRESSION ANALYSIS
9.1 Introduction
Regression analysis is a statistical procedure used to develop a mathematical equation
showing how variables are related. The variable that is predicted using this mathematical
equation is called a dependent variable while the variable used to predict is called
independent variable. Regression analysis involving only one independent and one
dependent variable is called a simple linear regression. Multiple regression analysis incase
of two or more independent variable.

Consider the following examples of pairs of random variables where X is an independent


variable and Y a dependent variable.

X
Advertising
Training
Speed
Hours worked
Daily temperature
Hours studied
Product Xs price
Bond Interest rate
Cost of living

Y
Company turnover
Labour productivity
Fuel consumption
Machine output
Electricity demand
Statistics results
Product Xs Sales level
Number
of
bond
defaulters
Poverty

Several objectives exist for carrying out regression analysis, among them are to:

See if Xi affects Y. The objective would be to investigate whether there is a change in


Y when the level of X is changed. Thus establishing a functional relationship between
the two variables. In this case, X is assumed to be a continuous variable. A scatter plot
would show if a relationship exist between the two variables.

See how Xi affects Y. Would be interested in knowing by how much the value of Y
changes per unit change in X.

Predict Y given Xi The objective in this case is to provide a mathematical function


that would be used in predicting values of Y per given X.

Consider for example, an experiment to estimate the mean weight gain per month for
steers fed on a particular variety of feeds. The dependent variable, weight gain could be
affected by many factors such as initial weight of the steer, amount of feed offered per
day, protein content of the feed, water content of the feed, and so on.

9.2 Simple Linear Regression


Statistical Analysis in Research Module
E-mail: NjuhoP@ukzn.ac.za

78

PMNjuho

Involves an independent variable denoted by X and a dependent variable denoted by Y.


The Xs are selected levels of the treatment under investigation. The response
corresponding to the effect is measured. In simple linear regression we want to explain the
behaviour of dependent variable Y in terms of X. Simple linear regression is concerned
with establishment of a linear function of independent variable X. The procedure involves
fitting simple linear regression to the data where parameters are estimated. The suitability
of the model is then assessed. The first step should be to plot the raw data in order to have
an indication of the relation between Y and X. If such relationship is not noticeable, then
other reasons should be give for proceeding to fit the regression line.
The simplest type of model relating a response variable y to a single independent variable
x is given by the following equation of a straight line:
y = 0 + 1 x +

where,

0 is the intercept (value of y when x=0)


1

is the slope of the straight line (change in y for a unit change in x)


is a random variable.
Note that the random error term takes into account all unpredictable and unknown factors
that are not included in the model.
The interest is mainly in estimating the two unknown parameters 0 and 1 where their
estimates are denoted by a and b, respectively. The statistics a and b are computed from
the data using a technique called least squares estimation procedure. The least squares
method is a procedure used to find a straight line that provides the best approximation for
the relationship between the independent and dependent variables. This line is called
estimated regression line or the estimated regression equation.
The following equations have been shown using calculus, to provide the minimum sum of
squared deviations between the observed values of dependent variable yi and the
)
estimated values of the dependent variable yi :

b=

( x x )( y y )
( x x )
i

b=

S xy
S xx

x y ( x y ) / n
x ( x ) / n
i

2
i

< By definition>

<Used in computation>

a = y -b x

Example 9.1a

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

79

PMNjuho

A property analyst is examining the relationship between the City Councils valuation on
residential property and the market value (selling price) of the properties. A random
sample of eight recent property transactions was examined. The data are as follows:

City council
valuation (R1 000)
x
12
45
32
50
28
56
18
40
281

Market value
(R1 000)
Y
65
220
142
310
196
364
116
260
1673

x2
144
2025
1024
2500
784
3136
324
1600
11537

xy
780
9900
4544
15500
5488
20384
2088
10400
69084

y2
4225
48400
20164
96100
38416
132496
13456
67600
420857

The scatter diagram of the above data is presented below.


City council values against market values
400

Market values (R1 000)

350
300
y = 6.1912x - 8.3392

250
200
150
100
50
0
0

10

20

30

40

50

60

City council values (R1 000)

For the above example:

x =281, y
b=

=1 673,

xy =69 084, x 2 = 11 537, y 2 = 420 857.

x y ( x y ) / n
x ( x ) / n
i

2
i

69084 ( 281 )( 1673 ) / 8


= 6.1912.
11537 ( 281 )2 / 8
a = y -b x
= 209.125 - (6.1912)(35.125) = -8.3392.
=

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

80

PMNjuho

Thus, the estimated regression line is y$ = -8.3392+6.1912x


The estimates of intercept and slope, namely, a and b are unbiased estimators of
population parameters 0 and 1 , respectively.
Caution: Extrapolation outside the range of x may lead to meaningless results. For
instance, at x = 0, we get y = -8.3392. That is, at a zero city council valuation, we get R 8.3392 market value.

The above regression line is meaningful only when x values fall within 12 x 40
interval.
Note: A regression line obtained using the standardised values of X and Y passes through
the origin, thus with zero intercept. The correlation coefficient between standardised X
and Y, r equals the slope, b, obtained using the same standardised values.
Example 9.1b

A substance used in biological and medical research is shipped by airfreight to users in


cartons of 1,000 ampules. The data below, involving 10 shipments, were collected on the
number of times the carton was transferred from one aircraft to another over the shipment
route (X) and the number of ampules found to be broken upon arrival (Y). Assume a
linear regression model.
i:
X i:
Y i:

1
1
16

2
0
9

3
2
17

4
0
12

5
3
22

6
1
13

7
0
8

8
1
15

9
2
19

3.5

10
0
11

Scatter plot
25
20

15
10
5
0
0

0.5

1.5

2.5

XI

Yi Xi- X i Yi- Yi (Xi- X i )( Yi- Yi ) (Xi- X i )2

16

1.8

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

0
81

(Yi- Yi )2

3.24

Yi
14.2

2
ei=(Yi- Yi ) ei
1.8
3.24

PMNjuho

2
3
4
5
6
7
8
9
10

0
9
2 17
0 12
3 22
1 13
0
8
1 15
2 19
0 11
10 142

-1
1
-1
2
0
-1
0
1
-1

-5.2
2.8
-2.2
7.8
-1.2
-6.2
0.8
4.8
-3.2

5.2
2.8
2.2
15.6
0
6.2
0
4.8
3.2
40

1
1
1
4
0
1
0
1
1
10

27.04
7.84
4.84
60.84
1.44
38.44
0.64
23.04
10.24
177.6

10.2
18.2
10.2
22.2
14.2
10.2
14.2
18.2
10.2

-1.2
-1.2
1.8
-0.2
-1.2
-2.2
0.8
0.8
0.8
0

Information required for computation:


n =10,
SX =

= 10,

(X X )

=142, SXY = ( X X )(Y Y ) = 40,

= 10, SY = (Y Y ) 2 =177.6

Computation

The estimate of the slope,


S XY
40
=
=4
SX
10

b=

The estimate of the intercept,


a = Y b X = 14.2 4(1) =10.2
Estimate linear regression line is, Yi = a + bX = 10.2 + 4X

MSE =

17.6
SSE
=
= 2.2
n2
8

Regression analysis
Estimator
a
b

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

Coef
10.2
4

Std Error t -value


0.663
15.38
0.469
8.53

82

P-value
<0.000
<0.000

PMNjuho

1.44
1.44
3.24
0.04
1.44
4.84
0.64
0.64
0.64
17.6

Fitted regression line

25
20

15
10
5
0
0

0.5

1.5

2.5

3.5

9.3 Model and assumptions

It is important to distinguish between a deterministic model and a probabilistic model


when testing for significance in regression analysis. In a deterministic model, the
relationship between X and Y is such that if the value of the independent variable is
specified, the value of the dependent variable is determined exactly. A probabilistic
model if we are unable to guarantee a single value of Y for each value of X. Thus,
mathematically,
Deterministic model: y = 0 + 1 x < A model with no error>
Probabilistic model: y = 0 + 1 x + < A model that allows for uncontrollable
components to be denoted>
The difference between the two models is in , which measures how far the actual y
value is above or below the regression line.
The following are the assumptions about , the error term in the regression model.

The error term is a random variable with a mean zero.


The variance of , denoted 2 , is the same for all values of x.
The values of are independent.
The error term is a normally distributed random variable.

We would be more concerned with assessing how the fitted model explains more of the
real life situation. That is, how close are the fitted values to the observed value? would be
the question of interest. Thus, we would aim at minimising the error term.
The above stated model assumes a straight line situation which often is not the case. A
non-linear model may turn out to explain the data more clearly than the straight line case.
The reliability of the final model depends on the validity of the underlying assumptions
and the adequacy of the fitted model in explaining more of the variation in the data.

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

83

PMNjuho

The coefficient of determination, denoted by r2 which is expressed as a ratio of sum of


squares regression to sum of squares total is often used as a measure of the goodness of fit
of the estimated regression line. A higher r2 value is associated with a better fit, however,
it does not allow us to concluded whether a regression relationship is statistically
significant. The computation of r2 fails to take into consideration the sample size.

9.4 Partitioning the total sum of squares

The total sum of squares can be partitioned into regression sums of squares and residual
sums of squares. That is:
Sum of squares about the mean = Sum of squares due to regression + Sun of squares for
residual.

Sum of squares about the sample mean:

( y y )

Sum of squares due to regression (the portion of the overall distance that can be
attributed to the independent variable x): ( y$ y )2
Sum of squares due to residual (that portion of the distance between y and y that
cannot be accounted for by the independent variable x):

( y y$ )

In summary,

( y y )
<Total variability
in y-values >

= ( y$ y )2
+
( y y$ )2
<Variability
<Unexplained variability>
explained by model>

The following computations obtained using the information given in the above example
illustrate the point.

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

84

PMNjuho

y$

(y- y$ )

(y- y )

( y$ - y )

(y- y$ )2

( y$ - y )2

(y- y )2

12

65

65.96

-0.96

-144.12

143.16

0.912

20496.16

20770.57

45 220 270.26

-50.26

10.88

-61.14

2526.550

3738.687

118.3744

32 142 189.78

-47.78

-67.12

19.34

2282.852

374.0665

4505.094

50 310 301.22

8.78

100.88

-92.10

77.074

8482.557

10176.77

28 196 165.01

30.99

-13.12

44.11

960.107

1945.304

172.1344

56 364 338.37

25.63

154.88

-129.25

656.999

16705.05

23987.81

18 116 103.10

12.90

-93.12

106.02

166.348

11239.73

8671.334

40 260 239.31

20.69

50.88

-30.19

428.126
7098.970

911.3636
63892.92

2588.774
70990.88

Where,

( y y )
( y y$ )
( y$ y )

= 70990.88

= 7098.97

= 63892.92

The results agree, except for the rounding errors.

9.5 An estimate of the variance of residual term

The variance of , denoted by 2 is estimated using the sum of squares due to residual,
SSE.
SSE =

( y y$ )

= Syy - bSxy.

The degrees of freedom indicate how many independent pieces of information involving
the n independent values used to compute the sum of squares. SSE is associated with n-2
degrees of freedom because two parameters ( 0 and 1 ) have to be estimated. The mean
square (MSE) is a number computed by dividing a sum of squares by its degrees of
freedom. It has been shown that, MSE or s2 provides estimate of 2 . Thus,

MSE =

SSE
n2

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

85

PMNjuho

From the above example,

MSE =

SSE
n2

7098 .97
=1183.16
82

9.6 Inference about the 0 and 1 parameters

The main interest would be to test if the slope is significantly different from zero,
indicating change in y per unit change in x. An appropriate hypothesis is
H o: 1 = 0
Ha: 1 0
The above hypothesis can be tested using t- test or F- test or a confidence interval. We
need to obtain b, the estimate of 1 and the associated variance in order to conduct the
appropriate test. The sampling distribution of estimate b is normal with mean 1 and
variance b2 , where,

b2 =
Sxx =

2
i

2
S xx

( xi ) / n

Since b2 is hardly known, it is estimated by sb where s2 replaces 2 in the above


equation. Thus,
sb2 =

The test statistic is


tcalc =

s2
Sx

b 1
sb

which follows t distribution with n-2 degrees of freedom.


The decision rule is to reject Ho if the absolute tcalc denote by |tcalc| is greater than t / 2 .
For the above example, b= 6.1912, standard error of b denoted s.e.(b) =

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

86

sb2 = sb.

PMNjuho

s=

1183.16 = 34.4
Sx =

2
i

( xi ) / n

= 11537 - (281)2/8 = 1666.875

s.e.(b) =

Thus,

tcalc =

1183.16
1666 .875

= 0.8425

6 .1912 0
= 7.349
0 .8425

From the t table, the value of t corresponding to 6 degrees of freedom and = 0.05 for a
two tailed test is t0.025 = 2.447.
Conclusion: We reject Ho: 1 = 0 since |tcalc| is greater than t0.025 = 2.447 and conclude
that the slope is significantly different from zero at 5 % significance level.

An F- test exists for testing the above hypothesis. The t- test and F- test give the same
results for a regression model with only one independent variable. This is due to relation
between the two distribution for one independent variable (F=t2 relationship). The
following computations are necessary in order to test the above hypothesis concerning the
slope parameter.
Sum of squares due to regression, denoted by SSR =
degree of freedom (number of parameters - 1).

( y$ y )

associated with 1

Sum of square due to residual, denoted by SSE = ( y y$ )2 associated with n-p degrees
of freedom (n is the sample size and p is the number of regression parameters).
Sum of squares due to total, denoted by SST =
of freedom.

( y y )

associated with n-1 degrees

The following are the corresponding mean squares:

MSR =

SSR
SSE
and MSE =
1
n p

Under Ho: 1 = 0 both MSR and MSE are two independent estimates of 2 . The ratio
MSR to MSE is known to have a sampling distribution that is F with 1 and n-p degrees of
freedom. (In this case p=2).

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

87

PMNjuho

For the above example, we get


MSR = 63892.92 and MSE = 1183.16 which implies that

Fcalc =

MSR
MSE
63892.92
= 54.0
1183.16

Note: F = t2 (i.e. 54.0 = 7.3492)

From the F- distribution table we get F (1,6; 0.05) = 5.99. We reject Ho: 1 = 0 at 5 %
significance level since Fcalc > F (1,6; 0.05) = 5.99 and conclude that there is statistically
significant relationship between the x and y.
Caution: Rejection of the null hypothesis does not imply that the relationship between the
x and y is linear. A proper way to phrase the statement is that, a linear relationship
explains a significant amount of the variability in y over the range of x values observed in
the sample.

Confidence Interval for 1


Confidence interval provides an alternative to testing the hypothesis Ho: 1 = 0 against
Ha: 1 0. The following is a 95 % confidence interval for 1
b t0.025s.e.(b)
In reference to the above example, a 95 % confidence interval for 1 is
6.1912 2.447(0.8425)
Thus, (4.1296, 8.2528) is a 95% confidence interval for 1 . We reject Ho because the
interval does not contain zero.
Similarly, the variance of the intercept estimate is given by the following formula
sa2 =

MSE x 2
nS xx

MSE ( nx 2 )
S xx

9.7 Confidence interval estimate of the mean value of y


Statistical Analysis in Research Module
E-mail: NjuhoP@ukzn.ac.za

88

PMNjuho

There are two types of interval estimates, namely, confidence interval estimate and
prediction interval estimate. The former is an estimate of the mean value of y for a
particular value of x while the latter concerns the prediction of an individual value of y
corresponding to a given value of x. The computed values using the equation
y$ = a + bx are both the same. The difference is only in computation of the standard error.
Suppose we denote the estimate of the mean value by y$ m and individual value estimate
by y$ ind . The corresponding values and their associated variances are computed using the
following formula:

y$ m = a + bx m

Mean value:

Estimated variance of y$ m :

sm2 = s 2 [

( x m x )2
1
]
+
n ( x 2 ( x )2 / n )

Individual value: y$ ind = a + bxind


Estimated variance of y$ ind :

( xind x )2
1
+
]
n ( x 2 ( x )2 / n )

2
sind
= s2 [ 1 +

For our example, suppose we wish to estimate the mean value for a given value of xm=30.

y$ m = a + bx m
= -8.3392+6.1912(30)
= 177.3968
and
( x m x )2
1
]
s = s [ +
n ( x 2 ( x )2 / n )
2
m

= 1183.16 [

1 ( 30 35.125 )2
+
]
8
( 1666 .875 )

= 166.5385
Suppose we wish to estimate the individual value for a given value of xind=30.

y$ ind = a + bxind = -8.3392+6.1912(30)


= 177.3968
and

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

89

PMNjuho

2
ind

( xind x )2
1
= s [1+ +
]
n ( x 2 ( x )2 / n )
2

1 ( 30 35.125 )2
= 1183.16 [ 1 + +
]
8
( 1666 .875 )

= 1349.6980

Note: The variance associated to the individual value prediction is greater than that
associated to the mean value. Consequently, the confidence interval for the individual vale
is wider than that of mean value.

Exercise 9.1

9.1 A restaurant operating on a reservations only basis would like to use the number of
advance reservations x to predict the number of dinners y to be prepared. Data on
reservations and number of dinners served for one day chosen at random from each
week in a 100-week period gave the following results:
x = 150

( x x )

y = 120
2

= 90 000

( y y )

= 70 000

( x x )( y y ) = 60 000
a) Find the least squares estimates a and b for the linear regression line y$ = a + bx.
b) Predict the number of meals to be prepared if the number of reservations is 135.
c) Construct a 90 % confidence interval for the slope. Does information on x (number
of advance reservations) help in predicting y (number of dinners prepared)?
9.2 Interest rates charged for home mortgages have, in general, declined over recent
months. With the apparent favourable influence for new home building, the data
shown below are the prevailing mortgage interest rates and the number of housing
starts in a city over a period of 18 months.
Month
1
2
3
4
5
6

Interest rate
x
10.5
10.3
10.6
11.4
11.8
11.3

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

Number of housing starts


y
360
340
370
360
330
300

90

PMNjuho

7
8
9
10
11
12
13
14
15
16
17
18

11.0
10.5
10.2
10.0
9.8
9.8
9.9
10.0
10.0
9.9
9.8
9.7

290
340
360
370
380
390
375
350
345
360
380
395

a) Plot the data.


b) Use these data to obtain a linear regression equation.
c) Is the slope significantly different from zero?
d) Predict the number of housing starts for interest rates of 10.2% and 9.5%.
e) Do you predict that the prevailing interest rate will increase or decrease next month
(month 19)?

9.8 Testing model assumptions

A residual is the difference between the actual value of the dependent variable yi and the
value predicted by the regression equation y$ i . The analysis of residuals plays an
important role in validating the assumptions made in regression analysis. The hypothesis
test discussed above is valid only when assumptions made on regression equation are
satisfied.
Residual plots are graphical presentations of the residuals that help reveal patterns and
thus help determine whether the assumptions concerning the error component and the
form of regression model are satisfied. The following are the common residual plots

A plot of residuals against the independent variable x.


A plot of residuals against the predicted value of the dependent variable.
A standardised residual plot in which each residual is standardised by dividing the
residual by its standard deviation.

9.9 Diagnostic procedures


Residual plot against x

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

91

PMNjuho

A residual plot against the independent variable x is constructed by placing x on the


horizontal axis and the residuals on the vertical axis. The residual plot should give an
overall impression of a horizontal band of points if the assumptions are valid and a linear
relationship between x and y is appropriate.
Residual plot against x
40

Residual

20
0
0

10

20

30

40

50

60

-20
-40
-60
X

Using the Residual Plot


a) An overall impression of a horizontal band of points from a residual plot implies that
the model is valid and a linear relationship between x and the expected value of y
exist.
b) A cone shape pattern of the residual plot suggests that the variance is not constant. That
is to say, the variability of about the regression line is greater for larger values of x.
c) A quadratic pattern of the residual plots suggests that the linear model is not adequate
and quadratic model should be fitted.

Note that for simple linear regression, both the residual plot against x and the residual plot
against the predicted value y$ provide the same information. With multiple regression
models, the residual plot against y$ .
Standardised residual plots are provided by most computer software. A random variable is
standardised by subtracting its mean and dividing the result by its standard deviation.
The standard deviation of the ith residual is
sy - y$ = s 2 ( 1 hi )
where

hi =

( xi x )2
1
and
+
n ( xi x )2

s2 = MSE

If the normality assumption is satisfied, 95 % of the computed standardised residual


should lie between -2 and 2.
Outliers

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

92

PMNjuho

Outliers represent observations that are suspect and warrant careful examination.
Sometimes they may occur due to erroneous data recording. They may also indicate some
signs of violation of model assumptions or unusual values may occur due to change.

Example 9.2

Consider the following data set to illustrate effect of an outlier.


x
1
1
2
3
3
3
4
4
5
6

y
45
55
50
75
40
45
30
35
25
15

The effect of an outlier


80

60
40
20
0
0

A negative linear relationship exists between X and Y except for the value at x=3 and
y=75 which is out of the pattern. Most statistical software classify an observation with
standardised residual that is either less than -2 or more than 2 to be an outlier.
Influential observations

An influential observation which may be an outlier is a value that is far away from the
mean
Consider the following data to illustrate the aspect of influential observation.
x
10
10
15
20
Statistical Analysis in Research Module
E-mail: NjuhoP@ukzn.ac.za

93

y
125
130
120
115
PMNjuho

20
25
70

120
110
100

Example 9.3
A high leverage observation
130

120
110
100
90
10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85
X

The observation at x=70 and y=100 is an observation with an extreme value of x. Thus,
correspond to a high leverage. The leverage is computed using the following formula
hi =

( xi x )2
1
+
n ( xi x )2

An observation is declared to be influential if hi > 6/n.


The appropriate approach to handling data with influential observations if to run the
regression analysis with and without the observation. Although time consuming, the
approach will reveal the influence of the observation on the results.

Exercise 9.2

9.3 Consider the following data for two variables X and Y.


X
Y

135 110
145 100

130 145 175 160 120


120 120 130 130 110

a) Compute the standardised residuals for these data. Do there appear to be any
outliers in the data? Explain.
b) Plot the standardised residuals against y$ . Does this plot reveal any outliers?
c) Develop a scatter plot for these data. Does the scatter diagram indicate any outliers
in the data? In general, what implications does this have for the simple linear
regression?
Statistical Analysis in Research Module
E-mail: NjuhoP@ukzn.ac.za

94

PMNjuho

9.10 Polynomial models

The response in dependent variable y will not always be linear whenever the independent
variable x is of quantitative nature. Sometimes the response may either quadratic or cubic
or higher than 3rd degree. For instance, a linear equation may not adequately represent the
relationship between yield and the amount of fertiliser applied to the plot. The following
data on yield of tomatoes receiving plots receiving different amount of fertiliser.
Plot
1
2
3
4
5
6
7
8
9
10
11
12
13
14

Amount of fertiliser
x
12
5
15
17
20
14
6
23
11
13
8
18
22
25

Yield
y
24
18
31
33
26
30
20
25
25
27
21
29
29
26

Scatterplot of yield versus fertiliser


40
35

Yield Y

30
25
20
15
10
5
0

10

15

20

25

Amount of fertiliser X

A model describing the quadratic form showed in the above figure is


y = 0 + 1 x + 2 x 2 +
A general polynomial regression model relating a dependent variable y to a single
quantitative independent variable x is given by
y = 0 + 1 x + 2 x 2 + ...+ p x p +
Statistical Analysis in Research Module
E-mail: NjuhoP@ukzn.ac.za

95

PMNjuho

The choice of p and hence the choice of an appropriate regression model will depend on
the experimental situation.

9.11 Multiple regression

The probabilistic model for multiple regression analysis is a direct extension of the linear
regression analysis. For p independent variables, we have

y = 0 + 1 x1 + 2 x 2 + ...+ p x p +
The estimated regression equation is
y$ = b0 + b1 x1 + b2 x 2 + ...+b p x p

Referred to multiple regression model because it involves more than one independent
variable. For example, consider an experiment set to study the yield of tomato crop.
Several independent variables say amount of fertiliser (X1), amount of water (X2), and
hours of sunlight on clear days (X3) could all have an effect on the yield.
The multiple regression model that relates a dependent variable y to a set of quantitative
independent variables is a direct extension of a polynomial regression model in one
independent variable. Any independent variables may be powers of other independent
variables, example x2 might be x12 or x3 a cross-product term x1x2. A point to note is that
no x is a perfect linear function of other xs.
y = 0 + 1 x1 + 2 x 2 + ...+ p x p +

In general, j (j 0 ) represents the expected change in y for a unit increase in xj

while holding all other xs constant.


A simplest model that allow for interaction between x1 and x2 is
y = 0 + 1 x1 + 2 x 2 + 3 x1 x 2 +

Say for a give x=2, expected value of y, denoted E(y) is expressed as


E(y) = 0 + 1 x1 + 2 ( 2 ) + 3 x1 ( 2 )
= ( 0 + 22 ) + ( 1 + 2 3 )x1
Here the intercept and slope are ( 0 + 2 2 ) and ( 1 + 2 3 ) , respectively.

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

96

PMNjuho

10. INTRODUCTION TO MULTIVARIATE ANALYSIS


10.1 An overview

Multivariate data occur in all branches of science. Almost all data collected by todays
researchers can be classified as multivariate data. For example, a marketing researcher
might be interested in identifying characteristics of individuals that would be enable the
researcher to determine whether a certain individual is likely to purchase a specific
product. A wheat breeder might be interested in more than just the yields of some new
varieties of wheat. The wheat breeder may also be interested in these varieties resistance
to insect damage and drought. A social scientist might be interested in studying
relationships between teenage girls dating behaviours and their fathers attitudes.
The objectives of scientific investigations for which multivariate techniques most
naturally lend themselves, include the following:

Data reduction or structural simplification.


Sorting and grouping.
Investigation of the dependence among variables.
Prediction.
Hypothesis construction and testing.

Multivariate techniques are applicable when more than one variable is measured on an
experimental unit. Such variables could be correlated and univariate analysis would not be
helpful in extracting relevant information. Multivariate techniques are classified into two
categories, namely variable-directed and individual or experimental unit directed. Some of
these techniques are:
Variable directed

Principal component analysis (PCA)


Factor analysis (FA)
Canonical correlation analysis (CCA)
Multiple regression analysis (MRA)
Individual directed

Discriminant analysis (DA)


Cluster analysis (CA)
Multivariate analysis of variance (MANOVA)
The above techniques will be discussed with examples later in Section 9.3.

10.2 Possible areas of applications


Statistical Analysis in Research Module
E-mail: NjuhoP@ukzn.ac.za

97

PMNjuho

Medicine and health


Example 10.1a

A study conducted to investigate the reactions of cancer patients to radiotherapy.


Measurements were made on 6 reaction variables for 98 patients. Interest data
reduction.
Example 10.1b

Research on the genetic basis for alcoholism. One group has found that the activity of the
two enzymes (monoamine oxidase and adenylate cyclase) produced by platelets was
significantly reduced in alcoholics. The results of this study hold promise for the
development of a simple screening test for the early detection of alcoholism. Interest to
identify and measure physiological variables that could be used effectively to discriminate
alcoholics from nonalcoholics.
Sociology
Example 10.2a

Competing current theories suggest that one strong socioeconomic dimension and a few
minor unexplored dimensions determine the structure of American occupations.
Measurements on 25 variables for 583 occupations were analysed using multivariate
methods in order to provide support for one or two of the positions. Interest hypothesis
verification.
Example 10.2b

In a study of mobility, counts of the number of foreign-born and second-generation U S


residents in 1970 were tabulated by country of origin and state of residence. Interest to
find natural homogeneous groupings.
Business and economics
Example 10.3a

Measurements of 6 accounting and financial variables were used in developing a


multivariate model to help insurance regulators identify potentially insolvent propertyliability insurers. Using the model, an insurance company could be classified as solvent or
distressed and remedial steps could then be taken to prevent bankruptcy of the distressed
firm. Interest to obtain a classification rule for distinguishing solvent firms from
distressed firms.
Example 10.3b

Knowledge of the relationships among policy instruments and goals for underdeveloped
countries can aid the process of national development and modernisation. Data from 74
non-communist underdeveloped countries allowed an investigator to find the subsets of
Statistical Analysis in Research Module
E-mail: NjuhoP@ukzn.ac.za

98

PMNjuho

goals and instruments most closely associated with each other and to estimate the nature
of the simultaneous relationships between the two subsets. Interest to determine the
dependence between two sets of variables corresponding to goals and instruments.

Education
Example 10.4

Scholastic Aptitude Test (SAT) scores and high school academic performance are often
used as indicators of academic success in college. Measurements on 5 precollege predictor
variables and 4 college performance criterion variables were used to determine the
association between the predictor and criterion scores. The study was concerned with
substituting the usefulness of test scores and high school achievement as predictors of
college performance. Interest prediction of college performance variables based on the
set of predictor variables.

Biology
Example 10.5a

Two species of chickweed have proved difficult to identify. Measurements on 4 variables


for chickweed plants, known to belong to the two species, were used to construct a
function whose values allowed one to separate the two groups. Consequently, the function
could be used to classify a new candidate plant as belonging to one species or the other.
Interest sorting or classification.
Example 10.5b

In plant breeding it is necessary, after the end of one generation, to select those plants that
will be the parents of the next generation. The selection is to be done in such a way that
the succeeding generation will be improved in a number of characteristics over that of the
previous generation. Many characteristics are often measured and evaluated. The plant
breeders goal is to maximise the genetic gain in the minimum amount of time.
Multivariate techniques were used in a bean-breeding programme to convert
measurements on several variables relating to yield and protein content into a selection
index. Scores on this index were then used to determine parents of the subsequent
families of beans. Interest construction of an index to replace measurements on many
variables and the development of a sorting rule.
Environmental studies
Example 10.6

The atmospheric concentrations of air pollutants in the Los Angeles area have been
extensively studied. In one of study, daily measurements on seven pollution related
variables were recorded over an extended period of time. Of the immediate interest was
whether the levels of air pollutants were roughly constant throughout the week or whether
there was a noticeable difference between weekdays and weekends. Interest hypothesis
testing and data reduction.
Statistical Analysis in Research Module
E-mail: NjuhoP@ukzn.ac.za

99

PMNjuho

Other areas where multivariate techniques apply are in meteorology, geology, psychology
and sports.

10.2 Principal component analysis

Principal component analysis approach is useful in discovering dimensionality of the data,


data screening, checking clusters and finding abnormalities. It applies technique of
grouping variables that are highly correlated together. The variables within a group are
highly correlated and between groups are uncorrelated. New variables are expressed as
linear combination of the p original variables.
Principal component scores are used as inputs in other analysis. Multiple regression
analysis is characterised by multicollinearity problem, which come about as a result of
predictor variables being correlated. In such a situation, the selected PC scores are used as
regressors.
Plots of the first PC scores helps to identify outliers and clusters that may be associated
with the data.
10.3 Factor analysis

Factor analysis follows the same principal of PCA. The main difference being that the
former has distributional properties whereas the later does not. A few factors do explain
the original variables without loss of information. When the new factors cannot be
explained, rotation techniques, some which are orthogonal, are applied. The PCs selected
using PCA can be used as the new factors.
10.4 Discriminant analysis

Dicriminant analysis is a multivariate procedure used to develop a rule that separate two
or more groups of individuals, given measurements for these individuals on several
variables. Discriminant analysis is similar to regression analysis except that the dependent
variable is categorical rather than continuous. In regression analysis the interest is in
predicting the value of a variable based on a set of predictor variables. In discriminant
analysis, the interest is in predicting class membership of an individual observation based
on a set of predictor variables. Several rules exist. A likelihhod rule; the linear
discriminant function rule; a mahalanobis distance rule; a posterior probability rule, etc.
The groups are known before hand.

10.5 Cluster Analysis

Suppose a study on farming system in a given area has been conducted. Variables
measured on each farm in the data set might include period farm had been farmed, number
of animals, fertiliser used, type of trees, average income, soil types, crops grown, size of
Statistical Analysis in Research Module
E-mail: NjuhoP@ukzn.ac.za

1 00

PMNjuho

the family, labour, etc. The researcher want to use this information to partition farmers
into subgroups, so that farmers that fall into distinct subgroups have similar characteristics
with respect to the measured variables. The partition would allow for efficient use of the
resources by the farmers.
In more general terms, suppose a researcher has data collected on a large number of
experimental units. Basic questions posed for cluster analysis would be whether it is
possible to devise a classification or grouping scheme, that would allow for partitioning of
the experimental units into classes or groups, called clusters, so that the units within a
class or group are similar to one another while those in distinct classes or groups are not
similar to those in the other groups.
Cluster analysis involves techniques that produce classifications from data that are
initially unclassified, and must not be confused with discriminant analysis where one
initially knows how many distinct groups exist and where one has data that is known to
come from each of these distinct groups.

11. CATEGORICAL DATA ANALYTIC METHODS


11.1Introduction

In many studies measurements are made on binary rather than numerical scales. For
example, studies of altitudes or opinions with the two categories for the response variable
being agree or disagree. Others form of responses being exposed or not exposed, yes or
no, present or absent, improved or unimproved. The type of data collected relates to
responses to question like, how many have the attribute? How many said yes? etc. We end
up with frequency counts. Analysis of such data uses a chi-square distribution denoted
2 . This distribution is defined as the sum of squares of independent, normally
distributed variables with zero mean and unit variance. The table values are the
intersection of the - value and the respective degrees of freedom. For example, = 0.05
and 6 degrees of freedom, the table value from Table cc is 12.6.
There are three areas of inferential statistics in which the chi-square test for significance is
commonly applied. They are

Tests for independence of associations;


Tests for equality of proportions in more than two populations; and
Test for goodness of fit tests.

The chi-square statistic tests the null hypothesis by comparing a set of observed
frequencies, which are, based on sample findings, to a set of expected frequencies, which
describe the null hypothesis. It measures the extent to which the observed and expected
frequencies differ. Large differences will result in the null hypothesis being rejected.
The chi-square statistic is computes

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

1 01

PMNjuho

2 =

(O

Ei ) 2

Ei

, i =1, 2, . . ., k

where, Oi is the ith observed count.


Ei is the ith expected count.
k is the number of categories.
The calculated 2 is compared against a table value obtained using k-1 degrees of
freedom and a specified - level. In case of a contingency table the total number of cells
constitute the number of categories, k.

11.2 Test for independence of association.

This test is applied when an investigator wishes to determine the independence of two
random variables. Independence implies that outcomes of one random variable in no way
influence the outcomes of a second random variable. The null hypothesis and alternative
hypothesis are stated as follows:
H0: The two categories are independent.
Against
Ha: The two categories are dependent.
The procedure is illustrated through the following example.
Example 11.1

A certain brewery company manufactures and distributes three types of beers which are
categorised as 1) a low-calorie light beer, 2) a regular beer and 3) a dark beer. In analysis
of the market segments for the three beers, the firms market research group has raised the
question of whether preferences for the three beers differ between male and female beer
drinkers. If beer preference is independent of the sex of the beer drinker, one advertising
campaign will be initiated for all their beers. However, if beer preference depends on the
sex of the beer drinker, the company will tailor its promotions towards different target
markets.
The hypotheses of this test is stated as
H0: Beer preference is independent of the sex of the beer drinker.
Against
Ha: Beer preference is not independent of the sex of the beer drinker (i.e., males
and females differ in their preference).
A sample is selected and each individual is asked to state his or her preference for the
three companys beers. Every individual in the sample will be placed in one of the six
cells (3 x 2 = 6). The table generate by the 3 x 2 cells is called a contingency table. The
test of independence makes use of the contingency table format and for this reason is
sometimes referred to as a contingency table test.

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

1 02

PMNjuho

Suppose that a simple random sample of 150 beer drinkers has been selected. After tastetesting the three beers, the individuals in the sample are asked to state their preference, or
first choice. The responses are presented in a contingency table below:
Observed frequencies (Oijs)
Beer Preference
Light Regular Dark
20
40
20
30
30
10
50
70
30

Sex
Male
Female
Totals

Totals
80
70
150

Expected frequencies for the cells of the contingency table are based on the following
rationale.
50 1
= of the beer drinkers
Assume the null hypothesis. Under this assumption we have
150 3
70
7
30
1
=
prefer regular beer, and
= prefer dark beer. If the
prefer light beer,
150 15
150 5
independence assumption is valid, these same fractions must be applicable to both male
and female beer drinkers. Thus, under the assumption of independence, we would expect
1
7
the 80 male drinkers to show that (80) = 26.67 prefer light beer,
(80) = 37.33 prefer
3
15
1
regular beer, and (80) = 16 prefer dark beer. Similar argument follows for female beer
5
drinkers.
Expected frequencies if beer preference is independent of the sex of the beer drinker (Eij)

Beer Preference
Light Regular Dark
26.67
37.33
16.00
23.33
32.67
14.00
50
70
30

Sex
Male
Female
Totals

Totals
80
70
150

The general formula for computing expected frequencies for a contingency table in the
test for independence is

Eij = (Row i Total)(Column j Total)/ sample size


In general, the contingency table test statistic is computed as

(O

ij

2 =

Eij ) 2

i, j

Eij

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

1 03

PMNjuho

where,
Oij is the observed frequency for contingency table category in row i and column j.
Eij is the expected frequency for contingency table in row i and column j based
on the assumption of independence.
With r rows and c columns in the contingency table, the test statistic has a chi-square
distribution with (r-1)(c-1) degrees of freedom provided the expected frequencies are 5 or
more for all categories.
Referring back to our example, we note that all expected frequencies are at least 5. Thus,
the sample size is adequate and can proceed to calculate chi-square statistic.

(O

ij

2 =

Eij ) 2

i, j

Eij

(20 26.67) 2
(40 37.33) 2
(10 14.00) 2
+
+...+
26.67
37.33
14.00

= 1.67 + 0.19 + . . . + 1.14 = 6.13


Degrees of freedom = (r-1)(c-1) = (2-1)(3-1) = 2.
Using = 0.05, 2 -Table = 5.99
We reject H0 since 2 -calc = 6.13 is greater than 2 -Table = 5.99 and conclude that the
preference for the beers is not independent of the sex of the beer drinkers.

Exercise 11.1

11.1 The following data is on the distribution of employment status in five areas denoted
by polygon codes, from KZN. The study involved a random sample of 2942 persons.

Employment Status
EMPLOYED
UNEMPLOYED
NOT_WORKIN
UNSPECIFIE
TOTAL

Polygons
5010012 5010013 5010014 5010015 5010016
344
435
291
30
13
25
14
1
211
276
189
72
178
218
125
14
746
954
619
117

257
15
130
104
506

Using = 0.05, test whether there is an association between employment status and
five areas.
11.2 The Abacus Media Company publishes 4 magazines for the teenager (between 13
and 17 years of age) market. The executive editor of Abacus would like to know
whether a readership preference for the four magazines is independent of gender. A
survey of 200 teenagers in stationery stores was carried out. Randomly selected
teenagers who bought at least one of the four magazines were asked to indicate which
of the four magazines they preferred. Their responses are presented below.
Statistical Analysis in Research Module
E-mail: NjuhoP@ukzn.ac.za

1 04

PMNjuho

Gender
Girls
Boys

Beat
18
38

Magazine Preference
Youth Grow
Live
12
20
28
26
34
24

Using = 0.05, test whether there is an association between gender and


magazine preference.
11.3 A motor vehicle distributor wishes to find out if the size of car bought is in any way
related to the age of a buyer. From sales invoices over the past two years, a
sample of 300 buyers were classified by size of the car bought and buyers age.
The following contingency table was constructed.
Car size bought
Small Medium Large
10
22
34
24
42
48
52
32
36

Buyers Age
Under 30
30 45
Over 45

Using = 0.05, test whether car size bought and buyers age are independent.
Interpret your results.
11.4 A sample of parts provided the following contingency table data concerning part
quality and production shift.

Shift
First
Second
Third

Good
368
285
176

Number
Defective
32
15
24

Use = 0.05 and test whether part quality is independent of the production shit.
What is your conclusion.

11.3 Tests for equality of proportions in more than two populations

Earlier sections discussed the case of comparing two population proportions using either
normal or t-distributions. The situation is different when more than two population
proportions are to be compared. The Chi-square distribution is used in such a situation.
The test for equality of proportions in more than two populations is equivalent to the test
for independence of association. The null hypothesis is stated as no differences exist
between the proportions of a given category of one random variable examined across all
categories of a second random variable.
The following example illustrates the procedures used to test for the equality of
proportions in more than two populations.

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

1 05

PMNjuho

Example 11.2

A local air carrier would like to know if there is any difference between the proportion of
travellers classified as business or non-business making reservations for each of their four
classes. A survey of 300 reservations over the past week shows the following use of each
class of travel by passengers.
The observed frequencies

Type of traveller
Business
Non-Business
Column Totals
Let

Class of Travel
Emerald Amethyst Diamond
32
22
42
48
26
68
80
48
110

Ruby
32
30
62

Row
Totals
128
172
300

P1 = Proportion of Emerald class business traveller.


P2 = Proportion of Amethyst class business traveller.
P3 = Proportion of Diamond class business traveller.
P4 = Proportion of Ruby class business traveller.

Hypothesis
H0 : P1 = P 2 = P 3 = P 4
H1 : At least one population proportion is different.
Note the null hypothesis could also be stated that type of traveller is independent of the
class of travel used.
The expected frequencies
Type of traveller
Business
Non-Business
Column Totals

Class of Travel
Emerald Amethyst Diamond
34.1
20.5
46.9
45.9
27.5
63.1
80
48
110

Ruby
26.5
35.5
62

Row
Totals
128
172
300

Test statistics

(O

ij

2 =
=

Eij ) 2

i, j

Eij

(32 34.1) 2
(22 20.5) 2
(30 35.5) 2
+
...+
34.1
20.5
35.5

= 0.1293 + 0.1096 + . . . + 0.8521


= 3.3028

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

1 06

PMNjuho

Using = 0.05, and (2-1)(4-1) = 3 degrees of freedom, 2 -Table = 7.815, we fail to reject
H0 since calculate 2 = 3.328 is not greater than table value 2 -Table = 7.815. Conclude
that the proportion of business people using each class of travel is the same. This finding
is equivalent to concluding that type of traveller and class of travel, are independent in an
independence of association hypothesis test.
Exercise 11.2

11.5 An insurance organisation sampled its field sales force in the four provinces
concerning their attitudes towards compensation. Respondents were given the choice
between the present method (fixed salary plus year-end bonus) and a proposed new
method (straight commission).
Response preference
Present method
New Method

Province
Cape
Transvaal
68
135
32
50

OFS
47
23

Natal
79
31

a) Test, at the 5 % level of significance, whether there is any difference in the proportion
of sales staff between the four provinces who prefer the present method?
b) Interpret your findings.

11.4 Test for goodness of fit tests

The following are the general steps used to conduct a goodness of fit test for any
hypothesised probability distribution:

Formulate a null hypothesis indicating a hypothesised distribution for k classes or


categories of a population.
Select a simple random sample of size n items, and record the observed frequencies
for each of the k classes or categories.
Based on the assumption that the null hypothesis is true, determine the expected
frequencies for each category.
Use the observed and expected frequencies to compute a value of 2 for the test.
Reject H0 if the calculated 2 value is greater than table 2 value obtained with k-1
degrees of freedom at level of significance.

We illustrate the computation through the following example.


Example 11.3
Patients that arrive for treatment at the emergency room of a large metropolitan hospital
are assigned to one of the following three categories based on the seriousness of their
condition.
Category 1: Patient condition is stable; immediate treatment by a physician is not
required.
Statistical Analysis in Research Module
E-mail: NjuhoP@ukzn.ac.za

1 07

PMNjuho

Category 2: Patient condition is serious; immediate treatment is not required, but patient
should be monitored for vital signs until a physician is available.
Category 3: Patient condition is critical; the patients life will be endangered without
immediate treatment.
The population of interest is a multinomial population since the condition of each patient
is classified into one and only one of the three categories stable, serious, and critical. The
available information over the last year indicate that 50 % of the patients who arrived for
treatment were classified as stable, 30 % were classified as serious, and 20 % were
classified as critical.
There has been an increased volume for the emergency room due to recent improvement.
The director of the hospital is concerned that the percentage of patients classified as
having stable, serious, or critical conditions may have also charged. Validation of this
claim is required.
Let P1 = fraction of patients classified as stable.
P2 = fraction of patients classified as serious.
P3 = fraction of patients classified as critical.
Hypothesis
H0 : P1 = 0.5, P2 = 0.30, P3 = 0.20
H1 : The population proportions are not P1 = 0.5, P2 = 0.30, P3 = 0.20
Suppose the hospital selected a sample of 200 patients who have been tested since the
volume increased in the emergency room. The following are observed frequencies.
Stable
98

Serious
48

Critical
54

The expected frequencies for each category under H0 are


Stable
200(0.50) = 100

Serious
200(0.30) = 60

Critical
200(0.20) = 40

The goodness of fit test focuses on the differences between the observed frequencies and
the expected frequencies. With the expected frequencies greater than 5 for all three
categories, the sample size requirement is satisfied and we proceed to compute the test
statistic.

Test statistic

2 =

(O

Ei ) 2

Ei

(98 100) 2 (48 60) 2


(54 40) 2
=
+
+
100
60
40

= 0.04 + 2.40 + 4.90 = 7.34


Statistical Analysis in Research Module
E-mail: NjuhoP@ukzn.ac.za

1 08

PMNjuho

Using = 0.05, and k = 3 -1 = 2 degrees of freedom, 2 -Table = 5.99.


We reject H0 since 2 = 7.34 is larger than the critical value 5.99. In rejecting H0 we
conclude that the increase in volume for the emergency room has altered the percentages
of patients whose conditions are stable, serious, or critical.
The goodness of fit test uses the chi-square distribution to determine whether a
hypothesised probability distribution for a population provides a good fit. Acceptance or
rejection of the hypothesised probability distribution depends on the differences between
the observed frequencies in a sample and the expected frequencies based on the assumed
probability distribution.
Exercise 11.3

11.6 Conduct a test of the following hypothesis using the chi-square goodness of fit test.
H0 : PA = 0.4, PB = 0.40, PC = 0.20
H1 : The population proportions are not PA = 0.4, PB = 0.40, PC = 0.20
11.7 A sample of size 200 yielded 60 in category A, 120 in category B, and 20 in category
C. Using = 0.01, test to see if the proportions are as stated in H0.
11.8 A manufacturer has adopted a new container design. Colour preferences indicated
in a sample of 150 individuals are as follows.
Red
40

Blue
64

Green
46

Test using =0.1 to see if the colour preferences are different. (Hint: Formulate
the null hypothesis as H0 : P1 = P2 = P3 = P4 = 1/3 )
11.9 Grade distribution guidelines for a statistics course at a major university are as
follows:
10% A, 30 % B, 40 % C, 15 % D, and 5 % F.
A sample of 120 statistics grades at the end of a semester showed 18 As, 30 Bs,
40 Cs, 22 Ds, and 10 Fs.
Test using =0.05 to see if the actual grades deviate significantly from the grade
distribution guidelines.
11.10 An accounted for a department store knows from past experience that 23 % of the
stores customers pay cash for their purchases, 35 % write cheques, and the
remaining 42 % use credit cards. The accountant examines a random sample of
200 sales receipts for the week before Christmas and makes the following sales
summary.

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

1 09

PMNjuho

Number of Customers

Cash Cheque Credit cards


37
47
116

Use the chi-square goodness of fit test to see if the preceding percentages fit these
observations. Use = 0.05.
11.11 Consider the following data on age distribution in the two polygons.
Age Group
0 -10
11_20
21_30
31_40
41_50
51_60
61_70
71_80
81_90
91_100
Over 101
UN
TOTAL

5010001
Frequency
177
231
240
141
169
124
38
8
1
0
0
2
1131

5090061
Frequency
75
54
34
14
18
10
9
1
0
0
0
0
215

Use the chi-square goodness of fit test to see if the age group distribution for
polygon 5090061 follows Poisson distribution. Use = 0.05.

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

1 10

PMNjuho

APPENDIXES
TABLE A
The Normal Distribution
z

Pr(Z z) =

1
2

z2
2 2

(-z) = 1 - (z)
Z
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
1.05

(z)
0.500
0.520
0.540
0.560
0.579
0.599
0.618
0.637
0.655
0.674
0.691
0.709
0.726
0.742
0.758
0.773
0.788
0.802
0.816
0.829
0.841
0.853

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

(z)
0.864
0.875
0.885
0.894
0.900
0.903
0.911
0.919
0.926
0.933
0.939
0.945
0.950
0.951
0.955
0.960
0.964
0.968
0.971
0.974
0.975
0.977

z
1.10
1.15
1.20
1.25
1.282
1.30
1.35
1.40
1.45
1.50
1.55
1.60
1.645
1.65
1.70
1.75
1.80
1.85
1.90
1.95
1.960
2.00

1 11

z
2.05
2.10
2.15
2.20
2.25
2.30
2.326
2.35
2.40
2.45
2.50
2.55
2.576
2.60
2.65
2.70
2.75
2.80
2.85
2.90
2.95
2.00

(z)
0.980
0.982
0.984
0.986
0.988
0.989
0.990
0.991
0.992
0.993
0.994
0.995
0.995
0.995
0.996
0.997
0.997
0.997
0.998
0.998
0.998
0.999

PMNjuho

TABLE B
The t-Distribution

Pr(T t ) =

[(r + 1) / 2]
dw
r (r / 2)(1 + w2 / r )( r +1) / 2

[Pr(T t ) = 1 Pr(T t )]

Pr(T t )

r
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

0.90
3.078
1.886
1.638
1.533
1.476
1.440
1.415
1.397
1.383
1.372
1.363
1.356
1.350
1.345
1.341
1.337
1.333
1.330
1.328
1.325
1.323
1.321
1.319
1.318
1.316
1.315
1.314
1.313
1.311
1.310

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

0.95
6.314
2.920
2.353
2.132
2.015
1.943
1.895
1.860
1.833
1.812
1.796
1.782
1.771
1.761
1.753
1.746
1.740
1.734
1.729
1.725
1.721
1.717
1.714
1.711
1.708
1.706
1.703
1.701
1.699
1.697

0.975
12.706
4.303
3.182
2.776
2.571
2.447
2.365
2.306
2.262
2.228
2.201
2.179
2.160
2.145
2.131
2.120
2.110
2.101
2.093
2.086
2.080
2.074
2.069
2.064
2.060
2.056
2.052
2.048
2.045
2.042

1 12

0.99
31.821
6.965
4.541
3.747
3.365
3.143
2.998
2.896
2.821
2.764
2.718
2.681
2.650
2.624
2.602
2.583
2.567
2.552
2.539
2.528
2.518
2.508
2.500
2.492
2.485
2.479
2.473
2.467
2.462
2.457

0.995
63.657
9.925
5.841
4.604
4.032
3.707
3.499
3.355
3.250
3.169
3.106
3.055
3.012
2.977
2.947
2.921
2.898
2.878
2.861
2.845
2.831
2.819
2.807
2.797
2.787
2.779
2.771
2.763
2.756
2.750

PMNjuho

TABLE C
The Chi-square Distribution
Upper Probability Points
P = P ( 2 v2, P )

Entries in the table are the values 2,P of the 2 -distribution for various degrees of
freedom and one-tailed probabilities P.
P

0.99

0.975

0.95

0.90

0.50

0.10

0.05

0.025

0.01

0.005

1
2
3
4
5

0.000
0.020
0.115
0.297
0.554

0.001
0.051
0.216
0.484
0.831

0.004
0.103
0.352
0.711
1.145

0.016
0.211
0.584
1.064
1.610

0.455
1.386
2.366
3.357
4.351

2.706
4.605
6.251
7.779
9.236

3.841
5.991
7.815
9.488
11.070

5.024
7.378
9.348
11.143
12.833

6.635
9.210
11.345
13.277
15.086

7.879
10.597
12.838
14.860
16.750

6
7
8
9
10

0.872
1.239
1.646
2.088
2.558

1.237
1.690
2.180
2.700
3.247

1.635
2.167
2.733
3.325
3.940

2.204
2.833
3.490
4.168
4.865

5.348
6.346
7.344
8.343
9.342

10.645
12.017
13.362
14.684
15.987

12.592
14.067
15.507
16.919
18.307

14.449
16.013
17.535
19.023
20.483

16.812
18.475
20.090
21.666
23.209

18.548
20.278
21.955
23.589
25.188

11
12
13
14
15

3.053
3.571
4.107
4.660
5.229

3.816
4.404
5.009
5.629
6.262

4.575
5.226
5.892
6.571
7.261

5.578
6.304
7.042
7.790
8.547

10.341
11.340
12.340
13.339
14.339

17.275
18.549
19.812
21.064
22.307

19.675
21.026
22.362
23.685
24.996

21.920
23.337
24.736
26.119
27.488

24.725
26.217
27.688
29.141
30.578

26.757
28.300
29.819
31.319
32. 801

16
17
18
19
20

5.812
6.408
7.015
7.633
8.260

6.908
7.564
8.231
8.907
9.591

7.962
8.672
9.390
10.117
10.851

9.312
10.085
10.865
11.651
12.443

15.338
16.338
17.338
18.338
19.337

23.542
24.769
25.989
27.204
28.412

26.296
27.587
28.869
30.144
31.410

28.845
30.191
31.526
32.852
34.170

32.000
33.409
34.805
36.191
27.566

34. 267
35. 718
37. 156
38. 582
39. 997

21
22
23
24
25

8.897
9.542
10.196
10.856
11.524

10.283
10.928
11.689
12.401
13.120

11.591
12.338
13.091
13.848
14.611

13.240
14.041
14.848
15.659
16.473

20.337
21.337
22.337
23.337
24.337

29.615
30.813
32.007
33.196
34.382

32.671
33.924
35.172
36.415
37.652

35.479
36.781
38.076
39.364
40.646

38.932
40.289
41.638
42.980
44.314

41. 401
42. 796
44. 181
45. 559
46. 928

26
27
28
29
30

12.198
12.879
13.565
14.256
14.953

13.844
14.573
15.308
16.047
16.791

15.379
16.151
16.928
17.708
18.493

17.292
18.114
18.939
19.768
20.599

25.336
26.336
27.336
28.336
29.336

35.563
36.741
37.916
39.087
40.256

38.885
40.113
41.337
42.557
43.773

41.923
43.195
44.461
45.722
46.979

45.642
46.963
48.278
49.588
50.892

48. 290
49. 645
50. 993
52. 336
53. 672

For v > 30

2 2 2v 1 is approximately distributed as normal (0, 1).

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

1 13

PMNjuho

TABLE D
The F-Distribution

Pr( F b) =

Pr(F b)

r2

0.95
0.975
0.99

0.95
0.975
0.99

[(r1 + r2 ) /](r1 / r2 ) r1/ 2 wr / 21


dw
(r1 / 2)(r2 / 2)(1 + r1w / r2 )( r1 + r2 ) / 2
r1

1
161
648
4052

2
200
800
4999

3
216
864
5403

4
225
900
5625

5
230
922
5764

6
234
937
5859

7
237
948
5928

8
239
957
5982

9
241
963
6023

10
242
969
6056

12
244
977
6106

15
246
985
6157

18.5
38.5
98.5

19.2
39.0
99.0

19.2
39.2
99.2

19.2
39.2
99.2

19.3
39.3
99.3

19.3
39.3
99.3

19.4
39.4
99.4

19.4
39.4
99.4

19.4
39.4
99.4

19.4
39.4
99.4

19.4
39.4
99.4

19.4
39.4
99.4

0.95
0.975
0.99

10.1
17.4
34.1

9.55
16.0
30.8

9.28
15.4
29.5

9.12
15.1
28.7

9.01
14.9
28.2

8.94
14.7
27.9

8.89
14.6
27.7

8.85
14.5
27.5

8.81
14.5
27.3

8.79
14.4
27.2

8.74
14.3
27.1

8.70
14.3
26.9

0.95
0.975
0.99

7.71
12.2
21.2

6.94
10.6
18.0

6.59
9.98
16.7

6.39
9.60
16.0

6.26
9.36
15.5

6.16
9.20
15.2

6.09
9.07
15.0

6.04
8.98
14.8

6.00
8.90
14.7

5.96
8.84
14.5

5.91
8.75
14.4

5.86
8.66
14.2

0.95
0.975
0.99
0.95
0.975
0.99

6.61
10.0
16.3
5.99
8.81
13.7

5.79
8.43
13.3
5.14
7.26
10.9

5.41
7.76
12.1
4.76
6.60
9.78

5.19
7.39
11.4
4.53
6.23
9.15

5.05
7.15
11.0
4.39
5.99
8.75

4.95
6.98
10.7
4.39
5.99
8.75

4.88
6.85
10.5
4.21
5.70
8.26

4.82
6.76
10.3
4.15
5.60
8.10

4.77
6.68
10.2
4.10
5.52
7.98

4.74
6.62
10.1
4.06
5.46
7.87

4.68
6.52
9.89
4.00
5.37
7.72

4.62
6.43
9.72
3.94
5.27
7.56

0.95
0.975
0.99

5.59
8.07
12.2

4.74
6.54
9.55

4.35
5.89
8.45

4.12
5.52
7.85

3.97
5.29
7.46

3.87
5.12
7.19

3.79
4.99
6.99

3.73
4.90
6.84

3.68
4.82
6.72

3.64
4.76
6.62

3.57
4.67
6.47

3.51
4.57
6.31

0.95
0.975
0.99

5.32
7.57
11.3

4.46
6.06
8.65

4.07
5.42
7.59

3.84
5.05
7.01

3.69
4.82
6.63

3.58
4.65
6.37

3.50
4.53
6.18

3.44
4.43
6.03

3.39
4.36
5.91

3.35
4.30
5.81

3.28
4.20
5.67

3.22
4.10
5.52

0.95
0.975
0.99

5.12
7.21
10.6

4.26
5.71
8.02

3.86
5.08
6.99

3.63
4.72
6.42

3.48
4.48
6.06

3.37
4.32
5.80

3.29
4.20
5.61

3.23
4.10
5.47

3.18
4.03
5.35

3.14
3.96
5.26

3.07
3.87
5.11

3.01
3.77
4.96

0.95
0.975
0.99

10

4.96
6.94
10.0

4.10
5.46
7.56

3.71
4.83
6.55

3.48
4.47
5.99

3.33
4.24
5.64

3.22
4.07
5.39

3.14
3.95
5.20

3.07
3.85
5.06

3.02
3.78
4.94

2.98
3.72
4.85

2.91
3.62
4.71

2.85
3.52
4.56

0.95
0.975
0.99

12

4.75
6.55
9.33

3.89
5.10
6.93

3.49
4.47
5.95

3.26
4.12
5.41

3.11
3.89
5.06

3.00
3.73
4.82

2.91
3.61
4.64

2.85
3.51
4.50

2.80
3.44
4.39

2.75
3.37
4.30

2.69
3.28
4.16

2.62
3.18
4.01

0.95
0.975
0.99

15

4.54
6.20
8.68

3.68
4.77
6.36

3.29
4.15
5.42

3.06
3.80
4.89

2.90
3.58
4.56

2.79
3.41
4.32

2.71
3.29
4.14

2.64
3.20
4.00

2.59
3.12
3.89

2.54
3.06
3.80

2.48
2.96
3.67

2.40
2.86
3.52

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

1 14

PMNjuho

REFERENCES

Clarke, G P Y., Haines, L M., Dicks, H M., Stielau, K., and Brittain, S. (1999). Basic
statistical methods teaching manual. School of Mathematics, Statistics and
Information Technology. University of Natal Pietermaritzburg.
Durrheim, K., Lachenicht, L., Richter L., and Gray, D. (2001). Statistics tutorial
workbook. Research methods. School of Psychology. University of Natal
Pietermaritzburg.
Freund, J E., and Simon G A. (1995). Statistics: A first course. Sixth Edition. PrenticeHall, Inc. A Simon & Schuster Company. New Jersey. USA.
Hildebrand, D K. and Ott, Lyman. (1991). Statistical thinking for managers. Third
Edition. PWS- KENT Publishing Company. USA.
Johnson, D. E. (1998). Applied multivariate methods for data analysts. Brooks/Cole
Publishing Company. CA. USA.
Kitchens, L. J. (1996). Exploring statistics: A modern introduction to data analysis and
inference. 2Ed. Brooks/Cole Publishing Company. CA. USA.
Lewis-Beck, M. S. (1994). Factor analysis & related techniques. Vol. 5. SAGE
Publications, Inc.
Lindgren, B W., and Berry, D A. (1981). Elementary statistics. MacMillan Publishing Co.
Inc. New York.
Manly, B. F. J. (1994). Multivariate statistical methods. A primer 2nd ed. Chapman &
Hall. London. UK.
Mendenhall, W., Wackerly, R L., and Scheaffer, R L. (1990). Mathematical statistics with
applications. Fourth Edition. PWS- KENT Publishing Company. USA.
Montgomery, D. C. (1976). Design and analysis of experiments. John Wiley & Sons, Inc.
Neter, J., Kutner, M H, Nachtsheim, C J and Wasserman, W (1996). Applied linear
statistical models. Fourth Edition. McGraw-Hill Companies. Boston,
Massachusetts. USA.
Rinaman, C W. (1993). Foundations of probability & statistics. Saunders College
Publishing. Forth Worth Philadelphia. USA.
Viljoen, C S., and Van der Merwe, L. (2000). Applied elementary statistics for business
and economics. Volume 2. Creda Communications, Elliot Avenue, Epping II,
Cape Town.
Wegner, T. (2000). Applied business statistics. The Rustica Press, Ndabeni, Western
Cape.

Statistical Analysis in Research Module


E-mail: NjuhoP@ukzn.ac.za

1 15

PMNjuho

You might also like