You are on page 1of 66

Introduction to Data Analysis

By M.A. Onabid
University of Dschang
Dept of Maths/Computer Science

Course outline:
- Recall Descriptive statistics
o (Observation, Visual display, Measure of central tendencies and dispersion)
- Sampling techniques
o (Sampling plan, sampling methods)
- Sample comparison
o (Simple parametric and non-parametric tests; T-test and Wilcoxon tests)
o Comparing two small samples
o Comparing two large samples
o Comparing more than two samples
 One-way analysis of variance
 Equals and unequal sample sizes
o Duncan’s multiple range test
o Bartlett’s test
 Two way analysis of variance
 With and without replication)
- Regression and Correlation
o Regression analysis
o Correlation analysis
- Multiple regression
- Multiple correlation ( Partial Correlation)
- Robust procedures
- Practical lessons using SPSS, R and a Programming Language

References

1) Statistic and Computing: Introductory Statistics with R by Peter Dalgaard, 202


2) Applied Statistics with Microsoft Excel; Gerald Keller, 2001
3) Statistics, Data Analysis and Decision Modelling; Lames Evans & David Olson, 2000
4) Introduction to Statistics, 3rd edition; Ronalds E Walpole, 1982
5) Introduction to Statistics and Experimental Design for the Life Sciences, MG Kelly & JGA Onyeka
6) Statistics for Beginners, SO Adamu & TL Johnson, Evans publishers, 1992
7) SPSS for Windows made simple release 10, Paul R. Kinner & Collins D. Gray, 2000
8) Generalized Additive Models: An Introduction with R; Simon Wood, 2006
9) Statistical packages like SYSTAT and STATgraphics will be needed

1
Chapter one
Introduction

Data analysis, also known as analysis of data or data analytics, is a process of inspecting,
cleansing, transforming, and modelling data with the goal of discovering useful information,
suggesting conclusions, and supporting decision-making. Data analysis has multiple facets and
approaches, encompassing diverse techniques under a variety of names, in different business,
science, and social science domains.

Data mining is a particular data analysis technique that focuses on modelling and knowledge
discovery for predictive rather than purely descriptive purposes, while business intelligence covers
data analysis that relies heavily on aggregation, focusing on business information.[1] In statistical
applications data analysis can be divided into descriptive statistics, exploratory data analysis (EDA),
and confirmatory data analysis (CDA). EDA focuses on discovering new features in the data and
CDA on confirming or falsifying existing hypotheses. Predictive analytics focuses on application of
statistical models for predictive forecasting or classification, while text analytics applies statistical,
linguistic, and structural techniques to extract and classify information from textual sources, a species
of unstructured data. All are varieties of data analysis.

Data integration is a precursor to data analysis, and data analysis is closely linked to data
visualization and data dissemination. The term data analysis is sometimes used as a synonym for
data modelling.
Descriptive statistics and graphics
Descriptive statistics: refers to a collection of quantitative measures and ways of describing data.
This includes frequency distributions and histograms, measure of central tendency (Mean, median,
mode, and midrange) and measure of dispersion (range, variance, standard deviation).

Measure of Central tendency: they provide estimate of a single value that in some fashion represent
the entire distribution. (Mean median, mode, and midrange, median of absolute deviation MAD)

Dispersion: this refers to the degree of variation in the data, that is the numerical spread (or
compactness) of the data.

Measure of dispersion (range, variance, standard deviation).

Range: this is the difference between largest and lowest data values.

Variance: its computing depends on all the points and is thus better than range which depends on
only the two extreme values.

Standard deviation (STDEV): defined as the square root of the variance. It’s easier to interpret than
variance.

√∑𝑛
𝑖=1(𝑥𝑖 − 𝜇)
2
Population variance is given as 𝜎 = and for the samples,
𝑁

2
√∑𝑛
𝑖=1(𝑥𝑖 − 𝜇)
2
The sample standard deviation is given as 𝑠 = ((n-1) is used to provide an unbiased
𝑛−1
estimate of 𝜎 2 . If we use N, 𝑠 2 will tend to underestimate 𝜎 2 ).

Stdev provides an indication of where the majority of data are clustered around the mean.

NOTE: one of the most important results in Statistics is that, “almost all” the data for any
distribution will fall within three stdev of the mean.

The larger the stdev, the more the data are “spread out” from the mean. Therefore the standard
deviation is a useful measure of risk, particular in Financial Analysis

Small stdev ⟹ little risk of lost


Large stdev ⟹ greater risk of achieving significant lower return and at the same time greater
potential of a higher return.

Coefficient of variation: Provides a relative measure of the dispersion in data relative to the mean
and is defined as; (that is 𝜎 expressed as a % of mean)
Coefficient of variation (CV) = 𝜎⁄𝑥̅ x 100
Measure of shape: (“Pearson” coefficient of skewness). It measures the degree of asymmetry of a
3(𝑥̅ − 𝑥̃) ̃)
3(𝜇− 𝜇
distribution around its mean. This is given or computed as 𝑠𝑘 = or 𝑠𝑘 = , when
𝑠 𝜎
mean is equal to median sk = 0.0. the value of sk will lie between -3 to +3

+ve coefficient of skewness ⟹ the distribution is positively skewed


-ve coefficient of skewness ⟹ the distribution is negatively skewed
The closer the coefficient is to zero, the less the degree of skewness in the distribution.

- If coefficient of skewness is > +1 or < -1, the distribution is generally regarded to be highly
skewed.
- Between 0.5 and 1 0r -0.5 and -1, is moderately skewed
- Coefficient between 0.5 and -0.5 indicates relative symmetry.

NB: A perfect symmetric distribution will have mean, median and mode to be the same.

Visual Display of Statistical measures


Stem-and-leaf plots
A good way to present both continuous and discrete data for sample sizes of less than 200 or so is
to use a stem-and-leaf plot. This plot is similar to a bar chart or histogram, but contains more
information. As with a histogram, we normally want 5–12 intervals of equal size which span the
observations. However, for a stem-and-leaf plot, the widths of these intervals must be 0.2, 0.5 or
1.0 times a power of 10, and we are not free to choose the end-points of the bins. They are best
explained in the context of an example.
Example
Consider the data on seed germination given as:
No. germinating 85 86 87 88 89 90 91 92 93 94
Frequency 3 1 5 2 3 6 11 4 4 1

3
Since the data has a range of 9, an interval width of 2 (= 0.2 × 101 ) seems reasonable. To form the
plot, draw a vertical line towards the left of the plotting area. On the left of this mark the interval
boundaries in increasing order, noting only those digits that are common to all of the observations
within the interval. This is called the stem of the plot. Next go through the observations one by
one, noting down the next significant digit on the right-hand side of the corresponding stem.

8 5 5 5
8 7 7 7 7 6 7
8 8 9 8 9 9
9 1 1 0 1 1 0 0 1 1 1 0 1 1 0 0 1 1
9 3 2 2 2 3 3 3 2
9 4

For example, the first stem contains any values of 84 and 85, the second stem contains any values of
86 and 87, and so on. The digits to the right of the vertical line are known as the leaves of the plot,
and each digit is known as a leaf.
Now re-draw the plot with all of the leaves corresponding to a particular stem ordered increas- ingly.
At the top of the plot, mark the sample size, and at the bottom, mark the stem and leaf units. These
are such that an observation corresponding to any leaf can be calculated as

Observation = StemLabel × StemUnits + LeafDigit × LeafUnits to


the nearest leaf unit.

n = 40
8 5 5 5
8 6 7 7 7 7 7
8 8 8 9 9 9
9 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
9 2 2 2 2 3 3 3 3
9 4
Stem Units = 10 seeds,
Leaf Units = 1 seed.

The main advantages of using a stem-and-leaf plot are that it shows the general shape of the data
(like a bar chart or histogram), and that the all of the data can be recovered (to the nearest leaf unit).
For example, we can see from the plot that there is only one value of 94, and three values of 89.

1.3.3 Box-and-whisker plots

It graphically display 5 key descriptive measures of a data set; the min, 1st quartile, median, 3rd
quartile, and maximum. This is a useful graphical description of the main features of a set of
observations. There are many variations on the box plot. The simplest form is constructed by
drawing a rectangular box which stretches from the lower quartile to the upper quartile, and is
divided in two at the median. From each end of the box, a line is drawn to the maximum and
minimum observations. These lines are sometimes called whiskers, hence the name. To determine
4
what outliers are, box-plot fits a Normal distribution to the data and labels as outliers any data that
are below the 1% or above the 99% quartiles of the fitted Normal distribution

Example

Consider the plots of a data set on leaf size. Box plots for this data and the logs are given below.
Notice how the asymmetry of the original distribution shows up very clearly on the left plot, and
the symmetry of the distribution of the logs, on the right plot.
120

4.5
100

4.0
80
60

3.5
40

3.0
20

Box plots of raw and transformed leaf area data (n = 200)

Box-and-whisker plots are particularly useful for comparing several groups of observations. A box
plot is constructed for each group and these are displayed on a common scale. At least 10
observations per group are required in order for the plot to be meaningful.

Fence is defined as Q3 + k(Q3-Q1)


and Q1 – 1.5(Q3-Q1)
Where k can either be 3 or 1.5
Inner fence is Q3 + 1.5(Q3-Q1)
and Q1 – 1.5(Q3-Q1)
Outer fence is given as Q3 + 3(Q3-Q1)
and Q1 – 3(Q3-Q1)

𝑄 3 − 𝑄1 𝑛+1 𝑡ℎ 𝑛+1 𝑡ℎ
Semi inter quartile range (IQR) 𝑄= , 𝑄1 = ( ) position. 𝑄2 = 2 ( ) ,
2 4 4
5
𝑛+1 𝑡ℎ
𝑄3 = 3 ( )
4
Any data value found either beyond an inner fence or beyond an outer fence is called an outlier.
This contains only the middle 50%of the data and is therefore not influence by extreme values.

NB; this is sometime used as an alternative measure of dispersion from the standard deviation.

Frequency Distribution chart


(Recall of frequency table, tally chart, histogram, frequency polygon, bar chart,
maybe pie chart, Q-Q plots)

Consider a record of the ages of 50 individuals found to be infected with Guinea Warm in a small
village around Dschang

78 32 41 42 56 28 21 47 64 46

26 35 56 24 53 54 43 32 18 75

42 22 41 27 48 63 43 53 55 51

36 47 45 46 54 55 35 58 45 68

31 44 54 34 64 44 57 77 63 34

For this data to make sense, we need to reduce them to a summary and present them as tables,
histograms, graphs etc. Ages have been measured to nearest year since most of the villagers don’t
even know when they were born. The first task prior to summarizing is to arrange them in a particular
order, either ascending or descending order. This is just rearrangement
1) Arrange data in ascending order
2) Place ages into classes and determine the number of observations falling in each class;
a. Use at least five (5) classes, but too many classes may render interpretation difficult
b. Use ten (10) classes in this case.

Characteristics of classes

(Effectively, the number of classes should fall between 4 -20). As a rule of thumb, the number of
ln(𝑛)
classes for n data points can be calculated as .
ln(2)
i) Class limits and boundaries: The limits are the extreme values which can be recorded. Thus
the class interval (10 - 19) should include values from 10 to 19, with 10 as the lower limit and 19 as
the upper limit. The value of 9.5 is called the lower class boundary while 19.5 is the upper class
boundary. However 19.5 is also the lower class boundary for the interval 20 – 29. The class width is
the difference between the upper and the lower boundaries. The class interval size can easily be
calculated as the difference between successive values of lower class limits.
ii) Class midpoint or Mark or Representative: this is the midpoint between the upper and the
lower class limits. In the case of continuous variable, the midpoint can be calculated for practical
purposes as the sum of two successive lower class limit divided by two. To illustrate these facts,
a. Draw a tally chart for the age data arranged above
b. Tabulate the characteristic of the classes defined for the data;

6
Class limit Class midpoint No of observations
10 – 19 15 1
20 – 29 25 7
c. Draw a frequency table ie simply showing class and no of observations
d. Draw a histogram of age distribution of the 50 individuals found to be infected with guinea warm.

Steps to draw histogram

1. Draw a vertical line at the lower boundary of each class interval up to the height representing
the frequency,
2. Make a horizontal line representing the width of each class interval on top of each vertical
line drawn in step 1,
3. Produce the vertical line of each class to complete a rectangle.
e. Construct frequency polygon, obtained by joining up the midpoints of the tops of the
frequency histogram.

Home Work
Construct bar chart, showing frequency of occurrences of score in 100 tosses of a fair die ( to be
presented in class the next day)
Statistical relationships

Two variables have a strong statistical relationship with one another if they appear to move together.
Correlation is a measure of strength of linear relationship between two variables X and Y and is
measured by the (population) Correlation coefficient given as:-

Cov(X, Y)
𝝆𝒙𝒚 =
𝜎𝑥 𝜎𝑦

The numerator is called the Covariance and is the average of the product of the deviation of each
observation from its respective mean.

∑𝑁
𝑖=1(𝑥𝑖 − 𝜇𝑥 )(𝑦𝑖 − 𝜇𝑦 )
𝑪𝒐𝒗(𝑿, 𝒀) =
𝑁
Similary the sample correlation coefficient is computed as;

7
∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅)
𝒓=
(𝑛 − 1)𝑠𝑥 𝑠𝑦

r will range from -1 to +1. A correlation of 0 indicates that the two variables have no linear
relationship to each other.

A correlation coefficient of +1 indicates perfect positive relationship

A correlation coefficient of -1 is perfect except that as one variable increases so the other decreases.

8
Chapter Two
Sampling
Sampling design: The first step in sampling is to design an effective sampling plan that will yield
representative samples of the population under study.

A sampling plan: is a description of the approach that will be used to obtain samples from a
population prior to any data collection activity.

It states;

- Objectives of sampling activity (to estimate the population mean, stdev, or differences
between two populations)
- The target population, the population is frame, complete list of all members (list from which
the sample is detected)
- The method of sampling
- Operational procedure for collecting data and statistical tool that will be used to analyse the
data.

Sampling methods: This could be subjective or probabilistic .e.g of subjective is judgement


sampling, where expert judgement is used to select the sample and convenience sampling where
samples are selected based on the ease in which the data could be collected.

 Probabilistic involves selecting the items in a sample using some random procedure.
 Sample random sampling; refers to selecting items in a sample so that each has an equal
chance of being selected. You could have sampling with and without replacement.

Other sampling methods

- Systematic sampling: select items periodically from the population.


- Stratified sampling: applies to population that are divided into natural subset (strata) and
allocates the appropriate proportion of samples to each stratum. Consider the country divided into
regions with different population sizes. A stratified sample would choose a sample of individuals
in each region proportionate to its size.
This approach ensures that each stratum is weighted by its size relative to the population and can
provide better results than simple random sampling if the items in each stratum are not
homogenous.

Cluster sampling: is based on dividing a population into subgroups (clusters) and taking a simple
random sample of each other cluster. The items within each cluster become the members of the
sample.
Compare t-test and Wilcoxon test
Parametric and non parametric tests

- Parametric test have some assumptions to be fulfilled by the data set, say it must be normally
distributed, non parametric test do not rely on any assumption.
- Because no parameters are estimated in non-parametric test they do not have confidence
intervals.

9
NB: If the model assumptions of the parametric test are fulfilled then it would be somewhat more
efficient to go ahead with parametric test on the order of 5%in the large samples. NB

1. Unless the sample size is 6 or more the sign-rank test simply cannot become significant at the
5% level.
2. The apparent lack of assumptions for these tests sometimes mislead people into using them
for data where the observations are not independent or where a comparison is biased by an
important covariate.
3. The Wilcoxon tests are susceptible to the problems of ties, where several observations share
the same value. In such cases, you simply use the average of the tied ranks. For large sample
normal approximation, this is not a problem, but the exact- small- sample distributions
becomes much more difficult to calculate and Wilcoxon test cannot do so.

One sample t-test


we have data x1, x2,...xn assumed independent realization of random variables with distribution
N(𝝁 , 𝝈𝟐 ) which denotes the normal distribution with mean 𝝁 and variance 𝝈𝟐 and wish to test the
null hypothesis that 𝝁 = 𝝁𝟎

The key concept is that of the standard error of mean, SEM which describes the variation of the
average of n random values with mean 𝝁 and variance 𝝈𝟐 . The value is
𝝈
SEM =
√𝑛

For normally distributed data, the rule of the thumb is that there is a 95% probability of staying
within 𝝁 ± 𝟐𝝈, so we expect that if 𝝁𝟎 were the true mean, then 𝑥̅ should be within 2SEM of it.

𝑥̅ – 𝝁𝟎
Formally, you calculate; t=
𝑺𝑬𝑴

and see whether the fall within an acceptable region, outside which t should fall with probability
equal to a specific significant level. This is often chosen as 5%, in which case the acceptance region
is almost but not exactly, the interval from -2 to 2 (±𝟏. 𝟗𝟔).
Tabular F can be looked up from the t-table with f = n-1 degree of freedom.
If t falls out of the acceptance region, then we reject the null hypothesis at the chosen significant
level.

P-values can be calculated, ie the probability of obtaining a value as numerically large as or larger
than the observed t and reject the hypothesis of the p-value is less than the significance level.
Confidence Interval

To calculate the confidence interval is based on inverting the t-test by solving for the values of 𝝁𝟎
that cause t to lie within its acceptance region. For 95% CI, the solution is

𝑥̅ − 𝑡0.975 (f) xS < 𝝁 < 𝑥̅ − 𝑡0.975 (f) xS

Wilcoxon signed-rank test: It is a very simple non parametric procedure for the comparison of
means of two continuous non- normal populations when independent samples are selected from the
populations.

It is called Wilcoxon rank-sum test or the Wilcoxon two sample tests.


10
Null hypothesis to be tested is 𝐻0 : 𝜇1 = 𝜇2

 Let sample size be n1 for small sample an n2 for the large sample. In case n1 and n2 are equal,
there is no need to specify small and large.
 Arrange n1 +n2 combined observations in ascending order, substitute a rank of 1, 2... n1+n2, for
each observations. Where there is a tie, replace each observation by the means of the ranks;
e.g if the 2nd and 3rd observations are equal, replaces each of them with 2.5.
 Let W1 be the sum of the ranks corresponding to n1 observations and W2 the sum of the n2
observations.
 W1 + W2 depends only on the number of observations in the two samples.
Therefore if n1=3 and n2 = 4, then W1+W2 = 1+2+3+...+7 = 28
(𝒏𝟏 +𝒏𝟐 )(𝒏𝟏 +𝒏𝟐 +𝟏)
Hence 𝑾𝟏 + 𝑾𝟐 = , the arithmetic sums of the integers 1, 2...n1+n2.
𝟐

If we choose repeated samples of n1 and n2, we would expect w1 and w2 to vary. We could
therefore think of w1 and w2 as values of random variables W1 and W2 respectively.

 The null hypothesis 𝜇1 = 𝜇2 will be rejected in favour of the alternative 𝜇1 < 𝜇2 only if w1 is
small and w2 is large.

Equally the alternative 𝜇1 > 𝜇2 can be accepted only if w1 is large and w2 is small.

In the case of two tailed test, H0 can be rejected in favour of H1. If w1 is small and w2 is large or if w1
is large and w2 small. That is 𝜇1 < 𝜇2 is accepted if w1 is sufficiently small or 𝜇1 > 𝜇2 is accepted
if w2 is sufficiently small. And the alternative 𝜇1 ≠ 𝜇2 is accepted if the minimum of w1 and w2 is
sufficiently small.
Most often our decision is based on the value;
𝐧𝟏 (𝐧𝟏 +𝟏) 𝐧𝟐 (𝐧𝟐 +𝟏)
𝜇1 = 𝑊1 − OR 𝜇2 = 𝑊2 −
𝟐 𝟐

If the observed value of 𝜇1 , 𝜇2 𝑜𝑟 𝜇 is less than or equal to the tabled critical value, the null
hypothesis is rejected at that level if significance indicated by the table;

Example: consider two samples with n1=3, n2=5 and w1 =8. Test for the hypothesis 𝜇1 = 𝜇2 against
the one-sided alternative 𝜇1 = 𝜇2 at the 0.05 level of significance.

Solution;
𝐧𝟏 (𝐧𝟏 +𝟏) 3(4)
𝜇1 = 𝑊1 − ie 𝜇1 = 8 − = 2 using Table A9 from Walpole
𝟐 2

From the table we rejected the null hypothesis if equal means where 𝜇1 ≤ 1. since 𝜇1 = 2 falls in
the acceptance region, the null hypothesis cannot be rejected.

Example: The nicotine content of two brands of cigarettes measured in milligrams was found to be
as follows. Brand A 2.1 4.0 6.3 5,4 4.8 3.7 6.1 3.3
Brand B 4.1 0.6 3.1 2.5 4.0 6.2 1.6 2.2 1.9 5.4
Test the hypothesis at 0.05 level of significance, that the average nicotine content of the two brands is
equal against the alternative that they are unequal.

Solution: Proceed by the six-step rule with n1 =8, n2 =10


11
1. 𝐻0 : 𝜇1 = 𝜇2
2. 𝐻1 : 𝜇1 ≠ 𝜇2
3. α = 0.05
4. critical value : 𝜇 ≤ 17 (from the table)
5. Computations: The observations are arranged in ascending order and ranks from 1 and 28
assigned.

The ranks of the observations from the smaller example have


been stared. Hence
W1 = 4+8+9+10.5+13+14.5+16+18= 93; and
(18)(19) (𝑛1 +𝑛2 )(𝑛1 +𝑛2 +1)
W2 = [ ] – W1; since 𝑊1 + 𝑊2 = ;
2 2
(18)(19)
Thus W2 = [ ] – 93 = 78,
2
𝟖(𝟗)
therefore 𝜇1 = 93 − = 𝟓𝟕
𝟐
𝟏𝟎(𝟏𝟏)
and 𝜇2 = 78 − = 𝟐𝟑
𝟐
so that 𝜇 = min(𝜇1 , 𝜇2 ) = min(57, 23) = 23

Original
observations ranks

0.6 1
1.6 2
1.9 3
2.1 4*
2.2 5 6. Decision: Accept H0 and
2.5 6 conclude that there is no difference in average nicotine content of the two
3.1 7 brands of cigarettes.
3.3 8*
3.7 9* NB: when n1 and n2 increase in size, the distribution of 𝜇1 and 𝜇2
4.0 10.5* 𝑛1𝑛2
4.0 10,5 approaches the normal distribution with mean; 𝜇𝑣1 = ; and
2
4.1 12 2 𝑛1𝑛2(𝑛1+𝑛2+1)
variance 𝜎𝑣1 = , Consequently, when the n2 > 20 and
4.8 13* 12
𝜇1 −𝜇𝑣1
5.4 14.5* n1 at least 10, one could use the statistics 𝑍 = 2 , critical region
𝜎𝑣1
5.4 14.5 falls on both sides of the SNC.
6.1 16*
6.2 17  Wicoxon rank test is not
6.3 18* restricted to non normal population. It can be used in place of the two-
sample t-test when the population are normal although the probability of
committing a Type II error will be larger.
 Wilcoxon test is always superior to the t-test for decidedly non normal populations.

12
Hypothesis testing This involves drawing inferences about two populations (hypothesis) relating to
the value of a population parameter and using sample data to make a decision about which hypothesis
can be supported.

Home work

1) The following data represent the number of hours that 2 different types of scientific pocket
calculators operate before a recharge is required.
Calculator A 5.5 5.6 6.3 4.6 5.3 5.0 6.2 5.8 5.1
Calculator B 3.8 4.8 4.3 4.2 4.0 4.9 4.5 5.2 4.5
Use the Wilcoxon rank sum test with α = 0.01 to determine if calculator A operates longer than
calculator B on a full battery charge.

2) A cigarette manufacturer claims that the tar content of brand B cigarette is lower than that of
brand A. To test this claim, the following determination of tar content in milligram were
recorded
Brand A 12, 09, 13, 11, 14
Brand B 08, 10, 07

Use the Wilcoxon rank sum test with α = 0.05 to test whether the claim is valid.

13
Chapter Three
TEST OF HYPOTHESIS (Theory)

Consider the following:-

1) A medical researcher maybe required to decide on the basis of experimental evidence whether
a certain vaccine is superior to one presently being marketed.
2) An engineer might have to decide on the basis of sample data whether there is a difference
between the accuracy of two kinds of gauges.
3) A Sociologist might wish to collect appropriate data to enable him to decide whether the
blood type and the eye colour of an individual are independent variables.

In all these, the problem will be the formulation of a set of rules that leads to a decision culminating
in the acceptance or rejection of some statement or hypothesis about the population. The procedure
for establishing this set of rules comprises a major area of statistical inference called Hypothesis
Testing

Definition: A Statistical hypothesis is a s assertion or conjecture concerning one or more populations.


In most cases, it is impracticable to examine the entire population, instead we take random samples
from the population and use information contained in it to decide whether the hypothesis is likely to
be true or false.

Note that the acceptance of a statistical hypothesis is a result of insufficient evidence to reject it and
does not necessarily imply that it is true; eg tossing a coin 100 times with p= 0.5 for head. We test for
balance of coin.

Note also that, the rejection of a hypothesis is to conclude that it is false, while acceptance of a
hypothesis merely implies that we have no evidence to believe otherwise.

With this, a statistician or researcher will often state his hypothesis that which he hopes to reject. For
example, to test a new vaccine, he should assume that it is no better than the one now in use. Equally
to prove that a teaching technique is superior to another, we test the hypothesis that there is no
difference in the two techniques.

Hypothesis formulated with the hope that be rejected are often called Null Hypothesis and is denoted
Ho, while the alternative to it is called Alternative Hypothesis and is denoted H1. Null hypotheses for
population parameters are always specific while the alternatives have many options. For example if
the null hypothesis is p=0.5, the alternative N1 might be p<0.5, p>0.5 or p≠ 0.5

Definition:
Type I error: The rejection of the null hypothesis when it should be retained. It is usually caused by
setting the level of acceptance or rejecting the hypothesis very low.
Type II error: Retaining of null hypothesis when it should be rejected. Caused by setting the level of
acceptance or rejection too high.
The probability of committing a Type I error is called the level of significance denoted by α,
sometimes the level of significance is called the size of the critical region
For fixed sample sizes, a decrease in the probability of one error will always result in an increase in
the probability of the other error.
The probability of committing both types of errors can be reduced by increasing sample size.
14
One Tailed and Two tailed Test
A test of hypothesis of the kind 𝐻0 : 𝜇1 = 𝜇2 and 𝐻1 : 𝜇1 > 𝜇2 or 𝐻0 : 𝜇1 = 𝜇2 and 𝐻1 : 𝜇1 < 𝜇2 is
called a one-tailed test. The critical region for the alternative hypothesis 𝜇1 > 𝜇2 , lies entirely in the
right tail of the distribution while the critical region for 𝜇1 < 𝜇2 lies to the left tail of the distribution.
A test of any statistical hypothesis where the alternative is two sided such as
𝐻0 : 𝜇1 = 𝜇2 and
𝐻1 : 𝜇1 ≠ 𝜇2; ;
is called two-tailed test with critical region split equally on both tails of the distribution? The
alternative 𝜇1 ≠ 𝜇2 states that either 𝜇1 < 𝜇2 or 𝜇1 > 𝜇2

Example 1: The manufacturer of a certain brand of cigarette claims that the average nicotine content
does not exceed 2.5 milligrams. State the null and alternative hypotheses to be used in testing the
claim and determine where the critical region is located.

Solution: The manufacturers claim should be rejected only if 𝜇 > 2.5 milligram and should be
accepted if 𝜇 ≤ 2.5 milligram. Since the null hypothesis is always a single value of the parameter,
we test 𝐻0 : 𝜇 = 2.5 and
𝐻1 : 𝜇 > 2.5
Though H0 has been stated with an equal sign, it is understood to include any value not specified by
the alternative hypothesis.

Example 2: A real estate agent claims that 60% of all private residences being built today are 3- bed
rooms. To test his claim, a large sample of new residences is inspected; the proportion of these homes
with 3 bedrooms is recorded and used as our test statistics. State the null and alternative hypotheses
to be used in this test and determine the location of the critical region.

Solution: If the test statistics is substantially higher than or lower than p = 0.6, we would reject the
agents claim. Hence we shall make the test:-
𝐻0 : 𝑝 = 0.6 and
𝐻1 : 𝑝 ≠ 0.6
H1 implies a two-tailed test with critical region equally on both sides of the distribution of p.
NB: In testing hypothesis, in which the test Statistics is discrete, the critical region may be chosen
arbitrarily and the size determined. If the size α is too large, it can be reduced by making an
adjustment in the critical value. Sample size maybe increased to offset the increase that automatically
occurs in β. If the test Statistics is continuous, it is common to chose the value of α to be 0.05, or
0.01 and then find the critical region.
Eg in two-tailed test at the 0.05 level of significance, the critical value for a statistic having a
standard normal distribution will be –Z0.025 = -1.96 and Z0.025 = 1.96. In terms of z-values, the critical
region of size 0.05 will be z < - 1.96 and z > 1.96.

- If the null hypothesis is rejected at the 0.05 level of significance, the test is said to be significant;
it is considered to be highly significant if the null hypothesis is rejected at the 0.01 level of
significance.
- The steps for testing a hypothesis concerning a population parameter θ against some alternative
maybe summarised as follows:-

15
1) State the null hypothesis H0 that θ=θ,
2) Chose an appropriate alternatives from θ<θ, θ>θ or θ≠θ,
3) Chose a significant level of size α;
4) Select appropriate test statistics and establish critical region
5) Compute the value of the test statistics from the sample data;
6) Decision: Reject H0 if test statistics has a value in the critical region, otherwise accept H0.

Recall of sample comparison ( Samples and the Normal)

Comparing two small samples from normal distribution.

If the population variances of both samples are equal, then the t-test becomes simpler. This we
calculate using the variance ratio (F) test.

F is the ratio of the larger variance on smaller one. In the case where the variance are different ie 𝐹 =
𝑆12
𝑖𝑓𝑓 𝑆12 > 𝑆22 , and refer to the F-table with df of n-1 for both the numerator and denominator F
𝑆22
(v1, v2).

NB: F – tables are designed for 1-tail test thus to use for two sided test, we must half the chosen
probability. Thus to test equality of variance at 5% we use the F-table at 2.5% or 0.025 level of
significance.

If Fcal > Ftab, we reject equality (Ho) .

a) If variances are assumed equal ( Fcal < Ftab)

In this case, the Confidence interval is set as follows:-


𝑥̅ 1 − 𝑥̅ 3
𝑡= 1 1
, where S is the estimate of the combined standard deviation and is computed as
𝑆√𝑛 +𝑛
1 2

follows:-

(∑ 𝑋12 + (∑ 𝑋1 )2 /𝑛1 )+(∑ 𝑋22 + (∑ 𝑋2 )2 /𝑛2 )


𝑆 = , We then refer to t-table at 𝑛1 + 𝑛2 − 2 degrees of freedom
𝑛1 + 𝑛2 − 2
and the chosen level of significance.

1 1
The Confidence Interval of the 2 means are 𝑥̅1 − 𝑥̅2 ± 𝑡 (𝑆√ + ),
𝑛1 𝑛2

b) Variance not assumed equal (Fcal, > Ftab)

If variances are not equal, we obtained an approximate solution by calculating d and equating the
product to t, with reduced degrees of freedom given by

𝑆12⁄
1 𝑛1
𝑓= 2 (1− 𝜇)2⁄
, and 𝜇 = 𝑆12⁄ 𝑆22
, the Confidence Intervals for the
(𝜇𝑛1 − 1 )+ ( 𝑛2 − 1) ( 𝑛1 )+ ( ⁄𝑛2 )

𝑆12 𝑆22
difference are given by 𝑥̅1 − 𝑥̅2 ± 𝑑√( + ), d is treated as t with f degrees of freedom.
𝑛1 𝑛2

Example: In a drug test, the reaction times of two new drugs were each tested on 15 individual with
identical characteristics. The reaction times were recorded as follows:-
16
Drug1 41 44 47 44 45 46 42 48 42 44 47 43 46 42 44
Drug2 32 34 36 34 33 37 34 34 38 36 35 34 37 34 36

Test whether the variances may be assumed to be equal and then test whether the reactions times
differ significantly.

Solution:
Drug type variances mean
drug 1 𝑆12 = 4.52 𝑥̅1 = 44.33
drug 2 𝑆22 = 2.78 𝑥̅2 = 34.93 /34.73
4.52
F= 𝐹= = 1.62, From the F-table, we get F (14, 14). At the 0.025 level of significance is 2.56.
2.78
Since Fcal < Ftab that is 1.62 < 2.56, we accept the hypothesis of equivalence and conclude that the
variances are not significantly different. We then proceed with assumption on equal variances to
calculate the common variance.

∑ 𝑋1 = 665. ∑ 𝑋2 = 524, ∑ 𝑋12 = 29454, ∑ 𝑋22 = 18344 Thus the common can be
calculated thus.

(∑ 𝑋12 + (∑ 𝑋1 )2 /𝑛1 )+(∑ 𝑋22 + (∑ 𝑋2 )2 /𝑛2 ) (29545+ (665)2 /15)+(18344+ (524)2 /15)
𝑆 = =
𝑛1 + 𝑛2 − 2 15+ 15− 2

63.33 – 38.93 102.26


= = = 3.65
28 28
44.33 – 34.93 9.4
Therefore the test statistics is computed as follows: - 𝑡 = 1 1
, = 1.3 = 7.0677 ≈ 𝟕. 𝟎𝟕
3.65√15+15

From the t-table at p = 0.05 with 28 df we find t (0.025, 28). = 2.048.

Since (tcal, < ttab), reject the hypothesis of no equivalence between the reaction times of the two
drugs.

Exercises:
1) Samples of fish of the same age are taken from two different ponds in a fish farm, giving the following
data;
Pond 1 Pond2
Mean 50cm 47 cm
Variance 70 74
Sample size 36 36
Determine whether the fish population at that age in the two ponds differ in length.
2) A study was carried out to determine the effect of small quantities of a toxic industrial pollutant on
mosquito larvae. Water samples taken from two areas gave the following data on concentration of the
pollutant. (in mgl-1)
Area 1 1.2 4.3 0.9 5.2 3.8
Area 2 5.7 5.2 9.0 4.1
Determine first whether the variances may be assumed to be equal and then use an appropriate test to
determine whether the concentration of the pollutant differ in the two areas.
3) The mean weight increase of a large population of laboratory rats when fed with standard diet a
fortnight is known to be equal to 20 units and a standard deviation of 16 units. When 36 rats were
housed in newly-installed cages, their mean weight increase was found to be 21.4 units. Determine
whether this differs significantly from the previous known population mean.

17
Chapter Four: Sample Comparison
One way Analysis of Variance
Introduction
We discussed one-sample and two-sample mean tests in the previous chapters. The analysis of
variance (ANOVA) is an extension of those tests. It enables to compare any mean independent
random samples. The analysis of variance (in its parametric form) assumes normality of the
distributions and homoscedasticity (identical variances). If these conditions are not executed then we
must use nonparametric Kruskall-Wallis test. It is an analogy of the one-factor sorting in the analysis
of the variance. It doesn't assume distributions normality but disadvantage is smaller sensitivity. In
experiments where there are more than two treatments, it would be difficult to use t-test to compare
the means, since this is a pair wise comparison method and will involve a lot of tedious work to make
the same calculations many times. More so, the many times we make the same test the higher the risk
of rejecting the null hypothesis by mistake.
A better way of testing for the equality of several means simultaneously is by using the called
Analysis of Variance ANOVA, developed by Englishman Sir Ronald Fisher (1890 -1962), who
wanted a way to analyse large amount of data on crop varieties, fertilizer and pesticides treatment and
so on.
Though the ANOVA technique is to determine whether differences exist between population
means, it ironically works by analysing the sample variances hence its name.
The basic question that it asks is as follows: - Is the variation between samples greater or
lesser than the amount of variation within each, in its simplest form. The ANOVA is a method for
splitting the total variation of our data into meaningful component that measure different sources of
variation. ANOVA requires assumptions that the m groups or factor level being studied represent
populations whose outcome measures are randomly and independently obtained, are normally
distributed, and have equal variance. The principle behind ANOVA is to compare ratios of variations
of different levels.
The null hypothesis is that if the factor level varies, the means of all outcomes are equal; the
alternative hypothesis is that at least one mean differs.
H0: 𝜇1 = 𝜇2 = 𝜇3 = ⋯ = 𝜇𝑛
H1: at least one mean is different from the others.
Further analysis could be performed in case there is a difference between treatments in order to
identify which is responsible for the difference.
If the classification of observations is based on a single criterion, we talk of One-Way
Classification. If the classification is based on two criteria, it is termed Two-Way classification

ONE-WAY CLASSIFICATION

Consider random samples of size n selected from each of k population assumed to be independent
and normally distributed with means 𝜇1 , 𝜇2 . 𝜇3 , … , 𝜇𝑘 and common variance 𝜎 2 . Our intention is to
test the null hypothesis
H0: 𝜇1 = 𝜇2 = 𝜇3 = ⋯ = 𝜇𝑛 against the alternative
H1: at least one mean is different from the others (two means are not equal)
Let Xi j denotes the jth observation from the ith population such that X3,5 is the 5th observation in the 3rd
population.
Let 𝑋̅𝑖 . : be the mean of the ith population
𝑋̅. . : be the grand mean (mean of all the observations)
𝑇. . : Total of all nk observation
𝑇𝑖 . : be the total of the ith population
Each observation may therefore be written as
Xi,j = 𝜇𝑖 + 𝜀𝑖,𝑗 with 𝜀𝑖,𝑗 ~ 𝑁 (0, 𝜎 2 )
18
where 𝜀𝑖,𝑗 measures the deviation of the jth observation of the ith sample from the corresponding
population mean. Alternatively, each observation can be decomposed by substituting 𝜇𝑖 = 𝜇 + 𝛼𝑖
hence

Xi,j = 𝜇 + 𝛼𝑖 + 𝜀𝑖,𝑗
∑𝑘
𝑖=1 𝜇𝑖
With 𝜇 = , and 𝛼𝑖 = 𝜇̅ 𝑖 . − 𝜇 (deviation of group mean from grand mean)
𝑘
and 𝜀𝑖,𝑗 = 𝑋𝑖,𝑗 - 𝜇̅ 𝑖 .
Such that ∑𝑘𝑖=1 𝜇𝑖 − ∑𝑘𝑖=1(𝜇𝑖 − 𝜇) = 0

Usually, the 𝛼𝑖 are referred to as the effect of the ith population.

A One-Way ANOVA can be summarised in tabular form as follows:-


Columns K random samples from k populations

Rows 1 2 3 .......... i ............ k

𝑋1,1 𝑋2,1 𝑋3,1 ......... 𝑋𝑖,1 𝑋𝑘 1

𝑋1,2 𝑋2,2 𝑋3,2 ........ 𝑋𝑖,2 𝑋𝑘 2

𝑋1,3 𝑋2,3 𝑋3,3 ........ 𝑋𝑖,3 𝑋𝑘 3

: : : ........ : :

: : : ......... : :

: : : ......... : :

𝑋1,𝑛 𝑋2,𝑛 𝑋3,𝑛 𝑋𝑖,1 𝑋𝑘 𝑛

Total 𝑇1. 𝑇2. 𝑇3. 𝑇𝑖. 𝑇𝑘. 𝑇..

Mean 𝑋̅1 . 𝑋̅2 . 𝑋̅3 . 𝑋̅𝑖 . 𝑋̅𝑘 . 𝑋̅...

NB The homogeneity of variance can be tested by computing the F-Ratio given by


𝑙𝑎𝑟𝑔𝑒𝑟 𝑠𝑎𝑚𝑝𝑙𝑒 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒
𝐹= and then compare with the tabular F value with v1, v2 degrees of freedom
𝑠𝑚𝑎𝑙𝑙𝑒𝑟 𝑠𝑎𝑚𝑝𝑙𝑒 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒
𝛼
at the ⁄2 level of significance.

Assumption on ANOVA
1) That data are normally distributed
2) All treatment have more or less equal variance
3) The effect of treatment is additive
a. Large sample size will imply data becomes normal
b. Variance increases as the means increases
c. Treatments are independent of unexamined factors

The computations of ANOVA are usually summarized in tabular form as shown below:-
Source of Sums of squares Degrees of Mean square Computed F
variation freedom

19
Column Means SSC k–1 𝑆𝑆𝐶 𝑆12
𝑆12 = 𝐹=
𝑘−1 𝑆22

Error SSE k(n-1) 𝑆𝑆𝐸


𝑆22 =
𝑘(𝑛 − 1)
Total SST nk - 1

The computed F (fcal) is then compared with the tabular F (ftab) with (k, n-1) degrees of freedom at the
given level of significance to decide whether to accept or reject the null hypothesis, depending on
whether fcal < ftab or fcal > ftab respectively.

Note that the F-test is one-sided for ANOVA. The estimates of 𝜎 2 are usually referred to as mean
squares.

Example1: The table below shows the number of hours of pain relief provided by 5 different brands of
head ache tablets administered to 25 subjects. The 25 subjects were randomly divided into 5 groups and each
group was treated with a different brand.

Tablets Perform the analysis of variance, and test the hypothesis at the
A B C D E 0.05% level of significance that the mean number of hours of relief
5 9 3 2 7 provided by the tablets is the same for all five brands.
4 7 5 3 6
8 8 2 4 9
In case the null hypothesis is rejected, use an appropriate test to
6 6 3 1 4
determine the source(s) of the difference.
3 9 7 4 7

Solution: the sum-of-squares identity van be represented symbolically by the equation

SST = SSC + SSE


Where
𝑘 𝑛
2
𝑇..2
𝑆𝑆𝑇 = ∑ ∑ 𝑋𝑖,𝑗 −
𝑛𝑘
𝑖=1 𝑗=1
∑𝑘𝑖=1 𝑇𝑖.2
𝑇..2
𝑆𝑆𝐶 = −
𝑛 𝑛𝑘
. Hence SSE = SST – SSC

Applying these identities to our headache tablet data and following the six steps in hypothesis testing,
proceed as follows:

i) H0: 𝑋̅𝐴 = 𝑋̅𝐵 = 𝑋̅𝐶 = 𝑋̅𝐷 = 𝑋̅𝐸


ii) H1: at least one mean is different from the others
iii) Significance level 𝛼 = 0.05
iv) Critical region (tabular F) f(v1, v2) = f(k-1, k(n-1)) = f(4,20) > 2.87
v) Computations:
𝑇..2 1322 17424
a. Corrector Factor; CF = = = = 696.960
𝑛𝑘 5𝑥5 25
b. SST = 52 + 42 + 82 + … + 72 − 696.96 = 834 – 696.960 = 137.040
262 + 392 + 202 + 142 + 332
c. SSC = − 696.960 = 776.4 – 696.960 = 79.440
5
d. SSE = SST – SSC = 137.040 – 79.440 = 57.600
20
These results and the rest of the computations are summarized in tabular form as follows:-

Source of Sums of Degrees of Mean square Computed F


variation squares freedom
Tablet Means 79.440 4 79.440 19.86
𝑆12 = = 19.86 𝐹= = 6.896
2.88
4

Error 57.600 20 57.600


𝑆22 = = 2.88
20
Total 137.040 24

vi) Decision: Since fcal > ftab ie 6.896 > 2.87, reject H0 and conclude that the average
number of hour of pain relief provided by the tablets is not the same for all five brands.

MULTIPLE RANGE TEST

In the case that the null hypothesis is rejected, it may be necessary to investigate the origin of the
difference. That is to know which of the means are not equal. One of the most powerful test used in
this situation is the Duncan’s Multiple Range Test.

Assume that ANOVA has rejected H0. Assume also that k-random samples are of equal size.
The range of any subset of P sample means must exceed a certain value before we consider any of the
P population means as being different. The value is called the Least Significant range for the P means
𝑆2
and is denoted by RP where𝑅𝑝 = 𝑟𝑝 . 𝑆𝑥̅ = 𝑟𝑝 √ . The sample variance 𝑆 2 which is an estimate of 𝜎 2
𝑛
is obtained from Mean Square Error (MSE), of ANOVA. 𝑟𝑝 called the Least Significant Studentized
range depends on the desired level of significance and the number of degree of freedom of MSE.
These values are obtained from Statistical Tables for P = 2, 3, ..., 10 means.

To illustrate this procedure, consider the example on different kinds of headache tablets.

1) Arrange the means in increasing order of magnitude


𝑋̅4 , , 𝑋̅3 , ̅𝑋1 , 𝑋̅5 , 𝑋̅2
2.8 4.0 5.2 6.6 7.8

2) Obtain 𝑆 2 for EMS from ANOVA table with required degrees of freedom, ie 𝑆 2 = 2.88, 𝑑𝑓 = 20

3) Let α =0.05, then from Statistical table on 𝑟𝑝 with 20 df and for 𝑟𝑝 = 2, 3, 4,5 we obtain the
following values

𝑆2 2.880
4) 𝑅𝑝 is obtained by multiplying each 𝑟𝑝 by √ = √ = 0.76
𝑛 5

P 2 3 4 5
𝑟𝑝 2.950 3.097 3.190 3.255
𝑅𝑝 2.24 2.35 2.42 2.47

21
Comparing these Least Significant ranges with the differences in ordered means, the following
conclusions can be arrived at:-

1) Since 𝑋̅2 − 𝑋̅5 = 1.2 < 𝑅2 = 2.24, conclude that 𝑋̅2 and 𝑋̅5 are not significantly different,
2) Since 𝑋̅2 − 𝑋̅1 = 2.6 > 𝑅3 = 2.35, conclude that 𝑋̅2 is significantly larger than 𝑋̅1 and
therefore µ2 > µ1 . It follows that µ2 > µ3 and µ2 > µ4 .
3) Since 𝑋̅5 − 𝑋̅1 = 1.4 < 𝑅2 = 2.24, conclude that 𝑋̅5 𝑎𝑛𝑑 𝑋̅1 are not significantly different.
4) Since 𝑋̅5 − 𝑋̅3 = 2.6 > 𝑅3 = 2.35, conclude that 𝑋̅5 is significantly larger that 𝑋̅3 and
therefore µ5 > µ3 also µ5 > µ4
5) Since𝑋̅1 − 𝑋̅3 = 1.2 < 𝑅2 = 2.24, conclude that 𝑋̅1 𝑎𝑛𝑑 𝑋̅3 are not significantly different.
6) Since 𝑋̅1 − 𝑋̅4 = 2.4 > 𝑅3 = 2.35, conclude that 𝑋̅1 is significantly larger than 𝑋̅4 and
therefore µ1 > µ4
7) Since 𝑋̅3 − 𝑋̅4 = 1.2 < 𝑅2 = 2.24, conclude that 𝑋̅3 𝑎𝑛𝑑 𝑋̅4 are not significantly different.

These conclusions can be summarized by drawing lines under subset of adjacent means that are
not significantly different. We therefore have

𝑋̅4 𝑋̅3 ̅𝑋1 𝑋̅5 𝑋̅2

2.8 4.0 5.2 6.6 7.8

Steps in the calculation of One-way ANOVA

1. Arrange data in tabular form as shown above;


2. Calculate treatment totals T1, T2, ..., Tk
3. Calculate grand total (GT) as T1 + T2 + T3 + ... + Tk
4. Calculate ∑ 𝑥 2 , as sum of squared values of all replicates of all treatments
2 2
5. Calculate 𝑇 ⁄𝑛 for all treatments and calculate ∑(𝑇 ⁄𝑛);
2 2
6. Calculate Corrector Factor (CF) as 𝐶𝐹 = 𝐺𝑇 ⁄𝑘𝑛 or 𝐺𝑇 ⁄∑ 𝑛 in case of unequal sample size or
number of replicates in treatments.
7. Calculate total Sum of squares as 𝑆𝑆𝑇 = ∑ 𝑥 2 − 𝐶𝐹
2
8. Calculate Sum of squares of treatment as 𝑆𝑆𝐶 = ∑(𝑇 ⁄𝑛) − 𝐶𝐹;
9. Calculate Error Sum of square as 𝑆𝑆𝐸 = 𝑆𝑆𝑇 − 𝑆𝑆𝐶;
10. Draw the summary table as shown above. Note: The degree of freedom (df) of SSE is ∑(𝑛 − 1)
rather than k(n-1) if the number of replicates per treatment is not equal.
11. Equate Fcal to Ftab for k-1 and k(n-1) df. H0 is rejected if fcal> ftab NB: In case the number of
replicates per treatment are different, the following modification is done in the calculation 𝐶𝐹 =
𝐺𝑇 2⁄
∑ 𝑛 𝑖𝑒 𝑟𝑝𝑙𝑎𝑐𝑒 𝑘𝑛 𝑏𝑦 ∑ 𝑛 and the degree of freedom of SSE becomes ∑(𝑛 − 1), instead of
k(n-1)
Unequal sample sizes.

In case the in an experiment the number of observations are not the same in the various samples,
the one way ANOVA can still be carried out. For example:
Example: Consider the tabulated data taken from a manufacturer company on defects on 3 models
22
of cars.

Car models
Test the hypothesis at the 0.05 level
A B C
of significance that the average
4 5 8 number of defects is the same for all
7 1 6 3 car models.
6 3 8
6 5 9
3 5
4
Total 23 21 36 80

Solution: Using the six step of hypothesis testing, proceed as follows:


1) H0: 𝜇1 = 𝜇2 = 𝜇3 against the alternative
2) H1: at least two of the mean are not equal
3) Significance level α = 0.05
4) Critical region: Ftab > 3.89
5) Computations:
2 2
a. Correction Factor; 𝐶𝐹 = 𝐺𝑇 ⁄∑ 𝑛 = 80 ⁄15 = 6400⁄15 = 𝟒𝟐𝟔. 𝟔𝟔𝟕
b. 𝑆𝑆𝑇 = ∑ 𝑥 2 − 𝐶𝐹 = 42 + 72 + 62 + … + 52 − 𝐶𝐹 = 492 − 426.667 = 𝟔𝟓. 𝟑𝟑𝟑

2 2 2 2
c.𝑆𝑆𝐶 = ∑(𝑇 ⁄𝑛𝑖 ) − 𝐶𝐹 = 23 ⁄𝑛1 + 21 ⁄𝑛2 + 36 ⁄𝑛3 − 𝐶𝐹 = 464.96 – 426.667 =
38.283
d. 𝑆𝑆𝐸 = 𝑆𝑆𝑇 − 𝑆𝑆𝐶 = 65.333 − 38.283 = 𝟐𝟕. 𝟎𝟓𝟎
These results and the rest of the computations are summarized in tabular form as follows:-
Source of Sums of Degrees of Mean square Computed F
variation squares freedom
Car models 38.283 2 38.283 19.142
𝑆12 = = 19.142 𝐹= = 8.492
2.254
2
27.050
𝑆22 = = 2.254
Error 27.050 12 12
Total 65.333 14

6) Decision: Since fcal> ftab ie 8.492 > 3.89, reject H0 and conclude that the average number of
defects is not the same for all three car models

Advantages of equal sample size

1) F ratio is insensitive to departure from the assumption of equal variance for the km
populations when sample sizes are equal.
2) The choice of equal sample size minimizes the probability of committing Type II error;
3) Computation of SSC is simplified if sample size is equal.

23
Test for the equality of several variances (Bartlett’s Test)
Though the f-ratio test obtained from the ANOVA procedure is insensitive to departure from the
assumption of equal variance for k normal populations when the samples are of equal sizes, caution
may be taken by running a preliminary test for homogeneity of variance. Such a test is certainly
advisable in the case of unequal sizes if there is reasonable doubt concerning homogeneity of
population variance. One of the test used in this situation is called Bartlett’s test
Suppose therefore that we wish to test the null hypothesis

𝐻0 : 𝜎12 = 𝜎12 = 𝜎12 =, … , = 𝜎12 Against the alternative

𝐻1 : The variances are not all equal

The Bartlett’s test is based on a statistics whose distribution provides exact critical values when the
sample sizes are equal. These critical values for equal sample sizes can also be used to yield highly
accurate approximations to the critical values for unequal sample sizes.

First we consider the k sample variances 𝑆12 , 𝑆22 , 𝑆32 , … , 𝑆𝑘2 from sample sizes n1, n2, n3,... nk with
∑𝑘𝑖=1 𝑛𝑖 = 𝑁
∑𝑘 2
𝑖=1(𝑛𝑖 −1)𝑆𝑖
Second, combine the sample variance to give a pooled estimate 𝑆𝑝2 = ,
𝑁−𝐾
1
[(𝑆12 )𝑛1 −1 . (𝑆22 )𝑛2 −1 . (𝑆32 )𝑛3 −1 …(𝑆𝑘2 )𝑛𝑘 −1 ](𝑁−𝐾)
Now, 𝑏= , is a value of a random variable B
𝑆𝑝2
having the Bartlett’s distribution. For the special case when n1= n2= n3=...= nk= n, we reject
H0 at the α level of significance if 𝑏 < 𝑏𝑘 (𝛼; 𝑛), where 𝑏𝑘 (𝛼; 𝑛), is the critical value leaving an
area of 𝛼 in the left tail of the Bartlett distribution.

Statistics table gives the critical values 𝑏𝑘 (𝛼; 𝑛) for 𝛼 = 0.01, 𝑎𝑛𝑑 0.05; k = 2, 3, 4, ... . 10 and
selected value of n from 3 to 100.

When the sample sizes are unequal, the null hypothesis is rejected at the 𝛼 level of significance if
𝑏 < 𝑏𝑘 (𝛼; 𝑛1 , 𝑛2 , 𝑛3 , … , 𝑛𝑘 ) where 𝑏𝑘 (𝛼; 𝑛1 , 𝑛2 , 𝑛3 , … , 𝑛𝑘 ) ≈
[𝑛1 𝑏𝑘 (𝛼,𝑛1 )+ 𝑛2 𝑏𝑘 (𝛼,𝑛2 )+𝑛3 𝑏𝑘 (𝛼,𝑛3 )+ …+ 𝑛𝑘 𝑏𝑘 (𝛼,𝑛𝑘 ) ]
with all the 𝑏(𝛼; 𝑛𝑖 ) for sample size n1, n2, n3, ... , nk
𝑁
are obtained from Statistical tables.

Example: Use Bartlett test to test the hypothesis that the variances of the three populations in the
example on defects on three (3) car models are equal.

Car models
A B C
4 5 8
7 1 6
6 3 8
6 5 9
3 5
4
Total 23 21 36 80
24
Solution: using the six steps of hypothesis testing proceed as follows

1) H0 : 𝜎12 = 𝜎22 = 𝜎32


2) H1: At least one variance is not equal to the others.
3) Significance level 𝛼 = 0.05
4) Critical region: to obtain this, recall that n1= 4, n2 = 6, n3 = 5; and N= n1+n2+n3 = 4+6+5=15
and k =3. Therefore H0 is rejected when
[4(0.4699) + 6(0.6484) + 5(0.5762)]
𝑏 < 𝑏3 (0.05, 4, 6, 5) ≈ = 0.5767
15
5) Computations: First compute, the k sample variance;
𝑆12 = 1.583, 𝑆22 = 2.300, 𝑆32 = 2.706 and then

(3)(1.583) + (5)(2.300) + (4)(2.706)


𝑆𝑝2 = = 2.256
(𝑁 − 𝑘) = 15 − 3 = 12
1
[(1.583)3 (2.300)5 (2.706)4 ] ⁄12
Now 𝑏 = = 0.9804 0𝑟 0.9795
2.256
6) Decision: Since bcal > btab, ie 0.9795 > 0.5767, accept the null hypothesis and conclude that
the variances of the populations are equal.

NB: The value obtained from the calculator option is the sample standard deviation and must
be squared in order to obtain the variance. 𝑥𝜎𝑛−1 = 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑡𝑑𝑒𝑣; 𝑥𝜎𝑛 = 𝑝𝑜𝑝 𝑠𝑡𝑑𝑒𝑣

Advantages of equal sample size


1) F ratio is insensitive to departure from the assumption of equal variance for the km
populations when sample sizes are equal.
2) The choice of equal sample size minimizes the probability of committing Type II error;
3) Computation of SSC is simplified if sample size is equal.

Two way Analysis of Variance (2-WAY ANOVA)


Consider the classification of observations according to two criteria at once by means of a rectangular
array in which the columns represent one criterion and the rows the other.

For example, if we observed the yields of three varieties of maize, using four different kinds of
fertilizers. The yields can be represented in a tabular form with each treatment combination defining
a cell for which we have obtained a single observation.

Yields of maize in kg per plot


Fertilizer Varieties of maize Total
kinds V1 V2 V3
64 72 74 210
55 57 47 159
59 66 58 183
58 57 53 168
Total 236 252 232 720
25
We shall need to test whether the variation in our yields is caused by the different varieties of maize,
different kinds of fertilizers or differences in both. The effects of our factor is called “Treatment
effect” and the other is called the “Block effect”

Summary table for 2-way ANOVA without replication

columns
Row 1, 2, 3, ....., j, . . . . . . .C Total Mean
1 x11 x12, x13, ..., x1j ........... x1c T1. ̅𝟏.
𝒙
2 x21 x22 x23, ..., x2j, ........, x2c T2. ̅𝟐.
𝒙
3 x31 x32 x33,..., x3j, ......., x3c T3. ̅𝟑.
𝒙
. . . . . . . .
. . . . . . . .
i xi1 xi2 xi3, ......, xij, ....., xic Ti. ̅𝒊.
𝒙
. . . . . . . .
. . . . . . . .
. . . . . . . .
R xr1 xr2 xr3 ...... xrj ....... xrc Tr. ̅𝒓.
𝒙
Total T.1 T.2 T.3, ..., T.j, ..., T.c T..
Mean ̅.𝟏, 𝒙
𝒙 ̅.𝟐 , 𝒙
̅.𝟑 , … 𝒙
̅.𝒋 , … . , 𝒙
̅.𝒄 , ̅ ..
𝑿

This is a rectangular array consisting of r rows and c columns with xij denoting an observation in the
ith row and the jth column. Xij are assumed to be normally distributed with mean 𝜇𝑖𝑗 and common
variance 𝜎 2 .

Ti. ⇒ Total at the ith row,


̅𝒊. ⇒ Mean of the ith row
𝒙
T.j ⇒ Total at the jth column
̅.𝒋 ⇒ Mean at the jth column
𝒙
T.. ⇒ Total of all rc observations
̅ . . ⇒ Mean of all rc observation.
𝑿

The c column represents the “Treatment” while the r rows represent the blocks. Two null hypotheses
could be treated in this case that for rows and column means.

The null hypotheses will be:-


𝐻0′ ∶ 𝑋̅1. = 𝑋̅2. = 𝑋̅3. =,,,,= 𝑋̅𝑟. = 𝑋̅
𝐻0′′ ∶ 𝑋̅.1 = 𝑋̅.2 = 𝑋̅.3 =,,,,= 𝑋̅𝑐. = 𝑋̅;
and the alternative hypotheses will be given as
𝐻1′ ∶ At least two row means are not equal
𝐻1′′ ∶ At least two column means are not equal

The sum-of-squares identity may be represented symbolically by the equation


SST = SSR + SSC + SSE
26
Where
𝑟 𝑐

𝑆𝑆𝑇 = ∑ ∑(𝑋𝑖𝑗 − 𝑋̅.. )2 = 𝑡𝑜𝑡𝑎𝑙 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠


𝑖=1 𝑗=1
𝑟

𝑆𝑆𝑅 = 𝑐 ∑(𝑋̅𝑖. − 𝑋̅.. )2 = 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑓𝑜𝑟 𝑟𝑜𝑤 𝑚𝑒𝑎𝑛𝑠


𝑖=1
𝑐

𝑆𝑆𝐶 = 𝑟 ∑(𝑋̅.𝑗 − 𝑋̅.. )2 = 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑓𝑜𝑟 𝑐𝑜𝑙𝑢𝑚𝑛𝑠 𝑚𝑒𝑎𝑛𝑠


𝑗=1

𝑟 𝑐

𝑆𝑆𝐸 = ∑ ∑(𝑋𝑖𝑗 − 𝑋̅𝑖. − 𝑋̅.𝑗 + 𝑋̅.. )2 = 𝑒𝑟𝑟𝑜𝑟 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠


𝑖=1 𝑗=1

The sum-of- squares computational formulas are as follows:-


𝑟 𝑐
𝑇. .2
𝑆𝑆𝑇 = ∑ ∑ 𝑋𝑖𝑗2 − ; 𝑡𝑜𝑡𝑎𝑙 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑢𝑎𝑟𝑒𝑠
𝑟𝑐
𝑖=1 𝑗=1

∑𝑟𝑖=1 𝑇𝑖2. 𝑇. .2
𝑆𝑆𝑅 = − ; 𝑡𝑜𝑡𝑎𝑙 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑜𝑓 𝑟𝑜𝑤 𝑒𝑓𝑓𝑒𝑐𝑡
𝑐 𝑟𝑐
∑𝑟𝑖=1 𝑇.𝑗2 𝑇. .2
𝑆𝑆𝐶 = − ; 𝑡𝑜𝑡𝑎𝑙 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑜𝑓 𝑐𝑜𝑙𝑢𝑚𝑛 𝑒𝑓𝑓𝑒𝑐𝑡
𝑟 𝑟𝑐
𝑆𝑆𝐸 = 𝑆𝑆𝑇 − 𝑆𝑆𝐶 − 𝑆𝑆𝑅

These computations can be summarized in tabular form as follows

Source of Sums of Degrees of Mean square Computed F


variation squares freedom
Column Means SSC c–1 𝑆𝑆𝐶 𝑀𝑆𝐶
𝑀𝑆𝐶 = 𝐹𝑐 =
(treatment effect) 𝑐−1 𝑀𝑆𝐸

Row means SSR r-1 𝑆𝑆𝑅 𝑀𝑆𝑅


𝑀𝑆𝑅 = 𝐹𝑟 =
(Block effects) 𝑟−1 𝑀𝑆𝐸

𝑆𝑆𝐸
Error SSE (c-1)(r-1) 𝑀𝑆𝐸 =
(𝑐 − 1)(𝑟 − 1)
Total SST cr - 1

To illustrate this procedure, consider solving the maize/fertilizer problem.


Solution: Let α represent the row (fertilizer) effect and β represent the column (maize) effect.

1) 𝐻0′ ∶ α1 = α2 = α3 = α4 = 0
𝐻0′′ ∶ β1 = β2 = β3 = 0
2) 𝐻1′ ∶ At least one α𝑖 𝑖𝑠 𝑛𝑜𝑡 𝑒𝑞𝑢𝑎𝑙 𝑡𝑜 𝑧𝑒𝑟𝑜
𝐻1′′ ∶ At least one β𝑗 𝑖𝑠 𝑛𝑜𝑡 𝑒𝑞𝑢𝑎𝑙 𝑡𝑜 𝑧𝑒𝑟𝑜
27
3) Significance level α = 0
4) Critical region; there are two critical regions, one to manage the column effect and the other
to control the row effect. 𝑎) 𝐹1(𝑣1,𝑣2) = 𝐹1(3,6) = 4.76; 𝑏) 𝐹2(𝑣1,𝑣2) = 𝐹2(2,6) = 5.14
5) Computations:

𝐺𝑇 2 7202
i) Corrector factor 𝐶𝐹 = = = 43200
𝑟𝑐 3𝑥4=12
𝑇..2
ii) 𝑆𝑆𝑇 = ∑𝑟𝑖=1 ∑𝑐𝑗=1 𝑋𝑖𝑗2 − = 64 + 552 + … + 532 − 43200 = 662
2
𝑟𝑐
∑𝑟𝑖=1 𝑇𝑖2. 𝑇..2 2102 + 1592 +1832 +1682
iii) 𝑆𝑆𝑅 = − = − 43200 = 43698 − 43200 = 498
𝑐 𝑟𝑐 3
∑𝑟𝑖=1 𝑇.𝑗2 𝑇..2 2362 + 2522 +2322 +2322
iv) 𝑆𝑆𝐶 = − = − 43200 = 43256 − 43200 = 56
𝑟 𝑟𝑐 4
v) 𝑆𝑆𝐸 = 𝑆𝑆𝑇 − 𝑆𝑆𝐶 − 𝑆𝑆𝑅 = 662 − 498 − 56 = 108

These results and the rest of the computations are summarized in tabular form as follows:-
Source of Sums of Degrees of Mean square Computed F
variation squares freedom
Maize varieties 56 2 56 28
𝑀𝑆𝐶 = = 28 𝐹𝑐 = = 1.56
(treatment effect) 2 18

498 166
Fertilizer kinds 498 3 𝑀𝑆𝑅 = = 166 𝐹𝑟 = = 9.22
18
(Block effects) 3

108
Error 108 6 𝑀𝑆𝐸 = = 18
6
Total 662 11

6) Decisions:
a. Since 𝐹1(𝑐𝑎𝑙) < 𝐹1(𝑡𝑎𝑏) 𝑖𝑒 1.56 < 5.14, accept 𝐻0′′ , and conclude that the different
variety of maize has no effect on the variation in the yields.
b. Since 𝐹2(𝑐𝑎𝑙) > 𝐹2(𝑡𝑎𝑏) 𝑖𝑒 9.22 > 4.76, reject 𝐻0′ and conclude that different kind
of fertilizer have significant effect on the yield of the maize.

Steps in calculating 2-way ANOVA


(This is also suitable for analysing randomized block design)
1) Arrange data in tabular form as shown above;
2) Calculate Treatment totals T1, T2, ..., Tc
3) Calculate Block totals B1, B2, ..., Br
4) Calculate grand total (GT) as T1 + T2 + T3 + ... + Tc
5) Calculate ∑ 𝑥 2 , as sum of squared values of all blocks of all treatments

28
2 2
6) Calculate 𝑇 ⁄𝑟 for all treatments and calculate ∑(𝑇 ⁄𝑟);
2 2
7) Calculate 𝐵 ⁄𝑐 for each block and calculate ∑(𝐵 ⁄𝑐 )
2
8) Calculate Corrector Factor (CF) as 𝐶𝐹 = 𝐺𝑇 ⁄𝑟𝑐 .
9) Calculate Total Sum of Squares as 𝑆𝑆𝑇 = ∑ 𝑥 2 − 𝐶𝐹
2
10) Calculate Treatment / Column Sum of Squares as 𝑆𝑆𝐶 = ∑(𝑇 ⁄𝑟) − 𝐶𝐹;
2
11) Calculate Block / Row Sum of Squares as 𝑆𝑆𝐶 = ∑(𝐵 ⁄𝑐 ) − 𝐶𝐹
12) Calculate Error Sum of square as 𝑆𝑆𝐸 = 𝑆𝑆𝑇 − 𝑆𝑆𝐶 − 𝑆𝑆𝑅;
13) Draw the summary table as shown above.
14) Equate Fcal to Ftab for both Treatment (Column) and Block (Row) respectively. H0 is rejected if
fcal> ftab
Exercises
1) To study the flowering behaviour of potatoes plants in the Dschang University farm, a student
noticed differences in the number of berries produced by some varieties. She decided to test
this by experiment and her results were as follows:-
Blocks
Varieties
I II III IV V

Ukama 0.301 0.845 1.415 0.903 1.000


Fatima 1.663 1.431 0.000 0.000 0.000
Estima 1.041 0.301 0.845 0.477 0.602
Desire 1.380 0.864 1.000 0.000 0.845
Roseline 0.699 0.000 0.301 0.000 0.602
Pbs 63-18 0.000 1.146 1.544 1.613 1.505
B 89-49-1 1.531 1.580 1.477 1.381 0.000

a) State a suitable null hypothesis


b) Display the data in a more helpful form
c) What do these results indicate
d) Comment on the differences in varieties

TWO - WAY ANOVA with REPLICATION


(Also suitable for n2 factorial experimental)

Advantages of this type of experiment are:-

1) The sensitivity of the experiment is increased


2) The experiment is not totally ruined if one replicate is lost and
3) It is now possible to detect interactions between the two factors.

Summary table for 2-way ANOVA with replication.

29
columns
Row 1 2 3 ............ j ........ C Total Mean
X111 X121 X131 X1j1 X1c1
1 X112 X122 X132 X1j2 X1c2
..... ..... ..... ..... ..... T1.. 𝑋̅1..
X11n X12n X13n X1jn X1cn
X211 X221 X231 X2j1 X2c1
2 X212 X222 X232 X2j2 X2c2
........ ........ ........ ........ ........ T2.. 𝑋̅1..
X21n X22n X23n X2jn X2cn
. . . . . .
. . . . . .
. . . . . .
Xi11 Xi21 Xi31 Xij1 Xic1
I Xi12 Xi22 Xi32 Xij2 Xic2 Ti.. 𝑋̅𝑖..
......... ......... ......... ......... .........
Xi1n Xi2n Xi3n Xijn Xicn
. . . . . .
. . . . . .
. . . . . .
Xr11 Xr21 Xr31 Xrj1 Xrc1
Xr12 Xr22 Xr32 Xrj2 Xrc2 Tr.. 𝑋̅𝑟..
R ........ ........ ........ ........ ........
Xr1n Xr2n Xr3n Xrjn Xrcn
Total T.1. T.2. T.3. T.j. T.c. T...
Mean 𝑋̅.1. 𝑋̅.2. 𝑋̅.3. 𝑋̅.𝑗. 𝑋̅.𝑐. 𝑋̅...

Notation
Tij. ⇒ Sum of observations in the ijth cell
Ti.. ⇒ Sum of observations in the ith row
T.j. ⇒ Sum of observations in the jth column
T... ⇒ Sum of all rcn observation
̅ . . . ⇒ Mean of all rcn observation
𝑿
̅ 𝒊𝒋. . ⇒ Mean of observation in the ijth cell
𝑿
̅ 𝒊. . ⇒ Mean of observation in the ith row
𝑿
̅ . 𝒋. ⇒ Mean of observation in the jth column
𝑿

Each observation in the table may be written in the form


𝑋𝑖𝑗𝑘 = 𝜇𝑖𝑗 + 𝜀𝑖𝑗𝑘
Where 𝜀𝑖𝑗𝑘 measures the deviation of the observed 𝑋𝑖𝑗𝑘 values in the ijth cell from the population
mean 𝜇𝑖𝑗 . If we let (𝛼𝛽)𝑖𝑗 denote the interaction effect of the ith row and the jth column and 𝜇 the
overall mean, we can write
𝜇𝑖𝑗 = 𝜇 + 𝛼𝑖 + 𝛽𝑗 + (𝛼𝛽)𝑖𝑗 and then

𝑋𝑖𝑗𝑘 = 𝜇 + 𝛼𝑖 + 𝛽𝑗 + (𝛼𝛽)𝑖𝑗 + 𝜀𝑖𝑗𝑘

On which we impose the restrictions

30
𝑟 𝑐 𝑟 𝑐

∑ 𝛼𝑖 = 0; ∑ 𝛽𝑗 = 0; ∑(𝛼𝛽)𝑖𝑗 = 0; ∑(𝛼𝛽)𝑖𝑗 = 0
𝑖=1 𝑗=1 𝑖=1 𝑗=1

The three hypotheses to be tested are as follows:-


1) 𝐻0′ ∶ α1 = α2 = α3 = α4 = ⋯ = α𝑟 = 0
𝐻0′′ ∶ β1 = β2 = β3 = β4 = ⋯ = β𝑐 = 0
𝐻0′′′ ∶ (𝛼𝛽)11 = (𝛼𝛽)12 = (𝛼𝛽)13 = ⋯ = (𝛼𝛽)𝑟𝑐 = 0 against the alternatives
2) 𝐻1′ ∶ At least one α𝑖 𝑖𝑠 𝑛𝑜𝑡 𝑒𝑞𝑢𝑎𝑙 𝑡𝑜 𝑧𝑒𝑟𝑜
𝐻1′′ ∶ At least one β𝑗 𝑖𝑠 𝑛𝑜𝑡 𝑒𝑞𝑢𝑎𝑙 𝑡𝑜 𝑧𝑒𝑟𝑜
𝐻1′′′ ∶ 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 (𝛼𝛽)𝑖𝑗 𝑖𝑠 𝑛𝑜𝑡 𝑒𝑞𝑢𝑎𝑙 𝑡𝑜 𝑧𝑒𝑟𝑜
These shall be tested based on a comparison of independent estimates of 𝜎 2 provided by splitting the
total sum of squares of our data into four components by means of the identity
𝑆𝑆𝑇 = 𝑆𝑆𝑅 + 𝑆𝑆𝐶 + 𝑆𝑆(𝑅𝐶) + 𝑆𝑆𝐸 which should be computed using the following formulas:-
𝐺𝑇2 𝑇…2
1) Corrector factor CF is given as 𝐶𝐹 = = ; where r =number of rows, c =number
𝑟𝑐𝑛 𝑟𝑐𝑛

of columns, and n= number of replicates per cell.


2 𝑇...2
2) 𝑆𝑆𝑇 = ∑𝑟𝑖=1 ∑𝑐𝑗=1 ∑𝑛𝑘=1 𝑋𝑖𝑗𝑘 −
𝑟𝑐𝑛

∑𝑟𝑖=1 𝑇𝑖..2 𝑇...2


3) 𝑆𝑆𝑅 = − ; 𝑡𝑜𝑡𝑎𝑙 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑜𝑓 𝑟𝑜𝑤 𝑒𝑓𝑓𝑒𝑐𝑡
𝑐𝑛 𝑟𝑐𝑛
∑𝑐𝑖=1 𝑇.𝑗.
2
𝑇...2
4) 𝑆𝑆𝐶 = − ; 𝑡𝑜𝑡𝑎𝑙 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑜𝑓 𝑐𝑜𝑙𝑢𝑚𝑛 𝑒𝑓𝑓𝑒𝑐𝑡
𝑟𝑛 𝑟𝑐𝑛
2 ∑𝑐𝑗=1 𝑇.𝑗.
2
𝑇𝑖𝑗. ∑𝑟𝑖=1 𝑇𝑖..2 𝑇…2
5) 𝑆𝑆(𝑅𝐶) = ∑𝑟𝑖=1 ∑𝑐𝑗=1 − − +
𝑛 𝑐𝑛 𝑟𝑛 𝑟𝑐𝑛

6) 𝑆𝑆𝐸 = 𝑆𝑆𝑇 − 𝑆𝑆𝐶 − 𝑆𝑆𝑅 − 𝑆𝑆(𝑅𝐶) with degrees of freedom partitioned as follows:
𝑟𝑐𝑛 − 1 = (𝑐 − 1) + (𝑟 − 1) + (𝑟 − 1)(𝑐 − 1) + 𝑟𝑐(𝑛 − 1) corresponding to
𝑆𝑆𝑇 = 𝑆𝑆𝐶 + 𝑆𝑆𝑅 + 𝑆𝑆(𝑅𝐶) + 𝑆𝑆𝐸
Example: Consider the following tabulated data based on lacquer concentration against standing time;
Standing Lacquer concentration
times 𝟏⁄ 1 𝟏 𝟏⁄𝟐 2 T.j.
𝟐
30 16, 14 12, 11 17, 19 13,11 113
20 15, 15 14, 17 15, 18 12, 14 120
10 10, 09 07, 06 10, 14 09, 13 78
Ti.. 79 67 93 72 T... = 311
2
∑ 𝑋𝑖𝑗𝑘 1083 835 1495 880

From the table, use a 0.05 level of significance to test the following:-
a) There is no difference in effect of the average Lacquer concentration at different times. ( 𝐻0′ )
b) There is no difference in the effect of the average standing time. ( 𝐻0′′ )
c) There is no interaction between Lacquer concentration and standing times. ( 𝐻0′′′ )

31
In many experiments, the assumption of additivity does not hold and 2-way ANOVA leads to
erroneous conclusion. In the maize/fertilizer problem, suppose the variety V2 produces on the
average 5 more kg of maize per plot than variety V1 when fertilizer treatment T1 is used, but
produces an average of 2kg per plot less than V1 when fertilizer treatment T2 is used. The varieties
of maize and the kind of fertilizer are now said to interact. An inspection of the data suggests the
presence of interact. This could be real or may be due to experimental error. If the total variability of
our data was in part the result of an interaction, this source of variation remained a part of the error
sum of squares causing the error mean squares to overestimate 𝜎 2 and thereby increased the
probability of committing type II error.
If we believe that the variety of maize and kinds of fertilizer interact, we could repeat the
experiment twice, ie using 36 one-acre plots, rather than 12, and record the result. It is customary to
say that the experiment has been replicated three times.

Example: If the result of all three experiment were recorded as follows:-


Fertilizer Variety of maize
treatment V1 V2 V3 Total
T1 64, 66, 70 72, 81, 64 74, 51, 65
T2 65, 63, 58 57, 43, 52 47, 58, 67
T3 59, 68, 65 66, 71, 59 58, 39, 42
T4 58, 41, 46 57, 61, 53 53, 59, 38
Total

Using a 0.05 level of significance to test the following hypotheses:-


a) 𝐻0′ : There is no difference in the average yield of maize when different kinds of fertilizers are used.
b) 𝐻0′′ : There is no difference in the average yield of the maize as a result of the varieties of maize.
;;′
c) 𝐻0 : There is no interaction between the different kinds of fertilizers and the different varieties of maize.

Solution: using the six steps in hypothesis testing, proceed as follows;


1) 𝐻0′ ∶ α1 = α2 = α3 = α4 = 0
𝐻0′′ ∶ β1 = β2 = β3 = 0
𝐻0′′′ ∶ (𝛼𝛽)11 = (𝛼𝛽)12 = (𝛼𝛽)13 = ⋯ = (𝛼𝛽)43 = 0
2) 𝐻1′ ∶ At least one α𝑖 𝑖𝑠 𝑛𝑜𝑡 𝑒𝑞𝑢𝑎𝑙 𝑡𝑜 𝑧𝑒𝑟𝑜
𝐻1′′ ∶ At least one β𝑗 𝑖𝑠 𝑛𝑜𝑡 𝑒𝑞𝑢𝑎𝑙 𝑡𝑜 𝑧𝑒𝑟𝑜
𝐻1′′′ ∶ 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 (𝛼𝛽)𝑖𝑗 𝑖𝑠 𝑛𝑜𝑡 𝑒𝑞𝑢𝑎𝑙 𝑡𝑜 𝑧𝑒𝑟𝑜
3) Significance level α = 0.05
4) Critical region
a. F1(v1, v2) = F1(2, 24) > 3.01; for the column/maize effect
b. F2(v1, v2) = F2(3, 24) > 3.40; for the row/ fertilizer effect
c. F3 (v1, v2) = F3(6, 24) > 2.51; for the interaction effect.

32
5) Computations: from the data, first construct a table of total as follows:

V1 V2 V4 Total
T1 200 217 190 607
T2 186 152 172 510
T3 192 196 139 527
T4 145 171 150 466
Total 723 736 651 2110

Now
21102
1) Corrector factor 𝐶𝐹 =
36
= 𝟏𝟐𝟑, 𝟔𝟔𝟗
2 𝑇…2
2) 𝑆𝑆𝑇 = ∑𝑛𝑘=1 𝑋𝑖𝑗𝑘 −
𝑟𝑐𝑛

= 642 + 662 + 702 + ⋯ + 382 − 123,669 = 127,448 − 123,669 = 𝟑𝟕𝟕𝟗


∑𝑟𝑖=1 𝑇.𝑖.2 𝑇...2 6072 +5102 +5272 +4662
3) 𝑆𝑆𝑅 = − = − 123,669 = 124,826 − 123,669 = 𝟏𝟏𝟓𝟕
𝑐𝑛 𝑟𝑐𝑛 9
∑𝑐𝑗=1 𝑇.𝑗.
2
𝑇...2 7212 +7362 +6512
4) 𝑆𝑆𝐶 = − = − 123,669 = 124,019 − 123,669 = 𝟑𝟓𝟎
𝑟𝑛 𝑟𝑐𝑛 12
2 ∑𝑐𝑗=1 𝑇.𝑗.
2
𝑇𝑖𝑗. ∑𝑟𝑖=1 𝑇𝑖..2 𝑇…2
5) 𝑆𝑆(𝑅𝐶) = ∑𝑟𝑖=1 ∑𝑐𝑗=1 − − +
𝑛 𝑐𝑛 𝑟𝑛 𝑟𝑐𝑛
2 2 2
200 + 186 + … + 150
= − 124,019 − 124,826 + 123,669 = 𝟕𝟕𝟏
5
6) 𝑆𝑆𝐸 = 𝑆𝑆𝑇 − 𝑆𝑆𝐶 − 𝑆𝑆𝑅 − 𝑆𝑆(𝑅𝐶) = 3779 − 1157 − 350 − 771 = 𝟏𝟓𝟎𝟏

These results and the remaining computations are summarized in the table below:-

Source of Sums of Degrees of Mean square Computed F


variation squares freedom
Maize varieties 350 175
𝑀𝑆𝐶 = = 175.00 𝐹𝑐 = = 2.80
(treatment effect) 350 2 2 62.542

Fertilizer kinds 1157 𝐹𝑟 =


385.667
= 6.17
𝑀𝑆𝑅 = = 385.667 62.542
(Block effects) 1157 3 3
128
771 𝐹𝑟𝑐 = = 2.05
Interaction 771 6 62.542
𝑀𝑆(𝑅𝐶) = = 128.500
6

1501
Error 1501 24 𝑀𝑆𝐸 = = 62.542
24
Total 662 35

33
7) Decision:
a) Since 𝐹𝑟(𝑐𝑎𝑙) > 𝐹𝑟(𝑡𝑎𝑏) 𝑖𝑒 6.17 > 3.40, reject 𝐻0′ and conclude that the average
yield the maize are not the same when different kinds of fertilizer are used.
b) Since 𝐹𝑐(𝑐𝑎𝑙) < 𝐹𝑐(𝑡𝑎𝑏) 𝑖𝑒 2.05 < 2.51, accept 𝐻0′′′ and conclude that the variation
of the average yield the maize is not as a result in the difference of maize variety.
c) Since 𝐹𝑟𝑐(𝑐𝑎𝑙) < 𝐹𝑟𝑐(𝑡𝑎𝑏) 𝑖𝑒 2.80 < 3.01, accept 𝐻0′′ and conclude that the variation
of the average yield the maize is not as a result of an interaction effect between
different maize varieties and different kinds of fertilizers

Exercise:

1) The following data represents the final grade obtained by 5 students in Maths, English,
French and Biology.

Subjects
Students Mathematics English French Biology
1 68 57 73 61
2 87 94 91 86
3 72 81 63 59
4 55 73 77 66
5 92 68 75 87

Use a 0.05 level of significance to test the hypothesis that


a) The courses are of equal difficulty
b) The students have equal ability.

2) The following data represents the results of 4 quizzes obtained by 5 students in Mathematics,
English, French and Biology.

Students Mathematics English French Biology


1 88, 63, 79, 80 51, 58, 72, 65 73, 81, 77, 77 87, 81, 92, 76
2 79, 96, 56, 68 85, 95, 67, 88 82, 36, 80, 68 80, 93, 62, 67
3 67, 66, 51, 89 74, 47, 59, 82 91, 95, 92,59 77, 70, 84, 73
4 35, 60, 70, 64 76, 49, 26, 76 43, 52, 42, 32 55, 49, 53, 56
5 99, 77, 87, 95 84, 94, 83, 76 95, 81, 98, 96 83, 87, 76, 80

Use a 0.05 level of significance to test the hypotheses that:-


a) The courses are of equal difficulty;
b) The students have equal ability;
c) The students strength and subject difficulty do not interact.

3) The Environmental Protection Agency of a large suburban county is studying coliform


bacteria counts (in parts per thousand) at beaches within the county. Three types of beaches
34
are to be considered -- ocean, bay, and sound -- in three geographical areas of the county --
west, central, and east. Two beaches of each type are randomly selected in each region of the
county. The coliform bacteria counts at each beach on a particular day were as follows:

Geographic Area
Type of Beach West Central East
Ocean 25 20 9 6 3 6
Bay 32 39 18 24 9 13
Sound 27 30 16 21 5 7
Enter the data and save as the file

At the 0.05 level of significance, is there an


a. effect due to type of beach?
H0: ________________________ H1: _______________________
p-value: _____________________ Decision: __________________
b. effect due to type of geographical area?
H0: ________________________ H1: _______________________
p-value: _____________________ Decision: __________________
c. effect due to type of beach and geographical area? OPTIONAL
H0: ________________________ H1: _______________________
p-value: _____________________ Decision: __________________
d. Based on your results, what conclusions concerning average bacteria count can be reached?

4) A videocassette recorder (VCR) repair service wished to study the effect of VCR brand and
service center on the repair time measured in minutes. Three VCR brands (A, B, C) were
specifically selected for analysis. Three service centers were also selected. Each service center
was assigned to perform a particular repair on two VCRs of each brand. The results were as
follows:

Service Centers Brand A Brand B Brand C


52 48 59
1 57 39 67
51 61 58
2 43 52 64
37 44 65
3 46 50 69
Enter the data and save the file
At the .05 level of significance:
(a) Is there an effect due to service centers?
(b) Is there an effect due to VCR brand?
(c) Is there an interaction due to service center and VCR brand?

5) The board of education of a large state wishes to study differences in class size between
elementary, intermediate, and high schools of various cities. A random sample of three cities
within the state was selected. Two schools at each level were chosen within each city, and the
average class size for the school was recorded with the following results:

35
Education Level City A City B City C
Elementary 32, 34 26, 30 20, 23
Intermediate 35, 39 33, 30 24, 27
High School 43, 38 37, 34 31, 28
Enter the data and save the file .
At the .05 level of significance:
(a) Is there an effect due to education level?
(b) Is there an effect due to cities?
(c) Is there an interaction due to educational level and city?

6) The quality control director for a clothing manufacturer wanted to study the effect of operators
and machines on the breaking strength (in pounds) of wool serge material. A batch of material
was cut into square yard pieces and these were randomly assigned, three each, to all twelve
combinations of four operators and three machines chosen specifically for the equipment. The
results were as follows:

Operator Machine I Machine II Machine III


A 115 115 119 111 108 114 109 110 107
B 117 114 114 105 102 106 110 113 114
C 109 110 106 100 103 101 103 102 105
D 112 115 111 105 107 107 108 111 110
Enter the data and save the file

At the .05 level of significance:


(a) Is there an effect due to operator?
(b) Is there an effect due to machine?
(c) Is there an interaction due to operator and machine?

36
Chapter Five

REGRESSION AND CORRELATION

There are many statistical investigations in which the main objective is to determine whether a
relationship exist between two variables. If such a relationship can be expressed by a mathematical
formula, we will then be able to use it for the purpose of making predictions.

A mathematical equation that allows us to predict values of one dependent variable from known
values of one or more independent variables is called a Regression Equation. Here we are interested
in problem of estimating or predicting the value of a dependent variable Y on the basis of a known
measurement of an independent and frequently controlled variable X.

When given any set of data to investigate, we plot a scatter diagram of the given data to see if there is
any linear relationship between the variables; this is called a Scatter Diagram. Once a reasonable
linear relationship has been ascertained, we try to express this mathematically by a straight line
equation called the Linear Regression Line.

From High school mathematics, we know that the slop-intercept form of a straight line can be written
in the form 𝑌̂ = 𝑎 + 𝑏𝑋 where a and b represent the Y intercept and slope respectively, 𝑌̂ is used to
distinguish between the estimated value and the actual observed Y for some values of X.

Because most experimental data rarely fit a perfect straight line, we have to resort to draw a “ Line of
best fit” through the data. Fitting a line by eye could be quite accurate if the added precaution is taken
of ensuring that it passes through the mean values of X and Y. The process of drawing a line of best
fit entails drawing a line that keeps as close as possible to our points.

REGRESSION ANALYSIS

In our discussion of regression analysis, we will first focus our discussion on simple linear regression
and then expand to multiple linear regression. The reason for this ordering is not because simple
linear regression is so simple, but because we can illustrate our discussion about simple linear
regression in two dimensions and once the reader has a good understanding of simple linear
regression, the extension to multiple regression will be facilitated. It is important for the reader to
understand that simple linear regression is a special case of multiple linear regression. Regression
models are frequently used for making statistical predictions -- this will be addressed at the end of
this chapter.
Simple Linear Regression

Simple linear regression analysis is used when one wants to explain and/or forecast the variation in a
variable as a function of another variable. To simplify, suppose you have a variable that exhibits
variable behavior, i.e. it fluctuates. If there is another variable that helps explain (or drive) the
variation, then regression analysis could be utilized.

An Example
37
Suppose you are a manager for the Pinkham family, which distributes a product whose
sales volume varies from year to year, and you wish to forecast the next years’ sales
volume. Using your knowledge of the company and the fact that its marketing efforts
focus mainly on advertising, you theorize that sales might be a linear function of
advertising and other outside factors. Hence, the model’s mathematical function is:

SALESt = B0 + B1 ADVERTt + Error

Where: SALESt represents Sales Volume in year t


ADVERTt represents advertising expenditures in year t
B0 and B1 are constants (fixed numbers)
and Errort is the difference between the actual sales volume
value in year t and the fitted sales volume value in year t

Note: the Errort term can account for influences on sales volume other than advertising.

Ignoring the error term one can clearly see that what is being proposed is a linear equation (straight
line) where the SALESt value depends on the value of ADVERTt. Hence, we refer to SALESt as the
dependent variable and ADVERTt as the explanatory variable.

To see if the proposed linear relationship seems appropriate we gather some data and plot the data to
see if a linear relationship seems appropriate. The data collected is yearly, from 1907 - 1960, hence,
54 observations. That is for each year we have a value for sales volume and a value for advertising
expenditures, which means we have 54 pairs of data.

Year Advert Sales

1907 608 1016


1908 451 921
. . .
. . .
. . .
. . .
1959 644 1387
1960 564 1289

To get a feel for the data, we plot (called a scatter plot) the data as is shown as Figure 1. (Hereafter,
the scatter plot will be called plot.)

38
Scatter Plot of Sales vs. Advertising

3900

3400

2900
Sales
2400

1900

1400

900

0 0.4 0.8 1.2 1.6 2


Advertising (X 1000)

Figure 1. Scatter Plot of Sales vs. Advertising

As can be seen, there appears to be a fairly good linear relationship between sales (SALES) and
advertising (ADVERT) (at least for advertising less than 1200 ~ note scaling factor for ADVERT x
1000). At this point, we are now ready to conclude the specification phase and move on to the
estimation phase where we estimate the best fitting line.
Summary: For a simple linear regression model, the functional relationship is: Yt = B0 + B1 Xt + Et
and for our example the dependent variable Yt is SALESt and the explanatory variable is ADVERTt.
We suggested our proposed model in the example based upon theory and confirmed it via a visual
inspection of the scatter plot for SALESt and ADVERTt. Note: In interpreting the model we are
saying that SALES depends upon ADVERT in the same time period and some other influences,
which are accounted for by the ERROR term.

Estimation

We utilize the computer to perform the estimation phase. In particular, the computer will calculate
the “best” fitting line, which means it will calculate the estimates for B0 and B1. The results are

Regression Analysis - Linear model: Y = a + b*X


-----------------------------------------------------------------------------
Dependent variable: sales
Independent variable: advert
-----------------------------------------------------------------------------
Standard T
Parameter Estimate Error Statistic P-Value
-----------------------------------------------------------------------------
Intercept 488.833 127.439 3.83582 0.0003
Slope 1.43459 0.126866 11.3079 0.0000
-----------------------------------------------------------------------------

Analysis of Variance
-----------------------------------------------------------------------------
Source Sum of Squares Df Mean Square F-Ratio P-Value
-----------------------------------------------------------------------------
Model 1.50846E7 1 1.50846E7 127.87 0.0000
Residual 6.13438E6 52 117969.0
-----------------------------------------------------------------------------
Total (Corr.) 2.1219E7 53
Table 1.
Correlation Coefficient = 0.843149
R-squared = 71.0901 percent
Standard Error of Est. = 343.466

39
Since B0 is the intercept term and B1 represents the slope we can see that the fitted line is:
SALESt = 488.8 + 1.4 ADVERTt

The rest of the information presented in Table 1 can be used in the diagnostic checking phase that we
discuss next.

Diagnostic Checking

Once again the purpose of the diagnostic checking phase is to evaluate the model’s adequacy. To do
so, at this time we will restrict our analysis to just a few pieces of information in Table 1.

First of all, to see how well the estimated model fits the observed data, we examine the R-squared
(R2) value, which is commonly referred to as the coefficient of determination. The R2 value denotes
the amount of variation in the dependent variable that is explained by the fitted model. Hence, for
our example, 71.09 percent of the variation in SALES is explained by our fitted model. Another way
of viewing the same thing is that the fitted model does not explain 28.91 percent of the variation in
SALES.

A second question we are able to address is whether the explanatory variable, ADVERTt, is a
significant contributor to the model in explaining the dependent variable, SALESt. Thus, for our
example, we ask whether ADVERTt is a significant contributor to our model in terms of explaining
SALESt. The mathematical test of this question can be denoted by the hypothesis:

H0 : 1 = 0
H1 : 1  0

which makes sense, given the previous statements, when one remembers that the model we proposed
is:

SALESt = B0 + B1 ADVERTt + ERRORt

Note: If B1 = 0, (i.e. the null hypothesis is true), then changes in ADVERTt will not produce a
change in SALESt. From Table 1, we note that the p-value (probability level) for the hypothesis test,
which resides on the line labeled slope, is 0.00000 (truncation). Since the p-value is less than  =.
05, we reject the null hypothesis and conclude that ADVERTt is a significant explanatory variable for
the model, where SALESt is the dependent variable.

An Example

To further illustrate the topic of simple linear regression and the model building
process, we consider another model using the same data set. However, instead
of using advertising to explain the variation in sales, we hypothesize that a good
explanatory variable is to use sales lagged one year. Recall that our time series
data is in yearly intervals, hence, what we are proposing is a model where the
value of sales is explained by its amount one time period (year) ago. This may
not make as much theoretical sense [to many] as the previous model we
considered, but when one considers that it is common in business for variables
to run in cycles, it can be seen to be a valid possibility.

40
Plot of sales vs lag(sales,1)
3900

3400

2900

sales 2400

1900

1400

900
900 1400 1900 2400 2900 3400 3900

lag(sales,1)

Figure 2. Plot of Sales vs. Lag(Sales,1)

Looking at Figure 2 as shown above, one can see that there appears to be a linear relationship
between sales and sales one time period before. Thus the model being specified is:
SALESt = B0 + B1 SALESt-1 + Errort

Where: SALESt represents sales volume in year t


SALESt-1 represents sales volume in year t-1
B0 and B1 are constants (fixed numbers)
and Errort is the difference between the actual sales volume value in year t and the fitted
sales volume value in year t

Estimation

Using Stats software, we are able to estimate the parameters B0 and B1 as is shown in Table 2.

Regression Analysis - Linear model: Y = a + b*X


-----------------------------------------------------------------------------
Dependent variable: sales
Independent variable: lag(sales,1)
-----------------------------------------------------------------------------
Standard T
Parameter Estimate Error Statistic P-Value
-----------------------------------------------------------------------------
Intercept 148.303 98.74 1.50196 0.1393
Slope 0.922186 0.050792 18.1561 0.0000
-----------------------------------------------------------------------------

Analysis of Variance
-----------------------------------------------------------------------------
Source Sum of Squares Df Mean Square F-Ratio P-Value
-----------------------------------------------------------------------------
Model 1.77921E7 1 1.77921E7 329.64 0.0000
Residual 2.75265E6 51 53973.5
-----------------------------------------------------------------------------
Total (Corr.) 2.05447E7 52

Correlation Coefficient = 0.9306


R-squared = 86.6017 percent
Standard Error of Est. = 232.322

41
hence, the fitted model is:

SALESt = 148.30 + 0.92 SALESt-1

Diagnostic Checking

In evaluating the attributes of this estimated model, we can see where we are now able to fit the
variation in sales better, as R2, the amount of explained variation in sales, has increased from 71.09
percent to 86.60 percent. Also, as one probably expects, the test of whether SALESt-1 does not have
a significant linear relationship with SALESt is rejected. That is, the p-value for

H0: 1 = 0
H1: 1  0
is less than alpha (.00000 < .05). There are other diagnostic checks that can be performed but we
will postpone those discussions until we consider multiple linear regression. Remember: simple
linear regression is a specific case of multiple linear regression.

Update

At this point, we have specified, estimated and diagnostically checked (evaluated) two simple linear
regression models. Depending upon one’s objective, either model may be utilized for explanatory or
forecasting purposes.

Using Model

As discussed previously, the end result of regression analysis is to be able to explain the variation of
sales and/or to forecast value of SALESt. We have now discussed how both of these end results can
be achieved.

Explanation

As suggested by Table 1 and 2, when estimating the simple linear regression models, one is
calculating estimates for the intercept and slope of the fitted line (B0 and B1 respectively). The
interpretation associated with the slope (B1) is that for a unit change in the explanatory variable it
represents the respective change in the dependent variable along the forecasted line. Of course, this
interpretation only holds in the area where the model has been fitted to the data. Thus usual
interpretation for the intercept is that it represents the fitted value of the dependent variable when the
independent (explanatory) variable takes on a value of zero. This is correct only when one has used
data for the explanatory variable that includes zero. When one does not use values of the explanatory
variable near zero, to estimate the model, then it does not make sense to even attempt to interpret the
intercept of the fitted line.

Referring back to our examples, neither data set examined values for the explanatory variables
(ADVERTt and SALESt-1) near zero, hence we do not even attempt to give an economic
interpretation to the intercepts. With regards to the model:

SALESt = 488.83 + 1.43 ADVERTt

the interpretation of the estimated slope is that a unit change in ADVERT ($1,000) will generate, on
the average, a change of 1.43 units in SALESt ($1,000). For instance, when ADVERTt increases
(decreases) by $1,000 the average effect on SALESt will be an increase (decrease) of $1,430. One
42
caveat, this interpretation is only valid over the range of values considered for ADVERT, which is
the range from 339 to 1941 (i.e., minimum and maximum values of ADVERT).

Forecasting

Calculating the point estimate with a linear regression is a very simple process. All one needs to do
is substitute the specific value of the explanatory variable, which is being forecasted, into the fitted
model and the output is the point estimate.

For example, referring back to the model:

SALESt = 488.8 + 1.4 ADVERTt

if one wishes to forecast a point estimate for a time period when ADVERT will be 1200 then the
point estimate is

2168.8 = 488.8 + 1.4 (1200)

Deriving a point estimate is useful, but managers usually find more information in confidence
intervals. For regression models, there are two sets of confidence intervals for point forecasts that are
of use as shown in Figure 3 on the next page.
Regression of Sales on Advertising

3900

3400

2900
Sales
2400

1900

1400

900

0 0.4 0.8 1.2 1.6 2


(X 1000)
Advertising

Figure 3. Regression of Sales on Advertising

Viewing Figure 3 as shown1, one can see two sets of dotted lines, each set being symmetric about the
fitted line. The inner set represents the limits (upper and lower) for the mean response for a given
input, while the other set represents the limits of an individual response for a given input. It is the
outer set that most managers are concerned with, since it represents the limits for an individual value.
For right now, it suffices to have an intuitive idea of what the confidence limits represent and
graphically what they look like. So for an ADVERT value of 1200 (input), one can visually see that
the limits are approximately 1500 and 2900. (The values are actually 1511 and 2909.) Hence, when
advertising is $1,200 for a time period (ADVERTt = 1,200) then we are 95 percent confident that
sales volume (SALESt) will be between approximately 1,500 and 2,900.

1
Figure 3 was obtained by selecting Plot of Fitted Line under the Graphical Options icon.
43
MARKET MODEL - Stock Beta’s

An important application of simple linear regression, from business, is used to calculate the ß of a
stock2. The ß’s are measures of risk and used by portfolio managers when selecting stocks.

The model used (specified) to calculate a stock ß is:

Rj,t =  +  Rm,t + t

Where: Rj,t is the rate of return for the jth stock in time periodt
Rm,t is the market rate of return in time periodt
t is the error term in time periodt
 and  are constants

To illustrate the above model, we will use data that resides in the data file SLR.SF3. In particular,
we will calculate ‘s for Anheuser Busch Corporation, the Boeing Corporation, and American
Express using the New York Stock Exchange (NYSE - Finance) as the “market” portfolio. The data
in the file SLR.SF3 has already been converted from monthly values of the individual stock prices
and dividends to represent the monthly rate of returns (starting with December 1986).

For all three stocks, the model being specified and estimated follows the form stated in the equation
shown above, the individual stocks rate of returns will be used as the dependent variable and the
NYSE rate of returns will be used as the independent variable.

2
For an additional explanation on the concept of stock beta’s, refer to the Appendix.
44
Multiple Linear Regression
Referring back to the Pinkham data, suppose you decided that ADVERTt contained information
about SALESt that lagged value of SALESt (i.e. SALESt-1) did not, and vice versa, and that you
wished to regress SALESt or both ADVERTt and SALESt-1; the solution would be to use a multiple
regression model. Hence, we need to generalize our discussion of simple linear regression models by
now allowing for more than one explanatory variable, hence the name multiple regression. [Note:
more than one explanatory variable, hence we are not limited to just two explanatory variables.]

Specification: Going back to our example, if we specify a multiple linear regression model where
SALESt is again the dependent variable and ADVERTt and SALESt-1 are the explanatory variables,
then the model is:

SALESt = B0 + B1 ADVERTt + B2 SALESt-1 + ERRORt

where: B0, B1, and B2 are parameters (coefficients).

Estimation: To obtain estimates for B0, B1, and B2 via StatGraphics, the criterion of least squares
still applies, the mathematics employed involves using matrix algebra. It suffices for the student to
understand what the computer is doing on an intuitive level; i.e. the best fitting line is being
generated. The results from the estimation phase are shown in Table 7.

Multiple Regression Analysis


-----------------------------------------------------------------------------
Dependent variable: sales
-----------------------------------------------------------------------------
Standard T
Parameter Estimate Error Statistic P-Value
-----------------------------------------------------------------------------
CONSTANT 138.691 95.6602 1.44982 0.1534
lag(sales,1) 0.759307 0.0914561 8.30242 0.0000
advert 0.328762 0.155672 2.11189 0.0397
-----------------------------------------------------------------------------

Analysis of Variance
-----------------------------------------------------------------------------
Source Sum of Squares Df Mean Square F-Ratio P-Value
-----------------------------------------------------------------------------
Model 1.80175E7 2 9.00875E6 178.23 0.0000
Residual 2.52722E6 50 50544.3
-----------------------------------------------------------------------------
Total (Corr.) 2.05447E7 52

R-squared = 87.699 percent


R-squared (adjusted for d.f.) = 87.2069 percent
Standard Error of Est. = 224.821
Mean absolute error = 173.307
Durbin-Watson statistic = 0.916542

Table 7
Diagnostic Checking

We still utilize the diagnostic checks we discussed for simple linear regression. We are now going to
expand that list and include additional diagnostic checks, some require more than one explanatory
variable but most also pertain to simple linear regression. We waited to introduce some of the checks
[that also pertain to simple linear regression] because we didn’t want to introduce too much at one
time and most of the corrective measures involve knowledge of multiple regression as an alternative
model.

45
The first diagnostic we consider involves focusing on whether any of the explanatory variables
should be removed from the model. To make these decision(s) we test whether the coefficient
associated with each variable is significantly different from zero, i.e. for the ith explanatory variable:

H0: i = 0

H1: i  0

As discussed in simple linear regression this involves a t-test. Looking at Table 7, the p-value for the
tests associated with determining the significance for SALESt-1 and ADVERT1 are 0.0000 and
0.0397, respectively, we can ascertain that neither explanatory variable should be eliminated from the
model. If one of the explanatory variables had a p-value greater than  =. 05, then we would
designate that variable as a candidate for deletion from the model and go back to the specification
phase.

Another attribute of the model we are interested in is the R2 adjusted value that in Table 7 is 0.8721,
or 87.21 percent. Since we are now considering multiple linear regression models, the R2 value that
we calculate represents the amount of variation in the dependent variable (SALESt) that is explained
by the fitted model, which includes all of the explanatory variables jointly (ADVERTt and SALESt-
1). At this point we choose to ignore the adjusted (ADJ) factor included in the printout.

Since we have already asked the question if anything should be deleted from the model the next
question that should be asked if there is anything that is missing from the model, i.e. should we add
anything to the model. To answer this question we should use theory but from an empirical
perspective we look at the residuals to see if they have a pattern, which as we discussed previously
would imply there is information. If we find missing information for the model (i.e. a pattern in the
residuals), then we go back to the specification phase, incorporate that information into the model
and then cycle through the 3 phase process again, with the revised model. We will illustrate this in
greater detail in our next example. However, the process involved is very similar to that which we
employed earlier in the semester. We illustrate the residual analysis with a new example.

Example

The purpose behind looking at this example is to allow us to work with some cross sectional data and
also to look in greater detail at analyzing the residuals. The data set contains three variables that have
been recorded by a firm that presents seminars. Each record focuses on a seminar with the fields
representing:

 number of people enrolled (ENROLL)


 number of mailings sent out (MAIL)
 lead time (in weeks) of 1st mailing (LEAD)
The theory being suggested is that the variation in the number of enrollments is an approximate linear
function of the number of mailings and the lead-time. As recommended earlier, we look at the scatter
plots of the data to see if our assumptions seem valid. Since we are working with two explanatory
variables, a three dimensional plot would be required to see all three variables simultaneously, which
can be done in StatGraphics with the PLOTTING FUNCTIONS, X-Y-Z LINE and SCATTER PLOT
options (note the dependent variable is usually Z). See Figure 7 for this plot.

46
Plot of enroll vs lead and mail

59

49

enroll
39

29
18
19 12 15
3 6 9 mail
0 4 8 12 0
16 20 24
lead

Figure 7. Plot of Enroll vs. Mail & Lead

This plot provides some insight, but for beginners, it is usually more beneficial to view multiple two-
dimensional plots where the dependent variable ENROLL is plotted against the different explanatory
variables, as is shown in Figures 8 and 9.

Plot of enroll vs mail


59

49
enroll

39

29

19
0 3 6 9 12 15 18

mail

Figure 8. Plot of Enroll vs. Mail

Plot of enroll vs lead


59

49
enr
oll 39

29

19
0 4 8 12 16 20 24
lead

Figure 9. Plot of Enroll vs. Lead

Looking at Figure 9, which plots ENROLL against LEAD, we notice that there is a dip for the largest
LEAD values which may economically suggest diminishing returns i.e. at a point the larger lead time
is counterproductive. This suggest that ENROLL and LEAD may have a parabolic relationship.
Since the general equation of a parabola is:

y = ax2 + bx + c

we may want to consider including a squared term of LEAD in the model. However, at this point we
are not going to do so, with the strategy that if it is needed, we will see that when we examine the
residuals, as we would have ignored some information in the data and it will surface when we

47
analyze the residuals. (In other words we wish to show that if a term should be included in a model,
but is not identified, one should be able to identify it as missing when examining the residuals of a
model estimated without it.)

Specification

Thus the model we tentatively specify is:

ENROLLi = B0 + B1 MAILi + B2 Leadi + ERRORi

Estimation
Multiple Regression Analysis
-----------------------------------------------------------------------------
Dependent variable: enroll
-----------------------------------------------------------------------------
Standard T
Parameter Estimate Error Statistic P-Value
-----------------------------------------------------------------------------
CONSTANT 14.8523 2.1596 6.87733 0.0000
lead 0.627318 0.165436 3.79191 0.0008
mail 1.27378 0.233677 5.45103 0.0000
-----------------------------------------------------------------------------

Analysis of Variance
-----------------------------------------------------------------------------
Source Sum of Squares Df Mean Square F-Ratio P-Value
-----------------------------------------------------------------------------
Model 1985.35 2 992.674 56.87 0.0000
Residual 453.824 26 17.4548
-----------------------------------------------------------------------------
Total (Corr.) 2439.17 28

R-squared = 81.3943 percent


R-squared (adjusted for d.f.) = 79.9631 percent
Standard Error of Est. = 4.17789
Mean absolute error = 3.33578
Durbin-Watson statistic = 1.03162

Table 8
Note that MAIL and LEAD are both significant, since their p-values are 0.0000 and 0.0008,
respectively. Hence, there is no need at this time to eliminate either from the model. Also, note that
R2adj is 79.96 percent.

To see if there is anything that should be added to the model, we analyze the residuals to see if they
contain any information. Utilizing the graphics options icon, one can obtain a plot of the To see if To
see if there is anything that should be added to the model, we analyze the residuals to see if they
contain any information. Utilizing the graphics options icon, one can obtain a plot of the
standardized residuals versus lead (select residuals versus X). Plotting against the predicted values is
similar to looking for departures from the fitted line. For our example since we entertained the idea
of some curvature (parabola) when plotting ENROLL against LEAD, we now plot the residuals
against LEAD. This plot is shown as Figure 10.

48
Residual Plot
2.8

Studentized residual
1.8

0.8

-0.2

-1.2

-2.2
0 4 8 12 16 20 24

lead

Figure 10. Residual Plot for Enroll against Lead

What we are looking for in the plot is whether there is any information in LEAD that is missing from
the fitted model. If one sees the curvature that still exists, then it suggests that one needs to add
another variable, actually a transformation of LEAD, to the model. Hence we go back to the
specification phase, based upon the information just discovered, and specify the model as:

ENROLLi = B0 + B1 MAILi + B2 Lead + B3 (LEAD)2i + ERRORi

The estimation of the revised model generates the output presented in Table 9.

Multiple Regression Analysis


-----------------------------------------------------------------------------
Dependent variable: enroll
-----------------------------------------------------------------------------
Standard T
Parameter Estimate Error Statistic P-Value
-----------------------------------------------------------------------------
CONSTANT 0.226184 2.89795 0.0780495 0.9384
lead 4.50131 0.675669 6.66201 0.0000
mail 0.645073 0.189375 3.40633 0.0022
lead * lead -0.132796 0.022852 -5.81115 0.0000
-----------------------------------------------------------------------------

Analysis of Variance
-----------------------------------------------------------------------------
Source Sum of Squares Df Mean Square F-Ratio P-Value
-----------------------------------------------------------------------------
Model 2246.12 3 748.707 96.96 0.0000
Residual 193.053 25 7.72211
-----------------------------------------------------------------------------
Total (Corr.) 2439.17 28

R-squared = 92.0853 percent


R-squared (adjusted for d.f.) = 91.1356 percent
Standard Error of Est. = 2.77887
Mean absolute error = 2.081
Durbin-Watson statistic = 1.121

Table 9

49
Diagnostic Checking

At this point we go through the diagnostic checking phase again. Note that all three explanatory
variables are significant and that the R2adj value has increased to 91.13 percent from 79.96 percent.
For our purposes at this point, we are going to stop our discussion of this example, although the
reader should be aware that the diagnostic checking phase has not been completed. Residual plots
should be examined again, and other diagnostic checks we still need to discuss should be considered.

Before we proceed however, it should be pointed out that the last model is still a multiple linear
regression model. Many students think that by including the squared term, to incorporate the
curvature, that we may have violated the linearity condition. This is not the case, as when we say
“linear” it is linear with regards to the coefficients. An intuitive explanation of this is to think like
the computer, all LEAD2 represents is the squared values of LEAD, therefore, the calculations are
the same as if LEAD2 was another explanatory variable.

The next three multiple regression topics we discuss will be illustrated with the data that was part of a
survey conducted of houses in Eugene, Oregon, during the 1970’s. The variables measured
(recorded), for each house, are sales price (price), square feet (sqft), number of bedrooms (bed),
number of bathrooms (bath), total number of rooms (total), age in years (age), the house has an
attached garage (attach), and whether the house has a nice view (view).

Dummy Variables
Prior to this current example, all the regression variables we have considered have been either ratio or
interval data, which means they are non-qualitative variables. However, we now want to incorporate
qualitative variables into our analysis. To do this we create dummy variables, which are binary
variables that take on values of either zero or one. Hence, the dummy variable (attach) is defined as:

attach = 1 if garage is attached to house

0 otherwise (i.e. not attached)

and

view = 1 if house has a nice view

0 otherwise

Note that each qualitative attribute (attached garage and view) cited above has two possible outcomes
(yes or no) but there is only 1 dummy variable for each. That is because there must always be, at
maximum, one less dummy variable than there are possible outcomes for particular qualitative
attribute. We mention this because there are going to be situations, for other examples, where one
wants to incorporate a qualitative attribute that has more than two possible outcomes in the analysis.
For example, if one is explaining sales and has quarterly data they might want to include the season
as an explanatory variable. Since there are four seasons (Fall, Winter, Spring, and Summer) there
will be three (four minus one) dummy variables. To define these three dummy variables, we
arbitrarily select one season to “withhold” and create dummy variables for each of the other seasons.
For example, if summer was “withheld” then our three dummy variables could be

50
D1 = 1 if Fall
0 otherwise
D2 = 1 if Winter
0 otherwise
D3 = 1 if Spring
0 otherwise
Now, what happens when we withhold a season is not that we ignore the season, but the others are
being compared with what is being withheld.
Outliers
When an observation has an undue amount of influence on the fitted regression model (coefficients)
then it is called an outlier. Ideally, each observation has an equal amount of influence on the
estimation of the fitted lines. When we have an outlier, the first question one needs to ask is “Why is
that observation an outlier?” The answer to that question will frequently dictate what type of action
the model builder should take.

One reason an observation may be an outlier is because of a recording (inputting) error. For instance,
it is easy to mistakenly input an extra zero, transpose two digits, etc. When this is the cause, then
corrective action can clearly be taken. Don’t always assume the data is correct! Another source is
because of some extra ordinary event that we do not expect to occur again. Or the observation is not
part of the population we wish to make interpretation/forecasts about. In these cases, the observation
may be “discarded.”

If the data is cross-sectional, then the observation may be eliminated, thereby decreasing the number
of observations by one. If the data is times series, by “discarding the impact” of the observation one
does not eliminate observations since doing so may effect lagging relationships, however one can set
the dummy variable equal to one (1) for that observation, zero (0) otherwise.

At other times, the outcome, which is classified as an outlier, is recorded correctly, may very well
occur again, and is indeed part of the concerned population. In this case, one would probably want to
leave the observation in the model construction process. In fact, if an outlier or set of outliers
represents a source of specific variation then one should incorporate that specific variation into the
model via an additional variable. Keep in mind, just because an observation is an outlier does not
mean that it should be discarded. These observations contain information that should not be ignored
just so “the model looks better.”
Now that we have defined what an outlier is and what action to take/not take for outlier, the next step
is to discuss how to determine what observations are outliers. Although a number of criteria exist for
classifying outliers, we limit our discussion to two specific criteria - standardized residuals and
leverage.

The theory behind using standardized residuals is that outlier are equated with observations which
have large residuals. To determine what is large, we standardize the residuals and then use the rule
that any standardized residual outside the bounds of -2 to 2 is considered an outlier. [Why do we use
-2 and 2? Could we use -3 and 3?]

The theory behind the leverage criteria is that a large residual may not necessarily equate with an
outlier. Hence, the leverage value measures the amount of influence that each observation has on the
set of estimates. It’s not intuitive, but can be shown mathematically, that the sum of the leverage
points is equal to the number of B coefficients in the model (P). Since there are N observations,
under ideal conditions each observation should have a leverage value of P/N. Hence, using our
51
criteria of large being outside two standard deviation, the decision rule for declaring outlier by means
of leverage values is to declare an observation as a potential outlier if its leverage value exceeds
2*P/N. StatGraphics employs a cut off of 3* P/N.

To illustrate, identifying outliers, we estimate the model:

Pricei = B0 + B1 SQFTi + B2 BED + Error

Multiple Regression Analysis


-----------------------------------------------------------------------------
Dependent variable: price
-----------------------------------------------------------------------------
Standard T
Parameter Estimate Error Statistic P-Value
-----------------------------------------------------------------------------
CONSTANT -15.4038 7.34394 -2.09749 0.0414
sqft 3.52674 0.269104 13.1055 0.0000
bed 7.64828 2.78697 2.7443 0.0086
-----------------------------------------------------------------------------

Analysis of Variance
-----------------------------------------------------------------------------
Source Sum of Squares Df Mean Square F-Ratio P-Value
-----------------------------------------------------------------------------
Model 29438.5 2 14719.3 140.65 0.0000
Residual 4918.52 47 104.649
-----------------------------------------------------------------------------
Total (Corr.) 34357.0 49

R-squared = 85.6841 percent


R-squared (adjusted for d.f.) = 85.0749 percent
Standard Error of Est. = 10.2298
Mean absolute error = 7.19612
Durbin-Watson statistic = 1.682

Table 10
With the results being shown in Table 10, in our data set of houses, clearly some houses are going to
influence the estimate more than others. Those with undue influences will be classified as potential
outliers. Again, the standardized residuals outside the bounds -2, +2 (i.e. absolute value greater than
2), and the leverage values greater than 3 3/50 (P = 3 since we estimated the coefficient for two (2)
explanatory variables and the intercept and n = 50 since there were 50 observations) will be flagged.
After estimating the model we select the "unusual residuals" and "influential points" options under
the tabular options icon. Note that from tables 11 and 12 observations 8, 42, 44, 47, 49 and 50 are
classified as outliers.

Unusual Residuals
--------------------------------------------------------------
Predicted Studentized
Row Y Y Residual Residual
--------------------------------------------------------------
44 111.3 85.482 25.818 2.73
47 115.2 92.1828 23.0172 2.40
49 129.0 89.2508 39.7492 5.03
--------------------------------------------------------------
Table 11

52
Influential Points
------------------------------------------------
Mahalanobis
Row Leverage Distance DFITS
------------------------------------------------
8 0.0816156 3.28611 0.560007
42 0.144802 7.14775 0.58652
49 0.0947427 4.04401 1.62728
50 0.339383 23.6798 0.0932134
------------------------------------------------
Average leverage of single data point = 0.06

Table 12

Once the outliers are identified one then needs to decide what, if anything, needs to be modified in
the data or model. This involves checking the accuracy of the data and/or determining if the outliers
represent a specific source of variation. To ascertain any sources of specific variation one looks to
see if there is anything common in the set, or subset, of observations flagged as outliers. In Table 11 3
one can see that some of the latter observations (42, 44, 47, 49, and 50) were flagged. Since the data
( n = 50) was entered by ascending price, one can see that the higher priced homes were flagged. As
a result, for this example, the higher priced homes are receiving a large amount of influence. Hence,
since this is cross-sectional data, one might want to split the analysis into two models - one for
“lower” priced homes and the second for “higher” priced homes.

Multicollinearity

When selecting a set of explanatory variables for a model, one ideally would like each explanatory to
provide unique information that is not provided by the other explanatory variable(s). When
explanatory variables provide duplicate information about the dependent variable, then we encounter
a situation called multicollinearity. For example, consider our house data again, where the following
model is proposed:

Price = B0 + B1 SQFT + B2 BATH + B3 TOTAL + ERROR

Clearly there is a relationship among the three (3) explanatory variables. What problems might this
create? To answer this, consider the estimation results, which are shown on the following page.

3
StatGraphics also used two other techniques for identifying outliers (Mahalanobis Distribution and DIFTS),
which we have elected not to discuss since from an intuitive level they are similar to the standardized
residual/leverage criteria.

53
Multiple Regression Analysis
-----------------------------------------------------------------------------
Dependent variable: price
-----------------------------------------------------------------------------
Standard T
Parameter Estimate Error Statistic P-Value
-----------------------------------------------------------------------------
CONSTANT -42.6274 9.50374 -4.48533 0.0000
sqft 3.02471 0.296349 10.2066 0.0000
bath -10.0432 3.49189 -2.87614 0.0061
total 10.7836 2.06048 5.23351 0.0000
-----------------------------------------------------------------------------

Analysis of Variance
-----------------------------------------------------------------------------
Source Sum of Squares Df Mean Square F-Ratio P-Value
-----------------------------------------------------------------------------
Model 30780.2 3 10260.1 131.95 0.0000
Residual 3576.84 46 77.7575
-----------------------------------------------------------------------------
Total (Corr.) 34357.0 49

R-squared = 89.5892 percent


R-squared (adjusted for d.f.) = 88.9102 percent
Standard Error of Est. = 8.81802
Mean absolute error = 5.89115
Durbin-Watson statistic = 1.53269

Table 13

If one were to start interpreting the coefficients individually and noticed the bath has a negative
coefficient, they might come to the conclusion that one way to increase the sales price is to eliminate
a bathroom. Of course, this doesn’t make sense, but it does not mean the model is not useful. After
all, when the BATH is altered so are the TOTAL and SQFT. So a problem with multicollinearity is
one of interpretation when other associated changes are not considered. One important fact to
remember, is that just because multicollinearity exists, does not mean the model can not be used for
meaningful forecasting, provided the forecasts are within the data region considered for constructing
the model.

54
Predicting Values with Multiple Regression

Regression models are frequently used for making statistical predictions. A multiple regression
model is developed, by the method of least squares, to predict the values of a dependent, response,
variable based on two or more independent, explanatory variables.

Research data can be classified as cross-sectional data or as time series data. Cross-sectional data
has no time dimension, or it is ignored. Consider collecting data on a group of subjects. You are
interested in their age, weight, height, gender, and whether they tend to be left-handed. The time
dimension in collecting the data is not important and would probably be ignored; even though
researchers tend to collect the data within a reasonably short time period.

Time series data is a sequence of observations collected from a process with equally spaced periods
of time. For example, in collecting sales data, the data would be collected weekly with the time (the
specific week of the year) and sales being recorded in pairs.

Using Cross-sectional Data for Predictions

When using regression models for making predictions with cross-sectional data, it is imperative that
you use only the relevant range of the predictor variable(s). When predicting the value of the
response variable for a given value of the explanatory variable, one may interpolate within the range
of the explanatory variables. However, contrary to when using time series data, one may not
extrapolate beyond the range of the explanatory variables. (To predict beyond the range of an
explanatory variable is to assume that the relationship continues to hold true below and/or above the
range -- something that is not known nor can it be determined. To make such an interpretation is
meaningless and, at best, subject to gross error.)

An Example: Using a Regression Model to Predict

Consider the following research problem - a real estate firm is interested in developing a model to
predict, or forecast, the selling price of a home in a local community. Data was collected on 50
homes in a local community over a three week period.

The data can consist of both qualitative and quantitative values. Quantitative variables are
measurable whereas qualitative variables are descriptive. For example: your height, a quantitative
value, is measurable whereas the color of your hair, a qualitative variable, is descriptive.

For our real estate example, the dependent variable (selling price) and the explanatory variables
(square feet, number of bathrooms, and total number of rooms) with their corresponding ranges for
quantitative variables (low to high). None of the data are qualitative variables.

Table 13. Variable With Range of Values

Variables Range of Values


Price (selling) ($1000) 30.6 - 165
Square feet (100 ft2) 8 - 40
Number of Bathrooms 1-3
Total number of rooms 5 - 12
55
As a review, the multiple regression model can be expressed as:

Yi = 0 + 1X1 + 2X2 + 3X3 + i

The slope, i, known as a net regression coefficient, represents the unit change in Y per unit change
in Xi taking into account (or, holding constant) the effect of the remaining explanatory variables. In
our real estate problem, b1 , where X1 is in square feet, represents the unit change selling price per
unit change in square feet, taking into account the effect of number of bedrooms, and total number of
rooms.

The resulting model fitting equation is shown in Table 14.

Multiple Regression Analysis


-----------------------------------------------------------------------------
Dependent variable: price
-----------------------------------------------------------------------------
Standard T
Parameter Estimate Error Statistic P-Value
-----------------------------------------------------------------------------
CONSTANT -42.6274 9.50374 -4.48533 0.0000
sqft 3.02471 0.296349 10.2066 0.0000
bath -10.0432 3.49189 -2.87614 0.0061
total 10.7836 2.06048 5.23351 0.0000
-----------------------------------------------------------------------------

Analysis of Variance
-----------------------------------------------------------------------------
Source Sum of Squares Df Mean Square F-Ratio P-Value
-----------------------------------------------------------------------------
Model 30780.2 3 10260.1 131.95 0.0000
Residual 3576.84 46 77.7575
-----------------------------------------------------------------------------
Total (Corr.) 34357.0 49

R-squared = 89.5892 percent


R-squared (adjusted for d.f.) = 88.9102 percent
Standard Error of Est. = 8.81802
Mean absolute error = 5.89115
Durbin-Watson statistic = 1.53269
Table 14

Multiple regression analysis is conducted to determine whether the null hypothesis, written as Ho: i
= 0 (with i = 0 - 3), can be rejected. If the null hypothesis can be rejected, then there is sufficient
evidence of a relationship (or, an association) between the response variable and the explanatory
variables in the sample. Table 14 also displays the resulting analysis of variance (ANOVA) for the
multiple regression model using the explanatory variables listed in Table 12.

The ANOVA for the full multiple regression shows a p-value equal to 0.0000, thus Ho can be
rejected (because the p-value is less than  of 0.05). Since the null hypothesis may be rejected, there
is sufficient evidence of a relationship (or, an association) between selling price and the three
explanatory variables in the sample of 50 houses.

56
CAUTION: As stated, when using regression models for making predictions with
cross-sectional data, use only the relevant range of the explanatory variable(s). To
predict outside the range of an explanatory variable is to assume that the relationship
continues to hold true below and/or above the range -- something that is not known
nor can be determined. To make such an interpretation is meaningless and, at best,
subject to gross error.

Suppose one wishes to obtain a point estimate, along with confidence intervals for both the individual
forecasts and the mean, for a home with the following attributes
1500 square feet, 1 bath, 6 total rooms.
To do this using Statgraphics, all one needs to do is add an additional row of data to the data file
(HOUSE.SF). In particular one would insert a 15 in the sqft column (remember that the square feet
units is in 1,000 's), a 1 in the bath column and a 6 in the total column. We leave the other columns
blank, especially the price column, since Statgraphics will treat it as a missing value and hence
estimate it. To see the desired output, one runs the regression, using the additional data point, goes to
the tabular options icon and selects the "report" option. Table 15 shows the forecasting results for
our example.

Regression Results for price


------------------------------------------------------------------------------------------------------
Fitted Stnd. Error Lower 95.0% CL Upper 95.0% CL Lower 95.0% CL Upper 95.0% CL
Row Value for Forecast for Forecast for Forecast for Mean for Mean
------------------------------------------------------------------------------------------------------
51 57.4014 9.1313 39.021 75.7818 52.6282 62.1746
------------------------------------------------------------------------------------------------------

Table 15

Summary

In the introduction to this section, cross-sectional data and time series data were defined. With
cross-sectional data, the time dimension in collecting the data is not important and can be ignored;
even though researchers tend to collect the data within a reasonably short time period. When
predicting the value of the response variable for a given value of the explanatory variable with cross-
sectional data, a researcher is restricted to interpolating within the range of the explanatory variables.
However, a researcher may not extrapolate beyond the range of the explanatory variables because it
cannot be assumed that the relationship continues to hold true below and/or above the range since
such an assumption cannot be validated. Cross-sectional forecasting is stationary, it does not change
over time.

On the other hand, time series data is a sequence of observations collected from a process with
equally spaced periods of time. Contrary to the restrictions placed on cross-sectional data, when
using time series data a major purpose of forecasting is to extrapolate beyond the range of the
explanatory variables. Time series forecasting is dynamic, it does change over time.

57
Practice Problem

As part of your job as personnel manager for a company that produces an industrial
product, you have assigned the task of analyzing the salaries of workers involved in
the production process. To accomplish this, you have decided to develop the “best”
model, utilizing the concept of parsimony, to predict their weekly salaries. Using the
personnel files, you select, based on systematic sampling, a sample of 49 workers
involved in the production process. The data, entered in the file COMPANY,
corresponds to their weekly salaries, lengths of employment, ages, gender, and job
classifications.

a. y,^ = _________________________________________________________

b. H0: ______________________ H1: _______________________

p-value: ___________________ Decision: __________________

c. In the final model, state the value and interpret for R2adj. R2adj: ________ %

d. In the final model, state the value and interpret for b1 . b1 = ________

e. Predict the weekly salaries for the following employees:

Category Employee #1 Employee #2


Length of employment (in months) 10 125
Age (in years) 23 33
Gender female male
Job classification technical clerical

Employee 95% LCL y,^ 95% UCL


#1
#2

[Check documentation on file to ascertain gender coding for female and male. Also check for proper
coding for job classification.]

58
Stepwise Regression
When there exists a large number of potential explanatory variables, a good exploratory technique
one can utilize is known as stepwise regression. This technique involves introducing or deleting
variables one at a time. There are two general procedures under the umbrella of stepwise regression -
- forward selection and backwards elimination. A hybrid of both forward selection and backwards
elimination exists and is generally just known as stepwise.

In the sections below, we describe the three (3) procedures cited above. In order to follow the
discussion, we first need to review the t- test for regression coefficients. Recall that for the model

Yi = 0 + 1 X1,i + 2 X2,i + ......... + k Xk,i + t

the t-test for: H0 : k = 0


H1 : k  0
actually tests whether the variable Xk should be included in the model. If one rejects H0, then the
decision is to keep Xt in the model, whereas if one does not reject H0 the decision is to eliminate Xt
from the model. Since rejecting H0 is usually done when either t  -2.0 or t  2.0, one can see that
having a variable in the model is equated to having a t-value with an absolute value greater than 2.
Likewise, if a variable has a corresponding t-value, which is equal to or less than 2 in absolute terms,
it should be eliminated from the model.
To simplify the programming for the stepwise procedures, the software packages generally rely on
the fact that squaring a distribution gives one a F distribution. Hence, the discussion above about the
t value and whether to keep or eliminate the corresponding variable can be expressed as:

If the F-statistic ( F = t 2) is greater than 4.0 , then the corresponding variable


should be included in the model. If the F-statistic is less than 4.0, then the
corresponding variable should not be included in the model.

Given this background information, we now discuss the three (3) stepwise procedures.

Forward Selection
This procedure starts with no explanatory variables in the model, only a constant. It then calculates
an F-statistic for each variable and focuses its attention on that variable with the highest F-value. If
the highest F-value is greater than 4.0, then the corresponding variable is inserted into the model. If
the highest F-value is less than 4.0, then the process stops. Assuming the first variable is inserted in
the model, an F-statistic is then calculated for each of the variables not in the model, conditioned
upon the fact that the first variable selected is in the model. The procedure then focuses on the
variable with the highest F-value and asks whether the F-value is greater than 4.0. If the answer is
yes, the associated variable is inserted into the model and the process continues by calculating an F-
statistic for each of the variables not included in the model, conditioned upon the fact that the first
two variables selected are included in the model. Once again, the procedure focuses attention on that
variable with the largest F-value and determines whether it is larger than 4.0. If the answer is yes the
associated variable is inserted into the model and the process continues by calculating an F-statistic
for each of the variables not included in the model, conditioned upon the fact that the first three
variables selected are included in the model. This process continues on until finally either all of the
variables have been included in the model or none of the remaining variables are significant.

59
Backward Elimination
This procedure starts with all of the explanatory variables in the model and successively drops one
variable at a time. Given all of the explanatory variables in the model, the “full” regression is run
and an F-statistic for each explanatory variable is calculated. The attention now focuses on that
variable with the smallest F-value. If the F-value is less than 4.0, then that variable is eliminated
from the model and a new regression model is estimated. From this “smaller regression” F-statistics
are examined and again the attention now focuses on that variable with the smallest F-value. If the F-
value is less than 4.0, then that variable is eliminated from the model and a new regression model is
estimated. This process continues on until either all of the explanatory variables have been
eliminated from the model or all of the remaining explanatory variables are significant.

Stepwise
This procedure is a hybrid of forward selection and backwards elimination. It operates the same as
forward selection, except at each stage the possibility of deleting a variable, as in backward
elimination is considered. Hence, a variable that enters at one stage may be eliminated at a later
stage (due to multicollinearity)

Summary

Generally all three stepwise procedures will provide the same model. Under extreme collinear
conditions (explanatory variables) the final results may be different. Keep in mind that stepwise
procedures are good exploratory techniques, to provide the model builder with some insight. One
should not be fooled into thinking that stepwise models are the best because the “computer generates
the models.” Stepwise procedures fail to consider things such as outliers, residual patterns,
autocorrelation, and theoretical considerations.

RELATIONSHIPS BETWEEN SERIES

When building models one frequently desires to utilize variables that have significant linear
relationships. In this section we discuss correlation as it pertains to cross sectional data,
autocorrelation for a single time series (demonstrated in the previous chapter), and cross
correlation, which deals with correlations of two series. Hopefully, the reader will note the
relationship between correlation, autocorrelation, and cross correlation.

Correlation
As we mentioned previously, when we talk of statistical correlation we are discussing a value which
measures the linear relationship between two variables. The statistic

rxy 
  x
i 
 x yi  y 
Sx Sy

where Sy and Sx represent the sample standard deviation of Y and X respectively, measures the
strength of the linear relationship between the variables Y and X. Again we are not going to dwell
on the mathematics, but will be primarily concerned with the interpretation.

60
To interpret the correlation coefficient, it is important to note that the denominator is included so that
values generated are not sensitive to the choice of metrics (i.e. inches vs. feet, ounces vs. pounds,
cents vs. dollars, etc.). As a result, the range of possible values for the correlation coefficients range
from -1.0 to 1.0.
Since the denominator is always a positive value, one can interpret the sign of the correlation
coefficient as the indicator of relationship of how X and Y move together. For instance, if the
correlation coefficient is positive, this indicates that positive (negative) changes in X tend to
accompany positive (negative) changes in Y (i.e. X and Y move in the same direction). Likewise, a
negative correlation value indicates that positive (negative) changes in X tend to accompany negative
(positive) changes in Y (i.e. X and Y move in opposite directions).

The absolute value of the correlation coefficient indicates how strong of a linear relationship two
variables have. The closer the absolute value is to 1.0 the stronger the linear relationship.

To summarize we consider the plots in Figure 1, where we show five different values for the
correlation coefficient. Note that (1) the sign indicates whether the variables move in the same
direction and (2) the absolute value indicates the strength of the linear relationship.

Autocorrelation
As indicated by its name, the autocorrelation function will calculate the correlation coefficient for a
series and itself in previous time periods. Hence, when analyzing one series and determining how
(linear) information is carried over from one time period to another, we will rely on the
autocorrelation function.

The autocorrelation function is defined as:

 x t 
 x xt  k  x 
r(k ) 
Sxt Sxt  k

where again Sx and Sx(t-k) are the sample standard deviations of Xt and Xt-k; which you think about
it are the same value. Hence when you substitute Xt and Xt-k into the correlation equation for Y and
X you can see the similarity. The one difference is with the time element component and hence the
inclusion of k. What k represents is the “lag” factor. So when one calculates r(1) that is the sample
autocorrelation of a time series variable and itself 1 time period ago, r(2) is the sample
autocorrelation of a time series variable and itself 2 time periods ago, r(3) is the sample
autocorrelation of a time series variable and itself 3 time periods ago, etc.

To illustrate the value of the autocorrelation function, consider the series TSDATA.BUBBLY
(StatGraphics data sample), which represents the monthly champagne sales volume for a firm. The
plot of this series shows a strong seasonality component as shown on the next page in Figure 2.

61
TSDATA.bubbly

15

12

9
data
6

0
03/97 12/98 09/99 06/00 03/01
Time

Figure 2. Time Sequence Plot for Bubbly Data

The autocorrelation function can be displayed numerically, Table 1, below:

Table 1. Estimated autocorrelations for TSDATA.bubbly


----------------------------------------------------------------
Lag Estimate Stnd.Error Lag Estimate Stnd.Error
----------------------------------------------------------------
1 .48933 .10911 2 .05787 .13269
3 -.15498 .13299 4 -.25001 .13512
5 -.03906 .14052 6 .03647 .14065
7 -.03773 .14076 8 -.24633 .14088
9 -.18132 .14592 10 -.00307 .14858
11 .37333 .14858 12 .80455 .15935
13 .40606 .20200 14 .02545 .21150
15 -.17323 .21153 16 -.24418 .21322
17 -.05609 .21652 18 .02920 .21669
19 -.03339 .21674 20 -.20632 .21680
21 -.14682 .21913 22 -.01295 .22029
23 .27869 .22030 24 .60181 .22446
----------------------------------------------------------------

The autocorrelation function can also be displayed numerically graphically (where dotted lines --
symmetric about 0 -- represent the significance limits) as shown in Figure 3.

62
Estimated Autocorrelations

0.5

coefficient
0

-0.5

-1

0 5 10 15 20 25
lag

Figure 3. Estimated Autocorrelations

By analyzing the display, the autocorrelation at lags 1, 11, 12, 13, and 24 are all significant ( =
0.05). Hence, one can conclude that there is a linear relationship between sales in the current time
period and itself and 1, 11, 12, 13, and 24 time periods ago. The values at 1, 11, 12, 13, and 24 are
connected with a yearly cycle (every 12 months).

Stationarity

The next topic we wish to discuss in this section is the cross correlation function, which will be used
to examine the relationship between two series displaced by k time periods. This will allow us to
begin identifying leading indicators. However in order to discuss the cross correlation function, we
first need to review what it means for a series to be stationary. This discussion is necessary because
the interpretation of the cross correlation function only makes useful sense if both series involved are
stationary.

Recall, a series is stationary if it has a constant mean and variance. Common departures from
stationarity (i.e. non-stationary series) are shown below:

When a series is nonstationary because of a changing variance, one can treat this problem by taking
logs of the data [logs in this course will be natural logs (Ln), not common logs (base 10)]. When a
series is nonstationary due to a changing mean then one can take differences to treat that problem. If
seasonality exists then one may in addition to taking differences of consecutive time periods, take
seasonal differences.
63
If a nonstationary series has a nonconstant mean and a nonconstant variance then differences and
logs may both be required to achieve a transformation to a stationary series. When taking both logs
and differences one must take the logs first (i.e. treat the nonconstant variance and the attack the
nonconstant mean). Why?

Cross Correlation

With the knowledge discussed in the autocorrelation section and the stationarity section, we are now
prepared to discuss the cross correlation function, which as we said before is designed to measure the
linear relationship between two series when they are displaced by k time periods. The cross
correlation function is shown below. (The formula is shown on extra large type to highlight the
components of the formula.)

rxy  k  
  Y  Y  X t t k X 
SYt S X t k
To interpret what is being measured in the cross correlation function one needs to combine what we
discussed about the correlation function and the autocorrelation function. Again note, like in the
autocorrelation function, that k can take on integer values, only now k can take on positive and
negative values.

For instance, let Y represent SALES and X represent ADVERTISING for a firm. If k = 1, then we
are measuring the correlation between SALES in time period t and ADVERTISING in time period t-
1. i.e. we are looking at the correlation between SALES in a time period and ADVERTISING in the
previous time period. If k = 2, we would be measuring the correlation in SALES in time period t and
ADVERTISING two time periods prior. What if k = 3, k = 4, ....? Note that when k is zero we are
considering the relationship of ADVERTISING in the same time periods.

When k takes on negative values then our interpretations are the same as above, except that now we
are looking at cases were Y (SALES) are leading indicators for X (ADVERTISING). This is the
“opposite” of what we were doing with the positive values for k. Note the cross correlation function
is not symmetric about 0. i.e.

rxy (k)  rxy (-k) for all x,y, k  0

An Example

To illustrate the cross correlation function, we consider the data TSDATA.units and
TSDATA.leadind. This data is sample data from Statgraphics and resides on the
network.

The joint plot of units and leadind, is shown in Figure 4 on the following page. Note
how leadind “leads” units. And how both series are nonstationary. Given at least
one of the series is nonstationary, the cross correlation function will be meaningless if
it is applied to the original data. Since both series can be transformed to stationary
series by simple differences (verify this), we will apply the cross correlation function
to the differenced series for both series.

64
Time Sequence Plot units
leadind

270 14.7

13.7
250

12.7
230
11.7

210
10.7

190 9.7

0 30 60 90 120 150

Time

Figure 4. Time Sequence Plot of Lead and Lag Indicators

Looking at the CCF (cross correlation plot) plot displayed in Figure 6 on the next page, we can see
significant cross correlation values at lags 2 and 3. Given leadind was the input (X t-k) value and
units is the output (Yt) value, we can conclude that leadind is a leading indicator of units by 2 and 3
time periods. So a change in leadind will result in a change in units two and three time periods later.
Note it takes two time periods for a change in leadind to show up in units.

(Note: for a situation where it is of interest to determine whether advertising leads sales, then
advertising would be the input and sales would be the output.)

Estimated Cross-Correlations

0.5

coefficient
0

-0.5

-1

-13 -8 -3 2 7 12 17

lag

Figure 5. Estimated Cross-Correlations

Questions:

 Does units lead leadind?


 What do you think would be the relationship between sales and advertising for a firm?
 In the units/leadind example, what does the CCF value for k = 0 mean?

65
Mini-Case

Herr Andres Lüthi owns a bank in Bern, Switzerland. One of Herr Lüthi’s requirements of his
employees is they must continually solicit unnumbered accounts from foreign investors. Herr Lüthi
prefers to call such accounts “CDs” because they have time limits similar to certificates of deposits
used in the United States.

Being very computer literate, Herr Lüthi created a file, CD, to store his data. In this historical file, he
maintains data of the sales volume of CDs, volume, for his bank. All the data is maintained on a
monthly basis. Included in the data set are call (the number of cold calls Herr Lüthi’s employees
made each month during the period January 1990 through July 1995), rate (the average rate for a
CD), and mail (the number of mailings Herr Lüthi sent out to potential customers). Because of the
excellent services provided by the bank, it is the norm for customers roll their CDs over into new
CDs when their original CDs expire.

It should be noted that several years ago Herr Lüthi took many of his employees on an extended ski
vacation. Records show that the ski vacation was in 1992, February through May. The few non-
skiers, who opted to take their holidays in Spain, continued soliciting CDs. They were, of course,
credited with any walk-in traffic and any roll over accounts.

You were recently offered a position at Herr Lüthi’s bank. As part of your responsibilities, you are to
construct a regression model that can be used to analyze the bank’s performance with regards to
selling CDs. When the Board of Directors met last week, they projected the following for next
month:

Number of Cold Calls 900


Average Rate for a CD 3.50
Number of Mailings 4,500

Prepare your analysis for Herr Lüthi.

66

You might also like