You are on page 1of 13

CHAPTER V

5. ANALYSIS OF VARIANCE
5.1.INTRODUCTION
Analysis of variance (ANOVA) is a procedure to test the hypothesis that several
populations have the same mean; i.e., it is used to test the equality of several means. This
technique is one of the most powerful techniques in statistical analysis and was
developed by R.A.Fisher. It is called the F-test. There are two types of classification
involved in analysis of variance.
The one-way analysis of variance refers to the situation when only 1 factor variable is
considered. For example: in testing for difference in sales for 3 salesmen, we are
considering only one factor, which is the salesman selling ability.
In the second type of classification, the response variable may be affected by more than
one factor. For example; the sales may be affected not only by the salesman’s selling
ability, but also by the price changed or the extent of advertising in a given area. For the
sake of simplicity our discussion will be limited to one-way analysis of variance.
In order to use ANOVA, we assume the following:
1. All the samples were randomly selected and are independent of one another.
2. The populations from which the samples were drawn are normally distributed. If
however, the sample sizes are large enough, we do not need the assumption of
normality.
3. All the population variances are equal.
5.1.1. COMPARISON OF THE MEAN OF MORE THAN TWO POPULATIONS
ANOVA is based on a comparison of two different estimates of the variances, σ 2, of
overall population.
1. The variance obtained by calculating the variation within the samples themselves
– Mean Square within (MSW).
2. The variance obtained by calculating the variation among sample means – Mean
Square between (MSB).
Since both are estimates of σ2, they should be approximately equal in value when the null
hypothesis is true. If the null hypothesis is not true, these two estimates will differ
considerably.
The three steps in ANOVA, then, are:
1. Determine one estimate of the population variance from the variation among
sample means
2. Determine a second estimate of the population variance from the variation within
the samples
3. Compare these two estimates. If they are approximately equal in value, accept the
null hypothesis.
Calculating the Variance among the Sample Means – MSB
The variance among the sample means is called Between Column Variance or Mean
Square between (MSB).

 X  X  .
2

Sample variance = S 2

n 1
Now, because we are working with sample means and the grand mean, let’s substitute X

for X, X for X , and K (number of samples) for n to get the formula for the variance
among the sample means:

 X  X 
2

Variance among sample means  S 2


 .
X
K 1
In sampling distribution of the mean we have calculated the standard error of the mean as

X  . Cross multiplying the terms    X n . Squaring both sides  2   X2 n .
n
In ANOVA, we do not have all the information needed to use the above equation to find
σ2. Specifically, we do not know  X2 . We could, however, calculate the variance among
2
 X  X 

the sample means, S X2 , using S X2    . So, why not substitute S 2 for  2 and
K 1 X X

calculate an estimate of the population variance? This will give us:


2 2

2  n X  X  n  X  X 
      If n , n ,...... n are equal.
 x  S X2 * n  
K 1 K 1
1 2 k

Which sample size to use?


There is a slight difficulty in using this equation as it stands. n represents the sample size,
but which sample size should we use when different samples have different sizes? We
2
solve this problem by multiplying  X j  X  by its was appropriate nj, and hence  X
2

 
becomes:
2

2  n  X j

j  X 
 .
MSB =  
K 1
Where:
2
 = First estimate of the population variance based on the variation among sample
means (the Between Column Variance – MSB)
nj = the size of the jth sample
X j = the sample mean of the jth sample

X = the grand mean


K = the number of samples
K-1 = the degrees of freedom associated with SSB.
Calculating the Variance With In the Samples (MSW)
It is based on the variation of the sample observations within each sample. It is called the
within column variance or Mean Square within (MSW). We calculate the sample

 X  X 
2

variance for each sample as S 2


 .
n 1
Since we have assumed that the variances of the populations from which samples have
been drawn are equal, we could use any one of the sample variances as the second
estimate of the population variance. Statistically, we can get a better estimate of the
population variance by using a weighted average of all sample variances. The general
formula for this second estimate of  2 is:

 n  1S 2j
k

j
2
MSW =  
i 1

nT  k
k
n  1 S 2j
If n1, n2, -----, nkare equal MSW = 
2
 i 1 .
k n  1

Where:
2
 = Second estimate of the population variance based on the variation within the
samples (the Within Column Variance – MSB)
nj = the size of the jth sample
nj-1 = degree of freedom in each sample
nT – k = degrees of freedom associated with SSB
S 2j  The sample variance of jth sample

K = the number of samples


nT = Σnj = the total sample size = n1 + n2 + …….. + nk.
The estimate of population variance based on variation that exists between sample means
(MSB) is somewhat suspect because it is based on the notion that all the populations have
the same mean. That is, the estimate MSB is a good estimate of the σ2 only if Ho is true
and all the populations’ means are equal: μ1 = μ2 =μ3 = ------ =μk.
If the unknown population means are not equal, and probably are radically different from
one another, then the sample means ( X j ) will most likely be radically different from

each other too. This difference will have a marked effect on MSB. That is to say, the
2
X j values will vary a great deal and the  X j  X  terms will be large. Thus, if the
 
population means are not all equal, then the MSB estimate will be large relative to the
MSW estimate. That is, is the MSB is large relative to the MSW, and then the hypothesis
that all the population means are equal is not likely to be true.The important question is,
of course, How large is “large?” also, how do we measure the relative sizes of the two
variance estimates? The answer to these questions is given by the F-distribution.
If k samples of nj (j = 1, 2… k) items of each are taken from k normal populations that
have equal variances and for which the hypothesis Ho: μ1 = μ2 = …= μk is true, then the
ratio of the MSB to the MSW is an F-value that follows an F-probability distribution.
MSB
F
MSW
5.2.THE F-DISTRIBUTION
Characteristics of F-distribution
1. It is a continuous probability distribution
2. It is uni-modal
3. It has two parameters; pair of degrees of freedom, ν1 and ν2
ν1 = the number of degrees of freedom in the numerator of F-ratio; ν1 = k – 1
ν2 = the number of degrees of freedom in the denominator of F-ratio; ν2 = nT- k
4. It is a positively skewed distribution, and tends to get more symmetrical as the
degrees of freedom in the numerator and denominator increase.
Example
1. The training director of a company is trying to evaluate three different methods of
training new employees. The first method assigns each to an experienced employee for
individual help in the factory. The second method puts all new employees in a training
room separate from the factory, and the third method uses training films and programmed
learning materials. The training director chooses 18 new employees assigned at random
to the three training methods and records their daily production after they complete the
programs. Below are productivity measures for individuals trained by each method.
Method 1 Method 2 Method 3
45 59 41
40 43 37
50 47 43
39 51 40
53 39 52
44 49 37
271 288 250
X 1 = 45.17 X 2 = 48.00 X 3 = 41.67 X = 44.94
2 2 2
S = 30.17
1 S = 47.60
2 S = 31.07
3

At the 0.05 level of significance, do the three training methods lead to different levels of
productivity?
Solution
1. Ho: μ1 = μ2 = μ3
μ1, μ2, and μ3 are not all equal
2. α = 0.05

ν1 = K – 1 ν2 = nT - k F0.05, 2,15 = 3.68


=3-1=2 = 18 – 3 = 15
Reject Ho if sample F > 3.68
3. Sample F
2

 n  X
j  X 
j 
  6 45.17  44.94  48.00  44.94  41.67  44.94
2 2 2

MSB =
K 1 3 1
120.66
  60.33
2

MSW =
 n  1S j 1
2


530.17  47.60  31.07 108.84
  36.28
nT  K 15 3
MSB 60.33
F   1.663
MSW 36.28
4. Do not reject Ho.
There are no differences in the effects of the three training programs (methods) on
employee productivity.
2. A department store chain is considering building a new store at one of the four different
sites. One of the important factors in the decision is the annual household income of the
residents of the four areas. Suppose that, in a preliminary study, various residents in each
area are asked what their annual household incomes are. The results are shown in the
accompanying table below. Is there sufficient evidence to conclude that differences exist
in the average annual household incomes among the four communities? Use α = 0.01
Area 1 Area 2 Area 3 Area 4
25 32 27 18
27 35 32 23
21 30 48 29
17 46 25 26
29 32 20 42
30 22 12
19 18
51
27
159 294 182 138
X 1 = 24.80 X 2 = 32.67 X 3 = 26.00 X 4 = 27.60 X = 28.63
S12 = 24.96 S 22 = 107.5 S 32 = 136.33 S 42 = 81.30

Solution
1. Ho: μ1 = μ2 = μ3 = μ4
μ1, μ2, μ3 and μ4 are not all equal
2. α = 0.01

ν1 = K - 1 ν2 = nT - k F0.01, 3,23 = 4.76


=4-1=3 = 27 – 4 = 23
Reject Ho if sample F > 4.76
3. Sample F
MSB =
2

 n j  X j  X  626.5  28.632  932.67  28.632  726.00  28.632  527.60  28.632

K 1 4 1
227.84
  75.95
3

MSW =
 n  1S
j
2
1

526.3  8107.5  6136.33  481.3 2134.68
  92.81
nT  K 27  4 23
MSB 75.95
F   0.82
MSW 92.81
4. Do not reject Ho.
No difference exists in the average annual household incomes among the four
communities.
CHAPTER SIX
6.REGRESSION AND CORRELATION
6.1. INTRODUCTION
Regression analysis is almost certainly the most important tool at the econometrician’s
disposal. But what is regression analysis? In very general terms, regression is concerned
with describing and evaluating the relationshipbetween a given variable and one or more
other variables. More specifically, regression is an attempt to explain movements in a
variable by reference to movements in one or more other variable. Regression analysis
can be defined as the process of developing mathematical model that can be used to
predict one variable by using another variable. Before we start developing the regression
model we should first make sure that a relationship exists between two variables. If such
relationship does not exist between the two variables, there is no point in developing a
regression model.
The relationship between two variables can be tested graphically using scatter diagram or
statistically using correlation analysis.
 A Scatter Diagramis a chart that portrays the relationship between the two variables.
 A scatter diagram is a simple two-dimensional graphs of the values of the dependent
variable and independent variable
 The Dependent Variable is the variable being predicted or estimated.
 The Independent Variable provides the basis for estimation. It is the predictor
variable
Example
The sales manager of Copier Sales of America, which has a large sales force throughout
the United States and Canada, wants to determine whether there is a relationship between
the number of sales calls made in a month and the number of copiers sold that month.
The manager selects a random sample of 10 representatives and determines the number
of sales calls each representative made last month and the number of copiers sold.
a. A CORRELATION ANALYSIS:
 The Coefficient of Correlation (r) is a measure of the strength of the relationship
between two variables. It requires interval or ratio-scaled data.
 It can range from -1.00 to 1.00.
 Values of -1.00 or 1.00 indicate perfect and strong correlation.
 Values close to 0.0 indicate weak correlation.
 Negative values indicate an inverse relationship and positive values indicate a
direct relationship
i. Karl Pearson’s Coefficient Of Correlation (Simple Correlation)
 Is the most widely used method of measuring the degree of relationship between
two variables? This coefficient assumes the following:
i. That there is linear relationship between the two variables;
ii. That the two variables are casually related which means that one of the variables
is
Independent and the other one is dependent; and
iii. A large number of independent causes are operating in both variables so as to
produce a normal distribution.

Karl Pearson’s coefficient of correlation (or r)* =

Example: a study was conducted to find whether there is any relationship between
sales and advertising cost. The following data has been collected from 10
companies of their monthly sales and their advertising cost.
Companies Sales(000) advertising cost(000)
A 25 8
B 35 12
C 29 11
D 24 5
E 38 14
F 12 3
G 18 6
H 27 8
I 17 4
J 30 9

SOLUTION
COMPANY Y X Y2 X2 XY
A 25 8 625 64 200
B 35 12 1225 144 420
C 29 11 841 121 319
D 24 5 576 25 120
E 38 14 1444 196 532
F 12 3 144 9 36
G 18 6 324 36 108
H 27 8 729 64 216
I 17 4 289 16 68
J 30 9 900 81 270
∑Y=255 ∑X=80 ∑Y2= 7097 ∑X2=756 ∑XY=2289

R =

r = 0.9482
: This indicates that there is a strong positive correlation between advertising cost
and volumes of sales.
ii. Charles Spearman’s Coefficient Of Correlation (Rank
Correlation)
 Is the technique of determining the degree of correlation between two variables in
case of ordinal data where ranks are given to the different values of the variables?
The main objective of this coefficient is to determine the extent to which the two
sets of ranking are similar or dissimilar.
This coefficient is determined as under:

 Spearman's coefficient of correlation (or rs) =

 Where: ;

Rank correlation is a non-parametric technique for measuring relationship between paired
observations of two variables when data are in the ranked form.
Note: the two variables must be ranked in the same order, giving rank 1 either to the
largest (smallest) value, rank 2 to the second largest (smallest) value and so on.
If there are ties, we assign to each of the tied observation the mean of the ranks which
they jointly occupy, thus,if the 3rd and 4th ordered values are identical we assign each the
rank of

Example: two students are considering applying to the same six universities (A, B, C, D,
E, and F) to study zoology. Their order of preference is as follows
Student 1 B E A F D C

Student 2 F C A B D E

Calculate the rank correlation coefficient and interpret your result


Solution
We have to assign rank
University A B C D E F
student 1 3 1 6 5 2 4
student 2 3 4 2 5 6 1
D 0 -3 4 0 -4 3
d2 0 9 16 0 16 9 ∑ d2 = 50

rs = = = -0.429 this implies student have slight

disagreement over their university preferences.


b. REGRESSION ANALYSIS
In regression analysis we use independent variable (X) to estimate dependent variable (Y).
 The relationship between the variables is linear.
 Both variables must be at least interval scale.
 The least squares criterion is used to determine the equation.
Regression equation is an equation that expresses the linear relationship between two
variables. We can predict the values of y given the values of x by using equation called
regression equation. Y = a+bx

 b= a=

Example: scores made by students in statistics class in the midterm and final examination
are given here. Develop regression equation which may be used to predict final
examination scores from the midterm scores.
Student Midterm Final
1 98 90
2 66 74
3 100 98
4 96 88
5 88 80
6 45 62
7 76 78
8 60 74
9 74 86
10 82 80
Solution
Student X Y X2 XY
1 98 90 9604 8820
2 66 74 4356 4884
3 100 98 10000 9800
4 96 88 9216 8448
5 88 80 7744 7040
6 45 62 2025 2790
7 76 78 5776 5928
8 60 74 3600 4440
9 74 86 5476 6364
10 82 80 6724 6560
∑x=785 ∑y=810 ∑x2= 64521 ∑xy=65071

b= = = = 0.5127

a= = = 40.7531
We can use this to find the projected or estimated final scores of the students. For
example, for the midterm score of 50 the projected final score is
y= 40.7531+0.5127(50)=66.38

You might also like