Chi Squares and Analysis of Variance

TESTS OF HYPOTHESES FOR MORE THAN TWO POPULATIONS
9.0 INTRODOUCTION
In chapter 8, we dealt with tests of hypotheses for parameters of one and two
populations. In this chapter we extend tests of hypotheses for more than two
populations. Specifically, we considered what is referred to as chi-square tests.
This is a situation where we test equality of several proportions. We also look into
cases where equality of several population means are being tested. In this case we
used what is called Analysis of Variance (ANOVA for short). Let us start with
tests for several proportions, and then we round it up with test for several means. A
test for several variances is also given.
9.1 TESTS FOR SEVERAL PROPORTIONS – CHI-SQUARE TESTS.

There are many situations in which it may be necessary or desirable to test
hypotheses about the equality of proportions of elements in general populations
having certain characteristics. For example, in market research we may be
interested in testing the equality of proportion of people who prefer a particular
brand of soap, tooth paste, tea, etc; in quality control, a similar test can be taken for
percentages of defective products produced by machines in a given period, or one
may want to relate the number of families with various income levels.
Other examples may be to test the hypothesis that the proportional breakdown of
people who prefer each of several different candidates is the same in each of
several cities, or, as indicated, that the proportional breakdown of people who
prefer each of several different brands of soaps is the same regardless of which of
several different advertising approaches is used. In this section we present
procedures for testing such hypothesis as these. The test statistic used here is chi-
square.
1
9.1.1. ‘One-Way Classification Case”
In one-way classification, our null hypothesis is formulated that there is no
difference between the observed and expected frequencies. The alternative
hypothesis is that there is a difference between the observed and expected
frequencies. That is;
H o :P 1=P2 =P s=. . .=Pk
H a : At least one of the equality does not hold
Let
θ1 ;θ 2 ,...θ k be the actual or observed frequencies and e 1 ,e 2 ,...e k be the
expected frequencies. The test statistic, chi-square relevant to above hypothesis is
given by;
k
( Oi −e i )2
X =∑
2
i =1 ei
That is, we square the difference between the observed and the expected
frequencies, and then divide the result by the expected frequencies. This test
statistic is closely approximated by chi-square distribution, and it has a degree of
freedom k-1.
2
If X =0 , then there is a perfect agreement between observed and expected
frequencies, the larger the value of X 2. Hence we reject Ho for Ha if the calculated
2
1−α / 2
X is greater than X
2 K (using our table).
Before we take an example to illustrate this concept further, please do note

the condition that must be satisfied in applying the chi-square test.
a) Each observation or frequency must be independent of all other observations
2
b) The sample size must be reasonably large in order that the difference between
the actual and expected observations be normally distributed. A sample size of at
least 50 is recommended.
c) No expected frequencies or observation should be small. It is recommended that

this expectation should not be less than 5; whenever we have a situation where this
number is less than 5, merge it with others.
Example 1: A marketing research department of coloured T.V.Division
manufacturing company has chosen five towns. It is believed that each city has
same sales potential. The actual number of coloured T.V. set sold by each
company in each city, in a six-month period is given in the table below. Test the
hypothesis that the five towns have equal sales potential, using a level of
significance of 0.05.
Town Number of sets sold
A 150
B 180
C 250
D 230
E 190
Total 1000
Table 9.1a. One way chi-square problem
Solution
On the hypothesis that each town has the same sales potential, the expected
number of sets sold in each town is 1000/5 = 200. Hence the table below:
Observed Expected
Town frequency frequency O-e (O-e)2/e
O e
A 150 200 -50 12.50
3
B 180 200 -20 2.00
C 250 200 50 12.50
D 230 200 30 4.50
E 190 200 -10 0.50
2
Total 1000 1000 O X =32.00
Table 9.1b. One way chi-square test.
2
0. 05
The degree of freedom is k-1 = 5.1 = 4. Now, X = 9 . 49 . Our hypothesis is
formulated as follows:
Ho: P 1=P2 =P3 =P4 =P 5
Ha: At least one of the equalities does not hold. That is, at least one is
different.
2 2
Since X > X 1 −α , k−1 , We reject Ho. That is, the sales potential in the five
towns are not the same.
Example 2: Daily demand for loaves of bread at a bakery is given as the following
table. Test the hypothesis that the number of loaves of bread sold does not depend
on the day of the week. use α =0.01.
Day of the week Number of loaves sold
Monday 3100
Tuesday 3500
Wednesday 3300
Thursday 4800
Friday 4300
Saturday 5000
Total 24000
Table 9.2a. Table for example 2.
Solution: There are six observations, and hence our hypothesis is.
Ho: P 1=P2 =P3 =P4 =P 5=P 6
Ha: At least one of the equalities does not hold.
The expected demand, e, is 24000  6 = 4000. Hence the table below
4
Observed Expected
Day frequency frequency O-e (O-e)2
O e
Mon 3100 4000 -900 810000
Tue 3500 4000 -500 250000
Wed 3300 4000 -700 490000
Thur 4800 4000 800 640000
Fri. 4300 4000 300 90000
Sat. 5000 4000 1000 1000000
Total 24000 24000 0 3280000
Table 9.2b: Table for solution of example 2.
3280000
X 2= =820
The calculated X2 is given by 4000
2 2
1−α / 2 0. 005
X k−1 is X =15 .08 Since the calculated value X2 is greater than
the table value, Ho is rejected.
9.1.2 Two-Way Classification Cases: Test of Independence

In the last section, we classified an item or perhaps individual into one several
classes according to one attribute. Such classification we had referred to as one-
way classification case, as the heading in that subsection stated. In this section, we
are classifying items or individuals according to two attributes, which is referred to
as a two-way classification case. For example, an individual may be classified
according to his income and kind of house he builds or buys, we may wish to test a
hypothesis concerning a relationship between an individual’s income and the kind
of the house he builds or buys. We may wish to test also the relationship between
level of education of a man and number of children he has.
To study problems of this type, we take a random sample of size n and classify the
items according to two criteria. The observed frequencies can then be presented in
a form of a table referred to as a contingency table. r by c contingency table is an
arrangement in which the data is classified into r classes x 1, x2, …, xr of attributed
5
x, and c classes y1, y2, yc of attribute y; one attribute is entered in rows and the
other in columns. A general form of such table is presented as below:
Col
Row y1 y2 y3 … yc Row totals
x1 O11 O12 O13… O1c n1
x2 O21 O22 O23… O2c n2
x3 O31 O32 O33… O3c n3
. . .
. . .
xr Or1 Or2 Or3… Orc nr
Column totals n.1 n.2 n.3 n.c n
Table 9.3 A general r by c contingency table
Note that since the expected frequencies in all cells expect in the last row and the
last column can be determined given row total, column total, and the other cell
frequencies, the number of degrees of freedom for a contingency table having r
rows and c columns is rc – r – c + 1 = (r-1)(c-1)
Here Chi-square test statistic is calculated as
r c
( Oij−e ij ) 2
X =∑
2
∑
i =1 i =1 e ij
Where Oij = observed frequency in the ith row and jth column.,
eij = expected frequency in the ijth cell
eij is calculated by the formula
ni ( n j )
e ij=
n
Where ni = total observed frequency in the ith row
n.j = total observed frequency in the jth column
6
n = sample size
Note that number of rows and columns is not necessarily the same.
Example 3: A random sample of 200 married men, all self-employed were

classified according to education and number of children.
Number of children
Education 0–3 4–7 Over 7 Total
Elementary 14 37 32 83
Secondary 19 42 17 78
University 12 17 10 39
Total 45 96 59 200
Table 9.40; Two-way chi-square problem
Using a 0.01 level of significance, test whether the size of a family is independent
of the level of education attained by the father.
Solution: On the hypothesis that father’s education and size of family are
statistically independent, the expected frequencies may be calculated as;
83 x 45 83 x 96 83 x 59
θ11= =18. 675 ;θ12= =39 .84 ;θ 13= =24 . 485
200 200 200
78 x 45 78 x 96 78 x 59
θ21= =17 .55 ;θ 22= =37 . 44 ;θ 23= =23. 01
200 200 200
39 x 45 39 x 96 39 x 59
θ31= =8 .775 ;θ 32= =18 .72 ;θ33= =11. 505
200 200 200
The X2 statistics is calculated in table below:
Observed Expected ( O−e )2

Classification frequency frequency O-e e
Elementary and 0 – 3 14 18.675 -4.675 1.70
7
-2.840
Elementary and 4-7 37 39.840 7.513 0.202
Elementary and over 7 32 24.485 1.450 2.307
Secondary and 0-3 19 17.550 4.560 0.120
Secondary and 0-7 42 37.440 3.225 0.555
Secondary and over 7 12 8.775 3.222 1.183
University and 0-3 17 8.775 -1.72 1.570
University and 4-7 17 18.720 -1.505 0.158
University and over 7 10 11.505 0.197
Total 200 200 O - x2 7.462
Table 9.4b; Two-way chi-square test
2
0. 05
Df is (3-1)(3-1) = 4; X =18 . 467 . Do not reject Ho; Education level and number
of children are related.
Example 4: A product is manufactured independently on three centres C 1, C2 and

C3. A quality control engineer wants to test whether the number of defective items
produced in the three centres are the same. The number of defective and non-
defective items in random samples of 200 from C 1, 150 from C2 and 250 from C3
are given below.
Centre
Type C1 C2 C3 Total
Defective (D) 6 8 20 34
Non-Defective (N) 194 142 230 566
Total 200 150 250 600
Test the hypothesis that the number of defective and non-defective items
produced in the three centres are the same. Use α = 0.01
8
Solution: On the hypothesis that the three centres produce the same number of
defective and non-defective items, the expected frequencies may be calculated as:
34 x 200 34 x 150 34 x 250
e 11= =11. 33; e12= =8 .50 ;e 13= =14 .17
600 600 600
566 x 150 566 x 250
e 21=188 .67 ; e22= =141 . 50; e23= =235. 83
600 600
The X2 statistics is calculated in table 9.5
Classification Observed Expected O-e ( O−e )2
Frequency (O) Frequency (e) e
D and C1 6 11.33 -5.33 2.4793
D and C2 8 8.50 -0.50 0.0294
D and C3 20 14.17 5.83 2.3987
N and C1 194 188.67 5.33 0.1506
N and C2 142 141.50 0.50 0.0077
N and C3 230 235.83 -5.83 0.1441
Table 9.5: Table for solution of example 4.
The number of degrees of freedom is (2-1)(3-1)=2
2
Now , X 2 ; 99=5 . 99 Since the calculated X2 is less than the one read from table.
The hypothesis that three centres produce the same number of defective and non-
defective items cannot be rejected.
9.1.3 Yates correction for continuity in chi-square

When the number of degrees of freedom is 1, we can improve the approximation to
the continuous X2 distribution by using a continuity factor. This necessitates
subtracting 0.5 from each (O-e) before squaring.
9.1.4 Goodness –of-fit Test
9
When we talk of goodness of fit, we are referring to a situation where we
compare observed sample distribution with expected probability distributions such
as binomial, Poisson, normal, and so on. The chi-square statistic is used then to
judge how the sample observed fit the actual distribution.
Example 5: The following is a distribution of daily demand for a product in a store.
X 0 1 2 3 4 5 6 or more
F 35 65 62 43 21 10 8
With α = 0.05, test the hypothesis that the distribution of daily demand is Poisson.
Solution: The estimate of the mean () of the number of customers arriving at the
∑ xf =500 =2 .05
shop is: ∑ f 244
Hence the Poisson distribution fitted to the data is
( 2 .05 )x e−2 .05
F ( x )=
x!
Note that e is given as e = nf(x); this e is the expected frequency.
The table of probabilities and expected and observed frequencies are given below:
Daily demand Probability Expected Observed
x Pi frequencies frequency
e O
0 0.1287 31.4028 35
1 0.2638 64.3757 65
2 0.2704 -65.9851 62
3 0.1848 45.0898 42
4 0.0947 23.1085 21
5 0.0388 9.4745 10
6 0.0132 3.2371 8
We can now compute the chi-square statistic from the table below
10
x O e O-e (O-e)2/e
O 35 31.4028 3.5972 0.4121
1 65 64.3757 0.6243 0.0061
2 62 65.9851 -3.9851 0.2408
3 43 45,0898 -2.0898 0.0969
4 21 23.1085 -2.1085 0.1924
5 or more 18 13.7116 4.2884 2.0397
Total 244 2.7956 = X2
Since we estimate the mean of Poisson distribution, the degrees of freedom
will be 6-1-1=4 If we use α = 0.05.
2
o . o5
X =9 . 48773 , since calculated X2 is less than the one read in the table
the data therefore fits in Poison distribution.
9.2 A TEST FOR SEVERAL MEANS; ANALYSIS OF VARIANCE

When we wish to compare means of more than two populations, the t-test is neither
efficient nor proper. For an experiment of seven sets of data or populations, there
are 21 such tests and if each is tested at 5% level of significance then for each test
P(type 1 error) = 0.05.
Hence P(type 1 error overall) = 1-P(No. type 1 error in 21 tests) = (1-0.95) 21
= 0.66.
For such tests, we use the F-test of ratio of variances and the technique is known as
the analysis of variance (ANOVA).
Basically, analysis of variance compares the variation between population means
with variation within populations. The underlying assumption here, is that even if
there were no difference between the population means, not all of the observations
would be the same because of the random nature of the variation within a
11
population which has one mean, as the inherent variation in the experiment and try
to estimate its variance. A similar estimate is made for the variable, between the
sample means. If the variability within population means and between means are
same order or magnitude, then there is no difference between population means. If
variation between means is significantly larger than within population variation,
then we conclude that there is a difference between population means.
In terms of test of hypothesis, we have the following:
Ho: μ 1=μ 2=. .. .=μ k
Ha: At least one of the equality is violated.
Where K is the number of populations considered
μi (i = 1, 2,….k) is the mean of population i.
9.2.1. One-way (one-factor) Analysis of Variance

When populations are distinguished on the basis of single criterion, you speak of
one-way or one-factor of analysis of variance. We present below pictorially, a
simple form of this type of problem.
Sample from population
1 2 ….. k
X11 X12 …. X1K
X21 X22 …. X2K
-
-
Xini X2n2 ….. Xknk
Total T1 T2 Tk
x1 x2 xk x
Mean
Where K = number of population

12
ni = Number of observations in sample i
n = Total number of observations
X i = the ith observation in the ith sample
Ti = The sum of observations in sample i.
x 1 = the mean of observations in sample i
T = The total of all observations
x = The mean of all observations
The total variation, is split between two other factors, the variation between
means and variation within sample. They are measured by three quantities called
sum of squares. They are calculated as shown below:
1. Total sum of squares (SST). This is total variability observations.
k ni
∑ ∑ ∑ ( X ij −X )2
SST = i=1 i =1
2. Between sum of squares (SST) This is the variability
( )
T 21 T 2
k ni k
∑ ∑ ( X 1− X ) =∑ 2
ni
−
n
SSB = i=1 j=1 i =1
3. Within sum of squares (SSW) This is the variability within samples
( )
T 21
k ni k ni k
∑ ∑ ( X iji−X i ) =∑ ∑
2 2
X ij− ∑ ni
SSW = i=1 i =1 i=1 j =1 i=1
and SST = SSB + SSW.

When SSW is divided by n-k (the degrees of freedom) you have within
mean square (MSW). Similarly, the between (MSB) are the respective variances of
within population and between population.
13
Now, if our null hypothesis is true, then all the K’s Sample mean would be close to
each other and very close to the total mean X . This would mean that MSB would
be small compared to MSW. But if Ho is not true, MSB would be large compared
to MSW. The ratio MSB/MSW which is a ratio of variances gives the test statistic
(which is F statistic) needed to carry out our test.
The results of the test is summarized in ANOVA (Analysis of variance) table
below.
ANOVAL TABLE
Source of Sum of Degrees of Mean squares F-ratio
Variance squares freedom
Between means SSB K-1 MSB MSB
=F
MSW
Within means SSW n-k MSW
Total SST n-1
Table 4 = ANOVA Table – one factor

You reject Ho if F > Fα , k-1, n-k
Example 6: The following table gives the number of units of production per day
turned out by four different employees, using four different types of machines.
Table 5 a
Type of Machine
Employee M1 M2 M3 M4 Total Mean
14
E1 40 36 45 30 151 37.75
E2 38 42 50 41 171 42.75
E3 36 30 48 35 149 37.25
E4 46 47 52 44 189 47.25
Total 160 155 195 150 160 41.25
Mean 40 38,75 48.75 37.5
Perform one-way ANOVA on this problem, that is, test the hypothesis that the
mean population is the same for the four machines.
Solution: Ho:
μ1 =μ2 =μ3 =μ4
ha: At least one equality is not true.
Here n1 = n2 = n3 = n4 = 4. Also k = 4 and n = 16
T 21 2
=27537 .50 . x =27900
Hence ni Also, ij
SSW = 27537.50 – 27225 = 312.50
SSB = 27900 – 27537.50 = 362.50
SST = SSB + SSW = 675.
MSB = 312.5/3 = 104.167
MSW = 362.5/12 = 30208
Hence F = MSB/MSW = 3.448.
Our result is summarized in ANOVA table 5b.

Source of sum of Degrees of Mean F-ratio
variation squares freedom squares
15
Between Machines 312.5 3 104.167
3.448
Within machines 362.5 12 30.208
Total 675 15
Table 5b: ANOVA Table for the problem
Now F.95.3.12 = 3.49 and F.99, 3 12 = 5.95
Since F<Fα ., df we cannot reject the null hypothesis
9.2.2 Two way (two-factor) Analysis of Variance

When populations are distinguished on basis of two criteria, you speak of two-
factor of analysis of variance. And two way ANOVA can be used in business. For
instance, you may be interested in learning about the effect of different advertising
techniques, at different levels of expenditure, on sales of a product. For
convenience purpose, we shall call the two factors as row and column factors. We
shall assume that the rows and columns do not interact; that is, the effect of rows is
same for each column and effect of columns is same for each row. Two-way
analysis with interaction is outlined in section 9.5.
Now let r = Number of rows
c = Number of columns
n = The total number of observations
Xij= The observation in the ith row and jth column
xi = Mean of observations in row i
X.j.= Mean of observations in column j
x = The total mean of all observations
Ti = The total (or sum) of observations in row i
T.j = The total (or sum) of observations in column j.
T = The total (or sum) of all observations
16
With the above notation you now have the picture below:
Rows 1 2….. j….. C Total Mean
1 X11 X12 Xii X1c T1 X1
2 X12 X22 X2i X2x T2 X2
R Xr1 Xr2 Xij xic Tc Xr

Total T.1 T.2 Tj T.c T X
Mean X.1 X.2 X.j X.c
Table 9.5: Two-way ANOVA table
In two-way analysis, the total variability in data is divided into three components.
A component is attributable to a factor each, while the third is due to chance or
experimental error. When each of these components is dived by the appropriate
number of degrees of freedom, you obtain the variance of the component. These
sum of squares and their mean squares are given below.
1. Total sum of squares (SST). This is the variability of the observations. It

is given by
r c r c
T2
SST =∑ ∑ ( X ij−X ) =∑ 2
∑−
i=1 j =1 i=1 j=1 rc
Note that total variance = SST/(c-1)
2. Sum of squares for Rows (SSR) and Row Mean Square (MRS)
This is the variability between row means
r r
T2 T2
C ∑ ( X i −X )2 =∑ −
SSR = i =1 i =1 c rc
MRS = SSR/(r-1).
17
3. Sum of squares for Columns (SSC) and Column Mean Squares (MSC)
This is variability between column means
c c
T 2i T 2
SSC=r ∑ ( Xij− X ) =∑ −
2
j=1 j=1 r rc
MSC = SSC/(c-1)
4. Sum of squares for Error (SSE) and Error Mean Square
This is variability due to chance
SSE = SST-SSR-SSC
MSE = SSE/(r-1)(c-1)
If we wish to test the null hypothesis that row effect are all equal, we
compute the ratio MSR/MSE and compare it with value of F read from the table,
given level of significance and degrees of freedom. Similarly, the F ratio
MSC/MSE give us the statistic test for comparing the column effects. The result
may summarized in the ANOVA table as shown here.
Source of Variation Sum of Degrees of Mean F-ratio

squares freedom Squares
Between-row means SSR r-1 MSR MSR
MSE
Between column SSC c-1 MSC MSC
MSE
means
Error SSE (r-1)(c-1) MSE
Total SST Rc-1
Table 7: ANOVA table: Two factors
18
Example 7: Using the data in previous example (i.e. in the one-way analysis) test
a) The hypothesis that the mean production is the same for the four machines
b) The hypothesis that four employees do not differ with respect to mean
productivity.
Solution: Here r = c4 so that rc = 4 x 4 = 16
T2
=27225
So that rc
Hence SST = 27900 – 27225 = 675
T 2 110150 109514
∑ R = 4 =27537 . 5 T 2c=
4
=27375. 5
Now and
SSC = 27537.5 – 27225 = 312.3
SSR = 27375.5 – 27225 = 153.5
SSE = SST – SSC – SSR = 209
Since c = r = 4, (r-1)(c-1) = 3 x 3 = 9
Hence,
SSC 312 .5
= =104 . 167
MSC = c−1 3
SSR 153 . 5
= =51 .167
MSR = r−1 3
SSE 209
= =23 .222
MSE = ( c−1 ) ( r−1 ) 9
MSC 104 . 167

∴ = =4 . 486
MSE 23 . 222
MSR 51 .167
= =2 . 203
and MSE 23 . 222
The result is now summarized in ANOVA table

Source of variation Sum of Degrees of Mean F-ratio
squares freedom square
Between machines 312.5 3 104.167 4.486
19
Between employees 51.167
Error 153.5 3 23.222 2.203
209 9
Total 675 15
Table 8: ANOVA Table for the problem
You now compare this calculated F-ratio with one’s read for statistical table to
make conclusion.
9.3 TEST FOR THE EQUALITY OF SEVERAL VARIANCES

In the previous sections we test for equality of several means and proportions. In
this section we extend the same treatment to another parameter, namely population
variance. Specifically, we may wish to test the following:-
Ho : δ 21 =δ 22 =.. .=δ 2k
Against the alternative
Ha: The variances are not all equal.
The test most often used is called Bartletl’s test. It is based on a statistic
whose sampling distribution is approximated very closely by a chi-square
distribution when the k samples are drawn from independent normal populations.
The procedure adopted here is as follows:
2 2 2
First compuite the k sample variances, S1 , S2 ;. .. . . S k from samples of size ni,
k
n2….,nk with ni = n After that combine the sample variances to give
i=1
k
( ni−1 ) S2
S2x= ∑ n−k
the pooled estimate i=1
q
Then compute b = 2.3026 n
20
k
( n−k ) log S 2p −∑ ( ni−1 ) log S 2i
where q = i =1
and b =
1+
1
⌊ k 1
−
1
3 ( k−1 ) i=1 n i−1 n−k ⌋
Here b is a value of random variable having approximately the chi-square
distribution with k-1 degrees of freedom. The quantity q is large when the samples
variances differ greatly and is few when all the sample variances are equal.
2
Therefore reject Ho at α level of significance when b > Xα / 2
Sample
A B C
4 5 8
7 1 6
6 3 8
6 5 9
3 5
4
2 2 2
Solution: Ho : δ 1 =δ 2 =δ 3
Ha: The variances are not equal

now X0.95,2 = 5.991; n1 = 4; n2 = 6; n3 = 5 and hence n = 15; k = 3.
2
p
Hence S =2. 254 with q = 0.1034; also h = 1.1167,
2. 3026 ( 0 .1034 )
=0 . 213
Therefore b = 1. 1167
We accept Ho and conclude that the variances of the three populations are equal.
9.4 Multiple – Range Test.

In the test for equality of means, in which we use a powerful tool, ANOVA, when
we reject Ho, we cannot know which of the population means that are equal.
21
Several tests are available to determine which of the pairs of population means that
are not equal. What is being considered in this section is Ducan’s multiple – range
test.
The procedure for this test is as follows:
First, let us assume that the ANOVA procedure has led to rejection of Ho, which
means not all population means are equal. Then we also assume that the k random
samples are all equal size n. The range of any subset of p sample means must
exceed a certain value before we consider any of the population means to be
different. The value is called the least significant range for the p means and is
denoted by Rp where Rp =
rp
√ S2
n
Here rp is the least significant standard range obtained from table and it depends on
the desired level of significance, S2 is the sample variance which is the estimate of
2
the common variance δ and is obtained from the error mean square in the analysis
of variance.
Example 9: Let us illustrate the above test with the following data
Sample
A B C D E
5 9 3 2 7
4 7 5 3 6
8 8 2 4 9
6 6 3 1 4
3 9 7 4 7
5.2 7.8 4.0 2.8 6.6
22
If we arrange the sample means in increasing order of magnitude we have the
following:
XD XC XA XE XB
2.8 4.0 5.2 6.6 7.8
Also, if you have performed one-way ANOVA on the data you will find that error
of mean square is 2.880. Let us put α = 0.05. According to the values of r p from
the table with degree of freedom = 20 for p = 2, 3, 4 and 5 are given below
Having obtained rp and with knowledge of S2, we can compute Rp, hence the
following result as is shown in the table below:
P 2 3 4 5
rp 2.950 3.097 3.190 3.255
Rp 2.24 2.35 2.42 2.47
Comparing these least significant ranges with differences in order means, we arrive
at the following conclusions:
a)
X B−X E=1 . 2<R2 =2. 24 ; X B and X E are not significantly different.
b)
X B−X A=2 .67 ; R 3=2. 35 X B is significantly larger than X A .
Hence
μ B > μ C and μ B > μ D
c)
X E− X A=1 . 4<; R2 =2. 24 X E and X A are not different.
d)
X E− X C =2 .6 >R3 =2. 35 ; X E is larger than X C and hence μ E
e)
X A−X C =1. 2<; R2 ; X A and X C are not significantly different.
f)
X A−X D =2. 4 >; R3 =2 .35 ; X A is significantly larger than X D
and hence
μA > μD .
23
g)
X C −X D=1 .2<;R 2;=2 .24 We conclude that
XC and
X D are not
significantly different.
9.5 TWO-WAY ANOVA WITH INTERACTIONS
In the way ANOVA treated earlier, we assume that the row and the column effects
were additive. In many cases, this assumption does not hold, and we have several
observations per cell, therefore there is a presence of interaction. What is presented
below is the general table of the ANOVA that will take care of such a situation.
Source of Sum of Degrees of Mean F. Ratio
Variation Squares Freedom Square
Row means SSR r-1 2
S1 =SSR/( r −1) S
1
2 √ S 24
Column means SSC c-1 S22 =SSC( c−1) 2
2
S / S2
4
SS ( Rc ) 2
S23 = S 3
/ S24
Interaction SS(Rc) (r-1)(c-1) ( r−1 ) ( c−1 )
Error SSE Rc(n-1) SSE

S24 =
rc ( n−1 )
Table 9..1 ANOVA Table for two-way with interactions
The sums of squares are usually obtained by means of the following computational
formulae.
r c n 2 T2
SST =∑ ∑∑X ijk
−
i=1 j =1 k=1 rcn
r
∑
i −1 T2 2
i
SSR= T −
cn rcn
c
∑
T2
i −1 2
i
SSC = T −
cn rcn
24
r c 2 r 2 c 2
∑∑T ij
∑T i. . .
∑T j
i=1 j=1 i=1 i=1 T2

SS ( RC )= − − +
n rcn
SSE = SST – SSR – SSC – SS(RS)
Here r is number of rows
c is the number of columns
and we assume each cell contains n observations
QUESTIONS
1. Use the following data to test whether the number of defective items
produced by two machines is independent of the machine on which they were
made.
________________________________________________________
Machine output
Defective Non Defective Total
________________________________________________________
Machine A 25 375 400
Machine B 42 558 600
________________________________________________________
Total 67 933 1000
________________________________________________________
2. Given the following data, use the X 2 test to determine whether the number of
accidents in a group of factories is independent of the age of the worker
_______________________________________________________
Age
Number of under
accidents 21 21 – 44 45 - 65
________________________________________________________
0 120 360 220
1 40 28 2
2 13 5 2
25
3 or more 7 2 1
________________________________________________________
3. A box contains red, green and white halls, which are identical in every respect
but colour. 60 balls are drawn from the box, each ball being replaced immediately
it is drawn and the colour noted. The numbers of different colours obtained are
Red, 15, Green 26, white 19. Test the hypothesis that the number of balls of each
colour in the box is the same.
4. Test the hypothesis that the observed distribution is drawn from a population
with a Poisson distribution
_________________________________________________________
Number of defects per metre of cable 0 1 2 3 4 5 6
_________________________________________________________
Number of metre lengths 14 25 30 17 5 5 3
_________________________________________________________
5. Several hundred women are employed in a certain assembly operations. The

results of timing a sample of 100 women were as follows: Show that the time to do
the job is the same for all groups. Use α =0 . 05 .

Time required to do job (in minutes) Number of women
20 and under 22 8
22 and under 24 24
24 and under 26 34
26 and under 28 28
28 and under 30 6
100
6) Suppose data on the daily productivity of three machine operators have been
obtained. Operator A has produced in five days 100, 110, 92, 95 and 108 parts.
Operator B has produced in five days 94, 97, 90, 101 and 98 parts. Operator C has
26
produced 98, 104, 113, 97 and 103 parts in five days. Assuming normal
populations with equal variances, determine at the 0.05 level of significance, if the
three operators are not producing at the same average daily rate.
7. A company wishes to study the differences in the selling abilities of its four
salesmen. A, B, C and D as well as the difference in its three sales districts, 51, 52
and 53, all of the same size. The weekly sales in Naira for the four salesmen are
given below:
______________________________________________
Salesmen
______________________________________________
District A B C D
______________________________________________
S1 550 450 700 500
S2 300 350 550 400
S3 350 550 400 600
______________________________________________
Test the null hypothesis that there are no differences between sales districts and
that there are no differences between the selling abilities of the salesmen.
8. Test for equality of variances in problem (7).
CHAPTER TEN
REGRESSION AND CORRELATION
10.1 INTRODUCTION
27
Businessmen, Economists, Scientists and Sociologists have always been concerned
with problems of prediction. Marketing executives, for example are constantly
analyzing sales data in hoes of predicting or forecasting future sales with a high
degree of accuracy. Measurements from sales data are used in turn in production
division of the company which enable the firm concerned to plan its output. One
may use JAMB scores of students entering University to predict their successes
later in the University. These types of prediction problems are referred to by
statisticians as Regression problems.
Regression problems are of many dimensions. The one we are now touching deals
with the prediction of the dependent variable (say, Y) on the basis of the related
independent variable (say, X). This is a case of Simple Regression Analysis. In the
cases where more than two independent variables are involved you speak of
Multiple Regression.
If two varieties X and Y are supposed to be related (linearly), you may test the
relationship as follows: Let X1, X2,….., Xn be the observed values of X and Y1, Y2,
….,Yn be corresponding values of Y. For instance, let X be price of a products,
and Y its corresponding sales for the past eight years as shown in the table below:
Year Price (N): X Sales (in thousands): Y
1970 79 50
1971 74 60
1972 70 65
1973 80 54
1974 83 50
1975 86 48
1976 88 47
1977 92 45
Table 10:1 Regression Problem
28
The plot of the set of these pairs (X1, Y1), (X2, Y2),…..(Xn, Yn) of values of
X and Y is called a scatter diagram. In our example, n is 8. The scatter diagram is
shown below:
Figure 1: Scatter Diagram

After the scatter diagram has been drawn, one may like to fit in a straight line that
may be of the best representation of the points. This is done be constructing an
equation of straight line of the form: Y = a + bx. This line is the average predicted
value of Y on X .
The major problem of this approach is the computation of the parameters ‘a’ and
‘b’. This is because, once those two unknown are specified, the equation is known.
10.1. Prediction Equation of Regression

Let estimated or predicted value of Y obtained from the line of the ‘best fit’ be Y,
then the equation Y = a + bx is called the prediction equation or the regression
equation of Y on X, or the line of average relationship, and is an estimate of the
line that should pass through the scattered points.
29
The difference between the actual Y – values and the corresponding computed
values of Y predicted from the line is called an Error or a Residual or a Deviation.
10.2 Method of Least Squares
Now that we have decided to use a linear prediction equation, we must consider the
problem of deriving computational formulas for determining the point estimates a
and b from the available sample points. The procedure that is used here is called
Method of Least Squares. This is the method of estimating the ‘best line’ to a given
set n pairs of points, in such a way that the sum of squares of the errors between
the actual y value and the predicted y values is minimized. That is, if we let e 1
represent the distance obtained by subtracting the predicted y value from the
observed y value for the first point, e 2, a similar distance corresponding to the
second point, and so forth, the method of least squares yield formulas for
calculating a and b so that:
2 2 2
∑ e2 =e 1 + e 2 +. ..+ e n is minimum.
By means of differential calculus, the values of a and b are obtained from these
two lines simultaneous equations:
an+b ∑ x=∑ y
a ∑ x +b ∑ x 2 =∑ xy
Called the Normal Equations.
Estimation of Parameters
The Least-squares estimates of a and b for the prediction equation
y = a + bx are obtained from the formulas.
a= y−bx
30
n ∑ xy−( ∑ x )( ∑ y )
b=
∑ xy−n x y
b=
and n ∑ x2 −( ∑ x )
2
or ∑ x 2−n x 2
Note that you have to compute b before a. To obtain the estimates or the
parameters a and b for your example you build the following table:
x y xy x2 y2
79 50 3950 6241 2500
74 60 4440 5476 3600
70 65 4550 4900 4225
80 54 4320 6400 2916
83 50 4150 6889 2500
86 48 4128 7396 2304
88 47 4136 7744 2209
92 45 4140 8464 2025
Total = 652 419 33814 53510 22279
Table 10.2. Estimation of parameters
652 419
x= =81 .50 y= =52. 375
Hence 8 and 8
b=
∑ xy−n x y
Therefore ∑ x 2−n x 2 becomes
33814−8 ( 81 .50 )( 52 .375 ) 33814−34148 .5
b= =
53510−8 ( 81. 50 )2 52 .375−53138
−334 .5
b= =−0 .899 ;
or 372
and a= y−bx becomes 52.375-(.899)(81.5)
31
ie.a=52. 375+73 . 2685=125 . 644
The linear prediction equation or the regression equation of y on x is y = 125.644 –
0.899x
By substituting any value of x into this equation, you obtain predicted value
y. For instance, when x = 90, y = 125.644 – 0.899(90) = 125.644 – 80.91 = 44.734
– 80.91 = 44.734.
10.1.3 ESTIMATION OF THE VARIANCES OF THE REGRESSION

EQUATION AND PARAMETERS a AND b.
We have pointed out that if there is a linear relationship between two variables x
and y, the relationship can be put in the form:
y = a + bx
The mere form of computing parameters a and b is not enough but there has to be a
test whether these parameters do exist. You will recall that in finding values of the
constants a and b of the prediction equation the basic criterion was to minimize the
sum of squares of deviations or errors between the actual y-values and the
predicted y-values. If this sum is zero, prediction or regression equation fits
perfectly to n pairs of sample observations. But because business or other related
decisions are made in the face of uncertainty, this sum is usually not zero. What is
normally done here is to know the degree of precision attached to this predicted y-
value. This is the variance of the regression equation. Its estimated denoted by S 2,
is calculated from the formula:
1
S2 = ( y e− y a ) 2
n−2
Where ye is the predicted value of y and ya is the actual value of y.
32
This variance is known as Sum of Squares for Errors, SSE, that is, the sum of
squares of the deviations of the actual y-values from their corresponding predicted
values on the regression line.
Note that we have to divide the expression by n-2 because since the
constants a and b of the regression equation have been estimated from the data, the
number of degrees of freedom associated with SSE thus reduced by 2.
There are various ways, SSE or S2 can be calculated. Majority of these
methods are tedious. We present below, a short way of using by this formula;
1
SSE or
S2 =
n−2
( ∑ y 2 −n y 2 ) −b ( ∑ xy−n x y )
The information of data needed to calculate S 2 is found in our last table, and as a
result, in our last example.
1
S2 = ( 22279−8 ( 52 .375 ) )2 −(−0 .899 )(−334 .5 )
8−2
1
( 22279−21945 .125−300 .7155 )
=6
33 .1595
=5 . 527
= 6
and S= √5 . 529=2 .327

That is, the standard error of estimate equals 2.352.
Variances of a and b
2 2
a b
let S and S be variances for a and b respectively. It can be shown that
( 1 x2
)
2
a 2
S =S +
n ∑ x 2−n x 2
33
2 S2
b
S =
and ∑ x 2
−n x 2
From our example, you will find that
( 1 81 . 50
)
2
S a =5. 527 + =5 .527 ( 17 . 98051 )
8 372 = 99.378.
and
Sa = √99 . 378=9. 969
2 5 . 527
S b= =0 . 0149
372
and b √
S = 0 . 0149=0 . 122
10.1.4Confidence Intervals for a and b
A ( 1−α ) % Confidence interval for the parameter a lies between a−t α /2 Sa
and
a+ t α /2 Sa .
and a(1-α )% confidence interval for the parameter b lies between
b−t α /2 S b and b+ t α /2 S b
For instance for 95% and 99% level, the confidence intervals are as follows:
Since 95%; 16,025 = 2.4469 and 99%: 16,005 = 3.7074 we have confidence
intervals for a in these two levels to bte 125.644 – 2.4469 (9.969)<a<125.644 +
2.4469 (9.969) or 101.251 < a < 150.037 and 88.684 < a < 162.604 respectively.
The confidence intervals for b in these two levels are respectively
-3.099 < b < 1.301
and -4.232 < b < 2.434.
10.1.5 Variance of Predicted y-value for a given value of x
34
You will recall that as soon as we computed the values of our a and b we
predicted the value of y for x = 90. This we said was y = 44.734. Since this value
was an estimate, one would like to compute the standard error associated with this
prediction. This is given by
2 2
S 2 ( x o −x ) S
Var ( y )= +
n ∑ x2 −nx 2
Where xo is the given value of x, Using our example, we have
5 .527 ( 90−81. 5 )2 ( 5 .527 )=1 .764

Var ( y )= +
8 372
and √ Var ( y )=1. 329=S y

Hence standard error associated with this prediction is 1.329. Hence the confidence
interval for the regression line is
y−t α /2 S y < y < y+t α /2 S y
10.1.6Decomposition of the Sample Variations of Y

The total sum of squares (SST) which is given by
SST = ∑ ( y− y ) 2
is broken into sum of squares for regression, SSR and sum of squares for
errors (SSE). That is,
SST = SSR + SSE
It can be shown that
SSR =
b 2
( ∑ x 2
−n x2
)
and from our example you have that
2
SSR = (−899 ) ( 372 )=300 .651
35
This partitioning of sum of squares may be summarized in a table called analysis
of variance, ANOVA, as shown below:
variation squares freedom
Due to regression
(SSR) ∑ (V o−v)2 1 SSR/1= MSR MSR
Deviation from SSE/N-2=MSE MSE
Regression (SSE)
∑ ( V −V o)2 n-2
Total (SST) ∑ (V o−v)2 n-1
Table 10: 3a: ANOVA Table

In our example the above table becomes
variation squares freedom
(SSR) 300.651 1 300.651 54.295
(SSE) 33.224 4 0.921
SST 333.875 7
Table 10.3b: ANOVA Table for the problem
10.1.7 Tests of Hypotheses
If there wee no relationship between independent variable x a and dependent
variable y, there the value of b is zero. This is done by the following hypothesis:
Ho: b = o
Ha: b  0
b
S
The test statistics for above all hypothesis is b
b −0 . 899
= =−7 . 360
S
In our example, b 0 . 122
Which is “Significant” at 90%, 95% and 99%. This means the parameter b does
exist.
36
Similarly, if you want to test whether the intercept a does exist; that is, it is
different from zero, you calculate the test statistic.
a 125 . 644
= =12 .603
S a 9 . 969
which is also significant at 90%, 95% and 99%. The parameter a actually exists.
Note also the F-test or F-value in ANOVA table is also significant 95 and 99%.
10.2 CORRELATION
Finally, one will like to test the degree of relationship between these two variables
x and y. This is done by coefficient of correlation, r, which is a measure of linear
correlation between the two variables r is given by the formula;
r=
∑ λy−n xy
√ ( ∑ x 2
−n x 2
)( ∑ y 2
−n y 2
)
The value r lies between -1 and 1 inclusive. If the value of r is equal to one
and or minus one, there is a perfect correlation in same or opposite direction
respectively. If value of r is zero, there is no linear relationship between these two
variables.
In our example, the value of r is
−334 . 5 −334 .5
r= = =−0 . 949
( 372 ( 333 . 875 ) ) 352 . 422
which shows a high degree of correlation
10.2.1: COEFFICIENT OF DETERMINATION

The square of the coefficient of correlation is called the coefficient of
determination. It measures the population of total variance in y which is due to
2
linear association between x and y. In our example, r =0 . 901. That is, 90.1% of
37
the variance in sales is determined by the linear association between volume of
2
sales and price of the commodity, for the data given. Notice that if r =0 then
sum of squares of errors is zero, and r2 is between zero and plus one inclusive.
10.2.2. TEST OF SIGNIFICANCE FOR CORRELATION COEFFICIENT

If you want to test whether the value of r is different from zero, that is, if you
H o :r=o
have
H a : r≠o
r 0 . 949
√ 1−r 2 √ 1−0 . 90 =7 .388
The test statistics is n−2 = 0
Which again is significant.
10.3 METHOD OF SEMI-AVERAGES

Another method of estimating the parameters a and b is one called the
method of semi-averages. This method which is as good as the method of least
squares usually divides the data into two equal parts and calculates the mean of
each part.
If the number of the observations is odd, the middle observation is ignored.
If you let ya and yb be the respective sums of the first half of y values and last half
of y values, and ya and xa are defined in a similar fashion then a and b can be
calculated from:
ya = a + bxa
yb = a + bxb
38
The above equations, solving for b and a yield the solution:
y a− y b
b=
x a −x b y −b x a
and a = a
y =57 . 25 and y b =47 .5
In our examples, a
x a =75 .25 and x b =87 . 25
229 190
−
4 4 229−190 39
= = =−0 .848
303 349 303−349 −46
−
Hence b = 4 4
and
a= y a −b x a =57 . 25−(−0. 848 )( 75 .75 )=121 . 488
10.4: MULTIPLE REGRESSION

We indicated at the that if we have more than one independent variables,
you have what is referred to as multiple regression. An example of a multiple
regression is y = a + b1x1 + b2x2. You have two independent or explanatory
variables. They are x1 and x2. The dependent variable is still y as usual.
These parameters also estimated by means of least squares method. Without
the detail, we present below some formulas and notations that will aid your
calculations in determining a lot requirements here.
10.4.1 DETERMINATION OF THE PARAMETERS a, b1 and b2.

2 2
M 11=n ∑ x 1−( ∑ x 1 )
First let,
M 12=n ∑ x 1 x 2−∑ x 1 ∑ x 2
2 2
M 22=n ∑ x 2 −( ∑ x2 )
39
My 1 =n ∑ x 1 y−∑ x 1 ∑ y
My 2 =n ∑ x 2 y−∑ x 2 ∑ y
My=n ∑ y 2 −( ∑ y )
2
Then it can be shown (from the normal equations not shown here) that:
a= y−b x 1 −b 2 x 2
my 1 m22−my 2 m12
b 1=
m11 m22−m
12 2
my 2 m11 −my 1 m12

b 2=
m11 m22−m 2
and 12
10.4.2 VARIANCES AND COVARIANCES OF THE LEAST SQUARE

ESTIMATES.
1
S2 = ( my−b 1 my1 −b 2 my 2 )
` n−3
x 2 m 22 + x 2 m 11 +2 x 1 x 2 m12
1 2 1 2
S 2 =S +
a n m11 m22 −m 2
12
2
S m 22
Sb 2 =
2 m11 m 22 −m 12 2
2 S 2 m 11
S b 2=
m 11 m 22−m 2
12
m 11 m 12−m
One thing a wise student should note is that the factor 122 runs through a
lot of the denominator of many of the above formulas. He should store this factor
somewhere, rather than calculating everything he runs into it.
10.4.3. CONFIDENCE INTERVALS AND HYPOTHESIS TESTING
40
With your parameters and their variances and hence their standard deviation
calculated, the confidence intervals are done as in simple regression case.
Existence of these parameters are done by hypothesis, with test statistics of these
parameters are done by hypothesis, with test statistics determined as in the case of
simple regression. For all you need in the hypothesis testing here, is the value of
the parameters and their corresponding standard deviations.
10.4.4. DECOMPOSITION OF THE SAMPLE VARIATION OF y.

Like in the previous case, SST = SSR + SSE. and
b 2 m11 +b 2 m 22+ 2b 1 b2 m12
SSR = 1 2
SST = my.
2 2
1 2
So that SSE = SST – SSR = m y −b m11−b −2 b1 b 2 m12
SSR / ( K −1 ) SSR /2
=
F-ratio = SSE / ( n−k ) SSE /n−3
Where K is number of parameters,
F – value significant, means birth b1 and b2 do exist
Coefficient of Determination.
The coefficient of determination, R2 is defined by
SST SSE
R2 = =1−
SSR SST
This is a measure used to describe how well the sample regression line fits the
observed data. R2 is between zero and plus one inclusive. However, some people
prefer to measure ‘the goodness of fit’ in the case of multiple regression formula
known as the ‘corrected coefficient of determination.
41
K−1 ( 1−R 2 )
2
R=
2
This is R and given by n−k
This measure takes into account the number of explanatory variation in relation to
the number of observations.
10.5 RANK CORRELATION

(Spearman’s Coefficient of Rank Correlation). The product moment correlation
coefficient as we have seen, is calculated from actual values of x and y in the
sample data. But there are occasions when the relative orders of magnitude of these
pairs of values are more instructive than the values themselves. For example, it
may be possible for two judges to rank by reference four different brands of orange
juice in terms of taste, whereas it may be difficult to give them numerical grade in
terms of task. Similarly, it may be possible even easy, for two bosses to rank their
subordinates for promotions, whereas it may be difficult to assign them a
numerical grade. In such cases, it is not possible to measure the relationship
between the two attributes by the product-moment coefficient. Hence we have to
device some means of assessing the relationship between the ranks of two
variables instead of between their actual values. The measure we used here is
called Spearman’s Coefficient of Rank Correlation. It is computed from the
formula:
6 ∑ d2
rs=1−
n ( n2 −1 )
Where d is the difference between x and y in each pair of (x,y) values, and n is the
number of pairs of values in the sample.
42
Like product-moment correlation coefficient, the Spearman, Coefficient of rank
Correlation takes value from -1 to +1 inclusive, and have same interpretation as the
former case.
Example: item: A B C D E F
x (rank): 4 5 6 3 1 2
y (rank) 2 4½ 4½ 6 1 3
d: 2 ½ 1½ -3 0 -1 (d = 0)
d2 40.25 2.25 9 0 1 ( ∑ d 2=16 . 50 )

6 x 16 . 50 1−16 . 50
1− = 2
rs = 6 ( 62 −1 ) 6 −1
16 . 50
1− =1. 0 . 4714=0 .5286
= 35
10.6 Some Statistical Assumptions on Least Squares Method

1. The residuals (or errors) should be independent. Dependence among residuals
can occur, for example, if management add employees to increase production but
is reluctant to cut its labour force proportionately when demand falls off and
production decreases. Economists have noted that this tendency exists for the
economy as a whole and refer to it as a ratchet effect.
A lack of independence among adjacent observations may be detected by plotting
the observations in sequence. The systematic correlation between consecutive
observations of some variables is called autocorrelation or serial correlation.
2. The residuals should have a constant variance at all levels of activity. It is
assumed that the variability of the unexplained residuals should be the same,
regardless of the site of the expected value of the dependent variable for the period.
43
Statisticians refer to this constant variance as homoskedasticity, and to the non-
constant variance of the residuals as heteroscedasticity.
3. The residuals should be normally distributed
4. The residuals should have an expected value of zero for any given observation.
10.6.1 Dummy (Shift) Variables

In addition to explanatory variables which reflect quantitative factors, it is also
possible to use qualitative or categorical variables. These qualitative variables are
often referred to as dummy variables. Dummy variables take of 1 if the
characteristic is present, or a value of 0 if it is not present.
One use of dummy variables, probably in cost analysis problems is to reflect
seasonal factors. For example, if light hills are expected to be higher whenever
NEPA struck (because the firm has to use its generator and same time pays NEPA
bills), then this non-NEPA period could be incorporated as a dummy variable in
the following cost model.
Y = a + b1x1 + b2x2.
When Y is total overhead cost, is the production level, and x 2 is the dummy
variable which would give the value of 1 when NEPA struck, and value of zero
when NEPA is on. The coefficient of the dummy variable b 2 would be the amount
by which the average costs in non-NEPA periods differ from when NEPA is on.
10.6.2 Choosing among Alternative Regression Models

To determine which regression model to use in a given application, the
following practical and statistical factors should be considered.
1. The model should make good economic sense
2. There should be a good overall fit of data to the regression model. Other factors
being equal, a model with a higher R 2 and a low standard error of estimate is
44
preferable to model with a poorer fit to the given data with a higher R 2 and a low
standard fit to the given data.
3. Higher significant values for the regression coefficients of the explanatory
variables are desirable.
As a rule of thumb the t-values should be at least 2. Values less than 2 for
large data sets mean that a 95% interval for the regression coefficient would
include o, which would imply that the variable in question has no identifiable
impact on the dependence variable being predicted.
Low t value can when regression model includes a variable which has no
real explanatory power. Low t can also occur when two variables which are closely
correlated are both included in a multiple regression model. Such variables are
referred to as collinear, and this problem is called multicullinearity. In such a
situation, it is difficult to attribute separately to each variable.
4. The model chosen should be easy to implement. If prediction for future require
new data, the ease of obtaining accurate value for the independent variables should
be considered. If one model data that will be costly or difficult to obtain, it may be
better use another model whose data requirements are less severe.
19,7 Questions
Suppose in the example given in the simple regression, it now observes that
volume of sales, not only depend on price x (now x 1) but also the level of
advertising expenditures (x2). Perform a multiple regression on the data shown
below. Compute R2, S2 and ANOVA.
Sales (in thousands) Price Advertising in Naira
50 79 4.5
60 74 5
65 70 6
54 80 4.8
45
50 83 4.2
48 86 4
47 88 4.1
45 92 7.5
2) compute the multiple regression analysis on the given data below. Test whether
the parameters are significant.
Customers (y) Student Enrollment x1 Student’s contributions
to fund (x2)
10 700 50
15 750 65
20 760 80
15 800 100
20 842 105
16 910 85
18 965 90
22 1010 100
24 1070 110
30 1100 100
3) Fit the linear relationship between the variables on the data given below
Size of part Entertainment cost
1 5.00
2 10.00
3 15.35
4 20.50
5 25.95
6 32.20
7 38.50
8 46.00
9 53.80
10 62.00
46
4) Fit the data in exercise (3) in the form V = abx
5) The following data for ten towns show (i) the proportion of house holds owning
a car and (ii) the index of such class.
Town % with cars Social class index
A 51 106
B 48 104
C 43 100
D 36 96
E 30 90
F 24 86
G 22 84
H 16 74
I 12 64
J 10 66
6) Compute the regression equation of y on x and correlation coefficient for the

following data:
x: 2 4 6 8 10
y: 3 5 7 11 13
7) An experiment was conducted on 8 children to determine how a child’s reading

ability varied with his ability to write. The points awarded were as follows:
Child A B C D E F G H
Writing 7 8 4 0 2 6 9 5
Reading 8 9 4 2 3 7 6 5
47
Calculate the coefficient of rank correlation
The following values for two variables x and y were obtained:
x: 0.125 0.250 0.500 1.000 2.000
y: 3.9 2.1 0.95 0.6 0.3
_____________________________
8). It is thought that x and y are connected by an equation on the form

a
y= + b
x Confirm that this is so, and find the values of a and b
9) The weights x and heights, y of twenty adults were recorded as follows:

x: 57 61 62 64 85 89 92 95 65 72
y: 155 158 163 158 183 185 198 193 165 165
x: 75 76 80 83 98 98 100 103 104 106
y: 170 178 178 175 205 198 198 213 208 225
___________________________________________
i) Calculate the produt-moment correlation coefficient for this data.
ii) Compute the Spearman Rank Correlation Coefficient.
48

Chi Squares and Analysis of Variance

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chi Squares and Analysis of Variance

Uploaded by

Copyright:

Available Formats

TESTS OF HYPOTHESES FOR MORE THAN TWO POPULATIONS

9.1 TESTS FOR SEVERAL PROPORTIONS – CHI-SQUARE TESTS.

Before we take an example to illustrate this concept further, please do note

c) No expected frequencies or observation should be small. It is recommended that

9.1.2 Two-Way Classification Cases: Test of Independence

Example 3: A random sample of 200 married men, all self-employed were

Observed Expected ( O−e )2

Example 4: A product is manufactured independently on three centres C 1, C2 and

9.1.3 Yates correction for continuity in chi-square

9.1.4 Goodness –of-fit Test

the data therefore fits in Poison distribution.

9.2 A TEST FOR SEVERAL MEANS; ANALYSIS OF VARIANCE

9.2.1. One-way (one-factor) Analysis of Variance

Where K = number of population

2. Between sum of squares (SST) This is the variability

3. Within sum of squares (SSW) This is the variability within samples

and SST = SSB + SSW.

Total SST n-1

Table 4 = ANOVA Table – one factor

Our result is summarized in ANOVA table 5b.

9.2.2 Two way (two-factor) Analysis of Variance

R Xr1 Xr2 Xij xic Tc Xr

1. Total sum of squares (SST). This is the variability of the observations. It

Source of Variation Sum of Degrees of Mean F-ratio

MSC 104 . 167

The result is now summarized in ANOVA table

9.3 TEST FOR THE EQUALITY OF SEVERAL VARIANCES

Ha: The variances are not equal

9.4 Multiple – Range Test.

Error SSE Rc(n-1) SSE

Table 9..1 ANOVA Table for two-way with interactions

i=1 j=1 i=1 i=1 T2

5. Several hundred women are employed in a certain assembly operations. The

the job is the same for all groups. Use α =0 . 05 .

8. Test for equality of variances in problem (7).

Figure 1: Scatter Diagram

10.1. Prediction Equation of Regression

and a= y−bx becomes 52.375-(.899)(81.5)

10.1.3 ESTIMATION OF THE VARIANCES OF THE REGRESSION

and S= √5 . 529=2 .327

From our example, you will find that

A ( 1−α ) % Confidence interval for the parameter a lies between a−t α /2 Sa

10.1.5 Variance of Predicted y-value for a given value of x

Where xo is the given value of x, Using our example, we have

5 .527 ( 90−81. 5 )2 ( 5 .527 )=1 .764

and √ Var ( y )=1. 329=S y

10.1.6Decomposition of the Sample Variations of Y

Total (SST) ∑ (V o−v)2 n-1

Table 10: 3a: ANOVA Table

10.2.1: COEFFICIENT OF DETERMINATION

10.2.2. TEST OF SIGNIFICANCE FOR CORRELATION COEFFICIENT

10.3 METHOD OF SEMI-AVERAGES

10.4: MULTIPLE REGRESSION

10.4.1 DETERMINATION OF THE PARAMETERS a, b1 and b2.

my 2 m11 −my 1 m12

10.4.2 VARIANCES AND COVARIANCES OF THE LEAST SQUARE

10.4.4. DECOMPOSITION OF THE SAMPLE VARIATION OF y.

10.5 RANK CORRELATION

d2 40.25 2.25 9 0 1 ( ∑ d 2=16 . 50 )

10.6 Some Statistical Assumptions on Least Squares Method

10.6.1 Dummy (Shift) Variables

10.6.2 Choosing among Alternative Regression Models

6) Compute the regression equation of y on x and correlation coefficient for the