Professional Documents
Culture Documents
9.0 INTRODOUCTION
In chapter 8, we dealt with tests of hypotheses for parameters of one and two
populations. In this chapter we extend tests of hypotheses for more than two
populations. Specifically, we considered what is referred to as chi-square tests.
This is a situation where we test equality of several proportions. We also look into
cases where equality of several population means are being tested. In this case we
used what is called Analysis of Variance (ANOVA for short). Let us start with
tests for several proportions, and then we round it up with test for several means. A
test for several variances is also given.
Let
θ1 ;θ 2 ,...θ k be the actual or observed frequencies and e 1 ,e 2 ,...e k be the
expected frequencies. The test statistic, chi-square relevant to above hypothesis is
given by;
k
( Oi −e i )2
X =∑
2
i =1 ei
That is, we square the difference between the observed and the expected
frequencies, and then divide the result by the expected frequencies. This test
statistic is closely approximated by chi-square distribution, and it has a degree of
freedom k-1.
2
If X =0 , then there is a perfect agreement between observed and expected
frequencies, the larger the value of X 2. Hence we reject Ho for Ha if the calculated
2
1−α / 2
X is greater than X
2 K (using our table).
2
b) The sample size must be reasonably large in order that the difference between
the actual and expected observations be normally distributed. A sample size of at
least 50 is recommended.
3
B 180 200 -20 2.00
C 250 200 50 12.50
D 230 200 30 4.50
E 190 200 -10 0.50
2
Total 1000 1000 O X =32.00
Table 9.1b. One way chi-square test.
2
0. 05
The degree of freedom is k-1 = 5.1 = 4. Now, X = 9 . 49 . Our hypothesis is
formulated as follows:
Ho: P 1=P2 =P3 =P4 =P 5
Ha: At least one of the equalities does not hold. That is, at least one is
different.
2 2
Since X > X 1 −α , k−1 , We reject Ho. That is, the sales potential in the five
towns are not the same.
Example 2: Daily demand for loaves of bread at a bakery is given as the following
table. Test the hypothesis that the number of loaves of bread sold does not depend
on the day of the week. use α =0.01.
Day of the week Number of loaves sold
Monday 3100
Tuesday 3500
Wednesday 3300
Thursday 4800
Friday 4300
Saturday 5000
Total 24000
Table 9.2a. Table for example 2.
Solution: There are six observations, and hence our hypothesis is.
Ho: P 1=P2 =P3 =P4 =P 5=P 6
Ha: At least one of the equalities does not hold.
The expected demand, e, is 24000 6 = 4000. Hence the table below
4
Observed Expected
Day frequency frequency O-e (O-e)2
O e
Mon 3100 4000 -900 810000
Tue 3500 4000 -500 250000
Wed 3300 4000 -700 490000
Thur 4800 4000 800 640000
Fri. 4300 4000 300 90000
Sat. 5000 4000 1000 1000000
Total 24000 24000 0 3280000
Table 9.2b: Table for solution of example 2.
3280000
X 2= =820
The calculated X2 is given by 4000
2 2
1−α / 2 0. 005
X k−1 is X =15 .08 Since the calculated value X2 is greater than
the table value, Ho is rejected.
Note that since the expected frequencies in all cells expect in the last row and the
last column can be determined given row total, column total, and the other cell
frequencies, the number of degrees of freedom for a contingency table having r
rows and c columns is rc – r – c + 1 = (r-1)(c-1)
Here Chi-square test statistic is calculated as
r c
( Oij−e ij ) 2
X =∑
2
∑
i =1 i =1 e ij
Where Oij = observed frequency in the ith row and jth column.,
eij = expected frequency in the ijth cell
eij is calculated by the formula
ni ( n j )
e ij=
n
Where ni = total observed frequency in the ith row
n.j = total observed frequency in the jth column
6
n = sample size
Note that number of rows and columns is not necessarily the same.
Solution: On the hypothesis that father’s education and size of family are
statistically independent, the expected frequencies may be calculated as;
83 x 45 83 x 96 83 x 59
θ11= =18. 675 ;θ12= =39 .84 ;θ 13= =24 . 485
200 200 200
78 x 45 78 x 96 78 x 59
θ21= =17 .55 ;θ 22= =37 . 44 ;θ 23= =23. 01
200 200 200
39 x 45 39 x 96 39 x 59
θ31= =8 .775 ;θ 32= =18 .72 ;θ33= =11. 505
200 200 200
The X2 statistics is calculated in table below:
7
-2.840
Elementary and 4-7 37 39.840 7.513 0.202
Elementary and over 7 32 24.485 1.450 2.307
Secondary and 0-3 19 17.550 4.560 0.120
Secondary and 0-7 42 37.440 3.225 0.555
Secondary and over 7 12 8.775 3.222 1.183
University and 0-3 17 8.775 -1.72 1.570
University and 4-7 17 18.720 -1.505 0.158
University and over 7 10 11.505 0.197
Total 200 200 O - x2 7.462
Table 9.4b; Two-way chi-square test
2
0. 05
Df is (3-1)(3-1) = 4; X =18 . 467 . Do not reject Ho; Education level and number
of children are related.
Test the hypothesis that the number of defective and non-defective items
produced in the three centres are the same. Use α = 0.01
8
Solution: On the hypothesis that the three centres produce the same number of
defective and non-defective items, the expected frequencies may be calculated as:
34 x 200 34 x 150 34 x 250
e 11= =11. 33; e12= =8 .50 ;e 13= =14 .17
600 600 600
566 x 150 566 x 250
e 21=188 .67 ; e22= =141 . 50; e23= =235. 83
600 600
The X2 statistics is calculated in table 9.5
Classification Observed Expected O-e ( O−e )2
Frequency (O) Frequency (e) e
D and C1 6 11.33 -5.33 2.4793
D and C2 8 8.50 -0.50 0.0294
D and C3 20 14.17 5.83 2.3987
N and C1 194 188.67 5.33 0.1506
N and C2 142 141.50 0.50 0.0077
N and C3 230 235.83 -5.83 0.1441
Table 9.5: Table for solution of example 4.
The number of degrees of freedom is (2-1)(3-1)=2
2
Now , X 2 ; 99=5 . 99 Since the calculated X2 is less than the one read from table.
The hypothesis that three centres produce the same number of defective and non-
defective items cannot be rejected.
9
When we talk of goodness of fit, we are referring to a situation where we
compare observed sample distribution with expected probability distributions such
as binomial, Poisson, normal, and so on. The chi-square statistic is used then to
judge how the sample observed fit the actual distribution.
Example 5: The following is a distribution of daily demand for a product in a store.
X 0 1 2 3 4 5 6 or more
F 35 65 62 43 21 10 8
With α = 0.05, test the hypothesis that the distribution of daily demand is Poisson.
Solution: The estimate of the mean () of the number of customers arriving at the
∑ xf =500 =2 .05
shop is: ∑ f 244
Hence the Poisson distribution fitted to the data is
( 2 .05 )x e−2 .05
F ( x )=
x!
Note that e is given as e = nf(x); this e is the expected frequency.
The table of probabilities and expected and observed frequencies are given below:
Daily demand Probability Expected Observed
x Pi frequencies frequency
e O
0 0.1287 31.4028 35
1 0.2638 64.3757 65
2 0.2704 -65.9851 62
3 0.1848 45.0898 42
4 0.0947 23.1085 21
5 0.0388 9.4745 10
6 0.0132 3.2371 8
We can now compute the chi-square statistic from the table below
10
x O e O-e (O-e)2/e
O 35 31.4028 3.5972 0.4121
1 65 64.3757 0.6243 0.0061
2 62 65.9851 -3.9851 0.2408
3 43 45,0898 -2.0898 0.0969
4 21 23.1085 -2.1085 0.1924
5 or more 18 13.7116 4.2884 2.0397
Total 244 2.7956 = X2
Since we estimate the mean of Poisson distribution, the degrees of freedom
will be 6-1-1=4 If we use α = 0.05.
2
o . o5
X =9 . 48773 , since calculated X2 is less than the one read in the table
11
population which has one mean, as the inherent variation in the experiment and try
to estimate its variance. A similar estimate is made for the variable, between the
sample means. If the variability within population means and between means are
same order or magnitude, then there is no difference between population means. If
variation between means is significantly larger than within population variation,
then we conclude that there is a difference between population means.
In terms of test of hypothesis, we have the following:
Ho: μ 1=μ 2=. .. .=μ k
Ha: At least one of the equality is violated.
Where K is the number of populations considered
μi (i = 1, 2,….k) is the mean of population i.
∑ ∑ ∑ ( X ij −X )2
SST = i=1 i =1
( )
T 21 T 2
k ni k
∑ ∑ ( X 1− X ) =∑ 2
ni
−
n
SSB = i=1 j=1 i =1
( )
T 21
k ni k ni k
∑ ∑ ( X iji−X i ) =∑ ∑
2 2
X ij− ∑ ni
SSW = i=1 i =1 i=1 j =1 i=1
13
Now, if our null hypothesis is true, then all the K’s Sample mean would be close to
each other and very close to the total mean X . This would mean that MSB would
be small compared to MSW. But if Ho is not true, MSB would be large compared
to MSW. The ratio MSB/MSW which is a ratio of variances gives the test statistic
(which is F statistic) needed to carry out our test.
The results of the test is summarized in ANOVA (Analysis of variance) table
below.
ANOVAL TABLE
Source of Sum of Degrees of Mean squares F-ratio
Variance squares freedom
Between means SSB K-1 MSB MSB
=F
MSW
Within means SSW n-k MSW
Table 5 a
Type of Machine
Employee M1 M2 M3 M4 Total Mean
14
E1 40 36 45 30 151 37.75
E2 38 42 50 41 171 42.75
E3 36 30 48 35 149 37.25
E4 46 47 52 44 189 47.25
Total 160 155 195 150 160 41.25
Mean 40 38,75 48.75 37.5
Perform one-way ANOVA on this problem, that is, test the hypothesis that the
mean population is the same for the four machines.
Solution: Ho:
μ1 =μ2 =μ3 =μ4
ha: At least one equality is not true.
Here n1 = n2 = n3 = n4 = 4. Also k = 4 and n = 16
T 21 2
=27537 .50 . x =27900
Hence ni Also, ij
SSW = 27537.50 – 27225 = 312.50
SSB = 27900 – 27537.50 = 362.50
SST = SSB + SSW = 675.
MSB = 312.5/3 = 104.167
MSW = 362.5/12 = 30208
Hence F = MSB/MSW = 3.448.
15
Between Machines 312.5 3 104.167
3.448
Within machines 362.5 12 30.208
Total 675 15
Table 5b: ANOVA Table for the problem
Now F.95.3.12 = 3.49 and F.99, 3 12 = 5.95
Since F<Fα ., df we cannot reject the null hypothesis
MRS = SSR/(r-1).
17
3. Sum of squares for Columns (SSC) and Column Mean Squares (MSC)
This is variability between column means
c c
T 2i T 2
SSC=r ∑ ( Xij− X ) =∑ −
2
j=1 j=1 r rc
MSC = SSC/(c-1)
4. Sum of squares for Error (SSE) and Error Mean Square
This is variability due to chance
SSE = SST-SSR-SSC
MSE = SSE/(r-1)(c-1)
If we wish to test the null hypothesis that row effect are all equal, we
compute the ratio MSR/MSE and compare it with value of F read from the table,
given level of significance and degrees of freedom. Similarly, the F ratio
MSC/MSE give us the statistic test for comparing the column effects. The result
may summarized in the ANOVA table as shown here.
18
Example 7: Using the data in previous example (i.e. in the one-way analysis) test
a) The hypothesis that the mean production is the same for the four machines
b) The hypothesis that four employees do not differ with respect to mean
productivity.
Solution: Here r = c4 so that rc = 4 x 4 = 16
T2
=27225
So that rc
Hence SST = 27900 – 27225 = 675
T 2 110150 109514
∑ R = 4 =27537 . 5 T 2c=
4
=27375. 5
Now and
SSC = 27537.5 – 27225 = 312.3
SSR = 27375.5 – 27225 = 153.5
SSE = SST – SSC – SSR = 209
Since c = r = 4, (r-1)(c-1) = 3 x 3 = 9
Hence,
SSC 312 .5
= =104 . 167
MSC = c−1 3
SSR 153 . 5
= =51 .167
MSR = r−1 3
SSE 209
= =23 .222
MSE = ( c−1 ) ( r−1 ) 9
q
Then compute b = 2.3026 n
20
k
( n−k ) log S 2p −∑ ( ni−1 ) log S 2i
where q = i =1
and b =
1+
1
⌊ k 1
−
1
3 ( k−1 ) i=1 n i−1 n−k ⌋
Here b is a value of random variable having approximately the chi-square
distribution with k-1 degrees of freedom. The quantity q is large when the samples
variances differ greatly and is few when all the sample variances are equal.
2
Therefore reject Ho at α level of significance when b > Xα / 2
Sample
A B C
4 5 8
7 1 6
6 3 8
6 5 9
3 5
4
2 2 2
Solution: Ho : δ 1 =δ 2 =δ 3
21
Several tests are available to determine which of the pairs of population means that
are not equal. What is being considered in this section is Ducan’s multiple – range
test.
The procedure for this test is as follows:
First, let us assume that the ANOVA procedure has led to rejection of Ho, which
means not all population means are equal. Then we also assume that the k random
samples are all equal size n. The range of any subset of p sample means must
exceed a certain value before we consider any of the population means to be
different. The value is called the least significant range for the p means and is
denoted by Rp where Rp =
rp
√ S2
n
Here rp is the least significant standard range obtained from table and it depends on
the desired level of significance, S2 is the sample variance which is the estimate of
2
the common variance δ and is obtained from the error mean square in the analysis
of variance.
Example 9: Let us illustrate the above test with the following data
Sample
A B C D E
5 9 3 2 7
4 7 5 3 6
8 8 2 4 9
6 6 3 1 4
3 9 7 4 7
5.2 7.8 4.0 2.8 6.6
22
If we arrange the sample means in increasing order of magnitude we have the
following:
XD XC XA XE XB
2.8 4.0 5.2 6.6 7.8
Also, if you have performed one-way ANOVA on the data you will find that error
of mean square is 2.880. Let us put α = 0.05. According to the values of r p from
the table with degree of freedom = 20 for p = 2, 3, 4 and 5 are given below
Having obtained rp and with knowledge of S2, we can compute Rp, hence the
following result as is shown in the table below:
P 2 3 4 5
rp 2.950 3.097 3.190 3.255
Rp 2.24 2.35 2.42 2.47
Comparing these least significant ranges with differences in order means, we arrive
at the following conclusions:
a)
X B−X E=1 . 2<R2 =2. 24 ; X B and X E are not significantly different.
b)
X B−X A=2 .67 ; R 3=2. 35 X B is significantly larger than X A .
Hence
μ B > μ C and μ B > μ D
c)
X E− X A=1 . 4<; R2 =2. 24 X E and X A are not different.
d)
X E− X C =2 .6 >R3 =2. 35 ; X E is larger than X C and hence μ E
e)
X A−X C =1. 2<; R2 ; X A and X C are not significantly different.
f)
X A−X D =2. 4 >; R3 =2 .35 ; X A is significantly larger than X D
and hence
μA > μD .
23
g)
X C −X D=1 .2<;R 2;=2 .24 We conclude that
XC and
X D are not
significantly different.
9.5 TWO-WAY ANOVA WITH INTERACTIONS
In the way ANOVA treated earlier, we assume that the row and the column effects
were additive. In many cases, this assumption does not hold, and we have several
observations per cell, therefore there is a presence of interaction. What is presented
below is the general table of the ANOVA that will take care of such a situation.
Source of Sum of Degrees of Mean F. Ratio
Variation Squares Freedom Square
Row means SSR r-1 2
S1 =SSR/( r −1) S
1
2 √ S 24
Column means SSC c-1 S22 =SSC( c−1) 2
2
S / S2
4
SS ( Rc ) 2
S23 = S 3
/ S24
Interaction SS(Rc) (r-1)(c-1) ( r−1 ) ( c−1 )
The sums of squares are usually obtained by means of the following computational
formulae.
r c n 2 T2
SST =∑ ∑∑X ijk
−
i=1 j =1 k=1 rcn
r
∑
i −1 T2 2
i
SSR= T −
cn rcn
c
∑
T2
i −1 2
i
SSC = T −
cn rcn
24
r c 2 r 2 c 2
∑∑T ij
∑T i. . .
∑T j
QUESTIONS
1. Use the following data to test whether the number of defective items
produced by two machines is independent of the machine on which they were
made.
________________________________________________________
Machine output
Defective Non Defective Total
________________________________________________________
Machine A 25 375 400
Machine B 42 558 600
________________________________________________________
Total 67 933 1000
________________________________________________________
2. Given the following data, use the X 2 test to determine whether the number of
accidents in a group of factories is independent of the age of the worker
_______________________________________________________
Age
Number of under
accidents 21 21 – 44 45 - 65
________________________________________________________
0 120 360 220
1 40 28 2
2 13 5 2
25
3 or more 7 2 1
________________________________________________________
3. A box contains red, green and white halls, which are identical in every respect
but colour. 60 balls are drawn from the box, each ball being replaced immediately
it is drawn and the colour noted. The numbers of different colours obtained are
Red, 15, Green 26, white 19. Test the hypothesis that the number of balls of each
colour in the box is the same.
4. Test the hypothesis that the observed distribution is drawn from a population
with a Poisson distribution
_________________________________________________________
Number of defects per metre of cable 0 1 2 3 4 5 6
_________________________________________________________
Number of metre lengths 14 25 30 17 5 5 3
_________________________________________________________
26
produced 98, 104, 113, 97 and 103 parts in five days. Assuming normal
populations with equal variances, determine at the 0.05 level of significance, if the
three operators are not producing at the same average daily rate.
7. A company wishes to study the differences in the selling abilities of its four
salesmen. A, B, C and D as well as the difference in its three sales districts, 51, 52
and 53, all of the same size. The weekly sales in Naira for the four salesmen are
given below:
______________________________________________
Salesmen
______________________________________________
District A B C D
______________________________________________
S1 550 450 700 500
S2 300 350 550 400
S3 350 550 400 600
______________________________________________
Test the null hypothesis that there are no differences between sales districts and
that there are no differences between the selling abilities of the salesmen.
CHAPTER TEN
REGRESSION AND CORRELATION
10.1 INTRODUCTION
27
Businessmen, Economists, Scientists and Sociologists have always been concerned
with problems of prediction. Marketing executives, for example are constantly
analyzing sales data in hoes of predicting or forecasting future sales with a high
degree of accuracy. Measurements from sales data are used in turn in production
division of the company which enable the firm concerned to plan its output. One
may use JAMB scores of students entering University to predict their successes
later in the University. These types of prediction problems are referred to by
statisticians as Regression problems.
Regression problems are of many dimensions. The one we are now touching deals
with the prediction of the dependent variable (say, Y) on the basis of the related
independent variable (say, X). This is a case of Simple Regression Analysis. In the
cases where more than two independent variables are involved you speak of
Multiple Regression.
If two varieties X and Y are supposed to be related (linearly), you may test the
relationship as follows: Let X1, X2,….., Xn be the observed values of X and Y1, Y2,
….,Yn be corresponding values of Y. For instance, let X be price of a products,
and Y its corresponding sales for the past eight years as shown in the table below:
Year Price (N): X Sales (in thousands): Y
1970 79 50
1971 74 60
1972 70 65
1973 80 54
1974 83 50
1975 86 48
1976 88 47
1977 92 45
Table 10:1 Regression Problem
28
The plot of the set of these pairs (X1, Y1), (X2, Y2),…..(Xn, Yn) of values of
X and Y is called a scatter diagram. In our example, n is 8. The scatter diagram is
shown below:
29
The difference between the actual Y – values and the corresponding computed
values of Y predicted from the line is called an Error or a Residual or a Deviation.
10.2 Method of Least Squares
Now that we have decided to use a linear prediction equation, we must consider the
problem of deriving computational formulas for determining the point estimates a
and b from the available sample points. The procedure that is used here is called
Method of Least Squares. This is the method of estimating the ‘best line’ to a given
set n pairs of points, in such a way that the sum of squares of the errors between
the actual y value and the predicted y values is minimized. That is, if we let e 1
represent the distance obtained by subtracting the predicted y value from the
observed y value for the first point, e 2, a similar distance corresponding to the
second point, and so forth, the method of least squares yield formulas for
calculating a and b so that:
2 2 2
∑ e2 =e 1 + e 2 +. ..+ e n is minimum.
By means of differential calculus, the values of a and b are obtained from these
two lines simultaneous equations:
an+b ∑ x=∑ y
a ∑ x +b ∑ x 2 =∑ xy
Called the Normal Equations.
Estimation of Parameters
The Least-squares estimates of a and b for the prediction equation
y = a + bx are obtained from the formulas.
a= y−bx
30
n ∑ xy−( ∑ x )( ∑ y )
b=
∑ xy−n x y
b=
and n ∑ x2 −( ∑ x )
2
or ∑ x 2−n x 2
Note that you have to compute b before a. To obtain the estimates or the
parameters a and b for your example you build the following table:
x y xy x2 y2
79 50 3950 6241 2500
74 60 4440 5476 3600
70 65 4550 4900 4225
80 54 4320 6400 2916
83 50 4150 6889 2500
86 48 4128 7396 2304
88 47 4136 7744 2209
92 45 4140 8464 2025
Total = 652 419 33814 53510 22279
Table 10.2. Estimation of parameters
652 419
x= =81 .50 y= =52. 375
Hence 8 and 8
b=
∑ xy−n x y
Therefore ∑ x 2−n x 2 becomes
33814−8 ( 81 .50 )( 52 .375 ) 33814−34148 .5
b= =
53510−8 ( 81. 50 )2 52 .375−53138
−334 .5
b= =−0 .899 ;
or 372
31
ie.a=52. 375+73 . 2685=125 . 644
The linear prediction equation or the regression equation of y on x is y = 125.644 –
0.899x
By substituting any value of x into this equation, you obtain predicted value
y. For instance, when x = 90, y = 125.644 – 0.899(90) = 125.644 – 80.91 = 44.734
– 80.91 = 44.734.
32
This variance is known as Sum of Squares for Errors, SSE, that is, the sum of
squares of the deviations of the actual y-values from their corresponding predicted
values on the regression line.
Note that we have to divide the expression by n-2 because since the
constants a and b of the regression equation have been estimated from the data, the
number of degrees of freedom associated with SSE thus reduced by 2.
There are various ways, SSE or S2 can be calculated. Majority of these
methods are tedious. We present below, a short way of using by this formula;
1
SSE or
S2 =
n−2
( ∑ y 2 −n y 2 ) −b ( ∑ xy−n x y )
The information of data needed to calculate S 2 is found in our last table, and as a
result, in our last example.
1
S2 = ( 22279−8 ( 52 .375 ) )2 −(−0 .899 )(−334 .5 )
8−2
1
( 22279−21945 .125−300 .7155 )
=6
33 .1595
=5 . 527
= 6
Variances of a and b
2 2
a b
let S and S be variances for a and b respectively. It can be shown that
( 1 x2
)
2
a 2
S =S +
n ∑ x 2−n x 2
33
2 S2
b
S =
and ∑ x 2
−n x 2
( 1 81 . 50
)
2
S a =5. 527 + =5 .527 ( 17 . 98051 )
8 372 = 99.378.
and
Sa = √99 . 378=9. 969
2 5 . 527
S b= =0 . 0149
372
and b √
S = 0 . 0149=0 . 122
10.1.4Confidence Intervals for a and b
and
a+ t α /2 Sa .
and a(1-α )% confidence interval for the parameter b lies between
b−t α /2 S b and b+ t α /2 S b
For instance for 95% and 99% level, the confidence intervals are as follows:
Since 95%; 16,025 = 2.4469 and 99%: 16,005 = 3.7074 we have confidence
intervals for a in these two levels to bte 125.644 – 2.4469 (9.969)<a<125.644 +
2.4469 (9.969) or 101.251 < a < 150.037 and 88.684 < a < 162.604 respectively.
The confidence intervals for b in these two levels are respectively
-3.099 < b < 1.301
and -4.232 < b < 2.434.
34
You will recall that as soon as we computed the values of our a and b we
predicted the value of y for x = 90. This we said was y = 44.734. Since this value
was an estimate, one would like to compute the standard error associated with this
prediction. This is given by
2 2
S 2 ( x o −x ) S
Var ( y )= +
n ∑ x2 −nx 2
SST = ∑ ( y− y ) 2
is broken into sum of squares for regression, SSR and sum of squares for
errors (SSE). That is,
SST = SSR + SSE
It can be shown that
SSR =
b 2
( ∑ x 2
−n x2
)
and from our example you have that
2
SSR = (−899 ) ( 372 )=300 .651
35
This partitioning of sum of squares may be summarized in a table called analysis
of variance, ANOVA, as shown below:
Source of Sum of Degrees of Mean squares F-ratio
variation squares freedom
Due to regression
(SSR) ∑ (V o−v)2 1 SSR/1= MSR MSR
Deviation from SSE/N-2=MSE MSE
Regression (SSE)
∑ ( V −V o)2 n-2
Which is “Significant” at 90%, 95% and 99%. This means the parameter b does
exist.
36
Similarly, if you want to test whether the intercept a does exist; that is, it is
different from zero, you calculate the test statistic.
a 125 . 644
= =12 .603
S a 9 . 969
which is also significant at 90%, 95% and 99%. The parameter a actually exists.
Note also the F-test or F-value in ANOVA table is also significant 95 and 99%.
10.2 CORRELATION
Finally, one will like to test the degree of relationship between these two variables
x and y. This is done by coefficient of correlation, r, which is a measure of linear
correlation between the two variables r is given by the formula;
r=
∑ λy−n xy
√ ( ∑ x 2
−n x 2
)( ∑ y 2
−n y 2
)
The value r lies between -1 and 1 inclusive. If the value of r is equal to one
and or minus one, there is a perfect correlation in same or opposite direction
respectively. If value of r is zero, there is no linear relationship between these two
variables.
In our example, the value of r is
−334 . 5 −334 .5
r= = =−0 . 949
( 372 ( 333 . 875 ) ) 352 . 422
which shows a high degree of correlation
37
the variance in sales is determined by the linear association between volume of
2
sales and price of the commodity, for the data given. Notice that if r =0 then
sum of squares of errors is zero, and r2 is between zero and plus one inclusive.
38
The above equations, solving for b and a yield the solution:
y a− y b
b=
x a −x b y −b x a
and a = a
y =57 . 25 and y b =47 .5
In our examples, a
x a =75 .25 and x b =87 . 25
229 190
−
4 4 229−190 39
= = =−0 .848
303 349 303−349 −46
−
Hence b = 4 4
and
a= y a −b x a =57 . 25−(−0. 848 )( 75 .75 )=121 . 488
M 12=n ∑ x 1 x 2−∑ x 1 ∑ x 2
2 2
M 22=n ∑ x 2 −( ∑ x2 )
39
My 1 =n ∑ x 1 y−∑ x 1 ∑ y
My 2 =n ∑ x 2 y−∑ x 2 ∑ y
My=n ∑ y 2 −( ∑ y )
2
Then it can be shown (from the normal equations not shown here) that:
a= y−b x 1 −b 2 x 2
my 1 m22−my 2 m12
b 1=
m11 m22−m
12 2
2
S m 22
Sb 2 =
2 m11 m 22 −m 12 2
2 S 2 m 11
S b 2=
m 11 m 22−m 2
12
m 11 m 12−m
One thing a wise student should note is that the factor 122 runs through a
lot of the denominator of many of the above formulas. He should store this factor
somewhere, rather than calculating everything he runs into it.
10.4.3. CONFIDENCE INTERVALS AND HYPOTHESIS TESTING
40
With your parameters and their variances and hence their standard deviation
calculated, the confidence intervals are done as in simple regression case.
Existence of these parameters are done by hypothesis, with test statistics of these
parameters are done by hypothesis, with test statistics determined as in the case of
simple regression. For all you need in the hypothesis testing here, is the value of
the parameters and their corresponding standard deviations.
SST = my.
2 2
1 2
So that SSE = SST – SSR = m y −b m11−b −2 b1 b 2 m12
SSR / ( K −1 ) SSR /2
=
F-ratio = SSE / ( n−k ) SSE /n−3
Where K is number of parameters,
F – value significant, means birth b1 and b2 do exist
Coefficient of Determination.
The coefficient of determination, R2 is defined by
SST SSE
R2 = =1−
SSR SST
This is a measure used to describe how well the sample regression line fits the
observed data. R2 is between zero and plus one inclusive. However, some people
prefer to measure ‘the goodness of fit’ in the case of multiple regression formula
known as the ‘corrected coefficient of determination.
41
K−1 ( 1−R 2 )
2
R=
2
This is R and given by n−k
This measure takes into account the number of explanatory variation in relation to
the number of observations.
42
Like product-moment correlation coefficient, the Spearman, Coefficient of rank
Correlation takes value from -1 to +1 inclusive, and have same interpretation as the
former case.
Example: item: A B C D E F
x (rank): 4 5 6 3 1 2
y (rank) 2 4½ 4½ 6 1 3
d: 2 ½ 1½ -3 0 -1 (d = 0)
43
Statisticians refer to this constant variance as homoskedasticity, and to the non-
constant variance of the residuals as heteroscedasticity.
3. The residuals should be normally distributed
4. The residuals should have an expected value of zero for any given observation.
19,7 Questions
Suppose in the example given in the simple regression, it now observes that
volume of sales, not only depend on price x (now x 1) but also the level of
advertising expenditures (x2). Perform a multiple regression on the data shown
below. Compute R2, S2 and ANOVA.
Sales (in thousands) Price Advertising in Naira
50 79 4.5
60 74 5
65 70 6
54 80 4.8
45
50 83 4.2
48 86 4
47 88 4.1
45 92 7.5
2) compute the multiple regression analysis on the given data below. Test whether
the parameters are significant.
Customers (y) Student Enrollment x1 Student’s contributions
to fund (x2)
10 700 50
15 750 65
20 760 80
15 800 100
20 842 105
16 910 85
18 965 90
22 1010 100
24 1070 110
30 1100 100
3) Fit the linear relationship between the variables on the data given below
Size of part Entertainment cost
1 5.00
2 10.00
3 15.35
4 20.50
5 25.95
6 32.20
7 38.50
8 46.00
9 53.80
10 62.00
46
4) Fit the data in exercise (3) in the form V = abx
5) The following data for ten towns show (i) the proportion of house holds owning
a car and (ii) the index of such class.
Town % with cars Social class index
A 51 106
B 48 104
C 43 100
D 36 96
E 30 90
F 24 86
G 22 84
H 16 74
I 12 64
J 10 66
47
Calculate the coefficient of rank correlation
The following values for two variables x and y were obtained:
x: 0.125 0.250 0.500 1.000 2.000
y: 3.9 2.1 0.95 0.6 0.3
_____________________________
48