You are on page 1of 22

Page 1 of 22

Course: Statistics



Unit 6

Chi-Square Distribution










Page 2 of 22


Table of Contents
6.1. Learning Objectives ................................................................................................................. 3
6.2. Introduction .............................................................................................................................. 4
6.3. Chi Square Distribution ....................................................................................................... 4
6.3.1. Properties of
2
_ Distribution .......................................................................................................... 5
6.3.2. Characteristics of
2
_ Test ............................................................................................................... 5
6.3.3. Degrees of Freedom ......................................................................................................................... 6
6.3.4. Restrictions and Conditions in Applying
2
_ Test .......................................................................... 6
6.3.5. Levels of Significance ...................................................................................................................... 7
6.3.6. Steps in Solving
2
_ Problems......................................................................................................... 8
6.3.7. Interpretation .................................................................................................................................... 8
6.4. Uses of
2
_
Test ........................................................................................................................ 9
6.5. Application of
2
_ Test .......................................................................................................... 9
6.5.1. Tests for Independence of Attributes ............................................................................................... 9
6.5.2. Test of Goodness of Fit .................................................................................................................. 14
6.5.3. Test for Specified Variance............................................................................................................ 20
6.6. Summary ................................................................................................................................. 21
6.7. Reference ................................................................................................................................ 21
6.7.1. Recommended Textbooks .............................................................................................................. 21
6.7.2. Web References ............................................................................................................................. 21















Page 3 of 22



























6.1. Learning Objectives

By the end of this unit, you should be able to:

- Recognise the importance of Chi-Square test
- Recall Chi-Square distribution and its properties
- List the conditions under which the test can be applied
- Apply Chi-square as a test of Independence
- Apply Chi-square as a test of goodness of fit
- Apply Chi-square as a test of specified variance
Case-1:

The ABC soap manufacturing company produces four varieties of soaps with
different ingredients and flavours. The Companys Marketing General Manager
wants to know age-wise the preference of the consumers with respect to the
varieties. He consults a Statistician and collects the following data as per his
instruction:

Table 6.1
Age (yrs)

Product
20-30 30-40 40 & above
S
1
70 120 70
S
2
130 200 130
S
3
120 190 20

G.M is also interested in knowing the distribution of complaints received in a week
by the firm. The Statistician collects the following information:

Table 6.2
Complaints 0 1 2 3 4
Numbers
Received
250 90 40 15 10

(Cont. in topic Degrees of Freedom)

Page 4 of 22


6.2. Introduction

In the previous units, we learned how to test hypotheses using data from either one or two
samples. We used one-sample tests to determine whether a mean or a proportion was significantly
different from a hypothesized value. In the two-sample tests, we examined the difference between
either two means or two proportions, and we tried to learn whether this difference was significant.

Suppose we have proportions from five populations instead of only two. In this case, the methods
for comparing proportions described in for testing hypothesis for two-samples do not apply; we
must use the chi-square
2
_ test.
2
_ tests enable us to test whether more than two population
proportions can be considered equal.

Actually, chi-square
2
_

tests allow us to do a lot more than just test for the equality of several
proportions. If we classify a population into several categories with respect to two attributes (such
as age and job performance), we can then use a chi-square
2
_ test to determine whether the two
attributes are independent of each other.

6.3. Chi Square Distribution




























The square of a standard normal variate is called a chi-square variate with 1 degree
of freedom. That is, if X variable is normally distributed with a mean and
standard deviation o then (X -) / o is a
2
variate with df = 1.

If X
1
, X
2
.X
n
are n independent random variables following the normal
distribution with mean and SD o respectively then the
2
_ variate is given by:

( ) ( ) ( )
o

o

o

_
2 2
2
2
1 2
. ..........
X
+ +
X
+
X
=
n


Chi-square is the sum of the squares of n independent standard normal variates,
following the
2
_ distribution with n degrees of freedom.


Page 5 of 22


6.3.1. Properties of
2
_ Distribution





















6.3.2. Characteristics of
2
_ Test



























1. Mean of
2
_ distribution = Degree of freedom = v
2. S.D. of
2
_ distribution = v 2
3. Median of
2
_ distribution divides the area of the curve into two equal parts, each part
being 0.5.
4. Mode of
2
_ distribution is equal to degrees of freedom less 2, that is, V-2.
5.
2
_ values are always positively skewed.
6.
2
_

values increases with the increase in the DF, there is a new
2
_ distribution with
every increase in the no. of degrees of freedom.
7. The lowest value of
2
_ is zero and the highest is infinity i,e. 0 <
2
_ < .
8. When two chi-squares 1
2
_ and 2
2
_ are independent following
2
_ distribution with n
1

and n
2
degrees of freedom, their sum 1
2
_ + 2
2
_ will follow
2
_ distribution with n
1
+ n
2

degrees of freedom.
9. When
2
_ >30, \2
2
_ (\2v -1) approximately follows the standard normal distribution.


-
2
_ test is based on frequencies and not on parameters.
- Its a non-parametric test where no parameters regarding the rigidity of population
parameters are required.
- Additive property is also found in
2
_ test.
-
2
_ test is useful to test the hypothesis about the independence of attributes.
- The
2
_ test can be use in complex contingency tables.
- The
2
_ test is very widely used for research purposes in behavioral and social sciences
including business research.
- It is defined as v = (0 E)
2
/ E.



Page 6 of 22


6.3.3. Degrees of Freedom












If a
2
_ is defined as the sum of the squares of n independent standardized normal variates and
the condition of the satisfaction of one linear relation is imposed upon them (such as the
estimation of some population parametric value etc.) then the effect of these n constraints would
be replaced by n k. If the sum of squares is taken about the sample mean instead of the
population mean when n is replaced by n-1 = v, since one linear constraint had been imposed.

6.3.4. Restrictions and Conditions in Applying
2
_ Test





















The number of degrees of freedom for n observations is n k and is usually
denoted by v where k is the number of independent linear constraints imposed
upon them. Suppose we are asked to write any four numbers then we will have all
the numbers of our choice. If a restriction is applied or imposed to the choice that
the sum of these numbers should be 50; then the freedom of choice would be
reduced to three only and so the degrees of freedom would now be 3.

(Cont. from topic Introduction)

In Case Study the degrees of freedom is given by (3-1) (3-1) = 4. At 5% level of
significance the tabulated value is 9.488.

(Cont. in topic Tests for Independence of Attributes)

Restrictions

The sample observations should be independently and normally distributed. For this either the
parent population should be infinitely large (say, greater than 50) or sampling should be done
with replacement.

Constraints imposed upon the observations must be linear character. For example,


=
i i
E O

The
2
_ distribution is essentially a continuous distribution but its character of continuity is
maintained only when the individual frequencies of the Variate values remain greater than or
equal to 5. So in applying
2
_ test in the testing of the goodness of fit or in a contingency
table, the cell frequency should not be less than 5. In practical problems we can combine a
few values of small frequencies into one to get the pooled frequency greater than 5.



Page 7 of 22














6.3.5. Levels of Significance


























Conditions:

1) The frequencies used in chi-square test must be absolute and not in relative terms.
2) The total no. of observations collected for this test must be large.
3) Each of the observations which make up the sample of this test must be independent of
each other.
4) As
2
_ test is based wholly on sample data, no assumption is made concerning the
population distribution. In other words it is a non parametric-test.
5)
2
_ test is wholly dependent on degrees of freedom.
6) The expected frequency of any item or cell must not be less than 5, the frequencies of
adjacent items or cells should be polled together in order to make it more than 5.
7) The data should be expressed in original units for convenience of comparison and the
given distribution should not be replaced by relative frequencies or proportions.

This test is used only for drawing inferences through test of the hypothesis, so it cannot be
used for estimation of parameter value.

Tables have been prepared for the values of P, the probability of getting a value of
2
_ greater than or equal to 0
2
_ where 0
2
_ be an observed value. From these
tables, we can find the value of P corresponding to an observed value if
2
_ and
then proceed to test whether the difference between observed and theoretical
frequencies is significant or not. Smaller the values of P, greater the divergence
between fact and theory so that small values lead us to suspect the hypothesis. Not
only small values of P lead us to suspect the hypothesis but a value of P very near
to unity may also lead to a similar result. Thus if P = 1,
2
_ = 0, showing that there
is perfect agreement between fact and theory which is a very improbable event.
The two conventional levels of significance are:

If P is less than 0.05, we say that the observed value of
2
_ is significant at 5 percent level
of significance. Similar if P less than 0.01, the value is significant at 1 % level.

The formula for calculating
2
_ is given by:
( )


=
e
e o
f
f f
2
2
_


Where, f
0
is observed frequency, f
e
is expected frequency.


Page 8 of 22


6.3.6. Steps in Solving
2
_ Problems













6.3.7. Interpretation









Figure 6.1





1) Calculate the expected frequencies. In general the expected frequency for any cell can
be calculated from the following expression:

2) Take the difference between observed and expected frequencies and obtain the squares
of these differences (O E)
2
.

3) Divide the values obtained in step 2 by the respective expected frequency and add all
the values to get the value according to the formula:
( )


=
e
e o
f
f f
2
2
_


After ascertaining the
2
_ value, the
2
_ table comprises of columns headed with symbols
2
_
0.05
for 5% level of significance,
2
_
0.01
for 1% level of significance and so on. The left
hand side indicates the degrees of freedom. If the calculated value of
2
_ falls in the
acceptance region, the null hypothesis H
O
is accepted and vice-versa.

Page 9 of 22


6.4. Uses of
2
_
Test










6.5. Application of
2
_ Test

6.5.1. Tests for Independence of Attributes

























The
2
_ test is used broadly to:

- Test goodness of fit for one way classification or for one variable only
- Test of independence or interaction for more than one row or column in the form of a
contingency table concerning several attributes
- Test of population Variance o
2
through confidence intervals suggested by
2
_ test

The number of degrees of freedom is given by:

DOF ( ) ( ) 1 1 - = columns of Number rows of Number

The expected value is given by:
GrandTotal
l ColumnTota RowTotal
= E


Page 10 of 22
































(Cont. from topic Degrees of Freedom)

Consider the case on hand:

The method of solution is as follows:
Table 6.3
Age (yrs)

Product
20-30 30-40 40 & above Row Total
S
1
70 120 70 260
S
2
130 200 130 460
S
3
120 190 20 330
Column Total 320 510 220


Table 6.4
Observed Value (O) Expected Value (E) (O-E)
2
/E
70 320 x 260/1050 = 79.24 1.077
130 320 x 460/1050 = 140.19 0.741
120 320 x 330/1050 = 100.57 3.754
120 510 x 260/1050 = 126.29 0.313
200 510 x 460/1050 = 223.43 2.457
190 510 x 330/1050 = 160.29 5.507
70 220 x 260/1050 = 54.48 4.421
130 220 x 460/1050 = 96.38 11.728
20 220 x 330/1050 = 69.14 34.925

_
2
Calculated 64.923


1. Null hypothesis H
o
: Variety is independent of age
Alternate hypothesis H
A
: Variety is dependent on age
2. Level of Significance 5% and D.O.F (3 1) (3 1) = 4
2
_
tab
= 9.488
3. Test Statistics
( )
E
E
=
2
2
0
_
4. Test
2
_
cal
= 64.923
5. Conclusion: Since
2
_
cal
(64.923) <
2
_
tab
(9.488) H
o
is rejected.
The Variety is independent of age. It is observed that 30 to 40 yrs of
group has more liking for the varieties.

(Cont. in topic Test of Goodness of Fit)

Page 11 of 22






























Example 6.1:

The following table gives the sales of a product by 3 salesman and 3
territories. Test at 5% level of significance whether salesman and territories are
independent.

Table 6.5
Salesman

Territories
1 2 3 Total
I 5 15 20 40
II 10 20 20 50
III 15 25 20 60
Total 30 60 60 150

Solution:
Table 6.6
Observed Value (O) Expected Value (E) (O E)
2
(O E)
2/
E
5 40 x 30/150 = 8 9 1.1250
10 50 x 30/150 = 10 0 0.0000
15 60 x 30/150 = 12 9 0.7500
15 40 x 60 /150 = 16 1 0.0625
20 50 x 60/150 = 20 0 0.0000
25 60 x 60/150 = 24 1 0.0417
20 40 x 60/150 = 16 16 1.0000
20 50 x 60 /150 = 20 0 0.0000
20 60 x 60/150 = 24 16 0.6667

2
_
3.6459

1. Null hypothesis H
o
: The salesman and territories are independent
Alternate hypothesis H
A
: They are dependent
2. Level of Significance 5% and D.O.F (3 1) (3 1) = 4
2
_
tab
= 9.49
3. Test Statistics
( )
E
E
=
2
2
0
_
4. Test
2
_
cal
= 3.6459
5. Conclusion: Since
2
_
cal
(3.6459) <
2
_
tab
(9.49) H
o
is accepted.
The attributes are independent.




Page 12 of 22






















Example
Five hundred students in a school were graded according to their intelligence and the economic
conditions of their homes. Examine whether there is any association between economic conditions
at home and intelligence at 5% level of significance.
Intelligence
Good Bad Total
Rich 85 75 160
Poor 165 175 340
Example 6.2:

Out of 1000 people surveyed 600 belonged to urban area and rest to rural area.
Among 500 who visited other states 400 belonged to urban area. Test at 5%
level of significance whether area and visiting other states are dependent.

Solution: The given information can be tabulated as follows:

Table 6.7
Other States Urban Rural Total
Visited 400 100 500
Not Visited 200 300 500
Total 600 400 1000

Table 6.8
Observed Value (O) Expected Value (E) (O E)
2
(O E)
2/
E
400 300 10000 33.33
200 300 10000 33.33
100 200 10000 50.00
300 200 10000 50.00

2
_
cal

1.66

1. Null hypothesis H
o
: Area and visit are independent
Alternate hypothesis H
A
: They are dependent
2. Level of Significance 5% and D.O.F (2 1) (2 1) = 1
2
_
tab
= 3.841
3. Test Statistics
( )
E
E
=
2
2
0
_
4. Test
2
_
cal
= 1.66.66
5. Conclusion: Since
2
_
cal
(1.66.66) >
2
_
tab
(3.845) H
o
is rejected.
They are dependent.


Page 13 of 22

Total 250 250 500

Solution:

Step 1: Set up the null and alternative hypothesis

Ho: There is no association between economic conditions at home and intelligence
v/s
H
1
: There is an association between economic conditions at home and intelligence.

Step 2: Level of significance = 0.05

Step 3: Test statistics


( )
) 1 ( ) 1 ( ~
2
2
2

c r
E
E O
_ _ degrees of freedom
Where,
O = observed values
E = Expected values =
GrandTotal
l ColumnTota RowTotal

Step 4: Calculation

O E
(O-
E)
(O-
E)
2
(O-E)
2
/
E
85 80 5 25 0.3125
75 80 -5 25 0.3125
165 170 -5 25 0.147059
175 170 5 25 0.147059

0.919118


Step 5: Inference

The tabulated _
2
(1 degree of freedom) at 5% level is 3.841, and calculated value is 0.919118.
Since the tabulated value is greater than calculated value, there is no sample evidence to reject
H
0
.
Therefore, there is no association between economic conditions at home and intelligence.

Page 14 of 22

6.5.2. Test of Goodness of Fit














































Degrees of freedom is n-1
Expected value = Average of the observed values.

(Cont. from topic Tests for Independence of Attributes)

From the nature of data the Statistician observes that it is more likely to be
closer to Poisson distribution. Therefore he fits a Poisson distribution to the
observed data.
Table 6.9
No. of complaints No. of times received
X f f x X
0 210 0
1 90 90
2 40 80
3 15 45
4 10 40
Total 365 255

7 . 0 6986 . 0
365
255
= = = = X m
( )
( )
49658 . 0
0
7 . 0
0
0
0 7 . 0 0
=
Z
=
Z
-
= = X P

e m e
m

( ) 3476 . 0
1
7 . 0
49658 . 0 1 = = = X P
( ) 1217 . 0
2
7 . 0
3476 . 0 2 = = = X P
( ) 0284 . 0
3
7 . 0
1217 . 0 3 = = = X P
( ) 0050 . 0
4
7 . 0
0284 . 0 4 = = = X P
Table 6.10
Observed Value
(O)
Expected Value
(E)
(O-E)
2
/E
210 0.49658 x 365 = 181.3 4.543
90 0.3476 x 365 =
126.9
10.72
40 0.127 x 365 = 44.5 0.44
)
`

10
15
25
0.0341 x 365 =
12.46
12.61
_
2
calculated 28.33
Note: Since expected frequency for last complaint is less than 5 it is combined
with the previous clause namely 3 complaints, as per one of the conditions
for applying _
2
test.
(Cont. on next page)

Page 15 of 22







































(Cont. from previous page)

1. Null hypothesis H
o
: It is a good fit
Alternate hypothesis H
A
: It is not a good fit
2. Level of Significance 5% and D.O.F (4 -1-1) = 2
2
_
tab
= 5.99
3. Test Statistics
( )
E
E
=
2
2
0
_
4. Test
2
_
cal
= 28.33
5. Conclusion: Since
2
_
cal
(28.33) <
2
_
tab
(5.99) H
o
is rejected.
In other words the data does not represent a Poisson Distribution.

Example 6.3:

A personal Manager is interested in trying to determine whether absenteeism is
greater on one day of the week than on another day of the week. He has the
following record for the past years.

Table 6.11
Days of Week: Mon Tue Wed Thur Fri
No. of Absentees 66 57 54 48 75

Test whether absenteeism is uniformly distributed over the week.

Solution:

If the absenteeism is uniformly distributed over the week, then expected number
of absenteeism per day should be:

E = 66 + 57 + 54 + 48 + 75 /5 = 60

Table 6.12
Observed Value (O) Expected Value (E) (O E)
2
(O E)
2/
5
66 60 36 0.6000
57 60 9 0.1500
54 60 36 0.6000
48 60 144 2.4000
75 60 225 3.7500

2
_
cal

7.5000

(Cont. on next page)

Page 16 of 22














Example

300 digits were chosen at random and found to give the following distribution:
Digits 0 1 2 3 4 5 6 7 8 9
Frequency 18 32 28 34 42 50 17 23 27 29

Test the hypothesis that the digits were distributed in equal numbers in the table from which the data
were collected. Test at 1% level of significance.

Solution:

Step 1: Set up the null and alternative hypothesis

H
0
: The digits were distributed in equal numbers in the table.
v/s
H
1
: The digits were not distributed in equal numbers in the table.

Step 2: Level of significance = 0.01

Step 3: Test statistics

( )
) 1 ( ~
2
2
2

n
E
E O
_ _ degrees of freedom
Where,
(Cont. from previous page)

1. Null hypothesis H
o
: The attributes are independent
Alternate hypothesis H
A
: They are dependent

2. Level of Significance 5% and D.O.F (5 1) = 4
2
_
tab
= 9.49

3. Test Statistics
( )
E
E
=
2
2
0
_

4. Test
2
_
cal
= 7.50

5. Conclusion:
Since
2
_
cal
(7.5) <
2
_
tab
(9.49); H
o
is rejected.
Absenteeism and days of week are independent.



Page 17 of 22

O = observed values
E = Expected values
Step 4: Calculation

O E O E (O E)
2

E
E O
2
) (

18 20 -2 4 0.2
32 20 12 144 7.2
28 20 8 64 3.2
34 20 14 196 9.8
42 20 22 484 24.2
50 20 30 900 45
17 20 -3 9 0.45
23 20 3 9 0.45
27 20 7 49 2.45
29 20 9 81 4.05

Where,
20
10
29 27 23 17 50 42 34 28 32 18
=
+ + + + + + + + +
= E
Step 5: Inference

_
2
cal = 97 and the tabulated value is 16.9190.
Since the tabulated value is lesser than the calculated value, there is no sample evidence to
accept H
0
.
Therefore, the digits were not distributed in equal numbers in the table.












Page 18 of 22






















Example:
In a particular industry the undergraduates, graduates, and post graduates are in the ratio 5:3:2. A firm
belonging to the industry had 1050, 550 and 400 undergraduates, graduates and postgraduates on its
pay-roll. Does the firm follow earlier observation (ratio) about the industry? Test at 5% level of
significance.

Solution:

Step 1: Set up the null and alternative hypothesis

H
0
: The observations are in the ratio 5:3:2
v/s
H
1
: The observations are not in the ratio 5:3:2


Example 6.4:

According to theory in Genetics the proportion of beans of A, B C and D types
in a generation should be 9:3:3:1. In an experiment with 1600 beans the
frequency of bean of A, B, C and D type was observed to be 882, 313, 287 and
118 respectively. Does the result support the theory?

Solution:

1. Null hypothesis H
o
: The result supports theory
Alternate hypothesis H
A
: The result does not support theory
2. Level of Significance 5% and 2 D.O.F (4 1) = 3
2
tab
= 7.81
3. Test Statistics
( )
E
E
=
2
2
0
_
4. By Null hypothesis. E = Total No. x Corresponding ratio

Table 6.13
Observed Value (O) Expected Value (E) (O E)
2
(O E)
2/
E
882 1600 x 19 / 10 = 900 324 0.36
313 300 169 0.56
287 300 169 0.56
118 100 324 3.24

2
_
cal

4.72

5. Test
2
_
cal
= 4.72
6. Conclusion: Since
2
_
cal
(4.72) <
2
_
tab
(7.81) H
o
is rejected.
The result supports the theory.


Page 19 of 22


Step 2: Level of significance = 0.05


Step 3: Test statistics

( )
) 1 ( ~
2
2
2

n
E
E O
_ _ degrees of freedom
Where,
O = observed values
E = Expected values
Step 4: Calculation

O E
(O-
E)
(O-
E)
2
E
E O
2
) (

400 400 0 0 0
550 600 -50 2500 4.166667
1050 1000 50 2500 2.5
Total 6.666667


Step 5: Inference

The tabulated _
2
(2 degree of freedom) at 5% level is 5.99 and calculated value is 6.66.
Since the tabulated value is lesser than calculated value, there is no sample evidence to accept
H
0
.
Therefore, the observations are not in the ratio 5:3:2.


Page 20 of 22

6.5.3. Test for Specified Variance














































Suppose we want to test whether the population has a given variance o
0
2
, then:

H
o
: o
2
= o
0
2

H
A
: o
2
= o
0
2

and
( )
2
2
2
2
2
2
o o o
nS
o o o
_ =
X X
=
|
|
.
|

\
|
X X
=

If the calculated value lie between K
1
and K
2
then H
0
is accepted K
1
and K
2
values are read
from the table.



Page 21 of 22





































Example 6.5:

The standard deviation of heights of plants is known to be 2 cms. Eight
randomly selected plants have heights 172, 156, 154, 163, 170, 169, 170 and
164 cms. Test whether the sample standard deviation differs significantly?

Solution:
Table 6.14
X d = X - 160 d
2

172 12 144
156 - 4 16
154 - 6 36
163 3 9
170 10 100
169 9 81
170 10 100
164 4 16
38 502
2
2
2
(

=
n
d
n
d
S
2
8
38
8
502
(

=
= 40.1875 nS
2
= 321.5
1. Null hypothesis H
o
: o
0
2
= o
2

Alternate hypothesis H
A
: o
0
2
= o
2

2. Level of Significance 5% and D.O.F (8 1) = 7 K
1
= 1.69 K
2
=
16.01
3. Test Statistics
2
2
2
o
nS
o
_ =
4. Test
2
_
cal
= 321.5 / 4 = 80.375
5. Conclusion: Since
2
_
cal
lies outside K
1
and K
2
H
o
is rejected.
Sample S.D differs significantly.


Page 22 of 22


6.6. Summary










6.7. Reference

6.7.1. Recommended Textbooks

















6.7.2. Web References







2
_ -Test is a non-parametric test. It is used to test the independence of attributes, goodness of
fit and specified variance. It assumes that samples are drawn at random and external forces,
if any, act on them in equal magnitude. The sample size should be very large. None of the
theoretical expected values calculated should be less than five.

Table 6.15
Sl.
No
Text Books/Reference Books Year Publisher Edition
1
Statistics for Management by Levin, Rubin
Chapter 11; pg. 567-584
2008 Prentice Hall 7th edition
2
Statistics for Managers Using Excel
Chapter 11 ; Page : 459-476

2005 Prentice Hall 4th edition
3
Complete Business Statistics by Amir D.
Aczel and Jayavel Sounderpandian

2006 Tata McGraw 6th edition
4 Basic Econometrics by Damodar Gujarati 2007
Tata McGraw
Hill
4th edition


1. www.statistics.com

2. http://onlinestatbook.com/

3. http://www.statsoft.com/textbook/stathome.html

4. http://home.ubalt.edu/ntsbarsh/Business-stat/

5. http://home.kku.ac.th/wichuda/Stat1/

You might also like