Chi Squared Tests

Mathematlcs Term
STPM Chapter
Chisquaredrests
6.1
The Chi-squored Distribution
Hypothesis test discussed in the last chapter each involves a null hypothesis stated in terms of a population parameter and a test statistic having a known probability distribution. They are called parametric tests. However, not all ideas can be stated in terms of population parameters. In this chapter, we shall discuss non-parametric test called chi-squared test which is performed using the chi-squared distribution. Let xt, x2, ...,
a
x,be
a random sample from a normal distribution with mean 1t andvariance
d.
Then the sampling distribution of the statistic
^,2
i=l
Le.-o)'
C
degrees
is called the chi-squared distribution with n
givenby
of freedom. The probability density function

2
is
f(X',) =
c(X',)'
_xi r , 'e
where c is a constant, Xl ls the chi-squared statistic with v degrees offreedom and e is the base ofthe natural logarithm. c is a normalised factor so that the area under the chi-squared curve is equal to one.
Examples of chi-squared distributions with various degrees of freedom are shown in the figure below. The curve for degrees of freedom, y = n - 1 = 3 - I = 2, represents the distribution of chi-square values computed
from all possible samples of size 3. Likewise, the curve for degrees of freedom equal to 10 corresponds to the distribution for samples of size 11.
il l295
l*Nl
*"ah"-"tics
Term
STPM chapter 6 chi-squared
Tests
The chi-squared distribution has the following properties:
. . . . . . . .
The values of X2 cannot be negative
The curve is not symmetric They are all positively skewed
As v gets larger, the degree of skewness decreases

The mean of the distribution is equal to the number of degrees of freedom: p = v.
The variance is equal to two times the number of degrees of freedom: 02 = 2
When the degrees of freedom are greater than or equal to 2, the maximum value occurs when
xl,=, -
As the degrees of freedom increase, the chi-squared curve approaches a normal distribution.
The area under the curve between 0 and a particular chi-squared value is a cumulative probability associated
with that chi-squared value. For example, the figure below is a graph of the chi-squared distribution with 6 degrees of freedom, the shaded area represents a cumulative probability associated with a chi-squared statistic equal to x; that is, it is the probability that the value of a chi-squared statistic will fall between
0 and x.
B J
I
The X2-distribution table gives values of X' for various values of a and v, where a and v represent significance level and degrees of freedom respectively. The areas, c, are the column headings; the degrees of freedom, v, are given in the left column, and the table entries are the X2 values. Hence the X2 value with 6 degrees of freedom, leaving an area of 0.05 to the left, is Xi = 1.635. Owing to lack of symmetry, we must also use the table to find X'u = 12.592 for q, = 0.95.
296
Mathematics Term 3
Critical values for the X2-distribution

If X has a X2-distribution with u degrees of freedom, then for each pair of values of p and v, the tabulated value of x is such that
P(X< x)=P.
N
0.99 0.995
STPM Chapter 6 Chisquared fests
P
v
0.01 0.031571
0.025
0.039821
0.0s
0.9 2.706
0.95 3.841
0.975 5.024
0.999
10.83
=l
2 3
o.0\932
0.t026
0.3518
6.635
7.879
10.60 12.84 14.86 16.75 18.55 20.28
0.02010
0.1 148
0.05064
4.60s
6.251
7.779 9.236
5.991 7.815
9.488
7.378 9.348
I 1.14
9.2t0
r1.34
t3.82 r6.27
18.47
20.51
0.21s8 0.4844
0.8312 1.237
1.690
2.1 80
4
5
0.2971
0.7t07
1.145
t3.28
15.09
16.81
0.5543
tl.07
t2.59
14.07
15.51 16.92 18.31 19.68
t2.83
14.45
16.01 17.53 19.02
6
7 8
0.872r
1.239
1.63s
2.167 2.733 3.32s
t0.64 t2.02 t3.36

14.68
22.46 24.32
18.48 20.09 21.67
t.647
2.088 2.558 3.053
2t.95
23.s9
26.r2
27.88 29.59 31.26
32.91
9 10
11
2.700
3.247 3.816
3.940
4.575 5.226
t5.99
17.28 18.55 19.81
20.48 21.92 23.34 24.74
23.2r
24.73 26.22 27.69 29.14 30.58 32.00
25.t9
26.76 28.30 29.82
t2
3.571 4.107 4.660

5.229
4.404
5.009 s.629 6.262 6.908
21.03 22.36 23.68 25.00 26.30 27.59 28.87 30.14
l3 t4
15
5.892
34.53
6.57t
7.26r
7.962 8.672
2t.06
22.3r 23.s4
24.77 25.99 27.20 28.41 29.62 30.81
26.r2
27.49 28.85
30.1 9
3t.32
32.80 34.27 35.72
36.t2
37.70 39.25
40.79
l6 t7
18
s.8t2
6.408
7.564 8.231
8.907
33.4r
34.81
7.0rs
7.633
9.390
10.12 10.85 I 1.59 12.34
13.09
31.53 32.85
37.t6
38.58 40.00 41.40 42.80
42.31
43.82 45.31
t9
20
36.t9
37.57 38.93 40.29
8.260
8.897 9.542 10.20 10.86
9.59r
10.28 10.98
3t.41
32.67 33.92
34.t7
35.48 36.78 38.08 39.36 40.6s 41.92 43.19 44.46 45.72 46.98
2t
22
23
46.80 48.27
49.73
5
tL.69 t2.40
13.r2
13.84 14.57
15.31
32.0t
33.20 34.38 35.56 36.74 37.92 39.09 40.26
35.t7
36.42 37.65 38.89
40.1
4t.64
42.98
44.r8
45.56
46.93 48.29
24 25 26 27 28 29 30
13.8s
14.61
1.18
lt.52
12.20 12.88 13.56
44.3r
45.64 46.96 48.28 49.s9 50.89
52.62 54.05 s5.48
1s.38
16.15 16.93 17.71
49.65 50.99 52.34 53.67
41.34 42.56 43.77
s6.89
58.30 59.70
t4.26
14.9s
16.0s
il l-
t6.79
t8.49
297
lNl t"ah.*"tics Term 3
STPM chapter 6 chi-squared Tests
Example
'1
The curve of the chi-squared distribution with v = 3 degrees of freedom is shown below. Find the critical value of X2 such that the area in the shaded region is
0.025.
Solution
Look it up in the table by proceeding down the left column entitled v, degrees of freedom, to v = 3. Then move to the right till the column labelled 0.975 is
found. The result is 9.348. Thus we have P(x'
>
9.348) = 9.925.
Example 2
A factory has produced a particular type ofdrill. On average, the useful operating live is 5.5 hours. The standard deviation is 0.47 hour. The quality control department runs a test by randomly selecting six drills. The standard deviation
of the selected drill is 0.61 hour. Determine the chi-squared statistic represented
by this test.
$olation
Given o = 0.47 hour, s = 0.61 hour, and the number of sample observations
n = 6. the chi-squared statistic

n,z
nS2
is
x=- d _ 6(0.61'?)
0.472
10.107
E;ge1,eiSe_-Cl,_=
G J
l.
2.
Find the
95th
percentile of the chi-squared distribution with 9 degrees of freedom.
Using the table of chi-squared distribution table, find
(a) (b) (c)
< 18.4s), P(X1, > 1e.81), P(X'r, ) 32.67).

P(x:,
298
Mathematics Term
3.
STPM Chapter 6 Chi-squared r"sts
Giving v and q, find the critical value(s) for each

(a)
a--
case
(b)
(c)
4.
Using the chi-squared distribution table, find the value of k such that
(a) (b) (c)

5.
< k) = 0.0t P(x1, > k) = o.es P(k < x2s < 9.39) = o.o4
P(X1,
(a) (b)
Find the mean and the standard deviation of a chi-squared distribution with 8 degrees of freedom. Which one of the following chi-squared distributions looks the most like a normal distribution? (i) A chi-squared distribution with I degree of freedom (ii) A chi-squared distribution with 2 degrees of freedom (iii) A chi-squared distribution with 5 degrees of freedom (iv) A chi-squared distribution with 10 degrees of freedom
6.
A random sample of 30 observations from a normal population with variance d = 8.3, is found to have a sample variance s2 = LL.72. Determine the chi-squared statistic from this experiment,
The chi-squared test can be used to test how good a fit between observed frequencies and expected frequencies.
Observed frequencies are the actual frequencies observed from a random sample. Expected frequencies are theoretical frequencies based on a distribution under the null hyprothesis which is presumed to be true until statistical evidence indicates otherwise. As an example: what would we expect by flipping a coin 12 times? By chance, we observe six heads and six tails. If we observe one head and eleven tails in this experiment, would this outcome be attributable merely to chance or be it due to the coin being biased? The chi-squared test can help providing an answer. Before discussing the chi-squared test, we have several assumptions to make. First, frequency data is used
il g-
to represent the actual number of elements in each category. Second, categories are mutually exclusive, that
299
iil*rNl
u.th.-"tics
Term
STPM chapter 6 chisquared Tests
is, whatever is being tallied can only be in one cell and cannot overlap. Third, categorical data is a grouping
of data according to similar characteristics in a way to show the frequencies of each category. Let us look at an example to see how we use the chi-squared test to determine whether the frequencies
observed across the categories differ significantly from what are expected theoretically. Consider the tossing
of a six-sided dice. We have the null hlpothesis that the dice is fair, which is equivalent to the hlpothesis
that the distribution of outcomes is uniform. Suppose that the dice is thrown 60 times and each outcome is recorded. The observed frequency o for each face of the dice is shown in the table below:
Faces
1
2
12
3
o_,
5
7
ot =
The chi-squared test
o,=8
= I'l
or=
o-=9
oa=10
will
e-. The table above lists the observed frequencies, and the expected frequencies need
compare the observed frequencies o. with the corresponding expected frequencies to be determined.
To calculate the expected frequency for each outcome, we make use of the hypothesis that the outcome of a fair dice is uniformly distributed. Since the probability of each outcome is one-sixth and there are a total of 60 rolls of the dice, we have
Expected frequency
e
_1 x60=10
6
Note that the expected frequencies are anticipated only in theoretical sense. It is not practical to expect the observed frequencies perfectly match the expected frequencies. The table below lists the observed and
expected frequencies for each category:
Faces
I
or
2
12
4
14
o,=8
e:=10
ot=
e.,
oi7
er= l0
o---9 e-=10
oe=10 ee= l0
er=10
= l0
Now, we need to decide whether the observed frequencies are reasonably close to the expected frequencies or really different from them. The hypothesis to be tested is how good the observed frequencies fit a given pattern or a theoretical distribution. The test is called a goodness-of-fit test.
A useful measure for the oerall

squared test statistic
discrepancy between the observed and expected frequencies is the chi-
v2
-1i=l
5br -,t' I'
where X2 is a value of a random variable X2 whose sampling distribution is approximately very closely described by the chi-squared distribution with k - 1 degrees of freedom and k is the number of categories. The symbols o. and e. represent the observed and expected frequencies respectively for the lth category.
For the chi-squared goodness-of-fit test, the number ofdegrees offreedom shows the number ofindependent free choices which can be made in allocating values to the expected frequencies. In this example of tossing
300
Mathematics Term
STPM chapter 6 Chi-squaredf""ts
a dice, there are six expected frequencies (one for each face, that is, I to 6) and only five of the expected frequencies can vary independently and the sixth one must take whatever value is required to fulfil that constraint oftotal frequency. Thus, the degrees offreedom v = number ofcategories - number ofconstraints. Here there are six categories and one constraint, so v = 6 - I = 5.
To calculate the chi-squared test statistic, we first subtract the expected frequency e. from the observed
frequency o-. Then we square the difference and subsequently divide the squared difference by the expected frequency e., before finally adding the quotients. This is done in the table below:
(o,
e,)2
Faces
o. I
e. I 10
(o,
(o. "r)
2
1
4 4
e.
I
e,)2
t2
8
0.4 0.4
1.6
l0 l0
t4
7 9
4 _J
t6
9
4
5
l0
l0
0.9
0.1 0 X2
-1
0
I
0
l0
l0
3.4
This means the value of X2 with 5 degrees of freedom is 3.4. In the goodness-of-fit test, if the observed frequencies are the same as the expected frequencies, then X2 = 0. Thus, if X2 value is small, there will be high degree of compatibility between expected and observed frequencies, indicating a good fit. lf X2 value is large, there is a low degree of matching between the two frequencies and the fit is poor. This also implies that the critical region falls in the right tail of the chisquared distribution. At the l0% significance level, we flnd X'z, = 9.236 using X2 table. The calculated value of X2 = 3.4 is less than 9.236, it would support the hypothesis that the outcomes of the dice is uniformly distributed. In other words, the dice is fair.
9.236
il
,g30r
Note: To perform a chi-squared test, the expected frequency for each category is at least equal to 5. This restriction may require combining adjacent categories, resulting in a reduction of the number of degrees of
freedom.
lSl *.ah"-.tlcs Term 3
STPM Chapter 6 Chi-squared Tests
EXample 3
A quality supervisor at a glass manufacturing factory inspects a random sample of 60 sheets of glass to check for any minor defects. The number of flaws in a
glass sheet are recorded. The results are as follows:
Numberofflaws 0 Observed frequency 32

distribution.
1 15
2 9
Use a 5% significance level to test the hypothesis that these data follows a Poisson
A test procedure is as follows.
i:*":#illI:i#liHr"'#ilLi',',',::'r',T.0,,,.,
Step
@: Specify the significance level
Here a = 0.05
Step @: Select the appropriate test statistic and calculate its value Use the chi-squared goodness-of-fit test to determine whether observed sample
frequencies differ significantly from expected frequencies specified

hypothesis.
in the null
The mean of the presumed Poisson distribution is unknown so must be estimated from the data by the sample mean,
Lox
^-
L,
- 3z)o+rc*t+9*z+q*3 32+15+9+4
=45 60
=
0.75
Hencewithtr=0.75,
p(X = x) i' '
e-o'5.0.'75*'
x.!
xi= o, 1,2,3
which gives the following probability associated with each class and thus the corresponding expected frequency is obtained by multiplying the appropriate Poisson probability by the sample size n = 60.
B
6
x, 0 t 2 3 or more
If
P(X=x,) 0.472 0.354 0.133 0.041
e,
28.32
2t.24
7.98 2.46
an expected frequency is less than 5, two or more classes can be combined. In the above situation the expected frequency in the last class is less than 3, so we should combine the last two classes to get,
302
Mathematlcs Term
STPM Chapter 6 Chi-squared f"rrc
Number
of flaws 0 1 2 or more
0bserved frequency 32 15 13
Expected
frequency
28.32 21.24 10.44
The chi-squared value can now be calculated:
w2-s @-e)' l\ -L
e
(32
- 28sD'z (ls - 2t.2q'z 28.32 2t.24
(13
rl.4q'z
10.44
= 2.94
Step @: Determine the critical region Since both the total frequency and the mean of the Poisson distribution of the observed data are required in estimation, the number of degrees of freedom is k - 2.Here, we have 3 classes, thus the chi-squared statistic has 3 - 2 = | degree of freedom. Using a significance level of 0.05, from chi-squared distribution table, the critical value of X'?o.r, with 1 degree of freedom is 3.841.
Step @: Make a decision As X2 = 2S4 < 3.841, we conclude that there is no real evidence to suggest the data does not follow a Poisson distribution.
Exampre
fr"i11*"3:'rJi"Ji #u::;r,#1T""'Hl'i-'1fi3;:"Jl",H5il;
deviation s = 6.4 minutes. Determine wether there is significant evidence at the 5o/o significance level, to reject the null hypothesis that the call length has a normal distribution.
Call length (in
minutes)
Frequency
4
9 16
13
5
0-s
5-10 10-15 15-20 20-25
2s-30
We proceed with the steps of a test procedure as follows:

Step @: State the hypotheses Ho: The telephone call lengths follow a normal distribution H,: The telephone call lengths do not follow a normal distribution
il l303
N U"th.-"tlcs
Term
STPM Step
Chapter 6 Chi-squared Tests
@: Specify the significance level
Here a = 0.05
Step @: Select the appropriate test statistic and calculate its value Use the chi-squared goodness-of-fit test to determine whether observed sample
frequencies differ significantly from expected frequencies specified

hypothesis.
in the null
The distribution of call lengths may be approximated by the normal distribution.
The sample mean and sample standard deviation
will be used for p and o in
calculating z values corresponding to the class boundaries. The expected frequency for each class (category), listed in the given table can be obtained from a normal curve. The z values corresponding to the boundaries of the second class are
_ 5-t4 = -t.406 r 6.4 to-t+ ,-= =_0.625 , 6.4

From the normal table, the area between zt
P(-1.406<Z<-0.62s)
= P(Z < -0.62s) - P(Z = 0.266 - 0.08 = 0.186
-1.406 and z, = -0.625 is
<
-1.406)
Thus, the expected frequency for the second class is e,
:0.186 x 50:9.3.
The expected frequency for the first class interval is obtained by using the total area under the normal curve to the left of the boundary 5. For the last class interval, we can use the total area to the right of the boundary 25. All other expected frequencies could be found by the similar method described above for the second class. The complete set of calculation needed to find the expected frequency in each class is summarised in the table below. Note that we have combined adjacent classes in the table, where the expected frequencies are less than 5. As a result, the total number of classes is reduced from 6 to 4. Class
l0-ls
'ri
i:i, i rs-20
i
boundaries
o,
;),,
16
1l
e;i
14.8
t3
-'rZ
,,
t3.2
;)t
ilj,"
+L
0.0068
The following table shows the detailed calculations for the chi-squared value. Class
re J
304
boundaries Below 10 10-15 15-20 Above 20
oi
13
,
r3.3
14.8 13.2 8.8
(o,- e,) (o,- e,)2
-0.3
1.2 -0.2 -0.8
0.09
t.44
0.04
0.64
X2
16
13
0.0973
0.0030 0.0727
0.180
Mathematlcs Term
STPM Cf,apter 6 Chi-squared
f""t"
Step @: Determine the critical region Altogether three constraints: total frequency, sample mean and standard deviation, have been estimated from the sample data, the number of degrees of freedom is therefore equal to k - 3 = 4 - 3 = l. Using a significance level of 0.05, the critical value of chi-squared with I degree of freedom is 3.841. Step
As X2 = 0.180 < 3.841, we have no reason to reject the null hypothesis and conclude that the normal distribution offers a good frt for the distribution of
telephone call lengths.
@: Make a decision
Exereise
l.
6.'
Assume that a chi-squared goodness-of-fit test is conducted. Determine the critical value of the chisquared test statistic for each of the following cases. (a) Number of categories = 7, ot = 0.01
(b)
Number of categories = 10, a = 0.10

as follows:
A random sample of 500 observations is obtained and distributed into 4 categories
CategoryL234 49 xi
Use a = 0.05 to test the null hypothesis Ho:
263
146
42
p, = 0.10, pz = 0.50, p, =
0.30, p4
0.10.
Three coins are tossed 150 times, and the observed frequencies of 0, l, 2 and 3 heads per toss are 14, 43, 67 and 26 times respectively. Use a 570 significance level to test whether the three coins are
balanced.
An experiment is to draw a card from a regular deck of 52 cards that has been thoroughly shuffled and it is recorded whether it is a spade, heart, diamond, or club. This process is repeated 40 times, each time replacing the card just drawn. If after 40 trials, 9 spades, 13 hearts, ll diamonds and 7 clubs are obtained. Test the hypothesis that the deck is honest at the 10% significence level.
Each package of beans sold in the supermarket is supposed to mix red beans, mung beans, black beans and black-eyed beans in the ratio of 5:3:l:1. A random sample selected from these packages contains 400 of mixed beans is found to have 210 red beans, 124 mung beans, 30 black beans and 36 blackeyed beans. Test the hlpothesis that the package contains the mixed beans in the ratio 5:3:1:l at the 0.05 significance level.
6.
jelly beans. This bag has 5 different colours of jelly beans in it. Assume all five colours are equally likely to be put in the bag. The boy is curious about the colour distribution
A boy buys a bag of
100
and opens the bag. He finds out that he has 17 brown, 24 yellow, l0 red, 31 green, and l8 white. Test the hlpothesis that the colours of the jelly beans occur with equal frequency at a significance level of
5o/o.
7.
The number of road accidents per week at a junction is monitored by the public traffic department. The table below shows the frequency of accidents per week in 60 weeks.
il
6
Number of
Observed
accidents frequency
28
123 15
12
(a) Determine the mean number of accidents per week. (b) Test the hypothesis that the data follows Poisson distribution
at the 5% significence level.

305
8.
The following frequency distribution table represents the number of days during a year that a total of 50 employees at a company are absent from work due to illness. It is thought that the data follows a
normal distribution with population mean Number of days
Lt
= 7 and, standard deviation o =

Number of employees
4
13
3.
absent
0-3 3-6 6-9 9-t2 t2-15
24
7 2
Test the goodness-of-fit between the observed class frequencies and the corresponding expected frequencies of a normal distribution at the 5% significence level.
9.
A paper shop has several retail stores in a city. The following table shows the number of boxes shipped per day for the last 100 days. Number of packages
shipped
Number of days
5 13
0-5 5-10 10-15 t5-20 20-25 25-30 30-35 (a) (b)

10.
28 23
18
l0
3
Calculate the sample mean and sample standard deviation of the number of absent days per week. Use a 5% significance level to test the goodness of fit between the observed class frequencies snd the corresponding expected frequencies of a normal distribution.
The table below shows the number of rain days in fanuary for the years from 1953 to 2004.
Numberofraindays 0 9 Observed frequency (a) (b)

Find the mean rain
day.
I 7
2 14
3 15
4 6
I
10olo
Test the hypothesis that the recorded data may be fitted by the Poisson distribution at the significance level.
11. A recent study reports
the number of hours of personal computer usage per week for a sample of 60 persons. Excluding from the study are people who work in the office and use the computer as part of their work.
1.1 4.3 6.3 2.4 4.3 (a) (b) (c)

306
6.7 4.5 2.r 2.4 9.7
2.2 9.3 2.7 4.7 7.7
2.6 5.3 0.4 1.7 5.2
9.8 6.3 5.1 2.0 r.7
6.4 8.8 5.6 6.7 8.s
4.9 6.5 5.4 3.7 4.2
5.2 0.6 4.8 3.3 5.5
4.5 5.2 2.1 1.1 9.2
9.3 6.6 10.1 2.7 8.s
7.9 9.3 1.3 6.7 6.0
4.6 4.3 5.6 6.s

8.1
Organise the data into a frequency distribution. Compute the sample mean and sample standard deviation of number of hours computer usage per week. It is thought that the data follows a normal distribution. Test the hlpothesis at the 57o significance
Ievel.
Mathematics Term
STPM Chapfer 6 Chi-squared fe"ts
When two attributes (variables) are observed for each element of a random sample, the data can
be
simultaneously classified with respect to these attributes in a two-way classification table called a contingency table. We can then determine whether there is a significant association between the two attributes.
Suppose we take a random sample of 200 persons and classify them based on gender as well as whether these persons own handphones. The observed frequencies are presented in the following 2 x 2 contingency table.
Own handphone
(ves)
Own
handphone
(no)
Total
130
Male
Female
70 30
100
Total
60 40 100
70
200
A contingency table can be of any

as an
size. In general, a contingency table with r rows and c columns is denoted r x c table. The row and column totals in the above table are called marginal frequencies. It is common practice to refer to each possible outcome of an experiment as a cell. Hence in our example we have four cells.
Let us test the hlpothesis of independence between a person's gender and a person's possession of a handphone. To perform this test, we first calculate the expected frequencies for each of the four cells of the above 2 x 2 contingency table under the assumption that the hypothesis is true.
M represent the event that an individual selected from the sample is male. Let Y represent the event that an individual selected owns a handphone.
Let
Since
M and Y are independent
P()'')
200
loo . Thu.. we have
events, P(M
e
n D = P(M)P(I).
,.,
But P(M
n n =#,P(M)
=ffi,
a.,d
2oo
no\/ roo \ -I 2oo \ /\ 2oo /

total)
Which we can rearrange
as
, ',,-
130
2oo--@"
x 100 _ (First row total)(First column
Where e,, is the expected frequency for the cell in row
and column
l.
The general formula for obtaining the expected frequency of any cell is given by
Expected frequency
(Row-total)(Colpmn total)
Total sample size
The expected frequency for each cell is recorded in parentheses beside the actual observed value in the table shown below.
Own handphone
(yes)
Own handhpone
Male
Female
70 (6s) 30 (3s)
100
Total
i (no) 60 (6s) I 40 (3s) 100 I

I
,
Total
130
70
200
il
6,-307
Note that the expected frequencies in any row or column add up to the appropriate marginal total. We need to calculate only the one expected frequency in the top row of the table and then find the others by subtraction. The number ofdegrees offreedom associated with the chi-squared test used here is equal to the number of cell frequencies that may be filled in freely when we are given the marginal totals and the grand
o DrrM
onapteroL;nFsqu
total, and in this illustration that number is 1. A simple formula providing the correct number of degrees of freedom is
v=(r_l)(c_l).
Hence, for our example, v = (2
- l)(2 -
1)
=I
degree of freedom.
We want to measure how much the observed frequencies differ collectively, from their corresponding expected
frequencies. We do this with the chi-squared test statistic
-,n-,?,{
We have uz
(o -e,)2
where the summation extends over all the cells in the
r x c contingency
(40
table.
(70
- 65)'z 65
(60
- 65)r 65
(30
- 35): 35
3s
35),
= 2.1978
Using a chi-squared table, we can see that for y = 1, the critical value for 5% significance level is X] = 3.3a1. Since the calculated value for X2 of 2.1978 does not fall within the critical region, we do not ieject the hypothesis that there is no relationship between a person's gender and the person's possession ofa handphone.
EXample 5
The following data show the attitude of housewives in various parts of the country
to a certain brand of detergent.
Attitude Like Indifferent Dislike
North 46 25 16
Central 21 58 37
South
3l
35
42
Test the hlpothesis that the attitude to new introduced detergent is independent of geographical area of residence at the l7o significance level
The given table is arranged to include the row and column totals.
Attitude Like Indifferent Dislike Total
North 46 25 16 87
Central 2t 58 37 116
South 31 35 42 108
Total
98 I
l8
95
311
G --g
Step @: State the hypotheses Ho: There is no association between attitude and location H,: Theere is association between attitude and location Step @: Specifr the significance level Given a = 0.01 Step @: Select the appropriate test statistic and calculate its value Use the chi-squared test for independence to determine whether there is any significant association between the two categorical variables.
Mathematlcs Term
STPM Chapter 6 Chi-squaredf"y"
As with goodness-of-fit test described earlier, the key idea of the chi-squared test for independence is a comparison of observed and expected frequencies. The expected frequency for each cell of the table can be generated using the following formula: (Row-total)(Colgmn total) Expected frequency - ---1---"-t
Total sample size
In fact, for a 3 x 3 contingency table, only four expected values in the top two
rows of the table are calculated and the remaining five expected values are found
by subtraction. For example, to calculate the expected frequency (for attitude

like and
JL north;29-I ' 311 = 27.41.In this way,
the table of both observed and
expected frequencies is as shown below.
Attitude
Indifferent Dislike
North
2s
Central
s8
South
Total
98
(33.01) 16 (26.s8)
(44.01) 3s (40.e8) 37 (3s.44) 42 (32.e8)
ll8
95
311
Total
87
116
= (r
108
The number of degrees of freedom v The chi-squared test statistic is

L-2
lXc
- l) = (3 -
1X3
- l)
4.
"
.( (o,-e,)'
i=l
Ei
A6
- 27.4i'), Ql - 36.55)2, (31 - 34.04)2, (25 - 33.01)'z, (58 - 44.01)2 44.01 33.01 34.04 36.55 27.41 .(35-40.98)2. (16-26.5$2 . G7-35.4q'z , e2-32.98)'? 32.98 26,58 35.44 40.98
=
Step
33.5057
@: Determine the critical region From chi-squared table, the critical value X2 for 4 degrees of freedom at is given by 13.28.
Step
17o
level
As the calculated value 33.51 is greater than the critical value 13.28, we can conclude there is evidence to reject Ho; that is attitude to new detergent and geographical area of residence are not independent.
@: Make a decision
E}(ereise&
1.
An experiment has 500 observations and the data are classified into 4 x 6 contingency table. Suppose we conduct a chi-squared test of independence at the l7o significance level. Assume the calculated value of the chi-squared test statistic is 39.2. (a) Determine the number of degrees of freedom. (b) Find the critical value for the chi-squared test of independence. (c) Determine whether the chi-squared test values falls into the critical region.
il
6
309
lSl *.ahu-.tlcs Term 3
STPM
Chapter 6Chr'-sguared rests
2,
The following3 x 2 contingency table contains observed values for a sample of size 250. Determine whether the row and column variables are independent using the chi-squared test with a = 0.025.
X
A
B 25
55
,Y
)/
32 38
63
3.
A research group performs a study on gender and handedness (right- or left-handed). 800 individuals are randomly chosen from a very large population. The following contingency table displays the
distribution of the two categories.
Right-handed 344 Male 352 Female 4.
Left-handed
72 32
Test the hypothesis that gender is independent of handedness at the 57o significance level.
Consider a sample of 200 customers. For each customer, we have information on gender and preference of food. A contingency table for these data is shown below.
Indian
Male
40
fapanese
20
Western
50 20
Female
gender and preference of food.
20
50
Carry out a test, at the 57o significance level, to determine whether there is any association between
5.
In an experiment to study the association between diabetes and smoking habits, the following data
are
No
Diabetes diabetes
Nonsmokers 25 40
Moderate smokers 30 2L
Heavy smokers
18 16
Using a l%o significance level, test the hypothesis that there is no association between cigarette smoking and the risk of diabetes.
6.
A camera manufacturer has four suppliers of lenses. The table below shows the numbers of defective lenses supplied by the suppliers.
B
-g
Supplier Supplier Supplier Supplier
I 2 3 4
Good 95 180 134 138
Defective
5
15
t6
7
Test, at the 57o significance level, whether the supplier is associated with the lens quality. What is your advice to the purchasing department based on the test result?
Mathematics Term
STPM Chapfer 6 Chisquaredf"src
7.
The table shows the result of a taste test in which a random sample of 500 people in two age groups is asked which of four formulations of a chocolate drink they prefer.
Age group 7 -25 26-50
Formulation A
30 28
Formulation B
69 36
Formulation C
116 70
Formulation D
78
73
Use a 0.01 significance level to test whether the preference for the different formulation change with
age.
8.
Fruit trees are subject to a bacteria-caused disease. Several different treatments for this disease are adopted. Treatment A: no action taken, treatment B: careful removal of clearly affected branches, and treatment C: frequent spraying of the leaves with an antibiotic in addition to careful removal of clearly a{fected branches. There are few different outcomes from the disease. Outcome 1: tree dies in the same year as the disease is noticed, outcome 2: tree dies 2-4 years after disease is noticed, outcome 3: tree survives beyond 4 years. A group of 200 trees are assorted into one of the treatments and over the next few years the outcome is recorded. The results are displayed in the following contingency table.
Treatment
Outcome
1
A
37
B
24 20
15
t7
32
2 J
l6
J
36
Determine whether there is any substantial evidence treatment. Use a 5% significance level for this test.
to conclude that outcome is independent of

of
9.
The table below shows the observed distribution of blood types: A, B, AB, and O in three samples Malays living in Kedah, Selangor and fohor.
Blood type
Kedah
Selangor
205
184
|ohor 4t
37
A
B
t4
16
3
AB
5l
232
1l
o
states.
t7
5l
Test, at the 5o/o significance level, whether the distribution of blood type is different across the three
10. A manufacturer
operates four assembly machines on three separate shifts daily. The table below gives the number of machine breakdowns recorded in the past year.
Machine First shift

Second shift
75
Machine 2
89
108
Machine 3
43 63
Machine 4
28
59
90
141
Third shift
175
t2t
2.5o/o
t4l
il l3ll
Determine whether these data provide sufficient evidence, at the machine breakdown is independent of shift.
significance level, to infer that
ummePg
l. The chi-squared distribution has one parameter, called the degree of freedom. 2. The chi-squared distribution curve lies to the right of the vertical axis and is skewed to the right. 3. In a goodness-of-fit test, we test the null hypothesis that the observed frequencies follow a certair
. :"":"::i::::.]:i":']:'::t:: ,hhrmnrhpcic
rh,r
rrrrn arrrihrrrpc
,rp inr,pnpnrpnr
5.
General test procedure ln a chl-squared test. . State the hypotheses . Specify the significance level
. . .
Calculate the value of the chi-squared test statistic f -e')' (Combine any adjacent classes , i= I where necessary) Determine the critical region based on the number of degrees of freedom and the significance level Make a decision
@'
REVI'ION EXERCI'E
l. (a)
Find P(0.83
< x1 <
12.8)
(b)
)
Determine the value of ft such that P(6.447
X'r,
<
k) =
O.Oag.
Three identical dice are thrown 150 times. The number of dice whose scores on the top faces at each throw are odd is recorded. The results are as follows:
Number of odd scores

Frequency
JJ
59
43
l5
Using a 570 significance level, test the hypothesis that all three dice are unbiased.
A departmental store sells men's shirts and stocks these shirts in five different sizes:
XXL. The number of the shirts sold each week is recorded.
Sizes
S
S,
M, L, XL, and
Number of shirts
2l
24 39
25
13
M
L
XL
xxL
4.
Test, at a l07o significance level, the hypothesis that number of shirts sold is uniformly distributed. Cars heading to a certain junction may go straight, turn left or turn right. A road transport department officer asserts that 60% of the cars will go straight at the intersection, and of the remaining 40%o, equal proportions will turn left and right. One hundred cars are randomly monitored and it is found that 51 cars go straight, 17 cars turn left are 32 cars turn right. Test, at the 5olo significance level, the hypothesis
that the proportions of cars going straight, turning left and turning right do not differ significantly from those asserted by the officer.
Mathematics Term
5.
STPM Chapter 6 Chi-squaredrr"ts
A pharmaceutical company conducts a trial on 200 patients to determine the effectiveness of a new cough remedy. Of these patients, 100 are randomly selected to be given the standard cough remedy
and the remaining 100 are assigned the new cough remedy. The result are recorded as shown. Standard cough remedy No relief
Some relief
53
New cough remedy

37
34
13
44
19
Full relief
Carry out a test, at a significance level of 57o, to investigate whether the two cough remedies are equally
effective.
6.
A football fan keeps the record of the goals

shown below. Goals obtained per match
scored per match by his favourite team. The results are
34 11 16 25
14
Number of matches
(a) (b)
Computed the mean number of goals scored per match. Using a 57o significance level, perform a test of the hlpothesis that the number of goals per match has a Poisson distribution.
7.
The following table gives the cumulative frequency distribution of the lives (in years) of 40 note-book
batteries tested by a battery manufacturer.
Battery life not greater than Cumulative frequency

Based on the previous experience,
1.5
2.0
2
2.5
J
3.0
3.5 22
4.0
32
4.5
5.0
3t
40
it
is believed that a normal distribution with mean 3.5 years and

5o/o
standard deviation 0.7 year provides a good approximation. Perform a chi-squared test, at the significance level, to determine whether the normal distribution gives a good fit for these data.
The table below shows the frequency distribution of marks for a paper obtained by 178 candidates.
Mark,.r
Number of candidates
5 19
50<x<60 40<x<50 30<x<40 20<x<30 10<x<20
34 63 47
0<x<10
l0
il g*
313
The population mean and standard deviation of the distribution of marks for the paper are 26.0 and 11.5 respectively. Test, at the 10% significance level, the hypothesis that the distribution of marks for
the paper is normal.
lNl U"th"-"tics Term 3 STPM chapter 6 Chi-squared
Tests
9.
in each of 80 pots. The number of seeds which germinate in each port recorded. The results of all the 80 ports are given in the following table.
A botanist sows three
seeds
is
Number of seeds germinate Number of pots
0 25
20
29
(a) (b)
Estimate the probability that an individual seed germinates. Using a 17o significance level, test the hlpothesis that the data may be fitted by the binomial
distribution.
10.
The distributions of marks for a paper marks in an examination has mean U and standard deviation o. Each candidate is assigned one of the five grades A, B, C, D, E as follows:
Mark,x
Grade
A
x 2 ,2 ui39 u+g< x < u+3! '22

u-g<xlui! '22
B
C
u-3L<x<rr-4 '22 x < u-3L '2
D E
The table below summarises the grades of a random sample of 198 candidates.
Grade Number of candidates
B
55
C
81
D
JJ
t7
t2
Determine, at the 1% significance level, the adequacy of a normal distribution as a model for these
data.
11.
The lengths
as follows:
(in millimetres) in a random sample of 50
leaves
of a certain plant are recorded

132
150 168 138 150
B J
314
145 155 138 163 156
133 136 177 r35 147
125 144 165 147 142
157 158 l l8 153 128
165 147 154 146 144
138 t43 151 148 t52 140 148 146 126 163 121 140 140 173 142 r35 145 l5l 135 161
Test the hypothesis that the leave length can be approximately modelled by a normal distribution.
Use a 0.05 significance level.
Mathematlcs Term
STPM Chapler 6 chi-squaredf""t"
12.
The table below shows the number of individuals exposed to a certain virus and the number of individuals who develop the disease. Development of disease
Exposure
virus
to
Yes No
Yes 44 19
No
116
128
Conduct a test of hypothesis at the l% significance level, to determine whether there is association
between the exposure to the virus and the development of the disease.
13.
The table below shows the number of males and females in each of three ernployment categories at a manufacturing company.
Managerial Support Male Female

categories.
10
Worker
285 624
39
52
Using a 17o significance level, test whether there is any association between gender and employment
14. A researcher in a study of heart
disease
in males links subjects to socioeconomic status and smoking
habits. The results are summarised in the contingency table below Socioeconomic status
High
Middle
29 27
Low
55 36 30
a
)T9Ktng hablts
significance level
2.5o/o.
Current Former Never
66 19 gg
lz
Perform a chi-squared test on association between smoking habits and socioeconomic status. Use
15. A hlpermarket wants to study the relationship between the method of payment by customers of different age groups. A random sample of 250 customers is taken and the results are summarised
in the table below.
Age group
L8-25 Payment method

Carry out a test at the
26-35 36 27
36-45 25 33
Over46
30
67
Card Cash
570 significance level
l8 t4
il
6
315
to find out whether the method of payment is independent
of
age group.
N *"an"rr,.tics
16.
Term
STPM Chapter 6 Chi-squared Tests
The school of Biological Sciences of a university records the level of exposure to a certain pollutant and the number of brain abnormality for laboratory mice. The data are summarised in the table below.
Number of brain abnormalitiy
0-2
Level of
exposure to
3-4
18
7 8
5-6
39
13 8
High Medium
Iow
t2
8
pollutant
Test, at the 570 significance level, r.thether there is association between the level
of exposure to the
pollutant and the number of brain abnormality lbund in the laboratory mice.
17,
The table below summarises the number of hours of sleep at nights for a random sample of adults of different age groups.
Number of hours of sleep

Less than
6 to
More than 8
70 62 43
Age group
25-44 45-54
>_ 55
41 34 76
85 77 69
Carry out a test, at the 1% significance level, to determine whether the number of hours of sleep is independent of the age of an adult.
18.
A plant expert collects samples of rice from a large field of 600 plots. One part of his investigation based on the sterility observed and genotype used for each plot.
Genotypes
is
I
No problem
Sterilitv Moderate
Severe
II
III
IV
16
30 r02 18
21 90 39
19 120 11
77 57
Test, at a
l% significance level, whether sterility is independent of genotype.
3t6

Chi Squared Tests

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chi Squared Tests

Uploaded by

Copyright:

Available Formats

Mathematlcs Term

The Chi-squored Distribution

a random sample from a normal distribution with mean 1t andvariance

Then the sampling distribution of the statistic

is called the chi-squared distribution with n

of freedom. The probability density function

STPM chapter 6 chi-squared

The chi-squared distribution has the following properties:

The values of X2 cannot be negative

The curve is not symmetric They are all positively skewed

As v gets larger, the degree of skewness decreases

Critical values for the X2-distribution

STPM Chapter 6 Chisquared fests

t0.64 t2.02 t3.36

18.48 20.09 21.67

20.48 21.92 23.34 24.74

3.571 4.107 4.660

21.03 22.36 23.68 25.00 26.30 27.59 28.87 30.14

52.62 54.05 s5.48

49.65 50.99 52.34 53.67

41.34 42.56 43.77

lNl t"ah.*"tics Term 3

STPM chapter 6 chi-squared Tests

n = 6. the chi-squared statistic

percentile of the chi-squared distribution with 9 degrees of freedom.

Using the table of chi-squared distribution table, find

(a) (b) (c)

< 18.4s), P(X1, > 1e.81), P(X'r, ) 32.67).

STPM Chapter 6 Chi-squared r"sts

Giving v and q, find the critical value(s) for each

(a) (b) (c)

STPM chapter 6 chisquared Tests

A useful measure for the oerall

discrepancy between the observed and expected frequencies is the chi-

5br -,t' I'

STPM chapter 6 Chi-squaredf""ts

lSl *.ah"-.tlcs Term 3

STPM Chapter 6 Chi-squared Tests

Numberofflaws 0 Observed frequency 32

A test procedure is as follows.

@: Specify the significance level

frequencies differ significantly from expected frequencies specified

P(X=x,) 0.472 0.354 0.133 0.041

STPM Chapter 6 Chi-squared f"rrc

The chi-squared value can now be calculated:

- 28sD'z (ls - 2t.2q'z 28.32 2t.24

Call length (in

We proceed with the steps of a test procedure as follows:

Chapter 6 Chi-squared Tests

@: Specify the significance level

frequencies differ significantly from expected frequencies specified

The distribution of call lengths may be approximated by the normal distribution.

The sample mean and sample standard deviation

will be used for p and o in

_ 5-t4 = -t.406 r 6.4 to-t+ ,-= =_0.625 , 6.4

-1.406 and z, = -0.625 is

Thus, the expected frequency for the second class is e,

boundaries Below 10 10-15 15-20 Above 20

(o,- e,) (o,- e,)2

STPM Cf,apter 6 Chi-squared

Number of categories = 10, a = 0.10

A random sample of 500 observations is obtained and distributed into 4 categories

A boy buys a bag of

at the 5% significence level.