You are on page 1of 37

CHAPTER 6

c2 CHI-SQUARE TEST
Chi-Square Test
• There are three main uses of the chi-square
distribution:
1. The test of independence is used to determine
whether two variables are related or are
independent.
2. It can be used as goodness-of-fit test, in order to
determine whether the frequencies of a distribution
are the same as the hypothesized frequencies.
3. The homogeneity of proportions test is used to
determine if several proportions are all equal when
samples are selected from different populations.
Hypothesis Tests
Qualitative Data
Qualitative
Data

More than
1 pop. 2 pop.
Proportion Independence

2 pop.

Z Test Z Test c2 Test c2 Test


Characteristics of the Chi-Square
Distribution

1. It is not symmetric.
2. The shape of the chi-square distribution depends on the
degrees of freedom, just like Student’s t-distribution.
3. As the number of degrees of freedom increases, the chi-
square distribution becomes more nearly symmetric.
4. The values of c2 are nonnegative,
i.e., the values of c2
are greater than or equal to 0.
Goodness of Fit
Chi-Square Goodness-of-Fit Test

To calculate the test statistic for the chi-square goodness-of-fit test, the
observed frequencies and the expected frequencies are used.

• The observed frequency O of a category is the frequency for the


category observed in the sample data.
• The expected frequency E of a category is the calculated frequency for
the category.

• The expected frequency for the ith category is


Ei = npi
• where n is the number of trials (the sample size) and
pi is the assumed probability of the ith category.
Test Statistic for Goodness-of-Fit Tests
Let
Oi - the observed counts of category i,
Ei - the expected counts of an category i,
k - the number of categories, and n represent the number of
independent trials of an experiment. Then,

i = 1, 2, …, k

with k – 1 degrees of freedom


Critical Value for Goodness-of-Fit Tests
All goodness-of-fit tests are right-tailed tests,
degrees of freedom (df) = k-1.
2
c
Assumptions

• The assumptions for the chi-square


independence and homogeneity tests:
1. The data are obtained from a random sample.
2. No more than 20% of the expected counts are
less than 5
The Chi-Square Goodness-of-Fit Test

Example 1: Fair Die?

Suppose you roll a die 60 times and get 12 ones, 9 twos,


10 threes, 6 fours, 11 fives, and 12 sixes.

Would you reject the null hypothesis that the die is fair
at the 5% level of significance?
Chi-Square Goodness-of-Fit Test
Example: Fair Die?

If a fair die is rolled 60 times, you “expect” to get each face of the die on 1/6 of the
60 rolls, or 10 times each.
Outcome Observed Frequency, O Expected Frequency, E
1 12 10
2 9 10
3 10 10
4 6 10
5 11 10
6 12 10
Total 60 60

The die is unfair if the observed frequencies are far from the expected
frequencies.
Chi-Square Goodness-of-Fit Test
The Test Statistic
Chi-Square Goodness-of-Fit Test
Example: The Test Statistic for the Die

Compute the c2 for the fair die.

Observed Expected
Outcome O-E (O - E)2 / E
Frequency, O Frequency, E
1 12 10 2 0.4
2 9 10 -1 0.1
3 10 10 0 0.0
4 6 10 -4 1.6
5 11 10 1 0.1
6 12 10 2 0.4
Total 60 60 0 2.6

2
c  2.6
Chi-Square Goodness-of-Fit Test

Example: Fair Die?

Critical Values

at the 5% level of significance

Determine the degrees of freedom (the number of categories minus 1):


df = 6 - 1 = 5.
c2* = 11.07

Since the test statistic c2 = 2.6 is smaller than the critical value c2* = 11.07,
we fail to reject the null hypothesis that the coin is fair at the  = 0.05 level.
Chi-Square Goodness-of-Fit Test

The Chi-Square Goodness of Fit Test


State the hypotheses

H 0 : The proportions in the population are equal to the proportions in


your model.
H 1 : At least one proportion in the population is not equal to the
corresponding proportion in your model.
Chi-Square Goodness-of-Fit Test
Example 2: Milk Chocolate Versus Peanut Butter

Milk chocolate M&Ms are 13% red, 14% yellow, 16% green, 20% orange, 13%
brown, and 24% blue. A random sample of 200 peanut butter M&Ms yielded this
distribution:

Color Observed Number of M&Ms


Red 25
Yellow 37
Green 45
Orange 34
Brown 19
Blue 40
Total 200

Do we have convincing evidence that the distribution of peanut butter M&Ms is


different from the distribution of milk chocolate M&Ms?
Chi-Square Goodness-of-Fit Test
Example: Milk Chocolate Versus Peanut Butter

Use the proportions of each color of milk chocolate M&Ms and the
sample size (n = 200) to compute the expected number of each color.

Observed Number Expected Number


Color
of M&Ms of M&Ms
Red 25 0.13(200) = 26
Yellow 37 0.14(200) = 28
Green 45 0.16(200) = 32
Orange 34 0.20(200) = 40
Brown 19 0.13(200) = 26
Blue 40 0.24(200) = 48
Total 200 200
The Chi-Square Goodness-of-Fit Test
Example: Milk Chocolate Versus Peanut Butter

State the hypotheses:

H0: The proportions in the population of peanut M&Ms are equal to


the proportions in the model (plain M&Ms).

H1: At least one proportion in the population of peanut M&Ms is not


equal to the corresponding proportion in the model (plain M&Ms).
The Chi-Square Goodness-of-Fit Test
Example: Milk Chocolate Versus Peanut Butter

Compute the test statistic, approximate the P-value, and draw a


sketch. 2
O  E 
The test statistic is c 2  
E

Observed Number Expected Number


Color (O - E)2 / E
of M&Ms of M&Ms
Red 25 0.13(200) = 26 0.038
Yellow 37 0.14(200) = 28 2.893
Green 45 0.16(200) = 32 5.281
Orange 34 0.20(200) = 40 0.900
Brown 19 0.13(200) = 26 1.885
Blue 40 0.24(200) = 48 1.333
Total 200 200 c2 = 12.331
 Shows if a relationship exists between
two qualitative variables
 Uses two-way contingency table
 The null hypothesis (H0) states that no
association exists between the two cross-
tabulated variables in the population,
(the variables are statistically independent).
 The (H1) proposes that the two variables are
related in the population.
(the variables are dependent)
Assuming the two variables are independent, you can use the
contingency table to find the expected frequency for each cell.

The Expected Frequency for Contingency Table Cells


The expected frequency for a cell Erc in a contingency table is

(Sum of row r )  (Sum of column c )


Expected frequency E r ,c  .
Sample size
Test Statistic for the Test of Independence
Let :
Oi - the observed number of counts in the ith cell,
Ei - the expected number of counts in the ith cell.

Then,

With degrees of freedom = (r – 1)(c – 1)


where: r is the number of rows
c is the number of columns
Critical Region for the Test of Independence
1. Hypotheses
• H0: Variables are independent
• Ha: Variables are related (dependent)
2. Test Statistic Observed count

2
o ij  Eij 
2
Expected
c  count
Eij
3. Degrees of Freedom: (r – 1)(c – 1)
Rows Columns
Example:
The following contingency table shows a random sample of 321 fatally
injured passenger vehicle drivers by age and gender. The expected
frequencies are displayed in parentheses. At  = 0.05, can you conclude that
the drivers’ ages are related to gender in such accidents?

Age
Gender 16 – 20 21 – 30 31 – 40 41 – 50 51 – 60 61 and Total
older
Male 32 51 52 43 28 10 216
(30.28) (49.12) (57.20) (43.07) (25.57) (10.77)
Female 13 22 33 21 10 6 105
(14.72) (23.88) (27.80) (20.93) (12.43) (5.23)
45 73 85 64 38 16 321
Example continued:
Because each expected frequency is at least 5 and the drivers were randomly
selected, the chi-square independence test can be used to test whether the
variables are independent.

H0: The drivers’ ages are independent on gender.


Ha: The drivers’ ages are dependent on gender.

d.f. = (r – 1)(c – 1)
= (2 – 1)(6 – 1)
=5
With d.f. = 5 and  = 0.05, the critical value is χ20 = 11.071.
Continued.
Example continued:
O E O–E (O – E)2 (O  E )2
Rejection E
32 30.28 1.72 2.9584 0.0977
region
51 49.12 1.88 3.5344 0.072
  0.05 52 57.20 5.2 27.04 0.4727
43 43.07 0.07 0.0049 0.0001
X2 28 25.57 2.43 5.9049 0.2309
10 10.77 0.77 0.5929 0.0551
χ20 = 11.071
13 14.72 1.72 2.9584 0.201
(O  E )2 22 23.88 1.88 3.5344 0.148
2
χ   2.84 33 27.80 5.2 27.04 0.9727
E
21 20.93 0.07 0.0049 0.0002
Fail to reject H0. 10 12.43 2.43 5.9049 0.4751
6 5.23 0.77 0.5929 0.1134
There is not enough evidence at the 5% level to conclude that age is dependent on gender
in such accidents.
You’re a marketing research analyst. You ask a random
sample of 286 consumers if they purchase Diet Pepsi or Diet
Coke. At the 0.05 level of significance, is there evidence of
a relationship?

Diet Pepsi
Diet Coke No Yes Total
No 84 32 116
Yes 48 122 170
Total 132 154 286
 H0: No Relationship Test Statistic:

 Ha: Relationship
  = .05
Decision:
 df = (2 - 1)(2 - 1) = 1
 Critical Value(s): Conclusion:

Reject H0
 = .05

0 3.841 c2

Eij  5 in all cells

116·132 Diet Pepsi 154·132


286 No Yes 286
Diet Coke Obs. Exp. Obs. Exp. Total
No 84 53.5 32 62.5 116
Yes 48 78.5 122 91.5 170
Total 132 132 154 154 286

170·132 170·154
286 286
c2  
Oi  E i 2
Ei
2 2 2

O11  E11  O12  E12 
  ... 
O 22  E 22 
E11 E12 E 22
2 2 2

84  53.5 32  62.5
  ... 
122  91.5
53.5 62.5 91.5
 54.29
H0: No Relationship Test Statistic:

Ha: Relationship c2 = 54.29

 = .05
Decision:
df = (2 - 1)(2 - 1) = 1
Critical Value(s): Conclusion:

Reject H0 Reject at  = .05

 = .05
There is evidence of a relationship

0 3.841 c2
Tabulated statistics: Coke, Pepsi
Using frequencies in frequency

Rows: Coke Columns: Pepsi

No Yes All
No 84 32 116
53.5 62.5 116.0
Yes 48 122 170
78.5 91.5 170.0
All 132 154 286
132.0 154.0 286.0

Cell Contents: Count


Expected count

Pearson Chi-Square = 54.150, DF = 1, P-Value = 0.000


Likelihood Ratio Chi-Square = 55.783, DF = 1, P-Value = 0.000

You might also like