You are on page 1of 4

Hypothesis Testing

Lesson 6: Tests Involving Frequency Data: Chi-Square Test for Independence


In the previous lessons, the data gathered is assumed to come from a normal distribution. Thus, in terms
of measurement scales, the data must be interval or ratio variables, in order to use the methods outlined earlier.
There are many cases when the data gathered are nominal in scale, so the tests in earlier sections cannot be used.
When the data are of the nominal type, their numerical measurements are mostly of the frequency type. The Chi-
Square Test 𝜒 2 is often used as a test of significance for data that are expressed in frequencies, or data that are
in percentages or proportions but which can be readily transformed into frequencies.
The Chi-Square Test for Independence is a test to establish a relationship between two variables when
these variables are measured using only frequency counts.
The Test for Independence is used to determine whether a variable or characteristic is independent of
another variable or characteristic. Otherwise, the two variables are related. For this test, the Chi-Square Test is
employed and the formula is provided below.
FORMULA FOR CHI-SQUARE TEST FOR INDEPENDENCE
Test Statistic 𝑯𝟎 𝑯𝒂 Rejection Region
𝑐 𝑟 2
2
(𝑜𝑖𝑗 − 𝑒𝑖𝑗 )
𝜒 = ∑∑ Variables X
𝑒𝑖𝑗
𝑗 𝑖 and Y are Variables X
with
degrees of
independent and Y are 𝜒 2 > 𝜒 2 (𝛼,𝑣)
𝑣 = (𝑟 − 1)(𝑐 − 1) (There is no dependent
freedom (df)
relationship (The
where between the variables are
(𝑅𝑖 )(𝐶𝑗 ) two related.)
𝑒𝑖𝑗 = variables.)
𝐺𝑇

Remarks:
1. In a 2 x 2 contingency table, the degree of freedom is 𝑣 = (2 − 1)(2 − 1) = 1, so a correction factor, called the
Yates’ correction for continuity, is applied. The corrected formula is as follows:
𝑐 𝑟 2
2 (𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑)
(|𝑜𝑖𝑗 − 𝑒𝑖𝑗 | − 0.5)
𝜒 = ∑∑ .
𝑒𝑖𝑗
𝑗 𝑖
2. In general, for an r x c contingency table, it is required that no fewer than 20% of the cells have an expected
frequency of less than five (5), and no cell should have an expected frequency of less than one (1). Columns are
combined (if meaningful) or pooled to meet such assumptions.

The summation extends over all rc cells in the r x c contingency table given below.
An 𝒓 𝒙 𝒄 Contingency Table
Row Column Variable 𝒀
Variable 𝑿 𝑦1 𝑦2 ⋯ 𝑦𝑗 ⋯ 𝑦𝑐 Row Total
𝑥1 𝑜11 𝑜12 ⋯ 𝑜1𝑗 ⋯ 𝑜1𝑐 𝑅1
𝑥2 𝑜21 𝑜22 ⋯ 𝑜2𝑗 ⋯ 𝑜2𝑐 𝑅2
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
𝑥𝑖 𝑜𝑖1 𝑜𝑖2 ⋯ 𝑜𝑖𝑗 ⋯ 𝑜𝑖𝑐 𝑅𝑖
⋮ ⋮ ⋮ ⋯ ⋮ ⋮ ⋮ ⋮
𝑥𝑟 𝑜𝑟1 𝑜𝑟2 ⋯ 𝑜𝑟𝑗 ⋯ 𝑜𝑟𝑐 𝑅𝑟
Column
𝐶1 𝐶2 ⋯ 𝐶𝑗 ⋯ 𝐶𝑐 GT
Total
where
𝑿𝒊 is the ith category of variable X;
𝒀𝒋 is the jth category of variable Y;
𝑜𝑖𝑗 is the observed frequency for cell ij;
𝑅𝑖 is the ith row total;
𝐶𝑗 is the jth column total;
𝐺𝑇 is the Grand Total;
𝑒𝑖𝑗 is the expected frequency;
𝑣 is the degrees of freedom for Chi-Square Statistic;
𝑟 is the number of rows;
𝑐 is the number of columns.

The following steps are to be computed before doing the test.


1. Solve for the jth column total (𝐶𝑗 ) and the ith row total (𝑅𝑖 ) of the contingency table.
2. Obtain the grand total (GT) by summing up all the row totals or column totals.
3. Solve for the corresponding expected frequency for each cell. To determine the expected frequency in
the ith row and jth column, we use this formula:
(𝑅𝑖 )(𝐶𝑗 )
𝑒𝑖𝑗 =
𝐺𝑇
Example 1. To study the relationship of the sex of an individual with the number of hours spent per day for TV
viewing, a random sample of 180 individuals were interviewed and the results are shown below.
Number of hours per day of TV Viewing
Sex < 2 hours 2 to 4 hours > 4 hours
Male 18 40 20
Female 12 50 40

Test the hypothesis that the number of hours spent per day in TV viewing is independent of the sex of the TV
viewer. Use 𝛼 = 0.05.

Given: 𝛼 = 0.05,
Number of hours per day of TV Viewing
Sex < 2 hours 2 to 4 hours > 4 hours Row Total
Male 18 40 20 78 𝑅1
Female 12 50 40 102 𝑅2
Column Total 30 90 60 180 𝐺𝑇
𝐶1𝐶2 𝐶3
The expected frequencies are computed as follows:
(𝑅1 )(𝐶1 ) (78)(30) (𝑅1 )(𝐶2 ) (78)(90) (𝑅1 )(𝐶3 ) (78)(60)
𝑒11 = = = 13 𝑒12 = = = 39 𝑒13 = = = 26
𝐺𝑇 180 𝐺𝑇 180 𝐺𝑇 180

(𝑅2 )(𝐶1 ) (102)(30) (𝑅2 )(𝐶2 ) (102)(90) (𝑅2 )(𝐶3 ) (102)(60)


𝑒21 = = = 17 𝑒22 = = = 51 𝑒23 = = = 34
𝐺𝑇 180 𝐺𝑇 180 𝐺𝑇 180

Solution: Following the steps in hypothesis testing, we have:


Steps:
1. Hypotheses:
𝐻0 : The number of hours spent per day on TV viewing is independent of the sex of an individual.
𝐻𝑎 : The number of hours spent per day on TV viewing is dependent of the sex of an individual.
2. Significance Level: 𝜶 = 𝟎. 𝟎𝟓
3. Test Statistic: Since the observations are classified in a 2 x 3 contingency table and an example of test
of independence, the appropriate test statistic is
𝟐
(𝒐𝒊𝒋 −𝒆𝒊𝒋 )
𝝌𝟐 = ∑𝒄𝒋 ∑𝒓𝒊 with 𝒗 = (𝒓 − 𝟏)(𝒄 − 𝟏) = (𝟐 − 𝟏)(𝟑 − 𝟏) = 𝟐.
𝒆𝒊𝒋

4. Critical Regions: The critical region is given by 𝜒 2 > 𝜒 2 (𝛼,𝑣) where 𝜒 2 (𝛼,𝑣) = 𝜒 2 (0.05,2) = 5.99

refer to Chi-square table as shown below


Thus, we reject 𝑯𝟎 if 𝝌𝟐 > 𝟓. 𝟗𝟗.

5. Computation: Using the formula in step 3, the actual value of the test statistic is:
𝒄 𝒓 𝟐
(𝒐𝒊𝒋 − 𝒆𝒊𝒋 ) (𝒐𝟏𝟏 − 𝒆𝟏𝟏 )𝟐 (𝒐𝟐𝟏 − 𝒆𝟐𝟏 )𝟐 (𝒐𝟏𝟐 − 𝒆𝟏𝟐 )𝟐 (𝒐𝟐𝟐 − 𝒆𝟐𝟐 )𝟐 (𝒐𝟏𝟑 − 𝒆𝟏𝟑 )𝟐 (𝒐𝟐𝟑 − 𝒆𝟐𝟑 )𝟐
𝝌𝟐 = ∑ ∑ = + + + + +
𝒆𝒊𝒋 𝒆𝟏𝟏 𝒆𝟐𝟏 𝒆𝟏𝟐 𝒆𝟐𝟐 𝒆𝟏𝟑 𝒆𝟐𝟑
𝒋 𝒊
(𝟏𝟖 − 𝟏𝟑)𝟐 (𝟏𝟐 − 𝟏𝟕)𝟐 (𝟒𝟎 − 𝟑𝟗)𝟐 (𝟓𝟎 − 𝟓𝟏)𝟐 (𝟐𝟎 − 𝟐𝟔)𝟐 (𝟒𝟎 − 𝟑𝟒)𝟐
= + + + + + = 𝟓. 𝟖𝟖
𝟏𝟑 𝟏𝟕 𝟑𝟗 𝟓𝟏 𝟐𝟔 𝟑𝟒

6. Statistical Decision: Since 𝝌𝟐 = 𝟓. 𝟖𝟖 is NOT greater than 𝟓. 𝟗𝟗 (meaning, it is NOT in the critical
region), the null hypothesis 𝑯𝟎 is NOT rejected.
7. Conclusion: Therefore, the number of hours spent on TV viewing is independent of the sex of an
individual at 𝜶 = 𝟎. 𝟎𝟓.

Example 2. A serum is claimed to cure a certain disease. To test this claim, 200 people infected with the disease
were divided into groups A and B with equal number of people. The serum is given to group A but not to group
B. Group B is called the control group. Table below shows that 75 people of Group A and 65 people of Group B,
recovered from the disease.
Number of People Recovered by
the Serum
Group Recovered Did not Recovered
Group A (with serum) 75 25
Group B (without serum) 65 35

Test the hypothesis that recovery is independent of the use of the serum using 𝛼 = 0.05.

Given: 𝛼 = 0.05,
Number of People Recovered
by the Serum
Group Recovered Did not Recovered Row Total
Group A (with serum) 75 25 100 𝑅1
Group B (without serum) 65 35 100 𝑅2
Column Total 140 60 200 𝐺𝑇
𝐶1 𝐶2
The expected frequencies are computed as follows:
(𝑅1 )(𝐶1 ) (100)(140) (𝑅1 )(𝐶2 ) (100)(60)
𝑒11 = = = 70 𝑒12 = = = 30
𝐺𝑇 200 𝐺𝑇 200

(𝑅2 )(𝐶1 ) (100)(140) (𝑅2 )(𝐶2 ) (100)(60)


𝑒21 = = = 70 𝑒22 = = = 30
𝐺𝑇 200 𝐺𝑇 200

Solution: Following the steps in hypothesis testing, we have:


Steps:
1. Hypotheses:
𝐻0 : Recovery is independent on the use of serum.
𝐻𝑎 : Recovery is dependent on the use of serum.
2. Significance Level: 𝜶 = 𝟎. 𝟎𝟓
3. Test Statistic: Since the observations are classified in a 2 x 2 contingency table and an example of test
of independence, the appropriate test statistic is
2
(|𝑜𝑖𝑗 −𝑒𝑖𝑗 |−0.5)
𝜒 2 (𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑) = ∑𝑐𝑗 ∑𝑟𝑖 with 𝒗 = (𝒓 − 𝟏)(𝒄 − 𝟏) = (𝟐 − 𝟏)(𝟐 − 𝟏) = 𝟏.
𝑒𝑖𝑗

4. Critical Regions: The critical region is given by 𝜒 2 > 𝜒 2 (𝛼,𝑣) where 𝜒 2 (𝛼,𝑣) = 𝜒 2 (0.05,1) = 3.84

refer to Chi-square table as shown below


Thus, we reject 𝑯𝟎 if 𝝌𝟐 > 𝟑. 𝟖𝟒.
5. Computation: Using the formula in step 3, the actual value of the test statistic is:
𝑐 𝑟 2
(|𝑜𝑖𝑗 − 𝑒𝑖𝑗 | − 0.5) (|𝒐𝟏𝟏 − 𝒆𝟏𝟏 | − 𝟎. 𝟓)𝟐 (|𝒐𝟐𝟏 − 𝒆𝟐𝟏 | − 𝟎. 𝟓)𝟐 (|𝒐𝟏𝟐 − 𝒆𝟏𝟐 | − 𝟎. 𝟓)𝟐 (|𝒐𝟐𝟐 − 𝒆𝟐𝟐 | − 𝟎. 𝟓)𝟐
𝜒 2 (𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑) = ∑ ∑ = + + +
𝑒𝑖𝑗 𝒆𝟏𝟏 𝒆𝟐𝟏 𝒆𝟏𝟐 𝒆𝟐𝟐
𝑗 𝑖
(|𝟕𝟓 − 𝟕𝟎| − 𝟎. 𝟓)𝟐 (|𝟔𝟓 − 𝟕𝟎| − 𝟎. 𝟓)𝟐 (|𝟐𝟓 − 𝟑𝟎| − 𝟎. 𝟓)𝟐 (|𝟑𝟓 − 𝟑𝟎| − 𝟎. 𝟓)𝟐
= + + +
𝟕𝟎 𝟕𝟎 𝟑𝟎 𝟑𝟎
= 𝟏. 𝟗𝟑
6. Statistical Decision: Since 𝝌𝟐 = 𝟏. 𝟗𝟑 is NOT greater than 𝟑. 𝟖𝟒 (meaning, it is NOT in the
critical region), the null hypothesis 𝑯𝟎 is NOT rejected.
7. Conclusion: Therefore, recovery is independent on the use of serum at 𝜶 = 𝟎. 𝟎𝟓.

Reference: Supe, A., et. al., (2013). Elementary Statistics. Central Book Supply Inc.

Prepared by:

JOBELLE S. SIMBLANTE
Stat 26 Instructor

You might also like