You are on page 1of 47

Categorical Data Analysis

Topic 11 ST1232 Statistics for Life Sciences 1 / 47


1 Introduction

2 Chi-squared (χ2 ) Test for 2 × 2 Contingency Tables


How was Data Collected?
χ2 -Test
Small Sample Sizes
Response versus Explanatory

3 Measures of Association for Categorical Data


Prospective Versus Retrospective Studies
Measures of Association For Prospective Studies
Measure of Association For Prospective And Retrospective Studies: Odds
Ratio
Confidence Interval for OR

4 r × c tables

5 Summary

Topic 11 ST1232 Statistics for Life Sciences 2 / 47


1 Introduction

2 Chi-squared (χ2 ) Test for 2 × 2 Contingency Tables


How was Data Collected?
χ2 -Test
Small Sample Sizes
Response versus Explanatory

3 Measures of Association for Categorical Data


Prospective Versus Retrospective Studies
Measures of Association For Prospective Studies
Measure of Association For Prospective And Retrospective Studies: Odds
Ratio
Confidence Interval for OR

4 r × c tables

5 Summary

Topic 11 ST1232 Statistics for Life Sciences 3 / 47


Categorical Data Analysis
The purpose in this chapter is to identify any association between two categorical
variables.
Example 1 (Chest Pain And Gender)
Suppose that 1073 patients at NUH were sampled, for a study where the
onset of severe chest pain in patients at high risk for cardiovascular disease
(CVD) is recorded for each subject.
The 1073 patients were queried on two aspects:
I Have they experienced the onset of severe chest pain in the preceding 6
months? (yes/no)
I Gender? (male/female)

Chest Pain No Chest Pain Total


Male 46 474 520
Female 37 516 553
Total 83 990 1073

Topic 11 ST1232 Statistics for Life Sciences 4 / 47


Conditional Proportions and Associations

The two categorical variables are gender and presence/absence of chest pain.
We can compute conditional proportions in the table for the preceding
example 1.
Chest Pain No Chest Pain Total
Male 8.8% 91.2% 100%
Female 6.7% 93.3% 100%

8.8% is a point estimate of P(chest pain|male). Similarly, 6.7% is a point


estimate of P(chest pain|female). We are interested in knowing if these two
population quantities are simlilar.
If they are (very) different from one another, we say that there is an
association between gender and chest pain. If they are similar, we say that
there is no association, or that the two variables are independent.

Topic 11 ST1232 Statistics for Life Sciences 5 / 47


Independence and Dependence

Definition 1 (Independence and Dependence (Association))


Two categorical variables are independent if the population conditional
distributions for one of them are identical at each category of the other.
The variables are dependent, or associated, if the conditional distributions
are not identical.
We shall now learn about a hypothesis test for association between two categorical
variables.

Topic 11 ST1232 Statistics for Life Sciences 6 / 47


1 Introduction

2 Chi-squared (χ2 ) Test for 2 × 2 Contingency Tables


How was Data Collected?
χ2 -Test
Small Sample Sizes
Response versus Explanatory

3 Measures of Association for Categorical Data


Prospective Versus Retrospective Studies
Measures of Association For Prospective Studies
Measure of Association For Prospective And Retrospective Studies: Odds
Ratio
Confidence Interval for OR

4 r × c tables

5 Summary

Topic 11 ST1232 Statistics for Life Sciences 7 / 47


1 Introduction

2 Chi-squared (χ2 ) Test for 2 × 2 Contingency Tables


How was Data Collected?
χ2 -Test
Small Sample Sizes
Response versus Explanatory

3 Measures of Association for Categorical Data


Prospective Versus Retrospective Studies
Measures of Association For Prospective Studies
Measure of Association For Prospective And Retrospective Studies: Odds
Ratio
Confidence Interval for OR

4 r × c tables

5 Summary

Topic 11 ST1232 Statistics for Life Sciences 8 / 47


Data in a 2 × 2 Table

In Example 1, data were collected via a random sample of patients at NUH,


and then they were assigned to one of the 4 “cells” in the 2 × 2 table.
It is also possible, that data were collected in the following manners:
I Collecting based on gender (explanatory variable)
F A random sample of male patients who at high risk for CVD was obtained.
F A random sample of female patients who at high risk for CVD was obtained.
F Each subject queried on whether he/she had experienced severe chest pain in
the preceding 6 months.
I Collecting based on the chest pain status (response variable)
F A random sample of people with chest pain was obtained.
F A random sample of people without chest pain was obtained.
F Each subject was queried on their gender.
No matter how the data are collected, as long as simple random samples were
obtained, the test in this section is valid.
The difference will be important in the next section, when we wish to
quantify the association between two categorical variables.

Topic 11 ST1232 Statistics for Life Sciences 9 / 47


1 Introduction

2 Chi-squared (χ2 ) Test for 2 × 2 Contingency Tables


How was Data Collected?
χ2 -Test
Small Sample Sizes
Response versus Explanatory

3 Measures of Association for Categorical Data


Prospective Versus Retrospective Studies
Measures of Association For Prospective Studies
Measure of Association For Prospective And Retrospective Studies: Odds
Ratio
Confidence Interval for OR

4 r × c tables

5 Summary

Topic 11 ST1232 Statistics for Life Sciences 10 / 47


Expected Cell Counts

The test we are about to learn has the following hypotheses:

H0 : The two variables are independent


H1 : The two variables are dependent

In order to compute the test statistic, we shall have to compute the expected
cell counts, under the assumptionof independence, and compare them to the
observed cell counts.

Definition 2 (Expected Cell Counts)


For a particular cell, the expected cell count is
Row total × Column total
Expected cell count =
Total sample size

Topic 11 ST1232 Statistics for Life Sciences 11 / 47


Chest Pain Expected Cell Counts

In Example 1, the expected cell counts are:


Chest Pain No Chest Pain Total
Male 40.2 479.8 520
Female 42.8 510.2 553
Total 83 990 1073
Notice that expected cell counts are not necessarily integers.
The question we need to answer is, how different are the expected cell counts
from the observed cell counts?

Topic 11 ST1232 Statistics for Life Sciences 12 / 47


Test Statistic
The χ2 test statistic summarises how far the observed cell counts are from the
expected cell counts, under the null hypothesis of independence.

Definition 3 (χ2 Test Statistic)


The general formula is
X (observed count − expected count)2
χ2 =
expected count
In other words,
For each cell, square the difference between the observed and expected
counts, and then divide that squared value by the expected count.
After calculating this term for every cell, sum the terms.

There are variations on this formula throughout this chapter; be aware of


when to use which variation.
The χ2 test statistic is always positive.
A larger value of the test statistic will give evidence against the null
hypothesis.
Topic 11 ST1232 Statistics for Life Sciences 13 / 47
Steps for testing independence in a 2 × 2 table

Assumptions: Two categorical variables


Data obtained via randomisation.
Expected cell counts greater than or equals to 5 for all cells.
Hypothesis: H0 : Two variables are independent.
H1 : Two variables are dependent.
Test statistic:
X (|observed count − expected count| − 0.5)2
χ2 =
expected count
This is known as the χ2 statistic with continuity correction.
p-value: The right tail probability of the χ2 distribution with 1 degree of
freedom.
Conclusion: Reject or do not reject H0 according to pre-determined α-level.

Topic 11 ST1232 Statistics for Life Sciences 14 / 47


SPSS Output, Chest Pain Example

gender * pain Crosstabulation

pain
no pain pain Total
gender female Count 516 37 553
Expected Count 510.2 42.8 553.0
% within gender 93.3% 6.7% 100.0%
male Count 474 46 520
Expected Count 479.8 40.2 520.0
% within gender 91.2% 8.8% 100.0%
Total Count 990 83 1073
Expected Count 990.0 83.0 1073.0
% within gender 92.3% 7.7% 100.0%

Chi-Square Tests
Asymp. Sig. Exact Sig. (2- Exact Sig. (1-
Value df (2-sided) sided) sided)
Pearson Chi- a
Square 1.744 1 .187

Continuity
Correction b 1.456 1 .228

Likelihood Ratio 1.745 1 .186


Fisher's Exact Test
.209 .114

Linear-by-Linear
Association 1.743 1 .187

N of Valid Cases 1073


a. 0 cells (0.0%) have expected count less than 5. The minimum expected count is 40.22.
b. Computed only for a 2x2 table
Topic 11 ST1232 Statistics for Life Sciences 15 / 47
1 Introduction

2 Chi-squared (χ2 ) Test for 2 × 2 Contingency Tables


How was Data Collected?
χ2 -Test
Small Sample Sizes
Response versus Explanatory

3 Measures of Association for Categorical Data


Prospective Versus Retrospective Studies
Measures of Association For Prospective Studies
Measure of Association For Prospective And Retrospective Studies: Odds
Ratio
Confidence Interval for OR

4 r × c tables

5 Summary

Topic 11 ST1232 Statistics for Life Sciences 16 / 47


Fisher’s Exact Test

One of the assumptions on slide 14 is that the expected cell counts are all at
least 5.
When this is not satisfied, we can use an alternative test, known as Fisher’s
Exact Test.
The basic idea is similar, that we wish to assess how different the expected
cell counts are from the observed, but we do not use the same test statistic,
and we do not compare it to a χ2 distribution.

Topic 11 ST1232 Statistics for Life Sciences 17 / 47


Claritin Trial

Example 2 (Claritin and Nervousness)


Claritin is a drug for treating allergies. However, it has a side effect of
inducing nervousness in patients.
From a sample of 450 subjects, 188 of them were randomly assigned to take
Claritin, and the remaining were assigned to take the placebo.
The following data were recorded:
Nervous Not Nervous Total
Claritin 4 184 188
Placebo 2 260 262

Topic 11 ST1232 Statistics for Life Sciences 18 / 47


SPSS Output, Claritin Example

drug * nervousness Crosstabulation

nervousness
not nervous nervous Total
drug Placebo Count 260 2 262
Expected Count 258.5 3.5 262.0
% within drug 99.2% 0.8% 100.0%
Claritin Count 184 4 188
Expected Count 185.5 2.5 188.0
% within drug 97.9% 2.1% 100.0%
Total Count 444 6 450
Expected Count 444.0 6.0 450.0
% within drug 98.7% 1.3% 100.0%

Chi-Square Tests
Asymp. Sig. Exact Sig. (2- Exact Sig. (1-
Value df (2-sided) sided) sided)
Pearson Chi- a
Square 1.549 1 .213

Continuity
Correction b .685 1 .408

Likelihood Ratio 1.529 1 .216


Fisher's Exact Test
.241 .203

Linear-by-Linear
Association 1.545 1 .214

N of Valid Cases 450


a. 2 cells (50.0%) have expected count less than 5. The minimum expected count is 2.51.
b. Computed only for a 2x2 table

Topic 11 ST1232 Statistics for Life Sciences 19 / 47


Fisher Exact Test

Assumptions: Two binary categorical variables


Data obtained via randomisation.
Hypothesis: H0 : Two variables are independent.
H1 : Two variables are dependent.
Test statistic: The test statistic is in fact the first cell count, as this determines
the others, given the margin totals.

Topic 11 ST1232 Statistics for Life Sciences 20 / 47


1 Introduction

2 Chi-squared (χ2 ) Test for 2 × 2 Contingency Tables


How was Data Collected?
χ2 -Test
Small Sample Sizes
Response versus Explanatory

3 Measures of Association for Categorical Data


Prospective Versus Retrospective Studies
Measures of Association For Prospective Studies
Measure of Association For Prospective And Retrospective Studies: Odds
Ratio
Confidence Interval for OR

4 r × c tables

5 Summary

Topic 11 ST1232 Statistics for Life Sciences 21 / 47


Symmetric Nature of the Test

In Example 1, the response variable was chest pain and the explanatory
variable was gender.
In example 2, the response was nervousness and the explanatory variable was
the drug used.
However, in both the χ2 -test and the Fisher Test, the test does not
distinguish which is the response and which is the explanatory variable.

Topic 11 ST1232 Statistics for Life Sciences 22 / 47


1 Introduction

2 Chi-squared (χ2 ) Test for 2 × 2 Contingency Tables


How was Data Collected?
χ2 -Test
Small Sample Sizes
Response versus Explanatory

3 Measures of Association for Categorical Data


Prospective Versus Retrospective Studies
Measures of Association For Prospective Studies
Measure of Association For Prospective And Retrospective Studies: Odds
Ratio
Confidence Interval for OR

4 r × c tables

5 Summary

Topic 11 ST1232 Statistics for Life Sciences 23 / 47


Quantification of the Effect

The χ2 test identifies if there is a significant association between the two


categorical variables.
However, it does not tell us the strength and direction of the association.
In this section we shall focus on a specific kind of 2 × 2 table, where the
explanatory variable is exposure to a condition, and the response is whether
or not someone contracts a disease.
We shall refer to these as disease-exposure tables.

Topic 11 ST1232 Statistics for Life Sciences 24 / 47


Disease Exposure Table

Positive Outcome Negative Outcome


Exposure a b
No exposure c d

Topic 11 ST1232 Statistics for Life Sciences 25 / 47


Example of Exposure and Disease Variables

Here are some examples of exposure variables and their possible values.
Smoker versus non-smoker.
Hypertensive versus non-hypertensive.
Use of oral contraceptive (OC) versus non-use of OC.
High salt intake versus Low salt intake.
Here are some examples of disease variables and their possible values.
Lung cancer versus no lung cancer.
Cardiovascular disease (CVD) versus no CVD.

Topic 11 ST1232 Statistics for Life Sciences 26 / 47


How Strong Is the Association?

If we find that there is an association between disease and exposure, then we


would be interested in knowing the strength of this association.
In other words, we want to know how different P(disease|exposure) is from
P(disease|no exposure).
If P(disease|exposure) is 10 times greater than P(disease|no exposure), we
would recommend people to reduce their exposure, e.g. stop smoking, reduce
salt, etc.
If P(disease|exposure) is only 1.05 times greater than P(disease|no exposure),
then we would not be so worried about controlling this exposure.

Topic 11 ST1232 Statistics for Life Sciences 27 / 47


Point Estimates

Positive Outcome (Disease) Negative Outcome (No disease)


Exposure a b
No exposure c d

Let pˆ1 be our point estimate of P(disease|exposure):


a
p̂1 =
a+b
Let pˆ2 be our point estimate of P(disease|no exposure):
c
p̂2 =
c +d

Topic 11 ST1232 Statistics for Life Sciences 28 / 47


1 Introduction

2 Chi-squared (χ2 ) Test for 2 × 2 Contingency Tables


How was Data Collected?
χ2 -Test
Small Sample Sizes
Response versus Explanatory

3 Measures of Association for Categorical Data


Prospective Versus Retrospective Studies
Measures of Association For Prospective Studies
Measure of Association For Prospective And Retrospective Studies: Odds
Ratio
Confidence Interval for OR

4 r × c tables

5 Summary

Topic 11 ST1232 Statistics for Life Sciences 29 / 47


Prospective Study

1 Sample subjects from a population.


2 Either randomly assign the exposure variable to the subjects, or record their
exposure variable status.
3 Follow the subjects over time to see if they develop the disease.
This is the main advantage:
Can obtain valid estimate of p̂1 and p̂2 from the 2 × 2 table.

Topic 11 ST1232 Statistics for Life Sciences 30 / 47


Retrospective Study

1 Sample a group of cases (people with the disease).


2 Sample a group of controls (people without the disease).
3 Check each subject to see if they were exposed or not.
This is also known as a case-control study. These are the advantages:
Cheap
Quick
Fewer subjects involved, especially if disease is rare.
However, the huge disadvantage is that we cannot obtain valid estimate of p̂1 and
p̂2 from the 2 × 2 table.

Topic 11 ST1232 Statistics for Life Sciences 31 / 47


Retrospective or Prospective?

Sample 5000 OC users and 5000 non-OC users, and follow them for 15 years
to see if they develop any form of myocardial infarction.

Identify and sample breast cancer cases in mothers at a hospital. From the
same hospital, identify and obtain a sample similarly aged mothers, but who
do not have breast cancer. Now check for their age at which they had their
first child (record as greater than 30 or not).

Topic 11 ST1232 Statistics for Life Sciences 32 / 47


1 Introduction

2 Chi-squared (χ2 ) Test for 2 × 2 Contingency Tables


How was Data Collected?
χ2 -Test
Small Sample Sizes
Response versus Explanatory

3 Measures of Association for Categorical Data


Prospective Versus Retrospective Studies
Measures of Association For Prospective Studies
Measure of Association For Prospective And Retrospective Studies: Odds
Ratio
Confidence Interval for OR

4 r × c tables

5 Summary

Topic 11 ST1232 Statistics for Life Sciences 33 / 47


Relative Risk and Difference of Proportions

In a prospective study, we can estimate p̂1 and p̂2 legitimately.


Hence we can measure the strentgh of association using
I the difference of proportion:
p̂1 − p̂2
I the relative risk:
p̂1 /p̂2

Topic 11 ST1232 Statistics for Life Sciences 34 / 47


1 Introduction

2 Chi-squared (χ2 ) Test for 2 × 2 Contingency Tables


How was Data Collected?
χ2 -Test
Small Sample Sizes
Response versus Explanatory

3 Measures of Association for Categorical Data


Prospective Versus Retrospective Studies
Measures of Association For Prospective Studies
Measure of Association For Prospective And Retrospective Studies: Odds
Ratio
Confidence Interval for OR

4 r × c tables

5 Summary

Topic 11 ST1232 Statistics for Life Sciences 35 / 47


Definition of Odds

Definition 4 (Odds)
For a categorical variable with 2 possible values, define one of them to be the
“success” and the other to be the “failure”
Let p be the probability of success, and 1 − p be the probability of failure.
Then the odds of success is defined to be
p
odds =
1−p
Odds equal to 0 corresponds to probability of success equal to 0.
Odds equal to 1 corresponds to probability of success equal to 0.5.
Odds equal to ∞ corresponds to probability of success equal to 1.

Remember that all of the above are population quantities.

Topic 11 ST1232 Statistics for Life Sciences 36 / 47


Computing Odds Ratio (OR)

In a disease exposure table, the odds of disease for exposed individuals is


estimated as
a/b
In a disease exposure table, the odds of disease for unexposed individuals is
estimated as
c/d
In a disease exposure table, the odds ratio is computed as

a/b ad
OR = =
c/d bc

Topic 11 ST1232 Statistics for Life Sciences 37 / 47


Implications of OR

If an estimated OR is 1, it means that

p̂1 = p̂2

If an estimated OR is more than 1, it means that

p̂1 > p̂2

If an estimated OR is less than 1, it means that

p̂1 < p̂2

Topic 11 ST1232 Statistics for Life Sciences 38 / 47


Why So Indirect?

The good thing about OR’s is that it is legitimate to compute them whether
we have a prospective or retrospective study.
Hence no matter what the sampling design (retrospective or prospective), we
can compare the population p1 to p2 via the estimated odds ratio!

Topic 11 ST1232 Statistics for Life Sciences 39 / 47


1 Introduction

2 Chi-squared (χ2 ) Test for 2 × 2 Contingency Tables


How was Data Collected?
χ2 -Test
Small Sample Sizes
Response versus Explanatory

3 Measures of Association for Categorical Data


Prospective Versus Retrospective Studies
Measures of Association For Prospective Studies
Measure of Association For Prospective And Retrospective Studies: Odds
Ratio
Confidence Interval for OR

4 r × c tables

5 Summary

Topic 11 ST1232 Statistics for Life Sciences 40 / 47


Computing a CI for OR

It is difficult to compute the standard error for the OR. Instead, it is done on
the ln scale before converting back.
The estimated standard error for ln(OR) is
r
1 1 1 1
se(ln(OR)) = + + +
a b c d
The formula for a (1 − α)100% confidence interval is
  r !
ad 1 1 1 1
exp ln ± q1−α/2 + + +
bc a b c d

Topic 11 ST1232 Statistics for Life Sciences 41 / 47


Smoking and Lung Cancer
Example 3
Consider the following data from a case-control study:

Cancer No cancer
Smoker 1301 1205
Non-smoker 56 152

OR = 2.93
The estimated variance is
1 1 1 1
+ + + = 0.026
1301 56 1205 152
A 95% CI is given by

e ln(2.93)±1.96× 0.026
= (2.14, 4.02)

Topic 11 ST1232 Statistics for Life Sciences 42 / 47


1 Introduction

2 Chi-squared (χ2 ) Test for 2 × 2 Contingency Tables


How was Data Collected?
χ2 -Test
Small Sample Sizes
Response versus Explanatory

3 Measures of Association for Categorical Data


Prospective Versus Retrospective Studies
Measures of Association For Prospective Studies
Measure of Association For Prospective And Retrospective Studies: Odds
Ratio
Confidence Interval for OR

4 r × c tables

5 Summary

Topic 11 ST1232 Statistics for Life Sciences 43 / 47


Beyond 2 × 2 tables

The χ2 test can be extended to tables larger than 2 by 2.


In general suppose that we have r rows and c columns that define two
catgorical random variables.
Expected value in each cell is computed exactly the same way.
The only difference is that the χ2 distribution to use is the χ2(r −1)(c−1)
distribution.

Topic 11 ST1232 Statistics for Life Sciences 44 / 47


Chest Pain versus Race
Chest Pain versus Race
Suppose we collected the following data

Topic 11 ST1232 Statistics for Life Sciences 45 / 47


1 Introduction

2 Chi-squared (χ2 ) Test for 2 × 2 Contingency Tables


How was Data Collected?
χ2 -Test
Small Sample Sizes
Response versus Explanatory

3 Measures of Association for Categorical Data


Prospective Versus Retrospective Studies
Measures of Association For Prospective Studies
Measure of Association For Prospective And Retrospective Studies: Odds
Ratio
Confidence Interval for OR

4 r × c tables

5 Summary

Topic 11 ST1232 Statistics for Life Sciences 46 / 47


What You Should Know

How to retrieve the test statistic for 2 × 2 and r × c tables from SPSS.
How to identify case-cohort and cohort studies.
Computing confidence intervals for OR.
Interpreting these confidence intervals.

Topic 11 ST1232 Statistics for Life Sciences 47 / 47

You might also like