Categorical Data Analysis: Topic 11 ST1232 Statistics For Life Sciences

Categorical Data Analysis
Topic 11 ST1232 Statistics for Life Sciences 1 / 47

1 Introduction
2 Chi-squared (χ2 ) Test for 2 × 2 Contingency Tables

How was Data Collected?
χ2 -Test
Small Sample Sizes
Response versus Explanatory
3 Measures of Association for Categorical Data

Prospective Versus Retrospective Studies
Measures of Association For Prospective Studies
Measure of Association For Prospective And Retrospective Studies: Odds
Ratio
Confidence Interval for OR
4 r × c tables
5 Summary

1 Introduction

χ2 -Test
Small Sample Sizes

Ratio
4 r × c tables
5 Summary

Categorical Data Analysis
The purpose in this chapter is to identify any association between two categorical
variables.
Example 1 (Chest Pain And Gender)
Suppose that 1073 patients at NUH were sampled, for a study where the
onset of severe chest pain in patients at high risk for cardiovascular disease
(CVD) is recorded for each subject.
The 1073 patients were queried on two aspects:
I Have they experienced the onset of severe chest pain in the preceding 6
months? (yes/no)
I Gender? (male/female)
Chest Pain No Chest Pain Total

Male 46 474 520
Female 37 516 553
Total 83 990 1073

Conditional Proportions and Associations
The two categorical variables are gender and presence/absence of chest pain.
We can compute conditional proportions in the table for the preceding
example 1.
Male 8.8% 91.2% 100%
Female 6.7% 93.3% 100%
8.8% is a point estimate of P(chest pain|male). Similarly, 6.7% is a point

estimate of P(chest pain|female). We are interested in knowing if these two
population quantities are simlilar.
If they are (very) different from one another, we say that there is an
association between gender and chest pain. If they are similar, we say that
there is no association, or that the two variables are independent.

Independence and Dependence
Definition 1 (Independence and Dependence (Association))

Two categorical variables are independent if the population conditional
distributions for one of them are identical at each category of the other.
The variables are dependent, or associated, if the conditional distributions
are not identical.
We shall now learn about a hypothesis test for association between two categorical
variables.

1 Introduction

χ2 -Test
Small Sample Sizes

Ratio
4 r × c tables
5 Summary

1 Introduction

χ2 -Test
Small Sample Sizes

Ratio
4 r × c tables
5 Summary

Data in a 2 × 2 Table
In Example 1, data were collected via a random sample of patients at NUH,

and then they were assigned to one of the 4 “cells” in the 2 × 2 table.
It is also possible, that data were collected in the following manners:
I Collecting based on gender (explanatory variable)
F A random sample of male patients who at high risk for CVD was obtained.
F A random sample of female patients who at high risk for CVD was obtained.
F Each subject queried on whether he/she had experienced severe chest pain in
the preceding 6 months.
I Collecting based on the chest pain status (response variable)
F A random sample of people with chest pain was obtained.
F A random sample of people without chest pain was obtained.
F Each subject was queried on their gender.
No matter how the data are collected, as long as simple random samples were
obtained, the test in this section is valid.
The difference will be important in the next section, when we wish to
quantify the association between two categorical variables.

1 Introduction

χ2 -Test
Small Sample Sizes

Ratio
4 r × c tables
5 Summary

Expected Cell Counts
The test we are about to learn has the following hypotheses:
H0 : The two variables are independent

H1 : The two variables are dependent
In order to compute the test statistic, we shall have to compute the expected
cell counts, under the assumptionof independence, and compare them to the
observed cell counts.
Definition 2 (Expected Cell Counts)

For a particular cell, the expected cell count is
Row total × Column total
Expected cell count =
Total sample size

Chest Pain Expected Cell Counts
In Example 1, the expected cell counts are:

Male 40.2 479.8 520
Female 42.8 510.2 553
Total 83 990 1073
Notice that expected cell counts are not necessarily integers.
The question we need to answer is, how different are the expected cell counts
from the observed cell counts?

Test Statistic
The χ2 test statistic summarises how far the observed cell counts are from the
expected cell counts, under the null hypothesis of independence.
Definition 3 (χ2 Test Statistic)

The general formula is
X (observed count − expected count)2
χ2 =
expected count
In other words,
For each cell, square the difference between the observed and expected
counts, and then divide that squared value by the expected count.
After calculating this term for every cell, sum the terms.
There are variations on this formula throughout this chapter; be aware of

when to use which variation.
The χ2 test statistic is always positive.
A larger value of the test statistic will give evidence against the null
hypothesis.
Steps for testing independence in a 2 × 2 table
Assumptions: Two categorical variables

Data obtained via randomisation.
Expected cell counts greater than or equals to 5 for all cells.
Hypothesis: H0 : Two variables are independent.
H1 : Two variables are dependent.
Test statistic:
X (|observed count − expected count| − 0.5)2
χ2 =
expected count
This is known as the χ2 statistic with continuity correction.
p-value: The right tail probability of the χ2 distribution with 1 degree of
freedom.
Conclusion: Reject or do not reject H0 according to pre-determined α-level.

SPSS Output, Chest Pain Example
gender * pain Crosstabulation
pain
no pain pain Total
gender female Count 516 37 553
Expected Count 510.2 42.8 553.0
% within gender 93.3% 6.7% 100.0%
male Count 474 46 520
% within gender 91.2% 8.8% 100.0%
Total Count 990 83 1073
% within gender 92.3% 7.7% 100.0%
Chi-Square Tests
Asymp. Sig. Exact Sig. (2- Exact Sig. (1-
Value df (2-sided) sided) sided)
Pearson Chi- a
Square 1.744 1 .187
Continuity
Correction b 1.456 1 .228
Likelihood Ratio 1.745 1 .186

Fisher's Exact Test
.209 .114
Linear-by-Linear
Association 1.743 1 .187
N of Valid Cases 1073

a. 0 cells (0.0%) have expected count less than 5. The minimum expected count is 40.22.
b. Computed only for a 2x2 table
1 Introduction

χ2 -Test
Small Sample Sizes

Ratio
4 r × c tables
5 Summary

Fisher’s Exact Test
One of the assumptions on slide 14 is that the expected cell counts are all at
least 5.
When this is not satisfied, we can use an alternative test, known as Fisher’s
Exact Test.
The basic idea is similar, that we wish to assess how different the expected
cell counts are from the observed, but we do not use the same test statistic,
and we do not compare it to a χ2 distribution.

Claritin Trial
Example 2 (Claritin and Nervousness)

Claritin is a drug for treating allergies. However, it has a side effect of
inducing nervousness in patients.
From a sample of 450 subjects, 188 of them were randomly assigned to take
Claritin, and the remaining were assigned to take the placebo.
The following data were recorded:
Nervous Not Nervous Total
Claritin 4 184 188
Placebo 2 260 262

SPSS Output, Claritin Example
drug * nervousness Crosstabulation
nervousness
not nervous nervous Total
drug Placebo Count 260 2 262
% within drug 99.2% 0.8% 100.0%
Claritin Count 184 4 188
% within drug 97.9% 2.1% 100.0%
Total Count 444 6 450
% within drug 98.7% 1.3% 100.0%
Chi-Square Tests
Asymp. Sig. Exact Sig. (2- Exact Sig. (1-
Value df (2-sided) sided) sided)
Pearson Chi- a
Square 1.549 1 .213
Continuity
Correction b .685 1 .408
Likelihood Ratio 1.529 1 .216

Fisher's Exact Test
.241 .203
Linear-by-Linear
Association 1.545 1 .214
N of Valid Cases 450

a. 2 cells (50.0%) have expected count less than 5. The minimum expected count is 2.51.
b. Computed only for a 2x2 table

Fisher Exact Test
Assumptions: Two binary categorical variables

Data obtained via randomisation.
Hypothesis: H0 : Two variables are independent.
H1 : Two variables are dependent.
Test statistic: The test statistic is in fact the first cell count, as this determines
the others, given the margin totals.

1 Introduction

χ2 -Test
Small Sample Sizes

Ratio
4 r × c tables
5 Summary

Symmetric Nature of the Test
In Example 1, the response variable was chest pain and the explanatory
variable was gender.
In example 2, the response was nervousness and the explanatory variable was
the drug used.
However, in both the χ2 -test and the Fisher Test, the test does not
distinguish which is the response and which is the explanatory variable.

1 Introduction

χ2 -Test
Small Sample Sizes

Ratio
4 r × c tables
5 Summary

Quantification of the Effect
The χ2 test identifies if there is a significant association between the two

categorical variables.
However, it does not tell us the strength and direction of the association.
In this section we shall focus on a specific kind of 2 × 2 table, where the
explanatory variable is exposure to a condition, and the response is whether
or not someone contracts a disease.
We shall refer to these as disease-exposure tables.

Disease Exposure Table
Positive Outcome Negative Outcome

Exposure a b
No exposure c d

Example of Exposure and Disease Variables
Here are some examples of exposure variables and their possible values.
Smoker versus non-smoker.
Hypertensive versus non-hypertensive.
Use of oral contraceptive (OC) versus non-use of OC.
High salt intake versus Low salt intake.
Here are some examples of disease variables and their possible values.
Lung cancer versus no lung cancer.
Cardiovascular disease (CVD) versus no CVD.

How Strong Is the Association?
If we find that there is an association between disease and exposure, then we

would be interested in knowing the strength of this association.
In other words, we want to know how different P(disease|exposure) is from
P(disease|no exposure).
If P(disease|exposure) is 10 times greater than P(disease|no exposure), we
would recommend people to reduce their exposure, e.g. stop smoking, reduce
salt, etc.
If P(disease|exposure) is only 1.05 times greater than P(disease|no exposure),
then we would not be so worried about controlling this exposure.

Point Estimates
Positive Outcome (Disease) Negative Outcome (No disease)

Exposure a b
No exposure c d
Let pˆ1 be our point estimate of P(disease|exposure):

a
p̂1 =
a+b
Let pˆ2 be our point estimate of P(disease|no exposure):
c
p̂2 =
c +d

1 Introduction

χ2 -Test
Small Sample Sizes

Ratio
4 r × c tables
5 Summary

Prospective Study
1 Sample subjects from a population.

2 Either randomly assign the exposure variable to the subjects, or record their
exposure variable status.
3 Follow the subjects over time to see if they develop the disease.
This is the main advantage:
Can obtain valid estimate of p̂1 and p̂2 from the 2 × 2 table.

Retrospective Study
1 Sample a group of cases (people with the disease).

2 Sample a group of controls (people without the disease).
3 Check each subject to see if they were exposed or not.
This is also known as a case-control study. These are the advantages:
Cheap
Quick
Fewer subjects involved, especially if disease is rare.
However, the huge disadvantage is that we cannot obtain valid estimate of p̂1 and
p̂2 from the 2 × 2 table.

Retrospective or Prospective?
Sample 5000 OC users and 5000 non-OC users, and follow them for 15 years
to see if they develop any form of myocardial infarction.
Identify and sample breast cancer cases in mothers at a hospital. From the
same hospital, identify and obtain a sample similarly aged mothers, but who
do not have breast cancer. Now check for their age at which they had their
first child (record as greater than 30 or not).

1 Introduction

χ2 -Test
Small Sample Sizes

Ratio
4 r × c tables
5 Summary

Relative Risk and Difference of Proportions
In a prospective study, we can estimate p̂1 and p̂2 legitimately.

Hence we can measure the strentgh of association using
I the difference of proportion:
p̂1 − p̂2
I the relative risk:
p̂1 /p̂2

1 Introduction

χ2 -Test
Small Sample Sizes

Ratio
4 r × c tables
5 Summary

Definition of Odds
Definition 4 (Odds)
For a categorical variable with 2 possible values, define one of them to be the
“success” and the other to be the “failure”
Let p be the probability of success, and 1 − p be the probability of failure.
Then the odds of success is defined to be
p
odds =
1−p
Odds equal to 0 corresponds to probability of success equal to 0.
Odds equal to 1 corresponds to probability of success equal to 0.5.
Odds equal to ∞ corresponds to probability of success equal to 1.
Remember that all of the above are population quantities.

Computing Odds Ratio (OR)
In a disease exposure table, the odds of disease for exposed individuals is

estimated as
a/b
In a disease exposure table, the odds of disease for unexposed individuals is
estimated as
c/d
In a disease exposure table, the odds ratio is computed as
a/b ad
OR = =
c/d bc

Implications of OR
If an estimated OR is 1, it means that
p̂1 = p̂2
If an estimated OR is more than 1, it means that
p̂1 > p̂2
If an estimated OR is less than 1, it means that
p̂1 < p̂2

Why So Indirect?
The good thing about OR’s is that it is legitimate to compute them whether
we have a prospective or retrospective study.
Hence no matter what the sampling design (retrospective or prospective), we
can compare the population p1 to p2 via the estimated odds ratio!

1 Introduction

χ2 -Test
Small Sample Sizes

Ratio
4 r × c tables
5 Summary

Computing a CI for OR
It is difficult to compute the standard error for the OR. Instead, it is done on
the ln scale before converting back.
The estimated standard error for ln(OR) is
r
1 1 1 1
se(ln(OR)) = + + +
a b c d
The formula for a (1 − α)100% confidence interval is
r !
ad 1 1 1 1
exp ln ± q1−α/2 + + +
bc a b c d

Smoking and Lung Cancer
Example 3
Consider the following data from a case-control study:
Cancer No cancer
Smoker 1301 1205
Non-smoker 56 152
OR = 2.93
The estimated variance is
1 1 1 1
+ + + = 0.026
1301 56 1205 152
A 95% CI is given by
√
e ln(2.93)±1.96× 0.026
= (2.14, 4.02)

1 Introduction

χ2 -Test
Small Sample Sizes

Ratio
4 r × c tables
5 Summary

Beyond 2 × 2 tables
The χ2 test can be extended to tables larger than 2 by 2.

In general suppose that we have r rows and c columns that define two
catgorical random variables.
Expected value in each cell is computed exactly the same way.
The only difference is that the χ2 distribution to use is the χ2(r −1)(c−1)
distribution.

Chest Pain versus Race
Chest Pain versus Race
Suppose we collected the following data

1 Introduction

χ2 -Test
Small Sample Sizes

Ratio
4 r × c tables
5 Summary

What You Should Know
How to retrieve the test statistic for 2 × 2 and r × c tables from SPSS.
How to identify case-cohort and cohort studies.
Computing confidence intervals for OR.
Interpreting these confidence intervals.

Categorical Data Analysis: Topic 11 ST1232 Statistics For Life Sciences

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Categorical Data Analysis: Topic 11 ST1232 Statistics For Life Sciences

Uploaded by

Copyright:

Available Formats

Categorical Data Analysis

Topic 11 ST1232 Statistics for Life Sciences 1 / 47

2 Chi-squared (χ2 ) Test for 2 × 2 Contingency Tables

3 Measures of Association for Categorical Data

Topic 11 ST1232 Statistics for Life Sciences 2 / 47

2 Chi-squared (χ2 ) Test for 2 × 2 Contingency Tables

3 Measures of Association for Categorical Data

Topic 11 ST1232 Statistics for Life Sciences 3 / 47

Chest Pain No Chest Pain Total

Topic 11 ST1232 Statistics for Life Sciences 4 / 47

8.8% is a point estimate of P(chest pain|male). Similarly, 6.7% is a point

Topic 11 ST1232 Statistics for Life Sciences 5 / 47

Definition 1 (Independence and Dependence (Association))

Topic 11 ST1232 Statistics for Life Sciences 6 / 47

2 Chi-squared (χ2 ) Test for 2 × 2 Contingency Tables

3 Measures of Association for Categorical Data

Topic 11 ST1232 Statistics for Life Sciences 7 / 47

2 Chi-squared (χ2 ) Test for 2 × 2 Contingency Tables

3 Measures of Association for Categorical Data

Topic 11 ST1232 Statistics for Life Sciences 8 / 47

In Example 1, data were collected via a random sample of patients at NUH,

Topic 11 ST1232 Statistics for Life Sciences 9 / 47

2 Chi-squared (χ2 ) Test for 2 × 2 Contingency Tables

3 Measures of Association for Categorical Data

Topic 11 ST1232 Statistics for Life Sciences 10 / 47

The test we are about to learn has the following hypotheses:

H0 : The two variables are independent

Definition 2 (Expected Cell Counts)

Topic 11 ST1232 Statistics for Life Sciences 11 / 47

In Example 1, the expected cell counts are:

Topic 11 ST1232 Statistics for Life Sciences 12 / 47

Definition 3 (χ2 Test Statistic)

There are variations on this formula throughout this chapter; be aware of

Assumptions: Two categorical variables

Topic 11 ST1232 Statistics for Life Sciences 14 / 47

gender * pain Crosstabulation

Likelihood Ratio 1.745 1 .186

N of Valid Cases 1073

2 Chi-squared (χ2 ) Test for 2 × 2 Contingency Tables

3 Measures of Association for Categorical Data

Topic 11 ST1232 Statistics for Life Sciences 16 / 47

Topic 11 ST1232 Statistics for Life Sciences 17 / 47

Example 2 (Claritin and Nervousness)

Topic 11 ST1232 Statistics for Life Sciences 18 / 47

drug * nervousness Crosstabulation

Likelihood Ratio 1.529 1 .216

N of Valid Cases 450

Topic 11 ST1232 Statistics for Life Sciences 19 / 47

Assumptions: Two binary categorical variables

Topic 11 ST1232 Statistics for Life Sciences 20 / 47

2 Chi-squared (χ2 ) Test for 2 × 2 Contingency Tables

3 Measures of Association for Categorical Data

Topic 11 ST1232 Statistics for Life Sciences 21 / 47

Topic 11 ST1232 Statistics for Life Sciences 22 / 47

2 Chi-squared (χ2 ) Test for 2 × 2 Contingency Tables

3 Measures of Association for Categorical Data

Topic 11 ST1232 Statistics for Life Sciences 23 / 47

The χ2 test identifies if there is a significant association between the two

Topic 11 ST1232 Statistics for Life Sciences 24 / 47

Positive Outcome Negative Outcome

Topic 11 ST1232 Statistics for Life Sciences 25 / 47

Topic 11 ST1232 Statistics for Life Sciences 26 / 47

If we find that there is an association between disease and exposure, then we