Professional Documents
Culture Documents
Categorical Data Analysis: Topic 11 ST1232 Statistics For Life Sciences
Categorical Data Analysis: Topic 11 ST1232 Statistics For Life Sciences
4 r × c tables
5 Summary
4 r × c tables
5 Summary
The two categorical variables are gender and presence/absence of chest pain.
We can compute conditional proportions in the table for the preceding
example 1.
Chest Pain No Chest Pain Total
Male 8.8% 91.2% 100%
Female 6.7% 93.3% 100%
4 r × c tables
5 Summary
4 r × c tables
5 Summary
4 r × c tables
5 Summary
In order to compute the test statistic, we shall have to compute the expected
cell counts, under the assumptionof independence, and compare them to the
observed cell counts.
pain
no pain pain Total
gender female Count 516 37 553
Expected Count 510.2 42.8 553.0
% within gender 93.3% 6.7% 100.0%
male Count 474 46 520
Expected Count 479.8 40.2 520.0
% within gender 91.2% 8.8% 100.0%
Total Count 990 83 1073
Expected Count 990.0 83.0 1073.0
% within gender 92.3% 7.7% 100.0%
Chi-Square Tests
Asymp. Sig. Exact Sig. (2- Exact Sig. (1-
Value df (2-sided) sided) sided)
Pearson Chi- a
Square 1.744 1 .187
Continuity
Correction b 1.456 1 .228
Linear-by-Linear
Association 1.743 1 .187
4 r × c tables
5 Summary
One of the assumptions on slide 14 is that the expected cell counts are all at
least 5.
When this is not satisfied, we can use an alternative test, known as Fisher’s
Exact Test.
The basic idea is similar, that we wish to assess how different the expected
cell counts are from the observed, but we do not use the same test statistic,
and we do not compare it to a χ2 distribution.
nervousness
not nervous nervous Total
drug Placebo Count 260 2 262
Expected Count 258.5 3.5 262.0
% within drug 99.2% 0.8% 100.0%
Claritin Count 184 4 188
Expected Count 185.5 2.5 188.0
% within drug 97.9% 2.1% 100.0%
Total Count 444 6 450
Expected Count 444.0 6.0 450.0
% within drug 98.7% 1.3% 100.0%
Chi-Square Tests
Asymp. Sig. Exact Sig. (2- Exact Sig. (1-
Value df (2-sided) sided) sided)
Pearson Chi- a
Square 1.549 1 .213
Continuity
Correction b .685 1 .408
Linear-by-Linear
Association 1.545 1 .214
4 r × c tables
5 Summary
In Example 1, the response variable was chest pain and the explanatory
variable was gender.
In example 2, the response was nervousness and the explanatory variable was
the drug used.
However, in both the χ2 -test and the Fisher Test, the test does not
distinguish which is the response and which is the explanatory variable.
4 r × c tables
5 Summary
Here are some examples of exposure variables and their possible values.
Smoker versus non-smoker.
Hypertensive versus non-hypertensive.
Use of oral contraceptive (OC) versus non-use of OC.
High salt intake versus Low salt intake.
Here are some examples of disease variables and their possible values.
Lung cancer versus no lung cancer.
Cardiovascular disease (CVD) versus no CVD.
4 r × c tables
5 Summary
Sample 5000 OC users and 5000 non-OC users, and follow them for 15 years
to see if they develop any form of myocardial infarction.
Identify and sample breast cancer cases in mothers at a hospital. From the
same hospital, identify and obtain a sample similarly aged mothers, but who
do not have breast cancer. Now check for their age at which they had their
first child (record as greater than 30 or not).
4 r × c tables
5 Summary
4 r × c tables
5 Summary
Definition 4 (Odds)
For a categorical variable with 2 possible values, define one of them to be the
“success” and the other to be the “failure”
Let p be the probability of success, and 1 − p be the probability of failure.
Then the odds of success is defined to be
p
odds =
1−p
Odds equal to 0 corresponds to probability of success equal to 0.
Odds equal to 1 corresponds to probability of success equal to 0.5.
Odds equal to ∞ corresponds to probability of success equal to 1.
a/b ad
OR = =
c/d bc
p̂1 = p̂2
The good thing about OR’s is that it is legitimate to compute them whether
we have a prospective or retrospective study.
Hence no matter what the sampling design (retrospective or prospective), we
can compare the population p1 to p2 via the estimated odds ratio!
4 r × c tables
5 Summary
It is difficult to compute the standard error for the OR. Instead, it is done on
the ln scale before converting back.
The estimated standard error for ln(OR) is
r
1 1 1 1
se(ln(OR)) = + + +
a b c d
The formula for a (1 − α)100% confidence interval is
r !
ad 1 1 1 1
exp ln ± q1−α/2 + + +
bc a b c d
Cancer No cancer
Smoker 1301 1205
Non-smoker 56 152
OR = 2.93
The estimated variance is
1 1 1 1
+ + + = 0.026
1301 56 1205 152
A 95% CI is given by
√
e ln(2.93)±1.96× 0.026
= (2.14, 4.02)
4 r × c tables
5 Summary
4 r × c tables
5 Summary
How to retrieve the test statistic for 2 × 2 and r × c tables from SPSS.
How to identify case-cohort and cohort studies.
Computing confidence intervals for OR.
Interpreting these confidence intervals.