Associations Between Categorical Variables

Associations Between
Categorical Variables
Case where both explanatory (independent)
variable and response (dependent) variable
are qualitative (Chapter 7 includes case
where both are binary (2 levels)
Association: The distributions of responses

differ among the levels of the explanatory
variable (e.g. Party affiliation by gender)
Contingency Tables
Cross-tabulations of frequency counts where the
rows (typically) represent the levels of the
explanatory variable and the columns represent
the levels of the response variable.
Numbers within the table represent the numbers
of individuals falling in the corresponding
combination of levels of the two variables
Row and column totals are called the marginal
distributions for the two variables
Example - Cyclones Near Antarctica
Period of Study: September,1973-May,1975
Explanatory Variable: Region (40-49,50-59,60-79)
(Degrees South Latitude)
Response: Season (Aut(4),Wtr(5),Spr(4),Sum(8))
(Number of months in parentheses)
Units: Cyclones in the study area
Treating the observed cyclones as a random
sample of all cyclones that could have occurred
Source: Howarth(1983), An Analysis of the Variability of Cyclones around Antarctica and Their
Relation to Sea-Ice Extent, Annals of the Association of American Geographers, Vol.73,pp519-537
Region\Season Autumn Winter Spring Summer Total
40-49S 370 452 273 422 1517
50-59S 526 624 513 1059 2722
60-79S 980 1200 995 1751 4926
Total 1876 2276 1781 3232 9165
For each region (row) we can compute the percentage of storms

occuring during each season, the conditional distribution. Of the
1517 cyclones in the 40-49 band, 370 occurred in Autumn, a
proportion of 370/1517=.244, or 24.4% as a percentage.
Region\Season Autumn Winter Spring Summer Total% (n)

40-49S 24.4 29.8 18.0 27.8 100.0 (1517)
50-59S 19.3 22.9 18.9 38.9 100.0 (2722)
60-79S 19.9 24.4 20.2 35.5 100.0 (4926)
40.00
region
40-49S
50-59S
60-79S
30.00
Bars show Means

regpct
20.00
10.00
Autumn Winter Spring Summer
season
Graphical Conditional Distributions for Regions

Guidelines for Contingency Tables
Compute percentages for the response (column)
variable within the categories of the explanatory
(row) variable. Note that in journal articles, rows
and columns may be interchanged.
Divide the cell totals by the row (explanatory
category) total and multiply by 100 to obtain a
percent, the row percents will add to 100
Give title and clearly define variables and
categories.
Include row (explanatory) total sample sizes
Independence & Dependence
Statistically Independent: Population conditional
distributions of one variable are the same across
all levels of the other variable
Statistically Dependent: Conditional Distributions
are not all equal
When testing, researchers typically wish to
demonstrate dependence (alternative hypothesis),
and wish to refute independence (null hypothesis)
Pearsons Chi-Square Test
Can be used for nominal or ordinal explanatory and

response variables
Variables can have any number of distinct levels
Tests whether the distribution of the response
variable is the same for each level of the explanatory
variable (H0: No association between the variables
r = # of levels of explanatory variable
c = # of levels of response variable
Intuition behind test statistic

Obtain marginal distribution of outcomes for the
response variable
Apply this common distribution to all levels of
the explanatory variable, by multiplying each
proportion by the corresponding sample size
Measure the difference between actual cell counts
and the expected cell counts in the previous step
Notation to obtain test statistic

Rows represent explanatory variable (r levels)
Cols represent response variable (c levels)
1 2 c Total
1 n11 n12 n1c n1.
2 n21 n22 n2c n2.
r nr1 nr2 nrc nr.
Total n.1 n.2 n.c n..

Observed frequency (fo): The number of
individuals falling in a particular cell
Expected frequency (fe): The number we would
expect in that cell, given the sample sizes
observed in study and the assumtpion of
independence.
Computed by multiplying the row total and the
column total, and dividing by the overall sample size.
Applies the overall marginal probability of the
response category to the sample size of explanatory
category
Large-sample test (all fe > 5)
H0: Variables are statistically independent
(No association between variables)
Ha: Variables are statistically dependent
(Association exists between variables)
Test Statistic: obs ( f f ) 2
2
o e
fe
2
P-value: Area above obs in the chi-squared
distribution with (r-1)(c-1) degrees of
freedom. (Critical values in Table 8.5)
Observed Cell Counts (fo):
40-49S 370 452 273 422 1517
50-59S 526 624 513 1059 2722
60-79S 980 1200 995 1751 4926
Total 1876 2276 1781 3232 9165
Note that overall: (1876/9165)100%=20.5% of all cyclones

occurred in Autumn. If we apply that percentage to the 1517 that
occurred in the 40-49S band, we would expect (0.205)(1517)=310.5
to have occurred in the first cell of the table. The full table of fe:
40-49S 310.5 376.7 294.8 535.0 1517
50-59S 557.2 676.0 529.0 959.9 2722
60-79S 1008.3 1223.3 957.3 1737.1 4926
Total 1876 2276 1781 3232 9165
Computation of obs
2
Region Season fo fe (fo-fe)^2 ((fo-fe)^2)/fe

40-49S Autumn 370 310.5 3540.25 11.4017713
40-49S Winter 452 376.7 5670.09 15.0520042
40-49S Spring 273 294.8 475.24 1.61207598
40-49S Summer 422 535.0 12769 23.8672897
50-59S Autumn 526 557.2 973.44 1.74702082
50-59S Winter 624 676.0 2704 4
50-59S Spring 513 529.0 256 0.48393195
50-59S Summer 1059 959.9 9820.81 10.2310762
60-79S Autumn 980 1008.3 800.89 0.79429733
60-79S Winter 1200 1223.3 542.89 0.44379138
60-79S Spring 995 957.3 1421.29 1.4846861
60-79S Summer 1751 1737.1 193.21 0.11122561
71.2291706
H0: Seasonal distribution of cyclone occurences
is independent of latitude band
Ha: Seasonal occurences of cyclone occurences
differ among latitude bands
Test Statistic: obs
2
71.2
P-value: Area in chi-squared distribution with (3-

1)(4-1)=6 degrees of freedom above 71.2
Frrom Table 8.5, P(222.46)=.001 P< .001
SPSS Output - Cyclone Example
REGION * SEASON Crosstabulation
SEASON
Autumn Winter Spring Summer Total
REGION 40-49S Count 370 452 273 422 1517
Expected Count 310.5 376.7 294.8 535.0 1517.0
% within REGION 24.4% 29.8% 18.0% 27.8% 100.0%
50-59S Count 526 624 513 1059 2722
Expected Count 557.2 676.0 529.0 959.9 2722.0
% within REGION 19.3% 22.9% 18.8% 38.9% 100.0%
60-79S Count 980 1200 995 1751 4926
Expected Count 1008.3 1223.3 957.3 1737.1 4926.0
% within REGION 19.9% 24.4% 20.2% 35.5% 100.0%
Total Count 1876 2276 1781 3232 9165
Expected Count 1876.0 2276.0 1781.0 3232.0 9165.0
% within REGION 20.5% 24.8% 19.4% 35.3% 100.0%
Chi-Square Tests
Asymp. Sig.
Value df (2-sided)
Pearson Chi-Square 71.189a 6 .000
Likelihood Ratio
P-value
71.337 6 .000
Linear-by-Linear
23.418 1 .000
Association
N of Valid Cases 9165
a. 0 cells (.0%) have expected count less than 5. The
minimum expected count is 294.79.
Misuses of chi-squared Test
Expected frequencies too small (all
expected counts should be above 5, not
necessary for the observed counts)
Dependent samples (the same individuals
are in each row, see McNemars test)
Can be used for nominal or ordinal
variables, but more powerful methods exist
for when both variables are ordinal and a
directional association is hypothesized
Residual Analysis
Once dependence has been determined from a chi-
squared test, often interested in determining which
cells contributed
Residual: fo-fe measures the difference between the
observed and expected counts
Positive implies observed more than expected
Residuals practical importance depends on level of fe
Adjusted Residual (computed for each cell):
fo fe
f e (1 row proportion)(1 column proportion)
Adjusted residuals above 3 in absolute value give strong evidence against independence in
that cell
Adjusted residuals are computed in the following table.
Row proportion for Region 40-49S: 1517/9165=0.1655
Column Proportion for Season Autumn is: 1876/9165=0.2047
Region Season fo fe row prop col prop adj res
40-49S Autumn 370 310.5 0.1655 0.2047 4.144837
40-49S Winter 452 376.7 0.1655 0.2483 4.898484
40-49S Spring 273 294.8 0.1655 0.1943 -1.54843
40-49S Summer 422 535 0.1655 0.3526 -6.64664
50-59S Autumn 526 557.2 0.297 0.2047 -1.76769
50-59S Winter 624 676 0.297 0.2483 -2.75125
50-59S Spring 513 529 0.297 0.1943 -0.92433
50-59S Summer 1059 959.9 0.297 0.3526 4.741291
60-79S Autumn 980 1008.3 0.5375 0.2047 -1.4695
60-79S Winter 1200 1223.3 0.5375 0.2483 -1.12983
60-79S Spring 995 957.3 0.5375 0.1943 1.996065
60-79S Summer 1751 1737.1 0.5375 0.3526 0.609481
2x2 Tables
Each variable has 2 levels

Explanatory Variable Groups (Typically based
on demographics, exposure, or Trt)
Response Variable Outcome (Typically
presence or absence of a characteristic)
Measures of association
Relative Risk (Prospective Studies)
Odds Ratio (Prospective or Retrospective)
Absolute Risk (Prospective Studies)
2x2 Tables - Notation
Outcome Outcome Group

Present Absent Total
Group 1 n11 n12 n1.
Group 2 n21 n22 n2.
Outcome n.1 n.2 n..

Total
Relative Risk
Ratio of the probability that the outcome

characteristic is present for one group, relative to
the other
Sample proportions with characteristic from
groups 1 and 2:
^ n11 ^ n21
1 2
n1. n2.
Relative Risk
Estimated Relative Risk:
RR 1 ^
2
95% Confidence Interval for Population

Relative Risk:
( RR (e 1.96 v
) , RR (e1.96 v
))
^ ^
(1 1 ) (1 2 )
e 2.71828 v
n11 n21
Relative Risk
Interpretation
Conclude that the probability that the outcome is
present is higher (in the population) for group 1 if
the entire interval is above 1
present is lower (in the population) for group 1 if
the entire interval is below 1
Do not conclude that the probability of the
outcome differs for the two groups if the interval
contains 1
Example - Coccidioidomycosis and
TNF-antagonists
Research Question: Risk of developing
Coccidioidmycosis associated with arthritis
therapy?
Groups: Patients receiving tumor necrosis
factor (TNF) versus Patients not receiving
TNF (all patients arthritic)
COC No COC Total
TNF 7 240 247
Other 4 734 738
Source: Bergstrom, et al Total 11 974 985
(2004)
TNF-antagonists
Group 1: Patients on TNF
Group 2: Patients not on TNF
^ 7 ^ 4
1 .0283 2 .0054
247 738
^
1 .0283 1 .0283 1 .0054
RR ^ 5.24 v .3874
2 .0054 7 4
95%CI : (5.24e 1.96 .3874

, 5.24e1.96 .3874
) (1.55 , 17.76)
Entire CI above 1 Conclude higher risk if on TNF

Odds Ratio
Odds of an event is the probability it occurs

divided by the probability it does not occur
Odds ratio is the odds of the event for group 1
divided by the odds of the event for group 2
Sample odds of the outcome for each group:
n11 / n1. n11
odds1
n12 / n1. n12
n21
odds2
n22
Odds Ratio
Estimated Odds Ratio:
odds1 n11 / n12 n11n22

OR
odds2 n21 / n22 n12 n21
95% Confidence Interval for

Population Odds Ratio
( OR (e 1.96 v
) , OR (e1.96 v ) )
1 1 1 1
e 2.71828 v
n11 n12 n21 n22
Odds Ratio
Interpretation
the entire interval is above 1
the entire interval is below 1
contains 1
Example - NSAIDs and GBM
Case-Control Study (Retrospective)
Cases: 137 Self-Reporting Patients with Glioblastoma
Multiforme (GBM)
Controls: 401 Population-Based Individuals matched to
cases wrt demographic factors
GBM Present GBM Absent Total

NSAID User 32 138 170
NSAID Non-User 105 263 368
Total 137 401 538
Source: Sivak-Sears, et al
Example - NSAIDs and GBM
32(263) 8416
OR 0.58
138(105) 14490
1 1 1 1
v 0.0518
32 138 105 263
95% CI : ( 0.58e 1.96 0.0518

, 0.58e1.96 0.0518
) (0.37 , 0.91)
Interval is entirely below 1, NSAID

use appears to be lower among
cases than controls
Absolute Risk
Difference Between Proportions of outcomes with

an outcome characteristic for 2 groups
Sample proportions with characteristic from
groups 1 and 2:
^ n11 ^ n21
1 2
n1. n2.
Absolute Risk
Estimated Absolute Risk:
^ ^
AR 1 2
95% Confidence Interval for Population

Absolute Risk ^

^ ^
^

1 1 1 2 1 2
AR 1.96
n1. n2.
Absolute Risk
Interpretation
the entire interval is positive
the entire interval is negative
contains 0
TNF-antagonists
Group 1: Patients on TNF
Group 2: Patients not on TNF
^ 7 ^ 4
1 .0283 2 .0054
247 738
^ ^
AR 1 2 .0283 .0054 .0229
.0283(.9717) .0054(.9946)
95%CI : .0229 1.96
247 738
.0229 .0213 (0.0016 , 0.0242)
Interval is entirely positive, TNF is associated

with higher risk
Ordinal Explanatory and Response
Variables
Pearsons Chi-square test can be used to test
associations among ordinal variables, but more
powerful methods exist
When theories exist that the association is
directional (positive or negative), measures exist
to describe and test for these specific alternatives
from independence:
Gamma
Kendalls b
Concordant and Discordant Pairs
Concordant Pairs - Pairs of individuals where one
individual scores higher on both ordered variables
than the other individual
Discordant Pairs - Pairs of individuals where one
individual scores higher on one ordered variable
and the other individual scores higher on the other
C = # Concordant Pairs D = # Discordant Pairs
Under Positive association, expect C > D
Under Negative association, expect C < D
Under No association, expect C D
Example - Alcohol Use and Sick Days
Alcohol Risk (Without Risk, Hardly any Risk,

Some to Considerable Risk)
Sick Days (0, 1-6, 7)
Concordant Pairs - Pairs of respondents where one
scores higher on both alcohol risk and sick days
than the other
Discordant Pairs - Pairs of respondents where one
scores higher on alcohol risk and the other scores
higher on sick days
Source: Hermansson, et al
(2003)
ALCOHOL * SICKDAYS Crosstabulation
Count
SICKDAYS
0 days 1-6 days 7+ days Total
ALCOHOL Without Risk 347 113 145 605
Hardly any Risk 154 63 56 273
Some-Considerable Risk 52 25 34 111
Total 553 201 235 989
Concordant Pairs: Each individual in a

given cell is concordant with each individual
in cells Southeast of theirs
Discordant Pairs: Each individual in a given
cell is discordant with each individual in
cells Southwest of theirs
ALCOHOL * SICKDAYS Crosstabulation
Count
SICKDAYS
0 days 1-6 days 7+ days Total
ALCOHOL Without Risk 347 113 145 605
Hardly any Risk 154 63 56 273
Some-Considerable Risk 52 25 34 111
Total 553 201 235 989
C 347(63 56 25 34) 113(56 34) 154(25 34) 63(34) 83164

D 145(154 63 52 25) 113(154 52) 56(52 25) 63(52) 73496
Measures of Association
Goodman and Kruskals Gamma:

^ CD ^
1 1
CD
Kendalls b:
^ CD
b
0.5 (n ni. )(n n. j )
2 2 2 2
When theres no association between the ordinal variables,

the population based values of these measures are 0.
Statistical software packages provide these tests.
^ C D 83164 73496
0.0617
C D 83164 73496
Symmetric Measures
Asymp.
a b
Value Std. Error Approx. T Approx. Sig.
Ordinal by Kendall's tau-b .035 .030 1.187 .235
Ordinal Gamma .062 .052 1.187 .235
N of Valid Cases 989
a. Not assuming the null hypothesis.
b. Using the asymptotic standard error assuming the null hypothesis.

Associations Between Categorical Variables

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Associations Between Categorical Variables

Uploaded by

Copyright:

Available Formats

Associations Between

Association: The distributions of responses

For each region (row) we can compute the percentage of storms

Region\Season Autumn Winter Spring Summer Total% (n)

Bars show Means

Autumn Winter Spring Summer

Graphical Conditional Distributions for Regions

Can be used for nominal or ordinal explanatory and

Intuition behind test statistic

Notation to obtain test statistic

1 n11 n12 n1c n1.

2 n21 n22 n2c n2.

r nr1 nr2 nrc nr.

Total n.1 n.2 n.c n..

Note that overall: (1876/9165)100%=20.5% of all cyclones

Region Season fo fe (fo-fe)^2 ((fo-fe)^2)/fe

P-value: Area in chi-squared distribution with (3-

Each variable has 2 levels

Outcome Outcome Group

Group 2 n21 n22 n2.

Outcome n.1 n.2 n..

Ratio of the probability that the outcome

95% Confidence Interval for Population

95%CI : (5.24e 1.96 .3874

Entire CI above 1 Conclude higher risk if on TNF

Odds of an event is the probability it occurs

odds1 n11 / n12 n11n22

95% Confidence Interval for

GBM Present GBM Absent Total

95% CI : ( 0.58e 1.96 0.0518

Interval is entirely below 1, NSAID

Difference Between Proportions of outcomes with

95% Confidence Interval for Population

Interval is entirely positive, TNF is associated

Alcohol Risk (Without Risk, Hardly any Risk,

Concordant Pairs: Each individual in a

C 347(63 56 25 34) 113(56 34) 154(25 34) 63(34) 83164

Goodman and Kruskals Gamma:

When theres no association between the ordinal variables,

You might also like