You are on page 1of 11

Contingency Analysis:

association between categorical variables

Contingency Analysis
Mosaic Plots
Odds
Odds Ratio
SE & CI for Odds Ratio
2 Contingency Test
R example
Assumptions of 2
Correction for Continuity
Fisher's Exact Test
G-tests

Contingency Analysis

There are many examples in biology where we wish to


relate two variables that are categorical.

For example:

1. Do bright and drab butterflies differ in their probability of


being eaten?
2. Is fur color (tan, brown, black) related to gender?
3. Is tree death related to slope aspect?

These questions are best approached using contingency


analysis, which allows us to determine whether two or
more categorical variables are independent.
2

Mosaic Plots

The Titanic disaster provides a simple example of the use


of mosaic plots for examining the structure of frequency
data.

Plots are composed of a series of graphical blocks or


boxes. The area of each box is proportional to the number
of elements in that group. Groups can be compared side by
side (rowise or columnwise).

The plot clearly shows that women experienced a greater


survival rate then men.

3
Mosaic Plots

Odds
Let's consider a variable (e.g., our previous coin toss
example) for which a random trial yields one of two
outcomes: success or failure (heads or tails).

The probability of success is p and the probability of failure


is 1-p. The odss of success (O) are the probability of
success divided by the probability of failure:
p
O=
1 p

The estimate of the odds is calculated from a random


sample of trials using the observed proportion of
successes (p-hat):


p
O=
1 p 5

Odds
- Example -

It is well established that there is a link between the use


of aspirin and decreased risk of heart attack. A
suggestion was made that there may also be a link with
reduction in cancer risk. A total of 39,876 women were
split into two groups: half took aspirin, half a placebo.
After 10 years the prevalence of cancer was assessed
in the two groups:

6
Odds
- Example -

Odds
- Example -

The estimated proportion that did not get cancer


(and the cmplement; those that did get cancer) is:

18496
p1 = =0.9279
19934

1 p 1 =10.9279=0.0721

The odds of not getting cancer while taking aspirin are:


p 0.9279
O 1= 1 p 1= =12.87
1 0.0721

So, the odds are ca. 12.87:1 of not getting cancer if taking aspirin.
8

Odds
- Example -

But, what are the odds of getting cancer if taking aspirin?

18515
p 2 = =0.9284
19942

1 p2 =10.9284=0.0716

p 0.9284
O 2 = 2 p 2= =12.97
1 0.0716

The difference between 12.87 and 12.97 is negligible,


so aspirin is not likely to influence cancer rate.

9
Odds Ratio

But, as statisticians, we are seldom convinced by just a


small difference (such a large sample size could still be
significant). We can use the odds ratio (OR) to assess
the odds of success relative to the odds of failure:

O1
O R=
O2

If the odds ratio is equal to one, the the odds of success


in the response variable is the same for both groups.

10

Odds Ratio
- Example -

O 1 12.86
O R= = =0.992
O 2 12.97

The OR suggests that the odds of developing cancer


while taking aspirin were about the same as while taking
the placebo; however, since the value is less than one,
there was a slight benefit of taking aspirin.

We're still left with the question of whether the aspirin is


a significant help towards reducing cancer risk (even if
small). We can evaluate this using the SE and CI
around OR.
11

Odds Ratio
Because the data are highly skewed, we must convert the OR
to its natural log form and then calculate the SE from which we
can derive the CI:

SE [ ln O R]=
1 1 1 1

a b c d


SE [ ln O R]=
1

1

1

1
1438 1427 18496 18515


SE [ ln O R]=0.03878

12
Odds Ratio
Now that we have a SE calculated, we can calculate the
the 95% CI:

-0.00803 1.96(0.03878) < ln(OR) < -0.00803 + 1.96(0.03878)

-0.084 < ln(OR) < 0.068

e-0.084 < OR < e0.068

0.92 < OR < 1.07

The CI is tightly bounded around 1.0, so the data provide good


evidence that aspirin plays no effect on the probability of
developing cancer. 13

2 Contingency Test

The most commonly used frequency data analysis


method is the chi-square contingency test for
association.

You may also see this test referred to in the literature as


an R x C (row-by-column) association test. R can have
two or more categories and C can have two or more
categories.

This test is widely adaptable to a variety of tests dealing


with the comparison of categorical data (and can be
expanded to 3+ dimensions = log-linear analysis)

14

2 Contingency Test
- Example -

Example 9.3 provides a biological example involving the


infection of fish with a parasite and their risk of predation by
birds as a function of their position in the water column.

The two variables of interest are infection status (uninfected,


lightly infected, and highly infected) and predation (eaten, not
eaten).

The corresponding hypotheses:

H0: Parasite infection and being eaten are independent.


HA: Parasite infection and being eaten are not independent.

15
2 Contingency Test
- Example -

16

17

2 Contingency Test
- Example -

[uninfected ]=50/141=0.3546
Pr

[eaten ]=48/141=0.3404
Pr

[uninfected eaten ]=0.35460.3404=0.1207


Pr

Expected [ uninfected eaten]=0.1207141=17.0

18
19

2 Statistic

Now that we have observed frequencies and expected


frequencies, we can generate a chi-square test using our
general formula:

c r 2
[Observed column , rowExpected column , row]
2 = .
[ Expected column , row]
column=1 row=1

117.02 4933.02 930.32


=
2
= 69.5
17.0 33.0 30.3
2
2, 0.05 =5.99 therefore , reject H 0

NB : df = r1c1=2131=2
20

Example
How would we solve this problem in R? Basically, a row x
column table is a matrix; so in keeping with the approach
of using vectors for data, we create an array using the
matrix function and specify that the data are read by rows
(note how R cycles through the data to create a matrix from
a vector):

> fish<-matrix(c(1,49,10,35,37,9),nrow=2)
> fish
[,1] [,2] [,3]
[1,] 1 10 37
[2,] 49 35 9

21
Example
While we have a perfectly workable matrix, let's prettify it and
add the appropriate variable names and levels:

> fish<-matrix(c(1,49,10,35,37,9), nrow=2,


dimnames=list("Predation"=c("Eaten", "Not Eaten"),
"Infection" = c("Uninfected", "Light", "Heavy")))
> fish
Infection
Predation Uninfected Light Heavy
Eaten 1 10 37
Not Eaten 49 35 9

22

Example

And the chi-square test...

> chisq.test(fish)

Pearson's Chi-squared test

data: fish
X-squared = 69.7557, df = 2, p-value
= 7.124e-16

23

Chi-square has a number of sub-routines that we take


further advantage of:

> chisq.test(fish)$observed
Infection
Predation Uninfected Light Heavy
Eaten 1 10 37
Not Eaten 49 35 9
> chisq.test(fish)$expected
Infection
Predation Uninfected Light Heavy
Eaten 17.02128 15.31915 15.65957
Not Eaten 32.97872 29.68085 30.34043

24
> mosaicplot(t(fish),cex=1.25,color=TRUE)

25

The chi-square contingency test makes the same assumptions


as the goodness of fit test:

1. No more than 20% of the cells can have a frequency less


than 5, and
2. No cell can have an expected frequency less than one.

If either are violated, the response is the same: (a) combine a


row or column [if array is bigger than 2 2], (b) if table is 2 2
use Fisher's Exact Test, or (c) use a randomization procedure
(discussed at end of course).

26

Correction for Continuity

When the contingency table is 2 2, most statisticians


recommend the use of a continuity correction factor. This
modification is known as the Yates Correction for
Continuity:

2
1
c r Observed column ,rowExpected column , row
2
= .
2

column=1 row=1 [ Expected column ,row]

27
Fisher's Exact Test

Fisher's Exact Test is used specifically for 2 2 contingency


tests. The test is an improvement over the normal chi-
square in cases where the expected cell frequencies are too
low to meet the regular assumptions. Thus, this test is used
for small data sets comparing two categorical variables.

Let's look at Example 9.4


which examines the feeding
habits of vampire bats.
The main question is whether
or not cows in estrous have a
greater chance of being
attacked by bats compared
to cows not in estrous.
28

29

> bats<-matrix(c(15,7,6,322),nrow=2)
> bats
[,1] [,2]
[1,] 15 6
[2,] 7 322

> fisher.test(bats)

Fisher's Exact Test for Count Data

data: bats
p-value < 2.2e-16
alternative hypothesis: true odds ratio
is not equal to 1
95 percent confidence interval:
29.94742 457.26860
sample estimates:
odds ratio
108.3894
30
G-tests

The G-test is another contingency test seen frequently in the


literature. The G-test is very similar to the chi-square test
across a wider range of circumstances. It utilizes the natural
logarithm (ln) in its calculation.

The G-test may not be as powerful as the chi-square test for


small sample sizes.

R code for G-test statistics are available, but are not part of the
normal stats base package or related packages.

31

You might also like