Contingency PDF

Contingency Analysis:
association between categorical variables
Contingency Analysis
Mosaic Plots
Odds
Odds Ratio
SE & CI for Odds Ratio
2 Contingency Test
R example
Assumptions of 2
Correction for Continuity
Fisher's Exact Test
G-tests
Contingency Analysis
There are many examples in biology where we wish to

relate two variables that are categorical.
For example:
1. Do bright and drab butterflies differ in their probability of

being eaten?
2. Is fur color (tan, brown, black) related to gender?
3. Is tree death related to slope aspect?
These questions are best approached using contingency

analysis, which allows us to determine whether two or
more categorical variables are independent.
2
Mosaic Plots
The Titanic disaster provides a simple example of the use

of mosaic plots for examining the structure of frequency
data.
Plots are composed of a series of graphical blocks or

boxes. The area of each box is proportional to the number
of elements in that group. Groups can be compared side by
side (rowise or columnwise).
The plot clearly shows that women experienced a greater

survival rate then men.
3
Mosaic Plots
Odds
Let's consider a variable (e.g., our previous coin toss
example) for which a random trial yields one of two
outcomes: success or failure (heads or tails).
The probability of success is p and the probability of failure

is 1-p. The odss of success (O) are the probability of
success divided by the probability of failure:
p
O=
1 p
The estimate of the odds is calculated from a random

sample of trials using the observed proportion of
successes (p-hat):

p
O=
1 p 5
Odds
- Example -
It is well established that there is a link between the use

of aspirin and decreased risk of heart attack. A
suggestion was made that there may also be a link with
reduction in cancer risk. A total of 39,876 women were
split into two groups: half took aspirin, half a placebo.
After 10 years the prevalence of cancer was assessed
in the two groups:
6
Odds
- Example -
Odds
- Example -
The estimated proportion that did not get cancer

(and the cmplement; those that did get cancer) is:
18496
p1 = =0.9279
19934
1 p 1 =10.9279=0.0721
The odds of not getting cancer while taking aspirin are:

p 0.9279
O 1= 1 p 1= =12.87
1 0.0721
So, the odds are ca. 12.87:1 of not getting cancer if taking aspirin.
8
Odds
- Example -
But, what are the odds of getting cancer if taking aspirin?
18515
p 2 = =0.9284
19942
1 p2 =10.9284=0.0716
p 0.9284
O 2 = 2 p 2= =12.97
1 0.0716
The difference between 12.87 and 12.97 is negligible,

so aspirin is not likely to influence cancer rate.
9
Odds Ratio
But, as statisticians, we are seldom convinced by just a

small difference (such a large sample size could still be
significant). We can use the odds ratio (OR) to assess
the odds of success relative to the odds of failure:
O1
O R=
O2
If the odds ratio is equal to one, the the odds of success

in the response variable is the same for both groups.
10
Odds Ratio
- Example -
O 1 12.86
O R= = =0.992
O 2 12.97
The OR suggests that the odds of developing cancer

while taking aspirin were about the same as while taking
the placebo; however, since the value is less than one,
there was a slight benefit of taking aspirin.
We're still left with the question of whether the aspirin is

a significant help towards reducing cancer risk (even if
small). We can evaluate this using the SE and CI
around OR.
11
Odds Ratio
Because the data are highly skewed, we must convert the OR
to its natural log form and then calculate the SE from which we
can derive the CI:

SE [ ln O R]=
1 1 1 1

a b c d

SE [ ln O R]=
1

1

1

1
1438 1427 18496 18515

SE [ ln O R]=0.03878
12
Odds Ratio
Now that we have a SE calculated, we can calculate the
the 95% CI:
-0.00803 1.96(0.03878) < ln(OR) < -0.00803 + 1.96(0.03878)
-0.084 < ln(OR) < 0.068
e-0.084 < OR < e0.068
0.92 < OR < 1.07
The CI is tightly bounded around 1.0, so the data provide good

evidence that aspirin plays no effect on the probability of
developing cancer. 13
2 Contingency Test
The most commonly used frequency data analysis

method is the chi-square contingency test for
association.
You may also see this test referred to in the literature as

an R x C (row-by-column) association test. R can have
two or more categories and C can have two or more
categories.
This test is widely adaptable to a variety of tests dealing

with the comparison of categorical data (and can be
expanded to 3+ dimensions = log-linear analysis)
14
2 Contingency Test
- Example -
Example 9.3 provides a biological example involving the

infection of fish with a parasite and their risk of predation by
birds as a function of their position in the water column.
The two variables of interest are infection status (uninfected,

lightly infected, and highly infected) and predation (eaten, not
eaten).
The corresponding hypotheses:
H0: Parasite infection and being eaten are independent.

HA: Parasite infection and being eaten are not independent.
15
2 Contingency Test
- Example -
16
17
2 Contingency Test
- Example -
[uninfected ]=50/141=0.3546
Pr
[eaten ]=48/141=0.3404
Pr
[uninfected eaten ]=0.35460.3404=0.1207

Pr
Expected [ uninfected eaten]=0.1207141=17.0
18
19
2 Statistic
Now that we have observed frequencies and expected

frequencies, we can generate a chi-square test using our
general formula:
c r 2
[Observed column , rowExpected column , row]
2 = .
[ Expected column , row]
column=1 row=1
117.02 4933.02 930.32

=
2
= 69.5
17.0 33.0 30.3
2
2, 0.05 =5.99 therefore , reject H 0
NB : df = r1c1=2131=2
20
Example
How would we solve this problem in R? Basically, a row x
column table is a matrix; so in keeping with the approach
of using vectors for data, we create an array using the
matrix function and specify that the data are read by rows
(note how R cycles through the data to create a matrix from
a vector):
> fish<-matrix(c(1,49,10,35,37,9),nrow=2)
> fish
[,1] [,2] [,3]
[1,] 1 10 37
[2,] 49 35 9
21
Example
While we have a perfectly workable matrix, let's prettify it and
add the appropriate variable names and levels:
> fish<-matrix(c(1,49,10,35,37,9), nrow=2,

dimnames=list("Predation"=c("Eaten", "Not Eaten"),
"Infection" = c("Uninfected", "Light", "Heavy")))
> fish
Infection
Predation Uninfected Light Heavy
Eaten 1 10 37
Not Eaten 49 35 9
22
Example
And the chi-square test...
> chisq.test(fish)
Pearson's Chi-squared test
data: fish
X-squared = 69.7557, df = 2, p-value
= 7.124e-16
23
Chi-square has a number of sub-routines that we take

further advantage of:
> chisq.test(fish)$observed
Infection
Eaten 1 10 37
Not Eaten 49 35 9
> chisq.test(fish)$expected
Infection
Eaten 17.02128 15.31915 15.65957
Not Eaten 32.97872 29.68085 30.34043
24
> mosaicplot(t(fish),cex=1.25,color=TRUE)
25
The chi-square contingency test makes the same assumptions

as the goodness of fit test:
1. No more than 20% of the cells can have a frequency less

than 5, and
2. No cell can have an expected frequency less than one.
If either are violated, the response is the same: (a) combine a

row or column [if array is bigger than 2 2], (b) if table is 2 2
use Fisher's Exact Test, or (c) use a randomization procedure
(discussed at end of course).
26
Correction for Continuity
When the contingency table is 2 2, most statisticians

recommend the use of a continuity correction factor. This
modification is known as the Yates Correction for
Continuity:
2
1
c r Observed column ,rowExpected column , row
2
= .
2
column=1 row=1 [ Expected column ,row]
27
Fisher's Exact Test
Fisher's Exact Test is used specifically for 2 2 contingency

tests. The test is an improvement over the normal chi-
square in cases where the expected cell frequencies are too
low to meet the regular assumptions. Thus, this test is used
for small data sets comparing two categorical variables.
Let's look at Example 9.4

which examines the feeding
habits of vampire bats.
The main question is whether
or not cows in estrous have a
greater chance of being
attacked by bats compared
to cows not in estrous.
28
29
> bats<-matrix(c(15,7,6,322),nrow=2)
> bats
[,1] [,2]
[1,] 15 6
[2,] 7 322
> fisher.test(bats)
Fisher's Exact Test for Count Data
data: bats
p-value < 2.2e-16
alternative hypothesis: true odds ratio
is not equal to 1
95 percent confidence interval:
29.94742 457.26860
sample estimates:
odds ratio
108.3894
30
G-tests
The G-test is another contingency test seen frequently in the

literature. The G-test is very similar to the chi-square test
across a wider range of circumstances. It utilizes the natural
logarithm (ln) in its calculation.
The G-test may not be as powerful as the chi-square test for

small sample sizes.
R code for G-test statistics are available, but are not part of the
normal stats base package or related packages.
31

Contingency PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Contingency PDF

Uploaded by

Copyright:

Available Formats

Contingency Analysis:

association between categorical variables

There are many examples in biology where we wish to

1. Do bright and drab butterflies differ in their probability of

These questions are best approached using contingency

The Titanic disaster provides a simple example of the use

Plots are composed of a series of graphical blocks or

The plot clearly shows that women experienced a greater

The probability of success is p and the probability of failure

The estimate of the odds is calculated from a random

It is well established that there is a link between the use

The estimated proportion that did not get cancer

The odds of not getting cancer while taking aspirin are:

But, what are the odds of getting cancer if taking aspirin?

The difference between 12.87 and 12.97 is negligible,

But, as statisticians, we are seldom convinced by just a

If the odds ratio is equal to one, the the odds of success

The OR suggests that the odds of developing cancer

We're still left with the question of whether the aspirin is

-0.00803 1.96(0.03878) < ln(OR) < -0.00803 + 1.96(0.03878)

-0.084 < ln(OR) < 0.068

e-0.084 < OR < e0.068

0.92 < OR < 1.07

The CI is tightly bounded around 1.0, so the data provide good

The most commonly used frequency data analysis

You may also see this test referred to in the literature as

This test is widely adaptable to a variety of tests dealing

Example 9.3 provides a biological example involving the

The two variables of interest are infection status (uninfected,

The corresponding hypotheses:

H0: Parasite infection and being eaten are independent.

[uninfected eaten ]=0.35460.3404=0.1207

Expected [ uninfected eaten]=0.1207141=17.0

Now that we have observed frequencies and expected

117.02 4933.02 930.32

> fish<-matrix(c(1,49,10,35,37,9), nrow=2,

And the chi-square test...

Pearson's Chi-squared test

Chi-square has a number of sub-routines that we take

The chi-square contingency test makes the same assumptions

1. No more than 20% of the cells can have a frequency less

If either are violated, the response is the same: (a) combine a

Correction for Continuity

When the contingency table is 2 2, most statisticians

column=1 row=1 [ Expected column ,row]

Fisher's Exact Test is used specifically for 2 2 contingency

Let's look at Example 9.4

Fisher's Exact Test for Count Data

The G-test is another contingency test seen frequently in the

The G-test may not be as powerful as the chi-square test for

You might also like