You are on page 1of 35

Determining what analysis

to use?
• For every model we will cover you must ask a series of
questions
1. What type of dependent variable do I have?
2. How many independent variables do I have?
3. What type of independent variable do I have
• If the independent is categorical, how many categories does it have?
4. Where subjects measured more than once (repeated measures)?
• The next slide provides a decision chart which summarizes
these questions and leads to a statistical method
• We will only cover a few of the models shown, the other methods
are FYI
• This lecture will focus on contingency analysis
Contingency
analysis
Contingency analysis
• Contingency analysis estimate and test for
associations between two or more categorical
variables
Contingency analysis
• Contingency analysis is complimented by
• Contingency table
• Graphing: mosaic plots or grouped barplots
Contingency analysis
• Contingency analysis is complimented by
• Its associated effect size is the odds ratio
Contingency analysis
• Assumptions of contingency analysis are linked to
the test statistic used
• It is common place to apply the X2 test.
• Assumptions include:
• Random sample
• Sufficiently large sample size
• Expected cell counts = 5 (Yates’s correction can be used)
• The observation are assumed to be independent
Contingency analysis
• At its heart, contingency analysis is the
investigation of the independence of the variables
• If two variables are independent, then the state of
one variable tells nothing about the probability of
the different values of the over variable
Contingency analysis
Example, how did women fair on the Titanic?

Observed frequencies
  Men Women Sum
Died 1329 109 1438
Survived 338 316 654
Sum 1667 425 2092
X2 contingency test
• Ho: Survival was independent of sex on the Titanic
• Ha: Survival was not independent of sex on the
Titanic
• To perform a X2 contingency test we calculate the
expected frequencies and compare them to the
observed frequencies
• The expected frequencies are those under the
assumption the null hypothesis is true
X2 contingency test
• What we would expect if sex and death were
independent?
• The mosaic chart below suggests that equal proportions
of men and women died on the Titanic - but this isn't
what happened
X2 contingency test
• However, in reality sex and death were not
independent
• More men died
X2 contingency test
• Calculating the expected frequencies
• If two events are independent, then by definition, the
probability of both occurring is equal to the probability
of one event occurring times the probability of the event
occurring
• Thus:

Pr[male and dead] = Pr[male] X Pr[dead]


X2 contingency test
• Calculating the expected frequencies

Pr[male] = 1667/2092 = 0.797

Observed frequencies
  Men Women Sum
Died 1329 109 1438
Survived 338 316 654
Sum 1667 425 2092
X2 contingency test
• Calculating the expected frequencies

Pr[dead] = 1438/2092 = 0.687

Observed frequencies
  Men Women Sum
Died 1329 109 1438
Survived 338 316 654
Sum 1667 425 2092
X2 contingency test
• Calculating the expected frequencies

Pr[male and dead] = Pr[male] X Pr[dead]


Pr[male and dead] = 0.797 X 0.687 = 0.548
X2 contingency test
• To get the expected frequency:
Expected[male and dead] = 0.548 X 1667 = 913.076

• You would do this for every cell and construct this table:

Expected frequencies
  Men Women Sum
Died 1145.863 292.137 1438.000
Survived 521.137 132.863 654.000
Sum 1667.000 425.000 2092.000
X2 contingency test
• X2 statistic

X2 = (1329-1145.863)2/1145.863 + (338-521.137)2/521.137 +
(109-292.137)2/292.137 + (316-132.863)2/132.863 = 460.866

Observed frequencies Expected frequencies


  Men Women Sum   Men Women Sum
Died 1329 109 1438 Died 1145.863 292.137 1438.000
Survived 338 316 654 Survived 521.137 132.863 654.000
Sum 1667 425 2092 Sum 1667.000 425.000 2092.000
X2 contingency test
• Degrees of freedom (df)- the number of values in the final
calculation of a statistic that are free to vary
• Part of almost all P-value estimation
• More is better
• For X2

Df = (number of rows - 1)*(number of columns - 1)


Df = (2-1)(2-1) = 1
X2 contingency test
• The critical value (the null value) for the X2 distribution with
df = 1 at significance level α =0.05 is 3.841
Our observed value for X2 = 460.866
• Therefore, we reject the null hypothesis
• P = <0.01
X2 contingency test
• Assumptions of the X2 contingency test
• No more than 20% of the cells have an expected
frequency less than 5
• No cell can have an expected frequency less than one
• Evaluate the expected frequencies using the “addmargins”
command in R
• If these assumptions are violated there are three options
X2 contingency test
• 1st, if the table is larger than a 2X2 then combine
rows or columns

Col 1 Col 2 Col 3 Col 1 Col 2


Row 1 Row 1
Row 2 Row 2

• This should be done carefully, can make the data


meaningless
X2 contingency test
• 2nd, if the table is a 2X2 then the Fisher exact test
should be used instead
• Better than X2 in cases where the expected cell
frequencies are too low to meet the assumptions of X2
• In R use “fisher.test()”

• 3rd, a permutation test (a type of resampling


procedure) can be performed
Odds ratios
• In order to understand contingency analysis we
need to get a grasp of odds
• Consider a variable for which a single random trial
yields on of two possible outcomes: success or
failure
• The odds of success are the probability of success
divided by the probability of failure
Odds ratios

^ 𝑝
^
𝑂=
1− 𝑝
^
• Where:
• is the probability of success
• 1 - is the probability of failure
• are the odds of success
Odds ratios

^ 𝑝
^
𝑂=
1− 𝑝
^
• What is the difference between
probability and odds?
• Probability expresses chance as a ratio of the number of
desired outcomes to the total number of possible
outcomes
• Odds is the chance as a ratio of success to failure, the
number of desired outcomes to the number of undesired
outcomes
Odds ratios
• 1st – estimate the proportion of men that died

  Men Women Sum


Died 1329 109 1438
Survived 338 316 654
Sum 1667 425 2092

797
Odds ratios
• 2nd – estimate the proportion of men that lived

^ 1=1 −0.797=𝟎 . 𝟐𝟎𝟑


1− 𝑝
Odds ratios
• 3rd – estimate the odds of dying on the Titanic if
the individual was male

^ 𝑝
^ 0.797
𝑂 1= = =3.93
^ 0.203
1− 𝑝
Odds ratios
• Now repeat this process for estimating the odds
that a woman died on the Titanic

  Men Women Sum


Died 1329 109 1438
Survived 338 316 654
Sum 1667 425 2092

109
𝑝
^ 2= =0.256 1− 𝑝
^ 2=1 −0.256=𝟎 .𝟕𝟒𝟒
425
^ 𝑝
^ 0 .256
𝑂 2= = =0.345
^ 0 .744
1− 𝑝
Odds ratios
• Now that we have the odds we can calculate the
odds ratio
• The odds ratio measures the magnitude of
association between two categorical variables
when each variables has only two categories

^1
𝑂
𝑂𝑅=
^
𝑂 2
Odds ratios
• If the odds ratio is a measure of effect size
describing the strength of association or non-
independence between two binary data values

^1
𝑂
𝑂𝑅=
^
𝑂 2
Odds ratios
• If the odds ratio is equal to one, the odds of success
in the response variable are independent of
treatment
• If > 1, the event has higher odds in the first group
than the second
• If < 1, then the odds are higher in the second group

^1
𝑂
𝑂𝑅=
^
𝑂 2
Odds ratios
^
𝑂 3.932
1
𝑂𝑅= = =11.399
^
𝑂 2
0.345

• OR is > 1, thus men had a higher odds (> 11 X!) of


dying on the Titanic compared to females
This is how you do this in R
titanic <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/
data/chapter09/chap09f1.1Titanic.csv"))

titanicTable <- table(titanic$survival, titanic$sex)


addmargins(titanicTable)

chisq.test(titanic$survival, titanic$sex, correct = FALSE)

library(epitools)
oddsratio(titanicTable, method = "wald")

You might also like