You are on page 1of 14

CATEGORICAL DATA ANALYSIS

Dr. Martin L. William

1.7 CONTINGENCY TABLES


Consider two categorical response variables X and Y where X has ‘I’ categories and Y has ‘J’
categories. Suppose that the two variables are measured on each subject in a sample. A
rectangular table with I rows and J columns,in which each of the ‘IJ’ cells contain frequency of
counts (of number of subjects falling in the category combination), is known as a Contingency
Table or Cross-Classification Table. More specifically, it is called a I x J contingency table.
1.7.1 Contingency Tables And Distributions
Let πij denote the probability that (X,Y) occur in the cell of Row ‘i’ and Column ‘j’. That is, the
probability distribution {πij} is the joint distribution of X &Y. The marginal distributions are the
row totals and column totals. That is,
J
The marginal distribution of X is {  i + } where  i + =   i j , i = 1, 2, …, I
j =1

I
The marginal distribution of Y is {  + j } where  + j =   i j , j = 1, 2, …, J
i =1

Clearly, 
i
i+ =   + j =   i j = 1
j i j

In most contingency tables, variable Y is a response variable and X is an explanatory variable.


For any fixed category of X, the response variable Y has a probability distribution. Let  j | i be
the conditional probability of classification in category ‘j’ of Y given that a subject belongs to
category ‘i’ of X. The probabilities {  1 | i ,  2 | i , …,  J | i } form the conditional distribution of Y
at category ‘i' of X. A principal aim of many studies is to compare conditional distributions of Y
at various levels of explanatory variable(s).
Notation for Actual [or Population] joint, marginal & conditional probabilities
______________________________________________
COLUMN
1 2 . . . . . . . . J Total
______________________________________________________________________

1  11  12 . . . . . . .  1J  1+
R (  1 |1 ) (  2 |1 ) . . . . . .(  J |1 ) (1.0)

O 2  21  22 . . . . . . .  2J  2+
W (  1 |2 ) (  2 | 2 ) . . . . . .(  J |2 ) (1.0)

I I1 I2 . . . . . . . IJ I +
( 1 |I ) (  2 | I ) . . . . . .(  J |I ) (1.0)

Total +1 + 2 . . . . . . . + J 1.0

1
CATEGORICAL DATA ANALYSIS
Dr. Martin L. William
The numbers in brackets are the conditional probabilities. It may be noted that  j | i =  i j /  i +
Defn: Two categorical variables are said to be independent if  i j =  i +  + j for all i, j.
Under independence,  j | 1 =  j | 2 = . . . =  j | I for j = 1,2…, J. That is, the probability of a
particular column response is the same in each row. Independence is also referred to as
homogeneity of the conditional distributions.
[ For sample distributions we use ̂ in the place of π ]
Notation for cell frequencies (observed in sample): The frequency in the (i, j)th cell is denoted n i j
I J J
If ‘n’ is the total sample size then, n=   ni j . Also, ni + =
i = 1 j =1
n
j =1
ij = ith row total and n + j =

n
i =1
i j = jth column total

Clearly, ̂ i j = ni j / n , ˆ i + = ni + / n , ˆ j | i = ˆ i j / ˆ i + = ni j / ni +

1.7.2 Sampling Schemes & Distributions


Poisson Sampling Scheme: When the total sample size as well as the sample sizes for the
categories of the explanatory variable X (row totals) are not fixed, but each (i, j) combination is
treated independently, and we just observe the frequencies in the cells, then the ‘count’ in each
cell may be viewed as a Poisson r.v. Specifically, denoting the number of ‘events’ in the (i, j)th
cell as Yij, the p.m.f. of Yij is given by P(Yij = nij) = exp( –  i j )  i ji j / n i j ! and the joint p.m.f
n

of the ‘IJ’ counts is the product of these marginal p.m.f’s.


Multinomial Sampling Scheme: When the total sample size is fixed, but the row and column
totals are not fixed, the joint distribution of the ‘IJ’ cells naturally follow a multinomial
distribution with p.m.f.
n!
i j  i ji j - - - - (1.7.1)
n
I J

  ni j i =1 j =1

Independent Multinomial Sampling Scheme: Often, observations on the response (dependent)


variable Y occur separately at each setting (‘i') of the explanatory variable X. In this case, the
number of observations under each ‘i' (i.e. n i + ) is fixed. This naturally leads to the conditional
probabilities of Y given X (at each ‘i' )namely {  1 | i ,  2 | i , .... ,  J | i }. The joint p.m.f. of the
counts in the cells at the setting ‘i' of X is the multinomial
ni + !
  j |i ij - - - - (1.7.2)
n

 ni j ! j j

2
CATEGORICAL DATA ANALYSIS
Dr. Martin L. William
When the responses occur separately at each setting of X, the samples at different settings are
independent and the joint p.m.f. of the ‘IJ” counts is the product of the above independent
multinomial p.m.f.s for i = 1, 2, …, I.
Sometimes, the n+j values are fixed and in such cases, we have to consider the conditional
probabilities of X given Y (at each ‘j' ) and the joint p.m.f. will be the product of ‘J’ independent
multinomial p.m.f.’s where the jth term in the product is a multinomial for the jth column.
Conditional Independent Multinomial Sampling Scheme:Suppose that the ‘IJ” counts { nij }
result from independent Poisson sampling with means {  i j } or from multinomial sampling
over the ‘IJ’ cells with probabilities {  i j =  i j / n }. When X is an explanatory variable (as is
J
often the case), inference can be performed conditional on the totals { n i + =  ni j } even when
j =1

the n i + values are not fixed by the sampling design. Conditional on { n i + }, the cell counts
{ ni 1 , n i 2 , . . . ni J } have the multinomial distribution in Equation (2.1.2) with the response
probabilities {  j | i = i j / i + , j = 1, 2,…, J } and the counts from different rows are independent.
With this conditioning, the row totals are treated as fixed and the overall joint p.m.f. is the
product of (2.1.2) for i = 1, 2,…., I.
Example: Researchers in the Highway Department plan to study the relationship between ‘seat-
belt use’ (yes / no) and ‘outcome of an automobile accident’ (fatality / non-fatality) for drivers
driving on a new highway. They will summarize their results in the format shown below:
________________________________________
Result of crash
Seat-belt use Fatality Non-fatality____
Yes
No
________________________________________________________________________________________________

• Suppose they plan to catalog all accidents on the highway for the period of study (say, a
year) classifying each accident according to the two variables (Seat-belt use & outcome
of crash), then the total sample size itself is not fixed. In such a case, the numbers of
observations (accidents) at the four combinations may be treated as independent Poisson
variables with unknown means, say, λ11, λ12,λ21, λ22.This is Poisson Scheme.
• Suppose the researchers randomly sample 200 (or any other fixed number) of police
records of crashes and classify each crash according to the two variables, then the total
sample size is fixed but neither the rows nor columns totals are fixed. This is the
‘Multinomial sampling scheme’.
• Suppose the researchers take a sample of 100 fatal accidents and 100 non-fatal accidents
and within each of these two groups classify the accidents according to seat-belt use, it
means the column totals ( n+j) are fixed and each column is an independent multinomial
[actually, binomial] sample. This is the ‘Independent Multinomial Sampling Scheme’.
• Another approach is to take 200 people, randomly assign 100 to wear seat-belt and the
other 100 to drive without seat-belt [This is traditional experimental design approach].

3
CATEGORICAL DATA ANALYSIS
Dr. Martin L. William
Here the row totals ( ni+ ) are fixed and each row is an independent binomial sample.
Again, this is the ‘Independent Multinomial Sampling Scheme’. [However, note that it
is not ethical to ‘force’ people to wear or not to wear seat-belt].
1.7.3 Types of Studies
Case–Control Study: A study which looks into the past given the (different) outcomes on the
response variable is known as ‘Case-control study’. It is also called ‘Retrospective study’.
(Eg.) In a study on the link between lung cancer and smoking, 709 cancer patients, who were
admitted with lung cancer, were queried about their smoking habits. A patient was classified as a
‘smoker’ if he/ she had smoked at least one cigarette a day for at least one year and as a non-
smoker otherwise. Simultaneously, the investigators identified 709 non-cancer patients matching
for age and gender with the cancer patients and classified them as ‘smokers’ and non-
smokers.The lung cancer patients are called ‘Cases’ while the non-cancer patients are called
‘Controls’. The data are summarized in the following cross-classification table.
____________________________
Lung Cancer
__Smoker Cases Controls __
Yes 688 650
No 21 59
Total 709 709
_________________________________________________________________________

Normally, ‘lung cancer’ status is the response variable and smoking behavior is an explanatory
variable. However, in this study, the marginal distribution of lung cancer is fixed by sampling
design and the outcome that is measured is ‘smoking status’ in the past. This is ‘retrospective’
design.
In the above case-control study, the cases and controls were ‘matched’. However, unmatched
case-control studies are possible.
Note: A retrospective study treats the totals { n + j } for Y as fixed and regards each column of ‘I’
counts as a multinomial sample on X.

Cohort Study: In a ‘Cohort study’ subjects make their own choices about which ‘X’ category
they belong to and their ‘Y’ categories are observed at a future time. It is a ‘Prospective study’.
(Eg.)A sample of people is selected and classified as to whether they are smokers or not. After a
number of years they are observed as to whether they have developed lung cancer or not.

Clinical Trials: In a ‘clinical trial’, subjects are randomly allocated to the ‘X’ categories and
their ‘Y’ categories are observed at a future time. This is also a ‘Prospective study’.
(Eg.) Subjects are randomly given one of the treatments (say, different drugs which are the ‘X’
categories) and the effect ‘Y’ is observed after a period of time.
Note: In prospective studies, usually we condition on the row totals { n i + } of the ‘X’ categories
and regard each row of ‘J’ counts as an independent multinomial sample on Y.

4
CATEGORICAL DATA ANALYSIS
Dr. Martin L. William
Cross-sectional Study: In a ‘Cross-sectional study’, subjects are sampled and classified
simultaneously on both variables.
(Eg.) Subjects are chosen and noted whether they are / were smokers or not and whether they
have lung cancer or not.
Note: In cross-sectional studies, the total sample size ‘n’ is fixed but the row & column totals are
not fixed and the counts in the ‘IJ’ cells constitute a multinomial sample.
Note: A ‘Clinical Trial’ is an experimental study while the other three are observational studies.
In a clinical trial the investigator has experimental control over which subjects fall in each of the
‘I’ categories of ‘X’. In such studies, randomization helps to make the groups balance on other
variables that may be associated with the response variable. Observational studies are common
but have more potential for biases of various types.

1.8 COMPARING TWO PROPORTIONS


Many studies are designed to compare groups on a binary response (or dependent) variable. The
DV ‘Y’ has only two categories (Success / Failure or Good/ Bad). With two groups (rows) to be
compared, a 2 x 2 contingency table displays the results.

1.8.1 Difference of Proportions


For subjects in row i,  1 | i is the probability for outcome ‘1’ (success) and  2 | i = 1 –  1 | i is
probability for outcome ‘2’. For simplicity, we call  1 | i as  i .
The ‘difference of proportions (or probabilities) of successes’ namely,  1 –  2 , is interesting.
This lies between – 1 & + 1 and equals zero when the two rows have identical (conditional)
distributions, i.e., when Y is statistically independent of the row classification.
Note that comparison on failures is ‘equivalent’ to comparison on successes, since the difference
(1 –  1 ) – ( 1 –  2 ) =  2 –  1 .
Sample version: The sample version (estimate) of the difference of proportions is
n 11 / n 1 + – n 21 / n 2 + . It is seen that this quantity does not change when any row(s) is / are
multiplied by a constant. However, it changes with multiplication of any column(s) and with
row-column interchange.

1.8.2 Relative Risk


The Relative Risk is defined as  1 /  2 or as (1 –  1 ) / ( 1 –  2 ) and has correspondingly
different interpretations.
‘Relative Risk’ gives more insight than ‘difference of proportions’ in the study on comparison of
proportions. For example, if  1 = 0.01 &  2 = 0.001 then  1 –  2 = 0.009 while  1 /  2 = 10. As
another example, if  1 = 0.410 &  2 = 0.401, then  1 –  2 = 0.009 (same) whereas,  1 /  2 =
1.02 only.

5
CATEGORICAL DATA ANALYSIS
Dr. Martin L. William
As seen in the above two illustrations, the ‘difference in proportions’ fails to reveal the huge
difference between  1 and  2 in the first example, whereas the ‘Relative Risk’ reveals the same
very well; It reveals that probability for success in row 1 is 10 times that in row 2.
When  1 and  2 are close to 0 or 1, then ‘Relative risk’ does a better job in helping us know the
real big difference.
n 11 / n 1+
Sample version: The sample version (estimate) of the ‘Relative risk’ is . It is seen that
n 21 / n2 +
this quantity does not change when any row(s) is / are multiplied by a constant. However, it
changes with multiplication of any column(s) and with row-column interchange.

1.8.3 Odds & Log odds


If π is the probability of success, we say the ‘odds’ in favour of success are π / (1 – π) = Ω (say).
Clearly Ω can lie between 0 and ∞ and it says how much likely is a success as compared to a
failure. For example, if π = ¾, then Ω = 3 meaning that success is thrice as likely as a failure. It
also says that, we expect three successes for every one failure.
In a 2 x 2 table, within row ‘i', the odds of success is  i / (1 –  i ) = Ω i (say). For joint
distributions with cell probabilities {  i j } the odds for success in row ‘i' are Ω i=  i 1 /  i2

Log odds of success is nothing but the logarithm of odds. It can vary over (–∞, ∞).

1.8.4 Odds Ratio


1  /(1 −  1 )
Refer to a 2 x 2 table. The ‘Odds Ratio’ is defined as θ = = 1 - - - - (2.2.1)
2  2 /(1 −  2 )
This ratio helps us to measure the increment or reduction in the odds favouring success when we
move between the rows (i.e., between the groups being compared).
 /  
For joint distributions, the odds ratio is θ = 11 12 . It is also equal to 11 22 - - - (2.2.2) and
 21 /  22  12  21
is called ‘cross-product ratio’ since it equals the ratio of the products from diagonally opposite
cells.
Properties of Odds Ratio
1. The Odds Ratio θ can lie between 0 & ∞. When one cell has zero probability, θ equals 0
or ∞.
2. θ = 1 corresponds to independence of X and Y. Values of θ farther from 1 represent
strong association.
3. θ >1 corresponds to the situation when subjects in row 1 are more likely to have ‘success’
than those in row 2. For example, when θ = 4, the odds for success in row 1 are four
times the odds in row 2. [This does not mean that the probability for success in row 1 is
four times as in row 2, which will be the interpretation if the ‘Relative Risk’ is 4].

6
CATEGORICAL DATA ANALYSIS
Dr. Martin L. William
4. 0 <θ< 1 corresponds to the situation when in row 1 success is less likely than in row 2.
5. Two values of θ represent the same strength of association but in opposite directions
when one value is the reciprocal of the other. For instance, when θ = ¼, the odds for
success in row 1 are one-fourth of the odds in row 2 or equivalently, the odds in row 2 are
4 times that in row 1.
When the two rows or two columns are reversed, the new value for θ is the reciprocal of
the original value.
6. It is convenient to use log θ instead of θ. Independence corresponds to log θ = 0. The ‘log
odds ratio’ is symmetric about 0 – Reversal of rows or of columns results in change in its
sign. For instance, two values for log θ that are same except for the sign [such as log 4 =
1.39& log(1/4) = – 1.39], represent the same strength of association
7. The ‘odds ratio’ is invariant to the reversal of orientation (i.e, rows becoming columns
and vice versa), as found in the ‘symmetric’ expression [cross-product ratio in Eq(2.2.2)].
8. Odds ratio, which is defined using conditional probabilities of Y given X, can also be
defined using reverse conditional probabilities. [Note that, with joint distribution,
conditional distributions exist in each direction]. We see that,
 / P ( Y = 1 | X = 1 ) / P (Y = 2 | X = 1 ) P ( X = 1 | Y = 1 ) / P ( X = 2 | Y = 1 )
θ= 11 12 = =
 21 /  22 P ( Y = 1 | X = 2 ) / P ( Y = 2 | X = 2) P ( X = 1 | Y = 2 ) / P ( X = 2 | Y = 2)
9. Sample version of Odds Ratio: For cell counts { n i j }, the sample odds ratio is ˆ =
n 11 n 22 / n12 n21 . This quantity does not change when any row or any column is
multiplied by a non-zero constant. An implication of this property is that the sample odds
ratio (properly) estimates the same quantity (θ) even when the sample is
disproportionately large or small for marginal categories of any of the variables.

(Eg.)A health study was conducted on whether regular aspirin intake reduces mortality from
cardiovascular disease. It was a randomized and blind ‘clinical’study wherein the subjects were
randomly put under two regimens: One group (of 11037 subjects) receiving aspirin and the other
(of 11034 subjects)receiving placebo, with no one knowing what they were taking. The results
are summarized below:
__________________________________
Myocardial Infarction____
Fatal Nonfatal No
Treatment Attack Attack Attack
____________________________________________________________________________________________________________

Aspirin 5 99 10933
Placebo 18 171 10845
___________________________________________________________________________________________________________

Among 11037 who received aspirin, the proportion who suffered heart attack is
ˆ 1 = 104/11037 = 0.0094 while in the aspirin group it is ˆ 2 = 189/11034 = 0.0171.
The difference of proportions is 0.0171 – 0.0094 = 0.0077
The Relative Risk is 0.0094 / 0.0171 = 0.55

7
CATEGORICAL DATA ANALYSIS
Dr. Martin L. William
[There is a 45% reduction in the probability of heart attack when a person is put under aspirin
treatment]
The Odds for heart attack in aspirin group is Ω 1 = ˆ 1 / (1 – ˆ 1 ) = 0.0094/0.9906 = 0.0095 and in
the placebo group it is Ω 2 = ˆ 2 / (1 – ˆ 2 ) = 0.0171/ 0.9829 = 0.0174
The Odds Ratio is (104 X 10845) / (189 X 10933) = 0.546
[The odds for attack reduce by approximately 45% among those taking aspirin compared to
those put under placebo].

Relevance of Odds Ratio


1. It has to be noted that ‘Odds Ratio’ is relevant to give interpretations in the ‘direction of
interest’ even with retrospective studies, while other quantities like ‘differences in
probabilities’ and ‘relative risk’ for the interesting outcomes cannot really be estimated
from retrospective studies.
(Eg.) Consider the Case-Control study on relating lung cancer to smoking habit,
discussed in Section 2.1.2. In this study, Y = Lung cancer status, X = Smoking status.
The data got were two independent Binomial samples on X given the ‘Y’ levels. Thus,
one can estimate the probability that a subject was a smoker given the outcome on lung
cancer status. This is 688/709 for the cases and 650/709 for controls. We cannot estimate
the probability for lung cancer given smoking status, which is a more interesting quantity.
Thus, we cannot estimate differences and relative risk for cancer, but can estimate these
for smoking status, which is not really an interesting quantity.
However, we can compute the odds ratio and interpret it in the ‘direction of interest’ even
though, with the retrospective study, the odds ratio is actually computed only for the
reverse direction. [This is due to the property 8 of odds ratio]. The odds ratio is found to
be (688 X 59) / (21 X 650) = 3.0
We infer that the odds for lung cancer among smokers were 3 times the odds among non-
smokers.
 1−  2 
2. We have the relation Odds Ratio = Relative Risk X  
 1−  
 1 
We see that, when both  i are close to zero, Odds Ratio ≈ Relative Risk. In the ‘clinical
trial’ relating aspirin and heart attack, we find that this is true.
For retrospective studies, ‘Relative Risk’ cannot be directly estimated but can be
approximated by the ‘Odds Ratio’ which is estimable, if both  i are close to zero. In the
Case-control study on Lung cancer v/s smoking, if the probability for lung cancer is small
regardless of smoking behavior, we find that 3.0 is a rough estimate of ‘Relative risk’;
That is, the chance for a smoker to get cancer is roughly 3 times that for a non-smokers.

8
CATEGORICAL DATA ANALYSIS
Dr. Martin L. William
Sensitivity / Specificity:In 2 x 2 tables when Y denotes the ‘predicted’ category of a subject
(based on a diagnostic procedure) and X denotes the ‘actual’ category (vis-à-vis the presence or
absence of a disease), then it is interesting to measure the efficiency of the diagnostic procedure.
The Sensitivity measures the probability that the diagnosis correctly predicts the presence of
disease and Specificity measures the probability that the diagnosis correctly predicts the absence
of disease. That is, Sensitivity = P{diagnosing that disease is present |disease is actually present}
and specificity = P{ diagnosing that disease is absent | disease is actually absent}.
If 1 & 2 for Y denote positive diagnosis result & negative diagnosis result respectively, and 1 &
2 for X denote presence & absence of disease respectively, as in the following table,
_____________________________________
Predicted (Diagnosed) Category
1 (yes) 2(no)
_____________________________________

Actual 1(yes)
Category
2 (no)

then, with the notation we used earlier, we have Sensitivity =  1|1 and Specificity =  2 | 2
Note that, the cell position (1,2) in the above 2 x 2 table represents ‘false negatives’ while
position (2,1) represents ‘false positives’.
Exercises:
1) An article about the PSA blood test for detecting prostate cancer stated as follows:
“The test fails to detect prostate cancer in 1 in 4 men who have the disease (false negative
results) and as many as two-thirds of the healthy men tested receive false-positive results.” Let
C: be the event of having prostate cancer, C' : be the event of not having prostate cancer
+: denote positive test result , – : denote negative test result.
Which is true: P ( – | C ) = 1/4 or P( C | – ) = 1/4 ? P (C'| + ) = 2/3 or P( + | C' ) = 2/3 ?
Determine sensitivity & specificity.
Solution: The first part of the sentence means that ‘among those with cancer disease’, 1 out of 4
get a ‘negative’ result. So, P ( – | C ) = 1/4 is true.
The 2nd part of the sentence means that ‘among healthy men tested’, 2/3rd get a false-positive
result. False positive means ‘Predicted as having disease given that there is no disease’. Thus,P(
+ | C' ) = 2/3 is true.
Sensitivity = P( + | C) = 3/4 & Specificity = P ( – | C') = 1/3.
2) A diagnostic test has sensitivity = specificity = 0.80. Find the odds ratio between true disease
status and the diagnostic test results.
Solution: In the notations as per convention introduced earlier, we have  1 | 1 =  2 | 2 = 0.8.
Equivalently, in another notation,  1 = 0.8,  2 = 0.2 [since  1 =  1 | 1 &  2 =  1 | 2 = 1 –  2 | 2 ].
1  /(1 −  1 ) 0 .8 / 0 .2
The odds ratio is given by θ = = 1 = = 16
2  2 /(1 −  2 ) 0.2 / 0.8

9
CATEGORICAL DATA ANALYSIS
Dr. Martin L. William
3) The following table is based on records of accidents that occurred in a state in one year.
Identify the response variable and find and interpret the difference of proportions, relative risk
and odds ratio. Why are the relative risk and odds ratio approximately equal?
______________________________________________________________

Safety equipment __________Injury__________


in use Fatal Non fatal
______________________________________________________________________
No equipment used 1601 1,62,527
Equipment used 510 4,12,368
_______________________________________________________________________

Answer: Clearly, the response variable is ‘Injury’ type. It has two categories: Fatal &Non fatal.
Denoting the outcome ‘Fatal injury’ as 1 & ‘Nonfatal injury’ as 2, and likewise ‘No equipment
use’ as 1 & ‘Equipment use’ as 2, we have ˆ 1 | 1 =1601/ 164128 = 0.009755 = ˆ 1 and
ˆ1 | 2 = 510/ 412878 = 0.001235 = ̂ 2 ( in our simplified notation)
Difference of proportions = ˆ 1 – ˆ 2 = 0.009755 – 0.001235 = 0.008520
We see that the proportion of fatal injuries is greater when no safety equipment is used and lower
when safety equipment is used.
Relative risk = ˆ 1 / ̂ 2 = 7.897 ≈ 8
We see that the probability of fatal injury in case no safety equipment is used is almost 8 times as
the probability of fatal injury when safety equipment is used. We also say that there is a 700%
increase in risk for fatal injury without safety equipment.
ˆ /(1 − ˆ1 )
Odds ratio = 1 = 7.965 ≈ 8 [almost same as Relative risk].
ˆ 2 /(1 − ˆ 2 )
The interpretation is that the odds favouring fatal injury when no safety equipment is used is
almost 8 times the odds when safety equipment is used.
It is seen that Odds ratio is almost same as Relative risk because both ˆ 1 & ˆ 2 are close to zero.

4)A 20-year cohort study noted that the proportion of respondents per year who were affected by
lung cancer was 0.00140 for smokers and 0.00010 for nonsmokers. The proportion of
respondents who were affected by heart disease was 0.00669 for smokers and 0.00413 for
nonsmokers.
(a) Describe the association of smoking with each of lung cancer and heart disease, using
difference of proportions, relative risk and odds ratio. Interpret.
(b) Which response is more strongly related to cigarette smoking, in terms of reduction in the
number of diseased that would occur with elimination of cigarettes? Explain.
Answer: (a) First consider the response ‘lung cancer’. Denoting smoking as ‘1’ & non-smoking
as ‘2’ and also lung cancer as ‘1’ & no lung cancer as ‘2’, we have ˆ 1 = 0.00140 & ̂ 2 = 0.00010.

10
CATEGORICAL DATA ANALYSIS
Dr. Martin L. William
Difference of proportions = ˆ 1 – ˆ 2 = 0.00140 – 0.00010 = 0.00130 [Shows that cancer is
more probable for smokers]
Relative risk = ˆ 1 / ̂ 2 = 14
The risk for lung cancer is 14 times among smokers as that among nonsmokers (or) There is a
1300 % increase in risk for lung cancer due to smoking.
ˆ /(1 − ˆ1 )
Odds ratio = 1 ≈ 14 [almost same as Relative risk].
ˆ 2 /(1 − ˆ 2 )
Next consider the response ‘heart disease’. Denoting smoking as ‘1’ & non-smoking as ‘2’ and
also heart disease as ‘1’ & no heart disease as ‘2’, we have ˆ 1 = 0.00669 & ̂ 2 = 0.00413.
Difference of proportions = ˆ 1 – ˆ 2 = 0.00669 – 0.00413= 0.00256 [Shows that heart disease
is more probable for smokers]
Relative risk = ˆ 1 / ̂ 2 ≈ 1.62
The risk for heart disease is 1.6 times among smokers as that among nonsmokers (or) There is a
60% increase in risk for heart disease due to smoking.
ˆ /(1 − ˆ1 )
Odds ratio = 1 ≈ 1.6 [almost same as Relative risk].
ˆ 2 /(1 − ˆ 2 )
(b) From the relative risks (and odds ratios as well)we see ‘lung cancer’ is more strongly related
to smoking&infer that elimination of cigarettes will lead to a big reduction in lung cancer cases.

5) For a diagnostic test of a certain disease, π1 denotes the prob that the diagnosis is positive
given that a subject has the disease, and π2 is the prob that the diagnosis is positive given that a
subject does not have it. Let ρ be the prob that a subject does have the disease.
a] Given that the diagnosis is positive, show that the prob that a subject does have the disease is
π1 ρ / [ π1 ρ + π2 ( 1 – ρ)]
b] Suppose that a diagnostic test for a disease has both sensitivity & specificity 0.95 and
ρ = 0.005. Find the prob that a subject truly has the disease given that diagnostic test is positive.
Solution: Let + : diagnosis positive , – : diagnosis negative, D: actual disease, D ': no disease
We have P(+ | D) = π1, P(+ | D ') = π2& P(D) = ρ
Thus, P(+) = P(+ | D) P(D) + P(+ | D ') P(D ') = π1 ρ + π2 (1 – ρ)
a] We need P( D | +) = P(+ | D) P(D) / P(+) = π1 ρ / [π1 ρ + π2 (1 – ρ)]
b] It is given that sensitivity (π1) = 0.95, specificity ( 1 – π2) = 0.95 & ρ = 0.005
We need P( D | +) = 0.95 x 0.005 / [ (0.95 x 0.005) + (0.05x0.995) ] = 0.087156

1.9 ODDS RATIOS FOR I X J TABLES


For I x J tables, a set of odds ratios or some other summary index can describe the features of the
association.

11
CATEGORICAL DATA ANALYSIS
Dr. Martin L. William
I J
One may think of each of the C 2 row-pairs in combination with the C 2 column-pairs and get
I
C 2 x J C 2 odds ratios, but this set of odds ratios contain much redundant information.

Method 1:
Consider the subset of (I – 1)x(J – 1) local odds ratios (using cells in adjacent rows and adjacent
columns) namely
 i j  i +1, j +1
ij= , i = 1, 2, …, I – 1, j = 1 ,2 ,…., J – 1 . . . . (1.9.1)
 i , j +1  i +1, j
These ratios determine all odds ratios that can be formed from any pairs of rows and pairs of
columns. For instance, let us determine the odds ratio for rows a & b (a < b) and columns c & d
(c < d) which is actually equal to (πac πbd) / (πbc πad).It may be easily seen that
 a c  a +1, c+1  a , c +1  a +1, c + 2  a , c + 2  a +1, c +3    a , c  a +1, d
x x x··· x a , d −1 a +1, d = …(1.9.2)
 a , c+1  a +1, c  a , c + 2  a +1, c +1  a , c +3  a +1, c + 2  a , d  a +1, d −1  a +1, c  a , d
We see that the RHS of the above equation is obtained by multiplying the local odds ratios for
row a and columns c, c+1, …,d – 1. In the same way, by multiplying local odds ratios for each
of the rows a+1, a+2, ...,b – 1 and columns c, c+1, …, d – 1 we get
 a +1, c  a + 2, d  a + 2, c  a +3, d  b −1, c  b, d
, ,…, . . . . . (1.9.3)
 a + 2, c  a +1, d  a +3, c  a + 2, d  b, c  b−1, d
Multiplying the RHS of (2.4.2) and all the ratios in (2.4.3), we get (πac πbd) / (πbc πad).
Thus, the (I – 1)x(J – 1) local odds ratios are enough to get the odds ratios for any row-pair &
column-pair combination.

Method 2:
Consider the subset of (I – 1)x(J – 1) baseline odds ratios ( using cells in any row with last row
& any column with last column) namely
 i j  I,J
ij= , i = 1, 2, …, I – 1, j = 1 ,2 ,…., J – 1 . . . . (1.9.4)
 I , j  i, J
By suitable multiplication of these baseline odds ratios we can get the odds ratios for any row-
pair & column-pair combination.
Note: Clearly, the cell probabilities determine the odds ratios. Also, given the marginal
distributions {  i + } and {  + j }, the odds ratios determine the cell probabilities.
We note that (I – 1)x(J – 1) cell probabilities determine everything, given the marginals. In the
same way, (I – 1)x(J – 1) local or baseline ratios determine everything, given the marginals.
Thus, (I – 1)x(J – 1) parameters can describe any association in an I x J table.
Note: Independence of X & Y is equivalent to all  i j ’s equaling 1 and all  i j ’s equaling 1.

12
CATEGORICAL DATA ANALYSIS
Dr. Martin L. William
1.10 OTHER MEASURES OF ASSOCIATION–NOMINAL VARIABLES
The most interpretable indices for nominal variables have the same structure as R2 for interval
variables. These measures describe the proportional reduction in variance from the marginal
distribution of response Y to the conditional distribution of Y given any explanatory variable X.
Proportional Reduction in variation: Let V(Y) denote any measure of variation for the
marginal distribution of Y, namely {  + j }, and let V(Y | i ) denote this measure for the
conditional distribution {  1 | i ,  2 | i , . . .  J | i } of Y at the ith setting of X. A ‘Measure of
Proportional Reduction in Variation’ has the form
V ( Y ) − E[ V (Y | X ) ]
- - - (1.10.1)
V (Y )
I
where E[ V( Y | X) ] = V (Y | X = i ) 
i =1
i + .This varies between 0 and 1.

1.10.1 Entropy & Uncertainty Coefficient ‘U’


Theil’s variation measure ‘Entropy’ is defined as the expected value of the ‘log of p.m.f.’. If p(.)
denotes the marginal p.m.f. of Y, then ‘Entropy’ is defined as
J
V(Y) = E [ Log (p(Y)] = 
j =1
+j log ( + j )

J I
The above quantity may be expressed as  
j =1 i =1
ij log ( + j )

J
The conditional entropy at X = i, is V (Y | X = i) = 
j =1
j|i log ( j|i )

I J
and, E[ V( Y | X) ] =   
i =1
i+
j =1
j|i log ( j|i )

I J I J ij 
=    i j log ( j|i) =   ij log 



i =1 j =1 i =1 j =1  i+ 
  i j 
( )
J I
So, V(Y) – E[ V( Y | X) ] = ij  


log  + j − log 



j =1 i =1  i+ 
J I    i j 
= –    i j  log  
    
j =1 i =1  i + + j 
Thus, the ‘Proportional Reduction in Entropy’ equals
I J

 
i =1 j = 1
i j log (  i j /  i +  + j )
U= – J


j =1
+j log ( + j )

also called Uncertainty Coefficient.

13
CATEGORICAL DATA ANALYSIS
Dr. Martin L. William
Note : The uncertainty coefficient is well defined when more than one  + j > 0.Also, 0 ≤ U ≤ 1.
U = 0 corresponds to independence of X & Y.
U = 1 corresponds to ‘total dependence’ of Y on X
[ That is, for each fixed X = i, Y takes only one ‘j’ giving  j | i = 1 for that ‘j’, so that
J
Numerator = V(Y) – E[ V( Y | X) ] = 
j =1
+j log ( + j ) = Denominator]

1.10.2 Gini Concentration Index & Tau (Concentration Coefficient)


Tau [Concentration Coefficient]: The ‘Gini Concentration Index’ is a measure of variation
defined as
J J
V(Y) =   + j (1 −  + j ) = 1 –
j =1

j =1
2
+ j

It is the probability that two independent observations on Y fall in different categories.


[Note that V(Y) = 0 when  + j = 1 for some ‘j’; The maximum value of V(Y) is (J – 1) / J which
happens when  + j = 1 / J for all ‘j’]
J
For this measure, V(Y | X = i) = 
j =1
j| i (1 −  j| i ) so that

I J I J
E[ V(Y | X) ] =   j | i (1−  j | i )  i + = 1 –
i = 1 j =1
 
i = 1 j =1
2
ij /i+

V ( Y ) − E[ V (Y | X ) ]
The resulting ‘Proportional Reduction in Variation’ is called
V (Y )
‘Concentration Coefficient’.And is denoted ‘τ’.
Note: τ = 0 corresponds to independence.

1.10.3 Prediction Error & Lambda


Lambda:The ‘Prediction Error’ [when the ‘most likely’ category is taken as the predicted
category] is a measure of variation given by
V(Y) = 1 – max {  + j }
I
For this measure, V (Y|X=i) = 1 – max {  j|i } so that E[V(Y | X) ] =  [1 − max {
i =1
j| i } ] i +

V ( Y ) − E[ V (Y | X ) ]
The resulting ‘Proportional Reduction in Variation’ is called ‘Lambda’ and
V (Y )
is denoted ‘λ’.
______________________________________________________________________________

14

You might also like