Professional Documents
Culture Documents
I
The marginal distribution of Y is { + j } where + j = i j , j = 1, 2, …, J
i =1
Clearly,
i
i+ = + j = i j = 1
j i j
1 11 12 . . . . . . . 1J 1+
R ( 1 |1 ) ( 2 |1 ) . . . . . .( J |1 ) (1.0)
O 2 21 22 . . . . . . . 2J 2+
W ( 1 |2 ) ( 2 | 2 ) . . . . . .( J |2 ) (1.0)
…
I I1 I2 . . . . . . . IJ I +
( 1 |I ) ( 2 | I ) . . . . . .( J |I ) (1.0)
1
CATEGORICAL DATA ANALYSIS
Dr. Martin L. William
The numbers in brackets are the conditional probabilities. It may be noted that j | i = i j / i +
Defn: Two categorical variables are said to be independent if i j = i + + j for all i, j.
Under independence, j | 1 = j | 2 = . . . = j | I for j = 1,2…, J. That is, the probability of a
particular column response is the same in each row. Independence is also referred to as
homogeneity of the conditional distributions.
[ For sample distributions we use ̂ in the place of π ]
Notation for cell frequencies (observed in sample): The frequency in the (i, j)th cell is denoted n i j
I J J
If ‘n’ is the total sample size then, n= ni j . Also, ni + =
i = 1 j =1
n
j =1
ij = ith row total and n + j =
n
i =1
i j = jth column total
Clearly, ̂ i j = ni j / n , ˆ i + = ni + / n , ˆ j | i = ˆ i j / ˆ i + = ni j / ni +
ni j i =1 j =1
ni j ! j j
2
CATEGORICAL DATA ANALYSIS
Dr. Martin L. William
When the responses occur separately at each setting of X, the samples at different settings are
independent and the joint p.m.f. of the ‘IJ” counts is the product of the above independent
multinomial p.m.f.s for i = 1, 2, …, I.
Sometimes, the n+j values are fixed and in such cases, we have to consider the conditional
probabilities of X given Y (at each ‘j' ) and the joint p.m.f. will be the product of ‘J’ independent
multinomial p.m.f.’s where the jth term in the product is a multinomial for the jth column.
Conditional Independent Multinomial Sampling Scheme:Suppose that the ‘IJ” counts { nij }
result from independent Poisson sampling with means { i j } or from multinomial sampling
over the ‘IJ’ cells with probabilities { i j = i j / n }. When X is an explanatory variable (as is
J
often the case), inference can be performed conditional on the totals { n i + = ni j } even when
j =1
the n i + values are not fixed by the sampling design. Conditional on { n i + }, the cell counts
{ ni 1 , n i 2 , . . . ni J } have the multinomial distribution in Equation (2.1.2) with the response
probabilities { j | i = i j / i + , j = 1, 2,…, J } and the counts from different rows are independent.
With this conditioning, the row totals are treated as fixed and the overall joint p.m.f. is the
product of (2.1.2) for i = 1, 2,…., I.
Example: Researchers in the Highway Department plan to study the relationship between ‘seat-
belt use’ (yes / no) and ‘outcome of an automobile accident’ (fatality / non-fatality) for drivers
driving on a new highway. They will summarize their results in the format shown below:
________________________________________
Result of crash
Seat-belt use Fatality Non-fatality____
Yes
No
________________________________________________________________________________________________
• Suppose they plan to catalog all accidents on the highway for the period of study (say, a
year) classifying each accident according to the two variables (Seat-belt use & outcome
of crash), then the total sample size itself is not fixed. In such a case, the numbers of
observations (accidents) at the four combinations may be treated as independent Poisson
variables with unknown means, say, λ11, λ12,λ21, λ22.This is Poisson Scheme.
• Suppose the researchers randomly sample 200 (or any other fixed number) of police
records of crashes and classify each crash according to the two variables, then the total
sample size is fixed but neither the rows nor columns totals are fixed. This is the
‘Multinomial sampling scheme’.
• Suppose the researchers take a sample of 100 fatal accidents and 100 non-fatal accidents
and within each of these two groups classify the accidents according to seat-belt use, it
means the column totals ( n+j) are fixed and each column is an independent multinomial
[actually, binomial] sample. This is the ‘Independent Multinomial Sampling Scheme’.
• Another approach is to take 200 people, randomly assign 100 to wear seat-belt and the
other 100 to drive without seat-belt [This is traditional experimental design approach].
3
CATEGORICAL DATA ANALYSIS
Dr. Martin L. William
Here the row totals ( ni+ ) are fixed and each row is an independent binomial sample.
Again, this is the ‘Independent Multinomial Sampling Scheme’. [However, note that it
is not ethical to ‘force’ people to wear or not to wear seat-belt].
1.7.3 Types of Studies
Case–Control Study: A study which looks into the past given the (different) outcomes on the
response variable is known as ‘Case-control study’. It is also called ‘Retrospective study’.
(Eg.) In a study on the link between lung cancer and smoking, 709 cancer patients, who were
admitted with lung cancer, were queried about their smoking habits. A patient was classified as a
‘smoker’ if he/ she had smoked at least one cigarette a day for at least one year and as a non-
smoker otherwise. Simultaneously, the investigators identified 709 non-cancer patients matching
for age and gender with the cancer patients and classified them as ‘smokers’ and non-
smokers.The lung cancer patients are called ‘Cases’ while the non-cancer patients are called
‘Controls’. The data are summarized in the following cross-classification table.
____________________________
Lung Cancer
__Smoker Cases Controls __
Yes 688 650
No 21 59
Total 709 709
_________________________________________________________________________
Normally, ‘lung cancer’ status is the response variable and smoking behavior is an explanatory
variable. However, in this study, the marginal distribution of lung cancer is fixed by sampling
design and the outcome that is measured is ‘smoking status’ in the past. This is ‘retrospective’
design.
In the above case-control study, the cases and controls were ‘matched’. However, unmatched
case-control studies are possible.
Note: A retrospective study treats the totals { n + j } for Y as fixed and regards each column of ‘I’
counts as a multinomial sample on X.
Cohort Study: In a ‘Cohort study’ subjects make their own choices about which ‘X’ category
they belong to and their ‘Y’ categories are observed at a future time. It is a ‘Prospective study’.
(Eg.)A sample of people is selected and classified as to whether they are smokers or not. After a
number of years they are observed as to whether they have developed lung cancer or not.
Clinical Trials: In a ‘clinical trial’, subjects are randomly allocated to the ‘X’ categories and
their ‘Y’ categories are observed at a future time. This is also a ‘Prospective study’.
(Eg.) Subjects are randomly given one of the treatments (say, different drugs which are the ‘X’
categories) and the effect ‘Y’ is observed after a period of time.
Note: In prospective studies, usually we condition on the row totals { n i + } of the ‘X’ categories
and regard each row of ‘J’ counts as an independent multinomial sample on Y.
4
CATEGORICAL DATA ANALYSIS
Dr. Martin L. William
Cross-sectional Study: In a ‘Cross-sectional study’, subjects are sampled and classified
simultaneously on both variables.
(Eg.) Subjects are chosen and noted whether they are / were smokers or not and whether they
have lung cancer or not.
Note: In cross-sectional studies, the total sample size ‘n’ is fixed but the row & column totals are
not fixed and the counts in the ‘IJ’ cells constitute a multinomial sample.
Note: A ‘Clinical Trial’ is an experimental study while the other three are observational studies.
In a clinical trial the investigator has experimental control over which subjects fall in each of the
‘I’ categories of ‘X’. In such studies, randomization helps to make the groups balance on other
variables that may be associated with the response variable. Observational studies are common
but have more potential for biases of various types.
5
CATEGORICAL DATA ANALYSIS
Dr. Martin L. William
As seen in the above two illustrations, the ‘difference in proportions’ fails to reveal the huge
difference between 1 and 2 in the first example, whereas the ‘Relative Risk’ reveals the same
very well; It reveals that probability for success in row 1 is 10 times that in row 2.
When 1 and 2 are close to 0 or 1, then ‘Relative risk’ does a better job in helping us know the
real big difference.
n 11 / n 1+
Sample version: The sample version (estimate) of the ‘Relative risk’ is . It is seen that
n 21 / n2 +
this quantity does not change when any row(s) is / are multiplied by a constant. However, it
changes with multiplication of any column(s) and with row-column interchange.
Log odds of success is nothing but the logarithm of odds. It can vary over (–∞, ∞).
6
CATEGORICAL DATA ANALYSIS
Dr. Martin L. William
4. 0 <θ< 1 corresponds to the situation when in row 1 success is less likely than in row 2.
5. Two values of θ represent the same strength of association but in opposite directions
when one value is the reciprocal of the other. For instance, when θ = ¼, the odds for
success in row 1 are one-fourth of the odds in row 2 or equivalently, the odds in row 2 are
4 times that in row 1.
When the two rows or two columns are reversed, the new value for θ is the reciprocal of
the original value.
6. It is convenient to use log θ instead of θ. Independence corresponds to log θ = 0. The ‘log
odds ratio’ is symmetric about 0 – Reversal of rows or of columns results in change in its
sign. For instance, two values for log θ that are same except for the sign [such as log 4 =
1.39& log(1/4) = – 1.39], represent the same strength of association
7. The ‘odds ratio’ is invariant to the reversal of orientation (i.e, rows becoming columns
and vice versa), as found in the ‘symmetric’ expression [cross-product ratio in Eq(2.2.2)].
8. Odds ratio, which is defined using conditional probabilities of Y given X, can also be
defined using reverse conditional probabilities. [Note that, with joint distribution,
conditional distributions exist in each direction]. We see that,
/ P ( Y = 1 | X = 1 ) / P (Y = 2 | X = 1 ) P ( X = 1 | Y = 1 ) / P ( X = 2 | Y = 1 )
θ= 11 12 = =
21 / 22 P ( Y = 1 | X = 2 ) / P ( Y = 2 | X = 2) P ( X = 1 | Y = 2 ) / P ( X = 2 | Y = 2)
9. Sample version of Odds Ratio: For cell counts { n i j }, the sample odds ratio is ˆ =
n 11 n 22 / n12 n21 . This quantity does not change when any row or any column is
multiplied by a non-zero constant. An implication of this property is that the sample odds
ratio (properly) estimates the same quantity (θ) even when the sample is
disproportionately large or small for marginal categories of any of the variables.
(Eg.)A health study was conducted on whether regular aspirin intake reduces mortality from
cardiovascular disease. It was a randomized and blind ‘clinical’study wherein the subjects were
randomly put under two regimens: One group (of 11037 subjects) receiving aspirin and the other
(of 11034 subjects)receiving placebo, with no one knowing what they were taking. The results
are summarized below:
__________________________________
Myocardial Infarction____
Fatal Nonfatal No
Treatment Attack Attack Attack
____________________________________________________________________________________________________________
Aspirin 5 99 10933
Placebo 18 171 10845
___________________________________________________________________________________________________________
Among 11037 who received aspirin, the proportion who suffered heart attack is
ˆ 1 = 104/11037 = 0.0094 while in the aspirin group it is ˆ 2 = 189/11034 = 0.0171.
The difference of proportions is 0.0171 – 0.0094 = 0.0077
The Relative Risk is 0.0094 / 0.0171 = 0.55
7
CATEGORICAL DATA ANALYSIS
Dr. Martin L. William
[There is a 45% reduction in the probability of heart attack when a person is put under aspirin
treatment]
The Odds for heart attack in aspirin group is Ω 1 = ˆ 1 / (1 – ˆ 1 ) = 0.0094/0.9906 = 0.0095 and in
the placebo group it is Ω 2 = ˆ 2 / (1 – ˆ 2 ) = 0.0171/ 0.9829 = 0.0174
The Odds Ratio is (104 X 10845) / (189 X 10933) = 0.546
[The odds for attack reduce by approximately 45% among those taking aspirin compared to
those put under placebo].
8
CATEGORICAL DATA ANALYSIS
Dr. Martin L. William
Sensitivity / Specificity:In 2 x 2 tables when Y denotes the ‘predicted’ category of a subject
(based on a diagnostic procedure) and X denotes the ‘actual’ category (vis-à-vis the presence or
absence of a disease), then it is interesting to measure the efficiency of the diagnostic procedure.
The Sensitivity measures the probability that the diagnosis correctly predicts the presence of
disease and Specificity measures the probability that the diagnosis correctly predicts the absence
of disease. That is, Sensitivity = P{diagnosing that disease is present |disease is actually present}
and specificity = P{ diagnosing that disease is absent | disease is actually absent}.
If 1 & 2 for Y denote positive diagnosis result & negative diagnosis result respectively, and 1 &
2 for X denote presence & absence of disease respectively, as in the following table,
_____________________________________
Predicted (Diagnosed) Category
1 (yes) 2(no)
_____________________________________
Actual 1(yes)
Category
2 (no)
then, with the notation we used earlier, we have Sensitivity = 1|1 and Specificity = 2 | 2
Note that, the cell position (1,2) in the above 2 x 2 table represents ‘false negatives’ while
position (2,1) represents ‘false positives’.
Exercises:
1) An article about the PSA blood test for detecting prostate cancer stated as follows:
“The test fails to detect prostate cancer in 1 in 4 men who have the disease (false negative
results) and as many as two-thirds of the healthy men tested receive false-positive results.” Let
C: be the event of having prostate cancer, C' : be the event of not having prostate cancer
+: denote positive test result , – : denote negative test result.
Which is true: P ( – | C ) = 1/4 or P( C | – ) = 1/4 ? P (C'| + ) = 2/3 or P( + | C' ) = 2/3 ?
Determine sensitivity & specificity.
Solution: The first part of the sentence means that ‘among those with cancer disease’, 1 out of 4
get a ‘negative’ result. So, P ( – | C ) = 1/4 is true.
The 2nd part of the sentence means that ‘among healthy men tested’, 2/3rd get a false-positive
result. False positive means ‘Predicted as having disease given that there is no disease’. Thus,P(
+ | C' ) = 2/3 is true.
Sensitivity = P( + | C) = 3/4 & Specificity = P ( – | C') = 1/3.
2) A diagnostic test has sensitivity = specificity = 0.80. Find the odds ratio between true disease
status and the diagnostic test results.
Solution: In the notations as per convention introduced earlier, we have 1 | 1 = 2 | 2 = 0.8.
Equivalently, in another notation, 1 = 0.8, 2 = 0.2 [since 1 = 1 | 1 & 2 = 1 | 2 = 1 – 2 | 2 ].
1 /(1 − 1 ) 0 .8 / 0 .2
The odds ratio is given by θ = = 1 = = 16
2 2 /(1 − 2 ) 0.2 / 0.8
9
CATEGORICAL DATA ANALYSIS
Dr. Martin L. William
3) The following table is based on records of accidents that occurred in a state in one year.
Identify the response variable and find and interpret the difference of proportions, relative risk
and odds ratio. Why are the relative risk and odds ratio approximately equal?
______________________________________________________________
Answer: Clearly, the response variable is ‘Injury’ type. It has two categories: Fatal &Non fatal.
Denoting the outcome ‘Fatal injury’ as 1 & ‘Nonfatal injury’ as 2, and likewise ‘No equipment
use’ as 1 & ‘Equipment use’ as 2, we have ˆ 1 | 1 =1601/ 164128 = 0.009755 = ˆ 1 and
ˆ1 | 2 = 510/ 412878 = 0.001235 = ̂ 2 ( in our simplified notation)
Difference of proportions = ˆ 1 – ˆ 2 = 0.009755 – 0.001235 = 0.008520
We see that the proportion of fatal injuries is greater when no safety equipment is used and lower
when safety equipment is used.
Relative risk = ˆ 1 / ̂ 2 = 7.897 ≈ 8
We see that the probability of fatal injury in case no safety equipment is used is almost 8 times as
the probability of fatal injury when safety equipment is used. We also say that there is a 700%
increase in risk for fatal injury without safety equipment.
ˆ /(1 − ˆ1 )
Odds ratio = 1 = 7.965 ≈ 8 [almost same as Relative risk].
ˆ 2 /(1 − ˆ 2 )
The interpretation is that the odds favouring fatal injury when no safety equipment is used is
almost 8 times the odds when safety equipment is used.
It is seen that Odds ratio is almost same as Relative risk because both ˆ 1 & ˆ 2 are close to zero.
4)A 20-year cohort study noted that the proportion of respondents per year who were affected by
lung cancer was 0.00140 for smokers and 0.00010 for nonsmokers. The proportion of
respondents who were affected by heart disease was 0.00669 for smokers and 0.00413 for
nonsmokers.
(a) Describe the association of smoking with each of lung cancer and heart disease, using
difference of proportions, relative risk and odds ratio. Interpret.
(b) Which response is more strongly related to cigarette smoking, in terms of reduction in the
number of diseased that would occur with elimination of cigarettes? Explain.
Answer: (a) First consider the response ‘lung cancer’. Denoting smoking as ‘1’ & non-smoking
as ‘2’ and also lung cancer as ‘1’ & no lung cancer as ‘2’, we have ˆ 1 = 0.00140 & ̂ 2 = 0.00010.
10
CATEGORICAL DATA ANALYSIS
Dr. Martin L. William
Difference of proportions = ˆ 1 – ˆ 2 = 0.00140 – 0.00010 = 0.00130 [Shows that cancer is
more probable for smokers]
Relative risk = ˆ 1 / ̂ 2 = 14
The risk for lung cancer is 14 times among smokers as that among nonsmokers (or) There is a
1300 % increase in risk for lung cancer due to smoking.
ˆ /(1 − ˆ1 )
Odds ratio = 1 ≈ 14 [almost same as Relative risk].
ˆ 2 /(1 − ˆ 2 )
Next consider the response ‘heart disease’. Denoting smoking as ‘1’ & non-smoking as ‘2’ and
also heart disease as ‘1’ & no heart disease as ‘2’, we have ˆ 1 = 0.00669 & ̂ 2 = 0.00413.
Difference of proportions = ˆ 1 – ˆ 2 = 0.00669 – 0.00413= 0.00256 [Shows that heart disease
is more probable for smokers]
Relative risk = ˆ 1 / ̂ 2 ≈ 1.62
The risk for heart disease is 1.6 times among smokers as that among nonsmokers (or) There is a
60% increase in risk for heart disease due to smoking.
ˆ /(1 − ˆ1 )
Odds ratio = 1 ≈ 1.6 [almost same as Relative risk].
ˆ 2 /(1 − ˆ 2 )
(b) From the relative risks (and odds ratios as well)we see ‘lung cancer’ is more strongly related
to smoking&infer that elimination of cigarettes will lead to a big reduction in lung cancer cases.
5) For a diagnostic test of a certain disease, π1 denotes the prob that the diagnosis is positive
given that a subject has the disease, and π2 is the prob that the diagnosis is positive given that a
subject does not have it. Let ρ be the prob that a subject does have the disease.
a] Given that the diagnosis is positive, show that the prob that a subject does have the disease is
π1 ρ / [ π1 ρ + π2 ( 1 – ρ)]
b] Suppose that a diagnostic test for a disease has both sensitivity & specificity 0.95 and
ρ = 0.005. Find the prob that a subject truly has the disease given that diagnostic test is positive.
Solution: Let + : diagnosis positive , – : diagnosis negative, D: actual disease, D ': no disease
We have P(+ | D) = π1, P(+ | D ') = π2& P(D) = ρ
Thus, P(+) = P(+ | D) P(D) + P(+ | D ') P(D ') = π1 ρ + π2 (1 – ρ)
a] We need P( D | +) = P(+ | D) P(D) / P(+) = π1 ρ / [π1 ρ + π2 (1 – ρ)]
b] It is given that sensitivity (π1) = 0.95, specificity ( 1 – π2) = 0.95 & ρ = 0.005
We need P( D | +) = 0.95 x 0.005 / [ (0.95 x 0.005) + (0.05x0.995) ] = 0.087156
11
CATEGORICAL DATA ANALYSIS
Dr. Martin L. William
I J
One may think of each of the C 2 row-pairs in combination with the C 2 column-pairs and get
I
C 2 x J C 2 odds ratios, but this set of odds ratios contain much redundant information.
Method 1:
Consider the subset of (I – 1)x(J – 1) local odds ratios (using cells in adjacent rows and adjacent
columns) namely
i j i +1, j +1
ij= , i = 1, 2, …, I – 1, j = 1 ,2 ,…., J – 1 . . . . (1.9.1)
i , j +1 i +1, j
These ratios determine all odds ratios that can be formed from any pairs of rows and pairs of
columns. For instance, let us determine the odds ratio for rows a & b (a < b) and columns c & d
(c < d) which is actually equal to (πac πbd) / (πbc πad).It may be easily seen that
a c a +1, c+1 a , c +1 a +1, c + 2 a , c + 2 a +1, c +3 a , c a +1, d
x x x··· x a , d −1 a +1, d = …(1.9.2)
a , c+1 a +1, c a , c + 2 a +1, c +1 a , c +3 a +1, c + 2 a , d a +1, d −1 a +1, c a , d
We see that the RHS of the above equation is obtained by multiplying the local odds ratios for
row a and columns c, c+1, …,d – 1. In the same way, by multiplying local odds ratios for each
of the rows a+1, a+2, ...,b – 1 and columns c, c+1, …, d – 1 we get
a +1, c a + 2, d a + 2, c a +3, d b −1, c b, d
, ,…, . . . . . (1.9.3)
a + 2, c a +1, d a +3, c a + 2, d b, c b−1, d
Multiplying the RHS of (2.4.2) and all the ratios in (2.4.3), we get (πac πbd) / (πbc πad).
Thus, the (I – 1)x(J – 1) local odds ratios are enough to get the odds ratios for any row-pair &
column-pair combination.
Method 2:
Consider the subset of (I – 1)x(J – 1) baseline odds ratios ( using cells in any row with last row
& any column with last column) namely
i j I,J
ij= , i = 1, 2, …, I – 1, j = 1 ,2 ,…., J – 1 . . . . (1.9.4)
I , j i, J
By suitable multiplication of these baseline odds ratios we can get the odds ratios for any row-
pair & column-pair combination.
Note: Clearly, the cell probabilities determine the odds ratios. Also, given the marginal
distributions { i + } and { + j }, the odds ratios determine the cell probabilities.
We note that (I – 1)x(J – 1) cell probabilities determine everything, given the marginals. In the
same way, (I – 1)x(J – 1) local or baseline ratios determine everything, given the marginals.
Thus, (I – 1)x(J – 1) parameters can describe any association in an I x J table.
Note: Independence of X & Y is equivalent to all i j ’s equaling 1 and all i j ’s equaling 1.
12
CATEGORICAL DATA ANALYSIS
Dr. Martin L. William
1.10 OTHER MEASURES OF ASSOCIATION–NOMINAL VARIABLES
The most interpretable indices for nominal variables have the same structure as R2 for interval
variables. These measures describe the proportional reduction in variance from the marginal
distribution of response Y to the conditional distribution of Y given any explanatory variable X.
Proportional Reduction in variation: Let V(Y) denote any measure of variation for the
marginal distribution of Y, namely { + j }, and let V(Y | i ) denote this measure for the
conditional distribution { 1 | i , 2 | i , . . . J | i } of Y at the ith setting of X. A ‘Measure of
Proportional Reduction in Variation’ has the form
V ( Y ) − E[ V (Y | X ) ]
- - - (1.10.1)
V (Y )
I
where E[ V( Y | X) ] = V (Y | X = i )
i =1
i + .This varies between 0 and 1.
J I
The above quantity may be expressed as
j =1 i =1
ij log ( + j )
J
The conditional entropy at X = i, is V (Y | X = i) =
j =1
j|i log ( j|i )
I J
and, E[ V( Y | X) ] =
i =1
i+
j =1
j|i log ( j|i )
I J I J ij
= i j log ( j|i) = ij log
i =1 j =1 i =1 j =1 i+
i j
( )
J I
So, V(Y) – E[ V( Y | X) ] = ij
log + j − log
j =1 i =1 i+
J I i j
= – i j log
j =1 i =1 i + + j
Thus, the ‘Proportional Reduction in Entropy’ equals
I J
i =1 j = 1
i j log ( i j / i + + j )
U= – J
j =1
+j log ( + j )
13
CATEGORICAL DATA ANALYSIS
Dr. Martin L. William
Note : The uncertainty coefficient is well defined when more than one + j > 0.Also, 0 ≤ U ≤ 1.
U = 0 corresponds to independence of X & Y.
U = 1 corresponds to ‘total dependence’ of Y on X
[ That is, for each fixed X = i, Y takes only one ‘j’ giving j | i = 1 for that ‘j’, so that
J
Numerator = V(Y) – E[ V( Y | X) ] =
j =1
+j log ( + j ) = Denominator]
I J I J
E[ V(Y | X) ] = j | i (1− j | i ) i + = 1 –
i = 1 j =1
i = 1 j =1
2
ij /i+
V ( Y ) − E[ V (Y | X ) ]
The resulting ‘Proportional Reduction in Variation’ is called
V (Y )
‘Concentration Coefficient’.And is denoted ‘τ’.
Note: τ = 0 corresponds to independence.
V ( Y ) − E[ V (Y | X ) ]
The resulting ‘Proportional Reduction in Variation’ is called ‘Lambda’ and
V (Y )
is denoted ‘λ’.
______________________________________________________________________________
14