categorical variables
Goodman and Kruskal’s Gamma
1
An example
Effect of smoking at 45 years of age on self
reported health five years later
Variable
Smoking
Categories
1 Never smoked
2 Stopped smoking
3 114 cigarettes/day
4 1524 cigarettes/day
5 25+ cigarettes/day
SRH
1 Very good
2 Good
3 Fair
4 Bad
Ordinal categories
Expected monotonous association:
Increasing codes on Smoking
Increasing code on SRH
2
Data on males from the Glostrup surveys
+ SMOKE45
  B:HEALTH51

J  Vgood Good Fair
Bad  TOTAL 
+++
Never 
16
73
6
1 
96 
row% 16.7 76.0
6.3
1.0  100.0 
No mo 
15
75
6
0 
96 
row% 15.6 78.1
6.3
0.0  100.0 
114 
13
59
7
1 
80 
row% 16.3 73.8
8.8
1.3  100.0 
1524 
10
81
17
3 
111 
row%
9.0 73.0 15.3
2.7  100.0 
25+ 
1
29
3
1 
34 
row%
2.9 85.3
8.8
2.9  100.0 
+
TOTAL 
55
317
39
6 
417 
row% 13.2 76.0
9.4
1.4  100.0 
+
2 = 16.2
df = 12
p = 0.182
No evidence of association even though
the expected monotonous relationship is
plain as the nose on your face
3
4 .Correlation coefficients for ordinal categorical variables Pearson’s correlation coefficient 1) Measures linear association which is not meaningful for ordinal variables 2) Evaluation of significance requires normal distributions Rank correlations (Kendall’s and Spearman’s ) are more appropriate but require continuous data with very little risk of ties.
5 .Goodman and Kruskal’s for ordinal categorical data 1) Similar to Kendall’s . 3) Wellknown asymptotic properties. 2) Related to the odds ratio for 22 tables. 4) A partial coefficient measuring conditional monotonous relationship among ordinal variables is available.
Monotonous relationships Y increases when X increases/decreases Two variables: X.Y Probabilities: pxy = Prob(X=x.Y=y) X and Y are independent pxy = Prob(X=x)P(Y=y) What exactly do we mean when we say that there is a monotonous relationship between X and Y? 6 .
Y2) Concordance (C) if X1<X2 and Y1<Y2 or X1>X2 and Y1>Y2 Discordance (D) if X1<X2 and Y1>Y2 or X1>X2 and Y1<Y2 Tie (T) if X1=X2 or Y1=Y2 7 .Y1) and (X2.Y) for two stochastically independent cases (X1.Concordance and discordance Compare outcomes on (X.
y 2 )C (x1 .x 2 .y 1 .x 2 .y 2 )D Positive relationship: PC > PD Negative relationship: PC < PD 8 .Concordance = same trend in X and Y Discordance = different trends in X and Y Probability of concordance pC pD Px1y1 Px2y 2 Px1y1 Px2 y 2 (x1 .y1 .
The gamma coefficient A measure of the strength of the monotonous relationship PD PC PD PC Satisfies all conventional requirements of correlationcoefficients: 1 +1 = 0 if X and Y are independent Positive association if > 0 Negative association if < 0 Change the order of Y categories: 9 .
after recoding =  before recoding Interpretation of P(C) P(C  C D) P(C) P(D) P(D) P(D  C D) P(C) P(D) such that = P(CCD) – P(DCD) 10 .
is the difference between two conditional probabilities Estimation of Pairwise comparison of all persons in the data set nC = number of concordances nD = number of discordances nT = number of ties Relative frequences nC hC nC n D n T nD hD nC n D n T nS hT nC n D nT 11 .
y j n xy n xy X 1 Y 1 .y j n xy n xy x i. .y j x i.The estimate of hC h D nC n D G hC h D nC n D A little bit of notation nxy = number of persons with X=x and Y=y A ij Dij x i. y . R . c Dxy nxy Dxy Axy 12 . x Axy .y j x i.
Number of concordances and discordances nC n xy A xy n D n xy Dxy x.y x.y The coefficient for 22 tables a c b d nC = ad nD = bc 13 .
nC n D nC n D ad ad bc bc 1 OR 1 ad bc ad 1 OR 1 bc 14 .
40 .67 9.22 1.20 2.73 1.94 2.00 5.30] 15 .00 19.20 .62 .85 .60 .80 .85 1.40 .86 2.30 til 0.90 .20 1.00 .10 1.39 1.00 .33 3.00 oddsratio .20 .43 .70 .1 OR 1 OR 1 OR 1 Gamma.60 . oddsratio og logit values gamma 1.33 .30 .20 .05 .41 .67 .00 4.10 .00 .00 + logit LN(oddsratio  2.18 .50 .10 .00 .00 1.25 .20 .94 + Note: logit 2gamma in the interval [0.50 .39 1.10 .41 .80 .90 1.73 2.30 .54 .50 1.11 .62 .82 1.70 .
Properties of the estimate of The estimate is unbiased E(G) = and asymptotically normally distributed with standard error.y D 2 A xy nC Dxy If X and Y are independent.y 16 xy Dxy 2 nC n D 2 n . s1. s0. given by s 2 1 16 nC n D 4 n n xy x. then the standard error. of G is given by s 2 0 n A 4 nC n D 2 xy x.
1) se0 Notice that confidence intervals and assessment of significance uses different estimates of the standard errors 17 .Statistical inference 95 % confidence intervals G 1.96se1 Test of significance If X and Y are independent then G z Norm(0.
1 6.8 2.24 p < 0.8 1.0 73.3 0.0  1524  10 81 17 3  111  row% 9.4 1.0  100.6 78.9 85.0  + TOTAL  55 317 39 6  417  row% 13.0  100.0  + 2 = 16.4  100.3 73.0 15.0 6.8 8.2 76.3 1.0005 Very strong evidence of an effect of smoking on health For ordinal variables.2 df = 12 p = 0.182 = 0. is much more powerful than 2 distributed test statistics 18 .9  100.3 2.0  25+  1 29 3 1  34  row% 2.7  100.0  No mo  15 75 6 0  96  row% 15.3 8.The example + SMOKE45   B:HEALTH51  J  Vgood Good Fair Bad  TOTAL  +++ Never  16 73 6 1  96  row% 16.3  100.0  114  13 59 7 1  80  row% 16.7 76.0 9.
The small number of persons with bad health would result in warnings from most statistical programs that asymptotics probably do not work. If in doubt use exact conditional tests instead of asymptotic tests. 19 .Exact conditional inference The problem: Can asymptotic distributions of estimates and test statistics be approximated by asymptotic distributions in small and moderate samples.
c y =1.…..y x The probability of the table n P(n11.y n xy xy . x = 1..The hypergeometric distribution The contingency table: nxy..n 11 rc 20 p x..ncr) = n .r The margins of the table: n x n xy y n y n xy n n xy x...
n 11 rc p p x n x x y n y y The marginal tables.H0: pxy = px+p+y n P(n11. n+1.ncr  n1+.ncr) = n ..n+r) = ! n y ! n x x y n! n xy ! x..y does not depend on unknown parameters 21 ... nx+ and n+y..….. are sufficient under H0 P(n11..….nc+..
The exact conditional test procedure 1 Find all tables with the same marginal tables as the observed table. For each of these tables calculate: The conditional probability of the table The tests statistics of interest The exact pvalue = the sum of probabilities for tables with test statistics that are more extreme than the test statistic of the observed table 22 .
The exact conditional test procedure Test statistic T(M) Where M is a rc table Observed teststatistic = tobs The exact pvalue pexact M: mx nx .. m y n y ..n 1 .n c ... but may be very time consuming due to the number of tables fitting the margins 23 .. T(M ) t obs P(M  n1 .n r ) Fisher’s exact test for 22 tables Also appropriate for rc tables..
i = 1.The Monte Carlo test Since the conditional probabilities are known exactly we may ask the computer to generate a random sample consisting of a large number of independent tables from this distribution: The MC test procedure: Generate tables: M1.MNsim Calculate test statistics for each table: Ti = T(Mi)..…..Nsim Count the number of random test statistics which are as extreme as the observed statistic S Nsim 1 i 1 p MC Ti t obs S is an unbiased estimate of pexact Nsim 24 .
The sequential Monte Carlo test interrupts the Monte Carlo procedure when the number of tables with T(Mi) is equal to 501 S 501 pMC 501/10000 > 0.05 25 .The standard error of pMC depends on Nsim Sequential Monte Carlo tests Interrupts the Monte Carlo procedure when it becomes obvious that the test statistic will not be significant Nsim = 10.000 Critical level of the test = 5 %.
1 %) 26 .Repeated Monte Carlo tests The repeated Monte Carlo test interrupts the Monte Carlo procedure when the “risk” of a significant pMCvalue has become very small Parameters of the Repeated Monte Carlo test: Nsim = the total number of tables to be generated Nstart = the minimum number of tables to be Generated Critical value Max risk of stopping too soon (default = 0.
1 6.4  100.9  100.000 Confounding? 27 .3 2.3 0.0 6.9 85.0  100.2 76.0 9.6 78.182 row% 13.0  X² = 16.0  Gam = 0.3 73.0  1524  10 81 17 3  111  row% 9.0  25+  1 29 3 1  34  row% 2.0  114  13 59 7 1  80  row% 16.4 1.2 + df = 12 TOTAL  55 317 39 6  417  p = 0.0 73.3  100.8 2.goo Good Fair Bad  TOTAL  +++ Never  16 73 6 1  96  row% 16.3 1.7  100.7 76.0 15.8 8.0  No mo  15 75 6 0  96  row% 15.Smoking and Self reported health + SMOKE45   B:HEALTH51  J  V.3 8.8 1.24 + p = 0.0  100.
196 row% 39.0 75.0 0.0  25+  0 3 0 0  3  row% 0.0  100.7 58.0 + df = 4 TOTAL  20 31 0 0  51  p = 0.0 0.0  100.0  114  9 6 0 0  15  row% 60.188 28 .0 0.0  1524  2 3 0 0  5  row% 40.0  100.0 100.0  100.0 60.8 0.0  No mo  5 7 0 0  12  row% 41.0  X² = 6.0 0.2 60.0 40.0  Gam = 0.0 0.0  100.0 0.0 0.0  100.Analysis of the conditional association given self reported health at 45 years HEALTH45 = Very good +HEALTH45  + SMOKE45    B:HEALTH51  G J  V.18 + p = 0.0 0.goo Good Fair Bad  TOTAL  +++ 1 Never  4 12 0 0  16  row% 25.0 0.0 0.3 0.
0  No mo  10 59 5 0  74  row% 13.8 0.3 7.52 + p = 0.0  No mo  0 6 1 0  7  row% 0.2 9.2  100.0  100.1  100.goo Good Fair Bad  TOTAL  +++ 2 Never  11 55 5 1  72  row% 15.0  25+  0 1 2 1  4  row% 0.2 86.6 82.0 12.1 54.0 50.0  Gam = 0.6 3.9 1.7 92.5 75.4 6.3 76.8 1.5 + df = 12 TOTAL  4 17 13 3  37  p = 0.HEALTH45 = Good +HEALTH45  + SMOKE45    B:HEALTH51  G J  V.0  100.7  100.8 45.9 35.7 6.131 row% 10.1 0.goo Good Fair Bad  TOTAL  +++ 3 Never  1 6 1 0  8  row% 12.9  100.041 HEALTH45 = Fair +HEALTH45  + SMOKE45    B:HEALTH51  G J  V.5 79.7 0.0  100.3 42.3 0.0  1524  6 76 8 1  91  row% 6.5 0.7 14.0 25.0  1524  2 1 6 2  11  row% 18.0  114  3 50 4 1  58  row% 5.636 row% 9.9 0.0  100.1 8.001 29 .5 8.0  X² = 17.2 6.0  Gam = 0.17 + p = 0.0  25+  1 25 1 0  27  row% 3.5 18.4  100.0  100.9 1.0  100.0 85.8 + df = 12 TOTAL  31 265 23 3  322  p = 0.0 25.1  100.0  X² = 9.6 83.0  114  1 3 3 0  7  row% 14.9 42.
0  100.0 25.0 0.0  X² = 3.1220  30 .0  0.0  No mo  0 3 0 0  3  row% 0.0 0.0 0.0030 4: Bad 3.17 0.0006 0.0472 0.001 ** Local testresults for strata defined by HEALTH45 (G) ** pvalues pvalues (1sided) G: HEALTH45 X² df asympt exact Gamma asympt exact 1: V.1430 0.6310 0.0 0.1540 1.1880 0.6358 0.9 0.0  Gam = 1.0  1524  0 1 3 0  4  row% 0.0  100.52 0.0  0.0010 0.0 0.0  114  0 0 0 0  0  row% 0.00 0.0 0.1 42.18 0.2050 2: Good 9.good 6.HEALTH45 = Bad +HEALTH45  + SMOKE45    B:HEALTH51  G J  V.9 + df = 1 TOTAL  0 4 3 0  7  p = 0.1311 0.0290 3: Fair 17.1960 0.0 0.0 75.0  100.0 0.77 12 0.0 0.0  25+  0 0 0 0  0  row% 0.04 4 0.0 0.0 0.1884 0.52 12 0.goo Good Fair Bad  TOTAL  +++ 4 Never  0 0 0 0  0  row% 0.0  0.94 1 0.0 100.047 row% 0.0 0.0 57.00 + p = 0.0410 0.
Tests of conditional independence H0: P(X.YZ=z) = P(XZ=z)P(YZ=z) for all z Z X Y 1 1 2 2 Concordance Test statistics and discordance 3 1 2 N1C and N1D 1 2 N2C and N2D 1= 12 N 1 C N1 D N 1 C N1 D 22 N 2C N 2 D = N 2C N 2 D 2 . kZ 1 2 N2C and N2D k k2 N kC N kD = N kC NkD All test statistics must be insignificant Global tests of conditional independence The global 2 2 2z z df df z z 31 . .... .
The partial coefficient N C N zC N D N zD z partial z N zC N zD NC ND z N C N D N zC N zD z = N N w N N zC zD z z z z zC z zD z where wz = N zC N zD i N iC N iD partial w z z z Weighted mean Asymptotic normal distribution Monte Carlo approximation of exact conditional pvalues as for 2way tables ** Local testresults for strata defined by HEALTH45 (G) ** pvalues pvalues (1sided) G: HEALTH45 X² df asympt exact Gamma asympt exact  32 .
17 p = 0.0010 0.6358 0.1220  Global 2 = 37.027 33 .0290 3: Fair 17.94 1 0.1880 0.1884 0.0472 0.034 pexact = 0.148 partial = 0.1430 0.1: V.17 0.1311 0.good 6.0410 0.52 12 0.0030 4: Bad 3.1540 1.2050 2: Good 9.139 pexact = 0.1960 0.52 0.04 4 0.18 0.3 df = 29 p = 0.0006 0.00 0.77 12 0.6310 0.
0397 0.0777 G: HEALTH45 Gamma variance s.152 2.5 df = 2 p = 0.023 Pairwise comparisons of strata: Comparison of strata 1+2 .52 0.0987 0.060 2: Good 0.1625 0.620 0.00 standard error is not available Incomplete set of Gammas Test for partial association: X² = 7.11 Significant difference between 1+2 and 3 P = 0.237 4: Bad 1.1993 0.0264 0.p = 0.0097 0.e.411 3: Fair 0.good 0.18 0.17 0. = 0.e.Are the local coefficients homogenous? Least square estimate: Gamma = 0.025 Notice the similarity of the analysis of coefficients and MantelHaenszel analysis of oddsratios 34 . weight residual 1: V.228 2.1998 s.