Professional Documents
Culture Documents
Cross Tabs
Cross Tabs
The notation and statistics refer to bivariate subtables defined by a row variable X
and a column variable Y, unless specified otherwise. By default, CROSSTABS
deletes cases with missing values on a table-by-table basis.
Notation
The following notation is used throughout this chapter unless otherwise stated:
R
cj ∑f ij , the jth column subtotal
i =1
C
ri ∑ fij , the ith row subtotal
j =1
C R
W
∑ cj = ∑ ri , the grand total
j =1 i =1
count = f ij
1
2 CROSSTABS
Expected Count
ri c j
Eij =
W
Row Percent
3
row percent = 100 × f ij ri 8
Column Percent
3
column percent = 100 × f ij c j 8
Total Percent
3
total percent = 100 × f ij W 8
Residual
Rij = f ij − Eij
Standardized Residual
Rij
SRij =
Eij
CROSSTABS 3
Adjusted Residual
Rij
ARij =
ri 1 − c
W
j
Eij 1 −
W
Chi-Square Statistics
Pearson’s Chi-Square
3f − Eij 8 2
∑
ij
χ 2p =
Eij
ij
χ 2LR = −2 ∑ f ln3 E ij ij f ij 8
ij
If the table is a 2 × 2 table, not resulting from a larger table with missing cells,
with at least one expected cell count less than 5, then the Fisher exact test is
calculated. See Appendix 5 for details.
4 CROSSTABS
%KW 2 f 11 f 22 − f 12 f 21 − 0.5W 7 2
χ 2c =&
K r1r2 c1c2
if f 11 f 22 − f 12 f 21 > 0.5W
KK0
'
otherwise
1
χ 2MH = W − 1 r 2 6
where r is the Pearson correlation coefficient to be defined later. The degrees of
freedom are 1.
χ 2p
ϕ=
W
For a 2 × 2 table only, ϕ is equal to the Pearson correlation coefficient so that the
sign of ϕ matches that of the correlation coefficients.
CROSSTABS 5
Coefficient of Contingency
χ 2 12
CC =
χ + W
p
2
p
Cramér’s V
χ 2 12
V =
W 1q − 16
p
where q = min R, C . ; @
Let f im and f mj be the largest cell count in row i and column j, respectively.
Also, let rm be the largest row subtotal and cm the largest column subtotal. Define
λ Y X as the proportion of relative error in predicting an individual’s Y category
that can be eliminated by knowledge of the X category. λ Y X is computed as
∑f im − cm
i =1
λY X =
W − cm
6 CROSSTABS
2
∑ ∑ f 3δ 8 − ∑ f
R C R
i =1 j =1
ij ij −δ j
2
i =1
im − cm W
ASE0 =
W − cm
∑ ∑ f 3δ 8
R C
2
ij ij − δ j + λδ j − Wλ Y X
i =1 j =1
ASE1 =
W − cm
where
δ ij =
%&1 if j is column index for fim
'0 otherwise
δj =
%&1 if j is index for cm
'0 otherwise
R C
∑ f im + ∑ f mj − cm − rm
i =1 j =1
λ=
2W − rm − cm
CROSSTABS 7
2 "#
∑∑ 3 8 − ∑ f + ∑ f −r
R C R C
##
2
fij δ ijr + δ ijc − δ ir − δ cj − cm
!
im mj m W
ASE0 =
i =1 j =1 i =1 j =1
$
2W − rm − cm
3 8
R C
∑∑ f
2
ij δ ijr + δ cij − δ ir − δ cj + λ δ ir + δ cj − 4Wλ2
i =1 j =1
ASE1 =
2W − rm − cm
where
δ ijr =
%&1 if i is row index for fmj
'0 otherwise
δ ir =
%&1 if i is index for rm
'0 otherwise
and where
δ ijc =
%&1 if j is column index for fim
'0 otherwise
δ ic =
%&1 if j is index for cm
'0 otherwise
8 CROSSTABS
%K 1 C 1 C 2 1 (K2
∑ f ij &K1v − δ 6 ri ∑ f ij c j − c j − Wδ ri2 ∑ fij − ri fij )K
4
ASE1 =
δ 4 i, j
' j =1 j =1 *
in which
C C
δ = W2 − ∑ c 2j and v =W ∑ f ij2 ri − ∑ c2j
j =1 i, j j =1
τ X Y and its standard error can be obtained by interchanging the roles of X and Y.
Uncertainty Coefficient
1 6 16 1 6
U X + U Y − U XY
U 1Y 6
UY X =
where
0 5 ∑ Wr ln Wr
R
U X =− i i
i =1
0 5 ∑ cW ln Wc
C
j j
U Y =−
j =1
and
1 6 ∑ Wf ln Wf ,
U XY = −
ij ij
for f ij > 0
i, j
ASE1 =
1
WU 0Y 5
∑ 'K r
2
fij U Y ln W *K
ij
i
j
i, j
P − W U X + U Y − U XY 05 05 0 5 2
WU 0Y 5
ASE0 =
10 CROSSTABS
where
c r 2
P= ∑ f ij ln Wf
j i
ij
i, j
U 1 X 6 + U 1Y 6 − U 1 XY 6 "#
! U 1 X 6 + U 1Y 6 #$
U =2
%K r c − U 1 X 6 + U 1Y 6 ln f (K)
f &U 1 XY 6 ln
2
∑ W W K*
2 i j ij
=
ASE1
W U 1 X 6 + U 1Y 6 K' 2
i, j
ij 2
or
ASE 0 =
W U X +U Y
2
1 6 16 1 6 16 1 6
P − U X + U Y − U XY
2
W
Cohen’s Kappa
16 1 6
Cohen’s kappa κ , defined only for square table R = C , is computed as
R R
W ∑ f ii − ∑r c i i
i =1 i =1
κ= R
W2 − ∑r c i i
i =1
CROSSTABS 11
with variance
%K 4 f 94W − f 9 24W − f 942 f r c − W f 1r + c 69
= W&
∑ ii ∑ + ∑ ∑ ∑ ii ∑ ii ii i i ii i i
KK 4W − ∑ r c 9
var1
4W − ∑ r c 9
2 3
'
2 2
i i i i
"# (K
4W − ∑ f 9 W ∑ f 3r + c 8 − 44∑ r c 9 # K
2 2 2
! #$ K)
ii ij j i i i
i, j
+
4W − ∑ r c 9 2 KK i i
4
K*
"# 2
W rc + r c − W r c 1r + c 6 #
! ∑ ∑ ∑
1
=
#
2
var0 i i i i i i i i
$
2
W W − ∑ r c
i i i
2
i i
i
12 CROSSTABS
i =1
C
Dc = W 2 − ∑c
j =1
2
j
Cij = ∑∑ f + ∑∑ f hk hk
h <i k < j h >i k > j
Dij = ∑∑ f + ∑∑ f
h <i k > j
hk
h >i k < j
hk
P= ∑f C
i, j
ij ij
Q= ∑f D ij ij
i, j
Note: the P and Q listed above are double the “usual” P (number of concordant
pairs) and Q (number of discordant pairs). Likewise, Dr is double the “usual”
P + Q + X 0 (the number of concordant pairs, discordant pairs, and pairs on which
the row variable is tied) and Dc is double the “usual” P + Q + Y0 (the number of
concordant pairs, discordant pairs, and pairs on which the column variable is tied).
Kendall’s Tau-b
P−Q
τb =
Dr Dc
ASE1 =
1
1
Dr Dc 6 ∑ f 42 ij 3 8
Dr Dc Cij − Dij + τ b vij 9
2
1
− W 3τ b2 Dr + Dc 6 2
i, j
14 CROSSTABS
where
vij = ri Dc + c j Dr
∑ f 3C ij ij − Dij 8 2
−
1
W
1P−Q
2
6
i, j
ASE 0 = 2
Dr Dc
Kendall’s Tau-c
τc =
1
q P−Q 6
W 2
1q − 16
with standard error
ASE1 =
2q
1q − 16W 2 ∑ f 3C ij ij − Dij 8 2
−
1
W
1
P−Q 6 2
i, j
ASE 0 =
2q
1q − 16W 2 ∑ f 3C ij ij − Dij 8 2
−
1
W
1
P−Q 6 2
i, j
where
; @
q = min R , C
CROSSTABS 15
Gamma
16
Gamma γ is estimated by
P−Q
γ =
P+Q
1 P + Q6 ∑ 3 8
4 2
ASE1 = f QC2 ij ij − PDij
i, j
ASE 0 =
1
2
P+Q 6 ∑ f 3C ij ij − Dij 8 2
−
1
W
P−Q 1 6 2
i, j
Somers’ d
Somers’ d with row variable X as the independent variable is calculated as
P−Q
dY X =
Dr
ASE1 =
Dr2
2
∑ f J D 3C
ij r ij 8 1
− Dij − P − Q W − Ri 61 6L 2
i, j
ASE 0 =
2
Dr ∑ f 3C ij ij − Dij 8 2
−
1
W
1P−Q 6 2
i, j
By interchanging the roles of X and Y, the formulas for Somers’ d with X as the
dependent variable can be obtained.
Symmetric version of Somers’ d is
d=
1 P − Q6
1
2
1 Dc + Dr 6
The standard error is
2σ τ2
ASE1 =
1 6
b
Dr Dc
Dr + Dc
ASE 0 =
1
4
Dc + Dr 6 ∑ f 3C ij ij − Dij 8 2
−
1
W
1P−Q 6 2
i, j
Pearson’s r
The Pearson’s product moment correlation r is computed as
r=
1 6 ≡S
cov X , Y
S 1 X 6S 1Y 6 T
CROSSTABS 17
where
1 6 ∑XY f X r Y c
∑ ∑
R C
cov X , Y = i j ij − i i j j W
i, j i =1 j =1
0 5 ∑ X r − ∑ X r
2
R R
S X = 2
i i i i W
i =1 i =1
and
2
− ∑Y c
C C
16 ∑
SY = Y j2 c j
=1 j j W
j =1 j
The variance of r is
∑ %&' 3 3 X 8 S 1Y 6 + 3Y 8 S 1 X 6"#$()*
2
var1 =
T4
1
f ij T X i − X Y j − Y − 83 8 S
2T ! i −X
2
j −Y
2
i, j
2
∑ − ∑f XY
i, j
f ij X i2 Y j2
i, j
ij i j W
var0 =
∑ r X ∑ c Y 2 2
i
i i
j
j j
where
R
X = ∑X r i i W
i =1
18 CROSSTABS
and
C
Y = ∑ Yj c j W
j =1
r W −2
t=
1− r2
Spearman Correlation
The Spearman’s rank correlation coefficient rs is computed by using rank scores
Ri for X i and Ci for Y j . These rank scores are defined as follows:
Ri = ∑ r + 1r + 16 2
k<i
k i for i = 1, 2, K, R
Cj = ∑ c + 3c + 18 2
h< j
h j
for j = 1, 2, K, C
The formulas for rs and its asymptotic variance can be obtained from the Pearson
formulas by substituting Ri and C j for X i and Y j , respectively.
Eta
Asymmetric η with the column variable Y as dependent is
SYW
ηY = 1 −
16
SY
CROSSTABS 19
where
2
− ∑ ∑Y f
R C
∑
1
SYW =
i, j
Y j2 f ij
=1 =1
i
ri
j
j ij
Relative Risk
Consider a 2 × 2 table ( that is, R = C = 2 ) . In a case-control study, the relative
risk is estimated as
f 11 f 22
R0 =
f 12 f 21
1 6
The 100 1− α percent CI for the relative risk is obtained as
4
R0 exp − z1−α 2 v , 9 4
R0 exp z1−α 2 v 9
where
1 1 1 1 12
v=
f11
+
f 12
+ +
f 21 f 22
The relative risk ratios in a cohort study are computed for both columns. For
column 1, the risk is
1
f 11 f 21 + f 22 6
R1 =
f 21 1f 11 + f 12 6
and the corresponding 100 1− α percent CI is 1 6
4
R1 exp − z1−α 2 v , 9 4
R1 exp z1−α 2 v 9
20 CROSSTABS
where
f
v=
12
6
f 22
f 1f 6+ f 1
12
11 11 + f 12 21 f 21 + f 22
The relative risk for column 2 and the confidence interval are computed similarly.
McNemar’s Test
Suppose the test sample is ( x1, y1 ),( x2 , y2 ),K,( x n , y n ) .
Let
n1 = #{i: xi < yi , i = 1,Kn}
n2 = #{i: xi > yi , i = 1,Kn}
and
r = min(n1 , n2 )
Notation
n1 Number of cases where xi < yi , i = 1,Kn
n2 Number of cases where xi > yi , i = 1,Kn
r min(n1, n2 )
Probability
If there is no real difference between the two trials, we expect the frequencies n1
and n2 to be related as 1:1. Deviations from this ratio can be tested by using the
binomial distribution. The two-tailed probability level is
r
n + n
2× ∑ i (1 / 2)
1 2 n1 + n2
i =0
Note. This is a generalized version of McNemar’s test. The original version is for a
2*2 table.
CROSSTABS 21
Eijk 3 8
E f ijk =
rik c jk
nk
, the expected cell count of the ith row of the jth
f i1k
p$ ik = ,
rik
d k = p$1k - p$ 2 k ,
c1k
p$ k = ,
nk
and
r1k r2 k
wk = .
nk
Cochran’s Statistic
Cochran’s (1954) statistic is
K K K
Êw d Êw k k k Êw d k k
k =1 k =1 k =1
C= = .
K K K
Êw $ k (1 -
kp p$ k ) Êw k Êw $ k (1 -
kp p$ k )
k =1 k =1 k =1
Ê( f 11k - E11k )
k =1
C= .
K
Êw $ k (1 -
kp p$ k )
k =1
When the number of strata is fixed as the sample sizes within each stratum
increase, Cochran’s statistic is asymptotically standard normal, and thus its square
is asymptotically distributed as a chi-squared distribution with 1 d.f.
K K
{| Ê ( f 11k - E11k )|-0.5} sgn{ Ê( f 11k - E11k )}
k =1 k =1
M= ,
K
Ên
r1k r2 k
p$ k (1 - p$ k )
k =1 k -1
%K
1 if x > 0
&K
sgn( x ) = 0 if x = 0 .
'
-1 if x < 0
strata increases, this statistic is asymptotically standard normal, and thus its square
is asymptotically distributed as a chi-squared distribution with 1 d.f.
Ê V( f | c ; θ$ )
.
k =1 11k 1k
E and V are based on the exact moments, but it is customary to replace them with
the asymptotic expectation and variance. Let Ê and V̂ mean the estimated
asymptotic expectation and the estimated asymptotic variance, respectively. Given
the Mantel-Haenszel common odds ratio estimator θˆMH , we use the following
statistic as the Breslow-Day statistic:
where
f$11k 0,
r1k - f$11k > 0,
c - f$ > 0,
1k 11k
nk - r1k - c1k + f$11k 0;
and
1 -
1
$ ( f | c ; θ$
V 11k 1k MH ) = f$
11k
+
1
f$12 k
+
1
f$21k
+
1
f$22 k
with constraints such that
f$11k > 0,
f$12 k = r1k - f$11k > 0,
f$21k = c1k - f$11k > 0,
f$22 k = nk - r1k - c1k + f$11k > 0;
All stratum such that r1k = 0 or c1k = 0 are excluded. If every stratum is such, B
is undefined. Stratum such that f$11k = 0 are also excluded. If every stratum is
such, then B is undefined.
Breslow-Day’s statistic is asymptotically distributed as a chi-squared random
variable with K-1 degrees of freedom under the null hypothesis of a constant odds
ratio.
Tarone’s Statistic
Tarone (1985) proposes an adjustment to the Breslow-Day statistic when the
common odds ratio estimator is consistent but inefficient, specifically when we
have the Mantel-Haenszel common odds ratio estimator. The adjusted statistic,
Tarone’s statistic, for θˆMH is
26 CROSSTABS
K " 2
K " 2
p1k (1 - p2 k )
θk =
(1 - p1k ) p2 k
fork = 1, ..., K . And, assuming that the true common odds ratio exists,
θ = θ 1 = ... = θ K , Mantel and Haenszel’s (1959) estimator of this common odds
ratio is
Ê
f 11k f 22 k
nk
θ$ MH = kK=1 .
Ê
f 12 k f 21k
nk
k =1
The (natural) log of the estimated common odds ratio is asymptotically normal.
Note, however, that if f 11k = 0 or f 22 k = 0 in every stratum, then θ$ MH is zero and
log θ$4 9
is undefined.
MH
Robins et al. (1986) give an estimated asymptotic variance for log θ$ MH that is 4 9
appropriate in both asymptotic cases:
K
( f 11k + f 22 k ) f 11k f 22 k
Ê nk2
σ$ 2 [log(θ$ MH )] = k =1 K
Ê
f 11k f 22 k 2
2( )
nk
k =1
K
( f 11k + f 22 k ) f 12 k f 21k + ( f 12 k + f 21k ) f 11k f 22 k
Ê nk2
k =1
+ K K
Ê Ê
f 11k f 22 k f 12 k f 21k
2( )( )
nk nk
k =1 k =1
K
( f 12 k + f 21k ) f 12 k f 21k
Ê nk2
+ k =1 K
.
Ê
f 12 k f 21k 2
2( )
nk
k =1
where z(α / 2) is the upper α / 2 critical value for the standard normal distribution.
All these computations are valid only if θ$ MH is defined and greater than 0.
28 CROSSTABS
4 9
given that log θ$ MH is defined.
Alternatively, we can consider using θ$ MH and the estimated exact variance of
θ$ MH , which is still consistent in both limiting cases:
θ$ MH - θ o
Pr | Z | >
σ$ [log(θ$ MH )]θ o
.
The caveat for this formula is that θ$ MH may be quite skewed even in moderate
sample sizes (Robins et al., 1986, p. 314).
References
Agresti, A. (1990). Categorical Data Analysis. John Wiley, New York.
Brown, M. B., and Benedetti, J. K. 1977. Sampling behavior of tests for correlation
in two-way contingency tables. Journal of the American Statistical Association,
72: 309–315.
Hauck, W. (1989). Odds ratio inference from stratified samples. Commun. Statist.-
Theory Meth., 18, 767-800.
Robins, J., Breslow, N., and Greenland, S. (1986). Estimators of the Mantel-
Haenszel variance consistent in both sparse data and large-strata limiting
models. Biometrics, 42, 311-323.
th
Snedecor, G. W. and Cochran, W. G. (1980). Statistical Methods, 7 ed. Iowa
State University Press, Ames, Iowa.