Professional Documents
Culture Documents
Spring 2019
c Cecilia Cotton 2019
This material is for the personal use of students enrolled in the Spring 2019 offering of Stat 431. Distribution or reproduction of these material for
commercial or non-commercial means is strictly prohibited.
Log Linear Models
Previously we used a Poisson GLM to model count data arising from a time
homogeneous Poisson process:
• Ni (ti ) = the number of events observed over (0, ti ]
We will consider three other types of data we can analyze with a Poisson GLM:
1. Approximating binomial data
2. Time non homogeneous Poisson processes
3. Contingency tables/Multinomial data (next class)
2/1
1. Poisson Approximation to the Binomial
• Suppose: Y ∼ Bin(m, π) so that E [Y ] = mπ
• Set µ = mπ and examine pmf of Y in terms of µ
m y
f (y ) = π (1 − π)m−y
y
(m)(m − 1) · · · (m − y )(m − y − 1) · · · (1) µ y µ m−y
= 1−
(m − y )! y ! m m
y −y
(m)(m − 1) · · · (m − (y − 1)) µ µ m µ
= 1− 1−
(m)(m) · · · (m) y! m m
• Recall a n
lim 1+ = ea
n→∞ n
• Therefore as m → ∞ with µ = mπ fixed:
µy e −µ
f (y ) → i.e. Y → Poisson(µ)
y!
3/1
Poisson Approximation to the Binomial
• So for Y ∼ Bin(m, π) as m → ∞ and π → 0 we have
Y ∼ Poisson(µ = mπ)
with E [Y ] = µ = mπ
• Using a Poisson GLM (with log link):
4/1
Example: Poisson Approximation to the Binomial
Non Melanoma Skin Cancer
Schwarz (2015) gives the incidence of non melanoma skin cancer among women in the early
1970s in Minneapolis-St Paul and Dallas-Fort Worth
City Age Class Age Mid Count Pop Size
msp 15-25 20 1 172675
msp 25-34 30 16 123065
msp 35-44 40 30 96216
msp 45-54 50 71 92051
msp 55-64 60 102 72159
msp 65-74 70 130 54722
msp 75-84 80 133 32185
msp 85+ 90 40 8328
dfw 15-25 20 4 181343
dfw 25-34 30 38 146207
dfw 35-44 40 119 121374
dfw 45-54 50 221 111353
dfw 55-64 60 259 83004
dfw 65-74 70 310 55932
dfw 75-84 80 226 29007
dfw 85+ 90 65 7538
5/1
6/1
R Code
# C r e a t e melanoma d a t a f r a m e : C i t y and A g e C l a s s a r e c a t e g o r i c a l / f a c t o r v a r i a b l e s
City = f a c t o r ( c ( r e p ( " msp " ,8) , r e p ( " dfw " ,8) ) )
AgeClass = f a c t o r ( r e p ( c ( " 15 -25 " ," 25 -34 " ," 35 -44 " ," 45 -54 " ," 55 -64 " ," 65 -74 " ," 75 -84 " ," 85+ " ) ,2) )
AgeMid = r e p ( s e q (20 ,90 ,10) ,2)
Count = c (1 ,16 ,30 ,71 ,102 ,130 ,133 ,40 ,4 ,38 ,119 ,221 ,259 ,310 ,226 ,65)
Pop = c (172675 , 123065 , 96216 , 92051 , 72159 , 54722 , 32185 , 8328 ,181343 , 146207 , 121374 , 111353 , 83004 ,
55932 , 29007 , 7538)
# P l o t t h e i n c i d e n c e v e r s u s t h e m i d p o i n t o f t h e age c a t e g o r i e s
p l o t ( AgeMid , Count / Pop , pch =1+9* a s . n u m e r i c ( City == " msp " ) )
l e g e n d ( x =20 , y =0.008 , pch = c (1 ,10) , l e g e n d = c ( " msp " ," dfw " ) )
# F i t a b i n o m i a l ( l o g i s t i c ) r e g r e s s i o n model
fit . b i n o m i a l = glm ( resp ~ City + l o g ( AgeMid ) , f a m i l y = b i n o m i a l ( l i n k = logit ) , d a t a = melanoma )
summary( fit . b i n o m i a l )
# F i t a P o i s s o n ( l o g−l i n e a r ) r e g r e s s i o n model
fit . p o i s s o n = glm ( Count ~ City + l o g ( AgeMid ) + o f f s e t ( l o g ( Pop ) ) , f a m i l y = p o i s s o n , d a t a = melanoma )
summary( fit . p o i s s o n )
# F i t t e d OR, RR , and p r e d i c t e d c o u n t s f o r i n t e r p r e t a t i o n
exp ( - fit . b i n o m i a l $ c o e f f [2])
exp ( - fit . p o i s s o n $ c o e f f [2])
expit (sum( fit . b i n o m i a l $ c o e f f * c (1 ,0 , l o g (30) ) ) ) *146207
exp (sum( fit . p o i s s o n $ c o e f f * c (1 ,0 , l o g (30) ) ) ) *146207
7/1
Binomial Model
πi
log = β0 + β1 xi1 + β2 xi2
1 − πi
> summary( fit . b i n o m i a l )
Call :
glm ( f o r m u l a = resp ˜ City + l o g ( AgeMid ) , f a m i l y = b i n o m i a l ( l i n k = logit ) , d a t a = melanoma )
Deviance Residuals :
Min 1Q Median 3Q Max
-3.2539 -0.9718 -0.2358 0.2556 2.0784
Coefficients :
Estimate Std . Error z value Pr ( >| z |)
( Intercept ) -19.49446 0.34284 -56.86 <2e -16 ∗∗∗
Citymsp -0.81288 0.05226 -15.55 <2e -16 ∗∗∗
l o g ( AgeMid ) 3.35654 0.08298 40.45 <2e -16 ∗∗∗
---
Signif . codes : 0 ∗∗∗ 0.001 ∗∗ 0.01 ∗ 0.05 . 0.1 1
Call :
glm ( f o r m u l a = Count ˜ City + l o g ( AgeMid ) + o f f s e t ( l o g ( Pop ) ) , f a m i l y = p o i s s o n , d a t a = melanoma )
Deviance Residuals :
Min 1Q Median 3Q Max
-3.2690 -0.9863 -0.2324 0.2620 2.0957
Coefficients :
Estimate Std . Error z value Pr ( >| z |)
( Intercept ) -19.46463 0.34207 -56.90 <2e -16 ∗∗∗
Citymsp -0.81014 0.05218 -15.53 <2e -16 ∗∗∗
l o g ( AgeMid ) 3.34813 0.08277 40.45 <2e -16 ∗∗∗
---
Signif . codes : 0 ∗∗∗ 0.001 ∗∗ 0.01 ∗ 0.05 . 0.1 1
9/1
Example: Non Melanoma Skin Cancer
1. What is the OR and RR for developing non melanoma skin cancer for women in
Dallas-Forth Worth versus those in Minneapolis-St Paul, controlling for age?
OR
d = exp(−β̂1 ) = exp(0.81288) = 2.254381
RR
c = exp(−α̂1 ) = exp(0.81014) = 2.248222
2. What is the predicted number of skin cancer cases in Dallas-Fort Worth among
women age 25-34?
Ŷi = π̂i mi = expit(β̂0 + β̂2 log(30))(146207) = 45.34367
10/1
2. Time Non Homogeneous Poisson Processes
• Now consider
E [N(t)] = µ(t) = λ(t) (not = λt)
• The rate is now a function of time
• Lots of possible ways to model the rate λ(t)
• Piecewise constant
12/1
Times to tumors in days (number of tumors detected)
Treatment Group Control Group
ID Days of Tumor Detection ID Days of Tumor Detection
(2)
1 122 1 3, 42, 59, 61 , 112, 119
2 - 2 28, 31, 35, 45, 52, 59(2) , 77, 85, 107, 112
3 3, 88 3 31, 38, 48, 52, 74, 77, 101(2) , 119
4 92 4 11, 114
5 70, 74, 85, 92 5 35, 45, 74(2) , 77, 80, 85, 90(2)
6 38, 92, 122 6 8(2) , 70, 77
7 28, 35, 45, 70, 77, 107 7 17, 35, 52, 77, 101, 114
8 92 8 21, 24, 66, 74, 101(2) , 114
9 21 9 8, 17, 38, 42(3)
10 11, 24, 66, 74, 92 10 52
11 56, 70 11 28(2) , 31, 38, 52, 74(2) , 77(2) , 80(2) , 92(2)
12 31 12 17, 119
13 3, 8, 24, 35, 92 13 52
14 45, 92 14 11(2) , 14, 17, 52, 56(2) , 80(2) , 107
15 3, 42, 92 15 17, 35, 66, 90
16 3, 17, 52, 80 16 28, 66, 70(2) , 74
17 17, 59, 92, 101, 107 17 3, 14, 24(2) , 28, 31, 35, 48, 74, 77, 119
18 45, 52, 85, 101, 122 18 21, 28, 45, 56, 63, 80, 85, 92, 101(2) , 119
19 92 19 28, 35, 52, 59, 66(2) , 90, 97, 119
20 21, 35 20 8(2) , 24, 42, 45, 59, 63(2) , 77, 101, 119, 122
21 24, 31, 42, 48, 70, 74 21 80
22 - 22 92, 122(2)
23 31 23 21
24 3, 28, 74
25 24, 74, 122
Timeline plots for data from Gail et al. (1980)
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
ï
Rat Tumor Data Structure
The entries in the data frame rats.pw for the first five rats in the treated group are
given below:
> rats.pw[1:20,]
id interval count len trt
1 1 1 0 30 1
2 1 2 0 30 1
3 1 3 0 30 1
4 1 4 1 32 1
5 2 1 0 30 1 • Consider four time intervals
6 2 2 0 30 1
7 2 3 0 30 1 • One line of data per interval
8 2 4 0 32 1
9
10 3
3 1
2
1 30
0 30
1
1
• count = number events in interval
11 3 3 1 30 1
12 3 4 0 32 1 • len = days spent in interval
13 4 1 0 30 1
14 4 2 0 30 1 • trt = treatment group
15 4 3 0 30 1
16 4 4 1 32 1
17 5 1 0 30 1
18 5 2 0 30 1
19 5 3 3 30 1
20 5 4 1 32 1
15/1
R Code I
rats = r e a d . t a b l e ( " rats . dat " , header = F )
dimnames ( rats ) [[2]] = c ( " id " ," start " ," stop " ," status " ," enum " ," trt " )
# f u n c t i o n t o c o v e r t d a t a t o s t r u c t u r e we want f o r analysis
gd . pw . f = f u n c t i o n ( indata ) {
pid = s o r t ( u n i q u e ( indata $ id ) )
16/1
R Code II
rats . pw [1:20 ,]
# Model C o n t r o l and T r e a t m e n t G r o u p s ( p f i t )
pfit = glm ( c o u n t ˜ o f f s e t ( l o g ( len ) ) + f a c t o r ( interval ) + trt , f a m i l y = p o i s s o n ( l i n k = l o g ) , d a t a = rats . pw ,
epsilon = 1e -004)
summary( pfit , corr = F )
17/1
1. Model Control Group Only (pfitC)
log µi = xT
i β + offset(log ti )
xi1 = I [interval 2]
xi2 = I [interval 3]
xi3 = I [interval 4]
• Include offset(log(len)) to account for the fact that different intervals are of
different durations
18/1
> pfitC = glm ( c o u n t ˜ o f f s e t ( l o g ( len ) ) + f a c t o r ( interval ) , f a m i l y = p o i s s o n ( l i n k = l o g ) , d a t a = rats . pw ,
s u b s e t =( trt ==0) , epsilon = 1e -004)
> summary( pfitC , corr = F )
Call :
glm ( f o r m u l a = c o u n t ˜ o f f s e t ( l o g ( len ) ) + f a c t o r ( interval ) , f a m i l y = p o i s s o n ( l i n k = l o g ) ,
d a t a = rats . pw , s u b s e t = ( trt == 0) , epsilon = 1e -04)
Deviance Residuals :
Min 1Q Median 3Q Max
-1.91834 -1.57481 -0.27360 0.62620 2.89586
Coefficients :
Estimate Std . Error z value Pr ( >| z |)
( Intercept ) -3.09371 0.17137 -18.0530 <2e -16 ∗∗∗
f a c t o r ( interval ) 2 0.16254 0.23284 0.6981 0.4851
f a c t o r ( interval ) 3 0.30228 0.22593 1.3380 0.1809
f a c t o r ( interval ) 4 -0.15690 0.24794 -0.6328 0.5268
---
Signif . codes : 0 ∗∗∗ 0.001 ∗∗ 0.01 ∗ 0.05 . 0.1 1
λ(interval 2)
exp(β1 ) = = exp(0.16254) = 1.176
λ(interval 1)
• Notice none of β1 , β2 , β3 are statistically significant
• There is a trend of a slightly higher rate in intervals 2 and 3 (versus interval 1) but
the event rate does not differ significantly across follow-up time in the control rats
20/1
2. Model Control and Treatment Groups (pfit)
• exp(β1 ) is now RR of events for interval 2 versus interval 1, for two rats of the
same treatment group
21/1
> pfit = glm ( c o u n t ˜ o f f s e t ( l o g ( len ) ) + f a c t o r ( interval ) + trt , f a m i l y = p o i s s o n ( l i n k = l o g ) , d a t a = rats . pw ,
epsilon = 1e -004)
> summary( pfit , corr = F )
Call :
glm ( f o r m u l a = c o u n t ˜ o f f s e t ( l o g ( len ) ) + f a c t o r ( interval ) + trt ,
f a m i l y = p o i s s o n ( l i n k = l o g ) , d a t a = rats . pw , epsilon = 1e -04)
Deviance Residuals :
Min 1Q Median 3Q Max
-1.83356 -1.19944 -0.33024 0.47007 3.05508
Coefficients :
Estimate Std . Error z value Pr ( >| z |)
( Intercept ) -3.088179 0.150548 -20.5130 < 2.2 e -16 ∗∗∗
f a c t o r ( interval ) 2 0.171860 0.195454 0.8793 0.3792
f a c t o r ( interval ) 3 0.206350 0.193905 1.0642 0.2872
f a c t o r ( interval ) 4 -0.064533 0.203686 -0.3168 0.7514
trt -0.822988 0.151130 -5.4456 5.164 e -08 ∗∗∗
---
Signif . codes : 0 ∗∗∗ 0.001 ∗∗ 0.01 ∗ 0.05 . 0.1 1
λ(treatment)
exp(β4 ) = = exp(−0.8230) = 0.44
λ(control)
• Controlling for interval of follow-up, the rate of tumor development in treated rats
in 0.44 times that of control rats
• i.e. Treatment looks beneficial
• Note that β4 is statistically significant
• β1 , β2 , β3 still not statistically significant
• Consider do we really need to use a time non homogeneous model for this data?
23/1
3. Time Homogeneous Model (fit)
24/1
> fit = glm ( c o u n t ˜ o f f s e t ( l o g ( len ) ) + trt , f a m i l y = p o i s s o n ( l i n k = l o g ) , d a t a = rats . pw , epsilon =
0.0001)
> summary( fit , corr = F )
Call :
glm ( f o r m u l a = c o u n t ˜ o f f s e t ( l o g ( len ) ) + trt , f a m i l y = p o i s s o n ( l i n k = l o g ) ,
d a t a = rats . pw , epsilon = 1e -04)
Deviance Residuals :
Min 1Q Median 3Q Max
-1.78004 -1.14210 -0.42348 0.40094 3.26727
Coefficients :
Estimate Std . Error z value Pr ( >| z |)
( Intercept ) -3.005609 0.081219 -37.0062 < 2.2 e -16 ∗∗∗
trt -0.822995 0.151160 -5.4445 5.194 e -08 ∗∗∗
---
Signif . codes : 0 ∗∗∗ 0.001 ∗∗ 0.01 ∗ 0.05 . 0.1 1
• No not reject H0
• Conclude that the time homogeneous model is probably OK in this case
• However, we retain it for generality and for the following analysis
26/1
4. Time Non Homogeneous Model with Treatment Interaction (ifit)
• Model pfit (time non homogeneous, without interaction) is nested within this
model (consider ifit with β5 = β6 = β7 = 0)
27/1
> ifit = glm ( c o u n t ˜ o f f s e t ( l o g ( len ) ) + f a c t o r ( interval ) ∗ trt , f a m i l y = p o i s s o n ( l i n k = l o g ) , d a t a =
rats . pw , epsilon = 0.0001)
> summary( ifit , corr = F )
Call :
glm ( f o r m u l a = c o u n t ˜ o f f s e t ( l o g ( len ) ) + f a c t o r ( interval ) ∗ trt , f a m i l y = p o i s s o n ( l i n k = l o g ) , d a t a =
rats . pw , epsilon = 1e -04)
Deviance Residuals :
Min 1Q Median 3Q Max
-1.91834 -1.21584 -0.32409 0.51249 2.89586
Coefficients :
Estimate Std . Error z value Pr ( >| z |)
( Intercept ) -3.093712 0.171368 -18.0530 < 2e -16 ∗∗∗
f a c t o r ( interval ) 2 0.162536 0.232840 0.6981 0.48514
f a c t o r ( interval ) 3 0.302284 0.225926 1.3380 0.18090
f a c t o r ( interval ) 4 -0.156902 0.247935 -0.6328 0.52684
trt -0.803850 0.316133 -2.5428 0.01100 ∗
f a c t o r ( interval ) 2: trt 0.031556 0.428219 0.0737 0.94126
f a c t o r ( interval ) 3: trt -0.376186 0.443551 -0.8481 0.39637
f a c t o r ( interval ) 4: trt 0.286455 0.436610 0.6561 0.51177
---
Signif . codes : 0 ∗∗∗ 0.001 ∗∗ 0.01 ∗ 0.05 . 0.1 1
• No not reject H0
• We do not have evidence that the treatment effect varies across the time intervals
29/1
Summary of Rat Tumor Data Analysis
• Interpretation: The relative rate for tumor development in treated versus control
rats is
exp(β̂4 ) = exp(−0.822995) = 0.439
• i.e. treatment is beneficial (treated rates get fewer tumors)
• Prediction: Expected number of tumors for a control rat observed for 70 days?
30/1