Professional Documents
Culture Documents
Part IV:
Theory of Generalized Linear
Models
✫ ✪
⋆ estimate β via least squares and perform inference via the CLT
✫ ✪
V[Yi |Xi ] = µi (1 − µi )
b
• Recall, β WLS is the solution to
∂
0 = RSS(β; W )
∂β
n
∂ X
0 = wi (yi − XiT β)2
∂β i=1
n
X
0 = Xi wi (yi − XiT β)
✫ ✪
i=1
1
wi =
µi (1 − µi )
b T −1
X)−1 X T Σ−1 Y
β GLS = (X Σ
b while
⋆ in practice, we use the IWLS algorithm to estimate β GLS
⋆ to get the MLE, we take derivatives, set them equal to zero and solve
⋆ following the algebra trail we find that
Xn
∂ Xi
ℓ(β|y) = (Yi − µi )
∂β µ
i=1 i
(1 − µ i )
⋆ but our current specification of the mean model doesn’t impose any
restrictions
Q: Is this a problem?
✫ ✪
Summary
✫ ✪
Definition
Random component
Systematic component
µi = E[Yi |Xi ]
g(µi ) = XiT β
fY (y; π) = µy (1 − µ)1−y
π
θ = log
1−π
a(φ) = 1
c(y, φ) = 0
✫ ✪
• The canonical parameter has key relationships with both E[Y ] and V[Y ]
⋆ typically varies across study units
⋆ index θ by i: θi
✫ ✪
Properties
• The log-likelihood is
yi θi − b(θi )
ℓ(θi , φ; yi ) = + c(yi , φ)
ai (φ)
• The first partial derivative with respect to θi is the score function for θi
and is given by
∂ yi − b′ (θi )
ℓ(θi , φ; yi ) = U (θi ) =
∂θi ai (φ)
✫ ✪
E [U (θi )] = 0
2
∂U (θi )
V [U (θi )] = E U (θi ) = −E
∂θi
⋆ this latter expression is the (i, i)th component of the Fisher information
matrix
E[Yi ] = b′ (θi )
✫ ✪
∂2 b′′ (θi )
2 ℓ(θi , φ; yi ) = −
∂θi ai (φ)
⋆ the observed information for the canonical parameter from the ith
observation
• This is also the expected information and using the above properties it
follows that
′
Yi − b (θi ) b′′ (θi )
V [U (θi )] = V = ,
ai (φ) ai (φ)
so that
✫ ✪
a(φ) = 1
✫ ✪
E[Y ] = b′ (θ)
exp{θ}
= = µ
1 + exp{θ}
V (µ) = µ(1 − µ)
✫ ✪
• For the exponential dispersion family, the pdf/pmf has the following form:
yi θi − b(θi )
fY (yi ; θi , φ) = exp + c(yi , φ)
ai (φ)
g(µi ) = XiT β
• You’ll often find the linear predictor called the ‘systematic component’
⋆ e.g., McCullagh and Nelder (1989) Generalized Linear Models
• Constructing the linear predictor for a GLM follows the same process one
uses for linear regression
• For the most part, the decision of which covariates to include should be
driven by scientific considerations
⋆ is the goal estimation or prediction?
⋆ is there a primary exposure of interest?
⋆ which covariates are predictors of the response variable?
⋆ are any of the covariates effect modifiers? confounders?
✫ ✪
• How one includes them in the model will also depend on a mixture of
scientific and practical considerations
• Suppose we are interested in the relationship between birth weight and risk
of death within the first year of life
⋆ infant mortality
η = β0 + β1 Xw
η = β0 + β1 Xw + β2 Xw2 + β3 Xw3
✫ ✪
η = β0 + β1 Xlbw
η = β0 + β1 Xcat,1 + . . . + βK Xcat,K
✫ ✪
• As we’ve noted, the range of values that µi can take on may be restricted
⋆ binary data: µi ∈ (0, 1)
⋆ count data: µi ∈ (0, ∞)
✫ ✪
g(µi ) = XiT β
• The inverse of the link function provides the specification of the model on
the scale of µi
µi = g −1
XiT β
✫ ✪
inverse
µi = XiT β
log(µi ) = XiT β
✫ ✪
µi = XiT β
• Recall that the mean and the canonical parameter are linked via the
derivative of the cumulant function
⋆ E[Yi ] = µi = b′ (θi )
g(µi ) = θ(µi )
• We’ll see later that this choice results in some mathematical convenience
✫ ✪
Choosing g(·)
• For binary response data, one might choose a link function from among
the following:
identity: g(µi ) = µi
log: g(µi ) = log(µi )
µi
logit: g(µi ) = log
1 − µi
probit: g(µi ) = probit(µi )
complementary log-log: g(µi ) = log {−log(1 − µi )}
✫ ✪
✫ ✪
Xn
yi θi − b(θi )
ℓ(β, φ; y) = + c(yi , φ)
i=1
ai (φ)
Estimation
T
∂ℓ(β, φ; y) ∂ℓ(β, φ; y) ∂ℓ(β, φ; y)
, ··· , , = 0
∂β0 ∂βp ∂φ
✫ ✪
∂ℓ(β, φ; yi ) yi − µi
=
∂θi ai (φ)
∂µi
= b′′ (θi )
∂θi
V[Yi ]
=
ai (φ)
∂ηi
= Xj,i
∂βj
✫ ✪
Xn
∂ℓ(β, φ; y) ∂µi Xj,i
= (yi − µi )
∂βj i=1
∂ηi V (µ )a
i i (φ)
Xn
∂ℓ(β, φ; y) wi (yi θi − b(θi )) ′
= − 2
+ c (yi , φ) = 0
∂φ i=1
φ
Xn
∂ℓ(β, φ; y) ∂µi Xj,i
= wi (yi − µi ) = 0
∂βj i=1
∂ηi V (µ i )
✫ ✪
b T −1
XT Y
β MLE = (X X)
✫ ✪
• To get the asymptotic variance, we first need to derive expressions for the
second partial derivatives:
2
∂ ℓ(β, φ; yi ) ∂ ∂µi Xj,i
= (yi − µi )
∂βj ∂βk ∂βk ∂ηi V (µi )ai (φ)
2
∂ ∂µi Xj,i ∂µi Xj,i Xk,i
= (yi − µi ) −
∂βk ∂ηi V (µi )ai (φ) ∂ηi V (µi )ai (φ)
✫ ✪
∂ 2 ℓ(β, φ; yi ) ∂ a′i (φ) ′
= − (yi θ i − b(θ i )) + c (yi , φ)
∂φ∂φ ∂φ ai (φ)2
ai (φ)2 a′′i (φ)− 2ai (φ)a′i (φ)2
= − (yi θi − b(θi )) + c′′ (yi , φ)
ai (φ)4
✫ ✪
2
n
X
∂ ℓ(β, φ; y)
−E = K(φ) (b′ (θi )θi − b(θi )) − E[c′′ (Yi , φ)]
∂φ∂φ i=1
✫ ✪
where the components of Iββ are given by the first expression on the
previous slide and the Iφφ is given by the last expression on the previous
slide
✫ ✪
• The block-diagonal structure V[β b , φbMLE ] indicates that for GLMs valid
MLE
2
⋆ we typically don’t plug in σ̂MLE but, rather, an unbiased estimate:
X n
1 b )2
σ̂ 2 = (Yi − XiT β MLE
n − p − 1 i=1
✫ ✪
sub-matrix:
bβ
V[ b ] = Ib−1
MLE ββ
✫ ✪
Matrix notation
• If we set
2
∂µi 1
Wi =
∂ηi V (µi )ai (φ)
Iββ = X T W X
Hypothesis testing
H0 : β 1 = β 1,0 vs Ha : β 1 6= β 1,0
⋆ length of β 1 is q ≤ (p + 1)
⋆ β 2 is left arbitrary
✫ ✪
⋆ under H0
b
(β Tb b −1 b
(β 1,MLE − β 1,0 ) −→d χ2q
1,MLE − β 1,0 ) V[β 1,MLE ]
bβ
where V[ b
1,MLE ] is the inverse of the q × q sub-matrix of Iββ
corresponding to β , evaluated at β b
1 1,MLE
• Score test:
b
⋆ let β b
0,MLE = (β 1,0 , β 2,MLE ) denote the MLE under H0
⋆ under H0
b
U (β b −1 b 2
0,MLE ; y)I(β 0,MLE ) U (β 0,MLE ; y) −→d χq
✫ ✪
b ; y) − ℓ(β
2(ℓ(β b ; y)) −→d χ 2
MLE 0,MLE q
✫ ✪
Xn
∂ℓ(β, φ; y) ∂µi Xj,i
= (yi − µi ) = 0
∂βj i=1
∂ηi V (µi )ai (φ)
• For a GLM, the gradient is the derivative of the score function with
respect to β
✫ ✪
⋆ these form the components of the observed information matrix I ββ
(r)
b
• Suppose the current estimate of β is β
⋆ compute the following:
(r) T b (r)
ηi = Xi β
(r) (r)
µi = g −1 ηi
!2
(r) ∂µi 1
Wi =
∂ηi (r)
ηi V
(r)
µi
∂η
(r) (r) (r) i
zi = ηi + yi − µi
∂µi (r)
µi
✫ ✪
⋆ zi is called the ‘adjusted response variable’
b converges
• Iterate until the value of β
(r+1) (r)
b
⋆ i.e. the difference between β b
and β is ‘small’
✫ ✪
• ‘data’ specifies the data frame containing the response and covariate data
• The call for a standard logistic regression for binary data might be of the
form:
✫ ✪
> ##
> ?family
> poisson()
Family: poisson
Link function: log
> ##
> myFamily <- binomial()
> myFamily
Family: binomial
Link function: logit
> names(myFamily)
[1] "family" "link" "linkfun" "linkinv" "variance"
"dev.resids" "aic"
[8] "mu.eta" "initialize" "validmu" "valideta" "simulate"
> myFamily$link
[1] "logit"
✫ ✪
> myFamily$variance
function (mu)
mu * (1 - mu)
>
> ## Changing the link function
> ## * for a true ’log-linear’ model we’d need to make appropriate
> ## changes to the other components of the family object
> ##
> myFamily$link <- "log"
>
> ## Standard logistic regression
> ##
> fit0 <- glm(Y ~ X, family=binomial)
>
> ## log-linear model for binary data
> ##
> fit1 <- glm(Y ~ X, family=binomial(link = "log"))
>
> ## which is (currently) not the same as
> ##
✫ ✪
> fit1 <- glm(Y ~ X, family=myFamily)
> ##
> names(fit0)
[1] "coefficients" "residuals" "fitted.values" "effects"
[5] "R" "rank" "qr" "family"
[9] "linear.predictors" "deviance" "aic" "null.deviance"
[13] "iter" "weights" "prior.weights" "df.residual"
[17] "df.null" "y" "converged" "boundary"
[21] "model" "call" "formula" "terms"
[25] "data" "offset" "control" "method"
[29] "contrasts" "xlevels"
>
> ##
> names(summary(fit0))
[1] "call" "terms" "family" "deviance" "aic"
[6] "contrasts" "df.residual" "null.deviance" "df.null" "iter"
[11] "deviance.resid" "coefficients" "aliased" "dispersion" "df"
[16] "cov.unscaled" "cov.scaled"
✫ ✪
The deviance
yi θi − b(θi )
ℓ(θi , φ; yi ) = + c(yi , φ)
ai (φ)
ℓ(θi , φ; yi ) ⇒ ℓ(µi , φ; yi )
n
X
ℓ(b
µ, φ; y) = ℓ(µ̂i , φ; yi ),
i=1
n
X
ℓ(y, φ; y) = ℓ(yi , φ; yi ),
i=1
✫ ✪
D ∗ (y, µ
b ) = 2 [ℓ(y, φ; y) − ℓ(b
µ, φ; y)]
• Let
⋆ θ̃i be the value of θi based on setting µi = yi
⋆ θ̂i be the value of θi based on setting µi = µ̂i
2wi h i
Xn
∗ b)
D(y, µ
b) =
D (y, µ yi (θ̃i − θ̂i ) − b(θ̃i ) + b(θ̂i ) =
i=1
φ φ
• For the Normal distribution, the deviance is the sum of squared residuals:
n
X
b) =
D(y, µ (yi − µ̂i )2
i=1
✫ ✪
distribution of the difference in the deviance between two models
Residuals
✫ ✪
ri = yi − µ̂i
yi − µ̂i
rip = p
V (µ̂i )
2
P p 2
⋆ Pearson χ statistic for goodness-of-fit is equal to (r i)
✫ ✪
i
• Pierce and Schafer (JASA, 1986) examined various residuals for GLMs
⋆ conclude that deviance residuals are ‘a very good choice’
⋆ very nearly normally distributed after one allows for the discreteness
⋆ continuity correction which replaces
1
yi ⇒ yi ±
2
✫ ✪
Yi |Xi ∼ fY (y; µi , φ)
E[Yi |Xi ] = g −1 (XiT β) = µi
V[Yi |Xi ] = V (µi )ai (φ)
π(β, φ) = π(β|φ)π(φ)
⋆ an informative prior
∗ e.g., β ∼ MVN(β 0 , Σβ )
∗ convenient choice given the computational methods described below
✫ ✪
Computation
• We’ve seen that the Gibbs sampler and the Metropolis-Hastings algorithm
are powerful tools for generating samples from the posterior distribution
⋆ need to specify a proposal distribution
⋆ need to specify starting values for the Markov chain(s)
e = (β,
• Towards this, let θ e φ̃) denote the posterior mode
✫ ✪
e
log π(θ|y) = log π(θ|y)
e ∂ log π(θ|y)
+ (θ − θ)
∂θ θ =θe
2
1 e T ∂ e
+ (θ − θ) log π(θ|y) (θ − θ)
2 ∂θ∂θ θ =θe
+ ...
e
• Ignore the log π(θ|y) term because, as a function of θ, it is constant
• The linear term in the expansion disappears because the first derivative of
the log-posterior at the mode is equal to 0
✫ ✪
observed information matrix, evaluated at the mode
1 e T I(θ)(θ
e e
log π(θ|y) ≈ − (θ − θ) − θ)
2
Q: How can we make use of this for sampling from the posterior π(β, φ|y)?
⋆ there are many approaches that one could take
⋆ we’ll describe three
✫ ✪
e φ̃) ≡ (β
(β, b , φ̂MLE )
MLE
✫ ✪
Approach #1
• If we believe that π
e(β|y) is a good approximation, we could simply report
summary statistics directly from the multivariate Normal distribution
β|y ∼ Normal β, e Vβ (β,
e φ̃)
✫ ✪
using the π̃(φ|y) Normal approximation
Approach #2
✫ ✪
β (r+1) = β (r)
β (r+1) = β ∗
✫ ✪
Approach #3
✫ ✪
• Use the approximations to generate starting values for the Markov chain(s)