You are on page 1of 20

Chapter 12

Linear mixed effect model

12.1 Fixed effects and random effects

• When we have observations y1 , · · · , yn , we are mainly interested in ‘average’ i.e.

E(yi ) = µi and ‘variation’, i.e. var (yi ) = σi2 . Those average and variation can be
further factored out when additional information is supplied.

• For example, such observations actually have a structure so that y11 , · · · , y1m are

observations from females and y21 , · · · , y2m are observations from males with
2m = n. Then, we can factor out the mean such as E(yij ) = µ + αi for i = 1, 2
and j = 1, · · · , m. αi can be interpreted as average contribution by ‘gender’. We

call E(yij ) = µ + αi a model or a mean model.

 When the model is linear in parameters, we call it a linear model (LM)

 When the model is not necessarily linear, we call it a generalized linear model
(GLM)

 In LM and GLM, parameters in ‘mean’ are unknown constants. Depending


on the situation, they can be random. Then, the model is called linear mixed
model (LMM) or generalized linear mixed model (GLMM).

• Non-random parameters are used to attribute the contribution in mean from


the additional information. Random parameters are used to attribute the con-

tribution in variation from the additional information.


CHAPTER 12. LINEAR MIXED EFFECT MODEL 113

• What is additional information? Additional information (covariates or predic-


tors) can be usual covariates (continuous, discrete) or categories (classifications).

The second type is often used for attributing variation.

 Categorical variates are also called factors. Factors are groups which divide
the data accordingly (Ex. gender).

 Individual classes in factor is called ‘level’ (Ex. female, male in ‘gender’)

 A unit categorized by a level of factors is called ‘cell’. Consider a model with


two factors: gender(female, male) and race (white, non-white). Each factor has
two levels. Then there are 2 × 2 = 4 cells.

 nested/crossed factor
Nested factor example: Consider students’ gpa by gender from each section in

English and Math courses. There are three sections (A,B,C). Then, we have three
factors (gender, section, subject). Within each subject, there are three sections.
Section factor is nested within the subject factor since section A in English class

has no connection to the section A in Math class.


Crossed factor example: Consider cancer rates by gender from each of 16 age
groups in white and non-white. There are 3 factors (age group, race, gender).

16 age groups in white and non-white are the same. The age group and the race
factors are said to be crossed.

• Effect of a factor is contribution to the mean and/or variation. There are two
effects: fixed effects and random effects.

 Effect of a factor is called fixed effect when the parameters associated with
a factor are fixed constants. Fixed effects contributes to the mean. In this case,
the corresponding factor has a finite number of levels. When the continous

covariate is considered and the corresponding coefficient is fixed constant, it is


also parameter for fixed effect.

 Effect of a factor is called random effect when the parameters associated with
CHAPTER 12. LINEAR MIXED EFFECT MODEL 114

a factor are random. Random effects contributes to the variation. In this case,
we regard the available levels (finite number of levels) for the factor as sample

levels from infinite set of levels.

• Example: Consider yijk be the volume of the kth bread in the jth batch at the
ith temperature level. i = 1, 2, 3, j = 1, · · · , 6 and k = 1, · · · , 4. A researcher
wants to investigate the changes in volume of bread at different temperatures.

There are three different temperatures. In each temperature level, Four loaves
of bread in each of 6 batches are baked.

 There are two factors: temperature factor (3 levels) and batch factor (6 levels).
Fixed effects for temperature factor. Random effects for batch factor. One can
argue that the effect by different temperature level contributes to the mean of

the volume while the effect by different batch level contributes to the variation
of the volume.

 When both fixed effects and random effects are considered in the model, it
is called a mixed effects model. The above example is a mixed effect model.

• Examples of a fixed effect model and a random effect model

 Placebo and drug: yij be the number of seizure by patient j receiving the
treatment i. i = 1: placebo, i = 2: a drug. This is one factor model.
Consider a model: E(yij ) = µi .

µi is the mean number of seizures expected from someone receiving the treat-
ment i. If we write E(yij ) = µ + αi , µ is general mean (global mean) and αi is the
effect on the mean number of seizures due to the treatment i. These are fixed

parameters of interest.

 Clinical trial: yij be the number of seizures by patient j at the ith clinic in the
city of seoul who is treated by a drug. Say, i = 1, · · · , 20.
Consider a model E(yij | ai ) = µ + ai .
CHAPTER 12. LINEAR MIXED EFFECT MODEL 115

One may consider 20 clinics as a sample of 20 clinics in Seoul. Then, ai is the


i.i.d.
random effect due to the clinic i. We often assume the random effects ai ∼

(0, σa2 ) for all i, i.e. E( ai ) = 0, var ( ai ) = σa2 and cov( ai , ai0 ) = 0 for i 6= i0 .

• How do random effects contribute to the variation of the data?


For simplicity, assume that

i.i.d.
yij | ai ∼ N (µ + ai , σ2 ), j = 1, · · · , J
i.i.d.
ai ∼ N (0, σa2 ), i = 1, · · · , I.

Alternatively,

yij = µ + ai + eij ,
i.i.d.
ai ∼ N (0, σa2 )
i.i.d.
eij ∼ N (0, σ2 ) and ai ⊥ eij

By the law of total variance/covariance,

var (yij ) = var ( E(yij | ai )) + E(var (yij | ai ))

= σa2 + σ2

cov(yij , yij0 ) = cov( E(yij | ai ), E(yij0 | ai )) + E(cov(yij , yij0 | ai ))

= cov(µ + ai , µ + ai ) = σa2

cov(yij , yij0 ) is called intra-class covariance.


cov(yij ,yij0 ) σa2
 Intra-class correlation is √ q = 2
σa +σ2
.
var (yij ) var (yij0 )

 Simulated data with I = 5, J = 10, µ = 0, σ2 = 1, DATA1: intra-class


correlation: 0.1 DATA2: intra-class correlation:0.9
CHAPTER 12. LINEAR MIXED EFFECT MODEL 116

mean(sd): DATA1= −0.24 ( 1.03 ) , DATA2= 0.93 ( 2.45 )

6
4
2
0
−2

DATA1 DATA2

DATA 1 : intra.cor= 0.1 , sigma2.a= 0.11


6
4
2
0
−2

X1 X2 X3 X4 X5

DATA 2 : intra.cor= 0.9 , sigma2.a= 9


6
4
2
0
−2

X1 X2 X3 X4 X5
CHAPTER 12. LINEAR MIXED EFFECT MODEL 117

 Small intra-calss correlation (σa2 is relatively smaller than σ2 ) indicates less


clustered within the class of random effects while large intra-class correlation

(σa2 is relatively larger than σ2 ) indicates more clustered within the class of ran-
dom effects.

12.2 Inference

 If we are interested in the distribution of the data, we estimate parameters.


In some situations, we want to ‘estimate’ random effects. In this case, we use

the term ‘prediction’ and we predict the realization of random effects.

 Typically, estimation is using likelihood (ML or REML).

 Testing is for comparing levels of fixed effects (i.e. linear functions of fixed
effects) or whether variation due to random effects is zero or not (i.e. σa2 = 0).

• Best linear unbiased estimator (BLUE) Let Y be a random vector with E(Y ) =
Xβ, cov(Y ) = V, where X is a known n × p design matrix, β ∈ R p and V is
known non-singular covariance matrix.

A real valued linear estimator t T Y is said to best linear unbiased estimator for
it’s expectation if and only if var (t T Y ) ≤ var ( a T Y ) for all linear estimators with
E ( a T Y ) = E ( t T Y ).

c BLUE = X ( X T V −1 X )− X T V −1 Y, where ( X T V −1 X )− is a g-inverse of X T V −1 X.


 Xβ

• Best predictor (BP) Let Y = (y1 , · · · , yn ) T be a vector of n random variables.


We want to use them to predict y0 , where y0 is also a random variable. The best

predictor is the one that minimizes mean squared error (MSE) which yields the
conditional mean.

 When the joint distribution of y0 and Y is available,

yb0BP = E(y0 | Y )
CHAPTER 12. LINEAR MIXED EFFECT MODEL 118

since for any predictor g(Y ) of y0 , one can show E(y0 − g(Y ))2 ≥ E(y0 −
E(y0 |Y ))2 .

• Best linear predictor (BLP) When mean and covariance are available, let cov(Y ) =
V, where V is non-singular covariance matrix, cov(y0 , Y ) = v0 , E(yi ) = µi ,

i = 0, 1, · · · , n. Let µ = (µ1 , · · · , µn ). Then, best linear predictor of y0 is given


by

yb0BLP = µ0 + cov(y0 , Y )cov(Y )−1 (Y − E(Y ))

= µ 0 + v 0 V − 1 (Y − µ )

(proof) Without loss of generality, we write an arbitrary linear predictor as

g(Y ) = a + b T (Y − µ). Then, we have

E(y0 − g(Y ))2 = E(y0 − a − b T (Y − µ))2

= E(y0 − µ0 − b T (Y − µ) + (µ0 − a))2

= E(y0 − µ0 − b T (Y − µ))2 + (µ0 − a)2

For unbiasedness, µ0 = a. For simplicity, we further assume µ0 = 0, µ = 0.


Then,

E(y0 − g(Y ))2 = E(y0 − b T Y )2

= E(y0 − cov(y0 , Y )cov(Y )−1 Y + cov(y0 , Y )cov(Y )−1 Y − b T Y )2

= E(y0 − cov(y0 , Y )cov(Y )−1 Y )2 + E(cov(y0 , Y )cov(Y )−1 Y − b T Y )2

≥ E(y0 − cov(y0 , Y )cov(Y )−1 Y )2

• Best linear unbiased predictor (BLUP) When the covariance structure is known,
a0 + a T Y is BLUP of y0 if a0 + a T Y is unbiased (i.e. E( a0 + a T Y ) = E(y0 )) and
for any other predictor b0 + b T Y, E(y0 − a0 − a T Y )2 ≤ E(y0 − b0 − b T Y )2 .

 First assume the following model.


    
Y Xβ
∼ , Cov
y0 x0 β
CHAPTER 12. LINEAR MIXED EFFECT MODEL 119

and x0 = c T X for some c, where X and Cov are known, cov(Y ) is non-singular
and β is unknown. Then,

BLUE BLUE
 
−1
yb0BLUP = x0 β
d + cov(y0 , Y ) cov(Y ) Y − Xβ
c

Let cov(Y ) = V and cov(y0 , Y ) = v0 and L β (Y ) = x0 β + v0 V −1 (Y − Xβ).

For any arbitrary linear unbiased estimator of y0 , a T Y, one can show

E(y0 − a T Y )2 = E(y0 − L β (Y ))2 + E( L β (Y ) − a T Y )2

≥ E(y0 − L β (Y ))2

• One can use an EM algorithm to estimate parameter in a linear mixed effects

model by considering random effects as latent (missing) variables.

12.3 Linear mixed effects model

• linear mixed effects model: A general linear mixed effects model is of the

following form:

Y = Xβ + Zγ + e,

where X, Z are known design matrices

β is unobservable vector of fixed effects


γ is unobservable vector of random effects with
E(γ) = 0, cov(γ) = D, cov(γ, e) = 0 and cov(e) = R.

• cov(Y ) = V, cov( Zγ + e) = ZDZ T + R

ZDZ T + R ZD
     
Y Xβ
∼ ,
γ 0 DZ T D

• Example: Consider math test scores of students by gender in 4 ninth grade


classes of fifteen high schools. Let ytijk be the math test score of k−th student in
the gender group t in the jth class at the ith school. i = 1, · · · , 15, j = 1, · · · , 4,

t = 1, 2 and k = 1, · · · , ntij .
CHAPTER 12. LINEAR MIXED EFFECT MODEL 120

 We can consider three sources of variability of the data:


1. among schools (si )

2. among classes within each school (cij )


3. among students within each class and some unexplained variability (etijk )

One plausible model is

ytijk = β t + si + cij + etijk ,

where β t are fixed effects and si , cij are random effects.

 Let γ1 = (s1 , · · · , s15 )T and γ2 = (c11 , · · · , c1,4 , c2,1 , · · · , c15,4 )T . Then, γ =


(γ1T , γ2T )T which consists of 15 + 15 × 4 = 75 random effects. Z is also parti-
tioned as Z = [ Z1 , Z2 ] such that Zγ = Z1 γ1 + Z2 γ2 .

 We need to further restrict the structure of D and R so that they can be


estimated. For example, simplification of D can be done by assuming cov(γ1 ) =
σs2 I15 , cov(γ2 ) = σc2 = I60 and cov(γ1 , γ2 ) = 0. For R, one can assume R =
σ2 IN , where N is total number of observations. Then, cov(Y ) = ZDZ T + R =

σs2 Z1 Z1T + σc2 Z2 Z2T + σ2 I.

• Variance component model: In the linear mixed effect model, Y = Xβ + Zγ + e,


consider Zγ = ∑rl=1 Zl γl such that cov(γ) = D = diag{σi2 Iql }rl=1 , V = cov(Y ) =

∑rl=1 σl2 Zl ZlT + R = ∑rl=1 σl2 Zl ZlT + σ02 IN , where N is the total number of obser-
vations. Such simplified linear mixed effects model is called a variance compo-
nent model. V can be also written as V = ∑rl=0 σl2 Zl ZlT with Z0 = IN .

• Estimation Estimation of a variance component model

 ML method
Log likelihood for a variance component model is ` = − 21 (Y − Xβ) T V −1 (Y −
N
Xβ) − 12 log |V | − 2 log(2π ). Then, we have

∂`
= − X T V −1 Xβ + X T V −1 Y
∂β
CHAPTER 12. LINEAR MIXED EFFECT MODEL 121

∂` 1 −1 1 ∂
T ∂V
2
= − ( Y − Xβ ) 2
(Y − Xβ) − log |V |
∂σl 2 ∂σl 2 ∂σl2
1 1
= (Y − Xβ)T V −1 Zl ZlT V −1 (Y − Xβ) − tr (V −1 Zl ZlT )
2 2

for l = 0, · · · , r. Then, the likelihood estimating equation is

X T V −1 Xβ = X T V −1 Y

tr (V −1 Zl ZlT ) = (Y − Xβ) T V −1 Zl ZlT V −1 (Y − Xβ)

= Y T PZl Zl PY,

for l = 0, · · · , r, where P = V −1 − V −1 X ( X T V −1 X )− X T V −1 .

X βbML = X ( X T V −1 X )− X T V −1 Y.

1
When Zl = 0 for l = 1, · · · , r, V = σ02 I and b
σ02 = N SSE.

 Restricted maximum likelihood (REML) method


The idea is to use the likelihood of residuals so that we deal with the likelihood

of only variance component parameters.

First, we find B such that

BT X = 0
B T is ( N − r ) × N full row rank matrix, where r = rank( X ) and
B T VB is nonsingular.

Also note that B( B T VB)−1 B T = V −1 − V −1 X ( X T V −1 X )− X T V −1 which is in-


variant to the choice of B.

Then, the likelihood of B T Y is

1 1
` R ∝ − Y T B( B T VB)−1 B T Y − log | B T VB|
2 2
∂` R 1 T T −1 T T T −1 T 1  T −1 T T

= Y B( B VB) B Zl Zl B( B VB) B Y − tr ( B VB) B Zl Zl B ,
∂σl2 2 2
for l = 0, · · · , r.
CHAPTER 12. LINEAR MIXED EFFECT MODEL 122

Let P = B( B T VB)−1 B T = V −1 − V −1 X ( X T V −1 X )− X T V −1 . Then, the likelihood


estimating equation becomes

tr ( PZl ZlT ) = Y T PZl ZlT PY

for l = 0, · · · , r.

When Zl = 0 for l = 1, · · · , r, V = σ02 I and P = I − X ( X T X )−1 X T so that


2 = 1
σ0R
b N − p SSE.

• BLUE and BLUP in a linear mixed effect model. We are interested in finding
c BLUE and Zγ
Xβ c BLUP .

 Mixed model equation (MME)


c BL and Zγ
Xβ c BLU are obtained by minimizing

(Y − Xβ − Zγ)T R−1 (Y − Xβ − Zγ) + γ T D −1 γ

which yields the following MME

X T R −1 X X T R −1 Z X T R −1 Y
    
β
= .
Z T R −1 X D −1 + Z T R −1 Z γ Z T R −1 Y

A solution to MME is

βb = ( X T V −1 X )− X T V −1 Y

b = DZ T V −1 (Y − X βb)
γ

b are the solution to the MME, then X βb is BLUE of Xβ


 Theorem: If βb and γ
and Z γ
b is BLUP of Zγ.

proof) Using that ( I + uv)−1 = I − u( I + vu)−1 v,

V −1 = ( ZDZ T + R)−1 = ( R−1 ZDZ T + I )−1 R−1

= ( I − R−1 Z ( I + DZ T R−1 Z )−1 DZ T ) R−1

= ( I − R −1 Z ( D −1 + Z T R −1 Z ) −1 Z T ) R −1
CHAPTER 12. LINEAR MIXED EFFECT MODEL 123

DZ T ( ZDZ T + R)−1 = DZ T R−1 − DZ T R−1 Z ( D −1 + Z T R−1 Z )−1 Z T R−1

= ( D ( D −1 + Z T R−1 Z ) − DZ T R−1 Z )( D −1 + Z T R−1 Z )−1 Z T R−1

= ( D −1 + Z T R −1 Z ) −1 Z T R −1 .

• Example: In one factor model, yij = µ + ai + eij ,

σa2
a i = E ( a i |Y ) = (ȳi· − µ),
σa2 + σ2 /ni
b

where ni is the number of observations in the ith level of the factor and ȳi· =
1 n
ni ∑ j=i 1 yij , the sample mean of observations in the ith level of the factor.

 Note that we do not know σ2 , σa2 and µ. A naive remedy is to replace with
estimators (Ex. MLE):
σa2
(ȳi· − ȳ· · )
b
ãi =
σa2 + b
b σ2 /ni
If ai is a fixed effect parameter,

ai = ȳi· − ȳ· ·
b

with constraint of ∑ ni b
ai = 0.
σa2
Note that an additional term b
b
σa2 +b
σ2 /ni
< 1 so ãi < b
ai .

σa2 is large (relatively), ãi ≈ b


If b ai . A large variation between classes does not
σa2 will make ãi <<
make much shrinkage. On the other hand, relatively small b

ai (more shrinkage).
b
Chapter 13

Generalized linear model

13.1 Three components of Model assumption in GLM

GLM generalizes the linear model by extending to nonnormal outcomes and

nonlinear relationships. Model specification requires the three elements, namely,


random component, systematic component and link function.

• Random component

 Y has a distribution in the exponential family, taking the form:

f Y (Y; θ, φ) = exp{(Yθ − b(θ ))/φ + c(Y, φ)}.

 Check θ and b(θ ) in normal, binomial, Poisson, and Gamma distributions.

 The mean and variance of exponential family distribution can be obtained


by differentiating the equations, f (y; θ, φ)dy = 1, yielding E(Y ) = b0 (θ ) and
R

Var (Y ) = b00 (θ )φ.

 The mean and variance functions uniquely determines the distribution in


exponential family. Higher moments are also functions of b(θ ).

• Systematic component

 E(Y | X ), say µ, is related to X through a linear combination of X;


CHAPTER 13. GENERALIZED LINEAR MODEL 125

p
ηi = ∑ j=1 Xij β j
β
counter example: exp( β 0 + Xi 1 )

• Link function: The link function relates linear predictor and the conditional
mean, i.e., η = g(µ). g(.) is called the link function and should be twice differ-

entiable. Link functions determine interpretation of coefficients. (e.g., logit link,


log link)

 Examples of link functions for binary regression are logit, probit, and com-
plementary log-log function.

 Canonical link function is the link function θ = η.

13.2 Likelihood Inference on GLM

• Estimator for β is obtained by maximizing


n n
L ( β ( θ ), φ ) = ∏ f (yi |θi , φ) = exp ∑ [{yi θi − b(θi )}/φ + c(yi , φ)]
i =1 i =1

n
logL( β(θ ), φ) = ∑ {yi θi − b(θi )}/φ + c(yi , φ),
i =1

where θ T = (θ1 , θ2 , · · · , θn ). Until we specify otherwise, assume that φ is known.

• Score function

The score function of the GLM has the form

n n
∂ ∂η ∂θ ∂ ∂η ∂µ ∂θ ∂
logL( β, φ) = ∑ i i logL( β, φ) = ∑ i i i logL( β, φ)
∂β i =1
∂β ∂ηi ∂θi i =1
∂β ∂ηi ∂µi ∂θi

∂ηi
∂β = XiT , ∂µi
∂ηi = 1/g0 (µi ), ∂θi
∂µi = 1/b00 (θi ),

∂θi ∂η
= 1/ i = 1/[ g0 {b0 (θi )}b00 (θi )] = 1/{ g0 (µi )b00 (θi )}),
∂ηi ∂θi
CHAPTER 13. GENERALIZED LINEAR MODEL 126

∂ Y − b 0 ( θi ) Y − µi
logL( β, φ) = i = i
∂θi φ φ

∂ n
Y − b 0 ( θi )
logL( β, φ) = ∑ XiT 1/{ g0 (µi )b00 (θi )} i
∂β i =1
φ
n
= ∑ XiT { g0 (µi )Var(Yi )}−1 {Yi − µi }
i =1

∂θi
Note: If θi = ηi , i.e., canonical link, ∂ηi = 1 and the score function is

n

logL( β, φ) = ∑ XiT (Yi − µi )/φ
∂β i =1

• Information

The second derivative of the negative log-likelihood function is the observed


information:
n
∂2 log f (Yi ; θ, φ)
I ( β) = ∑− ∂β∂β T
.
i =1
The expected value of I ( β), say i ( β), is called the expected information or Fisher
information. Hence,

n h
∂ ∂
I ( β) = − T U ( β) = − ∑ XiT { g0 (µi )Var (Yi )}−1 T (Yi − µi )
∂β i =1
∂β
∂ i
− XiT [ T { g0 (µi )Var (Yi )}−1 ]{Yi − µi }
∂β
n h
∂µ ∂ i
= ∑ XiT { g0 (µi )Var (Yi )}−1 Ti − XiT [ T { g0 (µi )Var (Yi )}−1 ]{Yi − µi } .
i =1
∂β ∂β

∂µi ∂µ ∂η
T
= i Ti = { g0 (µi )}−1 Xi
∂β ∂ηi ∂β

n
i ( β) = E{ I ( β)} = ∑ XiT g0 (µi )−1 Var(Yi )−1 g0 (µi )−1 Xi .
i =1
CHAPTER 13. GENERALIZED LINEAR MODEL 127

For canonical link,


n
I ( β) = ∑ XiT b00 (θi )Xi /φ = i( β),
i =1

i.e., likelihood function is strictly concave.

• Computation of MLE

In general, the equation, U (θ ) = 0, does not have a closed form solution. Then

one should solve the equation iteratively. Starting from initial value θ (0) , one
can iteratively update the estimate by

θ ( p +1) = θ ( p ) + I ( θ ( p ) ) −1 U ( θ ( p ) ),

p = 0, 1, · · · . When the update becomes very small, i.e., |θ ( p+1) − θ ( p) | < c,


where c is pre-specified small value, for example c = 10−8 , the iteration is

stopped and θ ( p+1) is declared as the solution. This computational algorithm is


called Newton-Raphson algorithm.

• Iteratively reweighted least squared estimate

Motivation: The score function of GLM has a form of weighted linear regression.

Can we obtain as a weighted least square estimator?

Consider the following ‘pseudo dependent variable’:

∂ηi
z i = ηi + (Y − µ i )
∂µi i

Then

E(zi ) = ηi = Xi β.

h ∂η i2
i
Var (zi ) = Var (Yi ) = g0 (µi )2 Var (Yi ).
∂µi
CHAPTER 13. GENERALIZED LINEAR MODEL 128

Then we can obtain the following weighted least square estimator:

n n
βb = ( ∑ XiT Var (zi )−1 Xi )−1 ( ∑ XiT Var (zi )−1 zi )
i =1 i =1

Note that both z and Var (z) are also functions of β. Start with β(0) and update.

β ( p +1)
n n
= [ ∑ XiT { g0 (µi )2 Var (Yi )}−1 Xi ]−1 [ ∑ XiT Var (zi )−1 {ηi + g0 (µi )(Yi − µi )}]
i =1 i =1
n h n
= [ ∑ XiT { g0 (µi )2 Var (Yi )}−1 Xi ]−1 { ∑ XiT Var (zi )−1 Xi β( p) }
i =1 i =1
n i
+ [ ∑ XiT { g0 (µi )2 Var (Yi ))}−1 g0 (µi )(Yi − µi )
i =1
n n
=β ( p)
+ [∑ XiT { g0 (µi )2 Var (Yi )}−1 Xi )]−1 ( ∑ XiT { g0 (µi )Var(Yi ))}−1 (Yi − µi ))
i =1 i =1
( p) ( p ) −1 ( p)
=β + i( β ) U(β ).

• We can tell that iterative reweighting leads to the MLE since it is numerically

equivalent to Newton-Raphson algorithm.

• Summary of results

1. E{U ( β)} = 0, Var {U ( β)} = i ( β)


1
2. n− 2 U ( β) ∼ N (0, n−1 i ( β))

3. βb is consistent.

4. n( βb − β 0 ) ∼ N (0, ni ( β)−1 )

13.3 Example. Binary regression, canonical link

Assume Yi ∼ B(1, µi ) is independently distributed. The likelihood function is


n
∏ µi i (1 − µi )(1−Yi )
Y
L( β, Y ) =
i =1
CHAPTER 13. GENERALIZED LINEAR MODEL 129

where Y = (Y1 , Y2 , · · · , Yn ) T ,

µi exp(ηi )
ηi = log , E(Yi | xi ) = µi = ,
1 − µi 1 + exp(ηi )

β = ( β 0 , β 1 ) T , and ηi = β 0 + β 1 xi . Interpretation of β 1 is log odds ratio,

P(Yi = 1| xi = 1) P(Yi = 0| xi = 0)
β 1 = log .
P(Yi = 1| xi = 0) P(Yi = 0| xi = 1)

The score function is

n  
1
∑ xi (Yi − µi )
i =1

and the observed and expected information matrices are the same and

n  n
∑i=1 µi (1 − µi ) ∑in=1 xi µi (1 − µi )
  
1
∑ xi µi (1 − µi ) 1 xi = ∑in=1 xi µi (1 − µi ) ∑in=1 x2 µi (1 − µi ) .

i =1 i

Note that unlike linear regression, the information is a function of β. Adjusted


dependent variable is

∂ηi
z i = ηi + (Y − µi ) = β 0 + β 1 xi + {µi (1 − µi )}−1 (Yi − µi )
∂µi i

∂ηi ∂η ∂θ
= i i = {µi (1 − µi )}−1
∂µi ∂θi ∂µi

• Difference between GLM and nonlinear regression: Generalized linear mod-

els are nonlinear models but usual definition of nonlinear models has the fol-
lowing form:
Yi = f ( Xi , β) + ei ,

and the distribution assumption is imposed on ei . To estimate β, sum of error

squares ∑in=1 ei2 or a weighted sum of error squares ∑in=1 wi ei2 (with known or
unknown weight) or maximum likelihood estimate can be used.
CHAPTER 13. GENERALIZED LINEAR MODEL 130

• Model checking is challenging due to nonidentically distributed errors.

• Hypothesis testing can be conducted using Wald test, Score test, and likelihood
ratio test.

• Score statistic

The score statistic is based on the null distribution of the score function. In
testing entire vector

H0 : β = β0

the score statistic is

TS = U (β0 ) T Var {U (β0 )}−1 U (β0 )

The test statistic TS is asymptotically chi squared distributed with p degrees of


freedom. Note that the score function is evaluated at the null value.

 Now in the presence of nuisance parameters,

H0 : β1 = β10 ,

β1 is q by 1 parameter of interest. Consider a partition of the score function


 
U1 (β1 , β2 )
U (β ) = .
U2 (β1 , β2 )

where U1 (β1 , β2 ) is a q-variate sub-vector of score function U (β ) corresponding


to β 1 .

Score statistic

TS = U1 (β10 , β˜2 ) T Var {U1 (β10 , β˜2 )}−1 U1 (β10 , β˜2 ),

where β̃2 is the solution of U2 (β10 , β2 ) = 0, i.e., the MLE of β2 under the

restriction β1 = β10 .

Consider a similar partition for i(β ):


 
∂U1 ∂U1
T,
 
∂β ∂β2T  i11 , i12
i(β ) = E  ∂U12 =
, ∂U2 i21 , i22
∂β1T ∂β2T
CHAPTER 13. GENERALIZED LINEAR MODEL 131

T .
It can be easily verified that i21 = i12

To obtain the variance of U (β10 , β˜2 ) we can consider the following expression:
h ∂ i
−1
U1 (β10 , β˜2 ) ≈ U1 (β10 , β20 ) + T
U1 (β1 , β2 ) (β̃2 − β20 ) = U1 (β10 , β20 ) − i12 i22 U 2 ( β1 , β2 ) .
∂β2

Then
−1
Var {U (β10 , β˜2 )} = i11 − i12 i22 i21 .

The test statistic TS is distributed as chi square with q degrees of freedom.

One weakness is that the computation of the score statistic requires specialized
program.

 Example

Consider a simple logistic regression where H0 : β 1 = 0, and β 0 is a nuisance


Y+ exp( β̃ 0 )
parameter. Note that β̃ 0 = log n− Y+ and µ̃i = µ̃ = 1+exp( β̃ 0 )
. We can obtain

n h exp( β̃ 0 ) i
U1 ( β̃ 0 , 0) = ∑ i Yi −
X
1 + exp( β̃ 0 )
.
i =1

n n n
Var {U1 ( β̃ 0 , 0)} = ∑ Xi2 µ̃i (1 − µ̃i ) − { ∑ Xi µ̃i (1 − µ̃i )}{ ∑ µ̃i (1 − µ̃i )}−1 {∑ Xi µ̃i (1 − µ̃i )}
i =1 i =1 i =1

You might also like