Appliedstat 2017 Chapter 12 13

Chapter 12
Linear mixed effect model
12.1 Fixed effects and random effects
• When we have observations y1 , · · · , yn , we are mainly interested in ‘average’ i.e.
E(yi ) = µi and ‘variation’, i.e. var (yi ) = σi2 . Those average and variation can be
further factored out when additional information is supplied.
• For example, such observations actually have a structure so that y11 , · · · , y1m are
observations from females and y21 , · · · , y2m are observations from males with
2m = n. Then, we can factor out the mean such as E(yij ) = µ + αi for i = 1, 2
and j = 1, · · · , m. αi can be interpreted as average contribution by ‘gender’. We
call E(yij ) = µ + αi a model or a mean model.
When the model is linear in parameters, we call it a linear model (LM)
When the model is not necessarily linear, we call it a generalized linear model
(GLM)
In LM and GLM, parameters in ‘mean’ are unknown constants. Depending

on the situation, they can be random. Then, the model is called linear mixed
model (LMM) or generalized linear mixed model (GLMM).
• Non-random parameters are used to attribute the contribution in mean from

the additional information. Random parameters are used to attribute the con-
tribution in variation from the additional information.

CHAPTER 12. LINEAR MIXED EFFECT MODEL 113
• What is additional information? Additional information (covariates or predic-

tors) can be usual covariates (continuous, discrete) or categories (classifications).
The second type is often used for attributing variation.
Categorical variates are also called factors. Factors are groups which divide
the data accordingly (Ex. gender).
Individual classes in factor is called ‘level’ (Ex. female, male in ‘gender’)
A unit categorized by a level of factors is called ‘cell’. Consider a model with

two factors: gender(female, male) and race (white, non-white). Each factor has
two levels. Then there are 2 × 2 = 4 cells.
nested/crossed factor
Nested factor example: Consider students’ gpa by gender from each section in
English and Math courses. There are three sections (A,B,C). Then, we have three
factors (gender, section, subject). Within each subject, there are three sections.
Section factor is nested within the subject factor since section A in English class
has no connection to the section A in Math class.

Crossed factor example: Consider cancer rates by gender from each of 16 age
groups in white and non-white. There are 3 factors (age group, race, gender).
16 age groups in white and non-white are the same. The age group and the race
factors are said to be crossed.
• Effect of a factor is contribution to the mean and/or variation. There are two
effects: fixed effects and random effects.
Effect of a factor is called fixed effect when the parameters associated with
a factor are fixed constants. Fixed effects contributes to the mean. In this case,
the corresponding factor has a finite number of levels. When the continous
covariate is considered and the corresponding coefficient is fixed constant, it is

also parameter for fixed effect.
Effect of a factor is called random effect when the parameters associated with
a factor are random. Random effects contributes to the variation. In this case,
we regard the available levels (finite number of levels) for the factor as sample
levels from infinite set of levels.
• Example: Consider yijk be the volume of the kth bread in the jth batch at the
ith temperature level. i = 1, 2, 3, j = 1, · · · , 6 and k = 1, · · · , 4. A researcher
wants to investigate the changes in volume of bread at different temperatures.
There are three different temperatures. In each temperature level, Four loaves
of bread in each of 6 batches are baked.
There are two factors: temperature factor (3 levels) and batch factor (6 levels).
Fixed effects for temperature factor. Random effects for batch factor. One can
argue that the effect by different temperature level contributes to the mean of
the volume while the effect by different batch level contributes to the variation
of the volume.
When both fixed effects and random effects are considered in the model, it
is called a mixed effects model. The above example is a mixed effect model.
• Examples of a fixed effect model and a random effect model
Placebo and drug: yij be the number of seizure by patient j receiving the
treatment i. i = 1: placebo, i = 2: a drug. This is one factor model.
Consider a model: E(yij ) = µi .
µi is the mean number of seizures expected from someone receiving the treat-
ment i. If we write E(yij ) = µ + αi , µ is general mean (global mean) and αi is the
effect on the mean number of seizures due to the treatment i. These are fixed
parameters of interest.
Clinical trial: yij be the number of seizures by patient j at the ith clinic in the
city of seoul who is treated by a drug. Say, i = 1, · · · , 20.
Consider a model E(yij | ai ) = µ + ai .
One may consider 20 clinics as a sample of 20 clinics in Seoul. Then, ai is the

i.i.d.
random effect due to the clinic i. We often assume the random effects ai ∼
(0, σa2 ) for all i, i.e. E( ai ) = 0, var ( ai ) = σa2 and cov( ai , ai0 ) = 0 for i 6= i0 .
• How do random effects contribute to the variation of the data?

For simplicity, assume that
i.i.d.
yij | ai ∼ N (µ + ai , σ2 ), j = 1, · · · , J
i.i.d.
ai ∼ N (0, σa2 ), i = 1, · · · , I.
Alternatively,
yij = µ + ai + eij ,
i.i.d.
ai ∼ N (0, σa2 )
i.i.d.
eij ∼ N (0, σ2 ) and ai ⊥ eij
By the law of total variance/covariance,
var (yij ) = var ( E(yij | ai )) + E(var (yij | ai ))
= σa2 + σ2
cov(yij , yij0 ) = cov( E(yij | ai ), E(yij0 | ai )) + E(cov(yij , yij0 | ai ))
= cov(µ + ai , µ + ai ) = σa2
cov(yij , yij0 ) is called intra-class covariance.

cov(yij ,yij0 ) σa2
Intra-class correlation is √ q = 2
σa +σ2
.
var (yij ) var (yij0 )
Simulated data with I = 5, J = 10, µ = 0, σ2 = 1, DATA1: intra-class

correlation: 0.1 DATA2: intra-class correlation:0.9
mean(sd): DATA1= −0.24 ( 1.03 ) , DATA2= 0.93 ( 2.45 )
6
4
2
0
−2
DATA1 DATA2
DATA 1 : intra.cor= 0.1 , sigma2.a= 0.11

6
4
2
0
−2
X1 X2 X3 X4 X5
DATA 2 : intra.cor= 0.9 , sigma2.a= 9

6
4
2
0
−2
X1 X2 X3 X4 X5
Small intra-calss correlation (σa2 is relatively smaller than σ2 ) indicates less

clustered within the class of random effects while large intra-class correlation
(σa2 is relatively larger than σ2 ) indicates more clustered within the class of ran-
dom effects.
12.2 Inference
If we are interested in the distribution of the data, we estimate parameters.

In some situations, we want to ‘estimate’ random effects. In this case, we use
the term ‘prediction’ and we predict the realization of random effects.
Typically, estimation is using likelihood (ML or REML).
Testing is for comparing levels of fixed effects (i.e. linear functions of fixed
effects) or whether variation due to random effects is zero or not (i.e. σa2 = 0).
• Best linear unbiased estimator (BLUE) Let Y be a random vector with E(Y ) =
Xβ, cov(Y ) = V, where X is a known n × p design matrix, β ∈ R p and V is
known non-singular covariance matrix.
A real valued linear estimator t T Y is said to best linear unbiased estimator for
it’s expectation if and only if var (t T Y ) ≤ var ( a T Y ) for all linear estimators with
E ( a T Y ) = E ( t T Y ).
c BLUE = X ( X T V −1 X )− X T V −1 Y, where ( X T V −1 X )− is a g-inverse of X T V −1 X.

Xβ
• Best predictor (BP) Let Y = (y1 , · · · , yn ) T be a vector of n random variables.

We want to use them to predict y0 , where y0 is also a random variable. The best
predictor is the one that minimizes mean squared error (MSE) which yields the
conditional mean.
When the joint distribution of y0 and Y is available,
yb0BP = E(y0 | Y )
since for any predictor g(Y ) of y0 , one can show E(y0 − g(Y ))2 ≥ E(y0 −
E(y0 |Y ))2 .
• Best linear predictor (BLP) When mean and covariance are available, let cov(Y ) =
V, where V is non-singular covariance matrix, cov(y0 , Y ) = v0 , E(yi ) = µi ,
i = 0, 1, · · · , n. Let µ = (µ1 , · · · , µn ). Then, best linear predictor of y0 is given

by
yb0BLP = µ0 + cov(y0 , Y )cov(Y )−1 (Y − E(Y ))
= µ 0 + v 0 V − 1 (Y − µ )
(proof) Without loss of generality, we write an arbitrary linear predictor as
g(Y ) = a + b T (Y − µ). Then, we have
E(y0 − g(Y ))2 = E(y0 − a − b T (Y − µ))2
= E(y0 − µ0 − b T (Y − µ) + (µ0 − a))2
= E(y0 − µ0 − b T (Y − µ))2 + (µ0 − a)2
For unbiasedness, µ0 = a. For simplicity, we further assume µ0 = 0, µ = 0.

Then,
E(y0 − g(Y ))2 = E(y0 − b T Y )2
= E(y0 − cov(y0 , Y )cov(Y )−1 Y + cov(y0 , Y )cov(Y )−1 Y − b T Y )2
= E(y0 − cov(y0 , Y )cov(Y )−1 Y )2 + E(cov(y0 , Y )cov(Y )−1 Y − b T Y )2
≥ E(y0 − cov(y0 , Y )cov(Y )−1 Y )2
• Best linear unbiased predictor (BLUP) When the covariance structure is known,
a0 + a T Y is BLUP of y0 if a0 + a T Y is unbiased (i.e. E( a0 + a T Y ) = E(y0 )) and
for any other predictor b0 + b T Y, E(y0 − a0 − a T Y )2 ≤ E(y0 − b0 − b T Y )2 .
First assume the following model.

Y Xβ
∼ , Cov
y0 x0 β
and x0 = c T X for some c, where X and Cov are known, cov(Y ) is non-singular
and β is unknown. Then,
BLUE BLUE

−1
yb0BLUP = x0 β
d + cov(y0 , Y ) cov(Y ) Y − Xβ
c
Let cov(Y ) = V and cov(y0 , Y ) = v0 and L β (Y ) = x0 β + v0 V −1 (Y − Xβ).
For any arbitrary linear unbiased estimator of y0 , a T Y, one can show
E(y0 − a T Y )2 = E(y0 − L β (Y ))2 + E( L β (Y ) − a T Y )2
≥ E(y0 − L β (Y ))2
• One can use an EM algorithm to estimate parameter in a linear mixed effects
model by considering random effects as latent (missing) variables.
12.3 Linear mixed effects model
• linear mixed effects model: A general linear mixed effects model is of the
following form:
Y = Xβ + Zγ + e,
where X, Z are known design matrices
β is unobservable vector of fixed effects

γ is unobservable vector of random effects with
E(γ) = 0, cov(γ) = D, cov(γ, e) = 0 and cov(e) = R.
• cov(Y ) = V, cov( Zγ + e) = ZDZ T + R
ZDZ T + R ZD

Y Xβ
∼ ,
γ 0 DZ T D
• Example: Consider math test scores of students by gender in 4 ninth grade

classes of fifteen high schools. Let ytijk be the math test score of k−th student in
the gender group t in the jth class at the ith school. i = 1, · · · , 15, j = 1, · · · , 4,
t = 1, 2 and k = 1, · · · , ntij .
We can consider three sources of variability of the data:

1. among schools (si )
2. among classes within each school (cij )

3. among students within each class and some unexplained variability (etijk )
One plausible model is
ytijk = β t + si + cij + etijk ,
where β t are fixed effects and si , cij are random effects.
Let γ1 = (s1 , · · · , s15 )T and γ2 = (c11 , · · · , c1,4 , c2,1 , · · · , c15,4 )T . Then, γ =

(γ1T , γ2T )T which consists of 15 + 15 × 4 = 75 random effects. Z is also parti-
tioned as Z = [ Z1 , Z2 ] such that Zγ = Z1 γ1 + Z2 γ2 .
We need to further restrict the structure of D and R so that they can be

estimated. For example, simplification of D can be done by assuming cov(γ1 ) =
σs2 I15 , cov(γ2 ) = σc2 = I60 and cov(γ1 , γ2 ) = 0. For R, one can assume R =
σ2 IN , where N is total number of observations. Then, cov(Y ) = ZDZ T + R =
σs2 Z1 Z1T + σc2 Z2 Z2T + σ2 I.
• Variance component model: In the linear mixed effect model, Y = Xβ + Zγ + e,

consider Zγ = ∑rl=1 Zl γl such that cov(γ) = D = diag{σi2 Iql }rl=1 , V = cov(Y ) =
∑rl=1 σl2 Zl ZlT + R = ∑rl=1 σl2 Zl ZlT + σ02 IN , where N is the total number of obser-
vations. Such simplified linear mixed effects model is called a variance compo-
nent model. V can be also written as V = ∑rl=0 σl2 Zl ZlT with Z0 = IN .
• Estimation Estimation of a variance component model
ML method
Log likelihood for a variance component model is ` = − 21 (Y − Xβ) T V −1 (Y −
N
Xβ) − 12 log |V | − 2 log(2π ). Then, we have
∂`
= − X T V −1 Xβ + X T V −1 Y
∂β
∂` 1 −1 1 ∂
T ∂V
2
= − ( Y − Xβ ) 2
(Y − Xβ) − log |V |
∂σl 2 ∂σl 2 ∂σl2
1 1
= (Y − Xβ)T V −1 Zl ZlT V −1 (Y − Xβ) − tr (V −1 Zl ZlT )
2 2
for l = 0, · · · , r. Then, the likelihood estimating equation is
X T V −1 Xβ = X T V −1 Y
tr (V −1 Zl ZlT ) = (Y − Xβ) T V −1 Zl ZlT V −1 (Y − Xβ)
= Y T PZl Zl PY,
for l = 0, · · · , r, where P = V −1 − V −1 X ( X T V −1 X )− X T V −1 .
X βbML = X ( X T V −1 X )− X T V −1 Y.
1
When Zl = 0 for l = 1, · · · , r, V = σ02 I and b
σ02 = N SSE.
Restricted maximum likelihood (REML) method

The idea is to use the likelihood of residuals so that we deal with the likelihood
of only variance component parameters.
First, we find B such that
BT X = 0
B T is ( N − r ) × N full row rank matrix, where r = rank( X ) and
B T VB is nonsingular.
Also note that B( B T VB)−1 B T = V −1 − V −1 X ( X T V −1 X )− X T V −1 which is in-

variant to the choice of B.
Then, the likelihood of B T Y is
1 1
` R ∝ − Y T B( B T VB)−1 B T Y − log | B T VB|
2 2
∂` R 1 T T −1 T T T −1 T 1 T −1 T T

= Y B( B VB) B Zl Zl B( B VB) B Y − tr ( B VB) B Zl Zl B ,
∂σl2 2 2
for l = 0, · · · , r.
Let P = B( B T VB)−1 B T = V −1 − V −1 X ( X T V −1 X )− X T V −1 . Then, the likelihood

estimating equation becomes
tr ( PZl ZlT ) = Y T PZl ZlT PY
for l = 0, · · · , r.
When Zl = 0 for l = 1, · · · , r, V = σ02 I and P = I − X ( X T X )−1 X T so that

2 = 1
σ0R
b N − p SSE.
• BLUE and BLUP in a linear mixed effect model. We are interested in finding
c BLUE and Zγ
Xβ c BLUP .
Mixed model equation (MME)

c BL and Zγ
Xβ c BLU are obtained by minimizing
(Y − Xβ − Zγ)T R−1 (Y − Xβ − Zγ) + γ T D −1 γ
which yields the following MME
X T R −1 X X T R −1 Z X T R −1 Y

β
= .
Z T R −1 X D −1 + Z T R −1 Z γ Z T R −1 Y
A solution to MME is
βb = ( X T V −1 X )− X T V −1 Y
b = DZ T V −1 (Y − X βb)
γ
b are the solution to the MME, then X βb is BLUE of Xβ

Theorem: If βb and γ
and Z γ
b is BLUP of Zγ.
proof) Using that ( I + uv)−1 = I − u( I + vu)−1 v,
V −1 = ( ZDZ T + R)−1 = ( R−1 ZDZ T + I )−1 R−1
= ( I − R−1 Z ( I + DZ T R−1 Z )−1 DZ T ) R−1
= ( I − R −1 Z ( D −1 + Z T R −1 Z ) −1 Z T ) R −1
DZ T ( ZDZ T + R)−1 = DZ T R−1 − DZ T R−1 Z ( D −1 + Z T R−1 Z )−1 Z T R−1
= ( D ( D −1 + Z T R−1 Z ) − DZ T R−1 Z )( D −1 + Z T R−1 Z )−1 Z T R−1
= ( D −1 + Z T R −1 Z ) −1 Z T R −1 .
• Example: In one factor model, yij = µ + ai + eij ,
σa2
a i = E ( a i |Y ) = (ȳi· − µ),
σa2 + σ2 /ni
b
where ni is the number of observations in the ith level of the factor and ȳi· =
1 n
ni ∑ j=i 1 yij , the sample mean of observations in the ith level of the factor.
Note that we do not know σ2 , σa2 and µ. A naive remedy is to replace with
estimators (Ex. MLE):
σa2
(ȳi· − ȳ· · )
b
ãi =
σa2 + b
b σ2 /ni
If ai is a fixed effect parameter,
ai = ȳi· − ȳ· ·
b
with constraint of ∑ ni b
ai = 0.
σa2
Note that an additional term b
b
σa2 +b
σ2 /ni
< 1 so ãi < b
ai .
σa2 is large (relatively), ãi ≈ b

If b ai . A large variation between classes does not
σa2 will make ãi <<
make much shrinkage. On the other hand, relatively small b
ai (more shrinkage).
b
Chapter 13
Generalized linear model
13.1 Three components of Model assumption in GLM
GLM generalizes the linear model by extending to nonnormal outcomes and
nonlinear relationships. Model specification requires the three elements, namely,

random component, systematic component and link function.
• Random component
Y has a distribution in the exponential family, taking the form:
f Y (Y; θ, φ) = exp{(Yθ − b(θ ))/φ + c(Y, φ)}.
Check θ and b(θ ) in normal, binomial, Poisson, and Gamma distributions.
The mean and variance of exponential family distribution can be obtained

by differentiating the equations, f (y; θ, φ)dy = 1, yielding E(Y ) = b0 (θ ) and
R
Var (Y ) = b00 (θ )φ.
The mean and variance functions uniquely determines the distribution in

exponential family. Higher moments are also functions of b(θ ).
• Systematic component
E(Y | X ), say µ, is related to X through a linear combination of X;

CHAPTER 13. GENERALIZED LINEAR MODEL 125
p
ηi = ∑ j=1 Xij β j
β
counter example: exp( β 0 + Xi 1 )
• Link function: The link function relates linear predictor and the conditional
mean, i.e., η = g(µ). g(.) is called the link function and should be twice differ-
entiable. Link functions determine interpretation of coefficients. (e.g., logit link,

log link)
Examples of link functions for binary regression are logit, probit, and com-
plementary log-log function.
Canonical link function is the link function θ = η.
13.2 Likelihood Inference on GLM
• Estimator for β is obtained by maximizing

n n
L ( β ( θ ), φ ) = ∏ f (yi |θi , φ) = exp ∑ [{yi θi − b(θi )}/φ + c(yi , φ)]
i =1 i =1
n
logL( β(θ ), φ) = ∑ {yi θi − b(θi )}/φ + c(yi , φ),
i =1
where θ T = (θ1 , θ2 , · · · , θn ). Until we specify otherwise, assume that φ is known.
• Score function
The score function of the GLM has the form
n n
∂ ∂η ∂θ ∂ ∂η ∂µ ∂θ ∂
logL( β, φ) = ∑ i i logL( β, φ) = ∑ i i i logL( β, φ)
∂β i =1
∂β ∂ηi ∂θi i =1
∂β ∂ηi ∂µi ∂θi
∂ηi
∂β = XiT , ∂µi
∂ηi = 1/g0 (µi ), ∂θi
∂µi = 1/b00 (θi ),
∂θi ∂η
= 1/ i = 1/[ g0 {b0 (θi )}b00 (θi )] = 1/{ g0 (µi )b00 (θi )}),
∂ηi ∂θi
∂ Y − b 0 ( θi ) Y − µi
logL( β, φ) = i = i
∂θi φ φ
∂ n
Y − b 0 ( θi )
logL( β, φ) = ∑ XiT 1/{ g0 (µi )b00 (θi )} i
∂β i =1
φ
n
= ∑ XiT { g0 (µi )Var(Yi )}−1 {Yi − µi }
i =1
∂θi
Note: If θi = ηi , i.e., canonical link, ∂ηi = 1 and the score function is
n
∂
logL( β, φ) = ∑ XiT (Yi − µi )/φ
∂β i =1
• Information
The second derivative of the negative log-likelihood function is the observed

information:
n
∂2 log f (Yi ; θ, φ)
I ( β) = ∑− ∂β∂β T
.
i =1
The expected value of I ( β), say i ( β), is called the expected information or Fisher
information. Hence,
n h
∂ ∂
I ( β) = − T U ( β) = − ∑ XiT { g0 (µi )Var (Yi )}−1 T (Yi − µi )
∂β i =1
∂β
∂ i
− XiT [ T { g0 (µi )Var (Yi )}−1 ]{Yi − µi }
∂β
n h
∂µ ∂ i
= ∑ XiT { g0 (µi )Var (Yi )}−1 Ti − XiT [ T { g0 (µi )Var (Yi )}−1 ]{Yi − µi } .
i =1
∂β ∂β
∂µi ∂µ ∂η
T
= i Ti = { g0 (µi )}−1 Xi
∂β ∂ηi ∂β
n
i ( β) = E{ I ( β)} = ∑ XiT g0 (µi )−1 Var(Yi )−1 g0 (µi )−1 Xi .
i =1
For canonical link,

n
I ( β) = ∑ XiT b00 (θi )Xi /φ = i( β),
i =1
i.e., likelihood function is strictly concave.
• Computation of MLE
In general, the equation, U (θ ) = 0, does not have a closed form solution. Then
one should solve the equation iteratively. Starting from initial value θ (0) , one
can iteratively update the estimate by
θ ( p +1) = θ ( p ) + I ( θ ( p ) ) −1 U ( θ ( p ) ),
p = 0, 1, · · · . When the update becomes very small, i.e., |θ ( p+1) − θ ( p) | < c,

where c is pre-specified small value, for example c = 10−8 , the iteration is
stopped and θ ( p+1) is declared as the solution. This computational algorithm is

called Newton-Raphson algorithm.
• Iteratively reweighted least squared estimate
Motivation: The score function of GLM has a form of weighted linear regression.
Can we obtain as a weighted least square estimator?
Consider the following ‘pseudo dependent variable’:
∂ηi
z i = ηi + (Y − µ i )
∂µi i
Then
E(zi ) = ηi = Xi β.
h ∂η i2
i
Var (zi ) = Var (Yi ) = g0 (µi )2 Var (Yi ).
∂µi
Then we can obtain the following weighted least square estimator:
n n
βb = ( ∑ XiT Var (zi )−1 Xi )−1 ( ∑ XiT Var (zi )−1 zi )
i =1 i =1
Note that both z and Var (z) are also functions of β. Start with β(0) and update.
β ( p +1)
n n
= [ ∑ XiT { g0 (µi )2 Var (Yi )}−1 Xi ]−1 [ ∑ XiT Var (zi )−1 {ηi + g0 (µi )(Yi − µi )}]
i =1 i =1
n h n
= [ ∑ XiT { g0 (µi )2 Var (Yi )}−1 Xi ]−1 { ∑ XiT Var (zi )−1 Xi β( p) }
i =1 i =1
n i
+ [ ∑ XiT { g0 (µi )2 Var (Yi ))}−1 g0 (µi )(Yi − µi )
i =1
n n
=β ( p)
+ [∑ XiT { g0 (µi )2 Var (Yi )}−1 Xi )]−1 ( ∑ XiT { g0 (µi )Var(Yi ))}−1 (Yi − µi ))
i =1 i =1
( p) ( p ) −1 ( p)
=β + i( β ) U(β ).
• We can tell that iterative reweighting leads to the MLE since it is numerically
equivalent to Newton-Raphson algorithm.
• Summary of results
1. E{U ( β)} = 0, Var {U ( β)} = i ( β)

1
2. n− 2 U ( β) ∼ N (0, n−1 i ( β))
3. βb is consistent.
√
4. n( βb − β 0 ) ∼ N (0, ni ( β)−1 )
13.3 Example. Binary regression, canonical link
Assume Yi ∼ B(1, µi ) is independently distributed. The likelihood function is

n
∏ µi i (1 − µi )(1−Yi )
Y
L( β, Y ) =
i =1
where Y = (Y1 , Y2 , · · · , Yn ) T ,
µi exp(ηi )
ηi = log , E(Yi | xi ) = µi = ,
1 − µi 1 + exp(ηi )
β = ( β 0 , β 1 ) T , and ηi = β 0 + β 1 xi . Interpretation of β 1 is log odds ratio,
P(Yi = 1| xi = 1) P(Yi = 0| xi = 0)
β 1 = log .
P(Yi = 1| xi = 0) P(Yi = 0| xi = 1)
The score function is
n
1
∑ xi (Yi − µi )
i =1
and the observed and expected information matrices are the same and
n n
∑i=1 µi (1 − µi ) ∑in=1 xi µi (1 − µi )

1
∑ xi µi (1 − µi ) 1 xi = ∑in=1 xi µi (1 − µi ) ∑in=1 x2 µi (1 − µi ) .

i =1 i
Note that unlike linear regression, the information is a function of β. Adjusted

dependent variable is
∂ηi
z i = ηi + (Y − µi ) = β 0 + β 1 xi + {µi (1 − µi )}−1 (Yi − µi )
∂µi i
∂ηi ∂η ∂θ
= i i = {µi (1 − µi )}−1
∂µi ∂θi ∂µi
• Difference between GLM and nonlinear regression: Generalized linear mod-
els are nonlinear models but usual definition of nonlinear models has the fol-
lowing form:
Yi = f ( Xi , β) + ei ,
and the distribution assumption is imposed on ei . To estimate β, sum of error
squares ∑in=1 ei2 or a weighted sum of error squares ∑in=1 wi ei2 (with known or
unknown weight) or maximum likelihood estimate can be used.
• Model checking is challenging due to nonidentically distributed errors.
• Hypothesis testing can be conducted using Wald test, Score test, and likelihood
ratio test.
• Score statistic
The score statistic is based on the null distribution of the score function. In
testing entire vector
H0 : β = β0
the score statistic is
TS = U (β0 ) T Var {U (β0 )}−1 U (β0 )
The test statistic TS is asymptotically chi squared distributed with p degrees of

freedom. Note that the score function is evaluated at the null value.
Now in the presence of nuisance parameters,
H0 : β1 = β10 ,
β1 is q by 1 parameter of interest. Consider a partition of the score function

U1 (β1 , β2 )
U (β ) = .
U2 (β1 , β2 )
where U1 (β1 , β2 ) is a q-variate sub-vector of score function U (β ) corresponding

to β 1 .
Score statistic
TS = U1 (β10 , β˜2 ) T Var {U1 (β10 , β˜2 )}−1 U1 (β10 , β˜2 ),
where β̃2 is the solution of U2 (β10 , β2 ) = 0, i.e., the MLE of β2 under the
restriction β1 = β10 .
Consider a similar partition for i(β ):

 
∂U1 ∂U1
T,

∂β ∂β2T  i11 , i12
i(β ) = E  ∂U12 =
, ∂U2 i21 , i22
∂β1T ∂β2T
T .
It can be easily verified that i21 = i12
To obtain the variance of U (β10 , β˜2 ) we can consider the following expression:
h ∂ i
−1
U1 (β10 , β˜2 ) ≈ U1 (β10 , β20 ) + T
U1 (β1 , β2 ) (β̃2 − β20 ) = U1 (β10 , β20 ) − i12 i22 U 2 ( β1 , β2 ) .
∂β2
Then
−1
Var {U (β10 , β˜2 )} = i11 − i12 i22 i21 .
The test statistic TS is distributed as chi square with q degrees of freedom.
One weakness is that the computation of the score statistic requires specialized
program.
Example
Consider a simple logistic regression where H0 : β 1 = 0, and β 0 is a nuisance

Y+ exp( β̃ 0 )
parameter. Note that β̃ 0 = log n− Y+ and µ̃i = µ̃ = 1+exp( β̃ 0 )
. We can obtain
n h exp( β̃ 0 ) i
U1 ( β̃ 0 , 0) = ∑ i Yi −
X
1 + exp( β̃ 0 )
.
i =1
n n n
Var {U1 ( β̃ 0 , 0)} = ∑ Xi2 µ̃i (1 − µ̃i ) − { ∑ Xi µ̃i (1 − µ̃i )}{ ∑ µ̃i (1 − µ̃i )}−1 {∑ Xi µ̃i (1 − µ̃i )}
i =1 i =1 i =1

Appliedstat 2017 Chapter 12 13

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Appliedstat 2017 Chapter 12 13

Uploaded by

Copyright:

Available Formats

Chapter 12

Linear mixed effect model

12.1 Fixed effects and random effects

• When we have observations y1 , · · · , yn , we are mainly interested in ‘average’ i.e.

call E(yij ) = µ + αi a model or a mean model.

When the model is linear in parameters, we call it a linear model (LM)

In LM and GLM, parameters in ‘mean’ are unknown constants. Depending

• Non-random parameters are used to attribute the contribution in mean from

tribution in variation from the additional information.

• What is additional information? Additional information (covariates or predic-

The second type is often used for attributing variation.

Individual classes in factor is called ‘level’ (Ex. female, male in ‘gender’)

A unit categorized by a level of factors is called ‘cell’. Consider a model with

has no connection to the section A in Math class.

covariate is considered and the corresponding coefficient is fixed constant, it is

levels from infinite set of levels.

• Examples of a fixed effect model and a random effect model

One may consider 20 clinics as a sample of 20 clinics in Seoul. Then, ai is the

• How do random effects contribute to the variation of the data?

By the law of total variance/covariance,

var (yij ) = var ( E(yij | ai )) + E(var (yij | ai ))

cov(yij , yij0 ) = cov( E(yij | ai ), E(yij0 | ai )) + E(cov(yij , yij0 | ai ))

cov(yij , yij0 ) is called intra-class covariance.

Simulated data with I = 5, J = 10, µ = 0, σ2 = 1, DATA1: intra-class

mean(sd): DATA1= −0.24 ( 1.03 ) , DATA2= 0.93 ( 2.45 )

DATA 1 : intra.cor= 0.1 , sigma2.a= 0.11

DATA 2 : intra.cor= 0.9 , sigma2.a= 9

Small intra-calss correlation (σa2 is relatively smaller than σ2 ) indicates less

If we are interested in the distribution of the data, we estimate parameters.

the term ‘prediction’ and we predict the realization of random effects.

Typically, estimation is using likelihood (ML or REML).

c BLUE = X ( X T V −1 X )− X T V −1 Y, where ( X T V −1 X )− is a g-inverse of X T V −1 X.

• Best predictor (BP) Let Y = (y1 , · · · , yn ) T be a vector of n random variables.

When the joint distribution of y0 and Y is available,

i = 0, 1, · · · , n. Let µ = (µ1 , · · · , µn ). Then, best linear predictor of y0 is given

yb0BLP = µ0 + cov(y0 , Y )cov(Y )−1 (Y − E(Y ))

(proof) Without loss of generality, we write an arbitrary linear predictor as

g(Y ) = a + b T (Y − µ). Then, we have

E(y0 − g(Y ))2 = E(y0 − a − b T (Y − µ))2

= E(y0 − µ0 − b T (Y − µ) + (µ0 − a))2

= E(y0 − µ0 − b T (Y − µ))2 + (µ0 − a)2

For unbiasedness, µ0 = a. For simplicity, we further assume µ0 = 0, µ = 0.

E(y0 − g(Y ))2 = E(y0 − b T Y )2

= E(y0 − cov(y0 , Y )cov(Y )−1 Y + cov(y0 , Y )cov(Y )−1 Y − b T Y )2

= E(y0 − cov(y0 , Y )cov(Y )−1 Y )2 + E(cov(y0 , Y )cov(Y )−1 Y − b T Y )2

≥ E(y0 − cov(y0 , Y )cov(Y )−1 Y )2

First assume the following model.

Let cov(Y ) = V and cov(y0 , Y ) = v0 and L β (Y ) = x0 β + v0 V −1 (Y − Xβ).

For any arbitrary linear unbiased estimator of y0 , a T Y, one can show

E(y0 − a T Y )2 = E(y0 − L β (Y ))2 + E( L β (Y ) − a T Y )2

• One can use an EM algorithm to estimate parameter in a linear mixed effects

model by considering random effects as latent (missing) variables.

12.3 Linear mixed effects model

where X, Z are known design matrices

β is unobservable vector of fixed effects

• cov(Y ) = V, cov( Zγ + e) = ZDZ T + R

• Example: Consider math test scores of students by gender in 4 ninth grade

We can consider three sources of variability of the data:

2. among classes within each school (cij )

One plausible model is

ytijk = β t + si + cij + etijk ,

where β t are fixed effects and si , cij are random effects.

Let γ1 = (s1 , · · · , s15 )T and γ2 = (c11 , · · · , c1,4 , c2,1 , · · · , c15,4 )T . Then, γ =