You are on page 1of 6

Introduction

Generalized linear models are a central part of a statistician's toolbox. They cover topics such as linear
models (regression and ANOVA), log-linear models (counts data), logistic regression (proportions) and
survival analysis. The also form the basis of more advanced analyses such as generalized linear mixed
models, and hierarchical models.

The key development behind GLMs was to notice that several types of analysis had a similar
underlying structure, and that the distributions that were being fitted could be shown to be members of
a general class of distributions, the exponential family. The properties of this family can be used to
develop the model fitting, and then can be applied to the different specific cases. This explains why we
need the maths at the start, before moving on to the "proper" statistics later.

There are several books on GLMs available. I have used the following:

McCullagh, P. & Nelder, J.A. (1989) Generalized Linear Models, 2nd Ed. Chapman & Hall.
The all-time classic. Nelder was behind the developments of the GLM ideas, and this is a
comprehensive book. Every professional statistician should have it on their shelf.
Dobson, A.J. (1990). An Introduction to Generalized Linear Models. Chapman & Hall.
Aimed at undergraduates, the first third of the book is about linear models and inference. There is
a second edition out now which I have not seen, but which looks more comprehensive.
Lindsey, J.K. (1997). Applying Generalized Linear Models. Springer.
An over-view of GLMs and some of their off-shoots. It covers a broad range, and explains the
basics nicely, but lacks depth.

Exponential Family of Distributions


All Generalized Linear Models are based on a family of distributions called the exponential family. he
reason for this is that the family includes several important distributions, and also has several 'nice'
properties.

If we have a continuous random variable Y taken from a distribution that is a member of the
exponential family, and it depends on a single parameter θ, then the probability density function for Y,
f(y|θ ), can be written as

f  y∣=exp  a  y b c d  y

If a(y)=y then the distribution is said to be in canonical form, and b(θ) is called the natural
parameter. In this form, the distribution can also be written as

f  y∣=exp
 yk −m 
h
n y ,

which is sometimes more convenient (as we will see later).
Examples

Normal distribution
The p.d.f. is
f  y∣=
1
 2 2 
exp
1
2 2
 y−2

which can be written as

 y2 y  2 1
f  y∣ ,  =exp − 2  2 − 2 − log 2   2  .
2  2 2 
This is in canonical form, with the natural parameter b(μ)=μ/σ2 and the other terms are
2 1 y2
c = 2 − log 2   and d  y= 2 .
2

2 2 2
2
In the alternative formulation, h= , so that k(μ)=μ,
2
 
2
1 y
m∗= and n  y , =− log 2  2  .
2 2  2

Binomial Distribution
If Y is the number of successes in n independent trials, where π is the (constant) probability of
success, then


f  y∣= n  1−
y
y n− y

which can be re-written as


f  y∣=exp y log

1−
n log 1−log n
 
y
,

so the natural parameter is b =log 1−


, and c =n log 1− and d  y =log  ny  .

Some Results
Expected Value and Variance of a(Y)

Let l = l(θ|y) = log f(y|θ ), i.e. the log-likelihood, and define U=dl/dθ . U is called the score, and
Var(U) is called the information. We need a couple of results:

E(U) = 0, and Var(U) = E(U2) = E(-U ').

The log-likelihood for a member of the exponential family is:

l = log f(y|θ) = a(y)b(θ) + c(θ) + d(y)

So that

dl d2l
U= =a  yb ' c '  and U ' = 2 =a  yb ' ' c ' ' 
d d
and hence E U =b '  E [a  y]c '  , but E(U) = 0, so
Moments of U
To find the moments we use the identity
d log f  y∣ 1 d f  y∣
= (1)
d f  y∣ d 
E(U): Take expectations:
d log f  y∣ d f  y∣
E U =∫ f  y∣dy=∫ dy
d d
Under suitable regularity conditions,
d f  y∣
∫ d
dy= ∫ f  y∣dy=
d
d
d
d
1=0

since f(y|θ ) is a probability distribution. Hence, E(U)=0.


Var(U): Differentiate (1) w.r.t. θ, and take expectations. If we can
interchange the order of operations, then:
d log f  y∣ d2
∫ d  dy= d 2 ∫ f  y∣dy
d
d
the right hand side is 0, and the left hand side is:
d 2 log f  y∣ d log f  y∣ d f  y∣
∫ d 2 f  y∣dy∫
d d
dy

Substituting (1) into the second term:

 
2
d 2 log f  y∣ d f  y∣
∫ d 2 f  y∣dy∫
d
f  y∣dy=0 .

Hence,

   
2
d 2 log f  y∣ d log f  y∣
E − =E .
d 2 d
Or, E(-U ') = E(U2). And, since E(U)=0, E(-U ') = E(U2) = Var(U).

c ' 
E [a  y]=− .
b ' 
Also,
Var U =[b ' ]2 Var  a  y and E −U ' =−b ' '  E  a  y−c ' '  so that

 b '  Var  a  y=−b ' '  E  a  y−c ' ' 
2
or

b ' ' c ' −b ' c ' ' 
Var  a  y= .
 b ' 
3
The meanings of these become clearer when they are written in the alternative form:

E Y =m '  and Var Y =m ' ' h  .

The variance therefore only depends on the natural parameter, m(θ), and a function that is independent
of θ, h(ϕ). h(ϕ) is commonly expressed as h(ϕ) = ϕ/w, where w is a known prior weight which can vary
between observations. ϕ is called the dispersion parameter, and is constant over observations.

Generalized Linear Models


Assume we have a set of N observations, Y1,...,YN, taken from a distribution of the exponential family,
where the Yi's are in the canonical form of the distribution, and depend on a single parameter θ. Then

 y  −mi 
f  y i∣i =exp i i
h 
n y i , .

 
The joint p.d.f. of Y1,...,YN is therefore
N N

∑ y i i −∑ m i  N
f  y 1 , , y N∣1  N =exp ∑ n y i ,
i=1 i=1

h i=1

This has one parameter per observation, which does not provide an efficient summary of the data. We
therefore consider a smaller set of parameters β1,...,βp (where p<N) such that a linear combination of the
β's is equal to a function of the expected value (μi) of Yi, i.e.
g i == x i 
T

where xi is a p×1 vector of explanatory variables, β is a p×1 vector of parameters and g is a monotone
and differentiable function called the link function. As noted above, we often write h(ϕ) as ϕ/w, with
w known, and ϕ either known or estimated as a nuisance parameter.

This gives us all the components of a generalized linear model:


1. The random component. The yi's, who's distribution is specified.
2. The systematic component. The linear predictor, η, which is defined by the covariates, xi, and
parameters βi.
3. The link function. Which links the systematic component with the expected value of the random
component, i.e. E(yi) = μi = g-1(η).
4. The dispersion. The residual variance (which is many cases is treated as known)

Examples

Normal Distribution
The random components are the yi's, with p.d.f.

f  y∣ , =exp
2

 y −½ 2 1 y 2
 2

2  2
log 2  2 

so that for N observations of y, with common variance, the joint p.d.f. is
  
N N N

∑ yi i − 12 ∑ i2 1
∑ y i2
f  y 1 , y N∣1 , N , 2 =exp − log 2  2 
i=1 i=1 i=1

 
2 2
2
The link function is the identity function, and the dispersion parameter is σ2.

Binomial Distribution
If we have N observations of ni trials, with yi successes in the ith trial, then the p.d.f. in the
alternative form is

∑ i
  
N N
yi
f  y∣=exp log log 1−i  ∑ log ni
i=1 ni 1−i i=1 yi

so that the link function is log 1− , the logit link function, and the dispersion parameter is n1 . i

We could, if we so wished, use a different link function, as long as it mapped from [0,1] to [-∞ to
+∞].

Estimation - Maximum Likelihood.

The reason for discussing these models is that we want to fit them to data. It is sensible to do this in a
likelihood framework, and here we will focus exclusively on maximum likelihood methods.

Assume we have N observations y1,...,yn=y, we will fit a model with p parameters, θ1,...,θp=θ, and joint
p.d.f. f(y|θ). The likelihood is defined as L(θ |y) = f(y|θ ), but normalising constants can be omitted for
brevity. It is also usually more convenient to work with the log likelihood, l(θ |y) = ln f(y|θ ).
N
If the observations are independent, then L ∣y=∏ L ∣y i  or equivalently
i=1
N
l ∣y=∑ l ∣y i  . This will be useful later.
i=1

Our aim is now to find the ML estimates of the parameters, i.e. find the vector  for which

L ∣y≥L ∣y , ∀  . A useful property of ML estimates is their invariance property, i.e. the ML
 . Hence, we can work with the log-likelihood, and
estimate of any function g(θ) of θ is g  
estimate  by finding l  ∣y
 .

It is obvious that  can be found by solving the following p equations:


∂ l ∣y 
=0 for i=1,...,p.
∂ i
∂ l ∣y
2

Further, for this to be a maximum, the matrix of second derivatives evaluated at 


∂ i ∂  j
∂l
should be negative definite. Note that =U , the score discussed earlier, so the m.l.e. 
∂

satisfies the equation U  =0 . As E(-U') = Var(U), we can approximate the variance-covariance
matrix of the m.l.e. estimates with the matrix of second derivatives, i.e.

Var  =I 
 =−E

∂2 l ∣y
∂ i ∂  j
.

This is known as Fisher's Information, and relies on an assumption of asymptotic normality, which is
often reasonable because of the Central Limit Theorem(s).

Example: Normal distribution

The log-likelihood for N observations yi drawn from a normal distribution with a common mean
μ and variance σ2 is
N
 y i −2
l  ,  ∣y =− log 2−n log −∑
2 n
2 i=1 2 2
so,
N N

∂ l i=1
∑  y i −
and ∂ l n
∑  yi −2
= =− − i=1 3
∂  2
∂  
Equating the derivatives to zero gives
N N
= 
 y and  = ∑  y i −
2 1
 =∑  y i − y 2 .
2
n i=1 i=1
the second derivatives are
N N

∂ l 2
n 3 ∑  y i − 2
−2 ∑  y i −
=− 2 , ∂ l
2
n , ∂2 l .
∂ 2
 = 2− i=1
= i=1

∂ 2
 4
∂  3

[ ]
So, when taking expectations and evaluating at   ,   , we have
n
0
 2
I =
2n
0
 2

ML estimates for GLMs

The likelihood for a GLM is L∣y=exp  a  y b c d  y , so the log likelihood is
l ∣y=a  yb c  (d(y) is a constant). For the single parameter case, with yi's independent,
the likelihood can be calculated from
N
 ∑ a  y i n c '  =0
b '    .
i=1
This is a maximum because the second derivative is
N 
n c '   N 
n c '  
b ' '   ∑ a  y i n c ' '  =−b ' '  
   n c ' '   (as ∑ a  y i =−
 ).
i=1

b '   i=1

b '  
b ' ' c ' −b ' c ' ' 
From much earlier, Var  a  y= , and hence the second derivative is
 b ' 
3

−n  b '    2 Var a  y ∣≤0


 . Consequently, we only have a local maximum, and indeed if the
parameter space is unrestricted, the unique solution is the m.l.e.

You might also like