Professional Documents
Culture Documents
θ, then the probability density function for Y, θ), can be written as y∣=exp ybcd y θ) is called the natural −m n y ,
θ, then the probability density function for Y, θ), can be written as y∣=exp ybcd y θ) is called the natural −m n y ,
Generalized linear models are a central part of a statistician's toolbox. They cover topics such as linear
models (regression and ANOVA), log-linear models (counts data), logistic regression (proportions) and
survival analysis. The also form the basis of more advanced analyses such as generalized linear mixed
models, and hierarchical models.
The key development behind GLMs was to notice that several types of analysis had a similar
underlying structure, and that the distributions that were being fitted could be shown to be members of
a general class of distributions, the exponential family. The properties of this family can be used to
develop the model fitting, and then can be applied to the different specific cases. This explains why we
need the maths at the start, before moving on to the "proper" statistics later.
There are several books on GLMs available. I have used the following:
McCullagh, P. & Nelder, J.A. (1989) Generalized Linear Models, 2nd Ed. Chapman & Hall.
The all-time classic. Nelder was behind the developments of the GLM ideas, and this is a
comprehensive book. Every professional statistician should have it on their shelf.
Dobson, A.J. (1990). An Introduction to Generalized Linear Models. Chapman & Hall.
Aimed at undergraduates, the first third of the book is about linear models and inference. There is
a second edition out now which I have not seen, but which looks more comprehensive.
Lindsey, J.K. (1997). Applying Generalized Linear Models. Springer.
An over-view of GLMs and some of their off-shoots. It covers a broad range, and explains the
basics nicely, but lacks depth.
If we have a continuous random variable Y taken from a distribution that is a member of the
exponential family, and it depends on a single parameter θ, then the probability density function for Y,
f(y|θ ), can be written as
If a(y)=y then the distribution is said to be in canonical form, and b(θ) is called the natural
parameter. In this form, the distribution can also be written as
f y∣=exp
yk −m
h
n y ,
which is sometimes more convenient (as we will see later).
Examples
Normal distribution
The p.d.f. is
f y∣=
1
2 2
exp
1
2 2
y−2
which can be written as
y2 y 2 1
f y∣ , =exp − 2 2 − 2 − log 2 2 .
2 2 2
This is in canonical form, with the natural parameter b(μ)=μ/σ2 and the other terms are
2 1 y2
c = 2 − log 2 and d y= 2 .
2
2 2 2
2
In the alternative formulation, h= , so that k(μ)=μ,
2
2
1 y
m∗= and n y , =− log 2 2 .
2 2 2
Binomial Distribution
If Y is the number of successes in n independent trials, where π is the (constant) probability of
success, then
f y∣= n 1−
y
y n− y
f y∣=exp y log
1−
n log 1−log n
y
,
Some Results
Expected Value and Variance of a(Y)
Let l = l(θ|y) = log f(y|θ ), i.e. the log-likelihood, and define U=dl/dθ . U is called the score, and
Var(U) is called the information. We need a couple of results:
So that
dl d2l
U= =a yb ' c ' and U ' = 2 =a yb ' ' c ' '
d d
and hence E U =b ' E [a y]c ' , but E(U) = 0, so
Moments of U
To find the moments we use the identity
d log f y∣ 1 d f y∣
= (1)
d f y∣ d
E(U): Take expectations:
d log f y∣ d f y∣
E U =∫ f y∣dy=∫ dy
d d
Under suitable regularity conditions,
d f y∣
∫ d
dy= ∫ f y∣dy=
d
d
d
d
1=0
2
d 2 log f y∣ d f y∣
∫ d 2 f y∣dy∫
d
f y∣dy=0 .
Hence,
2
d 2 log f y∣ d log f y∣
E − =E .
d 2 d
Or, E(-U ') = E(U2). And, since E(U)=0, E(-U ') = E(U2) = Var(U).
c '
E [a y]=− .
b '
Also,
Var U =[b ' ]2 Var a y and E −U ' =−b ' ' E a y−c ' ' so that
b ' Var a y=−b ' ' E a y−c ' '
2
or
b ' ' c ' −b ' c ' '
Var a y= .
b '
3
The meanings of these become clearer when they are written in the alternative form:
E Y =m ' and Var Y =m ' ' h .
The variance therefore only depends on the natural parameter, m(θ), and a function that is independent
of θ, h(ϕ). h(ϕ) is commonly expressed as h(ϕ) = ϕ/w, where w is a known prior weight which can vary
between observations. ϕ is called the dispersion parameter, and is constant over observations.
y −mi
f y i∣i =exp i i
h
n y i , .
The joint p.d.f. of Y1,...,YN is therefore
N N
∑ y i i −∑ m i N
f y 1 , , y N∣1 N =exp ∑ n y i ,
i=1 i=1
h i=1
This has one parameter per observation, which does not provide an efficient summary of the data. We
therefore consider a smaller set of parameters β1,...,βp (where p<N) such that a linear combination of the
β's is equal to a function of the expected value (μi) of Yi, i.e.
g i == x i
T
where xi is a p×1 vector of explanatory variables, β is a p×1 vector of parameters and g is a monotone
and differentiable function called the link function. As noted above, we often write h(ϕ) as ϕ/w, with
w known, and ϕ either known or estimated as a nuisance parameter.
Examples
Normal Distribution
The random components are the yi's, with p.d.f.
f y∣ , =exp
2
y −½ 2 1 y 2
2
−
2 2
log 2 2
so that for N observations of y, with common variance, the joint p.d.f. is
N N N
∑ yi i − 12 ∑ i2 1
∑ y i2
f y 1 , y N∣1 , N , 2 =exp − log 2 2
i=1 i=1 i=1
2 2
2
The link function is the identity function, and the dispersion parameter is σ2.
Binomial Distribution
If we have N observations of ni trials, with yi successes in the ith trial, then the p.d.f. in the
alternative form is
∑ i
N N
yi
f y∣=exp log log 1−i ∑ log ni
i=1 ni 1−i i=1 yi
so that the link function is log 1− , the logit link function, and the dispersion parameter is n1 . i
We could, if we so wished, use a different link function, as long as it mapped from [0,1] to [-∞ to
+∞].
The reason for discussing these models is that we want to fit them to data. It is sensible to do this in a
likelihood framework, and here we will focus exclusively on maximum likelihood methods.
Assume we have N observations y1,...,yn=y, we will fit a model with p parameters, θ1,...,θp=θ, and joint
p.d.f. f(y|θ). The likelihood is defined as L(θ |y) = f(y|θ ), but normalising constants can be omitted for
brevity. It is also usually more convenient to work with the log likelihood, l(θ |y) = ln f(y|θ ).
N
If the observations are independent, then L ∣y=∏ L ∣y i or equivalently
i=1
N
l ∣y=∑ l ∣y i . This will be useful later.
i=1
Our aim is now to find the ML estimates of the parameters, i.e. find the vector for which
L ∣y≥L ∣y , ∀ . A useful property of ML estimates is their invariance property, i.e. the ML
. Hence, we can work with the log-likelihood, and
estimate of any function g(θ) of θ is g
estimate by finding l ∣y
.
The log-likelihood for N observations yi drawn from a normal distribution with a common mean
μ and variance σ2 is
N
y i −2
l , ∣y =− log 2−n log −∑
2 n
2 i=1 2 2
so,
N N
∂ l i=1
∑ y i −
and ∂ l n
∑ yi −2
= =− − i=1 3
∂ 2
∂
Equating the derivatives to zero gives
N N
=
y and = ∑ y i −
2 1
=∑ y i − y 2 .
2
n i=1 i=1
the second derivatives are
N N
∂ l 2
n 3 ∑ y i − 2
−2 ∑ y i −
=− 2 , ∂ l
2
n , ∂2 l .
∂ 2
= 2− i=1
= i=1
∂ 2
4
∂ 3
[ ]
So, when taking expectations and evaluating at , , we have
n
0
2
I =
2n
0
2
The likelihood for a GLM is L∣y=exp a y b c d y , so the log likelihood is
l ∣y=a yb c (d(y) is a constant). For the single parameter case, with yi's independent,
the likelihood can be calculated from
N
∑ a y i n c ' =0
b ' .
i=1
This is a maximum because the second derivative is
N
n c ' N
n c '
b ' ' ∑ a y i n c ' ' =−b ' '
n c ' ' (as ∑ a y i =−
).
i=1
b ' i=1
b '
b ' ' c ' −b ' c ' '
From much earlier, Var a y= , and hence the second derivative is
b '
3