Generalized Linear Models: Ariel Alonso Abad

Generalized Linear Models
Ariel Alonso Abad
Interuniversity Institute for Biostatistics and statistical Bioinformatics, KU Leuven
Alonso, A. GLM 1 / 568
Outline
1 Description of the course

2 Statistical inference I
3 Statistical inference II
4 Linear Regression
5 Logistic Regression I
6 Logistic Regression II
7 Logistic Regression III
8 Poisson Regression
9 Generalized linear models

Teaching Plan
Lecture 1: Statistical inference I
Lecture 2: Statistical inference II
Lecture 3: Linear Regression
Lecture 4: Logistic Regression I
Lecture 5: Logistic Regression II
Lecture 6: Logistic Regression III
Lecture 7: Poisson Regression
Lecture 8: Generalized linear models
Lecture 9: Project
Tutorial groups
After first lecture students organize themselves in tutorial groups

within 5 days
Email Prof. Alonso list with the members of each group (names and
student numbers)
One project assigned to each group
All assignments on Toledo → Course Documents
Report with a detailed discussion of the analysis (no more than 10

pages)

Evaluation
Oral and written::
Written: Project
Oral: Questions about project and course
All information about the course on Toledo
Aims of the course
Extend the linear regression model to models that can handle a

binary, count and a non-Gaussian response
Deal with over-dispersion in the data
Analyzes will be done with the R software.
R programs are given in the slides.

References
Agresti (1990). Categorical Data Analysis, John Wiley & Sons
Faraway (2000). Linear Models with R, Chapman & Hall/CRC
Fahrmeir, Kneib, Lang and Marx (2013). Regression. Models,

Methods and Applications, Springer
McCullagh and Nelder (1989). Generalized Linear Models (2nd

Edition), Chapman & Hall/CRC
Morel and Neerchal (2012). Overdispersion Models in SAS , SAS

Press
Outline

4 Linear Regression

Outline

4 Linear Regression
Statistical Inference
Ariel Alonso Abad
Catholic University of Leuven
Alonso, A. Inference I 10 / 568

Randomness
Randomness: The occurrence of events for which no final cause is found
Physics: Quantum Mechanics
Biology: Theory of Evolution
Psychology: Synchronicity
Statistics and Mathematics
Frequency vs. probability distribution

0.020
Intelligence quotient
The intelligence quotient (IQ) of
0.015
10,000 children is determined.

Frequentiedichtheid
The results are shown in a histogram

with class width 10 IQ points.
0.010
0.005
0.000
0 50 100 150 200
Intelligentie

0.020
Probability distributions
If the number of observations
0.015
increases indefinitely then one can

make the class width smaller and
Frequentiedichtheid
smaller, in the limit a smooth curve

0.010
arises.
0.005
0.000
0 50 100 150 200
Intelligentie

0.020
Probability distributions
If the number of observations
0.015
increases indefinitely then one can

make the class width smaller and
Frequentiedichtheid
smaller, in the limit a smooth curve

0.010
arises.
Normal distribution
0.005
Student’s t-distribution
Chi-Squared distribution
0.000
F-distribution
0 50 100 150 200
Intelligentie

Normal distribution: Two parameters (μ, σ)
0.4
(x−μ)2
e 2σ2
f(x) =
σ 2π
0.3
probability density
0.2
Two parameters (μ, σ)

34.1% 34.1%
Symmetric
0.1
Mode=Median=Mean
13.6% 13.6%
2.1% 2.1%
0.1% 0.1%
0.0
μ − 3σ μ − 2σ μ − 1σ μ μ + 1σ μ + 2σ μ + 3σ
x−value
Normal distribution: Two parameters (μ, σ)

0.4
0.3
68%
probability density
0.2
Two parameters (μ, σ)

Symmetric
0.1
Mode=Median=Mean
95%
0.0
μ − 3σ μ − 2σ μ − 1σ μ μ + 1σ μ + 2σ μ + 3σ
x−value

Normal distribution
In the eighteenth century, mathematicians already found out that a large

number of measurements often showed a behavior that follows a pattern
similar to this Gaussian distribution
The maximum temperature on August 5 in De Bilt, measured over

several years
Body length
The intelligence, as measured by the IQ, of a large group of subjects

of the same age
Why?
The calculation of probabilities

Stanford−Binet IQ scores
0.025
0.020
IQ scores normally distributed

μ = 100, σ = 15
0.015
Density
0.010
0.005
0.000
40 60 80 100 120 140
Quantile

0.025
0.020

μ = 100, σ = 15
0.015
Density
High school
0.010
P (97 ≤ IQ ≤ 115) = 0.4206045

0.005
0.000
40 60 80 100 120 140
Quantile

0.025
0.020

μ = 100, σ = 15
0.015
Density
University
0.010
P (115 ≤ IQ ≤ 125) = 0.1108649

0.005
0.000
40 60 80 100 120 140
Quantile

0.025
0.020

μ = 100, σ = 15
0.015
Density
PhD
0.010
P (IQ ≥ 125) = 0.04779035

0.005
0.000
40 60 80 100 120 140
Quantile

0.025
0.020

μ = 100, σ = 15
0.015
Density
Stephen Hawking, Garry

Kasparov, Albert Einstein, Judit
0.010
Polgar
0.005
P (IQ ≥ 160) = 0.00003

0.000
40 60 80 100 120 140
Quantile

BMI: Body mass index
Penman and Johnson, Prev Chronic Dis. 2006; 3(3):A74

Birth weight
O’Cathain A., British Medical Journal 2002; 324, 643–646.

Inferential Statistics
Inferential Statistics
Inferential statistics makes quantitative statements about the
characteristics of a population but...
It is often impossible to examine an entire population
Thus, frequently, based on the results of a sample conclusions have to

be made or decisions have to be taken about an entire population
The frequentist approach
It is associated with the frequentist interpretation of probability
Conceptual framework: Any given experiment can be considered as

one of an infinite sequence of possible repetitions of the same
experiment, each capable of producing statistically independent results
Frequentist inference requires that the correct conclusion should be

drawn with a given (high) probability, among this notional set of
repetitions
In a frequentist approach unknown parameters are often, but not

always, treated as having fixed but unknown values that are not
capable of being treated as random variables in any sense

Notation
To simplify notation we will use y for both the random variable and
the realized value
In the general case one has a sample of n random variables:

Ý = (y1 , . . . , yn )
Suppose that the yi (i = 1, . . . , n) have a distribution/density

p(y | θ). For the moment, let us assume that dim(θ) = 1
Suppose also that the yi are conditionally independent given θ, i.e.

n
p(Ý | θ) = p(yi | θ)
i =1
The purpose is to draw inference about θ

Point estimation
Point estimation: Involves the use of sample data to calculate a

single value which is to serve as a “best guess” or “best estimate” of
an unknown population parameter
Ý ) is a random variable that aims to estimate the

Formally, θ(
parameter θ

Point estimation
Ý ) has a sampling distribution, resulting from

The point estimator θ(
the sampling distribution of Ý
A good estimator should have a sampling distribution concentrated

close to the true value θ
Ý)
Popular measure of closeness: Mean squared error (MSE) of θ(
defined as:

MSE θ( Ý ) − θ)2
Ý ) =EÝ |θ (θ(
2
Ý ) + bias θ(
=VarÝ |θ θ( Ý)
Point estimation: Bias
The bias of an estimator is the difference between its expected value

and the true value of the parameter being estimated

Ý ) = EÝ |θ θ(
bias θ( Ý) − θ
Important to notice: the expectations are taken wrt the sampling

distribution of Ý given the true value of θ

Ý) = 0
An estimator is called unbiased when bias θ(

Suppose that the yi (i = 1, . . . , n) are independent and identically

distributed (i.i.d.)
Interest is to estimate the parameters
μ = E (y ) and σ 2 = Var (y ) = E (y − μ)2
1 1
n n
Point estimators: ȳ = yi and S 2 = (y − ȳ )2
n n
i =1 i =1
n−1
E (ȳ ) = μ and E S 2 = σ2 < σ2
n
All else being equal, an unbiased estimator is often preferable to a

biased estimator
In practice all else is not equal, and biased estimators are frequently
used
Unbiasedness of estimators is a debatable property, since biased

estimators may have a smaller MSE than unbiased estimators

Bias: Estimating a Poisson probability
The Poisson distribution may be useful to model events such as
The number of meteorites greater than 1 meter diameter that strike

Earth in a year
The number of patients arriving in an emergency room between 10 and

11 pm
The number of incoming calls at a telephone switchboard per minute
λy e −λ
y ∼ P(y |λ) where P(y |λ) = and y = 0, 1, 2 . . .
y!
E (y ) = Var (y ) = λ
Bias: Estimating a Poisson probability
Suppose one wants to estimate θ = P(y = 0|λ)2 = e −2λ with sample

size one
When incoming calls per minute, at a telephone switchboard, are

modeled as a Poisson process with λ denoting the average number of
calls per minute, then θ = e −2λ is the probability that no calls arrive
in the next two minutes
The only unbiased estimator is:

1 if y = 100
θ (y ) = (−1)y =
−1 if y = 101

Important frequentist concepts
Consistency: If θn is an estimator of θ based on a random sample of

size n then we shall say that θn is consistent for θ when
lim P(| θn − θ |> ) → 0 for any > 0

n→∞
Law of Large Numbers: If y1 , y2 , . . . , yn are independent and

identically distributed (i.i.d.) with mean μ = E (Y ), then the sample
1 n
mean ȳn = n i =1 yi is consistent for μ
Bias versus consistency
Unbiased but not consistent: yi with i = 1, . . . , n and E (yi ) = μ.

(Ý ) = T (y1 , y2 , . . . yn ) = y1 is unbiased but not
The estimator μ
consistent
1 n
Biased but consistent: The sample variance S 2 = (y − ȳ )2
n i =1
The sample variance is asymptotically unbiased, i.e., the bias goes to

zero as n increases. However, one can have a consistent, biased
estimator which bias does not go to zero with n

Sample means
Trying to understand the behavior of sample means
Are there certain regularities or laws?
To gain insight into the behavior of the sample mean we will study
how sample means from a known population behave
From a known population draw several samples of n units and look at the
averages of those samples
Sample means
Normal Gamma
0.30
0.4
Simulations
0.3
0.20
Density
Density
0.2
20 000 samples of 4 units

0.10
0.1
0.00
0.0
20 000 means
−4 −2 0 2 4 0 5 10 15 20 25 30
X1 , X2 , X3 , X4
x x
Uniform Beta
X1 + X2 + X3 + X4
X̄ =
12
4
1.2
10
8
X̄1 , X̄2 , X̄3 , X̄4 , . . . , X̄20000

Density
Density
0.8
6
4
0.4
2
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x

Sample means
Normal Gamma
0.30
0.8
Simulations
0.6
0.20
Density
Density
0.4
0.10
0.2
0.00
0.0
N=4 (4 units)
−2 −1 0 1 2 0 2 4 6 8 10 12 14
x x
X1 , X2 , X3 , X4
Uniform Beta
X1 + X2 + X3 + X4
3.0
X̄ =
2.0
4
2.0
1.5
Density
Density
X̄1 , X̄2 , X̄3 , X̄4 , . . . , X̄20000

1.0
1.0
0.5
0.0
0.0
0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
Sample means
Normal Gamma
1.2
0.4
Simulations
0.3
0.8
Density
Density
0.2

0.4
0.1
0.0
0.0
N=9 (9 units)
−1.5 −0.5 0.0 0.5 1.0 1.5 2 4 6 8
x x
X1 , X2 , . . . , X9
Uniform Beta
X1 + X2 + · · · + X9
3.0
X̄ =
4
9
3
2.0
Density
Density
X̄1 , X̄2 , X̄3 , X̄4 , . . . , X̄20000

2
1.0
1
0.0
0
0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 1.0
x x

Sample means
Normal Gamma
2.0
Simulations
0.6
1.5
Density
Density
0.4
1.0
0.2
0.5
0.0
0.0
N=25 (25 units)
−0.5 0.0 0.5 1.0 1 2 3 4 5 6
x x
X1 , X2 , . . . , X25
Uniform Beta
X1 + X2 + · · · + X25
X̄ =
5
25
6
4
Density
Density
X̄1 , X̄2 , X̄3 , X̄4 , . . . , X̄20000

3
4
2
2
1
0
0.3 0.4 0.5 0.6 0.7 0.3 0.4 0.5 0.6 0.7 0.8
x x
Central limit theorem (CLT)
CLT: Univariate
If y1 , y2 , . . . , yn are i.i.d. random variables with mean μ = E (y ) and
√
variance σ 2 , then n(ȳn − μ) converges in distribution to N (0, σ 2 ), and
we write √ d
n(ȳn − μ) −
→ N (0, σ 2 )
or also
d
ȳn −
→ N (μ, σ 2 /n) as n → ∞
CLT: Multivariate
If Ý 1 , Ý 2 , . . . , Ý n are i.i.d. random vectors with mean μ = E (Ý ) and
√
covariance matrix Σ, then n(Ý̄ n − μ) converges in distribution to a
multivariate normal distribution, and we write
√ d
n(Ý̄ n − μ) −
→ N (0, Σ)

Why is the world normal?
Body length
Genetic factors
Economic factors
...
Body length = Average Effect
Delta Method
Delta Method
Let Ý n = (y1n , . . . , ypn )T be a sequence of random vectors with
√
E (Ý n ) = μ and n(Ý n − μ) →d Þ
Further consider the mapψ : Rp → Rk , i.e., ψ(Ý ) = (ψ1 (Ý ), . . . , ψk (Ý ))T

∂ψ
with Â (Ý ) = ∂ Ý = ∂ψ i
∂yj (k × p matrix of derivatives) the Jacobian
matrix of the transformation ψ. The delta method gives the asymptotic
distribution of the new random sequence ψ(Ý n ) as
√
n ψ(Ý n ) − ψ(μ) →d Â (μ)Þ
If Þ ∼ Np (0, Σ), then

√ d
n ψ(Ý n ) − ψ(μ) −→ Nk 0, Â (μ) Σ Â (μ)T

CTL Example: Bernoulli Distribution
The Bernoulli distribution, named after Swiss mathematician Jacob

Bernoulli, is the discrete probability distribution of a random variable
y which takes the value 1 with probability π and the value 0 with
probability 1 − π
The probability mass function f of this distribution is
f (k; π) = π k (1 − π)1−k for k ∈ {0, 1}
E (y ) = π and Var (y ) = π (1 − π)
Example: Bernoulli Distribution
If y1 , . . . , yn are i.i.d. random variables, all Bernoulli trials with

success probability π, then their sum is distributed according to a
binomial distribution with parameters n and π

n
y= yk ∼ Bin(n, π)
k=1
The binomial probability mass function f gives the probability of

getting exactly k successes in n trials

n k
f (k, n, π) = P(y = k) = π (1 − π)n−k
k
E (y ) = nπ and Var (y ) = nπ (1 − π)

Normal approximation
The Binomial distribution is the sum of i.i.d Bernoulli random

variables

n
k=1
The Central Limit Theorem asserts that if n is large enough, then

y ∼ N nπ, nπ(1 − π)
y ∼ B(n = 100, π = 0.4) thus approximately y ∼ N (40, 24)
## Binomial normal approximation
x1=36:45
x2= c(25:35, 46:55)
x1x2= seq(25, 55, by=.01)
plot(x1x2, dnorm(x1x2, 40, sqrt(24)), type="n", xlab="x",

+ ylab="Binomial Probability")
lines(x2, dbinom(x2, 100, .4), type="h", col=2)
lines(x1, dbinom(x1, 100, .4), type="h", lwd=2)
plot(x1x2, dnorm(x1x2, 40, sqrt(24)), type="l", xlab="x",

+ ylab="Binomial Probability")
lines(x2, dbinom(x2, 100, .4), type="h", col=2)
lines(x1, dbinom(x1, 100, .4), type="h", lwd=2)

0.08
0.06
Binomial Probability
0.04
0.02
0.00
25 30 35 40 45 50 55
Delta Method Example: Titanic
On the 10 April 1912 the largest passenger steamship in the world left
Southampton England, to New York City. At 23:40 on 14 April, it struck
an iceberg and sank at 2:20 the following morning, resulting in the deaths
of 1,517 people in one of the deadliest peacetime maritime disasters in
history.
On the 10 April 1912 the largest passenger steamship in the world left
Southampton England, to New York City. At 23:40 on 14 April, it struck
an iceberg and sank at 2:20 the following morning, resulting in the deaths
of 1,517 people in one of the deadliest peacetime maritime disasters in
history.
Odds of Survival
Let y1 , . . . , yn are i.i.d. random variables, all Bernoulli trials with

success probability π = P (yi = 1), where yi = 1 if passenger i
survived and zero otherwise
Odds of surviving
π
Θsurvival =
1−π
⇒ If Θsurvival = 2 then for every 2 passengers that survive 1 dies
⇒ If Θsurvival = 0.5 then for every 1 passengers that survive 2 dies
Term comes from horse racing

If y1 , . . . , yn are i.i.d. random variables, all Bernoulli trials with

success probability π = P (yi = 1), then

n
k=1
The Central Limit Theorem implies

π(1 − π)
π
∼ N π,
n
The point estimate for the Odds of surviving is
survival = π

Θ
1−π
Consider now the function

x 1
g (x) = ln , it is easy to show, g (x) =
1−x x(1 − x)
Applying the Delta Method one gets that asymptotically
2 π(1 − π)

ln Θsurvival ∼ N ln (Θsurvival ) , g (π)
n

1
Thus asymptotically Var ln Θsurvival =
π(1 − π)n
Note: It can be shown that no unbiased estimator of the log-odds

exists
> install.packages("msm")
> library(msm)
> titanic.path="C:/Equizo/Courses/KULeuven/titanicmissing.txt"
> titanic<- read.table(titanic.path, header=T, sep=",")
> head(titanic,5)
survived pclass sex age
1 1 1st 0 29.0000
2 0 1st 0 2.0000
3 0 1st 1 30.0000
4 0 1st 0 25.0000
5 1 1st 1 0.9167
> p_survival=mean(titanic$survived)
> log_survival_odds=log(p_survival/(1-p_survival))
> n=nrow(titanic)
> var_p_survival=(p_survival*(1-p_survival))/n
> se_odds_delta=deltamethod(g=~log(x1/(1-x1)), mean=p_survival, cov=var_p_survival)
> se_odds_delta^2
[1] 0.003384579
> log_survival_odds
[1] -0.6545499
> survival_odds
[1] 0.5196759
> A.
Alonso, Inference I 44 / 568
Confidence intervals
A confidence interval is an interval estimate of a parameter
In contrast to a point estimate a confidence interval gives a set of

plausible values (estimates) for the parameter
Of all the realizations of the interval some will contain the true value
of the parameter, but others will not
The probability that the stochastic interval will contain the true value
of the parameter is called the confidence level of the interval

Suppose also that the yi are conditionally independent given θ, i.e.

n
p(Ý | θ) = p(yi | θ)
i =1
Interval estimator: [a(Ý ), b(Ý )] aims to include the true value with a
pre-specified probability
For instance, 100(1 − α)% confidence interval (CI) is an interval

[a(Ý ), b(Ý )] that satisfies for all θ:

P θ ∈ [a(Ý ), b(Ý )] = 1 − α
Interval estimator
The probability statement is given wrt the sampling distribution of Ý
The probability 1 − α is called the coverage probability, classically

α = 0.05 leading to the 95% CI
The adjective 95% refers to a probability statement under repeated

sampling. For a particular sample the 95% CI either includes or
excludes the true value θ
Many 95% CIs have only asymptotically the correct coverage. For
small samples their good behavior needs to be established via
simulations

Confidence interval for the mean μ

S S
CI1−α = X̄ − T1− 2 ,n−1 √ , X̄ + T1− 2 ,n−1 √
α α
n n
P (μ lies in CI1−α ) = 1 − α
Often α = 0.05 thus...
Confidence interval for the mean μ

S S
CI0.95 = X̄ − T0.975,n−1 √ , X̄ + T0.975,n−1 √
n n
P (μ lies in CI0.95 ) = 0.95
95% of all realizations of the interval will contain the true value of
the population mean μ

Simulations: μ = 0.0, σ = 1, n = 100, 1 − α = 0.95
1 − α = 0.95 versus 1 − α = 0.99
Demonstration of Confidence Intervals
coverage rate: 0.990 1 99

0.4
n]
0.2
n , x + T1−α 2S
0.0
CI: [ x − T1−α 2S
−0.2
−0.4
0 20 40 60 80 100
Samples
1 − α = 0.95 1 − α = 0.99
Likelihood
The likelihood function is a particular function of the parameters of a

statistical model given data
The likelihood plays a central role in all the main paradigms of

statistics
Bayesian paradigm
Frequentist paradigm
Likelihood paradigm
Information-theoretic paradigm (AIC)
Maximum Likelihood: Intuitively, the maximum likelihood estimator

selects the values of the parameters of a model that give the observed
data the largest possible probability
Likelihood
Probability that sample Ý = (y1 , . . . , yn ) has happened under
p-dimensional θ is given by p(Ý | θ)
As a function of θ, p(Ý | θ) is called the likelihood and is denoted as

L(θ | Ý )
n
L(θ | Ý ) = p(yi | θ)
i =1
Maximum likelihood:
Given a data set Ý , the value of θ that maximizes L(θ | Ý ) is called the
maximum likelihood estimator (MLE) and is denoted as θ
we rather maximize
To find θ

n
(θ | Ý ) ≡ log L(θ | Ý ) = log p(yi | θ)
i =1

Likelihood: Related concepts
To explore the sampling properties of the MLE, we need to look at

the properties of (θ | Ý ), i.e., as a function of the random variable Ý
(θ | Ý ) will be denoted as (θ), but now it is considered a random

variable
Basically, one needs to know also the sampling behavior of
1st derivative of (θ): score function
2nd derivative of (θ): Fisher (expected) information matrix
We look at the i.i.d. case, but the MLE properties can be extended to
the non-i.i.d case
Score function
The score function is defined as:

∂ (θ) ∂ (θ) ∂ (θ)
Ë (θ) = = ,...,
∂θ ∂θ1 ∂θp
=[s1 (θ), . . . , sp (θ)]

The MLE is typically computed as the solution of the score equation

n
∂
n
Ë (θ) = log p(yi | θ) = × i (θ) = 0
∂θ
i =1 i =1
Hessian matrix is the matrix of the second derivatives of the

log-likelihood
∂2 ∂2 n
À (θ) = (θ) = log p(yi | θ)
∂θ∂θ ∂θ∂θ
i =1

Fisher’s (expected) information matrix
Fisher’s expected information matrix in a sample of size n is the
p × p matrix

∂2
Á (θ) ≡ Á n (θ) = −E À (θ) = −E (θ)
∂θ∂θ
i.e., minus the expected value of the Hessian matrix
In the i.i.d. data case

n
∂2
Á (θ) = −E log p(yi | θ) = nÁ 1 (θ)
∂θ∂θ
i =1
Á 1 (θ) denotes the expected Fisher information from one observation,

i.e.,
∂2
Á 1 (θ) = −E log p(y | θ)
∂θ∂θ
Score function
If the data were generated by p(y | θ 0 ) then

E Ë (θ 0 ) = 0
n n
Var Ë (θ 0 ) = T×
i =1 Var × i (θ 0 ) = i =1 E × i (θ 0 ) i (θ 0 )

In the i.i.d. data case Var Ë (θ 0 ) = nE × (θ 0 )
T × (θ
0) where
∂
× (θ 0 ) = log p(y | θ 0 )
∂θ
Moreover, the so-called Information Matrix Equality (IME) holds

∂2
E × (θ 0 ) × (θ 0 ) = −E
T
log p(y | θ 0 ) = Á 1 (θ 0 )
∂θ∂θ

and, hence, Var Ë (θ 0 ) = nÁ 1 (θ 0 ) = Á (θ 0 )

Score function
If the data were generated by p(y | θ 0 ) then

Asymptotically: n− 2 Ë (θ 0 ) ∼ N 0, Á 1 (θ 0 )
1

For large samples Ë (θ 0 ) ∼ N 0, Á (θ 0 ) with Á (θ 0 ) = −E À (θ 0 )
An important problem is the estimation of Á (θ 0 ). Three estimators

are available:

n
= −À (θ),
Á 1 (θ) =
Á 2 (θ) T × i (θ),
× i (θ) = Á (θ)
Á 3 (θ)
i =1
Provided θ is a consistent estimator of θ 0 (not necessarily the MLE),

each of these estimators converges to Á (θ 0 ).
is called the observed information matrix

The matrix −À (θ)
Cramér-Rao bound
The Cramér-Rao bound (CRB) also called information inequality

expresses a lower bound on the variance of unbiased estimators
In its simplest form, the bound states that the variance of any
unbiased estimator is at least as high as the inverse of the Fisher
information
An unbiased estimator which achieves this lower bound is said to be

(fully) efficient
Such a estimator achieves the lowest possible mean squared error

among all unbiased estimators and is the minimum variance unbiased
(MVU) estimator
In some cases, no unbiased estimator exists which achieves the bound

Cramér-Rao bound
Cramér-Rao bound
Let ( ) be an unbiased estimator of a vector function of parameters
ψ(θ). The Cramér-Rao bound states that the covariance matrix of ( )
satisfies
cov ( ) ≥ (θ) (θ)−1 (θ)T
where
(θ) = ( ∂ψ
i
∂θ ) is the Jacobian matrix
j
The matrix inequality ≥ is understood to mean that the matrix

− is positive semidefinite
= ( ) is an unbiased estimator of θ (i.e., ψ(θ) = θ), then the
If θ
Cramér-Rao bound reduces to

cov θ( ) ≥ (θ)−1
Properties MLE
MLEs have no optimum properties for finite samples, in the sense

that (when evaluated on finite samples) other estimators may have
greater concentration around the true parameter-value
MLEs possess a number of attractive limiting properties. As the

sample size increases to infinity, sequences of maximum likelihood
estimators have these properties
Consistency: The MLE converges in probability to the value being

estimated
Efficiency: It achieves the CRB when the sample size tends to infinity.
This means that no unbiased estimator has lower asymptotic mean
squared error (MLE asymptotically unbiased)
Asymptotic Normality: The MLE is asymptotically normal

Properties MLE
Consistency: Under general regularity conditions, if the data were

generated by p(Ý | θ 0 ) and we have a sufficiently large number of
observations n, then it is possible to find the value of θ 0 with
mle (Ý ) −
arbitrary precision, i.e., θ
p
→ θ0
√
Efficiency and Asymptotic Normality: The MLE is n-consistent
and asymptotically efficient, meaning that it reaches the CRB
√
n θ mle (Ý ) − θ 0 −→ N (0, Á 1 (θ 0 )−1 )
d

mle (Ý ) ∼ N θ 0 , Á (θ 0 )−1 with Á (θ 0 ) = −E À (θ 0 ) for
Therefore, θ
large samples
For carrying out inferences Á (θ 0 ) can be estimated using

Á 1,
Á 2 or
Á3
Regularity conditions
Identifiability: different values of θ determine different distributions
Range of the data cannot depend on unknown parameters
True parameter must lie in the interior of the parameter space
The number of parameters must not increase with the sample size, at
least not too quickly

Further properties of MLE
Equivariant estimator: If θ mle is the MLE of θ and ψ = ψ(θ) is a

1-to-1 transformation, then ψ
mle = ψ(θ mle ) is the MLE obtained
from L(ψ | Ý ). In addition, because of the delta method:

mle −
ψ
d mle )Â (θ)T
→ Np ψ, Â (θ)Cov(θ as n → ∞
Further properties of MLE
Equivariant estimator: If θ mle is the MLE of θ and ψ = ψ(θ) is a

1-to-1 transformation, then ψ
mle = ψ(θ mle ) is the MLE obtained
from L(ψ | Ý ). In addition, because of the delta method:

mle −
ψ → Np ψ, Â (θ)Á −1 (θ)Â (θ)T
d
as n → ∞
and the expected

For inference we may replace Â (θ) by Â (θ)

information by Á 1 , Á 2 or Á 3
MLE is invariant wrt 1-to-1 transformations of the data

Maximizing the likelihood
Given the parameter value θ k at iteration k, the Newton-Raphson

(NR) algorithm approximates Ë (θ) using a Taylor series expansion
about θ k
Ë (θ) ≈ Ë (θ k ) + À (θk ) θ − θk
An MLE is typically obtained as a solution of the likelihood equation

Ë (θ) = 0, thus we solve the system of linear equations

0 ≈ Ë (θ k ) + À (θ k ) θ − θ k
The previous equation can be used to update θ k , and this leads to

the Newton-Raphson algorithm
θ k+1 ≈ θ k − À (θ k )−1 Ë (θ k )
Solving the score equations

Maximizing the likelihood
Fisher’s scoring: Replaces (θ) with its expected value if the

formula for the expected value is known
Convergence is quadratic: As the method converges on the root,

the difference between the root and the approximation is squared at
each step
Failure of the method to converge to the root
Stationary point: Hessian is singular
Poor initial estimate
Overshoot: If the Hessian is not well behaved in the neighborhood of a

particular root, the method may overshoot, and diverge from that root
Surgery example: Binomial likelihood

New but rather complicated surgical technique. Surgeon operates
n = 12 patients with y = 9 successes
Binomial distribution Bin(n,π)

n y
P(y | π) = π (1 − π)(n−y ) (y = 0, 1, . . . , n)
y
Binomial likelihood (function)

12 9
L(π | 9) = π (1 − π)3
9
Binomial log-likelihood (function):

12
(π | 9) = log + 9 log(π) + 3 log(1 − π)
9
MLE: maximize L(π | y ) or better (π | y )
(π | y ) = y ln π + (n − y ) ln(1 − π) + constant
d y (n − y )
(π | y ) = − =0⇒π
= y /n
dπ π (1 − π)
For y = 9 and n = 12 ⇒ π
= 0.75
Fisher information: 2nd derivative of (π | y )

d 2 y n−y
H(π) = =− 2 −
dπ 2 π (1 − π)2

d 2 n
I (π) = −E =
dπ 2 π(1 − π)
Variance MLE (evaluated in MLE): π

(1 − π
)/n
## Maximum Likelihood Binomial distribution
## Likelihood function
n.size=12
y.su=9
llik2 <- function(p)-sum(dbinom(y.su,prob=p,size=n.size,log=TRUE))
p_MLE=nlm(llik2,p=c(0.5), hessian = TRUE)
> p_MLE
$minimum
[1] 1.354394
$estimate
[1] 0.7499995
$gradient
[1] -1.190159e-07
$hessian
[,1]
[1,] 64.03399

## Maximum Likelihood Binomial distribution
## Likelihood function
n.size=12
y.su=9
llik2 <- function(p)-sum(dbinom(y.su,prob=p,size=n.size,log=TRUE))
p_MLE=nlm(llik2,p=c(0.5), hessian = TRUE)
p_vec <- seq(0.01, 1, by = 0.01)

llik3=rep(0,length(p_vec))
for(i in 1:length(p_vec)) llik3[i]=llik2(p_vec[i])
par(las = 1, cex.lab = 1.2)

plot(p_vec, llik3, type = "l", xlab = "p", ylab = "-log-Likelihood")
points(p_MLE$estimate, p_MLE$minimum, pch = 19, col = "red")
segments(x0=p_MLE$estimate, y0 =0, x1 =p_MLE$estimate, y1 = p_MLE$minimum,
lwd = 2, col = "red")

35
30
25
−log−Likelihood
20
15
10
5
0
0.0 0.2 0.4 0.6 0.8 1.0

Asymptotic distribution of MLE:
π
∼ N [π, π
(1 − π
)/n]
Asymptotic 95% CI for π:

π
(1 − π
) π
(1 − π
)
π
− 1.96 × ,π
+ 1.96 ×
n n
Suppose θ = log[π/(1 − π)], then θ = log[

π /(1 − π
)]
And the same (asymptotic) properties hold for θ as for π

Asymptotic distribution of MLE:
π
∼ N [π, π
(1 − π
)/n]
Asymptotic 95% CI for π: [0.51, 0.99]
Suppose θ = log[π/(1 − π)], then θ = 1.1
And the same (asymptotic) properties hold for θ as for π


Generalized Linear Models: Ariel Alonso Abad

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Generalized Linear Models: Ariel Alonso Abad

Uploaded by

Copyright:

Available Formats

Generalized Linear Models

Ariel Alonso Abad

Interuniversity Institute for Biostatistics and statistical Bioinformatics, KU Leuven

Alonso, A. GLM 1 / 568

1 Description of the course

Alonso, A. GLM 2 / 568

Lecture 1: Statistical inference I

Lecture 2: Statistical inference II

Lecture 3: Linear Regression

Lecture 4: Logistic Regression I

Lecture 5: Logistic Regression II

Lecture 6: Logistic Regression III

Lecture 7: Poisson Regression

Lecture 8: Generalized linear models

After ﬁrst lecture students organize themselves in tutorial groups

One project assigned to each group

All assignments on Toledo → Course Documents

Report with a detailed discussion of the analysis (no more than 10

Alonso, A. GLM 4 / 568

Oral and written::

Oral: Questions about project and course

All information about the course on Toledo

Alonso, A. GLM 5 / 568

Aims of the course

Extend the linear regression model to models that can handle a

Deal with over-dispersion in the data

Analyzes will be done with the R software.

R programs are given in the slides.

Alonso, A. GLM 6 / 568

Agresti (1990). Categorical Data Analysis, John Wiley & Sons

Faraway (2000). Linear Models with R, Chapman & Hall/CRC

Fahrmeir, Kneib, Lang and Marx (2013). Regression. Models,

McCullagh and Nelder (1989). Generalized Linear Models (2nd

Morel and Neerchal (2012). Overdispersion Models in SAS , SAS

Alonso, A. GLM 7 / 568

1 Description of the course

Alonso, A. GLM 8 / 568

1 Description of the course

Alonso, A. GLM 9 / 568

Ariel Alonso Abad

Catholic University of Leuven

Alonso, A. Inference I 10 / 568

Randomness: The occurrence of events for which no ﬁnal cause is found

Physics: Quantum Mechanics

Biology: Theory of Evolution

Statistics and Mathematics

Alonso, A. Inference I 11 / 568

Frequency vs. probability distribution

10,000 children is determined.

The results are shown in a histogram

0 50 100 150 200

Alonso, A. Inference I 12 / 568

increases indeﬁnitely then one can

smaller, in the limit a smooth curve

0 50 100 150 200

Alonso, A. Inference I 12 / 568

Frequency vs. probability distribution

increases indeﬁnitely then one can

smaller, in the limit a smooth curve

Alonso, A. Inference I 12 / 568

Two parameters (μ, σ)

Alonso, A. Inference I 13 / 568