You are on page 1of 43

Generalized Linear Models

Ariel Alonso Abad

Interuniversity Institute for Biostatistics and statistical Bioinformatics, KU Leuven

Alonso, A. GLM 1 / 568

Outline

1 Description of the course


2 Statistical inference I
3 Statistical inference II
4 Linear Regression
5 Logistic Regression I
6 Logistic Regression II
7 Logistic Regression III
8 Poisson Regression
9 Generalized linear models

Alonso, A. GLM 2 / 568


Teaching Plan

Lecture 1: Statistical inference I

Lecture 2: Statistical inference II

Lecture 3: Linear Regression

Lecture 4: Logistic Regression I

Lecture 5: Logistic Regression II

Lecture 6: Logistic Regression III

Lecture 7: Poisson Regression

Lecture 8: Generalized linear models

Lecture 9: Project
Alonso, A. GLM 3 / 568

Tutorial groups

After first lecture students organize themselves in tutorial groups


within 5 days

Email Prof. Alonso list with the members of each group (names and
student numbers)

One project assigned to each group

All assignments on Toledo → Course Documents

Report with a detailed discussion of the analysis (no more than 10


pages)

Alonso, A. GLM 4 / 568


Evaluation

Oral and written::

Written: Project

Oral: Questions about project and course

All information about the course on Toledo

Alonso, A. GLM 5 / 568

Aims of the course

Extend the linear regression model to models that can handle a


binary, count and a non-Gaussian response

Deal with over-dispersion in the data

Analyzes will be done with the R software.

R programs are given in the slides.

Alonso, A. GLM 6 / 568


References

Agresti (1990). Categorical Data Analysis, John Wiley & Sons

Faraway (2000). Linear Models with R, Chapman & Hall/CRC

Fahrmeir, Kneib, Lang and Marx (2013). Regression. Models,


Methods and Applications, Springer

McCullagh and Nelder (1989). Generalized Linear Models (2nd


Edition), Chapman & Hall/CRC

Morel and Neerchal (2012). Overdispersion Models in SAS , SAS


Press

Alonso, A. GLM 7 / 568

Outline

1 Description of the course


2 Statistical inference I
3 Statistical inference II
4 Linear Regression
5 Logistic Regression I
6 Logistic Regression II
7 Logistic Regression III
8 Poisson Regression
9 Generalized linear models

Alonso, A. GLM 8 / 568


Outline

1 Description of the course


2 Statistical inference I
3 Statistical inference II
4 Linear Regression
5 Logistic Regression I
6 Logistic Regression II
7 Logistic Regression III
8 Poisson Regression
9 Generalized linear models

Alonso, A. GLM 9 / 568

Statistical Inference

Ariel Alonso Abad

Catholic University of Leuven

Alonso, A. Inference I 10 / 568


Randomness

Randomness: The occurrence of events for which no final cause is found

Physics: Quantum Mechanics

Biology: Theory of Evolution

Psychology: Synchronicity

Statistics and Mathematics

Alonso, A. Inference I 11 / 568

Frequency vs. probability distribution


0.020

Intelligence quotient
The intelligence quotient (IQ) of
0.015

10,000 children is determined.


Frequentiedichtheid

The results are shown in a histogram


with class width 10 IQ points.
0.010
0.005
0.000

0 50 100 150 200

Intelligentie

Alonso, A. Inference I 12 / 568


Frequency vs. probability distribution

0.020

Probability distributions
If the number of observations
0.015

increases indefinitely then one can


make the class width smaller and
Frequentiedichtheid

smaller, in the limit a smooth curve


0.010

arises.
0.005
0.000

0 50 100 150 200

Intelligentie

Alonso, A. Inference I 12 / 568

Frequency vs. probability distribution


0.020

Probability distributions
If the number of observations
0.015

increases indefinitely then one can


make the class width smaller and
Frequentiedichtheid

smaller, in the limit a smooth curve


0.010

arises.
Normal distribution
0.005

Student’s t-distribution

Chi-Squared distribution
0.000

F-distribution
0 50 100 150 200

Intelligentie

Alonso, A. Inference I 12 / 568


Normal distribution: Two parameters (μ, σ)

0.4
(x−μ)2
e 2σ2
f(x) =
σ 2π
0.3
probability density

0.2

Two parameters (μ, σ)


34.1% 34.1%
Symmetric
0.1

Mode=Median=Mean
13.6% 13.6%
2.1% 2.1%
0.1% 0.1%
0.0

μ − 3σ μ − 2σ μ − 1σ μ μ + 1σ μ + 2σ μ + 3σ

x−value

Alonso, A. Inference I 13 / 568

Normal distribution: Two parameters (μ, σ)


0.4
0.3

68%
probability density

0.2

Two parameters (μ, σ)


Symmetric
0.1

Mode=Median=Mean
95%
0.0

μ − 3σ μ − 2σ μ − 1σ μ μ + 1σ μ + 2σ μ + 3σ

x−value

Alonso, A. Inference I 13 / 568


Normal distribution

In the eighteenth century, mathematicians already found out that a large


number of measurements often showed a behavior that follows a pattern
similar to this Gaussian distribution

The maximum temperature on August 5 in De Bilt, measured over


several years

Body length

The intelligence, as measured by the IQ, of a large group of subjects


of the same age

Why?

Alonso, A. Inference I 14 / 568

The calculation of probabilities


Stanford−Binet IQ scores
0.025
0.020

IQ scores normally distributed


μ = 100, σ = 15
0.015
Density

0.010
0.005
0.000

40 60 80 100 120 140

Quantile

Alonso, A. Inference I 15 / 568


The calculation of probabilities
Stanford−Binet IQ scores

0.025
0.020

IQ scores normally distributed


μ = 100, σ = 15
0.015
Density

High school
0.010

P (97 ≤ IQ ≤ 115) = 0.4206045


0.005
0.000

40 60 80 100 120 140

Quantile

Alonso, A. Inference I 15 / 568

The calculation of probabilities


Stanford−Binet IQ scores
0.025
0.020

IQ scores normally distributed


μ = 100, σ = 15
0.015
Density

University
0.010

P (115 ≤ IQ ≤ 125) = 0.1108649


0.005
0.000

40 60 80 100 120 140

Quantile

Alonso, A. Inference I 15 / 568


The calculation of probabilities
Stanford−Binet IQ scores

0.025
0.020

IQ scores normally distributed


μ = 100, σ = 15
0.015
Density

PhD
0.010

P (IQ ≥ 125) = 0.04779035


0.005
0.000

40 60 80 100 120 140

Quantile

Alonso, A. Inference I 15 / 568

The calculation of probabilities


Stanford−Binet IQ scores
0.025
0.020

IQ scores normally distributed


μ = 100, σ = 15
0.015
Density

Stephen Hawking, Garry


Kasparov, Albert Einstein, Judit
0.010

Polgar
0.005

P (IQ ≥ 160) = 0.00003


0.000

40 60 80 100 120 140

Quantile

Alonso, A. Inference I 15 / 568


BMI: Body mass index

Penman and Johnson, Prev Chronic Dis. 2006; 3(3):A74


Alonso, A. Inference I 16 / 568

Birth weight

O’Cathain A., British Medical Journal 2002; 324, 643–646.


Alonso, A. Inference I 17 / 568
Inferential Statistics

Inferential Statistics
Inferential statistics makes quantitative statements about the
characteristics of a population but...

It is often impossible to examine an entire population

Thus, frequently, based on the results of a sample conclusions have to


be made or decisions have to be taken about an entire population

Alonso, A. Inference I 18 / 568

The frequentist approach

It is associated with the frequentist interpretation of probability

Conceptual framework: Any given experiment can be considered as


one of an infinite sequence of possible repetitions of the same
experiment, each capable of producing statistically independent results

Frequentist inference requires that the correct conclusion should be


drawn with a given (high) probability, among this notional set of
repetitions

In a frequentist approach unknown parameters are often, but not


always, treated as having fixed but unknown values that are not
capable of being treated as random variables in any sense

Alonso, A. Inference I 19 / 568


Notation

To simplify notation we will use y for both the random variable and
the realized value

In the general case one has a sample of n random variables:


Ý = (y1 , . . . , yn )

Suppose that the yi (i = 1, . . . , n) have a distribution/density


p(y | θ). For the moment, let us assume that dim(θ) = 1

Suppose also that the yi are conditionally independent given θ, i.e.


n
p(Ý | θ) = p(yi | θ)
i =1

The purpose is to draw inference about θ


Alonso, A. Inference I 20 / 568

Point estimation

Point estimation: Involves the use of sample data to calculate a


single value which is to serve as a “best guess” or “best estimate” of
an unknown population parameter

 Ý ) is a random variable that aims to estimate the


Formally, θ(
parameter θ

Alonso, A. Inference I 21 / 568


Point estimation

 Ý ) has a sampling distribution, resulting from


The point estimator θ(
the sampling distribution of Ý

A good estimator should have a sampling distribution concentrated


close to the true value θ

 Ý)
Popular measure of closeness: Mean squared error (MSE) of θ(
defined as:
   
MSE θ(  Ý ) − θ)2
 Ý ) =EÝ |θ (θ(

   2
 Ý ) + bias θ(
=VarÝ |θ θ(  Ý)

Alonso, A. Inference I 22 / 568

Point estimation: Bias

The bias of an estimator is the difference between its expected value


and the true value of the parameter being estimated
   
 Ý ) = EÝ |θ θ(
bias θ(  Ý) − θ

Important to notice: the expectations are taken wrt the sampling


distribution of Ý given the true value of θ

 
 Ý) = 0
An estimator is called unbiased when bias θ(

Alonso, A. Inference I 23 / 568


Point estimation: Bias

Suppose that the yi (i = 1, . . . , n) are independent and identically


distributed (i.i.d.)

Interest is to estimate the parameters

μ = E (y ) and σ 2 = Var (y ) = E (y − μ)2

1 1
n n
Point estimators: ȳ = yi and S 2 = (y − ȳ )2
n n
i =1 i =1

  n−1
E (ȳ ) = μ and E S 2 = σ2 < σ2
n

Alonso, A. Inference I 24 / 568

Point estimation: Bias

All else being equal, an unbiased estimator is often preferable to a


biased estimator

In practice all else is not equal, and biased estimators are frequently
used

Unbiasedness of estimators is a debatable property, since biased


estimators may have a smaller MSE than unbiased estimators

Alonso, A. Inference I 25 / 568


Bias: Estimating a Poisson probability

The Poisson distribution may be useful to model events such as

The number of meteorites greater than 1 meter diameter that strike


Earth in a year

The number of patients arriving in an emergency room between 10 and


11 pm

The number of incoming calls at a telephone switchboard per minute

λy e −λ
y ∼ P(y |λ) where P(y |λ) = and y = 0, 1, 2 . . .
y!

E (y ) = Var (y ) = λ

Alonso, A. Inference I 26 / 568

Bias: Estimating a Poisson probability

Suppose one wants to estimate θ = P(y = 0|λ)2 = e −2λ with sample


size one

When incoming calls per minute, at a telephone switchboard, are


modeled as a Poisson process with λ denoting the average number of
calls per minute, then θ = e −2λ is the probability that no calls arrive
in the next two minutes

The only unbiased estimator is:



1 if y = 100
θ (y ) = (−1)y =
−1 if y = 101

Alonso, A. Inference I 27 / 568


Important frequentist concepts

Consistency: If θn is an estimator of θ based on a random sample of


size n then we shall say that θn is consistent for θ when

lim P(| θn − θ |> ) → 0 for any  > 0


n→∞

Law of Large Numbers: If y1 , y2 , . . . , yn are independent and


identically distributed (i.i.d.) with mean μ = E (Y ), then the sample
1 n
mean ȳn = n i =1 yi is consistent for μ

Alonso, A. Inference I 28 / 568

Bias versus consistency

Unbiased but not consistent: yi with i = 1, . . . , n and E (yi ) = μ.


(Ý ) = T (y1 , y2 , . . . yn ) = y1 is unbiased but not
The estimator μ
consistent

1 n
Biased but consistent: The sample variance S 2 = (y − ȳ )2
n i =1

The sample variance is asymptotically unbiased, i.e., the bias goes to


zero as n increases. However, one can have a consistent, biased
estimator which bias does not go to zero with n

Alonso, A. Inference I 29 / 568


Sample means

Trying to understand the behavior of sample means

Are there certain regularities or laws?

To gain insight into the behavior of the sample mean we will study
how sample means from a known population behave

From a known population draw several samples of n units and look at the
averages of those samples

Alonso, A. Inference I 30 / 568

Sample means

Normal Gamma
0.30
0.4

Simulations
0.3

0.20
Density

Density
0.2

20 000 samples of 4 units


0.10
0.1

0.00
0.0

20 000 means
−4 −2 0 2 4 0 5 10 15 20 25 30

X1 , X2 , X3 , X4
x x

Uniform Beta
X1 + X2 + X3 + X4
X̄ =
12

4
1.2

10
8

X̄1 , X̄2 , X̄3 , X̄4 , . . . , X̄20000


Density

Density
0.8

6
4
0.4

2
0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x x

Alonso, A. Inference I 31 / 568


Sample means

Normal Gamma

0.30
0.8

Simulations
0.6

0.20
Density

Density
0.4

20 000 samples of 4 units

0.10
0.2

0.00
0.0

N=4 (4 units)
−2 −1 0 1 2 0 2 4 6 8 10 12 14

x x
X1 , X2 , X3 , X4
Uniform Beta
X1 + X2 + X3 + X4
3.0

X̄ =
2.0

4
2.0

1.5
Density

Density

X̄1 , X̄2 , X̄3 , X̄4 , . . . , X̄20000


1.0
1.0

0.5
0.0

0.0

0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x x

Alonso, A. Inference I 31 / 568

Sample means

Normal Gamma
1.2

0.4

Simulations
0.3
0.8
Density

Density

0.2

20 000 samples of 9 units


0.4

0.1
0.0

0.0

N=9 (9 units)
−1.5 −0.5 0.0 0.5 1.0 1.5 2 4 6 8

x x
X1 , X2 , . . . , X9
Uniform Beta
X1 + X2 + · · · + X9
3.0

X̄ =
4

9
3

2.0
Density

Density

X̄1 , X̄2 , X̄3 , X̄4 , . . . , X̄20000


2

1.0
1

0.0
0

0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 1.0

x x

Alonso, A. Inference I 31 / 568


Sample means

Normal Gamma
2.0

Simulations

0.6
1.5
Density

Density

0.4
1.0

20 000 samples of 25 units

0.2
0.5
0.0

0.0
N=25 (25 units)
−0.5 0.0 0.5 1.0 1 2 3 4 5 6

x x
X1 , X2 , . . . , X25
Uniform Beta
X1 + X2 + · · · + X25
X̄ =
5

25
6

4
Density

Density

X̄1 , X̄2 , X̄3 , X̄4 , . . . , X̄20000


3
4

2
2

1
0

0.3 0.4 0.5 0.6 0.7 0.3 0.4 0.5 0.6 0.7 0.8

x x

Alonso, A. Inference I 31 / 568

Central limit theorem (CLT)

CLT: Univariate
If y1 , y2 , . . . , yn are i.i.d. random variables with mean μ = E (y ) and

variance σ 2 , then n(ȳn − μ) converges in distribution to N (0, σ 2 ), and
we write √ d
n(ȳn − μ) −
→ N (0, σ 2 )
or also
d
ȳn −
→ N (μ, σ 2 /n) as n → ∞

CLT: Multivariate
If Ý 1 , Ý 2 , . . . , Ý n are i.i.d. random vectors with mean μ = E (Ý ) and

covariance matrix Σ, then n(Ý̄ n − μ) converges in distribution to a
multivariate normal distribution, and we write
√ d
n(Ý̄ n − μ) −
→ N (0, Σ)

Alonso, A. Inference I 32 / 568


Why is the world normal?

Body length
Genetic factors

Economic factors

...

Body length = Average Effect

Alonso, A. Inference I 33 / 568

Delta Method

Delta Method
Let Ý n = (y1n , . . . , ypn )T be a sequence of random vectors with

E (Ý n ) = μ and n(Ý n − μ) →d Þ

Further consider the mapψ : Rp → Rk , i.e., ψ(Ý ) = (ψ1 (Ý ), . . . , ψk (Ý ))T


∂ψ
with  (Ý ) = ∂ Ý = ∂ψ i
∂yj (k × p matrix of derivatives) the Jacobian
matrix of the transformation ψ. The delta method gives the asymptotic
distribution of the new random sequence ψ(Ý n ) as
√  
n ψ(Ý n ) − ψ(μ) →d  (μ)Þ

If Þ ∼ Np (0, Σ), then


√   d  
n ψ(Ý n ) − ψ(μ) −→ Nk 0, Â (μ) Σ Â (μ)T

Alonso, A. Inference I 34 / 568


CTL Example: Bernoulli Distribution

The Bernoulli distribution, named after Swiss mathematician Jacob


Bernoulli, is the discrete probability distribution of a random variable
y which takes the value 1 with probability π and the value 0 with
probability 1 − π

The probability mass function f of this distribution is

f (k; π) = π k (1 − π)1−k for k ∈ {0, 1}

E (y ) = π and Var (y ) = π (1 − π)

Alonso, A. Inference I 35 / 568

Example: Bernoulli Distribution

If y1 , . . . , yn are i.i.d. random variables, all Bernoulli trials with


success probability π, then their sum is distributed according to a
binomial distribution with parameters n and π


n
y= yk ∼ Bin(n, π)
k=1

The binomial probability mass function f gives the probability of


getting exactly k successes in n trials

n k
f (k, n, π) = P(y = k) = π (1 − π)n−k
k

E (y ) = nπ and Var (y ) = nπ (1 − π)

Alonso, A. Inference I 36 / 568


Normal approximation

The Binomial distribution is the sum of i.i.d Bernoulli random


variables

n
y= yk ∼ Bin(n, π)
k=1

The Central Limit Theorem asserts that if n is large enough, then

 
y ∼ N nπ, nπ(1 − π)

Alonso, A. Inference I 37 / 568

Normal approximation

y ∼ B(n = 100, π = 0.4) thus approximately y ∼ N (40, 24)

## Binomial normal approximation

x1=36:45
x2= c(25:35, 46:55)
x1x2= seq(25, 55, by=.01)

plot(x1x2, dnorm(x1x2, 40, sqrt(24)), type="n", xlab="x",


+ ylab="Binomial Probability")
lines(x2, dbinom(x2, 100, .4), type="h", col=2)
lines(x1, dbinom(x1, 100, .4), type="h", lwd=2)

plot(x1x2, dnorm(x1x2, 40, sqrt(24)), type="l", xlab="x",


+ ylab="Binomial Probability")
lines(x2, dbinom(x2, 100, .4), type="h", col=2)
lines(x1, dbinom(x1, 100, .4), type="h", lwd=2)

Alonso, A. Inference I 38 / 568


Normal approximation

0.08
0.06
Binomial Probability

0.04
0.02
0.00

25 30 35 40 45 50 55

Alonso, A. Inference I 39 / 568

Delta Method Example: Titanic

On the 10 April 1912 the largest passenger steamship in the world left
Southampton England, to New York City. At 23:40 on 14 April, it struck
an iceberg and sank at 2:20 the following morning, resulting in the deaths
of 1,517 people in one of the deadliest peacetime maritime disasters in
history.
Alonso, A. Inference I 40 / 568
Delta Method Example: Titanic

On the 10 April 1912 the largest passenger steamship in the world left
Southampton England, to New York City. At 23:40 on 14 April, it struck
an iceberg and sank at 2:20 the following morning, resulting in the deaths
of 1,517 people in one of the deadliest peacetime maritime disasters in
history.
Alonso, A. Inference I 40 / 568

Odds of Survival

Let y1 , . . . , yn are i.i.d. random variables, all Bernoulli trials with


success probability π = P (yi = 1), where yi = 1 if passenger i
survived and zero otherwise

Odds of surviving
π
Θsurvival =
1−π

⇒ If Θsurvival = 2 then for every 2 passengers that survive 1 dies

⇒ If Θsurvival = 0.5 then for every 1 passengers that survive 2 dies

Term comes from horse racing

Alonso, A. Inference I 41 / 568


Delta Method Example: Titanic

If y1 , . . . , yn are i.i.d. random variables, all Bernoulli trials with


success probability π = P (yi = 1), then


n
y= yk ∼ Bin(n, π)
k=1

The Central Limit Theorem implies


π(1 − π)
π
 ∼ N π,
n

The point estimate for the Odds of surviving is

 survival = π

Θ
1−π

Alonso, A. Inference I 42 / 568

Delta Method Example: Titanic

Consider now the function


x 1
g (x) = ln , it is easy to show, g  (x) =
1−x x(1 − x)

Applying the Delta Method one gets that asymptotically

  2 π(1 − π)

ln Θsurvival ∼ N ln (Θsurvival ) , g (π)
n
 
 1
Thus asymptotically Var ln Θsurvival =
π(1 − π)n

Note: It can be shown that no unbiased estimator of the log-odds


exists
Alonso, A. Inference I 43 / 568
Delta Method Example: Titanic

> install.packages("msm")
> library(msm)
> titanic.path="C:/Equizo/Courses/KULeuven/titanicmissing.txt"
> titanic<- read.table(titanic.path, header=T, sep=",")
> head(titanic,5)
survived pclass sex age
1 1 1st 0 29.0000
2 0 1st 0 2.0000
3 0 1st 1 30.0000
4 0 1st 0 25.0000
5 1 1st 1 0.9167
> p_survival=mean(titanic$survived)
> log_survival_odds=log(p_survival/(1-p_survival))
> n=nrow(titanic)
> var_p_survival=(p_survival*(1-p_survival))/n
> se_odds_delta=deltamethod(g=~log(x1/(1-x1)), mean=p_survival, cov=var_p_survival)
> se_odds_delta^2
[1] 0.003384579
> log_survival_odds
[1] -0.6545499
> survival_odds
[1] 0.5196759
> A.
Alonso, Inference I 44 / 568

Confidence intervals

A confidence interval is an interval estimate of a parameter

In contrast to a point estimate a confidence interval gives a set of


plausible values (estimates) for the parameter

Of all the realizations of the interval some will contain the true value
of the parameter, but others will not

The probability that the stochastic interval will contain the true value
of the parameter is called the confidence level of the interval

Alonso, A. Inference I 45 / 568


Confidence intervals

Suppose also that the yi are conditionally independent given θ, i.e.


n
p(Ý | θ) = p(yi | θ)
i =1

Interval estimator: [a(Ý ), b(Ý )] aims to include the true value with a
pre-specified probability

For instance, 100(1 − α)% confidence interval (CI) is an interval


[a(Ý ), b(Ý )] that satisfies for all θ:
 
P θ ∈ [a(Ý ), b(Ý )] = 1 − α

Alonso, A. Inference I 46 / 568

Interval estimator

The probability statement is given wrt the sampling distribution of Ý

The probability 1 − α is called the coverage probability, classically


α = 0.05 leading to the 95% CI

The adjective 95% refers to a probability statement under repeated


sampling. For a particular sample the 95% CI either includes or
excludes the true value θ

Many 95% CIs have only asymptotically the correct coverage. For
small samples their good behavior needs to be established via
simulations

Alonso, A. Inference I 47 / 568


Confidence intervals

Confidence interval for the mean μ


S S
CI1−α = X̄ − T1− 2 ,n−1 √ , X̄ + T1− 2 ,n−1 √
α α
n n

The probability that the stochastic interval will contain the true value
of the parameter is called the confidence level of the interval

P (μ lies in CI1−α ) = 1 − α

Often α = 0.05 thus...

Alonso, A. Inference I 48 / 568

Confidence intervals

Confidence interval for the mean μ


S S
CI0.95 = X̄ − T0.975,n−1 √ , X̄ + T0.975,n−1 √
n n

The probability that the stochastic interval will contain the true value
of the parameter is called the confidence level of the interval

P (μ lies in CI0.95 ) = 0.95

95% of all realizations of the interval will contain the true value of
the population mean μ

Alonso, A. Inference I 49 / 568


Simulations: μ = 0.0, σ = 1, n = 100, 1 − α = 0.95

Alonso, A. Inference I 50 / 568

1 − α = 0.95 versus 1 − α = 0.99

Demonstration of Confidence Intervals

coverage rate: 0.990 1 99


0.4
n]

0.2
n , x + T1−α 2S

0.0
CI: [ x − T1−α 2S

−0.2
−0.4

0 20 40 60 80 100

Samples

1 − α = 0.95 1 − α = 0.99
Alonso, A. Inference I 51 / 568
Likelihood

The likelihood function is a particular function of the parameters of a


statistical model given data

The likelihood plays a central role in all the main paradigms of


statistics

Bayesian paradigm

Frequentist paradigm

Likelihood paradigm

Information-theoretic paradigm (AIC)

Maximum Likelihood: Intuitively, the maximum likelihood estimator


selects the values of the parameters of a model that give the observed
data the largest possible probability

Alonso, A. Inference I 52 / 568

Likelihood
Probability that sample Ý = (y1 , . . . , yn ) has happened under
p-dimensional θ is given by p(Ý | θ)

As a function of θ, p(Ý | θ) is called the likelihood and is denoted as


L(θ | Ý )
n
L(θ | Ý ) = p(yi | θ)
i =1

Maximum likelihood:
Given a data set Ý , the value of θ that maximizes L(θ | Ý ) is called the
maximum likelihood estimator (MLE) and is denoted as θ 

 we rather maximize
To find θ

n
(θ | Ý ) ≡ log L(θ | Ý ) = log p(yi | θ)
i =1

Alonso, A. Inference I 53 / 568


Likelihood: Related concepts

To explore the sampling properties of the MLE, we need to look at


the properties of (θ | Ý ), i.e., as a function of the random variable Ý

(θ | Ý ) will be denoted as (θ), but now it is considered a random


variable

Basically, one needs to know also the sampling behavior of

1st derivative of (θ): score function

2nd derivative of (θ): Fisher (expected) information matrix

We look at the i.i.d. case, but the MLE properties can be extended to
the non-i.i.d case

Alonso, A. Inference I 54 / 568

Score function
The score function is defined as:
 
∂ (θ) ∂ (θ) ∂ (θ)
Ë (θ) = = ,...,
∂θ ∂θ1 ∂θp

=[s1 (θ), . . . , sp (θ)]


The MLE is typically computed as the solution of the score equation

n
∂ 
n
Ë (θ) = log p(yi | θ) = × i (θ) = 0
∂θ
i =1 i =1

Hessian matrix is the matrix of the second derivatives of the


log-likelihood
∂2  ∂2 n
À (θ) = (θ) = log p(yi | θ)
∂θ∂θ  ∂θ∂θ 
i =1

Alonso, A. Inference I 55 / 568


Fisher’s (expected) information matrix
Fisher’s expected information matrix in a sample of size n is the
p × p matrix
 
  ∂2
Á (θ) ≡ Á n (θ) = −E À (θ) = −E (θ)
∂θ∂θ 
i.e., minus the expected value of the Hessian matrix

In the i.i.d. data case


 

n
∂2
Á (θ) = −E log p(yi | θ) = nÁ 1 (θ)
∂θ∂θ 
i =1

Á 1 (θ) denotes the expected Fisher information from one observation,


i.e.,  
∂2
Á 1 (θ) = −E log p(y | θ)
∂θ∂θ 
Alonso, A. Inference I 56 / 568

Score function
If the data were generated by p(y | θ 0 ) then
 
E Ë (θ 0 ) = 0
  n   n  
Var Ë (θ 0 ) = T×
i =1 Var × i (θ 0 ) = i =1 E × i (θ 0 ) i (θ 0 )

   
In the i.i.d. data case Var Ë (θ 0 ) = nE × (θ 0 )
T × (θ
0) where

× (θ 0 ) = log p(y | θ 0 )
∂θ
Moreover, the so-called Information Matrix Equality (IME) holds
 
  ∂2
E × (θ 0 ) × (θ 0 ) = −E
T
log p(y | θ 0 ) = Á 1 (θ 0 )
∂θ∂θ 
 
and, hence, Var Ë (θ 0 ) = nÁ 1 (θ 0 ) = Á (θ 0 )

Alonso, A. Inference I 57 / 568


Score function

If the data were generated by p(y | θ 0 ) then


 
Asymptotically: n− 2 Ë (θ 0 ) ∼ N 0, Á 1 (θ 0 )
1

   
For large samples Ë (θ 0 ) ∼ N 0, Á (θ 0 ) with Á (θ 0 ) = −E À (θ 0 )

An important problem is the estimation of Á (θ 0 ). Three estimators


are available:

n
  = −À (θ),
Á 1 (θ)    =
Á 2 (θ)  T × i (θ),
× i (θ)    = Á (θ)
Á 3 (θ) 
i =1

Provided θ is a consistent estimator of θ 0 (not necessarily the MLE),


each of these estimators converges to Á (θ 0 ).

 is called the observed information matrix


The matrix −À (θ)

Alonso, A. Inference I 58 / 568

Cramér-Rao bound

The Cramér-Rao bound (CRB) also called information inequality


expresses a lower bound on the variance of unbiased estimators

In its simplest form, the bound states that the variance of any
unbiased estimator is at least as high as the inverse of the Fisher
information

An unbiased estimator which achieves this lower bound is said to be


(fully) efficient

Such a estimator achieves the lowest possible mean squared error


among all unbiased estimators and is the minimum variance unbiased
(MVU) estimator

In some cases, no unbiased estimator exists which achieves the bound

Alonso, A. Inference I 59 / 568


Cramér-Rao bound
Cramér-Rao bound
Let ( ) be an unbiased estimator of a vector function of parameters
ψ(θ). The Cramér-Rao bound states that the covariance matrix of ( )
satisfies  
cov ( ) ≥  (θ) (θ)−1  (θ)T
where

 (θ) = ( ∂ψ
i
∂θ ) is the Jacobian matrix
j

The matrix inequality  ≥  is understood to mean that the matrix


 −  is positive semidefinite
 = ( ) is an unbiased estimator of θ (i.e., ψ(θ) = θ), then the
If θ
Cramér-Rao bound reduces to
 
cov θ(  ) ≥  (θ)−1

Alonso, A. Inference I 60 / 568

Properties MLE

MLEs have no optimum properties for finite samples, in the sense


that (when evaluated on finite samples) other estimators may have
greater concentration around the true parameter-value

MLEs possess a number of attractive limiting properties. As the


sample size increases to infinity, sequences of maximum likelihood
estimators have these properties

Consistency: The MLE converges in probability to the value being


estimated

Efficiency: It achieves the CRB when the sample size tends to infinity.
This means that no unbiased estimator has lower asymptotic mean
squared error (MLE asymptotically unbiased)

Asymptotic Normality: The MLE is asymptotically normal

Alonso, A. Inference I 61 / 568


Properties MLE

Consistency: Under general regularity conditions, if the data were


generated by p(Ý | θ 0 ) and we have a sufficiently large number of
observations n, then it is possible to find the value of θ 0 with
mle (Ý ) −
arbitrary precision, i.e., θ
p
→ θ0

Efficiency and Asymptotic Normality: The MLE is n-consistent
and asymptotically efficient, meaning that it reaches the CRB
√ 
n θ mle (Ý ) − θ 0 −→ N (0, Á 1 (θ 0 )−1 )
d

   
mle (Ý ) ∼ N θ 0 , Á (θ 0 )−1 with Á (θ 0 ) = −E À (θ 0 ) for
Therefore, θ
large samples

For carrying out inferences Á (θ 0 ) can be estimated using 


Á 1, 
Á 2 or 
Á3

Alonso, A. Inference I 62 / 568

Regularity conditions

Identifiability: different values of θ determine different distributions

Range of the data cannot depend on unknown parameters

True parameter must lie in the interior of the parameter space

The number of parameters must not increase with the sample size, at
least not too quickly

Alonso, A. Inference I 63 / 568


Further properties of MLE

Equivariant estimator: If θ mle is the MLE of θ and ψ = ψ(θ) is a


1-to-1 transformation, then ψ 
mle = ψ(θ mle ) is the MLE obtained
from L(ψ | Ý ). In addition, because of the delta method:
 
 mle −
ψ
d mle )Â (θ)T
→ Np ψ, Â (θ)Cov(θ as n → ∞

Alonso, A. Inference I 64 / 568

Further properties of MLE

Equivariant estimator: If θ mle is the MLE of θ and ψ = ψ(θ) is a


1-to-1 transformation, then ψ 
mle = ψ(θ mle ) is the MLE obtained
from L(ψ | Ý ). In addition, because of the delta method:
 
 mle −
ψ → Np ψ, Â (θ)Á −1 (θ)Â (θ)T
d
as n → ∞

 and the expected


For inference we may replace  (θ) by  (θ)
  
information by Á 1 , Á 2 or Á 3

MLE is invariant wrt 1-to-1 transformations of the data

Alonso, A. Inference I 64 / 568


Maximizing the likelihood

Given the parameter value θ k at iteration k, the Newton-Raphson


(NR) algorithm approximates Ë (θ) using a Taylor series expansion
about θ k 
Ë (θ) ≈ Ë (θ k ) + À (θk ) θ − θk

An MLE is typically obtained as a solution of the likelihood equation


Ë (θ) = 0, thus we solve the system of linear equations

0 ≈ Ë (θ k ) + À (θ k ) θ − θ k

The previous equation can be used to update θ k , and this leads to


the Newton-Raphson algorithm

θ k+1 ≈ θ k − À (θ k )−1 Ë (θ k )

Alonso, A. Inference I 65 / 568

Solving the score equations

Alonso, A. Inference I 66 / 568


Maximizing the likelihood

Fisher’s scoring: Replaces (θ) with its expected value if the


formula for the expected value is known

Convergence is quadratic: As the method converges on the root,


the difference between the root and the approximation is squared at
each step

Failure of the method to converge to the root

Stationary point: Hessian is singular

Poor initial estimate

Overshoot: If the Hessian is not well behaved in the neighborhood of a


particular root, the method may overshoot, and diverge from that root

Alonso, A. Inference I 67 / 568

Surgery example: Binomial likelihood


New but rather complicated surgical technique. Surgeon operates
n = 12 patients with y = 9 successes

Binomial distribution Bin(n,π)


n y
P(y | π) = π (1 − π)(n−y ) (y = 0, 1, . . . , n)
y

Binomial likelihood (function)


12 9
L(π | 9) = π (1 − π)3
9

Binomial log-likelihood (function):


12
(π | 9) = log + 9 log(π) + 3 log(1 − π)
9
Alonso, A. Inference I 68 / 568
Surgery example: Binomial likelihood
MLE: maximize L(π | y ) or better (π | y )
(π | y ) = y ln π + (n − y ) ln(1 − π) + constant
d y (n − y )
(π | y ) = − =0⇒π
 = y /n
dπ π (1 − π)
For y = 9 and n = 12 ⇒ π
 = 0.75

Fisher information: 2nd derivative of (π | y )


d 2 y n−y
H(π) = =− 2 −
dπ 2 π (1 − π)2
 
d 2 n
I (π) = −E =
dπ 2 π(1 − π)

Variance MLE (evaluated in MLE): π


(1 − π
)/n
Alonso, A. Inference I 69 / 568

Surgery example: Binomial likelihood

## Maximum Likelihood Binomial distribution

## Likelihood function
n.size=12
y.su=9
llik2 <- function(p)-sum(dbinom(y.su,prob=p,size=n.size,log=TRUE))
p_MLE=nlm(llik2,p=c(0.5), hessian = TRUE)

> p_MLE
$minimum
[1] 1.354394

$estimate
[1] 0.7499995

$gradient
[1] -1.190159e-07
$hessian
[,1]
[1,] 64.03399

Alonso, A. Inference I 70 / 568


Surgery example: Binomial likelihood

## Maximum Likelihood Binomial distribution

## Likelihood function
n.size=12
y.su=9
llik2 <- function(p)-sum(dbinom(y.su,prob=p,size=n.size,log=TRUE))
p_MLE=nlm(llik2,p=c(0.5), hessian = TRUE)

p_vec <- seq(0.01, 1, by = 0.01)


llik3=rep(0,length(p_vec))
for(i in 1:length(p_vec)) llik3[i]=llik2(p_vec[i])

par(las = 1, cex.lab = 1.2)


plot(p_vec, llik3, type = "l", xlab = "p", ylab = "-log-Likelihood")
points(p_MLE$estimate, p_MLE$minimum, pch = 19, col = "red")
segments(x0=p_MLE$estimate, y0 =0, x1 =p_MLE$estimate, y1 = p_MLE$minimum,
lwd = 2, col = "red")

Alonso, A. Inference I 71 / 568

Surgery example: Binomial likelihood


35
30
25
−log−Likelihood

20
15
10
5
0

0.0 0.2 0.4 0.6 0.8 1.0

Alonso, A. Inference I 72 / 568


Surgery example: Binomial likelihood

Asymptotic distribution of MLE:

π
 ∼ N [π, π
(1 − π
)/n]

Asymptotic 95% CI for π:


   
π
(1 − π
) π
(1 − π
)
π
 − 1.96 × ,π
 + 1.96 ×
n n

Suppose θ = log[π/(1 − π)], then θ = log[


π /(1 − π
)]

And the same (asymptotic) properties hold for θ as for π




Alonso, A. Inference I 73 / 568

Surgery example: Binomial likelihood

Asymptotic distribution of MLE:

π
 ∼ N [π, π
(1 − π
)/n]

Asymptotic 95% CI for π: [0.51, 0.99]

Suppose θ = log[π/(1 − π)], then θ = 1.1

And the same (asymptotic) properties hold for θ as for π




Alonso, A. Inference I 74 / 568

You might also like