You are on page 1of 31

GENERALIZED LINEAR MODEL

INTRODUCTION TO GENERALIZED
LINEAR MODELS

Program Studi S1 Ilmu Aktuaria

Departemen Matematika
FMIPA UI

ATA 2021/2022

Dr. Fevi Novkaniza, S.Si, M.Si SCAK603104-MODEL LINIER LANJUT


GENERALIZED LINEAR MODEL Introduction

Course Description

This course present students with an introduction to the GLM


methodology and its applications of interest to actuarial science.
I GLM is methodology for modeling relationships between the
response and the explanatory variables, predictors or
covariates (in some contexts these are called risk factors or
rating factors)
I Extend the linear model modeling framework to variables that
are not normally distributed which insurance analysts
typically encounter
I GLMs are most commonly used to model binary or count
data, so we will focus on models for these types of data
I GLM’s allow also to include nonnormal errors such as
Binomial, Poisson and Gamma errors.

Dr. Fevi Novkaniza, S.Si, M.Si SCAK603104-MODEL LINIER LANJUT


GENERALIZED LINEAR MODEL Introduction

Course Outline

1. Introduction to Generalized Linear Models


2. Model Fitting
3. Exponential Family and Generalized Linear Models
4. Estimation and Inference
5. Binary Variables and Logistic Regression
6. Nominal and Ordinal Logistic Regression
7. Poisson Regression, Log-Linear Models, Negative Binomial
Regression
8. GLM and its Application

Dr. Fevi Novkaniza, S.Si, M.Si SCAK603104-MODEL LINIER LANJUT


GENERALIZED LINEAR MODEL Introduction

Text Books and Software

1. An Introduction to GLM (Dobson & Barnett; CRC Press,


2018)
2. Foundation of Linear and Generalized Linear Models (Alan
Agresti; Wiley, 2015)
3. GLM for Insurance Data (Jong, Heller; Cambridge University
Press, 2008)

Software : R package/ Phyton

Dr. Fevi Novkaniza, S.Si, M.Si SCAK603104-MODEL LINIER LANJUT


GENERALIZED LINEAR MODEL Introduction

GLM: Model Fitting and Inference Models

1. Review of Linear Models


2. Exponential Dispersion Family Distributions for a GLM
3. Likelihood and Asymptotic Distributions for GLMs
4. Likelihood-Ratio/Wald/Score Methods for GLM Parameters
5. Deviance of a GLM, Model Comparison, and Model Checking
6. Fitting GLM
7. Selecting Explanatory Variables for a GLM
8. Example : Building a GLM
9. Exercises

Dr. Fevi Novkaniza, S.Si, M.Si SCAK603104-MODEL LINIER LANJUT


GENERALIZED LINEAR MODEL Introduction

Models for Binary Data

1. Link Function for Binary Data


2. Logistic Regression
3. Inference About Parameters of Logistic Regression
4. Logistic Regression Model Fitting
5. Deviance and Goodness of Fit for Binary GLMs
6. Probit and Complementery Log-Log Models
7. Examples
8. Exercises

Dr. Fevi Novkaniza, S.Si, M.Si SCAK603104-MODEL LINIER LANJUT


GENERALIZED LINEAR MODEL Introduction

Multinomial Response Models

1. Nominal Responses : Baseline-Category Logit Models


2. Ordinal Responses : Cumulative Logit and Probit Models
3. Examples
4. Exercises

Dr. Fevi Novkaniza, S.Si, M.Si SCAK603104-MODEL LINIER LANJUT


GENERALIZED LINEAR MODEL Introduction

Models for Count Data

1. Poisson GLMSs for Counts and Rates


2. Poisson/Multinomial Models for Contingency Tables
3. Negative Binomial GLMS
4. Example
5. Exercises

Dr. Fevi Novkaniza, S.Si, M.Si SCAK603104-MODEL LINIER LANJUT


GENERALIZED LINEAR MODEL Introduction

Review : The General Linear Models

In a general linear model

yi = —0 + —1 x1i + ... + —p xpi + ‘i (1)

The response yi , i = 1, .., n is modelled by a linear function of


explanatory variables xj , j = 1, .., p plus an error term
General and Linear
I General refers to the dependence on potentially more than one
explanatory variable, v.s the simple linear model
I The model is linear in the parameters:

yi = —0 + —1 x1i + —2 x21 + ‘i (2)

yi = —0 + “1 ”1 x1 + exp(—2 )x2 + ‘i (3)


I but not: yi = —0 + —1 x—1 2 + ‘i
Dr. Fevi Novkaniza, S.Si, M.Si SCAK603104-MODEL LINIER LANJUT
GENERALIZED LINEAR MODEL Introduction

Error Structure

I We assume that the errors ‘i are independent and identically


distributed such that E(‘i ) = 0 and V ar(‘i ) = ‡ 2
I Typically we assume ‘i ≥N (0, ‡ 2 ) as a basis for inference
(t-test on parameters)

Dr. Fevi Novkaniza, S.Si, M.Si SCAK603104-MODEL LINIER LANJUT


GENERALIZED LINEAR MODEL Introduction

Restrictions of Linear Models

Although a very useful framework, there are some situations where


general linear models are not appropriate
I the range of Y is restricted (binary, count)
I the variance of Y depends on the mean
Generalized linear models extend the general linear model
framework to address both of these issues

Dr. Fevi Novkaniza, S.Si, M.Si SCAK603104-MODEL LINIER LANJUT


GENERALIZED LINEAR MODEL Introduction

Generalized Linear Model (GLM)

A generalized linear model is made up of a linear predictor:

÷i = —0 + —1 x1i + ... + —p xpi (4)

and two functions


I a link function that describes how the mean, E(Yi ) = µi ,
depends on the linear predictor g(µi ) = ÷i
I a variance function that describes how the variance,
V ar(Yi ) = „V (µ) where the dispersion parameter „ is a
constant

Dr. Fevi Novkaniza, S.Si, M.Si SCAK603104-MODEL LINIER LANJUT


GENERALIZED LINEAR MODEL Introduction

Components of GLM

Response Yi and independent variables Xi = (x1i , ..., xpi ) for


i = 1, ..., n
1. Random Component: Yi , 1 Æ i Æ n independent with density
from exponential family distribution, i.e
y◊ ≠ b(◊)
f (y; ◊, „) = exp[ + c(y, „)] (5)
a(„)
where „ is a dispersion parameter and functions b(), a() and
c() are known
2. Systematic Component:
÷i (—) = xti — = —0 + —1 x1i + ... + —p xpi (6)
linear predictor,— = (—0 , ..., —p ) regression parameters
3. Parametric Link Component: The link function
g(µi ) = ÷i = xti — combines linear predictor with mean µi of
yi . Canonical link function if ◊ = ÷.
Dr. Fevi Novkaniza, S.Si, M.Si SCAK603104-MODEL LINIER LANJUT
GENERALIZED LINEAR MODEL Introduction

Exponential Family

Most of the commonly used statistical distributions; Normal,


Binomial and Poisson, are members of the exponential family of
distributions whose densities can be written in the form
y◊ ≠ b(◊)
f (y; ◊, „) = exp[ + c(y, „)] (7)
a(„)

where the „ is the dispersion parameter and ◊ is the canonical or


natural parameter.
Often a(„) = 1 and c(yi , „) = c(yi ), giving the natural
exponential family of the form:

f (y; ◊) = h(y)exp[y◊ ≠ b(◊)] (8)

Dr. Fevi Novkaniza, S.Si, M.Si SCAK603104-MODEL LINIER LANJUT


GENERALIZED LINEAR MODEL Introduction

It can be shown that mean and variance in a GLM can be


represented as
µi = E(Yi ) = bÕ (◊i ) (9)
V ar(Yi ) = a(„).b”(◊i ) (10)
V (◊) := b”(◊) is called variance function of the GLM.
Prove: ....

Dr. Fevi Novkaniza, S.Si, M.Si SCAK603104-MODEL LINIER LANJUT


GENERALIZED LINEAR MODEL Introduction

Canonical Link Function

For a glm where the response follows an exponential distribution


we have

÷i = g(µi ) = g((bÕ (◊i )) = —0 + —1 x1i + ... + —p xpi (11)

The link function of a GLM connects the random component and


the linear predictor, that is a GLM states that a linear predictor
q
÷i = pj=1 —j xij to µi by ÷i , for a link function g
The link function g that transforms the mean µi to the natural
parameter ◊i is called canonical link:

g = (bÕ )≠1 =∆ g(µi ) = ◊i = —0 + —1 x1i + ... + —p xpi (12)

This direct relationship equates the natural parameter to the linear


predictor.
Dr. Fevi Novkaniza, S.Si, M.Si SCAK603104-MODEL LINIER LANJUT
GENERALIZED LINEAR MODEL Introduction

Normal General Linear Model as a Special Case

I For the general linear model with ‘≥N (0, ‡ 2 ) we have the
linear predictor

÷i = —0 + —1 x1i + ... + —p xpi (13)

I the link function g(µi ) = µi


I the variance function V (µi ) = 1
Proof:

Yi = xti — + ‘i = µi + ‘i , ‘i ≥ N (0, ‡ 2 )iid, i = 1, .., n (14)

The density of Yi has exponential family form since:


1 1
f (yi , µi , ‡) = Ô exp{≠ 2 (yi ≠ µi )2 } (15)
2fi‡ 2‡

Dr. Fevi Novkaniza, S.Si, M.Si SCAK603104-MODEL LINIER LANJUT


GENERALIZED LINEAR MODEL Introduction

µ2i
yi µi ≠ 2
= exp{ 2 ≠ 1 [ln(2fi‡ 2 )) + yi ]} (16)
‡2 2 ‡2
This implies for ◊i = µi and „ = ‡ 2

µi ◊i 1 y2
b(◊i ) = = , a(„) = ‡ 2 , c(yi , „) = ≠ [ln(2fi„) + i ] (17)
2 2 2 „

We have the identity as link function, i.e. g(µi ) = µi


Some canonical link function :
I log link for the Poisson distribution
I logit link for the Binomial distribution
Proof:

Dr. Fevi Novkaniza, S.Si, M.Si SCAK603104-MODEL LINIER LANJUT


GENERALIZED LINEAR MODEL Introduction

Modelling Binomial Data

I Suppose Yi ≥Binomial(ni , pi ) and we wish to model the


proportions Yi /ni , then
1
E(Yi /ni ) = pi , V ar(Yi /ni ) = pi (1 ≠ pi ) (18)
ni
I Variance function is V (µi ) = µi (1 ≠ µi )
I Link function must map from (0, 1) æ (≠Œ, Œ).
I A common choice is
µi
g(µi ) = logit(µi ) = log( ) (19)
1 ≠ µi

Dr. Fevi Novkaniza, S.Si, M.Si SCAK603104-MODEL LINIER LANJUT


GENERALIZED LINEAR MODEL Introduction

Modelling Poisson Data

I Suppose Yi ≥P oisson(⁄i ) then

E(Yi ) = ⁄i , V ar(Yi ) = ⁄i (20)

I Variance function is V (µi ) = µi


I Link function must map from (0, Œ) æ (≠Œ, Œ).
I A natural choice is

g(µi ) = log(µi ) (21)

Dr. Fevi Novkaniza, S.Si, M.Si SCAK603104-MODEL LINIER LANJUT


GENERALIZED LINEAR MODEL Introduction

Exercise 1

Prove these distribution are belong to exponential family


distribution:
1. Binomial Distribution
2. Poisson Distribution
3. Normal Distribution
4. Gamma Distribution
5. Negative Binomial Distribution

Dr. Fevi Novkaniza, S.Si, M.Si SCAK603104-MODEL LINIER LANJUT


GENERALIZED LINEAR MODEL Introduction

Exercise 2

Data are generated for the exponential distribution with density


f (y) = ⁄exp(≠⁄y), where ⁄, y > 0. The distribution is a member
of the exponential family.
1. Identify the specific form of ◊, „, a(), b() and c() for the
exponential distribution
2. Whats the canonical link and variance function for a GLM
with a response following the exponential distribution
3. Identify a practical difficulty that may arise when using the
canonical link in this instance

Dr. Fevi Novkaniza, S.Si, M.Si SCAK603104-MODEL LINIER LANJUT


GENERALIZED LINEAR MODEL Introduction

Transformation vs. GLM

In some situations a response variable can be transformed to


improve linearity and homogeneity of variance so that a general
linear model can be applied. This approach has some drawbacks
I response variables has changed!
I transformations must simultaneously improve linearity and
homogeneity of variance
I transformations may not be defined on the boundaries of the
sample space

Dr. Fevi Novkaniza, S.Si, M.Si SCAK603104-MODEL LINIER LANJUT


GENERALIZED LINEAR MODEL Introduction

Likelihood and Asymptotic Distributions for GLMs

For n independent observations,the log-likelihood for the sample


y1 , ...., yn is:
n
ÿ n
ÿ n
ÿ yi ◊i ≠ b(◊i ) n
ÿ
L(—) = Li = logf (yi , ◊i , „) = + c(yi , „)
i=1 i=1 i=1
a(„i ) i=1
(22)
The notation L(—) reflect the dependence of ◊ on the model
parameters —.

Dr. Fevi Novkaniza, S.Si, M.Si SCAK603104-MODEL LINIER LANJUT


GENERALIZED LINEAR MODEL Introduction

Likelihood Equations for a GLM


q
I For a GLM ÷i = pj=1 —j xij = g(µi ) with link function g, the
likelihood equations are
n
ˆL(—) ÿ ˆLi
= = 0, ’j (23)
ˆ—j i=1
ˆ—j

I To differentiate the log likelihood (22), we use the chain rule:


ˆLi ˆLi ˆ◊i ˆµi ˆ÷i
= . . . (24)
ˆ—j ˆ◊i ˆµi ˆ÷i ˆ—j

I Since ˆL
ˆ◊i = [yi ≠ b (◊i )]/a(„) and since µi = b (◊i ) and
i Õ Õ

V ar(yi ) = b”(◊i )a(„), then


ˆLi ˆµi
= (yi ≠ µi )/a(„), = b”(◊i ) = V ar(yi )/a(„) (25)
ˆ◊i ˆ◊i
qp
I We also know that ÷i = ˆ÷i
j=1 —j xij , ˆ—j = xij
Dr. Fevi Novkaniza, S.Si, M.Si SCAK603104-MODEL LINIER LANJUT
GENERALIZED LINEAR MODEL Introduction

Finally since ÷i = g(µi ) ˆµ


ˆ÷i depends on the link function for the
i

model, then
ˆLi ˆLi ˆ◊i ˆµi ˆ÷i
= . . . (26)
ˆ—j ˆ◊i ˆµi ˆ÷i ˆ—j
(yi ≠ µi ) a(„) ˆµi (yi ≠ µi )xij ˆµi
= . . .xij = . (27)
a(„) V ar(yi ) ˆ÷i var(yi ) ˆ÷i

Summing over the n obsrevations yields Likelihood Equations for


a GLM:
ˆL(—) ÿ n
(yi ≠ µi )xij ˆµi
= = 0, j = 1, 2, ..., p (28)
ˆ—j i=1
var(yi ) ˆ÷i
qp
where ÷i = j=1 —j xij = g(µi ) for link function g.

Dr. Fevi Novkaniza, S.Si, M.Si SCAK603104-MODEL LINIER LANJUT


GENERALIZED LINEAR MODEL Introduction

I Let V denote the diagonal matrix of variances of the


observations
I Let D denote the diagonal matrix with elements ˆµˆ÷
i
i
I For the GLM ÷ = X—, these likelihood equations have the
form:
X T DV ≠1 (y ≠ µ) = 0 (29)
I Different link function yields different set of equations
I The likelihood equations are nonlinear functions of — that
must be solved iteratively

Dr. Fevi Novkaniza, S.Si, M.Si SCAK603104-MODEL LINIER LANJUT


GENERALIZED LINEAR MODEL Introduction

Asymptotic Distribution of —ˆ for GLM ÷ = X—

I —ˆ has an approximate N [—, ()X T W X)≠1 ] distribution, where


i 2
W is the diagonal matrix with elements wi = ( ˆµˆ÷i ) /V ar(yi )
I The asymptotic covariance matrix i estimated by:

ˆ = X T Ŵ X)≠1
V ˆar(—) (30)

where Ŵ is W evaluated at —ˆ

Dr. Fevi Novkaniza, S.Si, M.Si SCAK603104-MODEL LINIER LANJUT


GENERALIZED LINEAR MODEL Introduction

Exercise 3

Consider a study intended to investigate race discrimination in


calling fouls by referees in NBA. n black referees and n white
referees where randomly selected (n > 0). For each referee, k foul
calls were randomly selected to count how many calls are given to
black players and how many calls were given to white players.
Therefore, we have the dataset (Yi , Xi ), i = 1, ..., n, where Yi is
the number of foul calls given to black players by ith refree, and Xi
indicates whether the ith referee is black.
1. Please construct a GLM to serve the study goal:
I Specify ◊i , „, b(◊i ), a and c(yi ) which defines the exponential
family of distributions of Yi
I State the canonical link function and variance function
2. Using the canonical link function, suppose we have
◊ˆi = 1.2 ≠ 0.5Xi estimated from the data. Interpret the result

Dr. Fevi Novkaniza, S.Si, M.Si SCAK603104-MODEL LINIER LANJUT


GENERALIZED LINEAR MODEL Introduction

Exercise 4

I You have studied how many emails 20 smart phone users


(Device A and Device B) sent to someone from their mobile
devices during the period of the study.
I You have the average number of the email per day for each
participant, and the Device and DataPlan are treated as
categorical data
I Construct a GLM for testing the hypothesis that the number
of emails sent from mobile devices can be influenced by the
difference of the devices and whether the participants had an
unlimited data plan

Dr. Fevi Novkaniza, S.Si, M.Si SCAK603104-MODEL LINIER LANJUT


GENERALIZED LINEAR MODEL Introduction

Dr. Fevi Novkaniza, S.Si, M.Si SCAK603104-MODEL LINIER LANJUT

You might also like