You are on page 1of 7

Generalised linear models

Generalised linear models, abbreviated GLM, are a generalisation of


the linear model, that can be used when your response variable has a
non-normal error distribution.
Typical examples are when your response variable is binary (only takes
two values, e.g. 0 or 1), or a count of something.
Fitting GLM’s is more or less entirely analogous to fitting linear models
in R, but model diagnostics are very different.

Modelling proportions: Logistic regression


As the first example of binary data, we will consider the wine quality
dataset wine from Cortez et al. (2009), which is available in the UCI
Machine Learning Repository at
http://archive.ics.uci.edu/ml/datasets/Wine+Quality. It contains
measurements on white and *red vinho verde wine samples from
northern Portugal.
Also data is posted on blackboard (.csv)

We start by loading the data. It is divided into two separate .csv files,
one for white wines and one for red, which we have to merge

Reading data in EXCEL

# Import data about white and red wines:


install.packages(“readxl”)
library(readxl)
or
> getwd()
[1] "C:/Users/User/OneDrive/Documents"

# Import data about white and red wines:

1|Page
> white<-
read.csv("C:/Users/User/OneDrive/Desktop/winedata1.csv",header=TR
UE,sep = ",")

> red<-
read.csv("C:/Users/User/OneDrive/Desktop/winedata2.csv",header=T
RUE,sep = ",")

# Add a type variable:


> white$type<-"white"
> red$type<-"red"
# Merge the datasets:
> wine<-rbind(white,red)
> wine$type<-factor(wine$type)
# Check the result:
> summary(wine)

 We are interested in seeing if measurements like pH ( pH ) and


alcohol content ( alcohol ) can be used to determine the colours of
the wine.
 The colour is represented by the type variable, which is binary. Our
model is that the type of a randomly selected wine is binomial -
distributed (Bernoulli distributed), where depends on explanatory
variables like pH and alcohol content.
A common model for this situation is a logistic regression model. Given
n observations of p explanatory variables, the model is:
𝜋
𝑙𝑜𝑔𝑖𝑡 (𝑌) = 𝑙𝑛 ( ) = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯
1−𝜋

Where in linear regression models model the expected value of the


response variable as a linear function of the explanatory variables, we

2|Page
now model the expected value of a function of the expected value of
𝜋
the response variable (that is, a function of ).
1−𝜋
In GLM terminology, this function is known as a link function. Logistic
regression models can be fitted using the glm function. To specify what
our model is, we use the argument family = binomial:

m <- glm(type ~ pH + alcohol, data = wine, family = binomial)

summary(m)

The coefficients of a logistic regression model aren’t as


straightforward to interpret as those in a linear model. If we let B
denote a coefficient corresponding to an explanatory variable x, then:

𝜋
If 𝛽 is positive, then 𝑙𝑛 ( ) increases when x increases.
1−𝜋
𝜋
If 𝛽 is negative, then 𝑙𝑛 ( ) decreases when xi increases.
1−𝜋

The odds ratio, shows how much the odds change when xi is increased 1
step.
We can extract the coefficients and odds ratios using

coef (m)

exp(coef(m)) gives odds ratio

To find the fitted probability that an observation belongs to the


second class we can use:

predict(m, type = "response")

# Check which class is the second one:

3|Page
levels(wine$type)

# "white" is the second class!

# Get fitted probabilities:

probs <- predict(m, type = "response")

# Check what the average prediction is for


# the two groups:

mean(probs[wine$type == "red"])

mean(probs[wine$type == "white"])

It turns out that the model predicts that most wines are white - even
the red ones!

Model diagnostics
It is difficult to assess model fit for GLM’s, because the behaviour of
the residuals is very different from residuals in ordinary linear models.
In the case of logistic regression, the response variable is always 0 or
1, meaning that there will be two bands of residuals:

# Store deviance residuals:


m <- glm(type ~ pH + alcohol, data = wine, family = binomial)
res <- data.frame(Predicted <- predict(m), Residuals <- residuals(m,
type ="deviance"), Index = 1:nrow(m$data), CooksDistance =
cooks.distance(m))

# Plot fitted values against the deviance residuals:

library(ggplot2) ggplot(res, aes(Predicted, Residuals)) + geom_point()

4|Page
# Plot index against the deviance residuals:

ggplot(res, aes(Index, Residuals)) + geom_point()

Plots of raw residuals are of little use in logistic regression models. A


better option is to use a binned residual plot, in which the observations
are grouped into bins based on their fitted value. The average residual
in each bin can then be computed, which will tell us if which parts of
the model have a poor fit. A function for this is available in the arm
package:

# Store deviance residuals:


install.packages("arm")
library(arm)

binnedplot(predict(m, type = "response"), residuals(m, type =


"response"))

The grey lines show confidence bounds which are supposed to contain
about 95 % of the bins. If too many points fall outside these bounds,
it’s a sign that we have a poor model fit. In this case, there are a few
points outside the bounds. Most notably, the average residuals are
fairly large for the observations with the lowest fitted values, i.e.
among the observations with the lowest predicted probability of being
white wines.

Let’s compare the above plot to that for a model with more explanatory
variables:
m2 <- glm(type ~ pH + alcohol + fixed.acidity + residual.sugar, data =
wine, family = binomial)

5|Page
binnedplot(predict(m2, type = "response"), residuals(m2, type =
"response"))

This looks much better - adding more explanatory variable appears to


have improved the model fit.
 It’s worth repeating that if your main interest is hypothesis
testing, you shouldn’t fit multiple models and then pick the one
that gives the best results.

 However, if you’re doing an exploratory analysis or are interested


in predictive modelling, you can and should try different models.
It can then be useful to do a formal hypothesis test of the null
hypothesis that m and m2 fit the data equally well, against the
alternative that m2 has a better fit.

 If both fit the data equally well, we’d prefer m , since it is a


simpler model.

We can use anova to perform a likelihood ratio deviance test, which


tests this:

anova(m, m2, test = "LRT")

The p-value is very low, and we conclude that m2 has a better model
fit.
 Another useful function is cooks.distance , which can be used to
compute the Cook’s distance for each observation, which is
useful for finding influential observations.
In this case, I’ve chosen to print the row numbers for the
observations with a Cook’s distance greater than 0.004 - this number
has been arbitrarily chosen in order only to highlight the observations
with the highest Cook’s distance.
6|Page
res <- data.frame(Index = 1:length(cooks.distance(m)), CooksDistance =
cooks.distance(m))

# Plot index against the Cook's distance to find # influential points:

ggplot(res, aes(Index, CooksDistance)) + geom_point() +


geom_text(aes(label = ifelse(CooksDistance > 0.004, rownames(res),
"")), hjust = 1.1)

Exercise: Investigate the residuals for your sharks.csv model. Are


there any problems with the model fit? Any influential points?

Prediction
Just as for linear models, we can use predict to make predictions for
new observations using a GLM. To begin with, let’s randomly sample 10
rows from the wine data and fit a model using all data except those ten
observations:

# Randomly select 10 rows from the wine data:

rows <- sample(1:nrow(wine), 10)


m <- glm(type ~ pH + alcohol, data = wine[-rows,], family = binomial)

We can now use predict to make predictions for the ten observations:

preds <- predict(m, wine[rows,]) preds

Logistic regression models are often used for prediction, in what is


known as classification.

7|Page

You might also like