You are on page 1of 5

Week 5

Modelling count data


 Logistic regression is but one of many types of GLM’s used in
practice.
 Let’s have a look at the shark attack data in sharks.csv , It
contains data about shark attacks in South Africa, downloaded
from The Global Shark Attack File
(http://www.sharkattackfile.net/incidentlog.htm).

To load it, we download the file and set file_path to the path of
sharks.csv

sharks <- read.csv(file_path, sep =";")


>sharks<-read.csv(“C:\\Users\\User\\Downloads\\GSAF5.csv”,
header=TRUE,stringsAsFactors=FALSE)

# Compute number of attacks per year:


attacks <- aggregate(Type ~ Year, data = sharks, FUN = length)

# Keep data for 1960-2019:


attacks <- subset(attacks, Year >= 1960)

The number of attacks in a year is not binary but a count that, in


principle, can take any nonnegative integer as its value.

Are there any trends over time for the number of reported
attacks?

# Plot data from 1960-2019:


library(ggplot2)
ggplot(attacks, aes(Year, Type)) +
geom_point() +
ylab("Number of attacks")
No trend is evident. To confirm this, let’s fit a regression model with
Type (the number of attacks) as the response variable and Year as an
explanatory variable.
 For count data like this, a good first model to use is Poisson
regression. Let 𝜇𝑖 denote the expected value of the response
variable given the explanatory variables.
 Given observations of explanatory variables, the Poisson
regression model is
log(μi) = β0 + β1xi1 + β2xi2 + ⋯ + βpxip, i = 1,…, n

To fit it, we use glm as before, but this time with family = poisson :

m <- glm(Type ~ Year, data = attacks, family = poisson)


summary(m)

Interpretation
A 1-unit increase in X is associated with an average change of 100×β1%
in Y

Explanation
We want to know how Y changes when we increase X by 1 unit, i.e. when
X becomes (X + 1).

Let’s call:

Ynew: the value of Y after increasing X by 1 unit


Yold: the value of Y before increasing X
log(Ynew)−log(Yold)=β0+β1(X+1)–(β0+β1X)
log(Ynew)−log(Yold)=β1
And since log(a) – log(b) = log(a/b), then:

log(Ynew/Yold)=β1
So, Ynew/Yold=eβ1
Now let’s transform Ynew/Yold
to a percent change in Y, by subtracting 1 and multiplying by 100.
So, when X increases by 1 unit, the percent change in Y will be (eβ1−1)
×100
when β1 < 0.1: (eβ1−1)×100≈100×β1

The interpretation becomes:


When X increases by 1 unit, Y changes by 100×β1%.

We can add the curve corresponding to the fitted model to our


scatterplot as follows:

attacks_pred <- data.frame(Year = attacks$Year, at_pred =


predict(m, type = "response"))

ggplot(attacks, aes(Year, Type)) +


geom_point() +
ylab("Number of attacks") +
geom_line(data = attacks_pred, aes(x = Year, y = at_pred),
colour = "red")

The fitted model seems to confirm our view that there is no trend over
time in the number of
attacks.
For model diagnostics, we can use a binned residual plot and a plot of
Cook’s distance to find influential points:

# Binned residual plot:


library(arm)
binnedplot(predict(m, type = "response"),
residuals(m, type = "response"))
# Plot index against the Cook's distance to find
# influential points:

res <- data.frame(Index = 1:nrow(m$data), CooksDistance =


cooks.distance(m))

ggplot(res, aes(Index, CooksDistance)) +


geom_point() + geom_text(aes(label = ifelse(CooksDistance > 0.1,
rownames(res), "")), hjust = 1.1)

A common problem in Poisson regression models is excess zeros, i.e.


more observations with value 0 than what is predicted by the model. To
check the distribution of counts in the data, we can draw a histogram:

ggplot(attacks, aes(Type)) +
geom_histogram(binwidth = 1, colour = "black")

If there are a lot of zeroes in the data, we should consider using


another model, such as a hurdle model or a zero-inflated Poisson
regression. Both of these are available in the pscl package.

Another common problem is over dispersion, which occurs when there


is more variability in the data than what is predicted by the GLM.

A formal test of overdispersion is provided by dispersiontest in the


AER package.
The null hypothesis is that : there is no overdispersion,
and the alternative that: there is overdispersion:

install.packages("AER")
library(AER)
dispersiontest(m, trafo = 1)
 There are several alternative models that can be considered in
the case of overdispersion.

 One of them is negative binomial regression, which uses the same


link function as Poisson regression.

 We can fit it using the glm.nb function from MASS :


library(MASS)
 m_nb <- glm.nb(Type ~ Year, data = attacks)

summary(m_nb)

For the shark attack data, the predictions from the two models are
virtually identical, meaning that both are equally applicable in this case:

attacks_pred <- data.frame(Year = attacks$Year, at_pred =


predict(m, type = "response"))

attacks_pred_nb <- data.frame(Year = attacks$Year, at_pred =


predict(m_nb, type = "response"))
ggplot(attacks, aes(Year, Type)) +
geom_point() +
ylab("Number of attacks") +
geom_line(data = attacks_pred, aes(x = Year, y = at_pred),
colour = "red") +
geom_line(data = attacks_pred_nb, aes(x = Year, y = at_pred),
colour = "blue", linetype = "dashed")

You might also like