Professional Documents
Culture Documents
Week 4 GLM
Week 4 GLM
We start by loading the data. It is divided into two separate .csv files,
one for white wines and one for red, which we have to merge
1|Page
> white<-
read.csv("C:/Users/User/OneDrive/Desktop/winedata1.csv",header=TR
UE,sep = ",")
> red<-
read.csv("C:/Users/User/OneDrive/Desktop/winedata2.csv",header=T
RUE,sep = ",")
2|Page
now model the expected value of a function of the expected value of
𝜋
the response variable (that is, a function of ).
1−𝜋
In GLM terminology, this function is known as a link function. Logistic
regression models can be fitted using the glm function. To specify what
our model is, we use the argument family = binomial:
summary(m)
𝜋
If 𝛽 is positive, then 𝑙𝑛 ( ) increases when x increases.
1−𝜋
𝜋
If 𝛽 is negative, then 𝑙𝑛 ( ) decreases when xi increases.
1−𝜋
The odds ratio, shows how much the odds change when xi is increased 1
step.
We can extract the coefficients and odds ratios using
coef (m)
3|Page
levels(wine$type)
mean(probs[wine$type == "red"])
mean(probs[wine$type == "white"])
It turns out that the model predicts that most wines are white - even
the red ones!
Model diagnostics
It is difficult to assess model fit for GLM’s, because the behaviour of
the residuals is very different from residuals in ordinary linear models.
In the case of logistic regression, the response variable is always 0 or
1, meaning that there will be two bands of residuals:
4|Page
# Plot index against the deviance residuals:
The grey lines show confidence bounds which are supposed to contain
about 95 % of the bins. If too many points fall outside these bounds,
it’s a sign that we have a poor model fit. In this case, there are a few
points outside the bounds. Most notably, the average residuals are
fairly large for the observations with the lowest fitted values, i.e.
among the observations with the lowest predicted probability of being
white wines.
Let’s compare the above plot to that for a model with more explanatory
variables:
m2 <- glm(type ~ pH + alcohol + fixed.acidity + residual.sugar, data =
wine, family = binomial)
5|Page
binnedplot(predict(m2, type = "response"), residuals(m2, type =
"response"))
The p-value is very low, and we conclude that m2 has a better model
fit.
Another useful function is cooks.distance , which can be used to
compute the Cook’s distance for each observation, which is
useful for finding influential observations.
In this case, I’ve chosen to print the row numbers for the
observations with a Cook’s distance greater than 0.004 - this number
has been arbitrarily chosen in order only to highlight the observations
with the highest Cook’s distance.
6|Page
res <- data.frame(Index = 1:length(cooks.distance(m)), CooksDistance =
cooks.distance(m))
Prediction
Just as for linear models, we can use predict to make predictions for
new observations using a GLM. To begin with, let’s randomly sample 10
rows from the wine data and fit a model using all data except those ten
observations:
We can now use predict to make predictions for the ten observations:
7|Page