Week 4, Solution: Exercise 4.38

02418 Week 4, solution
Exercise 4.38
We look at sulphur dioxide data in 41 cities
data <- read.table("exercise4_38.dat",header=FALSE)[[1]]

qqnorm(data)
Normal Q−Q Plot
●
100
●
80
Sample Quantiles
●
●
●
60
● ●
●●
40
●●
●●●
●●●●●
●●
●●
20
●
●●●
● ●●●
●
● ● ● ● ● ● ●●
●
−2 −1 0 1 2
Theoretical Quantiles
The data is clearly not Gaussian, and one approach is to use a data transfor-
mation. We start by a exploratory approach using the box-cox transforma-
tion (y(λ ) = (yλ − 1)/λ
## Exploratory
par(mfrow=c(2,2))
qqnorm(data)
qqnorm((sqrt(data)-1)/0.5) ## lambda=0.5
qqnorm(log(data)) ## lambda=0
qqnorm((1/sqrt(data)-1)/(-0.5)) ## lambda=-0.5
Normal Q−Q Plot Normal Q−Q Plot
● ●
100
●
●
Sample Quantiles
Sample Quantiles
15
80
●
●
●
● ●●
●
60
● ●●
●●
10
●●
●●
●●●
40
●●●●●
●●
●● ●●
●●●●●●
●●●● ●
●●●
●●
20
●●●● ●●●●
5
●●●
●●●●● ● ●●●
● ● ● ● ●●●●● ● ● ●
−2 −1 0 1 2 −2 −1 0 1 2
Theoretical Quantiles Theoretical Quantiles
Normal Q−Q Plot Normal Q−Q Plot

1.8
● ●
●
4.5
●
●●
●●●
●●
1.7
●●
Sample Quantiles
Sample Quantiles
●
4.0
●●
●●
●● ●●●
●●●●●
1.6
●●
●● ●●
3.5
●●●
●●●●● ●
●●
1.5
● ●●
● ●
3.0
●●●
● ●
●●● ●
1.4
●●● ●●
2.5
●
● ● ●●●
●●
● ●●● ● ●
● ●
1.3
2.0
● ●
−2 −1 0 1 2 −2 −1 0 1 2
Theoretical Quantiles Theoretical Quantiles
## Log seems to be a good option
Using the library MASS we can get a profile likelihood of λ directly
## box-cox profile likelihood

library(MASS)
boxcox(lm(data~1))
2
−60
95%
−70
−80
log−Likelihood
−90
−100
−2 −1 0 1 2
Again log is a good option, but also λ = −0.5 is a good option.

If we want to program the likelihood ourselves we need to take into account
that we change the data when choosing a new λ , i.e. we apply change
of variables for random variables (see p. 102 of our textbook). The log-
likelihood is given by
(λ )
l([λ , µ, σ 2 ]) = ∑ log(p(yi ) + (1 − λ ) log(yi ) (1)
i
where p is the density (here the normal density) and the last term is come
from the change of variables, i.e.
∂ y(λ )
f (y) = p(y(λ ) ) = p(y(λ ) )yλ −1 (2)
∂y
taking the log and summing over all observations gives the log-likelihood.
For each fixed λ we can find the optimal parameters of µ and σ 2 , by opti-
3
mising the likelihood with respect to µ and σ 2 . These are given by
1 (λ )
µ̂(λ ) = yi (3)
n∑i
1 (λ )
σ̂ 2 (λ ) = ∑(yi − µ̂(λ ))2 (4)
n i
hence we will not need the usual inner optimisation, and the box-cox can
be implemented by
ll.lambda <- function(lambda,y){

n <- length(y) ## number of obs
y.lambda <- (y^(lambda)-1)/(lambda) ## Transform data
if(lambda==0){y.lambda <- log(y)} ## lambda=1
## is a special case
yl.bar <- mean(y.lambda) ## Estimate of mu.hat(lambda)
sigmasq <- mean((y.lambda-yl.bar)^2) ## Estimate of
## sigma.hat^2(lambda)
## profile log-likelihood
- n/2 * log(sigmasq) - n/2 + (lambda-1)*sum(log(y))
}
and finally we can plot the profile log-likelihood
## Which lambdas to use

lambda <- seq(-1,0.5,by=0.01)
## profile likelihood
ll <- sapply(lambda, ll.lambda, y=data)
## Plot
plot(lambda, ll-max(ll),type="l",ylim=c(-5,0))
lines(range(lambda), -qchisq(0.95,df=1)/2 * c(1,1),lty=2)
4
0
−1
−2
ll − max(ll)
−3
−4
−5
−1.0 −0.5 0.0 0.5
lambda
Exercise 6.5
We start by plotting the data
data <- read.table("exercise6_5.dat",header=TRUE)

plot(data)
5
●
●
9
●
●
8
●
7
●
y
●
5
●
4
●
3
0 20 40 60 80 100
It is quite clear that a linear model in x (dose) will not work, but lets have a
look at the residuals anyway
fit1 <- lm(y~x,data=data)

par(mfrow=c(1,2))
qqnorm(fit1$residuals)
qqline(fit1$residuals)
plot(fitted(fit1),fit1$residuals)
6
Normal Q−Q Plot
● ●
● ●
0.5
0.5
● ●
● ●
● ●
Sample Quantiles
fit1$residuals
0.0
0.0
● ● ● ●
● ●
● ●
● ●
−0.5
−0.5
● ●
−1.0
−1.0
● ●
−1.5 −0.5 0.5 1.5 4 6 8 10
Theoretical Quantiles fitted(fit1)
Even though the distribution assumption is fine, we see quite clear patterns
in the residuals vs. fitted. We now implement the non-linear model pro-
posed in the exercise, this can be formulated as
Yi ∼ N(µi , σ 2 ) (5)
with µi given by
β0
µi (x) = (6)
1 + e−β1 (xi −µ)
The mean value function is implemented by
mu.i <- function(beta0, beta1, mu, x){

beta0 / (1 + exp(- beta1 * (x - mu)))
}
and the negative log-likelihood is implemented by
7
nll <- function(theta,data){
m <- mu.i(theta[1],theta[2],theta[3],data$x)
- sum(dnorm(data$y, mean = m, sd = sqrt(theta[4]),
log = TRUE))
}
Before we optimise the parameters we should choose some reasonable initial

values, β0 will be the maximum of µ(x) hence a reasonable initial guess
would be β0 = max(y), µ is the values where the mean values function attain
half it maximum, from the plot this is around x = 60. β1 and σ 2 are chosen
more randomly, but with β1 < 0. We now find the optimal parameters
opt <- nlminb(c(max(data$y),-1,60,1), nll,

lower = c(-Inf,-Inf,-Inf,0),data=data)
we can compare the likelihood of the two models calculating the AIC for
each
AIC(fit1)
## [1] 23.40353
2 * opt$objective + 2 * 4
## [1] 19.20323
Hence we should choose the non-linear model.

We calculate the standard errors in the usual way
library(numDeriv)
H <- hessian(nll, opt$par, data=data)
(se.theta <- sqrt(diag(solve(H))))## Satandard errors
## [1] 0.803290930 0.003852883 6.130495493 0.060800704
opt$par ## Parameters
## [1] 10.98598626 -0.02956138 65.26138265 0.14893068
we see that 95% confidence intervals would not cover 0 in any of the cases.
For the confidence interval of the residuals we need the (negative) profile
log-likelihood
8
nllp <- function(beta1,data){
fun.tmp <- function(theta,beta1,data){
nll(c(theta[1],beta1,theta[2:3]),data)
}
nlminb(c(max(data$y), 60, 1), fun.tmp,
lower = c(-Inf, -Inf, 0),
beta1 = beta1, data = data)$objective
}
which we can plot by
beta1 <- opt$par[2] + seq(-3 * se.theta[2], 3 * se.theta[2],

length=100) ## range of beta1
nllprofile <- sapply(beta1, nllp,
data=data) ## profile log-likelihood
plot(beta1, exp(-(nllprofile-opt$objective)),
type = "l") ## Plot
lines(range(beta1), exp(-qchisq(0.95, df = 1) / 2) * c(1, 1),
lty = 2, col = 4) ## Cut off
9
1.0
0.8
exp(−(nllprofile − opt$objective))
0.6
0.4
0.2
0.0
−0.040 −0.035 −0.030 −0.025 −0.020
beta1
The Wald confidence interval is given by
opt$par[2] + 2 * se.theta[2] * c(-1,1)
## [1] -0.03726715 -0.02185562
which is quite close to the profile likelihood confidence interval.

Finally look at the model and the data
x <- seq(0,150)
plot(x, mu.i(opt$par[1], opt$par[2], opt$par[3], x),
lty = 2, col = 2, type = "l")
points(data)
10
●
●
●
●
8
mu.i(opt$par[1], opt$par[2], opt$par[3], x)
●
6
●
4
●
●
2
0 50 100 150
and the residuals
par(mfrow=c(1,2))
res <- data$y-mu.i(opt$par[1],opt$par[2],opt$par[3],data$x)
qqnorm(res)
qqline(res)
plot(mu.i(opt$par[1],opt$par[2],opt$par[3],data$x),res)
11
0.6 Normal Q−Q Plot
0.6
● ●
● ● ● ●
0.4
0.4
0.2
0.2
● ● ● ●
Sample Quantiles
● ●
0.0
0.0
res
● ● ● ●
−0.2
−0.2
● ●
● ●
−0.4
−0.4
● ●
−0.6
−0.6
● ●
−1.5 −0.5 0.5 1.5 4 6 8
Theoretical Quantiles mu.i(opt$par[1], opt$par[2], opt$par[3], data$x)
There are still some systematic effect that could be investigated further, but
the magnitude of the residuals is smaller in the non-linear model.
Exercise 6.14
We start by taking a look at the data
data <- read.table("exercise6_14.dat",header=TRUE)

plot(accident ~ month, pch=19,col=data$year-1969,data=data)
12
●
●
60
●
50
●
●
●
accident
●
40
●
●
●
●
●
● ●
●
● ●
● ●
30
●
●
●
●
●
● ●
● ●
●
20
● ●
2 4 6 8 10 12
month
From the data we see that there are fewer accidents for later years, also
there is a clear seasonal variation.
a)
Based on the plot above it is reasonable to try a model with year as a linear
effect, but month as a factor (i.e. a parameter for each month).
fit <- glm(accident~ year + factor(month), data = data, family = poisson)

summary(fit)
##
## Call:
## glm(formula = accident ~ year + factor(month), family = poisson,
## data = data)
##
13
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.93674 -0.49986 0.03137 0.65552 2.25070
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 269.25450 78.92569 3.411 0.000646 ***
## year -0.13474 0.04005 -3.365 0.000766 ***
## factor(month)2 -0.34484 0.14177 -2.432 0.014998 *
## factor(month)3 -0.21278 0.13654 -1.558 0.119138
## factor(month)4 -0.39304 0.14380 -2.733 0.006272 **
## factor(month)5 -0.31015 0.14035 -2.210 0.027110 *
## factor(month)6 -0.47000 0.14720 -3.193 0.001408 **
## factor(month)7 -0.23361 0.13733 -1.701 0.088921 .
## factor(month)8 -0.35667 0.14226 -2.507 0.012169 *
## factor(month)9 -0.14310 0.13397 -1.068 0.285460
## factor(month)10 0.19877 0.13515 1.471 0.141358
## factor(month)11 0.13935 0.13731 1.015 0.310183
## factor(month)12 0.18911 0.13549 1.396 0.162805
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 120.4 on 32 degrees of freedom
## Residual deviance: 40.9 on 20 degrees of freedom
## AIC: 242.72
##
## Number of Fisher Scoring iterations: 4
from the coefficients we see that the number of accidents decrease with
time (year), but also that there is a considerable variation over the year,
with fewer (than January) accidents from Feb-Aug and more accidents in
Oct-Dec.
b)
In order the predict the outcomes in the last 3 month of 1973 we need the
coefficients for Oct-Dec and the slope wrt year, and the link-function (log),
i.e.
14
coef(fit)
## (Intercept) year factor(month)2 factor(month)3

## 269.2544981 -0.1347396 -0.3448405 -0.2127808
## factor(month)4 factor(month)5 factor(month)6 factor(month)7
## -0.3930426 -0.3101549 -0.4700036 -0.2336149
## factor(month)8 factor(month)9 factor(month)10 factor(month)11
## -0.3566749 -0.1431008 0.1987693 0.1393459
## factor(month)12
## 0.1891074
exp(coef(fit)[1] + coef(fit)[2] * 1972 + coef(fit)[11:13])
## factor(month)10 factor(month)11 factor(month)12

## 42.38806 39.94260 41.98048
c)
In order to find the goodness of fit we need the expected in each month and
year
e <- predict(fit,type="response")
and we can calculate the goodness of fit test by
(chi2 <- sum((data$accident-e)^2/e))
## [1] 39.83656
and finally we can calculate the p-value relating to the goodness of fit by
df <- length(e) - length(coef(fit))

## p-value
1 - pchisq(chi2,df=df)
## [1] 0.005238489
Hence there seems to be something wrong here, one explanation could be

that the dependence on month change from year to year.
15

Week 4, Solution: Exercise 4.38

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week 4, Solution: Exercise 4.38

Uploaded by

Copyright:

Available Formats

02418 Week 4, solution

We look at sulphur dioxide data in 41 cities

data <- read.table("exercise4_38.dat",header=FALSE)[[1]]

Normal Q−Q Plot

Normal Q−Q Plot Normal Q−Q Plot

Theoretical Quantiles Theoretical Quantiles

Normal Q−Q Plot Normal Q−Q Plot

Theoretical Quantiles Theoretical Quantiles

## Log seems to be a good option

Using the library MASS we can get a profile likelihood of λ directly

## box-cox profile likelihood

Again log is a good option, but also λ = −0.5 is a good option.

ll.lambda <- function(lambda,y){

and finally we can plot the profile log-likelihood

## Which lambdas to use

−1.0 −0.5 0.0 0.5

We start by plotting the data

data <- read.table("exercise6_5.dat",header=TRUE)

fit1 <- lm(y~x,data=data)

−1.5 −0.5 0.5 1.5 4 6 8 10

Theoretical Quantiles fitted(fit1)

mu.i <- function(beta0, beta1, mu, x){

and the negative log-likelihood is implemented by

Before we optimise the parameters we should choose some reasonable initial

opt <- nlminb(c(max(data$y),-1,60,1), nll,

Hence we should choose the non-linear model.

## [1] 0.803290930 0.003852883 6.130495493 0.060800704

## [1] 10.98598626 -0.02956138 65.26138265 0.14893068

which we can plot by

beta1 <- opt$par[2] + seq(-3 * se.theta[2], 3 * se.theta[2],

−0.040 −0.035 −0.030 −0.025 −0.020

The Wald confidence interval is given by

opt$par[2] + 2 * se.theta[2] * c(-1,1)

## [1] -0.03726715 -0.02185562

which is quite close to the profile likelihood confidence interval.

and the residuals

−1.5 −0.5 0.5 1.5 4 6 8

Theoretical Quantiles mu.i(opt$par[1], opt$par[2], opt$par[3], data$x)

We start by taking a look at the data

data <- read.table("exercise6_14.dat",header=TRUE)

fit <- glm(accident~ year + factor(month), data = data, family = poisson)

## (Intercept) year factor(month)2 factor(month)3

exp(coef(fit)[1] + coef(fit)[2] * 1972 + coef(fit)[11:13])

## factor(month)10 factor(month)11 factor(month)12

and we can calculate the goodness of fit test by

(chi2 <- sum((data$accident-e)^2/e))

df <- length(e) - length(coef(fit))

Hence there seems to be something wrong here, one explanation could be

You might also like