You are on page 1of 15

02418 Week 4, solution

Exercise 4.38

We look at sulphur dioxide data in 41 cities

data <- read.table("exercise4_38.dat",header=FALSE)[[1]]


qqnorm(data)

Normal Q−Q Plot


100


80
Sample Quantiles




60

● ●

●●
40

●●

●●●
●●●●●
●●
●●
20


●●●
● ●●●

● ● ● ● ● ● ●●

−2 −1 0 1 2

Theoretical Quantiles

The data is clearly not Gaussian, and one approach is to use a data transfor-
mation. We start by a exploratory approach using the box-cox transforma-
tion (y(λ ) = (yλ − 1)/λ

## Exploratory
par(mfrow=c(2,2))
qqnorm(data)
qqnorm((sqrt(data)-1)/0.5) ## lambda=0.5
qqnorm(log(data)) ## lambda=0
qqnorm((1/sqrt(data)-1)/(-0.5)) ## lambda=-0.5

Normal Q−Q Plot Normal Q−Q Plot

● ●
100



Sample Quantiles

Sample Quantiles

15
80




● ●●

60

● ●●
●●

10
●●
●●
●●●
40

●●●●●
●●
●● ●●
●●●●●●
●●●● ●
●●●
●●
20

●●●● ●●●●
5
●●●
●●●●● ● ●●●
● ● ● ● ●●●●● ● ● ●

−2 −1 0 1 2 −2 −1 0 1 2

Theoretical Quantiles Theoretical Quantiles

Normal Q−Q Plot Normal Q−Q Plot


1.8

● ●

4.5


●●
●●●
●●
1.7

●●
Sample Quantiles

Sample Quantiles


4.0

●●
●●
●● ●●●
●●●●●
1.6

●●
●● ●●
3.5

●●●
●●●●● ●
●●
1.5

● ●●
● ●
3.0

●●●
● ●
●●● ●
1.4

●●● ●●
2.5


● ● ●●●
●●
● ●●● ● ●
● ●
1.3
2.0

● ●

−2 −1 0 1 2 −2 −1 0 1 2

Theoretical Quantiles Theoretical Quantiles

## Log seems to be a good option

Using the library MASS we can get a profile likelihood of λ directly

## box-cox profile likelihood


library(MASS)
boxcox(lm(data~1))

2
−60

95%
−70
−80
log−Likelihood

−90
−100

−2 −1 0 1 2

Again log is a good option, but also λ = −0.5 is a good option.


If we want to program the likelihood ourselves we need to take into account
that we change the data when choosing a new λ , i.e. we apply change
of variables for random variables (see p. 102 of our textbook). The log-
likelihood is given by
(λ )
l([λ , µ, σ 2 ]) = ∑ log(p(yi ) + (1 − λ ) log(yi ) (1)
i

where p is the density (here the normal density) and the last term is come
from the change of variables, i.e.

∂ y(λ )
f (y) = p(y(λ ) ) = p(y(λ ) )yλ −1 (2)
∂y

taking the log and summing over all observations gives the log-likelihood.
For each fixed λ we can find the optimal parameters of µ and σ 2 , by opti-

3
mising the likelihood with respect to µ and σ 2 . These are given by
1 (λ )
µ̂(λ ) = yi (3)
n∑i
1 (λ )
σ̂ 2 (λ ) = ∑(yi − µ̂(λ ))2 (4)
n i

hence we will not need the usual inner optimisation, and the box-cox can
be implemented by

ll.lambda <- function(lambda,y){


n <- length(y) ## number of obs
y.lambda <- (y^(lambda)-1)/(lambda) ## Transform data
if(lambda==0){y.lambda <- log(y)} ## lambda=1
## is a special case
yl.bar <- mean(y.lambda) ## Estimate of mu.hat(lambda)
sigmasq <- mean((y.lambda-yl.bar)^2) ## Estimate of
## sigma.hat^2(lambda)
## profile log-likelihood
- n/2 * log(sigmasq) - n/2 + (lambda-1)*sum(log(y))
}

and finally we can plot the profile log-likelihood

## Which lambdas to use


lambda <- seq(-1,0.5,by=0.01)

## profile likelihood
ll <- sapply(lambda, ll.lambda, y=data)

## Plot
plot(lambda, ll-max(ll),type="l",ylim=c(-5,0))
lines(range(lambda), -qchisq(0.95,df=1)/2 * c(1,1),lty=2)

4
0
−1
−2
ll − max(ll)

−3
−4
−5

−1.0 −0.5 0.0 0.5

lambda

Exercise 6.5

We start by plotting the data

data <- read.table("exercise6_5.dat",header=TRUE)


plot(data)

5


9



8


7


y


5


4


3

0 20 40 60 80 100

It is quite clear that a linear model in x (dose) will not work, but lets have a
look at the residuals anyway

fit1 <- lm(y~x,data=data)


par(mfrow=c(1,2))
qqnorm(fit1$residuals)
qqline(fit1$residuals)

plot(fitted(fit1),fit1$residuals)

6
Normal Q−Q Plot

● ●

● ●
0.5

0.5
● ●
● ●

● ●
Sample Quantiles

fit1$residuals
0.0

0.0
● ● ● ●

● ●

● ●

● ●
−0.5

−0.5

● ●
−1.0

−1.0

● ●

−1.5 −0.5 0.5 1.5 4 6 8 10

Theoretical Quantiles fitted(fit1)

Even though the distribution assumption is fine, we see quite clear patterns
in the residuals vs. fitted. We now implement the non-linear model pro-
posed in the exercise, this can be formulated as

Yi ∼ N(µi , σ 2 ) (5)

with µi given by

β0
µi (x) = (6)
1 + e−β1 (xi −µ)
The mean value function is implemented by

mu.i <- function(beta0, beta1, mu, x){


beta0 / (1 + exp(- beta1 * (x - mu)))
}

and the negative log-likelihood is implemented by

7
nll <- function(theta,data){
m <- mu.i(theta[1],theta[2],theta[3],data$x)
- sum(dnorm(data$y, mean = m, sd = sqrt(theta[4]),
log = TRUE))
}

Before we optimise the parameters we should choose some reasonable initial


values, β0 will be the maximum of µ(x) hence a reasonable initial guess
would be β0 = max(y), µ is the values where the mean values function attain
half it maximum, from the plot this is around x = 60. β1 and σ 2 are chosen
more randomly, but with β1 < 0. We now find the optimal parameters

opt <- nlminb(c(max(data$y),-1,60,1), nll,


lower = c(-Inf,-Inf,-Inf,0),data=data)

we can compare the likelihood of the two models calculating the AIC for
each

AIC(fit1)

## [1] 23.40353

2 * opt$objective + 2 * 4

## [1] 19.20323

Hence we should choose the non-linear model.


We calculate the standard errors in the usual way

library(numDeriv)
H <- hessian(nll, opt$par, data=data)
(se.theta <- sqrt(diag(solve(H))))## Satandard errors

## [1] 0.803290930 0.003852883 6.130495493 0.060800704

opt$par ## Parameters

## [1] 10.98598626 -0.02956138 65.26138265 0.14893068

we see that 95% confidence intervals would not cover 0 in any of the cases.
For the confidence interval of the residuals we need the (negative) profile
log-likelihood

8
nllp <- function(beta1,data){
fun.tmp <- function(theta,beta1,data){
nll(c(theta[1],beta1,theta[2:3]),data)
}
nlminb(c(max(data$y), 60, 1), fun.tmp,
lower = c(-Inf, -Inf, 0),
beta1 = beta1, data = data)$objective
}

which we can plot by

beta1 <- opt$par[2] + seq(-3 * se.theta[2], 3 * se.theta[2],


length=100) ## range of beta1
nllprofile <- sapply(beta1, nllp,
data=data) ## profile log-likelihood
plot(beta1, exp(-(nllprofile-opt$objective)),
type = "l") ## Plot
lines(range(beta1), exp(-qchisq(0.95, df = 1) / 2) * c(1, 1),
lty = 2, col = 4) ## Cut off

9
1.0
0.8
exp(−(nllprofile − opt$objective))

0.6
0.4
0.2
0.0

−0.040 −0.035 −0.030 −0.025 −0.020

beta1

The Wald confidence interval is given by

opt$par[2] + 2 * se.theta[2] * c(-1,1)

## [1] -0.03726715 -0.02185562

which is quite close to the profile likelihood confidence interval.


Finally look at the model and the data

x <- seq(0,150)
plot(x, mu.i(opt$par[1], opt$par[2], opt$par[3], x),
lty = 2, col = 2, type = "l")
points(data)

10




8
mu.i(opt$par[1], opt$par[2], opt$par[3], x)


6


4



2

0 50 100 150

and the residuals

par(mfrow=c(1,2))
res <- data$y-mu.i(opt$par[1],opt$par[2],opt$par[3],data$x)
qqnorm(res)
qqline(res)
plot(mu.i(opt$par[1],opt$par[2],opt$par[3],data$x),res)

11
0.6 Normal Q−Q Plot

0.6
● ●

● ● ● ●
0.4

0.4
0.2

0.2
● ● ● ●
Sample Quantiles

● ●
0.0

0.0
res
● ● ● ●
−0.2

−0.2
● ●
● ●
−0.4

−0.4

● ●
−0.6

−0.6

● ●

−1.5 −0.5 0.5 1.5 4 6 8

Theoretical Quantiles mu.i(opt$par[1], opt$par[2], opt$par[3], data$x)

There are still some systematic effect that could be investigated further, but
the magnitude of the residuals is smaller in the non-linear model.

Exercise 6.14

We start by taking a look at the data

data <- read.table("exercise6_14.dat",header=TRUE)


plot(accident ~ month, pch=19,col=data$year-1969,data=data)

12


60


50




accident


40






● ●

● ●
● ●
30






● ●
● ●

20

● ●

2 4 6 8 10 12

month

From the data we see that there are fewer accidents for later years, also
there is a clear seasonal variation.

a)

Based on the plot above it is reasonable to try a model with year as a linear
effect, but month as a factor (i.e. a parameter for each month).

fit <- glm(accident~ year + factor(month), data = data, family = poisson)


summary(fit)

##
## Call:
## glm(formula = accident ~ year + factor(month), family = poisson,
## data = data)
##

13
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.93674 -0.49986 0.03137 0.65552 2.25070
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 269.25450 78.92569 3.411 0.000646 ***
## year -0.13474 0.04005 -3.365 0.000766 ***
## factor(month)2 -0.34484 0.14177 -2.432 0.014998 *
## factor(month)3 -0.21278 0.13654 -1.558 0.119138
## factor(month)4 -0.39304 0.14380 -2.733 0.006272 **
## factor(month)5 -0.31015 0.14035 -2.210 0.027110 *
## factor(month)6 -0.47000 0.14720 -3.193 0.001408 **
## factor(month)7 -0.23361 0.13733 -1.701 0.088921 .
## factor(month)8 -0.35667 0.14226 -2.507 0.012169 *
## factor(month)9 -0.14310 0.13397 -1.068 0.285460
## factor(month)10 0.19877 0.13515 1.471 0.141358
## factor(month)11 0.13935 0.13731 1.015 0.310183
## factor(month)12 0.18911 0.13549 1.396 0.162805
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 120.4 on 32 degrees of freedom
## Residual deviance: 40.9 on 20 degrees of freedom
## AIC: 242.72
##
## Number of Fisher Scoring iterations: 4

from the coefficients we see that the number of accidents decrease with
time (year), but also that there is a considerable variation over the year,
with fewer (than January) accidents from Feb-Aug and more accidents in
Oct-Dec.

b)

In order the predict the outcomes in the last 3 month of 1973 we need the
coefficients for Oct-Dec and the slope wrt year, and the link-function (log),
i.e.

14
coef(fit)

## (Intercept) year factor(month)2 factor(month)3


## 269.2544981 -0.1347396 -0.3448405 -0.2127808
## factor(month)4 factor(month)5 factor(month)6 factor(month)7
## -0.3930426 -0.3101549 -0.4700036 -0.2336149
## factor(month)8 factor(month)9 factor(month)10 factor(month)11
## -0.3566749 -0.1431008 0.1987693 0.1393459
## factor(month)12
## 0.1891074

exp(coef(fit)[1] + coef(fit)[2] * 1972 + coef(fit)[11:13])

## factor(month)10 factor(month)11 factor(month)12


## 42.38806 39.94260 41.98048

c)

In order to find the goodness of fit we need the expected in each month and
year

e <- predict(fit,type="response")

and we can calculate the goodness of fit test by

(chi2 <- sum((data$accident-e)^2/e))

## [1] 39.83656

and finally we can calculate the p-value relating to the goodness of fit by

df <- length(e) - length(coef(fit))


## p-value
1 - pchisq(chi2,df=df)

## [1] 0.005238489

Hence there seems to be something wrong here, one explanation could be


that the dependence on month change from year to year.

15

You might also like