You are on page 1of 4

1.

One of the assumptions for a linear regression is the independence of


observations. Describe the test we used to check this assumption.
[5marks]

We used the Durbin-Watson test which assumes the model errors follow
the autoregressive model of order 1:
i = % i1 + ui ,
where the ui are independent, normal and have constant variance. The
Null hypothesis of the DW test is that % = 0. However, the actual
test statistic is approximately DW = 2(1 %b) thus ranging between
0 and 4. If DW 2 then we can assume independence, DW 4 is
equivalent to negative autocorrelation, and DW 0 is equivalent to
positive autocorrelation. The test provides bounds depending on the
confidence level and the number of explanatory variables which give a
meaning to the sign.

2. The data set peru.df contains information for 39 Peruvian indians.


For each individual five measurements have been taken:

BP: systolic blood pressure (mm Hg)

age: Age of individual (years)

years: Years since individual migrated from the mountains

weight: weight (kg)

height: height (m)

It has been suspected that the blood pressure of Peruvian indians


strongly increases after they left the mountains and then gradually
decreases the longer they stay in lower altitudes. We fit a preliminary
model and use APR to find a suitable model The R-output is given
below.

adjRsq Cp AIC BIC CV age years weight height


1 0.252 8.777 47.777 51.104 402.827 0 0 1 0
2 0.389 1.822 40.822 45.812 356.715 0 1 1 0
3 0.380 3.333 42.333 48.987 374.670 0 1 1 1
4 0.368 5.000 44.000 52.318 405.990 1 1 1 1

1
Describe the five measures of model quality given above, and provide
the geometric form of the model suggested by all measures. [6marks]

All measures balance goodness of fit, as measured by RSS, against


model complexity, as measured by the number of regression coefficients,
p = k + 1.

adjRsq: adjusted R2 , given by


2 n1
1 Rp2 ,

Rp = 1
np
where Rp2 is the classical R2 value. The adjusted R2 ranges be-
tween 0 and 1, and we choose the model with the highest value.

Cp: Mallows Cp is the proportion between residual sum of squares of


the submodel and the residual mean square error of the full model,
penalised by the number of parameters.
RSSp
Cp = + 2p n.
RM Sfull
The most appropriate model minimises Cp .

AIC: The Akaike Information Criterion is Cp plus the number of data


points, n. Again, minimal is best.

BIC: Bayesian Information Criterion, penalises number of parameters


stronger than AIC and thus prefers simpler models:
RSSp
BIC = + p log(n).
RM Sfull
Again, minimal value indicates best model.

CV: Cross-validation. Data are split into training set and test set.
Goodness of fit is measured by the prediction error of the model
fitted to the training set as it predicts the test set. Once again,
we want to minimise this error. The commonly accepted form is
the 10-fold cross-validation.

All models prefer model 2, given as:

2
lm(BP ~ years + weight).

3. We continue with the preferred model and check for outliers using the
R function influence.measures. The output is given below.

dfb.1_ dfb.yers dfb.wght dffit cov.r cook.d hat


1 -0.846011 -1.099890 1.02239 1.3338 0.712 4.96e-01 0.1805
5 -0.027798 -0.066052 0.04064 0.0771 1.208 2.03e-03 0.1031
8 0.008411 0.007793 -0.00878 0.0105 1.325 3.75e-05 0.1790
38 0.024567 0.640509 -0.10510 0.7213 1.239 1.70e-01 0.2369
39 -0.440644 0.165194 0.40156 0.5953 1.581 1.19e-01 0.3494

Note: qf(0.5,3,36)=0.887

Describe each of the measures given, and when the associate value be-
comes critical. For each point identify which measure becomes critical
(if any). Are any of the points influential? [6marks]

The first six values are based on the leave-one-out principle indicating
how coefficients and fit change when an observation is removed from
the analysis. The first three columns show the effects on coefficients
() and are given by

bj bj [i]
dfb.j = .
se(bj )

Observations for which the absolute value of dfb.j exceeds 1 are con-
sidered influential to this coefficient. The fourth column looks at the
fitted values: n
X ybj ybj [i]
dffit = .
j=1
y
se(b j )
p
dffit
p becomes critical if the absolute exceeds 3 (k + 1)/(n k 1) =
3 3/36 = 0.866.

The fourth column is the covariance ratio which measures the change
in the standard errors of the estimated coefficients and becomes critical
when it is smaller than 1 3(k + 1)/n = 0.769 or larger than 1 + 3(k +
1)/n = 1.231.

3
The fifth column is Cooks D which measures the overall change in coef-
ficients and is problematic if it exceeds the median of the F -distribution
with k + 1 and n k 1 degrees of freedom, in this case the proposed
0.887.

The final column contains the hat values which are the hat matrix
diagonal values. These values measure the influence that the j th obser-
vation has on the j th fitted value. It is considered extreme if the value
exceeds 3(k + 1)/n = 0.231. Observations exceeding this bound are
considered influential.

Observation 1 is critical for dfb.yers,dfb.wght,dffit,cov.r. Ob-


servations 38 and 39 have high leverage. All but observation 5 are
considered extreme for the covariance ratio.

4. Finally, we want to test for normality of residuals. As the measure of


interest we chose the Weisberg-Bingham test. What exactly is this test
measuring? Which range of values does it take? [3marks]

The Weisberg-Bingham test statistic is the square of the correlation of


the normal plot. It measures how straight the plot is. Its values lie
between 0 and 1, and values close to 1 indicate normality. The null
hypothesis of the test is normality, i.e. W B = 1. The test is a variant
of the Shapiro-Wilk test.

You might also like