You are on page 1of 29

Multivariate Regression (3)

91
Problems with Linear Models

• Typical problems in linear models

▶ non-linear response-predictor relation

▶ correlation of error terms

▶ non-constant variance of error terms.

▶ outliers

▶ high-leverage points

▶ collinearity

92
. . . Problems with Linear Models

• Non-linear response-predictor relation

▶ check residual plot vs fitted values

▶ solution: add interaction terms / transform predictors and/or response,


eg Box-Cox transform
( λ
x −1
λ λ ̸= 0
fλ (x) =
log x λ = 0

• Correlation of error terms

▶ may result in underestimation of true SEs ⇝ narrower CIs

▶ test for auto-correlation in residuals

93
. . . Problems with Linear Models

Residuals vs Fitted

2e+05
6

1e+05 9
Residuals

0e+00
−1e+05
−2e+05

64

1e+05 2e+05 3e+05 4e+05

Fitted values
lm(Price ~ Taxes + Beds + Baths + New + Size)

94
. . . Problems with Linear Models

• Non-constant variance of error terms

▶ heteroskedasticity will affect SEs ⇝ CIs, test of hypotheses

▶ check plot of (standardized) residual vs fitted values (scale-location


plot), look for change in magnitude of residuals

▶ transform response variable / weigh observations

• Outliers

▶ actual response value away from predicted

▶ remove or not observation? can affect CIs, hypothesis test

▶ check residuals plots


. . . Problems with Linear Models

Scale−Location
64

2.0
6

9
1.5
Standardized residuals

1.0
0.5
0.0

1e+05 2e+05 3e+05 4e+05

Fitted values
lm(Price ~ Taxes + Beds + Baths + New + Size)

96
. . . Problems with Linear Models

Normal Q−Q

4
6

2 9
Standardized residuals

0
−2
−4

64

−2 −1 0 1 2

Theoretical Quantiles
lm(Price ~ Taxes + Beds + Baths + New + Size)

97
. . . Problems with Linear Models

• High-leverage points

▶ unusual values of the regressors

▶ difficult to spot when p > 1 ⇝ check residuals vs leverage plot

▶ influential observations: unusual covariates & response ⇝ will weigh


much on the regression

• Collinearity

▶ two or more covariates are close to each other

▶ misleading results

▶ affects accuracy of parameters estimates ⇝ t stats

98
. . . Problems with Linear Models

Residuals vs Leverage

4
6 1

9
2 0.5
Standardized residuals

0
−2

0.5

1
−4

64

Cook's distance

0.00 0.05 0.10 0.15 0.20 0.25

Leverage
lm(Price ~ Taxes + Beds + Baths + New + Size)

99
Model Building

• What variables to use?

▶ quantitative

▶ categorical

▶ interactions ⇝ joint effects

▶ transformation of variables

• Is the model good?

▶ statistical significance

▶ practical significance (effect size)

▶ simplicity v complexity

▶ model assumptions

100
. . . Model Building

• By including more explanatory variables, R 2 always increases

▶ even if they are not significant!

▶ R 2 is not a good criterion for model selection

▶ keeping non-significant explanatory variables may not improve


predictions

▶ . . . and makes the model unnecessarily complex

▶ highly correlated covariates may be significant ⇝ causality vs


association

101
Causality vs Association

• Design of a statistical investigation:

▶ experimental: usually randomized, the researcher decides the value of


the variables to be assigned to statistical units (eg blood coagulation
example) ⇝ causality can be proved

▶ observational: variables are observed without the researcher’s


intervention ⇝ usually the case in social sciences (eg house prices
example) ⇝ only association between variables can be established

102
. . . Causality vs Association

• Example. In a study, the response variable prestige is related to


education and income

▶ education is causal for both income and prestige

▶ the association between prestige and income stems from the common
prior education

▶ the association between prestige and income is spurious (not causal)

▶ education is a confounding variable ⇝ need to control for it

103
. . . Model Building

• With p explanatory variables (n. of parameters ≥ p),

▶ including/omitting each covariate: 2p possible models

▶ large number of parameters due to interactions/categorical variables

▶ many different ways to select a model exist

• What we know:

▶ t-test: remove one variable

▶ F -test: remove a group of variables (even all)

▶ R 2 , Re2 : compare models

104
. . . Model Building

• Best model selection: breaks up model selection by number of


covariates

1 M0 is the model with no covariates

for l = 1, . . . , p, fit the pl models with l covariates and select the best

2

model Ml (eg largest R 2 ) among them

3 choose a best overall model among M0 , M1 , . . . , Mp using some


criterion (eg largest R 2 , or smallest AIC or BIC)

105
. . . Model Building

• Two frequently used metrics for model comparison ⇝ trade off


between goodness-of-fit and model complexity
• Parametric model M with d parameters, fitted with max log-likelihood
ℓb on a sample of size n
▶ Akaike information criterion (AIC)

AIC (M) = 2d − 2ℓb

▶ Bayes information criterion (BIC), aka Schwartz information criterion

BIC (M) = log(n)d − 2ℓb

• Prefer model with lower metric

• Models need not be nested!

106
. . . Model Building

• Backward search:

1 fit model with all p covariates

2 remove one of the p regressors according to some criterion (eg largest


p-value)

3 if the current model has s covariates, remove one of them according to


some criterion (eg largest p-value)

4 continue (3) until a stopping rule is reached (eg largest p-value is less
than a threshold)

• May not be feasible if n. parameters with p covariates > n

107
. . . Model Building

• Forward search

1 fit the model with no covariates

2 fit p models with one covariate and choose one according to some
criterion (eg largest R 2 )

3 if the current model has s covariates, fit p − s models adding one of


the excluded p − s regressors; choose one according to some criterion
(eg largest R 2 )

4 continue (3) until a stopping rule is satisfied (eg R 2 greater than a


given threshold)

108
. . . Model Building

• Stepwise regression algorithm: combine backward and forward ⇝


variables could be inserted and excluded

1 fit the model with no covariates

2 fit p models with one covariate and choose one according to some
criterion (eg smallest p-value, provided it is smaller than a threshold)

3 if the current model has s > 1 covariates, fit p − s models adding one
of the excluded p − s regressors; choose one according to some
criterion (eg smallest p-value, provided it is smaller than a threshold)

4 from the model in (3), remove one variable according to some criterion
(eg largest p-value, provided it is larger than a threshold)

5 repeat steps (3) and (4) until all possible additions and deletions are
performed.

109
. . . Model Building

• Example. House price, forward selection (criterion: p-value, threshold


5%); response: price; covariates: size, tax, beds, baths, new

▶ 1st step: use one covariate

model new variable p-value


size size 5.00 × 10−27
tax tax 5.17 × 10−28
beds beds 5.00 × 10−5
baths baths 1.59 × 10−9
new new 6.60 × 10−7

▶ select tax

110
. . . Model Building

• Example. House price, forward selection (criterion: p-value, threshold


5%); response: price; covariates: size, tax, beds, baths, new

▶ 2nd step: add a second covariate

model new variable p-value


tax + size size 1.15 × 10−6
tax + beds beds 0.9163
tax + baths baths 0.1915
tax + new new 0.0020

▶ select size

111
. . . Model Building

• Example. House price, forward selection (criterion: p-value, threshold


5%); response: price; covariates: size, tax, beds, baths, new

▶ 3rd step: add a third covariate

model new variable p-value


tax + size + beds beds 0.0688
tax + size + baths baths 0.6294
tax + size + new new 0.0058

▶ select new

112
. . . Model Building

• Example. House price, forward selection (criterion: p-value, threshold


5%); response: price; covariates: size, tax, beds, baths, new

▶ 4th step: add a fourth covariate

model new variable p-value


tax + size + new + beds beds 0.1939
tax + size + new + baths baths 0.6545

▶ stop the algorithm

113
. . . Model Building

• Automatic selection

▶ useful when p is large

▶ computer can execute very rapidly

▶ different algorithms can lead to conflicting results

▶ incentive not to use subjective information

▶ ⇝ ideally combine judgement and problem knowledge with automatic


selection

114
Cross-validation

• How good is a model’s prediction? Need to test on a data set different


than the one used for fitting

▶ train set: used to fit the model(s)

▶ test set: check the prediction of the trained model(s) and choose one

▶ validation set: assess chosen model on unused data

▶ production set: use model for actual work

115
. . . Cross-validation

• Cross-validation (CV):

▶ separate available data into train and test subsets

▶ calculate prediction error on test

▶ repeat by changing the subset

• Most common criterion: predicted (or test) mean squared error: for
each test set
avg. (actual − predicted)2
then eg average across test sets

116
. . . Cross-validation

• Leave-one-out cross validation (LOOCV): for each i = 1, . . . , n

▶ remove observation i
(yi , xi1 , . . . , xip )
from sample

▶ fit model on remaining data (n − 1 obs.)

▶ use fitted model to predict response of (xi1 , . . . , xip ) ⇝ yb(i)

▶ calculate (yi − yb(i) )2


Pn
• Test MSE approximated by n1 PRESS = 1
n i=1 (yi − yb(i) )2

• Requires fitting n models!

117
. . . Cross-validation

• k-Fold cross validation (LOOCV): divide sample in k subsets (“folds”)


of same size h

▶ remove fold l = 1, . . . , k from sample

▶ fit model on remaining k − 1 folds (n − h obs.)

▶ use fitted model to predict response on the excluded fold observations

▶ calculate test MSEl

▶ repeat by excluding a different fold

1 Pk
• Test MSE approximated by k l=1 MSEl

• Typical choices: k = 5, k = 10

• Folds chosen randomly or based on data

118
. . . Cross-validation

• Predicted Sum of Squares (PRESS)


n
X
PRESS = (yi − ybi )2
i=1

▶ small PRESS ⇝ good predictive performance of a model

▶ compare alternative models based on the PRESS value: house prices


example

covariates R2 2
Radj PRESS
size, tax, beds, baths, new 0.793 0.782 2.91
size, tax, beds, new 0.793 0.785 2.85
size, tax, new 0.790 0.783 2.67

119

You might also like