w5 - Statistical Modelling

Multivariate Regression (3)
91
Problems with Linear Models
• Typical problems in linear models
▶ non-linear response-predictor relation
▶ correlation of error terms
▶ non-constant variance of error terms.
▶ outliers
▶ high-leverage points
▶ collinearity
92
. . . Problems with Linear Models
• Non-linear response-predictor relation
▶ check residual plot vs fitted values
▶ solution: add interaction terms / transform predictors and/or response,

eg Box-Cox transform
( λ
x −1
λ λ ̸= 0
fλ (x) =
log x λ = 0
• Correlation of error terms
▶ may result in underestimation of true SEs ⇝ narrower CIs
▶ test for auto-correlation in residuals
93
Residuals vs Fitted
2e+05
6
1e+05 9
Residuals
0e+00
−1e+05
−2e+05
64
1e+05 2e+05 3e+05 4e+05
Fitted values
lm(Price ~ Taxes + Beds + Baths + New + Size)
94
• Non-constant variance of error terms
▶ heteroskedasticity will affect SEs ⇝ CIs, test of hypotheses
▶ check plot of (standardized) residual vs fitted values (scale-location

plot), look for change in magnitude of residuals
▶ transform response variable / weigh observations
• Outliers
▶ actual response value away from predicted
▶ remove or not observation? can affect CIs, hypothesis test
▶ check residuals plots

Scale−Location
64
2.0
6
9
1.5
Standardized residuals
1.0
0.5
0.0
1e+05 2e+05 3e+05 4e+05
Fitted values
96
Normal Q−Q
4
6
2 9
0
−2
−4
64
−2 −1 0 1 2
Theoretical Quantiles
97
• High-leverage points
▶ unusual values of the regressors
▶ difficult to spot when p > 1 ⇝ check residuals vs leverage plot
▶ influential observations: unusual covariates & response ⇝ will weigh

much on the regression
• Collinearity
▶ two or more covariates are close to each other
▶ misleading results
▶ affects accuracy of parameters estimates ⇝ t stats
98
Residuals vs Leverage
4
6 1
9
2 0.5
0
−2
0.5
1
−4
64
Cook's distance
0.00 0.05 0.10 0.15 0.20 0.25
Leverage
99
Model Building
• What variables to use?
▶ quantitative
▶ categorical
▶ interactions ⇝ joint effects
▶ transformation of variables
• Is the model good?
▶ statistical significance
▶ practical significance (effect size)
▶ simplicity v complexity
▶ model assumptions
100
. . . Model Building
• By including more explanatory variables, R 2 always increases
▶ even if they are not significant!
▶ R 2 is not a good criterion for model selection
▶ keeping non-significant explanatory variables may not improve

predictions
▶ . . . and makes the model unnecessarily complex
▶ highly correlated covariates may be significant ⇝ causality vs

association
101
Causality vs Association
• Design of a statistical investigation:
▶ experimental: usually randomized, the researcher decides the value of

the variables to be assigned to statistical units (eg blood coagulation
example) ⇝ causality can be proved
▶ observational: variables are observed without the researcher’s

intervention ⇝ usually the case in social sciences (eg house prices
example) ⇝ only association between variables can be established
102
. . . Causality vs Association
• Example. In a study, the response variable prestige is related to

education and income
▶ education is causal for both income and prestige
▶ the association between prestige and income stems from the common
prior education
▶ the association between prestige and income is spurious (not causal)
▶ education is a confounding variable ⇝ need to control for it
103
• With p explanatory variables (n. of parameters ≥ p),
▶ including/omitting each covariate: 2p possible models
▶ large number of parameters due to interactions/categorical variables
▶ many different ways to select a model exist
• What we know:
▶ t-test: remove one variable
▶ F -test: remove a group of variables (even all)
▶ R 2 , Re2 : compare models
104
• Best model selection: breaks up model selection by number of

covariates
1 M0 is the model with no covariates
for l = 1, . . . , p, fit the pl models with l covariates and select the best

2
model Ml (eg largest R 2 ) among them
3 choose a best overall model among M0 , M1 , . . . , Mp using some

criterion (eg largest R 2 , or smallest AIC or BIC)
105
• Two frequently used metrics for model comparison ⇝ trade off

between goodness-of-fit and model complexity
• Parametric model M with d parameters, fitted with max log-likelihood
ℓb on a sample of size n
▶ Akaike information criterion (AIC)
AIC (M) = 2d − 2ℓb
▶ Bayes information criterion (BIC), aka Schwartz information criterion
BIC (M) = log(n)d − 2ℓb
• Prefer model with lower metric
• Models need not be nested!
106
• Backward search:
1 fit model with all p covariates
2 remove one of the p regressors according to some criterion (eg largest

p-value)
3 if the current model has s covariates, remove one of them according to

some criterion (eg largest p-value)
4 continue (3) until a stopping rule is reached (eg largest p-value is less
than a threshold)
• May not be feasible if n. parameters with p covariates > n
107
• Forward search
1 fit the model with no covariates
2 fit p models with one covariate and choose one according to some
criterion (eg largest R 2 )
3 if the current model has s covariates, fit p − s models adding one of

the excluded p − s regressors; choose one according to some criterion
(eg largest R 2 )
4 continue (3) until a stopping rule is satisfied (eg R 2 greater than a

given threshold)
108
• Stepwise regression algorithm: combine backward and forward ⇝

variables could be inserted and excluded
1 fit the model with no covariates
2 fit p models with one covariate and choose one according to some
criterion (eg smallest p-value, provided it is smaller than a threshold)
3 if the current model has s > 1 covariates, fit p − s models adding one
of the excluded p − s regressors; choose one according to some
criterion (eg smallest p-value, provided it is smaller than a threshold)
4 from the model in (3), remove one variable according to some criterion
(eg largest p-value, provided it is larger than a threshold)
5 repeat steps (3) and (4) until all possible additions and deletions are
performed.
109
• Example. House price, forward selection (criterion: p-value, threshold

5%); response: price; covariates: size, tax, beds, baths, new
▶ 1st step: use one covariate
model new variable p-value

size size 5.00 × 10−27
tax tax 5.17 × 10−28
beds beds 5.00 × 10−5
baths baths 1.59 × 10−9
new new 6.60 × 10−7
▶ select tax
110

▶ 2nd step: add a second covariate

tax + size size 1.15 × 10−6
tax + beds beds 0.9163
tax + baths baths 0.1915
tax + new new 0.0020
▶ select size
111

▶ 3rd step: add a third covariate

tax + size + beds beds 0.0688
tax + size + baths baths 0.6294
tax + size + new new 0.0058
▶ select new
112

▶ 4th step: add a fourth covariate

tax + size + new + beds beds 0.1939
tax + size + new + baths baths 0.6545
▶ stop the algorithm
113
• Automatic selection
▶ useful when p is large
▶ computer can execute very rapidly
▶ different algorithms can lead to conflicting results
▶ incentive not to use subjective information
▶ ⇝ ideally combine judgement and problem knowledge with automatic

selection
114
Cross-validation
• How good is a model’s prediction? Need to test on a data set different

than the one used for fitting
▶ train set: used to fit the model(s)
▶ test set: check the prediction of the trained model(s) and choose one
▶ validation set: assess chosen model on unused data
▶ production set: use model for actual work
115
. . . Cross-validation
• Cross-validation (CV):
▶ separate available data into train and test subsets
▶ calculate prediction error on test
▶ repeat by changing the subset
• Most common criterion: predicted (or test) mean squared error: for
each test set
avg. (actual − predicted)2
then eg average across test sets
116
• Leave-one-out cross validation (LOOCV): for each i = 1, . . . , n
▶ remove observation i
(yi , xi1 , . . . , xip )
from sample
▶ fit model on remaining data (n − 1 obs.)
▶ use fitted model to predict response of (xi1 , . . . , xip ) ⇝ yb(i)
▶ calculate (yi − yb(i) )2

Pn
• Test MSE approximated by n1 PRESS = 1
n i=1 (yi − yb(i) )2
• Requires fitting n models!
117
• k-Fold cross validation (LOOCV): divide sample in k subsets (“folds”)

of same size h
▶ remove fold l = 1, . . . , k from sample
▶ fit model on remaining k − 1 folds (n − h obs.)
▶ use fitted model to predict response on the excluded fold observations
▶ calculate test MSEl
▶ repeat by excluding a different fold
1 Pk
• Test MSE approximated by k l=1 MSEl
• Typical choices: k = 5, k = 10
• Folds chosen randomly or based on data
118
• Predicted Sum of Squares (PRESS)

n
X
PRESS = (yi − ybi )2
i=1
▶ small PRESS ⇝ good predictive performance of a model
▶ compare alternative models based on the PRESS value: house prices

example
covariates R2 2
Radj PRESS
size, tax, beds, baths, new 0.793 0.782 2.91
size, tax, beds, new 0.793 0.785 2.85
size, tax, new 0.790 0.783 2.67
119

w5 - Statistical Modelling

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

w5 - Statistical Modelling

Uploaded by

Copyright:

Available Formats

Multivariate Regression (3)

• Typical problems in linear models

▶ non-linear response-predictor relation

▶ correlation of error terms

▶ non-constant variance of error terms.

• Non-linear response-predictor relation

▶ check residual plot vs fitted values

▶ solution: add interaction terms / transform predictors and/or response,

• Correlation of error terms

▶ may result in underestimation of true SEs ⇝ narrower CIs

▶ test for auto-correlation in residuals

1e+05 2e+05 3e+05 4e+05

• Non-constant variance of error terms

▶ heteroskedasticity will affect SEs ⇝ CIs, test of hypotheses

▶ check plot of (standardized) residual vs fitted values (scale-location

▶ transform response variable / weigh observations

▶ actual response value away from predicted

▶ remove or not observation? can affect CIs, hypothesis test

▶ check residuals plots

1e+05 2e+05 3e+05 4e+05

▶ unusual values of the regressors

▶ difficult to spot when p > 1 ⇝ check residuals vs leverage plot

▶ influential observations: unusual covariates & response ⇝ will weigh

▶ two or more covariates are close to each other

▶ affects accuracy of parameters estimates ⇝ t stats

0.00 0.05 0.10 0.15 0.20 0.25

• What variables to use?

▶ interactions ⇝ joint effects

• Is the model good?

▶ practical significance (effect size)

• By including more explanatory variables, R 2 always increases

▶ even if they are not significant!

▶ R 2 is not a good criterion for model selection

▶ keeping non-significant explanatory variables may not improve

▶ . . . and makes the model unnecessarily complex

▶ highly correlated covariates may be significant ⇝ causality vs

• Design of a statistical investigation:

▶ experimental: usually randomized, the researcher decides the value of

▶ observational: variables are observed without the researcher’s

• Example. In a study, the response variable prestige is related to

▶ education is causal for both income and prestige

▶ the association between prestige and income is spurious (not causal)

▶ education is a confounding variable ⇝ need to control for it

• With p explanatory variables (n. of parameters ≥ p),

▶ including/omitting each covariate: 2p possible models

▶ large number of parameters due to interactions/categorical variables

▶ many different ways to select a model exist

▶ t-test: remove one variable

▶ F -test: remove a group of variables (even all)

▶ R 2 , Re2 : compare models

• Best model selection: breaks up model selection by number of

1 M0 is the model with no covariates

model Ml (eg largest R 2 ) among them

3 choose a best overall model among M0 , M1 , . . . , Mp using some

• Two frequently used metrics for model comparison ⇝ trade off

AIC (M) = 2d − 2ℓb

▶ Bayes information criterion (BIC), aka Schwartz information criterion

BIC (M) = log(n)d − 2ℓb

• Prefer model with lower metric

• Models need not be nested!

1 fit model with all p covariates

2 remove one of the p regressors according to some criterion (eg largest

3 if the current model has s covariates, remove one of them according to