You are on page 1of 45

Lecture 5: ANOVA and Regression II: Model Selection and Model Checking

or, How to choose a model, and then find out its wrong Bob OHara

Model Selection
We could fit all effects into a model But this would be difficult to understand
which factors are important?

Instead, we want to remove the effects which are not important, to leave the interesting ones How do we do this?

Models
What do we use models for?
Description Prediction Testing theories

These can lead to slightly different models


Testing theories often means removing the effect of nuisance variables These would be treated as useful for prediction

Whats a Good Model?


Should fit to the data
obvious?

Simple
easier to understand

Trade-off between model fit and complexity Also: Interpretability


importance scientifically depends on the purpose of the model

Criteria for Comparing Models


F tests, from ANOVA table
test individual effects can have problems with order of terms

Information Criteria
AIC, BIC Made up of two terms: xIC = Deviance + Complexity
Deviance = -2xLikelihood = Goodness of Fit Complexity - penalises for number of parameters

Information Criteria
Try to minimise xIC Better model fit, lower deviance More parameters, higher the penalisation For n observations, p parameters
AIC = Deviance + 2p
tends to overestimate number of parameters

BIC = Deviance + (ln n)p


leads to smaller models - perhaps too small? can overpenalise factors with many levels

How do we choose models?


Use AIC, BIC, F-tests to compare models Compare all models
only possible with small data sets

Automatic procedures
Forward selection Backward Selection Stepwise Selecion

The Principle of Marginality


If we have an interaction, then always keep the main effect in the model
effect of the removed main effect is set to 0

If we have a polynomial term, always keep the lower order terms


with x2 and no x, the slope at the origin is 0

Lower order terms are said to be marginal


and should be kept in...

Selection
Forward selection
Start with no factors add the best unselected factor until the present model is the best
use AIC, BIC, F-ratios to decide the best

Backward selection
Start with all factors in the nodel eliminate the worst covariates one by one until all remaining covariates are good
again use AIC etc.

Stepwise Selection
Start with full model Use backward selection
try and remove a term

Use forward selection


try and add a term

Iterate, trying to remove and add terms Stop when the model doesnt change

Selection Algorithms: Problems


Not guaranteed to find the optimum
Stepwise: try from different starting points

The methods are automatic


dont take into account scientific knowledge can lead to unrealistic models

Dont take effect size into account Therefore: treat them as a guide.
Dont believe them

How To Do Model Selection


(sort of)
Remove unimportant effects
On scientific grounds If effects are highly correlated, only need one Are higher order interactions worth including?
not an excuse to only look at main effects!

Is there enough data to look at big models?

Nuisance Parameters can be added first, to remove their effects

Then...
Do the more automatic stuff
Stepwise Selection F-stats

If you use the ANOVA table:


be careful about the order of the effects try different orders

Always keep main effects if you have an interaction


unless you have a good reason not to

An Example
Truancy in New South Wales Number of days absent from school in a year from a large town in NSW, Australia Children classified by 4 factors
Age (primary, first, second, third form) Eth (Aboriginal, non-aboriginal) Lrn (slow or average learner) Sex (Male, Female)

Comments
Layout unbalanced
no slow learners in fourth form

Heteroscedastic
use a transformation to log(Days+2.5)

What is the best model?

Automatic Model Selection


Use AIC as a criterion Try 2 starting points
just a constant full model (all terms and interactions)

Can do automatically in R

Starting from Nothing


Initial AIC (just a constant): -43.66 Step 1: + Eth + Age 0 + Sex + Lrn -57.1 -44.3 -43.7 -42.5 -41.7 Add Eth to the model Step 2: + Age 0 + Sex + Lrn -57.4 -57.1 -56.00 -55.1

-Eth -43.7

Carry on...
Add Age to the model Step 3:
+ Eth.Age 0 - Age + Lrn + Sex -Eth -61.1 -57.4 -57.1 -56.2 -55.9 -44.3

Add Eth.Age interaction Step4: 0 +Lrn + Sex -61.1 -60.1 -59.6 Stop Here!

-Eth.Age -57.4

Try from different starting points


Start from a constant in the model
end with Eth + Age + Eth.Age

Start from all main effects in the model


All Main effects + Eth.Age + Sex.Age + Age:Lrn

Start from full model


All Main effects + All First Order interactions + Eth.Sex.Lrn + Eth.Age.Lrn

Last one has lowest AIC

Model Checking
Linear Models have several assumptions
Normally distributed errors Additive effects Independent Observations Constant Variance of Errors

If these assumptions are not met, the analysis may be wrong


need to check these assumptions

The Tools
Predicted Values
Expected value from the parameter estimates

Residuals
Observed - Expected Can standardise by the variance

Hat Matrix
matrix of predicted values use in examination of influential observations

Plots
If the model is right, then there should be no structure in the residuals Check by looking for structure in plots Plot residuals against:
Predicted values covariates covariates not in the model
e.g. time

What to look for


Overall pattern
linearity constant variance outliers

A Good Fit
100 10

80

Residuals
0 10 20 30 40 50

60

40

20

-10

-5

20

40

60

80

100

Predicted values

An Outlier
140 60

120

100

Residuals
0 10 20 30 40 50

80

40

60

20

20

40

20

40

60

80

100

Predicted values

Curved Relationship
y=a+bx2+e
500 5000 3000 4000

Residuals
0 10 20 30 40 50

y
2000

1000

-500

1000

2000

3000

4000

Predicted values

Heteroscedasticity
400 200 200

Residuals
0 10 20 30 40 50

-200

-400 20

-200

40

60

80

100

Predicted values

Influential Observation
400 10

300

Residuals
0 50 100 150 200

200

100

-10 0

-5

100

200

300

400

Predicted values

Bad Influential Observation


100

80

Residuals
0 20 40 60 80 100

40

60

20

-60

-40

-20

50

100

150

Predicted values

Cooks D
Measure the effect of deleting a point Look for large values
larger than the rest

No formal tests. Sorry. Similar mesure: DFFITS


compare with t-distribution. Significant if > pt

Another measure: DFBETA


point influential if |DFBETA| >3/n

The Example (again)


Weve already found a good model, but does it fit? Look at some figures...

Residual Plot
1.0 Residuals -2.0 1.5 -1.5 -1.0 -0.5 0.0 0.5

2.0

2.5 Predicted Values

3.0

3.5

Normal Probability Plot


Normal Q-Q Plot
1.0 Sample Quantiles -2.0 -1.5 -1.0 -0.5 0.0 0.5

-2

-1

0 Theoretical Quantiles

Cooks D
Cook's distance plot
32

Cook's distance

0.10

Female, Aborigine, Slow learner, Primary Age. Only One. (6 days off, mean 16.4)
98

0.15

14

0.00 0

0.05

50 Obs. number

100

150

Problems
Outliers
only expect 5% of points with a standardised residual >2, only 1% >3

Influential Observations Relationship not linear Heteroscedasticity


variance not constant

Solutions I: Outliers
Why are they outliers?
if they are wrong, then deal with them
typos etc.

Remove them
Worth doing, to see if they have a big effect Report the analysis with and without them?

Use robust regression


if there are several, the distribution may not be Normal.

Solutions II: Influential Observations


As with outliers, try removing them
almost certainly have an effect (by definition)

Could try robust regression methods Whatever, treat them with caution
would you believe a scientific conclusion based on a single observation?

Solutions III: Non-Linearity


Box-Cox transformation transform the response variable power y y Curved up, use < 1 = 0 is y ln (y) May not need to be too precise
can often find an approximate value that can be interpreted scientifically

Solutions IV: Heteroscedasticity


Non-constant variance Can use a Box-Cox transformation
if variance increases with mean, use < 1 Also affects curvature

Solution 2: use weighted regression


give less weight to values with larger variance use covariates as weights if variance proportional to x, use x-1 as a weight

Plots against Covariates


Similar to plots against predicted values If only one covariate is mis-behaving, we can identify it. Can be easier to spot influential observations Can transform the variable, or add in polynomial terms (x2, x3 etc.)

Check of Normality
Normal Probability Plots
already seen them outliers skew kurtosis problems
tails too thick or too thin
Sample Quantiles

Normal Q-Q Plot


10 -10 -2 -5 0 5

Use to detect:

-1

Theoretical Quantiles

Dependence
Could be temporal
order in which samples are taken could matter

Spatial
where the samples were kept

Due to a another factor


which machine was used

Detection of Non-Independence
By modelling the residuals
e.g. if another factor is suspected, then they can be plotted against the factor

Temporal dependence
Durbin-Watson test model residuals as time series

Spatial dependence
more difficult. Look at semi-variograms?

Solution to Non-Independence
Model Expansion
add some more terms, and see if they make a difference

One general strategy for model checking


for non-linearity, adding polynomial terms is another example

Back to Model Selection...


We can select models, and then find out that theyre wrong What do we do?
go back to model selection

Iterate between selecting a model, and then checking it and making any changes
e.g. removing residuals, transforming variables

Normally only need 1 or 2 iterations

You might also like