Test Fit Logit Lecture

Lecture 5: ANOVA and Regression II: Model Selection and Model Checking
or, How to choose a model, and then find out its wrong Bob OHara
Model Selection
We could fit all effects into a model But this would be difficult to understand
which factors are important?
Instead, we want to remove the effects which are not important, to leave the interesting ones How do we do this?
Models
What do we use models for?
Description Prediction Testing theories
These can lead to slightly different models

Testing theories often means removing the effect of nuisance variables These would be treated as useful for prediction
Whats a Good Model?

Should fit to the data
obvious?
Simple
easier to understand
Trade-off between model fit and complexity Also: Interpretability

importance scientifically depends on the purpose of the model
Criteria for Comparing Models

F tests, from ANOVA table
test individual effects can have problems with order of terms
Information Criteria
AIC, BIC Made up of two terms: xIC = Deviance + Complexity
Deviance = -2xLikelihood = Goodness of Fit Complexity - penalises for number of parameters
Information Criteria
Try to minimise xIC Better model fit, lower deviance More parameters, higher the penalisation For n observations, p parameters
AIC = Deviance + 2p
tends to overestimate number of parameters
BIC = Deviance + (ln n)p

leads to smaller models - perhaps too small? can overpenalise factors with many levels
How do we choose models?

Use AIC, BIC, F-tests to compare models Compare all models
only possible with small data sets
Automatic procedures
Forward selection Backward Selection Stepwise Selecion
The Principle of Marginality

If we have an interaction, then always keep the main effect in the model
effect of the removed main effect is set to 0
If we have a polynomial term, always keep the lower order terms

with x2 and no x, the slope at the origin is 0
Lower order terms are said to be marginal

and should be kept in...
Selection
Forward selection
Start with no factors add the best unselected factor until the present model is the best
use AIC, BIC, F-ratios to decide the best
Backward selection
Start with all factors in the nodel eliminate the worst covariates one by one until all remaining covariates are good
again use AIC etc.
Stepwise Selection
Start with full model Use backward selection
try and remove a term
Use forward selection

try and add a term
Iterate, trying to remove and add terms Stop when the model doesnt change
Selection Algorithms: Problems

Not guaranteed to find the optimum
Stepwise: try from different starting points
The methods are automatic

dont take into account scientific knowledge can lead to unrealistic models
Dont take effect size into account Therefore: treat them as a guide.
Dont believe them
How To Do Model Selection

(sort of)
Remove unimportant effects
On scientific grounds If effects are highly correlated, only need one Are higher order interactions worth including?
not an excuse to only look at main effects!
Is there enough data to look at big models?
Nuisance Parameters can be added first, to remove their effects
Then...
Do the more automatic stuff
Stepwise Selection F-stats
If you use the ANOVA table:

be careful about the order of the effects try different orders
Always keep main effects if you have an interaction

unless you have a good reason not to
An Example
Truancy in New South Wales Number of days absent from school in a year from a large town in NSW, Australia Children classified by 4 factors
Age (primary, first, second, third form) Eth (Aboriginal, non-aboriginal) Lrn (slow or average learner) Sex (Male, Female)
Comments
Layout unbalanced
no slow learners in fourth form
Heteroscedastic
use a transformation to log(Days+2.5)
What is the best model?
Automatic Model Selection

Use AIC as a criterion Try 2 starting points
just a constant full model (all terms and interactions)
Can do automatically in R
Starting from Nothing

Initial AIC (just a constant): -43.66 Step 1: + Eth + Age 0 + Sex + Lrn -57.1 -44.3 -43.7 -42.5 -41.7 Add Eth to the model Step 2: + Age 0 + Sex + Lrn -57.4 -57.1 -56.00 -55.1
-Eth -43.7
Carry on...
Add Age to the model Step 3:
+ Eth.Age 0 - Age + Lrn + Sex -Eth -61.1 -57.4 -57.1 -56.2 -55.9 -44.3
Add Eth.Age interaction Step4: 0 +Lrn + Sex -61.1 -60.1 -59.6 Stop Here!
-Eth.Age -57.4
Try from different starting points

Start from a constant in the model
end with Eth + Age + Eth.Age
Start from all main effects in the model

All Main effects + Eth.Age + Sex.Age + Age:Lrn
Start from full model

All Main effects + All First Order interactions + Eth.Sex.Lrn + Eth.Age.Lrn
Last one has lowest AIC
Model Checking
Linear Models have several assumptions
Normally distributed errors Additive effects Independent Observations Constant Variance of Errors
If these assumptions are not met, the analysis may be wrong

need to check these assumptions
The Tools
Predicted Values
Expected value from the parameter estimates
Residuals
Observed - Expected Can standardise by the variance
Hat Matrix
matrix of predicted values use in examination of influential observations
Plots
If the model is right, then there should be no structure in the residuals Check by looking for structure in plots Plot residuals against:
Predicted values covariates covariates not in the model
e.g. time
What to look for

Overall pattern
linearity constant variance outliers
A Good Fit
100 10
80
Residuals
0 10 20 30 40 50
60
40
20
-10
-5
20
40
60
80
100
Predicted values
An Outlier
140 60
120
100
Residuals
0 10 20 30 40 50
80
40
60
20
20
40
20
40
60
80
100
Predicted values
Curved Relationship
y=a+bx2+e
500 5000 3000 4000
Residuals
0 10 20 30 40 50
y
2000
1000
-500
1000
2000
3000
4000
Predicted values
Heteroscedasticity
400 200 200
Residuals
0 10 20 30 40 50
-200
-400 20
-200
40
60
80
100
Predicted values
Influential Observation
400 10
300
Residuals
0 50 100 150 200
200
100
-10 0
-5
100
200
300
400
Predicted values
Bad Influential Observation

100
80
Residuals
0 20 40 60 80 100
40
60
20
-60
-40
-20
50
100
150
Predicted values
Cooks D
Measure the effect of deleting a point Look for large values
larger than the rest
No formal tests. Sorry. Similar mesure: DFFITS

compare with t-distribution. Significant if > pt
Another measure: DFBETA

point influential if |DFBETA| >3/n
The Example (again)

Weve already found a good model, but does it fit? Look at some figures...
Residual Plot
1.0 Residuals -2.0 1.5 -1.5 -1.0 -0.5 0.0 0.5
2.0
2.5 Predicted Values
3.0
3.5
Normal Probability Plot

Normal Q-Q Plot
1.0 Sample Quantiles -2.0 -1.5 -1.0 -0.5 0.0 0.5
-2
-1
0 Theoretical Quantiles
Cooks D
Cook's distance plot
32
Cook's distance
0.10
Female, Aborigine, Slow learner, Primary Age. Only One. (6 days off, mean 16.4)
98
0.15
14
0.00 0
0.05
50 Obs. number
100
150
Problems
Outliers
only expect 5% of points with a standardised residual >2, only 1% >3
Influential Observations Relationship not linear Heteroscedasticity

variance not constant
Solutions I: Outliers
Why are they outliers?
if they are wrong, then deal with them
typos etc.
Remove them
Worth doing, to see if they have a big effect Report the analysis with and without them?
Use robust regression

if there are several, the distribution may not be Normal.
Solutions II: Influential Observations

As with outliers, try removing them
almost certainly have an effect (by definition)
Could try robust regression methods Whatever, treat them with caution
would you believe a scientific conclusion based on a single observation?
Solutions III: Non-Linearity

Box-Cox transformation transform the response variable power y y Curved up, use < 1 = 0 is y ln (y) May not need to be too precise
can often find an approximate value that can be interpreted scientifically
Solutions IV: Heteroscedasticity

Non-constant variance Can use a Box-Cox transformation
if variance increases with mean, use < 1 Also affects curvature
Solution 2: use weighted regression

give less weight to values with larger variance use covariates as weights if variance proportional to x, use x-1 as a weight
Plots against Covariates

Similar to plots against predicted values If only one covariate is mis-behaving, we can identify it. Can be easier to spot influential observations Can transform the variable, or add in polynomial terms (x2, x3 etc.)
Check of Normality
Normal Probability Plots
already seen them outliers skew kurtosis problems
tails too thick or too thin
Sample Quantiles
Normal Q-Q Plot

10 -10 -2 -5 0 5
Use to detect:
-1
Theoretical Quantiles
Dependence
Could be temporal
order in which samples are taken could matter
Spatial
where the samples were kept
Due to a another factor

which machine was used
Detection of Non-Independence
By modelling the residuals
e.g. if another factor is suspected, then they can be plotted against the factor
Temporal dependence
Durbin-Watson test model residuals as time series
Spatial dependence
more difficult. Look at semi-variograms?
Solution to Non-Independence
Model Expansion
add some more terms, and see if they make a difference
One general strategy for model checking

for non-linearity, adding polynomial terms is another example
Back to Model Selection...

We can select models, and then find out that theyre wrong What do we do?
go back to model selection
Iterate between selecting a model, and then checking it and making any changes
e.g. removing residuals, transforming variables
Normally only need 1 or 2 iterations

Test Fit Logit Lecture

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Test Fit Logit Lecture

Uploaded by

Copyright:

Available Formats

Lecture 5: ANOVA and Regression II: Model Selection and Model Checking

These can lead to slightly different models

Whats a Good Model?

Trade-off between model fit and complexity Also: Interpretability

Criteria for Comparing Models

BIC = Deviance + (ln n)p

How do we choose models?

The Principle of Marginality

If we have a polynomial term, always keep the lower order terms

Lower order terms are said to be marginal

Use forward selection

Selection Algorithms: Problems

The methods are automatic

How To Do Model Selection

Is there enough data to look at big models?

Nuisance Parameters can be added first, to remove their effects

If you use the ANOVA table:

Always keep main effects if you have an interaction

What is the best model?

Automatic Model Selection

Starting from Nothing

Try from different starting points

Start from all main effects in the model

Start from full model

Last one has lowest AIC

If these assumptions are not met, the analysis may be wrong

What to look for

Bad Influential Observation

No formal tests. Sorry. Similar mesure: DFFITS

Another measure: DFBETA

The Example (again)

2.5 Predicted Values

Normal Probability Plot

Influential Observations Relationship not linear Heteroscedasticity

Use robust regression

Solutions II: Influential Observations

Solutions III: Non-Linearity

Solutions IV: Heteroscedasticity

Solution 2: use weighted regression

Plots against Covariates

Normal Q-Q Plot

Due to a another factor

One general strategy for model checking

Back to Model Selection...

Normally only need 1 or 2 iterations

You might also like