Professional Documents
Culture Documents
or, How to choose a model, and then find out its wrong Bob OHara
Model Selection
We could fit all effects into a model But this would be difficult to understand
which factors are important?
Instead, we want to remove the effects which are not important, to leave the interesting ones How do we do this?
Models
What do we use models for?
Description Prediction Testing theories
Simple
easier to understand
Information Criteria
AIC, BIC Made up of two terms: xIC = Deviance + Complexity
Deviance = -2xLikelihood = Goodness of Fit Complexity - penalises for number of parameters
Information Criteria
Try to minimise xIC Better model fit, lower deviance More parameters, higher the penalisation For n observations, p parameters
AIC = Deviance + 2p
tends to overestimate number of parameters
Automatic procedures
Forward selection Backward Selection Stepwise Selecion
Selection
Forward selection
Start with no factors add the best unselected factor until the present model is the best
use AIC, BIC, F-ratios to decide the best
Backward selection
Start with all factors in the nodel eliminate the worst covariates one by one until all remaining covariates are good
again use AIC etc.
Stepwise Selection
Start with full model Use backward selection
try and remove a term
Iterate, trying to remove and add terms Stop when the model doesnt change
Dont take effect size into account Therefore: treat them as a guide.
Dont believe them
Then...
Do the more automatic stuff
Stepwise Selection F-stats
An Example
Truancy in New South Wales Number of days absent from school in a year from a large town in NSW, Australia Children classified by 4 factors
Age (primary, first, second, third form) Eth (Aboriginal, non-aboriginal) Lrn (slow or average learner) Sex (Male, Female)
Comments
Layout unbalanced
no slow learners in fourth form
Heteroscedastic
use a transformation to log(Days+2.5)
Can do automatically in R
-Eth -43.7
Carry on...
Add Age to the model Step 3:
+ Eth.Age 0 - Age + Lrn + Sex -Eth -61.1 -57.4 -57.1 -56.2 -55.9 -44.3
Add Eth.Age interaction Step4: 0 +Lrn + Sex -61.1 -60.1 -59.6 Stop Here!
-Eth.Age -57.4
Model Checking
Linear Models have several assumptions
Normally distributed errors Additive effects Independent Observations Constant Variance of Errors
The Tools
Predicted Values
Expected value from the parameter estimates
Residuals
Observed - Expected Can standardise by the variance
Hat Matrix
matrix of predicted values use in examination of influential observations
Plots
If the model is right, then there should be no structure in the residuals Check by looking for structure in plots Plot residuals against:
Predicted values covariates covariates not in the model
e.g. time
A Good Fit
100 10
80
Residuals
0 10 20 30 40 50
60
40
20
-10
-5
20
40
60
80
100
Predicted values
An Outlier
140 60
120
100
Residuals
0 10 20 30 40 50
80
40
60
20
20
40
20
40
60
80
100
Predicted values
Curved Relationship
y=a+bx2+e
500 5000 3000 4000
Residuals
0 10 20 30 40 50
y
2000
1000
-500
1000
2000
3000
4000
Predicted values
Heteroscedasticity
400 200 200
Residuals
0 10 20 30 40 50
-200
-400 20
-200
40
60
80
100
Predicted values
Influential Observation
400 10
300
Residuals
0 50 100 150 200
200
100
-10 0
-5
100
200
300
400
Predicted values
80
Residuals
0 20 40 60 80 100
40
60
20
-60
-40
-20
50
100
150
Predicted values
Cooks D
Measure the effect of deleting a point Look for large values
larger than the rest
Residual Plot
1.0 Residuals -2.0 1.5 -1.5 -1.0 -0.5 0.0 0.5
2.0
3.0
3.5
-2
-1
0 Theoretical Quantiles
Cooks D
Cook's distance plot
32
Cook's distance
0.10
Female, Aborigine, Slow learner, Primary Age. Only One. (6 days off, mean 16.4)
98
0.15
14
0.00 0
0.05
50 Obs. number
100
150
Problems
Outliers
only expect 5% of points with a standardised residual >2, only 1% >3
Solutions I: Outliers
Why are they outliers?
if they are wrong, then deal with them
typos etc.
Remove them
Worth doing, to see if they have a big effect Report the analysis with and without them?
Could try robust regression methods Whatever, treat them with caution
would you believe a scientific conclusion based on a single observation?
Check of Normality
Normal Probability Plots
already seen them outliers skew kurtosis problems
tails too thick or too thin
Sample Quantiles
Use to detect:
-1
Theoretical Quantiles
Dependence
Could be temporal
order in which samples are taken could matter
Spatial
where the samples were kept
Detection of Non-Independence
By modelling the residuals
e.g. if another factor is suspected, then they can be plotted against the factor
Temporal dependence
Durbin-Watson test model residuals as time series
Spatial dependence
more difficult. Look at semi-variograms?
Solution to Non-Independence
Model Expansion
add some more terms, and see if they make a difference
Iterate between selecting a model, and then checking it and making any changes
e.g. removing residuals, transforming variables