Model Selection
• Given the data set with many potential predictors we need to decide which
ones to include in out model and which ones to leave out.
• Statistical algorithms may be used to find the best set of predictors.
Common selection methods are:
• Best Subsets (All possible models)
• Forward Selection (Automatic procedure)
• Backward Elimination (Automatic procedure)
• Stepwise Selection (Automatic procedure)
Best Subsets
To consider all possible models is time consuming unless there are only a
small number of models because there are 2p possible linear regression
models and we require procedures for choosing one (or a small number) of
them.
Still difficult to choose “best” model as lots of test results will be available
giving conflicting information.
Can select the best models based on Adjusted R2, Mallows Cp, AIC or BIC.
Adjusted R2 is used instead of R2 because penalises for the number of
parameters and sample size.
Usually too many to manually consider all models so need an automatic
system for deciding which models to consider and in which order. Better to
use a logical procedure like forward selection, backward elimination or
stepwise, where each test is acted upon sequentially and do not ignore any
‘substantive theory’.
Forward Selection
In Step 1, the predictor which has the most significance with
the response is entered into the model.
In subsequent steps, the remaining predictors are
considered; the predictor which has the greatest effect on
R2 is added.
The algorithm stops when adding predictors no longer has a
significant effect on R2.
Backward Elimination
In Step 1, all predictors are entered into the model.
In Subsequent Steps, the predictor whose removal results in
the smallest decrease in R2 is removed.
The algorithm stops when removing predictors would result
in a significant drop in R2.
Stepwise Selection
Choose an initial model – usually the null or
maximal model.
Include the most significant variable not in the
model.
Remove the least significant variable if it is not
significant at a certain level.
Repeat last two steps until model does not change.