Professional Documents
Culture Documents
VARIABLE SELECTION
Variable selection also occurs when the competing models differ on which variables should be
included but agree on the mathematical form that will be used for each variable — e.g.,
temperature might or might not be included as a predictor, but there is no question about
whether, if it is, we’d use temperature or temperature2 or log temperature.
Model Selection
The complete regression analysis depends on the explanatory variables present in the model. It
is understood in the regression analysis that only correct and important explanatory variables
appear in the model. In practice, after ensuring the correct functional form of the model, the
analyst usually has a pool of explanatory variables which possibly influence the process or
experiment.
However, in most practical problems all such candidate variables are not used in the regression
modelling, but a subset of explanatory variables is chosen from this pool. How to determine
such an appropriate subset of explanatory variables to be used in regression is called the
problem of variable selection.
While choosing a subset of explanatory variables, there are two possible options:
• To make the model as realistic as possible, the analyst may include as many as possible
explanatory variables.
• To make the model as simple as possible, one way include only fewer number of
explanatory variables.
So, the model building and subset selection have contradicting objectives.
• When large number of variables are included in the model, then these factors can
influence the prediction of study variable y.
• When small number of variables are included then the predictive variance of 𝑦𝑦�
decreases.
• When the observations on more number are to be collected, then it involves more cost,
time, labour etc.
A compromise between these consequences is striked to select the “best regression equation”.
• What is “best”?
• There are a number of ways we can choose the “best” - they will not all yield the same
results.
• What about the other potential problems with the model that might have been ignored
while selecting the “best” model?
The problem of variable selection is addressed assuming that the functional form of the
explanatory variable, such as 𝑥𝑥 2 , log 𝑥𝑥, 1/𝑥𝑥1 etc., is known and no outliers or influential
observations are present in the data.
Various statistical tools like residual analysis, identification of influential or high leverage
observations, model adequacy etc. are linked to variable selection. In fact, all these processes
should be solved simultaneously.
Usually, these steps are iteratively employed. In the first step, a strategy for variable selection
is opted and model is fitted with selected variables. The fitted model is then checked for the
functional form, outliers, influential observations etc. Based on the outcome, the model is re-
examined and selection of variable is reviewed again.
Data Collection
The data collection requirements for building a regression model vary with the nature of the
study.
1. Controlled Experiments
Researcher controls the treatments by assigning them to experimental units and observe
the response.
For example: a researcher studied the effects of the size of a graphic presentation and
the time allowed for analysis of the accuracy with which the analysis of the presentation
is carried out. The response variable is a measure of the accuracy of the analysis, and
the explanatory variables are the size of the graphic presentation and the time allowed.
A treatment consisted of a particular combination of size of presentation and length of
time allowed.
Data Preparation
Once data have been collected, edit, checks to identify gross errors as well as extreme outliers.
Difficulties with data errors are especially prevalent in large data sets and should be corrected
or resolved before the model building begins.
• Regression model with many explanatory variables may be difficult to maintain. While
regression model with limited number of explanatory variables are easier to work with
and understood.
• The presence of many highly inter-correlated explanatory variables may increase
sampling variation of regression coefficients, increase problems of round -off errors,
not improve or oven worsen model’s predictive ability.
• An actual worsening of model’s predictive ability can occur when explanatory variables
are kept in regression model that are not related to the response variable.
• Once, one has tentatively decided the functional form of regression relations (whether
in linear form, quadratic etc, whether interaction term), next step to identify a few
“good” subsets of X variables, can be first order, quadratic or other curvature terms or
interaction terms.
Suppose there are K candidate regressors, 𝑥𝑥1 , 𝑥𝑥2 , … , 𝑥𝑥𝑘𝑘 and 𝑛𝑛 ≥ 𝐾𝐾 + 1 observations. The full
model is
𝐾𝐾
We will assume;
• The list of candidates regressors includes all the important variables (no unmeasured
confounders).
• The intercept, 𝛽𝛽0 will always be in the model.
Suppose we delete r regressors from the model and retain 𝑝𝑝 = 𝐾𝐾 − 𝑟𝑟 (𝑝𝑝 + 1 total regressors
including the intercept). The model may be rewritten as
where the X matrix has been partitioned into 𝑋𝑋𝑝𝑝+1 and 𝑋𝑋𝑟𝑟 and 𝛽𝛽 has been partitioned into 𝛽𝛽𝑝𝑝+1
and 𝛽𝛽𝑟𝑟 .
2∗
𝑦𝑦′(𝐼𝐼 − 𝑋𝑋(𝑋𝑋 ′ 𝑋𝑋)−1 𝑋𝑋′)𝑦𝑦
𝜎𝜎� = (8.4)
𝑛𝑛 − 𝐾𝐾 − 1
−1 ′
𝛽𝛽̂𝑝𝑝+1 = �𝑋𝑋𝑝𝑝+1
′
𝑋𝑋𝑝𝑝+1 � 𝑋𝑋𝑝𝑝+1 𝑦𝑦 (8.6)
′ −1′
𝑦𝑦′ �𝐼𝐼 − 𝑋𝑋𝑝𝑝+1 �𝑋𝑋𝑝𝑝+1 𝑋𝑋𝑝𝑝+1 � 𝑋𝑋𝑝𝑝+1 � 𝑦𝑦
𝜎𝜎� 2 = (8.7)
𝑛𝑛 − 𝑝𝑝 − 1
−1
2. 𝑉𝑉𝑉𝑉𝑉𝑉�𝛽𝛽̂𝑝𝑝+1 � = 𝜎𝜎 2 �𝑋𝑋𝑝𝑝+1
′
𝑋𝑋𝑝𝑝+1 � and 𝐸𝐸�𝛽𝛽̂ ∗ � = 𝜎𝜎 2 (𝑋𝑋′𝑋𝑋)−1
3. The estimate 𝜎𝜎� 2∗ from the full model is an unbiased estimate of 𝜎𝜎 2 . For the subset model
′ −1 ′
𝛽𝛽𝑟𝑟′ 𝑋𝑋𝑟𝑟′ �𝐼𝐼 − 𝑋𝑋𝑝𝑝+1 �𝑋𝑋𝑝𝑝+1 𝑋𝑋𝑝𝑝+1 � 𝑋𝑋𝑝𝑝+1 � 𝑋𝑋𝑟𝑟 𝛽𝛽𝑟𝑟
𝐸𝐸(𝜎𝜎� 2 ) = 𝜎𝜎 2 +
𝑛𝑛 − 𝑝𝑝 − 1
Let 𝑅𝑅𝑝𝑝2 denote the 𝑅𝑅 2 for a subset regression model with p regressors (𝑝𝑝 + 1 terms include the
intercept).
𝑅𝑅𝑝𝑝2 criterion equivalent to using 𝑆𝑆𝑆𝑆𝑆𝑆𝑝𝑝 as the criterion consider good for which 𝑆𝑆𝑆𝑆𝑆𝑆𝑝𝑝 small.
𝑆𝑆𝑆𝑆𝑆𝑆𝑝𝑝
𝑅𝑅𝑝𝑝2 = 1 − (8.8)
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆
Since 𝑅𝑅𝑝𝑝2 does not take account of the number of parameters in the regression model and since
𝑅𝑅𝑝𝑝2 can never decrease as P increases, the adjusted coefficient of multiple determination, 𝑅𝑅𝛼𝛼,𝑝𝑝
2
𝑀𝑀𝑀𝑀𝑀𝑀𝑝𝑝
𝑅𝑅𝑝𝑝2 = 1 − (8.9)
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆
𝑛𝑛 − 1
𝑆𝑆𝑆𝑆𝑆𝑆𝑝𝑝
𝐶𝐶𝑝𝑝 = − (𝑛𝑛 − 2𝑝𝑝) (8.10)
𝑀𝑀𝑀𝑀𝑀𝑀�𝑋𝑋1 , … , 𝑋𝑋𝑝𝑝−1 �
where 𝑆𝑆𝑆𝑆𝑆𝑆𝑝𝑝 is the error sum of squares for the fitted subset regression model with p parameters
(i.e. with 𝑝𝑝 − 1 X variables).
When there is no bias in the regression model with 𝑝𝑝 − 1 X variables so that 𝐸𝐸(𝑌𝑌�𝑖𝑖 ) ≡ 𝜇𝜇𝑖𝑖 , the
expected value of 𝐶𝐶𝑝𝑝 is approximately p:
In using the 𝐶𝐶𝑝𝑝 criterion, we seek to identify subsets of X variables for which (I) the 𝐶𝐶𝑝𝑝 value
is small and (2) the 𝐶𝐶𝑝𝑝 value is near p. Subsets with small 𝐶𝐶𝑝𝑝 values have a small total mean
squared error, and when the 𝐶𝐶𝑝𝑝 value is also near p, the bias of the regression model is small.
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑝𝑝 (prediction sum of squares) criterion is a measure of how well the use of the fitted
values for a subset model can predict the observe response 𝑌𝑌𝑖𝑖 . The error sum of squares, 𝑆𝑆𝑆𝑆𝑆𝑆 =
2
∑�𝑌𝑌𝑖𝑖 − 𝑌𝑌�𝑖𝑖 � is also such a measure.
The PRESS measure differs from SSE in that each fitted values 𝑌𝑌�𝑖𝑖 for the PRESS criterion is
obtained by deleting the i-th cases from the data set, estimating the regression function for the
𝑆𝑆𝑆𝑆𝑆𝑆𝑝𝑝
𝑀𝑀𝑀𝑀𝑀𝑀𝑝𝑝 = (8.14)
𝑛𝑛 − 𝑝𝑝
So, 𝑀𝑀𝑀𝑀𝑀𝑀𝑝𝑝 can be used as a criterion for model selection like 𝑆𝑆𝑆𝑆𝑆𝑆𝑝𝑝 . The 𝑆𝑆𝑆𝑆𝑆𝑆𝑝𝑝 decreases with
an increase in p. So, similarly as p increases, 𝑀𝑀𝑀𝑀𝑀𝑀𝑝𝑝 initially decreases, then stabilizes and
finally may increase if the model is not sufficient to compensate the loss of one degree of
freedom in the factor (𝑛𝑛 − 𝑝𝑝).
Example 1: A hospital surgical unit was interested in predicting survival in patients undergoing
a particular type of liver operation. A random selection of 108 patients was available for
analysis. From each patient record, the following information was extracted from the
preoperation evaluation:
2
𝑅𝑅𝑎𝑎,𝑝𝑝 = 0.823: seven- or eight-parameter model
min(𝐶𝐶7 ) = 5.541
min(𝐴𝐴𝐴𝐴𝐴𝐴7 ) = −163.834
min(𝑆𝑆𝑆𝑆𝑆𝑆5 ) = −153.406
min(𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃5 ) = 2.738
− When pool of potential X variables very large, say greater than 30, require
excessive processing time.
− Under these conditions, suggest one of the stepwise regression procedures to assist
in the selection of X variables.
− Remedy by exploring and identify other candidate models with approximately the
same number of explanatory variables identified by automatic procedure.
Step 1: Fits a simple linear regression model for each of potential variables. For each simple
linear regression model, test whether or not slope equals zero.
𝛽𝛽̂𝑘𝑘
𝑡𝑡𝑘𝑘∗ = (8.15)
𝑠𝑠�𝛽𝛽̂𝑘𝑘 �
The X variable with the largest 𝑡𝑡𝑘𝑘∗ value is the candidate for the first addition. If this 𝑡𝑡𝑘𝑘∗ value
exceeds a predetermined level, the X variable is added. Otherwise, program will terminates
with no X variable considered to enter regression model.
Step 2: Assume 𝑋𝑋7 is the variable entered at Step 1. The stepwise regression routine now fits
all regression model with two X variables, where 𝑋𝑋7 is one of the pair. For each such regression
model, the 𝑡𝑡𝑘𝑘∗ statistics corresponding to the newly added predictor 𝑋𝑋𝑘𝑘 is obtained. The X
variable with the largest 𝑡𝑡𝑘𝑘∗ value (smallest p-value) is candidate for addition at the second
stage. If this 𝑡𝑡𝑘𝑘∗ value exceeds a predetermined level, the second X variable is added. Otherwise
program is terminated.
Step 3: Suppose 𝑋𝑋3 is added at second stage. Now the stepwise regression routine examines
whether any of the other variables already in the model should be dropped. At this step, there
is only one other variable in the model, 𝑋𝑋7, so that only one 𝑡𝑡 ∗ test statistic is obtained
𝛽𝛽̂7
𝑡𝑡7∗ = (8.16)
𝑠𝑠�𝛽𝛽̂7 �
At later stage, there would be a number of these 𝑡𝑡 ∗ statistics, one for each of the variables in
the model besides the one last added. The variable for which this value is smallest is the
candidate for deletion.
NOTE: The maximum acceptance α limit for adding a variable is 0.10, and the minimum
acceptance α limit for removing a variable is 0.15.
Example 2: The forward stepwise regression procedure for the surgical unit example.
2
∑𝑛𝑛𝑖𝑖=1�𝑌𝑌𝑖𝑖 − 𝑌𝑌�𝑖𝑖 �
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = (8.17)
𝑛𝑛∗
• If the mean squared prediction error MSPR is fairly close to MSE based on the regression
fit to the model-building data set, then the error mean square MSE for the selected
regression model is not seriously biased and gives an appropriate indication of the
predictive ability of the model.
• If the mean squared prediction error is much larger than MSE, one should rely on the mean
squared prediction error as an indicator of how well the selected regression model will
predict in the future.
• If a replication study for which the conditions of setting differ slightly or substantially, and
the regression results are still similar, indicate the regression results can be generalized to
apply under substantially varying conditions.
Data Splitting
• Preferred method to validate a regression model is through collection of new data- often
this is neither practical nor feasible.
• If data set is large enough, is to split data into two sets. The first set, called model-building
set or training sample, is used to develop the model. The second data set, called the
validation or prediction set is used to evaluate the reasonableness and predictive ability of
the selected model.
• Validation on data set is used for validation in the same way as when new data are
collected.
• The regression coefficients can be re-estimated for the selected model and then compared
for consistency with the coefficients obtained from the model-building data set.
• When splitting the data, it is important that the model-building data set be larger so that
reliable model can be developed.
• If model-building data set is reasonably large, these variances generally will not be that
much larger than those for the entire data set.
• Once the model has been validated, it is customary practice to use the entire data set for
estimating the final regression model.