Professional Documents
Culture Documents
Department of Mathematics
School of Advanced Sciences
Vellore Institute of Technology
Vellore Campus, Vellore - 632 014
India
Types
1 Omission/exclusion of relevant variables.
2 Inclusion of irrelevant variables.
Consequences
Biased parameter estimates.
Incomplete understanding of the relationship between variables.
Noise in the model.
Let there be k candidate explanatory variables out of which suppose r variables are included
and (k − r ) variables are to be deleted from the model. We can partition X and β as:
!
β1
X = X1 X2 and β =
β2
where:
X1 : Matrix of explanatory variables to be included (size: n × r )
X2 : Matrix of explanatory variables to be deleted (size: n × (k − r ))
β1 : Coefficients corresponding to variables in X1 (size: r × 1)
β2 : Coefficients corresponding to variables in X2 (size: (k − r ) × 1)
y = X1 β1 + X2 β2 + ε
y : Response variable
X1 β1 : Contribution of the included variables
X2 β2 : Contribution of the deleted variables
ε: Error term
After dropping the r explanatory variables, the new model becomes:
y = X1 β1 + δ
where:
X1 : Matrix of explanatory variables to be included
δ: Error term
The OLS estimator for β1 , denoted as β̂1 , is given by:
Sometimes, due to enthusiasm and an attempt to make the model more realistic, analysts may
include explanatory variables that are not very relevant to the model. Such variables may
contribute very little to the explanatory power of the model. This can lead to the following
consequences:
Reduction in degrees of freedom (n − k).
Questionable validity of inference drawn from the model.
Increase in the coefficient of determination (R 2 ), falsely indicating improvement in the
model.
which comprises k explanatory variables. Now, suppose r additional explanatory variables are
added to the model, resulting in a false model:
y = Xβ + Zγ + δ
where:
Z is an n × r matrix of n n observations on each of the r explanatory variables.
γ is a r × 1 vector of regression coefficients associated with Z .
δ is the disturbance term.
In regression analysis, the estimate of the regression coefficient remains unbiased even when
Dr. Jisha Francis Module 5 9 / 38
Comparison: Exclusion vs. Inclusion of Variables
SSreg SSres
R2 = =1−
SST SST
where SSreg , SSres & SST are the sum of squares due to regression, residuals and total,
respectively, in a subset model based on p − 1 explanatory variables.
Dr. Jisha Francis Module 5 11 / 38
Selection of Subset of Explanatory Variables using R 2
Since there are k explanatory variables available and we select only (p − 1) out of them, there
k
are p−1 possible choices of subsets. Each such choice will produce one subset model.
Moreover, the coefficient of determination (R 2 ) has a tendency to increase with the increase in
p.
To proceed:
1 Choose an appropriate value of p, fit the model and obtain R 2 (R12 ).
2 Add one variable, fit the model and again obtain R 2 (R22 ).
3 Obviously, R22 > R12 .
4 If R22 − R12 is small, then stop and choose the value of p for subset regression.
5 If R22 − R12 is high, then keep on adding variables up to a point where an additional
variable does not produce a large change in the value of R 2 or the increment in R 2
becomes small.
Dr. Jisha Francis Module 5 12 / 38
Adjusted Coefficient of Determination
2 ) has certain advantages over the usual
The adjusted coefficient of determination (Radj
coefficient of determination.
The adjusted coefficient of determination based on a p-term model is given by:
2 n−1
Radj =1− (1 − Rp2 )
n−p
If there are r more explanatory variables added to a p-term model, then Radj 2 increases if
and only if the partial F-statistic for testing the significance of r additional explanatory
variables exceeds 1.
Subset selection based on Radj2 can be made similar to R 2 .
The value of p corresponding to the maximum value of Radj 2 is chosen for the subset
A model is said to have a better fit if residuals are small. This is reflected in the sum of
squares due to residuals (SSres ). A model with smaller SSres is preferable.
Based on this, the residual mean square (MSres ) based on a p-variable subset regression model
is defined as:
SSres (p)
MSres (p) =
n−p
So MSres (p) can be used as a criterion for model selection like SSres . SSres (p) decreases with
an increase in p. Similarly, as p increases, MSres (p) initially decreases, then stabilizes, and
finally may increase if the model is not sufficient to compensate for the loss of one degree of
freedom in the factor n − p.
is based on the subset model y = X1 β1 + δ, derived from the full model . is based on the
subset model y = X1 β1 + X2 β2 + = X β + , derived from the full model .
The AIC is defined as:
Similar to AIC, the Bayesian Information Criterion (BIC) is based on maximizing the posterior
distribution of the model given the observations y . In the case of a linear regression model, it
is defined as:
BIC = ln(SSres ) + (k − n) ln(n)
where SSres is the sum of squared residuals, k is the number of parameters in the model, and n
is the sample size.
A model with a smaller value of BIC is preferable, as it indicates a better balance between
model fit and complexity.
Since the residuals and residual sum of squares act as a criterion for subset model selection,
similarly, the Prediction Error Sum of Squares (PRESS) can also be used for subset model
selection.
The PRESS statistic based on a subset model with p explanatory variables is given by:
n n 2
X 2 X ei
PRESSp = yi − ŷ(i) =
1 − hii
i=1 i=1
where ŷ(i) is the predicted value of yi obtained from the model when the ith observation is
excluded from the fitting process, and hii is the ith diagonal element of the hat matrix
H = X (X 0 X )−1 X 0 .
A subset regression model with a smaller value of PRESS is preferable.
Dr. Jisha Francis Module 5 18 / 38
Computational Techniques for Variable Selection
In order to select a subset model, several techniques based on computational procedures and
algorithm the available. They are essentially based on two ideas ?
1 select all possible explanatory variables
2 select the explanatory variables stepwise.
Stepwise Selection:
This methodology is based on choosing the explanatory variables in the subset model in steps,
which can be either adding one variable at a time or deleting one variable at a time. Based on
this, there are three procedures:
Forward Selection
Backward Elimination
Stepwise Selection
This methodology assumes that there is no explanatory variable in the model except an
intercept term. It adds variables one by one and tests the fitted model at each step using some
suitable criterion. It has the following steps:
1 Consider only the intercept term and insert one variable at a time.
2 Calculate the simple correlations of xi with y , i = 1, 2, . . . , k.
3 Choose xi which has the largest correlation with y .
4 Suppose x1 is the variable which has the highest correlation with y . Since the F-statistic
given by
R2 n−k
F0 = 2
·
(1 − R ) k − 1
so x1 will produce the largest value of F in testing the significance of a regression.
5 Choose a prespecified value of F value, say FIN (to enter).
6 If F > FIN , then accept x1 and so x1 enters into the model.
7 Adjust the effect of x1 on y and re-compute the correlations of remaining xi with y and
obtain the partial correlations.
This methodology is contrary to the forward selection procedure. The forward selection
procedure starts with no explanatory variable in the model and keeps on adding one variable at
a time until a suitable model is obtained. The backward elimination methodology begins with
all explanatory variables and keeps on deleting one variable at a time until a suitable model is
obtained. It is based on the following steps:
1 Consider all k explanatory variables and fit the model.
2 Compute partial F -statistics for each explanatory variable as if it were the last variable to
enter the model.
3 An explanatory variable that was added at an earlier step may now become insignificant
due to its relationship with currently present explanatory variables in the model.
4 If the partial F -statistic for an explanatory variable is smaller than FOUT , then this
variable is deleted from the model.
5 Stepwise regression needs two cut-off values, FIN and FOUT , considered. The choice of
these values is crucial. Sometimes, FIN = FOUT or FIN > FOUT are considered. Using
FIN > FOUT makes it relatively more difficult to add an explanatory variable than to
delete one.
1 None of the methods among the forward selection, backward elimination, or stepwise
regression guarantees the best subset model.
2 The order in which the explanatory variables enter or leave the models does not indicate
the order of importance of the explanatory variable.
3 In forward selection, no explanatory variable can be removed if entered in the model.
Similarly, in backward elimination, no explanatory variable can be added if removed from
the model.
4 All procedures may lead to different models. Different model selection criteria may give
different subset models.
The location of observations in x-space can play an important role in determining the
regression coefficients.
Consider a situation like in the following figure:
The point A in this figure is remote in x-space from the rest of the sample but it lies
almost on the regression line passing through the rest of the sample points. This is a
leverage point.
This point does not affect the estimates of the regression coefficients.
It affects the model summary statistics, e.g., R 2 , standard errors of regression coefficients,
etc.
This point has a moderately unusual x-coordinate and the y -value is also unusual. This is
an influence point.
It has a noticeable impact on the model coefficients.
It pulls the regression model in its direction.
Leverage:
The location of points in x-space affects the model properties like parameter estimates,
standard errors, predicted values, summary statistics, etc.
The hat matrix H = X (X T X )−1 X T plays an important role in identifying influential
observations.
The hat matrix diagonal is a standardized measure of the distance of an observation from the
center of the x-space. Large hat diagonals reveal observations that are potentially influential
because they are remote in x-space from the rest of the sample.
Influential Point:
All leverage points are not influential on the regression coefficients.
Hat diagonal examines only the location of observations in x-space, so we can look at the
studentized residual or R-student in conjunction with the hat diagonal.
Observation with large hat diagonal and large residuals are likely to be influential.
1 Cook’s D-statistics: Cook’s distance measure is a deletion diagnostic, i.e., it measures the
influence of the ith observation if it is removed from the sample.
2 DFFITS : the deletion influence of the ith observation on the predicted or fitted value.
3 DFBETAS: which indicates how much the regression coefficient changes if the the ith
observation were deleted. Large (in magnitude) value of DFBETASj,i , indicates that the
i observation has considerable influence on the jth regression coefficient.
4 If the data point is an outlier, then R -student will be large is magnitude.
5 If the data point has high leverage, then hii will be close to unity.