You are on page 1of 39

Regression Analysis

Regression Analysis: Variable Selection and Model Building

Dr. Jisha Francis

Department of Mathematics
School of Advanced Sciences
Vellore Institute of Technology
Vellore Campus, Vellore - 632 014
India

Dr. Jisha Francis Module 5 1 / 38


Learning Goals

Dr. Jisha Francis Module 5 1 / 38


Introduction

Regression analysis relies on selecting correct and important explanatory variables.


Analysts start with a pool of potential variables, but only a subset is chosen for the model.
Two options: include many variables for realism or few for simplicity.
Consequences: realism vs. simplicity, prediction accuracy vs. predictive variance, cost vs.
data collection.

Dr. Jisha Francis Module 5 2 / 38


Variable Selection Process

1 Ensure correct functional form of the model.


2 Choose a subset of explanatory variables from the pool.
3 Iteratively employ statistical tools: residual analysis, outlier identification, model
adequacy.
4 Fit the model, check for functional form, outliers, influential observations, etc.
5 Review and adjust variable selection based on outcomes.
6 Iterate until a satisfactory model is achieved.

Dr. Jisha Francis Module 5 3 / 38


Incorrect Model Specifications

Types
1 Omission/exclusion of relevant variables.
2 Inclusion of irrelevant variables.

Consequences
Biased parameter estimates.
Incomplete understanding of the relationship between variables.
Noise in the model.

Dr. Jisha Francis Module 5 4 / 38


Omission/exclusion of relevant variables

Let there be k candidate explanatory variables out of which suppose r variables are included
and (k − r ) variables are to be deleted from the model. We can partition X and β as:
!
  β1
X = X1 X2 and β =
β2

where:
X1 : Matrix of explanatory variables to be included (size: n × r )
X2 : Matrix of explanatory variables to be deleted (size: n × (k − r ))
β1 : Coefficients corresponding to variables in X1 (size: r × 1)
β2 : Coefficients corresponding to variables in X2 (size: (k − r ) × 1)

Dr. Jisha Francis Module 5 5 / 38


Full Model and Misspecified Model
The full model or true model is expressed as:

y = X1 β1 + X2 β2 + ε

y : Response variable
X1 β1 : Contribution of the included variables
X2 β2 : Contribution of the deleted variables
ε: Error term
After dropping the r explanatory variables, the new model becomes:

y = X1 β1 + δ

Dr. Jisha Francis Module 5 6 / 38


OLS Estimation for False Model

Given the false model:


y = X1 β1 + δ

where:
X1 : Matrix of explanatory variables to be included
δ: Error term
The OLS estimator for β1 , denoted as β̂1 , is given by:

β̂1 = (X1T X1 )−1 X1T y

is biased, in general. Efficiency generally declines.

Dr. Jisha Francis Module 5 7 / 38


Inclusion of Irrelevant Variables

Sometimes, due to enthusiasm and an attempt to make the model more realistic, analysts may
include explanatory variables that are not very relevant to the model. Such variables may
contribute very little to the explanatory power of the model. This can lead to the following
consequences:
Reduction in degrees of freedom (n − k).
Questionable validity of inference drawn from the model.
Increase in the coefficient of determination (R 2 ), falsely indicating improvement in the
model.

Dr. Jisha Francis Module 5 8 / 38


Example: True Model vs. False Model
Let the true model be:
y = Xβ + ε

which comprises k explanatory variables. Now, suppose r additional explanatory variables are
added to the model, resulting in a false model:

y = Xβ + Zγ + δ

where:
Z is an n × r matrix of n n observations on each of the r explanatory variables.
γ is a r × 1 vector of regression coefficients associated with Z .
δ is the disturbance term.
In regression analysis, the estimate of the regression coefficient remains unbiased even when
Dr. Jisha Francis Module 5 9 / 38
Comparison: Exclusion vs. Inclusion of Variables

Aspect Exclusion Type Inclusion Type


Estimation of Co- Biased Unbiased
efficients
Efficiency Generally declines Declines
Estimation of Dis- Over-estimate Unbiased
turbance Term
Conventional Test Invalid and faulty Valid though erro-
of Hypothesis and inferences neous
Confidence Region

Dr. Jisha Francis Module 5 10 / 38


Evaluation of Subset Regression Models
After selecting subsets of candidate variables for the model, the question arises: how to judge
which subset yields a better regression model? Various criteria have been proposed in the
literature to evaluate and compare subset regression models:
1 Coefficient of Determination (R 2 ):
The R 2 is the square of the multiple correlation coefficient between the study variable y and
the set of explanatory variables X1 , X2 , . . . , Xp .
Note: An intercept term is needed in the model (X0 = 1), without which R 2 cannot be used.
The R 2 based on p − 1 explanatory variables and one intercept term is calculated as:

SSreg SSres
R2 = =1−
SST SST

where SSreg , SSres & SST are the sum of squares due to regression, residuals and total,
respectively, in a subset model based on p − 1 explanatory variables.
Dr. Jisha Francis Module 5 11 / 38
Selection of Subset of Explanatory Variables using R 2
Since there are k explanatory variables available and we select only (p − 1) out of them, there
k

are p−1 possible choices of subsets. Each such choice will produce one subset model.
Moreover, the coefficient of determination (R 2 ) has a tendency to increase with the increase in
p.
To proceed:
1 Choose an appropriate value of p, fit the model and obtain R 2 (R12 ).
2 Add one variable, fit the model and again obtain R 2 (R22 ).
3 Obviously, R22 > R12 .
4 If R22 − R12 is small, then stop and choose the value of p for subset regression.
5 If R22 − R12 is high, then keep on adding variables up to a point where an additional
variable does not produce a large change in the value of R 2 or the increment in R 2
becomes small.
Dr. Jisha Francis Module 5 12 / 38
Adjusted Coefficient of Determination
2 ) has certain advantages over the usual
The adjusted coefficient of determination (Radj
coefficient of determination.
The adjusted coefficient of determination based on a p-term model is given by:
 
2 n−1
Radj =1− (1 − Rp2 )
n−p

An advantage of Radj2 is that it does not necessarily increase as p increases.

If there are r more explanatory variables added to a p-term model, then Radj 2 increases if

and only if the partial F-statistic for testing the significance of r additional explanatory
variables exceeds 1.
Subset selection based on Radj2 can be made similar to R 2 .

The value of p corresponding to the maximum value of Radj 2 is chosen for the subset

Dr. Jisha Francis Module 5 13 / 38


Model Selection Using Residual Mean Square

A model is said to have a better fit if residuals are small. This is reflected in the sum of
squares due to residuals (SSres ). A model with smaller SSres is preferable.
Based on this, the residual mean square (MSres ) based on a p-variable subset regression model
is defined as:
SSres (p)
MSres (p) =
n−p
So MSres (p) can be used as a criterion for model selection like SSres . SSres (p) decreases with
an increase in p. Similarly, as p increases, MSres (p) initially decreases, then stabilizes, and
finally may increase if the model is not sufficient to compensate for the loss of one degree of
freedom in the factor n − p.

Dr. Jisha Francis Module 5 14 / 38


Model Selection Using Residual Mean Square

Plot MSres (p) versus p.


Choose p corresponding to the minimum value of MSres (p).
Choose p corresponding to which MSres (p) is approximately equal to MSres based on the
full model.
Choose p near the point where the smallest value of MSres (p) turns upward.
2 )
Such a minimum value of MSres (p) will produce an adjusted coefficient of determination (Radj
with the maximum value.

Dr. Jisha Francis Module 5 15 / 38


Akaike?s Information Criterion (AIC)
The Akaike?s information criterion statistic is given as:
 
SSres (p)
AIC = ln + 2p
n

where SSres (p) is defined as:

SSres (p) = y T H1 y = y T X1 (X1T X1 )−1 X1T y

is based on the subset model y = X1 β1 + δ, derived from the full model . is based on the
subset model y = X1 β1 + X2 β2 +  = X β + , derived from the full model .
The AIC is defined as:

AIC = −2(maximized log likelihood) + 2(number of parameters)


Dr. Jisha Francis Module 5 16 / 38
Bayesian Information Criterion (BIC)

Similar to AIC, the Bayesian Information Criterion (BIC) is based on maximizing the posterior
distribution of the model given the observations y . In the case of a linear regression model, it
is defined as:
BIC = ln(SSres ) + (k − n) ln(n)

where SSres is the sum of squared residuals, k is the number of parameters in the model, and n
is the sample size.
A model with a smaller value of BIC is preferable, as it indicates a better balance between
model fit and complexity.

Dr. Jisha Francis Module 5 17 / 38


Prediction Error Sum of Squares (PRESS) Statistic

Since the residuals and residual sum of squares act as a criterion for subset model selection,
similarly, the Prediction Error Sum of Squares (PRESS) can also be used for subset model
selection.
The PRESS statistic based on a subset model with p explanatory variables is given by:
n n  2
X 2 X ei
PRESSp = yi − ŷ(i) =
1 − hii
i=1 i=1

where ŷ(i) is the predicted value of yi obtained from the model when the ith observation is
excluded from the fitting process, and hii is the ith diagonal element of the hat matrix
H = X (X 0 X )−1 X 0 .
A subset regression model with a smaller value of PRESS is preferable.
Dr. Jisha Francis Module 5 18 / 38
Computational Techniques for Variable Selection

In order to select a subset model, several techniques based on computational procedures and
algorithm the available. They are essentially based on two ideas ?
1 select all possible explanatory variables
2 select the explanatory variables stepwise.

Dr. Jisha Francis Module 5 19 / 38


Computational Techniques for Variable Selection

Use All Possible Explanatory Variables:


Fit a model with one explanatory variable.
Fit a model with two explanatory variables.
Fit a model with three explanatory variables, and so on.
Choose a suitable criterion for model selection and evaluate each of the fitted regression
equations with the selection criterion.
The total number of models to be fitted sharply rises with an increase in k. Therefore,
such models can be evaluated using a model selection criterion with the help of efficient
computation algorithms on computers.

Dr. Jisha Francis Module 5 20 / 38


Computational Techniques for Variable Selection

Stepwise Selection:
This methodology is based on choosing the explanatory variables in the subset model in steps,
which can be either adding one variable at a time or deleting one variable at a time. Based on
this, there are three procedures:

Forward Selection
Backward Elimination
Stepwise Selection

Dr. Jisha Francis Module 5 21 / 38


Forward Selection Procedure (Part 1)

This methodology assumes that there is no explanatory variable in the model except an
intercept term. It adds variables one by one and tests the fitted model at each step using some
suitable criterion. It has the following steps:
1 Consider only the intercept term and insert one variable at a time.
2 Calculate the simple correlations of xi with y , i = 1, 2, . . . , k.
3 Choose xi which has the largest correlation with y .

Dr. Jisha Francis Module 5 22 / 38


Forward Selection Procedure (Part 2)

4 Suppose x1 is the variable which has the highest correlation with y . Since the F-statistic
given by
R2 n−k
F0 = 2
·
(1 − R ) k − 1
so x1 will produce the largest value of F in testing the significance of a regression.
5 Choose a prespecified value of F value, say FIN (to enter).
6 If F > FIN , then accept x1 and so x1 enters into the model.
7 Adjust the effect of x1 on y and re-compute the correlations of remaining xi with y and
obtain the partial correlations.

Dr. Jisha Francis Module 5 23 / 38


Forward Selection Procedure (Part 3)
8 Choose xi with the second-largest correlation with y , i.e., the variable with the highest
value of partial correlation with y .
9 Suppose this variable is x2 . Then the largest partial F-statistic is

SSreg (x2 |x1 )


F =
MSres (x1 , x2 )

10 If F > FIN , then x2 enters into the model.


11 These steps are repeated. At each step, the partial correlations are computed, and the
explanatory variable corresponding to the highest partial correlation with y is chosen to be
added into the model. Equivalently, the partial F-statistics are calculated, and the largest
F-statistic given the other explanatory variables in the model is chosen. The corresponding
explanatory variable is added into the model if the partial F-statistic exceeds FIN .
12 Continue
Dr. Jishawith
Francis such selection as long as Module
either5 at a particular step, the partial F-statistic
24 / 38
Backward Elimination Procedure (Part 1)

This methodology is contrary to the forward selection procedure. The forward selection
procedure starts with no explanatory variable in the model and keeps on adding one variable at
a time until a suitable model is obtained. The backward elimination methodology begins with
all explanatory variables and keeps on deleting one variable at a time until a suitable model is
obtained. It is based on the following steps:
1 Consider all k explanatory variables and fit the model.
2 Compute partial F -statistics for each explanatory variable as if it were the last variable to
enter the model.

Dr. Jisha Francis Module 5 25 / 38


Backward Elimination Procedure (Part 2)

3 Choose a preselected value FOUT .


4 Compare the smallest of the partial F -statistics with FOUT . If it is less than FOUT ,
remove the corresponding explanatory variable from the model.
5 The model will now have (k − 1) explanatory variables.
6 Fit the model with these (k − 1) variables, then remove one explanatory variable,
compute the partial F -statistic for the new model, and compare it with FOUT . If it is less
than FOUT , then remove the corresponding variable from the model.

Dr. Jisha Francis Module 5 26 / 38


Backward Elimination Procedure (Part 3)

7 Repeat this procedure.


8 Stop the procedure when the smallest partial F -statistic exceeds FOUT .

Dr. Jisha Francis Module 5 27 / 38


Stepwise Regression Procedure (Part 1)

Stepwise regression is a combination of forward selection and backward elimination procedures.


It is a modification of the forward selection procedure and has the following steps:
1 Start with an initial model that contains no explanatory variables. This can also be a
model containing all available explanatory variables, depending on the approach chosen.
Consider all the explanatory variables entered into the model at the previous step.
2 Add a new variable and regress it via their partial F -statistics.

Dr. Jisha Francis Module 5 28 / 38


Stepwise Regression Procedure (Part 2)

3 An explanatory variable that was added at an earlier step may now become insignificant
due to its relationship with currently present explanatory variables in the model.
4 If the partial F -statistic for an explanatory variable is smaller than FOUT , then this
variable is deleted from the model.
5 Stepwise regression needs two cut-off values, FIN and FOUT , considered. The choice of
these values is crucial. Sometimes, FIN = FOUT or FIN > FOUT are considered. Using
FIN > FOUT makes it relatively more difficult to add an explanatory variable than to
delete one.

Dr. Jisha Francis Module 5 29 / 38


General Comments

1 None of the methods among the forward selection, backward elimination, or stepwise
regression guarantees the best subset model.
2 The order in which the explanatory variables enter or leave the models does not indicate
the order of importance of the explanatory variable.
3 In forward selection, no explanatory variable can be removed if entered in the model.
Similarly, in backward elimination, no explanatory variable can be added if removed from
the model.
4 All procedures may lead to different models. Different model selection criteria may give
different subset models.

Dr. Jisha Francis Module 5 30 / 38


Stopping Rules: Comments

Choice of FIN and/or FOUT provides stopping rules for algorithms.


Some computer software allows the analyst to specify these values directly.
Some algorithms require type I errors to generate FIN or/and FOUT . Sometimes, taking α
as the level of significance can be misleading because several correlated partial F -variables
are considered at each step, and the maximum among them is examined.
Some analysts prefer small values of FIN and FOUT , whereas some prefer extreme values.
A popular choice is the F -distribution. A popular choice is FIN = FOUT = 4 which is
corresponding to a 5% level of significance of F -distribution.

Dr. Jisha Francis Module 5 31 / 38


Diagnostic for Leverage and Influence

The location of observations in x-space can play an important role in determining the
regression coefficients.
Consider a situation like in the following figure:

Figure: leverage point

Dr. Jisha Francis Module 5 32 / 38


Diagnostic for Leverage and Influence

The point A in this figure is remote in x-space from the rest of the sample but it lies
almost on the regression line passing through the rest of the sample points. This is a
leverage point.
This point does not affect the estimates of the regression coefficients.
It affects the model summary statistics, e.g., R 2 , standard errors of regression coefficients,
etc.

Dr. Jisha Francis Module 5 33 / 38


Diagnostic for Leverage and Influence (continued)

Now consider the point B in the following figure:

Figure: Influence point

Dr. Jisha Francis Module 5 34 / 38


Diagnostic for Leverage and Influence (continued)

This point has a moderately unusual x-coordinate and the y -value is also unusual. This is
an influence point.
It has a noticeable impact on the model coefficients.
It pulls the regression model in its direction.

Dr. Jisha Francis Module 5 35 / 38


Diagnostic for Leverage and Influence (continued)

Leverage:
The location of points in x-space affects the model properties like parameter estimates,
standard errors, predicted values, summary statistics, etc.
The hat matrix H = X (X T X )−1 X T plays an important role in identifying influential
observations.
The hat matrix diagonal is a standardized measure of the distance of an observation from the
center of the x-space. Large hat diagonals reveal observations that are potentially influential
because they are remote in x-space from the rest of the sample.

Dr. Jisha Francis Module 5 36 / 38


Diagnostic for Leverage and Influence (continued)

Influential Point:
All leverage points are not influential on the regression coefficients.
Hat diagonal examines only the location of observations in x-space, so we can look at the
studentized residual or R-student in conjunction with the hat diagonal.
Observation with large hat diagonal and large residuals are likely to be influential.

Dr. Jisha Francis Module 5 37 / 38


Measures of Influence

1 Cook’s D-statistics: Cook’s distance measure is a deletion diagnostic, i.e., it measures the
influence of the ith observation if it is removed from the sample.
2 DFFITS : the deletion influence of the ith observation on the predicted or fitted value.
3 DFBETAS: which indicates how much the regression coefficient changes if the the ith
observation were deleted. Large (in magnitude) value of DFBETASj,i , indicates that the
i observation has considerable influence on the jth regression coefficient.
4 If the data point is an outlier, then R -student will be large is magnitude.
5 If the data point has high leverage, then hii will be close to unity.

Dr. Jisha Francis Module 5 38 / 38

You might also like