WINSEM2023-24 MAT6015 ETH VL2023240501308 2024-03-19 Reference-Material-I

Regression Analysis
Regression Analysis: Variable Selection and Model Building
Dr. Jisha Francis
Department of Mathematics
School of Advanced Sciences
Vellore Institute of Technology
Vellore Campus, Vellore - 632 014
India
Dr. Jisha Francis Module 5 1 / 38

Learning Goals

Introduction
Regression analysis relies on selecting correct and important explanatory variables.

Analysts start with a pool of potential variables, but only a subset is chosen for the model.
Two options: include many variables for realism or few for simplicity.
Consequences: realism vs. simplicity, prediction accuracy vs. predictive variance, cost vs.
data collection.

Variable Selection Process
1 Ensure correct functional form of the model.

2 Choose a subset of explanatory variables from the pool.
3 Iteratively employ statistical tools: residual analysis, outlier identification, model
adequacy.
4 Fit the model, check for functional form, outliers, influential observations, etc.
5 Review and adjust variable selection based on outcomes.
6 Iterate until a satisfactory model is achieved.

Incorrect Model Specifications
Types
1 Omission/exclusion of relevant variables.
2 Inclusion of irrelevant variables.
Consequences
Biased parameter estimates.
Incomplete understanding of the relationship between variables.
Noise in the model.

Omission/exclusion of relevant variables
Let there be k candidate explanatory variables out of which suppose r variables are included
and (k − r ) variables are to be deleted from the model. We can partition X and β as:
!
β1
X = X1 X2 and β =
β2
where:
X1 : Matrix of explanatory variables to be included (size: n × r )
X2 : Matrix of explanatory variables to be deleted (size: n × (k − r ))
β1 : Coefficients corresponding to variables in X1 (size: r × 1)
β2 : Coefficients corresponding to variables in X2 (size: (k − r ) × 1)

Full Model and Misspecified Model
The full model or true model is expressed as:
y = X1 β1 + X2 β2 + ε
y : Response variable
X1 β1 : Contribution of the included variables
X2 β2 : Contribution of the deleted variables
ε: Error term
After dropping the r explanatory variables, the new model becomes:
y = X1 β1 + δ

OLS Estimation for False Model
Given the false model:

y = X1 β1 + δ
where:
X1 : Matrix of explanatory variables to be included
δ: Error term
The OLS estimator for β1 , denoted as β̂1 , is given by:
β̂1 = (X1T X1 )−1 X1T y
is biased, in general. Efficiency generally declines.

Inclusion of Irrelevant Variables
Sometimes, due to enthusiasm and an attempt to make the model more realistic, analysts may
include explanatory variables that are not very relevant to the model. Such variables may
contribute very little to the explanatory power of the model. This can lead to the following
consequences:
Reduction in degrees of freedom (n − k).
Questionable validity of inference drawn from the model.
Increase in the coefficient of determination (R 2 ), falsely indicating improvement in the
model.

Example: True Model vs. False Model
Let the true model be:
y = Xβ + ε
which comprises k explanatory variables. Now, suppose r additional explanatory variables are
added to the model, resulting in a false model:
y = Xβ + Zγ + δ
where:
Z is an n × r matrix of n n observations on each of the r explanatory variables.
γ is a r × 1 vector of regression coefficients associated with Z .
δ is the disturbance term.
In regression analysis, the estimate of the regression coefficient remains unbiased even when
Comparison: Exclusion vs. Inclusion of Variables
Aspect Exclusion Type Inclusion Type

Estimation of Co- Biased Unbiased
efficients
Efficiency Generally declines Declines
Estimation of Dis- Over-estimate Unbiased
turbance Term
Conventional Test Invalid and faulty Valid though erro-
of Hypothesis and inferences neous
Confidence Region

Evaluation of Subset Regression Models
After selecting subsets of candidate variables for the model, the question arises: how to judge
which subset yields a better regression model? Various criteria have been proposed in the
literature to evaluate and compare subset regression models:
1 Coefficient of Determination (R 2 ):
The R 2 is the square of the multiple correlation coefficient between the study variable y and
the set of explanatory variables X1 , X2 , . . . , Xp .
Note: An intercept term is needed in the model (X0 = 1), without which R 2 cannot be used.
The R 2 based on p − 1 explanatory variables and one intercept term is calculated as:
SSreg SSres
R2 = =1−
SST SST
where SSreg , SSres & SST are the sum of squares due to regression, residuals and total,
respectively, in a subset model based on p − 1 explanatory variables.
Selection of Subset of Explanatory Variables using R 2
Since there are k explanatory variables available and we select only (p − 1) out of them, there
k

are p−1 possible choices of subsets. Each such choice will produce one subset model.
Moreover, the coefficient of determination (R 2 ) has a tendency to increase with the increase in
p.
To proceed:
1 Choose an appropriate value of p, fit the model and obtain R 2 (R12 ).
2 Add one variable, fit the model and again obtain R 2 (R22 ).
3 Obviously, R22 > R12 .
4 If R22 − R12 is small, then stop and choose the value of p for subset regression.
5 If R22 − R12 is high, then keep on adding variables up to a point where an additional
variable does not produce a large change in the value of R 2 or the increment in R 2
becomes small.
Adjusted Coefficient of Determination
2 ) has certain advantages over the usual
The adjusted coefficient of determination (Radj
coefficient of determination.
The adjusted coefficient of determination based on a p-term model is given by:

2 n−1
Radj =1− (1 − Rp2 )
n−p
An advantage of Radj2 is that it does not necessarily increase as p increases.
If there are r more explanatory variables added to a p-term model, then Radj 2 increases if
and only if the partial F-statistic for testing the significance of r additional explanatory
variables exceeds 1.
Subset selection based on Radj2 can be made similar to R 2 .
The value of p corresponding to the maximum value of Radj 2 is chosen for the subset

Model Selection Using Residual Mean Square
A model is said to have a better fit if residuals are small. This is reflected in the sum of
squares due to residuals (SSres ). A model with smaller SSres is preferable.
Based on this, the residual mean square (MSres ) based on a p-variable subset regression model
is defined as:
SSres (p)
MSres (p) =
n−p
So MSres (p) can be used as a criterion for model selection like SSres . SSres (p) decreases with
an increase in p. Similarly, as p increases, MSres (p) initially decreases, then stabilizes, and
finally may increase if the model is not sufficient to compensate for the loss of one degree of
freedom in the factor n − p.

Model Selection Using Residual Mean Square
Plot MSres (p) versus p.

Choose p corresponding to the minimum value of MSres (p).
Choose p corresponding to which MSres (p) is approximately equal to MSres based on the
full model.
Choose p near the point where the smallest value of MSres (p) turns upward.
2 )
Such a minimum value of MSres (p) will produce an adjusted coefficient of determination (Radj
with the maximum value.

Akaike?s Information Criterion (AIC)
The Akaike?s information criterion statistic is given as:

SSres (p)
AIC = ln + 2p
n
where SSres (p) is defined as:
SSres (p) = y T H1 y = y T X1 (X1T X1 )−1 X1T y
is based on the subset model y = X1 β1 + δ, derived from the full model . is based on the
subset model y = X1 β1 + X2 β2 + = X β + , derived from the full model .
The AIC is defined as:
AIC = −2(maximized log likelihood) + 2(number of parameters)

Bayesian Information Criterion (BIC)
Similar to AIC, the Bayesian Information Criterion (BIC) is based on maximizing the posterior
distribution of the model given the observations y . In the case of a linear regression model, it
is defined as:
BIC = ln(SSres ) + (k − n) ln(n)
where SSres is the sum of squared residuals, k is the number of parameters in the model, and n
is the sample size.
A model with a smaller value of BIC is preferable, as it indicates a better balance between
model fit and complexity.

Prediction Error Sum of Squares (PRESS) Statistic
Since the residuals and residual sum of squares act as a criterion for subset model selection,
similarly, the Prediction Error Sum of Squares (PRESS) can also be used for subset model
selection.
The PRESS statistic based on a subset model with p explanatory variables is given by:
n n 2
X 2 X ei
PRESSp = yi − ŷ(i) =
1 − hii
i=1 i=1
where ŷ(i) is the predicted value of yi obtained from the model when the ith observation is
excluded from the fitting process, and hii is the ith diagonal element of the hat matrix
H = X (X 0 X )−1 X 0 .
A subset regression model with a smaller value of PRESS is preferable.
Computational Techniques for Variable Selection
In order to select a subset model, several techniques based on computational procedures and
algorithm the available. They are essentially based on two ideas ?
1 select all possible explanatory variables
2 select the explanatory variables stepwise.

Use All Possible Explanatory Variables:

Fit a model with one explanatory variable.
Fit a model with two explanatory variables.
Fit a model with three explanatory variables, and so on.
Choose a suitable criterion for model selection and evaluate each of the fitted regression
equations with the selection criterion.
The total number of models to be fitted sharply rises with an increase in k. Therefore,
such models can be evaluated using a model selection criterion with the help of efficient
computation algorithms on computers.

Stepwise Selection:
This methodology is based on choosing the explanatory variables in the subset model in steps,
which can be either adding one variable at a time or deleting one variable at a time. Based on
this, there are three procedures:
Forward Selection
Backward Elimination
Stepwise Selection

Forward Selection Procedure (Part 1)
This methodology assumes that there is no explanatory variable in the model except an
intercept term. It adds variables one by one and tests the fitted model at each step using some
suitable criterion. It has the following steps:
1 Consider only the intercept term and insert one variable at a time.
2 Calculate the simple correlations of xi with y , i = 1, 2, . . . , k.
3 Choose xi which has the largest correlation with y .

4 Suppose x1 is the variable which has the highest correlation with y . Since the F-statistic
given by
R2 n−k
F0 = 2
·
(1 − R ) k − 1
so x1 will produce the largest value of F in testing the significance of a regression.
5 Choose a prespecified value of F value, say FIN (to enter).
6 If F > FIN , then accept x1 and so x1 enters into the model.
7 Adjust the effect of x1 on y and re-compute the correlations of remaining xi with y and
obtain the partial correlations.

8 Choose xi with the second-largest correlation with y , i.e., the variable with the highest
value of partial correlation with y .
9 Suppose this variable is x2 . Then the largest partial F-statistic is
SSreg (x2 |x1 )

F =
MSres (x1 , x2 )
10 If F > FIN , then x2 enters into the model.

11 These steps are repeated. At each step, the partial correlations are computed, and the
explanatory variable corresponding to the highest partial correlation with y is chosen to be
added into the model. Equivalently, the partial F-statistics are calculated, and the largest
F-statistic given the other explanatory variables in the model is chosen. The corresponding
explanatory variable is added into the model if the partial F-statistic exceeds FIN .
12 Continue
Dr. Jishawith
Francis such selection as long as Module
either5 at a particular step, the partial F-statistic
24 / 38
Backward Elimination Procedure (Part 1)
This methodology is contrary to the forward selection procedure. The forward selection
procedure starts with no explanatory variable in the model and keeps on adding one variable at
a time until a suitable model is obtained. The backward elimination methodology begins with
all explanatory variables and keeps on deleting one variable at a time until a suitable model is
obtained. It is based on the following steps:
1 Consider all k explanatory variables and fit the model.
2 Compute partial F -statistics for each explanatory variable as if it were the last variable to
enter the model.

3 Choose a preselected value FOUT .

4 Compare the smallest of the partial F -statistics with FOUT . If it is less than FOUT ,
remove the corresponding explanatory variable from the model.
5 The model will now have (k − 1) explanatory variables.
6 Fit the model with these (k − 1) variables, then remove one explanatory variable,
compute the partial F -statistic for the new model, and compare it with FOUT . If it is less
than FOUT , then remove the corresponding variable from the model.

7 Repeat this procedure.

8 Stop the procedure when the smallest partial F -statistic exceeds FOUT .

Stepwise Regression Procedure (Part 1)
Stepwise regression is a combination of forward selection and backward elimination procedures.

It is a modification of the forward selection procedure and has the following steps:
1 Start with an initial model that contains no explanatory variables. This can also be a
model containing all available explanatory variables, depending on the approach chosen.
Consider all the explanatory variables entered into the model at the previous step.
2 Add a new variable and regress it via their partial F -statistics.

Stepwise Regression Procedure (Part 2)
3 An explanatory variable that was added at an earlier step may now become insignificant
due to its relationship with currently present explanatory variables in the model.
4 If the partial F -statistic for an explanatory variable is smaller than FOUT , then this
variable is deleted from the model.
5 Stepwise regression needs two cut-off values, FIN and FOUT , considered. The choice of
these values is crucial. Sometimes, FIN = FOUT or FIN > FOUT are considered. Using
FIN > FOUT makes it relatively more difficult to add an explanatory variable than to
delete one.

General Comments
1 None of the methods among the forward selection, backward elimination, or stepwise
regression guarantees the best subset model.
2 The order in which the explanatory variables enter or leave the models does not indicate
the order of importance of the explanatory variable.
3 In forward selection, no explanatory variable can be removed if entered in the model.
Similarly, in backward elimination, no explanatory variable can be added if removed from
the model.
4 All procedures may lead to different models. Different model selection criteria may give
different subset models.

Stopping Rules: Comments
Choice of FIN and/or FOUT provides stopping rules for algorithms.

Some computer software allows the analyst to specify these values directly.
Some algorithms require type I errors to generate FIN or/and FOUT . Sometimes, taking α
as the level of significance can be misleading because several correlated partial F -variables
are considered at each step, and the maximum among them is examined.
Some analysts prefer small values of FIN and FOUT , whereas some prefer extreme values.
A popular choice is the F -distribution. A popular choice is FIN = FOUT = 4 which is
corresponding to a 5% level of significance of F -distribution.

Diagnostic for Leverage and Influence
The location of observations in x-space can play an important role in determining the
regression coefficients.
Consider a situation like in the following figure:
Figure: leverage point

Diagnostic for Leverage and Influence
The point A in this figure is remote in x-space from the rest of the sample but it lies
almost on the regression line passing through the rest of the sample points. This is a
leverage point.
This point does not affect the estimates of the regression coefficients.
It affects the model summary statistics, e.g., R 2 , standard errors of regression coefficients,
etc.

Diagnostic for Leverage and Influence (continued)
Now consider the point B in the following figure:
Figure: Influence point

This point has a moderately unusual x-coordinate and the y -value is also unusual. This is
an influence point.
It has a noticeable impact on the model coefficients.
It pulls the regression model in its direction.

Leverage:
The location of points in x-space affects the model properties like parameter estimates,
standard errors, predicted values, summary statistics, etc.
The hat matrix H = X (X T X )−1 X T plays an important role in identifying influential
observations.
The hat matrix diagonal is a standardized measure of the distance of an observation from the
center of the x-space. Large hat diagonals reveal observations that are potentially influential
because they are remote in x-space from the rest of the sample.

Influential Point:
All leverage points are not influential on the regression coefficients.
Hat diagonal examines only the location of observations in x-space, so we can look at the
studentized residual or R-student in conjunction with the hat diagonal.
Observation with large hat diagonal and large residuals are likely to be influential.

Measures of Influence
1 Cook’s D-statistics: Cook’s distance measure is a deletion diagnostic, i.e., it measures the
influence of the ith observation if it is removed from the sample.
2 DFFITS : the deletion influence of the ith observation on the predicted or fitted value.
3 DFBETAS: which indicates how much the regression coefficient changes if the the ith
observation were deleted. Large (in magnitude) value of DFBETASj,i , indicates that the
i observation has considerable influence on the jth regression coefficient.
4 If the data point is an outlier, then R -student will be large is magnitude.
5 If the data point has high leverage, then hii will be close to unity.

WINSEM2023-24 MAT6015 ETH VL2023240501308 2024-03-19 Reference-Material-I

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

WINSEM2023-24 MAT6015 ETH VL2023240501308 2024-03-19 Reference-Material-I

Uploaded by

Copyright:

Available Formats

Regression Analysis

Regression Analysis: Variable Selection and Model Building

Dr. Jisha Francis

Dr. Jisha Francis Module 5 1 / 38

Dr. Jisha Francis Module 5 1 / 38

Regression analysis relies on selecting correct and important explanatory variables.

Dr. Jisha Francis Module 5 2 / 38

1 Ensure correct functional form of the model.

Dr. Jisha Francis Module 5 3 / 38

Dr. Jisha Francis Module 5 4 / 38

Dr. Jisha Francis Module 5 5 / 38

Dr. Jisha Francis Module 5 6 / 38

Given the false model:

β̂1 = (X1T X1 )−1 X1T y

is biased, in general. Efficiency generally declines.

Dr. Jisha Francis Module 5 7 / 38

Dr. Jisha Francis Module 5 8 / 38

Aspect Exclusion Type Inclusion Type

Dr. Jisha Francis Module 5 10 / 38

An advantage of Radj2 is that it does not necessarily increase as p increases.

Dr. Jisha Francis Module 5 13 / 38

Dr. Jisha Francis Module 5 14 / 38

Plot MSres (p) versus p.

Dr. Jisha Francis Module 5 15 / 38

where SSres (p) is defined as:

SSres (p) = y T H1 y = y T X1 (X1T X1 )−1 X1T y

AIC = −2(maximized log likelihood) + 2(number of parameters)

Dr. Jisha Francis Module 5 17 / 38

Dr. Jisha Francis Module 5 19 / 38

Use All Possible Explanatory Variables:

Dr. Jisha Francis Module 5 20 / 38

Dr. Jisha Francis Module 5 21 / 38

Dr. Jisha Francis Module 5 22 / 38

Dr. Jisha Francis Module 5 23 / 38

SSreg (x2 |x1 )

10 If F > FIN , then x2 enters into the model.

Dr. Jisha Francis Module 5 25 / 38

3 Choose a preselected value FOUT .

Dr. Jisha Francis Module 5 26 / 38

7 Repeat this procedure.

Dr. Jisha Francis Module 5 27 / 38

Stepwise regression is a combination of forward selection and backward elimination procedures.

Dr. Jisha Francis Module 5 28 / 38

Dr. Jisha Francis Module 5 29 / 38

Dr. Jisha Francis Module 5 30 / 38

Choice of FIN and/or FOUT provides stopping rules for algorithms.

Dr. Jisha Francis Module 5 31 / 38

Figure: leverage point

Dr. Jisha Francis Module 5 32 / 38

Dr. Jisha Francis Module 5 33 / 38

Now consider the point B in the following figure:

Figure: Influence point

Dr. Jisha Francis Module 5 34 / 38

Dr. Jisha Francis Module 5 35 / 38

Dr. Jisha Francis Module 5 36 / 38

Dr. Jisha Francis Module 5 37 / 38

Dr. Jisha Francis Module 5 38 / 38

You might also like