Solving Stepwise Regression Problems

Stepwise Multiple Regression
Slide 1
Different Methods for Entering Variables in Multiple Regression
Different types of multiple regression are distinguished by the method for entering the
independent variables into the analysis.
In standard (or simultaneous) multiple regression, all of the independent variables are
entered into the analysis at the same.
In hierarchical (or sequential) multiple regression, the independent variables are entered in
an order prescribed by the analyst.
In stepwise (or statistical) multiple regression, the independent variables are entered
according to their statistical contribution in explaining the variance in the dependent
variable.
No matter what method of entry is chosen, a multiple regression that includes the same
independent variables and the same dependent variables will produce the same multiple
regression equation.
The number of cases required for stepwise regression is greater than the number for the
other forms. We will use the norm of 40 cases for each independent variable.
Slide 2
Purpose of Stepwise Multiple Regression
Stepwise regression is designed to find the most parsimonious set of predictors that are
most effective in predicting the dependent variable.
Variables are added to the regression equation one at a time, using the statistical criterion
of maximizing the R of the included variables.
After each variable is entered, each of the included variables are tested to see if the model
would be better off it were excluded. This does not happen often.
The process of adding more variables stops when all of the available variables have been
included or when it is not possible to make a statistically significant improvement in R
using any of the variables not yet included.
Since variables will not be added to the regression equation unless they make a
statistically significant addition to the analysis, all of the independent variable selected for
inclusion will have a statistically significant relationship to the dependent variable.
An example of how SPSS does stepwise regression is shown below.
Slide 3
Stepwise Multiple Regression in SPSS
Each time SPSS includes or removes a variable from the analysis, SPSS considers it a new
step or model, i.e. there will be one model and result for each variable included in the
analysis.
SPSS provides a table of variables included in the analysis and a table of variables
excluded from the analysis. It is possible that none of the variables will be included. It is
possible that all of the variables will be included.
The order of entry of the variables can be used as a measure of relative importance.
Once a variable is included, its interpretation in stepwise regression is the same as it

would be using other methods for including regression variables.
Slide 4
Pros and Cons of Stepwise Regression
Stepwise multiple regression can be used when the goal is to produce a predictive model
that is parsimonious and accurate because it excludes variables that do not contribute to
explaining differences in the dependent variable.
Stepwise multiple regression is less useful for testing hypotheses about statistical
relationships. It is widely regarded as atheoretical and its usage is not recommended.
Stepwise multiple regression can be useful in finding relationships that have not been
tested before. Its findings invite one to speculate on why an unusual relationship makes
sense.
It is not legitimate to do a stepwise multiple regression and present the results as though
one were testing a hypothesis that included the variables found to be significant in the
stepwise regression.
Using statistical criteria to determine relationships is vulnerable to over-fitting the data set
used to develop the model at the expense of generalizability.
When stepwise regression is used, some form of validation analysis is a necessity. We

will use 75/25% cross-validation.
Slide 5
75/25% Cross-validation
To do cross validation, we randomly split the data set into a 75% training sample and a
25% validation sample. We will use the training sample to develop the model, and we test
its effectiveness on the validation sample to test the applicability of the model to cases not
used to develop it.
In order to be successful, the follow two questions must be answers affirmatively:

Did the stepwise regression of the training sample produce the same subset of
predictors produced by the regression model of the full data set?
If yes, compare the R2 for the 25% validation sample to the R2 for the 75% training
sample. If the shrinkage (R2 for the 75% training sample - R2 for the 25% validation
sample) is 2% (0.02) or less, we conclude that validation was successful.
Note: shrinkage may be a negative value, indicating that the accuracy rate for the
validation sample is larger than the accuracy rate for the training sample. Negative
shrinkage (increase in accuracy) is evidence of a successful validation analysis.
If the validation is successful, we base our interpretation on the model that included all
cases.
Slide 6
Correlations between dependent variable and independent variables
We have two independent variables, IV1

and IV2, which each have a relationship
to the dependent variable. The areas of
IV1 and IV2 which overlap with DV are
r values, i.e. the proportion of the dv
that is explained by the iv.
IV2
DV
DV and IV2 are

correlated at r =
.40. The area of
overlap is r = .16.
DV and IV1 are

correlated at r =
.70. The area of
overlap is r = .49.
IV1
DV
Slide 7
Correlations between independent variables

The two independent
variables, IV1 and IV2,
are correlated at r = .20.
This correlation
represents redundant
information in the
independent variables.
IV1
IV2
Slide 8
Variance in the dependent variable explained by the independent variables

The variance explained in DV
is divided into three areas.
The total variance explained
is the sum of the three areas.
IV1
IV2
The green area is the

variance in DV uniquely
explained by IV1.
The orange area is the

variance in DV uniquely
explained by IV2.
DV
The brown area is
the variance in DV
that is explained by
both IV1 and IV2.
Slide 9
Correlations at step 1 of the stepwise regression
IV1
Since IV1 had the stronger relationship with
DV (.70 versus .40), it will be the variable
entered first in the stepwise regression.
As the only variable in the regression
equation, it is given full credit (.70) for its
relationship to DV.
The partial correlation and the part
correlation have the same value as the zeroorder correlation at .70.
DV
Slide 10
Change in variance explained when a second variable in included
At step 2, IV2 enters the model, increasing the total

variance explained from .49 to .56, an increase 0f .07.
By itself, IV2 explained .16 of the variance in DV, but
since it was itself correlated with IV1, a portion of
what it could explain had already been attributed to
IV1.
Slide 11
Differences in correlations when a second variable is entered
While the zero-order correlations do not change, both the

partial and the part correlations decrease.
Partial correlation represents the relationship between the
dependent variable and an independent variable when the
relationship between the dependent variable and other
independent variables has been removed from the
variance of both the dependent and the independent
variable.
Part (or semi-partial) correlation is the portion of the total
variance in the dependent variable that is by only that
independent variable. The square of part correlation is the
amount of change in R by including this variable.
Slide 12
Zero-order, partial, and part correlations

The zero-order correlation is
based on the relationship between
the independent variable and the
dependent variable, ignoring all
other independent variables.
IV1
Part correlation for IV1 is

the green area divided by
all parts of DV, i.e. including
areas associated with IV2.
IV1
IV2
DV
The partial correlation for

IV1 is the green area divided
by the area in DV and IV1
that is not part of IV2, i.e.
green divided by green +
yellow.
DV
NOTE: diagrams
are scaled to r2
rather than r.
Slide 13
Zero-order, partial, and part correlations
IV2
The partial correlation for IV2

is the green area divided by
the area in DV and IV2 that
is not part of IV2, i.e. orange
divided by orange + yellow.
IV1
DV
IV2
DV
The zero-order correlation

is based on the
relationship between the
independent variable and
the dependent variable,
ignoring all other
independent variables.
Part correlation for IV2 is

the orange area divided by
all parts of DV, i.e. including
areas associated with IV1.
Slide 14
How SPSS Stepwise Regression Chooses Variables - 1

We can use the table of
correlations to identify which
variable will be entered at the
first step of the stepwise
regression.
The table of Correlations shows

the the variable with the
strongest individual relationship
with the dependent variable is
RACE OF HOUSEHOLD=WHITE,
with a correlation of -.247.
Provided that the relationship
between this variable and the
dependent variable is
statistically significant, this will
be the variable that enters first.
Slide 15
The correlation between RACE OF

HOUSEHOLD=WHITE and
importance of ethnic group to R is
statistically significant at p < .001.
It will be the first variable entered
into the regression equation.
Slide 16
Model 1 contains the variable RACE

OF HOUSEHOLD=WHITE, with a
Multiple R of .247, producing an R
of .061 (.247), which is
statistically significant at p < .001.
We cannot use the table of correlations to

show which variable will be entered
second, since the variable entered second
must take into account its correlation to
the independent variable entered first.
Slide 17
The table of Excluded

Variables, however,
shows the Partial
Correlation between
each candidate for entry
and the dependent
variable.
In this example, RACE
OF HOUSEHOLD=BLACK
has the largest Partial
Correlation (.252) and
is statistically significant
at p < .001, so it will be
entered on the next
step
Partial correlation is a measure of the

relationship of the dependent variable to an
independent variable, where the variance
explained by previously entered independent
variables has been removed from both.
Slide 18
As expected, Model 2 contains the

variable RACE OF HOUSEHOLD=WHITE
and RACE OF HOUSEHOLD=BLACK.
The R for Model 2 increased by 0.059
to a total of .120. The increase in R
was statistically significant at p <
.001.
Slide 19
The increase in R of .059 is the square

of the Part Correlation for RACE OF
HOUSEHOLD=BLACK (.244 = 0.059).
Part correlation, also referred to as
semi-partial correlation, is the unique
relationship between this independent
variable and the dependent variable.
Slide 20
Sig.
Colum
n
Partial
Correlatio
n
Column
In the table of Excluded
Variables for model 2,
the next largest partial
correlation is HOW
OFTEN R ATTENDS
RELIGIOUS SERVICES
at .149.
This is the variable that will be

added in Model 3 because the
relationships is statistically
significant at p = 0.32.
Slide 21
As expected, Model 3 contains the

variable RACE OF HOUSEHOLD=WHITE
and RACE OF HOUSEHOLD=BLACK,
and HOW OFTEN R ATTENDS
RELIGIOUS SERVICES . The R for
Model 3 increased by 0.019 to a total
of .140. The increase in R was
statistically significant at p = .032.
Slide 22
Sig.
Colum
n
Partial
Correlatio
n
Column
In the table of Excluded

Variables for model 3,
the next largest partial
correlation is THINK OF
SELF AS LIBERAL OR
CONSERVATIVE at .089.
However, the partial

correlation is not significant
(p=.203), so no additional
variables will be added to
the model.
Slide 23
What SPSS Displays when Nothing is Significant
If none of the independent

variables has a statistically
significant relationship to the
dependent variable, SPSS
displays an empty table for
Variables Entered/Removed.
Slide 24
The Problem in BlackBoard - 1

The introductory problem statement tells us:
the data set to use: GSS2002_PrejudiceAndAltruism.SAV
the method for including variables in the regression
The dependent variable for the analysis
the list of independent variables that stepwise
regression will select from
Slide 25
This Weeks Problems
The problems this week take the 13 questions on prejudice from the general social survey
and explore the relationship of each to the demographic characteristics of age, education,
income, political views (conservative versus liberal), religiosity (attendance at church),
socioeconomic index, gender, and race.
I had no specific hypothesis about which demographic factors would be related to which
question on prejudice, beyond an expectation that race would be a significant contributor
to explaining differences on each of the questions.
My analyses were exploratory (to identify what demographic characteristics were

associated with different aspects of prejudice) and, thus, appropriate for stepwise
regression.
Slide 26
In these problems, we will assume that our data

satisfies the assumptions required by multiple
regression without explicitly testing for it.
We should recognize that failing to use a
needed transformation could preclude a variable
from being selected as a predictor.
In your analyses, you would, of course, want to
test for conformity to all of the assumptions.
Slide 27
The next sequence of specific instructions tell us

whether each variable should be treated as
metric or non-metric, along with the reference
category to use when dummy-coding nonmetric variables.
Though we will not use the script to test for
assumptions, we can use it to do the dummy
coding that we need for the problem.
Slide 28
The next pair of instructions tell us the

probability values to use for alpha for
both the tests of statistical
relationships and for the diagnostic
tests.
Slide 29
The final instruction tells us the

random number seed to use in
the validation analysis.
If you do not use this number
for the seed, it is likely that
you will get different results
from those shown in the
feedback.
Slide 30
The Statement about Level of Measurement
The first statement in the problem asks about

level of measurement. Stepwise multiple
regression requires the dependent variable and
the metric independent variables be interval
level, and the non-metric independent variables
be dummy-coded if they are not dichotomous.
The only way we would violate the level of
measurement would be to use a nominal
variable as the dependent variable, or to
attempt to dummy-code an interval level
variable that was not grouped.
Slide 31
Marking the Statement about Level of Measurement - 1
Stepwise multiple regression requires the dependent

variable and the metric independent variables be interval
level, and the non-metric independent variables be
dummy-coded if they are not dichotomous.
Mark the check box as a correct statement because:

"Importance of ethnic identity" [ethimp] is ordinal level, but the
problem calls for treating it as metric, applying the common
convention of treating ordinal variables as interval level.
The metric independent variable "age" [age] was interval level,
satisfying the requirement for independent variables.
The metric independent variable "highest year of school
completed" [educ] was interval level, satisfying the requirement
for independent variables.
"Income" [rincom98] is ordinal level, but the problem calls for
treating it as metric, applying the common convention of
treating ordinal variables as interval level.
Slide 32
Marking the Statement about Level of Measurement - 2
In addition:
"Description of political views" [polviews] is ordinal level, but the
problem calls for treating it as metric, applying the common
convention of treating ordinal variables as interval level.
"Frequency of attendance at religious services" [attend] is
ordinal level, but the problem calls for treating it as metric,
applying the common convention of treating ordinal variables as
interval level.
The metric independent variable "socioeconomic index" [sei]
was interval level, satisfying the requirement for independent
variables.
The non-metric independent variable "sex" [sex] was
dichotomous level, satisfying the requirement for independent
variables.
The non-metric independent variable "race of the household"
[hhrace] was nominal level, but will satisfy the requirement for
independent variables when dummy coded.
Slide 33
The Statement for Sample Size
The statement for sample

size indicates that the
available data satisfies the
requirement.
Because of the tendency for

stepwise regression to over-fit
the data, we have a larger
sample size requirement, i.e. 40
cases per independent variable
(Tabachnick and Fidell, p. 117)
To obtain the number of cases
available for this analysis, we run
the stepwise regression.
Slide 34
Using the Script to Create Dummy-coded Variables - 1
Before we can run the

stepwise regression, we need
to dummy code sex and race.
We will use the script to create
the dummy-coded variables.
Select the Run

Script command
from the Utilities
menu.
Slide 35
Navigate to the
My Documents
folder, if
necessary.
Highlight the script file

SatisfyingRegressionAssumptionsWit
h
MetricAndNonMetricVariables.SBS.
Click on the Run button

to open the script.
Slide 36
Move the non-metric

variable "sex" [sex] to the
list box for Non-metric
independent variables list
box.
With the variable highlighted,

select the reference category,
2=FEMALE from the
Reference category drop
down menu.
Slide 37
Move the non-metric variable

"race of the household"
[hhrace] to the list box for Nonmetric independent variables list
box.
With the variable highlighted,

select the reference category,
3=OTHER from the Reference
category drop down menu.
The OK button to run the

regression is deactivated until
we select a dependent
variable.
Slide 38
We select the dependent variable

"importance of ethnic identity"
[ethimp], though since we are
not going to interpret the output,
we could select any variable.
To have the script save the dummycoded variables, clear the check box
Delete variables created in this analysis.
Slide 39
Click on the OK button to run the

regression, creating the dummycoded variables as a by-product.
Slide 40
The Dummy-Coded Variables in the Data Editor
If we scroll the variable list to

the right, we see that the
three dummy-coded variables
have been added to the data
set.
Slide 41
Run the Stepwise Regression - 1
To run the regression, select

Regression > Linear from
the Analyze menu.
Slide 42
Move the dependent variable

[ethimp]
to the Dependent text box.
Move the independent variables:

"age" [age]
"highest year of school completed" [educ],
"income" [rincom98],
"description of political views" [polviews],
"frequency of attendance at religious services"
[attend],
"socioeconomic index" [sei],
survey respondents were male" [sex_1],
"survey respondents who were white" [hhrace_1],
"survey respondents who were black" [hhrace_2]
to the Independent(s) list box.
Slide 43

The critical step to produce a
stepwise regression is the selection
of the method for entering
variables.
Select Stepwise from the

Method drop down menu.
Slide 44
Click on the Statistics

button to specify additional
output.
Slide 45
We mark the check boxes for

optional statistics:
R squared change,
Descriptives,
Part and partial
correlations,
Collinearity diagnostics, and
Durbin-Watson.
Click on the
Continue button to
close the dialog box.
Slide 46
Click on the OK
button to produce
the output.
Slide 47
Answering the Sample Size Question
The analysis included 9 independent variables (6 metric

independent variables plus 3 dummy-coded variables). The
number of cases available for the analysis was 209, not
satisfying the requirement for 360 cases based on the rule
of thumb that the required number of cases for stepwise
multiple regression should be 40 x the number of
independent variables recommended by Tabachnick and
Fidell (p. 117).
We should consider mentioning the sample size issue as a
limitation of the analysis.
Slide 48
Marking the Statement for Sample Size
The check box is not marked

because we did not satisfy
the sample size
requirement.
Slide 49
Statements about Variables Included in Stepwise Regression
Three statements in the problem list

different combinations of the
variables included in the stepwise
regression.
To determine which is correct, we
look at the table of Variables Entered
and Removed in the SPSS output.
Slide 50
Answering the Question about Variables Included in Stepwise Regression - 1
Three independent variables satisfied

the statistical criteria for entry into the
model. The variable "survey
respondents who were white"
[hhrace_1] had the largest individual
impact on the dependent variable
"importance of ethnic identity" [ethimp].
The second variable included in the
model was "survey respondents who
were black" [hhrace_2]. The third
variable included in the model was
"frequency of attendance at religious
services" [attend].
The column for Variables

Removed is empty, telling us
that no variables were
removed after being
entered.
Slide 51
Marking the Statement about Variables Included in Stepwise Regression
Three independent variables satisfied the

statistical criteria for entry into the model.
The variable "survey respondents who
were white" [hhrace_1] had the largest
individual impact on the dependent
variable "importance of ethnic identity"
[ethimp]. The second variable included in
the model was "survey respondents who
were black" [hhrace_2]. The third variable
included in the model was "frequency of
attendance at religious services" [attend].
We mark the check box for the first of the
three statements.
Slide 52
Statement about the Strength of the Relationship
The next two statements focus on the

strength of the overall relationship
between the dependent variable and
the set of predictors that are selected
in the stepwise entry of variables. The
statement assumes that the overall
relationship will be statistically
significant, which will be true if any
variables are selected for the model.
We will use Cohens scale for

assigning an adjective to the
strength of the relationship:
less than .10 = trivial
.10 up to 0.30 = weak
.30 up to .50 = moderately strong
.50 or greater = strong
Slide 53
Statement about the Strength of the Relationship

Three independent variables satisfied the statistical
criteria for inclusion in the model. We interpret the
results for the last step for all of the questions about
statistical relationships (Model 3 in this example).
Applying Cohen's criteria for effect size, the

relationship was correctly characterized as
moderately strong (Multiple R = .374).
The overall relationship was

statistically significant (F(3, 205)
= 11.11, p < .001. The null
hypothesis that "all of the partial
slopes (b coefficients) = 0" is
rejected, supporting the
research hypothesis that "at
least one of the partial slopes (b
coefficients) is not equal to 0".
Slide 54
Marking the Statement about the Strength of the Relationship
The Multiple R of .374 translates to a

moderately strong relationship, so we
mark the check box for the second
statement on strength of the
relationship.
Slide 55
Statements about Relationships to the Dependent Variable for Individual Predictors
The next set of statements focus on

individual relationships between
predictors and the dependent
variable. In order for a statement to
be true, it must have a statistically
significant individual relationship (i.e.
it entered into the model), and the
direction of the relationship must be
interpreted correctly.
Slide 56
Answering Question about Relationship of RACE OF HOUSEHOLD=WHITE

Again, we base our interpretation about
statistical relationships on the last model for
variables entered, i.e. Model 3 for this problem.
The statement that "survey respondents who were white

attached less importance to ethnic identity compared to the
average for all survey respondents" is correct. The
individual relationship between the independent variable
"survey respondents who were white" [hhrace_1] and the
dependent variable "importance of ethnic identity" [ethimp]
was statistically significant, = -.290, t(199) = -4.38, p <
.001.
We reject the null hypothesis that the partial slope (b

coefficient) for the variable "survey respondents who
were white" = 0 and conclude that the partial slope (b
coefficient) for the variable "survey respondents who
were white" is not equal to 0. The negative sign of the b
coefficient (-0.518) means that survey respondents who
were white attached less importance to ethnic identity
compared to the average for all survey respondents.
Slide 57
Marking the Statement about Relationship of RACE OF HOUSEHOLD=WHITE
Since the statement survey respondents

who were white attached less importance
to ethnic identity compared to the
average for all survey respondents is
supported by our statistical results, we
mark the check box.
Slide 58
Answering Question about Relationship of RACE OF HOUSEHOLD=BLACK
The statement that "survey respondents who were black

attached greater importance to ethnic identity compared to
the average for all survey respondents" is correct. The
individual relationship between the independent variable
"survey respondents who were black" [hhrace_2] and the
dependent variable "importance of ethnic identity" [ethimp]
was statistically significant, = .225, t(199) = 3.37, p < .001.
We reject the null hypothesis that the partial slope (b coefficient)

for the variable "survey respondents who were black" = 0 and
conclude that the partial slope (b coefficient) for the variable
"survey respondents who were black" is not equal to 0. The
positive sign of the b coefficient (0.524) means that survey
respondents who were black attached greater importance to ethnic
identity compared to the average for all survey respondents.
Slide 59
Marking the Statement about Relationship of RACE OF HOUSEHOLD=BLACK

who were black attached greater
importance to ethnic identity compared
to the average for all survey
respondents" is supported by our
statistical results, we mark the check
box.
Since the previous statement was

correct, this statement cannot be
true, so the check box is not
marked.
Slide 60
Answering Question about Relationship of ATTEND RELIGIOUS SERVICES
The statement that "survey respondents who attended

religious services more often attached greater importance to
ethnic identity" is correct. The individual relationship between
the independent variable "frequency of attendance at religious
services" [attend] and the dependent variable "importance of
ethnic identity" [ethimp] was statistically significant, = .141,
t(199) = 2.16, p = .032.
We reject the null hypothesis that the partial slope (b coefficient) for the
variable "frequency of attendance at religious services" = 0 and conclude that
the partial slope (b coefficient) for the variable "frequency of attendance at
religious services" is not equal to 0. The positive sign of the b coefficient
(0.062) means that higher values of frequency of attendance at religious
services were associated with higher values of "importance of ethnic
identity".
Slide 61
Marking the Statement about Relationship of ATTEND RELIGIOUS SERVICES

who attended religious services more
often attached greater importance to
ethnic identity" is supported by our
statistical results, we mark the check
box.
The following check box is not marked
because the statement contradicts the
finding we have just made.
Slide 62
Answering Question about Relationship of AGE
The statement that "survey respondents who

were older attached greater importance to
ethnic identity" is not correct. The variable
"age" [age] was not among the list of
variables included in the stepwise model.
Slide 63
Marking the Statement for Age
The check box for the

statement for age is not
marked because the variable
did not enter the model in
the stepwise regression.
Slide 64
Statement about Cross-validation
The findings from our analysis are

generalizable to the extent that they are
applicable to cases not included in the
analysis. Since we cannot collect new
cases, we will divide our sample into two
subsets, using one subset to create the
model and test the findings on the second
subset of cases which were not included in
the analysis that created the model.
The final statement concerns the

generalizability of our findings to
the larger population. To answer
this question, we will do a
75/25% cross-validation.
Slide 65
Creating the Training Sample and the Validation Sample - 1

The 75/25% cross-validation requires
that we randomly divide the cases for
this analysis into two parts:
75% of the cases will be used to run
the stepwise regression (the training
sample), which will be tested for
accuracy on the remaining 25% of the
cases (the validation sample).
To set the seed for the random

number generator, select
Random Number Generator
from the Transform menu.
NOTE: you must use the random number

seed that is stated in the problem in order
to produce the same results that I found.
Any other seed will generate a different
random sequence that can produce results
that are very different from mine.
Slide 66
First, mark the check

for Set Starting Point.
Fourth, click on the
OK button to complete
the action.
Second, select
the option button
for a Fixed
Value.
Third, type the seed

number provided in the
problem directions: 726201.
NOTE: SPSS does not provide any feedback

that the seed has been set or changed. If
you are in doubt, you can reopen the
dialog box and see what it indicates.
Slide 67

We will create a variable that will
contain the information about whether a
case is in the training sample or the
validation sample. We will name this
variable split and use a value of 1 to
indicate the training sample and a value
of 0 to indicate the validation sample.
To create the new

variable, select Compute
from the Transform
menu.
Slide 68
Type the name of

the new variable,
split, in the Target
Variable text box.
Type the formula as

shown in the
Numeric Expression
text box.
The formula uses the SPSS UNIFORM

function to create a uniform distribution
of decimal numbers between 0 and 1. If
the generated number for a case is less
than or equal to 0.75, the statement in
the text box is True and the split
variable will be assigned a 1 for that
case. If the generated number is larger
than 0.75, the statement is false and
the case will be assigned a 0 for split.
Click on the OK
button to create
the variable.
Slide 69
If we scroll the data editor

window to the right, we see
the split variable in a new
column.
Slide 70
If we created a frequency distribution for

the split variable, we see that the
breakdown is approximately, not exactly,
correct. This is a consequence of
generating random numbers you have
no control over the sequence that it
generates beyond setting an initial seed.
Though I have done it to create

specific results for homework
problems, it is not acceptable to
run repeated series of random
numbers until one gets a
sequence that has desirable
properties.
Slide 71
An Additional Task before Running the Stepwise Regression on the Training Sample
Before we run the regression on the training sample, we need an additional step that will
enable us to compare the accuracy of the model for the training sample to the accuracy of
the model for the validation sample, using the R2 for each as our measure of accuracy.
We need to exclude from the analysis cases that are missing data for any of the variables
that we have designated as candidates for inclusion. If we dont specifically do this, SPSS
may include different cases in predicting values for the dependent variable than it does in
determining which variables to include in the model.
In model building, SPSS does listwise exclusion of missing data and omits any cases that
have missing data for any variable. In predicting scores on the dependent variable, it
excludes cases that are missing data for only the variables included in the stepwise model.
Thus, when selecting variables, SPSS assumes that only respondents who answer all
questions are valid cases; in predicting scores, it assumes that failing to answer a question
on a variable that is not included has no importance in the analysis.
Slide 72
Selecting Cases with Valid Data for All Variables in the Analysis - 1
To include only those

cases that have valid
data for all variables in
the analysis, choose the
Select Cases command
from the Data menu.
Slide 73
First, mark the
option button for If
condition is
satisfied.
Second, click on
the If button to
add the
condition.
Slide 74
Type
NMISS(ethimp,age,educ,rincom98,polviews,
attend,sei,sex_1,hhrace_1,hhrace_2) = 0
in the condition textbox. In the parentheses,
we type the names of the dependent variable
and all of the independent variables.
The SPSS NMISS function counts the number

of variables in the list that have missing data.
Telling SPSS to include cases for which this
calculation results in 0 indicates that the case
was not missing data for any of the variables.
Slide 75
Click on the
Continue button to
Slide 76
Click on the OK
button to
execute the
command.
Slide 77
The excluded cases

have a slash
through the case
number.
Slide 78
Run the Stepwise Regression on the Training Sample - 1
To run the regression, select

Regression > Linear from
the Analyze menu.
Slide 79
Move the dependent variable

[ethimp]
to the Dependent text box.
Move the independent variables:

"age" [age]
"highest year of school completed" [educ],
"income" [rincom98],
"description of political views" [polviews],
"frequency of attendance at religious services"
[attend],
"socioeconomic index" [sei],
survey respondents were male" [sex_1],
"survey respondents who were white" [hhrace_1],
"survey respondents who were black" [hhrace_2]
to the Independent(s) list box.
Slide 80

The critical steps to produce a
stepwise regression on the training
sample are the selection of the
stepwise method for entering
variables and the inclusion of the
training sample cases.
Select Stepwise from the

Method drop down menu.
Slide 81

To select the training sample, we
move the split variable to the
Selection Variable text box.
First, highlight
the split variable.
Second, click on the
right arrow button to the
left of the Selection
Variable text box..
Slide 82
Click on the Rule button

to specify the value that
we want split to use to
select cases.
Slide 83
First, type 1 in
the Value text
box. Recall that
this is the value
of split indicating
training cases.
Second, click on the

Continue button to
Slide 84
Click on the Statistics

button to specify additional
output.
Slide 85

R squared change,
Descriptives,
Part and partial
correlations,
Durbin-Watson.
Click on the
Continue button to
Slide 86

R squared change,
Descriptives,
Part and partial
correlations,
Durbin-Watson.
Click on the
Continue button to
Slide 87
Click on the OK
button to produce
the output.
Slide 88
Validating the Model - 1
The first step in our validation is to make

certain that the model based on the
training sample reasonably approximates
the model based on the full sample.
Here we see that both models included 3
variables.
If the number of models

(steps) were different, the
validation would fail.
Slide 89
Second, we verify that the model

based on the training sample
included the same three variables as
the model based on the full data set.
We do not require that the variables
be entered in the same order, as the
difference in samples can easily
result in small shifts.
The same variables entered into the

stepwise regression of the training sample
that entered into the stepwise regression
using the full sample ("frequency of
attendance at religious services" [attend],
"survey respondents who were black"
[hhrace_2] and "survey respondents who
were white" [hhrace_1]).
Slide 90

Third, we compare the accuracy of the model
for the validation sample to the accuracy of
the model for the training sample.
We have to calculate the R for the

validation sample (split ~= 1.0) by hand
from the Multiple R: .402 = .162.
The R for the 75% training sample was
0.131 and the R for the 25% validation
sample was 0.162, resulting in a value of
.131 162 = -.031 for shrinkage. Since
-.031 is <= .02, the validation is
successful.
If the shrinkage were greater

than .02 (2%), the validation
fails.
Slide 91
Marking the Check Box for the Cross-validation Statement
The validation analysis supported

the generalizability of the findings
of the analysis to the population
represented by the sample in the
data set.
We mark the check box for the
validation.
Slide 92
The Question Graded in Blackboard
When the problem was

submitted, BlackBoard
confirmed that all marked
answers were correct.
Slide 93
Logic Diagram for Solving Homework Problems: Level of Measurement
Run script to dummy-code

non-metric variables, if needed
Level of
measurement ok?
No
Do not mark check box

Mark: Inappropriate
application of the
statistic
Stop
Yes
Ordinal level variable

treated as metric?
Yes
Consider limitation in
discussion of findings
No
Run stepwise regression
Slide 94
Logic Diagram for Solving Homework Problems: Sample Size and Overall Relationship
Sample size ok
(number of Ivs x 40)?
No

Consider limitation in
discussion of findings
Yes
Mark check box

for correct sample size
Model will be
statistically
significant if
any
variables
entered
1+ variables entered
in model?
No
Stop (no significant

predictors)
No
Stop (model is
not usable)
Yes
Model is not trivial

(Multiple R >= .10)
Yes
Slide 95
Logic Diagram for Solving Homework Problems: Strength of Overall Relationship
Subset of entered
variables correctly
identified?
No
Do not mark check box for

correct subset
Yes
Mark check box

for correct subset
Strength of model
correctly characterized
No
Yes
Mark check box

for correct strength
Slide 96
Logic Diagram for Solving Homework Problems: Individual Relationships
Variable entered
and not removed?
No
Yes
Correct interpretation of
direction of relationship?
No

individual relationship
Yes
Mark check box

for individual relationship
Yes
Additional variables
entered?
No
Slide 97
Logic Diagram for Solving Homework Problems: Cross-validation
Create split variable

using specified seed
Select cases with no missing

values for all variables
Run stepwise regression

on training sample
Same variables entered

in full model?
No

supporting validation
No

Yes
Shrinkage
< or = 2%?
Yes
Mark check box for

Slide 98

Solving Stepwise Regression Problems

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Solving Stepwise Regression Problems

Uploaded by

Copyright:

Available Formats

Stepwise Multiple Regression

Different Methods for Entering Variables in Multiple Regression

Purpose of Stepwise Multiple Regression

An example of how SPSS does stepwise regression is shown below.

Stepwise Multiple Regression in SPSS

Once a variable is included, its interpretation in stepwise regression is the same as it

Pros and Cons of Stepwise Regression

When stepwise regression is used, some form of validation analysis is a necessity. We

In order to be successful, the follow two questions must be answers affirmatively:

Correlations between dependent variable and independent variables

We have two independent variables, IV1

DV and IV2 are

DV and IV1 are

Correlations between independent variables

Variance in the dependent variable explained by the independent variables

The green area is the

The orange area is the

Correlations at step 1 of the stepwise regression

Change in variance explained when a second variable in included

At step 2, IV2 enters the model, increasing the total

Differences in correlations when a second variable is entered

While the zero-order correlations do not change, both the

Zero-order, partial, and part correlations

Part correlation for IV1 is

The partial correlation for

Zero-order, partial, and part correlations

The partial correlation for IV2

The zero-order correlation

Part correlation for IV2 is

How SPSS Stepwise Regression Chooses Variables - 1

The table of Correlations shows

How SPSS Stepwise Regression Chooses Variables - 2

The correlation between RACE OF

How SPSS Stepwise Regression Chooses Variables - 3

Model 1 contains the variable RACE

We cannot use the table of correlations to

How SPSS Stepwise Regression Chooses Variables - 4

The table of Excluded

Partial correlation is a measure of the

How SPSS Stepwise Regression Chooses Variables - 5

As expected, Model 2 contains the

How SPSS Stepwise Regression Chooses Variables - 6

The increase in R of .059 is the square

How SPSS Stepwise Regression Chooses Variables - 7

This is the variable that will be

How SPSS Stepwise Regression Chooses Variables - 8

As expected, Model 3 contains the

How SPSS Stepwise Regression Chooses Variables - 9

In the table of Excluded

However, the partial

What SPSS Displays when Nothing is Significant

If none of the independent

The Problem in BlackBoard - 1

This Weeks Problems

My analyses were exploratory (to identify what demographic characteristics were

The Problem in BlackBoard - 2

In these problems, we will assume that our data

The Problem in BlackBoard - 3

The next sequence of specific instructions tell us

The Problem in BlackBoard - 4

The next pair of instructions tell us the

The Problem in BlackBoard - 4

The final instruction tells us the

The Statement about Level of Measurement