You are on page 1of 103

# SW388R7 Data Analysis & Computers II Slide 1

Stepwise Multiple Regression

Differences between stepwise and other methods of multiple regression Sample problem Steps in stepwise multiple regression Homework Problems

SW388R7 Data Analysis & Computers II Slide 2

Types of multiple regression 

Different types of multiple regression are distinguished by the method for entering the independent variables into the analysis. In standard (or simultaneous) multiple regression, all of the independent variables are entered into the analysis at the same. In hierarchical (or sequential) multiple regression, the independent variables are entered in an order prescribed by the analyst. In stepwise (or statistical) multiple regression, the independent variables are entered according to their statistical contribution in explaining the variance in the dependent variable. No matter what method of entry is chosen, a multiple regression that includes the same independent variables and the same dependent variables will produce the same multiple regression equation.    

SW388R7 Data Analysis & Computers II Slide 3

Stepwise multiple regression 

Stepwise regression is designed to find the most parsimonious set of predictors that are most effective in predicting the dependent variable. Variables are added to the regression equation one at a time, using the statistical criterion of maximizing the R² of the included variables. The process of adding more variables stops when all of the available variables have been included or when it is not possible to make a statistically significant improvement in R² using any of the variables not yet included. Since variables will not be added to the regression equation unless they make a statistically significant addition to the analysis, all of the independent variable selected for inclusion will have a statistically significant relationship to the dependent variable.   

The order of entry of the variables can be used as a measure of relative importance. its interpretation in stepwise regression is the same as it would be using other methods for including regression variables. SPSS provides a table of variables included in the analysis and a table of variables excluded from the analysis. Once a variable is included. SPSS considers it a new step or model.    . It is possible that all of the variables will be included. It is possible that none of the variables will be included. there will be one model and result for each variable included in the analysis.e. i.SW388R7 Data Analysis & Computers II Slide 4 Differences in statistical outputs  Each time SPSS includes or removes a variable from the analysis.

05 as the level of significance for our problems. If a variable is included in the stepwise analysis. While multicollinearity for all variable can be examined. a different level of significance can be entered in the SPSS Options dialog box. i.    . Validation analysis is absolutely necessary.SW388R7 Data Analysis & Computers II Slide 5 Differences in solving stepwise regression problems  The level of significance for the analysis is included in the specifications for the statistical analysis. it is really only a problem for the variables not included in the analysis. 50 x the number of independent variables. While we will use 0. it is permissible to interpret the variables included in the 75% training analysis (though we will not do this in our problems).e. it will not have a collinear relationship. The preferred sample size requirement is larger for stepwise regression. If generalizability is compromised. Stepwise procedures are notorious for over-fitting the sample to the detriment of generalizability.

and the available data is sufficient to satisfy the sample size requirements.SW388R7 Data Analysis & Computers II Slide 6 A stepwise regression problem When the problem asks us to identify the best set of predictors. . we will do stepwise multiple regression. Multiple regression is feasible if the dependent variable is metric and the independent variables (both predictors and controls) are metric or dichotomous.

answer Stepwise multiple regression requires that the dependent variable be metric and the independent variables be metric or dichotomous. .SW388R7 Data Analysis & Computers II Slide 7 Level of measurement . True with caution is the correct answer.

SW388R7 Data Analysis & Computers II Slide 8 Sample size .question The second question asks about the sample size requirements for multiple regression. we will run the initial or baseline multiple regression to obtain some basic data about the problem and solution. . To answer this question.

we will make a decision whether we should interpret the model that includes the transformed variables and omits outliers (the revised model).1 After we check for violations of assumptions and outliers. To run the baseline model. we run the baseline regression before we examine assumptions and outliers. we interpret the revised model. In order to make this decision. we interpret the baseline model. . select Regression | Linear« from the Analyze model. If using transformations and outliers substantially improves the analysis (a 2% increase in R²).SW388R7 Data Analysis & Computers II Slide 9 The baseline regression . or whether we will interpret the model that uses the untransformed variables and includes all cases including the outliers (the baseline model). If the increase is smaller. and record the R² for the baseline model.

move the dependent variable rincom98 to the Dependent text box. wkrslf.2 First. move the independent variables hrs1. we select Stepwise to request the best subset of variables. . In this example. and prestg80 to the Independent(s) list box. select the method for entering the variables into the analysis from the drop down Method menu. Second. Third.SW388R7 Data Analysis & Computers II Slide 10 The baseline regression .

SW388R7 Data Analysis & Computers II Slide 11

The baseline regression - 3

Click on the Statistics« button to specify the statistics options that we want.

SW388R7 Data Analysis & Computers II Slide 12

The baseline regression - 4

First, mark the checkboxes for Estimates on the Regression Coefficients panel.

Second, mark the checkboxes for Model Fit, Descriptives, and R squared change. The R squared change statistic will tell us the contribution of each additional variable that the stepwise procedure adds to the analysis.

Fifth, click on the Continue button to close the dialog box.

Third, mark the Durbin-Watson statistic on the Residuals panel.

Fourth, mark the the Collinearity diagnostics to get tolerance values for testing multicollinearity.

SW388R7 Data Analysis & Computers II Slide 13

The baseline regression - 5

Next, we need to specify the statistical criteria to use for including variables in the analysis. Click on the Options button.

SW388R7 Data Analysis & Computers II Slide 14 The baseline regression .05. click on the Continue button to close the dialog box. the default level of significance for entering variables to the regression equation is . . The criteria for removing a variable from the analysis is usually set at twice the level for including variables. Since that is the alpha level for our problem we do not need to make any change. Second.6 First.

SW388R7 Data Analysis & Computers II Slide 15 The baseline regression . .7 Click on the OK button to request the regression output.

the relationship will always be significant if any variables are included because the variables can only be included if they contributed to a statistically significant relationship.7%. the proportion of variance in the dependent variable explained by the independent variables (R²) was 25.257 is the benchmark that we will use to evaluate the utility of transformations and the elimination of outliers.SW388R7 Data Analysis & Computers II Slide 16 R² for the baseline model The R² of 0. . Prior to any transformations of variables to satisfy the assumptions of multiple regression or the removal of outliers. In stepwise regression. In stepwise regression. Two variables are included in this problem. the model number corresponds to the number of variables included in the stepwise analysis.

3 to 1. 677.1 22.3 to 1 did not satisfy the preferred ratio of 50 cases per independent variable. The ratio of valid cases (145) to number of independent variables (3) was 48. A caution should be added to the interpretation of the analysis and validation analysis should be conducted. which was equal to or greater than the minimum ratio.dtS scitsitatS evitpircseD 69.41 133.54 88.21 782. The requirement for a minimum ratio of cases to independent variables was satisfied. 541 541 541 541 N 471. However.14 49.31 naeM 08GTSERP FLSKRW 1SRH 89MOCNIR . the ratio of 48. True with caution is the correct answer.5 noitaiveD .SW388R7 Data Analysis & Computers II Slide 17 Sample size evidence and answer Stepwise multiple regression requires that the minimum ratio of valid cases to independent variables be at least 5 to 1.

.question Having satisfied the level of measurement and sample size requirements. First.SW388R7 Data Analysis & Computers II Slide 18 Assumption of normality for the dependent variable . we will evaluate the assumption of normality for the dependent variable. we turn our attention to conformity with three of the assumptions of multiple regression: normality. linearity. and homoscedasticity.

. Second. Fourth. mark the checkboxes for the transformations that we want to test in evaluating the assumption. click on the OK button to produce the output.SW388R7 Data Analysis & Computers II Slide 19 Run the script to test normality First. Third. move the variables to the list boxes based on the role that the variable plays in the analysis and its level of measurement. click on the Assumption of Normality option button to request that SPSS produce the output needed to evaluate the assumption of normality.

The skewness of the distribution (-0.31 25.686. 373.51 45.21 53.dtS True is the correct answer. 352. rorrE .5 535.31 citsitatS 81.0 and +1.41 dnuoB reppU dnuoB rewoL sisotruK ssenwekS egnaR elitrauqretnI egnaR mumixaM muminiM noitaiveD .0. 781. 914.253) was between -1.SW388R7 Data Analysis & Computers II Slide 20 Normality of the dependent variable: respondent·s income sevitpircseD The dependent variable "income" [rincom98] satisfied the criteria for a normal distribution.8 22 32 1 534.686) was between -1.0 and +1.92 00.0 and the kurtosis of the distribution (-0.00.dtS ecnairaV naideM naeM demmirT %5 naeM rof lavretnI ecnedifnoC %59 naeM EMOCNI STNEDNOPSER .

number of hours worked in the past week. .SW388R7 Data Analysis & Computers II Slide 21 Normality of the independent variable: hrs1 Next. we will evaluate the assumption of normality for the independent variable.

The skewness of the distribution (-0. 381.SW388R7 Data Analysis & Computers II Slide 22 Normality of the independent variable: number of hours worked in the past week sevitpircseD The independent variable "number of hours worked in the past week" [hrs1] satisfied the criteria for a normal distribution.0 and +1.93 99.21 194.dtS ecnairaV naideM naeM demmirT %5 naeM rof lavretnI ecnedifnoC %59 naeM KEEW TSAL DEKROW SRUOH FO REBMUN True is the correct answer.00.04 citsitatS 88.24 dnuoB reppU dnuoB rewoL sisotruK ssenwekS egnaR elitrauqretnI egnaR mumixaM muminiM noitaiveD .935) was between -1.0 and +1.14 01. rorrE . 423.04 12.01 67 08 4 807.0.324) was between -1. 859.dtS 463.161 00.0 and the kurtosis of the distribution (0. . 539.

we will evaluate the assumption of normality for the independent variable. The independent variable "occupational prestige score" [prestg80] .SW388R7 Data Analysis & Computers II Slide 23 Normality of the independent variable: prestg80 Finally.

491 00. 378.630) was between -1. 036.SW388R7 Data Analysis & Computers II Slide 24 Normality of the second independent variable: occupational prestige score sevitpircseD The independent variable "occupational prestige score" [prestg80] satisfied the criteria for a normal distribution. 351.0 and +1. rorrE . The skewness of the distribution (0. 00.34 54.81 96 68 71 539.0 and the kurtosis of the distribution (-0.24 71.401) was between -1.44 citsitatS 98.dtS 403.54 dnuoB reppU dnuoB rewoL sisotruK ssenwekS egnaR elitrauqretnI egnaR mumixaM muminiM noitaiveD .34 28.31 691. .0.104.dtS ecnairaV naideM naeM demmirT %5 naeM rof lavretnI )0891( ecnedifnoC %59 EROCS EGITSERP naeM LANOITAPUCCO SR True is the correct answer.0 and +1.

question All of the metric variables included in the analysis satisfied the assumption of normality. Next we will test the relationships for linearity. .SW388R7 Data Analysis & Computers II Slide 25 Assumption of linearity for respondent·s income and number of hours worked last week .

SW388R7 Data Analysis & Computers II Slide 26 Run the script to test linearity First. Second. . a default set of transformations to test is marked. click on the Assumption of Linearity option button to request that SPSS produce the output needed to evaluate the assumption of linearity. When the linearity option is selected. click on the OK button to produce the output.

1 671 000.0 eht ta tnacifingis si noitalerroC . 000. 671 671 000. 1 KEEW TSAL EMOCNI ST ])1SRH-18 ])1SRH-18 ])1SRH-18 DEKROW NEDNOPSER (/1-[ 1SRH (TRQS[ 1SRH (01GL[ 1SRH SRUOH fo seulaV fo seulaV fo seulaV FO REBMUN detcelfeR detcelfeR fo detcelfeR fo esrevnI tooR erauqS fo mhtiragoL SW388R7 Data Analysis & Computers II Slide 27 Linearity test: respondent·s income and number of hours worked last week The correlation between "number of hours worked in the past week" and "income" was statistically significant (r=. **163.giS KEEW TSAL DEKROW noitalerroC nosraeP SRUOH FO REBMUN N )deliat-2( . 500. 000. 671 671 000.giS noitalerroC nosraeP ])1SRH-18 (/1-[ 1SRH fo seulaV detcelfeR fo esrevnI ])1SRH-18 (TRQS[ 1SRH fo seulaV detcelfeR fo tooR erauqS 000.671 000. True is the correct answer.**733.**189. 000. **347. 000. 000.671 941 574. . **189. 1 **178. **205. 1 671 671 000.671 671 671 671 941 000.337. 671 . 500.)deliat-2( level 10. 000..001). p<0. N ])1SRH-18 )deliat-2( .941 000. 671 000. **649. . 950. **163. **649.giS (01GL[ 1SRH fo seulaV noitalerroC nosraeP detcelfeR fo mhtiragoL N )deliat-2( . **303.** 671 .1 **733.941 N )deliat-2( . . 950.**132.giS noitalerroC nosraeP EMOCNI STNEDNOPSER snoitalerroC . **347. 000.giS noitalerroC nosraeP N )deliat-2( . A linear relationship exists between these variables.**178. **205. 941 941 941 941 861 574.**132.**303.

Next we will test the relationships for linearity.question All of the metric variables included in the analysis satisfied the assumption of normality.SW388R7 Data Analysis & Computers II Slide 28 Assumption of linearity for respondent·s income and occupational prestige score . .

giS noitalerroC nosraeP N )deliat-2( . 861 000. 552 552 000. 552 000. **699.0 eht ta tnacifingis si noitalerroC . 1 ])08G ])08GT ])08GT )0891( EMOCNI ST TSERP(/1-[ SERP(TRQS[ SERP(01GL[ EROCS E NEDNOPSER 08GTSERP 08GTSERP fo 08GTSERP GITSERP fo esrevnI tooR erauqS fo mhtiragoL LANOIT APUCCO SR True is the correct answer. **044. 000.giS noitalerroC nosraeP N )deliat-2( . 552 000. 552 . 861 861 861 861 861 000. **634. **634.440. 552 552 000. **699. 552 000. **414. 1 552 000. 861 000. . **269. 861 000. **699.)deliat-2( level 10. p<0. 000. 000. **639. 552 000. 552 . **414. **269. **639. **289. A linear relationship exists between these variables. 552 . 861 000. **589. 000. N )0891( )deliat-2( .001).. **289. **589. 1 552 000. N )deliat-2( . 000. **044. 000.giS noitalerroC nosraeP EMOCNI STNEDNOPSER snoitalerroC .giS noitalerroC nosraeP ])08GTSERP(/1-[ 08GTSERP fo esrevnI ])08GTSERP(TRQS[ 08GTSERP fo tooR erauqS ])08GTSERP(01GL[ 08GTSERP fo mhtiragoL 552 SW388R7 Data Analysis & Computers II Slide 29 Linearity test: respondent·s income and occupational prestige score The correlation between "occupational prestige score" and "income" was statistically significant (r=. 1 **044. **044. **699.** 552 .giS EROCS EGITSERP noitalerroC nosraeP LANOITAPUCCO SR N )deliat-2( . 1 552 000.

SW388R7 Data Analysis & Computers II Slide 30 Assumption of homogeneity of variance .question Self-employment is the only dichotomous independent variable in the analysis. We will test if for homogeneity of variance using income as the dependent variable. .

click on the OK button to produce the output.SW388R7 Data Analysis & Computers II Slide 31 Run the script to test homogeneity of variance First. a default set of transformations to test is marked. click on the Assumption of Homogeneity option button to request that SPSS produce the output needed to evaluate the assumption of linearity. Second. . When the homogeneity of variance option is selected.

and conclude that the homoscedasticity assumption is satisfied. The homogeneity of variance assumption was satisfied. the variance in "income" [rincom98] is homogeneous for the categories of "self-employment" [wrkslf].01).SW388R7 Data Analysis & Computers II Slide 32 Assumption of homogeneity of variance Based on the Levene Test.076) is greater than the level of significance (0. The probability associated with the Levene Statistic (p=0. . True is the correct answer. so we fail to reject the null hypothesis that the variance is equal across groups.

We will run the baseline regression again and have SPSS compute the standardized residual for each case. an outlier in the solution can be defined as a case that has a large residual because the equation did a poor job of predicting its value.SW388R7 Data Analysis & Computers II Slide 33 Detection of outliers .3.question In multiple regression. Cases with a standardized residual larger than +/. .0 will be treated as outliers.

. select Regression | Linear« from the Analyze model. To run the baseline model. the SPSS regression output was re-created.SW388R7 Data Analysis & Computers II Slide 34 Re-running the baseline regression .1 Having decided to use the baseline model for the interpretation of this analysis.

move the independent variables hrs1. and prestg80 to the Independent(s) list box. move the dependent variable rincom98 to the Dependent text box.SW388R7 Data Analysis & Computers II Slide 35 Re-running the baseline regression . wkrslf. Third. Second. we select Stepwise to request the best subset of variables. select the method for entering the variables into the analysis from the drop down Method menu. In this example. .2 First.

SW388R7 Data Analysis & Computers II Slide 36 Re-running the baseline regression .3 Click on the Statistics« button to specify the statistics options that we want. .

Sixth. mark the Collinearity diagnostics to get tolerance values for testing multicollinearity. mark the checkboxes for Estimates on the Regression Coefficients panel. The R squared change statistic will tell us whether or not the variables added after the controls have a relationship to the dependent variable. Descriptives. Third.SW388R7 Data Analysis & Computers II Slide 37 Re-running the baseline regression . Second. . and R squared change. mark the checkboxes for Model Fit. mark the checkbox for the Casewise diagnostics. mark the Durbin-Watson statistic on the Residuals panel. which will be used to identify outliers.4 First. Fourth. Fifth. click on the Continue button to close the dialog box.

.5 Click on the Save button to save the standardized residuals to the data editor.SW388R7 Data Analysis & Computers II Slide 38 Re-running the baseline regression .

We will use this variable to omit outliers in the revised regression model. Click on the Continue button to close the dialog box.SW388R7 Data Analysis & Computers II Slide 39 Re-running the baseline regression .6 Mark the checkbox for Standardized Residuals so that SPSS saves a new variable in the data editor. .

SW388R7 Data Analysis & Computers II Slide 40 Re-running the baseline regression .7 Click on the OK button to request the regression output. .

0. The answer to the question is true. in which it lists the cases and values that results in their being an outlier. If there are no outliers. SPSS does not print the Casewise Diagnostics table. Since there were no outliers. . SPSS creates a table titled Casewise Diagnostics. Both the minimum and maximum fell in the acceptable range.3. the correct answer is true.3.0 by looking the minimum and maximum standardized residuals in the table of Residual Statistics. There was no table for this problem.SW388R7 Data Analysis & Computers II Slide 41 Outliers in the analysis If cases have a standardized residual larger than +/. We can verify that all standardized residuals were less than +/.

we can use the baseline regression for our interpretation.SW388R7 Data Analysis & Computers II Slide 42 Selecting the model to interpret .question Since there were no transformations used and there were no outliers. . The correct answer is false.

.SW388R7 Data Analysis & Computers II Slide 43 Assumption of independence of errors .question We can now check the assumption of independence of errors for the analysis we will interpret.

50. egnahC F . 471.b 08GTSERP . egnahC erauqS R Multiple regression assumes that the errors are independent and there is no serial correlation.SW388R7 Data Analysis & Computers II Slide 44 Assumption of independence of errors: evidence and answer c The Durbin-Watson statistic is used to test for the presence of serial correlation among the residuals. 081.dtS 742. we would add a caution to the findings for a violation of regression assumptions.4 etamitsE eht fo rorrE .)tnatsnoC( :srotciderP . b scitsitatS egnahC 1SRH .50 2. The value of the Durbin-Watson statistic ranges from 0 to 4. True is the correct answer. 081.4 408. R 2 1 ledoM yrammuS ledoM . As a general rule of thumb. erauqS R a424.)tnatsnoC( :srotciderP .13 egnahC F 770. The analysis satisfies the assumption of independence of errors. The Durbin-Watson statistic for this problem is 1.50 to 2.50.41 053. the residuals are not correlated if the Durbin-Watson statistic is approximately 2. No serial correlation implies that the size of the residual for one case has no impact on the size of the residual for the next case.1 nosta W-nibruD 000. 885.866 which falls within the acceptable range from 1.a 89MOCNIR :elbairaV tnednepeD .c 705.giS 241 341 2fd 1 1 1fd 887. and an acceptable range is 1. 668. Errors are the residuals or differences between the actual score for a case and the score estimated by the regression equation. 000. erauqS R detsujdA 752.08GTSERP . If the Durbin-Watson statistic was not in the acceptable range.

.question The final condition that can have an impact on our interpretation is multicollinearity.SW388R7 Data Analysis & Computers II Slide 45 Multicollinearity .

True is the correct answer. The tolerance values for all of the independent variables are larger than 0.979) and "occupational prestige score" [prestg80] (. out examination of tolerances focuses on the table of excluded variables.954). "self-employment" [wrkslf] (.954). . Multicollinearity is not a problem in this regression analysis.SW388R7 Data Analysis & Computers II Slide 46 Multicollinearity evidence and answer Multicollinearity occurs when one independent variable is so strongly Since multicollinearity will result in a variable not being included in the analysis.10: "number of hours worked in the past week" [hrs1] (.

.question The first finding we want to confirm concerns the overall relationship between the dependent variable and one or more of the independent variables.SW388R7 Data Analysis & Computers II Slide 47 Overall relationship between dependent variable and independent variables .

b 441 241 2 441 341 1 fd 08GTSERP .a 144.)tnatsnoC( :srotciderP .327 serauqS fo muS latoT laudiseR noissergeR latoT laudiseR noissergeR 1 ledoM 2 .4204 064.715 1SRH .08GTSERP . the null hypothesis that the Multiple R for all independent variables was equal to 0 was rejected. A c VONA Based on the results in the ANOVA table (F(2.42 280.4301 144.4204 597. to identify a relationship between some of independent variables and the dependent variable.581. and "occupational prestige score" [prestg80].)tnatsnoC( :srotciderP .0033 746.12 194.9892 289. p<0. there was an overall relationship between the dependent variable "income" [rincom98] and one or more of independent variables.05).giS b 000. The purpose of the analysis. "self-employment" [wrkslf].13 F 185.001). was supported.327 erauqS naeM 350. 053.SW388R7 Data Analysis & Computers II Slide 48 Overall relationship between dependent variable and independent variables evidence and answer 1 Stepwise multiple regression was performed to identify the best predictors of the dependent variable "income" [rincom98] among the independent variables "number of hours worked in the past week" [hrs1].32 746.001) was less than or equal to the level of significance (0.c a000. Since the probability of the F statistic (p<0. 89MOCNIR :elbairaV tnednepeD . 142) = 24. .

Caution in interpreting the relationship should be exercised because of inclusion of ordinal variables.40 and less than or equal to 0. greater than 0. .507. True with caution is the correct answer. greater than 0.40 is weak. and cases to variables ratio less than 50:1. which would be characterized as moderate using the rule of thumb that a correlation less than or equal to 0.SW388R7 Data Analysis & Computers II Slide 49 Overall relationship between dependent variable and independent variables evidence and answer 2 The Multiple R for the relationship between the independent variables included in the analysis and the dependent variable was 0.80 is strong.60 is moderate.80 is very strong.20 is characterized as very weak.20 and less than or equal to 0. The relationship between the independent variables and the dependent variable was correctly characterized as moderate.60 and less than or equal to 0. greater than 0. and greater than 0.

SW388R7 Data Analysis & Computers II Slide 50 Best subset of predictors .question The next finding concerns the list of independent variables that are statistically significant. .

820. 000. 851.5 119. 363.4 457. 000.1 FIV ecnareloT scitsitatS ytiraenilloC 000. 000.1 966. 000.1 940. and "number of hours worked in the past week" [hrs1].1 995.giS 648. so false is the correct answer. 811. .1 268. 236.1 940.1 459.2 820. The variable "number of hours worked in the past week" [hrs1] was not included in the list of predictors in the question. 280.dtS B stneiciffeoC dezidradnatsnU 89MOCNIR :elbairaV tnednepeD .3 898.a 1SRH 08GTSERP )tnatsnoC( 08GTSERP )tnatsnoC( 1 ledoM 2 . ateB stneiciffeoC dezidradnatS 130. 000. 853.6 rorrE .4 t 424. 459.SW388R7 Data Analysis & Computers II Slide 51 Best subset of predictors evidence and answer s a tneiciffeoC The best predictors of scores for the dependent variable "income" [rincom98] were "occupational prestige score" [prestg80]. 531. 582.

.SW388R7 Data Analysis & Computers II Slide 52 Relationship of the first independent variable and the dependent variable .question In the stepwise regression problems. we will focus on the entry order of the independent variables and the interpretation of individual relationships of independent variables on the dependent variable.

"number of hours worked in the past week" [hrs1] was added to the regression equation in model 2.SW388R7 Data Analysis & Computers II Slide 53 Relationship of the first independent variable and the dependent variable evidence and answer 1 In the table of variables entered and removed. F(1. In the table of variables entered and removed. The increase in R Square as a result of including this variable was . "number of hours worked in the past week" [hrs1] was added to the regression equation in model 2. p<0. 142) = 14.001.077 which was statistically significant.788. .

000.a 1SRH 08GTSERP )tnatsnoC( 08GTSERP )tnatsnoC( 1 ledoM 2 . True with caution is the correct answer.1 459.SW388R7 Data Analysis & Computers II Slide 54 Relationship of the first independent variable and the dependent variable evidence and answer 2 s a tneiciffeoC The b coefficient for the relationship between the dependent variable "income" [rincom98] and the independent variable "number of hours worked in the past week" [hrs1]. 459. 851.4 t 424. 000. 363.6 rorrE . which implies a direct relationship because the sign of the coefficient is positive.118. ateB stneiciffeoC dezidradnatS 130. 000. and an ordinal variable treated as metric.1 966. Caution in interpreting the relationship should be exercised because of cases to variables ratio less than 50:1.5 119.1 FIV ecnareloT scitsitatS ytiraenilloC 000.1 268. 236.4 457.1 940. 000.giS 648. was . 820.3 898. 280.1 995. . 853. 000.dtS B stneiciffeoC dezidradnatsnU 89MOCNIR :elbairaV tnednepeD . 582. The statement in the problem that "survey respondents who worked longer hours in the past week had higher incomes" is correct.2 820.1 940. Higher numeric values for the independent variable "number of hours worked in the past week" [hrs1] are associated with higher numeric values for the dependent variable "income" [rincom98]. 811. 531.

SW388R7 Data Analysis & Computers II Slide 55 Relationship of the second independent variable and the dependent variable .question .

SW388R7 Data Analysis & Computers II Slide 56 Relationship of the second independent variable and the dependent variable evidence and answer The independent variable "self-employment" [wrkslf] was not included in the regression equation. It did not increase the percentage of variance explained in the dependent variable by an amount large enough to be statistically significant. . False is the correct answer.

question .SW388R7 Data Analysis & Computers II Slide 57 Relationship of the third independent variable and the dependent variable .

"occupational prestige score" [prestg80] was added to the regression equation in model 1.180 which was statistically significant. F(1.001. The increase in R Square as a result of including this variable was .SW388R7 Data Analysis & Computers II Slide 58 Relationship of the third independent variable and the dependent variable evidence and answer 1 In the table of variables entered and removed. p<0. .350. 143) = 31.

4 457.1 966. 363. ateB stneiciffeoC dezidradnatS 130. 459. was . Higher numeric values for the independent variable "occupational prestige score" [prestg80] are associated with higher numeric values for the dependent variable "income" [rincom98]. 811. False is the correct answer. 000.1 268. 280.1 FIV ecnareloT scitsitatS ytiraenilloC 000.2 820.135.giS 648.1 995. 000.6 rorrE .5 119.dtS B stneiciffeoC dezidradnatsnU 89MOCNIR :elbairaV tnednepeD .1 940. The direction of the relationship is stated incorrectly. 000.4 t 424.1 940.3 898.a 1SRH 08GTSERP )tnatsnoC( 08GTSERP )tnatsnoC( 1 ledoM 2 . 820. The statement in the problem that "survey respondents who had more prestigious occupations had lower incomes" is incorrect. 531.SW388R7 Data Analysis & Computers II Slide 59 Relationship of the third independent variable and the dependent variable evidence and answer 2 s a tneiciffeoC The b coefficient for the relationship between the dependent variable "income" [rincom98] and the independent variable "occupational prestige score" [prestg80]. 851. . 236. 000. 000. which implies a direct relationship because the sign of the coefficient is positive. 853. 582.1 459.

.question The problem states the random number seed to use in the validation analysis.SW388R7 Data Analysis & Computers II Slide 60 Validation analysis .

. using 200070 as the random number seed.SW388R7 Data Analysis & Computers II Slide 61 Validation analysis: set the random number seed Validate the results of your regression analysis by conducting a 75/25% cross-validation. To set the random number seed. select the Random Number Seed« command from the Transform menu.

click on the Set seed to option button to activate the text box. Note that SPSS does not provide you with any feedback about the change. Second. type in the random seed stated in the problem. . Third. click on the OK button to complete the dialog box.SW388R7 Data Analysis & Computers II Slide 62 Set the random number seed First.

click on the Compute« command.SW388R7 Data Analysis & Computers II Slide 63 Validation analysis: compute the split variable To enter the formula for the variable that will split the sample in two parts. .

If the random number is less than or equal to 0. The uniform(1) function generates a random decimal number between 0 and 1. If the random number is larger than 0. Second. click on the OK button to complete the dialog box.75. the value of the formula will be 1. Third.SW388R7 Data Analysis & Computers II Slide 64 The formula for the split variable First. into the Target Variable text box. type the name for the new variable.75. the formula will return a 0. the SPSS numeric equivalent to false. split. . The random number is compared to the value 0.75. the formula for the value of split is shown in the text box. the SPSS numeric equivalent to true.

we select the cases where split = 1. the split variable shows a random pattern of zero¶s and one¶s. . To select the cases for the training sample.SW388R7 Data Analysis & Computers II Slide 65 The split variable in the data editor In the data editor.

.SW388R7 Data Analysis & Computers II Slide 66 Repeat the regression for the validation To repeat the multiple regression analysis for the validation sample. select Regression | Linear from the Analyze tool button.

click on the right arrow button to move the split variable to the Selection Variable text box. scroll down the list of variables and highlight the variable split. Second.SW388R7 Data Analysis & Computers II Slide 67 Using "split" as the selection variable First. .

.SW388R7 Data Analysis & Computers II Slide 68 Setting the value of split to select cases When the variable named split is moved to the Selection Variable text box. SPSS adds "=?" after the name to prompt up to enter a specific value for split. Click on the Rule« button to enter a value for split.

Second. click on the Continue button to complete the value entry.SW388R7 Data Analysis & Computers II Slide 69 Completing the value selection First. 1. . type the value for the training sample. into the Value text box.

SPSS adds the value we entered after the equal sign. This specification now tells SPSS to include in the analysis only those cases that have a value of 1 for the split variable. When the value entry dialog box is closed. .SW388R7 Data Analysis & Computers II Slide 70 Requesting output for the validation analysis Click on the OK button to request the output.

p<0. 142) = 24.001.SW388R7 Data Analysis & Computers II Slide 71 Validation Overall Relationship The validation analysis requires that the regression model for the 75% training sample replicate the pattern of statistical significance found for the full data set.581. the relationship between the set of independent variables and the dependent variable was statistically significant. p<0.195. as was the overall relationship in the analysis of the full data set. In the analysis of the 75% training sample. F(2. .001. 105) = 20. F(2.

the same two variables entered into the regression model: "occupational prestige score" [prestg80]. . In this analysis.Relationship of Individual Independent Variables to Dependent Variable In stepwise multiple regression. and "number of hours worked in the past week" [hrs1].SW388R7 Data Analysis & Computers II Slide 72 Validation . the pattern of individual relationships between the dependent variable and the independent variables will be the same if the same variables are selected as predictors for the analysis using the full data set and the analysis using the 75% training sample.

7% (.841²) for the validation sample. implying a better fit than obtained for the training sample. The value of R² for the validation sample was actually larger than the value of R² for the training sample. The answer to the question is true.527²). . This supports a conclusion that the regression model would be effective in predicting scores for cases other than those included in the sample. compared to 70. The validation analysis supported the generalizability of the findings of the analysis to the population represented by the sample in the data set.8% (.SW388R7 Data Analysis & Computers II Slide 73 Validation .Comparison of Training Sample and Validation Sample The total proportion of variance explained in the model using the training sample was 27.

True with caution. Text in italics (e. Many of the steps in stepwise regression analysis are identical to the steps in standard regression analysis. False. True.SW388R7 Data Analysis & Computers II Slide 74 Steps in complete stepwise regression analysis The following flow charts depict the process for solving the complete regression problem and determining the answer to each of the questions encountered in the complete analysis. Incorrect application of a statistic) represent the answers to each specific question. with the specifics of the difference underlined.g. Steps that are different are identified with a magenta background. .

SW388R7 Data Analysis & Computers II Slide 75 Complete stepwise multiple regression analysis: level of measurement Question: do variables included in the analysis satisfy the level of measurement requirements? Is the dependent variable metric and the independent variables metric or dichotomous? Examine all independent variables ± controls as well as predictors No Incorrect application of a statistic Yes Ordinal variables included in the relationship? Yes True with caution No True .

in the count of independent variables Ratio of cases to independent variables at least 5 to 1? No Inappropriate application of a statistic Yes Ratio of cases to independent variables at preferred sample size of at least 50 to 1? No True with caution Yes True .SW388R7 Data Analysis & Computers II Slide 76 Complete stepwise multiple regression analysis: sample size Question: Number of variables and cases satisfy sample size requirements? Compute the baseline regression in SPSS Include both controls and predictors.

SW388R7 Data Analysis & Computers II Slide 77

Complete stepwise multiple regression analysis: assumption of normality
Question: each metric variable satisfies the assumption of normality?
Test the dependent variable and independent variables The variable satisfies criteria for a normal distribution?

No

False

Yes True
If more than one transformation satisfies normality, use one with smallest skew

Log, square root, or inverse transformation satisfies normality?

No

Use untransformed variable in analysis, add caution to interpretation for violation of normality

Yes
Use transformation in revised model, no caution needed

SW388R7 Data Analysis & Computers II Slide 78

Complete stepwise multiple regression analysis: assumption of linearity
Question: relationship between dependent variable and metric independent variable satisfies assumption of linearity?
If dependent variable was transformed for normality, use transformed dependent variable in the test for linearity. If independent variable was transformed to satisfy normality, skip check for linearity.

If more than one transformation satisfies linearity, use one with largest r

Probability of Pearson correlation (r) <= level of significance?

No

Probability of correlation (r) for relationship with any transformation of IV <= level of significance?

No Yes Yes
Weak relationship. No caution needed

Use transformation in revised model

True

SW388R7 Data Analysis & Computers II Slide 79

Complete stepwise multiple regression analysis: assumption of homogeneity of variance
Question: variance in dependent variable is uniform across the categories of a dichotomous independent variable?
If dependent variable was transformed for normality, substitute transformed dependent variable in the test for the assumption of homogeneity of variance

Probability of Levene statistic <= level of significance?

Yes

False

No
Do not test transformations of dependent variable, add caution to interpretation for violation of homoscedasticity

True

substitute transformed variables in the regression for the detection of outliers. If any variables were transformed for normality or linearity. Is the standardized residual for any case greater than +/-3.00? Yes False No True Remove outliers and run revised regression again.SW388R7 Data Analysis & Computers II Slide 80 Complete stepwise multiple regression analysis: detecting outliers Question: After incorporating any transformations. no outliers were detected in the regression analysis. .

SW388R7 Data Analysis & Computers II Slide 81 Complete stepwise multiple regression analysis: picking regression model for interpretation Question: interpretation based on model that includes transformation of variables and removes outliers? Yes Pick revised regression with transformations and omitting outliers for interpretation R² for revised regression greater than R² for baseline regression by 2% or more? No Pick baseline regression with untransformed variables and all cases for interpretation True False .

5? No False Yes NOTE: caution for violation of assumption of independence of errors True .5 and 2. Durbin-Watson between 1.SW388R7 Data Analysis & Computers II Slide 82 Complete stepwise multiple regression analysis: assumption of independence of errors Question: serial correlation of errors is not a problem in this regression analysis? Residuals are independent.

indicating no multicollinearity? No False NOTE: halt the analysis if it is not okay to simply exclude the variable from the analysis.10.SW388R7 Data Analysis & Computers II Slide 83 Complete stepwise multiple regression analysis: multicollinearity Question: Multicollinearity is not a problem in this regression analysis? Tolerance for all IV¶s greater than 0. Yes True .

ordinal variables. No False Probability of F test of regression for last model <= level of significance? Yes Strength of relationship for included variables interpreted correctly? No False Yes Yes True with caution Small sample.SW388R7 Data Analysis & Computers II Slide 84 Complete stepwise multiple regression analysis: overall relationship Question: Finding about overall relationship between dependent variable and independent variables. or violation of assumption in the relationship? No True .

ordinal variables. or violation of assumption in the relationship? No True .SW388R7 Data Analysis & Computers II Slide 85 Complete stepwise multiple regression analysis: subset of best predictors Question: Finding about list of best subset of predictors? Listed variables match variables in table of variables entered/removed. No False Yes Yes True with caution Small sample.

Order of entry into regression equation stated correctly? No False Yes Significance of R2 change for variable <= level of significance? No False Yes .SW388R7 Data Analysis & Computers II Slide 86 Complete stepwise multiple regression analysis: individual relationships .1 Question: Finding about individual relationship between independent variable and dependent variable.

2 Direction of relationship between included variables and DV interpreted oorrectly? No False Yes Small sample. or violation of assumption in the relationship? Yes True with caution No True . ordinal variables.SW388R7 Data Analysis & Computers II Slide 87 Complete stepwise multiple regression analysis: individual relationships .

1 Question: The validation analysis supports the generalizability of the findings? Set the random seed and randomly split the sample into 75% training sample and 25% validation sample. Probability of ANOVA test for training sample <= level of significance? No False Yes .SW388R7 Data Analysis & Computers II Slide 88 Complete stepwise multiple regression analysis: validation analysis .

SW388R7 Data Analysis & Computers II Slide 89 Complete stepwise multiple regression analysis: validation analysis .R² for validation sample) < 2%? No False Yes True .2 Same variables entered into regression equation in training sample? No False Yes Shrinkage in R² (R² for training sample .

The complete stepwise multiple regression will include: Testing assumptions of normality and linearity Testing for outliers Determining whether to use transformations or exclude outliers.SW388R7 Data Analysis & Computers II Slide 90 Homework Problems Multiple Regression Stepwise Problems . Testing for independence of errors. and Validating the generalizability of the analysis. The only assumption made is the problems is that there is no problem with missing data. Checking for multicollinearty.1 The stepwise regression homework problems parallel the complete standard regression problems and the complete hierarchical problems. .

SW388R7 Data Analysis & Computers II Slide 91 Homework Problems Multiple Regression Stepwise Problems .2 The statement of the stepwise regression problem identifies the dependent variable and the independent variables from which we will extract a parsimonious subset. .

SW388R7 Data Analysis & Computers II Slide 92 Homework Problems Multiple Regression Stepwise Problems . include:  an ordered listing of the included independent variables  an interpretive statement about each of the independent variables.3 The findings. which must all be correct for a problem to be true. .  a statement about the strength of the overall relationship.

Failing to satisfy either of these requirement results in an inappropriate application of a statistic.SW388R7 Data Analysis & Computers II Slide 93 Homework Problems Multiple Regression Stepwise Problems .4 The first prerequisite for a problem is the satisfaction of the level of measurement and minimum sample size requirements. .

If transformations are unsuccessful. a caution is added to any true findings. If the variable is not normal.5 The assumption of normality requires that each metric variable be tested. .SW388R7 Data Analysis & Computers II Slide 94 Homework Problems Multiple Regression Stepwise Problems . transformations should be examined to see if we can improve the distribution of the variable.

.6 The assumption of linearity is examined for any metric independent variables that were not transformed for the assumption of normality.SW388R7 Data Analysis & Computers II Slide 95 Homework Problems Multiple Regression Stepwise Problems .

we look for outliers using standard residuals as the criterion. .7 After incorporating any transformations.SW388R7 Data Analysis & Computers II Slide 96 Homework Problems Multiple Regression Stepwise Problems .

SW388R7 Data Analysis & Computers II Slide 97 Homework Problems Multiple Regression Stepwise Problems .8 We compare the results of the regression without transformations and exclusion of outliers to the model with transformations and excluding outliers to determine whether we will base our interpretation on the baseline or the revised analysis. .

we attach a caution to our findings. .SW388R7 Data Analysis & Computers II Slide 98 Homework Problems Multiple Regression Stepwise Problems .9 We test for the assumption of independence of errors and the presence of multicollinearity. since we may be reporting erroneous findings. If we violate the assumption of independence. we halt the analysis. If there is a mutlicollinearity problem.

we interpret the R² for the overall relationship at the step or model when the last statistically significant variable was entered.9 In stepwise regression. .SW388R7 Data Analysis & Computers II Slide 99 Homework Problems Multiple Regression Stepwise Problems .

best predictor.10 The primary purpose of stepwise regression is to identify the best subset of predictors and the order in which variables were included in the regression equation. « .SW388R7 Data Analysis & Computers II Slide 100 Homework Problems Multiple Regression Stepwise Problems .e. i. The order tells us the relative importance of the predictors. second best.

and stepwise regression. The interpretation of individual predictors is the same for standard. . and worded correctly for the direction of the relationship.11 The relationships between predictor independent variables and the dependent variable stated in the problem must be statistically significant.SW388R7 Data Analysis & Computers II Slide 101 Homework Problems Multiple Regression Stepwise Problems . hierarchical.

the inclusion of the same variables in the validation model that were included in the full model. though not necessarily in the same order. The validation must support: the significance of the overall relationship. and the shrinkage in R² for the validation sample must not be more than 2% less than the training sample. .SW388R7 Data Analysis & Computers II Slide 102 Homework Problems Multiple Regression Stepwise Problems .12 We use a 75-25% validation strategy to support the generalizability of our findings.

SW388R7 Data Analysis & Computers II Slide 103 Homework Problems Multiple Regression Stepwise Problems . . if needed.13 Cautions are added as limitations to the analysis.