## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

Exercises

Quantitative Research, March 2009

TOPICS

1. 2. 3. 4. 5. 6. 7. Simple and multiple linear regression (SPSS) Preparing the data for analysis Inspecting frequencies and diagnostic plots Outliers and influential cases – diagnostics and solutions Interpreting the regression output Saving predicted and residual regression scores Introducing non-linear effects in the linear regression framework (recoding ordinal and interval level predictors into multiple categories, interaction effects, polynomial terms) 8. Variable transformations (ln, square) 9. Centering predictors 10. R square change 11. Multicolinearity and solutions Data bases for the lab examples and the homework assignment are available here: http://web.me.com/paula.tufis/QR

Inspect and recode the data (missing on NA/DK.s. transforming highly skewed variables. a. Inequality Module. www. • Visual inspection of histograms for each variable (SPSS: Graphs Chart Builder Histogram or use the Charts option in the Analyze Descriptive Statistics Frequencies) 3. direction of scaling. 2. 267) (we’ll do more extensive tests later) 4. operationalization of variables and concepts • Hypothesis: people with higher levels of education tend to have jobs with higher occupational prestige • This is a simple example. so the model fails to include some relevant predictors.org) . 1999. invalid variable values. use a scatterplot matrix. Variables: Outcome variable: SIOPS_R (Respondent’s occupational prestige – coded using the Standard International Prestige Score . Conceptualization and specification of the model (hypotheses.) • Both variables are already recoded for this example.sav (International Social Survey Programme. with a theoretical range from 6 – occupations with low prestige to 78 – occupations with high prestige) Predictor variable: EDUC_YRS (Respondent’s schooling measured in completed years of education) Analysis steps: 1. In real-life situations use your statistical data manipulation software of choice to recode variables (SPSS1: use Transform Compute Variable or Transform Recode into Same/Different Variable). include relevant predictors. • SPSS: Graphs Chart Builder Scatter/Dot Potential outlier (case no.EXAMPLE 1 – SIMPLE LINEAR REGRESSION Model: Effects of education on occupational prestige Dataset: ISSP_1999_Slovakia. In the multiple regression case. it probably isn’t the only one.0 1 . exclude irrelevant predictors). so while education might be a relevant predictor of occupational prestige.issp.o. Inspect the relationship between the two variables using a scatterplot for departures from linearity and for potential outliers. • Education is measured using completed years of schooling and occupational prestige is measured using SIOPS. Run OLS regression (SPSS: Analyze Regression Linear) 1 SPSS version 18.subsample of employed respondents with nonfarm origins.SIOPS. dichotomizing categorical predictors.

Alternative: look at Normal PP plot (expected cumulated probability plotted against observed cumulated probability of standardized residuals – line should be at 45 degrees) • SPSS: In the Plots option of the Linear Regression menu choose “Histogram” and “Normal Probability Plot”) under “Standardized Residual Plots” or save standardized residuals (use the “Save” option of the Linear Regression menu) and look at the distribution of the saved variable • • Durbin Watson test (SPSS:Linear Regression Statistics Durbin Watson) to test the assumption of independence of errors (values close to 2 indicate no error autocorrelation). run a hierarchical linear regression model).5. but error autocorrelations might be present for spatially clustered data (in which case. 2 . In cross-sectional samples autocorrelation is likely to be less of a problem than in time-series data. Plot standardized residuals (ZRESID) against standardized predicted values (ZPRED) to check the assumptions of homoskedasticity and linearity (SPSS Linear Regression Plots).The Durbin Watson test is relevant in the case of timeseries data. Diagnostics for possible violations of OLS regression assumptions • Scatterplot/ scatterplot matrix for checking the assumption of linearity of relationships • Plot the distribution of the regression standardized residual to check the assumption of normally distributed residuals (distribution should be normal). Funnel shapes denote heteroskedasticity and curved shapes indicate the relationships might be nonlinear.

Variable transformations can also help in some cases to pull in the outlier cases. DFBeta (cases with absolute values greater than 2 are cause for concern) 7. SPSS: Linear Regression Save Cook’s Distance (cases with values over 1 are influential cases). Detect outliers and influential cases • To detect outliers. DFBeta. For a group of outliers you might need a separate model. Leverage Values (cases with values more than 3 times the value of the average leverage or values greater than . Leverage Values. Some of the most often used: Cook’s Distance. there are a variety of statistics. If all OK interpret OLS regression results 3 . • To detect influential cases. You might consider excluding it if you think the case is affected by measurement error (examine variable values for this case and influence statistics). SPSS: Linear Regression Statistics Casewise Diagnostics (detect cases with standardized residuals with absolute values greater than 3 standard deviations) Same case identified on the scatterplot.6.5 are cause for concern).

Variables: Outcome variable: INCOMER (Respondent’s personal income) Predictor variables: • GENDERR (Respondent’s gender. To select a subsample in SPSS: Data Select cases If condition is satisfied If Type “empl=1 and nonfarm=1” without the quotes in the upper right box Continue. Dataset: EVS_2008_Romania_r80. 3 = small city. you need to select just the respondents that are employed and with nonfarm origins and run your analyses on this subsample. 2=Female) • REDUC (Respondent’s educational level. 99=DK/NA) • MARSTAT (Marital status. 4 = village – administrative center. 18-91. 2 = cohabiting. or to move selected cases to a new dataset (in this case save the resulting dataset with a new name).INDIVIDUAL ASSIGNMENT 1 (OPTIONAL) Re-estimate and interpret on your own the simple regression model discussed above using data for Romania. 1=Male. 2 = medium sized city. 3 = self employed. 3 = single) • AGER (Age. 2 = part time employment. 2008. 4=inactive/unemployed. 1 = full time employment. The variable EMPL identifies employed and unemployed respondents and the variable NONFARM identifies respondents with farm and nonfarm origins.sav (European Values Survey 2008) – random 80% subsample from the nationally representative subsample. random 80% subsample from the nationally representative sample. with a theoretical range from 6 – occupations with low prestige to 78 – occupations with high prestige) Predictor variable: EDUC_YRS (Respondent’s schooling measured in completed years of education) EXAMPLE 2 – MULTIPLE LINEAR REGRESSION Model: Effects of socio-demographic characteristics on income. 1 = No schooling … 9=university and postgraduate degree) • EMPL (Employment status. You can choose either to filter out cases (default and recommended option . Before interpreting regression coefficients examine diagnostic tests and plots to check for possible violations of regression assumptions. 5 = other villages) 4 . 2008. or to delete cases from the existing dataset (!!!Warning: using this option followed by the “Save” option overwrites the original dataset. 1 = big city. but statistical analyses will not take them into account).SIOPS. Soros Foundation Romania). Dataset: VF2008_Romania_r80 (Family Life. 1 = married.) Variables: Outcome variable: SIOPS_R (Respondent’s occupational prestige – coded using the Standard International Prestige Score . Subsample for analysis: employed respondents with nonfarm origins Since the dataset contains both employed and unemployed respondents and respondents with farm and nonfarm origins.unselected cases remain in the dataset. 99=NA) • LOCSIZE (Locality type and size.

and compare regression results with and without these variables. code 1 for males). under “Statistics”. in the linear regression procedure. explore whether there are statistically significant differences by interpreting the sizes and significances of dummy variable coefficients and by varying the reference category used. we’ll look at differences between employed (EMPL=1 thru 3) and unemployed people (EMPL=4) (so 2 categories and 1 dummy variable introduced as a predictor) • SPSS: Transform Recode into Different… Categorical predictors (such as MARSTAT) should be dummy coded. the locality type dummies) by introducing the associated dummy variables in a separate block in the regression equation. • For variables with several categories. • Examine distribution of the INCOMER variable (SPSS: Analyze Descriptive Statistics Frequencies Charts Histogram with normal curve). Interpret overall significance of the variable by using the F test for R square change. For the purposes of this example. ask for Collinearity diagnostics (Tolerance – less than . condition index – over 30 suggests multicollinearity) Solutions in case of multicollinearity: depending on the context.8 or .e. Use “other villages” as the reference category. As such. Multicollinearity – diagnostics and solutions Start by inspecting a bivariate correlation matrix between predictor variables. If you have a substantive interest in estimating the effects of all of the highly collinear variables but they do not form a scale.2 suggests high collinearity.Some notes on variable recoding Income variables are generally positively skewed and need to be transformed – recommended transformation: ln(incomer). we will treat it as a categorical variable. • You can examine the overall statistical significance of a set of dummy variables (i. VIF – greater than 4 (conservative) or 7 (liberal criterion) suggests high collinearity. For the purposes of this example use “single” as the reference category (omitted dummy variable) Locality size could be considered an ordinal level variable with a clear ordering according to population size in the first 3 categories (big. In SPSS. Correlations over . examine the partial regression plots to assess linearity and the presence of outliers (applicable only for interval level predictors). • To construct the natural logarithm of INCOMER. in SPSS: Transform Compute Variable Fill in a new name under “Target Variable” and fill in “ln(incomer)” without the quotes under “Numeric Expression” Dichotomous predictors such as GENDER should be dummy coded (for example code 0 for females. 5 . Request the R squared change statistic under “Statistics” in the Linear Regression procedure in SPSS. resulting in k dummy variables (where k is the number of categories) and using k-1 dummy variables as predictors in the regression equation. and small cities).9 suggest a moderate to high degree of collinearity of predictors. but the presence of the last two categories (administrative center villages and other villages) makes the variable only partially ordered. you can use a block regression with these predictors in a separate block. Regression assumption diagnostics The same as for the simple regression case In addition. and construct 5 dummy variables. either construct a scale using the highly collinear variables (if the variables measure the same dimension) or delete one of the highly collinear variables from the model. medium.

Examine effects of dummy variables and decide which categories to collapse. You can explore that by using 10-year or 5-year age groups instead of the original age variable. the relationship between age and income might be nonlinear.Example 2a. Interaction effects can also be introduced in the linear regression model if you think the effect of one predictor is moderated by another predictor. Centering predictors Centering is subtracting the mean from variable values It is done for two different purposes: • To make the intercept meaningful (the sizes of slopes are not affected by centering) • To avoid multicollinearity between a predictor and power terms of the predictor (age and age squared) 6 . It might also be appropriate to account for the squared effect of age on income if the relationship is curvilinear. For example. Re-estimate the model using blocks for sets of dummies and examine diagnostic tests and plots. Handling nonlinear effects You can split ordinal and interval level predictors into categories if you suspect that their effect on the dependent variable is non-linear. Recode variables and estimate the regression model in one block.

EXECUTE. EXECUTE. Household income per capita is computed by dividing the household income by the number of persons in the household. To test the interaction between an interval level variable and a nominal/ordinal variable with k-1 dummies in the 7 . Estimate a regression model predicting satisfaction with life. gender (V235). Tips for recoding variables: Start by assigning system missing values to NA/DK values on each variable. Explore whether marital status acts as a moderator in the relationship between education and satisfaction with life (test for an interaction effect between marital status and education). Recode the educational level variable into fewer categories so that the resulting variable is an ordinal level variable (in the original version the categories of the variable are only partially ordered). Re-run the model and take a look again at the values of the collinearity diagnostics. educational level (V238). COMPUTE age_c_sq=age_c*age_c. equal to the product of the two variables. Compute the new. Ask for collinearity diagnostics and look at VIF and Tolerance values for AGE and AGE squared. EXECUTE. Explore the possibility of a nonlinear effect of age on satisfaction with life by using a squared age variable as predictor in addition to age. *SPSS syntax for computing age squared. COMPUTE hhinc=v235/b65. * SPSS syntax for recoding education. 2. COMPUTE agesq=v237*v237. 3. *Constructing the age centered and age centered squared terms. use a smaller number of broad categories (this will make it easier to manage the dichotomized version of the variable later in the exercise). COMPUTE age_c=v237-48. number of children (V56). using 2005 data from the Public Opinion Barometer (dataset: bd bop noiembrie 2005. *Requesting mean for the age variable. To test the interaction between an interval level variable and a dichotomous variable. you will construct a single interaction term. For the purposes of this example. Recode marital status in this example using a simple dichotomy between legally married respondents and all other respondents. centered variable by subtracting the mean from the original variable. Dichotomize the gender variable. age (V237). *SPSS syntax for computing household income per capita. Tip for centering variables: Run a frequency for the variable you want to center and ask for the mean of the variable in the Statistics menu of the Frequency procedure. FREQUENCIES VARIABLES=v237 /FORMAT=NOTABLE /STATISTICS=MEAN /ORDER=ANALYSIS. To test the interaction between two interval level variables. the strategy is the same as for two interval level variables. You might explore as predictors of life satisfaction: marital status (V55). RECODE v238 (99=SYSMIS) (1=1) (2 thru 3=2) (4 thru 6=3) (7 thru 9=4) (10 thru 12=5) (13 thru 14=6) INTO educ. Tip: For OLS regression in SPSS you have to manually construct the interaction term (as a product of the two main variables). Satisfaction with life is variable V22. EXECUTE. Tip for computing power terms: Use the Transform Compute menu. Center the age variable and recompute the age squared variable using the centered age.68.INDIVIDUAL ASSIGNMENT 2 (OPTIONAL) 1. household income per capita (V253 household income and b65 number of persons in the household).sav). EXECUTE.

** The first two education categories are collapsed since the first category contains a very small number of cases. R squared coefficients). Interpret regression coefficients and the R square for your model. EXECUTE. RECODE educ (SYSMIS=SYSMIS) (1 thru 2=1) (ELSE=0) INTO educ1.). educ3*married. COMPUTE educ2mar=educ2*married. COMPUTE educ5mar=educ5*married. run a multiple regression model on a dependent variable of your choice with several predictors that you consider relevant. You might try some of the solutions we discussed if you find that there are marked violations of the regression assumptions. RECODE educ (SYSMIS=SYSMIS) (5=1) (ELSE=0) INTO educ4. Choose what you think is the best model and interpret regression results for this model (regression coefficients. Explore the possibility that education has nonlinear effects on satisfaction with life by dichotomizing educational level categories and introducing the education dummies in a separate block in the regression equation. Delete the previous interaction from the model (education is measured here in a different way and consequently the interaction term no longer makes sense). a. 4. 6. Same strategy applies for a dichotomous variable interacted with a nominal/ordinal variable with k-1 dummies. RECODE educ (SYSMIS=SYSMIS) (6=1) (ELSE=0) INTO educ5. Please include a print-out of relevant parts of the output and a discussion of your results of half a page to one page when you turn in your assignment. you will have to construct k-1 interaction terms (interval level variable multiplied by the first dummy. Tip: You will have to construct 4 separate interaction terms: educ2*married. RECODE educ (SYSMIS=SYSMIS) (3=1) (ELSE=0) INTO educ2. Examine the diagnostic tests and plots that we discussed during the lab and write a short interpretation of these. RECODE educ (SYSMIS=SYSMIS) (4=1) (ELSE=0) INTO educ3. and educ5*married. Re-test for an interaction effect between marital status and education (measured with dummies). then by the second dummy.regression. *SPSS syntax for constructing the 4 interaction terms. You can use the Public Opinion Barometer from 2005 (link available on the lab webpage). COMPUTE interact1=educ*married.o. COMPUTE educ3mar=educ3*married. *SPSS syntax for constructing education dummies. Introduce these 4 interaction terms in a separate block in the regression equation and interpret the R squared change statistic.s. Used EDUC1 resulting from the syntax below as a reference category in the regression equation. 8 . COMPUTE educ4mar=educ4*married. 5. Try to think about and present substantive findings of your model. HOMEWORK ASSIGNMENT (DUE NEXT WEEK) Using a data base you are familiar with. Introduce these interaction terms in a separate block in the regression equation in order to test for the overall effect of the interaction between marital status and education measured in this way. Tip: Collapse the first two education categories since the first category contains a very small number of cases. EXECUTE. educ4*married. EXECUTE. *SPSS syntax for constructing the interaction term between education and marital status.

- Kutner Solution
- Generalized Linear Models 2nd Ed
- banany 2012
- Factors Affecting the Net Interest Margin of Commercial Bank of Ethiopia
- 563 Introduction to Linear Regression Analysis
- Applied Linear Regression Models 4th Ed Note
- source 4-yurchisin
- 2009 Freedman Statistical Models RevEd
- Multi Col Linearity
- Applied Linear Statistical Models -Neter Et Al (McGraw Hill Fifth Edition 2005)
- Fnancial Mathematics
- Regression_notes.pdf
- Solutions Casella Berger
- Applied Linear Regression Models by John Neter, William Wasserman, Michael H. Kutner
- 2. Strategic Decision-making Processes
- Regression vs Box Jenkins Case Study
- Multiple Regression
- EX08.pdf
- 5486916 Factors Affecting Students Performance[1]
- Multiple Reg Analysis 5
- J ANIM SCI-2012-Paigen-5182-92
- FACTORS AFFECTING STUDENTS PERFORMANCE
- BM-2013
- Assessment of concrete Strength
- LJMS_vol1_2009_9-24[2]
- BMJ Open 2016 Diem
- stat_706_final_instructions_2014
- Regression Final
- v67b05.pdf
- Nitrogen mineralization

- The Effects of Employer Knowledge and Product Awareness on Job Seekers’ Application Decisions
- Longitudinal Evidence for a Midlife Nadir in Human Well-being
- The Union Wage Advantage for Low-Wage Workers
- How fast are semiconductor prices falling?
- UT Dallas Syllabus for poec7359.501 06f taught by Daniel Griffith (dag054000)
- UT Dallas Syllabus for poec6344.001.09f taught by Paul Jargowsky (jargo)
- UT Dallas Syllabus for epps3405.001.11f taught by Michael Tiefelsdorf (mrt052000)
- frbclv_wp1984-08.pdf
- TTR High Flyers Lee
- Tmp 7550
- UT Dallas Syllabus for hcs6313.501.07s taught by Herve Abdi (herve)
- First Names and Crime
- Use of Linear Regression in Machine Learning for Ranking
- tmp57E6.tmp
- Limiting Government through Direct Democracy
- 31 Fair empl.prac.cas. 1578, 31 Empl. Prac. Dec. P 33,571 Frank L. Eastland, Individually v. Tennessee Valley Authority, 704 F.2d 613, 11th Cir. (1983)
- Linking Lead and Education Data in Connecticut
- UT Dallas Syllabus for mkt6362.501.11f taught by Brian Ratchford (btr051000)
- tmp7391.tmp
- tmp698D
- UT Dallas Syllabus for eco5311.501 06s taught by Magnus Lofstrom (mjl023000)
- UT Dallas Syllabus for mkt6329.501 06s taught by Norris Bruce (nxb018100)
- UT Dallas Sample Syllabus for Chansu Jung
- Forecasting numbers of people affected annually by natural disasters up to 2015
- 68627_1995-1999
- tmpF532.tmp
- Effectiveness Review
- UT Dallas Syllabus for poec5316.501.09s taught by Timothy Bray (tmb021000)
- Development of Traffic Congestion Index for Urban Road Links in Rajkot City
- UT Dallas Syllabus for eco5311.501.07s taught by Magnus Lofstrom (mjl023000)

Sign up to vote on this title

UsefulNot usefulClose Dialog## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

Close Dialog## This title now requires a credit

Use one of your book credits to continue reading from where you left off, or restart the preview.

Loading