# Applied Linear Regression

Exercises
Quantitative Research, March 2009

TOPICS
1. 2. 3. 4. 5. 6. 7. Simple and multiple linear regression (SPSS) Preparing the data for analysis Inspecting frequencies and diagnostic plots Outliers and influential cases – diagnostics and solutions Interpreting the regression output Saving predicted and residual regression scores Introducing non-linear effects in the linear regression framework (recoding ordinal and interval level predictors into multiple categories, interaction effects, polynomial terms) 8. Variable transformations (ln, square) 9. Centering predictors 10. R square change 11. Multicolinearity and solutions Data bases for the lab examples and the homework assignment are available here: http://web.me.com/paula.tufis/QR

Inspect and recode the data (missing on NA/DK.s. transforming highly skewed variables. a. Inequality Module. www. • Visual inspection of histograms for each variable (SPSS: Graphs  Chart Builder  Histogram or use the Charts option in the Analyze  Descriptive Statistics  Frequencies) 3. direction of scaling. 2. 267) (we’ll do more extensive tests later) 4. operationalization of variables and concepts • Hypothesis: people with higher levels of education tend to have jobs with higher occupational prestige • This is a simple example. so the model fails to include some relevant predictors.org) . 1999. invalid variable values. use a scatterplot matrix. Variables: Outcome variable: SIOPS_R (Respondent’s occupational prestige – coded using the Standard International Prestige Score . Conceptualization and specification of the model (hypotheses.) • Both variables are already recoded for this example.sav (International Social Survey Programme. with a theoretical range from 6 – occupations with low prestige to 78 – occupations with high prestige) Predictor variable: EDUC_YRS (Respondent’s schooling measured in completed years of education) Analysis steps: 1. In real-life situations use your statistical data manipulation software of choice to recode variables (SPSS1: use Transform  Compute Variable or Transform  Recode into Same/Different Variable). include relevant predictors. • SPSS: Graphs  Chart Builder  Scatter/Dot Potential outlier (case no.EXAMPLE 1 – SIMPLE LINEAR REGRESSION Model: Effects of education on occupational prestige Dataset: ISSP_1999_Slovakia. In the multiple regression case. it probably isn’t the only one.0 1   . exclude irrelevant predictors). so while education might be a relevant predictor of occupational prestige.issp.o. Inspect the relationship between the two variables using a scatterplot for departures from linearity and for potential outliers. • Education is measured using completed years of schooling and occupational prestige is measured using SIOPS. Run OLS regression (SPSS: Analyze  Regression  Linear) 1 SPSS version 18.subsample of employed respondents with nonfarm origins.SIOPS. dichotomizing categorical predictors.

Alternative: look at Normal PP plot (expected cumulated probability plotted against observed cumulated probability of standardized residuals – line should be at 45 degrees) • SPSS: In the Plots option of the Linear Regression menu choose “Histogram” and “Normal Probability Plot”) under “Standardized Residual Plots” or save standardized residuals (use the “Save” option of the Linear Regression menu) and look at the distribution of the saved variable • • Durbin Watson test (SPSS:Linear Regression  Statistics  Durbin Watson) to test the assumption of independence of errors (values close to 2 indicate no error autocorrelation). run a hierarchical linear regression model).5. but error autocorrelations might be present for spatially clustered data (in which case. 2   . In cross-sectional samples autocorrelation is likely to be less of a problem than in time-series data. Plot standardized residuals (ZRESID) against standardized predicted values (ZPRED) to check the assumptions of homoskedasticity and linearity (SPSS  Linear Regression  Plots).The Durbin Watson test is relevant in the case of timeseries data. Diagnostics for possible violations of OLS regression assumptions • Scatterplot/ scatterplot matrix for checking the assumption of linearity of relationships • Plot the distribution of the regression standardized residual to check the assumption of normally distributed residuals (distribution should be normal). Funnel shapes denote heteroskedasticity and curved shapes indicate the relationships might be nonlinear.

Variable transformations can also help in some cases to pull in the outlier cases. DFBeta (cases with absolute values greater than 2 are cause for concern) 7. SPSS: Linear Regression  Save  Cook’s Distance (cases with values over 1 are influential cases). Detect outliers and influential cases • To detect outliers. DFBeta. For a group of outliers you might need a separate model. Leverage Values (cases with values more than 3 times the value of the average leverage or values greater than . Leverage Values. Some of the most often used: Cook’s Distance. there are a variety of statistics. If all OK  interpret OLS regression results 3   . • To detect influential cases. You might consider excluding it if you think the case is affected by measurement error (examine variable values for this case and influence statistics). SPSS: Linear Regression  Statistics  Casewise Diagnostics (detect cases with standardized residuals with absolute values greater than 3 standard deviations) Same case identified on the scatterplot.6.5 are cause for concern).

Variables: Outcome variable: INCOMER (Respondent’s personal income) Predictor variables: • GENDERR (Respondent’s gender. To select a subsample in SPSS: Data  Select cases  If condition is satisfied  If  Type “empl=1 and nonfarm=1” without the quotes in the upper right box  Continue. Dataset: EVS_2008_Romania_r80. 3 = small city. you need to select just the respondents that are employed and with nonfarm origins and run your analyses on this subsample. 2=Female) • REDUC (Respondent’s educational level. 99=DK/NA) • MARSTAT (Marital status. 4 = village – administrative center. 18-91. 2 = cohabiting. or to move selected cases to a new dataset (in this case save the resulting dataset with a new name).INDIVIDUAL ASSIGNMENT 1 (OPTIONAL) Re-estimate and interpret on your own the simple regression model discussed above using data for Romania. 1=Male. 2 = medium sized city. 3 = self employed. 3 = single) • AGER (Age. 2 = part time employment. 2008. 4=inactive/unemployed. 1 = full time employment. The variable EMPL identifies employed and unemployed respondents and the variable NONFARM identifies respondents with farm and nonfarm origins.sav (European Values Survey 2008) – random 80% subsample from the nationally representative subsample. random 80% subsample from the nationally representative sample. with a theoretical range from 6 – occupations with low prestige to 78 – occupations with high prestige) Predictor variable: EDUC_YRS (Respondent’s schooling measured in completed years of education) EXAMPLE 2 – MULTIPLE LINEAR REGRESSION Model: Effects of socio-demographic characteristics on income. 1 = No schooling … 9=university and postgraduate degree) • EMPL (Employment status. You can choose either to filter out cases (default and recommended option . Before interpreting regression coefficients examine diagnostic tests and plots to check for possible violations of regression assumptions. 5 = other villages) 4   . 2008. or to delete cases from the existing dataset (!!!Warning: using this option followed by the “Save” option overwrites the original dataset. 1 = big city. but statistical analyses will not take them into account).SIOPS. Soros Foundation Romania). Dataset: VF2008_Romania_r80 (Family Life. 1 = married.) Variables: Outcome variable: SIOPS_R (Respondent’s occupational prestige – coded using the Standard International Prestige Score . Subsample for analysis: employed respondents with nonfarm origins Since the dataset contains both employed and unemployed respondents and respondents with farm and nonfarm origins.unselected cases remain in the dataset. 99=NA) • LOCSIZE (Locality type and size.

and compare regression results with and without these variables. code 1 for males). under “Statistics”. in the linear regression procedure. explore whether there are statistically significant differences by interpreting the sizes and significances of dummy variable coefficients and by varying the reference category used. we’ll look at differences between employed (EMPL=1 thru 3) and unemployed people (EMPL=4) (so 2 categories and 1 dummy variable introduced as a predictor) • SPSS: Transform  Recode into Different… Categorical predictors (such as MARSTAT) should be dummy coded. the locality type dummies) by introducing the associated dummy variables in a separate block in the regression equation. • For variables with several categories. • Examine distribution of the INCOMER variable (SPSS: Analyze  Descriptive Statistics  Frequencies  Charts  Histogram with normal curve). Interpret overall significance of the variable by using the F test for R square change. For the purposes of this example. ask for Collinearity diagnostics (Tolerance – less than . condition index – over 30 suggests multicollinearity) Solutions in case of multicollinearity: depending on the context.8 or .e. Use “other villages” as the reference category. As such. Multicollinearity – diagnostics and solutions Start by inspecting a bivariate correlation matrix between predictor variables. If you have a substantive interest in estimating the effects of all of the highly collinear variables but they do not form a scale.2 suggests high collinearity.Some notes on variable recoding Income variables are generally positively skewed and need to be transformed – recommended transformation: ln(incomer). we will treat it as a categorical variable. • You can examine the overall statistical significance of a set of dummy variables (i. VIF – greater than 4 (conservative) or 7 (liberal criterion) suggests high collinearity. For the purposes of this example use “single” as the reference category (omitted dummy variable) Locality size could be considered an ordinal level variable with a clear ordering according to population size in the first 3 categories (big. In SPSS. Correlations over . examine the partial regression plots to assess linearity and the presence of outliers (applicable only for interval level predictors). • To construct the natural logarithm of INCOMER. in SPSS: Transform  Compute Variable  Fill in a new name under “Target Variable” and fill in “ln(incomer)” without the quotes under “Numeric Expression” Dichotomous predictors such as GENDER should be dummy coded (for example code 0 for females. 5   . Request the R squared change statistic under “Statistics” in the Linear Regression procedure in SPSS. resulting in k dummy variables (where k is the number of categories) and using k-1 dummy variables as predictors in the regression equation. and small cities).9 suggest a moderate to high degree of collinearity of predictors. but the presence of the last two categories (administrative center villages and other villages) makes the variable only partially ordered. you can use a block regression with these predictors in a separate block. Regression assumption diagnostics The same as for the simple regression case In addition. and construct 5 dummy variables. either construct a scale using the highly collinear variables (if the variables measure the same dimension) or delete one of the highly collinear variables from the model. medium.

Examine effects of dummy variables and decide which categories to collapse. You can explore that by using 10-year or 5-year age groups instead of the original age variable. the relationship between age and income might be nonlinear.Example 2a. Interaction effects can also be introduced in the linear regression model if you think the effect of one predictor is moderated by another predictor. Centering predictors Centering is subtracting the mean from variable values It is done for two different purposes: • To make the intercept meaningful (the sizes of slopes are not affected by centering) • To avoid multicollinearity between a predictor and power terms of the predictor (age and age squared) 6   . It might also be appropriate to account for the squared effect of age on income if the relationship is curvilinear. For example. Re-estimate the model using blocks for sets of dummies and examine diagnostic tests and plots. Handling nonlinear effects You can split ordinal and interval level predictors into categories if you suspect that their effect on the dependent variable is non-linear. Recode variables and estimate the regression model in one block.

EXECUTE. EXECUTE. Household income per capita is computed by dividing the household income by the number of persons in the household. To test the interaction between an interval level variable and a nominal/ordinal variable with k-1 dummies in the 7   . Estimate a regression model predicting satisfaction with life. gender (V235). Tips for recoding variables: Start by assigning system missing values to NA/DK values on each variable. Explore whether marital status acts as a moderator in the relationship between education and satisfaction with life (test for an interaction effect between marital status and education). Recode the educational level variable into fewer categories so that the resulting variable is an ordinal level variable (in the original version the categories of the variable are only partially ordered). Re-run the model and take a look again at the values of the collinearity diagnostics. educational level (V238). COMPUTE age_c_sq=age_c*age_c. equal to the product of the two variables. Compute the new. Ask for collinearity diagnostics and look at VIF and Tolerance values for AGE and AGE squared. EXECUTE. Explore the possibility of a nonlinear effect of age on satisfaction with life by using a squared age variable as predictor in addition to age. *SPSS syntax for computing age squared. COMPUTE hhinc=v235/b65. * SPSS syntax for recoding education. 2. COMPUTE agesq=v237*v237. 3. *Constructing the age centered and age centered squared terms. use a smaller number of broad categories (this will make it easier to manage the dichotomized version of the variable later in the exercise). COMPUTE age_c=v237-48. number of children (V56). using 2005 data from the Public Opinion Barometer (dataset: bd bop noiembrie 2005. *Requesting mean for the age variable. To test the interaction between an interval level variable and a dichotomous variable. you will construct a single interaction term. For the purposes of this example. Recode marital status in this example using a simple dichotomy between legally married respondents and all other respondents. centered variable by subtracting the mean from the original variable. Dichotomize the gender variable. age (V237). *SPSS syntax for computing household income per capita. Tip for centering variables: Run a frequency for the variable you want to center and ask for the mean of the variable in the Statistics menu of the Frequency procedure. FREQUENCIES VARIABLES=v237 /FORMAT=NOTABLE /STATISTICS=MEAN /ORDER=ANALYSIS. To test the interaction between two interval level variables. the strategy is the same as for two interval level variables. You might explore as predictors of life satisfaction: marital status (V55). RECODE v238 (99=SYSMIS) (1=1) (2 thru 3=2) (4 thru 6=3) (7 thru 9=4) (10 thru 12=5) (13 thru 14=6) INTO educ. Tip: For OLS regression in SPSS you have to manually construct the interaction term (as a product of the two main variables). Satisfaction with life is variable V22. EXECUTE. Tip for computing power terms: Use the Transform  Compute menu. Center the age variable and recompute the age squared variable using the centered age.68.INDIVIDUAL ASSIGNMENT 2 (OPTIONAL) 1. household income per capita (V253 household income and b65 number of persons in the household).sav). EXECUTE.

** The first two education categories are collapsed since the first category contains a very small number of cases. R squared coefficients). Interpret regression coefficients and the R square for your model. EXECUTE. RECODE educ (SYSMIS=SYSMIS) (1 thru 2=1) (ELSE=0) INTO educ1.). educ3*married. COMPUTE educ2mar=educ2*married. COMPUTE educ5mar=educ5*married. run a multiple regression model on a dependent variable of your choice with several predictors that you consider relevant. You might try some of the solutions we discussed if you find that there are marked violations of the regression assumptions. RECODE educ (SYSMIS=SYSMIS) (5=1) (ELSE=0) INTO educ4. Choose what you think is the best model and interpret regression results for this model (regression coefficients. Explore the possibility that education has nonlinear effects on satisfaction with life by dichotomizing educational level categories and introducing the education dummies in a separate block in the regression equation. Delete the previous interaction from the model (education is measured here in a different way and consequently the interaction term no longer makes sense). a. 4. 6. Same strategy applies for a dichotomous variable interacted with a nominal/ordinal variable with k-1 dummies. RECODE educ (SYSMIS=SYSMIS) (6=1) (ELSE=0) INTO educ5. Please include a print-out of relevant parts of the output and a discussion of your results of half a page to one page when you turn in your assignment. you will have to construct k-1 interaction terms (interval level variable multiplied by the first dummy. Tip: You will have to construct 4 separate interaction terms: educ2*married. RECODE educ (SYSMIS=SYSMIS) (3=1) (ELSE=0) INTO educ2. Examine the diagnostic tests and plots that we discussed during the lab and write a short interpretation of these. RECODE educ (SYSMIS=SYSMIS) (4=1) (ELSE=0) INTO educ3. and educ5*married. Re-test for an interaction effect between marital status and education (measured with dummies). then by the second dummy.regression. *SPSS syntax for constructing the 4 interaction terms. You can use the Public Opinion Barometer from 2005 (link available on the lab webpage). COMPUTE interact1=educ*married.o. COMPUTE educ3mar=educ3*married. *SPSS syntax for constructing education dummies. Introduce these 4 interaction terms in a separate block in the regression equation and interpret the R squared change statistic.s. Used EDUC1 resulting from the syntax below as a reference category in the regression equation. 8   . COMPUTE educ4mar=educ4*married. 5. Try to think about and present substantive findings of your model. HOMEWORK ASSIGNMENT (DUE NEXT WEEK) Using a data base you are familiar with. Introduce these interaction terms in a separate block in the regression equation in order to test for the overall effect of the interaction between marital status and education measured in this way. Tip: Collapse the first two education categories since the first category contains a very small number of cases. EXECUTE. educ4*married. EXECUTE. *SPSS syntax for constructing the interaction term between education and marital status.