You are on page 1of 96

Regression Analysis

with SPSS

1
Outline
1. Conceptualization
2. Schematic Diagrams of Linear Regression processes
3. Using SPSS, we plot and test relationships for linearity
4. Nonlinear relationships are transformed to linear ones
5. General Linear Model
6. Derivation of Sums of Squares and ANOVA
Derivation of intercept and regression coefficients
7. The Prediction Interval and its derivation
8. Model Assumptions
1. Explanation
2. Testing
3. Assessment
9. Alternatives when assumptions are unfulfilled

2
Conceptualization of
Regression Analysis
• Hypothesis testing
• Path Analytical Decomposition of effects

3
Hypothesis Testing
• For example: hypothesis 1 : X is statistically
significantly related to Y.
– The relationship is positive (as X increases, Y
increases) or negative (as X decreases, Y
increases).
– The magnitude of the relationship is small,
medium, or large.
If the magnitude is small, then a unit change in x is
associated with a small change in Y.
4
Regression Analysis
Have a clear notion of what you can and cannot do with
regression analysis
• Conceptualization
– A Path Model of a Regression Analysis

Path Diagram of A Linear Regression


Analysis

X1

error
YY
X2

x3
Yi  k  b1 x1  b2 x2  b3 x3  ei

5
A Path Analysis
Decomposition of Effects into Direct,
Indirect, Spurious, and Total Effects

Error

Error

X2
C

Y3
A

X1
E F

Y2
D

Error

Y1

Error

Direct Effects: Indirect Effects: Total Effects:


Paths C, E, F Paths Sum of Direct and Spurious effects are due to
AC, BE, DF Indirect Effects common (antecedent) causes

In a path analysis, Yi is endogenous. It is the outcome of several paths.


Direct effects on Y3: C,E, F
Indirect effects on Y3: BF, BDF
6
Total Effects= Direct + Indirect effects
Interaction Analysis

X1
A

C Y

X2

Y= K + aX1 + BX2 + CX1*X2

Interaction coefficient: C
X1 and X2 must be in model for interaction to be properly
specified. 7
A Precursor to Modeling with
Regression
• Data Exploration: Run a scatterplot matrix
and search for linear relationships with the
dependent variable.

8
Click on graphs and then on
scatter

9
When the scatterplot dialog box
appears, select Matrix

10
A Matrix of Scatterplots will
appear

Search for distinct linear relationships

11
12
13
Decomposition of the Sums of
Squares

14
Graphical Decomposition of Effects

D e c o m p o s it io n o f E ff e c ts

 
Yi ŷ  a  b x

y i  yˆ i  e r r o r
yi  y  Total Effect

ŷ  y  re g r es s io n e ffe c t
Y

X
X

15
Decomposition of the sum of
squares
Y  Y  Yˆ  Y  Yˆ  Y
total effect  error effects  regression ( model ) effect
Yi  Y  Yˆi  Yi  Yˆi  Y per case i
(Y  Y ) 2  (Yˆ  Y ) 2  (Yˆ  Y ) 2 per case i
i i i i
n n n

 (Y
i 1
i  Y) 2
  (Yˆi  Yi ) 2
i1
  (Yˆ
i1
i  Y )2 for data set

16
Decomposition of the sum of
squares
• Total SS = model SS + error SS
and if we divide by df
n n n

 (Y i  Y) 2
 (Yˆi  Yi ) 2  (Yˆ i  Y )2
i 1
 i 1
 i 1
n1 nk 1 k

• This yields the Variance Decomposition: We have the


total variance= model variance + error variance

17
F test for significance and R2 for
magnitude of effect
• R2 = Model var/total var n

 (Yˆ
i 1
i  Y )2

R2  k
n

 (Yˆi  Yi ) 2
i 1
nk 1

•F test for model significance


R2
= Model Var/Error Var k
F( k , n  k  1 ) 
1  R2
nk 1
18
ANOVA tests the significance of the
Regression Model

19
The Multiple Regression
Equation
• We proceed to the derivation of its components:
– The intercept: a
– The regression parameters, b1 and b2

Yi  a  b1 x1  b2 x2  ei

20
Derivation of the Intercept
y  a  bx  e
e  y  a  bx
n n n n

e
i 1
i  y
i 1
i   ai  b xi
i 1 i 1
n
Because by definition  ei  0
i 1
n n n
0 y
i 1
i   ai  b  xi
i 1 i 1

n n n
 ai   yi  b  xi
i 1 i 1 i 1

n n
na   yi  b  xi
i 1 i 1

a  y  bx
21
Derivation of the Regression
Coefficient
Given : y  a  b x  e i i i

ei  yi  a  b xi
n n

e
i 1
i  (y
i 1
i  a  b xi )
n n

e
i 1
i
2
 (y
i 1
i  a  b xi ) 2
n
  ei 2 n n
i 1
 2 xi  ( yi )  2b  xi xi
b i 1 i1
n n
0  2 xi  ( yi )  2b  xi xi
i 1 i 1
n

x i yi
b  i 1
n

x i
2

i 1 22
• If we recall that the formula for the
correlation coefficient can be expressed as
follows:

23
n

 xi yi
r  i 1
n n


i 1
xi 2

i 1
yi 2 

where
x  xi  x
y  yi  y

 xi yi
bj  i 1
n

i 1
x2

from which it can be seen that the regression coefficient b,


is a function of r.

sd y
bj  r *
sd x
24
Extending the bivariate case
To the Multiple linear regression case

25
ryx1  ryx2 rx1 x2 sd y
 yx 1 . x2
 * (6)
1 r 2
x1 x2 sd x

ryx2  ryx1 rx1 x2 sd y


 yx . x1  * (7)
2
1 r 2
x1 x2 sd x

It is also easy to extend the bivariate intercept


to the multivariate case as follows.

a  Y  b1 x1  b2 x2 (8)

26
Significance Tests for the
Regression Coefficients
1. We find the significance of the parameter
estimates by using the F or t test.

2. The R2 is the proportion of variance explained.


 (n-1) 
Adjusted R  = 1-(1-R ) 
2 2

 (n-p-1) 
3. where n  sample size
p  number of parameters in model

27
F and T tests for significance for
overall model
Model variance
F 
error variance
R2 / p

(1  R 2 ) /( n  p  1)
where
p  number of parameters
n  sample size

t  F
( n  2) * r 2

1 r2
28
Significance tests
• If we are using a type II sum of squares,
we are dealing with the ballantine. DV
Variance explained = a + b

29
Significance tests
T tests for statistical significance

 0
t 
sea
b0
t 
seb

30
Significance tests
Standard Error of intercept
 
SEa

 
(Y  Y ) 2  1
* 
 xi 2 

 n  2   n 
  ( n  1)  ( xi  x ) 2

 

Standard error of regression coefficient

ˆ
SEb 
x 2

where ˆ  std dev of residual


n

e 2

 
ˆ 2 i 1

n2 31
Programming Protocol
After invoking SPSS, procede to File, Open, Data

32
Select a Data Set (we choose
employee.sav) and click on open

33
We open the data set

34
To inspect the variable formats,
click on variable view on the lower
left

35
Because gender is a string
variable, we need to recode gender
into a numeric format

36
We autorecode gender by clicking
on transform and then autorecode

37
We select gender and move it into
the variable box on the right

38
Give the variable a new name and
click on add new name

39
Click on ok and the numeric
variable sex is created

It has values 1 for female and 2 for male and those values labels
are inserted.
40
To invoke Regression analysis,
Click on Analyze

41
Click on Regression and then
linear

42
Select the dependent variable:
Current Salary

43
Enter it in the dependent
variable box

44
Entering independent variables
• These variables are entered in blocks.
First the potentially confounding covariates
that have to entered.
• We enter time on job, beginning salary,
and previous experience.

45
After entering the covariates, we
click on next

46
We now enter the hypotheses we
wish to test
• We are testing for minority or sex
differences in salary after controlling for
the time on job, previous experience, and
beginning salary.
• We enter minority and numeric gender
(sex)

47
After entering these variables, click
on statistics

48
We select the following statistics
from the dialog box and click on
continue

49
Click on plots to obtain the plots
dialog box

50
We click on OK to run the
regression analysis

51
Navigation window (left) and output
window(right)
This shows that SPSS is reading the variables correctly

52
Variables Entered and Model
Summary

53
Omnibus ANOVA
Significance Tests for the Model at each stage of the
analysis

54
Full Model
Coefficients

CurSal   12036.3  1.83BeginSal


 165.17Jobtime  23.64 Exper
 2882.84 gender  1419.7 Minority

55
We omit insignificant variables and rerun the
analysis to obtain trimmed model coefficients

CurSal   12126.5  1.85 BeginSal


 163.20Jobtime  24.36 Exper
 2694.30 gender
56
Beta weights
• These are standardized regression
coefficients used to compare the
contribution to the explanation of the
variance of the dependent variable within
the model.

57
T tests and signif.
• These are the tests of significance for
each parameter estimate.

• The significance levels have to be less


than .05 for the parameter to be
statistically significant.

58
Assumptions of the Linear
Regression Model
1. Linear Functional form
2. Fixed independent variables
3. Independent observations
4. Representative sample and proper specification of the model
(no omitted variables)
5. Normality of the residuals or errors
6. Equality of variance of the errors (homogeneity of residual
variance)
7. No multicollinearity
8. No autocorrelation of the errors
9. No outlier distortion
59
Explanation of the Assumptions
1. 1. Linear Functional form
1. Does not detect curvilinear relationships
2. Independent observations
1. Representative samples
2. Autocorrelation inflates the t and r and f statistics and warps the significance
tests
3. Normality of the residuals
1. Permits proper significance testing
4. Equality of variance
1. Heteroskedasticity precludes generalization and external validity
2. This also warps the significance tests
5. Multicollinearity prevents proper parameter estimation. It may also preclude
computation of the parameter estimates completely if it is serious enough.
6. Outlier distortion may bias the results: If outliers have high influence and
the sample is not large enough, then they may serious bias the parameter
estimates

60
Diagnostic Tests for the Regression
Assumptions
1. Linearity tests: Regression curve fitting
1. No level shifts: One regime
2. Independence of observations: Runs test
3. Normality of the residuals: Shapiro-Wilks or Kolmogorov-Smirnov Test
4. Homogeneity of variance if the residuals: White’s General Specification
test
5. No autocorrelation of residuals: Durbin Watson or ACF or PACF of
residuals
6. Multicollinearity: Correlation matrix of independent variables..
Condition index or condition number
7. No serious outlier influence: tests of additive outliers: Pulse dummies.
1. Plot residuals and look for high leverage of residuals
2. Lists of Standardized residuals
3. Lists of Studentized residuals
4. Cook’s distance or leverage statistics

61
Explanation of Diagnostics
1. Plots show linearity or nonlinearity of
relationship
2. Correlation matrix shows whether the
independent variables are collinear and
correlated.
3. Representative sample is done with
probability sampling

62
Explanation of Diagnostics
Tests for Normality of the residuals. The
residuals are saved and then subjected to
either of:
Kolmogorov-Smirnov Test: Tests the limit of
the theoretical cumulative normal distribution
against your residual distribution.
Nonparametric Tests
1 sample K-S test

63
Collinearity Diagnostics

Tolerance  1  R2
small tolerances imply problems
Variance Inflation Factor (VIF)
1

Tolerance
Small intercorrelations among indep vars
means VIF  1
VIF  10 signifies problems

64
More Collinearity Diagnostics
condition numbers
= maximum eigenvalue/minimum
eigenvalue.
If condition numbers are between 100 and
1000, there is moderate to strong collinearity

condition index  k
where k  condition number

If Condition index > 30 then there is strong collinearity

65
Outlier Diagnostics
1. Residuals.
1. The predicted value minus the actual value. This is
otherwise known as the error.
2. Studentized Residuals
1. the residuals divided by their standard errors
without the ith observation
3. Leverage, called the Hat diag
1. This is the measure of influence of each observation
4. Cook’s Distance:
1. the change in the statistics that results from deleting
the observation. Watch this if it is much greater than
1.0.

66
Outlier detection
• Outlier detection involves the
determination whether the residual (error =
predicted – actual) is an extreme negative
or positive value.
• We may plot the residual versus the fitted
plot to determine which errors are large,
after running the regression.

67
Create Standardized Residuals
• A standardized residual is one divided by its
standard deviation.

ˆ i  yi
y
resid standardized 
s
where s  std dev of residuals

68
Limits of Standardized
Residuals
If the standardized residuals have values in excess of
3.5 and -3.5, they are outliers.

If the absolute values are less than 3.5, as these are,


then there are no outliers.

While outliers by themselves only distort mean


prediction when the sample size is small enough, it is
important to gauge the influence of outliers.

69
Outlier Influence
• Suppose we had a different data set with
two outliers.
• We tabulate the standardized residuals
and obtain the following output:

70
Outlier a does not distort and outlier
b does.

71
Studentized Residuals
• Alternatively, we could form studentized
residuals. These are distributed as a t
distribution with df=n-p-1, though they are not
quite independent. Therefore, we can
approximately determine if they are
statistically significant or not.
• Belsley et al. (1980) recommended the use of
studentized residuals.

72
Studentized Residual
ei
ei s 
s 2 (i ) (1  hi )
where
ei s  studentized residual
s( i )  standard deviation where ith obs is deleted
hi  leverage statistic

These are useful in estimating the statistical significance of a particular observation, of


which a dummy variable indicator is formed. The t value of the studentized residual
will indicate whether or not that observation is a significant outlier. The command to
generate studentized residuals, called rstudt is: predict rstudt, rstudent

73
Influence of Outliers
1. Leverage is measured by the diagonal
components of the hat matrix.
2. The hat matrix comes from the formula for
the regression of Y.
Yˆ  X   X '( X ' X ) 1 X ' Y
where X '( X ' X ) 1 X '  the hat matrix, H
Therefore,
Yˆ  HY

74
Leverage and the Hat matrix
1. The hat matrix transforms Y into the predicted scores.
2. The diagonals of the hat matrix indicate which values will
be outliers or not.
3. The diagonals are therefore measures of leverage.
4. Leverage is bounded by two limits: 1/n and 1. The closer
the leverage is to unity, the more leverage the value has.
5. The trace of the hat matrix = the number of variables in the
model.
6. When the leverage > 2p/n then there is high leverage
according to Belsley et al. (1980) cited in Long, J.F. Modern
Methods of Data Analysis (p.262). For smaller samples,
Vellman and Welsch (1981) suggested that 3p/n is the
criterion.

75
Cook’s D
1. Another measure of influence.
2. This is a popular one. The formula for it is:

 1  hi  ei 2 
Cook ' s Di     2 
 p  1  hi  s ( 1  hi ) 

Cook and Weisberg(1982) suggested that values of


D that exceeded 50% of the F distribution (df = p, n-p)
are large.

76
Using Cook’s D in SPSS
• Cook is the option /R
• Finding the influential outliers
• List cook, if cook > 4/n
• Belsley suggests 4/(n-k-1) as a cutoff

77
DFbeta
• One can use the DFbetas to ascertain the
magnitude of influence that an observation has
on a particular parameter estimate if that
observation is deleted.
b j  b(i ) j u j
DFbeta j 
u
2
j
(1  h j )
where u j  residuals of
regression of x on remaining xs.

78
Programming Diagnostic Tests
Testing homoskedasiticity
Select histogram, normal probability plot, and insert *zresid
in Y
and *zpred in X

Then click on continue


79
Click on Save to obtain the
Save dialog box

80
We select the following

Then we click on continue, go back to the Main


Regression Menu and click on OK 81
Check for linear Functional
Form
• Run a matrix plot of the dependent
variable against each independent
variable to be sure that the relationship is
linear.

82
Move the variables to be graphed into the box on
the upper right, and click on OK

83
Residual Autocorrelation check

Durbin  Watson d
tests first  order
autocorrelation of residuals

d 
n
 et  et  1  2
i 1 et

See significance tables for this statistic

84
Run the autocorrelation function from the Trends
Module for a better analysis

85
Testing for Homogeneity of variance

86
Normality of residuals can be visually inspected
from the histogram with the superimposed normal
curve.
Here we check the skewness for symmetry and the
kurtosis for peakedness

87
Kolmogorov Smirnov Test: An objective test of
normality

88
89
90
Multicollinearity test with the correlation matrix

91
92
93
Alternatives to Violations of
Assumptions
• 1. Nonlinearity: Transform to linearity if there is nonlinearity or run
a nonlinear regression
• 2. Nonnormality: Run a least absolute deviations regression or a
median regression (available in other packages or generalized
linear models [ SPLUS glm, STATA glm, or SAS Proc MODEL or
PROC GENMOD)].
• 3. Heteroskedasticity: weighted least squares regression (SPSS)
or white estimator (SAS, Stata, SPLUS). One can use a robust
regression procedure (SAS, STATA, or SPLUS) to obtain
downweighted outlier effect in the estimation.
• 4. Autocorrelation: Run AREG in SPSS Trends module or either
Prais or Newey-West procedure in STATA.
• 4. Multicollinearity: components regression or ridge regression or
proxy variables. 2sls in SPSS or ivreg in stata or SAS proc model
or proc syslin.

94
Model Building Strategies
• Specific to General: Cohen and Cohen
• General to Specific: Hendry and Richard
• Extreme Bounds analysis: E. Leamer.

95
Nonparametric Alternatives
1. If there is nonlinearity, transform to linearity first.
2. If there is heteroskedasticity, use robust standard
errors with STATA or SAS or SPLUS.
3. If there is non-normality, use quantile regression
with bootstrapped standard errors in STATA or
SPLUS.
4. If there is autocorrelation of residuals, use Newey-
West autoregression or First order autocorrelation
correction with Areg. If there is higher order
autocorrelation, use Box Jenkins ARIMA modeling.

96

You might also like