You are on page 1of 96

Regression Analysis

with SPSS
Robert A. Yaffee, Ph.D.
Statistics, Mapping and Social Science Group
Academic Computing Services
Information Technology Services
New York University
Office: 75 Third Ave Level C3
Tel: 212.998.3402
E-mail: yaffee@nyu.edu
February 04

1
Outline
1. Conceptualization
2. Schematic Diagrams of Linear Regression
processes
3. Using SPSS, we plot and test relationships for
linearity
4. Nonlinear relationships are transformed to
linear ones
5. General Linear Model
6. Derivation of Sums of Squares and ANOVA
Derivation of intercept and regression
coefficients
7. The Prediction Interval and its derivation
8. Model Assumptions
1. Explanation
2. Testing
3. Assessment
9. Alternatives when assumptions are unfulfilled

2
Conceptualization of
Regression Analysis
• Hypothesis testing
• Path Analytical Decomposition
of effects

3
Hypothesis Testing

• For example: hypothesis 1 : X is


statistically significantly related to
Y.
– The relationship is positive (as X
increases, Y increases) or negative
(as X decreases, Y increases).
– The magnitude of the relationship is
small, medium, or large.
If the magnitude is small, then a unit
change in x is associated with a
small change in Y.

4
Regression Analysis
Have a clear notion of what you can and
cannot do with regression analysis

• Conceptualization
– A Path Model of a Regression
Analysis

Path Diagram of A Linear Regression


Analysis

X1

error
YY
X2

x3
Yi  k  b1 x1  b2 x2  b3 x3  ei

5
A Path Analysis
Decomposition of Effects into Direct,
Indirect, Spurious, and Total Effects

Error

Error

X2
C

Y3
A

X1
E F

Y2
D

Error

Y1

Error

Direct Effects: Indirect Effects: Total Effects:


Paths C, E, F Paths Sum of Direct and Spurious effects are due to
AC, BE, DF Indirect Effects common (antecedent) causes

In a path analysis, Yi is endogenous. It is


the outcome of several paths.
Direct effects on Y3: C,E, F
Indirect effects on Y3: BF, BDF 6

Total Effects= Direct + Indirect effects


Interaction Analysis

X1
A

C Y

X2

Y= K + aX1 + BX2 + CX1*X2

Interaction coefficient: C
X1 and X2 must be in model for interaction
to be properly specified.
7
A Precursor to Modeling
with Regression
• Data Exploration: Run a
scatterplot matrix and search
for linear relationships with the
dependent variable.

8
Click on graphs and
then on scatter

9
When the scatterplot
dialog box appears, select
Matrix

10
A Matrix of Scatterplots
will appear

Search for distinct linear relationships

11
12
13
Decomposition of the
Sums of Squares

14
Graphical Decomposition
of Effects

D ec o m p o sitio n o f E ffec ts

 
Yi ŷ  a  b x

y i  yˆ i  error
yi  y  Total Effect

ŷ  y  r e g r es s io n e ffe c t
Y

X
X

15
Decomposition of the
sum of squares

Y  Y  Yˆ  Y  Yˆ  Y
total effect  error effects  regression (model ) effect
Yi  Y  Yˆi  Yi  Yˆi  Y per case i
(Y  Y ) 2  (Yˆ  Y ) 2  (Yˆ  Y ) 2 per case i
i i i i
n n n

 (Y  Y )
i 1
i
2
  (Yˆ  Y )
i 1
i i
2
  (Yˆ  Y )
i 1
i
2
for data set

16
Decomposition of the sum
of squares
• Total SS = model SS + error SS
and if we divide by df

n n n

 i
(Y  Y ) 2
 i i
(Yˆ  Y ) 2
 i
(Yˆ  Y ) 2

i 1
 i 1
 i 1

n1 nk 1 k

• This yields the Variance Decomposition:


We have the total variance= model
variance + error variance

17
F test for significance and
R2 for magnitude of effect
• R2 = Model var/total var
n

 i
(Yˆ
i 1
 Y ) 2

R2  k
n

 (Y
i 1
ˆ iY )i
2

nk 1

•F test for model significance


= Model Var/Error Var
2
R
F( k ,n  k 1)  k
1  R2
nk 1
18
ANOVA tests the
significance of the
Regression Model

19
The Multiple
Regression Equation
• We proceed to the derivation of its
components:
– The intercept: a
– The regression parameters, b1 and b2

Yi  a  b1 x1  b2 x2  ei

20
Derivation of the Intercept
y  a  bx  e
e  y  a  bx
n n n n

e
i 1
i   y a
i 1
i
i 1
i  b xi
i 1
n
Because by definition  ei  0
i 1
n n n
0  y a
i 1
i
i 1
i  b xi
i 1

n n n

 ai   yi  b  xi
i 1 i 1 i 1

n n
na   yi  b xi
i 1 i 1

a  y  bx
21
Derivation of the
Regression Coefficient
Given : yi  a  b xi  ei
ei  yi  a  b xi
n n

e
i 1
i  (y
i 1
i  a  b xi )
n n

i 
e 2

i 1
 i
( y
i 1
 a  b xi ) 2

n
  ei 2 n n
i 1
 2 xi  ( yi )  2b  xi xi
b i 1 i 1
n n
0  2 xi  ( yi )  2b  xi xi
i 1 i 1
n

x y i i
b  i 1
n

i
x 2

i 1 22
• If we recall that the formula for
the correlation coefficient can
be expressed as follows:

23
n

x i yi
r  i 1
n n

 x  
i 1
i
2

i 1
yi 2 

where
x  xi  x
y  yi  y

x i yi
bj  i 1
n


i 1
x2

from which it can be seen that the regression coefficient b,


is a function of r.

sd y
bj  r *
sd x 24
Extending the bivariate case
To the Multiple linear regression case

25
ryx1  ryx2 rx1x2 sd y
 yx1 . x2  * (6)
1 r 2
x1 x2 sd x

ryx2  ryx1 rx1x2 sd y


 yx2 . x1  * (7)
1 r 2
x1 x2 sd x

It is also easy to extend the bivariate intercept


to the multivariate case as follows.

a  Y  b1 x1  b2 x2 (8)

26
Significance Tests for the
Regression Coefficients
1. We find the significance of the
parameter estimates by using the
F or t test.

2. The R2 is the proportion of


variance explained.
2  (n-1) 
3.Adjusted R  = 1-(1-R ) 
2

 (n-p-1) 
where n  sample size
p  number of parameters in model

27
F and T tests for
significance for overall
model
Model variance
F
error variance
R2 / p

(1  R 2 ) /(n  p  1)
where
p  number of parameters
n  sample size

t F
( n  2) * r 2

1 r 2

28
Significance tests

• If we are using a type II sum of


squares, we are dealing with
the ballantine. DV Variance
explained = a + b

29
Significance tests

T tests for statistical significance

 0
t
sea
b0
t
seb

30
Significance tests

Standard Error of intercept

 
SEa

 
(Y  Y ) 2   1
* 
 xi 2 


 n2   n ( n  1) ( xi  x ) 2
 

 

Standard error of regression coefficient

ˆ
SEb 
x 2

where ˆ  std dev of residual


n

e 2

ˆ  2 i 1 31

n2
Programming Protocol
After invoking SPSS, procede to File, Open, Data

32
Select a Data Set (we
choose employee.sav)
and click on open

33
We open the data set

34
To inspect the variable
formats, click on variable
view on the lower left

35
Because gender is a
string variable, we need to
recode gender into a
numeric format

36
We autorecode gender by
clicking on transform and
then autorecode

37
We select gender and
move it into the variable
box on the right

38
Give the variable a new
name and click on add
new name

39
Click on ok and the
numeric variable sex is
created

It has values 1 for female and 2 for male and those values labels
are inserted.

40
To invoke Regression
analysis,
Click on Analyze

41
Click on Regression
and then linear

42
Select the dependent
variable: Current Salary

43
Enter it in the
dependent variable box

44
Entering independent
variables
• These variables are entered in
blocks. First the potentially
confounding covariates that
have to entered.
• We enter time on job,
beginning salary, and previous
experience.

45
After entering the
covariates, we click on
next

46
We now enter the
hypotheses we wish to
test
• We are testing for minority or
sex differences in salary after
controlling for the time on job,
previous experience, and
beginning salary.
• We enter minority and numeric
gender (sex)

47
After entering these
variables, click on
statistics

48
We select the following
statistics from the dialog
box and click on continue

49
Click on plots to obtain
the plots dialog box

50
We click on OK to run
the regression analysis

51
Navigation window (left)
and output window(right)
This shows that SPSS is reading the variables
correctly

52
Variables Entered and
Model Summary

53
Omnibus ANOVA
Significance Tests for the Model at each stage of the
analysis

54
Full Model
Coefficients

CurSal   12036.3  1.83BeginSal


 165.17Jobtime  23.64 Exper
 2882.84 gender  1419.7 Minority

55
We omit insignificant variables and
rerun the analysis to obtain trimmed
model coefficients

CurSal   12126.5  1.85BeginSal


 163.20Jobtime  24.36 Exper
 2694.30 gender
56
Beta weights

• These are standardized


regression coefficients used to
compare the contribution to the
explanation of the variance of
the dependent variable within
the model.

57
T tests and signif.

• These are the tests of


significance for each
parameter estimate.

• The significance levels have to


be less than .05 for the
parameter to be statistically
significant.

58
Assumptions of the Linear
Regression Model
1. Linear Functional form
2. Fixed independent variables
3. Independent observations
4. Representative sample and proper
specification of the model (no
omitted variables)
5. Normality of the residuals or errors
6. Equality of variance of the errors
(homogeneity of residual variance)
7. No multicollinearity
8. No autocorrelation of the errors
9. No outlier distortion
59
Explanation of the
Assumptions
1. 1. Linear Functional form
1. Does not detect curvilinear relationships
2. Independent observations
1. Representative samples
2. Autocorrelation inflates the t and r and f statistics and
warps the significance tests
3. Normality of the residuals
1. Permits proper significance testing
4. Equality of variance
1. Heteroskedasticity precludes generalization and
external validity
2. This also warps the significance tests
5. Multicollinearity prevents proper parameter
estimation. It may also preclude computation of the
parameter estimates completely if it is serious
enough.
6. Outlier distortion may bias the results: If outliers
have high influence and the sample is not large
enough, then they may serious bias the parameter
estimates

60
Diagnostic Tests for the
Regression Assumptions
1. Linearity tests: Regression curve fitting
1. No level shifts: One regime
2. Independence of observations: Runs test
3. Normality of the residuals: Shapiro-Wilks or
Kolmogorov-Smirnov Test
4. Homogeneity of variance if the residuals: White’s
General Specification test
5. No autocorrelation of residuals: Durbin Watson or
ACF or PACF of residuals
6. Multicollinearity: Correlation matrix of independent
variables.. Condition index or condition number
7. No serious outlier influence: tests of additive
outliers: Pulse dummies.
1. Plot residuals and look for high leverage of residuals
2. Lists of Standardized residuals
3. Lists of Studentized residuals
4. Cook’s distance or leverage statistics

61
Explanation of
Diagnostics
1. Plots show linearity or
nonlinearity of relationship
2. Correlation matrix shows
whether the independent
variables are collinear and
correlated.
3. Representative sample is done
with probability sampling

62
Explanation of
Diagnostics
Tests for Normality of the
residuals. The residuals are
saved and then subjected to
either of:
Kolmogorov-Smirnov Test: Tests
the limit of the theoretical
cumulative normal distribution
against your residual distribution.
Nonparametric Tests
1 sample K-S test

63
Collinearity Diagnostics

Tolerance  1R 2

small tolerances imply problems


Variance Inflation Factor (VIF)
1

Tolerance
Small intercorrelations among indep vars
means VIF  1
VIF  10 signifies problems

64
More Collinearity
Diagnostics
condition numbers
= maximum
eigenvalue/minimum
eigenvalue.
If condition numbers are between
100 and 1000, there is moderate
to strong collinearity

condition index  k
where k  condition number
If Condition index > 30 then there is strong collinearity

65
Outlier Diagnostics
1. Residuals.
1. The predicted value minus the actual
value. This is otherwise known as the
error.
2. Studentized Residuals
1. the residuals divided by their
standard errors without the ith
observation
3. Leverage, called the Hat diag
1. This is the measure of influence of
each observation
4. Cook’s Distance:
1. the change in the statistics that
results from deleting the observation.
Watch this if it is much greater than
1.0.

66
Outlier detection

• Outlier detection involves the


determination whether the
residual (error = predicted –
actual) is an extreme negative
or positive value.
• We may plot the residual
versus the fitted plot to
determine which errors are
large, after running the
regression.

67
Create Standardized
Residuals
• A standardized residual is one
divided by its standard deviation.

yˆi  yi
resid standardized 
s
where s  std dev of residuals

68
Limits of Standardized
Residuals
If the standardized residuals
have values in excess of 3.5
and -3.5, they are outliers.
If the absolute values are less
than 3.5, as these are, then
there are no outliers
While outliers by themselves
only distort mean prediction
when the sample size is small
enough, it is important to
gauge the influence of outliers.
69
Outlier Influence

• Suppose we had a different


data set with two outliers.
• We tabulate the standardized
residuals and obtain the
following output:

70
Outlier a does not distort
and outlier b does.

71
Studentized Residuals

• Alternatively, we could form


studentized residuals. These are
distributed as a t distribution with
df=n-p-1, though they are not
quite independent. Therefore, we
can approximately determine if
they are statistically significant or
not.
• Belsley et al. (1980)
recommended the use of
studentized residuals.

72
Studentized Residual

ei
ei 
s

s 2 (i ) (1  hi )
where
ei s  studentized residual
s( i )  standard deviation where ith obs is deleted
hi  leverage statistic

These are useful in estimating the statistical significance


of a particular observation, of which a dummy variable
indicator is formed. The t value of the studentized residual
will indicate whether or not that observation is a significant
outlier.
The command to generate studentized residuals, called rstudt is:
predict rstudt, rstudent
73
Influence of Outliers

1. Leverage is measured by the


diagonal components of the hat
matrix.
2. The hat matrix comes from the
formula for the regression of Y.

Yˆ  X   X '( X ' X ) 1 X 'Y


where X '( X ' X ) 1 X '  the hat matrix, H
Therefore,
Yˆ  HY

74
Leverage and the Hat
matrix
1. The hat matrix transforms Y into the
predicted scores.
2. The diagonals of the hat matrix indicate
which values will be outliers or not.
3. The diagonals are therefore measures of
leverage.
4. Leverage is bounded by two limits: 1/n and
1. The closer the leverage is to unity, the
more leverage the value has.
5. The trace of the hat matrix = the number of
variables in the model.
6. When the leverage > 2p/n then there is high
leverage according to Belsley et al. (1980)
cited in Long, J.F. Modern Methods of
Data Analysis (p.262). For smaller samples,
Vellman and Welsch (1981) suggested that
3p/n is the criterion.

75
Cook’s D

1. Another measure of influence.


2. This is a popular one. The
formula for it is:

 1   hi   ei 2 
Cook ' s Di      2 
 p   1  hi   s (1  hi ) 

Cook and Weisberg(1982) suggested that values of


D that exceeded 50% of the F distribution (df = p, n-p)
are large.

76
Using Cook’s D in
SPSS
• Cook is the option /R
• Finding the influential outliers
• List cook, if cook > 4/n
• Belsley suggests 4/(n-k-1) as a cutoff

77
DFbeta

• One can use the DFbetas to


ascertain the magnitude of
influence that an observation has
on a particular parameter estimate
if that observation is deleted.
b j  b(i ) j u j
DFbeta j 
u
2
j
(1  h j )
where u j  residuals of
regression of x on remaining xs.
78
Programming Diagnostic
Tests
Testing homoskedasiticity
Select histogram, normal probability plot,
and insert *zresid in Y
and *zpred in X

Then click on continue

79
Click on Save to obtain
the Save dialog box

80
We select the following

Then we click on continue, go back to the Main


Regression Menu and click on OK
81
Check for linear
Functional Form
• Run a matrix plot of the
dependent variable against
each independent variable to
be sure that the relationship is
linear.

82
Move the variables to be graphed
into the box on the upper right, and
click on OK

83
Residual
Autocorrelation check

Durbin  Watson d
tests first  order
autocorrelation of residuals

d 
n
 et  et 1  2
i 1 et

See significance tables for this


statistic

84
Run the autocorrelation function from
the Trends Module for a better analysis

85
Testing for Homogeneity of variance

86
Normality of residuals can be visually
inspected from the histogram with the
superimposed normal curve.
Here we check the skewness for
symmetry and the kurtosis for
peakedness

87
Kolmogorov Smirnov Test: An
objective test of normality

88
89
90
Multicollinearity test with the
correlation matrix

91
92
93
Alternatives to Violations
of Assumptions
• 1. Nonlinearity: Transform to linearity if there is
nonlinearity or run a nonlinear regression
• 2. Nonnormality: Run a least absolute deviations
regression or a median regression (available in
other packages or generalized linear models
[ SPLUS glm, STATA glm, or SAS Proc MODEL
or PROC GENMOD)].
• 3. Heteroskedasticity: weighted least squares
regression (SPSS) or white estimator (SAS,
Stata, SPLUS). One can use a robust regression
procedure (SAS, STATA, or SPLUS) to obtain
downweighted outlier effect in the estimation.
• 4. Autocorrelation: Run AREG in SPSS Trends
module or either Prais or Newey-West procedure
in STATA.
• 4. Multicollinearity: components regression or
ridge regression or proxy variables. 2sls in SPSS
or ivreg in stata or SAS proc model or proc syslin.

94
Model Building
Strategies
• Specific to General: Cohen
and Cohen
• General to Specific: Hendry
and Richard
• Extreme Bounds analysis: E.
Leamer.

95
Nonparametric
Alternatives
1. If there is nonlinearity, transform
to linearity first.
2. If there is heteroskedasticity, use
robust standard errors with
STATA or SAS or SPLUS.
3. If there is non-normality, use
quantile regression with
bootstrapped standard errors in
STATA or SPLUS.
4. If there is autocorrelation of
residuals, use Newey-West
autoregression or First order
autocorrelation correction with
Areg. If there is higher order
autocorrelation, use Box Jenkins
ARIMA modeling.

96

You might also like