Regression Analysis With SPSS: Yaffee@nyu - Edu

Regression Analysis
with SPSS
Robert A. Yaffee, Ph.D.
Statistics, Mapping and Social Science Group
Academic Computing Services
Information Technology Services
New York University
Office: 75 Third Ave Level C3
Tel: 212.998.3402
E-mail: yaffee@nyu.edu
February 04
1
Outline
1. Conceptualization
2. Schematic Diagrams of Linear Regression
processes
3. Using SPSS, we plot and test relationships for
linearity
4. Nonlinear relationships are transformed to
linear ones
5. General Linear Model
6. Derivation of Sums of Squares and ANOVA
Derivation of intercept and regression
coefficients
7. The Prediction Interval and its derivation
8. Model Assumptions
1. Explanation
2. Testing
3. Assessment
9. Alternatives when assumptions are unfulfilled
2
Conceptualization of
Regression Analysis
• Hypothesis testing
• Path Analytical Decomposition
of effects
3
Hypothesis Testing
• For example: hypothesis 1 : X is

statistically significantly related to
Y.
– The relationship is positive (as X
increases, Y increases) or negative
(as X decreases, Y increases).
– The magnitude of the relationship is
small, medium, or large.
If the magnitude is small, then a unit
change in x is associated with a
small change in Y.
4
Regression Analysis
Have a clear notion of what you can and
cannot do with regression analysis
• Conceptualization
– A Path Model of a Regression
Analysis
Path Diagram of A Linear Regression

Analysis
X1
error
YY
X2
x3
Yi  k  b1 x1  b2 x2  b3 x3  ei
5
A Path Analysis
Decomposition of Effects into Direct,
Indirect, Spurious, and Total Effects
Error
Error
X2
C
Y3
A
X1
E F
Y2
D
Error
Y1
Error
Direct Effects: Indirect Effects: Total Effects:

Paths C, E, F Paths Sum of Direct and Spurious effects are due to
AC, BE, DF Indirect Effects common (antecedent) causes
In a path analysis, Yi is endogenous. It is

the outcome of several paths.
Direct effects on Y3: C,E, F
Indirect effects on Y3: BF, BDF 6
Total Effects= Direct + Indirect effects

Interaction Analysis
X1
A
C Y
X2
Y= K + aX1 + BX2 + CX1*X2
Interaction coefficient: C
X1 and X2 must be in model for interaction
to be properly specified.
7
A Precursor to Modeling
with Regression
• Data Exploration: Run a
scatterplot matrix and search
for linear relationships with the
dependent variable.
8
Click on graphs and
then on scatter
9
When the scatterplot
dialog box appears, select
Matrix
10
A Matrix of Scatterplots
will appear
Search for distinct linear relationships
11
12
13
Decomposition of the
Sums of Squares
14
Graphical Decomposition
of Effects
D ec o m p o sitio n o f E ffec ts
 
Yi ŷ  a  b x
y i  yˆ i  error
yi  y  Total Effect
ŷ  y  r e g r es s io n e ffe c t
Y
X
X
15
Decomposition of the
sum of squares
Y  Y  Yˆ  Y  Yˆ  Y
total effect  error effects  regression (model ) effect
Yi  Y  Yî  Yi  Yî  Y per case i
(Y  Y ) 2  (Yˆ  Y ) 2  (Yˆ  Y ) 2 per case i
i i i i
n n n
 (Y  Y )
i 1
i
2
  (Yˆ  Y )
i 1
i i
2
  (Yˆ  Y )
i 1
i
2
for data set
16
Decomposition of the sum
of squares
• Total SS = model SS + error SS
and if we divide by df
n n n
 i
(Y  Y ) 2
 i i
(Yˆ  Y ) 2
 i
(Yˆ  Y ) 2
i 1
 i 1
 i 1
n1 nk 1 k
• This yields the Variance Decomposition:

We have the total variance= model
variance + error variance
17
F test for significance and
R2 for magnitude of effect
• R2 = Model var/total var
n
 i
(Yˆ
i 1
 Y ) 2
R2  k
n
 (Y
i 1
ˆ iY )i
2
nk 1
•F test for model significance

= Model Var/Error Var
2
R
F( k ,n  k 1)  k
1  R2
nk 1
18
ANOVA tests the
significance of the
Regression Model
19
The Multiple
Regression Equation
• We proceed to the derivation of its
components:
– The intercept: a
– The regression parameters, b1 and b2
Yi  a  b1 x1  b2 x2  ei
20
Derivation of the Intercept
y  a  bx  e
e  y  a  bx
n n n n
e
i 1
i   y a
i 1
i
i 1
i  b xi
i 1
n
Because by definition  ei  0
i 1
n n n
0  y a
i 1
i
i 1
i  b xi
i 1
n n n
 ai   yi  b  xi
i 1 i 1 i 1
n n
na   yi  b xi
i 1 i 1
a  y  bx
21
Derivation of the
Regression Coefficient
Given : yi  a  b xi  ei
ei  yi  a  b xi
n n
e
i 1
i  (y
i 1
i  a  b xi )
n n
i 
e 2
i 1
 i
( y
i 1
 a  b xi ) 2
n
  ei 2 n n
i 1
 2 xi  ( yi )  2b  xi xi
b i 1 i 1
n n
0  2 xi  ( yi )  2b  xi xi
i 1 i 1
n
x y i i
b  i 1
n
i
x 2
i 1 22
• If we recall that the formula for
the correlation coefficient can
be expressed as follows:
23
n
x i yi
r  i 1
n n
 x  
i 1
i
2
i 1
yi 2 
where
x  xi  x
y  yi  y
x i yi
bj  i 1
n

i 1
x2
from which it can be seen that the regression coefficient b,

is a function of r.
sd y
bj  r *
sd x 24
Extending the bivariate case
To the Multiple linear regression case
25
ryx1  ryx2 rx1x2 sd y
 yx1 . x2  * (6)
1 r 2
x1 x2 sd x
ryx2  ryx1 rx1x2 sd y

 yx2 . x1  * (7)
1 r 2
x1 x2 sd x
It is also easy to extend the bivariate intercept

to the multivariate case as follows.
a  Y  b1 x1  b2 x2 (8)
26
Significance Tests for the
Regression Coefficients
1. We find the significance of the
parameter estimates by using the
F or t test.
2. The R2 is the proportion of

variance explained.
2  (n-1) 
3.Adjusted R  = 1-(1-R ) 
2

 (n-p-1) 
where n  sample size
p  number of parameters in model
27
F and T tests for
significance for overall
model
Model variance
F
error variance
R2 / p

(1  R 2 ) /(n  p  1)
where
p  number of parameters
n  sample size
t F
( n  2) * r 2

1 r 2
28
Significance tests
• If we are using a type II sum of

squares, we are dealing with
the ballantine. DV Variance
explained = a + b
29
Significance tests
T tests for statistical significance
 0
t
sea
b0
t
seb
30
Significance tests
Standard Error of intercept
 
SEa

 
(Y  Y ) 2   1
* 
 xi 2 


 n2   n ( n  1) ( xi  x ) 2
 

 

Standard error of regression coefficient
ˆ
SEb 
x 2
where ˆ  std dev of residual

n
e 2
ˆ  2 i 1 31
n2
Programming Protocol
After invoking SPSS, procede to File, Open, Data
32
Select a Data Set (we
choose employee.sav)
and click on open
33
We open the data set
34
To inspect the variable
formats, click on variable
view on the lower left
35
Because gender is a
string variable, we need to
recode gender into a
numeric format
36
We autorecode gender by
clicking on transform and
then autorecode
37
We select gender and
move it into the variable
box on the right
38
Give the variable a new
name and click on add
new name
39
Click on ok and the
numeric variable sex is
created
It has values 1 for female and 2 for male and those values labels
are inserted.
40
To invoke Regression
analysis,
Click on Analyze
41
Click on Regression
and then linear
42
Select the dependent
variable: Current Salary
43
Enter it in the
dependent variable box
44
Entering independent
variables
• These variables are entered in
blocks. First the potentially
confounding covariates that
have to entered.
• We enter time on job,
beginning salary, and previous
experience.
45
After entering the
covariates, we click on
next
46
We now enter the
hypotheses we wish to
test
• We are testing for minority or
sex differences in salary after
controlling for the time on job,
previous experience, and
beginning salary.
• We enter minority and numeric
gender (sex)
47
After entering these
variables, click on
statistics
48
We select the following
statistics from the dialog
box and click on continue
49
Click on plots to obtain
the plots dialog box
50
We click on OK to run
the regression analysis
51
Navigation window (left)
and output window(right)
This shows that SPSS is reading the variables
correctly
52
Variables Entered and
Model Summary
53
Omnibus ANOVA
Significance Tests for the Model at each stage of the
analysis
54
Full Model
Coefficients
CurSal   12036.3  1.83BeginSal

 165.17Jobtime  23.64 Exper
 2882.84 gender  1419.7 Minority
55
We omit insignificant variables and
rerun the analysis to obtain trimmed
model coefficients
CurSal   12126.5  1.85BeginSal

 163.20Jobtime  24.36 Exper
 2694.30 gender
56
Beta weights
• These are standardized

regression coefficients used to
compare the contribution to the
explanation of the variance of
the dependent variable within
the model.
57
T tests and signif.
• These are the tests of

significance for each
parameter estimate.
• The significance levels have to

be less than .05 for the
parameter to be statistically
significant.
58
Assumptions of the Linear
Regression Model
1. Linear Functional form
2. Fixed independent variables
3. Independent observations
4. Representative sample and proper
specification of the model (no
omitted variables)
5. Normality of the residuals or errors
6. Equality of variance of the errors
(homogeneity of residual variance)
7. No multicollinearity
8. No autocorrelation of the errors
9. No outlier distortion
59
Explanation of the
Assumptions
1. 1. Linear Functional form
1. Does not detect curvilinear relationships
2. Independent observations
1. Representative samples
2. Autocorrelation inflates the t and r and f statistics and
warps the significance tests
3. Normality of the residuals
1. Permits proper significance testing
4. Equality of variance
1. Heteroskedasticity precludes generalization and
external validity
2. This also warps the significance tests
5. Multicollinearity prevents proper parameter
estimation. It may also preclude computation of the
parameter estimates completely if it is serious
enough.
6. Outlier distortion may bias the results: If outliers
have high influence and the sample is not large
enough, then they may serious bias the parameter
estimates
60
Diagnostic Tests for the
Regression Assumptions
1. Linearity tests: Regression curve fitting
1. No level shifts: One regime
2. Independence of observations: Runs test
3. Normality of the residuals: Shapiro-Wilks or
Kolmogorov-Smirnov Test
4. Homogeneity of variance if the residuals: White’s
General Specification test
5. No autocorrelation of residuals: Durbin Watson or
ACF or PACF of residuals
6. Multicollinearity: Correlation matrix of independent
variables.. Condition index or condition number
7. No serious outlier influence: tests of additive
outliers: Pulse dummies.
1. Plot residuals and look for high leverage of residuals
2. Lists of Standardized residuals
3. Lists of Studentized residuals
4. Cook’s distance or leverage statistics
61
Explanation of
Diagnostics
1. Plots show linearity or
nonlinearity of relationship
2. Correlation matrix shows
whether the independent
variables are collinear and
correlated.
3. Representative sample is done
with probability sampling
62
Explanation of
Diagnostics
Tests for Normality of the
residuals. The residuals are
saved and then subjected to
either of:
Kolmogorov-Smirnov Test: Tests
the limit of the theoretical
cumulative normal distribution
against your residual distribution.
Nonparametric Tests
1 sample K-S test
63
Collinearity Diagnostics
Tolerance  1R 2
small tolerances imply problems

Variance Inflation Factor (VIF)
1

Tolerance
Small intercorrelations among indep vars
means VIF  1
VIF  10 signifies problems
64
More Collinearity
Diagnostics
condition numbers
= maximum
eigenvalue/minimum
eigenvalue.
If condition numbers are between
100 and 1000, there is moderate
to strong collinearity
condition index  k
where k  condition number
If Condition index > 30 then there is strong collinearity
65
Outlier Diagnostics
1. Residuals.
1. The predicted value minus the actual
value. This is otherwise known as the
error.
2. Studentized Residuals
1. the residuals divided by their
standard errors without the ith
observation
3. Leverage, called the Hat diag
1. This is the measure of influence of
each observation
4. Cook’s Distance:
1. the change in the statistics that
results from deleting the observation.
Watch this if it is much greater than
1.0.
66
Outlier detection
• Outlier detection involves the

determination whether the
residual (error = predicted –
actual) is an extreme negative
or positive value.
• We may plot the residual
versus the fitted plot to
determine which errors are
large, after running the
regression.
67
Create Standardized
Residuals
• A standardized residual is one
divided by its standard deviation.
yî  yi
resid standardized 
s
where s  std dev of residuals
68
Limits of Standardized
Residuals
If the standardized residuals
have values in excess of 3.5
and -3.5, they are outliers.
If the absolute values are less
than 3.5, as these are, then
there are no outliers
While outliers by themselves
only distort mean prediction
when the sample size is small
enough, it is important to
gauge the influence of outliers.
69
Outlier Influence
• Suppose we had a different

data set with two outliers.
• We tabulate the standardized
residuals and obtain the
following output:
70
Outlier a does not distort
and outlier b does.
71
Studentized Residuals
• Alternatively, we could form

studentized residuals. These are
distributed as a t distribution with
df=n-p-1, though they are not
quite independent. Therefore, we
can approximately determine if
they are statistically significant or
not.
• Belsley et al. (1980)
recommended the use of
studentized residuals.
72
Studentized Residual
ei
ei 
s
s 2 (i ) (1  hi )
where
ei s  studentized residual
s( i )  standard deviation where ith obs is deleted
hi  leverage statistic
These are useful in estimating the statistical significance

of a particular observation, of which a dummy variable
indicator is formed. The t value of the studentized residual
will indicate whether or not that observation is a significant
outlier.
The command to generate studentized residuals, called rstudt is:
predict rstudt, rstudent
73
Influence of Outliers
1. Leverage is measured by the

diagonal components of the hat
matrix.
2. The hat matrix comes from the
formula for the regression of Y.
Yˆ  X   X '( X ' X ) 1 X 'Y

where X '( X ' X ) 1 X '  the hat matrix, H
Therefore,
Yˆ  HY
74
Leverage and the Hat
matrix
1. The hat matrix transforms Y into the
predicted scores.
2. The diagonals of the hat matrix indicate
which values will be outliers or not.
3. The diagonals are therefore measures of
leverage.
4. Leverage is bounded by two limits: 1/n and
1. The closer the leverage is to unity, the
more leverage the value has.
5. The trace of the hat matrix = the number of
variables in the model.
6. When the leverage > 2p/n then there is high
leverage according to Belsley et al. (1980)
cited in Long, J.F. Modern Methods of
Data Analysis (p.262). For smaller samples,
Vellman and Welsch (1981) suggested that
3p/n is the criterion.
75
Cook’s D
1. Another measure of influence.

2. This is a popular one. The
formula for it is:
 1   hi   ei 2 
Cook ' s Di      2 
 p   1  hi   s (1  hi ) 
Cook and Weisberg(1982) suggested that values of

D that exceeded 50% of the F distribution (df = p, n-p)
are large.
76
Using Cook’s D in
SPSS
• Cook is the option /R
• Finding the influential outliers
• List cook, if cook > 4/n
• Belsley suggests 4/(n-k-1) as a cutoff
77
DFbeta
• One can use the DFbetas to

ascertain the magnitude of
influence that an observation has
on a particular parameter estimate
if that observation is deleted.
b j  b(i ) j u j
DFbeta j 
u
2
j
(1  h j )
where u j  residuals of
regression of x on remaining xs.
78
Programming Diagnostic
Tests
Testing homoskedasiticity
Select histogram, normal probability plot,
and insert *zresid in Y
and *zpred in X
Then click on continue
79
Click on Save to obtain
the Save dialog box
80
We select the following
Then we click on continue, go back to the Main

Regression Menu and click on OK
81
Check for linear
Functional Form
• Run a matrix plot of the
dependent variable against
each independent variable to
be sure that the relationship is
linear.
82
Move the variables to be graphed
into the box on the upper right, and
click on OK
83
Residual
Autocorrelation check
Durbin  Watson d
tests first  order
autocorrelation of residuals
d 
n
 et  et 1  2
i 1 et
See significance tables for this

statistic
84
Run the autocorrelation function from
the Trends Module for a better analysis
85
Testing for Homogeneity of variance
86
Normality of residuals can be visually
inspected from the histogram with the
superimposed normal curve.
Here we check the skewness for
symmetry and the kurtosis for
peakedness
87
Kolmogorov Smirnov Test: An
objective test of normality
88
89
90
Multicollinearity test with the
correlation matrix
91
92
93
Alternatives to Violations
of Assumptions
• 1. Nonlinearity: Transform to linearity if there is
nonlinearity or run a nonlinear regression
• 2. Nonnormality: Run a least absolute deviations
regression or a median regression (available in
other packages or generalized linear models
[ SPLUS glm, STATA glm, or SAS Proc MODEL
or PROC GENMOD)].
• 3. Heteroskedasticity: weighted least squares
regression (SPSS) or white estimator (SAS,
Stata, SPLUS). One can use a robust regression
procedure (SAS, STATA, or SPLUS) to obtain
downweighted outlier effect in the estimation.
• 4. Autocorrelation: Run AREG in SPSS Trends
module or either Prais or Newey-West procedure
in STATA.
• 4. Multicollinearity: components regression or
ridge regression or proxy variables. 2sls in SPSS
or ivreg in stata or SAS proc model or proc syslin.
94
Model Building
Strategies
• Specific to General: Cohen
and Cohen
• General to Specific: Hendry
and Richard
• Extreme Bounds analysis: E.
Leamer.
95
Nonparametric
Alternatives
1. If there is nonlinearity, transform
to linearity first.
2. If there is heteroskedasticity, use
robust standard errors with
STATA or SAS or SPLUS.
3. If there is non-normality, use
quantile regression with
bootstrapped standard errors in
STATA or SPLUS.
4. If there is autocorrelation of
residuals, use Newey-West
autoregression or First order
autocorrelation correction with
Areg. If there is higher order
autocorrelation, use Box Jenkins
ARIMA modeling.
96

Regression Analysis With SPSS: Yaffee@nyu - Edu

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Regression Analysis With SPSS: Yaffee@nyu - Edu

Uploaded by

Copyright:

Available Formats

Regression Analysis

• For example: hypothesis 1 : X is

Path Diagram of A Linear Regression

Direct Effects: Indirect Effects: Total Effects:

In a path analysis, Yi is endogenous. It is

Total Effects= Direct + Indirect effects

Y= K + aX1 + BX2 + CX1*X2

Search for distinct linear relationships

• This yields the Variance Decomposition:

•F test for model significance

from which it can be seen that the regression coefficient b,

ryx2  ryx1 rx1x2 sd y

It is also easy to extend the bivariate intercept

2. The R2 is the proportion of

• If we are using a type II sum of

T tests for statistical significance

Standard Error of intercept

Standard error of regression coefficient

where ˆ  std dev of residual

CurSal   12036.3  1.83BeginSal

CurSal   12126.5  1.85BeginSal

• These are standardized

• These are the tests of

• The significance levels have to

small tolerances imply problems

• Outlier detection involves the

• Suppose we had a different

• Alternatively, we could form

These are useful in estimating the statistical significance

1. Leverage is measured by the

Yˆ  X   X '( X ' X ) 1 X 'Y

1. Another measure of influence.

Cook and Weisberg(1982) suggested that values of

• One can use the DFbetas to

Then click on continue

Then we click on continue, go back to the Main

See significance tables for this

You might also like