Professional Documents
Culture Documents
Regression Analysis With SPSS: Yaffee@nyu - Edu
Regression Analysis With SPSS: Yaffee@nyu - Edu
with SPSS
Robert A. Yaffee, Ph.D.
Statistics, Mapping and Social Science Group
Academic Computing Services
Information Technology Services
New York University
Office: 75 Third Ave Level C3
Tel: 212.998.3402
E-mail: yaffee@nyu.edu
February 04
1
Outline
1. Conceptualization
2. Schematic Diagrams of Linear Regression
processes
3. Using SPSS, we plot and test relationships for
linearity
4. Nonlinear relationships are transformed to
linear ones
5. General Linear Model
6. Derivation of Sums of Squares and ANOVA
Derivation of intercept and regression
coefficients
7. The Prediction Interval and its derivation
8. Model Assumptions
1. Explanation
2. Testing
3. Assessment
9. Alternatives when assumptions are unfulfilled
2
Conceptualization of
Regression Analysis
• Hypothesis testing
• Path Analytical Decomposition
of effects
3
Hypothesis Testing
4
Regression Analysis
Have a clear notion of what you can and
cannot do with regression analysis
• Conceptualization
– A Path Model of a Regression
Analysis
X1
error
YY
X2
x3
Yi k b1 x1 b2 x2 b3 x3 ei
5
A Path Analysis
Decomposition of Effects into Direct,
Indirect, Spurious, and Total Effects
Error
Error
X2
C
Y3
A
X1
E F
Y2
D
Error
Y1
Error
X1
A
C Y
X2
Interaction coefficient: C
X1 and X2 must be in model for interaction
to be properly specified.
7
A Precursor to Modeling
with Regression
• Data Exploration: Run a
scatterplot matrix and search
for linear relationships with the
dependent variable.
8
Click on graphs and
then on scatter
9
When the scatterplot
dialog box appears, select
Matrix
10
A Matrix of Scatterplots
will appear
11
12
13
Decomposition of the
Sums of Squares
14
Graphical Decomposition
of Effects
D ec o m p o sitio n o f E ffec ts
Yi ŷ a b x
y i yˆ i error
yi y Total Effect
ŷ y r e g r es s io n e ffe c t
Y
X
X
15
Decomposition of the
sum of squares
Y Y Yˆ Y Yˆ Y
total effect error effects regression (model ) effect
Yi Y Yˆi Yi Yˆi Y per case i
(Y Y ) 2 (Yˆ Y ) 2 (Yˆ Y ) 2 per case i
i i i i
n n n
(Y Y )
i 1
i
2
(Yˆ Y )
i 1
i i
2
(Yˆ Y )
i 1
i
2
for data set
16
Decomposition of the sum
of squares
• Total SS = model SS + error SS
and if we divide by df
n n n
i
(Y Y ) 2
i i
(Yˆ Y ) 2
i
(Yˆ Y ) 2
i 1
i 1
i 1
n1 nk 1 k
17
F test for significance and
R2 for magnitude of effect
• R2 = Model var/total var
n
i
(Yˆ
i 1
Y ) 2
R2 k
n
(Y
i 1
ˆ iY )i
2
nk 1
19
The Multiple
Regression Equation
• We proceed to the derivation of its
components:
– The intercept: a
– The regression parameters, b1 and b2
Yi a b1 x1 b2 x2 ei
20
Derivation of the Intercept
y a bx e
e y a bx
n n n n
e
i 1
i y a
i 1
i
i 1
i b xi
i 1
n
Because by definition ei 0
i 1
n n n
0 y a
i 1
i
i 1
i b xi
i 1
n n n
ai yi b xi
i 1 i 1 i 1
n n
na yi b xi
i 1 i 1
a y bx
21
Derivation of the
Regression Coefficient
Given : yi a b xi ei
ei yi a b xi
n n
e
i 1
i (y
i 1
i a b xi )
n n
i
e 2
i 1
i
( y
i 1
a b xi ) 2
n
ei 2 n n
i 1
2 xi ( yi ) 2b xi xi
b i 1 i 1
n n
0 2 xi ( yi ) 2b xi xi
i 1 i 1
n
x y i i
b i 1
n
i
x 2
i 1 22
• If we recall that the formula for
the correlation coefficient can
be expressed as follows:
23
n
x i yi
r i 1
n n
x
i 1
i
2
i 1
yi 2
where
x xi x
y yi y
x i yi
bj i 1
n
i 1
x2
sd y
bj r *
sd x 24
Extending the bivariate case
To the Multiple linear regression case
25
ryx1 ryx2 rx1x2 sd y
yx1 . x2 * (6)
1 r 2
x1 x2 sd x
a Y b1 x1 b2 x2 (8)
26
Significance Tests for the
Regression Coefficients
1. We find the significance of the
parameter estimates by using the
F or t test.
27
F and T tests for
significance for overall
model
Model variance
F
error variance
R2 / p
(1 R 2 ) /(n p 1)
where
p number of parameters
n sample size
t F
( n 2) * r 2
1 r 2
28
Significance tests
29
Significance tests
0
t
sea
b0
t
seb
30
Significance tests
SEa
(Y Y ) 2 1
*
xi 2
n2 n ( n 1) ( xi x ) 2
ˆ
SEb
x 2
e 2
ˆ 2 i 1 31
n2
Programming Protocol
After invoking SPSS, procede to File, Open, Data
32
Select a Data Set (we
choose employee.sav)
and click on open
33
We open the data set
34
To inspect the variable
formats, click on variable
view on the lower left
35
Because gender is a
string variable, we need to
recode gender into a
numeric format
36
We autorecode gender by
clicking on transform and
then autorecode
37
We select gender and
move it into the variable
box on the right
38
Give the variable a new
name and click on add
new name
39
Click on ok and the
numeric variable sex is
created
It has values 1 for female and 2 for male and those values labels
are inserted.
40
To invoke Regression
analysis,
Click on Analyze
41
Click on Regression
and then linear
42
Select the dependent
variable: Current Salary
43
Enter it in the
dependent variable box
44
Entering independent
variables
• These variables are entered in
blocks. First the potentially
confounding covariates that
have to entered.
• We enter time on job,
beginning salary, and previous
experience.
45
After entering the
covariates, we click on
next
46
We now enter the
hypotheses we wish to
test
• We are testing for minority or
sex differences in salary after
controlling for the time on job,
previous experience, and
beginning salary.
• We enter minority and numeric
gender (sex)
47
After entering these
variables, click on
statistics
48
We select the following
statistics from the dialog
box and click on continue
49
Click on plots to obtain
the plots dialog box
50
We click on OK to run
the regression analysis
51
Navigation window (left)
and output window(right)
This shows that SPSS is reading the variables
correctly
52
Variables Entered and
Model Summary
53
Omnibus ANOVA
Significance Tests for the Model at each stage of the
analysis
54
Full Model
Coefficients
55
We omit insignificant variables and
rerun the analysis to obtain trimmed
model coefficients
57
T tests and signif.
58
Assumptions of the Linear
Regression Model
1. Linear Functional form
2. Fixed independent variables
3. Independent observations
4. Representative sample and proper
specification of the model (no
omitted variables)
5. Normality of the residuals or errors
6. Equality of variance of the errors
(homogeneity of residual variance)
7. No multicollinearity
8. No autocorrelation of the errors
9. No outlier distortion
59
Explanation of the
Assumptions
1. 1. Linear Functional form
1. Does not detect curvilinear relationships
2. Independent observations
1. Representative samples
2. Autocorrelation inflates the t and r and f statistics and
warps the significance tests
3. Normality of the residuals
1. Permits proper significance testing
4. Equality of variance
1. Heteroskedasticity precludes generalization and
external validity
2. This also warps the significance tests
5. Multicollinearity prevents proper parameter
estimation. It may also preclude computation of the
parameter estimates completely if it is serious
enough.
6. Outlier distortion may bias the results: If outliers
have high influence and the sample is not large
enough, then they may serious bias the parameter
estimates
60
Diagnostic Tests for the
Regression Assumptions
1. Linearity tests: Regression curve fitting
1. No level shifts: One regime
2. Independence of observations: Runs test
3. Normality of the residuals: Shapiro-Wilks or
Kolmogorov-Smirnov Test
4. Homogeneity of variance if the residuals: White’s
General Specification test
5. No autocorrelation of residuals: Durbin Watson or
ACF or PACF of residuals
6. Multicollinearity: Correlation matrix of independent
variables.. Condition index or condition number
7. No serious outlier influence: tests of additive
outliers: Pulse dummies.
1. Plot residuals and look for high leverage of residuals
2. Lists of Standardized residuals
3. Lists of Studentized residuals
4. Cook’s distance or leverage statistics
61
Explanation of
Diagnostics
1. Plots show linearity or
nonlinearity of relationship
2. Correlation matrix shows
whether the independent
variables are collinear and
correlated.
3. Representative sample is done
with probability sampling
62
Explanation of
Diagnostics
Tests for Normality of the
residuals. The residuals are
saved and then subjected to
either of:
Kolmogorov-Smirnov Test: Tests
the limit of the theoretical
cumulative normal distribution
against your residual distribution.
Nonparametric Tests
1 sample K-S test
63
Collinearity Diagnostics
Tolerance 1R 2
64
More Collinearity
Diagnostics
condition numbers
= maximum
eigenvalue/minimum
eigenvalue.
If condition numbers are between
100 and 1000, there is moderate
to strong collinearity
condition index k
where k condition number
If Condition index > 30 then there is strong collinearity
65
Outlier Diagnostics
1. Residuals.
1. The predicted value minus the actual
value. This is otherwise known as the
error.
2. Studentized Residuals
1. the residuals divided by their
standard errors without the ith
observation
3. Leverage, called the Hat diag
1. This is the measure of influence of
each observation
4. Cook’s Distance:
1. the change in the statistics that
results from deleting the observation.
Watch this if it is much greater than
1.0.
66
Outlier detection
67
Create Standardized
Residuals
• A standardized residual is one
divided by its standard deviation.
yˆi yi
resid standardized
s
where s std dev of residuals
68
Limits of Standardized
Residuals
If the standardized residuals
have values in excess of 3.5
and -3.5, they are outliers.
If the absolute values are less
than 3.5, as these are, then
there are no outliers
While outliers by themselves
only distort mean prediction
when the sample size is small
enough, it is important to
gauge the influence of outliers.
69
Outlier Influence
70
Outlier a does not distort
and outlier b does.
71
Studentized Residuals
72
Studentized Residual
ei
ei
s
s 2 (i ) (1 hi )
where
ei s studentized residual
s( i ) standard deviation where ith obs is deleted
hi leverage statistic
74
Leverage and the Hat
matrix
1. The hat matrix transforms Y into the
predicted scores.
2. The diagonals of the hat matrix indicate
which values will be outliers or not.
3. The diagonals are therefore measures of
leverage.
4. Leverage is bounded by two limits: 1/n and
1. The closer the leverage is to unity, the
more leverage the value has.
5. The trace of the hat matrix = the number of
variables in the model.
6. When the leverage > 2p/n then there is high
leverage according to Belsley et al. (1980)
cited in Long, J.F. Modern Methods of
Data Analysis (p.262). For smaller samples,
Vellman and Welsch (1981) suggested that
3p/n is the criterion.
75
Cook’s D
1 hi ei 2
Cook ' s Di 2
p 1 hi s (1 hi )
76
Using Cook’s D in
SPSS
• Cook is the option /R
• Finding the influential outliers
• List cook, if cook > 4/n
• Belsley suggests 4/(n-k-1) as a cutoff
77
DFbeta
79
Click on Save to obtain
the Save dialog box
80
We select the following
82
Move the variables to be graphed
into the box on the upper right, and
click on OK
83
Residual
Autocorrelation check
Durbin Watson d
tests first order
autocorrelation of residuals
d
n
et et 1 2
i 1 et
84
Run the autocorrelation function from
the Trends Module for a better analysis
85
Testing for Homogeneity of variance
86
Normality of residuals can be visually
inspected from the histogram with the
superimposed normal curve.
Here we check the skewness for
symmetry and the kurtosis for
peakedness
87
Kolmogorov Smirnov Test: An
objective test of normality
88
89
90
Multicollinearity test with the
correlation matrix
91
92
93
Alternatives to Violations
of Assumptions
• 1. Nonlinearity: Transform to linearity if there is
nonlinearity or run a nonlinear regression
• 2. Nonnormality: Run a least absolute deviations
regression or a median regression (available in
other packages or generalized linear models
[ SPLUS glm, STATA glm, or SAS Proc MODEL
or PROC GENMOD)].
• 3. Heteroskedasticity: weighted least squares
regression (SPSS) or white estimator (SAS,
Stata, SPLUS). One can use a robust regression
procedure (SAS, STATA, or SPLUS) to obtain
downweighted outlier effect in the estimation.
• 4. Autocorrelation: Run AREG in SPSS Trends
module or either Prais or Newey-West procedure
in STATA.
• 4. Multicollinearity: components regression or
ridge regression or proxy variables. 2sls in SPSS
or ivreg in stata or SAS proc model or proc syslin.
94
Model Building
Strategies
• Specific to General: Cohen
and Cohen
• General to Specific: Hendry
and Richard
• Extreme Bounds analysis: E.
Leamer.
95
Nonparametric
Alternatives
1. If there is nonlinearity, transform
to linearity first.
2. If there is heteroskedasticity, use
robust standard errors with
STATA or SAS or SPLUS.
3. If there is non-normality, use
quantile regression with
bootstrapped standard errors in
STATA or SPLUS.
4. If there is autocorrelation of
residuals, use Newey-West
autoregression or First order
autocorrelation correction with
Areg. If there is higher order
autocorrelation, use Box Jenkins
ARIMA modeling.
96