Professional Documents
Culture Documents
with SPSS
1
Outline
1. Conceptualization
2. Schematic Diagrams of Linear Regression processes
3. Using SPSS, we plot and test relationships for linearity
4. Nonlinear relationships are transformed to linear ones
5. General Linear Model
6. Derivation of Sums of Squares and ANOVA
Derivation of intercept and regression coefficients
7. The Prediction Interval and its derivation
8. Model Assumptions
1. Explanation
2. Testing
3. Assessment
9. Alternatives when assumptions are unfulfilled
2
Conceptualization of
Regression Analysis
• Hypothesis testing
• Path Analytical Decomposition of effects
3
Hypothesis Testing
• For example: hypothesis 1 : X is statistically
significantly related to Y.
– The relationship is positive (as X increases, Y
increases) or negative (as X decreases, Y
increases).
– The magnitude of the relationship is small,
medium, or large.
If the magnitude is small, then a unit change in x is
associated with a small change in Y.
4
Regression Analysis
Have a clear notion of what you can and cannot do with
regression analysis
• Conceptualization
– A Path Model of a Regression Analysis
X1
error
YY
X2
x3
Yi k b1 x1 b2 x2 b3 x3 ei
5
A Path Analysis
Decomposition of Effects into Direct,
Indirect, Spurious, and Total Effects
Error
Error
X2
C
Y3
A
X1
E F
Y2
D
Error
Y1
Error
X1
A
C Y
X2
Interaction coefficient: C
X1 and X2 must be in model for interaction to be properly
specified. 7
A Precursor to Modeling with
Regression
• Data Exploration: Run a scatterplot matrix
and search for linear relationships with the
dependent variable.
8
Click on graphs and then on
scatter
9
When the scatterplot dialog box
appears, select Matrix
10
A Matrix of Scatterplots will
appear
11
12
13
Decomposition of the Sums of
Squares
14
Graphical Decomposition of Effects
D e c o m p o s it io n o f E ff e c ts
Yi ŷ a b x
y i yˆ i e r r o r
yi y Total Effect
ŷ y re g r es s io n e ffe c t
Y
X
X
15
Decomposition of the sum of
squares
Y Y Yˆ Y Yˆ Y
total effect error effects regression ( model ) effect
Yi Y Yˆi Yi Yˆi Y per case i
(Y Y ) 2 (Yˆ Y ) 2 (Yˆ Y ) 2 per case i
i i i i
n n n
(Y
i 1
i Y) 2
(Yˆi Yi ) 2
i1
(Yˆ
i1
i Y )2 for data set
16
Decomposition of the sum of
squares
• Total SS = model SS + error SS
and if we divide by df
n n n
(Y i Y) 2
(Yˆi Yi ) 2 (Yˆ i Y )2
i 1
i 1
i 1
n1 nk 1 k
17
F test for significance and R2 for
magnitude of effect
• R2 = Model var/total var n
(Yˆ
i 1
i Y )2
R2 k
n
(Yˆi Yi ) 2
i 1
nk 1
19
The Multiple Regression
Equation
• We proceed to the derivation of its components:
– The intercept: a
– The regression parameters, b1 and b2
Yi a b1 x1 b2 x2 ei
20
Derivation of the Intercept
y a bx e
e y a bx
n n n n
e
i 1
i y
i 1
i ai b xi
i 1 i 1
n
Because by definition ei 0
i 1
n n n
0 y
i 1
i ai b xi
i 1 i 1
n n n
ai yi b xi
i 1 i 1 i 1
n n
na yi b xi
i 1 i 1
a y bx
21
Derivation of the Regression
Coefficient
Given : y a b x e i i i
ei yi a b xi
n n
e
i 1
i (y
i 1
i a b xi )
n n
e
i 1
i
2
(y
i 1
i a b xi ) 2
n
ei 2 n n
i 1
2 xi ( yi ) 2b xi xi
b i 1 i1
n n
0 2 xi ( yi ) 2b xi xi
i 1 i 1
n
x i yi
b i 1
n
x i
2
i 1 22
• If we recall that the formula for the
correlation coefficient can be expressed as
follows:
23
n
xi yi
r i 1
n n
i 1
xi 2
i 1
yi 2
where
x xi x
y yi y
xi yi
bj i 1
n
i 1
x2
sd y
bj r *
sd x
24
Extending the bivariate case
To the Multiple linear regression case
25
ryx1 ryx2 rx1 x2 sd y
yx 1 . x2
* (6)
1 r 2
x1 x2 sd x
a Y b1 x1 b2 x2 (8)
26
Significance Tests for the
Regression Coefficients
1. We find the significance of the parameter
estimates by using the F or t test.
27
F and T tests for significance for
overall model
Model variance
F
error variance
R2 / p
(1 R 2 ) /( n p 1)
where
p number of parameters
n sample size
t F
( n 2) * r 2
1 r2
28
Significance tests
• If we are using a type II sum of squares,
we are dealing with the ballantine. DV
Variance explained = a + b
29
Significance tests
T tests for statistical significance
0
t
sea
b0
t
seb
30
Significance tests
Standard Error of intercept
SEa
(Y Y ) 2 1
*
xi 2
n 2 n
( n 1) ( xi x ) 2
Standard error of regression coefficient
ˆ
SEb
x 2
e 2
ˆ 2 i 1
n2 31
Programming Protocol
After invoking SPSS, procede to File, Open, Data
32
Select a Data Set (we choose
employee.sav) and click on open
33
We open the data set
34
To inspect the variable formats,
click on variable view on the lower
left
35
Because gender is a string
variable, we need to recode gender
into a numeric format
36
We autorecode gender by clicking
on transform and then autorecode
37
We select gender and move it into
the variable box on the right
38
Give the variable a new name and
click on add new name
39
Click on ok and the numeric
variable sex is created
It has values 1 for female and 2 for male and those values labels
are inserted.
40
To invoke Regression analysis,
Click on Analyze
41
Click on Regression and then
linear
42
Select the dependent variable:
Current Salary
43
Enter it in the dependent
variable box
44
Entering independent variables
• These variables are entered in blocks.
First the potentially confounding covariates
that have to entered.
• We enter time on job, beginning salary,
and previous experience.
45
After entering the covariates, we
click on next
46
We now enter the hypotheses we
wish to test
• We are testing for minority or sex
differences in salary after controlling for
the time on job, previous experience, and
beginning salary.
• We enter minority and numeric gender
(sex)
47
After entering these variables, click
on statistics
48
We select the following statistics
from the dialog box and click on
continue
49
Click on plots to obtain the plots
dialog box
50
We click on OK to run the
regression analysis
51
Navigation window (left) and output
window(right)
This shows that SPSS is reading the variables correctly
52
Variables Entered and Model
Summary
53
Omnibus ANOVA
Significance Tests for the Model at each stage of the
analysis
54
Full Model
Coefficients
55
We omit insignificant variables and rerun the
analysis to obtain trimmed model coefficients
57
T tests and signif.
• These are the tests of significance for
each parameter estimate.
58
Assumptions of the Linear
Regression Model
1. Linear Functional form
2. Fixed independent variables
3. Independent observations
4. Representative sample and proper specification of the model
(no omitted variables)
5. Normality of the residuals or errors
6. Equality of variance of the errors (homogeneity of residual
variance)
7. No multicollinearity
8. No autocorrelation of the errors
9. No outlier distortion
59
Explanation of the Assumptions
1. 1. Linear Functional form
1. Does not detect curvilinear relationships
2. Independent observations
1. Representative samples
2. Autocorrelation inflates the t and r and f statistics and warps the significance
tests
3. Normality of the residuals
1. Permits proper significance testing
4. Equality of variance
1. Heteroskedasticity precludes generalization and external validity
2. This also warps the significance tests
5. Multicollinearity prevents proper parameter estimation. It may also preclude
computation of the parameter estimates completely if it is serious enough.
6. Outlier distortion may bias the results: If outliers have high influence and
the sample is not large enough, then they may serious bias the parameter
estimates
60
Diagnostic Tests for the Regression
Assumptions
1. Linearity tests: Regression curve fitting
1. No level shifts: One regime
2. Independence of observations: Runs test
3. Normality of the residuals: Shapiro-Wilks or Kolmogorov-Smirnov Test
4. Homogeneity of variance if the residuals: White’s General Specification
test
5. No autocorrelation of residuals: Durbin Watson or ACF or PACF of
residuals
6. Multicollinearity: Correlation matrix of independent variables..
Condition index or condition number
7. No serious outlier influence: tests of additive outliers: Pulse dummies.
1. Plot residuals and look for high leverage of residuals
2. Lists of Standardized residuals
3. Lists of Studentized residuals
4. Cook’s distance or leverage statistics
61
Explanation of Diagnostics
1. Plots show linearity or nonlinearity of
relationship
2. Correlation matrix shows whether the
independent variables are collinear and
correlated.
3. Representative sample is done with
probability sampling
62
Explanation of Diagnostics
Tests for Normality of the residuals. The
residuals are saved and then subjected to
either of:
Kolmogorov-Smirnov Test: Tests the limit of
the theoretical cumulative normal distribution
against your residual distribution.
Nonparametric Tests
1 sample K-S test
63
Collinearity Diagnostics
Tolerance 1 R2
small tolerances imply problems
Variance Inflation Factor (VIF)
1
Tolerance
Small intercorrelations among indep vars
means VIF 1
VIF 10 signifies problems
64
More Collinearity Diagnostics
condition numbers
= maximum eigenvalue/minimum
eigenvalue.
If condition numbers are between 100 and
1000, there is moderate to strong collinearity
condition index k
where k condition number
65
Outlier Diagnostics
1. Residuals.
1. The predicted value minus the actual value. This is
otherwise known as the error.
2. Studentized Residuals
1. the residuals divided by their standard errors
without the ith observation
3. Leverage, called the Hat diag
1. This is the measure of influence of each observation
4. Cook’s Distance:
1. the change in the statistics that results from deleting
the observation. Watch this if it is much greater than
1.0.
66
Outlier detection
• Outlier detection involves the
determination whether the residual (error =
predicted – actual) is an extreme negative
or positive value.
• We may plot the residual versus the fitted
plot to determine which errors are large,
after running the regression.
67
Create Standardized Residuals
• A standardized residual is one divided by its
standard deviation.
ˆ i yi
y
resid standardized
s
where s std dev of residuals
68
Limits of Standardized
Residuals
If the standardized residuals have values in excess of
3.5 and -3.5, they are outliers.
69
Outlier Influence
• Suppose we had a different data set with
two outliers.
• We tabulate the standardized residuals
and obtain the following output:
70
Outlier a does not distort and outlier
b does.
71
Studentized Residuals
• Alternatively, we could form studentized
residuals. These are distributed as a t
distribution with df=n-p-1, though they are not
quite independent. Therefore, we can
approximately determine if they are
statistically significant or not.
• Belsley et al. (1980) recommended the use of
studentized residuals.
72
Studentized Residual
ei
ei s
s 2 (i ) (1 hi )
where
ei s studentized residual
s( i ) standard deviation where ith obs is deleted
hi leverage statistic
73
Influence of Outliers
1. Leverage is measured by the diagonal
components of the hat matrix.
2. The hat matrix comes from the formula for
the regression of Y.
Yˆ X X '( X ' X ) 1 X ' Y
where X '( X ' X ) 1 X ' the hat matrix, H
Therefore,
Yˆ HY
74
Leverage and the Hat matrix
1. The hat matrix transforms Y into the predicted scores.
2. The diagonals of the hat matrix indicate which values will
be outliers or not.
3. The diagonals are therefore measures of leverage.
4. Leverage is bounded by two limits: 1/n and 1. The closer
the leverage is to unity, the more leverage the value has.
5. The trace of the hat matrix = the number of variables in the
model.
6. When the leverage > 2p/n then there is high leverage
according to Belsley et al. (1980) cited in Long, J.F. Modern
Methods of Data Analysis (p.262). For smaller samples,
Vellman and Welsch (1981) suggested that 3p/n is the
criterion.
75
Cook’s D
1. Another measure of influence.
2. This is a popular one. The formula for it is:
1 hi ei 2
Cook ' s Di 2
p 1 hi s ( 1 hi )
76
Using Cook’s D in SPSS
• Cook is the option /R
• Finding the influential outliers
• List cook, if cook > 4/n
• Belsley suggests 4/(n-k-1) as a cutoff
77
DFbeta
• One can use the DFbetas to ascertain the
magnitude of influence that an observation has
on a particular parameter estimate if that
observation is deleted.
b j b(i ) j u j
DFbeta j
u
2
j
(1 h j )
where u j residuals of
regression of x on remaining xs.
78
Programming Diagnostic Tests
Testing homoskedasiticity
Select histogram, normal probability plot, and insert *zresid
in Y
and *zpred in X
80
We select the following
82
Move the variables to be graphed into the box on
the upper right, and click on OK
83
Residual Autocorrelation check
Durbin Watson d
tests first order
autocorrelation of residuals
d
n
et et 1 2
i 1 et
84
Run the autocorrelation function from the Trends
Module for a better analysis
85
Testing for Homogeneity of variance
86
Normality of residuals can be visually inspected
from the histogram with the superimposed normal
curve.
Here we check the skewness for symmetry and the
kurtosis for peakedness
87
Kolmogorov Smirnov Test: An objective test of
normality
88
89
90
Multicollinearity test with the correlation matrix
91
92
93
Alternatives to Violations of
Assumptions
• 1. Nonlinearity: Transform to linearity if there is nonlinearity or run
a nonlinear regression
• 2. Nonnormality: Run a least absolute deviations regression or a
median regression (available in other packages or generalized
linear models [ SPLUS glm, STATA glm, or SAS Proc MODEL or
PROC GENMOD)].
• 3. Heteroskedasticity: weighted least squares regression (SPSS)
or white estimator (SAS, Stata, SPLUS). One can use a robust
regression procedure (SAS, STATA, or SPLUS) to obtain
downweighted outlier effect in the estimation.
• 4. Autocorrelation: Run AREG in SPSS Trends module or either
Prais or Newey-West procedure in STATA.
• 4. Multicollinearity: components regression or ridge regression or
proxy variables. 2sls in SPSS or ivreg in stata or SAS proc model
or proc syslin.
94
Model Building Strategies
• Specific to General: Cohen and Cohen
• General to Specific: Hendry and Richard
• Extreme Bounds analysis: E. Leamer.
95
Nonparametric Alternatives
1. If there is nonlinearity, transform to linearity first.
2. If there is heteroskedasticity, use robust standard
errors with STATA or SAS or SPLUS.
3. If there is non-normality, use quantile regression
with bootstrapped standard errors in STATA or
SPLUS.
4. If there is autocorrelation of residuals, use Newey-
West autoregression or First order autocorrelation
correction with Areg. If there is higher order
autocorrelation, use Box Jenkins ARIMA modeling.
96