This action might not be possible to undo. Are you sure you want to continue?
1
Regression Analysis
with SAS
Robert A. Yaffee, Ph.D.
Statistics, Mapping, and Social Science
Group
Academic Computing Services
ITS
NYU
251 Mercer Street
New York, NY 10012
Office: 75 Third Avenue, SB
p. 212.998.3402
Email: yaffee@nyu.edu
2
Outline
1. Conceptualization of Regression Analysis
2. Plotting the data
3. Linear Regression Theory
1. Derivation of the intercept
2. Derivation of the slope
3. Multiple linear Regression
4. Significance tests
5. Assumptions
6. Diagnostic Tests of the assumptions
7. Assessment of robustness of model
4. Interaction model
1. Conceptualization
2. Path diagram
3. Program syntax
5. Note on polynomial regression
6. Model Building Strategies
7. Robust Alternatives
1. WLS
2. White Estimators
3. Median Regression Proc Robustreg
3
Regression Analysis
Have a clear notion of what you can and
cannot do with regression analysis
• Conceptualization
– A Path Model of a Regression
Analysis
Path Diagram of A Linear Regression
Analysis
Y
Y
X1
X2
x3
error
i i
Y k b x b x b x e = + + + +
1 1 2 2 3 3
4
Hypothesis Testing
• For example: hypothesis 1 : X is
statistically significantly related to
Y.
– The relationship is positive (as X
increases, Y increases) or negative
(as X decreases, Y increases).
– The magnitude of the relationship is
small, medium, or large.
If the magnitude is small, then a unit
change in x is associated with a
small change in Y.
5
Plotting the Data: Testing
for Functional form
1. The relationship between the Y
and each X needs to be
examined.
2. One needs to ascertain whether
it is linear or not
3. If it is not linear, can it be
transformed into linearity?
4. If so, then that may be the
easiest recourse.
5. If not, then perhaps linear
regression is not appropriate
6
I. Exploring the functional form of the relationship.
It is necessary to plot the dependent variable against each of the independent
variables.
1. The purpose is to determine the functional form of the relationship. If
the relationship is linear, then the distribution of data points on the graph will
have the approximate appearance of a straight line. There may be some
dispersion about the line. For the most part, the distribution can be
approximated by a straight line.
2 If the distribution has a curved form, the question arises as to whether
actual functional form approximating the distribution is one of a
curve or not.
3. If it is some sort of curve, it may be transformed into a straight line. If it can,
then this transformation of the original variable may be necessary.
4. To ascertain what the best functional form of these individual relationships
is, the analyst may wish to run the segmented regression over portions
of each independent variable.
5. The analystmay a test a number of functional forms to see which produces the
best R square. This functional form estimation procedure on the relationship to
see which functional form provides the best fit. He may refer to the R
2
statistic for
each functional form and select that with the maximum.
6. To demonstrate how this procedure works with data provided, we may show
example of employing the curve estimation with a polynomial of order three.
The real data are distributed as a cubic.
a. We may plot the data with a plot procedure, called
proc plot; The command syntax for PROC PLOT follows:
PROC PLOT;
PLOT Y*X;
RUN:
b. If we are using high resolution graphics, called
proc gplot, we may have much more control over the appearance
of the output. The command syntax for PROC GPLOT looks
like the following:
PROC GPLOT;
PLOT Y*X;
RUN:
Explore the
relationship
7
Graphical Exploration
The generation of the data is done with a do loop:
data plot1;
do x = 1 to 30;
y = x**3;
output;
end;
proc plot;
plot y*x; run;
We examine the plot to look for functional form,
outlier patterns, and other peculiarities.
8
Graphical Plot
• The output from this plot
appears in figure 1:
Figure 1 A plot of a nonlinear relationship between the dependent
variable Y, and an independent variable, X.
Figure 1 A plot of a nonlinear relationship between the dependent
variable Y, and an independent variable, X.
9
Testing Functional
Form with Curve Fitting
1. 1b. Curve Fitting
2. What is done here is that the data are subjected
to a number of tests of functional form.
Application of the data to such a process
presumes either a linear relationship or a
relationship that can be transformed into a linear
one.
1. From a regression analysis of the
relationship, an r square is generated. This is
the square of the multiple correlation
coefficient between the dependent variable
and the independent variables in the model.
This r square is the proportion of variance of
the dependent variable explained by the
functional form. The higher the r square, the
closer the functional form is to the actual
relationship inherent in the data.
3. The program to set up the curve fitting may be
found in figure 3 below.
10
We may fit functional
forms
Do x = 1 to 100;
liny = a + x;
quady = a + x**2;
cubicy = a + x**3;
lnx = log(x);
invx = 1/x;
expx = exp(x);
compound = a*b**x;
power = a*x**b;
sshapex = exp( a + x**(1)) ;
growth = exp(a + b*x);
output;
End;
Proc print;
run;
Then run different models. For example,
Proc Reg;
model y = quadx;
run;
Use the transformation that yields the highest R
2
11
2. Command syntax for Curve Fitting
The SAS programming syntax to set up the test of the data against
the functional forms may be found in figure 4.
12
Curve Fitting Output &
Interpretation
Output and interpretation
The proc print label data =bbb;
Var _model_ p rsq;
Run;
Produces the output in figure 5 on the next page.
While all three forms have significant components, it can be seen that the
R
2
approaches 1.00 only with the cubic and power functional forms. The
regression coefficient for this curve equals 1.00 and we have identified the
functional form of the regression analysis. The cubic functional form is
the third power of the function, so it is not surprising that the coefficient
of determination (R
2
) for that will be 1.00 also.
The nice aspect of curve fitting is that it provides a clear objective
criterion against which to assess functional form.
13
14
With the aid of this tool, one may obtain a better idea as to which
transformation to apply in order to render a nonlinear relationship
amenable to linear regression, in the event of apparent nonlinearity.
Linear Regression analysis
A. The General Linear Model
B.Derviation of the SS and Anova
C. Derivation of the coefficients: a and b
1. Bivariate Case
2. Multivariate Case: a, b1 and b2
D.The significance tests
E.The Prediction interval and its derivation
F. The Assumptions of Regression Analysis
G. Testing those assumptions
H. Robustness of the Model in face of their violation
I. Fixes:
1. Weighted Least Squares estimation for heteroskedasticity
2. Autoregression for autocorrelation
3. Nonlinear regression for nonlinearity
15
Fixescontinued
4. Nonparametric regression for bivariate cases where assumptions don’t
Hold
5. Logistic or Probit regression for dichotomous dependent variables
6. Poisson Regression for Count or Rare event data
16
The General Linear
Model
A. The General Linear Model
Includes regression, anova, and ancova. The main difference is the
Levels of measurement of the independent variables.
1. Regression: IVs are continuous
2. Anova: IVs are discrete
3. Ancova: IVs are both discrete and continuous
B. Decomposition of the Sums of squares
If we look at the graph of the regression line, we may geometrically
represent the error, the regression effect, and the total effect.
The graphical representation below depicts the sums of squares decomposition of
the equation :
SS total = SS regression + SS error
If we divide both sides of the equation by the
Respective df, we obtain the MS
SS total/n 1 = SS regression/k + SS error/n k  1
Where n = sample size and k = # independent vars in model
MS total = MS regression + MS error
Since MS = variance
Variance total = Regression Variance + error variance
We divide both sides by the total variance to obtain
1 = R square regression + R square error
F = Regression Variance

Error Variance
17
X
Y
ˆ y a bx = +
X
Y
i
Y
}
ˆ
i i
y y error ÷ =
ˆ y y regression effect ÷ =
}
{
i
y y Total Effect ÷ =
Decomposition of Effects
18
Decomposition of the
sum of squares
ˆ ˆ
( )
ˆ ˆ
ˆ ˆ
( ) ( ) ( )
ˆ ˆ
( ) ( ) ( )
i i i i
i i i i
n n n
i i i i
i i i
Y Y Y Y Y Y
total effect error effects regression model effect
Y Y Y Y Y Y per case i
Y Y Y Y Y Y per case i
Y Y Y Y Y Y for data set
= = =
÷ = ÷ + ÷
= +
÷ = ÷ + ÷
÷ = ÷ + ÷
÷ = ÷ + ÷
¿ ¿ ¿
2 2 2
2 2 2
1 1 1
19
Decomposition of the sum
of squares
• Total SS = model SS + error SS
and if we divide by df
• This yields the Variance Decomposition:
We have the total variance= model
variance + error variance
ˆ ˆ
( ) ( ) ( )
n n n
i i i i
i i i
Y Y Y Y Y Y
n n k k
= = =
÷ ÷ ÷
= +
÷ ÷ ÷
¿ ¿ ¿
2 2 2
1 1 1
1 1
20
F test for significance and
R
2
for magnitude of effect
• R
2
= Model var/total var
( , ) k n k
R
k
F
R
n k
÷ ÷
=
÷
÷ ÷
2
1
2
1
1
ˆ
( )
ˆ
( )
n
i
i
n
i i
i
Y Y
k
R
Y Y
n k
=
=
÷
=
÷
÷ ÷
¿
¿
2
1
2
2
1
1
•F test for model significance
= Model Var/Error Var
21
ANOVA tests the significance of the
Regression Model
22
Derivation of the Intercept
n n n
i i i
i i i
n n n n
i i i i
i i i i
n
i
i
n n n
i i i
i i i
a y b x
n n
i i
i i
y a bx e
e y a bx
e y a b x
Because by definition e
y a b x
na y b x
a y bx
= = =
= = = =
=
= = =
= ÷
= =
= + +
= ÷ ÷
= ÷ ÷
=
= ÷ ÷
¿ ¿ ¿
= ÷
= ÷
¿ ¿ ¿ ¿
¿
¿ ¿ ¿
¿ ¿
1 1 1
1 1 1 1
1
1 1 1
1 1
0
0
23
Derivation of the
Regression Coefficient
:
( )
( )
( )
( )
i i i
i i i
n n
i i i
i i
n n
i i i
i i
n
i
n n
i
i i i i
i i
n n
i i i i
i i
n
i i
i
n
i
i
Given y a b x e
e y a b x
e y a b x
e y a b x
e
x y b x x
b
x y b x x
x y
b
x
= =
= =
=
= =
= =
=
=
= + +
= ÷ ÷
= ÷ ÷
= ÷ ÷
c
= ÷
c
= ÷
=
¿ ¿
¿ ¿
¿
¿ ¿
¿ ¿
¿
¿
1 1
2 2
1 1
2
1
1 1
1 1
1
2
1
2 2
0 2 2
24
The Multiple
Regression Equation
• We proceed to the derivation of its
components:
– The intercept: a
– The regression parameters, b1 and b2
i i
Y a b x b x e = + + +
1 1 2 2
25
The Multiple
Regression Formula
• If we recall that the formula for
the correlation coefficient can
be expressed as follows:
26
from which it can be seen that the regression coefficient b,
is a function of r.
( ) ( )
n
i i
i 1
n n
2 2
i i
i 1 i 1
i
i
x y
r
x y
where
x x x
y y y
=
= =
=
= ÷
= ÷
¿
¿ ¿
n
i i
i 1
j
n
2
i 1
x y
b
x
=
=
=
¿
¿
*
y
j
x
sd
b r
sd
=
27
1 2 1 2
1 2
1 2
.
2
* (6)
1
yx yx x x
y
yx x
x x x
r r r
sd
r sd

÷
=
÷
2 1 1 2
2 1
1 2
.
2
* (7)
1
yx yx x x
y
yx x
x x x
r r r
sd
r sd

÷
=
÷
1 1 2 2
(8) a Y b x b x = ÷ ÷
It is also easy to extend the bivariate intercept
to the multivariate case as follows.
Substitute the partial r for r
28
Significance Tests for the
Regression Coefficients
1. We find the significance of the
parameter estimates by using the
F or t test.
2. The R
2
is the proportion of
variance explained.
3. Adjusted R
2
= 1(1R
2
) (n1)/(np1)
29
F and T tests for
significance for overall
model
/
( ) /( )
2
2
Model variance
F
error variance
R p
1 R n p 1
where
p number of parameters
n sample size
=
=
÷ ÷ ÷
=
=
( ) *
2
2
t F
n 2 r
1 r
=
÷
=
÷
30
Significance tests
• If we are using a type II sum of
squares, we are dealing with
the ballantine. DV Variance
explained = a + b
31
Significance tests
T tests for statistical significance
a
b
t
se
b
t
se
o ÷
=
÷
=
0
0
32
Significance tests
Standard Error of intercept
( )
*
( ) ( )
i
a
i
x
Y Y
SE
n n
n x x
(
( (
÷
( = + (
÷
( (
¸ ¸
÷ ÷
(
¸ ¸
¿
¿
2
2
2
1
2
1
ˆ
ˆ
ˆ
b
n
i
SE
x
where std devof residual
e
n
o
o
o
=
=
=
=
÷
¿
¿
2
2
2
1
2
Standard error of regression coefficient
33
SAS Regression
Command Syntax
34
SAS Regression syntax
Proc reg simple data=regdat;
model y = x1 x2 x3 /spec dw corrb collin r dffits
influence ;
output out=resdat p=pred
r=resid rstudent=rstud;
Data check;
set resdat;
Proc univariate normal plot;
var resid;
Title ‘Check of Normality of Residuals’;
Run;
Proc Freq data=resdat; tables rstud;
Title ‘Check for Outliers’;
run;
Proc ARIMA data=resdat;
identify var=resid;
Title ‘Check of Autocorrelation of the errors’;
Run;
35
SAS Regression Output
& Interpretation
36
More Simple Statistics
37
Omnibus ANOVA
Statistics
38
Parameter Estimates
39
Assumptions of the Linear
Regression Model
1. Linear Functional form
2. Fixed independent variables
3. Independent observations
4. Representative sample and proper
specification of the model (no
omitted variables)
5. Normality of the residuals or errors
6. Equality of variance of the errors
(homogeneity of residual variance)
7. No multicollinearity
8. No autocorrelation of the errors
9. No outlier distortion
40
Explanation of the
Assumptions
1. 1. Linear Functional form
1. Does not detect curvilinear relationships
2. Independent observations
1. Representative samples
2. Autocorrelation inflates the t and r and f statistics and
warps the significance tests
3. Normality of the residuals
1. Permits proper significance testing
4. Equality of variance
1. Heteroskedasticity precludes generalization and
external validity
2. This also warps the significance tests
5. Multicollinearity prevents proper parameter
estimation. It may also preclude computation of the
parameter estimates completely if it is serious enough.
6. Outlier distortion may bias the results: If outliers
have high influence and the sample is not large
enough, then they may serious bias the parameter
estimates
41
Diagnostic Tests for the
Regression Assumptions
1. Linearity tests: Regression curve fitting
1. No level shifts: One regime
2. Independence of observations: Runs test
3. Normality of the residuals: ShapiroWilks or
KolmogorovSmirnov Test
4. Homogeneity of variance if the residuals: White’s
General Specification test
5. No autocorrelation of residuals: Durbin Watson or
ACF or PACF of residuals
6. Multicollinearity: Correlation matrix of residuals.
Condition index or condition number
7. No serious outlier influence: tests of additive outliers:
Pulse dummies.
1. Plot residuals and look for high leverage of residuals
2. Lists of Standardized residuals
3. Lists of Studentized residuals
4. Cook’s distance or leverage statistics
42
Explanation of
Diagnostics
1. Plots show linearity or
nonlinearity of relationship
2. Correlation matrix shows
whether the independent
variables are collinear and
correlated.
3. Representative sample is done
with probability sampling
43
Explanation of
Diagnostics
Tests for Normality of the residuals.
The residuals are saved and then
subjected to either of:
KolmogorovSmirnov Test: Tests the
limit of the theoretical cumulative
normal distribution against your
residual distribution.
ShapiroWilks Test
Proc reg;
model y = x1 x2;
Output out=resdat r=resid p=pred;
Data check;
set resdat;
Proc univariate normal plot; var resid;
Title ‘Test of Normality of Residuals’;
Run;
44
Test for Homogeneity of
variance
White’s General
Test
Proc reg;
model y = x1
x2/spec;
Run;
Chow test
i
e b b x b x b x b x b x x = + + + + +
2 2 2
0 1 1 2 2 3 1 4 2 5 1 2
( ) /( )
( ) /( )
ur r ur r
chow
ur ur
R R p p
F
R n p
÷ ÷
=
÷ ÷
2 2
2
1
45
Weighted Least
squares
A fix for heteroskedasticity is
Weighted least squares or WLS
There are two ways to do this.
If x is proportional to the
variance in y, then form the
weight by wt = 1/y**2;
Proc reg;
weight wt;
Model y = x;
Run;
46
Collinearity Diagnostics
R
small tolerances imply problems
Tolerance
Small intercorrelations among indep vars
means VIF
VIF signifies problems
= ÷
=
~
>
2
1
Variance InflationFactor (VIF)
1
1
10
Tolerance
47
More Collinearity
Diagnostics
Watch for eigenvalues much
greater than 1
condition numbers = maximum
eigenvalue/minimum
eigenvalue.
If condition numbers are between
100 and 1000, there is moderate
to strong collinearity
condition index k
where k condition number
=
=
If Condition index > 30 then there is strong collinearity
48
Collinearity Diagnostics
49
Outlier Diagnostics
1. Residuals.
1. The predicted value minus the actual
value. This is otherwise known as the
error.
2. Studentized Residuals
1. the residuals divided by their
standard errors without the ith
observation
3. Leverage, called the Hat diag
1. This is the measure of influence of
each observation
4. Cook’s Distance:
1. the change in the statistics that
results from deleting the observation.
Watch this if it is much greater than
1.0.
50
Outlier detection
• Outlier detection involves the
determination whether the residual
(error = predicted – actual) is an
extreme negative or positive value.
• We may plot the residual versus
the fitted plot to determine which
errors are large, after running the
regression.
• The command syntax was already
demonstrated with the graph on
page 16: rvfplot, border yline(0)
51
Create Standardized
Residuals
• A standardized residual is one
divided by its standard deviation.
ˆ
i i
standardized
y y
resid
s
where s std devof residuals
÷
=
=
52
Limits of Standardized
Residuals
If the standardized residuals
have values in excess of 3.5
and 3.5, they are outliers.
If the absolute values are less
than 3.5, as these are, then
there are no outliers
While outliers by themselves
only distort mean prediction
when the sample size is small
enough, it is important to
gauge the influence of outliers.
53
Outlier Influence
• Suppose we had a different
data set with two outliers.
• We tabulate the standardized
residuals and obtain the
following output:
54
Outlier a does not distort
and outlier b does.
55
Studentized Residuals
• Alternatively, we could form
studentized residuals. These are
distributed as a t distribution with
df=np1, though they are not
quite independent. Therefore, we
can approximately determine if
they are statistically significant or
not.
• Belsley et al. (1980)
recommended the use of
studentized residuals.
56
Studentized Residual
( )
( )
( )
i
s
i
i
i
s
i
i
i
e
e
s h
where
e studentized residual
s standard deviation whereithobs is deleted
h leverage statistic
=
÷
=
=
=
2
1
These are useful in estimating the statistical significance
of a particular observation, of which a dummy variable
indicator is formed. The t value of the studentized residual
will indicate whether or not that observation is a significant
outlier.
The command to generate studentized residuals, called rstudt is:
predict rstudt, rstudent
57
Influence of Outliers
1. Leverage is measured by the
diagonal components of the hat
matrix.
2. The hat matrix comes from the
formula for the regression of Y.
ˆ
'( ' ) '
'( ' ) ' ,
,
ˆ
Y X X X X X Y
where X X X X the hat matrix H
Therefore
Y HY

÷
÷
= =
=
=
1
1
58
Leverage and the Hat
matrix
1. The hat matrix transforms Y into the
predicted scores.
2. The diagonals of the hat matrix indicate
which values will be outliers or not.
3. The diagonals are therefore measures of
leverage.
4. Leverage is bounded by two limits: 1/n and
1. The closer the leverage is to unity, the
more leverage the value has.
5. The trace of the hat matrix = the number of
variables in the model.
6. When the leverage > 2p/n then there is high
leverage according to Belsley et al. (1980)
cited in Long, J.F. Modern Methods of
Data Analysis (p.262). For smaller samples,
Vellman and Welsch (1981) suggested that
3p/n is the criterion.
59
Cook’s D
1. Another measure of influence.
2. This is a popular one. The
formula for it is:
'
( )
i i
i
i i
h e
Cook s D
p h s h
  
 
=
 

÷ ÷
\ .
\ .\ .
2
2
1
1 1
Cook and Weisberg(1982) suggested that values of
D that exceeded 50% of the F distribution (df = p, np)
are large.
60
Using Cook’s D in SAS
• Cook is the option /R
• Finding the influential outliers
• List cook, if cook > 4/n
• Belsley suggests 4/(nk1) as a cutoff
61
Graphical Exploration of
Outlier Influence
The two influential outliers can be found easily here
in the upper right.
One can plot the leverage against the standardized
residual to see if the outlier is problematic
62
DFbeta
• One can use the DFbetas to
ascertain the magnitude of
influence that an observation has
on a particular parameter estimate
if that observation is deleted.
( )
( )
.
i
j j j
j
j
j
j
b b u
DFbeta
u h
where u residuals of
regressionof x on remaining xs
÷
=
÷
=
¿
2
1
63
Alternatives to Violations
of Assumptions
• 1. Nonlinearity: Transform to
linearity if there is nonlinearity or
run a nonlinear regression
• 2. Nonnormality: Run a least
absolute deviations regression or a
median regression
• 3. Heteroskedasticity: weighted
least squares regression or white
estimator. One can use Proc
Robustreg to obtain downweighted
outlier effect in the estimation.
• 4. Autocorrelation: NeweyWest
estimators or autoregression model
• 4. Multicollinearity: components
regression or ridge regression or
proxy variables
64
The Interaction model
• The product vector x1*x2 is an
interaction term.
• This is the joint effect over and above
the main effects (x1 and x2).
• The main effects must be in the model
for the interaction to be properly
specified, regardless of whether the
main effects remain statistically
significant.
*
i i
Y a b x b x b x x e = + + + +
1 1 2 2 3 1 2
65
Path Diagram of an
Interaction model
Interaction Analysis
X1
X2
Y
A
B
C
Y= K + aX1 + BX2 + CX1*X2
Error
66
SAS Regression
Interaction Model Syntax
data one;
input y x1 x2 x3 x4;
lincrossprodx1x2=x1*x2;
x1sq=x1*x1;
time+1;
datalines;
112 113 114 39 10
322 230 310 43 23
323 340 250 33 33
112 122 125 144 45
99 100 89 55 34
14 13 10 249 40
40 34 98 39 30
30 32 34 40 40
90 80 93 50 50
89 90 91 60 44
120 130 43 100 34
444 432 430 20 44
proc print;
run;
proc reg;
model y= x1 x2 lincrossprodx1x2/r collin stb spec influence;
output out=resdat r=resid p=pred stdr=rstd student=rstud cookd=cooksd;
run;
67
SAS Regression
Diagnostic syntax
data outck;
set resdat;
degfree=7;
if cooksd > 4/7; /* if cd > p/(np1)
where p = # parms, n=sample size */
proc freq; tables rstud;
title 'Outlier Indicator';
run;
axis1 label=(a=90 'Cooks D Influence Stat');
proc gplot data=resdat;
plot cooksd * rstud/vaxis=axis1;
title 'Leverage and the Outlier';
run;
68
Regression Model test of
linear interaction
69
SAS Polynomial
Regression Syntax
proc reg data=one;
model y= x1 x1sq;
title 'The Polynomial Regression';
run;
70
Model Building Strategies
• Specific to General: Cohen
and Cohen
• General to Specific: Hendry
and Richard
• Extreme Bounds analysis: E.
Leamer.
• F. Harrell, Jr. approach:
Regression Modeling
Strategies (Springer, 2001).
Loess, Splines for mode fitting, data
reduction, missing data imputation,
validation & simplification.
71
Goodness of Fit indices
• R
2
= 1 – SSE/SST
– Explained Variance divided by
total variance
• AIC = 2LL + 2p
SC = 2LL + nlog(p)
• Nested models can be
compared on the basis of
these criteria to determine
which model is best
72
Robust Regression
1. This procedure permits 4 types of
robust regression: M, least
trimmed squares, S, and MM
estimation
2. These means downweight the
outliers.
3. M (median absolute deviation)
estimation used by Huber is
performed with
4. proc robustreg data=stack;
model y = x1 x2 x3 / diagnostics
leverage; id x1; test x3; run;
5. Estimation is done by IRLS.
Outline
1. 2. 3. Conceptualization of Regression Analysis Plotting the data Linear Regression Theory
1. 2. 3. 4. 5. 6. 7. Derivation of the intercept Derivation of the slope Multiple linear Regression Significance tests Assumptions Diagnostic Tests of the assumptions Assessment of robustness of model Conceptualization Path diagram Program syntax
4.
Interaction model
1. 2. 3.
5. 6. 7.
Note on polynomial regression Model Building Strategies Robust Alternatives
1. 2. 3. WLS White Estimators Median Regression Proc Robustreg
2
Regression Analysis
Have a clear notion of what you can and cannot do with regression analysis • Conceptualization
– A Path Model of a Regression Analysis
Path Diagram of A Linear Regression Analysis
X1
YY
X2
error
x3
Yi k b1 x1 b2 x2 b3 x3 ei
3
then a unit change in x is associated with a small change in Y. 4 . or large. Y increases).Hypothesis Testing • For example: hypothesis 1 : X is statistically significantly related to Y. medium. – The magnitude of the relationship is small. – The relationship is positive (as X increases. If the magnitude is small. Y increases) or negative (as X decreases.
Plotting the Data: Testing for Functional form 1. then perhaps linear regression is not appropriate 5 . 5. The relationship between the Y and each X needs to be examined. If so. If not. 2. If it is not linear. then that may be the easiest recourse. One needs to ascertain whether it is linear or not 3. can it be transformed into linearity? 4.
If we are using high resolution graphics. it may be transformed into a straight line. 1. 5. This functional form estimation procedure on the relationship to see which functional form provides the best fit. a. then the distribution of data points on the graph will have the approximate appearance of a straight line. 6. It is necessary to plot the dependent variable against each of the independent variables. PLOT Y*X.Explore the relationship I. we may have much more control over the appearance of the output. We may plot the data with a plot procedure. we may show example of employing the curve estimation with a polynomial of order three. PLOT Y*X. RUN: b. To demonstrate how this procedure works with data provided. The purpose is to determine the functional form of the relationship. 4. If it is some sort of curve. RUN: 6 . the distribution can be approximated by a straight line. If it can. He may refer to the R2 statistic for each functional form and select that with the maximum. called proc gplot. To ascertain what the best functional form of these individual relationships is. The command syntax for PROC GPLOT looks like the following: PROC GPLOT. The command syntax for PROC PLOT follows: PROC PLOT. then this transformation of the original variable may be necessary. The analystmay a test a number of functional forms to see which produces the best R square. 2 If the distribution has a curved form. For the most part. called proc plot. the analyst may wish to run the segmented regression over portions of each independent variable. The real data are distributed as a cubic. There may be some dispersion about the line. Exploring the functional form of the relationship. 3. the question arises as to whether actual functional form approximating the distribution is one of a curve or not. If the relationship is linear.
run. end. proc plot.Graphical Exploration We examine the plot to look for functional form. output. do x = 1 to 30. 7 . and other peculiarities. y = x**3. outlier patterns. plot y*x. The generation of the data is done with a do loop: data plot1.
X.Graphical Plot • The output from this plot appears in figure 1: Figure 1 A plot of a nonlinear relationship between the dependent variable Y. and an independent variable. 8 .
3. 9 . Curve Fitting 2. The higher the r square. 1. The program to set up the curve fitting may be found in figure 3 below. From a regression analysis of the relationship. What is done here is that the data are subjected to a number of tests of functional form. an r square is generated. 1b. Application of the data to such a process presumes either a linear relationship or a relationship that can be transformed into a linear one. This is the square of the multiple correlation coefficient between the dependent variable and the independent variables in the model. This r square is the proportion of variance of the dependent variable explained by the functional form. the closer the functional form is to the actual relationship inherent in the data.Testing Functional Form with Curve Fitting 1.
sshapex = exp( a + x**(1)) . Then run different models. model y = quadx. compound = a*b**x. For example. power = a*x**b. lnx = log(x). liny = a + x. Proc print. output. run. quady = a + x**2. End. run.We may fit functional forms Do x = 1 to 100. invx = 1/x. cubicy = a + x**3. Use the transformation that yields the highest R2 10 . growth = exp(a + b*x). expx = exp(x). Proc Reg.
Command syntax for Curve Fitting The SAS programming syntax to set up the test of the data against the functional forms may be found in figure 4. 11 .2.
12 . The nice aspect of curve fitting is that it provides a clear objective criterion against which to assess functional form. so it is not surprising that the coefficient of determination (R 2 ) for that will be 1. Var _model_ p rsq. The cubic functional form is the third power of the function. Produces the output in figure 5 on the next page. Run. While all three forms have significant components. it can be seen that the R 2 approaches 1.00 and we have identified the functional form of the regression analysis.Curve Fitting Output & Interpretation Output and interpretation The proc print label data =bbb.00 only with the cubic and power functional forms. The regression coefficient for this curve equals 1.00 also.
13 .
Nonlinear regression for nonlinearity 14 . Robustness of the Model in face of their violation I. in the event of apparent nonlinearity. Linear Regression analysis A. one may obtain a better idea as to which transformation to apply in order to render a nonlinear relationship amenable to linear regression. Multivariate Case: a. Derivation of the coefficients: a and b 1. Testing those assumptions H. Weighted Least Squares estimation for heteroskedasticity 2. b1 and b2 D.Derviation of the SS and Anova C. The Assumptions of Regression Analysis G. Bivariate Case 2.The significance tests E. The General Linear Model B. Autoregression for autocorrelation 3. Fixes: 1.The Prediction interval and its derivation F.With the aid of this tool.
Nonparametric regression for bivariate cases where assumptions don’t Hold 5. Logistic or Probit regression for dichotomous dependent variables 6. Poisson Regression for Count or Rare event data 15 .Fixescontinued 4.
the regression effect. The main difference is the Levels of measurement of the independent variables. The General Linear Model Includes regression. Ancova: IVs are both discrete and continuous B. we obtain the MS SS total/n 1 = SS regression/k + SS error/n. and ancova. Regression: IVs are continuous 2. Anova: IVs are discrete 3.The General Linear Model A. 1. we may geometrically represent the error. anova.1 Where n = sample size and k = # independent vars in model MS total = MS regression + MS error Since MS = variance Variance total = Regression Variance + error variance We divide both sides by the total variance to obtain 1 = R square regression + R square error F = Regression Variance Error Variance 16 . Decomposition of the Sums of squares If we look at the graph of the regression line.k . The graphical representation below depicts the sums of squares decomposition of the equation : SS total = SS regression + SS error If we divide both sides of the equation by the Respective df. and the total effect.
Decomposition of Effects yi y Total Effect Y Y Yi ˆ y a bx ˆ yi yi error ˆ y y regression effect X X 17 .
Decomposition of the sum of squares ˆ ˆ Y Y Y Y Y Y total effect error effects regression (model ) effect ˆ ˆ Yi Y Yi Yi Yi Y per case i ˆ ˆ (Y Y ) 2 (Y Y ) 2 (Y Y ) 2 per case i i i i i (Y Y ) i 1 i n 2 (Yˆ Y ) i 1 i i n 2 (Yˆ Y ) i 1 i n 2 for data set 18 .
Decomposition of the sum of squares • Total SS = model SS + error SS and if we divide by df (Yi Y ) 2 i 1 n n1 (Yi Yi ) 2 ˆ i 1 n nk 1 (Yi Y ) 2 ˆ i 1 n k • This yields the Variance Decomposition: We have the total variance= model variance + error variance 19 .
F test for significance and R2 for magnitude of effect • R2 = Model var/total var (Yi Y ) 2 ˆ i 1 n R2 k n (Y Y ) 2 ˆ i 1 i i nk 1 •F test for model significance = Model Var/Error Var F( k .n k 1) R k 1 R2 nk 1 20 2 .
ANOVA tests the significance of the Regression Model 21 .
Derivation of the Intercept y a bx e e y a bx n n n n e i 1 i y a i 1 i i 1 i b xi i 1 n Because by definition ei 0 i 1 0 n y a i 1 n n n i i 1 i b xi i 1 n ai yi b xi i 1 i 1 i 1 n na yi b xi i 1 i 1 n n a y bx 22 .
Derivation of the Regression Coefficient
Given : yi a b xi ei ei yi a b xi
e
i 1 n i 1
n
i
2
(y
i 1 n i 1
n
i
a b xi )
ei
ei 2
i 1 n
( yi a b xi ) 2 2 xi ( yi ) 2b xi xi
i 1 i 1 n n
b 0
2 xi ( yi ) 2b xi xi
i 1 i 1
n
n
b
x y
i 1 n i
n
i
xi 2
i 1
23
The Multiple Regression Equation
• We proceed to the derivation of its components:
– The intercept: a – The regression parameters, b1 and b2
Yi a b1 x1 b2 x2 ei
24
The Multiple Regression Formula
• If we recall that the formula for the correlation coefficient can be expressed as follows:
25
bj r * sd y sd x 26 .r x i 1 n 2 i 1 i n i yi n 2 i x y i 1 where x xi x y yi y bj i 1 n i 1 n xi yi x2 from which it can be seen that the regression coefficient b. is a function of r.
x 1 2 ryx1 ryx2 rx1x2 1 r 2 x1 x2 * sd y sd x sd y sd x (6) yx .Substitute the partial r for r yx . x 2 1 ryx2 ryx1 rx1x2 1 r 2 x1 x2 * (7) It is also easy to extend the bivariate intercept to the multivariate case as follows. a Y b1 x1 b2 x2 (8) 27 .
2. We find the significance of the parameter estimates by using the F or t test.Significance Tests for the Regression Coefficients 1. 3. The R2 is the proportion of variance explained. Adjusted R2 = 1(1R2) (n1)/(np1) 28 .
F and T tests for significance for overall model Model variance F error variance where p number of parameters n sample size R2 / p (1 R 2 ) /(n p 1) t F ( n 2) * r 2 1 r 2 29 .
Significance tests • If we are using a type II sum of squares. we are dealing with the ballantine. DV Variance explained = a + b 30 .
Significance tests T tests for statistical significance t 0 sea b0 t seb 31 .
Significance tests Standard Error of intercept SEa (Y Y ) 2 1 xi 2 * n2 n ( n 1) ( xi x ) 2 Standard error of regression coefficient SEb x e i 1 n 2 ˆ 2 ˆ where std dev of residual ˆ 2 n2 32 .
SAS Regression Command Syntax 33 .
model y = x1 x2 x3 /spec dw corrb collin r dffits influence . Run.SAS Regression syntax Proc reg simple data=regdat. output out=resdat p=pred r=resid rstudent=rstud. Data check. Title ‘Check for Outliers’. run. tables rstud. var resid. Title ‘Check of Autocorrelation of the errors’. identify var=resid. Proc univariate normal plot. Title ‘Check of Normality of Residuals’. Proc Freq data=resdat. Run. 34 . set resdat. Proc ARIMA data=resdat.
SAS Regression Output & Interpretation 35 .
More Simple Statistics 36 .
Omnibus ANOVA Statistics 37 .
Parameter Estimates 38 .
8. 9. 3. 6. 4. 7. Linear Functional form Fixed independent variables Independent observations Representative sample and proper specification of the model (no omitted variables) Normality of the residuals or errors Equality of variance of the errors (homogeneity of residual variance) No multicollinearity No autocorrelation of the errors No outlier distortion 39 5.Assumptions of the Linear Regression Model 1. 2. .
Normality of the residuals 1. 4. It may also preclude computation of the parameter estimates completely if it is serious enough. Permits proper significance testing Equality of variance 1. Independent observations 1. Outlier distortion may bias the results: If outliers have high influence and the sample is not large enough. This also warps the significance tests 5. 1. Heteroskedasticity precludes generalization and external validity 2. Autocorrelation inflates the t and r and f statistics and warps the significance tests 3. then they may serious bias the parameter estimates 40 . 6. Linear Functional form 1.Explanation of the Assumptions 1. Does not detect curvilinear relationships 2. Representative samples 2. Multicollinearity prevents proper parameter estimation.
Diagnostic Tests for the Regression Assumptions 1. Independence of observations: Runs test Normality of the residuals: ShapiroWilks or KolmogorovSmirnov Test Homogeneity of variance if the residuals: White’s General Specification test No autocorrelation of residuals: Durbin Watson or ACF or PACF of residuals Multicollinearity: Correlation matrix of residuals. 4. 5. Plot residuals and look for high leverage of residuals Lists of Standardized residuals Lists of Studentized residuals Cook’s distance or leverage statistics 41 . 2. Condition index or condition number No serious outlier influence: tests of additive outliers: Pulse dummies. 3. 1. 7. No level shifts: One regime 2. 3. 6. Linearity tests: Regression curve fitting 1. 4.
Representative sample is done with probability sampling 42 . 3.Explanation of Diagnostics 1. Plots show linearity or nonlinearity of relationship 2. Correlation matrix shows whether the independent variables are collinear and correlated.
Output out=resdat r=resid p=pred. Title ‘Test of Normality of Residuals’. set resdat. The residuals are saved and then subjected to either of: KolmogorovSmirnov Test: Tests the limit of the theoretical cumulative normal distribution against your residual distribution. ShapiroWilks Test Proc reg. var resid.Explanation of Diagnostics Tests for Normality of the residuals. Proc univariate normal plot. Run. model y = x1 x2. Data check. 43 .
model y = x1 x2/spec.Test for Homogeneity of variance White’s General Test Proc reg. Run. ei 2 b0 b1 x1 b2 x2 b3 x12 b4 x2 2 b5 x1 x2 Chow test ( Rur 2 Rr 2 ) /( pur pr ) (1 Rur 2 ) /(n pur ) 44 Fchow .
Run. If x is proportional to the variance in y. Proc reg.Weighted Least squares A fix for heteroskedasticity is Weighted least squares or WLS There are two ways to do this. weight wt. 45 . Model y = x. then form the weight by wt = 1/y**2.
Collinearity Diagnostics Tolerance 1R 2 small tolerances imply problems Variance Inflation Factor (VIF) 1 Tolerance Small intercorrelations among indep vars means VIF 1 VIF 10 signifies problems 46 .
If condition numbers are between 100 and 1000.More Collinearity Diagnostics Watch for eigenvalues much greater than 1 condition numbers = maximum eigenvalue/minimum eigenvalue. there is moderate to strong collinearity condition index k where k condition number If Condition index > 30 then there is strong collinearity 47 .
Collinearity Diagnostics 48 .
the residuals divided by their standard errors without the ith observation 3. Watch this if it is much greater than 1.0. This is the measure of influence of each observation 4. Cook’s Distance: 1.Outlier Diagnostics 1. Studentized Residuals 1. called the Hat diag 1. The predicted value minus the actual value. Leverage. 2. Residuals. This is otherwise known as the error. the change in the statistics that results from deleting the observation. 49 . 1.
after running the regression.Outlier detection • Outlier detection involves the determination whether the residual (error = predicted – actual) is an extreme negative or positive value. • We may plot the residual versus the fitted plot to determine which errors are large. border yline(0) 50 . • The command syntax was already demonstrated with the graph on page 16: rvfplot.
Create Standardized Residuals • A standardized residual is one divided by its standard deviation. ˆ yi yi resid standardized s where s std dev of residuals 51 .
they are outliers.5. it is important to gauge the influence of outliers. then there are no outliers While outliers by themselves only distort mean prediction when the sample size is small enough. as these are. 52 . If the absolute values are less than 3.Limits of Standardized Residuals If the standardized residuals have values in excess of 3.5.5 and 3.
Outlier Influence • Suppose we had a different data set with two outliers. • We tabulate the standardized residuals and obtain the following output: 53 .
54 .Outlier a does not distort and outlier b does.
we could form studentized residuals.Studentized Residuals • Alternatively. Therefore. we can approximately determine if they are statistically significant or not. These are distributed as a t distribution with df=np1. though they are not quite independent. 55 . (1980) recommended the use of studentized residuals. • Belsley et al.
The command to generate studentized residuals. of which a dummy variable indicator is formed.Studentized Residual ei s ei s 2 (i ) (1 hi ) where ei s studentized residual s(i ) standard deviation where ith obs is deleted hi leverage statistic These are useful in estimating the statistical significance of a particular observation. rstudent 56 . The t value of the studentized residual will indicate whether or not that observation is a significant outlier. called rstudt is: predict rstudt.
ˆ Y HY 57 . H Therefore. 2. The hat matrix comes from the formula for the regression of Y. Leverage is measured by the diagonal components of the hat matrix.Influence of Outliers 1. ˆ Y X X '( X ' X ) 1 X ' Y where X '( X ' X ) 1 X ' the hat matrix.
The closer the leverage is to unity. 58 . 6. The trace of the hat matrix = the number of variables in the model. For smaller samples. 4. Modern Methods of Data Analysis (p. 5.F. (1980) cited in Long. J.Leverage and the Hat matrix 1. the more leverage the value has. Leverage is bounded by two limits: 1/n and 1. The hat matrix transforms Y into the predicted scores. The diagonals are therefore measures of leverage. The diagonals of the hat matrix indicate which values will be outliers or not. Vellman and Welsch (1981) suggested that 3p/n is the criterion.262). 2. 3. When the leverage > 2p/n then there is high leverage according to Belsley et al.
Another measure of influence. np) are large. 59 . 2. This is a popular one.Cook’s D 1. The formula for it is: 1 hi ei 2 Cook ' s Di 2 p 1 hi s (1 hi ) Cook and Weisberg(1982) suggested that values of D that exceeded 50% of the F distribution (df = p.
Using Cook’s D in SAS • Cook is the option /R • Finding the influential outliers • List cook. if cook > 4/n • Belsley suggests 4/(nk1) as a cutoff 60 .
61 .Graphical Exploration of Outlier Influence One can plot the leverage against the standardized residual to see if the outlier is problematic The two influential outliers can be found easily here in the upper right.
DFbeta • One can use the DFbetas to ascertain the magnitude of influence that an observation has on a particular parameter estimate if that observation is deleted. 62 . DFbeta j b j b(i ) j u j u 2 j (1 h j ) where u j residuals of regression of x on remaining xs.
Autocorrelation: NeweyWest estimators or autoregression model • 4. Nonlinearity: Transform to linearity if there is nonlinearity or run a nonlinear regression • 2. One can use Proc Robustreg to obtain downweighted outlier effect in the estimation. Multicollinearity: components regression or ridge regression or proxy variables 63 . Heteroskedasticity: weighted least squares regression or white estimator. Nonnormality: Run a least absolute deviations regression or a median regression • 3.Alternatives to Violations of Assumptions • 1. • 4.
64 .The Interaction model Yi a b1 x1 b2 x2 b3 x1 * x2 ei • The product vector x1*x2 is an interaction term. • This is the joint effect over and above the main effects (x1 and x2). • The main effects must be in the model for the interaction to be properly specified. regardless of whether the main effects remain statistically significant.
Path Diagram of an Interaction model Interaction Analysis Error X1 A C Y B X2 Y= K + aX1 + BX2 + CX1*X2 65 .
112 113 114 39 10 322 230 310 43 23 323 340 250 33 33 112 122 125 144 45 99 100 89 55 34 14 13 10 249 40 40 34 98 39 30 30 32 34 40 40 90 80 93 50 50 89 90 91 60 44 120 130 43 100 34 444 432 430 20 44 proc print. time+1. model y= x1 x2 lincrossprodx1x2/r collin stb spec influence. 66 . lincrossprodx1x2=x1*x2. run. output out=resdat r=resid p=pred stdr=rstd student=rstud cookd=cooksd. datalines. input y x1 x2 x3 x4. run. x1sq=x1*x1.SAS Regression Interaction Model Syntax data one. proc reg.
axis1 label=(a=90 'Cooks D Influence Stat'). title 'Outlier Indicator'. degfree=7. plot cooksd * rstud/vaxis=axis1. n=sample size */ proc freq. set resdat. title 'Leverage and the Outlier'. run. tables rstud.SAS Regression Diagnostic syntax data outck. proc gplot data=resdat. if cooksd > 4/7. 67 . run. /* if cd > p/(np1) where p = # parms.
Regression Model test of linear interaction 68 .
model y= x1 x1sq.SAS Polynomial Regression Syntax proc reg data=one. 69 . title 'The Polynomial Regression'. run.
Loess. validation & simplification. missing data imputation. 70 .Model Building Strategies • Specific to General: Cohen and Cohen • General to Specific: Hendry and Richard • Extreme Bounds analysis: E. Jr. • F. Harrell. approach: Regression Modeling Strategies (Springer. Splines for mode fitting. 2001). Leamer. data reduction.
Goodness of Fit indices • R2 = 1 – SSE/SST – Explained Variance divided by total variance • AIC = 2LL + 2p SC = 2LL + nlog(p) • Nested models can be compared on the basis of these criteria to determine which model is best 71 .
72 . S. 3. This procedure permits 4 types of robust regression: M. id x1. run. M (median absolute deviation) estimation used by Huber is performed with 4. least trimmed squares. These means downweight the outliers. Estimation is done by IRLS. and MM estimation 2. model y = x1 x2 x3 / diagnostics leverage.Robust Regression 1. proc robustreg data=stack. 5. test x3.
9797.973/.:0804:907847349 %0/.381472839490 570/.4708 %0/.90 .3/90.0.8:70841 0.43..92..9 2..90/8.92.084:3/0/94298.709070147020.43.0.07.07.07.97 %0.0 0.84190.8.
0894:39 90 24700.92.07.05.8 %097..07.07.090.3..04190..4807900.0839024/0 03900.:0.97903:20741 .7.3/ %0.
2508 '02.3.9 5..90/343 4/073094/841 ..9.390390708 0.88 5 4782.3. .078.07.0.47/39408009. 8:0890/9.3/08.
790743 .3890.
7430%0 1472:..:0841 9.8:704131:03.00/0/ 4190/897:943 /15 3 5 .545:.70.3/0807 8:0890/9.90.14798 44 8 5 0 8 44.0 %88.70 .9.44 8 3490720.
&8344 83$$ W 4489045943.
44.4:9078 W 89.# W 3/39031:039.44 1.
3 W 0808:0898.
8.:9411 . 3 ..
%09431:039.7/0/ 708/:..547.4:9078.07.0.3/.3014:3/0.7.948001904:907857402.3549900..9.5.94341 :90731:03..8070 390:550779 .3899089.0 30..
39:/041 31:03.3902.33 8 .5.209070892.09.943.9..3:809009.34807. W 30.90 19.7.:.94807.079.. : : 070 : 708/:.75.  .8 41 70708843 41 43 702.79.894 ..9438/0090/ 09.8 43.09.8.
.9090/ 0.947 30.9438 4188:259438 W 4330.7947 7:3.0894'4.79%7.89.943 W :94.3/43090/ 4:9070110.34330. #4:89709449.947847.0.430..94300 089 0892.:947070884324/0 W :9.770708843 W 433472.370708843 W 0907480/.7919070834330.8986:.4770.38147294 30.9.08 .93900892.425430398 70708843477/07070884347 574.94387070884347.708707088434790 0892.9#:3. 20/.84:90/0.7.9073..89 .3:80!74.79.
10/ 70.98 .0 902.0.98702.4.989..389..39 .94.07.7/08841090790 2.    0 W %0574/:.982:89039024/0 147903907.%03907.30110.30110.3 3907.94324/0 . 831.943940574507 850..3/..9.9439072 W %88904390110.947 8.30110.3/ W %02..
88 .94324/0 3907.!.7.9433.241.3 3907.9...
/.9. 5739 7:3 574.7488574/.308 574. 430 35:9 3.9434/0$39. 70 24/03.7488574/ 86 920 /.$$#0708843 3907.9..
44/.0 4:95:94:9708/.4389850.7.97708/5570/89/7789/89:/039789:/.31:03.448/ 7:3 .
3489. 4:9.$$#0708843 .9. /.9 /01700 1.839.448/. 809708/.
.
1./5.
728 38.25080 . 3 5 07055.
448/ 789:/.9 574.9 549.0 .947 7:3 . 44831:03.0$9.9. 549 /..8. 17069.08789:/ 990 :9073/.708/. 574.
8.0...8 990 0.07.3/90 :907 7:3 .
.943 .#07088434/0908941 30.73907.
#0708843$39.9.#0708843 7:3 . 574. 70 /.$$!4342.430 24/086 990 %0!4342.
4/0:/3$97. 70/:.770 7..5574.1..943 8251..3. #07088434/03 $97.943 .3/403 W 0307.940307.25:9.1.03/7 .94$50.943 ./.3/#.7/ W 970204:3/8.943 2883/.9..9008 $57307 4088 $530814724/01993 /.88 0.9008 W $50.403 .9.207 W .
44/30884193/.08 W # $$.
.7907./0/ 949.0/.$$% 5.94/0907230 ..7.7.3.70/4390.3.8841 9080.0 W 5 $ 34 5 W 0890/24/08.30 ..24/08089 .30/'.425.
38/43 0990 4:9078 20/.708 $ .943 0892.89 97220/86:.74:8970/.3/ 0892.943 %08020.. 24/0.943:80/:078 50714720/9 574.9.0/:70507298950841 74:8970708843 0.#4:89#0708843 %8574.89..84:90/0.3.
9438/430#$ .3489./.0/90897:3 892.8 0.07.