Multiple Regression Slides

Multiple Regression Multiple Regression
Definition Definition
Multiple Regression Equation Multiple Regression Equation
A linear relationship between a dependent A linear relationship between a dependent
variable y and two or more independent
variable y and two or more independent
variables (x1, x2, x3 . . . , xk)
variables (x1, x2, x3 . . . , xk)
Model:
y = ßo+ ß1x1+ ß2x2+ . . . ßkxk+e
^
y = b0 + b1x1 + b2x2 + . . . + bkxk
^
y = b0 + b1 x1+ b2 x2+ b3 x3 +. . .+ bk xk
Notation
ß0 = the y-intercept, or the value of y when all
(General form of the estimated multiple regression equation)
of the predictor variables are 0
n = sample size b0 = estimate of ß0 based on the sample data
k = number of independent variables ß1, ß2, ß3 . . . , ßk are the population coefficients
y^ = predicted value of the of the independent variables x1, x2, x3 . . . , xk
dependent variable y
b1, b2, b3 . . . , bk are the sample estimates of
x1, x2, x3 . . . , xk are the independent
the coefficients ß1, ß2, ß3 . . . , ßk
variables
140 140
120 120
100 C 100
C
80 R
80 R
i
60 60 i
M
M
40 E 40
E
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100 120
Percent High School Urbanization

R=.47
R=.68
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 2 24732 12366 28.54 <.0001
Error 64 27730 433.28847
Corrected Total 66 52462
Overall Regression Analysis: A Test of the Multiple R
Root MSE 20.81558 R-Square 0.4714

Parameter Estimates
(R=.686)
Parameter Standard
Dependent Mean 52.40299 Adj R-Sq 0.4549
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 59.11807 28.36531 2.08 0.0411
Coeff Var 39.72213

hs 1 -0.58338 0.47246 -1.23 0.2214
Urb 1 0.68250 0.12321 5.54 <.0001
Notice that the slope of hs is negative
Parameter Estimates Pearson Correlation Coefficients, N = 67

Prob > |r| under H0: Rho=0
crate incom hs Urb

Parameter Standard
crate 1.00000 0.43375 0.46691 0.67737
Variable DF Estimate Error t Value Pr > |t| 0.0002 <.0001 <.0001
Intercept 1 -50.85690 24.45065 -2.08 0.0415
incom 0.43375 1.00000 0.79262 0.73070
0.0002 <.0001 <.0001
hs 1 1.48598 0.34908 4.26 <.0001
hs 0.46691 0.79262 1.00000 0.79072
<.0001 <.0001 <.0001
Urb 0.67737 0.73070 0.79072 1.00000

When hs is by itself the slope is positive <.0001 <.0001 <.0001
Adjusted R2
Multiple Regression SAS Setup
Definitions
• proc corr; run; Multiple coefficient of determination
• proc reg; model crate= hs Urb;
a measure of how well the multiple
• plot crate*hs;run;
regression equation fits the sample data
• proc reg; model crate= hs;run;
Adjusted coefficient of determination
the multiple coefficient of determination
R2 modified to account for the number of
variables and the sample size
Adjusted R2 Adjusted R2
2 (n - 1) 2
Adjusted R = 1 - (1 - R )
[n - (k + 1)]
Including the Three Variables

Adjusted R2
Analysis of Variance
2 (n - 1) 2
Adjusted R = 1 - (1 - R ) Sum of Mean
[n - (k + 1)] Source DF Squares Square F Value Pr > F
Model 3 24804 8268.16424 18.83 <.0001
Error 63 27658 439.00995
where n = sample size

k = number of independent (x) variables Corrected Total 66 52462
Overall Test Tests of Hypotheses
• Overall test:
Null: All of the population regression weights are
zero.
Alternative: Not all are zero
Coeff Var 39.98353
Parameter Estimates
The overall F Test

Parameter Standard
Intercept 1 59.71473 28.58953 2.09 0.0408
2
R /k
Fk , n − k −1 = incom 1 -0.38309 0.94053 -0.41 0.6852
2
(1 − R ) /( n − k − 1) hs 1 -0.46729 0.55443 -0.84 0.4025
Urb 1 0.69715 0.12913 5.40 <.0001
Individual Tests
• Test for b1: F Test for Restricted Models
Y = bo + b1x1 + b2x 2 + b3x3 + e
Y = bo + b2x 2 + b3x3 + e
• Test for b2: ( R 2f − Rr2 ) /(k − g )
Y = bo + b1x1 + b2x 2 + b3x3 + e Fk − g , n − k −1 =
Y = bo + b1x1 + b3x3 + e (1 − R 2f ) /(n − k − 1)
• Test for b3:
Y = bo + b1x1 + b2x 2 + b3x3 + e
Y = bo + b1x1 + b2x 2 + e
More on SAS Standardized Regression Weights
• Model options:
 S xi 
bi* = bi  
Model y = x1 x2 / R partial p stb;
R– residual analysis
 Sy 
Partial– partial regression scatter plot
P– predicted values
 
Stb– standardized regression weights
Generally, the standardized regression
weights fall between 1 and –1. However,
they can larger than one (or less than –1).
Parameter Estimates
Obtaining the Standardized

Parameter Standard Standardized
Regression Weights Variable DF Estimate Error t Value Pr > |t| Estimate
• Model Statement Intercept 1 59.71473 28.58953 2.09 0.0408 0
– Model crate = incom hs urb /stb;

incom 1 -0.38309 0.94053 -0.41 0.6852 -0.06363
hs 1 -0.46729 0.55443 -0.84 0.4025 -0.14683
Urb 1 0.69715 0.12913 5.40 <.0001 0.83996
„ƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒ†
crate ‚ ‚
Partial Regression Plots

‚ ‚
‚ ‚
‚ ‚
‚ ‚
60 ˆ ˆ
‚ ‚
‚ ‚
‚ 1 ‚
‚ 1 ‚
‚ ‚
40 ˆ 1 ˆ
‚ 1 1 ‚
• Plot of two residuals ‚

‚
‚
‚ 1
1 1 1
1 ‚
‚
‚
‚
20 ˆ 1 1 1 ˆ
• The regression line in this plot corresponds ‚

‚
‚
‚
1
1
1 1
1
1 1 11
1
1 1
1
‚
‚
‚
‚
Crate-b1(inc)-b2(urb)
to the regression weight in the overall 0 ˆ
‚
‚
‚
1 1
1 1
1
1
1
1
1
1
‚
ˆ
‚
‚
model. -20 ˆ
‚
‚
‚ 1
1 1 1
1
1
1 1
1
2
1
11 1
1
2
1 1 1
1
1
‚
‚
‚
ˆ
‚ 2 1 1 ‚
• Model Statement ‚
‚
‚
‚
1
1
1
1
‚
‚
‚
‚
/ partial
-40 ˆ ˆ
‚ ‚
‚ ‚
‚ ‚
‚ ‚
‚ ‚
ŠƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒŒ
-14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14
hs
Hs-b1(inc)-b2(urb)
„ƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒ†
crate ‚ ‚
‚ ‚
Interaction of two Variables

‚ ‚
80 ˆ ˆ
‚ 1 ‚
‚ ‚
‚ ‚
‚ ‚
60 ˆ ˆ
‚ ‚
‚ ‚
‚ 1 ‚
‚ ‚
• Just as in ANOVA we can have interaction effects

40 ˆ 1 1 ˆ
‚ 1 1 1 ‚
‚ 11 ‚
in a multiple regression analysis.

‚ 1 1 ‚
‚ 1 1 ‚
20 ˆ
‚
1 1 1 1 1
1
1 ˆ
‚ Crate-b1(inc)-b2(hs)
• For quantitative variables, interaction is present
‚ 1 1 ‚
‚ 1 1 1 1 ‚
‚ 1 1 1 1 1 ‚
when the relationship between the explanatory

0 ˆ ˆ
‚ 1 1 1 1 1 ‚
‚ 1 1 1 1 11 ‚
‚ 1 1 ‚
-20 ˆ
‚
‚
1
1
1
1
1
1
1 1 1
1
‚
ˆ
‚
variable and the response changes as the levels of
‚
‚
‚
1
1
1 12 1
1
1 1
1 1 ‚
‚
‚
another variable changes.
-40 ˆ ˆ
‚
‚
‚
1 ‚
‚
‚
• Consider crime rate as a function of hs and urb. If
the relationship (slope) between crime rate and urb
‚ ‚
-60 ˆ ˆ
‚ ‚
changes as hs changes, we have an interaction

‚ ‚
ŠƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒŒ
-60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60
Urb
between urb and hs.
Urb-b1(inc)-b2(hs)
Testing for an Interaction Effect SAS Setup for Interaction
• We test for interaction effect by comparing • Create a the product variable in the data
a model with interaction to a model without statement:
interaction: – Data new; input y x z; xz=x*z; cards;
Y = βo + β1x1 + β2x2 + β3 (x1*x2) + e
Y = βo + β1x1 + β2x2 + e – Model y = x z xz;
Parameter Estimates
Parameter Standard
Dependent Mean 52.40299 Adj R-Sq 0.4544 Intercept 1 19.31754 49.95871 0.39 0.7003
hs 1 0.03396 0.79381 0.04 0.9660
Coeff Var 39.74168
Urb 1 1.51431 0.86809 1.74 0.0860
hsurb 1 -0.01205 0.01245 -0.97 0.3367
Model crate= hs urb hs*urb (Looking for an interaction)
Testing the Interaction

SAS Setup for Interaction
data crime; input crate incom

hs Urb; hsurb= hs*urb;cards; Assessing The Fit of the Model
104 22.1 82.7 73.2
Looking further at residuals

proc reg; model crate= hs Urb
hsurb;
run;
Plot of yresid*hs. Legend: A = 1 obs, B = 2 obs, etc.
‚
‚
60 ˆ
Residual Plots in Regression ‚

‚
‚
‚
A
‚
‚
40 ˆ A A A
‚ A
‚ A
‚
‚ A
‚ A A
R ‚ A A
e 20 ˆ A A
s ‚ A A A A
i ‚ A AA AA
d ‚ A A
u ‚ A
a ‚ A
l ‚ A
0 ˆ B
‚ A A
‚ A A A
‚ A A A A A A
‚ A A A A
‚ A A A A A A A
‚ A A A A A
-20 ˆ A A A A
‚ A
‚ A A A
‚
‚ A
‚ A A
‚
-40 ˆ
‚
Šƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒ
50 55 60 65 70 75 80 85
hs
Residual by hs
Plot of yresid*Urb. Legend: A = 1 obs, B = 2 obs, etc.
Plot of yresid*incom. Legend: A = 1 obs, B = 2 obs, etc.

‚
‚ ‚
60 ˆ ‚
‚ 60 ˆ
‚ ‚
‚ A ‚
‚ ‚ A
‚ ‚
‚ ‚
40 ˆ A A A ‚
‚ A 40 ˆ A AA
‚ A ‚ A
‚ ‚ A
‚ A ‚
‚ A A ‚ A
R ‚ B ‚ A A
e 20 ˆ A A R ‚ A A
s ‚ A A A A e 20 ˆ A A
i ‚ A A A B s ‚ A A A A
d ‚ A A i ‚ A A A A A
u ‚ A d ‚ A A
a ‚ A u ‚ A
l ‚ A a ‚ A
0 ˆ A A l ‚ A
‚ A A 0 ˆ A A
‚ A A A ‚ A A
‚ A A A A A A ‚ A A A
‚ AAA A ‚ A A A A A A
‚ A AA A A A A ‚ B A A
‚ B A A A ‚ A A A A A A A
-20 ˆ A A AA ‚ A AA A A
‚ A -20 ˆ AA B
‚ A A A ‚ A
‚ ‚ A A A
‚ A ‚
‚ A A ‚ A
‚ ‚ A A
-40 ˆ ‚
‚ -40 ˆ
Šƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒ ‚
0 20 40 60 80 100 Šƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒ
15.0 17.5 20.0 22.5 25.0 27.5 30.0 32.5 35.0 37.5
Urb
incom
Residual by Urb
Residual by income
SAS Setup Assumptions
• Linearity– the relationship between the
• proc reg; dependent variable and independent
• model crate= incom hs Urb; variables is linear.
• output out=new p=yhat r=yresid; • Normality– y is independently normally
• proc plot data=new; distributed
• plot yresid*hs; Independently distributed random errors with a
mean of zero.
• plot yresid*urb;
• Homoskedasticity– the conditional
• plot yresid*incom;run;
variances of y given x are all equal.
Collinearity Diagnostics
Investigating Multicollinearty Proportion of Variation
• SAS Model Statement: Number Eigenvalue

Condition
Index Intercept incom hs Urb
• model crate= incom hs Urb / vif
1 3.78327 1.00000 0.00053670 0.00082225 0.00029978 0.00648
collin;
• When the explanatory variables are
2 0.20397 4.30678 0.00735 0.00127 0.00107 0.39725
highly correlated
(multicollinearity) the standard
errors of the regression weights 3 0.00983 19.61619 0.22868 0.81261 0.00944 0.29914
tend to get very large.

4 0.00293 35.92811 0.76343 0.18530 0.98919 0.29712
Parameter Estimates
Using the Multicollinearity
Indices
Parameter Standard Variance
• Look for conditioned indices larger than 30 Variable DF Estimate Error t Value Pr > |t| Inflation
• If an index is large than 30, identify Intercept 1 59.71473 28.58953 2.09 0.0408 0
variables with proportion indices larger than

.90 incom 1 -0.38309 0.94053 -0.41 0.6852 2.91618
– Proportion of variance in each coefficient

attributable to the condition index. hs 1 -0.46729 0.55443 -0.84 0.4025 3.62675
Urb 1 0.69715 0.12913 5.40 <.0001 2.89274

Variance Inflation Factor

(VIF)
VIF − 1
Ri2.rest =
VIF
Coeff Var 6.79794
VIF– the variance of the weight is inflated by
this quantity.
Model: hs = income urbanization
Collinearity Diagnostics
Test that a subset of regression

Proportion of Variation
weights are equal to zero
Condition • SAS Test Statement:
Number Eigenvalue Index Intercept incom Urb
proc reg; model crate= incom
1 2.80034 1.00000 0.00293 0.00203 0.01645
hs urb;
test incom=0, hs=0;
2 0.19004 3.83868 0.03655 0.00499 0.50952
run;
3 0.00962 17.06408 0.96052 0.99299 0.47402 Or,

test incom, hs;
Results from the Joint test

βhs=βincom=0
Partial Correlations
Test 1 Results for Dependent Variable crate • In a partial correlation a variable is partial
out of both variables
Mean rx1x2 . x3
Source DF Square F Value Pr > F
R12.23 − R12.3
Numerator 2 366.72473 0.84 0.4385
2
r12 .3 =
Denominator 63 439.00995 1 − R12.23
„ƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒ†
crate ‚ ‚
‚ ‚
Ry1y2.xz ‚
80 ˆ
‚ 1
‚
ˆ
‚
‚ ‚
‚ ‚
‚ ‚
60 ˆ ˆ
‚ ‚
‚ ‚
‚ 1 ‚
‚ ‚
40 ˆ 1 1 ˆ
‚ 1 1 1 ‚
‚ 11 ‚
‚ 1 1 ‚
‚ 1 1 ‚
20 ˆ
‚
‚
1 1 1
1
1 1
1
1
1 ˆ
‚
‚
Crate-b1(inc)-b2(hs)
y2 ‚
‚ 1
1
1 1
1 1
1 1
1 ‚
‚
y1
0 ˆ ˆ
‚ 1 1 1 1 1 ‚
‚ 1 1 1 1 11 ‚
‚ 1 1 ‚
‚ 1 1 1 1 1 ‚
-20 ˆ 1 1 ˆ
‚ 1 1 1 ‚
‚ 1 1 12 1 1 1 ‚
‚ 1 1 ‚
‚ 1 1 ‚
-40 ˆ ˆ
‚ 1 ‚
‚ ‚
z x ‚ ‚
z -60 ˆ
‚ ‚
ˆ
x
‚ ‚
‚ ‚
ŠƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒŒ
-60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60
Urb
Urb-b1(inc)-b2(hs)
Semipartial Correlations Relationship between

Semipartials and the Multiple R2
• The variable is partial out from only one of
the variables. The squared semipartial
correlation is given by
R y2.1234 = ry2.1 + ry2( 2.1) + ry2(3.12) + ry2( 4.123)
r12( 2.3) = R12.23 − R12.3
Finding the Best Multiple Regression Equation

1. Use common sense and practical considerations to
include or exclude variables and always plot the data.
2. Instead of including almost every available variable,
include relatively few independent (x) variables,
weeding out independent variables that don’t have an
effect on the dependent variable, remember
collinearity.
3. Select an equation having a value of adjusted R2 with this
property: If an additional independent variable is
included, the value of adjusted R2 does not increase by
a substantial amount.
4. For a given number of independent (x) variables, select
the equation with the largest value of adjusted R2.
5. You want overall significance with all of the regression
weights being significant also.

Multiple Regression Slides

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multiple Regression Slides

Uploaded by

Copyright:

Available Formats

Multiple Regression Multiple Regression

Percent High School Urbanization

Error 64 27730 433.28847

Corrected Total 66 52462

Overall Regression Analysis: A Test of the Multiple R

Root MSE 20.81558 R-Square 0.4714

Coeff Var 39.72213

Urb 1 0.68250 0.12321 5.54 <.0001

Notice that the slope of hs is negative

Parameter Estimates Pearson Correlation Coefficients, N = 67

crate incom hs Urb

Urb 0.67737 0.73070 0.79072 1.00000

Including the Three Variables

Model 3 24804 8268.16424 18.83 <.0001

Error 63 27658 439.00995

where n = sample size

Coeff Var 39.98353

The overall F Test

Intercept 1 59.71473 28.58953 2.09 0.0408

Urb 1 0.69715 0.12913 5.40 <.0001

Obtaining the Standardized

• Model Statement Intercept 1 59.71473 28.58953 2.09 0.0408 0

– Model crate = incom hs urb /stb;

hs 1 -0.46729 0.55443 -0.84 0.4025 -0.14683

Urb 1 0.69715 0.12913 5.40 <.0001 0.83996

Partial Regression Plots

• Plot of two residuals ‚

• The regression line in this plot corresponds ‚

Interaction of two Variables

• Just as in ANOVA we can have interaction effects

in a multiple regression analysis.

when the relationship between the explanatory

changes as hs changes, we have an interaction

Testing for an Interaction Effect SAS Setup for Interaction

hs 1 0.03396 0.79381 0.04 0.9660

Coeff Var 39.74168

Urb 1 1.51431 0.86809 1.74 0.0860

hsurb 1 -0.01205 0.01245 -0.97 0.3367

Model crate= hs urb hs*urb (Looking for an interaction)

Testing the Interaction

data crime; input crate incom

Looking further at residuals

Plot of yresid*hs. Legend: A = 1 obs, B = 2 obs, etc.

Residual Plots in Regression ‚

Plot of yresid*Urb. Legend: A = 1 obs, B = 2 obs, etc.

Plot of yresid*incom. Legend: A = 1 obs, B = 2 obs, etc.

Investigating Multicollinearty Proportion of Variation

• SAS Model Statement: Number Eigenvalue

tend to get very large.

variables with proportion indices larger than

– Proportion of variance in each coefficient

Urb 1 0.69715 0.12913 5.40 <.0001 2.89274

Variance Inflation Factor

Model: hs = income urbanization

Test that a subset of regression

3 0.00962 17.06408 0.96052 0.99299 0.47402 Or,

Results from the Joint test

Semipartial Correlations Relationship between

Finding the Best Multiple Regression Equation

You might also like