You are on page 1of 30

Multiple Linear Regression

Topics
Explanatory vs. predictive modeling with regression
Example: prices of Toyota Corollas
Fitting a predictive model
Assessing predictive accuracy
Explanatory Modeling
Goal: Explain relationship between predictors
(explanatory variables) and target

Familiar use of regression in data analysis

Model Goal: Fit the data well and understand the


contribution of explanatory variables to the model

“goodness-of-fit”: R2, residual analysis, p-values


Predictive Modeling
Goal: predict target values in other data where we have
predictor values, but not target values
Classic data mining context
Model Goal: Optimize predictive accuracy
Train model on training data
Assess performance on validation (hold-out) data
Explaining role of predictors is not primary purpose
(but useful)
Multiple Regression Model
The equation that describes how the dependent variable
y is related to the independent variables x1, x2, . . . xp and
an error term is called the multiple regression model.

y = 0 + 1x1 + 2x2 + . . . + pxp + 

where:
0, 1, 2, . . . , p are the parameters, and
 is a random variable called the error term
Estimation Process
Multiple Regression Model
Sample Data:
y = 0 + 1x1 + 2x2 +. . .+ pxp +  x 1 x2 . . . x p y
Multiple Regression Equation . . . .
E(y) = 0 + 1x1 + 2x2 +. . .+ pxp . . . .
Unknown parameters are
0, 1, 2, . . . , p

Estimated Multiple
Regression Equation
b 0, b 1, b 2, . . . , b p
provide estimates of yˆ  b0  b1 x1  b2 x2  ...  bp x p
 0,  1,  2, . . . ,  p Sample statistics are
b 0, b 1, b 2, . . . , b p
Multiple Regression Model

 Example: Programmer Salary Survey


A software firm collected data for a sample
of 20 computer programmers. A suggestion
was made that regression analysis could
be used to determine if salary was related
to the years of experience and the score
on the firm’s programmer aptitude test.
The years of experience, score on the aptitude
test, and corresponding annual salary ($1000s) for a
sample of 20 programmers is shown on the next
slide.
Estimated Regression Equation

SALARY
SALARY == 3.174
3.174 ++ 1.404(EXPER)
1.404(EXPER) ++ 0.251(SCORE
0.251(SCORE))

Note: Predicted salary will be in thousands of dollars.


Interpreting the Coefficients

In multiple regression analysis, we interpret each


regression coefficient as follows:

bi represents an estimate of the change in y


corresponding to a 1-unit increase in xi when all
other independent variables are held constant.
Interpreting the Coefficients

bb11 == 1.
1. 404
404

Salary is expected to increase by $1,404 for


each additional year of experience (when the variable
score on programmer attitude test is held constant).

bb22 == 0.251
0.251
Salary is expected to increase by $251 for each
additional point scored on the programmer aptitude
test (when the variable years of experience is held
constant).
Multiple Coefficient of Determination

 Relationship Among SST, SSR, SSE

SST = SSR + SSE

 i
( y  y ) 2
  i
( ˆ
y  y ) 2
  i i
( y  ˆ
y ) 2

where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error
Multiple Coefficient of Determination

R2 = SSR/SST

R2 = 500.3285/599.7855 = .83418
Adjusted Multiple Coefficient
of Determination

n1
Ra2 2
 1  (1  R )
np1

20  1
R  1  (1  .834179)
2
a  .814671
20  2  1
Testing for Significance: F Test

Hypotheses H 0:  1 =  2 = . . . =  p = 0
Ha: One or more of the parameters
is not equal to zero.

Test Statistics F = MSR/MSE

Rejection Rule Reject H0 if p-value <  or if F > F


where F is based on an F distribution
with p d.f. in the numerator and
n - p - 1 d.f. in the denominator.
Testing for Significance: t Test

Hypotheses

bi
Test Statistics t
sbi

Rejection Rule Reject H0 if p-value <  or


if t < -tor t > twhere t
is based on a t distribution
with n - p - 1 degrees of freedom.
Qualitative Independent Variables

 Example: Programmer Salary Survey


As an extension of the problem involving the
computer programmer salary survey, suppose
that management also believes that the
annual salary is related to whether the
individual has a graduate degree in
computer science or information systems.
The years of experience, the score on the programmer
aptitude test, whether the individual has a relevant
graduate degree, and the annual salary ($1000) for each
of the sampled 20 programmers are shown on the next
slide.
Estimated Regression Equation

y = b0 + b1x1 + b2x2 + b3x3

where:
^
y = annual salary ($1000)
x1 = years of experience
x2 = score on programmer aptitude test
x3 = 0 if individual does not have a graduate degree
1 if individual does have a graduate degree
x3 is a dummy variable
Qualitative Independent Variables

 Excel’s Regression Equation Output

A B C D E
38
39 Coeffic. Std. Err. t Stat P-value
40 Intercept 7.94485 7.3808 1.0764 0.2977
41 Experience 1.14758 0.2976 3.8561 0.0014
42 Test Score 0.19694 0.0899 2.1905 0.04364
43 Grad. Degr. 2.28042 1.98661 1.1479 0.26789
44

Not significant
More Complex Qualitative Variables

If
If aa qualitative
qualitative variable
variable has
has kk levels,
levels, kk -- 11 dummy
dummy
variables
variables are
are required
required to
to be
be included
included in in the
the model
model ,,
with
with each
each dummy
dummy variable
variable being
being coded
coded as as 00 or
or 1.
1.
The
The excluded
excluded dummy
dummy variable
variable will
will serve
serve as as aa reference
reference
for
for comparison
comparison
For example, a variable indicating level of education could
be represented by x1 and x2 values as follows:

Highest
Degree d1 d2
Bachelor’s 0 0
Master’s 1 0
Ph.D. 0 1
Required Conditions for the Error
Variable

The error is normally distributed.


The mean is equal to zero (linearity)
Errors have a constant variance σ2 for all settings of the
independent variables (Homoscedasticity)
 errors are independent.

Note: Checking regression assumptions is more


important in explanatory modeling rather in predictive
modeling
Example: Prices of Toyota Corolla
ToyotaCorolla.xls

Goal: predict prices of used Toyota Corollas


based on their specification

Data: Prices of 1442 used Toyota Corollas,


with their specification information
Data Sample
(showing only the variables to be used in analysis)

Price Age KM Fuel_Type HP Metallic Automatic cc Doors Quarterly_Tax Weight


13500 23 46986 Diesel 90 1 0 2000 3 210 1165
13750 23 72937 Diesel 90 1 0 2000 3 210 1165
13950 24 41711 Diesel 90 1 0 2000 3 210 1165
14950 26 48000 Diesel 90 0 0 2000 3 210 1165
13750 30 38500 Diesel 90 0 0 2000 3 210 1170
12950 32 61000 Diesel 90 0 0 2000 3 210 1170
16900 27 94612 Diesel 90 1 0 2000 3 210 1245
18600 30 75889 Diesel 90 1 0 2000 3 210 1245
21500 27 19700 Petrol 192 0 0 1800 3 100 1185
12950 23 71138 Diesel 69 0 0 1900 3 185 1105
20950 25 31461 Petrol 192 0 0 1800 3 100 1185
Variables Used
Price in Euros
Age in months as of 8/04
KM (kilometers)
Fuel Type (diesel, petrol, CNG)
HP (horsepower)
Metallic color (1=yes, 0=no)
Automatic transmission (1=yes, 0=no)
CC (cylinder volume)
Doors
Quarterly_Tax (road tax)
Weight (in kg)
Preprocessing
Fuel type is categorical, must be transformed into binary
variables

Diesel (1=yes, 0=no)

CNG (1=yes, 0=no)

None needed for “Petrol” (reference category)


Subset of the records selected for training
partition (limited # of variables shown)

Fuel_Type_Di Fuel_Type_Pe
Id Model Price Age_08_04 Mfg_Month Mfg_Year KM
esel trol
Corolla 2.0 D4D HATCHB TERRA
1 2/3-Doors 13500 23 10 2002 46986 1 0
Corolla 2.0 D4D HATCHB TERRA
4 2/3-Doors 14950 26 7 2002 48000 1 0
TA Corolla 2.0 D4D HATCHB 5 SOL 2/3-Doors 13750 30 3 2002 38500 1 0
TA Corolla 2.0 D4D HATCHB 6 SOL 2/3-Doors 12950 32 1 2002 61000 1 0
OTA Corolla 1800 T SPORT9VVT I 2/3-Doors 21500 27 6 2002 19700 0 1
TA Corolla 1.9 D HATCHB 10
TERRA 2/3-Doors 12950 23 10 2002 71138 1 0
8 16V VVTLI 3DR T SPORT 12 BNS 2/3-Doors 19950 22 11 2002 43610 0 1
olla 1.8 16V VVTLI 3DR T 17
SPORT 2/3-Doors 22750 30 3 2002 34000 0 1

60% training data / 40% validation data


The Fitted Regression Model
Input variables Coefficient Std. Error p-value SS
Constant term -3608.418457 1458.620728 0.0137 97276410000
Age_08_04 -123.8319168 3.367589 0 8033339000
KM -0.017482 0.00175105 0 251574500
Fuel_Type_Diesel 210.9862518 474.9978333 0.6571036 6212673
Fuel_Type_Petrol 2522.066895 463.6594238 0.00000008 4594.9375
HP 20.71352959 4.67398977 0.00001152 330138600
Met_Color -50.48505402 97.85591125 0.60614568 596053.75
Automatic 178.1519013 212.0528565 0.40124047 19223190
cc 0.01385481 0.09319961 0.88188446 1272449
Doors 20.02487946 51.0899086 0.69526076 39265060
Quarterly_Tax 16.7742424 2.09381151 0 160667200
Weight 15.41666317 1.40446579 0 214696000
Error reports

Training Data scoring - Summary Report

Total sum of
RMS Error Average Error
squared errors

1514553377 1325.527246 -0.000426154

Validation Data scoring - Summary Report

Total sum of
RMS Error Average Error
squared errors

1021587500 1334.079894 116.3728779


Multicollinearity

Problem: If one predictor is a linear combination of


other predictor(s), model estimation will fail
Note that in such a case, we have at least one redundant
predictor

Solution: Remove extreme redundancies (by dropping


highly correlated predictors)
Warning Signs of Multicollinearity
A regression coefficient is not significant even though,
theoretically, that variable should be highly correlated
with Y.
When you add or delete an X variable, the regression
coefficients change dramatically.
You see a negative regression coefficient when your
response should increase along with X.
You see a positive regression coefficient when the
response should decrease as X increases.
Your X variables have high pairwise correlations
Summary
Linear regression models are very popular tools, not
only for explanatory modeling, but also for prediction
A good predictive model has high predictive accuracy
(to a useful practical level)
Predictive models are built using a training data set,
and evaluated on a separate validation data set
Removing redundant predictors is key to achieving
predictive accuracy and robustness

You might also like