You are on page 1of 53

The Bucharest University of Economic Studies

THE BUCHAREST UNIVERSITY OF ECONOMIC STUDIES – BUCHAREST BUSINESS


Bucharest Business School
Romanian - French INDE MBA Program

Prof. univ. dr. Constantin MITRUT STATISTICS 1


LESSON 9

Multiple Regression

2
Introduction
In this chapter we extend the simple
linear regression model, and allow for
any number of independent variables.

We expect to build a model that fits the


data better than the simple linear
regression model.

3
Introduction
We shall use computer printout to
 Assess the model
 How well it fits the data
 Is it useful
 Are any required conditions violated?
 Employ the model
 Interpreting the coefficients
 Predictions using the prediction equation
 Estimating the expected value of the dependent
variable
4
Model and Required
Conditions
We allow for k independent variables to
potentially be related to the dependent
variable
Coefficients Random error variable

y = b0 + b1x1+ b2x2 + …+ bkxk + e


Dependent variable Independent variables

5
Required conditions for the error
variable

The error e is normally distributed.


The mean is equal to zero and the
standard deviation is constant (se) for
all values of y.
The errors are independent.

6
Estimating the Coefficients
and Assessing the Model
The procedure used to perform regression
analysis:
 Obtain the model coefficients and statistics using a
statistical software.
– Diagnose violations of required conditions. Try to remedy
problems when identified.
– Assess the model fit using statistics obtained from the
sample.
– If the model assessment indicates good fit to the data, use it
to interpret the coefficients and generate predictions.
7
Estimating the
Coefficients and
Assessing the Model
Example 1 Where to locate a new
motor inn?
 La Quinta Motor Inns is planning an
expansion.
 Management wishes to predict which sites are
likely to be profitable.
 Several areas where predictors of profitability
can be identified are:
 Competition
 Market awareness
 Demand generators 8
Margin
Profitability

Market
Competition Customers Community Physical
awareness

Rooms Nearest Office College Income Disttwn


space enrollment
Number of Distance to Median Distance to
hotels/motels the nearest household downtown.
rooms within La Quinta inn. income.
3 miles from
the site.
9
Estimating the Coefficients
and Assessing the Model,
Example
Profitability Operating Margin

Market
Competition Customers Community Physical
awareness

Rooms Nearest Office College Income Disttwn


space enrollment
Number of Distance to Median Distance to
hotels/motels the nearest household downtown.
rooms within La Quinta inn. income.
3 miles from
the site. 10
Estimating the Coefficients
and Assessing the Model,
Example
Data were collected from randomly selected 100 inns
that belong to La Quinta, and ran for the following
suggested model:

Margin = b0 + b1Rooms + b2Nearest +


b3Office + b4College + b5Income +
b6Disttwn
Margin Number Nearest Office Space Enrollment Income Distance
55.5 3203 4.2 549 8 37 2.7
33.8 2810 2.8 496 17.5 35 14.4
49 2890 2.4 254 20 35 2.6
31.9 3422 3.3 434 15.5 38 12.1
57.4 2687 0.9 678 15.5 42 6.9
49 3759 2.9 635 19 33 10.8
11
Regression Analysis, Excel
Output
SUMMARY OUTPUT
Thisisis the
This thesample
sample regression
regressionequation
equation
Regression Statistics
Multiple R
(sometimes
(sometimes calledthe
called
0.7246
theprediction
predictionequation)
equation)
R Square
Adjusted R Square
Margin = 38.14 - 0.0076Number +1.65Nearest
0.5251
0.4944
Standard Error + 0.020Office
5.51 Space +0.21Enrollment
Observations 100
+ 0.41Income - 0.23Distance
ANOVA
df SS MS F Significance F
Regression 6 3123.8 520.6 17.14 0.0000
Residual 93 2825.6 30.4
Total 99 5949.5

Coefficients Standard Error t Stat P-value


Intercept 38.14 6.99 5.45 0.0000
Number -0.0076 0.0013 -6.07 0.0000
Nearest 1.65 0.63 2.60 0.0108
Office Space 0.020 0.0034 5.80 0.0000
Enrollment 0.21 0.13 1.59 0.1159
Income 0.41 0.14 2.96 0.0039
Distance -0.23 0.18 -1.26 0.2107
12
Model Assessment
The model is assessed using three
tools:
 The standard error of estimate
 The coefficient of determination
 The F-test of the analysis of variance
The standard error of estimates
participates in building the other tools.

13
Standard Error of Estimate

The standard deviation of the error is


estimated by the Standard Error of
Estimate:
SSE
s 
n  k 1

y.
The magnitude of se is judged by
comparing it to
14
Standard Error of Estimate

From the printout, se = 5.51


Calculating the mean value of y wey  45.739
have
It seems se is not particularly small.
Question:
Can we conclude the model does not fit
the data well?
15
Coefficient of Determination

The definition is
SSE
R  1
2

 i
( y  y ) 2

From the printout, R2 = 0.5251


52.51% of the variation in operating margin is
explained by the six independent variables.
47.49% remains unexplained.
When adjusted for degrees of freedom,
Adjusted R2 = 1-[SSE/(n-k-1)] / [SS(Total)/(n-1)]
=
16
= 49.44%
Testing the Validity of the
Model
We pose the question:
Is there at least one independent variable
linearly related to the dependent variable?
To answer the question we test the
hypothesis

H0: b0 = b1 = b2 = … = bk
H1: At least one bi is not equal to zero.

If at least one bi is not equal to zero, the


17
Testing the Validity of the La
Quinta Inns Regression Model
The hypotheses are tested by an
ANOVA procedure ( the Excel output)
MSR/MSE
ANOVA
df SS MS F Significance F
Regression k = 6 3123.8 520.6 17.14 0.0000
Residual n–k–1 = 93 2825.6 30.4
Total n-1 = 99 5949.5

SSR MSR=SSR/k
SSE MSE=SSE/(n-k-1)

18
Testing the Validity of the La
Quinta Inns Regression Model
[Variation in y] = SSR + SSE.
Large F results from a large SSR. Then,
much of the variation in y is explained by the
regression model; the model is useful, and
thus, the null hypothesis should be rejected.
Therefore, the rejection region is…
SSR Rejection region
F k
SSE F>Fa,k,n-k-1
n  k 1

19
Testing the Validity of the La
Quinta Inns Regression Model

Conclusion: There is sufficient evidence to reject


ANOVA the null hypothesis in favor of the alternative hypothesis.
At least dfone of the b
SSi is not equal
MS to zero.F Thus, at least
Significance F
Regression one independent6 variable
3123.8is linearly
520.6 related
17.14 to y. 0.0000
Residual This linear
93 regression
2825.6 model 30.4is valid
Total 99 5949.5

Fa,k,n-k-1 = F0.05,6,100-6-1=2.17
F = 17.14 > 2.17
Also, the p-value (Significance F) = 0.0000
Reject the null hypothesis.

20
Interpreting the Coefficients
b0 = 38.14. This is the intercept, the value of y
when all the variables take the value zero.
Since the data range of all the independent
variables do not cover the value zero, do not
interpret the intercept.
b1 = – 0.0076. In this model, for each
additional room within 3 mile of the La Quinta
inn, the operating margin decreases on
21
Interpreting the Coefficients
b2 = 1.65. In this model, for each additional mile
that the nearest competitor is to a La Quinta inn,
the operating margin increases on average by
1.65% when the other variables are held constant.
b3 = 0.020. For each additional 1000 sq-ft of office
space, the operating margin will increase on
average by .02% when the other variables are held
constant.
b4 = 0.21. For each additional thousand students
the operating margin increases on average by .
21% when the other variables are held constant.
22
Interpreting the Coefficients
b5 = 0.41. For additional $1000 increase in
median household income, the operating
margin increases on average by .41%, when
the other variables remain constant.
b6 = -0.23. For each additional mile to the
downtown center, the operating margin
decreases on average by .23% when the
other variables are held constant.
23
Testing the Coefficients
The hypothesis for each bi is

H0: bi = 0 Test statistic


H1: bi ¹ 0 b  i
t i d.f. = n - k -1
sb i

Coefficients Standard Error t Stat P-value


Intercept 38.14 6.99 5.45 0.0000
Excel printout
Number -0.0076 0.0013 -6.07 0.0000
Nearest 1.65 0.63 2.60 0.0108
Office Space 0.020 0.0034 5.80 0.0000
Enrollment 0.21 0.13 1.59 0.1159
Income 0.41 0.14 2.96 0.0039
Distance -0.23 0.18 -1.26 0.2107
24
Using the Linear Regression
Equation
The model can be used for making
predictions by
 Producing prediction interval estimate for the
particular value of y, for a given values of xi.
 Producing a confidence interval estimate for
the expected value of y, for given values of x i.

The model can be used to learn about


relationships between the independent
variables xi, and the dependent variable25y,
La Quinta Inns, Predictions
Predict the average operating margin of an
inn at a site with the following characteristics:
 3815 rooms within 3 miles,
 Closet competitor .9 miles away,
 476,000 sq-ft of office space,
 24,500 college students,
 $35,000 median household income,
 11.2 miles distance to downtown center.

MARGIN = 38.14 - 0.0076(3815) +1.65(.9) + 0.020(476)


+0.21(24.5) + 0.41(35) - 0.23(11.2) = 37.1%
26
La Quinta Inns, Predictions
Interval estimates
Prediction Interval
It is predicted, with 95%
Margin confidence that the
operating margin will lie
Predicted value 37.1
between 25.4% and 48.8%.
Prediction Interval
Lower limit 25.4
It is estimated the average
Upper limit 48.8
operating margin of all sites
Interval Estimate of Expected Value that fit this category falls
Lower limit 33.0 within 33% and 41.2%.
Upper limit 41.2

The average inn would not be


profitable (Less than 50%).
27
Interpretation:
MBA Program Admission
Policy
The dean of a large university wants to raise
the admission standards to the popular MBA
program.
She plans to develop a method that can
predict an applicant’s performance in the
program.
She believes a student’s success can be
predicted by:
 Undergraduate GPA
 Graduate Management Admission Test (GMAT)
score 28
MBA Program Admission
Policy
A randomly selected sample of students who
completed the MBA was selected. (See MBA
).
MBA GPA UnderGPA GMAT Work
8.43 10.89 584 9
6.58 10.38 483 7
8.15 10.39 484 4
8.88 10.73 646 6
. . . .
. . . .

Develop a plan to decide which applicant to


admit.
29
MBA Program Admission
Policy
Solution
 The model to estimate is:
y = b0 +b1x1+ b2x2+ b3x3+e

y = MBA GPA
x1 = undergraduate GPA [UnderGPA]
x2 = GMAT score [GMAT]
x3 = years of work experience [Work]

 The estimated model:


MBA GPA = b0 + b1UnderGPA + b2GMAT + b3Work

30
MBA Program Admission
Policy – Model Diagnostics
SUMMARY OUTPUT
We estimate the
Regression Statistics
Multiple R 0.6808 regression
R Square 0.4635 model then we check:
Adjusted R Square
0.4446
Standard Error 0.788
Normality of errors
Observations 89
Standardized residuals
ANOVA
df SS MS 40 F Significance F
Regression 3 45.60 15.2030 24.48 0.0000
Residual 85 52.77 0.62
20
Total 88 98.37
10
Coefficients
Standard Error t Stat 0 P-value
Intercept 0.466 1.506 0.31 -2.5
0.7576 -1.5 -0.5 0.5 1.5 2.5 More
UnderGPA 0.063 0.120 0.52 0.6017
GMAT 0.011 0.001 8.16 0.0000
Work 0.093 0.031 3.00 0.0036

31
MBA Program Admission
Policy – Model Diagnostics
SUMMARY OUTPUT
We estimate the
Regression Statistics
Multiple R 0.6808 regression
R Square 0.4635 model then we check:
Adjusted R Square
0.4446
Standard Error 0.788
The variance of the error variable
Observations 89
Residuals
ANOVA
df SS MS 2 F Significance F
Regression 3 45.60 15.20 1 24.48 0.0000
Residual 85 52.77 0.62 0
Total 88 98.37
-1 6 7 8 9 10
-2
Coefficients
Standard Error t Stat P-value
Intercept 0.466 1.506 0.31-3 0.7576
UnderGPA 0.063 0.120 0.52 0.6017
GMAT 0.011 0.001 8.16 0.0000
Work 0.093 0.031 3.00 0.0036

32
MBA Program Admission
Policy – Model Diagnostics
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.6808
R Square 0.4635
Adjusted R Square 0.4446
Standard Error 0.788
Observations 89

ANOVA
df SS MS F Significance F
Regression 3 45.60 15.20 24.48 0.0000
Residual 85 52.77 0.62
Total 88 98.37

Coefficients Standard Error t Stat P-value


Intercept 0.466 1.506 0.31 0.7576
UnderGPA 0.063 0.120 0.52 0.6017
GMAT 0.011 0.001 8.16 0.0000
Work 0.093 0.031 3.00 0.0036

33
MBA Program Admission
Policy – Model Assessment
SUMMARY OUTPUT
• 46.35% of the variation in
Regression Statistics MBA GPA is explained by
Multiple R 0.6808
R Square 0.4635
the model.
Adjusted R Square
0.4446
Standard Error 0.788 • The model is valid
Observations 89
(p-value = 0.0000…)
ANOVA
df SS MS F Significance F
Regression 3 45.60 15.20 24.48 0.0000
• GMAT score and years of
Residual 85 52.77 0.62 work experience are
Total 88 98.37
linearly related to MBA
Coefficients
Standard Error t Stat P-value GPA.
Intercept 0.466 1.506 0.31 0.7576
UnderGPA 0.063 0.120 0.52 0.6017 • Insufficient evidence of
GMAT 0.011 0.001 8.16 0.0000
Work 0.093 0.031 3.00 0.0036 linear relationship
between undergraduate
34
GPA and MBA GPA.
Regression Diagnostics - II

The conditions required for the model


assessment to apply must be checked.
 Is the error variable normally
Draw a histogram of the residuals
distributed?

Plot the residuals versus ^y
Is the error variance constant?

 Are the errors independent? Plot the residuals versus the


time periods
 Can we identify outlier?
 Is multicolinearity (intercorrelation)a problem?
35
Diagnostics: Multicolinearity
Example : Predicting house price
 A real estate agent believes that a house selling price
can be predicted using the house size, number of
bedrooms, and lot size.
 A random sample of 100 houses was drawn and data
recorded.
Price Bedrooms H Size Lot Size
124100 3 1290 3900
218300 4 2080 6600
117800 3 1250 3750
. . . .
. . . .

36
Diagnostics: Multicolinearity
The proposed model is
PRICE = b0 + b1BEDROOMS + b2H-SIZE
+b3LOTSIZE
SUMMARY OUTPUT +e
Regression Statistics
Multiple R 0.7483 The model is valid, but no
R Square 0.5600 variable is significantly related
Adjusted R Square
0.5462
Standard Error 25023 to the selling price ?!
Observations 100

ANOVA
df SS MS F Significance F
Regression 3 76501718347 25500572782 40.73 0.0000
Residual 96 60109046053 626135896
Total 99 136610764400

Coefficients Standard Error t Stat P-value


Intercept 37718 14177 2.66 0.0091
Bedrooms 2306 6994 0.33 0.7423
House Size 74.30 52.98 1.40 0.1640
Lot Size -4.36 17.02 -0.26 0.7982
37
Diagnostics: Multicolinearity
Multicolinearity is found to be a
problem. Price Bedrooms H Size Lot Size
Price 1
Bedrooms 0.6454 1
H Size 0.7478 0.8465 1
Lot Size 0.7409 0.8374 0.9936 1

• Multicolinearity causes two kinds of difficulties:


– The t statistics appear to be too small.
– The b coefficients cannot be interpreted as “slopes”.
38
Remedying Violations of the
Required Conditions
Nonnormality or heteroscedasticity can be
remedied using transformations on the y
variable.
The transformations can improve the linear
relationship between the dependent variable
and the independent variables.
Many computer software systems allow us to
make the transformations easily.

39
Reducing Nonnormality by
Transformations
A brief list of transformations
 y’ = log y (for y > 0)
 Use when the se increases with y, or
 Use when the error distribution is positively skewed
 y’ = y2
 Use when the s2e is proportional to E(y), or
 Use when the error distribution is negatively skewed
 y’ = y1/2 (for y > 0)
 Use when the s2e is proportional to E(y)
 y’ = 1/y
 Use when s2e increases significantly when y increases
beyond some critical value.
40
Durbin - Watson Test:
Are the Errors Autocorrelated?
This test detects first order
autocorrelation between consecutive
residuals in a time series
If autocorrelation exists the error
variables are not
n independent

i 2
(ei  ei 1 ) 2
Residual at time i d n

i 1
ei 2

The range of d is 0  d  4 41
Positive First Order
Autocorrelation

+ Residuals
+
+
+
0
Time
+

+ +
+

Positive first order autocorrelation occurs when


consecutive residuals tend to be similar. Then,
the value of d is small (less than 2).
42
Negative First Order
Autocorrelation

Residuals

+ +
+
0
+ + + Time
+

Negative first order autocorrelation occurs when


consecutive residuals tend to markedly differ.
Then, the value of d is large (greater than 2).
43
One tail test for Positive First
Order Autocorrelation
If d<dL there is enough evidence to show
that positive first-order correlation exists
If d>dU there is not enough evidence to
show that positive first-order correlation
exists
If dFirstisorder
between
InconclusivedL Positive
andfirstdUorderthe test is
correlation
correlation test Does not exists
inconclusive.
exists

dL dU
44
One Tail Test for Negative
First Order Autocorrelation
If d>4-dL, negative first order correlation
exists
If d<4-dU, negative first order correlation
does not exists
if d falls between 4-dU and 4-dL the test
is inconclusive.
Negative first order correlation Inconclusive Negative
does not exist test first order
correlation
exists

4-dU 4-dL
45
Two-Tail Test for First Order
Autocorrelation
If d<dL or d>4-dL first order autocorrelation
exists
If d falls between dL and dU or between 4-
dU and
4-dLthe test is inconclusive
First order First order
First order
If d fallsInconclusive
between
correlation
test d and 4-d
correlation
does notUU there
Inconclusive
test
isFirst order
correlation
does not
no
correlation
exists exists
evidence for first order autocorrelation
exist exist

0 dL dU 2 4-dU 4-dL 4
46
Testing the Existence of
Autocorrelation, Example
Example
How does the weather affect the sales of lift tickets in a
ski resort?
 Data of the past 20 years sales of tickets, along with

the total snowfall and the average temperature during


Christmas week in each year, was collected.

 The model hypothesized was


TICKETS=b0+b1SNOWFALL+b2TEMPERATURE+e
 Regression analysis yielded the following results:
47
The Regression Equation –
Assessment (I)
SUMMARY OUTPUT The model
The model seems
seems to
to be
be very
very poor:
poor:
Regression Statistics • R-square=0.1200
Multiple R 0.3465
R Square 0.1200 • It is not valid (Signif. F =0.3373)
Adjusted R Square 0.0165 • No variable is linearly related to Sales
Standard Error 1712
Observations 20

ANOVA
df SS MS F Signif. F
Regression 2 6793798 3396899 1.16 0.3373
Residual 17 49807214 2929836
Total 19 56601012

Coefficients Standard Error t Stat P-value


Intercept 8308.0 903.73 9.19 0.0000
Snowfall 74.59 51.57 1.45 0.1663
Tempture -8.75 19.70 -0.44 0.6625 48
Diagnostics: The Error
Distribution
The errors histogram
7
6
5
4
3
2
1
0
-2.5 -1.5 -0.5 0.5 1.5 2.5 More

The errors may be


normally distributed

49
Diagnostics:
Heteroscedasticity
Residual vs. predicted y

3000
2000
1000
0
-10007500 8500 9500 10500 11500 12500
-2000
-3000
-4000

It appears there is no problem of heteroscedasticity


(the error variance seems to be constant).
50
Diagnostics: First Order
Autocorrelation
Residual over time
3000
2000
1000
0
-1000 0 5 10 15 20 25
-2000
-3000
-4000

The errors are not independent!!


51
Diagnostics: First Order
Autocorrelation
Using the computer - Excel
Tools > Data Analysis > Regression (check the residual option and then OK)
Tools > Data Analysis Plus > Durbin Watson Statistic > Highlight the range of the residuals
from the regression run > OK
Test for positive first order auto-
Durbin-Watson Statistic
correlation:
-2793.99
-1723.23 d = 0.5931 n=20, k=2. From the Durbin-Watson
-2342.03 The residuals table we have: dL=1.10, dU=1.54.
-956.955
-1963.73
The statistic d=0.5931
.
Conclusion: Because d<dL , there is
.
sufficient evidence to infer that positive
first order autocorrelation exists.
52
The Modified Model: Time
Included
The modified regression model
TICKETS=b0+ b1SNOWFALL+ b2TEMPERATURE+ b3TIME+e

• All the required conditions are met for this model.


• The fit of this model is high R2 = 0.7410.
• The model is valid. Significance F = .0001.
• SNOWFALL and TIME are linearly related to ticket sales.

• TEMPERATURE is not linearly related to ticket sales.


53

You might also like