You are on page 1of 24

Chapter 15 (15.1 - 15.2, 15.4 - 15.

6): Multiple
Regression and Model Building

Tanvir Quadir

School of Mathematics and Statistics

Tanvir Quadir STAT 2601: Business Statistics I 1 / 24


15.1 The Multiple Regression Model

Multivariate statistics are a class of statistics that involve the analysis


of at least three variables.

Multivariate statistics are computationally formidable, yet are an


important and powerful tool.

One widely used multivariate statistical tool is multiple regression.

Multiple regression is an extension of simple regression that allows


more than one predictor variable.

Tanvir Quadir STAT 2601: Business Statistics I 2 / 24


15.1 The Multiple Regression Model

Idea: Examine the linear relationship between 1 dependent (y ) and 2


or more independent variables (xi ).

Tanvir Quadir STAT 2601: Business Statistics I 3 / 24


15.1 The Multiple Regression Model

■ Least Squares Estimates and Prediction:


ŷ = b0 + b1 x1 + b2 x2 + · · · + bk xk is the point estimate of the mean
value of the dependent variable when the values of the independent
variables are x1 , x2 , . . . , xk .

It is also the point prediction of an individual value of the dependent


variable when the values of the independent variables are x1 , x2 , . . . , xk .

b0 , b1 , b2 , . . . , bk are the least squares point estimates of the


parameters β0 , β1 , β2 , . . . , βk .

x1 , x2 , . . . , xk are specified values of the independent predictor variables.

Tanvir Quadir STAT 2601: Business Statistics I 4 / 24


15.1 The Multiple Regression Model

House Price: Example 1 (Simple Linear Regression Model)

[ = 14, 349.48 + 48, 218.91 × Bedrooms


Price

R 2 = 21.4%, se = 68432.21 with df = 1057 − 2 = 1055

The predictor Bedrooms can explain only 21.4% of the variation in


Price.

Perhaps the inclusion of another factor can account for a portion of


the remaining variation. Try Multiple Regression Model!!

Tanvir Quadir STAT 2601: Business Statistics I 5 / 24


15.1 The Multiple Regression Model
House Price: Example 1 (Multiple Linear Regression Model)
[ = 20, 986.09 − 7, 483.10 × Bedrooms + 93.84 × Living Area
Price
R 2 = 57.8%, se = 50142.4 with df = 1057 − 3 = 1054

Now the model (i.e., Bedrooms and Living Area) accounts for 57.8%
of the variation in Price.

Price drops with increasing bedrooms? Counter intuitive?

Take-away: The meaning of the coefficients in multiple regression can


be subtly different than in simple regression.
Tanvir Quadir STAT 2601: Business Statistics I 6 / 24
15.1 The Multiple Regression Model

■ House Price: Example 1 (Multiple Linear Regression Model)

[ = 20, 986.09 − 7, 483.10 × Bedrooms + 93.84 × Living Area


Price

■ So, what’s the correct answer to the question: “Do more bedrooms
tend to increase or decrease the price of a home?”

Simple Linear Regression Model: “increase” if “Bedrooms” is the only


predictor (“more bedrooms” may mean “bigger house”, after all!)

Multiple Linear Regression Model: “decrease” if “Bedrooms” increases


for fixed Living Area (“more bedrooms” may mean “smaller,
more-cramped rooms” which devalue the home).

Tanvir Quadir STAT 2601: Business Statistics I 7 / 24


15.1 The Multiple Regression Model
■ House Price: Example 1 (Summarizing) Multiple regression
coefficients must be interpreted in terms of the other predictors in the
model.

Simple Linear Regression Model (Predictor - Bedrooms only):


[ = 14, 349.48 + 48, 218.91 × Bedrooms
Price
On average, we’d expect the price to increase by $48,218.91 for each
additional bedroom in the house.

Multiple Linear Regression Model (Predictors - Bedrooms and Living


Area):
[ = 20, 986.09 − 7, 483.10 × Bedrooms + 93.84 × Living Area
Price
On average, we’d expect the price to decrease by $7,483.10 for each
additional bedroom in the house holding constant the effect of the
variable living area.
Tanvir Quadir STAT 2601: Business Statistics I 8 / 24
15.1 The Multiple Regression Model

House Price: Example 2

Tanvir Quadir STAT 2601: Business Statistics I 9 / 24


15.1 The Multiple Regression Model
■ House Price: Example 2 (Correlation Matrix)
Correlation Coefficient: A quantitative measure of strength of the
linear relationship between two variables.
P P
xy − xn y
P
r = qP P 2 qP P 2
x 2 − ( nx) y 2 − ( ny )

Correlation Matrix: A table showing the pairwise correlations between


all variables (dependent and independent).

Tanvir Quadir STAT 2601: Business Statistics I 10 / 24


15.1 The Multiple Regression Model

House Price: Example 2 (Scatter Plots)

Tanvir Quadir STAT 2601: Business Statistics I 11 / 24


15.1 The Multiple Regression Model

House Price: Example 2 (EXCEL Output)

Tanvir Quadir STAT 2601: Business Statistics I 12 / 24


15.1 The Multiple Regression Model

■ House Price: Example 2


The estimate of the multiple regression model:

ŷ = 31, 128 + 63.07x1 − 1, 144x2 − 8, 410x3 + 3, 522x4 + 28, 204x5

Point estimate (price prediction)

ŷ = 31, 128+(63.07×2, 100)−(1, 144×15)−(8, 410×4)+(3, 522×3)


+(28, 204 × 2) = $179, 749

Tanvir Quadir STAT 2601: Business Statistics I 13 / 24


15.1 The Multiple Regression Model
Pie Sale: Example 3 A distributor of frozen desert pies wants to
evaluate factors thought to influence demand. Data are collected for
15 weeks.

Tanvir Quadir STAT 2601: Business Statistics I 14 / 24


15.1 The Multiple Regression Model
Pie Sale: Example 3 (MINITAB Output)

Tanvir Quadir STAT 2601: Business Statistics I 15 / 24


15.1 The Multiple Regression Model

■ Pie Sale: Example 3 (Interpretation of Estimated Coefficients)

Slope (bi )
(i) Estimates that the average value of y changes by bi units for each 1
unit increase in xi holding all other variables constant (or while other
variables are in the model).

(ii) Example: If b1 = −25, then sales (y ) is expected to decrease by an


estimated 25 pies per week for each $1 increase in selling price (x1 ),
holding the effects of changes due to advertising (x2 ).

y-intercept (b0 ) The estimated average value of y when all xi = 0


(assuming all xi = 0 is within the range of observed values)

Tanvir Quadir STAT 2601: Business Statistics I 16 / 24


15.2 Quality of Fit
Coefficient of Determination:R 2
SSR SSE
R2 = = 1−
SST SST
Interpretation: Fraction of the total variation in y accounted for by
the model (all the predictor variables included)
Valid Range: 0 ≤ R 2 ≤ 1.

Caveat: Adding new predictor variables to a model never decreases R 2


and may increase it.
Example:Weight and Height
\ = 103.40 + 6.38 Height(over 5 ft),
Weight R 2 = 0.74
Add new variable (completely nonsensical to the equation): campus
post office box number (Box#):
\ = 102.35 + 6.36 Height(over 5 ft) + 0.02 Box#,
Weight R 2 = 0.75
Tanvir Quadir STAT 2601: Business Statistics I 17 / 24
15.2 Quality of Fit
Adjusted Coefficient of Determination: Adjusted R 2
SSE
k n−1 n−k−1
R̄ 2 = (R 2 − )( ) = 1− SST
n−1 n−k −1 n−1

R̄ 2 permits a more equitable comparison between models of different


sizes.
Each additional variable results in the loss of one degree of freedom
(n − k − 1) . The lower the degrees of freedom, the less reliable the
estimates are likely to be.
Thus, the increase in the quality of fit needs to be compared to the
decrease in the degrees of freedom. The R̄ 2 takes into account this
cost and adjusts the R 2 value accordingly.
When comparing models, an increase in R̄ 2 indicates that the
marginal benefit of adding a variable exceeds the cost, while a
decrease in R̄ 2 indicates that the marginal cost exceeds the benefit.
R̄ 2 will always be less than R 2 .
Tanvir Quadir STAT 2601: Business Statistics I 18 / 24
15.4 The Overall F Test
Test 1: Is the Overall Model Significant?
Hypotheses:
H 0 : β1 = β2 = · · · = βk = 0
HA : At least one βi ̸= 0
Test Statistic:
Explained Variation SSR
k k MSR
F = Unexplained Variation
= SSE
=
n−k−1 n−k−1
MSE

SSR = Sum of Squares Regression


SSE = Sum of Squares Error
n = Sample Size
k = Number of Independent Variables
ANOVA Table:
Source SS df MS F-ratio
MSR
Regression SSR k MSR
MSE
Residual SSE n−k −1 MSE
Total SST n−1
Tanvir Quadir STAT 2601: Business Statistics I 19 / 24
15.4 The Overall F Test

Test 1: Is the Overall Model Significant? (House Price Model)

Tanvir Quadir STAT 2601: Business Statistics I 20 / 24


15.5 Testing the Significance of an Independent Variable
Test 2: Are the individual Variables Significant? If a multiple
regression F-test leads to a rejection of the null hypothesis, then
perform t-test for each regression coefficient.
Hypotheses:
H0 : βj = 0, given all other variables are in the model
HA : βj ̸= 0, given all other variables are in the model

Tanvir Quadir STAT 2601: Business Statistics I 21 / 24


15.5 Testing the Significance of an Independent Variable

Test 2: Are the individual Variables Significant? (House Price Model)

Tanvir Quadir STAT 2601: Business Statistics I 22 / 24


15.5 Testing the Significance of an Independent Variable

Tricky Parts of the t test

SE’s are harder to compute (let technology do it!).

The meaning of a coefficient depends on the other predictors in the


model

(i) If we fail to reject H0 ; βj = 0 based on it’s t-test, it does not mean


that xj has no linear relationship to y .

(ii) Rather, it means that xj contributes nothing to modeling y after


allowing for the other predictors.

Tanvir Quadir STAT 2601: Business Statistics I 23 / 24


THANK YOU!

Tanvir Quadir STAT 2601: Business Statistics I 24 / 24

You might also like