You are on page 1of 22

11

Predictive Analytics:
Regression Analysis

Business Statistics:
Communicating with Numbers, 3e

By Sanjiv Jaggia and Alison Kelly

6/16/2023
McGraw-Hill/Irwin Copyright © 2019 by The McGraw-Hill Companies, Inc. All rights reserved.
14-1
Chapter 11 Learning Objectives (LOs)
LO 11.1 The linear regression model.
LO 11.2 Estimate and interpret simple linear
regression model.
LO 11.3 Estimate and interpret multiple linear
regression model.
LO 11.4 Goodness-of-Fit Measures.

BUSINESS STATISTICS: COMMUNICATING WITH NUMBERS, 3e | Jaggia, Kelly Copyright © 2019


14-2
Introductory Case: Consumer Debt Payments

• A study in 2010 showed that consumers in 26 cities


made debt payments from $763 to $1,285 per month.
• Economist Madelyn Davis believes that income
differences are the main reason for the disparity.
• She is less sure about the impact of unemployment.
• She uses regression analysis to learn more.

BUSINESS STATISTICS: COMMUNICATING WITH NUMBERS, 3e | Jaggia, Kelly Copyright © 2019


14-3
Introductory Case: Consumer Debt Payments

• Using the sample information:


1. Determine if debt payments and income are
correlated.
2. Use regression analysis to make predictions for
debt payments for given values of income and
the unemployment rate.
3. Use various goodness-of-fit measures to
determine the regression model that best fits
the data.

BUSINESS STATISTICS: COMMUNICATING WITH NUMBERS, 3e | Jaggia, Kelly Copyright © 2019


14-4
11.1 The Linear Regression Model
• While the correlation coefficient may establish a linear
relationship, it not suggest that one variable causes the
other.
• With regression analysis, we explicitly assume that
one variable, called the response variable, is
influenced by other variables, called the explanatory
variables.
• Using regression analysis, we may predict the
response variable given values for our explanatory
variables.

BUSINESS STATISTICS: COMMUNICATING WITH NUMBERS, 3e | Jaggia, Kelly Copyright © 2019


14-5
11.1 The Linear Regression Model
• If the value of the response variable is uniquely
determined by the values of the explanatory variables,
we say that the relationship is deterministic.
• If the relationship is inexact due to omission of relevant
(sometimes not measurable) factors, we say that the
relationship is stochastic.
• In regression analysis, we include a stochastic error
term, that accounts for all variables omitted in the
deterministic component of the relationship.

BUSINESS STATISTICS: COMMUNICATING WITH NUMBERS, 3e | Jaggia, Kelly Copyright © 2019


14-6
11.2 The Simple Linear Regression
Model
The Simple Linear Regression Model
• The simple linear regression model is defined as

y = b 0 + b 1x +e,
where y and x are the response variable and the
explanatory variable, respectively, and ε is the
random error term.
• The coefficients β0 and β1 are the unknown
parameters to be estimated.

BUSINESS STATISTICS: COMMUNICATING WITH NUMBERS, 3e | Jaggia, Kelly Copyright © 2019


14-7
11.2 The Simple Linear Regression
Model
The Sample Regression Equation
• By fitting the sample data to the model, we obtain
the equation:

𝑦ො = 𝑏0 + 𝑏1 𝑥,

where b0 and b1 are the estimates of β0 and β1.


• Since the predictions cannot be perfectly accurate,
the difference between the observed and the
predicted values of y represents the residual e that
is, 𝑒 = 𝑦 − 𝑦.

BUSINESS STATISTICS: COMMUNICATING WITH NUMBERS, 3e | Jaggia, Kelly Copyright © 2019
14-8
11.2 The Simple Linear Regression
Model
Example (Introductory Case)
• A scatterplot of debt payments against income with
a superimposed sample regression equation:
• Vertical
distance
between y
and 𝑦ො
represents
the
residual, e.

BUSINESS STATISTICS: COMMUNICATING WITH NUMBERS, 3e | Jaggia, Kelly Copyright © 2019


14-9
11.2 The Simple Linear Regression
Model
The Least Squares Estimates
• Method of least squares, also referred to as ordinary
least squares (OLS) is used to estimate the
parameters β0 and β1.
• The parameters β0 and β1 are estimated by minimizing
the error sum of squares, SSE (𝑆𝑆𝐸 = σ(𝑦𝑖 − 𝑦ො𝑖 )2 ).
• Using OLS, the estimated slope b1 and the intercept b0
are calculated as:
σ(𝑥𝑖 −𝑥)(𝑦
ҧ 𝑖 −𝑦)

𝑏1 = σ(𝑥𝑖 −𝑥)ҧ 2
𝑏0 = 𝑦ത − 𝑏1 𝑥ҧ

BUSINESS STATISTICS: COMMUNICATING WITH NUMBERS, 3e | Jaggia, Kelly Copyright © 2019


14-10
11.2 The Simple Linear Regression
Model
Example (Introductory Case): Results
• The sample regression equation:
෣ = 210.2977 + 10.4411 ∙ 𝐼𝑛𝑐𝑜𝑚𝑒 .
𝐷𝑒𝑏𝑡
• The estimated slope 10.4411 suggests a positive
relationship between income and debt payments:
• If median household income increases by $1,000, the
consumer debt payments increase by $10.44.
• The estimated intercept 210.2977 suggests that if
income is zero, then predicted debt payments are
$210.30.
• Predicted debt payments for an income of $80,000 is
෣ = 210.2977 + 10.4411 × 80 = $1,045.59.
𝐷𝑒𝑏𝑡
BUSINESS STATISTICS: COMMUNICATING WITH NUMBERS, 3e | Jaggia, Kelly Copyright © 2019
14-11
11.3 The Multiple Linear Regression
Model
The Multiple Linear Regression Model
• If there is more than one explanatory variable
available, we use multiple regression.
• Multiple regression allows us to explore how
several variables influence the response
variable.
• For example, we analyzed how debt payments are
influenced by income, but ignored the possible effect
of unemployment.

BUSINESS STATISTICS: COMMUNICATING WITH NUMBERS, 3e | Jaggia, Kelly Copyright © 2019


14-12
11.3 The Multiple Linear Regression
Model
The Multiple Linear Regression Model
• Suppose there are k explanatory variables.
• The multiple linear regression model is defined as
y = b 0 + b 1x1 + b 2x2 +e,
where y is the response variable, x1, x2 ,…, xk are
the k explanatory variables, and ε is the random
error term.
• The coefficients β0, β1, ... , βk are the unknown
parameters to be estimated.
BUSINESS STATISTICS: COMMUNICATING WITH NUMBERS, 3e | Jaggia, Kelly Copyright © 2019
14-13
11.3 The Multiple Linear Regression
Model
The Sample Regression Equation
• By fitting the sample data to the model, we obtain the
equation:
𝑦ො = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + ⋯ + 𝑏𝑘 𝑥𝑘,
where b0, b1,…, bk are the estimates of β0, β1, ... , βk.
• In multiple regression, there is a slight modification of
the interpretation of the slopes.
• bj measures the change in the predicted value y given
a unit increase in the associated explanatory variable
xj, holding all other explanatory variables constant.
BUSINESS STATISTICS: COMMUNICATING WITH NUMBERS, 3e | Jaggia, Kelly Copyright © 2019
14-14
11.3 The Multiple Linear Regression
Model
Example (Introductory Case): Results
• The sample regression equation:
෣ = 198.9956 + 10.5122 ∙ 𝐼𝑛𝑐𝑜𝑚𝑒 + 0.6186 ∙ 𝑈𝑛𝑒𝑚𝑝𝑙𝑜𝑦𝑚𝑒𝑛𝑡.
𝐷𝑒𝑏𝑡
• The estimated slope on Income is 10.5122 and
suggests that if income increases by $1,000, then debt
payments are predicted to increase by $10.51, holding
the unemployment rate constant.
• The estimated slope on Unemployment is 0.6186 and
suggests that a 1 percentage point increase in the
unemployment rate leads to a predicted increase in
debt payments of $0.62, holding income constant.
BUSINESS STATISTICS: COMMUNICATING WITH NUMBERS, 3e | Jaggia, Kelly Copyright © 2019
14-15
11.3 The Multiple Linear Regression
Model
Example (Introductory Case): Results
• The sample regression equation:
෣ = 198.9956 + 10.5122 ∙ 𝐼𝑛𝑐𝑜𝑚𝑒 + 0.6186 ∙ 𝑈𝑛𝑒𝑚𝑝𝑙𝑜𝑦𝑚𝑒𝑛𝑡.
𝐷𝑒𝑏𝑡
• We can use the estimated regression equation to
predict debt payments given values of median income
and the unemployment rate.
• For example, predict debt payments if income is
$80,000 and the unemployment rate is 7.5%.
• Plug in these values in the estimated regression:
෣ = 198.9956 + 10.5122 × 80 + 0.6186 × 7.5 = 1,044.61.
𝐷𝑒𝑏𝑡
• That is, debt payments are predicted to be $1,044.61.
BUSINESS STATISTICS: COMMUNICATING WITH NUMBERS, 3e | Jaggia, Kelly Copyright © 2019
14-16
11.4 Goodness-of-Fit Measures

• We will introduce three measures to judge


how well the sample regression fits the
data:
1. The standard error of the estimate.
2. The coefficient of determination, R2.
3. The adjusted R2.

BUSINESS STATISTICS: COMMUNICATING WITH NUMBERS, 3e | Jaggia, Kelly Copyright © 2019


14-17
11.4 Goodness-of-Fit Measures
The Standard Error of the Estimate
• In general, the standard error of the estimate
measures the standard deviation of the residual.
• The standard error of the estimate se is:
𝑆𝑆𝐸
𝑠𝑒 = ,
𝑛−𝑘−1

where 𝑆𝑆𝐸 = σ(𝑦𝑖 − 𝑦ො𝑖 )2= σ 𝑒𝑖 2 is the error sum of


squares.
• 𝑠𝑒 can take any value between 0 and infinity. The
closer it is to 0, the better the model fits the data.
BUSINESS STATISTICS: COMMUNICATING WITH NUMBERS, 3e | Jaggia, Kelly Copyright © 2019
14-18
11.4 Goodness-of-Fit Measures
The Standard Error of the Estimate
• Virtually all statistical software packages
report 𝑠𝑒 .
• Excel reports 𝑠𝑒 in the Regression Statistics
portion of the regression output and refers to
it as Standard Error.
• R reports 𝑠𝑒 in the bottom portion of the
regression output and refers to it as Residual
standard error.
BUSINESS STATISTICS: COMMUNICATING WITH NUMBERS, 3e | Jaggia, Kelly Copyright © 2019
14-19
11.4 Goodness-of-Fit Measures
The Coefficient of Determination R2
• The coefficient of determination, R2, quantifies the
sample variation in the response variable y that is
explained by the sample regression equation.
• It is computed as the ratio of the explained variation of
the response variable (SSR) to its total variation (SST).
2
𝑆𝑆𝑅
𝑅 = ,
𝑆𝑆𝑇
ത 2 , 𝑆𝑆𝑅 = σ(𝑦ො𝑖 − 𝑦)
where 𝑆𝑆𝑇 = σ(𝑦𝑖 − 𝑦) ത 2 , 𝑆𝑆𝐸 = σ(𝑦𝑖 − 𝑦ො𝑖 )2 .
• The value of R2 falls between 0 and 1; the closer the
value is to 1, the better the fit.

BUSINESS STATISTICS: COMMUNICATING WITH NUMBERS, 3e | Jaggia, Kelly Copyright © 2019


14-20
11.4 Goodness-of-Fit Measures
The Adjusted R2
• More explanatory variables result in a higher R2.
• But some of these variables may be unimportant
and should not be in the model.
• The Adjusted R2 accounts for the number of
explanatory variables k included in the model.
• It is used for model selection because it imposes
a penalty for any additional explanatory variable
included in the analysis.

BUSINESS STATISTICS: COMMUNICATING WITH NUMBERS, 3e | Jaggia, Kelly Copyright © 2019


14-21
11.4 Goodness-of-Fit Measures
The Adjusted R2
• The adjusted R2 is calculated as:
𝑛−1
Adjusted 𝑅2 = 1 − (1 − 𝑅2 ) .
𝑛−𝑘−1
• The adjusted R2:
1. Penalizes for adding additional explanatory
variables in the model.
2. Is used to compare linear regressions with
different numbers of explanatory variables.
• The higher the adjusted R2, the better the model.

BUSINESS STATISTICS: COMMUNICATING WITH NUMBERS, 3e | Jaggia, Kelly Copyright © 2019


14-22

You might also like