You are on page 1of 25

# BUSI 410

Business Analytics

## Module 18: Identifying Drivers of

Business Outcomes

1
Last lecture

## • Approximate 95% CI for independent-data

population mean difference

## • Conservative, Approximate 95% CI for population

proportion

2
Business decision in action:
Identifying drivers of business outcomes
Starbucks currently owns 24,000 retail outlets in 72 countries.
When choosing a new location, Starbucks carefully examines the
profitability of each candidate location.

## Sample factors affecting revenue:

(1) Demographics (population,
age, income, etc.)
(2) Nearby Starbucks stores
(3) Nearby office buildings
(4) Nearby colleges

3
Regression is useful in
explaining phenomena and forecasting

## • Regression helps explain drivers of performance

– What drives defects in a factory?
– What drives profits of Starbucks stores?
– What determines employees’ pay?

## • Regression helps forecast performance

– Profit, sales, etc.

4
Two types of data

## • Cross sectional data – collected on multiple

subjects at the same time (e.g., Oct sales and street
traffic density of 100 Starbucks stores)

## • Time series data – collected on one subject over

time (e.g., Sales and street traffic density over 100
months of a certain Starbucks store)

5
Hybrid auto sales

## We want to quantify the impact of gas price (independent

variable) on hybrid sales (dependent variable)
Linear regression: Finding the LINE that minimizes the mean
square vertical difference from all sample points to this line
6
Hybrid auto sales:
Linear regression

## From the Data tab open the Data Analysis

tools and select Regression. Then specify:
1. The Input Y Range (hybrid sales)
2. The Input X Range (gas price)
3. Check the “Labels” box if appropriate.
4. Designate where the output is to go.
5. Check the “Residuals” and “Residual
Plots” boxes.
6. Click OK. 7
Hybrid auto sales:
Regression output

## Predicted hybrid sales is the dependent variable, gas

price is the independent variable, and the equation for
the relationship between the two is:
hybrid sales (K) = –13.8 + 15.2 gas price (\$/gal)

8
Hybrid auto sales:
Gas price drives hybrid sales
50.0
Hybrid sales (K) 45.0
40.0
35.0
30.0
25.0
20.0
15.0
10.0
2.00 2.50 3.00
Gas price (\$ per gallon)
hybrid sales (K) = –13.8 + 15.2 gas price (\$/gal)
9
Hybrid auto sales:
Predict hybrid sales using gas price
50.0
Hybrid sales (K) 45.0
40.0
35.0
30.0
25.0
20.0
15.0
10.0
2.00 2.50 3.00
Gas price (\$ per gallon)
hybrid sales (K) = –13.8 + 15.2 gas price (\$/gal)
10
Hybrid auto sales:
Residuals
50.0
Hybrid sales (K) 45.0
40.0
35.0
30.0
25.0
20.0
15.0
10.0
2.00 2.50 3.00
Gas price (\$ per gallon)
Residual: differences between actual value & prediction
11
What exactly did we do?

## • Assumed gas price can explain hybrid sales in a

linear way:
hybrid sales = 𝛽0 + 𝛽1 ∗ gas price + 𝜀
where random variable 𝜀 represents the impact of
all other factors

## • Found the best fitting (minimum mean square error,

MMSE) line based on a sample
– The estimated intercept and coefficient are random
variables

12
What else can we do?

## • If we assume 𝜀 is normal, has mean zero, is

independent of the driver, and is independent
over time, then we can use the sample to

## – Construct confidence intervals for our prediction of the

dependent variable

13
Hypothesis test and CI for 𝜷𝟏

## p-value is for the hypothesis:

H0: 𝛽1 = 0 vs. H1: 𝛽1 ≠ 0
(meaning this is a two-tail p-value)

## 95% confidence interval for 𝛽1 inculded

(we can also specify any significance level)

14
Hypothesis test and CI for 𝜷𝟏

## • If reject 𝛽1 = 0: We can conclude that gas price

drives hybrid sales (driver is significant)
• If cannot reject 𝛽1 = 0: We cannot conclude that
gas price drives hybrid sales (driver is insignificant)
• (Leave 𝛽0 alone!)

## • 95% CI for 𝛽1 is [8.7, 21.7]: With 95% confidence,

a \$1 increase in gas price will cause monthly hybrid
sales to increase between 8.7k and 21.7k units

15
Constructing prediction
interval for the dependent variable

## • For confidence level 1 − 𝛼, the following interval

contains the actual value of 𝑦 (for a given 𝑥):
# of independent variables Std. error of
(=1 for simple regression) the prediction

𝑦ො ±T.INV.2T(𝛼, 𝑁 − 𝑘 − 1)*SE

## • Approximate 95% PI: 𝑦ො ± 2*SE

16
R Square and Standard Error

## The Regression Statistics section of the output tells us:

1. The meaning of R Square – the proportion of variation in
the dependent variable (y) explained by the variation in the
independent variable (x)
2. Standard Error (of the estimate) – the standard deviation of
the residuals of the regression, useful for constructing CI
3. Observations – the number of pairs of data, (x, y) used to
run this linear regression

17
“ANOVA” –
Analysis of Variance

## Significance F – the p-value for H0: all slopes are

zero (R Square = 0) vs. H1: at least one slope is
non-zero (R Square > 0). In simple linear
regressions, Significance F = driver p-value

18
Regression equation in
standard format
Mark prediction Specify units

## ෣sales(k) = −13.8 + 15.2∗ gas price (\$/gal)

hybrid
Specify R Square
R Square: .5∗
Specify significance of
each driver and

Significant at .05 model as a whole (R2)

## 1. Use required or a common significance level

2. Use multiple significance levels if necessary,
including insignificant
19
How it’s done in real life

20
Regression and correlation

## • Correlation (correl(x,y) in Excel):

σ𝑖 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
𝑟𝑥𝑦 =
σ𝑖 𝑥𝑖 − 𝑥ҧ 2 σ𝑖 𝑦𝑖 − 𝑦ത 2

2
• In simple linear regressions, R Square = 𝑟𝑥𝑦
• Correlation ≠ regression coefficient
– Correlation measures how much two values tend to
move together; bounded by [-1,1]
– Regression coefficient measures how much an
outcome changes with a unit change of a driver;
unbounded

21
Correlation ≠ causation

## • A may cause B: Cold weather correlated with flu

pandemic
• B may cause A: Flu pandemic correlated with cold
weather
• C may cause A and B: Coke sales correlated with
gas price
• Mixture of all: Poor education correlated with
poverty
• Use intuition to build linear regression models; use
caution when interpreting results
22
Confirming assumptions
on residuals

## We assumed 𝜀 is normal, has mean zero, is

independent of the driver, and is independent
over time. Is it so?
14 60% residuals
distribution % if Normal
7
Residuals

40%
0
1.90 2.20 2.50 2.80 3.10
-7 20%
-14
gas price (\$ per gallon)
0%
Skew: 1.2
23
Linear regression is like
fitting a watermelon in a box…

## You can always find the best fit, but sometimes

even the best fit is a bad one
24
For next class

25