You are on page 1of 25

BUSI 410

Business Analytics

Module 18: Identifying Drivers of


Business Outcomes

1
Last lecture

• Confidence interval for population mean

• Approximate 95% CI for independent-data


population mean difference

• CI for paired-data population mean difference

• Conservative, Approximate 95% CI for population


proportion

2
Business decision in action:
Identifying drivers of business outcomes
Starbucks currently owns 24,000 retail outlets in 72 countries.
When choosing a new location, Starbucks carefully examines the
profitability of each candidate location.

Sample factors affecting revenue:


(1) Demographics (population,
age, income, etc.)
(2) Nearby Starbucks stores
(3) Nearby office buildings
(4) Nearby colleges

3
Regression is useful in
explaining phenomena and forecasting

• Regression helps explain drivers of performance


– What drives defects in a factory?
– What drives profits of Starbucks stores?
– What determines employees’ pay?

• Regression helps forecast performance


– Profit, sales, etc.

4
Two types of data

• Cross sectional data – collected on multiple


subjects at the same time (e.g., Oct sales and street
traffic density of 100 Starbucks stores)

• Time series data – collected on one subject over


time (e.g., Sales and street traffic density over 100
months of a certain Starbucks store)

5
Hybrid auto sales

We want to quantify the impact of gas price (independent


variable) on hybrid sales (dependent variable)
Linear regression: Finding the LINE that minimizes the mean
square vertical difference from all sample points to this line
6
Hybrid auto sales:
Linear regression

From the Data tab open the Data Analysis


tools and select Regression. Then specify:
1. The Input Y Range (hybrid sales)
2. The Input X Range (gas price)
3. Check the “Labels” box if appropriate.
4. Designate where the output is to go.
5. Check the “Residuals” and “Residual
Plots” boxes.
6. Click OK. 7
Hybrid auto sales:
Regression output

Predicted hybrid sales is the dependent variable, gas


price is the independent variable, and the equation for
the relationship between the two is:
hybrid sales (K) = –13.8 + 15.2 gas price ($/gal)

8
Hybrid auto sales:
Gas price drives hybrid sales
50.0
Hybrid sales (K) 45.0
40.0
35.0
30.0
25.0
20.0
15.0
10.0
2.00 2.50 3.00
Gas price ($ per gallon)
hybrid sales (K) = –13.8 + 15.2 gas price ($/gal)
9
Hybrid auto sales:
Predict hybrid sales using gas price
50.0
Hybrid sales (K) 45.0
40.0
35.0
30.0
25.0
20.0
15.0
10.0
2.00 2.50 3.00
Gas price ($ per gallon)
hybrid sales (K) = –13.8 + 15.2 gas price ($/gal)
10
Hybrid auto sales:
Residuals
50.0
Hybrid sales (K) 45.0
40.0
35.0
30.0
25.0
20.0
15.0
10.0
2.00 2.50 3.00
Gas price ($ per gallon)
Residual: differences between actual value & prediction
11
What exactly did we do?

• Assumed gas price can explain hybrid sales in a


linear way:
hybrid sales = 𝛽0 + 𝛽1 ∗ gas price + 𝜀
where random variable 𝜀 represents the impact of
all other factors

• Found the best fitting (minimum mean square error,


MMSE) line based on a sample
– The estimated intercept and coefficient are random
variables

12
What else can we do?

• If we assume 𝜀 is normal, has mean zero, is


independent of the driver, and is independent
over time, then we can use the sample to

– Test whether 𝛽1 is non-zero

– Construct confidence interval for 𝛽1

– Construct confidence intervals for our prediction of the


dependent variable

13
Hypothesis test and CI for 𝜷𝟏

p-value is for the hypothesis:


H0: 𝛽1 = 0 vs. H1: 𝛽1 ≠ 0
(meaning this is a two-tail p-value)

95% confidence interval for 𝛽1 inculded


(we can also specify any significance level)

14
Hypothesis test and CI for 𝜷𝟏

• If reject 𝛽1 = 0: We can conclude that gas price


drives hybrid sales (driver is significant)
• If cannot reject 𝛽1 = 0: We cannot conclude that
gas price drives hybrid sales (driver is insignificant)
• (Leave 𝛽0 alone!)

• 95% CI for 𝛽1 is [8.7, 21.7]: With 95% confidence,


a $1 increase in gas price will cause monthly hybrid
sales to increase between 8.7k and 21.7k units

15
Constructing prediction
interval for the dependent variable

• For confidence level 1 − 𝛼, the following interval


contains the actual value of 𝑦 (for a given 𝑥):
# of independent variables Std. error of
(=1 for simple regression) the prediction

𝑦ො ±T.INV.2T(𝛼, 𝑁 − 𝑘 − 1)*SE

(Mean) prediction # of observations

• Approximate 95% PI: 𝑦ො ± 2*SE

16
R Square and Standard Error

The Regression Statistics section of the output tells us:


1. The meaning of R Square – the proportion of variation in
the dependent variable (y) explained by the variation in the
independent variable (x)
2. Standard Error (of the estimate) – the standard deviation of
the residuals of the regression, useful for constructing CI
3. Observations – the number of pairs of data, (x, y) used to
run this linear regression

17
“ANOVA” –
Analysis of Variance

Significance F – the p-value for H0: all slopes are


zero (R Square = 0) vs. H1: at least one slope is
non-zero (R Square > 0). In simple linear
regressions, Significance F = driver p-value

18
Regression equation in
standard format
Mark prediction Specify units

෣sales(k) = −13.8 + 15.2∗ gas price ($/gal)


hybrid
Specify R Square
R Square: .5∗
Specify significance of
each driver and

Significant at .05 model as a whole (R2)

1. Use required or a common significance level


2. Use multiple significance levels if necessary,
including insignificant
19
How it’s done in real life

20
Regression and correlation

• Correlation (correl(x,y) in Excel):


σ𝑖 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
𝑟𝑥𝑦 =
σ𝑖 𝑥𝑖 − 𝑥ҧ 2 σ𝑖 𝑦𝑖 − 𝑦ത 2

2
• In simple linear regressions, R Square = 𝑟𝑥𝑦
• Correlation ≠ regression coefficient
– Correlation measures how much two values tend to
move together; bounded by [-1,1]
– Regression coefficient measures how much an
outcome changes with a unit change of a driver;
unbounded

21
Correlation ≠ causation

• A may cause B: Cold weather correlated with flu


pandemic
• B may cause A: Flu pandemic correlated with cold
weather
• C may cause A and B: Coke sales correlated with
gas price
• Mixture of all: Poor education correlated with
poverty
• Use intuition to build linear regression models; use
caution when interpreting results
22
Confirming assumptions
on residuals

We assumed 𝜀 is normal, has mean zero, is


independent of the driver, and is independent
over time. Is it so?
14 60% residuals
distribution % if Normal
7
Residuals

40%
0
1.90 2.20 2.50 2.80 3.10
-7 20%
-14
gas price ($ per gallon)
0%
Skew: 1.2
23
Linear regression is like
fitting a watermelon in a box…

You can always find the best fit, but sometimes


even the best fit is a bad one
24
For next class

• Group Case Project due 11/7 prior to class

• Read Chapter 10 of the textbook

25