You are on page 1of 45

Chapter

Linear Regression

Shirin Aslani, Fall 2021

1
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
The statistical Sommelier

• Large differences in price and quality between years,


although wine is produced in a similar way

• Meant to be aged, so hard to tell if wine will be good when


it is on the market

• Expert tasters predict which ones will be good

• Can analytics be used to come up with a different system


for judging wine?

2
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
The statistical Sommelier

• Large differences in price and quality between years,


although wine is produced in a similar way

• Meant to be aged, so hard to tell if wine will be good when


it is on the market

• Expert tasters predict which ones will be good

• Can analytics be used to come up with a different system


for judging wine?

• March 1990 – Orley Ashenfelter, a Princeton economics


professor, claims he can predict wine quality without tasting
the wine

3
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
The statistical Sommelier

• Ashenfelter used a method called linear regression


• Predicts an outcome variable, or dependent variable
• Predicts using a set of independent variables

• Dependent variable: typical price in 1990-1991 wine


auctions (approximates quality)

• Independent variables:
• Age – older wines are more expensive
• Weather

• Average Growing Season Temperature


• Harvest Rain
• Winter Rain

4
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
The statistical Sommelier

• Robert Parker, the world's most influential wine expert:

“Ashenfelter is an absolute total sham”

• “rather like a movie critic who never goes to see the movie
but tells you how good it is based on the actors and the
director”

5
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
One-Variable Linear Regression

8.5

8
Logarithm of Price

7.5

6.5

6
14.5 15 15.5 16 16.5 17 17.5 18

Average Growing Season Temp

6
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
One-Variable Linear Regression

8.5

8
Logarithm of Price

7.5

6.5

6
14.5 15 15.5 16 16.5 17 17.5 18

Average Growing Season Temp

7
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
One-Variable Linear Regression

8.5

8
Logarithm of Price

7.5

6.5

6
14.5 15 15.5 16 16.5 17 17.5 18

Average Growing Season Temp

8
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
The Regression Model

• The best model (choice of coefficients) has the


smallest error terms

9
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Selecting the Best Model

10
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Selecting the Best Model

11
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Other Error Measures

• SSE can be hard to interpret


• Depends on N
• Units are hard to understand

• Root-Mean-Square Error (RMSE)

• Normalized by N, units of dependent variable


12
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
R

• Compares the best


model to a “baseline”
model
• The baseline model
does not use any
variables
• Predicts same outcome
(price) regardless of
the independent
variable (temperature)

13
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
R

14
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
15
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Available Independent Variables

• Many different independent variables could be


used
• Average Growing Season Temperature
• Harvest Rain
• Winter Rain
• Age of Wine (in 1990)
• Population of France

16
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Available Independent Variables

17
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
The Regression Model

• Multiple linear regression model with k variables

• Best model coefficients selected to minimize SSE

18
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Adding Variables

• Adding more variables can improve the model


• Diminishing returns as more variables are added

19
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Adding Variables

• Not all available variables should be used


• Each new variable requires more data
• Causes overfitting: high R2 on data used to create
model, but bad performance on unseen data

20
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Correlation

• Correlation measures linear association, not


causation.
• The correlation (usually denoted by r) between
two variables (call them x and y) is a unit-free
measure of the strength of the linear relationship
between x and y.
• The correlation between any two variables is
always between –1 and +1.
• Although the exact formula used to compute the
correlation between two variables isn’t very
important, interpreting the correlation between
the variables is.
• A correlation near +1 means that x and y have a
strong positive linear relationship. 21
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Correlation

22
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Predictive Ability

23
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
24
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
The Result

• Parker:
• 1986 is “very good to sometimes exceptional”
• Ashenfelter:
• 1986 is mediocre
• 1989 will be “the wine of the century” and 1990 will be
even better!
• In wine auctions,
• 1989 sold for more than twice the price of 1986
• 1990 sold for even higher prices!
• Later, Ashenfelter predicted 2000 and 2003
would be great
• Parker has stated that “2000 is the greatest
vintage Bordeaux has ever
• produced” 25
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Effects of light on Meadowfoam Flowering

• Meadowfoam is a small plant found growing in moist


meadows of the U.S. Pacific Northwest.
• It has been domesticated at Oregon State University for its
seed oil, which is unique among vegetable oils for its long
carbon strings which is nongreasy and highly stable.
• One study in a series designed to find out how to elevate
meadowfoam production to a profitable crop.
• In a controlled growth chamber, they focused on the effects
of two light-related factors:
• light intensity, 150, 300, 450, 600, 750, and 900 µmol/m2/sec;
• the timing of the onset of the light treatment, either at
photoperiodic floral induction (PFI)— the time at which the
photo period was increased from 8 to 16 hours per day to
induce flowering—or 24 days before PFI.

26
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Effects of light on Meadowfoam Flowering

27
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Effects of light on Meadowfoam Flowering

• 12 treatment groups—6 light intensities at 2 timing levels.


• Ten seedlings randomly assigned to each treatment group.
• The number of flowers per plant is the primary measure of
production, measured by averaging the numbers of flowers
produced by the 10 seedlings in each group.
• The entire experiment was repeated.
• No difference was found between the first and second runs
of the experiment.

• The 2 observations in each cell are thought of as


independent replicates under the specified conditions.
• What are the effects of differing light intensity levels?
• What is the effect of the timing? 28
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Effects of light on Meadowfoam Flowering

Statistical Conclusion

• Increasing light intensity decreased the mean number of


flowers per plant by an estimated 4.0 flowers per plant per
100 µmol/m2/sec (95% confidence interval from 3.0 to 5.1).
• Beginning the light treatments 24 days prior to PFI increased
the mean numbers of flowers by an estimated 12.2 flowers
per plant (95% confidence interval from 6.7 to 17.6).
29
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Effects of light on Meadowfoam Flowering

Statistical Conclusion

• The data provide no evidence that the effect of light intensity


depends on the timing of its initiation (two-sided p-
value=0.91, from a t -test for interaction, 20 degrees of
freedom).

30
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Effects of light on Meadowfoam Flowering

Scope of Inference
• The researchers can infer that the effects above were caused
by the light intensity and timing manipulations, because this
was a randomized experiment.

31
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
The Multiple Linear Regression

• In multiple regression there is a single response variable


and multiple explanatory variables.

• Although they should not be thought of as the truth, one or


two models may adequately approximate the mean of the
response as a function of the explanatory variables, and
conveniently allow for the questions of interest to be
investigated.

32
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Interpretation of Regression Coefficients

33
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
An Indicator Variable to Distinguish Between Two
Groups
• An indicator variable (or dummy variable) takes on one of two
values: “1” (one) indicates that an attribute is present, and “0”
(zero) indicates that the attribute is absent.

• Consider the regression model

• If time=0, then early = 0, and the regression line is

• If time = 24, then early = 1, and the regression line is

• Because the slopes are the same the model is called the parallel
lines regression model 34
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
MONEYBALL

• Moneyball tells the story of the


Oakland A’s in 2002
• One of the poorest teams in baseball
• New ownership and budget cuts in 1995
• But they were improving

• How were they doing it?


• Was it just luck?
• In 2002, the A’s lost three key
players
• Could they continue winning?
35
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
The Problem

• Rich teams have four times


the salary of poor teams
• The Oakland A’s can’t
afford the all-stars, but they
are still making it to the
playoffs. How?
• They take a quantitative
approach and find
undervalued players

36
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
A Different Approach

• The A’s started using a different method to select


players
• The traditional way was through scouting
• Scouts would go watch high school and college players
• Report back about their skills
• A lot of talk about speed and athletic build
• The A’s selected players based on their statistics,
not on their looks
“The statistics enabled you to find your way
past all sorts of sight-based scouting prejudices.”

37
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Billy Beane

• The general manager since 1997


• Played major league baseball, but never made it
big
• Billy Beane succeeded in using analytics
• Had a management position
• Understood the importance of statistics – hired Paul
DePodesta (a Harvard graduate) as his assistant
• Didn’t care about being ostracized

38
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Taking a Quantitative View

• Paul DePodesta spent a lot of time looking at the


data
• His analysis suggested that some skills were
undervalued and some skills were overvalued
• If they could detect the undervalued skills, they
could find players at a bargain

39
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Taking a Quantitative View

40
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Making it to the Playoffs

• How many games does a team need to win in the


regular season to make it to the playoffs?
• “Paul DePodesta reduced the regular season to a
math problem. He judged how many wins it
would take to make it to the playoffs: 95.”

41
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Scoring Runs

• How does a team score more runs?


• The A’s discovered that two baseball statistics
were significantly more important than anything
else
• On-Base Percentage (OBP)
• Percentage of time a player gets on base
(including walks)
• Slugging Percentage (SLG)
• How far a player gets around the bases on
his turn (measures power)

42
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Predicting Runs and Wins

• Can we predict how many games the 2002


Oakland A’s will win using our models?
• The models for runs use team statistics
• Each year, a baseball team is different
• We need to estimate the new team statistics
using past player performance
• Assumes past performance correlates with future
performance
• Assumes few injuries
• We can estimate the team statistics for 2002 by
using the 2001 player statistics

43
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
Predicting Runs Scored

• At the beginning of the 2002 season, the Oakland


A’s had 24 batters on their roster
• Using the 2001 regular season statistics for these
players
• Team OBP is 0.339
• Team SLG is 0.430

• At the beginning of the 2002 season, the Oakland


A’s had 17 pitchers on their roster
• Using the 2001 regular season statistics for these
players
• Team OOBP is 0.307
• Team OSLG is 0.373
44
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology
The Goal of a Baseball Team

• Why isn’t the goal to win the World Series?


• Billy and Paul see their job as making sure the
team makes it to the playoffs – after that all bets
are off
• The A’s made it to the playoffs in 2000, 2001, 2002,
2003
• But they didn’t win the World Series
• Why?

45
Business Analytics– Dr. Shirin Aslani – GSME, Sharif University of Technology

You might also like