You are on page 1of 21

Travel Behavior

(LV 0000001540)

Session 3
21 November 2022
Regression Analysis

Rolf Moeckel | Professor of Travel Behavior | Department of Mobility Systems Engineering | Technical University of Munich
Statistical learning
Shown are Sales vs TV, Radio and Newspaper adds, with a blue linear-regression line fit separately to each.

Can we predict Sales using these three? Perhaps we can do better using a model
Sales = f (TV, Radio, Newspaper)
Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning
2
Notation (1)
Here, Sales is a dependent variable that we wish to predict. We generically refer to the response as 𝑌.

TV is an independent variable that we call 𝑋! . We name Radio as 𝑋" , and Newspaper as 𝑋# .

We can refer to the input vector collectively as


𝑋! sales is a depended
tv radio is indeendent,
𝑋 = 𝑋"
𝑋#
Now we write our model as
𝑌 =𝑓 𝑥 +𝜖
Where 𝜖 captures measurement errors and other discrepancies.

Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning


3
Notation (2)
Is there an ideal 𝑓 𝑥 ? In particular, what is a good value for 𝑓 𝑥 at any selected value of 𝑋, say 𝑋 = 4? There
can be many Y values at 𝑋 = 4. A good value is
𝑓 4 =𝐸 𝑌𝑋=4
𝐸 𝑌 𝑋 = 4 means expected value (or average) of Y given X = 4.

This ideal 𝑓 𝑥 = 𝐸 𝑌 𝑋 = 𝑥 is called the regression function.

Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning


4
Regression function 𝑓(𝑋)
Is similarly defined for vector 𝑋; e.g.
𝑓 𝑥 = 𝑓 𝑥! , 𝑥" , 𝑥# = 𝐸 𝑌 𝑋! = 𝑥! , 𝑋" = 𝑥" , 𝑋# = 𝑥#

The ideal or optimal predictor of 𝑌 with regard to mean-squared prediction error: 𝑓 𝑥 = 𝐸 𝑌 𝑋 = 𝑥 is the
"
function that minimizes 𝐸 𝑌 − 𝑓A 𝑋 𝑋 = 𝑥 over all functions 𝑓 at all points 𝑋 = 𝑥.

Y stands for the observed value (the “true” answer).


to minimize the differnece ber=twwen actual obeserve value

to get + value,
emphasize the large number

Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning


5
Regression function 𝑓(𝑋)
𝜖 = 𝑌 − 𝑓 𝑥 is the irreducible error, i.e. even if we knew 𝑓 𝑥 , we would still make errors in prediction, since at
each 𝑋 = 𝑥 there is typically a distribution of possible 𝑌 values.

For any estimate 𝑓A 𝑥 of 𝑓 𝑥 , we have


" "
𝐸 𝑌 − 𝑓A 𝑋 𝑋 = 𝑥 = 𝑓 𝑥 − 𝑓A 𝑥 + 𝑉𝑎𝑟 𝜖

Reducible Irreducible

differnce bwtwwen obersve value and function

Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning


6
How to estimate 𝑓(𝑋)
Typically, we have few if any data points with X = 4 exactly.
Therefore, we cannot compute 𝐸 𝑌 𝑋 = 𝑥
Relax the definition and let 𝑓A 𝑥 = 𝐴𝑣𝑒 𝑌 𝑋 ∈ 𝑁 𝑥
where 𝑁 𝑥 is some neighborhood of 𝑥.

Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning


7
Limitations of nearest neighbor averaging
Nearest neighbor averaging can be pretty good for small 𝑝 (= number of independent variables), i.e. 𝑝 ≤ 4, and
large-ish 𝑁, i.e. sample size.

Nearest neighbor methods can be lousy when 𝑝 is large. Reason: the curse of dimensionality. Nearest
neighbors tend to be far away in high dimensions.

• We need to get a reasonable fraction of the 𝑁 values of 𝑦$ to average in order to bring the variance down,
e.g. 10%.
• A 10% neighborhood in high dimensions (i.e., large number of independent variables, or 𝑝 is large) need no
longer be local, so we lose the spirit of estimating 𝐸 𝑌 𝑋 = 𝑥 by local averaging.

Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning


8
Curse of dimensionality

10% Neighborhood in one dimension (x1) 10% Neighborhood in two dimensions (x1 and x2)
Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning
9
Linear regression model

𝑓% 𝑥 = 𝛽& + 𝛽! 𝑋! + 𝛽" 𝑋" + ⋯ + 𝛽' 𝑋'

• A linear model is specified in terms of 𝑝 + 1 parameters 𝛽& , 𝛽! , 𝛽" , … 𝛽'


• We estimate the parameters by fitting the model to training data.
• Although it is almost never correct, a linear model often serves as a good and interpretable approximation
to the unknown true function 𝑓 𝑋 .

Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning


10
Fitting a model to training data
A linear model
𝑓A% 𝑋 = 𝛽A& + 𝛽A! 𝑋
gives a reasonable fit.

A quadratic model
𝑓A( 𝑋 = 𝛽A& + 𝛽A! 𝑋 + 𝛽A" 𝑋 "
may fit slightly better.

Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning


11
Trade-offs of fitting data
Assume red dots represent
observed data, and the
blue surface was fitted to
these data.

𝑖𝑛𝑐𝑜𝑚𝑒 = 𝑓( 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛, 𝑠𝑒𝑛𝑖𝑜𝑟𝑖𝑡𝑦 + 𝜖

Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning


12
Linear fitting of data
A linear regression fits the
data fairly well, but missing
non-linear relationships
between years of education
and income.

𝑖𝑛𝑐𝑜𝑚𝑒 = 𝑓(! 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛, 𝑠𝑒𝑛𝑖𝑜𝑟𝑖𝑡𝑦 + 𝜖

Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning


13
Overfitting the data
It is also possible to overfit
the data by replicating
every little dent in the
surface (here achieved with
a spline regression). This
model adjusts to any
irregularity in the data,
which usually is not
desired.

This model would replicate


measurement errors and
model misspecifications.
𝑖𝑛𝑐𝑜𝑚𝑒 = 𝑓(" 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛, 𝑠𝑒𝑛𝑖𝑜𝑟𝑖𝑡𝑦 + 𝜖
Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning
14
Trade-offs to make when fitting data
• Prediction accuracy versus interpretability: Linear models are easy to interpret; thin-plate splines are not.
• Parsimony versus black-box: We often prefer a simpler model involving fewer variables over a black-box
predictor involving them all.
• Good fit versus over-fit or under-fit: How do we know when the fit is just right?

Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning


15
Assessing model accuracy
𝑁
Suppose we fit a model 𝑓A 𝑥 to some training data 𝑇𝑟 = 𝑥$ , 𝑦$ , and we wish to see how well it performs.
1
We can compute the average squared prediction error over training data 𝑇𝑟:
"
𝑀𝑆𝐸)* = 𝐴𝑣𝑒$∈), 𝑦$ − 𝑓A 𝑥$

This maybe biased towards models that are overfit. Instead, we should – when possible – compute it using
fresh test data 𝑇𝑒:
"
𝑀𝑆𝐸)- = 𝐴𝑣𝑒$∈)- 𝑦$ − 𝑓A 𝑥$

To create test data: Data records are randomly sampled, and – for example – 80% of all records are used for
model estimation and the remaining records that were not used for model estimation are used for model
testing.

Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning


16
Danger of overfitting a model (Example 1)
Separate test data set

yellow linear model


𝑀𝑆𝐸!"

𝑀𝑆𝐸!#

Full data set

simpler complexer
model model

Black curve is “truth.” Orange, blue and green curves/squares correspond to fits of different flexibility.

Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning


17
Danger of overfitting a model (Example 2)

Separate test data set

𝑀𝑆𝐸!"

𝑀𝑆𝐸!#

Full data set

simpler complexer
model model

Here, the truth is smoother (or simpler). The linear model does really well. Simple models are generally preferred over complex models.
Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning
18
Danger of overfitting a model (Example 3)

𝑀𝑆𝐸!" Separate test data set

𝑀𝑆𝐸!# Full data set

simpler complexer
model model

Here, the truth is wiggly and the noise is low, so the more flexible fits do the best job.
Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning
19
Bias-variance trade-off
Suppose we have fit a model 𝑓! 𝑥 to some training data 𝑇𝑟, and
let 𝑥$, 𝑦$ be a test observation drawn from the population. If the
true model is 𝑌 = 𝑓 𝑋 + 𝜖 with 𝑓 𝑥 = 𝐸 𝑌 𝑋 = 𝑥 , then
% %
𝐸 𝑦$ − 𝑓! 𝑥$ = 𝑉𝑎𝑟 𝑓! 𝑥$ + 𝐵𝑖𝑎𝑠 𝑓! 𝑥$ + 𝑉𝑎𝑟 𝜖
Estimated curve
Variance refers to the amount by which 𝑓! would change if we Bias
True relationship
estimated it using a different training data set. Bias refers to the

Varia
error that is introduced by approximating a real-life problem by a
much simpler model.

nce
𝛆
Observations
There is a bias-variance trade-off. Typically, as 𝑓! becomes more
complex, its variance increases and its bias decreases.
20
Bias-variance trade-offs for the three examples
Example 1 Example 2 Example 3

MSE = Var + Bias2 + Var(𝛆)


Bias = True Curve – Estimated Curve
Var = Observations – Estimated Curve
Irreducible
error 𝛆

Irreducible
error 𝛆

simpler complexer
model model

Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning


21

You might also like