03 Regressionanalysis

Travel Behavior
(LV 0000001540)
Session 3
21 November 2022
Regression Analysis
Rolf Moeckel | Professor of Travel Behavior | Department of Mobility Systems Engineering | Technical University of Munich
Statistical learning
Shown are Sales vs TV, Radio and Newspaper adds, with a blue linear-regression line fit separately to each.
Can we predict Sales using these three? Perhaps we can do better using a model
Sales = f (TV, Radio, Newspaper)
Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning
2
Notation (1)
Here, Sales is a dependent variable that we wish to predict. We generically refer to the response as 𝑌.
TV is an independent variable that we call 𝑋! . We name Radio as 𝑋" , and Newspaper as 𝑋# .
We can refer to the input vector collectively as

𝑋! sales is a depended
tv radio is indeendent,
𝑋 = 𝑋"
𝑋#
Now we write our model as
𝑌 =𝑓 𝑥 +𝜖
Where 𝜖 captures measurement errors and other discrepancies.

3
Notation (2)
Is there an ideal 𝑓 𝑥 ? In particular, what is a good value for 𝑓 𝑥 at any selected value of 𝑋, say 𝑋 = 4? There
can be many Y values at 𝑋 = 4. A good value is
𝑓 4 =𝐸 𝑌𝑋=4
𝐸 𝑌 𝑋 = 4 means expected value (or average) of Y given X = 4.
This ideal 𝑓 𝑥 = 𝐸 𝑌 𝑋 = 𝑥 is called the regression function.

4
Regression function 𝑓(𝑋)
Is similarly defined for vector 𝑋; e.g.
𝑓 𝑥 = 𝑓 𝑥! , 𝑥" , 𝑥# = 𝐸 𝑌 𝑋! = 𝑥! , 𝑋" = 𝑥" , 𝑋# = 𝑥#
The ideal or optimal predictor of 𝑌 with regard to mean-squared prediction error: 𝑓 𝑥 = 𝐸 𝑌 𝑋 = 𝑥 is the
"
function that minimizes 𝐸 𝑌 − 𝑓A 𝑋 𝑋 = 𝑥 over all functions 𝑓 at all points 𝑋 = 𝑥.
Y stands for the observed value (the “true” answer).

to minimize the differnece ber=twwen actual obeserve value
to get + value,
emphasize the large number

5
Regression function 𝑓(𝑋)
𝜖 = 𝑌 − 𝑓 𝑥 is the irreducible error, i.e. even if we knew 𝑓 𝑥 , we would still make errors in prediction, since at
each 𝑋 = 𝑥 there is typically a distribution of possible 𝑌 values.
For any estimate 𝑓A 𝑥 of 𝑓 𝑥 , we have

" "
𝐸 𝑌 − 𝑓A 𝑋 𝑋 = 𝑥 = 𝑓 𝑥 − 𝑓A 𝑥 + 𝑉𝑎𝑟 𝜖
Reducible Irreducible
differnce bwtwwen obersve value and function

6
How to estimate 𝑓(𝑋)
Typically, we have few if any data points with X = 4 exactly.
Therefore, we cannot compute 𝐸 𝑌 𝑋 = 𝑥
Relax the definition and let 𝑓A 𝑥 = 𝐴𝑣𝑒 𝑌 𝑋 ∈ 𝑁 𝑥
where 𝑁 𝑥 is some neighborhood of 𝑥.

7
Limitations of nearest neighbor averaging
Nearest neighbor averaging can be pretty good for small 𝑝 (= number of independent variables), i.e. 𝑝 ≤ 4, and
large-ish 𝑁, i.e. sample size.
Nearest neighbor methods can be lousy when 𝑝 is large. Reason: the curse of dimensionality. Nearest
neighbors tend to be far away in high dimensions.
• We need to get a reasonable fraction of the 𝑁 values of 𝑦$ to average in order to bring the variance down,
e.g. 10%.
• A 10% neighborhood in high dimensions (i.e., large number of independent variables, or 𝑝 is large) need no
longer be local, so we lose the spirit of estimating 𝐸 𝑌 𝑋 = 𝑥 by local averaging.

8
Curse of dimensionality
10% Neighborhood in one dimension (x1) 10% Neighborhood in two dimensions (x1 and x2)
9
Linear regression model
𝑓% 𝑥 = 𝛽& + 𝛽! 𝑋! + 𝛽" 𝑋" + ⋯ + 𝛽' 𝑋'
• A linear model is specified in terms of 𝑝 + 1 parameters 𝛽& , 𝛽! , 𝛽" , … 𝛽'

• We estimate the parameters by fitting the model to training data.
• Although it is almost never correct, a linear model often serves as a good and interpretable approximation
to the unknown true function 𝑓 𝑋 .

10
Fitting a model to training data
A linear model
𝑓A% 𝑋 = 𝛽A& + 𝛽A! 𝑋
gives a reasonable fit.
A quadratic model
𝑓A( 𝑋 = 𝛽A& + 𝛽A! 𝑋 + 𝛽A" 𝑋 "
may fit slightly better.

11
Trade-offs of fitting data
Assume red dots represent
observed data, and the
blue surface was fitted to
these data.
𝑖𝑛𝑐𝑜𝑚𝑒 = 𝑓( 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛, 𝑠𝑒𝑛𝑖𝑜𝑟𝑖𝑡𝑦 + 𝜖

12
Linear fitting of data
A linear regression fits the
data fairly well, but missing
non-linear relationships
between years of education
and income.
𝑖𝑛𝑐𝑜𝑚𝑒 = 𝑓(! 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛, 𝑠𝑒𝑛𝑖𝑜𝑟𝑖𝑡𝑦 + 𝜖

13
Overfitting the data
It is also possible to overfit
the data by replicating
every little dent in the
surface (here achieved with
a spline regression). This
model adjusts to any
irregularity in the data,
which usually is not
desired.
This model would replicate

measurement errors and
model misspecifications.
𝑖𝑛𝑐𝑜𝑚𝑒 = 𝑓(" 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛, 𝑠𝑒𝑛𝑖𝑜𝑟𝑖𝑡𝑦 + 𝜖
14
Trade-offs to make when fitting data
• Prediction accuracy versus interpretability: Linear models are easy to interpret; thin-plate splines are not.
• Parsimony versus black-box: We often prefer a simpler model involving fewer variables over a black-box
predictor involving them all.
• Good fit versus over-fit or under-fit: How do we know when the fit is just right?

15
Assessing model accuracy
𝑁
Suppose we fit a model 𝑓A 𝑥 to some training data 𝑇𝑟 = 𝑥$ , 𝑦$ , and we wish to see how well it performs.
1
We can compute the average squared prediction error over training data 𝑇𝑟:
"
𝑀𝑆𝐸)* = 𝐴𝑣𝑒$∈), 𝑦$ − 𝑓A 𝑥$
This maybe biased towards models that are overfit. Instead, we should – when possible – compute it using
fresh test data 𝑇𝑒:
"
𝑀𝑆𝐸)- = 𝐴𝑣𝑒$∈)- 𝑦$ − 𝑓A 𝑥$
To create test data: Data records are randomly sampled, and – for example – 80% of all records are used for
model estimation and the remaining records that were not used for model estimation are used for model
testing.

16
Danger of overfitting a model (Example 1)
Separate test data set
yellow linear model

𝑀𝑆𝐸!"
𝑀𝑆𝐸!#
Full data set
simpler complexer
model model
Black curve is “truth.” Orange, blue and green curves/squares correspond to fits of different flexibility.

17
Separate test data set
𝑀𝑆𝐸!"
𝑀𝑆𝐸!#
Full data set
simpler complexer
model model
Here, the truth is smoother (or simpler). The linear model does really well. Simple models are generally preferred over complex models.
18
𝑀𝑆𝐸!" Separate test data set
𝑀𝑆𝐸!# Full data set
simpler complexer
model model
Here, the truth is wiggly and the noise is low, so the more flexible fits do the best job.
19
Bias-variance trade-off
Suppose we have fit a model 𝑓! 𝑥 to some training data 𝑇𝑟, and
let 𝑥$, 𝑦$ be a test observation drawn from the population. If the
true model is 𝑌 = 𝑓 𝑋 + 𝜖 with 𝑓 𝑥 = 𝐸 𝑌 𝑋 = 𝑥 , then
% %
𝐸 𝑦$ − 𝑓! 𝑥$ = 𝑉𝑎𝑟 𝑓! 𝑥$ + 𝐵𝑖𝑎𝑠 𝑓! 𝑥$ + 𝑉𝑎𝑟 𝜖
Estimated curve
Variance refers to the amount by which 𝑓! would change if we Bias
True relationship
estimated it using a different training data set. Bias refers to the
Varia
error that is introduced by approximating a real-life problem by a
much simpler model.
nce
𝛆
Observations
There is a bias-variance trade-off. Typically, as 𝑓! becomes more
complex, its variance increases and its bias decreases.
20
Bias-variance trade-offs for the three examples
Example 1 Example 2 Example 3
MSE = Var + Bias2 + Var(𝛆)

Bias = True Curve – Estimated Curve
Var = Observations – Estimated Curve
Irreducible
error 𝛆
Irreducible
error 𝛆
simpler complexer
model model

21

03 Regressionanalysis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

03 Regressionanalysis

Uploaded by

Copyright:

Available Formats

Travel Behavior

TV is an independent variable that we call 𝑋! . We name Radio as 𝑋" , and Newspaper as 𝑋# .

We can refer to the input vector collectively as

Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning

This ideal 𝑓 𝑥 = 𝐸 𝑌 𝑋 = 𝑥 is called the regression function.

Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning

Y stands for the observed value (the “true” answer).

Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning

For any estimate 𝑓A 𝑥 of 𝑓 𝑥 , we have

differnce bwtwwen obersve value and function

Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning

Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning

Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning

𝑓% 𝑥 = 𝛽& + 𝛽! 𝑋! + 𝛽" 𝑋" + ⋯ + 𝛽' 𝑋'

• A linear model is specified in terms of 𝑝 + 1 parameters 𝛽& , 𝛽! , 𝛽" , … 𝛽'

Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning

Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning

𝑖𝑛𝑐𝑜𝑚𝑒 = 𝑓( 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛, 𝑠𝑒𝑛𝑖𝑜𝑟𝑖𝑡𝑦 + 𝜖

Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning

𝑖𝑛𝑐𝑜𝑚𝑒 = 𝑓(! 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛, 𝑠𝑒𝑛𝑖𝑜𝑟𝑖𝑡𝑦 + 𝜖

Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning

This model would replicate

Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning

Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning

yellow linear model

Full data set

Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning

Separate test data set

Full data set

𝑀𝑆𝐸!" Separate test data set

𝑀𝑆𝐸!# Full data set

MSE = Var + Bias2 + Var(𝛆)

Source: Trevor Hastie and Robert Tibshirani (2014) Statistical Learning

You might also like