You are on page 1of 14

CEE 561

Transportation Modelling
Lecture 2

Continuous vs. Discrete Goods

Continuous Goods Discrete Goods


x2

auto

1
Indifference
u curves
3
u2
u1
x1
0 1 bus

1
Outline
ƒ Data
ƒ Modeling Principles
– Assumptions
– Estimates
– Statistical Tests
– Potential data problems

Data
ƒ Cross-section
– Activities of individual persons, firms or other units at single time
z One observation/ individual
ƒ Time series
– Movement of a variable over time
z Annual, quarterly, monthly, weekly observations etc.
z Mostly used for national or regional level aggregation of the
observations
ƒ Pooled/panel
– Combination of time series and cross-section
– Behavior of individual persons, firms or other units over time

2
Examples
ƒ Cross-sectional data
Vehicle Miles Travelled in 2008

No of
No of cars HH No of HH
ID VMT Children TAZ
in HH Income members
in HH
1 1000 2 >80k 4 2 101
2 1200 2 30-50k 3 1 101
3 600 1 30-50k 2 0 104

:
:

Examples
ƒ Time series data
Vehicle Miles Travelled in between 2006-2008
Average
Avg Fuel Average car
Year VMT/
Price/L ownership
person
2000 20 500 0.01
2001 22 575 0.02
2002 25 0.04
: : :
: : :
2006 45 800 0.06
2007 50 900 0.06
2008 77 1000 0.06

3
Examples
ƒ Pooled/Panel data
Vehicle Miles Travelled in between 2006-2008
No of
Avg Fuel No of cars HH No of HH
ID Year VMT Children TAZ
Price/L in HH Income members
in HH
2006 30 800 2 >80k 4 2 101
1 2007 50 900 2 >80k 4 2 101
2008 77 1000 2 >80k 4 2 101
2006 30 1000 2 30-50k 3 1 101
2 2007 50 1500 2 30-50k 3 1 101
2008 77 1200 2 30-50k 3 1 101
:
:

Examples
ƒ Pseudo panel data
Vehicle Miles Travelled in between 2006-2008
No of
Avg Fuel No of cars HH No of HH
ID Year VMT Children TAZ
Price/L in HH Income members
in HH
1 2006 30 800 1 >80k 4 1 101
2 2007 50 900 2 50-80k 3 2 104
3 2008 77 1000 2 >80k 4 2 101
4 2006 30 1000 3 30-50k 3 2 108
5 2007 50 1500 1 30-50k 2 1 101
6 2008 77 1200 1 50-80k 3 1 101
:
:

4
Modeling Principles
ƒ Hypothesis:
– Example: VMT= f (fuel cost, no of cars, hh income, hh size)
ƒ Linear relationship:
– Example: VMT = α + β cost * cost +β car *carNo + β inc * hhInc + β size * hhSize
ƒ Non-linear relationship:
β cos t * cost
VMT = α + β car *carNo +
1 + β inc * hhInc

– In this course we will deal with linear relationships only


ƒ In Regression analysis, we estimate α and β s that
‘best fit’ the observed data using estimators

Estimators
ƒ Our interest: Population
Available: Sample/samples from population
– Sample information to obtain best possible estimates
ƒ Estimator: Rule that gives a reasonable estimate for each
and every possible sample
– Estimators are rules
– Estimates are numbers produced by the estimator
ƒ Desirable properties
– Unbiasedness
– Efficiency
– Consistency (only for large sample)

10

5
Desirable Properties

ƒ We want our estimators to be :


– Unbiased: Expected value of estimator close to true mean
z Bias = E ( β * ) − β
population

– Efficient: For a given sample size, variance is smaller than any


other unbiased estimator
z Higher efficiency indicates higher reliance on results

– Consistent: As N increases β 
→ β population
*

This assumption is required when we do statistical tests (e.g. t-test)

11

Examples of Estimators
ƒ Least/Minimum error
N
Min∑ (Yi − Yi )
i =1

ƒ Least/Minimum absolute error


N
Min∑ | (Yi − Yi ) |
i =1

ƒ Ordinary Least square (OLS)


N
Min∑ (Yi − Yi ) 2
i =1

ƒ Weighted least square (WLS)


N
Min∑ wi (Yi − Yi ) 2
i =1
12

6
Two Variable Linear Regression Model

Model:
Yi = α + β X i + ε i
X i = non − stochastic
ε i = stochastic random term
(often follows certain distributions )

13

Error ( ε )
ƒ Variables cannot provide perfect explanations
ƒ Errors are things that influence Yi other than Xi
ƒ Reasons
– Simplification of reality
z e.g. VMT=f (no of cars, hh income, hh size, hh children, location)
z Omitted variables: individual tastes, education, lifestyle patterns
and many more…
– Measurement errors
z Privacy issues
z Poor record keeping etc.

14

7
Error ( ε )

ƒ Prediction Error ε = Y − Y *
i i i

Yi* = Predicted dependent variables= α + β X i

Sum squared error (SSE) : ∑ ε i 2 = ∑ (Yi − Yi* ) 2


N N

ƒ In OLS, we minimize SSE

15

Two Variable Linear Regression Model

Model:
Yi = α + β X i + ε i
X i = non − stochastic
ε i = stochastic random term
(often follows certain distributions )

Solution:
β=∑
XY i i

∑X i
2

Y X
α = ∑ −β∑ i i

N N
If Y varies a lot when X varies little, β will be big.
In other words, β is the magnitude of influence of x on y
16

8
Statistical Significance
ƒ How dependable are the estimates?
ƒ How significant is X in explaining Y ?
– If there is a high probability that β is not 0, then β * is
statistically significant
– The smaller the standard errors (variances) are relative to the
coefficients, the more confidence we have in the estimates

ƒ How to test?
– Use t-stats/ t-test
– t-stat = β − β*
std error of β *
– Compare with tcritical (at 95% or 90% level of confidence) at (N-k)
dof (N=Obs number, k= number of estimated parameters)
z > tcritical: statistically significant

17

Goodness-of-Fit
ƒ How well the model fits the data
ƒ Measure R 2 = 1 − ∑ ε i 2
∑Y i
2

18

9
Multivariate Linear Regression Model

Yi = α + β1 X 1i + β 2 X 2i + β 3 X 3i + ... + ε i
In matrix notation:
Y=βX
X = [1 X1 X 2i X 3i ....]
OLS Solution
β = ( X ' X ) −1 ( X ' Y )

19

Goodness-of-Fit
ƒ R2 always increases as we add new variables
2
ƒ Measure R which accounts for k (number of estimated
parameters)
ƒ Model with higher R-bar sqr. has better goodness-of-fit
in absolute terms

20

10
Example: Chicago Trip Generation
ƒ Dependent variable:
– average trips per occupied dwelling unit
ƒ Independent variables
– average car ownership
– average household size
– three zonal social indices

21

Assignment
ƒ Variations you can try:
– Add other variables
– Use interaction terms
– Use log on variables
– Piecewise linear formulation
ƒ Evaluation criteria:
– Correct signs
– Improvement in goodness-of-fit
– t-test

22

11
Assumptions of Classical LR Model

1. Relationship between X and Y linear


2. X non-stochastic and no exact linear relationship exists
between two or more independent variables
3. Error has zero expected value (cancel out)
E (ε ) = 0
4. Error has constant variance for all observations
E (ε 2 ) = σ 2
5. No correlation among errors
E (ε iε j ) = 0, for all i ≠ j

23

Gauss Markov Theorem


ƒ If 1-5 is fulfilled OLS is BLUE
– Best
– Linear
– Unbiased
– Estimator

24

12
Violation 1: Collinearity
ƒ Types
– Perfect correlation
– Other high interdependence: multicollinearity
ƒ Examples
– e.g. GPA=f(X1,X2,X3, X4,X5)
z X1= parents education level
z X2= average hours of study / day
z X3= average hours of study/ week
z X4= parents income
z X5= school
– X2 and X3 perfectly collinear
– X1 and X4 can be multicollinear

25

Violation 1:Collinearity (cont)


ƒ Effect:
– Perfect:
z Cannot be estimated
– Multicollinear:
z Difficult to interpret
z Affects statistical significance
ƒ Solution:
– Drop one variable
– Caution: May result bias

26

13
Violation 2: Heteroscedasticity
ƒ Homoscedastic= constant variance
ƒ Heteroscedastic = variance not constant
ƒ Example:
– Large firm: bigger errors
– Larger TAZ: bigger errors

ƒ Effect: Estimators unbiased but inefficient


ƒ Solution: Weighted least square (WLS)

27

Violations 3: Serial correlation


ƒ Both cross section and time series
ƒ Can be positive or negative
– e.g. Positive error: incorrect mileage reading
– Negative error: mileage data taken in Jan 2009 instead
of Dec 2008 ; overestimation of 2008 VMT,
underestimation of 2009 VMT
ƒ Effect: Estimators unbiased but inefficient
ƒ Solution:
– Prais-Winsten, Cochrane-Orcutt, Durbin’s Method

28

14

You might also like