EUC1502 Module2 Machine Learning

Module 2.
Introduction to
machine learning
econometrics
THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION
Eurostat
Computational issues
Introduction
In a data-rich situation, the best approach would be:
50% 25% 25%
Training set Validation set Test set
Used to fit Used to Used for

the models estimate error assessment of
for model the final chosen
selection model 2
Eurostat
The validation set approach
n observations
1, 2, 3, … ….., n-2, n-1, n

n/2 observations n/2 observations
training set validation set
Machine learning Fitted model

model
3
Eurostat
Disadvantages:
▪ The validation estimates can be higly variable, depending
on which observations are icluded in the training set
▪ The validation set error rate may overestimate the test

error rate, since statistical methods tend to perform
worse when trained on fewer observations
4
Eurostat
Disadvantages:
▪ The validation estimates can be higly variable, depending
on which observations are icluded in the training set
▪ The validation set error rate may overestimate the test

error rate, since statistical methods tend to perform
worse when trained on fewer observations
5
Eurostat
Cross-validation
Allternatives to the validation set approach:
▪ Leave-one-out cross-validation (LOOCV)
▪ K-Fold cross-validation
6
Eurostat
Cross-validation: leave-one-out cross-validation
1 n-1
validation set training set
Fitted model Machine learning

model
7
Eurostat
LOOCV estimate for the test MSE

1 𝑛
CV(𝑛) = σ ሶ 𝑀𝑆𝐸𝑖
𝑛 𝑖=1
8
Eurostat
Advantages:
▪ Not overestimate the test error rate

▪ No randomness in the training/validation set Split
Disadvantage:
▪ It can be very time consuming if n large
9
Eurostat
Cross-validation: K-Fold cross-validation
k=5
fold k k-1 folds
validation set training set
Fitted model Machine learning

model
10
Eurostat
K-foldCV estimate for the test MSE:

1
CV(𝑘) = σ𝑘𝑖=1
ሶ 𝑀𝑆𝐸𝑖
𝑘
If k=n LOOCV
Usually k=5 or k=10
11
Eurostat
Advantages:
▪ Not time consuming

▪ More accurate estimates of the test error rate
▪ Lower variance
12
Eurostat
Assessing model fitting
▪ There is not a single best method
▪ Mean squared error (MSE):
1
MSE = σ𝑛𝑖=1(𝑦𝑖 − 𝑓መ 𝑥𝑖 ) 2
𝑛 Prediction that 𝒇෠
gives on 𝑥𝑖
Training MSE
▪ The model fits the data if MSE is low
▪ We are more interested in the test MSE (MSE we

obtain when apply the method to unseen data)
13
Eurostat
▪ Minimise the test MSE
▪ Problem: (I) sometimes test observations are not available

(II) Sometimes training MSE is small while test
MSE is larger
14
Eurostat
Example
Black curve: real f

12
Orange curve: linear regression

10
Blue and green curves:

8
Y
smoothing splines with two

different level of smoothness
6
4
2
0 20 40 60 80 100
X
15
Eurostat
2.5
Grey curve: Training MSE

2.0
Red curve: Test MSE

Mean Squared Error
1.5
Squared dots: value of MSE

1.0
for the three applied methods

0.5
Horizontal line: irreducible

error Var(ɛ) i.e. minimum
0.0
2 5 10 20
value that test MSE can get
Flexibility
Overfitting the data

The blue curve from the previous slide is the model
best fitting the data 16
Eurostat
Remarks:
▪ If training MSE is small but test MSE large overfitting
▪ Training MSE almost always smaller than test MSE, because

machine learning packages almost seek to minimize
training MSE
▪ Usually test data are not available
17
Eurostat
Replication or resampling: Bootstrap
Bootstrap: general procedure for assess statistical accuracy

of a parameter estimate or prediction
zi = (xi, yi)
B= number of
bootstrapped
samples
S(Z*b)= quantity
of interest
18
Eurostat
Replication or resampling: Bagging
Bagging (or Bootstrap AGGregation): Improvement of the

bootstrap. Procedure for reducing the variance
Training data Z= ൛ 𝑥1 , 𝑦1 , (𝑥2 , 𝑦2 ),…(𝑥N , 𝑦N )}

𝑓መ *b(x) = our function applied to the bth bootstrapped training
data set
1 𝐵
𝑓መ𝑏𝑎𝑔(𝑥)= σ𝑏=1 𝑓መ*b(x)
𝐵
The bagging averages the prediction 𝑓መ *b(x) over the bootstrap

samples
19
Eurostat
Replication or resampling: Bumping
Bumping: Technique for finding a better single model. It used

bootstrap sampling to choose the model that best fits the
data.
Z*1, Z*2,…Z*b ,..., Z*B: Bootstrapped samples

𝑓መ *b(x)= function applied to the bth bootstrapped training data
set, for each b: 1..B
The best model is the one that produces the smallest

prediction error, averaged over the original training set.
20
Eurostat
Replication or resampling: Bumping
The best model is from the b෠ bootstrap sample where:
𝑁
𝑏෠ = arg min ෍ 𝑦𝑖 − 𝑓መ ∗b(𝑥𝑖 )

b
𝑖=1
^
∗ 𝑏Ƹ
The model predictions are 𝑓መ (x)
Remark:
▪ The original training sample is included in the bootstrapped
samples, so the model could pick it if it has the lowest
training error
21
Eurostat
Machine learning linear estimation
Introduction
Standard linear model: f(X) = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑝 𝑋𝑝 +ɛ
The standard method to fit this model is the least squares.
Remarks:
▪ Prediction accuracy:
➢ low bias
➢ If n ≫ p also low variance
However:
➢ If n > p high variance
➢ If p >n the method cannot be used
22
Eurostat
Introduction
▪ Model interpretability:
➢ The model sometimes includes irrelevant variables
The complexity increases and the interpretability is

not easy
23
Eurostat
Introduction
Alternative: Shrinkage methods
▪ Technique to improve a least-squares estimator which

consists in reducing the variance by adding constraints on
the value of coefficients.
▪ Only those variables that improve the fit deserve to take a

coefficient bigger than zero and consequently appear in the
fitted linear model.
24
Eurostat
Shrinkage methods
Best shrinkage techniques:
▪ Ridge regression
▪ Lasso
25
Eurostat
Shrinkage methods
▪ Lasso
26
Eurostat
Shrinkage methods – Ridge regression
In ridge regression, the 𝛽መ 𝑅 minimize the following:
σ𝑛𝑖=1 𝑦𝑖 − 𝛽0 − σ𝑝𝑗=1 𝛽𝑗 𝑥𝑖𝑗 2 𝑝

+ 𝜆 σ𝑗=1 𝛽𝑗 2
Tuning parameter
shrinkage penalty
For each value of 𝜆 there is a set of coefficient estimates 𝛽

෢𝜆𝑅
If:
𝜆 =0 shrinkage penality is null and 𝛽መ 𝑅 = 𝛽መ
𝜆 →∞ shrinkage penality grows and 𝛽መ 𝑅 → 0
27
Eurostat
𝛽መ 𝑅 are not scale invariant standardise the predictors xij

𝑥𝑖𝑗
𝑥෦
𝑖𝑗 = 1 𝑛 2
σ 𝑥𝑖𝑗 − 𝑥𝑗
𝑛 𝑖=1
28
Eurostat
Advantages:
60
as 𝜆 increases
Mean Squared Error

50
▪
the variance decreases
40
(but the bias increases)
30
20
10
▪ It can be used when p>n
0
1e−01 1e+01 1e+03
Disadvantage: λ
Black line: squared bias

▪ All p predictors are included in
Green line: variance
the model challenge for Purple line: test means squared
interpretability when p is high error 29
Eurostat
Shrinkage methods
▪ Lasso
30
Eurostat
Shrinkage methods – Lasso regression
In Lasso regression, the 𝛽መ 𝐿 minimize the following:
σ𝑛𝑖=1 𝑦𝑖 − 𝛽0 − σ𝑝𝑗=1 𝛽𝑗 𝑥𝑖𝑗 2 𝑝

+ 𝜆 σ𝑗=1 𝛽𝑗
Tuning parameter shrinkage penalty
ℓ1 penalty
ℓ1 penalty forces some 𝛽መ 𝐿 =0 when 𝜆 is high
31
Eurostat
Shrinkage methods –Selection of 𝜆
Steps:
▪ Define a grid of values for 𝜆
▪ Calculate cross-validation error for each value of 𝜆
▪ Select 𝜆 for which the cross-validation error is the

smallest
▪ Re-fit the model with Ridge or Lasso regression using the

selected 𝜆
32
Eurostat

EUC1502 Module2 Machine Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

EUC1502 Module2 Machine Learning

Uploaded by

Copyright:

Available Formats

Module 2.

In a data-rich situation, the best approach would be:

50% 25% 25%

Training set Validation set Test set

Used to fit Used to Used for

1, 2, 3, … ….., n-2, n-1, n

training set validation set

Machine learning Fitted model

▪ The validation set error rate may overestimate the test

▪ The validation set error rate may overestimate the test

Allternatives to the validation set approach:

▪ Leave-one-out cross-validation (LOOCV)

validation set training set

Fitted model Machine learning

LOOCV estimate for the test MSE

▪ Not overestimate the test error rate

validation set training set

Fitted model Machine learning

K-foldCV estimate for the test MSE:

▪ Not time consuming

▪ There is not a single best method

▪ Mean squared error (MSE):

▪ The model fits the data if MSE is low

▪ We are more interested in the test MSE (MSE we

▪ Minimise the test MSE

▪ Problem: (I) sometimes test observations are not available

Black curve: real f

Orange curve: linear regression

Blue and green curves:

smoothing splines with two

Grey curve: Training MSE

Red curve: Test MSE

Squared dots: value of MSE

for the three applied methods

Horizontal line: irreducible

Overfitting the data

▪ If training MSE is small but test MSE large overfitting

▪ Training MSE almost always smaller than test MSE, because

▪ Usually test data are not available

Bootstrap: general procedure for assess statistical accuracy

Bagging (or Bootstrap AGGregation): Improvement of the

Training data Z= ൛ 𝑥1 , 𝑦1 , (𝑥2 , 𝑦2 ),…(𝑥N , 𝑦N )}

The bagging averages the prediction 𝑓መ *b(x) over the bootstrap

Bumping: Technique for finding a better single model. It used

Z*1, Z*2,…Z*b ,..., Z*B: Bootstrapped samples

The best model is the one that produces the smallest

𝑏෠ = arg min ෍ 𝑦𝑖 − 𝑓መ ∗b(𝑥𝑖 )

Standard linear model: f(X) = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑝 𝑋𝑝 +ɛ

The standard method to fit this model is the least squares.

The complexity increases and the interpretability is

Alternative: Shrinkage methods

▪ Technique to improve a least-squares estimator which

▪ Only those variables that improve the fit deserve to take a

Best shrinkage techniques:

Best shrinkage techniques:

In ridge regression, the 𝛽መ 𝑅 minimize the following:

σ𝑛𝑖=1 𝑦𝑖 − 𝛽0 − σ𝑝𝑗=1 𝛽𝑗 𝑥𝑖𝑗 2 𝑝

For each value of 𝜆 there is a set of coefficient estimates 𝛽

𝛽መ 𝑅 are not scale invariant standardise the predictors xij

Mean Squared Error

Black line: squared bias

Best shrinkage techniques:

Z1, Z2,…Zb ,..., ZB: Bootstrapped samples