EUC1502 Module3 Machine-Learning

Module 3.
Non-linear
machine learning
econometrics:
Tree-based estimation
THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION
Eurostat
Machine-learning non-linear estimation methods
Introduction
▪ Linear estimation methods are relatively simple to

describe and implement, however they have limitations in
terms of predictive power
▪ Linearity assumption is almost always an approximation
▪ Ridge regression and Lasso could improve the models,

but still are linear models
2
Eurostat
Introduction
When the assumption of linearity is relaxed
Non-linear models
Polinomial regression
Generalized additive models
Decision Trees
Support Vector Machines
Etc.
3
Eurostat
Introduction
Non-linear models
Decision Trees
Etc.
4
Eurostat
Introduction
Non-linear models
Decision Trees
Etc.
5
Eurostat
Introduction
▪ Tree methods are commonly used in data science to

understand patterns within data and to build predictive
models.
▪ Tree-based methods involve segmenting the predictor
space into a number of simple binary regions
Internal nodes
Branches
Terminal nodes or
leaves 6
6
Eurostat
Introduction
Graphically:
7
Eurostat
Introduction
Advantages:
▪ Can be displayed graphically

▪ Very easy to explain to public
▪ Mirrors more closely human decision-making than other
methods
▪ Can handle qualitative predictors without creating dummy
variables
Disadvantage:
▪ They have lower level of accuracy than other methods

8
Eurostat
Introduction
Source: «Making data science accessible - Machine Learning – Tree Methods” posted by Dan Kellett on
12.04.2016
9
Eurostat
Introduction
Tree-based methods can be used for:
▪ Regression for quantitative response variables
▪ Classification for qualitative response variables
10
Eurostat
Regression trees
How it works:
1. Divide the predictor space into J distinct and non-

overlapping regions R1, R2, …, Rj.
2. For every test observation that falls into region Rj we

predict the mean of the response values for the training
observations in Rj.
11
Eurostat
Regression trees
How it works:


observations in Rj.
12
Eurostat
Regression trees
R1, R2, …, Rj have to be defined such that they minimise
Residual Sum of 𝐽
RSS = σ𝑗=1 σ𝑖∈𝑅𝑗 𝑦𝑖 − 𝑦ො𝑅𝑗 2
Squares
Mean response for the

training observations within
the jth region
Problem:
It is computationally infeasible to consider every possible
partition of the feature space into J regions
Recursive binary splitting

13
Eurostat
Regression trees: Recursive binary splitting
▪ Top-down approach (it begins at the top of the tree, i.e. all
observations fall into only one region)
▪ At each step of the tree-building process, the best split is made:

• for each threshold s and variable j the pair of half-planes
R1(j,s) = ൛𝑋|𝑋𝑗 < 𝑠ൟ and R2(j,s)= ൛𝑋|𝑋𝑗 ≥ 𝑠ൟ is defined and
• we seek the value of j and s that minimise:

RSS= σ𝑖: 𝑥𝑖∈ 𝑅 (𝑗,𝑠) 𝑦𝑖 − 𝑦ො𝑅1 2 + σ𝑖: 𝑥𝑖∈ 𝑅 (𝑗,𝑠) 𝑦𝑖 − 𝑦ො𝑅2 2
1 2
Mean response for training observations in

R1 and R2
• process stops when a stopping criterion is reached (for
instance, until no region contains more than 5 observations)
14
Eurostat
Regression trees: Recursive binary splitting
How it works:


observations in Rj.
15
Eurostat
Regression trees
Disadvantages:
▪ The tree may me too complex
▪ Overfitting
▪ Low test fitting
16
Eurostat
Regression trees – tree pruning
▪ Tree-pruning means constructing a very large tree T0 and

prune it back to obtain a simpler sub-tree (smaller number
of terminals)
▪ To prune the tree: cost complexity pruning (weakest link
pruning):
for each value of  there corresponds a sub-tree T ⊂ T0
such that it minimises:
Nr of terminal nodes of T Tuning parameter
σ|𝑇|
𝑚=1 σ𝑖: 𝑥𝑖 ∈ 𝑅𝑚 𝑦𝑖 − 𝑦
ො𝑅𝑚 2 +  |T|
Subset of the predictor space corresponding to the Mean of training

m-th terminal node observations in Rm
17
Eurostat
Regression trees – tree pruning
▪  controls the trade-off between the sub-tree’s complexity

(|T|) and its goodness-of-fit to training data:
 =0 T = T0
 increases |T| will be minimised for a smaller subtree
▪  plays a similar role to 𝜆 in Lasso regression (cost

function) and is selected by cross-validation:
• divide the training observations into K folders and for each k= 1,..K
• repeat steps on all data but the k-th folder
• evaluate the mean squared prediction error on the data in the left-out k-th fold, as a function of 
• average the results for each value of  and pick  as to minimise the average error.
18
Eurostat
Classification trees
▪ For qualitative variables
▪ The predicted response is given by the most commonly

(=modal) occurring class of training observations (instead
of the mean response of the training observations)
▪ Interest also on the class proportion among the training

observations
▪ Classification trees can be grown just as regression trees
▪ Recursive binary splitting can be used, but instead of RSS

we use the Gini index or cross-entropy 19
Eurostat
Classification trees
Gini index :
G= σK ො mk(1- 𝑝Ƹ mk)
k=1 p
Proportion of training observations

in the m-th region that are from the
k-th class
Cross- entropy:
D= - σK ො mk*log(𝑝Ƹ mk)
k=1 p
Measures of node purity (i.e. the node contains predominantly

observations from a single class) when the value is low
20
Eurostat
Bootstrap and bagging
▪ As presented in Module 2, Bootstrap and bagging can be

useful for improving decision tres
B= number of samples bootstrapped (regression trees) from

the training data set
𝑓መ *b(x) = our function applied to the bth bootstrapped training
data set
1 𝐵
𝑓መ𝑏𝑎𝑔(𝑥)= σ𝑏=1 𝑓መ*b(x)
𝐵
▪ Trees are not pruned

▪ Each individual tree has high variance but low bias
21
averaging them reduces the variance
Eurostat
▪ With qualitative variables we record the class predicted

for a given test observation by each tree and take a
majority vote
▪ B chosen sufficiently large (e.g.100)
Disadvantage:
▪ Difficult to interpret the resulting model
22
Eurostat
Variable importance measure:

▪ quantitative variables: total amount that RSS is decreased
due to splits over a given predictor, averaged over all B
trees
▪ Qualitative variables: total amount that Gini index is
decreased due to splits over a given predictor, averaged
over all B trees
high value means important variable
23
Eurostat
Random forest
▪ Random forest method is an improvement of the bagging

method
▪ It considers only a subset (random sample) of m predictors

at each split (bagging considers all the predictors)
▪ Usually m = 𝑝 where p = total number of predictors (if m

= p then we have bagging)
▪ It permits to decorrelate the trees (helpful when we have a

large number of correlated predictors)
24
Eurostat
Random forest
How random forest works:
▪ Construct B regression trees using B bootstrapped training

sets
▪ For each b: 1..B the training set will only consider a subset
m of the available p predictors
▪ Average the resulting predictions
25
Eurostat

EUC1502 Module3 Machine-Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

EUC1502 Module3 Machine-Learning

Uploaded by

Copyright:

Available Formats

Module 3.

▪ Linear estimation methods are relatively simple to

▪ Linearity assumption is almost always an approximation

▪ Ridge regression and Lasso could improve the models,

When the assumption of linearity is relaxed

When the assumption of linearity is relaxed

When the assumption of linearity is relaxed

▪ Tree methods are commonly used in data science to

▪ Can be displayed graphically

▪ They have lower level of accuracy than other methods

Tree-based methods can be used for:

▪ Regression for quantitative response variables

▪ Classification for qualitative response variables

1. Divide the predictor space into J distinct and non-

2. For every test observation that falls into region Rj we

1. Divide the predictor space into J distinct and non-

2. For every test observation that falls into region Rj we

R1, R2, …, Rj have to be defined such that they minimise

Mean response for the

Recursive binary splitting

▪ At each step of the tree-building process, the best split is made:

• we seek the value of j and s that minimise:

Mean response for training observations in

1. Divide the predictor space into J distinct and non-

2. For every test observation that falls into region Rj we

▪ The tree may me too complex

▪ Low test fitting

▪ Tree-pruning means constructing a very large tree T0 and

Nr of terminal nodes of T Tuning parameter

Subset of the predictor space corresponding to the Mean of training

▪  controls the trade-off between the sub-tree’s complexity

▪  plays a similar role to 𝜆 in Lasso regression (cost

▪ For qualitative variables

▪ The predicted response is given by the most commonly

▪ Interest also on the class proportion among the training

▪ Classification trees can be grown just as regression trees

▪ Recursive binary splitting can be used, but instead of RSS

Proportion of training observations

Measures of node purity (i.e. the node contains predominantly

▪ As presented in Module 2, Bootstrap and bagging can be

B= number of samples bootstrapped (regression trees) from

▪ Trees are not pruned

▪ With qualitative variables we record the class predicted

▪ B chosen sufficiently large (e.g.100)

Variable importance measure:

high value means important variable

▪ Random forest method is an improvement of the bagging

▪ It considers only a subset (random sample) of m predictors

▪ Usually m = 𝑝 where p = total number of predictors (if m

▪ It permits to decorrelate the trees (helpful when we have a

How random forest works:

▪ Construct B regression trees using B bootstrapped training

▪ Average the resulting predictions

You might also like