You are on page 1of 218

Course on Machine Learning and Central Banking

CEMLA and Deutsche Bundesbank – October 2021


Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann
Agenda

▪ Introduction (Prof. Gabriela Alves Werb and Prof. Stefan Bender)

▪ Structure and Organization of the Course

▪ Machine Learning and Central Banking (Prof. Stefan Bender)

▪ Train, Test and Validation Samples (Prof. Gabriela Alves Werb)

▪ Break

▪ Model Validation Measures

▪ Hands-on Exercise

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 2
Structure and Organization of the Course
Structure and Organization of the Course

▪ Day 1: October, 4
▪ Introduction I
▪ Machine Learning and Central Banking
▪ Break
▪ Introduction II
▪ Day 2: October, 5
▪ Shrinkage I
▪ Break
▪ Shrinkage II
▪ Day 3: October, 6
▪ Decision Trees I
▪ Break
▪ Decision Trees II

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 4
Structure and Organization of the Course

▪ Day 4: October, 7
▪ Random Forest
▪ Break
▪ Gradient Boosting
▪ Day 5: October, 8
▪ Machine Learning in Practice (Elizabeth Téllez León, CEMLA)
▪ Break
▪ Wrap Up & Q&A

▪ Each module is followed by a hands-on exercise.


▪ Let‘s keep it interactive: please ask questions at any time!

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 5
Main Idea: Toolbox

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 6
Machine Learning and Central Banking
Machine learning, record linkage in the data life
cycle of Bundesbank
Stefan Bender, Research Data and Service Center (RDSC), Deutsche Bundesbank
Overview

▪ Machine Learning: Introduction


▪ Improving Data Quality with Machine Learning by Tobias Cagala
▪ Record Linkage: Introduction
▪ Linking Deutsche Bundesbank Company Data by Christoper-Johannes Schild
▪ Conclusion

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 9
Machine Learning: Introduction
Machine Learning (Rayid Ghani)

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 11
Machine Learning

▪ „Field of study that gives computers the ability to learn without being explicitly programmed.” (Samuel
1959, S. 2010).

▪ „An agent is learning if it improves its performance on future tasks after making observations about
the world.” (Russel 2016, S. 693).

▪ „A computer program is said to learn from experience E with respect to some task T and some
performance measure P, if its performance on T, as measured by P, improves with experience E.”
(Mitchell 1997, S. 2).

Required skills of a computer program to be „intelligent“:


▪ Perception → of the environment (dataset)
▪ Learn → a hypothesis from the perceived data
▪ Apply → the hypothesis on related data
▪ Decide → which is the best performing hypothesis
THX to Frank Raulf

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 12
Why Machine Learning?

▪ Goal: Adaptive, Scalable systems that are cost effective to build and
maintain

▪ Rules-based systems are rigid and expensive

▪ Lots of data is available to “train” the system

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 13
Process

▪ Understand “business” problem

▪ Map to machine learning problem

▪ Understand the data

▪ Explore and prepare the data

▪ “Feature” development

▪ Method Selection

▪ Evaluation

▪ Deployment

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 14
Types of Machine Learning I

Unsupervised “Weakly” supervised Fully supervised

Clustering Anomaly Detection


PCA … Classification

Regression

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 15
Different Types of Machine Learning II

▪ Supervised Learning:
Given is a pair of values: 𝑦1 , 𝒙1 , … , 𝑦𝑛 , 𝒙𝑛 ,
from which a machine can learn 𝑦ො = ℎ 𝒙
Required: Learning-Dataset and Testing-Dataset!
Either classification or regression.
(Example: Linear Regression)

▪ Unsupervised Learning:
Given is a set of values: 𝒙1 , … , 𝒙𝑛 ,
where no assignment exists. Hence, the algorithm searches for patterns within the
data to generate artifically generated ys.
(Example: Clustering)

▪ Reinforcement Learning:
A algorithm is getting a gratification if it behaves correctly and/or a punishment if it
makes a mistake.
(Example: Ant algorithm)
THX to Frank Raulf
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 16
Clustering

▪ A good clustering method will produce clusters with

▪ High intra-cluster similarity

▪ Low inter-cluster similarity

▪ K-Means is the simplest and the most common algorithm

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 17
K-means algorithm

▪ Given k, the k-means algorithm works as follows:

1) Randomly choose k data points (seeds) to be the initial


centroids, cluster centers
2) Assign each data point to the closest centroid

3) Re-compute the centroids using the current cluster


memberships.
4) If a convergence criterion is not met, go to 2).

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 18
K-means example

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 19
K-means example, start

k1
Y
k2

k3

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 20
K-means example, initial step

k1

k2
move cluster
centers to cluster k3
means

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 21
Supervised learning framework

y = f(x)
output Learned features
function

Training: given a training set of labeled examples {(x1,y1), …, (xN,yN)},


estimate the prediction function f that minimizes future generalization
(out of sample) error

▪ Testing: apply f to a new test example x and output the predicted value
y = f(x)
Slide credit: L. Lazebnik

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 22
Classification Task
Supervised Learning

The task is to classify whether or not there was a change in perimeter.

What is required?

1) A concept.
2) A correct dataset which is to be divided into
1) a learning and
2) a testing dataset.
3) An efficient programming language (such as R, Python, Matlab…).
4) Time.

However, there is always the possibility that the applied methods do not lead to a solution.

THX to Frank Raulf


Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 23
Steps
Data Preparation

Features

Build Model

Validate Model

Deploy
Deploy Model
Model

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 24
Modeling & Validation

Training
Training (Building) Labels

Learned
Training Data Features Training
model

Testing (Validating)
Learned
Test Data Features Prediction
model

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 25
Factors to consider

▪ Complexity

▪ Overfitting

▪ Robustness

▪ Interpretability

▪ Training Time

▪ Test Time

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 26
Classification Challenges (Christen 2015)

▪ In many cases there are no training data available


▪ Possible to use results of earlier matching projects?
▪ Or from manual clerical review process?
▪ How confident can we be about correct manual classification of potential matches?

▪ Often there is no gold standard available (no data sets with true
known match status)

▪ No large test data set collection available (like in information retrieval or machine learning)

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 27
Improving Data Quality with Machine Learning
Tobias Cagala, Deutsche Bundesbank
Background

Use out-of-sample predictions with machine learning algorithms for. . .

1) Data Quality Management (DQM): identify and correct measurement errors

2) Closing data gaps: impute missing values

THX to Tobias Cagala

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 29
Application to DQM

Securities Holdings Statistics


▪ German banks provide monthly reports of securities holdings (security-by-security)

▪ DQM with labor intensive manual case-by-case evaluations

THX to Tobias Cagala

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 30
Data

Plausibility check

Securities reported after the maturity date

Two data sources

I. Outcome of evaluation by compiler: Acceptance or Flag


II. Features of the security

➢ Linkage of securities data with data on compiler decisions

THX to Tobias Cagala

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 31
Data

Dataset

▪ Reported securities: 4 495


▪ Accepted: 4 110

▪ Flagged: 385

THX to Tobias Cagala

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 32
Descriptive Analysis of Patterns
Number of Reporting Banks, Days since Maturity

THX to Tobias Cagala


Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 33
Taking Advantage of the Probability

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank THX to Tobias Cagala
04.10.2021
Page 34
Taking Advantage of the Probability

Advantages
▪ All of the 28 out of 827 securities in Top-
35
▪ Improvement in efficiency and
effectiveness

Experience
▪ 50% reduction in time for check and
increased effectiveness of evaluations

THX to Tobias Cagala


Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 35
Conclusion: Application to DQM

Application to DQM

▪ Application to DQM is feasible and straightforward


▪ Large potential for improvements of efficiency

▪ Inclusion in production process can be a challenge

THX to Tobias Cagala


Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 36
Other Applications of Bundesbank in Machine Learning

▪ Saisonal Adjustment

▪ Is This Time Series Seasonal? - How Random Forests Can Improve Seasonality Tests by
Daniel Ollech, Karsten Webel / DG Statistics

▪ Identification of Holdings

▪ by Frank Raulf
▪ Record Linkage

▪ (we will see in some minutes)

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 37
General Insights

▪ Data-driven solutions can improve efficiency drastically


▪ An effective application of ML methods requires:

▪ Machine learning algorithms are no silver bullet


▪ Availability of training data
▪ Performance gain contingent on datasets with complex data structure
▪ Many algorithms are a black-box

THX to Tobias Cagala


Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 38
Record Linkage: Introduction
Motivation for Record Linkage (Christen 2015)

▪ Large amounts of data are being collected (big data).

▪ Analyzing such data can provide huge benefits.

▪ Data are from different sources (need for record linkage).

▪ Lack of unique entity identifiers: linking based on personal information.

▪ The linking of databases is challenged by data quality, database size, privacy and
confidentiality concerns.

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 40
Definition of Record Linkage and Major Challenges

▪ RL is finding records in different data sets that represent the same entity and link
them.

▪ RL is also known as data matching, entity resolution, object identification,


duplicate detection, identity uncertainty, merge-purge.

▪ Major challenge is that (clean) unique entity identifiers are not available in the
databases to be linked.

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 41
The basic record linkage process

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 42
There is no perfect world

▪ In a perfect world
True Positive (TP)
True Negative (TN)

▪ But we do not live in a perfect world


True Positive (TP) False Positive (FP)
False Negative (FN) True Negative (TN)

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 43
Record Linkage Challenges (Christen 2012)

▪ No unique entity identifiers available


▪ Real world data are dirty (typographical errors and variations, missing and out-of-date values,
different coding schemes, etc.)
▪ Scalability
▪ Naïve comparison of all record pairs is quadratic
▪ Remove likely no-matches as efficiently as possible
▪ No training data in many linkage applications
▪ No record pairs with known true match status
▪ Privacy and confidentiality
▪ (because personal information, like names and addresses, are commonly required for
linking)

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 44
Record Linkage Technique (Christen 2015)

▪ Deterministic matching
▪ Rule-based matching (complex to build and maintain)
▪ Probabilistic record linkage (Fellegi and Sunter, 1969)
▪ Use available attributes for linking (often personal information, like names, addresses,
dates of birth, etc.)
▪ Calculate match weights for attributes
▪ “Computer science” approaches
▪ Based on machine learning, data mining, database, or information retrieval techniques

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 45
The extended record linkage process

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 46
Christen (2012)

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 47
Importance of Preprocessing

▪ “In situations of reasonably high-quality data, preprocessing can


yield a greater improvement in matching efficiency than string
comparators and ‘optimized parameters’. In some situations, 90% of
the improvement in matching efficiency may be due to
preprocessing.” (Winkler 2009, p. 370)

▪ “Inability or lack of time and resources for cleaning up files in


preparation of matching are often the main reasons that matching
projects fail.” (Winkler 2009, p. 366)

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 48
Shares of effort within linkage process

▪ 5% matching and linking efforts


▪ 20% checking that the computer matching is correct
▪ 75% cleaning and parsing the two input files

(Gill 2001, p. 31)

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 49
Caveats of Record Linkage

▪ Imperfect matching variables (like typos)


▪ Variables may be coded differently in both data sources
▪ – E.g., years of education vs. degrees received
▪ Data may require significant amounts of processing and data cleaning prior to
linkage
▪ Not always a 1-to-1 match, but a 1-to-1 matched set can be extracted from a
post-processing step
▪ (admin) record may not exist.

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 50
Privacy Issues I

▪ Sensitive data (names, address)

▪ Informed consent, data avoidance, purpose limitation of data

▪ Circle of trust

▪ “Formalities” (like contracts, terms and conditions)

▪ The five Safe

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 51
Privacy Issues II: The 5 Safes

▪ Safe people: fit and proper, expertise to do the work.

▪ Safe projects: formal ethical review, public benefit, scientific merit.

▪ Safe environment: restrict data access, data security.

▪ Safe data: Privacy Preserving Record Linkage.

▪ Safe results: results should not directly or indirectly identify any individual or
organisation.

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 52
Source for a Deeper Knowledge

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 53
Linking Deutsche Bundesbank Company Data
Dr. Christopher-Johannes Schild, FDSZ 1-5
Bundesbank’s relevant microdata sources: Company Data

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank THX to Christopher-Johannes Schild
04.10.2021
Page 55
1. Input Data

Dataset (-family) Source Size (entities in master data)


AWMUS (MiDi / SITS) S2 ~260.000
(MiDi ~22.000 german entities 2013)
BAKIS-M (MiMiK) B ~350.000
Jalys / Corep (USTAN) S3 ~230.000 (~24.000 in 2013)
KUSYS S1 ~8.500
EGR Destatis ~60.000
DAFNE S3  Bureau van Dijk ~220.000 (~56.000 in 2012)
Hoppenstedt S3  Hoppenstedt ~90.000 (~20.000 in 2013)
AMADEUS Bureau van Dijk ~1.600.000
ZENTK, RIAD (Banks) S1 ~3.000 (~1.800 in 2016)
RIAD (Non-Banks) S1 ~35.000
LEI GLEIF ~45.000
URS Company Register Destatis ~4.800.000
Kantwert / Trade Reg. Kantwert GmbH ~4.300.000

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank THX to Christopher-Johannes Schild
04.10.2021
Page 56
1. Data linkage

Company data (non financial institutions (NFI)):

There is no common unique firm identifier in Germany.


(Company business register-ID not stable)

We have to match firm data…


▪ … that do not have a common unique identifier / key
▪ … by using alternative identifiers (such as names, addresses, sectors, legal forms)

RDSC has matched several NFI-microdatasets (from Statistics, Banking Supervision and external data) with an
advanced machine learning algorithm and generated a matching table (with probalistic matching scores)

Goal:
▪ Improve data quality, increase analytical value of data
▪ More general and flexible Record Linkage System
▪ Historicized matching tables

THX to Christopher-Johannes Schild

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 57
Record Linkage Process
Input Input Data
Data A B

Preprocessing

Blocking

Match- Predictors
Candidates (Features)

Classification-
Testdata Trainingsdata
model

Automatic Manual
Evaluation Review

Postprocessing
THX to Christopher-Johannes Schild
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 58
Linkage of Company Data by FDSZ („Record Linkage“)

▪ Duplicate detection with supervised machine learning


▪ Decision Tree algorithms (Random Forests, Gradient Boosting Trees)
▪ Training- and Testdata from common IDs
▪ Comparison features:
▪ Firmnames, string comparison algorithms
▪ Georeferenced addresses
▪ Economic sector codes
▪ Legal form
▪ Balance sheet information
▪ Data pre- and postprocessing with SAS
▪ Machine Learning uses Python-ML-Packages („scikit-learn“)

THX to Christopher-Johannes Schild


Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 59
2. Set up for the Record Linkage (Machine Learning)

Training (Building) Training


Labels

Training Data Learned


Features Training
model

Testing (Validating)
Test Data Learned
Features Prediction
model

THX to Christopher-Johannes Schild


Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 60
4. Classification

Bias vs Overfitting

THX to Christopher-Johannes Schild


Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 61
4. Classification

Ptrue Ptrue
negative (TN) positive (TP)

false
positive (FP)
false
negative (FN)

THX to Christopher-Johannes Schild


Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 62
5. Evaluation

Harmonisierung 1

T F
TN =TN
179203 FPFP
= 2553 Indexing /
N P
Blocking 2
„Grobfilter“

Detail-
F T Vergleich 3
FN =
N 8856
FN TP =TP
P125292 „Feinfilter“

• Precision = TP / (TP + FP) Klassifikatio


= 98,0% 4
• Recall / Coverage n 93,3%
= TP / (TP + FN) =
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank Evaluation 5
THX to Christopher-Johannes Schild
04.10.2021
Page 63
5. Evaluation

precision →

 recall / coverage THX to Christopher-Johannes Schild


Bundesbank’s relevant microdata sources: Company Data

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021 THX to Christopher-Johannes Schild
Page 65
Datauniverses (I)

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021 THX to Christopher-Johannes Schild
Page 66
Datauniverses (II)

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank THX to Christopher-Johannes Schild
04.10.2021
Page 67
Train, Validation, and Test Subsamples
Splitting the Data
Train, Validation, and Test Subsamples
Splitting the Data

▪ Golden rule in Machine Learning: evaluate the model on data that was not used to train it

▪ Most common: randomly split the dataset into train and test data

▪ E.g., when modeling loan default, some borrowers land in the train, others land in the test
data

▪ Usually: 75%-80% train, 25%-20% test

▪ Better approach: randomly split dataset into train, validation, and test data

▪ Usually: 60% train, 20% validation, 20% test

▪ Even better in some settings: also have out-of-time data

▪ Completely separate dataset, e.g., collected a few months later


Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 69
Train, Validation, and Test Subsamples
Splitting the Data

▪ Train data

▪ Data used to train the model

▪ The model learns from the underlying patterns and relationships in this subsample

▪ Validation data

▪ After optimizing several models in the train data, see how they perform in the validation
data

▪ Go back to training and iterate until you are happy with the results

▪ The results in this subsample motivate future modeling decisions

▪ Validation data is “contaminated” → No unbiased estimate of the error on truly new data

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 70
Train, Validation, and Test Subsamples
Splitting the Data

▪ Test data

▪ Only to be used in the very end

▪ Performance of the model in this subsample provides an unbiased estimate of the error on
new data
60% Train Data Train Model

Repeat
20%
Validation Data Evaluate Model

Take Final Model


20% Test Data
Original
Test Model
Dataset
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 71
Model Validation and Evaluation Measures
Model Validation and Evaluation Measures
Cross-Validation

▪ Idea: use many subsamples of the data to estimate model error

▪ K subsamples = K-Fold cross-validation

▪ Measure error (e.g., error rate for classification or Sum of Squared Residuals for regression)

▪ Average the errors or sum them (e.g., “risk” measure in RPART)


10
𝐹1 𝐸𝑟𝑟𝑜𝑟1
෣ = ෍ 𝐸𝑟𝑟𝑜𝑟𝑘
𝐸𝑟𝑟𝑜𝑟
𝐹2 𝐸𝑟𝑟𝑜𝑟2 𝑘=1
o𝑟
… Training Folds Validation Fold 10
𝐹10 𝐸𝑟𝑟𝑜𝑟10 𝐸𝑟𝑟𝑜𝑟𝑘
෣ =෍
𝐸𝑟𝑟𝑜𝑟
10
𝑘=1
Your Training Data
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 73
Model Validation and Evaluation Measures
Cross-Validation

▪ But, beware!

▪ If you preselect variables that flow in the model, then cross-validation must be also
applied at the variable selection step (not only later)!

▪ Otherwise, cross-validation does not accurately estimate the prediction error

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 74
Model Validation and Evaluation Measures
Confusion Matrix for Binary Classification

Confusion Matrix
Actual Data
No Default Default
No Default TN FN
Predictions
Default FP TP

▪ Example: Loan Default Prediction

▪ TN: true negatives / TP: true positives

▪ FN: false negatives / FP: false positives

▪ Based on them, we can compute several other metrics

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 75
Model Validation and Evaluation Measures
Accuracy, Recall

Confusion Matrix
▪ What is the share of correctly predicted cases? Actual Data
No Default Default
▪ Accuracy = (TN + TP) / Total No Default TN FN
Predictions
▪ Total = TP+TN+FP+FN Default FP TP

Confusion Matrix
▪ Which share of the true default cases is correctly predicted? Actual Data
No Default Default
▪ Sensitivity (or Recall) = TP / (TP + FN) No Default TN FN
Predictions
Default FP TP

Confusion Matrix
▪ Which share of the true non-default cases is correctly predicted? Actual Data
No Default Default
▪ Specificity = TN / (TN + FP) No Default TN FN
Predictions
Default FP TP

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 76
Model Validation and Evaluation Measures
Precision, F1-Score

Confusion Matrix
Actual Data
▪ What is the share of correct “default” predictions?
No Default Default
▪ Precision = TP / (TP + FP) Predictions
No Default TN FN
Default FP TP

▪ Can the model identify true “default” cases without many false alarms?
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙
▪ F1-Score = 2 ∗
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙
▪ F1-Score provides a tradeoff between precision and recall

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 77
Model Validation and Evaluation Measures
ROC Curve
100% sensitivity (no false negatives)
100% specificity (no false positives)
▪ Receiver Operating Curve (ROC)
▪ Plots the tradeoff between Sensitivity and
Specificity for different probability thresholds

▪ Decrease threshold: increase TP, but also FP

▪ Increase threshold: increase TN, but also FN

▪ AUC (Area under the Receiver Operating Curve)


▪ Probability that a randomly chosen positive
case (e.g., “default”) receives from the model a
higher score (predicted probability) than a
randomly chosen negative case (e.g., “non-default”)

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 78
Model Validation and Evaluation Measures
Precision-Recall (PR) Curve

▪ Precision-Recall (PR) Curve


▪ Plots the tradeoff between Sensitivity (Recall) and Precision for different probability
thresholds

▪ Comparison to ROC curve:


▪ ROC curves are appropriate for balanced datasets
▪ PR curves are also appropriate for imbalanced datasets

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 79
Model Validation and Evaluation Measures
Example: Loan Default Prediction

▪ Dependent variable: default (yes / no)


▪ 30 independent variables:
▪ Loan characteristics (e.g., funded amount, term, loan amount)
▪ Borrower characteristics (e.g., employment length, annual income, home ownership)

Original Data Train Data Validation Data Test Data


# Obs. % of # Obs. % of # Obs. % of # Obs. % of
Dataset Dataset Dataset Dataset
Default 100,209 20.3% 68,773 20.6% 10,417 19.8% 10,417 19.1%
No Default 393,818 79.7% 264,528 79.4% 85,076 80.2% 44,214 80.9%
Total 494,027 100% 333,301 100% 106,095 100% 54,631 100%
Share of
100% 67.5% 21.5% 11%
Original Data
This is the “no information rate” in the data
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 80
Model Validation and Evaluation Measures
Example: Loan Default Prediction

▪ No information rate: accuracy if we simply predict that all observations belong to the most

frequent class in the data

▪ Example here: predict “no-default” to all observations

▪ This rate an important benchmark to assess the accuracy of our models

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 81
Model Validation and Evaluation Measures
Example: Loan Default Prediction

ROC Curve PR Curve

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 82
Hands-on Exercises
References

Davis, J., & Goadrich, M. (2006). The Relationship Between Precision-Recall and ROC Curves.
In Proceedings of the 23rd international conference on Machine learning (pp. 233-240).

Dietterich, T. G. (1998), Approximate Statistical Tests for Comparing Supervised Classification


Learning Algorithms, Neural Computation, 10(7), 1895-1923.

Raschka, S. (2018), Model Evaluation, Model Selection, and Algorithm Selection in Machine
Learning, Working Paper.

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 84
Day 2
Shrinkage
Agenda

▪ Introduction

▪ Lasso

▪ Ridge

▪ Hands-on Exercise

▪ Break

▪ Elastic Net

▪ Multiclass Problems

▪ Hands-on Exercise

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 86
Shrinkage I
Shrinkage I
Introduction to Shrinkage

▪ Handle the “curse” of dimensionality

▪ 𝑝 >> 𝑛 problems

▪ Too many coefficients → Overparametrization, risk of overfitting

▪ Extreme case: millions of independent variables → computational problems

▪ Also called regularization methods

▪ Not exclusive to machine learning, similar ideas in “traditional” econometrics:

▪ Partial Least Squares (PLS)

▪ Principal Component Regression (PCR)

▪ Horseshoe prior in Bayesian statistics

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 88
Shrinkage I
Ridge Regression

▪ Hoerl and Kennard (1970): Impose penalty on coefficients’ magnitude

▪ Idea: tradeoff between goodness of fit (squared residuals) and model complexity (squared
coefficients)

𝑝
▪ 𝛽መ 𝑅𝑖𝑑𝑔𝑒 = argmin 𝒀 − 𝑿𝜷 ′
𝒀 − 𝑿𝜷 + 𝜆 σ𝑗=1 𝛽𝑗2 = argmin 𝒀 − 𝑿𝜷 ′
𝒀 − 𝑿𝜷 + 𝜆 𝜷 2
2
𝛽 𝛽

Sum of Squared Residuals Penalty: Sum of Squared Coefficients

▪ What happens if the independent variables are not on the same scale?

▪ Unfair penalties! So, we usually standardize X and center y

▪ ഥ)
Therefore, remove the intercept (it becomes simply 𝒚
▪ Also called L2 regularization (L2 norm: Euclidean distance)

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 89
Shrinkage I
Ridge Regression

𝛽መ 𝑅𝑖𝑑𝑔𝑒 = argmin 𝒀 − 𝑿𝜷 ′
𝒀 − 𝑿𝜷 + 𝜆 𝜷′𝜷
𝛽

▪ Solution depends on 𝝀, the size of the penalty (hyperparameter, can be tuned)

▪ Use cross-validation to choose optimal 𝝀

▪ What happens if 𝝀 → 𝟎?

▪ We are back to the OLS solution

▪ What happens if 𝝀 → ∞?

▪ Then all coefficients approach zero

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 90
Shrinkage I
Implications for Bias-Variance Tradeoff

▪ Shrinkage reduces variance, but...


෡ 𝑶𝑳𝑺 ]
𝑬[𝜷 ෡ 𝑹𝒊𝒅𝒈𝒆 ]
𝑬[𝜷

▪ Introduces bias!
Unbiased, but Lower
large variance variance,
▪ The bias increases as 𝝀 increases
but biased

▪ The variance decreases as 𝝀 increases

▪ Ridge regression is not helpful for variable selection

▪ Even if the true coefficient is zero, Ridge will shrink it but it will not set it𝜷to zero

▪ Good predictions, but difficult to interpret the resulting coefficients

▪ No concept of statistical significance (no standard errors provided)

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 91
Shrinkage I
LASSO Regression

▪ Tibshirani (1996): Least Absolute Shrinkage and Selection Operator

𝑝
▪ 𝛽መ 𝐿𝐴𝑆𝑆𝑂 = argmin 𝒀 − 𝑿𝜷 ′
𝒀 − 𝑿𝜷 + 𝜆 σ𝑗=1 |𝛽𝑗 | = argmin 𝒀 − 𝑿𝜷 ′
𝒀 − 𝑿𝜷 + 𝜆 𝜷 1
𝛽 𝛽
Sum of Squared Residuals Penalty: Sum of Absolute Coefficients

▪ LASSO also performs variable selection: coefficients can become exactly zero

▪ Sparse solutions: many zero coefficients, equivalent to excluding the variables from the
model

▪ Computational advantage: independent variables with zero coefficients can be ignored

▪ Also called L1 regularization (L1 norm: Manhattan distance)

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 92
Shrinkage I
LASSO Regression

▪ Efron et al. (2004): LARS (Least Angle Regression) to compute LASSO efficiently

▪ But, with highly correlated independent variables:

▪ LASSO arbitrarily selects one variable and reduces the other coefficients to zero

▪ Under these conditions, even small values of 𝝀 give many zero coefficients

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 93
Shrinkage I
LASSO vs Ridge Regression
Contours of the least squares error function

LASSO: 𝜷𝟏 + 𝜷𝟐 ≤ 𝒕 Ridge: 𝛃𝟐𝟏 + 𝛃𝟐𝟐 ≤ 𝒕

Source: Hastie, Tibshirani and Friedman (2009)

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 94
Shrinkage I
Example: Loan Default Prediction – Ridge Regression

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 95
Shrinkage I
Example: Loan Default Prediction – Ridge Regression

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 96
Shrinkage I
Example: Loan Default Prediction – LASSO Regression

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 97
Shrinkage I
Example: Loan Default Prediction – LASSO Regression

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 98
Hands-on Exercise
Shrinkage II
Shrinkage II
Elastic Net

▪ Zou and Hastie (2005)

▪ Ridge often has a “grouping” effect

▪ Strongly correlated independent variables tend to be in or out of the model together

▪ Combine both methods: convex combination of Ridge and LASSO penalties

▪ Shrinks together the coefficients of correlated predictors like Ridge

▪ Selects variables like LASSO but suffers less from the multi-collinearity problem

𝑝
𝛽መ 𝐸𝑙𝑎𝑠𝑡𝑖𝑐 𝑁𝑒𝑡 = argmin 𝒚 − 𝑿𝛽 ′ 𝒚 − 𝑿𝛽 + 𝜆 ෍ 𝛼 𝛽𝑗 + 1 − 𝛼 𝛽𝑗2 ,0 ≤ 𝛼 ≤ 1
𝛽 𝑗=1

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 101
Shrinkage II
Elastic Net

Ridge: LASSO: Elastic Net:


𝜷𝟐𝟏 + 𝜷𝟐𝟐 ≤ 𝒕 𝜷𝟏 + 𝜷𝟐 ≤ 𝒕 𝜶 𝜷𝟏 + 𝜷𝟐 + 𝟏 − 𝜶 𝜷𝟐𝟏 + 𝜷𝟐𝟐 ≤ 𝒕

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 102
Shrinkage II
Example: Loan Default Prediction

▪ Using package glmnet in R

▪ Ridge: set parameter alpha = 0

▪ LASSO: set parameter alpha = 1

▪ Elastic Net: set parameter 0 < alpha < 1

▪ As with the lasso and ridge, we typically do not penalize the intercept term, and
standardize the predictors for the penalty to be meaningful.

▪ The parameter 𝛼 determines the mix of the penalties, and is often pre-chosen on qualitative
grounds, or chosen by cross-validation.

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 103
Shrinkage II
Example: Loan Default Prediction – Elastic Net Regression

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 104
Shrinkage II
Example: Loan Default Prediction – Elastic Net Regression

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 105
Shrinkage II
Example: Loan Default Prediction – Confusion Matrix Comparison
Logistic Regression Ridge LASSO Elastic Net
Validation Data Validation Data Validation Data Validation Data
No Default Default No Default Default No Default Default No Default Default
No Default 84,915 2,151 85,049 8,780 84,931 2,168 84,987 2,456
Predictions
Default 161 18,868 27 12,239 145 18,851 89 18,563

Accuracy 97.8% 91.7% 97.8% 97.6%

Precision 99.2% 99.8% 99.2% 99.5%

Recall 89.8% 58.2% 89.7% 88.3%

F1-Score 94.2% 73.5% 94.2% 93.6%

• Remember: No information rate in the validation data is 80.2%.

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 106
Shrinkage II
Multiclass Problems

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 107
Shrinkage II
Multiclass Problems

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 108
Shrinkage II
Multiclass Problems

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 109
Shrinkage II
Multiclass Problems

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 110
Hands-on Exercise
References

Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004), Least Angle Regression, The Annals
of statistics, 32(2), 407-499.

Hoerl, A. E., & Kennard, R. W. (1970), Ridge Regression: Biased Estimation for Nonorthogonal
Problems, Technometrics, 12(1), 55-67.

Tibshirani, R. (1996), Regression Shrinkage and Selection via the Lasso, Journal of the Royal
Statistical Society, Series B (Methodological), 58(1), 267-288.

Zou, H., & Hastie, T. (2005), Regularization and Variable Selection via the Elastic Net, Journal
of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301-320.

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 112
Day 3
Decision Trees
Agenda

▪ Introduction

▪ Decision Trees (CART)

▪ Bootstrapping

▪ Break

▪ Introduction to Ensemble Methods


▪ Hands-on Exercise

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 114
Decision Trees (CART)
Decision Trees (CART)
Are They Really a New Method?

▪ Decision trees are essentially not a brand new idea

▪ First algorithms were developed decades ago

▪ AID (Morgan & Sonquist, 1963)

▪ CHAID (Kass, 1980)

▪ CART (Breiman et al., 1984)

▪ ID3 (Precursor of C5.0) (Quinlan, 1986)

▪ But now they are more widely known and used (why?)

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 116
Decision Trees (CART)
What Are Decision Trees?

▪ Belong to “divide and conquer” algorithms (greedy algorithms)

▪ Divide the “big” problem into smaller “subproblems”

▪ Recursively solve the “subproblems” and combine the solutions

▪ Recursive partitioning of the training data in smaller subsets until:

▪ All subsets predominantly belong to one value of Y (homogeneous nodes), or

▪ A pre-specified parameter (stop criterion) is achieved – e.g. minimum number of


observations in a node

▪ Resulting tree represents a set of rules – “if this, than that”

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 117
Decision Trees (CART)
What Are Decision Trees?

▪ Classification Trees – Our focus today

▪ Y is a categorical variable, e.g. represents categories or binary choices

▪ E.g. transportation mode (car, bike, subway), default (y/n), buy (y/n)

▪ Regression Trees

▪ Y is a numerical variable, e.g. represents discrete or continuous quantities

▪ E.g., number of products purchased, duration of customer relationship

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 118
Decision Trees (CART)
Example: Will a Borrower Default?

▪ Example: Will a borrower default?


▪ Classification tree: predict default (yes/no) Root node
▪ Method: CART (RPART in R)

Root node: beginning of tree Decision


node
Decision nodes: node in which a choice is made –
lead either to outcome or to another decision node

Leaf/terminal nodes: outcome

Question formats
▪ x ≥ a (or x < a )
▪ x=a
▪ 𝑥 ∈ 𝐴, where 𝐴 are partitions of the values x Terminal nodes (or leaves)
takes in the training data

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 119
Decision Trees (CART)
Example: Will a Borrower Default?

333,301 observations (100%)


68,773 (20.6%) default Root node
264,528 (79.4%%) non-default

Decision node

Terminal nodes (or leaves)

275,913 observations (82.7%) 1,672 observations (0.5%)


39,918 (14.5%) default 1,672 (100%) default
235,995 (85.5%) non-default 0 (0%) non-default

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 120
Decision Trees (CART)
How to Build a Decision Tree?

Fundamental steps to build a decision tree:

1.Which split criterion to choose?

2.For every decision node:


• on which independent variable to split?
• on which value to split (i.e. which value is the cut-off value)?

3. Depth of tree (i.e. number of decision nodes)?

4.What value to predict at each leaf node?

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 121
Decision Trees (CART)
1. Which Split Criterion to Choose?

▪ Impurity using the Gini Index: CART and its implementation in R, RPART

2
▪ 𝑖 𝑡 = 1 − σ𝑘𝑗=1 𝑝 𝑗 𝑡 𝒑 𝒋 𝒕 : relative frequency of category j in node t
k: Number of categories in the sample

▪ Minimum value: all observations belong to one category (“pure” node)

▪ Maximum value: observations are equally distributed across all categories (maximal
dispersion)

▪ Other criteria (we won’t discuss these in details)


▪ Significance tests: 𝝌2 (CHAID), Permutation tests (CTREE)
▪ Entropy (C5.0)
▪ Sum of Squares (for regression)
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 122
Decision Trees (CART)
2. On Which Value to Split?

▪ Goal: generate child nodes that are “purer” than their parents

▪ Split finds the independent variable and its cut-off value that maximizes the decrease in
node impurity

▪ Decrease in node impurity from split s in node t (Gini Index)

▪ ∆𝑖 𝑠, 𝑡 = 𝑖 𝑡 − σ𝑁
𝑛=1 𝑝𝑡𝑛 𝑖 𝑡𝑛 for multiway splits (N child nodes)

▪ ∆𝑖 𝑠, 𝑡 = 𝑖 𝑡 − 𝑝𝑡𝑙 𝑖 𝑡𝑙 − 𝑝𝑡𝑟 𝑖 𝑡𝑟 for binary splits (𝑡𝑙 left node, 𝑡𝑟 right node)

▪ Split minimizes the second term of the equation

𝑝𝑡𝑛 : Proportion of observations in node t assigned to child node 𝑡𝑛 . σ𝑁


𝑛=1 𝑝𝑡𝑛 = 1.

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 123
Decision Trees (CART)
2. On Which Value to Split - Example Loan Default Prediction
▪ Loan default prediction with RPART: Decrease in node impurity from the first split

333,301 observations (100%) Gini Index root node


2 2
68,773 (20.6%) default 68,773 264,528
1− − ≅ 0.33
264,528 (79.4%%) non-default 333,301 333,301

Gini Index right node


57,388 observations (17%) 2 2
28,855 (50.3%) default 28,855 28,533
𝑖 𝑡𝑟 =1− − ≅ 0.48
28,533 (49.7%) non-default 57,388 57,388

275,913 observations (82.7%)


39,918 (14.5%) default
235,995 (85.5%) non-default
Decrease in node impurity from first split 1
Gini Index left node 275,913 57,388
2 2 0.33 − ∗ 0.25 − ∗ 0.50
39,918 235,995 333,301 333,301
𝑖 𝑡𝑙 = 1 − −
275,913 275,913 = 0.04
≅ 0. 25

1 The “improve” reported under summary() in R shows the decrease in node impurity multiplied by the number of obs. in the parent node

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 124
Decision Trees (CART)
3. How Deep Should the Tree be?

▪ Without a stop criterion (e.g., minimum number of observations in a node)

▪ Tree can potentially grow until all observations either have

▪ The same value for the dependent variable (perfect prediction)

▪ The same value for the independent variables

▪ Extreme case: each observation has ist own leaf node → overfitting!

▪ Predictive power is high on traning data, but poor on new data – model is not

generalizable

▪ Solution: pruning (removing branches from the tree)

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 125
Decision Trees (CART)
3. How Deep Should the Tree be?

Pre-pruning: stop tree growth during the tree building process

▪ Pre-define thresholds either in

▪ Split criterion (e.g., minimum improvement in split criterion)

▪ Tree characteristics (e.g., number of leaf nodes, number of splits, minimum number of

observations in leaf nodes)

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 126
Decision Trees (CART)
3. How Deep Should the Tree be?

Post-pruning: grow the tree to its maximum and then trim the nodes in a bottom‐up fashion

▪ Merge leaf nodes while considering prediction error (shouldn’t increase much)

▪ Cost-complexity pruning with cross-validation: CART

▪ Pessimistic error-based pruning with binomial confidence limit: C5.0

▪ Usually possible to weight the prediction errors differently

▪ E.g., it might be worse to wrongly predict a borrower who ends up defaulting than to

predict that a borrower will default, but she doesn’t

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 127
Decision Trees (CART)
4. What Value to Predict at Each Leaf Node?

Predictions for every leaf node

▪ Majority voting: category with the most “votes” within the

leaf node is the prediction (mode of dependent variable)

▪ Assumption: Equal cost of type I and type II errors

▪ When is this (not) a reasonable assumption?

▪ Predicted probabilities (relative frequency of category in

the leaf node)

Source: twitter.com/freakonometrics

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 128
Decision Trees (CART)
4. What Value to Predict at Each Leaf Node - Example Loan Default Prediction

No Default No Default Default Default

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 129
Decision Trees (CART)
Handling Missing Values

▪ Surrogate variables
▪ Tree grows with the selected splits (primary splits) using observations with no missing
values on the split variables
▪ Weakness: CART is biased toward selecting variables with many missing values for a
primary split (Kim & Loh, 2001)
▪ When a split variable is missing for an observation:
▪ Surrogate split: use a surrogate variable instead of the primary split variable
▪ A surrogate variable is another independent variable and cut-off value whose split most
resembles the primary split
▪ Ideally, they should send exactly the same observations to the each child node

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 130
Decision Trees (CART)
Handling Missing Values – Example Loan Default Prediction
333,301 observations (100%)
68,773 (20.6%) default
264,528 (79.4%%) non-default
Now: 100,000 missings in total_pymnt

▪ Randomly force 100,000 values of


total_pymnt to be missing

▪ Surrogate variables are used when


we set na.action=na.rpart

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 131
Decision Trees (CART)
Handling Missing Values – Example Loan Default Prediction

▪ Output from R – using the summary() command:


Primary splits:
total_pymnt < 5317.716 to the right, improve=8636.406, (100,000
missing)
Surrogate splits:
loan_amnt < 4712.5 to the right, agree=0.917, adj=0.515, (100,000
split)

Observations with a missing in total_pymnt are split on loan_amnt instead

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 132
Decision Trees (CART)
Handling Missing Values – Example Loan Default Prediction

▪ Agreement: proportion of observations sent to the “correct” leaf node when the
surrogate is used (instead of the primary split)
𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑠𝑢𝑟𝑟𝑜𝑔𝑎𝑡𝑒
𝐴𝑔𝑟𝑒𝑒 =
# 𝑜𝑏𝑠 𝑖𝑛 𝑝𝑎𝑟𝑒𝑛𝑡 𝑛𝑜𝑑𝑒

▪ Adjusted agreement: deducts the surrogate agreement from the “go with the majority”
baseline
(𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑠𝑢𝑟𝑟𝑜𝑔𝑎𝑡𝑒 − 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 “𝑔𝑜 𝑤𝑖𝑡ℎ 𝑚𝑎𝑗𝑜𝑟𝑖𝑡𝑦”)
A𝑑𝑗 =
(# 𝑜𝑏𝑠 𝑖𝑛 𝑝𝑎𝑟𝑒𝑛𝑡 𝑛𝑜𝑑𝑒 − 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 “𝑔𝑜 𝑤𝑖𝑡ℎ 𝑚𝑎𝑗𝑜𝑟𝑖𝑡𝑦”)

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 133
Decision Trees (CART)
Discussion

▪ Impurity Criterion
▪ Gini Index is biased toward selecting split variables with many missing values

▪ Required Assumptions
▪ Observations are independent
▪ Joint distribution of X and Y in the training data is the same as in the test data

▪ Overfitting
▪ Single trees have high variance: unstable predictions
▪ Small perturbation in the data → large changes in leaf nodes
▪ Solution? Ensemble learning (more on that after the break!)

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 134
Detour: Bias-Variance Tradeoff
Bias-Variance Tradeoff
Expected Prediction Error
▪ Expected prediction error for a new observation with 𝑿 = 𝒙𝟎 ?

▪ Consider 𝑌 = 𝑓 𝑋 + 𝜀, s. t. E 𝜀 = 0 𝑎𝑛𝑑 𝑉𝑎𝑟 𝜀 = 𝜎𝜀2

▪ Using a quadratic loss function (squared error)

▪ 𝐸𝑟𝑟𝑜𝑟 𝑥0 = E[Y − 𝑓መ 𝑥0 ]2 = E[Y − 𝑓 𝑥0 ]2 + Irreducible


𝑉𝑎𝑟 𝜀 = 𝜎𝜀2 unless 𝜎𝜀2 =0!
Result of
▪ E[𝑓መ 𝑥0 − 𝑓 𝑥0 ]2 + 𝑩𝒊𝒂𝒔2 misspecifying 𝑓(𝑋)
Result of using a sample
▪ E[𝑓መ 𝑥0 − 𝐸[𝑓መ 𝑥0 ]]2 𝑽𝒂𝒓(𝒇෠ 𝒙𝟎 )
to estimate 𝑓

▪ Goal in Machine Learning: minimize this combination

▪ Complex models: usually low bias but high variance

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 136
Bias-Variance Tradeoff
Examples

Low Bias High Bias Low Variance


High Variance Low Variance Low Bias

Overfitting Underfitting Good Balance

Source: towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 137
Bias-Variance Tradeoff
Examples

Source: https://www.kdnuggets.com/2016/08/bias-variance-tradeoff-overview.html

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 138
Bias-Variance Tradeoff
Changes in Error with Model Complexity

Source: scott.fortmann-roe.com/docs/BiasVariance.html

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 139
Bias-Variance Tradeoff
What Are Your Options?

▪ Use cross-validation to estimate the validation error

▪ Test and validate your model on new observations (set part of the data aside for that)

▪ Combine multiple models

▪ This combination is known as ensemble learning

▪ Most influential development in Machine Learning in the past decade (Seni & Elder,
2010)

▪ Main idea: average predictions of several models

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 140
Introduction to Ensemble Methods
Introduction to Ensemble Methods
Overview

▪ Again (!) not really a “new” development

▪ Bagging (Breiman,1996)

▪ Boosting (Freund & Schapire, 1997)

▪ Random Forests (Breiman, 2001)

▪ Gradient Boosting (Friedman, 2001)

▪ Group of models that together deliver one aggregate prediction

▪ Idea: combine a large number of “weak” models to produce a stronger, more stable
model

▪ Use average of predictions (numerical dep. var.) or majority voting (categorical dep. var.)

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 142
Introduction to Ensemble Methods
Overview

▪ Ensemble methods retain many advantages of Machine Learning methods

▪ Model higher order interactions automatically

▪ Helpful in p >>n problems

▪ Robust to outliers

▪ Multicollinearity is not a problem

▪ But…

▪ More difficult to visualize and interpret results

▪ Usually known as a “black box”

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 143
Introduction to Ensemble Methods
Overview

▪ Ideal world: infinitely many datasets available


▪ Estimate one model on each dataset

▪ Sadly, that’s not our reality…

▪ So, what’s the next best option?

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 144
Introduction to Ensemble Methods
Bagging Weak Models

▪ Bagging: bootstrap aggregating (Breiman, 1996)

▪ Idea: use your dataset to artificially generate “new” datasets

▪ Sample with replacement from training data (bootstrapping)

Original
Training
Dataset
Bootstrap Replicates

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 145
Introduction to Ensemble Methods
Bagging Weak Models

▪ Estimate a model on each “new” bootstrapped dataset

▪ Aggregate predictions (average or majority voting)

▪ Improves accuracy and model stability

▪ Often used for decision trees, but you can use it with any weak supervised model


Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 146
Introduction to Ensemble Methods
Example: Loan Default Prediction – Confusion Matrix Comparison
Logistic Regression Decision Tree Bagged Trees
Validation Data Validation Data Validation Data
No Default Default No Default Default No Default Default
No Default 84,915 2,151 84,872 13,788 84,869 13,785
Predictions
Default 161 18,868 204 7,231 207 7,234

Accuracy 97.8% 86.8% 86.8%

Precision 99.2% 97.3% 97.2%

Recall 89.8% 34.3% 34.4%

F1-Score 94.2% 50.8% 50.8%

• Remember: No information rate in the validation data is 80.2%.

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 147
Hands-on Exercise
References

Alfaro, E., Gamez, M. & Garcia, N. (2013), adabag: An R Package for Classification with
Boosting and Bagging, Journal of Statistical Software, 54 (2), 1-35.

Alfaro, E., Garcia, N., Gamez, M. & Elizondo, D. (2008), Bankruptcy forecasting: An empirical
comparison of AdaBoost and neural networks'. Decision Support Systems, 45, 110-122.

Breiman, L. (1996), Bagging predictors, Machine Learning, 24(2), 123-140.


Breiman, L. (1998), Arcing classifiers, The Annals of Statistics, 26(3), 801-849.
Breiman, L. (2001), Statistical Modeling: The Two Cultures, Statistical Science, 16(3), 199-231.

Breiman, L., Friedman, J. H., Stone, C. J. & Olshen, R. A. (1984), Classification and Regression
Trees. Belmont, CA, Wadsworth.

Hastie, T., Tibshirani, R. & Friedman, J. (2009), The Elements of Statistical Learning. Data
Mining, Inference, and Prediction, Springer.

Hothorn, T., Hornik, K. & Zeileis, A. (2006), Unbiased Recursive Partitioning: A Conditional
Inference Framework, Journal of Computational and Graphical Statistics, 15(3), 651-674.

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 149
References

Loh, W.-Y. (2014), "Fifty Years of Classification and Regression Trees, International Statistical
Review, 82(3), 329-348.

Loh, W.-Y. & Shih, Y.-S. (1997), "Split Selection Methods for Classification Trees, Statistica
Sinica, 7(4), 815-840.
Quinlan, J. R. (1986), Induction of Decision Trees, Machine Learning, 1(1), 81-106.

Quinlan, J. R. (1993), C4.5: Programs for Machine Learning, San Mateo, CA, Morgan
Kaufmann.

Shmueli, G. (2010), To Explain or to Predict?, Statistical Science, 25(3), 289-310.

Therneau, T. M. & Atkinson, E. J. (1997), An Introduction to Recursive Partitioning Using the


RPART Routines, Technical Report 61, Mayo Clinic, Section of Statistics, Rochester,
Minnesota.

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 150
Day 4
Ensemble Methods
Agenda

▪ Random Forests

▪ Hands-on Exercise

▪ Break

▪ Gradient Boosting

▪ Hands-on Exercise

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 152
Recap
Bias-Variance Tradeoff
Changes in Error with Model Complexity

Source: scott.fortmann-roe.com/docs/BiasVariance.html

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 154
Ensemble Methods
Combining Several Weak Models to Generate a Stronger Model

Source: Lantz, Brett (2015), "Machine learning with R", Birmingham: Packt Publishing, Chapter 11.

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 155
Ensemble Methods
Bagging Weak Models

▪ Estimate a model on each “new” bootstrapped dataset

▪ Aggregate predictions (average or majority voting)

▪ Improves accuracy and model stability

▪ Often used for decision trees, but you can use it with any weak supervised model


Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 156
Random Forests
Random Forests
Motivation

▪ Bagging helps to reduce model variance

▪ But…

▪ Trees often have many splits in common

▪ Consequence: similar predictions

▪ Idea: why not also introduce randomness in the selection of the candidate variables for a
split?

▪ This idea is implemented in Random Forests (Breiman, 2001)

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 158
Random Forests
Overview

▪ Use bootstrap to generate 𝑵 “new” datasets

▪ For each tree and split:

▪ Only give the algorithm a random set of 𝒎 independent variables to choose from (out of
the total 𝑴 independent variables in the dataset)

▪ Estimate 𝑵 trees, each with a different dataset

▪ Aggregate predictions with averaging or majority voting

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 159
Random Forests
Overview

Train Data

Bootstrap Bootstrap Bootstrap Bootstrap


Subsample 1 Subsample 2 Subsample 3 Subsample 4

Tree 1 Tree 2 Tree 3 Tree 4


At each split the tree
considers only a fraction
of all the available
independent variables
(m variables out of the
total M variables)
Aggregated
Prediction

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 160
Random Forests
How to Choose m?

▪ How to choose 𝒎?

▪ Breiman’s suggested heuristic for choosing 𝒎

▪ 𝑚= 𝑛

▪ 𝑛: total number of independent variables

▪ Use cross-validation to find the “best” value of 𝒎 for the data / problem at hand

▪ This is called (hyper)parameter tuning

▪ When 𝒎 = 𝒏, you’re back to bagging

▪ 𝒎 in the R implementation: mtry parameter

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 161
Random Forests
Out-of-Bag Observations

▪ Another nice feature of Random Forests: OOB (Out-Of-Bag)


▪ Refers to observations in the original dataset that were not part of the respective
bootstrap replicate

▪ So, OOB observations were not used to construct the model

▪ Therefore, great to estimate the generalization error!

▪ By default, OOB ~ 1/3 of observations


▪ But you can adjust it as you wish
Original Train Bootstrap
Dataset Replicate
▪ OOB observations are also used to compute
In this bootstrap replicate, the
variable importance measures
gray observation is OOB

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 162
Random Forests
Parameters

▪ ntree: Number of trees

▪ mtry: m parameter: number of randomly selected candidate variables at each split

▪ nodesize: minimum number of observations in each leaf node

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 163
Random Forests
Example: Loan Default Prediction – Confusion Matrix Comparison
Logistic Regression Decision Tree Bagged Trees Random Forest
Validation Data Validation Data Validation Data Validation Data
No Default Default No Default Default No Default Default No Default Default
No Default 84,915 2,151 84,872 13,788 84,869 13,785 85,069 2,396
Predictions
Default 161 18,868 204 7,231 207 7,234 7 18,623

Accuracy 97.8% 86.8% 86.8% 97.7%

Precision 99.2% 97.3% 97.2% 99.9%

Recall 89.8% 34.3% 34.4% 88.6%

F1-Score 94.2% 50.8% 50.8% 93.9%

• Remember: No information rate in the validation data is 80.2%.

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 164
Random Forests
Discussion

▪ Random Forests reduce variance by

▪ Training trees on different subsets of the training dataset

▪ Limiting the subspace of independent variables considered for a split

▪ But, there’s no free lunch…

▪ Bias may slightly increase (remember the potentially biased splits in each tree)

▪ More difficult to interpret

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 165
Random Forests
Discussion

▪ Bias can be reduced with tuning

▪ Find optimal parameters for your model / data (e.g. tree depth, number of trees, number of
observations in leaf nodes)

▪ Usually done with a grid search (test a finite combination of parameters)

▪ Interpretability can be tackled with interpretability methods

▪ E.g., variable importance measures, partial dependence plots, etc.

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 166
Hands-on Exercise
Gradient Boosting
Gradient Boosting
First Idea of Boosting

▪ Ensemble method proposed by Freund & Schapire (1997)

▪ Main idea

▪ Sequentially train new models, giving more importance to observations “difficult to


predict”

▪ Weights or residuals reflect how “difficult” it is to predict the outcome for a specific
observation

▪ Dataset is the same, only the weights change at each iteration

▪ Often applied in the context of decision trees, but applicable to any “weak” model

▪ Trees are usually smaller than in Random Forests (often “stumps”: only one split)

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 169
Gradient Boosting
Boosting for Classification

▪ Objective: minimize classification error

▪ Weights are adjusted and renormalized at each iteration

▪ Aggregated prediction: Sum the “votes” of all models

▪ “Voting power” of each model is a function of its accuracy

▪ Models that accurately predict many observations change weights more significantly and
have a higher influence in the aggregated prediction

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 170
Gradient Boosting
Boosting for Classification

Incorrectly (correctly) Incorrectly (correctly)


Begin:
classified obs. in Tree classified obs. in
All weights
1 gain a higher Tree 2 gain a higher
equal
(lower) weight (lower) weight

Tree 1 Tree 2 Tree 3


Training Aggregated
Dataset Prediction

Subsequent tree focuses on


Voting power: the vote of
predicting observations with
“good models” matters more
higher weights

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 171
Gradient Boosting
Other Views of Boosting

▪ Friedman, Hastie & Tibshirani (2000)


▪ Adaboost ~ optimization method to minimize a particular exponential loss function (for
classification)
▪ Find that exponential loss ~ Bernoulli likelihood
▪ So, maximize Bernoulli likelihood instead (for classification)

▪ Breiman (1999)
▪ Boosting ~ gradient descent with a special loss function

▪ Friedman (2000)
▪ Generalize: from Adaboost to Gradient Boosting
▪ Handle many other loss functions
▪ Very important development: link “obscure” computational learning to standard
statistics (likelihood) and function optimization

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 172
Gradient Boosting
Slightly Change the Target Function

෡ 𝑿 ….
▪ Instead of training 𝒉 𝑿 on residuals of 𝐹 𝑋 , 𝒚 − 𝑭

2
𝑦 −𝐹 𝑋
▪ Train 𝒉 𝑿 on gradient of the loss function: 𝐿 𝑦, 𝐹 𝑋 =
2

▪ Goal: minimize J = σ𝑖 𝐿 𝑦𝑖 , 𝐹 𝑋𝑖 for all obs. 𝑖 by finding appropriate 𝐹 𝑋𝑖

𝜕𝐽
▪ Yields an optimization problem: = 𝐹 𝑋𝑖 − 𝑦𝑖
𝜕𝐹 𝑋𝑖

▪ Residuals are negative gradients of 𝐿!

𝜕𝐽
𝑦𝑖 − 𝐹 𝑋𝑖 = −
𝜕𝐹 𝑋𝑖

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 173
Gradient Boosting
Slightly Change the Target Function

𝜕𝐽
▪ 𝑦𝑖 − 𝐹 𝑋𝑖 = − - Does this look familiar?
𝜕𝐹 𝑋𝑖

𝜕𝐽
▪ Hint: Gradient Descent: 𝜃𝑖 ≔ 𝜃𝑖 − 𝜌
𝜕𝜃𝑖

▪ Gradient Boosting = Boosting + Gradient


Descent

Source: https://www.oreilly.com/library/view/learn-arcore-
/9781788830409/e24a657a-a5c6-4ff2-b9ea-9418a7a5d24c.xhtml

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 174
Gradient Boosting
Generalization Using Gradient Descent

▪ Changing from residuals to gradient descent → Gradient Boosting

▪ Generalize the method to other loss functions (e.g. log loss, absolute loss)

▪ So, in our new framework:

▪ Start with conservative estimate for 𝐹(𝑋) (e.g. mean of 𝑦)

𝐿 𝑦,𝐹 𝑋
▪ Compute negative gradient (i.e. residual): −𝑔 𝑋 = − = 𝑦 − 𝐹(𝑋)
𝜕𝐹 𝑋

▪ Fit a new model ℎ 𝑋 to −𝑔 𝑋

▪ Update 𝐹(𝑋) : 𝐹 𝑋 ≔ 𝐹 𝑋 + 𝜌ℎ 𝑋

▪ 𝜌 is the size of the step

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 175
Gradient Boosting
Generalization Using Gradient Descent

▪ By restricting the tree depth we can control the order of approximation of 𝐹 𝑋

▪ E.g., Gradient Boosting with stumps (only one split) are a first order (linear)
approximation

▪ But, if we fit the training data “too closely”: overfitting!

▪ Solution: regularization (or shrinkage)

▪ Introduce a parameter to slow down the incorporation of new results to the aggregate
model

▪ Each update is “scaled” or “shrunk” by the “learning rate” parameter 𝜈.

▪ 𝐹 𝑋 ≔ 𝐹 𝑋 + 𝒗 ∙ 𝜌ℎ 𝑋

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 176
Gradient Boosting
Generalization Using Gradient Descent

▪ How to choose the learning rate?

▪ Tradeoff between learning rate and number of trees

▪ Best to optimally choose parameters (e.g. using cross-validation)

▪ Stochastic Gradient Boosting (Friedman, 2002)

▪ Gradient Boosting + randomness in the train data seen by each model (as in Bagging)

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 177
Gradient Boosting
Parameters

▪ Parameters to choose (names in italic: inputs to gbm function in R)

▪ Loss function (distribution)

▪ Number of trees (n.trees)

▪ Learning rate (shrinkage)

▪ Tree depth - shorter trees usually give better results! (interaction.depth)

▪ Minimum number of observations in leaf nodes (n.minobsinnode)

▪ Fraction of observations in training data randomly selected to grow the next tree
(bag.fraction)

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 178
Gradient Boosting
Example: Loan Default Prediction – Confusion Matrix Comparison
Logistic Regression Decision Tree Random Forest Gradient Boosting
Validation Data Validation Data Validation Data Validation Data
No Default Default No Default Default No Default Default No Default Default
No Default 84,915 2,151 84,872 13,788 85,069 2,396 84,980 1,370
Predictions
Default 161 18,868 204 7,231 7 18,623 96 19,649

Accuracy 97.8% 86.8% 97.7% 98.6%

Precision 99.2% 97.3% 99.9% 99.5%

Recall 89.8% 34.3% 88.6% 93.5%

F1-Score 94.2% 50.8% 93.9% 96.4%

• Remember: No information rate in the validation data is 80.2%.

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 179
Gradient Boosting
Discussion

▪ Gradient Boosting reduces bias by

▪ Training trees sequentially on the training dataset

▪ Stochastic Gradient Boosting reduces variance by

▪ Training trees on different subsets of the training dataset

▪ What kind of problems can it solve?

▪ Any (!) loss function for which you can compute the gradient

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 180
Gradient Boosting
Discussion

▪ But:

▪ Usual shortcomings of ensembles: bias, interpretability

▪ Gradient Boosting requires more computing time

▪ Remedies:

▪ Tuning

▪ Tools and measures for interpretability

▪ A few other tricks possible (see xgboost)

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 181
Hands-on Exercise
References

Breiman, L. (1999), Prediction Games and Arcing Algorithms, Neural Computation, 11(7), 1493-1517.
Breiman, L. (2001), Random Forests, Machine Learning, 45(1), 5-32.

Breiman, L. (2001), Statistical Modeling: The Two Cultures, Statistical Science, 16(3), 199-231.

Friedman, J., Hastie, T. & Tibshirani, R. (2000), Special Invited Paper. Additive Logistic Regression: A
Statistical View of Boosting, Annals of Statistics, 28 (2), 337-374.
Friedman, J.H. (2001), Greedy Function Approximation: A Gradient Boosting Machine, Annals of
Statistics, 29 (5), 1189-1232.
Friedman, J.H. (2002), Stochastic Gradient Boosting, Computational Statistics & Data Analysis, 38 (4),
367-378.
Hastie, T., Tibshirani, R. & Friedman, J. (2009), The Elements of Statistical Learning. Data Mining,
Inference, and Prediction, Springer.

Ridgeway, G. (1999), The State of Boosting, Computing Science and Statistics, 31, 172-181.

Shmueli, G. (2010), To Explain or to Predict?, Statistical Science, 25(3), 289-310.

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 183
Day 5
Wrap Up
Agenda

▪ Machine Learning in Practice (Serafín Martínez-Jaramillo, CEMLA)

▪ Break

▪ Variable Importance

▪ Wrap Up

▪ Q&A

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 185
Machine Learning in Practice
Wrap Up & Q&A
Where Are We Headed?
Current Challenges and Opportunities

▪ Internal stakeholders: “what do we gain from using AI / machine learning?”

▪ Systems, talent, regulation (e.g. privacy)

▪ Markets are complex: regimes change, trends come and go

▪ Unstable models (asset management: “no longer than 3 weeks”)

▪ Use networks to better assess systemic risk

▪ Need for explainable machine learning

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 188
Where Are We Headed?
Explainable Machine Learning

▪ Avoid poor decisions

▪ Text mining at an investment fund – Berkshire Hathaway x Anne Hathaway

▪ IBM “Watson for Oncology”

▪ Tyndaris Investments automated trading: ~ US$20 M daily losses

▪ Prevent algorithm bias: is the model doing what it is supposed to?

▪ Bias in recruiting tools (e.g., Amazon AI recruiting tool)

▪ Predicting the likelihood of a criminal reoffending (e.g., US COMPAS)

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 189
Will Machine Learning Solve All Our Problems?

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 190
Structure and Organization of the Course

▪ Day 1: October, 4
▪ Introduction I
▪ Machine Learning and Central Banking
▪ Break
▪ Introduction II

▪ Day 2: October, 5
▪ Shrinkage I
▪ Break
▪ Shrinkage II

▪ Day 3: October, 6
▪ Decision Trees I
▪ Break
▪ Decision Trees II

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 191
Structure and Organization of the Course

▪ Day 4: October, 7
▪ Random Forest
▪ Break
▪ Gradient Boosting

▪ Day 5: October, 8
▪ Machine Learning in Practice (Elizabeth Téllez León, CEMLA)
▪ Break
▪ Wrap Up & Q&A

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 192
Main Idea: Toolbox

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 193
Process

▪ Understand “business” problem

▪ Map to machine learning problem

▪ Understand the data

▪ Explore and prepare the data

▪ “Feature” development

▪ Method Selection

▪ Evaluation

▪ Deployment

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 194
Steps
Data Preparation

Features

Build Model

Validate Model

Deploy Model

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 195
Examples

2020 Banca d'Italia and Federal Reserve Board Joint Conference on Nontratidional Data & Statistical
Learning with Applications to Macroeconomics – Oct. 2021

• Forecasting UK inflation bottom up, A. Joseph, Eleni Kalamara, G. Potjagailo, and G. Kapetanios
• The Macroeconomy as a Random Forest, P. Goulet Coulombe
• Teaching Machines to Measure Economic Activities from Satellite Images: Challenges and Solution, D. Ahn,
M. Cha, S. Han, J. Kim, S. Sang Lee, S. Park, S. Park, H. Yang, and J. Yang
• The Knowledge Graph for Macroeconomic Analysis with Alternative Big Data, Y. Yang, Y. Pang, G. Huang,
and Weinan E Machine Learning for Zombie Hunting. Firms' Failures and Financial Constraints, F. J.
Bargagli Stoffi, M. Riccaboni, and A. Rungi
• Deciphering the Fed Communication via Text-Analysis of Alternative FOMC Statements, T. Doh, D. Song,
and S.-K. Yang
• Measuring central banks' sentiment and its spillover effects with a network approach, G. Tizzanini, P.
Lorenzini, M. Priola, L. Zicchino
• Application of text mining to the analysis of climate-related disclosures, Á. I. Moreno, and T. Caminero

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 196
BIG THANK YOU!!!!!
- To all participants for the lively discussion, deep
comments and/or just for staying with us
- Of course to Serafín Martínez Jaramillo, Eréndira
Fuentes Hernández and all for us „unknown“
colleagues
- We do not forget: the translators and the IT guys

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 197
References

Athey, S. (In Press), The Impact of Machine Learning on Economics, In A.K. Agrawal, J. Gans &
A. Goldfarb (Eds.),The Economics of Artificial Intelligence: An Agenda, University of Chicago
Press.
Breiman, L. (2001), Statistical Modeling: The Two Cultures, Statistical Science, 16(3), 199-231.

Hastie, T., Tibshirani, R. & Friedman, J. (2009), The Elements of Statistical Learning. Data
Mining, Inference, and Prediction, Springer.

Shmueli, G. (2010), To Explain or to Predict?, Statistical Science, 25(3), 289-310.

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 198
Backup Day 1
Model Validation and Evaluation Measures
Comparing Competing Models

▪ How to tell if the difference in performance is


really meaningful?

▪ Type I and type II errors

Source: Ellis, P.D. (2010), “Effect Size FAQs”, www.effectsizefaq.com

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 200
Model Validation and Evaluation Measures
Comparing Competing Models

▪ Many popular tests have a high probability of type I error

▪ I.e., detecting a difference between models when no difference exists

▪ Examples:

▪ Test for the difference of two proportions (each model estimated one time)

▪ Paired-differences t test using several random train-test splits

▪ Alternatives:

▪ McNemar Test (McNemar, 1947)

▪ 5x2-Fold Cross-Validated Paired t-Test (Dietterich, 1998)


Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 201
Model Validation and Evaluation Measures
Comparing Competing Models

McNemar Test (“within-subjects chi-squared test”)

▪ Do both models disagree in the same way?

▪ H0: Both models have the same performance: 𝑃𝑟𝑜𝑏 (𝐶𝐼) = 𝑃𝑟𝑜𝑏 (𝐼𝐶)
Model 1
Correct Incorrect
Correct CC CI
Model 2
Incorrect IC II

Total = CC + CI + IC + II

▪ You only need to estimate each model one time

▪ Requires CI and IC ≥ 25
In R:
▪ If CI and IC ≤ 25, use the binomial test instead
mcnemar.
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank test
04.10.2021
Page 202
binomial.t
est
Model Validation and Evaluation Measures
Comparing Competing Models

5x2-Fold Cross-Validated Paired t-Test

▪ Split the dataset 5 times in train and test (50% of observations each)
▪ In each of the 5 iterations:
▪ Step A: Estimate Model 1 and Model 2 to the train set and evaluate performance on test
set
▪ 𝐴𝐶𝐶𝐴,𝑖 = 𝐴𝐶𝐶𝐴,𝑖,𝑀𝑜𝑑𝑒𝑙 1 − 𝐴𝐶𝐶𝐴,𝑖,𝑀𝑜𝑑𝑒𝑙 2 𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛 𝑖
▪ Step B: Swap between train and test (estimate on test set and evaluate on train set)
▪ 𝐴𝐶𝐶𝐵,𝑖 = 𝐴𝐶𝐶𝐵,𝑖,𝑀𝑜𝑑𝑒𝑙 1 − 𝐴𝐶𝐶𝐵,𝑖,𝑀𝑜𝑑𝑒𝑙 2 𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛 𝑖
𝐴𝐶𝐶𝐴,𝑖 −𝐴𝐶𝐶𝐵,𝑖 2 2
▪ 𝐴𝐶𝐶𝑎𝑣𝑔,𝑖 = , 𝑠𝑖2 = 𝐴𝐶𝐶𝐴,𝑖 − 𝐴𝐶𝐶𝑎𝑣𝑔,𝑖 + 𝐴𝐶𝐶𝐵,𝑖 − 𝐴𝐶𝐶𝑎𝑣𝑔,𝑖
2
𝐴𝐶𝐶𝐴,1
▪ Test statistic: 𝑡 = 1 5
, 𝐴𝐶𝐶𝐴,1 : 𝐴𝐶𝐶𝐴 𝑓𝑟𝑜𝑚 𝑡ℎ𝑒 𝑓𝑖𝑟𝑠𝑡 𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛
σ 𝑠2
5 𝑖=1 𝑖

▪ t follows a t distribution with 5 degrees of freedom

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 203
Backup Day 3
Decision Trees (CART)
Split Criterion: Impurity vs. Entropy

Only for classification problems (categorical dependent variable)

▪ Minimum value: all observations belong to one category (“pure” node)

▪ Maximum value: observations are equally distributed across all categories (maximal
dispersion)

▪ Gini Index (CART, RPART)


𝑘
2 𝒑 𝒋 𝒕 : relative frequency of category j in node t
𝑖 𝑡 = 1− ෍ 𝑝 𝑗 𝑡
𝑗=1
k: Number of categories in the sample
▪ Information Gain (C5.01)
𝑘

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑡 = ෍ −𝑝 𝑗 𝑡 log 2 𝑝 𝑗 𝑡
𝑗=1
1 C5.0 uses a normalized version of Information Gain, the “Gain-Ratio”

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 205
Decision Trees (CART)
Split Criterion: Impurity vs. Entropy

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 206
1. Splitting Criterion: Sum of Squares (Regression)

Only for regression problems (numerical dependent variable)

▪ Minimum value: all observations in the node have the same value of the dependent variable
(homogeneous node)

𝑛𝑡

𝑆𝑆 𝑡 = ෍ 𝑦𝑖 − 𝑦ത 2

𝑖=1

Variance of y
within node t

− 𝑦ത is the average value of 𝑦𝑖 in node 𝑡, with 𝑛𝑡 observations

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 207
Decision Trees
Handling Missing Values in C5.0

▪ Proportional weighting

▪ Tree grows with all observations – lower information gain from variables with many
missings

▪ Observation with a missing in the split variable is sent down both child nodes

▪ In each child node: weight this observation by the proportion of observations with no
missings in the split variable

▪ Aggregate predictions of all leaf nodes the observation reached, using the weights
gained in each followed path

▪ Weighted average for numeric variables, or category with the highest probability, for
categorical variables

▪ Implemented in C5.0 and its precursors (e.g., C4.5)


Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 208
Backup Day 4
Boosting for Regression

▪ Objective: minimize squared error

▪ Subsequent model tries to “correct” the errors of the previous model

▪ Predict the residuals of the previous model

▪ Aggregated prediction: Sum the predictions of all models

Begin: Residuals of the Tree 1 Residuals of the Tree 2


Actual value of the become the dependent become the dependent
dependent variable variable in Tree 2 variable in Tree 3
Tree 1 Tree 2 Tree 3
Training Aggregated
Dataset Prediction

Subsequent tree focuses on predicting


observations with higher prediction error

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 210
Backup Day 5
Traditional Econometric Methods vs. Machine Learning

Traditional Econometric Methods Machine Learning


▪ Focus on causal explanation ▪ Focus on empirical prediction

▪ Mostly parametric ▪ Non-parametric

▪ Fail for n < p and overfit for p < n with ▪ Capable of handling both n < p and large p
large p problems (e.g. hundreds of covariates)

▪ Not always feasible with big data ▪ Work well with big data

▪ Model selection based on nested models ▪ Systematic model selection based on


or stepwise methods cross-validation

▪ In-sample goodness of fit ▪ Out-of-sample goodness of fit

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 212
Traditional Econometric Methods vs. Machine Learning

Traditional Econometric Methods Machine Learning


▪ Strong assumptions about the data ▪ Little to no DGP assumptions
generating process (DGP)
▪ Supervised case: data is independent,
▪ If assumptions are not appropriate, joint distribution of X and Y is the same in
conclusions may be seriously wrong training and test data.

▪ E.g. how appropriate is a linear, logistic ▪ Better results for large data or complex

or Cox model for the data? relationships

▪ Why should they work: asymptotic theory ▪ Why should they work: open question

▪ Results are directly interpretable ▪ Need additional tools for interpretation

▪ Provide confidence intervals ▪ Usually don’t provide confidence


intervals
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 213
Traditional Econometric Methods vs. Machine Learning

Traditional Econometric Methods Machine Learning


▪ Only complete observations (no ▪ Missings are (usually) not a problem
missings), have to drop or imput them
▪ Model interpretation
▪ Model interpretation
▪ Variable Importance Measures
▪ Coefficients
▪ Partial Dependence Plots (PDP)
▪ Counterfactual analysis
▪ Individual Conditional Expectation (ICE)
plots

▪ Local Interpretable Model-agnostic


Explanations (LIME)

▪ Shapley Additive Explanations (SHAP)


Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 214
Variable Importance
Variable Importance with Surrogates

− How could a variable potentially improve model predictions?

– Potential improvement (here: Gini) calculated for primary and surrogate splits

– Weight surrogate splits with agreement

– Keep in mind: Importance measures are often biased due to bias in splits!

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 216
Variable Importance (RPART Example)

Primary splits:
total_pymnt < 5317.716 to the right, improve=8636.406, (1e+05 missing)
installment < 167.24 to the left, improve=13731.080, (0 missing)
Surrogate splits:
loan_amnt < 4712.5 to the right, agree=0.917, adj=0.515, (1e+05 split)

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 217
Variable Importance (Random Forest Example)

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 218

You might also like