Professional Documents
Culture Documents
▪ Break
▪ Hands-on Exercise
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 2
Structure and Organization of the Course
Structure and Organization of the Course
▪ Day 1: October, 4
▪ Introduction I
▪ Machine Learning and Central Banking
▪ Break
▪ Introduction II
▪ Day 2: October, 5
▪ Shrinkage I
▪ Break
▪ Shrinkage II
▪ Day 3: October, 6
▪ Decision Trees I
▪ Break
▪ Decision Trees II
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 4
Structure and Organization of the Course
▪ Day 4: October, 7
▪ Random Forest
▪ Break
▪ Gradient Boosting
▪ Day 5: October, 8
▪ Machine Learning in Practice (Elizabeth Téllez León, CEMLA)
▪ Break
▪ Wrap Up & Q&A
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 5
Main Idea: Toolbox
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 6
Machine Learning and Central Banking
Machine learning, record linkage in the data life
cycle of Bundesbank
Stefan Bender, Research Data and Service Center (RDSC), Deutsche Bundesbank
Overview
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 9
Machine Learning: Introduction
Machine Learning (Rayid Ghani)
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 11
Machine Learning
▪ „Field of study that gives computers the ability to learn without being explicitly programmed.” (Samuel
1959, S. 2010).
▪ „An agent is learning if it improves its performance on future tasks after making observations about
the world.” (Russel 2016, S. 693).
▪ „A computer program is said to learn from experience E with respect to some task T and some
performance measure P, if its performance on T, as measured by P, improves with experience E.”
(Mitchell 1997, S. 2).
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 12
Why Machine Learning?
▪ Goal: Adaptive, Scalable systems that are cost effective to build and
maintain
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 13
Process
▪ “Feature” development
▪ Method Selection
▪ Evaluation
▪ Deployment
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 14
Types of Machine Learning I
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 15
Different Types of Machine Learning II
▪ Supervised Learning:
Given is a pair of values: 𝑦1 , 𝒙1 , … , 𝑦𝑛 , 𝒙𝑛 ,
from which a machine can learn 𝑦ො = ℎ 𝒙
Required: Learning-Dataset and Testing-Dataset!
Either classification or regression.
(Example: Linear Regression)
▪ Unsupervised Learning:
Given is a set of values: 𝒙1 , … , 𝒙𝑛 ,
where no assignment exists. Hence, the algorithm searches for patterns within the
data to generate artifically generated ys.
(Example: Clustering)
▪ Reinforcement Learning:
A algorithm is getting a gratification if it behaves correctly and/or a punishment if it
makes a mistake.
(Example: Ant algorithm)
THX to Frank Raulf
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 16
Clustering
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 17
K-means algorithm
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 18
K-means example
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 19
K-means example, start
k1
Y
k2
k3
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 20
K-means example, initial step
k1
k2
move cluster
centers to cluster k3
means
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 21
Supervised learning framework
y = f(x)
output Learned features
function
▪ Testing: apply f to a new test example x and output the predicted value
y = f(x)
Slide credit: L. Lazebnik
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 22
Classification Task
Supervised Learning
What is required?
1) A concept.
2) A correct dataset which is to be divided into
1) a learning and
2) a testing dataset.
3) An efficient programming language (such as R, Python, Matlab…).
4) Time.
However, there is always the possibility that the applied methods do not lead to a solution.
Features
Build Model
Validate Model
Deploy
Deploy Model
Model
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 24
Modeling & Validation
Training
Training (Building) Labels
Learned
Training Data Features Training
model
Testing (Validating)
Learned
Test Data Features Prediction
model
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 25
Factors to consider
▪ Complexity
▪ Overfitting
▪ Robustness
▪ Interpretability
▪ Training Time
▪ Test Time
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 26
Classification Challenges (Christen 2015)
▪ Often there is no gold standard available (no data sets with true
known match status)
▪ No large test data set collection available (like in information retrieval or machine learning)
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 27
Improving Data Quality with Machine Learning
Tobias Cagala, Deutsche Bundesbank
Background
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 29
Application to DQM
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 30
Data
Plausibility check
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 31
Data
Dataset
▪ Flagged: 385
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 32
Descriptive Analysis of Patterns
Number of Reporting Banks, Days since Maturity
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank THX to Tobias Cagala
04.10.2021
Page 34
Taking Advantage of the Probability
Advantages
▪ All of the 28 out of 827 securities in Top-
35
▪ Improvement in efficiency and
effectiveness
Experience
▪ 50% reduction in time for check and
increased effectiveness of evaluations
Application to DQM
▪ Saisonal Adjustment
▪ Is This Time Series Seasonal? - How Random Forests Can Improve Seasonality Tests by
Daniel Ollech, Karsten Webel / DG Statistics
▪ Identification of Holdings
▪ by Frank Raulf
▪ Record Linkage
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 37
General Insights
▪ The linking of databases is challenged by data quality, database size, privacy and
confidentiality concerns.
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 40
Definition of Record Linkage and Major Challenges
▪ RL is finding records in different data sets that represent the same entity and link
them.
▪ Major challenge is that (clean) unique entity identifiers are not available in the
databases to be linked.
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 41
The basic record linkage process
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 42
There is no perfect world
▪ In a perfect world
True Positive (TP)
True Negative (TN)
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 43
Record Linkage Challenges (Christen 2012)
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 44
Record Linkage Technique (Christen 2015)
▪ Deterministic matching
▪ Rule-based matching (complex to build and maintain)
▪ Probabilistic record linkage (Fellegi and Sunter, 1969)
▪ Use available attributes for linking (often personal information, like names, addresses,
dates of birth, etc.)
▪ Calculate match weights for attributes
▪ “Computer science” approaches
▪ Based on machine learning, data mining, database, or information retrieval techniques
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 45
The extended record linkage process
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 46
Christen (2012)
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 47
Importance of Preprocessing
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 48
Shares of effort within linkage process
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 49
Caveats of Record Linkage
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 50
Privacy Issues I
▪ Circle of trust
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 51
Privacy Issues II: The 5 Safes
▪ Safe results: results should not directly or indirectly identify any individual or
organisation.
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 52
Source for a Deeper Knowledge
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 53
Linking Deutsche Bundesbank Company Data
Dr. Christopher-Johannes Schild, FDSZ 1-5
Bundesbank’s relevant microdata sources: Company Data
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank THX to Christopher-Johannes Schild
04.10.2021
Page 55
1. Input Data
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank THX to Christopher-Johannes Schild
04.10.2021
Page 56
1. Data linkage
RDSC has matched several NFI-microdatasets (from Statistics, Banking Supervision and external data) with an
advanced machine learning algorithm and generated a matching table (with probalistic matching scores)
Goal:
▪ Improve data quality, increase analytical value of data
▪ More general and flexible Record Linkage System
▪ Historicized matching tables
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 57
Record Linkage Process
Input Input Data
Data A B
Preprocessing
Blocking
Match- Predictors
Candidates (Features)
Classification-
Testdata Trainingsdata
model
Automatic Manual
Evaluation Review
Postprocessing
THX to Christopher-Johannes Schild
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 58
Linkage of Company Data by FDSZ („Record Linkage“)
Testing (Validating)
Test Data Learned
Features Prediction
model
Bias vs Overfitting
Ptrue Ptrue
negative (TN) positive (TP)
false
positive (FP)
false
negative (FN)
Harmonisierung 1
T F
TN =TN
179203 FPFP
= 2553 Indexing /
N P
Blocking 2
„Grobfilter“
Detail-
F T Vergleich 3
FN =
N 8856
FN TP =TP
P125292 „Feinfilter“
precision →
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021 THX to Christopher-Johannes Schild
Page 65
Datauniverses (I)
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021 THX to Christopher-Johannes Schild
Page 66
Datauniverses (II)
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank THX to Christopher-Johannes Schild
04.10.2021
Page 67
Train, Validation, and Test Subsamples
Splitting the Data
Train, Validation, and Test Subsamples
Splitting the Data
▪ Golden rule in Machine Learning: evaluate the model on data that was not used to train it
▪ Most common: randomly split the dataset into train and test data
▪ E.g., when modeling loan default, some borrowers land in the train, others land in the test
data
▪ Better approach: randomly split dataset into train, validation, and test data
▪ Train data
▪ The model learns from the underlying patterns and relationships in this subsample
▪ Validation data
▪ After optimizing several models in the train data, see how they perform in the validation
data
▪ Go back to training and iterate until you are happy with the results
▪ Validation data is “contaminated” → No unbiased estimate of the error on truly new data
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 70
Train, Validation, and Test Subsamples
Splitting the Data
▪ Test data
▪ Performance of the model in this subsample provides an unbiased estimate of the error on
new data
60% Train Data Train Model
Repeat
20%
Validation Data Evaluate Model
▪ Measure error (e.g., error rate for classification or Sum of Squared Residuals for regression)
▪ But, beware!
▪ If you preselect variables that flow in the model, then cross-validation must be also
applied at the variable selection step (not only later)!
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 74
Model Validation and Evaluation Measures
Confusion Matrix for Binary Classification
Confusion Matrix
Actual Data
No Default Default
No Default TN FN
Predictions
Default FP TP
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 75
Model Validation and Evaluation Measures
Accuracy, Recall
Confusion Matrix
▪ What is the share of correctly predicted cases? Actual Data
No Default Default
▪ Accuracy = (TN + TP) / Total No Default TN FN
Predictions
▪ Total = TP+TN+FP+FN Default FP TP
Confusion Matrix
▪ Which share of the true default cases is correctly predicted? Actual Data
No Default Default
▪ Sensitivity (or Recall) = TP / (TP + FN) No Default TN FN
Predictions
Default FP TP
Confusion Matrix
▪ Which share of the true non-default cases is correctly predicted? Actual Data
No Default Default
▪ Specificity = TN / (TN + FP) No Default TN FN
Predictions
Default FP TP
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 76
Model Validation and Evaluation Measures
Precision, F1-Score
Confusion Matrix
Actual Data
▪ What is the share of correct “default” predictions?
No Default Default
▪ Precision = TP / (TP + FP) Predictions
No Default TN FN
Default FP TP
▪ Can the model identify true “default” cases without many false alarms?
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙
▪ F1-Score = 2 ∗
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙
▪ F1-Score provides a tradeoff between precision and recall
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 77
Model Validation and Evaluation Measures
ROC Curve
100% sensitivity (no false negatives)
100% specificity (no false positives)
▪ Receiver Operating Curve (ROC)
▪ Plots the tradeoff between Sensitivity and
Specificity for different probability thresholds
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 78
Model Validation and Evaluation Measures
Precision-Recall (PR) Curve
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 79
Model Validation and Evaluation Measures
Example: Loan Default Prediction
▪ No information rate: accuracy if we simply predict that all observations belong to the most
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 81
Model Validation and Evaluation Measures
Example: Loan Default Prediction
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 82
Hands-on Exercises
References
Davis, J., & Goadrich, M. (2006). The Relationship Between Precision-Recall and ROC Curves.
In Proceedings of the 23rd international conference on Machine learning (pp. 233-240).
Raschka, S. (2018), Model Evaluation, Model Selection, and Algorithm Selection in Machine
Learning, Working Paper.
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 84
Day 2
Shrinkage
Agenda
▪ Introduction
▪ Lasso
▪ Ridge
▪ Hands-on Exercise
▪ Break
▪ Elastic Net
▪ Multiclass Problems
▪ Hands-on Exercise
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 86
Shrinkage I
Shrinkage I
Introduction to Shrinkage
▪ 𝑝 >> 𝑛 problems
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 88
Shrinkage I
Ridge Regression
▪ Idea: tradeoff between goodness of fit (squared residuals) and model complexity (squared
coefficients)
𝑝
▪ 𝛽መ 𝑅𝑖𝑑𝑔𝑒 = argmin 𝒀 − 𝑿𝜷 ′
𝒀 − 𝑿𝜷 + 𝜆 σ𝑗=1 𝛽𝑗2 = argmin 𝒀 − 𝑿𝜷 ′
𝒀 − 𝑿𝜷 + 𝜆 𝜷 2
2
𝛽 𝛽
▪ What happens if the independent variables are not on the same scale?
▪ ഥ)
Therefore, remove the intercept (it becomes simply 𝒚
▪ Also called L2 regularization (L2 norm: Euclidean distance)
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 89
Shrinkage I
Ridge Regression
𝛽መ 𝑅𝑖𝑑𝑔𝑒 = argmin 𝒀 − 𝑿𝜷 ′
𝒀 − 𝑿𝜷 + 𝜆 𝜷′𝜷
𝛽
▪ What happens if 𝝀 → 𝟎?
▪ What happens if 𝝀 → ∞?
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 90
Shrinkage I
Implications for Bias-Variance Tradeoff
▪ Introduces bias!
Unbiased, but Lower
large variance variance,
▪ The bias increases as 𝝀 increases
but biased
▪ Even if the true coefficient is zero, Ridge will shrink it but it will not set it𝜷to zero
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 91
Shrinkage I
LASSO Regression
𝑝
▪ 𝛽መ 𝐿𝐴𝑆𝑆𝑂 = argmin 𝒀 − 𝑿𝜷 ′
𝒀 − 𝑿𝜷 + 𝜆 σ𝑗=1 |𝛽𝑗 | = argmin 𝒀 − 𝑿𝜷 ′
𝒀 − 𝑿𝜷 + 𝜆 𝜷 1
𝛽 𝛽
Sum of Squared Residuals Penalty: Sum of Absolute Coefficients
▪ LASSO also performs variable selection: coefficients can become exactly zero
▪ Sparse solutions: many zero coefficients, equivalent to excluding the variables from the
model
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 92
Shrinkage I
LASSO Regression
▪ Efron et al. (2004): LARS (Least Angle Regression) to compute LASSO efficiently
▪ LASSO arbitrarily selects one variable and reduces the other coefficients to zero
▪ Under these conditions, even small values of 𝝀 give many zero coefficients
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 93
Shrinkage I
LASSO vs Ridge Regression
Contours of the least squares error function
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 94
Shrinkage I
Example: Loan Default Prediction – Ridge Regression
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 95
Shrinkage I
Example: Loan Default Prediction – Ridge Regression
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 96
Shrinkage I
Example: Loan Default Prediction – LASSO Regression
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 97
Shrinkage I
Example: Loan Default Prediction – LASSO Regression
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 98
Hands-on Exercise
Shrinkage II
Shrinkage II
Elastic Net
▪ Selects variables like LASSO but suffers less from the multi-collinearity problem
𝑝
𝛽መ 𝐸𝑙𝑎𝑠𝑡𝑖𝑐 𝑁𝑒𝑡 = argmin 𝒚 − 𝑿𝛽 ′ 𝒚 − 𝑿𝛽 + 𝜆 𝛼 𝛽𝑗 + 1 − 𝛼 𝛽𝑗2 ,0 ≤ 𝛼 ≤ 1
𝛽 𝑗=1
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 101
Shrinkage II
Elastic Net
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 102
Shrinkage II
Example: Loan Default Prediction
▪ As with the lasso and ridge, we typically do not penalize the intercept term, and
standardize the predictors for the penalty to be meaningful.
▪ The parameter 𝛼 determines the mix of the penalties, and is often pre-chosen on qualitative
grounds, or chosen by cross-validation.
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 103
Shrinkage II
Example: Loan Default Prediction – Elastic Net Regression
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 104
Shrinkage II
Example: Loan Default Prediction – Elastic Net Regression
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 105
Shrinkage II
Example: Loan Default Prediction – Confusion Matrix Comparison
Logistic Regression Ridge LASSO Elastic Net
Validation Data Validation Data Validation Data Validation Data
No Default Default No Default Default No Default Default No Default Default
No Default 84,915 2,151 85,049 8,780 84,931 2,168 84,987 2,456
Predictions
Default 161 18,868 27 12,239 145 18,851 89 18,563
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 106
Shrinkage II
Multiclass Problems
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 107
Shrinkage II
Multiclass Problems
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 108
Shrinkage II
Multiclass Problems
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 109
Shrinkage II
Multiclass Problems
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 110
Hands-on Exercise
References
Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004), Least Angle Regression, The Annals
of statistics, 32(2), 407-499.
Hoerl, A. E., & Kennard, R. W. (1970), Ridge Regression: Biased Estimation for Nonorthogonal
Problems, Technometrics, 12(1), 55-67.
Tibshirani, R. (1996), Regression Shrinkage and Selection via the Lasso, Journal of the Royal
Statistical Society, Series B (Methodological), 58(1), 267-288.
Zou, H., & Hastie, T. (2005), Regularization and Variable Selection via the Elastic Net, Journal
of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301-320.
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 112
Day 3
Decision Trees
Agenda
▪ Introduction
▪ Bootstrapping
▪ Break
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 114
Decision Trees (CART)
Decision Trees (CART)
Are They Really a New Method?
▪ But now they are more widely known and used (why?)
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 116
Decision Trees (CART)
What Are Decision Trees?
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 117
Decision Trees (CART)
What Are Decision Trees?
▪ E.g. transportation mode (car, bike, subway), default (y/n), buy (y/n)
▪ Regression Trees
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 118
Decision Trees (CART)
Example: Will a Borrower Default?
Question formats
▪ x ≥ a (or x < a )
▪ x=a
▪ 𝑥 ∈ 𝐴, where 𝐴 are partitions of the values x Terminal nodes (or leaves)
takes in the training data
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 119
Decision Trees (CART)
Example: Will a Borrower Default?
Decision node
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 120
Decision Trees (CART)
How to Build a Decision Tree?
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 121
Decision Trees (CART)
1. Which Split Criterion to Choose?
▪ Impurity using the Gini Index: CART and its implementation in R, RPART
2
▪ 𝑖 𝑡 = 1 − σ𝑘𝑗=1 𝑝 𝑗 𝑡 𝒑 𝒋 𝒕 : relative frequency of category j in node t
k: Number of categories in the sample
▪ Maximum value: observations are equally distributed across all categories (maximal
dispersion)
▪ Goal: generate child nodes that are “purer” than their parents
▪ Split finds the independent variable and its cut-off value that maximizes the decrease in
node impurity
▪ ∆𝑖 𝑠, 𝑡 = 𝑖 𝑡 − σ𝑁
𝑛=1 𝑝𝑡𝑛 𝑖 𝑡𝑛 for multiway splits (N child nodes)
▪ ∆𝑖 𝑠, 𝑡 = 𝑖 𝑡 − 𝑝𝑡𝑙 𝑖 𝑡𝑙 − 𝑝𝑡𝑟 𝑖 𝑡𝑟 for binary splits (𝑡𝑙 left node, 𝑡𝑟 right node)
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 123
Decision Trees (CART)
2. On Which Value to Split - Example Loan Default Prediction
▪ Loan default prediction with RPART: Decrease in node impurity from the first split
1 The “improve” reported under summary() in R shows the decrease in node impurity multiplied by the number of obs. in the parent node
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 124
Decision Trees (CART)
3. How Deep Should the Tree be?
▪ Extreme case: each observation has ist own leaf node → overfitting!
▪ Predictive power is high on traning data, but poor on new data – model is not
generalizable
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 125
Decision Trees (CART)
3. How Deep Should the Tree be?
▪ Tree characteristics (e.g., number of leaf nodes, number of splits, minimum number of
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 126
Decision Trees (CART)
3. How Deep Should the Tree be?
Post-pruning: grow the tree to its maximum and then trim the nodes in a bottom‐up fashion
▪ Merge leaf nodes while considering prediction error (shouldn’t increase much)
▪ E.g., it might be worse to wrongly predict a borrower who ends up defaulting than to
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 127
Decision Trees (CART)
4. What Value to Predict at Each Leaf Node?
Source: twitter.com/freakonometrics
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 128
Decision Trees (CART)
4. What Value to Predict at Each Leaf Node - Example Loan Default Prediction
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 129
Decision Trees (CART)
Handling Missing Values
▪ Surrogate variables
▪ Tree grows with the selected splits (primary splits) using observations with no missing
values on the split variables
▪ Weakness: CART is biased toward selecting variables with many missing values for a
primary split (Kim & Loh, 2001)
▪ When a split variable is missing for an observation:
▪ Surrogate split: use a surrogate variable instead of the primary split variable
▪ A surrogate variable is another independent variable and cut-off value whose split most
resembles the primary split
▪ Ideally, they should send exactly the same observations to the each child node
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 130
Decision Trees (CART)
Handling Missing Values – Example Loan Default Prediction
333,301 observations (100%)
68,773 (20.6%) default
264,528 (79.4%%) non-default
Now: 100,000 missings in total_pymnt
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 131
Decision Trees (CART)
Handling Missing Values – Example Loan Default Prediction
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 132
Decision Trees (CART)
Handling Missing Values – Example Loan Default Prediction
▪ Agreement: proportion of observations sent to the “correct” leaf node when the
surrogate is used (instead of the primary split)
𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑠𝑢𝑟𝑟𝑜𝑔𝑎𝑡𝑒
𝐴𝑔𝑟𝑒𝑒 =
# 𝑜𝑏𝑠 𝑖𝑛 𝑝𝑎𝑟𝑒𝑛𝑡 𝑛𝑜𝑑𝑒
▪ Adjusted agreement: deducts the surrogate agreement from the “go with the majority”
baseline
(𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑠𝑢𝑟𝑟𝑜𝑔𝑎𝑡𝑒 − 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 “𝑔𝑜 𝑤𝑖𝑡ℎ 𝑚𝑎𝑗𝑜𝑟𝑖𝑡𝑦”)
A𝑑𝑗 =
(# 𝑜𝑏𝑠 𝑖𝑛 𝑝𝑎𝑟𝑒𝑛𝑡 𝑛𝑜𝑑𝑒 − 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 “𝑔𝑜 𝑤𝑖𝑡ℎ 𝑚𝑎𝑗𝑜𝑟𝑖𝑡𝑦”)
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 133
Decision Trees (CART)
Discussion
▪ Impurity Criterion
▪ Gini Index is biased toward selecting split variables with many missing values
▪ Required Assumptions
▪ Observations are independent
▪ Joint distribution of X and Y in the training data is the same as in the test data
▪ Overfitting
▪ Single trees have high variance: unstable predictions
▪ Small perturbation in the data → large changes in leaf nodes
▪ Solution? Ensemble learning (more on that after the break!)
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 134
Detour: Bias-Variance Tradeoff
Bias-Variance Tradeoff
Expected Prediction Error
▪ Expected prediction error for a new observation with 𝑿 = 𝒙𝟎 ?
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 136
Bias-Variance Tradeoff
Examples
Source: towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 137
Bias-Variance Tradeoff
Examples
Source: https://www.kdnuggets.com/2016/08/bias-variance-tradeoff-overview.html
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 138
Bias-Variance Tradeoff
Changes in Error with Model Complexity
Source: scott.fortmann-roe.com/docs/BiasVariance.html
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 139
Bias-Variance Tradeoff
What Are Your Options?
▪ Test and validate your model on new observations (set part of the data aside for that)
▪ Most influential development in Machine Learning in the past decade (Seni & Elder,
2010)
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 140
Introduction to Ensemble Methods
Introduction to Ensemble Methods
Overview
▪ Bagging (Breiman,1996)
▪ Idea: combine a large number of “weak” models to produce a stronger, more stable
model
▪ Use average of predictions (numerical dep. var.) or majority voting (categorical dep. var.)
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 142
Introduction to Ensemble Methods
Overview
▪ Robust to outliers
▪ But…
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 143
Introduction to Ensemble Methods
Overview
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 144
Introduction to Ensemble Methods
Bagging Weak Models
Original
Training
Dataset
Bootstrap Replicates
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 145
Introduction to Ensemble Methods
Bagging Weak Models
▪ Often used for decision trees, but you can use it with any weak supervised model
…
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 146
Introduction to Ensemble Methods
Example: Loan Default Prediction – Confusion Matrix Comparison
Logistic Regression Decision Tree Bagged Trees
Validation Data Validation Data Validation Data
No Default Default No Default Default No Default Default
No Default 84,915 2,151 84,872 13,788 84,869 13,785
Predictions
Default 161 18,868 204 7,231 207 7,234
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 147
Hands-on Exercise
References
Alfaro, E., Gamez, M. & Garcia, N. (2013), adabag: An R Package for Classification with
Boosting and Bagging, Journal of Statistical Software, 54 (2), 1-35.
Alfaro, E., Garcia, N., Gamez, M. & Elizondo, D. (2008), Bankruptcy forecasting: An empirical
comparison of AdaBoost and neural networks'. Decision Support Systems, 45, 110-122.
Breiman, L., Friedman, J. H., Stone, C. J. & Olshen, R. A. (1984), Classification and Regression
Trees. Belmont, CA, Wadsworth.
Hastie, T., Tibshirani, R. & Friedman, J. (2009), The Elements of Statistical Learning. Data
Mining, Inference, and Prediction, Springer.
Hothorn, T., Hornik, K. & Zeileis, A. (2006), Unbiased Recursive Partitioning: A Conditional
Inference Framework, Journal of Computational and Graphical Statistics, 15(3), 651-674.
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 149
References
Loh, W.-Y. (2014), "Fifty Years of Classification and Regression Trees, International Statistical
Review, 82(3), 329-348.
Loh, W.-Y. & Shih, Y.-S. (1997), "Split Selection Methods for Classification Trees, Statistica
Sinica, 7(4), 815-840.
Quinlan, J. R. (1986), Induction of Decision Trees, Machine Learning, 1(1), 81-106.
Quinlan, J. R. (1993), C4.5: Programs for Machine Learning, San Mateo, CA, Morgan
Kaufmann.
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 150
Day 4
Ensemble Methods
Agenda
▪ Random Forests
▪ Hands-on Exercise
▪ Break
▪ Gradient Boosting
▪ Hands-on Exercise
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 152
Recap
Bias-Variance Tradeoff
Changes in Error with Model Complexity
Source: scott.fortmann-roe.com/docs/BiasVariance.html
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 154
Ensemble Methods
Combining Several Weak Models to Generate a Stronger Model
Source: Lantz, Brett (2015), "Machine learning with R", Birmingham: Packt Publishing, Chapter 11.
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 155
Ensemble Methods
Bagging Weak Models
▪ Often used for decision trees, but you can use it with any weak supervised model
…
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 156
Random Forests
Random Forests
Motivation
▪ But…
▪ Idea: why not also introduce randomness in the selection of the candidate variables for a
split?
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 158
Random Forests
Overview
▪ Only give the algorithm a random set of 𝒎 independent variables to choose from (out of
the total 𝑴 independent variables in the dataset)
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 159
Random Forests
Overview
Train Data
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 160
Random Forests
How to Choose m?
▪ How to choose 𝒎?
▪ 𝑚= 𝑛
▪ Use cross-validation to find the “best” value of 𝒎 for the data / problem at hand
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 161
Random Forests
Out-of-Bag Observations
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 162
Random Forests
Parameters
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 163
Random Forests
Example: Loan Default Prediction – Confusion Matrix Comparison
Logistic Regression Decision Tree Bagged Trees Random Forest
Validation Data Validation Data Validation Data Validation Data
No Default Default No Default Default No Default Default No Default Default
No Default 84,915 2,151 84,872 13,788 84,869 13,785 85,069 2,396
Predictions
Default 161 18,868 204 7,231 207 7,234 7 18,623
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 164
Random Forests
Discussion
▪ Bias may slightly increase (remember the potentially biased splits in each tree)
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 165
Random Forests
Discussion
▪ Find optimal parameters for your model / data (e.g. tree depth, number of trees, number of
observations in leaf nodes)
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 166
Hands-on Exercise
Gradient Boosting
Gradient Boosting
First Idea of Boosting
▪ Main idea
▪ Weights or residuals reflect how “difficult” it is to predict the outcome for a specific
observation
▪ Often applied in the context of decision trees, but applicable to any “weak” model
▪ Trees are usually smaller than in Random Forests (often “stumps”: only one split)
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 169
Gradient Boosting
Boosting for Classification
▪ Models that accurately predict many observations change weights more significantly and
have a higher influence in the aggregated prediction
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 170
Gradient Boosting
Boosting for Classification
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 171
Gradient Boosting
Other Views of Boosting
▪ Breiman (1999)
▪ Boosting ~ gradient descent with a special loss function
▪ Friedman (2000)
▪ Generalize: from Adaboost to Gradient Boosting
▪ Handle many other loss functions
▪ Very important development: link “obscure” computational learning to standard
statistics (likelihood) and function optimization
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 172
Gradient Boosting
Slightly Change the Target Function
𝑿 ….
▪ Instead of training 𝒉 𝑿 on residuals of 𝐹 𝑋 , 𝒚 − 𝑭
2
𝑦 −𝐹 𝑋
▪ Train 𝒉 𝑿 on gradient of the loss function: 𝐿 𝑦, 𝐹 𝑋 =
2
𝜕𝐽
▪ Yields an optimization problem: = 𝐹 𝑋𝑖 − 𝑦𝑖
𝜕𝐹 𝑋𝑖
𝜕𝐽
𝑦𝑖 − 𝐹 𝑋𝑖 = −
𝜕𝐹 𝑋𝑖
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 173
Gradient Boosting
Slightly Change the Target Function
𝜕𝐽
▪ 𝑦𝑖 − 𝐹 𝑋𝑖 = − - Does this look familiar?
𝜕𝐹 𝑋𝑖
𝜕𝐽
▪ Hint: Gradient Descent: 𝜃𝑖 ≔ 𝜃𝑖 − 𝜌
𝜕𝜃𝑖
Source: https://www.oreilly.com/library/view/learn-arcore-
/9781788830409/e24a657a-a5c6-4ff2-b9ea-9418a7a5d24c.xhtml
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 174
Gradient Boosting
Generalization Using Gradient Descent
▪ Generalize the method to other loss functions (e.g. log loss, absolute loss)
𝐿 𝑦,𝐹 𝑋
▪ Compute negative gradient (i.e. residual): −𝑔 𝑋 = − = 𝑦 − 𝐹(𝑋)
𝜕𝐹 𝑋
▪ Update 𝐹(𝑋) : 𝐹 𝑋 ≔ 𝐹 𝑋 + 𝜌ℎ 𝑋
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 175
Gradient Boosting
Generalization Using Gradient Descent
▪ E.g., Gradient Boosting with stumps (only one split) are a first order (linear)
approximation
▪ Introduce a parameter to slow down the incorporation of new results to the aggregate
model
▪ 𝐹 𝑋 ≔ 𝐹 𝑋 + 𝒗 ∙ 𝜌ℎ 𝑋
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 176
Gradient Boosting
Generalization Using Gradient Descent
▪ Gradient Boosting + randomness in the train data seen by each model (as in Bagging)
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 177
Gradient Boosting
Parameters
▪ Fraction of observations in training data randomly selected to grow the next tree
(bag.fraction)
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 178
Gradient Boosting
Example: Loan Default Prediction – Confusion Matrix Comparison
Logistic Regression Decision Tree Random Forest Gradient Boosting
Validation Data Validation Data Validation Data Validation Data
No Default Default No Default Default No Default Default No Default Default
No Default 84,915 2,151 84,872 13,788 85,069 2,396 84,980 1,370
Predictions
Default 161 18,868 204 7,231 7 18,623 96 19,649
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 179
Gradient Boosting
Discussion
▪ Any (!) loss function for which you can compute the gradient
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 180
Gradient Boosting
Discussion
▪ But:
▪ Remedies:
▪ Tuning
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 181
Hands-on Exercise
References
Breiman, L. (1999), Prediction Games and Arcing Algorithms, Neural Computation, 11(7), 1493-1517.
Breiman, L. (2001), Random Forests, Machine Learning, 45(1), 5-32.
Breiman, L. (2001), Statistical Modeling: The Two Cultures, Statistical Science, 16(3), 199-231.
Friedman, J., Hastie, T. & Tibshirani, R. (2000), Special Invited Paper. Additive Logistic Regression: A
Statistical View of Boosting, Annals of Statistics, 28 (2), 337-374.
Friedman, J.H. (2001), Greedy Function Approximation: A Gradient Boosting Machine, Annals of
Statistics, 29 (5), 1189-1232.
Friedman, J.H. (2002), Stochastic Gradient Boosting, Computational Statistics & Data Analysis, 38 (4),
367-378.
Hastie, T., Tibshirani, R. & Friedman, J. (2009), The Elements of Statistical Learning. Data Mining,
Inference, and Prediction, Springer.
Ridgeway, G. (1999), The State of Boosting, Computing Science and Statistics, 31, 172-181.
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 183
Day 5
Wrap Up
Agenda
▪ Break
▪ Variable Importance
▪ Wrap Up
▪ Q&A
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 185
Machine Learning in Practice
Wrap Up & Q&A
Where Are We Headed?
Current Challenges and Opportunities
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 188
Where Are We Headed?
Explainable Machine Learning
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 189
Will Machine Learning Solve All Our Problems?
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 190
Structure and Organization of the Course
▪ Day 1: October, 4
▪ Introduction I
▪ Machine Learning and Central Banking
▪ Break
▪ Introduction II
▪ Day 2: October, 5
▪ Shrinkage I
▪ Break
▪ Shrinkage II
▪ Day 3: October, 6
▪ Decision Trees I
▪ Break
▪ Decision Trees II
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 191
Structure and Organization of the Course
▪ Day 4: October, 7
▪ Random Forest
▪ Break
▪ Gradient Boosting
▪ Day 5: October, 8
▪ Machine Learning in Practice (Elizabeth Téllez León, CEMLA)
▪ Break
▪ Wrap Up & Q&A
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 192
Main Idea: Toolbox
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 193
Process
▪ “Feature” development
▪ Method Selection
▪ Evaluation
▪ Deployment
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 194
Steps
Data Preparation
Features
Build Model
Validate Model
Deploy Model
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 195
Examples
2020 Banca d'Italia and Federal Reserve Board Joint Conference on Nontratidional Data & Statistical
Learning with Applications to Macroeconomics – Oct. 2021
• Forecasting UK inflation bottom up, A. Joseph, Eleni Kalamara, G. Potjagailo, and G. Kapetanios
• The Macroeconomy as a Random Forest, P. Goulet Coulombe
• Teaching Machines to Measure Economic Activities from Satellite Images: Challenges and Solution, D. Ahn,
M. Cha, S. Han, J. Kim, S. Sang Lee, S. Park, S. Park, H. Yang, and J. Yang
• The Knowledge Graph for Macroeconomic Analysis with Alternative Big Data, Y. Yang, Y. Pang, G. Huang,
and Weinan E Machine Learning for Zombie Hunting. Firms' Failures and Financial Constraints, F. J.
Bargagli Stoffi, M. Riccaboni, and A. Rungi
• Deciphering the Fed Communication via Text-Analysis of Alternative FOMC Statements, T. Doh, D. Song,
and S.-K. Yang
• Measuring central banks' sentiment and its spillover effects with a network approach, G. Tizzanini, P.
Lorenzini, M. Priola, L. Zicchino
• Application of text mining to the analysis of climate-related disclosures, Á. I. Moreno, and T. Caminero
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 196
BIG THANK YOU!!!!!
- To all participants for the lively discussion, deep
comments and/or just for staying with us
- Of course to Serafín Martínez Jaramillo, Eréndira
Fuentes Hernández and all for us „unknown“
colleagues
- We do not forget: the translators and the IT guys
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 197
References
Athey, S. (In Press), The Impact of Machine Learning on Economics, In A.K. Agrawal, J. Gans &
A. Goldfarb (Eds.),The Economics of Artificial Intelligence: An Agenda, University of Chicago
Press.
Breiman, L. (2001), Statistical Modeling: The Two Cultures, Statistical Science, 16(3), 199-231.
Hastie, T., Tibshirani, R. & Friedman, J. (2009), The Elements of Statistical Learning. Data
Mining, Inference, and Prediction, Springer.
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 198
Backup Day 1
Model Validation and Evaluation Measures
Comparing Competing Models
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 200
Model Validation and Evaluation Measures
Comparing Competing Models
▪ Examples:
▪ Test for the difference of two proportions (each model estimated one time)
▪ Alternatives:
▪ H0: Both models have the same performance: 𝑃𝑟𝑜𝑏 (𝐶𝐼) = 𝑃𝑟𝑜𝑏 (𝐼𝐶)
Model 1
Correct Incorrect
Correct CC CI
Model 2
Incorrect IC II
Total = CC + CI + IC + II
▪ Requires CI and IC ≥ 25
In R:
▪ If CI and IC ≤ 25, use the binomial test instead
mcnemar.
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank test
04.10.2021
Page 202
binomial.t
est
Model Validation and Evaluation Measures
Comparing Competing Models
▪ Split the dataset 5 times in train and test (50% of observations each)
▪ In each of the 5 iterations:
▪ Step A: Estimate Model 1 and Model 2 to the train set and evaluate performance on test
set
▪ 𝐴𝐶𝐶𝐴,𝑖 = 𝐴𝐶𝐶𝐴,𝑖,𝑀𝑜𝑑𝑒𝑙 1 − 𝐴𝐶𝐶𝐴,𝑖,𝑀𝑜𝑑𝑒𝑙 2 𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛 𝑖
▪ Step B: Swap between train and test (estimate on test set and evaluate on train set)
▪ 𝐴𝐶𝐶𝐵,𝑖 = 𝐴𝐶𝐶𝐵,𝑖,𝑀𝑜𝑑𝑒𝑙 1 − 𝐴𝐶𝐶𝐵,𝑖,𝑀𝑜𝑑𝑒𝑙 2 𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛 𝑖
𝐴𝐶𝐶𝐴,𝑖 −𝐴𝐶𝐶𝐵,𝑖 2 2
▪ 𝐴𝐶𝐶𝑎𝑣𝑔,𝑖 = , 𝑠𝑖2 = 𝐴𝐶𝐶𝐴,𝑖 − 𝐴𝐶𝐶𝑎𝑣𝑔,𝑖 + 𝐴𝐶𝐶𝐵,𝑖 − 𝐴𝐶𝐶𝑎𝑣𝑔,𝑖
2
𝐴𝐶𝐶𝐴,1
▪ Test statistic: 𝑡 = 1 5
, 𝐴𝐶𝐶𝐴,1 : 𝐴𝐶𝐶𝐴 𝑓𝑟𝑜𝑚 𝑡ℎ𝑒 𝑓𝑖𝑟𝑠𝑡 𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛
σ 𝑠2
5 𝑖=1 𝑖
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 203
Backup Day 3
Decision Trees (CART)
Split Criterion: Impurity vs. Entropy
▪ Maximum value: observations are equally distributed across all categories (maximal
dispersion)
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑡 = −𝑝 𝑗 𝑡 log 2 𝑝 𝑗 𝑡
𝑗=1
1 C5.0 uses a normalized version of Information Gain, the “Gain-Ratio”
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 205
Decision Trees (CART)
Split Criterion: Impurity vs. Entropy
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 206
1. Splitting Criterion: Sum of Squares (Regression)
▪ Minimum value: all observations in the node have the same value of the dependent variable
(homogeneous node)
𝑛𝑡
𝑆𝑆 𝑡 = 𝑦𝑖 − 𝑦ത 2
𝑖=1
Variance of y
within node t
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 207
Decision Trees
Handling Missing Values in C5.0
▪ Proportional weighting
▪ Tree grows with all observations – lower information gain from variables with many
missings
▪ Observation with a missing in the split variable is sent down both child nodes
▪ In each child node: weight this observation by the proportion of observations with no
missings in the split variable
▪ Aggregate predictions of all leaf nodes the observation reached, using the weights
gained in each followed path
▪ Weighted average for numeric variables, or category with the highest probability, for
categorical variables
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 210
Backup Day 5
Traditional Econometric Methods vs. Machine Learning
▪ Fail for n < p and overfit for p < n with ▪ Capable of handling both n < p and large p
large p problems (e.g. hundreds of covariates)
▪ Not always feasible with big data ▪ Work well with big data
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 212
Traditional Econometric Methods vs. Machine Learning
▪ E.g. how appropriate is a linear, logistic ▪ Better results for large data or complex
▪ Why should they work: asymptotic theory ▪ Why should they work: open question
– Potential improvement (here: Gini) calculated for primary and surrogate splits
– Keep in mind: Importance measures are often biased due to bias in splits!
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 216
Variable Importance (RPART Example)
Primary splits:
total_pymnt < 5317.716 to the right, improve=8636.406, (1e+05 missing)
installment < 167.24 to the left, improve=13731.080, (0 missing)
Surrogate splits:
loan_amnt < 4712.5 to the right, agree=0.917, adj=0.515, (1e+05 split)
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 217
Variable Importance (Random Forest Example)
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 218