File Show

Course on Machine Learning and Central Banking
CEMLA and Deutsche Bundesbank – October 2021

Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann
Agenda
▪ Introduction (Prof. Gabriela Alves Werb and Prof. Stefan Bender)
▪ Structure and Organization of the Course
▪ Machine Learning and Central Banking (Prof. Stefan Bender)
▪ Train, Test and Validation Samples (Prof. Gabriela Alves Werb)
▪ Break
▪ Model Validation Measures
▪ Hands-on Exercise
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank
04.10.2021
Page 2
Structure and Organization of the Course
▪ Day 1: October, 4
▪ Introduction I
▪ Machine Learning and Central Banking
▪ Break
▪ Introduction II
▪ Shrinkage I
▪ Break
▪ Shrinkage II
▪ Decision Trees I
▪ Break
▪ Decision Trees II
04.10.2021
Page 4
▪ Random Forest
▪ Break
▪ Gradient Boosting
▪ Machine Learning in Practice (Elizabeth Téllez León, CEMLA)
▪ Break
▪ Wrap Up & Q&A
▪ Each module is followed by a hands-on exercise.

▪ Let‘s keep it interactive: please ask questions at any time!
04.10.2021
Page 5
Main Idea: Toolbox
04.10.2021
Page 6
Machine Learning and Central Banking
Machine learning, record linkage in the data life
cycle of Bundesbank
Stefan Bender, Research Data and Service Center (RDSC), Deutsche Bundesbank
Overview
▪ Machine Learning: Introduction

▪ Improving Data Quality with Machine Learning by Tobias Cagala
▪ Record Linkage: Introduction
▪ Linking Deutsche Bundesbank Company Data by Christoper-Johannes Schild
▪ Conclusion
04.10.2021
Page 9
Machine Learning: Introduction
Machine Learning (Rayid Ghani)
04.10.2021
Page 11
Machine Learning
▪ „Field of study that gives computers the ability to learn without being explicitly programmed.” (Samuel
1959, S. 2010).
▪ „An agent is learning if it improves its performance on future tasks after making observations about
the world.” (Russel 2016, S. 693).
▪ „A computer program is said to learn from experience E with respect to some task T and some
performance measure P, if its performance on T, as measured by P, improves with experience E.”
(Mitchell 1997, S. 2).
Required skills of a computer program to be „intelligent“:

▪ Perception → of the environment (dataset)
▪ Learn → a hypothesis from the perceived data
▪ Apply → the hypothesis on related data
▪ Decide → which is the best performing hypothesis
THX to Frank Raulf
04.10.2021
Page 12
Why Machine Learning?
▪ Goal: Adaptive, Scalable systems that are cost effective to build and
maintain
▪ Rules-based systems are rigid and expensive
▪ Lots of data is available to “train” the system
04.10.2021
Page 13
Process
▪ Understand “business” problem
▪ Map to machine learning problem
▪ Understand the data
▪ Explore and prepare the data
▪ “Feature” development
▪ Method Selection
▪ Evaluation
▪ Deployment
04.10.2021
Page 14
Types of Machine Learning I
Unsupervised “Weakly” supervised Fully supervised
Clustering Anomaly Detection

PCA … Classification
…
Regression
04.10.2021
Page 15
Different Types of Machine Learning II
▪ Supervised Learning:
Given is a pair of values: 𝑦1 , 𝒙1 , … , 𝑦𝑛 , 𝒙𝑛 ,
from which a machine can learn 𝑦ො = ℎ 𝒙
Required: Learning-Dataset and Testing-Dataset!
Either classification or regression.
(Example: Linear Regression)
▪ Unsupervised Learning:
Given is a set of values: 𝒙1 , … , 𝒙𝑛 ,
where no assignment exists. Hence, the algorithm searches for patterns within the
data to generate artifically generated ys.
(Example: Clustering)
▪ Reinforcement Learning:
A algorithm is getting a gratification if it behaves correctly and/or a punishment if it
makes a mistake.
(Example: Ant algorithm)
THX to Frank Raulf
04.10.2021
Page 16
Clustering
▪ A good clustering method will produce clusters with
▪ High intra-cluster similarity
▪ Low inter-cluster similarity
▪ K-Means is the simplest and the most common algorithm
04.10.2021
Page 17
K-means algorithm
▪ Given k, the k-means algorithm works as follows:
1) Randomly choose k data points (seeds) to be the initial

centroids, cluster centers
2) Assign each data point to the closest centroid
3) Re-compute the centroids using the current cluster

memberships.
4) If a convergence criterion is not met, go to 2).
04.10.2021
Page 18
K-means example
04.10.2021
Page 19
K-means example, start
k1
Y
k2
k3
04.10.2021
Page 20
K-means example, initial step
k1
k2
move cluster
centers to cluster k3
means
04.10.2021
Page 21
Supervised learning framework
y = f(x)
output Learned features
function
Training: given a training set of labeled examples {(x1,y1), …, (xN,yN)},

estimate the prediction function f that minimizes future generalization
(out of sample) error
▪ Testing: apply f to a new test example x and output the predicted value
y = f(x)
Slide credit: L. Lazebnik
04.10.2021
Page 22
Classification Task
Supervised Learning
The task is to classify whether or not there was a change in perimeter.
What is required?
1) A concept.
2) A correct dataset which is to be divided into
1) a learning and
2) a testing dataset.
3) An efficient programming language (such as R, Python, Matlab…).
4) Time.
However, there is always the possibility that the applied methods do not lead to a solution.
THX to Frank Raulf

04.10.2021
Page 23
Steps
Data Preparation
Features
Build Model
Validate Model
Deploy
Deploy Model
Model
04.10.2021
Page 24
Modeling & Validation
Training
Training (Building) Labels
Learned
Training Data Features Training
model
Testing (Validating)
Learned
Test Data Features Prediction
model
04.10.2021
Page 25
Factors to consider
▪ Complexity
▪ Overfitting
▪ Robustness
▪ Interpretability
▪ Training Time
▪ Test Time
04.10.2021
Page 26
Classification Challenges (Christen 2015)
▪ In many cases there are no training data available

▪ Possible to use results of earlier matching projects?
▪ Or from manual clerical review process?
▪ How confident can we be about correct manual classification of potential matches?
▪ Often there is no gold standard available (no data sets with true
known match status)
▪ No large test data set collection available (like in information retrieval or machine learning)
04.10.2021
Page 27
Improving Data Quality with Machine Learning
Tobias Cagala, Deutsche Bundesbank
Background
Use out-of-sample predictions with machine learning algorithms for. . .
1) Data Quality Management (DQM): identify and correct measurement errors
2) Closing data gaps: impute missing values
THX to Tobias Cagala
04.10.2021
Page 29
Application to DQM
Securities Holdings Statistics

▪ German banks provide monthly reports of securities holdings (security-by-security)
▪ DQM with labor intensive manual case-by-case evaluations
04.10.2021
Page 30
Data
Plausibility check
Securities reported after the maturity date
Two data sources
I. Outcome of evaluation by compiler: Acceptance or Flag

II. Features of the security
➢ Linkage of securities data with data on compiler decisions
04.10.2021
Page 31
Data
Dataset
▪ Reported securities: 4 495

▪ Accepted: 4 110
▪ Flagged: 385
04.10.2021
Page 32
Descriptive Analysis of Patterns
Number of Reporting Banks, Days since Maturity

04.10.2021
Page 33
Taking Advantage of the Probability
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank THX to Tobias Cagala
04.10.2021
Page 34
Taking Advantage of the Probability
Advantages
▪ All of the 28 out of 827 securities in Top-
35
▪ Improvement in efficiency and
effectiveness
Experience
▪ 50% reduction in time for check and
increased effectiveness of evaluations

04.10.2021
Page 35
Conclusion: Application to DQM
Application to DQM
▪ Application to DQM is feasible and straightforward

▪ Large potential for improvements of efficiency
▪ Inclusion in production process can be a challenge

04.10.2021
Page 36
Other Applications of Bundesbank in Machine Learning
▪ Saisonal Adjustment
▪ Is This Time Series Seasonal? - How Random Forests Can Improve Seasonality Tests by
Daniel Ollech, Karsten Webel / DG Statistics
▪ Identification of Holdings
▪ by Frank Raulf
▪ Record Linkage
▪ (we will see in some minutes)
04.10.2021
Page 37
General Insights
▪ Data-driven solutions can improve efficiency drastically

▪ An effective application of ML methods requires:
▪ Machine learning algorithms are no silver bullet

▪ Availability of training data
▪ Performance gain contingent on datasets with complex data structure
▪ Many algorithms are a black-box

04.10.2021
Page 38
Record Linkage: Introduction
Motivation for Record Linkage (Christen 2015)
▪ Large amounts of data are being collected (big data).
▪ Analyzing such data can provide huge benefits.
▪ Data are from different sources (need for record linkage).
▪ Lack of unique entity identifiers: linking based on personal information.
▪ The linking of databases is challenged by data quality, database size, privacy and
confidentiality concerns.
04.10.2021
Page 40
Definition of Record Linkage and Major Challenges
▪ RL is finding records in different data sets that represent the same entity and link
them.
▪ RL is also known as data matching, entity resolution, object identification,

duplicate detection, identity uncertainty, merge-purge.
▪ Major challenge is that (clean) unique entity identifiers are not available in the
databases to be linked.
04.10.2021
Page 41
The basic record linkage process
04.10.2021
Page 42
There is no perfect world
▪ In a perfect world
True Positive (TP)
True Negative (TN)
▪ But we do not live in a perfect world

True Positive (TP) False Positive (FP)
False Negative (FN) True Negative (TN)
04.10.2021
Page 43
Record Linkage Challenges (Christen 2012)
▪ No unique entity identifiers available

▪ Real world data are dirty (typographical errors and variations, missing and out-of-date values,
different coding schemes, etc.)
▪ Scalability
▪ Naïve comparison of all record pairs is quadratic
▪ Remove likely no-matches as efficiently as possible
▪ No training data in many linkage applications
▪ No record pairs with known true match status
▪ Privacy and confidentiality
▪ (because personal information, like names and addresses, are commonly required for
linking)
04.10.2021
Page 44
Record Linkage Technique (Christen 2015)
▪ Deterministic matching
▪ Rule-based matching (complex to build and maintain)
▪ Probabilistic record linkage (Fellegi and Sunter, 1969)
▪ Use available attributes for linking (often personal information, like names, addresses,
dates of birth, etc.)
▪ Calculate match weights for attributes
▪ “Computer science” approaches
▪ Based on machine learning, data mining, database, or information retrieval techniques
04.10.2021
Page 45
The extended record linkage process
04.10.2021
Page 46
Christen (2012)
04.10.2021
Page 47
Importance of Preprocessing
▪ “In situations of reasonably high-quality data, preprocessing can

yield a greater improvement in matching efficiency than string
comparators and ‘optimized parameters’. In some situations, 90% of
the improvement in matching efficiency may be due to
preprocessing.” (Winkler 2009, p. 370)
▪ “Inability or lack of time and resources for cleaning up files in

preparation of matching are often the main reasons that matching
projects fail.” (Winkler 2009, p. 366)
04.10.2021
Page 48
Shares of effort within linkage process
▪ 5% matching and linking efforts

▪ 20% checking that the computer matching is correct
▪ 75% cleaning and parsing the two input files
(Gill 2001, p. 31)
04.10.2021
Page 49
Caveats of Record Linkage
▪ Imperfect matching variables (like typos)

▪ Variables may be coded differently in both data sources
▪ – E.g., years of education vs. degrees received
▪ Data may require significant amounts of processing and data cleaning prior to
linkage
▪ Not always a 1-to-1 match, but a 1-to-1 matched set can be extracted from a
post-processing step
▪ (admin) record may not exist.
04.10.2021
Page 50
Privacy Issues I
▪ Sensitive data (names, address)
▪ Informed consent, data avoidance, purpose limitation of data
▪ Circle of trust
▪ “Formalities” (like contracts, terms and conditions)
▪ The five Safe
04.10.2021
Page 51
Privacy Issues II: The 5 Safes
▪ Safe people: fit and proper, expertise to do the work.
▪ Safe projects: formal ethical review, public benefit, scientific merit.
▪ Safe environment: restrict data access, data security.
▪ Safe data: Privacy Preserving Record Linkage.
▪ Safe results: results should not directly or indirectly identify any individual or
organisation.
04.10.2021
Page 52
Source for a Deeper Knowledge
04.10.2021
Page 53
Linking Deutsche Bundesbank Company Data
Dr. Christopher-Johannes Schild, FDSZ 1-5
Bundesbank’s relevant microdata sources: Company Data
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank THX to Christopher-Johannes Schild
04.10.2021
Page 55
1. Input Data
Dataset (-family) Source Size (entities in master data)

AWMUS (MiDi / SITS) S2 ~260.000
(MiDi ~22.000 german entities 2013)
BAKIS-M (MiMiK) B ~350.000
Jalys / Corep (USTAN) S3 ~230.000 (~24.000 in 2013)
KUSYS S1 ~8.500
EGR Destatis ~60.000
DAFNE S3  Bureau van Dijk ~220.000 (~56.000 in 2012)
Hoppenstedt S3  Hoppenstedt ~90.000 (~20.000 in 2013)
AMADEUS Bureau van Dijk ~1.600.000
ZENTK, RIAD (Banks) S1 ~3.000 (~1.800 in 2016)
RIAD (Non-Banks) S1 ~35.000
LEI GLEIF ~45.000
URS Company Register Destatis ~4.800.000
Kantwert / Trade Reg. Kantwert GmbH ~4.300.000
04.10.2021
Page 56
1. Data linkage
Company data (non financial institutions (NFI)):
There is no common unique firm identifier in Germany.

(Company business register-ID not stable)
We have to match firm data…

▪ … that do not have a common unique identifier / key
▪ … by using alternative identifiers (such as names, addresses, sectors, legal forms)
RDSC has matched several NFI-microdatasets (from Statistics, Banking Supervision and external data) with an
advanced machine learning algorithm and generated a matching table (with probalistic matching scores)
Goal:
▪ Improve data quality, increase analytical value of data
▪ More general and flexible Record Linkage System
▪ Historicized matching tables
THX to Christopher-Johannes Schild
04.10.2021
Page 57
Record Linkage Process
Input Input Data
Data A B
Preprocessing
Blocking
Match- Predictors
Candidates (Features)
Classification-
Testdata Trainingsdata
model
Automatic Manual
Evaluation Review
Postprocessing
04.10.2021
Page 58
Linkage of Company Data by FDSZ („Record Linkage“)
▪ Duplicate detection with supervised machine learning

▪ Decision Tree algorithms (Random Forests, Gradient Boosting Trees)
▪ Training- and Testdata from common IDs
▪ Comparison features:
▪ Firmnames, string comparison algorithms
▪ Georeferenced addresses
▪ Economic sector codes
▪ Legal form
▪ Balance sheet information
▪ Data pre- and postprocessing with SAS
▪ Machine Learning uses Python-ML-Packages („scikit-learn“)

04.10.2021
Page 59
2. Set up for the Record Linkage (Machine Learning)
Training (Building) Training

Labels
Training Data Learned

Features Training
model
Testing (Validating)
Test Data Learned
Features Prediction
model

04.10.2021
Page 60
4. Classification
Bias vs Overfitting

04.10.2021
Page 61
4. Classification
Ptrue Ptrue
negative (TN) positive (TP)
false
positive (FP)
false
negative (FN)

04.10.2021
Page 62
5. Evaluation
Harmonisierung 1
T F
TN =TN
179203 FPFP
= 2553 Indexing /
N P
Blocking 2
„Grobfilter“
Detail-
F T Vergleich 3
FN =
N 8856
FN TP =TP
P125292 „Feinfilter“
• Precision = TP / (TP + FP) Klassifikatio

= 98,0% 4
• Recall / Coverage n 93,3%
= TP / (TP + FN) =
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank Evaluation 5
04.10.2021
Page 63
5. Evaluation
precision →
 recall / coverage THX to Christopher-Johannes Schild

Bundesbank’s relevant microdata sources: Company Data
04.10.2021 THX to Christopher-Johannes Schild
Page 65
Datauniverses (I)
04.10.2021 THX to Christopher-Johannes Schild
Page 66
Datauniverses (II)
04.10.2021
Page 67
Train, Validation, and Test Subsamples
Splitting the Data
Splitting the Data
▪ Golden rule in Machine Learning: evaluate the model on data that was not used to train it
▪ Most common: randomly split the dataset into train and test data
▪ E.g., when modeling loan default, some borrowers land in the train, others land in the test
data
▪ Usually: 75%-80% train, 25%-20% test
▪ Better approach: randomly split dataset into train, validation, and test data
▪ Usually: 60% train, 20% validation, 20% test
▪ Even better in some settings: also have out-of-time data
▪ Completely separate dataset, e.g., collected a few months later

04.10.2021
Page 69
Splitting the Data
▪ Train data
▪ Data used to train the model
▪ The model learns from the underlying patterns and relationships in this subsample
▪ Validation data
▪ After optimizing several models in the train data, see how they perform in the validation
data
▪ Go back to training and iterate until you are happy with the results
▪ The results in this subsample motivate future modeling decisions
▪ Validation data is “contaminated” → No unbiased estimate of the error on truly new data
04.10.2021
Page 70
Splitting the Data
▪ Test data
▪ Only to be used in the very end
▪ Performance of the model in this subsample provides an unbiased estimate of the error on
new data
60% Train Data Train Model
Repeat
20%
Validation Data Evaluate Model
Take Final Model

20% Test Data
Original
Test Model
Dataset
04.10.2021
Page 71
Model Validation and Evaluation Measures
Cross-Validation
▪ Idea: use many subsamples of the data to estimate model error
▪ K subsamples = K-Fold cross-validation
▪ Measure error (e.g., error rate for classification or Sum of Squared Residuals for regression)
▪ Average the errors or sum them (e.g., “risk” measure in RPART)

10
𝐹1 𝐸𝑟𝑟𝑜𝑟1
෣ = ෍ 𝐸𝑟𝑟𝑜𝑟𝑘
𝐸𝑟𝑟𝑜𝑟
𝐹2 𝐸𝑟𝑟𝑜𝑟2 𝑘=1
o𝑟
… Training Folds Validation Fold 10
𝐹10 𝐸𝑟𝑟𝑜𝑟10 𝐸𝑟𝑟𝑜𝑟𝑘
෣ =෍
𝐸𝑟𝑟𝑜𝑟
10
𝑘=1
Your Training Data
04.10.2021
Page 73
Cross-Validation
▪ But, beware!
▪ If you preselect variables that flow in the model, then cross-validation must be also
applied at the variable selection step (not only later)!
▪ Otherwise, cross-validation does not accurately estimate the prediction error
04.10.2021
Page 74
Confusion Matrix for Binary Classification
Confusion Matrix
Actual Data
No Default Default
No Default TN FN
Predictions
Default FP TP
▪ Example: Loan Default Prediction
▪ TN: true negatives / TP: true positives
▪ FN: false negatives / FP: false positives
▪ Based on them, we can compute several other metrics
04.10.2021
Page 75
Accuracy, Recall
Confusion Matrix
▪ What is the share of correctly predicted cases? Actual Data
No Default Default
▪ Accuracy = (TN + TP) / Total No Default TN FN
Predictions
▪ Total = TP+TN+FP+FN Default FP TP
Confusion Matrix
▪ Which share of the true default cases is correctly predicted? Actual Data
No Default Default
▪ Sensitivity (or Recall) = TP / (TP + FN) No Default TN FN
Predictions
Default FP TP
Confusion Matrix
▪ Which share of the true non-default cases is correctly predicted? Actual Data
No Default Default
▪ Specificity = TN / (TN + FP) No Default TN FN
Predictions
Default FP TP
04.10.2021
Page 76
Precision, F1-Score
Confusion Matrix
Actual Data
▪ What is the share of correct “default” predictions?
No Default Default
▪ Precision = TP / (TP + FP) Predictions
No Default TN FN
Default FP TP
▪ Can the model identify true “default” cases without many false alarms?
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙
▪ F1-Score = 2 ∗
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙
▪ F1-Score provides a tradeoff between precision and recall
04.10.2021
Page 77
ROC Curve
100% sensitivity (no false negatives)
100% specificity (no false positives)
▪ Receiver Operating Curve (ROC)
▪ Plots the tradeoff between Sensitivity and
Specificity for different probability thresholds
▪ Decrease threshold: increase TP, but also FP
▪ Increase threshold: increase TN, but also FN
▪ AUC (Area under the Receiver Operating Curve)

▪ Probability that a randomly chosen positive
case (e.g., “default”) receives from the model a
higher score (predicted probability) than a
randomly chosen negative case (e.g., “non-default”)
04.10.2021
Page 78
Precision-Recall (PR) Curve
▪ Precision-Recall (PR) Curve

▪ Plots the tradeoff between Sensitivity (Recall) and Precision for different probability
thresholds
▪ Comparison to ROC curve:

▪ ROC curves are appropriate for balanced datasets
▪ PR curves are also appropriate for imbalanced datasets
04.10.2021
Page 79
Example: Loan Default Prediction
▪ Dependent variable: default (yes / no)

▪ 30 independent variables:
▪ Loan characteristics (e.g., funded amount, term, loan amount)
▪ Borrower characteristics (e.g., employment length, annual income, home ownership)
Original Data Train Data Validation Data Test Data

# Obs. % of # Obs. % of # Obs. % of # Obs. % of
Dataset Dataset Dataset Dataset
Default 100,209 20.3% 68,773 20.6% 10,417 19.8% 10,417 19.1%
No Default 393,818 79.7% 264,528 79.4% 85,076 80.2% 44,214 80.9%
Total 494,027 100% 333,301 100% 106,095 100% 54,631 100%
Share of
100% 67.5% 21.5% 11%
Original Data
This is the “no information rate” in the data
04.10.2021
Page 80
▪ No information rate: accuracy if we simply predict that all observations belong to the most
frequent class in the data
▪ Example here: predict “no-default” to all observations
▪ This rate an important benchmark to assess the accuracy of our models
04.10.2021
Page 81
ROC Curve PR Curve
04.10.2021
Page 82
Hands-on Exercises
References
Davis, J., & Goadrich, M. (2006). The Relationship Between Precision-Recall and ROC Curves.
In Proceedings of the 23rd international conference on Machine learning (pp. 233-240).
Dietterich, T. G. (1998), Approximate Statistical Tests for Comparing Supervised Classification

Learning Algorithms, Neural Computation, 10(7), 1895-1923.
Raschka, S. (2018), Model Evaluation, Model Selection, and Algorithm Selection in Machine
Learning, Working Paper.
04.10.2021
Page 84
Day 2
Shrinkage
Agenda
▪ Introduction
▪ Lasso
▪ Ridge
▪ Break
▪ Elastic Net
▪ Multiclass Problems
04.10.2021
Page 86
Shrinkage I
Shrinkage I
Introduction to Shrinkage
▪ Handle the “curse” of dimensionality
▪ 𝑝 >> 𝑛 problems
▪ Too many coefficients → Overparametrization, risk of overfitting
▪ Extreme case: millions of independent variables → computational problems
▪ Also called regularization methods
▪ Not exclusive to machine learning, similar ideas in “traditional” econometrics:
▪ Partial Least Squares (PLS)
▪ Principal Component Regression (PCR)
▪ Horseshoe prior in Bayesian statistics
04.10.2021
Page 88
Shrinkage I
Ridge Regression
▪ Hoerl and Kennard (1970): Impose penalty on coefficients’ magnitude
▪ Idea: tradeoff between goodness of fit (squared residuals) and model complexity (squared
coefficients)
𝑝
▪ 𝛽መ 𝑅𝑖𝑑𝑔𝑒 = argmin 𝒀 − 𝑿𝜷 ′
𝒀 − 𝑿𝜷 + 𝜆 σ𝑗=1 𝛽𝑗2 = argmin 𝒀 − 𝑿𝜷 ′
𝒀 − 𝑿𝜷 + 𝜆 𝜷 2
2
𝛽 𝛽
Sum of Squared Residuals Penalty: Sum of Squared Coefficients
▪ What happens if the independent variables are not on the same scale?
▪ Unfair penalties! So, we usually standardize X and center y
▪ ഥ)
Therefore, remove the intercept (it becomes simply 𝒚
▪ Also called L2 regularization (L2 norm: Euclidean distance)
04.10.2021
Page 89
Shrinkage I
Ridge Regression
𝛽መ 𝑅𝑖𝑑𝑔𝑒 = argmin 𝒀 − 𝑿𝜷 ′
𝒀 − 𝑿𝜷 + 𝜆 𝜷′𝜷
𝛽
▪ Solution depends on 𝝀, the size of the penalty (hyperparameter, can be tuned)
▪ Use cross-validation to choose optimal 𝝀
▪ What happens if 𝝀 → 𝟎?
▪ We are back to the OLS solution
▪ What happens if 𝝀 → ∞?
▪ Then all coefficients approach zero
04.10.2021
Page 90
Shrinkage I
Implications for Bias-Variance Tradeoff
▪ Shrinkage reduces variance, but...

෡ 𝑶𝑳𝑺 ]
𝑬[𝜷 ෡ 𝑹𝒊𝒅𝒈𝒆 ]
𝑬[𝜷
▪ Introduces bias!
Unbiased, but Lower
large variance variance,
▪ The bias increases as 𝝀 increases
but biased
▪ The variance decreases as 𝝀 increases
▪ Ridge regression is not helpful for variable selection
▪ Even if the true coefficient is zero, Ridge will shrink it but it will not set it𝜷to zero
▪ Good predictions, but difficult to interpret the resulting coefficients
▪ No concept of statistical significance (no standard errors provided)
04.10.2021
Page 91
Shrinkage I
LASSO Regression
▪ Tibshirani (1996): Least Absolute Shrinkage and Selection Operator
𝑝
▪ 𝛽መ 𝐿𝐴𝑆𝑆𝑂 = argmin 𝒀 − 𝑿𝜷 ′
𝒀 − 𝑿𝜷 + 𝜆 σ𝑗=1 |𝛽𝑗 | = argmin 𝒀 − 𝑿𝜷 ′
𝒀 − 𝑿𝜷 + 𝜆 𝜷 1
𝛽 𝛽
Sum of Squared Residuals Penalty: Sum of Absolute Coefficients
▪ LASSO also performs variable selection: coefficients can become exactly zero
▪ Sparse solutions: many zero coefficients, equivalent to excluding the variables from the
model
▪ Computational advantage: independent variables with zero coefficients can be ignored
▪ Also called L1 regularization (L1 norm: Manhattan distance)
04.10.2021
Page 92
Shrinkage I
LASSO Regression
▪ Efron et al. (2004): LARS (Least Angle Regression) to compute LASSO efficiently
▪ But, with highly correlated independent variables:
▪ LASSO arbitrarily selects one variable and reduces the other coefficients to zero
▪ Under these conditions, even small values of 𝝀 give many zero coefficients
04.10.2021
Page 93
Shrinkage I
LASSO vs Ridge Regression
Contours of the least squares error function
LASSO: 𝜷𝟏 + 𝜷𝟐 ≤ 𝒕 Ridge: 𝛃𝟐𝟏 + 𝛃𝟐𝟐 ≤ 𝒕
Source: Hastie, Tibshirani and Friedman (2009)
04.10.2021
Page 94
Shrinkage I
Example: Loan Default Prediction – Ridge Regression
04.10.2021
Page 95
Shrinkage I
Example: Loan Default Prediction – Ridge Regression
04.10.2021
Page 96
Shrinkage I
Example: Loan Default Prediction – LASSO Regression
04.10.2021
Page 97
Shrinkage I
Example: Loan Default Prediction – LASSO Regression
04.10.2021
Page 98
Hands-on Exercise
Shrinkage II
Shrinkage II
Elastic Net
▪ Zou and Hastie (2005)
▪ Ridge often has a “grouping” effect
▪ Strongly correlated independent variables tend to be in or out of the model together
▪ Combine both methods: convex combination of Ridge and LASSO penalties
▪ Shrinks together the coefficients of correlated predictors like Ridge
▪ Selects variables like LASSO but suffers less from the multi-collinearity problem
𝑝
𝛽መ 𝐸𝑙𝑎𝑠𝑡𝑖𝑐 𝑁𝑒𝑡 = argmin 𝒚 − 𝑿𝛽 ′ 𝒚 − 𝑿𝛽 + 𝜆 ෍ 𝛼 𝛽𝑗 + 1 − 𝛼 𝛽𝑗2 ,0 ≤ 𝛼 ≤ 1
𝛽 𝑗=1
04.10.2021
Page 101
Shrinkage II
Elastic Net
Ridge: LASSO: Elastic Net:

𝜷𝟐𝟏 + 𝜷𝟐𝟐 ≤ 𝒕 𝜷𝟏 + 𝜷𝟐 ≤ 𝒕 𝜶 𝜷𝟏 + 𝜷𝟐 + 𝟏 − 𝜶 𝜷𝟐𝟏 + 𝜷𝟐𝟐 ≤ 𝒕
04.10.2021
Page 102
Shrinkage II
▪ Using package glmnet in R
▪ Ridge: set parameter alpha = 0
▪ LASSO: set parameter alpha = 1
▪ Elastic Net: set parameter 0 < alpha < 1
▪ As with the lasso and ridge, we typically do not penalize the intercept term, and
standardize the predictors for the penalty to be meaningful.
▪ The parameter 𝛼 determines the mix of the penalties, and is often pre-chosen on qualitative
grounds, or chosen by cross-validation.
04.10.2021
Page 103
Shrinkage II
Example: Loan Default Prediction – Elastic Net Regression
04.10.2021
Page 104
Shrinkage II
Example: Loan Default Prediction – Elastic Net Regression
04.10.2021
Page 105
Shrinkage II
Example: Loan Default Prediction – Confusion Matrix Comparison
Logistic Regression Ridge LASSO Elastic Net
Validation Data Validation Data Validation Data Validation Data
No Default Default No Default Default No Default Default No Default Default
No Default 84,915 2,151 85,049 8,780 84,931 2,168 84,987 2,456
Predictions
Default 161 18,868 27 12,239 145 18,851 89 18,563
Accuracy 97.8% 91.7% 97.8% 97.6%
Precision 99.2% 99.8% 99.2% 99.5%
Recall 89.8% 58.2% 89.7% 88.3%
F1-Score 94.2% 73.5% 94.2% 93.6%
• Remember: No information rate in the validation data is 80.2%.
04.10.2021
Page 106
Shrinkage II
Multiclass Problems
04.10.2021
Page 107
Shrinkage II
Multiclass Problems
04.10.2021
Page 108
Shrinkage II
Multiclass Problems
04.10.2021
Page 109
Shrinkage II
Multiclass Problems
04.10.2021
Page 110
Hands-on Exercise
References
Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004), Least Angle Regression, The Annals
of statistics, 32(2), 407-499.
Hoerl, A. E., & Kennard, R. W. (1970), Ridge Regression: Biased Estimation for Nonorthogonal
Problems, Technometrics, 12(1), 55-67.
Tibshirani, R. (1996), Regression Shrinkage and Selection via the Lasso, Journal of the Royal
Statistical Society, Series B (Methodological), 58(1), 267-288.
Zou, H., & Hastie, T. (2005), Regularization and Variable Selection via the Elastic Net, Journal
of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301-320.
04.10.2021
Page 112
Day 3
Decision Trees
Agenda
▪ Introduction
▪ Decision Trees (CART)
▪ Bootstrapping
▪ Break
▪ Introduction to Ensemble Methods

04.10.2021
Page 114
Decision Trees (CART)
Are They Really a New Method?
▪ Decision trees are essentially not a brand new idea
▪ First algorithms were developed decades ago
▪ AID (Morgan & Sonquist, 1963)
▪ CHAID (Kass, 1980)
▪ CART (Breiman et al., 1984)
▪ ID3 (Precursor of C5.0) (Quinlan, 1986)
▪ But now they are more widely known and used (why?)
04.10.2021
Page 116
What Are Decision Trees?
▪ Belong to “divide and conquer” algorithms (greedy algorithms)
▪ Divide the “big” problem into smaller “subproblems”
▪ Recursively solve the “subproblems” and combine the solutions
▪ Recursive partitioning of the training data in smaller subsets until:
▪ All subsets predominantly belong to one value of Y (homogeneous nodes), or
▪ A pre-specified parameter (stop criterion) is achieved – e.g. minimum number of

observations in a node
▪ Resulting tree represents a set of rules – “if this, than that”
04.10.2021
Page 117
What Are Decision Trees?
▪ Classification Trees – Our focus today
▪ Y is a categorical variable, e.g. represents categories or binary choices
▪ E.g. transportation mode (car, bike, subway), default (y/n), buy (y/n)
▪ Regression Trees
▪ Y is a numerical variable, e.g. represents discrete or continuous quantities
▪ E.g., number of products purchased, duration of customer relationship
04.10.2021
Page 118
Example: Will a Borrower Default?
▪ Example: Will a borrower default?

▪ Classification tree: predict default (yes/no) Root node
▪ Method: CART (RPART in R)
Root node: beginning of tree Decision

node
Decision nodes: node in which a choice is made –
lead either to outcome or to another decision node
Leaf/terminal nodes: outcome
Question formats
▪ x ≥ a (or x < a )
▪ x=a
▪ 𝑥 ∈ 𝐴, where 𝐴 are partitions of the values x Terminal nodes (or leaves)
takes in the training data
04.10.2021
Page 119
Example: Will a Borrower Default?
333,301 observations (100%)

68,773 (20.6%) default Root node
264,528 (79.4%%) non-default
Decision node
Terminal nodes (or leaves)
275,913 observations (82.7%) 1,672 observations (0.5%)

39,918 (14.5%) default 1,672 (100%) default
235,995 (85.5%) non-default 0 (0%) non-default
04.10.2021
Page 120
How to Build a Decision Tree?
Fundamental steps to build a decision tree:
1.Which split criterion to choose?
2.For every decision node:

• on which independent variable to split?
• on which value to split (i.e. which value is the cut-off value)?
3. Depth of tree (i.e. number of decision nodes)?
4.What value to predict at each leaf node?
04.10.2021
Page 121
1. Which Split Criterion to Choose?
▪ Impurity using the Gini Index: CART and its implementation in R, RPART
2
▪ 𝑖 𝑡 = 1 − σ𝑘𝑗=1 𝑝 𝑗 𝑡 𝒑 𝒋 𝒕 : relative frequency of category j in node t
k: Number of categories in the sample
▪ Minimum value: all observations belong to one category (“pure” node)
▪ Maximum value: observations are equally distributed across all categories (maximal
dispersion)
▪ Other criteria (we won’t discuss these in details)

▪ Significance tests: 𝝌2 (CHAID), Permutation tests (CTREE)
▪ Entropy (C5.0)
▪ Sum of Squares (for regression)
04.10.2021
Page 122
2. On Which Value to Split?
▪ Goal: generate child nodes that are “purer” than their parents
▪ Split finds the independent variable and its cut-off value that maximizes the decrease in
node impurity
▪ Decrease in node impurity from split s in node t (Gini Index)
▪ ∆𝑖 𝑠, 𝑡 = 𝑖 𝑡 − σ𝑁
𝑛=1 𝑝𝑡𝑛 𝑖 𝑡𝑛 for multiway splits (N child nodes)
▪ ∆𝑖 𝑠, 𝑡 = 𝑖 𝑡 − 𝑝𝑡𝑙 𝑖 𝑡𝑙 − 𝑝𝑡𝑟 𝑖 𝑡𝑟 for binary splits (𝑡𝑙 left node, 𝑡𝑟 right node)
▪ Split minimizes the second term of the equation
𝑝𝑡𝑛 : Proportion of observations in node t assigned to child node 𝑡𝑛 . σ𝑁

𝑛=1 𝑝𝑡𝑛 = 1.
04.10.2021
Page 123
2. On Which Value to Split - Example Loan Default Prediction
▪ Loan default prediction with RPART: Decrease in node impurity from the first split
333,301 observations (100%) Gini Index root node

2 2
68,773 (20.6%) default 68,773 264,528
1− − ≅ 0.33
264,528 (79.4%%) non-default 333,301 333,301
Gini Index right node

57,388 observations (17%) 2 2
28,855 (50.3%) default 28,855 28,533
𝑖 𝑡𝑟 =1− − ≅ 0.48
28,533 (49.7%) non-default 57,388 57,388
275,913 observations (82.7%)

39,918 (14.5%) default
235,995 (85.5%) non-default
Decrease in node impurity from first split 1
Gini Index left node 275,913 57,388
2 2 0.33 − ∗ 0.25 − ∗ 0.50
39,918 235,995 333,301 333,301
𝑖 𝑡𝑙 = 1 − −
275,913 275,913 = 0.04
≅ 0. 25
1 The “improve” reported under summary() in R shows the decrease in node impurity multiplied by the number of obs. in the parent node
04.10.2021
Page 124
3. How Deep Should the Tree be?
▪ Without a stop criterion (e.g., minimum number of observations in a node)
▪ Tree can potentially grow until all observations either have
▪ The same value for the dependent variable (perfect prediction)
▪ The same value for the independent variables
▪ Extreme case: each observation has ist own leaf node → overfitting!
▪ Predictive power is high on traning data, but poor on new data – model is not
generalizable
▪ Solution: pruning (removing branches from the tree)
04.10.2021
Page 125
Pre-pruning: stop tree growth during the tree building process
▪ Pre-define thresholds either in
▪ Split criterion (e.g., minimum improvement in split criterion)
▪ Tree characteristics (e.g., number of leaf nodes, number of splits, minimum number of
observations in leaf nodes)
04.10.2021
Page 126
Post-pruning: grow the tree to its maximum and then trim the nodes in a bottom‐up fashion
▪ Merge leaf nodes while considering prediction error (shouldn’t increase much)
▪ Cost-complexity pruning with cross-validation: CART
▪ Pessimistic error-based pruning with binomial confidence limit: C5.0
▪ Usually possible to weight the prediction errors differently
▪ E.g., it might be worse to wrongly predict a borrower who ends up defaulting than to
predict that a borrower will default, but she doesn’t
04.10.2021
Page 127
4. What Value to Predict at Each Leaf Node?
Predictions for every leaf node
▪ Majority voting: category with the most “votes” within the
leaf node is the prediction (mode of dependent variable)
▪ Assumption: Equal cost of type I and type II errors
▪ When is this (not) a reasonable assumption?
▪ Predicted probabilities (relative frequency of category in
the leaf node)
Source: twitter.com/freakonometrics
04.10.2021
Page 128
4. What Value to Predict at Each Leaf Node - Example Loan Default Prediction
No Default No Default Default Default
04.10.2021
Page 129
Handling Missing Values
▪ Surrogate variables
▪ Tree grows with the selected splits (primary splits) using observations with no missing
values on the split variables
▪ Weakness: CART is biased toward selecting variables with many missing values for a
primary split (Kim & Loh, 2001)
▪ When a split variable is missing for an observation:
▪ Surrogate split: use a surrogate variable instead of the primary split variable
▪ A surrogate variable is another independent variable and cut-off value whose split most
resembles the primary split
▪ Ideally, they should send exactly the same observations to the each child node
04.10.2021
Page 130
Handling Missing Values – Example Loan Default Prediction
333,301 observations (100%)
68,773 (20.6%) default
264,528 (79.4%%) non-default
Now: 100,000 missings in total_pymnt
▪ Randomly force 100,000 values of

total_pymnt to be missing
▪ Surrogate variables are used when

we set na.action=na.rpart
04.10.2021
Page 131
▪ Output from R – using the summary() command:

Primary splits:
total_pymnt < 5317.716 to the right, improve=8636.406, (100,000
missing)
Surrogate splits:
loan_amnt < 4712.5 to the right, agree=0.917, adj=0.515, (100,000
split)
Observations with a missing in total_pymnt are split on loan_amnt instead
04.10.2021
Page 132
▪ Agreement: proportion of observations sent to the “correct” leaf node when the
surrogate is used (instead of the primary split)
𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑠𝑢𝑟𝑟𝑜𝑔𝑎𝑡𝑒
𝐴𝑔𝑟𝑒𝑒 =
# 𝑜𝑏𝑠 𝑖𝑛 𝑝𝑎𝑟𝑒𝑛𝑡 𝑛𝑜𝑑𝑒
▪ Adjusted agreement: deducts the surrogate agreement from the “go with the majority”
baseline
(𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑠𝑢𝑟𝑟𝑜𝑔𝑎𝑡𝑒 − 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 “𝑔𝑜 𝑤𝑖𝑡ℎ 𝑚𝑎𝑗𝑜𝑟𝑖𝑡𝑦”)
A𝑑𝑗 =
(# 𝑜𝑏𝑠 𝑖𝑛 𝑝𝑎𝑟𝑒𝑛𝑡 𝑛𝑜𝑑𝑒 − 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 “𝑔𝑜 𝑤𝑖𝑡ℎ 𝑚𝑎𝑗𝑜𝑟𝑖𝑡𝑦”)
04.10.2021
Page 133
Discussion
▪ Impurity Criterion
▪ Gini Index is biased toward selecting split variables with many missing values
▪ Required Assumptions
▪ Observations are independent
▪ Joint distribution of X and Y in the training data is the same as in the test data
▪ Overfitting
▪ Single trees have high variance: unstable predictions
▪ Small perturbation in the data → large changes in leaf nodes
▪ Solution? Ensemble learning (more on that after the break!)
04.10.2021
Page 134
Detour: Bias-Variance Tradeoff
Bias-Variance Tradeoff
Expected Prediction Error
▪ Expected prediction error for a new observation with 𝑿 = 𝒙𝟎 ?
▪ Consider 𝑌 = 𝑓 𝑋 + 𝜀, s. t. E 𝜀 = 0 𝑎𝑛𝑑 𝑉𝑎𝑟 𝜀 = 𝜎𝜀2
▪ Using a quadratic loss function (squared error)
▪ 𝐸𝑟𝑟𝑜𝑟 𝑥0 = E[Y − 𝑓መ 𝑥0 ]2 = E[Y − 𝑓 𝑥0 ]2 + Irreducible

𝑉𝑎𝑟 𝜀 = 𝜎𝜀2 unless 𝜎𝜀2 =0!
Result of
▪ E[𝑓መ 𝑥0 − 𝑓 𝑥0 ]2 + 𝑩𝒊𝒂𝒔2 misspecifying 𝑓(𝑋)
Result of using a sample
▪ E[𝑓መ 𝑥0 − 𝐸[𝑓መ 𝑥0 ]]2 𝑽𝒂𝒓(𝒇෠ 𝒙𝟎 )
to estimate 𝑓
▪ Goal in Machine Learning: minimize this combination
▪ Complex models: usually low bias but high variance
04.10.2021
Page 136
Examples
Low Bias High Bias Low Variance

High Variance Low Variance Low Bias
Overfitting Underfitting Good Balance
Source: towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229
04.10.2021
Page 137
Examples
Source: https://www.kdnuggets.com/2016/08/bias-variance-tradeoff-overview.html
04.10.2021
Page 138
Changes in Error with Model Complexity
Source: scott.fortmann-roe.com/docs/BiasVariance.html
04.10.2021
Page 139
What Are Your Options?
▪ Use cross-validation to estimate the validation error
▪ Test and validate your model on new observations (set part of the data aside for that)
▪ Combine multiple models
▪ This combination is known as ensemble learning
▪ Most influential development in Machine Learning in the past decade (Seni & Elder,
2010)
▪ Main idea: average predictions of several models
04.10.2021
Page 140
Introduction to Ensemble Methods
Overview
▪ Again (!) not really a “new” development
▪ Bagging (Breiman,1996)
▪ Boosting (Freund & Schapire, 1997)
▪ Random Forests (Breiman, 2001)
▪ Gradient Boosting (Friedman, 2001)
▪ Group of models that together deliver one aggregate prediction
▪ Idea: combine a large number of “weak” models to produce a stronger, more stable
model
▪ Use average of predictions (numerical dep. var.) or majority voting (categorical dep. var.)
04.10.2021
Page 142
Overview
▪ Ensemble methods retain many advantages of Machine Learning methods
▪ Model higher order interactions automatically
▪ Helpful in p >>n problems
▪ Robust to outliers
▪ Multicollinearity is not a problem
▪ But…
▪ More difficult to visualize and interpret results
▪ Usually known as a “black box”
04.10.2021
Page 143
Overview
▪ Ideal world: infinitely many datasets available

▪ Estimate one model on each dataset
▪ Sadly, that’s not our reality…
▪ So, what’s the next best option?
04.10.2021
Page 144
Bagging Weak Models
▪ Bagging: bootstrap aggregating (Breiman, 1996)
▪ Idea: use your dataset to artificially generate “new” datasets
▪ Sample with replacement from training data (bootstrapping)
Original
Training
Dataset
Bootstrap Replicates
04.10.2021
Page 145
Bagging Weak Models
▪ Estimate a model on each “new” bootstrapped dataset
▪ Aggregate predictions (average or majority voting)
▪ Improves accuracy and model stability
▪ Often used for decision trees, but you can use it with any weak supervised model
…
04.10.2021
Page 146
Logistic Regression Decision Tree Bagged Trees
Validation Data Validation Data Validation Data
No Default Default No Default Default No Default Default
No Default 84,915 2,151 84,872 13,788 84,869 13,785
Predictions
Default 161 18,868 204 7,231 207 7,234
Accuracy 97.8% 86.8% 86.8%
Precision 99.2% 97.3% 97.2%
Recall 89.8% 34.3% 34.4%
F1-Score 94.2% 50.8% 50.8%
04.10.2021
Page 147
Hands-on Exercise
References
Alfaro, E., Gamez, M. & Garcia, N. (2013), adabag: An R Package for Classification with
Boosting and Bagging, Journal of Statistical Software, 54 (2), 1-35.
Alfaro, E., Garcia, N., Gamez, M. & Elizondo, D. (2008), Bankruptcy forecasting: An empirical
comparison of AdaBoost and neural networks'. Decision Support Systems, 45, 110-122.
Breiman, L. (1996), Bagging predictors, Machine Learning, 24(2), 123-140.

Breiman, L. (1998), Arcing classifiers, The Annals of Statistics, 26(3), 801-849.
Breiman, L. (2001), Statistical Modeling: The Two Cultures, Statistical Science, 16(3), 199-231.
Breiman, L., Friedman, J. H., Stone, C. J. & Olshen, R. A. (1984), Classification and Regression
Trees. Belmont, CA, Wadsworth.
Hastie, T., Tibshirani, R. & Friedman, J. (2009), The Elements of Statistical Learning. Data
Mining, Inference, and Prediction, Springer.
Hothorn, T., Hornik, K. & Zeileis, A. (2006), Unbiased Recursive Partitioning: A Conditional
Inference Framework, Journal of Computational and Graphical Statistics, 15(3), 651-674.
04.10.2021
Page 149
References
Loh, W.-Y. (2014), "Fifty Years of Classification and Regression Trees, International Statistical
Review, 82(3), 329-348.
Loh, W.-Y. & Shih, Y.-S. (1997), "Split Selection Methods for Classification Trees, Statistica
Sinica, 7(4), 815-840.
Quinlan, J. R. (1986), Induction of Decision Trees, Machine Learning, 1(1), 81-106.
Quinlan, J. R. (1993), C4.5: Programs for Machine Learning, San Mateo, CA, Morgan
Kaufmann.
Shmueli, G. (2010), To Explain or to Predict?, Statistical Science, 25(3), 289-310.
Therneau, T. M. & Atkinson, E. J. (1997), An Introduction to Recursive Partitioning Using the

RPART Routines, Technical Report 61, Mayo Clinic, Section of Statistics, Rochester,
Minnesota.
04.10.2021
Page 150
Day 4
Ensemble Methods
Agenda
▪ Random Forests
▪ Break
04.10.2021
Page 152
Recap
Changes in Error with Model Complexity
Source: scott.fortmann-roe.com/docs/BiasVariance.html
04.10.2021
Page 154
Ensemble Methods
Combining Several Weak Models to Generate a Stronger Model
Source: Lantz, Brett (2015), "Machine learning with R", Birmingham: Packt Publishing, Chapter 11.
04.10.2021
Page 155
Ensemble Methods
Bagging Weak Models
▪ Estimate a model on each “new” bootstrapped dataset
▪ Aggregate predictions (average or majority voting)
▪ Improves accuracy and model stability
▪ Often used for decision trees, but you can use it with any weak supervised model
…
04.10.2021
Page 156
Random Forests
Random Forests
Motivation
▪ Bagging helps to reduce model variance
▪ But…
▪ Trees often have many splits in common
▪ Consequence: similar predictions
▪ Idea: why not also introduce randomness in the selection of the candidate variables for a
split?
▪ This idea is implemented in Random Forests (Breiman, 2001)
04.10.2021
Page 158
Random Forests
Overview
▪ Use bootstrap to generate 𝑵 “new” datasets
▪ For each tree and split:
▪ Only give the algorithm a random set of 𝒎 independent variables to choose from (out of
the total 𝑴 independent variables in the dataset)
▪ Estimate 𝑵 trees, each with a different dataset
▪ Aggregate predictions with averaging or majority voting
04.10.2021
Page 159
Random Forests
Overview
Train Data
Bootstrap Bootstrap Bootstrap Bootstrap

Subsample 1 Subsample 2 Subsample 3 Subsample 4
Tree 1 Tree 2 Tree 3 Tree 4

At each split the tree
considers only a fraction
of all the available
independent variables
(m variables out of the
total M variables)
Aggregated
Prediction
04.10.2021
Page 160
Random Forests
How to Choose m?
▪ How to choose 𝒎?
▪ Breiman’s suggested heuristic for choosing 𝒎
▪ 𝑚= 𝑛
▪ 𝑛: total number of independent variables
▪ Use cross-validation to find the “best” value of 𝒎 for the data / problem at hand
▪ This is called (hyper)parameter tuning
▪ When 𝒎 = 𝒏, you’re back to bagging
▪ 𝒎 in the R implementation: mtry parameter
04.10.2021
Page 161
Random Forests
Out-of-Bag Observations
▪ Another nice feature of Random Forests: OOB (Out-Of-Bag)

▪ Refers to observations in the original dataset that were not part of the respective
bootstrap replicate
▪ So, OOB observations were not used to construct the model
▪ Therefore, great to estimate the generalization error!
▪ By default, OOB ~ 1/3 of observations

▪ But you can adjust it as you wish
Original Train Bootstrap
Dataset Replicate
▪ OOB observations are also used to compute
In this bootstrap replicate, the
variable importance measures
gray observation is OOB
04.10.2021
Page 162
Random Forests
Parameters
▪ ntree: Number of trees
▪ mtry: m parameter: number of randomly selected candidate variables at each split
▪ nodesize: minimum number of observations in each leaf node
04.10.2021
Page 163
Random Forests
Logistic Regression Decision Tree Bagged Trees Random Forest
No Default 84,915 2,151 84,872 13,788 84,869 13,785 85,069 2,396
Predictions
Default 161 18,868 204 7,231 207 7,234 7 18,623
Accuracy 97.8% 86.8% 86.8% 97.7%
Precision 99.2% 97.3% 97.2% 99.9%
Recall 89.8% 34.3% 34.4% 88.6%
F1-Score 94.2% 50.8% 50.8% 93.9%
04.10.2021
Page 164
Random Forests
Discussion
▪ Random Forests reduce variance by
▪ Training trees on different subsets of the training dataset
▪ Limiting the subspace of independent variables considered for a split
▪ But, there’s no free lunch…
▪ Bias may slightly increase (remember the potentially biased splits in each tree)
▪ More difficult to interpret
04.10.2021
Page 165
Random Forests
Discussion
▪ Bias can be reduced with tuning
▪ Find optimal parameters for your model / data (e.g. tree depth, number of trees, number of
observations in leaf nodes)
▪ Usually done with a grid search (test a finite combination of parameters)
▪ Interpretability can be tackled with interpretability methods
▪ E.g., variable importance measures, partial dependence plots, etc.
04.10.2021
Page 166
Hands-on Exercise
Gradient Boosting
Gradient Boosting
First Idea of Boosting
▪ Ensemble method proposed by Freund & Schapire (1997)
▪ Main idea
▪ Sequentially train new models, giving more importance to observations “difficult to

predict”
▪ Weights or residuals reflect how “difficult” it is to predict the outcome for a specific
observation
▪ Dataset is the same, only the weights change at each iteration
▪ Often applied in the context of decision trees, but applicable to any “weak” model
▪ Trees are usually smaller than in Random Forests (often “stumps”: only one split)
04.10.2021
Page 169
Gradient Boosting
Boosting for Classification
▪ Objective: minimize classification error
▪ Weights are adjusted and renormalized at each iteration
▪ Aggregated prediction: Sum the “votes” of all models
▪ “Voting power” of each model is a function of its accuracy
▪ Models that accurately predict many observations change weights more significantly and
have a higher influence in the aggregated prediction
04.10.2021
Page 170
Gradient Boosting
Boosting for Classification
Incorrectly (correctly) Incorrectly (correctly)

Begin:
classified obs. in Tree classified obs. in
All weights
1 gain a higher Tree 2 gain a higher
equal
(lower) weight (lower) weight
Tree 1 Tree 2 Tree 3

Training Aggregated
Dataset Prediction
Subsequent tree focuses on

Voting power: the vote of
predicting observations with
“good models” matters more
higher weights
04.10.2021
Page 171
Gradient Boosting
Other Views of Boosting
▪ Friedman, Hastie & Tibshirani (2000)

▪ Adaboost ~ optimization method to minimize a particular exponential loss function (for
classification)
▪ Find that exponential loss ~ Bernoulli likelihood
▪ So, maximize Bernoulli likelihood instead (for classification)
▪ Breiman (1999)
▪ Boosting ~ gradient descent with a special loss function
▪ Friedman (2000)
▪ Generalize: from Adaboost to Gradient Boosting
▪ Handle many other loss functions
▪ Very important development: link “obscure” computational learning to standard
statistics (likelihood) and function optimization
04.10.2021
Page 172
Gradient Boosting
Slightly Change the Target Function
෡ 𝑿 ….
▪ Instead of training 𝒉 𝑿 on residuals of 𝐹 𝑋 , 𝒚 − 𝑭
2
𝑦 −𝐹 𝑋
▪ Train 𝒉 𝑿 on gradient of the loss function: 𝐿 𝑦, 𝐹 𝑋 =
2
▪ Goal: minimize J = σ𝑖 𝐿 𝑦𝑖 , 𝐹 𝑋𝑖 for all obs. 𝑖 by finding appropriate 𝐹 𝑋𝑖
𝜕𝐽
▪ Yields an optimization problem: = 𝐹 𝑋𝑖 − 𝑦𝑖
𝜕𝐹 𝑋𝑖
▪ Residuals are negative gradients of 𝐿!
𝜕𝐽
𝑦𝑖 − 𝐹 𝑋𝑖 = −
𝜕𝐹 𝑋𝑖
04.10.2021
Page 173
Gradient Boosting
Slightly Change the Target Function
𝜕𝐽
▪ 𝑦𝑖 − 𝐹 𝑋𝑖 = − - Does this look familiar?
𝜕𝐹 𝑋𝑖
𝜕𝐽
▪ Hint: Gradient Descent: 𝜃𝑖 ≔ 𝜃𝑖 − 𝜌
𝜕𝜃𝑖
▪ Gradient Boosting = Boosting + Gradient

Descent
Source: https://www.oreilly.com/library/view/learn-arcore-
/9781788830409/e24a657a-a5c6-4ff2-b9ea-9418a7a5d24c.xhtml
04.10.2021
Page 174
Gradient Boosting
Generalization Using Gradient Descent
▪ Changing from residuals to gradient descent → Gradient Boosting
▪ Generalize the method to other loss functions (e.g. log loss, absolute loss)
▪ So, in our new framework:
▪ Start with conservative estimate for 𝐹(𝑋) (e.g. mean of 𝑦)
𝐿 𝑦,𝐹 𝑋
▪ Compute negative gradient (i.e. residual): −𝑔 𝑋 = − = 𝑦 − 𝐹(𝑋)
𝜕𝐹 𝑋
▪ Fit a new model ℎ 𝑋 to −𝑔 𝑋
▪ Update 𝐹(𝑋) : 𝐹 𝑋 ≔ 𝐹 𝑋 + 𝜌ℎ 𝑋
▪ 𝜌 is the size of the step
04.10.2021
Page 175
Gradient Boosting
▪ By restricting the tree depth we can control the order of approximation of 𝐹 𝑋
▪ E.g., Gradient Boosting with stumps (only one split) are a first order (linear)
approximation
▪ But, if we fit the training data “too closely”: overfitting!
▪ Solution: regularization (or shrinkage)
▪ Introduce a parameter to slow down the incorporation of new results to the aggregate
model
▪ Each update is “scaled” or “shrunk” by the “learning rate” parameter 𝜈.
▪ 𝐹 𝑋 ≔ 𝐹 𝑋 + 𝒗 ∙ 𝜌ℎ 𝑋
04.10.2021
Page 176
Gradient Boosting
▪ How to choose the learning rate?
▪ Tradeoff between learning rate and number of trees
▪ Best to optimally choose parameters (e.g. using cross-validation)
▪ Stochastic Gradient Boosting (Friedman, 2002)
▪ Gradient Boosting + randomness in the train data seen by each model (as in Bagging)
04.10.2021
Page 177
Gradient Boosting
Parameters
▪ Parameters to choose (names in italic: inputs to gbm function in R)
▪ Loss function (distribution)
▪ Number of trees (n.trees)
▪ Learning rate (shrinkage)
▪ Tree depth - shorter trees usually give better results! (interaction.depth)
▪ Minimum number of observations in leaf nodes (n.minobsinnode)
▪ Fraction of observations in training data randomly selected to grow the next tree
(bag.fraction)
04.10.2021
Page 178
Gradient Boosting
Logistic Regression Decision Tree Random Forest Gradient Boosting
No Default 84,915 2,151 84,872 13,788 85,069 2,396 84,980 1,370
Predictions
Default 161 18,868 204 7,231 7 18,623 96 19,649
Accuracy 97.8% 86.8% 97.7% 98.6%
Precision 99.2% 97.3% 99.9% 99.5%
Recall 89.8% 34.3% 88.6% 93.5%
F1-Score 94.2% 50.8% 93.9% 96.4%
04.10.2021
Page 179
Gradient Boosting
Discussion
▪ Gradient Boosting reduces bias by
▪ Training trees sequentially on the training dataset
▪ Stochastic Gradient Boosting reduces variance by
▪ Training trees on different subsets of the training dataset
▪ What kind of problems can it solve?
▪ Any (!) loss function for which you can compute the gradient
04.10.2021
Page 180
Gradient Boosting
Discussion
▪ But:
▪ Usual shortcomings of ensembles: bias, interpretability
▪ Gradient Boosting requires more computing time
▪ Remedies:
▪ Tuning
▪ Tools and measures for interpretability
▪ A few other tricks possible (see xgboost)
04.10.2021
Page 181
Hands-on Exercise
References
Breiman, L. (1999), Prediction Games and Arcing Algorithms, Neural Computation, 11(7), 1493-1517.
Breiman, L. (2001), Random Forests, Machine Learning, 45(1), 5-32.
Friedman, J., Hastie, T. & Tibshirani, R. (2000), Special Invited Paper. Additive Logistic Regression: A
Statistical View of Boosting, Annals of Statistics, 28 (2), 337-374.
Friedman, J.H. (2001), Greedy Function Approximation: A Gradient Boosting Machine, Annals of
Statistics, 29 (5), 1189-1232.
Friedman, J.H. (2002), Stochastic Gradient Boosting, Computational Statistics & Data Analysis, 38 (4),
367-378.
Hastie, T., Tibshirani, R. & Friedman, J. (2009), The Elements of Statistical Learning. Data Mining,
Inference, and Prediction, Springer.
Ridgeway, G. (1999), The State of Boosting, Computing Science and Statistics, 31, 172-181.
04.10.2021
Page 183
Day 5
Wrap Up
Agenda
▪ Machine Learning in Practice (Serafín Martínez-Jaramillo, CEMLA)
▪ Break
▪ Variable Importance
▪ Wrap Up
▪ Q&A
04.10.2021
Page 185
Machine Learning in Practice
Wrap Up & Q&A
Where Are We Headed?
Current Challenges and Opportunities
▪ Internal stakeholders: “what do we gain from using AI / machine learning?”
▪ Systems, talent, regulation (e.g. privacy)
▪ Markets are complex: regimes change, trends come and go
▪ Unstable models (asset management: “no longer than 3 weeks”)
▪ Use networks to better assess systemic risk
▪ Need for explainable machine learning
04.10.2021
Page 188
Where Are We Headed?
Explainable Machine Learning
▪ Avoid poor decisions
▪ Text mining at an investment fund – Berkshire Hathaway x Anne Hathaway
▪ IBM “Watson for Oncology”
▪ Tyndaris Investments automated trading: ~ US$20 M daily losses
▪ Prevent algorithm bias: is the model doing what it is supposed to?
▪ Bias in recruiting tools (e.g., Amazon AI recruiting tool)
▪ Predicting the likelihood of a criminal reoffending (e.g., US COMPAS)
04.10.2021
Page 189
Will Machine Learning Solve All Our Problems?
04.10.2021
Page 190
▪ Introduction I
▪ Machine Learning and Central Banking
▪ Break
▪ Introduction II
▪ Shrinkage I
▪ Break
▪ Shrinkage II
▪ Decision Trees I
▪ Break
▪ Decision Trees II
04.10.2021
Page 191
▪ Random Forest
▪ Break
▪ Machine Learning in Practice (Elizabeth Téllez León, CEMLA)
▪ Break
▪ Wrap Up & Q&A
04.10.2021
Page 192
Main Idea: Toolbox
04.10.2021
Page 193
Process
▪ Understand “business” problem
▪ Map to machine learning problem
▪ Understand the data
▪ Explore and prepare the data
▪ “Feature” development
▪ Method Selection
▪ Evaluation
▪ Deployment
04.10.2021
Page 194
Steps
Data Preparation
Features
Build Model
Validate Model
Deploy Model
04.10.2021
Page 195
Examples
2020 Banca d'Italia and Federal Reserve Board Joint Conference on Nontratidional Data & Statistical
Learning with Applications to Macroeconomics – Oct. 2021
• Forecasting UK inflation bottom up, A. Joseph, Eleni Kalamara, G. Potjagailo, and G. Kapetanios
• The Macroeconomy as a Random Forest, P. Goulet Coulombe
• Teaching Machines to Measure Economic Activities from Satellite Images: Challenges and Solution, D. Ahn,
M. Cha, S. Han, J. Kim, S. Sang Lee, S. Park, S. Park, H. Yang, and J. Yang
• The Knowledge Graph for Macroeconomic Analysis with Alternative Big Data, Y. Yang, Y. Pang, G. Huang,
and Weinan E Machine Learning for Zombie Hunting. Firms' Failures and Financial Constraints, F. J.
Bargagli Stoffi, M. Riccaboni, and A. Rungi
• Deciphering the Fed Communication via Text-Analysis of Alternative FOMC Statements, T. Doh, D. Song,
and S.-K. Yang
• Measuring central banks' sentiment and its spillover effects with a network approach, G. Tizzanini, P.
Lorenzini, M. Priola, L. Zicchino
• Application of text mining to the analysis of climate-related disclosures, Á. I. Moreno, and T. Caminero
04.10.2021
Page 196
BIG THANK YOU!!!!!
- To all participants for the lively discussion, deep
comments and/or just for staying with us
- Of course to Serafín Martínez Jaramillo, Eréndira
Fuentes Hernández and all for us „unknown“
colleagues
- We do not forget: the translators and the IT guys
04.10.2021
Page 197
References
Athey, S. (In Press), The Impact of Machine Learning on Economics, In A.K. Agrawal, J. Gans &
A. Goldfarb (Eds.),The Economics of Artificial Intelligence: An Agenda, University of Chicago
Press.
Hastie, T., Tibshirani, R. & Friedman, J. (2009), The Elements of Statistical Learning. Data
Mining, Inference, and Prediction, Springer.
04.10.2021
Page 198
Backup Day 1
Comparing Competing Models
▪ How to tell if the difference in performance is

really meaningful?
▪ Type I and type II errors
Source: Ellis, P.D. (2010), “Effect Size FAQs”, www.effectsizefaq.com
04.10.2021
Page 200
▪ Many popular tests have a high probability of type I error
▪ I.e., detecting a difference between models when no difference exists
▪ Examples:
▪ Test for the difference of two proportions (each model estimated one time)
▪ Paired-differences t test using several random train-test splits
▪ Alternatives:
▪ McNemar Test (McNemar, 1947)
▪ 5x2-Fold Cross-Validated Paired t-Test (Dietterich, 1998)

04.10.2021
Page 201
McNemar Test (“within-subjects chi-squared test”)
▪ Do both models disagree in the same way?
▪ H0: Both models have the same performance: 𝑃𝑟𝑜𝑏 (𝐶𝐼) = 𝑃𝑟𝑜𝑏 (𝐼𝐶)
Model 1
Correct Incorrect
Correct CC CI
Model 2
Incorrect IC II
Total = CC + CI + IC + II
▪ You only need to estimate each model one time
▪ Requires CI and IC ≥ 25
In R:
▪ If CI and IC ≤ 25, use the binomial test instead
mcnemar.
Prof. Gabriela Alves Werb, Prof. Stefan Bender, Sebastian Seltmann. Deutsche Bundesbank test
04.10.2021
Page 202
binomial.t
est
5x2-Fold Cross-Validated Paired t-Test
▪ Split the dataset 5 times in train and test (50% of observations each)
▪ In each of the 5 iterations:
▪ Step A: Estimate Model 1 and Model 2 to the train set and evaluate performance on test
set
▪ 𝐴𝐶𝐶𝐴,𝑖 = 𝐴𝐶𝐶𝐴,𝑖,𝑀𝑜𝑑𝑒𝑙 1 − 𝐴𝐶𝐶𝐴,𝑖,𝑀𝑜𝑑𝑒𝑙 2 𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛 𝑖
▪ Step B: Swap between train and test (estimate on test set and evaluate on train set)
▪ 𝐴𝐶𝐶𝐵,𝑖 = 𝐴𝐶𝐶𝐵,𝑖,𝑀𝑜𝑑𝑒𝑙 1 − 𝐴𝐶𝐶𝐵,𝑖,𝑀𝑜𝑑𝑒𝑙 2 𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛 𝑖
𝐴𝐶𝐶𝐴,𝑖 −𝐴𝐶𝐶𝐵,𝑖 2 2
▪ 𝐴𝐶𝐶𝑎𝑣𝑔,𝑖 = , 𝑠𝑖2 = 𝐴𝐶𝐶𝐴,𝑖 − 𝐴𝐶𝐶𝑎𝑣𝑔,𝑖 + 𝐴𝐶𝐶𝐵,𝑖 − 𝐴𝐶𝐶𝑎𝑣𝑔,𝑖
2
𝐴𝐶𝐶𝐴,1
▪ Test statistic: 𝑡 = 1 5
, 𝐴𝐶𝐶𝐴,1 : 𝐴𝐶𝐶𝐴 𝑓𝑟𝑜𝑚 𝑡ℎ𝑒 𝑓𝑖𝑟𝑠𝑡 𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛
σ 𝑠2
5 𝑖=1 𝑖
▪ t follows a t distribution with 5 degrees of freedom
04.10.2021
Page 203
Backup Day 3
Split Criterion: Impurity vs. Entropy
Only for classification problems (categorical dependent variable)
▪ Minimum value: all observations belong to one category (“pure” node)
▪ Maximum value: observations are equally distributed across all categories (maximal
dispersion)
▪ Gini Index (CART, RPART)

𝑘
2 𝒑 𝒋 𝒕 : relative frequency of category j in node t
𝑖 𝑡 = 1− ෍ 𝑝 𝑗 𝑡
𝑗=1
k: Number of categories in the sample
▪ Information Gain (C5.01)
𝑘
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑡 = ෍ −𝑝 𝑗 𝑡 log 2 𝑝 𝑗 𝑡
𝑗=1
1 C5.0 uses a normalized version of Information Gain, the “Gain-Ratio”
04.10.2021
Page 205
Split Criterion: Impurity vs. Entropy
04.10.2021
Page 206
1. Splitting Criterion: Sum of Squares (Regression)
Only for regression problems (numerical dependent variable)
▪ Minimum value: all observations in the node have the same value of the dependent variable
(homogeneous node)
𝑛𝑡
𝑆𝑆 𝑡 = ෍ 𝑦𝑖 − 𝑦ത 2
𝑖=1
Variance of y
within node t
− 𝑦ത is the average value of 𝑦𝑖 in node 𝑡, with 𝑛𝑡 observations
04.10.2021
Page 207
Decision Trees
Handling Missing Values in C5.0
▪ Proportional weighting
▪ Tree grows with all observations – lower information gain from variables with many
missings
▪ Observation with a missing in the split variable is sent down both child nodes
▪ In each child node: weight this observation by the proportion of observations with no
missings in the split variable
▪ Aggregate predictions of all leaf nodes the observation reached, using the weights
gained in each followed path
▪ Weighted average for numeric variables, or category with the highest probability, for
categorical variables
▪ Implemented in C5.0 and its precursors (e.g., C4.5)

04.10.2021
Page 208
Backup Day 4
Boosting for Regression
▪ Objective: minimize squared error
▪ Subsequent model tries to “correct” the errors of the previous model
▪ Predict the residuals of the previous model
▪ Aggregated prediction: Sum the predictions of all models
Begin: Residuals of the Tree 1 Residuals of the Tree 2

Actual value of the become the dependent become the dependent
dependent variable variable in Tree 2 variable in Tree 3
Tree 1 Tree 2 Tree 3
Training Aggregated
Dataset Prediction
Subsequent tree focuses on predicting

observations with higher prediction error
04.10.2021
Page 210
Backup Day 5
Traditional Econometric Methods vs. Machine Learning
Traditional Econometric Methods Machine Learning

▪ Focus on causal explanation ▪ Focus on empirical prediction
▪ Mostly parametric ▪ Non-parametric
▪ Fail for n < p and overfit for p < n with ▪ Capable of handling both n < p and large p
large p problems (e.g. hundreds of covariates)
▪ Not always feasible with big data ▪ Work well with big data
▪ Model selection based on nested models ▪ Systematic model selection based on

or stepwise methods cross-validation
▪ In-sample goodness of fit ▪ Out-of-sample goodness of fit
04.10.2021
Page 212

▪ Strong assumptions about the data ▪ Little to no DGP assumptions
generating process (DGP)
▪ Supervised case: data is independent,
▪ If assumptions are not appropriate, joint distribution of X and Y is the same in
conclusions may be seriously wrong training and test data.
▪ E.g. how appropriate is a linear, logistic ▪ Better results for large data or complex
or Cox model for the data? relationships
▪ Why should they work: asymptotic theory ▪ Why should they work: open question
▪ Results are directly interpretable ▪ Need additional tools for interpretation
▪ Provide confidence intervals ▪ Usually don’t provide confidence

intervals
04.10.2021
Page 213

▪ Only complete observations (no ▪ Missings are (usually) not a problem
missings), have to drop or imput them
▪ Model interpretation
▪ Model interpretation
▪ Variable Importance Measures
▪ Coefficients
▪ Partial Dependence Plots (PDP)
▪ Counterfactual analysis
▪ Individual Conditional Expectation (ICE)
plots
▪ Local Interpretable Model-agnostic

Explanations (LIME)
▪ Shapley Additive Explanations (SHAP)

04.10.2021
Page 214
Variable Importance
Variable Importance with Surrogates
− How could a variable potentially improve model predictions?
– Potential improvement (here: Gini) calculated for primary and surrogate splits
– Weight surrogate splits with agreement
– Keep in mind: Importance measures are often biased due to bias in splits!
04.10.2021
Page 216
Variable Importance (RPART Example)
Primary splits:
total_pymnt < 5317.716 to the right, improve=8636.406, (1e+05 missing)
installment < 167.24 to the left, improve=13731.080, (0 missing)
Surrogate splits:
loan_amnt < 4712.5 to the right, agree=0.917, adj=0.515, (1e+05 split)
04.10.2021
Page 217
Variable Importance (Random Forest Example)
04.10.2021
Page 218

File Show

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

File Show

Uploaded by

Copyright:

Available Formats

Course on Machine Learning and Central Banking

CEMLA and Deutsche Bundesbank – October 2021

▪ Introduction (Prof. Gabriela Alves Werb and Prof. Stefan Bender)

▪ Structure and Organization of the Course

▪ Machine Learning and Central Banking (Prof. Stefan Bender)

▪ Train, Test and Validation Samples (Prof. Gabriela Alves Werb)

▪ Model Validation Measures

▪ Each module is followed by a hands-on exercise.

▪ Machine Learning: Introduction

Required skills of a computer program to be „intelligent“:

▪ Rules-based systems are rigid and expensive

▪ Lots of data is available to “train” the system

▪ Understand “business” problem

▪ Map to machine learning problem

▪ Understand the data

▪ Explore and prepare the data

Unsupervised “Weakly” supervised Fully supervised

Clustering Anomaly Detection

▪ A good clustering method will produce clusters with

▪ High intra-cluster similarity

▪ Low inter-cluster similarity

▪ K-Means is the simplest and the most common algorithm

▪ Given k, the k-means algorithm works as follows:

1) Randomly choose k data points (seeds) to be the initial

3) Re-compute the centroids using the current cluster

Training: given a training set of labeled examples {(x1,y1), …, (xN,yN)},

The task is to classify whether or not there was a change in perimeter.

THX to Frank Raulf

▪ In many cases there are no training data available

Use out-of-sample predictions with machine learning algorithms for. . .

1) Data Quality Management (DQM): identify and correct measurement errors

2) Closing data gaps: impute missing values

THX to Tobias Cagala

Securities Holdings Statistics

▪ DQM with labor intensive manual case-by-case evaluations

THX to Tobias Cagala

Securities reported after the maturity date

Two data sources

I. Outcome of evaluation by compiler: Acceptance or Flag

➢ Linkage of securities data with data on compiler decisions

THX to Tobias Cagala

▪ Reported securities: 4 495

THX to Tobias Cagala

THX to Tobias Cagala

THX to Tobias Cagala

▪ Application to DQM is feasible and straightforward

▪ Inclusion in production process can be a challenge

THX to Tobias Cagala

▪ (we will see in some minutes)

▪ Data-driven solutions can improve efficiency drastically

▪ Machine learning algorithms are no silver bullet

THX to Tobias Cagala

▪ Large amounts of data are being collected (big data).

▪ Analyzing such data can provide huge benefits.

▪ Data are from different sources (need for record linkage).

▪ Lack of unique entity identifiers: linking based on personal information.

▪ RL is also known as data matching, entity resolution, object identification,

▪ But we do not live in a perfect world

▪ No unique entity identifiers available

▪ “In situations of reasonably high-quality data, preprocessing can

▪ “Inability or lack of time and resources for cleaning up files in

▪ 5% matching and linking efforts