You are on page 1of 40

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/318702064

Approaches for Credit Scorecard Calibration: An Empirical Analysis

Article  in  Knowledge-Based Systems · July 2017


DOI: 10.1016/j.knosys.2017.07.034

CITATIONS READS

11 5,304

4 authors:

Artem Bequé Kristof Coussement

5 PUBLICATIONS   45 CITATIONS   
IESEG School of Management
27 PUBLICATIONS   963 CITATIONS   
SEE PROFILE
SEE PROFILE

Ross W Gayler Stefan Lessmann


Independent Researcher https://www.rossgayler.com Humboldt-Universität zu Berlin
59 PUBLICATIONS   1,292 CITATIONS    117 PUBLICATIONS   1,971 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Record linkage View project

Ensemble Classification Methods with Applications in R View project

All content following this page was uploaded by Stefan Lessmann on 04 October 2017.

The user has requested enhancement of the downloaded file.


Approaches for Credit Scorecard Calibration: An
Empirical Analysis

Authors, Affiliations, and Postal address:

Artem Bequé1
Kristof Coussement2
Ross Gayler3
Stefan Lessmann1
1
School of Business and Economics, Humboldt-University of Berlin, Unter-den-Linden 6,
10099 Berlin, Germany
2
IESEG School of Management, Université Catholique de Lille (LEM, UMR CNRS
9221), Department of Marketing, 3 Rue de la Digue, F-59000, Lille, France
3
Independent researcher, Melbourne, Australia
Email:

artemlive@live.com
k.coussement@ieseg.fr
r.gayler@gmail.com
stefan.lessmann@hu-berlin.de

Corresponding author:

Artem Bequé
School of Business and Economics, Humboldt-University of Berlin, Unter-den-Linden 6,
10099 Berlin, Germany
Email: artemlive@live.com
Tel.: +49 (0)30 2093 5742
Fax: +49 (0)30 2093 5741

1
Approaches for Credit Scorecard Calibration: An
Empirical Analysis

Abstract

Financial institutions use credit scorecards for risk management. A scorecard is a data-driven
model for predicting default probabilities. Scorecard assessment concentrates on how well a
scorecard discriminates good and bad risk. Whether predicted and observed default
probabilities agree (i.e., calibration) is an equally important yet often overlooked dimension of
scorecard performance. Surprisingly, no attempt has been made to systematically explore
different calibration methods and their implications in credit scoring. The goal of the paper is
to integrate previous work on probability calibration, to re-introduce available calibration
techniques to the credit scoring community, and to empirically examine the extent to which
they improve scorecards. More specifically, using real-world credit scoring data, we first
develop scorecards using different classifiers, next apply calibration methods to the classifier
predictions, and then measure the degree to which they improve calibration. To evaluate
performance, we measure the accuracy of predictions in terms of the Brier Score before and
after calibration, and employ repeated measures analysis of variance to test for significant
differences between group means. Furthermore, we check calibration using reliability plots and
decompose the Brier Score to clarify the origin of performance differences across calibrators.
The observed results suggest that post-processing scorecard predictions using a calibrator is
beneficial. Calibrators improve scorecard calibration while the discriminatory ability remains
unaffected. Generalized additive models are particularly suitable for calibrating classifier
predictions.

Keywords: credit scoring, classification, calibration, probability of default

2
1 Introduction
Credit scoring helps to improve the efficiency of loan officers, reduce human bias in
lending decisions, quantify expected losses, and, more generally, manage financial risks
effectively and responsibly [13]. Today, almost all lenders rely upon scoring systems to assess
financial risks [44]. In retail lending, for example, credit scoring is widely used to decide on
applications for personal credit cards, consumer loans, and mortgages [26]. A lender employs
data from past transactions to predict the chance of an applicant to default. To decide on the
application, the lender then compares the predicted probability to default (PD) to a cut-off
value; granting credit if the prediction is below the cut-off, and rejecting it otherwise [35].
Many techniques for scorecard development have been proposed and studied. Examples
include artificial neural networks [2,51,55], support vector machines [16,36], multiple
classifier systems [27], hybrid models [4,18], or genetic programming [1]. In general, any
classification algorithm facilitates the construction of a scorecard and PD modelling in
particular [15]. Logistic regression is the most widely used approach in industry [44], although
other, more sophisticated classification algorithms have been shown to predict credit risks more
accurately [6,45]. A fully-comprehensive review of 214 articles/books/theses on application
credit scoring further supports [3] the view that more advanced techniques (e.g., genetic
algorithms) outperform conventional models (e.g., logistic regression). However, the authors
also report on studies that find similar performance in terms of predictive accuracy [3].
In addition to predictive accuracy, the suitability of a scorecard also depends on other
dimensions such as comprehensibility and compliance [34] or defining key variables for
classifiers by mitigating noise data and redundant attributes [69]. This paper, however,
concentrates on one specific dimension of scorecard performance: calibration.
A well-calibrated scorecard is one which produces probabilistic forecasts that correspond
with observed probabilities [20]. For example, consider one hundred loans in a band of
predicted PD estimated to be ten percent by some scorecard. If the scorecard is well-calibrated,
the actual number of eventually defaulting loans in this band should be close to ten.
Scorecard calibration is important for many reasons. Regulatory frameworks such as the
Basel Accord require financial institutions to verify that their internal rating systems produce
calibrated risk predictions. Poor calibration, therefore, is penalized with higher regulatory
capital requirements [22]. Calibration is also relevant from a lending decision making point of
view [19]. At a micro-level, well-calibrated risk predictions are essential to evaluate credit
applications in economic terms (e.g., through calculating expected gains/losses), which is more
relevant to the business than an evaluation in terms of statistical accuracy measures only

3
[12,33]. At a macro-level, calibration is important for portfolio risk management and default
rate estimation [64]. In particular, to forecast the default rate of a credit portfolio, one may
adopt a classify-and-count strategy [10]. This approach derives the portfolio default rate
forecast from individual level (single loan) risk predictions and thus benefits from calibration
[65]. Furthermore, approaches to support managerial decisions have to account for the
cognitive abilities and limitations of decision makers [50]. Although far from perfect,
probabilities (rather than, say, log-odds) are a format to represent information that decision
makers understand and process relatively well [43]. Thus, a credit analyst is likely to distil
more information from a well-calibrated PD estimate. Last, scorecards are developed from
loans granted in the past and used to forecast the risk of lending to novel applicants [37]. Due
to changes in customer behavior, economic conditions, etc. default rates may differ across the
corresponding distributions. Calibration is a way to account for the differences in prior
probabilities [20].
The Institute of International Finance, Inc. and the International Swaps and Derivatives
Association have called for a higher recognition of calibration when choosing among
scorecards [40]. However, we find multiple studies that concentrate on e.g., balancing between
accuracy and complexity [76], improving existing classifiers [69], offering new multiple
classifier systems [6], but rarely studies devoted to calibration. In [3] the authors conclude that
the receiver operating characteristic curve and the Gini coefficient are the most popular
performance evaluation criteria in credit scoring. That is why we argue that the relevance of
calibration is still not sufficiently reflected in the credit scoring literature. To further support
this point, we consider a recent review of more than forty empirical credit scoring studies
published between 2003 and 2014 [49]. Among the articles reviewed in [49], we find only one
study [46] that explicitly raises the issue of calibration and uses suitable evaluation metrics
such as Brier Score. More recent literature published after 2014 shows the same pattern. We
find only two studies that use Brier Score to measure classifier performance [5,6]. However,
both studies concentrate on developing novel classification systems, which are assessed in
terms of the Brier Score, amongst others. Neither [5] nor [6] consider techniques to improve
calibration, which supports the view that calibration methods have not been examined
sufficiently in credit scoring; or, in the words of Van Hoorde et al. [68]: calibration is often
overlooked in risk modeling.
There is ample evidence that especially advanced learning algorithms such as random
forest, which enjoy much popularity in credit scoring, produce predictions that are poorly
calibrated [47,54,60,74]. This suggests a trade-off between predictive accuracy and calibration.

4
Calibration assumes that the relationship between the raw score, which a classification model
produces, and the true PD is monotonic. Therefore, calibration consists of estimating a
monotonic function to map raw scores to (calibrated) PDs. Given that the calibration function
is monotonic, it maintains the ordering of the cases by raw score and consequently has no effect
on the discriminative power of classifiers [71]. Examples of calibration techniques include
isotonic regression or Platt scaling [54]. They promise to overcome the accuracy-calibration-
trade-off and seem to have potential for credit scoring. To the best of our knowledge, no attempt
has been made to systematically explore this potential in prior work in credit scoring.
The goal of this paper is to close this research gap. More specifically, we aim at examining
the degree to which alternative algorithms for scorecard development suffer from poor
calibration, evaluating techniques for improving calibration, and, thereby, contributing towards
increasing the fit of advanced classifiers for real-world banking requirements. In pursuing these
objectives, we make the following contributions. First, we establish the difference between
accuracy and calibration measures. This helps to understand the conceptual differences
between the two and to emphasize the need to address calibration in scorecard development.
Second, we introduce several methods to improve calibration, subsequently called calibrators,
to the credit scoring community and systematically assess their performance through empirical
experimentation. Third, we examine the interaction between classifiers and calibrators. This
allows us to identify synergies between the modelling approaches and to provide specific
recommendations which techniques work well together. Last, relying upon reliability analysis
and a decomposition of the Brier Score, we shed light on the determinants of calibrator
effectiveness and provide insight into why and when calibrators work well.
The remainder of the paper is organized as follows: Section 2 introduces relevant
methodology and the calibrators in particular. Section 3 describes the experimental design
before empirical results are presented in Section 4. Section 5 concludes the paper.

2 Calibration Methods
A classifier or a scorecard estimates a functional relationship between the probability
distribution of a binary class label - good or bad risk - and a set of explanatory variables, which
profile the applicant’s characteristics and behavior. For example, bad risks are commonly
defined as customers who miss three consecutive payments [66]. Calibration serves two
purposes. First, some classification algorithms are unable to produce probabilistic predictions.
For instance, support vector machines output a confidence score on the real interval ∞; ∞ ,
whereby the sign of the prediction indicates the class assignment and the magnitude the

5
confidence of the classifier in this assignment. For example, a positive confidence score might
indicate that the classifier considers a credit applicant a bad risk, whereby a large (small)
confidence score indicates that the classifier is certain (uncertain) about this prediction [58].
Second, some classifiers provide predictions in the interval 0; 1 , which can be interpreted as
probabilities, but suffer from biases and thus display poor calibration. Examples include the
random forest classifier, the predictions of which habitually exhibit a characteristic sigmoid-
shaped distortion [54]. Therefore, we define calibration as the process of converting the
confidence scores or the raw (uncalibrated) probabilistic predictions – hereafter referred to as
the credit risk output scores – to calibrated credit risk probabilities.
To demonstrate the technique behind calibration, Table 1 presents a theoretical example
of credit risk output scores of a classifier before and after the application of calibration. Table
1 also gives the actual class. Values of 1 and 0 indicate default and non-default events,
respectively. Recall that this example is only valid for a classifier that generates probabilistic
predictions, meaning that the output of the classifier must be in the interval 0; 1 .

Table 1 Theoretical example of calibration technique


Applicant Actual Class Raw Output Score Calibrated Prediction
1 0 .35 .08
2 1 .68 .93
3 1 .70 .89
4 1 .81 .95
5 0 .20 .09
Represents a theoretical example that exemplifies the calibration procedure. The table includes five theoretical
applicants that belong to an either negative (i.e., good risk) or positive (i.e., bad risk) actual class, raw output score
that represents the probability of default that is generated by a theoretical classifier, and calibrated prediction that
presents the calibrated default probability of that classifier.

Table 1 illustrates that calibration improves the quality of the probabilistic predictions of
the theoretical classifier in a sense that the uncalibrated predictions of the classifier get closer
to the true target class. For example, the uncalibrated prediction for the first applicant gets
closer to 0 (from .35 to .08) or the one for the second applicant gets closer to 1 (from .68 to
.93). To be more precise, the Brier Score of the raw output scores is 0.0782 versus that of the
calibrated predictions of 0.0068. Thus, calibration improves the Brier Score (i.e., as a measure
of the quality of probabilistic predictions). We elaborate more on the Brier Score in latter
sections of the paper.
To introduce specific calibrators, we use the following notation. Define a training set
, consisting of observations. Each observation , is a

6
combination of an input vector representing the explanatory variables and a binary class
label with ∈ 0,1 corresponding to whether a default event has been observed. The prior
probability of the classes, 1 and 0 , are approximated by their empirical
frequencies; that is the fraction of applicants belonging to the group of defaulters 1
and non-defaulters 0 in . During the training phase, a classification model
is built on . The model produces, for every applicant in , a credit risk output
score .
For the test set consisting of observations, the classification model
is also applied for every observation in to obtain a credit risk output score . The
purpose of calibration is to adjust the individual credit risk output scores of observations in
to the true posterior credit risk probability. This is done by applying a calibrator that is
previously developed on to optimally adjust to 1.

In this study, we empirically benchmark six calibrators previously used in binary


classification settings [20,53,54,75], namely the rescaling algorithm (RS), and calibrators
based on logistic regression (LR), Platt scaling (PS), generalized additive models (GAM),
isotonic regression (IS) and Gaussian Naïve Bayes (GNB).

2.1 Rescaling algorithm


Saerens et al. [61] propose the RS calibrator based on Bayes’ rule assuming that the credit
risk output scores depend in a non-linear way on the prior probability distribution of the
class labels 1 and 0 . Therefore, if there is a change in the prior probability
distribution of the classes, there is an expected change in the posterior default probabilities of
the classification algorithm [61]. In line with [20,61], RS defines the calibrated default
probability or 1| as

1
1| (1)
0 1 1

where stands for the credit risk output score by the classification algorithm, 1
0 present the prior probability distribution of the class labels.

1
Note that this is a minor simplification of the actual calibration process. Instead of using the training set for
both, developing a classifier and a calibrator, it is preferable to develop the classifier and calibrator in a cross-
validation process to protect against a possible overfitting of the calibrator [58]

7
2.2 Logistic regression
LR is using logistic regression, a well-known classification algorithm for predicting a
binary dependent variable, and well suited as calibrator [48]. LR estimates the calibrated
probability of default as

1
1| (2)
1

with the credit risk output score output by classification algorithm , equals to the
parameter estimate of and denotes the intercept. We estimate and from the credit
risk output scores of on the training set using the maximum likelihood procedure.

2.3 Platt scaling


PS was originally introduced to calibrate the output scores of support vector machines
[58], but can be applied as a calibrator for other classification algorithms [54]. PS is closely
linked to LR, where the calibrated risk probabilities are defined as

1
1| (3)
1

where parameters and are obtained by minimizing the negative log-likelihood of the
data using the Levenberg-Marquardt algorithm [59]. During PS calibrator training, the labels
of the dependent variable are transformed to class probabilities to prevent the calibrator from

overfitting. In concrete, the labels 1 and 0 of are replaced by and . For more

detailed information about the justification of these new labels, we kindly refer to [58].

2.4 Generalized additive models


An alternative to standard LR-based calibrators is GAM [20], which is based on generalized
additive models. GAM relaxes the linearity constraint and applies a non-parametric non-linear
fit to the data [38]. This means that the information between and in decides on
the functional relationship of the GAM calibrator. Methodologically, GAM replaces the linear
predictor in the logit equation (2) with an additive component. This corresponds to:

1
1| (4)
1

where represents penalized regression splines to estimate the non-parametric trend


for the dependency of on by maximizing the penalized likelihood using the penalized

8
iteratively reweighted least squares algorithm [45,70].

2.5 Isotonic regression


Similar to GAM, IS makes use of isotonic regression, i.e. a non-parametric form of
regression in which the dependency of on is chosen from the class of isotonic –
monotonically non-decreasing – functions. However, IS estimates a non-smooth, piecewise
constant function, while GAM estimates a smooth non-parametric curve. IS has been used in
previous calibration studies e.g., [74]. In concrete, IS computes an isotonic calibrator function

that maps to (equal to 1| , the probability of default) so that if
then . This study uses the commonly-used pair-adjacent violators algorithm

to find the piecewise constant calibrator function [8].

2.6 Gaussian Naïve Bayes


Bayesian classifiers are intuitive classification algorithms widely used in practice [42].
Gaussian Naïve Bayes (GNB) has been used and recognized in the litetarure as a reliable
calibration method [29, 48]. The GNB calibrator learns the conditional probabilities
| 1 of each applicant given the dependent variable . By cause of the explanatory
variable being continuous, the conditional probabilities | 1 are assumed to follow
a normal (Gaussian) distribution. The training part of GNB estimates the class conditional
mean and standard deviation for . A new applicant is then scored by using Bayes’
rule to compute the probability of default 1| given the credit risk output score :

| 1 1
1| (5)

with the likelihood of for defaulters assumed to be Gaussian:

1
| 1
(6)
2 exp
2

where the mean and the standard deviation are estimated using the maximum
likelihood method.

3 Experimental setup
We examine the relative effectiveness of the calibrators for credit scoring through empirical
experimentation. Our experimental design draws inspiration from Lessmann et al. [49]. Their

9
study compares 41 classification algorithms across eight retail credit scoring data sets using
different indicators of classification performance. The data sets are well known in credit
scoring and have – at least partially – been used in several prior studies, e.g.,
[9,11,15,32,47,67,72]. We use the same data sets in this study.
The data sets belong to the field of application scoring, meaning that the task is to categorize
credit applications into good and bad risks. More specifically, the Australian Credit (AC) and
German Credit (GC) data sets come from the UCI Library2 and the Thomas (TH) data set has
been provided by [67]. Three other data sets, Bene-1, Bene-2, and UK have been used in [10].
They originate from major financial institutions in the Benelux and the UK. Finally, PAK and
GMC, come from the 2010 PAKDD data mining challenge3 and the “Give me some credit”
Kaggle competition4, respectively. This selection provides a wide range of real world credit
scoring data sets of different sizes and origins.
Every data set includes a binary response variable to indicate the observed status of a
granted credit (good/bad) and a number of attributes concerning loan (e.g., loan amount or
interest rate), debtor (e.g., demographics, number of accounts, account balances, etc.),
collateral (presence, value, etc.), and possibly other characteristics of the credit application.
Table 2 summarizes the credit scoring data sets, including the number of cases, number of
attributes, and prior default rate.

Table 2 Summary of the credit scoring data sets


No. of Prior
Name Cases
Attributes default rate
AC 690 14 .445
GC 1,000 20 .300
TH 1,225 17 .264
Bene-1 3,123 27 .667
Bene-2 7,190 28 .300
UK 30,000 14 .040
PAK 50,000 37 .261
GMC 150,000 12 .067
Presents the summary of the involved credit scoring data sets. The table includes the acronyms of the data sets,
the number of observations per data set, number of the explanatory variables, and the corresponding prior default
rate.

To prepare the data for subsequent analysis, standard pre-processing operations have been

2
A. Asuncion, D. J. Newman, UCI Machine Learning Repository, in: School of Information and Computer
Science (2010) University of California, Irvine, CA.
3
http://sede.neurotech.com.br/PAKDD2010/
4
http://www.kaggle.com/c/GiveMeSomeCredit

10
applied. In detail, we consider imputation of missing values using a mean/mode replacement
for numeric/nominal attributes as well as the transformation of nominal variables using weight-
of-evidence coding. Another important concern relates to data partitioning. We follow
recommendations from the industry [9] and randomly partition every data set into a training
set (60%) and a hold-out test set (40%). These partitions are used for classifier/calibrator
development and subsequent evaluations, respectively. To use the training data efficiently, we
perform 5x2 cross-validation on the training set [23]. This avoids reserving further data for
validation purposes [17]. The sampling procedure is somewhat complicated since it takes care
of meta-parameter tuning (e.g., Table 3) and calibrator development simultaneously. Details
are available in the Appendix.
There are many classification algorithms that have been applied in credit scoring. For this
study, we select classifiers based on prior benchmarking studies e.g., [9,49] because these
examine the classifiers in multi-faceted experiments and study their relative merits in terms of
different performance estimates. More specifically, to cover a range of different approaches,
we choose classifiers from different families. As an individual classifier, we use artificial neural
networks, MLP, because it is a powerful scoring method that has received much attention in
financial applications [7,25,63]. Ensemble classifier, which combine many models from the
same classification algorithm [56], and multiple classifier systems [27], which use different
algorithms, have also received much attention in the literature and have shown their suitability
for credit scoring, e.g., [27,42,69]. We select random forest (RF) and bagged hill-climbing
ensemble selection (BAGES) as representatives for ensemble classifiers and multiple classifier
systems. Both techniques have been shown to outperform several alternative approaches in
their respective family [49]. In addition, we consider logistic regression (LRE), which can be
seen as industry standard in credit scoring [24,41,44,52,73]. Including LRE is also useful to
verify the common view that LRE produces well-calibrated predictions, e.g., [48] and to test
whether an additional calibration step can improve LRE predictions, respectively. Last, we
consider an ensemble classifier that is given by averaging over the uncalibrated risk predictions
of all individual and homogenous classification models developed in [49]. The motivation
behind considering this approach, which we call AVG, is that it embodies a large set of different
classification algorithms (see [49]; Table 2) and thus gives an overall indication of the extent
to which alternative classifiers produce calibrated predictions on average.
In view of the fact that the paper relies on [49] it is important to clarify the differences
between the two studies. First, we clarify how this study relies on [49], namely in that we use
the same data sets and select classifiers that perform well in [49]. The data sets – at least some

11
of them – have been used in many prior studies (e.g., [9,11,15,32,47, 67,72]) and can thus be
considered a domain standard. With respect to the selection of classifiers it is important to note
that i) classification algorithms are not central to this paper, which concentrates on calibration
algorithms, and ii) our selection is also supported by several other studies (e.g.,
[7,25,27,42,56,63,69]), which use the same techniques that we employ here and/or find them
to perform well. Second, to explain differences between this study and [49], we note that [49]
focuses on classification algorithms but ignores calibration. We, on the other hand, concentrate
on calibration, which is a different step in a scorecard development process (i.e., calibration
follows the development of a scorecard using a classification algorithm). In particular, we
examine the ability of different calibrators to improve scorecards. As detailed in the
introduction, calibration represents an important dimension of scorecard performance, which
most previous credit scoring studies including [49] do not account for. Consequently, this study
focuses on fundamentally different methodology than [49] (calibrators as opposed to
classifiers). Accordingly, the empirical findings provided below and their implications are
orthogonal to [49] and contribute original insights to the body of knowledge in credit scoring
concerning, for example, the relative merits of alternative calibration methods and their
interaction with classification algorithms. To the best of our knowledge, these points have not
been considered in any previous study in credit scoring.
To provide more details on the involved classification algorithms, Table 3 summarizes the
number of models, meta-parameters, and candidate settings per classification algorithm
considered in the study.

Table 3 Classifiers and their meta-parameters


Classifier No. of model Meta-parameter Candidate setting
No. of hidden nodes 2, 3, …, 20
MLP 171
Regularization penalty 10(-4, -3, …,0)
No. of CART trees 100, 250, 500, 750, 1000
RF 30
No. of randomly sampled variables ∗ 0.1, 0.25, 0.5, 1, 2,4
No. of bagging iteration 5, 25
BAGES 16
No. of base models per iteration 5%, 20% of library size
LRE 1 - -
AVG 1 - -
Presents the summary on the classifier tuning. The tables entails the acronyms of the classification algorithms
employed here, the number of the developed models per classification algorithm, the corresponding meta-
parameters, and the candidate setting of these meta-parameters.

We train both classifiers and calibration models on the train set and then apply them to the
out-of-sample test set. Classifier and calibrator training is done using R (version R-3.2.2.),
whereby, we use the glm, isotone, klaR and mgcv package to implement LR, IS, GNB, and

12
GAM. Other calibrators have been implemented in custom functions.
We pair classifiers and calibrators in a full-factorial setup. More specifically, we first let
the classifiers produce forecasts for cases in the test set. This produces a set of raw credit risk
output scores ( ). Next, we post-process the raw output scores using the six calibrators. This
way, we obtain seven versions of test set predictions for each classifier and data set.
To examine the appropriateness of different calibrators and their interaction with
classification algorithms, we consider three performance measures: the area under the receiver
operating characteristic curve (AUC), the Brier score (BS), and the logarithmic loss (LL). AUC
is widely used in credit scoring and measures a classifier’s ability to discriminate between good
and bad risks. However, concentrating on the relative ranking of risk predictions (i.e., whether
bad risks receive a higher output score than good risk), AUC does not capture whether a
classifier produces well-calibrated predictions. BS and LL, on the other hand, compare
estimated default probabilities to a zero-one coded response variable and aggregate the
deviations per case across a data sample. Therefore, BS and LL both assess calibration and a
classifier’s ability to predict class probabilities with high accuracy. BS is defined as the mean-
squared error of a class probability prediction and a zero-one coded binary target variable.

1
(7)

where denotes the estimated default probability of case and ∈ 0,1 the actual class
label. BS is low for a well-calibrated classifier, which predicts class membership probabilities
close to one (zero) for defaults (non-defaults). It ranges in an interval from 0; 1 accounting
for the total accurate and inaccurate forecast, respectively [39,62]. LL follows the same concept
and is defined as:

log 1 log 1 (8)

where and have the same meaning as in (7).

4 Empirical results
The experimental results consist of the performance estimates for every combination of
the factors: classifier (5 levels), calibrator (7 levels), credit scoring data set (8 levels), and
performance measure (3 levels). The performance measure capture the degree to which
classifiers discriminate good and bad credit risk with high accuracy and are well calibrated,
respectively.

13
4.1 Differences between discriminative and calibration ability
The first experiment exemplifies the difference between discrimination and calibration in
credit scoring. To that end, we create a ranking of classifiers on the basis of their raw credit
risk output scores . More specifically, we create three rankings per data set using AUC, LL
and BS as ranking criteria. For example, the classifier achieving the highest AUC on a data set
receives the first rank, the second best classifier rank two, and so on. We then calculate the
arithmetic mean of classifier ranks across data sets to obtain an average rank for each classifier.
Ranking classifiers in this way follows the recommendations of García and Herrera [31]. Last,
we compute the correlation between classifier ranks in terms of AUC, LL and BS using
Kendall’s  Table 4 reports the resulting correlations. The values in brackets denote p-values
corresponding to testing the significance of correlation coefficients.

Table 4: Correlation across data sets of classifier ranks as per performance measure

AUC LL BS
LL -.369 (.001) 1
BS -.357 (.001) .964 (.000) 1
Presents the correlation estimates obtained using Kendall’s tau based on the performance measures LL and BS
across all involved classification algorithms and credit scoring data sets.

Table 4 shows that classifier ranks in terms of BS and LL are highly correlated. The reason
that both measures are not perfectly correlated lies in the different weighting of (large)
deviations between probabilistic forecasts and actual default events. The BS builds on squared-
loss, which is upper-bounded at one. Being theoretically unbounded from above, LL assigns
larger weight to large errors. However, a classifier that performs well in one metric also
achieves a good rank in the other metric. In comparison, the correlation between AUC and BS
or AUC and LL is much lower. This shows that AUC emphasizes a different notion of classifier
performance than BS and LL. AUC is only sensitive to the ranking of cases and is invariant
under any monotone transformation of the scores. BS and LL, on the other hand, are sensitive
to the actual score values in addition to the ranking of cases. Note that the negative sign
originates from the fact that higher (lower) AUC values indicate better (worse) performance,
whereas the opposite is true for both BS and LL. Also note that the correlations are significantly
different from zero at the one percent significance level.
To further investigate the relationship behind different performance metrics, Figure 1
shows a scatterplot of AUC (on the x-axis) versus BS and LL (on the y-axis).

14
Figure 1 Scatterplot: AUC vs BS

While supporting findings gained from Table 4 (high correlation between BS and LL) even
further, Figure 1 clearly reveals two important findings. Indeed, we observe that there are some
classifiers that deliver high discrimination power (i.e., high AUC) and, at the same time, well-
calibrated risk predictions (i.e., low BS). However, Figure 1 also shows a sizeable number of
cases where high AUC values are associated with relatively high BS values and thus poor
calibration. This supports the view that classifiers with high discriminatory power do not
necessarily provide well-calibrated risk predictions. This suggests that focusing exclusively on
measures of discriminative performance might be misleading in that it disregards the quality
of the estimated default probabilities (i.e., calibration), which is also a relevant criterion [14].
For example, when an analyst has to choose between scorecards where one has a small
advantage in AUC while the other is much better calibrated, it is at least debatable whether
preference should be given to the one with slightly higher AUC.
Arguably, practitioners are well aware of the importance of calibration and are required to
address risk model calibration (or a lack thereof) by regulation. However, the dominance of
AUC and other measures of the discriminative ability in the credit scoring literature e.g., [49]
suggests that corresponding studies give misleading advice in regard to which classifiers are
suitable for developing credit risk models. In this sense, we strongly recommend to pay more
attention to calibration measures in empirical classifier comparisons, which are commonly
performed in the credit scoring literature.

15
4.2 Calibrator-based post-processing of classifier predictions
We now examine the potential of the calibration models to deliver better calibrated PD.
Given high correlation between BS and LL (= .964 in Table 4), we rely upon BS in the
remainder of the paper to assess calibration; where lower (higher) values indicate relatively
better (poorer) calibration. The motivation to use BS rather than LL comes from the fact that
BS is decomposable into different parts, which can be analysed individually (see below).
To examine the relative merit of alternative calibration algorithms and the degree to which
the choice of a suitable calibrator depends on the classifier (i.e., test for a calibrator-classifier-
interaction), we perform a repeated measures analysis of variance (ANOVA).
In our experiment, the repeated measures are individual loan applications from the credit
scoring data sets. The independent variables (i.e., treatments) are calibrators and classifiers,
with seven and five levels, respectively. The seven levels for the factor calibrator follow from
the fact that our setup includes, in addition to the six calibration algorithms of Section 2, the
zero-normalized uncalibrated classifier output scores, . The dependent variable is the squared
difference between the actual value of the zero-one response variable (good/bad) and the
corresponding prediction. This measure is equivalent to calculating BS for an individual
observation (i.e., credit application). To account for the fact that the cases originate from
different credit scoring data sets, we include the identity of the data set as a blocking factor in
our setup e.g., [57]. Table 5 presents the summary of ANOVA results.

Table 5 Summary of ANOVA


Source Type III SS df MSE F Sig.
Calibrator * Classifier 36.6 2.5 14.5 1,938.2 .000
Error (Calibrator * Classifier) 2,298.9 306,587 .007
Calibrator 32.7 1.2 26.7 847.7 .000
Error (Calibrator) 4,703.1 149,334 .031
Classifier 10.4 1.9 5.2 239.8 .000
Error (Classifier) 5302.2 241,634 .022
Presents the summary of ANOVA with the independent variables calibrators and classifiers with seven and five
levels, respectively. Whereby, the dependent variable is the squared difference between the actual values of the
zero-one coded response variables and the corresponding prediction.

The results of the ANOVA reveal a significant interaction between the factors classifier
and calibrator 2.5; 306,587 1,938; 14.5; .000 . Note that the degrees of
freedom follows from a Greenhouse-Geisser correction, which we apply because Mauchly’s
Test indicates a violation of the sphericity assumption .000 . In accordance with the
significant factor interaction, we can also observe that the main effects of the factor classifier

16
2.0; 241,634 240; 5.2; .000 and calibrator 1.2; 149,334
847; 26.7; .000 are significant. Prior work in credit scoring has established the
effect of the classification algorithm on predictive accuracy [28,30,49]. The significant main
effect of the factor classifier complements such findings in that it shows that the choice of the
classifier also determines the degree to which risk predictions are well-calibrated. In a similar
manner, the ANOVA confirms that the decision to use a calibrator for post-processing risk
predictions and the choice of a specific calibration algorithm also affect the calibration quality
of PD predictions. Prior to analysing the factor interaction in detail (see Section 4.3), we seek
to gain some more insight into the relative performance differences among calibrators. To that
end, we estimate the pairwise differences between alternative calibrators in terms of BS. To
illustrate the computation of the corresponding results, let denote the squared difference
between the zero-one coded actual outcome of credit application (good/bad) and a classifier’s
risk prediction for application after post-processing with calibrator . Then, to estimate the
pairwise difference between two calibrators, and , we compute

1
(9)
∗ ∗∑

where and index the 5 classifiers and 8 data sets, respectively, and denotes
the number of credit applications in data set .
Table 6 depicts the resulting values for all pairs of calibrators. Given that the and
represent prediction errors, lower values indicate better performance (i.e., higher accuracy
and better calibration). Hence, a positive value in Table 6 indicates that the row calibrator
outperforms the column calibrator on average (across data sets and classifiers). For example,
the value of -.021 in the lower right cell of Table 6, showing the difference between RS and
IR, reveals that predictions post-processed using IR have, on average, lower error and are better
calibrated compared to calibrated predictions using RS. More specifically, the value of -.021
represents the average difference in squared error of RS and IR at the level of an individual
credit application. When processing a large number of applications, for example in the credit
card business, the small pairwise differences of Table 6 may translate into sizeable advantages
of one calibrator over another. Note that we use italic face to indicate that a pairwise difference
is not significant at the five percent significance level, using Sidak’s correction to control the
family-wise error.

Table 6: Pairwise difference of calibrators’ performance in terms of BS

17
RAW GAM GNB PS LR IR
GAM .024
GNB .023 -.001
PS .023 .000 .000
LR .023 .000 .000 .000
IR .002 -.004 -.003 -.003 -.003
RS -.001 -.025 -.024 -.024 -.024 -.021
Presents the difference in the performance between the calibrators when measured in BS. The numbers represent
the average difference in squared error between the calibrators at the level of an individual credit application.
Consequently, if the number is negative, then the calibrator that is presented in a column performs better than that
presented in the row and vice versa, if the number is positive.

Table 6 evidences the suitability of calibration algorithms. The leftmost column


corresponds to the zero-normalized, un-calibrated classifier output scores (denoted as RAW in
Table 6). Compared to this benchmark, all calibrators but RS improve the calibration of risk
predictions. In particular, the positive entries in the leftmost column of Table 6 indicate that
after post-processing raw risk predictions using some calibrator, the BS is less than that of the
un-calibrated classifier output scores (RS being an exception). Moreover, the observed
differences are significant.
Table 6 indicates poor performance of the RS calibrator. In a previous study, Coussement
and Buckinx [19] found GAM to outperform RS. We further advance this result by showing
that several other calibrators also outperform RS, including the naïve approach of scaling
predictions to the unit interval (i.e., RAW in Table 6). Considering Table 6, all entries in the
row corresponding to RS are negative and statistically significant at the five percent level. This
indicates that the RS approach to retransform the raw credit risk scores is unduly simplistic.
Other algorithms embody a more advanced logic toward calibration, which, considering the
data sets employed here, gives better results. Consequently, we caution against using RS to
calibrate risk predictions.
Considering the remaining calibrators, we find that IR achieves significantly poorer
calibration than LR, PS, GNB, and GAM and is thus dominated by these methods. The pairwise
differences of LR, PS, GNB, and GAM, on the other hand, are close to zero. Furthermore,
several comparisons do not provide sufficient empirical evidence to reject the null hypothesis
that observed differences between two calibrators are random variation, for example in the case
of LR vs. GAM, LR vs. GNB, PS vs. GAM, and PS vs. GNB. The estimated marginal means
also suggest that LR and PS achieve almost the same degree of calibration as GAM (average
BS of .135 for LR and PS compared to .134 for GAM). However, in view of the small
differences between these methods, it seems inappropriate to draw conclusions related to their
relative merits on the basis of Table 6. Rather, we proceed with examining possible interactions

18
between calibrators and classifiers.

4.3 Interaction between classifiers and calibrators


To identify synergy effects between the modelling approaches and to provide specific
recommendations regarding which techniques work well together, we examine the interactions
between classifiers and calibrators. Figure 2 summarizes corresponding results. To obtain it,
we first average the values of BS of every classifier without application of calibration (i.e.,
scaled raw credit risk predictions) and, subsequently, after application of all calibrators across
all data sets. In this way, we get a set of six calibrated and one un-calibrated curve of BS for
every classifier. Recall that lower BS values indicate better calibrated risk predictions.

Figure 2 Interaction between classifiers and calibrators

Figure 2 gives rise to the following conclusions. First, we observe the beneficial effect of
calibration. With the exception of RS, all calibrators improve the quality of risk predictions in
terms of BS. The magnitude of this effect, however, differs across classifiers. For example, BS
of BAGES is improved from .30 to less than .15 by GAM, LR, GNB, and PS. We observe
similar result for AVG. Second, we find moderate evidence for an interaction between
calibrators and classifiers. Although calibrators improve the performance of classifiers, they
seem to produce similar impact on all classifiers. Put differently, all classifiers achieve similar
BS after application of a calibrator. Third, we can further confirm that RS scarcely has an
impact on the performance of any classifiers in terms of calibration. The interaction plot clearly

19
supports the findings of Table 6 and we, therefore, caution against using RS. Figure 2 also
confirms that IR is less successful compared to GAM, LR, GNB, and PS. When considering
the calibration of credit risk estimates, we argue that practitioners need to choose the best
calibrator. Thus, we cannot recommend IR. To further explore the interactions between
classifiers and calibrators, we proceed with examining calibration plots.

4.4 Interaction between classifiers and calibrators demonstrated on calibration plots


A calibration plot or reliability diagram is a graph of predicted credit risk probabilities
1| (on the x-axis) against observed event frequencies (on the y-axis) e.g., [74]. For
every classifier, we create a grid of plots showing the un-calibrated predictions (leftmost chart)
and predictions after post-processing with every of the six calibrators considered in the study.
Each plot includes the results from the eight credit scoring data sets. That is, we once again
pool observed events and corresponding predictions across all data sets. Note that we partition
the range of predictions into ten bins. A well-calibrated classifier gathers positive examples in
the upper right corner of the calibration plot and negative in the lower left corner. A perfectly
calibrated classifier is represented by a diagonal line. In this sense, the purpose of the calibrator
(when shown on the calibration plot) is to handle the sparsity of the un-calibrated predictions
of the classifiers and allocate them on this diagonal line. Thus, any upper or lower deviation
from the diagonal will indicate over/under-estimation of the probabilities in a given bin.
In the interest of brevity, the main body of the paper includes results for one classifier, RF,
whereas results for other classifiers are available in Appendix 1. We choose RF because results
of Lessmann et al. [49] suggest that RF is a particularly well-suited method to predict default
risks in credit scoring.
Figure 3 supports the results of previous analyses and also provides some novel insights.
First, we observe the raw predictions of RF (left-most panel) to be biased. Across all bins,
estimated probabilities are below the diagonal, meaning that RF consistently overestimates
actual default probabilities. Second, all calibrators but RS correct this bias; at least to some
extent. Third, the calibrators face difficulties in de-biasing RF predictions in the last (right-
most) bin. This phenomenon can be explained with the sparsity of the data in higher score
ranges. In other words, RF forecasts default probabilities of 80 percent or above in only a small
number of cases, which makes it difficult for the calibrators to infer the relationship between
raw classifier output scores and true PDs. However, Figure 3 reveals that GAM handles
sparsity in higher score ranges much better than any other calibrator. In fact, after calibration
with GAM, RF predictions display almost perfect calibration across all bins. In this sense,

20
Figure 3 augments the ANOVA results of Section 4.2, where empirical evidence was
insufficient to reject the null-hypothesis that GAM and PS, or GAM and LR perform alike.
Compared to these two and the other calibrators, Figure 3 provides strong evidence that GAM
outperforms these two as well as the other calibrator. The results of other classifiers (available
in the Appendix), support this view. Independent of the classifier, GAM performs as good as
other calibrators among lower ranges of estimated probabilities and much better for larger
estimated probabilities. For example, GAM calibrates more successfully than all other
calibrators the probability estimates starting from the 7th bin on MLP and LRE. In the cases of
BAGES and AVG the superiority of GAM is even more pronounced than in Figure 3.
Last, we note that Figure 3 as well as the calibration plots in the Appendix also indicate
that GNB achieves slightly better calibration than LR and PS. Again, differences occur only
for larger estimated probabilities. Kuhn and Johnson [48] argue that GNB itself does not tend
to produce well-calibrated class probabilities, but can be surprisingly effective in calibrating
probability estimates of other classifiers. Our analysis further supports this view.

21
Calibration Plot for RF

Figure 3 Raw and calibrated predictions of the RF classifier across all data sets

22
4.5 Decomposition of BS
Decomposition of BS can further support the quantification of probability forecast quality.
In particular, the BS can be decomposed into the following parts [62]: i) uncertainty, a measure
of the degree to which the outcome is predictable; ii) resolution, a measure of the extent to
which conditional probabilities differ from the overall average (calculated as mean squared
difference; the higher the term, the better the forecast); and iii) calibration, which measures
how close the forecast probabilities are to the true probabilities. The lower the calibration term,
the better the forecast.
Table 7 reports the reliability and the resolution terms of BS across the classifiers and
calibrators averaged over the data sets. Note that the uncertainty term is not presented in Table
7, as it does not differ across calibrators and is always equal to .1731. Therefore, uncertainty is
not relevant for subsequent analysis. Table 7 also presents the summation of both calibration
and resolution of every calibrator across classifiers.

Table 7 Decomposition of BS
Calibration Resolution Calibration Resolution
LR .0121 .2006 GAM .0100 .2011
MLP .0021 .0379 MLP .0021 .0377
RF .0024 .0405 RF .0025 .0410
BAGES .0024 .0420 BAGES .0019 .0426
LRE .0029 .0395 LRE .0019 .0393
AVG .0022 .0406 AVG .0016 .0406
PS .0117 .2000 RS .1293 .1983
MLP .0022 .0379 MLP .0092 .0392
RF .0024 .0405 RF .0108 .0405
BAGES .0022 .0418 BAGES .0675 .0390
LRE .0028 .0393 LRE .0057 .0394
AVG .0021 .0405 AVG .0361 .0402
IR .0204 .1963 GNB .0117 .1992
MLP .0039 .0375 MLP .0018 .0379
RF .0033 .0399 RF .0027 .0405
BAGES .0038 .0396 BAGES .0020 .0418
LRE .0048 .0382 LRE .0026 .0382
AVG .0046 .0410 AVG .0026 .0409
Presents the summary of the performance of calibrators based on the decomposition of BS. Namely, there are two
parts that are decomposable from BS, calibration and resolution. Both of them represent the estimates across all
data sets as per classification algorithm.

Analysis of the performance estimates presented in Table 7 provides the following insights.
First, Table 7 shows that PS outperforms LR in terms of calibration. The total result (as per all
classifiers) of PS versus LR for calibration is .0117 c.f. .0121, meaning that PS calibrates
classifier predictions better than LR. Table 7 also indicates that PS has the same overall
calibration power as GNB (.0117) but outperforms GNB in terms of resolution (.2000 c.f.

23
.1992). An interesting point to note about the PS versus GNB comparison is that PS calibration
is worse than that of GNB for LRE (.0028 c.f. .0026) and MLP (.0022 c.f. .0018). Both
classifiers use a logistic link function to produce probabilistic risk predictions. It is plausible
that the calibration is less important for risk predictions produced in this manner. Classifiers
such as RF and AVG, which do not incorporate a logistic link function, show opposite results.
More specifically, PS shows a tendency to perform better with classifiers that need auxiliary
calibration the most. We take this as evidence that PS might be preferable to GNB, although
the overall calibration performance of the two is similar. Finally, and most importantly, we
reemphasize that GAM is superior to all other calibrators in terms of both calibration and
resolution. GAM produces the lowest calibration and the highest resolution in grand total.
GAM consistently achieves the best results when interacting with BAGES, LRE, and AVG,
and competitive results when applied to MLP and RF. In view of these results, we suggest that,
out of the techniques considered here, GAM is the most suitable approach to calibrate risk
predictions of classification models.

4.6 BS per individual data set


At last, we would like to present the performance of the calibrators in terms of BS based
on every data set. Table 8 reports on the performance of classifiers before (“raw”) and after
application of calibration techniques as per data set. Table 8 reveals some new findings and
supports previous ones at the same time. First, we observe that the raw predictions of the
classifiers are never better (lower) than these after calibration. We can thus support the previous
findings that calibration contributes to a higher fit of the classifiers in terms of BS. A novel
insight is that RS seems to perform better than other calibrators when dealing with a classifier
that delivers well-calibrated predictions by nature. We can observe the lowest BS values for
MLP and LRE on the data set AC and the lowest BS for MLP on the Bene-2 data set. This
finding is relevant for smaller data sets though. We can follow appealing results of GNB. This
can be seen on the larger data sets, e.g., PAK and UK. GNB performs well with both classifiers
that deliver well-calibrated and poor-calibrated predictions. While LR performs well on smaller
data sets (i.e., AC and GC), IR calibrates successfully MLP and AVG on TH. However, we
can stress that GAM has delivered most promising results. This is for the following reasons.
First, GAM obtains the biggest number of the lowest BS values. This indicates that GAM
operates well independently of the size of the data set. Second, GAM shows the best result
when dealing with poorly calibrated classifiers. For example, GAM delivers the best results
when calibrating BAGES across all data sets (PAK being exception). GAM calibrates AVG

24
most successfully on bigger data sets and deals quite well with RF on almost every data set.
Finally, GAM shows competitive results when calibrating LRE and MLP. For example, GAM
delivers the best BS estimates for LRE on TH and Bene-1 and for MLP on PAK and GMC.

25
Table 8 BS across the data sets
Data Data
Classifier Raw LR PS IR GAM RS GNB Classifier Raw LR PS IR GAM RS GNB
set set

MLP .1119 .1135 .1132 .1203 .1135 .1114 .1157 MLP .1636 .1660 .1660 .1654 .1642 .1636 .1645
RF .1151 .1119 .1124 .1138 .1119 .1157 .1143 RF .1654 .1654 .1654 .1640 .1647 .1661 .1645
AC BAGES .1897 .1066 .1070 .1185 .1054 .1871 .1071 Bene - 2 BAGES .2851 .1610 .1609 .1614 .1594 .2905 .1602
LRE .1023 .1039 .1049 .1107 .1039 .1013 .1110 LRE .1664 .1674 .1674 .1664 .1666 .1664 .1670
AVG .1315 .1064 .1066 .1108 .1069 .1331 .1082 AVG .1924 .1638 .1638 .1632 .1625 .1947 .1622

MLP .1756 .1672 .1678 .1813 .1722 .1747 .1679 MLP .0464 .0410 .0410 .0408 .0409 .0465 .0408
RF .1701 .1594 .1595 .1668 .1588 .1673 .1590 RF .0553 .0411 .0410 .0410 .0409 .0606 .0409
GC BAGES .1847 .1592 .1595 .1740 .1592 .1803 .1596 UK BAGES .0959 .0414 .0414 .0418 .0408 .0970 .0410
LRE .1682 .1690 .1687 .1860 .1674 .1693 .1674 LRE .0420 .0414 .0414 .0415 .0409 .0420 .0412
AVG .1738 .1573 .1574 .1689 .1586 .1712 .1593 AVG .1528 .0410 .0410 .0438 .0409 .1547 .0411

MLP .2158 .1942 .1943 .1884 .1942 .2231 .1917 MLP .1913 .1859 .1859 .1859 .1858 .1917 .1860
RF .1950 .1904 .1900 .1927 .1903 .1936 .1924 RF .2182 .1861 .1861 .1861 .1861 .2204 .1862
TH BAGES .2798 .1911 .1911 .1938 .1892 .2892 .1928 PAK BAGES .2341 .1847 .1847 .1850 .1847 .2357 .1846
LRE .1962 .1911 .1916 .1936 .1904 .1980 .1977 LRE .1977 .1888 .1888 .1885 .1887 .1983 .1884
AVG .2077 .1934 .1936 .1906 .1928 .2131 .1974 AVG .2115 .1862 .1862 .1872 .1861 .2124 .1861

MLP .1813 .1788 .1787 .1828 .1788 .1815 .1787 MLP .0527 .0523 .0523 .0510 .0510 .0527 .0511
RF .1745 .1750 .1747 .1777 .1741 .1743 .1752 RF .0502 .0511 .0511 .0501 .0501 .0502 .0501
Bene - 1 BAGES .2525 .1724 .1721 .1739 .1712 .2530 .1715 GMC BAGES .0803 .0512 .0512 .0500 .0499 .0804 .0501
LRE .1817 .1725 .1721 .1741 .1710 .1818 .1712 LRE .0576 .0582 .0582 .0567 .0568 .0576 .0566
AVG .2088 .1782 .1780 .1786 .1755 .2089 .1742 AVG .0644 .0515 .0515 .0505 .0502 .0644 .0504
Presents the summary on the performance of calibrators summarized across the eight credit scoring data sets.In every data set the performance of calibrators is presented on every involved classification
algorithm. Bold numbers indicate the best calibration of the classification algorithms within the corresponding data set.

26
5 Conclusion
We set out to examine how different calibration methods contribute to the improvement of
the probability forecast quality of classification algorithms in credit scoring. Calibration can be
seen as one dimension of scorecard quality and is a regulatory requirement. Given that only a
few credit scoring studies raise the issue of calibration and that the assessment of scorecards in
this literature is predominantly based on indicators that do not capture calibration, our study
aimed at filling this research gap through empirically comparing several calibration procedures,
which have been proposed in different strands of literature.
The empirical results emphasize the importance of calibration. Post-processing risk
predictions using one of the calibrators considered in the study consistently improves
calibration (in terms of BS) without hurting discriminatory ability. Therefore, we strongly
suggest researchers and practitioners to routinely employ calibrators to post-process scorecard
predictions and improve calibration.
We also find that the performance of alternative calibrators is often similar. The only
exception is RS, which performs inferior to any other calibrator. Therefore, we strongly
discourage using this method in future work. Conversely, recommending one specific
calibrator is complicated. In several cases, empirical evidence is insufficient to conclude that
some calibrators differ significantly. However, the analysis of calibration plots and the BS
decomposition provide some evidence in favor of GAM. We observe consistently good results
with this calibrator and find it to perform equally well with the classification algorithms
considered in the study. The BS decomposition reveals that competitive calibrators such as PS
or GNB lack this level of consistency. In this sense, the empirical results suggest GAM to be
the most appropriate calibrator. This recommendation agrees with and further supports the
findings of Coussement and Buckinx [20] in the field of direct marketing. As shown here, the
appealing performance of GAM generalizes to risk modeling tasks in credit scoring.
Other calibration studies this paper links to include Kuhn and Johnson [48]. They highlight
the ability of GNB to facilitate excellent calibration. Our results confirm this view. Moreover,
Niculescu-Mizil and Caruana [54] provide empirical results associated with a comparison of
PS versus IS based on data from medical applications. The authors conclude that PS
outperforms IS when applied to data sets of a smaller sizes, whereas IS is superior for large
data sets. We cannot corroborate this result for credit scoring. Rather, we find PS to consistently
outperform IS.
Our paper also contains a number of limitations that offer avenues for future research. First,

27
conclusions are limited to the data sets employed in the study. Any attempt to replicate our
results in credit scoring using other data sets and more generally other domains will add to
literature in a valuable way. Second, we focus on a two-step approach. We build a classification
model and then calibrate its predictions. Several studies in machine learning have successfully
proposed new algorithms that integrate specific task requirements such as class imbalance or
asymmetric misclassification costs. Future research could thus strive to develop methodologies
to embed the calibration step directly into scorecard development.

28
References
[1] H. A. Abdou, Genetic programming for credit scoring: The case of Egyptian public sector
banks, Expert Systems with Applications 36 (9) (2009) 11402-11417.
[2] H. A. Abdou, J. Pointon, A. El-Masry, Neural nets versus conventional techniques in
credit scoring in Egyptian banking, Expert Systems with Applications 35 (3) (2008)
1275-1292.
[3] H. A. Abdou, J. Pointon, Credit scoring, statistical techniques and evaluation criteria: A
review of the literature, Intelligent Systems in Accounting Finance & Management
18(2-3) (2011) 59-88.
[4] H. A. Abdou, M. D. Dongmo Tsafack, C. G. Ntim, R. D. Baker, Predicting
creditworthiness in retail banking with limited scoring data, Knowledge-Based
Systems 103 (1) (2016) 89-103.
[5] M. Ala’raj, M. F. Abbod, Classifiers consensus system approach for credit scoring,
Knowledge-Based Systems 104 (2016) 89-105.
[6] M. Ala'raj, M. F. Abbod, A new hybrid ensemble credit scoring model based on
classifiers consensus system approach, Expert Systems with Applications 64 (2016)
36-55.
[7] E. I. Altman, G. Marco, F. Varetto, Corporate distress diagnosis: comparisons using linear
discriminant analysis and neural networks (the italian experience), Journal of Banking
and Finance 18 (3) (1994) 505-529.
[8] M. Ayer, H. D. Brunk, G. M. Ewing, W. T. Reid, E. Silverman, An empirical distribution
function for sampling with incomplete information, The Annals of Mathematical
Statistics 26 (4) (1955) 641-647.
[9] B. Baesens, Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J., & Vanthienen, J.
(2003). Benchmarking state-of-the-art classification algorithms for credit scoring.
Journal of the Operational Research Society, 54(6), 627-635.
[10] J. Barranquero, J. Díez, J. José del Coz, Quantification-oriented learning based on
reliable classifiers, Pattern Recognition 48 (2) (2015) 591-604.
[11] A. C., Bahnsen, D. Aouada, B. Ottersten, Example-dependent cost-sensitive logistic
regression for credit scoring, 13th International Conference on Machine Learning and
Applications (2014) 263-269.
[12] A. Ben-David, E. Frank, Accuracy of machine learning models versus "hand crafted"
expert systems - A credit scoring case study, Expert Systems with Applications: An
International Journal 36 (3) (2009) 5264-5271.
[13] A. Blöchlinger, M. Leippold, A new goodness-of-fit test for event forecasting and its
application to credit defaults, Management Science 57 (3) (2011) 487-505.
[14] A. Blöchlinger, M. Leippold, Economic benefit of powerful credit scoring, Journal of
Banking & Finance 30 (3) (2006) 851-873.
[15] I. Brown, C. Mues, An experimental comparison of classification algorithms for
imbalanced credit scoring data sets, Expert Systems with Applications 39 (3) (2012)
3446-3453.
[16] E. Carrizosa, A. Nogales-Gómez, D. Romero Morales, Clustering categories in support
vector machines, Omega 66 (A) (2017) 28-37.

29
[17] R. Caruana, A. Munson, A. Niculescu-Mizil, Getting the most out of ensemble selection,
Paper presented at the Proc. of the 6th Intern. Conf. on Data Mining, (2006) Hong
Kong China.
[18] Y.-S. Chen, C.-H. Cheng, Hybrid models based on rough set classifiers for setting credit
rating decision rules in the global banking industry, Knowledge-Based Systems 39
(2013) 224-239.
[19] S. A. Cole, M. Kanz, L. F. Klapper, Incentivizing calculated risk taking: Evidence from
an experiment with commercial bank loan officers, The Journal of Finance 70 (2)
(2012) 537-575.
[20] K. Coussement, W. Buckinx, A probability-mapping algorithm for calibrating the
posterior probabilities: A direct marketing application, European Journal of
Operational Research 214 (3) (2011) 732-738.
[21] J. N. Crook, D. B. Edelman, L. C. Thomas, Recent developments in consumer credit risk
assessment, European Journal of Operational Research 183 (3) (2007) 1447-1465.
[22] M. Crouhy, D. Galai, R. Mark, A comparative analysis of current credit risk models,
Journal of Banking & Finance 24 (1-2) (2000) 59-117.
[23] T. G. Dietterich, Approximate statistical tests for comparing supervised classification
learning, Neural Computation 10 (7) (1998) 1895-1923.
[24] G. Dong, Lai Kin, K., & Yen, J., Credit scorecard based on logistic regression with
random coefficients, Procedia Computer Science 1 (1) (2010) 2463-2468.
[25] M. A. Doori, B. Beyrouti, Credit scoring model based on back propagation neural
network using various activation and error function, International Journal of
Computer Science and Network Security 14 (3) (2014) 16-24.
[26] L. Einav, M. Jenkins, J. Levin, The impact of credit scoring on consumer lending, The
RAND Journal of Economics 44 (2) (2013) 249-274.
[27] S. Finlay, Multiple classifier architectures and their application to credit risk assessment,
European Journal of Operational Research 210 (2) (2011) 368-378.
[28] T. Fitzpatrick, C. Mues, An empirical comparison of classification algorithms for
mortgage default prediction: evidence from a distressed mortgage market, European
Journal of Operational Research 249 (2) (2016) 427-439.
[29] A. P. Flach, N. Lachiche, Naive Bayesian classification of structured data, Machine
Learning 57 (3) (2004) 233-269.
[30] R. Florez-Lopez, J. M. Ramon-Jeronimo, Enhancing accuracy and interpretability of
ensemble strategies in credit risk assessment. A correlated-adjusted decision forest
proposal, Expert Systems with Applications 42 (13) (2015) 5737-5753.
[31] S. García, F. Herrera, An extension on "Statistical comparisons of classifiers over
multiple data sets" for all pairwise comparisons, Journal of Machine Learning
Research 9 (2008) 2677-2694.
[32] V. García, A. I. Marqués, J. S. Sánchez, Improving risk predictions by preprocessing
imbalanced credit data, International Conference on Neural Information Processing
(2012) 68-75.
[33] C. W. J. Granger, M. H. Pesaran, Economic and statistical measures of forecast
accuracy, Journal of Forecasting 19 (7) (2000) 537-560.

30
[34] D. J. Hand, Classifier technology and the illusion of progress, Statistical Science 21 (1)
(2006) 1-15.
[35] D. J. Hand, Good practice in retail credit scorecard assessment, Journal of the
Operational Research Society 56 (9) (2005) 1109-1117.
[36] H. Harris, Credit scoring using the clustered support vector machine, Expert Systems
with Applications 42 (2) (2015) 741-750.
[37] K. R. Hasan, Development of a credit scoring model for retail loan granting financial
institutions from frontier markets, International Journal of Business and Economics
Research 5 (5) (2016) 135-142.
[38] T. Hastie, R. Tibshirani, Generalized additive models, Statistical Science 1 (3) (1986)
297-318.
[39] J. Hernández-Orallo, P. Flach, C. Ferri, A unified view of performance metrics:
translating threshold choice into expected classification loss, Journal of Machine
Learning Research 13 (2012) 2813-2869.
[40] B. J. Hirtle, M. Levonian, M. Saidenberg, S. Walter, D. Wright, Using credit risk models
for regulatory capital: issues and options, Economic Policy Review 7 (1) (2001) 19-
36.
[41] A. I. Irimia-Dieguez, A. Blanco-Oliveer, M. J. Vazquez-Cueto, A comparison of
classifiction/regression trees and logistic regression in failure models, Procedia
Economics and Finance 23 (2014) 9-14.
[42] G. H. John, P. Langley, Estimating continuous distributions in Bayesian classifiers. In
Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence,
Morgan Kaufmann Publishers Inc., ISBN:1-55860-385-9, (1995), 338-345.
[43] J. E. V. Johnson, A. C. Bruce, Calibration of subjective probability judgments in a
naturalistic setting, Journal of Organizational Behaviour and Human Decision
Processes 85 (2) (2001) 265-290.
[44] K. M. Jung, L. C. Thomas, M. C. So, When to rebuild or when to adjust scorecards,
Journal of the Operational Research Society 66 (10) (2015) 1656-1668.
[45] R. Krause, G. Tutz, Simultaneous selection of variables and smoothing parameters in
additive regression models, Computational Statistics & Data Analysis 53 (1) (2008)
61-81.
[46] J. Kruppa, A. Schwarz, G. Arminger, A. Ziegler, Consumer credit risk: Individual
probability estimates using machine learning, Expert Systems with Applications 40
(13) (2013) 5125-5131.
[47] J. Kruppa, Y. Liu, G. Biau, M. Kohler, I. R. König, J. D. Malley, A. Ziegler, Probability
estimation with machine learning methods for dichotomous and multicategory
outcome: Theory, Biometrical Journal 56 (4) (2014) 534-563.
[48] M. Kuhn, K. Johnson, Applied Predictive Modeling, Springer, New York, 2013.
[49] S. Lessmann, B. Baesens, H.-V. Seow, L. C. Thomas, Benchmarking state-of-the-art
classification algorithms for credit scoring: An update of research, European Journal
of Operational Research 247 (1) (2015) 124-136.

31
[50] Y. Li, J. Gao, A. Z. Enkavi, L. Zaval, E. U. Weber, E. J. Johnson, Sound credit scores
and financial decisions despite cognitive aging, In Proceedings of the National
Academy of Sciences 112 (1) (2015) 65-69.
[51] R. Malhotra, D. K. Malhotra, Evaluating consumer loans using neural networks, Omega
31 (2) (2003) 83-96.
[52] J. F. Martínes Sánchez, G. Pérez Lechuga, Assessment of a credit scoring system for
popular bank savings and credit, Contaduría y Administración 61 (2) (2016) 391-417.
[53] M. P. Naeini, F. G. Cooper, M. Hauskrecht, Obtaining well calibrated probabilities using
Bayesian binning. In Proceedings of the 29th AAAI Conference on Artificial
Intelligence (pp. 2901-2907) (2014) Austin TX, USA.
[54] A. Niculescu-Mizil, R. Caruana, Predicting Good Probabilities with Supervised
Learning. In L. D. Raedt and S. Wrobel (Eds.), Proc. of the 22nd Intern. Conf. on
Machine Learning (Vol. New York, pp. 625-632), (2005), Bonn, Germany: ACM
Press.
[55] V. Pacelli, M. Azzollini, An artificial neural network approach for credit risk
management, Journal of Intelligent Learning Systems and Applications 3 (2011) 103-
112.
[56] G. Paleologo, A. Elisseeff, G. Antonini, Subagging for credit scoring models, European
Journal of Operational Research 201 (2) (2010) 490-499.
[57] J. Perols, K. Chari, M. Agrawal, Information market-based decision fusion, Management
Science 55 (2009) 827-842.
[58] J. C. Platt, Probabilities for Support Vector Machines. In A. Smola, P. Bartlett, B.
Schölkopf and D. Schuurmans (Eds.), Advances in Large Margin Classifiers (pp. 61-
74), (2000), Cambridge: MIT Press.
[59] W. H. Press, B. P. Flannery, S. A. Teukolsky, W. T. Vetterling, Numerical Recipes: The
Art of Scientific Computing, (1992). Cambridge (UK) and New York: Cambridge
University Press.
[60] F. Provost, P. Domingos, Well-trained PETs: Improving probability estimation trees. In.
New York: Stern School of Business, New York University, (2001).
[61] M. Saerens, P. Latinne, C. Decaestecker, Adjusting the outputs of a classifier to new a
priori probabilities: a simple procedure, Neural Computation 14 (1) (2002) 21-41.
[62] D. B. Stephenson, C. A. S. Coelho, I. T. Jolliffe, Two extra components in the Brier
Score Decomposition, Weather and Forecasting 23 (4) (2008) 752-757.
[63] K. Y. Tam, M. Kiang, Managerial applications of neural networks: the case of bank
failure predictions, Management Science 38 (7) (1992) 926-947.
[64] N. Tarashev, H. Zhu, Specification and calibration errors in measures of portfolio credit
risk: The case of the ASRF Model, International Journal of Central Banking 4 (2008)
129-173.
[65] L. C. Thomas, A survey of credit and behavioural scoring: forecasting financial risk of
lending to consumers, International Journal of Forecasting 16 (2000) 149-172.
[66] L. C. Thomas, Consumer Credit Models: Pricing, Profit and Portfolios, (2009), Oxford,
UK: Oxford University Press.

32
[67] L.C. Thomas, D. B. Edelman, J. N. Crook, Credit Scoring and its Applications, (2002)
Philadelphia: Siam.
[68] K. Van Hoorde, S. Van Huffel, D. Timmerman, T. Bourne, B. Van Calster, A spline-
based tool to assess and visualize the calibration of multiclass risk predictions, Journal
of Biomedical Informatics 54 (2015) 283-293.
[69] G. Wang, J. Ma, L. Huang, K. Xu, Two credit scoring models based on dual strategy
ensemble trees. Knowledge-Based Systems 26 (2012) 61-68.
[70] S. N. Wood, Modelling and smoothing parameter estimation with multiple quadratic
penalties, Journal of the Royal Statistical Society: Series B (Statistical Methodology)
62 (2) (2000) 413-428.
[71] B. W. Yap, S. H. Ong, N. H. Mohamed Husain, Using data mining to improve
assessment of credit worthiness via credit scoring models, Expert Systems with
Applications: An International Journal 38 (10) (2011) 13274-13283.
[72] W. Xiao, Q. Zhao, Q. Fei, A comparative study of data mining methods in consumer
loans credit scoring management. Journal of Systems Science and Systems
Engineering, 15, (2006)419-435.
[73] L. Yu, X. Li, L. Tang, Z. Zhang, G. Kou, Social credit: a comprehensive literature
review, (2015) Financial Innovation, 1:6, doi: 10.1186/s40854-015-0005-6.
[74] B. Zadrozny, C. Elkan, Transforming Classifier Scores into Accurate Multiclass
Probability Estimates. In D. Hand, D. Keim and R. Ng (Eds.), Proc. of the 8th ACM
SIGKDD Intern. Conf. Knowledge Discovery and Data Mining (pp. 694-699), (2002).
Edmonton, Alberta, Canada: ACM Press.
[75] L. W. Zhong, T. J. Kwok, Accurate Probability Calibration for Multiple Classifiers. In
Proceedings of the Twenty-Third International Joint Conference on Artificial
Intelligence (pp. 1939-1945), (2013). Beijing, China.
[76] X. Zhu, J. Li, D. Wu, H. Wang, C. Liang, Balancing accuracy, complexity and
interpretability in consumer credit decision making: A C-TOPSIS classification
approach, Knowledge-Based Systems 52 (2013) 258-267.

33
Appendix: Calibration Plots for Every Classifier across All Data Sets

Calibration Plot for MLP

Raw and calibrated predictions of the MLP classifier across all data sets

I
Calibration Plot for BAGES

Raw and calibrated predictions of the BAGES classifier across all data sets

II
Calibration Plot for LRE

Raw and calibrated predictions of the LRE classifier across all data sets

III
Calibration Plot for AVG

Raw and calibrated predictions of the AVG classifier across all data sets

IV
Pseudo Code for the Sampling Procedure Using in the Main Part of the Paper
The following pseudo code illustrates our sampling procedure and data organization. Essentially, we randomly partition
every data set into a training set (60%) and a hold-out test set (40%). The latter is used for evaluation. The former enters
a 5x2 cross validation [23] stage for classifier and calibrator development.
1: # Given: 8 credit scoring data sets;
2: # Given: list of classifiers and classifier meta-parameters (see Table 3)
3: # Given: list of calibrators
4: # iterate through “datasets”

5: for each dataset


6: # create random training and test set partition
7: trainset = select random sub-sample of 60%
8: testingset = remaining data not part of trainset

9: # begin training of classifier


10: for each classifier c out of list of all classifier

11: # begin 5x2 cross-validation


12: for each meta-parameter setting of c

13: for i=1 to 5


14: # randomly split training set in half
15: trainfold = random sample of 50% of training set
16: testfold = remaining 50% of training set
17: classifier = train classifier on trainfold
18: prediction = apply classifier to testfold
19: # assess discrimination ability of classifier using the current testfold
20: discrim = calculate AUC from prediction

21: for each calibrator in list of calibrators


22: # use (out-of-sample) prediction to estimate calibrator
23: calibrator = estimate calibrator using classifier predictions (from testfold)
24: # apply (i.e., test calibrator using trainfold)
25: # note that trainfold can be considered out-of-sample for the calibrator
26: cal_prediction = apply calibrator to trainfold
27: # assess calibration ability of calibrator-classifier combination
28: calib = calculate Brier Score and Log-Loss from cal_prediction
29: end for # loop over calibrators

30: end for # 5x2 cross-validation loop


31: end for # loop over classifier meta-parameters
32: end for # loop over classifiers
33: end for # loop over data sets

The procedure above produces hold-out estimates of a classifier’s discrimination ability for each candidate setting of
classifier meta-parameters (see line 20). These values facilitate selecting the best meta-parameter configuration per
classifier (see Table 3). In the same manner, line 26 produces hold-out estimates of the calibration ability of a
calibration method (when applied to the focal classifier). Combined with the estimates of discrimination ability, we can
compare the correlation between discrimination measures such as the AUC and calibration measures such as Brier
Score (see, e.g., Table 4). Note that it is important to control the sequence of random numbers in the overall data
partitioning (lines 7 and 8) and the 5x2 cross-validation loop (line 13) to ensure that all classifiers and calibrators
operate on the same sub-folds of the data.

Having identified the best meta-parameter setting per classifier, we use the five classification models corresponding to
these meta-parameters (from 5x2 cross-validation; see line 17) to predict the hold-out test set. In particular, we average
test set predictions over the five classification models. Caruana et al. [17] recommend this approach of reusing cross-
validated classification models and provide empirical evidence that it performs better than developing novel
I
classification models from the full training set. We also use this approach to calibrate test set predictions. In particular,
we re-use the calibration models produced during 5x2 cross-validation (for the best meta-parameter configuration of
the classifier) to convert test set predictions into calibrated probability estimates. Since cross-validation produces one
calibration model per calibration algorithm per fold, we once again average over the five calibrated test set predictions.
In the main part of the paper, results in Sections 4.2 to 4.6 are then based on these hold-out test set predictions.

II

View publication stats

You might also like