You are on page 1of 19

European Journal of Operational Research 183 (2007) 1447–1465

www.elsevier.com/locate/ejor

Recent developments in consumer credit risk assessment


a,*
Jonathan N. Crook , David B. Edelman b, Lyn C. Thomas c

a
Credit Research Centre, Management School and Economics, 50 George Square, University of Edinburgh,
Edinburgh EH8 9JY, United Kingdom
b
Caledonia Credit Consultancy, 42 Milverton Road, Glasgow G46 7LP, United Kingdom
c
School of Management, Highfield, Southampton SO17 1BJ, United Kingdom

Received 1 July 2006; accepted 1 September 2006


Available online 15 February 2007

Abstract

Consumer credit risk assessment involves the use of risk assessment tools to manage a borrower’s account from the time
of pre-screening a potential application through to the management of the account during its life and possible write-off.
The riskiness of lending to a credit applicant is usually estimated using a logistic regression model though researchers have
considered many other types of classifier and whilst preliminary evidence suggest support vector machines seem to be the
most accurate, data quality issues may prevent these laboratory based results from being achieved in practice. The training
of a classifier on a sample of accepted applicants rather than on a sample representative of the applicant population seems
not to result in bias though it does result in difficulties in setting the cut off. Profit scoring is a promising line of research
and the Basel 2 accord has had profound implications for the way in which credit applicants are assessed and bank policies
adopted.
Ó 2007 Elsevier B.V. All rights reserved.

Keywords: Finance; OR in banking; Risk analysis

1. Introduction standing commercial bank loans secured on real


estate and of consumer loans in the personal sector
Between 1970 and 2005 the volume of consumer in December 2005, the former made up 80%.2 The
credit outstanding in the US increased by 231% and growth in debt outstanding in the UK has also been
the volume of bank loans secured on real estate dramatic. Between 1987 and 2005 the volume of
increased by 705%.1 Of the $3617.0 billion of out- consumer credit (that is excluding mortgage debt)
outstanding increased by 182% and the growth in
* credit card debt was 416%.3 Mortgage debt
Corresponding author. Tel.: +44 131 6503820; fax: +44 131
6683053.
E-mail addresses: j.crook@ed.ac.uk (J.N. Crook), davi-
de24@ntlworld.com (D.B. Edelman), l.thomas@soton.ac.uk
2
(L.C. Thomas). Data from the Federal Reserve Board (series bcablci_ba.m
1
Data from the Federal Reserve Board (series bcablci_ba.m and bcablcr_ba.m deflated), H8, Assets and Liabilities of Com-
and bcablcr_ba.m deflated), H8, Assets and Liabilities of Com- mercial Banks in the United States.
3
mercial Banks in the United States. Data from ONS Online.

0377-2217/$ - see front matter Ó 2007 Elsevier B.V. All rights reserved.
doi:10.1016/j.ejor.2006.09.100
1448 J.N. Crook et al. / European Journal of Operational Research 183 (2007) 1447–1465

outstanding increased by 125%.4 Within Europe ing material about a consumer loan through to the
growth rates have varied. For example between management of the borrower’s account during its
2001 and 2004 nominal loans to households and lifetime. The same techniques are used in building
the non-profit sector increased by 36% in Italy, these tools even though they involve different infor-
but in Germany they increased by only 2.3% with mation and are applied to different decisions. Appli-
the Netherlands and France at 28% and 22% cation scoring helps a lender discriminate between
respectively.5 those applicants whom the lender is confident will
These generally large increases in lending and repay a loan or card or manage their current
associated applications for loans have been under- account properly and those applicants about whom
pinned by one of the most successful applications the lender is insufficiently confident. The lender uses
of statistics and operations research: credit scoring. a rule to distinguish between these two subgroups
Credit scoring is the assessment of the risk associ- which make up the population of applicants. Usu-
ated with lending to an organization or an individ- ally the lender has a sample of borrowers who
ual. Every month almost every adult in the US applied, were made an offer of a loan, who accepted
and the UK is scored several times to enable a len- the offer and whose subsequent repayment perfor-
der to decide whether to mail information about mance has been observed. Information is available
new loan products, to evaluate whether a credit card on many sociodemographic characteristics (such as
company should increase one’s credit limit, and so income and years at address) of each borrower at
on. Whilst the extension of credit goes back to Bab- the time of application from his/her application
ylonian times (Lewis, 1992) the history of credit form and typically on the repayment performance
scoring begins in 1941 with the publication by Dur- of each borrower on other loans and of individuals
and (1941) of a study that distinguished between who live in the same neighbourhood. We will denote
good and bad loans made by 37 firms. Since then each characteristic as xi which may take on one of
the already established techniques of statistical dis- several values (called ‘‘attributes’’) xij for case i.
crimination have been developed and an enormous The most common method of deriving a classifi-
number of new classificatory algorithms have been cation rule is logistic regression where, using maxi-
researched and tested. Virtually all major banks mum likelihood, the analyst estimates the
use credit scoring with specialised consultancies pro- parameters in the equation:
viding credit scoring services and offering powerful !
software to score applicants, monitor their perfor- pgi
Log ¼ b0 þ bT xi ; ð1Þ
mance and manage their accounts. In this review 1  pgi
we firstly explain the basic ideas of credit scoring
and then discuss a selection of very exciting current where pgi is the probability that case i is a good.
research topics. A number of previous surveys have This implies that the probability of case i being a
been published albeit with different emphasises. good is
Rosenberg and Gleit (1994) review different types T
eb x i
of classifiers, (Hand and Henley, 1997) cover earlier pgi ¼ T :
1 þ eb x i
results concerning the performance of classifiers
whilst (Thomas, 2000) additionally reviews behav- Traditionally the values of the characteristics for
ioural and profit scoring. Thomas et al. (2002) dis- each individual were used to predict a value for pgi
cuss research questions in credit scoring especially which is compared with a critical or cut off value
those concerning Basel 2. and a decision made. Since a binary decision is re-
quired the model is required to provide no more
2. Basic ideas of credit scoring than a ranking. pgi may be used in other ways.
For example in the case of a card product, pgi will
Consumer credit risk assessment involves the use also determine a multiple of salary for the credit
of risk assessment tools to manage a borrower’s limit. For a current account, pgi will determine the
account from the time of direct mailing of market- number of cheques in a cheque book and the type
of card that is issued. For a loan product pgi will of-
4
Data from the Council of Mortgage Lenders. ten determine the interest rate to be charged.
5
Source: OECDStatistics, Financial Balance Sheets – consol- In practice the definition of a good varies enor-
idated dataset 710. mously but is typically taken as a borrower who
J.N. Crook et al. / European Journal of Operational Research 183 (2007) 1447–1465 1449

defaulted within a given time period, usually within produced from a scoring model like Eq. (1). Sc is a
one year. Default may be taken as missed three cut-off score used to partition the score range into
scheduled payments in the period or missed three those to be labelled goods (i.e. to be accepted), this
successive payments within the period. group to be labelled AG, and those to be labelled
In practice the characteristics in Eq. (1) may be bads (and so rejected), labelled AB. If the distribu-
transformed. Typically each variable, whether mea- tions did not overlap the separation and so the dis-
sured at nominal, ordinal or ratio level is divided crimination would be perfect. If they do overlap
into ranges according to similarity of the proportion there are four subsets of cases. Bads classified as
of goods at each value. Inferential tests are available bad, bads classified as good, goods classified as
to test the null hypothesis that any one aggregation good and goods classified as bad. The proportion
of values into a range separates the goods from the of bads classified as good is the area under the B
bads more clearly than another aggregation. This curve to the right of Sc, labelled BW. The proportion
process is known as ‘‘coarse classification’’ and the of goods classified as bad is the area under the G
available tests include the chi-square test, the infor- curve to the left of Sc, labelled GW. One measure
mation statistic and the Somer’s D concordance sta- used in practice is to estimate the sum of these pro-
tistic. A transformation of the proportion that is portions for a given cut-off. A two by two matrix of
good is performed for the cases in each band, a typ- the four possibilities is constructed, called a confu-
ical example being to take the log of the proportion sion matrix, an example of which is shown in Table
that are good divided by the proportion that are 1. In this table GW is nBG =ðnGG þ nBG Þ and BW is
bad. The transformed value, called the weight of nGB =ðnGB þ nBB Þ. The cut off may be set in many
evidence, is used in place of the raw values for all ways for example such that the error rate is mini-
cases in the range. This has the advantage of pre- mised or equal to the observed error rate in the
serving degrees of freedom in a small dataset. How- training sample or in the holdout sample. Notice
ever most datasets are sufficiently large that this that the appropriate test sample is the population
method confers few benefits over using either of all cases whereas in practice this is not available
dummy variables to represent each range (nominal and we must make do with a sample. Now the
and continuous variables) or dummy variables. expected value of the error rate estimated from all
Coarse classification also has the advantage of holdout samples is an unbiased estimate of the error
allowing a missing value (i.e. no response to a ques- rate in the population, if the sample is selected ran-
tion) to be incorporated in the analysis of the domly from all possible applicants. However since,
responses to that question. in practice, the sample typically consists only of
The model is parameterised on a training sample accepted applicants, the expected error rate from
and tested on a holdout sample to avoid over- all holdout samples may be a biased estimate of
parameterisation whereby the estimated model fits the error rate in the population of all applicants
the nuances in the training sample which are not (see reject inference below).
repeated in the population. Validation of an appli- A weakness of the error rate is that the predictive
cation scoring model involves assessing how well it accuracy of the estimated model depends on the cut-
separates the distributions of observed goods and off value chosen. On the other hand, the statistics
bads over the characteristics: its discriminatory associated with the receiver operating curve (ROC)
power. Fig. 1 shows two such observed distributions give a summary measure of the discrimination of
f(Sfitted)

Table 1
A confusion matrix
Predicted class Observed class
Good Bad Row total
B W
G
G B
W Good nGG nGB nGG + nGB
score = f(x) Bad nBG nBB nBG + nBB
AB SC AG
Column total nGG + nBG nGB + nBB nGG + nBG +
Fig. 1. nGB + nBB
1450 J.N. Crook et al. / European Journal of Operational Research 183 (2007) 1447–1465

the scorecard since all possible cut-off scores are con- because the Gini is the integral over all cut-offs.
sidered. The ROC curve is a plot of the proportion When we are interested in predictive performance
of bads classified as bad against the proportion of only in a narrow range of cut-offs, the ROC plots
goods classified as bad at all values of Sc. The former would give more information.
is called ‘‘sensitivity’’ and the latter is 1 – ‘‘specific- Other important measures of separation are
ity’’ since specificity is the proportion of goods pre- available. One is the Kolmogorov–Smirnov statistic
dicted to be good. We can see these values in Table (KS). From Fig. 1 we can plot the cumulative distri-
1 where sensitivity is nBB =ðnGB þ nBB Þ and 1-specific- butions of scores for goods and bads from the scor-
ity is nBG =ðnGG þ nBG Þ. If the value of Sc is increased ing model, which are shown in Fig. 3, where F ðsjGÞ
from its lowest possible value the value of BW denotes the probability that a good has a score less
increases faster than the value of GW decreases, even- than s and F ðsjBÞ denotes the probability that a Bad
tually the reverse happens. This gives the shape of has a score less than s. The KS is the maximum dif-
the ROC curve shown in Fig. 2. If the distributions ference between F ðsjGÞ and F ðsjBÞ at any score.
are totally separated, as Sc increases, all of the bads Intuitively the KS says: across all scores what is
will be correctly classified before any of the goods the maximum difference between the probability
are misclassified and the ROC curve will lie over that a case is a good and is rejected and the proba-
the vertical axis until point A and then along the bility one is a bad and is rejected. However, like the
upper horizontal axis. If there is no separation at Gini, the KS may also give the maximum separation
all and the distributions are identical, the proportion at cut-offs that the lender is, for operational reasons,
of bads classified as bad will equal the portion of not interested in. Many other measures are avail-
goods classified as bad and the ROC curve will lie able; see (Thomas et al., 2002; Tasche, 2005).
over the diagonal straight line. One measure of the Hand (2005) has cast doubt on the value of any
separation yielded by a scoring model is therefore measure of scorecard performance which uses the
the proportion of the area below the ROC curve distribution of scores. Since the aim of, say, an
which is above the diagonal line. The Gini coefficient application model is simply to divide applicants into
is double this value. The Gini coefficient can be com- those to be accepted and the rest, it is only the
pared between scoring models to indicate which numbers of cases above and below the cut-off that
gives better separation. If one is interested in predic- is relevant. The distance between the score and the
tive performance over all cut-off values the Gini cut-off is not. The proportion of accepted bad cases
coefficient is more informative than looking at uses just this information and so is to be preferred
ROC plots since the latter may cross. A useful prop- over the area under the ROC curve, Gini or KS
erty is that the area under the ROC curve down to measure. If on the other hand, we are interested in
the right hand corner of Fig. 2 equals the Mann– dividing a sample into two groups where we can
Whitney statistic. Assuming this is normally distrib- follow the subsequent ex post performance of both
uted one can place confidence intervals on this area. groups, then we can find the proportion of bad cases
But the Gini may be very misleading when we are that turned out to be good and the proportion of
interested only in a small range of cut-off values good cases that turned out to be bad. Given the
around where the predicted bad rate is acceptable costs of each type of misclassification the total such

% of bads classified
as bad f(s)
(0,1) A (1,1)
1

F(s|B)

F(s|G)

0
(0,0) % of goods classified as bads s=f(x)

Fig. 2. The receiver operating curve. Fig. 3. The Kolmogrov–Smirnov statistic.


J.N. Crook et al. / European Journal of Operational Research 183 (2007) 1447–1465 1451

costs from a scorecard can be estimated. Again only until five missed payments when the proportion lev-
the sign of the classification and the costs of mis- els off. They also found that the proportion of cases
classification are used, the distribution of scores is in a state that move directly to paid up to date
not. decreases up until five missed payments when it
Once an applicant has been accepted and has increases again. This last result may be due to peo-
been through several repayment cycles a lender ple taking out a consolidating loan to pay off previ-
may wish to decide if he/she is likely to default in ous debts.6
the next time period. Unlike application scoring If the transition probability matrix, Pt , is known
the lender now has information on the borrower’s for each t one can predict the distribution of the
repayment performance and usage of the credit states of accounts that a customer has n periods into
product over time period t and may use this in addi- the future. A kth order non-stationary MC may be
tion to the application characteristic values to pre- written as
dict the chance of default over time period t + 1.
This may be used to decide whether to increase a P fX tþk ¼ jtþk jX tþk1 ¼ jtþk1 ; X tþk2 ¼ jtþk2 ;
credit limit, grant a new loan or even post a mail . . . . . . . . . ; X 1 ¼ j1 g
shot. A weakness of this approach is that it assumes ¼ pt ðjt ; . . . . . . ; jtþk1 ; jtþk Þ
that a model estimated from a set of policies ¼ pt ðfjt ; . . . . . . ; jtþk1 g; fjtþ1 ; . . . ; jtþk gÞ;
towards the borrower in the past (size of loan or
limit) is just as applicable when the policy is chan- ð2Þ
ged (on the basis of the behavioural score – see
where Xt takes on values representing the different
below). This may well be false.
possible repayment states.
Another approach is to model the behaviour of
A MC is stationary if the elements within the
the borrower as a Markov chain (MC). Each transi-
transition matrix do not vary with time. A kth order
tion probability, pt ði; jÞ is the probability that an
MC is one where the probability that a customer is
account moves from state i to state j one month
in a state j in period t + k depends on the states that
later. The repayment states may be ‘‘no credit’’,
the customer has visited in the previous k periods. If
‘‘has credit and is up to date’’, ‘‘has credit and is
an MC is non-stationary (but first order) the prob-
one payment behind’’ . . . ‘‘default’’. Typically ‘‘writ-
ability that a customer moves from state i to state
ten off’’ and ‘‘loan paid off’’ are absorption states.
j over k periods is the product of the transition
Alternatively MCs may be more complex. Many
matrices which each contain the transition probabil-
banks may use a compound definition for example
ities between two states for every two adjacent peri-
how many payments are missed, whether the
ods. One may represent a kth order MC by
account is active, how much money is owed and
redefining each ‘‘state’’ to consist of every possible
so on. A good example, a model developed by Bank
combination of values of X t ; X tþ1 ; . . . ; X tþk origi-
One, is given by (Trench et al., 2003). Notice that
nally defined states. Then the transition matrix of
there are typically many structural zeros: transitions
probabilities over t to t + k periods is the product
which are not possible between two successive
of the matrix containing transition probabilities
months. For example an account cannot move from
over the ðt; . . . ; t þ k  1Þ periods and ðt þ 1; . . . ;
being up to date to being more than one period
t þ kthÞ period. Anderson and Goodman (1957)
behind.
give tests for stationarity and order. Evidence sug-
Those transition matrices that have been pub-
gests that MCs describing repayment states are
lished typically show that, apart from those who
not first order and may not even be second order
have missed one or more payments, at least 90%
(Till and Hand, 2001). Ho et al. (2004) find that a
of credit card accounts (Till and Hand, 2001) or per-
MC describing states for a bank are not first, second
sonal loan accounts (Ho et al., 2004) remain up to
or third order, where each state is defined as a com-
date, that is move from up to date in period t to
bination of how delinquent and how active the
up to date in period t + 1. Till and Hand (2001)
account is and the outstanding balance owed.
found that for credit cards, apart from those who
missed 1 or 2 payments in t, the majority of those
who missed 3 or more payments in t missed a fur- 6
Or to the fact that some systems actually have the original
ther payment in t + 1 and as we consider more debt cleared as the case is passed to an external debt collection
missed payments, this proportion of cases goes up agency or written off.
1452 J.N. Crook et al. / European Journal of Operational Research 183 (2007) 1447–1465

Results by Till and Hand (2001) also suggest MCs active. There are many other areas of rapid research
for the delinquency states for credit products are in this area and we do not mean to imply that the
not stationary. Trench et al. (2003) consider redefin- areas we have chosen are necessarily the most bene-
ing the states by concatenation of previous histori- ficial areas.
cal states to induce stationarity, but opt instead
for segmenting the transition matrix for the whole 3.1. Statistical methods
sample into a matrix for each customer segment.
The resulting model, when combined with an opti- Whilst logistic regression (LR) is a commonly
misation routine to maximise NPV for each exiting used algorithm it was not the first to be used and
account by choice of credit line and APR, has con- the development of newer methods has been a long
siderable success. standing research agenda which is likely to con-
Often MC models are used for predicting the tinue. One of the earliest techniques was discrimi-
long term distributions of accounts and so enabling nant analysis. Using our earlier notation we wish
policies to be followed to change these distributions. to develop a rule to partition a sample of applicants,
Early applications of MCs (Cyert and Thompson, A, into goods AG and bads AB which maximises the
1968) to credit risk involved segmenting borrowers separation between the two groups in some sense.
into risk groups with each group having its own Fisher (1936) argued that a natural solution is to
transition matrix. More recent applications have find the direction, w that maximises the difference
involved Mover–Stayer Markov chains which are between the sample mean values of xG and xB . If
chains where the population of accounts is seg- we define y G ¼ wT xG and y B ¼ wT xB then we wish
mented into those which ‘‘stay’’ in the same state to maximise E(yG  yB) and so to maximise
and those which move (Frydman, 1984). So the j wT ðxG  xB Þ. In terms of Fig. 1 we wish to find
period transition matrix consists of the probability the w matrix that maximises the difference between
that an account is a stayer times the transition the peaks of the B and G distributions in the hori-
matrix given that it is a stayer, plus the probability zontal direction. But although the sample means
it is a mover times the transition matrix given that it may be widely apart, the distributions of wT x values
is a mover. So (if we assume stationarity): may overlap considerably. The misclassifications in
Pð0; jÞ ¼ SI þ ðI  SÞMj ; ð3Þ Fig. 1 represented by the GW and BW areas would
be relatively large, but for a given difference in
where S = row vector consisting of the diagonal of means would be lower if the distributions had smal-
the matrix of Si values where Si is the proportion ler variances. If, for simplicity, we assume the two
of stayers in state i. M is the transition matrix for groups have the same covariance matrices, denoted
movers and I is the identity matrix. Notice that as S in the samples, the within-group variance is
the stayers could be in any state of the transition wT Sw and dividing wT ðxG  xB Þ by this makes the
matrix, and not just the up to date state. Studies separation between the two group means smaller
show that between 44% and 72% of accounts never for larger variances. In fact the distance measure
leave the same state (the bulk of which are in the up that is used is
to date state), which supports the applicability of a
2
stayer segment in credit portfolios. Recent research ½wT ðxG  xB Þ
has decomposed the remaining accounts into several MðwÞ ¼ ; ð4Þ
wT Sw
segments rather than just ‘‘movers’’. Thus Ho and
Thomas were able to statistically separate the (where the numerator is squared for mathemati-
movers into three groups according to the number cal convenience). Differentiating equation (4) with
of changes of state made: those that move mod- respect to w gives estimators for w where w / S1
estly ‘‘twitchers’’, those that move more substan- ðxG  xB Þ. Since we just require a ranking of cases
tially ‘‘shakers’’ and those that move considerably to compare with a cut-off value this is sufficient to
‘‘movers’’. classify each case. S is taken as an estimate of the
population value X and xG and xB are taken as esti-
3. Classification algorithms mates of the corresponding population values, lG
and lB.
We now begin our discussion of a selection of Notice that we have made no assumptions about
areas where recent research has been particularly the distributions of xG or xB. This is particularly
J.N. Crook et al. / European Journal of Operational Research 183 (2007) 1447–1465 1453

important for credit scoring because such assump- Min ai¼1 þ ai¼2 þ  þ ai¼nG þnB
tions rarely hold even when weights of evidence s:t: w1 x1i þ w2 x2i wp xpi > c  ai for 1 6 i 6 nG ;
values are used. w1 x1i þ w2 x2i wp xpi 6 c þ ai nG þ 1 6 i 6 nG þ nB ;
One can in fact deduce the same estimator in ai P 0 1 6 i 6 nG þ nB :
other ways. For example we could partition the
sample to minimise the costs of doing this errone- ð6Þ
ously. That is, in Fig. 1 minimise the sum of areas To avoid the trivial solution of weights and cut-
GW and BW, each weighted by the cost of misclassi- off being set to zero one needs to solve the program
fication of each case. Assuming that the values of once with the cut-off set to be positive and once
the characteristics, x are multivariate normal and when negative. But even then the solution may vary
they have a common covariance matrix we have under linear transformations of the data. Various
modifications have been suggested to overcome this
AG : y 6 xT w; ð5Þ problem. An alternative to the set up of Eq. (6) is to
 minimise the total cost of misclassifying applicants
where w ¼ X1 ðlG  lB Þ and y ¼ 0:5 lTG  X1  so the objective function is
 
lG  lTB X1 lB Þ þ log Dp
Lp
B
where D is the expected X
nG nX
GþB
G
Min L mi þ D mi ; ð7Þ
loss when a bad is accepted and L is the expected i¼nGþ1
i¼1
foregone profit when a good is rejected. For estima-
tion the population means lG and lB and the where mi is a dummy taking on values of 1 if a case
covariance matrix are estimated by their sample is misclassified and 0 otherwise and L and D are as
values. above. The constraints are appropriately modified.
Another common rule is a simple linear model However this method is feasible only for relatively
estimated by Ordinary Least Squares where the small samples due to its computational complexity
dependent variable takes on the values of 1 or 0 and it is also possible that several different optimal
according to whether the applicant has defaulted solutions may exist with very different performances
or not. Unfortunately this has the disadvantage that on the holdout sample. Work by Glen (1999) has
predicted probabilities can lie outside the range [0, 1] amended the algorithm to avoid some of these
which is a characteristic not shared with logistic problems.
regression. This may not matter if one only requires
a ranking of probabilities of default, but may be 3.3. Neural networks
problematic if an estimated probability of default
(PD) is required for capital adequacy purposes. Neural networks (NN) are a further group of
algorithms. Here each characteristic is taken as an
3.2. Mathematical programming ‘input’ and a linear combination of them is taken
with arbitrary weights. The structure of a very sim-
More recently non-statistical methods have been ple multilayer perceptron is shown in Fig. 4. The
developed. Mathematical programming derived central column of circles is a hidden layer; the final
classifiers have found favour in several consultan- circle is the output layer. The characteristics are lin-
cies. Here one wishes to use MP to estimate the early combined and subject to a non-linear transfor-
weights in a linear classifier so that the weighted mation represented by the g and h functions, then
sum of the application values for goods is above a fed as inputs into the next layer for similar manipu-
cut-off and the weighted sum of values for the bads lation. The final function yields values which can be
is below the cut-off. Since we cannot expect to gain compared with a cut-off for classification (Haykin,
perfect separation we need to add error term vari- 1999). A large choice of activation functions is avail-
ables which are positive or zero to the equation able (logistic, exponential etc.). Each training case is
for the goods and subtract them from the equation submitted to the network, the final output com-
for the bads. We wish to minimise the sum of the pared with the observed value (0 or 1 in this case)
errors over all cases, subject to the above classifica- and the difference, the error, is propagated back
tion rule. Arranging applicants so that the first nG though the network and the weights modified at
are good and the remaining nB cases are bad, the each layer according to the contribution each weight
set up is makes to the error value (the back propagation
1454 J.N. Crook et al. / European Journal of Operational Research 183 (2007) 1447–1465

f1=exp(wTX+c1)
X1 w11
Age
w21

w12 f2=exp(wTX+c2)
k1
X2 W22
Income k2

. . z=exp(kTf+d)
.
. . Good/Bad
.
. .
.
. .
kq

fq=exp(wTX+c)
Xp
Years at
address
Input layer Hidden layer Output layer
p neurons q neurons

Fig. 4. A two layer neural network.

algorithm). In essence the network takes data in have been and are used by consultancies for applica-
characteristics space, transforms it using the weights tion scoring it is fraud scoring where arguably they
and activation functions into hidden value space have been most commonly used.
(the ‘g’ values) and then possibly into further hidden
value space, if further layers exist, and eventually 3.4. Other classifiers
into output layer space which is linearly separable.
In principle only one hidden layer is needed to sep- A similar type of classifier uses fuzzy rather than
arate non-linearly separable data and only two hid- crisp rules to discriminate between classes. A crisp
den layers are needed to separate data where the rule is one which precisely defines a membership cri-
observations are is in distinct and separate regions. terion (for example ‘‘poor’’ if income is less than
Neural networks have a weakness when used to $10 k) whereas a fuzzy rule allows membership cri-
score credit applicants: the resulting set of classifica- teria to be less precisely defined with membership
tion rules are not easily interpretable in terms of the depending possibly on several other less precisely
original input variables. This makes it difficult to defined criteria. The predictive accuracy of fuzzy
use if, as is required in the US and UK, a rejected rule extraction techniques applied to credit scoring
applicant is given a reason for being rejected. How- data has been mixed relative to other classifiers.
ever several algorithms are available that try to Baesens (2003) finds their performance to be poorer
derive interpretable rules from a NN. These algo- but not significantly so than linear discriminant
rithms are generally of two types: decompositional analysis (LDA), NN and decision trees. But (Piram-
and pedagogical. The former take an estimated net- uthu, 1999) found the performance of fuzzy classifi-
work and prune connections between neurons sub- ers to be significantly worse than NN when applied
ject to the difference in predictions between the to credit data.
original network and the new one being within an Classification trees are another type of classifier.
accepted tolerance. Pedagogical techniques estimate Here a characteristic is chosen and one splits the
a simplified set of classification rules which is both set of attributes into two classes that best separate
interpretable and again gives acceptably similar the goods from the bads. Each case is ascribed to
classifications as the original network. Neurolinear one branch of the classification tree. Each subclass
is an example of the first type and Trepan an exam- is again divided according to the attribute value of
ple of the second. Baesens (2003) finds that these a characteristic that again gives the best separation
approaches can be successful. Of course there is between the goods and bads in that sub-sample or
no need to explain the outcomes when the network node. The sample splitting continues until few cases
is being used to detect fraud and whilst networks remain in each branch or the difference in good rate
J.N. Crook et al. / European Journal of Operational Research 183 (2007) 1447–1465 1455

between the subsets after splitting is small. This pro- above the average in the population of possible
cess is also called the Recursive Partitioning Algo- solutions then the expected number of strings in
rithm. The tree is then pruned back until the the population with this schemata will increase
resulting splits are thought to give an acceptable exponentially. GAs are optimisation algorithms
classification performance in the population. Vari- rather than a class of classifier and may be applied
ous statistics have been used to evaluate how well as described or to optimise NNs, decision trees
the goods and bads are separated at the end of and so on. For example when GAs are applied to
any one branch and so which combination of attri- a decision tree mutation involves randomly replac-
bute values to allocate to one branch rather than to ing the branches from a node in a tree and crossover
the other if the characteristic is a nominal one. One involves swapping branches from a node in one tree
example is the KS statistic described above. with those in another tree (Ong et al., 2005).
Another common statistic advocated by Quinlan Support vector machines (SVM) can be consid-
(1963) is the entropy index defined as ered as an extension of the mathematical program-
ming approach but where the scorecard is a
IðxÞ ¼ P ðGjxÞln2 ðP ðGjxÞÞ weighted linear combination of functions of the
 P ðBjxÞln2 ðP ðBjxÞÞ ð8Þ variables not just a weighted linear combination of
the variables themselves. The objective is for all
which relates to the number of different ways one the goods to have ‘‘scores’’ greater than c + 1 and
could get the good–bad split in a node and so indi- the bads to have ‘‘scores’’ below c  1 so that there
cates the information contained in the split. The is a normalized gap of 2/kwk between the two
analyst must decide which splitting rule to use, when groups. The objective is also extended to minimize
to stop pruning, and how to assign terminal nodes a combination of the size of the misclassification
to good and bad categories. errors and the inverse of this gap. In fact, the algo-
Logistic regression is the most commonly used rithms work by solving the dual problem when the
technique in developing scorecards, but linear dual kernel is chosen to be of a certain form. Unfor-
regression, mathematical programming, classifica- tunately one cannot invert this to obtain the corre-
tion trees and neural networks have also been used sponding primal functions and thus an explicit
in practice. A number of other classification tech- expression for the scorecard.
niques such as genetic algorithms, support vector Many other classifiers may be used to divide
machines (SVM) and nearest neighbours have been applicants into good and bad classes. These include
piloted but have not become established in the K-nearest neighbour techniques and various Bayes-
industry. ian methods. Some comparisons can be made
Genetic methods have also been applied to credit between the classifiers. LR, LDA, and MP are all
data. In a genetic algorithm (GA) a solution consist- linear classifiers: in input space the classifying func-
ing of a string (or chromosome) of values (alleles) tion is linear. This is not so for quadratic discrimi-
may represent, for each characteristic, whether a nant analysis, where the covariances of the input
characteristic enters the decision rule and a range variables are not the same in the goods and bads,
of values that characteristic may take. This is repre- nor in NNs, SVMs or decision trees. Second, all
sented in binary form. Many candidate solutions are techniques require some parameter selection by the
considered and a measure of the predictive accuracy analyst. For example, in LR it is the assumption
or fitness is computed for each solution. From this that a logistic function will best separate the two
sample of solutions a random sample is selected classes, in NN the number of hidden layers, the
where the probability of selection depends on the fit- types of activation function, the type of error func-
ness. Genetic operators are applied. In crossover the tion, the number of epochs and so on; in trees it
first n digits of one solution are swapped with the is the splitting and stopping criteria etc. Third, a
last n digits of another. In mutation specific alleles common approach to devising a non-linear classifier
are flipped. The new children are put back into is to map data from input space into a higher
the population. The fitness of each solution is dimensional feature space. This is done with least
recomputed and the procedures repeated until a ter- restrictions in SVMs but is also done in NNs. Fur-
mination condition is met. GAs make use of the thermore several researchers have experimented
Schemata theorem (Holland, 1968) that essentially with combining classifiers. For example (Schebesch
shows that if the average fitness of a schemata is and Stecking, 2005) have used SVMs to select cases
1456 J.N. Crook et al. / European Journal of Operational Research 183 (2007) 1447–1465

(the support vectors) to present to a LDA. Others tive to other classifiers on one definition may per-
have used trees to select variables to use in a LR. form relatively less well on another. Moreover
Yet others have used a stepwise routine in LDA each classifier typically optimizes a measure of fit
to select the variables for inclusion in a NN (Lee (for example maximum likelihood in the case of
et al., 2002)). Kao and Chiu (2001) have combined LR) but its performance in practice may be judged
decision trees and NNs. by a very different criterion, for example cost-
Whilst many comparisons of the predictive per- weighted error rates, for which there are many pos-
formance of these classifiers have been made rela- sibilities. So the relative performances of classifiers
tively few have been conducted using consumer may differ according to the fitness measure used in
credit data and it is possible that the data structures practice.
in application data are more amenable to one type
of classifier rather than to others. Table 2 gives a 3.5. Comparisons with corporate risk models
comparison of published results, although caution
is required since the table shows only the error rate, The methods used to predict the probability of
the papers differ in how the cut-off was set, the default for individuals that we have considered so
errors are not weighted according to the relative far may be contrasted with those applied in the cor-
costs of misclassifying a bad and a good and few porate sector. One can distinguish between the risk
papers use inferential tests to see if the differences of lending or extending a facility to an individual
are significant. For these reasons the figures can company and the collective risk of lending to a port-
only be compared across a row not between studies. folio of companies. (A corresponding separation
For example Henley’s results show a much lower applies to loans to individuals). For the former a
proportion correctly classified because the propor- number of credit rating agencies offer opinions as
tion of defaults in his sample was very different from to the credit worthiness of lending to large quoted
that in other studies. Nevertheless it would seem companies which are typically indicated by an
from the table that SVMs when included do per- alphabetic ordinal scale, for example S&P’s scale
form better than other classifiers, a result that is of \AAA"; \AA" . . . \CCC" labels. Different agencies’
consistent with comparisons using data from other scales often have different meanings. Moody’s scale
contexts. Neural networks also perform well as we indicates a view of likely expected loss (including
would expect bearing in mind that a NN with a PD and loss given default) of extending a facility
logistic activation function encompasses a LR whereas S&P’s scale reflects an opinion about the
model. Given that small improvements in classifica- PD of an issuer across all of its loans. The details
tion accuracy results in millions of dollars of addi- of the methods used to construct a scale are proprie-
tional profits attempts to increase accuracy by tary, however they involve both quantitative analy-
devising new algorithms is likely to continue. ses of financial ratios and reviews of the competitive
However (Hand, 2006) argues that the relative environment, including regulations and actions of
performances illustrated in comparisons like Table competitors and also internal factors such as mana-
2 overemphasise the likely differences in predictive gerial quality and strategies. The senior manage-
performance when classifiers are used in practice. ment of the firm to be rated is consulted and a
One reason is associated with sampling issues. Over committee within the agency would discuss its pro-
time ‘‘population drift’’ occurs whereby changes in posed opinion before it is made public.
p(Gjx) occur due, in the case of LDA to changes Although there are monotonic relationships
in p(G) or p(xjG), or in the case of LR to changes, between PDs and ordinal risk categories the risk
not in the parameters of the model, but to changes category is not supposed to indicate a precise PD
in the distributions of variables not included in the for a borrower. Some agencies try to produce a clas-
model but which affect the posterior probability. sification which is invariant through a business cycle
Another reason is that the training sample may unless a major deterioration in the credit worthiness
not represent the population of all applicants, as of the firm is detected, (but in practice some varia-
we explain below in Section 4. Further, since the tion through the cycle is apparent (De Servigny
definitions of ‘‘good’’ and ‘‘bad’’ which are used and Renault, 2004)), while others give ratings which
are arbitrary, they may change between the time reflect the point in time credit risk.
the scorecard was developed and the time when it An alternative approach, which is typified by
is implemented. A model which performs well rela- KMV’s Credit Monitor, is to deduce the probability
Table 2

J.N. Crook et al. / European Journal of Operational Research 183 (2007) 1447–1465
Relative predictive accuracy of different classifiers using credit application data
Author Linear Logistic Decision trees Math Neural nets Generic Genetic K-nearest Support vector
regression or regression programming algorithms programming neighbours machines
LDA
Percentage correctly classified
Srinivisan and 87.5 89.3 93.2 86.1
Kim (1987)
Boyle et al. 77.5 75.0 74.7
(1992)
Henley (1995)e 43.4 43.3 43.8
Desai et al. 66.5 67.3 66.4
(1997)
Yobas et al. 68.4 62.3 62.0 64.7
(2000)
West (2000)a 79.3 81.8 77.0 82.6 76.7
Lee et al. (2002) 71.4 73.5 73.7 (77.0)b
Malhotra and 69.3 72.0
Malhotra
(2003)
Baesens, 2003c 79.3 79.3 77.0 79.0 79.4 78.2 79.7
Ong et al., 80.8 78.4 81.7 82.8
2005d
a
Figs are an average across two data sets.
b
Hybrid LDA and NN.
c
Figs are averaged across eight data sets. Results are not given for fuzzy extraction rules because benchmark methods had different results compared with those shown in the
columns in the table.
d
Figs are averages over two datasets.
e
Henley’s data had a much higher proportion of defaults than did the other studies.

1457
1458 J.N. Crook et al. / European Journal of Operational Research 183 (2007) 1447–1465

of default from a variant of Merton’s (1974) model cant was accepted, P(D) is to be predicted using pre-
of debt pricing. This type of model is called a ‘‘struc- dictor variables X1 and we may try to explain P ðAÞ
tural model’’. The essential idea is that a firm will in terms of X2, we can write:
default on its debt if the debt outstanding exceeds
the value of the firm. When a firm issues debt the P ðDjX1 Þ ¼ P ðDjX1 ; A ¼ 1Þ  P ðA ¼ 1jX2 Þ
equity holders essentially give the firm’s assets to þ P ðDjX1 ; A ¼ 0Þ  P ðA ¼ 0jX2 Þ: ð9Þ
the debt holders but have an option to buy the
assets back by paying off the debt at a future date. In MAR P ðDjX1 ; A ¼ 1Þ ¼ P ðDjX1 ; A ¼ 0Þ so
If the value of the assets rises above the value of P ðDjX1 Þ ¼ P ðDjX1 ; A ¼ 1Þ whereas in MNAR
the debt the equity holders gain all of this excess. P ðDjX1 ; A ¼ 1Þ 6¼ ðDjX1 ; A ¼ 0Þ and the sample of
So the payoff to the equity holders is like a payoff accepted applicants will give biased estimates of
from a call option on the value of the firm at a strike the population parameters. Hand and Henley
price of the repayment amount of the debt. Merton (1994) have shown that if the X1 set encompasses
makes many assumptions including that the value of the original variables used to accept or reject appli-
the firm follows a Brownian motion process. Using cants we have MAR, otherwise we have MNAR. If
(Black and Scholes, 1973) option pricing model the we have MAR then a classifier like LR, NNets, and
PD of the bond can be recovered. Whilst in Merton SVMs which do not make distributional assump-
the PDs follow a cumulative normal distribution, tions will yield unbiased estimates.
PD ¼ N 1 ðDDÞ where DD is distance to default, Researchers have experimented with several pos-
this is not assumed in KMV Credit Monitor where sible ways of incorporating rejected cases in the esti-
DD is mapped to a default probability using a pro- mation of a model applicable to all applicants.
prietary frequency distribution over DD values. These have included the EM algorithm (Feelders,
For non-quoted companies the market value of 2000), and multiple imputation, but these together
the firm is unavailable and the Merton model can- with imputation by Markov Chain Monte Carlo
not be used. Then, corporate risk modelers often simulation have usually assumed the missingness
resort to methods used to model consumer credit mechanism is MAR when it is not known if this is
as described in the last few sections. the case. If it is the case then such procedures will
not correct for bias since none would be expected.
4. Reject inference However if we face MNAR then we need to model
the missingness mechanism. An example of this pro-
A possible concern for credit risk modelers is that cedure is Heckman’s two stage sample selection
we want a model that will classify new applicants model (Heckman, 1979).
but we can only train on a sample of applicants that In practice a weighting procedure known as aug-
were accepted and who took up the offer of credit. mentation (or variants of it) is used. Augmentation
The repayment performance of the rejected cases involves two stages. In the first stage one estimates
is missing. There are two possible problems here. P ðAjX2 Þ using an ex post classifier and the sample
First, the possibility that the parameters of such a of all applicants including those rejected. P ðAjX2 Þ
model may be biased estimates of those for the is then predicted for those previously accepted cases
intended population arises. Second, without know- and each is ascribed a ‘‘sampling’’ weight of
ing the distribution of goods and bads in the appli- 1=P ðAjX2 Þ in the final LR. Thus each accepted case
cant population one cannot know the predictive represents 1=P ðAjX2 Þ cases and it is assumed that at
performance of the new model on this population. each P ðAjX2 Þ value P ðDjX1 , A ¼ 1Þ ¼ P ðDjX1 ;
The first problem is related to Rubin’s classifica- A ¼ 0Þ. That is, the proportion of goods at that
tion of missing data mechanisms (Little and Rubin, value of P ðAjX2 Þ amongst the accepts equals that
1987). The two types of missing data mechanisms amongst the rejects, and so the accepts at this level
that are relevant here are missing at random of P ðAjX2 Þ can be weighted up to represent the dis-
(MAR) and missing not at random (MNAR). In tribution of goods and bads at that level of P ðAjX2 Þ
the MAR case the mechanism that causes the data in the applicant population as a whole.
to be missing is not related to the variable of inter- An alternative possible solution is to use a vari-
est. MNAR occurs when the missingness mecha- ant of Heckman’s sample selection method whereby
nism is closely related to the variable of interest. one estimates a bivariate probit model with selec-
So if D denotes default, A denotes whether an appli- tion (BVP) which models the probability of default
J.N. Crook et al. / European Journal of Operational Research 183 (2007) 1447–1465 1459

and the probability of acceptance allowing for the 1 – FB(s) pB


fact that observed default is missing for those not Expected Bad
Rate
accepted (Crook and Banasik, 2003). However this
method has several weaknesses including the
P
assumption that the residuals for the two equations A
are bivariate normal and that they are correlated.
The last requirement prevents its use if the original B
accept model can be perfectly replicated.
There is little empirical evidence on the size of
0 1 – F(s) 0.95 1.0
this possible bias because its evaluation depends
Acceptance Rate
on access to a sample where all those who applied
were accepted and this is very rarely done by lend- Fig. 5. The strategy curve.
ers. However a few studies have access to a sample
which almost fits this requirement and they find that and then accept only bads. In Fig. 5 if it is assumed
the magnitude of the bias and so scope for correc- that 95% of applicants are good, the perfect score-
tion is very small. Experimentation has shown that card is represented by the straight line along the
the size of the bias increases monotonically if the horizontal axis until the accept rate is 95%, there-
cut-off used on the legacy scoring model increases after the curve slopes up at 45°. A replacement scor-
and the accept rate decreases (Crook and Banasik, ing model may be represented by the dotted line.
2004). Using the training sample rather than an all The bank may currently be at point P and the bank
applicant sample as holdout leads to a considerable has a choice of increasing the acceptance rate whilst
overestimate of a model’s performance, particularly maintaining the current expected bad rate (A) or
at high original cut-offs. Augmentation has been keeping the current acceptance rate and decreasing
found to actually reduce performance compared the expected bad rate (B), or somewhere in between.
with using LR for accepted cases only. Recent work, The coordinates of all of these points can be calcu-
some of which is reported on in this issue, investi- lated from parameters yielded by the estimation of
gates combining both the BVP and augmentation the scorecard.
approaches, altering the BVP approach so that the One can go further and plot expected losses
restrictive distributional assumptions about the against expected profits (Fig. 6). For a given scoring
error terms are relaxed and uses branch and bound model as one lowers the cut-off an increasing pro-
algorithms. portion of accepts are bad. Given that the loss from
an accepted bad is many times greater than the
5. Profit scoring profit to be made from a good, this reduction in
cut-off at first increases profits and the expected loss,
Since the mid 1990s a new strand of research has but eventually the increase in loss from the ever
explored various ways in which scorecards may be increasing proportion of bads outweighs the
used to maximize profit. Much of this work is pro- increase in profits and profits fall. Now, the bank
prietary, but many significant contributions to the will wish to be on the lower portion of the curve
literature have also been made. One active area con-
cerns cut-off strategy. In a series of path breaking
papers Oliver and colleagues (Marshall and Oliver, Expected
1993; Hoadley and Oliver, 1998; Oliver and Wells, losses
2001) have formulated the problem as follows. For
a given scoring model one can plot the expected
bad rate (i.e. the expected proportion of all accepts
that turn out to be bad or ð1  F ðsÞÞpB where F(s) is
the proportion of applicants with a score less than s)
against the acceptance rate (i.e. the proportion of
applicants that are accepted or ð1  F ðsÞÞÞ. This is
shown in Fig. 5 and is called a strategy curve. A per-
Expected Profit
fectly discriminating model will, as the cut-off score
increases, accept all goods amongst those that apply Fig. 6. The efficiency curve.
1460 J.N. Crook et al. / European Journal of Operational Research 183 (2007) 1447–1465

which is called the efficiency frontier. The bank may amount to offer to minimize its expected profit.
now find its current position and, by choice of cut- Because levels of the interest rate have the opposite
off and adoption of the new model, decide where on effects for a borrower and a lender, as one increases
the efficiency frontier of a new model it wishes to this rate and reduces the loan amount the probabil-
locate: whether to maintain expected losses and ity of acceptance and so market share goes down
increase expected profits or maintain expected profit whilst at first expected profits rise. The lender has
and reduce expected losses. Oliver and Wells (2001) to choose the combination of market share and
derive a condition for maximum profits which is, expected profit it desires and then compute the
unsurprisingly that the odds of being good equal interest rate and loan amount to offer to the appli-
the ratio of the loss from a bad divided by the gain cant to achieve this. Notice however some limita-
from a good (D/L). Interestingly, an expression for tions of this model. First it assumes that there is
the efficiency frontier can be gained by solving the only one lender, since the probability of acceptance
non-linear problem of minimizing expected loss sub- is not made dependent on the offers made by rival
ject to expected profits being at least a certain value lenders. Second, the model omits learning by both
P0. The implied cutoff is then expressed as H 1 Lp P0
lender and potential borrower. Third, this is very
g
where H is an expression for expected profits and difficult to implement in practice since in principle
all values can be derived from a parameterized scor- one would need the utility function for every indi-
ing model. vidual potential borrower.
So far the efficiency frontier model has omitted This and associated models has spawned much
interest expense on funds borrowed, operating costs new research. For example some progress has been
and the costs of holding regulatory capital as made on finding the factors that affect the probabil-
required under Basel 2. In ongoing work the model ity of acceptance. Seow and Thomas (2007) have
has been amended by Oliver and Thomas (2005) by presented students with hypothetical offers of a
incorporating these elements into the profit func- bank account and each potential borrower had a
tion. They show that for a risk neutral bank the different set of (hypothetical) account characteristics
requirements of Basel 2 unambiguously increase offered to them. The aim was to find the applicant
the cut-off score for any scorecard and so reduce and account characteristics that maximized accep-
the number of loans made compared with no mini- tance probability. Unlike earlier work, by using a
mum capital requirement. classification tree in which offer characteristics var-
Some recent theoretical models of bank lending ied according to the branch the applicant was on,
have turned their attention to more general objec- this paper does not assume applicants have the same
tives of a bank than merely expected profit. For utility function.
example Keeney and Oliver (2005) have considered
the optimal characteristics of a loan offer from the 6. The Basel 2 accord
point of view of a lender, given that the utility of
the lender depends positively on expected profit A major research agenda has developed from the
and on market share. The lender’s expected profit Basel 2 Accord. The Basel Committee on Banking
depends on the interest rate and on the amount of and Supervision was set up in 1975 by the Governors
the loan made to a potential borrower. To maximize of the G10 most industrialised countries. It has no
this utility a lender has to consider the effects of formal supervisory authority but it states supervi-
both arguments on the probability that a potential sory guidelines and expects legal regulators in major
borrower will take a loan offered and so on market economies to implement them. The aims of the
share. The utility of a potential borrower is assumed guidelines were to stabilize the international banking
to depend positively on the loan amount and nega- system by setting standardised criteria for capital
tively on the interest rate. It is also crucially levels and a common framework to facilitate fair
assumed that as the interest rate and amount bor- banking competition. The need for such standards
rowed increase, whilst keeping the borrower at the was perceived after major reorganisations in the
same level of utility and so market share, the banking industry, certain financial crises and bank
expected profits of the lender increase, reach a max- failures, the very rapid development of new financial
imum and then decrease. By comparing the maxi- products and increased banking competition.
mum expected profit at each utility level the lender The first Basel accord (Basel, 1988), stated that a
can work out the combination of interest and loan bank’s capital had to exceed 8% of its risk weighted
J.N. Crook et al. / European Journal of Operational Research 183 (2007) 1447–1465 1461
 
assets. But the accord had many weaknesses (see De ð1  expð35  PDÞ
R ¼ 0:03 
Servigny and Renault, 2004) and over the last seven ð1  expð35ÞÞ
years the BIS has issued several updated standards   
1  expð35  PDÞ
which have come to be known as the Basel 2 accord. þ 0:16  1  ð11Þ
1  expð35Þ
It is due to be adopted from 2007-08. (Pillar 3 does
not take effect until later.) for other retail exposures.
The Basel 2 Accord (Basel, 2005) states that Intuitively the second term in Eq. (10) (involving
sound regulation consists of three ‘pillars’: mini- the square brackets) represents the probability that
mum capital requirements, a Supervisory Review the PD lies between its predicted value and its
Process and market discipline. We are concerned 99.9th percentile value. Its formal derivation is
with the first. Each bank has a choice of method it given in Schonbucher (2000) and Perli and Nayda
may use to estimate values for the capital it needs (2004). The formula is derived by using the assump-
to set aside. It may use a Standardised Approach tion that a firm will default when the value of its
in which it has to set aside 2.4% of the loan capital assets falls below a threshold which is common for
for mortgage loans and 6% for regulatory retail all firms. The value of the firm is standard normally
loans (to individuals or SMEs for revolving credit, distributed and, crucially, the assets of all obligors
lines of credit, personal loans and leases). Alterna- depend on one common factor, and a random ele-
tively the bank may use Internal Ratings Based ment which is particular to each firm. The third
approaches, in which case parameters are estimated term represents future margin income which could
by the bank. Notice that if a bank adopts an IRB be used to reduce the capital required. This is
for, say, its corporate exposures it must do so for included because the interest on each loan includes
all types of exposures. The attraction of the IRB an element to cover for losses on defaulted loans
approach over the Standardised Approach is that and this can be used to cover losses before the bank
the former may enable a bank to have less capital would need to use its capital. The size of this income
and so earn higher returns on equity than the latter. is estimated as the expected loss.
Therefore it is likely that most large banks will For each type of exposure the bank has to divide
adopt the IRB approach. its loans into ‘risk buckets’ i.e. groups which differ
Under the IRB approach a bank must retain cap- in terms of the group’s PD but within which PDs
ital to cover both expected and unexpected losses. are similar. When allocating a loan to a group the
Expected losses are to be built into the calculation bank must take into account borrower risk variables
of profitability that underlie the pricing of the prod- (sociodemographics etc.), transaction risk variables
uct, and so covered by these profits. The unexpected (aspects of collateral) and whether the loan is cur-
losses are what the regulatory capital seeks to cover. rently delinquent. To allocate loans into groups
For each type of retail exposure that is not in behavioural scoring models may be used with possi-
default the regulatory capital requirement, k, is cal- ble overrides. The PD value for each group has to
culated as be valid over a five year period to cover a business
" cycle. For retail exposures several methods to esti-
1 mate the PD of each pool are available but most
k ¼ LGD  N pffiffiffiffiffiffiffiffiffiffiffiffi  N 1 ðPDÞ
1R banks will use a behavioural scoring model.
rffiffiffiffiffiffiffiffiffiffiffiffi # The Basel 2 Accord has led to many new lines of
R 1 research. The first issue is the appropriateness of the
þ  N ð0:999Þ  PD  LGD; ð10Þ
1R one factor model to estimate unexpected loss of a
risk group for consumer debt since this model was
where LGD is loss given default (the proportion of originally devised for corporate default. Although
an exposure which, if in default, would not be recov- Perli and Nayda (2004) build such a consumer
ered); N is the cumulative distribution for a stan- model several authors have questioned the plausibil-
dard normal random variable; N1 is inverse of N; ity of a household defaulting on unsecured debt if
R is the correlation between the performance of dif- the value of the household’s assets dip below a
ferent exposures; and PD is the probability of de- threshold. Instead new models involving reputa-
fault. R is set at 0.15 for mortgage loans, at 0.04 tional assets and jump process models have been
for revolving, unsecured exposures to individuals tried. De Andrade and Thomas (2007) assume that
up to €100 k, and a borrower will default when his reputation dips
1462 J.N. Crook et al. / European Journal of Operational Research 183 (2007) 1447–1465

below a threshold. His reputation is a function of a the value that maximises the likelihood of observing
latent unobserved variable representing his credit the actual number of defaults. So for a portfolio
worthiness. consisting of two grades, A with a higher PD than
A second topic of research is the assumed param- B and da defaults and ð1  d a Þ exposures, and B
eter values in the Basel formula, specifically the with db defaults and ð1  d b Þ exposures, and letting
assumed correlation values. De Andrade and Tho- PA and PB indicate their respective PDs, PA and PB
mas found that the calculated correlation coefficient can be calculated by choosing their values to
for ‘other retail exposures’ for a sample of Brazilian maximise
data was 2.28% which is lower than the correlation
range of 3% in Basel 2. Lopez (2002) found the cor- Likelihood LðP A ; P B Þ ¼ P dAa ð1  P A Þ1d a P dBb ð1  P B Þ1d b
relation between asset sizes to be negatively related ð12Þ
to default probability for firms, which is the oppo-
site of the relationship implied by Eq. (11), and to constrained so that P A > P B .
be positively related to firm size which is not A further issue is how to validate estimates of
included in Eq. (11). PDs, LGDs and EADs. The application and behav-
A third source of research is how to estimate PDs ioural scoring models outlined in Section 2 are
from portfolios with very low default proportions. required to provide only an ordinal ranking of cases
In very low risk groups, especially in mortgage port- whereas ratio level values are required for Basel 2
folios, very few, sometimes no defaults, are compliance. Validation of PDs has two aspects; dis-
observed. Estimating the average PD in such a criminatory power, which refers to the degree of
group as close to zero would not reflect the conser- separation of the distributions of scores between
vatism implied by Basel 2. Pluto and Tasche (2005) groups, and calibration which refers to the accuracy
assume the analyst knows at least the rating grade of the PDs. We considered the former in an earlier
into which a borrower fits and calculates confidence section; here we consider the latter. The crucial issue
intervals so that by adopting a ‘most prudent esti- is how to test whether the size of the difference
mation principle’ PD estimates can be made which between estimated ex ante PDs and ex post observed
are monotone in the risk groups. For example con- default rates is significant. Several tests have been
sider the case of no defaults. Three groups, i = 1, 2, 3 considered by the BIS Research Task Force (2005)
in decreasing order of risk each have n1, n2 and n3 (BISTF) as follows. The test is required for each risk
cases respectively. The group risk ordering implies pool. The binomial test, which assumes the default
P 1 6 P 2 6 P 3 . The most prudent estimate of P1 events within the risk group are independent, tests
implies P 1 ¼ P 2 ¼ P 3 . Assuming independence of the null hypothesis the PD is underestimated.
defaults the probability of observing zero defaults Although the BISTF argue that empirically esti-
n1þn2þn3
is ð1  P 1 Þ and the confidence region at level mated correlation coefficients between defaults
n1þn2þn3
c for P1 is such that P 1 6 1  ð1  cÞ . At a within a risk group are around 0.5–3% they assess
99.9% confidence level, P1 = 0.86%. By similar rea- the implications of different asset correlations by
soning the most prudent value of P2 is P 2 ¼ P 3 and reworking the one factor model with this assump-
the upper boundary may be calculated (at 99.9% it tion. They find that when the returns are correlated,
is 0.98%). The difference in values is a consequence an unadjusted binomial test will be too conserva-
of the number of cases. In the case of a few defaults tive: it will reject the null when the correct probabil-
and assuming the number of defaults follows a bino- ity of doing so is in fact much lower than intended.
mial distribution, the upper confidence interval can The BISTF reports that a chi-square test could be
also be calculated. Pluto and Tasche extend the cal- used to test the same hypotheses as the binomial test
culations using a one factor probit model to find the but is applied to the portfolio as a whole (i.e. to all
confidence interval for the PD when asset returns risk groups) to circumvent the well known problem
are correlated. This increases the PDs considerably that multiple tests would be expected to indicate a
and the new probabilities under the same assump- rejection of the null in 5% of cases. But again it
tions as above were P1 = 5.3%, P2 = 5.8% and has the same problems as the binomial test when
P3 = 9.8%. defaults are correlated. The normal test allows for
Forrest (2005) adopted a similar approach of cal- correlated defaults and uses the property that the
culating the 95%ile of the default distribution. But distribution of the standardised sum of differences
unlike Pluto and Tasche he calculated the PD as between the observed default rate in each year and
J.N. Crook et al. / European Journal of Operational Research 183 (2007) 1447–1465 1463

the predicted probability is normally distributed. have been employed by consultancies over the last
The null is that ‘none of the true probabilities of few years. Arguably the most significant factor
default in the years t ¼ 1; 2; . . . ; T is greater than impacting on consumer credit scoring processes is
the forecast P t ’. However the BISTF find the test the passing of the Basel 2 accord. This has deter-
has conservative bias: in too many cases the null is mined the way in which banks in major countries
rejected. A further test is proposed by Blochwitz calculate their reserve capital. The amount depends
et al. (2004) and called the ‘traffic lights test’. Ignor- on the use of updated scoring models to estimate the
ing dependency in credit scores over t, the distribu- probability of default and other methods to estimate
tion of the number of defaults is assumed to follow the loss given default and amount owing at default.
a multinomial distribution over four categories The method is based on a one factor Merton model
labelled ‘green’, ‘yellow’, ‘orange’, and ‘red’. The but has been subject to major criticisms. The effect
probability that a linear combination of the number of Basel 2 on the competitive advantage of an insti-
of defaults in each category exceeds a critical value tution is so large that research into the adequacy,
is used to test the null that ‘none of the true proba- applicability and validity of the adopted models is
bilities of default in years t ¼ 1; 2; . . . ; T is greater set to be a major research initiative in the foresee-
than the corresponding forecast PDt ’. Running sim- able future.
ulation tests the BISTF find the test is rather There are many topics that we do not have space
conservative. in this paper to consider (see Thomas et al. (2002)
After comparisons of these four inferential test for further details). Consumer credit scoring is also
statistics the BISTF concluded that currently ‘no used to decide to whom to make a direct mailshot, it
really powerful tests of adequate calibre are cur- is used by government agencies (for example tax
rently available’ and call for more research. authorities) when deciding who is likely to pay
debts, it is used by utilities when deciding whether
7. Conclusion to allow consumers to use electricity/gas etc. in
advance of payment. A major use of scoring tech-
In this paper we have reviewed a selection of cur- niques is in assessing small business loan applica-
rent research topics in consumer credit risk assess- tions and their monitoring for credit limit
ment. Credit scoring is central to this process and adjustment, but this topic alone would fill a com-
involves deriving a prediction of the risk associated plete review paper.
with making a loan. The commonest method of esti- Credit scoring and risk assessment has been one
mating a classifier of applicants into those likely to of the most successful applications of statistical
repay and those unlikely to repay is logistic regres- and operational research concepts ever with consid-
sion with the logit value compared with a cut off. erable social implications. It has made practical the
Many alternative classifiers have been developed in assessment and granting of hundreds of millions of
the last thirty years and the evidence suggests that credit card applications and applications for other
the most accurate is likely to be support vector loan products and so it has considerably improved
machines although much more evidence is needed the lifestyles of millions of people throughout the
to confirm this tentative view. Behavioural scoring world. It has considerably increased competition
involves the same principles as application scoring in credit markets and arguably reduced the cost of
but now the analyst has evidence on the behaviour borrowing to many. The techniques developed have
of the borrower with the lender and the scores are been applied in a wide variety of decision making
estimated after the loan product has been taken contexts so reducing costs to those who present min-
and so updated risk estimates can be made as the imal risk to lenders.
borrower services his account. Initial concerns that
the estimation of classifiers based on only accepted References
applicants and not on a random sample of the pop-
ulation of applicants seem to be rather over placed, Anderson, T.W., Goodman, L.A., 1957. Statistical inference
but using applicants only to assess the predictive about Markov chains. Annals of Mathematical Statistics 28,
89–110.
performance of a classifier does appear likely to lead
Baesens, B., 2003. Developing Intelligent Systems for Credit
to significant over optimism as to a predictor’s per- Scoring Using Machine Learning Techniques, Doctoral
formance. Scoring for profit is a major research ini- Thesis no 180 Faculteit Economische en Toegepaste Eco-
tiative and has led to some important tools which nomische Wetebnschappen, Katholieke Universiteit, Leuven.
1464 J.N. Crook et al. / European Journal of Operational Research 183 (2007) 1447–1465

Basel, 1988. Basel Committee on Banking Supervision: Interna- Hand, D.J., Henley, W.E., 1997. Statistical classification methods
tional Convergence of Capital Measurement and Capital in consumer credit scoring: A review. Journal of the Royal
Standards. Bank for International Settlements, July 1988. Statistical Society, Series A 160, 523–541.
Basel, 2005. Basel Committee on Banking Supervision: Interna- Haykin, S., 1999. Neural Networks: A Comprehensive Founda-
tional Convergence of Capital Measurement and Capital tion. Prentice Hall, New-Jersey.
Standards: A Revised Framework. Bank for International Heckman, J.J., 1979. Sample selection bias as a specification
Settlements, November 2005. problem. Econometrica 47, 153–161.
Black, F., Scholes, M., 1973. The pricing of options and Henley, W.E., 1995. Statistical Aspects of Credit Scoring. Ph.D.
corporate liabilities. Journal of Political Economy 81, 637– thesis, Open University.
659. Hoadley, B., Oliver, R.M., 1998. Business measures of scorecard
Blochwitz, S., Hohl, S.D., Tasche, D., When, C.S., 2004. benefit. IMA Journal of Mathematics Applied in Business
Validating Default Probabilities in Short Times Series. and Industry 9, 55–64.
Mimeo, Deutsche Bundesbank. Ho, J., Thomas, L.C., Pomroy, T.A., Scherer, W.T., 2004.
Boyle, M., Crook, J.N., Hamilton, R., Thomas, L.C., 1992. Segmentation in Markov Chain consumer credit behavioural
Methods for credit scoring applied to slow payers. In: models. In: Thomas, L.C., Edelman, D., Crook, J.N. (Eds.),
Thomas, L.C., Crook, J.N., Edelman, D.E. (Eds.), Credit Readings in Credit Scoring. Oxford University Press, Oxford.
Scoring and Credit Control. Oxford University Press, Oxford. Holland, J.H., 1968. Hierarchical Description of Universal and
Crook, J.N., Banasik, J., 2003. Sample selection bias in credit Adaptive Systems, Department of Computer and Communi-
scoring models. Journal of the Operational Research Society cations Sciences. University of Michigan, Ann Arbor.
54, 822–832. Kao, L.J., Chiu, C.C., 2001. Mining the customer credit by using
Crook, J., Banasik, J., 2004. Does reject inference really improve the neural network model with classification and regression.
the performance of application scoring models? Journal of In: Joint 9th IFSA World Congress and 20th NAFIPS
Banking and Finance 28 857–874. International Conference, Proceedings, vols. 1–5, pp. 923–
Cyert, R.M., Thompson, G.L., 1968. Selecting a portfolio of credit 928.
risks by Markov chains. The Journal of Business 1, 39–46. Keeney, R.L., Oliver, R.M., 2005. Designing win–win financial
De Andrade, F.W.M., Thomas, L.C., 2007. Structural models in loan products for consumers and businesses. Journal of the
consumer credit. European Journal of Operational Research, Operational Research Society 56 (9), September.
this issue, doi:10.1016/j.ejor.2006.07.049. Lee, T.S., Chiu, C.C., Lu, C.J., Chen, I.F., 2002. Credit scoring
Desai, V.S., Conway, D.G., Crook, J.N., Overstreet, G.A., 1997. using the hybrid neural discriminant technique. Expert
Credit scoring models in the credit union environment using Systems with Applications 23, 245–254.
neural networks and genetic algorithms. IMA Journal of Lewis, E., 1992. Introduction to Credit Scoring. The Athena
Mathematics Applied in Business and Industry 8, 323–346. Press, San Rafael.
De Servigny, A., Renault, O., 2004. Measuring and Managing Little, R.J., Rubin, D.B., 1987. Statistical Analysis with Missing
Credit Risk. McGraw Hill, New York. Data. Wiley, New York.
Durand, D., 1941. Risk Elements in Consumer Instalment Lopez, J., 2002. The Empirical Relationship Between Average
Financing. National Bureau of Economic Research, New Asset Correlation, Firm Probability Of Default And Asset
York. Size. Working Paper, Federal Reserve Bank of San Francisco.
Feelders, A.J., 2000. Credit scoring and reject inference with Malhotra, R., Malhotra, D.K., 2003. Evaluating consumer loans
mixture models. International Journal of Intelligent Systems using neural networks. Omega 31, 83–96.
in Accounting, Finance and Management 9, 1–8. Marshall, K.T., Oliver, R.M., 1993. Decision Making and
Fisher, R.A., 1936. The use of multiple measurements in Forecasting. McGraw-Hill, New York, pp. 121–128.
taxonomic problems. Annals of Eugenics 7, 179–188. Oliver, R.M., Thomas, L.C., 2005. How Basel Will Affect
Forrest, A., 2005. Low Default Portfolios – Theory and Practice. Optimal Cut-offs Basel Capital Requirements. Presented at
Presentation to Credit Scoring and Credit Control 9 confer- Credit Scoring and Credit Control 9 Conference, Edinburgh,
ence, Edinburgh, 7–9 September 2005. September.
Frydman, H., 1984. Maximum likelihood estimation in the Oliver, R.M., Wells, E., 2001. Efficient frontier cut-off policies in
Mover–Stayer model. Journal of the American Statistical credit portfolios. Journal of Operational Research Society 52,
Association 79, 632–638. 1025–1033.
Glen, J., 1999. Integer programming models for normalisation Ong, C.S., Huang, J.J., Tzeng, G.H., 2005. Building credit
and variable selection in mathematical programming models scoring systems using genetic programming. Expert Systems
for discriminant analysis. Journal of the Operational with Applications 29, 41–47.
Research Society 50, 1043–1053. Perli, R., Nayda, W.I., 2004. Economic and regulatory capital
Hand, D.J., 2005. Good practice in retail credit scorecard allocation for revolving retail exposures. Journal of Banking
assessment. Journal of the Operational Research Society 56 and Finance 28, 789–809.
(9), September. Piramuthu, S., 1999. Financial credit-risk evaluation with neural
Hand, D.J., 2006. Classifier technology and the illusion of and neurofuzzy systems. European Journal of Operational
progress. Statistical Science 21, 1–14. Research 112, 310–321.
Hand, D.J., Henley, W.E., 1994. Inference about rejected cases in Pluto, K., Tasche, D., 2005. Estimating Probabilities for Low
discriminant analysis. In: Diday, E., Lechvallier, Y., Schader, Default Portfolios. Presented at Credit Scoring and Credit
P., Bertrand, P., Buntschy, B. (Eds.), New Approaches in Control 9 Conference, Edinburgh, 2005.
Classification and Data Analysis. Springer-Verlag, Berlin, Quinlan, J., 1963. C4. 5 Programs for Machine Learning.
pp. 292–299. Morgan-Kaufman, San Mateo, CA.
J.N. Crook et al. / European Journal of Operational Research 183 (2007) 1447–1465 1465

Rosenberg, E., Gleit, A., 1994. Quantitative methods in credit Systems. Bank for International Settlements, Working Paper
management: A survey. Operations Research 42, 589–613. No. 14, May 2005.
Schebesch, K.B., Stecking, R., 2005. Support vector machines for Thomas, L.C., 2000. A survey of credit and behavioural scoring:
classifying and describing credit applicants: Detecting typical Forecasting financial risk of lending to consumers. Interna-
and critical regions. Journal of the Operational Research tional Journal of Forecasting 16, 149–172.
Society 56, 1082–1088. Thomas, L.C., Edelman, D.B., Crook, J.N., 2002. Credit Scoring
Schonbucher, P.J., 2000. Factor Models For Portfolio Credit and its Applications. SIAM, Philadelphia.
Risk. Mimeo, Department of Statistics, Bonn University, Till, R., Hand, D., 2001. Markov Models in Consumer Credit.
December 2000. Presented at Credit Scoring and Credit Control 7 Conference,
Seow, H.-V., Thomas, L.C., 2007. To ask or not to ask, that is the Edinburgh, September.
question. European Journal of Operational Research, this Trench, M.S., Pederson, S.P., Lau, E.T., Ma, L., Wang, H., Nair,
issue, doi:10.1016/j.ejor.2006.08.061. S.K., 2003. Managing credit lines and prices for bank one
Srinivisan, V., Kim, Y.H., 1987. Credit granting a comparative credit cards. Interfaces 33 (5), 4–21.
analysis of classificatory procedures. Journal of Finance 42, West, D., 2000. Neural network credit scoring models. Comput-
655–683. ers and Operational Research 27, 1131–1152.
Tasche, D., 2005. Rating and probability of default validation in Yobas, M.B., Crook, J.N., Ross, P., 2000. Credit scoring using
BIS Research Task Force (2005) Basel Committee on Banking neural and evolutionary techniques. IMA Journal of Math-
Supervision: Studies on the Validation of Internal Rating ematics Applied in Business and Industry 11, 111–125.

You might also like