Adjusting For Sample Selection Bias in Acquisition Credit Scorin

Sample Selection Bias in Acquisition Credit Scoring Models:
An Evaluation of the Supplemental-Data Approach ∇
Irina Barakova Dennis Glennon Ajay Palvia
Office of the Comptroller of the Currency
April 2013
Abstract:
Models evaluating credit applicants rely on payment performance data, which is only
available for accepted applicants. This sampling limitation could lead to biased parameter
estimates. We use a nationally representative sample of credit bureau records to examine
sample selection bias in account acquisition scoring models and to evaluate the
effectiveness of the industry practice of using proxy payment performance for rejected
applicants. Our results show that ignoring the rejected applicants significantly affects
forecast accuracy of credit scores while it has little effect on their discriminatory power.
Finally, we document that validating scores only on accepted applicants can be
misleading.
Keywords: credit scoring, sample selection bias, reject inference, validation
∇
The authors thank David Hand, Nicholas Kiefer, Christopher Henderson, OCC and Federal Reserve Bank
of Chicago seminar participants, and an anonymous referee for helpful comments and suggestions. A
previous version of this paper has been also been circulated as “Adjusting for Sample Selection Bias in
Acquisition Credit Scoring Models.” The views expressed in this paper are those of the authors and do not
necessarily reflect the views of the Office of the Comptroller of the Currency (OCC), or the US Treasury
Department. The authors can be reached via email at irina.barakova@occ.treas.gov,
dennis.glennon@occ.treas.gov, and ajay.palvia@occ.treas.gov.
Electronic copy available at: http://ssrn.com/abstract=1722382

1. Introduction
Consumer credit scoring models are a key input in banks’ credit acquisition
strategies and banks increasingly rely on such models to evaluate credit risk when
deciding whether or not to extend credit. Such credit scores can be developed based on
the payment performance history of previous applicants, assuming they are representative
of future applicants. However, banks can only observe the performance of their
customers and not of those applicants that have been rejected. To the extent past rejection
of applicants is not random, such a model development sample would not be
representative of the full pool of applicants and could bias the estimated scoring model.
In turn, such a biased score could lead to a misguided acquisition strategy and future
losses in the bank’s portfolio.
This sample selection problem is well known in the industry and the academic
literature. Different parametric and non-parametric approaches, known as reject
inference, have been proposed in order to account for the missing data. Many of the
proposed reject inference techniques have been examined in theoretical or empirical
studies but are not widely used in practice. Lenders’ most common way to address the
sample selection bias is to obtain proxy credit performance information on the applicants
that they have rejected in the past from their credit bureau records. However, the general
impact of this approach on the performance of credit scoring models has not been
documented, which is a primary objective of this paper.
We exploit a unique proprietary database of credit bureau data with a large
number of credit card applicants over a 10-year period to examine potential sample
selection bias in credit card acquisition scoring models. Extensive out-of-sample
Electronic copy available at: http://ssrn.com/abstract=1722382

delinquency risk rank ordering is conducted and forecast accuracy performance of the
score is evaluated to assess the scope and gains from supplemental bureau data. Our
paper documents three key findings.
First, we find that generic credit bureau scores and other risk factors are
significantly worse for loan applicants who were rejected by a major credit card issuer at
least once before receiving credit and worse still for applicants who did not obtain credit
from a major credit card issuer but were able to obtain credit elsewhere. The results
suggest that accounting for the effect of applicants who were rejected at least once could
improve scoring models. The results also indicate, however, that a large sub-portion of
applicants attempt to obtain credit but do not succeed in doing so, suggesting
considerable limitations in inferring the behavior of rejects through obtaining additional
bureau data.
Next, in terms of score performance impact, we show that the discriminatory
power of the score is not substantially different when acquisition scoring models are built
excluding rejected applicants. The score forecast accuracy, however, does improve when
supplemental data is used to infer the performance of rejected applicants. We also find
that older models perform substantially worse in terms of accuracy regardless of whether
reject inference is used. Thus, our findings indicate reject inference is important for
improving model accuracy but not a substitute for building newer models or building
dynamic models when legacy models have outlived their usefulness.
Finally, our out-of-sample score validation results show that ignoring the rejected
applicants when validating the score could lead to misleading results. Delinquency rates
expected for each score range appear overestimated for the first couple of years after
3
score implementation when the score is validated only on the accepted applicants. This
finding is important since it is not standard industry practice to use bureau supplemental
data for score validation unlike for score development.
The rest of this paper is organized as follows. The next section provides some
background on the issue of sample selection bias in credit scoring models. Section 3
presents the data and methodology. Section 4 discusses our analysis and the final section
presents our conclusions.
2. Literature
A small but growing literature examines the statistical techniques and issues in
credit scoring. Hand and Henley (1997) offer an excellent review of the statistical
techniques used in building credit scoring models and Glennon (1999) outlines the
conceptual framework, current practices, and modeling issues in retail credit scoring.
More recently, Glennon et al. (2008) utilize a proprietary data-set to estimate and validate
credit scoring models for bank card borrowers. The results indicate that current industry
best practices can be effective at ranking borrower risk but may fall short when it comes
to accurately estimating default rates.
Kiefer and Larsen (2006) further discuss the key conceptual and statistical issues
involved in developing sound credit-scoring models. In particular, they consider a central
issue in developing acquisition scoring models – whether the applicant data used to build
such models, given that it excludes rejected applicants, is appropriate. The inability to
build a new model on the entire sample of past applicants need not necessarily lead to
selection bias. As noted by Little and Rubin (1987), the missing data can be categorized
4
in one of three ways: missing completely at random (MCAR), missing at random (MAR),
and missing not at random (MNAR) and in the first two cases account performance does
not depend on the selection process. In contrast, if the data is MNAR, then credit
performance is a function of the selection process and it is in this case that sample
selection bias will occur. Such bias has become well known in the literature and bank
industry and numerous reject inference techniques have been proposed in order to
mitigate its effect (see Joanes (1993), Hand and Henley (1994), Hand (2001a, 2001b),
Ash and Meester (2002), and Greene (2007)).
Among the most commonly used reject inference methods is to obtain external
performance data (from credit bureaus) from applicants that were rejected. Using such
performance data, a lender can seek to infer how rejected applicants would have
performed if accepted; this as often referred to as the supplemental data method. For the
subset of rejected applicants for which performance data is available, the method assumes
that default on some other credit product is equivalent to default on the product of
interest; the new model is then created while factoring in the behavior of these rejects.
The drawbacks of this approach include the cost of obtaining the bureau data and of
assuming default on another product is equivalent to default in the current model. Also,
because the rejects with no performance data are likely to be non-random, this method
will not completely eliminate bias. To date, there is little evidence on the effectiveness of
this method in practice.
A second method also seeks to improve scoring models by using information on
rejected applicants that would not normally be available. But instead of obtaining
performance data on a sub-set of rejected applicants, this method involves randomly
5
accepting a subsample of applicants that would otherwise be rejected. Obtaining
additional performance information through accepting normally rejected accounts, often
referred to as enlargement, is an ideal way to mitigate selection bias; but it is also very
costly since rejected applicants are the most likely to default. Parnitske (2005), using
simulated data, examines this method and finds that this method does reduce selection
bias.
A third class of inference methods are based on extrapolation techniques. One of
the more frequently used types of extrapolation, the re-weighting method, assumes that
the relation between the borrower characteristics and default is identical for accepted and
rejected applicants. 1 The method essentially works by giving greater weight to the lower-
scoring accepted applicants relative to higher scoring ones. Crook and Banasik (2004),
using proprietary data, find that the re-weighting technique does not improve the
performance of the good-bad model. Other extrapolation methods that assign default
status to rejected applicants and in essence “create” data include “re-classification” and
“parceling”. These methods, while easy to implement, tend to be quite arbitrary and often
result in false precision or lead to a distortion of the actual default data (Kiefer and
Larsen (2006)).
The last category of reject inference techniques includes those based on variations
of Heckman’s 2-step sample bias correction procedure where the first step is the selection
equation and the second step is the default model. Greene (1998) examines the impact of
Heckman’s procedure on predicting loan default and finds that coefficients for key
1
Extrapolation techniques such as re-weighting do not reduce selection bias in the sense that if the model
determining good/bad is assumed to be the same for accepts and rejects, there is no bias to correct for. In
these cases, reject inference is really intended to reduce variability and thus make the model more efficient.
For a more complete discussion, see Banasik and Crook (2007).
6
default determinants differ substantially from results obtained without correcting for
selection bias. Banasik and Crook (2007) use a proprietary data set where applicants that
would have normally been rejected were accepted and find that using a bivariate probit
model design to address sample selection bias improves model performance. Banasik et
al (2003) examine a Heckman-type selection procedure (bivariate probit model). Again, a
proprietary data set is used and the authors find that the procedure can improve model
accuracy in some cases but the improvements are small. Finally, Wu and Hand (2007),
using simulated data, find that Heckman’s procedure improves the good-bad model, but
only if the normality assumption holds, when “enough” customers are rejected and
accepted, and when the original accept/reject decision was not primarily determined by
the variables used in the selection equation.
In summary, the extant literature has highlighted sample selection issues in credit
scoring and, in particular, has evaluated several methods to adjust credit scoring models
for sample selection bias arising from building/validating models on previously booked
applicants. The evidence regarding the effectiveness of these models is mixed, however.
Further, inferring the behavior of rejected applicants using supplemental bureau data is
widely used but, to the best of our knowledge, the effectiveness of this approach has not
previously been directly examined in the literature. Our paper, in part, is motivated by
highlighting the effectiveness of this commonly used approach.
3. Sample design
3.1 Data
7
Banks usually purchase credit history information for the accepted and rejected
applicants from one or more of the credit bureaus for the development of acquisition
scores. Following industry practices, we use data from one of the three major credit
bureaus. 2 We have access to a unique nationally representative consumer credit database
(CCDB) with information for a growing sample of 1 to 1.5 million individuals during the
period 1999 to 2009. For each individual, the CCDB contains a credit bureau score and
information on all debt exposures, inquiries, public records, and any other reports to the
credit bureau also known as tradelines. In addition, for each tradeline the CCDB includes
the type, amount, and payment history, which allow for the calculation of the hundreds of
credit risk attributes that credit bureaus provide to banks. Credit standing of the
individual is reported as of the end of June 30th of each year so the data consists of a
series of snapshots of individuals and their risk attributes as of June 30th. One exception
is the payment status for each tradeline, which is available at the actual monthly
frequency at which it is realized and recorded. This allows us to track performance
closely and identify the exact instance of missed payments.
The CCDB sample includes both scoreable and unscoreable individuals. The
unscoreable individuals are those that have been inactive in the past 12 months or with a
very limited credit history such that they cannot be assigned a valid bureau credit score.
They constitute around a quarter of the full credit bureau data. In the CCDB, the
scoreable individuals are over represented relative to the full population of individuals
reported to the credit bureau such that the sample is well suited for score development.
2
The three major credit bureaus are Equifax, Experian, and TransUnion and they maintain credit files for
around 200 million individuals. The credit files contain information from grantors of consumer credit and
collectors of public records. The bureaus use the information to build consumer credit history, consumer
credit score and consumer credit attributes used for evaluating consumer credit quality.
8
For the 1999 sample there are 950,000 scoreable and 50,000 unscoreable individuals with
each scorable account having on average roughly 5 bankcard tradelines.
Going forward the individuals are kept in the panel if they remain or become
scoreable or dropped if they become unscoreable. In addition, another 50,000 uscoreable
individuals are added to the panel each year as well as another random sample of
scoreable individuals averaging six percent of the existing scoreable individuals. The
design of the panel is illustrated by the vertical bars in figure 1. The large portion of
individuals that are kept in the panel from year to year allows us to observe future
performance for credit risk evaluation purposes. The added unscoreable individuals
ensure that individuals with relatively short credit history are represented in the sample.
This is important for our purposes since the individuals with relatively short credit history
are likely to apply for credit. The additional scoreable individuals are selected based on
the distribution over the bureau score for a more population representative sampling. 3
This unbalanced panel sample design allows us to track performance over different
horizons for score development and validation purposes as well as to test the scores on
individuals that have not been part of the development sample.
3.2 Methodology
In this section we describe the selection of the sample that we will use for
evaluating the impact of reject inference on acquisition credit score performance. Bank
credit card acquisition strategies target the pool of potential customers. Each bank might
have different acquisition channels and depending on the channel, the customer pool
3
For more detailed illustration of the sample design please see Glennon et al. (2008).
9
might consist of actual applicants or the population identified for mailing applications. 4
The acceptance rate and thus the need for reject inference can vary across channels.
Since banks need to construct a measure of credit risk based on individual
characteristics known up front, for modeling purposes they take a snap shot (ie, cross
section sample) of applicants as of a particular month or quarter and construct a
performance measure for them. For consistency with this industry practice, we turn the
panel data into a series of snap shot samples used for score development and validation,
which allows us to test the robustness of our results through time. One concern is that the
aging of the population in the sample could have a downward bias on the number of
rejected applicants in our analysis since the individuals with less credit history, such as
the young borrowers, are more likely to be rejected rather than granted a credit card. As a
result, we suspect that any impact of reject inference on score performance that we find in
our data could in fact be larger.
Since we can construct the individual risk attributes only as of June 30th , we take
all credit card applicants from the following quarter as our sample window for identifying
the through-the-door TTD population. Figure 1 illustrates the construction of the sample.
We cannot completely replicate a development sample for any given bank since the bank
reported card inquiries and new opened card accounts in the CCDB are anonymous and
cannot be associated with a particular institution. Instead we take all newly opened
bankcards and inquiries observed during the third quarter of each year as our model
4
The use of the score for identifying a mailing base is also known as pre-approval or front-end evaluation.
Back-end evaluation refers to the use of the score for acceptance/rejection of actual applicants once the
applicants have responded to the pre-approved offers.
10
development sample, which is a random sample of the credit card applicants for the
industry rather than for a particular bank. 5
While for a particular bank, the accepted and rejected applicants are naturally
defined, we need to identify these subsets for the industry given our sample. We define as
booked (BK) the set of all individuals that have applied for a card to a bank during the
third quarter of the year and have been granted credit in each of those instances. The BK
individuals have not been rejected by any institution during the chosen quarter even if
they have been rejected at some other point in the past. The rejected individuals are those
that have made at least one inquiry and have been rejected. They could have received a
card from one bank during the quarter but at least one other bank has rejected them,
which implies that there is less certainty about their desirability as customers and for at
least one institution they would fall in the rejected pool and so we consider them rejected.
Following industry practices, we classify the rejected applicants further depending on our
ability to infer performance. Individuals that manage to open a bank card, after being
rejected by at least one card lender during the same or the following quarter, make our
main reject inference group (RI). This increase in the selection window is done in order
to allow for more of the rejected individuals to have opened cards to proxy performance.
To further expand the possible inference set, it is a common industry practice under the
supplemental approach to consider for individuals that do not manage to open any
bankcard during the extended window, nonbankcard tradelines, such as a retail credit
card, loan, etc. opened during the third or fourth quarters as a proxy for performance
tracking. As part of our analysis, we also mimic this industry practice to evaluate its
5
To the extent very large banks have nationally representative applicant pools, one could argue that for the
largest few bankcard providers our development sample is indeed representative of the industry applicants.
11
performance. In particular, we augment the reject-inference sample with individuals that
were rejected and opened tradelines on other retail credit credits during the observation
window. We label the extended reject inference sample RI*. 6 The set of individuals for
which we cannot make any inference because they do not open any new tradeline during
the selected window are labeled RNI. Note that it is possible to use existing bank cards or
other tradelines to proxy performance but that would not be a directly relevant
comparison because of the impact of seasoning and account management. Similarly, as a
variation of proxy performance information for rejected applicants, banks can use the
credit bureau score at the end of the performance period. However, as a summary
statistic, the score reflects any credit performance deterioration and cannot distinguish
between the newly opened and existing cards performance. 7
The BK applicants combined with the RI and RI* group of applicants make up the
through-the-door (TTD) sample that can be used for score development with reject
inference, which is depicted for each of the year/samples in figure 2. Around half of the
individuals are immediately booked (BK). Another 15 percent are rejected somewhere
but manage to receive a credit card (RI). With the expanded proxy performance (RI*),
another 5 percent of the applicants can be used for score development. However, around
6
To the extent that performance of different credit products are driven by different underlying factors,
assuming such products are similar could lead to biases. Though we accept that such biases might occur,
our main goal is to evaluate whether such models, which are widely used in industry, are nevertheless
helpful in inferring the behavior of rejected applicants. In terms of delinquency rates the RI and RI* groups
are much more similar to each other than to the BK group. Even if the RI* performance measure is not a
true representation of their possible performance on a bank card, a comparison between the RI and RI*
groups across the performance of non bank card accounts shows that the RI* group is riskier.
7
The 2004 data shows that for the BK group there is negative 50% correlation between 90 days plus
delinquency within 24 months and the fresh bureau score at the end of the 24 month period. For the group
for which we have performance on a new account (the RI group) the correlation is negative 40%. Although
the correlation is relatively high for both groups, it shows that the score is influenced by more than the
performance on newly opened accounts.
12
20 percent of our random sample consists of individuals for which no inference can be
made (RNI) from credit bureau data. This implies that the problem of censoring still can
be significantly large even after the augmentation with bureau data.
A commonly used performance measure in the acquisition scoring area is some
delinquency or any major derogatory measure under a fixed horizon, eg 12, 18, 24
months. Often the industry practice in setting the fixed horizon is a matter of selecting
another point in time as of which performance is evaluated. We follow this practice and
assess performance as of the end of the fourth quarter of the following year which results
in a horizon ranging from 12 to 18 months. Thus, there is variation of the performance
horizon with the booked sample having a longer horizon than the rejected sample. Given
that default risk can only increase with the increase of horizon, we may be
underestimating the true default risk of the rejected population. The rejected applicants
do appear with significantly higher delinquency rates than the booked so any possible
underestimation is not preventing our and banks' relative analysis.
Our main performance flag is defined as 90 days or worse delinquency (90+
DPD) although alternative definitions such as 60 days delinquency are used for
robustness. The performance for the BK and RI groups is based on the worst
performance they have for any of the opened credit cards during the window. For the RI*
individuals, performance is based on the worst delinquency of any of their tradelines
opened during the selected window. The RNI individuals do not open any new tradeline
from which we can infer performance.
4. Analysis
13
4.1 Univariate analysis
Our analysis begins with comparing the BK, RI, RI* and RNI groups of credit
card applicants defined in the previous section. In figure 3, which shows the annual bad
rate across these three groups, we see that bad rates are substantially different across the
three groups during all time periods. The booked accounts exhibit less than half the bad
rate of the RI group. Table 1 presents the mean and standard deviations for all the credit
bureau attributes that we use in building scoring models across the nine
development/validation samples. We use tradeline-specific data to construct attributes
that are consistent with the definitions developed by the credit bureau. We select a subset
of these attributes from the five broadly defined categories used in scorecard
development outlined in Fair-Isaac (2006). Each of the categories: payment history,
amount owed, length of credit history, types of credit in use and new credit is presented
in a separate panel. Relative to the BK group and across the rejected groups RI, RI*, and
RNI the mean credit bureau score decreases, the percent of unscoreable individuals
increases, the number of inquiries increases, the instances of non-zero balance are higher
even if the average balance itself is not, also the balance to credit ratio is higher. The RNI
group has on average a shorter credit history and worse payment history with the highest
balance for major derogatory accounts.
The difference in the population characteristics can be seen also in figure 4, which
depicts the full distribution of the generic bureau score across the four groups in the 2003
sample. Clearly the BK group is of higher credit quality than the three rejected groups
14
and the RNI group has the lowest credit quality; the RI and RI* groups appear to have
very similar distributions. 8
Similarly, we compare the distributions of attributes across the four groups. We
evaluate the magnitude of the differences in the distributions across the BK and rejected
data sets (RI, RI*, and RNI) using a non-parametric Kolmogorov-Smirnov (K-S) test,
which measures the level of separation between two distributions. 9 We report in table 2,
for a subset of attributes, the KS statistics for the difference in the distributions from the
various rejected groups (RI, RI*, and RNI) and the distribution of values from the booked
(BK) data set using the 2003 development data. We report the results for a subset of
attributes for which the K-S statistic is more than 20 percent for at least one of
comparisons; that is, the difference between the attribute distribution for the BK data is
large relative to the distribution of values for at least one of the data sets made of rejected
accounts. The shading indicates the level of the K-S statistic for each pair of distributions
with darker corresponding to higher K-S. The first column shows the K-S between the
BK and RI groups which exhibit the least differences. The third column comparing the
BK and RNI distribution is the darkest indicating the greatest amount of difference and
the most characteristics where the groups differ in distribution. Such differences in the
distribution suggest that building a score only on the booked accounts but applying it on
all booked and rejected could be misleading. Although the RNI group is large, the
8
A common practice is for banks to incorporate in their acquisition strategy a cut off based on the generic
bureau score and in this way they eliminate at least half of the RNI group from their customer base. We do
not take this route for our analysis since such a subjective cut off depends on the bank’s risk strategy while
our goal is to document the scope for reject inference.
9
If the distribution of values fro a specific attribute various significantly across the data groups, the KS
statistic will be large (e.g., greater than 20 percent). Conversely, a small KS value (e.g., less than 10
percent) implies the distributions of values for that attributes are similar across the data sets.
15
addition of the RI and RI* groups to the BK for score development has the potential to
address some of the censoring given the differences across these groups.
4.2 Score development
For score development purpose, we split the full sample further in terms of
“clean” individuals, ie those that have not had any major delinquency in the past, “dirty”
individuals, ie those that have been at least 60 days delinquent in the past or currently,
and individuals with “thin” files for which it is hard to determine the credit quality. Banks
do split further the thin into dirty and clean but the size of our sample would make such a
split impractical. Figure 5 shows the subgroups of BK, RI, RI* and RNI for each of the
clean, dirty and thin segments. Consistently across our samples the dirty and thin
segments have much fewer booked accounts (BK) and it is in those segments that we
expect the reject inference to be of most importance.
We build our score as a means to evaluate the risk of 90 days delinquency or
worse in the next 12-18 months. Following industry practice, we estimate a logistic
regression for each of the three segments clean, dirty, and thin using the 90 plus day
delinquency as the dependent variable. Banks may have a more granular segmentation
based on their portfolio but in some way the clean, dirty, thin split is usually part of any
segmentation scheme as there are significant differences in these populations. For
robustness, we also evaluate the score if built on the full population without any
segmentation. Although banks sometimes use some form of expert judgment in selecting
characteristics for scorecard development, the final score is usually based on a statistical
selection of variables often through a stepwise regression. Glennon et al. (2009) provides
16
evidence that indeed this method performs relatively well compared to semi-parametric
and non-parametric alternatives.
We start with more than 80 credit bureau attributes summarized in table 1 but the
final models have in some cases fewer than 10 attributes. Given the high multicollinerity
across the attributes the coefficients may not be informative and although we develop 9
separate scoring models – one for each annual cohort 1999 to 2007, we do not present all
estimated models. Instead we focus on the types of variables and how often they are
selected to be included in a scoring model. Tables 3, 4, and 5 show the number of times a
particular attribute is selected through the stepwise selection process in one of the 9
cohort-based scoring models for the clean, dirty and thin segments respectively. The
tables are sorted from the most to least selected variables, which allows for a comparison
between the scoring models built only on the booked (i.e., BK) accounts versus on the
full TTD population.
Those results show that the scoring models developed on the TTD data
incorporate a wider range, and an alternative mix, of variables relative to the BK-based
models. The TTD scoring models include more often information on the length of credit
history and both the credit line and balances for bankcards and other trades. The bureau
score is selected in all models and since it may be capturing most of the information
about the individual’s credit quality, we also estimate the models without the score which
results in a different mixture of selected variables and estimated coefficients but as
discussed later has little impact on performance. Because of the high correlations across
all the bureau attributes, it is not clear whether any of those differences would lead to
17
significant variation in performance of the scoring models build on the TTD population
versus those developed on the BK sample data only, which is discussed next.
4.3 Score performance
An important part of the evaluation of a scoring model is the out-of-sample, out-
of-time performance of the scores with respect to objectives and purpose of the model.
Acquisition strategies are often based on several models. 10 Banks use risk scores to set
acceptance/rejection cut off values based on risk tolerance, which implies, at least
implicitly, that the expected default rate (or odds) at cut off are consistent with a given
bank’s pricing and risk/return objectives. The score is also used for assigning credit line
levels and terms of the contract. For those later purposes, the discriminatory power of the
model is important because the bank needs to be able to differentiate the potential
customers by their level of risk. Higher lines and better terms or special offers like
balance transfers could be made to the better quality customers in order to maximize
market share and profits.
For many of the decisions in the acquisition area the scores need to also indicate
the account-specific likelihood of becoming seriously delinquent (or to default). The
score associated likelihood of delinquency (or, equivalently, the odds ratio) is used for
setting score cutoff levels for the approval decision but also for account and portfolio
10
A risk score or the likelihood of the account becoming severely delinquent, which is the focus of this
paper, is usually combined with a response likelihood forecast as well as a balance and/or revenue forecast.
18
profitability and loss analysis. 11 Thus, an equally important quality of the credit score is
the accuracy of the actual associated odds.
In evaluating the impact of sample selection on performance, we use the
performance of the reject inference score built on the TTD sample when applied on the
TTD pool as a benchmark. Using the same TTD validation sample we apply the score
built only on the BK subset and compare the results to the benchmark in order to evaluate
the selection bias. We also track the performance of the TTD score and the BK score
when applied only on the BK subset as a validation sample. These last two cases are
actually common model validation practices in the industry because, unlike for model
development purposes, for model validation banks do not typically gather performance
information for the individuals that have been rejected by the model. At the same time,
for acquisition purposes the scoring model is applied on the full pool of potential
customers so proper validation has to be done on at least the above defined TTD
population.
Discriminatory power
For ease of presenting the results, we show first the performance of the scoring
models built on the 2003 data sample. The two scoring models that we test are those built
on the BK and TTD samples labeled BK2003 and TTD2003 respectively. The first panel
of table 6 exhibits the K-S statistic for both in-sample and out-of-time samples. The
results are provided by segment (clean\dirty\thin) and also for the aggregate portfolio
(all). Each column shows a particular combination of score and validation sample:
11
Banks use additional layers of business logic to determine final cutoffs beyond the odds suggested by the
score. Risk management might have different targets and risk tolerance across geographic regions or
products as they monitor the performance of the score within such portfolio segments.
19
TTD2003_TTD, BK2003_TTD, TTD2003_BK, and BK2003_BK. 12 Note that the
discriminatory power is relatively high and decreases very little through time for the
2004-2007 validation samples. This finding is consistent with the results for behavioral
scoring models reported in Glennon et al. (2008) and is generally also assumed in the
industry. The robustness of the models to maintain their ability to discriminate between
good and bad accounts over time justifies the industry practice of relying on models
based on relatively old data.
As expected the thin and dirty segment scores exhibit lower discriminatory power.
The TTD2003_TTD are either the same or a couple of percentage points higher than the
BK2003_TDD but not sufficiently higher to imply that reject inference matters for the
discriminatory power of the scoring model: a result that holds for all years (not reported).
For the thin and dirty segments, the models developed on the TTD data and validated on
the TDD population in 2003 (ie, TTD2003_TTD) exhibit better performance by roughly
5 percentage points when validated on the BK accounts only (ie., TTD2003_BK). While
this is not a large enough difference to warrant concerns about the model’s ability to
differentiate between the default and non-default distributions, it shows that validating
only on booked accounts may tend to overestimate the discriminatory power of the score.
The results are confirmed in the second panel of table 6 where we show the Somer’s D
discriminatory power statistic as an alternative to the K-S. Unlike the K-S statistic,
12
TTD2003_TTD (_BK) refers to applying the scoring model developed on the 2003 TTD sample to the
TTD (BK) data. BK2003_TTD (_BK) refers to applying the scoring model developed on the 2003 booked
accounts only to the TTD (BK) data.
20
Somer’s D captures the impact across the full distribution and would reflect any change
in score discriminatory power. 13
Forecast accuracy
Unlike the impact on the scoring model’s discriminatory power, accounting for
reject inference appears to clearly affect forecasting accuracy. The top panel of table 7
shows the Hosmer-Lemeshaw (H-L) statistic for the same combinations of TTD2003 and
BK2003 scores applied to the TTD and BK samples in and out-of-time. The results show
that the forecast accuracy deteriorates relatively quickly on the out-of-time sample, which
is consistent with industry and academic findings. On the TTD sample the TTD2003
score performs much better than the BK2003. Given the asymptotic distribution of a chi
square with 10 degrees of freedom, the bottom panel provides the probability values for
each of the calculated H-L statistics, which are very low. 14 These probability values are
expected given the large number of observations and the fact that the score does not
account for all the risk drivers and thus cannot fully capture the delinquency risk. In the
industry, often the score odds are further calibrated to historical delinquency rates in
order to achieve better forecast accuracy. For our study, we need to compare the relative
accuracy of the scores with and without reject inference which is independent of any
refinement of the scoring model.
13
Note that Somers’ D is equivalent to the Accuracy Ratio (AR) of the Cumulative Accuracy Profile (CAP)
curve and is also related to the area under the curve (AUC) of the Receiver Operating Characteristic (ROC)
curve (Somers’ D=AR=2AUC-1). The industry often relies on monitoring the change in log odds which
similarly involves the full distribution, which is reflected in the Somer’s D measure.
14
Note that in sample the mean and variance are estimated from the data which implies that the appropriate
chi square distribution for the H-L statistic as a goodness-of-fit test is 2 degrees of freedom less ie χ2(8) as
shown in Hosmer and Lemeshaw (1980). However, out-of-time the test statistic calculated for the 10
deciles is asymptotically distributed as the sum of 10 squared random normal variables which is a χ2(10)
distribution (see Hosmer and Lemeshow 2000).
21
The H-L test statistic treats both under and over-estimation equally and also does
not indicate which scores are less accurate. In order to get a better sense of the forecast
accuracy issue, we summarize model performance in a more granular way in table 8. For
each score decile the corresponding probability values are given as well as the expected
minus actual delinquency rates. These results show that the likelihood of becoming
seriously delinquent is underestimated in the more risky deciles if sample selection bias is
ignored. 15 With time, the underestimation spreads to all deciles and to both scores
BK2003 and TTD2003.
We find also that when testing the scores on the BK subset instead of the realistic
TTD, the accuracy is better. In particular, BK2003_BK exhibits the lowest H-L and the
TTD2003_BK exhibits H-L levels lower than BK2003_TTD. Furthermore, table 8 shows
that using the booked accounts BK for validation of a score that accounts for reject
inference can lead to overestimation of the delinquency rates especially in the high risk
deciles and that may be the case for a few years post redevelopment (TTD2003_BK). In
this case, which is common practice in the industry, the deterioration of score
performance will not be evident right away and the validation results could be misleading
in terms of the actual forecasting accuracy of the score.
The results are similar across segments and across time. Table 9 shows the H-L
test for all scoring models (1999-2007) using both in and out-of-time samples.
Comparing the top two panels for which the validation sample is the TTD, the second
panel which shows the scoring models built only on booked accounts (BK_TTD) has
much higher H-L values than the first (TTD_TTD) – ie, lower accuracy. The bottom two
15
For each decile, the statistic that is evaluated is the part used for the H-L test calculation which has a
Normal asymptotic distribution.
22
panels confirm also our conclusions about the misleading performance when evaluated
only on the booked accounts.
Unlike in our sample where acceptance is exogenous and does not depend on our
score value, banks’ acceptance rules based on an internal score that underestimates
delinquency risk will lead to a definite deterioration in the credit quality of booked
accounts. Our evidence strongly suggests that accounting for reject inference as part of
the model-development process is important when the scores are used for defining
acquisition strategy cut offs based on targeted delinquency rates.
4.4 Robustness
In this section we discuss further the robustness of our main finding that reject
inference impacts the forecast accuracy of scores. The unbalanced panel design results in
a set of residual accounts that, by design, are not included in the development data; these
include accounts that were unscorable at the time of development but have since become
scorable and also the small percentage of new accounts that are added each year. This
sample structure allows us to test the forecast accuracy on individuals that have not been
part of the model-development sample. We report the results for the H-L test across
scoring models and validation years in table 10 using the format from table 9. Note that
the number of observations is significantly lower - roughly 4 thousand accounts as
compared to 100 thousand in the full sample - so the H-L test statistics are lower than in
table 9. The top first panel, as in table 9, has for the most part lower values than the
second panel. Another caveat for this validation sample is that it consists of a relatively
large number of unscoreable individuals, for which the model score is expected not to
23
perform as well. Although these results are not as clear, they confirm our earlier findings
that the TTD scores are relatively more accurate that the BK scores. Similarly, the bottom
two panels show that validating only on booked accounts attributes more forecast
accuracy to the model score than it actually has.
We also check the robustness of the results with respect to the development of the
score by varying the definition of bad performance to 60 rather than 90 days delinquency,
the performance horizon to 6-12 months instead of 12-18, without segmentation, without
the credit bureau variable, and a smaller set of explanatory variables. We also vary the
development sample selection period to be the first two quarters of the year rather than
the last two in case there is any seasonality. However, our credit bureau attributes data is
only as of June 30th which means that attributes that we use for score building already
reflect any inquiries and newly opened credit cards of the individuals in the TTD sample.
As expected, in that case the scoring model has better performance especially in terms of
discriminatory power. The impact of reject inference, although smaller, appears again in
terms of the forecast accuracy of the score.
5. Conclusions
The problem of acquisition credit score development based only on booked
individuals while ignoring the rejected ones has been well studied in the literature. Many
techniques for inferring the behavior of rejected applicants have been proposed but the
empirical evidence supporting such studies is weak or alternatively focuses on methods
not typically used in the industry. In this paper, we use a nationally representative
sample of credit bureau data to evaluate the problem of sample selection in credit card
24
acquisition score development. We evaluate the credit bureau supplemental data method
used by banks for reject inference by examining its impact on score performance.
We find that reject inference has little impact on the discriminatory power of the
score but basing the score only on booked account leads to underestimation of
delinquency risk especially in the high risk deciles. Furthermore, ignoring rejected
individuals when validating the score leads to underestimating the deterioration in score
performance with time. These results suggest that although the data augmentation method
is not a perfect solution to this sample selection issue, it can improve significantly the
forecast accuracy of scores which is important for credit acquisition strategies.
25
References
Ash, D. and S. Meester (2002). “Best Practices in Reject Inferencing”, Conference on

Credit Risk Modeling and Decisioning . Wharton Financial Institutions Center,
Philadelphia.
Banasik, J., J. Crook, and L. C. Thomas (2003). “Sample selection bias in credit
scoring models”, Journal of the Operational Research Society 54, pp 822–832.
Banasik, J. and J. Crook (2007). “Reject Inference, Augmentation, and Sample

Selection”, European Journal of Operational Research, 183, pp 1582-1594.
Crook, J. and J. Banasik (2004). “Does reject inference really improve the performance
of application scoring models?” Journal of Banking and Finance, 28, pp 857-874.
Crook, J., Eldman, D. and L. Thomas (2007). “Developments in Consumer Risk

Assessment”, European Journal of Operational Research, 183, pp 1447-1465.
Fair-Isaac (2006). "Understanding Your Credit Score",

http://www.my.co.com/CreditEducation/WhatsInYourScore.aspx.
Feelders A.J. (2000). “Credit scoring and reject inference with mixture models”,
International Journal of Intelligent systems in Accounting Finance and Management, 8
(4), pp 271-279.
Glennon, D. (1999). “Evaluating Credit Scoring Models: Theory and Practice”, OCC
Working Paper.
Glennon, D., C.E. Larson, N. Kiefer, and H. Choi, H. (2008). "Development and
Validation of Credit-Scoring Models", Journal of Credit Risk, 4, pp 1-61.
Greene, W. (1998). “Sample selection in credit-scoring models”, Japan and the World
Economy, 10, pp 299-316.
Greene, W. (2007). “A Statistical Model for Credit Scoring,” in Credit Risk: Quantitative
Methods and Analysis, Hensher, D. and S. Jones, eds., Cambridge University Press.
Hand D. J. (2001a). “Reject inference in credit operations: theory and methods” in The
Handbook of Credit Scoring, Glenlake Publishing Company, pp 225-240.
Hand D. J. (2001b). “Modeling consumer credit risk”, IMA Journal of Management

Mathematics, 12 (2), pp 139-155.
Hand D.J. and W.E. Henley (1994). “Can reject inference ever work?”, IMA Journal of
Mathematics Applied in Business and Industry, 5 (1), pp 45-55.
26
Hand, D.J. and W.E. Henley (1997). “Statistical classification methods in consumer
credit scoring: a review”, Journal of the Royal Statistical Society A, 160, pp 523-541.
Hosmer, D. W. and S. Lemeshow (1980). "A goodness-of-for test for the multiple logistic
regression", Communications in Statistics, A10, pp 1043-1069.
Hosmer, D. W. and S. Lemeshow (2000). Applied Logistic Regression -2nd ed. Wiley,
New York.
Joanes D.N. (1993). “Reject inference applied to logistic regression for credit scoring”,
IMA Journal of Mathematics Applied in Business and Industry, 5 (1), pp 35-43.
Kiefer, N. M. and C.E.Larson (2006). “Specification and Informational Issues in Credit

Scoring”, International Journal of Statistics and Management Systems, 1, pp 152-178.
Little, R.J.A. and D.B. Rubin (1987). Statistical Analysis with Missing Data. New York:
John Wiley.
Parnitzke, T (2005). “Credit Scoring and the sample selection bias”, Institute of Insurance
Economics, Working Paper.
Wu, I. and D. Hand (2007). “Handling selection bias when choosing actions in retail
credit applications”, European Journal of Operational Research, 183, pp 156-1568.
27
Table 1. Average mean and standard deviation by type of variable following Fair Isaac
categories across the 9 samples used for developing and validating the score (1999-2007).
Invalid extreme values are set to missing and they represent less than half of one percent
of the TTD sample.
Credit amount
BK RI RI* RNI
BUREAU ATTRIBUTE MEAN STD MEAN STD MEAN STD MEAN STD
# OF OPEN AUTO LOAN TRADES BAL > 0 0.4 0.7 0.4 0.7 0.5 0.7 0.4 0.6
AGG BAL FOR OPEN AUTO LOAN TRADES 5874 10783 5746 10490 6847 11833 4821 10075
8605 14451 8264 14039 9774 15740 6921 13388
AGG CREDIT FOR OPEN AUTO LOAN TRADES
# OF BANKCARD TRADES BAL > 0 2.0 1.8 2.3 2.0 2.4 2.2 2.3 2.4
AGG BAL FOR OPEN BANKCARD TRADES 5486 10055 5395 10150 6621 12560 5198 11970
26948 32045 22246 30507 19757 29295 12352 24524
AGG CREDIT FOR OPEN BANKCARD TRADES
AGG BAL TO CREDIT RATIO FOR OPEN
28 465 36 623 41 39 58 2242
BANKCARD TRADES
# OF OPEN HOME EQUITY TRADES BAL > 0 0.2 0.4 0.1 0.4 0.1 0.4 0.1 0.3
7436 29329 6092 25724 6926 29450 4038 22171
AGG BAL FOR OPEN HOME EQUITY TRADES
AGG CREDIT FOR OPEN HOME EQUITY
12801 47881 9854 40972 10658 48150 5925 35129
TRADES
# OF INST TRADES BAL > 0 1.3 1.7 1.6 2.0 2.2 2.6 2.3 2.6
AGG CREDIT FOR OPEN INST TRADES 18285 37118 17603 41267 22295 50576 14406 31262
AGG BAL TO CREDIT RATIO FOR OPEN INST
41.1 238.3 44.0 329.5 50.4 73.9 42.4 193.7
TRADES
# OF OPEN AUTO LEASE TRADES BAL > 0 0.1 0.3 0.1 0.3 0.1 0.3 0.1 0.2
AGG BAL FOR OPEN AUTO LEASE TRADES 805 3973 757 3961 950 4511 579 3674
1533 6786 1398 6575 1868 7883 1043 5972
AGG CREDIT FOR OPEN AUTO LEASE TRADES
# OF OPEN MORTGAGE TRADES BAL > 0 0.6 0.7 0.5 0.7 0.5 0.7 0.3 0.6
AGG BAL FOR OPEN MORTGAGE TRADES 72367 126908 59220 116031 65951 138920 37885 100355
78231 132319 63569 121104 70270 144766 40584 105792
AGG CREDIT FOR OPEN MORTGAGE TRADES
44.3 53.3 36.3 47.4 38.1 45.6 25.8 239.3
MORTGAGE TRADES
# OF RETAIL TRADES BAL > 0 0.7 1.1 0.8 1.2 0.9 1.3 0.8 1.3
AGG CREDIT FOR OPEN RETAIL TRADES 4327 7852 3319 4830 3067 4600 1857 3551
AGG BAL TO CREDIT RATIO FOR OPEN RETAIL
8.6 24.1 12.0 78.0 15.6 32.9 15.9 141.9
TRADES
# OF REVOLVING TRADES BAL > 0 3.0 2.6 3.2 2.8 3.6 3.1 3.2 3.2
31929 32478 26172 32407 23282 30263 14480 25332
AGG CREDIT FOR OPEN REVOLVING TRADES
# OF OPEN BANKCARD TRADES 3.7 3.0 3.5 3.3 3.3 3.1 2.3 2.8
24.5 117.9 32.4 76.3 39.4 37.7 41.0 128.7
REVOLVING TRADES
Credit type
BK RI RI* RNI
BUREAU ATTRIBUTE MEAN STD MEAN STD MEAN STD MEAN STD
# OF TRADES 14.1 7.9 14.6 8.3 15.7 8.5 13.1 7.8
# OF AUTO LOAN TRADES 0.8 1.1 0.8 1.1 1.0 1.3 0.8 1.1
# OF AUTO LOAN OPENED W/I 12 MOS 0.2 0.6 0.3 0.7 0.3 0.7 0.3 0.7
# OF BANKCARD TRADES 5.3 4.0 5.6 4.5 5.4 4.2 4.6 4.1
# OF INST TRADES 3.5 3.2 4.4 3.7 5.7 4.5 5.5 4.4
# OF AUTO LEASE TRADES 0.1 0.4 0.1 0.4 0.1 0.5 0.1 0.4
# OF MORTGAGE TRADES 1.4 1.8 1.3 1.9 1.5 2.1 1.0 1.7
# OF RETAIL TRADES 3.4 3.1 3.2 3.0 3.2 3.1 2.4 2.7
# OF OPEN RETAIL TRADES BAL DATE W/I 12
2.5 2.4 2.1 2.4 2.1 2.4 1.4 2.0
MOS
# OF REVOLVING TRADES 8.8 5.9 8.5 6.3 8.1 6.1 6.3 5.7
28
Table 1. (cont.)
Length of history
BK RI RI*
BUREAU ATTRIBUTE MEAN STD MEAN STD MEAN STD MEAN

AGE OF OLDEST TRADE 173.2 109.0 150.4 102.2 140.0 98.3 122.5
AGE OF OLDEST AUTO LOAN TRADE 13.6 19.3 13.8 19.8 15.3 20.5 13.1
AGE OF OLDEST BANKCARD TRADE 124.3 91.9 103.6 86.7 95.0 83.7 79.2
AGE OF OLDEST HOME EQUITY TRADE 10.0 26.4 7.2 22.2 6.9 21.3 4.7
AGE OF OLDEST AUTO LEASE TRADE 2.6 9.7 2.4 9.4 2.9 10.1 1.8
AGE OF OLDEST MORTGAGE TRADE 29.8 48.6 24.7 45.9 24.1 44.3 18.3
AGE OF OLDEST RETAIL TRADE 122.6 114.4 104.4 104.2 96.7 99.9 78.5
New credit
BK RI RI*
BUREAU ATTRIBUTE MEAN STD MEAN STD MEAN STD MEA

# OF OPEN BANKCARD TRADES BAL DATE W/I
3.6 3.0 3.4 3.2 3.2 3.0 2.3
24 MOS
# OF INQUIRIES W/I 12 MOS 1.6 1.9 2.1 2.2 2.7 2.6 2.4
# OF MORTGAGE OPENED W/I 24 MOS 0.9 1.6 0.9 1.6 1.1 1.9 0.7
# OF RETAIL OPENED W/I 24 MOS 0.9 1.3 0.9 1.3 0.9 1.4 0.6
# OF RETAIL OPENED W/I 12 MOS 0.5 1.0 0.5 1.0 0.6 1.1 0.4
Payment
BK RI RI*
BUREAU ATTRIBUTE MEAN STD MEAN STD MEAN STD ME

# OF CLOSED TRADES W/I 6 MOS 1.2 1.4 1.3 1.6 1.4 1.6 1
# OF TRADES OPENED W/I 24 MOS FOR
0.0 0.3 0.1 0.3 0.1 0.4 0
CURRENT W/ MINOR DELINQ
# OF TRADES OPENED W/I 24 MOS W/ MAJOR
0.2 0.7 0.3 1.0 0.6 1.3 1
DELINQ/DEROG
# OF BANKCARD OPENED W/I 24 MOS W/
0.0 0.1 0.0 0.2 0.0 0.3 0
MAJOR DELINQ/DEROG
# OF TRADES CURRENTLY 30 DPD BAL > 0 0.0 0.2 0.0 0.2 0.1 0.4 0
0.0 0.3 0.1 0.5 0.1 0.6 0
# OF BANKCARD TRADES - 90 DPD W/I 12 MOS
# OF RETAIL TRADES - 90 DPD W/I 12 MOS 0.0 0.1 0.0 0.2 0.1 0.3 0
# OF TRADES MAJOR DEROG 0.4 1.5 1.2 2.4 1.4 2.6 2
AGG BAL FOR MAJOR DEROG 731 8303 2256 13210 2161 12607 46
# OF TRADES 30-180 DPD W/I 12 MOS 0.3 0.9 0.6 1.4 0.9 1.7 1
# OF BANKRUPTCIES 0.1 0.5 0.3 1.1 0.2 0.8 0
# OF ALL PUBLIC RECORD INCLUDING
0.1 0.6 0.3 1.1 0.2 0.9 0
TRADELINE BANKRUPTCIES
Credit score
BK RI RI*
BUREAU ATTRIBUTE MEAN STD MEAN STD MEAN STD ME
VALID BUREAU SCORE 765 93 709 110 678 112 6
UNSCOREABLE 0.4% 6.1% 0.6% 7.1% 0.9% 8.9% 1
29
Table 2. Bureau attributes exhibiting the largest distributional difference between booked
BK and the different rejected groups based on level of inference (RI, RI*, RNI), as
measured by the K-S statistic for the 2003 development sample. Shown are the variables
with significant difference for at least one of the pairs. The shading corresponds to the
level of K-S with darker indicating higher K-S (10-20, 20-30, 30-100).
VARIABLE BK_RI BK_RI* BK_RNI RI_RNI RI*_RNI

AGE OF OLDEST TRADE 0.10 0.13 0.24 0.14 0.12
# OF TRADES OPENED W/I 24 MOS W/ MAJOR 0.09 0.15 0.36 0.27 0.21
# OF OPEN BANKCARD TRADES 0.07 0.08 0.27 0.19 0.18
# OF OPEN BANKCARD TRADES BAL DATE W/I 24 MOS 0.08 0.08 0.26 0.19 0.18
AGG CREDIT FOR OPEN BANKCARD TRADES 0.18 0.19 0.39 0.21 0.21
AGG BAL TO CREDIT RATIO FOR OPEN BANKCARD 0.14 0.23 0.26 0.15 0.15
AGE OF OLDEST BANKCARD TRADE 0.13 0.15 0.26 0.14 0.12
# OF COLLECTION TRADES W/I 24 MOS 0.09 0.15 0.34 0.25 0.19
# OF BANKCARD TRADES - 30 DPD W/I 12 MOS 0.07 0.13 0.30 0.23 0.17
# OF TRADES MAJOR DEROG 0.19 0.23 0.46 0.27 0.23
AGG BAL FOR MAJOR DEROG 0.15 0.19 0.43 0.27 0.23
# OF TRADES 30-180 DPD W/I 12 MOS 0.13 0.21 0.42 0.29 0.21
# OF INST TRADES 0.10 0.22 0.21 0.11 0.01
# OF MORTGAGE TRADES 0.08 0.02 0.22 0.14 0.21
# OF OPEN MORTGAGE TRADES BAL > 0 0.09 0.04 0.23 0.14 0.19
AGG BAL FOR OPEN MORTGAGE TRADES 0.09 0.04 0.23 0.14 0.19
AGG CREDIT FOR OPEN MORTGAGE TRADES 0.09 0.04 0.23 0.14 0.19
AGE OF OLDEST MORTGAGE TRADE 0.10 0.05 0.25 0.15 0.20
AGG BAL TO CREDIT RATIO FOR OPEN MORTGAGE 0.09 0.04 0.23 0.14 0.19
# OF OPEN RETAIL TRADES BAL DATE W/I 12 MOS 0.08 0.08 0.23 0.15 0.15
AGG CREDIT FOR OPEN RETAIL TRADES 0.11 0.11 0.27 0.16 0.17
AGG CREDIT FOR OPEN REVOLVING TRADES 0.19 0.19 0.40 0.21 0.21
AGG BAL TO CREDIT RATIO FOR OPEN REVOLVING 0.15 0.25 0.29 0.17 0.13
VALID BUREAU SCORE 0.26 0.35 0.59 0.38 0.30
30
Table 3. Number of times a variable is selected by the stepwise regression used for
developing the acquisition scores for the CLEAN segment 1999-2007.
VARIABLE BK TTD
VALID BUREAU SCORE 9 9
AGG BAL TO CREDIT RATIO FOR OPEN REVOLVING TRADES 7 6
# OF REVOLVING TRADES BAL > 0 6 6
AVERAGE AGE OF TRADES 5 8
AGG CREDIT FOR OPEN MORTGAGE TRADES 5 7
AGE OF OLDEST BANKCARD TRADE 4 7
# OF TRADES 30-180 DPD W/I 12 MOS 4 5
# OF INQUIRIES W/I 6 MOS 4 5
AGE OF OLDEST HOME EQUITY TRADE 3 4
AGE OF OLDEST TRADE 3 2
# OF MORTGAGE OPENED W/I 24 MOS 3 2
AGG BAL FOR OPEN FINANCE TRADES 2 4
# OF TRADES OPENED W/I 24 MOS FOR CURRENT W/ MINOR 2 2
# OF BANKCARD TRADES - 30 DPD W/I 12 MOS 2 2
AGG BAL TO CREDIT RATIO FOR OPEN INST TRADES 2 2
# OF BANKCARD TRADES 2 0
# OF INST TRADES 2 0
AGG BAL FOR OPEN HOME EQUITY TRADES 1 3
# OF ALL PUBLIC RECORD INCLUDING TRADELINE BANKRUPTCIES 1 3
AGG BAL FOR OPEN AUTO LOAN TRADES 1 2
AGG BAL TO CREDIT RATIO FOR OPEN BANKCARD TRADES 1 2
AGG BAL FOR OPEN MORTGAGE TRADES 1 2
# OF OPEN HOME EQUITY TRADES BAL > 0 1 1
# OF OPEN MORTGAGE TRADES BAL > 0 1 1
# OF RETAIL OPENED W/I 12 MOS 1 1
# OF REVOLVING TRADES 1 1
# OF CLOSED TRADES W/I 6 MOS 1 0
# OF BANKCARD TRADES BAL > 0 1 0
AGG BAL FOR OPEN INST TRADES 1 0
AGE OF OLDEST AUTO LEASE TRADE 1 0
AGG BAL TO CREDIT RATIO FOR OPEN MORTGAGE TRADES 1 0
# OF OPEN RETAIL TRADES BAL DATE W/I 12 MOS 1 0
AGG BAL FOR OPEN BANKCARD TRADES 0 6
AGG CREDIT FOR OPEN BANKCARD TRADES 0 5
AGE OF OLDEST RETAIL TRADE 0 3
# OF TRADES 0 2
AGG CREDIT FOR OPEN AUTO LOAN TRADES 0 2
# OF OPEN BANKCARD TRADES BAL DATE W/I 24 MOS 0 2
AGG CREDIT FOR OPEN HOME EQUITY TRADES 0 2
# OF AUTO LEASE TRADES 0 2
AGG CREDIT FOR OPEN AUTO LEASE TRADES 0 2
# OF MORTGAGE TRADES 0 2
# OF OPEN AUTO LOAN TRADES BAL > 0 0 1
# OF OPEN BANKCARD TRADES 0 1
AGG BAL FOR OPEN AUTO LEASE TRADES 0 1
AGE OF OLDEST MORTGAGE TRADE 0 1
# OF RETAIL TRADES BAL > 0 0 1
AGG CREDIT FOR OPEN RETAIL TRADES 0 1
AGG CREDIT FOR OPEN REVOLVING TRADES 0 1
31
developing the acquisition scores for the DIRTY segment 1999-2007.
VARIABLE BK TTD
# OF TRADES MAJOR DEROG 6 9
# OF TRADES OPENED W/I 24 MOS FOR CURRENT W/ MINOR 6 5
# OF BANKRUPTCIES 5 7
# OF ALL PUBLIC RECORD INCLUDING TRADELINE 5 5
# OF BANKCARD OPENED W/I 24 MOS W/ MAJOR 4 7
AGG BAL TO CREDIT RATIO FOR OPEN REVOLVING TRADES 4 1
AGE OF OLDEST HOME EQUITY TRADE 2 6
# OF TRADES CURRENTLY 30 DPD BAL > 0 2 3
# OF INST TRADES BAL > 0 2 3
AGG BAL TO CREDIT RATIO FOR OPEN MORTGAGE TRADES 2 3
# OF TRADES OPENED W/I 24 MOS W/ MAJOR DELINQ/DEROG 2 2
AGG CREDIT FOR OPEN INST TRADES 2 2
# OF TRADES 2 1
AGG BAL FOR OPEN FINANCE TRADES 1 4
AGG CREDIT FOR OPEN AUTO LEASE TRADES 1 2
# OF RETAIL TRADES 1 1
# OF MORTGAGE - SEVERE DELINQUENCY INCLUDES 1 0
AGG BAL FOR MAJOR DEROG 1 0
# OF MORTGAGE OPENED W/I 24 MOS 1 0
AGE OF OLDEST MORTGAGE TRADE 1 0
# OF CLOSED TRADES W/I 6 MOS 0 2
AGE OF OLDEST AUTO LOAN TRADE 0 1
AGG BAL TO CREDIT RATIO FOR OPEN BANKCARD TRADES 0 1
# OF COLLECTION TRADES W/I 24 MOS 0 1
AGG BAL FOR OPEN HOME EQUITY TRADES 0 1
32
developing the acquisition scores for the THIN segment 1999-2007.
VARIABLE BK TTD
# OF TRADES MAJOR DEROG 4 5
AGG BAL TO CREDIT RATIO FOR OPEN REVOLVING 4 4
AGG BAL TO CREDIT RATIO FOR OPEN BANKCARD 3 1
# OF INST TRADES BAL > 0 2 4
# OF TRADES 1 2
# OF BANKCARD TRADES 1 1
# OF BANKCARD OPENED W/I 24 MOS W/ MAJOR 1 1
# OF COLLECTION TRADES W/I 24 MOS 1 1
# OF MORTGAGE - SEVERE DELINQUENCY INCLUDES 1 1
# OF BANKRUPTCIES 1 1
# OF TRADES OPENED W/I 24 MOS W/ MAJOR 0 2
# OF TRADES OPENED W/I 24 MOS FOR CURRENT W/ 0 1
# OF AUTO LOAN TRADES 0 1
AGG CREDIT FOR OPEN AUTO LOAN TRADES 0 1
AGG CREDIT FOR OPEN INST TRADES 0 1
AGG BAL FOR OPEN AUTO LEASE TRADES 0 1
AGG BAL TO CREDIT RATIO FOR OPEN MORTGAGE 0 1
# OF ALL PUBLIC RECORD INCLUDING TRADELINE 0 1
AGG BAL TO CREDIT RATIO FOR OPEN RETAIL TRADES 0 1
33
Table 6. Discriminatory power results for the scores built on just booked and all through-
the-door accounts based on the 2003 data TTD2003 and BK2003 tested in sample and
out-of-time
validation
segment TTD2003_TTD BK2003_TTD TTD2003_BK BK2003_BK
year
2003 0.59 0.58 0.59 0.59
2004 0.59 0.57 0.56 0.56
all 2005 0.58 0.56 0.56 0.55
2006 0.56 0.55 0.57 0.56
2007 0.52 0.51 0.52 0.51
2003 0.59 0.57 0.57 0.57
2004 0.56 0.55 0.51 0.51
clean 2005 0.55 0.54 0.51 0.51
K-S
2006 0.55 0.52 0.56 0.55

2007 0.51 0.46 0.51 0.46
2003 0.4 0.39 0.48 0.47
2004 0.41 0.38 0.44 0.43
dirty 2005 0.38 0.36 0.4 0.39
2006 0.37 0.35 0.39 0.36
2007 0.35 0.35 0.35 0.35
2003 0.47 0.4 0.58 0.59
2004 0.47 0.41 0.52 0.48
thin 2005 0.5 0.42 0.57 0.49
2006 0.51 0.46 0.53 0.54
2007 0.46 0.44 0.46 0.45
validation
year
2003 0.74 0.73 0.74 0.75
2004 0.74 0.72 0.72 0.71
all 2005 0.72 0.71 0.71 0.7
2006 0.71 0.69 0.72 0.71
2007 0.67 0.66 0.67 0.65
2003 0.71 0.71 0.69 0.7
2004 0.7 0.68 0.64 0.64
Somer’s D
clean 2005 0.69 0.67 0.65 0.64

2006 0.67 0.63 0.69 0.65
2007 0.63 0.55 0.62 0.55
2003 0.54 0.51 0.6 0.61
2004 0.53 0.51 0.56 0.55
dirty 2005 0.49 0.47 0.54 0.52
2006 0.48 0.45 0.5 0.48
2007 0.46 0.46 0.47 0.47
2003 0.6 0.52 0.7 0.71
2004 0.58 0.53 0.66 0.63
thin 2005 0.58 0.55 0.65 0.61
2006 0.62 0.58 0.68 0.67
2007 0.58 0.56 0.59 0.56
34
Table 7. Score forecasting accuracy power based on the H-L test statistic for the scores
built on just booked and all through-the-door accounts based on the 2003 data TTD2003
and BK2003 tested in sample and out-of-time.
validation
year
2003 23.3 816.3 193.5 17.8
2004 95.1 1799.1 178.1 23.6
all 2005 26.3 1020.8 208.3 48.7
2006 318.3 2319.9 111.8 278.6
2007 1712.3 5829.6 419.4 1411.5
2003 22 149.1 55.3 11.6
2004 32 241.2 25.2 16.5
clean 2005 24.4 162.2 19.3 42.3
H-L
2006 197.6 643.4 28 138.3

2007 1049.6 2903.8 295 1045.2
2003 30.3 569.5 119.7 26.4
2004 120.3 1393.2 137.8 9.2
dirty 2005 81.6 837.5 171.8 20.9
2006 214.3 1557.2 99.8 129.5
2007 765.1 2796.3 165 502.3
2003 17.1 368 56 7.7
2004 63.1 662.1 30.3 30.1
thin 2005 52.9 409.4 41.9 32.8
2006 208.1 1321.7 46.5 142.3
2007 81.2 1013.2 26.3 225.3
validation
year
2003 0.010 0.000 0.000 0.058
2004 0.000 0.000 0.000 0.009
all 2005 0.003 0.000 0.000 0.000
2006 0.000 0.000 0.000 0.000
2007 0.000 0.000 0.000 0.000
p-value χ2 (10 d.f.)
2003 0.015 0.000 0.000 0.313

2004 0.000 0.000 0.005 0.086
clean 2005 0.007 0.000 0.037 0.000
2006 0.000 0.000 0.002 0.000
2007 0.000 0.000 0.000 0.000
2003 0.001 0.000 0.000 0.003
2004 0.000 0.000 0.000 0.513
dirty 2005 0.000 0.000 0.000 0.022
2006 0.000 0.000 0.000 0.000
2007 0.000 0.000 0.000 0.000
2003 0.072 0.000 0.000 0.658
2004 0.000 0.000 0.001 0.001
thin 2005 0.000 0.000 0.000 0.000
2006 0.000 0.000 0.000 0.000
2007 0.000 0.000 0.003 0.000
35
Table 8. Score forecasting accuracy power Normal test by deciles for the scores built on
just booked and all through-the-door accounts based on the 2003 data TTD2003 and
BK2003 tested in sample and out-of-time. Provided are results for the full sample
showing the p-value and expected less observed (exp-obs) number of BAD.
validation TTD2003_TTD BK2003_TTD TTD2003_BK BK2003_BK

year decile p-value exp - obs p-value exp - obs p-value exp - obs p-value exp - obs
1 0.46 0 0.30 1 0.22 1 0.04 3
2 0.01 9 0.34 2 0.04 5 0.24 2
3 0.17 -5 0.48 0 0.25 2 0.10 4
4 0.04 11 0.40 -1 0.38 1 0.20 3
5 0.01 18 0.01 -16 0.00 17 0.27 3
2003
6 0.42 -2 0.00 -39 0.01 17 0.32 3
7 0.15 -15 0.00 -125 0.01 20 0.12 8
8 0.11 26 0.00 -177 0.00 34 0.00 -25
9 0.06 42 0.00 -340 0.00 119 0.11 15
10 0.02 -84 0.00 -583 0.00 268 0.23 -16
1 0.06 -4 0.12 -3 0.22 1 0.37 -1
2 0.29 -2 0.00 -10 0.02 -6 0.39 -1
3 0.31 3 0.24 -3 0.01 8 0.40 1
4 0.45 1 0.02 -12 0.07 7 0.25 -3
5 0.06 -13 0.00 -37 0.14 6 0.27 3
2004
6 0.03 -22 0.00 -89 0.25 5 0.05 -9
7 0.04 -28 0.00 -166 0.09 12 0.13 -8
8 0.00 -59 0.00 -294 0.00 45 0.00 -39
9 0.00 -83 0.00 -493 0.00 111 0.23 -10
10 0.00 -344 0.00 -921 0.00 294 0.10 -30
1 0.15 -2 0.00 -10 0.06 -3 0.31 -1
2 0.18 3 0.12 4 0.19 2 0.38 1
3 0.03 9 0.39 -1 0.12 4 0.36 1
4 0.28 -4 0.01 -13 0.26 3 0.35 -1
5 0.01 -18 0.00 -26 0.25 4 0.03 -9
2005
6 0.04 -19 0.00 -100 0.43 -1 0.00 -18
7 0.06 -24 0.00 -140 0.13 10 0.00 -30
8 0.46 2 0.00 -267 0.00 37 0.00 -41
9 0.09 -39 0.00 -398 0.00 128 0.12 -17
10 0.00 123 0.00 -454 0.00 358 0.14 27
1 0.00 -16 0.00 -23 0.00 -6 0.00 -6
2 0.00 -15 0.00 -17 0.05 -4 0.01 -6
3 0.01 -10 0.00 -18 0.28 -2 0.05 -5
4 0.01 -13 0.00 -22 0.39 1 0.31 2
5 0.00 -30 0.00 -52 0.09 7 0.00 -13
2006
6 0.00 -86 0.00 -129 0.01 -15 0.00 -35
7 0.00 -118 0.00 -236 0.00 -34 0.00 -67
8 0.00 -134 0.00 -323 0.29 -7 0.00 -83
9 0.00 -172 0.00 -587 0.00 68 0.00 -114
10 0.01 -103 0.00 -683 0.00 256 0.00 -100
1 0.00 -37 0.00 -73 0.00 -16 0.00 -33
2 0.00 -23 0.00 -33 0.00 -11 0.00 -15
3 0.00 -28 0.00 -44 0.00 -15 0.00 -16
4 0.00 -60 0.00 -65 0.00 -16 0.00 -24
5 0.00 -97 0.00 -123 0.00 -30 0.00 -42
2007
6 0.00 -195 0.00 -216 0.00 -48 0.00 -66
7 0.00 -269 0.00 -375 0.00 -108 0.00 -107
8 0.00 -267 0.00 -439 0.00 -78 0.00 -170
9 0.00 -315 0.00 -685 0.02 -41 0.00 -176
10 0.00 -337 0.00 -885 0.00 127 0.00 -236
36
Table 9. The H-L test across all scores and validation years for the scores built on just
booked BK and all through-the-door TTD individuals with values below the χ2 (10)
shaded. The number of observations does is around 100K for all cohorts.
validation score score score score score score score score score
year 1999 2000 2001 2002 2003 2004 2005 2006 2007
1999 54 . . . . . . . .
TTD_TTD
2000 227 37 . . . . . . .
2001 139 40 27 . . . . . .
2002 172 411 393 14 . . . . .
2003 218 438 414 32 23 . . . .
2004 119 211 191 65 95 28 . . .
2005 325 561 547 38 26 103 19 . .
2006 589 378 455 215 318 184 284 12 .
2007 2190 978 1248 1370 1712 1299 1469 296 31
year 1999 2000 2001 2002 2003 2004 2005 2006 2007
1999 388 . . . . . . . .
BK_TTD
2000 1487 972 . . . . . . .

2001 1329 772 1434 . . . . . .
2002 275 100 211 333 . . . . .
2003 235 61 227 337 816 . . . .
2004 389 144 438 773 1799 1362 . . .
2005 310 105 328 372 1021 760 756 . .
2006 851 346 885 1188 2320 1891 1722 568 .
2007 2911 1304 2841 3176 5830 4721 4329 1740 501
year 1999 2000 2001 2002 2003 2004 2005 2006 2007
1999 143 . . . . . . . .
TTD_BK
2000 118 299 . . . . . . .

2001 141 331 312 . . . . . .
2002 223 433 392 123 . . . . .
2003 369 587 558 242 194 . . . .
2004 425 610 583 206 178 321 . . .
2005 488 724 695 242 208 365 201 . .
2006 403 518 562 105 112 197 95 202 .
2007 806 423 601 315 419 393 306 57 183
year 1999 2000 2001 2002 2003 2004 2005 2006 2007
1999 25 . . . . . . . .
2000 32 16 . . . . . . .
BK_BK
2001 15 38 16 . . . . . .
2002 51 88 38 7 . . . . .
2003 146 188 130 47 18 . . . .
2004 157 179 117 58 24 7 . . .
2005 228 265 214 97 49 19 15 . .
2006 225 147 196 182 279 172 109 13 .
2007 788 314 768 741 1412 1012 917 205 15
37
Table 10. The H-L test across booked BK and through-the-door TTD scores and
validation years applied only on the individuals that have not been part of the model-
development sample. This validation sample averages fewer than 4000 observations.
Note also that this subset of the TTD population has a large portion of unscoreable
individuals also affecting the accuracy of the results.
validation score score score score score score score score

year 1999 2000 2001 2002 2003 2004 2005 2006
2000 59 . . . . . . .
TTD_TTD
2001 51 48 . . . . . .
2002 75 63 85 . . . . .
2003 63 70 88 72 . . . .
2004 79 88 96 80 42 . . .
2005 71 74 84 73 51 58 . .
2006 78 76 83 61 55 69 35 .
2007 93 95 94 93 95 66 43 50

year 1999 2000 2001 2002 2003 2004 2005 2006
2000 138 . . . . . . .
BK_TTD
2001 146 94 . . . . . .
2002 94 53 69 . . . . .
2003 56 35 53 63 . . . .
2004 99 73 83 108 76 . . .
2005 77 59 72 99 76 74 . .
2006 102 86 111 134 90 59 48 .
2007 124 78 137 140 166 97 79 65

year 1999 2000 2001 2002 2003 2004 2005 2006
2000 55 . . . . . . .
TTD_BK
2001 42 56 . . . . . .
2002 39 41 48 . . . . .
2003 36 43 54 42 . . . .
2004 68 77 79 65 43 . . .
2005 77 69 87 71 60 73 . .
2006 83 65 65 55 42 52 40 .
2007 87 67 75 60 69 66 33 37

year 1999 2000 2001 2002 2003 2004 2005 2006
2000 36 . . . . . . .
BK_BK
2001 30 23 . . . . . .
2002 36 20 45 . . . . .
2003 28 23 23 37 . . . .
2004 62 79 44 57 29 . . .
2005 83 40 59 79 35 40 . .
2006 70 49 64 53 38 30 16 .
2007 60 53 87 65 92 40 23 22
38
Figure 1. Illustration of the consumer credit database CCDB and the timing of the TTD
sample selection and performance horizon. The vertical bars represent the cross sectional
bureau data in the CCDB at each June 30th snapshot of both scoreable and unscoreable
individuals. The difference in shading indicates the mixture of individuals that make the
unbalanced panel form of the CCDB. Those that remain or become scoreable are part of
the sample in the following year and in each year, new unscoreable and scoreable
individuals are added. The TTD sample is taken from the full cross sectional snapshot
and can include both scoreable and unscoreable individuals as well as both individuals
that have been in the sample in the previous year or are new to the panel.
June 2002 sample June 2003 sample June 2004 sample

with attributes with attributes with attributes
unscoreable
unscoreable
unscoreable
2002 TTD proxy 2003 TTD proxy 2004 TTD proxy
sample accounts sample accounts sample accounts
scoreable
scoreable
scoreable
Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
2002 TTD performance horizon
2003 TTD performance horizon
39
Figure 2. Annual sample distribution for the type of applicant
Percent of Borrowers Booked (BK), Reject Inference (RI and RI*),

Rejects-No Inference (RNI) and
total through-the-door TTD Count ('000) by Cohort
100% 200
90%
80%
150 RNI
70%
60% RI*
50% 100 RI
40% BK
30% TTD
50
20%
10%
0% 0
1999 2000 2001 2002 2003 2004 2005 2006 2007
Figure 3. Account performance by type of applicant
Bad Rate for the Booked (BK) and

Rejected (RI + RI*) Subsets by Cohort
12
Percent Bad (90+ DPD)
10
8
BK
6
RI
4
RI*
2
0
1999 2000 2001 2002 2003 2004 2005 2006 2007
40
Figure 4. Distribution of the credit bureau score for the booked and the three groups of
rejected individuals based on level of inference. Extreme invalid credit score values are
set to missing.
41
Figure 5. Distribution of booked and rejected applicants by segment
Clean segment TTD
100%
80%
RNI
60% RI*
40% RI
BK
20%
0%
99
00
01
02
03
04
05
06
07
19
20
20
20
20
20
20
20
20
Dirty segment TTD
100%
80% RNI
60% RI*
40% RI
20% BK
0%
99
00
01
02
03
04
05
06
07
19
20
20
20
20
20
20
20
20
Thin segment TTD
100%
80%
RNI
60% RI*
40% RI
BK
20%
0%
99
00
01
02
03
04
05
06
07
19
20
20
20
20
20
20
20
20
42
Figure 6. Segment distribution across the development and validation samples 1999-
2008. The first panel shows the full sample in thousands and the second only new
individuals used for out-of-sample out-of-time validation.
Count of Individuals by Segment

in the TTD Sample
100
80
clean
60
dirty
40
thin
20
0
99
00
01
02
03
04
05
06
07
19
20
20
20
20
20
20
20
20
Count of New Individuals by

Segment in the TTD sample
4000
3500
3000
2500 clean
2000 dirty
1500 thin
1000
500
0
00
01
02
03
04
05
06
07
20
20
20
20
20
20
20
20
43

Adjusting For Sample Selection Bias in Acquisition Credit Scorin

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Adjusting For Sample Selection Bias in Acquisition Credit Scorin

Uploaded by

Copyright:

Available Formats

Sample Selection Bias in Acquisition Credit Scoring Models:

An Evaluation of the Supplemental-Data Approach ∇

Irina Barakova Dennis Glennon Ajay Palvia

Office of the Comptroller of the Currency

Keywords: credit scoring, sample selection bias, reject inference, validation

Electronic copy available at: http://ssrn.com/abstract=1722382

of applicants is not random, such a model development sample would not be

losses in the bank’s portfolio.

literature. Different parametric and non-parametric approaches, known as reject

proposed reject inference techniques have been examined in theoretical or empirical

documented, which is a primary objective of this paper.

We exploit a unique proprietary database of credit bureau data with a large

selection bias in credit card acquisition scoring models. Extensive out-of-sample

Electronic copy available at: http://ssrn.com/abstract=1722382

paper documents three key findings.

considerable limitations in inferring the behavior of rejects through obtaining additional

Next, in terms of score performance impact, we show that the discriminatory

dynamic models when legacy models have outlived their usefulness.

data for score validation unlike for score development.

presents our conclusions.

to accurately estimating default rates.

involved in developing sound credit-scoring models. In particular, they consider a central

Ash and Meester (2002), and Greene (2007)).

this method in practice.

A second method also seeks to improve scoring models by using information on

performance data on a sub-set of rejected applicants, this method involves randomly

additional performance information through accepting normally rejected accounts, often

A third class of inference methods are based on extrapolation techniques. One of

al (2003) examine a Heckman-type selection procedure (bivariate probit model). Again, a

the variables used in the selection equation.

highlighting the effectiveness of this commonly used approach.

bureaus. 2 We have access to a unique nationally representative consumer credit database

frequency at which it is realized and recorded. This allows us to track performance

closely and identify the exact instance of missed payments.

each scorable account having on average roughly 5 bankcard tradelines.

scoreable or dropped if they become unscoreable. In addition, another 50,000 uscoreable

individuals that have not been part of the development sample.

Since banks need to construct a measure of credit risk based on individual

section sample) of applicants as of a particular month or quarter and construct a

our data could in fact be larger.

industry rather than for a particular bank. 5

comparison because of the impact of seasoning and account management. Similarly, as a

between the newly opened and existing cards performance. 7

be significantly large even after the augmentation with bureau data.

A commonly used performance measure in the acquisition scoring area is some

in a horizon ranging from 12 to 18 months. Thus, there is variation of the performance

underestimation is not preventing our and banks' relative analysis.

Our main performance flag is defined as 90 days or worse delinquency (90+

individuals, performance is based on the worst delinquency of any of their tradelines

from which we can infer performance.

development/validation samples. We use tradeline-specific data to construct attributes

development outlined in Fair-Isaac (2006). Each of the categories: payment history,

balance for major derogatory accounts.

very similar distributions. 8

Similarly, we compare the distributions of attributes across the four groups. We

4.2 Score development

expect the reject inference to be of most importance.

We build our score as a means to evaluate the risk of 90 days delinquency or

segmentation scheme as there are significant differences in these populations. For

and non-parametric alternatives.

full TTD population.

results in a different mixture of selected variables and estimated coefficients but as

4.3 Score performance