Professional Documents
Culture Documents
April 2013
Abstract:
Models evaluating credit applicants rely on payment performance data, which is only
available for accepted applicants. This sampling limitation could lead to biased parameter
estimates. We use a nationally representative sample of credit bureau records to examine
sample selection bias in account acquisition scoring models and to evaluate the
effectiveness of the industry practice of using proxy payment performance for rejected
applicants. Our results show that ignoring the rejected applicants significantly affects
forecast accuracy of credit scores while it has little effect on their discriminatory power.
Finally, we document that validating scores only on accepted applicants can be
misleading.
∇
The authors thank David Hand, Nicholas Kiefer, Christopher Henderson, OCC and Federal Reserve Bank
of Chicago seminar participants, and an anonymous referee for helpful comments and suggestions. A
previous version of this paper has been also been circulated as “Adjusting for Sample Selection Bias in
Acquisition Credit Scoring Models.” The views expressed in this paper are those of the authors and do not
necessarily reflect the views of the Office of the Comptroller of the Currency (OCC), or the US Treasury
Department. The authors can be reached via email at irina.barakova@occ.treas.gov,
dennis.glennon@occ.treas.gov, and ajay.palvia@occ.treas.gov.
Consumer credit scoring models are a key input in banks’ credit acquisition
strategies and banks increasingly rely on such models to evaluate credit risk when
deciding whether or not to extend credit. Such credit scores can be developed based on
the payment performance history of previous applicants, assuming they are representative
of future applicants. However, banks can only observe the performance of their
customers and not of those applicants that have been rejected. To the extent past rejection
representative of the full pool of applicants and could bias the estimated scoring model.
In turn, such a biased score could lead to a misguided acquisition strategy and future
This sample selection problem is well known in the industry and the academic
inference, have been proposed in order to account for the missing data. Many of the
studies but are not widely used in practice. Lenders’ most common way to address the
sample selection bias is to obtain proxy credit performance information on the applicants
that they have rejected in the past from their credit bureau records. However, the general
impact of this approach on the performance of credit scoring models has not been
number of credit card applicants over a 10-year period to examine potential sample
score is evaluated to assess the scope and gains from supplemental bureau data. Our
First, we find that generic credit bureau scores and other risk factors are
significantly worse for loan applicants who were rejected by a major credit card issuer at
least once before receiving credit and worse still for applicants who did not obtain credit
from a major credit card issuer but were able to obtain credit elsewhere. The results
suggest that accounting for the effect of applicants who were rejected at least once could
improve scoring models. The results also indicate, however, that a large sub-portion of
applicants attempt to obtain credit but do not succeed in doing so, suggesting
bureau data.
power of the score is not substantially different when acquisition scoring models are built
excluding rejected applicants. The score forecast accuracy, however, does improve when
supplemental data is used to infer the performance of rejected applicants. We also find
that older models perform substantially worse in terms of accuracy regardless of whether
reject inference is used. Thus, our findings indicate reject inference is important for
improving model accuracy but not a substitute for building newer models or building
Finally, our out-of-sample score validation results show that ignoring the rejected
applicants when validating the score could lead to misleading results. Delinquency rates
expected for each score range appear overestimated for the first couple of years after
3
score implementation when the score is validated only on the accepted applicants. This
finding is important since it is not standard industry practice to use bureau supplemental
The rest of this paper is organized as follows. The next section provides some
background on the issue of sample selection bias in credit scoring models. Section 3
presents the data and methodology. Section 4 discusses our analysis and the final section
2. Literature
A small but growing literature examines the statistical techniques and issues in
credit scoring. Hand and Henley (1997) offer an excellent review of the statistical
techniques used in building credit scoring models and Glennon (1999) outlines the
conceptual framework, current practices, and modeling issues in retail credit scoring.
More recently, Glennon et al. (2008) utilize a proprietary data-set to estimate and validate
credit scoring models for bank card borrowers. The results indicate that current industry
best practices can be effective at ranking borrower risk but may fall short when it comes
Kiefer and Larsen (2006) further discuss the key conceptual and statistical issues
issue in developing acquisition scoring models – whether the applicant data used to build
such models, given that it excludes rejected applicants, is appropriate. The inability to
build a new model on the entire sample of past applicants need not necessarily lead to
selection bias. As noted by Little and Rubin (1987), the missing data can be categorized
4
in one of three ways: missing completely at random (MCAR), missing at random (MAR),
and missing not at random (MNAR) and in the first two cases account performance does
not depend on the selection process. In contrast, if the data is MNAR, then credit
performance is a function of the selection process and it is in this case that sample
selection bias will occur. Such bias has become well known in the literature and bank
industry and numerous reject inference techniques have been proposed in order to
mitigate its effect (see Joanes (1993), Hand and Henley (1994), Hand (2001a, 2001b),
Among the most commonly used reject inference methods is to obtain external
performance data (from credit bureaus) from applicants that were rejected. Using such
performance data, a lender can seek to infer how rejected applicants would have
performed if accepted; this as often referred to as the supplemental data method. For the
subset of rejected applicants for which performance data is available, the method assumes
that default on some other credit product is equivalent to default on the product of
interest; the new model is then created while factoring in the behavior of these rejects.
The drawbacks of this approach include the cost of obtaining the bureau data and of
assuming default on another product is equivalent to default in the current model. Also,
because the rejects with no performance data are likely to be non-random, this method
will not completely eliminate bias. To date, there is little evidence on the effectiveness of
rejected applicants that would not normally be available. But instead of obtaining
5
accepting a subsample of applicants that would otherwise be rejected. Obtaining
referred to as enlargement, is an ideal way to mitigate selection bias; but it is also very
costly since rejected applicants are the most likely to default. Parnitske (2005), using
simulated data, examines this method and finds that this method does reduce selection
bias.
the more frequently used types of extrapolation, the re-weighting method, assumes that
the relation between the borrower characteristics and default is identical for accepted and
rejected applicants. 1 The method essentially works by giving greater weight to the lower-
scoring accepted applicants relative to higher scoring ones. Crook and Banasik (2004),
using proprietary data, find that the re-weighting technique does not improve the
performance of the good-bad model. Other extrapolation methods that assign default
status to rejected applicants and in essence “create” data include “re-classification” and
“parceling”. These methods, while easy to implement, tend to be quite arbitrary and often
result in false precision or lead to a distortion of the actual default data (Kiefer and
Larsen (2006)).
The last category of reject inference techniques includes those based on variations
of Heckman’s 2-step sample bias correction procedure where the first step is the selection
equation and the second step is the default model. Greene (1998) examines the impact of
Heckman’s procedure on predicting loan default and finds that coefficients for key
1
Extrapolation techniques such as re-weighting do not reduce selection bias in the sense that if the model
determining good/bad is assumed to be the same for accepts and rejects, there is no bias to correct for. In
these cases, reject inference is really intended to reduce variability and thus make the model more efficient.
For a more complete discussion, see Banasik and Crook (2007).
6
default determinants differ substantially from results obtained without correcting for
selection bias. Banasik and Crook (2007) use a proprietary data set where applicants that
would have normally been rejected were accepted and find that using a bivariate probit
model design to address sample selection bias improves model performance. Banasik et
proprietary data set is used and the authors find that the procedure can improve model
accuracy in some cases but the improvements are small. Finally, Wu and Hand (2007),
using simulated data, find that Heckman’s procedure improves the good-bad model, but
only if the normality assumption holds, when “enough” customers are rejected and
accepted, and when the original accept/reject decision was not primarily determined by
In summary, the extant literature has highlighted sample selection issues in credit
scoring and, in particular, has evaluated several methods to adjust credit scoring models
for sample selection bias arising from building/validating models on previously booked
applicants. The evidence regarding the effectiveness of these models is mixed, however.
Further, inferring the behavior of rejected applicants using supplemental bureau data is
widely used but, to the best of our knowledge, the effectiveness of this approach has not
previously been directly examined in the literature. Our paper, in part, is motivated by
3. Sample design
3.1 Data
7
Banks usually purchase credit history information for the accepted and rejected
applicants from one or more of the credit bureaus for the development of acquisition
scores. Following industry practices, we use data from one of the three major credit
(CCDB) with information for a growing sample of 1 to 1.5 million individuals during the
period 1999 to 2009. For each individual, the CCDB contains a credit bureau score and
information on all debt exposures, inquiries, public records, and any other reports to the
credit bureau also known as tradelines. In addition, for each tradeline the CCDB includes
the type, amount, and payment history, which allow for the calculation of the hundreds of
credit risk attributes that credit bureaus provide to banks. Credit standing of the
individual is reported as of the end of June 30th of each year so the data consists of a
series of snapshots of individuals and their risk attributes as of June 30th. One exception
is the payment status for each tradeline, which is available at the actual monthly
The CCDB sample includes both scoreable and unscoreable individuals. The
unscoreable individuals are those that have been inactive in the past 12 months or with a
very limited credit history such that they cannot be assigned a valid bureau credit score.
They constitute around a quarter of the full credit bureau data. In the CCDB, the
scoreable individuals are over represented relative to the full population of individuals
reported to the credit bureau such that the sample is well suited for score development.
2
The three major credit bureaus are Equifax, Experian, and TransUnion and they maintain credit files for
around 200 million individuals. The credit files contain information from grantors of consumer credit and
collectors of public records. The bureaus use the information to build consumer credit history, consumer
credit score and consumer credit attributes used for evaluating consumer credit quality.
8
For the 1999 sample there are 950,000 scoreable and 50,000 unscoreable individuals with
Going forward the individuals are kept in the panel if they remain or become
individuals are added to the panel each year as well as another random sample of
scoreable individuals averaging six percent of the existing scoreable individuals. The
design of the panel is illustrated by the vertical bars in figure 1. The large portion of
individuals that are kept in the panel from year to year allows us to observe future
performance for credit risk evaluation purposes. The added unscoreable individuals
ensure that individuals with relatively short credit history are represented in the sample.
This is important for our purposes since the individuals with relatively short credit history
are likely to apply for credit. The additional scoreable individuals are selected based on
the distribution over the bureau score for a more population representative sampling. 3
This unbalanced panel sample design allows us to track performance over different
horizons for score development and validation purposes as well as to test the scores on
3.2 Methodology
In this section we describe the selection of the sample that we will use for
evaluating the impact of reject inference on acquisition credit score performance. Bank
credit card acquisition strategies target the pool of potential customers. Each bank might
have different acquisition channels and depending on the channel, the customer pool
3
For more detailed illustration of the sample design please see Glennon et al. (2008).
9
might consist of actual applicants or the population identified for mailing applications. 4
The acceptance rate and thus the need for reject inference can vary across channels.
characteristics known up front, for modeling purposes they take a snap shot (ie, cross
performance measure for them. For consistency with this industry practice, we turn the
panel data into a series of snap shot samples used for score development and validation,
which allows us to test the robustness of our results through time. One concern is that the
aging of the population in the sample could have a downward bias on the number of
rejected applicants in our analysis since the individuals with less credit history, such as
the young borrowers, are more likely to be rejected rather than granted a credit card. As a
result, we suspect that any impact of reject inference on score performance that we find in
Since we can construct the individual risk attributes only as of June 30th , we take
all credit card applicants from the following quarter as our sample window for identifying
the through-the-door TTD population. Figure 1 illustrates the construction of the sample.
We cannot completely replicate a development sample for any given bank since the bank
reported card inquiries and new opened card accounts in the CCDB are anonymous and
cannot be associated with a particular institution. Instead we take all newly opened
bankcards and inquiries observed during the third quarter of each year as our model
4
The use of the score for identifying a mailing base is also known as pre-approval or front-end evaluation.
Back-end evaluation refers to the use of the score for acceptance/rejection of actual applicants once the
applicants have responded to the pre-approved offers.
10
development sample, which is a random sample of the credit card applicants for the
While for a particular bank, the accepted and rejected applicants are naturally
defined, we need to identify these subsets for the industry given our sample. We define as
booked (BK) the set of all individuals that have applied for a card to a bank during the
third quarter of the year and have been granted credit in each of those instances. The BK
individuals have not been rejected by any institution during the chosen quarter even if
they have been rejected at some other point in the past. The rejected individuals are those
that have made at least one inquiry and have been rejected. They could have received a
card from one bank during the quarter but at least one other bank has rejected them,
which implies that there is less certainty about their desirability as customers and for at
least one institution they would fall in the rejected pool and so we consider them rejected.
Following industry practices, we classify the rejected applicants further depending on our
ability to infer performance. Individuals that manage to open a bank card, after being
rejected by at least one card lender during the same or the following quarter, make our
main reject inference group (RI). This increase in the selection window is done in order
to allow for more of the rejected individuals to have opened cards to proxy performance.
To further expand the possible inference set, it is a common industry practice under the
supplemental approach to consider for individuals that do not manage to open any
bankcard during the extended window, nonbankcard tradelines, such as a retail credit
card, loan, etc. opened during the third or fourth quarters as a proxy for performance
tracking. As part of our analysis, we also mimic this industry practice to evaluate its
5
To the extent very large banks have nationally representative applicant pools, one could argue that for the
largest few bankcard providers our development sample is indeed representative of the industry applicants.
11
performance. In particular, we augment the reject-inference sample with individuals that
were rejected and opened tradelines on other retail credit credits during the observation
window. We label the extended reject inference sample RI*. 6 The set of individuals for
which we cannot make any inference because they do not open any new tradeline during
the selected window are labeled RNI. Note that it is possible to use existing bank cards or
other tradelines to proxy performance but that would not be a directly relevant
variation of proxy performance information for rejected applicants, banks can use the
credit bureau score at the end of the performance period. However, as a summary
statistic, the score reflects any credit performance deterioration and cannot distinguish
The BK applicants combined with the RI and RI* group of applicants make up the
through-the-door (TTD) sample that can be used for score development with reject
inference, which is depicted for each of the year/samples in figure 2. Around half of the
individuals are immediately booked (BK). Another 15 percent are rejected somewhere
but manage to receive a credit card (RI). With the expanded proxy performance (RI*),
another 5 percent of the applicants can be used for score development. However, around
6
To the extent that performance of different credit products are driven by different underlying factors,
assuming such products are similar could lead to biases. Though we accept that such biases might occur,
our main goal is to evaluate whether such models, which are widely used in industry, are nevertheless
helpful in inferring the behavior of rejected applicants. In terms of delinquency rates the RI and RI* groups
are much more similar to each other than to the BK group. Even if the RI* performance measure is not a
true representation of their possible performance on a bank card, a comparison between the RI and RI*
groups across the performance of non bank card accounts shows that the RI* group is riskier.
7
The 2004 data shows that for the BK group there is negative 50% correlation between 90 days plus
delinquency within 24 months and the fresh bureau score at the end of the 24 month period. For the group
for which we have performance on a new account (the RI group) the correlation is negative 40%. Although
the correlation is relatively high for both groups, it shows that the score is influenced by more than the
performance on newly opened accounts.
12
20 percent of our random sample consists of individuals for which no inference can be
made (RNI) from credit bureau data. This implies that the problem of censoring still can
delinquency or any major derogatory measure under a fixed horizon, eg 12, 18, 24
months. Often the industry practice in setting the fixed horizon is a matter of selecting
another point in time as of which performance is evaluated. We follow this practice and
assess performance as of the end of the fourth quarter of the following year which results
horizon with the booked sample having a longer horizon than the rejected sample. Given
that default risk can only increase with the increase of horizon, we may be
underestimating the true default risk of the rejected population. The rejected applicants
do appear with significantly higher delinquency rates than the booked so any possible
DPD) although alternative definitions such as 60 days delinquency are used for
robustness. The performance for the BK and RI groups is based on the worst
performance they have for any of the opened credit cards during the window. For the RI*
opened during the selected window. The RNI individuals do not open any new tradeline
4. Analysis
13
4.1 Univariate analysis
Our analysis begins with comparing the BK, RI, RI* and RNI groups of credit
card applicants defined in the previous section. In figure 3, which shows the annual bad
rate across these three groups, we see that bad rates are substantially different across the
three groups during all time periods. The booked accounts exhibit less than half the bad
rate of the RI group. Table 1 presents the mean and standard deviations for all the credit
bureau attributes that we use in building scoring models across the nine
that are consistent with the definitions developed by the credit bureau. We select a subset
of these attributes from the five broadly defined categories used in scorecard
amount owed, length of credit history, types of credit in use and new credit is presented
in a separate panel. Relative to the BK group and across the rejected groups RI, RI*, and
RNI the mean credit bureau score decreases, the percent of unscoreable individuals
increases, the number of inquiries increases, the instances of non-zero balance are higher
even if the average balance itself is not, also the balance to credit ratio is higher. The RNI
group has on average a shorter credit history and worse payment history with the highest
The difference in the population characteristics can be seen also in figure 4, which
depicts the full distribution of the generic bureau score across the four groups in the 2003
sample. Clearly the BK group is of higher credit quality than the three rejected groups
14
and the RNI group has the lowest credit quality; the RI and RI* groups appear to have
evaluate the magnitude of the differences in the distributions across the BK and rejected
data sets (RI, RI*, and RNI) using a non-parametric Kolmogorov-Smirnov (K-S) test,
which measures the level of separation between two distributions. 9 We report in table 2,
for a subset of attributes, the KS statistics for the difference in the distributions from the
various rejected groups (RI, RI*, and RNI) and the distribution of values from the booked
(BK) data set using the 2003 development data. We report the results for a subset of
attributes for which the K-S statistic is more than 20 percent for at least one of
comparisons; that is, the difference between the attribute distribution for the BK data is
large relative to the distribution of values for at least one of the data sets made of rejected
accounts. The shading indicates the level of the K-S statistic for each pair of distributions
with darker corresponding to higher K-S. The first column shows the K-S between the
BK and RI groups which exhibit the least differences. The third column comparing the
BK and RNI distribution is the darkest indicating the greatest amount of difference and
the most characteristics where the groups differ in distribution. Such differences in the
distribution suggest that building a score only on the booked accounts but applying it on
all booked and rejected could be misleading. Although the RNI group is large, the
8
A common practice is for banks to incorporate in their acquisition strategy a cut off based on the generic
bureau score and in this way they eliminate at least half of the RNI group from their customer base. We do
not take this route for our analysis since such a subjective cut off depends on the bank’s risk strategy while
our goal is to document the scope for reject inference.
9
If the distribution of values fro a specific attribute various significantly across the data groups, the KS
statistic will be large (e.g., greater than 20 percent). Conversely, a small KS value (e.g., less than 10
percent) implies the distributions of values for that attributes are similar across the data sets.
15
addition of the RI and RI* groups to the BK for score development has the potential to
address some of the censoring given the differences across these groups.
For score development purpose, we split the full sample further in terms of
“clean” individuals, ie those that have not had any major delinquency in the past, “dirty”
individuals, ie those that have been at least 60 days delinquent in the past or currently,
and individuals with “thin” files for which it is hard to determine the credit quality. Banks
do split further the thin into dirty and clean but the size of our sample would make such a
split impractical. Figure 5 shows the subgroups of BK, RI, RI* and RNI for each of the
clean, dirty and thin segments. Consistently across our samples the dirty and thin
segments have much fewer booked accounts (BK) and it is in those segments that we
worse in the next 12-18 months. Following industry practice, we estimate a logistic
regression for each of the three segments clean, dirty, and thin using the 90 plus day
delinquency as the dependent variable. Banks may have a more granular segmentation
based on their portfolio but in some way the clean, dirty, thin split is usually part of any
robustness, we also evaluate the score if built on the full population without any
segmentation. Although banks sometimes use some form of expert judgment in selecting
characteristics for scorecard development, the final score is usually based on a statistical
selection of variables often through a stepwise regression. Glennon et al. (2009) provides
16
evidence that indeed this method performs relatively well compared to semi-parametric
We start with more than 80 credit bureau attributes summarized in table 1 but the
final models have in some cases fewer than 10 attributes. Given the high multicollinerity
across the attributes the coefficients may not be informative and although we develop 9
separate scoring models – one for each annual cohort 1999 to 2007, we do not present all
estimated models. Instead we focus on the types of variables and how often they are
selected to be included in a scoring model. Tables 3, 4, and 5 show the number of times a
particular attribute is selected through the stepwise selection process in one of the 9
cohort-based scoring models for the clean, dirty and thin segments respectively. The
tables are sorted from the most to least selected variables, which allows for a comparison
between the scoring models built only on the booked (i.e., BK) accounts versus on the
Those results show that the scoring models developed on the TTD data
incorporate a wider range, and an alternative mix, of variables relative to the BK-based
models. The TTD scoring models include more often information on the length of credit
history and both the credit line and balances for bankcards and other trades. The bureau
score is selected in all models and since it may be capturing most of the information
about the individual’s credit quality, we also estimate the models without the score which
discussed later has little impact on performance. Because of the high correlations across
all the bureau attributes, it is not clear whether any of those differences would lead to
17
significant variation in performance of the scoring models build on the TTD population
versus those developed on the BK sample data only, which is discussed next.
of-time performance of the scores with respect to objectives and purpose of the model.
Acquisition strategies are often based on several models. 10 Banks use risk scores to set
acceptance/rejection cut off values based on risk tolerance, which implies, at least
implicitly, that the expected default rate (or odds) at cut off are consistent with a given
bank’s pricing and risk/return objectives. The score is also used for assigning credit line
levels and terms of the contract. For those later purposes, the discriminatory power of the
model is important because the bank needs to be able to differentiate the potential
customers by their level of risk. Higher lines and better terms or special offers like
balance transfers could be made to the better quality customers in order to maximize
For many of the decisions in the acquisition area the scores need to also indicate
score associated likelihood of delinquency (or, equivalently, the odds ratio) is used for
setting score cutoff levels for the approval decision but also for account and portfolio
10
A risk score or the likelihood of the account becoming severely delinquent, which is the focus of this
paper, is usually combined with a response likelihood forecast as well as a balance and/or revenue forecast.
18
profitability and loss analysis. 11 Thus, an equally important quality of the credit score is
performance of the reject inference score built on the TTD sample when applied on the
TTD pool as a benchmark. Using the same TTD validation sample we apply the score
built only on the BK subset and compare the results to the benchmark in order to evaluate
the selection bias. We also track the performance of the TTD score and the BK score
when applied only on the BK subset as a validation sample. These last two cases are
actually common model validation practices in the industry because, unlike for model
development purposes, for model validation banks do not typically gather performance
information for the individuals that have been rejected by the model. At the same time,
for acquisition purposes the scoring model is applied on the full pool of potential
customers so proper validation has to be done on at least the above defined TTD
population.
Discriminatory power
For ease of presenting the results, we show first the performance of the scoring
models built on the 2003 data sample. The two scoring models that we test are those built
on the BK and TTD samples labeled BK2003 and TTD2003 respectively. The first panel
of table 6 exhibits the K-S statistic for both in-sample and out-of-time samples. The
results are provided by segment (clean\dirty\thin) and also for the aggregate portfolio
(all). Each column shows a particular combination of score and validation sample:
11
Banks use additional layers of business logic to determine final cutoffs beyond the odds suggested by the
score. Risk management might have different targets and risk tolerance across geographic regions or
products as they monitor the performance of the score within such portfolio segments.
19
TTD2003_TTD, BK2003_TTD, TTD2003_BK, and BK2003_BK. 12 Note that the
discriminatory power is relatively high and decreases very little through time for the
2004-2007 validation samples. This finding is consistent with the results for behavioral
scoring models reported in Glennon et al. (2008) and is generally also assumed in the
industry. The robustness of the models to maintain their ability to discriminate between
good and bad accounts over time justifies the industry practice of relying on models
As expected the thin and dirty segment scores exhibit lower discriminatory power.
The TTD2003_TTD are either the same or a couple of percentage points higher than the
BK2003_TDD but not sufficiently higher to imply that reject inference matters for the
discriminatory power of the scoring model: a result that holds for all years (not reported).
For the thin and dirty segments, the models developed on the TTD data and validated on
the TDD population in 2003 (ie, TTD2003_TTD) exhibit better performance by roughly
5 percentage points when validated on the BK accounts only (ie., TTD2003_BK). While
this is not a large enough difference to warrant concerns about the model’s ability to
differentiate between the default and non-default distributions, it shows that validating
only on booked accounts may tend to overestimate the discriminatory power of the score.
The results are confirmed in the second panel of table 6 where we show the Somer’s D
discriminatory power statistic as an alternative to the K-S. Unlike the K-S statistic,
12
TTD2003_TTD (_BK) refers to applying the scoring model developed on the 2003 TTD sample to the
TTD (BK) data. BK2003_TTD (_BK) refers to applying the scoring model developed on the 2003 booked
accounts only to the TTD (BK) data.
20
Somer’s D captures the impact across the full distribution and would reflect any change
Forecast accuracy
Unlike the impact on the scoring model’s discriminatory power, accounting for
reject inference appears to clearly affect forecasting accuracy. The top panel of table 7
shows the Hosmer-Lemeshaw (H-L) statistic for the same combinations of TTD2003 and
BK2003 scores applied to the TTD and BK samples in and out-of-time. The results show
that the forecast accuracy deteriorates relatively quickly on the out-of-time sample, which
is consistent with industry and academic findings. On the TTD sample the TTD2003
score performs much better than the BK2003. Given the asymptotic distribution of a chi
square with 10 degrees of freedom, the bottom panel provides the probability values for
each of the calculated H-L statistics, which are very low. 14 These probability values are
expected given the large number of observations and the fact that the score does not
account for all the risk drivers and thus cannot fully capture the delinquency risk. In the
industry, often the score odds are further calibrated to historical delinquency rates in
order to achieve better forecast accuracy. For our study, we need to compare the relative
accuracy of the scores with and without reject inference which is independent of any
13
Note that Somers’ D is equivalent to the Accuracy Ratio (AR) of the Cumulative Accuracy Profile (CAP)
curve and is also related to the area under the curve (AUC) of the Receiver Operating Characteristic (ROC)
curve (Somers’ D=AR=2AUC-1). The industry often relies on monitoring the change in log odds which
similarly involves the full distribution, which is reflected in the Somer’s D measure.
14
Note that in sample the mean and variance are estimated from the data which implies that the appropriate
chi square distribution for the H-L statistic as a goodness-of-fit test is 2 degrees of freedom less ie χ2(8) as
shown in Hosmer and Lemeshaw (1980). However, out-of-time the test statistic calculated for the 10
deciles is asymptotically distributed as the sum of 10 squared random normal variables which is a χ2(10)
distribution (see Hosmer and Lemeshow 2000).
21
The H-L test statistic treats both under and over-estimation equally and also does
not indicate which scores are less accurate. In order to get a better sense of the forecast
accuracy issue, we summarize model performance in a more granular way in table 8. For
each score decile the corresponding probability values are given as well as the expected
minus actual delinquency rates. These results show that the likelihood of becoming
seriously delinquent is underestimated in the more risky deciles if sample selection bias is
ignored. 15 With time, the underestimation spreads to all deciles and to both scores
We find also that when testing the scores on the BK subset instead of the realistic
TTD, the accuracy is better. In particular, BK2003_BK exhibits the lowest H-L and the
TTD2003_BK exhibits H-L levels lower than BK2003_TTD. Furthermore, table 8 shows
that using the booked accounts BK for validation of a score that accounts for reject
inference can lead to overestimation of the delinquency rates especially in the high risk
deciles and that may be the case for a few years post redevelopment (TTD2003_BK). In
this case, which is common practice in the industry, the deterioration of score
performance will not be evident right away and the validation results could be misleading
The results are similar across segments and across time. Table 9 shows the H-L
test for all scoring models (1999-2007) using both in and out-of-time samples.
Comparing the top two panels for which the validation sample is the TTD, the second
panel which shows the scoring models built only on booked accounts (BK_TTD) has
much higher H-L values than the first (TTD_TTD) – ie, lower accuracy. The bottom two
15
For each decile, the statistic that is evaluated is the part used for the H-L test calculation which has a
Normal asymptotic distribution.
22
panels confirm also our conclusions about the misleading performance when evaluated
Unlike in our sample where acceptance is exogenous and does not depend on our
score value, banks’ acceptance rules based on an internal score that underestimates
delinquency risk will lead to a definite deterioration in the credit quality of booked
accounts. Our evidence strongly suggests that accounting for reject inference as part of
the model-development process is important when the scores are used for defining
4.4 Robustness
In this section we discuss further the robustness of our main finding that reject
inference impacts the forecast accuracy of scores. The unbalanced panel design results in
a set of residual accounts that, by design, are not included in the development data; these
include accounts that were unscorable at the time of development but have since become
scorable and also the small percentage of new accounts that are added each year. This
sample structure allows us to test the forecast accuracy on individuals that have not been
part of the model-development sample. We report the results for the H-L test across
scoring models and validation years in table 10 using the format from table 9. Note that
compared to 100 thousand in the full sample - so the H-L test statistics are lower than in
table 9. The top first panel, as in table 9, has for the most part lower values than the
second panel. Another caveat for this validation sample is that it consists of a relatively
large number of unscoreable individuals, for which the model score is expected not to
23
perform as well. Although these results are not as clear, they confirm our earlier findings
that the TTD scores are relatively more accurate that the BK scores. Similarly, the bottom
two panels show that validating only on booked accounts attributes more forecast
We also check the robustness of the results with respect to the development of the
score by varying the definition of bad performance to 60 rather than 90 days delinquency,
the performance horizon to 6-12 months instead of 12-18, without segmentation, without
the credit bureau variable, and a smaller set of explanatory variables. We also vary the
development sample selection period to be the first two quarters of the year rather than
the last two in case there is any seasonality. However, our credit bureau attributes data is
only as of June 30th which means that attributes that we use for score building already
reflect any inquiries and newly opened credit cards of the individuals in the TTD sample.
As expected, in that case the scoring model has better performance especially in terms of
discriminatory power. The impact of reject inference, although smaller, appears again in
5. Conclusions
individuals while ignoring the rejected ones has been well studied in the literature. Many
techniques for inferring the behavior of rejected applicants have been proposed but the
not typically used in the industry. In this paper, we use a nationally representative
sample of credit bureau data to evaluate the problem of sample selection in credit card
24
acquisition score development. We evaluate the credit bureau supplemental data method
used by banks for reject inference by examining its impact on score performance.
We find that reject inference has little impact on the discriminatory power of the
score but basing the score only on booked account leads to underestimation of
delinquency risk especially in the high risk deciles. Furthermore, ignoring rejected
individuals when validating the score leads to underestimating the deterioration in score
performance with time. These results suggest that although the data augmentation method
is not a perfect solution to this sample selection issue, it can improve significantly the
25
References
Banasik, J., J. Crook, and L. C. Thomas (2003). “Sample selection bias in credit
scoring models”, Journal of the Operational Research Society 54, pp 822–832.
Crook, J. and J. Banasik (2004). “Does reject inference really improve the performance
of application scoring models?” Journal of Banking and Finance, 28, pp 857-874.
Feelders A.J. (2000). “Credit scoring and reject inference with mixture models”,
International Journal of Intelligent systems in Accounting Finance and Management, 8
(4), pp 271-279.
Glennon, D. (1999). “Evaluating Credit Scoring Models: Theory and Practice”, OCC
Working Paper.
Glennon, D., C.E. Larson, N. Kiefer, and H. Choi, H. (2008). "Development and
Validation of Credit-Scoring Models", Journal of Credit Risk, 4, pp 1-61.
Greene, W. (1998). “Sample selection in credit-scoring models”, Japan and the World
Economy, 10, pp 299-316.
Greene, W. (2007). “A Statistical Model for Credit Scoring,” in Credit Risk: Quantitative
Methods and Analysis, Hensher, D. and S. Jones, eds., Cambridge University Press.
Hand D. J. (2001a). “Reject inference in credit operations: theory and methods” in The
Handbook of Credit Scoring, Glenlake Publishing Company, pp 225-240.
Hand D.J. and W.E. Henley (1994). “Can reject inference ever work?”, IMA Journal of
Mathematics Applied in Business and Industry, 5 (1), pp 45-55.
26
Hand, D.J. and W.E. Henley (1997). “Statistical classification methods in consumer
credit scoring: a review”, Journal of the Royal Statistical Society A, 160, pp 523-541.
Hosmer, D. W. and S. Lemeshow (1980). "A goodness-of-for test for the multiple logistic
regression", Communications in Statistics, A10, pp 1043-1069.
Hosmer, D. W. and S. Lemeshow (2000). Applied Logistic Regression -2nd ed. Wiley,
New York.
Joanes D.N. (1993). “Reject inference applied to logistic regression for credit scoring”,
IMA Journal of Mathematics Applied in Business and Industry, 5 (1), pp 35-43.
Little, R.J.A. and D.B. Rubin (1987). Statistical Analysis with Missing Data. New York:
John Wiley.
Parnitzke, T (2005). “Credit Scoring and the sample selection bias”, Institute of Insurance
Economics, Working Paper.
Wu, I. and D. Hand (2007). “Handling selection bias when choosing actions in retail
credit applications”, European Journal of Operational Research, 183, pp 156-1568.
27
Table 1. Average mean and standard deviation by type of variable following Fair Isaac
categories across the 9 samples used for developing and validating the score (1999-2007).
Invalid extreme values are set to missing and they represent less than half of one percent
of the TTD sample.
Credit amount
BK RI RI* RNI
BUREAU ATTRIBUTE MEAN STD MEAN STD MEAN STD MEAN STD
# OF OPEN AUTO LOAN TRADES BAL > 0 0.4 0.7 0.4 0.7 0.5 0.7 0.4 0.6
AGG BAL FOR OPEN AUTO LOAN TRADES 5874 10783 5746 10490 6847 11833 4821 10075
8605 14451 8264 14039 9774 15740 6921 13388
AGG CREDIT FOR OPEN AUTO LOAN TRADES
# OF BANKCARD TRADES BAL > 0 2.0 1.8 2.3 2.0 2.4 2.2 2.3 2.4
AGG BAL FOR OPEN BANKCARD TRADES 5486 10055 5395 10150 6621 12560 5198 11970
26948 32045 22246 30507 19757 29295 12352 24524
AGG CREDIT FOR OPEN BANKCARD TRADES
AGG BAL TO CREDIT RATIO FOR OPEN
28 465 36 623 41 39 58 2242
BANKCARD TRADES
# OF OPEN HOME EQUITY TRADES BAL > 0 0.2 0.4 0.1 0.4 0.1 0.4 0.1 0.3
7436 29329 6092 25724 6926 29450 4038 22171
AGG BAL FOR OPEN HOME EQUITY TRADES
AGG CREDIT FOR OPEN HOME EQUITY
12801 47881 9854 40972 10658 48150 5925 35129
TRADES
# OF INST TRADES BAL > 0 1.3 1.7 1.6 2.0 2.2 2.6 2.3 2.6
AGG CREDIT FOR OPEN INST TRADES 18285 37118 17603 41267 22295 50576 14406 31262
AGG BAL TO CREDIT RATIO FOR OPEN INST
41.1 238.3 44.0 329.5 50.4 73.9 42.4 193.7
TRADES
# OF OPEN AUTO LEASE TRADES BAL > 0 0.1 0.3 0.1 0.3 0.1 0.3 0.1 0.2
AGG BAL FOR OPEN AUTO LEASE TRADES 805 3973 757 3961 950 4511 579 3674
1533 6786 1398 6575 1868 7883 1043 5972
AGG CREDIT FOR OPEN AUTO LEASE TRADES
# OF OPEN MORTGAGE TRADES BAL > 0 0.6 0.7 0.5 0.7 0.5 0.7 0.3 0.6
AGG BAL FOR OPEN MORTGAGE TRADES 72367 126908 59220 116031 65951 138920 37885 100355
78231 132319 63569 121104 70270 144766 40584 105792
AGG CREDIT FOR OPEN MORTGAGE TRADES
AGG BAL TO CREDIT RATIO FOR OPEN
44.3 53.3 36.3 47.4 38.1 45.6 25.8 239.3
MORTGAGE TRADES
# OF RETAIL TRADES BAL > 0 0.7 1.1 0.8 1.2 0.9 1.3 0.8 1.3
AGG CREDIT FOR OPEN RETAIL TRADES 4327 7852 3319 4830 3067 4600 1857 3551
AGG BAL TO CREDIT RATIO FOR OPEN RETAIL
8.6 24.1 12.0 78.0 15.6 32.9 15.9 141.9
TRADES
# OF REVOLVING TRADES BAL > 0 3.0 2.6 3.2 2.8 3.6 3.1 3.2 3.2
31929 32478 26172 32407 23282 30263 14480 25332
AGG CREDIT FOR OPEN REVOLVING TRADES
# OF OPEN BANKCARD TRADES 3.7 3.0 3.5 3.3 3.3 3.1 2.3 2.8
AGG BAL TO CREDIT RATIO FOR OPEN
24.5 117.9 32.4 76.3 39.4 37.7 41.0 128.7
REVOLVING TRADES
Credit type
BK RI RI* RNI
BUREAU ATTRIBUTE MEAN STD MEAN STD MEAN STD MEAN STD
# OF TRADES 14.1 7.9 14.6 8.3 15.7 8.5 13.1 7.8
# OF AUTO LOAN TRADES 0.8 1.1 0.8 1.1 1.0 1.3 0.8 1.1
# OF AUTO LOAN OPENED W/I 12 MOS 0.2 0.6 0.3 0.7 0.3 0.7 0.3 0.7
# OF BANKCARD TRADES 5.3 4.0 5.6 4.5 5.4 4.2 4.6 4.1
# OF INST TRADES 3.5 3.2 4.4 3.7 5.7 4.5 5.5 4.4
# OF AUTO LEASE TRADES 0.1 0.4 0.1 0.4 0.1 0.5 0.1 0.4
# OF MORTGAGE TRADES 1.4 1.8 1.3 1.9 1.5 2.1 1.0 1.7
# OF RETAIL TRADES 3.4 3.1 3.2 3.0 3.2 3.1 2.4 2.7
# OF OPEN RETAIL TRADES BAL DATE W/I 12
2.5 2.4 2.1 2.4 2.1 2.4 1.4 2.0
MOS
# OF REVOLVING TRADES 8.8 5.9 8.5 6.3 8.1 6.1 6.3 5.7
28
Table 1. (cont.)
Length of history
BK RI RI*
New credit
BK RI RI*
Payment
BK RI RI*
Credit score
BK RI RI*
BUREAU ATTRIBUTE MEAN STD MEAN STD MEAN STD ME
VALID BUREAU SCORE 765 93 709 110 678 112 6
UNSCOREABLE 0.4% 6.1% 0.6% 7.1% 0.9% 8.9% 1
29
Table 2. Bureau attributes exhibiting the largest distributional difference between booked
BK and the different rejected groups based on level of inference (RI, RI*, RNI), as
measured by the K-S statistic for the 2003 development sample. Shown are the variables
with significant difference for at least one of the pairs. The shading corresponds to the
level of K-S with darker indicating higher K-S (10-20, 20-30, 30-100).
30
Table 3. Number of times a variable is selected by the stepwise regression used for
developing the acquisition scores for the CLEAN segment 1999-2007.
VARIABLE BK TTD
VALID BUREAU SCORE 9 9
AGG BAL TO CREDIT RATIO FOR OPEN REVOLVING TRADES 7 6
# OF REVOLVING TRADES BAL > 0 6 6
AVERAGE AGE OF TRADES 5 8
AGG CREDIT FOR OPEN MORTGAGE TRADES 5 7
AGE OF OLDEST BANKCARD TRADE 4 7
# OF TRADES 30-180 DPD W/I 12 MOS 4 5
# OF INQUIRIES W/I 6 MOS 4 5
AGE OF OLDEST HOME EQUITY TRADE 3 4
# OF INQUIRIES W/I 12 MOS 3 4
AGE OF OLDEST TRADE 3 2
# OF MORTGAGE OPENED W/I 24 MOS 3 2
AGG BAL FOR OPEN FINANCE TRADES 2 4
# OF TRADES OPENED W/I 24 MOS FOR CURRENT W/ MINOR 2 2
# OF BANKCARD TRADES - 30 DPD W/I 12 MOS 2 2
AGG BAL TO CREDIT RATIO FOR OPEN INST TRADES 2 2
# OF BANKCARD TRADES 2 0
# OF INST TRADES 2 0
AGG BAL FOR OPEN HOME EQUITY TRADES 1 3
# OF ALL PUBLIC RECORD INCLUDING TRADELINE BANKRUPTCIES 1 3
AGG BAL FOR OPEN AUTO LOAN TRADES 1 2
AGG BAL TO CREDIT RATIO FOR OPEN BANKCARD TRADES 1 2
AGG BAL FOR OPEN MORTGAGE TRADES 1 2
# OF OPEN HOME EQUITY TRADES BAL > 0 1 1
# OF OPEN MORTGAGE TRADES BAL > 0 1 1
# OF RETAIL OPENED W/I 12 MOS 1 1
# OF REVOLVING TRADES 1 1
# OF CLOSED TRADES W/I 6 MOS 1 0
# OF BANKCARD TRADES BAL > 0 1 0
AGG BAL FOR OPEN INST TRADES 1 0
AGE OF OLDEST AUTO LEASE TRADE 1 0
AGG BAL TO CREDIT RATIO FOR OPEN MORTGAGE TRADES 1 0
# OF OPEN RETAIL TRADES BAL DATE W/I 12 MOS 1 0
AGG BAL FOR OPEN BANKCARD TRADES 0 6
AGG CREDIT FOR OPEN BANKCARD TRADES 0 5
AGE OF OLDEST RETAIL TRADE 0 3
# OF TRADES 0 2
AGG CREDIT FOR OPEN AUTO LOAN TRADES 0 2
# OF OPEN BANKCARD TRADES BAL DATE W/I 24 MOS 0 2
AGG CREDIT FOR OPEN HOME EQUITY TRADES 0 2
# OF AUTO LEASE TRADES 0 2
AGG CREDIT FOR OPEN AUTO LEASE TRADES 0 2
# OF MORTGAGE TRADES 0 2
# OF OPEN AUTO LOAN TRADES BAL > 0 0 1
# OF OPEN BANKCARD TRADES 0 1
AGG BAL FOR OPEN AUTO LEASE TRADES 0 1
AGE OF OLDEST MORTGAGE TRADE 0 1
# OF RETAIL TRADES BAL > 0 0 1
# OF RETAIL OPENED W/I 24 MOS 0 1
AGG CREDIT FOR OPEN RETAIL TRADES 0 1
AGG CREDIT FOR OPEN REVOLVING TRADES 0 1
31
Table 4. Number of times a variable is selected by the stepwise regression used for
developing the acquisition scores for the DIRTY segment 1999-2007.
VARIABLE BK TTD
VALID BUREAU SCORE 9 9
AGE OF OLDEST BANKCARD TRADE 6 9
# OF TRADES MAJOR DEROG 6 9
# OF TRADES OPENED W/I 24 MOS FOR CURRENT W/ MINOR 6 5
AGG CREDIT FOR OPEN REVOLVING TRADES 6 2
AVERAGE AGE OF TRADES 5 8
# OF INQUIRIES W/I 12 MOS 5 7
# OF BANKRUPTCIES 5 7
# OF ALL PUBLIC RECORD INCLUDING TRADELINE 5 5
# OF BANKCARD OPENED W/I 24 MOS W/ MAJOR 4 7
# OF REVOLVING TRADES 4 7
AGG BAL TO CREDIT RATIO FOR OPEN REVOLVING TRADES 4 1
AGG BAL FOR OPEN BANKCARD TRADES 3 9
# OF TRADES 30-180 DPD W/I 12 MOS 3 9
AGG BAL FOR OPEN MORTGAGE TRADES 3 1
# OF REVOLVING TRADES BAL > 0 3 1
AGG CREDIT FOR OPEN MORTGAGE TRADES 2 8
AGE OF OLDEST HOME EQUITY TRADE 2 6
# OF TRADES CURRENTLY 30 DPD BAL > 0 2 3
# OF INST TRADES BAL > 0 2 3
AGG BAL TO CREDIT RATIO FOR OPEN MORTGAGE TRADES 2 3
AGE OF OLDEST TRADE 2 2
# OF TRADES OPENED W/I 24 MOS W/ MAJOR DELINQ/DEROG 2 2
AGG CREDIT FOR OPEN INST TRADES 2 2
AGG BAL TO CREDIT RATIO FOR OPEN INST TRADES 2 2
# OF TRADES 2 1
AGG CREDIT FOR OPEN BANKCARD TRADES 1 7
# OF INQUIRIES W/I 6 MOS 1 4
AGG BAL FOR OPEN FINANCE TRADES 1 4
# OF INST TRADES 1 3
# OF OPEN BANKCARD TRADES 1 2
# OF TRADES CURRENTLY 60 DPD BAL > 0 1 2
# OF BANKCARD TRADES - 30 DPD W/I 12 MOS 1 2
# OF BANKCARD TRADES - 60 DPD W/I 12 MOS 1 2
AGG CREDIT FOR OPEN AUTO LEASE TRADES 1 2
# OF OPEN MORTGAGE TRADES BAL > 0 1 2
# OF OPEN RETAIL TRADES BAL DATE W/I 12 MOS 1 2
# OF BANKCARD TRADES - 90 DPD W/I 12 MOS 1 1
AGG BAL FOR OPEN INST TRADES 1 1
# OF RETAIL TRADES 1 1
# OF RETAIL TRADES BAL > 0 1 1
AGG BAL FOR OPEN AUTO LOAN TRADES 1 0
# OF MORTGAGE - SEVERE DELINQUENCY INCLUDES 1 0
# OF OPEN HOME EQUITY TRADES BAL > 0 1 0
AGG BAL FOR MAJOR DEROG 1 0
# OF MORTGAGE OPENED W/I 24 MOS 1 0
AGE OF OLDEST MORTGAGE TRADE 1 0
AGE OF OLDEST RETAIL TRADE 1 0
# OF BANKCARD TRADES BAL > 0 0 4
# OF TRADES 60-180 DPD W/I 12 MOS 0 3
# OF CLOSED TRADES W/I 6 MOS 0 2
# OF TRADES 90-180 DPD W/I 12 MOS 0 2
# OF AUTO LEASE TRADES 0 2
AGE OF OLDEST AUTO LEASE TRADE 0 2
# OF MORTGAGE TRADES 0 2
AGE OF OLDEST AUTO LOAN TRADE 0 1
# OF OPEN BANKCARD TRADES BAL DATE W/I 24 MOS 0 1
AGG BAL TO CREDIT RATIO FOR OPEN BANKCARD TRADES 0 1
# OF COLLECTION TRADES W/I 24 MOS 0 1
AGG BAL FOR OPEN HOME EQUITY TRADES 0 1
# OF RETAIL OPENED W/I 24 MOS 0 1
32
Table 5. Number of times a variable is selected by the stepwise regression used for
developing the acquisition scores for the THIN segment 1999-2007.
VARIABLE BK TTD
VALID BUREAU SCORE 9 9
AGG CREDIT FOR OPEN REVOLVING TRADES 8 7
# OF TRADES 30-180 DPD W/I 12 MOS 4 8
# OF TRADES MAJOR DEROG 4 5
# OF INQUIRIES W/I 6 MOS 4 4
AGG BAL TO CREDIT RATIO FOR OPEN REVOLVING 4 4
AGG BAL TO CREDIT RATIO FOR OPEN INST TRADES 4 3
AGE OF OLDEST BANKCARD TRADE 3 6
# OF INST TRADES 3 3
# OF REVOLVING TRADES BAL > 0 3 2
AGG BAL TO CREDIT RATIO FOR OPEN BANKCARD 3 1
# OF INST TRADES BAL > 0 2 4
# OF INQUIRIES W/I 12 MOS 2 4
AGE OF OLDEST TRADE 2 3
# OF MORTGAGE TRADES 2 2
# OF REVOLVING TRADES 2 1
# OF TRADES CURRENTLY 60 DPD BAL > 0 2 0
# OF BANKCARD TRADES - 30 DPD W/I 12 MOS 2 0
AGG CREDIT FOR OPEN MORTGAGE TRADES 1 4
# OF TRADES 1 2
AVERAGE AGE OF TRADES 1 1
# OF BANKCARD TRADES 1 1
# OF BANKCARD OPENED W/I 24 MOS W/ MAJOR 1 1
# OF COLLECTION TRADES W/I 24 MOS 1 1
# OF BANKCARD TRADES - 60 DPD W/I 12 MOS 1 1
# OF BANKCARD TRADES - 90 DPD W/I 12 MOS 1 1
# OF MORTGAGE - SEVERE DELINQUENCY INCLUDES 1 1
# OF OPEN MORTGAGE TRADES BAL > 0 1 1
# OF BANKRUPTCIES 1 1
AGE OF OLDEST RETAIL TRADE 1 1
AGG BAL FOR OPEN AUTO LOAN TRADES 1 0
# OF OPEN BANKCARD TRADES BAL DATE W/I 24 MOS 1 0
# OF TRADES 60-180 DPD W/I 12 MOS 1 0
AGE OF OLDEST AUTO LEASE TRADE 1 0
AGG BAL FOR OPEN BANKCARD TRADES 0 5
AGG CREDIT FOR OPEN BANKCARD TRADES 0 3
# OF RETAIL TRADES BAL > 0 0 3
# OF TRADES OPENED W/I 24 MOS W/ MAJOR 0 2
# OF BANKCARD TRADES BAL > 0 0 2
# OF OPEN BANKCARD TRADES 0 2
# OF AUTO LEASE TRADES 0 2
# OF OPEN RETAIL TRADES BAL DATE W/I 12 MOS 0 2
# OF TRADES OPENED W/I 24 MOS FOR CURRENT W/ 0 1
# OF AUTO LOAN TRADES 0 1
AGG CREDIT FOR OPEN AUTO LOAN TRADES 0 1
# OF TRADES CURRENTLY 30 DPD BAL > 0 0 1
# OF OPEN HOME EQUITY TRADES BAL > 0 0 1
AGG BAL FOR OPEN INST TRADES 0 1
AGG CREDIT FOR OPEN INST TRADES 0 1
AGG BAL FOR OPEN AUTO LEASE TRADES 0 1
AGG BAL FOR OPEN MORTGAGE TRADES 0 1
AGG BAL TO CREDIT RATIO FOR OPEN MORTGAGE 0 1
# OF ALL PUBLIC RECORD INCLUDING TRADELINE 0 1
AGG BAL TO CREDIT RATIO FOR OPEN RETAIL TRADES 0 1
33
Table 6. Discriminatory power results for the scores built on just booked and all through-
the-door accounts based on the 2003 data TTD2003 and BK2003 tested in sample and
out-of-time
validation
segment TTD2003_TTD BK2003_TTD TTD2003_BK BK2003_BK
year
2003 0.59 0.58 0.59 0.59
2004 0.59 0.57 0.56 0.56
all 2005 0.58 0.56 0.56 0.55
2006 0.56 0.55 0.57 0.56
2007 0.52 0.51 0.52 0.51
2003 0.59 0.57 0.57 0.57
2004 0.56 0.55 0.51 0.51
clean 2005 0.55 0.54 0.51 0.51
K-S
validation
segment TTD2003_TTD BK2003_TTD TTD2003_BK BK2003_BK
year
2003 0.74 0.73 0.74 0.75
2004 0.74 0.72 0.72 0.71
all 2005 0.72 0.71 0.71 0.7
2006 0.71 0.69 0.72 0.71
2007 0.67 0.66 0.67 0.65
2003 0.71 0.71 0.69 0.7
2004 0.7 0.68 0.64 0.64
Somer’s D
34
Table 7. Score forecasting accuracy power based on the H-L test statistic for the scores
built on just booked and all through-the-door accounts based on the 2003 data TTD2003
and BK2003 tested in sample and out-of-time.
validation
segment TTD2003_TTD BK2003_TTD TTD2003_BK BK2003_BK
year
2003 23.3 816.3 193.5 17.8
2004 95.1 1799.1 178.1 23.6
all 2005 26.3 1020.8 208.3 48.7
2006 318.3 2319.9 111.8 278.6
2007 1712.3 5829.6 419.4 1411.5
2003 22 149.1 55.3 11.6
2004 32 241.2 25.2 16.5
clean 2005 24.4 162.2 19.3 42.3
H-L
validation
segment TTD2003_TTD BK2003_TTD TTD2003_BK BK2003_BK
year
2003 0.010 0.000 0.000 0.058
2004 0.000 0.000 0.000 0.009
all 2005 0.003 0.000 0.000 0.000
2006 0.000 0.000 0.000 0.000
2007 0.000 0.000 0.000 0.000
p-value χ2 (10 d.f.)
35
Table 8. Score forecasting accuracy power Normal test by deciles for the scores built on
just booked and all through-the-door accounts based on the 2003 data TTD2003 and
BK2003 tested in sample and out-of-time. Provided are results for the full sample
showing the p-value and expected less observed (exp-obs) number of BAD.
36
Table 9. The H-L test across all scores and validation years for the scores built on just
booked BK and all through-the-door TTD individuals with values below the χ2 (10)
shaded. The number of observations does is around 100K for all cohorts.
validation score score score score score score score score score
year 1999 2000 2001 2002 2003 2004 2005 2006 2007
1999 54 . . . . . . . .
TTD_TTD
2000 227 37 . . . . . . .
2001 139 40 27 . . . . . .
2002 172 411 393 14 . . . . .
2003 218 438 414 32 23 . . . .
2004 119 211 191 65 95 28 . . .
2005 325 561 547 38 26 103 19 . .
2006 589 378 455 215 318 184 284 12 .
2007 2190 978 1248 1370 1712 1299 1469 296 31
validation score score score score score score score score score
year 1999 2000 2001 2002 2003 2004 2005 2006 2007
1999 388 . . . . . . . .
BK_TTD
validation score score score score score score score score score
year 1999 2000 2001 2002 2003 2004 2005 2006 2007
1999 143 . . . . . . . .
TTD_BK
validation score score score score score score score score score
year 1999 2000 2001 2002 2003 2004 2005 2006 2007
1999 25 . . . . . . . .
2000 32 16 . . . . . . .
BK_BK
2001 15 38 16 . . . . . .
2002 51 88 38 7 . . . . .
2003 146 188 130 47 18 . . . .
2004 157 179 117 58 24 7 . . .
2005 228 265 214 97 49 19 15 . .
2006 225 147 196 182 279 172 109 13 .
2007 788 314 768 741 1412 1012 917 205 15
37
Table 10. The H-L test across booked BK and through-the-door TTD scores and
validation years applied only on the individuals that have not been part of the model-
development sample. This validation sample averages fewer than 4000 observations.
Note also that this subset of the TTD population has a large portion of unscoreable
individuals also affecting the accuracy of the results.
2001 51 48 . . . . . .
2002 75 63 85 . . . . .
2003 63 70 88 72 . . . .
2004 79 88 96 80 42 . . .
2005 71 74 84 73 51 58 . .
2006 78 76 83 61 55 69 35 .
2007 93 95 94 93 95 66 43 50
2001 146 94 . . . . . .
2002 94 53 69 . . . . .
2003 56 35 53 63 . . . .
2004 99 73 83 108 76 . . .
2005 77 59 72 99 76 74 . .
2006 102 86 111 134 90 59 48 .
2007 124 78 137 140 166 97 79 65
2001 42 56 . . . . . .
2002 39 41 48 . . . . .
2003 36 43 54 42 . . . .
2004 68 77 79 65 43 . . .
2005 77 69 87 71 60 73 . .
2006 83 65 65 55 42 52 40 .
2007 87 67 75 60 69 66 33 37
2001 30 23 . . . . . .
2002 36 20 45 . . . . .
2003 28 23 23 37 . . . .
2004 62 79 44 57 29 . . .
2005 83 40 59 79 35 40 . .
2006 70 49 64 53 38 30 16 .
2007 60 53 87 65 92 40 23 22
38
Figure 1. Illustration of the consumer credit database CCDB and the timing of the TTD
sample selection and performance horizon. The vertical bars represent the cross sectional
bureau data in the CCDB at each June 30th snapshot of both scoreable and unscoreable
individuals. The difference in shading indicates the mixture of individuals that make the
unbalanced panel form of the CCDB. Those that remain or become scoreable are part of
the sample in the following year and in each year, new unscoreable and scoreable
individuals are added. The TTD sample is taken from the full cross sectional snapshot
and can include both scoreable and unscoreable individuals as well as both individuals
that have been in the sample in the previous year or are new to the panel.
unscoreable
unscoreable
2002 TTD proxy 2003 TTD proxy 2004 TTD proxy
sample accounts sample accounts sample accounts
scoreable
scoreable
scoreable
Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
39
Figure 2. Annual sample distribution for the type of applicant
100% 200
90%
80%
150 RNI
70%
60% RI*
50% 100 RI
40% BK
30% TTD
50
20%
10%
0% 0
1999 2000 2001 2002 2003 2004 2005 2006 2007
12
Percent Bad (90+ DPD)
10
8
BK
6
RI
4
RI*
2
0
1999 2000 2001 2002 2003 2004 2005 2006 2007
40
Figure 4. Distribution of the credit bureau score for the booked and the three groups of
rejected individuals based on level of inference. Extreme invalid credit score values are
set to missing.
41
Figure 5. Distribution of booked and rejected applicants by segment
100%
80%
RNI
60% RI*
40% RI
BK
20%
0%
99
00
01
02
03
04
05
06
07
19
20
20
20
20
20
20
20
20
Dirty segment TTD
100%
80% RNI
60% RI*
40% RI
20% BK
0%
99
00
01
02
03
04
05
06
07
19
20
20
20
20
20
20
20
20
100%
80%
RNI
60% RI*
40% RI
BK
20%
0%
99
00
01
02
03
04
05
06
07
19
20
20
20
20
20
20
20
20
42
Figure 6. Segment distribution across the development and validation samples 1999-
2008. The first panel shows the full sample in thousands and the second only new
individuals used for out-of-sample out-of-time validation.
100
80
clean
60
dirty
40
thin
20
0
99
00
01
02
03
04
05
06
07
19
20
20
20
20
20
20
20
20
4000
3500
3000
2500 clean
2000 dirty
1500 thin
1000
500
0
00
01
02
03
04
05
06
07
20
20
20
20
20
20
20
20
43