This action might not be possible to undo. Are you sure you want to continue?

BooksAudiobooksComicsSheet Music### Categories

### Categories

### Categories

Editors' Picks Books

Hand-picked favorites from

our editors

our editors

Editors' Picks Audiobooks

Hand-picked favorites from

our editors

our editors

Editors' Picks Comics

Hand-picked favorites from

our editors

our editors

Editors' Picks Sheet Music

Hand-picked favorites from

our editors

our editors

Top Books

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Audiobooks

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Comics

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Sheet Music

What's trending, bestsellers,

award-winners & more

award-winners & more

Welcome to Scribd! Start your free trial and access books, documents and more.Find out more

com/locate/eswa

**Evaluation of rating systems
**

Andreas Oelerich a,*, Thorsten Poddig b,1

b a HSH Nordbank AG, Gerhart-Hauptmann Platz 50, 20095 Hamburg, Germany University of Bremen, chair of ﬁnance, Hochschulring 4, 28359 Bremen, Germany

Abstract Since the publication of the initial consultative proposal of a new Basel capital accord in June 1999 (and latest proposal from summer 2004) the inﬂuence of the proposed changes in bank management has been discussed intensively. Especially, the possibility to forecast insolvencies is one of the most relevant questions in many empirical studies. In this paper, we present an evaluation methodology for quantitative rating systems. As an example, we use the well known logistic regression model in order to demonstrate the evaluation methodology proposed and we discuss the results obtained in detail. Any other method (statistical or artiﬁcial intelligence methods, e.g. neural networks, fuzzy logic) can be evaluated in the same manner. As a side effect, the approach proposed might lead to improved forecasting results. q 2005 Elsevier Ltd. All rights reserved.

1. Introduction Rating grades are used to individually symbolise and quantify economic risks of a company. Most popular are the symbols of the leading rating agencies like Moody’s and Standard & Poor’s (e.g. AAA and Aaa, respectively). As of the beginning of 2007 the broad use of rating grades is recommend by the BIS (BIS; Bank for International Settlement, 2003). According to the recommendation of the BIS, national supervisory authorities will reﬂect these proposals in national supervisory procedures or standards. Rating grades are an important input parameter to portfolio credit risk models because credit ratings are mapped to probabilities of default (Carey and Hrycay, 2001, p. 197). In order to obtain quantitative based rating grades, statistical models are applied. They provide a simple score of a company, distinguishing between solvent and insolvent companies (e.g. Foreman, 2003). In the next step, these scores are further mapped to more reﬁned and discrete rating grades. Unlike the historical information about the state of solvency of a company, its true rating grade is always unknown and unobservable. Therefore, it is not possible to compare forecasted ratings with true ratings.

* Corresponding author. Tel.: C49 40 3333 11986; fax: C49 40 3333 611986. E-mail address: andreas.oelerich@hsh-nordbank.com (A. Oelerich). 1 Tel.: C49 421 218 7548; fax: C49 421 218 4838.

In order to evaluate the accuracy of forecasted rating grades, a Monte-Carlo simulation approach is presented in this paper. We brieﬂy discuss the theoretical background in Section 2. In Section 3, we shortly describe the logistic regression and the bootstrap method. Section 4 introduces our evaluation system based on Monte-Carlo simulation and artiﬁcial data sets. The results are presented in Section 5 with subsequent conclusional remarks. 2. Theoretical background The proposal of a new Basel capital agreement contains three fundamental innovations (pillars). The ﬁrst pillar includes the integration of operational risk in the calculation of the capital ratio and the use of rating systems to individually quantify the economic risk of a borrower. The minimum capital requirement retains both the current deﬁnition of the total capital and the minimum requirement of at least 8% of the bank’s equity to risk weighted assets. Concerning the second pillar, a supervisory review process has to validate the bank’s capital allocation techniques and its compliance with relevant standards. The third pillars aim is to strengthen market discipline through enhanced disclosure requirements for banks. This paper deals with the ﬁrst pillar, especially with the generation and use of internal or external rating grades. Banks have to calculate the regulatory capital necessary for each borrower individually. The capital requirement depends on the individual economic risk of the borrower. This economic risk is represented by an internal or external credit rating grade. Banks could or should use an internal rating system based on

0957-4174/$ - see front matter q 2005 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2005.10.004

438

A. Oelerich, T. Poddig / Expert Systems with Applications 30 (2006) 437–447

historical data in order to optimise their risk management. In principle, thereare three techniques available to forecast a rating grade: a qualitative, a quantitative and a combined approach. Moody’s and Standard and Poor’s mainly use qualitative approaches, which means that they use the knowhow of credit risk experts instead of statistical methods. Yet the qualitative approach has crucial disadvantages. It is very time consuming and expensive. Therefore such an approach cannot be applied to rate the large number of small and medium-sized enterprises. Banks are interested in an automated process to forecast rating grades. Some banks use a statistical approach to calculate a rating grade from historical accounting data only. Other banks apply a combined approach. They use statistical methods to get a quantitative based rating grade which is ﬁnally reﬁned by experts resulting in an improved rating grade. The major disadvantage of a pure quantitative approach is the lack of an additional judgement by credit risk experts which might improve the forecast. Therefore, a quantitative rating grade forecast must be of high quality. Banks must be able to rely on the rating grade forecasting system, because the resulting probability of default is a crucial input parameter to credit risk portfolio models (see Carey and Hrycay, 2001, p. 197). Many statistically based methods used to generate a rating grade are only asymptotically consistent (for logistic regression see, e.g. Fahrmeir and Kaufmann, 1985; Fahrmeir and Kaufmann, 1986). Therefore, analysts need a large database of information about insolvencies and solvencies to achieve nearly consistent forecasts of rating grades. Very often, a large database is not available. So it is questionable, whether such statistical methods are unbiased and it is unclear, which effects may result. Furthermore, the whole portfolio credit risk management system of a bank is based on these probabilities of default. The inﬂuence of a bias in the forecasting process of rating grades might be huge. Because of this small sample problem, the quality of statistical methods is a crucial question concerning modern portfolio credit risk management systems. The major objective of many empirical studies is to compare different statistical methods to classify the default risk of borrowers (see, i.e. Altman, 1968; Desai, Crook, & Overstreet, 1996; Frerichs & Wahrenburg, 2003; Huang, Che, Hsu, Chen, & Wu, 2004; Srinivasan & Kim, 1987). By using accounting data and other historical information about solvent and insolvent companies, a statistically based model enables experts to forecast whether a company will likely become insolvent or not. A possible qualitative assessment of such a model is e.g. the proportion of correct classiﬁed borrowers with respect to solvent vs. insolvent ﬁrms in a outof-sample or out-of-time sample test (see e.g. Altman, 1968; Anders, 1998; Desai et al., 1996; Frerichs & Wahrenburg, 2003; Huang et al., 2004; Leker & Schewe, 1998; Srinivasan & Kim, 1987). Borrowers are frequently classiﬁed with high accuracy in the cited empirical studies. Nevertheless, some borrowers must be analysed in detail in order to get an adequate classiﬁcation. For example, the Deutsche Bundesbank (see Deutsche Bundesbank, 1999) decides based on

the resulting scores whether a borrower should be viewed in detail or not. To minimise this effort, banks need a high quality statistical approach. This implies that a rating system has to provide accurate dichotom classiﬁcation, so that it minimises additional analysis of borrowers. Beside the dichotom classiﬁcation another characteristic of rating systems has also a signiﬁcance importance: the forecasted rating grade. It is quite a difference whether a borrower gets the rating grade AAA or BBB. The individual rating grade (in conjunction with the mapping process from a rating grade to a probability of default) is the important input parameter to portfolio credit risk models. A classiﬁcation system with also an excellent polytom classiﬁcation rate is essential. Overestimating or underestimating correct grades is expensive and includes a bias to risk judgements and the whole risk management process. Therefore, in a ﬁrst step banks need a powerful approach to forecast the dichotom classiﬁcation, because this decides about an acceptance, a rejection or a further detailed analysis of the borrowers through credit experts. In a second step banks require a powerful system to forecast detailled rating grades for correct capital allocation and risk management. Unfortunately, there is no possibility to observe true rating grades in a historical data sample. Only the event of insolvency is observable. If a solvent borrower gets a rating grade e.g. BBB, it is impossible to ﬁnd out whether this (forecasted) grade will prove to be correct or not (as long as the borrower stays solvent). A validation of the forecasted rating grades by using a real data sample is impossible. Nevertheless, forecasted rating grades will affect credit conditions, the credit portfolio optimisation and the banks success. In the following, we present an evaluation system to assess the quality of statistical methods based on a given (artiﬁcial) data sample. Based on this system a bank is able to decide which method should be used to forecast rating grades. 3. Logistic regression and bootstrapping In this section, we introduce the classical logistic regression and the integration of the bootstrap method. The logistic regression serves as an example in order to demonstrate the Monte-Carlo based evaluation system later on. The integration of the bootstrap method in the logistic regression is an example for a possible improvement of classical methods. The evaluation system can be applied to modern methods like neural networks or support vector machines. But the evaluation of such modern methods is beyond the scope of this paper. In logistic regression, a single outcome variable Yk (wherein kZ1,.,N indicates the N objects) follows a Bernoulli distribution. It becomes one with probability pk and zero with 1Kpk. In the case of modelling insolvencies, the event one may describe the default of a company and the probability pk is then called the probability of default (shortly: PD) of the kth company. pk varies over the different objects as described by an inverse logistic function (also called link function) of a linear predictor ðxT bÞ. Here, xk contains all explanatory k

A. Oelerich, T. Poddig / Expert Systems with Applications 30 (2006) 437–447

439

Bootstrap method to predict insolvencies and rating grades

1.Phase

Data management

original dataset

dataset for prediction

2.Phase

Generating of resamplings

Prediction of PD

3.Phase

Classification

Empirical distribution

Empirical expected value Predictions

Fig. 1. Bootstrap methodology.

variables with respect to the kth company and b all model parameters (including the cutpoint). pk Z expðxT bÞ k 1 C expðxT bÞ k (1)

The model parameters are estimated by the maximumlikelihood approach (see, e.g. Agresti, 1996). The maximumlikelihood analysis works by ﬁnding the value of b that returns the maximum value of the log-likelihood function. This value ^ ^ is labelled as b. The estimated value b has important characteristics: it is asymptotically consistent and asymptotically normal distributed (see, e.g. Fahrmeir & Kaufmann, 1985, 1986; for more details and discussions see, e.g. Hosmer & Lemeshow, 2000). As discussed above, banks are limited by small data samples in insolvency analysis regarding different sectors of borrowers. The asymptotic properties of the logistic regression might cause a bias in forecasting models. In some studies, the logistic regression showed a bias even in very simple models (one or two explanatory variable) when the sample size was small (see, e.g. Matin, 1994). Banks often use very complex models which have various numbers of explanatory variables (see, e.g. Hayden, 2002). The number of variables is usually reduced by applying a selection algorithm that eliminates insigniﬁcant variables. Nevertheless, the sample size usually remains small in relation to model complexity. As a result, a bias in predictions may result. This inﬂuences the entire risk management and decision process. In order to eliminate such an inaccuracy in logistic regression facing such data problems, we propose to integrate the bootstrap method in the forecasting process to stabilise the predictions. The bootstrap method is developed to generate an empirical distribution of a variable in those cases in which proof of an analytical distribution is impossible (see in

particular Efron & Tibshirani, 1993). We use this method to reduce the variability of predictions with the logistic regression model. In many studies, the bootstrap is used to quantify, e.g. a standard error or conﬁdence interval. The bootstrap methodology is based on the following idea: A variation of a given data set allows an estimation of an empirical distribution of a random variable. Fig. 1 shows the bootstrap process. The ﬁrst phase includes all steps necessary to prepare the data set for the analysis. These steps include the reduction of multicollinearity and standardisation. The second phase includes the generation of the empirical distribution function. We use the bootstrap method to generate an empirical ^ distribution of the probabilities of default ðpk Þ as following. The design matrix X includes all design vectors xk (with kZ1, .,N). These design vectors xk might represent real data (e.g. accounting information in an observed sample of real data) or artiﬁcial data. To generate an empirical distribution of the forecasted probabilities of default, a number of subsets of X is needed. We take a random selection out of all N design vectors. In every selection, we choose a sample size of N. It is possible that a company appears more then once in a subset or might not be selected at all. We generate 200 subsets and estimate for any subset the probability of default for all companies. Finally, an empirical distribution of probabilities of default for every company results. We use the mean of this empirical distribution for each company as prediction of probability of default (see Fig. 1) instead of the classical PD estimated by the simple logistic regression model (see Eq. (1)). The variation of the data set might help to reduce the errors due to the small sample size problems in classical logistic regression. In order to investigate whether boostrappingimprove the classical logistic regression, we compare the bootstrap method of forecasting insolvencies and rating grades with the classical logistic regression model (without bootstrapping). We run

200 runs

random selection of subsets

440

**A. Oelerich, T. Poddig / Expert Systems with Applications 30 (2006) 437–447
**

Quality of quantitative rating systems

Quality of a method in special cases

Quality of a method in general cases

Assessment based on real data

Assessment based on Monte-Carlo Simulation

Fig. 2. Quality of rating methods in special and general cases.

the Monte-Carlo simulation to evaluate this method as explained in Section 4. 4. Evaluation of quantitative rating systems As outlined in the second section, the quality of quantitative rating systems is very important because the whole risk management depends on it. We distinguish two major questions to assess such systems, the meaning of quality in empirical studies (special cases) and the meaning of quality in simulation studies (general cases) (see Fig. 2). Empirical studies use mainly quantities like hit rates (proportions of correct classiﬁed companies) to compare different statistical methods for predicting bankruptcies (see e.g. Altman, 1968; Desai et al., 1996; Huang et al., 2004). We regard this as a quality in a special case. This assessment of different method is based on real data. In the case of real data, the assessment of methods used to forecast rating grades is impossible, because the true rating grade is unknown and unobservable. Different methods cannot be compared in order to decide which method is the best concerning rating grades. It can be shown with an artiﬁcial data set that the best method to forecast bankruptcies is not necessarily the best method to forecast rating grades (see, e.g. Poddig & Oelerich, 2004). Banks cannot generalise the quality of bankruptcy predictions to the quality of rating grade forecasts. Artiﬁcial data allow to develop an evaluation system. Using such data we are able to integrate the structure of a given data set and to simulate the rating process. Such an evaluation system allows to assess the methods of bankcrupcy predictions and also the quality of rating grade predictions. Moreover, based on artiﬁcial data we are able to quantify the quality of different statistical methods. Here, all necessary information are known: information about companies, the model and the true rating grades. Based on this evaluation system, an assessment of different methods for forecasting rating grades is possible. In order to assess the quality of a predictive method, one distinguishs between two errors. First, any statistical method includes a typical statistical error, which is called unsystematic error. The judgement of any method should not be based on this error, because a quality judgement should describe a structural

weakness, which is caused by systematic errors. Concerning systematic errors, we discuss the popular logistic regression as an example for the inﬂuence of such errors in the following. Systematic errors result from the assumptions and properties of the speciﬁc method. The logistic regression describes the relation between explanatory (or independent) variables and probabilities of default (or in the more general case, the probability of success). This model-based relation (described by a link function) might be incorrect in reality. Additionally, the maximum-likelihood estimation is (only) asymptotically consistent and asymptotically normal distributed. For ﬁnite sample sizes, this estimation could be biased. Also, the estimated vector of model parameters and the resulting estimation of probability of defaults are (only) asymptotically consistent and asymptotically normal distributed. Small sample sizes provide (especially) for banks problems to get accurate forecasts. Tiny modiﬁcations of the data sets might change the estimated probabilities of defaults signiﬁcantly. Monte-Carlo simulations are a tool to eliminate unsystematic errors. The remaining systematic error can be used to quantify the quality of a method. The evaluation process of the logistic regression and the bootstrap method is shown in Fig. 3. It shouldbe noted that the evaluation of the bootstrap method described above has to estimate 500,000 logistic regression models, due to Monte-Carlo Simulation to assess the method in general and the speciﬁc bootstrapping method. The evaluation process is based on artiﬁcial data. The data set is generated with the logit link known as the link function in the logistic regression model. The data set (table, matrix) includes the known qualitative and quantitative information about all companies. In a ﬁrst step, we generate the explanatory variables. The sample size is denoted by N. A bold X describes the matrix which contains quantitative and qualitative information of all N companies. A row of X, xk, represents all information of the kth company, where k is an index with kZ1,.,N. In empirical studies, these vectors xk include information like ﬁnancial ratios or qualitative information such

**Evaluation process of logistic regression and bootstrapping 1.phase
**

Data management

Generate artificial data

Bernoulli experiments

2.phase

Monte-Carlo simulation

Classical logistic regression

Bootstrap method

3.phase

Empirical distributions

Empirical distributions

Empirical parameters

Evaluation of predictions

Fig. 3. Evaluation process of statistical methods.

2500 runs

A. Oelerich, T. Poddig / Expert Systems with Applications 30 (2006) 437–447

441

as sector, legal form of acompany or quality of management. This vector can be written as xk Z ðxQ ; xM Þ whereas xQ contains k k k codings for qualitative information and xM includes quantitatk ive information about the kth company. Sometimes the information of xM will be standardised k before it is used in a model (see, e.g. Moody’s Investors Service, 2001). The advantage of standardisation is to compare different information measured on different scales. A standardised variable has an expected value of zero and a variance of one. For this reason, we generate artiﬁcial data by using standard normal distributed random variables ðxM wNð0; VÞÞ. Here, V : ZcovðxM ÞZ I is assumed as the k k identity matrix. For reasons of simpliﬁcation, we used uncorrelated data. In reality the data will not be perfectly uncorrelated. Therefore, orthogonality methods might be used for data preprocessing to minimise or remove the correlation. Despite these problems concerning real data our framework fullﬁlls the theoretical assumption of many statistical methods, e.g. the logistic regression model or the discriminant analysis. To generate data more similar to real world data, an empirical covariance matrix should be used instead of the identitymatrix. In this case, the covariance matrix V has to be replaced ^ by the empirical matrix V. Our framework works with both: a diagonal and non-diagonal covariance matrix. To simplify the simulation system, we focus on a diagonal covariance matrix assuming that multicollinearity is removed by data preprocessing. Qualitative information like legal form, sector, management quality and so on allow to group companies. To simulate this kind of information, we deﬁne the number of qualitative factors, the corresponding number of levels and all resulting combinations. Then, we choose the number of companies for each combination. The data generating process including qualitative information is described in Oelerich and Poddig (2004). An artiﬁcial design vector might look like Eq. (2) xk Z ð1; 2; 0:06; 0:98;K 0:876Þ (2)

Given vector xk and a vector b (and the cutpoint a), we calculate the probability pk Z ð1C expðK aKxT bÞÞK1 for each k company. A realisation of Yk can be simulated by a Bernoulli experiment based on these probabilities (pk). Therefore, we only need a given cutpoint parameter (a) and a vector (b) of model parameters to deﬁne an artiﬁcial data generating process. We calculate for each company its probability of default using the parameters discussed above (b) (see also Table A1) and the design vectors xk. These probabilities are the inputs to generate Bernoulli random variables. Thereby, an artiﬁcial data sample corresponding to a given equation results. Based on this artiﬁcial data, we know all information about every company (the table X), the model Eq. (a and vector b), and the true probabilities of default pk. Inorder to investigate the small sample size properties of different statistical methods with respect to forecast rating grades, different in-sample and out-of sample sizes are generated. To generate the true rating grade for any company, we orientate on the scale in (Rolfes & Emse, 2000). We divide the interval from 0 to 1 into nine classes (see Table 1). This rating scale is based on nine rating grades, whereas the grades 1–8 symbolise solvent grades and the grade 9 describe insolvent ratings. Banks might use other deﬁnitions of rating grades and/or add more rating classes. Our focus is not to look for a perfect deﬁnition of rating grades, but a system to evaluate statistical methods. Typically, rating agencies use up to 21 rating grades. Such complex splittings can also be integrated into our framework, but they are not regarded here in detail. The evaluation system is based on four indexes. First, we deﬁne an index to quantify the quality of the probability estimation. We call this index upsilon ðYÞ. The upsilon index is based on the idea of mean squared errors deﬁned by: MSEp Z

N 1 X ^ ðp Kpk Þ2 N kZ1 k

(4)

Here, the ﬁrst element 1 describes the ﬁrst level of the ﬁrst qualitative factor noted as A. The second element 2 corresponds to the second level of the second factor (noted as B). These information has to be coded with dummy variables by applying reference or effect coding preprocessing (see, e.g. Hosmer & Lemeshow, 2000). In our simulation system, we use reference coding for qualitative variables, because this coding is easier to interpret than the effect coding (see Hosmer & Lemeshow, 2000). The vector further includes three additional variables representing quantitative information. To summarise, we generate quantitative information as standard normal distributed random variables and use qualitative information as grouping variables. Let Yk wBðpk Þ be the dependent variable, where Bðpk Þ describes the Bernoulli distribution with pk as probability for the event of bankruptcy. The logistic regression model is based on the model Eq. (3) (see, e.g. Hosmer & Lemeshow, 2000) pk (3) ln Z a C xT b: k 1Kpk

Here, pk describes the true probability of default of the kth ^ company and pk the corresponding prediction (by any kind of model). The index is equal to zero if all probabilities of defaults are forecasted correctly. The upper limit is 1 ð0% MSEp % 1Þ. N The well known hit rate (proportion of correct classiﬁcation) is interpreted in a similar, but reversed manner. So we use the

Table 1 Probabilities of default and corresponding rating grades Grades 1 2 3 4 5 6 7 8 9 Lower limit 0.00000 0.00025 0.00055 0.00115 0.00405 0.01335 0.07705 0.16995 0.20000 Upper limit 0.00025 0.00055 0.00115 0.00405 0.01335 0.07705 0.16995 0.20000 1.00000

A similar deﬁnition can be found in (Rolfes and Emse, 2000).

442

A. Oelerich, T. Poddig / Expert Systems with Applications 30 (2006) 437–447

Table 2 Overview of evaluation indexes for rating systems Index Desciption Yp Quantify the quality of probability forecasts YD Quantify the quality of binary classiﬁcation (hit rate) YP Quantify the quality of rating grade predictions (hit rate) YR Quantify the quality of rating grade predictions (distances)

**normed index Yp :Z 1KMSEp ; N
**

p

(5)

to get the same interpretation: The Y -index in Eq. (5) becomes one if all forecasts are correct. The logistic regression has been applied in many empirical studies to forecast insolvencies (e.g. Huang et al., 2004). The estimated probabilities are used to divide all companies into two groups/classes. Deﬁning a cutpoint-parameter (e.g. cZ0.50) every company is assigned to one class (insolvent or solvent). This binary classiﬁcation is less sensitive to the variability of the probability estimation by the statistical model than the estimation of the probability itself. The Bernoulli experiment gives the true class (insolvent, solvent) in every run of the simulation. Thus, the evaluation of binary classiﬁcations is possible. In the binary case, this index is equal to the proportion of correctly classiﬁed companies. We denote the class of solvent companies by zero and the class of insolvent companies by one. The probability pk of the Bernoulli distribution for the event one is the true probability of default. The symbol Dk expresses the true class of the kth ^ company (that means Dk2{0,1}) and Dk the corresponding estimation. The index YD Z 1K

N 1 X ^ ðD KDk Þ2 N kZ1 k

number of incorrect classiﬁcations so that YP corresponds to the proportion of correctly rated companies. The distance between the true and the estimated rating grade is also important. For example, if a company had a true grade of BB, a forecasted rating grade of AA or CCC would be a severe misclassiﬁcation. It inﬂuences the entire risk management process including especially the price and credit conditions. To include the distance of the forecasted rating grade to the true rating grade, we deﬁne an index based on distances. The true rating of the kth company is denoted by Rk and the ^ corresponding estimate by Rk . The normed sum of squared differences is deﬁned as 2 N ^ 1 X Rk KRk ; Y Z 1K N kZ1 r

R

(8)

with rC1 equal to the number of rating grades. In our framework r equals eight. The index YR is one if all ratings are correct. It becomes zero if the absolute value of the differences ^ Rk KRk is for any company equal to r. The division by r is important to formulate a normed index (Table 2).

5. Results (6) Before presenting the results obtained, we describe three data generating models used in our simulation studies. Table A1 shows the parameters of different data generating models used in the simulation. A qualitative (main) factor is denoted by A, B, C, a quantitative factor by M1 to M10. The parameters b1 to b5 correspond to the ﬁve levels of factor A in model0. Sectors or branches, legal forms or ratings for management quality are typical examples for such factors. Interaction terms are denoted by the corresponding main factors, e.g. AB denote the interaction between the main factors A and B. For simpliﬁcations, no interaction between qualitative factors is assumed, so all parameters hAB are set to ij zero. In model0, the quantitative factors M1 to M5 have different effect sizes, M6 to M10 are assumed to have no effect. These artiﬁcial parameter settings have the same important characteristics as real data, e.g. some factors (e.g. AB, M4 and M5) have no effect and other have different effect sizes. The symbol ‘***’ means that this factor (e.g. C, M5) is not used in the speciﬁc data generating model, e.g. the model0 does not contain a qualitative factor C. In order to investigate the small sample size properties of different statistical methods to forecast rating grades, different sample sizes are generated. Table 3 shows these sample sizes for all models. The proportion of solvent and insolvent companies in an artiﬁcial data set depends on the speciﬁc

describes the quality of forecasts with respect to a binary ^ classiﬁcation. The squared difference in Eq. (6) ðDk KDk Þ2 of the distance between the true and the estimated class is similar to the case of probabilities discussed above. In the binary case, this index is identical to the classical hit rate (proportion of correctly classiﬁed companies), because the distance equals one if the classiﬁcation is wrong and zero if the classiﬁcation is correct. Rating grades are more difﬁcult to evaluate than a binary classiﬁcation in real world applications because the true rating grades are unknown and unobservable. To evaluate a statistical rating system using artiﬁcial data, we deﬁne a similar index to quantify the predictive power of a method as usually used for binary classiﬁcations. We calculate the proportion of correctly classiﬁed companies as discussed for the binary case. Due to artiﬁcial data, the true rating grade is known. Table 1 shows how the probability of default is mapped to rating grades in our simulations. The YP -index is deﬁned as YP Z 1K

N 1 X P; N kZ1 k

(7)

with Pk being zero if the estimated rating grade is equal to the true rating grade and one otherwise. The sum counts the

A. Oelerich, T. Poddig / Expert Systems with Applications 30 (2006) 437–447 Table 3 Sample sizes for all data generating models used Data generating model Model0 Sample sizes In sample 125 250 500 225 450 900 270 675 1350 Out of sample 125 250 500 225 450 900 270 675 1350 Sum 250 500 1000 450 900 1800 540 1350 2700

443

Model1

Model2

model and the Bernoulli experiments. The observed proportion in our simulation studies is approximately 0.2. A preliminary simulation study considers only two predictive models. We denote the logistic regression using the reference coding for qualitative factors and no selection algorithm as VM. The resulting model by using a forward selection algorithm based on the Wald test and the reference coding for qualitative factors is denoted by WM. In this preliminary study, we do not discuss the bootstrapping method which will be applied later on. As an example, the distributions of two indexes (given data generating model1 with the smallest sample size) is shown in Fig. 4. The logistic regression model exhibits some undesirable properties for small sample sizes. The ﬁgure shows also thatthe rating grade predictions have a lower average and a higher variance in this example. Tables A2 and A3 show important characteristics of the empirical distribution of the indexes simulating all data

generating models with different sample sizes and forecasts with the VM and the WM prediction models. The Yp -Index illustrate the general quality of logistic regression to forecast probabilities. The predictive models (VM and WM) seem to be powerful. The average index is often approximately 0.99. The reason for the closeness to one of this index is the usage of squared differences of probabilities. Important is the evaluation of the YD index for binary classiﬁcation based on these probabilities (see Table A2). Fig. 4 shows the empirical histogram of the YD -index for data generating model1. In literature, a hit rate of 75% is often regarded as a good result (see, e.g. Leker & Schewe, 1998). Our simulation studies show that such quotes are not likely to be observed on average. The mean of correct classiﬁcations in our simulation studies is lower than 75% in the most cases. In real world studies, an analyst also face further problems, like multicollinearity or outlier problems. Additionally, the assumed link-function might be incorrect (see the discussion about systematic errors). The decision whether a method is good or not depends on the sample size and the model complexity. We used quite simple models. In some empirical studies, the complexity is much higher with occasionally more than thirty types of information for one company (see, e.g. Hayden, 2002). In these cases, it is really doubtful that a hit rate of 75% results on average in repeated applications of the same model development procedure. The YP index shows the hit rates for rating grades. These hit rates are smaller than for binary classiﬁcations and the variance is higher. For example, using data generating model0 (with 125 companies) the average binary classiﬁcation hit rate is nearly 75%, whereas the hit rate for rating grades is nearly 36%. This is only half of predictive power than in the binary case. The variance is more than three times higher. With increasing

Fig. 4. Empirical distribution of the YD -index (left side) and the YP -index (right side) for data generating model1 with the smallest sample size of 225 companies. The forecasting model is a logistic regression using a forward selection algorithm based on the Wald-test and the reference coding for qualitative factors. This ﬁgure shows the hit rates (x-axis) and their corresponding relative frequency in percent (y-axis).

444

A. Oelerich, T. Poddig / Expert Systems with Applications 30 (2006) 437–447

sample size the quality of rating grade forecasts becomes better. If a bank uses 21 rating classes, the quality will likely be lower than the results presented in this paper, because we use only nine classes. In many empirical studies the sample sizes correspond to the sample sizes used inour simulation studies. We do not regard extremely small or extremely large sample sizes (see Table A3). To quantify the difference of predicted rating grades and true rating grades, we propose the YR index (see Table A3). Here, we observe similar results as seen for the Yp index. For the smallest sample sizes, the variablility of the VM is extremely large. Using model1 and model2 as data generating processes, the variance is nearly ten times higher as observed applying the WM. These results shows that the variability of forecasts differ for different techniques within the same statistical model. Our simulations studies show the advantage of forecasting rating grades using a selection algorithm. Nevertheless, for small sample size the WM is biased. All in all, even the WM seems to provide unreliable forecasts when the sample size is small. The bootstrap method might result in improved forecasted probabilities because its integration into the WM should reduce the variability of forecasts. This should lead to a more reliable predictive model. We discuss this for data generating model1 in detail. In the second simulation study, we apply the data generating model1 with the smallest (225 companies) and the largest (900 companies) sample size due to the extremely time consuming simulation runs. We ﬁnd two important results in this second study. First, the bootstrap method has the highest average in all predictive models (in our studies) in general. The forecast using the simple WM model is based on asymptotic consistent estimators (see also Hocke, 1974; Matin, 1994), but it seems to be more incorrect than the bootstrap method when the sample size is small. The bootstrap method seemingly stabilise the predictions of probabilities. Note, that the sample size applied in our studies is similar to such used in empirical studies (e.g. Anders, 1998; Huang et al., 2004; Leker & Schewe, 1998). Second, the variance is an important property of forecasting models in real world applications. The bootstrap method has in all cases the smallest variances (see Table A4). For ratings generated by statistical methods, this robustness is very important, because ratings should be conservative and stable with respect to time (see Bank for International Settlement, 2003). Unreliable information in the data base might have a signiﬁcant inﬂuence on the forecast. For example, changes in the accounting data of one company could inﬂuence the model equation in such a way that another company gets a different estimation of probability of default and thereby another rating grade is assigned. The bootstrap method might reduce this variation in rating grade forecasts. Note, that the integration of bootstrap is easy with respect to most statistical methods (e.g. logistic regression) and can be realised by the use of many statistical software solutions (e.g. SAS or SPSS).

To summarise the major results, the bootstrap method shows the highest average of all four indexes in our simulations in most cases. The variability of forecasts is extremely different. The variance of the VM is up to ﬁfteen times higher than the variance obtained by the bootstrap method. The main advantage of the bootstrap method is its robustness. The results seem to be more reliable. However, with increasing sample size the classical methods becomes more powerful. For the largest sample size used, the variance of the bootstrap method is approximately only two times lower than that of the classical methods (for the Yp and YR indexes). Such simulation studies show the power of different methods or the improvements of modiﬁcations to traditional statistical methods. For example, the comparison of classical logistic regression with the integration of bootstrapping suggests the use of resampling approaches.

6. Conclusion In this paper we discuss the possibility to assess the quality of statistical methods to forecasts insolvencies and rating grades. Whereas an evaluation of statistical methods predicting insolvencies does not pose difﬁcult problems, it is impossible forrating grades, because they are unknown and unobservable. To solve this problem, we propose a simulation system using artiﬁcial data, wherein all necessary information are known, the information about companies, the probability of default, the insolvencycoding and a the rating grade. Based on such data we are able to evaluate rating processes, which are based on statistical methods. We apply this simulation system to several logistic regression models and the bootstrap method for one speciﬁc logistic regression model. There are two major results: First, a prototype of an evaluation system to quantify the quality of rating grade forecasting models could be studied. Second, we ﬁnd that the integration of bootstrap results in more conﬁdent forecasts. We call such a rating process ‘robust’, because it reduces the variability of predictions of insolvencies and rating grades. In order to reﬂect real world data in a more realistic approach, it is recommended to use an empirical covariance matrix in the data generating process when building up the design vectors (the matrix X). This empirical covariance matrix could reﬂect interdependencies between real companies in the artiﬁcial data. The simulation system allows to assess different rating models and to identify optimal rating models regarding a given real data sample.

Appendix A See Tables A1–A4

A. Oelerich, T. Poddig / Expert Systems with Applications 30 (2006) 437–447 Table A1 Parameters in three different data generating models Factor Parameter b1 b2 A b3 b4 b5 g1 g2 g3 g4 g5 d1 d2 d3 d4 d5 hAB ij hAC ik hBC ik hABC ijk m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 Value of parameter in data generating model model0 1.50 0.75 0.00 -0.75 -1.50 -0.50 -0.25 0.00 0.25 0.50 *** *** *** *** *** 0.00 *** *** *** K0.40 K0.75 K0.50 0.25 0.75 0.00 0.00 0.00 0.00 0.00 model1 0.75 0.00 -0.75 *** *** -0.25 0.00 0.25 *** *** *** *** *** *** *** 0.00 *** *** *** -0.75 -0.40 0.25 0.00 0.00 *** *** *** *** *** model2 1.00 0.00 -1.00 *** *** -0.75 0.00 0.75 *** *** 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -1.00 -0.75 -0.25 0.00 0.00 *** *** *** *** ***

445

B

C

AB AC BC ABC M1 M2 M3 M4 M5 M6 M7 M8 M9 M10

Table A2 Empirical distribution of the Yp - and YD -Index for the VM and the WM models. The YD -Index is given in percent. All indexes are based on the data generating models shown in Table A1 in the out-of sample test Index Y

p

Data generating model Model0

Sample size 125 250 500

Prediction model VM WM VM WM VM WM VM WM VM WM VM WM VM WM VM WM VM WM VM WM

Empirical distribution Minimum 0.7037797 0.7745611 0.8805836 0.9020020 0.9710683 0.9694741 0.9634257 0.9708070 0.9846521 0.9814356 0.9941839 0.9929668 0.9459336 0.9702192 0.9807539 0.9862482 0.9902074 0.9936707 57.60 56.80 Average 0.8821377 0.9349743 0.9312435 0.9463119 0.9865392 0.9921865 0.9895332 0.9908616 0.9952083 0.9959866 0.9976368 0.9980054 0.9793925 0.9917139 0.9913713 0.9966762 0.9956640 0.9984178 73.49 75.26 Maximum 0.9597116 0.9857128 0.9665835 0.9846160 0.9955058 0.9983600 0.9979800 0.9983320 0.9992547 0.9991158 0.9996633 0.9997606 0.9932119 0.9990288 0.9972507 0.9996188 0.9983464 0.9998965 85.60 87.20 Variance 0.0017986 0.0006322 0.0001741 0.0001388 0.0000126 0.0000117 0.0000190 0.0000178 0.0000037 0.0000040 0.0000008 0.0000009 0.0000338 0.0000221 0.0000052 0.0000029 0.0000012 0.0000008 18.02 22.43

Model

225 450 900

Model2

270 675 1350

YD

Model0

125

(continued on next page)

446 Table A2 (continued) Index

A. Oelerich, T. Poddig / Expert Systems with Applications 30 (2006) 437–447

Data generating model

Sample size 250 500

Prediction model VM WM VM WM VM WM VM WM VM WM VM WM VM WM VM WM

Empirical distribution Minimum 60.80 57.20 65.60 67.00 53.78 43.56 55.78 54.44 56.22 56.78 58.52 58.15 67.56 67.41 69.11 69.70 Average 70.68 69.23 73.31 73.65 71.74 71.00 69.81 69.63 66.80 66.70 72.14 71.71 74.39 74.83 73.54 73.70 Maximum 82.00 79.20 79.80 79.40 83.11 83.56 78.89 80.00 75.33 75.89 82.22 81.85 80.59 80.44 77.11 77.19 Variance 8.52 8.87 3.82 3.56 17.55 20.01 10.14 12.49 7.42 8.56 11.86 12.61 3.56 3.25 1.56 1.54

Model1

225 450 900

Model2

270 675 1350

**Table A3 Empirical distribution of the YP and YR -Index for the VM and the WM models Index
**

P

Data generating model Model0

Sample size 125 250 500

Prediction model VM WM VM WM VM WM VM WM VM WM VM WM VM WM VM WM VM WM VM WM VM WM VM WM VM WM VM WM VM WM VM WM VM WM VM WM

Empirical distribution Minimum 7.20 8.80 18.00 21.60 30.40 39.80 20.44 22.67 39.11 42.67 56.33 57.33 21.11 28.15 42.37 45.78 54.00 62.44 0.4851250 0.5031250 0.6919375 0.7370000 0.8221563 0.8231250 0.7623611 0.8185417 0.8825000 0.8839931 0.9448090 0.9456076 0.7423032 0.8255787 0.8986806 0.9413426 0.9328935 0.9501042 Average 21.89 36.77 30.43 39.91 54.65 66.16 52.63 55.89 64.41 68.47 74.31 76.94 44.46 62.96 61.77 75.59 72.17 83.00 0.7215956 0.9222988 0.8400099 0.9532424 0.9390746 0.9877972 0.9570286 0.9841445 0.9870136 0.9916225 0.9941055 0.9948134 0.8929976 0.9870792 0.9722261 0.9943625 0.9892359 0.9965378 Maximum 40.80 64.00 45.20 62.40 75.40 84.20 80.44 82.67 84.00 82.44 90.00 93.11 64.44 85.96 79.41 91.56 84.07 95.93 0.9056250 0.9871250 0.9497500 0.9868125 0.9918750 0.9967813 0.9961111 0.9963194 0.9969792 0.9968056 0.9983333 0.9989236 0.9832755 0.9972222 0.9953704 0.9986574 0.9965509 0.9993634 Variance 29.16 70.33 18.98 26.95 31.12 43.06 74.10 91.61 45.90 50.64 25.17 27.43 42.10 94.83 24.91 39.44 16.08 23.16 0.0058781 0.0035467 0.0013861 0.0008482 0.0006739 0.0002389 0.0013208 0.0001420 0.0001924 0.0000192 0.0000044 0.0000037 0.0010667 0.0001299 0.0002371 0.0000098 0.0000614 0.0000032

Y

Model1

225 450 900

Model2

270 675 1350

YR

Model0

125 250 500

Model1

225 450 900

Model2

270 675 1350

The YP -index is given in percent. All indexes are based on the data generating models shown in Table A1 in the out-of sample test.

A. Oelerich, T. Poddig / Expert Systems with Applications 30 (2006) 437–447 Table A4 Empirical distribution of all indexes for three different statistical models Index Y

p

447

Sample size 225

Statistical model VM WM BM VM WM BM VM WM BM VM WM BM VM WM BM VM WM BM VM WM BM VM WM BM

Empirical distribution Minimum 0.9634257 0.9708070 0.9757352 0.9941839 0.9929668 0.9945386 53.78 43.56 49.33 56.22 56.78 56.00 20.44 22.67 28.00 56.33 57.33 60.78 0.7623611 0.8185417 0.8271528 0.9448090 0.9456076 0.9895486 Mean 0.9895332 0.9908616 0.9932666 0.9976368 0.9980054 0.9982839 71.75 71.00 71.07 66.80 66.70 66.65 52.63 55.89 60.73 74.31 76.94 78.28 0.9570286 0.9841445 0.9877336 0.9941055 0.9948134 0.9952950 Maximum 0.9979800 0.9983320 0.9992551 0.9996633 0.9997606 0.9998238 83.11 83.56 84.44 75.33 75.89 75.56 80.44 82.67 86.22 90.00 93.11 93.89 0.9961111 0.9963194 0.9974306 0.9983333 0.9989236 0.9990451 Variance 0.0000190 0.0000178 0.0000103 0.0000008 0.0000009 0.0000006 17.55 20.01 18.50 7.42 8.56 8.42 74.10 91.61 77.27 25.17 27.43 24.17 0.0013208 0.0001420 0.0000878 0.0000044 0.0000037 0.0000020

900

YD

225

900

YP

225

900

YR

225

900

VM is the logistic regression with all factors (without a selection algorithm). WM means the logistic regression using a forward selection algorithm based on the Wald-test and the reference coding for qualitative factors. BM is the WM with bootstrapping. All indexes are based on data generating model1 in Table A1.

References

Agresti, A. (1996). An introduction to categorical data analysis. New York: Wiley. Altman, E. I. (1968). Financial ratios, discriminant analysis and the prediction of corporate bankcruptcy. Journal of Finance, 4, 589–609. Anders, U. (1998). Prognose von Insolvenzwahrscheinlichkeiten mit Hilfe ¨ logistischer neuronaler Netzwerke. Zeitschrift fur betriebswirtschaftliche Forschung, 50, 892–915. Bank for International Settlement. (2003). Consultative document: The new ¨ basel capital accord. Zurich. Carey, M., & Hrycay, M. (2001). Parameterizing credit risk models with rating data. Journal of Banking & Finance, 25, 197–270. Desai, V. S., Crook, J. N., & Overstreet, A. (1996). A comparison of neural networks and linear scoring models in the credit union environment. European Journal of Operational Research, 95, 24–37. ¨ Deutsche Bundesbank. (1999). Zur Bonitatsbeurteilung von Wirtschaftsunternehmen durch die Deutsche Bundesbank. Deutsche Bundesbank Monatsbericht 1999 (pp. 51–64). Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. London: Chapman & Hall. Fahrmeir, L., & Kaufmann, H. (1985). Consistency and asymptotic normality of the maximum likelihood estimators in generalized linear models. The Annals of Statistics, 13(1), 342–368. Fahrmeir, L., & Kaufmann, H. (1986). Correction: Consistency and asymptotic normality of the maximum likelihood estimators in generalized linear models. The Annals of Statistics, 14(4), 1643. Foreman, R. D. (2003). A logistic analysis of bankruptcy within the US local telecommunicaion industry. Journal of Economics and Business, 55, 135– 166.

Frerichs, H., & Wahrenburg, W. (2003). Evaluating internal rating systems ¨ depending on bank size. Working Paper 115. Frankfurt: Universitat Frankfurt. Hayden, E. (2002). Modelling an accounting-based rating system for austrian ¨ ¨ ﬁrms. Dissertation. Fakultat fur Wirtschaftswissenschaften und Informatik ¨ der Universitat Wien. ¨ Hocke, J. (1974). Der Einﬂuss der Multikollinearitat auf die Kleinstichprobe¨ ¨ neigenschaften diverser okonometrischer Schatzmethoden—Eine Monte ¨ ¨ Carlo-Studie. Dissertation. Universitat Munchen. Hosmer, D. W., & Lemeshow, S. (2000). Applied Logistic Regression (2nd ed.). New York: Wiley. Huang, Z., Chen, H., Hsu, C., Chen, W., & Wu, S. (2004). Credit rating analysis with support vector machines and neural networks: A market comparative study. Decision Support Systems, 37, 543–558. Leker, J., & Schewe, G. (1998). Beurteilung des Kreditausfallrisikos im ¨ ¨ Firmenkundengeschaft der Banken. Zeitschrift fur betriebswirtschaftliche Forschung, 50, 877–891. Matin, M. A. (1994). Small-sample properties of different tests and estimators of the parameters in the logistic regression model. Research Report 4. Schweden, Uppsala: Uppsala Universitet. ¨ ¨ Moody’s Investors Service. (2001). Moody’s RiskCalcTM fur nicht borsennotierte Unternehmen: Das deutsche Modell. Oelerich, A., & Poddig, T. (2004). Modiﬁed wald statistics for generalized linear models. Allgemeines Statistisches Archiv, 1, 23–34. Poddig, T., & Oelerich, A. (2004). Evaluierung quantitativer Ratingverfahren. In D. Bayer, & C. Ortseifen (Eds.), SAS in Hochschule und Wirtschaft (pp. 195–212). Aachen: Shaker Verlag. ¨ Rolfes, B., & Emse, C. (2000). Rating basierte Ansatze zur Bemessung der Eigenkapitalunterlegung von Kreditrisiken. ecfs—Forschungsbericht, Vol. 3. Srinivasan, V., & Kim, H. (1987). Credit granting: A comparative analysis of classiﬁcation procedures. Journal of Banking and Finance, XLII(3), 665–683.

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue listening from where you left off, or restart the preview.

scribd