This action might not be possible to undo. Are you sure you want to continue?

# CREDIT SCORING METHODS AND ACCURACY RATIO

AYS

.

EG

¨

UL

˙

IS

.

CANO

¯

GLU

AUGUST 2005

CREDIT SCORING METHODS AND ACCURACY RATIO

A THESIS SUBMITTED TO

THE INSTITUTE OF APPLIED MATHEMATICS

OF

THE MIDDLE EAST TECHNICAL UNIVERSITY

BY

AYS

.

EG

¨

UL

˙

IS

.

CANO

¯

GLU

INPARTIALFULFILLMENTOFTHEREQUIREMENTSFORTHEDEGREEOF

MASTER OF SCIENCE

IN

THE DEPARTMENT OF FINANCIAL MATHEMATICS

AUGUST 2005

Approval of the Institute of Applied Mathematics

Prof. Dr. Ersan AKYILDIZ

Director

I certify that this thesis satisﬁes all the requirements as a thesis for the degree of

Master of Science.

Prof. Dr. Hayri K

¨

OREZL

˙

IO

˘

GLU

Head of Department

This is to certify that we have read this thesis and that in our opinion it is fully

adequate, in scope and quality, as a thesis for the degree of Master of Science.

Assist. Prof. Dr. Kasırga YILDIRAK Prof. Dr. Hayri K

¨

OREZL

˙

IO

˘

GLU

Co-Advisor Supervisor

Examining Committee Members

Prof. Dr. Hayri K

¨

OREZL

˙

IO

˘

GLU

Prof. Dr. Gerhard Wilhelm WEBER

Assoc. Prof. Dr. Azize HAYFAV

˙

I

Assist. Prof. Dr. Hakan

¨

OKTEM

Assist. Prof. Dr. Kasırga YILDIRAK

abstract

CREDIT SCORING METHODS AND ACCURACY

RATIO

˙

Is

.

cano¯glu, Ays

.

eg¨ ul

M.Sc., Department of Financial Mathematics

Supervisor: Prof. Dr. Hayri K

¨

OREZL

˙

IO

˘

GLU

Co-Advisor: Assist. Prof. Dr. Kasırga YILDIRAK

August 2005, 132 pages

The credit scoring with the help of classiﬁcation techniques provides to take easy

and quick decisions in lending. However, no deﬁnite consensus has been reached

with regard to the best method for credit scoring and in what conditions the meth-

ods performs best. Although a huge range of classiﬁcation techniques has been

used in this area, the logistic regression has been seen an important tool and used

very widely in studies. This study aims to examine accuracy and bias properties in

parameter estimation of the logistic regression by using Monte Carlo simulations

in four aspect which are dimension of the sets, length, the included percentage

defaults in data and eﬀect of variables on estimation. Moreover, application of

some important statistical and non-statistical methods on Turkish credit default

data is provided and the method accuracies are compared for Turkish market.

Finally, ratings on the results of best method is done by using receiver operating

characteristic curve.

Keywords: Credit Scoring, Discriminant Analysis, Regression, Probit Regression,

Logistic Regression, Classiﬁcation and Regression Tree, Semiparametric Regres-

sion, Neural Networks, Validation Techniques, Accuracy Ratio.

iii

¨

oz

KRED

˙

I SKORLAMASI VE DO

¯

GRULUK RASYOSU

˙

Is

.

cano¯glu, Ays

.

eg¨ ul

Y¨ uksek Lisans, Finansal Matematik B¨ol¨ um¨ u

Tez Y¨oneticisi: Prof. Dr. Hayri K

¨

OREZL

˙

IO

˘

GLU

Tez Y¨onetici Yardımcısı: Yard. Doc

.

. Dr. Kasırga YILDIRAK

A¯gustos 2005, 132 sayfa

Kredi skorlama, klasiﬁkasyon metodlarının yardımı ile kredi verme is

.

lemlerinde

kolay ve c

.

abuk karar verilmesini sa¯glamaktadır. Fakat en iyi kredi skorlama

metodlarının hangi kos

.

ullarda iyi performans g¨osterdikleri ve bunlardan hangisinin

en iyi oldu¯gu hakkında kesin bir yargı bulunmamaktadır. Bu alanda bir c

.

ok

de¯gis

.

ik metot kullanılmasına ra¯gmen, lojistik regresyon ¨onemli bir arac

.

olarak

g¨or¨ ulmekte ve uygulamalarda yo¯ gun bir s

.

ekilde kullanılmaktadır. Bu c

.

alıs

.

ma

Monte Karlo simulasyonları kullanılarak lojistik regresyonun parametre tahmininde

ki do¯gruluk ve yanlılık ¨ozelliklerini verinin boyut, uzunluk, veri ic

.

indeki temerr¨ utteki

g¨ozlem oranı ve de¯gis

.

kenlerin tahmin ¨ uzerindeki etkileri olmak ¨ uzere 4 farklı

ac

.

ıdan incelemeyi amac

.

lamaktadır. Bu c

.

aıs

.

ma buna ek olarak T¨ urkiye temerr¨ ut

verisi ¨ uzerinde, ¨onemli bazı istatistiksel ve istatistiksel olmayan metotlar uygulanmıs

.

ile kredi skorlama yapılmakta ve bu veri ic

.

in metotlar kars

.

ılas

.

tırılmaktadır. Son

olarak, receiver operating characteristic e¯grisi kullanılarak, en iyi method sonuc

.

ları

kullanılarak derecelendirilmis

.

tir.

Anahtar Kelimeler: Kredi Skorlama, Diskriminant Analiz, Regresyon, Probit Re-

gresyon, Logistik Regresyon, Klasiﬁkasyon and Regresyon A¯gac

.

ları, Yapay Sinir

A¯gları, Gec

.

erlilik Teknikleri , Do¯gruluk Rasyosu

iv

To my family

v

acknowledgments

It is pleasure to express my appreciation to those who have inﬂuenced this

study. I am grateful to Prof. Dr. Hayri K

¨

OREZL

˙

IO

˘

GLU and Assist. Prof. Dr.

Kasırga YILDIRAK for their encouragement, guidance, and always interesting

correspondence.

I am indebted to Prof. Dr. Gerhard Wilhelm WEBER for his help, valuable

comments and suggestions. I am also very grateful to Assist. Prof.Dr. Hakan

¨

OKTEM for all his eﬀorts.

My special thanks are to my mother and my dear brother without whose en-

couragement and supports this thesis would not be possible.

I acknowledge my debt and express my thanks to Oktay S¨ ur¨ uc¨ u for his endless

support and being with me all the way.

As always it is impossible to mention everybody who had an impact to this

work: Yeliz Yolcu Okur, Zehra Eks

.

i,

˙

Irem Yıldırım, Serkan Zeytun, Tu¯gba Yıldız,

Zeynep Alparslan, Nejla Erdo¯gdu, Derya Altıntan etc.

vi

table of contents

ABSTRACT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

¨

OZ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

ACKNOWLEDMENTS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

TABLE OF CONTENTS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

LIST OF FIGURES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. x

LIST OF TABLES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

CHAPTER

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 1

2 FOUNDATIONS OF CREDIT SCORING: DEVELOPMENT, RE-

SEARCH AND EXPLANATORY VARIABLES. . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Primitive Age Researchers and Explanatory Variables . . . . . . . . 5

2.2 Discriminant Age Researchers and Explanatory Variables . . . . . . 6

2.3 Regression Age Researchers and Explanatory Variables . . . . . . . 8

2.4 Machine Age Researchers and Explanatory Variables . . . . . . . . 10

3 STATISTICAL METHODS IN CREDIT SCORING. . . . . . . . . . . . . . . . . . 15

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2.1 Decision Theory Approach. . . . . . . . . . . . . . . . . . . . 16

3.2.2 Functional Form Approach. . . . . . . . . . . . . . . . . . . . 23

vii

3.2.3 Advantages and Disadvantages of Discriminant Analysis. . 25

3.3 Linear Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3.2 Advantages and Disadvantages of Regression. . . . . . . . . 30

3.4 Probit Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4.2 Advantages and Disadvantages of Probit Regression . . . . 33

3.5 Logistic Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.5.2 Advantages and Disadvantages of Logistic Regression . . . 35

3.6 Classiﬁcation and Regression Trees . . . . . . . . . . . . . . . . . 36

3.6.1 Simple Binary Questions of CART Algorithm . . . . . . . . 39

3.6.2 Goodness of Split Criterion . . . . . . . . . . . . . . . . . . . 40

3.6.3 The Class Assignment Rule to the Terminal Nodes

Resubstitution Estimates . . . . . . . . . . . . . . . . . . . . 43

3.6.4 Selecting the Correct Complexity of a Tree. . . . . . . . . . 45

3.6.5 Advantages and Disadvantages of CART . . . . . . . . . . . 46

3.7 Semi-Parametric Binary Classiﬁcation . . . . . . . . . . . . . . . 47

3.7.1 Kernel Density Estimation. . . . . . . . . . . . . . . . . . . . 48

3.7.2 Generalized Partial Linear Models . . . . . . . . . . . . . . . 55

3.7.3 Advantages and Disadvantages of Semi-Parametric Methods 59

4 NONSTATISTICAL METHODS IN CREDIT SCORING . . . . . . . 60

4.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.1.1 Structure of Neural Networks . . . . . . . . . . . . . . . . . 60

4.1.2 Learning Process. . . . . . . . . . . . . . . . . . . . . . . . . 64

4.1.3 Advantages and Disadvantages of Neural Networks . . . . 78

5 PARAMETER ESTIMATION ACCURACY OF LOGISTIC REGRES-

SION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.1 Introduction and Methodology. . . . . . . . . . . . . . . . . . . 80

5.2 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

viii

5.2.1 One Variable Case . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.2.2 Six Variables Case . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.2.3 Twelve Variables Case. . . . . . . . . . . . . . . . . . . . . . . . 94

5.3 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 104

6 ACCURACY RATIO AND METHOD VALIDATIONS . . . . . . . . . 105

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.2 Validation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.2.1 Cumulative Accuracy Proﬁle (CAP) . . . . . . . . . . . . . . 107

6.2.2 Receiver Operating Characteristic Curve (ROC) . . . . . . 109

6.3 Method Validations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6.3.1 Data and Methodology . . . . . . . . . . . . . . . . . . . . . . 113

6.3.2 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6.3.3 Ratings via Logistic and Probit Regression . . . . . . . . . . 119

7 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

8 APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

ix

list of figures

3.1 Misclassiﬁcation errors [JW97]. . . . . . . . . . . . . . . . . . . . 21

3.2 Fisher’s linear discriminant analysis. . . . . . . . . . . . . . . . . 24

3.3 A Sample organization chart of classiﬁcation and regression trees

[BFOS84]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.4 Kernel density estimation of car prices by Matlab (h = 100) with

triangle kernel function. . . . . . . . . . . . . . . . . . . . . . . . 49

3.5 Kernel density estimation of car prices by Matlab (h = 200) with

triangle kernel function. . . . . . . . . . . . . . . . . . . . . . . . 50

3.6 Kernel Density Estimation of car prices by Matlab (h = 300) with

Triangle kernel function. . . . . . . . . . . . . . . . . . . . . . . . 50

3.7 Kernel density estimation of car prices by Matlab (h = 400) with

Triangle kernel function. . . . . . . . . . . . . . . . . . . . . . . . 51

3.8 Kernel density estimation of car prices and house prices by Matlab

(h

1

= 200, h

2

= 100) with Gaussian kernel function. . . . . . . . . 53

3.9 Kernel density estimation of car prices and house prices by Mat-

lab(h

1

= 300, h

2

= 100) with Gaussian kernel function. . . . . . . 54

3.10 Kernel density estimation of car prices and house prices by Matlab

(h

1

= 400, h

2

= 200) with Gaussian kernel function. . . . . . . . . 54

3.11 Kernel density estimation of car prices and house prices by Matlab

(h

1

= 500, h

2

= 300) with Gaussian kernel function. . . . . . . . . 55

4.1 Structure of neural networks. . . . . . . . . . . . . . . . . . . . . 60

4.2 Threshold activation function [H94]. . . . . . . . . . . . . . . . . . 62

4.3 Piecewise-linear activation function [H94]. . . . . . . . . . . . . . 62

4.4 Sigmoid activation functions. . . . . . . . . . . . . . . . . . . . . . 63

4.5 A neural network structure. . . . . . . . . . . . . . . . . . . . . . 64

4.6 Diagram of learning process [H94]. . . . . . . . . . . . . . . . . . 65

4.7 Error correction learning. . . . . . . . . . . . . . . . . . . . . . . . 66

x

4.8 Boltzman machine [H94]. . . . . . . . . . . . . . . . . . . . . . . . 68

4.9 Single layer competitive network [H94]. . . . . . . . . . . . . . . . 71

4.10 The feed-forward network [H94]. . . . . . . . . . . . . . . . . . . . 72

4.11 Two layer feed-forward network [H94]. . . . . . . . . . . . . . . . 76

5.1 Coeﬃcient of variation of estimator in diﬀerent default levels for

w = 0.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.2 Coeﬃcient estimate and their true value in diﬀerent default levels

for w = 0.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.3 Coeﬃcient of variation of estimator in diﬀerent default levels for

w = 0.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.4 Coeﬃcient estimate and their true value in diﬀerent default levels

for w = 0.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.5 Six variable case: Average coeﬃcient of variation of estimators in

diﬀerent default levels for the ﬁrst set of weights. . . . . . . . . . 88

5.6 Six variable case: Coeﬃcients’ estimates and their true values on

diﬀerent default levels for the ﬁrst set of weights. . . . . . . . . . 89

5.7 Six variable case: Average coeﬃcient of variation of estimators in

diﬀerent default levels for the second set of weights. . . . . . . . . 90

5.8 Six variable case: Coeﬃcients’ estimates and their true values in

diﬀerent default levels for the second set of weights. . . . . . . . 91

5.9 Six variable case: Average coeﬃcient of variation of estimators in

diﬀerent default levels for the third set of weights. . . . . . . . . 92

5.10 Six variable case: Coeﬃcients’ estimates and their true values in

diﬀerent default levels for the third set of weights. . . . . . . . . 93

5.11 Twelve variable case: Average coeﬃcient of variation of estimators

in diﬀerent default levels for the ﬁrst set of weights. . . . . . . . 95

5.12 Twelve variable case: First six coeﬃcients’ estimates and their true

values in diﬀerent default levels for the ﬁrst set of weights. . . . . 96

5.13 Twelve variable case: Second six coeﬃcients’ estimates and their

true values in diﬀerent default levels for the ﬁrst set of weights. . 97

xi

5.14 Twelve variable case: Average coeﬃcient of variation of estimators

in diﬀerent default levels for the second set of weights. . . . . . . 98

5.15 Twelve variable case: First six coeﬃcients’ estimates and their true

values in diﬀerent default levels for the second set of weights. . . 99

5.16 Twelve variable case: Second six coeﬃcients’ estimates and their

true values in diﬀerent default levels for the second set of weights. 100

5.17 Twelve variable case: Average coeﬃcient of variation of estimators

in diﬀerent default levels for the third set of weights. . . . . . . . 101

5.18 Twelve variable case: First six coeﬃcients’ estimates and their true

values in diﬀerent default levels for the third set of weights. . . . 102

5.19 Twelve variable case: Second six coeﬃcients’ estimates and their

true values in diﬀerent default levels for the third set of weights. 103

6.1 Cumulative accuracy proﬁle curve of the Example (6.1). . . . . . . 108

6.2 Cumulative accuracy proﬁle [EHT02]. . . . . . . . . . . . . . . . . 109

6.3 Distribution of rating scores for defaulting and non-defaulting debtors

[EHT02]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6.4 Receiver operating characteristic curve [EHT02]. . . . . . . . . . . 111

6.5 Distribution of rating scores for Example 6.1. . . . . . . . . . . . 112

6.6 Receiver operating characteristic curve for Example 6.1. . . . . . 113

6.7 Receiver operating characteristic curves for validation sample of

size 1000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.8 Receiver operating characteristic curves for training sample for size

1000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

xii

list of tables

2.1 The methods and scientists. . . . . . . . . . . . . . . . . . . . . . 14

3.1 Misclassiﬁcation costs. . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 Kernel functions [HMSW04]. . . . . . . . . . . . . . . . . . . . . . 48

6.1 The rating classes and total number of observations. . . . . . . . . 106

6.2 The cumulative probability functions. . . . . . . . . . . . . . . . . 108

6.3 Decision results given the cut-oﬀ value C [EHT02]. . . . . . . . . 110

6.4 The MSE and accuracy results of methods. . . . . . . . . . . . . . 116

6.5 The coeﬃcients and p-values of logistic and probit regression. . . . 120

6.6 Training sample classiﬁcation results. . . . . . . . . . . . . . . . . 120

6.7 Validation sample classiﬁcation results. . . . . . . . . . . . . . . . 121

6.8 Optimum cut-oﬀ probability of defaults of logistic regression for

10 rating category. . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6.9 Optimum cut-oﬀ probability of defaults of probit regression for 10

rating category. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

6.10 Ratings of companies for 10 rating classes (Def: number of defaults

in rating categories). . . . . . . . . . . . . . . . . . . . . . . . . . 123

6.11 The areas under the ROC curve for optimum cut-oﬀ probabilities. 123

8.1 Optimum cut-oﬀ probability of defaults of logistic regression for 8

rating category. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

8.2 Optimum cut-oﬀ probability of defaults of probit regression for 8

rating category. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

8.3 Ratings of companies for 8 rating classes (Def: number of defaults

in rating categories). . . . . . . . . . . . . . . . . . . . . . . . . . 133

8.4 Optimum cut-oﬀ probability of defaults of logistic regression for 9

rating category. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

8.5 Optimum cut-oﬀ probability of defaults of probit regression for 9

rating category. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

xiii

8.6 Ratings of companies for 9 rating classes (Def: number of defaults

in rating categories). . . . . . . . . . . . . . . . . . . . . . . . . . 134

xiv

chapter 1

INTRODUCTION

In a credit granting procedure, a credit company’s main aim is to determine

whether a credit application should be granted or refused. The credit scoring

procedures in fact measure the risk on lending. From the early civilizations this

risk has been assessed by the interest rate on it. However, the studies on the

ﬁnancial situations of the governments, companies and individuals has demon-

strated that the interest does not diminish the risk. The credit risk should be

assessed separately.

The default of a ﬁrm is always very costly for both shareholders and credit

agencies. Because the credit agencies could lose whatever they give and the share-

holders could lose all or nearly all of their value of equity. Here, the problem is

to learn default some time before the default in order to be take some precau-

tions. The empirical studies indicated that classiﬁcation methods gives signals of

defaults. However, these methods could act diﬀerently according to size, shape,

and structure of the data. Therefore, selection of the most suitable method for

available data is a much more complex concern.

In literature, as far as we know, the scoring studies shows only accuracy

comparisons of two or more models. The close examinations of the methods

are gaps in this area. Therefore, this study includes the empirical research on

logistic regression. The logistic regression because of its environment is the most

1

widely used method in studies of credit scoring. However, the conditions in which

the logistic regression performs well is not discussed. This study makes a close

examination of the parameter estimation of logistic regression in diﬀerent variable

sets and the data sets with diﬀerent default levels by Monte Carlo simulations.

Moreover, the study presented here also includes the applications of classiﬁ-

cation methods to the real Turkish credit data. The accuracies of all methods

are compared with each other. Furthermore, for Turkish data sets which are very

volatile in companies from the diﬀerent group of companies, the most appropriate

model is tried to be selected.

The organization of the thesis is as follows. Chapter 2 gives a brief overview

of the development of credit scoring, the related studies and the explanatory vari-

ables in the studies. Chapter 3 provides the fundamentals of statistical methods

used in credit scoring. In Chapter 4, the important neural network learning al-

gorithms and their derivations are given. Chapter 5 presents the Monte Carlo

applications on the parameter estimation and estimation accuracies of logistic

regression. Moreover, Chapter 6 provides the accuracy ratio and validations of

the methods mentioned in Chapter 3 and Chapter 4. Chapter 7 concludes this

thesis.

2

chapter 2

FOUNDATIONS OF CREDIT

SCORING: DEVELOPMENT,

RESEARCH AND

EXPLANATORY VARIABLES

The research in the area of credit scoring started in the 1930’s. After that

date, many diﬀerent works and methods have entered the literature of credit

scoring. According to type of methods, we can split the period 1930-2005 into

4 sub-periods. We call the ﬁrst period as a primitive age of credit scoring

because this includes very basic applications. In this part, research was based

only on a ratio analysis. In those years, scientists compared ratios of default

and non-default companies and tried to develop an idea of companies’ ﬁnancial

performances. As it can be guessed, these type of methods had no predictive

power and so they were not very suitable.

The second period of the credit scoring started at 1966 with the application

of discriminant analysis. By this application, research gained predictive power.

However, this method has very strong assumptions on variables and so, the pre-

diction power is not very high. Moreover, this method does not give the idea

3

of relative performances of the variables. We call this period as discriminant

age.

The application of discriminant analysis is a turning point for credit scoring

because it opened the door for computer-based methods. After the 1970’s, the

methods that applied to this area changed rapidly. The main types of meth-

ods were the regression based approaches. Therefore, we can call this period as

regression age of credit scoring. The linear regression was applied ﬁrstly,

but it did not give good results. Because the credit default probabilities takes

only values between 0 and 1, but linear regression can give the results between

−∞ and ∞. Then, secondly, probit regression came into play. Since it also has

strong assumptions of normality, the end of the application of this methods came

rapidly. In other words, in the period 1970-1980, the regression type methods

were not gone to the fore of discriminant analysis.

In the 1980’s, the study of logistic regression increased the interest to the

regression since it has no normality assumptions on variables, allows predictions,

and interpretation of coeﬃcients, and it gives the output on the interval [0,1].

Although after the 1980’s many other statistical methods have been also ap-

plied such as k-nearest neighborhood, classiﬁcation and regression trees, survival

analysis, etc, this method has kept its importance even nowadays as the most

widely-used statistical technique in research.

The year 1990 is an another turning point for credit scoring. In this year, the

statistical methods gave their place to the machine learning type methods with

the application neural networks. Therefore, we name this period as machine

age.

4

2.1 Primitive Age Researchers and Explanatory

Variables

The ﬁrst researchers which we found in the primitive age are Ramser and

Foster with their 1931 paper [BLSW96]. This was followed by Fitzpatrick in

1932 [BLSW96]. Fitzpatrick investigated nineteen pairs of failed and non-failed

companies. He showed a signiﬁcant diﬀerence in the ratios of failed and non-failed

companies at least three years prior to failure. After then, Winekor and Smith

[BLSW96] in 1935, searched the mean ratios of failed ﬁrms ten years prior to

failure and detected the breakdown in the mean values when failure was coming

[B66].

In 1942, Merwin [BLSW96] studied mean ratios of the failed and non-failed

companies in the period 1926-1936 and his result was only diﬀering form Winekor

and Smith’ work [BLSW96] in the six year before failure [B66].

In these years, the primary ratio was the current ratio. However, some other

ratios were also used. For example, Ramser and Foster [BLSW96] studied with

equity / (net sales); Fitzpatrick [BLSW96] used equity / ﬁxed-asset and return on

stock; Winekor and Smith [BLSW96] applied its analysis on (working capital) /

(total assets); and Mervin [BLSW96] beside current ratio, used total debt/equity,

(working capital) / (total assets).

5

2.2 Discriminant Age Researchers and Explana-

tory Variables

The researcher who started the discriminant age is Beaver [B66]. In his work,

he applied a univariate type discriminant analysis by using: (cash ﬂow) / (total

debt), ( current assets) / (current liabilities), (net income) / (total assets), (total

debt) / (total assets), (working capital) / (total assets) [B66]. Then, in 1968,

Altman investigated a multivariate discriminant analysis by his famous z-score

with 5 variables that are 1. (MV of equity) / (book value of debt), 2. (net sales

/ total assets), 3. (operating income) / (total assets), 4. (retained earnings) /

(total assets), 5. (working capital) / (total assets) [A68]. Altman obtained 94%

and 97% classiﬁcation accuracy among default and non-default ﬁrms, respectively

and 95% overall accuracy [HM01].

The discriminant analysis applications were also continued after 1970’s. In

1972, Deakin [D72] applied discriminant analysis. In his analysis, he used: cash

/ (current liabilities), (cash ﬂow) / (total debt), cash / (net sales), cash / (total

assets), current ratio, (current assets) / (net sales), (current assets) / (total

assets), (net income) / (total assets), (quick assets) / (current liabilities), (quick

assets) / (net sales), (quick assets) / (total assets), (total debt) / (total assets),

(working capital) / (net sales) and (working capital) / (total assets) and obtained

97% overall accuracy [D72] [HM01].

In 1972, Lane and Awh and then, Waters [RG94] in 1974, used the discrimi-

nant analysis. Moreover, in 1974, Blum with ratios: market rate of return, quick

ratio, (cash ﬂow) / (total debt), fair market value of net worth, (net quick assets)

/ (inventory), (book value of net worth) / (total debt), (standard deviation of

6

income), (standard deviation of net quick assets) / inventory, slope of income,

(slope of net quick assets) / (inventory), trend breaks of income, (trend breaks of

net quick assets) / (inventory) applied discriminant analysis [Bl74].

One year later, Sinkey used: (cash+U.S. treasury security) / (assets), (loans)

/ (assets), (provision for loan losses) / (operationg expense), (loans) / (capital +

reserves), (operating expense) / (operating income), (loan revenue) / (total rev-

enue), (U.S. treasury securities’ revenue) / (total revenue), (state and local obli-

gations’ revenue) / (total revenue), (interest paid on deposits) / (total revenue),

(other expenses) / (total revenue) [S75]. Then, Altman and Lorris [BLSW96]

(1976) acquire 90% classiﬁcation accuracy with the help of ﬁve ﬁnancial ratios

that are

• (net income) / (total assets),

• (total liabilities + subordinate loans) / equity,

• (total assets) / (adjusted net capital),

• (ending capital-capital additions) / (beginning capital)

• scaled age

• composite version of the before mentioned ones.

For closer information, we refer to [AL76] and [HM01].

Another paper with discriminant analysis was published by Altman, Halde-

man and Narayan (1977) [BLSW96]. Here, (retained earnings) / (total assets),

(earnings before interest and taxes) / (total interest payments), (operating in-

come) / (total assets), (market value of equity) / (book value of debt), (current

7

assets) / (current liabilities) were their ratios. And their results showed a 93 %

overall accuracy [HM01].

The investigations which I were able to reach were from the last researchers

who made their studies only with discriminant analysis. They are Dambolena

and Khory (1980) [DK80]; their ratios: 1. (working capital) / (total assets), 2.

(retained earnings) / (total assets), 3. earning before interest and taxes to total

assets, 4. (market value of equity) / (book value of debt), 5. sales to total assets,

Altman and Izan (1984) [HM01] and lastly, Pantalone and Platt (1987) [HM01]

with 95% accuracy.

2.3 Regression Age Researchers and Explana-

tory Variables

The ﬁrst researcher who used regression analysis according to my study was

Orgler (1970) [Org70]. In his analysis, Orgler basically used: current ratio, work-

ing capital, cash / (current liabilities), inventory / (current assets), quick ratio,

(working capital) / (current assets), (net proﬁt) / sales, (net proﬁt) / (net worth),

(net proﬁt) / (total assets), net proﬁt ≶ 0, net proﬁt, (net worth) / (total liabili-

ties), (net worth) / (ﬁxed assets), (net worth) / (long-term debt), net worth ≶ 0,

sales / (ﬁxed assets), sales / (net worth), sales / (total assets), sales / inventory

and sales / receivables. He, in the hold-out sample, was only able to classify 75

% of the bad loans as bad and 35 % of the good loans as good [Org70].

In 1976, Fitzpatrick applied multivarite regression also. This research was fol-

lowed by Olhson (1980) [O80]. In his paper, he investigated a logistic regression

analysis. He collected data from the period 1970-76. Besides the basic ratios,

8

that are: (total liabilities) /(total assets), (working capital) / (total assets), (cur-

rent liabilities) / (current assets), (total liabilities-total assets) > 0, it takes one,

otherwise it takes zero, (net income) / (total assets), (funds provided by opera-

tions) / (total liabilities), dummy; One if net income negative for the last two

years, (NI

t

− NI

t−1

)/(|NI

t

| + |NI

t−1

|), he also used size of the company as an

explanatory variable [O80]. He calculated the type one and type two types of

errors in diﬀerent cut points and found for his second model better average error

that is 14.4%.

Then, Pantalone and Platt (1987) [HM01] tried logistic regression in their

paper. They obtained 98% accuracy in the classiﬁcation of failed ﬁrms and 92%

accuracy in that of non-failed ﬁrms .

In this time period, the recursive partitioning algorithm was gained to the

literature by Altman, Frydman and Kao (1985) [FAK85]. Their explanatory

variables were as follows:

• (net income) / (total assets),

• (current assets)/ (current liabilities),

• log(total assets),

• (market value of equity)/ (total capitilazation),

• (current assets) / (total assets),

• (cash ﬂow) / (total debt),

• (quick assets) / (current liabilities),

• (earning before interest and taxes) / (total assets),

9

• log(interest coverage+15 ),

• cash / (total sales),

• (total debt) / (total assets),

• (quick assets) / (total assets) .

For a closer information please look at [FAK85].

2.4 Machine Age Researchers and Explanatory

Variables

Odom and Sharda in 1990 [OS90] made a comparison between the discrimi-

nant analysis and neural networks by using Altman’s (1968) explanatory variables

[A68]. They collected the data set in the period 1975 to 1982. Their training

sample consisted of 74 companies 38 of which were default and hold-out sample

was constituted by 55 companies 27 of which default. They concluded that neu-

ral networks performed better with respect to both training sample and hold-out

sample. Moreover, the study proved that neural networks were more robust than

discriminant analysis even in small sample sizes [OS90].

In 1991, Cadden [BO04] and Coats and Fant [BO04] made a comparison

of discriminant analysis and neural networks also. After that in the following

year, Tam and Kiang [TK92] applied discriminant analysis, logistic regression

and neural networks with eighteen explanatory variables. This was followed by

another study of Coats and Fant [CF93] in 1993. In the study, Coats and Fant

obtained the data from the 1970 to 1989. By using Altman’s (1968) variables,

10

they run discriminant analysis and neural networks and get that the discriminant

analysis gave the highest misclassiﬁcation error that is classifying a default as

non-default.

In 1996, Back, Laitinen, Sere and Wesel [BLSW96] worked with a huge set

of variables that are: 1. cash /(current liabilities), 2. (cash ﬂow) / (current

liabilities), 3. (cash ﬂow) / (total assets), 4. (cash ﬂow) / (total debt), 5. cash /

(net sales) 6. cash / (total assets), 7. (current assets) / (current liabilities), 8.

(current assets) / (net sales), 9. (current assets) / (total assets), 10. (current

liabilities) / equity, 11. equity / (ﬁxed assets), 12. equity / (net sales), 13.

inventory / (net sales), 14. (long term debt) / equity, 15. (market value of

equity) / (book value of debt), 16. (total debt) / equity, 17. (net income) / (total

assets), 18. (net quick assets) / inventory, 19. (net sales) / (total assets), 20.

(operating income) / (total assets) 21. (earnings before interest and faxes) /(total

interest payments), 22. (quick assets) / (current liabilities), 23. (quick assets)

/ (net sales), 24. (quick assets) / (total assets), 25. rate of return to common

stock, 26. (retained earnings) / (total assets), 27. return on stock, 28. (total

debt) / (total assets), 29. (working capital) / (net sales), 30. (working capital) /

equity, and 31. (working capital) / (total assets) [BLSW96].

In 1998, Kiviluoto [K98] tried the discriminant analysis, neural networks by

means of operating margin, net income before depreciation, extraordinary items,

net income before depreciation, extraordinary items of the previous year, equity

ratio. For closer details see [K98].

After that in 1999, Laitinen and Kankaanpaa [LK99] compared discriminant

analysis, logistic regression, recursive partitioning, survival analysis and neural

networks. The study’s ratios were cash to current liabilities, total debt to total

11

assets, operating income to total assets. They examined all the methods from

three years prior to failure. Moreover, in the total error, they found neural

networks as best one year prior to failure and recursive partitioning as best two

and three years prior to failure.

In 1999, Muller and Ronz [MR99] showed a diﬀerent approach to credit default

prediction. They implemented the semi parametric generalized partial linear

models to this area. In the paper, the 24 variables were used but not speciﬁed .

In 2000, recursive partitioning, discriminant analysis and neural networks were

attempted by McKee and Greenstein [MG00]. The ratios were as follows:

• (net income) / (total assets),

• (current assets) / (total assets),

• (current assets) / (current liabilities),

• cash / (total assets),

• (current assets) / sales and

• (long-term debt) / (total assets).

In 2001, Atiya [Ati01] used: the 1. (book value) / (total assets), 2. (cash ﬂow)

/ (total assets), 3. price / (cash ﬂow ratio), 4. rate of change of stock price, 5.

rate of change of cash ﬂow per share, 6. stock price volatility and he investigated

neural networks.

The time table of the studies and methods can be found in Table (2.1). For

a close examination please we refer to it, and we abbreviate:

12

Method

Year Ra.A. DA RA LA CART SPR NN

Researchers

Ramser, 1931 X

Foster [BLSW96]

Fitzpatrick [BLSW96] 1932 X

Winakor, 1935 X

Smith [BLSW96]

Merwin [BLSW96] 1942 X

Beaver [B66] 1966 X

Mears [M66] 1966 X

Horrigan [H66] 1966 X

Neter [N66] 1966 X

Altman [A68] 1968 X

Orgler [Org70] 1970 X

Wilcox [Wil71] 1971 X

Deakin [D72] 1972 X

Lane [RG94] 1972 X

Wilcox [Wil73] 1973 X

Awh, Waters [RG94] 1974 X

Blum [Bl74] 1974 X

Sinkey [S75] 1975 X

Libby [Lib75] 1975 X

Fitzpatrick [RG94] 1976 X

Altman and 1976 X

Lorris [BLSW96]

Altman, Haldeman, 1977 X

Narayan [BLSW96]

Dambolena, 1980 X

Khoury [DK80]

Olhson [O80] 1980 X

Altman, Izan [BO04] 1984 X

Altman, Friedman, 1985 X

Kao [FAK85]

Pantalone, Platt [BO04] 1987 X

Pantalone, Platt [BO04] 1987 X

Odom, Sharda [BO04] 1990 X X

13

Method

Year Ra.A. DA RA LA CART SPR NN

Researchers

Cadden [BO04] 1991 X X

Coats, Fant [BO04] 1991 X X

Tam, Kiang [TK92] 1992 X X X

Coats, Fant [CF93] 1993 X X

Fletcher, Goss [BO04] 1993 X X

Udo [BO04] 1993 X X

Chung, Tam [BO04] 1993 X

Altman, Marco, 1994 X X

Varetto [BO04]

Back, Laitinen, Sere, 1996 X X X

Wesel [BLSW96]

Bardos, Zhu [BO04] 1997 X X X

Pompe, Feelders [PF97] 1997 X X X

Kivilioto [K98] 1998 X X

Laitinen, 1999 X X X X

Kankaanpaa [LK99]

Muller, Ronz [MR99] 1999 X X

Mckee, Greenstein [MG00] 2000 X X X

Pompe, Bilderbeek [BO04] 2000 X X

Yang, Temple [BO04] 2000 X X

Neophytou, 2001 X X

Mar-Molinero [BO04]

Atiya [Ati01] 2001 X

Table 2.1: The methods and scientists.

Ra.A: Ratio analysis,

DA: Discriminant analysis,

RA: Regression analysis,

LA: Logistic regression,

CART: Classiﬁcation and regression trees,

SPR: Semiparametric regression,

NN: Neural networks.

14

chapter 3

STATISTICAL METHODS IN

CREDIT SCORING

3.1 Introduction

A credit institute faces with a prejudgement problem of measuring its cus-

tomer ﬁrms’ creditworthiness. For this reason, the credit institute primarily col-

lects information about its customer ﬁrms’ measurable features, or; namely, age

of the ﬁrm, (current assets) / (current liabilities) ratio, and so on. Let X

i

⊆

represents the each feature of the customer ﬁrm, e.g., X

1

may be the age of the

ﬁrm, X

2

may be the current assets/current liabilities ratio and so on. Then, each

customer ﬁrm can be described by a tuple of p random variables, namely, by the

vector X = (X

1

, X

2

, ..., X

p

) which indicates a ﬁrm’s completely all characteristic

properties and market and internal performance features. Let the actual values

of the variables for a particular customer ﬁrm be x = (x

1

, x

2

, ..., x

p

) ∈ X ⊆

n

.

Furthermore, let any diﬀerent possible value of x

i

to variable X

i

be called an

attribute of that feature.

Let us call the space of X as the input space and denote it by Ω since each

customer is represented as a point in this space.

According to records of the credit institute, in market, there are two types

15

of ﬁrms: default ﬁrms (D) and non-default ﬁrms (ND). Here, a default ﬁrm is a

customer ﬁrm that did not fulﬁll its obligation in the past, and a non-default ﬁrm

is a customer ﬁrm that fulﬁlled its obligation in the past. Moreover, the space of

all possible outcomes that has only two elements: D and ND, is called the output

space Y .

Moreover, let δ

ND

be the prior ”non-default ﬁrms” category probability, a δ

D

stands for prior class probability of ”default ﬁrms”.

According to our concern, objective is to ﬁnd best scoring procedure space(X)

−→space(Y ) which splits the space Ω into two subspaces: Ω

ND

and Ω

D

, so that

classifying new customer ﬁrms whose indicator vector belongs to the set Ω

ND

as

”non-default ﬁrm” and whose indicator vector belongs to the set Ω

D

as ”default

ﬁrm” [HKGB99], [CET02].

In here, a brief review of credit scoring methods which are used in literature

most commonly will be given. The above notations are giving to be throughout

this chapter.

3.2 Discriminant Analysis

The discriminant analysis is a standard tool for classiﬁcation. It is based on

maximizing the between-group variance relative to the within-group variance.

3.2.1 Decision Theory Approach

According to the discrete or continuous character of the probability distribu-

tions, we refer to the discrete or the continuous case in the followings.

16

Discrete Case

Assume the companies which ask for credit feature vector has a ﬁnite number

of discrete attributes so that Ω is ﬁnite and there is only a ﬁnite number of

diﬀerent attributes x.

Suppose p(x|ND) represent the probability that a non-default ﬁrm will has

an attribute x. Similarly, p(x|D) represents the probability that a default ﬁrm

will has an attribute x. These conditional probabilities can be shown to be

p(x|ND) =

p(ﬁrm is a non-default ﬁrm and has an indicator vector x)

p(ﬁrm is a non-default ﬁrm)

(3.2.1)

and

p(x|D) =

p(ﬁrm is a default ﬁrm and has an indicator vector x)

p(ﬁrm is a default ﬁrm)

. (3.2.2)

Since in a market these conditional probabilities can not be observed directly,

they will be obtained by using Bayes rule and directly observable probabilities.

To apply Bayes rule, let us deﬁne p(ND|x) as the probability that a company

an attribute vector x as a non-default company and let us deﬁne p(D|x) as the

probability that a company an attribute vector x as a default company. Then,

p(ND|x) =

p(ﬁrm is a non-default ﬁrm and has an indicator vector x)

p(ﬁrm has an indicator vector x )

,

(3.2.3)

p(D|x) =

p(company is a default ﬁrm and has an indicator vector x)

p(ﬁrm has an indicator vector x )

. (3.2.4)

Let γ(x):= p(ﬁrm has an indicator vector x), then (3.2.1), (3.2.3), (3.2.2) and

17

(3.2.4), respectively, can be put together in the following formulae:

p(ﬁrm is a non-default ﬁrm and has an indicator vector x)

= p(ND|x)γ(x) = p(x|ND)δ

ND

and

p(ﬁrm is a default ﬁrm and has an indicator vector x)

= p(D|x)γ(x) = p(x|D)δ

D

.

Then, by Bayes rule,

p(ND|x) =

p(x|ND)δ

ND

γ(x)

(3.2.5)

and

p(D|x) =

p(x|D)δ

D

γ(x)

. (3.2.6)

Suppose a credit institute loses the amount c(ND|D) of money for each ﬁrm

if it classiﬁes a default ﬁrm as non-default and loses c(D|ND) amount of money

for per ﬁrm if it classiﬁes a non-default ﬁrm as default. These misclassiﬁcation

costs are summarized in Table (3.1).

Classiﬁcation Result

non-default default

true non-default 0 c(ND|D)

value default c(D|ND) 0

Table 3.1: Misclassiﬁcation costs.

Furthermore, let us assume the probability that misclassifying a company as

18

Ω

ND

be

p(ﬁrm is misclassiﬁed as non-default) = p(x|D)δ

D

, (3.2.7)

and the probability that misclassifying a non-default ﬁrm as Ω

D

be

p(ﬁrm is misclassiﬁed as default) = p(x|ND)δ

ND

. (3.2.8)

Then, the expected cost of misclassifying ﬁrms if the ﬁrms with attributes

belonging to the set Ω

ND

are accepted and if the ﬁrms with attributes belonging

to the set Ω

D

are refused is

c(D|ND)

¸

x∈Ω

D

p(x|ND)δ

ND

+ c(ND|D)

¸

x∈Ω

ND

p(x|D)δ

D

=

c(D|ND)

¸

x∈Ω

D

p(ND|x)γ(x) + c(ND|D)

¸

x∈Ω

ND

p(D|x)γ(x). (3.2.9)

At this point, the decision rule that minimizes this expected cost is clear. Let

us consider cost of classifying a ﬁrm with x = (x

1

, x

2

, ..., x

p

). If a ﬁrm puts into

Ω

ND

, then the only cost if it is a default is the expected cost C(ND|D)p(x|D)δ

D

.

If a ﬁrm puts into Ω

D

if it is non-default, then the expected cost is

c(D|ND)p(x|ND)δ

ND

. (3.2.10)

Therefore, x can be classiﬁed into Ω

ND

if

c(ND|D)p(x|D)δ

D

≤ c(D|ND)p(x|ND)δ

ND

(3.2.11)

is satisﬁed. For this reason, the decision rule that minimizes the expected costs

19

is

Ω

ND

= {x|c(ND|D)p(x|D)δ

D

≤ c(D|ND)p(x|ND)δ

ND

}

=

x|

c(ND|D)

c(D|ND)

≤

p(x|ND)δ

ND

p(x|D)δ

D

=

x|

c(ND|D)

c(D|ND)

≤

p(ND|x)

p(D|x)

. (3.2.12)

That is, we classify a ﬁrm as a non-default ﬁrm if the above condition is

satisﬁed. Otherwise, classify the ﬁrm as a default ﬁrm.

Continuous Case

Let us assume that the indicator vector has a ﬁnite number of continuous

type attributes so that Ω is ﬁnite and there is only a ﬁnite number of diﬀerent

attributes x. The same procedure from the discrete case is applied to the con-

tinuous case. Here, the only diﬀerence is that the conditional probability mass

functions p(x|ND) and p(x|D) are replaced by the continuous probability den-

sity functions. Then, the expected cost of misclassifying ﬁrms if the ﬁrms with

attributes belonging to set Ω

ND

are accepted and if the ﬁrms with attributes

belonging to set Ω

D

are refused, will become

c(D|ND)

x∈Ω

D

f(x|ND)δ

ND

dx + c(ND|D)

x∈Ω

ND

f(x|D)δ

D

dx (3.2.13)

and the decision rule that minimizing this expected cost will become

Ω

ND

= {x|c(ND|D)f(x|D)δ

D

≤ c(D|ND)f(x|ND)δ

ND

}

=

x|

c(ND|D)δ

D

c(D|ND)δ

ND

≤

f(x|ND)

f(x|D)

. (3.2.14)

20

f(x| ND)

Misclassifying a Non-

Default Firm as Default

Misclassifying a Default

Firm as Non-Default

f(x| D)

Figure 3.1: Misclassiﬁcation errors [JW97].

That is, we classify a ﬁrm as a non-default ﬁrm if the above condition is

satisﬁed. Otherwise, we classify the ﬁrm as a default Firm.

In literature, normal distribution is the most widely used distribution for

discriminant analysis. When using a normal probability density function, three

cases can be considered:

A. Univariate Normal Feature Variable x

This is the simplest case of discriminant analysis. In this case, ﬁrms are tried

to classify with respect to only one feature variable. Let us say X = (current

assets) / (current liabilities) ratio, and let it follow a normal distribution with

mean µ

ND

and variance σ

2

among the non-default ﬁrms, and follow a normal

distribution with mean µ

D

and variance σ

2

among the default ﬁrms. Then, the

probability distribution functions can be written as follows:

f(x|ND) = (2πσ

2

)

−

1

2

exp(

−(x −µ

ND

)

2

2σ

2

) (3.2.15)

21

for non-default ﬁrms and

f(x|D) = (2πσ

2

)

−

1

2

exp(

−(x −µ

D

)

2

2σ

2

) (3.2.16)

for default ﬁrms.

Then, according to continuous case results, the decision rule for univariate

normal case of minimizing this expected cost will become

Ω

ND

=

x|

c(ND|D)δ

D

c(D|ND)δ

ND

≤

f(x|ND)

f(x|D)

=

x|

c(ND|D)δ

D

c(D|ND)δ

ND

≤

(2πσ

2

)

−

1

2

exp(

−(x−µ

ND

)

2

2σ

2

)

(2πσ

2

)

−

1

2

exp(

−(x−µ

D

)

2

2σ

2

)

=

x| exp(

−(x −µ

ND

)

2

+−(x −µ

D

)

2

2σ

2

) ≥

c(ND|D)δ

D

c(D|ND)δ

ND

=

x|x(µ

ND

−µ

D

) ≥

µ

2

ND

+ µ

2

D

2

+ σ

2

log

c(ND|D)δ

D

c(D|ND)δ

ND

. (3.2.17)

That means, in our continuous case, we classify a ﬁrm as a non-default ﬁrm

if the above condition is satisﬁed. Otherwise, we classify a ﬁrm as default ﬁrm

[CET02].

B. Multivariate Normal Feature Vector x with Common Covariance

Matrix Σ

In this case, the ﬁrms are tried to classify according to p variate feature indices

x = (x

1

, x

2

, ..., x

p

) ∈ X ⊆

n

. It is assumed that x follows a normal distribution

with mean µ

ND

and variance-covariance matrix Σ among non-default ﬁrms and

follows a normal distribution with mean µ

D

and variance-covariance matrix Σ

22

among default ﬁrms.The probability density function is

f(x|ND) = (2π)

−

p

2

(detΣ

ND

)

−

1

2

exp(

−(x −µ

ND

)

T

Σ

−1

ND

(x −µ

ND

)

2

). (3.2.18)

Similarly, as we learned it in A, our decision rule will become

Ω

ND

=

x|

c(ND|D)δ

D

c(D|ND)δ

ND

≤

f(x|ND)

f(x|D)

=

x|exp

¸

−

1

2

((x −µ

ND

)Σ

N

D

−1

(x −µ

ND

)

T

−(x −µ

D

)Σ

−1

D

(x −µ

D

)

T

)

¸

≥

c(ND|D).δ

D

c(D|ND)δ

ND

=

x|x(Σ

−1

ND

−Σ

−1

D

)x

T

+ 2x(Σ

−1

ND

µ

T

ND

−Σ

−1

D

µ

T

D

)

≥ (µ

ND

Σ

−1

ND

µ

T

ND

−µ

D

Σ

−1

D

µ

T

D

) + 2log(

c(ND|D)δ

D

c(D|ND)δ

ND

)

. (3.2.19)

3.2.2 Functional Form Approach

This approach is also known as Fisher’s discriminant function analysis after

Fisher, 1936 [JW97]. In his work, he tried to ﬁt a linear discriminant function

of feature variables that best splits the set into two subsets (see Figure 3.2 for

an impression). The Fisher’s discriminant function consists of combination of

feature variables.

Let Y = w

1

X

1

+w

2

X

2

+. . . +w

p

X

p

be any linear combination of credit perfor-

mance measure X = (X

1

, X

2

, . . . , X

p

). Like analysis of variance (ANOVA), the

Fisher’s discriminant function analysis uses the diﬀerences of the mean values of Y

in the two subspaces: the space of default ﬁrms and space of non-default ﬁrms as

a splitting criteria. In this analysis, therefore, the weights of X

i

’s (i = 1, 2, ..., p)

are such that they minimize the distance between the sample means of default

23

and non-default ﬁrms over the square root of the common sample variance.

x

2

x1

Figure 3.2: Fisher’s linear discriminant analysis.

Fisher’s Principle is an optimization problem which look as follows:

min J(w) = w

T

(m

ND

−m

D

)

T

(wSw

T

)

1/2

, (3.2.20)

where m

ND

and m

D

are sample means vectors for non-default and default com-

panies, respectively, and S is the common variance-covariance matrix.

Diﬀerentiating (3.2.20) with respect to w and setting it to 0 derives the fol-

lowing equation:

m

T

ND

−m

T

D

(wSw

T

)

1/2

−

(w(m

ND

−m

D

)

T

)(Sw

T

)

(wSw

T

)

3/2

= 0 (3.2.21)

24

(m

ND

−m

D

)

T

(wSw

T

) = (Sw

T

)(w(m

ND

−m

D

)

T

). (3.2.22)

Since

wSw

T

w(m

ND

−m

D

)

T

is a constant, (3.2.22) results in

w

T

∝ (S

−1

(m

ND

−m

D

)

T

). (3.2.23)

For a closer information please look [CET02] [B04] [JW97].

3.2.3 Advantages and Disadvantages of Discriminant Anal-

ysis

Advantages:

Discriminant analysis has the following advantages:

• dichotomous response variable,

• easy to calculate,

• yields the input needed for an immediate decision, and

• reduced error rates.

Disadvantages:

Discriminant analysis has the following disadvantages:

• normality assumption on variables,

• approximately equal variances in each group,

• assumption on equivalent correlation patterns for groups,

25

• problem of multi-collinearity, and

• sensitivity to the outliers.

.

3.3 Linear Regression

3.3.1 Introduction

The linear regression is a statistical technique for investigating and modelling

the linear relationships between variables. The probability of default is deﬁned

by the following form of linear regression:

w

0

+ w

1

X

1

+ w

2

X

2

+ . . . + w

p

X

p

= w

∗

X

∗T

, (3.3.24)

where w

∗

= (w

0

, w

1

, w

2

, . . . , w

p

) and X

∗

= (X

0

, X

1

, X

2

, . . . , X

p

).

Suppose p(x

i

) deﬁnes the probability of default for the i

th

individual company,

then,

p(x

i

) = w

0

+ w

1

x

i1

+ w

2

x

i2

+ . . . + w

p

x

ip

+ ε

i

. (3.3.25)

The linear regression has some primary assumptions, namely:

• The relationship between probability of default and explanatory variables

is linear, or at least it is well approximated by a straight line.

• The error term ε has zero mean.

• The error term ε has constant variance.

26

• The errors are uncorrelated.

• The errors are normally distributed.

Suppose n

D

of the training set are default companies and n

ND

ones are non-

default companies. We denote the default and non-default companies by ”1” and

”0”, respectively. That is, p(x

i

) = 1 when i = 1, 2, ..., n

D

, and p(x

i

) = 0 when

i = n

D

+ 1, n

D

+ 2, ..., n

D

+ n

ND

, and n = n

D

+ n

ND

.

Then, our aim is to ﬁnd the best set of weights, i.e., the one which satisﬁes

min

w

n

¸

i

ε

2

i

. (3.3.26)

By (3.3.25), this results to minimize

n

D

¸

i=1

1 −

p

¸

j=0

w

j

x

ij

2

+

n

D

+n

ND

¸

i=n

D

+1

p

¸

j=0

w

j

x

ij

2

, (3.3.27)

where x

i0

= 1 for i = 1, 2, ..., n.

By matrix notation,

¸

¸

1

D

X

D

1

ND

X

ND

¸

¸

w

0

w

T

=

¸

¸

1

D

0

(3.3.28)

or

Xw

T

= y

T

. (3.3.29)

Here,

X :=

¸

¸

1

D

X

D

1

ND

X

ND

27

being an (n

D

×(p + 1))-matrix,

X

D

=

¸

¸

¸

¸

¸

¸

¸

¸

x

11

· · · · · · x

1p

x

21

· · · · · · x

2p

.

.

.

.

.

.

.

.

.

.

.

.

x

n

D

1

· · · · · · x

n

D

p

being an (n

D

×p)-matrix,

X

ND

=

¸

¸

¸

¸

¸

¸

¸

¸

x

n

D

+11

· · · · · · x

n

D

+1p

.

.

. · · · · · ·

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

x

n

D

+n

ND

1

· · · · · · x

n

D

+n

ND

p

is an (n

ND

×p) matrix, and

y

T

=

¸

¸

1

D

0

.

Moreover, 1

D

and 1

ND

are the (1 × n

D

)- and (1 × n

ND

)-vectors with entries 1,

respectively.

Then, in matrix notation our problem in the equation (3.3.27) become the

following:

min (Xw

T

−y

T

)

T

(Xw

T

−y

T

) (3.3.30)

To treat the problem (3.3.30), we set the derivative of it with respect to w equal

to 0, i.e.,

X

T

(Xw

T

−y

T

) = 0 or X

T

Xw

T

= X

T

y

T

. (3.3.31)

28

These equations are called normal equations. For solving them we study the

system

¸

¸

1

D

1

ND

X

D

X

ND

¸

¸

1

D

X

D

1

ND

X

ND

¸

¸

w

0

w

T

=

¸

¸

1

D

1

ND

X

D

X

ND

¸

¸

1

D

0

(3.3.32)

and

¸

¸

n n

D

µ

D

+ n

ND

µ

ND

n

D

µ

T

D

+ n

ND

µ

T

ND

X

T

D

X

D

+X

T

ND

X

ND

¸

¸

w

0

w

T

=

¸

¸

n

D

n

D

µ

D

T

, (3.3.33)

where µ

D

and µ

ND

indicate the mean vectors of explanatory variables for default

and non-default companies, respectively.

Let us think that the learning set expectations as actual expectations and

assumptions hold. Then, we get

X

T

D

X

D

+X

T

ND

X

ND

= nE[X

i

X

j

] (3.3.34)

= nCov(X

i

X

j

) + n

D

µ

D

µ

T

D

+ n

ND

µ

ND

µ

T

ND

.

Let C denote the learning sample covariance function. Now, (3.3.34) becomes

X

T

D

X

D

+X

T

ND

X

ND

= nC+ n

D

µ

D

µ

T

D

+ n

ND

µ

ND

µ

T

ND

. (3.3.35)

By using equation (3.3.35) in equation (3.3.33), we obtain

nw

0

+ (n

D

µ

D

+ n

ND

µ

ND

)w

T

= n

D

,(3.3.36)

(n

D

µ

D

T

+ n

ND

µ

T

ND

)w

0

+ (nC+ n

D

µ

D

µ

T

D

+ n

ND

µ

ND

µ

T

ND

)w

T

= n

D

µ

T

D

.

29

Then, substituting the ﬁrst equation of (3.3.36) into the second one gives

((n

D

µ

D

T

+ n

ND

µ

T

ND

)(n

D

−(µ

D

+ n

ND

µ

ND

)w

T

)/n)

+(n

D

µ

D

µ

T

D

+ n

ND

µ

ND

µ

T

ND

)w

T

+ nCw

T

= n

D

µ

T

D

.

Hence,

n

D

n

ND

n

(µ

D

−µ

ND

)w

T

+ nCw

T

=

n

D

n

ND

n

(µ

D

−µ

ND

)

T

;

thus

Cw

T

= a(µ

D

−µ

ND

)

T

, (3.3.37)

where a is a constant. This relation (3.3.37) gives the optimal vector of weights

w = (w

0

, w

1

, w

2

, . . . , w

p

).

3.3.2 Advantages and Disadvantages of Regression

Advantages:

As very important advantages of regression, we note:

• The estimates of the unknown parameters obtained from linear least squares

regression are the optimal. Estimates from a broad class of possible param-

eter estimates under the usual assumptions are used for process modelling.

• It uses data very eﬃciently. Good results can be obtained with relatively

small data sets.

• The theory associated with linear regression is well-understood and allows

30

for construction of diﬀerent types of easily-interpretable statistical intervals

for predictions, calibrations, and optimizations.

Disadvantages:

As the disadvantages of regression we state:

• Outputs of regression can lie outside of the range [0,1].

• It has limitations in the shapes that linear models can assume over long

ranges.

• The extrapolation properties will be possibly poor.

• It is very sensitive to outliers.

• It often gives optimal estimates of the unknown parameters.

.

3.4 Probit Regression

3.4.1 Introduction

Probit regression is a tool for a dichotomous dependent variable. The term

”probit” was used ﬁrstly in the 1930’s by Chester Bliss and implies a probability

unit [Probit].

If we denote probability of default as p(x

i

) = E(Y |x

i

). Then, probit model is

deﬁned as:

p(x

i

) = Φ(x

i

w) = Φ(w

0

+w

1

x

i1

+w

2

x

i2

+. . .+w

p

x

ip

) (i = 1, 2, ..., n), (3.4.38)

31

where Φ is the standard cumulative normal probability distribution and w is the

weight vector.

Here, xw has the normal distribution and y

i

follows a Bernolli distribution.

Then, the estimation of the coeﬃcients of a probit regression model can be made

with the help of the maximum likelihood estimation (MLE). For MLE estimation

we should ﬁrstly write the likelihood function of the model. Let us denote the

likelihood function by L, then,

L(w) =

n

¸

i=1

p(x

i

)

y

i

1 −p(x

i

)

1−y

i

. (3.4.39)

By putting (3.4.38) in (3.4.41), the following is obtained:

L(w) :=

n

¸

i=1

Φ(x

i

w)

y

i

1 −Φ(x

i

w)

1−y

i

. (3.4.40)

By MLE, the log-version of likelihood function i.e., log-likelihood function will

be used to make calculations easy. Log-likelihood function is denoted by and

shown as:

(w) := l(w) = ln(L(w)) =

n

¸

i=1

¸

y

i

ln(Φ(x

i

w)) + (1 −y

i

) ln(1 −Φ(x

i

w))

¸

.

(3.4.41)

The maximum likelihood requires to maximizing the log-likelihood function

by taking derivative of it with respect to w. That results in to solve following

32

equations:

∂l(w)

∂w

0

=

n

¸

i=1

y

i

−Φ(x

i

w)

= 0 (3.4.42)

∂l(w)

∂w

j

=

n

¸

i=1

x

ij

y

i

−Φ(x

i

w)

= 0 (j = 1, 2, ..., p). (3.4.43)

3.4.2 Advantages and Disadvantages of Logistic Regres-

sion

Advantages:

• Scores are interpretable.

• Probabilities help for decisions.

• It is easy to compute.

Disadvantages:

• Normality assumption violations can not be hold.

• Over- or under-estimation problems can occur.

3.5 Logistic regression

3.5.1 Introduction

Logistic regression is a form of regression which is used when the dependent

variable is a binary or dichotomous and the independents are of any type. In

any regression analysis, the main feature is to ﬁnd the expected value of the

33

dependent variable under the known explanatory variables, i.e., E(Y |x), where Y

and x denote the dependent and the vector of explanatory variables, respectively

[HL00].

Let us use the notation p(x

i

) = E(Y |x

i

) being the probability of default for

the i

th

individual company. Then, the form of the logistic regression model is

p(x

i

) = G(x

i

, w) =

e

w

0

+w

1

x

i1

+w

2

x

i2

+...+w

p

x

ip

1 + e

w

0

+w

1

x

i1

+w

2

x

i2

+...+w

p

x

ip

=

e

x

i

w

1 + e

x

i

w

. (3.5.44)

The logit transformation of it can be written as

ln

p(x

i

)

1 −p(x

i

)

= w

0

+ w

1

x

i1

+ w

2

x

i2

+ . . . + w

p

x

ip

+ ε

i

= x

i

w+ ε

i

. (3.5.45)

Here, the estimation of the coeﬃcients of a logistic regression model is done

with the help of the maximum likelihood estimation (MLE). MLE primarily states

that the coeﬃcients are estimated in a way in which the likelihood function is

minimized. In order to obtain a likelihood function, we should ﬁrstly introduce

the probability mass function of y

i

. Since y

i

follows a Bernolli distribution, the

probability mass function for the i

th

company can be written as

p(x

i

)

y

i

1 −p(x

i

)

1−y

i

. (3.5.46)

If we assume that all observations are independently distributed, the likelihood

function expression will be

L(w) :=

n

¸

i=1

p(x

i

)

y

i

1 −p(x

i

)

1−y

i

. (3.5.47)

34

However, the computations by using (3.5.47) are diﬃcult, so we use logarith-

mic version of it that is called log-likelihood function and deﬁned as

l(w) := ln(L(w)) =

n

¸

i=1

¸

y

i

ln(p(x

i

)) + (1 −y

i

) ln(1 −p(x

i

))

¸

. (3.5.48)

To ﬁnd the values of coeﬃcients that maximizes (3.5.48) we diﬀerentiate (3.5.48)

with respect to w. This gives:

∂l(w)

∂w

0

=

n

¸

i=1

y

i

−p(x

i

)

= 0, (3.5.49)

∂l(w)

∂w

j

=

n

¸

i=1

x

ij

y

i

−p(x

i

)

= 0 (j = 1, 2, ..., p). (3.5.50)

Since these equations are nonlinear in w, it is not possible to solve them di-

rectly. Therefore, the solution of it is usually made with the well-known nonlinear

optimization method called Gauss-Newton algorithm. For a closer information

please see [HL00].

3.5.2 Advantages and Disadvantages of Logistic Regres-

sion

Advantages:

• Scores are interpretable in terms of log odds.

• Constructed probabilities have chance of being meaningful.

• It is modelled as a function directly rather than as ratio of two densities.

• It is a good default tool to use when appropriate, especially, combined with

35

feature creation and selection.

Disadvantages:

• It invites to an over-interpretation of some parameters.

For a closer look we refer to [M04].

3.6 Classiﬁcation and Regression Trees

Classiﬁcation and regression tree (CART) is a nonparametric pattern recog-

nition based statistical classiﬁcation technique [FAK85]. CART is a tool for

analyzing both categorical and continuous dependent variables. At this point, we

can split CART methodology into two parts: For categorized dependent variables

it gives the name classiﬁcation tree and for continuous dependent variables its

name becomes regression tree [YW99]. However, in both cases, it produces a

binary classiﬁcation tree.

In fact, CART algorithm divides the whole space into rectangular regions and

assigns each rectangular region to one of the classes. The sample organization

chart of CART algorithm can be shown in Figure 3.3.

36

t1

t2

t3

t5 t4

Yes

Yes No

No

X1<0,7

X2<0,5

0

0

0

0

0

x

0 0

0

0 x

0

0

0

x x

x x 0 0

x x x x

x x x

0 x 0 x

x

x x x x

X2

X1

0,5

0,7

t4

t5 t3

0 0 0

x 0 0

0

0 0 0

0 0

0 x

t5

Figure 3.3: A Sample organization chart of classiﬁcation and regression trees

[BFOS84].

In our case, since we have a categorical dependent variable, we will only

explain the classiﬁcation trees.

A classiﬁcation tree has three types of nodes. The top node is called the root

node and contains all the observation in the sample. The second type of node

consists of terminal nodes that are the end nodes assigned to one of the classes.

The other nodes are called as non-terminal nodes.

The CART algorithm is a four-step classiﬁcation procedure:

37

Step 1: simple binary questions of CART Algorithm,

Step 2: goodness of split criterion,

Step 3: the class assignment rule to the terminal nodes and resubstitution es-

timates,

Step 4: selecting the correct complexity of a tree.

To understand CART algorithm better, we ﬁrstly need to deﬁne some proba-

bilities. The works [BFOS84], [YW99], [FAK85] give the following expressions:

Let N be the total learning sample size, and let N

ND

be the number of non-

default ﬁrms in our learning sample and N

D

be the number of default ﬁrms.

Suppose CART has t = 1, 2, ..., T nodes. In node t, there are N(t) obser-

vations. Let N

ND

(t) represent the number of non-default ﬁrms in node t and

similarly, N

D

(t) represents the number of default ﬁrms.

Let p(ND, t) be the probability of a ﬁrm is non-default ﬁrm and falls into

node t. Similarly, p(D, t) be the probability of a ﬁrm is default ﬁrm and falls into

node t. So, we can write these probabilities as

p(ND, t) = δ

ND

N

ND

(t)

N

ND

, (3.6.51)

p(D, t) = δ

D

N

D

(t)

N

D

, (3.6.52)

where δ

ND

, δ

D

stands for the prior class probabilities.

Moreover, the probability that an observation falls into node t is

p(t) =

D

¸

j=ND

p(j, t). (3.6.53)

38

By Bayes rule, the probability of a ﬁrm in the node t to be a non-default ﬁrm

is denoted by p(ND|t) and equal to

p(ND|t) =

p(ND, t)

p(t)

, (3.6.54)

and the probability that a ﬁrm in the node t is a default ﬁrm is

p(D|t) =

p(D, t)

p(t)

. (3.6.55)

The sum of (3.6.54), (3.6.55) probabilites satisﬁes the below equality:

D

¸

j=ND

p(j|t) = 1. (3.6.56)

With the help of Bayes rule, we can observe the probabilities of p(t|ND) and

p(t|D) from equations (3.6.51) to (3.6.56). Then,

p(t|ND) =

p(ND, t)

δ

ND

=

p(ND|t)p(t)

δ

ND

, (3.6.57)

p(t|D) =

p(D, t)

δ

D

=

p(D|t)p(t)

δ

D

. (3.6.58)

3.6.1 Simple Binary Questions of CART Algorithm

The tree-growing procedure of CART algorithm is based on the binary ques-

tions of type {Is x

i

≤ c?} for numerical values and {Is x

i

= d?} for the cate-

gorical values, where x

i

is any variable in the feature vector x = (x

1

, x

2

, ..., x

p

) ∈

X ⊆

n

. CART algorithm puts all observations into the root node and, then,

with the help of these simple questions, it searches for the best split in order to

39

divide root node into binary nodes.

In CART, each split will be desired for only one variable. For this purpose, at

each node, the algorithm tries all the variables x

1

, x

2

, . . . , x

n

. For each variable, it

searches for the best split. Furthermore, selects the best split point and variable

among these best splits for a particular node.

As an example according to [BFOS84],[JW97]; let x

5

be a categorical variable

with 6 category form 1, 2, . . . , 6. Then, our set of questions will be Is x

5

= 1? ...

Is x

5

= 6?. So, the CART should search 2

6−1

−1 = 31 diﬀerent splits for ﬁnding

the best split for this single variable. For a numeric variable, let us say x

6

is in

the range [12, 34]. Our possible questions will be Is x

5

≤ 13? ... Is x

5

≤ 34?.

Herewith, if the number of questions is k then, the number of searched splits will

be 2

k−1

−1.

3.6.2 Goodness of Split Criterion

Introduction

The goodness of split criterion is applied to the each split point at a node

in order to select best split point for each variables and then, for node. The

goodness of split criterion is an index based on impurity functions.

Deﬁnition 3.1. [BFOS84] An impurity function is a function φ deﬁned on

the set of J classes with prior probability vector δ= (δ

1

, δ

2

, ..., δ

J

) satisfying δ

j

≥ 0,

(j = 1, 2, ..., J),

¸

J

j=1

δ

j

= 1 with the following three properties:

(i) φ has a maximum only at the point δ= (

1

J

, ...,

1

J

),

(ii) φ achieves its minimum only at the points δ= (1, 0, ..., 0), δ= (0, 1, 0, ..., 0),

..., δ= (0, 0, ..., 0, 1),

40

(iii) φ is a symmetric function of δ

1

, δ

2

, ..., δ

j

.

Deﬁnition 3.2. [BFOS84] Given an impurity function φ, deﬁne the impurity

measure i(t) of any node t as

i(t) := φ(p(1|t), p(2|t), ..., p(J|t))

The CART basically for a split point divides sample left (l) and right (r)

nodes, then applies a split criterion and looks the split’s goodness.

In credit scoring literature, there are six most commonly used impurity mea-

sures can be found : basic impurity index, gini index, Kolmogorov-Smirnov statis-

tic, twoing index, entropy index and maximize half-sum of squares index.

Basic Impurity Index

The basic impurity index is only based on the left(l) and right(r) node prob-

abilities: p(t

l

) and p(t

r

) see (3.6.53) and impurities: i(t

l

) and i(t

r

).

Then, the change in the impurity at node t is deﬁned as follows:

∆i(t, s) = i(t) −p(t

l

)i(t

l

) −p(t

r

)i(t

r

), (3.6.59)

where s denotes the split and

i(j) = p(ND|j) if p(ND|j) ≤ 0.5, (3.6.60)

i(j) = p(D|j) if p(D|j) < 0.5. (3.6.61)

If this change in the impurity is greater, then the node will be much more pure.

That means we must select the split which maximizes this function [BFOS84],

[CET02].

41

Gini Index

The Gini index is a quadratic impurity measure. The change in the impurity

at node t is

∆i(t, s) := i(t) −p(t

l

)i(t

l

) −p(t

r

)i(t

r

), (3.6.62)

where

i(j) := p(ND|j)p(D|j). (3.6.63)

We refer to the [CET02], [BFOS84].

Kolmogorov-Smirnov Statistics

This impurity measure is based on the idea of maximizing the distance between

probability distributions of non-default ﬁrms and default ﬁrms for a node.

Kolmogorov-Smirnov statistic is deﬁned by

KS(s) := |p(t

l

|D) −p(t

l

|ND)| =

p(D|t

l

)

δ

D

−

p(ND|t

l

)

δ

ND

. (3.6.64)

This implies that we select the split s which maximized the KS statistics

[CET02].

Twoing Index

According to twoing Index, select the split s which maximizes the following

measure [BFOS84]:

∆i(t, s) :=

p(t

l

)p(t

r

)

4

ND

¸

j=D

|p(j|t

l

) −p(j|t

r

)|

2

. (3.6.65)

42

Entropy Index

The entropy index like Gini criterion is a nonlinear measure of impurity. The

rule of entropy index is to

max ∆i(t, s) := i(t) −p(t

l

)i(t

l

) −p(t

r

)i(t

r

), (3.6.66)

where

i(j) := −p(ND|j)ln(p(ND|j)) −p(D|j)ln(p(D|j)). (3.6.67)

Here, we refer to [CET02].

Maximize Half-Sum of Squares

Here, the index is

∆i(t, s) := n(t

l

)n(t

r

) −

p(ND|t

l

) −p(ND|t

r

)

2

n(t

l

) + n(t

r

)

, (3.6.68)

where n(t

l

) and n(t

r

) are the total numbers of observations in the left and right

nodes [CET02].

3.6.3 The Class Assignment Rule to the Terminal Nodes

and Resubstitution Estimates

After selecting the variables with best splits for each node, a tree (let us say:

T

max

) will have been constructed. Let

¯

T present the terminal nodes.

Deﬁnition 3.3. [BFOS84] A class assignment rule assigns a class j ∈

{1, 2, ..., J} to every terminal node t ∈

¯

T. The class assigned to node t ∈

¯

T

43

is denoted by j(t).

The class assignment rule requires the minimization of cost of misclassiﬁcation

of a node one of the classes. In CART analysis, the observed expected cost of

misclassiﬁcation of each assignment is called the resubstitution risk [FAK85]. Let

us denote the resubstitution risk of assigning a terminal node t ∈

¯

T to j

th

by

R

j

(t).

Suppose that the misclassiﬁcation cost of classifying a default ﬁrm as non-

default be c(ND|D) ≥ 0 and the misclassiﬁcation cost of classifying a non-default

ﬁrm as default be c(D|ND) ≥ 0.

Then, the resubstitution risk of classifying the observation which falls into the

terminal node t as a non-default ﬁrm can be obtained by

R

ND

(t) = c(ND|D)p(D, t) (3.6.69)

= c(ND|D)p(t|D)δ

D

(3.6.70)

= c(ND|D)δ

D

N

D

(t)

N

D

(3.6.71)

and, similarly,

R

D

(t) = c(D|ND)δ

ND

N

ND

(t)

N

ND

. (3.6.72)

Then, our class assignment rule j(t) will be

j(t) =

ND if R

D

(t) ≥ R

ND

(t),

D otherwise.

This means, if the resubstitution risk of assigning node t to the class of non-

default ﬁrms is greater or equal to that of default ﬁrms assign node t to the class

of default ﬁrms. Otherwise, we assign node t to the class of non-default ﬁrms.

44

3.6.4 Selecting the Correct Complexity of a Tree

The initial trees (T

max

) includes more splits and more terminal nodes and so

they are complex trees. Therefore, they have some problems: 1. Their resubsti-

tution risks will always seen to be smaller, but they usually overﬁt the data. So,

we are faced with poorer results of estimation with new data sets. 2. Another

type of problem is interpretation of the complex trees. The larger the tree is, the

more diﬃcult to interpret it.

In order to faced with these problems, we would try to select the tree with

correct complexity and smaller resubstitution risk.

In this part, we ﬁrstly consider all subtrees of the T

max

. The resubstitution

risk of the any tree can be computed by the formula

R(

¯

T) =

¸

t∈

T

R(t). (3.6.73)

Then, for each tree T, we compute the following complexity measure

R(T) + K × (number of terminal nodes of T), (3.6.74)

where K is a non-negative constant interpreted as a penalty for complex trees.

Our decision rule is that for a given K, select the optimal tree T

opt

which mini-

mizes the complexity measure (3.6.74).

If K = 0, then, the optimal tree will be T

max

. If K > 0, the optimal tree will

be any subtree of T

max

. And if K increases, then, the optimal tree’s complexity

will be less, but its resubstitution risk will be greater.

45

3.6.5 Advantages and Disadvantages of CART

Advantages:

• CART makes no distributional assumptions of any kind for dependent and

independent variables.

• The explanatory variables can be a mixture of categorical and continuous.

• It is not at all aﬀected by the outliers, collinearities, heteroskedasticity, or

distributional error structures that aﬀect parametric procedures. Outliers

are isolated into a node and thus have no eﬀect on splitting. Contrary to

situations in parametric modeling, CART makes use of collinear variables

in surrogate splits.

• CART has a built-in algorithm to deal with the missing values of a variable

for a case, except when a linear combination of variables is used as a splitting

rule.

• Furthermore, CART has the ability to detect and reveal variable interac-

tions in the data set.

• It deals eﬀectively with large data sets and the issues of higher dimension-

ality.

• It can handle noisy and incomplete data.

• CART is a user friendly method and it gives clear output.

• It is a simple procedure.

46

• The probability level or conﬁdence interval associated with predictions de-

rived from a CART tree could help to classify a new set of data.

Disadvantages:

• There can be errors in the speciﬁcation of prior probabilities and misclassi-

ﬁcation costs.

• CART does not vary under a monotone transformation of independent vari-

ables.

• In CART, the relative importance of variables are unknown.

• It is a discrete scoring system.

For a closer information, we refer to [BO04], [YW99].

3.7 Semi-Parametric Binary Classiﬁcation

Semi-parametric binary classiﬁcation methods make no necessary assumption

on the model and data sets like non-parametric models. However, at the same

time, in contrast to non-parametric models, they allow extrapolations in some

boundaries. Also they reduce the dimensionality of the parameter space to protect

statistical accuracy of the model from sharp decreases.

Here, in this section, generalized partial linear models will be given.

In semi-parametric models kernel density estimation plays an important role;

therefore, we ﬁrstly introduce it.

47

3.7.1 Kernel Density Estimation

Univariate Case

Suppose we have a univariate feature random variable X with n entities:

X

1

,X

2

,...,X

n

with an unknown continuous distribution.

Let h be the bandwidth. Then, the kernel function assigns weights to each

observation X

i

whose distance from a given x is not bigger than h.

A kernel function is denoted by K(·) and the density of x can be written as

´

f

h

(x) =

1

nh

n

¸

i=1

K

x −X

i

h

, (3.7.75)

please see Table 3.2 for some very important kernel functions.

Kernel K(u)

Uniform

1

2

I(|u| ≤ 1)

Triangle (1 −|u|)I(|u| ≤ 1)

Epanechnikov

3

4

(1 −u

2

)I(|u| ≤ 1)

Quartic Biweight

15

16

(1 −u

2

)

2

I(|u| ≤ 1)

Triweight

35

32

(1 −u

2

)

3

I(|u| ≤ 1)

Gaussian

1

√

2π

exp(−

1

2

u

2

)

Cosine

π

4

cos(

π

2

u)I(|u| ≤ 1)

Table 3.2: Kernel functions [HMSW04].

The general form of the kernel density estimator of a probability density f,

based on a random sample X

1

,X

2

,...,X

n

from f, looks as follows [HMSW04]:

48

´

f

h

(x) =

1

n

n

¸

i=1

K

h

x −X

i

, (3.7.76)

where

K

h

(·) =

1

n

K(·|h). (3.7.77)

The kernel density estimations of the car prices in the one of the Matlab

data ﬁles were made by writing the code of (3.7.76) in Matlab. These density

estimations in diﬀerent bandwidths are shown in the Figures 3.4, 3.5, 3.6 and

3.7.

0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4

x 10

4

0

1

2

3

x 10

−4

Figure 3.4: Kernel density estimation of car prices by Matlab (h = 100) with

triangle kernel function.

49

0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4

x 10

4

0

0.5

1

1.5

2

2.5

3

3.5

x 10

−4

Figure 3.5: Kernel density estimation of car prices by Matlab (h = 200) with

triangle kernel function.

0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4

x 10

4

0

0.5

1

1.5

2

2.5

3

3.5

x 10

−4

Figure 3.6: Kernel Density Estimation of car prices by Matlab (h = 300) with

Triangle kernel function.

Statistical Properties of Kernel Density Functions

Since the kernel functions are usually probability density functions, they have

the following main properties:

∞

−∞

K(s)ds = 1,

∞

−∞

sK(s)ds = 0,

50

0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4

x 10

4

0

1

2

x 10

−4

Figure 3.7: Kernel density estimation of car prices by Matlab (h = 400) with

Triangle kernel function.

∞

−∞

s

2

K(s)ds < ∞,

and

∞

−∞

[K(s)]

2

ds < ∞.

For our investigation, we need some further measures of error and deviation:

A. Bias

Bias

´

f

h

(x) := E

´

f

h

(x) −f(x) (3.7.78)

=

1

n

n

¸

i=1

EK

h

(x −X

i

) −f(x)

= EK

h

(x −X

i

) −f(x)

=

1

h

K

x −u

h

f(u)du −f(x).

Let us put s :=

u−x

h

, then, du = hds and, by substitution rule,

Bias

´

f

h

(x) =

K(s)f(u)ds −f(x). (3.7.79)

51

By using the Taylor expansion of f(u) around x, the equation is as follows:

Bias

´

f

h

(x) =

K(s)[f(x) + f

(x)(u −x) + (3.7.80)

1

2

f

(x)(u −x)

2

+ o(h

2

)]ds −f(x)

= f(x)

K(s)ds + hf

(x)

sK(s)ds +

h

2

2

f

(x)

s

2

K(s)ds −f(x)

=

h

2

2

f

(x)

s

2

K(s)ds + o(h

2

).

Hence as h −→ 0, the bias will be removed. Therefore, we should take h as

small as possible to reduce the bias.

B. Variance

Var

´

f

h

(x) =

1

n

¸

n

i=1

K

h

(x −X

i

)

(3.7.81)

= Var

1

n

2

n

¸

i=1

V ar(K

h

(x −X

i

))

=

1

n

V ar(K

h

(x −X

i

))

=

1

nh

f(x)

K

2

(s)ds + o(

1

nh

).

Multivariate Case

Suppose we have a p-dimensional feature random vector X = (X

1

,X

1

...,X

p

).

Each X have n observations.

Let us represent the i

th

observation as

X

i

=

¸

¸

¸

¸

¸

X

i1

.

.

.

X

ip

(i = 1, 2, ..., n).

52

Then, the multivariate kernel density estimator of the X := (X

1

, X

2

, ..., X

p

)

will be

´

f

h

(x) =

1

n

n

¸

i=1

1

h

1

...h

p

ℵ

x

1

−X

i1

h

1

, ...,

x

p

−X

ip

h

p

. (3.7.82)

By using the multiplicative kernel

ℵ(u) = K(u

1

)K(u

2

)...K(u

p

), (3.7.83)

the estimator (3.7.82) becomes

´

f

h

(x) =

1

n

n

¸

i=1

p

¸

j=1

h

−1

j

K

x

j

−X

ij

h

j

. (3.7.84)

The kernel density estimations of the car prices with respect to house prices

in the one of the Matlab data ﬁles were made by writing the code of (3.7.84)

in Matlab. These density estimations in diﬀerent bandwidths are shown in the

Figures 3.8, 3.9, 3.10 and 3.11.

0.5

1

1.5

2

2.5

x 10

4

0

2000

4000

6000

8000

0

1

2

3

4

x 10

−7

Figure 3.8: Kernel density estimation of car prices and house prices by Matlab

(h

1

= 200, h

2

= 100) with Gaussian kernel function.

53

0.5

1

1.5

2

2.5

x 10

4

0

2000

4000

6000

8000

0

1

2

3

4

x 10

−7

Figure 3.9: Kernel density estimation of car prices and house prices by Mat-

lab(h

1

= 300, h

2

= 100) with Gaussian kernel function.

0.5

1

1.5

2

2.5

x 10

4

0

2000

4000

6000

8000

0

0.5

1

1.5

2

2.5

3

x 10

−7

Figure 3.10: Kernel density estimation of car prices and house prices by Matlab

(h

1

= 400, h

2

= 200) with Gaussian kernel function.

54

0.5

1

1.5

2

2.5

x 10

4

0

2000

4000

6000

8000

0

0.5

1

1.5

2

2.5

x 10

−7

Figure 3.11: Kernel density estimation of car prices and house prices by Matlab

(h

1

= 500, h

2

= 300) with Gaussian kernel function.

3.7.2 Generalized Partial Linear Models

Partial linear models are composed of two parts, a linear and non-parametric

part. With a known link function G(•), a generalized partial linear model

(GPLM) can be represented by

E(Y |U, T) = G(U

T

β + m(T)), (3.7.85)

where β = (β

1

, β

2

, ..., β

p

)

T

is a ﬁnite dimensional parameter and m(·) a smooth

function.

The estimation of a generalized partial linear model is an two step procedure.

Firstly, estimate β with an known m(·), then, ﬁnd an estimator of m(·) with the

help of a known β [Mu00].

To estimate the GPLM by semiparametric maximum likelihood method, the

55

following assumptions should be made:

E(Y |U, T) = µ = G(η) = G(U

T

β + m(T)), (3.7.86)

V ar(Y |U, T) = σ

2

V (µ). (3.7.87)

Let L(µ, y) be the individual log-likelihood (or let the distribution of Y not

be belong to an exponential family). Then, a quasi-likelihood function can be

written as

L(µ, y) :=

1

σ

2

y

µ

(s −y)

V (s)

ds. (3.7.88)

Based on the sample, the estimated scale parameter σ is

´

σ

2

:=

1

n

n

¸

i=1

(y

i

− ´ µ

i

)

2

V (´ µ

i

)

, (3.7.89)

where ´ µ

i

:= G(´ η

i

) and ´ η

i

= x

T

i

´

β + ´ m(t

i

).

Proﬁle Likelihood Algorithm

The proﬁle likelihood method is one of the methods to solve generalized partial

linear models. This method distinguishes two parts in the estimation: a paramet-

ric and a nonparametric part. The method ﬁxes the parameter β to estimate the

most probably nonparametric function m(·), then, it uses the estimate m(·) to

construct the proﬁle likelihood for β. As a result of the proﬁle likelihood method,

the estimator

´

β is

√

n-consistent, asymptotically normal and eﬃcient, and the

estimator ´ m(·) = ´ m

β

(·) is consistent in sup mode.

56

Firstly, let L(β) be the parametric proﬁle likelihood function and deﬁned by

L(β) :=

n

¸

i=1

L(µ

i,β

, y

i

), (3.7.90)

where µ

i,β

:= G(x

T

i

β + m(t

i

)).

This likelihood is optimized to obtain an estimate for β.

Secondly, the local likelihood can be shown to look as follows:

L

H

(m

β

(t)) :=

n

¸

i=1

ℵ

H

(t −t

i

)L(µ

i,m

β(t)

, y

i

), (3.7.91)

where µ

i,m

β(t)

:= G(x

T

i

β + m(t)) and ℵ

H

(t − t

i

) is the kernel weight, with ℵ

H

denoting the multidimensional kernel functions and H denoting the bandwidth

matrix. Moreover, this is also optimized to estimate an estimator for m

β

(t) at t.

Then, the individual quasi-likelihood is

i

(η) := L(G(η), y

i

), (3.7.92)

and the ﬁrst and second derivative of it with respect to η are denoted by

i

“

i

.

For the estimation of the m(·) the maximization of local likelihood can be

obtained by solving

n

¸

i=1

ℵ

H

(t

i

−t

j

)

‘

i

(x

T

i

β + m

j

) = 0 (3.7.93)

with respect to m

j

. To obtain the estimator of β the following derivative of the

quasi-likelihood part requires to solve [MR99]

n

¸

i=1

‘

i

(x

T

i

β + m

i

)x

i

+ m

‘

i

= 0 (3.7.94)

57

with respect to β.

A further diﬀerentiation of (3.7.93) results in [MR99]

m

‘

j

= −

¸

n

i=1

“

i

(x

T

i

β + m

j

)ℵ

H

(t

i

−t

j

)x

i

¸

n

i=1

“

i

(x

T

i

β + m

j

)ℵ

H

(t

i

−t

j

)

. (3.7.95)

The algorithm solves in an iterative way and basically includes the following

steps [Mu00]:

1. updating step for β

β

new

= β −B

−1

n

¸

i=1

‘

i

(x

T

i

β + m

i

) ¯ x

i

(3.7.96)

with a Hessian type matrix

B =

n

¸

i=1

“

i

(x

T

i

β + m

i

) ¯ x

i

¯ x

i

T

(3.7.97)

and

¯ x

j

= x

j

+ m

‘

j

= x

j

−

¸

n

i=1

“

i

(x

T

i

β + m

j

)ℵ

H

(t

i

−t

j

)x

i

¸

n

i=1

“

i

(x

T

i

β + m

j

)ℵ

H

(t

i

−t

j

)

; (3.7.98)

2. updating step for m

j

m

new

j

= m

j

−

¸

n

i=1

‘

i

(x

T

i

β + m

j

)ℵ

H

(t

i

−t

j

)

¸

n

i=1

“

i

(x

T

i

β + m

j

)ℵ

H

(t

i

−t

j

)

. (3.7.99)

58

3.7.3 Advantages and Disadvantages of Semi-parametric

Methods

Advantages:

• The semi-parametric method preserves the degrees of freedom.

• It achieves greater precision than nonparametric models but with weaker

assumptions than parametric models.

• By restricting G(x), it reduces the eﬀective dimension of x.

• The risk of the speciﬁcation error is less than with a parametric model.

Disadvantages:

• The full functional form is not known with conﬁdence.

• The risk of speciﬁcation error is greater than for the fully nonparametric

model.

For a closer information we refer to [Horrowitz].

59

chapter 4

NONSTATISTICAL METHODS

IN CREDIT SCORING

4.1 Neural Networks

4.1.1 Structure of Neural Networks

Like a human brain, neural network has an ability of learning, remembering

and generalizing. The basic element of a neural network is called as a neuron. As

seen in the Figure 4.1 a neuron has ﬁve components: inputs, weights, combination

part, activation part and output. In the following, we give closer information

about these components.

Figure 4.1: Structure of neural networks.

60

A. Inputs:

Inputs are denoted by x

i

. The inputs get the information from the environ-

ment of the neuron. This input may be an initial input or an output of prior

neurons. For a company, the initial input is (x

1

, x

2

, ..., x

p

).

B. Weights:

The weights for neuron j are comprised in the vector (w

j1

, w

j2

, ..., x

jp

). The

weights are some constants that determine the eﬀects of inputs on neuron. The

greater the weight of an input is, the greater is the impact of the input on the

neuron.

C. Adder:

In this part, the input values are multiplied with weights and summed with

the threshold level θ

j

and, then, sent to the activation part.

D. Activation Part:

Activation function of a neuron speciﬁed the ﬁnal output of a neuron at some

activity level and denoted by ϕ(·). There are three main types of activity func-

tions:

D.1. Threshold Function: The threshold function for neuron k is simply

ϕ(v

k

) := y

k

=

0 if v

k

≥ 0

1 if v

k

< 0,

(4.1.1)

where v

k

=

¸

j

w

kj

x

j

−θ

k

.

61

The threshold function is shown in Figure 4.2.

Figure 4.2: Threshold activation function [H94].

D.2. Piecewise-Linear Function: The piecewise-linear function is deﬁned

by

ϕ(v) :=

1, if v ≥ 1/2

v, if 1/2 > v > −1/2

0, if v ≤ −1/2.

(4.1.2)

Figure 4.3 shows this piecewise-linear function.

Figure 4.3: Piecewise-linear activation function [H94].

62

3. Sigmoid Function: The sigmoid function is the most widely used ac-

tivation function in the applications of neural networks. All strictly increasing

functions with smoothness and asymptotic properties can be put into this type.

As an example of a sigmoid function, we can give the logistic function. The

logistic function has the following form:

ϕ(v) =

1

1 + exp(−αv)

, (4.1.3)

where α is the slope parameter.

Another example can be the hyperbolic tangent function given by

ϕ(v) = tanh

v

2

1 −exp(−v)

1 + exp(−v)

. (4.1.4)

In Figure 4.4, the examples of sigmoid functions can be seen.

Figure 4.4: Sigmoid activation functions.

63

E. Output:

In this step, the output of the activation part diﬀused to the outer world or

another neuron. Every neuron has only one output.

4.1.2 Learning Process

Introduction

The signiﬁcance of neural networks comes from its ability to learn their envi-

ronment and according to it, to improve its performance by means of adjusting

weights and thresholds.

Let us consider the following network given in Figure (4.5):

Figure 4.5: A neural network structure.

In this network, the x

j

denote the output of neuron j and they are connected

with the internal activity v

k

of neuron k. Let w

kj

(t) represent the value of weight

w

kj

at time t. Then, to obtain the updated weights for time t +1, the adjustment

∆w

kj

(t) is applied in the following way:

w

kj

(t + 1) := w

kj

(t) + ∆w

kj

(t). (4.1.5)

64

Some basic rules of the learning processes are shown in Figure 4.6.

Figure 4.6: Diagram of learning process [H94].

Error Correction Learning

In Figure 4.7, d

j

denotes the desired response or target output for neuron j.

Furthermore, y

j

denotes the output of neuron i. In error correction learning, the

algorithm tries to abate the error between desired response and actual output.

The error is simply

e

j

= d

j

−y

j

, (4.1.6)

where

y

j

:=

p

¸

i=1

w

ji

x

ji

. (4.1.7)

As a performance measure, the sum of mean squared error is used, i.e.,

J :=

1

2

E(e

j

). (4.1.8)

Here, the factor 1/2 is included for convenience. Our problem is to ﬁnd the

best set of weights (w

j1

, w

j2

, ..., w

jp

) which minimize (4.1.8).

65

Figure 4.7: Error correction learning.

The solution of this problem is known to be represented as Wiener-Hopf equa-

tions. By substituting (4.1.6) and (4.1.7) into (4.1.8), we get

J =

1

2

E(d

2

j

) −E

p

¸

i=1

w

ji

x

ji

d

j

+

1

2

E

p

¸

k=1

p

¸

i=1

w

jk

w

ji

x

jk

x

jk

(4.1.9)

=

1

2

E(d

2

j

) −

p

¸

i=1

w

ji

E(x

ji

d

j

) +

1

2

p

¸

k=1

p

¸

i=1

w

jk

w

ji

E(x

ji

x

jk

). (4.1.10)

Then, the gradient ∆w

kj

can be found as

∆w

ji

=

∂J

∂w

ij

(4.1.11)

= −E(x

ji

d

j

) +

p

¸

k=1

w

jk

E(x

ji

x

jk

) (i = 1, 2, ..., p). (4.1.12)

Therefore, the optimality condition is deﬁned by

∆w

ji

= 0 (i = 1, 2, ..., p). (4.1.13)

66

Boltzman Learning

This type of learning is a stochastic one. The Boltzman machine has basically

following properties [H94]:

• processing units have binary values (−1 and 1),

• all the connections between units are symmetric,

• the units are picked at random and one at a time or updating,

• it has no self-feedback,

• it permits the use of hidden nodes,

• it uses stochastic neurons with a probabilistic ﬁring mechanism,

• it may also be trained by supervision of a probabilistic form .

The Boltzman machine is described by the following energy function:

E = −

1

2

¸

i

¸

j

i=j

w

ji

s

j

s

i

, (4.1.14)

where s

i

is the state of neuron i, and w

ji

is the weight connecting neuron i to

neuron j. The relation i = j implies that none of the neurons in the machine

has self-feedback. The machine operates by choosing a neuron at random - say,

neuron j - at some step of learning process, and ﬂipping the state of neuron j

from state s

j

to state −s

j

at some constant C > 0 with probability

W(s

j

−→−s

j

) =

1

1 + exp(−∆E

j

/C)

, (4.1.15)

67

Figure 4.8: Boltzman machine [H94].

where ∆E

j

is the change in the energy function of the machine.

The Boltzman machine has two types of stochastic neurons as seen in (4.8):

hidden neurons, visible neurons, where the visible neurons work as a connection

between the network and environment and on the other hand, hidden neurons

work as constrains of the input vectors by taking the higher order correlations of

the vectors into account. The machine has two types of operations:

• clamped condition: in this type of operation, the visible neurons are all

clamped onto speciﬁc states determined by the environment;

• free-running condition: here, all types of neurons are permitted to operate

freely.

Let the states of visible neurons be α and that of hidden neurons be β. We

assume the network has L hidden neurons and K visible neurons, so α runs from

68

1 to 2

K

and β runs from 1 to 2

L

. Let P

+

α

indicate the probability that the

visible neurons are collectively in state α, given that the network is operating in

its clamped condition. Let P

−

α

denote the probability that these same neurons

are collectively in state α, given that the network is allowed to run freely, but

no environment input. The signs +, − indicate that the network is in clamped

condition and running freely, respectively. Accordingly, the set of properties

{P

+

α

|α = 1, 2, ..., 2

K

} (4.1.16)

consists of the desired probabilities which represent the environment, and the set

of properties

{P

−

α

|α = 1, 2, ..., 2

K

} (4.1.17)

consists of the actual probabilities which are computed by the network.

Suppose ρ

+

ji

denotes the correlation between the states of neurons i and j,

conditional on the network being in its clamped condition. Let ρ

−

ji

denote the

unconditional correlation between the states of neurons i and j. The correlations

ρ

+

ji

and ρ

−

ji

are given by

ρ

+

ji

:=

¸

α

¸

β

P

+

αβ

s

j|αβ

s

i|αβ

, (4.1.18)

ρ

−

ji

:=

¸

α

¸

β

P

−

αβ

s

j|αβ

s

i|αβ

, (4.1.19)

where s

i|αβ

denotes the state of neuron i, given that the visible neurons of the

machine are in state α and the hidden neurons are in state β. Then, according

to the Boltzman learning rule, the change ∆w

ji

applied to the weight w

ji

from

69

neuron i to neuron j is deﬁned by

∆w

ji

:= η(ρ

+

ji

−ρ

−

ji

), ∀i = j (4.1.20)

where η is a learning-rate parameter.

Hebbian Learning

Hebb’s rule of learning is the most well-known learning algorithm. Hebb

proposed that weights are adjusted in proportional to the correlation between

input and output of the network, respectively [HS95].

Consider the Figure 4.5 again. The weights, inputs and outputs of neuron

k are denoted by w

kj

, x

j

, y

k

, respectively. The Hebb learning has the following

form of adjustment

∆w

kj

:= F(y

k

(t), x

j

(t)), (4.1.21)

where F(·, ·) is a function of y

k

(t), and x

j

(t). A special case of equation (4.1.21)

is:

∆w

kj

= ηCov[y

k

(t), x

j

(t)] (4.1.22)

= ηE[(y

k

(t) −y

k

)(x

j

(t) −x

j

)], (4.1.23)

where η is the rate of learning.

Competitive Learning

In competitive learning, output neurons which win the competition are acti-

vated. Competitive learning has three basic properties:

70

• A set of neurons that are all the same except for some randomly distributed

weights, and which therefore respond diﬀerently to a given set of input

patterns.

• A limit imposed on the ”strength” of each neuron.

• A mechanism which permits the neurons to compete for the right to respond

to a given subset of inputs, such that only one output neuron, or only

one neuron per group, is active at a time. The neuron which wins the

competition is called a winner-takes-all neuron [H94].

Figure 4.9: Single layer competitive network [H94].

Let neuron j be a winning neuron with the largest activity level υ

j

. Then,

the output y

j

of the winning neuron is set equal to one and the output of other

neurons are set to zero. Furthermore, in competitive learning, a ﬁxed amount of

71

weights are assigned to each neuron such that

¸

i

w

ji

= 1 for all j. (4.1.24)

The algorithm of competitive learning adjusts the weights by the way of

∆w

ji

=

η(x

i

−w

ji

) if neuron j wins the competition

0 if neuron j loses the competition,

(4.1.25)

where η is the learning-rate parameter.

Back Propagation Algorithm

The back propagation algorithm is a learning rule for multi-layered neural

networks. The algorithm primarily work for adjusting the synaptic weights in

order to minimize the network system’s output and actual output.

Derivation of Back Propagation Algorithm

Figure 4.10: The feed-forward network [H94].

72

The error at the output of neuron j at iteration t can be deﬁned by

e

j

(t) := d

j

(t) −y

j

(t). (4.1.26)

As the sum of squared errors of the network at time t we put

ξ(t) :=

1

2

¸

j∈C

e

2

j

(t) (4.1.27)

Suppose that the iteration is ﬁnished at the T. We try to minimize the average

sum of squared errors of the training set. In other words, the cost function of

the training set learning performance, that is represented as follows and is to be

minimized:

min ξ

av

=

1

T

T

¸

t=1

ξ(t). (4.1.28)

In the minimization part, the back propagation algorithm uses the least mean

square algorithm (LMS). Let us deﬁne

v

j

(t) :=

p

¸

i=0

w

ji

(t)x

i

(t), (4.1.29)

where p is the total number inputs applied to neuron j. Furthermore, the output

y

j

(t) at iteration t is

y

j

(t) := ϕ

j

(v

j

(t)). (4.1.30)

The back propagation algorithm improves its weights as proportional to the

instantaneous gradient

∂ξ(t)

∂w

ji

(t)

.

In the algorithm, the gradient

∂ξ(t)

∂w

ji

(t)

refers to a sensitivity factor which de-

termines the direction of the search for weight w

ji

. According to chain rule, the

73

gradient may be expressed in the following way:

∂ξ(t)

∂w

ji

(t)

=

∂ξ(t)

∂e

j

(t)

∂e

j

(t)

∂y

j

(t)

∂y

j

(t)

∂v

j

(t)

∂v

j

(t)

∂w

ji

(t)

. (4.1.31)

Let us compute the gradient. Firstly, we diﬀerentiate both sides of (4.1.27)

with respect to e

j

(t), then,

∂ξ(t)

∂e

j

(t)

= e

j

(t). (4.1.32)

Diﬀerentiating both sides of (4.1.26) with respect to y

j

(t), then,

∂e

j

(t)

∂y

j

(t)

= −1. (4.1.33)

Next, diﬀerentiating (4.1.30) with respect to v

j

(t) gives

∂y

j

(t)

∂v

j

(t)

= ϕ

j

(v

j

(t)). (4.1.34)

Finally, diﬀerentiating (4.1.29) with respect to w

ji

(t) yields

∂v

j

(t)

∂w

ji

(t)

= x

i

(t). (4.1.35)

Furthermore, by putting (4.1.32) to (4.1.35) into (4.1.31), we obtain

∂ξ(t)

∂w

ji

(t)

= −e

j

(t)ϕ

j

(v

j

(t))x

i

(t). (4.1.36)

Then, the correction ∆w

ji

(t) applied to w

ji

(t) means an improvement process

74

which can be deﬁned by the delta-rule

∆w

ji

(t) := −µ

∂ξ(t)

∂w

ji

(t)

, (4.1.37)

where µ is a constant learning rate. Use of (4.1.36) in (4.1.37) yields

∆w

ji

(t) := −µδ

j

(t)(v

j

(t))x

i

(t), (4.1.38)

where the local gradient δ

j

(t) is deﬁned by

δ

j

(t) = −

∂ξ(t)

∂e

j

(t)

∂e

j

(t)

∂y

j

(t)

∂y

j

(t)

∂v

j

(t)

= e

j

(t)ϕ

j

(v

j

(t)). (4.1.39)

In this step, we are faced with two situation. The ﬁrst one is given by neuron

j being an output node. The second one is the neuron j being an hidden node.

CASE I. Neuron j is an output neuron:

If neuron j is lying on the output layer of the network, there would exist a

desired response. Furthermore, we may compute the error sum of squares by the

formula (4.1.26) and correct synaptic weights with the help of (4.1.38).

CASE II. Neuron j is a hidden neuron:

If neuron j is lying on the hidden layer of the network, there is no desired

response. So, the error rate may be determined in terms of the error rates of all

neurons connected directly with that hidden neuron. Let us consider the situation

in the below Figure (4.11):

In this ﬁgure, the j

th

neuron represents the hidden node and the k

th

neuron

75

Figure 4.11: Two layer feed-forward network [H94].

represents output neuron, where

v

j

(t) :=

¸

i

w

ji

(t)x

i

(t), (4.1.40)

v

k

(t) :=

¸

j

w

kj

(t)y

j

(t) (4.1.41)

and

y

j

(t) := ϕ

j

(v

j

(t)), (4.1.42)

y

k

(t) := ϕ

k

(v

k

(t)). (4.1.43)

Then, the local gradient δ

j

(t) for the hidden neuron j can be deﬁned as

δ

j

(t) := −

∂ξ(t)

∂y

j

(t)

∂y

j

(t)

∂v

j

(t)

= −

∂ξ(t)

∂y

j

(t)

ϕ

j

(v

j

(t)). (4.1.44)

In order to obtain this correction factor for the hidden neuron, we must cal-

culate

∂ξ(t)

∂y

j

(t)

by the following procedure:

For our case, the instantaneous sum of squared error existing in the output

76

neuron k can be written as follows:

ξ(t) =

1

2

¸

k∈C

e

2

k

(t). (4.1.45)

Firstly, we diﬀerentiate (4.1.45) with respect to the y

j

(t):

∂ξ(t)

∂y

j

(t)

=

¸

k

e

k

∂e

k

(t)

∂y

j

(t)

. (4.1.46)

Then, by chain rule, (4.1.46) becomes

∂ξ(t)

∂y

j

(t)

=

¸

k

e

k

∂e

k

(t)

∂v

k

(t)

∂v

k

(t)

∂y

j

(t)

. (4.1.47)

Now, from Figure 4.11, we know that

e

k

(t) = d

k

(t) −y

k

(t) = d

k

(t) −ϕ

k

(v

k

(t)), (4.1.48)

v

k

(t) =

¸

j

w

kj

(t)y

j

(t). (4.1.49)

Therefore,

∂e

k

(t)

∂v

k

(t)

= −ϕ

k

(v

k

(t)), (4.1.50)

∂v

k

(t)

∂y

j

(t)

= w

kj

(t) (4.1.51)

and, then,

∂ξ(t)

∂y

j

(t)

= −

¸

k

e

k

ϕ

k

(v

k

(t)) w

kj

(t). (4.1.52)

77

Accordingly, by using (4.1.52) in the equation (4.1.44), we obtain

δ

j

(t) =

¸

k

e

k

ϕ

k

(v

k

(t)) w

kj

(t) ϕ

j

(v

j

(t)). (4.1.53)

From (4.1.31), we learn here:

∂ξ(t)

∂w

ji

(t)

= δ

j

(t)

∂v

j

(t)

∂w

ji

(t)

. (4.1.54)

Putting

v

j

(t) =

¸

i

w

ji

(t)x

i

(t) (4.1.55)

and (4.1.53) into the equation (4.1.54) yields

∂ξ(t)

∂w

ji

(t)

= x

i

(t) ϕ

j

(v

j

(t))

¸

k

e

k

ϕ

k

(v

k

(t))w

kj

(t). (4.1.56)

Then, the correction (4.1.37) for the hidden neuron is obtained in the following

form by using (4.1.54) in it:

∆w

ji

(t) = µ

∂ξ(t)

∂w

ji

(t)

= µ δ

j

(t) x

i

(t). (4.1.57)

4.1.3 Advantages and Disadvantages of Neural Networks

Advantages:

The neural network

• does not use pre-programmed knowledge base,

78

• suited to analyze complex pattern,

• have no restrictive assumptions,

• allows for qualitative data,

• can handle noisy data,

• can overcome autocorrelation,

• user-friendly: clear output, and

• robust and ﬂexible.

Disadvantages:

The neural network

• requires high quality data,

• variables must be carefully selected a priori,

• risk of overﬁtting,

• requires a deﬁnition of architecture,

• long processing time,

• possibility of illogical network behavior, and

• large training sample required.

For a closer explanation we refer to the [BO04].

79

chapter 5

Parameter Estimation

Accuracy of Logistic

Regression

5.1 Introduction and Methodology

Since the 1980’s, the logistic regression is the primary tool for research. It is

used for every type and size of data without any consideration on the accuracy of

parameter estimation. Therefore, in this section, we tried to check the conditions

in which logistic regression performs well. In our analysis, we made our analysis

by monte carlo type simulation in Matlab.

As we mention in previous sections, the logistic regression is deﬁned as follows:

p(x

i

) = G(x

i

, w) =

e

w

0

+w

1

x

i1

+w

2

x

i2

+...+wpx

ip

1 + e

w

0

+w

1

x

i1

+w

2

x

i2

+...+wpx

ip

=

e

x

i

w

1 + e

x

i

w

, (5.1.1)

where w

i

’s (i = 1, 2, . . . , p) are weights, and x

ji

is the i

th

independent variable for

the individual j.

For Monte Carlo type simulation, we follows a ﬁve step procedure that is:

Step 1 The set of independent variables, x is derived from various sets of dis-

80

tributions with three diﬀerent dimensions.

Step 2 The variables are normalized in order to prevent from scale problems.

Step 3 The variables multiplied by some initial weights w

i

’s i = 1, 2, . . . , p, and

summed.

Step 4 According to following formulae y scores are obtained,

y

∗

= xw+ ε, (5.1.2)

where ε ∼ logistic(0,

π

√

3

) and we assign y as

0, if y

∗

< 0

1, if y

∗

≥ 0.

.

Step 5 Lastly, by using produced y and derived x, weights are estimated from

logistic distribution and compared with initials.

In Matlab, there is no tool for logistic random variable to produce error terms.

Therefore, we derive it by making an inverse operation in the following way:

If z follows a logistic distribution with parameters α and, β, then,

F(z, α, β) =

exp ((z −α)/β)

1 + exp ((z −α)/β)

. (5.1.3)

This expression is equivalent to

F(z, α, β) = 1 −

1

1 + exp ((z −α)/β)

. (5.1.4)

Bringing 1 to the other sides results in

1 −F(z, α, β) =

1

1 + exp ((z −α)/β)

. (5.1.5)

81

Since the exponential term is always positive, we can take the −1

st

power of both

sides:

1

1 −F(z, α, β)

−1 = exp ((z −α)/β) (5.1.6)

F(z, α, β)

1 −F(z, α, β)

= exp ((z −α)/β). (5.1.7)

Then, by taking logarithms of both sides, we obtain

ln

F(z, α, β)

1 −F(z, α, β)

=

z −α

β

. (5.1.8)

So,

z = α + β ln

F(z, α, β)

1 −F(z, α, β)

. (5.1.9)

In order z to be a logistic random variable and since F(·) only takes values

between 0 and 1, we change it with uniform random variable.Then, we obtain the

following formula:

z = α + β ln

Unif(0, 1)

1 −Unif(0, 1)

. (5.1.10)

In particular, we derive our error term, that is ε ∼ logistic(0,

π

√

3

), by

ε =

π

√

3

ln

Unif(0, 1)

1 −Unif(0, 1)

. (5.1.11)

In our estimations, to check the estimation accuracy of logistic regression in

diﬀerent dimensions and sizes, we made our estimations by using one, six, and

twelve independent variables for number of 250, 500 and 1000 data cases with

1000 simulations.

To check the parameter accuracy in logistic regression, we took coeﬃcient of

82

variation as a measure of bias. The coeﬃcient of variation is

CV =

Standard deviation of ´ w

Mean of ´ w

. (5.1.12)

Moreover, we have tested it by using diﬀerent set of weights. The testing weight

sets are selected among sequences with a sum of one or smaller (preserving the

mean of the normalized data). For the 6 and 12 variables cases, we selected the

weights in the following manner: the ﬁrst ones having high variations among

them, the second ones with relatively lower values and the third ones having

relatively higher eﬀects.

5.2 Results

5.2.1 One Variable Case

In this section, to examine the prediction accuracy of logistic regression in

small dimensions of data, we used only a single variable that is generated from

the uniform distribution with parameters 1 and 100. Furthermore, we made our

calculations under the two diﬀerent weights. One is a smaller and the other is

the higher loading.

Smaller Initial Weight w = 0.2

Firstly, at the second step of the algorithm, we selected the initial weight as

w = 0.2. Figure 5.1 shows the coeﬃcient variation of the estimator. Contrary to

what we expected, this graph ﬁgures out the direct relation between the percent-

age default in the data set and the bias. Herewith, according to the ﬁgure, if the

data set includes more than 30% defaults in it, the bias will increase sharply.

83

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Defaut Levels

250 obs

500 obs

1000 obs

Figure 5.1: Coeﬃcient of variation of estimator in diﬀerent default levels for

w = 0.2.

The coeﬃcient estimate and the initial value w = 0.2 are compared in the

Figure 5.2. Accordingly, a nearly positive linear relationship between the default

level and the accuracy of the estimate can be observed from the graph. Since

the bias rises after the 30% default cut-oﬀ, it can be reasonable to include 30%

default in the data sets. Furthermore, the smaller bias and error in the estimation

are observed with the long data sets.

84

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

0.15

0.16

0.17

0.18

0.19

0.2

0.21

Defaut Levels

250 data

500 data

1000 data

True Value

Figure 5.2: Coeﬃcient estimate and their true value in diﬀerent default levels for

w = 0.2.

Higher Initial Weight w = 0.6

To see the eﬀect of the variables with higher loadings, we selected initial weight

as w = 0.6. Figures 5.3 and 5.4 are the graphs of coeﬃcient of variation and

coeﬃcient estimations, respectively. Similar result with the above case w = 0.2

can be concluded. The ﬁrst ﬁgure shows an increase in the bias before the 5%

and 30% default cases in the data. From the other ﬁgure, positive relation can

be observed between the estimation accuracy and the default level. Therefore, in

fact, we noted for both cases of weights similar results.

85

0 0.1 0.2 0.3 0.4 0.5

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Defaut Levels

250 obs

500 obs

1000 obs

Figure 5.3: Coeﬃcient of variation of estimator in diﬀerent default levels for

w = 0.6.

0 0.1 0.2 0.3 0.4 0.5

0.35

0.4

0.45

0.5

0.55

0.6

Defaut Levels

250 data

500 data

1000 data

True Value

Figure 5.4: Coeﬃcient estimate and their true value in diﬀerent default levels for

w = 0.6.

86

5.2.2 Six Variables Case

The data were generated from a set of random variables from the distributions:

1. x

1

∼ uniform(1,100),

2. x

2

∼ exponential (16),

3. x

3

∼ normal (12,4),

4. x

4

∼ weilbull (0.5,2),

5. x

5

∼ chi (17),

6. x

6

∼ beta(4,3).

First Set of Weights

Our ﬁrst set of initial weights taken are w

1

= 0.8, w

2

= −0.9, w

3

= 0.4,

w

4

= 0.05, w

5

= 0.75 and w

6

= −0.1.

This set includes high and low weights at the same time. The average coeﬃ-

cient of variations are shown in Figure 5.5. Accordingly, it is observed that when

the data size is getting higher, the Monte Carlo simulations shows low bias in the

estimation of parameters.

87

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

0

5

10

15

20

25

30

35

40

45

Default Levels

250 data

500 data

1000 data

Figure 5.5: Six variable case: Average coeﬃcient of variation of estimators in

diﬀerent default levels for the ﬁrst set of weights.

In Figure 5.6, the true values of parameters and estimates of them for diﬀerent

sizes of data set are shown for each variable in an order from left to right. The ﬁrst,

second and ﬁfth ones are drawn for high in the absolute value weights and it can

be seen that the greater the number of defaults are in the data set, the closer are

the estimates to the true values. Furthermore, the others are the smaller weights’

graphs, and these represent nearly perfect estimations on the 30% default level.

88

Figure 5.6: Six variable case: Coeﬃcients’ estimates and their true values on

diﬀerent default levels for the ﬁrst set of weights.

89

Second Set of Weights

Our second set consists of w

1

= 0.02, w

2

= 0.05, w

3

= 0.03, w

4

= 0.01,

w

5

= 0.07 and w

6

= 0.04.

This set assigns low weights to the variables. The coeﬃcient of variation

estimates of the data sets in diﬀerent default levels shown in Figure 5.7 indicates

that in the small samples, bias of estimators are much higher if the small number

of defaults occurred in the sample. Moreover, the bias is getting smaller for each

sample sizes after the point of 25% default level.

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

0

50

100

150

200

250

Default Levels

250 data

500 data

1000 data

Figure 5.7: Six variable case: Average coeﬃcient of variation of estimators in

diﬀerent default levels for the second set of weights.

Furthermore, Figure 5.8 shows that when the weights are very low the param-

eter estimation accuracy of logistic regression is very perfect, especially, in the

sample with nearly 30% default in it.

90

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

0.005

0.01

0.015

0.02

0.025

0.03

0.035

Default Levels

250 data

500 data

1000 data

True Value

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

-0.03

-0.02

-0.01

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Default Levels

250 data

500 data

1000 data

True Value

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

Default Levels

250 data

500 data

1000 data

True Value

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

-0.015

-0.01

-0.005

0

0.005

0.01

0.015

0.02

Default Levels

250 data

500 data

1000 data

True Value

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

0.03

0.035

0.04

0.045

0.05

0.055

0.06

0.065

0.07

0.075

0.08

Default Levels

250 data

500 data

1000 data

True Value

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

0.03

0.035

0.04

0.045

0.05

0.055

0.06

0.065

Default Levels

250 data

500 data

1000 data

True Value

Figure 5.8: Six variable case: Coeﬃcients’ estimates and their true values in

diﬀerent default levels for the second set of weights.

91

Third Set of Weights

The initial weights taken in the analysis are w

1

= 1.3, w

2

= 0.2, w

3

= −0.7,

w

4

= −0.9, w

5

= 0.6 and w

6

= 0.5

The third set of weights gives high loadings to the variables. To preserve the

mean we used negative and positive high values together.

Figure 5.9 represent similar result the when sample size is increased, the bias

of estimation decreased. Moreover, for these three sample sizes, it is valid that

the bias shows nearly no change after the level of % 30 defaults in the sample.

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

0.1

0.15

0.2

0.25

0.3

0.35

Default Levels

250 data

500 data

1000 data

Figure 5.9: Six variable case: Average coeﬃcient of variation of estimators in

diﬀerent default levels for the third set of weights.

As shown in Figure 5.10, the estimators are much more diﬀerent than their

true values. This failure of estimation of logistic regression cannot be eliminated

much even on the high default levels.

92

Figure 5.10: Six variable case: Coeﬃcients’ estimates and their true values in

diﬀerent default levels for the third set of weights.

93

5.2.3 Twelve Variables Case

The data were generated from a set of random variables from the distributions:

1. x

1

∼ uniform(1,100),

2. x

2

∼ exponential (16),

3. x

3

∼ normal (12,4),

4. x

4

∼ weilbull (0.5,2),

5. x

5

∼ chi (17),

6. x

6

∼ beta(4,3),

7. x

7

∼ uniform(1,150),

8. x

8

∼ exponential (11),

9. x

9

∼ normal (32,5),

10. x

10

∼ weilbull (3,1),

11. x

11

∼ chi (51),

12. x

12

∼ beta(7,2).

First Set of Weights

Our ﬁrst set of weights for these twelve variable has the elements w

1

= 0.01,

w

2

= 0.02, w

3

= 0.04, w

4

= 0.03, w

5

= 0.015, w

6

= 0.07, w

7

= 0.085, w

8

= 0.03,

w

9

= 0.1, w

10

= 0.24, w

11

= 0.06 and w

12

= 0.3. These weights are selected to

load low eﬀects to variables and preserve the mean.

For the coeﬃcient of variation Figure 5.11 shows a peak for the 250 observation

data set which includes smaller than 5% default. Furthermore, until the level of

25% the bias of coeﬃcient estimates are especially very high, but after that it is

nearly stable.

94

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

0

200

400

600

800

1000

1200

Defaut Levels

250 obs

500 obs

1000 obs

Figure 5.11: Twelve variable case: Average coeﬃcient of variation of estimators

in diﬀerent default levels for the ﬁrst set of weights.

Figures 5.12 and 5.13 represent the coeﬃcient estimates with respect to their

true values. Accordingly, it can be observed that for these low weights, the

estimation accuracy for nearly all default levels and data sizes is perfect.

95

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

-0.12

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

Defaut Levels

250 data

500 data

1000 data

True Value

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

-0.35

-0.3

-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

Defaut Levels

250 data

500 data

1000 data

True Value

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

0.02

0.04

0.06

0.08

0.1

0.12

0.14

Defaut Levels

250 data

500 data

1000 data

True Value

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

-0.14

-0.12

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

Defaut Levels

250 data

500 data

1000 data

True Value

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

-0.07

-0.06

-0.05

-0.04

-0.03

-0.02

-0.01

0

0.01

0.02

0.03

Defaut Levels

250 data

500 data

1000 data

True Value

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

Defaut Levels

250 data

500 data

1000 data

True Value

Figure 5.12: Twelve variable case: First six coeﬃcients’ estimates and their true

values in diﬀerent default levels for the ﬁrst set of weights.

96

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Defaut Levels

250 data

500 data

1000 data

True Value

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

-0.4

-0.35

-0.3

-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

Defaut Levels

250 data

500 data

1000 data

True Value

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

Defaut Levels

250 data

500 data

1000 data

True Value

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

Defaut Levels

250 data

500 data

1000 data

True Value

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0.11

0.12

0.13

0.14

Defaut Levels

250 data

500 data

1000 data

True Value

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

Defaut Levels

250 data

500 data

1000 data

True Value

Figure 5.13: Twelve variable case: Second six coeﬃcients’ estimates and their

true values in diﬀerent default levels for the ﬁrst set of weights.

97

Second Set of Weights

As second set of weights, we took: w

1

= 0.9, w

2

= 1.2, w

3

= −0.7, w

4

= 0.85,

w

5

= −1.1, w

6

= −1.1, w

7

= 0.65, w

8

= −0.5, w

9

= 1.3, w

10

= 0.7, w

11

= −1.7

and w

12

= 0.8. This set includes very high positive and negative weights. The

mean is preserved also.

0.25 0.3 0.35 0.4 0.45 0.5

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Defaut Levels

250 data

500 data

1000 data

Figure 5.14: Twelve variable case: Average coeﬃcient of variation of estimators

in diﬀerent default levels for the second set of weights.

From Figure 5.14, we observe that if the independent variables have high

eﬀects on dependent variables, the bias is ﬁxed after the 30% default level for

data sets with 500 and 1000 observations, i.e. for the larger data sets, and it is

ﬁxed after the 40% default level for smaller data sets.

Moreover, Figures 5.15 and 5.16 indicate very bad estimations of weights. The

smaller data sets give more accurate estimation results. Furthermore, the best

results are taken around the 30% default level for all variables.

98

0.25 0.3 0.35 0.4 0.45 0.5

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Defaut Levels

250 data

500 data

1000 data

True Value

0.25 0.3 0.35 0.4 0.45 0.5

0.85

0.9

0.95

1

1.05

1.1

1.15

1.2

1.25

Defaut Levels

250 data

500 data

1000 data

True Value

0.25 0.3 0.35 0.4 0.45 0.5

-0.75

-0.7

-0.65

-0.6

-0.55

-0.5

Defaut Levels

250 data

500 data

1000 data

True Value

0.25 0.3 0.35 0.4 0.45 0.5

0.65

0.7

0.75

0.8

0.85

Defaut Levels

250 data

500 data

1000 data

True Value

0.25 0.3 0.35 0.4 0.45 0.5

-1.15

-1.1

-1.05

-1

-0.95

-0.9

-0.85

-0.8

-0.75

Defaut Levels

250 data

500 data

1000 data

True Value

0.25 0.3 0.35 0.4 0.45 0.5

-1.15

-1.1

-1.05

-1

-0.95

-0.9

-0.85

-0.8

-0.75

Defaut Levels

250 data

500 data

1000 data

True Value

Figure 5.15: Twelve variable case: First six coeﬃcients’ estimates and their true

values in diﬀerent default levels for the second set of weights.

99

0.25 0.3 0.35 0.4 0.45 0.5

0.45

0.5

0.55

0.6

0.65

0.7

Defaut Levels

250 data

500 data

1000 data

True Value

0.25 0.3 0.35 0.4 0.45 0.5

-0.65

-0.6

-0.55

-0.5

-0.45

-0.4

-0.35

Defaut Levels

250 data

500 data

1000 data

True Value

0.25 0.3 0.35 0.4 0.45 0.5

0.95

1

1.05

1.1

1.15

1.2

1.25

1.3

1.35

Defaut Levels

250 data

500 data

1000 data

True Value

0.25 0.3 0.35 0.4 0.45 0.5

0.5

0.52

0.54

0.56

0.58

0.6

0.62

0.64

0.66

0.68

0.7

Defaut Levels

250 data

500 data

1000 data

True Value

0.25 0.3 0.35 0.4 0.45 0.5

-1.7

-1.65

-1.6

-1.55

-1.5

-1.45

-1.4

-1.35

-1.3

-1.25

Defaut Levels

250 data

500 data

1000 data

True Value

0.25 0.3 0.35 0.4 0.45 0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

Defaut Levels

250 data

500 data

1000 data

True Value

Figure 5.16: Twelve variable case: Second six coeﬃcients’ estimates and their

true values in diﬀerent default levels for the second set of weights.

100

Third Set of Weights

The third set of weights is the following: w

1

= 0.1, w

2

= 0.2, w

3

= 0.4,

w

4

= 0.3, w

5

= −0.2, w

6

= −0.5, w

7

= −0.25, w

8

= −0.15, w

9

= 0.35, w

10

=

0.47, w

11

= 0.18 and w

12

= 0.1. This set of weights is selected for showing the

prediction accuracy in the data sets in which some variables have high, some have

low eﬀects, some variables have negative and some have positive eﬀects.

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

0

5

10

15

20

25

Defaut Levels

250 obs

500 obs

1000 obs

Figure 5.17: Twelve variable case: Average coeﬃcient of variation of estimators

in diﬀerent default levels for the third set of weights.

According to above Figure 5.14, the average coeﬃcient of variation of the

variables has peak for all size of data sets which includes lower than 5% defaults

in it. After the default cut-oﬀ point 25%, the bias follows a stable manner.

Figures 5.15 and 5.16 show estimators and their true values. If we made a

comparison between the estimated and true value of all, we can say that the

estimation accuracy of loadings is very good and when the number of default

cases increases in the data set, the coeﬃcient estimates converge to their true

values. However, after the level of nearly 30 %, there are no big changes in the

convergence.

101

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Defaut Levels

250 data

500 data

1000 data

True Value

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

Defaut Levels

250 data

500 data

1000 data

True Value

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

Defaut Levels

250 data

500 data

1000 data

True Value

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

Defaut Levels

250 data

500 data

1000 data

True Value

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

-0.45

-0.4

-0.35

-0.3

-0.25

-0.2

-0.15

-0.1

Defaut Levels

250 data

500 data

1000 data

True Value

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

-0.5

-0.45

-0.4

-0.35

-0.3

-0.25

-0.2

-0.15

Defaut Levels

250 data

500 data

1000 data

True Value

Figure 5.18: Twelve variable case: First six coeﬃcients’ estimates and their true

values in diﬀerent default levels for the third set of weights.

102

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

-0.26

-0.24

-0.22

-0.2

-0.18

-0.16

-0.14

-0.12

-0.1

-0.08

Defaut Levels

250 data

500 data

1000 data

True Value

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

-0.9

-0.8

-0.7

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0

Defaut Levels

250 data

500 data

1000 data

True Value

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

Defaut Levels

250 data

500 data

1000 data

True Value

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

Defaut Levels

250 data

500 data

1000 data

True Value

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

0.1

0.15

0.2

0.25

0.3

0.35

Defaut Levels

250 data

500 data

1000 data

True Value

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Defaut Levels

250 data

500 data

1000 data

True Value

Figure 5.19: Twelve variable case: Second six coeﬃcients’ estimates and their

true values in diﬀerent default levels for the third set of weights.

103

5.3 Summary and Conclusion

As a whole, for all three cases: one variable, six variables and twelve variables,

the results concluded are very similar. The higher loadings destroy the precision

accuracy of logistic regression and precision is getting worse when the number

of variables included increase. The more accurate results are taken with a small

set of weights. Moreover, the larger sized data sets show lower bias and more

convergent estimations of weights. Furthermore, if the data set includes nearly

30% default cases in it, the bias reaches its minimum for some cases or gets its

optimal level. As a result of these, we can conclude that to imply the logistic

regression it is good to select data sets with at least 30 % default cases in it,

with minimum 500 observations and with a small number of variables or smaller

eﬀects on probability of defaults.

104

chapter 6

ACCURACY RATIO AND

METHOD VALIDATIONS

6.1 Introduction

In credit scoring, the most important part is the discriminative power of the

methods. Since the methods are used to evaluate the credit worthiness for taking

credit decisions and any classiﬁcation errors can create damages to the resources of

an credit institute [EHT02]. Therefore, in this chapter we focus on the validation

techniques and validation of the methods which are introduced in the previous

chapters.

6.2 Validation Techniques

In studies, various evaluation techniques can be found. However, the most

popular ones are Cumulative Accuracy Proﬁle and Receiver Operating

Characteristic Curve.

Let a method assign to each applicant a score out of k possible values {s

1

, s

2

, ..., s

k

}

with s

1

< s

2

< ... < s

k

. For example, let k = 10, then, s

10

= AAA, s

9

= AA,...

and s

1

= D. AAA is highest rating and D is the lowest.

Let us introduce the random variables, S

T

, S

D

and S

ND

as model score dis-

tributions of all, defaulters and non-defaulters, respectively. The probability that

a default has a score s

i

where i = 1, 2, ..., k is denoted by p

i

D

,

¸

k

i=1

p

i

D

= 1. The

probability that a non-default has a score s

i

(i = 1, 2, ..., k) is p

i

ND

. Given the

priori default probability π of all applicants, the probability p

i

T

can be obtained

105

by the following way:

p

i

T

= πp

i

D

+ (1 −π)p

i

ND

. (6.2.1)

Then, the cumulative probabilities are

CD

i

D

=

i

¸

j=1

p

j

D

(i = 1, 2, ..., k) (6.2.2)

CD

i

ND

=

i

¸

j=1

p

j

ND

(i = 1, 2, ..., k) (6.2.3)

CD

i

T

=

i

¸

j=1

p

j

T

(i = 1, 2, ..., k), (6.2.4)

where CD

D

, CD

ND

and CD

T

denote the cumulative distribution functions of the

score values of the default, non-default and total applicants, respectively.

Let us understand these probabilities by an example.

Example 6.1. In a market, let 10 diﬀerent rating classes. s

10

= AAA, s

9

= AA,

s

8

= A, s

7

= BBB, s

6

= BB, s

5

= B, s

4

= CCC, s

3

= CC, s

9

= C and s

1

= D.

The number of companies from each rating classes are summarized in Table 6.1.

Rating Class Total Number of default cases Number of non-default cases

AAA 5 0 5

AA 12 1 11

A 20 3 17

BBB 32 9 23

BB 27 7 20

B 34 8 26

CCC 67 17 50

CC 42 20 22

C 17 9 8

D 12 10 2

Total 268 84 184

Table 6.1: The rating classes and total number of observations.

Then,

p

1

D

= 10/84,

106

p

2

D

= 9/84,

.

.

.

p

10

D

= 0/84

and

10

¸

i=1

p

i

D

= 10/84 + 9/84 + ... + 0 = 1.

Similarly,

p

1

ND

= 2/184,

p

2

ND

= 8/184,

.

.

.

p

10

ND

= 5/184

and

10

¸

i=1

p

i

ND

= 2/184 + 8/184 + ... + 5/184 = 1.

Furthermore, if we select the π = 0.4, the total probabilities will be

p

1

T

= 0.4p

1

D

+ 0.6p

1

ND

= 0.0541,

p

2

T

= 0.4p

2

D

+ 0.6p

2

ND

= 0.0689,

.

.

.

p

10

T

= 0.4p

10

D

+ 0.6p

10

ND

= 0.0163.

The distribution functions are shown in Table 6.2. We assume CD

0

D

= CD

0

ND

=

CD

0

T

= 0.

6.2.1 Cumulative Accuracy Proﬁle (CAP)

The cumulative accuracy proﬁle is basically deﬁned as the graph of all points

(CD

i

T

, CD

i

D

)

i=1,2,...,k

where the points are connected by linear interpolation [EHT02].

107

i CD

T

CD

D

CD

ND

1 0,0541 0,119 0,0109

2 0,123 0,2261 0,0544

3 0,29 0,4642 0,174

4 0,534 0,6666 0,4457

5 0,6569 0,7618 0,587

6 0,7555 0,8451 0,6957

7 0,8734 0,9522 0,8207

8 0,9431 0,9879 0,9131

9 0,9837 0,9998 0,9729

10 1 1 1

Table 6.2: The cumulative probability functions.

The CAP of the Example 6.1 and the random model are shown in the following

Figure (6.1).

0 0.2 0.4 0.6 0.8 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Rating model

Random model

Figure 6.1: Cumulative accuracy proﬁle curve of the Example (6.1).

In the random model of rating, the lowest α percentage of all companies in

the research contains the α percentage of the all defaults.

The assignment of the lowest scores to the actual default companies is a

measure of the quality of rating methods and called the accuracy ratio (AR). The

108

accuracy ratio in fact is the monotone transformation of the method’s strength

and weakness to a one dimensional measure [SKSt00]. From the case in Figure

6.2, the accuracy ratio is deﬁned as the ratio of the area α

R

between the CAP of

the rating model being applied and the CAP of the random model, and the area

α

P

between the CAP of the perfect rating model and the CAP of the random

model, i.e.,

AR =

α

R

α

P

. (6.2.5)

Figure 6.2: Cumulative accuracy proﬁle [EHT02].

6.2.2 Receiver Operating Characteristic Curve (ROC)

Let us consider Figure 6.3. This ﬁgure depicts the distributions of rating

scores for default and non-default companies. For a perfect rating model the

distributions would not overlap like it instead they would be separate. The grey

areas shows default companies and white areas shows the non-default companies.

Furthermore, C denotes the cut-oﬀ point. Usually, the companies which have

scores lower than C are classiﬁed as default and the companies with higher scores

than C are classiﬁed as non-default.

109

Figure 6.3: Distribution of rating scores for defaulting and non-defaulting debtors

[EHT02].

default non-default

rating below C correct prediction (hit) wrong prediction (false alarm)

score above C wrong prediction (miss) correct prediction (correct rejection)

Table 6.3: Decision results given the cut-oﬀ value C [EHT02].

Table 6.3 summarizes all possible decision results. Accordingly, if the score

of a company is below the C and the company default in the next period, the

decision was correct and this situation is called an hit. Otherwise, the decision

is wrong and called as a false alarm. Similarly, if the score is beyond C and the

company does not default, the correct prediction is made. Otherwise the decision

is wrong.

To construct the ROC curve let us ﬁrstly deﬁne the hit rate (HR(C)) at C as

HR(C) = P(S

D

≤ C), (6.2.6)

and the false alarm rate (FAR(C)) at C as

FAR(C) = P(S

ND

≤ C). (6.2.7)

110

Then, the ROC curve is deﬁned as a plot of HR(C) with respect to FAR(C) for all

values of C [EHT02]. Figure 6.4 shows an example of a ROC curve. This ﬁgure

indicates that the rating model gives the results between the random model and

the perfect model. The nearer the graph of the rating model is to the perfect

one, the better are the predictions of the rating model.

Figure 6.4: Receiver operating characteristic curve [EHT02].

For Example 6.1, let us construct a ROC curve. The distributions for default

and non-default companies are shown in Figure 6.5. Accordingly, we can say

that in our example rating system the perfect discrimination is not possible. the

distributions overlap.

111

Figure 6.5: Distribution of rating scores for Example 6.1.

In our construction process, we can take the cut-oﬀ values as C

1

= 1, C

2

= 2,

... and C

10

= 10. Then, the hit rate HR(C

i

) (i = 1, 2, ..., 10) from Table 6.1is

HR(C

1

) = 10/84, FAR(C

1

) = 2/184;

HR(C

2

) = 19/84, FAR(C

2

) = 10/184;

.

.

.

HR(C

10

) = 1, FAR(C

1

) = 1.

Then, we can visualize the receiver operating characteristic curve in Figure 6.6.

112

0 0.2 0.4 0.6 0.8 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false alarm rate

h

i

t

r

a

t

e

← random model

← rating model

← perfect model

Figure 6.6: Receiver operating characteristic curve for Example 6.1.

6.3 Method Validations

6.3.1 Data and Methodology

Some statistical and non-statistical credit scoring methods with their advan-

tages and disadvantages were mentioned in previous chapters. However, we did

not consider which one is the best in scoring. In this section, the validation of

methods mentioned in Chapter 3 and Chapter 4: discriminant analysis, linear

regression, probit regression, logistic regression, semi-parametric logistic regres-

sion, classiﬁcation and regression trees and neural networks will be presented

and applied.

The data used in this part were collected from the period of 1995-2005. The

data set includes the ﬁnancial situations of 1000 companies. Among of them, 247

are defaulters. The data set includes 17 explanatory variables:

I. Liquidity Ratios:

X

1

: current ratio,

X

2

: liquidity ratio,

X

3

: cash ratio.

113

II. Activity Ratios:

X

4

: receivables turnover ratio,

X

5

: inventory turnover ratio,

X

6

: ﬁxed-assets turnover ratio,

X

7

: equidity turnover ratio,

X

8

: (total liability)/(total assets),

X

10

: (total liability)/equidity .

III. Proﬁtability Ratios:

X

11

: gross merchandise margin,

X

12

: net proﬁt margin,

X

13

: active proﬁtability ratio,

X

14

: equidity proﬁtability ratio.

IV. Growth Ratios:

X

15

: increase in sales,

X

16

: active growth rate,

X

17

: equidity growth rate.

To obtain validation results, we follow a type of cross-validation simula-

tions with 1000 simulations for every method. In cross-validation methodology,

we applied the following steps:

Step 1: The random data set of size 1000 are constructed from the actual data

set with the help of the uniform random numbers.

Step 2: The random data sets of size 250 and 500 are selected from the selected

random sample of size 1000.

Step 3: The 80% of the data sets are used as training samples and the other

20% are used as validation samples. The method is applied to the all three

diﬀerent sized data sets.

Step 4: The mean square errors (MSE) and classiﬁcation accuracies are observed

for each set.

Step 5: The steps 1, 2, 3 and 4 are repeated 1000 times and the MSE, and the

classiﬁcation accuracies are averaged.

114

6.3.2 Results

The average MSEs and accuracies of classiﬁcation techniques are summarized

in Table 6.4. The estimations were done in a way that the signiﬁcance of coeﬃ-

cients were not compared. Because we tried to compare models instead of looking

signiﬁcance of the variables on credit scoring. Moreover, since the discriminant

analysis and classiﬁcation and regression trees (CART) give the outputs 0 or 1

only, we did not calculate the mean square errors for these methods. Accord-

ing to this table, the discriminant analysis gave nearly 96% accuracy in training

samples and 95% accuracy in validation samples of small sizes, i.e., 250 and 500.

However, its accuracy is dropping 4% for large samples. A similar situation is

also valid for neural network. For small sample sizes, the accuracy of it is much

higher. Furthermore, for neural networks, when sample size is getting larger, the

MSEs are rising sharply.

The most interesting method is CART. It turns out to be a the perfect model

in the sample of size 1000. However, for other sizes its accuracy in scoring is very

low. It is about 60%.

In addition to CART, the semi-parametric regression method with logistic link

is the other method which gave the worst results both in accuracy and MSEs. Its

classiﬁcation accuracy is between 70% and 80% for all sample sizes.

Moreover, among the all methods the logistic regression is the best scoring

methods. Its accuracy is above the 98% for all sizes. The closer method is the

probit. Its accuracy in estimation is also very high. However, for large samples

its accuracy decreases.

To check the validation of the methods we constructed receiver operating

characteristic curves for each method in sample of 1000. Figures 6.7 and 6.8

show the plotted ROC curves for validation and training samples for our analysis,

respectively.

From these ﬁgures, we could not be able to observe any big diﬀerences in the

validation and training sample results. The graphs shows us CART is the perfect

scoring method and logistic regression is near to it. The probit is also near to the

perfect model of classiﬁcation. However, the graphs of semi-parametric regression

115

Method data size training sample validation sample

MSE accuracy MSE accuracy

Discriminant Analysis 250 - 96.64% - 95.35%

500 - 96.19% - 95.67%

1000 - 92.53% - 92.00%

Linear Regression 250 11.663 93.20% 4.3522 91.59%

500 24.712 92.56% 6.8408 91.84%

1000 51.107 92.26% 13.419 91.72%

Probit Regression 250 2.8805 98.23% 1.4438 96.56%

500 8.0458 97.42% 2.5722 96.72%

1000 19.4081 96.96% 5.3519 96.62%

Logistic Regression 250 1.1472 99.51% 0.9881 97.99%

500 3.3116 99.31% 1.2670 98.76%

1000 7.6038 99.08% 2.2173 98.82%

CART 250 - 63.48 % - 63.66%

500 - 63.31% - 63.43%

1000 - 100% - 100%

Semi-parametric regression 250 56 72% 10 80%

500 90 78% 18 83%

1000 182 78.13% 45 78.1%

Neural Networks 250 9.3362 98.5% 4.0003 96%

500 52.8 95.42% 17.921 93.2%

1000 101.08 93.5% 32.211 92.5%

Table 6.4: The MSE and accuracy results of methods.

are far from the perfect model so it is the worst model both in validation and

training samples.

116

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false alarm rate

h

it

r

a

t

e

discriminant analysis

random model

perfectmodel

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false alarm rate

h

it

r

a

t

e

linear regression

random model

perfectmodel

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false alarm rate

h

it

r

a

t

e

probit

random model

perfect model

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false alarm rate

h

it

r

a

t

e

logistic regression

random model

perfectmodel

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false alarm rate

h

it

r

a

t

e

semiparametric regression

random model

perfectmodel

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false alarm rate

h

it

r

a

t

e

cart

random model

perfectmodel

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false alarm rate

h

it

r

a

t

e

neural networks

random model

perfectmodel

Figure 6.7: Receiver operating characteristic curves for validation sample of size

1000.

117

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false alarm rate

h

it

r

a

t

e

discriminant analysis

random model

perfectmodel

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false alarm rate

h

it

r

a

t

e

linear regression

random model

perfectmodel

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false alarm rate

h

it

r

a

t

e

probit

random model

perfect model

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false alarm rate

h

it

r

a

t

e

logistic regression

random model

perfectmodel

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false alarm rate

h

it

r

a

t

e

cart

random model

perfectmodel

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false alarm rate

h

it

r

a

t

e

semiparametric regression

random model

perfectmodel

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false alarm rate

h

it

r

a

t

e

neural networks

random model

perfectmodel

Figure 6.8: Receiver operating characteristic curves for training sample for size

1000.

118

6.3.3 Ratings via Logistic and Probit Regressions

In conclusion, results shows that the logistic regression provides the best ac-

curacy results in validations. Therefore, for our data set, it is good to use it as a

ﬁnal conclusion and ratings. Moreover, to make a comment and comparison on

its results, we also apply the probit regression.

In here, we also take 20 % of the data set as validation sample and remaining

80% of the set as training sample. Table 6.5 summarizes coeﬃcient estimates

and p-values. Accordingly, p-values of logistic regression show us the variables

x

5

, x

8

, x

11

and x

15

are not signiﬁcant on estimation. Similarly, the coeﬃcients

of x

3

, x

5

, x

8

and x

14

are not signiﬁcant at α = 0.05. To make a comparison

of these two regression models, we excluded only the variables x

5

and x

8

which

are insigniﬁcant for both models and we ﬁt the regressions with the remaining

coeﬃcients.

Variables Logistic Regression Probit Regression

coeﬃcients p-value coeﬃcients p-value

x

0

-18.3391 0.0001 -7.0792 <0.0001

x

1

0.0664 <0.0001 0.0162 0.0045

x

2

-0.1272 <0.0001 -0.0327 <0.0001

x

3

-0.1031 0.0042 -0.0209 0.1081

x

4

0.0485 0.0001 0.0071 0.0251

x

5

-0.0710 0.2250 -0.0252 0.1037

x

6

-0.1343 <0.0001 -0.0479 0.0016

x

7

0.0332 <0.0001 0.0080 0.0005

x

8

0.0346 0.1945 0.0138 0.1174

x

9

0.0144 <0.0001 0.0061 <0.0001

x

10

0.1845 <0.0001 0.0646 <0.0001

x

11

0.0091 0.2933 0.0105 0.0251

119

Variables Logistic Regression Probit Regression

coeﬃcients p-value coeﬃcients p-value

x

12

-0.0432 0.0182 0.0344 <0.0001

x

13

-0.1650 <0.0001 -0.0328 <0.0001

x

14

0.0455 0.0153 0.0116 0.1512

x

15

0.0032 0.1277 0.0015 0.0084

x

16

-0.0057 0.0016 -0.0031 0.0049

x

17

0.0262 <0.0001 0.0072 <0.0001

Table 6.5: The coeﬃcients and p-values of logistic and probit regression.

Tables 6.6 and 6.7 present the classiﬁcation errors of both models

for both training and validation samples. Accordingly, the classiﬁ-

cation errors of non-default ﬁrms in training and validation samples

are same for logistic and probit regression models. However, for

default ﬁrms when logistic regression gives the no error, the probit

regression gives important misclassiﬁcation errors in both sample

estimates at probability cut-oﬀ 0.5. In fact instead of this cut-oﬀ,

the optimum cut-oﬀ should be used. However, to see some failures

of probit we used 0.5. Furthermore, the results indicate the low

probability assignment problem of probit regression to the default

ﬁrms.

Variables Logistic Regression Probit Regression

default non-default default non-default

default 198 0 170 28

non-default 7 595 7 595

Table 6.6: Training sample classiﬁcation results.

The ratings are given to the responses according to estimated

probabilities of both models. To select the optimum cut-oﬀ prob-

abilities of rating categories we implement the following algorithm.

120

Variables Logistic Regression Probit Regression

default non-default default non-default

default 49 0 42 7

non-default 2 149 2 149

Table 6.7: Validation sample classiﬁcation results.

This algorithm ﬁrstly observes the repeated probabilities and ac-

cording to them splits the data into several parts and thinks these

probabilities as cut-oﬀs. Then, algorithm tries diﬀerent combina-

tions of these cut-oﬀ probabilities and selects the optimum split in

a way that optimum cut-oﬀ is the one which maximizes the area

under the receiver operating characteristic curve. We run algorithm

for 8, 9 and 10 rating categories in our analysis, and also compare

the optimum number of categories.

The optimum cut-oﬀ probabilities for logistic and probit regres-

sion for 10 rating classes are summarized in Tables 6.8 and 6.9,

respectively:

Ratings Probability of defaults from logistic regression

validation sample training sample

lower range upper range lower range upper range

AAA 0.000 2.3031*10

−10

0.000 6.27*10

−12

AA 2.3032*10

−10

1.0721110*10

−8

6.28*10

−12

1.03207*10

−8

A 1.072111*10

−8

1.4504844*10

−7

1.03207*10

−8

9.937391*10

−6

BBB 1.4504845*10

−7

3.86603848*10

−6

9.937392*10

−6

0.0000014

BB 3.86603849*10

−6

0.0006369 0.0000015 0.0000232

B 0.0006369 0.007774 0.0000233 0.000139

CCC 0.007775 0.007806 0.004653 0.0001391

CC 0.0078067 0.073298 0.0001391 0.020088

C 0.073299 0.77553 0.020089 0.60002

D 0.77554 1.000 0.60003 1.000

Table 6.8: Optimum cut-oﬀ probability of defaults of logistic regression for 10

rating category.

121

From Tables 6.8 and 6.9 we observe that logistic regression is

more sensitive to the probability of default and assigns high ratings

for exactly low probabilities. However, probit regression classiﬁes

only very high probabilities to the low rating classes. Table 6.10

Ratings Probability of defaults from probit regression

validation sample training sample

lower range upper range lower range upper range

AAA 0.000 2.346*10

−11

0.000 2.346*10

−11

AA 2.346*10

−11

5.64459*10

−7

2.346*10

−11

2,07457*10

−7

A 5.64459*10

−7

4.20266*10

−5

2.07457*10

−7

1.21604*10

−5

BBB 4.20266*10

−5

0.000692 1.21604*10

−5

0.000362

BB 0.000692 0.003843 0.000362 0.00231

B 0.0038434 0.013068 0.002317 0.00853

CCC 0.01306 0.158488 0.00853 0.065516

CC 0.15848 0.63818 0.06551 0.28311

C 0.63818 0.99752 0.28311 0.95799

D 0.99752 1.000 0.95799 1.000

Table 6.9: Optimum cut-oﬀ probability of defaults of probit regression for 10

rating category.

shows the number of companies in each rating categories and num-

ber of defaults in each category. Accordingly, while logistic regres-

sion assigns most of the defaults in the worst rating category ”D”,

the probit regression assigns default observations in the categories

”CC”, ”C” and ”D”.

122

Ratings Logistic Regression Probit Regression

validation training validation training

AAA 20 67 25 69

AA 14 83 20 78

A 18 67 14 66

BBB 13 87 19 79

BB 18 73 18 92

B 14 74 18 73

CCC 17 71 21 60

CC 20 66 21(Def=7) 75

C 20 (Def=4) 15 19(Def=17) 59(Def=49)

D 45(Def=45) 198(Def=198) 25(Def=25) 149(Def=149)

Table 6.10: Ratings of companies for 10 rating classes (Def: number of defaults

in rating categories).

The optimum cut-oﬀ probabilities for the cases 9 and 8 rating

categories are given in Appendix.

Accordingly, the results shows that more accurate ratings of com-

panies are obtained for the 10 rating category case. Furthermore,

Table 6.11 demonstrates the areas under the ROC curves. It is seen

that the area is maximized when 10 rating categories are used for

both logistic and probit regression.

Number of rating categories

8 9 10

Logistic validation 0.9270 0.9835 0.9957

Regression training 0.9327 0.9875 0.9958

Probit validation 0.9470 0.9865 0.9899

Regression training 0.9497 0.9917 0.9994

Table 6.11: The areas under the ROC curve for optimum cut-oﬀ probabilities.

123

chapter 7

CONCLUSION

In this work, we gave an overview about the theoretical aspects

of important statistical and nonstatistical methods and their appli-

cations in credit scoring. In our application part, we ﬁrstly focus on

logistic regression which is the most widely used method in stud-

ies. By Monte Carlo simulations, we examined logistic regression

under the aspect of its bias in parameter estimation by using both

diﬀerent data sets which includes various (%) defaults and diﬀerent

lengthes of variables. Our results show that the logistic regression

performs well when the data sets include nearly 30% default cases.

Moreover, if the independent variable set is very large and some

variables have high eﬀects on dependent variable, the coeﬃcient es-

timates are much more diﬀerent from their true values. Secondly,

we checked the prediction accuracies of all methods mentioned in

this work. In this part, cross-validation simulations on real Turkish

credit data were made. The results of analyze show that the logistic

regression is the best classiﬁcation technique for Turkish credit data

for small, medium and long sizes.

124

references

[A68] Altman, E., Financial Ratios, Discriminant analysis and the

prediction of corporate bankruptcy, The Journal of Finance

(September 1968) 589 -609.

[AL76] Altman, E.I., and Loris, B., A ﬁnancial early warning system

for over-the counter broker-dealers, Journal of Finance, 31,4

(September 1976) 1201-1217.

[Ati01] Atiya, A.F., Bankruptcy prediction for credit risk using neu-

ral networks: A Survey and New Results, IEEE Transactions

on Neural Networks, 12(July 2001) 929-935.

[BLSW96] Back B., Laitinen T., Sere K. and van Wezel M., Choos-

ing bankruptcy predictors using discriminant analysis, logit

analysis, and genetic algorithms, Turku Centre for Computer

Science Technical Report 40 (September 1996).

[BO04] Balcean S. and Ooghe H., Alternative methodologies in

studies on business failure: do they produce better results than

the classical statistical methods?, Working Paper: Universiteit

Gent, Fakulteit Economie, no:249 (2004).

[B66] Beaver, W., Financial ratios as predictors of failure, Empirical

Research in Accounting: Selected Studies, 1966, supplement to

vol.5, Journal of Accounting Research (1966) 71-111.

125

[Bl74] Blum, M. Failing company discriminant analysis, Journal of

Accounting Research 1 (1974) 1-25.

[B04] Boik R.J., Classical Multivariate Analysis, Lecture Notes:

Statistics 537, Department of Mathematical Sciences, Montana

State University Bozeman, (2004).

[BFOS84] Breiman L., Frydman H., Olshen R.A., and Stone C.J.,

Classiﬁcation and Regression Trees, Chapman and Hall, New

York, London, (1984).

[CF93] Coats P. and Fant L.. Recognizing ﬁnancial distress patterns

using a neural network toll, Financial Management 22 (1993)

142-155.

[CET02] Crook J.N., Edelman D.B., and Thomas L.C., Credit Scor-

ing and Its Applications, SIAM ,(2002).

[DK80] Dambolena, I.G., and Khoury, S.J., Ratio stability and cor-

porate failure, The Journal of Finance, (September 1980) 1017-

1026.

[D72] Deakin,E.B., A Discriminant analysis of predictors of busi-

ness failure, Journal of Accounting Research, 10-1 (Spring 1972)

167-179.

[EHT02] Enleman, B., Hayden, E., and Tasche, D. . Mea-

suring the discriminative power of rating systems (2002),

www.defaultrisk.com.

126

[FAK85] Frydman H., Altman E.I., and Kao D-L., Introducing re-

cursive partioning for ﬁnancial classiﬁcation: The Case of ﬁ-

nancial distress, The Journal of Finance XI-1 (March 1985).

[HM01] Hajdu, O. and Mikls, V., A Hungarian model for predicting

ﬁnancial bankruptcy, society and economy in central and east-

ern europe: quarterly., Journal of Budapest University of Eco-

nomic Sciences and Public Administration, XXIII 1-2 (2001)

28-46.

[HMSW04] H¨ardle W., M¨ uller M., Sperlich S., and Werwatz A.,

Nonparametric and Semiparametric Models, Springer Berlin,

2004.

[HS95] Hassoun M.H., Fundamentals of artiﬁcial neural networks.

Cambridge, Mass., MIT Press, 1995.

[H94] Haykin S., Neural Networks : a Comprehensive Foundation,

New York, Macmillan, 1994.

[HKGB99] Herbrich R., Keilbach M., Graepel T., Bollmann-

Sdorra P., and Obermayer K., Neural networks in economics,

http://www.smartquant.com/references/NeuralNetworks/

neural7.pdf, (1999).

[Horrowitz] Horowitz, J.L., Semiparametric Single-Index Models,

Lecture Notes: Department of Economics, Northwestern Uni-

versity

127

[H66] Horrigan, J. O.. The Determination of long-term credit stand-

ing with ﬁnancial ratios, Journal of Accounting Research, Em-

prical Research in Accounting: Selected Studies 4 (1966) 44-62.

[HL00] Hosmer D.W., and Lemeshow Jr.S., Applied Logistic Regres-

sion, New York, Wiley, 2000.

[JW97] Johnson R.A., and Wichern D.W., Applied Multivariate

Statistical Analysis, Prentice Hall, New Jersey.

[K98] Kiviluoto, K. Predicting Bankruptcies with the self-

organizing map, Neurocomputing 21 (1998) 191-201.

[KGSC00] Kolari J., Glennon D., Shin H., and Caputo M., Predict-

ing large U.S. Commercial bank failures, economic and policy

analysis, Working Paper 2000-1 (January 2000).

[L99] Laitinen E.K., Predicting a corporate credit analysts risk es-

timate by logistic and linear Models, International Review of

Financial Analysis, 8:2 (1999) 97121.

[LK99] Laitinen, T. and Kankaanpaa, M. Comparative analysis of

failure prediction methods: the ﬁnnish case, The European Ac-

counting Review 8:1 (1999) 67-92.

[Lib75] Libby, R., Accounting ratios and the prediction of failure:

some behavioral evidence, Journal of Accounting Research, 13:1

(Spring 1975) 150-161.

128

[MG00] McKEE, T.E. and Greensten M. Predicting bankruptcy us-

ing recursive partitioning and a realistically proportioned data

set. Journal of Forecasting (2000) 19: pp. 219-230.

[M66] Mears, P. K., Discussion of ﬁnancial ratios as predictors of

failures, Journal of Accounting Research, Emprical Research in

Accounting: Selected Studies 4 (1966) 119-122.

[M04] Monsour C., Discrete predictive modeling, Casualty Actuar-

ial Society, Special Interest Seminar on Predictive Modeling,

Chicago(2004).

[Mu00] Muller, M., Semiparametric extensions to generalized linear

models, Humboldt Universitat zu Berlin. Working Paper(2000).

[MR99] Muller M. and Ronz B. Credit scoring using semiparamet-

ric methods, Humboldt Universitaet Berlin in its series Sonder-

forschungsbereich 373 with number 1999-93 (1999).

[N66] Neter, J., Discussion of ﬁnancial ratios as predictors of fail-

ures, Journal of Accounting Research, Emprical Research in

Accounting: Selected Studies 4 (1966) 112-118.

[OS90] Odom, M.D., and Sharda, R., A Neural network model for

bankruptcy prediction, Neural Networks, IJCNN International

Joint Conference on 17-21, 2 (June 1990) 163-168.

[O80] Ohlson, J.A., Financial ratios and probabilistic prediction of

bankruptcy, Journal of Accounting Research, 18 (Spring 1980)

109-131.

129

[Org70] Orgler, Y.E., Credit scoring model for commercial loans,

Journal of Money, Credit and Bank 2: (1970) 435-445.

[PS69] Pogue T.F., and Soldofsky R.M., What is bond rating, Jour-

nal of Financial and Quantitative Analysis, 4:2 (June 1969)

201-228.

[PF97] Pompe, P.P.M, Feelders, A. J., Using machine learning, neu-

ral networks and statistics to predict corporate bankruptcy, Mi-

crocomputers in Civil Engineering, 12 (1997) 267-276.

[Probit] Probit regression, http://www.gseis.ucla.edu/ courses/

ed231a1/notes2/probit1.html

[RG94] Rosenberg,E., and Gleit, A., Quantitaive methods in credit

management: a survey, Operations Research, 42:4(1994) 589-

613.

[S75] Sinkey, J.F., A Multivariate analysis of the characteristics of

problem banks, The Journal of Finance, 30: (March 1975) 21-

36.

[SKSt00] Sobehart, J.R., Keenan, S.C., and Stein, M.S., Bencmark-

ing quantitative default risk models: a validation methodology,

Moody’s Rating Methodology (2000).

[TK92] Tam Y. K. and Kiang Y.M. Managerial applications of neu-

ral networks: the case of bank failure predictions, Management

Science, 38: (July 1992) 926-947.

130

[Wil71] Wilcox, J. W.. A Simple theory of ﬁnancial ratios as pre-

dictors of failure, Journal of Accounting Research, 9:2 (1971)

389-395.

[Wil73] Wilcox, J. W., A Prediction of business failure using ac-

counting data, Journal of Accounting Research, Emprical Re-

search in Accounting: Selected Studies 11: (1973) 163-179.

[WBS97] Wong B.K., Bodnovich T.A., and Selvi Y., Neural network

applications in business: a review and analysis of the literature

(1988-95), Decision Support Systems 19 (1997) 301-320.

[YW99] Yohannes Y., and Webb P., Classiﬁcation and Regression

Trees,CART, A User Manual for Identifying Indicators of Vul-

nerability to and Chronic Food Insecurity, International Food

Policy Research Institute, (1999).

131

chapter 8

APPENDIX

Ratings Probability of defaults from logistic regression

validation sample training sample

lower range upper range lower range upper range

AAA 0.000 2.4201*10

−10

0.000 1.3673*10

−10

AA 2.4201*10

−10

2.76115*10

−8

1.3673*10

−10

1.07211*10

−8

A 2.76115*10

−8

2.20445*10

−5

1.07211*10

−8

3.62314*10

−7

BBB 2.20445*10

−5

0.000139 3.62314*10

−7

1.15619*10

−5

BB 0.000139 0.007353 1.15619*10

−5

8.10605*10

−5

B 0.007353 0.007774 8.10605*10

−5

0.003895

C 0.007774 0.073298 0.003895 0.02008

D 0.073298038 1.000 0.02008 1.000

Table 8.1: Optimum cut-oﬀ probability of defaults of logistic regression for 8

rating category.

Ratings Probability of defaults from probit regression

validation sample training sample

lower range upper range lower range upper range

AAA 0.000 2.346*10

−11

0.000 2.346*10

−11

AA 2.346*10

−11

5.6445*10

−7

2.346*10

−11

2.0745*10

−7

A 5.6445*10

−7

4.2026*10

−5

2.0745*10

−7

1.2160*10

−5

BBB 4.2026*10

−5

0.000692 1.2160*10

−5

0.000362

BB 0.000692 0.00384 0.000362 0.002317

B 0.00384 0.01306 0.002317 0.00853

C 0.01306 0.15848 0.00853 0.06551

D 1.000 0.15848 0.06551 1.000

Table 8.2: Optimum cut-oﬀ probability of defaults of probit regression for 8 rating

category.

132

Ratings Logistic Regression Probit Regression

validation training validation training

AAA 22 66 25 69

AA 17 83 20 78

A 21 67 14 66

BBB 16 87 19 79

BB 19 73 18 92

B 19 74 18 73

C 21 71 21 60

D 65 279 65 283

Table 8.3: Ratings of companies for 8 rating classes (Def: number of defaults in

rating categories).

Ratings Probability of defaults from logistic regression

validation sample training sample

lower range upper range lower range upper range

AAA 0.000 2.4201*10

−10

0.000 1.367*10

−10

AA 2.4201*10

−10

2.76115*10

−8

1.367*10

−10

1.072*10

−8

A 2.76115*10

−8

1.0348*10

−6

1.072*10

−8

3.623*10

−7

BBB 1.0348*10

−6

2.33465*10

−5

3.623*10

−7

1.15619*10

−5

BB 2.33465*10

−5

0.000279 1.15619*10

−5

8.10605*10

−5

B 0.000279 0.00777 8.10605*10

−5

0.00389

CC 0.00777 0.07329 0.00389 0.02008

C 0.07329 0.79767 0.02008 0.1882

D 0.79767 1.000 0.1882 1.000

Table 8.4: Optimum cut-oﬀ probability of defaults of logistic regression for 9

rating category.

133

Ratings Probability of defaults from probit regression

validation sample training sample

lower range upper range lower range upper range

AAA 0.000 2.346*10

−11

0.000 2.346*10

−11

AA 2.346*10

−11

5.644*10

−7

2.346*10

−11

2.074*10

−7

A 5.644*10

−7

4.202*10

−5

2.074*10

−7

1.216*10

−5

BBB 4.202*10

−5

0.000692 1.216*10

−5

0.00036

BB 0.000692 0.00384 0.00036 0.00231

B 0.00384 0.01306 0.00231 0.00853

CC 0.01306 0.15848 0.00853 0.06551

C 0.15848 0.63818 0.06551 0.28311

D 0.63818 1.000 0.28311 1.000

Table 8.5: Optimum cut-oﬀ probability of defaults of probit regression for 9 rating

category.

Ratings Logistic Regression Probit Regression

validation training validation training

AAA 22 66 25 69

AA 17 83 20 78

A 21 67 14 66

BBB 16 87 19 79

BB 19 73 18 92

B 19 74 18 73

CC 21 71 21 60

C 22(Def=6) 66 21(Def=7) 75

D 43(Def=43) 213(Def=198) 44(Def=42) 208(Def=198)

Table 8.6: Ratings of companies for 9 rating classes (Def: number of defaults in

rating categories).

134