Support Vector Machines For Credit Scoring

SUPPORT VECTOR
MACHINES FOR
CREDIT SCORING
Michal Haltuf
Diploma thesis
Binary classification
Explanatory variables
Each observation belongs to one of two

classes = {−1, +1}
Given the training dataset , find a function

that correctly assigns new observation to its
class
Linear classifiers
Interpret each data point as n-dimensional
vector
Draw a line (plane, hyperplane) that BEST

separates the classes
⋅ + =0
Infinite number of separating lines (w,b).

Which one is best?
Credits: Fisher, 1936
Linear classifier
What means BEST?
Different answers ...
Different assumptions ...
... Different classifiers

Fisher discriminant (LDA)
Logistic regression
Perceptron
Support vector machines
Linear classifiers
... but each of them can be expressed as linear
separating hyperplane ⋅ + = 0
Support vector machines
= Maximum margin hyperplane
+ Kernel trick
+ Soft-margin (slack variable)

Maximum margin hyperplane
Linearly separable case
Find (w,b) such that corresponding hyperplane

maximizes the distance from the nearest point
Euclidean distance
( ⋅ + )
=
| |
Scale (w,b) so that y ⋅ + ≥ 1 (constraint)

Credits: Yifan Peng, 2013
Optimization
Maximize s.t. constraint

| |
= minimization of
= minimization of
subject to ⋅ + ≥ 1, ∀
Lagrange function
Primal optimization problem:
∗
= min max *( , +)
$;&' ()
Dual optimization problem:

, ∗ = max min *( , +)
$;&' ()
Generally ,∗ ≤ ∗ ... Duality gap
Karush-Kuhn-Tucker (KKT) conditions => , ∗ = ∗

Lagrangian dual
1 1 1
1
argmax 0 + − 0 0 + +4 4 ⋅ 4
$ 2
2 2 42
s.t.
1
0+ =0
2
+ ≥ 0, ∀
Support vectors
∗
= ∑12 + ∗
78 = 9:;(∑12 + ∗ ⋅ + ∗)
By KKT complementary condition, most + = 0
Non-zero only for a few vectors, that are closest to

the separating hyperplane
⇨ SUPPORT VECTORS
Credits: Yifan Peng, 2013
Kernel function
Some data are not linearly separable
SOLUTION:
Choose a suitable mapping function =( )
Map vectors from the original input space to the
higher-dimensional feature space
Find linear separating hyperplane in the feature
space
Cover’s theorem
Credits: Eric Kim, 2013
, →[ , , + ]
Kernel trick
Our solution depends only on the dot products
Therefore we do not need the mapping function

=( ) explicitly
Instead, use suitable A , B == ⋅= B
WHY?
C(;D ) → C(;) problem
allows to map into infinitely-dimensional space

Gaussian Kernel
I
HE
Show that A , E = exp − JI
== ⋅= E
I I I
HE H H E ⋅E
exp − = exp exp
JI JI JI
L M
Apply Taylor expansion KL = ∑O
42) 4!
I I
In ℝ1 ... = exp H
JI
H E
1⋅1+
JI !
⋅
JI !
E+
JQ !
⋅
JQ !
E+⋯
W
− 1 1 1
= = exp 1; ;
T
;
U
;…
2S S 1! S 1! S 1!
Function ф maps vector x into infinite-dimensional space
Credits: Chih-Jen Li, Support vector machines, MLSS, Taipei 2006

Soft margin
Real data are noisy, rarely separable in ANY
feature space
SOLUTION:
allow some data to appear on the „wrong“ side of
hyperplane – Slack variable.
Constraint:
y ⋅ + ≥1−Y, Y ≥0
Credits: Stephen Cronin, 2010
Soft margin
Adjust objective function by Cost parameter
1
1
min + Z0Y
2
2
Dual of the problem simplifies to a very

convenient form, affects only one constraint
and nothing else:
Z ≥ + ≥ 0, ∀
Probabilities of default
Unlike Logistic regression, SVM does not give
PDs:
+1/-1: binary output not sufficient
Distance from separating hyperplane – no
interpretation
Different approaches
Binning
Platt’s scaling
1
[ =17 =
1 + exp {\7 + ]}
Binning - Obtaining Calibrated Probability Estimates from Support Vector Machines, J. Drish, 2001.
Platt’s scaling - Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods, J. C. Platt, 1999.
Implementation
Quadratic programming – matrix of size n2
Gradient methods – Lagrange multipliers α are

not independent
Sequential minimal optimization
Standard packages, libraries (LIBSVM)

Real-data application
Peer-to-peer lending data set, Bondora.com
2932 observations
656 defaults
37 explanatory variables
Compare models’ performance with AUC

Logistic regression – Benchmark model
Two methods of categorical variables

encoding – dummy variables, “woeisation”
15 significant variables (p < 0.05)

Logistic regression – Benchmark model
Support Vector Machines – Linear Kernel
Find optimal C – chaotic behaviour on dummy

data set
Systematically underperforms Logistic

regression
Problem with unbalanced data
Long training times for log2C>5

Support Vector Machines – Linear Kernel
Support Vector Machines – Gaussian Kernel
Find optimal
parameters (C,γ)
Performs better
than LR – potential
non-linearities
in data
Model comparison
Best model: SVM-RBF with dummy encoding, AUC = 0.831

0,85
0,8
0,75
AUC
0,7 LR
SVM-L
SVM-R
0,65
0,6
0,55
0 5 10 15 20
Number of variables
Quantifying the edge
Backtesting using 3 different strategies
Random diversification
Naive scoring
Quantitative model (LR, SVM-RBF)
Perform 10,000 simulations and calculate

E(R), VAR on out-of-sample data set
Do quantitative models provide added value

(increase ROI and/or decrease risk?)
Quantifying the edge
Conclusions
SVM: strong theoretical backgrounds, but less
reliable
Hard to use as a standalone method (variable

selection, probability of default)
Linear SVM exhibited chaotic behavior w.r.t. cost

parameter C
SVM with Gaussian kernel slightly outperformed

Logistic Regression
Conclusions
Real application of SVM in credit scoring
improbable
Big players – model risk, conservative

approach
Small investors (P2P) – better alternatives
SVM - Historical importance, theory (Vapnik,

VC dimensions, Statistical Learning Theory)

Support Vector Machines For Credit Scoring

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Support Vector Machines For Credit Scoring

Uploaded by

Copyright:

Available Formats

SUPPORT VECTOR

Each observation belongs to one of two

Given the training dataset , find a function

Draw a line (plane, hyperplane) that BEST

Infinite number of separating lines (w,b).

... Different classifiers

= Maximum margin hyperplane

+ Soft-margin (slack variable)

Find (w,b) such that corresponding hyperplane

Scale (w,b) so that y ⋅ + ≥ 1 (constraint)

Maximize s.t. constraint

Dual optimization problem:

Generally ,∗ ≤ ∗ ... Duality gap

Karush-Kuhn-Tucker (KKT) conditions => , ∗ = ∗

By KKT complementary condition, most + = 0

Non-zero only for a few vectors, that are closest to

Therefore we do not need the mapping function

Instead, use suitable A , B == ⋅= B

allows to map into infinitely-dimensional space

Function ф maps vector x into infinite-dimensional space

Credits: Chih-Jen Li, Support vector machines, MLSS, Taipei 2006

Dual of the problem simplifies to a very

Gradient methods – Lagrange multipliers α are

Sequential minimal optimization

Standard packages, libraries (LIBSVM)

Compare models’ performance with AUC

Two methods of categorical variables

15 significant variables (p < 0.05)

Find optimal C – chaotic behaviour on dummy

Systematically underperforms Logistic

Problem with unbalanced data

Long training times for log2C>5

Best model: SVM-RBF with dummy encoding, AUC = 0.831

Perform 10,000 simulations and calculate

Do quantitative models provide added value

Hard to use as a standalone method (variable

Linear SVM exhibited chaotic behavior w.r.t. cost

SVM with Gaussian kernel slightly outperformed

Big players – model risk, conservative

Small investors (P2P) – better alternatives

SVM - Historical importance, theory (Vapnik,

You might also like