You are on page 1of 42

SUPPORT VECTOR

MACHINES FOR
CREDIT SCORING
Michal Haltuf
Diploma thesis
Binary classification
Explanatory variables

Each observation belongs to one of two


classes = {−1, +1}

Given the training dataset , find a function


that correctly assigns new observation to its
class
Linear classifiers
Interpret each data point as n-dimensional
vector

Draw a line (plane, hyperplane) that BEST


separates the classes
⋅ + =0

Infinite number of separating lines (w,b).


Which one is best?
Credits: Fisher, 1936
Credits: Fisher, 1936
Credits: Fisher, 1936
Credits: Fisher, 1936
Linear classifier
What means BEST?
Different answers ...
Different assumptions ...

... Different classifiers


Fisher discriminant (LDA)
Logistic regression
Perceptron
Support vector machines
Linear classifiers
... but each of them can be expressed as linear
separating hyperplane ⋅ + = 0
Support vector machines

= Maximum margin hyperplane

+ Kernel trick

+ Soft-margin (slack variable)


Maximum margin hyperplane
Linearly separable case

Find (w,b) such that corresponding hyperplane


maximizes the distance from the nearest point

Euclidean distance
( ⋅ + )
=
| |

Scale (w,b) so that y ⋅ + ≥ 1 (constraint)


Credits: Yifan Peng, 2013
Optimization

Maximize s.t. constraint


| |

= minimization of

= minimization of

subject to ⋅ + ≥ 1, ∀
Lagrange function
Primal optimization problem:

= min max *( , +)
$;&' ()

Dual optimization problem:


, ∗ = max min *( , +)
$;&' ()

Generally ,∗ ≤ ∗ ... Duality gap

Karush-Kuhn-Tucker (KKT) conditions => , ∗ = ∗


Lagrangian dual
1 1 1
1
argmax 0 + − 0 0 + +4 4 ⋅ 4
$ 2
2 2 42

s.t.
1

0+ =0
2
+ ≥ 0, ∀
Support vectors

= ∑12 + ∗

78 = 9:;(∑12 + ∗ ⋅ + ∗)

By KKT complementary condition, most + = 0

Non-zero only for a few vectors, that are closest to


the separating hyperplane

⇨ SUPPORT VECTORS
Credits: Yifan Peng, 2013
Kernel function
Some data are not linearly separable

SOLUTION:
Choose a suitable mapping function =( )
Map vectors from the original input space to the
higher-dimensional feature space
Find linear separating hyperplane in the feature
space

Cover’s theorem
Credits: Eric Kim, 2013
, →[ , , + ]
Credits: Eric Kim, 2013
Credits: Eric Kim, 2013
Kernel trick
Our solution depends only on the dot products

Therefore we do not need the mapping function


=( ) explicitly

Instead, use suitable A , B == ⋅= B

WHY?
C(;D ) → C(;) problem

allows to map into infinitely-dimensional space


Gaussian Kernel
I
HE
Show that A , E = exp − JI
== ⋅= E

I I I
HE H H E ⋅E
exp − = exp exp
JI JI JI

L M
Apply Taylor expansion KL = ∑O
42) 4!

I I
In ℝ1 ... = exp H
JI
H E
1⋅1+
JI !

JI !
E+
JQ !

JQ !
E+⋯
W
− 1 1 1
= = exp 1; ;
T
;
U
;…
2S S 1! S 1! S 1!

Function ф maps vector x into infinite-dimensional space

Credits: Chih-Jen Li, Support vector machines, MLSS, Taipei 2006


Soft margin
Real data are noisy, rarely separable in ANY
feature space

SOLUTION:
allow some data to appear on the „wrong“ side of
hyperplane – Slack variable.
Constraint:

y ⋅ + ≥1−Y, Y ≥0
Credits: Stephen Cronin, 2010
Soft margin
Adjust objective function by Cost parameter
1
1
min + Z0Y
2
2

Dual of the problem simplifies to a very


convenient form, affects only one constraint
and nothing else:
Z ≥ + ≥ 0, ∀
Probabilities of default
Unlike Logistic regression, SVM does not give
PDs:
+1/-1: binary output not sufficient
Distance from separating hyperplane – no
interpretation

Different approaches
Binning
Platt’s scaling
1
[ =17 =
1 + exp {\7 + ]}

Binning - Obtaining Calibrated Probability Estimates from Support Vector Machines, J. Drish, 2001.
Platt’s scaling - Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods, J. C. Platt, 1999.
Implementation
Quadratic programming – matrix of size n2

Gradient methods – Lagrange multipliers α are


not independent

Sequential minimal optimization

Standard packages, libraries (LIBSVM)


Real-data application
Peer-to-peer lending data set, Bondora.com

2932 observations

656 defaults

37 explanatory variables

Compare models’ performance with AUC


Logistic regression – Benchmark model

Two methods of categorical variables


encoding – dummy variables, “woeisation”

15 significant variables (p < 0.05)


Logistic regression – Benchmark model
Support Vector Machines – Linear Kernel

Find optimal C – chaotic behaviour on dummy


data set

Systematically underperforms Logistic


regression

Problem with unbalanced data

Long training times for log2C>5


Support Vector Machines – Linear Kernel
Support Vector Machines – Gaussian Kernel

Find optimal
parameters (C,γ)

Performs better
than LR – potential
non-linearities
in data
Model comparison

Best model: SVM-RBF with dummy encoding, AUC = 0.831


0,85

0,8

0,75
AUC

0,7 LR
SVM-L
SVM-R
0,65

0,6

0,55
0 5 10 15 20
Number of variables
Quantifying the edge
Backtesting using 3 different strategies
Random diversification
Naive scoring
Quantitative model (LR, SVM-RBF)

Perform 10,000 simulations and calculate


E(R), VAR on out-of-sample data set

Do quantitative models provide added value


(increase ROI and/or decrease risk?)
Quantifying the edge
Conclusions
SVM: strong theoretical backgrounds, but less
reliable

Hard to use as a standalone method (variable


selection, probability of default)

Linear SVM exhibited chaotic behavior w.r.t. cost


parameter C

SVM with Gaussian kernel slightly outperformed


Logistic Regression
Conclusions
Real application of SVM in credit scoring
improbable

Big players – model risk, conservative


approach

Small investors (P2P) – better alternatives

SVM - Historical importance, theory (Vapnik,


VC dimensions, Statistical Learning Theory)

You might also like