, Constantin F. Aliferis
*
(Materials about SVM Clustering were contributed by Nikita Lytkin
*
)
*
New York University,
#
Vanderbilt University,
ClopiNet
Part I
Introduction
Necessary mathematical concepts
Support vector machines for binary
classification: classical formulation
Basic principles of statistical machine learning
2
Introduction
3
About this tutorial
4
Main goal: Fully understand support vector machines (and
important extensions) with a modicum of mathematics
knowledge.
This tutorial is both modest (it does not invent anything new)
and ambitious (support vector machines are generally
considered mathematically quite difficult to grasp).
Tutorial approach:
learning problem main idea of the SVM solution
geometrical interpretation math/theory
basic algorithms extensions case studies.
Dataanalysis problems of interest
1. Build computational classification models (or
classifiers) that assign patients/samples into two or
more classes.
 Classifiers can be used for diagnosis, outcome prediction, and
other classification tasks.
 E.g., build a decisionsupport system to diagnose primary and
metastatic cancers from gene expression profiles of the patients:
5
Classifier
model
Patient Biopsy
Gene expression
profile
Primary Cancer
Metastatic Cancer
Dataanalysis problems of interest
2. Build computational regression models to predict values
of some continuous response variable or outcome.
 Regression models can be used to predict survival, length of stay
in the hospital, laboratory test values, etc.
 E.g., build a decisionsupport system to predict optimal dosage
of the drug to be administered to the patient. This dosage is
determined by the values of patient biomarkers, and clinical and
demographics data:
6
Regression
model
Patient
Biomarkers,
clinical and
demographics data
Optimal
dosage is 5
IU/Kg/week
1 2.2 3 423 2 3 92 2 1 8
Dataanalysis problems of interest
3. Out of all measured variables in the dataset, select the
smallest subset of variables that is necessary for the
most accurate prediction (classification or regression) of
some variable of interest (e.g., phenotypic response
variable).
 E.g., find the most compact panel of breast cancer biomarkers
from microarray gene expression data for 20,000 genes:
7
Breast
cancer
tissues
Normal
tissues
Dataanalysis problems of interest
4. Build a computational model to identify novel or outlier
patients/samples.
 Such models can be used to discover deviations in sample
handling protocol when doing quality control of assays, etc.
 E.g., build a decisionsupport system to identify aliens.
8
Dataanalysis problems of interest
5. Group patients/samples into several
clusters based on their similarity.
 These methods can be used to discovery
disease subtypes and for other tasks.
 E.g., consider clustering of brain tumor
patients into 4 clusters based on their gene
expression profiles. All patients have the
same pathological subtype of the disease,
and clustering discovers new disease
subtypes that happen to have different
characteristics in terms of patient survival
and time to recurrence after treatment.
9
Cluster #1
Cluster #2
Cluster #3
Cluster #4
Basic principles of classification
10
Want to classify objects as boats and houses.
Basic principles of classification
11
All objects before the coast line are boats and all objects after the
coast line are houses.
Coast line serves as a decision surface that separates two classes.
Basic principles of classification
12
These boats will be misclassified as houses
Basic principles of classification
13
Longitude
Latitude
Boat
House
The methods that build classification models (i.e., classification algorithms)
operate very similarly to the previous example.
First all objects are represented geometrically.
Basic principles of classification
14
Longitude
Latitude
Boat
House
Then the algorithm seeks to find a decision
surface that separates classes of objects
Basic principles of classification
15
Longitude
Latitude
? ? ?
?
?
?
These objects are classified as boats
These objects are classified as houses
Unseen (new) objects are classified as boats
if they fall below the decision surface and as
houses if the fall above it
The Support Vector Machine (SVM)
approach
16
Support vector machines (SVMs) is a binary classification
algorithm that offers a solution to problem #1.
Extensions of the basic SVM algorithm can be applied to
solve problems #1#5.
SVMs are important because of (a) theoretical reasons:
 Robust to very large number of variables and small samples
 Can learn both simple and highly complex classification models
 Employ sophisticated mathematical principles to avoid overfitting
and (b) superior empirical results.
Main ideas of SVMs
17
Cancer patients Normal patients
Gene X
Gene Y
Consider example dataset described by 2 genes, gene X and gene Y
Represent patients geometrically (by vectors)
Main ideas of SVMs
18
Find a linear decision surface (hyperplane) that can separate
patient classes and has the largest distance (i.e., largest gap or
margin) between borderline patients (i.e., support vectors);
Cancer patients Normal patients
Gene X
Gene Y
Main ideas of SVMs
19
If such linear decision surface does not exist, the data is mapped
into a much higher dimensional space (feature space) where the
separating decision surface is found;
The feature space is constructed via very clever mathematical
projection (kernel trick).
Gene Y
Gene X
Cancer
Normal
Cancer
Normal
kernel
Decision surface
History of SVMs and usage in the literature
20
Support vector machine classifiers have a long history of
development starting from the 1960s.
The most important milestone for development of modern SVMs
is the 1992 paper by Boser, Guyon, and Vapnik (A training
algorithm for optimal margin classifiers)
359
621
906
1,430
2,330
3,530
4,950
6,660
8,180
8,860
4 12 46 99
201
351
521
726
917
1,190
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
1998 1999 2000 2001 2002 2003 2004 2005 2006 2007
Use of Support Vector Machines in the Literature
General sciences
Biomedicine
9,770
10,800
12,000
13,500
14,900
16,000
17,700
19,500
20,000
19,600
14,900
15,500
19,200
18,700
19,100
22,200
24,100
20,100
17,700
18,300
0
5000
10000
15000
20000
25000
30000
1998 1999 2000 2001 2002 2003 2004 2005 2006 2007
Use of Linear Regression in the Literature
General sciences
Biomedicine
Necessary mathematical concepts
21
How to represent samples geometrically?
Vectors in ndimensional space (R
n
)
Assume that a sample/patient is described by n characteristics
(features or variables)
Representation: Every sample/patient is a vector in R
n
with
tail at point with 0 coordinates and arrowhead at point with
the feature values.
Example: Consider a patient described by 2 features:
Systolic BP = 110 and Age = 29.
This patient can be represented as a vector in R
2
:
22
Systolic BP
Age
(0, 0)
(110, 29)
0
100
200
300
0
50
100
150
200
0
20
40
60
How to represent samples geometrically?
Vectors in ndimensional space (R
n
)
Patient 3
Patient 4
Patient 1
Patient 2
23
Patient
id
Cholesterol
(mg/dl)
Systolic BP
(mmHg)
Age
(years)
Tail of the
vector
Arrowhead of
the vector
1 150 110 35 (0,0,0) (150, 110, 35)
2 250 120 30 (0,0,0) (250, 120, 30)
3 140 160 65 (0,0,0) (140, 160, 65)
4 300 180 45 (0,0,0) (300, 180, 45)
A
g
e
(
y
e
a
r
s
)
0
100
200
300
0
50
100
150
200
0
20
40
60
How to represent samples geometrically?
Vectors in ndimensional space (R
n
)
Patient 3
Patient 4
Patient 1
Patient 2
24
A
g
e
(
y
e
a
r
s
)
Since we assume that the tail of each vector is at point with 0
coordinates, we will also depict vectors as points (where the
arrowhead is pointing).
Purpose of vector representation
Having represented each sample/patient as a vector allows
now to geometrically represent the decision surface that
separates two groups of samples/patients.
In order to define the decision surface, we need to introduce
some basic math elements
25
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
0
5
10
0
1
2
3
4
5
6
7
A decision surface in R
2
A decision surface in R
3
Basic operation on vectors in R
n
1. Multiplication by a scalar
Consider a vector and a scalar c
Define:
When you multiply a vector by a scalar, you stretch it in the
same or opposite direction depending on whether the scalar is
positive or negative.
) ,..., , (
2 1 n
a a a a =
) ,..., , (
2 1 n
ca ca ca a c =
) 4 , 2 (
2
) 2 , 1 (
=
=
=
a c
c
a
a c
1
2
3
4
0 1 2 3 4
) 2 , 1 (
1
) 2 , 1 (
=
=
=
a c
c
a
a c
1
2
3
4
0 1 2 3 4 2 1
2
1
26
Basic operation on vectors in R
n
2. Addition
Consider vectors and
Define:
Recall addition of forces in
classical mechanics.
) ,..., , (
2 1 n
a a a a =
27
) ,..., , (
2 1 n
b b b b =
) ,..., , (
2 2 1 1 n n
b a b a b a b a + + + = +
1
2
3
4
0 1 2 3 4
) 2 , 4 (
) 0 , 3 (
) 2 , 1 (
= +
=
=
b a
b
a
b a
+
Basic operation on vectors in R
n
3. Subtraction
Consider vectors and
Define:
What vector do we
need to add to to
get ? I.e., similar to
subtraction of real
numbers.
) ,..., , (
2 1 n
a a a a =
28
) ,..., , (
2 1 n
b b b b =
) ,..., , (
2 2 1 1 n n
b a b a b a b a =
) 2 , 2 (
) 0 , 3 (
) 2 , 1 (
=
=
=
b a
b
a
b a
1
2
3
4
0 1 2 3 4 3 2 1
b
29
2 2
2
2
1
2
...
n
a a a a + + + =
24 . 2 5
) 2 , 1 (
2
=
=
a
a
1
2
3
0 1 2 3 4
Length of this
vector is 2.24
a
30
) ,..., , (
2 1 n
b b b b =
=
= + + + =
n
i
i i n n
b a b a b a b a b a
1
2 2 1 1
...
cos    
2 2
b a b a
=
a
0 = b a
1
2
3
4
0 1 2 3 4
3
) 0 , 3 (
) 2 , 1 (
=
=
=
b a
b
a
1
2
3
4
0 1 2 3 4
0
) 0 , 3 (
) 2 , 0 (
=
=
=
b a
b
a
Basic operation on vectors in R
n
5. Dot product (continued)
Property:
In the classical regression equation
the response variable y is just a dot product of the
vector representing patient characteristics ( ) and
the regression weights vector ( ) which is common
across all patients plus an offset b.
31
=
= + + + =
n
i
i i n n
b a b a b a b a b a
1
2 2 1 1
...
2
2 2 2 1 1
... a a a a a a a a a
n n
= + + + =
b x w y + =
w
P
0
P
w
0
x
0
x x
OP x =
0 ) (
0
= x x w
0
0
= x w x w
or
define
0
x w b
=
0 = + b x w
O
Equation of a hyperplane
35
0 43 6 4
0 43 ) , , ( ) 6 , 1 , 4 (
0 43 ) 6 , 1 , 4 (
0 43
43 ) 42 1 0 (
) 7 , 1 , 0 (
) 6 , 1 , 4 (
) 3 ( ) 2 ( ) 1 (
) 3 ( ) 2 ( ) 1 (
0
0
= + +
= +
= +
= +
= = =
=
=
x x x
x x x
x
x w
x w b
P
w
P
0
w
0 43= + x w
What happens if the b coefficient changes?
The hyperplane moves along the direction of .
We obtain parallel hyperplanes.
w
Example
0 10= + x w
0 50= + x w
w b b D
/
2 1
=
Distance between two parallel hyperplanes and
is equal to .
0
1
= + b x w
0
2
= + b x w
+ direction
 direction
(Derivation of the distance between two
parallel hyperplanes)
36
w
0
1
= + b x w
0
2
= + b x w
w b b w t D
w b b t
b w t b
b w t b b x w
b w t x w
b w t x w
b x w
w t w t D
w t x x
/
/ ) (
0
0 ) (
0
0 ) (
0
2 1
2
2 1
2
2
1
2
2
1 1 1
2
2
1
2 1
2 2
1 2
= =
=
= + +
= + + +
= + +
= + +
= +
= =
+ =
1
x
2
x
w t
Recap
37
We know
How to represent patients (as vectors)
How to define a linear decision surface (hyperplane)
We need to know
How to efficiently compute the hyperplane that separates
two classes with the largest gap?
Need to introduce basics
of relevant optimization
theory
Cancer patients Normal patients
Gene X
Gene Y
Basics of optimization:
Convex functions
38
A function is called convex if the function lies below the
straight line segment connecting two points, for any two
points in the interval.
Property: Any local minimum is a global minimum!
Convex function Nonconvex function
Global minimum
Global minimum
Local minimum
39
Quadratic programming (QP) is a special
optimization problem: the function to optimize
(objective) is quadratic, subject to linear
constraints.
Convex QP problems have convex objective
functions.
These problems can be solved easily and efficiently
by greedy algorithms (because every local
minimum is a global minimum).
Basics of optimization:
Quadratic programming (QP)
Consider
Minimize subject to
This is QP problem, and it is a convex QP as we will see later
We can rewrite it as:
Minimize subject to
40
Basics of optimization:
Example QP problem
2
2
 
2
1
x
) , (
2 1
x x x =
0 1
2 1
+x x
) (
2
1
2
2
2
1
x x + 0 1
2 1
+x x
quadratic
objective
linear
constraints
quadratic
objective
linear
constraints
41
Basics of optimization:
Example QP problem
x
1
x
2
f(x
1
,x
2
)
) (
2
1
2
2
2
1
x x +
1
2 1
+x x
0 1
2 1
+x x
0 1
2 1
+x x
The solution is x
1
=1/2and x
2
=1/2.
Congratulations! You have mastered
all math elements needed to
understand support vector machines.
Now, let us strengthen your
knowledge by a quiz
42
1) Consider a hyperplane shown
with white. It is defined by
equation:
Which of the three other
hyperplanes can be defined by
equation: ?
 Orange
 Green
 Yellow
2) What is the dot product between
vectors and ?
Quiz
43
P
0
w
0 10= + x w
0 3= + x w
) 3 , 3 ( = a
) 1 , 1 ( = b
1
2
3
4
0 1 2 3 4 2 1
2
1
a
1
3) What is the dot product between
vectors and ?
4) What is the length of a vector
and what is the length of
all other red vectors in the figure?
Quiz
44
) 3 , 3 ( = a
) 0 , 1 ( = b
1
2
3
4
0 2 3 4 2 1
2
1
a
) 0 , 2 ( = a
1
1
2
3
4
0 2 3 4 2 1
2
1
a
N
n
N
y y y
R x x x
Positive instances (y=+1) Negative instances (y=1)
Want to find a classifier
(hyperplane) to separate
negative instances from the
positive ones.
An infinite number of such
hyperplanes exist.
SVMs finds the hyperplane that
maximizes the gap between
data points on the boundaries
(socalled support vectors).
If the points on the boundaries
are not informative (e.g., due to
noise), SVMs will not do well.
w
1 ) ( + b x w y
i i
) ( ) ( b x w sign x f + =
Minimize subject to for i =1,,N
SVM optimization problem:
Primal formulation
50
=
n
i
i
w
1
2
2
1
0 1 ) ( + b x w y
i i
Objective function Constraints
This is called primal formulation of linear SVMs.
It is a convex quadratic programming (QP)
optimization problem with n variables (w
i
, i = 1,,n),
where n is the number of features in the dataset.
SVM optimization problem:
Dual formulation
51
The previous problem can be recast in the socalled dual
form giving rise to dual formulation of linear SVMs.
It is also a convex quadratic programming problem but with
N variables (
i
,i = 1,,N), where N is the number of
samples.
Maximize subject to and .
Then the wvector is defined in terms of
i
:
And the solution becomes:
= =
N
j i
j i j i j i
N
i
i
x x y y
1 ,
2
1
1
0
i
0
1
=
=
N
i
i i
y
Objective function Constraints
=
=
N
i
i i i
x y w
1
) ( ) (
1
b x x y sign x f
N
i
i i i
+ =
=
=
n
i
i
w
1
2
2
1
0 1 ) ( + b x w y
i i
Objective function Constraints
Apply the method of Lagrange multipliers.
Define Lagrangian
We need to minimize this Lagrangian with respect to and simultaneously
require that the derivative with respect to vanishes , all subject to the
constraints that
( ) ( )
= =
+ =
N
i
i i i
n
i
i P
b x w y w b w
1 1
2
2
1
1 ) ( , ,
a vector with n elements
a vector with N elements
b w,
. 0
i
( )
( )
=
=
= =
= =
N
i
i i i
P
N
i
i i
P
x y w
w
b w
y
b
b w
1
1
0
, ,
0 0
, ,
( )
, ,b w
P
( )
= =
=
N
j i
j i j i j i
N
i
i D
x x y y
1 ,
2
1
1
0
i
0
1
=
=
N
i
i i
y
Case 2: Not linearly separable data;
Softmargin linear SVM
55
Want to minimize subject to for i =1,,N
Then given a new instance x, the classifier is
=
+
N
i
i
C w
1
2
2
1
i i i
b x w y + 1 ) (
) ( ) ( b x w sign x f + =
Assign a slack variable to each instance , which can be thought of distance from
the separating hyperplane if an instance is misclassified and 0 otherwise.
0
i
0
0 0
0
0
0
0
0
0
0
0
0
0
0
0
What if the data is not linearly
separable? E.g., there are
outliers or noisy measurements,
or the data is slightly nonlinear.
Want to handle this case without changing
the family of decision functions.
Approach:
Two formulations of softmargin
linear SVM
56
Minimize subject to for i =1,,N
= =
+
N
i
i
n
i
i
C w
1 1
2
2
1
Objective function Constraints
i i i
b x w y + 1 ) (
= =
N
j i
j i j i j i
n
i
i
x x y y
1 ,
2
1
1
C
i
0 0
1
=
=
N
i
i i
y Minimize subject to and
for i =1,,N.
Objective function Constraints
Primal formulation:
Dual formulation:
Parameter C in softmargin SVM
57
Minimize subject to for i =1,,N
=
+
N
i
i
C w
1
2
2
1
i i i
b x w y + 1 ) (
C=100 C=1
C=0.15 C=0.1
When C is very large, the soft
margin SVM is equivalent to
hardmargin SVM;
When C is very small, we
admit misclassifications in the
training data at the expense of
having wvector with small
norm;
C has to be selected for the
distribution at hand as it will
be discussed later in this
tutorial.
Case 3: Not linearly separable data;
Kernel trick
58
Gene 2
Gene 1
Tumor
Normal
Tumor
Normal ?
?
Data is not linearly separable
in the input space
Data is linearly separable in the
feature space obtained by a kernel
kernel
H R
N
:
Data in a higher dimensional feature space
Kernel trick
59
) ( ) ( b x w sign x f + =
) ) ( ( ) ( b x w sign x f + =
=
=
N
i
i i i
x y w
1
=
=
N
i
i i i
x y w
1
) (
) ) ( ) ( ( ) (
1
b x x y sign x f
N
i
i i i
+ =
=
) ) , ( ( ) (
1
b x x K y sign x f
N
i
i i i
+ =
=
) (x
=
+ =
+ =
=
=
=
j i j i
j i
q
j i j i
q
j i j i
j i j i
j i j i
j i j i
x x k x x K
x x x x p x x K
x x p x x K
x x x x K
x x x x K
x x x x K
) ( ) ( ) , (
j i j i
x x x x K
=
Understanding the Gaussian kernel
61
) exp( ) , (
2
j j
x x x x K
=
Consider Gaussian kernel:
Geometrically, this is a bump or cavity centered at the
training data point :
j
x
The resulting
mapping function
is a combination
of bumps and
cavities.
"bump
cavity
Understanding the Gaussian kernel
62
Several more views of the
data is mapped to the
feature space by Gaussian
kernel
63
Understanding the Gaussian kernel
Linear hyperplane
that separates two
classes
Consider polynomial kernel:
Assume that we are dealing with 2dimensional data
(i.e., in R
2
). Where will this kernel map the data?
Understanding the polynomial kernel
64
3
) 1 ( ) , (
j i j i
x x x x K
+ =
) 2 ( ) 1 (
x x
) 2 (
2
) 1 (
2
) 2 ( ) 1 (
3
) 2 (
3
) 1 ( ) 2 ( ) 1 (
2
) 2 (
2
) 1 ( ) 2 ( ) 1 (
1 x x x x x x x x x x x x
2dimensional space
10dimensional space
kernel
Example of benefits of using a kernel
65
) 2 (
x
) 1 (
x
1
x
2
x
3
x
4
x
\




.

\

= + + =
= + =
(
(


.

\



.

\

= =
Example of benefits of using a kernel
66



.

\

=
2
) 2 (
) 2 ( ) 1 (
2
) 1 (
2 ) (
x
x x
x
x
2
x
3
x
4
x
2
) 2 (
x
2
) 1 (
x
) 2 ( ) 1 (
2 x x
2 1
, x x
4 3
, x x
kernel
=
+
+
w
b
) ( ) ( b x w sign x f + =
w
=
+
N
i
i
C w
1
2
2
1
i i i
b x w y + 1 ) (
Loss
(hinge loss)
Penalty
Meaning of SVM loss function
Consider loss function:
Recall that []
+
indicates the positive part
For a given sample/patient i, the loss is nonzero if
In other words,
Since , this means that the loss is nonzero if
for y
i
=+1
for y
i
=1
In other words, the loss is nonzero if
for y
i
=+1
for y
i
=1
73
=
+
N
i
i i
x f y
1
)] ( 1 [
0 ) ( 1 >
i i
x f y
1 ) ( <
i i
x f y
} 1 , 1 { + =
i
y
1 ) ( <
i
x f
1 ) ( >
i
x f
1 < + b x w
i
1 > + b x w
i
74
Meaning of SVM loss function
Positive instances (y=+1) Negative instances (y=1)
0 = + b x w
1 + = + b x w
1 = + b x w
1
2
3
4
If the instance is negative,
it is penalized only in
regions 2,3,4
If the instance is positive,
it is penalized only in
regions 1,2,3
Flexibility of loss + penalty framework
75
Loss function Penalty function Resulting algorithm
Hinge loss: SVMs
Mean squared error:
Ridge regression
Mean squared error:
Lasso
Mean squared error:
Elastic net
Hinge loss: 1norm SVM
Minimize (Loss + Penalty)
=
+
N
i
i i
x f y
1
)] ( 1 [
2
2
w
=
N
i
i i
x f y
1
2
)) ( (
2
2
w
=
N
i
i i
x f y
1
2
)) ( (
1
w
=
+
N
i
i i
x f y
1
)] ( 1 [
1
w
=
N
i
i i
x f y
1
2
)) ( (
2
2
2
1
1
w w
+
Part 2
Model selection for SVMs
Extensions to the basic SVM model:
1. SVMs for multicategory data
2. Support vector regression
3. Novelty detection with SVMbased methods
4. Support vector clustering
5. SVMbased variable selection
6. Computing posterior class probabilities for SVM
classifiers
76
Model selection for SVMs
77
78
Need for model selection for SVMs
Gene 2
Gene 1
Tumor
Normal
It is impossible to find a linear SVM classifier
that separates tumors from normals!
Need a nonlinear SVM classifier, e.g. SVM
with polynomial kernel of degree 2 solves
this problem without errors.
Gene 2
Gene 1
Tumor
Normal
We should not apply a nonlinear SVM
classifier while we can perfectly solve
this problem using a linear SVM
classifier!
79
A datadriven approach for
model selection for SVMs
Do not know a priori what type of SVM kernel and what kernel
parameter(s) to use for a given dataset?
Need to examine various combinations of parameters, e.g.
consider searching the following grid:
How to search this grid while producing an unbiased estimate
of classification performance?
Polynomial degree d
Parameter
C
(0.1, 1) (1, 1) (10, 1) (100, 1) (1000, 1)
(0.1, 2) (1, 2) (10, 2) (100, 2) (1000, 2)
(0.1, 3) (1, 3) (10, 3) (100, 3) (1000, 3)
(0.1, 4) (1, 4) (10, 4) (100, 4) (1000, 4)
(0.1, 5) (1, 5) (10, 5) (100, 5) (1000, 5)
80
Nested crossvalidation
Recall the main idea of crossvalidation:
data train
test
train
test
train
test
What combination of SVM
parameters to apply on
training data?
train
valid
train
valid
train
Perform grid search using another nested
loop of crossvalidation.
81
Example of nested crossvalidation
P1
P2
P3
{
d
a
t
a
Outer Loop
Inner Loop
Training
set
Validation
set
C
Accuracy
Average
Accuracy
P1 P2 86%
P2 P1 84%
P1 P2 70%
P2 P1 90%
1 85%
2 80%
Training
set
Testing
set
C Accuracy
Average
Accuracy
P1, P2 P3 1 89%
P1,P3 P2 2 84%
P2, P3 P1 1 76%
83%
choose
C=1
}
,..., ,
,..., ,
2 1
2 1
Main idea:
Find a function
that approximates y
1
,,y
N
:
it has at most derivation from
the true values y
i
it is as flat as possible (to
avoid overfitting)
*
*
*
*
*
*
*
*
*
*
* *
*
*
*
*
*
x
y
+

b x w x f + =
) (
E.g., build a model to predict survival of cancer patients that
can admit a one month error (= ).
Find
by minimizing subject
to constraints:
for i =1,,N.
90
Formulation of hardmargin SVR
0 = + b x w
2
2
1
w
+
+
) (
) (
b x w y
b x w y
i
i
*
*
*
*
*
*
*
*
*
*
* *
*
*
*
x
y
+

b x w x f + =
) (
I.e., difference between y
i
and the fitted function should be smaller
than and larger than  all points y
i
should be in the ribbon
around the fitted function.
91
Formulation of softmargin SVR
*
*
*
*
*
*
*
*
*
*
* *
*
*
*
x
y
+
 *
*
If we have points like this
(e.g., outliers or noise) we
can either:
a) increase to ensure that
these points are within the
new ribbon, or
b) assign a penalty (slack
variable) to each of this
points (as was done for
softmargin SVMs)
92
Formulation of softmargin SVR
*
*
*
*
*
*
*
*
*
*
* *
*
*
*
x
y
+
 *
*
i
*
Find
by minimizing
subject to constraints:
for i =1,,N.
( )
=
+ +
N
i
i i
C w
1
*
2
2
1
0 ,
) (
) (
*
*
+
+ +
i i
i i
i i
b x w y
b x w y
b x w x f + =
) (
Notice that only points outside ribbon are penalized!
93
Nonlinear SVR
*
*
*
*
*
*
*
*
*
*
* *
*
*
*
(x)
y
()
x
y
+

kernel
+()
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
x
y
+

*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
Cannot approximate well
this function with small !
1
=
+
b x w x f + =
) (
w
b
Loss
(linear insensitive loss)
Penalty
Error in approximation
Loss function value
95
Comparing SVR with popular
regression methods
Loss function Penalty function Resulting algorithm
Linear insensitive loss:
SVR
Quadratic insensitive loss:
Another variant of SVR
Mean squared error:
Ridge regression
Mean linear error:
Another variant of ridge
regression
=
N
i
i i
x f y
1
)  ) (  , 0 max(
2
2
w
=
N
i
i i
x f y
1
2
)) ( (
2
2
w
=
N
i
i i
x f y
1
 ) ( 
2
2
w
=
N
i
i i
x f y
1
2
) )) ( ( , 0 max(
2
2
w
96
Comparing loss functions of regression
methods

Loss function
value
Error in
approximation

Loss function
value
Error in
approximation

Loss function
value
Error in
approximation

Loss function
value
Error in
approximation
Linear insensitive loss Quadratic insensitive loss
Mean squared error Mean linear error
97
Applying SVR to real data
In the absence of domain knowledge about decision
functions, it is recommended to optimize the following
parameters (e.g., by crossvalidation using gridsearch):
parameter C
parameter
kernel parameters (e.g., degree of polynomial)
Notice that parameter depends on the ranges of
variables in the dataset; therefore it is recommended to
normalize/rescale data prior to applying SVR.
Novelty detection with SVMbased
methods
98
99
What is it about?
Find the simplest and most
compact region in the space of
predictors where the majority
of data samples live (i.e.,
with the highest density of
samples).
Build a decision function that
takes value +1 in this region
and 1 elsewhere.
Once we have such a decision
function, we can identify novel
or outlier samples/patients in
the data.
Predictor X
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
* *
*
*
*
*
*
*
*
*
*
*
* *
*
*
*
*
*
*
*
*
*
*
*
*
*
* *
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
* *
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
* *
*
*
*
*
*
*
*
*
*
P
r
e
d
i
c
t
o
r
Y
*
*
*
*
Decision function = +1
Decision function = 1
*
** *
** * *
*
** *
** * *
100
Key assumptions
We do not know classes/labels of samples (positive
or negative) in the data available for learning
this is not a classification problem
All positive samples are similar but each negative
sample can be different in its own way
Thus, do not need to collect data for negative samples!
101
Sample applications
Novel
Novel
Novel
Normal
Modified from: www.cs.huji.ac.il/course/2004/learns/NoveltyDetection.ppt
102
Sample applications
Discover deviations in sample handling protocol when
doing quality control of assays.
Protein X
Protein Y
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
* *
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
Samples with highquality
of processing
*
*
*
Samples with low quality
of processing from infants
Samples with low quality of
processing from patients
with lung cancer
Samples with low quality of
processing from ICU patients
Samples with low quality
of processing from the
lab of Dr. Smith
103
Sample applications
Identify websites that discuss benefits of proven cancer
treatments.
Weighted
frequency of
word Y
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
* *
*
*
*
*
*
*
*
*
*
*
*
* *
*
*
*
*
*
*
*
*
*
*
Websites that discuss
benefits of proven cancer
treatments
*
*
* *
Websites that discuss sideeffects of
proven cancer treatments
Blogs of cancer patients
Websites that discuss
cancer prevention
methods
Weighted
frequency of
word X
*
*
*
Websites that discuss
unproven cancer treatments
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
**
*
*
*
*
*
*
*
*
104
Oneclass SVM
Main idea: Find the maximal gap hyperplane that separates data from
the origin (i.e., the only member of the second class is the origin).
i
0 = + b x w
Find
by minimizing
subject to constraints:
for i =1,,N.
=
+ +
N
i
i
b
N
w
1
2
2
1
1
0
+
i
i
b x w
) ( ) ( b x w sign x f + =
Given training data:
n
N
R x x x
,..., ,
2 1
upper bound on
the fraction of
outliers (i.e., points
outside decision
surface) allowed in
the data
i.e., the decision function should
be positive in all training samples
except for small deviations
106
Formulation of oneclass SVM:
linear and nonlinear cases
Find
by minimizing
subject to constraints:
for i =1,,N.
=
+ +
N
i
i
b
N
w
1
2
2
1
1
0
+
i
i
b x w
) ( ) ( b x w sign x f + =
Find
by minimizing
subject to constraints:
for i =1,,N.
=
+ +
N
i
i
b
N
w
1
2
2
1
1
0
) (
+
i
i
b x w
) ) ( ( ) ( b x w sign x f + =
Linear case Nonlinear case
(use kernel trick)
107
More about oneclass SVM
Oneclass SVMs inherit most of properties of SVMs for
binary classification (e.g., kernel trick, sample
efficiency, ease of finding of a solution by efficient
optimization method, etc.);
The choice of other parameter significantly affects
the resulting decision surface.
The choice of origin is arbitrary and also significantly
affects the decision surface returned by the algorithm.
R
R
R
SVDD optimization criterion
111
Minimize subject to for i =1,,N
2
R
Squared radius of the sphere Constraints
2 2
 ) (  R a x
i
Formulation with hard constraints:
R
R
R
' R
a
' a
Main idea behind Support Vector
Clustering
Cluster boundaries in the input space are formed by the set of
points that when mapped from the input space to the feature
space fall exactly on the surface of the minimal enclosing
hypersphere
SVs identified by SVDD are a subset of the cluster boundary points
112
1
R
R
R
a
Cluster assignment in SVC
Two points and belong to the same cluster (i.e., have
the same label) if every point of the line segment
projected to the feature space lies within the hyper
sphere
113
R
R
R
a
Every point is within the hyper
sphere in the feature space
Some points lie outside the
hypersphere
) , (
j i
x x
i
x
j
x
Cluster assignment in SVC (continued)
Pointwise adjacency matrix is
constructed by testing the line
segments between every pair of
points
Connected components are
extracted
Points belonging to the same
connected component are
assigned the same label
114
A B C D E
A 1 1 1 0 0
B 1 1 0 0
C 1 0 0
D 1 1
E 1
A
B
C
D
E
A
B
C
D
E
Effects of noise and cluster overlap
In practice, data often contains noise, outlier points and
overlapping clusters, which would prevent contour
separation and result in all points being assigned to the
same cluster
115
Ideal data Typical data
Noise
Outliers
Overlap
SVDD with soft constraints
SVC can be used on noisy data by allowing a fraction of points,
called Bounded SVs (BSV), to lie outside the hypersphere
BSVs are not considered as cluster boundary points and are not
assigned to clusters by SVC
116
kernel
R
R
R
Overlap
a
Typical data
Noise
Outliers
Overlap
Noise
Outliers
Soft SVDD optimization criterion
117
Minimize subject to for i =1,,N
2
R
Squared radius of the sphere Soft constraints
i i
R a x +
2 2
 ) ( 
Primal formulation with soft constraints:
0
i
R
R
Overlap
a i
R +
Introduction of slack
variables mitigates
the influence of noise
and overlap on the
clustering process
i
Noise
Outliers
Gaussian kernel tends to yield tighter
contour representations of clusters than the polynomial kernel
The Gaussian kernel width parameter influences tightness of
cluster boundaries, number of SVs and the number of clusters
Increasing causes an increase in the number of clusters
Dual formulation of soft SVDD
118
Minimize
subject to for i =1,,N
=
j i
j i j i
i
i i i
x x K x x K W
,
) , ( ) , (
Constraints
C
i
0
) exp( ) , (
2
j i j i
x x x x K
=
As before, denotes a kernel function
Parameter gives a tradeoff between volume of the sphere and
the number of errors (C=1 corresponds to hard constraints)
) ( ) ( ) , (
j i j i
x x x x K =
1 0 < C
0 >
1 ) ( + b x w y
i i
w
b
) ( ) ( b x w sign x f + =
w
n
w R
121
Understanding the weight vector w
x
1
x
2 0 1 1
) 1 , 1 (
2 1
= + +
=
b x x
w
x
1
x
2 0 0 1
) 0 , 1 (
2 1
= + +
=
b x x
w
x
1
x
2
0 1 0
) 1 , 0 (
2 1
= + +
=
b x x
w
x
2
x
1
x
3
0 0 1 1
) 0 , 1 , 1 (
3 2 1
= + + +
=
b x x x
w
X
1
and X
2
are equally important X
1
is important, X
2
is not
X
2
is important, X
1
is not
X
1
and X
2
are
equally important,
X
3
is not
122
Understanding the weight vector w
Gene X
1
Gene X
2
Melanoma
Nevi
X
1
Phenotype X
2
SVM decision surface
0 1 1
) 1 , 1 (
2 1
= + +
=
b x x
w
Decision surface of
another classifier
0 0 1
) 0 , 1 (
2 1
= + +
=
b x x
w
True model
In the true model, X
1
is causal and X
2
is redundant
SVM decision surface implies that X
1
and X
2
are equally
important; thus it is locally causally inconsistent
There exists a causally consistent decision surface for this example
Causal discovery algorithms can identify that X
1
is causal and X
2
is redundant
123
Simple SVMbased variable selection
algorithm
Algorithm:
1. Train SVM classifier using data for all variables to
estimate vector
2. Rank each variable based on the magnitude of the
corresponding element in vector
3. Using the above ranking of variables, select the
smallest nested subset of variables that achieves the
best SVM prediction accuracy.
w
124
Simple SVMbased variable selection
algorithm
Consider that we have 7 variables: X
1
, X
2
, X
3
, X
4
, X
5
, X
6
, X
7
The vector is: (0.1, 0.3, 0.4, 0.01, 0.9, 0.99, 0.2)
The ranking of variables is: X
6
, X
5
, X
3
, X
2
, X
7
, X
1
, X
4
Subset of variables
Classification
accuracy
X
6
X
5
X
3
X
2
X
7
X
1
X
4
0.920
X
6
X
5
X
3
X
2
X
7
X
1
0.920
X
6
X
5
X
3
X
2
X
7
0.919
X
6
X
5
X
3
X
2
0.852
X
6
X
5
X
3
0.843
X
6
X
5
0.832
X
6
0.821
Best classification accuracy
Classification accuracy that is
statistically indistinguishable
from the best one
Select the following variable subset: X
6
, X
5
, X
3
, X
2
, X
7
w
125
Simple SVMbased variable selection
algorithm
SVM weights are not locally causally consistent we
may end up with a variable subset that is not causal
and not necessarily the most compact one.
The magnitude of a variable in vector estimates
the effect of removing that variable on the objective
function of SVM (e.g., function that we want to
minimize). However, this algorithm becomes sub
optimal when considering effect of removing several
variables at a time This pitfall is addressed in the
SVMRFE algorithm that is presented next.
w
126
SVMRFE variable selection algorithm
Algorithm:
1. Initialize V to all variables in the data
2. Repeat
3. Train SVM classifier using data for variables in V to
estimate vector
4. Estimate prediction accuracy of variables in V using
the above SVM classifier (e.g., by crossvalidation)
5. Remove from V a variable (or a subset of variables)
with the smallest magnitude of the corresponding
element in vector
6. Until there are no variables in V
7. Select the smallest subset of variables with the best
prediction accuracy
w
127
SVMRFE variable selection algorithm
10,000
genes
SVM
model
Prediction
accuracy
5,000
genes
5,000
genes
Important for
classification
Not important
for classification
SVM
model
2,500
genes
2,500
genes
Important for
classification
Not important
for classification
Discarded Discarded
Prediction
accuracy
Unlike simple SVMbased variable selection algorithm, SVM
RFE estimates vector many times to establish ranking of the
variables.
Notice that the prediction accuracy should be estimated at
each step in an unbiased fashion, e.g. by crossvalidation.
w
128
SVM variable selection in feature space
The real power of SVMs comes with application of the kernel
trick that maps data to a much higher dimensional space
(feature space) where the data is linearly separable.
Gene 2
Gene 1
Tumor
Normal
Tumor
Normal ?
?
kernel
dataset
142
Gene selection methods
1. Signaltonoise (S2N) ratio in
oneversusrest (OVR)
fashion;
2. Signaltonoise (S2N) ratio in
oneversusone (OVO)
fashion;
3. KruskalWallis nonparametric
oneway ANOVA (KW);
4. Ratio of genes between
categories to withincategory
sum of squares (BW).
genes
Uninformative genes
Highly discriminatory genes
143
Performance metrics and
statistical comparison
1. Accuracy
+ can compare to previous studies
+ easy to interpret & simplifies statistical comparison
2. Relative classifier information (RCI)
+ easy to interpret & simplifies statistical comparison
+ not sensitive to distribution of classes
+ accounts for difficulty of a decision problem
Randomized permutation testing to compare accuracies
of the classifiers (=0.05)
144
Microarray datasets
Total:
~1300 samples
74 diagnostic categories
41 cancer types and
12 normal tissue types
Sam
ples
Variables
(genes)
Cate
gories
11_Tumors 174 12533 11 Su, 2001
14_Tumors 308 15009 26 Ramaswamy, 2001
9_Tumors 60 5726 9 Staunton, 2001
Brain_Tumor1 90 5920 5 Pomeroy, 2002
Brain_Tumor2 50 10367 4 Nutt, 2003
Leukemia1 72 5327 3 Golub, 1999
Leukemia2 72 11225 3 Armstrong, 2002
Lung_Cancer 203 12600 5 Bhattacherjee, 2001
SRBCT 83 2308 4 Khan, 2001
Prostate_Tumor 102 10509 2 Singh, 2002
DLBCL 77 5469 2 Shipp, 2002
Dataset name
Number of
Reference
145
Summary of methods and datasets
S2N OneVersusRest
S2N OneVersusOne
Nonparam. ANOVA
BW ratio
Gene Selection
Methods (4)
CrossValidation
Designs (2)
10Fold CV
LOOCV
Accuracy
RCI
Performance
Metrics (2)
OneVersusRest
OneVersusOne
DAGSVM
Method by WW
Classifiers (11)
Method by CS
KNN
Backprop. NN
Prob. NN
Decision Trees
OneVersusRest
OneVersusOne
M
C

S
V
M
W
V
Majority Voting
MCSVM OVR
MCSVM OVO
MCSVM DAGSVM
Ensemble Classifiers (7)
Decision Trees
Decision Trees
Majority Voting
B
a
s
e
d
o
n
M
C

S
V
M
o
u
t
p
u
t
s
B
a
s
e
d
o
n
o
u
t
p
u
t
s
o
f
a
l
l
c
l
a
s
s
i
f
i
e
r
s
Randomized
permutation testing
Statistical
Comparison
11_Tumors
14_Tumors
9_Tumors
Brain Tumor1
Gene Expression Datasets (11)
Brain_Tumor2
Leukemia1
Leukemia2
Lung_Cancer
SRBCT
Prostate_Tumors
DLBCL
M
u
l
t
i
c
a
t
e
g
o
r
y
D
x
B
i
n
a
r
y
D
x
146
9
_
T
u
m
o
r
s
1
4
_
T
u
m
o
r
s
B
r
a
i
n
_
T
u
m
o
r
2
B
r
a
i
n
_
T
u
m
o
r
1
1
1
_
T
u
m
o
r
s
L
e
u
k
e
m
i
a
1
L
e
u
k
e
m
i
a
2
L
u
n
g
_
C
a
n
c
e
r
S
R
B
C
T
P
r
o
s
t
a
t
e
_
T
u
m
o
r
D
L
B
C
L
Results without gene selection
OVR
OVO
DAGSVM
WW
CS
KNN
NN
PNN
M
C

S
V
M
0
20
40
60
80
100
A
c
c
u
r
a
c
y
,
%
147
Results with gene selection
OVR OVO DAGSVM WW CS KNN NN PNN
10
20
30
40
50
60
70
I
m
p
r
o
v
e
m
e
n
t
i
n
a
c
c
u
r
a
c
y
,
%
SVM nonSVM
Improvement of diagnostic
performance by gene selection
(averages for the four datasets)
Average reduction of genes is 1030 times
Diagnostic performance
before and after gene selection
SVM nonSVM SVM nonSVM
20
40
60
80
100
20
40
60
80
100
A
c
c
u
r
a
c
y
,
%
9_Tumors 14_Tumors
Brain_Tumor1 Brain_Tumor2
148
Comparison with previously
published results
A
c
c
u
r
a
c
y
,
%
0
20
40
60
80
100
9
_
T
u
m
o
r
s
1
4
_
T
u
m
o
r
s
B
r
a
i
n
_
T
u
m
o
r
2
B
r
a
i
n
_
T
u
m
o
r
1
1
1
_
T
u
m
o
r
s
L
e
u
k
e
m
i
a
1
L
e
u
k
e
m
i
a
2
L
u
n
g
_
C
a
n
c
e
r
S
R
B
C
T
P
r
o
s
t
a
t
e
_
T
u
m
o
r
D
L
B
C
L
Multiple specialized
classification methods
(original primary studies)
Multiclass SVMs
(this study)
149
Summary of results
Multiclass SVMs are the best family among the
tested algorithms outperforming KNN, NN, PNN, DT,
and WV.
Gene selection in some cases improves classification
performance of all classifiers, especially of nonSVM
algorithms;
Ensemble classification does not improve
performance of SVM and other classifiers;
Results obtained by SVMs favorably compare with the
literature.
Appealing properties
Work when # of predictors > # of samples
Embedded gene selection
Incorporate interactions
Based on theory of ensemble learning
Can work with binary & multiclass tasks
Does not require much finetuning of parameters
Strong theoretical claims
Empirical evidence: (DiazUriarte and Alvarez de
Andres, BMC Bioinformatics, 2006) reported
superior classification performance of RFs compared
to SVMs and other methods
150
Random Forest (RF) classifiers
Key principles of RF classifiers
Training
data
2) Random gene
selection
3) Fit unpruned
decision trees
4) Apply to testing data &
combine predictions
Testing
data
1) Generate
bootstrap
samples
151
Results without gene selection
152
SVMs nominally outperform RFs is 15 datasets, RFs outperform SVMs in 4 datasets,
algorithms are exactly the same in 3 datasets.
In 7 datasets SVMs outperform RFs statistically significantly.
On average, the performance advantage of SVMs is 0.033 AUC and 0.057 RCI.
Results with gene selection
153
SVMs nominally outperform RFs is 17 datasets, RFs outperform SVMs in 3 datasets,
algorithms are exactly the same in 2 datasets.
In 1 dataset SVMs outperform RFs statistically significantly.
On average, the performance advantage of SVMs is 0.028 AUC and 0.047 RCI.
2. Text categorization in biomedicine
154
Models to categorize content and quality:
Main idea
155
1. Utilize existing (or easy to build) training corpora
1
2
3
4
2. Simple document
representations (i.e., typically
stemmed and weighted
words in title and abstract,
Mesh terms if available;
occasionally addition of
Metamap CUIs, author info) as
bagofwords
Models to categorize content and quality:
Main idea
156
Labeled
Examples
Unseen
Examples
Labeled
3. Train SVM models that capture
implicit categories of meaning or
quality criteria
5. Evaluate performance prospectively &
compare to prior crossvalidation estimates
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Txmt Di ag Prog Eti o
Esti mated Performance 2005 Performance
4. Evaluate models performances
 with nested crossvalidation or other
appropriate error estimators
 use primarily AUC as well as other metrics
(sensitivity, specificity, PPV, Precision/Recall
curves, HIT curves, etc.)
Models to categorize content and quality:
Some notable results
157
1. SVM models have
excellent ability to identify
highquality PubMed
documents according to
ACPJ gold standard
Category Average AUC Range
over n folds
Treatment 0.97* 0.96  0.98
Etiology 0.94* 0.89 0.95
Prognosis 0.95* 0.92 0.97
Diagnosis 0.95* 0.93  0.98
Category Average AUC Range
over n folds
Treatment 0.97* 0.96  0.98
Etiology 0.94* 0.89 0.95
Prognosis 0.95* 0.92 0.97
Diagnosis 0.95* 0.93  0.98
Method Treatment 
AUC
Etiology 
AUC
Prognosis 
AUC
Diagnosis 
AUC
Google Pagerank
0.54 0.54 0.43 0.46
Yahoo Webranks
0.56 0.49 0.52 0.52
Impact Factor
2005
0.67 0.62 0.51 0.52
Web page hit
count
0.63 0.63 0.58 0.57
Bibliometric
Citation Count
0.76 0.69 0.67 0.60
Machine Learning
Models
0.96 0.95 0.95 0.95
Method Treatment 
AUC
Etiology 
AUC
Prognosis 
AUC
Diagnosis 
AUC
Google Pagerank
0.54 0.54 0.43 0.46
Yahoo Webranks
0.56 0.49 0.52 0.52
Impact Factor
2005
0.67 0.62 0.51 0.52
Web page hit
count
0.63 0.63 0.58 0.57
Bibliometric
Citation Count
0.76 0.69 0.67 0.60
Machine Learning
Models
0.96 0.95 0.95 0.95
2. SVM models have better classification
performance than PageRank, Yahoo ranks,
Impact Factor, Web Page hit counts, and
bibliometric citation counts on the Web
according to ACPJ gold standard
Models to categorize content and quality:
Some notable results
158
3. SVM models have better
classification performance than
PageRank, Impact Factor and
Citation count in Medline for
SSOAB gold standard
Gold standard: SSOAB Area under the ROC
curve*
SSOABspecific filters 0.893
Citation Count 0.791
ACPJ Txmtspecific filters 0.548
Impact Factor (2001) 0.549
Impact Factor (2005) 0.558
Gold standard: SSOAB Area under the ROC
curve*
SSOABspecific filters 0.893
Citation Count 0.791
ACPJ Txmtspecific filters 0.548
Impact Factor (2001) 0.549
Impact Factor (2005) 0.558
Diagnosis  Fixed Specificity
0.65
0.97
0.82
0.97
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Sens Fi xed Spec
Query Fi l ters Learni ng Model s
Diagnosis  Fixed Sensitivity
0.96
0.68
0.96
0.88
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fi xed Sens Spec
Query Fi l ters Learni ng Model s
Prognosis  Fixed Specificity
0.8 0.77
1
0.77
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Sens Fi xed Spec
Query Fi l ters Learni ng Model s
Prognosis  Fixed Sensitivity
0.8
0.71
0.8
0.87
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fi xed Sens Spec
Query Fi l ters Learni ng Model s
Treatment  Fixed Specificity
0.8
0.91
0.95
0.91
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Sens Fi xed Spec
Query Fi l ters Learni ng Model s
Treatment  Fixed Sensitivity
0.98
0.71
0.98
0.89
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fi xed Sens Spec
Query Fi l ters Learni ng Model s
Etiology  Fixed Specificity
0.68
0.91
0.94
0.91
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Sens Fi xed Spec
Query Fi l ters Learni ng Model s
Etiology  Fixed Sensitivity
0.98
0.44
0.98
0.75
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fi xed Sens Spec
Query Fi l ters Learni ng Model s
4. SVM models have better sensitivity/specificity in PubMed than CQFs at
comparable thresholds according to ACPJ gold standard
Other applications of SVMs to text
categorization
159
1. Identifying Web Pages with misleading treatment information according
to special purpose gold standard (Quack Watch). SVM models outperform
Quackometer and Google ranks in the tested domain of cancer treatment.
Model Area Under the
Curve
Machine Learning Models 0.93
Quackometer* 0.67
Google 0.63
Model Area Under the
Curve
Machine Learning Models 0.93
Quackometer* 0.67
Google 0.63
2. Prediction of future paper citation counts (work of L. Fu and C.F. Aliferis,
AMIA 2008)
3. Prediction of clinical laboratory
values
160
Dataset generation and
experimental design
Training Testing
Validation
(25% of Training)
01/199805/2001 06/200110/2002
StarPanel database contains ~810
6
lab measurements of ~100,000 in
patients from Vanderbilt University Medical Center.
Lab measurements were taken between 01/1998 and 10/2002.
For each combination of lab test and normal range, we generated
the following datasets.
161
Querybased approach for
prediction of clinical cab values
Train SVM classifier
Training data
Data model
Validation data
database
These steps are performed
for every data model
Performance
Prediction
Testing data
Testing
sample
Optimal
data model
SVM classifier
These steps are performed
for every testing sample
162
Classification results
Area under ROC curve (without feature selection)
>1 <99 [1, 99] >2.5 <97.5 [2.5, 97.5]
BUN 80.4% 99.1% 76.6% 87.1% 98.2% 70.7%
Ca 72.8% 93.4% 55.6% 81.4% 81.4% 63.4%
CaIo 74.1% 60.0% 50.1% 64.7% 72.3% 57.7%
CO2 82.0% 93.6% 59.8% 84.4% 94.5% 56.3%
Creat 62.8% 97.7% 89.1% 91.5% 98.1% 87.7%
Mg 56.9% 70.0% 49.1% 58.6% 76.9% 59.1%
Osmol 50.9% 60.8% 60.8% 91.0% 90.5% 68.0%
PCV 74.9% 99.2% 66.3% 80.9% 80.6% 67.1%
Phos 74.5% 93.6% 64.4% 71.7% 92.2% 69.7%
L
a
b
o
r
a
t
o
r
y
t
e
s
t
Range of normal values
>1 <99 [1, 99] >2.5 <97.5 [2.5, 97.5]
BUN 75.9% 93.4% 68.5% 81.8% 92.2% 66.9%
Ca 67.5% 80.4% 55.0% 77.4% 70.8% 60.0%
CaIo 63.5% 52.9% 58.8% 46.4% 66.3% 58.7%
CO2 77.3% 88.0% 53.4% 77.5% 90.5% 58.1%
Creat 62.2% 88.4% 83.5% 88.4% 94.9% 83.8%
Mg 58.4% 71.8% 64.2% 67.0% 72.5% 62.1%
Osmol 77.9% 64.8% 65.2% 79.2% 82.4% 71.5%
PCV 62.3% 91.6% 69.7% 76.5% 84.6% 70.2%
Phos 70.8% 75.4% 60.4% 68.0% 81.8% 65.9%
Range of normal values
L
a
b
o
r
a
t
o
r
y
t
e
s
t
Area under ROC curve (without feature selection)
Including cases with K=0 (i.e. samples
with no prior lab measurements)
Excluding cases with K=0 (i.e. samples
with no prior lab measurements)
A total of 84,240 SVM classifiers were built for 16,848 possible data models.
163
Improving predictive power and parsimony
of a BUN model using feature selection
Test name BUN
Range of normal values <99 perc.
Data modeling SRT
Number of previous
measurements
5
Use variables corresponding to
hospitalization units?
Yes
Number of prior
hospitalizations used
2
Model description
N sampl es
(total )
N abnormal
sampl es
N
vari abl es
Training set 3749 78
Validation set 1251 27
Testing set 836 16
3442
Dataset description
0 50 100 150 200 250
0
0.5
1
1.5
2
2.5
3
x 10
4
Histogram of test BUN
Test value
F
r
e
q
u
e
n
c
y
(
N
m
e
a
s
u
r
e
m
e
n
t
s
)
normal
values
abnormal
values
105
Classification performance (area under ROC curve)
Al l RFE_Li near RFE_Pol y HITON_PC HITON_MB
Validation set 95.29% 98.78% 98.76% 99.12% 98.90%
Testing set 94.72% 99.66% 99.63% 99.16% 99.05%
Number of features 3442 26 3 11 17
164
Classification performance (area under ROC curve)
Al l RFE_Li near RFE_Pol y HITON_PC HITON_MB
Validation set 95.29% 98.78% 98.76% 99.12% 98.90%
Testing set 94.72% 99.66% 99.63% 99.16% 99.05%
Number of features 3442 26 3 11 17
Features
1 LAB: PM_1(BUN) LAB: PM_1(BUN) LAB: PM_1(BUN) LAB: PM_1(BUN)
2 LAB: PM_2(Cl) LAB: Indicator(PM_1(Mg)) LAB: PM_5(Creat) LAB: PM_5(Creat)
3 LAB: DT(PM_3(K))
LAB: Test Unit
NO_TEST_MEASUREMENT
(Test CaIo, PM 1)
LAB: PM_1(Phos) LAB: PM_3(PCV)
4 LAB: DT(PM_3(Creat)) LAB: Indicator(PM_1(BUN)) LAB: PM_1(Mg)
5 LAB: Test Unit J 018 (Test Ca, PM 3) LAB: Indicator(PM_5(Creat)) LAB: PM_1(Phos)
6 LAB: DT(PM_4(Cl)) LAB: Indicator(PM_1(Mg)) LAB: Indicator(PM_4(Creat))
7 LAB: DT(PM_3(Mg)) LAB: DT(PM_4(Creat)) LAB: Indicator(PM_5(Creat))
8 LAB: PM_1(Cl) LAB: Test Unit 7SCC (Test Ca, PM 1) LAB: Indicator(PM_3(PCV))
9 LAB: PM_3(Gluc) LAB: Test Unit RADR (Test Ca, PM 5) LAB: Indicator(PM_1(Phos))
10 LAB: DT(PM_1(CO2)) LAB: Test Unit 7SMI (Test PCV, PM 4) LAB: DT(PM_4(Creat))
11 LAB: DT(PM_4(Gluc)) DEMO: Gender LAB: Test Unit 11NM (Test BUN, PM 2)
12 LAB: PM_3(Mg) LAB: Test Unit 7SCC (Test Ca, PM 1)
13 LAB: DT(PM_5(Mg)) LAB: Test Unit RADR (Test Ca, PM 5)
14 LAB: PM_1(PCV) LAB: Test Unit 7SMI (Test PCV, PM 4)
15 LAB: PM_2(BUN) LAB: Test Unit CCL (Test Phos, PM 1)
16 LAB: Test Unit 11NM (Test PCV, PM 2) DEMO: Gender
17 LAB: Test Unit 7SCC (Test Mg, PM 3) DEMO: Age
18 LAB: DT(PM_2(Phos))
19 LAB: DT(PM_3(CO2))
20 LAB: DT(PM_2(Gluc))
21 LAB: DT(PM_5(CaIo))
22 DEMO: Hospitalization Unit TVOS
23 LAB: PM_1(Phos)
24 LAB: PM_2(Phos)
25 LAB: Test Unit 11NM (Test K, PM 5)
26 LAB: Test Unit VHR (Test CaIo, PM 1)
165
4. Modeling clinical judgment
166
Methodological framework and study
outline
same across physicians different across physicians
P
h
y
s
i
c
i
a
n
1
P
h
y
s
i
c
i
a
n
6
Patient Feature
Clinical
Diagnosis
Gold
Standard
1 f
1
f
m
cd
1
hd
1
N cd
N
hd
N
1 f
1
f
m
cd
1
hd
1
N cd
N
hd
N
Patients
Physicians
Guidelines
Predict clinical decisions
Identify predictors
ignored by
physicians
Explain each physicians
diagnostic model
Compare physicians with each
other and with guidelines
Clinical context of experiment
Incidence & mortality have been constantly increasing in
the last decades.
Malignant melanoma is the most dangerous form of skin cancer
Physicians and patients
Dermatologists N = 6
3 experts  3 nonexperts
Patients N=177
76 melanomas  101 nevi
Features
Lesion
location
Family history of
melanoma
Irregular Border
Streaks (radial
streaming, pseudopods)
Maxdiameter
Fitzpatricks
Phototype
Number of colors Slateblue veil
Mindiameter Sunburn
Atypical pigmented
network
Whitish veil
Evolution Ephelis Abrupt network cutoff Globular elements
Age Lentigos RegressionErythema
Comedolike openings,
milialike cysts
Gender Asymmetry Hypopigmentation Telangiectasia
Data collection:
Patients seen prospectively,
from 1999 to 2002 at
Department of Dermatology,
S.Chiara Hospital, Trento, Italy
inclusion criteria: histological
diagnosis and >1 digital image
available
Diagnoses made in 2004
Method to explain physicianspecific
SVM models
Build
SVM
SVM
Black Box
Build
DT
Regular Learning
MetaLearning
Apply SVM
FS
Results: Predicting physicians judgments
Physicians
All
(features)
HITON_PC
(features)
HITON_MB
(features)
RFE
(features)
Expert 1 0.94 (24) 0.92 (4) 0.92 (5) 0.95 (14)
Expert 2 0.92 (24) 0.89 (7) 0.90 (7) 0.90 (12)
Expert 3 0.98 (24) 0.95 (4) 0.95 (4) 0.97 (19)
NonExpert 1 0.92 (24) 0.89 (5) 0.89 (6) 0.90 (22)
NonExpert 2 1.00 (24) 0.99 (6) 0.99 (6) 0.98 (11)
NonExpert 3 0.89 (24) 0.89 (4) 0.89 (6) 0.87 (10)
Results: Physicianspecific models
Results: Explaining physician agreement
Expert 1
AUC=0.92
R
2
=99%
Expert 3
AUC=0.95
R
2
=99%
Blue
veil
irregular border streaks
Patient 001 yes no yes
Results: Explain physician disagreement
Blue
veil
irregular
border
streaks
number
of colors
evolution
Patient 002 no no yes 3 no
Expert 1
AUC=0.92
R
2
=99%
Expert 3
AUC=0.95
R
2
=99%
Results: Guideline compliance
Physician Reported
guidelines
Compliance
Experts1,2,3,
nonexpert 1
Pattern analysis
Noncompliant: they ignore the
majority of features (17 to 20)
recommended by pattern analysis.
Non expert 2 ABCDE rule
Non compliant: asymmetry, irregular
border and evolution are ignored.
Non expert 3
Nonstandard.
Reports using 7
features
Non compliant: 2 out of 7 reported
features are ignored while some non
reported ones are not
On the contrary: In all guidelines, the more predictors present,
the higher the likelihood of melanoma. All physicians were
compliant with this principle.
5. Using SVMs for feature selection
176
Feature selection methods
177
Feature selection methods (noncausal)
SVMRFE
Univariate + wrapper
Random forestbased
LARSElastic Net
RELIEF + wrapper
L0norm
Forward stepwise feature selection
No feature selection
Causal feature selection methods
HITONPC
HITONMB
IAMB
BLCD
K2MB