You are on page 1of 19

Data Science for Molecular

Engineering
Lecture 9
ILOs
• Understand linear classification and decision boundaries;
• Know the concepts, principles and solution methods for Support
Vector Machine;
• Understand the concepts of decision tree method;
• Understand the principles of Random Forest;
Binary classification and decision boundaries

w Tx + b = 0
w Tx + b > 0
w Tx + b < 0

f(x) = sign(wTx + b)

Compare with logistic regression


Multiple decision boundaries are possible

Which one is optimal?


Support Vector Machines
w Tx + b = 0
ρ

r • Intuitively, maximize the margin of the linear separation

• Examples closest to the hyperplane are support vectors.

• Margin ρ of the separator is the distance between


support vectors.
Mathematical formulation
Distance from point i to the linear separation line
ρ wT xi  b
r (from linear algebra)
w
r
Let training set {(xi, yi)}i=1..n, xiRd, yi  {-1, 1} be separated by a
hyperplane with margin ρ. Then for each training example (xi, yi):
wTxi + b ≤ - ρ/2 if yi = -1
wTxi + b ≥ ρ/2 if yi = 1
 yi(wTxi + b) ≥ ρ/2

For every support vector xs the above inequality is an equality.


After rescaling w and b by ρ/2 in the equality, we obtain that
distance between each xs and the hyperplane is
y s ( w T x s  b) 1
r 
w w
Mathematical formulation
2
max 
w

s.t. for (xi, yi), i=1..n : yi(wTxi + b) ≥ 1

min Φ(w) = |w|2=wTw

s.t. for (xi, yi), i=1..n : yi(wTxi + b) ≥ 1


Soft Margin Classification
• The data is sometimes not linearly separable
• Soft margin classification allows misclassification

ξi min Φ(w) =wTw + CΣξi

ξi s.t. for (xi, yi), i=1..n : yi(wTxi + b) ≥ 1-ξi


Solution to SVM problem (optional)
• Quadratic programming with linear constraints
• The solution involves constructing a dual problem where a Lagrange
multiplier αi is associated with every inequality constraint in the
primal (original) problem:

Find α1…αn such that


Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) αi ≥ 0 for all αi
Solution to SVM problem
Given a solution α1…αn to the dual problem, solution to the primal is:

w =Σαiyixi b = yk - Σαiyixi Txk for any αk > 0

Each non-zero αi indicates that corresponding xi is a support vector.


Then the classifying function is (note that we don’t need w explicitly):

f(x) = ΣαiyixiTx + b
Extension to nonlinear classification
• Datasets that are linearly separable with some noise work out great:

0 x

• But what are we going to do if the dataset is just too hard?


0 x

• How about… mapping data to a higher-dimensional space:


x2

0 x
Feature transformation

Φ: x → φ(x)

Instead of defining φ(x) directly, use a “Kernel trick” – define the inner product of φ(x)

K(xi,xj)= φ(xi) Tφ(xj)


Common kernel functions
Linear: K(xi,xj)= xiTxj
Mapping Φ: x → φ(x), where φ(x) is x itself

Polynomial of power p: K(xi,xj)= (1+ xiTxj)p d  p


 
Mapping Φ: x → φ(x), where φ(x) has  p  dimensions
2
xi  x j

2 2
e
Gaussian (radial-basis function): K(xi,xj) =
Mapping Φ: x → φ(x), where φ(x) is infinite-dimensional: every point is
mapped to a function (a Gaussian); combination of functions for support
vectors is the separator.
Decision Tree
Each node is a test on
one attribute
Outlook

sunny
rain Possible attribute values
overcast
of the node

Humidity Yes
Windy

high normal true false

Leafs are the


No Yes No Yes decisions
Training decision tree
Key problem: choosing which attribute to split a given set of examples
• Some possibilities are:
– Random: Select any attribute at random
– Least-Values: Choose the attribute with the smallest
number of possible values
– Most-Values: Choose the attribute with the largest
number of possible values
– Max-Gain: Choose the attribute that has the largest
expected information gain
Overfitting of decision trees

You can have a deep decision tree with many


decision boundaries to perfectly separate all
data, but that does not generalize well to test
data
Random Forest
Create bootstrap samples
from the training data

M features
N examples

Take the
majority
vote

....…
....…
Random Forest
choose m<M features (Bagging)

M features
N examples

Take the
majority
vote

....…
....…

You might also like