Lecture 8

Data Science for Molecular
Engineering
Lecture 9
ILOs
• Understand linear classification and decision boundaries;
• Know the concepts, principles and solution methods for Support
Vector Machine;
• Understand the concepts of decision tree method;
• Understand the principles of Random Forest;
Binary classification and decision boundaries
w Tx + b = 0
w Tx + b > 0
w Tx + b < 0
f(x) = sign(wTx + b)
Compare with logistic regression

Multiple decision boundaries are possible
Which one is optimal?

Support Vector Machines
w Tx + b = 0
ρ
r • Intuitively, maximize the margin of the linear separation
• Examples closest to the hyperplane are support vectors.
• Margin ρ of the separator is the distance between

support vectors.
Mathematical formulation
Distance from point i to the linear separation line
ρ wT xi  b
r (from linear algebra)
w
r
Let training set {(xi, yi)}i=1..n, xiRd, yi  {-1, 1} be separated by a
hyperplane with margin ρ. Then for each training example (xi, yi):
wTxi + b ≤ - ρ/2 if yi = -1
wTxi + b ≥ ρ/2 if yi = 1
 yi(wTxi + b) ≥ ρ/2
For every support vector xs the above inequality is an equality.

After rescaling w and b by ρ/2 in the equality, we obtain that
distance between each xs and the hyperplane is
y s ( w T x s  b) 1
r 
w w
Mathematical formulation
2
max 
w
s.t. for (xi, yi), i=1..n : yi(wTxi + b) ≥ 1
min Φ(w) = |w|2=wTw
s.t. for (xi, yi), i=1..n : yi(wTxi + b) ≥ 1

Soft Margin Classification
• The data is sometimes not linearly separable
• Soft margin classification allows misclassification
ξi min Φ(w) =wTw + CΣξi
ξi s.t. for (xi, yi), i=1..n : yi(wTxi + b) ≥ 1-ξi

Solution to SVM problem (optional)
• Quadratic programming with linear constraints
• The solution involves constructing a dual problem where a Lagrange
multiplier αi is associated with every inequality constraint in the
primal (original) problem:
Find α1…αn such that

Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) αi ≥ 0 for all αi
Solution to SVM problem
Given a solution α1…αn to the dual problem, solution to the primal is:
w =Σαiyixi b = yk - Σαiyixi Txk for any αk > 0
Each non-zero αi indicates that corresponding xi is a support vector.

Then the classifying function is (note that we don’t need w explicitly):
f(x) = ΣαiyixiTx + b
Extension to nonlinear classification
• Datasets that are linearly separable with some noise work out great:
0 x
• But what are we going to do if the dataset is just too hard?

0 x
• How about… mapping data to a higher-dimensional space:

x2
0 x
Feature transformation
Φ: x → φ(x)
Instead of defining φ(x) directly, use a “Kernel trick” – define the inner product of φ(x)
K(xi,xj)= φ(xi) Tφ(xj)

Common kernel functions
Linear: K(xi,xj)= xiTxj
Mapping Φ: x → φ(x), where φ(x) is x itself
Polynomial of power p: K(xi,xj)= (1+ xiTxj)p d  p

 
Mapping Φ: x → φ(x), where φ(x) has  p  dimensions
2
xi  x j

2 2
e
Gaussian (radial-basis function): K(xi,xj) =
Mapping Φ: x → φ(x), where φ(x) is infinite-dimensional: every point is
mapped to a function (a Gaussian); combination of functions for support
vectors is the separator.
Decision Tree
Each node is a test on
one attribute
Outlook
sunny
rain Possible attribute values
overcast
of the node
Humidity Yes
Windy
high normal true false
Leafs are the

No Yes No Yes decisions
Training decision tree
Key problem: choosing which attribute to split a given set of examples
• Some possibilities are:
– Random: Select any attribute at random
– Least-Values: Choose the attribute with the smallest
number of possible values
– Most-Values: Choose the attribute with the largest
number of possible values
– Max-Gain: Choose the attribute that has the largest
expected information gain
Overfitting of decision trees
You can have a deep decision tree with many

decision boundaries to perfectly separate all
data, but that does not generalize well to test
data
Random Forest
Create bootstrap samples
from the training data
M features
N examples
Take the
majority
vote
....…
....…
Random Forest
choose m<M features (Bagging)
M features
N examples
Take the
majority
vote
....…
....…

Lecture 8

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 8

Uploaded by

Copyright:

Available Formats

Data Science for Molecular

Compare with logistic regression

Which one is optimal?

r • Intuitively, maximize the margin of the linear separation

• Examples closest to the hyperplane are support vectors.

• Margin ρ of the separator is the distance between

For every support vector xs the above inequality is an equality.

s.t. for (xi, yi), i=1..n : yi(wTxi + b) ≥ 1

min Φ(w) = |w|2=wTw

s.t. for (xi, yi), i=1..n : yi(wTxi + b) ≥ 1

ξi min Φ(w) =wTw + CΣξi

ξi s.t. for (xi, yi), i=1..n : yi(wTxi + b) ≥ 1-ξi

Find α1…αn such that

w =Σαiyixi b = yk - Σαiyixi Txk for any αk > 0

Each non-zero αi indicates that corresponding xi is a support vector.

• But what are we going to do if the dataset is just too hard?

• How about… mapping data to a higher-dimensional space:

K(xi,xj)= φ(xi) Tφ(xj)

Polynomial of power p: K(xi,xj)= (1+ xiTxj)p d  p

high normal true false

Leafs are the

You can have a deep decision tree with many

You might also like