Professional Documents
Culture Documents
material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://www.mmds.org
Large-Scale Machine
Learning: SVM
Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff Ullman
Stanford University
http://www.mmds.org
New Topic: Machine Learning!
Locality Filtering
PageRank, Recommen
sensitive data SVM
SimRank der systems
hashing streams
Dimensiona Duplicate
Spam Queries on Perceptron,
lity document
Detection streams kNN
reduction detection
Data is labeled:
Have many pairs {(x, y)}
x … vector of binary, categorical, real valued features
y … class ({+1, -1}, or a real number)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4
Supervised Learning
Task:
Given data (X,Y) build a model f() to predict
Y’ based on X’
Strategy: Estimate
Training X Y
on . data
nigeria
If the f(x) is: wx=0
x1
Positive: Predict +1 Spam=1
x2
Negative: Predict -1 w
Ham=-1 viagra
Start with w0 = 0
Pick training examples xt one by one
Predict class of xt using current wt
y’ = sign(wt xt)
If y’ is correct (i.e., yt = y’) ytxt
No change: wt+1 = wt
wt
If y’ is wrong: Adjust wt
wt+1
wt+1 = wt + yt xt
is the learning rate parameter
xt is the t-th training example
xt, yt=1
yt is true t-th class label ({+1, -1})
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9
Perceptron: The Good and the Bad
Good: Perceptron convergence theorem:
If there exist a set of weights that are consistent
(i.e., the data is linearly separable) the Perceptron
learning algorithm will converge
Bad: Never converges:
If the data is not separable
weights dance around
indefinitely
Bad: Mediocre generalization:
Finds a “barely” separating solution
The reason we define margin this way is due to theoretical convenience and existence of
generalization error bounds that depend on the value of margin.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15
Why maximizing a good idea?
Remember: Dot product
‖ 𝑨‖𝒄𝒐𝒔 𝜽
𝒅
¿| 𝑨|∨¿
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
√ ∑(𝑨 )
𝒋=𝟏
( 𝒋) 𝟐
16
Why maximizing a good idea?
Dot
product 𝒘
0
=
b
x2
+ +x
+
1
x
What is ,?
w
x2 x2
+ + x1 + +x 1
𝒘
𝒘
=0
A (x
+b
, xA(2))
Line L: w∙x+b =
(1)
A
+
∙x
w
d(
A, w(1)x(1)+w(2)x(2)+b=0
L)
w = (w(1), w(2))
H Point A = (xA(1), xA(2))
Point M on a line = (xM(1), xM(2))
(0,0) d(A, L) = |AH|
= |(A-M) ∙ w|
= |(xA(1) – xM(1)) w(1) + (xA(2) – xM(2)) w(2)|
M (x , x ) 1 2
= xA(1) w(1) + xA(2) w(2) + b
=w∙A+b
Remember xM(1)w(1) + xM(2)w(2) = - b
since M belongs to line L
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18
Largest Margin
𝒘
Prediction= sign(wx + b)
+ “Confidence” = (w x + b) y
+
=0
For i-th datapoint:
+b
+ +
-
x
w
+ - Want to solve:
+ -
+
- - Can rewrite as
- -
max
w,
s.t.i, yi ( w xi b)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19
Support Vector Machine
Maximize
the margin:
+ + wx+b=0
Good according to intuition,
theory (VC dimension) &
+ +
practice + -
+
max + - -
w ,
-
+
s.t.i, yi ( w xi b) - -
-
is margin … distance from Maximizing the margin
-1
0
b=
b=
Problem:
x+
x+
+1
w
w
b=
Let
x+
w
then x2
Scaling w increases margin!
Solution: x1
Work with normalized w: w
|| w ||
23
Canonical Hyperplane: Solution
-1
0
b=
b=
Want to maximize margin !
x+
x+
+1
w
w
b=
What is the relation
x+
w
between x1 and x2? x2 2
x1
We also know:
w
|| w ||
So:
w 1
w w w Note:
2
-1 ww w
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24
Maximizing the Margin
-1
0
b=
b=
We started with
x+
x+
+1
max w,
w
w
b=
x+
w
s.t.i, yi ( w xi b) x2 2
But w can be arbitrarily large!
We normalized and... x1
1 2
arg max arg max arg min w arg min 1
2
w
w w
|| w ||
Then:
2
min 1
w 2 || w ||
s.t.i, yi ( w xi b) 1
This is called SVM with “hard” constraints
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 25
Non-linearly Separable Data
If data is not separable introduce penalty:
2
min 1
w 2 w C (# number of mistakes)
s.t.i, yi ( w xi b) 1 + +
0
+b=
-
wx
Minimize ǁwǁ plus the 2 + +
number of training mistakes
+ -
Set C using cross validation + -
- +
-
How to penalize mistakes? + -
-
All mistakes are not -
equally bad!
-
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 26
Support Vector Machines
0
Introduce slack variables i
b=
+ +
x+
w
n
w C i
2
min 1
+
w,b , i 0
2
i 1
- i
+
s.t.i, yi ( w xi b) 1 i +
+ j -
+ -
If point xi is on the wrong +
- -
side of the margin then -
For each data point:
get penalty i If margin 1, don’t care
If margin < 1, pay linear penalty
2
w ,b
i 1
s.t.i, yi ( w xi b) 1 i
0/1 loss
s.t.i, yi ( xi w b) 1 i
Want
to estimate and !
Standard way: Use a solver!
Solver: software for finding solutions to
“common” optimization problems
Use a quadratic solver:
Minimize quadratic function
Subject to linear constraints
Problem: Solvers are inefficient for big data!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 31
SVM: How to estimate w?
n
Want
to estimate w, b! min 1
2 w w C i
w,b
i 1
Alternative approach: s.t.i, yi ( xi w b) 1 i
Want to minimize f(w,b):
dn
f ( w, b) 2 w w C max 0,1 yi ( w xi b)
1 ( j) ( j)
i 1 j 1
Side note:
How to minimize convex functions ?
g(z)
Use gradient descent: minz g(z)
Iterate: zt+1 zt – g(zt) z
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 32
SVM: How to estimate w?
Want to minimize f(w,b):
w
d n d
f ( w, b) 1
2
( j) 2
C max 0,1 yi ( w xi b)
( j) ( j)
j 1 i 1 j 1
Empirical loss
w ( j)
i 1 w ( j)
L( xi , yi )
0 if yi (w xi b) 1
w ( j)
yi xi( j ) else
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 33
SVM: How to estimate w?
Gradient descent:
• Update: w ( j)
i 1 w ( j)
SGD SVM
Conventional
SVM
Approximated in 2 steps:
L( xi , yi )
w w C cheap: xi is sparse and so few
w
coordinates j of w will be updated
w w(1 ) expensive: w is not sparse, all
coordinates need to be updated
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 42
Practical Considerations
Solution
1: Two step update procedure:
Represent vector w as the (1) w w C
L( xi , yi )
product of scalar s and vector v w
(2) w w(1 )
Then the update procedure is:
(1)
(2)
Solution 2:
Perform only step (1) for each training example
Perform step (2) with lower frequency
and higher
(xi, yi)
w C i
2
min 1
2 c
w ,b
c i 1 c yi , i
wyi xi byi wc xi bc 1 i i 0, i