You are on page 1of 47

Note to other teachers and users of these slides: We would be delighted if you found this our

material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://www.mmds.org

Large-Scale Machine
Learning: SVM
Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff Ullman
Stanford University
http://www.mmds.org
New Topic: Machine Learning!

High dim. Graph Infinite Machine


Apps
data data data learning

Locality Filtering
PageRank, Recommen
sensitive data SVM
SimRank der systems
hashing streams

Community Web Decision Association


Clustering
Detection advertising Trees Rules

Dimensiona Duplicate
Spam Queries on Perceptron,
lity document
Detection streams kNN
reduction detection

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2


Supervised Learning
 Example: Spam filtering

 Instance space x  X (|X|= n data points)


 Binary or real-valued feature vector x of
word occurrences
 d features (words + other things, d~100,000)
 Class y  Y
 y: Spam (+1), Ham (-1)
 Goal: Estimate a function f(x) so that y = f(x)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3
More generally: Supervised Learning
 Would like to do prediction:
estimate a function f(x) so that y = f(x)

 Where y can be:


 Real number: Regression
 Categorical: Classification
 Complex object:
 Ranking of items, Parse tree, etc.

 Data is labeled:
 Have many pairs {(x, y)}
 x … vector of binary, categorical, real valued features
 y … class ({+1, -1}, or a real number)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4
Supervised Learning
 Task:
  Given data (X,Y) build a model f() to predict
Y’ based on X’
 Strategy: Estimate
Training X Y
on . data

Hope that the same also Test


X’ Y’
data
works to predict unknown
 The “hope” is called generalization
 Overfitting: If f(x) predicts well Y but is unable to predict Y’
 We want to build a model that generalizes
well to unseen data
 But Jure, how can we well on data we have
never seen before?!?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5
Supervised Learning
 Idea: Pretend we do not know the data/labels
we actually do know
Training
 Build the model f(x) on set X Y
the training data Validation
set
See how well f(x) does on
Test
the test data set X’
 If it does well, then apply it also to X’
 Refinement: Cross validation
 Splitting into training/validation set is brutal
 Let’s split our data (X,Y) into 10-folds (buckets)
 Take out 1-fold for validation, train on remaining 9
 Repeat this 10 times, report average performance
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6
Linear models for classification
 Binary classification:

 
f (x) = { +1 if w(1) x(1) + w(2) x(2) +. . .w(d) x(d)  
-1 otherwise

 Input: Vectors xj and labels yj


 Vectors xj are real valued where
 Goal: Find vector w = (w(1), w(2) ,... , w(d) )
 Each w(i) is a real number Decision
w
x boundary
= is linear
-- -
wx=0 - - -- - - Note:
x  x, 1 x
w - -- -
- - w  w,  
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7
Perceptron [Rosenblatt ‘58]
  (Very) loose motivation: Neuron
 Inputs are feature values
 Each feature has a weight wi
 Activation is the sum: w(1)
x (1)
w(2)
x(2)
x(3)
w(3)   0?
w(4)
x(4)

nigeria
 If the f(x) is: wx=0

x1
 Positive: Predict +1 Spam=1
x2
 Negative: Predict -1 w
Ham=-1 viagra

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8


Perceptron
Note that the Perceptron is
 Perceptron: y’ = sign(w x) a conservative algorithm: it
ignores samples that it
 How to find parameters w? classifies correctly.

 Start with w0 = 0
 Pick training examples xt one by one
 Predict class of xt using current wt
 y’ = sign(wt xt)
 If y’ is correct (i.e., yt = y’) ytxt
 No change: wt+1 = wt
wt
 If y’ is wrong: Adjust wt
wt+1
wt+1 = wt +   yt  xt
  is the learning rate parameter
 xt is the t-th training example
xt, yt=1
 yt is true t-th class label ({+1, -1})
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9
Perceptron: The Good and the Bad
 Good: Perceptron convergence theorem:
 If there exist a set of weights that are consistent
(i.e., the data is linearly separable) the Perceptron
learning algorithm will converge
 Bad: Never converges:
If the data is not separable
weights dance around
indefinitely
 Bad: Mediocre generalization:
 Finds a “barely” separating solution

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10


Updating the Learning Rate
 Perceptron will oscillate and won’t converge
 So, when to stop learning?
 (1) Slowly decrease the learning rate 
 A classic way is to:  = c1/(t + c2)
 But, we also need to determine constants c1 and c2
 (2) Stop when the training error stops chaining
 (3) Have a small test dataset and stop when the
test set error stops decreasing
 (4) Stop when we reached some maximum
number of passes over the data
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11
Support Vector Machines
Support Vector Machines
 Want to separate “+” from “-” using a line
Data:
 
+  Training examples:
+
+  (x1, y1) … (xn, yn)
+ + -  Each example i:
-
 xi = ( xi(1),… , xi(d) )
+ -
 xi(j) is real valued
- -  yi  { -1, +1 }
- -  Inner product:

Which is best linear separator (defined by w)?


J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13
Largest Margin
A  Distance from the
C
separating
+ +
+ hyperplane
+ - corresponds to
B + - the “confidence”
+ + - of prediction
 Example:
+ - -
+  We are more sure
- - about the class of
- A and B than of C
-
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14
Largest Margin
  Margin : Distance of closest example from
the decision line/hyperplane

The reason we define margin this way is due to theoretical convenience and existence of
generalization error bounds that depend on the value of margin.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15
 Why maximizing a good idea?
  Remember: Dot product

‖ 𝑨‖𝒄𝒐𝒔 𝜽
 

  𝒅
¿| 𝑨|∨¿
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
√ ∑(𝑨 )
𝒋=𝟏
( 𝒋) 𝟐

16
 Why maximizing a good idea?

 Dot
  product 𝒘

0
 

=
b
x2
+ +x

+
1

x
 What is ,?

w
x2 x2
+ + x1 + +x 1

𝒘
  𝒘
 

 In this case   In this case

 So, roughly corresponds to the margin


 Bigger bigger the separation
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17
What is the margin?
Distance from a point to a line
 Note we assume
L
w  Let:

=0
A (x

+b
, xA(2))
 Line L: w∙x+b =
(1)
A

+
∙x
w
d(
A, w(1)x(1)+w(2)x(2)+b=0
L)
 w = (w(1), w(2))
H  Point A = (xA(1), xA(2))
 Point M on a line = (xM(1), xM(2))
(0,0) d(A, L) = |AH|
= |(A-M) ∙ w|
= |(xA(1) – xM(1)) w(1) + (xA(2) – xM(2)) w(2)|
M (x , x ) 1 2
= xA(1) w(1) + xA(2) w(2) + b
=w∙A+b
Remember xM(1)w(1) + xM(2)w(2) = - b
since M belongs to line L
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18
Largest Margin
𝒘
 

  Prediction= sign(wx + b)
+  “Confidence” = (w x + b) y
+

=0
 For i-th datapoint:
+b
+ +
-
x
w

+ -  Want to solve:
+ -
+
- -  Can rewrite as
- -
max 
w,

s.t.i, yi ( w  xi  b)  
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19
Support Vector Machine
 Maximize
  the margin:
+ + wx+b=0
 Good according to intuition,
theory (VC dimension) &
+ +
practice +  -
+ 
max  + - -
w ,
-
+ 
s.t.i, yi ( w  xi  b)   - -
-
 is margin … distance from Maximizing the margin

the separating hyperplane


J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20
Support Vector Machines:
Deriving the margin
Support Vector Machines
 Separating hyperplane
is defined by the
support vectors
 Points on +/- planes
from the solution
 If you knew these
points, you could
ignore the rest
 Generally,
d+1 support vectors (for d dim. data)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22


Canonical Hyperplane: Problem

-1

0
b=

b=
  Problem:

x+

x+

+1
w

w

b=
 Let

x+
w
then x2
 Scaling w increases margin!
 Solution: x1
 Work with normalized w: w
|| w ||

 Let’s also require support vectors


to be on the plane defined by:
  𝑑
¿|w|∨¿
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
√ ∑ (𝑤 )
𝑗=1
( 𝑗) 2

23
Canonical Hyperplane: Solution

-1

0
b=

b=
  Want to maximize margin !

x+

x+

+1
w

w

b=
 What is the relation

x+
w
between x1 and x2? x2 2

x1
 We also know:
w
|| w ||

 So:
w 1
  
w w w Note:
2
-1 ww  w
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24
Maximizing the Margin

-1

0
b=

b=
 We started with

x+

x+

+1
max w, 

w

w

b=
x+
w
s.t.i, yi ( w  xi  b)   x2 2
But w can be arbitrarily large!
 We normalized and... x1
1 2
arg max   arg max  arg min w  arg min 1
2
w
w w
|| w ||

 Then:
2
min 1
w 2 || w ||
s.t.i, yi ( w  xi  b)  1
This is called SVM with “hard” constraints
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 25
Non-linearly Separable Data
 If data is not separable introduce penalty:
2
min 1
w 2 w  C  (# number of mistakes)
s.t.i, yi ( w  xi  b)  1 + +

0
+b=
-

wx
 Minimize ǁwǁ plus the 2 + +
number of training mistakes
+ -
 Set C using cross validation + -
- +
-
 How to penalize mistakes? + -
-
 All mistakes are not -
equally bad!
-
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 26
Support Vector Machines

0
 Introduce slack variables i

b=
+ +

x+
w
n
w  C   i
2
min 1
+
w,b , i  0
2
i 1
- i
+
s.t.i, yi ( w  xi  b)  1   i +
+ j -
+ -
 If point xi is on the wrong +
- -
side of the margin then -
For each data point:
get penalty i If margin  1, don’t care
If margin < 1, pay linear penalty

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27


 Slack Penalty
2
min 1
w 2 w  C  (# number of mistakes)
s.t.i, yi ( w  xi  b)  1
 What is the role of slack penalty C: small C
“good” C
 C=: Only want to w, b + +
that separate the data
+
 C=0: Can set i to anything, +
then w=0 (basically + big C
+ -
ignores the data) -
+ + -
- -
(0,0) -
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28
Support Vector Machines
 SVM in the “natural” form
n
arg min 1
2 w  w  C   max 0,1  yi ( w  xi  b)
w ,b i 1
Margin
Empirical loss L (how well we fit training data)
Regularization
parameter

 SVM uses “Hinge Loss”: n


w  C  i
2
min 1
penalty

2
w ,b
i 1

s.t.i, yi  ( w  xi  b)  1   i
0/1 loss

Hinge loss: max{0, 1-z}


-1 0 1 2 z  yi  ( xi  w  b)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 29
Support Vector Machines:
How to compute the margin?
SVM: How to estimate w?
n
min 1
2 w  w  C   i
w ,b
i 1

s.t.i, yi  ( xi  w  b)  1   i
 Want
  to estimate and !
 Standard way: Use a solver!
 Solver: software for finding solutions to
“common” optimization problems
 Use a quadratic solver:
 Minimize quadratic function
 Subject to linear constraints
 Problem: Solvers are inefficient for big data!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 31
SVM: How to estimate w?
n

 Want
  to estimate w, b! min 1
2 w  w  C  i
w,b
i 1
 Alternative approach: s.t.i, yi  ( xi  w  b)  1  i
 Want to minimize f(w,b):
 dn 
f ( w, b)  2 w  w  C   max 0,1  yi ( w xi  b)
1 ( j) ( j)

i 1  j 1 

 Side note:
 How to minimize convex functions ?
g(z)
 Use gradient descent: minz g(z)
 Iterate: zt+1  zt –  g(zt) z
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 32
SVM: How to estimate w?
 Want to minimize f(w,b):
 
 w 
d n d
f ( w, b)  1
2
( j) 2
 C  max 0,1  yi ( w xi  b)
( j) ( j)

j 1 i 1  j 1 
  Empirical loss

 Compute the gradient (j)


n w.r.t. w
(j)
f ( w, b) L( xi , yi )
f ( j)
  w  C
( j)

w ( j)
i 1 w ( j)

L( xi , yi )
0 if yi (w  xi  b)  1
w ( j)

  yi xi( j ) else
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 33
SVM: How to estimate w?
 Gradient descent:

Iterate until convergence:


• For j = 1 … d
f ( w , b ) n
L( xi , yi )
• Evaluate:f  ( j)
 w  C
( j)

• Update: w ( j)
i 1 w ( j)

w(j)  w(j) - f(j)


…learning rate parameter
C… regularization parameter
 Problem:
 Computing f(j) takes O(n) time!
 n … size of the training dataset

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 34


SVM: How to estimate w?
We just had:
n
L( xi , yi )
 Stochastic Gradient Descent f ( j)
w ( j)
 C
i 1 w( j )
 Instead of evaluating gradient over all examples
evaluate it for each individual training example
L( xi , yi )
f ( j)
( xi )  w ( j)
C
w ( j) Notice: no summation
over i anymore
 Stochastic gradient descent:
Iterate until convergence:
• For i = 1 … n
• For j = 1 … d
• Compute: f(j)(xi)
• Update: w(j)  w(j) -  f(j)(xi)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 35
Support Vector Machines:
Example
Example: Text categorization
 Example by Leon Bottou:
 Reuters RCV1 document corpus
 Predict a category of a document
 One vs. the rest classification
 n = 781,000 training examples (documents)
 23,000 test examples
 d = 50,000 features
 One feature per word
 Remove stop-words
 Remove low frequency words

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 37


Example: Text categorization
 Questions:
 (1) Is SGD successful at minimizing f(w,b)?
 (2) How quickly does SGD find the min of f(w,b)?
 (3) What is the error on a test set?

Training time Value of f(w,b) Test error


Standard SVM
“Fast SVM”
SGD SVM

(1) SGD-SVM is successful at minimizing the value of f(w,b)


(2) SGD-SVM is super fast
(3) SGD-SVM test set error is comparable
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 38
Optimization “Accuracy”

SGD SVM

Conventional
SVM

Optimization quality: | f(w,b) – f (wopt,bopt) |

For optimizing f(w,b) within reasonable quality


SGD-SVM is super fast
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 39
SGD vs. Batch Conjugate Gradient
 SGD on full dataset vs. Conjugate Gradient on
a sample of n training examples
  Theory says: Gradient
descent converges in
linear time . Conjugate
gradient converges in .

Bottom line: Doing a simple (but fast) SGD update


many times is better than doing a complicated (but
slow) CG update a few times   … condition number
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 40
Practical Considerations
 Need to choose learning rate  and t0
t  L( xi , yi ) 
wt 1  wt   wt  C 
t  t0  w 
 Leon suggests:
 Choose t0 so that the expected initial updates are
comparable with the expected size of the weights
 Choose :
 Select a small subsample
 Try various rates  (e.g., 10, 1, 0.1, 0.01, …)
 Pick the one that most reduces the cost
 Use  for next 100k iterations on the full dataset
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 41
Practical Considerations
 Sparse Linear SVM:
 Feature vector xi is sparse (contains many zeros)
 Do not do: xi = [0,0,0,1,0,0,0,0,5,0,0,0,0,0,0,…]
 But represent xi as a sparse vector xi=[(4,1), (9,5), …]
 Can we do the SGD update more efficiently?
L( x , y ) 

w  w   w  C i i

 w 

 Approximated in 2 steps:
L( xi , yi )
w  w  C cheap: xi is sparse and so few
w
coordinates j of w will be updated
w  w(1   ) expensive: w is not sparse, all
coordinates need to be updated
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 42
Practical Considerations

 Solution
  1: Two step update procedure:
 Represent vector w as the (1) w  w  C
L( xi , yi )
product of scalar s and vector v w
(2) w  w(1   )
 Then the update procedure is:
 (1)
 (2)
 Solution 2:
 Perform only step (1) for each training example
 Perform step (2) with lower frequency
and higher 

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 43


Practical Considerations
 Stopping criteria:
How many iterations of SGD?
 Early stopping with cross validation
 Create a validation set
 Monitor cost function on the validation set
 Stop when loss stops decreasing
 Early stopping
 Extract two disjoint subsamples A and B of training data
 Train on A, stop by validating on B
 Number of epochs is an estimate of k
 Train for k epochs on the full dataset
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 44
What about multiple classes?
 Idea 1:
One against all
Learn 3 classifiers
 + vs. {o, -}
 - vs. {o, +}
 o vs. {+, -}
Obtain:
w+ b+, w- b-, wo bo
 How to classify?
 Return class c
arg maxc wc x + bc
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 45
Learn 1 classifier: Multiclass SVM
 Idea 2: Learn 3 sets of weights simoultaneously!
 For each class c estimate wc, bc
 Want the correct class to have highest margin:
wy xi + by  1 + wc xi + bc c  yi , i
i i

(xi, yi)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 46


Multiclass SVM
 Optimization problem:
n

w  C  i
2
min 1
2 c
w ,b
c i 1 c  yi , i
wyi  xi  byi  wc  xi  bc  1  i i  0, i

 To obtain parameters wc , bc (for each class c)


we can use similar techniques as for 2 class SVM

 SVM is widely perceived a very powerful


J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 47

You might also like