You are on page 1of 21

# Online Learning via Stochastic Optimization,

## Perceptron, and Intro to SVMs

Piyush Rai
Machine Learning (CS771A)
Aug 20, 2016

## Stochastic Gradient Descent for Logistic Regression

Recall the gradient descent (GD) update rule for (unreg.) logistic regression
N
X
w (t+1) = w (t)
((t)
n yn )x n
n=1
(t)

## where the predicted probability of yn being 1 is n =

1
1+exp(w (t) > x n )

## Stochastic GD (SGD): Approx. the gradient using a randomly chosen (x n , yn )

w (t+1) = w (t) t ((t)
n yn )x n
where t is the learning rate at update t (typically decreasing with t)
(t)

(t)

Lets replace the predicted label prob. n by the predicted binary label yn

where

Thus w (t)

w (t+1) = w (t) t (
yn(t) yn )x n
(
>
(t)
1 if n 0.5 or w (t) x n 0
(t)
yn =
>
(t)
0 if n < 0.5 or w (t) x n < 0
(t)
gets updated only when yn 6= yn (i.e., when w (t) mispredicts)

## Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs

Mistake-Driven Learning
Consider the mistake-driven SGD update rule
w (t+1) = w (t) t (
yn(t) yn )x n
Lets, from now on, assume the binary labels to be {-1,+1}, not {0,1}. Then
(
(t)
2yn if yn 6= yn
(t)
yn yn =
(t)
0
if yn = yn
(t)

## Thus whenever w (t) mispredicts (i.e., yn 6= yn ), we update w (t) as

w (t+1) = w (t) + 2t yn x n
For t = 0.5, this is akin to the Perceptron (a hyperplane based learning algo)
w (t+1) = w (t) + yn x n
Note: There are other ways of deriving the Perceptron rule (will see shortly)
Machine Learning (CS771A)

## The Perceptron Algorithm

One of the earliest algorithms for binary classification (Rosenblatt, 1958)
Learns a linear hyperplane to separate the two classes

## A very simple, mistake-driven Online Learning algorithm. Learns using one

example at a time. Also highly scalable for large training data sets (like SGD
based logistic regression and other SGD based learning algorithms)
Guaranteed to find a separating hyperplane if the data is linearly separable
If data not linearly separable
First make it linearly separable (more when we discuss Kernel Methods)
.. or use multi-layer Perceptrons (more when we discuss Deep Learning)
Machine Learning (CS771A)

## Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs

Hyperplanes
Separates a D-dimensional space into two half-spaces (positive and negative)

## Defined by normal vector w RD (pointing towards positive half-space)

w is orthogonal to any vector x lying on the hyperplane
w >x = 0

## Assumption: The hyperplane passes through origin. If not, add a bias b R

w >x + b = 0
b > 0 means moving it parallely along w (b < 0 means in opposite direction)

## Hyperplane based Linear Classification

Represents the decision boundary by a hyperplane w RD

## For binary classification, w is assumed to point towards the positive class

Classification rule: y = sign(w T x + b)
w T x + b > 0 y = +1
w T x + b < 0 y = 1

## Note: y (w T x + b) < 0 mean a mistake on training example (x, y )

Note: Some algorithms that we have already seen (e.g., distance from
means, logistic regression, etc.) can also be viewed as learning hyperplanes

## Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs

Notion of Margins
Geometric margin n of an example x n is its signed distance from hyperplane
n =

wT xn + b
||w ||

## Geometric margin may be +ve/-ve based on which side of the plane x n is

Margin of a set {x 1 , . . . , x N } w.r.t. w is the min. abs. geometric margin
= min |n | = min
1nN

1nN

|(w T x n + b)|
||w ||

## Functional margin of w on a training example (x n , yn ): yn (w T x n + b)

Positive if w predicts yn correctly
Negative if w predicts yn incorrectly

## Perceptron: Another Derivation

For a hyperplane based model, lets consider the following loss function
`(w , b) =

N
X

`n (w , b) =

n=1

N
X

max{0, yn (w x n + b)}

n=1

## Seems natural: if yn (w T x n + b) 0, then the loss on (x n , yn ) will be 0;

otherwise the model will incur some positive loss when yn (w T x n + b) < 0
Perform stochastic optimization on this loss. Stochastic (sub)gradients are
`n (w , b)
w
`n (w , b)
b

yn x n

yn

## (when the loss is positive or zero; gradients zero otherwise)

Upon every mistake, update rule for w and b (assuming learning rate = 1)
w

w + yn x n

b + yn

## These updates are precisely the Perceptron algorithm!

Machine Learning (CS771A)

## The Perceptron Algorithm

Given: Training data D = {(x 1 , y1 ), . . . , (x N , yN )}
Initialize: w old = [0, . . . , 0], bold = 0
Repeat until convergence:
For a random (x n , yn ) D
if sign(w T x n + b) 6= yn i.e., yn (w T x n + b) 0 (i.e., mistake is made)
w new

w old + yn x n

bnew

bold + yn

## Stopping condition: stop when either

All training examples are classified correctly
May overfit, so less common in practice

## A fixed number of iterations completed, or some convergence criteria met

Completed one pass over the data (each example seen once)
E.g., examples arriving in a streaming fashion and cant be stored in memory

## Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs

10

Lets look at a misclassified positive example (yn = +1)
Perceptron (wrongly) predicts that w Told x n + bold < 0

w new = w old + yn x n = w old + x n (since yn = +1)
bnew = bold + yn = bold + 1

wT
new x n + bnew

(w old + x n )T x n + bold + 1

T
(w T
old x n + bold ) + x n x n + 1

T
Thus w T
new x n + bnew is less negative than w old x n + bold

11

## Why Perceptron Updates Work (Pictorially)?

w new = w old + x

## Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs

12

Now lets look at a misclassified negative example (yn = 1)
Perceptron (wrongly) predicts that w Told x n + bold > 0

w new = w old + yn x n = w old x n (since yn = 1)
bnew = bold + yn = bold 1

wT
new x n + bnew

(w old x n )T x n + bold 1

T
(w T
old x n + bold ) x n x n 1

T
Thus w T
new x n + bnew is less positive than w old x n + bold

13

w new = w old x

## Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs

14

Convergence of Perceptron
Theorem (Block & Novikoff): If the training data is linearly separable with
margin by a unit norm hyperplane w (||w || = 1) with b = 0, then perceptron
converges after R 2 / 2 mistakes during training (assuming ||x|| < R for all x).
Proof:
yn w T
xn
||w ||

= yn w T
xn

## Consider the (k + 1)th mistake: yn w T

k x n 0, and update w k+1 = w k + yn x n
T
T
T
wT
k+1 w = w k w + yn w x n w k w + (why is this nice?)

k+1 w > k
2

2yn w T
k xn

(1)
2

k x n 0)

(2)

## Using (1), (2), and ||w || = 1 , we get k < w T

w

||w
k+1 || R k
k+1
k R 2 / 2
Nice Thing: Convergence rate does not depend on the number of training
examples N or the data dimensionality D. Depends only on the margin!!!
Machine Learning (CS771A)

15

## The Best Hyperplane Separator?

Perceptron finds one of the many possible hyperplanes separating the data
.. if one exists

## Intuitively, we want the hyperplane having the maximum margin

Large margin leads to good generalization on the test data

16

## Support Vector Machine (SVM)

One of the most popular/influential classification algorithms
Backed by solid theoretical groundings (Vapnik and Cortes, 1995)
A hyperplane based classifier (like the Perceptron)
Additionally uses the Maximum Margin Principle
Finds the hyperplane with maximum separation margin on the training data

17

## Support Vector Machine

A hyperplane based linear classifier defined by w and b
Prediction rule: y = sign(w T x + b)
Given: Training data {(x 1 , y1 ), . . . , (x N , yN )}
Goal: Learn w and b with small training error the maximum margin
For now, assume the entire training data can be correctly classified by (w , b)
Zero loss on the training examples (non-zero loss case later)

## Assume the hyperplane is such that

w T x n + b 1 for yn = +1
w T x n + b 1 for yn = 1
Equivalently, yn (w T x n + b) 1
min1nN |w T x n + b| = 1
The hyperplanes margin:
= min1nN

|w T x n +b|
= ||w1 ||
||w ||

18

## Support Vector Machine: The Optimization Problem

We want to maximize the margin =

1
||w ||

## Maximizing the margin = minimizing ||w || (the norm)

Our optimization problem would be:
||w ||2
2
T
yn (w x n + b) 1,

Minimize f (w , b) =
subject to

n = 1, . . . , N

## This is a Quadratic Program (QP) with N linear inequality constraints

Machine Learning (CS771A)

19

## Large margins intuitively mean good generalization

We can give a slightly more formal justification to this
Recall: Margin =

1
||w ||

## Large margin small ||w ||

Small ||w || regularized/simple solutions (wi s dont become too large)
Simple solutions good generalization on test data

20

Next class..

## Solving the SVM optimization problem

Introduction to kernel methods (nonlinear SVMs)

21