You are on page 1of 14

1

Linear Classifiers
CS771: Introduction to Machine Learning
Purushottam Kar

CS771: Intro to ML
Linear Classifiers Before going forward, recall that linear
classifiers are those that have a line or a
2
Keep appearing again and again
planeinasvarious methods
the decision boundary. A linear
LwP with 2 classes, Euclidean metric
classifier always
is given givesthat
by a model a linear
looks
classifier like and it makes predictions by
looking at whether or not
Even if Mahalanobis metric used, still LwP gives a linear classifier
Decision stumps with a single feature also give a linear classifier
Extremely popular in ML
Very small model size – just one vector (andclassifiers
Learning one bias value)
directly
Very fast prediction time will allow us to control many
Used to build DTs, deep nets etc useful properties about them!
That is exactly Instead of indirectly getting a linear
what we will do classifier via LwP + Mahalanobis
today! etc etc, can’t we learn one directly? CS771: Intro to ML
The “best” Linear Classifier
It seems infinitely many
classifiers perfectly
classify the data. Which
3
one should I choose?
All these brittle dotted classifiers
misclassify the two new test points
However, the bold classifier,
whose decision boundary is far
from all train points is not affected
Indeed! Such models would be very brittle and might
misclassify test data (i.e. predict the wrong class), even
those test data which look very similar to train data

It is better to not select a model


whose decision boundary passes
very close to a training data point CS771: Intro to ML
Large Margin Classifiers
I cannot search all classifiers
to find the one with the
largest margin! It would take
4
an infinite amount of time!
Geometric Margin

That is where
math comes
to our rescue!
Indeed, my name is Ms M for a
reason (M for ML as well as M for
Math)
A good linear classifier would can be one
where all data points are correctly
classified, as well as far from the
CS771: Intro to ML
Large Margin Classifiers 5
Fact: distance of origin from hyperplane is
Fact: distance of a point from this hyperplane is
Given train data for a binary classfn problem where and , we want
two things from a classifier
Demand 1: classify every point correctly – how to ask this politely?
One way: demand that for all ,
Easier way: demand that for all ,
Demand 2: not let any data point come close to the boundary
Demand that be as large as possible

Geometric Margin
CS771: Intro to ML
Support Vector Machines 6
Just a fancy way of saying Let us simplify this “

optimization
Please find me a linear classifier problem
that perfectly
classifies the train data while keeping data points
as far away from the hyperplane as possible
The mathematical way of writing this request is the following
Constraints Objective

such that for all

This is known as an This looks so complicated,


optimization problem with an how will I ever find a
objective and lots of solution to this optimization
CS771: Intro to ML
Constrained Optimization 101
Constraints are usually specified using
math equations. The set of points that
satisfy all the constraints is called the
7
HOW WE MUST SPEAK TO MS Mfeasible set ofWE
HOW the optimization
SPEAK TOproblem A HUMAN
Objective
Constraints
I want to find an unknown that
gives me the best value
such that For
Youyour
optimization
according
specifiedproblem
constraints,
to this
has
function
thenooptimal
solution (least)
sincevalue
no point
of is
and etc. etc. satisfies
Oh!and all
it isyour
btw, achieved
not constraints
anyatwould do!
must satisfy these conditions
Feasible
Feasibleset
set is the
is empty!
interval
All I am saying is, of the values
of that satisfy my conditions,
s.t.
s.t. find me the one that gives the
and
and best value according to
3 6
CS771: Intro to ML
Back to SVMs 8
Assume there do exist models that perfectly classify all train data
Consider one such model which classifies train data perfectly
Now, as
Thus, geometric margin is same as since model has perfect
classification!
We will use this useful fact to greatly simplify the optimization
problem

We will remove What if train data is non-linearly


this assumption separable i.e no linear classifier can
later perfectly classify it? For example
CS771: Intro to ML
Support Vector Machines Called the functional margin. Note that
geometric margin = functional margin/ 9
Let be the data point that comes closest to the hyperplane i.e.
Recall that all this discussion holds only for a perfect classifier
Let and consider
Note this gives us for all as well as (as )
Thus, instead of searching for , easier to search for

such that for all


min {‖~ 𝐰 2‖2 }
~ ~
𝐰 ,𝑏 ❑
CS771: Intro to ML
The C-SVM Technique What prevents me from misusing the slack
variables to learn a model that misclassifies every
data point?
10
For linearly separable cases where
The termweprevents
suspect youafrom
perfect
doing classifier
so.
exists If we set to be a large value (it is a
hyper-parameter), then it will penalize
solutions that misuse slack too much
s.t. for all
If a linear classifier cannot perfectly classify data, then find model
using Having the constraint prevents
us from misusing slack to
artificially inflate the margin
s.t. for all Recall English
as well as for all phrase “cut me some
slack”
The terms are called slack variables. They allow some data points
to come close to the hyperplane or be misclassified altogether CS771: Intro to ML
From C-SVM to Loss Functions 11
We can further simplify the previous optimization problem
Note basically allows us to have (even )
Thus, the amount of slack we want is just
However, recall that we must also satisfy

any slack i.e. you should have in this case


𝑥
Another way of saying that if you already have , then you [ ]
don’t need
+ ¿= max { 𝑥 ,0 } ¿
Thus, we need only set
The above is nothing but the popular hinge loss function!

CS771: Intro to ML
Hinge Loss 12
Captures how well as a classifier classified a data point
Suppose on a data point , a model gives prediction score of (for a
linear model , we have )
We obviously want for correct classification but we also want for
large margin – hinge loss function captures both

Note that hinge loss not only penalizes misclassification


but also correct classification if the data point gets
too close to the hyperplane!
CS771: Intro to ML
Final Form of C-SVM 13
Recall that the C-SVM optimization finds a model by solving

s.t. for all


as well as for all
Using the previous discussion, we can rewrite the above very simply

Agreed this is simpler than


before but I still don’t know
This is where calculus and other
how to use this to find the model
math topics come to our rescue 
CS771: Intro to ML
14

Emoticons from Flaticon, designed by Twitter


https://www.flaticon.com/packs/smileys-and-people-9
Licensed under CC BY 3.0
CS771: Intro to ML

You might also like