You are on page 1of 4

So let's look at what the loss and regularization

term of the objective function look like for the support


vector machine.
So in lecture, you've seen that the goal of the Support Vector
Machine, or SVM, is to maximize the margin of the decision
boundary.

Now, what is this margin?


So the margin is defined in this way.
So consider a data set b that has data points xi and yi for i
equals from 1 to n, so n data points.
Lets calculate the distance from point i
to the decision boundary.
And that distance is distance from point i
to decision boundary.
Let's call that gamma.
So gamma is equal to yi theta 0 xi plus theta 0
over the norm of theta.

So this gamma represents the distance


between each of the points in d and the decision boundary.
And we define the margin--

let's call that d--


as the minimum distance between any
of the points to the decision boundary,
so is the minimum xi yi in d.
So the minimum point--
distance between the point--
between a point-- to the decision boundary,
so minimum of gamma of xi, yi, theta, and theta 0.
And gamma here is a function of xi, yi, theta, and theta 0,
as you see here.
So what does this marginal look like in graphical terms.
So let's draw some pictures.

So I'm going to draw two identical data sets here.


And for each, there's some positively labeled points
and some negatively labeled points.

So this is a very simple classification problem.


So now let's look at two decision boundaries.
Let's have one be like this with some theta, theta 0.

And let's have another one be slightly different, like this,


with some theta and theta 0.
So in this case, the margin, which is the point closest--
the distance between the point closest to the decision
boundary, is just the distance between this point,
because it's the closest to a decision boundary
and the decision boundary.
So here the margin is this distance.
Let's call this d1.
And here, the margin is again the distance
between this point and the decision boundary,
because this point is still the closest point to the decision
boundary in this graph.
And let's call this margin d2.
and we see here that d1--
d2 is slightly bigger than d1 in this case.

And in support vector machine, we want to maximize the margin.


So we want to maximize this distance here.
And why is that the case?
Well, because generally when we maximize the margin,
or we have a large margin, that gives us--
that gives our model more ability to generalize,
so more generalization.

And this can be seen through a simple example.


So let's suppose that we have a test
data that's negatively labeled.
And it happens to be somewhere right here.
So correspondingly in this graph,
it would be somewhere like here.
So in this case, because of the large margin,
this test data point is still correctly labeled
by this decision boundary.
Whereas in this case, the test data point
is actually misclassified by the decision boundary.
And here we see why having a large margin is better
and allows for more generalization.
So the goal in SVM, again, is to maximize the margin.
So let's remove the test data point for now.

And let's talk about how do we actually achieve this goal.


So for support vector machine, the loss term we use
is equal to the hinge loss.

And the hinge loss is defined as a function of gamma and gamma


ref and more specifically, it is 1 minus gamma
over gamma ref for gamma less than gamma ref and 0 otherwise.
And I will talk later about what this gamma ref is.
The gamma is already defined here.

Let's first look at what happens to a loss


for different values of gamma.

So if gamma is greater than gamma ref,


then we're in the second condition here,
and the loss is just 0.
So this is hinge loss on this axis and 0.
And here's gamma ref.
When gamma is less than gamma ref but greater than 0,
then the hinge loss is somewhere between 0 and 1,
because when gamma is equal to gamma ref, this is 0.
When gamma is 0, this is 1.
So the hinge loss grows like this,
between gamma equal to gamma ref to gamma equal 0.
And for even smaller values of gamma,
this loss will just continue to grow linearly.
So it would just go like this.

Now, in graphical terms we need to define what gamma ref is.


So this is where I will explain what gamma ref is.
And in essence, gamma ref is basically just the margin
of the decision boundary.
So in this graph here, gamma ref,
if we draw in the support vectors,
gamma ref is exactly this distance.
And we see that if we have a point out here,
where gamma is greater than gamma ref, then--
let's say a negatively labeled point here--
then the classification of that point
is correct and is outside of the support vectors.
So gamma is greater than gamma ref, and we incur zero loss.
If the negatively labeled point is actually
here between the support vector and the decision boundary,
such that the gamma is between gamma ref and 0,
we start to incur some loss.
And if the negative is on this side of the decision boundary
so that it's mislabelled, then gamma is negative,
and we start to incur even more loss.

So this is our loss term for support vector machine.


Now let's look at the regularization term.

So again, with our regularization term,


the goal is to allow our model to be more generalizable.

And we just mentioned that to become more generalizable,


we want a large margin.
And we just defined our gamma ref to be our margin.
So what we want here is we want to maximize gamma ref.
In other words, this is the same as minimizing 1 over gamma ref,
because in our objective function
we're minimizing the function, so we want
to minimize something here.
And in practice, what we actually do
is that we minimize 1 over gamma ref squared,
which achieves the same goal.
It's just a convention that we do to minimize gamma ref--
1 over gamma ref squared instead of 1 over gamma ref.
OK.
Now let's put together loss term and the regularization term
and write out the objective function for SVM.
So for SVM, our objective function j
is the-- first is the loss term, which is a hinge loss.
And this is a hinge loss for a single data point.
We have n data points, so let's do an average over the n data
points.
So we have 1/n sum equals 1 to n,
[? n-th ?] loss gamma over gamma ref--
so that's the loss term--
plus alpha, which is our hyperparameter,
and the regularization term, which is just 1 over gamma ref
squared.

So there is still a question here


of how do we define gamma ref in terms
of things we already know.
Because essentially, currently we're
using gamma ref as a new variable.
But we can see that we can actually
define gamma ref in terms of theta and theta 0, because--
and then we don't have to keep around this new variable.
We can just define it in terms of theta and theta 0.
So let's look into what gamma ref actually is
in terms of theta and theta 0.

You might also like