term of the objective function look like for the support
vector machine. So in lecture, you've seen that the goal of the Support Vector Machine, or SVM, is to maximize the margin of the decision boundary.
Now, what is this margin?
So the margin is defined in this way. So consider a data set b that has data points xi and yi for i equals from 1 to n, so n data points. Lets calculate the distance from point i to the decision boundary. And that distance is distance from point i to decision boundary. Let's call that gamma. So gamma is equal to yi theta 0 xi plus theta 0 over the norm of theta.
So this gamma represents the distance
between each of the points in d and the decision boundary. And we define the margin--
let's call that d--
as the minimum distance between any of the points to the decision boundary, so is the minimum xi yi in d. So the minimum point-- distance between the point-- between a point-- to the decision boundary, so minimum of gamma of xi, yi, theta, and theta 0. And gamma here is a function of xi, yi, theta, and theta 0, as you see here. So what does this marginal look like in graphical terms. So let's draw some pictures.
So I'm going to draw two identical data sets here.
And for each, there's some positively labeled points and some negatively labeled points.
So this is a very simple classification problem.
So now let's look at two decision boundaries. Let's have one be like this with some theta, theta 0.
And let's have another one be slightly different, like this,
with some theta and theta 0. So in this case, the margin, which is the point closest-- the distance between the point closest to the decision boundary, is just the distance between this point, because it's the closest to a decision boundary and the decision boundary. So here the margin is this distance. Let's call this d1. And here, the margin is again the distance between this point and the decision boundary, because this point is still the closest point to the decision boundary in this graph. And let's call this margin d2. and we see here that d1-- d2 is slightly bigger than d1 in this case.
And in support vector machine, we want to maximize the margin.
So we want to maximize this distance here. And why is that the case? Well, because generally when we maximize the margin, or we have a large margin, that gives us-- that gives our model more ability to generalize, so more generalization.
And this can be seen through a simple example.
So let's suppose that we have a test data that's negatively labeled. And it happens to be somewhere right here. So correspondingly in this graph, it would be somewhere like here. So in this case, because of the large margin, this test data point is still correctly labeled by this decision boundary. Whereas in this case, the test data point is actually misclassified by the decision boundary. And here we see why having a large margin is better and allows for more generalization. So the goal in SVM, again, is to maximize the margin. So let's remove the test data point for now.
And let's talk about how do we actually achieve this goal.
So for support vector machine, the loss term we use is equal to the hinge loss.
And the hinge loss is defined as a function of gamma and gamma
ref and more specifically, it is 1 minus gamma over gamma ref for gamma less than gamma ref and 0 otherwise. And I will talk later about what this gamma ref is. The gamma is already defined here.
Let's first look at what happens to a loss
for different values of gamma.
So if gamma is greater than gamma ref,
then we're in the second condition here, and the loss is just 0. So this is hinge loss on this axis and 0. And here's gamma ref. When gamma is less than gamma ref but greater than 0, then the hinge loss is somewhere between 0 and 1, because when gamma is equal to gamma ref, this is 0. When gamma is 0, this is 1. So the hinge loss grows like this, between gamma equal to gamma ref to gamma equal 0. And for even smaller values of gamma, this loss will just continue to grow linearly. So it would just go like this.
Now, in graphical terms we need to define what gamma ref is.
So this is where I will explain what gamma ref is. And in essence, gamma ref is basically just the margin of the decision boundary. So in this graph here, gamma ref, if we draw in the support vectors, gamma ref is exactly this distance. And we see that if we have a point out here, where gamma is greater than gamma ref, then-- let's say a negatively labeled point here-- then the classification of that point is correct and is outside of the support vectors. So gamma is greater than gamma ref, and we incur zero loss. If the negatively labeled point is actually here between the support vector and the decision boundary, such that the gamma is between gamma ref and 0, we start to incur some loss. And if the negative is on this side of the decision boundary so that it's mislabelled, then gamma is negative, and we start to incur even more loss.
So this is our loss term for support vector machine.
Now let's look at the regularization term.
So again, with our regularization term,
the goal is to allow our model to be more generalizable.
And we just mentioned that to become more generalizable,
we want a large margin. And we just defined our gamma ref to be our margin. So what we want here is we want to maximize gamma ref. In other words, this is the same as minimizing 1 over gamma ref, because in our objective function we're minimizing the function, so we want to minimize something here. And in practice, what we actually do is that we minimize 1 over gamma ref squared, which achieves the same goal. It's just a convention that we do to minimize gamma ref-- 1 over gamma ref squared instead of 1 over gamma ref. OK. Now let's put together loss term and the regularization term and write out the objective function for SVM. So for SVM, our objective function j is the-- first is the loss term, which is a hinge loss. And this is a hinge loss for a single data point. We have n data points, so let's do an average over the n data points. So we have 1/n sum equals 1 to n, [? n-th ?] loss gamma over gamma ref-- so that's the loss term-- plus alpha, which is our hyperparameter, and the regularization term, which is just 1 over gamma ref squared.
So there is still a question here
of how do we define gamma ref in terms of things we already know. Because essentially, currently we're using gamma ref as a new variable. But we can see that we can actually define gamma ref in terms of theta and theta 0, because-- and then we don't have to keep around this new variable. We can just define it in terms of theta and theta 0. So let's look into what gamma ref actually is in terms of theta and theta 0.