plus regularization term. And we can write it by moving the regularization term inside the average, since it doesn't depend on the training example. So any average over that would return the same thing. I'm also going to simplify our calculations by omitting the offset parameter completely here. So one way to understand this is that we actually have now an average of simpler objective functions. So we have a sum over the training examples here with respect to the simpler examples specific objective functions that involve a simple loss function plus the regularization term.
Now, one way to understand this is that there lies--
the objective function is actually an average an expectation over these individual terms. So we should be able to update the parameters stochastically by taking each individual term individually sampled at random. So sample a training example at random. Look at each individual objective function and nudge the parameters in the direction that optimizes that particular term. And since we are reassembling another training example nudging the parameters in slightly different direction and so on, if we take small steps like that, then, on average, we are going to move in the direction that optimizes what we want. Why do we do this rather than the full gradient descent step with respect to the whole objective function? The reason is that is actually a more efficient way to do the optimization. So what we do is we sample a training example I at random from the set of possible indices of the training samples. And then we perform a gradient descent update with respect to the selected sampled term. So we move parameters in the direction that seems to optimize that particular term in the objective function.
Now, since we are doing this stochastically,
we introduce some randomness here. We actually need to decrease the learning rate parameter in order to make this convert. So what we actually need to do is to have the learning rate go to 0 as a function of the iterations of this update. more precisely, we want it to sum--
the learning rate sum to infinity,
if we sum over all the iterations. What that means is that we retain enough steam to be able to nudge the parameters wherever we are towards the minimum. But then on the other hand, we need to reduce the variance from the stochasticity that we're introducing. So we have the parameters be something called square-summable, meaning that the squared values of those parameters are actually finite. So for example, eta t equals 1 over 1 plus t would suffice to satisfy these constraints.
Now let's look at how we actually compute the gradient.
Here we have now the objective function without the offset parameter. And we have the J i-th term here and an average of those. So our update here is then to sample i at random.
And now take a parameter update, the old parameter value.
Nudge it in the direction of the gradient with respect to the parameters of, now, the hinge loss of that particular training example, with respect to the parameters theta plus their regularization term. So let's see what that gradient actually amounts to. So now we need to take the gradient with respected the parameters of both terms here. And what we get is here, the loss-- derivative of the loss, we can divide it into two parts. When the loss is 0, then the gradient is also 0 as a vector here. So when the loss is 0, then nothing will come out of that gradient. When the loss is non-zero, then the hinge loss is actually 1 minus its argument. The argument is just a linear function of theta when you take a gradient of such thing with respect to the parameters, you get just the multiplying terms here. So you will get here minus from that 1 minus part and then label, which is scalar at times, xi, which is a vector. So this is the gradient of the hinge loss when the loss is non-zero. And then we have another term here coming from the regularization term that will be-- the gradient of that will be just lambda times the parameter value itself.
All right, so there are three differences
between this and our earlier perception algorithm. First, is that we are actually using a decreasing learning rate due to the stochasticity. The second difference is that we are actually performing the update, even if we are correctly classifying the example, because the regularization term will always yield an update, regardless of what the loss is. And the role of that regularization term is to nudge the parameters a little bit backwards, so decrease the norm of the parameter vector at every stamp, which corresponds to trying to maximize the margin. To counterbalance that, we will get a non-zero derivative from the last terms, if the loss is non-zero. And that update looks like the perceptor update, but it is actually made even if we correctly classify the example. If the example is within the margin boundaries, you would get a non-zero loss.
So here, we have just a better way
of writing what that stochastic gradient descent update or SGD update looks like.