You are on page 1of 3

Now our problem looks like this.

We have an average of the training losses


plus regularization term.
And we can write it by moving the regularization term
inside the average, since it doesn't
depend on the training example.
So any average over that would return the same thing.
I'm also going to simplify our calculations
by omitting the offset parameter completely here.
So one way to understand this is that we actually
have now an average of simpler objective functions.
So we have a sum over the training examples
here with respect to the simpler examples
specific objective functions that
involve a simple loss function plus the regularization term.

Now, one way to understand this is that there lies--


the objective function is actually
an average an expectation over these individual terms.
So we should be able to update the parameters stochastically
by taking each individual term individually sampled at random.
So sample a training example at random.
Look at each individual objective function
and nudge the parameters in the direction that
optimizes that particular term.
And since we are reassembling another training example
nudging the parameters in slightly different direction
and so on, if we take small steps like that, then,
on average, we are going to move in the direction that
optimizes what we want.
Why do we do this rather than the full gradient descent
step with respect to the whole objective function?
The reason is that is actually a more efficient way
to do the optimization.
So what we do is we sample a training example I
at random from the set of possible indices
of the training samples.
And then we perform a gradient descent update with respect
to the selected sampled term.
So we move parameters in the direction that
seems to optimize that particular term
in the objective function.

Now, since we are doing this stochastically,


we introduce some randomness here.
We actually need to decrease the learning rate parameter
in order to make this convert.
So what we actually need to do is
to have the learning rate go to 0
as a function of the iterations of this update.
more precisely, we want it to sum--

the learning rate sum to infinity,


if we sum over all the iterations.
What that means is that we retain enough steam
to be able to nudge the parameters wherever
we are towards the minimum.
But then on the other hand, we need
to reduce the variance from the stochasticity
that we're introducing.
So we have the parameters be something
called square-summable, meaning that the squared values
of those parameters are actually finite.
So for example, eta t equals 1 over 1 plus t would suffice
to satisfy these constraints.

Now let's look at how we actually compute the gradient.


Here we have now the objective function
without the offset parameter.
And we have the J i-th term here and an average of those.
So our update here is then to sample i at random.

And now take a parameter update, the old parameter value.


Nudge it in the direction of the gradient with respect
to the parameters of, now, the hinge loss
of that particular training example,
with respect to the parameters theta plus their regularization
term.
So let's see what that gradient actually amounts to.
So now we need to take the gradient
with respected the parameters of both terms here.
And what we get is here, the loss--
derivative of the loss, we can divide it into two parts.
When the loss is 0, then the gradient is also 0 as a vector
here.
So when the loss is 0, then nothing
will come out of that gradient.
When the loss is non-zero, then the hinge loss
is actually 1 minus its argument.
The argument is just a linear function of theta
when you take a gradient of such thing with respect
to the parameters, you get just the multiplying terms here.
So you will get here minus from that 1 minus part
and then label, which is scalar at times, xi,
which is a vector.
So this is the gradient of the hinge loss
when the loss is non-zero.
And then we have another term here coming
from the regularization term that
will be-- the gradient of that will
be just lambda times the parameter value itself.

All right, so there are three differences


between this and our earlier perception algorithm.
First, is that we are actually using a decreasing learning
rate due to the stochasticity.
The second difference is that we are actually
performing the update, even if we are correctly classifying
the example, because the regularization term will always
yield an update, regardless of what the loss is.
And the role of that regularization term
is to nudge the parameters a little bit
backwards, so decrease the norm of the parameter
vector at every stamp, which corresponds to trying
to maximize the margin.
To counterbalance that, we will get a non-zero derivative
from the last terms, if the loss is non-zero.
And that update looks like the perceptor update,
but it is actually made even if we correctly
classify the example.
If the example is within the margin boundaries,
you would get a non-zero loss.

So here, we have just a better way


of writing what that stochastic gradient descent update or SGD
update looks like.

You might also like