You are on page 1of 11

• Consider the perceptron training rule that uses a hard-limiting transfer function ( -

1/+1)
• W(new) = W(old) + α ( error) x input

• The dataset has outputs either class +1/ Class -1

• X(1) = (1,0,1) output = -1
• X(2) = (0, -1,-1) output = +1
• X(3) = ( -1,-0.5, -1) output = +1

• Take learning constant as 0.1 and initial weight vector as (1,-1,0) . Show the weight
update for two epochs . Ignore the bias input ( x0= 0, w0= 0
Gradient Descent learning

Variants : Stochastic. Batch and mini-


batch
Consider the perceptron training rule that uses a hard-limiting transfer
function ( -1/+1)
W(new) = W(old) + α ( error) x input

The dataset has outputs either class +1/ Class -1

X(1) = (1,0,1) output = -1


X(2) = (0, -1,-1) output = +1
X(3) = ( -1,-0.5, -1) output = +1

Take learning constant as 0.1 and initial weight vector as (1,-1,0) . Show the
weight update for two epochs . Ignore the bias input ( x0= 0, w0= 0
The variants of Gradient Descent
• The main difference between them is the
amount of data we use when computing the
gradients for each learning step.
• The trade-off between them is the accuracy
of the gradient versus the time complexity to
perform each parameter’s update
Batch Gradient Descent
• we consider all the examples for every step of
Gradient Descent which means we compute
derivatives of all the training examples to get a
new parameter..
• we sum up over all examples on each
iteration when performing the updates to the
parameters.
• is infeasible when training data is huge, the
per-iteration computational cost is very high.
Stochastic GD (Online training)
• Approximates the gradient by gradient of one
randomly chosen sample
• he gradient calculated this way is a stochastic
approximation to the gradient calculated using
the entire training data. Each update is now
much faster to calculate than in batch gradient
descent, and over many updates,
• Also called incremental GD
• we get a smoother objective function in BGD
Batch mode Gradient Descent:
Do until satisfied:
 1

1. Compute the gradient ED [w] ED [ w]  2  (td  od ) 2
   d D
2. w  w   ED [ w]
Incremental mode Gradient Descent:
Do until satisfied:
- For each training example d in D
 
1. Compute the gradient Ed [w] Ed [w]  12 (td  od ) 2
  
2. w  w   Ed [ w]

Incremental Gradient Descent can approximate Batch Gradient


Descent arbitrarily closely if  made small enough

7
Index is shuffled until convergence
For a problem with n data points, mini-batch size b
and feature dimension p, we obtain the following
costs
of standard SGD and batch-SGD:
1. full gradient: O(np)
2. mini-batch: O(bp)
3. standard SGD: O(p)
• If α is very small, it would take long time to
converge and become computationally
expensive.
• If α is large, it may fail to converge and
overshoot the minimum.
• mini-batch size” hyperparameter for the
learning algorithm

You might also like