Professional Documents
Culture Documents
1/+1)
• W(new) = W(old) + α ( error) x input
•
• The dataset has outputs either class +1/ Class -1
•
• X(1) = (1,0,1) output = -1
• X(2) = (0, -1,-1) output = +1
• X(3) = ( -1,-0.5, -1) output = +1
•
• Take learning constant as 0.1 and initial weight vector as (1,-1,0) . Show the weight
update for two epochs . Ignore the bias input ( x0= 0, w0= 0
Gradient Descent learning
Take learning constant as 0.1 and initial weight vector as (1,-1,0) . Show the
weight update for two epochs . Ignore the bias input ( x0= 0, w0= 0
The variants of Gradient Descent
• The main difference between them is the
amount of data we use when computing the
gradients for each learning step.
• The trade-off between them is the accuracy
of the gradient versus the time complexity to
perform each parameter’s update
Batch Gradient Descent
• we consider all the examples for every step of
Gradient Descent which means we compute
derivatives of all the training examples to get a
new parameter..
• we sum up over all examples on each
iteration when performing the updates to the
parameters.
• is infeasible when training data is huge, the
per-iteration computational cost is very high.
Stochastic GD (Online training)
• Approximates the gradient by gradient of one
randomly chosen sample
• he gradient calculated this way is a stochastic
approximation to the gradient calculated using
the entire training data. Each update is now
much faster to calculate than in batch gradient
descent, and over many updates,
• Also called incremental GD
• we get a smoother objective function in BGD
Batch mode Gradient Descent:
Do until satisfied:
1
1. Compute the gradient ED [w] ED [ w] 2 (td od ) 2
d D
2. w w ED [ w]
Incremental mode Gradient Descent:
Do until satisfied:
- For each training example d in D
1. Compute the gradient Ed [w] Ed [w] 12 (td od ) 2
2. w w Ed [ w]
7
Index is shuffled until convergence
For a problem with n data points, mini-batch size b
and feature dimension p, we obtain the following
costs
of standard SGD and batch-SGD:
1. full gradient: O(np)
2. mini-batch: O(bp)
3. standard SGD: O(p)
• If α is very small, it would take long time to
converge and become computationally
expensive.
• If α is large, it may fail to converge and
overshoot the minimum.
• mini-batch size” hyperparameter for the
learning algorithm