Professional Documents
Culture Documents
2
Using a perceptron, a classifier learns the weights for a
2D dataset (x1,x2) having 500 examples as W0 = 2,
W1=3 and W2 = -2.
3
(b) What is the decision rule for the perceptron
classifier ?
(d) Another student has trained the dataset and gets the
same weights for both perceptron and LR classifiers,
lets say W0=-2, W1 = 1 and W2 = 3. Is this possible or
some mistake is made? Give reasons . What will be the
decision rule for this case? Are they same or different?
4
MSE Gradient for a sigmoid Units
E 1
d d
wi wi 2 dD
(t o ) 2
But we know :
od (net d )
1
(t d od ) 2 od (1 od )
2 d wi net d net d
1 net d ( w xd )
2(t d od )
wi
(t d od ) xi ,d
2 d wi wi
od
(t d od ) So :
d wi E
od net d (t d od )od (1 od ) xi ,d
- (t d od ) wi d D
d net d wi
5
6
Explanations of calculations
(SIGMOID+ MSE+GD)
• The net input to the perceptron (sigmoid)
in = w1*x1 + w2*x2 + w3*x3+ w0*0
= 0.15 *0 - 0.15*0 +0.1 *1 +0.2* (-1) = -0.1
(For the first training example)
sigma(-0.1) = 0.475. sigma’ = 0.475(1-0.475)= 0.249
Note in neuron terminology, the bias weight w0 is
often called threshold t , accordingly the input x0
= -1 )
BIAS = (-). THRESHOLD
7
Explanation of calculations
• Error = target- actual output
= 0-0.475 = -0.475
For the first example f(x) = target = 0
Derived earlier for MSE (weight update eqn)
Leraning rate x Error x sigma’ =
0.5x 0.249 x -0.475 = - 0.059
MULTIPLY WITH INPUTS
For x0 =-1, : del W0 = +0.059; x1,and x2 (no update in W1 and W2 )
For x3 = 1, del W3 = -0.059
Note the calculation of new weights for the next training example
W0 = wo (old) + del wo = 0.2+0.059 = 0.259
W3 = W3 (old) +delW3 = 0.1- 0.059 = 0.041
W1 (old) = W1 new
W2 (old )= W2 new
8
The second update (110)
• The net input to the perceptron
in = w1*x1 + w2*x2 + w3*x3+ w0*0 =
= 0.15 *1 - 0.15 X 1 + 0.041 * 0 +0.259 x (-1) = in
= - 0.259
(For the second training example)
sigma(-0.259) = 0.436.
sigma’ = 0.436(1-0.436)= 0.246
Leraning rate x Error x sigma’ = 0.5*(1-0.436)*0.246
= 0.069 (common multiplying factor)
MULTIPLY WITH INPUTS to get updates
9
The updates
• W0(new) = W0(old) + (-1) * 0.069 =
0.259-0.069= 0.190
• W1(new) = W1(old) + (1) * 0.069 =
0.150+0.069= 0.219
• W2(new) = W2(old) + (1) * 0.069 =
-0.150+0.069= -0.081
• W3(new) = W3(old) + (0) * 0.069 =
0.041+0.= 0.041
10
11
12
13
14
15
Gradient Descent applications
• Neural Networks
• Linear Regression
• Logistic Regression
• Back-propagation algorithm
• Support Vector Machines
• many others….
The learning rate- alpha in GD
• One of the most important Hyper-parameter
of a learning algorithm ( Tuning it means we
are looking for a GOOD value). Is the GD
tuning the alpha too??
• HP are parameters which NOT part of the
model.
• one tries a bunch of values and picks the one
which works the best.
• Instability (when) vs slow learning (when?)
17
Problems of Gradient Descent
The problems of GD
• Calculating derivatives (numerically or
analytically) : trade-off faster/more accurate
• When does GD converge?
• Role of the step size extremely important
• Can it be used for both convex and non-convex
functions?? ( local/global points/ saddle points )
• Minimization of a sum of functions;
• one epoch = pass of one training data set
• loss functions are evaluated on the ith element of
the dataset
How much data to use in gradient
calculations
• Stochastic (Incremental) Gradient Descent:
• Uses only single training example to calculate the
gradient and update parameters.
• Batch Gradient Descent:
• Calculate the gradients for the whole dataset and
perform just one update at each iteration.
• Mini-batch Gradient Descent:
• Mini-batch gradient is a variation of stochastic gradient
descent where instead of single training example, mini-
batch of samples is used. It’s one of the most popular
optimization algorithms.