You are on page 1of 20

1

2
Using a perceptron, a classifier learns the weights for a
2D dataset (x1,x2) having 500 examples as W0 = 2,
W1=3 and W2 = -2.

The same dataset is used to train a Logistic regression


classifier with weights W0= 1, W1 = 3 and W2 = -1. Both
classifier use GD for learning

Answer the following

(a) Is there any mistake performed in training? Can the


same dataset produce different weights in this case? Give
reasons for your answers

3
(b) What is the decision rule for the perceptron
classifier ?

(c) What is the decision rule for the LR classifier?

(d) Another student has trained the dataset and gets the
same weights for both perceptron and LR classifiers,
lets say W0=-2, W1 = 1 and W2 = 3. Is this possible or
some mistake is made? Give reasons . What will be the
decision rule for this case? Are they same or different?

(e) Does the number of training examples play any


role in the weights that you are getting? Give reasons

4
MSE Gradient for a sigmoid Units
E  1
  d d
wi wi 2 dD
(t  o ) 2
But we know :
 od  (net d )

1
 (t d  od ) 2   od (1  od )
2 d wi net d net d
 
1  net d  ( w  xd )
  2(t d  od )
wi
(t d  od )   xi ,d
2 d wi wi
 od 
  (t d  od )   So :
d  wi  E
od net d    (t d  od )od (1  od ) xi ,d
 -  (t d  od ) wi d D
d net d wi

5
6
Explanations of calculations
(SIGMOID+ MSE+GD)
• The net input to the perceptron (sigmoid)
in = w1*x1 + w2*x2 + w3*x3+ w0*0
= 0.15 *0 - 0.15*0 +0.1 *1 +0.2* (-1) = -0.1
(For the first training example)
sigma(-0.1) = 0.475. sigma’ = 0.475(1-0.475)= 0.249
Note in neuron terminology, the bias weight w0 is
often called threshold t , accordingly the input x0
= -1 )
BIAS = (-). THRESHOLD

7
Explanation of calculations
• Error = target- actual output
= 0-0.475 = -0.475
For the first example f(x) = target = 0
Derived earlier for MSE (weight update eqn)
Leraning rate x Error x sigma’ =
0.5x 0.249 x -0.475 = - 0.059
MULTIPLY WITH INPUTS
For x0 =-1, : del W0 = +0.059; x1,and x2 (no update in W1 and W2 )
For x3 = 1, del W3 = -0.059
Note the calculation of new weights for the next training example
W0 = wo (old) + del wo = 0.2+0.059 = 0.259
W3 = W3 (old) +delW3 = 0.1- 0.059 = 0.041
W1 (old) = W1 new
W2 (old )= W2 new

8
The second update (110)
• The net input to the perceptron
in = w1*x1 + w2*x2 + w3*x3+ w0*0 =
= 0.15 *1 - 0.15 X 1 + 0.041 * 0 +0.259 x (-1) = in
= - 0.259
(For the second training example)
sigma(-0.259) = 0.436.
sigma’ = 0.436(1-0.436)= 0.246
Leraning rate x Error x sigma’ = 0.5*(1-0.436)*0.246
= 0.069 (common multiplying factor)
MULTIPLY WITH INPUTS to get updates
9
The updates
• W0(new) = W0(old) + (-1) * 0.069 =
0.259-0.069= 0.190
• W1(new) = W1(old) + (1) * 0.069 =
0.150+0.069= 0.219
• W2(new) = W2(old) + (1) * 0.069 =
-0.150+0.069= -0.081
• W3(new) = W3(old) + (0) * 0.069 =
0.041+0.= 0.041

10
11
12
13
14
15
Gradient Descent applications
• Neural Networks
• Linear Regression
• Logistic Regression
• Back-propagation algorithm
• Support Vector Machines
• many others….
The learning rate- alpha in GD
• One of the most important Hyper-parameter
of a learning algorithm ( Tuning it means we
are looking for a GOOD value). Is the GD
tuning the alpha too??
• HP are parameters which NOT part of the
model.
• one tries a bunch of values and picks the one
which works the best.
• Instability (when) vs slow learning (when?)
17
Problems of Gradient Descent
The problems of GD
• Calculating derivatives (numerically or
analytically) : trade-off faster/more accurate
• When does GD converge?
• Role of the step size extremely important
• Can it be used for both convex and non-convex
functions?? ( local/global points/ saddle points )
• Minimization of a sum of functions;
• one epoch = pass of one training data set
• loss functions are evaluated on the ith element of
the dataset
How much data to use in gradient
calculations
• Stochastic (Incremental) Gradient Descent:
• Uses only single training example to calculate the
gradient and update parameters.
• Batch Gradient Descent:
• Calculate the gradients for the whole dataset and
perform just one update at each iteration.
• Mini-batch Gradient Descent:
• Mini-batch gradient is a variation of stochastic gradient
descent where instead of single training example, mini-
batch of samples is used. It’s one of the most popular
optimization algorithms.

You might also like