CS7015 (Deep Learning) : Lecture 3

CS7015 (Deep Learning) : Lecture 3
Sigmoid Neurons, Gradient Descent, Feedforward Neural Networks,

Representation Power of Feedforward Neural Networks
Mitesh M. Khapra
Department of Computer Science and Engineering

Indian Institute of Technology Madras
0/61
Acknowledgements
• For Module 3.4, I have borrowed ideas from the videos by Ryan Harris on “visualize
backpropagation” (available on youtube)
• For Module 3.5, I have borrowed ideas from this excellent book ? which is available
online
• I am sure I would have been influenced and borrowed ideas from other sources and I
apologize if I have failed to acknowledge them
?
http://neuralnetworksanddeeplearning.com/chap4.html 1/61
1
Module 3.1: Sigmoid Neuron
2/61
2
The story ahead ...
• Enough about boolean functions!
3/61
3
The story ahead ...
• What about arbitrary functions of the form y = f(x) where x ∈ Rn (instead of {0, 1}n )
and y ∈ R (instead of {0, 1}) ?
3/61
3
The story ahead ...
• Can we have a network which can (approximately) represent such functions ?
3/61
3
The story ahead ...
• Can we have a network which can (approximately) represent such functions ?
• Before answering the above question we will have to first graduate from perceptrons
to sigmoidal neurons ...
3/61
3
Recall
• A perceptron will fire if the weighted sum of its inputs is greater than the threshold
(-w0 )
4/61
4
y
• The thresholding logic used by a perceptron is very
harsh !
bias = w0 = −0.5
w1 = 1
x1
5/61
5
y
harsh !
bias = w0 = −0.5 • For example, let us return to our problem of decid-
ing whether we will like or dislike a movie
w1 = 1
x1
5/61
5
y
harsh !
w1 = 1 • Consider that we base our decision only on one

input (x1 = criticsRating which lies between 0 and
x1
1)
criticsRating
5/61
5
y
harsh !

x1
1)
criticsRating
• If the threshold is 0.5 (w0 = −0.5) and w1 = 1
then what would be the decision for a movie with
criticsRating = 0.51 ?
5/61
5
y
harsh !

x1
1)
criticsRating
criticsRating = 0.51 ? (like)
5/61
5
y
harsh !

x1
1)
criticsRating
• What about a movie with criticsRating = 0.49 ?
5/61
5
y
harsh !

x1
1)
criticsRating
(dislike)
5/61
5
y
harsh !

x1
1)
criticsRating
(dislike)
• It seems harsh that we would like a movie with rat-
5/61
ing 0.51 but not one with a rating of 0.49
5
• This behavior is not a characteristic of the spe-
cific problem we chose or the specific weight and
threshold that we chose
6/61
6
1
• It is a characteristic of the perceptron function it-
y
self which behaves like a step function
P-wn 0
z= i=1 wi xi
6/61
6
1
y

• There will always be this sudden change in the de-
cision (from 0 to 1) when ni=1 wi xi crosses the
P
threshold (-w0 )
P-wn 0
z= i=1 wi xi
6/61
6
1
y

• There will always be this sudden change in the de-
cision (from 0 to 1) when ni=1 wi xi crosses the
P
threshold (-w0 )
P-wn 0
z= • For most real world applications we would ex-
i=1 wi xi
pect a smoother decision function which gradually
changes from 0 to 1
6/61
6
• Introducing sigmoid neurons where the output
1
y function is much smoother than the step function
P-wn 0
z= i=1 wi xi
6/61
6
1
function is much smoother than the step function
• Here is one form of the sigmoid function called the
logistic function
y
1
y= Pn
1+ e−(w0 + i=1 wi xi )
P-wn 0
z= i=1 wi xi
6/61
6
1
logistic function
y
1
y= Pn
1+ e−(w0 + i=1 wi xi )
• We no longer see a sharp transition around the
P-w0
n threshold -w0
z= i=1 wi xi
6/61
6
1
logistic function
y
1
y= Pn
1+ e−(w0 + i=1 wi xi )
P-w0
n threshold -w0
z= i=1 wi xi • Also the output y is no longer binary but a real
value between 0 and 1 which can be interpreted
as a probability
6/61
6
1
logistic function
y
1
y= Pn
1+ e−(w0 + i=1 wi xi )
P-w0
n threshold -w0
z= i=1 wi xi • Also the output y is no longer binary but a real
value between 0 and 1 which can be interpreted
as a probability
• Instead of a like/dislike decision we get the prob-
ability of liking the movie 6/61
6
Perceptron Sigmoid (logistic) Neuron
y y
w0 = −θ w1 w2 .. .. wn w0 = −θ w1 w2 .. .. wn
x0 = 1 x1 x2 .. .. xn x0 = 1 x1 x2 .. .. xn
n 1
y=
X
y=1 if wi ∗ xi ≥ 0 Pn
1+ e−( i=0 wi xi )
i=0
Xn
= 0 if wi ∗ xi < 0
i=0 7/61
7
Perceptron Sigmoid Neuron
y 1 1
y
P-wn 0 P-wn 0
z= i=1 wi xi z= i=1 wi xi
Not smooth, not continuous (at w0), not
differentiable
8/61
8
Perceptron Sigmoid Neuron
y 1 1
y
P-wn 0 P-wn 0
z= i=1 wi xi z= i=1 wi xi
Not smooth, not continuous (at w0), not
Smooth, continuous, differentiable
differentiable
8/61
8
Module 3.2: A typical Supervised Machine Learning Setup
9/61
9
• What next ?
Sigmoid (logistic) Neuron
y
w0 = −θ w1 w2 .. .. wn
x0 = 1 x1 x2 .. .. xn
10/61
10
• What next ?
Sigmoid (logistic) Neuron • Well, just as we had an algorithm for learning the
y
weights of a perceptron, we also need a way of
learning the weights of a sigmoid neuron
w0 = −θ w1 w2 .. .. wn
x0 = 1 x1 x2 .. .. xn
10/61
10
• What next ?
Sigmoid (logistic) Neuron • Well, just as we had an algorithm for learning the
y
weights of a perceptron, we also need a way of
learning the weights of a sigmoid neuron
• Before we see such an algorithm we will revisit the
concept of error
w0 = −θ w1 w2 .. .. wn
x0 = 1 x1 x2 .. .. xn
10/61
10
• Earlier we mentioned that a single perceptron cannot deal
with this data because it is not linearly separable
11/61
11
• What does “cannot deal with” mean?
11/61
11
• What would happen if we use a perceptron model to clas-
sify this data ?
11/61
11
sify this data ?
• We would probably end up with a line like this ...
11/61
11
sify this data ?
• This line doesn’t seem to be too bad
11/61
11
sify this data ?
• Sure, it misclassifies 3 blue points and 3 red points but we
could live with this error in most real world applications
11/61
11
sify this data ?
• Sure, it misclassifies 3 blue points and 3 red points but we
could live with this error in most real world applications
• From now on, we will accept that it is hard to drive the error
to 0 in most cases and will instead aim to reach the min-
imum possible error
11/61
11
This brings us to a typical machine learning setup which has the following components...
12/61
12
• Data: {xi , yi }ni=1
12/61
12
• Model: Our approximation of the relation between x and y. For example,
12/61
12
1
ŷ =
1 + e−(wT x)
12/61
12
1
ŷ =
1 + e−(wT x)
or ŷ = wT x
12/61
12
1
ŷ =
1 + e−(wT x)
or ŷ = wT x
or ŷ = xT Wx
12/61
12
1
ŷ =
1 + e−(wT x)
or ŷ = wT x
or ŷ = xT Wx
or just about any function
12/61
12
1
ŷ =
1 + e−(wT x)
or ŷ = wT x
or ŷ = xT Wx
• Parameters: In all the above cases, w is a parameter which needs to be learned from
the data
12/61
12
1
ŷ =
1 + e−(wT x)
or ŷ = wT x
or ŷ = xT Wx
the data
• Learning algorithm: An algorithm for learning the parameters (w) of the model (for
example, perceptron learning algorithm, gradient descent, etc.)
12/61
12
1
ŷ =
1 + e−(wT x)
or ŷ = wT x
or ŷ = xT Wx
the data
• Objective/Loss/Error function: To guide the learning algorithm
12/61
12
1
ŷ =
1 + e−(wT x)
or ŷ = wT x
or ŷ = xT Wx
the data
• Objective/Loss/Error function: To guide the learning algorithm - the learning algorithm
12/61
should aim to minimize the loss function
12
As an illustration, consider our movie example
13/61
13
• Data: {xi = movie, yi = like/dislike}ni=1
13/61
13
• Model: Our approximation of the relation between x and y (the probability of liking a
movie).
13/61
13
movie).
1
ŷ =
1 + e−(wT x)
13/61
13
movie).
1
ŷ =
1 + e−(wT x)
• Parameter: w
13/61
13
movie).
1
ŷ =
1 + e−(wT x)
• Parameter: w
• Learning algorithm: Gradient Descent [we will see soon]
13/61
13
movie).
1
ŷ =
1 + e−(wT x)
• Parameter: w
• Objective/Loss/Error function:
13/61
13
movie).
1
ŷ =
1 + e−(wT x)
• Parameter: w
• Objective/Loss/Error function: One possibility is
n
(ŷi − yi )2
X
L (w) =
i=1
13/61
13
movie).
1
ŷ =
1 + e−(wT x)
• Parameter: w
• Objective/Loss/Error function: One possibility is
n
(ŷi − yi )2
X
L (w) =
i=1
13/61
The learning algorithm should aim to find a w which minimizes the above function
13
(squared error between y and ŷ)
Module 3.3: Learning Parameters: (Infeasible) guess work
14/61
14
y • Keeping this supervised ML setup in mind, we will
now focus on this model and discuss an algorithm
for learning the parameters of this model from
σ some given data using an appropriate objective
function
w0 = −θ w1 w2 .. .. wn
x0 = 1 x1 x2 .. .. xn
1
f (x) = 1+e−(w·x+b)
15/61
15
y • Keeping this supervised ML setup in mind, we will
σ some given data using an appropriate objective
function
w0 = −θ w1 w2 .. .. wn • σ stands for the sigmoid function (logistic func-

tion in this case)
x0 = 1 x1 x2 .. .. xn
1
f (x) = 1+e−(w·x+b)
15/61
15
x σ ŷ = f (x) • Keeping this supervised ML setup in mind, we will
1 for learning the parameters of this model from
1 some given data using an appropriate objective
f(x) = 1+e−(w·x+b) function
• σ stands for the sigmoid function (logistic func-
tion in this case)
• For ease of explanation, we will consider a very
simplified version of the model having just 1 input
15/61
15
x
w
σ ŷ = f (x) • Keeping this supervised ML setup in mind, we will
1 b
tion in this case)
• Further to be consistent with the literature, from
now on, we will refer to w0 as b (bias)
15/61
15
x
w
σ ŷ = f (x) • Keeping this supervised ML setup in mind, we will
1 b
tion in this case)
• Further to be consistent with the literature, from
now on, we will refer to w0 as b (bias)
• Lastly, instead of considering the problem of pre-
dicting like/dislike, we will assume that we want to
predict criticsRating(y) given imdbRating(x) (for no 15/61
particular reason) 15
x σ ŷ = f(x)
w
1 b
1
f (x) = 1+e−(w·x+b)
16/61
16
x σ ŷ = f(x)
w Input for training
1 b {xi , yi }Ni=1 → N pairs of (x, y)
1
f (x) = 1+e−(w·x+b)
16/61
16
x σ ŷ = f(x)
Training objective
1
f (x) = 1+e−(w·x+b) Find w and b such that:
N
(yi − f(xi ))2
X
minimize L (w, b) =
w,b
i=1
16/61
16
x σ ŷ = f(x)
Training objective
1
f (x) = 1+e−(w·x+b) Find w and b such that:
N
(yi − f(xi ))2
X
minimize L (w, b) =
w,b
i=1
16/61
16
x σ ŷ = f(x)
w What does it mean to train the network?
b • Suppose we train the network with
1
(x, y) = (0.5, 0.2) and (2.5, 0.9)
1
f (x) = 1+e−(w·x+b)
16/61
16
x σ ŷ = f(x)
1
(x, y) = (0.5, 0.2) and (2.5, 0.9)
1
f (x) = 1+e−(w·x+b) • At the end of training we expect to find
w*, b* such that:
16/61
16
x σ ŷ = f(x)
1
(x, y) = (0.5, 0.2) and (2.5, 0.9)
1
w*, b* such that:
• f(0.5) → 0.2 and f(2.5) → 0.9
16/61
16
x σ ŷ = f(x)
1
(x, y) = (0.5, 0.2) and (2.5, 0.9)
1
w*, b* such that:
• f(0.5) → 0.2 and f(2.5) → 0.9
In other words...
• We hope to find a sigmoid function such
that (0.5, 0.2) and (2.5, 0.9) lie on this
sigmoid
16/61
16
x σ ŷ = f(x)
1
(x, y) = (0.5, 0.2) and (2.5, 0.9)
1
w*, b* such that:
• f(0.5) → 0.2 and f(2.5) → 0.9
In other words...
• We hope to find a sigmoid function such
that (0.5, 0.2) and (2.5, 0.9) lie on this
sigmoid
16/61
16
Let us see this in more detail....
17/61
17
1
σ(x) =
1+ e−(wx+b)
18/61
18
• Can we try to find such a w∗ , b∗ manually
1
σ(x) =
1+ e−(wx+b)
18/61
18
• Let us try a random guess.. (say, w = 0.5, b = 0)
1
σ(x) =
1+ e−(wx+b)
18/61
18
• Clearly not good, but how bad is it ?
1
σ(x) =
1+ e−(wx+b)
18/61
18
• Clearly not good, but how bad is it ?
• Let us revisit L (w, b) to see how bad it is ...
1
σ(x) =
1+ e−(wx+b)
18/61
18
N
1 X
L (w, b) = ∗ (yi − f(xi ))2
2
i=1
1
σ(x) =
1+ e−(wx+b)
18/61
18
N
1 X
L (w, b) = ∗ (yi − f(xi ))2
2
i=1
1
= ∗ (y1 − f (x1 ))2 + (y2 − f(x2 ))2
2
1
σ(x) =
1+ e−(wx+b)
18/61
18
N
1 X
L (w, b) = ∗ (yi − f(xi ))2
2
i=1
1
= ∗ (y1 − f (x1 ))2 + (y2 − f(x2 ))2
2
1
= ∗ (0.9 − f(2.5))2 + (0.2 − f(0.5))2
2
1
σ(x) =
1+ e−(wx+b)
18/61
18
N
1 X
L (w, b) = ∗ (yi − f(xi ))2
2
i=1
1
= ∗ (y1 − f (x1 ))2 + (y2 − f(x2 ))2
2
1
= ∗ (0.9 − f(2.5))2 + (0.2 − f(0.5))2
2
= 0.073
1
σ(x) =
1+ e−(wx+b)
18/61
18
N
1 X
L (w, b) = ∗ (yi − f(xi ))2
2
i=1
1
= ∗ (y1 − f (x1 ))2 + (y2 − f(x2 ))2
2
1
= ∗ (0.9 − f(2.5))2 + (0.2 − f(0.5))2
2
= 0.073
We want L (w, b) to be as close to 0 as possible

1
σ(x) =
1 + e−(wx+b)
18/61
18
Let us try some other values of w, b
w b L (w, b)
0.50 0.00 0.0730
1
σ(x) =
1+ e−(wx+b)
18/61
18
w b L (w, b)
0.50 0.00 0.0730
-0.10 0.00 0.1481
1 Oops!! this made things even worse...

σ(x) =
1+ e−(wx+b)
18/61
18
w b L (w, b)
0.50 0.00 0.0730
-0.10 0.00 0.1481
0.94 -0.94 0.0214
1 Perhaps it would help to push w and b in the other

σ(x) = direction...
1 + e−(wx+b)
18/61
18
w b L (w, b)
0.50 0.00 0.0730
-0.10 0.00 0.1481
0.94 -0.94 0.0214
1.42 -1.73 0.0028
1 Let us keep going in this direction, i.e., increase w and

σ(x) = decrease b
1 + e−(wx+b)
18/61
18
w b L (w, b)
0.50 0.00 0.0730
-0.10 0.00 0.1481
0.94 -0.94 0.0214
1.42 -1.73 0.0028
1.65 -2.08 0.0003
1
σ(x) = Let us keep going in this direction, i.e., increase w and
1+ e−(wx+b)
decrease b
18/61
18
w b L (w, b)
0.50 0.00 0.0730
-0.10 0.00 0.1481
0.94 -0.94 0.0214
1.42 -1.73 0.0028
1.65 -2.08 0.0003
1.78 -2.27 0.0000
1
σ(x) = With some guess work and intuition we were able to
1+ e−(wx+b)
find the right values for w and b
18/61
18
Let us look at something better than our “guess work” algorithm....
19/61
19
• Since we have only 2 points and 2 para-
meters (w, b) we can easily plot L (w, b)
for different values of (w, b) and pick the
one where L (w, b) is minimum
20/61
20
20/61
20
• But of course this becomes intractable
once you have many more data points
and many more parameters !!
20/61
20
• But of course this becomes intractable
once you have many more data points
and many more parameters !!
• Further, even here we have plotted the er-
ror surface only for a small range of (w, b)
[from (−6, 6) and not from (− inf, inf)]
20/61
20
Let us look at the geometric interpretation of our “guess work”
algorithm in terms of this error surface
21/61
21
22/61
22
22/61
22
22/61
22
22/61
22
22/61
22
22/61
22
22/61
22
Module 3.4: Learning Parameters : Gradient Descent
23/61
23
Now let us see if there is a more efficient and principled way of
doing this
24/61
24
Goal
Find a better way of traversing the error surface so that we can reach the minimum value
quickly without resorting to brute force search!
25/61
25
vector of parameters,
say, randomly initialized
θ = [w, b]
26/61
26
θ = [w, b]
∆θ = [∆w, ∆b]
change in the
values of w, b
26/61
26
θ = [w, b] θ
∆θ = [∆w, ∆b]
∆θ
change in the
values of w, b
26/61
26
θ = [w, b] θ θnew
∆θ = [∆w, ∆b]
∆θ
change in the
values of w, b
26/61
26
We moved in the direction
of ∆θ
∆θ = [∆w, ∆b]
∆θ
change in the
values of w, b
26/61
26
of ∆θ
∆θ = [∆w, ∆b]
∆θ Let us be a bit conservat-
change in the
ive: move only by a small
values of w, b
amount η
26/61
26
of ∆θ
∆θ = [∆w, ∆b]
η · ∆θ ∆θ Let us be a bit conservat-
change in the
values of w, b
amount η
26/61
26
of ∆θ
∆θ = [∆w, ∆b]
change in the
values of w, b
amount η
26/61
26
of ∆θ
∆θ = [∆w, ∆b]
change in the
values of w, b
amount η
θnew = θ + η · ∆θ
26/61
26
of ∆θ
∆θ = [∆w, ∆b]
change in the
values of w, b
amount η
Question: What is the right ∆θ to use ?
26/61
26
of ∆θ
∆θ = [∆w, ∆b]
change in the
values of w, b
amount η
Question: What is the right ∆θ to use ?
The answer comes from Taylor series
26/61
26
For ease of notation, let ∆θ = u, then from Taylor series, we have,
27/61
27
η2 η3 η4
L (θ + ηu) = L (θ) + η ∗ uT ∇θ L (θ) + ∗ uT ∇2 L (θ)u + ∗ ... + ∗ ...
2! 3! 4!
27/61
27
η2 η3 η4
2! 3! 4!
= L (θ) + η ∗ uT ∇θ L (θ) [η is typically small, so η 2 , η 3 , .. → 0]
27/61
27
η2 η3 η4
2! 3! 4!
Note that the move (ηu) would be favorable only if,
L (θ + ηu) − L (θ) < 0 [i.e., if the new loss is less than the previous loss]
27/61
27
η2 η3 η4
2! 3! 4!
Note that the move (ηu) would be favorable only if,
L (θ + ηu) − L (θ) < 0 [i.e., if the new loss is less than the previous loss]
This implies,
uT ∇θ L (θ) < 0
27/61
27
Okay, so we have,
uT ∇θ L (θ) < 0
But, what is the range of uT ∇θ L (θ) ?
28/61
28
Okay, so we have,
uT ∇θ L (θ) < 0
But, what is the range of uT ∇θ L (θ) ? Let us see....
28/61
28
Okay, so we have,
uT ∇θ L (θ) < 0

Let β be the angle between u and ∇θ L (θ), then we know that,
28/61
28
Okay, so we have,
uT ∇θ L (θ) < 0

uT ∇θ L (θ)
−1 ≤ cos(β) = ≤1
||u|| ∗ ||∇θ L (θ)||
28/61
28
Okay, so we have,
uT ∇θ L (θ) < 0

uT ∇θ L (θ)
−1 ≤ cos(β) = ≤1
||u|| ∗ ||∇θ L (θ)||
28/61
28
Okay, so we have,
uT ∇θ L (θ) < 0

uT ∇θ L (θ)
−1 ≤ cos(β) = ≤1
||u|| ∗ ||∇θ L (θ)||
multiply throughout by k = ||u|| ∗ ||∇θ L (θ)||
−k ≤ k ∗ cos(β) = uT ∇θ L (θ) ≤ k
28/61
28
Okay, so we have,
uT ∇θ L (θ) < 0

uT ∇θ L (θ)
−1 ≤ cos(β) = ≤1
||u|| ∗ ||∇θ L (θ)||
multiply throughout by k = ||u|| ∗ ||∇θ L (θ)||
−k ≤ k ∗ cos(β) = uT ∇θ L (θ) ≤ k
Thus, L (θ + ηu) − L (θ) = uT ∇θ L (θ) = k ∗ cos(β) will be most negative when

cos(β) = −1 i.e., when β is 180°
28/61
28
Gradient Descent Rule
• The direction u that we intend to move in should be at 180° w.r.t. the gradient
29/61
29
• In other words, move in a direction opposite to the gradient
29/61
29
Parameter Update Equations
wt+1 = wt − η∇wt
bt+1 = bt − η∇bt
∂L (w, b) ∂L (w, b)
where, ∇wt = , ∇b =
∂w at w = wt , b = bt ∂b at w = wt , b = bt
29/61
29
Parameter Update Equations
wt+1 = wt − η∇wt
bt+1 = bt − η∇bt
∂L (w, b) ∂L (w, b)
where, ∇wt = , ∇b =
∂w at w = wt , b = bt ∂b at w = wt , b = bt
So we now have a more principled way of moving in the w-b plane than our “guess work”
algorithm
29/61
29
• Let us create an algorithm from this rule ...
30/61
30
Algorithm: gradient_descent()
t ← 0;
max_iterations ← 1000;
while t < max_iterations do
wt+1 ← wt − η∇wt ;
bt+1 ← bt − η∇bt ;
t ← t + 1;
end
30/61
30
Algorithm: gradient_descent()
t ← 0;
max_iterations ← 1000;
while t < max_iterations do
wt+1 ← wt − η∇wt ;
bt+1 ← bt − η∇bt ;
t ← t + 1;
end
• To see this algorithm in practice let us first derive ∇w and ∇b for our toy neural network
30/61
30
x σ y = f (x)
1
f(x) = 1+e−(w·x+b)
31/61
31
x σ y = f (x)
1 Let’s assume there is only 1 point to fit (x, y)

f(x) = 1+e−(w·x+b)
31/61
31
x σ y = f (x)

f(x) = 1+e−(w·x+b)
1
L (w, b) = ∗ (f(x) − y)2
2
31/61
31
x σ y = f (x)

f(x) = 1+e−(w·x+b)
1
∗ (f(x) − y)2
L (w, b) =
2
∂L (w, b) ∂ 1
∇w = = [ ∗ (f (x) − y)2 ]
∂w ∂w 2
31/61
31
∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2
32/61
32
∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2
1 ∂
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)]
2 ∂w
32/61
32
∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2
1 ∂
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)]
2 ∂w
∂
= (f(x) − y) ∗ (f (x))
∂w
32/61
32
∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2
1 ∂
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)]
2 ∂w
∂
= (f(x) − y) ∗ (f (x))
∂w
∂ 1
= (f(x) − y) ∗
∂w 1 + e−(wx+b)
32/61
32
∂ 1 ∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2 ∂w 1 + e−(wx+b)
1 ∂
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)]
2 ∂w
∂
= (f(x) − y) ∗ (f (x))
∂w
∂ 1
= (f(x) − y) ∗
∂w 1 + e−(wx+b)
32/61
32
∂ 1 ∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2 ∂w 1 + e−(wx+b)
1 ∂ −1 ∂ −(wx+b)
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)] = (e ))
2 ∂w (1 + e −(wx+b) 2
) ∂w
∂
= (f(x) − y) ∗ (f (x))
∂w
∂ 1
= (f(x) − y) ∗
∂w 1 + e−(wx+b)
32/61
32
∂ 1 ∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2 ∂w 1 + e−(wx+b)
1 ∂ −1 ∂ −(wx+b)
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)] = (e ))
2 ∂w (1 + e −(wx+b) 2
) ∂w
∂ −1 ∂
= (f(x) − y) ∗ (f (x)) = ∗ (e−(wx+b) ) (−(wx + b)))
∂w (1 + e −(wx+b) )2 ∂w
∂ 1
= (f(x) − y) ∗
∂w 1 + e−(wx+b)
32/61
32
∂ 1 ∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2 ∂w 1 + e−(wx+b)
1 ∂ −1 ∂ −(wx+b)
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)] = (e ))
2 ∂w (1 + e −(wx+b) 2
) ∂w
∂ −1 ∂
= (f(x) − y) ∗ (f (x)) = ∗ (e−(wx+b) ) (−(wx + b)))
∂w (1 + e −(wx+b) )2 ∂w
∂ 1
= (f(x) − y) ∗ −1 e−(wx+b)
∂w 1 + e−(wx+b) = ∗ ∗ (−x)
(1 + e−(wx+b) ) (1 + e−(wx+b) )
32/61
32
∂ 1 ∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2 ∂w 1 + e−(wx+b)
1 ∂ −1 ∂ −(wx+b)
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)] = (e ))
2 ∂w (1 + e −(wx+b) 2
) ∂w
∂ −1 ∂
= (f(x) − y) ∗ (f (x)) = ∗ (e−(wx+b) ) (−(wx + b)))
∂w (1 + e −(wx+b) )2 ∂w
∂ 1
= (f(x) − y) ∗ −1 e−(wx+b)
∂w 1 + e−(wx+b) = ∗ ∗ (−x)
(1 + e−(wx+b) ) (1 + e−(wx+b) )
1 e−(wx+b)
= ∗ ∗ (x)
(1 + e−(wx+b) ) (1 + e−(wx+b) )
32/61
32
∂ 1 ∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2 ∂w 1 + e−(wx+b)
1 ∂ −1 ∂ −(wx+b)
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)] = (e ))
2 ∂w (1 + e −(wx+b) 2
) ∂w
∂ −1 ∂
= (f(x) − y) ∗ (f (x)) = ∗ (e−(wx+b) ) (−(wx + b)))
∂w (1 + e −(wx+b) )2 ∂w
∂ 1
= (f(x) − y) ∗ −1 e−(wx+b)
∂w 1 + e−(wx+b) = ∗ ∗ (−x)
(1 + e−(wx+b) ) (1 + e−(wx+b) )
= (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x
1 e−(wx+b)
= ∗ ∗ (x)
(1 + e−(wx+b) ) (1 + e−(wx+b) )
= f(x) ∗ (1 − f(x)) ∗ x
32/61
32
x σ y = f (x)
1 So if there is only 1 point (x, y), we have,

f(x) = 1+e−(w·x+b)
33/61
33
x σ y = f (x)

f(x) = 1+e−(w·x+b)
∇w = (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x
33/61
33
x σ y = f (x)

f(x) = 1+e−(w·x+b)
∇w = (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x
For two points,
33/61
33
x σ y = f (x)

f(x) = 1+e−(w·x+b)
∇w = (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x
For two points,
2
X
∇w = (f (xi ) − yi ) ∗ f(xi ) ∗ (1 − f(xi )) ∗ xi
i=1
33/61
33
x σ y = f (x)

f(x) = 1+e−(w·x+b)
∇w = (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x
For two points,
2
X
∇w = (f (xi ) − yi ) ∗ f(xi ) ∗ (1 − f(xi )) ∗ xi
i=1
X2
∇b = (f (xi ) − yi ) ∗ f(xi ) ∗ (1 − f(xi ))
i=1
33/61
33
34/61
34
34/61
34
34/61
34
34/61
34
34/61
34
34/61
34
34/61
34
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
• Later on in the course we will look at gradient descent in much more detail and discuss
its variants
36/61
36
its variants
• For the time being it suffices to know that we have an algorithm for learning the para-
meters of a sigmoid neuron
36/61
36
its variants
• For the time being it suffices to know that we have an algorithm for learning the para-
meters of a sigmoid neuron
• So where do we head from here ?
36/61
36
Module 3.5: Representation Power of a Multilayer Network of
Sigmoid Neurons
37/61
37
Representation power of a multilayer net- Representation power of a multilayer net-
work of perceptrons work of sigmoid neurons
38/61
38
A multilayer network of perceptrons with a

single hidden layer can be used to represent
any boolean function precisely (no errors)
38/61
38
A multilayer network of perceptrons with a A multilayer network of neurons with a single

single hidden layer can be used to represent hidden layer can be used to approximate any
any boolean function precisely (no errors) continuous function to any desired precision
38/61
38

In other words, there is a guarantee that for

any function f (x) : Rn → Rm , we can always
find a neural network (with 1 hidden layer con-
taining enough neurons) whose output g(x)
satisfies |g(x) − f (x)| < !!
38/61
38

In other words, there is a guarantee that for

any function f (x) : Rn → Rm , we can always
find a neural network (with 1 hidden layer con-
taining enough neurons) whose output g(x)
satisfies |g(x) − f (x)| < !!
Proof: We will see an illustrative proof of

this... [Cybenko, 1989], [Hornik, 1991]
38/61
38
• See this link? for an excellent illustration of this proof
• The discussion in the next few slides is based on the ideas presented at the above link
?
http://neuralnetworksanddeeplearning.com/chap4.html 39/61
39
• We are interested in knowing whether a
network of neurons can be used to rep-
resent an arbitrary function (like the one
shown in the figure)
40/61
40
• We observe that such an arbitrary func-
tion can be approximated by several
“tower” functions
40/61
40
• More the number of such “tower” func-
tions, better the approximation
40/61
40
• More the number of such “tower” func-
tions, better the approximation
• To be more precise, we can approximate
any arbitrary function by a sum of such
40/61
40
• We make a few observations
41/61
41
• All these “tower” functions are similar
and only differ in their heights and pos-
itions on the x-axis
41/61
41
• Suppose there is a black box which takes
the original input (x) and constructs
these tower functions
Tower Tower . . . Tower Tower

maker maker maker maker
...
x
41/61
41
. . .
Tower Tower . . . Tower Tower
...
x
41/61
41
• We can then have a simple network
. . .
which can just add them up to approxim-
Tower Tower . . . Tower Tower ate the function
...
x
41/61
41
+
. . . these tower functions
. . .
...
x
41/61
41
+
. . . these tower functions
. . .
... • Our job now is to figure out what is inside
this blackbox
x
41/61
41
We will figure this out over the next few slides ...
42/61
42
• If we take the logistic function and set w
to a very high value we will recover the
step function
43/61
43
step function
• Let us see what happens as we change
the value of w
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 0, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 1, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 2, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 3, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 4, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 5, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 6, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 7, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 8, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 9, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 10, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 11, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 12, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 13, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 14, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 15, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 16, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 17, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 18, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 19, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 20, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 21, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 22, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 23, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 24, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 25, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 26, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 27, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 28, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 29, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 30, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 31, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 32, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 33, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 34, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 35, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 36, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 37, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 38, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 39, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 40, b = 0
43/61
43
step function
the value of w
1
σ(x) = 1+e−(wx+b)
w = 41, b = 0
43/61
43
step function
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 1
43/61
43
step function
the value of w
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 2
43/61
43
step function
the value of w
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 3
43/61
43
step function
the value of w
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 4
43/61
43
step function
the value of w
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 5
43/61
43
step function
the value of w
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 6
43/61
43
step function
the value of w
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 7
43/61
43
step function
the value of w
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 8
43/61
43
step function
the value of w
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 9
43/61
43
step function
the value of w
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 10
43/61
43
step function
the value of w
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 11
43/61
43
step function
the value of w
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 12
43/61
43
step function
the value of w
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 13
43/61
43
step function
the value of w
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 14
43/61
43
step function
the value of w
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 15
43/61
43
step function
the value of w
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 16
43/61
43
step function
the value of w
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 17
43/61
43
step function
the value of w
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 18
43/61
43
step function
the value of w
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 19
43/61
43
step function
the value of w
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 20
43/61
43
step function
the value of w
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 21
43/61
43
step function
the value of w
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 22
43/61
43
step function
the value of w
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 23
43/61
43
step function
the value of w
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 24
43/61
43
step function
the value of w
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 25
43/61
43
step function
the value of w
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 26
43/61
43
step function
the value of w
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 27
43/61
43
step function
the value of w
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 28
43/61
43
step function
the value of w
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 29
43/61
43
step function
the value of w
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 30
43/61
43
step function
the value of w
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 31
43/61
43
step function
the value of w
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 32
43/61
43
step function
the value of w
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 33
43/61
43
step function
the value of w
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 34
43/61
43
step function
the value of w
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 35
43/61
43
• Now let us see what we get by taking
two such sigmoid functions (with differ-
ent b0 s) and subtracting one from the
other
44/61
44
other
44/61
44
other
44/61
44
other
• Voila! We have our tower function !!
44/61
44
• Can we come up with a neural network to represent this operation of subtracting one
sigmoid function from another ?
45/61
45
46/61
46
• What if we have more than one input?
47/61
47
• Suppose we are trying to take a decision about
whether we will find oil at a particular location on
the ocean bed(Yes/No)
47/61
47
• Further, suppose we base our decision on two
factors: Salinity (x1 ) and Pressure (x2 )
47/61
47
• We are given some data and it seems that
y(oil|no-oil) is a complex function of x1 and x2
47/61
47
• We are given some data and it seems that
y(oil|no-oil) is a complex function of x1 and x2
• We want a neural network to approximate this
function
47/61
47
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
48/61
48
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
48/61
48
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
48/61
w1 = 2, w2 = 0, b = 0
48
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
in this case
48/61
w1 = 3, w2 = 0, b = 0
48
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
in this case
48/61
w1 = 4, w2 = 0, b = 0
48
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
in this case
48/61
w1 = 5, w2 = 0, b = 0
48
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
in this case
48/61
w1 = 6, w2 = 0, b = 0
48
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
in this case
48/61
w1 = 7, w2 = 0, b = 0
48
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
in this case
48/61
w1 = 8, w2 = 0, b = 0
48
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
in this case
48/61
w1 = 9, w2 = 0, b = 0
48
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
in this case
48/61
w1 = 10, w2 = 0, b = 0
48
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
in this case
48/61
w1 = 11, w2 = 0, b = 0
48
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
in this case
48/61
w1 = 12, w2 = 0, b = 0
48
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
in this case
48/61
w1 = 13, w2 = 0, b = 0
48
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
in this case
48/61
w1 = 14, w2 = 0, b = 0
48
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
in this case
48/61
w1 = 15, w2 = 0, b = 0
48
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
in this case
48/61
w1 = 16, w2 = 0, b = 0
48
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
in this case
48/61
w1 = 17, w2 = 0, b = 0
48
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
in this case
48/61
w1 = 18, w2 = 0, b = 0
48
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
in this case
48/61
w1 = 19, w2 = 0, b = 0
48
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
in this case
48/61
w1 = 20, w2 = 0, b = 0
48
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
in this case
48/61
w1 = 21, w2 = 0, b = 0
48
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
in this case
48/61
w1 = 22, w2 = 0, b = 0
48
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
in this case
48/61
w1 = 23, w2 = 0, b = 0
48
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
in this case
48/61
w1 = 24, w2 = 0, b = 0
48
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
in this case
• What would happen if we change b ?
48/61
48
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
in this case
49/61
w1 = 25, w2 = 0, b = 5
49
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
in this case
49/61
w1 = 25, w2 = 0, b = 10
49
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
in this case
49/61
w1 = 25, w2 = 0, b = 15
49
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
in this case
49/61
w1 = 25, w2 = 0, b = 20
49
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
in this case
49/61
w1 = 25, w2 = 0, b = 25
49
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
in this case
49/61
w1 = 25, w2 = 0, b = 30
49
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
in this case
49/61
w1 = 25, w2 = 0, b = 35
49
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
in this case
49/61
w1 = 25, w2 = 0, b = 40
49
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
in this case
49/61
w1 = 25, w2 = 0, b = 45
49
• What if we take two such step func-
tions (with different b values) and sub-
tract one from the other
50/61
50
50/61
50
50/61
50
• We still don’t get a tower (or we get a
tower which is open from two sides)
50/61
50
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
51/61
51
y=
51/61
51
y=
51/61
w1 = 0, w2 = 2, b = 0
51
y=
51/61
w1 = 0, w2 = 3, b = 0
51
y=
51/61
w1 = 0, w2 = 4, b = 0
51
y=
51/61
w1 = 0, w2 = 5, b = 0
51
y=
51/61
w1 = 0, w2 = 6, b = 0
51
y=
51/61
w1 = 0, w2 = 7, b = 0
51
y=
51/61
w1 = 0, w2 = 8, b = 0
51
y=
51/61
w1 = 0, w2 = 9, b = 0
51
y=
51/61
w1 = 0, w2 = 10, b = 0
51
y=
51/61
w1 = 0, w2 = 11, b = 0
51
y=
51/61
w1 = 0, w2 = 12, b = 0
51
y=
51/61
w1 = 0, w2 = 13, b = 0
51
y=
51/61
w1 = 0, w2 = 14, b = 0
51
y=
51/61
w1 = 0, w2 = 15, b = 0
51
y=
51/61
w1 = 0, w2 = 16, b = 0
51
y=
51/61
w1 = 0, w2 = 17, b = 0
51
y=
51/61
w1 = 0, w2 = 18, b = 0
51
y=
51/61
w1 = 0, w2 = 19, b = 0
51
y=
51/61
w1 = 0, w2 = 20, b = 0
51
y=
51/61
w1 = 0, w2 = 21, b = 0
51
y=
51/61
w1 = 0, w2 = 22, b = 0
51
y=
51/61
w1 = 0, w2 = 23, b = 0
51
y=
51/61
w1 = 0, w2 = 24, b = 0
51
y=
• And now we change b
51/61
51
y=
52/61
w1 = 0, w2 = 25, b = 5
52
y=
52/61
w1 = 0, w2 = 25, b = 10
52
y=
52/61
w1 = 0, w2 = 25, b = 15
52
y=
52/61
w1 = 0, w2 = 25, b = 20
52
y=
52/61
w1 = 0, w2 = 25, b = 25
52
y=
52/61
w1 = 0, w2 = 25, b = 30
52
y=
52/61
w1 = 0, w2 = 25, b = 35
52
y=
52/61
w1 = 0, w2 = 25, b = 40
52
y=
52/61
w1 = 0, w2 = 25, b = 45
52
• Again, what if we take two such step
functions (with different b values) and
subtract one from the other
53/61
53
53/61
53
53/61
53
53/61
53
• Notice that this open tower has a differ-
ent orientation from the previous one
53/61
53
• Now what will we get by adding two such
open towers ?
54/61
54
open towers ?
54/61
54
open towers ?
54/61
54
open towers ?
• We get a tower standing on an elevated
base
54/61
54
open towers ?
base
• We can now pass this output through an-
other sigmoid neuron to get the desired
tower !
54/61
54
open towers ?
base
tower !
54/61
54
open towers ?
base
tower !
• We can now approximate any function
by summing up many such towers
54/61
54
• For example, we could approximate the
following function using a sum of several
towers
55/61
55
• Can we come up with a neural network to represent this entire procedure of construct-
ing a 3 dimensional tower ?
56/61
56
57/61
57
58/61
58
Think
• For 1 dimensional input we needed 2 neurons to construct a tower
• For 2 dimensional input we needed 4 neurons to construct a tower
• How many neurons will you need to construct a tower in n dimensions ?
59/61
59
Time to retrospect
• Why do we care about approximating any arbitrary function ?
• Can we tie all this back to the classification problem that we have been dealing with ?
60/61
60
• We are interested in separating the blue points
from the red points
61/61
61
from the red points
• Suppose we use a single sigmoidal neuron to ap-
proximate the relation between x = [x1 , x2 ] and
y
61/61
61
from the red points
y
61/61
61
from the red points
y
• Obviously, there will be errors (some blue points
get classified as 1 and some red points get classi- 61/61
fied as 0)
61
• We are interested in separating the blue points • This is what we actually want
from the red points
y
fied as 0)
61
from the red points • The illustrative proof that we just saw tells us that
• Suppose we use a single sigmoidal neuron to ap- we can have a neural network with two hidden lay-
proximate the relation between x = [x1 , x2 ] and ers which can approximate the above function by
y a sum of towers
fied as 0)
61
from the red points • The illustrative proof that we just saw tells us that
• Suppose we use a single sigmoidal neuron to ap- we can have a neural network with two hidden lay-
proximate the relation between x = [x1 , x2 ] and ers which can approximate the above function by
y a sum of towers
• Obviously, there will be errors (some blue points • Which means we can have a neural network which
get classified as 1 and some red points get classi- can exactly separate the blue points from the red 61/61
fied as 0) points !!
61

CS7015 (Deep Learning) : Lecture 3

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CS7015 (Deep Learning) : Lecture 3

Uploaded by

Copyright:

Available Formats

CS7015 (Deep Learning) : Lecture 3

Sigmoid Neurons, Gradient Descent, Feedforward Neural Networks,

Department of Computer Science and Engineering

w1 = 1 • Consider that we base our decision only on one

w1 = 1 • Consider that we base our decision only on one

w1 = 1 • Consider that we base our decision only on one

w1 = 1 • Consider that we base our decision only on one

w1 = 1 • Consider that we base our decision only on one

w1 = 1 • Consider that we base our decision only on one

self which behaves like a step function

self which behaves like a step function

self which behaves like a step function

w0 = −θ w1 w2 .. .. wn • σ stands for the sigmoid function (logistic func-

We want L (w, b) to be as close to 0 as possible

1 Oops!! this made things even worse...

1 Perhaps it would help to push w and b in the other

1 Let us keep going in this direction, i.e., increase w and

Question: What is the right ∆θ to use ?

Question: What is the right ∆θ to use ?

The answer comes from Taylor series

Note that the move (ηu) would be favorable only if,

Note that the move (ηu) would be favorable only if,

But, what is the range of uT ∇θ L (θ) ?

But, what is the range of uT ∇θ L (θ) ? Let us see....

But, what is the range of uT ∇θ L (θ) ? Let us see....

But, what is the range of uT ∇θ L (θ) ? Let us see....

But, what is the range of uT ∇θ L (θ) ? Let us see....

But, what is the range of uT ∇θ L (θ) ? Let us see....

multiply throughout by k = ||u|| ∗ ||∇θ L (θ)||

But, what is the range of uT ∇θ L (θ) ? Let us see....

multiply throughout by k = ||u|| ∗ ||∇θ L (θ)||

Thus, L (θ + ηu) − L (θ) = uT ∇θ L (θ) = k ∗ cos(β) will be most negative when

Parameter Update Equations

Parameter Update Equations

1 Let’s assume there is only 1 point to fit (x, y)

1 Let’s assume there is only 1 point to fit (x, y)

1 Let’s assume there is only 1 point to fit (x, y)

1 So if there is only 1 point (x, y), we have,

1 So if there is only 1 point (x, y), we have,

∇w = (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x

1 So if there is only 1 point (x, y), we have,

∇w = (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x

For two points,

1 So if there is only 1 point (x, y), we have,

∇w = (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x

For two points,

1 So if there is only 1 point (x, y), we have,

∇w = (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x

For two points,

A multilayer network of perceptrons with a

A multilayer network of perceptrons with a A multilayer network of neurons with a single

A multilayer network of perceptrons with a A multilayer network of neurons with a single

In other words, there is a guarantee that for

A multilayer network of perceptrons with a A multilayer network of neurons with a single

In other words, there is a guarantee that for

Proof: We will see an illustrative proof of

Tower Tower . . . Tower Tower

You might also like