Professional Documents
Culture Documents
CS7015 (Deep Learning) : Lecture 3
CS7015 (Deep Learning) : Lecture 3
Mitesh M. Khapra
0/61
Acknowledgements
• For Module 3.4, I have borrowed ideas from the videos by Ryan Harris on “visualize
backpropagation” (available on youtube)
• For Module 3.5, I have borrowed ideas from this excellent book ? which is available
online
• I am sure I would have been influenced and borrowed ideas from other sources and I
apologize if I have failed to acknowledge them
?
http://neuralnetworksanddeeplearning.com/chap4.html 1/61
1
Module 3.1: Sigmoid Neuron
2/61
2
The story ahead ...
• Enough about boolean functions!
3/61
3
The story ahead ...
• Enough about boolean functions!
• What about arbitrary functions of the form y = f(x) where x ∈ Rn (instead of {0, 1}n )
and y ∈ R (instead of {0, 1}) ?
3/61
3
The story ahead ...
• Enough about boolean functions!
• What about arbitrary functions of the form y = f(x) where x ∈ Rn (instead of {0, 1}n )
and y ∈ R (instead of {0, 1}) ?
• Can we have a network which can (approximately) represent such functions ?
3/61
3
The story ahead ...
• Enough about boolean functions!
• What about arbitrary functions of the form y = f(x) where x ∈ Rn (instead of {0, 1}n )
and y ∈ R (instead of {0, 1}) ?
• Can we have a network which can (approximately) represent such functions ?
• Before answering the above question we will have to first graduate from perceptrons
to sigmoidal neurons ...
3/61
3
Recall
• A perceptron will fire if the weighted sum of its inputs is greater than the threshold
(-w0 )
4/61
4
y
• The thresholding logic used by a perceptron is very
harsh !
bias = w0 = −0.5
w1 = 1
x1
5/61
5
y
• The thresholding logic used by a perceptron is very
harsh !
bias = w0 = −0.5 • For example, let us return to our problem of decid-
ing whether we will like or dislike a movie
w1 = 1
x1
5/61
5
y
• The thresholding logic used by a perceptron is very
harsh !
bias = w0 = −0.5 • For example, let us return to our problem of decid-
ing whether we will like or dislike a movie
5/61
5
y
• The thresholding logic used by a perceptron is very
harsh !
bias = w0 = −0.5 • For example, let us return to our problem of decid-
ing whether we will like or dislike a movie
5/61
5
y
• The thresholding logic used by a perceptron is very
harsh !
bias = w0 = −0.5 • For example, let us return to our problem of decid-
ing whether we will like or dislike a movie
5/61
5
y
• The thresholding logic used by a perceptron is very
harsh !
bias = w0 = −0.5 • For example, let us return to our problem of decid-
ing whether we will like or dislike a movie
5/61
5
y
• The thresholding logic used by a perceptron is very
harsh !
bias = w0 = −0.5 • For example, let us return to our problem of decid-
ing whether we will like or dislike a movie
5/61
5
y
• The thresholding logic used by a perceptron is very
harsh !
bias = w0 = −0.5 • For example, let us return to our problem of decid-
ing whether we will like or dislike a movie
6/61
6
• This behavior is not a characteristic of the spe-
1
cific problem we chose or the specific weight and
threshold that we chose
• It is a characteristic of the perceptron function it-
y
P-wn 0
z= i=1 wi xi
6/61
6
• This behavior is not a characteristic of the spe-
1
cific problem we chose or the specific weight and
threshold that we chose
• It is a characteristic of the perceptron function it-
y
threshold (-w0 )
P-wn 0
z= i=1 wi xi
6/61
6
• This behavior is not a characteristic of the spe-
1
cific problem we chose or the specific weight and
threshold that we chose
• It is a characteristic of the perceptron function it-
y
threshold (-w0 )
P-wn 0
z= • For most real world applications we would ex-
i=1 wi xi
pect a smoother decision function which gradually
changes from 0 to 1
6/61
6
• Introducing sigmoid neurons where the output
1
y function is much smoother than the step function
P-wn 0
z= i=1 wi xi
6/61
6
• Introducing sigmoid neurons where the output
1
function is much smoother than the step function
• Here is one form of the sigmoid function called the
logistic function
y
1
y= Pn
1+ e−(w0 + i=1 wi xi )
P-wn 0
z= i=1 wi xi
6/61
6
• Introducing sigmoid neurons where the output
1
function is much smoother than the step function
• Here is one form of the sigmoid function called the
logistic function
y
1
y= Pn
1+ e−(w0 + i=1 wi xi )
• We no longer see a sharp transition around the
P-w0
n threshold -w0
z= i=1 wi xi
6/61
6
• Introducing sigmoid neurons where the output
1
function is much smoother than the step function
• Here is one form of the sigmoid function called the
logistic function
y
1
y= Pn
1+ e−(w0 + i=1 wi xi )
• We no longer see a sharp transition around the
P-w0
n threshold -w0
z= i=1 wi xi • Also the output y is no longer binary but a real
value between 0 and 1 which can be interpreted
as a probability
6/61
6
• Introducing sigmoid neurons where the output
1
function is much smoother than the step function
• Here is one form of the sigmoid function called the
logistic function
y
1
y= Pn
1+ e−(w0 + i=1 wi xi )
• We no longer see a sharp transition around the
P-w0
n threshold -w0
z= i=1 wi xi • Also the output y is no longer binary but a real
value between 0 and 1 which can be interpreted
as a probability
• Instead of a like/dislike decision we get the prob-
ability of liking the movie 6/61
6
Perceptron Sigmoid (logistic) Neuron
y y
w0 = −θ w1 w2 .. .. wn w0 = −θ w1 w2 .. .. wn
x0 = 1 x1 x2 .. .. xn x0 = 1 x1 x2 .. .. xn
n 1
y=
X
y=1 if wi ∗ xi ≥ 0 Pn
1+ e−( i=0 wi xi )
i=0
Xn
= 0 if wi ∗ xi < 0
i=0 7/61
7
Perceptron Sigmoid Neuron
y 1 1
y
P-wn 0 P-wn 0
z= i=1 wi xi z= i=1 wi xi
Not smooth, not continuous (at w0), not
differentiable
8/61
8
Perceptron Sigmoid Neuron
y 1 1
y
P-wn 0 P-wn 0
z= i=1 wi xi z= i=1 wi xi
Not smooth, not continuous (at w0), not
Smooth, continuous, differentiable
differentiable
8/61
8
Module 3.2: A typical Supervised Machine Learning Setup
9/61
9
• What next ?
Sigmoid (logistic) Neuron
y
w0 = −θ w1 w2 .. .. wn
x0 = 1 x1 x2 .. .. xn
10/61
10
• What next ?
Sigmoid (logistic) Neuron • Well, just as we had an algorithm for learning the
y
weights of a perceptron, we also need a way of
learning the weights of a sigmoid neuron
w0 = −θ w1 w2 .. .. wn
x0 = 1 x1 x2 .. .. xn
10/61
10
• What next ?
Sigmoid (logistic) Neuron • Well, just as we had an algorithm for learning the
y
weights of a perceptron, we also need a way of
learning the weights of a sigmoid neuron
• Before we see such an algorithm we will revisit the
concept of error
w0 = −θ w1 w2 .. .. wn
x0 = 1 x1 x2 .. .. xn
10/61
10
• Earlier we mentioned that a single perceptron cannot deal
with this data because it is not linearly separable
11/61
11
• Earlier we mentioned that a single perceptron cannot deal
with this data because it is not linearly separable
• What does “cannot deal with” mean?
11/61
11
• Earlier we mentioned that a single perceptron cannot deal
with this data because it is not linearly separable
• What does “cannot deal with” mean?
• What would happen if we use a perceptron model to clas-
sify this data ?
11/61
11
• Earlier we mentioned that a single perceptron cannot deal
with this data because it is not linearly separable
• What does “cannot deal with” mean?
• What would happen if we use a perceptron model to clas-
sify this data ?
• We would probably end up with a line like this ...
11/61
11
• Earlier we mentioned that a single perceptron cannot deal
with this data because it is not linearly separable
• What does “cannot deal with” mean?
• What would happen if we use a perceptron model to clas-
sify this data ?
• We would probably end up with a line like this ...
• This line doesn’t seem to be too bad
11/61
11
• Earlier we mentioned that a single perceptron cannot deal
with this data because it is not linearly separable
• What does “cannot deal with” mean?
• What would happen if we use a perceptron model to clas-
sify this data ?
• We would probably end up with a line like this ...
• This line doesn’t seem to be too bad
• Sure, it misclassifies 3 blue points and 3 red points but we
could live with this error in most real world applications
11/61
11
• Earlier we mentioned that a single perceptron cannot deal
with this data because it is not linearly separable
• What does “cannot deal with” mean?
• What would happen if we use a perceptron model to clas-
sify this data ?
• We would probably end up with a line like this ...
• This line doesn’t seem to be too bad
• Sure, it misclassifies 3 blue points and 3 red points but we
could live with this error in most real world applications
• From now on, we will accept that it is hard to drive the error
to 0 in most cases and will instead aim to reach the min-
imum possible error
11/61
11
This brings us to a typical machine learning setup which has the following components...
12/61
12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
12/61
12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
• Model: Our approximation of the relation between x and y. For example,
12/61
12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
• Model: Our approximation of the relation between x and y. For example,
1
ŷ =
1 + e−(wT x)
12/61
12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
• Model: Our approximation of the relation between x and y. For example,
1
ŷ =
1 + e−(wT x)
or ŷ = wT x
12/61
12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
• Model: Our approximation of the relation between x and y. For example,
1
ŷ =
1 + e−(wT x)
or ŷ = wT x
or ŷ = xT Wx
12/61
12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
• Model: Our approximation of the relation between x and y. For example,
1
ŷ =
1 + e−(wT x)
or ŷ = wT x
or ŷ = xT Wx
or just about any function
12/61
12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
• Model: Our approximation of the relation between x and y. For example,
1
ŷ =
1 + e−(wT x)
or ŷ = wT x
or ŷ = xT Wx
or just about any function
• Parameters: In all the above cases, w is a parameter which needs to be learned from
the data
12/61
12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
• Model: Our approximation of the relation between x and y. For example,
1
ŷ =
1 + e−(wT x)
or ŷ = wT x
or ŷ = xT Wx
or just about any function
• Parameters: In all the above cases, w is a parameter which needs to be learned from
the data
• Learning algorithm: An algorithm for learning the parameters (w) of the model (for
example, perceptron learning algorithm, gradient descent, etc.)
12/61
12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
• Model: Our approximation of the relation between x and y. For example,
1
ŷ =
1 + e−(wT x)
or ŷ = wT x
or ŷ = xT Wx
or just about any function
• Parameters: In all the above cases, w is a parameter which needs to be learned from
the data
• Learning algorithm: An algorithm for learning the parameters (w) of the model (for
example, perceptron learning algorithm, gradient descent, etc.)
• Objective/Loss/Error function: To guide the learning algorithm
12/61
12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
• Model: Our approximation of the relation between x and y. For example,
1
ŷ =
1 + e−(wT x)
or ŷ = wT x
or ŷ = xT Wx
or just about any function
• Parameters: In all the above cases, w is a parameter which needs to be learned from
the data
• Learning algorithm: An algorithm for learning the parameters (w) of the model (for
example, perceptron learning algorithm, gradient descent, etc.)
• Objective/Loss/Error function: To guide the learning algorithm - the learning algorithm
12/61
should aim to minimize the loss function
12
As an illustration, consider our movie example
13/61
13
As an illustration, consider our movie example
• Data: {xi = movie, yi = like/dislike}ni=1
13/61
13
As an illustration, consider our movie example
• Data: {xi = movie, yi = like/dislike}ni=1
• Model: Our approximation of the relation between x and y (the probability of liking a
movie).
13/61
13
As an illustration, consider our movie example
• Data: {xi = movie, yi = like/dislike}ni=1
• Model: Our approximation of the relation between x and y (the probability of liking a
movie).
1
ŷ =
1 + e−(wT x)
13/61
13
As an illustration, consider our movie example
• Data: {xi = movie, yi = like/dislike}ni=1
• Model: Our approximation of the relation between x and y (the probability of liking a
movie).
1
ŷ =
1 + e−(wT x)
• Parameter: w
13/61
13
As an illustration, consider our movie example
• Data: {xi = movie, yi = like/dislike}ni=1
• Model: Our approximation of the relation between x and y (the probability of liking a
movie).
1
ŷ =
1 + e−(wT x)
• Parameter: w
• Learning algorithm: Gradient Descent [we will see soon]
13/61
13
As an illustration, consider our movie example
• Data: {xi = movie, yi = like/dislike}ni=1
• Model: Our approximation of the relation between x and y (the probability of liking a
movie).
1
ŷ =
1 + e−(wT x)
• Parameter: w
• Learning algorithm: Gradient Descent [we will see soon]
• Objective/Loss/Error function:
13/61
13
As an illustration, consider our movie example
• Data: {xi = movie, yi = like/dislike}ni=1
• Model: Our approximation of the relation between x and y (the probability of liking a
movie).
1
ŷ =
1 + e−(wT x)
• Parameter: w
• Learning algorithm: Gradient Descent [we will see soon]
• Objective/Loss/Error function: One possibility is
n
(ŷi − yi )2
X
L (w) =
i=1
13/61
13
As an illustration, consider our movie example
• Data: {xi = movie, yi = like/dislike}ni=1
• Model: Our approximation of the relation between x and y (the probability of liking a
movie).
1
ŷ =
1 + e−(wT x)
• Parameter: w
• Learning algorithm: Gradient Descent [we will see soon]
• Objective/Loss/Error function: One possibility is
n
(ŷi − yi )2
X
L (w) =
i=1
13/61
The learning algorithm should aim to find a w which minimizes the above function
13
(squared error between y and ŷ)
Module 3.3: Learning Parameters: (Infeasible) guess work
14/61
14
y • Keeping this supervised ML setup in mind, we will
now focus on this model and discuss an algorithm
for learning the parameters of this model from
σ some given data using an appropriate objective
function
w0 = −θ w1 w2 .. .. wn
x0 = 1 x1 x2 .. .. xn
1
f (x) = 1+e−(w·x+b)
15/61
15
y • Keeping this supervised ML setup in mind, we will
now focus on this model and discuss an algorithm
for learning the parameters of this model from
σ some given data using an appropriate objective
function
15/61
15
x σ ŷ = f (x) • Keeping this supervised ML setup in mind, we will
now focus on this model and discuss an algorithm
1 for learning the parameters of this model from
1 some given data using an appropriate objective
f(x) = 1+e−(w·x+b) function
• σ stands for the sigmoid function (logistic func-
tion in this case)
• For ease of explanation, we will consider a very
simplified version of the model having just 1 input
15/61
15
x
w
σ ŷ = f (x) • Keeping this supervised ML setup in mind, we will
now focus on this model and discuss an algorithm
1 b
for learning the parameters of this model from
1 some given data using an appropriate objective
f(x) = 1+e−(w·x+b) function
• σ stands for the sigmoid function (logistic func-
tion in this case)
• For ease of explanation, we will consider a very
simplified version of the model having just 1 input
• Further to be consistent with the literature, from
now on, we will refer to w0 as b (bias)
15/61
15
x
w
σ ŷ = f (x) • Keeping this supervised ML setup in mind, we will
now focus on this model and discuss an algorithm
1 b
for learning the parameters of this model from
1 some given data using an appropriate objective
f(x) = 1+e−(w·x+b) function
• σ stands for the sigmoid function (logistic func-
tion in this case)
• For ease of explanation, we will consider a very
simplified version of the model having just 1 input
• Further to be consistent with the literature, from
now on, we will refer to w0 as b (bias)
• Lastly, instead of considering the problem of pre-
dicting like/dislike, we will assume that we want to
predict criticsRating(y) given imdbRating(x) (for no 15/61
particular reason) 15
x σ ŷ = f(x)
w
1 b
1
f (x) = 1+e−(w·x+b)
16/61
16
x σ ŷ = f(x)
w Input for training
1 b {xi , yi }Ni=1 → N pairs of (x, y)
1
f (x) = 1+e−(w·x+b)
16/61
16
x σ ŷ = f(x)
w Input for training
1 b {xi , yi }Ni=1 → N pairs of (x, y)
Training objective
1
f (x) = 1+e−(w·x+b) Find w and b such that:
N
(yi − f(xi ))2
X
minimize L (w, b) =
w,b
i=1
16/61
16
x σ ŷ = f(x)
w Input for training
1 b {xi , yi }Ni=1 → N pairs of (x, y)
Training objective
1
f (x) = 1+e−(w·x+b) Find w and b such that:
N
(yi − f(xi ))2
X
minimize L (w, b) =
w,b
i=1
16/61
16
x σ ŷ = f(x)
w What does it mean to train the network?
b • Suppose we train the network with
1
(x, y) = (0.5, 0.2) and (2.5, 0.9)
1
f (x) = 1+e−(w·x+b)
16/61
16
x σ ŷ = f(x)
w What does it mean to train the network?
b • Suppose we train the network with
1
(x, y) = (0.5, 0.2) and (2.5, 0.9)
1
f (x) = 1+e−(w·x+b) • At the end of training we expect to find
w*, b* such that:
16/61
16
x σ ŷ = f(x)
w What does it mean to train the network?
b • Suppose we train the network with
1
(x, y) = (0.5, 0.2) and (2.5, 0.9)
1
f (x) = 1+e−(w·x+b) • At the end of training we expect to find
w*, b* such that:
• f(0.5) → 0.2 and f(2.5) → 0.9
16/61
16
x σ ŷ = f(x)
w What does it mean to train the network?
b • Suppose we train the network with
1
(x, y) = (0.5, 0.2) and (2.5, 0.9)
1
f (x) = 1+e−(w·x+b) • At the end of training we expect to find
w*, b* such that:
• f(0.5) → 0.2 and f(2.5) → 0.9
In other words...
• We hope to find a sigmoid function such
that (0.5, 0.2) and (2.5, 0.9) lie on this
sigmoid
16/61
16
x σ ŷ = f(x)
w What does it mean to train the network?
b • Suppose we train the network with
1
(x, y) = (0.5, 0.2) and (2.5, 0.9)
1
f (x) = 1+e−(w·x+b) • At the end of training we expect to find
w*, b* such that:
• f(0.5) → 0.2 and f(2.5) → 0.9
In other words...
• We hope to find a sigmoid function such
that (0.5, 0.2) and (2.5, 0.9) lie on this
sigmoid
16/61
16
Let us see this in more detail....
17/61
17
1
σ(x) =
1+ e−(wx+b)
18/61
18
• Can we try to find such a w∗ , b∗ manually
1
σ(x) =
1+ e−(wx+b)
18/61
18
• Can we try to find such a w∗ , b∗ manually
• Let us try a random guess.. (say, w = 0.5, b = 0)
1
σ(x) =
1+ e−(wx+b)
18/61
18
• Can we try to find such a w∗ , b∗ manually
• Let us try a random guess.. (say, w = 0.5, b = 0)
• Clearly not good, but how bad is it ?
1
σ(x) =
1+ e−(wx+b)
18/61
18
• Can we try to find such a w∗ , b∗ manually
• Let us try a random guess.. (say, w = 0.5, b = 0)
• Clearly not good, but how bad is it ?
• Let us revisit L (w, b) to see how bad it is ...
1
σ(x) =
1+ e−(wx+b)
18/61
18
N
1 X
L (w, b) = ∗ (yi − f(xi ))2
2
i=1
1
σ(x) =
1+ e−(wx+b)
18/61
18
N
1 X
L (w, b) = ∗ (yi − f(xi ))2
2
i=1
1
= ∗ (y1 − f (x1 ))2 + (y2 − f(x2 ))2
2
1
σ(x) =
1+ e−(wx+b)
18/61
18
N
1 X
L (w, b) = ∗ (yi − f(xi ))2
2
i=1
1
= ∗ (y1 − f (x1 ))2 + (y2 − f(x2 ))2
2
1
= ∗ (0.9 − f(2.5))2 + (0.2 − f(0.5))2
2
1
σ(x) =
1+ e−(wx+b)
18/61
18
N
1 X
L (w, b) = ∗ (yi − f(xi ))2
2
i=1
1
= ∗ (y1 − f (x1 ))2 + (y2 − f(x2 ))2
2
1
= ∗ (0.9 − f(2.5))2 + (0.2 − f(0.5))2
2
= 0.073
1
σ(x) =
1+ e−(wx+b)
18/61
18
N
1 X
L (w, b) = ∗ (yi − f(xi ))2
2
i=1
1
= ∗ (y1 − f (x1 ))2 + (y2 − f(x2 ))2
2
1
= ∗ (0.9 − f(2.5))2 + (0.2 − f(0.5))2
2
= 0.073
18/61
18
Let us try some other values of w, b
w b L (w, b)
0.50 0.00 0.0730
1
σ(x) =
1+ e−(wx+b)
18/61
18
Let us try some other values of w, b
w b L (w, b)
0.50 0.00 0.0730
-0.10 0.00 0.1481
18/61
18
Let us try some other values of w, b
w b L (w, b)
0.50 0.00 0.0730
-0.10 0.00 0.1481
0.94 -0.94 0.0214
18/61
18
Let us try some other values of w, b
w b L (w, b)
0.50 0.00 0.0730
-0.10 0.00 0.1481
0.94 -0.94 0.0214
1.42 -1.73 0.0028
18/61
18
Let us try some other values of w, b
w b L (w, b)
0.50 0.00 0.0730
-0.10 0.00 0.1481
0.94 -0.94 0.0214
1.42 -1.73 0.0028
1.65 -2.08 0.0003
1
σ(x) = Let us keep going in this direction, i.e., increase w and
1+ e−(wx+b)
decrease b
18/61
18
Let us try some other values of w, b
w b L (w, b)
0.50 0.00 0.0730
-0.10 0.00 0.1481
0.94 -0.94 0.0214
1.42 -1.73 0.0028
1.65 -2.08 0.0003
1.78 -2.27 0.0000
1
σ(x) = With some guess work and intuition we were able to
1+ e−(wx+b)
find the right values for w and b
18/61
18
Let us look at something better than our “guess work” algorithm....
19/61
19
• Since we have only 2 points and 2 para-
meters (w, b) we can easily plot L (w, b)
for different values of (w, b) and pick the
one where L (w, b) is minimum
20/61
20
• Since we have only 2 points and 2 para-
meters (w, b) we can easily plot L (w, b)
for different values of (w, b) and pick the
one where L (w, b) is minimum
20/61
20
• Since we have only 2 points and 2 para-
meters (w, b) we can easily plot L (w, b)
for different values of (w, b) and pick the
one where L (w, b) is minimum
• But of course this becomes intractable
once you have many more data points
and many more parameters !!
20/61
20
• Since we have only 2 points and 2 para-
meters (w, b) we can easily plot L (w, b)
for different values of (w, b) and pick the
one where L (w, b) is minimum
• But of course this becomes intractable
once you have many more data points
and many more parameters !!
• Further, even here we have plotted the er-
ror surface only for a small range of (w, b)
[from (−6, 6) and not from (− inf, inf)]
20/61
20
Let us look at the geometric interpretation of our “guess work”
algorithm in terms of this error surface
21/61
21
22/61
22
22/61
22
22/61
22
22/61
22
22/61
22
22/61
22
22/61
22
Module 3.4: Learning Parameters : Gradient Descent
23/61
23
Now let us see if there is a more efficient and principled way of
doing this
24/61
24
Goal
Find a better way of traversing the error surface so that we can reach the minimum value
quickly without resorting to brute force search!
25/61
25
vector of parameters,
say, randomly initialized
θ = [w, b]
26/61
26
vector of parameters,
say, randomly initialized
θ = [w, b]
∆θ = [∆w, ∆b]
change in the
values of w, b
26/61
26
vector of parameters,
say, randomly initialized
θ = [w, b] θ
∆θ = [∆w, ∆b]
∆θ
change in the
values of w, b
26/61
26
vector of parameters,
say, randomly initialized
θ = [w, b] θ θnew
∆θ = [∆w, ∆b]
∆θ
change in the
values of w, b
26/61
26
vector of parameters,
say, randomly initialized
We moved in the direction
θ = [w, b] θ θnew
of ∆θ
∆θ = [∆w, ∆b]
∆θ
change in the
values of w, b
26/61
26
vector of parameters,
say, randomly initialized
We moved in the direction
θ = [w, b] θ θnew
of ∆θ
∆θ = [∆w, ∆b]
∆θ Let us be a bit conservat-
change in the
ive: move only by a small
values of w, b
amount η
26/61
26
vector of parameters,
say, randomly initialized
We moved in the direction
θ = [w, b] θ θnew
of ∆θ
∆θ = [∆w, ∆b]
η · ∆θ ∆θ Let us be a bit conservat-
change in the
ive: move only by a small
values of w, b
amount η
26/61
26
vector of parameters,
say, randomly initialized
We moved in the direction
θ = [w, b] θ θnew
of ∆θ
∆θ = [∆w, ∆b]
η · ∆θ ∆θ Let us be a bit conservat-
change in the
ive: move only by a small
values of w, b
amount η
26/61
26
vector of parameters,
say, randomly initialized
We moved in the direction
θ = [w, b] θ θnew
of ∆θ
∆θ = [∆w, ∆b]
η · ∆θ ∆θ Let us be a bit conservat-
change in the
ive: move only by a small
values of w, b
amount η
θnew = θ + η · ∆θ
26/61
26
vector of parameters,
say, randomly initialized
We moved in the direction
θ = [w, b] θ θnew
of ∆θ
∆θ = [∆w, ∆b]
η · ∆θ ∆θ Let us be a bit conservat-
change in the
ive: move only by a small
values of w, b
amount η
θnew = θ + η · ∆θ
26/61
26
vector of parameters,
say, randomly initialized
We moved in the direction
θ = [w, b] θ θnew
of ∆θ
∆θ = [∆w, ∆b]
η · ∆θ ∆θ Let us be a bit conservat-
change in the
ive: move only by a small
values of w, b
amount η
θnew = θ + η · ∆θ
26/61
26
For ease of notation, let ∆θ = u, then from Taylor series, we have,
27/61
27
For ease of notation, let ∆θ = u, then from Taylor series, we have,
η2 η3 η4
L (θ + ηu) = L (θ) + η ∗ uT ∇θ L (θ) + ∗ uT ∇2 L (θ)u + ∗ ... + ∗ ...
2! 3! 4!
27/61
27
For ease of notation, let ∆θ = u, then from Taylor series, we have,
η2 η3 η4
L (θ + ηu) = L (θ) + η ∗ uT ∇θ L (θ) + ∗ uT ∇2 L (θ)u + ∗ ... + ∗ ...
2! 3! 4!
= L (θ) + η ∗ uT ∇θ L (θ) [η is typically small, so η 2 , η 3 , .. → 0]
27/61
27
For ease of notation, let ∆θ = u, then from Taylor series, we have,
η2 η3 η4
L (θ + ηu) = L (θ) + η ∗ uT ∇θ L (θ) + ∗ uT ∇2 L (θ)u + ∗ ... + ∗ ...
2! 3! 4!
= L (θ) + η ∗ uT ∇θ L (θ) [η is typically small, so η 2 , η 3 , .. → 0]
L (θ + ηu) − L (θ) < 0 [i.e., if the new loss is less than the previous loss]
27/61
27
For ease of notation, let ∆θ = u, then from Taylor series, we have,
η2 η3 η4
L (θ + ηu) = L (θ) + η ∗ uT ∇θ L (θ) + ∗ uT ∇2 L (θ)u + ∗ ... + ∗ ...
2! 3! 4!
= L (θ) + η ∗ uT ∇θ L (θ) [η is typically small, so η 2 , η 3 , .. → 0]
L (θ + ηu) − L (θ) < 0 [i.e., if the new loss is less than the previous loss]
This implies,
uT ∇θ L (θ) < 0
27/61
27
Okay, so we have,
uT ∇θ L (θ) < 0
28/61
28
Okay, so we have,
uT ∇θ L (θ) < 0
28/61
28
Okay, so we have,
uT ∇θ L (θ) < 0
28/61
28
Okay, so we have,
uT ∇θ L (θ) < 0
uT ∇θ L (θ)
−1 ≤ cos(β) = ≤1
||u|| ∗ ||∇θ L (θ)||
28/61
28
Okay, so we have,
uT ∇θ L (θ) < 0
uT ∇θ L (θ)
−1 ≤ cos(β) = ≤1
||u|| ∗ ||∇θ L (θ)||
28/61
28
Okay, so we have,
uT ∇θ L (θ) < 0
uT ∇θ L (θ)
−1 ≤ cos(β) = ≤1
||u|| ∗ ||∇θ L (θ)||
−k ≤ k ∗ cos(β) = uT ∇θ L (θ) ≤ k
28/61
28
Okay, so we have,
uT ∇θ L (θ) < 0
uT ∇θ L (θ)
−1 ≤ cos(β) = ≤1
||u|| ∗ ||∇θ L (θ)||
−k ≤ k ∗ cos(β) = uT ∇θ L (θ) ≤ k
28
Gradient Descent Rule
• The direction u that we intend to move in should be at 180° w.r.t. the gradient
29/61
29
Gradient Descent Rule
• The direction u that we intend to move in should be at 180° w.r.t. the gradient
• In other words, move in a direction opposite to the gradient
29/61
29
Gradient Descent Rule
• The direction u that we intend to move in should be at 180° w.r.t. the gradient
• In other words, move in a direction opposite to the gradient
wt+1 = wt − η∇wt
bt+1 = bt − η∇bt
∂L (w, b) ∂L (w, b)
where, ∇wt = , ∇b =
∂w at w = wt , b = bt ∂b at w = wt , b = bt
29/61
29
Gradient Descent Rule
• The direction u that we intend to move in should be at 180° w.r.t. the gradient
• In other words, move in a direction opposite to the gradient
wt+1 = wt − η∇wt
bt+1 = bt − η∇bt
∂L (w, b) ∂L (w, b)
where, ∇wt = , ∇b =
∂w at w = wt , b = bt ∂b at w = wt , b = bt
So we now have a more principled way of moving in the w-b plane than our “guess work”
algorithm
29/61
29
• Let us create an algorithm from this rule ...
30/61
30
• Let us create an algorithm from this rule ...
Algorithm: gradient_descent()
t ← 0;
max_iterations ← 1000;
while t < max_iterations do
wt+1 ← wt − η∇wt ;
bt+1 ← bt − η∇bt ;
t ← t + 1;
end
30/61
30
• Let us create an algorithm from this rule ...
Algorithm: gradient_descent()
t ← 0;
max_iterations ← 1000;
while t < max_iterations do
wt+1 ← wt − η∇wt ;
bt+1 ← bt − η∇bt ;
t ← t + 1;
end
• To see this algorithm in practice let us first derive ∇w and ∇b for our toy neural network
30/61
30
x σ y = f (x)
1
f(x) = 1+e−(w·x+b)
31/61
31
x σ y = f (x)
31/61
31
x σ y = f (x)
31/61
31
x σ y = f (x)
31/61
31
∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2
32/61
32
∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2
1 ∂
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)]
2 ∂w
32/61
32
∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2
1 ∂
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)]
2 ∂w
∂
= (f(x) − y) ∗ (f (x))
∂w
32/61
32
∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2
1 ∂
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)]
2 ∂w
∂
= (f(x) − y) ∗ (f (x))
∂w
∂ 1
= (f(x) − y) ∗
∂w 1 + e−(wx+b)
32/61
32
∂ 1 ∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2 ∂w 1 + e−(wx+b)
1 ∂
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)]
2 ∂w
∂
= (f(x) − y) ∗ (f (x))
∂w
∂ 1
= (f(x) − y) ∗
∂w 1 + e−(wx+b)
32/61
32
∂ 1 ∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2 ∂w 1 + e−(wx+b)
1 ∂ −1 ∂ −(wx+b)
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)] = (e ))
2 ∂w (1 + e −(wx+b) 2
) ∂w
∂
= (f(x) − y) ∗ (f (x))
∂w
∂ 1
= (f(x) − y) ∗
∂w 1 + e−(wx+b)
32/61
32
∂ 1 ∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2 ∂w 1 + e−(wx+b)
1 ∂ −1 ∂ −(wx+b)
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)] = (e ))
2 ∂w (1 + e −(wx+b) 2
) ∂w
∂ −1 ∂
= (f(x) − y) ∗ (f (x)) = ∗ (e−(wx+b) ) (−(wx + b)))
∂w (1 + e −(wx+b) )2 ∂w
∂ 1
= (f(x) − y) ∗
∂w 1 + e−(wx+b)
32/61
32
∂ 1 ∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2 ∂w 1 + e−(wx+b)
1 ∂ −1 ∂ −(wx+b)
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)] = (e ))
2 ∂w (1 + e −(wx+b) 2
) ∂w
∂ −1 ∂
= (f(x) − y) ∗ (f (x)) = ∗ (e−(wx+b) ) (−(wx + b)))
∂w (1 + e −(wx+b) )2 ∂w
∂ 1
= (f(x) − y) ∗ −1 e−(wx+b)
∂w 1 + e−(wx+b) = ∗ ∗ (−x)
(1 + e−(wx+b) ) (1 + e−(wx+b) )
32/61
32
∂ 1 ∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2 ∂w 1 + e−(wx+b)
1 ∂ −1 ∂ −(wx+b)
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)] = (e ))
2 ∂w (1 + e −(wx+b) 2
) ∂w
∂ −1 ∂
= (f(x) − y) ∗ (f (x)) = ∗ (e−(wx+b) ) (−(wx + b)))
∂w (1 + e −(wx+b) )2 ∂w
∂ 1
= (f(x) − y) ∗ −1 e−(wx+b)
∂w 1 + e−(wx+b) = ∗ ∗ (−x)
(1 + e−(wx+b) ) (1 + e−(wx+b) )
1 e−(wx+b)
= ∗ ∗ (x)
(1 + e−(wx+b) ) (1 + e−(wx+b) )
32/61
32
∂ 1 ∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2 ∂w 1 + e−(wx+b)
1 ∂ −1 ∂ −(wx+b)
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)] = (e ))
2 ∂w (1 + e −(wx+b) 2
) ∂w
∂ −1 ∂
= (f(x) − y) ∗ (f (x)) = ∗ (e−(wx+b) ) (−(wx + b)))
∂w (1 + e −(wx+b) )2 ∂w
∂ 1
= (f(x) − y) ∗ −1 e−(wx+b)
∂w 1 + e−(wx+b) = ∗ ∗ (−x)
(1 + e−(wx+b) ) (1 + e−(wx+b) )
= (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x
1 e−(wx+b)
= ∗ ∗ (x)
(1 + e−(wx+b) ) (1 + e−(wx+b) )
= f(x) ∗ (1 − f(x)) ∗ x
32/61
32
x σ y = f (x)
33/61
33
x σ y = f (x)
33/61
33
x σ y = f (x)
33/61
33
x σ y = f (x)
2
X
∇w = (f (xi ) − yi ) ∗ f(xi ) ∗ (1 − f(xi )) ∗ xi
i=1
33/61
33
x σ y = f (x)
2
X
∇w = (f (xi ) − yi ) ∗ f(xi ) ∗ (1 − f(xi )) ∗ xi
i=1
X2
∇b = (f (xi ) − yi ) ∗ f(xi ) ∗ (1 − f(xi ))
i=1
33/61
33
34/61
34
34/61
34
34/61
34
34/61
34
34/61
34
34/61
34
34/61
34
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
35/61
35
• Later on in the course we will look at gradient descent in much more detail and discuss
its variants
36/61
36
• Later on in the course we will look at gradient descent in much more detail and discuss
its variants
• For the time being it suffices to know that we have an algorithm for learning the para-
meters of a sigmoid neuron
36/61
36
• Later on in the course we will look at gradient descent in much more detail and discuss
its variants
• For the time being it suffices to know that we have an algorithm for learning the para-
meters of a sigmoid neuron
• So where do we head from here ?
36/61
36
Module 3.5: Representation Power of a Multilayer Network of
Sigmoid Neurons
37/61
37
Representation power of a multilayer net- Representation power of a multilayer net-
work of perceptrons work of sigmoid neurons
38/61
38
Representation power of a multilayer net- Representation power of a multilayer net-
work of perceptrons work of sigmoid neurons
38/61
38
Representation power of a multilayer net- Representation power of a multilayer net-
work of perceptrons work of sigmoid neurons
38/61
38
Representation power of a multilayer net- Representation power of a multilayer net-
work of perceptrons work of sigmoid neurons
38/61
38
Representation power of a multilayer net- Representation power of a multilayer net-
work of perceptrons work of sigmoid neurons
38
• See this link? for an excellent illustration of this proof
• The discussion in the next few slides is based on the ideas presented at the above link
?
http://neuralnetworksanddeeplearning.com/chap4.html 39/61
39
• We are interested in knowing whether a
network of neurons can be used to rep-
resent an arbitrary function (like the one
shown in the figure)
40/61
40
• We are interested in knowing whether a
network of neurons can be used to rep-
resent an arbitrary function (like the one
shown in the figure)
• We observe that such an arbitrary func-
tion can be approximated by several
“tower” functions
40/61
40
• We are interested in knowing whether a
network of neurons can be used to rep-
resent an arbitrary function (like the one
shown in the figure)
• We observe that such an arbitrary func-
tion can be approximated by several
“tower” functions
• More the number of such “tower” func-
tions, better the approximation
40/61
40
• We are interested in knowing whether a
network of neurons can be used to rep-
resent an arbitrary function (like the one
shown in the figure)
• We observe that such an arbitrary func-
tion can be approximated by several
“tower” functions
• More the number of such “tower” func-
tions, better the approximation
• To be more precise, we can approximate
any arbitrary function by a sum of such
“tower” functions
40/61
40
• We make a few observations
41/61
41
• We make a few observations
• All these “tower” functions are similar
and only differ in their heights and pos-
itions on the x-axis
41/61
41
• We make a few observations
• All these “tower” functions are similar
and only differ in their heights and pos-
itions on the x-axis
• Suppose there is a black box which takes
the original input (x) and constructs
these tower functions
41
• We make a few observations
• All these “tower” functions are similar
and only differ in their heights and pos-
itions on the x-axis
• Suppose there is a black box which takes
the original input (x) and constructs
these tower functions
. . .
Tower Tower . . . Tower Tower
maker maker maker maker
...
x
41/61
41
• We make a few observations
• All these “tower” functions are similar
and only differ in their heights and pos-
itions on the x-axis
• Suppose there is a black box which takes
the original input (x) and constructs
these tower functions
• We can then have a simple network
. . .
which can just add them up to approxim-
Tower Tower . . . Tower Tower ate the function
maker maker maker maker
...
x
41/61
41
• We make a few observations
• All these “tower” functions are similar
and only differ in their heights and pos-
itions on the x-axis
• Suppose there is a black box which takes
+
the original input (x) and constructs
. . . these tower functions
• We can then have a simple network
. . .
which can just add them up to approxim-
Tower Tower . . . Tower Tower ate the function
maker maker maker maker
...
x
41/61
41
• We make a few observations
• All these “tower” functions are similar
and only differ in their heights and pos-
itions on the x-axis
• Suppose there is a black box which takes
+
the original input (x) and constructs
. . . these tower functions
• We can then have a simple network
. . .
which can just add them up to approxim-
Tower Tower . . . Tower Tower ate the function
maker maker maker maker
... • Our job now is to figure out what is inside
this blackbox
x
41/61
41
We will figure this out over the next few slides ...
42/61
42
• If we take the logistic function and set w
to a very high value we will recover the
step function
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 0, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 1, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 2, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 3, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 4, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 5, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 6, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 7, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 8, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 9, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 10, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 11, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 12, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 13, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 14, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 15, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 16, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 17, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 18, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 19, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 20, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 21, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 22, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 23, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 24, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 25, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 26, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 27, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 28, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 29, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 30, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 31, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 32, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 33, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 34, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 35, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 36, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 37, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 38, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 39, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 40, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
1
σ(x) = 1+e−(wx+b)
w = 41, b = 0
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 1
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 2
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 3
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 4
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 5
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 6
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 7
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 8
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 9
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 10
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 11
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 12
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 13
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 14
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 15
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 16
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 17
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 18
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 19
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 20
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 21
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 22
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 23
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 24
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 25
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 26
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 27
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 28
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 29
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 30
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 31
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 32
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 33
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 34
43/61
43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1
1
σ(x) = 1+e−(wx+b)
w = 50, b = 35
43/61
43
• Now let us see what we get by taking
two such sigmoid functions (with differ-
ent b0 s) and subtracting one from the
other
44/61
44
• Now let us see what we get by taking
two such sigmoid functions (with differ-
ent b0 s) and subtracting one from the
other
44/61
44
• Now let us see what we get by taking
two such sigmoid functions (with differ-
ent b0 s) and subtracting one from the
other
44/61
44
• Now let us see what we get by taking
two such sigmoid functions (with differ-
ent b0 s) and subtracting one from the
other
• Voila! We have our tower function !!
44/61
44
• Can we come up with a neural network to represent this operation of subtracting one
sigmoid function from another ?
45/61
45
46/61
46
• What if we have more than one input?
47/61
47
• What if we have more than one input?
• Suppose we are trying to take a decision about
whether we will find oil at a particular location on
the ocean bed(Yes/No)
47/61
47
• What if we have more than one input?
• Suppose we are trying to take a decision about
whether we will find oil at a particular location on
the ocean bed(Yes/No)
• Further, suppose we base our decision on two
factors: Salinity (x1 ) and Pressure (x2 )
47/61
47
• What if we have more than one input?
• Suppose we are trying to take a decision about
whether we will find oil at a particular location on
the ocean bed(Yes/No)
• Further, suppose we base our decision on two
factors: Salinity (x1 ) and Pressure (x2 )
• We are given some data and it seems that
y(oil|no-oil) is a complex function of x1 and x2
47/61
47
• What if we have more than one input?
• Suppose we are trying to take a decision about
whether we will find oil at a particular location on
the ocean bed(Yes/No)
• Further, suppose we base our decision on two
factors: Salinity (x1 ) and Pressure (x2 )
• We are given some data and it seems that
y(oil|no-oil) is a complex function of x1 and x2
• We want a neural network to approximate this
function
47/61
47
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
48/61
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
48/61
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
48/61
w1 = 2, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
48/61
w1 = 3, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
48/61
w1 = 4, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
48/61
w1 = 5, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
48/61
w1 = 6, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
48/61
w1 = 7, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
48/61
w1 = 8, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
48/61
w1 = 9, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
48/61
w1 = 10, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
48/61
w1 = 11, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
48/61
w1 = 12, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
48/61
w1 = 13, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
48/61
w1 = 14, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
48/61
w1 = 15, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
48/61
w1 = 16, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
48/61
w1 = 17, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
48/61
w1 = 18, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
48/61
w1 = 19, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
48/61
w1 = 20, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
48/61
w1 = 21, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
48/61
w1 = 22, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
48/61
w1 = 23, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
48/61
w1 = 24, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
• What would happen if we change b ?
48/61
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
• What would happen if we change b ?
49/61
w1 = 25, w2 = 0, b = 5
49
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
• What would happen if we change b ?
49/61
w1 = 25, w2 = 0, b = 10
49
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
• What would happen if we change b ?
49/61
w1 = 25, w2 = 0, b = 15
49
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
• What would happen if we change b ?
49/61
w1 = 25, w2 = 0, b = 20
49
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
• What would happen if we change b ?
49/61
w1 = 25, w2 = 0, b = 25
49
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
• What would happen if we change b ?
49/61
w1 = 25, w2 = 0, b = 30
49
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
• What would happen if we change b ?
49/61
w1 = 25, w2 = 0, b = 35
49
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
• What would happen if we change b ?
49/61
w1 = 25, w2 = 0, b = 40
49
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
• What would happen if we change b ?
49/61
w1 = 25, w2 = 0, b = 45
49
• What if we take two such step func-
tions (with different b values) and sub-
tract one from the other
50/61
50
• What if we take two such step func-
tions (with different b values) and sub-
tract one from the other
50/61
50
• What if we take two such step func-
tions (with different b values) and sub-
tract one from the other
50/61
50
• What if we take two such step func-
tions (with different b values) and sub-
tract one from the other
• We still don’t get a tower (or we get a
tower which is open from two sides)
50/61
50
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
51/61
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
51/61
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
51/61
w1 = 0, w2 = 2, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
51/61
w1 = 0, w2 = 3, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
51/61
w1 = 0, w2 = 4, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
51/61
w1 = 0, w2 = 5, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
51/61
w1 = 0, w2 = 6, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
51/61
w1 = 0, w2 = 7, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
51/61
w1 = 0, w2 = 8, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
51/61
w1 = 0, w2 = 9, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
51/61
w1 = 0, w2 = 10, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
51/61
w1 = 0, w2 = 11, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
51/61
w1 = 0, w2 = 12, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
51/61
w1 = 0, w2 = 13, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
51/61
w1 = 0, w2 = 14, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
51/61
w1 = 0, w2 = 15, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
51/61
w1 = 0, w2 = 16, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
51/61
w1 = 0, w2 = 17, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
51/61
w1 = 0, w2 = 18, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
51/61
w1 = 0, w2 = 19, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
51/61
w1 = 0, w2 = 20, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
51/61
w1 = 0, w2 = 21, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
51/61
w1 = 0, w2 = 22, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
51/61
w1 = 0, w2 = 23, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
51/61
w1 = 0, w2 = 24, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
• And now we change b
51/61
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
• And now we change b
52/61
w1 = 0, w2 = 25, b = 5
52
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
• And now we change b
52/61
w1 = 0, w2 = 25, b = 10
52
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
• And now we change b
52/61
w1 = 0, w2 = 25, b = 15
52
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
• And now we change b
52/61
w1 = 0, w2 = 25, b = 20
52
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
• And now we change b
52/61
w1 = 0, w2 = 25, b = 25
52
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
• And now we change b
52/61
w1 = 0, w2 = 25, b = 30
52
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
• And now we change b
52/61
w1 = 0, w2 = 25, b = 35
52
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
• And now we change b
52/61
w1 = 0, w2 = 25, b = 40
52
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
• And now we change b
52/61
w1 = 0, w2 = 25, b = 45
52
• Again, what if we take two such step
functions (with different b values) and
subtract one from the other
53/61
53
• Again, what if we take two such step
functions (with different b values) and
subtract one from the other
53/61
53
• Again, what if we take two such step
functions (with different b values) and
subtract one from the other
53/61
53
• Again, what if we take two such step
functions (with different b values) and
subtract one from the other
• We still don’t get a tower (or we get a
tower which is open from two sides)
53/61
53
• Again, what if we take two such step
functions (with different b values) and
subtract one from the other
• We still don’t get a tower (or we get a
tower which is open from two sides)
• Notice that this open tower has a differ-
ent orientation from the previous one
53/61
53
• Now what will we get by adding two such
open towers ?
54/61
54
• Now what will we get by adding two such
open towers ?
54/61
54
• Now what will we get by adding two such
open towers ?
54/61
54
• Now what will we get by adding two such
open towers ?
• We get a tower standing on an elevated
base
54/61
54
• Now what will we get by adding two such
open towers ?
• We get a tower standing on an elevated
base
• We can now pass this output through an-
other sigmoid neuron to get the desired
tower !
54/61
54
• Now what will we get by adding two such
open towers ?
• We get a tower standing on an elevated
base
• We can now pass this output through an-
other sigmoid neuron to get the desired
tower !
54/61
54
• Now what will we get by adding two such
open towers ?
• We get a tower standing on an elevated
base
• We can now pass this output through an-
other sigmoid neuron to get the desired
tower !
• We can now approximate any function
by summing up many such towers
54/61
54
• For example, we could approximate the
following function using a sum of several
towers
55/61
55
• Can we come up with a neural network to represent this entire procedure of construct-
ing a 3 dimensional tower ?
56/61
56
57/61
57
58/61
58
Think
• For 1 dimensional input we needed 2 neurons to construct a tower
• For 2 dimensional input we needed 4 neurons to construct a tower
• How many neurons will you need to construct a tower in n dimensions ?
59/61
59
Time to retrospect
• Why do we care about approximating any arbitrary function ?
• Can we tie all this back to the classification problem that we have been dealing with ?
60/61
60
• We are interested in separating the blue points
from the red points
61/61
61
• We are interested in separating the blue points
from the red points
• Suppose we use a single sigmoidal neuron to ap-
proximate the relation between x = [x1 , x2 ] and
y
61/61
61
• We are interested in separating the blue points
from the red points
• Suppose we use a single sigmoidal neuron to ap-
proximate the relation between x = [x1 , x2 ] and
y
61/61
61
• We are interested in separating the blue points
from the red points
• Suppose we use a single sigmoidal neuron to ap-
proximate the relation between x = [x1 , x2 ] and
y
• Obviously, there will be errors (some blue points
get classified as 1 and some red points get classi- 61/61
fied as 0)
61
• We are interested in separating the blue points • This is what we actually want
from the red points
• Suppose we use a single sigmoidal neuron to ap-
proximate the relation between x = [x1 , x2 ] and
y
• Obviously, there will be errors (some blue points
get classified as 1 and some red points get classi- 61/61
fied as 0)
61
• We are interested in separating the blue points • This is what we actually want
from the red points • The illustrative proof that we just saw tells us that
• Suppose we use a single sigmoidal neuron to ap- we can have a neural network with two hidden lay-
proximate the relation between x = [x1 , x2 ] and ers which can approximate the above function by
y a sum of towers
• Obviously, there will be errors (some blue points
get classified as 1 and some red points get classi- 61/61
fied as 0)
61
• We are interested in separating the blue points • This is what we actually want
from the red points • The illustrative proof that we just saw tells us that
• Suppose we use a single sigmoidal neuron to ap- we can have a neural network with two hidden lay-
proximate the relation between x = [x1 , x2 ] and ers which can approximate the above function by
y a sum of towers
• Obviously, there will be errors (some blue points • Which means we can have a neural network which
get classified as 1 and some red points get classi- can exactly separate the blue points from the red 61/61
fied as 0) points !!
61