You are on page 1of 473

CS7015 (Deep Learning) : Lecture 3

Sigmoid Neurons, Gradient Descent, Feedforward Neural Networks,


Representation Power of Feedforward Neural Networks

Mitesh M. Khapra

Department of Computer Science and Engineering


Indian Institute of Technology Madras

0/61
Acknowledgements
• For Module 3.4, I have borrowed ideas from the videos by Ryan Harris on “visualize
backpropagation” (available on youtube)
• For Module 3.5, I have borrowed ideas from this excellent book ? which is available
online
• I am sure I would have been influenced and borrowed ideas from other sources and I
apologize if I have failed to acknowledge them

?
http://neuralnetworksanddeeplearning.com/chap4.html 1/61

1
Module 3.1: Sigmoid Neuron

2/61

2
The story ahead ...
• Enough about boolean functions!

3/61

3
The story ahead ...
• Enough about boolean functions!
• What about arbitrary functions of the form y = f(x) where x ∈ Rn (instead of {0, 1}n )
and y ∈ R (instead of {0, 1}) ?

3/61

3
The story ahead ...
• Enough about boolean functions!
• What about arbitrary functions of the form y = f(x) where x ∈ Rn (instead of {0, 1}n )
and y ∈ R (instead of {0, 1}) ?
• Can we have a network which can (approximately) represent such functions ?

3/61

3
The story ahead ...
• Enough about boolean functions!
• What about arbitrary functions of the form y = f(x) where x ∈ Rn (instead of {0, 1}n )
and y ∈ R (instead of {0, 1}) ?
• Can we have a network which can (approximately) represent such functions ?
• Before answering the above question we will have to first graduate from perceptrons
to sigmoidal neurons ...

3/61

3
Recall
• A perceptron will fire if the weighted sum of its inputs is greater than the threshold
(-w0 )

4/61

4
y
• The thresholding logic used by a perceptron is very
harsh !
bias = w0 = −0.5

w1 = 1
x1

5/61

5
y
• The thresholding logic used by a perceptron is very
harsh !
bias = w0 = −0.5 • For example, let us return to our problem of decid-
ing whether we will like or dislike a movie

w1 = 1
x1

5/61

5
y
• The thresholding logic used by a perceptron is very
harsh !
bias = w0 = −0.5 • For example, let us return to our problem of decid-
ing whether we will like or dislike a movie

w1 = 1 • Consider that we base our decision only on one


input (x1 = criticsRating which lies between 0 and
x1
1)
criticsRating

5/61

5
y
• The thresholding logic used by a perceptron is very
harsh !
bias = w0 = −0.5 • For example, let us return to our problem of decid-
ing whether we will like or dislike a movie

w1 = 1 • Consider that we base our decision only on one


input (x1 = criticsRating which lies between 0 and
x1
1)
criticsRating
• If the threshold is 0.5 (w0 = −0.5) and w1 = 1
then what would be the decision for a movie with
criticsRating = 0.51 ?

5/61

5
y
• The thresholding logic used by a perceptron is very
harsh !
bias = w0 = −0.5 • For example, let us return to our problem of decid-
ing whether we will like or dislike a movie

w1 = 1 • Consider that we base our decision only on one


input (x1 = criticsRating which lies between 0 and
x1
1)
criticsRating
• If the threshold is 0.5 (w0 = −0.5) and w1 = 1
then what would be the decision for a movie with
criticsRating = 0.51 ? (like)

5/61

5
y
• The thresholding logic used by a perceptron is very
harsh !
bias = w0 = −0.5 • For example, let us return to our problem of decid-
ing whether we will like or dislike a movie

w1 = 1 • Consider that we base our decision only on one


input (x1 = criticsRating which lies between 0 and
x1
1)
criticsRating
• If the threshold is 0.5 (w0 = −0.5) and w1 = 1
then what would be the decision for a movie with
criticsRating = 0.51 ? (like)
• What about a movie with criticsRating = 0.49 ?

5/61

5
y
• The thresholding logic used by a perceptron is very
harsh !
bias = w0 = −0.5 • For example, let us return to our problem of decid-
ing whether we will like or dislike a movie

w1 = 1 • Consider that we base our decision only on one


input (x1 = criticsRating which lies between 0 and
x1
1)
criticsRating
• If the threshold is 0.5 (w0 = −0.5) and w1 = 1
then what would be the decision for a movie with
criticsRating = 0.51 ? (like)
• What about a movie with criticsRating = 0.49 ?
(dislike)

5/61

5
y
• The thresholding logic used by a perceptron is very
harsh !
bias = w0 = −0.5 • For example, let us return to our problem of decid-
ing whether we will like or dislike a movie

w1 = 1 • Consider that we base our decision only on one


input (x1 = criticsRating which lies between 0 and
x1
1)
criticsRating
• If the threshold is 0.5 (w0 = −0.5) and w1 = 1
then what would be the decision for a movie with
criticsRating = 0.51 ? (like)
• What about a movie with criticsRating = 0.49 ?
(dislike)
• It seems harsh that we would like a movie with rat-
5/61
ing 0.51 but not one with a rating of 0.49
5
• This behavior is not a characteristic of the spe-
cific problem we chose or the specific weight and
threshold that we chose

6/61

6
• This behavior is not a characteristic of the spe-
1
cific problem we chose or the specific weight and
threshold that we chose
• It is a characteristic of the perceptron function it-
y

self which behaves like a step function

P-wn 0
z= i=1 wi xi

6/61

6
• This behavior is not a characteristic of the spe-
1
cific problem we chose or the specific weight and
threshold that we chose
• It is a characteristic of the perceptron function it-
y

self which behaves like a step function


• There will always be this sudden change in the de-
cision (from 0 to 1) when ni=1 wi xi crosses the
P

threshold (-w0 )
P-wn 0
z= i=1 wi xi

6/61

6
• This behavior is not a characteristic of the spe-
1
cific problem we chose or the specific weight and
threshold that we chose
• It is a characteristic of the perceptron function it-
y

self which behaves like a step function


• There will always be this sudden change in the de-
cision (from 0 to 1) when ni=1 wi xi crosses the
P

threshold (-w0 )
P-wn 0
z= • For most real world applications we would ex-
i=1 wi xi
pect a smoother decision function which gradually
changes from 0 to 1

6/61

6
• Introducing sigmoid neurons where the output
1
y function is much smoother than the step function

P-wn 0
z= i=1 wi xi

6/61

6
• Introducing sigmoid neurons where the output
1
function is much smoother than the step function
• Here is one form of the sigmoid function called the
logistic function
y

1
y= Pn
1+ e−(w0 + i=1 wi xi )

P-wn 0
z= i=1 wi xi

6/61

6
• Introducing sigmoid neurons where the output
1
function is much smoother than the step function
• Here is one form of the sigmoid function called the
logistic function
y

1
y= Pn
1+ e−(w0 + i=1 wi xi )
• We no longer see a sharp transition around the
P-w0
n threshold -w0
z= i=1 wi xi

6/61

6
• Introducing sigmoid neurons where the output
1
function is much smoother than the step function
• Here is one form of the sigmoid function called the
logistic function
y

1
y= Pn
1+ e−(w0 + i=1 wi xi )
• We no longer see a sharp transition around the
P-w0
n threshold -w0
z= i=1 wi xi • Also the output y is no longer binary but a real
value between 0 and 1 which can be interpreted
as a probability

6/61

6
• Introducing sigmoid neurons where the output
1
function is much smoother than the step function
• Here is one form of the sigmoid function called the
logistic function
y

1
y= Pn
1+ e−(w0 + i=1 wi xi )
• We no longer see a sharp transition around the
P-w0
n threshold -w0
z= i=1 wi xi • Also the output y is no longer binary but a real
value between 0 and 1 which can be interpreted
as a probability
• Instead of a like/dislike decision we get the prob-
ability of liking the movie 6/61

6
Perceptron Sigmoid (logistic) Neuron
y y

w0 = −θ w1 w2 .. .. wn w0 = −θ w1 w2 .. .. wn

x0 = 1 x1 x2 .. .. xn x0 = 1 x1 x2 .. .. xn
n 1
y=
X
y=1 if wi ∗ xi ≥ 0 Pn
1+ e−( i=0 wi xi )
i=0
Xn
= 0 if wi ∗ xi < 0
i=0 7/61

7
Perceptron Sigmoid Neuron
y 1 1

y
P-wn 0 P-wn 0
z= i=1 wi xi z= i=1 wi xi
Not smooth, not continuous (at w0), not

differentiable

8/61

8
Perceptron Sigmoid Neuron
y 1 1

y
P-wn 0 P-wn 0
z= i=1 wi xi z= i=1 wi xi
Not smooth, not continuous (at w0), not
Smooth, continuous, differentiable
differentiable

8/61

8
Module 3.2: A typical Supervised Machine Learning Setup

9/61

9
• What next ?
Sigmoid (logistic) Neuron
y

w0 = −θ w1 w2 .. .. wn

x0 = 1 x1 x2 .. .. xn

10/61

10
• What next ?
Sigmoid (logistic) Neuron • Well, just as we had an algorithm for learning the
y
weights of a perceptron, we also need a way of
learning the weights of a sigmoid neuron

w0 = −θ w1 w2 .. .. wn

x0 = 1 x1 x2 .. .. xn

10/61

10
• What next ?
Sigmoid (logistic) Neuron • Well, just as we had an algorithm for learning the
y
weights of a perceptron, we also need a way of
learning the weights of a sigmoid neuron
• Before we see such an algorithm we will revisit the
concept of error

w0 = −θ w1 w2 .. .. wn

x0 = 1 x1 x2 .. .. xn

10/61

10
• Earlier we mentioned that a single perceptron cannot deal
with this data because it is not linearly separable

11/61

11
• Earlier we mentioned that a single perceptron cannot deal
with this data because it is not linearly separable
• What does “cannot deal with” mean?

11/61

11
• Earlier we mentioned that a single perceptron cannot deal
with this data because it is not linearly separable
• What does “cannot deal with” mean?
• What would happen if we use a perceptron model to clas-
sify this data ?

11/61

11
• Earlier we mentioned that a single perceptron cannot deal
with this data because it is not linearly separable
• What does “cannot deal with” mean?
• What would happen if we use a perceptron model to clas-
sify this data ?
• We would probably end up with a line like this ...

11/61

11
• Earlier we mentioned that a single perceptron cannot deal
with this data because it is not linearly separable
• What does “cannot deal with” mean?
• What would happen if we use a perceptron model to clas-
sify this data ?
• We would probably end up with a line like this ...
• This line doesn’t seem to be too bad

11/61

11
• Earlier we mentioned that a single perceptron cannot deal
with this data because it is not linearly separable
• What does “cannot deal with” mean?
• What would happen if we use a perceptron model to clas-
sify this data ?
• We would probably end up with a line like this ...
• This line doesn’t seem to be too bad
• Sure, it misclassifies 3 blue points and 3 red points but we
could live with this error in most real world applications

11/61

11
• Earlier we mentioned that a single perceptron cannot deal
with this data because it is not linearly separable
• What does “cannot deal with” mean?
• What would happen if we use a perceptron model to clas-
sify this data ?
• We would probably end up with a line like this ...
• This line doesn’t seem to be too bad
• Sure, it misclassifies 3 blue points and 3 red points but we
could live with this error in most real world applications
• From now on, we will accept that it is hard to drive the error
to 0 in most cases and will instead aim to reach the min-
imum possible error
11/61

11
This brings us to a typical machine learning setup which has the following components...

12/61

12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1

12/61

12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
• Model: Our approximation of the relation between x and y. For example,

12/61

12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
• Model: Our approximation of the relation between x and y. For example,
1
ŷ =
1 + e−(wT x)

12/61

12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
• Model: Our approximation of the relation between x and y. For example,
1
ŷ =
1 + e−(wT x)
or ŷ = wT x

12/61

12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
• Model: Our approximation of the relation between x and y. For example,
1
ŷ =
1 + e−(wT x)
or ŷ = wT x
or ŷ = xT Wx

12/61

12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
• Model: Our approximation of the relation between x and y. For example,
1
ŷ =
1 + e−(wT x)
or ŷ = wT x
or ŷ = xT Wx
or just about any function

12/61

12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
• Model: Our approximation of the relation between x and y. For example,
1
ŷ =
1 + e−(wT x)
or ŷ = wT x
or ŷ = xT Wx
or just about any function
• Parameters: In all the above cases, w is a parameter which needs to be learned from
the data

12/61

12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
• Model: Our approximation of the relation between x and y. For example,
1
ŷ =
1 + e−(wT x)
or ŷ = wT x
or ŷ = xT Wx
or just about any function
• Parameters: In all the above cases, w is a parameter which needs to be learned from
the data
• Learning algorithm: An algorithm for learning the parameters (w) of the model (for
example, perceptron learning algorithm, gradient descent, etc.)

12/61

12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
• Model: Our approximation of the relation between x and y. For example,
1
ŷ =
1 + e−(wT x)
or ŷ = wT x
or ŷ = xT Wx
or just about any function
• Parameters: In all the above cases, w is a parameter which needs to be learned from
the data
• Learning algorithm: An algorithm for learning the parameters (w) of the model (for
example, perceptron learning algorithm, gradient descent, etc.)
• Objective/Loss/Error function: To guide the learning algorithm
12/61

12
This brings us to a typical machine learning setup which has the following components...
• Data: {xi , yi }ni=1
• Model: Our approximation of the relation between x and y. For example,
1
ŷ =
1 + e−(wT x)
or ŷ = wT x
or ŷ = xT Wx
or just about any function
• Parameters: In all the above cases, w is a parameter which needs to be learned from
the data
• Learning algorithm: An algorithm for learning the parameters (w) of the model (for
example, perceptron learning algorithm, gradient descent, etc.)
• Objective/Loss/Error function: To guide the learning algorithm - the learning algorithm
12/61
should aim to minimize the loss function
12
As an illustration, consider our movie example

13/61

13
As an illustration, consider our movie example
• Data: {xi = movie, yi = like/dislike}ni=1

13/61

13
As an illustration, consider our movie example
• Data: {xi = movie, yi = like/dislike}ni=1
• Model: Our approximation of the relation between x and y (the probability of liking a
movie).

13/61

13
As an illustration, consider our movie example
• Data: {xi = movie, yi = like/dislike}ni=1
• Model: Our approximation of the relation between x and y (the probability of liking a
movie).
1
ŷ =
1 + e−(wT x)

13/61

13
As an illustration, consider our movie example
• Data: {xi = movie, yi = like/dislike}ni=1
• Model: Our approximation of the relation between x and y (the probability of liking a
movie).
1
ŷ =
1 + e−(wT x)

• Parameter: w

13/61

13
As an illustration, consider our movie example
• Data: {xi = movie, yi = like/dislike}ni=1
• Model: Our approximation of the relation between x and y (the probability of liking a
movie).
1
ŷ =
1 + e−(wT x)

• Parameter: w
• Learning algorithm: Gradient Descent [we will see soon]

13/61

13
As an illustration, consider our movie example
• Data: {xi = movie, yi = like/dislike}ni=1
• Model: Our approximation of the relation between x and y (the probability of liking a
movie).
1
ŷ =
1 + e−(wT x)

• Parameter: w
• Learning algorithm: Gradient Descent [we will see soon]
• Objective/Loss/Error function:

13/61

13
As an illustration, consider our movie example
• Data: {xi = movie, yi = like/dislike}ni=1
• Model: Our approximation of the relation between x and y (the probability of liking a
movie).
1
ŷ =
1 + e−(wT x)

• Parameter: w
• Learning algorithm: Gradient Descent [we will see soon]
• Objective/Loss/Error function: One possibility is
n
(ŷi − yi )2
X
L (w) =
i=1

13/61

13
As an illustration, consider our movie example
• Data: {xi = movie, yi = like/dislike}ni=1
• Model: Our approximation of the relation between x and y (the probability of liking a
movie).
1
ŷ =
1 + e−(wT x)

• Parameter: w
• Learning algorithm: Gradient Descent [we will see soon]
• Objective/Loss/Error function: One possibility is
n
(ŷi − yi )2
X
L (w) =
i=1

13/61
The learning algorithm should aim to find a w which minimizes the above function
13
(squared error between y and ŷ)
Module 3.3: Learning Parameters: (Infeasible) guess work

14/61

14
y • Keeping this supervised ML setup in mind, we will
now focus on this model and discuss an algorithm
for learning the parameters of this model from
σ some given data using an appropriate objective
function

w0 = −θ w1 w2 .. .. wn

x0 = 1 x1 x2 .. .. xn
1
f (x) = 1+e−(w·x+b)

15/61

15
y • Keeping this supervised ML setup in mind, we will
now focus on this model and discuss an algorithm
for learning the parameters of this model from
σ some given data using an appropriate objective
function

w0 = −θ w1 w2 .. .. wn • σ stands for the sigmoid function (logistic func-


tion in this case)
x0 = 1 x1 x2 .. .. xn
1
f (x) = 1+e−(w·x+b)

15/61

15
x σ ŷ = f (x) • Keeping this supervised ML setup in mind, we will
now focus on this model and discuss an algorithm
1 for learning the parameters of this model from
1 some given data using an appropriate objective
f(x) = 1+e−(w·x+b) function
• σ stands for the sigmoid function (logistic func-
tion in this case)
• For ease of explanation, we will consider a very
simplified version of the model having just 1 input

15/61

15
x
w
σ ŷ = f (x) • Keeping this supervised ML setup in mind, we will
now focus on this model and discuss an algorithm
1 b
for learning the parameters of this model from
1 some given data using an appropriate objective
f(x) = 1+e−(w·x+b) function
• σ stands for the sigmoid function (logistic func-
tion in this case)
• For ease of explanation, we will consider a very
simplified version of the model having just 1 input
• Further to be consistent with the literature, from
now on, we will refer to w0 as b (bias)

15/61

15
x
w
σ ŷ = f (x) • Keeping this supervised ML setup in mind, we will
now focus on this model and discuss an algorithm
1 b
for learning the parameters of this model from
1 some given data using an appropriate objective
f(x) = 1+e−(w·x+b) function
• σ stands for the sigmoid function (logistic func-
tion in this case)
• For ease of explanation, we will consider a very
simplified version of the model having just 1 input
• Further to be consistent with the literature, from
now on, we will refer to w0 as b (bias)
• Lastly, instead of considering the problem of pre-
dicting like/dislike, we will assume that we want to
predict criticsRating(y) given imdbRating(x) (for no 15/61
particular reason) 15
x σ ŷ = f(x)
w

1 b

1
f (x) = 1+e−(w·x+b)

16/61

16
x σ ŷ = f(x)
w Input for training
1 b {xi , yi }Ni=1 → N pairs of (x, y)

1
f (x) = 1+e−(w·x+b)

16/61

16
x σ ŷ = f(x)
w Input for training
1 b {xi , yi }Ni=1 → N pairs of (x, y)
Training objective
1
f (x) = 1+e−(w·x+b) Find w and b such that:
N
(yi − f(xi ))2
X
minimize L (w, b) =
w,b
i=1

16/61

16
x σ ŷ = f(x)
w Input for training
1 b {xi , yi }Ni=1 → N pairs of (x, y)
Training objective
1
f (x) = 1+e−(w·x+b) Find w and b such that:
N
(yi − f(xi ))2
X
minimize L (w, b) =
w,b
i=1

16/61

16
x σ ŷ = f(x)
w What does it mean to train the network?
b • Suppose we train the network with
1
(x, y) = (0.5, 0.2) and (2.5, 0.9)
1
f (x) = 1+e−(w·x+b)

16/61

16
x σ ŷ = f(x)
w What does it mean to train the network?
b • Suppose we train the network with
1
(x, y) = (0.5, 0.2) and (2.5, 0.9)
1
f (x) = 1+e−(w·x+b) • At the end of training we expect to find
w*, b* such that:

16/61

16
x σ ŷ = f(x)
w What does it mean to train the network?
b • Suppose we train the network with
1
(x, y) = (0.5, 0.2) and (2.5, 0.9)
1
f (x) = 1+e−(w·x+b) • At the end of training we expect to find
w*, b* such that:
• f(0.5) → 0.2 and f(2.5) → 0.9

16/61

16
x σ ŷ = f(x)
w What does it mean to train the network?
b • Suppose we train the network with
1
(x, y) = (0.5, 0.2) and (2.5, 0.9)
1
f (x) = 1+e−(w·x+b) • At the end of training we expect to find
w*, b* such that:
• f(0.5) → 0.2 and f(2.5) → 0.9

In other words...
• We hope to find a sigmoid function such
that (0.5, 0.2) and (2.5, 0.9) lie on this
sigmoid

16/61

16
x σ ŷ = f(x)
w What does it mean to train the network?
b • Suppose we train the network with
1
(x, y) = (0.5, 0.2) and (2.5, 0.9)
1
f (x) = 1+e−(w·x+b) • At the end of training we expect to find
w*, b* such that:
• f(0.5) → 0.2 and f(2.5) → 0.9

In other words...
• We hope to find a sigmoid function such
that (0.5, 0.2) and (2.5, 0.9) lie on this
sigmoid

16/61

16
Let us see this in more detail....

17/61

17
1
σ(x) =
1+ e−(wx+b)

18/61

18
• Can we try to find such a w∗ , b∗ manually

1
σ(x) =
1+ e−(wx+b)

18/61

18
• Can we try to find such a w∗ , b∗ manually
• Let us try a random guess.. (say, w = 0.5, b = 0)

1
σ(x) =
1+ e−(wx+b)

18/61

18
• Can we try to find such a w∗ , b∗ manually
• Let us try a random guess.. (say, w = 0.5, b = 0)
• Clearly not good, but how bad is it ?

1
σ(x) =
1+ e−(wx+b)

18/61

18
• Can we try to find such a w∗ , b∗ manually
• Let us try a random guess.. (say, w = 0.5, b = 0)
• Clearly not good, but how bad is it ?
• Let us revisit L (w, b) to see how bad it is ...

1
σ(x) =
1+ e−(wx+b)

18/61

18
N
1 X
L (w, b) = ∗ (yi − f(xi ))2
2
i=1

1
σ(x) =
1+ e−(wx+b)

18/61

18
N
1 X
L (w, b) = ∗ (yi − f(xi ))2
2
i=1
1
= ∗ (y1 − f (x1 ))2 + (y2 − f(x2 ))2
2

1
σ(x) =
1+ e−(wx+b)

18/61

18
N
1 X
L (w, b) = ∗ (yi − f(xi ))2
2
i=1
1
= ∗ (y1 − f (x1 ))2 + (y2 − f(x2 ))2
2
1
= ∗ (0.9 − f(2.5))2 + (0.2 − f(0.5))2
2

1
σ(x) =
1+ e−(wx+b)

18/61

18
N
1 X
L (w, b) = ∗ (yi − f(xi ))2
2
i=1
1
= ∗ (y1 − f (x1 ))2 + (y2 − f(x2 ))2
2
1
= ∗ (0.9 − f(2.5))2 + (0.2 − f(0.5))2
2
= 0.073

1
σ(x) =
1+ e−(wx+b)

18/61

18
N
1 X
L (w, b) = ∗ (yi − f(xi ))2
2
i=1
1
= ∗ (y1 − f (x1 ))2 + (y2 − f(x2 ))2
2
1
= ∗ (0.9 − f(2.5))2 + (0.2 − f(0.5))2
2
= 0.073

We want L (w, b) to be as close to 0 as possible


1
σ(x) =
1 + e−(wx+b)

18/61

18
Let us try some other values of w, b

w b L (w, b)
0.50 0.00 0.0730

1
σ(x) =
1+ e−(wx+b)

18/61

18
Let us try some other values of w, b

w b L (w, b)
0.50 0.00 0.0730
-0.10 0.00 0.1481

1 Oops!! this made things even worse...


σ(x) =
1+ e−(wx+b)

18/61

18
Let us try some other values of w, b

w b L (w, b)
0.50 0.00 0.0730
-0.10 0.00 0.1481
0.94 -0.94 0.0214

1 Perhaps it would help to push w and b in the other


σ(x) = direction...
1 + e−(wx+b)

18/61

18
Let us try some other values of w, b

w b L (w, b)
0.50 0.00 0.0730
-0.10 0.00 0.1481
0.94 -0.94 0.0214
1.42 -1.73 0.0028

1 Let us keep going in this direction, i.e., increase w and


σ(x) = decrease b
1 + e−(wx+b)

18/61

18
Let us try some other values of w, b

w b L (w, b)
0.50 0.00 0.0730
-0.10 0.00 0.1481
0.94 -0.94 0.0214
1.42 -1.73 0.0028
1.65 -2.08 0.0003

1
σ(x) = Let us keep going in this direction, i.e., increase w and
1+ e−(wx+b)
decrease b

18/61

18
Let us try some other values of w, b

w b L (w, b)
0.50 0.00 0.0730
-0.10 0.00 0.1481
0.94 -0.94 0.0214
1.42 -1.73 0.0028
1.65 -2.08 0.0003
1.78 -2.27 0.0000

1
σ(x) = With some guess work and intuition we were able to
1+ e−(wx+b)
find the right values for w and b

18/61

18
Let us look at something better than our “guess work” algorithm....

19/61

19
• Since we have only 2 points and 2 para-
meters (w, b) we can easily plot L (w, b)
for different values of (w, b) and pick the
one where L (w, b) is minimum

20/61

20
• Since we have only 2 points and 2 para-
meters (w, b) we can easily plot L (w, b)
for different values of (w, b) and pick the
one where L (w, b) is minimum

20/61

20
• Since we have only 2 points and 2 para-
meters (w, b) we can easily plot L (w, b)
for different values of (w, b) and pick the
one where L (w, b) is minimum
• But of course this becomes intractable
once you have many more data points
and many more parameters !!

20/61

20
• Since we have only 2 points and 2 para-
meters (w, b) we can easily plot L (w, b)
for different values of (w, b) and pick the
one where L (w, b) is minimum
• But of course this becomes intractable
once you have many more data points
and many more parameters !!
• Further, even here we have plotted the er-
ror surface only for a small range of (w, b)
[from (−6, 6) and not from (− inf, inf)]

20/61

20
Let us look at the geometric interpretation of our “guess work”
algorithm in terms of this error surface

21/61

21
22/61

22
22/61

22
22/61

22
22/61

22
22/61

22
22/61

22
22/61

22
Module 3.4: Learning Parameters : Gradient Descent

23/61

23
Now let us see if there is a more efficient and principled way of
doing this

24/61

24
Goal
Find a better way of traversing the error surface so that we can reach the minimum value
quickly without resorting to brute force search!

25/61

25
vector of parameters,
say, randomly initialized
θ = [w, b]

26/61

26
vector of parameters,
say, randomly initialized
θ = [w, b]

∆θ = [∆w, ∆b]
change in the
values of w, b

26/61

26
vector of parameters,
say, randomly initialized
θ = [w, b] θ

∆θ = [∆w, ∆b]
∆θ
change in the
values of w, b

26/61

26
vector of parameters,
say, randomly initialized
θ = [w, b] θ θnew

∆θ = [∆w, ∆b]
∆θ
change in the
values of w, b

26/61

26
vector of parameters,
say, randomly initialized
We moved in the direction
θ = [w, b] θ θnew
of ∆θ
∆θ = [∆w, ∆b]
∆θ
change in the
values of w, b

26/61

26
vector of parameters,
say, randomly initialized
We moved in the direction
θ = [w, b] θ θnew
of ∆θ
∆θ = [∆w, ∆b]
∆θ Let us be a bit conservat-
change in the
ive: move only by a small
values of w, b
amount η

26/61

26
vector of parameters,
say, randomly initialized
We moved in the direction
θ = [w, b] θ θnew
of ∆θ
∆θ = [∆w, ∆b]
η · ∆θ ∆θ Let us be a bit conservat-
change in the
ive: move only by a small
values of w, b
amount η

26/61

26
vector of parameters,
say, randomly initialized
We moved in the direction
θ = [w, b] θ θnew
of ∆θ
∆θ = [∆w, ∆b]
η · ∆θ ∆θ Let us be a bit conservat-
change in the
ive: move only by a small
values of w, b
amount η

26/61

26
vector of parameters,
say, randomly initialized
We moved in the direction
θ = [w, b] θ θnew
of ∆θ
∆θ = [∆w, ∆b]
η · ∆θ ∆θ Let us be a bit conservat-
change in the
ive: move only by a small
values of w, b
amount η
θnew = θ + η · ∆θ

26/61

26
vector of parameters,
say, randomly initialized
We moved in the direction
θ = [w, b] θ θnew
of ∆θ
∆θ = [∆w, ∆b]
η · ∆θ ∆θ Let us be a bit conservat-
change in the
ive: move only by a small
values of w, b
amount η
θnew = θ + η · ∆θ

Question: What is the right ∆θ to use ?

26/61

26
vector of parameters,
say, randomly initialized
We moved in the direction
θ = [w, b] θ θnew
of ∆θ
∆θ = [∆w, ∆b]
η · ∆θ ∆θ Let us be a bit conservat-
change in the
ive: move only by a small
values of w, b
amount η
θnew = θ + η · ∆θ

Question: What is the right ∆θ to use ?

The answer comes from Taylor series

26/61

26
For ease of notation, let ∆θ = u, then from Taylor series, we have,

27/61

27
For ease of notation, let ∆θ = u, then from Taylor series, we have,

η2 η3 η4
L (θ + ηu) = L (θ) + η ∗ uT ∇θ L (θ) + ∗ uT ∇2 L (θ)u + ∗ ... + ∗ ...
2! 3! 4!

27/61

27
For ease of notation, let ∆θ = u, then from Taylor series, we have,

η2 η3 η4
L (θ + ηu) = L (θ) + η ∗ uT ∇θ L (θ) + ∗ uT ∇2 L (θ)u + ∗ ... + ∗ ...
2! 3! 4!
= L (θ) + η ∗ uT ∇θ L (θ) [η is typically small, so η 2 , η 3 , .. → 0]

27/61

27
For ease of notation, let ∆θ = u, then from Taylor series, we have,

η2 η3 η4
L (θ + ηu) = L (θ) + η ∗ uT ∇θ L (θ) + ∗ uT ∇2 L (θ)u + ∗ ... + ∗ ...
2! 3! 4!
= L (θ) + η ∗ uT ∇θ L (θ) [η is typically small, so η 2 , η 3 , .. → 0]

Note that the move (ηu) would be favorable only if,

L (θ + ηu) − L (θ) < 0 [i.e., if the new loss is less than the previous loss]

27/61

27
For ease of notation, let ∆θ = u, then from Taylor series, we have,

η2 η3 η4
L (θ + ηu) = L (θ) + η ∗ uT ∇θ L (θ) + ∗ uT ∇2 L (θ)u + ∗ ... + ∗ ...
2! 3! 4!
= L (θ) + η ∗ uT ∇θ L (θ) [η is typically small, so η 2 , η 3 , .. → 0]

Note that the move (ηu) would be favorable only if,

L (θ + ηu) − L (θ) < 0 [i.e., if the new loss is less than the previous loss]

This implies,

uT ∇θ L (θ) < 0

27/61

27
Okay, so we have,

uT ∇θ L (θ) < 0

But, what is the range of uT ∇θ L (θ) ?

28/61

28
Okay, so we have,

uT ∇θ L (θ) < 0

But, what is the range of uT ∇θ L (θ) ? Let us see....

28/61

28
Okay, so we have,

uT ∇θ L (θ) < 0

But, what is the range of uT ∇θ L (θ) ? Let us see....


Let β be the angle between u and ∇θ L (θ), then we know that,

28/61

28
Okay, so we have,

uT ∇θ L (θ) < 0

But, what is the range of uT ∇θ L (θ) ? Let us see....


Let β be the angle between u and ∇θ L (θ), then we know that,

uT ∇θ L (θ)
−1 ≤ cos(β) = ≤1
||u|| ∗ ||∇θ L (θ)||

28/61

28
Okay, so we have,

uT ∇θ L (θ) < 0

But, what is the range of uT ∇θ L (θ) ? Let us see....


Let β be the angle between u and ∇θ L (θ), then we know that,

uT ∇θ L (θ)
−1 ≤ cos(β) = ≤1
||u|| ∗ ||∇θ L (θ)||

28/61

28
Okay, so we have,

uT ∇θ L (θ) < 0

But, what is the range of uT ∇θ L (θ) ? Let us see....


Let β be the angle between u and ∇θ L (θ), then we know that,

uT ∇θ L (θ)
−1 ≤ cos(β) = ≤1
||u|| ∗ ||∇θ L (θ)||

multiply throughout by k = ||u|| ∗ ||∇θ L (θ)||

−k ≤ k ∗ cos(β) = uT ∇θ L (θ) ≤ k

28/61

28
Okay, so we have,

uT ∇θ L (θ) < 0

But, what is the range of uT ∇θ L (θ) ? Let us see....


Let β be the angle between u and ∇θ L (θ), then we know that,

uT ∇θ L (θ)
−1 ≤ cos(β) = ≤1
||u|| ∗ ||∇θ L (θ)||

multiply throughout by k = ||u|| ∗ ||∇θ L (θ)||

−k ≤ k ∗ cos(β) = uT ∇θ L (θ) ≤ k

Thus, L (θ + ηu) − L (θ) = uT ∇θ L (θ) = k ∗ cos(β) will be most negative when


cos(β) = −1 i.e., when β is 180°
28/61

28
Gradient Descent Rule
• The direction u that we intend to move in should be at 180° w.r.t. the gradient

29/61

29
Gradient Descent Rule
• The direction u that we intend to move in should be at 180° w.r.t. the gradient
• In other words, move in a direction opposite to the gradient

29/61

29
Gradient Descent Rule
• The direction u that we intend to move in should be at 180° w.r.t. the gradient
• In other words, move in a direction opposite to the gradient

Parameter Update Equations

wt+1 = wt − η∇wt
bt+1 = bt − η∇bt
∂L (w, b) ∂L (w, b)
where, ∇wt = , ∇b =
∂w at w = wt , b = bt ∂b at w = wt , b = bt

29/61

29
Gradient Descent Rule
• The direction u that we intend to move in should be at 180° w.r.t. the gradient
• In other words, move in a direction opposite to the gradient

Parameter Update Equations

wt+1 = wt − η∇wt
bt+1 = bt − η∇bt
∂L (w, b) ∂L (w, b)
where, ∇wt = , ∇b =
∂w at w = wt , b = bt ∂b at w = wt , b = bt

So we now have a more principled way of moving in the w-b plane than our “guess work”
algorithm
29/61

29
• Let us create an algorithm from this rule ...

30/61

30
• Let us create an algorithm from this rule ...
Algorithm: gradient_descent()
t ← 0;
max_iterations ← 1000;
while t < max_iterations do
wt+1 ← wt − η∇wt ;
bt+1 ← bt − η∇bt ;
t ← t + 1;
end

30/61

30
• Let us create an algorithm from this rule ...
Algorithm: gradient_descent()
t ← 0;
max_iterations ← 1000;
while t < max_iterations do
wt+1 ← wt − η∇wt ;
bt+1 ← bt − η∇bt ;
t ← t + 1;
end

• To see this algorithm in practice let us first derive ∇w and ∇b for our toy neural network

30/61

30
x σ y = f (x)

1
f(x) = 1+e−(w·x+b)

31/61

31
x σ y = f (x)

1 Let’s assume there is only 1 point to fit (x, y)


f(x) = 1+e−(w·x+b)

31/61

31
x σ y = f (x)

1 Let’s assume there is only 1 point to fit (x, y)


f(x) = 1+e−(w·x+b)
1
L (w, b) = ∗ (f(x) − y)2
2

31/61

31
x σ y = f (x)

1 Let’s assume there is only 1 point to fit (x, y)


f(x) = 1+e−(w·x+b)
1
∗ (f(x) − y)2
L (w, b) =
2
∂L (w, b) ∂ 1
∇w = = [ ∗ (f (x) − y)2 ]
∂w ∂w 2

31/61

31
∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2

32/61

32
∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2
1 ∂
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)]
2 ∂w

32/61

32
∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2
1 ∂
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)]
2 ∂w

= (f(x) − y) ∗ (f (x))
∂w

32/61

32
∂ 1
∇w = [ ∗ (f(x) − y)2 ]
∂w 2
1 ∂
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)]
2 ∂w

= (f(x) − y) ∗ (f (x))
∂w
∂  1 
= (f(x) − y) ∗
∂w 1 + e−(wx+b)

32/61

32
∂ 1 ∂  1 
∇w = [ ∗ (f(x) − y)2 ]
∂w 2 ∂w 1 + e−(wx+b)
1 ∂
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)]
2 ∂w

= (f(x) − y) ∗ (f (x))
∂w
∂  1 
= (f(x) − y) ∗
∂w 1 + e−(wx+b)

32/61

32
∂ 1 ∂  1 
∇w = [ ∗ (f(x) − y)2 ]
∂w 2 ∂w 1 + e−(wx+b)
1 ∂ −1 ∂ −(wx+b)
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)] = (e ))
2 ∂w (1 + e −(wx+b) 2
) ∂w

= (f(x) − y) ∗ (f (x))
∂w
∂  1 
= (f(x) − y) ∗
∂w 1 + e−(wx+b)

32/61

32
∂ 1 ∂  1 
∇w = [ ∗ (f(x) − y)2 ]
∂w 2 ∂w 1 + e−(wx+b)
1 ∂ −1 ∂ −(wx+b)
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)] = (e ))
2 ∂w (1 + e −(wx+b) 2
) ∂w
∂ −1 ∂
= (f(x) − y) ∗ (f (x)) = ∗ (e−(wx+b) ) (−(wx + b)))
∂w (1 + e −(wx+b) )2 ∂w
∂  1 
= (f(x) − y) ∗
∂w 1 + e−(wx+b)

32/61

32
∂ 1 ∂  1 
∇w = [ ∗ (f(x) − y)2 ]
∂w 2 ∂w 1 + e−(wx+b)
1 ∂ −1 ∂ −(wx+b)
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)] = (e ))
2 ∂w (1 + e −(wx+b) 2
) ∂w
∂ −1 ∂
= (f(x) − y) ∗ (f (x)) = ∗ (e−(wx+b) ) (−(wx + b)))
∂w (1 + e −(wx+b) )2 ∂w
∂  1 
= (f(x) − y) ∗ −1 e−(wx+b)
∂w 1 + e−(wx+b) = ∗ ∗ (−x)
(1 + e−(wx+b) ) (1 + e−(wx+b) )

32/61

32
∂ 1 ∂  1 
∇w = [ ∗ (f(x) − y)2 ]
∂w 2 ∂w 1 + e−(wx+b)
1 ∂ −1 ∂ −(wx+b)
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)] = (e ))
2 ∂w (1 + e −(wx+b) 2
) ∂w
∂ −1 ∂
= (f(x) − y) ∗ (f (x)) = ∗ (e−(wx+b) ) (−(wx + b)))
∂w (1 + e −(wx+b) )2 ∂w
∂  1 
= (f(x) − y) ∗ −1 e−(wx+b)
∂w 1 + e−(wx+b) = ∗ ∗ (−x)
(1 + e−(wx+b) ) (1 + e−(wx+b) )
1 e−(wx+b)
= ∗ ∗ (x)
(1 + e−(wx+b) ) (1 + e−(wx+b) )

32/61

32
∂ 1 ∂  1 
∇w = [ ∗ (f(x) − y)2 ]
∂w 2 ∂w 1 + e−(wx+b)
1 ∂ −1 ∂ −(wx+b)
= ∗ [2 ∗ (f(x) − y) ∗ (f(x) − y)] = (e ))
2 ∂w (1 + e −(wx+b) 2
) ∂w
∂ −1 ∂
= (f(x) − y) ∗ (f (x)) = ∗ (e−(wx+b) ) (−(wx + b)))
∂w (1 + e −(wx+b) )2 ∂w
∂  1 
= (f(x) − y) ∗ −1 e−(wx+b)
∂w 1 + e−(wx+b) = ∗ ∗ (−x)
(1 + e−(wx+b) ) (1 + e−(wx+b) )
= (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x
1 e−(wx+b)
= ∗ ∗ (x)
(1 + e−(wx+b) ) (1 + e−(wx+b) )
= f(x) ∗ (1 − f(x)) ∗ x

32/61

32
x σ y = f (x)

1 So if there is only 1 point (x, y), we have,


f(x) = 1+e−(w·x+b)

33/61

33
x σ y = f (x)

1 So if there is only 1 point (x, y), we have,


f(x) = 1+e−(w·x+b)

∇w = (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x

33/61

33
x σ y = f (x)

1 So if there is only 1 point (x, y), we have,


f(x) = 1+e−(w·x+b)

∇w = (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x

For two points,

33/61

33
x σ y = f (x)

1 So if there is only 1 point (x, y), we have,


f(x) = 1+e−(w·x+b)

∇w = (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x

For two points,

2
X
∇w = (f (xi ) − yi ) ∗ f(xi ) ∗ (1 − f(xi )) ∗ xi
i=1

33/61

33
x σ y = f (x)

1 So if there is only 1 point (x, y), we have,


f(x) = 1+e−(w·x+b)

∇w = (f(x) − y) ∗ f(x) ∗ (1 − f(x)) ∗ x

For two points,

2
X
∇w = (f (xi ) − yi ) ∗ f(xi ) ∗ (1 − f(xi )) ∗ xi
i=1
X2
∇b = (f (xi ) − yi ) ∗ f(xi ) ∗ (1 − f(xi ))
i=1
33/61

33
34/61

34
34/61

34
34/61

34
34/61

34
34/61

34
34/61

34
34/61

34
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
35/61

35
• Later on in the course we will look at gradient descent in much more detail and discuss
its variants

36/61

36
• Later on in the course we will look at gradient descent in much more detail and discuss
its variants
• For the time being it suffices to know that we have an algorithm for learning the para-
meters of a sigmoid neuron

36/61

36
• Later on in the course we will look at gradient descent in much more detail and discuss
its variants
• For the time being it suffices to know that we have an algorithm for learning the para-
meters of a sigmoid neuron
• So where do we head from here ?

36/61

36
Module 3.5: Representation Power of a Multilayer Network of
Sigmoid Neurons

37/61

37
Representation power of a multilayer net- Representation power of a multilayer net-
work of perceptrons work of sigmoid neurons

38/61

38
Representation power of a multilayer net- Representation power of a multilayer net-
work of perceptrons work of sigmoid neurons

A multilayer network of perceptrons with a


single hidden layer can be used to represent
any boolean function precisely (no errors)

38/61

38
Representation power of a multilayer net- Representation power of a multilayer net-
work of perceptrons work of sigmoid neurons

A multilayer network of perceptrons with a A multilayer network of neurons with a single


single hidden layer can be used to represent hidden layer can be used to approximate any
any boolean function precisely (no errors) continuous function to any desired precision

38/61

38
Representation power of a multilayer net- Representation power of a multilayer net-
work of perceptrons work of sigmoid neurons

A multilayer network of perceptrons with a A multilayer network of neurons with a single


single hidden layer can be used to represent hidden layer can be used to approximate any
any boolean function precisely (no errors) continuous function to any desired precision

In other words, there is a guarantee that for


any function f (x) : Rn → Rm , we can always
find a neural network (with 1 hidden layer con-
taining enough neurons) whose output g(x)
satisfies |g(x) − f (x)| <  !!

38/61

38
Representation power of a multilayer net- Representation power of a multilayer net-
work of perceptrons work of sigmoid neurons

A multilayer network of perceptrons with a A multilayer network of neurons with a single


single hidden layer can be used to represent hidden layer can be used to approximate any
any boolean function precisely (no errors) continuous function to any desired precision

In other words, there is a guarantee that for


any function f (x) : Rn → Rm , we can always
find a neural network (with 1 hidden layer con-
taining enough neurons) whose output g(x)
satisfies |g(x) − f (x)| <  !!

Proof: We will see an illustrative proof of


this... [Cybenko, 1989], [Hornik, 1991]
38/61

38
• See this link? for an excellent illustration of this proof
• The discussion in the next few slides is based on the ideas presented at the above link

?
http://neuralnetworksanddeeplearning.com/chap4.html 39/61

39
• We are interested in knowing whether a
network of neurons can be used to rep-
resent an arbitrary function (like the one
shown in the figure)

40/61

40
• We are interested in knowing whether a
network of neurons can be used to rep-
resent an arbitrary function (like the one
shown in the figure)
• We observe that such an arbitrary func-
tion can be approximated by several
“tower” functions

40/61

40
• We are interested in knowing whether a
network of neurons can be used to rep-
resent an arbitrary function (like the one
shown in the figure)
• We observe that such an arbitrary func-
tion can be approximated by several
“tower” functions
• More the number of such “tower” func-
tions, better the approximation

40/61

40
• We are interested in knowing whether a
network of neurons can be used to rep-
resent an arbitrary function (like the one
shown in the figure)
• We observe that such an arbitrary func-
tion can be approximated by several
“tower” functions
• More the number of such “tower” func-
tions, better the approximation
• To be more precise, we can approximate
any arbitrary function by a sum of such
“tower” functions

40/61

40
• We make a few observations

41/61

41
• We make a few observations
• All these “tower” functions are similar
and only differ in their heights and pos-
itions on the x-axis

41/61

41
• We make a few observations
• All these “tower” functions are similar
and only differ in their heights and pos-
itions on the x-axis
• Suppose there is a black box which takes
the original input (x) and constructs
these tower functions

Tower Tower . . . Tower Tower


maker maker maker maker
...
x
41/61

41
• We make a few observations
• All these “tower” functions are similar
and only differ in their heights and pos-
itions on the x-axis
• Suppose there is a black box which takes
the original input (x) and constructs
these tower functions

. . .
Tower Tower . . . Tower Tower
maker maker maker maker
...
x
41/61

41
• We make a few observations
• All these “tower” functions are similar
and only differ in their heights and pos-
itions on the x-axis
• Suppose there is a black box which takes
the original input (x) and constructs
these tower functions
• We can then have a simple network
. . .
which can just add them up to approxim-
Tower Tower . . . Tower Tower ate the function
maker maker maker maker
...
x
41/61

41
• We make a few observations
• All these “tower” functions are similar
and only differ in their heights and pos-
itions on the x-axis
• Suppose there is a black box which takes
+
the original input (x) and constructs
. . . these tower functions
• We can then have a simple network
. . .
which can just add them up to approxim-
Tower Tower . . . Tower Tower ate the function
maker maker maker maker
...
x
41/61

41
• We make a few observations
• All these “tower” functions are similar
and only differ in their heights and pos-
itions on the x-axis
• Suppose there is a black box which takes
+
the original input (x) and constructs
. . . these tower functions
• We can then have a simple network
. . .
which can just add them up to approxim-
Tower Tower . . . Tower Tower ate the function
maker maker maker maker
... • Our job now is to figure out what is inside
this blackbox
x
41/61

41
We will figure this out over the next few slides ...

42/61

42
• If we take the logistic function and set w
to a very high value we will recover the
step function

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 0, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 1, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 2, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 3, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 4, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 5, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 6, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 7, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 8, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 9, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 10, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 11, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 12, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 13, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 14, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 15, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 16, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 17, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 18, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 19, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 20, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 21, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 22, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 23, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 24, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 25, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 26, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 27, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 28, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 29, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 30, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 31, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 32, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 33, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 34, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 35, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 36, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 37, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 38, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 39, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 40, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w

1
σ(x) = 1+e−(wx+b)
w = 41, b = 0

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1

1
σ(x) = 1+e−(wx+b)
w = 50, b = 1

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1

1
σ(x) = 1+e−(wx+b)
w = 50, b = 2

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1

1
σ(x) = 1+e−(wx+b)
w = 50, b = 3

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1

1
σ(x) = 1+e−(wx+b)
w = 50, b = 4

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1

1
σ(x) = 1+e−(wx+b)
w = 50, b = 5

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1

1
σ(x) = 1+e−(wx+b)
w = 50, b = 6

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1

1
σ(x) = 1+e−(wx+b)
w = 50, b = 7

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1

1
σ(x) = 1+e−(wx+b)
w = 50, b = 8

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1

1
σ(x) = 1+e−(wx+b)
w = 50, b = 9

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1

1
σ(x) = 1+e−(wx+b)
w = 50, b = 10

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1

1
σ(x) = 1+e−(wx+b)
w = 50, b = 11

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1

1
σ(x) = 1+e−(wx+b)
w = 50, b = 12

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1

1
σ(x) = 1+e−(wx+b)
w = 50, b = 13

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1

1
σ(x) = 1+e−(wx+b)
w = 50, b = 14

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1

1
σ(x) = 1+e−(wx+b)
w = 50, b = 15

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1

1
σ(x) = 1+e−(wx+b)
w = 50, b = 16

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1

1
σ(x) = 1+e−(wx+b)
w = 50, b = 17

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1

1
σ(x) = 1+e−(wx+b)
w = 50, b = 18

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1

1
σ(x) = 1+e−(wx+b)
w = 50, b = 19

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1

1
σ(x) = 1+e−(wx+b)
w = 50, b = 20

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1

1
σ(x) = 1+e−(wx+b)
w = 50, b = 21

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1

1
σ(x) = 1+e−(wx+b)
w = 50, b = 22

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1

1
σ(x) = 1+e−(wx+b)
w = 50, b = 23

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1

1
σ(x) = 1+e−(wx+b)
w = 50, b = 24

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1

1
σ(x) = 1+e−(wx+b)
w = 50, b = 25

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1

1
σ(x) = 1+e−(wx+b)
w = 50, b = 26

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1

1
σ(x) = 1+e−(wx+b)
w = 50, b = 27

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1

1
σ(x) = 1+e−(wx+b)
w = 50, b = 28

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1

1
σ(x) = 1+e−(wx+b)
w = 50, b = 29

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1

1
σ(x) = 1+e−(wx+b)
w = 50, b = 30

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1

1
σ(x) = 1+e−(wx+b)
w = 50, b = 31

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1

1
σ(x) = 1+e−(wx+b)
w = 50, b = 32

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1

1
σ(x) = 1+e−(wx+b)
w = 50, b = 33

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1

1
σ(x) = 1+e−(wx+b)
w = 50, b = 34

43/61

43
• If we take the logistic function and set w
to a very high value we will recover the
step function
• Let us see what happens as we change
the value of w
• Further we can adjust the value of b
to control the position on the x-axis at
which the function transitions from 0 to
1

1
σ(x) = 1+e−(wx+b)
w = 50, b = 35

43/61

43
• Now let us see what we get by taking
two such sigmoid functions (with differ-
ent b0 s) and subtracting one from the
other

44/61

44
• Now let us see what we get by taking
two such sigmoid functions (with differ-
ent b0 s) and subtracting one from the
other

44/61

44
• Now let us see what we get by taking
two such sigmoid functions (with differ-
ent b0 s) and subtracting one from the
other

44/61

44
• Now let us see what we get by taking
two such sigmoid functions (with differ-
ent b0 s) and subtracting one from the
other
• Voila! We have our tower function !!

44/61

44
• Can we come up with a neural network to represent this operation of subtracting one
sigmoid function from another ?

45/61

45
46/61

46
• What if we have more than one input?

47/61

47
• What if we have more than one input?
• Suppose we are trying to take a decision about
whether we will find oil at a particular location on
the ocean bed(Yes/No)

47/61

47
• What if we have more than one input?
• Suppose we are trying to take a decision about
whether we will find oil at a particular location on
the ocean bed(Yes/No)
• Further, suppose we base our decision on two
factors: Salinity (x1 ) and Pressure (x2 )

47/61

47
• What if we have more than one input?
• Suppose we are trying to take a decision about
whether we will find oil at a particular location on
the ocean bed(Yes/No)
• Further, suppose we base our decision on two
factors: Salinity (x1 ) and Pressure (x2 )
• We are given some data and it seems that
y(oil|no-oil) is a complex function of x1 and x2

47/61

47
• What if we have more than one input?
• Suppose we are trying to take a decision about
whether we will find oil at a particular location on
the ocean bed(Yes/No)
• Further, suppose we base our decision on two
factors: Salinity (x1 ) and Pressure (x2 )
• We are given some data and it seems that
y(oil|no-oil) is a complex function of x1 and x2
• We want a neural network to approximate this
function

47/61

47
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)

48/61

48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case

48/61

48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function

48/61
w1 = 2, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function

48/61
w1 = 3, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function

48/61
w1 = 4, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function

48/61
w1 = 5, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function

48/61
w1 = 6, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function

48/61
w1 = 7, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function

48/61
w1 = 8, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function

48/61
w1 = 9, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function

48/61
w1 = 10, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function

48/61
w1 = 11, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function

48/61
w1 = 12, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function

48/61
w1 = 13, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function

48/61
w1 = 14, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function

48/61
w1 = 15, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function

48/61
w1 = 16, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function

48/61
w1 = 17, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function

48/61
w1 = 18, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function

48/61
w1 = 19, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function

48/61
w1 = 20, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function

48/61
w1 = 21, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function

48/61
w1 = 22, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function

48/61
w1 = 23, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function

48/61
w1 = 24, w2 = 0, b = 0
48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
• What would happen if we change b ?

48/61

48
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
• What would happen if we change b ?

49/61
w1 = 25, w2 = 0, b = 5
49
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
• What would happen if we change b ?

49/61
w1 = 25, w2 = 0, b = 10
49
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
• What would happen if we change b ?

49/61
w1 = 25, w2 = 0, b = 15
49
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
• What would happen if we change b ?

49/61
w1 = 25, w2 = 0, b = 20
49
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
• What would happen if we change b ?

49/61
w1 = 25, w2 = 0, b = 25
49
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
• What would happen if we change b ?

49/61
w1 = 25, w2 = 0, b = 30
49
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
• What would happen if we change b ?

49/61
w1 = 25, w2 = 0, b = 35
49
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
• What would happen if we change b ?

49/61
w1 = 25, w2 = 0, b = 40
49
• This is what a 2-dimensional sigmoid
1 looks like
y=
1 + e−(w1 x1 +w2 x2 +b)
• We need to figure out how to get a tower
in this case
• First, let us set w2 to 0 and see if we can
get a two dimensional step function
• What would happen if we change b ?

49/61
w1 = 25, w2 = 0, b = 45
49
• What if we take two such step func-
tions (with different b values) and sub-
tract one from the other

50/61

50
• What if we take two such step func-
tions (with different b values) and sub-
tract one from the other

50/61

50
• What if we take two such step func-
tions (with different b values) and sub-
tract one from the other

50/61

50
• What if we take two such step func-
tions (with different b values) and sub-
tract one from the other
• We still don’t get a tower (or we get a
tower which is open from two sides)

50/61

50
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation

51/61

51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation

51/61

51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation

51/61
w1 = 0, w2 = 2, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation

51/61
w1 = 0, w2 = 3, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation

51/61
w1 = 0, w2 = 4, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation

51/61
w1 = 0, w2 = 5, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation

51/61
w1 = 0, w2 = 6, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation

51/61
w1 = 0, w2 = 7, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation

51/61
w1 = 0, w2 = 8, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation

51/61
w1 = 0, w2 = 9, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation

51/61
w1 = 0, w2 = 10, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation

51/61
w1 = 0, w2 = 11, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation

51/61
w1 = 0, w2 = 12, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation

51/61
w1 = 0, w2 = 13, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation

51/61
w1 = 0, w2 = 14, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation

51/61
w1 = 0, w2 = 15, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation

51/61
w1 = 0, w2 = 16, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation

51/61
w1 = 0, w2 = 17, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation

51/61
w1 = 0, w2 = 18, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation

51/61
w1 = 0, w2 = 19, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation

51/61
w1 = 0, w2 = 20, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation

51/61
w1 = 0, w2 = 21, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation

51/61
w1 = 0, w2 = 22, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation

51/61
w1 = 0, w2 = 23, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation

51/61
w1 = 0, w2 = 24, b = 0
51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
• And now we change b

51/61

51
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
• And now we change b

52/61
w1 = 0, w2 = 25, b = 5
52
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
• And now we change b

52/61
w1 = 0, w2 = 25, b = 10
52
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
• And now we change b

52/61
w1 = 0, w2 = 25, b = 15
52
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
• And now we change b

52/61
w1 = 0, w2 = 25, b = 20
52
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
• And now we change b

52/61
w1 = 0, w2 = 25, b = 25
52
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
• And now we change b

52/61
w1 = 0, w2 = 25, b = 30
52
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
• And now we change b

52/61
w1 = 0, w2 = 25, b = 35
52
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
• And now we change b

52/61
w1 = 0, w2 = 25, b = 40
52
• Now let us set w1 to 0 and adjust w2 to
1 get a 2-dimensional step function with a
y=
1 + e−(w1 x1 +w2 x2 +b) different orientation
• And now we change b

52/61
w1 = 0, w2 = 25, b = 45
52
• Again, what if we take two such step
functions (with different b values) and
subtract one from the other

53/61

53
• Again, what if we take two such step
functions (with different b values) and
subtract one from the other

53/61

53
• Again, what if we take two such step
functions (with different b values) and
subtract one from the other

53/61

53
• Again, what if we take two such step
functions (with different b values) and
subtract one from the other
• We still don’t get a tower (or we get a
tower which is open from two sides)

53/61

53
• Again, what if we take two such step
functions (with different b values) and
subtract one from the other
• We still don’t get a tower (or we get a
tower which is open from two sides)
• Notice that this open tower has a differ-
ent orientation from the previous one

53/61

53
• Now what will we get by adding two such
open towers ?

54/61

54
• Now what will we get by adding two such
open towers ?

54/61

54
• Now what will we get by adding two such
open towers ?

54/61

54
• Now what will we get by adding two such
open towers ?
• We get a tower standing on an elevated
base

54/61

54
• Now what will we get by adding two such
open towers ?
• We get a tower standing on an elevated
base
• We can now pass this output through an-
other sigmoid neuron to get the desired
tower !

54/61

54
• Now what will we get by adding two such
open towers ?
• We get a tower standing on an elevated
base
• We can now pass this output through an-
other sigmoid neuron to get the desired
tower !

54/61

54
• Now what will we get by adding two such
open towers ?
• We get a tower standing on an elevated
base
• We can now pass this output through an-
other sigmoid neuron to get the desired
tower !
• We can now approximate any function
by summing up many such towers

54/61

54
• For example, we could approximate the
following function using a sum of several
towers

55/61

55
• Can we come up with a neural network to represent this entire procedure of construct-
ing a 3 dimensional tower ?

56/61

56
57/61

57
58/61

58
Think
• For 1 dimensional input we needed 2 neurons to construct a tower
• For 2 dimensional input we needed 4 neurons to construct a tower
• How many neurons will you need to construct a tower in n dimensions ?

59/61

59
Time to retrospect
• Why do we care about approximating any arbitrary function ?
• Can we tie all this back to the classification problem that we have been dealing with ?

60/61

60
• We are interested in separating the blue points
from the red points

61/61

61
• We are interested in separating the blue points
from the red points
• Suppose we use a single sigmoidal neuron to ap-
proximate the relation between x = [x1 , x2 ] and
y

61/61

61
• We are interested in separating the blue points
from the red points
• Suppose we use a single sigmoidal neuron to ap-
proximate the relation between x = [x1 , x2 ] and
y

61/61

61
• We are interested in separating the blue points
from the red points
• Suppose we use a single sigmoidal neuron to ap-
proximate the relation between x = [x1 , x2 ] and
y
• Obviously, there will be errors (some blue points
get classified as 1 and some red points get classi- 61/61
fied as 0)
61
• We are interested in separating the blue points • This is what we actually want
from the red points
• Suppose we use a single sigmoidal neuron to ap-
proximate the relation between x = [x1 , x2 ] and
y
• Obviously, there will be errors (some blue points
get classified as 1 and some red points get classi- 61/61
fied as 0)
61
• We are interested in separating the blue points • This is what we actually want
from the red points • The illustrative proof that we just saw tells us that
• Suppose we use a single sigmoidal neuron to ap- we can have a neural network with two hidden lay-
proximate the relation between x = [x1 , x2 ] and ers which can approximate the above function by
y a sum of towers
• Obviously, there will be errors (some blue points
get classified as 1 and some red points get classi- 61/61
fied as 0)
61
• We are interested in separating the blue points • This is what we actually want
from the red points • The illustrative proof that we just saw tells us that
• Suppose we use a single sigmoidal neuron to ap- we can have a neural network with two hidden lay-
proximate the relation between x = [x1 , x2 ] and ers which can approximate the above function by
y a sum of towers
• Obviously, there will be errors (some blue points • Which means we can have a neural network which
get classified as 1 and some red points get classi- can exactly separate the blue points from the red 61/61
fied as 0) points !!
61

You might also like