ML Lecture#4

Artificial Neural Networks
Dr. Syed M. Usman
1
Contents
• What is ANN?
• Perceptron and Perceptron Learning
• Multi-Layer Perceptron – Feed forward
Networks
• Back Propagation
• Implementation Session
2
What is ANN?
• An artificial neural network is a crude way of trying to simulate the
human brain (digitally)
• Human brain – Approx. 10 billion neurons
• Each neuron connected with thousands of others
• Parts of neuron
– Cell body
– Dendrites – receive input signal
– Axons – Give output
3
What is ANN?
• ANN – made up of artificial neurons
– Digitally modeled biological neuron
• Each input into the neuron has its own weight associated with
it
• As each input enters the nucleus it's multiplied by its
weight.
4
What is ANN?
• The nucleus sums all these new input values which gives us
the activation
• For n inputs and n weights – weights multiplied by input
and summed
a = x1w1+x2w2+x3w3... +xnwn
5
What is ANN?
• If the activation is greater than a threshold value - the
neuron outputs a signal – (for example 1)
• If the activation is less than threshold the neuron outputs
zero.
• This is typically called a step function
6
What is ANN?
• The combination of summation and thresholding is called a
node
http://www-cse.uta.edu/~cook/ai1/lectures/figures/neuron.jpg
• For step (activation) function – The output is 1 if:
x1w1+x2w2+x3w3... +xnwn > T
7
What is ANN?
x1w1+x2w2+x3w3... +xnwn > T
x1w1+x2w2+x3w3... +xnwn -T > 0
Let w0 = -T and x0 = 1
D = x0w0 + x1w1+x2w2+x3w3... +xnwn > 0
Output is 1 if D> 0;
Output is 0 otherwise
w0 is called a bias weight
8
Typical Activation Functions
Step function Sign function Sigmoid function Linear

function
Y Y Y Y
+1 +1 1 1
0 X 0 X 0 X 0 X
-1 -1 -1 -1
Y step 1, if X  0 Y sign  1, if X Y sigmoid Y linear X

X
0
1
Controls when1,if
unit isX“active” or “inactive”
0, if X 0  
0
1e
9
Simplest Classifier
Can a single neuron learn a task?
10
Example
• Each day you get lunch at the cafeteria.
– Your diet consists of fish, chips, and drink.
– You get several portions of each
• The cashier only tells you the total price of the meal
– After several days, you should be able to figure out the
price of each portion.
• Each meal price gives a linear constraint on the prices of the
portions:
price  x fish w fish

x chips wchips  xdrink wdrink
11
Solving the Problem
• The prices of the portions are like the weights in of a
linear neuron.
• We will start with guesses for the weights and then adjust the
guesses to give a better fit to the prices given by the cashier.
w  (w fish , wchips , wdrink )
12
The Cashier’s Brain
Price of meal = 850
Linear
neuron
150 50 100
2 5 3
portions of fish portions of portions of

chips drink
13
Model of Cashier’s Brain with Arbitrary Weights
Price of meal = 500
• Residual error = 350

• Apply learning rules and
update weights
50 50 50
2 5 3
portions of fish portions of portions of
chips drink
14
Perceptron
• In 1958, Frank Rosenblatt introduced a training algorithm that
provided the first procedure for training a simple ANN: a
perceptron.
15
Perceptron
• A perceptron takes several inputs, x1, x2, ……, and
produces a single binary output.
• The model consists of a linear combiner followed by a hard
limiter.
• The weighted sum of the inputs is applied to the hard limiter,
which produces an output equal to +1 if its input is positive and -
1 if it is negative. (1/0 in some models).
16
Perceptron
1 -10
1
x1 Y
4
x2
This is Equation of a line - Decision boundary
17
Perceptron Learning
• A perceptron (threshold unit) can learn anything that it can
represent (i.e. anything separable with a hyperplane)
X1 X2 Y
0 0 0
0 1 1
1 0 1
1 1 1
18
Implementing ‘OR’ with Perceptron
• The two-input perceptron can implement the OR function
when we set the weights: w0 = -0.3, w1 = w2 = 0.5
x2
Decision hyperplane :
w 0 + w 1 x1 + w 2 x2 = 0
+ +
-0.3 + 0.5 x 1 + 0.5 x 2 = 0
- +
x1
X1 X2 Y
-0.3 + 0.5 x1 + 0.5 x2 = 0
0 0 -0.3 -1
0 1 0.2 +1
1 0 0.2 +1
1 1 0.7 +1
19
Implementing ‘AND’ with Perceptron
Decision hyperplane :
w 0 + w 1 x1 + w 2 x2 = 0 x2
-0.8 + 0.5 x1 + 0.5 x2 = 0
- +
- -
X1 X2 Y x1
-0.8 + 0.5 x1 + 0.5 x2 = 0
0 0 -0.8 -1
0 1 -0.3 -1
1 0 -0.3 -1
1 1 0.2 +1
20
Implementing ‘XOR’ with Perceptron
• A Perceptron cannot represent Exclusive OR since it is not
linearly separable.
XOR Function
X1 X2 Y
0 0 -1
0 1 +1
1 0 +1
1 1 -1
Two perceptrons?
21
Perceptron Learning
• Gradient Descente to find the minimum value of E (on
Error Surface)
22
Perceptron Learning
• Objective: Find the values of weights which minimize the
error function
m 2
1
(d ) (d )
 O
E  2 
d  (T )
1
O (d )  w 0  w1 x1d  2 x d2  ....  w n x dn
w
O(d) is the observed and T(d) is the target output for training example ‘d’
23
Gradient Descente - Algorithm
27
Inputs Desired Initial Actual Error Final
Epoch output weights weights
x1 x2 Yd w1 w2 output e w1 w2
Y
1 0 0 0 0.3 0.1
0 1 0
1 0 0
1 1 1
2 0 0 0
0 1 0
1 0 0
1 1 1
3 0 0 0
0 1 0
1 0 0
1 1 1
4 0 0 0
0 1 0
1 0 0
1 1 1
5 0 0 0
0 1 0
1  =00.2; learning
Threshold: 0 rate:  = 0.1 2
8
1 1 1
Y
1 0 0 0 0.3 0.1 0 0 0.3 0.1

0 1 0
1 0 0
1 1 1
2 0 0 0
0 1 0
1 0 0
1 1 1
3 0 0 0
0 1 0
1 0 0
1 1 1
4 0 0 0
0 1 0
1 0 0
1 1 1
5 0 0 0
0 1 0
9
1 1 1
Y
1 0 0 0 0.3 0.1 0 0 0.3 0.1

0 1 0 0.3 0.1
1 0 0
1 1 1
2 0 0 0
0 1 0
1 0 0
1 1 1
3 0 0 0
0 1 0
1 0 0
1 1 1
4 0 0 0
0 1 0
1 0 0
1 1 1
5 0 0 0
0 1 0
0
1 1 1
Y
1 0 0 0 0.3 0.1 0 0 0.3 0.1

0 0 0.3 0 0 0.3 Training Example 2:
1 0.1 0.1
1 0 0
1 1 1
2 0 0 0
0 1 0
1 0 0
1 1 1
3 0 0 0
0 1 0
1 0 0
1 1 1
4 0 0 0
0 1 0
1 0 0
1 1 1
5 0 0 0
0 1 0
1
1 1 1
Y
1 0 0 0 0.3 0.1 0 0 0.3 0.1 Training Example 2:

0 1 0 0.3 0.1 0 0 0.3 0.1
1 0 0 0.3 0.1
1 1 1
2 0 0 0
0 1 0
1 0 0
1 1 1
3 0 0 0
0 1 0
1 0 0
1 1 1
4 0 0 0
0 1 0
1 0 0
1 1 1
5 0 0 0
0 1 0
1 Learning
DSC 704 Deep 0 0 Bahria University, Spring 2019 3
2
1 1 1
Y
1 0 0 0 0.3 0.1 0 0 0.3 0.1

0 0 0 0 0.3 Training Example 3:
1 0.3 0.1 0.1
1 0 0 0.3 0.1
1 1 1
2 0 0 0
0 1 0
1 0 0
1 1 1
3 0 0 0
0 1 0
1 0 0
1 1 1
4 0 0 0
0 1 0
1 0 0
1 1 1
5 0 0 0
0 1 0
Threshold:  =00.2; learning
1 Learning
DSC 704 Deep 0 rate:  =University,
Bahria 0.1 Spring 2019 3
3
1 1 1
Y
1 0 0 0 0.3 0.1 0 0 0.3 0.1 Training Example 3:

0 1 0 0.3 0.1 0 0 0.3 0.1
1 0 0 0.3 0.1 1 -1 0.2 0.1
1 1 1
2 0 0 0
0 1 0
1 0 0
1 1 1
3 0 0 0
0 1 0
1 0 0
1 1 1
4 0 0 0
0 1 0
1 0 0
1 1 1
5 0 0 0
0 1 0
1 Learning
DSC 704 Deep 0 0 Bahria University, Spring 2019 3
4
1 1 1
Y
1 0 0 0 0.3 0.1 0 0 0.3 0.1

0 0 0 0 0.3 Training Example 3:
1 0.3 0.1 0.1
1 0 0 0.3 0.1 1 -1 0.2 0.1
1 1 1 0.2 0.1
2 0 0 0
0 1 0
1 0 0
1 1 1
3 0 0 0
0 1 0
1 0 0
1 1 1
4 0 0 0
0 1 0
1 0 0
1 1 1
5 0 0 0
0 1 0
Threshold:  =00.2; learning
1 Learning
DSC 704 Deep 0 rate:  =University,
Bahria 0.1 Spring 2019 3
5
1 1 1
Epoch output weight output weight PERCEPTRON LEARNING:
Yd s Y e s LOGICAL OPERATION AND
x1 x2 w1 w2 w1 w2
1 0 0 0 0.3 0.1 0 0 0.3 0.1

0 1 0 0.3 0.1 0 0 0.3 0.1
1 0 0 0.3 0.1 1 1 0.2 0.1
1 1 1 0.2 0.1 0 0.3
1 0.0
2 0 0 0 0.3 0.0 0 0 0.3 0.0
0 1 0 0.3 0.0 0 0 0.3 0.0
1 0 0 0.3 0.0 1 1 0.2 0.0
1 1 1 0.2 0.0 1 0.2 0.0
0
3 0 0 0 0.2 0.0 0 0 0.2 0.0
0 1 0 0.2 0.0 0 0 0.2 0.0
1 0 0 0.2 0.0 1 1 0.1 0.0
1 1 1 0.1 0.0 0 0.2 0.1
1
4 0 0 0 0.2 0.1 0 0 0.2 0.1
0 1 0 0.2 0.1 0 0 0.2 0.1
1 0 0 0.2 0.1 1 1 0.1 0.1
1 1 1 0.1 0.1 1 0.1 0.1
Threshold:  = 0.2; learning rate:  = 0.1 0 3
5 0 0 0 0.1 0.1 0 0 0.1 0.1 6
0 1 0 0.1 0.1 0 0 0.1 0.1
Logistic Regression
37
Logistic Regression (LR)
• A Binary Classification problem
• Example: Cat (1) or No-Cat (0)
• On a 2 dimensional (2D) data LR will try to draw a straight line to

separate the classes
• For 3D data it’ll try to draw a 2D plane to separate the classes.
• This generalizes to N dimensional data and N-1 dimensional
hyperplane separator.
38
Logistic Regression
• Given: x; Predict:y^; where y^ = P(y = 1|x)
• Output:
– y^ is a probability – between 0 and 1
– y^ =σ (wT x + b)
39
Logistic Regression
40
LR – Linearly Separable Data
41
• The Sequential model allows us to build deep neural

networks by stacking layers one on top of another.
• Since we’re now building a simple logistic regression model,
we will have the input nodes directly connected to output
node, without any hidden layers.
42
• The Dense function in Keras constructs a fully connected neural network

layer, automatically initializing the weights as biases.
• The function arguments are defined as follows:
– units: The first argument, representing number of nodes in this layer. Since we’re
constructing the output layer, and we said it has only one node, this value is 1.
– input_shape: The first layer in Keras models need to specify the input dimensions. The
subsequent layers (which we don’t have here but we will in later sections) don’t need to
specify this argument because Keras can infer the dimensions automatically. In this case our
input dimensionality is 2, the x and y coordinates.
– activation: The activation function of a logistic regression model is the logistic function, or
alternatively called the sigmoid.
43
• The compile function creates the neural network model by specifying the
details of the learning process. The model hasn’t been trained yet. The function
arguments are defined as follows:
– optimizer: Which optimizer to use in order to minimize the loss function. There are a
lot of different optimizers, most of them based on gradient descent.
– loss: The loss function to minimize. Since we’re building a binary 0/1 classifier, the
loss function to minimize is binary_crossentropy.
– metrics: Which metric to report statistics on, for classification problems we set this
as accuracy.
44
• The model is trained using the fit function. The arguments are as follows:
– x: The input data, we defined it as X above. It contains the x and y coordinates of the input points
– y: refers to the labels, in our case the class we’re trying to predict: 0 or 1.
– verbose: Prints out the loss and accuracy, set it to 1 to see the output.
– epochs: Number of times to go over the entire training data. When training models we pass
through the training data not just once but multiple times.
45
46
47
LR – Complex Data (Moons)
48
49
The current decision boundary doesn’t look as clean as the one before. The model
tried to separate out the classes from the middle, but there are a lot of misclassified
points. We need a more complex classifier with a non-linear decision boundary.
50
LR – Complex Data (Circles)
51
52
53
Can we do better?
• Re-visit same problems
• Solution: Multi-Layer Perceptron / ANN with hidden layers
• Important: The term “perceptron” continued to be employed

(MLP) for historical reasons despite the fact that activation
functions changed.
54
Multi-Layer Perceptron
• Minsky & Papert (1969) offered solution to XOR problem by
combining perceptron unit responses using a second layer of Units
• Piecewise linear classification using an MLP with threshold
(perceptron) units
Piece-wise linear separation
55
• A multilayer perceptron is a feed forward neural network with one or
more hidden layers.
• The network consists of an input layer of source neurons, at least one
middle or hidden layer of computational neurons, and an output
layer of computational neurons.
• The input signals are propagated in a forward direction on a layer- by-
layer basis.
56
57
58
• We first map the 3D input to a 4D vector space, then we
perform another transformation to a new 4D space, and
the final transformation reduces it to 1D.
• This is just a chain of matrix multiplications.
• The forward pass performs these matrix dot products and
applies the activation function element-wise to the result.
59
60
• By performing transformations at each layer, we are able to
project the input to a new vector space, and draw a decision
boundary to separate the classes.
61
• This is equivalent to drawing a complex decision boundary in the
original input space.
• So the main benefit of having a deeper model is being able to
do more non-linear transformations of the input and drawing
a more complex decision boundary.
62
ANNs are universal function approximators,
meaning they can model any complex function.
63
64
• Output layer still uses the sigmoid activation function since we’re working on a
binary classification problem.
• Hidden layers use the tanh activation function.
• We have fewer number of nodes in each subsequent layer. It’s common to have
less nodes as we stack layers on top of one another, sort of a triangular shape.
65
66
The ANN is able to come up with a perfect separator to distinguish the classes.
67
68
Multi-Class Classification
• We will solve the three class problem using:

– Softmax Regression
– MLP / ANN
69
Softmax Regression
• Softmax Regression (SR) is a generalization of LR where we can have more than 2
classes.
• In our current dataset we have 3 classes, represented as 0/1/2.
70
Softmax Regression – Difference with LR
• Number of nodes in the dense layer: LR uses 1 node, where SR has 3 nodes. Since
we have 3 classes it makes sense for SR to be using 3 nodes. LR models the probability of
an example belonging to class one: P(class=1). And we can calculate class 0 probability
by: 1−P(class=1). But when we have more than 2 classes, we need individual nodes for
each class. Because knowing the probability of one class doesn’t let us infer the
probability of the other classes.
• Activation function: LR used sigmoid activation function, SR uses softmax.
Softmax scales the values of the output nodes such that they represent probabilities
and sum up to 1.
• Loss function: In a binary classification problem like LR, the loss function is
binary_crossentropy. In the multiclass case, the loss function is
categorical_crossentropy.
• Fit function: LR used the vector y directly in the fit function, which has just one
column with 0/1 values. When we’re doing SR the labels need to be in one-
hot representation. In our case y_cat is a matrix with 3 columns, where all the
values are 0 except for the one that represents our class, and that is 1.
71
Softmax Regression
• LR is a linear classifier. SR is also a linear classifier, but for multiple classes.
• So the “power” of the model hasn’t changed, it’s still a linear model. We just
generalized LR to apply it to a multiclass problem.
72
Softmax Regression
• LR is a linear classifier. SR is also a linear classifier, but for multiple classes.
• So the “power” of the model hasn’t changed, it’s still a linear model. We just
generalized LR to apply it to a multiclass problem.
73
Multi-Class with ANN
74
Multi-Class with ANN
75
Source
• Complete Source Code and Utility Functions can be found at:
https://github.com/ardendertat/Applied-Deep-Learning-with-
Keras/tree/master/notebooks
76
Case Study
Handwritten Digit Recognition
(Hello World of
ANN)
Recognition by Human
28
28
78
Writing a Computer Program?
Input: 28x28 matrix of

numbers
Output: Class label
ML: ANN
79
Digit Recognition using ANN
80
81
82
A pattern of activations in one layer causes a

specific pattern of activations in the next layer
83
Activation Function
84
85
86
87
88
89
90
Training Data
91
92
93
94
Implementing Digit Recognition
95
Back Propagation
Back Propagation
(Worked Example)
96
Back Propagation
97
Back Propagation
98
Back Propagation
• Initialize random weights
• Apply the first input to the network and workout
the output
• Workout errors at output neurons (B & C)
• Update the weights as:
The term: is due to sigmoid function

99
Back Propagation
• Calculate errors for hidden layer neurons – Since we do not

have targets, these errors cannot be calculated directly. Back
propagate errors from output neurons
• Repeat the process for previous layers
100
Back Propagation
101
Back Propagation
102
Back Propagation
103
Back Propagation – Worked Example
WAC
C
WBC WCE
E
WAD WDE
D
WBD
Perform a forward pass Perform a

reverse pass Learning Rate = 1;
Target = 0.5
104
WAC
C
WBC WCE
E
WAD WDE
D
WBD
Input to top neuron: (0.35x0.1) + (0.9x0.8) = 0.755

Output of top neuron: Sigmoid(0.755) = 0.68
Input to bottom neuron: (0.35x0.4) + (0.9x0.6) = 0.68

Output of bottom neuron: Sigmoid(0.755) = 0.663
Input to final neuron: (0.68x0.3) + (0.663x0.9) = 0.801

Output of final neuron: Sigmoid(0.801) = 0.69
105
WAC
0.68
C
WBC WCE
0.69
E
WAD
0.663 WDE
D
WBD
Output Error:
New Weights (Output layer):
106
WAC 0.68 0.272
C
WBC WCE
0.69
E
WAD
0.663 WDE
D 0.873
WBD
Error for hidden
Layers:
107
WAC 0.68 0.272
C
WBC WCE
0.69
E
WAD 0.663 WDE
D 0.873
WBD
Weight updates for hidden layers:
108
WAC 0.272
C
WBC WCE
E
WAD WDE
D
0.873
WBD
Feed the input with new weights:

Previous Error = -0.19
Current Error = -0.18205
109
MLP in Keras – PIMA INDIAN DIABETES
110
MLP in Keras – IRIS DATA SET
111
References
• Neural Networks, by Megan Vasta
• Perceptrons and Multilayer Perceptrons, Cognitive
Systems II - Machine Learning, SS 2005
• Videos by 3Blue1Brown
• https://towardsdatascience.com/@ardendertat
112

ML Lecture#4

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Lecture#4

Uploaded by

Copyright:

Available Formats

Artificial Neural Networks

Dr. Syed M. Usman

• For step (activation) function – The output is 1 if:

x1w1+x2w2+x3w3... +xnwn > T

x1w1+x2w2+x3w3... +xnwn -T > 0

D = x0w0 + x1w1+x2w2+x3w3... +xnwn > 0

w0 is called a bias weight

Step function Sign function Sigmoid function Linear

Y step 1, if X  0 Y sign  1, if X Y sigmoid Y linear X

Can a single neuron learn a task?

price  x fish w fish

w  (w fish , wchips , wdrink )

Price of meal = 850

portions of fish portions of portions of

Price of meal = 500

• Residual error = 350

This is Equation of a line - Decision boundary

1 0 0 0 0.3 0.1 0 0 0.3 0.1

1 0 0 0 0.3 0.1 0 0 0.3 0.1

1 0 0 0 0.3 0.1 0 0 0.3 0.1

1 0 0 0 0.3 0.1 0 0 0.3 0.1 Training Example 2:

1 0 0 0 0.3 0.1 0 0 0.3 0.1

1 0 0 0 0.3 0.1 0 0 0.3 0.1 Training Example 3:

1 0 0 0 0.3 0.1 0 0 0.3 0.1

1 0 0 0 0.3 0.1 0 0 0.3 0.1

• On a 2 dimensional (2D) data LR will try to draw a straight line to

• The Sequential model allows us to build deep neural

• The Dense function in Keras constructs a fully connected neural network

• Important: The term “perceptron” continued to be employed

Piece-wise linear separation

• We will solve the three class problem using:

Input: 28x28 matrix of

Output: Class label

A pattern of activations in one layer causes a

• Update the weights as:

The term: is due to sigmoid function

• Calculate errors for hidden layer neurons – Since we do not

• Repeat the process for previous layers

Perform a forward pass Perform a

Input to top neuron: (0.35x0.1) + (0.9x0.8) = 0.755

Input to bottom neuron: (0.35x0.4) + (0.9x0.6) = 0.68

Input to final neuron: (0.68x0.3) + (0.663x0.9) = 0.801

New Weights (Output layer):

Weight updates for hidden layers:

Feed the input with new weights:

You might also like