You are on page 1of 109

Artificial Neural Networks

Dr. Syed M. Usman

1
Contents

• What is ANN?
• Perceptron and Perceptron Learning
• Multi-Layer Perceptron – Feed forward
Networks
• Back Propagation
• Implementation Session

2
What is ANN?
• An artificial neural network is a crude way of trying to simulate the
human brain (digitally)
• Human brain – Approx. 10 billion neurons
• Each neuron connected with thousands of others
• Parts of neuron
– Cell body
– Dendrites – receive input signal
– Axons – Give output

3
What is ANN?
• ANN – made up of artificial neurons
– Digitally modeled biological neuron
• Each input into the neuron has its own weight associated with
it
• As each input enters the nucleus it's multiplied by its
weight.

4
What is ANN?
• The nucleus sums all these new input values which gives us
the activation
• For n inputs and n weights – weights multiplied by input
and summed

a = x1w1+x2w2+x3w3... +xnwn

5
What is ANN?
• If the activation is greater than a threshold value - the
neuron outputs a signal – (for example 1)
• If the activation is less than threshold the neuron outputs
zero.
• This is typically called a step function

6
What is ANN?
• The combination of summation and thresholding is called a
node

http://www-cse.uta.edu/~cook/ai1/lectures/figures/neuron.jpg

• For step (activation) function – The output is 1 if:

x1w1+x2w2+x3w3... +xnwn > T

7
What is ANN?
x1w1+x2w2+x3w3... +xnwn > T

x1w1+x2w2+x3w3... +xnwn -T > 0

Let w0 = -T and x0 = 1

D = x0w0 + x1w1+x2w2+x3w3... +xnwn > 0

Output is 1 if D> 0;
Output is 0 otherwise

w0 is called a bias weight

8
Typical Activation Functions

Step function Sign function Sigmoid function Linear


function
Y Y Y Y
+1 +1 1 1

0 X 0 X 0 X 0 X
-1 -1 -1 -1

Y step 1, if X  0 Y sign  1, if X Y sigmoid Y linear X


X
0
1
Controls when1,if
unit isX“active” or “inactive”
0, if X 0  
0
1e
9
Simplest Classifier

Can a single neuron learn a task?

10
Example
• Each day you get lunch at the cafeteria.
– Your diet consists of fish, chips, and drink.
– You get several portions of each
• The cashier only tells you the total price of the meal
– After several days, you should be able to figure out the
price of each portion.
• Each meal price gives a linear constraint on the prices of the
portions:

price  x fish w fish


x chips wchips  xdrink wdrink

11
Solving the Problem
• The prices of the portions are like the weights in of a
linear neuron.

• We will start with guesses for the weights and then adjust the
guesses to give a better fit to the prices given by the cashier.

w  (w fish , wchips , wdrink )

12
The Cashier’s Brain

Price of meal = 850

Linear

neuron

150 50 100

2 5 3

portions of fish portions of portions of


chips drink

13
Model of Cashier’s Brain with Arbitrary Weights

Price of meal = 500

• Residual error = 350


• Apply learning rules and
update weights
50 50 50

2 5 3
portions of fish portions of portions of
chips drink

14
Perceptron
• In 1958, Frank Rosenblatt introduced a training algorithm that
provided the first procedure for training a simple ANN: a
perceptron.

15
Perceptron
• A perceptron takes several inputs, x1, x2, ……, and
produces a single binary output.
• The model consists of a linear combiner followed by a hard
limiter.
• The weighted sum of the inputs is applied to the hard limiter,
which produces an output equal to +1 if its input is positive and -
1 if it is negative. (1/0 in some models).

16
Perceptron
1 -10

1
x1 Y
4
x2

This is Equation of a line - Decision boundary

17
Perceptron Learning
• A perceptron (threshold unit) can learn anything that it can
represent (i.e. anything separable with a hyperplane)

X1 X2 Y
0 0 0
0 1 1
1 0 1
1 1 1

18
Implementing ‘OR’ with Perceptron
• The two-input perceptron can implement the OR function
when we set the weights: w0 = -0.3, w1 = w2 = 0.5

x2
Decision hyperplane :
w 0 + w 1 x1 + w 2 x2 = 0
+ +
-0.3 + 0.5 x 1 + 0.5 x 2 = 0

- +
x1
X1 X2 Y
-0.3 + 0.5 x1 + 0.5 x2 = 0
0 0 -0.3 -1
0 1 0.2 +1
1 0 0.2 +1
1 1 0.7 +1

19
Implementing ‘AND’ with Perceptron

Decision hyperplane :
w 0 + w 1 x1 + w 2 x2 = 0 x2
-0.8 + 0.5 x1 + 0.5 x2 = 0
- +

- -
X1 X2 Y x1
-0.8 + 0.5 x1 + 0.5 x2 = 0
0 0 -0.8 -1
0 1 -0.3 -1
1 0 -0.3 -1
1 1 0.2 +1

20
Implementing ‘XOR’ with Perceptron
• A Perceptron cannot represent Exclusive OR since it is not
linearly separable.

XOR Function
X1 X2 Y
0 0 -1
0 1 +1
1 0 +1
1 1 -1

Two perceptrons?

21
Perceptron Learning
• Gradient Descente to find the minimum value of E (on
Error Surface)

22
Perceptron Learning
• Objective: Find the values of weights which minimize the
error function

m 2
1
(d ) (d )
 O
E  2 
d  (T )
1
O (d )  w 0  w1 x1d  2 x d2  ....  w n x dn
w
O(d) is the observed and T(d) is the target output for training example ‘d’

23
Gradient Descente - Algorithm

27
Inputs Desired Initial Actual Error Final
Epoch output weights weights
x1 x2 Yd w1 w2 output e w1 w2
Y

1 0 0 0 0.3 0.1
0 1 0
1 0 0
1 1 1
2 0 0 0
0 1 0
1 0 0
1 1 1
3 0 0 0
0 1 0
1 0 0
1 1 1
4 0 0 0
0 1 0
1 0 0
1 1 1
5 0 0 0
0 1 0
1  =00.2; learning
Threshold: 0 rate:  = 0.1 2
8
1 1 1
Inputs Desired Initial Actual Error Final
Epoch output weights weights
x1 x2 Yd w1 w2 output e w1 w2
Y

1 0 0 0 0.3 0.1 0 0 0.3 0.1


0 1 0
1 0 0
1 1 1
2 0 0 0
0 1 0
1 0 0
1 1 1
3 0 0 0
0 1 0
1 0 0
1 1 1
4 0 0 0
0 1 0
1 0 0
1 1 1
5 0 0 0
0 1 0
1  =00.2; learning
Threshold: 0 rate:  = 0.1 2
9
1 1 1
Inputs Desired Initial Actual Error Final
Epoch output weights weights
x1 x2 Yd w1 w2 output e w1 w2
Y

1 0 0 0 0.3 0.1 0 0 0.3 0.1


0 1 0 0.3 0.1
1 0 0
1 1 1
2 0 0 0
0 1 0
1 0 0
1 1 1
3 0 0 0
0 1 0
1 0 0
1 1 1
4 0 0 0
0 1 0
1 0 0
1 1 1
5 0 0 0
0 1 0
1  =00.2; learning
Threshold: 0 rate:  = 0.1 3
0
1 1 1
Inputs Desired Initial Actual Error Final
Epoch output weights weights
x1 x2 Yd w1 w2 output e w1 w2
Y

1 0 0 0 0.3 0.1 0 0 0.3 0.1


0 0 0.3 0 0 0.3 Training Example 2:
1 0.1 0.1
1 0 0
1 1 1
2 0 0 0
0 1 0
1 0 0
1 1 1
3 0 0 0
0 1 0
1 0 0
1 1 1
4 0 0 0
0 1 0
1 0 0
1 1 1
5 0 0 0
0 1 0
1  =00.2; learning
Threshold: 0 rate:  = 0.1 3
1
1 1 1
Inputs Desired Initial Actual Error Final
Epoch output weights weights
x1 x2 Yd w1 w2 output e w1 w2
Y

1 0 0 0 0.3 0.1 0 0 0.3 0.1 Training Example 2:


0 1 0 0.3 0.1 0 0 0.3 0.1
1 0 0 0.3 0.1
1 1 1
2 0 0 0
0 1 0
1 0 0
1 1 1
3 0 0 0
0 1 0
1 0 0
1 1 1
4 0 0 0
0 1 0
1 0 0
1 1 1
5 0 0 0
0 1 0
1 Learning
DSC 704 Deep 0 0 Bahria University, Spring 2019 3
2
1 1 1
Inputs Desired Initial Actual Error Final
Epoch output weights weights
x1 x2 Yd w1 w2 output e w1 w2
Y

1 0 0 0 0.3 0.1 0 0 0.3 0.1


0 0 0 0 0.3 Training Example 3:
1 0.3 0.1 0.1
1 0 0 0.3 0.1
1 1 1
2 0 0 0
0 1 0
1 0 0
1 1 1
3 0 0 0
0 1 0
1 0 0
1 1 1
4 0 0 0
0 1 0
1 0 0
1 1 1
5 0 0 0
0 1 0
Threshold:  =00.2; learning
1 Learning
DSC 704 Deep 0 rate:  =University,
Bahria 0.1 Spring 2019 3
3
1 1 1
Inputs Desired Initial Actual Error Final
Epoch output weights weights
x1 x2 Yd w1 w2 output e w1 w2
Y

1 0 0 0 0.3 0.1 0 0 0.3 0.1 Training Example 3:


0 1 0 0.3 0.1 0 0 0.3 0.1
1 0 0 0.3 0.1 1 -1 0.2 0.1
1 1 1
2 0 0 0
0 1 0
1 0 0
1 1 1
3 0 0 0
0 1 0
1 0 0
1 1 1
4 0 0 0
0 1 0
1 0 0
1 1 1
5 0 0 0
0 1 0
1 Learning
DSC 704 Deep 0 0 Bahria University, Spring 2019 3
4
1 1 1
Inputs Desired Initial Actual Error Final
Epoch output weights weights
x1 x2 Yd w1 w2 output e w1 w2
Y

1 0 0 0 0.3 0.1 0 0 0.3 0.1


0 0 0 0 0.3 Training Example 3:
1 0.3 0.1 0.1
1 0 0 0.3 0.1 1 -1 0.2 0.1
1 1 1 0.2 0.1
2 0 0 0
0 1 0
1 0 0
1 1 1
3 0 0 0
0 1 0
1 0 0
1 1 1
4 0 0 0
0 1 0
1 0 0
1 1 1
5 0 0 0
0 1 0
Threshold:  =00.2; learning
1 Learning
DSC 704 Deep 0 rate:  =University,
Bahria 0.1 Spring 2019 3
5
1 1 1
Inputs Desired Initial Actual Error Final
Epoch output weight output weight PERCEPTRON LEARNING:
Yd s Y e s LOGICAL OPERATION AND
x1 x2 w1 w2 w1 w2

1 0 0 0 0.3 0.1 0 0 0.3 0.1


0 1 0 0.3 0.1 0 0 0.3 0.1
1 0 0 0.3 0.1 1 1 0.2 0.1
1 1 1 0.2 0.1 0 0.3
1 0.0
2 0 0 0 0.3 0.0 0 0 0.3 0.0
0 1 0 0.3 0.0 0 0 0.3 0.0
1 0 0 0.3 0.0 1 1 0.2 0.0
1 1 1 0.2 0.0 1 0.2 0.0
0
3 0 0 0 0.2 0.0 0 0 0.2 0.0
0 1 0 0.2 0.0 0 0 0.2 0.0
1 0 0 0.2 0.0 1 1 0.1 0.0
1 1 1 0.1 0.0 0 0.2 0.1
1
4 0 0 0 0.2 0.1 0 0 0.2 0.1
0 1 0 0.2 0.1 0 0 0.2 0.1
1 0 0 0.2 0.1 1 1 0.1 0.1
1 1 1 0.1 0.1 1 0.1 0.1
Threshold:  = 0.2; learning rate:  = 0.1 0 3
5 0 0 0 0.1 0.1 0 0 0.1 0.1 6
0 1 0 0.1 0.1 0 0 0.1 0.1
Logistic Regression

37
Logistic Regression (LR)
• A Binary Classification problem
• Example: Cat (1) or No-Cat (0)

• On a 2 dimensional (2D) data LR will try to draw a straight line to


separate the classes
• For 3D data it’ll try to draw a 2D plane to separate the classes.
• This generalizes to N dimensional data and N-1 dimensional
hyperplane separator.

38
Logistic Regression
• Given: x; Predict:y^; where y^ = P(y = 1|x)
• Output:
– y^ is a probability – between 0 and 1
– y^ =σ (wT x + b)

39
Logistic Regression

40
LR – Linearly Separable Data

41
LR – Linearly Separable Data

• The Sequential model allows us to build deep neural


networks by stacking layers one on top of another.
• Since we’re now building a simple logistic regression model,
we will have the input nodes directly connected to output
node, without any hidden layers.

42
LR – Linearly Separable Data

• The Dense function in Keras constructs a fully connected neural network


layer, automatically initializing the weights as biases.
• The function arguments are defined as follows:
– units: The first argument, representing number of nodes in this layer. Since we’re
constructing the output layer, and we said it has only one node, this value is 1.
– input_shape: The first layer in Keras models need to specify the input dimensions. The
subsequent layers (which we don’t have here but we will in later sections) don’t need to
specify this argument because Keras can infer the dimensions automatically. In this case our
input dimensionality is 2, the x and y coordinates.
– activation: The activation function of a logistic regression model is the logistic function, or
alternatively called the sigmoid.

43
LR – Linearly Separable Data

• The compile function creates the neural network model by specifying the
details of the learning process. The model hasn’t been trained yet. The function
arguments are defined as follows:
– optimizer: Which optimizer to use in order to minimize the loss function. There are a
lot of different optimizers, most of them based on gradient descent.
– loss: The loss function to minimize. Since we’re building a binary 0/1 classifier, the
loss function to minimize is binary_crossentropy.
– metrics: Which metric to report statistics on, for classification problems we set this
as accuracy.

44
LR – Linearly Separable Data

• The model is trained using the fit function. The arguments are as follows:
– x: The input data, we defined it as X above. It contains the x and y coordinates of the input points
– y: refers to the labels, in our case the class we’re trying to predict: 0 or 1.
– verbose: Prints out the loss and accuracy, set it to 1 to see the output.
– epochs: Number of times to go over the entire training data. When training models we pass
through the training data not just once but multiple times.

45
LR – Linearly Separable Data

46
LR – Linearly Separable Data

47
LR – Complex Data (Moons)

48
LR – Complex Data (Moons)

49
LR – Complex Data (Moons)
The current decision boundary doesn’t look as clean as the one before. The model
tried to separate out the classes from the middle, but there are a lot of misclassified
points. We need a more complex classifier with a non-linear decision boundary.

50
LR – Complex Data (Circles)

51
LR – Complex Data (Circles)

52
LR – Complex Data (Circles)

53
Can we do better?
• Re-visit same problems
• Solution: Multi-Layer Perceptron / ANN with hidden layers

• Important: The term “perceptron” continued to be employed


(MLP) for historical reasons despite the fact that activation
functions changed.

54
Multi-Layer Perceptron
• Minsky & Papert (1969) offered solution to XOR problem by
combining perceptron unit responses using a second layer of Units
• Piecewise linear classification using an MLP with threshold
(perceptron) units

Piece-wise linear separation

55
Multi-Layer Perceptron
• A multilayer perceptron is a feed forward neural network with one or
more hidden layers.
• The network consists of an input layer of source neurons, at least one
middle or hidden layer of computational neurons, and an output
layer of computational neurons.
• The input signals are propagated in a forward direction on a layer- by-
layer basis.

56
Multi-Layer Perceptron

57
Multi-Layer Perceptron

58
Multi-Layer Perceptron
• We first map the 3D input to a 4D vector space, then we
perform another transformation to a new 4D space, and
the final transformation reduces it to 1D.
• This is just a chain of matrix multiplications.
• The forward pass performs these matrix dot products and
applies the activation function element-wise to the result.

59
Multi-Layer Perceptron

60
Multi-Layer Perceptron
• By performing transformations at each layer, we are able to
project the input to a new vector space, and draw a decision
boundary to separate the classes.

61
Multi-Layer Perceptron
• This is equivalent to drawing a complex decision boundary in the
original input space.
• So the main benefit of having a deeper model is being able to
do more non-linear transformations of the input and drawing
a more complex decision boundary.

62
ANNs are universal function approximators,
meaning they can model any complex function.

63
Multi-Layer Perceptron

64
Multi-Layer Perceptron

• Output layer still uses the sigmoid activation function since we’re working on a
binary classification problem.
• Hidden layers use the tanh activation function.
• We have fewer number of nodes in each subsequent layer. It’s common to have
less nodes as we stack layers on top of one another, sort of a triangular shape.

65
Multi-Layer Perceptron

66
Multi-Layer Perceptron
The ANN is able to come up with a perfect separator to distinguish the classes.

67
Multi-Layer Perceptron

68
Multi-Class Classification

• We will solve the three class problem using:


– Softmax Regression
– MLP / ANN

69
Softmax Regression
• Softmax Regression (SR) is a generalization of LR where we can have more than 2
classes.
• In our current dataset we have 3 classes, represented as 0/1/2.

70
Softmax Regression – Difference with LR
• Number of nodes in the dense layer: LR uses 1 node, where SR has 3 nodes. Since
we have 3 classes it makes sense for SR to be using 3 nodes. LR models the probability of
an example belonging to class one: P(class=1). And we can calculate class 0 probability
by: 1−P(class=1). But when we have more than 2 classes, we need individual nodes for
each class. Because knowing the probability of one class doesn’t let us infer the
probability of the other classes.
• Activation function: LR used sigmoid activation function, SR uses softmax.
Softmax scales the values of the output nodes such that they represent probabilities
and sum up to 1.
• Loss function: In a binary classification problem like LR, the loss function is
binary_crossentropy. In the multiclass case, the loss function is
categorical_crossentropy.
• Fit function: LR used the vector y directly in the fit function, which has just one
column with 0/1 values. When we’re doing SR the labels need to be in one-
hot representation. In our case y_cat is a matrix with 3 columns, where all the
values are 0 except for the one that represents our class, and that is 1.

71
Softmax Regression
• LR is a linear classifier. SR is also a linear classifier, but for multiple classes.
• So the “power” of the model hasn’t changed, it’s still a linear model. We just
generalized LR to apply it to a multiclass problem.

72
Softmax Regression
• LR is a linear classifier. SR is also a linear classifier, but for multiple classes.
• So the “power” of the model hasn’t changed, it’s still a linear model. We just
generalized LR to apply it to a multiclass problem.

73
Multi-Class with ANN

74
Multi-Class with ANN

75
Source
• Complete Source Code and Utility Functions can be found at:

https://github.com/ardendertat/Applied-Deep-Learning-with-
Keras/tree/master/notebooks

76
Case Study
Handwritten Digit Recognition
(Hello World of
ANN)
Recognition by Human
28

28

78
Writing a Computer Program?

Input: 28x28 matrix of


numbers

Output: Class label

ML: ANN

79
Digit Recognition using ANN

80
Digit Recognition using ANN

81
Digit Recognition using ANN

82
Digit Recognition using ANN

A pattern of activations in one layer causes a


specific pattern of activations in the next layer

83
Activation Function

84
Digit Recognition using ANN

85
Digit Recognition using ANN

86
Digit Recognition using ANN

87
Digit Recognition using ANN

88
Digit Recognition using ANN

89
Digit Recognition using ANN

90
Training Data

91
Digit Recognition using ANN

92
Digit Recognition using ANN

93
Digit Recognition using ANN

94
Implementing Digit Recognition

95
Back Propagation

Back Propagation
(Worked Example)

96
Back Propagation

97
Back Propagation

98
Back Propagation
• Initialize random weights
• Apply the first input to the network and workout
the output
• Workout errors at output neurons (B & C)

• Update the weights as:

The term: is due to sigmoid function


99
Back Propagation

• Calculate errors for hidden layer neurons – Since we do not


have targets, these errors cannot be calculated directly. Back
propagate errors from output neurons

• Repeat the process for previous layers

100
Back Propagation

101
Back Propagation

102
Back Propagation

103
Back Propagation – Worked Example

WAC

C
WBC WCE

E
WAD WDE
D
WBD

Perform a forward pass Perform a


reverse pass Learning Rate = 1;
Target = 0.5

104
Back Propagation – Worked Example
WAC

C
WBC WCE

E
WAD WDE
D
WBD

Input to top neuron: (0.35x0.1) + (0.9x0.8) = 0.755


Output of top neuron: Sigmoid(0.755) = 0.68

Input to bottom neuron: (0.35x0.4) + (0.9x0.6) = 0.68


Output of bottom neuron: Sigmoid(0.755) = 0.663

Input to final neuron: (0.68x0.3) + (0.663x0.9) = 0.801


Output of final neuron: Sigmoid(0.801) = 0.69

105
Back Propagation – Worked Example
WAC
0.68
C
WBC WCE
0.69
E
WAD
0.663 WDE

D
WBD

Output Error:

New Weights (Output layer):

106
Back Propagation – Worked Example
WAC 0.68 0.272
C
WBC WCE
0.69
E
WAD
0.663 WDE

D 0.873
WBD
Error for hidden
Layers:

107
WAC 0.68 0.272
C
WBC WCE
0.69
E
WAD 0.663 WDE

D 0.873
WBD

Weight updates for hidden layers:

108
Back Propagation – Worked Example
WAC 0.272
C
WBC WCE

E
WAD WDE
D
0.873

WBD

Feed the input with new weights:


Previous Error = -0.19
Current Error = -0.18205

109
MLP in Keras – PIMA INDIAN DIABETES

110
MLP in Keras – IRIS DATA SET

111
References
• Neural Networks, by Megan Vasta
• Perceptrons and Multilayer Perceptrons, Cognitive
Systems II - Machine Learning, SS 2005
• Videos by 3Blue1Brown
• https://towardsdatascience.com/@ardendertat

112

You might also like