Professional Documents
Culture Documents
1
Contents
• What is ANN?
• Perceptron and Perceptron Learning
• Multi-Layer Perceptron – Feed forward
Networks
• Back Propagation
• Implementation Session
2
What is ANN?
• An artificial neural network is a crude way of trying to simulate the
human brain (digitally)
• Human brain – Approx. 10 billion neurons
• Each neuron connected with thousands of others
• Parts of neuron
– Cell body
– Dendrites – receive input signal
– Axons – Give output
3
What is ANN?
• ANN – made up of artificial neurons
– Digitally modeled biological neuron
• Each input into the neuron has its own weight associated with
it
• As each input enters the nucleus it's multiplied by its
weight.
4
What is ANN?
• The nucleus sums all these new input values which gives us
the activation
• For n inputs and n weights – weights multiplied by input
and summed
a = x1w1+x2w2+x3w3... +xnwn
5
What is ANN?
• If the activation is greater than a threshold value - the
neuron outputs a signal – (for example 1)
• If the activation is less than threshold the neuron outputs
zero.
• This is typically called a step function
6
What is ANN?
• The combination of summation and thresholding is called a
node
http://www-cse.uta.edu/~cook/ai1/lectures/figures/neuron.jpg
7
What is ANN?
x1w1+x2w2+x3w3... +xnwn > T
Let w0 = -T and x0 = 1
Output is 1 if D> 0;
Output is 0 otherwise
8
Typical Activation Functions
0 X 0 X 0 X 0 X
-1 -1 -1 -1
10
Example
• Each day you get lunch at the cafeteria.
– Your diet consists of fish, chips, and drink.
– You get several portions of each
• The cashier only tells you the total price of the meal
– After several days, you should be able to figure out the
price of each portion.
• Each meal price gives a linear constraint on the prices of the
portions:
11
Solving the Problem
• The prices of the portions are like the weights in of a
linear neuron.
• We will start with guesses for the weights and then adjust the
guesses to give a better fit to the prices given by the cashier.
12
The Cashier’s Brain
Linear
neuron
150 50 100
2 5 3
13
Model of Cashier’s Brain with Arbitrary Weights
2 5 3
portions of fish portions of portions of
chips drink
14
Perceptron
• In 1958, Frank Rosenblatt introduced a training algorithm that
provided the first procedure for training a simple ANN: a
perceptron.
15
Perceptron
• A perceptron takes several inputs, x1, x2, ……, and
produces a single binary output.
• The model consists of a linear combiner followed by a hard
limiter.
• The weighted sum of the inputs is applied to the hard limiter,
which produces an output equal to +1 if its input is positive and -
1 if it is negative. (1/0 in some models).
16
Perceptron
1 -10
1
x1 Y
4
x2
17
Perceptron Learning
• A perceptron (threshold unit) can learn anything that it can
represent (i.e. anything separable with a hyperplane)
X1 X2 Y
0 0 0
0 1 1
1 0 1
1 1 1
18
Implementing ‘OR’ with Perceptron
• The two-input perceptron can implement the OR function
when we set the weights: w0 = -0.3, w1 = w2 = 0.5
x2
Decision hyperplane :
w 0 + w 1 x1 + w 2 x2 = 0
+ +
-0.3 + 0.5 x 1 + 0.5 x 2 = 0
- +
x1
X1 X2 Y
-0.3 + 0.5 x1 + 0.5 x2 = 0
0 0 -0.3 -1
0 1 0.2 +1
1 0 0.2 +1
1 1 0.7 +1
19
Implementing ‘AND’ with Perceptron
Decision hyperplane :
w 0 + w 1 x1 + w 2 x2 = 0 x2
-0.8 + 0.5 x1 + 0.5 x2 = 0
- +
- -
X1 X2 Y x1
-0.8 + 0.5 x1 + 0.5 x2 = 0
0 0 -0.8 -1
0 1 -0.3 -1
1 0 -0.3 -1
1 1 0.2 +1
20
Implementing ‘XOR’ with Perceptron
• A Perceptron cannot represent Exclusive OR since it is not
linearly separable.
XOR Function
X1 X2 Y
0 0 -1
0 1 +1
1 0 +1
1 1 -1
Two perceptrons?
21
Perceptron Learning
• Gradient Descente to find the minimum value of E (on
Error Surface)
22
Perceptron Learning
• Objective: Find the values of weights which minimize the
error function
m 2
1
(d ) (d )
O
E 2
d (T )
1
O (d ) w 0 w1 x1d 2 x d2 .... w n x dn
w
O(d) is the observed and T(d) is the target output for training example ‘d’
23
Gradient Descente - Algorithm
27
Inputs Desired Initial Actual Error Final
Epoch output weights weights
x1 x2 Yd w1 w2 output e w1 w2
Y
1 0 0 0 0.3 0.1
0 1 0
1 0 0
1 1 1
2 0 0 0
0 1 0
1 0 0
1 1 1
3 0 0 0
0 1 0
1 0 0
1 1 1
4 0 0 0
0 1 0
1 0 0
1 1 1
5 0 0 0
0 1 0
1 =00.2; learning
Threshold: 0 rate: = 0.1 2
8
1 1 1
Inputs Desired Initial Actual Error Final
Epoch output weights weights
x1 x2 Yd w1 w2 output e w1 w2
Y
37
Logistic Regression (LR)
• A Binary Classification problem
• Example: Cat (1) or No-Cat (0)
38
Logistic Regression
• Given: x; Predict:y^; where y^ = P(y = 1|x)
• Output:
– y^ is a probability – between 0 and 1
– y^ =σ (wT x + b)
39
Logistic Regression
40
LR – Linearly Separable Data
41
LR – Linearly Separable Data
42
LR – Linearly Separable Data
43
LR – Linearly Separable Data
• The compile function creates the neural network model by specifying the
details of the learning process. The model hasn’t been trained yet. The function
arguments are defined as follows:
– optimizer: Which optimizer to use in order to minimize the loss function. There are a
lot of different optimizers, most of them based on gradient descent.
– loss: The loss function to minimize. Since we’re building a binary 0/1 classifier, the
loss function to minimize is binary_crossentropy.
– metrics: Which metric to report statistics on, for classification problems we set this
as accuracy.
44
LR – Linearly Separable Data
• The model is trained using the fit function. The arguments are as follows:
– x: The input data, we defined it as X above. It contains the x and y coordinates of the input points
– y: refers to the labels, in our case the class we’re trying to predict: 0 or 1.
– verbose: Prints out the loss and accuracy, set it to 1 to see the output.
– epochs: Number of times to go over the entire training data. When training models we pass
through the training data not just once but multiple times.
45
LR – Linearly Separable Data
46
LR – Linearly Separable Data
47
LR – Complex Data (Moons)
48
LR – Complex Data (Moons)
49
LR – Complex Data (Moons)
The current decision boundary doesn’t look as clean as the one before. The model
tried to separate out the classes from the middle, but there are a lot of misclassified
points. We need a more complex classifier with a non-linear decision boundary.
50
LR – Complex Data (Circles)
51
LR – Complex Data (Circles)
52
LR – Complex Data (Circles)
53
Can we do better?
• Re-visit same problems
• Solution: Multi-Layer Perceptron / ANN with hidden layers
54
Multi-Layer Perceptron
• Minsky & Papert (1969) offered solution to XOR problem by
combining perceptron unit responses using a second layer of Units
• Piecewise linear classification using an MLP with threshold
(perceptron) units
55
Multi-Layer Perceptron
• A multilayer perceptron is a feed forward neural network with one or
more hidden layers.
• The network consists of an input layer of source neurons, at least one
middle or hidden layer of computational neurons, and an output
layer of computational neurons.
• The input signals are propagated in a forward direction on a layer- by-
layer basis.
56
Multi-Layer Perceptron
57
Multi-Layer Perceptron
58
Multi-Layer Perceptron
• We first map the 3D input to a 4D vector space, then we
perform another transformation to a new 4D space, and
the final transformation reduces it to 1D.
• This is just a chain of matrix multiplications.
• The forward pass performs these matrix dot products and
applies the activation function element-wise to the result.
59
Multi-Layer Perceptron
60
Multi-Layer Perceptron
• By performing transformations at each layer, we are able to
project the input to a new vector space, and draw a decision
boundary to separate the classes.
61
Multi-Layer Perceptron
• This is equivalent to drawing a complex decision boundary in the
original input space.
• So the main benefit of having a deeper model is being able to
do more non-linear transformations of the input and drawing
a more complex decision boundary.
62
ANNs are universal function approximators,
meaning they can model any complex function.
63
Multi-Layer Perceptron
64
Multi-Layer Perceptron
• Output layer still uses the sigmoid activation function since we’re working on a
binary classification problem.
• Hidden layers use the tanh activation function.
• We have fewer number of nodes in each subsequent layer. It’s common to have
less nodes as we stack layers on top of one another, sort of a triangular shape.
65
Multi-Layer Perceptron
66
Multi-Layer Perceptron
The ANN is able to come up with a perfect separator to distinguish the classes.
67
Multi-Layer Perceptron
68
Multi-Class Classification
69
Softmax Regression
• Softmax Regression (SR) is a generalization of LR where we can have more than 2
classes.
• In our current dataset we have 3 classes, represented as 0/1/2.
70
Softmax Regression – Difference with LR
• Number of nodes in the dense layer: LR uses 1 node, where SR has 3 nodes. Since
we have 3 classes it makes sense for SR to be using 3 nodes. LR models the probability of
an example belonging to class one: P(class=1). And we can calculate class 0 probability
by: 1−P(class=1). But when we have more than 2 classes, we need individual nodes for
each class. Because knowing the probability of one class doesn’t let us infer the
probability of the other classes.
• Activation function: LR used sigmoid activation function, SR uses softmax.
Softmax scales the values of the output nodes such that they represent probabilities
and sum up to 1.
• Loss function: In a binary classification problem like LR, the loss function is
binary_crossentropy. In the multiclass case, the loss function is
categorical_crossentropy.
• Fit function: LR used the vector y directly in the fit function, which has just one
column with 0/1 values. When we’re doing SR the labels need to be in one-
hot representation. In our case y_cat is a matrix with 3 columns, where all the
values are 0 except for the one that represents our class, and that is 1.
71
Softmax Regression
• LR is a linear classifier. SR is also a linear classifier, but for multiple classes.
• So the “power” of the model hasn’t changed, it’s still a linear model. We just
generalized LR to apply it to a multiclass problem.
72
Softmax Regression
• LR is a linear classifier. SR is also a linear classifier, but for multiple classes.
• So the “power” of the model hasn’t changed, it’s still a linear model. We just
generalized LR to apply it to a multiclass problem.
73
Multi-Class with ANN
74
Multi-Class with ANN
75
Source
• Complete Source Code and Utility Functions can be found at:
https://github.com/ardendertat/Applied-Deep-Learning-with-
Keras/tree/master/notebooks
76
Case Study
Handwritten Digit Recognition
(Hello World of
ANN)
Recognition by Human
28
28
78
Writing a Computer Program?
ML: ANN
79
Digit Recognition using ANN
80
Digit Recognition using ANN
81
Digit Recognition using ANN
82
Digit Recognition using ANN
83
Activation Function
84
Digit Recognition using ANN
85
Digit Recognition using ANN
86
Digit Recognition using ANN
87
Digit Recognition using ANN
88
Digit Recognition using ANN
89
Digit Recognition using ANN
90
Training Data
91
Digit Recognition using ANN
92
Digit Recognition using ANN
93
Digit Recognition using ANN
94
Implementing Digit Recognition
95
Back Propagation
Back Propagation
(Worked Example)
96
Back Propagation
97
Back Propagation
98
Back Propagation
• Initialize random weights
• Apply the first input to the network and workout
the output
• Workout errors at output neurons (B & C)
100
Back Propagation
101
Back Propagation
102
Back Propagation
103
Back Propagation – Worked Example
WAC
C
WBC WCE
E
WAD WDE
D
WBD
104
Back Propagation – Worked Example
WAC
C
WBC WCE
E
WAD WDE
D
WBD
105
Back Propagation – Worked Example
WAC
0.68
C
WBC WCE
0.69
E
WAD
0.663 WDE
D
WBD
Output Error:
106
Back Propagation – Worked Example
WAC 0.68 0.272
C
WBC WCE
0.69
E
WAD
0.663 WDE
D 0.873
WBD
Error for hidden
Layers:
107
WAC 0.68 0.272
C
WBC WCE
0.69
E
WAD 0.663 WDE
D 0.873
WBD
108
Back Propagation – Worked Example
WAC 0.272
C
WBC WCE
E
WAD WDE
D
0.873
WBD
109
MLP in Keras – PIMA INDIAN DIABETES
110
MLP in Keras – IRIS DATA SET
111
References
• Neural Networks, by Megan Vasta
• Perceptrons and Multilayer Perceptrons, Cognitive
Systems II - Machine Learning, SS 2005
• Videos by 3Blue1Brown
• https://towardsdatascience.com/@ardendertat
112