You are on page 1of 82

Artificial Neural

Networks
Artificial Neural Network
 Neural Networks (NN) are based on biological neural system
structure, which consists of several connected elements
named neurons

 Stimuli from external environment or inputs from sensory


organs are accepted by dendrites

 These inputs create electric impulses, which quickly travel


through the neural network

 Neurons get signals from dendrites and pass them to the


next neurons
Artificial Neural Network(Contd…)
 Neural networks are composed of multiple nodes, which imitate
biological neurons of human brain

 The nodes are neurons

 The nodes can take input data and perform simple operations on
the data

 The result of these operations is passed to other neurons

 The output at each node is called its activation or node value


Artificial Neural Network(Contd…)
 Input layer, Hidden layer(s), Output layer

 Layer(s) present between input and output layers are known as


hidden layers which are generally a black-box where all
calculations happen and calculated values are then sent to the
output layer

 Each layer extracts some information related to the features and


forwards them with a weight to the next layer

 Output is the sum of all these information gains multiplied by


their related weights
Artificial Neural Network(Contd…)
An artificial neuron in hidden layers uses association
and activation function

Z a
x association activation

w Z=WTx+b a = σ(Z) Y

b dz da

Where,
W is weight, b is bias
x is the input, T is vector transpose
Artificial Neural Network(Contd…)
Hidden Layer

x1
Node 1 Z1[1] = W1[1]Tx(i) + b1[1] a1[1] = σ(Z1[1])
a1[1]
x2 y

Node 2 Z2[1] = W2[1]Tx(i) + b2[1] a2[1] = σ(Z2[1]) a2[1]


x3

Where, [1] is layer 1 (hidden layer)


Artificial Neural Network(Contd…)

 Weights shows the strength of the


x1
particular node
x2
 A bias value allows you to shift the
activation function curve up or down
x3
Neural Networks Overview
In logistic regression we had:

X1 \
X2 ==> z = XW + B ==> a = Sigmoid(z) ==> l(a,Y)
X3 /

dz da
Neural Networks Overview
In neural networks with one layer we will have:

X1 \
X2 => z[1] = W[1] X + b[1] => a[1] = Sigmoid(z[1] ) => z[2] = W[2] a[1]+ b[2] => a[2] = Sigmoid(z[2]) => L(a[2],Y)
X3 /

X is the input vector (X1, X2, X3), and Y is the output variable (1x1) [1]
[1] ---> refers to Layer 1
[2] ---> refers to Layer 2 [2]

NN is stack of logistic regression objects


Deep Neural Network
 Deep Neural Networks are complex ANN with
more than one hidden layer

 Each layer forwards generated output to the next


layer and the last layer is responsible to provide
final value after computation

 For example, output 0 means that it is a normal


application while output 1 means malicious
Activation Functions
1) Sigmoid or Logistic Activation Function

 The Sigmoid Function curve looks like a S-shape

𝟏 y
 Sigmoid: a= 𝟏 𝒆 𝒛

 It exists between (0 to 1)

 Therefore, used in output layer for binary classification


Activation Functions (Contd…)
2) Tanh or hyperbolic tangent Activation Function
 Tanh is superior than sigmoid
 Works better in hidden layers than sigmoid function because values
are between +1 and -1, thus the mean of activation that comes out
of hidden layer are closer to ZERO mean rather than 0.5
𝒆𝒛 𝒆 𝒛
 Tanh: a = 𝒆𝒛 𝒆 𝒛
y
 The range of the Tanh function is from (-1 to 1)
 The advantage is that the negative inputs will be mapped strongly
negative and the zero inputs will be mapped near zero in the Tanh
graph
 Therefore, used in hidden layers
 Can’t use in output layer (as y is either 0 or 1) rather than -1 to 1
Activation Functions (Contd…)
3) ReLU (Rectified Linear Unit) Activation Function
 ReLU is half rectified (from bottom). f(z) is zero when z is less than zero and f(z) is
equal to z when z is above or equal to zero
 ReLU: a = max(0,z)

 Range of the ReLU function is from 0 to infinity

 Used in hidden layers y

 Neural Network learns much faster in ReLU than Sigmoid or Tanh

 All negative values become zero immediately which decreases the ability of the
model to fit or train from the data properly

 Any negative input given to the ReLU function turns the value into zero
immediately in the graph (affects the resulting graph by not mapping the negative
values)
Activation Functions (Contd…)
4) Leaky ReLU Activation Function

 To solve the dying ReLU problem

 The leak helps to increase the range of the ReLU function

 Leaky ReLU: a = max(0.01 z, z)


y
 Usually, the value of a is 0.01 or so

 When a is not 0.01 then it is called Randomized ReLU

 The range of the Leaky ReLU is (-infinity to infinity)

 Used in hidden layers


Evolution of Neural Networks
Year Neural Network Designer Description

1943 McCulloch and Mcculloch Pitts Arrangement of neurons is combination of logic gate. Unique
Pitts Neuron feature is threshhold

1949 Hebb Network Hebb If two neurons are active, then their connection strengths should
be increased.
1958-1988 Perceptron Frank Rosenblatt, Weights of path can be adjusted
Block, Minsky and
Papert
1960 Adaline Widrow and Hoff The weights are adjusted to reduce the difference between the
net input to the output unit and the desired output.
1972 Kohonen Kohenen Inputs are clustered to obtain a fired output neuron.
selforganizing
feature map
Evolution of Neural Networks
Year Neural Network Designer Description

1982 Hopfield John Hopfield Based on fixed weights.


network Can act as associative memory nets
1986 Back propagation Rumelhard Multilayered
network Error propagated backward from output to the hidden units
1987-90 ART Carpenter and Used for both binary and analog
Grossberg
Hebb Network

Hebb rule is used for pattern


association, pattern categorization,
pattern classification and over a
range of other areas.
activation function of input layer is identity
function: xi = si for i=1 to n S:tthe input training vector and target output pair
Example: Designing a Hebb network to implement
AND function
Example: Designing a Hebb network to implement
AND function
XOR using Hebb Network

No decision boundary
Exercise
Exercise –Solution Notes
Perceptron or Single-layer Perceptron
 Single Layer Perceptron has just two layers of
input and output
 It only has single layer hence the name single
y
layer perceptron
 It does not contain Hidden Layers as that of
Multilayer perceptron
Perceptrons
Perceptron is type of ANN that can be seen as the simplest
kind of feedforward neural network: a linear classifier
Introduced in the late 1950s
Perceptron convergence theorem (Rosenblatt 1962):
◦ Perceptron will learn to classify any linearly separable set
of inputs.

Perceptron is a network:
– single-layer
– feed-forward: data only
XOR function (no linear separation) travels in one direction

24
Perceptron or Single-layer Perceptron
 Perceptron is a single layer neural network where as a multi-layer perceptron is called
Neural Networks

 The perceptron consists of 4 parts:


y
 Input values or One input layer
 Weights and Bias
 Net sum
 Activation Function
Working of Perceptron
 All the inputs x are multiplied with their
weights w. Let’s call it k
 Add all the multiplied values and call them
y
Weighted Sum
 Apply that weighted sum to the correct
Activation Function
Perceptron or Single-layer Perceptron

 Perceptron is usually used to classify the data


into two parts
y
 Therefore, it is also known as a Linear Binary
Classifier
 Supervised Learning
Multi Layer Perceptron (MLP)

 It is a neural network where the mapping


between inputs and output is non-linear

y
 An MLP has input and output layers, and one or
more hidden layers with many neurons stacked
together
Multi Layer Perceptron (MLP)
 A type of feed-forward artificial neural network that generates a set of outputs from a set of
inputs
 An MLP is a neural network connecting multiple layers in a directed graph, which means that
the signal path through the nodes goes one way y

 The MLP network consists of input, output, and hidden


layers. Each hidden layer consists of numerous
perceptron’s which are called hidden units
Perceptrons have no hidden layers
Multilayer perceptrons may have many
Learning Power of an ANN
Perceptron is guaranteed to converge if data is linearly separable
◦ It will learn a hyperplane that separates the classes

Mulitlayer ANN has no guarantee of convergence but can learn functions that are
not linearly separable

30
Multi Layer Perceptron (MLP)
 Each layer is feeding the next one with the result of their computation, their internal
representation of the data
 This goes all the way through the hidden layers to the output layer
 If the algorithm only computed the weighted sums in each neuron, propagated results
y to the

output layer, and stopped there, it wouldn’t be able to learn the weights that minimize the
cost function. If the algorithm only computed one iteration, there would be no actual
learning
 This is where Backpropagation comes into play
Perceptron Training
Assume supervised training examples giving the desired output for a unit given a set of known input
activations.
Goal: learn the weight vector (synaptic weights) that causes the perceptron to produce the correct +/-
1 values
Perceptron uses iterative update algorithm to learn a correct set of weights
 Perceptron training rule
 Delta rule (Not in Syllabus)
Both algorithms guaranteed to converge under somewhat different conditions

32
Perceptron Training Rule
Update weights by:
wi  wi  wi
wi   (t  o) wi

where
η is the learning rate (Learning rate is generally represented by )
◦ a small value (e.g., 0.1)
◦ sometimes decays as the number of weight-tuning operations increases

t – target output for the current training example


o – linear unit output for the current training example
33
Perceptron Training Rule
Equivalent to rules:
◦ If output is correct do nothing.
◦ If output is high, lower weights on active inputs
◦ If output is low, increase weights on active inputs
Can prove it will converge
◦ if training data is linearly separable and η is small
Works reasonably well when training data is not linearly separable

34
Perceptron Learning

Each execution of the outer loop is typically called an epoch.


Perceptron Learning Rule
• Consider a finite "n" number of input training vectors, with their associated
target (desired) values x(n) and t(n), where “n” ranges from 1 to N.
• The target is either +1 or -1.
• The output ''y" is obtained on the basis of the net input calculated and activation
function being applied over the net input.
Example1: Perceptron for AND
Example-2
FeedForward Network

1.1×0.3+2.6×1.0=2.93
y
Neural Network Representation

There are ______ layer(s) in this NN

 Remember that while counting the number of layers in a NN, we do


not count the input layer
 So, there are 2 layers in the NN shown above, i.e., one hidden layer
and one output layer
Neural Network Representation
 The first layer is referred as a[0], second layer as a[1], and
the final layer as a[2]
 Here ‘a’ stands for activations, which are the values that y
different layers of a neural network passes on to the next
layer
Neural Network Representation
 The corresponding parameters are w[1], b[1] and w[2], b[2]
 w[1] = (4,3) matrix
 Hidden layer has 4 units and Input layer has 3 units y
 b[1] = (4,1) vector
 w[2] = (1,4)
 Output layer has 1 unit and hidden layer has 4 units
 b[2] = (1,1)
Neural Network Representation
 noOfHiddenNeurons = 4
 Nx = 3
 Shapes of the variables:
o W1 is the matrix of the first hidden layer, it has a shape of (noOfHiddenNeurons,nx)
o b1 is the matrix of the first hidden layer, it has a shape of (noOfHiddenNeurons,1)
y
o z1 is the result of the equation z1 = W1*X + b, it has a shape of (noOfHiddenNeurons,1)
o a1 is the result of the equation a1 = sigmoid(z1), it has a shape of (noOfHiddenNeurons,1)
o W2 is the matrix of the second hidden layer, it has a shape of (1,noOfHiddenNeurons)
o b2 is the matrix of the second hidden layer, it has a shape of (1,1)
o z2 is the result of the equation z2 = W2*a1 + b, it has a shape of (1,1)
o a2 is the result of the equation a2 = sigmoid(z2), it has a shape of (1,1)
Computing a Neural Network's Output

 Let’s look in detail at how each


neuron of a neural network works
 Each neuron takes an input, performs y
some operation on them (calculates z
= wTX + b), and then applies the
sigmoid function
Computing a Neural Network's Output
The equations for the first hidden layer with four neurons will be:

For given input X, the outputs for each neuron will be:
z[1] = W[1]x + b[1]
a[1] = 𝛔(z[1])
z[2] = W[2]x + b[2]
a[2] = 𝛔(z[2])
Computing a Neural Network's Output

y
Backpropagation

Backpropagation is the learning mechanism that


allows the Multilayer Perceptron to iteratively
adjust the weights in the network, with the goal of y

minimizing the cost function


Architecture of a back-propagation network
Backpropagation
 In each iteration, after the weighted sums are forwarded through all layers, the
gradient of the Mean Squared Error is computed across all input and output pairs
 Then, to propagate it back, the weights of the first hidden layer are updated with the
y
value of the gradient
 That’s how the weights are propagated back to the starting point of the neural
network!
Backpropagation
NN parameters:
n[0] = Nx = 3
n[1] = NoOfHiddenNeurons = 4
n[2] = NoOfOutputNeurons = 1
W[1] shape is (n[1],n[0]) y
b[1] shape is (n[1],1)
W[2] shape is (n[2],n[1]) Cost function: J (W[1], b[1], W[2], b[2])
b[2] shape is (n[2],1)
=   L(y
  ,y)

where y is the prediction a[2]


Gradient Descent
Gradient descent: Initialize parameters randomly rather than all Zero’s

Repeat:
Compute predictions (y'[i], i = 0,...m)
Get derivatives: dW[1], db[1], dW[2], db[2] y
Update:
W[1] = W[1] - α * dW[1]
b[1] = b[1] - α * db[1]
W[2] = W[2] - α * dW[2]
b[2] = b[2] - α * db[2] dW[1] = [1] db[1] = [1]
dW db
Gradient Descent
Forward propagation:

Z[1] = W[1]A[0] + b[1] # A[0] is X


y
A[1] = g[1] (Z[1])
Z[2] = W[2]A[1] + b[2]
A[2] = g[2] (Z[2]) = Sigmoid(Z[2]) # Sigmoid because the output is between 0 and 1
Gradient Descent
Backpropagation (derivations):

dZ[2] = A[2] - Y # derivative of cost function we used * derivative of the sigmoid function
dW[2] = (dZ[2] * A[1].T) / m
y
db[2] = Sum(dZ[2]) /m
dZ[1] = (W[2].T * dZ[2]) * g'[1] (Z[1]) # element wise product (*)
dW[1] = (dZ[1] * A[0].T) / m # A0 = X
db[1] = Sum(dZ[1]) / m
# Hint there are transposes with multiplication because to keep dimensions correct
Random Initialization
 We have previously seen that the weights are initialized to 0 in case of a logistic
regression algorithm
 For logistic regression, it was okay to initialize weights to 0 because it doesn’t have any
hidden layer
 But should we initialize the weights of a neural network to 0?
y
Random Initialization
If the weights are initialized to 0, the W matrix will be:

Using these weights: a1[1] = a2[1]

Identical or Symmetric because both of these hidden units are computing exactly the
same function
Random Initialization
When we compute backpropagation:
dZ1[1] = dZ2[1]

Identical because outgoing weight is also equal


W[2] = [0 0]
y

No matter how many hidden units we use in a layer, we are always getting the same
output which is similar to that of using a single unit
So, instead of initializing the weights to 0, we randomly initialize them
Back-propagation
network training
Back-propagation
network training
Example-1

Using back-propagation network, find the new


weights for the net shown in Figure. It is
presented with the input pattern [0, 1] and the
target output is 1. Use a learning rate α = 0.25
and binary sigmoidal activation function.
Example-1

Initial Weights:

Binary Sigmoidal Activation function:


Example-1
Example-1
Example-1
Example-1
Example-1
Example-1
Example-1
Solved examples of Perceptron and Back-propagation is taken form book “Principles of Soft Computing” by
Sivanandam and Deepa. Explore first 3 chapters of this book for more details on Neural Networks.
Machine Learning vs Deep Learning
 DL is a newer area of ML - uses multi-layered
artificial neural networks to deliver high
accuracy in tasks such as:
• intrusion detection
• object detection
• speech recognition
• language translation

 DL can automatically learn/extract/translate


the features from data sets such as:
• Images
• Video
• Text
Image Source: https://www.xenonstack.com/blog/data-science/log-analytics-deep-machine-learning-ai/
Why is Deep Learning taking off?
Source: https://www.coursera.org/learn/neural-networks-deep-learning/lecture/praGm/why-is-deep-learning-taking-off
Example of Deep Learning

Network Intrusion detection:

 Intrusion Detection System

 Intrusion Prevention System

 Next-Generation Firewall
Steps for Implementation
1. Data Preprocessing
2. Feature Extraction
3. Feature Selection
4. Implement Machine learning model
5. Training Testing (if supervised model)
6. Calculate Parameters
Evaluation – Model Training
While the parameters of each model may differ, there are several methods to train a model.
◦ We want to avoid overfitting a model and maximize its predictive power.

There are two standard methods for training a model:


◦ Hold-out – reserve 2/3 of data for training and 1/3 for testing
◦ Cross-Validation – partition data into k disjoint subsets, train on k-1 partitions, test on remaining &
run k times to get average, e.g., 10-fold validation is commonly used.

Many software (e.g., WEKA, RapidMiner) will do these methods automatically for you.
Evaluation
There are several questions we should ask after model training:
◦ How predictive is the model we learned?
◦ How reliable and accurate are the predicted results?
◦ Which model performs better?

We want our model to perform well on our training set but also have strong predictive power.

Fortunately, various metrics applied on the testing set can help us choose the “best” model for
our application.
Metrics for Performance Evaluation
A Confusion Matrix provides measures to compute a
models’ accuracy:
◦ True Positives (TP) – # of positive examples correctly predicted
by the model

◦ False Negative (FN) – # of positive examples wrongly predicted


as negative by the model

◦ False Positive (FP) - # of negative examples wrongly predicted


as positive by the model

◦ True Negative (TN) - # of negative examples correctly


predicted by the model
Example(s)
True Positives (TP)  An instance for which both predicted and actual values are positive
E.g., Images which are cat and actually predicted cat
True Negatives (TN)  An instance for which both predicted and actual values are negative
E.g., Images which are not-cat and actually predicted not-cat
False Positives (FP)  An instance for which predicted value is positive but actual value is negative
E.g., Images which are not-cat and actually predicted as cat
False Negatives (FN)  An instance for which predicted value is negative but actual value is positive
E.g., Images which are cat and actually predicted as not-cat
Metrics for Performance Evaluation
Precision = = (1) Precision-> How precise our model is out of those
predicted positive, how many of them are actual positive

Recall = = (2) Recall-> How many of the Actual Positives ML model


capture through labeling it as Positive

.
Accuracy = = .
(3) Accuracy-> Percentage of correct predictions made by ML
model

( ∗ )
F − Score = 2. ( )
(4) F-Score is the harmonic mean of precision and recall
Metrics for Performance Evaluation
However, accuracy can be skewed due to a class imbalance.

Other measures are better indicators for model performance.

Metric Description Calculation


Exactness – % of tuples the classifier labeled as positive are TP
Precision =
actually positive TP + FP
Completeness – % of positive tuples the classifier actually TP
Recall =
labeled as positive TP + FN
F- 2 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
Harmonic mean of precision and recall =
Measure 𝑅𝑒𝑐𝑎𝑙𝑙 + 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
Metrics for Performance Evaluation
Metrics for Performance Evaluation
Models can also be compared visually using a Receiver Operating Characteristic
(ROC) curve.

An ROC curve characterizes the trade-off between TP and FP rates.


◦ TP rate is plotted on the y-axis against FP rate on the x-axis
◦ Stronger models will generally have more Area Under the ROC curve (AUC).

TP

FP

You might also like