Artificial Neural Networks

Artificial Neural
Networks
Artificial Neural Network
 Neural Networks (NN) are based on biological neural system
structure, which consists of several connected elements
named neurons
 Stimuli from external environment or inputs from sensory

organs are accepted by dendrites
 These inputs create electric impulses, which quickly travel

through the neural network
 Neurons get signals from dendrites and pass them to the

next neurons
Artificial Neural Network(Contd…)
 Neural networks are composed of multiple nodes, which imitate
biological neurons of human brain
 The nodes are neurons
 The nodes can take input data and perform simple operations on
the data
 The result of these operations is passed to other neurons
 The output at each node is called its activation or node value

 Input layer, Hidden layer(s), Output layer
 Layer(s) present between input and output layers are known as

hidden layers which are generally a black-box where all
calculations happen and calculated values are then sent to the
output layer
 Each layer extracts some information related to the features and

forwards them with a weight to the next layer
 Output is the sum of all these information gains multiplied by

their related weights
An artificial neuron in hidden layers uses association
and activation function
Z a
x association activation
w Z=WTx+b a = σ(Z) Y
b dz da
Where,
W is weight, b is bias
x is the input, T is vector transpose
Hidden Layer
x1
Node 1 Z1[1] = W1[1]Tx(i) + b1[1] a1[1] = σ(Z1[1])
a1[1]
x2 y
Node 2 Z2[1] = W2[1]Tx(i) + b2[1] a2[1] = σ(Z2[1]) a2[1]

x3
Where, [1] is layer 1 (hidden layer)

 Weights shows the strength of the

x1
particular node
x2
 A bias value allows you to shift the
activation function curve up or down
x3
Neural Networks Overview
In logistic regression we had:
X1 \
X2 ==> z = XW + B ==> a = Sigmoid(z) ==> l(a,Y)
X3 /
dz da
Neural Networks Overview
In neural networks with one layer we will have:
X1 \
X2 => z[1] = W[1] X + b[1] => a[1] = Sigmoid(z[1] ) => z[2] = W[2] a[1]+ b[2] => a[2] = Sigmoid(z[2]) => L(a[2],Y)
X3 /
X is the input vector (X1, X2, X3), and Y is the output variable (1x1) [1]
[1] ---> refers to Layer 1
[2] ---> refers to Layer 2 [2]
NN is stack of logistic regression objects

Deep Neural Network
 Deep Neural Networks are complex ANN with
more than one hidden layer
 Each layer forwards generated output to the next

layer and the last layer is responsible to provide
final value after computation
 For example, output 0 means that it is a normal

application while output 1 means malicious
Activation Functions
1) Sigmoid or Logistic Activation Function
 The Sigmoid Function curve looks like a S-shape
𝟏 y
 Sigmoid: a= 𝟏 𝒆 𝒛
 It exists between (0 to 1)
 Therefore, used in output layer for binary classification

Activation Functions (Contd…)
2) Tanh or hyperbolic tangent Activation Function
 Tanh is superior than sigmoid
 Works better in hidden layers than sigmoid function because values
are between +1 and -1, thus the mean of activation that comes out
of hidden layer are closer to ZERO mean rather than 0.5
𝒆𝒛 𝒆 𝒛
 Tanh: a = 𝒆𝒛 𝒆 𝒛
y
 The range of the Tanh function is from (-1 to 1)
 The advantage is that the negative inputs will be mapped strongly
negative and the zero inputs will be mapped near zero in the Tanh
graph
 Therefore, used in hidden layers
 Can’t use in output layer (as y is either 0 or 1) rather than -1 to 1
3) ReLU (Rectified Linear Unit) Activation Function
 ReLU is half rectified (from bottom). f(z) is zero when z is less than zero and f(z) is
equal to z when z is above or equal to zero
 ReLU: a = max(0,z)
 Range of the ReLU function is from 0 to infinity
 Used in hidden layers y
 Neural Network learns much faster in ReLU than Sigmoid or Tanh
 All negative values become zero immediately which decreases the ability of the
model to fit or train from the data properly
 Any negative input given to the ReLU function turns the value into zero
immediately in the graph (affects the resulting graph by not mapping the negative
values)
4) Leaky ReLU Activation Function
 To solve the dying ReLU problem
 The leak helps to increase the range of the ReLU function
 Leaky ReLU: a = max(0.01 z, z)

y
 Usually, the value of a is 0.01 or so
 When a is not 0.01 then it is called Randomized ReLU
 The range of the Leaky ReLU is (-infinity to infinity)
 Used in hidden layers

Evolution of Neural Networks
Year Neural Network Designer Description
1943 McCulloch and Mcculloch Pitts Arrangement of neurons is combination of logic gate. Unique
Pitts Neuron feature is threshhold
1949 Hebb Network Hebb If two neurons are active, then their connection strengths should
be increased.
1958-1988 Perceptron Frank Rosenblatt, Weights of path can be adjusted
Block, Minsky and
Papert
1960 Adaline Widrow and Hoff The weights are adjusted to reduce the difference between the
net input to the output unit and the desired output.
1972 Kohonen Kohenen Inputs are clustered to obtain a fired output neuron.
selforganizing
feature map
Evolution of Neural Networks
Year Neural Network Designer Description
1982 Hopfield John Hopfield Based on fixed weights.

network Can act as associative memory nets
1986 Back propagation Rumelhard Multilayered
network Error propagated backward from output to the hidden units
1987-90 ART Carpenter and Used for both binary and analog
Grossberg
Hebb Network
Hebb rule is used for pattern

association, pattern categorization,
pattern classification and over a
range of other areas.
activation function of input layer is identity
function: xi = si for i=1 to n S:tthe input training vector and target output pair
Example: Designing a Hebb network to implement
AND function
Example: Designing a Hebb network to implement
AND function
XOR using Hebb Network
No decision boundary
Exercise
Exercise –Solution Notes
Perceptron or Single-layer Perceptron
 Single Layer Perceptron has just two layers of
input and output
 It only has single layer hence the name single
y
layer perceptron
 It does not contain Hidden Layers as that of
Multilayer perceptron
Perceptrons
Perceptron is type of ANN that can be seen as the simplest
kind of feedforward neural network: a linear classifier
Introduced in the late 1950s
Perceptron convergence theorem (Rosenblatt 1962):
◦ Perceptron will learn to classify any linearly separable set
of inputs.
Perceptron is a network:
– single-layer
– feed-forward: data only
XOR function (no linear separation) travels in one direction
24
 Perceptron is a single layer neural network where as a multi-layer perceptron is called
Neural Networks
 The perceptron consists of 4 parts:

y
 Input values or One input layer
 Weights and Bias
 Net sum
 Activation Function
Working of Perceptron
 All the inputs x are multiplied with their
weights w. Let’s call it k
 Add all the multiplied values and call them
y
Weighted Sum
 Apply that weighted sum to the correct
Activation Function
 Perceptron is usually used to classify the data

into two parts
y
 Therefore, it is also known as a Linear Binary
Classifier
 Supervised Learning
Multi Layer Perceptron (MLP)
 It is a neural network where the mapping

between inputs and output is non-linear
y
 An MLP has input and output layers, and one or
more hidden layers with many neurons stacked
together
 A type of feed-forward artificial neural network that generates a set of outputs from a set of
inputs
 An MLP is a neural network connecting multiple layers in a directed graph, which means that
the signal path through the nodes goes one way y
 The MLP network consists of input, output, and hidden

layers. Each hidden layer consists of numerous
perceptron’s which are called hidden units
Perceptrons have no hidden layers
Multilayer perceptrons may have many
Learning Power of an ANN
Perceptron is guaranteed to converge if data is linearly separable
◦ It will learn a hyperplane that separates the classes
Mulitlayer ANN has no guarantee of convergence but can learn functions that are
not linearly separable
30
 Each layer is feeding the next one with the result of their computation, their internal
representation of the data
 This goes all the way through the hidden layers to the output layer
 If the algorithm only computed the weighted sums in each neuron, propagated results
y to the
output layer, and stopped there, it wouldn’t be able to learn the weights that minimize the
cost function. If the algorithm only computed one iteration, there would be no actual
learning
 This is where Backpropagation comes into play
Perceptron Training
Assume supervised training examples giving the desired output for a unit given a set of known input
activations.
Goal: learn the weight vector (synaptic weights) that causes the perceptron to produce the correct +/-
1 values
Perceptron uses iterative update algorithm to learn a correct set of weights
 Perceptron training rule
 Delta rule (Not in Syllabus)
Both algorithms guaranteed to converge under somewhat different conditions
32
Perceptron Training Rule
Update weights by:
wi  wi  wi
wi   (t  o) wi
where
η is the learning rate (Learning rate is generally represented by )
◦ a small value (e.g., 0.1)
◦ sometimes decays as the number of weight-tuning operations increases
t – target output for the current training example

o – linear unit output for the current training example
33
Perceptron Training Rule
Equivalent to rules:
◦ If output is correct do nothing.
◦ If output is high, lower weights on active inputs
◦ If output is low, increase weights on active inputs
Can prove it will converge
◦ if training data is linearly separable and η is small
Works reasonably well when training data is not linearly separable
34
Perceptron Learning
Each execution of the outer loop is typically called an epoch.

Perceptron Learning Rule
• Consider a finite "n" number of input training vectors, with their associated
target (desired) values x(n) and t(n), where “n” ranges from 1 to N.
• The target is either +1 or -1.
• The output ''y" is obtained on the basis of the net input calculated and activation
function being applied over the net input.
Example1: Perceptron for AND
Example-2
FeedForward Network
1.1×0.3+2.6×1.0=2.93
y
Neural Network Representation
There are ______ layer(s) in this NN
 Remember that while counting the number of layers in a NN, we do

not count the input layer
 So, there are 2 layers in the NN shown above, i.e., one hidden layer
and one output layer
 The first layer is referred as a[0], second layer as a[1], and
the final layer as a[2]
 Here ‘a’ stands for activations, which are the values that y
different layers of a neural network passes on to the next
layer
 The corresponding parameters are w[1], b[1] and w[2], b[2]
 w[1] = (4,3) matrix
 Hidden layer has 4 units and Input layer has 3 units y
 b[1] = (4,1) vector
 w[2] = (1,4)
 Output layer has 1 unit and hidden layer has 4 units
 b[2] = (1,1)
 noOfHiddenNeurons = 4
 Nx = 3
 Shapes of the variables:
o W1 is the matrix of the first hidden layer, it has a shape of (noOfHiddenNeurons,nx)
o b1 is the matrix of the first hidden layer, it has a shape of (noOfHiddenNeurons,1)
y
o z1 is the result of the equation z1 = W1*X + b, it has a shape of (noOfHiddenNeurons,1)
o a1 is the result of the equation a1 = sigmoid(z1), it has a shape of (noOfHiddenNeurons,1)
o W2 is the matrix of the second hidden layer, it has a shape of (1,noOfHiddenNeurons)
o b2 is the matrix of the second hidden layer, it has a shape of (1,1)
o z2 is the result of the equation z2 = W2*a1 + b, it has a shape of (1,1)
o a2 is the result of the equation a2 = sigmoid(z2), it has a shape of (1,1)
Computing a Neural Network's Output
 Let’s look in detail at how each

neuron of a neural network works
 Each neuron takes an input, performs y
some operation on them (calculates z
= wTX + b), and then applies the
sigmoid function
The equations for the first hidden layer with four neurons will be:
For given input X, the outputs for each neuron will be:
z[1] = W[1]x + b[1]
a[1] = 𝛔(z[1])
z[2] = W[2]x + b[2]
a[2] = 𝛔(z[2])
y
Backpropagation
Backpropagation is the learning mechanism that

allows the Multilayer Perceptron to iteratively
adjust the weights in the network, with the goal of y
minimizing the cost function

Architecture of a back-propagation network
Backpropagation
 In each iteration, after the weighted sums are forwarded through all layers, the
gradient of the Mean Squared Error is computed across all input and output pairs
 Then, to propagate it back, the weights of the first hidden layer are updated with the
y
value of the gradient
 That’s how the weights are propagated back to the starting point of the neural
network!
Backpropagation
NN parameters:
n[0] = Nx = 3
n[1] = NoOfHiddenNeurons = 4
n[2] = NoOfOutputNeurons = 1
W[1] shape is (n[1],n[0]) y
b[1] shape is (n[1],1)
W[2] shape is (n[2],n[1]) Cost function: J (W[1], b[1], W[2], b[2])
b[2] shape is (n[2],1)
= L(y
,y)
where y is the prediction a[2]

Gradient Descent
Gradient descent: Initialize parameters randomly rather than all Zero’s
Repeat:
Compute predictions (y'[i], i = 0,...m)
Get derivatives: dW[1], db[1], dW[2], db[2] y
Update:
W[1] = W[1] - α * dW[1]
b[1] = b[1] - α * db[1]
W[2] = W[2] - α * dW[2]
b[2] = b[2] - α * db[2] dW[1] = [1] db[1] = [1]
dW db
Gradient Descent
Forward propagation:
Z[1] = W[1]A[0] + b[1] # A[0] is X

y
A[1] = g[1] (Z[1])
Z[2] = W[2]A[1] + b[2]
A[2] = g[2] (Z[2]) = Sigmoid(Z[2]) # Sigmoid because the output is between 0 and 1
Gradient Descent
Backpropagation (derivations):
dZ[2] = A[2] - Y # derivative of cost function we used * derivative of the sigmoid function
dW[2] = (dZ[2] * A[1].T) / m
y
db[2] = Sum(dZ[2]) /m
dZ[1] = (W[2].T * dZ[2]) * g'[1] (Z[1]) # element wise product (*)
dW[1] = (dZ[1] * A[0].T) / m # A0 = X
db[1] = Sum(dZ[1]) / m
# Hint there are transposes with multiplication because to keep dimensions correct
Random Initialization
 We have previously seen that the weights are initialized to 0 in case of a logistic
regression algorithm
 For logistic regression, it was okay to initialize weights to 0 because it doesn’t have any
hidden layer
 But should we initialize the weights of a neural network to 0?
y
If the weights are initialized to 0, the W matrix will be:
Using these weights: a1[1] = a2[1]
Identical or Symmetric because both of these hidden units are computing exactly the
same function
When we compute backpropagation:
dZ1[1] = dZ2[1]
Identical because outgoing weight is also equal

W[2] = [0 0]
y
No matter how many hidden units we use in a layer, we are always getting the same
output which is similar to that of using a single unit
So, instead of initializing the weights to 0, we randomly initialize them
Back-propagation
network training
Back-propagation
network training
Example-1
Using back-propagation network, find the new

weights for the net shown in Figure. It is
presented with the input pattern [0, 1] and the
target output is 1. Use a learning rate α = 0.25
and binary sigmoidal activation function.
Example-1
Initial Weights:
Binary Sigmoidal Activation function:

Example-1
Example-1
Example-1
Example-1
Example-1
Example-1
Example-1
Solved examples of Perceptron and Back-propagation is taken form book “Principles of Soft Computing” by
Sivanandam and Deepa. Explore first 3 chapters of this book for more details on Neural Networks.
Machine Learning vs Deep Learning
 DL is a newer area of ML - uses multi-layered
artificial neural networks to deliver high
accuracy in tasks such as:
• intrusion detection
• object detection
• speech recognition
• language translation
 DL can automatically learn/extract/translate

the features from data sets such as:
• Images
• Video
• Text
Image Source: https://www.xenonstack.com/blog/data-science/log-analytics-deep-machine-learning-ai/
Why is Deep Learning taking off?
Source: https://www.coursera.org/learn/neural-networks-deep-learning/lecture/praGm/why-is-deep-learning-taking-off
Example of Deep Learning
Network Intrusion detection:
 Intrusion Detection System
 Intrusion Prevention System
 Next-Generation Firewall
Steps for Implementation
1. Data Preprocessing
2. Feature Extraction
3. Feature Selection
4. Implement Machine learning model
5. Training Testing (if supervised model)
6. Calculate Parameters
Evaluation – Model Training
While the parameters of each model may differ, there are several methods to train a model.
◦ We want to avoid overfitting a model and maximize its predictive power.
There are two standard methods for training a model:

◦ Hold-out – reserve 2/3 of data for training and 1/3 for testing
◦ Cross-Validation – partition data into k disjoint subsets, train on k-1 partitions, test on remaining &
run k times to get average, e.g., 10-fold validation is commonly used.
Many software (e.g., WEKA, RapidMiner) will do these methods automatically for you.
Evaluation
There are several questions we should ask after model training:
◦ How predictive is the model we learned?
◦ How reliable and accurate are the predicted results?
◦ Which model performs better?
We want our model to perform well on our training set but also have strong predictive power.
Fortunately, various metrics applied on the testing set can help us choose the “best” model for
our application.
Metrics for Performance Evaluation
A Confusion Matrix provides measures to compute a
models’ accuracy:
◦ True Positives (TP) – # of positive examples correctly predicted
by the model
◦ False Negative (FN) – # of positive examples wrongly predicted

as negative by the model
◦ False Positive (FP) - # of negative examples wrongly predicted

as positive by the model
◦ True Negative (TN) - # of negative examples correctly

predicted by the model
Example(s)
True Positives (TP)  An instance for which both predicted and actual values are positive
E.g., Images which are cat and actually predicted cat
True Negatives (TN)  An instance for which both predicted and actual values are negative
E.g., Images which are not-cat and actually predicted not-cat
False Positives (FP)  An instance for which predicted value is positive but actual value is negative
E.g., Images which are not-cat and actually predicted as cat
False Negatives (FN)  An instance for which predicted value is negative but actual value is positive
E.g., Images which are cat and actually predicted as not-cat
Precision = = (1) Precision-> How precise our model is out of those
predicted positive, how many of them are actual positive
Recall = = (2) Recall-> How many of the Actual Positives ML model

capture through labeling it as Positive
.
Accuracy = = .
(3) Accuracy-> Percentage of correct predictions made by ML
model
( ∗ )
F − Score = 2. ( )
(4) F-Score is the harmonic mean of precision and recall
However, accuracy can be skewed due to a class imbalance.
Other measures are better indicators for model performance.
Metric Description Calculation

Exactness – % of tuples the classifier labeled as positive are TP
Precision =
actually positive TP + FP
Completeness – % of positive tuples the classifier actually TP
Recall =
labeled as positive TP + FN
F- 2 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
Harmonic mean of precision and recall =
Measure 𝑅𝑒𝑐𝑎𝑙𝑙 + 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
Models can also be compared visually using a Receiver Operating Characteristic
(ROC) curve.
An ROC curve characterizes the trade-off between TP and FP rates.

◦ TP rate is plotted on the y-axis against FP rate on the x-axis
◦ Stronger models will generally have more Area Under the ROC curve (AUC).
TP
FP

Artificial Neural Networks

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Artificial Neural Networks

Uploaded by

Copyright:

Available Formats

Artificial Neural

 Stimuli from external environment or inputs from sensory

 These inputs create electric impulses, which quickly travel

 Neurons get signals from dendrites and pass them to the

 The nodes are neurons

 The result of these operations is passed to other neurons

 The output at each node is called its activation or node value

 Layer(s) present between input and output layers are known as

 Each layer extracts some information related to the features and

 Output is the sum of all these information gains multiplied by

Node 2 Z2[1] = W2[1]Tx(i) + b2[1] a2[1] = σ(Z2[1]) a2[1]

Where, [1] is layer 1 (hidden layer)

 Weights shows the strength of the

NN is stack of logistic regression objects

 Each layer forwards generated output to the next

 For example, output 0 means that it is a normal

 The Sigmoid Function curve looks like a S-shape

 Therefore, used in output layer for binary classification

 Range of the ReLU function is from 0 to infinity

 Used in hidden layers y

 Neural Network learns much faster in ReLU than Sigmoid or Tanh

 To solve the dying ReLU problem

 The leak helps to increase the range of the ReLU function

 Leaky ReLU: a = max(0.01 z, z)

 When a is not 0.01 then it is called Randomized ReLU

 The range of the Leaky ReLU is (-infinity to infinity)

 Used in hidden layers

1982 Hopfield John Hopfield Based on fixed weights.

Hebb rule is used for pattern

 The perceptron consists of 4 parts:

 Perceptron is usually used to classify the data

 It is a neural network where the mapping

 The MLP network consists of input, output, and hidden

t – target output for the current training example

Each execution of the outer loop is typically called an epoch.

There are ______ layer(s) in this NN

 Remember that while counting the number of layers in a NN, we do

 Let’s look in detail at how each

Backpropagation is the learning mechanism that

minimizing the cost function

where y is the prediction a[2]

Z[1] = W[1]A[0] + b[1] # A[0] is X

Using these weights: a1[1] = a2[1]

Identical because outgoing weight is also equal

Using back-propagation network, find the new

Binary Sigmoidal Activation function:

 DL can automatically learn/extract/translate

Network Intrusion detection:

 Intrusion Detection System

 Intrusion Prevention System

There are two standard methods for training a model:

◦ False Negative (FN) – # of positive examples wrongly predicted

◦ False Positive (FP) - # of negative examples wrongly predicted

◦ True Negative (TN) - # of negative examples correctly

Recall = = (2) Recall-> How many of the Actual Positives ML model

Other measures are better indicators for model performance.

Metric Description Calculation

An ROC curve characterizes the trade-off between TP and FP rates.

You might also like