You are on page 1of 105


Karthik Murali Madhavan Rathai
Assistant professor
Department of Mechatronics engineering
SRM University, Kattankulathur
Room no. H316
E-mail –
Phone – +91-9840291486
Website –
Syllabus for ANN in MH1202
Mathematical prerequisite
• Linear algebra (Matrices, vectors, null
space, rank…etc)
• Optimization (Convex/non-convex,
gradient free methods…etc)
• Probability & statistics (Distributions, data
fitting, data analysis…etc)
• Functional analysis (Kernels,
Hilbert/Banach spaces…etc)
• Calculus (Basic multivariate
Artificial neural network(ANN)
• ANNs are nonlinear information (signal) processing devices, which are built
from interconnected elementary processing devices called neurons.
• ANNs are inspired by the way biological nervous systems, e.i. the brain,
process information. The key element of this paradigm is the novel
structure of the information processing system.
• A neural network is a massively parallel-distributed processor that has a
natural propensity for storing experimental knowledge and making it
available for use. It resembles the brain in two respects:
 Knowledge is acquired by the network through a learning process
Inter-neuron connection strengths known as synaptic weights are used to stor the
Artificial neural network(ANN)
• An artificial neuron is characterized by
 Architecture (connection between neurons)
 Training or learning (determining weights on the connections)
 Activation function
• Example Input layer
Output layer

x1 w1 (weights)

(Input) y
x2 w2 (weights)
Why artificial neural networks (ANNs)?
• The long course of evolution has given the human brain many
desirable characteristics not present in Von Neumann or modern
parallel computers, which include
 Massive parallelism
 Distributed representation and computation
 Learning ability
 Generalization ability
 Adaptivity
 Inherent contextual information processing
 Fault tolerance
 Low energy consumption
NNs vs Computers
Digital Computers Neural Networks
o Deductive Reasoning - We apply known rules to o Inductive Reasoning - Given input and output data
input data to produce output. (training examples), we construct the rules.
o Computation is centralized, synchronous, and o Computation is collective, asynchronous, and
serial. parallel.
o Memory is packetted, literally stored, and location o Memory is distributed, internalized, short term and
addressable. content addressable
o Not fault tolerant. One transistor goes and it no o Fault tolerant, redundancy, and sharing of
longer works. responsibilities.
o Exact. o Inexact.
o Static connectivity. o Dynamic connectivity.
o Applicable if well defined rules with precise input o Applicable if rules are unknown or complicated, or
data. if data are noisy or partial.
Other key advantages
• Adaptive learning – An ability to learn how to do tasks based on the
data given for training or initial experience.
• Self-organization – An ANN can create its own organization or
representation of the information it receives during learning time.
• Real-time operation – ANN computations may be carried out in
parallel, using special hardware devices designed and manufactured
to take advantages of this capability.
• Fault tolerance via redundant information coding – Partial destruction
of a network leads to corresponding degradation of performance.
However, some network capabilities may be retained even after
majoe network damage due to this feature.
Historical tour on ANN
• 1943 – McCulloh & Pitts: Start of the modern era of neural networks
 This forms a logical calculus of neural networks. A network consists of
sufficient number of neurons and properly set synaptic connections can
compute any computable function. A simple logic function is performed by a
neuron in this neuron in this case based upon the weights set in the
McCulloch-Pitts neuron. The arrangement of neuron in this case may be
represented as a combination of logic functions. The most important feature
of this neuron is the concept of threshold
Historical tour on ANN
• 1949 – Hebb’s book “The organization of behavior”
 An explicit statement of a physiological learning rule for synaptic
modification was presented for the first time. Hebb proposed that the
connectivity of the brain is continually changing as an organization learns
differing functional tasks and that neural assemblies are created by such
 The concept of Hebb’s theory theory in simple words – “If two neurons are
found to be active simultaneously the strength of connection between the
two neurons should be increased”
Historical tour on ANN
• 1958 – Rosenblatt introduces Perceptron
 In Perceptron network the weights on the connection paths can be adjusted.
A method of iterative weights adjustment can be used in the Perceptron net.
The Perceptron net is found to converge if the weights obtained allow the net
to reproduce exactly all the training input and target output vector pairs.
Historical tour on ANN

• 1960 – Widrow and Hoff introduced ADALINE

 ADALINE, abbreviated from Adaptive linear neuron uses a learning rule called
Least mean square (LMS) rule or Delta rule. This rule is found to adjust the
weights so as to reduce the difference between the net input to the output
unit and the desired output. The convergence criteria in this case are the
reduction of mean square error to a minimum value. This delta rule for a
single layer net can be called a precursor of the backpropagation net used for
multi-layer nets. The multi-layer extension of ADALINE formed MADALINE
Historical tour on ANN
• 1982 – John Hopfield’s networks
 Hopfield showed how to use “Ising spin glass” type of model to store
information in dynamically stable networks. These nets are widely used as
associative memory nets. The Hopfield nets are found to be both continuous
valued and discrete valued. This net provides an efficient solution for TSP
• 1972 – Kohonen’s self organizing map (SOM)
 Kohonen’s SOM are capable of reproducing important aspects of the
structure of biological neural nets. They make use of data representation
using topographic maps which are common in the nervous system. SOM also
has a wide range of applications. It shows how the output layer can pick up
the correlational structure from inputs in the form of the spatial
arrangements of units
Historical tour on ANN
• Back propagation algorithm 1985 – Parker, 1986 – Lecum
 The method propagates the error information at the output units back to the
hidden units using a generalized delta rule. This net is basically a multilayer,
feed-forward net trained by means of backpropagation. Backpropogation net
emerged as the most popular learning algorithm for the training of multilayer
Perceptrons and has been the workhorse for many neural network
• 1988 – Radial basis functions, Broomhead and Lowe
• 1990 – Support vector machine, Vapnik
• 1993+ – Convolutional neural network (Deep learning stuff)
Applications of NNs
 Classification
• in marketing: consumer spending pattern classification
• In defence: radar and sonar image classification
• In agriculture & fishing: fruit and catch grading
• In medicine: ultrasound and electrocardiogram image classification, EEGs, medical
 Recognition and identification
• In general computing and telecommunications: speech, vision and handwriting recognition
• In finance: signature verification and bank note verification
 Assessment
• In engineering: product inspection monitoring and control
• In defence: target tracking
• In security: motion detection, surveillance image analysis and fingerprint matching
 Forecasting and prediction
• In finance: foreign exchange rate and stock market forecasting
• In agriculture: crop yield forecasting
• In marketing: sales forecasting
• In meteorology: weather prediction
What can you do with an NN and what not?
• In principle, NNs can compute any computable function, i.e., they can
do everything a normal digital computer can do. Almost any mapping
between vector spaces can be approximated to arbitrary precision by
feedforward NNs
• In practice, NNs are especially useful for classification and function
approximation problems usually when rules such as those that might
be used in an expert system cannot easily be applied.
• NNs are, at least today, difficult to apply successfully to problems that
concern manipulation of symbols and memory. And there are no
methods for training NNs that can magically create information that is
not contained in the training data.
Who is concerned with NNs?
• Computer scientists want to find out about the properties of non-symbolic information
processing with neural nets and about learning systems in general.
• Statisticians use neural nets as flexible, nonlinear regression and classification models.
• Engineers of many kinds exploit the capabilities of neural networks in many areas, such
as signal processing and automatic control.
• Cognitive scientists view neural networks as a possible apparatus to describe models of
thinking and consciousness (High-level brain function).
• Neuro-physiologists use neural networks to describe and explore medium-level brain
function (e.g. memory, sensory system, motorics).
• Physicists use neural networks to model phenomena in statistical mechanics and for a
lot of other tasks.
• Biologists use Neural Networks to interpret nucleotide sequences.
• Philosophers and some other people may also be interested in Neural Networks for
various reasons
The Biological Neuron

• The brain is a collection of about 10 billion interconnected neurons. Each neuron is a

cell that uses biochemical reactions to receive, process and transmit information.
• Each terminal button is connected to other neurons across a small gap called a synapse.
• A neuron's dendritic tree is connected to a thousand neighboring neurons. When one
of those neurons fire, a positive or negative charge is received by one of the dendrites.
The strengths of all the received charges are added together through the processes of
spatial and temporal summation.
Biological NN and Artificial NN comparison
Inputs Weights
Biological neural network Artificial neural network w1

Cell Body Neurons p2 a
w3 f Output
Dendrite Weights or interconnections p3

Soma Net input

a  f p1 w1  p2 w2  p3 w3  b   f  pi wi  b 
Axon Output
Basic building blocks of ANN
• The basic building blocks of the artificial neural network are
 Network architecture
 Setting the weights
 Activation function
• Network architecture
 The arrangement of neurons into layers and the pattern of connection within
and in-between layers are generally called as the architecture of the net. The
neurons within a layer are found to be fully interconnected or not
 The number of layers in the net can be defined to be the number of layers of
weighted interconnected links between the particular slabs of neurons
Different ANN architecture
Setting up weights
• The method of setting the value for the weights enables the process
of learning or training. The process of modifying the weights in the
connections between network layers with the objective of achieving
the expected output is called training a network.
• The three types of training are
 Supervised learning
 Unsupervised learning
 Reinforcement learning
Setting up weights
• Unsupervised learning • Supervised learning
Feedback nets  Feedback Nets
 Binary adaptive resonance theory (ART1) Boltzmann machine (BM)
 Analog adaptive resonance theory (ART2,ART2a) Mean field annealing (MFT)
 Discrete Hopfield (DH) Recurrent cascade correlation (RCC)
 Continuous Hopfield (CH)
 Learning vector quantization (LVQ)
 Discrete Bi-Directional Associative Memory (BAM)
 Temporal associative memory (TAM) Backpropagation through time (BPTT)
 Adaptive Bi-directional associative memory Real time recurrent learning (RTRL)
(ABAM)  Feedforward-only nets
 Kohonen self-organizing map/Topology preserving Perceptron
map (SOM/TPM)
 Competitive learning Adaline, Madaline
 Feedforward-only Nets Backpropagation (BP)
 Learning Matrix (LM) Cauchy machine( CM)
 Driver–reinforcement learning Artmap
 Counter propagation (CPN) Cascade correlation (CasCor)
Activation Functions
• Use different functions to obtain different models.
• 3 most common choices :
 Step function
 Sign function
 Sigmoid function
• An output of 1 represents firing of a neuron down the axon.
Activation Functions
Stochastic (Activation) Model of a Neuron
• So far we have introduced only deterministic models of ANNs.
• A stochastic (probabilistic) model can also be defined.
• If x denotes the state of a neuron, then P(v) denotes the prob. of
firing a neuron, where v is the induced activation potential (bias +
linear combination).
𝑃 𝑣 = −𝜐
1+𝑒 𝑇
• Where T is a pseudo-temperature used to control the noise level (and
therefore the uncertainty in firing).
• As 𝑇 → 0, the stochastic model tends to deterministic model.
• Single-layer Feedforward Networks
 Input layer and output layer - Single (computation) layer
 Feedforward, acyclic
• Multilayer feedforward network
 Hidden layers - hidden neurons and hidden units
 Enables to extract high order statistics
 Fully connected layered network
• Recurrent Networks
 Atleast one feedback loop
 With or without hidden neuron
Feedforward Networks
• One I/P and one O/P layer
• One or more hidden layers
• Each hidden layer is built from artificial neurons
• Each element of the preceding layer is connected with each element
of the next layer.
• There is no interconnection between artificial neurons from the same
• Finding weights is a task which has to be done depending on which
solution problem is to be performed by a specific network.
Recurrent/dynamic systems/Feedback
Recurrent/dynamic systems/Feedback
• The interconnections go in two directions between ANNs or with the
• Boltzmann machine is an example of recursive nets which is a
generalization of Hopfield nets. Other example of recursive nets:
Adaptive Resonance Theory (ART) nets.
Neural Processing
• Remember – “The process of computation of an output o for a given input x performed by the ANN”.
• It’s objective is to retrieve the information, i.e., to decode the stored content which must have been
encoded in the network previously.
• Autoassociation
 A network is presented a pattern similar to a member of the stored set, auto-association
associates the input pattern with the closest stored pattern.
 Reconstruction of incomplete or noisy image.
• Heteroassociation
 The network associates the input pattern with pairs of patterns stored.
Neural Processing
• Classification
 A set of patterns is already divided into a number of classes, or categories
 When an input pattern is presented, the classifier recalls the information regarding
the class membership of the input pattern
 The classes are expressed by discrete-valued output vectors, thus the output
neurons of the classifier employ binary activation functions.
• Clustering
 Unsupervised classification of patterns/objects without providing information about
the actual classes
 The network must discover for itself any existing patterns, regularities, separating
properties, etc.
 While discovering these, the network undergoes change of its parameters, which is
called Self organization
• Learning is a process by which free parameters of NN are adapted thru
stimulation from environment.
• Sequence of Events
 Stimulated by an environment
 Undergoes changes in its free parameters
 Responds in a new way to the environment
• Learning Algorithm
 Prescribed steps of process to make a system learn, e.i. ways to adjust synaptic weight of a
 No unique learning algorithms - kit of tools
• Five learning rules/Learning paradigms
 Error-correction learning
 Memory based learning
 Hebbian learning
 Competitive learning
 Boltzmann learning
Important terminologies in ANN
• Training set : The ensemble of “inputs” used to train the system. For a
supervised network. It is the ensemble of “input-desired” response pairs
used to train the system.
• Validation set: The ensemble of samples that will be used to validate the
parameters used in the training (not to be confused with the test set which
assesses the performance of the classifier).
• Test set: The ensemble of “input-desired” response data used to verify the
performance of a trained system. This data is not used for training.
• Training epoch: one cycle through the set of training patterns.
• Generalization: The ability of a NN to produce reasonable responses to
input patterns that are similar, but not identical, to training patterns.
Important terminologies in ANN
• Asynchronous: process in which weights or activations are updated one at
a time, rather than all being updated simultaneously.
• Synchronous updates: All weights are adjusted at the same time.
• Inhibitory connection: connection link between two neurons such that a
signal sent over this link will reduce the activation of the neuron that
receives the signal . This may result from the connection having a negative
weight, or from the signal received being used to reduce the activation of a
neuron by scaling the net input the neuron receives from other neurons.
• Activation: a node’s level of activity; the result of applying the activation
function to the net input to the node. Typically this is also the value the
node transmits.
Mathematical preliminaries - Vectors
Learning Rule/Learning paradigm #1 –
“Error Correction Learning Rule”
Error Correction Learning
Error Correction Learning
Perceptron learning
Memory-based Learning
Memory-based Learning
Memory-based Learning
Memory-based Learning
Hebbian learning
• In 1949, Donald Hebb proposed one of the key ideas in biological learning,
commonly known as Hebb’s Law. Hebb’s Law states that if neuron “i” is
near enough to excite neuron “j” and repeatedly participates in its
activation, the synaptic connection between these two neurons is
strengthened and neuron “j” becomes more sensitive to stimuli from
neuron “i”.
• Hebb’s Law can be represented in the form of two rules:
If two neurons on either side of a connection are activated synchronously, then the
weight of that connection is increased.
If two neurons on either side of a connection are activated asynchronously, then the
weight of that connection is decreased.
• Hebb’s Law provides the basis for learning without a teacher. Learning here
is a local phenomenon occurring without feedback from the environment.
Hebbian learning
Hebbian learning
Hebbian learning
Hebbian learning
Hebbian learning in a neural network

Output Signals
Input Signals

i j
Hebbian learning
• Using Hebb’s Law we can express the adjustment applied to the
weight 𝑤𝑖𝑗 at iteration p in the following form:
 w ij ( p )  F [ y j ( p ), x i ( p ) ]
• As a special case, we can represent Hebb’s Law as follows:

wij ( p)  a y j ( p) xi ( p )

• where a is the learning rate parameter. This equation is referred to as

the activity product rule.
Hebbian learning
• Hebbian learning implies that weights can only increase. To resolve this problem, we might impose
a limit on the growth of synaptic weights. It can be done by introducing a non-linear forgetting
factor into Hebb’s Law:

wij ( p)   y j ( p) xi ( p)   y j ( p) wij ( p)
• where is the 𝜑 forgetting factor.
• Forgetting factor usually falls in the interval between 0 and 1, typically between 0.01 and 0.1, to
allow only a little “forgetting” while limiting the weight growth.
• Oja’s rule - Hebbian learning rule has a severe problem - there is nothing there to stop the
connections from growing all the time, finally leading to very large values. There should be another
term to balance this growth. In many neuron models, another term representing "forgetting" has
been used: the value of the weight itself should be subtracted from the right hand side. The central
idea in the Oja learning rule is to make this forgetting term proportional, not only to the value of
the weight, but also to the square of the output of the neuron. The Oja rule reads
Hebbian learning algorithm
• Step 1: Initialization - Set initial synaptic weights and thresholds to
small random values, say in an interval [0, 1 ].
• Step 2: Activation - Compute the neuron output at iteration p
y j ( p)   xi ( p) wij ( p)   j
i 1
• where n is the number of neuron inputs, and qj is the threshold value
of neuron j.
Hebbian learning
• Step 3: Learning - Update the weights in the network:
• Where Δ𝑤𝑖𝑗 (𝑝) is the weight correction at iteration p.
• The weight correction is determined by the generalized activity
product rule:

wij ( p)   y j ( p)[  xi ( p)  wij ( p)]

• Step 4: Iteration - Increase iteration p by one, go back to Step 2.
Hebbian learning example
Initial and final states of the network
1 1 y1 1 0 y1
x1 1 1 x 1 1

0 2 0 y2 0 2 1 y2
x2 2 x 2

x3 0 3 3
0 y3 x 0 3 3
0 y3

0 4 0 y4 0 4 0 y4
x4 4 x 4

1 5 1 y5 1 5 1 y5
x5 5 x 5
Input layer Output layer Input layer Output layer
Initial and final weight matrices

O u t pu t l a yer O u t pu t l a yer
1 2 3 4 5 1 2 3 4 5
1 1 0 0 0 0 1 0 0 0 00 
0 1 0 0 0  

0 2.0204 0 0 2.0204
3 0 0 1 0 0 3 0 0 1.0200 0 0 
   
0 0 0 1 0 4 0 0 00 .9996 0 
5 0 0 0 0 1 5 0 2.0204 0 0 2.0204
• A test input vector, or probe, is defined as
 
X  0
 

• When this probe is presented to the network, we obtain:

0 0 0 0 0  1 0.4940  0
 
0 2.0204 0 0 2.0204 0 0.2661  1
 
Y  sign 0 0 1.0200 0 0  0  0.0907   0
0 0
  
0 0.9996 0  0 0.9478  0
  
 
0 2.0204 0 0 2.0204 1 0.0737  1
    
Competitive Learning
• Basic Elements
A set of neurons that are all same except synaptic weight distribution
 Respond differently to a given set of input pattern
 A mechanism to compete to respond to a given input
 The winner that wins the competition is called “winner-takes-all”
Competitive Learning
• Competitive Learning Rule: Adapt the neuron m which has the
maximum response due to input x.

• Weights are typically initialized at random values and their strengths

are normalized during learning.
Competitive Learning
• x has some constant Euclidean length (also normalized) and
σ𝑗 𝑤𝑚𝑗 = 1 for all m
Competitive Learning
• What is required for the net to encode the training set is that the
weight vectors become aligned with any clusters present in this set
and that each cluster is represented by at least one node. Then, when
a vector is presented to the net there will be a node, or group of
nodes, which respond maximally to the input and which respond in
this way only when this vector is shown at the input.
• If the net can learn a weight vector configuration like this, without
being told explicitly of the existence of clusters at the input, then it is
said to undergo a process of self organised or unsupervised learning.
This is to be contrasted with nets which were trained with the delta
rule for e.g. where a target vector or output had to be supplied.
Competitive Learning
• In order to achieve this goal, the weight vectors must be rotated
around the sphere so that they line up with the training set.
• The first thing to notice is that this may be achieved in a gradual and
efficient way by moving the weight vector which is closest (in an
angular sense) to the current input vector towards that vector slightly.
• The node k with the closest vector is that which gives the greatest
input excitation v = w.x since this is just the dot product of the weight
and input vectors. As shown below, the weight vector of node k may
be aligned more closely with the input if a change is made according
Winner-Take-All learning

The winner neighbourhood is sometimes extended to beyond the single neuron

winner to include the neighbouring neurons
Summary of learning rules
Boltzman Learning
• Rooted from statistical mechanics
• Boltzman Machine : NN on the basis of Boltzman learning
• The neurons constitute a recurrent structure (see next slide)
 They are stochastic neurons
 Operate in binary manner: “on”: +1 and “off”: -1
 Visible neurons and hidden neurons
 Energy function of the machine (𝑥𝑗 = state of neuron j):

 𝑗 ≠ 𝑘 means no self feedback

Boltzman Machine

Fig: Architecture of Boltzmann machine. K is the number of

visible neurons and L is the number of hidden neurons
Boltzman Machine Operation
• Choosing a neuron at random, k, then flip the state of the neuron from state
𝑥𝑘 to state -𝑥𝑘 (random perturbation) with probability

• where ∆𝐸𝑘 is energy change of the machine resulting from such a flip (flip
from state 𝑥𝑘 to state –𝑥𝑘 )
• If this rule is applied repeatedly, the machine reaches thermal equilibrium
(note that T is a pseudo-temperature).
• Two modes of operation
 Clamped condition : visible neurons are clamped onto specific states determined by
environment (i.e. under the influence of training set).
Free-running condition: all neurons (visible and hidden) are allowed to operate freely
(i.e. with no envir. input)
Boltzman Machine operation
• Such a network can be used for pattern completion
• Goal of Boltzman Learning is to maximize likelihood function (using gradient

• ℑ denotes the set of training examples drawn from a pdf of interest.

• 𝑥𝛼 represents the state of the visible neurons.
• 𝑥𝛽 represents the state of the visible neurons.
• Set of synaptic weights is called a model of the environment if it leads the same
probability distribution of the states of visible units
Boltzman Learning Rule
Pattern Recognition
• (One) Definition
 The identification of implicit objects, types or relationships in raw data by an
animal or machine
 i.e. recognizing hidden information in data
• Common Problems
 What is it?
 Where is it?
 How is it constructed?
 These problems interact. Example: in optical character recognition (OCR),
detected characters may influence the detection of words and text lines, and
vice versa
Pattern Recognition
Common Tasks
• What is it? (Task: Classification)
 Identifying a handwritten character, CAPTCHAs
 Discriminating humans from computers
• Where is it? (Task: Segmentation)
 Detecting text or face regions in images
• How is it constructed? (Tasks: Parsing, Syntactic Pattern Recognition)
 Determining how a group of math symbols are related, and how they form an
Determining protein structure to decide its type (class) (an example of what is
often called “Syntactic PR”)
Models and Search: Key Elements of Solutions
to Pattern Recognition Problems
• Models
 For algorithmic solutions, we use a formal model of entities to be detected.
This model represents knowledge about the problem domain (‘prior
knowledge’). It also defines the space of possible inputs and output
• Search: Machine Learning and Finding Solutions
 Normally model parameters set using “learning” algorithms
 Classification: learn parameters for function from model inputs to classes
 Segmentation: learn search algorithm parameters for detecting Regions of Interest
(ROIs: note that this requires a classifier to identify ROIs)
 Parsing: learn search algorithm parameters for constructing structural descriptions
(trees/graphs, often use sementers & classifiers to identify ROIs and their relationships in
Pattern Classification
Classifying an Object

Obtaining Model Inputs

Physical signals converted to digital signal (transducer(s)); a
region of interest is identified, features computed for this
Making a Decision
Classifier returns a class; may be revised in post-processing
(e.g. modify recognized character based on surrounding
Example (DHS): Classifying Salmon and Sea
Designing a classifier or
clustering algorithm
Feature Selection and Extraction
• Feature Selection
 Choosing from available features those to be used in our classification model.
Ideally, these
 Discriminate well between classes
 Are simple and efficient to compute
• Feature Extraction
 Computing features for inputs at run-time
• Preprocessing
 User to reduce data complexity and/or variation, and applied before feature
extraction to permit/simplify feature computations; sometimes involves other
PR algorithms (e.g. segmentation)
Types of Features
Example Single Feature (DHS): Fish
A Better Feature: Average Lightness of Fish
A Combination of Features:
Lightness and Width
Classifier: A Formal Definition
Canonical model
Regions and Boundaries
Example: Linear Discriminant
Separating Two Classes
Classifier design
Poor Generalization due to Over-
fitting the Decision Boundary
Avoiding Over-Fitting
A Simpler Decision Boundary, with Better