You are on page 1of 327

SOFT COMPUTING

MODULE 1
Dr. R.B.Ghongade,
SOFT COMPUTING CONSTITUENTS AND CONVENTIONAL
ARTIFICIAL INTELLIGENCE

• Soft Computing (SC): The symbiotic use of many


emerging problem-solving disciplines.
• Soft computing is an emerging approach to computing
which parallels the remarkable ability of the human mind
to reason and learn in an environment of uncertainty and
imprecision. (Lotfi A. Zadeh, 1992 )
• Soft computing consists of several computing paradigms,
– neural networks
– fuzzy set theory, approximate reasoning
– derivative-free optimization methods such as genetic algorithms
and simulated annealing
PROBLEM SOLVING TECHNIQUES
HARD COMPUTING SOFT COMPUTING

Precise Models Approximate Models

Traditional Functional
Symbolic Numerical Approximate Approximation
Logic Modeling and Reasoning and Randomized
Reasoning Search Search
Methodology Strength
Neural network Learning and adaptation
Knowledge representation via fuzzy if-then
Fuzzy set theory
rules
Genetic algorithm and simulated annealing Systematic random search
Conventional AI Symbolic manipulation

• The seamless integration of these methodologies forms the


core of soft computing
• The synergism allows soft computing to incorporate human
knowledge effectively, deal with imprecision and uncertainty,
and learn to adapt to unknown or changing environment for
better performance
An Example
• Neural character recognizer
and a knowledge base are
used together to
determine the meaning of
a hand-written word
• The neural character
recognizer generates two
possible answers "dog" and
"dag," since the middle
character could be
• either an "o" or "a”
• If the knowledge base
provides an extra piece of
information that the given
word is related to animals,
then the answer "dog" is
picked up correctly
From Conventional AI to Computational Intelligence
• Humans usually employ natural languages in
reasoning and drawing conclusions
• Conventional AI attempts to mimic human intelligent
behavior by expressing it in language forms or
symbolic rules
• AI manipulates symbols on the assumption that such
behavior can be stored in symbolically structured
knowledge bases-called physical symbol system
hypothesis
• Symbolic systems provide a good basis for modeling
human experts in some narrow problem areas if
explicit knowledge is available –Eg. an expert system
A typical expert system
Short comings of symbolicism
• In practice, the symbolic manipulations limit the
situations to which the conventional AI theories can be
applied as knowledge acquisition and representation
are difficult tasks
• Hence more attention has been directed toward
biologically inspired methodologies such as brain
modeling, evolutionary algorithms, and immune
modeling; they simulate biological mechanisms
responsible for generating natural intelligence
• These methodologies are somewhat orthogonal to
conventional AI approaches and generally compensate
for the shortcomings of symbolicism
• The long-term goal of AI research is the
creation and understanding of machine
intelligence
• Soft computing shares the same ultimate goal
with AI

An intelligent system
NEURAL NETWORKS
DARPA Neural Network Study (1988, AFCEA International
Press, p. 60):

... a neural network is a system composed of many simple processing


elements operating in parallel whose function is determined by network
structure, connection strengths, and the processing performed at
computing elements or nodes.

An artificial neuron
DEFINITIONS OF NEURAL NETWORKS
According to Haykin (1994), p. 2:

A neural network is a massively parallel distributed processor that has a


natural propensity for storing experiential knowledge and making it
available for use. It resembles the brain in two respects:

• Knowledge is acquired by the network through a learning process.

• Interneuron connection strengths known as synaptic weights are


used to store the knowledge
According to Nigrin (1993), p. 11:

A neural network is a circuit composed of a very large number of


simple processing elements that are neurally based. Each element
operates only on local information.

Furthermore each element operates asynchronously; thus there is no


overall system clock.

According to Zurada (1992):

Artificial neural systems, or neural networks, are physical cellular


systems which can acquire, store and utilize experiential knowledge.
A multi-layered neural network
MULTIDISCIPLINARY VIEW OF NEURAL NETWORKS
FUZZY LOGIC

• Origins: Multivalued Logic for treatment of imprecision


and vagueness

– 1930s: Post, Kleene, and Lukasiewicz attempted to


represent undetermined, unknown, and other possible
intermediate truth-values.

– 1937: Max Black suggested the use of a consistency profile


to represent vague (ambiguous) concepts.

– 1965: Zadeh proposed a complete theory of fuzzy sets


(and its isomorphic fuzzy logic), to represent and
manipulate ill-defined concepts.
FUZZY LOGIC – LINGUISTIC VARIABLES

– Fuzzy logic gives us a language (with syntax and local


semantics) in which we can translate our qualitative
domain knowledge.

– Linguistic variables to model dynamic systems

– These variables take linguistic values that are


characterized by:
• a label - a sentence generated from the syntax
• a meaning - a membership function determined by a local
semantic procedure
FUZZY LOGIC – REASONING METHODS

– The meaning of a linguistic variable may be interpreted as an


elastic constraint on its value.

– These constraints are propagated by fuzzy inference operations,


based on the generalized modus-ponens.

– An FL Controller (FLC) applies this reasoning system to a


Knowledge Base (KB) containing the problem domain heuristics.

– The inference is the result of interpolating among the outputs of


all relevant rules.

– The outcome is a membership distribution on the output space,


which is defuzzified to produce a crisp output.
GENETIC ALGORITHM
EVOLUTIONARY PROCESS
DEFINITION OF GENETIC ALGORITHM

– The genetic algorithm is a probabilistic search


algorithm that iteratively transforms a set (called a
population) of mathematical objects (typically
fixed-length binary character strings), each with
an associated fitness value, into a new population
of offspring objects using the Darwinian principle
of natural selection and using operations that are
patterned after naturally occurring genetic
operations, such as crossover (sexual
recombination) and mutation.
STEPS INVOLVED IN GENETIC ALGORITHM
The genetic algorithms follow the evolution process in the nature to
find the better solutions of some complicated problems. Foundations
of genetic algorithms are given in Holland (1975) and Goldberg (1989)
books.
Genetic algorithms consist the following steps:

 Initialization
 Selection
 Reproduction with crossover and mutation

Selection and reproduction are repeated for each generation until a


solution is reached.
During this procedure a certain strings of symbols, known as
chromosomes, evaluate toward better solution.
HYBRID SYSTEMS
Hybrid systems enables one to combine various soft computing
paradigms and result in a best solution. The major three hybrid
systems are as follows:

 Hybrid Fuzzy Logic (FL) Systems

 Hybrid Neural Network (NN) Systems

 Hybrid Evolutionary Algorithm (EA) Systems


SOFT COMPUTING: HYBRID FL SYSTEMS
Approximate Reasoning Functional Approximation/ Randomized
Search

Probabilistic Multivalued & Neural Evolutionary


Models Fuzzy Logics Networks Algorithms

Fuzzy Multivalued
Systems Algebras

Fuzzy Logic
Controllers
HYBRID FL SYSTEMS

NN modified by FS FLC Tuned by NN FLC Generated


(Fuzzy Neural (Neural Fuzzy and Tuned by EA
Systems) Systems)

23
SOFT COMPUTING: HYBRID NN SYSTEMS
Approximate Reasoning Functional Approximation/ Randomized Search

Probabilistic Multivalued & Neural Evolutionary


Models Fuzzy Logics Networks Algorithms

Feedforward Recurrent
NN NN

Single/Multiple SOM
RBF Hopfield ART
Layer Perceptron
HYBRID NN SYSTEMS
NN parameters NN topology and/or
(learning rate h
momentum a ) weights
controlled by FLC generated by EAs

24
SOFT COMPUTING: HYBRID EA SYSTEMS
Approximate Reasoning Functional Approximation/ Randomized
Search

Probabilistic Multivalued & Neural Evolutionary


Models Fuzzy Logics Networks Algorithms

Evolution Genetic
Strategies Algorithms

Evolutionary Genetic
Programs Programs

HYBRID EA SYSTEMS

EA parameters EA-based search EA parameters


(N, P cr , Pmu) inter-twined with (Pop size, selection)
controlled by FLC hill-climbing controlled by EA

25
NEURO-FUZZY AND SOFT COMPUTING CHARACTERISTICS

• Human expertise
– SC utilizes human expertise in the form of fuzzy if-then rules, as
well as in conventional knowledge representations, to solve
practical problems
• Biologically inspired computing models
– Inspired by biological neural networks, artificial neural networks
are employed extensively in soft computing to deal with
perception, pattern recognition, and nonlinear regression and
classification problems
• New optimization techniques
– Soft computing applies innovative optimization methods arising
from various sources; they are genetic algorithms (inspired by
the evolution and selection process), simulated annealing
(motivated by thermodynamics), the random search method,
these optimization methods do not require the gradient vector
of an objective function, so they are more flexible in dealing
with complex optimization problems
NEURO-FUZZY AND SOFT COMPUTING CHARACTERISTICS

• Numerical computation
– Unlike symbolic AI, soft computing relies mainly on numerical
computation. Incorporation of symbolic techniques in soft
computing is an active research area within this field.
• New application domains
– Because of its numerical computation, soft computing has found
a number of new application domains besides that of AI
approaches. These application domains are mostly computation
intensive and include adaptive signal processing, adaptive
control, nonlinear system identification, nonlinear regression,
and pattern recognition.
• Model-free learning
– Neural networks and adaptive fuzzy inference systems have the
ability to construct models using only target system sample
data. Detailed insight into the target system helps set up the
initial model structure, but it is not mandatory.
NEURO-FUZZY AND SOFT COMPUTING CHARACTERISTICS

• Intensive computation
– Without assuming too much background knowledge of the
problem being solved, neuro-fuzzy and soft computing rely
heavily on high-speed number-crunching computation to
find rules or regularity in data sets. This is a common
feature of all areas of computational intelligence
• Fault tolerance
– Both neural networks and fuzzy inference systems exhibit
fault tolerance. The deletion of a neuron in a neural
network, or a rule in a fuzzy inference system, does not
necessarily destroy the system. Instead, the system
continues performing because of its parallel and
redundant architecture, although performance quality
gradually deteriorates
NEURO-FUZZY AND SOFT COMPUTING CHARACTERISTICS

• Goal driven characteristics


– Neuro-fuzzy and soft computing are goal driven; the path leading from
the current state to the solution does not really matter as long as we
are moving toward the goal in the long run. This is particularly true
when used with derivative-free optimization schemes, such as genetic
algorithms, simulated annealing, and the random search method.
Domain specific knowledge helps reduces the amount of computation
and search time, but it is not a requirement.
• Real-world applications
– Most real-world problems are large scale and inevitably incorporate
built-in uncertainties; this precludes using conventional approaches
that require detailed description of the problem being solved. Soft
computing is an integrated approach that can usually utilize specific
techniques within subtasks to construct generally satisfactory
solutions to real-world problems
APPLICATIONS OF SOFT COMPUTING
 Handwriting Recognition
 Image Processing and Data Compression
 Automotive Systems and Manufacturing
 Soft Computing to Architecture
 Decision-support Systems
 Soft Computing to Power Systems
 Neuro Fuzzy systems
 Fuzzy Logic Control
 Machine Learning Applications
 Speech and Vision Recognition Systems
 Process Control and So on
ARTIFICIAL NEURAL NETWORKS
MODULE 2
Dr. R.B.Ghongade,
Pune-411043
Topics covered
• Introduction
• Neural networks (Usefulness & Capabilities)
• Nervous System
• Structure of a Nerve Cell
• Electrical Model of a Neuron
• Artificial Neuron
• Activation Functions
• Mc Culloch-Pitts Neuron (AND-NOT & XOR implementation)
• ANN Topologies
• ANN Architectures
• Learning Paradigms
• Revision of Some Basic Maths
Introduction
• BRAIN COMPUTATION
– A highly complex non-linear and parallel computer with
structural constituents called “NEURONS”
– The human brain contains about 1011nerve cells, or neurons
– On an average, each neuron is connected to other neurons
through approximately 104 synapses

How is it then that the brain outperforms a digital computer?


Neural networks (Usefulness & Capabilities)
a) Exploits Non-linearity
– A system is linear if its output can be expressed as a
linear combination of its inputs
+…+
– A system is non-linear if its outputs are expressed in
not only the linear terms but also the higher order
terms of its inputs
1 +…
– Most of the real-world problems are non-linear and
hence we need non-linear units to solve them,
neurons are non-linear
– Since the brain has a large interconnection of non-
linear neurons, non-linearity is distributed
throughout
b) Input-Output Mapping
– By modifying free parameters we may be able to map input to
desired output
– This process is called as learning with the help of a teacher
– Two types of learning
• With the help of a teacher
• Without the help of a teacher (also called auto-associative learning)
– We specify that for a given input what should be the output or
desired response
– It is possible that we may get a different output than the desired
one, in this case we accordingly modify the set of free
parameters so as to get the output most closest to the desired
output
– This may not be achieved immediately at first but the difference
between the output and the desired value is reduced and
subsequently with more iterations the outputs will match the
desired response, this is the process of learning
– A teacher is required to adjust the free parameters
– Thus neural networks are different from conventional systems
because they involve learning
– Ex. Learning of a child
c) Adaptability
– Neural networks can adapt the free parameters to
the changes in the surrounding environment
– Humans adapt to their surroundings from time to
time and thus cope up with the world!
d) Evidential Response
– Humans respond with confidence level
– Ex. “I think I am going to pass the exam”, denotes
that the individual is confident up to a good
degree
– Thus there is a decision with a measure of
confidence
e) Fault Tolerance
– Even if a single connection is malfunctioning , the nervous
system functions, there is no catastrophic failure
– At the maximum there is graceful degradation
– It depends on how severe the fault is, if the fault is with too
many units, the degree of degradation is large and vice-versa
– It is possible to incorporate this fault tolerant mechanism in
artificial neural networks
f) VLSI Implementation Ability
– It is possible to integrate a large number of artificial neurons
using VLSI technology
– Neurons do independent computations giving a good degree
of parallelism
g) Neurobiological Analogy
– These properties of a biological neuron can be imparted to
the artificial neuron
Nervous System
• Human nervous system may be viewed as a three-stage system
– The receptors convert stimuli from the human body or the external
environment into electrical impulses that convey information to the
neural net (brain)
– Neural (nerve) net, which continually receives information, perceives
it, and makes appropriate decisions
– The effectors convert electrical impulses generated by the neural net
into discernible responses as system outputs
• Arrows pointing from left to right indicate the forward transmission
of information-bearing signals through the system
• The arrows pointing from right to left signify the presence of
feedback in the system
Structure of a Nerve Cell
• Synapses are elementary
structural and functional units
that mediate the interactions
between neurons
• Axons, the transmission lines, and
dendrites, the receptive zones,
constitute two types of cell
filaments that are distinguished
on morphological grounds
• An axon has a smoother surface,
fewer branches, and greater
length
• A dendrite (so called because of
its resemblance to a tree) has an
irregular surface and more
branches
• It is assumed that a synapse is a
simple connection that can
impose excitation or inhibition,
but not both on the receptive
neuron
Operation
• Strength of synaptic connections decide the net
signal coming to the cell body
• Free parameters refer to the strength of the
synapse
• Every input is associated with synaptic strengths
• Assuming some a-priori synaptic strength a cell
computes the response
• If the response differs from the desired value, the
synaptic strengths are adjusted
• This process may take many iterations
Model of a Neuron
Three basic elements of the neuronal model:
• A set of synapses or connecting links, each of which is
characterized by a weight or strength of its own
– Specifically, a signal xj at the input of synapse j connected
to neuron k is multiplied by the synaptic weight 𝒌𝒋
– Unlike a synapse in the brain, the synaptic weight of an
artificial neuron may lie in a range that includes negative
as well as positive values
• An adder for summing the input signals, weighted by the
respective synapses of the neuron; the operations described
here constitute a linear combiner
• An activation function for limiting the amplitude of the output
of a neuron
Electrical Model of a Neuron

𝒚_𝒊𝒏 𝒌 𝒃𝒌 𝒙𝒊 · 𝒘𝒌𝒊 𝒃𝒌 𝒙𝟏 · 𝒘𝒌𝟏 𝒙𝟐 · 𝒘𝒌𝟐 ⋯ 𝒙𝒎 · 𝒘𝒌𝒎


𝒊 𝟏

𝒚𝒌 𝒇 𝒚_𝒊𝒏𝒌
Effect of bias
• The neuronal model also includes an
externally applied bias, denoted by bk
• The bias bk has the effect of increasing or
lowering the net input of the activation
function, depending on whether it is
positive or negative, respectively
• The use of bias bk , has the effect of applying
an affine transformation to the output y_ink
of the linear combiner in the model
• An affine transformation is any
transformation that preserves collinearity
(i.e., all points lying on a line initially still lie
on a line after transformation)
• Depending on whether the bias bk is
positive or negative, the relationship
between the induced local field or activation
potential y_ink of neuron k and the linear
combiner output uk is modified
Another nonlinear model of a neuron; wk0 accounts for
the bias bk.
Activation Functions- Threshold Function

If y _ ink   yk  1
yk  0 otherwise
Activation Functions- Signum Function

If y _ ink   yk  1
yk  1 otherwise
Activation Functions- Linear Function

𝑦 𝑦_𝑖𝑛
Activation Functions- Saturating Linear Function

If y _ ink  1 yk  1
yk  y _ ink otherwise
If y _ ink  1 yk  1
yk  y _ ink otherwise
Artificial Neuron Model

𝒚_𝒊𝒏 𝒌 𝒙𝒊 · 𝒘𝒌𝒊 𝟏 · 𝒘𝒌𝟎 𝒙𝟏 · 𝒘𝒌𝟏 𝒙𝟐 · 𝒘𝒌𝟐 ⋯ 𝒙𝒎 · 𝒘𝒌𝒎


𝒊 𝟎

𝒚𝒌 𝒇 𝒚_𝒊𝒏𝒌
𝒚_𝒊𝒏 𝒌 is called the local induced field
Mc Culloch-Pitts Neuron
• McCulloch-Pitts
neuron Y may
receive signals
from any number
of other neurons
• Each connection
path is either
excitatory, with
weight w > 0, or
inhibitory, with
weight - p (p > 0)
• The condition that inhibition is absolute requires that for
the activation function satisfy the inequality

• Y will fire if it receives k or more excitatory inputs and no


inhibitory inputs, where

• Although all excitatory weights coming into any particular unit


must be the same, the weights coming into one unit, say, Y1 ,
do not have to be the same as the weights coming into
another unit, say Y2
Logical AND NOT Implementation
x1 x2 Y
0 0 0
0 1 0
1 0 1
1 1 0

• Assume that 𝑤 , 𝑤 are excitatory and 𝑤 =𝑤 1


• Output is given by 𝑦 𝑥 𝑤 𝑥 𝑤

x1.w1 X2.w2 Yin • We see that for no threshold value we can


0 1 0 1 0 separate the third combination
0 1 1 1 1 • Change the weights to
1 1 0 1 1 𝑤 =1 , 𝑤 1
1 1 1 1 2
x1.w1 x2.w2 yin • Now we can separate the third
0 1 0 1 0 combination by setting
0 1 1 (-1) -1 • So with = and
1 1 0 1) 1 we can have the following
1 1 1 (-1) 0 response

If y _ in  1 y 1
y  0, otherwise

x1.w1 x2.w2 yin 𝜽 y


0 1 0 1 0 1 0
0 1 1 (-1) -1 1 0
1 1 0 1) 1 1 1
1 1 1 (-1) 0 1 0

All the weight and threshold setting has to be done with trial and error!
Logical XOR Implementation
x1 x2 Y
0 0 0
0 1 1
1 0 1
1 1 0

We observe that we need two AND gates and one OR gate


𝑧 𝑥 𝑥 x1 x2 Z1
0 0 0
• For subfunction z1 (AND gate)
• Assume that 𝑤 , 𝑤 are excitatory and 𝑤 =𝑤 1 0 1 0
• Output is given by 𝑧 𝑥 𝑤 𝑥 𝑤 1 0 1
1 1 0
x1.w11 x2.w21 z1in
0 1 0 1 0 • We see that for no threshold value we can
0 1 1 1 1
separate the third combination
• Change the weights to
1 1 0 1 1
𝑤 =1 , 𝑤 1
1 1 1 1 2
• We see that setting 𝜃 1 can separate the
x1.w11 x2.w21 z1in third entry
• So with 𝑤 =1 , 𝑤 1 and 𝜃 1 we
0 1 0 1) 0
can have the desired response
0 1 1 1 -1
1 1 0 (-1) 1 x1.w11 x2.w21 z1in 𝜽 Z1

1 1 1 1) 0 0 1 0 1 0 1 0
0 1 1 (-1) -1 1 0
1 1 0 1) 1 1 1
1 1 1 (-1) 0 1 0
𝑧 𝑥 𝑥 x1 x2 Z2
0 0 0
• For subfunction z2 (AND gate)
• Assume that 𝑤 , 𝑤 are excitatory and 𝑤 =𝑤 1 0 1 1
• Output is given by 𝑧 𝑥 𝑤 𝑥 𝑤 1 0 0
1 1 0
x1.w12 x2.w22 z2in
0 1 0 1 0 • We see that for no threshold value we can
0 1 1 1 1
separate the third combination
• Change the weights to
1 1 0 1 1
𝑤 = 1 ,𝑤 1
1 1 1 1 2
• We see that setting 𝜃 1 can separate the
x1.w12 x2.w22 z2in third entry
• So with 𝑤 = 1 , 𝑤 1 and 𝜃 1 we
0 1) 0 1 0
can have the desired response
0 (-1) 1 1 1
1 (-1) 0 1 -1 x1.w12 x2.w22 z2in 𝜽 z2

1 1) 1 1 0 0 1) 0 1 0 1 0
0 (-1) 1 1 1 1 1
1 (-1) 0 1 -1 1 0
1 (-1) 1 1 0 1 0
𝑦 𝑧 𝑧

• For subfunction 𝑦 (OR gate)


• Note that the combination 𝑧 𝑧 1 𝐍𝐄𝐕𝐄𝐑 𝐎𝐂𝐂𝐔𝐑𝐒

x1 x2 z1 z2 y
0 0 0 0 0
0 1 0 1 1
1 0 1 0 1
1 1 0 0 0

• Assume that 𝑣 , 𝑣 are excitatory and 𝑣 =𝑣 1


• Output is given by 𝑦 𝑧 𝑣 𝑧 𝑣

z1.v1 z2.v2 yin


0 1 0 1 0 • We see that for 𝜃 1 we can separate
0 1 1 1 1
the second and the third combination
• With 𝑣 =1 , 𝑣 1 and 𝜃 1 we can
1 1 0 1 1
have the desired response
0 1 0 1 0
x1 x2 z1 z2 yin 𝜽 y
0 0 0 1 0 1 0 1 0
0 1 0 1 1 1 1 1 1
1 0 1 1 0 1 1 1 1
1 1 0 1 0 1 0 1 0

• Hence the complete solution to XOR implementation is


ANN Topologies
ANN Architectures

• Feedforward network with


a single layer of neurons
• Fully connected feedforward network with one hidden layer and one
output layer
• Recurrent network with no self-feedback loops and no hidden
neurons
• Recurrent network with hidden neurons
Learning Paradigms
LEARNING WITH A TEACHER (SUPERVISED LEARNING)
• Teacher as has knowledge of the
environment, with that knowledge
being represented by a set of
input-output examples
• The environment is, however,
unknown to the neural network of
interest
• Suppose now that the teacher and
the neural network are both
exposed to a training vector
• By virtue of built-in knowledge, the
teacher is able to provide the
neural network with a desired
response for that training vector
• The network parameters are adjusted under the combined influence of the
training vector and the error signal, iteratively
• The error signal is defined as the difference between the desired response and
the actual response of the network
LEARNING WITHOUT A TEACHER
• There is no teacher to oversee the learning
process
• Hence there are no labeled examples of the
function to be learned by the network
• Two sub types:
– Reinforcement learning / Neurodynamic
programming
– Unsupervised or self-organized learning
Reinforcement learning
• Learning of an input-output
mapping is performed through
continued interaction with the
environment in order to
minimize a scalar index of
performance
• Built around a critic that converts
a primary reinforcement signal
received from the environment
into a higher quality
reinforcement signal called the
heuristic reinforcement signal,
both of which are scalar inputs
• The system observes a temporal sequence of stimuli (i.e., state vectors) also
received from the environment, which eventually result in the generation
of the heuristic reinforcement signal
• The goal of learning is to minimize a cost-to-go function, defined as the
expectation of the cumulative cost of actions taken over a sequence of steps
instead of simply the immediate cost
Unsupervised Learning

• There is no external teacher or critic to oversee the learning process


• Provision is made for a task independent measure of the quality of
representation that the network is required to learn, and the free
parameters of the network are optimized with respect to that
measure
• Once the network has become tuned to the statistical regularities of
the input data, it develops the ability to form internal representations
for encoding features of the input and thereby to create new classes
automatically
Revision of Some Basic Maths
• Vector and Matrix

– Row vector/column vector/vector transposition


– Vector length/norm
– Inner/dot product
– Matrix (vector) multiplication
– Linear algebra
– Euclidean space

• Basic Calculus

– Partial derivatives
– Gradient
– Chain rule
Revision of Some Basic Maths
• Inner/dot product

x = [x1, x1, …, xn ]T , y = [y1, y1, …, yn ]T

Inner/dot product of x and y, xTy


n
x y  x1 y1  x2 y2    xn yn   xi yi
T

i 1

• Matrix/Vector multiplication
Revision of Some Basic Maths
• Vector space/Euclidean space

• A vector space V is a set that is closed under finite


vector addition and scalar multiplication.

• The basic example is n-dimensional Euclidean space,


where every element is represented by a list of n real
numbers

• An n-dimensional real vector corresponds to a point in


the Euclidean space.

[1, 3] is a point in 2-dimensional space


[2, 4, 6] is point in 3-dimensional space
Revision of Some Basic Maths
• Vector space/Euclidean space

– Euclidean space (Euclidean distance)

X Y  x1  y1 2  x2  y2 2   xn  yn 2

– Dot/inner product and Euclidean distance

• Let x and y be two normalized vectors, ||x||= 1, ||y||=1, we can write

 X  Y  X  Y   2  2 X TY
2 T
X Y

• Minimization of Euclidean distance between two vectors corresponds to


maximization of their inner product.

– Euclidean distance/inner product as similarity measure


Revision of Some Basic Maths
• Basic Calculus

– Multivariable function:
y(x)  f (x1, x2,...,xn )

– Partial derivative: gives the direction and speed of change of y, with


respect to xi
 f f 
– Gradient f   , ...... 

 1x x n

dy dy du

– Chain rule: Let y = f (g(x)), u = g(x), then dx du dx
dz f dx f dy
 
– Let z = f(x, y), x = g(t), y = h(t), then dt x dt y dt
Feature Space
• Representing real world objects using feature vectors

i
1 2 3

x2(i) 4
x1(i) 6 5
7
10
x1 X(i) =[x1(i), x2(i)] 9
12 11
Feature Vector
x1(i) 13
14 8

15 16
Feature Space
Elliptical blobs (objects)
x2(i) x2
Feature Space
 From Objects to Feature Vectors to Points in the Feature Spaces

x1
X(15) 1 2 3

X(1) X(7) 4
X(16) 6 5
X(3) X(8) 7
9 10
X(25) X(12)
12 11
X(13) X(6) 13
X(9) 14 8
X(10)
X(4) 15 16
X(11)
X(14)
Elliptical blobs (objects)
x2
Linear Neuron as classifier
w11
• Consider the line represented by a single neuron, x1


with equation: 𝑥 𝑤
Re-arranging we get:
𝑥 𝑤 𝑏 0
 y
𝑤 𝑏 x2
𝑥 𝑥 w21 b
𝑤 𝑤
• Is of the form: 𝑦 𝑚𝑥 𝑐 +1
Where 𝑚 and 𝑐 x1w11+x2w21+b=0

• If we set appropriate values for 𝑤 , 𝑤 , 𝑏


(such that yellow objects produce output <0
and red objects produce output >0)then we

Feature 2 (x2)
can use this line as a decision boundary
separating objects
• Now if a new object 𝑇 𝑡 𝑡 produces
output of neuron 𝑦 0 , it simply means the
object belongs to the red object class
• Thus the linear neuron works as a classifier! Feature 1 (x1)
ARTIFICIAL NEURAL NETWORKS
MODULE 3
Dr. R.B.Ghongade,
Topics covered
• Linear Regression
• Gradient Descent Algorithm
• More Activation Functions
• Learning Processes
– Error correction Learning
– Memory based Learning
– Hebbian Learning
– Competitive Learning
• Biases and Thresholds
• Linear Separability
• Perceptron
Linear Regression
• In statistics, linear regression is an
approach to modeling the
relationship between a scalar
variable y and one or more
explanatory variables denoted X
• Once we know the equation of this
fitted line we can use it to find y
given X
• Say , we have conducted an
experiment that gives us some
output value for a set of inputs
• There are various conventional methods to do this ,e.g. Ordinary least
squares
• The best fit line has equation : 𝑦 𝑚𝑥 𝑐
• We have a choice of two parameters: 𝑚 and 𝑐
• A linear neuron can be used for this task since output of a linear neuron is:
𝑦 𝑥 𝑤 𝑥 𝑤
• If we keep 𝑥 =1, then 𝑤 is slope and𝑤 is the intercept, hence we have to
modify 𝑤 and 𝑤 to obtain the best fit line
• If we make a good choice of 𝑤
and 𝑤 our job is over!
• Many real world problems involve
more than one independent variables
then the regression is called multiple
regression
• For example if we have two
independent variables 𝑥 and 𝑥 , then
it becomes a 2-D problem, the new
network would different
• Now we have to adjust two slopes and
one intercept
• This can be extended to solve n-
dimensional problems
• For a function like 𝑦 𝑓 𝑥 ,𝑥 we have a 2-D problem
The Concept of Error Energy

• We can have a combined error measure by squaring and adding all the
errors and dividing it by number of observations giving MEAN SQUARE
ERROR (mse), this in fact is the ERROR ENERGY
• This mse shows how good or bad the line is fitted!
• Considering a 1-D problem ,error energy at a point p is :
𝑒 𝑡 𝑦
• The total error energy is given by:
𝐸 𝑒 𝑡 𝑦

• Why do we have to measure this error energy?


• In order to get the best fit the error energy should be minimized!
Gradient Descent Algorithm
• If we plot the error energy versus weight 𝑤 and 𝑤 , we get a error
graph showing combined error over the set of all input points for various
values of 𝑤 and 𝑤

• Objective is to reach the minimum error energy point by adjusting


𝑤 and 𝑤 , where we say that the ANN has converged
Minima

• We may have two or more minimas, our goal being to reach


the global minina
• We cannot guarantee global minima
How do we reach the(hopefully global) minima?

• We can find the gradient(slope) at the starting point and slide down in the
opposite direction of the gradient to reach the minima
• This technique is called the steepest descent approach
• Even though reaching global minima is not guaranteed by this approach we
may find a good combination of 𝑤 and 𝑤 which gives lowest fitting error!
The Algorithm
• Let there be p observations available for training then

e   t  y 
where p p p 2
o o

since there may be several outputs such that:

For computational convenience we define as:


1
e    to  yo  ..............(1)
p p p 2

2
• Gradient of error with respect to any weight is

we wish to find the gradient with respect to a certain weight


and a particular pattern
Using chain rule

we get

From (1) we have


• But

Combining (2) and (3) we have

𝒐 𝒐 𝒊
𝒐𝒊
• This is the gradient with respect to one particular weight
considering as the output and i as the input unit
• Now we have to move in the opposite direction of the
derivative hence correction to be applied to weights is
• We apply correction by simply multiplying the error by input
thus
𝒐𝒊 𝒐 𝒐 𝒊
• Hence the new weight is
𝒐𝒊 𝒐𝒊 𝒐𝒊
• Generally we apply a constant as a controlling parameter
called as the learning rate
𝒐𝒊 𝒐 𝒐 𝒊
• If learning rate is high the network learns fast and vice-versa
• But a higher learning rate may lead to unstable operation and
no learning at all
• If learning is slower, it is at least guaranteed that there is
progress
• We have to thus look out for an optimal learning rate
Non-linear Regression
• Real world problems are mostly non-linear hence we have to go for non-linear
regression
• This can be done with a non-linear neuron i.e. a neuron with non-linear activation
function
2-D Non-linear Regression
Activation Functions
• To map non-linear output functions we require non-
linear activation functions
• The activation functions should be
• Continuous
• Monotonically increasing
• Differentiable
Monotonicity

• Monotonically increasing function

• Monotonically decreasing function

• Non Monotonic function


Activation Functions
Log Sigmoid Output limits:[0,1]
Activation Functions
Bipolar Sigmoid Output limits:[-1,1]
Activation Functions
Tanh Sigmoid Output limits:[-1,1]
exp 𝑎𝑦 exp 𝑎𝑦
𝑦 tanh 𝑎𝑦_𝑖𝑛
exp 𝑎𝑦 exp 𝑎𝑦
Linear regression MATLAB Demo
Learning Processes
• Ability of the network to learn from its environment, and to
improve its performance through learning
• Neural network learns about its environment through an
interactive process of adjustments applied to its synaptic
weights and bias levels
• Definition:
Learning is a process by which the free parameters of a neural
network are adapted through a process of stimulation by the
environment in which the network is embedded. The type of
learning is determined by the manner in which the parameter
changes take place
• Learning process implies the following
sequence of events”
1. The neural network is stimulated by an
environment
2. The neural network undergoes changes in its
free parameters as a result of this stimulation
3. The neural network responds in a new way to
the environment because of the changes that
have occurred in its internal structure
• A prescribed set of well-defined rules for the
solution of a learning problem is called a
learning algorithm
Learning Mechanisms
• Five Types
– Error correction Learning
– Memory based Learning
– Hebbian Learning
– Competitive Learning
– Boltzmann Learning
Error correction Learning
• We compute the error at time step and output as:

• Then we minimize the term


1
E (n)    ek (n) 
2

2 k
• The weight correction rule we get is
𝒌𝒋 𝒌 𝒋
• This learning rule is called as the DELTA rule or WIDROW-
HOFF rule
• The updated synaptic weight for the time step is
then
𝒌𝒋 𝒌𝒋 𝒌𝒋
• The adjustment made to a synaptic weight of a neuron is
proportional to the product of the error signal and the input
signal of the synapse in question
Memory based Learning
• In memory-based learning, all (or most) of the past
experiences are explicitly stored in a large memory of
correctly classified input-output examples

xi ,di i1


N

where = input vector and =desired response


• If this is a binary classification problem, there are two
classes/hypotheses denoted by and , then takes the
value 0 (-1) for and 1 for
• When classification of a test vector (not seen before) is
required, the algorithm responds by retrieving and analyzing
the training data in a "local neighborhood" of 𝐭𝐞𝐬𝐭
• Memory-based learning algorithms involve two essential
ingredients
1. Criterion used for defining the local neighborhood of the test vector
𝐱 𝐭𝐞𝐬𝐭
2. Learning rule applied to the training examples in the local
neighborhood of 𝐱 𝐭𝐞𝐬𝐭
• The algorithms differ from each other in the way in which
these two ingredients are defined
• The simple type of memory-based learning known as the
nearest neighbor rule, the local neighborhood is defined as
the training example that lies in the immediate neighborhood
of the test vector 𝐭𝐞𝐬𝐭
• The vector 𝐍 𝟏 𝟐 𝐍 is said to be the nearest
neighbor of 𝐭𝐞𝐬𝐭 if
min d  x i , x test   d  x , x test 
'
N
i

where 𝐢 𝐭𝐞𝐬𝐭 is the Euclidean distance between the vectors


𝐢 and 𝐭𝐞𝐬𝐭
• The class associated with the minimum distance, that is,
vector 𝐍 is reported as the classification of 𝐭𝐞𝐬𝐭
• But this simple rule poses a problem:

• The test vector seems to have the least Euclidean distance with an
outlier from Class 0 and is classified as belonging to Class 0
• This is wrong!
• To solve this problem we modify the nearest neighbor rule to
k-nearest neighbor classifier
• Identify the classified patterns that lie nearest to the test
vector 𝐭𝐞𝐬𝐭 for some integer
• Assign 𝐭𝐞𝐬𝐭 to the class (hypothesis) that is most frequently
represented in the -nearest neighbors to 𝐭𝐞𝐬𝐭 (i.e., use a
majority vote to make the classification)
• Here , and is now classified
As belonging to Class 1 (out of three
nearest neighbors , two belong to
Class 1)
Hebbian Learning
• Hebbian learning is considered to be more closer to the
learning mechanism of a biological neuron
• Hebb , a neurophysiologist , in his book “ Organization of
Behaviour”, in 1949 postulated the “Hebb Rule”
“When an axon of cell A is near enough
to excite a cell B and repeatedly or
persistently takes part in firing it, some
growth process or metabolic changes
take place in one or both cells such that
A's efficiency as one of the cells firing B,
is increased”
• Thus if cell A fires consistently the cell B then the synaptic
weight increases such that the next time cell A has got a
better probability of firing cell B
• Also if cell A does not take part in firing cell B, the synaptic
weight weakens
• If pre-synaptic neuron and the post-synaptic neurons show similar
activation we shall increase the synaptic weight and vice-versa
• Hebbian synapse is a synapse that uses a time dependent, highly
local, and strongly interactive mechanism to increase synaptic
efficiency as a function of the correlation between the presynaptic
and postsynaptic activities
• Four key mechanisms (properties) that characterize a Hebbian
synapse
1. Time-dependent mechanism: modifications in a Hebbian synapse
depend on the exact time of occurrence of the presynaptic and
postsynaptic signals
2. Local mechanism: locally available information is used by a Hebbian
synapse to produce a local synaptic modification that is input specific
3. Interactive mechanism: Hebbian form of learning depends on a "true
interaction" between presynaptic and postsynaptic signals in the sense
that we cannot make a prediction from either one of these two activities
by itself
4. Correlational mechanism: the correlation over time between presynaptic
and postsynaptic signals is viewed as being responsible for a synaptic
change
Mathematical Models of Hebbian Modifications
• Consider a synaptic weight of neuron with presynaptic
and postsynaptic signals denoted by and respectively.
• The adjustment applied to the synaptic weight at time
step n is expressed in the general form:

where is a function of both postsynaptic and presynaptic


signals
• Hebbs hypothesis or Hebbian learning rule is then given as
𝒌𝒋 𝒌 𝒋
where is a positive constant that determines the rate of
learning
• This rule is also called as activity product rule
• But this rule has a basic flaw:
• The repeated application of the input signal (presynaptic activity) 𝑥
leads to an increase in 𝑦 , and therefore exponential growth that
finally drives the synaptic connection into saturation
• At that point no information will be stored in the synapse and
selectivity is lost
• This limitation is overcome by Covariance hypothesis
• Presynaptic and postsynaptic signals are replaced by the departure
of presynaptic and postsynaptic signals from their respective
average values over a certain time interval
• Let 𝑥̅ and 𝑦 denote the time-averaged values of the presynaptic
signal 𝑥 and postsynaptic signal 𝑦 , respectively
• According to the covariance hypothesis, the adjustment applied to
the synaptic weight 𝑤 is defined by:

∆𝒘𝒌𝒋 𝜼 𝒚𝒌 𝒚 𝒙𝒋 𝒙
Competitive Learning
• Output neurons of a neural network compete among
themselves to become active (fired)
• In competitive learning only a
single output neuron is active
at any one time
• Highly suited to discover
statistically salient features
that may be used to classify a
set of input patterns
• Accordingly the individual
neurons of the network learn
to specialize on ensembles of
similar patterns; in so doing
they become feature
detectors for different classes
of input patterns
• There are three basic elements to a competitive
learning rule
– A set of neurons that are all the same except for
some randomly distributed synaptic weights, and
which therefore respond differently to a given set
of input patterns
– A limit imposed on the "strength" of each neuron
– A mechanism that permits the neurons to
compete for the right to respond to a given subset
of inputs, such that only one output neuron, or
only one neuron per group, is active (i.e., "on") at
a time. The neuron that wins the competition is
called a winner-takes-all neuron
• For a neuron 𝑘 to be the winning neuron, its induced local field 𝑣 ,
for a specified input pattern x must be the largest among all the
neurons in the network
• The output signal 𝑦 of winning neuron 𝑘 is set equal to one; the
output signals of all the neurons that lose the competition are set
equal to zero
• Thus
1, 𝑖𝑓 𝑣 𝑣 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑗, 𝑗 𝑘
𝑦
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
where the induced local field 𝑣 , represents the combined action of all
the forward and feedback inputs to neuron 𝑘
• Let 𝑤 denote the synaptic weight connecting input node j to
neuron k
• Suppose that each neuron is allotted a fixed amount of synaptic
weight (i.e., all synaptic weights are positive), which is distributed
among its input nodes; that is
𝑤 1 for all 𝑗
• As per the standard competitive learning rule, the change
𝒌𝒋 applied to synaptic weight 𝒌𝒋 is defined by:

𝒌𝒋

• This rule has the overall effect of moving the synaptic weight
vector , of winning neuron k toward the input pattern
• Thus if we have and input vector as
then we are effectively moving towards
the input pattern X
• What ultimately we do is that we align the weight towards
the input vector for a specific input and weight combination
Geometric Interpretation
• Consider vector 𝐗 𝑥 𝑥 𝑥
• The constraint we lay down is that
𝑋 1 , this means
𝑥1 𝑥2 𝑥3 1

• Suppose we have three patterns with


same unit length 𝑋 , 𝑋 , 𝑋
• These vectors will lie on the surface of
a sphere(there can be a large number
of vectors)
• We expect that the output neurons
represent these clusters
• Similar to input vectors we can assume (constrain) that sum of squares of weights
per neuron is equal to 1:

𝑤 1

• Now when we present say


𝑋 , 𝑦 wins and its
weights are adjusted while
for other neurons the
weights stay unchanged
• On presenting inputs
belonging to cluster 1, 𝑊
moves closer to cluster 1
• Similarly on presenting
other vectors 𝑋 and 𝑋 ,
𝑊 and 𝑊 get aligned to
the respective clusters
• But what will happen if there are more than four clusters?
Biases and Thresholds
• A bias acts exactly as a weight on a connection from a unit
whose activation is always 1
• Increasing the bias increases the net input to the unit
• If a bias is included, the activation function is typically taken
to be

Where

Here and
• If bias is not used the same network can be described as
1, 𝑦_𝑖𝑛 𝜃;
𝑦 𝑓 𝑦_𝑖𝑛
1, 𝑦_𝑖𝑛 𝜃;
Where

𝑦_𝑖𝑛 𝑥 ·𝑤

• Consider the following neural net: • We consider the separation of the


input space into regions where the
response of the net is positive and
regions where the response is negative
• The boundary between the values of 𝒙𝟏 and 𝒙𝟐 for which the net gives a
positive response and the values for which it gives a negative response is
the separating line:
𝑏 𝑤 𝑥 𝑤 𝑥 0
or assuming 𝑤 0
𝑤 𝑏
𝑥 𝑥
𝑤 𝑤

• The requirement for a positive


response from the output unit is that
the net input , 𝑦_𝑖𝑛 it receives,
namely,
𝑏 𝑤 𝑥 𝑤 𝑥
be greater than 0
• During training, values of 𝑤 , 𝑤 and
𝑏 are determined so that the net will
have the correct response for the
training data
• If we think in terms of the threshold the requirement for a
positive response from the output unit is that the net input it
receives, namely, , be greater than the
threshold
• This gives the equation of the line separating positive from
negative output as:

or assuming

• During training, values of and are determined so that


the net will have the correct response for the training data
• In this case, the separating line cannot pass through the
origin, but a line can be found that passes arbitrarily close to
the origin
• Thus there is no advantage to including both a bias and a
nonzero threshold for a neuron that uses the step function as
its activation function
• Also including neither a bias nor a threshold is equivalent to
requiring the separating line (or plane or hyperplane for
inputs with more components) to pass through the origin
• This may or may not be appropriate for a particular problem
• Example: Going to watch a cricket match!
• Conclusion: Since it is the relative values of the weights,
rather than their actual magnitudes, that determine the
response of the neuron, the model can cover all possibilities
using either the fixed threshold or the adjustable bias
• So we can use either adjustable bias or fixed threshold NOT
both!
Advantages of bias over threshold
• Using bias can remove the dependency on
threshold
• We can modify the bias (more adaptable) but
threshold is fixed
• Bias helps in simplifying the separation
boundaries
• Computational flexibility and ease is more if we
use bias as thresholds may be different for
different output neurons
• Hence use of bias is more preferable
Linear Separability
• The intent is to train the net (i.e., adaptively determine its
weights) so that it will respond with the desired classification
when presented with an input pattern that it was trained on
or when presented with one that is sufficiently similar to one
of the training patterns
• For a particular output unit, the desired response is a "yes" if
the input pattern is a member of its class and a "no" if it is not
• A "yes" response is represented by an output signal of 1, a
"no" by an output signal of -1 (for bipolar signals)
• Since we want one of two responses, the activation (or
transfer or output) function is taken to be a step function
• The value of the function is 1 if the net input is positive and -
1 if the net input is negative
• Since the net input to the output unit is

• It is easy to see that the boundary between the region where


and the region where , which we call the
decision boundary, is determined by the relation

• If there are weights (and a bias) so that all of the training


input vectors for which the correct response is + 1 lie on one
side of the decision boundary and all of the training input
vectors for which the correct response is - 1 lie on the other
side of the decision boundary, we say that the problem is
"linearly separable"
• The region where y is positive is separated from the region
where it is negative by the line

• These two regions are often called decision regions for the
net
Response regions for the AND function
• Solving this graphically, we find that the red
x1 x2 Y
line can be a good decision boundary
-1 -1 -1
• Points(0,1) and (1,0) lie on the line hence the
using equation
-1 1 -1 𝑤 𝑏
1 -1 -1 𝑥 𝑥
𝑤 𝑤
1 1 1 assuming 𝑤 1, gives
𝑏 1 using point(0,1) and 𝑤 1, using
point (1,0) and 𝑏 1
• Actually the choice of sign for b is
determined by the requirement that
𝑏 𝑤 𝑥 𝑤 𝑥 0
• We can then set 𝑥 and 𝑥 =0 and compute 𝑏
knowing to which side of the line the
point(0,0) should lie
Response regions for the OR function
x1 x2 Y
-1 -1 -1 • Points(-1,0) and (0,-1) lie on the line
-1 1 1 hence the using equation
1 -1 1
1 1 1
assuming
and
XOR-Example of linearly non-separable
problems
x1 x2 Y
-1 -1 -1
-1 1 1
1 -1 1
1 1 -1

What should the decision


boundary be and where?
The Perceptron
• The perceptron learning rule is a more powerful learning rule than
the Hebb rule
• Under suitable assumptions, its iterative learning procedure can be
proved to converge to the correct weights, i.e., the weights that
allow the net to produce the correct output value for each of the
training input patterns
• The weight update rule is
𝒘𝒊 𝒏𝒆𝒘 𝒘𝒊 𝒐𝒍𝒅 𝜼𝒕𝒙𝒊
Where 𝒕 is the target output and 𝜼 is the learning rate
• If an error did not occur, the weights would not be changed
• Training would continue until no error occurred
• The perceptron learning rule convergence theorem states that if
weights exist to allow the net to respond correctly to all training
patterns, then the rule's procedure for adjusting the weights will
find values such that the net does respond correctly to all training
• patterns (i.e., the net solves the problem-or learns the classification
• Also the net will find these weights in a finite number of training
steps
Perceptron Architecture

• The goal of the net is to classify each input pattern as


belonging, or not belonging, to a particular class.
• Belonging is signified by the output unit giving a response
of + 1; not belonging is indicated by a response of - 1
The perceptron algorithm
STEP 0 Initialize weight and biases.
(Set to zero )
Set learning rate η (0 < η < 1)

STEP 1 While stopping condition is false, do Steps 2-6.


STEP 2 For each training pair, do Steps 3-5.
STEP 3 Set activations of input units
𝑥𝑖 𝑠𝑖
STEP 4 Compute response of the output unit
y _ in  b   xi  wi
i

1 if y _ in  

y   0 if    y _ in  
 1 if y _ in  

STEP 1 STEP 2 STEP 5 Update weights and bias if an error occurred
for this pattern
If 𝑦 𝑡
𝒘𝒊 𝒏𝒆𝒘 𝒘𝒊 𝒐𝒍𝒅 𝜼𝒕𝒙𝒊

𝒃 𝒏𝒆𝒘 𝒃 𝒐𝒍𝒅 𝜼𝒕

else
𝒘𝒊 𝒏𝒆𝒘 𝒘𝒊 𝒐𝒍𝒅

𝒃 𝒏𝒆𝒘 𝒃 𝒐𝒍𝒅

STEP 6 Test stopping condition


If no weights changed in Step 2, stop; else, continue
• Threshold here plays a different role
• The threshold on the activation function for the response unit
is a fixed, non-negative value 
• The form of the activation function for the output
unit(response unit) is such that there is an "undecided" band
(of fixed width determined by ) separating the region of
positive response from that of negative response.
A Perceptron for the AND function: bipolar inputs, bipolar
targets
x1 x2 t
• The training process for bipolar
-1 -1 -1
input, η = 1, and threshold and initial
-1 1 -1
weights =0,=0
1 -1 -1
1 1 1

First iteration
𝒊 𝒊 𝒊
Input y_in y t Weight Changes Weights

x1 x2 1 w1 w2 b
0 0 0
-1 -1 1 0 1 -1 1 1 -1 1 1 -1
-1 1 1 -1 -1 -1 0 0 0 1 1 -1

1 -1 1 -1 -1 -1 0 0 0 1 1 -1
1 1 1 1 1 1 0 0 0 1 1 -1
Second iteration
Input y_in y t Weight Changes Weights

x1 x2 1 w1 w2 b
1 1 -1
-1 -1 1 -3 -1 -1 0 0 0 1 1 -1
-1 1 1 -1 -1 -1 0 0 0 1 1 -1

1 -1 1 -1 -1 -1 0 0 0 1 1 -1
1 1 1 1 1 1 0 0 0 1 1 -1

• We see that there is no weight change after second iteration hence we conclude
that the net has converged
MULTI-LAYERED PERCEPTRON AND
BACKPROPAGATION
MODULE 4
Dr. Rajesh B. Ghongade
Agenda
• Perceptron and its limitations
• MLP architecture
• Activation functions
• Gradient Descent Algorithm and Delta Rule
• Generalized Delta Rule( Backpropagation)
• Signal Flow
• Standard Backpropagation Algorithm
• XOR problem
• MATLAB Demo
• Some tips for net convergence
• Variations in standard backpropagation algorithm
• Applications of MLP trained with backpropagation algorithm
Perceptron and its limitation
x1 X1
w11
y _ in1
Y1 y1

w21
w01
x2 X2
1

a0 n0
a 1 n0
• Creates a linear separation boundary called as decision boundary
• Capable of classifying linearly separable objects only

x2 2
w01  w11  x1  w21  x 2  w01   wj1  xj
j 1

x1
• Minsky and Papert (1969) showed that a perceptron is incapable of solving a simple
XOR problem which is not linearly separable
x2 x1 x2 y
0 0 0
0 1 1
1 0 1
x1 1 1 0

• Minsky and Papert (1969) also showed that such a problem can be solved by adding
another layer of perceptron and combining the responses

v11 Z1 z1
x1 X1 w11
v21 v 01
1
Y1 y1
v12 w21
x2 X2 w01
v22
Z2 z2
1

v02
1
x2

x1

z2

z1
z2
(0,1) (1,1)

(1,0) z1

XOR problem can be solved, but how to find out all the weights and biases?
MLP Architecture
1 1
v01
w01
v0 j
v0 p w0m w0k

x1 X1 v11 Z1 w11 Y1 y1
v1 j w1k
  

v1p  w1m 
  

v i1 w j1
xi Xi v ij Zj wjk Yk yk
vip wjm
  
  
w p1

vn1 vnj  
w pk

xn Xn Zp wpm Ym ym


• No connections within a layer
• No direct connections between input and output layers
• Fully connected between layers
• Often more than 2 layers
• Number of output units need not equal number of input
units
• Number of hidden units per layer can be more or less than
input or output units
• Can also view 1st layer as using local knowledge while 2nd
layer does global
• With sigmoidal activation functions can show that a 2
layer net can approximate any function to arbitrary
accuracy: property of Universal Approximation
1st layer draws linear 2nd layer combines the 3rd layer can generate
boundaries boundaries arbitrarily complex boundaries
Concept of layers!

Number of layers= layers of weights


or
Number of layers = layers of processing elements
Activation Functions
• To map non-linear output functions we require non-
linear activation functions
• The activation functions should be
• Continuous
• Monotonically increasing
• Differentiable
Activation Functions
1
Log Sigmoid f ( x)  y  x Output limits:[0,1]
1 e

dy
Derivative f '( x)   f ( x) 1  f ( x) 
dx
Activation Functions
Bipolar Sigmoid f ( x)  y  2
x
 1 Output limits:[-1,1]
1 e

dy 1
Derivative f '( x)   1  f ( x) 1  f ( x) 
dx 2
Activation Functions
x
e e
x
Tanh Sigmoid f ( x)  y  x  x Output limits:[-1,1]
e e

dy 
 1   f ( x)  
2
Derivative f '( x) 
dx  
Gradient Descent Algorithm

• Error is defined as the deviation of the actual output


from the desired target
e  t  y  t  y _ in  f ( xI , wI )
n n

where y _ in  w0   wi  xi   wi  xi
i 1 i 0

1
• Error energy is then: E  e   t  y _ in 
2 2

2
• In order to reduce squared error i.e, error energy we
have to find the partial derivative of with respect to
the weights , and modify the weights in a direction
opposite to the partial derivative, this is an optimization
technique known as gradient descent.
• Hence we want to compute i.e. the gradient of error
energy with respect to weight
• Once the gradient is obtained we can move along the opposite
direction to the gradient in the hope of reaching the global
minima
E
• Hence the weights have to be modified as: wI  
wI
E  y _ in 
But  (t  y _ in)  
wI  wI 
y _ in n n
And  xI since y _ in  w0   wi  xi   wi  xi
wI i 1 i 0

E
hence w I    (t  y _ in)  xI
wI
Thus the delta rule becomes wI    (t  y _ in)  xI
where is the learning rate (0< >1)
Backpropagation
• In the perceptron/single layer nets, we used gradient descent on
the error function to find the correct weights:
wI    (t  y _ in)  xI
• We see that errors/updates are local to the node i.e. the change
in the weight from node to output ( 𝑖𝑗) is controlled by the
input that travels along the connection and the error signal from
output

• But with more layers how are the weights for the first 2 layers
found when the error is computed for layer 3 only?

•There is no direct error signal for the first layers!!!!!


Derivation of Backpropagation Algorithm
• We shall denote the weight between hidden unit and the
output unit by .
• The subscripts are used analogously for the weights
between input and the hidden unit .
• Let be any arbitrary function with the derivative .
• We desire to minimize the error by modifying the weights

• Since denotes the change of error with respect to


weights , we want to modify the weights in opposite
direction to hence

, where =learning rate


• Let us find out the dependence of output error on the
weights in output and the hidden layers.
• We shall first define the error term to be minimized as:
ek  t k  yk
1 1
E    tk  yk    e
2 2

2 k 2 k
L
Eav  1
L E
l 1
l

• Output of neuron before applying activation function


is p p
y _ ink  w0 k   zj wjk   zj wjk
j 1 j 0

yk  f ( y _ ink )
• We want to find out the contribution of weight to error
i.e;

• We can express it using chain rule as follows:


E E ek yk y _ ink
   
wjk ek yk y _ ink wjk
E ek yk y _ ink
but  ek ,  1 ,  f ( y _ ink ) ,
'
 zj
ek yk y _ ink wjk

E
hence  ek  f  y _ ink   zj
'

wjk
• Now we want to reduce the error by changing weights
proportionately as:
E
wjk  
wjk
where =learning rate
hence wjk    ek  f '
 y _ ink   zj
• We define local gradient as:
 k  ek  f  y _ ink 
'

• Hence weight update equation becomes:

wjk     k  zj
Actually =- , can shown by using chain rule again
_
as follows:
E E ek yk
k       ek  (1)  f '  y _ ink 
y _ ink ek yk y _ ink

• Thus local gradient is the derivative of error energy with


respect to its own induced field.
Computing local gradient for hidden layer neuron:

• Computing local gradient is easy since the error value is directly


available, but that for the hidden layer neurons requires more
analysis
• We start again with the definition of the local gradient, now we
want to compute the local gradient for the neuron 𝑗 belonging to
the hidden layer hence; E
j
z _ inj
Using chain rule again to express the above equation:
E E zj E '
j     f ( z _ inj )
z _ inj zj z _ inj zj

E '
 j    f ( z _ inj )
z j
1
E  e
2
We have;
2 k
We now compute as follows:

E E ek yk y _ ink


   
zj ek yk y _ ink zj
E ek ek yk y _ ink
  ek    ek   
zj k zj k yk y _ ink zj

Note here that here is coming from all output neurons as


against the previous case , hence we have to sum all the error
terms.
Again
ek yk y _ ink
 1 ,  f ( y _ ink ) ,
'
 wjk
yk y _ ink zj
Substituting Eq.10 in Eq.9, we get:
E
  ek  f ' ( y _ ink )  wjk
zj k

But ek  f ' ( y _ ink )   k


hence we re-write the above equation as follows:
E
   k  wjk
zj k

Using Eq. 11 in Eq. 7 we get:


 j  f ' ( z _ inj )    k  wjk
k

Again to reduce the error 𝐸 by changing weights 𝑣


proportionately as:

vij     j  xi
Backpropagation Algorithm thus has THREE phases:
1.Forward phase
Where the input signal propagates in the forward direction

2.Error backpropagation phase


Where the error propagates in the reverse direction

3.Weights and biases update phase


Where the local gradients are used to compute the weight and
bias updates
wjk     _ 2k  zj w0 k     _ 2k
vij     _1 j  xi v 0 j     _1 j
Signal flow graph
1 1 Forward phase

v0 j
tk
x1 w01

 w1k
 v1 j 

 
z _ inj f () zj y _ ink f ( ) yk
xi v ij wjk 1 ek
 
 vnj  w pk
 

 _21
xi f ' ( y _ in1)
e1
 _ 22
f ' ( y _ in 2)
w j1 e2
Error backpropagation
phase  _1j  
 _ in 2 wj2  

f ' ( z _ inj )  _ 2k f ' ( y _ ink )



 _1j wjk ek
wjm  _ 2m
f ' ( y _ inm)
 _1j em
The standard backpropagation algorithm
STEP 0 Initialize weights.
(Set to small random values).

STEP 1 While stopping condition is false, do Steps 2-9.

STEP 2 For each training pair, do Steps 3-8.

FORWARD PHASE
STEP 3 Each input unit ( Xi, i  1,..., n) receives input signal xi and
broadcasts this signal to all units in the hidden units layer).
STEP 4 Each hidden unit ( Zj , j  1,..., p ) sums its weighted input
signals,
z _ inj  v 0 j   xi  vij
i

applies its activation function to compute its output signal,


zj  f ( z _ inj )

and sends this signal to all units in the output units layer
STEP 5 Each output unit (Yk , k  1,..., m) sums its weighted input
signals,
y _ ink  w0 k   zj  w jk
j

applies its activation function to compute its output signal,


yk  f ( y _ ink )
BACKPROPAGATION PHASE

STEP 6 Each output unit (Yk , k  1,..., m) receives a target pattern


corresponding to the input training pattern, computes its error
information term,
 k  (tk  yk )  f ' ( y _ ink )
calculates its weight correction term (used to update w jk later),
wjk     k  zj
calculates its bias correction term (used to update w0k later),
w0k     k
and sends  k to units in the hidden layer .
STEP 7
Each hidden unit ( Z , j  1,..., p) sums its delta inputs (from
j

units in the output layer ),


 _ in     w
j k jk
k

multiplies by the derivative of its activation function to


calculate its error information term,
 j   _ inj  f ' ( z _ inj )
calculates its weight correction term (used to update vij
later),
vij     j  xi
calculates its bias correction term (used to update v0 j
later),
v 0 j     j
WEIGHTS AND BIASES UPDATE PHASE

STEP 8 Each output unit (Yk , k  1,..., m) updates its bias and weights
( j  0,..., p )
wjk ( new)  wjk (old )  wjk

Each hidden unit ( Zj , j  1,..., p ) updates its bias and weights


( i  0,..., n)

STEP 9 Calculate mean square error


0.5 (tk  yk )
2

mse  k

Test the stopping condition (max. iterations or if mse value is acceptable)


XOR solution
z _ in1
z1

v 01  in1   _ 21  w11

v11
 _11   in1  f ' ( z _ in1)
x1 w11
v12
z _ in 2
z2
v13
 in 2   _ 21  w21
v 02 w21  _ 21

v14 y _ in1
 _12   in 2  f ' ( z _ in 2)  _ 21 y1 t1
v 21
w31  _ 21 w01
z3
z _ in3
 _ 21  _ 21  (t1  y1)  f ' ( y _ in1)
v 22
 in 3   _ 21  w31 w11     _ 21  z1
v 03 w41 w21     _ 21  z 2
v 23
w31     _ 21  z 3
x2  _13   in3  f ' ( z _ in3) w41     _ 21  z 4
v 24 w01     _ 21

v11     _11  x1 v 21     _11  x 2 z _ in 4


z4
v12     _12  x1 v 22     _12  x 2
v13     _13  x1 v 23     _13  x 2  in 4   _ 21  w41
v14     _14  x1 v 24     _14  x 2 v 04

v 01     _11 v 02     _12 v 03     _13 v 04     _14


 _14   in 4  f ' ( z _ in 4)
MATLAB Demo
Some tips for net convergence
Choice of initial weights and biases
• Random Initialization: The choice of initial weights will influence whether
the net reaches a global (or only a local) minimum of the error and; if so,
how quickly it converges.
• The update of the weight between two units depends on both the derivative
of the output unit's activation function and the activation of the hidden
unit. For this reason, it is important to avoid choices of initial weights that
would make it likely that either activations or derivatives of activations are
zero.
• The values for the initial weights must not be too large, or the initial input
signals to each hidden or output unit will be likely to fall in the region where
the derivative of the sigmoid function has a very small value (the so-called
saturation region)
• On the other hand, if the initial weights are too small, the net input to a
hidden or output unit will be close to zero, which also causes extremely slow
learning.
• A common procedure is to initialize the weights (and biases) to random
values between -0.5 and 0.5 (or between -1 and 1 or some other suitable
interval)
How long to train the net ?
• Since the usual motivation for applying a backpropagation net is
to achieve a balance between correct responses to training
patterns and good responses to new input patterns (i.e., a
balance between memorization and generalization), it is not
necessarily advantageous to continue training until the total
squared error actually reaches a minimum.
• Hecht-Nielsen (1990) suggests using two sets of data during
training: a set of training patterns and a set cross-validation
patterns. These two sets are disjoint.
• Weight adjustments are based on the training patterns;
however, at intervals during training, the error is computed
using the cross-validation patterns. As long as the error for the
cross-validation patterns decreases, training continues. When
the error begins to increase, the net is starting to memorize the
training patterns too specifically (and starting to lose its ability
to generalize). At this point, training is terminated.
How many training pairs there should be ?
• Under what circumstances can I be assured that a net which is
trained to classify a given percentage of the training patterns
correctly will also classify correctly testing patterns drawn from
the same sample space?“
• Thumb rule dictates :
………….order of
Where N= number of training exemplars, W=Number of free
parameters to be adjusted (weights & biases), e=fraction of
permissible classification error
• For example, with e = 0.1, a net with 80 weights will require
800 training patterns to be assured of classifying 90% of the
testing patterns correctly, assuming that the net was trained to
classify 95% of the training patterns correctly
• But experience suggests that the optimum number of training
patterns is problem specific !
Data Representation
• It is recommended that the number of dimensions of
the data be reduced by suitable methods , this process
is called feature extraction
• Feature extraction methods like the Principal
Component Analysis, Transforms like FFT, DCT, Wavelet
have to be carefully chosen so that the intelligence in
the data is preserved
• In general, it is easier for a neural net to learn a set of
distinct responses than a continuous-valued response,
therefore encoding the targets is important, we
employ one-hot coding for the output neurons for
classification problems
Number of Hidden Layer Neurons
• Researcher have attempted to find out optimal
number of hidden layer neurons, but have failed so
far!
• The number of hidden layer neurons is highly problem
specific and has to be found out using brute-force
technique.
• This technique is to simply start with a few hidden
layer neurons and carrying out the training-testing
phase over a larger number of neurons , to find out
the maximum accuracy configuration
Number of Hidden Layers
• Generally one hidden layer is sufficient for a backpropagation
net to approximate any continuous mapping from the input
patterns to the output patterns to an arbitrary degree of
accuracy.
• However, two hidden layers may make training easier in some
situations
Choice of learning rate
• Since we use the gradient descent algorithm to reach a global
minima , if it exists, learning rate plays an important role in the
training of the net
• A very high learning rate can de-stabilize the net into
producing oscillatory behavior
• A very low learning rate on the other hand slows down the
learning and takes longer to reach the acceptable value of
error
• Generally
Generalization
Affected by THREE factors:
1. Size of the training set
2. Architecture of the net
3. Physical complexity of the problem at hand
Variations in the standard backpropagation
algorithm
Alternative Weight Update Procedures
Momentum: the weight change is in a direction that is a
combination of the current gradient and the previous
gradient.
wjk (t  1)  wjk (t )     k  zj    wjk (t )  wjk (t  1) 
where
Adaptive Learning Rates
Delta-Bar-Delta: Allow each weight to have its own
learning rate, and to let the learning rates vary with time
as training progresses
E
wjk (t  1)  wjk (t )   jk (t  1)   wjk (t )   jk (t  1)   k  zj
wjk
Applications of MLP trained with
backpropagation algorithm

1. Regression
2. Pattern Classification
3. Forecasting
Thank you!
RADIAL BASIS FUNCTION NETWORKS
MODULE 5

Dr. Rajesh B. Ghongade


Agenda
• Concept of RBFN and Cover’s Theorem
• RBFN architecture
•  functions
• XOR problem
• RBFN Algorithm
• K-means clustering
• MATLAB Demo
• Comparison between MLP and RBFN
• Applications of RBFN
Concept of RBFN
• MLP can be viewed as a stochastic approximation approach
but we can view it also as a surface fitting consideration
• MLP succeeds because of the in-between mapping done by
the hidden layer
• For a problem where we cannot pass a hyper plane for
separation of the classes i.e. non-linearly separable problems,
but we may be able to pass a hyper sphere or a hyper quadric
(multi-dimensional ) surface
• Thus given a non linearly separable problem our task is to
determine a hyper surface that can classify properly, which is a
surface fitting problem in multi dimensional space and is the
essence of RBF networks
• RBF = radial-basis function: a function which depends
only on the radial distance from a point
Cover’s Theorem
“ A pattern classification problem cast in a high dimensional
space is more likely to be linearly separable than in low
dimensional space”

Consider a set H of N patterns , ,…,


Each vector can be assigned to any of the two classes

If we have a family of surfaces wherein at least one surface


separates the two classes then we say that the set H is separable
with respect to that family of surfaces

X has dimensions= m 0

For each X  H , define a vector made of a set of real valued functions
  

 X i  1, 2,..., m1 …. m1 number of real value functions
These are the hidden layer functions so that input vector gets transformed as:

1 X  

 2 X  
X

m X 1  
 Space that is spanned by the set of hidden functions is:

  
m1
i X is the hidden space or feature space
i 1

We are mapping m0 dimension inputs into m1 dimension hidden space

A dichotomy H 1, H 2 is -separable if there exists one m1 dimensional vector such


that:
  
W T   
X  0 then X  H 1
  
W T   
X  0 then X  H 2
 
Hyper plane equation is: W   X  0
T
  is the separating surface in  space
  
If X 1 , X 2,..., XN are independently chosen patterns and there are
large number of dichotomies, but only a few are separable,
Cover studied the probabilities of those dichotomies which are separable
 N
Assuming all possible dichotomies of H  X   are equiprobable
i 1
The probability P ( N , m1) denotes the probability that a particular dichotomy
picked at random is   separable is given by Cover's theorem as:
N 1 mi 1
1  N  1
P ( N , m1)      
2
  m 0  m 
If mi is very high the probability is closer to 1
RBFN Architecture
• RBFNs have hidden units which provide a set of functions
which forms a basis for mapping into the hidden layer space,
hence the need for some basis functions
• Hidden neurons provide the basis functions hence the name
radial basis function networks
• Basic form of RBFN consists of three layers:
• Input layer contains source nodes connected to the
environment
• Only one hidden layer which does the non-linear
transformation (mapping) from input space to hidden
space, where hidden space dimensionality is higher than
input space
• Output layer supplies the response
A typical RBF Network

f .. y1
...  1 ..
w11
w 21
w01
w12

...  2 .. w22 f .. y2


wm2
w02
wm1
w1k
...  m .. w2k

wmk
m f .. yk
y  f ( x)   wi    xi  ci 
i 1 w0k
 functions
1) Gaussian
 r2 
  r   exp   2  for some   0, r  R
 2 

Localized function ; if r
 functions
2) Multi-quadrics
 r   r  c
2

2 1/2
for some c  0, r  R

Non-localized function ; if r
 functions
3) Inverse Multi-quadrics
1
 r   for some c  0, r  R
 
1/2
r 2
 c 2

Localized function ; if r
XOR problem

c1

∑ ...  1 ..
x1 w1
c2
x2 ∑ f .. y
w2
∑ ...  2 ..
w0

+1
We select the two -functions as Gaussian with µ=0 and
σ= so that
 r2 
  r   exp   2  reduces to   r   exp  r 2 
 2 
We select the centers as C1=[0,0] and C2 =[1,1] hence:

   
  2
 1 X  exp  X  C1

  X   exp   X  C 
  2
2 2

Thus the inputs are , the distances from the


centers are , and the outputs from the hidden units
are
When all the inputs are presented to the net we get:

For X 2

r1  X 2  C1  (0,1)  (0, 0)  (0  0) 2  (1  0) 2  1

r 2  X 2  C 2  (0,1)  (1,1)  (0  1) 2  (1  1) 2  1
 1  exp(r12 )  exp((1) 2 )  0.3679
 2  exp(r 2 2 )  exp((1) 2 )  0.3679

0 0 0 1.4142 1 0.1353

0 1 1 1 0.3679 0.3679

1 0 1 1 0.3679 0.3679

1 1 1.4142 0 0.1353 1
Mapped points
2
  1 0.1353 1 0 
X4 0.3679 0.3679 1  w1  1 
  , W   w2  , D   
1
0.3679 0.3679 1   1 
   w0   
 0.1353 1 1 0 
 Solve this discrimination function to
X1
0.2 complete the solution:
1  W  D
 
W    D        T   D

0.2 1  T 1
X2 X3
 
yields: Checking the solution:
 1 0.1353 1 0 
 2.5031 0.3679 0.3679 1   2.5031  
W   2.5031     2.5031  1 
0.3679 0.3679 1   1 
 2.8418     2.8418   
 0.1353 1 1 0 
Fixing the radius σ
This is usually done using the P-nearest neighbor algorithm. A
number P is chosen, and for each center, the P nearest centers are
found. The root mean squared distance between the current cluster
center and its P nearest neighbors is calculated, and this is the value
chosen for σ. So, if the current cluster center is 𝑐 , the value is:

1 P
  ck  ci 
2
j
P i 1
A typical value for P is 2, in which case σ is set to be the average
distance from the two nearest neighboring cluster centers.
The variable σ defines the width or radius of the bell shape and is
something that has to be determined empirically. When the
distance from the center of the Gaussian reaches σ, the output
drops from 1 to 0.6.
RBFN Algorithm
Step 1 Get 𝑛, 𝑘 ,𝑁,the feature vectors and their target vector, input the number of
iterations 𝐼, set 𝑖 0, set 𝑚 centers of RBF’s as the 𝑁 exemplar vectors,
𝑚𝑠𝑒𝑔𝑜𝑎𝑙 and learning rate α
Step 2 UNSUPERVISED LEARNING
Using clustering algorithm like k-means clustering find the 𝑚 cluster centers. Find
minimum Euclidean distance of the cluster centers to fix σ (radius)
Step 3 Compute the -matrix
Step 4 SUPERVISED LEARNING
Choose weights and biases 𝑊at random between -0.5 and 0.5
Step 5 Compute 𝑦𝑘 and error 𝐸 and 𝑚𝑠𝑒
Step 6 Using Delta Rule update all parameters 𝑊𝑚𝑘 for all 𝑚 and 𝑘 at the
current
iteration
Step 7 Compute 𝑦𝑘 and new value of error 𝐸
Step 8 If 𝑚𝑠𝑒 ≤ 𝑚𝑠𝑒𝑔𝑜𝑎𝑙 stop else continue;
Step 9 Increment iteration i, if 𝑖 𝐼 then go to Step 5, else stop
-means clustering
• The k-means algorithm partitions the data into 𝑘 clusters. A popular
criterion function associated with the k-means algorithm is the sum of
squared error. Let 𝑘 be the number of clusters and 𝑛 the number of
data in the sample 𝑋 , 𝑋 ,…, 𝑋 . We define the cluster centroid 𝑚
as; n


j 1
ij  xj
mi  n
, i

j 1
ij

with the membership function 𝜔 indicating whether the data point 𝑋 belongs
to a cluster 𝜔 . The membership values vary according to the type of k-means
algorithm. The standard k-means uses an all-or-nothing procedure, that is,
𝜔 =1, if the data sample 𝑋 belongs to cluster 𝜔 , else 𝜔 =0 .
• The membership function must also satisfy the following constraints:

 i 1
ij  1, j

n
0    ij  n, i
j 1

• The k-means algorithm uses a criterion function based on the measure of similarity
or distance. For example, using the Euclidean distance that will favor the hyper
spherical cluster, a criterion function to minimize is defined by:
k n
J    ij  xj  mi
2

i 1 j 1

which, considering the all-or-nothing membership function, simplifies to:


k
J  
2
 ij  xj  mi
i 1
xj   i
• The k-means clustering algorithm is an
iterative algorithm that minimizes the
criterion function J.
• The initial choice of cluster and
measure of similarity or distance affects
the way in which the algorithm
behaves.
• This type algorithm tends to converge
to local minima close to cluster
centroids set initially.
• This reinforces the importance of initial
consideration with regard to the choice
of cluster, keeping in mind that such an
algorithm does not converge to the
global minimum.
Numerical example of k-means clustering
• Suppose we have several objects (4 types of medicines) and each object
have two attributes or features as shown in table below.
• Our goal is to group these objects into K=2 group of medicine based on the
two features (pH and weight index).
Objects Attribute 1 (weight index) Attribute 2 (pH)
Medicine A 1 1
Medicine B 2 1
Medicine C 4 3
Medicine D 5 4
• Each medicine represents one point with two attributes that we can
represent as coordinates in an attribute space as shown in the figure below.
1. Initial value of centroids : Suppose we use medicine A and medicine B as the
first centroids. Let C1 and C2 denote the coordinate of the centroids, then C1=[1,1]
and C2 =[2,1]
2. Objects-Centroids distance : we calculate the distance between cluster centroid
to each object. Let us use Euclidean distance, then we have distance matrix at
iteration 0 as:

0 1 3.61 5  1 2 4 5  Attribute1
D 
0
 , O 
1 0 2.83 4.24  1 1 3 4  Attribute2
C1  1 1
C 2   2 1
Each column in the distance matrix symbolizes the object. The first row of the
distance matrix corresponds to the distance of each object to the first centroid and
the second row is the distance of each object to the second centroid. For example,
distance from medicine C = (4, 3) to the first centroid C1=(1,1) is  4  12  (3  1)2  3.61
and distance with C2=(2,1) is  4  2 2  (3  1)2  2.83
and so on.
3. Objects clustering : We assign each object based on the minimum distance. Thus,
medicine A is assigned to group 1, medicine B to group 2, medicine C to group 2 and
medicine D to group 2. The element of Group matrix below is 1 if and only if the object
is assigned to that group.

1 0 0 0  Group1
G 
0
 Group 2
 0 1 1 1 
4. Iteration-1, determine centroids : Knowing the members of each group, now we
compute the new centroid of each group based on these new memberships. Group 1
only has one member thus the centroid remains in C1=(1,1). Group 2 now has three
members, thus the centroid is the average coordinate among the three members:
 2  4  5 1 3  4 
C2   ,    3.67, 2.67 
 3 3 
5. Iteration-1, Objects-Centroids distances : The next step is to compute the distance of
all objects to the new centroids. Similar to step 2, we have distance matrix at iteration 1
as:
 0 1 3.61 5 
D1    ,
3.14 2.36 0.47 1.89 
C1  1 1
C 2  3.67 2.67 

6. Iteration-1, Objects clustering: Similar to step 3, we assign each object based on the
minimum distance. Based on the new distance matrix, we move the medicine B to
Group 1 while all the other objects remain unchanged. The Group matrix is shown
below:

1 1 0 0  Group1
G 
1
 Group 2
 0 0 1 1 
7. Iteration 2, determine centroids: Now we repeat step 4 to calculate the new centroids
coordinate based on the clustering of previous iteration. Group1 and group 2 both has
two members, thus the new centroids are

 1 2 11   45 3 4 
C1   ,   1.5,1 C2   ,    4.5,3.5 
 2 2   2 2 
8. Iteration-2, Objects-Centroids distances : Repeat step 2 again, we have new distance
matrix at iteration 2 as:

0.5 0.5 3.2 4.61


D  2
 ,
 4.3 3.54 0.71 0.71
C1  1.5 1
C 2   4.5 3.5

9. Iteration-2, Objects clustering: Again, we assign each object based on the minimum
distance:

1 1 0 0  Group1
G 
2
 Group 2
 0 0 1 1 

We see that hence the objects do not move anymore and the algorithm has reached a
stable point.
RBFN MATLAB DEMO
If only life were so simple…

• How do we choose number of clusters ? Similar to


problem of selecting number of hidden nodes for MLP
• What type of pre-processing is best?
• Does the clustering method work for the data? E.g might
be better to fix σ and try again.
There is NO general answer: each choice will be problem-
specific. The only info you have is your performance
measure.
Comparison
MLP RBFN
• Feedforward network • Feedforward network
• Universal approximator • Universal approximator
• Can have more than one hidden • Only one hidden layer
layers
• Learning in two phases: unsupervised
• Completely supervised learning
learning (clustering) followed by
• Network dimension according to supervised learning (delta rule)
dimension of data
• Network dimension according to
• All neurons output and hidden
layer , share common neuronal information content
model • Hidden layer neurons are completely
• MLPs construct global different than output neurons
approximations to non-linear • RBF using exponentially decaying
input output mapping localized non-linearities (Gaussian)
construct local approximations to
input-output mapping
Comparison
MLP RBFN
Applications of RBFN
1. Regression
2. Pattern Classification
Thank You!
UNSUPERVISED LEARNING
SELF ORGANIZING MAPS
MODULE 6
Dr. R.B.Ghongade,
Department of E&TC,
BVDUCOE, Pune-43
Agenda
• Introduction
• Concept
• Motivation
• Models
– Willshaw von der Marlsburg model
– Kohonen model
• Architecture of SOM
• Essential Processes in SOM
– Competition
– Co-operation
– Synaptic Adaptation
• Mathematical model of the processes
• Phases of weight adaptation
• Concept of Topology Organization
• Algorithm
• MATLAB Demo
Introduction
• Self organizing maps(SOM) are also known as Self
Organizing Feature maps(SOFM)
• SOMs are topology preserving nets
• They employ unsupervised learning
• SOMs work on the principle of competitive learning
• There is a spatial organization of the distribution of
neurons(lattice of output neurons, 1-D ,2-D or even higher
D)
• Most practical problems make use of 1-D or 2-D lattices,
higher D lattices NOT preferred due to complexity
• Neurons are arranged in a structure manner and if we
present input patterns , they act as a stimuli to these
neurons
Structured Output Neurons
Concept
• Out of the different neurons in the output lattice, one of
them will be the winner and synaptic weights from input
to output will be adjusted so that the Euclidean distance
between the synaptic weights and the input vector is
minimized
• Minimization of Euclidean distance means maximization
of
• The inputs are randomly distributed in the input space,
but if we start with a regular structure, depending on the
input statistics, the ultimate organization of the lattice
that results would be indicative of the statistics of input
patterns that we are applying as stimuli
• Training is the process where the lattice structure is
disturbed and it will move the winning neuron physically
towards the input pattern
• Neurons at the output act in a competitive
manner: they inhibit the responses of other
output neurons
• But in the vicinity of the winning neuron an
excitatory response is created while an
inhibitory response is created for other
output neurons which are far apart
• Such networks exhibit:
– Short range excitation
– Long range inhibition
Motivation
• Neuro-biological phenomena in the brain actually exhibits
topologically ordered computation
• Various sensory inputs like visual, acoustic and tactile
inputs are processed in different regions of the cerebral
cortex

Tactile

Taste
Smell
Models of SOM
1. Willshaw von der Marlsburg model
2. Kohonen model
Willshaw von der Marlsburg model
• Two arrays (lattices) of pre-
synaptic and post-synaptic
neurons
• Was used to explain the retino-
optic mapping from retina to
visual cortex, where retina will
form the pre-synaptic layer and
the visual cortex is the post-
synaptic layer
• Dimensions of input and output
are same
• Electric signals of pre-synaptic
neurons are based on geometric
proximities
• The neurons near the excited neuron have highly correlated electrical responses
• If signal for A is strong , the signal for a geometrically close neuron B, will also be strong
and it will enhance spatially those output neurons which are there in similar spatial
locations
• There is a spatial correlation of activities
• Hence pre-synaptic layer activities are mapped onto similar neuron activities in the post
synaptic layer
Kohonen model
• Based on vector coding
algorithm which optimally
places a fixed number of
vectors (called code
words) into a higher
dimensional input space
• Same as data
compression technique
like entropy coding
• If we have an input of length 𝑙 , it can be compressed into a code word of
length 𝑚 , which is much less than 𝑙
• This technique exploits the entropy in the data eg: Huffmann coding
• A fixed number of codewords form a vector
• Kononen model is more popularly used because it offers dimensionality
reduction and is more general
• Hence SOMs are also referred to as Kohonen self organizing maps
Architecture of SOM
Essential Processes in SOM
1. Competition:
– Long range inhibition
– For each input pattern the neurons in the output
layer determine the value of a function
– Called as the discriminant function
– This function provides the basis of competition
– A particular neuron with the largest discriminant
function value is the winner
– Responses of other neurons (at longer geometric
distances) is set to zero
Essential Processes in SOM
2. Co-operation:
– Short range excitation
– Winning neuron determines the topological
neighbourhood of excitation for other output
neurons
– Neuron that wins , excites the neighbouring neurons
surrounding it
– Excitation is co-operative and always follows
competition
– In other words, the winning neuron determines the
spatial location of topological neighbourhood of
excited neurons
Essential Processes in SOM
3. Synaptic Adaptation:
– Enables the excited neurons to increase their
individual values of discriminant function in
relation to the input pattern
– This adaptation is only for excited neurons
– When similar input patterns are fed repeatedly ,
then we have the increasing winning neuron
response each time
Competition
• Consider an dimensional input
• Then T where ( =total
number of output neurons)
• Find the best match between and
• There will be competition between neurons and whichever
is having the best match with , that 𝑡ℎ neuron will
emerge as the winner
• Some applications require only the index of the winning
neuron while some may require the actual winning vector
• We compute for and select the largest
amongst these
• Maximizing ->minimizing , the Euclidean
distance
• Using index where is the index based on input vector
• Then

• Corresponding weight vector Is closest to input pattern


• Input is a continuous -dimensional space and we are
mapping it onto a discrete space of outputs
• Hence a continuous input space of activation patterns is
mapped onto a discrete output space of neurons by a process
of competition
Illustration of the relationship between feature map Փ
and weight vector of winning neuron i.
Cooperation

• When a neuron fires , it also excites the neighbouring neurons


• If we go farther away from the winning neuron then its
neighbourhood function should gradually decrease
• This function should be monotonically decreasing function
• Topological neighbourhood is defined as
, :Topological neighbourhood centered around
encompassing neuron using the measure
, :Lateral distance between the winning neuron
and the excited neuron
• Weights of this excited neurons have to be updated and up to
what extent is determined by the neighbourhood
• This topological neighbourhood , should satisfy :
– It should be symmetric about 𝑑 ,
– It should be monotonically decaying function, decreasing with distance
𝑑 , i.e;as 𝑑 , → ∞ , ℎ , 0
• A typical choice is the Gaussian function

,
,
• This function is independent of the position of the winning
neuron , hence it is translation invariant
• In case of 1-D lattice ,
• For 2-D lattice ,
Where position vector of excited neuron
: position vector of winning neuron
• Another important feature of SOM is that is not constant
with iterations
• As iterations increase decreases means the neighbourhood
shrinks with iterations

Where is a time constant


• Consequently
,
,

• ,
is called as the neighbourhood function
• Significance of co-operation:
– Initially a large number of neurons participate in the co-operative
process because 𝜎 is large
– We update the winning neurons and its neighbours
– If say 𝑋 is fed to the network and output neuron 𝑖 is the winner, next if
𝑋 is now fed to the network and now neuron 𝑗 is the new winner.
– If 𝑋 and 𝑋 are close then neuron 𝑖 also participates in weight
updation
– If we feed a large number of vectors close to 𝑋 i.e,; cluster of
patterns, the winners are different but every output neuron in the
neighbourhood gets its share of weight update
– Ultimately the topology of the network will be adjusted to the clusters
Alternative neighbourhood Functions
Weight Adaptation
• This the learning phase
• We can use the hebbian learning but it is severely limited by
weight saturation if presented with same input pattern
repeatedly
• Hence we modify it by introducing the forgetting term:

Where is a positive scalar function


• To simplify let

• We can set as:


,
• The modified Hebbian learning is
• The weight update rule becomes:
,
Or
, ,
Finally can be written as:
𝒋 𝒋,𝒊 𝑿 𝒋

• We essentially move the weight vector to align with input


vector
• Using discrete time formulation we express weight update as:
𝒋 𝒋 𝒋,𝒊 𝑿 𝒋

• is the variable learning rate


• This leads to topological ordering
• The two important heuristics involved here are:

and
𝒋,𝒊 𝑿
• We use

Where is another time constant


• Recall that
,
,
Two phases of weight adaptation
1. Self organizing(ordering)
– where topology is arranged
– It is a fast phase requires less iterations
– to starts with, say, 0.1 and should decrease
to 0.01 ( =1000)
– It takes generally more than 1000 iterations
– ,
to start with should include a large
number of neurons and later be restricted to a
small neighbourhood
Two phases of weight adaptation
2. Convergence
– All obtained positions are fine tuned
– Requires iterations , at least 500 times the
number of output neurons
– remains constant in this phase but
,
may continue to decrease
Concept of topology organization
Algorithm(simplified)
STEP 0 Initialize weights 𝑤𝑖.𝑗
Set topological neighborhood parameters.
Set learning rate parameters.

STEP 1 While stopping condition is false, do Steps 2-8.


STEP 2 For each input vector 𝑋⃗, do Steps 3-5.
STEP 3 For each 𝑗, compute:
2
𝐷 𝑗 𝑥𝑖 𝑤𝑖,𝑗
𝑖
STEP 4 Find index 𝐽 such that 𝐷 𝐽 is a minimum

STEP 5 For all units 𝑗 within a specified neighborhood


of 𝐽, and for all 𝑖:
Compute 𝑦𝑗 𝑤𝑖,𝑗 ∙ 𝑥𝑖
𝑤𝑖,𝑗 𝑛𝑒𝑤 𝑤𝑖,𝑗 𝑜𝑙𝑑 𝜂𝑦𝑗 𝑥𝑖 𝑤𝑖,𝑗 𝑜𝑙𝑑
STEP 1 STEP 2

STEP 6 Update learning rate

STEP 7 Reduce radius of topological neighborhood at


specified times.
STEP 8 Test stopping condition
MATLAB Demo
APPLICATIONS OF MLP AND RBFN AS
CLASSIFIERS
MODULE 12
Dr. Rajesh B. Ghongade
Professor, BVDUCOE, Pune-43
Agenda
• Pattern Classifiers
• Classifier performance metrics
• Case Study : Two-Class QRS classification with
MLP
• Data Pre-processing
• MLP Classifier MATLAB DEMO
• Case Study : Two-Class QRS classification with
RBFN
• RBFN Classifier MATLAB DEMO
Pattern Classifiers
• Pattern recognition ultimately is used for classification of a
pattern
• Identify the relevant features about the pattern from the
original information and then use a feature extractor to
measure them
• These measurements are then passed to a classifier which
performs the actual classification, i.e., determines at which of
the existing classes to classify the pattern
• Here we assume the existence of natural grouping, i.e. we have
some a priori knowledge about the classes and the data
• For example we may know the exact or approximate number
of the classes and the correct classification of some given
patterns which are called the training patterns
• This type of information and the type of the features that
may suggest which classifier to apply for a given application
Parametric and Nonparametric
Classifiers
• A classifier is called a parametric classifier if the discriminant
functions have a well-defined mathematical functional form
(Gaussian) that depends on a set of parameters (mean and
variance)
• In nonparametric classifiers, there is no assumed functional
form for the discriminants. Nonparametric classification is
solely driven by the data. These methods require a great deal
of data for acceptable performance, but they are free from
assumptions about the shape of discriminant functions (or
data distributions) that may be erroneous.
Minimum-Distance Classifiers
• If the training patterns seem to form clusters we often use classifiers
which use distance functions for classification.
• If each class is represented by a single prototype called the cluster
center, we can use a minimum-distance classifier to classify a new
pattern.
• A similar modified classifier is used if every class consists of several
clusters.
• The nearest-neighbor classifier classifies a new pattern by measuring
its distances from the training patterns and choosing the class to
which the nearest neighbor belongs
• Sometimes the a priori information is the exact or approximate
number of classes c.
• Each training pattern is in one of these classes but its specific
classification is not known. In this case we use algorithms to
determine the cluster (class) centers by minimizing some
performance index and are found iteratively and then a new pattern
is classified using a minimum-distance classifier.
Statistical Classifiers
• Many times the training patterns of various classes overlap
for example when they are originated by some statistical
distributions.
• In this case a statistical approach is appropriate, particularly
when the various distribution functions of the classes are
known
• A statistical classifier must also evaluate the risk associated
with every classification which measures the probability of
misclassification.
• The Bayes classifier based on Bayes formula from probability
theory minimizes the total expected risk
Fuzzy Classifiers
• Quite often classification is performed with some degree of
uncertainty
• Either the classification outcome itself may be in doubt, or
the classified pattern x may belong in some degree to more
than one class.
• We thus naturally introduce fuzzy classification where a
pattern is a member of every class with some grade of
membership between 0 and 1
• For such a situation the crisp k-Means algorithm is
generalized and replaced by the Fuzzy k- Means and after the
cluster centers are determined, each incoming pattern is
given a final set of grades of membership which determine
the degrees of its classification in the various clusters.
• These are the non parametric classifiers.
Artificial Neural Networks
• The neural net approach assumes as other approaches
before that a set of training patterns and their correct
classifications is given
• The architecture of the net which includes input layer, output
layer and hidden layers may be very complex
• It is characterized by a set of weights and activation function
which determine how any information (input signal) is being
transmitted to the output layer.
• The neural net is trained by training patterns and adjusts the
weights until the correct classifications are obtained
• It is then used to classify arbitrary unknown patterns
• There are several popular neural net classifiers, like the
multilayered perceptron (MLP), radial basis function neural
nets (RBFN), self-organizing feature maps (SOFM) and
support vector machine (SVM)
• These belong to the semi-parametric classifier type
Pattern Classifiers
Classifier Performance Metrics
The Confusion Matrix
• The confusion matrix is a table where
the true classification is compared with
the output of the classifier
• Let us assume that the true CLASSIFIER OUTPUT
classification is the row and the
classifier output is the column

GROUND TRUTH
• The classification of each sample NORMAL ABNORMAL

(specified by a column) is added to the NORMAL TN FP


row of the true classification
• A perfect classification provides a ABNORMAL FN TP

confusion matrix that has only the


diagonal populated
TP: True positive
• All the other entries are zero TN: True negative
• The classification error is the sum of off- FP: False positive
diagonal entries divided by the total FN: False negative
number of samples
Classifier Performance Metrics
1. Sensitivity (Se) is the fraction of abnormal ECG beats that are correctly detected
among all the abnormal ECG beats
TP
Sensitivity(Se)=
TP + FN

2. Specificity (SPE) is the fraction of normal ECG beats that are correctly classified
among all the normal ECG beats

TN
Specificity(Spe)=
TN + FP

3. Positive predictivity is the fraction of real abnormal ECG beats in all detected
beats

TP
Positive Predictivity(PP)=
TP + FP

4. False Positive Rate is the fraction of all normal ECG beats that are not rejected

FP
False Prediction Rate(FPR)= = 1-Spe
TN +FP
Classifier Performance Metrics continued…
5. Classification rate (CR) is the fraction of all correctly classified ECG beats, regardless of
normal or abnormal among all the ECG beats
TN + TP
Classification Rate(CR)=
TN + FP + FN + TP

6. Mean squared error (MSE) is a measure used only as the stopping criterion while
training the ANN
7. Percentage average accuracy is the total accuracy of the classifier

 TP(NORMAL) TP(ABNORMAL) 
Percentage Average Accuracy =  +  ×100
 TOTAL(NORMAL) TOTAL(ABNORMAL) 

8. Training Time is the CPU time required for training an ANN described in terms of time
per epoch per total exemplars in seconds
9. Pre-processing time is the CPU time required for generating the transform part of the
feature vector in seconds

10. Resources consumed for the ANN topology is the sum of weights and biases for the
first layer and the second layer also called as adjustable or free parameters of the
network
Receiver Operating Characteristics (ROC)

• The receiver operating characteristic is a metric used to check


the quality of classifiers
• For each class of a classifier, roc applies threshold values
across the interval [0,1] to outputs.
• For each threshold, two values are calculated, the True
Positive Ratio (the number of outputs greater or equal to the
threshold, divided by the number of one targets), and the
False Positive Ratio (the number of outputs less than the
threshold, divided by the number of zero targets)
• ROC gives us the insight of the classifier performance
especially in the high sensitivity and high selectivity region
A typical ROC
• For an ideal classifier
the area under curve
of ROC =1
Case Study : Two-Class QRS
classification with MLP
Problem Statement:
Design a system to correctly classify extracted QRS
complexes into TWO classes: NORMAL and ABNORMAL
using MLP
NORMAL ABNORMAL
Data Pre-processing
• Mean adjust the data to remove the DC
component
• We can use all 180 points for training but it poses
a computational overhead
• Using features for training minimizes the
response of ANN to noise present in the signal
• Training time of the ANN reduces
• Overall accuracy of the ANN improves
• Generalization of the network improves
• Feature extraction in transform domain can use
– Discrete Cosine Transform (DCT)
• One-hot encoding technique for the target
Selection of significant components

• Metrics for component selection

– Components that retain 99.5% of the signal energy

– Percent Root Mean Difference (PRD)

  x
2
(i)- xreconstructed(i)
original
i=0
PRD = n
×100
 original
x
i=0
2
Discrete Cosine Transform
Signal Energy v/s # of DCT coefficients

100.00%
% Signal Energy

80.00%

60.00%

40.00%

20.00%

0.00%
1
6
11
16
21
26
31
36
41
46
51
56
61
66
71
76
81
86
91
96
101
106
111
116
121
126
131
136
141
146
151
156
161
166
171
176
# of DCT coefficients

• Thirty DCT components contribute to 99.86%


of the signal energy
• PRD(30 components)=0.5343%
Discrete Cosine Transform- selection of significant coefficients

Sr. Transform Coefficients PRD % Signal Energy


(%)
1 5 9.8228 35.81
2 15 5.4096 80.53
DCT
3 30 0.5343 99.86
4 40 0.3134 99.93

Reconstruction Errors

1250
Coeff 5
1200
Effect of truncating DCT Coeff 15
1150 Coeff 30
coefficients Original
1100
mV

1050

1000

950

900

850
1 90 179
Sa mple s
Dataset Creation
• It is always desirable that we have equal number
of exemplars from each class for the training
dataset
• This prevents “favoring” any class during
training
• If the number of exemplars are unequal we have
to de-skew the classifier decision
• De-skewing simply scales the output according
to the probability of the input classes
• Data randomization before training is a must
other wise repetitive training by same class may
not allow the network to converge, remember
that the error gradient is important for achieving
global minima , if it exists!
• Partition the data into THREE disjoint sets
• Training set
• Cross-validation set
• Testing set
• Before the data is presented to the net for training we
have to normalize the data in range [-1,1] , this helps in
faster network learning
• Amplitude and Offset are given as:

Amp (i ) 
UpperBound  LowerBound 
max(i )  min(i )
Off (i )  UpperBound  Amp(i )  max(i )

To normalize data: Data (i )  Amp (i )  Data (i )  Off (i )

Data (i )  Off (i )
To de-normalize data: Data (i ) 
Amp (i )
Methodology

EXTRACTION OF
EQUAL LENGTH
SIGNALS

Pre-processing Supervised ANN


(Mean Adjust) Training

NORMAL
Feature
Transform( DCT) Classifier
Extracted Data
ABNORMAL
START

TRAINING USE
backpfun_logsig
for TRAINING
CROSS-
VALIDATION

USE testbackfun
TESTING
for TESTING

COMPUTE
Confusion matrix,
Se, Spe, Pp, FPR,
CR

STOP
MLP Classifier MATLAB DEMO
Case Study : Two-Class QRS
classification with RBFN
Problem Statement:
Design a system to correctly classify extracted QRS
complexes into TWO classes: NORMAL and ABNORMAL
using RBFN
NORMAL ABNORMAL

We use the same pre-processed data from previous case study


START

TRAINING
USE rbfn_fun for
TRAINING

USE testrbfn_fun for


TESTING
TESTING

COMPUTE
Confusion matrix,
Se, Spe, Pp, FPR, CR

STOP
RBFN Classifier MATLAB DEMO
Thank You!
LVQ, Hopfield Net and BAM
Prof.Dr.R.B.Ghongade
Learning Vector Quantization
• Learning vector quantization (LVQ) (proposed by Kohonen)is a pattern
classification method in which each output unit represents a particular class or
category
• The weight vector for an output unit is often referred to as a reference (or
codebook) vector for the class that the unit represents
• During training, the output units are positioned (by adjusting their weights
through supervised training) to approximate the decision surfaces
• It is assumed that a set of training patterns with known classifications is provided,
along with an initial distribution of reference vectors (each of which represents a
known classification).
• After training, an LVQ net classifies an input vector by assigning it to the
same class as the output unit that has its weight vector (reference vector) closest
to the input vector
Architecture
Algorithm
STEP 0 Initialize reference vectors;
Initialize learning rate , 𝛼 0

STEP 1 While stopping condition is false, do Steps 2-6


𝑋 training vector
STEP 2 For each input vector 𝑋⃗, do Steps 3-4.
𝑇 correct category or class for the
training vector STEP 3 Compute 𝐽, so that 𝑋 𝑊𝐽 is minimum

𝑊 weight vector for Jth output unit STEP 4 Update 𝑊𝐽 as follows:


𝐶 Class of Jth input If 𝑇 𝐶𝐽 , then
𝑤𝐽 𝑛𝑒𝑤 𝑤𝐽 𝑜𝑙𝑑 𝛼𝑥 𝑤𝑗 𝑜𝑙𝑑
𝑋 𝑊 Euclidean distance between 𝑋
If 𝑇 𝐶𝐽 , then
and 𝑊
𝑤𝐽 𝑛𝑒𝑤 𝑤𝐽 𝑜𝑙𝑑 𝛼𝑥 𝑤𝑗 𝑜𝑙𝑑

STEP 5 Reduce learning rate

STEP 6 Test stopping condition


The condition may specify a fixed number of
iterations (i.e., executions of Step 1) or the
learning rate reaching a sufficiently small
value.
Hopfield Network
• Developed by Hopfield (1982, 1984) , is an iterative auto-associative net
• The net is a fully interconnected neural net, in the sense that each unit is connected to every
other unit
• The net has symmetric weights with no self-connections
• Only one unit updates its activation at a time (based on the signal it receives from each other
unit)
• Each unit continues to receive an external signal in addition to the signal
from the other units in the net
• The asynchronous updating of the units allows a function, known as an energy or Lyapunov
function, to be found for the net
• The existence of such a function enables us to prove that the net will converge to a stable set of
activations, rather than oscillating
• Hopfield net can be used as content-addressable memory
Architecture

Note that: 𝒘𝒊𝒊 𝟎 ;𝒊 𝟏 … 𝒏 and 𝒘𝒊𝒋 𝒘𝒋𝒊


Discrete Hopfield Net Algorithm (bipolar inputs)
STEP 0 Initialize weights to store patterns using Hebb rule (outer product)
Store bipolar patterns 𝑆 𝑝 , 𝑝 1 … 𝑃 , where
𝑆 𝑝 𝑠 𝑝 ,…,𝑠 𝑝 ,…,𝑠 𝑝

𝑊 𝑤 𝑠 𝑝 𝑠 𝑝 … 𝑓𝑜𝑟 𝑖 𝑗
𝑤 0
While activations of the net are not converged, do Steps 1-7.
STEP 1 For each input vector x, do Steps 2-6.
STEP 2 Set initial activations of net equal to the external input vector x:
𝑦 𝑥 , 𝑖 1, … , 𝑛

STEP 3 Do Steps 4-6 for each unit ,𝑌


(Units should be updated in random order.)
STEP 4 Compute net input: 𝑦_𝑖𝑛 𝑥 ∑ 𝑦𝑤
STEP 5 Determine activation (output signal):
1 , 𝑖𝑓 𝑦_𝑖𝑛 𝜃
𝑦 𝑦 , 𝑖𝑓 𝑦_𝑖𝑛 𝜃
1 , 𝑖𝑓𝑦_𝑖𝑛 𝜃
STEP 6 Broadcast the value of y, to all other units.
(This updates the activation vector.)
STEP 7 Test for convergence
Analysis
• Energy Function:
• Hopfield [1984] proved that the discrete Hopfield net will converge to a stable
limit point (pattern of activation of the units) by considering an energy (or
Lyapunov) function for the system
• An energy function is a function that is bounded below and is a non-
increasing function of the state of the system
• For a neural net, the state of the system is the vector of activations of the
units
• Thus, if an energy function can be found for an iterative neural net, the net
will converge to a stable set of activations
• An energy function for the discrete Hopfield net is given by:

𝐸 0.5 𝑦𝑦𝑤 𝑥𝑦 𝜃𝑦
Analysis
• Storage Capacity:
• Hopfield found experimentally that the number of binary patterns that can be
stored and recalled in a net with reasonable accuracy, is given approximately
by:
𝑃 0.15𝑛
where n is the number of neurons in the net.
Bi-directional Associative Memory (BAM)
• A bidirectional associative memory [Kosko, 1988] stores a set of pattern
associations by summing bipolar correlation matrices (an n by m outer
product
matrix for each pattern to be stored)
• The architecture of the net consists of two layers of neurons, connected by
directional weighted connection paths
• The net iterates, sending signals back and forth between the two layers
until all neurons reach equilibrium (i.e., until each neuron’s activation
remains constant for several steps)
• Bidirectional associative memory neural nets can respond to input to either
layer.
Architecture
Algorithm for Discrete BAM
• Setting the weights: The weight matrix to store a set of input and target vectors , 𝑠 𝑝 : 𝑡 𝑝 , 𝑝 1, . . . , 𝑃,
where 𝑠 𝑝 𝑠 𝑝 ,…,𝑠 𝑝 ,…,𝑠 𝑝
and 𝑡 𝑝 𝑡 𝑝 ,…,𝑡 𝑝 ,…,𝑡 𝑝
can be determined by the Hebb rule (outer product) as:
𝑊 𝑤 𝑠 𝑝 𝑠 𝑝 … 𝑓𝑜𝑟 𝑖 𝑗
• Activation functions:
1 , 𝑖𝑓 𝑦_𝑖𝑛 𝜃
For bipolar input vectors, the activation function for the Y-layer is: 𝑦 𝑦 , 𝑖𝑓 𝑦_𝑖𝑛 𝜃
1 , 𝑖𝑓𝑦_𝑖𝑛 𝜃
1 , 𝑖𝑓 𝑥_𝑖𝑛 𝜃
and the activation function for the X-layer is: 𝑥 𝑥 , 𝑖𝑓 𝑥_𝑖𝑛 𝜃
1 , 𝑖𝑓𝑥_𝑖𝑛 𝜃
Algorithm for Discrete BAM (bipolar)
STEP 0 Initialize the weights to store a set of P vectors;
initialize all activations to 0.
STEP 1 For each testing input, do Steps 2-6.
STEP 2a Present input pattern x to the X-layer ((i.e., set activations of X-layer to current input pattern).
STEP 2b Present input pattern y to the y-layer (Either of the input patterns may be the zero vector.)
STEP 3 While activations are not converged, do Steps 4-6.
STEP 4 Update activations of units in Y-layer.
Compute net inputs:

𝑦_𝑖𝑛 𝑥𝑤

Compute activations:
𝑦 𝑓 𝑦_𝑖𝑛
Send signal to X-layer.
STEP 5 Update activations of units in X-layer.
Compute net inputs:

𝑥_𝑖𝑛 𝑦𝑤

Compute activations:
𝑥 𝑓 𝑥_𝑖𝑛
Send signal to Y-layer.
STEP 6 Test for convergence:
If the activation vectors x and y have reached equilibrium, then stop;
otherwise, continue.
Introduction to Deep
Learning
ML vs. Deep Learning
Most machine learning methods work well because of human-designed representations
and input features
ML becomes just optimizing weights to best make a final prediction
What is Deep Learning (DL) ?
A machine learning subfield of learning representations of data.
Exceptional effective at learning patterns.
Deep learning algorithms attempt to learn (multiple levels of)
representation by using a hierarchy of multiple layers
If you provide the system tons of information, it begins to understand it
and respond in useful ways.
Why is DL useful?
o Manually designed features are often over-specified, incomplete and
take a long time to design and validate
o Learned Features are easy to adapt, fast to learn
o Deep learning provides a very flexible, (almost?) universal, learnable
framework for representing world, visual and linguistic information.
o Can learn both unsupervised and supervised
o Effective end-to-end joint system learning
o Utilize large amounts of training data

In ~2010 DL started outperforming


other ML techniques
first in speech and vision, then NLP
Neural Network Intro
Weights

Activation functions

How do we train?

4 + 2 = 6 neurons (not counting inputs)


[3 x 4] + [4 x 2] = 20 weights
4 + 2 = 6 biases
26 learnable parameters
Training
Sample Forward it Back- Update the
labeled data through the
network, get
propagate network
(batch) the errors weights
predictions

Optimize (min. or max.) objective/cost function


Generate error signal that measures difference
between predictions and target values

Use error signal to change the weights and get


more accurate predictions
Subtracting a fraction of the gradient moves you
towards the (local) minimum of the cost function
https://medium.com/@ramrajchandradevan/the-evolution-of-gradient-descend-optimization-algorithm-4106a6702d39
Gradient Descent
objective/cost function

Update each element of θ

Matrix notation for all parameters

learning rate

Recursively apply chain rule though each node


One forward pass
Text (input) representation
TFIDF
Word embeddings
….

0.2 -0.5 0.1 0.1 1.0 0.95 very positive


2.0 1.5 1.3 3.0 3.89 positive
0.2 0.025
0.5 0.0 0.25 0.15 negative
-0.3 2.0 0.0 0.37 very negative
0.3 0.0
Activation functions
Non-linearities needed to learn complex (non-linear) representations of
data, otherwise the NN would be just a linear function

http://cs231n.github.io/assets/nn1/layer_sizes.jpeg

More layers and neurons can approximate more complex functions

Full list:
Activation: Sigmoid
Takes a real-valued number and
“squashes” it into range between
0 and 1.

http://adilmoujahid.com/images/activation.png

+ Nice interpretation as the firing rate of a neuron


• 0 = not firing at all
• 1 = fully firing

- Sigmoid neurons saturate and kill gradients, thus NN will barely learn
• when the neuron’s activation are 0 or 1 (saturate)
⦁ gradient at these regions almost zero
⦁ almost no signal will flow to its weights
⦁ if initial weights are too large then most neurons would saturate
Activation: Tanh
Takes a real-valued number and
“squashes” it into range between
-1 and 1.

http://adilmoujahid.com/images/activation.png
Activation: ReLU
Takes a real-valued number and
thresholds it at zero

http://adilmoujahid.com/images/activation.png

Most Deep Networks use ReLU nowadays

⦁ Trains much faster


• accelerates the convergence of SGD
• due to linear, non-saturating form
⦁ Less expensive operations
• compared to sigmoid/tanh (exponentials etc.)
• implemented by simply thresholding a matrix at zero
⦁ More expressive
⦁ Prevents the gradient vanishing problem
Overfitting

http://wiki.bethanycrane.com/overfitting-of-data

Learned hypothesis may fit the


training data very well, even
outliers (noise) but fail to
generalize to new examples
(test data)

https://www.neuraldesigner.com/images/learning/selection_error.svg
Regularization
Dropout
• Randomly drop units (along with their
connections) during training
• Each unit retained with fixed probability
p, independent of other units
• Hyper-parameter p to be chosen (tuned)
Srivastava, Nitish, et al. Journal of machine learning research (2014)

L2 = weight decay
• Regularization term that penalizes big weights,
added to the objective
• Weight decay value determines how dominant
regularization is during gradient computation
• Big weight decay coefficient  big penalty for big weights

Early-stopping
• Use validation error to decide when to stop training
• Stop when monitored quantity has not improved after n subsequent epochs
• n is called patience
Tuning hyper-parameters
g(x) ≈ g(x) + h(y)

g(x) shown in green


h(y) is shown in yellow
Bergstra, James, and Yoshua Bengio. "" Journal of
Machine Learning Research, Feb (2012)

“Grid and random search of 9 trials for optimizing function g(x) ≈ g(x) + h(y)
With grid search, nine trials only test g(x) in three distinct places.
With random search, all nine trials explore distinct values of g. ”

Both try configurations randomly and blindly


Next trial is independent to all the trials done before

Make smarter choice for the next trial, minimize the number of trials
1. Collect the performance at several configurations
2. Make inference and decide what configuration to try next
Loss functions and output
Classification Regression

Training Rn x {class_1, ..., class_n} Rn x Rm


examples (one-hot encoding)

Output Soft-max Linear (Identity)


Layer [map Rn to a probability distribution] or Sigmoid

f(x)=x

Cost (loss) Cross-entropy Mean Squared Error


function

Mean Absolute Error


Convolutional Neural
Networks (CNNs)
Main CNN idea for text:
Compute vectors for n-grams and group them afterwards

Example: “this takes too long” compute vectors for:


This takes, takes too, too long, this takes too, takes too long, this takes too long

Convolutional
Input matrix 3x3 filter
http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution
Convolutional Neural
Networks (CNNs)
Main CNN idea for text:
Compute vectors for n-grams and group them afterwards

max pool
2x2 filters
and stride 2

https://shafeentejani.github.io/assets/images/pooling.gif
CNN for text classification

Severyn, Aliaksei, and Alessandro Moschitti. "UNITN: Training Deep Convolutional Neural Network for
Twitter Sentiment Classification." SemEval@ NAACL-HLT. 2015.
CNN with multiple filters

Kim, Y. “Convolutional Neural Networks for Sentence Classification”, EMNLP (2014)

sliding over 3, 4 or 5 words at a time


Recurrent Neural Networks
(RNNs)
Main RNN idea for text:
Condition on all previous words
Use same set of weights at all time steps

https://pbs.twimg.com/media/C2j-8j5UsAACgEK.jpg

⦁ Stack them up

https://discuss.pytorch.org/uploads/default/original/1X/6415da0424dd66f2f5b134709b92baa59e604c55.jpg
Bidirectional RNNs
Main idea: incorporate both left and right context
output may not only depend on the previous elements in the sequence,
but also future elements.

past and future around a single token


http://www.wildml.com/2015/09/recurrent-neural-
networks-tutorial-part-1-introduction-to-rnns/

two RNNs stacked on top of each other


output is computed based on the hidden state of both RNNs
Gated Recurrent Units
(GRUs)
Main idea:
keep around memory to capture long dependencies
Allow error messages to flow at different strengths depending on the inputs

http://www.wildml.com/2015/10/recurrent-neural-network-
tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-
theano/

Controls how much of past state should matter now


If z close to 1, then we can copy information in that unit through many steps!
Gated Recurrent Units
(GRUs)
Main idea:
keep around memory to capture long dependencies
Allow error messages to flow at different strengths depending on the inputs

http://www.wildml.com/2015/10/recurrent-neural-network-
tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-
theano/

If reset close to 0, ignore previous


hidden state (allows model to drop
information that is irrelevant in the future)

Units with short-term dependencies often have reset gates very active
Units with long-term dependencies have active update gates z
Gated Recurrent Units
(GRUs)
Main idea:
keep around memory to capture long dependencies
Allow error messages to flow at different strengths depending on the inputs

http://www.wildml.com/2015/10/recurrent-neural-network-
tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-
theano/

are a more complex form, but


basically same intuition
GRUs are often more preferred
than LSTMs
combines current & previous time steps
LSTM
Generative Adversarial
Networks
GAN- Introduction
• Generative Adversarial Networks, or GANs for short, are an approach to generative modeling
using deep learning methods, such as convolutional neural networks.
• Generative modeling is an unsupervised learning task in machine learning that involves
automatically discovering and learning the regularities or patterns in input data in such a
way that the model can be used to generate or output new examples that plausibly could
have been drawn from the original dataset.
• GANs are a clever way of training a generative model by framing the problem as a
supervised learning problem with two sub-models: the generator model that we train to
generate new examples, and the discriminator model that tries to classify examples as
either real (from the domain) or fake (generated). The two models are trained together in
a zero-sum game, adversarial, until the discriminator model is fooled about half the time,
meaning the generator model is generating plausible examples.
• Generative models find their ability to generate realistic examples across a range of problem
domains, most notably in image-to-image translation tasks such as translating photos of summer
to winter or day to night, and in generating photorealistic photos of objects, scenes, and people
that even humans cannot tell are fake.
What Are Generative Models?
Supervised vs. Unsupervised Learning
• A typical machine learning problem involves using a model to make a
prediction, e.g. predictive modeling.

• This requires a training dataset that is used to train a model,


comprised of multiple examples, called samples, each with input
variables (X) and output class labels (y)
• A model is trained by showing examples of inputs, having it predict
outputs, and correcting the model to make the outputs more like the
expected outputs.
Supervised vs. Unsupervised Learning
Discriminative vs. Generative Modeling
• In supervised learning, we may be interested
in developing a model to predict a class label
given an example of input variables.
• This predictive modeling task is called
classification.
• Classification is also traditionally referred to as
discriminative modeling.
• This is because a model must discriminate
examples of input variables across classes; it
must choose or make a decision as to what
class a given example belongs.
• Alternately, unsupervised models that summarize the distribution of
input variables may be able to be used to create or generate new
examples in the input distribution.
• As such, these types of models are referred to as generative models.

• For example, a single variable may have a known data


distribution, such as a Gaussian distribution, or bell shape. A
generative model may be able to sufficiently summarize this
data distribution, and then be used to generate new
variables that plausibly fit into the distribution of the input
variable.
• In fact, a really good generative model may be able to
generate new examples that are not just plausible, but
indistinguishable from real examples from the problem
domain
What Are Generative Adversarial Networks?
• Generative Adversarial Networks, or GANs, are a deep-learning-based
generative model.
• More generally, GANs are a model architecture for training a generative
model, and it is most common to use deep learning models in this
architecture.
• The GAN model architecture involves two sub-models: a generator model
for generating new examples and a discriminator model for classifying
whether generated examples are real, from the domain, or fake, generated
by the generator model.

• Generator: Model that is used to generate new plausible examples from the problem
domain.
• Discriminator: Model that is used to classify examples as real (from the domain) or
fake (generated).
The Generator Model
• The generator model takes a fixed-length random vector as input and
generates a sample in the domain.
• The vector is drawn from randomly from a Gaussian distribution, and the
vector is used to seed the generative process
• After training, points in this multidimensional vector space will correspond
to points in the problem domain, forming a compressed representation of
the data distribution.
• This vector space is referred to as a latent space, or a vector space
comprised of latent variables
• Latent variables, or hidden variables, are those variables that are important
for a domain but are not directly observable.
The Generator Model continued…
• We often refer to latent variables, or a latent space,
as a projection or compression of a data
distribution
• That is, a latent space provides a compression or
high-level concepts of the observed raw data such
as the input data distribution
• In the case of GANs, the generator model applies
meaning to points in a chosen latent space, such
that new points drawn from the latent space can be
provided to the generator model as input and used
to generate new and different output examples.
• After training, the generator model is kept and used
to generate new samples.
The Discriminator Model
• The discriminator model takes an example from the
domain as input (real or generated) and predicts a binary
class label of real or fake (generated).
• The real example comes from the training dataset
• The generated examples are output by the generator
model.
• The discriminator is a normal (and well understood)
classification model.
• After the training process, the discriminator model is
discarded as we are interested in the generator.
• Sometimes, the generator can be repurposed as it has
learned to effectively extract features from examples in
the problem domain
• Some or all of the feature extraction layers can be used in
transfer learning applications using the same or similar
input data.
GANs as a Two Player Game
• Generative modeling is an unsupervised learning problem, although a
clever property of the GAN architecture is that the training of the
generative model is framed as a supervised learning problem.
• The two models, the generator and discriminator, are trained together
• The generator generates a batch of samples, and these, along with real
examples from the domain, are provided to the discriminator and classified
as real or fake.
• The discriminator is then updated to get better at discriminating real and
fake samples in the next round, and importantly, the generator is updated
based on how well, or not, the generated samples fooled the discriminator.
• In this way, the two models are competing against each other, they are
adversarial in the game theory sense, and are playing a zero-sum game.
• In this case, zero-sum means that when the discriminator successfully identifies
real and fake samples, it is rewarded or no change is needed to the model
parameters, whereas the generator is penalized with large updates to model
parameters.

• Alternately, when the generator fools the discriminator, it is rewarded, or no


change is needed to the model parameters, but the discriminator is penalized and
its model parameters are updated.

• At a limit, the generator generates perfect replicas from the input domain every
time, and the discriminator cannot tell the difference and predicts “unsure” (e.g.
50% for real and fake) in every case. This is just an example of an idealized case;
we do not need to get to this point to arrive at a useful generator model.
Example of the Generative Adversarial
Network Model Architecture

Advantages of GAN: it can


1. model high-dimensional data
2. handle missing data
3. provides multi-modal outputs or multiple plausible answers

You might also like