Professional Documents
Culture Documents
Soft Computing 2023 Units1,2,3 Handouts
Soft Computing 2023 Units1,2,3 Handouts
MODULE 1
Dr. R.B.Ghongade,
SOFT COMPUTING CONSTITUENTS AND CONVENTIONAL
ARTIFICIAL INTELLIGENCE
Traditional Functional
Symbolic Numerical Approximate Approximation
Logic Modeling and Reasoning and Randomized
Reasoning Search Search
Methodology Strength
Neural network Learning and adaptation
Knowledge representation via fuzzy if-then
Fuzzy set theory
rules
Genetic algorithm and simulated annealing Systematic random search
Conventional AI Symbolic manipulation
An intelligent system
NEURAL NETWORKS
DARPA Neural Network Study (1988, AFCEA International
Press, p. 60):
An artificial neuron
DEFINITIONS OF NEURAL NETWORKS
According to Haykin (1994), p. 2:
Initialization
Selection
Reproduction with crossover and mutation
Fuzzy Multivalued
Systems Algebras
Fuzzy Logic
Controllers
HYBRID FL SYSTEMS
23
SOFT COMPUTING: HYBRID NN SYSTEMS
Approximate Reasoning Functional Approximation/ Randomized Search
Feedforward Recurrent
NN NN
Single/Multiple SOM
RBF Hopfield ART
Layer Perceptron
HYBRID NN SYSTEMS
NN parameters NN topology and/or
(learning rate h
momentum a ) weights
controlled by FLC generated by EAs
24
SOFT COMPUTING: HYBRID EA SYSTEMS
Approximate Reasoning Functional Approximation/ Randomized
Search
Evolution Genetic
Strategies Algorithms
Evolutionary Genetic
Programs Programs
HYBRID EA SYSTEMS
25
NEURO-FUZZY AND SOFT COMPUTING CHARACTERISTICS
• Human expertise
– SC utilizes human expertise in the form of fuzzy if-then rules, as
well as in conventional knowledge representations, to solve
practical problems
• Biologically inspired computing models
– Inspired by biological neural networks, artificial neural networks
are employed extensively in soft computing to deal with
perception, pattern recognition, and nonlinear regression and
classification problems
• New optimization techniques
– Soft computing applies innovative optimization methods arising
from various sources; they are genetic algorithms (inspired by
the evolution and selection process), simulated annealing
(motivated by thermodynamics), the random search method,
these optimization methods do not require the gradient vector
of an objective function, so they are more flexible in dealing
with complex optimization problems
NEURO-FUZZY AND SOFT COMPUTING CHARACTERISTICS
• Numerical computation
– Unlike symbolic AI, soft computing relies mainly on numerical
computation. Incorporation of symbolic techniques in soft
computing is an active research area within this field.
• New application domains
– Because of its numerical computation, soft computing has found
a number of new application domains besides that of AI
approaches. These application domains are mostly computation
intensive and include adaptive signal processing, adaptive
control, nonlinear system identification, nonlinear regression,
and pattern recognition.
• Model-free learning
– Neural networks and adaptive fuzzy inference systems have the
ability to construct models using only target system sample
data. Detailed insight into the target system helps set up the
initial model structure, but it is not mandatory.
NEURO-FUZZY AND SOFT COMPUTING CHARACTERISTICS
• Intensive computation
– Without assuming too much background knowledge of the
problem being solved, neuro-fuzzy and soft computing rely
heavily on high-speed number-crunching computation to
find rules or regularity in data sets. This is a common
feature of all areas of computational intelligence
• Fault tolerance
– Both neural networks and fuzzy inference systems exhibit
fault tolerance. The deletion of a neuron in a neural
network, or a rule in a fuzzy inference system, does not
necessarily destroy the system. Instead, the system
continues performing because of its parallel and
redundant architecture, although performance quality
gradually deteriorates
NEURO-FUZZY AND SOFT COMPUTING CHARACTERISTICS
𝒚𝒌 𝒇 𝒚_𝒊𝒏𝒌
Effect of bias
• The neuronal model also includes an
externally applied bias, denoted by bk
• The bias bk has the effect of increasing or
lowering the net input of the activation
function, depending on whether it is
positive or negative, respectively
• The use of bias bk , has the effect of applying
an affine transformation to the output y_ink
of the linear combiner in the model
• An affine transformation is any
transformation that preserves collinearity
(i.e., all points lying on a line initially still lie
on a line after transformation)
• Depending on whether the bias bk is
positive or negative, the relationship
between the induced local field or activation
potential y_ink of neuron k and the linear
combiner output uk is modified
Another nonlinear model of a neuron; wk0 accounts for
the bias bk.
Activation Functions- Threshold Function
If y _ ink yk 1
yk 0 otherwise
Activation Functions- Signum Function
If y _ ink yk 1
yk 1 otherwise
Activation Functions- Linear Function
𝑦 𝑦_𝑖𝑛
Activation Functions- Saturating Linear Function
If y _ ink 1 yk 1
yk y _ ink otherwise
If y _ ink 1 yk 1
yk y _ ink otherwise
Artificial Neuron Model
𝒚𝒌 𝒇 𝒚_𝒊𝒏𝒌
𝒚_𝒊𝒏 𝒌 is called the local induced field
Mc Culloch-Pitts Neuron
• McCulloch-Pitts
neuron Y may
receive signals
from any number
of other neurons
• Each connection
path is either
excitatory, with
weight w > 0, or
inhibitory, with
weight - p (p > 0)
• The condition that inhibition is absolute requires that for
the activation function satisfy the inequality
If y _ in 1 y 1
y 0, otherwise
All the weight and threshold setting has to be done with trial and error!
Logical XOR Implementation
x1 x2 Y
0 0 0
0 1 1
1 0 1
1 1 0
1 1 1 1) 0 0 1 0 1 0 1 0
0 1 1 (-1) -1 1 0
1 1 0 1) 1 1 1
1 1 1 (-1) 0 1 0
𝑧 𝑥 𝑥 x1 x2 Z2
0 0 0
• For subfunction z2 (AND gate)
• Assume that 𝑤 , 𝑤 are excitatory and 𝑤 =𝑤 1 0 1 1
• Output is given by 𝑧 𝑥 𝑤 𝑥 𝑤 1 0 0
1 1 0
x1.w12 x2.w22 z2in
0 1 0 1 0 • We see that for no threshold value we can
0 1 1 1 1
separate the third combination
• Change the weights to
1 1 0 1 1
𝑤 = 1 ,𝑤 1
1 1 1 1 2
• We see that setting 𝜃 1 can separate the
x1.w12 x2.w22 z2in third entry
• So with 𝑤 = 1 , 𝑤 1 and 𝜃 1 we
0 1) 0 1 0
can have the desired response
0 (-1) 1 1 1
1 (-1) 0 1 -1 x1.w12 x2.w22 z2in 𝜽 z2
1 1) 1 1 0 0 1) 0 1 0 1 0
0 (-1) 1 1 1 1 1
1 (-1) 0 1 -1 1 0
1 (-1) 1 1 0 1 0
𝑦 𝑧 𝑧
x1 x2 z1 z2 y
0 0 0 0 0
0 1 0 1 1
1 0 1 0 1
1 1 0 0 0
• Basic Calculus
– Partial derivatives
– Gradient
– Chain rule
Revision of Some Basic Maths
• Inner/dot product
i 1
• Matrix/Vector multiplication
Revision of Some Basic Maths
• Vector space/Euclidean space
X Y X Y 2 2 X TY
2 T
X Y
– Multivariable function:
y(x) f (x1, x2,...,xn )
dy dy du
– Chain rule: Let y = f (g(x)), u = g(x), then dx du dx
dz f dx f dy
– Let z = f(x, y), x = g(t), y = h(t), then dt x dt y dt
Feature Space
• Representing real world objects using feature vectors
i
1 2 3
x2(i) 4
x1(i) 6 5
7
10
x1 X(i) =[x1(i), x2(i)] 9
12 11
Feature Vector
x1(i) 13
14 8
15 16
Feature Space
Elliptical blobs (objects)
x2(i) x2
Feature Space
From Objects to Feature Vectors to Points in the Feature Spaces
x1
X(15) 1 2 3
X(1) X(7) 4
X(16) 6 5
X(3) X(8) 7
9 10
X(25) X(12)
12 11
X(13) X(6) 13
X(9) 14 8
X(10)
X(4) 15 16
X(11)
X(14)
Elliptical blobs (objects)
x2
Linear Neuron as classifier
w11
• Consider the line represented by a single neuron, x1
•
with equation: 𝑥 𝑤
Re-arranging we get:
𝑥 𝑤 𝑏 0
y
𝑤 𝑏 x2
𝑥 𝑥 w21 b
𝑤 𝑤
• Is of the form: 𝑦 𝑚𝑥 𝑐 +1
Where 𝑚 and 𝑐 x1w11+x2w21+b=0
Feature 2 (x2)
can use this line as a decision boundary
separating objects
• Now if a new object 𝑇 𝑡 𝑡 produces
output of neuron 𝑦 0 , it simply means the
object belongs to the red object class
• Thus the linear neuron works as a classifier! Feature 1 (x1)
ARTIFICIAL NEURAL NETWORKS
MODULE 3
Dr. R.B.Ghongade,
Topics covered
• Linear Regression
• Gradient Descent Algorithm
• More Activation Functions
• Learning Processes
– Error correction Learning
– Memory based Learning
– Hebbian Learning
– Competitive Learning
• Biases and Thresholds
• Linear Separability
• Perceptron
Linear Regression
• In statistics, linear regression is an
approach to modeling the
relationship between a scalar
variable y and one or more
explanatory variables denoted X
• Once we know the equation of this
fitted line we can use it to find y
given X
• Say , we have conducted an
experiment that gives us some
output value for a set of inputs
• There are various conventional methods to do this ,e.g. Ordinary least
squares
• The best fit line has equation : 𝑦 𝑚𝑥 𝑐
• We have a choice of two parameters: 𝑚 and 𝑐
• A linear neuron can be used for this task since output of a linear neuron is:
𝑦 𝑥 𝑤 𝑥 𝑤
• If we keep 𝑥 =1, then 𝑤 is slope and𝑤 is the intercept, hence we have to
modify 𝑤 and 𝑤 to obtain the best fit line
• If we make a good choice of 𝑤
and 𝑤 our job is over!
• Many real world problems involve
more than one independent variables
then the regression is called multiple
regression
• For example if we have two
independent variables 𝑥 and 𝑥 , then
it becomes a 2-D problem, the new
network would different
• Now we have to adjust two slopes and
one intercept
• This can be extended to solve n-
dimensional problems
• For a function like 𝑦 𝑓 𝑥 ,𝑥 we have a 2-D problem
The Concept of Error Energy
• We can have a combined error measure by squaring and adding all the
errors and dividing it by number of observations giving MEAN SQUARE
ERROR (mse), this in fact is the ERROR ENERGY
• This mse shows how good or bad the line is fitted!
• Considering a 1-D problem ,error energy at a point p is :
𝑒 𝑡 𝑦
• The total error energy is given by:
𝐸 𝑒 𝑡 𝑦
• We can find the gradient(slope) at the starting point and slide down in the
opposite direction of the gradient to reach the minima
• This technique is called the steepest descent approach
• Even though reaching global minima is not guaranteed by this approach we
may find a good combination of 𝑤 and 𝑤 which gives lowest fitting error!
The Algorithm
• Let there be p observations available for training then
e t y
where p p p 2
o o
2
• Gradient of error with respect to any weight is
we get
𝒐 𝒐 𝒊
𝒐𝒊
• This is the gradient with respect to one particular weight
considering as the output and i as the input unit
• Now we have to move in the opposite direction of the
derivative hence correction to be applied to weights is
• We apply correction by simply multiplying the error by input
thus
𝒐𝒊 𝒐 𝒐 𝒊
• Hence the new weight is
𝒐𝒊 𝒐𝒊 𝒐𝒊
• Generally we apply a constant as a controlling parameter
called as the learning rate
𝒐𝒊 𝒐 𝒐 𝒊
• If learning rate is high the network learns fast and vice-versa
• But a higher learning rate may lead to unstable operation and
no learning at all
• If learning is slower, it is at least guaranteed that there is
progress
• We have to thus look out for an optimal learning rate
Non-linear Regression
• Real world problems are mostly non-linear hence we have to go for non-linear
regression
• This can be done with a non-linear neuron i.e. a neuron with non-linear activation
function
2-D Non-linear Regression
Activation Functions
• To map non-linear output functions we require non-
linear activation functions
• The activation functions should be
• Continuous
• Monotonically increasing
• Differentiable
Monotonicity
2 k
• The weight correction rule we get is
𝒌𝒋 𝒌 𝒋
• This learning rule is called as the DELTA rule or WIDROW-
HOFF rule
• The updated synaptic weight for the time step is
then
𝒌𝒋 𝒌𝒋 𝒌𝒋
• The adjustment made to a synaptic weight of a neuron is
proportional to the product of the error signal and the input
signal of the synapse in question
Memory based Learning
• In memory-based learning, all (or most) of the past
experiences are explicitly stored in a large memory of
correctly classified input-output examples
• The test vector seems to have the least Euclidean distance with an
outlier from Class 0 and is classified as belonging to Class 0
• This is wrong!
• To solve this problem we modify the nearest neighbor rule to
k-nearest neighbor classifier
• Identify the classified patterns that lie nearest to the test
vector 𝐭𝐞𝐬𝐭 for some integer
• Assign 𝐭𝐞𝐬𝐭 to the class (hypothesis) that is most frequently
represented in the -nearest neighbors to 𝐭𝐞𝐬𝐭 (i.e., use a
majority vote to make the classification)
• Here , and is now classified
As belonging to Class 1 (out of three
nearest neighbors , two belong to
Class 1)
Hebbian Learning
• Hebbian learning is considered to be more closer to the
learning mechanism of a biological neuron
• Hebb , a neurophysiologist , in his book “ Organization of
Behaviour”, in 1949 postulated the “Hebb Rule”
“When an axon of cell A is near enough
to excite a cell B and repeatedly or
persistently takes part in firing it, some
growth process or metabolic changes
take place in one or both cells such that
A's efficiency as one of the cells firing B,
is increased”
• Thus if cell A fires consistently the cell B then the synaptic
weight increases such that the next time cell A has got a
better probability of firing cell B
• Also if cell A does not take part in firing cell B, the synaptic
weight weakens
• If pre-synaptic neuron and the post-synaptic neurons show similar
activation we shall increase the synaptic weight and vice-versa
• Hebbian synapse is a synapse that uses a time dependent, highly
local, and strongly interactive mechanism to increase synaptic
efficiency as a function of the correlation between the presynaptic
and postsynaptic activities
• Four key mechanisms (properties) that characterize a Hebbian
synapse
1. Time-dependent mechanism: modifications in a Hebbian synapse
depend on the exact time of occurrence of the presynaptic and
postsynaptic signals
2. Local mechanism: locally available information is used by a Hebbian
synapse to produce a local synaptic modification that is input specific
3. Interactive mechanism: Hebbian form of learning depends on a "true
interaction" between presynaptic and postsynaptic signals in the sense
that we cannot make a prediction from either one of these two activities
by itself
4. Correlational mechanism: the correlation over time between presynaptic
and postsynaptic signals is viewed as being responsible for a synaptic
change
Mathematical Models of Hebbian Modifications
• Consider a synaptic weight of neuron with presynaptic
and postsynaptic signals denoted by and respectively.
• The adjustment applied to the synaptic weight at time
step n is expressed in the general form:
∆𝒘𝒌𝒋 𝜼 𝒚𝒌 𝒚 𝒙𝒋 𝒙
Competitive Learning
• Output neurons of a neural network compete among
themselves to become active (fired)
• In competitive learning only a
single output neuron is active
at any one time
• Highly suited to discover
statistically salient features
that may be used to classify a
set of input patterns
• Accordingly the individual
neurons of the network learn
to specialize on ensembles of
similar patterns; in so doing
they become feature
detectors for different classes
of input patterns
• There are three basic elements to a competitive
learning rule
– A set of neurons that are all the same except for
some randomly distributed synaptic weights, and
which therefore respond differently to a given set
of input patterns
– A limit imposed on the "strength" of each neuron
– A mechanism that permits the neurons to
compete for the right to respond to a given subset
of inputs, such that only one output neuron, or
only one neuron per group, is active (i.e., "on") at
a time. The neuron that wins the competition is
called a winner-takes-all neuron
• For a neuron 𝑘 to be the winning neuron, its induced local field 𝑣 ,
for a specified input pattern x must be the largest among all the
neurons in the network
• The output signal 𝑦 of winning neuron 𝑘 is set equal to one; the
output signals of all the neurons that lose the competition are set
equal to zero
• Thus
1, 𝑖𝑓 𝑣 𝑣 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑗, 𝑗 𝑘
𝑦
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
where the induced local field 𝑣 , represents the combined action of all
the forward and feedback inputs to neuron 𝑘
• Let 𝑤 denote the synaptic weight connecting input node j to
neuron k
• Suppose that each neuron is allotted a fixed amount of synaptic
weight (i.e., all synaptic weights are positive), which is distributed
among its input nodes; that is
𝑤 1 for all 𝑗
• As per the standard competitive learning rule, the change
𝒌𝒋 applied to synaptic weight 𝒌𝒋 is defined by:
𝒌𝒋
• This rule has the overall effect of moving the synaptic weight
vector , of winning neuron k toward the input pattern
• Thus if we have and input vector as
then we are effectively moving towards
the input pattern X
• What ultimately we do is that we align the weight towards
the input vector for a specific input and weight combination
Geometric Interpretation
• Consider vector 𝐗 𝑥 𝑥 𝑥
• The constraint we lay down is that
𝑋 1 , this means
𝑥1 𝑥2 𝑥3 1
𝑤 1
Where
Here and
• If bias is not used the same network can be described as
1, 𝑦_𝑖𝑛 𝜃;
𝑦 𝑓 𝑦_𝑖𝑛
1, 𝑦_𝑖𝑛 𝜃;
Where
𝑦_𝑖𝑛 𝑥 ·𝑤
or assuming
• These two regions are often called decision regions for the
net
Response regions for the AND function
• Solving this graphically, we find that the red
x1 x2 Y
line can be a good decision boundary
-1 -1 -1
• Points(0,1) and (1,0) lie on the line hence the
using equation
-1 1 -1 𝑤 𝑏
1 -1 -1 𝑥 𝑥
𝑤 𝑤
1 1 1 assuming 𝑤 1, gives
𝑏 1 using point(0,1) and 𝑤 1, using
point (1,0) and 𝑏 1
• Actually the choice of sign for b is
determined by the requirement that
𝑏 𝑤 𝑥 𝑤 𝑥 0
• We can then set 𝑥 and 𝑥 =0 and compute 𝑏
knowing to which side of the line the
point(0,0) should lie
Response regions for the OR function
x1 x2 Y
-1 -1 -1 • Points(-1,0) and (0,-1) lie on the line
-1 1 1 hence the using equation
1 -1 1
1 1 1
assuming
and
XOR-Example of linearly non-separable
problems
x1 x2 Y
-1 -1 -1
-1 1 1
1 -1 1
1 1 -1
1 if y _ in
y 0 if y _ in
1 if y _ in
STEP 1 STEP 2 STEP 5 Update weights and bias if an error occurred
for this pattern
If 𝑦 𝑡
𝒘𝒊 𝒏𝒆𝒘 𝒘𝒊 𝒐𝒍𝒅 𝜼𝒕𝒙𝒊
𝒃 𝒏𝒆𝒘 𝒃 𝒐𝒍𝒅 𝜼𝒕
else
𝒘𝒊 𝒏𝒆𝒘 𝒘𝒊 𝒐𝒍𝒅
𝒃 𝒏𝒆𝒘 𝒃 𝒐𝒍𝒅
First iteration
𝒊 𝒊 𝒊
Input y_in y t Weight Changes Weights
x1 x2 1 w1 w2 b
0 0 0
-1 -1 1 0 1 -1 1 1 -1 1 1 -1
-1 1 1 -1 -1 -1 0 0 0 1 1 -1
1 -1 1 -1 -1 -1 0 0 0 1 1 -1
1 1 1 1 1 1 0 0 0 1 1 -1
Second iteration
Input y_in y t Weight Changes Weights
x1 x2 1 w1 w2 b
1 1 -1
-1 -1 1 -3 -1 -1 0 0 0 1 1 -1
-1 1 1 -1 -1 -1 0 0 0 1 1 -1
1 -1 1 -1 -1 -1 0 0 0 1 1 -1
1 1 1 1 1 1 0 0 0 1 1 -1
• We see that there is no weight change after second iteration hence we conclude
that the net has converged
MULTI-LAYERED PERCEPTRON AND
BACKPROPAGATION
MODULE 4
Dr. Rajesh B. Ghongade
Agenda
• Perceptron and its limitations
• MLP architecture
• Activation functions
• Gradient Descent Algorithm and Delta Rule
• Generalized Delta Rule( Backpropagation)
• Signal Flow
• Standard Backpropagation Algorithm
• XOR problem
• MATLAB Demo
• Some tips for net convergence
• Variations in standard backpropagation algorithm
• Applications of MLP trained with backpropagation algorithm
Perceptron and its limitation
x1 X1
w11
y _ in1
Y1 y1
w21
w01
x2 X2
1
a0 n0
a 1 n0
• Creates a linear separation boundary called as decision boundary
• Capable of classifying linearly separable objects only
x2 2
w01 w11 x1 w21 x 2 w01 wj1 xj
j 1
x1
• Minsky and Papert (1969) showed that a perceptron is incapable of solving a simple
XOR problem which is not linearly separable
x2 x1 x2 y
0 0 0
0 1 1
1 0 1
x1 1 1 0
• Minsky and Papert (1969) also showed that such a problem can be solved by adding
another layer of perceptron and combining the responses
v11 Z1 z1
x1 X1 w11
v21 v 01
1
Y1 y1
v12 w21
x2 X2 w01
v22
Z2 z2
1
v02
1
x2
x1
z2
z1
z2
(0,1) (1,1)
(1,0) z1
XOR problem can be solved, but how to find out all the weights and biases?
MLP Architecture
1 1
v01
w01
v0 j
v0 p w0m w0k
x1 X1 v11 Z1 w11 Y1 y1
v1 j w1k
v1p w1m
v i1 w j1
xi Xi v ij Zj wjk Yk yk
vip wjm
w p1
vn1 vnj
w pk
xn Xn Zp wpm Ym ym
• No connections within a layer
• No direct connections between input and output layers
• Fully connected between layers
• Often more than 2 layers
• Number of output units need not equal number of input
units
• Number of hidden units per layer can be more or less than
input or output units
• Can also view 1st layer as using local knowledge while 2nd
layer does global
• With sigmoidal activation functions can show that a 2
layer net can approximate any function to arbitrary
accuracy: property of Universal Approximation
1st layer draws linear 2nd layer combines the 3rd layer can generate
boundaries boundaries arbitrarily complex boundaries
Concept of layers!
dy
Derivative f '( x) f ( x) 1 f ( x)
dx
Activation Functions
Bipolar Sigmoid f ( x) y 2
x
1 Output limits:[-1,1]
1 e
dy 1
Derivative f '( x) 1 f ( x) 1 f ( x)
dx 2
Activation Functions
x
e e
x
Tanh Sigmoid f ( x) y x x Output limits:[-1,1]
e e
dy
1 f ( x)
2
Derivative f '( x)
dx
Gradient Descent Algorithm
where y _ in w0 wi xi wi xi
i 1 i 0
1
• Error energy is then: E e t y _ in
2 2
2
• In order to reduce squared error i.e, error energy we
have to find the partial derivative of with respect to
the weights , and modify the weights in a direction
opposite to the partial derivative, this is an optimization
technique known as gradient descent.
• Hence we want to compute i.e. the gradient of error
energy with respect to weight
• Once the gradient is obtained we can move along the opposite
direction to the gradient in the hope of reaching the global
minima
E
• Hence the weights have to be modified as: wI
wI
E y _ in
But (t y _ in)
wI wI
y _ in n n
And xI since y _ in w0 wi xi wi xi
wI i 1 i 0
E
hence w I (t y _ in) xI
wI
Thus the delta rule becomes wI (t y _ in) xI
where is the learning rate (0< >1)
Backpropagation
• In the perceptron/single layer nets, we used gradient descent on
the error function to find the correct weights:
wI (t y _ in) xI
• We see that errors/updates are local to the node i.e. the change
in the weight from node to output ( 𝑖𝑗) is controlled by the
input that travels along the connection and the error signal from
output
• But with more layers how are the weights for the first 2 layers
found when the error is computed for layer 3 only?
2 k 2 k
L
Eav 1
L E
l 1
l
yk f ( y _ ink )
• We want to find out the contribution of weight to error
i.e;
E
hence ek f y _ ink zj
'
wjk
• Now we want to reduce the error by changing weights
proportionately as:
E
wjk
wjk
where =learning rate
hence wjk ek f '
y _ ink zj
• We define local gradient as:
k ek f y _ ink
'
wjk k zj
Actually =- , can shown by using chain rule again
_
as follows:
E E ek yk
k ek (1) f ' y _ ink
y _ ink ek yk y _ ink
E '
j f ( z _ inj )
z j
1
E e
2
We have;
2 k
We now compute as follows:
vij j xi
Backpropagation Algorithm thus has THREE phases:
1.Forward phase
Where the input signal propagates in the forward direction
v0 j
tk
x1 w01
w1k
v1 j
z _ inj f () zj y _ ink f ( ) yk
xi v ij wjk 1 ek
vnj w pk
_21
xi f ' ( y _ in1)
e1
_ 22
f ' ( y _ in 2)
w j1 e2
Error backpropagation
phase _1j
_ in 2 wj2
FORWARD PHASE
STEP 3 Each input unit ( Xi, i 1,..., n) receives input signal xi and
broadcasts this signal to all units in the hidden units layer).
STEP 4 Each hidden unit ( Zj , j 1,..., p ) sums its weighted input
signals,
z _ inj v 0 j xi vij
i
and sends this signal to all units in the output units layer
STEP 5 Each output unit (Yk , k 1,..., m) sums its weighted input
signals,
y _ ink w0 k zj w jk
j
STEP 8 Each output unit (Yk , k 1,..., m) updates its bias and weights
( j 0,..., p )
wjk ( new) wjk (old ) wjk
mse k
v 01 in1 _ 21 w11
v11
_11 in1 f ' ( z _ in1)
x1 w11
v12
z _ in 2
z2
v13
in 2 _ 21 w21
v 02 w21 _ 21
v14 y _ in1
_12 in 2 f ' ( z _ in 2) _ 21 y1 t1
v 21
w31 _ 21 w01
z3
z _ in3
_ 21 _ 21 (t1 y1) f ' ( y _ in1)
v 22
in 3 _ 21 w31 w11 _ 21 z1
v 03 w41 w21 _ 21 z 2
v 23
w31 _ 21 z 3
x2 _13 in3 f ' ( z _ in3) w41 _ 21 z 4
v 24 w01 _ 21
1. Regression
2. Pattern Classification
3. Forecasting
Thank you!
RADIAL BASIS FUNCTION NETWORKS
MODULE 5
f .. y1
... 1 ..
w11
w 21
w01
w12
wmk
m f .. yk
y f ( x) wi xi ci
i 1 w0k
functions
1) Gaussian
r2
r exp 2 for some 0, r R
2
Localized function ; if r
functions
2) Multi-quadrics
r r c
2
2 1/2
for some c 0, r R
Non-localized function ; if r
functions
3) Inverse Multi-quadrics
1
r for some c 0, r R
1/2
r 2
c 2
Localized function ; if r
XOR problem
c1
∑ ... 1 ..
x1 w1
c2
x2 ∑ f .. y
w2
∑ ... 2 ..
w0
+1
We select the two -functions as Gaussian with µ=0 and
σ= so that
r2
r exp 2 reduces to r exp r 2
2
We select the centers as C1=[0,0] and C2 =[1,1] hence:
2
1 X exp X C1
X exp X C
2
2 2
0 0 0 1.4142 1 0.1353
0 1 1 1 0.3679 0.3679
1 0 1 1 0.3679 0.3679
1 1 1.4142 0 0.1353 1
Mapped points
2
1 0.1353 1 0
X4 0.3679 0.3679 1 w1 1
, W w2 , D
1
0.3679 0.3679 1 1
w0
0.1353 1 1 0
Solve this discrimination function to
X1
0.2 complete the solution:
1 W D
W D T D
0.2 1 T 1
X2 X3
yields: Checking the solution:
1 0.1353 1 0
2.5031 0.3679 0.3679 1 2.5031
W 2.5031 2.5031 1
0.3679 0.3679 1 1
2.8418 2.8418
0.1353 1 1 0
Fixing the radius σ
This is usually done using the P-nearest neighbor algorithm. A
number P is chosen, and for each center, the P nearest centers are
found. The root mean squared distance between the current cluster
center and its P nearest neighbors is calculated, and this is the value
chosen for σ. So, if the current cluster center is 𝑐 , the value is:
1 P
ck ci
2
j
P i 1
A typical value for P is 2, in which case σ is set to be the average
distance from the two nearest neighboring cluster centers.
The variable σ defines the width or radius of the bell shape and is
something that has to be determined empirically. When the
distance from the center of the Gaussian reaches σ, the output
drops from 1 to 0.6.
RBFN Algorithm
Step 1 Get 𝑛, 𝑘 ,𝑁,the feature vectors and their target vector, input the number of
iterations 𝐼, set 𝑖 0, set 𝑚 centers of RBF’s as the 𝑁 exemplar vectors,
𝑚𝑠𝑒𝑔𝑜𝑎𝑙 and learning rate α
Step 2 UNSUPERVISED LEARNING
Using clustering algorithm like k-means clustering find the 𝑚 cluster centers. Find
minimum Euclidean distance of the cluster centers to fix σ (radius)
Step 3 Compute the -matrix
Step 4 SUPERVISED LEARNING
Choose weights and biases 𝑊at random between -0.5 and 0.5
Step 5 Compute 𝑦𝑘 and error 𝐸 and 𝑚𝑠𝑒
Step 6 Using Delta Rule update all parameters 𝑊𝑚𝑘 for all 𝑚 and 𝑘 at the
current
iteration
Step 7 Compute 𝑦𝑘 and new value of error 𝐸
Step 8 If 𝑚𝑠𝑒 ≤ 𝑚𝑠𝑒𝑔𝑜𝑎𝑙 stop else continue;
Step 9 Increment iteration i, if 𝑖 𝐼 then go to Step 5, else stop
-means clustering
• The k-means algorithm partitions the data into 𝑘 clusters. A popular
criterion function associated with the k-means algorithm is the sum of
squared error. Let 𝑘 be the number of clusters and 𝑛 the number of
data in the sample 𝑋 , 𝑋 ,…, 𝑋 . We define the cluster centroid 𝑚
as; n
j 1
ij xj
mi n
, i
j 1
ij
with the membership function 𝜔 indicating whether the data point 𝑋 belongs
to a cluster 𝜔 . The membership values vary according to the type of k-means
algorithm. The standard k-means uses an all-or-nothing procedure, that is,
𝜔 =1, if the data sample 𝑋 belongs to cluster 𝜔 , else 𝜔 =0 .
• The membership function must also satisfy the following constraints:
i 1
ij 1, j
n
0 ij n, i
j 1
• The k-means algorithm uses a criterion function based on the measure of similarity
or distance. For example, using the Euclidean distance that will favor the hyper
spherical cluster, a criterion function to minimize is defined by:
k n
J ij xj mi
2
i 1 j 1
0 1 3.61 5 1 2 4 5 Attribute1
D
0
, O
1 0 2.83 4.24 1 1 3 4 Attribute2
C1 1 1
C 2 2 1
Each column in the distance matrix symbolizes the object. The first row of the
distance matrix corresponds to the distance of each object to the first centroid and
the second row is the distance of each object to the second centroid. For example,
distance from medicine C = (4, 3) to the first centroid C1=(1,1) is 4 12 (3 1)2 3.61
and distance with C2=(2,1) is 4 2 2 (3 1)2 2.83
and so on.
3. Objects clustering : We assign each object based on the minimum distance. Thus,
medicine A is assigned to group 1, medicine B to group 2, medicine C to group 2 and
medicine D to group 2. The element of Group matrix below is 1 if and only if the object
is assigned to that group.
1 0 0 0 Group1
G
0
Group 2
0 1 1 1
4. Iteration-1, determine centroids : Knowing the members of each group, now we
compute the new centroid of each group based on these new memberships. Group 1
only has one member thus the centroid remains in C1=(1,1). Group 2 now has three
members, thus the centroid is the average coordinate among the three members:
2 4 5 1 3 4
C2 , 3.67, 2.67
3 3
5. Iteration-1, Objects-Centroids distances : The next step is to compute the distance of
all objects to the new centroids. Similar to step 2, we have distance matrix at iteration 1
as:
0 1 3.61 5
D1 ,
3.14 2.36 0.47 1.89
C1 1 1
C 2 3.67 2.67
6. Iteration-1, Objects clustering: Similar to step 3, we assign each object based on the
minimum distance. Based on the new distance matrix, we move the medicine B to
Group 1 while all the other objects remain unchanged. The Group matrix is shown
below:
1 1 0 0 Group1
G
1
Group 2
0 0 1 1
7. Iteration 2, determine centroids: Now we repeat step 4 to calculate the new centroids
coordinate based on the clustering of previous iteration. Group1 and group 2 both has
two members, thus the new centroids are
1 2 11 45 3 4
C1 , 1.5,1 C2 , 4.5,3.5
2 2 2 2
8. Iteration-2, Objects-Centroids distances : Repeat step 2 again, we have new distance
matrix at iteration 2 as:
9. Iteration-2, Objects clustering: Again, we assign each object based on the minimum
distance:
1 1 0 0 Group1
G
2
Group 2
0 0 1 1
We see that hence the objects do not move anymore and the algorithm has reached a
stable point.
RBFN MATLAB DEMO
If only life were so simple…
Tactile
Taste
Smell
Models of SOM
1. Willshaw von der Marlsburg model
2. Kohonen model
Willshaw von der Marlsburg model
• Two arrays (lattices) of pre-
synaptic and post-synaptic
neurons
• Was used to explain the retino-
optic mapping from retina to
visual cortex, where retina will
form the pre-synaptic layer and
the visual cortex is the post-
synaptic layer
• Dimensions of input and output
are same
• Electric signals of pre-synaptic
neurons are based on geometric
proximities
• The neurons near the excited neuron have highly correlated electrical responses
• If signal for A is strong , the signal for a geometrically close neuron B, will also be strong
and it will enhance spatially those output neurons which are there in similar spatial
locations
• There is a spatial correlation of activities
• Hence pre-synaptic layer activities are mapped onto similar neuron activities in the post
synaptic layer
Kohonen model
• Based on vector coding
algorithm which optimally
places a fixed number of
vectors (called code
words) into a higher
dimensional input space
• Same as data
compression technique
like entropy coding
• If we have an input of length 𝑙 , it can be compressed into a code word of
length 𝑚 , which is much less than 𝑙
• This technique exploits the entropy in the data eg: Huffmann coding
• A fixed number of codewords form a vector
• Kononen model is more popularly used because it offers dimensionality
reduction and is more general
• Hence SOMs are also referred to as Kohonen self organizing maps
Architecture of SOM
Essential Processes in SOM
1. Competition:
– Long range inhibition
– For each input pattern the neurons in the output
layer determine the value of a function
– Called as the discriminant function
– This function provides the basis of competition
– A particular neuron with the largest discriminant
function value is the winner
– Responses of other neurons (at longer geometric
distances) is set to zero
Essential Processes in SOM
2. Co-operation:
– Short range excitation
– Winning neuron determines the topological
neighbourhood of excitation for other output
neurons
– Neuron that wins , excites the neighbouring neurons
surrounding it
– Excitation is co-operative and always follows
competition
– In other words, the winning neuron determines the
spatial location of topological neighbourhood of
excited neurons
Essential Processes in SOM
3. Synaptic Adaptation:
– Enables the excited neurons to increase their
individual values of discriminant function in
relation to the input pattern
– This adaptation is only for excited neurons
– When similar input patterns are fed repeatedly ,
then we have the increasing winning neuron
response each time
Competition
• Consider an dimensional input
• Then T where ( =total
number of output neurons)
• Find the best match between and
• There will be competition between neurons and whichever
is having the best match with , that 𝑡ℎ neuron will
emerge as the winner
• Some applications require only the index of the winning
neuron while some may require the actual winning vector
• We compute for and select the largest
amongst these
• Maximizing ->minimizing , the Euclidean
distance
• Using index where is the index based on input vector
• Then
,
,
• This function is independent of the position of the winning
neuron , hence it is translation invariant
• In case of 1-D lattice ,
• For 2-D lattice ,
Where position vector of excited neuron
: position vector of winning neuron
• Another important feature of SOM is that is not constant
with iterations
• As iterations increase decreases means the neighbourhood
shrinks with iterations
• ,
is called as the neighbourhood function
• Significance of co-operation:
– Initially a large number of neurons participate in the co-operative
process because 𝜎 is large
– We update the winning neurons and its neighbours
– If say 𝑋 is fed to the network and output neuron 𝑖 is the winner, next if
𝑋 is now fed to the network and now neuron 𝑗 is the new winner.
– If 𝑋 and 𝑋 are close then neuron 𝑖 also participates in weight
updation
– If we feed a large number of vectors close to 𝑋 i.e,; cluster of
patterns, the winners are different but every output neuron in the
neighbourhood gets its share of weight update
– Ultimately the topology of the network will be adjusted to the clusters
Alternative neighbourhood Functions
Weight Adaptation
• This the learning phase
• We can use the hebbian learning but it is severely limited by
weight saturation if presented with same input pattern
repeatedly
• Hence we modify it by introducing the forgetting term:
and
𝒋,𝒊 𝑿
• We use
GROUND TRUTH
• The classification of each sample NORMAL ABNORMAL
2. Specificity (SPE) is the fraction of normal ECG beats that are correctly classified
among all the normal ECG beats
TN
Specificity(Spe)=
TN + FP
3. Positive predictivity is the fraction of real abnormal ECG beats in all detected
beats
TP
Positive Predictivity(PP)=
TP + FP
4. False Positive Rate is the fraction of all normal ECG beats that are not rejected
FP
False Prediction Rate(FPR)= = 1-Spe
TN +FP
Classifier Performance Metrics continued…
5. Classification rate (CR) is the fraction of all correctly classified ECG beats, regardless of
normal or abnormal among all the ECG beats
TN + TP
Classification Rate(CR)=
TN + FP + FN + TP
6. Mean squared error (MSE) is a measure used only as the stopping criterion while
training the ANN
7. Percentage average accuracy is the total accuracy of the classifier
TP(NORMAL) TP(ABNORMAL)
Percentage Average Accuracy = + ×100
TOTAL(NORMAL) TOTAL(ABNORMAL)
8. Training Time is the CPU time required for training an ANN described in terms of time
per epoch per total exemplars in seconds
9. Pre-processing time is the CPU time required for generating the transform part of the
feature vector in seconds
10. Resources consumed for the ANN topology is the sum of weights and biases for the
first layer and the second layer also called as adjustable or free parameters of the
network
Receiver Operating Characteristics (ROC)
x
2
(i)- xreconstructed(i)
original
i=0
PRD = n
×100
original
x
i=0
2
Discrete Cosine Transform
Signal Energy v/s # of DCT coefficients
100.00%
% Signal Energy
80.00%
60.00%
40.00%
20.00%
0.00%
1
6
11
16
21
26
31
36
41
46
51
56
61
66
71
76
81
86
91
96
101
106
111
116
121
126
131
136
141
146
151
156
161
166
171
176
# of DCT coefficients
Reconstruction Errors
1250
Coeff 5
1200
Effect of truncating DCT Coeff 15
1150 Coeff 30
coefficients Original
1100
mV
1050
1000
950
900
850
1 90 179
Sa mple s
Dataset Creation
• It is always desirable that we have equal number
of exemplars from each class for the training
dataset
• This prevents “favoring” any class during
training
• If the number of exemplars are unequal we have
to de-skew the classifier decision
• De-skewing simply scales the output according
to the probability of the input classes
• Data randomization before training is a must
other wise repetitive training by same class may
not allow the network to converge, remember
that the error gradient is important for achieving
global minima , if it exists!
• Partition the data into THREE disjoint sets
• Training set
• Cross-validation set
• Testing set
• Before the data is presented to the net for training we
have to normalize the data in range [-1,1] , this helps in
faster network learning
• Amplitude and Offset are given as:
Amp (i )
UpperBound LowerBound
max(i ) min(i )
Off (i ) UpperBound Amp(i ) max(i )
Data (i ) Off (i )
To de-normalize data: Data (i )
Amp (i )
Methodology
EXTRACTION OF
EQUAL LENGTH
SIGNALS
NORMAL
Feature
Transform( DCT) Classifier
Extracted Data
ABNORMAL
START
TRAINING USE
backpfun_logsig
for TRAINING
CROSS-
VALIDATION
USE testbackfun
TESTING
for TESTING
COMPUTE
Confusion matrix,
Se, Spe, Pp, FPR,
CR
STOP
MLP Classifier MATLAB DEMO
Case Study : Two-Class QRS
classification with RBFN
Problem Statement:
Design a system to correctly classify extracted QRS
complexes into TWO classes: NORMAL and ABNORMAL
using RBFN
NORMAL ABNORMAL
TRAINING
USE rbfn_fun for
TRAINING
COMPUTE
Confusion matrix,
Se, Spe, Pp, FPR, CR
STOP
RBFN Classifier MATLAB DEMO
Thank You!
LVQ, Hopfield Net and BAM
Prof.Dr.R.B.Ghongade
Learning Vector Quantization
• Learning vector quantization (LVQ) (proposed by Kohonen)is a pattern
classification method in which each output unit represents a particular class or
category
• The weight vector for an output unit is often referred to as a reference (or
codebook) vector for the class that the unit represents
• During training, the output units are positioned (by adjusting their weights
through supervised training) to approximate the decision surfaces
• It is assumed that a set of training patterns with known classifications is provided,
along with an initial distribution of reference vectors (each of which represents a
known classification).
• After training, an LVQ net classifies an input vector by assigning it to the
same class as the output unit that has its weight vector (reference vector) closest
to the input vector
Architecture
Algorithm
STEP 0 Initialize reference vectors;
Initialize learning rate , 𝛼 0
𝑊 𝑤 𝑠 𝑝 𝑠 𝑝 … 𝑓𝑜𝑟 𝑖 𝑗
𝑤 0
While activations of the net are not converged, do Steps 1-7.
STEP 1 For each input vector x, do Steps 2-6.
STEP 2 Set initial activations of net equal to the external input vector x:
𝑦 𝑥 , 𝑖 1, … , 𝑛
𝐸 0.5 𝑦𝑦𝑤 𝑥𝑦 𝜃𝑦
Analysis
• Storage Capacity:
• Hopfield found experimentally that the number of binary patterns that can be
stored and recalled in a net with reasonable accuracy, is given approximately
by:
𝑃 0.15𝑛
where n is the number of neurons in the net.
Bi-directional Associative Memory (BAM)
• A bidirectional associative memory [Kosko, 1988] stores a set of pattern
associations by summing bipolar correlation matrices (an n by m outer
product
matrix for each pattern to be stored)
• The architecture of the net consists of two layers of neurons, connected by
directional weighted connection paths
• The net iterates, sending signals back and forth between the two layers
until all neurons reach equilibrium (i.e., until each neuron’s activation
remains constant for several steps)
• Bidirectional associative memory neural nets can respond to input to either
layer.
Architecture
Algorithm for Discrete BAM
• Setting the weights: The weight matrix to store a set of input and target vectors , 𝑠 𝑝 : 𝑡 𝑝 , 𝑝 1, . . . , 𝑃,
where 𝑠 𝑝 𝑠 𝑝 ,…,𝑠 𝑝 ,…,𝑠 𝑝
and 𝑡 𝑝 𝑡 𝑝 ,…,𝑡 𝑝 ,…,𝑡 𝑝
can be determined by the Hebb rule (outer product) as:
𝑊 𝑤 𝑠 𝑝 𝑠 𝑝 … 𝑓𝑜𝑟 𝑖 𝑗
• Activation functions:
1 , 𝑖𝑓 𝑦_𝑖𝑛 𝜃
For bipolar input vectors, the activation function for the Y-layer is: 𝑦 𝑦 , 𝑖𝑓 𝑦_𝑖𝑛 𝜃
1 , 𝑖𝑓𝑦_𝑖𝑛 𝜃
1 , 𝑖𝑓 𝑥_𝑖𝑛 𝜃
and the activation function for the X-layer is: 𝑥 𝑥 , 𝑖𝑓 𝑥_𝑖𝑛 𝜃
1 , 𝑖𝑓𝑥_𝑖𝑛 𝜃
Algorithm for Discrete BAM (bipolar)
STEP 0 Initialize the weights to store a set of P vectors;
initialize all activations to 0.
STEP 1 For each testing input, do Steps 2-6.
STEP 2a Present input pattern x to the X-layer ((i.e., set activations of X-layer to current input pattern).
STEP 2b Present input pattern y to the y-layer (Either of the input patterns may be the zero vector.)
STEP 3 While activations are not converged, do Steps 4-6.
STEP 4 Update activations of units in Y-layer.
Compute net inputs:
𝑦_𝑖𝑛 𝑥𝑤
Compute activations:
𝑦 𝑓 𝑦_𝑖𝑛
Send signal to X-layer.
STEP 5 Update activations of units in X-layer.
Compute net inputs:
𝑥_𝑖𝑛 𝑦𝑤
Compute activations:
𝑥 𝑓 𝑥_𝑖𝑛
Send signal to Y-layer.
STEP 6 Test for convergence:
If the activation vectors x and y have reached equilibrium, then stop;
otherwise, continue.
Introduction to Deep
Learning
ML vs. Deep Learning
Most machine learning methods work well because of human-designed representations
and input features
ML becomes just optimizing weights to best make a final prediction
What is Deep Learning (DL) ?
A machine learning subfield of learning representations of data.
Exceptional effective at learning patterns.
Deep learning algorithms attempt to learn (multiple levels of)
representation by using a hierarchy of multiple layers
If you provide the system tons of information, it begins to understand it
and respond in useful ways.
Why is DL useful?
o Manually designed features are often over-specified, incomplete and
take a long time to design and validate
o Learned Features are easy to adapt, fast to learn
o Deep learning provides a very flexible, (almost?) universal, learnable
framework for representing world, visual and linguistic information.
o Can learn both unsupervised and supervised
o Effective end-to-end joint system learning
o Utilize large amounts of training data
Activation functions
How do we train?
learning rate
http://cs231n.github.io/assets/nn1/layer_sizes.jpeg
Full list:
Activation: Sigmoid
Takes a real-valued number and
“squashes” it into range between
0 and 1.
http://adilmoujahid.com/images/activation.png
- Sigmoid neurons saturate and kill gradients, thus NN will barely learn
• when the neuron’s activation are 0 or 1 (saturate)
⦁ gradient at these regions almost zero
⦁ almost no signal will flow to its weights
⦁ if initial weights are too large then most neurons would saturate
Activation: Tanh
Takes a real-valued number and
“squashes” it into range between
-1 and 1.
http://adilmoujahid.com/images/activation.png
Activation: ReLU
Takes a real-valued number and
thresholds it at zero
http://adilmoujahid.com/images/activation.png
http://wiki.bethanycrane.com/overfitting-of-data
https://www.neuraldesigner.com/images/learning/selection_error.svg
Regularization
Dropout
• Randomly drop units (along with their
connections) during training
• Each unit retained with fixed probability
p, independent of other units
• Hyper-parameter p to be chosen (tuned)
Srivastava, Nitish, et al. Journal of machine learning research (2014)
L2 = weight decay
• Regularization term that penalizes big weights,
added to the objective
• Weight decay value determines how dominant
regularization is during gradient computation
• Big weight decay coefficient big penalty for big weights
Early-stopping
• Use validation error to decide when to stop training
• Stop when monitored quantity has not improved after n subsequent epochs
• n is called patience
Tuning hyper-parameters
g(x) ≈ g(x) + h(y)
“Grid and random search of 9 trials for optimizing function g(x) ≈ g(x) + h(y)
With grid search, nine trials only test g(x) in three distinct places.
With random search, all nine trials explore distinct values of g. ”
Make smarter choice for the next trial, minimize the number of trials
1. Collect the performance at several configurations
2. Make inference and decide what configuration to try next
Loss functions and output
Classification Regression
f(x)=x
Convolutional
Input matrix 3x3 filter
http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution
Convolutional Neural
Networks (CNNs)
Main CNN idea for text:
Compute vectors for n-grams and group them afterwards
max pool
2x2 filters
and stride 2
https://shafeentejani.github.io/assets/images/pooling.gif
CNN for text classification
Severyn, Aliaksei, and Alessandro Moschitti. "UNITN: Training Deep Convolutional Neural Network for
Twitter Sentiment Classification." SemEval@ NAACL-HLT. 2015.
CNN with multiple filters
https://pbs.twimg.com/media/C2j-8j5UsAACgEK.jpg
⦁ Stack them up
https://discuss.pytorch.org/uploads/default/original/1X/6415da0424dd66f2f5b134709b92baa59e604c55.jpg
Bidirectional RNNs
Main idea: incorporate both left and right context
output may not only depend on the previous elements in the sequence,
but also future elements.
http://www.wildml.com/2015/10/recurrent-neural-network-
tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-
theano/
http://www.wildml.com/2015/10/recurrent-neural-network-
tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-
theano/
Units with short-term dependencies often have reset gates very active
Units with long-term dependencies have active update gates z
Gated Recurrent Units
(GRUs)
Main idea:
keep around memory to capture long dependencies
Allow error messages to flow at different strengths depending on the inputs
http://www.wildml.com/2015/10/recurrent-neural-network-
tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-
theano/
• Generator: Model that is used to generate new plausible examples from the problem
domain.
• Discriminator: Model that is used to classify examples as real (from the domain) or
fake (generated).
The Generator Model
• The generator model takes a fixed-length random vector as input and
generates a sample in the domain.
• The vector is drawn from randomly from a Gaussian distribution, and the
vector is used to seed the generative process
• After training, points in this multidimensional vector space will correspond
to points in the problem domain, forming a compressed representation of
the data distribution.
• This vector space is referred to as a latent space, or a vector space
comprised of latent variables
• Latent variables, or hidden variables, are those variables that are important
for a domain but are not directly observable.
The Generator Model continued…
• We often refer to latent variables, or a latent space,
as a projection or compression of a data
distribution
• That is, a latent space provides a compression or
high-level concepts of the observed raw data such
as the input data distribution
• In the case of GANs, the generator model applies
meaning to points in a chosen latent space, such
that new points drawn from the latent space can be
provided to the generator model as input and used
to generate new and different output examples.
• After training, the generator model is kept and used
to generate new samples.
The Discriminator Model
• The discriminator model takes an example from the
domain as input (real or generated) and predicts a binary
class label of real or fake (generated).
• The real example comes from the training dataset
• The generated examples are output by the generator
model.
• The discriminator is a normal (and well understood)
classification model.
• After the training process, the discriminator model is
discarded as we are interested in the generator.
• Sometimes, the generator can be repurposed as it has
learned to effectively extract features from examples in
the problem domain
• Some or all of the feature extraction layers can be used in
transfer learning applications using the same or similar
input data.
GANs as a Two Player Game
• Generative modeling is an unsupervised learning problem, although a
clever property of the GAN architecture is that the training of the
generative model is framed as a supervised learning problem.
• The two models, the generator and discriminator, are trained together
• The generator generates a batch of samples, and these, along with real
examples from the domain, are provided to the discriminator and classified
as real or fake.
• The discriminator is then updated to get better at discriminating real and
fake samples in the next round, and importantly, the generator is updated
based on how well, or not, the generated samples fooled the discriminator.
• In this way, the two models are competing against each other, they are
adversarial in the game theory sense, and are playing a zero-sum game.
• In this case, zero-sum means that when the discriminator successfully identifies
real and fake samples, it is rewarded or no change is needed to the model
parameters, whereas the generator is penalized with large updates to model
parameters.
• At a limit, the generator generates perfect replicas from the input domain every
time, and the discriminator cannot tell the difference and predicts “unsure” (e.g.
50% for real and fake) in every case. This is just an example of an idealized case;
we do not need to get to this point to arrive at a useful generator model.
Example of the Generative Adversarial
Network Model Architecture