NEURAL NETWORKS (PDFDrive) PDF

NEURAL NETWORKS
Vedat Tavşanoğlu
What Is a Neural Network?
 Work on artificial neural networks,

commonly referred to as "neural networks,"
has been motivated right from its inception
by the recognition that the brain computes
in an entirely different way from the
conventional digital computer.
 The struggle to understand the brain owes much
to the pioneering work of Ramon y Cajal (1911),
who introduced the idea of neurons as structural
constituents of the brain.
 Typically, neurons are five to six orders of

magnitude slower than silicon logic gates; events
in a silicon chip happen in the nanosecond (10-9 s)
range, whereas neural events happen in the
millisecond (10-3 s) range.
 However, the brain makes up for the

relatively slow rate of operation of a neuron
by having a truly staggering number of
neurons (nerve cells) with massive
interconnections between them.
 It is estimated that there must be on the order of
10 billion neurons in the human cortex, and 60
trillion synapses or connections (Shepherd and
Koch, 1990). The net result is that the brain is an
enormously efficient structure. Specifically, the
energetic efficiency of the brain is approximately
10-16 joules (J) per operation per second.
 The corresponding value for the best computers

in use today is about 10-6 joules per operation per
second (Faggin, 1991).
 The brain is a highly complex, nonlinear,
and parallel computer (information-
processing system). It has the capability of
organizing neurons so as to perform certain
computations (e.g., pattern recognition,
perception, and motor control) many times
faster than the fastest digital computer in
existence today.
 Consider, for example, human vision, which is an
information-processing task (Churchland and
Sejnowski, 1992; Levine, 1985; Marr, 1982).
 It is the function of the visual system to provide a

representation of the environment around us and,
more important, to supply the information we
need to interact with the environment.
 The brain routinely accomplishes perceptual
recognition tasks (e.g., recognizing a
familiar face embedded in an unfamiliar
scene) in something of the order of 100-200
ms, whereas tasks of much lesser complexity
will take hours on conventional computers.
 For another example, consider the sonar of a bat.
Sonar is an active echo-location system.
 In addition to providing information about how far

away a target (e.g., a flying insect) is, a bat sonar
conveys information about the relative velocity of
the target, the size of the target, the size of various
features of the target, and the azimuth and
elevation of the target (Suga, 1990a, b).
 The complex neural computations needed to
extract all this information from the target
echo occur within a brain the size of a plum.
Indeed, an echo-locating bat can pursue and
capture its target with a facility and success
rate that would be the envy of a radar or
sonar engineer.
 How, then, does a human brain or the brain
of a bat do it?
 At birth, a brain has great structure and the

ability to build up its own rules through
what we usually refer to as "experience."
 Indeed, experience is built up over the years,
with the most dramatic development (i.e.,'·
hard-wiring) of the human brain taking place
in the first two years from birth; but the
development continues well beyond that stage.
 During this early stage of development, about 1

million synapses are formed per second.
 Synapses are elementary structural and

functional units that mediate the
interactions between neurons. The most
common kind of synapse is a chemical
synapse, which operates as follows:
 A presynaptic process liberates a transmitter
substance that diffuses across the synaptic
junction between neurons and then acts on a
postsynaptic process.
 Thus a synapse converts a presynaptic

electrical signal into a chemical signal and
then back into a postsynaptic electrical
signal (Shepherd and Koch, 1990).
 In electrical terminology, such an element is
said to be a nonreciprocal two-port device.
 In traditional descriptions of neural

organization, it is assumed that a synapse is
a simple connection that can impose
excitation or inhibition, but not both on the
receptive neuron.
 A developing neuron is synonymous with a plastic
brain: Plasticity ([Latin plasticus, from Greek
plastikos, from plastos, molded, from plassein, to
mold; see pelə-2 in Indo-European roots.] )permits
the developing nervous system to adapt to its
surrounding environment (Churchland and
Sejnowski, 1992; Eggermont, 1990). In an adult
brain, plasticity may be accounted for by two
mechanisms: the creation of new synaptic
connections between neurons, and the
modification of existing synapses.
 Axons, the transmission lines, and dendrites, the

receptive zones, constitute two types of cell
filaments that are distinguished on morphological
grounds; an axon has a smoother surface, fewer
branches, and greater length, whereas a dendrite
(so called because of its resemblance to a tree)
has an irregular surface and more branches
(Freeman, 1975).
 Neurons come in a wide variety of shapes
and sizes in different parts of the brain. The
figure illustrates the shape of a pyramidal
cell, which is one of the most common types
of cortical neurons.
 Like many other types of neurons, it receives
most of its inputs through dendritic spines.
 The pyramidal cell can receive 10,000 or
more synaptic contacts and it can project
onto thousands of target cells.
 Just as plasticity appears to be essential to the

functioning of neurons as informationprocessing
units in the human brain, so it is with neural
networks made up of artificial neurons.
 In its most general form, a neural network is

a machine that is designed to model the way
in which the brain performs a particular task
or function of interest; the network is usually
implemented using electronic components
or simulated in software on a digital
computer.
 In most cases the interest is confined largely

to an important class of neural networks that
perform useful computations through a
process of learning.
 To achieve good performance, neural

networks employ a massive interconnection
of simple computing cells referred to as
"neurons" or "processing units." We may
thus offer the following definition of a neural
network viewed as an adaptive machine:
 A neural network is a massively parallel distributed
processor that has a natural propensity for storing
experiential knowledge and making it available for use.
It resembles the brain in two respects:
1 Knowledge is acquired by the network

through a learning process .
2 Interneuron connection strengths known as

synaptic weights are used to store the knowledge.
 The procedure used to perform the learning

process is called a learning algorithm,
 the function of which is to modify the synaptic
weights of the network
 in an orderly fashion so as to attain a desired
design objective.
 The modification of synaptic weights provides the
traditional method for the design of neural
networks. Such an approach is the closest to
linear adaptive filter theory, which is already well
established and successfully applied in such
diverse fields as communications, control, radar,
sonar, seismology, and biomedical engineering
(Haykin, 1991; Widrow and Stearns, 1985).
 However, it is also possible for a neural

network to modify its own topology, which
is motivated by the fact that neurons in the
human brain can die and that new synaptic
connections can grow.
 Neural networks are also referred to in the

literature as neurocomputers, connectionist
networks, parallel distributed processors,
etc.
Benefits of Neural Networks
 From the above discussion, it is apparent that a
neural network derives its computing power
through:
1. its massively parallel distributed structure,
1. its ability to learn and therefore generalize;

generalization refers to the neural network
producing reasonable outputs for inputs not
encountered during training (learning).
How does the following example help you to generalize ?
confer: L. conferre- con-, together, ferre, to bring
v.t. to give, to bestow (to place or put by), to talk or consult together
defer: L. differre- dis-, asunder (adv. apart, into parts, separately), ferre, to bear , to carry
v.t. to put off to another time, to delay
defer: L. deferre- de-, down, ferre, to bear

v.i. to yield (to the wishes or opinions of another, or to authority), v.t. to submit or
to or to lay before somebody
differ: L. differre- dif.( for dis-), apart, ferre, to bear

v.i. to be unlike, distinct or various
infer: L. inferre- in-, into, ferre, to bring

v.t.. to bring on, to drive as a conclusion
prefer: L. preaferre- prea-,in front of, ferre, to bear

v.t. to set in front, to put forward, offer, submit, present, for acceptance or consideration,
to promote
convene L. convenire, con- together, and venire, to come
v.i. to come together, v.i. to call together
convent v.t. to convene
convention the act of convening, :an assembly, esp. of
special delegates for some common object, an agreement
(Geneva Convention)
invent L. invenire, inventum, in-, upon, venire, to come
v.t. to find, to device or contrive
prevent L. preavenire, prea- in front of, venire, to come
v.t. to precede, to be, go, act, earlier than, to preclude, to stop,
keep, or hinder effectually, to keep from coming to pass
 Synonym:1432 (but rare before 18c.), from L.
synonymum, from Gk. synonymon "word having the
same sense as another," noun use of neut. of
synonymos "having the same name as,
synonymous," from syn- "together, same" +
onyma, Aeolic dialectal form of onoma "name"
(see name). Synonymous is attested from 1610.
 Antonym:1870, created to serve as opposite of
synonym, from Gk. anti- "equal to, instead of,
opposite" (see anti-) + -onym "name" (see name).
 Anonymous:1601, from Gk. anonymos "without a

name," from an- "without" + onyma, Æolic
dialectal form of onoma "name" (see name).
 These two information-processing
capabilities,i.e.,
(1) massively parallel distributed structure
(2) the ability to generalize
make it possible for neural networks to solve
complex (large-scale) problems that are
currently intractable. In practice, however,
neural networks cannot provide the solution
working by themselves alone. Rather, they need
to be integrated into a consistent system
engineering approach.
 Specifically, a complex problem of interest is
decomposed into a number of relatively simple tasks,
and neural networks are assigned a subset of the tasks
e.g.,
1. pattern recognition,
2. associative memory,
3. control,etc.
 that match their inherent capabilities. It is important to
recognize, however, that we have a long way to go (if
ever) before we can build a computer architecture that
mimics the human brain.
Properties and Capabilities of Neural
Networks
 1. Nonlinearity
A neuron is basically a nonlinear device.
Consequently, a neural network, made up of an
interconnection of neurons, is itself nonlinear.
Moreover, the nonlinearity is of a special kind in
the sense that it is distributed throughout the
network. Nonlinearity is a highly important
property, particularly if the underlying physical
mechanism responsible for the generation of an
input signal (e.g., speech signal) is inherently
nonlinear.
Networks
 2. Input-Output Mapping
A popular paradigm of learning called
supervised learning involves the modification
of the synaptic weights of a neural network by
applying a set of labeled training samples or
task examples. Each example consists of a
unique input signal and the corresponding
desired response.
Networks
 The network is presented an example picked at

random from the set,
and
 the synaptic weights(free parameters) of the
network are modified so as to minimize the
difference between the desired response and the
actual response of the network
Networks
 The training of the network is repeated for many

examples in the set until the network reaches a
steady state, where there are no further significant
changes in the synaptic weights;
 The previously applied training examples may be

reapplied during the training session but in a
different order.
Properties and Capabilities of
Neural Networks
Thus the network learns from the examples by

constructing an input-output mapping for the
problem at hand.
Networks
 Such an approach brings to mind the study of
nonparametric statistical inference which is a branch of
statistics dealing with model-free estimation, or, from a
biological viewpoint, tabula rasa learning (Geman et al.,
1992).
 (tabula rasa: a smoothed or blank tablet, a mind not yet

influenced by outside impressions and experiences)
([Medieval Latin tabula rāsa : Latin tabula, tablet + Latin
rāsa, feminine of rāsus, erased.]
Neural Networks
 Consider, for example, a pattern classification task,
where the requirement is to assign an input signal
representing a physical object or event to one of several
prespecified categories (classes).
 In a nonparametric approach to this problem, the

requirement is to "estimate" arbitrary decision
boundaries in the input signal space for the pattern-
classification task using a set of examples, and to do so
without invoking a probabilistic distribution model.
Networks
 A similar point of view is implicit in the

supervised learning paradigm, which suggests
a close analogy between the input-output
mapping performed by a neural network and
nonparametric statistical inference.
 paradigm:1.Grammar. a.a set of forms all of which
contain a particular element, esp. the set of all
inflected forms based on a single stem or theme.
b.a display in fixed arrangement of such a set, as
boy, boy's, boys, boys'. 2.an example serving as a
model; pattern. [Origin: 1475–85; < LL paradīgma
< Gk parádeigma pattern (verbid of paradeiknýnai
to show side by side), equiv. to para- para-1 + deik-
, base of deiknýnai to show (see deictic) + -ma n.
suffix ]
Networks
 analogy: 1550, from L. analogia, from Gk.
analogia "proportion," from ana- "upon,
according to" + logos "ratio," also "word,
speech, reckoning." A mathematical term
used in a wider sense by Plato.
Networks
 3. Adaptivity.
Neural networks have a built-in capability to
adapt their synaptic weights to changes in the
surrounding environment. In particular, a neural
network trained to operate in a specific
environment can be easily retrained to deal with
minor changes in the operating environmental
conditions.
Networks
 Moreover, when it is operating in a nonstationary
environment (i.e., one whose statistics change
with time), a neural network can be designed to
change its synaptic weights in real time. The
natural architecture of a neural network for
pattern classification, signal processing, and
control applications, coupled with the adaptive
capability of the network, make it an ideal tool for
use in adaptive pattern classification, adaptive
signal processing, and adaptive control.
Neural Networks
 As a general rule, it may be said that the more

adaptive we make a system in a properly
designed fashion, assuming the adaptive system
is stable, the more robust its performance will
likely be when the system is required to operate
in a nonstationary environment.
Neural Networks
 It should be emphasized, however, that
adaptivity does not always lead to
robustness; indeed, it may do the very
opposite. For example, an adaptive system
with short time constants may change
rapidly and therefore tend to respond to
spurious disturbances, causing a drastic
degradation in system performance.
Neural Networks
 To realize the full benefits of adaptivity, the
principal time constants of the system should be
long enough for the system to ignore spurious (L.
spurius, false) disturbances and yet short enough
to respond to meaningful changes in the
environment; the problem described here is
referred to as the stability-plasticity dilema
(Grossberg, 1988). Adaptivity (or “in situ,(L.)in
the original situation” training as it is sometimes
referred to) is an open research topic.
Neural Networks
 4. Evidential Response
In the context of pattern classification, a neural
network can be designed to provide information
not only about which particular pattern to select,
but also about the confidence in the decision
made. This latter information may be used to
reject ambiguous patterns, should they arise, and
thereby improve the classification performance of
the network.
Neural Networks
 5. Contextual Information
(L. contextus,contexere-con-, texere, textum, to weave)
Knowledge is represented by the very structure and
activation state of a neural network.
Every neuron in the network is potentially affected by the

global activity of all other neurons in the network.
Consequently, contextual information is dealt with

naturally by a neural network.
Neural Networks
 6. Fault Tolerance
A neural network, implemented in hardware form, has
the potential to be inherently fault tolerant in the sense
that its performance is degraded gracefully under adverse
operating conditions (Bolt, 1992).
For example, if a neuron or its connecting links are

damaged, recall of a stored pattern is impaired in quality.
However, owing to the distributed nature of information
in the network, the damage has to be extensive before
the overall response of the network is degraded seriously.
Thus, in principle, a neural network exhibits a graceful
degradation in performance rather than catastrophic
failure.
Neural Networks
 7. VLSI Implementability
The massively parallel nature of a neural network makes it
potentially fast for the computation of certain tasks. This same
feature makes a neural network ideally suited for implementation
using very-Iarge-scale-integrated (VLSI) technology.
The particular virtue of VLSI is that it provides a means of

capturing truly complex behavior in a highly hierarchical fashion
(Mead and Conway, 1980), which makes it possible to use a neural
network as a tool for real-time applications involving pattern
recognition, signal processing, and control.
Neural Networks
 8. Uniformity of Analysis and Design. Basically, neural
networks enjoy universality as information processors.
We say this in the sense that the same notation is used in
all the domains involving the application of neural
networks. This feature manifests itself in different ways:
 Neurons, in one form or another, represent an ingredient
common to all neural networks.
 This commonality makes it possible to share theories
and learning algorithms in different applications of
neural networks.
 Modular networks can be built through a seamless
integration of modules.
Networks
 analysis: [Medieval Latin, from Greek analusis, a dissolving, from analūein, to
undo : ana-, throughout; see ana- + lūein, to loosen; see leu- in Indo-European
roots.]
(Download Now or Buy the Book) The American Heritage® Dictionary of the English
Language, Fourth Edition
Copyright © 2006 by Houghton Mifflin Company.
Published by Houghton Mifflin Company. All rights reserved.Online Etymology
Dictionary - Cite This Source - Share This
analysis
1581, "resolution of anything complex into simple elements" (opposite of
synthesis), from M.L. analysis, from Gk. analysis "a breaking up," from analyein
"unloose," from ana- "up, throughout" + lysis "a loosening" (see lose).
Psychological sense is from 1890. Phrase in the final (or last) analysis (1844),
translates Fr. en dernière analyse.
Networks
 Design: 1548, from L. designare "mark out,
devise," from de- "out" + signare "to mark," from
signum "a mark, sign." Originally in Eng. with the
meaning now attached to designate (1646, from L.
designatus, pp. of designare); many modern uses
of design are metaphoric extensions. Designer
(adj.) in the fashion sense of "prestigious" is first
recorded 1966; designer drug is from 1983.
Designing "scheming" is from 1671. Designated
hitter introduced in American League baseball in
1973, soon giving wide figurative extension to
designated.
Neural Networks
 9. Neurobiological Analogy
 The design of a neural network is motivated by
analogy with the brain, which is a living proof
that fault-tolerant parallel processing is not only
physically possible but also fast and powerful.
Neurobiologists look to (artificial) neural
networks as a research tool for the interpretation
of neurobiological phenomena.
Neural Networks
 For example, neural networks have been used to provide insight
on the development of premotor (relating to, or being the area of
the cortex of the frontal lobe lying immediately in front of the
motor area of the precentral gyrus(Any of the prominent, rounded,
elevated convolutions on the surfaces of the cerebral hemispheres.
[Latin gȳrus, circle; see gyre.] ) )circuits in the oculomotor
(1.Of or relating to movements of the eyeball: an oculomotor
muscle.
2.Of or relating to the oculomotor nerve.
[Latin oculus, eye; see okw- in Indo-European roots + motor.]
system (responsible for eye movements) and the manner in which
they process signals (Robinson, 1992). On the other hand,
engineers look to neurobiology for new ideas to solve problems
more complex than those based on conventional hard-wired
design techniques.
Neural Networks
 Here, for example, we may mention the development of
a model sonar receiver based on the bat (Simmons et aI.,
1992). The batinspired model consists of three stages:
 (1) a front end that mimics the inner ear of the bat in
order to encode waveforms;
 (2) a subsystem of delay lines that computes echo delays;
 (3) a subsystem that computes the spectrum of echoes,
which is in turn used to estimate the time separation of
echoes from multiple target glints.
Neural Networks
 The motivation is to develop a new sonar receiver
that is superior to one designed by conventional
methods. The neurobiological analogy is also
useful in another important way: It provides a
hope and belief (and, to a certain extent, an
existence proof) that physical understanding of
neurobiological structures could indeed influence
the art of electronics and thus VLSI (Andreou,
1992).
Neural Networks
 With inspiration from neurobiological analogy in

mind, it seems appropriate that we take a brief
look at the structural levels of organization in the
brain.
Neural Networks
 1.2 Structural Levels of Organization in the
Brain
The human nervous system may be viewed
as a three-stage system,(Arbib, 1987).
Block-diagram representation of nervous system

Neural Networks
Central to the system is the brain, represented by the
neural (nerve) net in this figure, which continually
receives information, perceives it, and makes
appropriate decisions. Two sets of arrows are shown in
this figure:
1 Those pointing from left to right indicate the forward

transmission of information-bearing signals through
the system.
2 The arrows pointing from right to left signify the

presence of feedback in the system.
Neural Networks
The receptors in the figure convert stimuli from
the human body or the external environment into
electrical impulses that convey information to the
neural net (brain). The effectors, on the other
hand, convert electrical impulses generated by the
neural net into discernible responses as system
outputs. . In the brain there are both small-scale
and large-scale anatomical organizations, and
different functions take place at lower and higher
levels.
Neural Networks
This figure shows a
hierarchy of interwoven
levels of organization that
has emerged from the
extensive work done on
the analysis of local
regions in the brain
(Churchland and
Sejnowski, 1992; Shepherd
and Koch, 1990).
Neural Networks
Proceeding upward from synapses that represent

the most fundamental level and that depend on
molecules and ions for their action, we have
neural microcircuits dendritic trees, and then
neurons.
Neural Networks
A neural microcircuit refers to an assembly
of synapses organized into patterns of
connectivity so as to produce a functional
operation of interest. A neural microcircuit
may be likened to a silicon chip made up of
an assembly of transistors.
Neural Networks
The smallest size of microcircuits is measured in
micrometers (m), and their fastest speed of
operation is measured in milliseconds. The neural
microcircuits are grouped to form dendritic
subunits within the dendritic trees of individual
neurons. The whole neuron, about 100 m in size,
contains several dendritic subunits.
Neural Networks
 At the next level of complexity, we have local
circuits (about 1 mm in size) made up of
neurons with similar or different properties;
these neural assemblies perform operations
characteristic of a localized region in the
brain.
Neural Networks
 This is followed by interregional circuits made up
of pathways, columns, and topographic maps,
which involve multiple regions located in different
parts of the brain.
Neural Networks
 Topographic maps are organized to respond to
incoming sensory information. These maps are
often arranged in sheets, as in the superior
colliculus, where the visual, auditory, and
somatosensory maps are stacked in adjacent
layers in such a way that stimuli from
corresponding points in space lie above each
other. Finally, the topographic maps, and other
interregional circuits mediate specific types of
behavior in the central nervous system .
Neural Networks
 It is important to recognize that the
structural levels of organization described
herein are a unique characteristic of the
brain. They are nowhere to be found in a
digital computer, and we are nowhere close
to realizing them with artificial neural
networks. Nevertheless, we are inching our
way toward a hierarchy of computational
levels similar to that described in the last
figure.
Neural Networks
The artificial neurons we use to build our
neural networks are truly primitive in
comparison to those found in the brain.
The neural networks we are presently able to

design are just as primitive compared to the
local circuits and the interregional circuits in
the brain.
Neural Networks
 What is really satisfying, however, is the
remarkable progress that we have made on
so many fronts during the past 20 years.
With the neurobiological analogy as the
source of inspiration, and the wealth of
theoretical and technological tools that we
are bringing together, it is for certain that in
another 10 years our understanding of
artificial neural networks will be much more
sophisticated than it is today.
Neural Networks
 Our primary interest here is confined to the study
of artificial neural networks from an engineering
perspective, to which we refer simply as neural
networks. We begin the study by describing the
models of (artificial) neurons
that form the basis of the neural networks
considered in these lectures.
Models of a Neuron
 Models of a Neuron
A neuron is an information-processing unit that is
fundamental to the operation of a neural network.
The figure on the next slide shows the model for a
neuron.
Models of a Neuron
Nonlinear model of a neuron

Models of a Neuron
1. A set of synapses or connecting links, each
of which is characterized by a weight or
strength of its own. Specifically, a signal xj
at the input of synapse j connected to
neuron k is multiplied by the synaptic
weight wkj. It is important to make a note
of the manner in which the subscripts of
the synaptic weight wkj are written.
Models of a Neuron
The first subscript refers to the neuron in
question and the second subscript refers to
the input end of the synapse to which the
weight refers; the reverse of this notation is
also used in the literature.
The weight wkj is positive if the associated
synapse is excitatory; it is negative if the
synapse is inhibitory (Middle English
inhibiten, to forbid, from Latin inhibēre, inhibit-
, to restrain, forbid : in-, in; see in-2 + habēre, to
hold; see ghabh- in Indo-European roots.] )
Models of a Neuron
2.An adder for summing the input signals,

weighted by the respective synapses of
the neuron; the operations described
here constitute a linear combiner.
Models of a Neuron
3.An activation function for limiting the amplitude
of the output of a neuron. The activation function,
is also referred to in the literature as a squashing
function in that it squashes (limits) the permissible
amplitude range of the output signal to some finite
value. Typically, the normalized amplitude range
of the output of a neuron is written as the closed
unit interval [0,1] or alternatively [-1,1].
Models of a Neuron
4. The model of a neuron also includes an
externally applied threshold k that has the
effect of lowering the net input of the
activation function.
On the other hand, the net input of the

activation function may be increased by
employing a bias term rather than a
threshold; the bias is the negative of the
threshold.
Models of a Neuron
In mathematical terms, we may describe neuron k by
writing the following pair of equations:
p
Mathematical Model uk   wkj x j
of a Neuron j 1
yk   ( uk   k )
where xj’s are the input signals; wkj’s are the synaptic
weights of neuron k; uk is the linear combiner output;
k is the threshold;  is the activation function;
and yk is the output signal of the neuron.
Models of a Neuron
Block-Diagram Representation of a Neuron
p
uk   wkj x j
j 1 
yk   ( u k   k )
Models of a Neuron
 The use of threshold k has the effect of
applying an affine transformation to the
output uk of the linear combiner in the
model of the figure, as shown by
vk  u k   k
Models of a Neuron
In particular, depending on
whether the threshold k is
positive or negative, the
relationship between the
effective internal activity level or
activation potential vk of neuron
k and the linear combiner
output uk is modified in the
manner illustrated in the figure.
Note that as a result of this
affine transformation, the graph
of vk versus uk no longer passes
through the origin.
Models of a Neuron
The threshold k is an external parameter
of artificial neuron k. We may account for its
presence as in the above equation.
Equivalently, we may formulate the
combination of the two equations as follows:
p
vk   wkj x j
j 0
yk   (vk )
Models of a Neuron
Here we have added a new synapse, whose input
is
x0  1
and whose weight is
wk 0   k
Models of a Neuron
We may therefore
reformulate the model
of neuron k as in the
figure, where the effect
of the threshold is
represented by doing
two things:
Models of a Neuron
(1) adding a new input signal fixed at -1, and
(2) adding a new synaptic weight equal to the
threshold k .
Alternatively, we may model the neuron as in

the following slide:
Y1
where the
Models of a Neuron
combination of
fixed input
xo = + 1 and
weight wkO = bk
accounts for the
bias bk•
Although the
models of the
two figures are
different in
appearance,
they are
mathematically
equivalent.
Slayt 92
Y1 YTU; 15.03.2005
Models of a Neuron
Signal-Flow Graph Representation of a Neuron
p
uk  w
j 1
kj xj

yk   ( uk   k )
Models of a Neuron
Signal-Flow Graph Representation of a Neuron
Two different types of links may be distinguished:
(a) Synaptic links, defined by a linear input-output
relation. Specifically, the node signal xj is
multiplied by the synaptic weight wkj to produce
the node signal vk .
(b) Activation links, defined in general by a
nonlinear input-output relation. This form of
relationship is the nonlinear activation function
given as
 (.)
Models of a Neuron
The Activation Function

The activation function, denoted by
y   (v )
defines the output y of a neuron in terms of the
activity level at its input v.
Models of a Neuron
We may identify three basic types of activation
functions:
1. Threshold Function
2. Piecewise-linear Function
3. Sigmoid Function
Models of a Neuron
Threshold (hard limiter or binary activation )
1.
Function (leading to discrete perceptron)
 (v )
1 1
1  (v)   sgn(v)
2 2
0
v
(a) Unipolar
Models of a Neuron
 (v )
 (v)  sgn(v)
1
00
v
0
-1
(b) Bipolar
Models of a Neuron
2. Piecewise-linear Function
 (v )
1 1 1 1 1
 (v )    v   v  
0.5 2 2 2 2
-0.5 0 v
0
0
(a) Unipolar
y ij (t )  f ( xij ) 
1
2

xij (t )  1  x ij (t )  1 
Models of a Neuron
 (v )
 (v )   v  1  v  1 
1 1
2
-1
0
1 v
-1
(b) Bipolar
Models of a Neuron
3. Sigmoid Function
 (v )
1 1
 (v )   av
;a 0
0.5
v
1  e
(a) Unipolar
Models of a Neuron
 (v )
1
1  e  av 2
 (v )   av
=  av
-1 ; a  0
1 e 1 e
v
-1
(b) Bipolar
Models of Artificial Neural Networks
DEFINITION OF Neural Network
(Jacek M. Zurada, ARTIFICIAL NEURAL SYSTEMS, 1992, West Publishing Company)
A Neural Network is an interconnection of

neurons such that neuron outputs are
connected, through weights, to all other
neurons including themselves; both lagfree
and delay connections are allowed.
Neural Networks Viewed as Directed Graphs
1. Block-Diagram Representation (BDR)

2. Signal-Flow Graph Representation (SFGR)
These are obtained when BDR and SFGR for

the neurons are used.
An alternative definition of Neural Network
(Simon Haykin, NEURAL NETWORKS, 1994, Macmillan College Publishing Company)
A neural network is a directed graph (SFG) consisting of

nodes with interconnecting synaptic and activation
links, and which is characterized by four properties:
 Each neuron is represented by a set of linear synaptic

links, an externally applied threshold, and a nonlinear
activation link. The threshold is represented by a
synaptic link with an input signal fixed at a value of -1.
2. The synaptic links of a neuron weight their
respective input signals.
3. The weighted sum of the input signals
defines the total internal activity level of
the neuron in question.
4. The activation link squashes the internal
activity level of the neuron to produce an
output that represents the output of the
neuron.
Network Architectures
 In general, we may identify four different
classes of network architectures:
 1. Single-Layer Feedforward Networks
 2. Multilayer Feedforward Networks
 3. Recurrent Networks
 4. Lattice Structures
1. Single-Layer Feedforward Networks

A layered neural network is a network of
neurons organized in the form of layers. In
the simplest form of a layered network, we
just have an input layer of source nodes that
projects onto an output layer of neurons
(computation nodes), but not vice versa.
 In other words, this network is strictly of a
feedforward type. It is illustrated on the
following slide for the case of four nodes in
both the input and output layers. Such a
network is called a single-layer network,
with the designation "single layer" referring
to the output layer of computation nodes
(neurons). In other words, we do not count
the input layer of source nodes, because no
computation is performed there.
 2. Multilayer Feedforward Networks
The second class of a feedforward neural
network distinguishes itself by the presence
of one or more hidden layers, whose
computation nodes are correspondingly
called hidden neurons or hidden units. The
function of the hidden neurons is to
intervene between the external input and the
network output.
 By adding one or more hidden layers, the network is
enabled to extract higher-order statistics, for (in a rather
loose sense) the network acquires a global perspective
despite its local connectivity by virtue of:
 the extra set of synaptic connections

 the extra dimension of neural interactions.
The ability of hidden neurons to extract higher-order

statistics is particularly valuable when the size of the
input layer is large.
 The source nodes in the input layer of the
network supply respective elements of the
activation pattern (input vector), which constitute
the input signals applied to the neurons
(computation nodes) in the second layer (i.e., the
first hidden layer). The output signals of the
second layer are used as inputs to the third layer,
and so on for the rest of the network.
 Typically, the neurons in each layer of the
network have as their inputs the output signals of
the preceding layer only.
 The set of output signals of the neurons in the

output (final) layer of the network constitutes the
overall response of the network to the activation
pattern supplied by the source nodes in the input
(first) layer.
This graph illustrates the
layout of a multilayer
feedforward neural
network for the case of a
single hidden layer. For
brevity this network is
referred to as a 10-4-2
network in that it has 10
source nodes, 4 hidden
neurons, and 2 output
neurons.
As another example, a feedforward
network with p source nodes, h1
neurons in the first hidden layer, h2
neurons in the second layer, and q
neurons in the output layer, say, is
referred to as a p-h1-h2-q network.
 The neural network of
this figure is said to
be fully connected in
the sense that every
node in each layer of
the network is
connected to every
other node in the
adjacent forward
layer.
 If, however, some of the communication links
(synaptic connections) are missing from the
network, we say that the network is partially
connected. A form of partially connected
multilayer feedforward network of particular
interest is a locally connected network. An
example of such a network with a single hidden
layer is presented on the next slide. Each neuron
in the hidden layer is connected to a local (partial)
set of source nodes that lies in the immediate
neighborhood.
Such a set of localized
nodes feeding a neuron
is said to constitute the
receptive field of the
neuron.
Likewise, each neuron in

the output layer is
connected to a local set
of hidden neurons. Partially connected
feedforward neural network
 3. Recurrent (Feedback or Dynamical)
Networks
A recurrent neural network
distinguishes itself from a feedforward
neural network in that it has at least
one feedback loop. For example, a
recurrent network may consist of a
single layer of neurons with each
neuron feeding its output signal back
to the inputs of all the other neurons,
as illustrated in the architectural
graph of the figure on the right. In the
structure depicted in this figure there
are no self-feedback loops in the
network; self-feedback refers to a Recurrent network with no
situation where the output of a neuron self-feedback loops and no
is fedback to its own input. hidden neurons
 The recurrent network
illustrated on the previous
slide also has no hidden
neurons. Here we illustrate
another class of recurrent
networks with hidden
neurons. The feedback
connections shown originate
from the hidden neurons as Recurrent network with
well as the output neurons. hidden neurons
 The presence of feedback loops, be it as in the
recurrent structure with or without hidden
neurons, has a profound impact on the learning
capability of the network, and on its
performance. Moreover, the feedback loops
involve the use of particular branches
composed of unit-delay elements (denoted by
z-1), which result in a nonlinear dynamical
behavior by virtue of the nonlinear nature of
the neurons.
4. Lattice (Multicategory Perceptron)
Structures
A lattice consists of a one-dimensional,
two-dimensional, or higher-dimensional
array of neurons with a corresponding set
of source nodes that supply the input One dimensional lattice of 3 neurons
signals to the array; the dimension of the
lattice refers to the number of the
dimensions of the space in which the
graph lies.
A lattice network is really a feedforward
network with the output neurons
arranged in rows and columns.
Two dimensional lattice of 3-by-3 neurons

The Perceptron
 The perceptron is the simplest
form of a neural network used
for the classification of a special
type of patterns said to be
linearly separable (i.e., patterns
that lie on opposite sides of a
hyperplane).
 Basically, it consists of a single
neuron with adjustable synaptic
weights and threshold, as shown
in the figures.
The Perceptron
 The algorithm used to adjust the free parameters of this
neural network first appeared in a learning procedure
developed by Rosenblatt (1958, 1962) for his perceptron
brain model. Indeed, Rosenblatt proved that if the
patterns (vectors) used to train the perceptron are drawn
from two linearly separable classes, then the perceptron
algorithm converges and positions the decision surface in
the form of a hyperplane between the two classes. The
proof of convergence of the algorithm is known as the
perceptron convergence theorem.
The Perceptron
 The single-layer perceptron depicted has a single
neuron. Such a perceptron is limited to performing
pattern classification with only two classes.
 By expanding the output (computation) layer of the

perceptron to include more than one neuron, we may
correspondingly form classification with more than two
classes. However, the classes would have to be linearly
separable for the perceptron to work properly.
The Perceptron
 From this model we find that
the linear combiner output
(i.e., hard limiter input) is
p
v   wkj x j  
j 1
The purpose of the perceptron is to classify the set of

externally applied stimuli x1 ,x2,…., xp into one of two
classes, C1 or C2, say. The decision rule for the
classification is to assign the point represented by the
inputs x1 ,x2,…., xp to class C1, if the perceptron output y is
+ 1 and to class C2 if it is -1.
The Perceptron
 To develop insight into the behavior of a pattern
classifier, it is customary to plot a map of the decision
regions in the p-dimensional signal space spanned by
the p input variables x1 ,x2,…., xp. In the case of an
elementary perceptron, there are two decision regions
separated by a hyperplane defined by
w
j 1
kj xj   0
The Perceptron
This is illustrated here for the case of
two input variables xl and x2, for which
the decision boundary takes the form of
a straight line called the decision line. A
point (x1,x2) that lies above the decision
line is assigned to class C1, and a point
(x1,x2) that lies below the decision line
is assigned to class C2. Note also that
the effect of the threshold  is merely to
shift the decision line away from the
origin. The synaptic weights w1 w2, ..,wp
of the perceptron can be fixed or
adapted on an iteration-by-iteration
basis. For the adaptation, we may use
an error-correction rule known as the
perceptron convergence algorithm.
The Perceptron
We find it more convenient to
work with the modified  signal-
flow graph given here.
In this second model, which is
equivalent to that of the previous
figure, the threshold  is treated as
a synaptic weight is connected
to a fixed input equal to -1. We may thus define the
(p + 1)-by-1 (augmented) input vector and the
corresponding (augmented) weight vector as:
x  [ x1 x2 ... ... x p 1]t w  [ w1 w2 ... ... wp  ]t
The Perceptron
 Pattern Space
Any pattern can be represented by a point in

n-dimensinal Euclidean space En called the
pattern space. Points in that space corresponding to
members of the pattern set are n-tuple vectors x.
The Perceptron
Example 1: Consider the six patterns in two dimensional
pattern space shown in the following figure.
x2
(0, 0) (2,0)
x1
(-0.5, -1) (1.5,-1)
(-1, -2) (1, -2)

The Perceptron
Design a perceptron such that these are classified
according to their membership in sets as follows :
2 1.5  1 
   ,   ,    : class 1
0  1   2  
 0   0.5  1 
  ,   ,    : class 2
 0   1   2  
The Perceptron
One possible decision line is given by x2= 2x1-2
which is drawn in the following figure.
x2 x = 2x -2
2 1
(0, 0) (2,0)
x1
(-0.5, -1) (1.5,-1)
(-1, -2) (1, -2)

The Perceptron
One decision surface for this line is obtained as: x3  2 x1  x2  2
x3  0  2 x1  x2  2  0 gives the points on the decision line

x3  0  2 x1  x2  2  0 gives the part of the surface above the decision line
x3  0  2 x1  x2  2  0 gives the part of the surface below the decision line
Such a pattern classification can be performed by

the following (discrete) perceptron (dichotomizer):
dichotomize: to divide or separate into two parts
dicha: in two; tomia: to cut
The Perceptron
x1
-2
x2 1 + v sgn(v) y
-2
-1 y  sgn(2 x1  x2  2)
Single-Layer Feedforward Neural
Network
Example 2: Assume that a set of eight points,
P0, P1... , P7 , in three-dimensional space is available.
The set consists of all vertices of a three-dimensional
cube as follows:
{P0(-l, -1, -l), P1(-l, -1, l), P2(-1, 1, -1), P3(-1, 1, 1),
P4(1, -1, -l), P5(1, -1, 1), P6(1,1, -1), P7(1, 1, 1)}
Elements of this set need to be classified into two categories
The first category is defined as containing points with two
or more positive ones; the second category contains all the
remaining points that do not belong to the first category.
Network
Classification of points P3, P5, P6, and P7 can be
based on the summation of coordinate values for
each point evaluated for category membership.
Notice that for each point Pi (x1, x2, x3) ,where
i = 0, ... , 7, the membership in the category can be
established by the following calculation:
 1, then category 1
If sgn( x1  x2  x3 )  
 1, then category 2
Network
The neural network given below implements the
above expression:
Network
The network above performs the
threedimensional Cartesian space partitioning
as illustrated below :
Network
Discriminant Functions
In Example 1
x3  2 x1  x2  2
can be viewed as a Discriminant Function. We
may also write
g ( x1 , x2 )  2 x1  x2  2
or
 x1 
g ( x )  2 x1  x2  2 where x =  
 x2 
Network
g ( x1 , x2 )  2 x1  x2  2
can also be viewed as the equation of a plane in
3-D Euclidean space.
On the other hand

g ( x1 , x2 )  0   2 x1  x2  2  0
is the intersection line of the above plane with the

xy-plane.
Network
 Obviously:
g ( x )  0  2 x1  x2  2  0 gives the points on the decision line

g ( x )  0  2 x1  x2  2  0 gives the points on the plane above the decision line
g ( x )  0  2 x1  x2  2  0 gives the points on the plane below the decision line
Network
Since on the decision line we have
g ( x1 , x2 )  0
we can write
g ( x1 , x2 ) g ( x1 , x2 )
dg ( x1 , x2 )  dx1  dx2  0
x1 x2
where dx1 and dx2 are the increments given to
x1 and x2 on the decision line.
Network
Now defining
 g ( x1 , x2 ) 
 x   dx1 
g ( x1 , x2 )   1
 and dr   
 g ( x1 , x2 )   dx2 
 x 
 2 
where
g and dr
are known to be the gradient vector (or normal

vector) and the tangent vector, respectively,
Network
The gradient vector points toward the positive side
of the decision line. However, there are two normal
vectors, one pointing toward the positive side, 1,
and the other toward the negative side, 2 =-1.
For the above example the gradient and normal
vectors are given by:
 g ( x1 , x2 ) 
 x   2   2   2
g ( x1 , x2 )       , 1  g ( x1 , x2 )    ,2   
1
 g ( x1 , x2 )   1  1  1
 x 
 2 
Network
In fact 2 is obtained from
 g ( x1 , x2 )  0
Note that 1 and 2 are the projections of the
normal vectors on the x-y plane of two intersecting
planes whose intersection line is given by
g ( x1 , x2 )  0
Network
Although 1 and 2 are unique, there are infinetely
many plane pairs whose intersection line is given
by
g ( x1 , x2 )  0
Plane pairs can be built by appropriately
augementing the 2-D normal vectors 1 and 2
to 3-D normal vectors which will be the normal
vectors of the two intersecting planes.
Network
The 2-D normal vectors are plane vectors given in
the x-y plane.
 2  2
1    ,2   
1  1
These can be augmented to 3-D by adding a third

component, say 2, yielding
 2  2
n1   1  , n2   1
   
 2   2 
Network
The details of building the augmented vectors
are shown below: g
n1 x2
Decision
n
1 2
 line

 1 2 x1
-2 0


-1
-2
Network
Note that 1 and 2 are the normal vectors of the
plane that is perpendicular to the x-y plane and
intersects the x-y plane at the decision line.
On the other hand the vectors n1 and n2 are the

normal vectors of the planes obtained by rotating
the above perpendicular plane around the decision
line by and , respectively.
Network
We can now determine the equations for the these
planes by using the normal vector-point form of
plane equation given as:
n ( x  x0 )  0
t
where:
•n is the normal vector of the plane,
•x is the vector connecting any point on the plane
to the origin,
•x0 is the vector connecting a fixed point on the
plane to the origin.
Network
This means that x-x0 represents the vector

connecting all possible points x on the plane
to fixed point x0 on the same plane. That is x-x0
is a vector that lies on the plane.
Now let us find the plane equations for the two

normal vectors found above.
Network
Let x0 be the point (1,0,0) on the decision line.
We can write:
 2    x1  1  
       1
For n1   1    2 1 2   x2   0    0  g1  x1  x2  1
  2
 2    g1  0  
2   x1  1  
       1
For n2   1   2 1 2   x2   0    0  g 2   x1  x2  1
  2
 2    g 2  0  
Network
Because of the way g1(x) and g2(x) are built we can
state the following:
g1 ( x)  g 2 ( x)  0 on the positive side of the decision line

g 2 ( x)  g1 ( x)  0 on the negative side of the decision line
 2  2
  1   1
For n1   1   g1  x1  x2  1 For n2   1  g 2   x1  x2  1
2 2
 2   2 
g
Network
Decision
Decisio
n2 line
n1 g1
g2
x1
Decision
line x2
Network
Now we can compute g1(x) and g2(x) for the selected
patterns in Example 1.
Class 1 Class 2
(2,0) (1.5,-1) (1,-2) (0,0) (-0.5,1) (-1,-2)
g1-g2>0 g1-g2>0 g1-g2>0 g2-g1>0 g2-g1>0 g2-g1>0

Network
Henceforth, such gi(x) functions will be called
Discriminant Functions. We can conclude that:
g1 ( x)  g 2 ( x) for the patterns in Class 1

g 2 ( x)  g1 ( x) for the patterns in Class 2
Network
Minimum Distance Classification
The classification of two clusters is carried out in
such a way that the boundary of these two clusters
is drawn as a line perpendicular to and passing
through the midpoint of the line connecting the
center points of two clusters . Therefore the
boundary line is the perpendicular bisector of a
connecting line.
Pi Network Positive side
xi-xj Negative side
xi
P0=(xi+xj)/2
Pj
xj
0
Network
Now we will derive the equation of the boundary line.
Let the vector x and x0 represent any point on this and
the point P0, respectively. Then the following must hold:
( xi  x j ) ( x  x 0 )  0
t
which can be written in the form

1
( xi  x j ) ( x  ( xi  x j ))  0
t
2
Network
and
1
( xi  x j ) x  ( xi  x j ) ( xi  x j )  0
t t
2
or
1 2
( xi  x j ) x  ( xi  x j )  0
t 2
2
Network
Now defining
1 2
gij ( x )  ( xi  x j ) x  ( xi  xj )
t 2
2
We have already seen that the boundary (decision)
line can be taken as the intersection of two planes
gi and gj .
Network
Therefore
gij ( x )  gi ( x )  g j ( x )
where we have called gi (x) discriminant functions
and shown that they are associated with plane
equations.
Network
Now using the two equations above we obtain
1 2
( xi  x j ) x  ( xi  x j )  gi ( x )  g j ( x )
t 2
2
which can be used to make the following
identification:
1 1 2
gi ( x )  xi x  xi
2
t
g j ( x)  x j x  x j
t
2 2
Network
gi (x) can also be expressed as:
gi ( x )  wi x  wi ,n1
t
Therefore we can make the identification:
wi = x i
1
wi ,n1   xi
2
2
Network
An alternative approach towards the construction
of discriminant functions may be taken as follows:
Let us assume that a minimum–distance
classification is requried to classify patterns into
R categories. Each of the classes is represented by
its center point Pi , i=1,2,…..,R. The Euclidean
distance between an input pattern x and the point
Pi is given by the norm of the vector x-xi as:
x  xi  ( x  xi ) ( x  xi )
t
Network
A minimum–distance classifier computes the distance
from a pattern of unknown classification to each of the
center points Pi . Then the category number of the point
that yields the minimum distance is assigned to the
unknown pattern.
Squaring the above equation yields

1 t
x  xi  x x - 2x x + x xi = x x - 2(x x - xi xi ) > 0
2 t t t t t
i i i
2
Network
Since xxt is independent of i, this term is constant
with respect to the categories. Therefore, in order
to minimize the distance
x  xi
we need to maximize
t 1 t
gi (x) = x x - xi xi
i
2
which is called a discriminant function.
Network
Example 3: A linear minimum-distance classifier
will be designed for the three points given as:
10 2   5
x1   , x2   , x3   
2   5 5
It is also assumed that the index of each point
(pattern)corresponds to its class number.
The three points and the connecting lines
constitute a triangle which is shown on the
next slide:
Network
x2
P3(-5,5)
P1(10,2)
0 x1
P2(2,-5)
Network
Now let us draw the circle passing through all three
vertices of the triangle, the circumcircle. We can
conclude that each boundary is a perpendicular
bisector of the triangle. A perpendicular bisector of a
triangle is a straight line passing through the
midpoint of a side and being perpendicular to it, i.e.
forming a right angle with it. The three
perpendicular bisectors meet at a single point, the
triangle's circumcenter; this point is the center of the
circumcircle.
Network
x2
P3(-5,5)
P1(10,2)
0 x1
P2(2,-5)
Network
Now using
10 2   5
x1   , x2   , x3   
2   5 5
and
1 2
gij ( x )  ( xi  x j ) x  ( xi  xj )
t 2
2
we obtain
Network
1
g12 ( x )  ( x1  x2 ) x  ( x1  x2 )
t 2 2
2
t
 10   2    x1  1
           [(100  4)  (4  25)]
  2   5   x2  2
 8 x1  7 x2  37.5
Network
1
g13 ( x )  ( x1  x3 ) x  ( x1  x3 )
t 2 2
2
t
 10   5   x1  1
        x   2 [(100  4)  (25  25)]
 2   5    2
 15 x1  3 x2  27
Network
1
g 23 ( x )  ( x2  x3 ) x  ( x2  x3 )
t 2 2
2
t
  2   5   x1  1
           [(4  25)  (25  25)]
  5  5    x2  2
 7 x1  10 x2  10.5
Network
Now using
wi = x i
1
wi ,n1   xi
2
2
we obtain
 10   2   5 
w1   2  ; w2   5  ; w3   5  ;
     
-52  -14.5 -25
and using
Network
gi ( x )  wi x  wi ,n1
t
we obtain
g1 ( x )  10 x1  2 x2  52
g 2 ( x )  2 x1  5 x2  14.5
g3 ( x )  5 x1  5 x2  25
Network
A block diagram producing the three discriminant
functions is shown below:
x1 10 10 x1  2 x2  52
2
52
2
x2 -5 2 x1  5 x2  14.5
14.5
5 -5
-1 25 5 x1  5 x2  25
Network
The discriminant values for the three patterns
P1(10,2), P2(2,-5) and P3(-5,5) are shown in the
table below:
Input
Discriminant Class 1 Class 2 Class 3

[10 2]t [2 -5]t [-5 5]t
g1(x)=10x1+2x2-52 52 -42 -92
g2(x)= 2x1-5x2-14.5 -4.5 14.5 -49.5
g3(x)=-5x1+5x2-25 -65 -60 25

Network
As required by the definition of the discriminant

function, the responses on the diagonal are the
largest in each column. It will be shown later that
the same is true for any three points P1,P2 ,P3 taken
from the three decision regions H1,, H2, H3 provided
that the decision regions are determined as shown
above. Therefore using a maximum selector at the
output will provide the required function from the
network.
Network
Using the same network with TLUs (bipolar
activation functions) will result in the outputs
given in the table below:
Input
Class 1 Class 2 Class 3

[10 2]t [2 -5]t [-5 5]t
sgn(g1(x)=5x1+3x2-5) 1 -1 -1
sgn(g2(x)= -x2-2) -1 1 -1
sgn(g3(x)=-9x1+x2) -1 -1 1
Network
The diagonal entries=1
The offdiagonal entries=-1
However, as the next example will demonstrate
this is not true for any three points P1,P2 ,P3 taken
from the three decision regions H1, H2, H3.
Network
The response of the same network to the patterns
Q1(5,0), Q2(0,1) and Q3(-4,0) are shown in the table below:
Input

[5 0]t [0 1]t [-4 0]t
g1(x)=10x1+2x2-52 -2 -50 -92
g2(x)= 2x1-5x2-14.5 -4.5 -19.5 -22.5
g3(x)=-5x1+5x2-25 -50 -20 -5

Network
The responses on the diagonal are still the largest

in each column. However, using the same network
with TLUs (bipolar activation functions) will result
in the outputs given in the table on the next slide:
Network
Input

[5 0]t [0 1]t [-4 0]t
sgn(g1(x)=10x1+2x2-52) -1 -1 -1
sgn(g2(x)= 2x1-5x2-14.5) -1 -1 -1
sgn(g3(x)=-5x1+5x2-25) -1 -1 -1
Network
It is therefore impossible to use TLUs once the
decision lines are calculated using the
minimum-distance calssification procedure.
The only way out is using a maximum selector.
The explanation of the responses on the diagonal

being the largest in each column will now be made
in detail.
Network
The discriminant functions determine the plane
equations
g1  10 x1  2 x2  52  0
g 2  2 x1  5 x2  14.5  0
g3  5 x1  5 x2  25  0
Network
These planes are shown on the next slide.
It is easily seen that:

For any point in H1 : g1(x)>g2(x) and g1(x)>g3(x)
Network
100
50
-50
gi
-100
-150
-200
10
-5
x2 -10 5 10
-10 -5 0
x1
Network
The decision regions H1,H2, H3 are projections of

the planes g1,g2 and g3, respectively, on the x1-x2
plane and the decision lines are the projections of
the intersection lines of the planes gi on the x1-x2
plane which are shown on the next slide.
x Network
2
H3 g13 (x)=0 H1,,
P3(-5,5) g1 ( x)  g 2 ( x)
g1 ( x)  g3 ( x)
g3 ( x)  g1 ( x)
g3 ( x)  g 2 ( x) P123(2.337,2.686)
P1(10,2)
0 x1
g 2 ( x)  g1 ( x)
g23 (x)=0 g 2 ( x)  g3 ( x)
g12 (x)=0
H2 P2(2,-5)
Network
A MATLAB plot of the projections of the
intersection lines of the planes gi are shown
on the next slide
Network
30
20
10
-10
-20
-30
-30 -20 -10 0 10 20 30
Network
The projections of the intersection lines of the
planes gi on the x1-x2 plane are shown to be given
by the following line equations:
g12 ( x )  8 x1  7 x2  37.5  0
g13 ( x )  15 x1  3 x2  27  0
g 23 ( x )  7 x1  10 x2  10.5  0
The previous slide shows the segments that can
be seen from the top.
Network
The continuation of the line g12=0 remains
underneath the plane g3.


Network
A classifier using a maximum selector is shown on

the next slide. The maximum selector selects the
maximum discriminant and responds with the
number of the discriminant having the largest
value.
Network
x1 10 g1(x)
1
2
52
2 Maximum i=1,2, or 3
x2 -5 g2(x)
2 selector
14.5
5 -5 g3(x)
-1 25 3
Classifier using the maximum selector

Network
The classifier can be redrawn as follows:
x1 10
2 g1(x)
x2 1
-1 52
x1 2
g2(x) Maximum i=1,2, or 3
x2 - 5 2 selector
-1 14.5
x1 -5
x2 5 g3(x)
3
-1 25
Network
x1 10
2 g1(x)
x2 1
-1 52
x1 2
-5 g2(x) Maximum i=1,2, or 3
x2 2 selector
-1 14.5
x1 -5
x2 5 g3(x)
3
-1 25
Network
In the above we have designed a classifier which was

based on the minimumdistance classification for known
clusters and derived the network with three perceptrons
from the discriminant functions which were interpreted
as plane equations. Instead, now let us consider the
network on the next slide which is obtained as a result of
training a network with three perceptrons using the same input
patterns P1(10,2), P2(2,-5) and P3(-5,5) as in the previous network .
Network
5 x1  3 x2  5
x1 5 TLU#1
3
5
0  x2  2
x2 -1 TLU#2
2
1 -9 9x1  x2
-1 0 TLU#3
Network
In fact gi(x)=0 define the intersection of gi planes
with x1-x2 plane. Therefore the TLU divides the gi
planes into two regions:
(1)the upper-half plane which is above x1-x2 plane

and
(1)the lower-half plane which is below x1-x2 plane.
Network
The decision lines are obtained by setting gi(x)=0
5 x1  3x2  5  0
 x2  2  0
9 x1  x2  0
which are given on the next slide. The shaded
areas are indecision regions which will become
clear in the following discussion.
Network
x2  g1 (0,9)   22   1 
 g (0,9)    29    1
5 x1  3 x2  5  0 Q1(0,9)  2     
 g 2 (0,9)   9   1 
 g1 (5,5)   15  1

 g (5,5)    7    1 P3(-5,5) 9 x1  x2  0  g1 (10, 2)   51   1 
 2     
 g 2 (5,5)   50   1   g (10, 2)    4    1
 2     
 g 2 (10, 2)   88  1
0 P1(10,2)
x1
 x2  2  0
 g1 (1, 3)   19   1

 g (1, 3)    1    1  Q3(-1,-3) Q2(4,-4)  g1 (4, 4)   3   1 
 g (4, 4)    2    1 
 2       2     
 g 2 (1, 3)   6   1  P2(2,-5)  g 2 (4, 4)   40   1
 g1 (2, 5)   10   1

 g (2, 5)    3    1 
 2     
 g 2 (2, 5)   23  1
Network
The discriminant values g1(x), g2(x), g3(x) for
the same three patterns P1(10,2), P2(2,-5) and
P3(-5,5) are shown in the table below:
Input

[10 2]t [2 -5]t [-5 5]t
g1(x)=5x1+3x2-5 51 -10 -15
g2(x)= -x2-2 -4 3 -7
g3(x)=-9x1+x2 -88 -23 50

Network
The outputs of the network with three discrete
perceptrons are shown in the table below:
Input
Class 1 Class 2 Class 3

[10 2]t [2 -5]t [-5 5]t
sgn(g1(x)=5x1+3x2-5) 1 -1 -1
sgn(g2(x)= -x2-2) -1 1 -1
sgn(g3(x)=-9x1+x2) -1 -1 1
Network
The table on the previous slide shows that the new
discriminant functions
g1 ( x )  5 x1  3 x2  5
g 2 ( x )   x2  2
g3 ( x )  9 x1  x2
classify the paterns P1(10,2), P2(2,-5) and P3(-5,5)
in the same way as the discriminant functions
g1 ( x )  10 x1  2 x2  52
g 2 ( x )  2 x1  5 x2  14.5
g3 ( x )  5 x1  5 x2  25
Network
Conclusion:
The network, which is obtained through the
perceptron learning algorithm, and the network
obtained using the maximum-distance classification
procedure have classified the three points
P1(10,2), P2(2,-5) and P3(-5,5) in exactly the same
way, i.e.,
P1 (10,2)  class 1
P2 (2,-5)  class 2
P3 (-5,5)  class 3
Network
Now consider the patterns Q1(0,9), Q2(4,-4)
and Q3(-1,-3) which fall into shaded areas.
The discriminant values for these patterns are
shown in the table on the next slide:
Network
Input
Discriminant [0 9]t [4 -4]t [-1 -3]t

g1(x)=5x1+3x2-5 22 3 -19
g2(x)= -x2-2 -29 2 1
g3(x)=-9x1+x2 9 -40 6
Network
Since
g1 (0,9)  g3 (0,9), g 2 (0,9)
g3 (  1, 3)>g 2 (1, 3), g1 (1, 3)
g1 (4,-4)>g 2 (4,-4),g3 (4,-4)
if we use a maximum selector instead of the three
TLUs the network can decide that
Q1 (0,9) and Q 2 (4,-4)  class 1

Q3 (1, 3)  class 3
Network
On the other hand, if we use TLUs we would
obtain the outputs in the following table:
Input
Discriminant [0 9]t [4 -4]t [-1 -3]t
g1(x)=5x1+3x2-5 1 1 -1
g2(x)= -x2-2 -1 1 1
g3(x)=-9x1+x2 1 -1 1
Network
In order to make a classification we should have a
column with one 1 and two -1s. Therefore
according to the table obtained non of the three
patterns Q1(0,9), Q2(4,-4) and Q3(-1,-3) could be
classified into any class. Therefore according to
the network with TLUs the shaded areas will be
called indecision regions.
Network
Now let us consider the planes defined by
g1  5 x1  3 x2  5
g 2   x2  2
g3  9 x1  x2
which are plotted on the next slide:
Network
100
50
0
gi
-50
-100
10
x2 -10
-10 -8 -6 -4 -2 0 2 4 6 8 10
x1
Network
The projections of the intersection lines of the
planes gi(x) on the x1-x2 plane are given by
g12 : 5 x1  4 x2  3  0
g 23 : 9 x1  2 x2  2  0
g13 : 14 x1  2 x2  5  0
The segments that can be seen from the top are

plotted on the next slide.
Network
15
10
-5
-10
-15
-10 -5 0 5 10 15 20
Network


Network
x1 w11 v1 y1
1
Neurons
wk1 v2
x2 2 y2
wK1
wK2
wkj vk yk
xj k
wmn vm
xn K ym
Input nodes Output nodes
Network
v1  w11 x1  w12 x2  .........  w1 j x j  ......  w1J xJ y1  f ( v1 )
v2  w21 x1  w22 x2  .........  w2 j x j  ......  w2 J xJ y2  f ( v2 )
vk  wk1 x1  wk 2 x2  .........  wkj x j  ......  wkJ xJ yk  f ( vk )
vK  wK 1 x1  wK 2 x2  .........  wKj x j  ......  wKJ xJ y K  f ( vK )

Network
 v1   w11 w12 . . w1J   x1   y1   f ( v1 )    v1  
v  w     y   f ( v )  
 2   21 w22 . . w1J   x2    v2  
 2  2 
 .  . . . . .  .   .    .   Γ . 
          
 .  . . . . .  .   .   .   . 
vK   wK 1  yK   f ( vJ )  
wK 2 . . wKJ   xJ   v J  
v  Wx y  Γ(v)
Network
 y1   f (.) 0 . . 0   v1 
y   0 f (.) . . 0  v 
 2   2 
 .  . . . . .  . 
    
 .   . . . . .  . 
 yK   0 0 . . f (.) vJ 
y  Γ[Wx]
Network
Example 1:
 x1 
x   x2 
 1
 v1   5 3 5   x1  5 x1  3 x2  5 x3 
v    0 1 2   x     x  2 x 
 2   2  2 3 
 v3   9 1 0   1  9 x1  x2 
 y1  sgn(5 x1  3x2  5 x3 ) 
 y    sgn( x  2 x ) 
 2  2 3 
 y3   sgn(9 x1  x2 ) 
Two-Layer Feedforward Neural
Network
Example 1: Design a neural network such that the network
maps the shaded region of plane x1, x2 into y = 1, and it
maps its complement into y = -1, where y is the output of the
neural network. In summary, the network will provide the
mapping of the entire x1, x2 plane into one of the two points
±1 on the real number axis.
Network
Solution: The inputs to the neural network will
be x1, x2 and the threshold value -1. Thus the
input vector is given as:
 x1 

x   x2 
 1
Network
The boundaries of the shaded region are given
by the equations:
x1  1  0
x1  2  0
x2  0
x2  3  0
Network
The shaded region satisfies the inequalities:
x1  1 x1  1  0
x1  2  x1  2  0
or
x2  0 x2  0
x2  3  x2  3  0
Network
These inequalities may be implemented using
four neurons:
Network
The equations for the first layer are obtained as:
 v1   1 0 1 
v   1 0 2   x1 
 2     x2 
 v3   0 1 0   
     1
v4   0 1 3
y  sgn( x1  1) sgn(  x1  2) sgn( x2 ) sgn(  x2  3)

t
where binary (threshold or hard limiter )activation

function, i.e., discrete perceptron is used.
Network
Let us discuss the mapping performed by the first
layer. Note that each of the neurons 1 through 4
divides the plane xl ,x2 into two half-planes.
 The half-planes where the neurons' responses are

positive (+ 1) have been marked with arrows pointing
toward the positive response half plane.
 The response of the. second layer can be easily
obtained as
y  sgn( y1  y2  y3  y4  3.5)
Network
The resultant neural network
The Perceptron Training Algorithm
For the development of the perceptron

learning algorithm for a single-layer
perceptron, we find it more convenient
to work with the modified
signal-flow graph model given here. In this second model,
which is equivalent to that of the previous figure, the threshold
 is treated as a synaptic weight connected to a fixed input
equal to -1. We may thus define the (p + 1)-by-1 input vector and
the corresponding weight vector as:
x  [ x1 x2 ... ... xn 1]t w  [ w1 w2 ... ... wn  ]t
 These vectors are respectively called the augmented input vector
and the augmented weight vector.

 For fixed n, the equation wtx = 0, plotted in

p-dimensional space with coordinates xl; xz, ... , xp, defines a
hyperplane as the decision surface between two different classes of
inputs.
Suppose then the input variables of the single-layer perceptron

originate from two linearly separable classes that fall on the
opposite sides of some hyperplane. Let Xl be the subset of training
vectors xl(1), xl(2), ... that belong to class C1, and let X2 be the
subset of training vectors x2(1), x2(2), ...
that belong to class C2. The union of Xl and X2 is the complete
training set X.
 Given the sets of vectors

Xl and X 2 to train the classifier,
the training process involves the adjustment of the
weight vector w in such a way that the two classes Cl and
C2 are separable.
 These two classes are said to be linearly separable if a

realizable setting of the weight vector w exists.
 Conversely, if the two classes Cl and C2 are known to be

linearly separable, then there exists a weight vector w
such that we may state:
w x - 0
t for every input vector x belonging to class C1
w x0
t
for every input vector x belonging to class C2
 Given the subsets of training

 vectors X1 and X2,
the training problem for the elementary ·perceptron
is then to find a weight vector w such that the two
inequalities above are satisfied.
 However, until this is achieved in the itermediate steps
we will have
wx
t
- 0 for some input vectors x belonging to class C2
w x0
t
for some input vector x belonging to class C1
In the former case therefore wt x will
be reduced until wt x  0 is achieved,

and in the latter case wt x will be

increased until wt x  0 is reached.
Here we will begin to examine neural

network classifiers that derive their
weights during the learning cycle.
• The sample pattern vectors x1 , x2, ... , xp, called
the training sequence, are presented to the
machine along with the correct response.
• The response is provided by the teacher and

specifies the classification information 'for
each mput vector. The classifier modifies its
parameters by means of iterative, supervised
learning.
 The network learns from 'experience

by comparing the targeted correct
response with the actual response.
 The classifier structure is usually

adjusted after each incorrect response
based on the error value generated.
 Let us now look again at the dichotomizer
introduced and defined earlier.
 We will develop a supervised training procedure

for this two-class linear classifier.
 Assuming that the desired response is provided,

the error signal is computed.
 The error information can be used to adapt the

weights of the discrete perceptron.
 First we examine the geometrical conditions in
the augmented weight space.
 This will make it possible to devise a

meaningful training procedure for the
dichotomizer under consideration.
 The decision surface equation in n+1

dimensional augmented pattern space is
w x0
t
When the above equation is considered in the

pattern space then it is written for fixed weights
w(1), w(2),……….,w(k). Therefore the variables
of the function f(wt(i )x) are x1, x2,……….,xn+1 ,
the components of the pattern vector.
x2
w (i )
x1
f(wt(i )x)=0
The normal vector w(i ) (weight wector) points

toward the side of the pattern space for which
wt(i )x > 0, called the positive side.
When the above equation is considered in the

weight space then it is written for fixed patterns
x(1), x(2),……….,x(p). Therefore the variables of
the function f(wtx(i)) are w1, w2,……….,wn+1 , the
components of the weight vector.
w2
f(wtx(i))=0
w1
x ( i)
The normal vector x(i) (pattern vector) points

toward the side of the weight space for which
wtx(i) > 0, called the positive side.
 f ( w1 , w2 ,....., wn1 ) 
 w1 
   x1 (i ) 
 f ( w1 , w2 ,....., wn1 )  
x2 (i ) 
∇
f ( w t x (i ))  f ( w1 , w2 ,....., wn1 )   w2  
 
 x( i )
   
   xn1 (i ) 
 f ( w1 , w2 ,....., wn1 ) 
 wn1 
In further discussion it will be understood that the

normal vector will always point toward the side of
the space for which wtx> 0, called the positive side,
or semispace, of the hyperplane.
Decision hyperplane in augmented weight space

for a five pattern set from two classes
 Note that the vectors x(i) points toward the
positive side of the decision hyperplanes
wtx(i)= 0.
 By labeling each decision boundary in the

augmented weight space with an arrow
pointing into the positive half-plane, we can
easily find a region in the weight space that
satisfies the linearly separable classification.
 To find the solution for weights, we will

look for the intersection of the positive
decision regions due to the prototypes
of class 1·and of the negative decision
regions due to the prototypes of class 2.
 Inspection of the figure reveals that the

intersection of the sets of weights
yielding all five correct classifications
of depicted patterns is in the shaded
region of the second quadrant as shown
in the figure above.
 Let us now attempt to arrive iteratively

at the weight vector w located in the
shaded weight solution area.
 To accomplish this, the weights need

to be adjusted from the initial value
located anywhere in the weight space.
 This assumption is due to our ignorance of

the weight solution region as well as weight
initialization.
 The adjustment discussed, or network
training, is based on an error-correction
scheme.
 At this point we will introduce the Perceptron
Learning (Traning) Rule (Algorithm).
 The perceptron learning rule is of central

importance for supervised learning of neural
networks.
 The weights are initialized at any values in this

method
 A neuron is considered to be an adaptive

element. Its weights are modifiable
depending on the input signal it receives,
its output value, and the associated
teacher (supervisor) response.
The weight vector is changed according to the
following:
w(i  1)  w(i )  w(i )

where
w(i )  cr  w(i ), x(i ), d (i )  x(i )
and
• d(i) is the teacher’s (supervisor’s) signal
• r is the learning signal
•c is a positive number called the learning
constant depending on the sign of r.
Here we have used

 f ( w t x( i )) 
  w 
 1   x (i ) 
∇   ( w x( i ))   1 
t
   x2 (i ) 
w2   x( i )
t
(w x( i )) = 

   
  x
 n1 
(i )
  ( w t x( i )) 
 w 
 n 1 
This reveals that the change in the weight

vector is in the direction of steepest ascent
(or descent)of wtx(i).
Perceptron Learning (Traning) Rule (Algorithm).
In this case the learning signal is defined as:
r (i )  d (i )  y (i )
where d(i) is the desired output signal and y(i) is
the actual output signal for the input pattern x(i)
given by:
y (i )  sgn( w (i) x(i) )
t
The weight adjustment is given by:
w(i )  c[d (i )  sgn( w (i ) x (i ))] x (i )

t
d = 1, i.e.,class 1 is input :
1) y =sgn( wt x)  1,i.e., the input is misclassified  r  d  y  1  (1)  2;
the correction is in the direction of steepest ascent and given as w(i)  2c x(i)
2) y =sgn( wt x)  1,i.e., the input is correctly classified  r  d  y  1 1  0; no correction
d = -1, i.e.,class 2 is input :
1) y  sgn( wt x)  1,i.e., the input is correctly classified  r  d  y  1  (1)  0; no correction
2) y  sgn(wt x)  1,i.e., the input is misclassified  r  d  y  1  (1)  2;
the correction is in the direction of steepest decent and given as w(i)  2c x(i)
 EXAMPLE:
 The trained classifier should provide the
following classification of four patterns x
with known class membership d:
 x(1)= 1, x(3) =3, d1= d3 = 1: class C1

 x(2)= -0.5, x(4)=-2, d2 = d4 = -1: class C2
The augmented input vectors are given as:

1  0.5  3  2 
x(1)    , x(2)    , x(3)    , x(4)   
1  1  1 1
x( x x(1) x

Let us choose an arbitrary augmented weight vector of
 2.5
w(1)   
 1.75 
With x (1) being the input , we obtain
1
wt (1) x(1)   2.5 1.75    0.75  0
1
and binary activation function (discrete

perceptron)
sgn( wt (1) x(1))  1
Hence x(1) is classified as being in class C2. However
this is not true. Therefore a correction has to be made.
The question to be asked at this point is: How do we make
this correction? The answer depends on which training
algorithm used. Since sgn{wt(1)x(1)}=-1, one thing is
certain,however, is that the correction should me made in
such a way that
t
wx
increases. In order that we achieve this we must first find
out if there is a direction in which the decrease or, for that
matter, increase takes place. To show this, let us consider
the surface given by :
z  f ( w1 , w2 ,....., wn1 )
We can write:
f ( w1 , w2 ,....., wn1 ) f ( w1 , w2 ,....., wn1 ) f ( w1 , w2 ,....., wn1 )
df ( w1 , w2 ,....., wn1 )  dw1  dw2  .....  dwn1
w1 w2 wn1
Let us now restrict ourselves to the case of

3 dimensions, namely,
z , w1 , w2 or more succinctly z , x, y
The Perceptron Training
Algorithm
Now consider the surface
z  f ( x, y )
If the level curves are interpreted as contour lines
of the landscape, i.e., of the surface, then along
these curves
z  f ( x, y )  constant
Algorithm
consequently, we obtain
dz  df ( x, y )  0
hence
f ( x , y ) f ( x , y )
df ( x , y )  dx  dy  0
x y
where dx and dy are the increments given to
x and y on the level curve.
Algorithm
Now defining
 f ( x , y ) 
 x  dx 
f ( x , y )   f ( x , y )  and dr   
  dy 
 y 
where
f and dr
are known to be the gradient vector and the tangent
vector,respectively,
we can write
df ( x , y )  f dr  0t
This means that the gradient vector and the

tangent vector are orthogonal vectors. Moreover
it can be shown that the gradient vector points in
the direction of steepest ascent of the function
f(x,y). Furthermore, the gradient is the rate of
climb in the direction of steepest ascent.
Now consider the surface
z  f ( x, y )  ( x  50)  ( y  50)  32
2 2 2
The following MATLAB program plots this

surface:
close all

clear all
for x=1:1:100;
for y=1:1:100;
f(x,y)=(x-50).^2+(y-50).^2-1024;
end
end
mesh(f);title('f(x,y)=x^2+y^2-1024');
figure,imshow(f,[ ],'notruesize');colormap(jet);
colorbar;title('f(x,y)=x^2+y^2-1024');
z=f(x,y)=(x-50)2+(y-50)2-322 z=f(x,y)=(x-50)2+(y-50)2-322
4000
3000
4000
3000 2000
2000
1000
1000
0 0
-1000
-1000
-2000
100
100 -2000
80 100
50 60
40 50
20 0 60 80 100
0 0 0 20 40
z=f(x,y)=(x-50)2+(y-50)2-322
3500
3000
2500
2000
1500
1000
500
-500
-1000
The level curves are obtained from
z  f ( x, y )  ( x  50)  ( y  50)  32  Ci
2 2 2
where Ci are constants.

y C3  16384
C2  9216
C1  4096
x
 f ( x, y ) 
 x   2( x  50)   dx 
f ( x, y )      and dr   dy 
 f ( x , y )   2( y  50)   
 y 
Considering the four quadrants of the circle:
Q.1  2( x  50)  0 Q.2  2( x  50)  0

2( y  50)  0 2( y  50)  0
Q.3  2( x  50)  0 Q.4  2( x  50)  0
2( y  50)  0 2( y  50)  0
the gradient vector points in directions as given
below:
Q.2 Q.1
Q.3 Q.4
The fact that the gradient vector is orthogonal to
the tangent vector proves that it is in the direction
of steepest ascent or steepest descent.
The directions found for the example show that
the gradient vector points in the direction of ascent
of the function f(x,y).
Combining the two facts we can conclude that it
points in the direction of steepest ascent.
 x ( 1 )  1  x ( 2 )   0.5  x1( 3 )  3  x1( 4 )   2
x( 1 )   1     , x( 2 )   1     , x( 3 )       , x( 4 )     
 x2 ( 1 ) 1  x2 ( 2 )  1   x2 ( 3 ) 1  x2 ( 4 )  1 
In the weight space the following straight lines
represent the decision lines:
1
w1 w2    w1  w2  0  w2   w1
1
 0.5
w1 w2    w1  0.5w2  0  w2  0.5w1
 1 
3
w1 w2    w1  3w2  0  w2  3w1
1
  2
w1 w2    w1  2 w2  0  w2  2 w1
1
Decision lines in
weight space w2
Initial weight vector
x(2) x(1) x(3)

x(4)
w1
3 1
4
Algorithm
The corresponding gradient vectors
are computed as follows:
 ( wt x( 1 )) 
   x ( 1 ) 1
 w
for x( 1 ) w x( 1 )  w1  w2  ( w x( 1 ))  
t t 1   
1
  x( 1 )  1
 ( w x( 1 ))   x1( 1 )
t

 w 
 2 
 ( wt x( 2 )) 
   x ( 2 )  0.5
 w
for x( 2 ) w x( 1 )  0.5w1  w2  ( w x( 2 ))  
t t 1   
1
  x( 2 )   1 
 ( w x( 2 ))   x1( 2 )
t
 
 w 
 2 
 ( wt x( 3 )) 
   x ( 3 ) 3
  w 
for x( 3 ) w x( 3 )  0.5w1  w2  ( w x( 3 )) 
t t 1
 1
  x( 3 )  1
 ( w x( 3 ))   x1( 3 )
t
 
 w 
 2 
 ( wt x( 4 )) 
   x ( 4 )   2
 w
for x( 4 ) w x( 4 )  2 w1  w2  ( w x( 4 ))  
t t 1   
1
  x( 4 )  1
 ( w x( 4 ))   x1( 4 )
t
 
 w 
 2 
wtx(3)<0 wtx(3)>0
Decision lines and
gradient vectors in
w2
weight space
Initial weight vector wtx(2)>0
wtx(2)<0
 2.5
w(1)    x(2)
 1.75  x(1) x(3)
x(4)
w1
2 wtx(4)>0 wtx(1)>0
wtx(4)<0
wtx(1)<0
3 1
4
Now we can concentrate on the particular training
(or learning) algorithm (or rule).

This is a supervised learning algorithm. This
means that at each step the correction is made
according to the directive given by the supervisor
as shown in the following figure.
yi
x
di
Weight learning rule: di is provided only in the

case of supervised learning
Now consider
r  d i  sgn( w x )t
Since
d i  1 , sgn( w x )  1
t
r can take on one of the three values:

 2,  2, 0
d i  1, sgn( w x )  1  r  2
forIn fact t
for d i  1, sgn( w x )  1  r  2

t
d i  1, sgn( w x )  1  r  0
for Since t
and d i  1, sgn( w x )  1  r  0

t
Therefore we can define the correction rule in terms

of the correction amount at the nth step as follows:
Δwi ( n )   ( n )( d i ( n )  sgn( w ( n )x( n )))( w ( n )x( n ))
t t
Δwi ( n )   ( n )( d i ( n )  sgn( w ( n )x( n )))x( n )

t
yi
di
Algorithm
In order for the correct cllasification of the entire
training set x( 1 ), x( 2 ), x( 3 ), and x( 4 )
with respective class memberships
d ( 1 )  1, d ( 2 )  1, d ( 3 )  1; and d ( 4 )  1
the following four inequalities must hold:
w ( N )x( 1 )  0
t
w ( N )x( 2 )  0
t
wt ( N )x( 3 )  0
wt ( N )x( 4 )  0
where w(N) is the final weight vector that provides
correct classification for the entire training set.
Algorithm
 This means that after N-1 training steps

the weight vector w(N) ends up in the
solution area, which is the shaded area
in the following figure.
wtx(3)<0 wtx(3)>0 Weight Space
w2
Initial weight vector wtx(2)>0
 2.5 wtx(2)<0
w(1)    x(2)
 1.75  x(1) x(3)
x(4)
w1
wtx(1)>0
2
wtx(4)>0 wtx(4)<0
wtx(1)<0
1
4 3
Algorithm
 The training has so far been shown in the weight
space. This is achieved using the decision lines
defined by x(1), x(2), x(3) nd x(4). However, the
original decision lines determined by the
perceptron at each step are defined in the pattern
space as this enables the classification to be easily
seen. These decision lines are defined by w(1), w(2)
w(3) and w(4).
 In the following we show the correction steps of the
weight vector as well as the corresponding decision
surfaces in the pattern space.
Algorithm
 In the pattern space
w( 1 )x  0
determines the the decision line defined by
the initial weight vector
 2.5
w(1)   
 1.75 
as
 x1 
w (1) x   2.5 1.75    2.5 x1  1.75 x2  0  x2  1.429 x1
t
 x2 
Algorithm
The corresponding gradient vector is computed
as follows:
 ( wt ( 1 )x ) 
   w ( 1 )  2.5
 x
w ( 1 )x  2.5 x1  x2  ( w ( 1 )x  
t t 1   
1
  w( 1 )   1.75 
 ( w ( 1 )x )   w2 ( 1 )
t
 
 x2 
 
which is the initial weight vector. As the
gradient vector lies on the side of
w (1) x  2.5 x1  1.75 x2  0
t
where
w (1) x  2.5 x1  1.75 x2  0
t
Algorithm
 However, x(1) and x(3) have class 1 ,i.e.,

d1= d3=1 and x(2) and x(4) have class 2 ,i.e.,
d2= d4=-1.
 This means that x(1), x(2), x(3), x(4) all are
wrongly classified.
Algorithm
Weight Space Pattern Space
Initial Initial
 2.5 x 
wt ( 1 )x   2.5 1.75 1   2.5 x1  1.75 x2  0  x2  1.429 x1
weight w( 1 )     x2  decision
 1.75 
vector line
Weight vector is orthogonal to corresponding decision line

w2 x2
wt ( 1 ) x  0 wt ( 1 ) x  0
w( 1 ) (is orthogonal todecion line)
x(2) x(1) x(3)
Initial weight vector w1 x(4) x1
 2.5
w(1)   
 1.75 
x 
wt ( 1 )x   2.5 1.75 1   2.5 x1  1.75 x2  0  x2  1.429 x1
 x2 
Decision line for initial weight vector
Algorithm
Pattern x(1) is input
Initial x  Initial
 2.5 wt ( 1 )x   2.5 1.75 1   2.5 x1  1.75 x2  0  x2  1.429 x1
weight w( 1 )   
 x2 
decision
 1.75 
vector line
Weight vector is orthogonal to corresponding decision line

Initial 1
First
decision wt x( 1 )  w1
1
w2    w1  w2  0  w2   w1
x( 1)    input
1 1
line vector
Decision line is orthogonal to corresponding input vector

Step 1:Pattern x(1) is input
w2 x2 1
wt ( 1 ) x  0 wt ( 1 ) x  0
wt x( 1 )  0 w( 1 )
x(1) x(2) x(1) x(3)
 2.5
w(1)    wtx(1)>0
 1.75 
wtx(1)<0 Line 1 is
wt x( 1 )  w1
1
w2    w1  w2  0  w2   w1 decision line
1 x 
wt ( 1 )x   2.5 1.75 1   2.5 x1  1.75 x2  0  x2  1.429 x1
 x2 
Algorithm
Step 1 (Update 1): Pattern x(1) is input
Initial weight First input
vector vector Initial decision line
 2.5 1
w( 1 )    x( 1)     x 
wt ( 1 )x   2.5 1.75 1   2.5 x1  1.75 x2  0  x2  1.429 x1
 1.75  1  x2 
1
y( 1 )  sgn( wt ( 1 )x( 1 ))  sgn(  2.5 1.75  )  1
1
d ( 1 )  y( 1 )  1  ( 1 )  2
Updated weight vector Updated decision line
 2.5 1  1.5 x 
wt ( 2 )x   1.5 2.75 1   1.5 x1  2.75 x2  0  x2  0.545 x1
w(2)  w(1)  x(1)        x2 
1.75  1  2.75 
Step 1 (Update 1): :Pattern x(1) is input
 1.5
w( 2 )   
 2.75  w2 w( 2 ) x2
wt ( 2 ) x  0
wt x( 1 )  0
x(1) x(2) x(1) x(3)

wt ( 2 ) x  0
 2.5
w(1)    wtx(1)>0
 1.75  2 Line 2 is
wtx(1)<0
decision line
1 x 
wt x( 1 )  w1 w2    w1  w2  0  w2   w1 wt ( 2 )x   1.5 2.75 1   1.5 x1  2.75 x2  0  x2  0.545 x1
1  x2 
Weight Space Algorithm Pattern Space
Weight Second
vector to be input
Decision line to be updated
updated vector
 1.5   0.5
w( 2 )    x( 2 )     x 
wt ( 2 )x   1.5 2.75 1   1.5 x1  2.75 x2  0  x2  0.545 x1
 2 .75   1   x2 
  0. 5
y( 2 )  sgn( wt ( 2 )x( 2 ))  sgn(  1.5 2.75  ) 1
 1 
d ( 2 )  y( 2 )  1  1 )  2
 Second update 
Updated decision line
 1.5  0.5  1  x 
w(3)  w(2)  x(2)      1   1.75 wt ( 3 )x   1 1.75 1    x1  1.75 x2  0  x2  0.57 x1
 2.75       x2 
w( 3 ) w2 x2
w( 3 )
wt x( 2 )  0 wt ( 3 ) x  0
wtx(2)>0 x(2) x(1)

x(2) x(3)
w1 x(4)
w t ( 3 ) x  0 x1
wtx(2)<0
Line 3 is
3 decision line
  0 .5 
w1 w2    w1  0.5w2  0  w2  0.5w1
 1 
Weight Third
vector to be input
updated vector
 1  3 x 
w( 3 )    x( 3 )    wt ( 3 )x   1 1.75 1    x1  1.75 x2  0  x2  0.57 x1
1.75 1  x2 

3
y( 3 )  sgn( w ( 3 )x( 3 ))  sgn(  1 1.75  )  1
t
1
d ( 3 )  y( 3 )  1  ( 1 )  2 Updated decision line

 1  3  2 
w(4)  w(3)  x(3)       x 
wt ( 4 )x  2 2.75 1   2 x1  2.75 x2  0  x2  0.73 x1
1.75 1  2.75  x2 
w2 x2
w(4)
wt x( 3 )  0
wtx(3)>0 wt ( 4 ) x  0
x(3) x(2) x(1) x(3)
wt ( 4 ) x  0 4
wtx(3)<0
Line 4 is
3 decision line
w1 w2    3w1  w2  0  w2  3w1
1
Weight Fourth
vector to be input
updated vector
 2 
w( 4 )  
  2 x 
wt ( 4 )x  2 2.75 1   2 x1  2.75 x2  0  x2  0.73 x1
 x( 4 )     x2 
2.75 1
  2
y( 4 )  sgn( w ( 4 ) x( 4 ))  sgn( 2 2.75  )  1
t
1
No update,
d ( 4 )  y( 4 )  1  ( 1 )  0
same decision line
 2 
w(5)  w(4)    x 
wt ( 4 )x  2 2.75 1   2 x1  2.75 x2  0  x2  0.73 x1
 2.75  x2 
w2 x2
w(5) =w(4)
wt x( 4 )  0
wt ( 4 ) x  0
x(2) x(1) x(3)
x(4)
wt ( 4 ) x  0 5 =4
wtx(4)<0 Line 4 remains
wtx(4)>0
  2 decision line
w1 w2    2 w1  w2  0  w2  2 w1
1
Algorithm
Step 5 (Update5): :Pattern x(1) is input
Weight First
vector to be input
updated vector
 2  1 x 
w( 5 )  w( 4 )   x( 1)    wt ( 4 )x  2 2.75 1   2 x1  2.75 x2  0  x2  0.73 x1
  x2 
2.75 1
1
y( 5 )  sgn( wt ( 5 ) x( 1 ))  sgn( wt ( 4 )x( 1 ))  sgn( 2 2.75  )  1
1
d ( 1 )  y( 5 )  1  1  0 No update,
 2 
same decision line
w(6)  w(5)  w(4)    x 
wt ( 4 )x  2 2.75 1   2 x1  2.75 x2  0  x2  0.73 x1
 2.75  x2 
Step 5 (Update5): :Pattern x(1) is input
w2 x2
w(6) =w(5)= w(4)
wt x( 1 )  0
wt ( 4 ) x  0
x(1) x(2) x(1) x(3)
wtx(1)<0 wtx(1)>0 wt ( 4 ) x  0 6 5 4
Line 4 remains
1 decision line
w1 w2    w1  w2  0  w2   w1
1
Algorithm
Weight Second
vector to be input
updated vector
 2    0.5 x 
wt ( 4 )x  2 2.75 1   2 x1  2.75 x2  0  x2  0.73 x1
w( 6 )  w( 5 )  w( 4 )  
.  x( 2 )     x2 
 2 75  1 
 0.5
y (6)  sgn( wt (6) x(2))  sgn( wt (6) x(2))  sgn( 2 2.75   ) 1
 1 
d ( 2 )  y( 6 )  1  1 )  2 Updated decision line
 2   0.5  2.5  t x1
w( 7 )  w( 4 )  x( 2 )          w ( 7 )x  2.5 1.75x   2.5x1 1.75x2 0x2  1.43x1
2.75  1  1.75  2
w2 x2
wt x( 2 )  0 w(7)
w(7)
wtx(3)>0 wt ( 7 ) x  0
x(3) x(2) x(1) x(3)
wtx(3)<0 7
wt ( 7 ) x  0
  2
w1 w2    2 w1  w2  0  w2  2 w1
1
w2 x2
w(4)
wt x( 3 )  0
wt ( 3 ) x  0
wtx(3)>0
x(2) x(1) x(3)
x(2)
w1 x(4)
w t ( 3 ) x  0 x1
wtx(3)<0
3
3
w1 w2    3w1  w2  0  w2  3w1
1
Algorithm
Step 7:Pattern x3 is input

 3
y (7)  sgn( wt (7) x(3))  sgn( 2.5 1.75   )  1
1
d ( 3 )  y( 7 )  1  1  0
 2.5 
w(8)  w(7)   
1.75
Algorithm
Step 8:Pattern x4 is input

  2
y( 8 )  sgn( w ( 8 )x( 4 ))  sgn( w ( 7 )x( 4 ))  sgn( 2.5 1.75  )  1
t t
1
d ( 4 )  y( 7 )  1  ( 1 )  0
 2.5 
w(9)  w(8)  w(7)   
1.75
Algorithm

1
y( 9 )  sgn( w ( 9 )x( 1 ))  sgn( w ( 7 )x( 1 ))  sgn( 2.5 1.75  )  1
t t
1
d ( 1 )  y( 9 )  1  1  0
 2.5 
w(10)  w(9)  w(8)  w(7)   
1.75
Algorithm

1
y( 10 )  sgn( w ( 10 )x( 2 ))  sgn( w ( 7 )x( 1 ))  sgn( 2.5 1.75  )  1
t t
1
d ( 2 )  y( 10 )  1  1  2
 2.5 
w(11)   
1.75
Algorithm
w2 x2
wt x( 1 )  0
wt ( 1 ) x  0 wt ( 1 ) x  0
w( 1 )
x(1) x(2) x(1) x(3)
 2.5
w(1)    wtx(1)>0
 1.75 
wtx(1)<0
1
wt x( 1 )  w1 w2    w1  w2  0  w2   w1
1 x 
wt ( 1 )x   2.5 1.75 1   2.5 x1  1.75 x2  0  x2  1.429 x1
 x2 
The initial weight vector w(1) and the weight
vectors w(2)- w(11) obtained during the training
algorithm are given below:
 2.5  1.5  1   2 
w(1)    , w(2)    , w(3)    , w(4)    ,
1.75   2.75  1.75  2.75
 2.5 
w(5)  w(4), w(6)  w(5)  w(4), w(7)    ,
1.75
 3 
w(8)  w(7), w(9)  w(8)  w(7), w(10)  w(9)  w(8)  w(7), w(11)   
0.75
As can be seen from these vectors, out of the ten

vectors w(2)- w(11) only five are different .
These five vectors are given in the MATLAB plot
below:
3
2.5
1.5
0.5
-0.5
-1
-1.5
-2
-8 -6 -4 -2 0 2 4 6 8 10 12
x3
Example: The trained (1,0,1)
(0,0,1)
classifier is required
to provide the
(0,0,1)
classification such
that the yellow vertices (0,1,1) x1
of the cube have class 0
membership d=1 and (0,0,0) (1,0,0)
the blue vertices have
class membership (0,1,0)
d=2. (1,1,0)
x2
x3=0.25 plane
SUMMARY OF CONTINUOUS PERCEPTRON TRAINING ALGORITHM
Given are the p training pairs
{x1,d1, x2,d2,…………….,xp,dp}, where xi is (N+1) x 1
Di is 1 x 1, i=1,2, ,P. In the following n denotes
the training step and p denotes the step counter
within the training cycle.
Step 1: c>0 is chosen.
Step2: Weights are initialized at w at random small
values, w is (N+1) x 1. Counters and error are
initialized.
1  k ,1  p and 0  E
Step3: The training cycle begins here. Input is
presented and output is computed:
x p  x, d p  d , y  sgn( w x)
t
Step4: Weights are updated:

→ 1
x p  x, w w  c(d  y ) x
2
Step5: Cycle error is computed.
1
E  (d  y )2
2
Step 6: If p<P then p  p  1, n  n  1
and go to Step 3, otherwise go to Step 7.
Step 7: The training cycle is completed. For E=0,
terminate the training session. Output weights, k
and E. If E>0 then enter the new training cycle
by going to Step 3.
Single-Layer Continuous
Perceptron
 Here the activation function is a continuous function of
the weights instead of the signum.
 There are two main objectives of this:
1 To define a continuous function of the weights as the
error function so as to obtain finer control over the
weights as well as over the whole training procedure;
2 To enable the computation of the error gradient in
order to be continuously in a position to know the
direction in which the error decreases.
Training Rule for a Single-Layer
Continuous Perceptron:The Delta
Training Rule
The Delta Training Rule is based on the
minimisation of the error function which is given by
1
E( n )  ( d ( n )  y( n ))2
2
where n is a positive integer representing the traning
step number, i.e.,the step number in the minimisation
process, d(n) is the desired output signal and
y (n)  f [ w (n) x(n)]

t
is the actual output.

Training Rule
The error function (error surgface) is a function
of the weights: E(w1,w2,....,wp)=E(w) which is
minimised using an iterative minimisation
method which computes the new values of the
weights according to
w(n  1)  w(n)  w(n)
where w(n) is the increment given to the
present weight vector w(n) to obtain the new
weight vector w(n+1).
Training Rule
Let us now use the steepest descent method for the
minimisation of the error function E(w) where it is
required that the weight changes be in the negative
gradient direction. Therefore we take:
w(n)  E ( w(n))

∇
where E(w(n)) is the gradient vector and  is called the

learning constant. Using this in the equation above, we obtain
w( n  1 )  w( n )  E( w( n ))
Training Rule
E( w( n ))  E( n )
is the error surface at the n’th training step .
Therefore the error to be minimised is:
1
E( n )  ( d ( n )  f ( wt ( n )x( n ))2
2
The independent variables for minimisation at
each training step are wi, the components of the
weight vector.
Training Rule
The error minimisation requires the computation

of the gradient of the error function:
1 2
E( w )w w( n )    ( d ( n )  f ( w x( n )) 
t
2  w w( n )
Training Rule
The gradient vector is defined as:
 E 
 w 
 1

 E 
 w2 
E ( w )  
. 
 
 . 
 E 
 w 
 p 1 
Training Rule
 Using
1 2
E( w )w w( n )    ( d ( n )  f ( w x( n )) 
t
2  w w( n )
and defining
v( w )  w x t
we obtain
Training Rule
  v( w )  
  w  
  1

  v( w )  
 df ( v( w ))  w2  
E( w )ww( n )   d ( n )  f ( v( w ))  .  
 dv  
  . 
  v( w )  
  w  
  p 1   w w( n )
Training Rule
Since
v( w)
 xi
wi
and
f(v) y
we can write
Training Rule
df (v( w)) 
E ( w)w w( n )    d ( n)  y ( n)   x ( n)
dv  ww ( n )
and
 E ( w)   df (v( w)) 
     d ( n)  y ( n)    x i ( n)
 wi  ww( n )  dv  ww ( n )
Training Rule
If bipolar continuous activation function is used
then we have:
v
1 e
f (v )  v
1 e
and
v
df (v) 2e
 v 2
dv (1  e )
Training Rule
In fact
df (v) 2ev
1  v 2 
 1 e  1
 
 1  v 
  
1  f 2
(v) 
dv 1  e 
v 2 2   1 e   2
 
Training Rule
E( w )w w( n )  
  d ( n )  y( n )1  y 2 ( n ) x( n )
1
2
Training Rule
Conclusion: The delta training rule for the

bipolar continuous perceptron is given as:
1
2
 2

w( n  1 )  w( n )   d ( n )  y( n )1  y ( n ) x( n )
Training Rule
If unipolar continuous activation function is used
then we have:
1
f (v )  v
1 e
and
v
df (v) e
 v 2
dv (1  e )
Training Rule
we can write
df (v) e v 1 1  ev  1 1 1
   (1  )
1  e  1  e 1  e
v v v v
dv  v 2
1 e 1 e
=f (v)(1  f (v))
Training Rule
Example:We will carry out the same
training algorithm as in the previous
example but this time using a continuous
bipolar perceptron.
The error at step n is given by:
2 2
1 1 e 
 v( n )
1  2 
E ( n )  d ( n )  v( n ) 
 d ( n )    v( n )
 1 
2 1 e  2 1  e 
Training Rule
For the first pattern x(1)=[1 1]t , d(1)=1.
The error at step 1 is given by:
2
1  2 
E (1)  d (1)    ( w1  w2 )
 1  
2 1  e 
2 2
  1  2e 
 ( w1  w2 )
1  2 2
1    ( w1  w2 )
 1     ( w1  w2 ) 
 ( w1  w2 ) 2
2  1  e  2 1  e  (1  e )
Training Rule
For the second pattern x(2)=[-0.5 1]t ,
d(2)=-1.
2
1  2 
E ( 2 )  d ( 2 )   ( 0.5 w1  w2 )
 1  
2 1  e 
2
1  2
2
 2 1  2
 1    1    ( 0.5 w1  w2 ) 

2 1  e
( 0.5 w1  w2 )
 2 1  e  
1  e( 0.5 w1  w2 ) 
2
Training Rule
For the third pattern x(3)=[3 1]t , d(3)=1.

2
1  2 
E ( 3 )  d ( 3 )   ( 3 w1  w2 )
 1  
2 1  e 
2 2
1  2  1  2e 
( 3 w1  w2 )
2
 1    1    ( 3 w1  w2 ) 

2 1  e
( 3 w1  w2 )
 2 1  e  1  e( 3 w1  w2 )

2
Training Rule
For the fourth pattern x(4)=[-2 1]t , d(4)=-1.

2
1  2 
E ( 4 )  d ( 4 )   ( 2 w1  w2 )
 1  
2 1  e 
2
1  2
2
 2 1  2
 1    1    ( 2 w1  w2 ) 

2 1  e
( 2 w1  w2 )
 2 1  e  
1  e( 2 w1  w2 ) 
2
Training Rule
close all;
clear all;
[w1,w2] = meshgrid(-4:.1:4, -4:.1:4);
Z1 = exp(w1+w2);
E1=2./(1+Z1).^2
mesh(Z1)
Z2 = exp(.5*w1-w2);
E2=2./(1+Z2).^2
figure,mesh(Z2)
Z3 = exp(3*w1+w2);
E3=2./(1+Z3).^2
figure,mesh(Z3)
Z4 = exp(2*w1-w2);
E4=2./(1+Z4).^2
figure,mesh(Z4)
subplot(2,2,1),mesh(E1);title('Error surface for xt(1)=[1 1] and y=f(wt*x(1))');
xlabel('w1'),ylabel('w2'),zlabel('E1(w1,w2)');
Training Rule
subplot(2,2,2),mesh(E2);title('Error surface for xt(2)=[.5 -1] and y=f(wt*x(2))');
subplot(2,2,3),mesh(E3);title('Error surface for xt(3)=[3 1] and y=f(wt*x(3))');
subplot(2,2,4),mesh(E4);title('Error surface for xt(4)=[2 -1] and y=f(wt*x(4))');
E = E1+E2+E3+E4;
figure,mesh(E);title('Total Error
E(w1,w2)=E1(w1,w2)+E2(w1,w2)+E3(w1,w2)+E4(w1,w2),MESH');
xlabel('w1'),ylabel('w2'),zlabel('E(w1,w2)');
figure,imshow(E,[],'notruesize');colormap(jet);title('Total Error
E(w1,w2)=E1(w1,w2)+E2(w1,w2)+E3(w1,w2)+E4(w1,w2),IMSHOW');
xlabel('w1'),ylabel('w2')
The error surfaces for the above four cases are
shown in the next slide:
Training Rule for a Single-Layer Continuous
Perceptron:The Delta Training Rule
Error surface for xt(1)=[1 1] and y=f(wt*x(1))Error surface for xt(2)=[.5 -1] and y=f(wt*x(2))
2 2
E2(w1,w2)
E1(w1,w2)
1 1
0 0
0 100
100 100
50 50 50 50
100 0 w2 w2 0 0 w1
w1
Error surface for xt(3)=[3 1] and y=f(wt*x(3))Error surface for xt(4)=[2 -1] and y=f(wt*x(4))
2
2
E4(w1,w2)
E3(w1,w2)
1
1
0 0
0 100 100
50 50 50 100
0 0 50
100 0 w2 w2 w1
w1
Training Rule
The total error is defined by:
E( w1 , w2 )  E1( w1 , w2 )  E2 ( w1 , w2 )  E3 ( w1 , w2 )  E4 ( w1 , w2 )
The total error surface is shown in the next

slide.
A contour map of the total error is depicted below:
The classifier training has been simulated for

 = 0.5 for four arbitrarily chosen initial weight
vectors, including the one taken from
x(1)= 1, x(3) =3, d1= d3 = 1: class C1

x(2)= -0.5, x(4)=-2, d2 = d4 = -1: class C2
The resulting trajectories of 150 simulated
training steps are shown in the following figure
(each tenth step is shown).
In each case the weights converge during training

toward the center of the solution region obtained
for the discrete perceptron case given on the next
slide, which coincides with the dark blue region in
the contour map of the total error is depicted before
and also shown on the next slide.
SUMMARY OF CONTINUOUS PERCEPTRON TRAINING ALGORITHM
Given are the p training pairs
{x1,d1, x2,d2,…………….,xp,dp}, where xi is (N+1) x 1
Di is 1 x 1, i=1,2, ,P. In the following n denotes
the training step and p denotes the step counter
within the training cycle.
Step 1: >0, =1 and Emax>0 chosen.
Step2: Weights are initialized at w at random small
values, w is (N+1) x 1. Counters and error are
initialized.
1  k ,1  p and 0  E
Step3: The training cycle begins here. Input is
presented and output is computed:
x p  x, d p  d , y  f ( w x )
t
Step4: Weights are updated:

1
x p  x, w  w   (d  y )(1  y 2 ) x
2
Step5: Cycle error is computed.
1
E  (d  y )2
2
Step 6: If p<P then p  p  1, n  n  1
and go to Step 3, otherwise go to Step 7.
Step 7: The training cycle is completed. For E<Emax
terminate the training session. Output weights, k
and E. If E>Emax then enter the new training cycle
by going to Step 3.
Delta Training Rule for
Multi-Perceptron Layer
x1 w11 v1 y1
w12 1
w21 v2
x2 2 y2
wK1
w1j
wjk vk yk
xj k
j-th
column w1J wK2
of nodes
wKJ vK
xJ K yK
k-th column of nodes
-1 Neurons
The above can be redrawn as:
v1  w11 x1  w12 x2  .........  w1 j x j  ......  w1J xJ y1  f ( v1 )
v2  w21 x1  w22 x2  .........  w2 j x j  ......  w2 J xJ y2  f ( v2 )
vl  wl1 x1  wl 2 x2  .........  wlj x j  ......  wlJ xJ yl  f (vl )
vK  wK 1 x1  wK 2 x2  .........  wKj x j  ......  wKJ xJ y K  f ( vK )

 v1   w11 w12 . . w1J   x1   y1   f ( v1 )    v1  
v  w     y   f ( v )  
 2   21 w22 . . w1J   x2    v2  
 2  2 
 .  . . . . .  .   .    .   Γ . 
          
 .  . . . . .  .   .   .   . 
vK   wK 1  yK   f ( vJ )  
wK 2 . . wKJ   xJ   v J  
v  Wx y  Γ(v)
 y1   f (.) 0 . . 0   v1 
y   0 f (.) . . 0  v 
 2   2 
 .  . . . . .  . 
    
 .   . . . . .  . 
 yK   0 0 . . f (.) vJ 
y  Γ[Wx]
 The desired and actual output vectors at the nth
training step are given as:
 d1 (n)   y1 (n) 
 d ( n)   y ( n) 
 2   2 
d  .  y . 
   
.  . 
 
 d K (n)   yK (n) 
The error expression for a single perceptron was given as:
1
E( n )  ( d ( n )  y( n )) 2
2
which can be generalised to include all squared errors at

the outputs k=1,2,.....,K
1 K 1
E( n )   ( d k ( n )  yk ( n ))  d( n )  y( n )
2 2
2 k 1 2
where n represents the n-th step which corresponds

to a specific input pattern that produces the output error.
 The updated weight value from input j to
neuron k at step n is given by:
wkj ( n 1 )  wkj ( n )  Δwkj ( n )
According to the delta training rule for
continuous perceptron
E  k  1,2 ,..., K 
Δwkj ( n )     for  
wkj   j  1,2 ,..., J 
wkj  wkj ( n )
where
E E vk
E  E (v( w))  w  v w
kj k kj
Using
vk  wk1 x1  wk 2 x2  .........  wkj x j  ......  wkJ xJ
we have
vk
 xj
wkj
The error signal term produced by the kth
neuron is defined as:
E
 yk 
vk
Using this yields
E
  yk x j
wkj
On the other hand we can write:
E E yk
 yk  
vk yk vk
Since
1 K 1
E( n )   ( d k ( n )  yk ( n ))  d( n )  y( n )
2 2
2 k 1 2
we get
E
 ( d k  yk )
yk
On the other hand using
yk f ( vk )

vk vk
yields
E E yk f ( vk )
 yk    ( d k  yk )
vk yk vk vk
which is used to obtain
f ( vk )
Δwkj ( n )   ( d k  yk ) xj
vk
For bipolar continuous activation function
we already know that
f ( vk ) 1
vk
  1
 1  ( f ( vk ))  1  yk
2
2
2
2

 Hence
1
Δwkj ( n )   ( d k ( n )  yk ( n ))( 1  ( yk ( n )) )x j
2
2
and
1
wkj ( n  1 )  wkj ( n )   ( d k ( n )  yk ( n ))( 1  ( yk ( n ))2 )x j ( n )
2
where
 yk ( n )  ( d k ( n )  yk ( n ))( 1 ( yk ( n )) ) 2
we can write
 w11( n  1 ) . w1J ( n  1 )    w11( n ) . w1J ( n )     y1 ( n ) 

 . . .    . . .      . x ( n ) . x ( n )
      1 J
 wK 1( n  1 ) . wKJ ( n  1 )   wK 1( n ) . wKJ ( n )   yK ( n )

 
Now defining
 x1 (n)  1 (n) 


x ( n)   .    
 ( n)   . 
 xJ (n)   J (n) 
W (n  1)  W (n)   δy x t
Generalised Delta Training Rule for
Multi- Layer Perceptron
z1 t11 u1 w11 v1 y1
1
x1
. v2
z2 t21 u2 wk1 y2
2
. x2 wK1
wK2
. t uj
zi j1 wkj vk yk
k
xj
.
. vK
tIJ wkJ yK
zI uJ xJ K
i-th column of nodes j-th column of nodes (hidden layer) k-th column of nodes
The weight adjustment for the hidden layer
according to the gradient descent method will be:
E   j  1,2 ,..., J 
Δt ji ( n )     for  
t ji 
t ji t ji ( n )
i  1,2 ,..., I 
where
E E u j

t ji u j t ji
Here
E
 xj   for j  1,2 ,...., J
u j
is the error signal term of the hidden layer
with output x. This term is produced by the j-th
neuron of the hidden layer, where j=1,2,....,J.
On the other hand, using
u j  t j1 z1  t j 2 z2  ...............  t jI z I
we can calculate
u j
t ji
as
u j
 zi
t ji
Therefore
E E u j
   xj zi
t ji u j t ji
and
Δt ji   xj zi
Since
xj  f ( uj )
E E x j
 xj   
u j x j u j
E  1
 2
K
   d k  f ( vk ) 
x j x j  2 k 1 
and
x j f ( u j )

u j u j
E K
f ( vk )
  ( d k  f ( vk )
x j k 1 x j
K
f ( vk ) vk
   ( d k  yk )
k 1 vk x j
Now using
vk  wk1 x1  wk 2 x2  .........  wkj x j  ......  wkJ xJ

we have
vk
 wkj
x j
Now using this equality and
E f ( vk )
 yk   ( d k  yk )
vk vk
in
E K
f ( vk )
  ( d k  f ( vk )
x j k 1 x j
K
f ( vk ) vk
   ( d k  yk )
k 1 vk x j
we obtain
E K
   yk wkj
x j k 1
vk
Now using this and  wkj in
x j
E E x j
 xj   
u j x j u j
we obtain
f ( u j ) K
 xj 
u j

k 1
yk wkj
Now using
Δt ji   xj zi
we get
f ( u j ) K
Δt ji   zi   yk wkj
u j k 1

K
 f ( u j )
t ji ( n  1 )  t ji ( n )      yk wkj  zi
 k 1  u j
for j  1,2 ,..., J
i  1,2 ,..., I
Now defining the j-th column of the matrix
 w11 w12 . . w1J 
w w22 . . w1J 
 21
W  . . . . . 
 
 . . . . . 
 wK 1 wK 2 . . wKJ 
as
wj
and using
  y1 
 
δy   . 
 yK 
 
we can write
K

k 1
yk wkj  w δy t
j
In the case of bipolar activation function we
obtain for the hidden layer
f ( u j ) 1
 f  (1  x j )
' 2
xj
u j 2
Now construct a vector whose entries are the
above terms for j=1,2,...,J, i.e.,
1 2 
( 1  x )
 f x1   2
' 1 
 '  1 
f
 x 2   ( 1  x2 )
2
fx    
'  2 
   
   
 f '  1 
 xJ   ( 1  x 2 )
J
2 
and define
 z1 
z 
 2
z .
 
.
 z I 
We then have
 K
 '
   yk wkj  f j zi  ( w j δy ) f x z
t '
 k 1 
Now defining
δx  w δy f t
j
'
x
and
 t11 t12 . . t1I 
t . . t1I 
 21 t22
T  . . . . .
 
 . . . . .
t J 1 t J 2 . . t JI 
we finally obtain
T ( n 1 )  T ( n )  δx z t
This updating formula is called the Generalised

Delta Rule for adjusting the hidden layer weights.
A similar formula was given for updating the
output layer weights:
W (n  1)  W (n)   δy x t
Here the main difference is in computing the error
signals y and x. In fact, the entries of y are given
as
f ( vk )
 yk  ( d k  yk )
vk
which only contain terms belonging to the output
layer. However, this is not the case with x ,
δx  w δy f t
j
'
x
f x'
whose entries are weighted sum of error

signals yk produced by the following layer.
Here we can draw the following conclusion:
The Generalised Delta Learning Rule propagates
the error back by one layer which is true for every
layer.
- Summary of the Error Back-Propagation
Training Algorithm (EBPTA) Given are P
training pairs
where zi is (1 Xl), di is (K X 1), and i = 1, 2, ... ,

P. Note that the l'th component of each zi is of
value -1 since input vectors have been
augmented. Size J - 1 of the hidden layer
having outputs y is
Error Back-Propagation Training
Algorithm (EBPTA)
selected. Note that the J'th component of y is of
value -1, since hidden layer outputs have also
been augmented; y is (J X 1) and 0 is (K Xl).
Algorithm (EBPTA)
Step 1: 'T/ > 0, Emax chosen .
Weights W and V are initialized at small random
values; W is (K X J), V is (J X /).
q f-- 1, p f-- 1, E f-- 0
. Step 2: Training step starts here
(See Note 1 at end of list.)
Input is presented and the layers' outputs computed
[f(net) as in (2.3a) is used]:
Algorithm (EBPTA)
Z f-- zP' d f-- dp
Yj f-- f(vjz), for j = 1, 2, ... , J where ~'" a column
vector, is the j'th row of V, and J .
ok f-- f(w~y), . for k = 1, 2, ... , K
where Wk, a column vector, is the k'th row of W.
Step 3: Error value is computed:
1 2
E f-- '2(dk - Ok) + E, for k = 1, 2, ... , K
Algorithm (EBPTA)
Step 4: Error signal vectors 80 and 8y of
both layers are computed.
Vector 80 is (K Xl), 8y is (J X 1). (See
Note 2 at end of list.) The error signal
terms of the output layer in this step are
Algorithm (EBPTA)
1 2
Ook = -(dk- 0k)(l - Ok)' for k = 1, 2, ... , K 2
The error signal terms of the hidden layer in this step
are
12K
0yj = _ (1 - Yj) L 0okWkj, for j = 1, 2, ... , J
k=l
Step 5: Output layer weights are adjusted:
Wkj f-- Wkj + T/OokYj, for k = 1, 2, ... , K and j = 1, 2,
... ,J
Algorithm (EBPTA)
Step 6: Hidden layer weights are adjusted:
Vji f-- Vji + T/OyPi' for j = 1, 2, ... ,J and i = 1,
2, ... ,I
Step 7: If p < P then p f-- P + 1, q f-- q + 1, and

go to Step 2; otherwise, go to Step 8.
Algorithm (EBPTA)
Step 8: The training cycle is completed.
For E < Emax terminate the training session. Output
weights W, V, q, and E.
If E > Emax' then E f- 0, P f- 1, and initiate the new
training cycle by going to Step 2.
NOTE 1 For best results, patterns should be
chosen at random
from the training set Qustification follows in Section
4.5).
Algorithm (EBPTA)
11IIIII NOTE 2 If formula (2.4a) is used in Step

2, then the error
signal terms in Step 4 are computed as follows
80k = (dk - 0k)(l - 0k)ok' for k = 1, 2, ... , K K
8y) = y}(l - y) L 8okWk}' for j = 1, 2, ... , J k=l
The Hopfield Network
 The Hopfield Network
We know that the Hopfield Network is a
Recurrent (Feedback or Dynamical) Neural
Network.
Let yi , i=1,2,.....,n, be the outputs of the
network and the energy function E satisfy the
following:
2
dE n
 dy i 
   i   0
dt i 1  dt 
where i>0 , i=1,2,.....,n.
 The above inequlity reveals that the energy
decreases with time and becomes zero if and
only if
dyi
 0 i ,
dt
i.e.,
yi ( t )  cons tan t i ,
i.e.,
yi ( t ) reach their stable equilibrium states .

Now let us assume that
1
df ( yi )
 i  Ci
dyi
where
Ci  0
For the bipolar activation function
 ax
1 e
y  f(x)  ax
1 e
the inverse function is given by:
1 1 1 y
x  f ( y )   ln
a 1 y
The Bipolar Activation Function and its Inverse
The Derivative of the Inverse of the Bipolar Function
dx 2 1

dy a 1  y 2
We can conclude that
df 1( yi )
0 for  1  yi  1
dyi
Therefore
2
dE  dyi  df ( yi )
n -1
  Ci   0
dt i 1  dt  dyi
Considering
dxi df ( yi ) df ( yi ) dyi
-1 -1
 
dt dt dyi dt
we obtain
2
dE  iy f y n
dxi dyi
n -1
d d ( )
  Ci  
i
  Ci
dt i 1  dt  dyi i 1 dt dt
Now defining
 x1   y1 
x  y 
 2  2
x . 
 
y . 
 
C  diag ( Ci )
 .  . 
 xN   y N 
yields
t t
dE dy d x  dx  dy
 C   C 
dt dt dt  dt  dt
Since
dE dy
  E( y )
t
dt dt
We can write
dx
 E ( y )  C
dt
This reveals that the capacitor current vector
is parallel to the negative gradient vector.
N
dxi
Ci   wij ( y j  xi )  g i xi  I i
dt j 1
wi1
y1 Ii
xi yi
wi2
y2
gi Ci
.
.
.  N 
wiN dxi N
  yi  f ( xi ) 
yN Ci   wij y j    wij  g i xi  I i
dt j 1  j 1  xi  f 1( yi )
 
Now define
N
w
j 1
ij  g i  Gi
C  diag ( Ci ) G  diag ( Gi ), i  1,2,..., N

 w11 w12 . . w1N   x1   I1 
w w22 . . w2 N  x  I 
 12  2  2
W  . . . . .  x .  I  . 
     
 . . . . .   .  .
 w1N w2 N . . wNN   xN   I N 
We obtain N
dxi
Ci   wij y j  Gi xi  I i
dt j 1
consequently
dx( t )
C  Wy  Gx  I
dt
and since
dx
 E ( y )  C
dt
we obtain
 E( y )  Wy  Gx  I
In the case of bipolar activation function we know that
1 1 1 y
x  f ( y )   ln
a 1 y
Therefore the state vector is given as:
 1  y1 
ln
 1 y 
 1

 ln 1  y2 
1  1  y2 
x  
a 
 
 1  yN 
ln 1  y 
 N 
We already know
N
dE dxi dyi
  Ci
dt i 1 dt dt
therefore
N N
dE dyi
  ( wij y j  Gi xi  I i )
dt i 1 j 1 dt
and
N N
dE dyi N dyi N dyi
 ( wij y j   Gi xi   Ii )
dt i 1 j 1 dt i 1 dt i 1 dt
Now consider:
t
d dy dy
( y Wy ) 
t
Wy  y W
t
dt dt dt
If
W W
t
then
t
d dy dy
( y Wy ) 
t
W y yW
t t
dt dt dt
t t
dy dy dy t dy
W y
t
Wy  ( y W
t
)  yW
t
dt dt dt dt
Therefore
d dy
( y Wy )  2 y W
t t
dt dt
and
dy 1 d
t
yW  t
( y Wy )
dt 2 dt
Now consider the first term of
N N
dE dyi N dyi N dyi
 ( wij y j   Gi xi   Ii )
dt i 1 j 1 dt i 1 dt i 1 dt
We can write:
N N
dyi dy

i 1 j 1
wij y j
dt
 yW
t
dt
Now using the above equality, we have
N N
dyi 1 d t

i 1 j 1
wij y j 
dt 2 dt
(y Wy)
Now consider the second term in the same equation:
dyi 1 dyi
xi  f ( yi )
dt dt
we can write
yi
d
1
f ( yi )  (  f ( y )dy )
1
dyi 0
yi yi
dyi d dyi d
1
f ( yi )  (  f ( y )dy )
1
 (  f ( y )dy )
1
dt dyi 0 dt dt 0
N yi N
dE d 1 t
  ( y Wy   Gi  f ( yi )dy   I i yi )
1
dt dt 2 i 1 0 i 1
N yi N
1 t
E  ( y Wy   Gi  f ( yi )dy   I i yi )
1
2 i 1 0 i 1
In order to obtain the state equations in terms
of the outputs yi consider once again
N
dxi
dt j 1
Using
dxi 2 1

dyi a 1  yi 2
we obtain
N
2 dy i
a (1  y i ) dt
2
j 1
and
dy i a (1  y i )
2 N
dt

2Ci
w
j 1
ij y j  Gi xi  I i
 a( 1  y ) 
dy
 
2
 diag  i
 Wy  GΓ (y)  I
1
dt  2Ci 
g12 g11 x1 y1
g1
C1
dx1
g1 x1  C1  g11 ( y1  x1 )+ g12 ( y2  x1 )
dt
g22
g21 x2
y2
g2 C2
dx2
g 2 x2  C2  g 22 ( y2  x2 )+g 21 ( y1  x2 )
dt
 dx1 
C1 0   dt   g11 g12   y1   g1  g11  g12 0   x1 
0       
 C1   dx2   g 21 g 22   y2   0 g 2  g 22  g 21   x2 
 dt 
which yields
C1 0   g11 g12   g1  g11  g12 0 

C  , W   , G
 0 C1   g 21 g 22   0 g 2  g 22  g 21 
and
yi
1  g11 g12   y1  2
E    y1 y 2        Gi  f  y  dy
1
2  g 21 g 22   y2  i1 0
 1  y1 
ln
 g11 g12   y1  1 G1 0   1  y1 
E ( y1 , y2 )          
 g 21 g 22   y2  a  0 G2   1  y2 
ln 1  y 
 2
and
 E   1 1  y1 
  y    g11 y1  g12 y2  a G1 ln 1  y 
E ( y1 , y2 )   1
   1

 E   1 1  y2 
  y    g 22 y1  g 21 y1  a G2 ln 1  y 
 2  2
y 1 y
2
1
E   ( g11 y1  g 22 y2  y1 y2 g12  y1 y2 g12 )  G1  f  y  dy  G2  f 1  y  dy
2 2 1
2 0 0
y y
1 1 1 1 y 2
1 y
E   ( g11 y1  g 22 y2  y1 y2 g 21  y1 y2 g12 )  G1  ln
2 2
dy  G2  ln dy
2 a 0 1 y 0
1 y
Now consider
yi yi yi
1 y
I   ln dy   ln( 1  y )dy   ln( 1  y )dy
0
1 y 0 0
and yi
I1   ln( 1  y )dy
0
ln( 1  y )  u  dy
du 
Let then 1 y
1 y  v
dv  dy
yi yi yi
I1   ln( 1  y )dy    udv  ( uv ]0yi   vdu )

Hence 0 0 0
yi
 dy
 { ln ( 1  y )}( 1  y )]   ( 1  y )
yi
1 y
0
0
I1  { ln ( 1  yi )}( 1  yi )  yi
I 2  {ln ( 1  yi )}( 1  yi )  yi
yi yi yi
1 y
I   ln dy   ln( 1  y )dy   ln( 1  y )dy
0
1 y 0 0
I  I1  I 2  { ln ( 1  yi )}( 1  yi )  {ln ( 1  yi )}( 1  yi )

y y
1 1 1 1 y 1 2
1 y
E   ( g11 y1  g 22 y2  y1 y2 g12  y1 y2 g12 )  G1  ln
2 2
dy  G2  ln dy
2 a 0 1 y a 0 1 y
1 1
E   {g11 y12  g 22 y22  (g12  g 21 )y1 y2 }  G1 ( 1  y1 ) ln( 1  y1 )  ( 1  y1 ) ln( 1  y1 )
2 a
1
 G2 ( 1  y2 ) ln( 1  y2 )  ( 1  y2 ) ln( 1  y2 )
a
Using
 a( 1  yi2 ) 
dy
dt

 diag  

1

Wy  GΓ (y)  I 
 2Ci 
the state equations are obtained as

 dy1  a 1  y 2
  1 1  y1  
1
 ln
 dt   2C1
0   

  11g g   y   G 0  a 1  y1 
    
12 1 1
     
 dy2   0 1  y22    g 21 g 22   y2   0 G2   1 1  y2  

 dt  
a    a ln 1  y  
2C2    2 
 dy1   1 (1  y 2 )(ag y  ag y  G ln 1  y1 ) 
 dt   2C1 1 11 1 12 2 1
1  y1 
  
 dy2   1 (1  y 2 )(ag y  ag y  G ln 1  y2 ) 
 dt   2C2 2 22 2 21 1 2
1  y2 
Discrete-Time Hopfield Networks
Consider the state equation of the Gradient-Type
Hopfield Network:
dx( t )
C  Wy  Gx  I
dt
We can write
dx( t )
C  Wy  GΓ (y)  I
-1
dt
As the plot of the inverse
bipolar activation function
shows the second term in
the above equation is zero
for high gain neurons.
Hence:
dx( t )
C  Wy  I
dt
Now consider
dxi df ( yi ) df ( yi ) dyi
-1 -1
 
dt dt dyi dt
Using the this plot we

can conclude that
df -1 ( yi )
0
dyi
for high gain neurons.

Hence
dx( t )
0
dt
It follows that
0  Wy  I
Now let us solve this equation using Jacobi’s
algorithm. To this end define:
W'  W  D = L  U
where
L,U and D = diag( wii )
Are the lower and upper triangular and diagonal

matrices shown in the following decomposition
of W.
 w11 w12 . . w1N   w11 0 . . 0   0 0 . . 0  0  w12 . .  w1N 
w  0  0
 21 w 22 . . w 2N   0 w 22 . . 0    w 21 0 . . 0 . .  w 2N 

W  . . . . .  . . . . .     w 31  w 32 0 . .. . . . . 
       
 . . . . .   . . . . .   . . . . . . . . .  w N1,N 
 w N1 w N2 . . w NN   0 0 . . w NN    w N1  w N2 .  w N,N1 0  0 0 . . 0 
Now defining
D  diag( wii )
we obtain
-Dy = W'y + I
-1 -1
y = -D W'y - D I
Now define
-D W' = W
-1
-D I  I
-1
y = Wy  I
Now replace the vector y on the right-hand side
by an initial y(0) vector. If the vector y on the left-
hand side is obtained as y(0), then y(0) is the
solution of the system. If not then call the vector y
obtained on the left-hand side y(1), i.e.,
y(1) = W y(0)  I
and in general we can write
y(k + 1) = W y(k)  I
The method will always converge if the matrix
W is strictly or irreducibly diagonally
dominant. Strict row diagonal dominance
means that for each row, the absolute value of
the diagonal term is greater than the sum of
absolute values of other terms:
wii   wij
i j
The Jacobi method sometimes converges even
if this condition is not satisfied. It is necessary,
however, that the diagonal terms in the matrix
are greater (in magnitude) than the other
terms.
Solution by Gauss-Seidel Method
In Jacobi’s method the updating of the
unknowns is made after all N unknowns have
been moved to the left side of the equation. We
will see in the following that this is not
necessary, i.e., the updating can be made
individually for each unknown and this updated
value can be used in the next equation. This is
shown in the following equations:
1
x1 (n  1)  [ a12 x2 (n)  a13 x3 (n)  ......  a1N xN (n)  b1 ]
a11
1
x2 (n  1)   a21x1 (n  1)  a23 x3 (n)  .....  a2 N xN (n)  b2  a22
a22
1
x3 ( n  1)   a31x1 (n  1)  a32 x2 (n  1)  a34 x2 (n)  ......  a3 N xN (n)  b3  a33
a33
and
1
xN (n  1)    aN 1 x1 ( n  1)  aN 2 x2 ( n  1)  ......  aN , N 1 xN 1 ( n  1)  bN  aNN
aNN
In vector-matrix form, we can write:

D x (n  1)  L x (n  1)  U x (n)  b
( D  L) x (n  1)  U x (n)  b
x (n  1)  ( D  L) 1 (U x (n)  b)
This matrix expression is mainly used to analyze

the method. When implementing Gauss Seidel,
an explicit entry-by-entry approach is used:
1  
xi (n  1)  bi   aij x j (n  1)   aij x j (n) 
aii  j i j i 
Gauss-Seidel method is defined on matrices with non-zero
diagonals, but convergence is only guaranteed if the matrix
is either:
1. diagonally dominant or
2. symmetric and positive definite.

NEURAL NETWORKS (PDFDrive) PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NEURAL NETWORKS (PDFDrive) PDF

Uploaded by

Copyright:

Available Formats

NEURAL NETWORKS

 Work on artificial neural networks,

 Typically, neurons are five to six orders of

 However, the brain makes up for the

 The corresponding value for the best computers

 It is the function of the visual system to provide a

 In addition to providing information about how far

 At birth, a brain has great structure and the

 During this early stage of development, about 1

 Synapses are elementary structural and

 Thus a synapse converts a presynaptic

 In traditional descriptions of neural

 Axons, the transmission lines, and dendrites, the

 Just as plasticity appears to be essential to the

 In its most general form, a neural network is

 In most cases the interest is confined largely

 To achieve good performance, neural

1 Knowledge is acquired by the network

2 Interneuron connection strengths known as

 The procedure used to perform the learning

 However, it is also possible for a neural

 Neural networks are also referred to in the

1. its massively parallel distributed structure,

1. its ability to learn and therefore generalize;

defer: L. deferre- de-, down, ferre, to bear

differ: L. differre- dif.( for dis-), apart, ferre, to bear

infer: L. inferre- in-, into, ferre, to bring

prefer: L. preaferre- prea-,in front of, ferre, to bear

 Anonymous:1601, from Gk. anonymos "without a

 The network is presented an example picked at

 The training of the network is repeated for many

 The previously applied training examples may be

Thus the network learns from the examples by

 (tabula rasa: a smoothed or blank tablet, a mind not yet

 In a nonparametric approach to this problem, the

 A similar point of view is implicit in the

 As a general rule, it may be said that the more

Every neuron in the network is potentially affected by the

Consequently, contextual information is dealt with

For example, if a neuron or its connecting links are

The particular virtue of VLSI is that it provides a means of

 With inspiration from neurobiological analogy in

Block-diagram representation of nervous system

1 Those pointing from left to right indicate the forward

2 The arrows pointing from right to left signify the

Proceeding upward from synapses that represent

The neural networks we are presently able to

Nonlinear model of a neuron

2.An adder for summing the input signals,

On the other hand, the net input of the

Alternatively, we may model the neuron as in

The Activation Function

A Neural Network is an interconnection of

1. Block-Diagram Representation (BDR)

These are obtained when BDR and SFGR for

A neural network is a directed graph (SFG) consisting of

 Each neuron is represented by a set of linear synaptic

1. Single-Layer Feedforward Networks

 the extra set of synaptic connections

The ability of hidden neurons to extract higher-order

 The set of output signals of the neurons in the

Likewise, each neuron in

Two dimensional lattice of 3-by-3 neurons