Soft Computing

An artificial neural network (ANN), usually called neural network (NN), is a mathematical model or computational
model that is inspired by the structure and/or functional aspects of biological neural networks. A neural network consists
of an interconnected group of artificial neurons, and it processes information using a connectionist approach
to computation. In most cases an ANN is an adaptive system that changes its structure based on external or internal
information that flows through the network during the learning phase. Modern neural networks are non-
linear statistical data modeling tools. They are usually used to model complex relationships between inputs and outputs or
to find patterns in data.
Learning
What has attracted the most interest in neural networks is the possibility of learning. Given a specific task to solve, and
a class of functions , learning means using a set of observations to find which solves the task in
some optimal sense.
This entails defining a cost function such that, for the optimal solution , (i.e., no
solution has a cost less than the cost of the optimal solution).
The cost function is an important concept in learning, as it is a measure of how far away a particular solution is from an
optimal solution to the problem to be solved. Learning algorithms search through the solution space to find a function that
has the smallest possible cost.
For applications where the solution is dependent on some data, the cost must necessarily be a function of the
observations, otherwise we would not be modelling anything related to the data. It is frequently defined as a statistic to
which only approximations can be made. As a simple example, consider the problem of finding the model which
minimizes , for data pairs drawn from some distribution . In practical situations we would
only have samples from and thus, for the above example, we would only
minimize . Thus, the cost is minimized over a sample of the data rather than the entire
data set.
When some form of online machine learning must be used, where the cost is partially minimized as each new
example is seen. While online machine learning is often used when is fixed, it is most useful in the case where the
distribution changes slowly over time. In neural network methods, some form of online machine learning is frequently
used for finite datasets.
See also: Optimization (mathematics), Estimation theory, and Machine learning

[edit]Choosing a cost function
While it is possible to define some arbitrary, ad hoc cost function, frequently a particular cost will be used, either because
it has desirable properties (such as convexity) or because it arises naturally from a particular formulation of the problem
(e.g., in a probabilistic formulation the posterior probability of the model can be used as an inverse cost). Ultimately, the
cost function will depend on the desired task. An overview of the three main categories of learning tasks is provided
below.
[edit]Learning paradigms
There are three major learning paradigms, each corresponding to a particular abstract learning task. These
are supervised learning,unsupervised learning and reinforcement learning.

[edit]Supervised learning
In supervised learning, we are given a set of example pairs and the aim is to find a
function in the allowed class of functions that matches the examples. In other words, we wish to infer the
mapping implied by the data; the cost function is related to the mismatch between our mapping and the data and it
implicitly contains prior knowledge about the problem domain.
A commonly used cost is the mean-squared error which tries to minimize the average squared error between the
network's output, f(x), and the target value y over all the example pairs. When one tries to minimize this cost
using gradient descent for the class of neural networks called multilayer perceptrons, one obtains the common and well-
known backpropagation algorithm for training neural networks.
Tasks that fall within the paradigm of supervised learning are pattern recognition (also known as classification)
and regression (also known as function approximation). The supervised learning paradigm is also applicable to sequential
data (e.g., for speech and gesture recognition). This can be thought of as learning with a "teacher," in the form of a
function that provides continuous feedback on the quality of solutions obtained thus far.
[edit]Unsupervised learning
In unsupervised learning, some data is given and the cost function to be minimized, that can be any function of the
data and the network's output, .
The cost function is dependent on the task (what we are trying to model) and our a priori assumptions (the implicit
properties of our model, its parameters and the observed variables).
As a trivial example, consider the model , where is a constant and the cost .
Minimizing this cost will give us a value of that is equal to the mean of the data. The cost function can be much more
complicated. Its form depends on the application: for example, in compression it could be related to the mutual
information between x and y, whereas in statistical modeling, it could be related to the posterior probability of the model
given the data. (Note that in both of those examples those quantities would be maximized rather than minimized).
Tasks that fall within the paradigm of unsupervised learning are in general estimation problems; the applications
include clustering, the estimation of statistical distributions, compression and filtering.
[edit]Reinforcement learning
In reinforcement learning, data are usually not given, but generated by an agent's interactions with the environment. At
each point in time , the agent performs an action and the environment generates an observation and an
instantaneous cost , according to some (usually unknown) dynamics. The aim is to discover a policy for selecting
actions that minimizes some measure of a long-term cost; i.e., the expected cumulative cost. The environment's dynamics
and the long-term cost for each policy are usually unknown, but can be estimated.
More formally, the environment is modeled as a Markov decision process (MDP) with states and
actions with the following probability distributions: the instantaneous cost distribution , the
observation distribution and the transition , while a policy is defined as conditional
distribution over actions given the observations. Taken together, the two define a Markov chain (MC). The aim is to
discover the policy that minimizes the cost; i.e., the MC for which the cost is minimal.
ANNs are frequently used in reinforcement learning as part of the overall algorithm.
Tasks that fall within the paradigm of reinforcement learning are control problems, games and other sequential decision
making tasks.
See also: dynamic programming and stochastic control
A Boltzmann machine is a type of stochastic recurrent neural network by Geoffrey Hinton and Terry Sejnowski.
Boltzmann machines can be seen as the stochastic, generative counterpart of Hopfield nets. They were one of the first
examples of a neural network capable of learning internal representations, and are able to represent and (given sufficient
time) solve difficult combinatoric problems. However, due to a number of issues discussed below, Boltzmann machines
with unconstrained connectivity have not proven useful for practical problems in machine learning or inference. They are
still theoretically intriguing, however, due to the locality andHebbian nature of their training algorithm, as well as their
parallelism and the resemblance of their dynamics to simple physical processes. If the connectivity is constrained, the
learning can be made efficient enough to be useful for practical problems.
They are named after the Boltzmann distribution in statistical mechanics, which is used in their sampling function.
Structure
A Boltzmann machine, like a Hopfield network, is a network of units with an "energy" defined for the network. It also
has binary units, but unlike Hopfield nets, Boltzmann machine units are stochastic. The global energy, E, in a Boltzmann
machine is identical in form to that of a Hopfield network:
Where:
 wij is the connection strength between unit j and unit i.
 si is the state, , of unit i.
 θi is the threshold of unit i.
The connections in a Boltzmann machine have two restrictions:
 No unit has a connection with itself.
 . (All connections are symmetric.)

[edit]Probability of a unit's state
The difference in the global energy that results from a single unit i being 0 (off) versus 1 (on), written ΔEi, is given
by:
This can be expressed as the difference of energies of two states:
ΔEi = Ei=off − Ei=on
We then substitute the energy of each state with its relative probability according to the Boltzmann
Factor (the property of aBoltzmann distribution that the energy of a state is proportional to the
negative log probability of that state):
where kB is Boltzmann's constant and becomes lumped into the artificial notion of
temperature T. We then rearrange terms and consider that the probabilities of the unit being
on and off must sum to one:
We can now finally solve for pi=on, the probability that the i-th unit is on.
where the scalar T is referred to as the temperature of the system.
This relation is the source of the logistic function found in probability expressions in variants of the Boltzmann machine.
Equilibrium state
The network is run by repeatedly choosing a unit and setting its state according to the above formula. After running for
long enough at a certain temperature, the probability of a global state of the network will depend only upon that global
state's energy, according to aBoltzmann distribution. This means that log-probabilities of global states become linear in
their energies. This relationship is true when the machine is "at thermal equilibrium", meaning that the probability
distribution of global states has converged. If we start running the network from a high temperature, and gradually
decrease it until we reach a thermal equilibrium at a low temperature, we are guaranteed to converge to a distribution
where the energy level fluctuates around the global minimum. This process is called simulated annealing.
If we want to train the network so that the chance it will converge to a global state is according to an external distribution
that we have over these states, we need to set the weights so that the global states with the highest probabilities will get
the lowest energies. This is done by the following training procedure.
[edit]Training
The units in the Boltzmann Machine are divided into "visible" units, V, and "hidden" units, H. The visible units are those
which receive information from the "environment", i.e. our training set is a set of binary vectors over the set V. The
distribution over the training set is denoted P + (V).
On the Boltzmann Machine side, as recalled, the distribution over the global states is converging as we reach a thermal
equilibrium. We denote the converged distribution, after we marginalize it over the visible units V, as P − (V).
Our goal is to approximate the "real" distribution P + (V) using the P − (V) which will be produced (eventually) by the
machine. To measure how similar the two distributions are, we use the Kullback-Leibler divergence, G:
Where the sum is over all the possible states of V. G is a function of the weights, since they determine the energy
of a state, and the energy determines P − (v), as promised by the Boltzmann distribution. Hence, we can use
a gradient descent algorithm over G, so a given weight, wij is changed by subtracting the partial
derivative of G with respect to the weight.
There are two phases to Boltzmann machine training, and we switch iteratively between them. One is the
"positive" phase where the visible units' states are clamped to a particular binary state vector sampled from the
training set (according to P + ). The other is the "negative" phase where the network is allowed to run freely, i.e. no
units have their state determined by external data. Surprisingly enough, the gradient with respect to a given
weight, wij, is given by the very simple equation (proved in Ackley et al.):
Where:
 is the probability of units i and j both being on when the machine is at equilibrium on the
positive phase.
 is the probability of units i and j both being on when the machine is at equilibrium on the
negative phase.
 R denotes the learning rate
This result follows from the fact that at the thermal equilibrium the probability P − (s) of any global
state s when the network is free-running is given by the Boltzmann distribution (hence the name
"Boltzmann machine").
Remarkably, this learning rule is fairly biologically plausible because the only information needed to change
the weights is provided by "local" information. That is, the connection (or synapse biologically speaking)
does not need information about anything other than the two neurons it connects. This is far more
biologically realistic than the information needed by a connection in many other neural network training
algorithms, such as backpropagation.
The training of a Boltzmann machine does not use the EM algorithm, which is heavily used in machine
learning. By minimizing the KL-divergence, it is equivalent as maximizing the log-likelihood of the data.
Therefore, the training procedure performs gradient ascent on the log-likelihood of the observed data. This
is in contrast to the EM algorithm, where the posterior distribution of the hidden nodes must be calculated
before the maximization of the expected value of the complete data likelihood during the M-step.
Training the biases is similar, but uses only single node activity:
[edit]Problems
The Boltzmann machine would theoretically be a rather general computational medium. For
instance, if trained on photographs, the machine would theoretically model the distribution of
photographs, and could use that model to, for example, complete a partial photograph.
Unfortunately, there is a serious practical problem with the Boltzmann machine, namely that the
learning seems to stop working correctly when the machine is scaled up to anything larger than a
trivial machine. This is due to a number of effects, the most important of which are:
 the time the machine must be run in order to collect equilibrium statistics grows
exponentially with the machine's size, and with the magnitude of the connection strengths
 connection strengths are more plastic when the units being connected have activation
probabilities intermediate between zero and one, leading to a so-called variance trap. The net
effect is that noise causes the connection strengths to random walk until the activities saturate.
[edit]Restricted Boltzmann Machine
Although learning is impractical in general Boltzmann machines, it can be made quite efficient in an
architecture called the "Restricted Boltzmann Machine" or "RBM" which does not allow intralayer
connections between hidden units. After training one RBM, the activities of its hidden units can be
treated as data for training a higher-level RBM. This method of stacking RBM's makes it possible to
train many layers of hidden units efficiently and as each new layer is added the overall generative
model gets better.
There is an extension to the Restricted Boltzmann Machine that affords using real valued data
rather than binary data. Along with higher order Boltzmann Machines, it is outlined here [1].
[edit]History
Hopfield net
From Wikipedia, the free encyclopedia
A Hopfield net is a form of recurrent artificial neural network invented by John Hopfield. Hopfield nets serve as content-
addressable memory systems with binary threshold units. They are guaranteed to converge to a local minimum, but convergence
to one of the stored patterns is not guaranteed.
Structure
A Hopfield net with four nodes.
The units in Hopfield nets are binary threshold units, i.e. the units only take on two different values for their states and the
value is determined by whether or not the units' input exceeds their threshold. Hopfield nets can either have units that
take on values of 1 or -1, or units that take on values of 1 or 0. So, the two possible definitions for unit i's activation, ai,
are:
(1)
(2)
Where:
 wij is the strength of the connection weight from unit j to unit i (the weight of the connection).
 sj is the state of unit j.
 θi is the threshold of unit i.
The connections in a Hopfield net typically have the following restrictions:
 (no unit has a connection with itself)
 (connections are symmetric)
The requirement that weights be symmetric is typically used, as it will guarantee that the energy function decreases
monotonically while following the activation rules, and the network may exhibit some periodic or chaotic behaviour if non-
symmetric weights are used. However, Hopfield found that this chaotic behaviour is confined to relatively small parts of
the phase space, and does not impair the network's ability to act as a content-addressable associative memory system.
Hopfield nets have a scalar value associated with each state of the network referred to as the "energy", E, of the network,
where:
This value is called the "energy" because the definition ensures that if units are randomly chosen to update their
activations the network will converge to states which are local minima in the energy function (which is considered
to be a Lyapunov function). Thus, if a state is a local minimum in the energy function it is a stable state for the
network. Note that this energy function belongs to a general class of models in physics, under the name of Ising
models; these in turn are a special case of Markov networks, since the associated probability measure, the Gibbs
measure, has the Markov property.

[edit]Running
At each step, pick a node at random. The node's behavior is then deterministic: it moves to a state to minimize the
energy of itself and its neighbors. (In contrast, the Boltzmann machine has a stochastic update rule.)
[edit]Training
Training a Hopfield net involves lowering the energy of states that the net should "remember". This allows the net
to serve as a content addressable memory system, that is to say, the network will converge to a "remembered"
state if it is given only part of the state. The net can be used to recover from a distorted input the trained state that
is most similar to that input. This is called associative memory because it recovers memories on the basis of
similarity. For example, if we train a Hopfield net with five units so that the state (1, 0, 1, 0, 1) is an energy
minimum, and we give the network the state (1, 0, 0, 0, 1) it will converge to (1, 0, 1, 0, 1). Thus, the network is
properly trained when the energy of states which the network should remember are local minima.
Navigation
One of the earliest recurrent neural networks reported in literature was the
auto-associator independently described by Anderson (Anderson, 1977)
and Kohonen (Kohonen, 1977) in 1977. It consists of a pool of neurons
with connections between each unit i and j, i 6= j.
The auto-associator network. All neurons are both input and output
neurons, i.e., a
pattern is clamped, the network iterates to a stable state, and the output of
the network consists of
the new activation values of the neurons.
All connections are weighted. In 1982, Hopfield (Hopfield, 1982) brings
together several earlier ideas concerning these networks and presents a
complete mathematical analysis based on Ising spin models (Amit,
Gutfreund, & Sompolinsky, 1986). It is therefore that this network, which
we will describe in this chapter, is generally referred to as the Hopfield
network.
The Hopfield network consists of a set of N interconnected neurons which

update their activation values asynchronously and independently of other
neurons. All neurons are both input and output neurons. The activation
values are binary. Originally, Hopeld chose activation values of 1 and 0,
but using values +1 and -1 presents some advantages discussed below. We
will therefore adhere to the latter convention.
The state of the system is given by the activation values1 y = (yk). The net
input sk(t+1) of a neuron k at cycle t+1 is a weighted sum
A simple threshold function is applied to the net input to obtain the new
activation value yi(t + 1) at time t + 1:
Hopfield network as associative memory
A primary application of the Hopfield network is an associative memory.

In this case, the weights of the connections between the neurons have to be
thus set that the states of the system corresponding with the patterns which
are to be stored in the network are stable. These states can be seen as 'dips'
in energy space. When the network is cued with a noisy or incomplete test
pattern, it will render the incorrect or missing data by iterating to a stable
state which is in some sense 'near' to the cued pattern.
i.e., if xp j and xp k are equal, wjk is increased, otherwise decreased by
one (note that, in the original Hebb rule, weights only increase). It appears,
however, that the network gets saturated very quickly, and that about
0:15N memories can be stored before recall errors become severe. There
are two problems associated with storing too many patterns:
1. the stored patterns become unstable;

2. spurious stable states appear (i.e., stable states which do not
correspond with stored patterns).
The first of these two problems can be solved by an algorithm proposed by

Bruce et al. (Bruce, Canning, Forrest, Gardner, & Wallace, 1986):
Algorithm 1 Given a starting weight matrix for each pattern

xp to be stored and each element xp k in xp dene a correction εk such that
Now modify wjk by Repeat this procedure

until all patterns are stable.
It appears that, in practice, this algorithm usually converges. There exist

cases, however, where the algorithm remains oscillatory (try to find one)!

Soft Computing

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Soft Computing

Uploaded by

Copyright:

Available Formats

An artificial neural network (ANN), usually called neural network (NN), is a mathematical model or computational

some optimal sense.

has the smallest possible cost.

used for finite datasets.

See also: Optimization (mathematics), Estimation theory, and Machine learning

are supervised learning,unsupervised learning and reinforcement learning.

implicitly contains prior knowledge about the problem domain.

known backpropagation algorithm for training neural networks.

data and the network's output, .

properties of our model, its parameters and the observed variables).

include clustering, the estimation of statistical distributions, compression and filtering.

See also: dynamic programming and stochastic control

learning can be made efficient enough to be useful for practical problems.

machine is identical in form to that of a Hopfield network:

 wij is the connection strength between unit j and unit i.

 si is the state, , of unit i.

 θi is the threshold of unit i.

The connections in a Boltzmann machine have two restrictions:

 No unit has a connection with itself.

 . (All connections are symmetric.)

This can be expressed as the difference of energies of two states:

ΔEi = Ei=off − Ei=on

negative log probability of that state):

on and off must sum to one:

where the scalar T is referred to as the temperature of the system.

the lowest energies. This is done by the following training procedure.

distribution over the training set is denoted P + (V).

derivative of G with respect to the weight.

 R denotes the learning rate

algorithms, such as backpropagation.

[edit]Restricted Boltzmann Machine

model gets better.

to one of the stored patterns is not guaranteed.

 sj is the state of unit j.

 θi is the threshold of unit i.

The connections in a Hopfield net typically have the following restrictions:

 (no unit has a connection with itself)

 (connections are symmetric)

measure, has the Markov property.

The Hopfield network consists of a set of N interconnected neurons which

Hopfield network as associative memory

A primary application of the Hopfield network is an associative memory.

1. the stored patterns become unstable;

The first of these two problems can be solved by an algorithm proposed by

Algorithm 1 Given a starting weight matrix for each pattern

Now modify wjk by Repeat this procedure

It appears that, in practice, this algorithm usually converges. There exist

You might also like