Professional Documents
Culture Documents
model that is inspired by the structure and/or functional aspects of biological neural networks. A neural network consists
of an interconnected group of artificial neurons, and it processes information using a connectionist approach
to computation. In most cases an ANN is an adaptive system that changes its structure based on external or internal
information that flows through the network during the learning phase. Modern neural networks are non-
linear statistical data modeling tools. They are usually used to model complex relationships between inputs and outputs or
to find patterns in data.
Learning
What has attracted the most interest in neural networks is the possibility of learning. Given a specific task to solve, and
a class of functions , learning means using a set of observations to find which solves the task in
This entails defining a cost function such that, for the optimal solution , (i.e., no
solution has a cost less than the cost of the optimal solution).
The cost function is an important concept in learning, as it is a measure of how far away a particular solution is from an
optimal solution to the problem to be solved. Learning algorithms search through the solution space to find a function that
For applications where the solution is dependent on some data, the cost must necessarily be a function of the
observations, otherwise we would not be modelling anything related to the data. It is frequently defined as a statistic to
which only approximations can be made. As a simple example, consider the problem of finding the model which
minimizes , for data pairs drawn from some distribution . In practical situations we would
only have samples from and thus, for the above example, we would only
minimize . Thus, the cost is minimized over a sample of the data rather than the entire
data set.
When some form of online machine learning must be used, where the cost is partially minimized as each new
example is seen. While online machine learning is often used when is fixed, it is most useful in the case where the
distribution changes slowly over time. In neural network methods, some form of online machine learning is frequently
While it is possible to define some arbitrary, ad hoc cost function, frequently a particular cost will be used, either because
it has desirable properties (such as convexity) or because it arises naturally from a particular formulation of the problem
(e.g., in a probabilistic formulation the posterior probability of the model can be used as an inverse cost). Ultimately, the
cost function will depend on the desired task. An overview of the three main categories of learning tasks is provided
below.
[edit]Learning paradigms
There are three major learning paradigms, each corresponding to a particular abstract learning task. These
In supervised learning, we are given a set of example pairs and the aim is to find a
function in the allowed class of functions that matches the examples. In other words, we wish to infer the
mapping implied by the data; the cost function is related to the mismatch between our mapping and the data and it
A commonly used cost is the mean-squared error which tries to minimize the average squared error between the
network's output, f(x), and the target value y over all the example pairs. When one tries to minimize this cost
using gradient descent for the class of neural networks called multilayer perceptrons, one obtains the common and well-
Tasks that fall within the paradigm of supervised learning are pattern recognition (also known as classification)
and regression (also known as function approximation). The supervised learning paradigm is also applicable to sequential
data (e.g., for speech and gesture recognition). This can be thought of as learning with a "teacher," in the form of a
function that provides continuous feedback on the quality of solutions obtained thus far.
[edit]Unsupervised learning
In unsupervised learning, some data is given and the cost function to be minimized, that can be any function of the
The cost function is dependent on the task (what we are trying to model) and our a priori assumptions (the implicit
As a trivial example, consider the model , where is a constant and the cost .
Minimizing this cost will give us a value of that is equal to the mean of the data. The cost function can be much more
complicated. Its form depends on the application: for example, in compression it could be related to the mutual
information between x and y, whereas in statistical modeling, it could be related to the posterior probability of the model
given the data. (Note that in both of those examples those quantities would be maximized rather than minimized).
Tasks that fall within the paradigm of unsupervised learning are in general estimation problems; the applications
[edit]Reinforcement learning
In reinforcement learning, data are usually not given, but generated by an agent's interactions with the environment. At
each point in time , the agent performs an action and the environment generates an observation and an
instantaneous cost , according to some (usually unknown) dynamics. The aim is to discover a policy for selecting
actions that minimizes some measure of a long-term cost; i.e., the expected cumulative cost. The environment's dynamics
and the long-term cost for each policy are usually unknown, but can be estimated.
More formally, the environment is modeled as a Markov decision process (MDP) with states and
actions with the following probability distributions: the instantaneous cost distribution , the
observation distribution and the transition , while a policy is defined as conditional
distribution over actions given the observations. Taken together, the two define a Markov chain (MC). The aim is to
discover the policy that minimizes the cost; i.e., the MC for which the cost is minimal.
ANNs are frequently used in reinforcement learning as part of the overall algorithm.
Tasks that fall within the paradigm of reinforcement learning are control problems, games and other sequential decision
making tasks.
A Boltzmann machine is a type of stochastic recurrent neural network by Geoffrey Hinton and Terry Sejnowski.
Boltzmann machines can be seen as the stochastic, generative counterpart of Hopfield nets. They were one of the first
examples of a neural network capable of learning internal representations, and are able to represent and (given sufficient
time) solve difficult combinatoric problems. However, due to a number of issues discussed below, Boltzmann machines
with unconstrained connectivity have not proven useful for practical problems in machine learning or inference. They are
still theoretically intriguing, however, due to the locality andHebbian nature of their training algorithm, as well as their
parallelism and the resemblance of their dynamics to simple physical processes. If the connectivity is constrained, the
They are named after the Boltzmann distribution in statistical mechanics, which is used in their sampling function.
Structure
A Boltzmann machine, like a Hopfield network, is a network of units with an "energy" defined for the network. It also
has binary units, but unlike Hopfield nets, Boltzmann machine units are stochastic. The global energy, E, in a Boltzmann
Where:
The difference in the global energy that results from a single unit i being 0 (off) versus 1 (on), written ΔEi, is given
by:
We then substitute the energy of each state with its relative probability according to the Boltzmann
Factor (the property of aBoltzmann distribution that the energy of a state is proportional to the
where kB is Boltzmann's constant and becomes lumped into the artificial notion of
temperature T. We then rearrange terms and consider that the probabilities of the unit being
We can now finally solve for pi=on, the probability that the i-th unit is on.
This relation is the source of the logistic function found in probability expressions in variants of the Boltzmann machine.
Equilibrium state
The network is run by repeatedly choosing a unit and setting its state according to the above formula. After running for
long enough at a certain temperature, the probability of a global state of the network will depend only upon that global
state's energy, according to aBoltzmann distribution. This means that log-probabilities of global states become linear in
their energies. This relationship is true when the machine is "at thermal equilibrium", meaning that the probability
distribution of global states has converged. If we start running the network from a high temperature, and gradually
decrease it until we reach a thermal equilibrium at a low temperature, we are guaranteed to converge to a distribution
where the energy level fluctuates around the global minimum. This process is called simulated annealing.
If we want to train the network so that the chance it will converge to a global state is according to an external distribution
that we have over these states, we need to set the weights so that the global states with the highest probabilities will get
[edit]Training
The units in the Boltzmann Machine are divided into "visible" units, V, and "hidden" units, H. The visible units are those
which receive information from the "environment", i.e. our training set is a set of binary vectors over the set V. The
On the Boltzmann Machine side, as recalled, the distribution over the global states is converging as we reach a thermal
equilibrium. We denote the converged distribution, after we marginalize it over the visible units V, as P − (V).
Our goal is to approximate the "real" distribution P + (V) using the P − (V) which will be produced (eventually) by the
machine. To measure how similar the two distributions are, we use the Kullback-Leibler divergence, G:
Where the sum is over all the possible states of V. G is a function of the weights, since they determine the energy
of a state, and the energy determines P − (v), as promised by the Boltzmann distribution. Hence, we can use
a gradient descent algorithm over G, so a given weight, wij is changed by subtracting the partial
There are two phases to Boltzmann machine training, and we switch iteratively between them. One is the
"positive" phase where the visible units' states are clamped to a particular binary state vector sampled from the
training set (according to P + ). The other is the "negative" phase where the network is allowed to run freely, i.e. no
units have their state determined by external data. Surprisingly enough, the gradient with respect to a given
weight, wij, is given by the very simple equation (proved in Ackley et al.):
Where:
is the probability of units i and j both being on when the machine is at equilibrium on the
positive phase.
is the probability of units i and j both being on when the machine is at equilibrium on the
negative phase.
This result follows from the fact that at the thermal equilibrium the probability P − (s) of any global
state s when the network is free-running is given by the Boltzmann distribution (hence the name
"Boltzmann machine").
Remarkably, this learning rule is fairly biologically plausible because the only information needed to change
the weights is provided by "local" information. That is, the connection (or synapse biologically speaking)
does not need information about anything other than the two neurons it connects. This is far more
biologically realistic than the information needed by a connection in many other neural network training
The training of a Boltzmann machine does not use the EM algorithm, which is heavily used in machine
learning. By minimizing the KL-divergence, it is equivalent as maximizing the log-likelihood of the data.
Therefore, the training procedure performs gradient ascent on the log-likelihood of the observed data. This
is in contrast to the EM algorithm, where the posterior distribution of the hidden nodes must be calculated
before the maximization of the expected value of the complete data likelihood during the M-step.
Training the biases is similar, but uses only single node activity:
[edit]Problems
The Boltzmann machine would theoretically be a rather general computational medium. For
instance, if trained on photographs, the machine would theoretically model the distribution of
photographs, and could use that model to, for example, complete a partial photograph.
Unfortunately, there is a serious practical problem with the Boltzmann machine, namely that the
learning seems to stop working correctly when the machine is scaled up to anything larger than a
trivial machine. This is due to a number of effects, the most important of which are:
the time the machine must be run in order to collect equilibrium statistics grows
exponentially with the machine's size, and with the magnitude of the connection strengths
connection strengths are more plastic when the units being connected have activation
probabilities intermediate between zero and one, leading to a so-called variance trap. The net
effect is that noise causes the connection strengths to random walk until the activities saturate.
Although learning is impractical in general Boltzmann machines, it can be made quite efficient in an
architecture called the "Restricted Boltzmann Machine" or "RBM" which does not allow intralayer
connections between hidden units. After training one RBM, the activities of its hidden units can be
treated as data for training a higher-level RBM. This method of stacking RBM's makes it possible to
train many layers of hidden units efficiently and as each new layer is added the overall generative
There is an extension to the Restricted Boltzmann Machine that affords using real valued data
rather than binary data. Along with higher order Boltzmann Machines, it is outlined here [1].
[edit]History
Hopfield net
From Wikipedia, the free encyclopedia
A Hopfield net is a form of recurrent artificial neural network invented by John Hopfield. Hopfield nets serve as content-
addressable memory systems with binary threshold units. They are guaranteed to converge to a local minimum, but convergence
Structure
A Hopfield net with four nodes.
The units in Hopfield nets are binary threshold units, i.e. the units only take on two different values for their states and the
value is determined by whether or not the units' input exceeds their threshold. Hopfield nets can either have units that
take on values of 1 or -1, or units that take on values of 1 or 0. So, the two possible definitions for unit i's activation, ai,
are:
(1)
(2)
Where:
wij is the strength of the connection weight from unit j to unit i (the weight of the connection).
The requirement that weights be symmetric is typically used, as it will guarantee that the energy function decreases
monotonically while following the activation rules, and the network may exhibit some periodic or chaotic behaviour if non-
symmetric weights are used. However, Hopfield found that this chaotic behaviour is confined to relatively small parts of
the phase space, and does not impair the network's ability to act as a content-addressable associative memory system.
Hopfield nets have a scalar value associated with each state of the network referred to as the "energy", E, of the network,
where:
This value is called the "energy" because the definition ensures that if units are randomly chosen to update their
activations the network will converge to states which are local minima in the energy function (which is considered
to be a Lyapunov function). Thus, if a state is a local minimum in the energy function it is a stable state for the
network. Note that this energy function belongs to a general class of models in physics, under the name of Ising
models; these in turn are a special case of Markov networks, since the associated probability measure, the Gibbs
At each step, pick a node at random. The node's behavior is then deterministic: it moves to a state to minimize the
energy of itself and its neighbors. (In contrast, the Boltzmann machine has a stochastic update rule.)
[edit]Training
Training a Hopfield net involves lowering the energy of states that the net should "remember". This allows the net
to serve as a content addressable memory system, that is to say, the network will converge to a "remembered"
state if it is given only part of the state. The net can be used to recover from a distorted input the trained state that
is most similar to that input. This is called associative memory because it recovers memories on the basis of
similarity. For example, if we train a Hopfield net with five units so that the state (1, 0, 1, 0, 1) is an energy
minimum, and we give the network the state (1, 0, 0, 0, 1) it will converge to (1, 0, 1, 0, 1). Thus, the network is
properly trained when the energy of states which the network should remember are local minima.
Navigation
One of the earliest recurrent neural networks reported in literature was the
auto-associator independently described by Anderson (Anderson, 1977)
and Kohonen (Kohonen, 1977) in 1977. It consists of a pool of neurons
with connections between each unit i and j, i 6= j.
The auto-associator network. All neurons are both input and output
neurons, i.e., a
pattern is clamped, the network iterates to a stable state, and the output of
the network consists of
the new activation values of the neurons.
All connections are weighted. In 1982, Hopfield (Hopfield, 1982) brings
together several earlier ideas concerning these networks and presents a
complete mathematical analysis based on Ising spin models (Amit,
Gutfreund, & Sompolinsky, 1986). It is therefore that this network, which
we will describe in this chapter, is generally referred to as the Hopfield
network.
The state of the system is given by the activation values1 y = (yk). The net
input sk(t+1) of a neuron k at cycle t+1 is a weighted sum
A simple threshold function is applied to the net input to obtain the new
activation value yi(t + 1) at time t + 1: