You are on page 1of 11

An artificial neural network (ANN), usually called neural network (NN), is a mathematical model or computational

model that is inspired by the structure and/or functional aspects of biological neural networks. A neural network consists
of an interconnected group of artificial neurons, and it processes information using a connectionist approach
to computation. In most cases an ANN is an adaptive system that changes its structure based on external or internal
information that flows through the network during the learning phase. Modern neural networks are non-
linear statistical data modeling tools. They are usually used to model complex relationships between inputs and outputs or
to find patterns in data.

Learning

What has attracted the most interest in neural networks is the possibility of learning. Given a specific task to solve, and

a class of functions , learning means using a set of observations to find which solves the task in

some optimal sense.

This entails defining a cost function such that, for the optimal solution , (i.e., no

solution has a cost less than the cost of the optimal solution).

The cost function is an important concept in learning, as it is a measure of how far away a particular solution is from an

optimal solution to the problem to be solved. Learning algorithms search through the solution space to find a function that

has the smallest possible cost.

For applications where the solution is dependent on some data, the cost must necessarily be a function of the

observations, otherwise we would not be modelling anything related to the data. It is frequently defined as a statistic to

which only approximations can be made. As a simple example, consider the problem of finding the model which

minimizes , for data pairs drawn from some distribution . In practical situations we would

only have samples from and thus, for the above example, we would only

minimize . Thus, the cost is minimized over a sample of the data rather than the entire

data set.

When some form of online machine learning must be used, where the cost is partially minimized as each new

example is seen. While online machine learning is often used when is fixed, it is most useful in the case where the

distribution changes slowly over time. In neural network methods, some form of online machine learning is frequently

used for finite datasets.

See also: Optimization (mathematics), Estimation theory, and Machine learning


[edit]Choosing a cost function

While it is possible to define some arbitrary, ad hoc cost function, frequently a particular cost will be used, either because

it has desirable properties (such as convexity) or because it arises naturally from a particular formulation of the problem

(e.g., in a probabilistic formulation the posterior probability of the model can be used as an inverse cost). Ultimately, the

cost function will depend on the desired task. An overview of the three main categories of learning tasks is provided

below.

[edit]Learning paradigms

There are three major learning paradigms, each corresponding to a particular abstract learning task. These

are supervised learning,unsupervised learning and reinforcement learning.


[edit]Supervised learning

In supervised learning, we are given a set of example pairs and the aim is to find a

function in the allowed class of functions that matches the examples. In other words, we wish to infer the

mapping implied by the data; the cost function is related to the mismatch between our mapping and the data and it

implicitly contains prior knowledge about the problem domain.

A commonly used cost is the mean-squared error which tries to minimize the average squared error between the

network's output, f(x), and the target value y over all the example pairs. When one tries to minimize this cost

using gradient descent for the class of neural networks called multilayer perceptrons, one obtains the common and well-

known backpropagation algorithm for training neural networks.

Tasks that fall within the paradigm of supervised learning are pattern recognition (also known as classification)

and regression (also known as function approximation). The supervised learning paradigm is also applicable to sequential

data (e.g., for speech and gesture recognition). This can be thought of as learning with a "teacher," in the form of a

function that provides continuous feedback on the quality of solutions obtained thus far.

[edit]Unsupervised learning

In unsupervised learning, some data is given and the cost function to be minimized, that can be any function of the

data and the network's output, .

The cost function is dependent on the task (what we are trying to model) and our a priori assumptions (the implicit

properties of our model, its parameters and the observed variables).

As a trivial example, consider the model , where is a constant and the cost .

Minimizing this cost will give us a value of that is equal to the mean of the data. The cost function can be much more

complicated. Its form depends on the application: for example, in compression it could be related to the mutual

information between x and y, whereas in statistical modeling, it could be related to the posterior probability of the model

given the data. (Note that in both of those examples those quantities would be maximized rather than minimized).

Tasks that fall within the paradigm of unsupervised learning are in general estimation problems; the applications

include clustering, the estimation of statistical distributions, compression and filtering.

[edit]Reinforcement learning

In reinforcement learning, data are usually not given, but generated by an agent's interactions with the environment. At

each point in time , the agent performs an action and the environment generates an observation and an

instantaneous cost , according to some (usually unknown) dynamics. The aim is to discover a policy for selecting

actions that minimizes some measure of a long-term cost; i.e., the expected cumulative cost. The environment's dynamics

and the long-term cost for each policy are usually unknown, but can be estimated.

More formally, the environment is modeled as a Markov decision process (MDP) with states and

actions with the following probability distributions: the instantaneous cost distribution , the
observation distribution and the transition , while a policy is defined as conditional

distribution over actions given the observations. Taken together, the two define a Markov chain (MC). The aim is to

discover the policy that minimizes the cost; i.e., the MC for which the cost is minimal.

ANNs are frequently used in reinforcement learning as part of the overall algorithm.

Tasks that fall within the paradigm of reinforcement learning are control problems, games and other sequential decision

making tasks.

See also: dynamic programming and stochastic control

A Boltzmann machine is a type of stochastic recurrent neural network by Geoffrey Hinton and Terry Sejnowski.

Boltzmann machines can be seen as the stochastic, generative counterpart of Hopfield nets. They were one of the first

examples of a neural network capable of learning internal representations, and are able to represent and (given sufficient

time) solve difficult combinatoric problems. However, due to a number of issues discussed below, Boltzmann machines

with unconstrained connectivity have not proven useful for practical problems in machine learning or inference. They are

still theoretically intriguing, however, due to the locality andHebbian nature of their training algorithm, as well as their

parallelism and the resemblance of their dynamics to simple physical processes. If the connectivity is constrained, the

learning can be made efficient enough to be useful for practical problems.

They are named after the Boltzmann distribution in statistical mechanics, which is used in their sampling function.

Structure

A Boltzmann machine, like a Hopfield network, is a network of units with an "energy" defined for the network. It also

has binary units, but unlike Hopfield nets, Boltzmann machine units are stochastic. The global energy, E, in a Boltzmann

machine is identical in form to that of a Hopfield network:

Where:

 wij is the connection strength between unit j and unit i.

 si is the state, , of unit i.

 θi is the threshold of unit i.

The connections in a Boltzmann machine have two restrictions:

 No unit has a connection with itself.

 . (All connections are symmetric.)


[edit]Probability of a unit's state

The difference in the global energy that results from a single unit i being 0 (off) versus 1 (on), written ΔEi, is given

by:

This can be expressed as the difference of energies of two states:

ΔEi = Ei=off − Ei=on

We then substitute the energy of each state with its relative probability according to the Boltzmann

Factor (the property of aBoltzmann distribution that the energy of a state is proportional to the

negative log probability of that state):

where kB is Boltzmann's constant and becomes lumped into the artificial notion of

temperature T. We then rearrange terms and consider that the probabilities of the unit being

on and off must sum to one:

We can now finally solve for pi=on, the probability that the i-th unit is on.

where the scalar T is referred to as the temperature of the system.

This relation is the source of the logistic function found in probability expressions in variants of the Boltzmann machine.
Equilibrium state

The network is run by repeatedly choosing a unit and setting its state according to the above formula. After running for

long enough at a certain temperature, the probability of a global state of the network will depend only upon that global

state's energy, according to aBoltzmann distribution. This means that log-probabilities of global states become linear in

their energies. This relationship is true when the machine is "at thermal equilibrium", meaning that the probability

distribution of global states has converged. If we start running the network from a high temperature, and gradually

decrease it until we reach a thermal equilibrium at a low temperature, we are guaranteed to converge to a distribution

where the energy level fluctuates around the global minimum. This process is called simulated annealing.

If we want to train the network so that the chance it will converge to a global state is according to an external distribution

that we have over these states, we need to set the weights so that the global states with the highest probabilities will get

the lowest energies. This is done by the following training procedure.

[edit]Training

The units in the Boltzmann Machine are divided into "visible" units, V, and "hidden" units, H. The visible units are those

which receive information from the "environment", i.e. our training set is a set of binary vectors over the set V. The

distribution over the training set is denoted P + (V).

On the Boltzmann Machine side, as recalled, the distribution over the global states is converging as we reach a thermal

equilibrium. We denote the converged distribution, after we marginalize it over the visible units V, as P − (V).

Our goal is to approximate the "real" distribution P + (V) using the P − (V) which will be produced (eventually) by the

machine. To measure how similar the two distributions are, we use the Kullback-Leibler divergence, G:

Where the sum is over all the possible states of V. G is a function of the weights, since they determine the energy

of a state, and the energy determines P − (v), as promised by the Boltzmann distribution. Hence, we can use

a gradient descent algorithm over G, so a given weight, wij is changed by subtracting the partial

derivative of G with respect to the weight.

There are two phases to Boltzmann machine training, and we switch iteratively between them. One is the

"positive" phase where the visible units' states are clamped to a particular binary state vector sampled from the

training set (according to P + ). The other is the "negative" phase where the network is allowed to run freely, i.e. no

units have their state determined by external data. Surprisingly enough, the gradient with respect to a given

weight, wij, is given by the very simple equation (proved in Ackley et al.):
Where:

 is the probability of units i and j both being on when the machine is at equilibrium on the

positive phase.

 is the probability of units i and j both being on when the machine is at equilibrium on the

negative phase.

 R denotes the learning rate

This result follows from the fact that at the thermal equilibrium the probability P − (s) of any global

state s when the network is free-running is given by the Boltzmann distribution (hence the name

"Boltzmann machine").

Remarkably, this learning rule is fairly biologically plausible because the only information needed to change

the weights is provided by "local" information. That is, the connection (or synapse biologically speaking)

does not need information about anything other than the two neurons it connects. This is far more

biologically realistic than the information needed by a connection in many other neural network training

algorithms, such as backpropagation.

The training of a Boltzmann machine does not use the EM algorithm, which is heavily used in machine

learning. By minimizing the KL-divergence, it is equivalent as maximizing the log-likelihood of the data.

Therefore, the training procedure performs gradient ascent on the log-likelihood of the observed data. This

is in contrast to the EM algorithm, where the posterior distribution of the hidden nodes must be calculated

before the maximization of the expected value of the complete data likelihood during the M-step.

Training the biases is similar, but uses only single node activity:

[edit]Problems

The Boltzmann machine would theoretically be a rather general computational medium. For

instance, if trained on photographs, the machine would theoretically model the distribution of

photographs, and could use that model to, for example, complete a partial photograph.

Unfortunately, there is a serious practical problem with the Boltzmann machine, namely that the

learning seems to stop working correctly when the machine is scaled up to anything larger than a

trivial machine. This is due to a number of effects, the most important of which are:
 the time the machine must be run in order to collect equilibrium statistics grows

exponentially with the machine's size, and with the magnitude of the connection strengths

 connection strengths are more plastic when the units being connected have activation

probabilities intermediate between zero and one, leading to a so-called variance trap. The net

effect is that noise causes the connection strengths to random walk until the activities saturate.

[edit]Restricted Boltzmann Machine

Although learning is impractical in general Boltzmann machines, it can be made quite efficient in an

architecture called the "Restricted Boltzmann Machine" or "RBM" which does not allow intralayer

connections between hidden units. After training one RBM, the activities of its hidden units can be

treated as data for training a higher-level RBM. This method of stacking RBM's makes it possible to

train many layers of hidden units efficiently and as each new layer is added the overall generative

model gets better.

There is an extension to the Restricted Boltzmann Machine that affords using real valued data

rather than binary data. Along with higher order Boltzmann Machines, it is outlined here [1].

[edit]History

Hopfield net
From Wikipedia, the free encyclopedia

A Hopfield net is a form of recurrent artificial neural network invented by John Hopfield. Hopfield nets serve as content-

addressable memory systems with binary threshold units. They are guaranteed to converge to a local minimum, but convergence

to one of the stored patterns is not guaranteed.

Structure
A Hopfield net with four nodes.

The units in Hopfield nets are binary threshold units, i.e. the units only take on two different values for their states and the

value is determined by whether or not the units' input exceeds their threshold. Hopfield nets can either have units that

take on values of 1 or -1, or units that take on values of 1 or 0. So, the two possible definitions for unit i's activation, ai,

are:

(1)

(2)

Where:

 wij is the strength of the connection weight from unit j to unit i (the weight of the connection).

 sj is the state of unit j.

 θi is the threshold of unit i.

The connections in a Hopfield net typically have the following restrictions:

 (no unit has a connection with itself)

 (connections are symmetric)

The requirement that weights be symmetric is typically used, as it will guarantee that the energy function decreases

monotonically while following the activation rules, and the network may exhibit some periodic or chaotic behaviour if non-

symmetric weights are used. However, Hopfield found that this chaotic behaviour is confined to relatively small parts of

the phase space, and does not impair the network's ability to act as a content-addressable associative memory system.

Hopfield nets have a scalar value associated with each state of the network referred to as the "energy", E, of the network,

where:

This value is called the "energy" because the definition ensures that if units are randomly chosen to update their

activations the network will converge to states which are local minima in the energy function (which is considered

to be a Lyapunov function). Thus, if a state is a local minimum in the energy function it is a stable state for the

network. Note that this energy function belongs to a general class of models in physics, under the name of Ising

models; these in turn are a special case of Markov networks, since the associated probability measure, the Gibbs

measure, has the Markov property.


[edit]Running

At each step, pick a node at random. The node's behavior is then deterministic: it moves to a state to minimize the

energy of itself and its neighbors. (In contrast, the Boltzmann machine has a stochastic update rule.)

[edit]Training

Training a Hopfield net involves lowering the energy of states that the net should "remember". This allows the net

to serve as a content addressable memory system, that is to say, the network will converge to a "remembered"

state if it is given only part of the state. The net can be used to recover from a distorted input the trained state that

is most similar to that input. This is called associative memory because it recovers memories on the basis of

similarity. For example, if we train a Hopfield net with five units so that the state (1, 0, 1, 0, 1) is an energy

minimum, and we give the network the state (1, 0, 0, 0, 1) it will converge to (1, 0, 1, 0, 1). Thus, the network is

properly trained when the energy of states which the network should remember are local minima.

Navigation

One of the earliest recurrent neural networks reported in literature was the
auto-associator independently described by Anderson (Anderson, 1977)
and Kohonen (Kohonen, 1977) in 1977. It consists of a pool of neurons
with connections between each unit i and j, i 6= j.

The auto-associator network. All neurons are both input and output
neurons, i.e., a
pattern is clamped, the network iterates to a stable state, and the output of
the network consists of
the new activation values of the neurons.
All connections are weighted. In 1982, Hopfield (Hopfield, 1982) brings
together several earlier ideas concerning these networks and presents a
complete mathematical analysis based on Ising spin models (Amit,
Gutfreund, & Sompolinsky, 1986). It is therefore that this network, which
we will describe in this chapter, is generally referred to as the Hopfield
network.

The Hopfield network consists of a set of N interconnected neurons which


update their activation values asynchronously and independently of other
neurons. All neurons are both input and output neurons. The activation
values are binary. Originally, Hopeld chose activation values of 1 and 0,
but using values +1 and -1 presents some advantages discussed below. We
will therefore adhere to the latter convention.

The state of the system is given by the activation values1 y = (yk). The net
input sk(t+1) of a neuron k at cycle t+1 is a weighted sum

A simple threshold function is applied to the net input to obtain the new
activation value yi(t + 1) at time t + 1:

Hopfield network as associative memory

A primary application of the Hopfield network is an associative memory.


In this case, the weights of the connections between the neurons have to be
thus set that the states of the system corresponding with the patterns which
are to be stored in the network are stable. These states can be seen as 'dips'
in energy space. When the network is cued with a noisy or incomplete test
pattern, it will render the incorrect or missing data by iterating to a stable
state which is in some sense 'near' to the cued pattern.
i.e., if xp j and xp k are equal, wjk is increased, otherwise decreased by
one (note that, in the original Hebb rule, weights only increase). It appears,
however, that the network gets saturated very quickly, and that about
0:15N memories can be stored before recall errors become severe. There
are two problems associated with storing too many patterns:

1. the stored patterns become unstable;


2. spurious stable states appear (i.e., stable states which do not
correspond with stored patterns).

The first of these two problems can be solved by an algorithm proposed by


Bruce et al. (Bruce, Canning, Forrest, Gardner, & Wallace, 1986):

Algorithm 1 Given a starting weight matrix for each pattern


xp to be stored and each element xp k in xp dene a correction εk such that

Now modify wjk by Repeat this procedure


until all patterns are stable.

It appears that, in practice, this algorithm usually converges. There exist


cases, however, where the algorithm remains oscillatory (try to find one)!

You might also like