You are on page 1of 12

Top 10 Neural Network Architectures You Need to

Know
1 - Perceptrons
Considered the first generation of neural networks, perceptrons are simply
computational models of a single neuron. Perceptron was originally coined
by Frank Rosenblatt (“The perceptron: a probabilistic model for information
storage and organization in the brain”) [1]. Also called feed-forward neural
network, a perceptron feeds information from the front to the back. Training
perceptrons usually require back-propagation, giving the network paired
datasets of inputs and outputs. Inputs are sent into the neuron, processed,
and result in an output. The error that is back propagated is usually the
difference between the input and the output data. If the network has enough
hidden neurons, it can always model the relationship between the input and
output. Practically, their use is a lot more limited, but they are popularly
combined with other networks to form new networks.
If you choose features by hand and have enough , you can do almost
anything. For binary input vectors, we can have a separate feature unit for
each of the exponentially many binary vectors and we can make any
possible discrimination for binary input vectors. However, perceptrons do
have limitations: once the hand-coded features have been determined, there
are very strong limitations on what a perceptron can learn.

2 - Convolutional Neural Networks


In 1998, Yann LeCun and his collaborators developed a really good
recognizer for handwritten digits called LeNet. It used back-propagation in a
feedforward net with many hidden layers, many maps of replicated units in
each layer, output pooling of nearby replicated units, a wide net that can
cope with several characters at once, even if they overlap, and a clever way
of training a complete system, not just a recognizer. It was later formalized
under the name ***convolutional neural networks (CNNs)***.

Convolutional neural networks are quite different from most other networks.
They are primarily used for image processing, but can also be used for other
types of input, such as as audio. A typical use case for CNNs is where you
feed the network images and it classifies the data. CNNs tend to start with an
input “scanner,” which is not intended to parse all of the training data at
once. For example, to input an image of 100 x 100 pixels, you wouldn’t want
a layer with 10,000 nodes. Rather, you create a scanning input layer of say,
10 x 10, and you feed the first 10 x 10 pixels of the image. Once you’ve
passed that input, you feed it the next 10 x 10 pixels by moving the scanner
one pixel to the right.

This input data is then fed through convolutional layers instead of normal
layers, where not all nodes are connected. Each node only concerns itself
with close neighboring cells. These convolutional layers also tend to shrink
as they become deeper, mostly by easily divisible factors of the input. Beside
these convolutional layers, they also often feature pooling layers. Pooling is
a way to filter out details: a commonly found pooling technique is max
pooling, where we take, say, 2 x 2 pixels and pass on the pixel with the most
amount of red. If you want to dig deeper into CNNs, read Yann LeCun’s
original paper, “Gradient-based learning applied to document recognition”
(1998) [2].

3 - Recurrent Neural Networks


To understand RNNs, we need to have a brief overview of sequence
modeling. When applying machine learning to sequences, we often want to
turn an input sequence into an output sequence that lives in a different
domain. For example, turn a sequence of sound pressures into a sequence
of word identities. When there is no separate target sequence, we can get a
teaching signal by trying to predict the next term in the input sequence. The
target output sequence is the input sequence with an advance of one step.
This seems much more natural than trying to predict one pixel in an image
from the other pixels, or one patch of an image from the rest of the image.
Predicting the next term in a sequence blurs the distinction between
supervised and unsupervised learning. It uses methods designed for
supervised learning but doesn’t require a separate teaching signal.

Memoryless models are the standard approach to this task. In particular,


autoregressive models can predict the next term in a sequence from a fixed
number of previous terms using “delay taps.” Feed-forward neural nets are
generalized autoregressive models that use one or more layers of non-linear
hidden units. However, if we give our generative model some hidden state,
and if we give this hidden state its own internal dynamics, we get a much
more interesting kind of model that can store information in its hidden state
for a long time. If the dynamics and the way it generates outputs from its
hidden state are noisy, we will never know its exact hidden state. The best
we can do is infer a probability distribution over the space of hidden state
vectors. This inference is only tractable for two types of hidden state models.

Originally introduced in Jeffrey Elman's “Finding structure in time”


(1990) [3], recurrent neural networks (RNNs) are basically perceptrons.
However, unlike perceptrons, which are stateless, they have connections
between passes, connections through time. RNNs are very powerful,
because they combine two properties: 1) a distributed hidden state that
allows them to store a lot of information about the past efficiently and 2) non-
linear dynamics that allow them to update their hidden state in complicated
ways. With enough neurons and time, RNNs can compute anything that your
computer can compute. So what kinds of behavior can RNNs exhibit? They
can oscillate, settle to point attractors, and behave chaotically. They can
potentially learn to implement lots of small programs that each capture a
nugget of knowledge and run in parallel, interacting to produce very
complicated effects.

One big problem with RNNs is the vanishing (or exploding) gradient problem,
where, depending on the activation functions used, information rapidly gets
lost over time. Intuitively, this wouldn’t be much of a problem because these
are just weights and not neuron states, but the weights through time is
actually where the information from the past is stored. If the weight reaches a
value of 0 or 1,000,000, the previous state won’t be very informative. RNNs
can, in principle, be used in many fields, as most forms of data that don’t
actually have a timeline (non- audio or video) can be represented as a
sequence. A picture or a string of text can be fed one pixel or character at a
time, so time dependent weights are used for what came before in the
sequence, not actually what happened x seconds before. In general,
recurrent networks are a good choice for advancing or completing
information, like autocompletion.

4 - Long / Short Term Memory


Hochreiter & Schmidhuber (1997) [4] solved the problem of getting a RNN to
remember things for a long time by building what is known as ***long-short
term memory networks (LSTMs)***. LSTMs try to combat the
vanishing/exploding gradient problem by introducing gates and an explicitly
defined memory cell. The memory cell stores the previous values and holds
onto it unless a "forget gate" tells the cell to forget those values. LSTMs also
have an "input gate" that adds new stuff to the cell and an "output gate" that
decides when to pass along the vectors from the cell to the next hidden
state.

Recall that with all RNNs, the values coming in from X_train and H_previous
are used to determine what happens in the current hidden state. The results
of the current hidden state (H_current) are used to determine what happens
in the next hidden state. LSTMs simply add a cell layer to make sure the
transfer of hidden state information from one iteration to the next is
reasonably high. Put another way, we want to remember stuff from previous
iterations for as long as needed, and the cells in LSTMs allow this to happen.
LSTMs are able to learn complex sequences, such as Hemingway’s writing
or Mozart’s music.

5 - Gated Recurrent Unit


Gated recurrent units (GRUs) are a slight variation on LSTMs. They take
X_train and H_previous as inputs. They perform some calculations and then
pass along H_current. In the next iteration, X_train.next and H_current are
used for more calculations, and so on. What makes them different from
LSTMs is that GRUs don't need the cell layer to pass values along. The
calculations within each iteration ensure that the H_current values being
passed along either retain a high amount of old information or are jump-
started with a high amount of new information.

In most cases, GRUs function very similarly to LSTMs, with the biggest
difference being that GRUs are slightly faster and easier to run (but also
slightly less expressive). In practice, these tend to cancel each other out, as
you need a bigger network to regain some expressiveness, which then in
turn cancels out the performance benefits. In some cases where the extra
expressiveness is not needed, GRUs can outperform LSTMs. You can read
more about GRUs in Junyoung Chung’s 2014 “Empirical evaluation of gated
recurrent neural networks on sequence modeling” [5].

6 - Hopfield Network
Recurrent networks of non-linear units are generally very hard to analyze.
They can behave in many different ways: settle to a stable state, oscillate, or
follow chaotic trajectories that cannot be predicted far into the future. To
resolve this problem, John Hopfield introduced the Hopfield Net in his 1982
work “Neural networks and physical systems with emergent collective
computational abilities” [6]. A Hopfield network (HN) is a network where
every neuron is connected to every other neuron. It is a completely
entangled plate of spaghetti as even all the nodes function as everything.
Each node is inputted before training, then hidden during training and output
afterwards. The networks are trained by setting the value of the neurons to
the desired pattern, after which the weights can be computed. The weights
do not change after this. Once trained for one or more patterns, the network
will always converge to one of the learned patterns because the network is
only stable in those states.

There is another computational role for Hopfield nets. Instead of using the
net to store memories, we use it to construct interpretations of sensory input.
The input is represented by the visible units, the states of the hidden units,
and the badness of the interpretation is represented by the energy.

Unfortunately, people have shown that a Hopfield net is very limited in its
capacity. A Hopfield net of N units can only memorize 0.15N patterns
because of the so-called spurious minima in its energy function. The idea is
that since the energy function is continuous in the space of its weights, if two
local minima are too close, they might “fall” into each other to create a single
local minima that doesn’t correspond to any training sample, while forgetting
about the two samples it is supposed to memorize. This phenomenon
significantly limits the number of samples that a Hopfield net can learn.

7 - Boltzmann Machine
A Boltzmann Machine is a type of stochastic recurrent neural network. It can
be seen as the stochastic, generative counterpart of Hopfield nets. It was
one of the first neural networks capable of learning internal representations
and able to represent and solve difficult combinatoric problems. First
introduced by Geoffrey Hinton and Terrence Sejnowski in “Learning and
relearning in Boltzmann machines” (1986) [7], Boltzmann machines are a lot
like Hopfield Networks, but some neurons are marked as input neurons and
others remain “hidden.” The input neurons become output neurons at the
end of a full network update. It starts with random weights and learns
through back-propagation. Compared to a Hopfield Net, the neurons mostly
have binary activation patterns.

The goal of learning for a Boltzmann machine learning algorithm is to


maximize the product of the probabilities that the Boltzmann machine
assigns to the binary vectors in the training set. This is equivalent to
maximizing the sum of the log probabilities that the Boltzmann machine
assigns to the training vectors. It is also equivalent to maximizing the
probability that we would obtain exactly the N training cases if we did the
following: 1) let the network settle to its stationary distribution N different time
with no external input and 2) sample the visible vector once each time.
An efficient mini-batch learning procedure was proposed for Boltzmann
Machines by Salakhutdinov and Hinton in 2012 [8].

 For the positive phase, first initialize the hidden probabilities at 0.5,
clamp a data vector on the visible units, then update all of the hidden
units in parallel until convergence using mean field updates. After the
net has converged, record PiPj for every connected pair of units and
average this over all data in the mini-batch.

 For the negative phase: first keep a set of “fantasy particles.” Each
particle has a value that is a global configuration. Sequentially update
all of the units in each fantasy particle a few times. For every
connected pair of units, average SiSj over all of the fantasy particles.

In a general Boltzmann machine, the stochastic updates of units need to be


sequential. There is a special architecture that allows alternating parallel
updates that are much more efficient (no connections within a layer, no skip-
layer connections). This mini-batch procedure makes the updates of the
Boltzmann machine more parallel. This is called a Deep Boltzmann Machine
(DBM), a general Boltzmann machine with a lot of missing connections.

8 - Deep Belief Networks


Back-propagation is considered the standard method in artificial neural
networks for calculating the error contribution of each neuron after a batch of
data is processed. However, there are some major problems using back-
propagation. First, it requires labeled training data while almost all data is
unlabeled. Second, the learning time does not scale well, which means it is
very slow in networks with multiple hidden layers. Third, it can get stuck in
poor local optima, so for deep nets, they are far from optimal.

To overcome the limitations of back-propagation, researchers have


considered using unsupervised learning approaches. This helps keep the
efficiency and simplicity of using a gradient method for adjusting the weights,
while also using to model the structure of the sensory input. In particular,
they adjust the weights to maximize the probability that a generative model
would have generated the sensory input. The question is what kind of
generative model should we learn? Can it be an energy-based model like a
Boltzmann machine? Or a causal model made of idealized neurons? Or a
hybrid of the two?

Yoshua Bengio came up with Deep Belief Networks (“Greedy layer-wise


training of deep networks”) [9], which have been shown to be effectively
trainable stack by stack. This technique is also known as greedy training,
where greedy means making locally optimal solutions to get to a decent but
possibly not optimal answer. A belief net is a directed acyclic graph
composed of stochastic variables. Using belief net, we get to observe some
of the variables, and we would like to solve two problems: 1) the inference
problem: infer the states of the unobserved variables, and 2) the learning
problem: adjust the interactions among variables to make the network more
likely to generate the training data.

You might also like