Professional Documents
Culture Documents
Know
1 - Perceptrons
Considered the first generation of neural networks, perceptrons are simply
computational models of a single neuron. Perceptron was originally coined
by Frank Rosenblatt (“The perceptron: a probabilistic model for information
storage and organization in the brain”) [1]. Also called feed-forward neural
network, a perceptron feeds information from the front to the back. Training
perceptrons usually require back-propagation, giving the network paired
datasets of inputs and outputs. Inputs are sent into the neuron, processed,
and result in an output. The error that is back propagated is usually the
difference between the input and the output data. If the network has enough
hidden neurons, it can always model the relationship between the input and
output. Practically, their use is a lot more limited, but they are popularly
combined with other networks to form new networks.
If you choose features by hand and have enough , you can do almost
anything. For binary input vectors, we can have a separate feature unit for
each of the exponentially many binary vectors and we can make any
possible discrimination for binary input vectors. However, perceptrons do
have limitations: once the hand-coded features have been determined, there
are very strong limitations on what a perceptron can learn.
Convolutional neural networks are quite different from most other networks.
They are primarily used for image processing, but can also be used for other
types of input, such as as audio. A typical use case for CNNs is where you
feed the network images and it classifies the data. CNNs tend to start with an
input “scanner,” which is not intended to parse all of the training data at
once. For example, to input an image of 100 x 100 pixels, you wouldn’t want
a layer with 10,000 nodes. Rather, you create a scanning input layer of say,
10 x 10, and you feed the first 10 x 10 pixels of the image. Once you’ve
passed that input, you feed it the next 10 x 10 pixels by moving the scanner
one pixel to the right.
This input data is then fed through convolutional layers instead of normal
layers, where not all nodes are connected. Each node only concerns itself
with close neighboring cells. These convolutional layers also tend to shrink
as they become deeper, mostly by easily divisible factors of the input. Beside
these convolutional layers, they also often feature pooling layers. Pooling is
a way to filter out details: a commonly found pooling technique is max
pooling, where we take, say, 2 x 2 pixels and pass on the pixel with the most
amount of red. If you want to dig deeper into CNNs, read Yann LeCun’s
original paper, “Gradient-based learning applied to document recognition”
(1998) [2].
One big problem with RNNs is the vanishing (or exploding) gradient problem,
where, depending on the activation functions used, information rapidly gets
lost over time. Intuitively, this wouldn’t be much of a problem because these
are just weights and not neuron states, but the weights through time is
actually where the information from the past is stored. If the weight reaches a
value of 0 or 1,000,000, the previous state won’t be very informative. RNNs
can, in principle, be used in many fields, as most forms of data that don’t
actually have a timeline (non- audio or video) can be represented as a
sequence. A picture or a string of text can be fed one pixel or character at a
time, so time dependent weights are used for what came before in the
sequence, not actually what happened x seconds before. In general,
recurrent networks are a good choice for advancing or completing
information, like autocompletion.
Recall that with all RNNs, the values coming in from X_train and H_previous
are used to determine what happens in the current hidden state. The results
of the current hidden state (H_current) are used to determine what happens
in the next hidden state. LSTMs simply add a cell layer to make sure the
transfer of hidden state information from one iteration to the next is
reasonably high. Put another way, we want to remember stuff from previous
iterations for as long as needed, and the cells in LSTMs allow this to happen.
LSTMs are able to learn complex sequences, such as Hemingway’s writing
or Mozart’s music.
In most cases, GRUs function very similarly to LSTMs, with the biggest
difference being that GRUs are slightly faster and easier to run (but also
slightly less expressive). In practice, these tend to cancel each other out, as
you need a bigger network to regain some expressiveness, which then in
turn cancels out the performance benefits. In some cases where the extra
expressiveness is not needed, GRUs can outperform LSTMs. You can read
more about GRUs in Junyoung Chung’s 2014 “Empirical evaluation of gated
recurrent neural networks on sequence modeling” [5].
6 - Hopfield Network
Recurrent networks of non-linear units are generally very hard to analyze.
They can behave in many different ways: settle to a stable state, oscillate, or
follow chaotic trajectories that cannot be predicted far into the future. To
resolve this problem, John Hopfield introduced the Hopfield Net in his 1982
work “Neural networks and physical systems with emergent collective
computational abilities” [6]. A Hopfield network (HN) is a network where
every neuron is connected to every other neuron. It is a completely
entangled plate of spaghetti as even all the nodes function as everything.
Each node is inputted before training, then hidden during training and output
afterwards. The networks are trained by setting the value of the neurons to
the desired pattern, after which the weights can be computed. The weights
do not change after this. Once trained for one or more patterns, the network
will always converge to one of the learned patterns because the network is
only stable in those states.
There is another computational role for Hopfield nets. Instead of using the
net to store memories, we use it to construct interpretations of sensory input.
The input is represented by the visible units, the states of the hidden units,
and the badness of the interpretation is represented by the energy.
Unfortunately, people have shown that a Hopfield net is very limited in its
capacity. A Hopfield net of N units can only memorize 0.15N patterns
because of the so-called spurious minima in its energy function. The idea is
that since the energy function is continuous in the space of its weights, if two
local minima are too close, they might “fall” into each other to create a single
local minima that doesn’t correspond to any training sample, while forgetting
about the two samples it is supposed to memorize. This phenomenon
significantly limits the number of samples that a Hopfield net can learn.
7 - Boltzmann Machine
A Boltzmann Machine is a type of stochastic recurrent neural network. It can
be seen as the stochastic, generative counterpart of Hopfield nets. It was
one of the first neural networks capable of learning internal representations
and able to represent and solve difficult combinatoric problems. First
introduced by Geoffrey Hinton and Terrence Sejnowski in “Learning and
relearning in Boltzmann machines” (1986) [7], Boltzmann machines are a lot
like Hopfield Networks, but some neurons are marked as input neurons and
others remain “hidden.” The input neurons become output neurons at the
end of a full network update. It starts with random weights and learns
through back-propagation. Compared to a Hopfield Net, the neurons mostly
have binary activation patterns.
For the positive phase, first initialize the hidden probabilities at 0.5,
clamp a data vector on the visible units, then update all of the hidden
units in parallel until convergence using mean field updates. After the
net has converged, record PiPj for every connected pair of units and
average this over all data in the mini-batch.
For the negative phase: first keep a set of “fantasy particles.” Each
particle has a value that is a global configuration. Sequentially update
all of the units in each fantasy particle a few times. For every
connected pair of units, average SiSj over all of the fantasy particles.