Professional Documents
Culture Documents
Abstract stette et al., 2015; Zaremba & Sutskever, 2015). The at-
tention acts as an addressing system that selects memory
We investigate a new method to augment recur- locations. The content of the selected memory locations
rent neural networks with extra memory without can be read or modified by the network.
increasing the number of network parameters.
The system has an associative memory based on We provide a different addressing mechanism in Associa-
complex-valued vectors and is closely related to tive LSTM, where, like LSTM, an item is stored in a dis-
Holographic Reduced Representations and Long tributed vector representation without locations. Our sys-
Short-Term Memory networks. Holographic Re- tem implements an associative array that stores key-value
duced Representations have limited capacity: as pairs based on two contributions:
they store more information, each retrieval be-
comes noisier due to interference. Our system
1. We combine LSTM with ideas from Holographic Re-
in contrast creates redundant copies of stored in-
duced Representations (HRRs) (Plate, 2003) to enable
formation, which enables retrieval with reduced
key-value storage of data.
noise. Experiments demonstrate faster learning
on multiple memorization tasks.
2. A direct application of the HRR idea leads to very
lossy storage. We use redundant storage to increase
1. Introduction memory capacity and to reduce noise in memory re-
trieval.
We aim to enhance LSTM (Hochreiter & Schmidhuber,
1997), which in recent years has become widely used in se-
quence prediction, speech recognition and machine trans- HRRs use a “binding” operator to implement key-value
lation (Graves, 2013; Graves et al., 2013; Sutskever et al., binding between two vectors (the key and its associated
2014). We address two limitations of LSTM. The first content). They natively implement associative arrays; as a
limitation is that the number of memory cells is linked to byproduct, they can also easily implement stacks, queues,
the size of the recurrent weight matrices. An LSTM with or lists. Since Holographic Reduced Representations may
Nh memory cells requires a recurrent weight matrix with be unfamiliar to many readers, Section 2 provides a short
O(Nh2 ) weights. The second limitation is that LSTM is introduction to them and to related vector-symbolic archi-
a poor candidate for learning to represent common data tectures (Kanerva, 2009).
structures like arrays because it lacks a mechanism to in- In computing, Redundant Arrays of Inexpensive Disks
dex its memory while writing and reading. (RAID) provided a means to build reliable storage from
To overcome these limitations, recurrent neural networks unreliable components. We similarly reduce retrieval er-
have been previously augmented with soft or hard atten- ror inside a holographic representation by using redundant
tion mechanisms to external memories (Graves et al., 2014; storage, a construction described in Section 3. We then
Sukhbaatar et al., 2015; Joulin & Mikolov, 2015; Grefen- combine the redundant associative memory with LSTM in
Section 5. The system can be equipped with a large mem-
Proceedings of the 33 rd International Conference on Machine ory without increasing the number of network parameters.
Learning, New York, NY, USA, 2016. JMLR: W&CP volume Our experiments in Section 6 show the benefits of the mem-
48. Copyright 2016 by the author(s).
ory system for learning speed and accuracy.
Associative Long Short-Term Memory
Figure 1. From top to bottom: 20 original images and the image sequence retrieved from 1 copy, 4 copies, and 20 copies of the memory
trace. Using more copies reduces the noise.
0 0 0
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Figure 2. The mean squared error per pixel when retrieving an ImageNet image from a memory trace. Left: 50 images are stored in the
memory trace. The number of copies ranges from 1 to 100. Middle: 50 copies are used, and the number of stored images goes from 1
to 100. The mean squared error grows linearly. Right: The number of copies is increased together with the number of stored images.
After reaching 50 copies, the mean squared error is almost constant.
to be the real part and the second half the imaginary part • It is possible to store more items than the number of
of a complex vector. The sequence of images is encoded copies at the cost of increased retrieval noise.
by using random keys with moduli equal to 1. We see that
retrieval is less corrupted by noise as the number of copies 4. Long Short-Term Memory
grows.
We briefly introduce the LSTM with forget gates (Gers
The mean squared error of retrieved ImageNet images is
et al., 2000), a recurrent neural network whose hidden state
analysed in Figure 2 with varying numbers of copies and
is described by two parts ht , ct ∈ RNh . At each time step,
images. The simulation agrees accurately with our predic-
the network is presented with an input xt and updates its
tion: the mean squared error is proportional to the number
state to
of stored items and inversely proportional to the number of
copies.
ĝf , ĝi , ĝo , û = Wxh xt + Whh ht−1 + bh (9)
The redundant associative memory has several nice proper- gf = σ(ĝf ) (10)
ties:
gi = σ(ĝi ) (11)
• The number of copies can be modified at any time: go = σ(ĝo ) (12)
reducing their number increases retrieval noise, while u = tanh(û) (13)
increasing the number of copies enlarges capacity. ct = gf ct−1 + gi u (14)
• It is possible to query using a partially known key by ht = go tanh(ct ) (15)
setting some key elements to zero. Each copy of the
permuted key Ps rk routes the zeroed key elements to
different dimensions. We need to know O( NN h
) el- where σ(x) ∈ (0, 1) is the logistic sigmoid function, and
copies
gf , gi , go are the forget gate, input gate, and output gate,
ements of the key to recover the whole value.
respectively. The u vector is a proposed update to the cell
• Unlike Neural Turing Machines (Graves et al., 2014), state c. Wxh , Whh are weight matrices, and bh is a bias
it is not necessary to search for free locations when vector. “” denotes element-wise multiplication of two
writing. vectors.
Associative Long Short-Term Memory
5. Associative Long Short-Term Memory s ∈ {1, . . . , Ncopies }, we add the same key-value pair to
the cell state:
When we combine Holographic Reduced Representations
with the LSTM, we need to implement complex vector Ps 0
ri,s = ri (26)
multiplications. For a complex vector z = hreal +ihimaginary , 0 Ps
we use the form
hreal
cs,t = gf cs,t−1 + ri,s ~ (gi u) (27)
h= (16)
himaginary
where ri,s is the permuted input key; Ps ∈ RNh /2×Nh /2 is
a constant random permutation matrix, specific to the s-th
where h ∈ RNh , hreal , himaginary ∈ RNh /2 . In the network
copy. “~” is element-wise complex multiplication and is
description below, the reader can assume that all vectors
computed using
and matrices are strictly real-valued.
As in LSTM, we first compute gate variables, but we also rreal ureal − rimaginary uimaginary
r~u= (28)
produce parameters that will be used to define associative rreal uimaginary + rimaginary ureal
keys r̂i , r̂o . The same gates are applied to the real and imag-
inary parts: The output key for each copy, ro,s , is permuted by the same
matrix as the copy’s input key:
ĝf , ĝi , ĝo , r̂i , r̂o = Wxh xt + Whh ht−1 + bh (17)
Ps 0
û = Wxu xt + Whu ht−1 + bu (18) ro,s = ro (29)
0 Ps
σ(ĝf )
gf = (19) Finally, the cells (memory trace) are read out by averaging
σ(ĝf )
the copies:
σ(ĝi )
gi = (20)
σ(ĝi )
1
Ncopies
X
σ(ĝo )
ht = go bound ro,s ~ cs,t (30)
go = (21) Ncopies s=1
σ(ĝo )
Note that permutation can be performed in O(Nh ) compu-
Unlike LSTM, we use an activation function that operates tations. Additionally, all copies can be updated in parallel
only on the modulus of a complex number. The following by operating on tensors of size Ncopies × Nh .
function restricts the modulus of a (real, imaginary) pair to
be between 0 and 1: On some tasks, we found that learning speed was improved
by not feeding ht−1 to the update u: namely, Whu is set
hreal d to zero in Equation 18, which causes u to serve as an em-
bound(h) = (22)
himaginary d bedding of xt . This modification was made for the episodic
copy, XML modeling, and variable assignment tasks below.
Nh /2
p is element-wise division, and d ∈ R
where “” =
max(1, hreal hreal + himaginary himaginary ) corre-
sponds to element-wise normalisation by the modulus 6. Experiments
of each complex number when the modulus is greater All experiments used the Adam optimisation algorithm
than one. This hard bounding worked slightly better than (Kingma & Ba, 2014) with no gradient clipping. For ex-
applying tanh to the modulus. “bound” is then used to periments with synthetic data, we generate new data for
construct the update and two keys: each training minibatch, obviating the need for a separate
test data set. Minibatches of size 2 were used in all tasks
u = bound(û) (23)
beside the Wikipedia task below, where the minibatch size
ri = bound(r̂i ) (24) was 10.
ro = bound(r̂o ) (25)
We compared Associative LSTM to multiple baselines:
where ri ∈ RNh is an input key, acting as a storage key LSTM. We use LSTM with forget gates and without peep-
in the associative array, and ro ∈ RNh is an output key, holes (Gers et al., 2000).
corresponding to a lookup key. The update u is multiplied
Permutation RNN. Each sequence is encoded by using
with the input gate gi to produce the value to be stored.
powers of a constant random permutation matrix as keys:
Now, we introduce redundant storage and provide the pro-
cedure for memory retrieval. For each copy, indexed by ht = P ht−1 + W xt (31)
Associative Long Short-Term Memory
5
ht = f (W ht−1 + V xt ) (32)
0
where W is a unitary matrix (W † W = I). The product 0 100 200 300 400 500 600 700
minibatch number (thousands)
800 900 1000
20 1.8
LSTM forgetGateBias=1 LSTM
18 Associative LSTM nCopies=1 1.6 LSTM nH=512
Associative LSTM nCopies=4 Associative LSTM nCopies=1
16 Associative LSTM nCopies=8 1.4 Associative LSTM nCopies=4
cost per sequence (bits)
4 0.4
2 0.2
0 0
0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000
minibatch number (thousands) minibatch number (thousands)
Figure 4. Training cost per sequence on the episodic copy task Figure 5. Training cost on the XML task, including Unitary
with variable-length sequences (1 to 10 characters). Associative RNNs.
LSTM learns quickly and almost as fast as in the fixed-length
0.6
episodic copy. Unitary RNN converges slowly relative to the LSTM
LSTM nH=512
fixed-length task. 0.5 Associative LSTM nCopies=1
Associative LSTM nCopies=4
0.3
could use a deep network at the output.
0.2
To present a more complex challenge, we constructed one
other variant of the task in which the number of random 0.1
0.04 0.6
LSTM LSTM
LSTM nH=512 LSTM nH=512
0.035
Associative LSTM nCopies=1 0.5 Associative LSTM nCopies=1
Associative LSTM nCopies=4 Associative LSTM nCopies=4
0.03
cost per character (bits)
0.02 0.3
0.015
0.2
0.01
0.1
0.005
0 0
0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000
minibatch number (thousands) minibatch number (thousands)
Figure 7. Training cost on the variable assignment task. Figure 8. Training cost on the arithmetic task.
2.3
Deep LSTM The English Wikipedia dataset has 100 million characters
Deep Associative LSTM nCopies=1
2.2 Deep Associative LSTM nCopies=4 of which we used the last 4 million as a test set. We used
Deep Associative LSTM nCopies=8
larger models on this task but, to reduce training time, did
cost per character (bits)
2.1
not use extremely large models since our primary motiva-
2 tion was to compare the architectures. LSTM and Asso-
1.9
ciative LSTM were constructed from a stack of 3 recurrent
layers with 256 units by sending the output of each layer
1.8
into the input of the next layer as in Graves et al. (2013).
1.7
0 200 400 600 800 1000 Associative LSTM performed comparably to LSTM (Fig-
minibatch number (thousands)
ure 10). We expected that Associative LSTM would per-
form at least as well as LSTM; if the input key ri is set to
Figure 10. Test cost when modeling English Wikipedia. the all 1-s vector, then the update in Equation 27 becomes
0.6 0.6
LSTM nH=512 LSTM nH=512
Associative LSTM nCopies=1,nHeads=3 Associative LSTM nCopies=1,nHeads=3
0.5 Associative LSTM nCopies=4,nHeads=3 0.5 Associative LSTM nCopies=4,nHeads=3
Associative LSTM nCopies=8,nHeads=3 Associative LSTM nCopies=8,nHeads=3
cost per character (bits)
cost per character (bits)
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000
minibatch number (thousands) minibatch number (thousands)
Figure 11. Training cost on the XML task. Figure 13. Training cost on the arithmetic task.
0.04
LSTM nH=512
Associative LSTM nCopies=1,nHeads=3
0.035
Associative LSTM nCopies=4,nHeads=3
Associative LSTM nCopies=8,nHeads=3
0.03
cost per character (bits)
NTM nHeads=4
0.025
0.02
0.015
0.01
0.005
0
0 100 200 300 400 500 600 700 800 900 1000
minibatch number (thousands)