You are on page 1of 194


Neural turing machine

Published as a conference paper at ICLR 2015

Jason Weston, Sumit Chopra & Antoine Bordes
Facebook AI Research
770 Broadway
New York, USA
arXiv:1410.3916v11 [cs.AI] 29 Nov 2015


We describe a new class of learning models called memory networks. Memory

networks reason with inference components combined with a long-term memory
component; they learn how to use these jointly. The long-term memory can be
read and written to, with the goal of using it for prediction. We investigate these
models in the context of question answering (QA) where the long-term mem-
ory effectively acts as a (dynamic) knowledge base, and the output is a textual
response. We evaluate them on a large-scale QA task, and a smaller, but more
complex, toy task generated from a simulated world. In the latter, we show the
reasoning power of such models by chaining multiple supporting sentences to an-
swer questions that require understanding the intension of verbs.


Most machine learning models lack an easy way to read and write to part of a (potentially very
large) long-term memory component, and to combine this seamlessly with inference. Hence, they
do not take advantage of one of the great assets of a modern day computer. For example, consider
the task of being told a set of facts or a story, and then having to answer questions on that subject.
In principle this could be achieved by a language modeler such as a recurrent neural network (RNN)
(Mikolov et al., 2010; Hochreiter & Schmidhuber, 1997) as these models are trained to predict the
next (set of) word(s) to output after having read a stream of words. However, their memory (en-
coded by hidden states and weights) is typically too small, and is not compartmentalized enough
to accurately remember facts from the past (knowledge is compressed into dense vectors). RNNs
are known to have difficulty in performing memorization, for example the simple copying task of
outputting the same input sequence they have just read (Zaremba & Sutskever, 2014). The situation
is similar for other tasks, e.g., in the vision and audio domains a long term memory is required to
watch a movie and answer questions about it.
In this work, we introduce a class of models called memory networks that attempt to rectify this
problem. The central idea is to combine the successful learning strategies developed in the machine
learning literature for inference with a memory component that can be read and written to. The
model is then trained to learn how to operate effectively with the memory component. We introduce
the general framework in Section 2, and present a specific implementation in the text domain for
the task of question answering in Section 3. We discuss related work in Section 4, describe our
experiments in 5, and finally conclude in Section 6.


A memory network consists of a memory m (an array of objects1 indexed by mi ) and four (poten-
tially learned) components I, G, O and R as follows:

I: (input feature map) – converts the incoming input to the internal feature representation.

For example an array of vectors or an array of strings.

Published as a conference paper at ICLR 2015

G: (generalization) – updates old memories given the new input. We call this generalization
as there is an opportunity for the network to compress and generalize its memories at this
stage for some intended future use.
O: (output feature map) – produces a new output (in the feature representation space), given
the new input and the current memory state.
R: (response) – converts the output into the response format desired. For example, a textual
response or an action.

Given an input x (e.g., an input character, word or sentence depending on the granularity chosen, an
image or an audio signal) the flow of the model is as follows:

1. Convert x to an internal feature representation I(x).

2. Update memories mi given the new input: mi = G(mi , I(x), m), ∀i.
3. Compute output features o given the new input and the memory: o = O(I(x), m).
4. Finally, decode output features o to give the final response: r = R(o).
This process is applied at both train and test time, if there is a distinction between such phases, that
is, memories are also stored at test time, but the model parameters of I, G, O and R are not updated.
Memory networks cover a wide class of possible implementations. The components I, G, O and R
can potentially use any existing ideas from the machine learning literature, e.g., make use of your
favorite models (SVMs, decision trees, etc.).

I component: Component I can make use of standard pre-processing, e.g., parsing, coreference
and entity resolution for text inputs. It could also encode the input into an internal feature represen-
tation, e.g., convert from text to a sparse or dense feature vector.

G component: The simplest form of G is to store I(x) in a “slot” in the memory:

mH(x) = I(x), (1)
where H(.) is a function selecting the slot. That is, G updates the index H(x) of m, but all other
parts of the memory remain untouched. More sophisticated variants of G could go back and update
earlier stored memories (potentially, all memories) based on the new evidence from the current input
x. If the input is at the character or word level one could group inputs (i.e., by segmenting them into
chunks) and store each chunk in a memory slot.
If the memory is huge (e.g., consider all of Freebase or Wikipedia) one needs to organize the memo-
ries. This can be achieved with the slot choosing function H just described: for example, it could be
designed, or trained, to store memories by entity or topic. Consequently, for efficiency at scale, G
(and O) need not operate on all memories: they can operate on only a retrieved subset of candidates
(only operating on memories that are on the right topic). We explore a simple variant of this in our
If the memory becomes full, a procedure for “forgetting” could also be implemented by H as it
chooses which memory is replaced, e.g., H could score the utility of each memory, and overwrite
the least useful. We have not explored this experimentally yet.

O and R components: The O component is typically responsible for reading from memory and
performing inference, e.g., calculating what are the relevant memories to perform a good response.
The R component then produces the final response given O. For example in a question answering
setup O finds relevant memories, and then R produces the actual wording of the answer, e.g., R
could be an RNN that is conditioned on the output of O. Our hypothesis is that without conditioning
on such memories, such an RNN will perform poorly.

One particular instantiation of a memory network is where the components are neural networks. We
refer to these as memory neural networks (MemNNs). In this section we describe a relatively simple
implementation of a MemNN with textual input and output.

Published as a conference paper at ICLR 2015


In our basic architecture, the I module takes an input text. Let us first assume this to be a sentence:
either the statement of a fact, or a question to be answered by the system (later we will consider
word-based input sequences). The text is stored in the next available memory slot in its original
form2 , i.e., S(x) returns the next empty memory slot N : mN = x, N = N + 1. The G module
is thus only used to store this new memory, so old memories are not updated. More sophisticated
models are described in subsequent sections.
The core of inference lies in the O and R modules. The O module produces output features by
finding k supporting memories given x. We use k up to 2, but the procedure is generalizable to
larger k. For k = 1 the highest scoring supporting memory is retrieved with:
o1 = O1 (x, m) = arg max sO (x, mi ) (2)

where sO is a function that scores the match between the pair of sentences x and mi . For the case
k = 2 we then find a second supporting memory given the first found in the previous iteration:
o2 = O2 (x, m) = arg max sO ([x, mo1 ], mi ) (3)

where the candidate supporting memory mi is now scored with respect to both the original in-
put and the first supporting memory, where square brackets denote a list3 . The final output o is
[x, mo1 , mo2 ], which is input to the module R.
Finally, R needs to produce a textual response r. The simplest response is to return mok , i.e.,
to output the previously uttered sentence we retrieved. To perform true sentence generation, one
can instead employ an RNN. In our experiments we also consider an easy to evaluate compromise
approach where we limit textual responses to be a single word (out of all the words seen by the
model) by ranking them:
r = argmaxw∈W sR ([x, mo1 , mo2 ], w) (4)
where W is the set of all words in the dictionary, and sR is a function that scores the match.
An example task is given in Figure 1. In order to answer the question x = “Where is the milk now?”,
the O module first scores all memories, i.e., all previously seen sentences, against x to retrieve the
most relevant fact, mo1 = “Joe left the milk” in this case. Then, it would search the memory again
to find the second relevant fact given [x, mo1 ], that is mo2 = “Joe travelled to the office” (the last
place Joe went before dropping the milk). Finally, the R module using eq. (4) would score words
given [x, mo1 , mo2 ] to output r = “office”.
In our experiments, the scoring functions sO and sR have the same form, that of an embedding
s(x, y) = Φx (x)⊤ U ⊤ U Φy (y). (5)
where U is a n × D matrix where D is the number of features and n is the embedding dimension.
The role of Φx and Φy is to map the original text to the D-dimensional feature space. The simplest
feature space to choose is a bag of words representation, we choose D = 3|W | for sO , i.e., every
word in the dictionary has three different representations: one for Φy (.) and two for Φx (.) depending
on whether the words of the input arguments are from the actual input x or from the supporting
memories so that they can be modeled differently.4 Similarly, we used D = 3|W | for sR as well.
sO and sR use different weight matrices UO and UR .
Technically, we will be using an embedding model to represent text, so we could store the incoming input
using its learned embedding vector in memory instead. The downside of such a choice is that during learning
the embedding parameters are changing, and hence the stored vectors would go stale. However, at test time
(where the parameters are not changing) storing as embedding vectors could make sense, as this is faster than
reading the original words and then embedding them repeatedly.
As we will use a bag-of-words model where both x and mo1 are represented in the bag (but with two differ-
ent dictionaries) this is equivalent to using the sum sO (x, mi ) + sO (mo1 , mi ), however a more sophisticated
modeling of the inputs (e.g., with nonlinearities) may not separate into a sum.
Experiments with only a single dictionary and linear embeddings performed worse (not shown). In order
to model with only a single dictionary, one could consider deeper networks that transform the words dependent
on their context. We leave this to future work.

Published as a conference paper at ICLR 2015

Figure 1: Example “story” statements, questions and answers generated by a simple simulation.
Answering the question about the location of the milk requires comprehension of the actions “picked
up” and “left”. The questions also require comprehension of the time elements of the story, e.g., to
answer “where was Joe before the office?”.

Joe went to the kitchen. Fred went to the kitchen. Joe picked up the milk.
Joe travelled to the office. Joe left the milk. Joe went to the bathroom.
Where is the milk now? A: office
Where is Joe? A: bathroom
Where was Joe before the office? A: kitchen

Training We train in a fully supervised setting where we are given desired inputs and responses,
and the supporting sentences are labeled as such in the training data (but not in the test data, where
we are given only the inputs). That is, during training we know the best choice of both max functions
in eq. (2) and (3)5 . Training is then performed with a margin ranking loss and stochastic gradient
descent (SGD). Specifically, for a given question x with true response r and supporting sentences
mo1 and mo2 (when k = 2), we minimize over model parameters UO and UR :

max(0, γ − sO (x, mo1 ) + sO (x, f¯)) +


max(0, γ − sO ([x, mo1 ], mo2 ]) + sO ([x, mo1 ], f¯′ ])) +

f¯′ 6=mo2
max(0, γ − sR ([x, mo1 , mo2 ], r) + sR ([x, mo1 , mo2 ], r̄])) (8)

where f¯, f¯′ and r̄ are all other choices than the correct labels, and γ is the margin. At every step
of SGD we sample f¯, f¯′ , r̄ rather than compute the whole sum for each training example, following
e.g., Weston et al. (2011).
In the case of employing an RNN for the R component of our MemNN (instead of using a single
word response as above) we replace the last term with the standard log likelihood used in a language
modeling task, where the RNN is fed the sequence [x, o1 , o2 , r]. At test time we output its prediction
r given [x, o1 , o2 ]. In contrast the absolute simplest model, that of using k = 1 and outputting the
located memory mo1 as response r, would only use the first term to train.
In the following subsections we consider some extensions of our basic model.


If input is at the word rather than sentence level, that is words arrive in a stream (as is often done, e.g.,
with RNNs) and not already segmented as statements and questions, we need to modify the approach
we have so far described. We hence add a “segmentation” function, to be learned, which takes as in-
put the last sequence of words that have so far not been segmented and looks for breakpoints. When
the segmenter fires (indicates the current sequence is a segment) we write that sequence to memory,
and can then proceed as before. The segmenter is modeled similarly to our other components, as an
embedding model of the form:

seg(c) = Wseg US Φseg (c) (9)
where Wseg is a vector (effectively the parameters of a linear classifier in embedding space), and c is
the sequence of input words represented as bag of words using a separate dictionary. If seg(c) > γ,
where γ is the margin, then this sequence is recognised as a segment. In this way, our MemNN has
a learning component in its write operation. We consider this segmenter a first proof of concept:
of course, one could design something much more sophisticated. Further details on the training
mechanism are given in Appendix B.

However, note that methods like RNNs and LSTMs cannot easily use this information.

Published as a conference paper at ICLR 2015


If the set of stored memories is very large it is prohibitively expensive to score all of them as in
equations (2) and (3). Instead we explore hashing tricks to speed up lookup: hash the input I(x) into
one or more buckets and then only score memories mi that are in the same buckets. We investigated
two ways of doing hashing: (i) via hashing words; and (ii) via clustering word embeddings. For (i)
we construct as many buckets as there are words in the dictionary, then for a given sentence we hash
it into all the buckets corresponding to its words. The problem with (i) is that a memory mi will
only be considered if it shares at least one word with the input I(x). Method (ii) tries to solve this
by clustering instead. After training the embedding matrix UO , we run K-means to cluster word
vectors (UO )i , thus giving K buckets. We then hash a given sentence into all the buckets that its
individual words fall into. As word vectors tend to be close to their synonyms, they cluster together
and we thus also will score those similar memories as well. Exact word matches between input and
memory will still be scored by definition. Choosing K controls the speed-accuracy trade-off.


We can extend our model to take into account when a memory slot was written to. This is not
important when answering questions about fixed facts (“What is the capital of France?”) but is
important when answering questions about a story, see e.g., Figure 1. One obvious way to implement
this is to add extra features to the representations Φx and Φy that encode the index j of a given
memory mj , assuming that j follows write time (i.e., no memory slot rewriting). However, that
requires dealing with absolute rather than relative time. We had more success empirically with the
following procedure: instead of scoring input, candidate pairs with s as above, learn a function on
triples sOt (x, y, y ′ ):
sOt (x, y, y ′ ) = Φx (x)⊤ UOt ⊤ UOt Φy (y) − Φy (y ′ ) + Φt (x, y, y ′ ) . (10)
Φt (x, y, y ′ ) uses three new features which take on the value 0 or 1: whether x is older than y, x is
older than y ′ , and y older than y ′ . (That is, we extended the dimensionality of all the Φ embeddings
by 3, and set these three dimensions to zero when not used.) Now, if sOt (x, y, y ′ ) > 0 the model
prefers y over y ′ , and if sOt (x, y, y ′ ) < 0 it prefers y ′ . The argmax of eq. (2) and (3) are replaced by
a loop over memories i = 1, . . . , N , keeping the winning memory (y or y ′ ) at each step, and always
comparing the current winner to the next memory mi . This procedure is equivalent to the argmax
before if the time features are removed. More details are given in Appendix C.


Even for humans who have read a lot of text, new words are continuously introduced. For example,
the first time the word “Boromir” appears in Lord of The Rings (Tolkien, 1954). How should a
machine learning model deal with this? Ideally it should work having seen only one example. A
possible way would be to use a language model: given the neighboring words, predict what the word
should be, and assume the new word is similar to that. Our proposed approach takes this idea, but
incorporates it into our networks sO and sR , rather than as a separate step.
Concretely, for each word we see, we store a bag of words it has co-occurred with, one bag for the
left context, and one for the right. Any unknown word can be represented with such features. Hence,
we increase our feature representation D from 3|W | to 5|W | to model these contexts (|W | features
for each bag). Our model learns to deal with new words during training using a kind of “dropout”
technique: d% of the time we pretend we have not seen a word before, and hence do not have a
n-dimensional embedding for that word, and represent it with the context instead.


Embedding models cannot efficiently use exact word matches due to the low dimensionality n. One
solution is to score a pair x, y with
Φx (x)⊤ U ⊤ U Φy (y) + λΦx (x)⊤ Φy (y) (11)
instead. That is, add the “bag of words” matching score to the learned embedding score (with a
mixing parameter λ). Another, related way, that we propose is to stay in the n-dimensional em-
bedding space, but to extend the feature representation D with matching features, e.g., one per

Published as a conference paper at ICLR 2015

word. A matching feature indicates if a word occurs in both x and y. That is, we score with
Φx (x)⊤ U ⊤ U Φy (y, x) where Φy is actually built conditionally on x: if some of the words in y
match the words in x we set those matching features to 1. Unseen words can be modeled similarly
by using matching features on their context words. This then gives a feature space of D = 8|W |.


Classical QA methods use a set of documents as a kind of memory, and information retrieval meth-
ods to find answers, see e.g., (Kolomiyets & Moens, 2011) and references therein. More recent
methods try instead to create a graph of facts – a knowledge base (KB) – as their memory, and map
questions to logical queries (Berant et al., 2013; 2014). Neural network and embedding approaches
have also been recently explored (Bordes et al., 2014a; Iyyer et al., 2014; Yih et al., 2014). Com-
pared to recent knowledge base approaches, memory networks differ in that they do not apply a
two-stage strategy: (i) apply information extraction principles first to build the KB; followed by (ii)
inference over the KB. Instead, extraction of useful information to answer a question is performed
on-the-fly over the memory which can be stored as raw text, as well as other choices such as embed-
ding vectors. This is potentially less brittle as the first stage of building the KB may have already
thrown away the relevant part of the original data.
Classical neural network memory models such as associative memory networks aim to provide
content-addressable memory, i.e., given a key vector to output a value vector, see e.g., Haykin (1994)
and references therein. Typically this type of memory is distributed across the whole network of
weights of the model rather than being compartmentalized into memory locations. Memory-based
learning such as nearest neighbor, on the other hand, does seek to store all (typically labeled) exam-
ples in compartments in memory, but only uses them for finding closest labels. Memory networks
combine compartmentalized memory with neural network modules that can learn how to (poten-
tially successively) read and write to that memory, e.g., to perform reasoning they can iteratively
read salient facts from the memory.
However, there are some notable models that have attempted to include memory read and write
operations from the 90s. In particular (Das et al., 1992) designed differentiable push and pop actions
called a neural network pushdown automaton. The work of Schmidhuber (1992) incorporated the
concept of two neural networks where one has very fast changing weights which can potentially be
used as memory. Schmidhuber (1993) proposed to allow a network to modify its own weights “self-
referentially” which can also be seen as a kind of memory addressing. Finally two other relevant
works are the DISCERN model of script processing and memory (Miikkulainen, 1990) and the
NARX recurrent networks for modeling long term dependencies (Lin et al., 1996).
Our work was submitted to arxiv just before the Neural Turing Machine work of Graves et al. (2014),
which is one of the most relevant related methods. Their method also proposes to perform (sequence)
prediction using a “large, addressable memory” which can be read and written to. In their experi-
ments, the memory size was limited to 128 locations, whereas we consider much larger storage (up
to 14M sentences). The experimental setups are notably quite different also: whereas we focus on
language and reasoning tasks, their paper focuses on problems of sorting, copying and recall. On the
one hand their problems require considerably more complex models than the memory network de-
scribed in Section 3. On the other hand, their problems have known algorithmic solutions, whereas
(non-toy) language problems do not.
There are other recent related works. RNNSearch (Bahdanau et al., 2014) is a method of machine
translation that uses a learned alignment mechanism over the input sentence representation while
predicting an output in order to overcome poor performance on long sentences. The work of (Graves,
2013) performs handwriting recognition by dynamically determining “an alignment between the text
and the pen locations” so that “it learns to decide which character to write next”. One can view these
as particular variants of memory networks where in that case the memory only extends back a single
sentence or character sequence.

Published as a conference paper at ICLR 2015

Table 1: Results on the large-scale QA task of (Fader et al., 2013).

Method F1
(Fader et al., 2013) 0.54
(Bordes et al., 2014b) 0.73
MemNN (embedding only) 0.72
MemNN (with BoW features) 0.82

Table 2: Memory hashing results on the large-scale QA task of (Fader et al., 2013).
Method Embedding F1 Embedding + BoW F1 Candidates (speedup)
MemNN (no hashing) 0.72 0.82 14M (0x)
MemNN (word hash) 0.63 0.68 13k (1000x)
MemNN (cluster hash) 0.71 0.80 177k (80x)



We perform experiments on the QA dataset introduced in Fader et al. (2013). It consists of 14M
statements, stored as (subject, relation, object) triples, which are stored as memories in the MemNN
model. The triples are REVERB extractions mined from the ClueWeb09 corpus and cover di-
verse topics such as (milne, authored, winnie-the-pooh) and (sheep, be-afraid-of, wolf). Following
Fader et al. (2013) and Bordes et al. (2014b), training combines pseudo-labeled QA pairs made of a
question and an associated triple, and 35M pairs of paraphrased questions from WikiAnswers like
“Who wrote the Winnie the Pooh books?” and “Who is poohs creator?”.
We performed experiments in the framework of re-ranking the top returned candidate answers by
several systems measuring F1 score over the test set, following Bordes et al. (2014b). These answers
have been annotated as right or wrong by humans, whereas other answers are ignored at test time as
we do not know their label. We used a MemNN model of Section 3 with a k = 1 supporting memory,
which ends up being similar to the approach of Bordes et al. (2014b).6 We also tried adding the bag
of words features of Section 3.6 as well. Time and unseen word modeling were not used. Results
are given in Table 1. The results show that MemNNs are a viable approach for large scale QA in
terms of performance. However, lookup is linear in the size of the memory, which with 14M facts is
slow. We therefore implemented the memory hashing techniques of Section 3.3 using both hashing
of words and clustered embeddings. For the latter we tried K = 1000 clusters. The results given in
Table 2 show that one can get significant speedups (∼80x) while maintaining similar performance
using the cluster-based hash. The string hash on the other hand loses performance (whilst being a
lot faster) because answers which share no words are now no longer matched.


Similar to the approach of Bordes et al. (2010) we also built a simple simulation of 4 characters, 3
objects and 5 rooms – with characters moving around, picking up and dropping objects. The actions
are transcribed into text using a simple automated grammar, and labeled questions are generated in
a similar way. This gives a QA task on simple “stories” such as in Figure 1. The overall difficulty of
the task is that multiple statements have to be used to do inference when asking where an object is,
e.g. to answer where is the milk in Figure 1 one has to understand the meaning of the actions “picked
up” and “left” and the influence of their relative order. We generated 7k statements and 3k questions
from the simulator for training7, and an identical number for testing and compare MemNNs to RNNs
and LSTMs (long short term memory RNNs (Hochreiter & Schmidhuber, 1997)) on this task. To
We use a larger 128 dimension for embeddings, and no fine tuning, hence the result of MemNN slightly
differs from those reported in Bordes et al. (2014b).
Learning curves with different numbers of training examples are given in Appendix D.

Published as a conference paper at ICLR 2015

Table 3: Test accuracy on the simulation QA task.

Difficulty 1 Difficulty 5
Method actor w/o before actor actor+object actor actor+object
RNN 100% 60.9% 27.9% 23.8% 17.8%
LSTM 100% 64.8% 49.1% 35.2% 29.0%
MemNN k = 1 97.8% 31.0% 24.0% 21.9% 18.5%
MemNN k = 1 (+time) 99.9% 60.2% 42.5% 60.8% 44.4%
MemNN k = 2 (+time) 100% 100% 100% 100% 99.9%

test with sequences of words as input (Section 3.2) the statements are joined together again with a
simple grammar8, to produce sentences that may contain multiple statements, see e.g., Figure 2.
We control the complexity of the task by setting a limit on the number of time steps in the past the
entity we ask the question about was last mentioned. We try two experiments: using a limit of 1, and
of 5, i.e., if the limit is 5 then we pick a random sentence between 1-5 time steps in the past. If this
chosen sentence only mentions an actor, e.g., “Bill is in the kitchen” then we generate the question
“where is Bill?” or “where was Bill before the kitchen?”. If the sentence mentions an object, e.g.,
“Bill dropped the football” then we ask the question “where is the football?”. For the answers we
consider two options: (i) single word answers; and (ii) a simple grammar for generating true answers
in sentence form, e.g., “kitchen” for (i) and “He is in the kitchen I believe” (and other variants) for
(ii). More details on the dataset generation are given in Appendix A. Note that in the object case
the supporting statements necessary to deduce the answer may not lie in the last 5 sentences, e.g.,
in this example the answer depends on other sentences to find out where Bill actually was when he
dropped the football. In fact, in the dataset we generated necessary supporting statements can be
up to 65 sentences before (but are usually closer). For that reason, we also conducted two further
types of experiments: where we only ask questions about actors (easier) and about actors and objects
(harder). We also consider the actor-based questions without the “before” questions for the simplest
possible task (i.e. “where is Bill?” but not “where was Bill before the kitchen?” questions).

Figure 2: Sample test set predictions (in red) for the simulation in the setting of word-based input
and where answers are sentences and an LSTM is used as the R component of the MemNN.

Joe went to the garden then Fred picked up the milk; Joe moved to the bathroom and Fred dropped the
milk, and then Dan moved to the living room.
Where is Dan? A: living room I believe
Where is Joe? A: the bathroom
Fred moved to the bedroom and Joe went to the kitchen then Joe took the milk there and Dan journeyed
to the bedroom; Joe discarded the milk.
Where is the milk now ? A: the milk is in the kitchen
Where is Dan now? A: I think he is in the bedroom
Joe took the milk there, after that Mike travelled to the office, then Joe went to the living room, next Dan
went back to the kitchen and Joe travelled to the office.
Where is Joe now? A: I think Joe is in the office

For the baseline RNN and LSTM systems we perform language modeling with backpropagation
through time (Mikolov et al., 2010), but where we backprop only on answer words9 . We optimized
the hyperparameters: size of the hidden layer, bptt steps, and learning rate for each dataset. For
MemNNs we fixed the embedding dimension to 100, learning rate to 0.01 and margin γ to 0.1 and
10 epochs of training in all experiments.

Results The results for the single word answer setting (i) are given in Table 3. For the actor-only
tasks, RNN and LSTMs solve the simpler difficulty level 1 task without before questions (“w/o
We also tried the same kind of experiments with sentence-level rather than word-sequence input, without
joining sentences, giving results with similar overall conclusions, see Appendix E.
We tried using standard language modeling on the questions as well, with slightly worse results.

Published as a conference paper at ICLR 2015

Figure 3: An example story with questions correctly answered by a MemNN. The MemNN was
trained on the simulation described in Section 5.2 and had never seen many of these words before,
e.g., Bilbo, Frodo and Gollum.

Bilbo travelled to the cave. Gollum dropped the ring there. Bilbo took the ring.
Bilbo went back to the Shire. Bilbo left the ring there. Frodo got the ring.
Frodo journeyed to Mount-Doom. Frodo dropped the ring there. Sauron died.
Frodo went back to the Shire. Bilbo travelled to the Grey-havens. The End.
Where is the ring? A: Mount-Doom
Where is Bilbo now? A: Grey-havens
Where is Frodo now? A: Shire

Figure 4: An example dialogue with a MemNN system trained on both the simulation data and
the large-scale QA data. The system is able to (attempt to) answer questions about general world
knowledge and about specific story-based statements in the dialogue.

Fred went to the kitchen. Fred picked up the milk. Fred travelled to the office.
Where is the milk ? A: office
Where does milk come from ? A: milk come from cow
What is a cow a type of ? A: cow be female of cattle
Where are cattle found ? A: cattle farm become widespread in brazil
What does milk taste like ? A: milk taste like milk
What does milk go well with ? A: milk go with coffee
Where was Fred before the office ? A: kitchen

before”), but perform worse with before questions, and even worse on the difficulty 5 tasks. This
demonstrates that the poor performance of the RNN is due to its failure to encode long(er)-term
memory. This would likely deteriorate even further with higher difficulty levels (distances). LSTMs
are however better than RNNs, as expected, as they are designed with a more sophisticated memory
model, but still have trouble remembering sentences too far in the past. MemNNs do not have
this memory limitation and its mistakes are instead due to incorrect usage of its memory, when the
wrong statement is picked by sO . Time features are necessary for good performance on before
questions or difficulty > 1 (i.e., when the answer is not in the last statement), otherwise sO can pick
a statement about a person’s whereabouts but they have since moved. Finally, results on the harder
actor+object task indicate that MemNN also successfully perform 2-stage inference using k = 2,
whereas MemNNs without such inference (with k = 1) and RNNs and LSTMs fail.
We also tested MemNNs in the multi-word answer setting (ii) with similar results, whereby
MemNNs outperform RNNs and LSTMs, which are detailed in Appendix F. Example test prediction
output demonstrating the model in that setting is given in Figure 2.


We then tested the ability of MemNNs to deal with previously unseen words at test time using the
unseen word modeling approach of Sections 3.5 and 3.6. We trained the MemNN on the same sim-
ulated dataset as before and test on the story given in Figure 3. This story is generated using similar
structures as in the simulation data, except that the nouns are unknowns to the system at training
time. Despite never seeing any of the Lord of The Rings specific words before (e.g., Bilbo, Frodo,
Sauron, Gollum, Shire and Mount-Doom), MemNNs are able to correctly answer the questions.
MemNNs can discover simple linguistic patterns based on verbal forms such as (X, dropped, Y), (X,
took, Y) or (X, journeyed to, Y) and can successfully generalize the meaning of their instantiations
using unknown words to perform 2-stage inference. Without the unseen word modeling described
in Section 3.5, they completely fail on this task.

Published as a conference paper at ICLR 2015


Combining simulated world learning with real-world data might be one way to show the power and
generality of the models we design. We implemented a naive setup towards that goal: we took the
two models from Sections 5.1 and 5.2, trained on large-scale QA and simulated data respectively,
and built an ensemble of the two. We present the input to both systems and then for each question
simply output the response of the two choices with the highest score. This allows us to perform
simple dialogues with our combined MemNN system. The system is then capable of answering both
general knowledge questions and specific statements relating to the previous dialogue. An example
dialogue trace is given in Fig. 4. Some answers appear fine, whereas others are nonsensical. Future
work should combine these models more effectively, for example by multitasking directly the tasks
with a single model.


In this paper we introduced a powerful class of models, memory networks, and showed one instanti-
ation for QA. Future work should develop MemNNs for text further, evaluating them on harder QA
and open-domain machine comprehension tasks (Richardson et al., 2013). For example, large scale
QA tasks that require multi-hop inference such as WebQuestions should also be tried Berant et al.
(2013). More complex simulation data could also be constructed in order to bridge that gap, e.g.,
requiring coreference, involving more verbs and nouns, sentences with more structure and requiring
more temporal and causal understanding. More sophisticated architectures should also be explored
in order to deal with these tasks, e.g., using more sophisticated memory management via G and
more sophisticated sentence representations. Weakly supervised settings are also very important,
and should be explored, as many datasets only have supervision in the form of question answer
pairs, and not supporting facts as well as we used here. Finally, we believe this class of models is
much richer than the one specific variant we detail here, and that we have currently only explored
one specific variant of memory networks. Memory networks should be applied to other text tasks,
and other domains, such as vision, as well.

We thank Tomas Mikolov for useful discussions.

Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural machine translation by jointly
learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
Berant, Jonathan, Chou, Andrew, Frostig, Roy, and Liang, Percy. Semantic parsing on freebase from
question-answer pairs. In EMNLP, pp. 1533–1544, 2013.
Berant, Jonathan, Srikumar, Vivek, Chen, Pei-Chun, Huang, Brad, Manning, Christopher D, Van-
der Linden, Abby, Harding, Brittany, and Clark, Peter. Modeling biological processes for reading
comprehension. In Proc. EMNLP, 2014.
Bordes, Antoine, Usunier, Nicolas, Collobert, Ronan, and Weston, Jason. Towards understanding
situated natural language. In AISTATS, 2010.
Bordes, Antoine, Chopra, Sumit, and Weston, Jason. Question answering with subgraph embed-
dings. In Proc. EMNLP, 2014a.
Bordes, Antoine, Weston, Jason, and Usunier, Nicolas. Open question answering with weakly su-
pervised embedding models. ECML-PKDD, 2014b.
Das, Sreerupa, Giles, C Lee, and Sun, Guo-Zheng. Learning context-free grammars: Capabilities
and limitations of a recurrent neural network with an external stack memory. In Proceedings of
The Fourteenth Annual Conference of Cognitive Science Society. Indiana University, 1992.
Fader, Anthony, Zettlemoyer, Luke, and Etzioni, Oren. Paraphrase-driven learning for open question
answering. In ACL, pp. 1608–1618, 2013.

Published as a conference paper at ICLR 2015

Graves, Alex. Generating sequences with recurrent neural networks. arXiv preprint
arXiv:1308.0850, 2013.
Graves, Alex, Wayne, Greg, and Danihelka, Ivo. Neural turing machines. arXiv preprint
arXiv:1410.5401, 2014.
Haykin, Simon. Neural networks: A comprehensive foundation. 1994.
Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-term memory. Neural computation, 9(8):
1735–1780, 1997.
Iyyer, Mohit, Boyd-Graber, Jordan, Claudino, Leonardo, Socher, Richard, and III, Hal Daumé. A
neural network for factoid question answering over paragraphs. In Proceedings of the 2014 Con-
ference on Empirical Methods in Natural Language Processing (EMNLP), pp. 633–644, 2014.
Kolomiyets, Oleksandr and Moens, Marie-Francine. A survey on question answering technology
from an information retrieval perspective. Information Sciences, 181(24):5412–5434, 2011.
Lin, Tsungnam, Horne, Bil G, Tiňo, Peter, and Giles, C Lee. Learning long-term dependencies in
narx recurrent neural networks. Neural Networks, IEEE Transactions on, 7(6):1329–1338, 1996.
Miikkulainen, Risto. {DISCERN}:{A} distributed artificial neural network model of script process-
ing and memory. 1990.
Mikolov, Tomas, Karafiát, Martin, Burget, Lukas, Cernockỳ, Jan, and Khudanpur, Sanjeev. Recur-
rent neural network based language model. In Interspeech, pp. 1045–1048, 2010.
Richardson, Matthew, Burges, Christopher JC, and Renshaw, Erin. Mctest: A challenge dataset for
the open-domain machine comprehension of text. In EMNLP, pp. 193–203, 2013.
Schmidhuber, Jürgen. Learning to control fast-weight memories: An alternative to dynamic recur-
rent networks. Neural Computation, 4(1):131–139, 1992.
Schmidhuber, Jürgen. A self-referentialweight matrix. In ICANN93, pp. 446–450. Springer, 1993.
Tolkien, John Ronald Reuel. The Fellowship of the Ring. George Allen & Unwin, 1954.
Weston, Jason, Bengio, Samy, and Usunier, Nicolas. Wsabie: Scaling up to large vocabulary im-
age annotation. In Proceedings of the Twenty-Second international joint conference on Artificial
Intelligence-Volume Volume Three, pp. 2764–2770. AAAI Press, 2011.
Yih, Wen-Tau, He, Xiaodong, and Meek, Christopher. Semantic parsing for single-relation question
answering. In Proceedings of ACL. Association for Computational Linguistics, June 2014. URL
Zaremba, Wojciech and Sutskever, Ilya. Learning to execute. arXiv preprint arXiv:1410.4615, 2014.

Published as a conference paper at ICLR 2015


Aim We have built a simple simulation which behaves much like a classic text adventure game.
The idea is that generating text within this simulation allows us to ground the language used.
Some comments about our intent:

• Firstly, while this currently only encompasses a very small part of the kind of language and
understanding we want a model to learn to move towards full language understanding, we
believe it is a prerequisite that models should perform well on this kind of task for them to
work on real-world environments.
• Secondly, our aim is to make this simulation more complex and to release improved ver-
sions over time. Hopefully it can then scale up to evaluate more and more useful properties.

Currently, tasks within the simulation are restricted to question answering tasks about the location
of people and objects. However, we envisage other tasks should be possible, including asking the
learner to perform actions within the simulation (“Please pick up the milk”, “Please find John and
give him the milk”) and asking the learner to describe actions (”What did John just do?”).

Actions The underlying actions in the simulation consist of the following:

go <location>, get <object>, get <object1> from <object2>,
put <object1> in/on <object2>, give <object> to <actor>,
drop <object>, look, inventory, examine <object>.
There are a set of constraints on those actions. For example an actor cannot get something that they
or someone else already has, they cannot go to a place they are already at, cannot drop something
they do not already have, and so on.

Executing Actions and Asking Questions Using the underlying actions and their constraints,
there is then a (hand-built) model that defines how actors act. Currently this is very simple: they try
to make a random valid action, at the moment restricted to go or go, get and drop depending on the
which of two types of experiments we are running: (i) actor; or (ii) actor + object.
If we write these actions down in text form this gives us a very simple “story” which is executable
by the simulation, e.g., joe go kitchen; fred go kitchen; joe get milk; joe go office; joe drop milk;
joe go bathroom. This example corresponds to the story given in Figure 1. The system can then ask
questions about the state of the simulation e.g., where milk?, where joe?, where joe before office? It
is easy to calculate the true answers for these questions as we have access to the underlying world.
What remains is to convert both the statements and the questions to look more like natural language.

Simple Grammar For Generating Language In order to produce more natural looking text with
lexical variety we built a simple automated grammar. Each verb is assigned a set of synonyms,
e.g., the simulation command get is replaced with either picked up, got, grabbed or took, and drop
is replace with either dropped, left, discarded or put down. Similarly, each object and actor can
have a set of replacement synonyms as well, although currently there is no ambiguity there in our
experiments, we simply add articles or not. We do add lexical variation to questions, e.g., “Where is
John ?” or “Where is John now ?”.

Joining Statements Finally, for the word sequence training setting, we join the statements above
into compound sentences. To do this we simply take the set of statements and then join them
randomly with one of the following: “.”, “and”, “then”, “, then”, “;”, “, later”, “, after that”, “, and
then”, or “, next”. Example output can be seen in Figure 2.

Issues There are a great many aspects of language not yet modeled. For example, currently coref-
erence is not modeled (e.g., “He picked up the milk”) and similarly there are no compound noun
phrases (“John and Fred went to the kitchen”). Some of these seem easy to add to the simulation.
The hope is that adding these complexities will help evaluate models in a controlled way, within the
simulated environment, which is hard to do with real data. Of course, this is not a substitute for real
data which our models should be applied to as well, but does serve as a useful testbed.

Published as a conference paper at ICLR 2015


For segmenting an input word stream as generated in Appendix A we use a segmenter of the form:

seg(c) = Wseg US Φseg (c)

where Wseg is a vector (effectively the parameters of a linear classifier in embedding space). As we
are already in the fully supervised setting, where for each question in the training set we are given
the answer and the supporting facts from the input stream, we can also use that supervision for the
segmenter as well. That is, for any known supporting fact, such as “Bill is in the Kitchen” for the
question “Where is Bill?” we wish the segmenter to fire for such a statement, but not for unfinished
statements such as “Bill is in the”. We can thus write our training criterion for segmentation as the
minimization of:
max(0, γ + seg(f¯))
max(0, γ − seg(f )) + (12)
f ∈F f¯∈F̄

where F are all known supporting segments in the labeled training set, and F̄ are all other segments
in the training set.


The training procedure to take into account modeling write time is slightly different to that described
in Section 3.1. Write time features are important so that the MemNN knows when each memory
was written, and hence knows the ordering of statements that comprise a story or dialogue. Note
that this is different to time information described in the text of a statement, such as the tense of a
statement, or statements containing time expressions, e.g., “He went to the office yesterday”. For
such cases, write time features are not directly necessary, and they could (potentially) be modeled
directly from the text.
As was described in Section 3.4 we add three write time features to the model and score triples
sOt (x, y, y ′ ) = Φx (x)⊤ UOt ⊤ UOt Φy (y) − Φy (y ′ ) + Φt (x, y, y ′ ) . (13)

If sO (x, y, y ′ ) > 0 the model prefers y over y ′ , and if sO (x, y, y ′ ) < 0 it prefers y ′ . The argmax of
eq. (2) and (3) are replaced by a loop over memories i = 1, . . . , N , keeping the winning memory
(y or y ′ ) at each step, and always comparing the current winner to the next memory mi . That is,
at inference time, for a k = 2 model the arg max functions of eq. (2) and (3) are replaced with
o1 = Ot (x, m) and o2 = Ot ([x, mo1 ], m) where Ot is defined in Algorithm 1 below.

Algorithm 1 Ot replacement to arg max when using write time features

function Ot (q, m)
for i = 2, . . . , N do
if sOt (q, mi , mt ) > 0 then
end if
end for
return t
end function

Φt (x, y, y ′ ) uses three new features which take on the value 0 or 1: whether x is older than y,
x is older than y ′ , and y older than y ′ . When finding the second supporting memory (computing
Ot ([x, mo1 ], m)) we encode whether mo1 is older than y, mo1 is older than y ′ , and y older than y ′ to
capture the relative age of the first supporting memory w.r.t. the second one in the first two features.
Note that when finding the first supporting memory (i.e., for Ot (x, m)) the first two features are
useless as x is the last thing in the memory and hence y and y ′ are always older.

Published as a conference paper at ICLR 2015

To train our model with write time features we need to replace the hinge loss in eqs. (6)-(7) with a
loss that matches Algorithm 1. To do this, we instead minimize:
max(0, γ − sOt (x, mo1 , f¯)) + max(0, γ + sOt (x, f¯, mo1 )) +
f¯6=mo1 f¯6=mo1

max(0, γ − sOt ([x, mo1 ], mo2 , f¯′ )) + max(0, γ + sOt ([x, mo1 ], f¯′ , mo2 ) +
f¯′ 6=mo2 f¯′ 6=mo2
max(0, γ − sR ([x, mo1 , mo2 ], r) + sR ([x, mo1 , mo2 ], r̄]))

The last term is the same as in eq. (8) and is for the final ranking of words to return a response,
which remains unchanged (as usual, this can also be replaced by an RNN for a more sophisticated
model). Terms 1-4 replace eqs. (6)-(7) by considering triples directly. For both mo1 and mo2 we
need to have two terms considering them as the second or third argument to SOt as they may appear
on either side during inference (via Algorithm 1). As before, at every step of SGD we sample f¯, f¯′ , r̄
rather than compute the whole sum for each training example.


We computed the test accuracy of MemNNs k = 2 (+ time) for varying amounts of training data:
100, 500, 1000 and 3000 training questions. The results are given in Table 4. These results can be
compared with RNNs and LSTMs on the full data (3000 examples) by comparing with Figure 3.
For example, on the difficulty 5 actor and actor + object tasks MemNNs outperform LSTMs even
using 30 times less training examples.

Table 4: Test accuracy of MemNNs k = 2 (+time) on the word-sequence simulation QA task for
differing numbers of training examples (number of questions).
Difficulty 1 Difficulty 5
Num. training actor actor actor actor
questions + object + object
100 73.8% 64.9% 74.4% 49.8%
500 99.9% 99.2% 99.8% 95.1%
1000 99.9% 100% 100% 98.4%
3000 100% 100% 100% 99.9%


We conducted experiments where input was at the sentence-level, that is the data was already pre-
segemented into statements and questions as input to the MemNN (as opposed to being input as a
stream of words). Results comparing RNNs with MemNNs are given in Table 5. The conclusions
are similar to those at the word level from Section 5.2. That is, MemNNs outperform RNNs, and
that inference that finds k = 2 supporting statements and time features are necessary for the actor
w/o before + object task.

Table 5: Test accuracy on the sentence-level simulation QA task.

Difficulty 1 Difficulty 5
actor actor w/o before actor actor w/o before
Method w/o before + object w/o before + object
RNN 100% 58% 29% 17%
MemNN k = 1 90% 9% 46% 21%
MemNN k = 1 (+time) 100% 73% 100% 73%
MemNN k = 2 (+time) 100% 99.95% 100% 99.4%

Published as a conference paper at ICLR 2015


We conducted experiments for the simulation data in the case where the answers are sentences (see
Appendix A and Figure 2). As the single word answer model can no longer be used, we simply
compare MemNNs using either RNNs or LSTMs for the response module R. As baselines we can
still use RNNs and LSTMs in the standard setting of being fed words only including the statements
and the question as a word stream. In contrast, the MemNN RNN and LSTMs are effectively fed
the output of the O module (see Section 3.1). In these experiments we only consider the difficulty
5 actor+object setting in the case of MemNNs with k = 2 iterations (eq. (3)), which means the
module R is fed the features [x, mo1 , mo2 ] after the modules I, G and O have run.
The sentence generation is performed on the test data, and the evaluation we chose is as follows. A
correct generation has to contain the correct location answer, and can optionally contain the subject
or a correct pronoun referring to it. For example the question “Where is Bill?” allows the correct
answers “Kitchen”, “In the kitchen”, “Bill is in the kitchen”, “He is in the kitchen” and “I think Bill
is in the kitchen”. However incorrect answers contain an incorrect location or subject reference, for
example “Joe is in the kitchen”, “It is in the kitchen” or “Bill is in the bathroom I believe”. We can
then measure the percentage of text examples that are correct using this metric.
The numerical results are given in Table 6, and example output is given in Figure 2. The results
indicate that MemNNs with LSTMs perform quite strongly, outperforming MemNNs using RNNs.
However, both MemNN variant outperform both RNNs and LSTMs by some distance.

Table 6: Test accuracy on the multi-word answer simulation QA task. We compare conventional
RNN and LSTMs with MemNNs using an RNN or LSTM module R (i.e., where R is fed features
[x, mo1 , mo2 ] after the modules I, G and O have run).
Model MemNN: IGO features [x, mo1 , mo2 ] Word features
RNN 68.83% 13.97%
LSTM 90.98% 14.01%

Neural Turing Machines
arXiv:1410.5401v2 [cs.NE] 10 Dec 2014

Alex Graves

Greg Wayne
Ivo Danihelka

Google DeepMind, London, UK


We extend the capabilities of neural networks by coupling them to external memory re-
sources, which they can interact with by attentional processes. The combined system is
analogous to a Turing Machine or Von Neumann architecture but is differentiable end-to-
end, allowing it to be efficiently trained with gradient descent. Preliminary results demon-
strate that Neural Turing Machines can infer simple algorithms such as copying, sorting,
and associative recall from input and output examples.

1 Introduction
Computer programs make use of three fundamental mechanisms: elementary operations
(e.g., arithmetic operations), logical flow control (branching), and external memory, which
can be written to and read from in the course of computation (Von Neumann, 1945). De-
spite its wide-ranging success in modelling complicated data, modern machine learning
has largely neglected the use of logical flow control and external memory.
Recurrent neural networks (RNNs) stand out from other machine learning methods
for their ability to learn and carry out complicated transformations of data over extended
periods of time. Moreover, it is known that RNNs are Turing-Complete (Siegelmann and
Sontag, 1995), and therefore have the capacity to simulate arbitrary procedures, if properly
wired. Yet what is possible in principle is not always what is simple in practice. We
therefore enrich the capabilities of standard recurrent networks to simplify the solution of
algorithmic tasks. This enrichment is primarily via a large, addressable memory, so, by
analogy to Turing’s enrichment of finite-state machines by an infinite memory tape, we

dub our device a “Neural Turing Machine” (NTM). Unlike a Turing machine, an NTM
is a differentiable computer that can be trained by gradient descent, yielding a practical
mechanism for learning programs.
In human cognition, the process that shares the most similarity to algorithmic operation
is known as “working memory.” While the mechanisms of working memory remain some-
what obscure at the level of neurophysiology, the verbal definition is understood to mean
a capacity for short-term storage of information and its rule-based manipulation (Badde-
ley et al., 2009). In computational terms, these rules are simple programs, and the stored
information constitutes the arguments of these programs. Therefore, an NTM resembles
a working memory system, as it is designed to solve tasks that require the application of
approximate rules to “rapidly-created variables.” Rapidly-created variables (Hadley, 2009)
are data that are quickly bound to memory slots, in the same way that the number 3 and the
number 4 are put inside registers in a conventional computer and added to make 7 (Minsky,
1967). An NTM bears another close resemblance to models of working memory since the
NTM architecture uses an attentional process to read from and write to memory selectively.
In contrast to most models of working memory, our architecture can learn to use its working
memory instead of deploying a fixed set of procedures over symbolic data.
The organisation of this report begins with a brief review of germane research on work-
ing memory in psychology, linguistics, and neuroscience, along with related research in
artificial intelligence and neural networks. We then describe our basic contribution, a mem-
ory architecture and attentional controller that we believe is well-suited to the performance
of tasks that require the induction and execution of simple programs. To test this architec-
ture, we have constructed a battery of problems, and we present their precise descriptions
along with our results. We conclude by summarising the strengths of the architecture.

2 Foundational Research
2.1 Psychology and Neuroscience
The concept of working memory has been most heavily developed in psychology to explain
the performance of tasks involving the short-term manipulation of information. The broad
picture is that a “central executive” focuses attention and performs operations on data in a
memory buffer (Baddeley et al., 2009). Psychologists have extensively studied the capacity
limitations of working memory, which is often quantified by the number of “chunks” of
information that can be readily recalled (Miller, 1956).1 These capacity limitations lead
toward an understanding of structural constraints in the human working memory system,
but in our own work we are happy to exceed them.
In neuroscience, the working memory process has been ascribed to the functioning of a
system composed of the prefrontal cortex and basal ganglia (Goldman-Rakic, 1995). Typ-
There remains vigorous debate about how best to characterise capacity limitations (Barrouillet et al.,

ical experiments involve recording from a single neuron or group of neurons in prefrontal
cortex while a monkey is performing a task that involves observing a transient cue, waiting
through a “delay period,” then responding in a manner dependent on the cue. Certain tasks
elicit persistent firing from individual neurons during the delay period or more complicated
neural dynamics. A recent study quantified delay period activity in prefrontal cortex for a
complex, context-dependent task based on measures of “dimensionality” of the population
code and showed that it predicted memory performance (Rigotti et al., 2013).
Modeling studies of working memory range from those that consider how biophysical
circuits could implement persistent neuronal firing (Wang, 1999) to those that try to solve
explicit tasks (Hazy et al., 2006) (Dayan, 2008) (Eliasmith, 2013). Of these, Hazy et al.’s
model is the most relevant to our work, as it is itself analogous to the Long Short-Term
Memory architecture, which we have modified ourselves. As in our architecture, Hazy
et al.’s has mechanisms to gate information into memory slots, which they use to solve a
memory task constructed of nested rules. In contrast to our work, the authors include no
sophisticated notion of memory addressing, which limits the system to storage and recall
of relatively simple, atomic data. Addressing, fundamental to our work, is usually left
out from computational models in neuroscience, though it deserves to be mentioned that
Gallistel and King (Gallistel and King, 2009) and Marcus (Marcus, 2003) have argued that
addressing must be implicated in the operation of the brain.

2.2 Cognitive Science and Linguistics

Historically, cognitive science and linguistics emerged as fields at roughly the same time
as artificial intelligence, all deeply influenced by the advent of the computer (Chomsky,
1956) (Miller, 2003). Their intentions were to explain human mental behaviour based on
information or symbol-processing metaphors. In the early 1980s, both fields considered
recursive or procedural (rule-based) symbol-processing to be the highest mark of cogni-
tion. The Parallel Distributed Processing (PDP) or connectionist revolution cast aside the
symbol-processing metaphor in favour of a so-called “sub-symbolic” description of thought
processes (Rumelhart et al., 1986).
Fodor and Pylyshyn (Fodor and Pylyshyn, 1988) famously made two barbed claims
about the limitations of neural networks for cognitive modeling. They first objected that
connectionist theories were incapable of variable-binding, or the assignment of a particular
datum to a particular slot in a data structure. In language, variable-binding is ubiquitous;
for example, when one produces or interprets a sentence of the form, “Mary spoke to John,”
one has assigned “Mary” the role of subject, “John” the role of object, and “spoke to” the
role of the transitive verb. Fodor and Pylyshyn also argued that neural networks with fixed-
length input domains could not reproduce human capabilities in tasks that involve process-
ing variable-length structures. In response to this criticism, neural network researchers
including Hinton (Hinton, 1986), Smolensky (Smolensky, 1990), Touretzky (Touretzky,
1990), Pollack (Pollack, 1990), Plate (Plate, 2003), and Kanerva (Kanerva, 2009) inves-
tigated specific mechanisms that could support both variable-binding and variable-length

structure within a connectionist framework. Our architecture draws on and potentiates this
Recursive processing of variable-length structures continues to be regarded as a hall-
mark of human cognition. In the last decade, a firefight in the linguistics community staked
several leaders of the field against one another. At issue was whether recursive processing
is the “uniquely human” evolutionary innovation that enables language and is specialized to
language, a view supported by Fitch, Hauser, and Chomsky (Fitch et al., 2005), or whether
multiple new adaptations are responsible for human language evolution and recursive pro-
cessing predates language (Jackendoff and Pinker, 2005). Regardless of recursive process-
ing’s evolutionary origins, all agreed that it is essential to human cognitive flexibility.

2.3 Recurrent Neural Networks

Recurrent neural networks constitute a broad class of machines with dynamic state; that
is, they have state whose evolution depends both on the input to the system and on the
current state. In comparison to hidden Markov models, which also contain dynamic state,
RNNs have a distributed state and therefore have significantly larger and richer memory
and computational capacity. Dynamic state is crucial because it affords the possibility of
context-dependent computation; a signal entering at a given moment can alter the behaviour
of the network at a much later moment.
A crucial innovation to recurrent networks was the Long Short-Term Memory (LSTM)
(Hochreiter and Schmidhuber, 1997). This very general architecture was developed for a
specific purpose, to address the “vanishing and exploding gradient” problem (Hochreiter
et al., 2001a), which we might relabel the problem of “vanishing and exploding sensitivity.”
LSTM ameliorates the problem by embedding perfect integrators (Seung, 1998) for mem-
ory storage in the network. The simplest example of a perfect integrator is the equation
x(t + 1) = x(t) + i(t), where i(t) is an input to the system. The implicit identity matrix
Ix(t) means that signals do not dynamically vanish or explode. If we attach a mechanism
to this integrator that allows an enclosing network to choose when the integrator listens to
inputs, namely, a programmable gate depending on context, we have an equation of the
form x(t + 1) = x(t) + g(context)i(t). We can now selectively store information for an
indefinite length of time.
Recurrent networks readily process variable-length structures without modification. In
sequential problems, inputs to the network arrive at different times, allowing variable-
length or composite structures to be processed over multiple steps. Because they natively
handle variable-length structures, they have recently been used in a variety of cognitive
problems, including speech recognition (Graves et al., 2013; Graves and Jaitly, 2014), text
generation (Sutskever et al., 2011), handwriting generation (Graves, 2013) and machine
translation (Sutskever et al., 2014). Considering this property, we do not feel that it is ur-
gent or even necessarily valuable to build explicit parse trees to merge composite structures
greedily (Pollack, 1990) (Socher et al., 2012) (Frasconi et al., 1998).
Other important precursors to our work include differentiable models of attention (Graves,

Figure 1: Neural Turing Machine Architecture. During each update cycle, the controller
network receives inputs from an external environment and emits outputs in response. It also
reads to and writes from a memory matrix via a set of parallel read and write heads. The dashed
line indicates the division between the NTM circuit and the outside world.

2013) (Bahdanau et al., 2014) and program search (Hochreiter et al., 2001b) (Das et al.,
1992), constructed with recurrent neural networks.

3 Neural Turing Machines

A Neural Turing Machine (NTM) architecture contains two basic components: a neural
network controller and a memory bank. Figure 1 presents a high-level diagram of the NTM
architecture. Like most neural networks, the controller interacts with the external world via
input and output vectors. Unlike a standard network, it also interacts with a memory matrix
using selective read and write operations. By analogy to the Turing machine we refer to the
network outputs that parametrise these operations as “heads.”
Crucially, every component of the architecture is differentiable, making it straightfor-
ward to train with gradient descent. We achieved this by defining ‘blurry’ read and write
operations that interact to a greater or lesser degree with all the elements in memory (rather
than addressing a single element, as in a normal Turing machine or digital computer). The
degree of blurriness is determined by an attentional “focus” mechanism that constrains each
read and write operation to interact with a small portion of the memory, while ignoring the
rest. Because interaction with the memory is highly sparse, the NTM is biased towards
storing data without interference. The memory location brought into attentional focus is
determined by specialised outputs emitted by the heads. These outputs define a normalised
weighting over the rows in the memory matrix (referred to as memory “locations”). Each
weighting, one per read or write head, defines the degree to which the head reads or writes

at each location. A head can thereby attend sharply to the memory at a single location or
weakly to the memory at many locations.

3.1 Reading
Let Mt be the contents of the N × M memory matrix at time t, where N is the number
of memory locations, and M is the vector size at each location. Let wt be a vector of
weightings over the N locations emitted by a read head at time t. Since all weightings are
normalised, the N elements wt (i) of wt obey the following constraints:
wt (i) = 1, 0 ≤ wt (i) ≤ 1, ∀i. (1)

The length M read vector rt returned by the head is defined as a convex combination of
the row-vectors Mt (i) in memory:
rt ←− wt (i)Mt (i), (2)

which is clearly differentiable with respect to both the memory and the weighting.

3.2 Writing
Taking inspiration from the input and forget gates in LSTM, we decompose each write into
two parts: an erase followed by an add.
Given a weighting wt emitted by a write head at time t, along with an erase vector
et whose M elements all lie in the range (0, 1), the memory vectors Mt−1 (i) from the
previous time-step are modified as follows:
M̃t (i) ←− Mt−1 (i) [1 − wt (i)et ] , (3)
where 1 is a row-vector of all 1-s, and the multiplication against the memory location acts
point-wise. Therefore, the elements of a memory location are reset to zero only if both the
weighting at the location and the erase element are one; if either the weighting or the erase
is zero, the memory is left unchanged. When multiple write heads are present, the erasures
can be performed in any order, as multiplication is commutative.
Each write head also produces a length M add vector at , which is added to the memory
after the erase step has been performed:
Mt (i) ←− M̃t (i) + wt (i) at . (4)
Once again, the order in which the adds are performed by multiple heads is irrelevant. The
combined erase and add operations of all the write heads produces the final content of the
memory at time t. Since both erase and add are differentiable, the composite write oper-
ation is differentiable too. Note that both the erase and add vectors have M independent
components, allowing fine-grained control over which elements in each memory location
are modified.

Figure 2: Flow Diagram of the Addressing Mechanism. The key vector, kt , and key
strength, βt , are used to perform content-based addressing of the memory matrix, Mt . The
resulting content-based weighting is interpolated with the weighting from the previous time step
based on the value of the interpolation gate, gt . The shift weighting, st , determines whether
and by how much the weighting is rotated. Finally, depending on γt , the weighting is sharpened
and used for memory access.

3.3 Addressing Mechanisms

Although we have now shown the equations of reading and writing, we have not described
how the weightings are produced. These weightings arise by combining two addressing
mechanisms with complementary facilities. The first mechanism, “content-based address-
ing,” focuses attention on locations based on the similarity between their current values
and values emitted by the controller. This is related to the content-addressing of Hopfield
networks (Hopfield, 1982). The advantage of content-based addressing is that retrieval is
simple, merely requiring the controller to produce an approximation to a part of the stored
data, which is then compared to memory to yield the exact stored value.
However, not all problems are well-suited to content-based addressing. In certain tasks
the content of a variable is arbitrary, but the variable still needs a recognisable name or ad-
dress. Arithmetic problems fall into this category: the variable x and the variable y can take
on any two values, but the procedure f (x, y) = x × y should still be defined. A controller
for this task could take the values of the variables x and y, store them in different addresses,
then retrieve them and perform a multiplication algorithm. In this case, the variables are
addressed by location, not by content. We call this form of addressing “location-based ad-
dressing.” Content-based addressing is strictly more general than location-based addressing
as the content of a memory location could include location information inside it. In our ex-
periments however, providing location-based addressing as a primitive operation proved
essential for some forms of generalisation, so we employ both mechanisms concurrently.
Figure 2 presents a flow diagram of the entire addressing system that shows the order
of operations for constructing a weighting vector when reading or writing.

3.3.1 Focusing by Content
For content-addressing, each head (whether employed for reading or writing) first produces
  M key vector kt that is compared to each vector Mt (i) by a csimilarity measure
a length
K ·, · . The content-based system produces a normalised weighting wt based on the sim-
ilarity and a positive key strength, βt , which can amplify or attenuate the precision of the
exp βt K kt , Mt (i)
wt (i) ←−  . (5)
j exp βt K kt , Mt (j)

In our current implementation, the similarity measure is cosine similarity:

K u, v = . (6)
||u|| · ||v||

3.3.2 Focusing by Location

The location-based addressing mechanism is designed to facilitate both simple iteration
across the locations of the memory and random-access jumps. It does so by implementing
a rotational shift of a weighting. For example, if the current weighting focuses entirely on
a single location, a rotation of 1 would shift the focus to the next location. A negative shift
would move the weighting in the opposite direction.
Prior to rotation, each head emits a scalar interpolation gate gt in the range (0, 1). The
value of g is used to blend between the weighting wt−1 produced by the head at the previous
time-step and the weighting wtc produced by the content system at the current time-step,
yielding the gated weighting wtg :

wtg ←− gt wtc + (1 − gt )wt−1 . (7)

If the gate is zero, then the content weighting is entirely ignored, and the weighting from the
previous time step is used. Conversely, if the gate is one, the weighting from the previous
iteration is ignored, and the system applies content-based addressing.
After interpolation, each head emits a shift weighting st that defines a normalised distri-
bution over the allowed integer shifts. For example, if shifts between -1 and 1 are allowed,
st has three elements corresponding to the degree to which shifts of -1, 0 and 1 are per-
formed. The simplest way to define the shift weightings is to use a softmax layer of the
appropriate size attached to the controller. We also experimented with another technique,
where the controller emits a single scalar that is interpreted as the lower bound of a width
one uniform distribution over shifts. For example, if the shift scalar is 6.7, then st (6) = 0.3,
st (7) = 0.7, and the rest of st is zero.

If we index the N memory locations from 0 to N − 1, the rotation applied to wtg by st
can be expressed as the following circular convolution:
X −1
w̃t (i) ←− wtg (j) st (i − j) (8)

where all index arithmetic is computed modulo N . The convolution operation in Equa-
tion (8) can cause leakage or dispersion of weightings over time if the shift weighting is
not sharp. For example, if shifts of -1, 0 and 1 are given weights of 0.1, 0.8 and 0.1, the
rotation will transform a weighting focused at a single point into one slightly blurred over
three points. To combat this, each head emits one further scalar γt ≥ 1 whose effect is to
sharpen the final weighting as follows:
w̃t (i)γt
wt (i) ←− P γt
j w̃t (j)

The combined addressing system of weighting interpolation and content and location-
based addressing can operate in three complementary modes. One, a weighting can be
chosen by the content system without any modification by the location system. Two, a
weighting produced by the content addressing system can be chosen and then shifted. This
allows the focus to jump to a location next to, but not on, an address accessed by content;
in computational terms this allows a head to find a contiguous block of data, then access a
particular element within that block. Three, a weighting from the previous time step can
be rotated without any input from the content-based addressing system. This allows the
weighting to iterate through a sequence of addresses by advancing the same distance at
each time-step.

3.4 Controller Network

The NTM architecture architecture described above has several free parameters, including
the size of the memory, the number of read and write heads, and the range of allowed lo-
cation shifts. But perhaps the most significant architectural choice is the type of neural
network used as the controller. In particular, one has to decide whether to use a recurrent
or feedforward network. A recurrent controller such as LSTM has its own internal memory
that can complement the larger memory in the matrix. If one compares the controller to
the central processing unit in a digital computer (albeit with adaptive rather than predefined
instructions) and the memory matrix to RAM, then the hidden activations of the recurrent
controller are akin to the registers in the processor. They allow the controller to mix infor-
mation across multiple time steps of operation. On the other hand a feedforward controller
can mimic a recurrent network by reading and writing at the same location in memory at
every step. Furthermore, feedforward controllers often confer greater transparency to the
network’s operation because the pattern of reading from and writing to the memory matrix
is usually easier to interpret than the internal state of an RNN. However, one limitation of

a feedforward controller is that the number of concurrent read and write heads imposes a
bottleneck on the type of computation the NTM can perform. With a single read head, it
can perform only a unary transform on a single memory vector at each time-step, with two
read heads it can perform binary vector transforms, and so on. Recurrent controllers can
internally store read vectors from previous time-steps, so do not suffer from this limitation.

4 Experiments
This section presents preliminary experiments on a set of simple algorithmic tasks such
as copying and sorting data sequences. The goal was not only to establish that NTM is
able to solve the problems, but also that it is able to do so by learning compact internal
programs. The hallmark of such solutions is that they generalise well beyond the range of
the training data. For example, we were curious to see if a network that had been trained
to copy sequences of length up to 20 could copy a sequence of length 100 with no further
For all the experiments we compared three architectures: NTM with a feedforward
controller, NTM with an LSTM controller, and a standard LSTM network. Because all
the tasks were episodic, we reset the dynamic state of the networks at the start of each
input sequence. For the LSTM networks, this meant setting the previous hidden state equal
to a learned bias vector. For NTM the previous state of the controller, the value of the
previous read vectors, and the contents of the memory were all reset to bias values. All
the tasks were supervised learning problems with binary targets; all networks had logistic
sigmoid output layers and were trained with the cross-entropy objective function. Sequence
prediction errors are reported in bits-per-sequence. For more details about the experimental
parameters see Section 4.6.

4.1 Copy
The copy task tests whether NTM can store and recall a long sequence of arbitrary in-
formation. The network is presented with an input sequence of random binary vectors
followed by a delimiter flag. Storage and access of information over long time periods has
always been problematic for RNNs and other dynamic architectures. We were particularly
interested to see if an NTM is able to bridge longer time delays than LSTM.
The networks were trained to copy sequences of eight bit random vectors, where the
sequence lengths were randomised between 1 and 20. The target sequence was simply a
copy of the input sequence (without the delimiter flag). Note that no inputs were presented
to the network while it receives the targets, to ensure that it recalls the entire sequence with
no intermediate assistance.
As can be seen from Figure 3, NTM (with either a feedforward or LSTM controller)
learned much faster than LSTM alone, and converged to a lower cost. The disparity be-
tween the NTM and LSTM learning curves is dramatic enough to suggest a qualitative,


cost per sequence (bits)

NTM with LSTM Controller
NTM with Feedforward Controller

0 200 400 600 800 1000
sequence number (thousands)

Figure 3: Copy Learning Curves.

rather than quantitative, difference in the way the two models solve the problem.
We also studied the ability of the networks to generalise to longer sequences than seen
during training (that they can generalise to novel vectors is clear from the training error).
Figures 4 and 5 demonstrate that the behaviour of LSTM and NTM in this regime is rad-
ically different. NTM continues to copy as the length increases2 , while LSTM rapidly
degrades beyond length 20.
The preceding analysis suggests that NTM, unlike LSTM, has learned some form of
copy algorithm. To determine what this algorithm is, we examined the interaction between
the controller and the memory (Figure 6). We believe that the sequence of operations per-
formed by the network can be summarised by the following pseudocode:

initialise: move head to start location

while input delimiter not seen do
receive input vector
write input to head location
increment head location by 1
end while
return head to start location
while true do
read output vector from head location
emit output
increment head location by 1
end while

This is essentially how a human programmer would perform the same task in a low-
The limiting factor was the size of the memory (128 locations), after which the cyclical shifts wrapped
around and previous writes were overwritten.

Figure 4: NTM Generalisation on the Copy Task. The four pairs of plots in the top row
depict network outputs and corresponding copy targets for test sequences of length 10, 20, 30,
and 50, respectively. The plots in the bottom row are for a length 120 sequence. The network
was only trained on sequences of up to length 20. The first four sequences are reproduced with
high confidence and very few mistakes. The longest one has a few more local errors and one
global error: at the point indicated by the red arrow at the bottom, a single vector is duplicated,
pushing all subsequent vectors one step back. Despite being subjectively close to a correct copy,
this leads to a high loss.

level programming language. In terms of data structures, we could say that NTM has
learned how to create and iterate through arrays. Note that the algorithm combines both
content-based addressing (to jump to start of the sequence) and location-based address-
ing (to move along the sequence). Also note that the iteration would not generalise to
long sequences without the ability to use relative shifts from the previous read and write
weightings (Equation 7), and that without the focus-sharpening mechanism (Equation 9)
the weightings would probably lose precision over time.

4.2 Repeat Copy

The repeat copy task extends copy by requiring the network to output the copied sequence a
specified number of times and then emit an end-of-sequence marker. The main motivation
was to see if the NTM could learn a simple nested function. Ideally, we would like it to be
able to execute a “for loop” containing any subroutine it has already learned.
The network receives random-length sequences of random binary vectors, followed by
a scalar value indicating the desired number of copies, which appears on a separate input
channel. To emit the end marker at the correct time the network must be both able to
interpret the extra input and keep count of the number of copies it has performed so far.
As with the copy task, no inputs are provided to the network after the initial sequence and
repeat number. The networks were trained to reproduce sequences of size eight random
binary vectors, where both the sequence length and the number of repetitions were chosen
randomly from one to ten. The input representing the repeat number was normalised to
have mean zero and variance one.

Figure 5: LSTM Generalisation on the Copy Task. The plots show inputs and outputs
for the same sequence lengths as Figure 4. Like NTM, LSTM learns to reproduce sequences
of up to length 20 almost perfectly. However it clearly fails to generalise to longer sequences.
Also note that the length of the accurate prefix decreases as the sequence length increases,
suggesting that the network has trouble retaining information for long periods.

Figure 6: NTM Memory Use During the Copy Task. The plots in the left column depict
the inputs to the network (top), the vectors added to memory (middle) and the corresponding
write weightings (bottom) during a single test sequence for the copy task. The plots on the right
show the outputs from the network (top), the vectors read from memory (middle) and the read
weightings (bottom). Only a subset of memory locations are shown. Notice the sharp focus of
all the weightings on a single location in memory (black is weight zero, white is weight one).
Also note the translation of the focal point over time, reflects the network’s use of iterative
shifts for location-based addressing, as described in Section 3.3.2. Lastly, observe that the read
locations exactly match the write locations, and the read vectors match the add vectors. This
suggests that the network writes each input vector in turn to a specific memory location during
the input phase, then reads from the same location sequence during the output phase.

180 LSTM

cost per sequence (bits)

NTM with LSTM Controller
NTM with Feedforward Controller
0 100 200 300 400 500
sequence number (thousands)

Figure 7: Repeat Copy Learning Curves.

Figure 7 shows that NTM learns the task much faster than LSTM, but both were able to
solve it perfectly.3 The difference between the two architectures only becomes clear when
they are asked to generalise beyond the training data. In this case we were interested in
generalisation along two dimensions: sequence length and number of repetitions. Figure 8
illustrates the effect of doubling first one, then the other, for both LSTM and NTM. Whereas
LSTM fails both tests, NTM succeeds with longer sequences and is able to perform more
than ten repetitions; however it is unable to keep count of of how many repeats it has
completed, and does not predict the end marker correctly. This is probably a consequence
of representing the number of repetitions numerically, which does not easily generalise
beyond a fixed range.
Figure 9 suggests that NTM learns a simple extension of the copy algorithm in the
previous section, where the sequential read is repeated as many times as necessary.

4.3 Associative Recall

The previous tasks show that the NTM can apply algorithms to relatively simple, linear data
structures. The next order of complexity in organising data arises from “indirection”—that
is, when one data item points to another. We test the NTM’s capability for learning an
instance of this more interesting class by constructing a list of items so that querying with
one of the items demands that the network return the subsequent item. More specifically,
we define an item as a sequence of binary vectors that is bounded on the left and right
by delimiter symbols. After several items have been propagated to the network, we query
by showing a random item, and we ask the network to produce the next item. In our
experiments, each item consisted of three six-bit binary vectors (giving a total of 18 bits
It surprised us that LSTM performed better here than on the copy problem. The likely reasons are that the
sequences were shorter (up to length 10 instead of up to 20), and the LSTM network was larger and therefore
had more memory capacity.

Figure 8: NTM and LSTM Generalisation for the Repeat Copy Task. NTM generalises
almost perfectly to longer sequences than seen during training. When the number of repeats is
increased it is able to continue duplicating the input sequence fairly accurately; but it is unable
to predict when the sequence will end, emitting the end marker after the end of every repetition
beyond the eleventh. LSTM struggles with both increased length and number, rapidly diverging
from the input sequence in both cases.

per item). During training, we used a minimum of 2 items and a maximum of 6 items in a
single episode.
Figure 10 shows that NTM learns this task significantly faster than LSTM, terminating
at near zero cost within approximately 30, 000 episodes, whereas LSTM does not reach
zero cost after a million episodes. Additionally, NTM with a feedforward controller learns
faster than NTM with an LSTM controller. These two results suggest that NTM’s external
memory is a more effective way of maintaining the data structure than LSTM’s internal
state. NTM also generalises much better to longer sequences than LSTM, as can be seen
in Figure 11. NTM with a feedforward controller is nearly perfect for sequences of up to
12 items (twice the maximum length used in training), and still has an average cost below
1 bit per sequence for sequences of 15 items.
In Figure 12, we show the operation of the NTM memory, controlled by an LSTM
with one head, on a single test episode. In “Inputs,” we see that the input denotes item
delimiters as single bits in row 7. After the sequence of items has been propagated, a

Figure 9: NTM Memory Use During the Repeat Copy Task. As with the copy task the
network first writes the input vectors to memory using iterative shifts. It then reads through
the sequence to replicate the input as many times as necessary (six in this case). The white dot
at the bottom of the read weightings seems to correspond to an intermediate location used to
redirect the head to the start of the sequence (The NTM equivalent of a goto statement).

cost per sequence (bits)

NTM with LSTM Controller

NTM with Feedforward Controller
0 200 400 600 800 1000
sequence number (thousands)

Figure 10: Associative Recall Learning Curves for NTM and LSTM.


co st p er seq u en ce (bits)
25 LS T M
20 N T M w ith LS T M C o ntroller
15 N T M w ith F eed fo rw a rd C o n troller

6 8 10 12 14 16 18 20
n um be r of item s per se que nce

Figure 11: Generalisation Performance on Associative Recall for Longer Item Sequences.
The NTM with either a feedforward or LSTM controller generalises to much longer sequences
of items than the LSTM alone. In particular, the NTM with a feedforward controller is nearly
perfect for item sequences of twice the length of sequences in its training set.

delimiter in row 8 prepares the network to receive a query item. In this case, the query
item corresponds to the second item in the sequence (contained in the green box). In
“Outputs,” we see that the network crisply outputs item 3 in the sequence (from the red
box). In “Read Weightings,” on the last three time steps, we see that the controller reads
from contiguous locations that each store the time slices of item 3. This is curious because it
appears that the network has jumped directly to the correct location storing item 3. However
we can explain this behaviour by looking at “Write Weightings.” Here we see that the
memory is written to even when the input presents a delimiter symbol between items.
One can confirm in “Adds” that data are indeed written to memory when the delimiters
are presented (e.g., the data within the black box); furthermore, each time a delimiter is
presented, the vector added to memory is different. Further analysis of the memory reveals
that the network accesses the location it reads after the query by using a content-based
lookup that produces a weighting that is shifted by one. Additionally, the key used for
content-lookup corresponds to the vector that was added in the black box. This implies the
following memory-access algorithm: when each item delimiter is presented, the controller
writes a compressed representation of the previous three time slices of the item. After the
query arrives, the controller recomputes the same compressed representation of the query
item, uses a content-based lookup to find the location where it wrote the first representation,
and then shifts by one to produce the subsequent item in the sequence (thereby combining
content-based lookup with location-based offsetting).

4.4 Dynamic N-Grams

The goal of the dynamic N-Grams task was to test whether NTM could rapidly adapt to
new predictive distributions. In particular we were interested to see if it were able to use its

Figure 12: NTM Memory Use During the Associative Recall Task. In “Inputs,” a se-
quence of items, each composed of three consecutive binary random vectors is propagated to the
controller. The distinction between items is designated by delimiter symbols (row 7 in “Inputs”).
After several items have been presented, a delimiter that designates a query is presented (row 8
in “Inputs”). A single query item is presented (green box), and the network target corresponds
to the subsequent item in the sequence (red box). In “Outputs,” we see that the network cor-
rectly produces the target item. The red boxes in the read and write weightings highlight the
three locations where the target item was written and then read. The solution the network finds
is to form a compressed representation (black box in “Adds”) of each item that it can store in
a single location. For further analysis, see the main text.

memory as a re-writable table that it could use to keep count of transition statistics, thereby
emulating a conventional N-Gram model.
We considered the set of all possible 6-Gram distributions over binary sequences. Each
6-Gram distribution can be expressed as a table of 25 = 32 numbers, specifying the prob-
ability that the next bit will be one, given all possible length five binary histories. For
each training example, we first generated random 6-Gram probabilities by independently
drawing all 32 probabilities from the Beta( 21 , 12 ) distribution.
We then generated a particular training sequence by drawing 200 successive bits using
the current lookup table.4 The network observes the sequence one bit at a time and is then
asked to predict the next bit. The optimal estimator for the problem can be determined by
The first 5 bits, for which insufficient context exists to sample from the table, are drawn i.i.d. from a
Bernoulli distribution with p = 0.5.


cost per sequence (bits)

155 NTM with LSTM Controller
NTM with Feedforward Controller
150 Optimal Estimator




0 200 400 600 800 1000
sequence number (thousands)

Figure 13: Dynamic N-Gram Learning Curves.

Bayesian analysis (Murphy, 2012):

N1 + 21
P (B = 1|N1 , N0 , c) = (10)
N1 + N0 + 1
where c is the five bit previous context, B is the value of the next bit and N0 and N1 are
respectively the number of zeros and ones observed after c so far in the sequence. We can
therefore compare NTM to the optimal predictor as well as LSTM. To assess performance
we used a validation set of 1000 length 200 sequences sampled from the same distribu-
tion as the training data. As shown in Figure 13, NTM achieves a small, but significant
performance advantage over LSTM, but never quite reaches the optimum cost.
The evolution of the two architecture’s predictions as they observe new inputs is shown
in Figure 14, along with the optimal predictions. Close analysis of NTM’s memory usage
(Figure 15) suggests that the controller uses the memory to count how many ones and zeros
it has observed in different contexts, allowing it to implement an algorithm similar to the
optimal estimator.

4.5 Priority Sort

This task tests whether the NTM can sort data—an important elementary algorithm. A
sequence of random binary vectors is input to the network along with a scalar priority
rating for each vector. The priority is drawn uniformly from the range [-1, 1]. The target
sequence contains the binary vectors sorted according to their priorities, as depicted in
Figure 16.
Each input sequence contained 20 binary vectors with corresponding priorities, and
each target sequence was the 16 highest-priority vectors in the input.5 Inspection of NTM’s
We limited the sort to size 16 because we were interested to see if NTM would solve the task using a
binary heap sort of depth 4.

Figure 14: Dynamic N-Gram Inference. The top row shows a test sequence from the N-Gram
task, and the rows below show the corresponding predictive distributions emitted by the optimal
estimator, NTM, and LSTM. In most places the NTM predictions are almost indistinguishable
from the optimal ones. However at the points indicated by the two arrows it makes clear
mistakes, one of which is explained in Figure 15. LSTM follows the optimal predictions closely
in some places but appears to diverge further as the sequence progresses; we speculate that this
is due to LSTM “forgetting” the observations at the start of the sequence.

Figure 15: NTM Memory Use During the Dynamic N-Gram Task. The red and green
arrows indicate point where the same context is repeatedly observed during the test sequence
(“00010” for the green arrows, “01111” for the red arrows). At each such point the same
location is accessed by the read head, and then, on the next time-step, accessed by the write
head. We postulate that the network uses the writes to keep count of the fraction of ones and
zeros following each context in the sequence so far. This is supported by the add vectors, which
are clearly anti-correlated at places where the input is one or zero, suggesting a distributed
“counter.” Note that the write weightings grow fainter as the same context is repeatedly seen;
this may be because the memory records a ratio of ones to zeros, rather than absolute counts.
The red box in the prediction sequence corresponds to the mistake at the first red arrow in
Figure 14; the controller appears to have accessed the wrong memory location, as the previous
context was “01101” and not “01111.”

Figure 16: Example Input and Target Sequence for the Priority Sort Task. The input
sequence contains random binary vectors and random scalar priorities. The target sequence is a
subset of the input vectors sorted by the priorities.

Hypothesised Locations Write Weightings Read Weightings


Time Time Time

Figure 17: NTM Memory Use During the Priority Sort Task. Left: Write locations
returned by fitting a linear function of the priorities to the observed write locations. Middle:
Observed write locations. Right: Read locations.

memory use led us to hypothesise that it uses the priorities to determine the relative location
of each write. To test this hypothesis we fitted a linear function of the priority to the
observed write locations. Figure 17 shows that the locations returned by the linear function
closely match the observed write locations. It also shows that the network reads from the
memory locations in increasing order, thereby traversing the sorted sequence.
The learning curves in Figure 18 demonstrate that NTM with both feedforward and
LSTM controllers substantially outperform LSTM on this task. Note that eight parallel
read and write heads were needed for best performance with a feedforward controller on
this task; this may reflect the difficulty of sorting vectors using only unary vector operations
(see Section 3.4).

4.6 Experimental Details

For all experiments, the RMSProp algorithm was used for training in the form described
in (Graves, 2013) with momentum of 0.9. Tables 1 to 3 give details about the network
configurations and learning rates used in the experiments. All LSTM networks had three
stacked hidden layers. Note that the number of LSTM parameters grows quadratically with


cost per sequence (bits)

120 NTM with LSTM Controller
NTM with Feedforward Controller
0 200 400 600 800 1000
sequence number (thousands)

Figure 18: Priority Sort Learning Curves.

Task #Heads Controller Size Memory Size Learning Rate #Parameters

Copy 1 100 128 × 20 10−4 17, 162
Repeat Copy 1 100 128 × 20 10−4 16, 712
Associative 4 256 128 × 20 10−4 146, 845
N-Grams 1 100 128 × 20 3 × 10−5 14, 656
Priority Sort 8 512 128 × 20 3 × 10−5 508, 305

Table 1: NTM with Feedforward Controller Experimental Settings

the number of hidden units (due to the recurrent connections in the hidden layers). This
contrasts with NTM, where the number of parameters does not increase with the number of
memory locations. During the training backward pass, all gradient components are clipped
elementwise to the range (-10, 10).

5 Conclusion
We have introduced the Neural Turing Machine, a neural network architecture that takes
inspiration from both models of biological working memory and the design of digital com-
puters. Like conventional neural networks, the architecture is differentiable end-to-end and
can be trained with gradient descent. Our experiments demonstrate that it is capable of
learning simple algorithms from example data and of using these algorithms to generalise
well outside its training regime.

Task #Heads Controller Size Memory Size Learning Rate #Parameters
Copy 1 100 128 × 20 10−4 67, 561
Repeat Copy 1 100 128 × 20 10−4 66, 111
Associative 1 100 128 × 20 10−4 70, 330
N-Grams 1 100 128 × 20 3 × 10−5 61, 749
Priority Sort 5 2 × 100 128 × 20 3 × 10−5 269, 038

Table 2: NTM with LSTM Controller Experimental Settings

Task Network Size Learning Rate #Parameters

Copy 3 × 256 3 × 10−5 1, 352, 969
Repeat Copy 3 × 512 3 × 10−5 5, 312, 007
Associative 3 × 256 10−4 1, 344, 518
N-Grams 3 × 128 10−4 331, 905
Priority Sort 3 × 128 3 × 10−5 384, 424

Table 3: LSTM Network Experimental Settings

6 Acknowledgments
Many have offered thoughtful insights, but we would especially like to thank Daan Wier-
stra, Peter Dayan, Ilya Sutskever, Charles Blundell, Joel Veness, Koray Kavukcuoglu,
Dharshan Kumaran, Georg Ostrovski, Chris Summerfield, Jeff Dean, Geoffrey Hinton, and
Demis Hassabis.

Baddeley, A., Eysenck, M., and Anderson, M. (2009). Memory. Psychology Press.

Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly
learning to align and translate. abs/1409.0473.

Barrouillet, P., Bernardin, S., and Camos, V. (2004). Time constraints and resource shar-
ing in adults’ working memory spans. Journal of Experimental Psychology: General,

Chomsky, N. (1956). Three models for the description of language. Information Theory,
IEEE Transactions on, 2(3):113–124.

Das, S., Giles, C. L., and Sun, G.-Z. (1992). Learning context-free grammars: Capabil-
ities and limitations of a recurrent neural network with an external stack memory. In
Proceedings of The Fourteenth Annual Conference of Cognitive Science Society. Indiana

Dayan, P. (2008). Simple substrates for complex cognition. Frontiers in neuroscience,


Eliasmith, C. (2013). How to build a brain: A neural architecture for biological cognition.
Oxford University Press.

Fitch, W., Hauser, M. D., and Chomsky, N. (2005). The evolution of the language faculty:
clarifications and implications. Cognition, 97(2):179–210.

Fodor, J. A. and Pylyshyn, Z. W. (1988). Connectionism and cognitive architecture: A

critical analysis. Cognition, 28(1):3–71.

Frasconi, P., Gori, M., and Sperduti, A. (1998). A general framework for adaptive process-
ing of data structures. Neural Networks, IEEE Transactions on, 9(5):768–786.

Gallistel, C. R. and King, A. P. (2009). Memory and the computational brain: Why cogni-
tive science will transform neuroscience, volume 3. John Wiley & Sons.

Goldman-Rakic, P. S. (1995). Cellular basis of working memory. Neuron, 14(3):477–485.

Graves, A. (2013). Generating sequences with recurrent neural networks. arXiv preprint

Graves, A. and Jaitly, N. (2014). Towards end-to-end speech recognition with recurrent
neural networks. In Proceedings of the 31st International Conference on Machine Learn-
ing (ICML-14), pages 1764–1772.

Graves, A., Mohamed, A., and Hinton, G. (2013). Speech recognition with deep recurrent
neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE
International Conference on, pages 6645–6649. IEEE.

Hadley, R. F. (2009). The problem of rapid variable creation. Neural computation,


Hazy, T. E., Frank, M. J., and O’Reilly, R. C. (2006). Banishing the homunculus: making
working memory work. Neuroscience, 139(1):105–118.

Hinton, G. E. (1986). Learning distributed representations of concepts. In Proceedings

of the eighth annual conference of the cognitive science society, volume 1, page 12.
Amherst, MA.

Hochreiter, S., Bengio, Y., Frasconi, P., and Schmidhuber, J. (2001a). Gradient flow in
recurrent nets: the difficulty of learning long-term dependencies.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation,


Hochreiter, S., Younger, A. S., and Conwell, P. R. (2001b). Learning to learn using gradient
descent. In Artificial Neural Networks?ICANN 2001, pages 87–94. Springer.

Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective
computational abilities. Proceedings of the national academy of sciences, 79(8):2554–

Jackendoff, R. and Pinker, S. (2005). The nature of the language faculty and its implications
for evolution of language (reply to fitch, hauser, and chomsky). Cognition, 97(2):211–

Kanerva, P. (2009). Hyperdimensional computing: An introduction to computing in dis-

tributed representation with high-dimensional random vectors. Cognitive Computation,

Marcus, G. F. (2003). The algebraic mind: Integrating connectionism and cognitive sci-
ence. MIT press.

Miller, G. A. (1956). The magical number seven, plus or minus two: some limits on our
capacity for processing information. Psychological review, 63(2):81.

Miller, G. A. (2003). The cognitive revolution: a historical perspective. Trends in cognitive

sciences, 7(3):141–144.

Minsky, M. L. (1967). Computation: finite and infinite machines. Prentice-Hall, Inc.

Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press.

Plate, T. A. (2003). Holographic Reduced Representation: Distributed representation for
cognitive structures. CSLI.

Pollack, J. B. (1990). Recursive distributed representations. Artificial Intelligence,


Rigotti, M., Barak, O., Warden, M. R., Wang, X.-J., Daw, N. D., Miller, E. K., and Fusi,
S. (2013). The importance of mixed selectivity in complex cognitive tasks. Nature,

Rumelhart, D. E., McClelland, J. L., Group, P. R., et al. (1986). Parallel distributed pro-
cessing, volume 1. MIT press.

Seung, H. S. (1998). Continuous attractors and oculomotor control. Neural Networks,


Siegelmann, H. T. and Sontag, E. D. (1995). On the computational power of neural nets.

Journal of computer and system sciences, 50(1):132–150.

Smolensky, P. (1990). Tensor product variable binding and the representation of symbolic
structures in connectionist systems. Artificial intelligence, 46(1):159–216.

Socher, R., Huval, B., Manning, C. D., and Ng, A. Y. (2012). Semantic compositionality
through recursive matrix-vector spaces. In Proceedings of the 2012 Joint Conference on
Empirical Methods in Natural Language Processing and Computational Natural Lan-
guage Learning, pages 1201–1211. Association for Computational Linguistics.

Sutskever, I., Martens, J., and Hinton, G. E. (2011). Generating text with recurrent neural
networks. In Proceedings of the 28th International Conference on Machine Learning
(ICML-11), pages 1017–1024.

Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural
networks. arXiv preprint arXiv:1409.3215.

Touretzky, D. S. (1990). Boltzcons: Dynamic symbol structures in a connectionist network.

Artificial Intelligence, 46(1):5–46.

Von Neumann, J. (1945). First draft of a report on the edvac.

Wang, X.-J. (1999). Synaptic basis of cortical persistent activity: the importance of nmda
receptors to working memory. The Journal of Neuroscience, 19(21):9587–9603.

Under review as a conference paper at ICLR 2016

Wojciech Zaremba1,2 Ilya Sutskever2
New York University Google Brain
Facebook AI Research

arXiv:1505.00521v3 [cs.LG] 12 Jan 2016

The Neural Turing Machine (NTM) is more expressive than all previously considered
models because of its external memory. It can be viewed as a broader effort to use
abstract external Interfaces and to learn a parametric model that interacts with them.
The capabilities of a model can be extended by providing it with proper Interfaces
that interact with the world. These external Interfaces include memory, a database,
a search engine, or a piece of software such as a theorem verifier. Some of these
Interfaces are provided by the developers of the model. However, many important
existing Interfaces, such as databases and search engines, are discrete.
We examine feasibility of learning models to interact with discrete Interfaces. We
investigate the following discrete Interfaces: a memory Tape, an input Tape, and an
output Tape. We use a Reinforcement Learning algorithm to train a neural network
that interacts with such Interfaces to solve simple algorithmic tasks. Our Interfaces
are expressive enough to make our model Turing complete.

Graves et al. (2014b)’s Neural Turing Machine (NTM) is model that learns to interact with an external
memory that is differentiable and continuous. An external memory extends the capabilities of the NTM,
allowing it to solve tasks that were previously unsolvable by conventional machine learning methods.
This is the source of the NTM’s expressive power. In general, it appears that ML models become
significantly more powerful if they are able to learn to interact with external interfaces.
There exist a vast number of Interfaces that could be used with our models. For example, the Google
search engine is an example of such Interface. The search engine consumes queries (which are actions),
and outputs search results. However, the search engine is not differentiable, and the model interacts
with the Interface using discrete actions. This work examines the feasibility of learning to interact with
discrete Interfaces using the reinforce algorithm.
Discrete Interfaces cannot be trained directly with standard backpropagation because they are not dif-
ferentiable. It is most natural to learn to interact with discrete Interfaces using Reinforcement Learning
methods. In this work, we consider an Input Tape and a Memory Tape interface with discrete access.
Our concrete proposal is to use the Reinforce algorithm to learn where to access the discrete interfaces,
and to use the backpropagation algorithm to determine what to write to the memory and to the output.
We call this model the RL–NTM.
Discrete Interfaces are computationally attractive because the cost of accessing a discrete Interface is
often independent of its size. It is not the case for the continuous Interfaces, where the cost of access
scales linearly with size. It is a significant disadvantage since slow models cannot scale to large difficult
problems that require intensive training on large datasets. In addition, an output Interface that lets
the model decide when it wants to make a prediction allows the model’s runtime to be in principle
unbounded. If the model has an output interface of this kind together with an interface to an unbounded
memory, the model becomes Turing complete.
We evaluate the RL-NTM on a number of simple algorithmic tasks. The RL-NTM succeeds on problems
such as copying an input several times to the output tape (the “repeat copy” task from Graves et al.
(2014b)), reversing a sequence, and a few more tasks of comparable difficulty. However, its success is
highly dependent on the architecture of the “controller”. We discuss this in more details in Section 8.
Work done while the author was at Google.
Both authors contributed equally to this work.

Under review as a conference paper at ICLR 2016

Finally, we found it non-trivial to correctly implement the RL-NTM due its large number of interacting
components. We developed a simple procedure to numerically check the gradients of the Reinforce
algorithm (Section 5). The procedure can be applied to problems unrelated to NTMs, and is of the
independent interest. The code for this work can be found at


Many difficult tasks require a prolonged, multi-step interaction with an external environment. Examples
of such environments include computer games (Mnih et al., 2013), the stock market, an advertisement
system, or the physical world (Levine et al., 2015). A model can observe a partial state from the
environment, and influence the environment through its actions. This is seen as a general reinforcement
leaning problem. However, our setting departs from the classical RL, i.e. we have a freedom to design
tools available to solve a given problem. Tools might cooperate with the model (i.e. backpropagation
through memory), and the tools specify the actions over the environment. We formalize this concept
under the name Interface–Controller interaction.
The external environment is exposed to the model through a number of Interfaces, each with its own
API. For instance, a human perceives the world through its senses, which include the vision Interface
and the touch Interface. The touch Interface provides methods for contracting the various muscles, and
methods for sensing the current state of the muscles, pain level, temperature and a few others. In this
work, we explore a number of simple Interfaces that allow the controller to access an input tape, a
memory tape, and an output tape.
The part of the model that communicates with Interfaces is called the Controller, which is the only part
of the system which learns. The Controller can have prior knowledge about behavior of its Interfaces,
but it is not the case in our experiments. The Controller learns to interact with Interfaces in a way that
allows it to solve a given task. Fig. 1 illustrates the complete Interfaces–Controller abstraction.

input position to output symbol memory address

increment Target or not? increment new memory
prediction value vector
Input Interface Output Interface Memory Interface -1 0 1 0 1 -1 0 1

Controller Output Controller Output

Past State Controller Future State Past State LSTM Future State
Controller Input Controller Input

Input Interface Output Interface Memory Interface Current Input Current Memory

An abstract Interface–Controller model Our model as an Interface–Controller

Figure 1: (Left) The Interface–Controller abstraction, (Right) an instantiation of our model as an Interface–
Controller. The bottom boxes are the read methods, and the top are the write methods. The RL–NTM makes
discrete decisions regarding the move over the input tape, the memory tape, and whether to make a prediction at a
given timestep. During training, the model’s prediction is compared with the desired output, and is used to train the
model when the RL-NTM chooses to advance its position on the output tape; otherwise it is ignored. The memory
value vector is a vector of content that is stored in the memory cell.

We now describe the RL–NTM. As a controller, it uses either LSTM, direct access, or LSTM (see
sec. 8.1 for a definition). It has a one-dimensional input tape, a one-dimensional memory, and a one-
dimensional output tape as Interfaces. Both the input tape and the memory tape have a head that reads
the Tape’s content at the current location. The head of the input tape and the memory tape can move in
any direction. However, the output tape is a write-only tape, and its head can either stay at the current
position or move forward. Fig. 2 shows an example execution trace for the entire RL–NTM on the
reverse task (sec. 6).
At the core of the RL–NTM is an LSTM controller which receives multiple inputs and has to generate
multiple outputs at each timestep. Table 1 summarizes the controller’s inputs and outputs, and the
way in which the RL–NTM is trained to produce them. The objective function of the RL–NTM is
the expected log probability of the desired outputs, where the expectation is taken over all possible
sequences of actions, weighted with probability of taking these actions. Both backpropagation and
Reinforce maximize this objective. Backpropagation maximizes the log probabilities of the model’s
predictions, while the reinforce algorithm influences the probabilities of action sequences.

Under review as a conference paper at ICLR 2016

Figure 2: Execution of RL–NTM on the ForwardReverse task. At each timestep, the RL-NTM con-
sumes the value of the current input tape, the value of the current memory cell, and a representation
of all the actions that have been taken in the previous timestep (not marked on the figures). The RL-
NTM then outputs a new value for the current memory cell (marked with a star), a prediction for the
next target symbol, and discrete decisions for changing the positions of the heads on the various tapes.
The RL-NTM learns to make discrete decisions using the Reinforce algorithm, and learns to produce
continuous outputs using backpropagation.

The global objective can be written formally as:

X n
hX i
preinforce (a1 , a2 , . . . , an |θ) log(pbp (yi |x1 , . . . , xi , a1 , . . . ai , θ)
[a1 ,a2 ,...,an ]∈A† i=1

A† represents the space of sequences of actions that lead to the end of episode. The probabilities in the
above equation are parametrized with a neural network (the Controller). We have marked with preinforce
the part of the equation which is learned with Reinforce. pbp indicates the part of the equation optimized
with the classical backpropagation.

Interface Read Write Training Type

Input Tape Head window of values surrounding the current position distribution over [−1, 0, 1] Reinforce
Head ∅ distribution over [0, 1] Reinforce
Output Tape
Content ∅ distribution over output vocabulary Backpropagation
Head distribution over [−1, 0, 1] Reinforce
Memory Tape window of memory values surrounding the current address
Content vector of real values to store Backpropagation
Miscellaneous all actions taken in the previous time step ∅ ∅

Table 1: Table summarizes what the Controller reads at every time step, and what it has to produce. The
“training” column indicates how the given part of the model is trained.

The RL–NTM receives a direct learning signal only when it decides to make a prediction. If it chooses to
not make a prediction at a given timestep, then it will not receive a direct learning signal. Theoretically,
we can allow the RL–NTM to run for an arbitrary number of steps without making any prediction,
hoping that after sufficiently many steps, it would decide to make a prediction. Doing so will also
provide the RL–NTM with arbitrary computational capability. However, this strategy is both unstable
and computationally infeasible. Thus, we resort to limiting the total number of computational steps to
a fixed upper bound, and force the RL–NTM to predict the next desired output whenever the number of
remaining desired outputs is equal to the number of remaining computational steps.


This work is the most similar to the Neural Turing Machine Graves et al. (2014b). The NTM is an
ambitious, computationally universal model that can be trained (or “automatically programmed”) with
the backpropagation algorithm using only input-output examples.
Following the introduction NTM, several other memory-based models have been introduced. All of
them can be seen as part of a larger community effort. These models are constructed according to the
Interface–Controller abstraction (Section 2).
Neural Turing Machine (NTM) (Graves et al., 2014a) has a modified LSTM as the Controller, and the
following three Interfaces: a sequential input, a delayed Output, and a differentiable Memory.

Under review as a conference paper at ICLR 2016

Weakly supervised Memory Network (Sukhbaatar et al., 2015) uses a feed forward network as the
Controller, and has a differentiable soft-attention Input, and Delayed Output as Interfaces.
Stack RNN (Joulin & Mikolov, 2015) has a RNN as the Controller, and the sequential input, a differen-
tiable memory stack, and sequential output as Interfaces. Also uses search to improve its performance.
Neural DeQue (Grefenstette et al., 2015) has a LSTM as the Controller, and a Sequential Input, a
differentiable Memory Queue, and the Sequential Output as Interfaces.
Our model fits into the Interfaces–Controller abstraction. It has a direct access LSTM as the Controller
(or LSTM or feed forward network), and its three interfaces are the Input Tape, the Memory Tape, and
the Output Tape. All three Interfaces of the RL–NTM are discrete and cannot be trained only with
This prior work investigates continuous and differentiable Interfaces, while we consider discrete In-
terfaces. Discrete Interfaces are more challenging to train because backpropagation cannot be used.
However, many external Interfaces are inherently discrete, even though humans can easily use them
(apparently without using continuous backpropagation). For instance, one interacts with the Google
search engine with discrete actions. This work examines the possibility of learning models that interact
with discrete Interfaces with the Reinforce algorithm.
The Reinforce algorithm (Williams, 1992) is a classical RL algorithm, which has been applied to the
broad spectrum of planning problems (Peters & Schaal, 2006; Kohl & Stone, 2004; Aberdeen & Baxter,
2002). In addition, it has been applied in object recognition to implement visual attention (Mnih et al.,
2014; Ba et al., 2014). This work uses Reinforce to train an attention mechanism: we use it to train how
to access the various tapes provided to the model.
The RL–NTM can postpone prediction for an arbitrary number of timesteps, and in principle has access
to the unbounded memory. As a result, the RL-NTM is Turing complete in principle. There have been
very few prior models that are Turing complete Schmidhuber (2012; 2004). Although our model is
Turing complete, it is not very powerful because it is very difficult to train, and our model can solve
only relatively simple problems. Moreover, the RL–NTM does not exploit Turing completeness, as
none of tasks that it solves require superlinear runtime to be solved.


Let A be a space of actions, and A† be a space of all sequences of actions that cause an episode to end
(so A† ⊂ A∗ ) . An action at time-step t is denoted by at . We denote time at the end of episode by T (this
is not completely formal as some episodes can vary in time). Let a1:t stand for a sequence of actions
[a1 , a2 , . . . , at ]. Let r(a1:t ) denote the reward achieved at time t, having executed the sequence of ac-
tions a1:t , and R(a1:T ) is the cumulative reward, namely R(ak:T ) = t=k r(a1:t ). Let pθ (at |a1:(t−1) )
be a parametric conditional probability of an action at given all previous actions a1:(t−1) . Finally, pθ is
a policy parametrized by θ.
This work relies on learning discrete actions with the Reinforce algorithm (Williams, 1992). We now
describe this algorithm in detail. Moreover, the supplementary materials include descriptions of tech-
niques for reducing variance of the gradient estimators.
The goal of reinforcement learning is to maximize the sum of future rewards. The Reinforce algorithm
(Williams, 1992) does so directly by optimizing the parameters of the policy pθ (at |a1:(t−1) ). Reinforce
follows the gradient of the sum of the future rewards. The objective function for episodic reinforce can
be expressed as the sum over all sequences of valid actions that cause the episode to end:

J(θ) = pθ (a1 , a2 , . . . , aT )R(a1 , a2 , . . . , aT ) = pθ (a1:T )R(a1:T )
[a1 ,a2 ,...,aT ]∈A† a1:T ∈A†

This sum iterates over sequences of all possible actions. This set is usually exponential or even infinite,
so it cannot be computed exactly and cheaply for most of problems. However, it can be written as

Under review as a conference paper at ICLR 2016

expectation, which can be approximated with an unbiased estimator. We have that:

J(θ) = pθ (a1:T )R(a1:T ) =
a1:T ∈A†
Ea1:T ∼pθ r(a1:t ) =
Ea1 ∼pθ (a1 ) Ea2 ∼pθ (a2 |a1 ) . . . EaT ∼pθ (aT |a1:(T −1) ) r(a1:t )

The last expression suggests a procedure to estimate J(θ): simply sequentially sample each at from
the model distribution pθ (at |a1:(t−1) ) for t from 1 to T . The unbiased estimator of J(θ) is the sum of
r(a1:t ). This gives us an algorithm to estimate J(θ). However, the main interest is in training a model
to maximize this quantity.
The reinforce algorithm maximizes J(θ) by following the gradient of it:
∂θ J(θ) = ∂θ pθ (a1:T ) R(a1:T )
a1:T ∈A†

However, the above expression is a sum over the set of the possible action sequences, so it cannot be
computed directly for most A† . Once again, the Reinforce algorithm rewrites this sum as an expectation
that is approximated with sampling. It relies on the equation: ∂θ f (θ) = f (θ) ∂fθ f(θ)
= f (θ)∂θ [log f (θ)].
This identity is valid as long as f (x) 6= 0. As typical neural network parametrizations of distributions
assign non-zero probability to every action, this condition holds for f = pθ . We have that:
∂θ J(θ) = ∂θ pθ (a1:T ) R(a1:T ) =
[a1:T ]∈A†
= pθ (a1:T ) ∂θ log pθ (a1:T ) R(a1:T )
a1:T ∈A†
X n
= pθ (a1:T ) ∂θ log pθ (ai |a1:(t−1) ) R(a1:T )
a1:T ∈A† t=1
X  X 
= Ea1 ∼pθ (a1 ) Ea2 ∼pθ (a2 |a1 ) . . . EaT ∼pθ (aT |a1:T −1 ) ∂θ log pθ (ai |a1:(t−1) ) r(a1:t )
t=1 t=1

The last expression gives us an algorithm for estimating ∂θ J(θ). We have sketched it at the left side
of the Figure 3. It’s easiest to describe it with respect to computational graph behind a neural network.
Reinforce can be implemented as follows. A neural network outputs: lt = log pθ (at |a1:(t−1) ). Sequen-
tially sample action at from the distribution elt , and execute the sampled action at . Simultaneously,
experience a reward r(a1:t ). Backpropagate the sum of the rewards t=1 r(a1:t ) to the every node
∂θ log pθ (at |a1:(t−1) ).
We have derived an unbiased estimator for the sum of future rewards, and the unbiased estimator of its
gradient. However, the derived gradient estimator has high variance, which makes learning difficult.
RL–NTM employs several techniques to reduce gradient estimator variance: (1) future rewards back-
propagation, (2) online baseline prediction, and (3) offline baseline prediction. All these techniques are
crucial to solve our tasks. We provide detailed description of techniques in the Supplementary material.
Finally, we needed a way of verifying the correctness of our implementation. We discovered a technique
that makes it possible to easily implement a gradient checker for nearly any model that uses Reinforce.
Following Section 5 describes this technique.

The RL–NTM is complex, so we needed to find an automated way of verifying the correctness of
our implementation. We discovered a technique that makes it possible to easily implement a gradient
checker for nearly any model that uses Reinforce. This discovery is an independent contribution of this

Under review as a conference paper at ICLR 2016

Reinforce Gradient Checking of Reinforce

Figure 3: Figure sketches algorithms: (Left) the reinforce algorithm, (Right) gradient checking for the
reinforce algorithm. The red color indicates necessary steps to override the reinforce to become the
gradient checker for the reinforce.

work. This Section describes the gradient checking for any implementation of the reinforce algorithm
that uses a general function for sampling from multinomial distribution.
The reinforce gradient verification should ensure that expected gradient over all sequences of actions
matches the numerical derivative of the expected objective. However, even for a tiny problem, we would
need to draw billions of samples to achieve estimates accurate enough to state if there is match or mis-
match. Instead, we developed a technique which avoids sampling, and allows for gradient verification
of reinforce within seconds on a laptop.
First, we have to reduce the size of our a task to make sure that the number of possible actions is
manageable (e.g., < 104 ). This is similar to conventional gradient checkers, which can only be applied
to small models. Next, we enumerate all possible sequences of actions that terminate the episode. By
definition, these are precisely all the elements of A† .
The key idea is the following: we override the sampling function which turns a multinomial distribu-
tion into a random sample with a deterministic function that deterministically chooses actions from an
appropriate action sequence from A† , while accumulating their probabilities. By calling the modified
sampler, it will produce every possible action sequence from A† exactly once.
For efficiency, it is desirable to use a single minibatch whose size is #A† . The sampling function needs
to be adapted in such a way, so that it incrementally outputs the appropriate sequence from A† as we
repeatedly call the sampling function. At the end of the Q minibatch, the sampling function will have
access to the total probability of each action sequence ( t pθ (at |a1:t−1 )), which in turn can be used to
exactly compute J(θ) and its derivative. To compute the derivative, the reinforce gradient produced by
each sequence a1:T ∈ A† should be weighted by its probability pθ (a1:T ). We summarize this procedure
on Figure 3.
The gradient checking is critical for ensuring the correctness of our implementation. While the basic
reinforce algorithm is conceptually simple, the RL–NTM is fairly complicated, as reinforce is used
to train several Interfaces of our model. Moreover, the RL–NTM uses three separate techniques for
reducing the variance of the gradient estimators. The model’s high complexity greatly increases the
probability of a code error. In particular, our early implementations were incorrect, and we were able to
fix them only after implementing gradient checking.


This section defines tasks used in the experiments. Figure 4 shows exemplary instantiations of our tasks.
Table 2 summarizes the Interfaces that are available for each task.

Under review as a conference paper at ICLR 2016

Input Tape Memory Tape
Copy X ×
DuplicatedInput X ×
Reverse X ×
RepeatCopy X ×
ForwardReverse × X

Table 2: This table marks the available Interfaces for each task. The difficulty of a task is dependent on
the type of Interfaces available to the model.

Copy DuplicatedInput Reverse RepeatCopy ForwardReverse

Figure 4: This Figure presents the initial state for every task. The yellow box indicates the starting
position of the reading head over the Input Interface. The gray characters on the Output Tape represent
the target symbols. Our tasks involve reordering symbols, and and the symbols xi have been picked
uniformly from the set of size 30.
Copy. A generic input is x1 x2 x3 . . . xC ∅ and the desired output is x1 x2 . . . xC ∅. Thus the goal is
to repeat the input. The length of the input sequence is variable and is allowed to change. The input
sequence and the desired output both terminate with a special end-of-sequence symbol ∅.
DuplicatedInput. A generic input has the form x1 x1 x1 x2 x2 x2 x3 . . . xC−1 xC xC xC ∅ while the
desired output is x1 x2 x3 . . . xC ∅. Thus each input symbol is replicated three times, so the RL-NTM
must emit every third input symbol.
Reverse. A generic input is x1 x2 . . . xC−1 xC ∅ and the desired output is xC xC−1 . . . x2 x1 ∅.
RepeatCopy. A generic input is mx1 x2 x3 . . . xC ∅ and the desired output is
x1 x2 . . . xC x1 . . . xC x1 . . . xC ∅, where the number of copies is given by m. Thus the goal is to
copy the input m times, where m can be only 2 or 3.
ForwardReverse. The task is identical to Reverse, but the RL-NTM is only allowed to move its input
tape pointer forward. It means that a perfect solution must use the NTM’s external memory.


Humans and animals learn much better when the examples are not randomly presented but organized in a
meaningful order which illustrates gradually more concepts, and gradually more complex ones. . . . and
call them “curriculum learning”.
Bengio et al. (2009)

We were unable to solve tasks with RL–NTM by training it on the difficult instances of the problems
(where difficult usually means long). To succeed, we had to create a curriculum of tasks of increasing
complexity. We verified that our tasks were completely unsolvable (in an all-or-nothing sense) for
all but the shortest sequences when we did not use a curriculum. In our experiments, we measure
the complexity c of a problem instance by the maximal length of the desired output to typical inputs.
During training, we maintain a distribution over the task complexity. We shift the distribution over the
task complexities whenever the performance of the RL–NTM exceeds a threshold. Then, our model
focuses on more difficult problem instances as its performance improves.

Probability Procedure to pick complexity d

10% uniformly at random from the possible task complexities.
25% uniformly from [1, C + e]
65% d = D + e.

Table 3: The curriculum learning distribution, indexed by C. Here e is a sample from a geometric
distribution whose success probability is 21 , i.e., p(e = k) = 21k .

Under review as a conference paper at ICLR 2016

The distribution over task complexities is indexed with an integer c, and is defined in Table 3. While
we have not tuned the coefficients in the curriculum learning setup, we experimentally verified that it is
critical to always maintain non-negligible mass over the hardest difficulty levels (Zaremba & Sutskever,
2014). Removing it makes the curriculum much less effective.
Whenever the average zero-one-loss (normalized by the length of the target sequence) of our RL–NTM
decreases below 0.2, we increase c by 1. We kept doing so until c reaches its maximal allowable value.
Finally, we enforced a refractory period to ensure that successive increments of C are separated by at
least 100 parameter updates, since we encountered situations where C increased in rapid succession
which consistently caused learning to fail.


The success of reinforcement learning training highly depends on the complexity of the controller, and
its ease of training. It’s common to either limit number of parameters of the network, or to constraint
it by initialization from pretrained model on some other task (for instance, object recognition network
for robotics). Ideally, models should be generic enough to not need such “tricks”. However, still some
tasks require building task specific architectures.

Figure 6: The direct access controller.

Figure 5: LSTM as a controller.

This work considers two controllers. The first is a LSTM (Fig. 5), and the second is a direct access
controller (Fig. 6). LSTM is a generic controller, that in principle should be powerful enough to solve
any of the considered tasks. However, it has trouble solving many of them. Direct access controller, is
a much better fit for symbol rearrangement tasks, however it’s not a generic solution.


All the tasks that we consider involve rearranging the input symbols in some way. For example, a
typical task is to reverse a sequence (section 6 lists the tasks). For such tasks, the controller would
benefit from a built-in mechanism for directly copying an appropriate input to memory and to the output.
Such a mechanism would free the LSTM controller from remembering the input symbol in its control
variables (“registers”), and would shorten the backpropagation paths and therefore make learning easier.
We implemented this mechanism by adding the input to the memory and the output, and also adding
the memory to the output and to the adjacent memories (figure 6), while modulating these additive
contribution by a dynamic scalar (sigmoid) which is computed from the controller’s state. This way,
the controller can decide to effectively not add the current input to the output at a given timestep.
Unfortunately the necessity of this architectural modification is a drawback of our implementation,
since it is not domain independent and would therefore not improve the performance of the RL–NTM
on many tasks of interest.

LSTM Direct Access
Copy X X
DuplicatedInput X X
Reverse × X
ForwardReverse × X
RepeatCopy × X

Table 4: Success of training on various task for a given controller.

Under review as a conference paper at ICLR 2016


We presents results of training RL–NTM on all aforementioned tasks. The main drawback of our
experiments is in the lack of comparison to the other models. However, the tasks that we consider have
to be considered in conjunction with available Interfaces, and other models haven’t been considered
with the same set of interfaces. The statement, “this model solves addition” is difficult to assess, as the
way that digits are delivered defines task difficulty.
The closest model to ours is NTM, and the shared task that they consider is copying. We are able to
generalize with copying to an arbitrary length. However, our Interfaces make this task very simple.
Table 4 summarizes results.
We trained our model using SGD with a fixed learning rate of 0.05 and a fixed momentum of 0.9. We
used a batch of size 200, which we found to work better than smaller batch sizes (such as 50 or 20).
We normalized the gradient by batch size but not by sequence length. We independently clip the norm
of the gradients w.r.t. the RL-NTM parameters to 5, and the gradient w.r.t. the baseline network to 2.
We initialize the RL–NTM controller and the baseline model using a Gaussian with standard deviation
0.1. We used an inverse temperature of 0.01 for the different action distributions. Doing so reduced
the effective learning rate of the Reinforce derivatives. The memory consists of 35 real values through
which we backpropagate. The initial memory state and the controller’s initial hidden states were set to
the zero vector.


Figure 7: (Left) Trace of ForwardReverse solution, (Right) trace of RepeatInput. The vertical depicts
execution time. The rows show the input pointer, output pointer, and memory pointer (with the ∗
symbol) at each step of the RL-NTM’s execution. Note that we represent the set {1, . . . , 30} with 30
distinct symbols, and lack of prediction with #.

The ForwardReverse task is particularly interesting. In order to solve the problem, the RL–NTM has
to move to the end of the sequence without making any predictions. While doing so, it has to store
the input sequence into its memory (encoded in real values), and use its memory when reversing the
sequence (Fig. 7).
We have also experimented with a number of additional tasks but with less empirical success. Tasks we
found to be too difficult include sorting and long integer addition (in base 3 for simplicity), and Repeat-
Copy when the input tape is forced to only move forward. While we were able to achieve reasonable
performance on the sorting task, the RL–NTM learned an ad-hoc algorithm and made excessive use of
its controller memory in order to sort the sequence.
Empirically, we found all the components of the RL-NTM essential to successfully solving these prob-
lems. All our tasks are either solvable in under 20,000 parameter updates or fail in arbitrary number
of updates. We were completely unable to solve RepeatCopy, Reverse, and Forward reverse with the
LSTM controller, but with direct access controller we succeeded. Moreover, we were also unable to
solve any of these problems at all without a curriculum (except for short sequences of length 5). We
present more traces for our tasks in the supplementary material (together with failure traces).

Under review as a conference paper at ICLR 2016

We have shown that the Reinforce algorithm is capable of training an NTM-style model to solve very
simple algorithmic problems. While the Reinforce algorithm is very general and is easily applicable to
a wide range of problems, it seems that learning memory access patterns with Reinforce is difficult.
Our gradient checking procedure for Reinforce can be applied to a wide variety of implementations. We
also found it extremely useful: without it, we had no way of being sure that our gradient was correct,
which made debugging and tuning much more difficult.

We thank Christopher Olah for the LSTM figure that have been used in the paper, and to Tencia Lee for
revising the paper.

Aberdeen, Douglas and Baxter, Jonathan. Scaling internal-state policy-gradient methods for pomdps. In MACHINE

Ba, Jimmy, Mnih, Volodymyr, and Kavukcuoglu, Koray. Multiple object recognition with visual attention. arXiv
preprint arXiv:1412.7755, 2014.

Bengio, Yoshua, Louradour, Jérôme, Collobert, Ronan, and Weston, Jason. Curriculum learning. In Proceedings
of the 26th annual international conference on machine learning, pp. 41–48. ACM, 2009.
Graves, Alex, Wayne, Greg, and Danihelka, Ivo. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014a.

Graves, Alex, Wayne, Greg, and Danihelka, Ivo. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014b.

Grefenstette, Edward, Hermann, Karl Moritz, Suleyman, Mustafa, and Blunsom, Phil. Learning to transduce with
unbounded memory. arXiv preprint arXiv:1506.02516, 2015.

Joulin, Armand and Mikolov, Tomas. Inferring algorithmic patterns with stack-augmented recurrent nets. arXiv
preprint arXiv:1503.01007, 2015.

Kohl, Nate and Stone, Peter. Policy gradient reinforcement learning for fast quadrupedal locomotion. In Robotics
and Automation, 2004. Proceedings. ICRA’04. 2004 IEEE International Conference on, volume 3, pp. 2619–
2624. IEEE, 2004.

Levine, Sergey, Finn, Chelsea, Darrell, Trevor, and Abbeel, Pieter. End-to-end training of deep visuomotor policies.
arXiv preprint arXiv:1504.00702, 2015.

Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Graves, Alex, Antonoglou, Ioannis, Wierstra, Daan, and
Riedmiller, Martin. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
Mnih, Volodymyr, Heess, Nicolas, Graves, Alex, et al. Recurrent models of visual attention. In Advances in Neural
Information Processing Systems, pp. 2204–2212, 2014.

Peters, Jan and Schaal, Stefan. Policy gradient methods for robotics. In Intelligent Robots and Systems, 2006
IEEE/RSJ International Conference on, pp. 2219–2225. IEEE, 2006.

Schmidhuber, Juergen. Self-delimiting neural networks. arXiv preprint arXiv:1210.0118, 2012.

Schmidhuber, Jürgen. Optimal ordered problem solver. Machine Learning, 54(3):211–254, 2004.

Sukhbaatar, Sainbayar, Szlam, Arthur, Weston, Jason, and Fergus, Rob. Weakly supervised memory networks.
arXiv preprint arXiv:1503.08895, 2015.

Williams, Ronald J. Simple statistical gradient-following algorithms for connectionist reinforcement learning.
Machine learning, 8(3-4):229–256, 1992.

Zaremba, Wojciech and Sutskever, Ilya. Learning to execute. arXiv preprint arXiv:1410.4615, 2014.

Under review as a conference paper at ICLR 2016


We present here several techniques to decrease variance of the gradient estimation for the Reinforce.
We have employed all of these tricks in our RL–NTM implementation.
We expand notation introduced in Sec. 4. Let A‡ denote all valid subsequences of actions (i.e. A‡ ⊂
A† ⊂ A∗ ). Moreover, we define set of sequences of actions that are valid after executing a sequence
a1:t , and that terminate. We denote such set by: A†a1:t . Every sequence a(t+1):T ∈ A†a1:t terminates an


Actions at time t cannot possibly influence rewards obtained in the past, because the past rewards are
caused by actions prior to them. This idea allows to derive an unbiased estimator of ∂θ J(θ) with lower
variance. Here, we formalize it:
∂θ J(θ) = pθ (a) ∂θ log pθ (a) R(a)
a1:T ∈A†

= pθ (a) ∂θ log pθ (a) r(a1:t )
a1:T ∈A† t=1

= pθ (a) ∂θ log pθ (a1:t )r(a1:t )
a1:T ∈A† t=1

= pθ (a) ∂θ log pθ (a1:t )r(a1:t ) + ∂θ log pθ (a(t+1):T |a1:t )r(a1:t )
a1:T ∈A† t=1

= pθ (a1:t )∂θ log pθ (a1:t )r(a1:t ) + pθ (a)∂θ log pθ (a(t+1):T |a1:t )r(a1:t )
a1:T ∈A† t=1
= pθ (a1:t )∂θ log pθ (a1:t )r(a1:t ) + pθ (a1:t )r(a1:t )∂θ pθ (a(t+1):T |a1:t )
a1:T ∈A† t=1
= pθ (a1:t )∂θ log pθ (a1:t )r(a1:t ) + pθ (a1:t )r(a1:t )∂θ pθ (a(t+1):T |a1:t )
a1:T ∈A† t=1 a1:T ∈A† t=1

We will show that the right side of this equation is equal to zero. It’s zero, because the future
actions a(t+1):T don’t influence past rewards r(a1:t ). Here we formalize it; we use an identity
Ea(t+1):T ∈A†a pθ (a(t+1):T |a1:t ) = 1:

X X  
pθ (a1:t )r(a1:t )∂θ pθ (a(t+1):T |a1:t ) =
a1:T ∈A† t=1
X  X 
pθ (a1:t )r(a1:t ) ∂θ pθ (a(t+1):T |a1:t ) =
a1:t ∈A‡ a(t+1):T ∈A†a1:t
pθ (a1:t )r(a1:t )∂θ 1 = 0
a1:t ∈A‡

We can purge the right side of the equation for ∂θ J(θ):

X X 
∂θ J(θ) = pθ (a1:t )∂θ log pθ (a1:t )r(a1:t )
a1:T ∈A† t=1
X X 
= Ea1 ∼pθ (a) Ea2 ∼pθ (a|a1 ) . . . EaT ∼pθ (a|a1:(T −1) ) ∂θ log pθ (at |a1:(t−1) ) r(a1:i )
t=1 i=t

Under review as a conference paper at ICLR 2016

The last line of derived equations describes the learning algorithm. This can be implemented as fol-
lows. A neural network outputs: lt = log pθ (at |a1:(t−1) ). We sequentially sample action at from the
distribution elt , and execute the sampled action at . Simultaneously, we experience a reward r(a1:t ). We
should backpropagate to the node ∂θ log pθ (at |a1:(t−1) ) the sum of rewards starting from time step t:
i=t r(a1:i ). The only difference in comparison to the initial algorithm is that we backpropagate sum
of rewards starting from the current time step, instead of the sum of rewards over the entire episode.


Online baseline prediction is an idea, that the importance of reward is determined by its relative relation
to other rewards. All the rewards could be shifted by a constant factor and such change shouldn’t effect
its relation, thus it shouldn’t influence expected gradient. However, it could decrease the variance of the
gradient estimate.
Aforementioned shift is called the baseline, and it can be estimated separately for the every time-step.
We have that:
pθ (a(t+1):T |a1:t ) = 1
a(t+1):T ∈A†a1:t
∂θ pθ (a(t+1):T |a1:t ) = 0
a(t+1):T ∈A†a1:t

We are allowed to subtract above quantity (multiplied by bt ) from our estimate of the gradient without
changing its expected value:
X X 
∂θ J(θ) = Ea1 ∼pθ (a) Ea2 ∼pθ (a|a1 ) . . . EaT ∼pθ (a|a1:(T −1) ) ∂θ log pθ (at |a1:(t−1) ) (r(a1:i ) − bt )
t=1 i=t

Above statement holds for an any sequence of bt . We aim to find the sequence bt that yields the lowest
variance estimator on ∂θ J(θ). The variance of our estimator is:
X X 2
V ar = Ea1 ∼pθ (a) Ea2 ∼pθ (a|a1 ) . . . EaT ∼pθ (a|a1:(T −1) ) ∂θ log pθ (at |a1:(t−1) ) (r(a1:i ) − bt ) −
t=1 i=t
h X X i2
Ea1 ∼pθ (a) Ea2 ∼pθ (a|a1 ) . . . EaT ∼pθ (a|a1:(T −1) ) ∂θ log pθ (at |a1:(t−1) ) (r(a1:i ) − bt )
t=1 i=t

The second term doesn’t depend on bt , and the variance is always positive. It’s sufficient to minimize
the first term. The first term is minimal when it’s derivative with respect to bt is zero. This implies
Ea1 ∼pθ (a) Ea2 ∼pθ (a|a1 ) . . . EaT ∼pθ (a|a1:(T −1) ) ∂θ log pθ (at |a1:(t−1) ) (r(a1:i ) − bt ) = 0
t=1 i=t
∂θ log pθ (at |a1:(t−1) ) (r(a1:i ) − bt ) = 0
t=1 i=t
t=1 ∂θ log pθ (at |a1:(t−1) ) i=t r(a1:t )
bt = PT
t=1 ∂θ log pθ (at |a1:(t−1) )

This gives us estimate for a vector bt ∈ R#θ . However, it is common to use a single scalar for bt ∈ R,
and estimate it as Epθ (at:T |a1:(t−1) ) R(at:T ).


The Reinforce algorithm works much better whenever it has accurate baselines. A separate LSTM can
help in the baseline estimation. First, run the baseline LSTM on the entire input tape to produce a vector
summarizing the input. Next, continue running the baseline LSTM in tandem with the controller LSTM,

Under review as a conference paper at ICLR 2016

Figure 8: The baseline LSTM computes a baseline bt for every computational step t of the RL-NTM.
The baseline LSTM receives the same inputs as the RL-NTM, and it computes a baseline bt for time
t before observing the chosen actions of time t. However, it is important to first provide the baseline
LSTM with the entire input tape as a preliminary inputs, because doing so allows the baseline LSTM
to accurately estimate the true difficulty of a given problem instance and therefore compute better base-
lines. For example, if a problem instance is unusually difficult, then we expect R1 to be large and
negative. If the baseline LSTM is given entire input tape as an auxiliary input, it could compute an
appropriately large and negative b1 .

so that the baseline LSTM receives precisely the same inputs as the controller LSTM, and outputs
PT  2
a baseline bt at each timestep t. The baseline LSTM is trained to minimize t=1 R(at:T ) − bt
(Fig. 8). This technique introduces a biased estimator, however it works well in practise.
We found it important to first have the baseline LSTM go over the entire input before computing the
baselines bt . It is especially beneficial whenever there is considerable variation in the difficulty of the
examples. For example, if the baseline LSTM can recognize that the current instance is unusually
difficult, it can output a large negative value for bt=1 in anticipation of a large and a negative R1 . In
general, it is cheap and therefore worthwhile to provide the baseline network with all of the available
information, even if this information would not be available at test time, because the baseline network
is not needed at test time.


We present several execution traces of the RL–NTM. Each figure shows execution traces of the trained
RL-NTM on each of the tasks. The first row shows the input tape and the desired output, while each
subsequent row shows the RL-NTM’s position on the input tape and its prediction for the output tape.
In these examples, the RL-NTM solved each task perfectly, so the predictions made in the output tape
perfectly match the desired outputs listed in the first row.

Under review as a conference paper at ICLR 2016

An RL-NTM successfully solving a small in-

stance of the Reverse problem (where the external
An RL-NTM successfully solving a small in-
memory is not used).
stance of the ForwardReverse problem, where the
external memory is used.

An example of a failure of the RepeatCopy task, where the

input tape is only allowed to move forward. The correct so-
An RL-NTM successfully solving an lution would have been to copy the input to the memory, and
instance of the RepeatCopy problem then solve the task using the memory. Instead, the memory
where the input is to be repeated three pointer is moving randomly.

A Neural Conversational Model

Oriol Vinyals VINYALS @ GOOGLE . COM

arXiv:1506.05869v3 [cs.CL] 22 Jul 2015

Abstract than just mere classification, they can be used to map com-
plicated structures to other complicated structures. An ex-
Conversational modeling is an important task in
ample of this is the task of mapping a sequence to another
natural language understanding and machine in-
sequence which has direct applications in natural language
telligence. Although previous approaches ex-
understanding (Sutskever et al., 2014). The main advan-
ist, they are often restricted to specific domains
tage of this framework is that it requires little feature en-
(e.g., booking an airline ticket) and require hand-
gineering and domain specificity whilst matching or sur-
crafted rules. In this paper, we present a sim-
passing state-of-the-art results. This advance, in our opin-
ple approach for this task which uses the recently
ion, allows researchers to work on tasks for which domain
proposed sequence to sequence framework. Our
knowledge may not be readily available, or for tasks which
model converses by predicting the next sentence
are simply too hard to design rules manually.
given the previous sentence or sentences in a
conversation. The strength of our model is that Conversational modeling can directly benefit from this for-
it can be trained end-to-end and thus requires mulation because it requires mapping between queries and
much fewer hand-crafted rules. We find that this reponses. Due to the complexity of this mapping, conver-
straightforward model can generate simple con- sational modeling has previously been designed to be very
versations given a large conversational training narrow in domain, with a major undertaking on feature en-
dataset. Our preliminary results suggest that, de- gineering. In this work, we experiment with the conversa-
spite optimizing the wrong objective function, tion modeling task by casting it to a task of predicting the
the model is able to converse well. It is able next sequence given the previous sequence or sequences
extract knowledge from both a domain specific using recurrent networks (Sutskever et al., 2014). We find
dataset, and from a large, noisy, and general do- that this approach can do surprisingly well on generating
main dataset of movie subtitles. On a domain- fluent and accurate replies to conversations.
specific IT helpdesk dataset, the model can find
We test the model on chat sessions from an IT helpdesk
a solution to a technical problem via conversa-
dataset of conversations, and find that the model can some-
tions. On a noisy open-domain movie transcript
times track the problem and provide a useful answer to
dataset, the model can perform simple forms of
the user. We also experiment with conversations obtained
common sense reasoning. As expected, we also
from a noisy dataset of movie subtitles, and find that the
find that the lack of consistency is a common fail-
model can hold a natural conversation and sometimes per-
ure mode of our model.
form simple forms of common sense reasoning. In both
cases, the recurrent nets obtain better perplexity compared
to the n-gram model and capture important long-range cor-
1. Introduction relations. From a qualitative point of view, our model is
Advances in end-to-end training of neural networks have sometimes able to produce natural conversations.
led to remarkable progress in many domains such as speech
recognition, computer vision, and language processing. 2. Related Work
Recent work suggests that neural networks can do more
Our approach is based on recent work which pro-
Proceedings of the 31 st International Conference on Machine posed to use neural networks to map sequences to se-
Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy- quences (Kalchbrenner & Blunsom, 2013; Sutskever et al.,
right 2015 by the author(s). 2014; Bahdanau et al., 2014). This framework has been
A Neural Conversational Model

used for neural machine translation and achieves im-

provements on the English-French and English-German
translation tasks from the WMT’14 dataset (Luong et al.,
2014; Jean et al., 2014). It has also been used for
other tasks such as parsing (Vinyals et al., 2014a) and
image captioning (Vinyals et al., 2014b). Since it is
well known that vanilla RNNs suffer from vanish-
ing gradients, most researchers use variants of Long Figure 1. Using the seq2seq framework for modeling conversa-
Short Term Memory (LSTM) recurrent neural net- tions.
works (Hochreiter & Schmidhuber, 1997).
Our work is also inspired by the recent success of neu-
ral language modeling (Bengio et al., 2003; Mikolov et al.,
2010; Mikolov, 2012), which shows that recurrent neural and train to map “ABC” to “WXYZ” as shown in Figure 1
networks are rather effective models for natural language. above. The hidden state of the model when it receives the
More recently, work by Sordoni et al. (Sordoni et al., 2015) end of sequence symbol “<eos>” can be viewed as the
and Shang et al. (Shang et al., 2015), used recurrent neural thought vector because it stores the information of the sen-
networks to model dialogue in short conversations (trained tence, or thought, “ABC”.
on Twitter-style chats).
The strength of this model lies in its simplicity and gener-
Building bots and conversational agents has been pur- ality. We can use this model for machine translation, ques-
sued by many researchers over the last decades, and it tion/answering, and conversations without major changes
is out of the scope of this paper to provide an exhaus- in the architecture. Applying this technique to conversa-
tive list of references. However, most of these systems tion modeling is also straightforward: the input sequence
require a rather complicated processing pipeline of many can be the concatenation of what has been conversed so far
stages (Lester et al., 2004; Will, 2007; Jurafsky & Martin, (the context), and the output sequence is the reply.
2009). Our work differs from conventional systems by
proposing an end-to-end approach to the problem which Unlike easier tasks like translation, however, a model
lacks domain knowledge. It could, in principle, be com- like sequence-to-sequence will not be able to successfully
bined with other systems to re-score a short-list of can- “solve” the problem of modeling dialogue due to sev-
didate responses, but our work is based on producing an- eral obvious simplifications: the objective function being
swers given by a probabilistic model trained to maximize optimized does not capture the actual objective achieved
the probability of the answer given some context. through human communication, which is typically longer
term and based on exchange of information rather than next
step prediction. The lack of a model to ensure consistency
3. Model and general world knowledge is another obvious limitation
Our approach makes use of the sequence-to-sequence of a purely unsupervised model.
(seq2seq) framework described in (Sutskever et al., 2014).
The model is based on a recurrent neural network which 4. Datasets
reads the input sequence one token at a time, and predicts
the output sequence, also one token at a time. During train- In our experiments we used two datasets: a closed-domain
ing, the true output sequence is given to the model, so learn- IT helpdesk troubleshooting dataset and an open-domain
ing can be done by backpropagation. The model is trained movie transcript dataset. The details of the two datasets are
to maximize the cross entropy of the correct sequence given as follows.
its context. During inference, given that the true output se-
quence is not observed, we simply feed the predicted output 4.1. IT Helpdesk Troubleshooting dataset
token as input to predict the next output. This is a “greedy” In our first set of experiments, we used a dataset which was
inference approach. A less greedy approach would be to extracted from a IT helpdesk troubleshooting chat service.
use beam search, and feed several candidates at the previ- In this service, costumers face computer related issues, and
ous step to the next step. The predicted sequence can be a specialist help them by conversing and walking through
selected based on the probability of the sequence. a solution. Typical interactions (or threads) are 400 words
Concretely, suppose that we observe a conversation with long, and turn taking is clearly signaled. Our training set
two turns: the first person utters “ABC”, and second person contains 30M tokens, and 3M tokens were used as valida-
replies “WXYZ”. We can use a recurrent neural network, tion. Some amount of clean up was performed, such as
removing common names, numbers, and full URLs.
A Neural Conversational Model

4.2. OpenSubtitles dataset Machine: hi

Human: hello
We also tested our model on the OpenSubtitles
Machine: could you please let me know what are
dataset (Tiedemann, 2009). This dataset consists of
the operating systems you are using ?
movie conversations in XML format. It contains sen-
Human: i am using linux
tences uttered by characters in movies. We applied a
Machine: ok it was in the ssl vpn on which you
simple processing step removing XML tags and obvi-
stated it was broken
ous non-conversational text (e.g., hyperlinks) from the
Machine: please check if you are logged in to
dataset. As turn taking is not clearly indicated, we treated
corp network
consecutive sentences assuming they were uttered by
Machine: what is the error that you are running
different characters. We trained our model to predict the
next sentence given the previous one, and we did this for
Human: i am seeing an error related to vpn
every sentence (noting that this doubles our dataset size,
Machine: what is the error message that you are
as each sentence is used both for context and as target).
getting when connecting to vpn using network
Our training and validation split has 62M sentences (923M
connect ?
tokens) as training examples, and the validation set has
Human: connection refused or something like
26M sentences (395M tokens). The split is done in such a
way that each sentence in a pair of sentences either appear
Machine: may i know the version of network
together in the training set or test set but not both. Unlike
connect you connect ?
the previous dataset, the OpenSubtitles is quite large, and
Human: i am not sure i know that
rather noisy because consecutive sentences may be uttered
Machine: is the network connect prompting for
by the same character. Given the broad scope of movies,
any user certificate
this is an open-domain conversation dataset, contrasting
Machine: ?
with the technical troubleshooting dataset.
Human: yes
Machine: are you at home or at work at the mo-
5. Experiments ment ?
Machine: office ?
In this section, we describe the experimental results with
Human: i am at home
the two datasets and show some samples of the interac-
Machine: try this
tions with the system that we trained. We also compare
Machine: goto <URL> page and enter your
the performance of our system against a popular rule-based
name and password
bot (CleverBot1 ) using human evaluations on a set of 200
Human: done
Machine: try logging in with <NAME> and
check if its working fine now
5.1. IT Helpdesk Troubleshooting experiments Human: yes , now it works !
In this experiment, we trained a single layer LSTM with Machine: great
1024 memory cells using stochastic gradient descent with Machine: anything else that i can help
gradient clipping. The vocabulary consists of the most Human: no , that was very useful
common 20K words, which includes special tokens indi- Machine: you are other than <NAME> : )
cating turn taking and actor. Machine: have a nice night

At convergence, this model achieved a perplexity of 8, (The <URL> indeed contains information about vpn ac-
whereas an n-gram model achieved 18. Below is a few sam- cess.)
ples of simulated troubleshooting sessions for most three
of the most common IT problems (remote access, software Conversation 2: Browser issues.
crashes, and password issues). In these conversations, Ma-
chine is our Neural Conversational Model, and Human the Describe your problem: browser keeps crashing
human actor interacting with it. Machine: hi
Human: hi
Conversation 1: VPN issues. Machine: could you please let me know what is
the operating system you are using ?
Describe your problem: i am having issues ac- Human: linux
cessing vpn Machine: ok this is the issue with the system per-
formance , nasser , tools .
1 Human: what is that ?
A Neural Conversational Model

Machine: is it a windows machine ? Machine: ok great

Machine: we have linux on your machine . Human: can you tell me how i can recover it ?
Machine: is it windows xp or windows 7 ? Machine: i ’ ll need to verify who it is at that
Human: i am using linux machine
Machine: let me take a remote of your machine Machine: can we do a hangout
and check if issue has been pushed properly . Machine: ?
Human: ok Human: yes
Machine: please open terminal and run the be- Machine: <anonymized hangout instructions
low commands in one computer : and verifications>
Machine: sudo apt - get update Human: cool , i am good now
Machine: sudo apt - get upgrade Machine: how ’ s this looking good ?
Machine: sudo <COMMAND> - updater Machine: i ’ ll go ahead and close this chat .
Human: done Machine: take care . : )
Machine: let me try connecting now Human: bye
Human: ok Machine: bye .
Machine: its done
Machine: should use the second or hit run on 5.2. OpenSubtitles experiments
the background
Human: done In this experiment, we trained a two-layered LSTM using
Machine: try signing out now and follow the AdaGrad with gradient clipping. Each layer of the LSTM
prompts as per the message given above has 4096 memory cells, and we built a vocabulary consist-
Machine: shd i reboot the machine . ing of the most frequent 100K words. To speed up the soft-
Human: ok , please go ahead max, we project the memory cells to 2048 linear units be-
Machine: once its completed please try to login fore feeding the information to the classifier.
. At convergence the perplexity of the recurrent model on the
Human: sure validation set is 17. Our smoothed 5-gram model achieves
Machine: please ping me back once you have a perplexity of 28. Interestingly, adding the soft atten-
tried to login tion mechanism of (Bahdanau et al., 2014) did not signifi-
Machine: are we still connected cantly improve the perplexity on neither training or valida-
Human: i now logged in tion sets.
Machine: thanks for your confirmation
Human: what should i do now ? 5.2.1. M ODEL S AMPLES
Machine: wait for 1 5 minutes to complete the
login policy In addition to the perplexity measure, our simple recur-
Machine: and then try logging in rent model does often produce plausible answers. Here
Human: ok are a few sample question-answer pairs that allowed us to
Machine: is there anything else to help you ? understand the capabilities of the model when trained on
Human: let my try browsing a little bit and if it the noisy but large dataset of OpenSubtitles from movies.
keeps crashing i will let you know Again, in these conversations, Machine is our Neural Con-
Machine: sure will do versational Model, and Human the human actor interacting
Machine: seems the mail is not loading with it.
Machine: are we still connected to the chat ? Conversation 1: Basic
Machine: <reinvite>
Human: it seems good now , cool , thank you , Human: hello !
bye Machine: hello !
Machine: you , too Human: how are you ?
Machine: bye Machine: i ’m good .
Human: what ’s your name ?
Conversation 3: Password issues. Machine: i ’m julia .
Human: when were you born ?
Describe your problem: i forgot my password Machine: july 20th .
Machine: hi , this is <NAME> from helpdesk Human: what year were you born ?
connect , what can i help you with today ? Machine: 1977 .
Human: hi , i forgot my password Human: where are you from ?
A Neural Conversational Model

Machine: i ’m out in the boonies. Human: how much is ten minus two ?
Human: are you a man or a woman ? Machine: seventy - two .
Machine: i ’m a woman .
Human: why are we here ? Conversation 4: Philosophical Q&A
Machine: i ’m not sure .
Human: okay, bye ! Human: what is the purpose of life ?
Machine: bye . Machine: to serve the greater good .
Human: see you later . Human: what is the purpose of living ?
Machine: bye . Machine: to live forever .
Human: what is the purpose of existence ?
Conversation 2: Simple Q&A (contexts and multiple Machine: to find out what happens when we get
choice) to the planet earth .
Human: where are you now ?
Human: my name is david . what is my name ? Machine: i ’m in the middle of nowhere .
Machine: david . Human: what is the purpose of dying ?
Human: my name is john . what is my name ? Machine: to have a life .
Machine: john . Human: what is the purpose of being intelligent
Human: are you a leader or a follower ? ?
Machine: i ’m a leader . Machine: to find out what it is .
Human: are you a follower or a leader ? Human: what is the purpose of emotions ?
Machine: i ’m a leader . Machine: i don ’t know .

Conversation 3: General knowledge Q&A Conversation 5: Morality

Human: who is skywalker ? Human: what is moral ?

Machine: he is a hero . Machine: what empowered humanity , what in-
Human: who is bill clinton ? tellectual the essence is .
Machine: he ’s a billionaire . Human: what is immoral ?
Human: is sky blue or black ? Machine: the fact that you have a child .
Machine: blue . Human: what is morality ?
Human: does a cat have a tail ? Machine: what is altruism ?
Machine: yes . Human: what is the definition of altruism ?
Human: does a cat have a wing ? Machine: if you don ’t believe in god , then you
Machine: no don ’t know .
Human: can a cat fly ? Human: ok ... so what is the definition of moral-
Machine: no . ity ?
Human: how many legs does a cat have ? Machine: well , the truth is , you ’re not a be-
Machine: four , i think . liever in god almighty .
Human: how many legs does a spider have ? Human: tell me the definition of morality , i am
Machine: three , i think . quite upset now !
Human: how many legs does a centipede have ? Machine: i ’m not ashamed of being a philoso-
Machine: eight . pher !
Human: what is the color of the sky ? Human: tell me the definition of morality .
Machine: blue . Machine: i don ’t have ethics .
Human: what is the color of water ? Human: look , i need help , i need to know more
Machine: water . about morality ...
Human: what is the color of blood ? Machine: i don ’t know what ethics is .
Machine: it is the same as a black eye . Human: seriously , what is morality ?
Human: what is the usual color of a leaf ? Machine: what is the definition of living ?
Machine: it is a green one . Human: why living has anything to do with
Human: what is the color of a yellow car ? morality ?
Machine: yellow . Machine: you ’re not a cop .
Human: how much is two plus two ? Human: okay , i need to know how should i be-
Machine: four . have morally ...
A Neural Conversational Model

Machine: i don ’t know how to tell you . problematic drawback is that the model does not capture
Human: is morality and ethics the same ? a consistent personality. Indeed, if we ask not identical but
Machine: yes , sir . semantically similar questions, the answers can sometimes
Human: what are the things that i do to be im- be inconsistent. This is expected due to the simplicity of
moral ? our model and the dataset in our experiments. The conver-
Machine: i don ’t know . sation below illustrates such failure:
Human: give me some examples of moral ac-
Conversation 7: Job and Personality
tions ...
Machine: i ’m not a moralist .
Human: alright , morality ? Human: what is your job ?
Machine: integrity . Machine: i ’m a lawyer .
Human: what is integrity ? Human: what do you do ?
Machine: i ’m sorry , i don ’t know what else to Machine: i ’m a doctor .
say .
Human: be moral ! Lastly, as a comparison of other traditional bots that are
Machine: be a man ! rule-based, we provide the output of CleverBot for some of
Human: i really like our discussion on morality the questions in our test:2
and ethics ...
Machine: and how i ’m not in the mood for a Human: Are you a follower or a leader ?
philosophical debate . CleverBot: No !!
Human: what do you like to talk about ? Human: My name is david . what is my name ?
Machine: nothing . CleverBot: How old are you David ?
Human: My name is john . what is my name ?
Conversation 6: Opinions (notice the model outputs “he” , CleverBot: John
“she” or “it” depending on the subject being asked) Human: What is the purpose of being intelligent
Human: what do you think about tesla ? CleverBot: The purpose of life is to love God
Machine: he ’s a good conductor . and love people.
Human: what do you think about bill gates ? Human: Is sky blue or black ?
Machine: he ’s a good man . CleverBot: I am 200 years old.
Human: What do you think about messi ? Human: What is the color of a yellow car ?
Machine: he ’s a great player . CleverBot: Brown.
Human: what do you think about cleopatra ? Human: What is the usual color of a leaf ?
Machine: oh , she ’s very regal . CleverBot: What is the color of the apple in the
Human: what do you think about england dur- fridge ?
ing the reign of elizabeth ? Human: What is the purpose of life ?
Machine: it was a great place . CleverBot: My purpose it to forward my species,
in other words to make it easier for future gener-
We find it encouraging that the model can remember facts, ations of mankind to live.
understand contexts, perform common sense reasoning Human: What do you think about tesla ?
without the complexity in traditional pipelines. What sur- CleverBot: I am tesla.
prises us is that the model does so without any explicit Human: What do you think about cleopatra ?
knowledge representation component except for the param- CleverBot: Cool is cool.
eters in the word vectors. Human: What do you think about england dur-
ing the reign of elizabeth ?
Perhaps most practically significant is the fact that the CleverBot: I like him. Do you ?
model can generalize to new questions. In other words, Human: What do you think about bill gates ?
it does not simply look up for an answer by matching the CleverBot: Lets talk about Link.
question with the existing database. In fact, most of the Human: What is moral ?
questions presented above, except for the first conversation, CleverBot: Morality is the quality of being in
do not appear in the training set. accord with standards of right or good conduct.
Nonetheless, one drawback of this basic model is that it Human: What is immoral ?
only gives simple, short, sometimes unsatisfying answers 2
The outputs of CleverBot are non-deterministic so the an-
to our questions as can be seen above. Perhaps a more swers may be different for different trials.
A Neural Conversational Model

CleverBot: What is the objective of sistetic life ?

Human: Does a cat have a wing ? Human: What’s thirty plus forty ?
CleverBot: What is the name of your husband. NCM: Sixty ?
Human: Can a cat fly ? CB: What should we talk about ?
CleverBot:: What is your husband’s name ?
Human: Who is skywalker ?
CleverBot: There are two of them, Anakin Sky- We however believe that it remains an open research prob-
walker and Luke Skywalker, both characters from lem of designing a good metric to quickly measure the
star Wars. quality of a conversational model. We did so by manual
Human: How many legs does a cat have ? inspection, by using Mechanical Turk to get human evalu-
CleverBot: An ant. ations comparing with a rule-based bot, and by computing
Human: How many legs does a centipede have the perplexity. However, there are obvious drawbacks on
? using those, and we are actively pursuing this and other re-
CleverBot: A centipede’s number of legs varies search directions.
from thirty to two hundred.
Human: What is your job ?
CleverBot: I work in mental health. 6. Discussion
Human: What do you do ? In this paper, we show that a simple language model based
CleverBot: I have no job. on the seq2seq framework can be used to train a conversa-
tional engine. Our modest results show that it can gener-
5.2.2. H UMAN E VALUATION ate simple and basic conversations, and extract knowledge
from a noisy but open-domain dataset. Even though the
In order to fairly and objectively compare our model
model has obvious limitations, it is surprising to us that a
against CleverBot, we picked 200 questions3 , and asked
purely data driven approach without any rules can produce
four different humans to rate our model (NCM) versus Cle-
rather proper answers to many types of questions. How-
verBot (CB). Human judges were asked to pick which of
ever, the model may require substantial modifications to be
the two bots they preferred, and were also allowed to de-
able to deliver realistic conversations. Amongst the many
clare a tie if both answers were of equal quality.
limitations, the lack of a coherent personality makes it dif-
If three out of four human judges agreed, we record that ficult for our system to pass the Turing test (Turing, 1950).
score. Our model was preferred in 97 out of 200 questions,
whereas CleverBot was picked in 60 out of 200. There was ACKNOWLEDGMENTS
a tie in 20 questions, and in 23 questions the judges were
We thank Greg Corrado, Andrew Dai, Jeff Dean, Tom
in disagreement.
Dean, Matthieu Devin, Rajat Monga, Mike Schuster, Noam
Here are some samples of how the judges evaluated the sys- Shazeer, Ilya Sutskever and the Google Brain team for the
tems (in bold we show their preferred answer, the third ex- help with the project.
ample shows a tie, and the fourth examples shows diagree-
Human: What is the deepest spot on the world ? Bahdanau, D., Cho, K., and Bengio, Y. Neural machine
NCM: The pacific ocean. translation by jointly learning to align and translate.
CB: What ? arXiv preprint arXiv:1409.0473, 2014.
Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. A
Human: Do you like Mexican food or Indian neural probabilistic language model. The Journal of Ma-
food ? chine Learning Research, 3:1137–1155, 2003.
NCM: I hate it.
CB: I love Mexican food. Hochreiter, S. and Schmidhuber, J. Long short-term mem-
ory. Neural Computation, 1997.
Human: Who are you crazy about ?
Jean, S., Cho, K., Memisevic, R., and Bengio, Y. On using
NCM: You’re the one who’s crazy.
very large target vocabulary for neural machine transla-
CB: Who are you ?
tion. CoRR, abs/1412.2007, 2014.
The questions we used together
with the answers can be found in Jurafsky, D. and Martin, J. Speech and language process-˜quocle/QAresults.pdf ing. Pearson International, 2009.
A Neural Conversational Model

Kalchbrenner, N. and Blunsom, P. Recurrent continuous

translation models. In EMNLP, 2013.

Lester, J., Branting, K., and Mott, B. Conversational

agents. In Handbook of Internet Computing. Chapman
& Hall, 2004.
Luong, T., Sutskever, I., Le, Q. V., Vinyals, O., and
Zaremba, W. Addressing the rare word problem in neu-
ral machine translation. arXiv preprint arXiv:1410.8206,
Mikolov, T. Statistical Language Models based on Neural
Networks. PhD thesis, Brno University of Technology,
Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., and
Khudanpur, S. Recurrent neural network based language
model. In INTERSPEECH, pp. 1045–1048, 2010.
Shang, L., Lu, Z., and Li, H. Neural responding ma-
chine for short-text conversation. In Proceedings of ACL,
Sordoni, A., Galley, M., Auli, M., Brockett, C., Ji, Y.,
Mitchell, M., Gao, J., Dolan, B., and Nie, J.-Y. A neural
network approach to context-sensitive generation of con-
versational responses. In Proceedings of NAACL, 2015.
Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to se-
quence learning with neural networks. In NIPS, 2014.
Tiedemann, J. News from OPUS - A collection of multi-
lingual parallel corpora with tools and interfaces. In Ni-
colov, N., Bontcheva, K., Angelova, G., and Mitkov, R.
(eds.), Recent Advances in Natural Language Process-
ing, volume V, pp. 237–248. John Benjamins, Amster-
dam/Philadelphia, Borovets, Bulgaria, 2009. ISBN 978
90 272 4825 1.
Turing, A. M. Computing machinery and intelligence.
Mind, pp. 433–460, 1950.
Vinyals, O., Kaiser, L., Koo, T., Petrov, S., Sutskever, I.,
and Hinton, G. Grammar as a foreign language. arXiv
preprint arXiv:1412.7449, 2014a.
Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. Show
and tell: A neural image caption generator. arXiv
preprint arXiv:1411.4555, 2014b.
Will, T. Creating a Dynamic Speech Dialogue. VDM Ver-
lag Dr, 2007.
Listen, Attend and Spell

William Chan Navdeep Jaitly, Quoc V. Le, Oriol Vinyals

Carnegie Mellon University Google Brain {ndjaitly,qvl,vinyals}
arXiv:1508.01211v2 [cs.CL] 20 Aug 2015


We present Listen, Attend and Spell (LAS), a neural network that learns to tran-
scribe speech utterances to characters. Unlike traditional DNN-HMM models, this
model learns all the components of a speech recognizer jointly. Our system has
two components: a listener and a speller. The listener is a pyramidal recurrent net-
work encoder that accepts filter bank spectra as inputs. The speller is an attention-
based recurrent network decoder that emits characters as outputs. The network
produces character sequences without making any independence assumptions be-
tween the characters. This is the key improvement of LAS over previous end-to-
end CTC models. On a subset of the Google voice search task, LAS achieves a
word error rate (WER) of 14.1% without a dictionary or a language model, and
10.3% with language model rescoring over the top 32 beams. By comparison, the
state-of-the-art CLDNN-HMM model achieves a WER of 8.0%.

1 Introduction
Deep Neural Networks (DNNs) have led to improvements in various components of speech recog-
nizers. They are commonly used in hybrid DNN-HMM speech recognition systems for acoustic
modeling [1, 2, 3, 4, 5, 6]. DNNs have also produced significant gains in pronunciation models that
map words to phoneme sequences [7, 8]. In language modeling, recurrent models have been shown
to improve speech recognition accuracy by rescoring n-best lists [9]. Traditionally these compo-
nents – acoustic, pronunciation and language models – have all been trained separately, each with a
different objective. Recent work in this area attempts to rectify this disjoint training issue by design-
ing models that are trained end-to-end – from speech directly to transcripts [10, 11, 12, 13, 14, 15].
Two main approaches for this are Connectionist Temporal Classification (CTC) [10] and sequence
to sequence models with attention [16]. Both of these approaches have limitations that we try to
address: CTC assumes that the label outputs are conditionally independent of each other; whereas
the sequence to sequence approach has only been applied to phoneme sequences [14, 15], and not
trained end-to-end for speech recognition.
In this paper we introduce Listen, Attend and Spell (LAS), a neural network that improves upon the
previous attempts [12, 14, 15]. The network learns to transcribe an audio sequence signal to a word
sequence, one character at a time. Unlike previous approaches, LAS does not make independence
assumptions in the label sequence and it does not rely on HMMs. LAS is based on the sequence to
sequence learning framework with attention [17, 18, 16, 14, 15]. It consists of an encoder recurrent
neural network (RNN), which is named the listener, and a decoder RNN, which is named the speller.
The listener is a pyramidal RNN that converts low level speech signals into higher level features.
The speller is an RNN that converts these higher level features into output utterances by specifying
a probability distribution over sequences of characters using the attention mechanism [16, 14, 15].
The listener and the speller are trained jointly.
Key to our approach is the fact that we use a pyramidal RNN model for the listener, which reduces
the number of time steps that the attention model has to extract relevant information from. Rare and
out-of-vocabulary (OOV) words are handled automatically, since the model outputs the character

sequence, one character at a time. Another advantage of modeling characters as outputs is that the
network is able to generate multiple spelling variants naturally. For example, for the phrase “triple a”
the model produces both “triple a” and “aaa” in the top beams (see section 4.5). A model like CTC
may have trouble producing such diverse transcripts for the same utterance because of conditional
independence assumptions between frames.
In our experiments, we find that these components are necessary for LAS to work well. Without the
attention mechanism, the model overfits the training data significantly, in spite of our large training
set of three million utterances - it memorizes the training transcripts without paying attention to the
acoustics. Without the pyramid structure in the encoder side, our model converges too slowly - even
after a month of training, the error rates were significantly higher than the errors we report here.
Both of these problems arise because the acoustic signals can have hundreds to thousands of frames
which makes it difficult to train the RNNs. Finally, to reduce the overfitting of the speller to the
training transcripts, we use a sampling trick during training [19].
With these improvements, LAS achieves 14.1% WER on a subset of the Google voice search task,
without a dictionary or a language model. When combined with language model rescoring, LAS
achieves 10.3% WER. By comparison, the Google state-of-the-art CLDNN-HMM system achieves
8.0% WER on the same data set [20].

2 Related Work

Even though deep networks have been successfully used in many applications, until recently, they
have mainly been used in classification: mapping a fixed-length vector to an output category [21].
For structured problems, such as mapping one variable-length sequence to another variable-length
sequence, neural networks have to be combined with other sequential models such as Hidden
Markov Models (HMMs) [22] and Conditional Random Fields (CRFs) [23]. A drawback of this
combining approach is that the resulting models cannot be easily trained end-to-end and they make
simplistic assumptions about the probability distribution of the data.
Sequence to sequence learning is a framework that attempts to address the problem of learning
variable-length input and output sequences [17]. It uses an encoder RNN to map the sequential
variable-length input into a fixed-length vector. A decoder RNN then uses this vector to produce
the variable-length output sequence, one token at a time. During training, the model feeds the
groundtruth labels as inputs to the decoder. During inference, the model performs a beam search to
generate suitable candidates for next step predictions.
Sequence to sequence models can be improved significantly by the use of an attention mechanism
that provides the decoder RNN more information when it produces the output tokens [16]. At each
output step, the last hidden state of the decoder RNN is used to generate an attention vector over
the input sequence of the encoder. The attention vector is used to propagate information from the
encoder to the decoder at every time step, instead of just once, as with the original sequence to
sequence model [17]. This attention vector can be thought of as skip connections that allow the
information and the gradients to flow more effectively in an RNN.
The sequence to sequence framework has been used extensively for many applications: machine
translation [24, 25], image captioning [26, 27], parsing [28] and conversational modeling [29]. The
generality of this framework suggests that speech recognition can also be a direct application [14,

3 Model

In this section, we will formally describe LAS which accepts acoustic features as in-
puts and emits English characters as outputs. Let x = (x1 , . . . , xT ) be our input se-
quence of filter bank spectra features, and let y = (hsosi, y1 , . . . , yS , heosi), yi ∈
{a, b, c, · · · , z, 0, · · · , 9, hspacei, hcommai, hperiodi, hapostrophei, hunki}, be the output se-
quence of characters. Here hsosi and heosi are the special start-of-sentence token, and end-of-
sentence tokens, respectively.

We want to model each character output yi as a conditional distribution over the previous characters
y<i and the input signal x using the chain rule:
P (y|x) = P (yi |x, y<i ) (1)

Our Listen, Attend and Spell (LAS) model consists of two sub-modules: the listener and the speller.
The listener is an acoustic model encoder, whose key operation is Listen. The speller is an attention-
based character decoder, whose key operation is AttendAndSpell. The Listen function transforms
the original signal x into a high level representation h = (h1 , . . . , hU ) with U ≤ T , while the
AttendAndSpell function consumes h and produces a probability distribution over character se-
h = Listen(x) (2)
P (y|x) = AttendAndSpell(h, y) (3)

Figure 1 visualizes LAS with these two components. We provide more details of these components
in the following sections.

y2 y3 y4 heosi Grapheme characters yi are
modelled by the

c1 c2
AttentionContext creates
context vector ci from h
and si
h h h

s1 s2

hsosi y2 y3 yS−1

Long input sequence x is encoded with the pyramidal

h = (h1 , . . . , hU ) BLSTM Listen into shorter sequence h

h1 h2 hU

x1 x2 x3 x4 x5 x6 x7 x8 xT

Figure 1: Listen, Attend and Spell (LAS) model: the listener is a pyramidal BLSTM encoding our input
sequence x into high level features h, the speller is an attention-based decoder generating the y characters
from h.

3.1 Listen

The Listen operation uses a Bidirectional Long Short Term Memory RNN (BLSTM) [30, 31, 12]
with a pyramid structure. This modification is required to reduce the length U of h, from T , the
length of the input x, because the input speech signals can be hundreds to thousands of frames long.
A direct application of BLSTM for the operation Listen converged slowly and produced results
inferior to those reported here, even after a month of training time. This is presumably because the
operation AttendAndSpell has a hard time extracting the relevant information from a large number
of input time steps.
We circumvent this problem by using a pyramid BLSTM (pBLSTM) similar to the Clockwork RNN
[33]. In each successive stacked pBLSTM layer, we reduce the time resolution by a factor of 2. In a
typical deep BTLM architecture, the output at the i-th time step, from the j-th layer is computed as

hji = BLSTM(hji−1 , hij−1 ) (4)

In the pBLSTM model, we concatenate the outputs at consecutive steps of each layer before feeding
it to the next layer, i.e.:
h i
hji = pBLSTM(hji−1 , hj−1
2i , hj−1
2i+1 ) (5)

In our model, we stack 3 pBLSTMs on top of the bottom BLSTM layer to reduce the time resolution
23 = 8 times. This allows the attention model (see next section) to extract the relevant information
from a smaller number of times steps. In addition to reducing the resolution, the deep architecture al-
lows the model to learn nonlinear feature representations of the data. See Figure 1 for a visualization
of the pBLSTM.
The pyramid structure also reduces the computational complexity. In the next section we show that
the attention mechanism over U features has a computational complexity of O(U S). Thus, reducing
U speeds up learning and inference significantly.

3.2 Attend and Spell

We now describe the AttendAndSpell function. The function is computed using an attention-based
LSTM transducer [16, 15]. At every output step, the transducer produces a probability distribution
over the next character conditioned on all the characters seen previously. The distribution for yi is
a function of the decoder state si and context ci . The decoder state si is a function of the previous
state si−1 , the previously emitted character yi−1 and context ci−1 . The context vector ci is produced
by an attention mechanism. Specifically,

ci = AttentionContext(si , h) (6)
si = RNN(si−1 , yi−1 , ci−1 ) (7)
P (yi |x, y<i ) = CharacterDistribution(si , ci ) (8)

where CharacterDistribution is an MLP with softmax outputs over characters, and RNN is a 2
layer LSTM.
At each time step, i, the attention mechanism, AttentionContext generates a context vector, ci
encapsulating the information in the acoustic signal needed to generate the next character. The
attention model is content based - the contents of the decoder state si are matched to the contents
of hu representing time step u of h, to generate an attention vector αi . αi is used to linearly blend
vectors hu to create ci .
Specifically, at each decoder timestep i, the AttentionContext function computes the scalar energy
ei,u for each time step u, using vector hu ∈ h and si . The scalar energy ei,u is converted into a
probability distribution over times steps (or attention) αi using a softmax function. This is used to

create the context vector ci by linearly blending the listener features, hu , at different time steps:
ei,u = hφ(si ), ψ(hu )i (9)
exp(ei,u )
αi,u = P (10)
exp(ei,u )
ci = αi,u hu (11)

where φ and ψ are MLP networks. On convergence, the αi distribution is typically very sharp, and
focused on only a few frames of h; ci can be seen as a continuous bag of weighted features of h.
Figure 1 shows LAS architecture.

3.3 Learning

The Listen and AttendAndSpell functions can be trained jointly for end-to-end speech recognition.
The sequence to sequence methods condition the next step prediction on the previous characters [17,
16] and maximizes the log probability:

max log P (yi |x, y<i ; θ) (12)

where y<i is the groundtruth of the previous characters.
However during inference, the groundtruth is missing and the predictions can suffer because the
model was not trained to be resilient to feeding in bad predictions at some time steps. To ameliorate
this effect, we use a trick that was proposed in [19]. During training, instead of always feeding in the
ground truth transcript for next step prediction, we sometimes sample from our previous character
distribution and use that as the inputs in the next step predictions:
ỹi ∼ CharacterDistribution(si , ci ) (13)
max log P (yi |x, ỹ<i ; θ) (14)

where ỹi−1 is the character chosen from the ground truth, or sampled from the model with a certain
sampling rate. Unlike [19], we do not use a schedule and simply use a constant sampling rate of
10% right from the start of training.
As the system is a very deep network it may appear that some type of pretraining would be required.
However, in our experiments, we found no need for pretraining. In particular, we attempted to
pretrain the Listen function with context independent or context dependent phonemes generated
from a conventional GMM-HMM system. A softmax network was attached to the output units
hu ∈ h of the listener and used to make multi-frame phoneme state predictions [34] but led to no
improvements. We also attempted to use the phonemes as a joint objective target [35], but found no

3.4 Decoding and Rescoring

During inference we want to find the most likely character sequence given the input acoustics:
ŷ = arg max log P (y|x) (15)

Decoding is performed with a simple left-to-right beam search algorithm similar to [17]. We main-
tain a set of β partial hypotheses, starting with the start-of-sentence hsosi token. At each timestep,
each partial hypothesis in the beam is expanded with every possible character and only the β most
likely beams are kept. When the heosi token is encountered, it is removed from the beam and added
to the set of complete hypothesis. A dictionary can optionally be added to constrain the search space
to valid words, however we found that this was not necessary since the model learns to spell real
words almost all the time.
We have vast quantities of text data [36], compared to the amount of transcribed speech utterances.
We can use language models trained on text corpora alone similar to conventional speech systems

[37]. To do so we can rescore our beams with the language model. We find that our model has a
small bias for shorter utterances so we normalize our probabilities by the number of characters |y|c
in the hypothesis and combine it with a language model probability PLM (y):
log P (y|x)
s(y|x) = + λ log PLM (y) (16)
where λ is our language model weight and can be determined by a held-out validation set.

4 Experiments
We used a dataset approximately three million Google voice search utterances (representing 2000
hours of data) for our experiments. Approximately 10 hours of utterances were randomly selected as
a held-out validation set. Data augmentation was performed using a room simulator, adding different
types of noise and reverberations; the noise sources were obtained from YouTube and environmental
recordings of daily events [20]. This increased the amount of audio data by 20 times. 40-dimensional
log-mel filter bank features were computed every 10ms and used as the acoustic inputs to the listener.
A separate set of 22K utterances representing approximately 16 hours of data were used as the test
data. A noisy test data set was also created using the same corruption strategy that was applied to
the training data. All training sets are anonymized and hand-transcribed, and are representative of
Googles speech traffic.
The text was normalized by converting all characters to lower case English alphanumerics (including
digits). The punctuations: space, comma, period and apostrophe were kept, while all other tokens
were converted to the unknown hunki token. As mentioned earlier, all utterances were padded with
the start-of-sentence hsosi and the end-of-sentence heosi tokens.
The state-of-the-art model on this dataset is a CLDNN-HMM system that was described in [20].
The CLDNN system achieves a WER of 8.0% on the clean test set and 8.9% on the noisy test set.
However, we note that the CLDNN uses unidirectional CLDNNs and would certainly benefit even
further from the use of a bidirectional CLDNN architecture.
For the Listen function we used 3 layers of 512 pBLSTM nodes (i.e., 256 nodes per direction) on
top of a BLSTM that operates on the input. This reduced the time resolution by 8 = 23 times. The
Spell function used a two layer LSTM with 512 nodes each. The weights were initialized with a
uniform distribution U(−0.1, 0.1).
Asynchronous Stochastic Gradient Descent (ASGD) was used for training our model [38]. A learn-
ing rate of 0.2 was used with a geometric decay of 0.98 per 3M utterances (i.e., 1/20-th of an epoch).
We used the DistBelief framework [38] with 32 replicas, each with a minibatch of 32 utterances.
In order to further speed up training, the sequences were grouped into buckets based on their frame
length [17].
The model was trained using groundtruth previous characters until results on the validation set
stopped improving. This took approximately two weeks. The model was decoded using beam width
β = 32 and achieved 16.2% WER on the clean test set and 19.0% WER on the noisy test set without
any dictionary or language model. We found that constraining the beam search with a dictionary had
no impact on the WER. Rescoring the top 32 beams with the same n-gram language model that was
used by the CLDNN system using a language model weight of λ = 0.008 improved the results for
the clean and noisy test sets to 12.6% and 14.7% respectively. Note that for convenience, we did not
decode with a language model, but rather only rescored the top 32 beams. It is possible that further
gains could have been achieved by using the language model during decoding.
As mentioned in Section 3.3, there is a mismatch between training and testing. During training
the model is conditioned on the correct previous characters but during testing mistakes made by
the model corrupt future predictions. We trained another model by sampling from our previous
character distribution with a probability of 10% (we did not use a schedule as described in [19]).
This improved our results on the clean and noisy test sets to 14.1% and 16.5% WER respectively
when no language model rescoring was used. With language model rescoring, we achevied 10.3%
and 12.0% WER on the clean and noisy test sets, respectively. Table 1 summarizes these results.
On the clean test set, this model is within 2.5% absolute WER of the state-of-the-art CLDNN-HMM
system, while on the noisy set it is less than 3.0% absolute WER worse. We suspect that convolu-

Table 1: WER comparison on the clean and noisy Google voice search task. The CLDNN-HMM system is
the state-of-the-art system, the Listen, Attend and Spell (LAS) models are decoded with a beam size of 32.
Language Model (LM) rescoring was applied to our beams, and a sampling trick was applied to bridge the gap
between training and inference.

Model Clean WER Noisy WER

CLDNN-HMM [20] 8.0 8.9
LAS 16.2 19.0
LAS + LM Rescoring 12.6 14.7
LAS + Sampling 14.1 16.5
LAS + Sampling + LM Rescoring 10.3 12.0

tional filters could lead to improved results, as they have been reported to improve performance by
5% relative WER on clean speech and 7% relative on noisy speech compared to non-convolutional
architectures [20].

4.1 Attention Visualization

Figure 2: Alignments between character outputs and audio signal produced by the Listen, Attend and Spell
(LAS) model for the utterance “how much would a woodchuck chuck”. The content based attention mechanism
was able to identify the start position in the audio sequence for the first character correctly. The alignment
produced is generally monotonic without a need for any location based priors.

The content-based attention mechanism creates an explicit alignment between the characters and
audio signal. We can visualize the attention mechanism by recording the attention distribution on
the acoustic sequence at every character output timestep. Figure 2 visualizes the attention align-
ment between the characters and the filterbanks for the utterance “how much would a woodchuck
chuck”. For this particular utterance, the model learnt a monotonic distribution without any location
priors. The words “woodchuck” and “chuck” have acoustic similarities, the attention mechanism
was slightly confused when emitting “woodchuck” with a dilution in the distribution. The attention
model was also able to identify the start and end of the utterance properly.

In the following sections, we report results of control experiments that were conducted to understand
the effects of beam widths, utterance lengths and word frequency on the WER of our model.

4.2 Effects of Beam Width

We investigate the correlation between the performance of the model and the width of beam search,
with and without the language model rescoring. Figure 3 shows the effect of the decode beam width,
β, on the WER for the clean test set. We see consistent WER improvements by increasing the beam
width up to 16, after which we observe no significant benefits. At a beam width of 32, the WER
is 14.1% and 10.3% after language model rescoring. Rescoring the top 32 beams with an oracle
produces a WER of 4.3% on the clean test set and 5.5% on the noisy test set.

Beam Width vs. WER

WER Oracle


12 4 8 16 32
Beam Width
Figure 3: The effect of the decode beam width on WER for the clean Google voice search task. The reported
WERs are without a dictionary or language model, with language model rescoring and the oracle WER for
different beam widths. The figure shows that good results can be obtained even with a relatively small beam

4.3 Effects of Utterance Length

We measure the performance of our model as a function of the number of words in the utterance. We
expect the model to do poorly on longer utterances due to limited number of long training utterances
in our distribution. Hence it is not surprising that longer utterances have a larger error rate. The
deletions dominate the error for long utterances, suggesting we may be missing out on words. It is
surprising that short utterances (e.g., 2 words or less) perform quite poorly. Here, the substitutions
and insertions are the main sources of errors, suggesting the model may split words apart.
Figure 4 also suggests that our model struggles to generalize to long utterances when trained on a
distribution of shorter utterances. It is possible location-based priors may help in these situations as
reported by [15].

4.4 Word Frequency

We study the performance of our model on rare words. We use the recall metric to indicate whether
a word appears in the utterance regardless of position (higher is better). Figure 5 reports the recall
of each word in the test distribution as a function of the word frequency in the training distribution.
Rare words have higher variance and lower recall while more frequent words typically have higher

Utterance Length vs. Error
Data Distribution
80 Insertion
70 WER
60 WER Oracle

5 10 15 20 25+
Number of Words in Utterance
Figure 4: The correlation between error rates (insertion, deletion, substitution and WER) and the number of
words in an utterance. The WER is reported without a dictionary or language model, with language model
rescoring and the oracle WER for the clean Google voice search task. The data distribution with respect to the
number of words in an utterance is overlaid in the figure. LAS performs poorly with short utterances despite
an abundance of data. LAS also fails to generalize well on longer utterances when trained on a distribution
of shorter utterances. Insertions and substitutions are the main sources of errors for short utterances, while
deletions dominate the error for long utterances.

recall. The word “and” occurs 85k times in the training set, however it has a recall of only 80% even
after language model rescoring. The word “and” is frequently mis-transcribed as “in” (which has
95% recall). This suggests improvements are needed in the language model. By contrast, the word
“walkerville” occurs just once in the training set but it has a recall of 100%. This suggests that the
recall for a word depends both on its frequency in the training set and its acoustic uniqueness.

4.5 Interesting Decoding Examples

In this section, we show the outputs of the model on several utterances to demonstrate the capabilities
of LAS. All the results in this section are decoded without a dictionary or a language model.
During our experiments, we observed that LAS can learn multiple spelling variants given the same
acoustics. Table 2 shows top beams for the utterance that includes “triple a”. As can be seen,
the model produces both “triple a” and “aaa” within the top four beams. The decoder is able to
generate such varied parses, because the next step prediction model makes no assumptions on the
probability distribution by using the chain rule decomposition. It would be difficult to produce such
differing transcripts using CTC due to the conditional independence assumptions, where p(yi |x)
is conditionally independent of p(yi+1 |x). Conventional DNN-HMM systems would require both
spellings to be in the pronunciation dictionary to generate both spelling permutations.
It can also be seen that the model produced “xxx” even though acoustically “x” is very different
from “a” - this is presumably because the language model overpowers the acoustic signal in this
case. In the training corpus “xxx” is a very common phrase and we suspect the language model

Word Frequency vs. Recall
Word Recall Percentage 70
100 101 102 103 104 105 106
Word Frequency
Figure 5: The correlation between word frequency in the training distribution and recall in the test distribution.
In general, rare words report worse recall compared to more frequent words.

Table 2: Example 1: “triple a” vs. “aaa” spelling variants.

Beam Text Log Probability WER

Truth call aaa roadside assistance - -
1 call aaa roadside assistance -0.5740 0.00
2 call triple a roadside assistance -1.5399 50.00
3 call trip way roadside assistance -3.5012 50.00
4 call xxx roadside assistance -4.4375 25.00

implicit in the speller learns to associate “triple” with “xxx”. We note that “triple a” occurs 4 times
in the training distribution and “aaa” (when pronounced “triple a” rather than “a”-“a”-“a”) occurs
only once in the training distribution.
We are also surprised that the model is capable of handling utterances with repeated words despite
the fact that it uses content-based attention. Table 3 shows an example of an utterance with a repeated
word. Since LAS implements content-based attention, it is expected it to “lose its attention” during
the decoding steps and produce a word more or less times than the number of times the word was
spoken. As can be seen from this example, even though “seven” is repeated three times, the model
successfully outputs “seven” three times. This hints that location-based priors (e.g., location based
attention or location based regularization) may not be needed for repeated contents.

Table 3: Example 2: Repeated “seven”s.

Beam Text Log Probability WER

Truth eight nine four minus seven seven seven - -
1 eight nine four minus seven seven seven -0.2145 0.00
2 eight nine four nine seven seven seven -1.9071 14.29
3 eight nine four minus seven seventy seven -4.7316 14.29
4 eight nine four nine s seven seven seven -5.1252 28.57

5 Conclusions

We have presented Listen, Attend and Spell (LAS), an attention-based neural network that can di-
rectly transcribe acoustic signals to characters. LAS is based on the sequence to sequence framework
with a pyramid structure in the encoder that reduces the number of timesteps that the decoder has
to attend to. LAS is trained end-to-end and has two main components. The first component, the
listener, is a pyramidal acoustic RNN encoder that transforms the input sequence into a high level
feature representation. The second component, the speller, is an RNN decoder that attends to the
high level features and spells out the transcript one character at a time. Our system does not use
the concepts of phonemes, nor does it rely on pronunciation dictionaries or HMMs. We bypass the
conditional independence assumptions of CTC, and show how we can learn an implicit language
model that can generate multiple spelling variants given the same acoustics. To further improve
the results, we used samples from the softmax classifier in the decoder as inputs to the next step
prediction during training. Finally, we showed how a language model trained on additional text can
be used to rerank our top hypotheses.


We thank Tara Sainath, Babak Damavandi for helping us with the data, language models and for
helpful comments. We also thank Andrew Dai, Ashish Agarwal, Samy Bengio, Eugene Brevdo,
Greg Corrado, Andrew Dai, Jeff Dean, Rajat Monga, Christopher Olah, Mike Schuster, Noam
Shazeer, Ilya Sutskever, Vincent Vanhoucke and the Google Brain team for helpful comments, sug-
gestions and technical assistance.

[1] Nathaniel Morgan and Herve Bourlard. Continuous Speech Recognition using Multilayer Per-
ceptrons with Hidden Markov Models. In IEEE International Conference on Acoustics, Speech
and Signal Processing, 1990.
[2] Abdel-rahman Mohamed, George E. Dahl, and Geoffrey E. Hinton. Deep belief networks for
phone recognition. In Neural Information Processing Systems: Workshop on Deep Learning
for Speech Recognition and Related Applications, 2009.
[3] George E. Dahl, Dong Yu, Li Deng, and Alex Acero. Large vocabulary continuous speech
recognition with context-dependent dbn-hmms. In IEEE International Conference on Acous-
tics, Speech and Signal Processing, 2011.
[4] Abdel-rahman Mohamed, George E. Dahl, and Geoffrey Hinton. Acoustic modeling us-
ing deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing,
20(1):14–22, 2012.
[5] Navdeep Jaitly, Patrick Nguyen, Andrew W. Senior, and Vincent Vanhoucke. Application
of Pretrained Deep Neural Networks to Large Vocabulary Speech Recognition. In INTER-
SPEECH, 2012.
[6] Tara Sainath, Abdel-rahman Mohamed, Brian Kingsbury, and Bhuvana Ramabhadran. Deep
Convolutional Neural Networks for LVCSR. In IEEE International Conference on Acoustics,
Speech and Signal Processing, 2013.
[7] Kanishka Rao, Fuchun Peng, Hasim Sak, and Francoise Beaufays. Grapheme-to-phoneme
conversion using long short-term memory recurrent neural networks. In IEEE International
Conference on Acoustics, Speech and Signal Processing, 2015.
[8] Kaisheng Yao and Geoffrey Zweig. Sequence-to-Sequence Neural Net Models for Grapheme-
to-Phoneme Conversion. 2015.
[9] Tomas Mikolov, Karafiat Martin, Burget Luka, Eernocky Jan, and Khudanpur Sanjeev. Recur-
rent neural network based language model. In INTERSPEECH, 2010.
[10] Alex Graves, Santiago Fernandez, Faustino Gomez, and Jurgen Schmiduber. Connectionist
Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Net-
works. In International Conference on Machine Learning, 2006.

[11] Alex Graves. Sequence Transduction with Recurrent Neural Networks. In International Con-
ference on Machine Learning: Representation Learning Workshop, 2012.
[12] Alex Graves and Navdeep Jaitly. Towards End-to-End Speech Recognition with Recurrent
Neural Networks. In International Conference on Machine Learning, 2014.
[13] Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan
Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Ng. Deep Speech:
Scaling up end-to-end speech recognition. In, 2014.
[14] Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. End-to-end Con-
tinuous Speech Recognition using Attention-based Recurrent NN: First Results. In Neural In-
formation Processing Systems: Workshop Deep Learning and Representation Learning Work-
shop, 2014.
[15] Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio.
Attention-Based Models for Speech Recognition. In, 2015.
[16] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Translation by
Jointly Learning to Align and Translate. In International Conference on Learning Representa-
tions, 2015.
[17] Ilya Sutskever, Oriol Vinyals, and Quoc Le. Sequence to Sequence Learning with Neural
Networks. In Neural Information Processing Systems, 2014.
[18] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,
Holger Schwen, and Yoshua Bengio. Learning Phrase Representations using RNN Encoder-
Decoder for Statistical Machine Translation. In Conference on Empirical Methods in Natural
Language Processing, 2014.
[19] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled Sampling for Se-
quence Prediction with Recurrent Neural Networks. In, 2015.
[20] Tara N. Sainath, Oriol Vinyals, Andrew Senior, and Hasim Sak. Convolutional, Long Short-
Term Memory, Fully Connected Deep Neural Networks. In IEEE International Conference on
Acoustics, Speech and Signal Processing, 2015.
[21] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet Classification with Deep
Convolutional Neural Networks. In Neural Information Processing Systems, 2012.
[22] Leonard E. Baum and Ted Petrie. Statistical Inference for Probabilistic Functions of Finite
State Markov Chains. The Annals of Mathematical Statistics, 37:1554–1563, 1966.
[23] John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional Random Fields: Proba-
bilistic Models for Segmenting and Labeling Sequence Data. In International Conference on
Machine Learning, 2001.
[24] Minh-Thang Luong, Ilya Sutskever, Quoc V. Le, Oriol Vinyals, and Wojciech Zaremba. Ad-
dressing the Rare Word Problem in Neural Machine Translation. In Association for Computa-
tional Linguistics, 2015.
[25] Sebastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. On Using Very
Large Target Vocabulary for Neural Machine Translation. In Association for Computational
Linguistics, 2015.
[26] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and Tell: A Neural
Image Caption Generator. In IEEE Conference on Computer Vision and Pattern Recognition,
[27] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov,
Richard Zemel, and Yoshua Bengio. Show, Attend and Tell: Neural Image Caption Generation
with Visual Attention. In International Conference on Machine Learning, 2015.
[28] Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey E. Hinton.
Grammar as a foreign language. In, 2014.
[29] Oriol Vinyals and Quoc V. Le. A Neural Conversational Model. In International Conference
on Machine Learning: Deep Learning Workshop, 2015.
[30] Sepp Hochreiter and Jurgen Schmidhuber. Long Short-Term Memory. Neural Computation,
9(8):1735–1780, November 1997.

[31] Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. Hybrid Speech Recognition with
Bidirectional LSTM. In Automatic Speech Recognition and Understanding Workshop, 2013.
[32] Salah Hihi and Yoshua Bengio. Hierarchical Recurrent Neural Networks for Long-Term De-
pendencies. In Neural Information Processing Systems, 1996.
[33] Jan Koutnik, Klaus Greff, Faustino Gomez, and Jurgen Schmidhuber. A Clockwork RNN. In
International Conference on Machine Learning, 2014.
[34] Navdeep Jaitly, Vincent Vanhoucke, and Geoffrey Hinton. Autoregressive product of multi-
frame predictions can improve the accuracy of hybrid models. In INTERSPEECH, 2014.
[35] Hasim Sak, Andrew Senior, Kanishka Rao, and Francoise Beaufays. Fast and Accurate Recur-
rent Neural Network Acoustic Models for Speech Recognition. In INTERSPEECH, 2015.
[36] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Repre-
sentations of Words and Phrases and their Compositionality. In Neural Information Processing
Systems, 2013.
[37] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra
Goel, Mirko Hannenmann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg
Stemmer, and Karel Vesely. The Kaldi Speech Recognition Toolkit. In Automatic Speech
Recognition and Understanding Workshop, 2011.
[38] Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z.
Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. Large
Scale Distributed Deep Networks. In Neural Information Processing Systems, 2012.

A Alignment Examples
In this section, we give additional visualization examples of our model and the attention distribution.

Figure 6: The spelling variants of “aaa” vs “triple a” produces different attention distributions, both spelling
variants appear in our top beams. The ground truth is: “aaa emergency roadside service”.

Figure 7: The spelling variants of “st” vs “saint” produces different attention distributions, both spelling
variants appear in our top beams. The ground truth is: “st mary’s animal clinic”.

Figure 8: The phrase “cancel” is repeated three times. Note the parallel diagonals, the content attention
mechanism gets slightly confused however the model still emits the correct hypothesis. The ground truth is:
“cancel cancel cancel”.

Published as a conference paper at ICLR 2016


Arvind Neelakantan∗ Quoc V. Le Ilya Sutskever
University of Massachusetts Amherst Google Brain Google Brain

arXiv:1511.04834v3 [cs.LG] 4 Aug 2016

Deep neural networks have achieved impressive supervised classification perfor-

mance in many tasks including image recognition, speech recognition, and se-
quence to sequence learning. However, this success has not been translated to ap-
plications like question answering that may involve complex arithmetic and logic
reasoning. A major limitation of these models is in their inability to learn even
simple arithmetic and logic operations. For example, it has been shown that neural
networks fail to learn to add two binary numbers reliably. In this work, we pro-
pose Neural Programmer, a neural network augmented with a small set of basic
arithmetic and logic operations that can be trained end-to-end using backpropaga-
tion. Neural Programmer can call these augmented operations over several steps,
thereby inducing compositional programs that are more complex than the built-in
operations. The model learns from a weak supervision signal which is the result of
execution of the correct program, hence it does not require expensive annotation
of the correct program itself. The decisions of what operations to call, and what
data segments to apply to are inferred by Neural Programmer. Such decisions,
during training, are done in a differentiable fashion so that the entire network can
be trained jointly by gradient descent. We find that training the model is diffi-
cult, but it can be greatly improved by adding random noise to the gradient. On
a fairly complex synthetic table-comprehension dataset, traditional recurrent net-
works and attentional models perform poorly while Neural Programmer typically
obtains nearly perfect accuracy.


The past few years have seen the tremendous success of deep neural networks (DNNs) in a variety of
supervised classification tasks starting with image recognition (Krizhevsky et al., 2012) and speech
recognition (Hinton et al., 2012) where the DNNs act on a fixed-length input and output. More
recently, this success has been translated into applications that involve a variable-length sequence
as input and/or output such as machine translation (Sutskever et al., 2014; Bahdanau et al., 2014;
Luong et al., 2014), image captioning (Vinyals et al., 2015; Xu et al., 2015), conversational model-
ing (Shang et al., 2015; Vinyals & Le, 2015), end-to-end Q&A (Sukhbaatar et al., 2015; Peng et al.,
2015; Hermann et al., 2015), and end-to-end speech recognition (Graves & Jaitly, 2014; Hannun
et al., 2014; Chan et al., 2015; Bahdanau et al., 2015).
While these results strongly indicate that DNN models are capable of learning the fuzzy underlying
patterns in the data, they have not had similar impact in applications that involve crisp reasoning.
A major limitation of these models is in their inability to learn even simple arithmetic and logic
operations. For example, Joulin & Mikolov (2015) show that recurrent neural networks (RNNs) fail
at the task of adding two binary numbers even when the result has less than 10 bits. This makes
existing DNN models unsuitable for downstream applications that require complex reasoning, e.g.,
natural language question answering. For example, to answer the question “how many states border
Texas?” (see Zettlemoyer & Collins (2005)), the algorithm has to perform an act of counting in a
table which is something that a neural network is not yet good at.

Work done during an internship at Google.

Published as a conference paper at ICLR 2016

A fairly common method for solving these problems is program induction where the goal is to find
a program (in SQL or some high-level languages) that can correctly solve the task. An application
of these models is in semantic parsing where the task is to build a natural language interface to a
structured database (Zelle & Mooney, 1996). This problem is often formulated as mapping a natural
language question to an executable query.
A drawback of existing methods in semantic parsing is that they are difficult to train and require
a great deal of human supervision. As the space over programs is non-smooth, it is difficult to
apply simple gradient descent; most often, gradient descent is augmented with a complex search
procedure, such as sampling (Liang et al., 2010). To further simplify training, the algorithmic de-
signers have to manually add more supervision signals to the models in the form of annotation of the
complete program for every question (Zettlemoyer & Collins, 2005) or a domain-specific grammar
(Liang et al., 2011). For example, designing grammars that contain rules to associate lexical items to
the correct operations, e.g., the word “largest” to the operation “argmax”, or to produce syntactically
valid programs, e.g., disallow the program >= dog. The role of hand-crafted grammars is crucial in
semantic parsing yet also limits its general applicability to many different domains. In a recent work
by Wang et al. (2015) to build semantic parsers for 7 domains, the authors hand engineer a separate
grammar for each domain.
The goal of this work is to develop a model that does not require substantial human supervision
and is broadly applicable across different domains, data sources and natural languages. We propose
Neural Programmer (Figure 1), a neural network augmented with a small set of basic arithmetic
and logic operations that can be trained end-to-end using backpropagation. In our formulation, the
neural network can run several steps using a recurrent neural network. At each step, it can select a
segment in the data source and a particular operation to apply to that segment. The neural network
propagates these outputs forward at every step to form the final, more complicated output. Using
the target output, we can adjust the network to select the right data segments and operations, thereby
inducing the correct program. Key to our approach is that the selection process (for the data source
and operations) is done in a differentiable fashion (i.e., soft selection or attention), so that the whole
neural network can be trained jointly by gradient descent. At test time, we replace soft selection
with hard selection.

Timestep t t = 1, 2, …, T

Arithmetic and
logic operations

Input Soft
Controller Apply

Data Memory Output

Figure 1: The architecture of Neural Programmer, a neural network augmented with arithmetic and
logic operations. The controller selects the operation and the data segment. The memory stores the
output of the operations applied to the data segments and the previous actions taken by the controller.
The controller runs for several steps thereby inducing compositional programs that are more complex
than the built-in operations. The dotted line indicates that the controller uses information in the
memory to make decisions in the next time step.

By combining neural network with mathematical operations, we can utilize both the fuzzy pattern
matching capabilities of deep networks and the crisp algorithmic power of traditional programmable
computers. This approach of using an augmented logic and arithmetic component is reminiscent of
the idea of using an ALU (arithmetic and logic unit) in a conventional computer (Von Neumann,
1945). It is loosely related to the symbolic numerical processing abilities exhibited in the intrapari-
etal sulcus (IPS) area of the brain (Piazza et al., 2004; Cantlon et al., 2006; Kucian et al., 2006; Fias
et al., 2007; Dastjerdi et al., 2013). Our work is also inspired by the success of the soft attention
mechanism (Bahdanau et al., 2014) and its application in learning a neural network to control an
additional memory component (Graves et al., 2014; Sukhbaatar et al., 2015).

Published as a conference paper at ICLR 2016

Neural Programmer has two attractive properties. First, it learns from a weak supervision signal
which is the result of execution of the correct program. It does not require the expensive annotation
of the correct program for the training examples. The human supervision effort is in the form of
question, data source and answer triples. Second, Neural Programmer does not require additional
rules to guide the program search, making it a general framework. With Neural Programmer, the
algorithmic designer only defines a list of basic operations which requires lesser human effort than
in previous program induction techniques.
We experiment with a synthetic table-comprehension dataset, consisting of questions with a wide
range of difficulty levels. Examples of natural language translated queries include “print elements in
column H whose field in column C is greater than 50 and field in column E is less than 20?” or “what
is the difference between sum of elements in column A and number of rows in the table?”. We find
that LSTM recurrent networks (Hochreiter & Schmidhuber, 1997) and LSTM models with attention
(Bahdanau et al., 2014) do not work well. Neural Programmer, however, can completely solve this
task or achieve greater than 99% accuracy on most cases by inducing the required latent program.
We find that training the model is difficult, but it can be greatly improved by injecting random
Gaussian noise to the gradient (Welling & Teh, 2011; Neelakantan et al., 2016) which enhances the
generalization ability of the Neural Programmer.

Even though our model is quite general, in this paper, we apply Neural Programmer to the task of
question answering on tables, a task that has not been previously attempted by neural networks.
In our implementation for this task, Neural Programmer is run for a total of T time steps chosen
in advance to induce compositional programs of up to T operations. The model consists of four

• A question Recurrent Neural Network (RNN) to process the input question,

• A selector to assign two probability distributions at every step, one over the set of operations
and the other over the data segments,
• A list of operations that the model can apply and,
• A history RNN to remember the previous operations and data segments selected by the
model till the current time step.

These four modules are also shown in Figure 2. The history RNN combined with the selector module
functions as the controller in this case. Information about each component is discussed in the next
Outputt =
Timestep t Op on data weighted by softmax
History RNN
ht-1 Softmax
Data Source
RNN step Apply
Input at ht hcol [ ; ]
step t Col Selector Final
ct Input Output
Operations at =
step OutputT
Op Selector
Question RNN q t = 1, 2, …, T

Figure 2: An implementation of Neural Programmer for the task of question answering on tables.
The output of the model at time step t is obtained by applying the operations on the data segments
weighted by their probabilities. The final output of the model is the output at time step T . The dotted
line indicates the input to the history RNN at step t+1.

Apart from the list of operations, all the other modules are learned using gradient descent on a
training set consisting of triples, where each triple has a question, a data source and an answer. We

Published as a conference paper at ICLR 2016

assume that the data source is in the form of a table, table ∈ RM ×C , containing M rows and C
columns (M and C can vary amongst examples). The data segments in our experiments are the
columns, where each column also has a column name.


The question module converts the question tokens to a distributed representation. In the basic version
of our model, we use a simple RNN (Werbos, 1990) parameterized by W question and the last hidden
state of the RNN is used as the question representation (Figure 3).

Question RNN
last RNN hidden state
z1 z2 = tanh(Wquestion [z1; V(w2)]) q=zq

V(w1) V(w2) …… V(wq)

w1 w2 wq

Figure 3: The question module to process the input question. q = zq denotes the question represen-
tation used by Neural Programmer.

Consider an input question containing Q words {w1 , w2 , . . . , wQ }, the question module performs
the following computations:
zi = tanh(W question [zi−1 ; V (wi )]), ∀i = 1, 2, . . . , Q
where V (wi ) ∈ Rd represents the embedded representation of the word wi , [a; b] ∈ R2d represents
the concatenation of two vectors a, b ∈ Rd , W question ∈ Rd×2d is the recurrent matrix of the
question RNN, tanh is the element-wise non-linearity function and zQ ∈ Rd is the representation
of the question. We set z0 to [0]d . We pre-process the question by removing numbers from it and
storing the numbers in a separate list. Along with the numbers we store the word that appeared to the
left of it in the question which is useful to compute the pivot values for the comparison operations
described in Section 2.3.
For tasks that involve longer questions, we use a bidirectional RNN since we find that a simple
unidirectional RNN has trouble remembering the beginning of the question. When the bidirectional
RNN is used, the question representation is obtained by concatenating the last hidden states of the
two-ends of the bidirectional RNNs. The question representation is denoted by q.


The selector produces two probability distributions at every time step t (t = 1, 2, . . . , T ): one
probablity distribution over the set of operations and another probability distribution over the set
of columns. The inputs to the selector are the question representation (q ∈ Rd ) from the question
module and the output of the history RNN (described in Section 2.4) at time step t (ht ∈ Rd ) which
stores information about the operations and columns selected by the model up to the previous step.
Each operation is represented using a d-dimensional vector. Let the number of operations be O and
let U ∈ RO×D be the matrix storing the representations of the operations.
Operation Selection is performed by:
αtop = softmax (U tanh(W op [q; ht ]))
where W op ∈ Rd×2d is the parameter matrix of the operation selector that produces the probability
distribution αtop ∈ [0, 1]O over the set of operations (Figure 4).
The selector also produces a probability distribution over the columns at every time step. We obtain
vector representations for the column names using the parameters in the question module (Section
2.1) by word embedding or an RNN phrase embedding. Let P ∈ RC×D be the matrix storing the
representations of the column names.
Data Selection is performed by:

Published as a conference paper at ICLR 2016

History RNN Timestep t

RNN step Op: 1
Input at ht hop = … … Op: 2

step t tanh(Wop [q; ht]) Softmax
ct Op Selector … … …
Op: V

Question RNN q t = 1, 2, …, T

Figure 4: Operation selection at time step t where the selector assigns a probability distribution over
the set of operations.

αtcol = softmax (P tanh(W col [q; ht ]))

where W col ∈ Rd×2d is the parameter matrix of the column selector that produces the probability
distribution αtcol ∈ [0, 1]C over the set of columns (Figure 5).

Timestep t
History RNN
ht-1 Data Source
RNN step Col:1 …. Col:C
Input at ht hcol =
step t tanh(Wcol [q; ht])
ct Col Selector

Question RNN q t = 1, 2, …, T

Figure 5: Data selection at time step t where the selector assigns a probability distribution over the
set of columns.


Neural Programmer currently supports two types of outputs: a) a scalar output, and b) a list of items
selected from the table (i.e., table lookup).1 The first type of output is for questions of type “Sum
of elements in column C” while the second type of output is for questions of type “Print elements
in column A that are greater than 50.” To facilitate this, the model maintains two kinds of out-
put variables at every step t, scalar answert ∈ R and lookup answert ∈ [0, 1]M ×C . The output
lookup answert (i , j ) stores the probability that the element (i, j) in the table is part of the out-
put. The final output of the model is scalar answerT or lookup answerT depending on whichever
of the two is updated after T time steps. Apart from the two output variables, the model main-
tains an additional variable row selectt ∈ [0, 1]M that is updated at every time step. The variables
row selectt [i ](∀i = 1, 2, . . . , M ) maintain the probability of selecting row i and allows the model
to dynamically select a subset of rows within a column. The output is initialized to zero while the
row select variable is initialized to [1]M .
Key to Neural Programmer is the built-in operations, which have access to the outputs of the
model at every time step before the current time step t, i.e., the operations have access to
(scalar answer i , lookup answer i ), ∀i = 1, 2, . . . , t − 1. This enables the model to build powerful
compositional programs.
It is important to design the operations such that they can work with probabilistic row and column
selection so that the model is differentiable. Table 1 shows the list of operations built into the model
along with their definitions. The reset operation can be selected any number of times which when
required allows the model to induce programs whose complexity is less than T steps.

It is trivial to extend the model to support general text responses by adding a decoder RNN to generate text

Published as a conference paper at ICLR 2016

Type Operation Definition

Sum sumt [j] = row selectt−1 [i] ∗ table[i][j], ∀j = 1, 2, . . . , C
Aggregate i=1
Count countt = row selectt−1 [i]
Arithmetic Difference difft = scalar output t−3 − scalar output t−1
Greater gt [i][j] = table[i][j] > pivotg , ∀(i, j), i = 1, . . . , M, j = 1, . . . , C
Lesser lt [i][j] = table[i][j ] < pivotl , ∀(i, j), i = 1, . . . , M, j = 1, . . . , C
And and t [i] = min(row selectt−1 [i], row selectt−2 [i]), ∀i = 1, 2, . . . , M
Or or t [i] = max(row selectt−1 [i], row selectt−2 [i]), ∀i = 1, 2, . . . , M
Assign Lookup assign assignt [i][j] = row selectt−1 [i], ∀(i, j)i = 1, 2, . . . , M, j = 1, 2, . . . , C
Reset Reset resett [i] = 1, ∀i = 1, 2, . . . , M

Table 1: List of operations along with their definitions at time step t, table ∈ RM ×C is the data
source in the form of a table and row selectt ∈ [0, 1]M functions as a row selector.

While the definitions of the operations are fairly straightforward, comparison operations greater
and lesser require a pivot value as input (refer Table 1), which appears in the question. Let
qn1 , qn2 , . . . , qnN be the numbers that appear in the question.
For every comparison operation (greater and lesser), we compute its pivot value by adding up all the
numbers in the question each of them weighted with the probabilities assigned to it computed using
the hidden vector at position to the left of the number,2 and the operation’s embedding vector. More
βop = softmax (ZU (op))
pivotop = βop (i)qni

where U (op) ∈ Rd is the vector representation of operation op (op ∈ {greater, lesser}) and Z ∈
RN ×d is the matrix storing the hidden vectors of the question RNN at positions to the left of the
occurrence of the numbers.
By overloading the definition of αtop and αtcol , let αtop (x) and αtcol (j) denote the probability assigned
by the selector to operation x (x ∈ {sum, count, difference, greater, lesser, and, or, assign, reset})
and column j (∀j = 1, 2, . . . , C) at time step t respectively.
Figure 6 show how the output and row selector variables are computed. The output and row selector
variables at a step is obtained by additively combining the output of the individual operations on the
different data segments weighted with their corresponding probabilities assigned by the model.

Timestep t
Data Source
Selector Apply scalar_answert


scalar_answert-3 t = 1, 2, …, T

Figure 6: The output and row selector variables are obtained by applying the operations on the data
segments and additively combining their outputs weighted using the probabilities assigned by the
This choice is made to reflect the common case in English where the pivot number is usually mentioned
after the operation but it is trivial to extend to use hidden vectors both in the left and the right of the number.

Published as a conference paper at ICLR 2016

More formally, the output variables are given by:

scalar answert = αtop (count)countt + αtop (difference)difft + αtcol (j)αtop (sum)sumt [j ],

lookup answert [i][j] = αtcol (j)αtop (assign)assignt [i][j], ∀(i, j)i = 1, 2, . . . , M, j = 1, 2, . . . , C

The row selector variable is given by:
row selectt [i ] = αtop (and)andt [i] + αtop (or)ort [i] + αtop (reset)resett [i]+
αtcol (j)(αtop (greater)gt [i][j] + αtop (lesser)lt [i][j]), ∀i = 1, . . . , M

It is important to note that other operations like equal to, max, min, not etc. can be built into this
model easily.


So far, our disscusion has been only concerned with tables that have numeric entries. In this section
we describe how Neural Programmer handles text entries in the input table. We assume a column
can contain either numeric or text entries. An example query is “what is the sum of elements in
column B whose field in column C is word:1 and field in column A is word:7?”. In other words, the
query is looking for text entries in the column that match specified words in the questions. To answer
these queries, we add a text match operation that updates the row selector variable appropriately. In
our implementation, the parameters for vector representations of the column’s text entries are shared
with the question module.
The text match operation uses a two-stage soft attention mechanism, back and forth from the text
entries to question module. In the following, we explain its implementation in detail.
Let T C1 , T C2 , . . . , T CK be the set of columns that each have M text entries and A ∈ M × K × d
store the vector representations of the text entries. In the first stage, the question representation
coarsely selects the appropriate text entries through the sigmoid operation. Concretely, coarse se-
lection, B, is given by the sigmoid of dot product between vector representations for text entries, A,
and question representation, q:

B[m][k] = sigmoid A[m][k][p] · q[p] ∀(m, k) m = 1, . . . , M, k = 1, . . . , K

To obtain question-specific column representations, D, we use B as weighting factors to compute

the weighted average of the vector representations of the text entries in that column:

1 X
D[k][p] = (B[m][k] · A[m][k][p]) ∀(k, p) k = 1, . . . , K, p = 1, . . . , d
M m=1

To allow different words in the question to be matched to the corresponding columns (e.g., match
word:1 in column C and match word:7 in column A for question “what is the sum of elements in
column B whose field in column C is word:1 and field in column A is word:7?’), we add the column
name representations (described in Section 2.2), P , to D to obtain column representations E. This
make the representation also sensitive to the column name.
In the second stage, we use E to compute an attention over the hidden states of the question RNN
to get attention vector G for each column of the input table. More concretely, we compute the dot
product between E and the hidden states of the question RNN to obtain scalar values. We then

Published as a conference paper at ICLR 2016

pass them through softmax to obtain weighting factors for each hidden state. G is the weighted
combination of the hidden states of the question RNN.
Finally, text match selection is done by:
text match[m][k] = sigmoid A[m][k][p] · G[k][p] ∀(m, k) m = 1, . . . , M, k = 1, . . . , K

Without loss of generality, let the first K (K ∈ [0, 1, . . . , C]) columns out of C columns of the table
contain text entries while the remaining contain numeric entries. The row selector variable now is
given by:
row selectt [i ] = αtop (and)andt [i] + αtop (or)ort [i] + αtop (reset)resett [i]+
αtcol (j)(αtop (greater)gt [i][j] + αtop (lesser)lt [i][j])+
αtcol (j)(αtop (text match)text match t [i][j], ∀i = 1, . . . , M

The two-stage mechanism is required since in our experiments we find that simply averaging the
vector representations fails to make the representation of the column specific enough to the question.
Unless otherwise stated, our experiments are with input tables whose entries are only numeric and
in that case the model does not contain the text match operation.


The history RNN keeps track of the previous operations and columns selected by the selector module
so that the model can induce compositional programs. This information is encoded in the hidden
vector of the history RNN at time step t, ht ∈ Rd . This helps the selector module to induce the
probability distributions over the operations and columns by taking into account the previous actions
selected by the model. Figure 7 shows details of this component.

Timestep t
History RNN
ht-1 Softmax
Data Source
sum of op
RNN step vectors

Input at ht hcol [ Weighted

; ]
step t sum of col
ct Input
Operations at
Question RNN q t = 1, 2, …, T

Figure 7: The history RNN which helps in remembering the previous operations and data segments
selected by the model. The dotted line indicates the input to the history RNN at step t+1.

The input to the history RNN at time step t, ct ∈ R2d is obtained by concatenating the weighted
representations of operations and column names with their corresponding probability distribution
produced by the selector at step t − 1. More precisely:
op T col T
ct = [(αt−1 ) U ; (αt−1 ) P]
The hidden state of the history RNN at step t is computed as:
ht = tanh(W history [ct ; ht−1 ]), ∀i = 1, 2, . . . , Q
where W history ∈ Rd×3d is the recurrent matrix of the history RNN, and ht ∈ Rd is the current
representation of the history. The history vector at time t = 1, h1 is set to [0]d .

Published as a conference paper at ICLR 2016


The parameters of the model include the parameters of the question RNN, W question , parameters
of the history RNN, W history , word embeddings V (.), operation embeddings U , operation selector
and column selector matrices, W op and W col respectively. During training, depending on whether
the answer is a scalar or a lookup from the table we have two different loss functions.
When the answer is a scalar, we use Huber loss (Huber, 1964) given by:
1 2
a , if a ≤ δ
Lscalar (scalar answerT , y) = 2
δa − 12 δ 2 , otherwise
where a = |scalar answer T − y| is the absolute difference between the predicted and true answer,
and δ is the Huber constant treated as a model hyper-parameter. In our experiments, we find that
using square loss makes training unstable while using the absolute loss makes the optimization
difficult near the non-differentiable point.
When the answer is a list of items selected from the table, we convert the answer to y ∈ {0, 1}M ×C ,
where y[i, j] indicates whether the element (i, j) is part of the output. In this case we use log-loss
over the set of elements in the table given by:
M C 
1 XX
Llookup (lookup answer T , y) = − y[i, j] log(lookup answer T [i, j])+
M C i=1 j=1

(1 − y[i, j]) log(1 − lookup answer T [i, j])

The training objective of the model is given by:

1 X (k) (k)
L= [nk == T rue]Lscalar + [nk == F alse]λLlookup

(k) (k)
where N is the number of training examples, Lscalar and Llookup are the scalar and lookup loss on
k th example, nk is a boolean random variable which is set to True when the k th example’s answer
is a scalar and set to False when the answer is a lookup, and λ is a hyper-parameter of the model
that allows to weight the two loss functions appropriately.
At inference time, we replace the three softmax layers in the model with the conventional
maximum (hardmax) operation and the final output of the model is either scalar answerT or
lookup answerT , depending on whichever among them is updated after T time steps. Algorithm 1
gives a high-level view of Neural Programmer during inference.

Neural Programmer is faced with many challenges, specifically: 1) can the model learn the param-
eters of the different modules with delayed supervision after T steps? 2) can it exhibit composi-
tionality by generalizing to unseen questions? and 3) can the question module handle the variability
and ambiguity of natural language? In our experiments, we mainly focus on answering the first two
questions using synthetic data. Our reason for using synthetic data is that it is easier to understand a
new model with a synthetic dataset. We can generate the data in a large quantity, whereas the biggest
real-word semantic parsing datasets we know of contains only about 14k training examples (Pasu-
pat & Liang, 2015) which is very small by neural network standards. In one of our experiments,
we introduce simple word-level variability to simulate one aspect of the difficulties in dealing with
natural language input.

3.1 DATA

We generate question, table and answer triples using a synthetic grammar. Tables 4 and 5 (see Ap-
pendix) shows examples of question templates from the synthetic grammar for single and multiple

Published as a conference paper at ICLR 2016

Algorithm 1 High-level view of Neural Programmer during its inference stage for an input example.
1: Input: table ∈ RM ×C and question
2: Initialize: scalar answer 0 = 0, lookup answer 0 = 0M ×C , row select 0 = 1M , history vector
at time t = 0, h0 = 0d and input to history RNN at time t = 0, c0 = 02d
3: Preprocessing: Remove numbers from question and store them in a list along with the words
that appear to the left of it. The tokens in the input question are {w1 , w2 , . . . , wQ }.
4: Question Module: Run question RNN on the preprocessed question to get question represen-
tation q and list of hidden states z1 , z2 , . . . , zQ
5: Pivot numbers: pivotg and pivotl are computed using hidden states from question RNN and
operation representations U
6: for t = 1, 2, . . . , T do
7: Compute history vector ht by passing input ct to the history RNN
8: Operation selection using q, ht and operation representations U
9: Data selection on table using q, ht and column representations V
10: Update scalar answert , lookup answert and row select t using the selected operation and
11: Compute input to the history RNN at time t + 1, ct+1
12: end for
13: Output: scalar answer T or lookup answer T depending on whichever of the two is updated
at step T

columns respectively. The elements in the table are uniformly randomly sampled from [-100, 100]
and [-200, 200] during training and test time respectively. The number of rows is sampled randomly
from [30, 100] in training while during prediction the number of rows is 120. Each question in the
test set is unique, i.e., it is generated from a distinct template. We use the following settings:
Single Column: We first perform experiments with a single column that enables 23 different ques-
tion templates which can be answered using 4 time steps.
Many Columns: We increase the difficulty by experimenting with multiple columns (max columns
= 3, 5 or 10). During training, the number of columns is randomly sampled from (1, max columns)
and at test time every question had the maximum number of columns used during training.
Variability: To simulate one aspect of the difficulties in dealing with natural language input, we
consider multiple ways to refer to the same operation (Tables 6 and 7).
Text Match: Now we consider cases where some columns in the input table contain text entries.
We use a small vocabulary of 10 words and fill the column by uniformly randomly sampling from
them. In our first experiment with text entries, the table always contains two columns, one with text
and other with numeric entries (Table 8). In the next experiment, each example can have up to 3
columns containing numeric entries and up to 2 columns containing text entries during training. At
test time, all the examples contain 3 columns with numeric entries and 2 columns with text entries.


In the following, we benchmark the performance of Neural Programmer on various versions of the
table-comprehension dataset. We slowly increase the difficulty of the task by changing the table
properties (more columns, mixed numeric and text entries) and question properties (word variabil-
ity). After that we discuss a comparison between Neural Programmer, LSTM, and LSTM with


We use 4 time steps in our experiments (T = 4). Neural Programmer is trained with mini-batch
stochastic gradient descent with Adam optimizer (Kingma & Ba, 2014). The parameters are ini-
tialized uniformly randomly within the range [-0.1, 0.1]. In all experiments, we set the mini-batch
size to 50, dimensionality d to 256, the initial learning rate and the momentum hyper-parameters
of Adam to their default values (Kingma & Ba, 2014). We found that it is extremely useful to add
random Gaussian noise to our gradients at every training step. This acts as a regularizer to the model

Published as a conference paper at ICLR 2016

and allows it to actively explore more programs. We use a schedule inspired from Welling & Teh
(2011), where at every step we sample a Gaussian of 0 mean and variance= curr step−0.55 .
To prevent exploding gradients, we perform gradient clipping by scaling the gradient when the norm
exceeds a threshold (Graves, 2013). The threshold value is picked from [1, 5, 50]. We tune the 
hyper-parameter in Adam from [1e-6, 1e-8], the Huber constant δ from [10, 25, 50] and λ (weight
between two losses) from [25, 50, 75, 100] using grid search. While performing experiments with
multiple random restarts we find that the performance of the model is stable with respect to  and
gradient clipping threshold but we have to tune δ and λ for the different random seeds.

Type No. of Test Question Templates Accuracy % seen test

Single Column 23 100.0 100
3 Columns 307 99.02 100
5 Columns 1231 99.11 98.62
10 Columns 7900 99.13 62.44
Word Variability on 1 Column 1368 96.49 100
Word Variability on 5 Columns 24000 88.99 31.31
Text Match on 2 Columns 1125 99.11 97.42
Text Match on 5 Columns 14600 98.03 31.02

Table 2: Summary of the performance of Neural Programmer on various versions of the synthetic
table-comprehension task. The prediction of the model is considered correct if it is equal to the
correct answer up to the first decimal place. The last column indicates the percentage of question
templates in the test set that are observed during training. The unseen question templates generate
questions containing sequences of words that the model has never seen before. The model can
generalize to unseen question templates which is evident in the 10-columns, word variability on
5-columns and text match on 5 columns experiments. This indicates that Neural Programmer is
a powerful compositional model since solving unseen question templates requires performing a
sequence of actions that it has never done during training.

The training set consists of 50, 000 triples in all our experiments. Table 2 shows the performance
of Neural Programmer on synthetic data experiments. In single column experiments, the model
answers all questions correctly which we manually verify by inspecting the programs induced by
the model. In many columns experiments with 5 columns, we use a bidirectional RNN and for 10
columns we additionally perform attention (Bahdanau et al., 2014) on the question at every time step
using the history vector. The model is able to generalize to unseen question templates which are a
considerable fraction in our ten columns experiment. This can also be seen in the word variability
experiment with 5 columns and text match experiment with 5 columns where more than two-thirds
of the test set contains question templates that are unseen during training. This indicates that Neural
Programmer is a powerful compositional model since solving unseen question templates requires
inducing programs that do not appear during training. Almost all the errors made by the model were
on questions that require the difference operation to be used. Table 3 shows examples of how the
model selects the operation and column at every time step for three test questions.

Figure 8 shows an example of the effect of adding random noise to the gradients in our experiment
with 5 columns.


We apply a three-layer sequence-to-sequence LSTM recurrent network model (Hochreiter &
Schmidhuber, 1997; Sutskever et al., 2014) and LSTM model with attention (Bahdanau et al., 2014).
We explore multiple attention heads (1, 5, 10) and try two cases, placing the input table before and
after the question. We consider a simpler version of the single column dataset with only questions
that have scalar answers. The number of elements in the column is uniformly randomly sampled

Published as a conference paper at ICLR 2016

Selected Selected pivotg pivotl Row

Question t
Op Column select
greater 50.32 C and lesser 20.21 E sum H 1 Greater C g1
What is the sum of numbers in column H 2 Lesser E l2
50.32 20.21
whose field in column C is greater than 50.32 3 And - and3
and field in Column E is lesser than 20.21. 4 Sum H [0]M
lesser -80.97 D or greater 12.57 B print F 1 Lesser D l1
Print elements in column F 2 Greater B g2
12.57 -80.97
whose field in column D is lesser than -80.97 3 Or - or3
or field in Column B is greater than 12.57. 4 Assign F [0]M
sum A diff count 1 Sum A [0]M
What is the difference 2 Reset - -1 -1 [1]M
between sum of elements in 3 Count - [0]M
column A and number of rows 4 Diff - [0]M

Table 3: Example outputs from the model for T = 4 time steps on three questions in the test set.
We show the synthetically generated question along with its natural language translation. For each
question, the model takes 4 steps and at each step selects an operation and a column. The pivot
numbers for the comparison operations are computed before performing the 4 steps. We show the
selected columns in cases during which the selected operation acts on a particular column.

Train Loss: Noise Vs. No Noise Test Accuracy: Noise Vs. No Noise
3500 100

no noise no noise
noise noise
3000 80
Test Accuracy

2500 60
Train Loss

2000 40

1500 20

1000 0
0 50 100 150 200 250 300 0 50 100 150 200 250 300
No. of epochs No. of epochs

Figure 8: The effect of adding random noise to the gradients versus not adding it in our experiment
with 5 columns when all hyper-parameters are the same. The models trained with noise generalizes
almost always better.

from [4, 7] while the elements are sampled from [−10, 10]. The best accuracy using these models is
close to 80% in spite of relatively easier questions and supplying fresh training examples at every
step. When the scale of the input numbers is changed to [−50, 50] at test time, the accuracy drops to
Neural Programmer solves this task and achieves 100% accuracy using 50, 000 training examples.
Since hardmax operation is used at test time, the answers (or the program induced) from Neural
Programmer is invariant to the scale of numbers and the length of the input.

Program induction has been studied in the context of semantic parsing (Zelle & Mooney, 1996;
Zettlemoyer & Collins, 2005; Liang et al., 2011) in natural language processing. Pasupat & Liang
(2015) develop a semantic parser with a hand engineered grammar for question answering on tables
with natural language questions. Methods such as Piantadosi et al. (2008); Eisenstein et al. (2009);
Clarke et al. (2010) learn a compositional semantic model without hand engineered compositional
grammar, but still requiring a hand labeled lexical mapping of words to the operations. Poon (2013)
develop an unsupervised method for semantic parsing, which requires many pre-processing steps

Published as a conference paper at ICLR 2016

including dependency parsing and mapping from words to operations. Liang et al. (2010) propose
an hierarchical Bayesian approach to learn simple programs.
There has been some early work in using neural networks for learning context free grammar (Das
et al., 1992a;b; Zeng et al., 1994) and context sensitive grammar (Steijvers, 1996; Gers & Schmid-
huber, 2001) for small problems. Neelakantan et al. (2015); Lin et al. (2015) learn simple Horn
clauses in a large knowledge base using RNNs. Neural networks have also been used for Q&A on
datasets that do not require complicated arithmetic and logic reasoning (Bordes et al., 2014; Iyyer
et al., 2014; Sukhbaatar et al., 2015; Peng et al., 2015; Hermann et al., 2015). While there has been
lot of work in augmenting neural networks with additional memory (Das et al., 1992a; Schmidhu-
ber, 1993; Hochreiter & Schmidhuber, 1997; Graves et al., 2014; Weston et al., 2015; Kumar et al.,
2015; Joulin & Mikolov, 2015), we are not aware of any other work that augments a neural network
with a set of operations to enhance complex reasoning capabilities.
After our work was submitted to ArXiv, Neural Programmer-Interpreters (Reed & Freitas, 2016), a
method that learns to induce programs with supervision of the entire program was proposed. This
was followed by Neural Enquirer (Yin et al., 2015), which similar to our work tackles the problem of
synthetic table QA. However, their method achieves perfect accuracy only when given supervision
of the entire program. Later, dynamic neural module network (Andreas et al., 2016) was proposed
for question answering which uses syntactic supervision in the form of dependency trees.

We develop Neural Programmer, a neural network model augmented with a small set of arithmetic
and logic operations to perform complex arithmetic and logic reasoning. The model can be trained in
an end-to-end fashion using backpropagation to induce programs requiring much lesser sophisticated
human supervision than prior work. It is a general model for program induction broadly applicable
across different domains, data sources and languages. Our experiments indicate that the model is
capable of learning with delayed supervision and exhibits powerful compositionality.

Acknowledgements We sincerely thank Greg Corrado, Andrew Dai, Jeff Dean, Shixiang Gu,
Andrew McCallum, and Luke Vilnis for their suggestions and the Google Brain team for the support.

Andreas, Jacob, Rohrbach, Marcus, Darrell, Trevor, and Klein, Dan. Learning to compose neural
networks for question answering. ArXiv, 2016.
Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural machine translation by jointly
learning to align and translate. ICLR, 2014.
Bahdanau, Dzmitry, Chorowski, Jan, Serdyuk, Dmitriy, Brakel, Philemon, and Bengio,
Yoshua. End-to-end attention-based large vocabulary speech recognition. arXiv preprint
arxiv:1508.04395, 2015.
Bordes, Antoine, Chopra, Sumit, and Weston, Jason. Question answering with subgraph embed-
dings. In EMNLP, 2014.
Cantlon, Jessica F., Brannon, Elizabeth M., Carter, Elizabeth J., and Pelphrey, Kevin A. Functional
imaging of numerical processing in adults and 4-y-old children. PLoS Biology, 2006.
Chan, William, Jaitly, Navdeep, Le, Quoc V., and Vinyals, Oriol. Listen, attend and spell. arXiv
preprint arxiv:1508.01211, 2015.
Clarke, James, Goldwasser, Dan, Chang, Ming-Wei, and Roth, Dan. Driving semantic parsing from
the world’s response. In CoNLL, 2010.
Das, Sreerupa, Giles, C. Lee, and zheng Sun, Guo. Learning context-free grammars: Capabilities
and limitations of a recurrent neural network with an external stack memory. In CogSci, 1992a.
Das, Sreerupa, Giles, C. Lee, and zheng Sun, Guo. Using prior knowledge in an NNPDA to learn
context-free languages. In NIPS, 1992b.

Published as a conference paper at ICLR 2016

Dastjerdi, Mohammad, Ozker, Muge, Foster, Brett L, Rangarajan, Vinitha, and Parvizi, Josef. Nu-
merical processing in the human parietal cortex during experimental and natural conditions. Na-
ture communications, 4, 2013.
Eisenstein, Jacob, Clarke, James, Goldwasser, Dan, and Roth, Dan. Reading to learn: Constructing
features from semantic abstracts. In EMNLP, 2009.
Fias, Wim, Lammertyn, Jan, Caessens, Bernie, and Orban, Guy A. Processing of abstract ordinal
knowledge in the horizontal segment of the intraparietal sulcus. The Journal of Neuroscience,
Gers, Felix A. and Schmidhuber, Jürgen. LSTM recurrent networks learn simple context free and
context sensitive languages. IEEE Transactions on Neural Networks, 2001.
Graves, Alex. Generating sequences with recurrent neural networks. arXiv preprint
arxiv:1308.0850, 2013.
Graves, Alex and Jaitly, Navdeep. Towards end-to-end speech recognition with recurrent neural
networks. In ICML, 2014.
Graves, Alex, Wayne, Greg, and Danihelka, Ivo. Neural Turing Machines. arXiv preprint
arxiv:1410.5401, 2014.
Hannun, Awni Y., Case, Carl, Casper, Jared, Catanzaro, Bryan C., Diamos, Greg, Elsen, Erich,
Prenger, Ryan, Satheesh, Sanjeev, Sengupta, Shubho, Coates, Adam, and Ng, Andrew Y. Deep
Speech: Scaling up end-to-end speech recognition. arXiv preprint arxiv:1412.5567, 2014.
Hermann, Karl Moritz, Kociský, Tomás, Grefenstette, Edward, Espeholt, Lasse, Kay, Will, Suley-
man, Mustafa, and Blunsom, Phil. Teaching machines to read and comprehend. NIPS, 2015.
Hinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, George, rahman Mohamed, Abdel, Jaitly, Navdeep,
Senior, Andrew, Vanhoucke, Vincent, Nguyen, Patrick, Sainath, Tara, and Kingsbury, Brian. Deep
neural networks for acoustic modeling in speech recognition. Signal Processing Magazine, 2012.
Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-term memory. Neural Computation, 1997.
Huber, Peter. Robust estimation of a location parameter. In The Annals of Mathematical Statistics,
Iyyer, Mohit, Boyd-Graber, Jordan L., Claudino, Leonardo Max Batista, Socher, Richard, and III,
Hal Daumé. A neural network for factoid question answering over paragraphs. In EMNLP, 2014.
Joulin, Armand and Mikolov, Tomas. Inferring algorithmic patterns with stack-augmented recurrent
nets. NIPS, 2015.
Kingma, Diederik P. and Ba, Jimmy. Adam: A method for stochastic optimization. ICLR, 2014.
Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep con-
volutional neural networks. In NIPS, 2012.
Kucian, Karin, Loenneker, Thomas, Dietrich, Thomas, Dosch, Mengia, Martin, Ernst, and
Von Aster, Michael. Impaired neural networks for approximate calculation in dyscalculic chil-
dren: a functional mri study. Behavioral and Brain Functions, 2006.
Kumar, Ankit, Irsoy, Ozan, Su, Jonathan, Bradbury, James, English, Robert, Pierce, Brian, On-
druska, Peter, Gulrajani, Ishaan, and Socher, Richard. Ask me anything: Dynamic memory net-
works for natural language processing. ArXiv, 2015.
Liang, Percy, Jordan, Michael I., and Klein, Dan. Learning programs: A hierarchical Bayesian
approach. In ICML, 2010.
Liang, Percy, Jordan, Michael I., and Klein, Dan. Learning dependency-based compositional se-
mantics. In ACL, 2011.

Published as a conference paper at ICLR 2016

Lin, Yankai, Liu, Zhiyuan, Luan, Huan-Bo, Sun, Maosong, Rao, Siwei, and Liu, Song. Modeling
relation paths for representation learning of knowledge bases. In EMNLP, 2015.
Luong, Thang, Sutskever, Ilya, Le, Quoc V., Vinyals, Oriol, and Zaremba, Wojciech. Addressing
the rare word problem in neural machine translation. ACL, 2014.
Neelakantan, Arvind, Roth, Benjamin, and McCallum, Andrew. Compositional vector space models
for knowledge base completion. In ACL, 2015.
Neelakantan, Arvind, Vilnis, Luke, Le, Quoc V., Sutskever, Ilya, Kaiser, Lukasz, Kurach, Karol,
and Martens, James. Adding gradient noise improves learning for very deep networks. ICLR
Workshop, 2016.
Pasupat, Panupong and Liang, Percy. Compositional semantic parsing on semi-structured tables. In
ACL, 2015.
Peng, Baolin, Lu, Zhengdong, Li, Hang, and Wong, Kam-Fai. Towards neural network-based rea-
soning. arXiv preprint arxiv:1508.05508, 2015.
Piantadosi, Steven T., Goodman, N.D., Ellis, B.A., and Tenenbaum, J.B. A Bayesian model of the
acquisition of compositional semantics. In CogSci, 2008.
Piazza, Manuela, Izard, Veronique, Pinel, Philippe, Le Bihan, Denis, and Dehaene, Stanislas. Tuning
curves for approximate numerosity in the human intraparietal sulcus. Neuron, 2004.
Poon, Hoifung. Grounded unsupervised semantic parsing. In ACL, 2013.
Reed, Scott and Freitas, Nando De. Neural programmer-interpreters. ICLR, 2016.
Schmidhuber, J. A self-referentialweight matrix. In ICANN, 1993.
Shang, Lifeng, Lu, Zhengdogn, and Li, Hang. Neural responding machine for short-text conversa-
tion. arXiv preprint arXiv:1503.02364, 2015.
Steijvers, Mark. A recurrent network that performs a context-sensitive prediction task. In CogSci,
Sukhbaatar, Sainbayar, Szlam, Arthur, Weston, Jason, and Fergus, Rob. End-to-end memory net-
works. arXiv preprint arXiv:1503.08895, 2015.
Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc V. Sequence to sequence learning with neural net-
works. In NIPS, 2014.
Vinyals, Oriol and Le, Quoc V. A neural conversational model. ICML DL Workshop, 2015.
Vinyals, Oriol, Toshev, Alexander, Bengio, Samy, and Erhan, Dumitru. Show and tell: A neural
image caption generator. In CVPR, 2015.
Von Neumann, John. First draft of a report on the EDVAC. Technical report, 1945.
Wang, Yushi, Berant, Jonathan, and Liang, Percy. Building a semantic parser overnight. In ACL,
Welling, Max and Teh, Yee Whye. Bayesian learning via stochastic gradient Langevin dynamics. In
ICML, 2011.
Werbos, P. Backpropagation through time: what does it do and how to do it. In Proceedings of
IEEE, 1990.
Weston, Jason, Chopra, Sumit, and Bordes, Antoine. Memory Networks. 2015.
Xu, Kelvin, Ba, Jimmy, Kiros, Ryan, Cho, Kyunghyun, Courville, Aaron C., Salakhutdinov, Ruslan,
Zemel, Richard S., and Bengio, Yoshua. Show, attend and tell: Neural image caption generation
with visual attention. In ICML, 2015.

Published as a conference paper at ICLR 2016

Yin, Pengcheng, Lu, Zhengdong, Li, Hang, and Kao, Ben. Neural enquirer: Learning to query tables
with natural language. ArXiv, 2015.
Zelle, John M. and Mooney, Raymond J. Learning to parse database queries using inductive logic
programming. In AAAI/IAAI, 1996.
Zeng, Z., Goodman, R., and Smyth, P. Discrete recurrent neural networks for grammatical inference.
IEEE Transactions on Neural Networks, 1994.
Zettlemoyer, Luke S. and Collins, Michael. Learning to map sentences to logical form: Structured
classification with probabilistic categorial grammars. In UAI, 2005.

Published as a conference paper at ICLR 2016


greater [number] sum
lesser [number] sum
greater [number] count
lesser [number] count
greater [number] print
lesser [number] print
greater [number1] and lesser [number2] sum
lesser [number1] and greater [number2] sum
greater [number1] or lesser [number2] sum
lesser [number1] or greater [number2] sum
greater [number1] and lesser [number2] count
lesser [number1] and greater [number2] count
greater [number1] or lesser [number2] count
lesser [number1] or greater [number2] count
greater [number1] and lesser [number2] print
lesser [number1] and greater [number2] print
greater [number1] or lesser [number2] print
lesser [number1] or greater [number2] print
sum diff count
count diff sum

Table 4: 23 question templates for single column experiment. We have four categories of questions:
1) simple aggregation (sum, count) 2) comparison (greater, lesser) 3) logic (and, or) and, 4) arith-
metic (diff). We first sample the categories uniformly randomly and each program within a category
is equally likely. In the word variability experiment with 5 columns we sampled from the set of
all programs uniformly randomly since greater than 90% of the test questions were unseen during
training using the other procedure.

greater [number1] A and lesser [number2] A sum A

greater [number1] B and lesser [number2] B sum B
greater [number1] A and lesser [number2] A sum B
greater [number1] A and lesser [number2] B sum A
greater [number1] B and lesser [number2] A sum A
greater [number1] A and lesser [number2] B sum B
greater [number1] B and lesser [number2] B sum A
greater [number1] B and lesser [number2] B sum A

Table 5: 8 question templates of type “greater [number1] and lesser [number2] sum” when there are
2 columns.

sum sum, total, total of, sum of

count count, count of, how many
greater greater, greater than, bigger, bigger than, larger, larger than
lesser lesser, lesser than, smaller, smaller than, under
assign print, display, show
difference difference, difference between

Table 6: Word variability, multiple ways to refer to the same operation.

Published as a conference paper at ICLR 2016

greater [number] sum

greater [number] total
greater [number] total of
greater [number] sum of
greater than [number] sum
greater than [number] total
greater than [number] total of
greater than [number] sum of
bigger [number] sum
bigger [number] total
bigger [number] total of
bigger [number] sum of
bigger than [number] sum
bigger than [number] total
bigger than [number] total of
bigger than [number] sum of
larger [number] sum
larger [number] total
larger [number] total of
larger [number] sum of
larger than [number] sum
larger than [number] total
larger than [number] total of
larger than [number] sum of

Table 7: 24 questions templates for questions of type “greater [number] sum” in the single column
word variability experiment.

word:0 A sum B
word:1 A sum B
word:2 A sum B
word:3 A sum B
word:4 A sum B
word:5 A sum B
word:6 A sum B
word:7 A sum B
word:8 A sum B
word:9 A sum B

Table 8: 10 questions templates for questions of type “[word] A sum B” in the two columns text
match experiment.

Published as a conference paper at ICLR 2016


Scott Reed & Nando de Freitas
Google DeepMind
London, UK

arXiv:1511.06279v4 [cs.LG] 29 Feb 2016

We propose the neural programmer-interpreter (NPI): a recurrent and composi-

tional neural network that learns to represent and execute programs. NPI has three
learnable components: a task-agnostic recurrent core, a persistent key-value pro-
gram memory, and domain-specific encoders that enable a single NPI to operate in
multiple perceptually diverse environments with distinct affordances. By learning
to compose lower-level programs to express higher-level programs, NPI reduces
sample complexity and increases generalization ability compared to sequence-to-
sequence LSTMs. The program memory allows efficient learning of additional
tasks by building on existing programs. NPI can also harness the environment
(e.g. a scratch pad with read-write pointers) to cache intermediate results of com-
putation, lessening the long-term memory burden on recurrent hidden units. In
this work we train the NPI with fully-supervised execution traces; each program
has example sequences of calls to the immediate subprograms conditioned on the
input. Rather than training on a huge number of relatively weak labels, NPI learns
from a small number of rich examples. We demonstrate the capability of our
model to learn several types of compositional programs: addition, sorting, and
canonicalizing 3D models. Furthermore, a single NPI learns to execute these pro-
grams and all 21 associated subprograms.

Teaching machines to learn new programs, to rapidly compose new programs from existing pro-
grams, and to conditionally execute these programs automatically so as to solve a wide variety of
tasks is one of the central challenges of AI. Programs appear in many guises in various AI prob-
lems; including motor behaviours, image transformations, reinforcement learning policies, classical
algorithms, and symbolic relations.
In this paper, we develop a compositional architecture that learns to represent and interpret pro-
grams. We refer to this architecture as the Neural Programmer-Interpreter (NPI). The core module
is an LSTM-based sequence model that takes as input a learnable program embedding, program
arguments passed on by the calling program, and a feature representation of the environment. The
output of the core module is a key indicating what program to call next, arguments for the following
program and a flag indicating whether the program should terminate. In addition to the recurrent
core, the NPI architecture includes a learnable key-value memory of program embeddings. This
program-memory is essential for learning and re-using programs in a continual manner. Figures 1
and 2 illustrate the NPI on two different tasks.
We show in our experiments that the NPI architecture can learn 21 programs, including addition,
sorting, and trajectory planning from image pixels. Crucially, this can be achieved using a single
core model with the same parameters shared across all tasks. Different environments (for example
images, text, and scratch-pads) may require specific perception modules or encoders to produce the
features used by the shared core, as well as environment-specific actuators. Both perception modules
and actuators can be learned from data when training the NPI architecture.
To train the NPI we use curriculum learning and supervision via example execution traces. Each
program has example sequences of calls to the immediate subprograms conditioned on the input.

Published as a conference paper at ICLR 2016

Mkey Mprog
h h



h h

h h


h h



1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2


Figure 1: Example execution of canonicalizing 3D car models. The task is to move the camera such
that a target angle and elevation are reached. There is a read-only scratch pad containing the target
(angle 1, elevation 2 here). The image encoder is a convnet trained from scratch on pixels.

Mkey Mprog Figure 2: Example execu-
h h
tion trace of single-digit addi-
h h h a single-digit add on the num-
h h first two rows. The carry (row
ACT INPUT ACT INPUT 3) and output (row 4) should
be updated to reflect the addi-
9 3 4 9 3 4 9 3 4 9 3 4 9 3 4 9 3 4 9 3 4
tion. At each time step, an ob-
3 4 8 3 4 8 3 4 8 3 4 8 3 4 8 3 4 8 3 4 8 servation of the environment
(viewed from each pointer on
a scratch pad) is encoded into
2 2 2 2 2 2
ADD1() ACT (4,2,WRITE) ADD1() CARRY() ACT (3,LEFT) CARRY() ACT (3,1,WRITE) a fixed-length vector.

By using neural networks to represent the subprograms and learning these from data, the approach
can generalize on tasks involving rich perceptual inputs and uncertainty.
We may envision two approaches to provide supervision. In one, we provide a very large number
of labeled examples, as in object recognition, speech and machine translation. In the other, the
approached followed in this paper, the aim is to provide far fewer labeled examples, but where
the labels contain richer information allowing the model to learn compositional structure. While
unsupervised and reinforcement learning play important roles in perception and motor control, other
cognitive abilities are possible thanks to rich supervision and curriculum learning. This is indeed
the reason for sending our children to school.
An advantage of our approach to model building and training is that the learned programs exhibit
strong generalization. Specifically, when trained to sort sequences of up to twenty numbers in
length, they can sort much longer sequences at test time. In contrast, the experiments will show that
more standard sequence to sequence LSTMs only exhibit weak generalization, see Figure 6.
A trained NPI with fixed parameters and a learned library of programs, can act both as an interpreter
and as a programmer. As an interpreter, it takes input in the form of a program embedding and input
data and subsequently executes the program. As a programmer, it uses samples drawn from a new
task to generate a new program embedding that can be added to its library of programs.

Several ideas related to our approach have a long history. For example, the idea of using dynam-
ically programmable networks in which the activations of one network become the weights (the

Published as a conference paper at ICLR 2016

program) of a second network was mentioned in the Sigma-Pi units section of the influential PDP
paper (Rumelhart et al., 1986). This idea appeared in (Sutskever & Hinton, 2009) in the context of
learning higher order symbolic relations and in (Donnarumma et al., 2015) as the key ingredient of an
architecture for prefrontal cognitive control. Schmidhuber (1992) proposed a related meta-learning
idea, whereby one learns the parameters of a slowly changing network, which in turn generates
context dependent weight changes for a second rapidly changing network. These approaches have
only been demonstrated in very limited settings. In cognitive science, several theories of brain areas
controlling other brain parts so as to carry out multiple tasks have been proposed; see for example
Schneider & Chein (2003); Anderson (2010) and Donnarumma et al. (2012).
Related problems have been studied in the literature on hierarchical reinforcement learning (e.g.,
Dietterich (2000); Andre & Russell (2001); Sutton et al. (1999) and Schaul et al. (2015)), imitation
and apprenticeship learning (e.g., Kolter et al. (2008) and Rothkopf & Ballard (2013)) and elicita-
tion of options through human interaction (Subramanian et al., 2011). These ideas have held great
promise, but have not enjoyed significant impact. We believe the recurrent compositional neural
representations proposed in this paper could help these approaches in the future, and in particular in
overcoming feature engineering.
Several recent advancements have extended recurrent networks to solve problems beyond simple
sequence prediction. Graves et al. (2014) developed a neural Turing machine capable of learning
and executing simple programs such as repeat copying, simple priority sorting and associative recall.
Vinyals et al. (2015) developed Pointer Networks that generalize the notion of encoder attention in
order to provide the decoder a variable-sized output space depending on the input sequence length.
This model was shown to be effective for combinatorial optimization problems such as the traveling
salesman and Delaunay triangulation. While our proposed model is trained on execution traces in-
stead of input and output pairs, in exchange for this richer supervision we benefit from compositional
program structure, improving data efficiency on several problems.
This work is also closely related to program induction. Most previous work on program induc-
tion, i.e. inducing a program given example input and output pairs, has used genetic program-
ming (Banzhaf et al., 1998) to evolve useful programs from candidate populations. Mou et al.
(2014) process program symbols to learn max-margin program embeddings with the help of parse
trees. Zaremba & Sutskever (2014) trained LSTM models to read in the text of simple programs
character-by-character and correctly predict the program output. Joulin & Mikolov (2015) aug-
mented a recurrent network with a pushdown stack, allowing for generalization to longer input
sequences than seen during training for several algorithmic patterns.
Contemporary to this work, several papers have also studied program induction with variants of
recurrent neural networks (Zaremba & Sutskever, 2015; Zaremba et al., 2015; Kaiser & Sutskever,
2015; Kurach et al., 2015; Neelakantan et al., 2015). While we share a similar motivation, our
approach is distinct in that we explicitly incorporate compositional structure into the network using
a program memory, allowing the model to learn new programs by combining sub-programs.

The NPI core is a long short-term memory (LSTM) network (Hochreiter & Schmidhuber, 1997)
that acts as a router between programs conditioned on the current state observation and previous
hidden unit states. At each time step, the core module can select another program to invoke using
content-based addressing. It emits the probability of ending the current program with a single binary
unit. If this probability is over threshold (we used 0.5), control is returned to the caller by popping
the caller’s LSTM hidden units and program embedding off of a program call stack and resuming
execution in this context.
The NPI may also optionally write arguments (ARG) that are passed by reference or value to the
invoked sub-programs. For example, an argument could indicate a specific location in the input
sequence (by reference), or it could specify a number to write down at a particular location in the
sequence (by value). The subsequent state consists of these arguments and observations of the
environment. The approach is illustrated in Figures 1 and 2.
It must be emphasized that there is a single inference core. That is, all the LSTM instantiations
executing arbitrary programs share the same parameters. Different programs correspond to program
embeddings, which are stored in a learnable persistent memory. The programs therefore have a more

Published as a conference paper at ICLR 2016

succinct representation than neural programs encoded as the full set of weights in a neural network
(Rumelhart et al., 1986; Graves et al., 2014).
The output of an NPI, conditioned on an input state and a program to run, is a sequence of actions
in a given environment. In this work, we consider several environments: a 1-D array with read-only
pointers and a swap action, a 2-D scratch pad with read-write pointers, and a CAD renderer with
controllable elevation and azimuth movements. Note that the sequence of actions for a program is
not fixed, but dependent also on the input state.
Denote the environment observation at time t as et ∈ E, and the current program arguments as
at ∈ A. The form of et can vary dramatically by environment; for example it could be a color
image or an array of numbers. The program arguments at can also vary by environment, but in
the experiments for this paper we always used a 3-tuple of integers (at (1), at (2), at (3)). Given
the environment and arguments at time t, a fixed-length state encoding st ∈ RD is extracted by a
domain-specific encoder fenc : E ×A → RD . In section 4 we provide examples of several encoders.
Note that a single NPI network can have multiple encoders for multiple environments, and encoders
can potentially also be shared across tasks.
We denote the current program embedding as pt ∈ RP . The previous hidden unit and cell states
(l) (l)
are ht−1 ∈ RM and ct−1 ∈ RM , l = 1, ..., L where L is the number of layers in the LSTM.
The program and state vectors are then propagated forward through an LSTM mapping flstm as in
(Sutskever et al., 2014). How to fuse pt and st within flstm is an implementation detail, but in this
work we concatenate and feed through a 2-layer MLP with rectified linear (ReLU) hidden activation
and linear decoder.
From the top LSTM hidden state hL t , several decoders generate the outputs. The probability of
finishing the program and returning to the caller 1 is computed by fend : RM → [0, 1]. The lookup
key embedding used for retrieving the next program from memory is computed by fprog : RM →
RK . Note that RK can be much smaller than RP because the key only need act as the identifier
of a program, while the program embedding must have enough capacity to conditionally generate a
sequence of actions. The contents of the arguments to the next program to be called are generated
by farg : RM → A. The feed-forward steps of program inference are summarized below:
st = fenc (et , at ) (1)
ht = flstm (st , pt , ht−1 ) (2)
rt = fend (ht ), kt = fprog (ht ), at+1 = farg (ht ) (3)
where rt , kt and at+1 correspond to the end-of-program probability, program key embedding, and
output arguments at time t, respectively. These yield input arguments at time t + 1. To simplify the
notation, we have abstracted properties such as layers and cell memory in the sequence-to-sequence
LSTM of equation (2); see (Sutskever et al., 2014) for details.
The NPI representation is equipped with key-value memory structures M key ∈ RN ×K and
M prog ∈ RN ×P storing program keys and program embeddings, respectively, where N is the
current number of programs in memory. We can add more programs by adding rows to memory.
During training, the next program identifier is provided to the model as ground-truth, so that its
embedding can be retrieved from the corresponding row of M prog . At test time, we compute the
“program ID” by comparing the key embedding kt to each row of M key storing all program keys.
Then the program embedding is retrieved from M prog as follows:
i∗ = arg max(Mi,:key T
) kt , pt+1 = Miprog
∗ ,: (4)
The next environmental state et+1 will be determined by the dynamics of the environment and can
be affected by both the choice of program pt and the contents of the output arguments at , i.e.
et+1 ∼ fenv (et , pt , at ) (5)
The transition mapping fenv is domain-specific and will be discussed in Section 4. A description of
the inference procedure is given in Algorithm 1.
In our implementation, a program may first call a subprogram before itself finishing. The only exception
is the ACT program that signals a low-level action to the environment, e.g. moving a pointer one step left or
writing a value. By convention ACT does not call any further sub-programs.

Published as a conference paper at ICLR 2016

Algorithm 1 Neural programming inference

1: Inputs: Environment observation e, program id i, arguments a, stop threshold α
2: function RUN(i, a)
3: h ← 0, r ← 0, p ← Mi,: . Init LSTM and return probability.
4: while r < α do
5: s ← fenc (e, a), h ← flstm (s, p, h) . Feed-forward NPI one step.
6: r ← fend (h), k ← fprog (h), a2 ← farg (h)
key T
7: i2 ← arg max(Mj,: ) k . Decide the next program to run.
8: if i == ACT then e ← fenv (e, p, a) . Update the environment based on ACT.
9: else RUN(i2 , a2 ) . Run subprogram i2 with arguments a2

Each task has a set of actions that affect the environment. For example, in addition there are LEFT
and RIGHT actions that move a specified pointer, and a WRITE action which writes a value at
a specified location. These actions are encapsulated into a general-purpose ACT program shared
across tasks, and the concrete action to be taken is indicated by the NPI-generated arguments at .
Note that the core LSTM module of our NPI representation is completely agnostic to the data modal-
ity used to produce the state encoding. As long as the same fixed-length embedding is extracted,
the same module can in practice route between programs related to sorting arrays just as easily as
between programs related to rotating 3D objects. In the experimental sections, we provide details of
the modality-specific deep neural networks that we use to produce these fixed-length state vectors.
To train we use execution traces ξtinp : {et , it , at } and ξtout : {it+1 , at+1 , rt }, t = 1, ...T , where T is
the sequence length. Program IDs it and it+1 are row-indices in M key and M prog of the programs
to run at time t and t+1, respectively. We propose to directly maximize the probability of the correct
execution trace output ξ out conditioned on ξ inp :
θ∗ = arg max log P (ξ out |ξ inp ; θ) (6)
(ξ inp ,ξ out )

where θ are the parameters of our model. Since the traces are variable in length depending on the
input, we apply the chain rule to model the joint probability over ξ1out , ..., ξTout as follows:
log P (ξout |ξinp ; θ) = log P (ξtout |ξ1inp , ..., ξtinp ; θ) (7)

Note that for many problems the input history ξ1inp , ..., ξtinp is critical to deciding future actions
because the environment observation at the current time-step et alone does not contain enough in-
formation. The hidden unit activations of the LSTM in NPI are capable of capturing these temporal
dependencies. The single-step conditional probability in equation (7) can be factorized into three
further conditional distributions, corresponding to predicting the next program, next arguments, and
whether to halt execution:
log P (ξtout |ξ1inp , ..., ξtinp ) = log P (it+1 |ht ) + log P (at+1 |ht ) + log P (rt |ht ) (8)
where ht is the output of flstm at time t, carrying information from previous time steps. We train
by gradient ascent on the likelihood in equation (7).
We used an adaptive curriculum in which training examples for each mini-batch are fetched with fre-
quency proportional to the model’s current prediction error for the corresponding program. Specif-
ically, we set the sampling frequency using a softmax over average prediction error across all pro-
grams, with configurable temperature. Every 1000 steps of training we re-estimated these prediction
errors. Intuitively, this forces the model to focus on learning the program for which it currently per-
forms worst in executing. We found that the adaptive curriculum immediately worked much better
than our best-performing hand-designed curriculum, allowing a multi-task NPI to achieve compara-
ble performance to single-task NPI on all tasks.
We also note that our program has a distinct memory advantage over basic LSTMs because all sub-
programs can be trained in parallel. For programs whose execution length grows e.g. quadratically

Published as a conference paper at ICLR 2016

Figure 3: Illustration of the addition environment used in our experiments.

input 1 0 0 0 9 6 WRITE OUT 1 WRITE OUT 2 WRITE OUT 2
(a) Example scratch pad and pointers (b) Actual trace of addition program generated by our model
used for computing “96 + 125 = 221”. on the problem shown to the left. Note that we substituted
Carry step is being implemented. the ACT calls in the trace with more human-readable steps.

with the input sequence length, an LSTM will by highly constrained by device memory to train on
short sequences. By exploiting compositionality, an effective curriculum can often be developed
with sublinear-length subprograms, enabling our NPI model to train on order of magnitude larger
sequences than the LSTM.

This section describes the environment and state encoder function for each task, and shows example
outputs and prediction accuracy results. For all tasks, the core LSTM had two layers of size 256.
We trained the NPI using the ADAM solver (Kingma & Ba, 2015) with base learning rate 0.0001,
batch size 1, and decayed the learning rate by a factor of 0.95 every 10,000 steps.


In this section we provide an overview of the tasks used to evaluate our model. Table 2 in the
appendix provides a full listing of all the programs and subprograms learned by our model.

The task in this environment is to read in the digits of two base-10 numbers and produce the digits
of the answer. Our goal is to teach the model the standard (at least in the US) grade school algorithm
of adding, in which one works from right to left applying single-digit add and carry operations.
In this environment, the network is endowed with a “scratch pad” with which to store intermediate
computations; e.g. to record carries. There are four pointers; one for each of the two input numbers,
one for the carry, and another to write the output. At each time step, a pointer can be moved left or
right, or it can record a value to the pad. Figure 3a illustrates the environment of this model, and
Figure 3b provides a real execution trace generated by our model.
For the state encoder fenc , the model is allowed a view of the scratch pad from the perspective of
each of the four pointers. That is, the model sees the current values at pointer locations of the two
inputs, the carry row and the output row, as 1-of-K encodings, where K is 10 because we are working
in base 10. We also append the values of the input argument tuple at :
fenc (Q, i1 , i2 , i3 , i4 , at ) = M LP ([Q(1, i1 ), Q(2, i2 ), Q(3, i3 ), Q(4, i4 ), at (1), at (2), at (3)]) (9)
where Q ∈ R4×N ×K , and i1 , ..., i4 are pointers, one per scratch pad row. The first dimension of Q
corresponds to scratch pad rows, N is the number of columns (digits) and K is the one-hot encoding
dimension. To begin the ADD program, we set the initial arguments to a default value and initialize
all pointers to be at the rightmost column. The only subprogram with non-default arguments is ACT,
in which case the arguments indicate an action to be taken by a specified pointer.

In this section we apply our model to a setting with potentially much longer execution traces: sorting
an array of numbers using bubblesort. As in the case of addition we can use a scratch pad to store
intermediate states of the array. We define the encoder as follows:
fenc (Q, i1 , i2 , at ) = M LP ([Q(1, i1 ), Q(1, i2 ), at (1), at (2), at (3)]) (10)

Published as a conference paper at ICLR 2016

Figure 4: Illustration of the sorting environment used in our experiments.

t=0 3 2 4 9 1 PTR 2 RIGHT LSHIFT PTR 2 RIGHT
3 2 4 9 1 SWAP 1 2 LSHIFT SWAP 1 2
t=2 2 3 4 9 1 PTR 2 RIGHT …

(a) Example scratch pad and pointers PTR 1 RIGHT PTR 1 RIGHT
used for sorting. Several steps of the PTR 2 RIGHT PTR 2 RIGHT
BUBBLE subprogram are shown. (b) Excerpt from the trace of the learned bubblesort program.

where Q ∈ R1×N ×K is the pad, N is the array length and K is the array entry embedding dimension.
Figure 4 shows an example series of array states and an excerpt of an execution trace.

We also apply our model to a vision task with a very different perceptual environment - pixels. Given
a rendering of a 3D car, we would like to learn a visual program that “canonicalizes” the model with
respect to its pose. Whatever the starting position, the program should generate a trajectory of
actions that delivers the camera to the target view, e.g. frontal pose at a 15◦ elevation. For training
data, we used renderings of the 3D car CAD models from (Fidler et al., 2012).
This is a nontrivial problem because different starting positions will require quite different trajec-
tories to reach the target. Further complicating the problem is the fact that the model will need to
generalize to different car models than it saw during training.
We again use a scratch pad, but here it is a very simple read-only pad that only contains a target
camera elevation and azimuth – i.e., the “canonical pose”. Since observations come in the form of
image pixels, we use a convolutional neural network fCN N as the image encoder:
fenc (Q, x, i1 , i2 , at ) = M LP ([Q(1, i1 ), Q(2, i2 ), fCN N (x), at (1), at (2), at (3)]) (11)
where x ∈ RH×W ×3 is a car rendering at the current pose, Q ∈ R2×1×K is the pad containing
canonical azimuth and elevation, i1 , i2 are the (fixed at 1) pointer locations, and K is the one-hot
encoding dimension of pose coordinates. We set K = 24 corresponding to 15◦ pose increments.
Note, critically, that our NPI model only has access to pixels of the rendering and the target pose,
and is not provided the pose of query frames. We are also aware that one solution to this problem
would be to train a pose classifier network and then find the shortest path to canonical pose via
classical methods. That is also a sensible approach. However, our purpose here is to show that our
method generalizes beyond the scratch pad domain to detailed images of 3D objects, and also to
other environments with a single multi-task model.


Both LSTMs and Neural Turing Machines can learn to perform sorting to a limited degree, although
they have not been shown to generalize well to much longer arrays than were seen during training.
However, we are interested not only in whether sorting can be accomplished, but whether a particular
sorting algorithm (e.g. bubblesort) can be learned by the model, and how effectively in terms of
sample complexity and generalization.
We compare the generalization ability of our model to a flat sequence-to-sequence LSTM (Sutskever
et al., 2014), using the same number of layers (2) and hidden units (256). Note that a flat 2 version
of NPI could also learn sorting of short arrays, but because bubblesort runs in O(N 2 ) for arrays of
length N , the execution traces quickly become far too long to store the required number of LSTM
states in memory. Our NPI architecture can train on much larger arrays by exploiting compositional
structure; the memory requirements of any given subprogram can be restricted to O(N ).
By flat in this case, we mean non-compositional, not making use of subprograms, and only making calls
to ACT in order to swap values and move pointers.

Published as a conference paper at ICLR 2016


Figure 5: Sample complexity. Test accuracy Figure 6: Strong vs. weak generalization. Test
of sequence-to-sequence LSTM versus NPI on accuracy of sequence-to-sequence LSTM ver-
length-20 arrays of single-digit numbers. Note sus NPI on varying-length arrays of single-digit
that NPI is able to mine and train on subprogram numbers. Both models were trained on arrays of
traces from each bubblesort example. single-digit numbers up to length 20.

A strong indicator of whether a neural network has learned a program well is whether it can run the
program on inputs of previously-unseen sizes. To evaluate this property, we train both the sequence-
to-sequence LSTM and NPI to perform bubblesort on arrays of single-digit numbers from length 2
to length 20. Compared to fixed-length inputs this raises the challenge level during training, but in
exchange we can get a more flexible and generalizable sorting program.
To handle variable-sized inputs, the state representation must have some information about input se-
quence length and the number of steps taken so far. For example, the main BUBBLESORT program
naturally needs to call its helper function BUBBLE a number of times dependent on the sequence
length. We enable this in our model by adding a third pointer that acts as a counter; each time BUB-
BLE is called the pointer is advanced by one step. The scratch pad environment also provides a bit
indicating whether a pointer is at the start or end of a sequence, equivalent in purpose to end tokens
used in a sequence-to-sequence model.
For each length, we provided 64 example bubblesort traces, for a total of 1,216 examples. Then,
we evaluated whether the network can learn to sort arrays beyond length 20. We found that the
trained model generalizes well, and is capable of sorting arrays up to size 60; see Figure 6. At 60
and beyond, we observed a failure mode in which sweeps of pointers across the array would take
the wrong number of steps, suggesting that the limiting performance factor is related to counting.
In stark contrast, when provided with the 1,216 examples, the sequence-to-sequence LSTMs fail to
generalize beyond arrays of length 25 as shown in Figure 6.
To study sample complexity further, we fix the length of the arrays to 20 and vary the number of
training examples. We see in Figure 5 that NPI starts learning with 2 examples and is able to sort
almost perfectly with only 8 examples. The sequence-to-sequence model on the other hand requires
64 examples to start learning and only manages to sort well with over 250 examples.
Figure 7 shows several example canonicalization trajectories generated by our model, starting from
the leftmost car. The image encoder was a convolutional network with three passes of stride-2
convolution and pooling, trained on renderings of size 128 × 128. The canonical target pose in this
case is frontal with 15◦ elevation. At test time, from an initial rendering, NPI is able to canonicalize
cars of varying appearance from multiple starting positions. Importantly, it can generalize to car
appearances not encountered in the training set as shown in Figure 7.


One challenge for continual learning of neural-network-based agents is that training on new tasks
and experiences can lead to degraded performance in old tasks. The learning of new tasks may
require that the network weights change substantially, so care must be taken to avoid catastrophic
forgetting (Mccloskey & Cohen, 1989; OReilly et al., 2014). Using NPI, one solution is to fix the
weights of the core routing module, and only make sparse updates to the program memory.
When adding a new program the core module’s routing computation will be completely unaffected;
all the learning for a new task occurs in program embedding space. Of course, the addition of new
programs to the memory adds a new choice of program at each time step, and an old program could

Published as a conference paper at ICLR 2016

GOTO 1 2
GOTO 1 2 1 2 3
ACT(LEFT) 4 5 6
GOTO 1 2 1 2 3 VGOTO

Figure 7: Example canonicalization of several different test set cars. The network is able to generate
and execute the appropriate plan based on the starting car image. This NPI was trained on trajectories
starting at azimuth (−75◦ ...75◦ ) , elevation (0◦ ...60◦ ) in 15◦ increments. The training trajectories
target azimuth 0◦ and elevation 15◦ , as in the generated traces above.

mistakenly call a newly added program. To overcome this, when learning a new set of program
vectors with a fixed core, in practice we train not only on example traces of the new program, but
also traces of existing programs. Alternatively, a simpler approach is to prevent existing programs
from calling subsequently added programs, allowing addition of new programs without ever looking
back at training data for known programs. In either case, note that only the memory slots of the new
programs are updated, and all other weights, including other program embeddings, are fixed.
Table 1 shows the result of adding a maximum-finding program MAX to a multitask NPI trained
on addition, sorting and canonicalization. MAX first calls BUBBLESORT and then a new program
RJMP, which moves pointers to the right of the sorted array, where the max element can be read.
During training we froze all weights except for the two newly-added program embeddings. We
find that NPI learns MAX perfectly without forgetting the other tasks. In particular, after training a
single multi-task model as outlined in the following section, learning the MAX program with this
fixed-core multi-task NPI results in no performance deterioration for all three tasks.
In this section we perform a controlled experiment to compare the performance of a multi-task NPI
with several single-task NPI models. Table 1 shows the results for addition, sorting and canonical-
izing 3D car models. We trained and evaluated on 10-digit numbers for addition, length-5 arrays for
sorting, and up to four-step trajectories for canonicalization. As shown in Table 1, one multi-task
NPI can learn all three programs (and necessarily the 21 subprograms) with comparable accuracy
compared to each single-task NPI.
Task Single Multi + Max Table 1: Per-sequence % accuracy. “+ Max”
Addition 100.0 97.0 97.0 indicates performance after addition of the ad-
Sorting 100.0 100.0 100.0 ditional max-finding subprograms to memory.
Canon. seen car 89.5 91.4 91.4 “unseen” uses a test set with disjoint car mod-
Canon. unseen 88.7 89.9 89.9 els from the training set, while “seen car” uses
Maximum - - 100.0 the same car models but different trajectories.

We have shown that the NPI can learn programs in very dissimilar environments with different
affordances. In the context of sorting we showed that NPI exhibits very strong generalization in
comparison to sequence-to-sequence LSTMs. We also showed how a trained NPI with a fixed core
can continue to learn new programs without forgetting already learned programs.

We sincerely thank Arun Nair and Ed Grefenstette for helpful suggestions.

Published as a conference paper at ICLR 2016

Anderson, Michael L. Neural reuse: A fundamental organizational principle of the brain. Behavioral
and Brain Sciences, 33:245–266, 8 2010.
Andre, David and Russell, Stuart J. Programmable reinforcement learning agents. In Advances in
Neural Information Processing Systems, pp. 1019–1025. 2001.
Banzhaf, Wolfgang, Nordin, Peter, Keller, Robert E, and Francone, Frank D. Genetic programming:
An introduction, volume 1. Morgan Kaufmann San Francisco, 1998.
Dietterich, Thomas G. Hierarchical reinforcement learning with the MAXQ value function decom-
position. Journal of Artificial Intelligence Research, 13:227–303, 2000.
Donnarumma, Francesco, Prevete, Roberto, and Trautteur, Giuseppe. Programming in the brain: A
neural network theoretical framework. Connection Science, 24(2-3):71–90, 2012.
Donnarumma, Francesco, Prevete, Roberto, Chersi, Fabian, and Pezzulo, Giovanni. A programmer-
interpreter neural network architecture for prefrontal cognitive control. International Journal of
Neural Systems, 25(6):1550017, 2015.
Fidler, Sanja, Dickinson, Sven, and Urtasun, Raquel. 3D object detection and viewpoint estimation
with a deformable 3D cuboid model. In Advances in neural information processing systems, 2012.
Graves, Alex, Wayne, Greg, and Danihelka, Ivo. Neural Turing machines. arXiv preprint
arXiv:1410.5401, 2014.
Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-term memory. Neural computation, 9(8):
1735–1780, 1997.
Joulin, Armand and Mikolov, Tomas. Inferring algorithmic patterns with stack-augmented recurrent
nets. In NIPS, 2015.
Kaiser, Łukasz and Sutskever, Ilya. Neural gpus learn algorithms. arXiv preprint arXiv:1511.08228,
Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. 2015.
Kolter, Zico, Abbeel, Pieter, and Ng, Andrew Y. Hierarchical apprenticeship learning with appli-
cation to quadruped locomotion. In Advances in Neural Information Processing Systems, pp.
769–776. 2008.
Kurach, Karol, Andrychowicz, Marcin, and Sutskever, Ilya. Neural random-access machines. arXiv
preprint arXiv:1511.06392, 2015.
Mccloskey, Michael and Cohen, Neal J. Catastrophic interference in connectionist networks: The
sequential learning problem. In The psychology of learning and motivation, volume 24, pp. 109–
165. 1989.
Mou, Lili, Li, Ge, Liu, Yuxuan, Peng, Hao, Jin, Zhi, Xu, Yan, and Zhang, Lu. Building program
vector representations for deep learning. arXiv preprint arXiv:1409.3358, 2014.
Neelakantan, Arvind, Le, Quoc V, and Sutskever, Ilya. Neural programmer: Inducing latent pro-
grams with gradient descent. arXiv preprint arXiv:1511.04834, 2015.
OReilly, Randall C., Bhattacharyya, Rajan, Howard, Michael D., and Ketz, Nicholas. Complemen-
tary learning systems. Cognitive Science, 38(6):1229–1248, 2014.
Rothkopf, ConstantinA. and Ballard, DanaH. Modular inverse reinforcement learning for visuomo-
tor behavior. Biological Cybernetics, 107(4):477–490, 2013.
Rumelhart, D. E., Hinton, G. E., and McClelland, J. L. Parallel distributed processing: Explorations
in the microstructure of cognition, vol. 1. chapter A General Framework for Parallel Distributed
Processing, pp. 45–76. MIT Press, 1986.

Published as a conference paper at ICLR 2016

Schaul, Tom, Horgan, Daniel, Gregor, Karol, and Silver, David. Universal value function approxi-
mators. In International Conference on Machine Learning, 2015.
Schmidhuber, Jürgen. Learning to control fast-weight memories: An alternative to dynamic recur-
rent networks. Neural Computation, 4(1):131–139, 1992.
Schneider, Walter and Chein, Jason M. Controlled and automatic processing: behavior, theory, and
biological mechanisms. Cognitive Science, 27(3):525–559, 2003.
Subramanian, Kaushik, Isbell, Charles, and Thomaz, Andrea. Learning options through human
interaction. In IJCAI Workshop on Agents Learning Interactively from Human Teachers, 2011.
Sutskever, Ilya and Hinton, Geoffrey E. Using matrices to model symbolic relationship. In Advances
in Neural Information Processing Systems, pp. 1593–1600. 2009.
Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc VV. Sequence to sequence learning with neural net-
works. In Advances in neural information processing systems, pp. 3104–3112, 2014.
Sutton, Richard S., Precup, Doina, and Singh, Satinder. Between MDPs and semi-MDPs: A frame-
work for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1-2):181–
211, 1999.
Vinyals, Oriol, Fortunato, Meire, and Jaitly, Navdeep. Pointer networks. Advances in Neural Infor-
mation Processing Systems (NIPS), 2015.
Zaremba, Wojciech and Sutskever, Ilya. Learning to execute. arXiv preprint arXiv:1410.4615, 2014.
Zaremba, Wojciech and Sutskever, Ilya. Reinforcement learning neural turing machines. arXiv
preprint arXiv:1505.00521, 2015.
Zaremba, Wojciech, Mikolov, Tomas, Joulin, Armand, and Fergus, Rob. Learning simple algorithms
from examples. arXiv preprint arXiv:1511.07275, 2015.

Published as a conference paper at ICLR 2016


Below we list the programs learned by our model:

Program Descriptions Calls

ADD Perform multi-digit addition ADD1, LSHIFT
ADD1 Perform single-digit addition ACT, CARRY
CARRY Mark a 1 in the carry row one unit left ACT
LSHIFT Shift a specified pointer one step left ACT
RSHIFT Shift a specified pointer one step right ACT
ACT Move a pointer or write to the scratch pad -
BUBBLESORT Perform bubble sort (ascending order) BUBBLE, RESET
BUBBLE Perform one sweep of pointers left to right ACT, BSTEP
RESET Move both pointers all the way left LSHIFT
BSTEP Conditionally swap and advance pointers COMPSWAP, RSHIFT
COMPSWAP Conditionally swap two elements ACT
LSHIFT Shift a specified pointer one step left ACT
RSHIFT Shift a specified pointer one step right ACT
ACT Swap two values at pointer locations or move a pointer -
GOTO Change 3D car pose to match the target HGOTO, VGOTO
HGOTO Move horizontally to the target angle LGOTO, RGOTO
LGOTO Move left to match the target angle ACT
RGOTO Move right to match the target angle ACT
VGOTO Move vertically to the target elevation UGOTO, DGOTO
UGOTO Move up to match the target elevation ACT
DGOTO Move down to match the target elevation ACT
ACT Move camera 15◦ up, down, left or right -
RJMP Move all pointers to the rightmost posiiton RSHIFT
MAX Find maximum element of an array BUBBLESORT,RJMP

Table 2: Programs learned for addition, sorting and 3D car canonicalization. Note the the ACT
program has a different effect depending on the environment and on the passed-in arguments.


Figure 8 shows the sequence of program calls for BUBBLESORT. Pointers 1 and 2 are used to im-

Figure 8: Generated execution trace from our trained NPI sorting the array [9,2,5].
SWAP 1 2
SWAP 1 2
plement the “bubble” operation involving the comparison and swapping of adjacent array elements.
The third pointer (referred to in the trace as “PTR 3”) is used to count the number of calls to BUB-
BLE. After every call to RESET the swapping pointers are moved to the beginning of the array and
the counting pointer is advanced by 1. When it has reached the end of the scratch pad, the model
learns to halt execution of BUBBLESORT.

Published as a conference paper at ICLR 2016


Based on reviewer feedback, we conducted an additional comparison of NPI and sequence-to-

sequence models for the addition task, to evaluate the generalization ability. we implemented addi-
tion in a sequence to sequence model, training to model sequences of the following form, e.g. for
“90 + 160 = 250” we represent the sequence as:


For the simple Seq2Seq baseline above (same number of LSTM layers and hidden units as NPI), we
observed that the model could predict one or two digits reliably, but did not generalize even up to
20-digit addition. However, we are aware that others have gotten multi-digit addition of the above
form to work to some extent with curriculum learning (Zaremba & Sutskever, 2014). In order to
make a more competitive baseline, we helped Seq2Seq in two ways: 1) reverse input digits and
stack the two numbers on top of each other to form a 2-channel sequence, and 2) reverse input digits
and generate reversed output digits immediately at each time step.
In the approach of 1), the seq2seq model schematically looks like this:

output: XXXX250
input 1: 090XXXX
input 2: 061XXXX

In the approach of 2), the sequence looks like this:

output: 052
input 1: 090
input 2: 061

Both 1) which we call s2s-stacked and 2) which we call s2s-easy are much stronger competitors to
NPI than even the proposed addition baseline. We compare the generalization performance of NPI
to these baselines in the figure below:

Figure 9: Comparing NPI and Seq2Seq variants on addition generalization to longer sequences.

We found that NPI trained on 32 examples for problem lengths 1,...,20 generalizes with 100% ac-
curacy to all the lengths we tried (up to 3000). s2s-easy trained on twice as many examples gen-
eralizes to just over length 2000 problems. s2s-stacked barely generalizes beyond 5, even with far
more data. This suggests that locality of computation makes a large impact on generalization per-
formance. Even when we carefully ordered and stacked the input numbers for Seq2Seq, NPI still
had an edge in performance. In contrast to Seq2Seq, NPI is taught (supervised for now) to move
its pointers so that the key operations (e.g. single digit add, carry) can be done using only local
information, and this appears to help generalization.

Published as a conference paper at ICLR 2016


Karol Kurach∗ & Marcin Andrychowicz∗ & Ilya Sutskever


In this paper, we propose and investigate a new neural network architecture called
Neural Random Access Machine. It can manipulate and dereference pointers to
arXiv:1511.06392v3 [cs.LG] 9 Feb 2016

an external variable-size random-access memory. The model is trained from pure

input-output examples using backpropagation.
We evaluate the new model on a number of simple algorithmic tasks whose so-
lutions require pointer manipulation and dereferencing. Our results show that the
proposed model can learn to solve algorithmic tasks of such type and is capable
of operating on simple data structures like linked-lists and binary trees. For easier
tasks, the learned solutions generalize to sequences of arbitrary length. More-
over, memory access during inference can be done in a constant time under some


Deep learning is successful for two reasons. First, deep neural networks are able to represent the
“right” kind of functions; second, deep neural networks are trainable. Deep neural networks can
be potentially improved if they get deeper and have fewer parameters, while maintaining train-
ability. By doing so, we move closer towards a practical implementation of Solomonoff induc-
tion (Solomonoff, 1964). The first model that we know of that attempted to train extremely deep
networks with a large memory and few parameters is the Neural Turing Machine (NTM) (Graves
et al., 2014) — a computationally universal deep neural network that is trainable with backprop-
agation. Other models with this property include variants of Stack-Augmented recurrent neural
networks (Joulin & Mikolov, 2015; Grefenstette et al., 2015), and the Grid-LSTM (Kalchbrenner
et al., 2015)—of which the Grid-LSTM has achieved the greatest success on both synthetic and real
tasks. The key characteristic of these models is that their depth, the size of their short term memory,
and their number of parameters are no longer confounded and can be altered independently — which
stands in contrast to models like the LSTM (Hochreiter & Schmidhuber, 1997), whose number of
parameters grows quadratically with the size of their short term memory.
A fundamental operation of modern computers is pointer manipulation and dereferencing. In this
work, we investigate a model class that we name the Neural Random-Access Machine (NRAM),
which is a neural network that has, as primitive operations, the ability to manipulate, store in mem-
ory, and dereference pointers into its working memory. By providing our model with dereferencing
as a primitive, it becomes possible to train models on problems whose solutions require pointer
manipulation and chasing. Although all computationally universal neural networks are equivalent,
which means that the NRAM model does not have a representational advantage over other models if
they are given a sufficient number of computational steps, in practice, the number of timesteps that
a given model has is highly limited, as extremely deep models are very difficult to train. As a result,
the model’s core primitives have a strong effect on the set of functions that can be feasibly learned
in practice, similarly to the way in which the choice of a programming language strongly affects the
functions that can be implemented with an extremely small amount of code.
Finally, the usefulness of computationally-universal neural networks depends entirely on the ability
of backpropagation to find good settings of their parameters. Indeed, it is trivial to define the “op-
timal” hypothesis class (Solomonoff, 1964), but the problem of finding the best (or even a good)

Equal contribution.

Published as a conference paper at ICLR 2016

function in that class is intractable. Our work puts the backpropagation algorithm to another test,
where the model is extremely deep and intricate.
In our experiments, we evaluate our model on several algorithmic problems whose solutions required
pointer manipulation and chasing. These problems include algorithms on a linked-list and a binary
tree. While we were able to achieve encouraging results on these problems, we found that standard
optimization algorithms struggle with these extremely deep and nonlinear models. We believe that
advances in optimization methods will likely lead to better results.


There has been a significant interest in the problem of learning algorithms in the past few years.
The most relevant recent paper is Neural Turing Machines (NTMs) (Graves et al., 2014). It was the
first paper to explicitly suggest the notion that it is worth training a computationally universal neural
network, and achieved encouraging results.
A follow-up model that had the goal of learning algorithms was the Stack-Augmented Recurrent
Neural Network (Joulin & Mikolov, 2015) This work demonstrated that the Stack-Augmented RNN
can generalize to long problem instances from short problem instances. A related model is the
Reinforcement Learning Neural Turing Machine (Zaremba & Sutskever, 2015), which attempted to
use reinforcement learning techniques to train a discrete-continuous hybrid model.
The memory network (Weston et al., 2014) is an early model that attempted to explicitly separate
the memory from computation in a neural network model. The followup work of Sukhbaatar et al.
(2015) combined the memory network with the soft attention mechanism, which allowed it to be
trained with less supervision.
The Grid-LSTM (Kalchbrenner et al., 2015) is a highly interesting extension of LSTM, which allows
to use LSTM cells for both deep and sequential computation. It achieves excellent results on both
synthetic, algorithmic problems and on real tasks, such as language modelling, machine translation,
and object recognition.
The Pointer Network (Vinyals et al., 2015) is somewhat different from the above models in that it
does not have a writable memory — it is more similar to the attention model of Bahdanau et al.
(2014) in this regard. Despite not having a memory, this model was able to solve a number of diffi-
cult algorithmic problems that include the convex hull and the approximate 2D travelling salesman
problem (TSP).
Finally, it is important to mention the attention model of Bahdanau et al. (2014). Although this
work is not explicitly aimed at learning algorithms, it is by far the most practical model that has
an “algorithmic bent”. Indeed, this model has proven to be highly versatile, and variants of this
model have achieved state-of-the-art results on machine translation (Luong et al., 2015), speech
recognition (Chan et al., 2015), and syntactic parsing (Vinyals et al., 2014), without the use of
almost any domain-specific tuning.


In this section we describe the NRAM model. We start with a description of the simplified version
of our model which does not use an external memory and then explain how to augment it with a
variable-size random-access memory. The core part of the model is a neural controller, which acts
as a “processor”. The controller can be a feedforward neural network or an LSTM, and it is the only
trainable part of the model.
The model contains R registers, each of which holds an integer value. To make our model trainable
with gradient descent, we made it fully differentiable. Hence, each register represents an integer
value with a distribution over the set {0, 1, . . . , M − 1}, for some constant M . We do not assume
that these distributions have any special form — they are simply stored as vectors p ∈ RM satisfying
pi ≥ 0 and i pi = 1. The controller does not have direct access to the registers; it can interact
with them using a number of prespecified modules (gates), such as integer addition or equality test.

Published as a conference paper at ICLR 2016

Let’s denote the modules m1 , m2 , . . . , mQ , where each module is a function:

mi : {0, 1, . . . , M − 1} × {0, 1, . . . , M − 1} → {0, 1, . . . , M − 1}.

On a high level, the model performs a sequence of timesteps, each of which consists of the following

1. The controller gets some inputs depending on the values of the registers (the controller’s
inputs are described in Sec. 3.1).
2. The controller updates its internal state (if the controller is an LSTM).
3. The controller outputs the description of a “fuzzy circuit” with inputs r1 , . . . , rR , gates
m1 , . . . , mQ and R outputs.
4. The values of the registers are overwritten with the outputs of the circuit.

More precisely, each circuit is created as follows. The inputs for the module mi are chosen by the
controller from the set {r1 , . . . , rR , o1 , . . . , oi−1 }, where:

• rj is the value stored in the j-th register at the current timestep, and
• oj is the output of the module mj at the current timestep.

Hence, for each 1 ≤ i ≤ Q the controller chooses weighted averages of the values
{r1 , . . . , rR , o1 , . . . , oi−1 } which are given as inputs to the module. Therefore,

oi = mi (r1 , . . . , rR , o1 , . . . , oi−1 )T softmax(ai ), (r1 , . . . , rR , o1 , . . . , oi−1 )T softmax(bi ) ,

where the vectors ai , bi ∈ RR+i−1 are produced by the controller (Fig. 1).

outputs of
registers modules h·, ·i
r1 ... rR o1 . . . oi−1 mi oi
h·, ·i
ai s-m
bi s-m

Figure 1: The execution of the module mi . Gates s-m represent the softmax function and h·, ·i
denotes inner product. See Eq. 1 for details.

Recall that the variables rj represent probability distributions and therefore the inputs to mi , be-
ing weighted averages of probability distributions, are also probability distributions. Thus, as the
modules mi are originally defined for integer inputs and outputs, we must extend their domain to
probability distributions as inputs, which can be done in a natural way (and make their output also
be a probability distribution):
∀0≤c<M P (mi (A, B) = c) = P(A = a)P(B = b)[mi (a, b) = c]. (2)

After the modules have produced their outputs, the controller decides which of the values
{r1 , . . . , rR , o1 , . . . , oQ } should be stored in the registers. In detail, the controller outputs the vec-
tors ci ∈ RR+Q for 1 ≤ i ≤ R and the values of the registers are updated (simultaneously) using
the formula:
ri := (r1 , . . . , rR , o1 , . . . , oQ )T softmax(ci ). (3)

Published as a conference paper at ICLR 2016


Recall that at the beginning of each timestep the controller receives some inputs, and it is an im-
portant design decision to decide where should these inputs come from. A naive approach is to
use the values of the registers as inputs to the controller. However, the values of the registers are
probability distributions and are stored as vectors p ∈ RM . If the entire distributions were given as
inputs to the controller then the number of the model’s parameters would depend on M . This would
be undesirable because, as will be explained in the next section, the value M is linked to the size of
an external random-access memory tape and hence it would prevent the model from generalizing to
different memory sizes.
Hence, for each 1 ≤ i ≤ R the controller receives, as input, only one scalar from each register,
namely P(ri = 0) — the probability that the value in the register is equal 0. This solution has
an additional advantage, namely it limits the amount of information available to the controller and
forces it to rely on the modules instead of trying to solve the problem on its own. Notice that this
information is sufficient to get the exact value of ri if ri ∈ {0, 1}, which is the case whenever ri is
an output of a ,,boolean” module, e.g. the inequality test module mi (a, b) = [a < b].


One could use the model described so far for learning sequence-to-sequence transformations by
initializing the registers with the input sequence, and training the model to produce the desired
output sequence in its registers after a given number of timesteps. The disadvantage of such model
is that it would be completely unable to generalize to longer sequences, because the length of the
sequence that the model can process is equal to the number of its registers, which is constant.
Therefore, we extend the model with a variable-size memory tape, which consists of M memory
cells, each of which stores a distribution over the set {0, 1, . . . , M −1}. Notice that each distribution
stored in a memory cell or a register can be interpreted as a fuzzy address in the memory and used
as a fuzzy pointer. We will hence identify integers in the set {0, 1, . . . , M − 1} with pointers to the
memory. Therefore, the value in each memory cell may be interpreted as an integer or as a pointer.
The exact state of the memory can be described by a matrix M ∈ RM M , where the value Mi,j is the
probability that the i-th cell holds the value j.
The model interacts with the memory tape solely using two special modules:

• READ module: this module takes as the input a pointer1 and returns the value stored under
the given address in the memory. This operation is extended to fuzzy pointers similarly
to Eq. 2. More precisely, if p is a vector representing the probability distribution of the
input (i.e. pi is the probability that the input pointer points to the i-th cell) then the module
returns the value MT p.
• WRITE module: this module takes as the input a pointer p and a value a and stores the value
a under the address p in the memory. The fuzzy form of the operation can be effectively
expressed using matrix operations 2 .

The full architecture of the NRAM model is presented on Fig. 2


The memory tape also serves as an input-output channel — the model’s memory is initialized with
the input sequence and the model is expected to produce the output in the memory. Moreover, we
use a novel way of deciding how many timesteps should be executed. After each timestep we let
the controller decide whether it would like to continue the execution or finish it, in which case the
current state of the memory is treated as the output.

Formally each module takes two arguments. In this case the second argument is simply ignored.
The exact formula is M := (J − p)J T · M + paT , where J denotes a (column) vector consisting of M
ones and · denotes coordinate-wise multiplication.

Published as a conference paper at ICLR 2016

LSTM finish?

registers r1 m1 m3 r1
r2 r2
r3 r3
r4 m2 r4

memory tape

Figure 2: One timestep of the NRAM architecture with R = 4 registers. The LSTM controller gets
the ,,binarized” values r1 , r2 , . . . stored in the registers as inputs and outputs the description of the
circuit in the grey box and the probability of finishing the execution in the current timestep (See
Sec. 3.3 for more detail). The weights of the solid thin connections are outputted by the controller.
The weights of the solid thick connections are trainable parameters of the model. Some of the
modules (i.e. READ and WRITE) may interact with the memory tape (dashed connections).

More precisely, after the timestep t the controller outputs a scalar ft ∈ [0, 1]3 , which denotes the
willingness to finish the execution in the current timestep. Therefore, the probability that the exe-
cution has not been finished before the timestep t is equal i=1 (1 − fi ), and the probability that
the output is produced exactly at the timestep t is equal pt = ft · i=1 (1 − fi ). There is also
some maximal allowed number of timesteps T , which is a hyperparameter. The model is forced to
PT −1
produce output in the last step if it has not done it yet, i.e. pT = 1 − i=1 pi regardless of the value
fT .
Let M(t) ∈ RM M denote the memory matrix after the timestep t, i.e. Mi,j is the probability that
the i-th memory cell holds the value j after the timestep t. For an input-output pair (x, y), where
x, y ∈ {0, 1, . . . , M − 1}M we define the loss of the model as the expected
 negative log-likelihood
PT PM (t)
of producing the correct output, i.e., − t=1 pt · i=1 log(Mi,yi ) assuming that the memory
was initialized with the sequence x4 . Moreover, for all problems we consider the output sequence
is shorter than the memory. Therefore, we compute the loss only over memory cells, which should
contain the output.


Computing the outputs of modules, represented as probability distributions, is a computationally

costly operation. For example, computing the output of the READ module takes Θ(M 2 ) time as it
requires the multiplication of the matrix M ∈ RM
M and the vector p ∈ R .

One may however suspect (and we empirically verify this claim in Sec. 4) that the NRAM model
naturally learns solutions in which the distributions of intermediate values have very low entropy.
The argument for this hypothesis is that fuzziness in the intermediate values would probably prop-
agate to the output and cause a higher value of the cost function. To test this hypothesis we trained
the model and then used its discretized version during interference. In the discretized version every
module gets as inputs the values from modules (or registers), which are the most probable to produce
In fact, the controller outputs a scalar xi and fi = sigmoid(xi ). P 
4 (t)
One could also use the negative log-likelihood of the expected output, i.e. − M T
i=1 log t=1 pt · Mi,yi
as the loss function.

Published as a conference paper at ICLR 2016

the given input accordingly to the distribution outputted by the controller. More precisely, it corre-
sponds to replacing the function softmax in equations (1,3) with the function returning the vector
containing 1 on the position of the maximum value in the input and zeros on all other positions.
Notice that in the discretized NRAM model each register and memory cell stores an integer from
the set {0, 1, . . . , M − 1} and therefore all modules may be executed efficiently (assuming that
the functions represented by the modules can be efficiently computed). In case of a feedforward
controller and a small (e.g. ≤ 20) number of registers the interference can be accelerated even
further. Recall that the only inputs to the controller are binarized values of the register. Therefore,
instead of executing the controller one may simple precompute the (discretized) controller’s output
for each configuration of the registers’ binarized values. Such algorithm would enjoy an extremely
efficient implementation in machine code.



The NRAM model is fully differentiable and we trained it using the Adam optimization algorithm
(Kingma & Ba, 2014) with the negative log-likelihood cost function. Notice that we do not use any
additional supervised data (such as memory access traces) beyond pure input-output examples.
We used multilayer perceptrons (MLPs) with two hidden layers or LSTMs with a hidden layer
between input and LSTM cells as controllers. The number of hidden units in each layer was equal.
The ReLu nonlinearity (Nair & Hinton, 2010) was used in all experiments.
Below are some important techniques that we used in the training:

Curriculum learning As noticed in several papers (Bengio et al., 2009; Zaremba & Sutskever,
2014), curriculum learning is crucial for training deep networks on very complicated problems. We
followed the curriculum learning schedule from Zaremba & Sutskever (2014) without any modifi-
cations. The details can be found in Appendix B.

Gradient clipping Notice that the depth of the unfolded execution is roughly a product of the
number of timesteps and the number of modules. Even for moderately small experiments (e.g. 14
modules and 20 timesteps) this value easily exceeds a few hundreds. In networks of such depth,
the gradients can often “explode” (Bengio et al., 1994), what makes training by backpropagation
much harder. We noticed that the gradients w.r.t. the intermediate values inside the backpropagation
were so large, that they sometimes led to an overflow in single-precision floating-point arithmetic.
Therefore, we clipped the gradients w.r.t. the activations, within the execution of the backpropaga-
tion algorithm. More precisely, each coordinate is separately cropped into the range [−C1 , C1 ] for
some constant C1 . Before updating parameters, we also globally rescale the whole gradient vector,
so that its L2 norm is not bigger than some constant value C2 .

Noise We added random Gaussian noise to the computed gradients after the backpropagation step.
The variance of this noise decays exponentially during the training. The details can be found in
Neelakantan et al. (2015).

Enforcing Distribution Constraints For very deep networks, a small error in one place can prop-
agate to a huge error in some other place. This was the case with our pointers: they are probability
distributions over memory cells and they should sum up to 1. However, after a number of operations
are applied, they can accumulate error as a result of inaccurate floating-point arithmetic.
We have a special layer which is responsible for rescaling all values (multiplying by the inverse of
their sum), to make sure they always represent a probability distribution. We add this layer to our
model in a few critical places (eg. after the softmax operation)5 .

We do not however backpropagate through these renormalizing operations, i.e. during the backward pass
we simply assume that they are identities.

Published as a conference paper at ICLR 2016

Entropy While searching for a solution, the network can fix the pointer distribution on some
particular value. This is advantageous at the end of training, because ideally we would like to be
able to discretize the model. However, if this happens at the begin of the training, it could force the
network to stay in a local minimum, with a small chance of moving the probability mass to some
other value. To address this problem, we encourage the network to explore the space of solutions by
adding an ”entropy bonus”, that decreases over time. More precisely, for every distribution outputted
by the controller, we subtract from the cost function the entropy of the distribution multiplied by
some coefficient, which decreases exponentially during the training.

Limiting the values of logarithms There are two places in our model where the logarithms are
computed — in the cost function and in the entropy computation. Inputs to whose logarithms can
be very small numbers, which may cause very big values of the cost function or even overflows in
floating-point arithmetic. To prevent this phenomenon we use log(max(x, )) instead of log(x) for
some small hyperparameter  whenever a logarithm is computed.


We now describe the tasks used in our experiments. For every task, the input is given to the network
in the memory tape, and the network’s goal is to modify the memory according to the task’s specifi-
cation. We allow the network to modify the original input. The final error for a test case is computed
as mc
, where c is the number of correctly written cells, and m represents the total number of cells
that should be modified.
Due to limited space, we describe the tasks only briefly here. The detailed memory layout of inputs
and outputs can be found in the Appendix A.

1. Access Given a value k and an array A, return A[k].

2. Increment Given an array, increment all its elements by 1.
3. Copy Given an array and a pointer to the destination, copy all elements from the array to
the given location.
4. Reverse Given an array and a pointer to the destination, copy all elements from the array
in reversed order.
5. Swap Given two pointers p, q and an array A, swap elements A[p] and A[q].
6. Permutation Given two arrays of n elements: P (contains a permutation of numbers
1, . . . , n) and A (contains random elements), permutate A according to P .
7. ListK Given a pointer to the head of a linked list and a number k, find the value of the k-th
element on the list.
8. ListSearch Given a pointer to the head of a linked list and a value v to find return a pointer
to the first node on the list with the value v.
9. Merge Given pointers to 2 sorted arrays A and B, merge them.
10. WalkBST Given a pointer to the root of a Binary Search Tree, and a path to be traversed
(sequence of left/right steps), return the element at the end of the path.


In all of our experiments we used the same sequence of 14 modules: READ (described in Sec. 3.2),
ZERO(a, b) = 0, ONE(a, b) = 1, TWO(a, b) = 2, INC(a, b) = (a+1) mod M , ADD(a, b) = (a+b)
mod M , SUB(a, b) = (a − b) mod M , DEC(a, b) = (a − 1) mod M , LESS-THAN(a, b) = [a <
b], LESS-OR-EQUAL-THAN(a, b) = [a ≤ b], EQUALITY-TEST(a, b) = [a = b], MIN(a, b) =
min(a, b), MAX(a, b) = max(a, b), WRITE (described in Sec. 3.2).
We also considered settings in which the module sequence is repeated many times, e.g. there are 28
modules, where modules number 1. and 15. are READ, modules number 2. and 16. are ZERO and so
on. The number of repetitions is a hyperparameter.

Published as a conference paper at ICLR 2016

Task Train Complexity Train error Generalization Discretization

Access len(A) ≤ 20 0 perfect perfect
Increment len(A) ≤ 15 0 perfect perfect
Copy len(A) ≤ 15 0 perfect perfect
Reverse len(A) ≤ 15 0 perfect perfect
Swap len(A) ≤ 20 0 perfect perfect
Permutation len(A) ≤ 6 0 almost perfect perfect
ListK len(list) ≤ 10 0 strong hurts performance
ListSearch len(list) ≤ 6 0 weak hurts performance
Merge len(A) + len(B) ≤ 10 1% weak hurts performance
WalkBST size(tree) ≤ 10 0.3% strong hurts performance

Table 1: Results of the experiments. The perfect generalization error means that the tested problem
had error 0 for complexity up to 50. Exact generalization errors are presented in Fig. 3 The perfect
discretization means that the discretized version of the model produced exactly the same outputs as
the original model on all test cases.

0.40 WalkBST
0.35 ListSearch

Test error





10 15 20 25 30
Max task complexity

Figure 3: Generalization errors for hard tasks. The Permutation and ListSearch problems were
trained only up to complexity 6. The remaining problems were trained up to complexity 10. The
horizontal axis denotes the maximal task complexity, i.e., x = 20 denotes results with complexity
sampled uniformly from the interval [1, 20].


Overall, we were able to find parameters that achieved an error 0 for all problems except Merge and
WalkBST (where we got an error of ≤ 1%). As described in 4.2, our metric is an accuracy on the
memory cells that should be modified. To compute it, we take the continuous memory state produced
by our network, then discretize it (every cell will contain the value with the highest probability), and
finally compare with the expected output. The results of the experiments are summarized in Table 1.
Below we describe our results on all 10 tasks in more detail. We divide them into 2 categories:
”easy” and ”hard” tasks. Easy tasks is a category of tasks that achieved low error scores for many
sets of parameters and we did not have to spend much time trying to tune them. First 5 problems
from our task list belong to this category. Hard tasks, on the other hand, are problems that often
trained to low error rate only in a very small number of cases, eg. 1 out of 100.


This category includes the following problems: Access, Increment, Copy, Reverse, Swap. For
all of them we were able to find many sets of hyperparameters that achieved error 0, or close to it
without much effort.

Published as a conference paper at ICLR 2016

Step 0 1 2 3 4 5 6 7 8 9 10 11 r1 r2 r3 r4 READ WRITE

1 6 2 10 6 8 9 0 0 0 0 0 0 0 0 0 0 p:0 p:0 a:6
2 6 2 10 6 8 9 0 0 0 0 0 0 0 5 0 1 p:1 p:6 a:2
3 6 2 10 6 8 9 2 0 0 0 0 0 0 5 1 1 p:1 p:6 a:2
4 6 2 10 6 8 9 2 0 0 0 0 0 0 5 1 2 p:2 p:7 a:10
5 6 2 10 6 8 9 2 10 0 0 0 0 0 5 2 2 p:2 p:7 a:10
6 6 2 10 6 8 9 2 10 0 0 0 0 0 5 2 3 p:3 p:8 a:6
7 6 2 10 6 8 9 2 10 6 0 0 0 0 5 3 3 p:3 p:8 a:6
8 6 2 10 6 8 9 2 10 6 0 0 0 0 5 3 4 p:4 p:9 a:8
9 6 2 10 6 8 9 2 10 6 8 0 0 0 5 4 4 p:4 p:9 a:8
10 6 2 10 6 8 9 2 10 6 8 0 0 0 5 4 5 p:5 p:10 a:9
11 6 2 10 6 8 9 2 10 6 8 9 0 0 5 5 5 p:5 p:10 a:9

Table 2: State of memory and registers for the Copy problem at the start of every timestep. We also show
the arguments given to the READ and WRITE functions in each timestep. The argument “p:” represents the
source/destination address and “a:” represents the value to be written (for WRITE). The value 6 at position 0
in the memory is the pointer to the destination array. It is followed by 5 values (gray columns) that should be

We also tested how those solutions generalize to longer input sequences. To do this, for every
problem we selected a model that achieved error 0 during the training, and tested it on inputs with
lengths up to 506 . To perform these tests we also increased the memory size and the number of
allowed timesteps.
In all cases the model solved the problem perfectly, what shows that it generalizes not only to longer
input sequences, but also to different memory sizes and numbers of allowed timesteps. Moreover,
the discretized version of the model (see Sec. 3.4 for details) also solves all the problems perfectly.
These results show that the NRAM model naturally learns “algorithmic” solutions, which generalize
We were also interested if the found solutions generalize to sequences of arbitrary length. It is eas-
iest to verify in the case of a discretized model with a feedforward controller. That is because then
circuits outputted by the controller depend solely on the values of registers, which are integers. We
manually analysed circuits for problems Copy and Increment and verified that found solutions gen-
eralize to inputs of arbitrary length, assuming that the number of allowed timesteps is appropriate.


r3 '

This category includes: Permutation, ListK,

ListSearch, Merge and WalkBST. For all of r4 p
them we had to perform an extensive random p write
search to find a good set of hyperparameters. add
Usually, most of the parameter combinations
were stuck on the starting curriculum level with
r2 r2 '
a high error of 50% − 70%. For the first 3 tasks
we managed to train the network to achieve er-
ror 0. For WalkBST and Merge the training er- r3 inc min r4 '

rors were 0.3% and 1% respectively. For train-

ing those problems we had to introduce addi- r1 r1 '
tional techniques described in Sec. 4.1.
For Permutation, ListK and WalkBST our
model generalizes very well and achieves low
error rates on inputs at least twice longer than Figure 4: The circuit generated at every timestep
the ones seen during the training. The exact ≥ 2. The values of the pointer (p) for READ,
generalization errors are shown in Fig. 3. WRITE and the value to be written (a) for WRITE
are presented in Table 2. The modules whose out-
The only hard problem on which our model puts are not used were removed from the picture.
discretizes well is Permutation — on this task
Unfortunately we could not test for lengths longer than 50 due to the memory restrictions.

Published as a conference paper at ICLR 2016

the discretized version of the model produces exactly the same outputs as the original model on all
cases tested. For the remaining four problems the discretized version of the models perform very
poorly (error rates ≥ 70%). We believe that better results may be obtained by using some techniques
encouraging discretization during the training 7 .
We noticed that the training procedure is very unstable and the error often raises from a few percents
to e.g. 70% in just one epoch. Moreover, even if we use the best found set of hyperparameters, the
percent of random seeds that converges to error 0 was usually equal about 11%. We observed that
the percent of converging seeds is much lower if we do not add noise to the gradient — in this case
only about 1% of seeds converge.


A comparison to other models is challenging because we are the first to consider problems with
pointers. The NTM can solve tasks like Copy or Reverse, but it suffers from the inability to naturally
store a pointer to a fixed location in the memory. This makes it unlikely that it could solve tasks such
as ListK, ListSearch or WalkBST since the pointers used in these tasks refer to absolute positions.
What distinguishes our model from most of the previous attempts (including NTMs, Memory Net-
works, Pointer Networks) is the lack of content-based addressing. It was a deliberate design deci-
sion, since this kind of addressing inherently slows down the memory access. In contrast, our model
— if discretized — can access the memory in a constant time.
The NRAM is also the first model that we are aware of employing a differentiable mechanism for
deciding when to finish the computation.


We present one example execution of our model for the problem Copy. For the example, we use
a very small model with 12 memory cells, 4 registers and the standard set of 14 modules. The
controller for this model is a feedforward network, and we run it for 11 timesteps. Table 2 contains,
for every timestep, the state of the memory and registers at the begin of the timestep.
The model can execute different circuits at different timesteps. In particular, we observed that the
first circuit is slightly different from the rest, since it needs to handle the initialization. Starting from
the second step all generated circuits are the same. We present this circuit in Fig. 4. The register r2
is constant and keeps the offset between the destination array and the source array (6 − 1 = 5 in
this case). The register r3 is responsible for incrementing the pointer in the source array. Its value is
copied to r4 8 , the register used by the READ module. For the WRITE module, it also uses r4 which
is shifted by r2 . The register r1 is unused. This solution generalizes to sequences of arbitrary length.

In this paper we presented the Neural Random-Access Machine, which can learn to solve problems
that require explicit manipulation and dereferencing of pointers.
We showed that this model can learn to solve a number of algorithmic problems and generalize well
to inputs longer than ones seen during the training. In particular, for some problems it generalizes
to inputs of arbitrary length.
However, we noticed that the optimization problem resulting from the backpropagating through the
execution trace of the program is very challenging for standard optimization techniques. It seems
likely that a method that can search in an easier “abstract” space would be more effective at solving
such problems.

One could for example add at later stages of training a penalty proportional to the entropy of the interme-
diate values of registers/memory.
In our case r3 < r2 , so the MIN module always outputs the value r3 + 1. It is not satisfied in the last
timestep, but then the array is already copied.

Published as a conference paper at ICLR 2016

Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural machine translation by jointly
learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
Bengio, Yoshua, Simard, Patrice, and Frasconi, Paolo. Learning long-term dependencies with gra-
dient descent is difficult. Neural Networks, IEEE Transactions on, 5(2):157–166, 1994.
Bengio, Yoshua, Louradour, Jérôme, Collobert, Ronan, and Weston, Jason. Curriculum learning. In
Proceedings of the 26th annual international conference on machine learning, pp. 41–48. ACM,
Chan, William, Jaitly, Navdeep, Le, Quoc V, and Vinyals, Oriol. Listen, attend and spell. arXiv
preprint arXiv:1508.01211, 2015.
Graves, Alex, Wayne, Greg, and Danihelka, Ivo. Neural turing machines. arXiv preprint
arXiv:1410.5401, 2014.
Grefenstette, Edward, Hermann, Karl Moritz, Suleyman, Mustafa, and Blunsom, Phil. Learning to
transduce with unbounded memory. arXiv preprint arXiv:1506.02516, 2015.
Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-term memory. Neural computation, 9(8):
1735–1780, 1997.
Joulin, Armand and Mikolov, Tomas. Inferring algorithmic patterns with stack-augmented recurrent
nets. arXiv preprint arXiv:1503.01007, 2015.
Kalchbrenner, Nal, Danihelka, Ivo, and Graves, Alex. Grid long short-term memory. arXiv preprint
arXiv:1507.01526, 2015.
Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
Luong, Minh-Thang, Pham, Hieu, and Manning, Christopher D. Effective approaches to attention-
based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
Nair, Vinod and Hinton, Geoffrey E. Rectified linear units improve restricted boltzmann machines.
In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–
814, 2010.
Neelakantan, Arvind, Vilnis, Luke, Le, Quoc V, Sutskever, Ilya, Kaiser, Lukasz, Kurach, Karol, and
Martens, James. Adding gradient noise improves learning for very deep networks. arXiv preprint
arXiv:1511.06807, 2015.
Solomonoff, Ray J. A formal theory of inductive inference. part i. Information and control, 7(1):
1–22, 1964.
Sukhbaatar, Sainbayar, Szlam, Arthur, Weston, Jason, and Fergus, Rob. End-to-end memory net-
works. arXiv preprint arXiv:1503.08895, 2015.
Vinyals, Oriol, Kaiser, Lukasz, Koo, Terry, Petrov, Slav, Sutskever, Ilya, and Hinton, Geoffrey.
Grammar as a foreign language. arXiv preprint arXiv:1412.7449, 2014.
Vinyals, Oriol, Fortunato, Meire, and Jaitly, Navdeep. Pointer networks. arXiv preprint
arXiv:1506.03134, 2015.
Weston, Jason, Chopra, Sumit, and Bordes, Antoine. Memory networks. arXiv preprint
arXiv:1410.3916, 2014.
Zaremba, Wojciech and Sutskever, Ilya. Learning to execute. arXiv preprint arXiv:1410.4615, 2014.
Zaremba, Wojciech and Sutskever, Ilya. Reinforcement learning neural turing machines. arXiv
preprint arXiv:1505.00521, 2015.

Published as a conference paper at ICLR 2016


In this section we describe in details the memory layout of inputs and outputs for the tasks used in
our experiments. In all descriptions below, big letters represent arrays and small letters represents
pointers. N U LL denotes the value 0 and is used to mark the end of an array or a missing next
element in a list or a binary tree.

1. Access Given a value k and an array A, return A[k]. Input is given as k, A[0], .., A[n −
1], N U LL and the network should replace the first memory cell with A[k].
2. Increment Given an array A, increment all its elements by 1. Input is given as
A[0], ..., A[n − 1], N U LL and the expected output is A[0] + 1, ..., A[n − 1] + 1.
3. Copy Given an array and a pointer to the destination, copy all elements from the array to
the given location. Input is given as p, A[0], ..., A[n−1] where p points to one element after
A[n − 1]. The expected output is A[0], ..., A[n − 1] at positions p, ..., p + n − 1 respectively.
4. Reverse Given an array and a pointer to the destination, copy all elements from the array
in reversed order. Input is given as p, A[0], ..., A[n − 1] where p points one element after
A[n − 1]. The expected output is A[n − 1], ..., A[0] at positions p, ..., p + n − 1 respectively.
5. Swap Given two pointers p, q and an array A, swap elements A[p] and A[q]. Input is
given as p, q, A[0], .., A[p], ..., A[q], ..., A[n − 1], 0. The expected modified array A is:
A[0], ..., A[q], ..., A[p], ..., A[n − 1].
6. Permutation Given two arrays of n elements: P (contains a permutation of numbers
0, . . . , n − 1) and A (contains random elements), permutate A according to P . Input is
given as a, P [0], ..., P [n − 1], A[0], ..., A[n − 1], where a is a pointer to the array A. The
expected output is A[P [0]], ..., A[P [n − 1]], which should override the array P .
7. ListK Given a pointer to the head of a linked list and a number k, find the value of the
k-th element on the list. List nodes are represented as two adjacent memory cells: a pointer
to the next node and a value. Elements are in random locations in the memory, so that
the network needs to follow the pointers to find the correct element. Input is given as:
head, k, out, ... where head is a pointer to the first node on the list, k indicates how many
hops are needed and out is a cell where the output should be put.
8. ListSearch Given a pointer to the head of a linked list and a value v to find return a pointer
to the first node on the list with the value v. The list is placed in memory in the same way
as in the task ListK. We fill empty memory with “trash” values to prevent the network from
“cheating” and just iterating over the whole memory.
9. Merge Given pointers to 2 sorted arrays A and B, and the pointer to the output o,
merge the two arrays into one sorted array. The input is given as: a, b, o, A[0], .., A[n −
1], G, B[0], ..., B[m − 1], G, where G is a special guardian value, a and b point to the first
elements of arrays A and B respectively, and o points to the address after the second G.
The n + m element should be written in correct order starting from position o.
10. WalkBST Given a pointer to the root of a Binary Search Tree, and a path to be traversed,
return the element at the end of the path. The BST nodes are represented as tripes (v, l,
r), where v is the value, and l, r are pointers to the left/right child. The triples are placed
randomly in the memory. Input is given as root, out, d1 , d2 , ..., dk , N U LL, ..., where root
points to the root node and out is a slot for the output. The sequence d1 , di ∈ {0, 1}
represents the path to be traversed: di = 0 means that the network should go to the left
child, di = 1 represents going to the right child.

Published as a conference paper at ICLR 2016


As noticed in several papers (Bengio et al., 2009; Zaremba & Sutskever, 2014), curriculum learning
is crucial for training deep networks on very complicated problems. We followed the curriculum
learning schedule from Zaremba & Sutskever (2014) without any modifications.
For each of the tasks we have manually defined a sequence of subtasks with increasing difficulty,
where the difficulty is usually measured by the length of the input sequence. During training the
input-output examples are sampled from a distribution that is determined by the current difficulty
level D. The level is increased (up to some maximal value) whenever the error rate of the model
goes below some threshold. Moreover, we ensure that successive increases of D are separated by
some number of batches.
In more detail, to generate an input-output example we first sample a difficulty d from a distribution
determined by the current level D and then draw the example with the difficulty d. The procedure
for sampling d is the following:

• with probability 10%: pick d uniformly at random from the set of all possible difficulties;
• with probability 25%: pick d uniformly from [1, D + e], where e is a sample from a geo-
metric distribution with a success probability 1/2;
• with probability 65%: set d = D + e, where e is sampled as above.

Notice that the above procedure guarantees that every difficulty d can be picked regardless of the
current level D, which has been shown to increase performance Zaremba & Sutskever (2014).

Published as a conference paper at ICLR 2016

Below are presented example circuits generated during training for all simple tasks (except Copy
which was presented in the paper). For modules READ and WRITE, the value of the first argument
(pointer to the address to be read/written) is marked as p. For WRITE, the value to be written
is marked as a and the value returned by this module is always 0. For modules LESS-THAN and
LESS-OR-EQUAL-THAN the first parameter is marked as x and the second one as y. Other modules
either have only one parameter or the order of parameters is not important.
For all tasks below (except Increment), the circuit generated at timestep 1 is different than circuits
generated at steps ≥ 2, which are the same. This is because the first circuit needs to handle the
initialization. We present only the ”main” circuits generated for timesteps ≥ 2.


x lt min
inc p
read write r2 '
p a
r1 '

Figure 5: The circuit generated at every timestep ≥ 2 for the task Access.

Step 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 r1 r2
1 3 1 12 4 7 12 1 13 8 2 1 3 11 11 12 0 0 0
2 3 1 12 4 7 12 1 13 8 2 1 3 11 11 12 0 3 0
3 4 1 12 4 7 12 1 13 8 2 1 3 11 11 12 0 3 0

Table 3: Memory for task Access. Only the first memory cell is modified.

Published as a conference paper at ICLR 2016


r3 '

r5 p r4 '
read inc max

r5 ' r2 '

1 add
min r1 '

Figure 6: The circuit generated at every timestep for the task Increment.

Step 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 r1 r2 r3 r4 r5
1 1 11 3 8 1 2 9 8 5 3 0 0 0 0 0 0 0 0 0 0 0
2 2 11 3 8 1 2 9 8 5 3 0 0 0 0 0 0 1 2 2 2 1
3 2 12 3 8 1 2 9 8 5 3 0 0 0 0 0 0 2 12 12 12 2
4 2 12 4 8 1 2 9 8 5 3 0 0 0 0 0 0 3 4 4 4 3
5 2 12 4 9 1 2 9 8 5 3 0 0 0 0 0 0 4 9 9 9 4
6 2 12 4 9 2 2 9 8 5 3 0 0 0 0 0 0 5 2 2 2 5
7 2 12 4 9 2 3 9 8 5 3 0 0 0 0 0 0 6 3 3 3 6
8 2 12 4 9 2 3 10 8 5 3 0 0 0 0 0 0 7 10 10 10 7
9 2 12 4 9 2 3 10 9 5 3 0 0 0 0 0 0 8 9 9 9 8
10 2 12 4 9 2 3 10 9 6 3 0 0 0 0 0 0 9 6 6 6 9
11 2 12 4 9 2 3 10 9 6 4 0 0 0 0 0 0 10 4 4 4 10

Table 4: Memory for task Increment.

Published as a conference paper at ICLR 2016


r4 r2 '

y le r4 '

r1 '

r1 add


dec p
r3 inc r3 '
p write
read min max

Figure 7: The circuit generated at every timestep ≥ 2 for the task Reverse.

Step 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 r1 r2 r3 r4
1 8 8 1 3 5 1 1 2 0 0 0 0 0 0 0 0 0 0 0 0
2 8 8 1 3 5 1 1 2 0 0 0 0 0 0 0 0 8 0 1 1
3 8 8 1 3 5 1 1 2 0 0 0 0 0 0 8 0 8 1 2 1
4 8 8 1 3 5 1 1 2 0 0 0 0 0 1 8 0 8 1 3 1
5 8 8 1 3 5 1 1 2 0 0 0 0 3 1 8 0 8 1 4 1
6 8 8 1 3 5 1 1 2 0 0 0 5 3 1 8 0 8 1 5 1
7 8 8 1 3 5 1 1 2 0 0 1 5 3 1 8 0 8 1 6 1
8 8 8 1 3 5 1 1 2 0 1 1 5 3 1 8 0 8 1 7 1
9 8 8 1 3 5 1 1 2 2 1 1 5 3 1 8 0 8 1 8 1
10 8 8 1 3 5 1 1 2 2 1 1 5 3 1 8 0 8 1 9 1

Table 5: Memory for task Reverse.

Published as a conference paper at ICLR 2016


For swap we observed that 2 different circuits are generated, one for even timesteps, one for odd

r2 '

p read max
read a
write r1 '
r1 p
max a
write add


Figure 8: The circuit generated at every even timestep for the task Swap.

r1 p a
r1 ' write
p p

sub lt

1 sub
x le r2 '
inc x

Figure 9: The circuit generated at every odd timestep ≥ 3 for the task Swap.

Step 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 r1 r2
1 4 13 6 10 5 4 6 3 7 1 1 11 13 12 0 0 0 0
2 5 13 6 10 5 4 6 3 7 1 1 11 13 12 0 0 1 4
3 5 13 6 10 12 4 6 3 7 1 1 11 13 12 0 0 0 13
4 5 13 6 10 12 4 6 3 7 1 1 11 13 5 0 0 5 1

Table 6: Memory for task Swap.

Published as a conference paper at ICLR 2016


Łukasz Kaiser & Ilya Sutskever
Google Brain {lukaszkaiser,ilyasu}

arXiv:1511.08228v3 [cs.LG] 15 Mar 2016

Learning an algorithm from examples is a fundamental problem that has been

widely studied. It has been addressed using neural networks too, in particular by
Neural Turing Machines (NTMs). These are fully differentiable computers that
use backpropagation to learn their own programming. Despite their appeal NTMs
have a weakness that is caused by their sequential nature: they are not parallel and
are are hard to train due to their large depth when unfolded.
We present a neural network architecture to address this problem: the Neural
GPU. It is based on a type of convolutional gated recurrent unit and, like the
NTM, is computationally universal. Unlike the NTM, the Neural GPU is highly
parallel which makes it easier to train and efficient to run.
An essential property of algorithms is their ability to handle inputs of arbitrary
size. We show that the Neural GPU can be trained on short instances of an al-
gorithmic task and successfully generalize to long instances. We verified it on a
number of tasks including long addition and long multiplication of numbers rep-
resented in binary. We train the Neural GPU on numbers with up-to 20 bits and
observe no errors whatsoever while testing it, even on much longer numbers.
To achieve these results we introduce a technique for training deep recurrent net-
works: parameter sharing relaxation. We also found a small amount of dropout
and gradient noise to have a large positive effect on learning and generalization.


Deep neural networks have recently proven successful at various tasks, such as computer vision
(Krizhevsky et al., 2012), speech recognition (Dahl et al., 2012), and in other domains. Recurrent
neural networks based on long short-term memory (LSTM) cells (Hochreiter & Schmidhuber, 1997)
have been successfully applied to a number of natural language processing tasks. Sequence-to-
sequence recurrent neural networks with such cells can learn very complex tasks in an end-to-end
manner, such as translation (Sutskever et al., 2014; Bahdanau et al., 2014; Cho et al., 2014), parsing
(Vinyals & Kaiser et al., 2015), speech recognition (Chan et al., 2016) or image caption generation
(Vinyals et al., 2014). Since so many tasks can be solved with essentially one model, a natural
question arises: is this model the best we can hope for in supervised learning?
Despite its recent success, the sequence-to-sequence model has limitations. In its basic form, the
entire input is encoded into a single fixed-size vector, so the model cannot generalize to inputs much
longer than this fixed capacity. One way to resolve this problem is by using an attention mechanism
(Bahdanau et al., 2014). This allows the network to inspect arbitrary parts of the input in every de-
coding step, so the basic limitation is removed. But other problems remain, and Joulin & Mikolov
(2015) show a number of basic algorithmic tasks on which sequence-to-sequence LSTM networks
fail to generalize. They propose a stack-augmented recurrent network, and it works on some prob-
lems, but is limited in other ways.
In the best case one would desire a neural network model able to learn arbitrarily complex algorithms
given enough resources. Neural Turing Machines (Graves et al., 2014) have this theoretical property.
However, they are not computationally efficient because they use soft attention and because they tend
to be of considerable depth. Their depth makes the training objective difficult to optimize and im-
possible to parallelize because they are learning a sequential program. Their use of soft attention
requires accessing the entire memory in order to simulate 1 step of computation, which introduces
substantial overhead. These two factors make learning complex algorithms using Neural Turing Ma-

Published as a conference paper at ICLR 2016

chines difficult. These issues are not limited to Neural Turing Machines, they apply to other architec-
tures too, such as stack-RNNs (Joulin & Mikolov, 2015) or (De)Queue-RNNs (Grefenstette et al.,
2015). One can try to alleviate these problems using hard attention and reinforcement learning, but
such non-differentiable models do not learn well at present (Zaremba & Sutskever, 2015b).
In this work we present a neural network model, the Neural GPU, that addresses the above issues.
It is a Turing-complete model capable of learning arbitrary algorithms in principle, like a Neural
Turing Machine. But, in contrast to Neural Turing Machines, it is designed to be as parallel and as
shallow as possible. It is more similar to a GPU than to a Turing machine since it uses a smaller num-
ber of parallel computational steps. We show that the Neural GPU works in multiple experiments:

• A Neural GPU can learn long binary multiplication from examples. It is the first neural
network able to learn an algorithm whose run-time is superlinear in the size of its input.
Trained on up-to 20-bit numbers, we see no single error on any inputs we tested, and we
tested on numbers up-to 2000 bits long.
• The same architecture can also learn long binary addition and a number of other algorith-
mic tasks, such as counting, copying sequences, reversing them, or duplicating them.


The learning of algorithms with neural networks has seen a lot of interest after the success
of sequence-to-sequence neural networks on language processing tasks (Sutskever et al., 2014;
Bahdanau et al., 2014; Cho et al., 2014). An attempt has even been made to learn to evaluate sim-
ple python programs with a pure sequence-to-sequence model (Zaremba & Sutskever, 2015a), but
more success was seen with more complex models. Neural Turing Machines (Graves et al., 2014)
were shown to learn a number of basic sequence transformations and memory access patterns, and
their reinforcement learning variant (Zaremba & Sutskever, 2015b) has reasonable performance on
a number of tasks as well. Stack, Queue and DeQueue networks (Grefenstette et al., 2015) were also
shown to learn basic sequence transformations such as bigram flipping or sequence reversal.
The Grid LSTM (Kalchbrenner et al., 2016) is another powerful architecture that can learn to mul-
tiply 15-digit decimal numbers. As we will see in the next section, the Grid-LSTM is quite similar
to the Neural GPU – the main difference is that the Neural GPU is less recurrent and is explicitly
constructed from the highly parallel convolution operator.
In image processing, convolutional LSTMs, an architecture similar to the Neural GPU, have recently
been used for weather prediction (Shi et al., 2015) and image compression (Toderici et al., 2016).
We find it encouraging as it hints that the Neural GPU might perform well in other contexts.
Most comparable to this work are the prior experiments with the stack-augmented RNNs
(Joulin & Mikolov, 2015). These networks manage to learn and generalize to unseen lengths on
a number of algorithmic tasks. But, as we show in Section 3.1, stack-augmented RNNs trained to
add numbers up-to 20-bit long generalize only to ∼ 100-bit numbers, never to 200-bit ones, and
never without error. Still, their generalization is the best we were able to obtain without using the
Neural GPU and far surpasses a baseline LSTM sequence-to-sequence model with attention.
The quest for learning algorithms has been pursued much more widely with tools other than neu-
ral networks. It is known under names such as program synthesis, program induction, automatic
programming, or inductive synthesis, and has a long history with many works that we do not cover
here; see, e.g., Gulwani (2010) and Kitzelmann (2010) for a more general perspective.
Since one of our results is the synthesis of an algorithm for long binary addition, let us recall how
this problem has been addressed without neural networks. Importantly, there are two cases of this
problem with different complexity. The easier case is when the two numbers that are to be added
are aligned at input, i.e., if the first (lower-endian) bit of the first number is presented at the same
time as the first bit of the second number, then come the second bits, and so on, as depicted below
for x = 9 = 8 + 1 and y = 5 = 4 + 1 written in binary with least-significant bit left.

Input 1 0 0 1
(x and y aligned) 1 0 1 0
Desired Output (x + y) 0 1 1 1

Published as a conference paper at ICLR 2016

In this representation the triples of bits from (x, y, x + y), e.g., (1, 1, 0) (0, 0, 1) (0, 1, 1) (1, 0, 1)
as in the figure above, form a regular language. To learn binary addition in this representation it
therefore suffices to find a regular expression or an automaton that accepts this language, which can
be done with a variant of Anguin’s algorithm (Angluin, 1987). But only few interesting functions
have regular representations, as for example long multiplication does not (Blumensath & Grädel,
2000). It is therefore desirable to learn long binary addition without alignment, for example when x
and y are provided one after another. This is the representation we use in the present paper.

Input (x, y) 1 0 0 1 + 1 0 1 0
Desired Output (x + y) 0 1 1 1

Before we introduce the Neural GPU, let us recall the architecture of a Gated Recurrent Unit
(GRU) (Cho et al., 2014). A GRU is similar to an LSTM, but its input and state are the same
size, which makes it easier for us to generalize it later; a highway network could have also been
used (Srivastava et al., 2015), but it lacks the reset gate. GRUs have shown performance similar to
LSTMs on a number of tasks (Chung et al., 2014; Greff et al., 2015). A GRU takes an input vector
x and a current state vector s, and outputs:
GRU(x, s) = u ⊙ s + (1 − u) ⊙ tanh(W x + U (r ⊙ s) + B), where
u = σ(W ′ x + U ′ s + B ′ ) and r = σ(W ′′ x + U ′′ s + B ′′ ).
In the equations above, W, W ′ , W ′′ , U, U ′ , U ′′ are matrices and B, B ′ , B ′′ are bias vectors; these
are the parameters that will be learned. We write W x for a matrix-vector multiplication and r ⊙ s
for elementwise vector multiplication. The vectors u and r are called gates since their elements are
in [0, 1] — u is the update gate and r is the reset gate.
In recurrent neural networks a unit like GRU is applied at every step and the result is both passed as
new state and used to compute the output. In a Neural GPU we do not process a new input in every
step. Instead, all inputs are written into the starting state s0 . This state has 2-dimensional structure:
it consists of w × h vectors of m numbers, i.e., it is a 3-dimensional tensor of shape [w, h, m]. This
mental image evolves in time in a way defined by a convolutional gated recurrent unit:
CGRU(s) = u ⊙ s + (1 − u) ⊙ tanh(U ∗ (r ⊙ s) + B), where
u = σ(U ′ ∗ s + B ′ ) and r = σ(U ′′ ∗ s + B ′′ ).
U ∗ s above denotes the convolution of a kernel bank U with the mental image s. A kernel bank is a
4-dimensional tensor of shape [kw , kh , m, m], i.e., it contains kw · kh · m2 parameters, where kw and
kh are kernel width and height. It is applied to a mental image s of shape [w, h, m] which results in
another mental image U ∗ s of the same shape defined by:
⌊kw /2⌋
⌊kh /2⌋
U ∗ s[x, y, i] = s[x + u, y + v, c] · U [u, v, c, i].
u=⌊−kw /2⌋ v=⌊−kh /2⌋ c=1

In the equation above the index x + u might sometimes be negative or larger than the size of s, and
in such cases we assume the value is 0. This corresponds to the standard convolution operator used
in convolutional neural networks with zero padding on both sides and stride 1. Using the standard
operator has the advantage that it is heavily optimized (see Section 4 for Neural GPU performance).
New work on faster convolutions, e.g., Lavin & Gray (2015), can be directly used in a Neural GPU.
Knowing how a CGRU gate works, the definition of a l-layer Neural GPU is simple, as depicted in
Figure 1. The given sequence i = (i1 , . . . , in ) of n discrete symbols from {0, . . . , I} is first em-
bedded into the mental image s0 by concatenating the vectors obtained from an embedding lookup
of the input symbols into its first column. More precisely, we create the starting mental image s0 of
shape [w, n, m] by using an embedding matrix E of shape [I, m] and setting s0 [0, k, :] = E[ik ] (in
python notation) for all k = 1 . . . n (here i1 , . . . , in is the input). All other elements of s0 are set to
0. Then, we apply l different CGRU gates in turn for n steps to produce the final mental image sfin :
st+1 = CGRUl (CGRUl−1 . . . CGRU1 (st ) . . .) and sfin = sn .

Published as a conference paper at ICLR 2016

i1 o1

.. ... ..

in on
s0 s1 sn−1 sn

Figure 1: Neural GPU with 2 layers and width w = 3 unfolded in time.

The result of a Neural GPU is produced by multiplying each item in the first column of sfin by
an output matrix O to obtain the logits lk = Osfin [0, k, :] and then selecting the maximal one:
ok = argmax(lk ). During training we use the standard loss function, i.e., we compute a softmax
over the logits lk and use the negative log probability of the target as the loss.
Since all components of a Neural GPU are clearly differentiable, we can train using any stochastic
gradient descent optimizer. For the results presented in this paper we used the Adam optimizer
(Kingma & Ba, 2014) with ε = 10−4 and gradients norm clipped to 1. The number of layers was
set to l = 2, the width of mental images was constant at w = 4, the number of maps in each mental
image point was m = 24, and the convolution kernels width and height was always kw = kh = 3.

Computational power of Neural GPUs. While the above definition is simple, it might not be
immediately obvious what kind of functions a Neural GPU can compute. Why can we expect it to
be able to perform long multiplication? To answer such questions it is useful to draw an analogy
between a Neural GPU and a discrete 2-dimensional cellular automaton. Except for being discrete
and the lack of a gating mechanism, such automata are quite similar to Neural GPUs. Of course,
these are large exceptions. Dense representations have often more capacity than purely discrete
states and the gating mechanism is crucial to avoid vanishing gradients during training. But the
computational power of cellular automata is much better understood. In particular, it is well known
that a cellular automaton can exploit its parallelism to multiply two n-bit numbers in O(n) steps
using Atrubin’s algorithm. We recommend the online book (Vivien, 2003) to get an understanding
of this algorithm and the computational power of cellular automata.

In this section, we present experiments showing that a Neural GPU can successfully learn a number
of algorithmic tasks and generalize well beyond the lengths that it was trained on. We start with the
two tasks we focused on, long binary addition and long binary multiplication. Then, to demonstrate
the generality of the model, we show that Neural GPUs perform well on several other tasks as well.


The two core tasks on which we study the performance of Neural GPUs are long binary addition
and long binary multiplication. We chose them because they are fundamental tasks and because
there is no known linear-time algorithm for long multiplication. As described in Section 2, we
input a sequence of discrete symbols into the network and we read out a sequence of symbols
again. For binary addition, we use a set of 4 symbols: {0, 1, +, PAD} and for multiplication we use
{0, 1, ·, PAD}. The PAD symbol is only used for padding so we depict it as empty space below.

Long binary addition (badd) is the task of adding two numbers represented lower-endian in
binary notation. We always add numbers of the same length, but we allow them to have 0s at start,
so numbers of differing lengths can be padded to equal size. Given two d-bit numbers the full
sequence length is n = 2d + 1, as seen in the example below, representing (1 + 4) + (2 + 4 + 8) =
5 + 14 = 19 = (16 + 2 + 1).

Published as a conference paper at ICLR 2016

Task@Bits Neural GPU stackRNN LSTM+A

badd@20 100% 100% 100%
badd@25 100% 100% 73%
badd@100 100% 88% 0%
badd@200 100% 0% 0%
badd@2000 100% 0% 0%
bmul@20 100% N/A 0%
bmul@25 100% N/A 0%
bmul@200 100% N/A 0%
bmul@2000 100% N/A 0%

Table 1: Neural GPU, stackRNN, and LSTM+A results on addition and multiplication. The table
shows the fraction of test cases for which every single bit of the model’s output is correct.

Input 1 0 1 0 + 0 1 1 1
Output 1 1 0 0 1

Long binary multiplication (bmul) is the task of multiplying two binary numbers, represented
lower-endian. Again, we always multiply numbers of the same length, but we allow them to have 0s
at start, so numbers of differing lengths can be padded to equal size. Given two d-bit numbers, the
full sequence length is again n = 2d+1, as seen in the example below, representing (2+4)·(2+8) =
6 · 10 = 60 = 32 + 16 + 8 + 4.

Input 0 1 1 0 · 0 1 0 1
Output 0 0 1 1 1

Models. We compare three different models on the above tasks. In addition to the Neural GPU
we include a baseline LSTM recurrent neural network with an attention mechanism. We call this
model LSTM+A as it is exactly the same as described in (Vinyals & Kaiser et al., 2015). It is a
3-layer model with 64 units in each LSTM cell in each layer, which results in about 200k param-
eters (the Neural GPU uses m = 24 and has about 30k paramters). Both the Neural GPU and
the LSTM+A baseline were trained using all the techniques described below, including curriculum
training and gradient noise. Finally, on binary addition, we also include the stack-RNN model from
(Joulin & Mikolov, 2015). This model was not trained using our training regime, but in exactly the
way as provided in its source code, only with nmax = 41. To match our training procedure, we ran
it 729 times (cf. Section 3.3) with different random seeds and we report the best obtained result.

Results. We measure also the rate of fully correct output sequences and report the results in Ta-
ble 1. For both tasks, we show first the error at the maximum length seen during training, i.e., for
20-bit numbers. Note that LSTM+A is not able to learn long binary multiplication at this length, it
does not even fit the training data. Then we report numbers for sizes not seen during training.
As you can see, a Neural GPU can learn a multiplication algorithm that generalizes perfectly, at least
as far as we were able to test (technical limits of our implementation prevented us from testing much
above 2000 bits). Even for the simpler task of binary addition, stack-RNNs work only up-to length
100. This is still much better than the LSTM+A baseline which only generalizes to length 25.


In addition to the two main tasks above, we tested Neural GPUs on the following simpler algorithmic
tasks. The same architecture as used above was able to solve all of the tasks described below, i.e.,
after being trained on sequences of length up-to 41 we were not able to find any error on sequences
on any length we tested (up-to 4001).

Copying sequences is the simple task of producing on output the same sequence as on input. It is
very easy for a Neural GPU, in fact all models converge quickly and generalize perfectly.

Reversing sequences is the task of reversing a sequence of bits, n is the length of the sequence.

Published as a conference paper at ICLR 2016

Duplicating sequences is the task of duplicating the input bit sequence on output twice, as in the
example below. We use the padding symbol on input to make it match the output length. We trained
on sequences of inputs up-to 20 bits, so outputs were up-to 40-bits long, and tested on inputs up-to
2000 bits long.

Input 0 0 1 1
Output 0 0 1 1 0 0 1 1

Counting by sorting bits is the task of sorting the input bit sequence on output. Since there are
only 2 symbols to sort, this is a counting tasks – the network must count how many 0s are in the
input and produce the output accordingly, as in the example below.

Input 1 0 1 1 0 0 1 0
Output 0 0 0 0 1 1 1 1


Here we describe the training methods that we used to improve our results. Note that we applied
these methods to the LSTM+A baseline as well, to keep the above comparison fair. We focus on
the most important elements of our training regime, all less relevant details can be found in the code
which is released as open-source.1

Grid search. Each result we report is obtained by running a grid search over 36 = 729 instances.
We consider 3 settings of the learning rate, initial parameters scale, and 4 other hyperparameters
discussed below: the relaxation pull factor, curriculum progress threshold, gradient noise scale, and
dropout. An important effect of running this grid search is also that we train 729 models with differ-
ent random seeds every time. Usually only a few of these models generalize to 2000-bit numbers,
but a significant fraction works well on 200-bit numbers, as discussed below.

Curriculum learning. We use a curriculum learning approach inspired by Zaremba & Sutskever
(2015a). This means that we train, e.g., on 7-digit numbers only after crossing a curriculum progress
threshold (e.g., over 90% fully correct outputs) on 6-digit numbers. However, with 20% probability
we pick a minibatch of d-digit numbers with d chosen uniformly at random between 1 and 20.

Gradients noise. To improve training speed and stability we add noise to gradients in each training
step. Inspired by the schedule from Welling & Teh (2011), we add to gradients a noise drawn from
the normal distribution with mean 0 and variance inversely proportional to the square root of step-
number (i.e., with standard deviation proportional to the 4-th root of step-number). We multiply this
noise by the gradient noise scale and, to avoid noise in converged models, we also multiply it by the
fraction of non-fully-correct outputs (which is 0 for a perfect model).

Gate cutoff. In Section 2 we defined the gates in a CGRU using the sigmoid function, e.g., we
wrote u = σ(U ′ ∗ s + B ′ ). Usually the standard sigmoid function is used, σ(x) = 1+e1−x . We
found that adding a hard threshold on the top and bottom helps slightly in our setting, so we use
1.2σ(x) − 0.1 cut to the interval [0, 1], i.e., σ ′ (x) = max(0, min(1, 1.2σ(x) − 0.1)).


Dropout is a widely applied technique for regularizing neural networks. But when applying it to
recurrent networks, it has been counter-productive to apply it on recurrent connections – it only
worked when applied to the non-recurrent ones, as reported by Pham et al. (2014).
Since a Neural GPU does not have non-recurrent connections it might seem that dropout will not
be useful for this architecture. Surprisingly, we found the contrary – it is useful and improves
generalization. The key to using dropout effectively in this setting is to set a small dropout rate.
When we run a grid search for dropout rates we vary them between 6%, 9%, and 13.5%, meaning
that over 85% of the values are always preserved. It turns out that even this small dropout has large
The code is at

Published as a conference paper at ICLR 2016

effect since we apply it to the whole mental image si in each step i. Presumably the network now
learns to include some redundancy in its internal representation and generalization benefits from it.
Without dropout we usually see only a few models from a 729 grid search generalize reasonably,
while with dropout it is a much larger fraction and they generalize to higher lengths. In particular,
dropout was necessary to train models for multiplication that generalize to 2000 bits.


To improve optimization of our deep network we use a relaxation technique for shared parameters
which works as follows. Instead of training with parameters shared across time-steps we use r
identical sets of non-shared parameters (we often use r = 6, larger numbers work better but use
more memory). At time-step t of the Neural GPU we use the i-th set if t mod r = i.
The procedure described above relaxes the network, as it can now perform different operations in
different time-steps. Training becomes easier, but we now have r parameters instead of the single
shared set we want. To unify them we add a term to the cost function representing the distance
of each parameter from the average of this parameter in all the r sets. This term in the final cost
function is multiplied by a scalar which we call the relaxation pull. If the relaxation pull is 0, the
network behaves as if the r parameter sets were separate, but when it is large, the cost forces the
network to unify the parameters across different set.
During training, we gradually increase the relaxation pull. We start with a small value and every time
the curriculum makes progress, e.g., when the model performs well on 6-digit numbers, we multiply
the relaxation pull by a relaxation pull factor. When the curriculum reaches the maximal length we
average the parameters from all sets and continue to train with a single shared parameter set.
This method is crucial for learning multiplication. Without it, a Neural GPU with m = 24 has
trouble to even fit the training set, and the few models that manage to do it do not generalize. With
relaxation almost all models in our 729 runs manage to fit the training data.

We prepared a video of the Neural GPU trained to solve the tasks mentioned above.2. It shows
the state in each step with values of −1 drawn in white, 1 in black, and other in gray. This gives
an intuition how the Neural GPU solves the discussed problems, e.g., it is quite clear that for the
duplication task the Neural GPU learned to move a part of the embedding downwards in each step.
What did not work well? For one, using decimal inputs degrades performance. All tasks above can
easily be formulated with decimal inputs instead of binary ones. One could hope that a Neural GPU
will work well in this case too, maybe with a larger m. We experimented with this formulation and
our results were worse than when the representation was binary: we did not manage to learn long
decimal multiplication. Increasing m to 128 allows to learn all other tasks in the decimal setting.
Another problem is that often only a few models in a 729 grid search generalize to very long unseen
instances. Among those 729 models, there usually are many models that generalize to 40 or even 200
bits, but only a few working without error for 2000-bit numbers. Using dropout and gradient noise
improves the reliability of training and generalization, but maybe another technique could help even
more. How could we make more models achieve good generalization? One idea that looks natural
is to try to reduce the number of parameters by decreasing m. Surprisingly, this does not seem to
have any influence. In addition to the m = 24 results presented above we ran experiments with
m = 32, 64, 128 and the results were similar. In fact using m = 128 we got the most models to
generalize. Additionally, we observed that ensembling a few models, just by averaging their outputs,
helps to generalize: ensembles of 5 models almost always generalize perfectly on binary tasks.
Why use width? The Neural GPU is defined using two-dimensional convolutions and in our exper-
iments one of the dimensions is always set to 4. Doing so is not necessary since a one-dimensional
Neural GPU that uses four times larger m can represent every function representable by the original
one. In fact we trained a model for long binary multiplication that generalized to 2000-bit numbers
using a Neural GPU with width 1 and m = 64. However, the width of the Neural GPU increases the
The video is available at

Published as a conference paper at ICLR 2016

amount of information carried in its hidden state without increasing the number of its parameters.
Thus it can be thought of as a factorization and might be useful for other tasks.
Speed and data efficiency. Neural GPUs use the standard, heavily optimized convolution operation
and are fast. We experimented with a 2-layer Neural GPU for n = 32 and m = 64. After unfolding
in time it has 128 layers of CGRUs, each operating on 32 mental images, each 4 × 64 × 64 . The
joint forward-backward step time for this network was about 0.6s on an NVIDIA GTX 970 GPU.
We were also surprised by how data-efficient a Neural GPU can be. The experiments presented
above were all performed using 10k random training data examples for each training length. Since
we train on up-to 20-bit numbers this adds to about 200k training examples. We tried to train using
only 100 examples per length, so about 2000 total training instances. We were surprised to see
that it actually worked well for binary addition: there were models that generalized well to 200-bit
numbers and to all lengths below despite such small training set. But we never managed to train a
good model for binary multiplication with that little training data.


The results presented in Table 1 show clearly that there is a qualitative difference between what can
be achieved with a Neural GPU and what was possible with previous architectures. In particular, for
the first time, we show a neural network that learns a non-trivial superlinear-time algorithm in a way
that generalized to much higher lengths without errors.
This opens the way to use neural networks in domains that were previously only addressed by
discrete methods, such as program synthesis. With the surprising data efficiency of Neural GPUs it
could even be possible to replicate previous program synthesis results, e.g., Kaiser (2012), but in a
more scalable way. It is also interesting that a Neural GPU can learn symbolic algorithms without
using any discrete state at all, and adding dropout and noise only improves its performance.
Another promising future work is to apply Neural GPUs to language processing tasks. Good
results have already been obtained on translation with a convolutional architecture over words
(Kalchbrenner & Blunsom, 2013) and adding gating and recursion, like in a Neural GPU, should
allow to train much deeper models without overfitting. Finally, the parameter sharing relaxation
technique can be applied to any deep recurrent network and has the potential to improve RNN train-
ing in general.

Angluin, Dana. Learning regaular sets from queries and counterexamples. Information and Computation, 75:
87–106, 1987.
Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural machine translation by jointly learning to
align and translate. CoRR, abs/1409.0473, 2014. URL
Blumensath, Achim and Grädel, Erich. Automatic Structures. In Proceedings of LICS 2000, pp. 51–62, 2000.
Chan, William, Jaitly, Navdeep, Le, Quoc V., and Vinyals, Oriol. Listen, attend and spell. In International
Conference on Acoustics, Speech and Signal Processing, ICASSP’16, 2016.
Cho, Kyunghyun, van Merrienboer, Bart, Gulcehre, Caglar, Bougares, Fethi, Schwenk, Holger, and Bengio,
Yoshua. Learning phrase representations using rnn encoder-decoder for statistical machine translation.
CoRR, abs/1406.1078, 2014. URL
Chung, Junyoung, Gülçehre, Çaglar, Cho, Kyunghyun, and Bengio, Yoshua. Empirical evaluation
of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555, 2014. URL
Dahl, George E., Yu, Dong, Deng, Li, and Acero, Alex. Context-dependent pre-trained deep neural networks
for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech & Language Processing, 20
(1):30–42, 2012.
Graves, Alex, Wayne, Greg, and Danihelka, Ivo. Neural turing machines. CoRR, abs/1410.5401, 2014. URL

Published as a conference paper at ICLR 2016

Grefenstette, Edward, Hermann, Karl Moritz, Suleyman, Mustafa, and Blunsom, Phil.
Learning to transduce with unbounded memory. CoRR, abs/1506.02516, 2015. URL
Greff, Klaus, Srivastava, Rupesh Kumar, Koutnı́k, Jan, Steunebrink, Bas R., and Schmidhuber, Jürgen. LSTM:
A search space odyssey. CoRR, abs/1503.04069, 2015. URL
Gulwani, Sumit. Dimensions in program synthesis. In Proceedings of PPDP 2010, PPDP ’10, pp. 13–24, 2010.
Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-term memory. Neural computation, 9(8):1735–1780,
Joulin, Armand and Mikolov, Tomas. Inferring algorithmic patterns with stack-augmented recurrent nets.
CoRR, abs/1503.01007, 2015. URL
Kaiser, Łukasz. Learning games from videos guided by descriptive complexity. In Proceedings of the AAAI-12,
pp. 963–970. AAAI Press, 2012. URL
Kalchbrenner, Nal and Blunsom, Phil. Recurrent continuous translation models. In Proceedings EMNLP 2013,
pp. 1700–1709, 2013. URL
Kalchbrenner, Nal, Danihelka, Ivo, and Graves, Alex. Grid long short-term memory. In International Confer-
ence on Learning Representations, 2016. URL
Kingma, Diederik P. and Ba, Jimmy. Adam: A method for stochastic optimization. CoRR, abs/1412.6980,
2014. URL
Kitzelmann, Emanuel. Inductive programming: A survey of program synthesis techniques. In Approaches and
Applications of Inductive Programming, AAIP 2009, volume 5812 of LNCS, pp. 50–73, 2010.
Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey. Imagenet classification with deep convolutional neural
network. In Advances in Neural Information Processing Systems, 2012.
Lavin, Andrew and Gray, Scott. Fast algorithms for convolutional neural networks. CoRR, abs/1509.09308,
2015. URL
Pham, Vu, Bluche, Théodore, Kermorvant, Christopher, and Louradour, Jérôme. Dropout improves recur-
rent neural networks for handwriting recognition. In International Conference on Frontiers in Handwriting
Recognition (ICFHR), pp. 285–290. IEEE, 2014. URL
Shi, Xingjian, Chen, Zhourong, Wang, Hao, Yeung, Dit-Yan, kin Wong, Wai, and chun Woo, Wang. Convo-
lutional LSTM network: A machine learning approach for precipitation nowcasting. In Advances in Neural
Information Processing Systems, 2015. URL
Srivastava, Rupesh Kumar, Greff, Klaus, and Schmidhuber, Jürgen. Highway networks. CoRR,
abs/1505.00387, 2015. URL
Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc VV. Sequence to sequence learning with neural net-
works. In Advances in Neural Information Processing Systems, pp. 3104–3112, 2014. URL
Toderici, George, O’Malley, Sean M., Hwang, Sung Jin, Vincent, Damien, Minnen, David, Baluja,
Shumeet, Covell, Michele, and Sukthankar, Rahul. Variable rate image compression with recur-
rent neural networks. In International Conference on Learning Representations, 2016. URL
Vinyals & Kaiser, Koo, Petrov, Sutskever, and Hinton. Grammar as a foreign language. In Advances in Neural
Information Processing Systems, 2015. URL
Vinyals, Oriol, Toshev, Alexander, Bengio, Samy, and Erhan, Dumitru. Show and tell: A neural image caption
generator. CoRR, abs/1411.4555, 2014. URL
Vivien, Helene. An Introduction to cellular automata. 2003. URL˜yunes/ca/archives/bookvivien.pdf.
Welling, Max and Teh, Yee Whye. Bayesian learning via stochastic gradient Langevin dynamics. In Proceed-
ings of ICML 2011, pp. 681–688, 2011.
Zaremba, Wojciech and Sutskever, Ilya. Learning to execute. CoRR, abs/1410.4615, 2015a. URL
Zaremba, Wojciech and Sutskever, Ilya. Reinforcement learning neural turing machines. CoRR,
abs/1505.00521, 2015b. URL

Learning Efficient Algorithms with Hierarchical Attentive Memory

Marcin Andrychowicz∗ MARCINA @ GOOGLE . COM Google DeepMind

Karol Kurach∗ KKURACH @ GOOGLE . COM Google / University of Warsaw1

equal contribution
arXiv:1602.03218v2 [cs.LG] 23 Feb 2016

Abstract practice, this limits the number of used memory cells to

few thousands.
In this paper, we propose and investigate a novel
memory architecture for neural networks called It would be desirable for the size of the memory to be inde-
Hierarchical Attentive Memory (HAM). It is pendent of the number of model parameters. The first ver-
based on a binary tree with leaves corresponding satile and highly successful architecture with this property
to memory cells. This allows HAM to perform was Neural Turing Machine (NTM) (Graves et al., 2014).
memory access in Θ(log n) complexity, which The main idea behind the NTM is to split the network into a
is a significant improvement over the standard trainable “controller” and an “external” variable-size mem-
attention mechanism that requires Θ(n) opera- ory. It caused an outbreak of other neural network architec-
tions, where n is the size of the memory. tures with external memories (see Sec. 2).
We show that an LSTM network augmented with However, one aspect which has been usually neglected so
HAM can learn algorithms for problems like far is the efficiency of the memory access. Most of the
merging, sorting or binary searching from pure proposed memory architectures have the Θ(n) access com-
input-output examples. In particular, it learns to plexity, where n is the size of the memory. It means that,
sort n numbers in time Θ(n log n) and general- for instance, copying a sequence of length n requires per-
izes well to input sequences much longer than the forming Θ(n2 ) operations, which is clearly unsatisfactory.
ones seen during the training. We also show that
HAM can be trained to act like classic data struc- 1.1. Our contribution
tures: a stack, a FIFO queue and a priority queue.
In this paper we propose a novel memory module for neural
networks, called Hierarchical Attentive Memory (HAM).
1. Intro The HAM module is generic and can be used as a build-
ing block of larger neural architectures. Its crucial property
Deep Recurrent Neural Networks (RNNs) have recently is that it scales well with the memory size — the memory
proven to be very successful in real-word tasks, e.g. ma- access requires only Θ(log n) operations, where n is the
chine translation (Sutskever et al., 2014) and computer vi- size of the memory. This complexity is achieved by us-
sion (Vinyals et al., 2014). However, the success has been ing a new attention mechanism based on a binary tree with
achieved only on tasks which do not require a large mem- leaves corresponding to memory cells. The novel attention
ory to solve the problem, e.g. we can translate sentences mechanism is not only faster than the standard one used in
using RNNs, but we can not produce reasonable transla- Deep Learning (Bahdanau et al., 2014), but it also facilities
tions of really long pieces of text, like books. learning algorithms due to a built-in bias towards operating
A high-capacity memory is a crucial component neces- on intervals.
sary to deal with large-scale problems that contain plenty We show that an LSTM augmented with HAM is able to
of long-range dependencies. Currently used RNNs do not learn algorithms for tasks like merging, sorting or binary
scale well to larger memories, e.g. the number of parame- searching. In particular, it is the first neural network, which
ters in an LSTM (Hochreiter & Schmidhuber, 1997) grows we are aware of, that is able to learn to sort from pure input-
quadratically with the size of the network’s memory. In output examples and generalizes well to input sequences
1 much longer than the ones seen during the training. More-
Work done while at Google.
over, the learned sorting algorithm runs in time Θ(n log n).
We also show that the HAM memory itself is capable of
simulating different classic memory structures: a stack, a
FIFO queue and a priority queue.
Learning Efficient Algorithms with Hierarchical Attentive Memory

2. Related work els is that they allow a constant time memory access. They
were however only successful on relatively simple tasks.
In this section we mention a number of recently proposed
neural architectures with an external memory, which size is Another model, which can use a pointer-based memory
independent of the number of the model parameters. is the Neural Programmer-Interpreter (Reed & de Freitas,
2015). It is very interesting, because it managed to learn
Memory architectures based on attention Attention is sub-procedures. Unfortunately, it requires strong supervi-
a recent but already extremely successful technique in sion in the form of execution traces.
Deep Learning. This mechanism allows networks to at- Another type of pointer-based memory was presented
tend to parts of the (potentially preprocessed) input se- in Neural Random-Access Machine (Kurach et al., 2015),
quence (Bahdanau et al., 2014) while generating the out- which is a neural architecture mimicking classic comput-
put sequence. It is implemented by giving the network as ers.
an auxiliary input a linear combination of input symbols,
where the weights of this linear combination can be con- Parallel memory architectures There are two recent
trolled by the network. memory architectures, which are especially suited for
Attention mechanism was used to access the memory in parallel computation. Grid-LSTM (Kalchbrenner et al.,
Neural Turing Machines (NTMs) (Graves et al., 2014). It 2015) is an extension of LSTM to multiple dimen-
was the first paper, that explicitly attempted to train a com- sions. Another recent model of this type is Neural GPU
putationally universal neural network and achieved encour- (Kaiser & Sutskever, 2015), which can learn to multiply
aging results. long binary numbers.

The Memory Network (Weston et al., 2014) is an early

model that attempted to explicitly separate the memory 3. Hierarchical Attentive Memory
from computation in a neural network model. The followup In this section we describe our novel memory module
work of (Sukhbaatar et al., 2015) combined the memory called Hierarchical Attentive Memory (HAM). The HAM
network with the soft attention mechanism, which allowed module is generic and can be used as a building block of
it to be trained with less supervision. In contrast to NTMs, larger neural network architectures. For instance, it can be
the memory in these models is non-writeable. added to feedforward or LSTM networks to extend their ca-
Another model without writeable memory is the Pointer pabilities. To make our description more concrete we will
Network (Vinyals et al., 2015), which is very similar to the consider a model consisting of an LSTM “controller” ex-
attention model of Bahdanau et al. (2014). Despite not hav- tended with a HAM module.
ing a memory, this model was able to solve a number of The high-level idea behind the HAM module is as follows.
difficult algorithmic problems that include the Convex Hull The memory is structured as a full binary tree with the
and the approximate 2D Travelling Salesman Problem. leaves containing the data stored in the memory. The in-
All of the architectures mentioned so far use standard at- ner nodes contain some auxiliary data, which allows us to
tention mechanisms to access the memory and therefore efficiently perform some types of “queries” on the mem-
memory access complexity scales linearly with the mem- ory. In order to access the memory, one starts from the
ory size. root of the tree and performs a top-down descent in the
tree, which is similar to the hierarchical softmax procedure
Memory architectures based on data structures Stack- (Morin & Bengio, 2005). At every node of the tree, one
Augmented Recurrent Neural Network (Joulin & Mikolov, decides to go left or right based on the auxiliary data stored
2015) is a neural architecture combining an RNN and a in this node and a “query”. Details are provided in the rest
differentiable stack. In another paper (Grefenstette et al., of this section.
2015) authors consider extending an LSTM with a stack,
a FIFO queue or a double-ended queue and show some 3.1. Notation
promising results. The advantage of the latter model is that The model takes as input a sequence x1 , x2 , . . . and out-
the presented data structures have a constant access time. puts a sequence y1 , y2 , . . .. We assume that each element
of these sequences is a binary vector of size b ∈ N, i.e.
Memory architectures based on pointers In two recent xi , yi ∈ {0, 1}b. Suppose for a moment that we only want
papers (Zaremba & Sutskever, 2015; Zaremba et al., 2015) to process input sequences of length ≤ n, where n ∈ N is
authors consider extending neural networks with nondif- a power of two (we show later how to process sequences of
ferentiable memories based on pointers and trained using an arbitrary length). The model is based on the full binary
Reinforcement Learning. The big advantage of these mod- tree with n leaves. Let V denote the set of the nodes in that
Learning Efficient Algorithms with Hierarchical Attentive Memory

y1 y2 y3 h1

h2 h3
h4 h5 h6 h7
h8 h9 h10 h11 h12 h13 h14 h15


x1 ... xm x1 x2 x3 x4 x5 x6

Figure 1. The LSTM+HAM model consists of an LSTM con- Figure 2. Initialization of the model. The value in the i-th leaf of
troller and a HAM module. The execution of the model starts HAM is initialized with EMBED(xi ), where EMBED is a train-
with the initialization of HAM using the whole input sequence able feed-forward network. If there are more leaves than input
x1 , x2 , . . . , xm . At each timestep, the HAM module produces symbols, we initialize the values in the excessive leaves with ze-
an input for the LSTM, which then produces an output symbol ros. Then, we initialize the values in the inner nodes bottom-up
yt . Afterwards, the hidden states of the LSTM and HAM are up- using the formula he = JOIN(hl(e) , hr(e) ). The hidden state of
dated. the LSTM — hLSTM is initialized with zeros.

h1 SEARCH(h1 , hLSTM ) = 0.95

tree (notice that |V | = 2n − 1) and let L ⊂ V denote the

set of its leaves. Let l(e) for e ∈ V \ L be the left child of h2 h3 SEARCH(h3 , hLSTM ) = 0.1
the node e and let r(e) be its right child. SEARCH(h6 , hLSTM ) = 1

We will now present the inference procedure for the model h4 h5 h6 h7

and then discuss how to train it.

h8 h9 h10 h11 h12 ha h14 h15
3.2. Inference
The high-level view of the model execution is presented in Figure 3. Attention phase. In this phase the model performs a top-
Fig. 1. The hidden state of the model consists of two com- down “search” in the tree starting from the root. Suppose that
ponents: the hidden state of the LSTM controller (denoted we are currently at the node c ∈ V \ L. We compute the value
hLSTM ∈ Rl for some l ∈ N) and the hidden values stored p = SEARCH(hc , hLSTM ). Then, with probability p the model
in the nodes of the HAM tree. More precisely, for every goes right (i.e. c := r(c)) and with probability 1 − p it goes left
node e ∈ V there is a hidden value he ∈ Rd . These values (i.e. c := l(c)). This procedure is continued until we reach one
change during the recurrent execution of the model, but we of the leaves. This leaf is called the attended or accessed leaf and
drop all timestep indices to simplify the notation. denoted a.

The parameters of the model describe the input-output be-

haviour of the LSTM, as well as the following 4 trans- The HAM parameters describe only the 4 mentioned trans-
formations, which describe the HAM module: EMBED : formations and hence the number of the model parameters
Rb → Rd , JOIN : Rd × Rd → Rd , SEARCH : Rd × Rl → does not depend on the size of the binary tree used. Thus,
[0, 1] and WRITE : Rd × Rl → Rd . These transforma- we can use the model to process the inputs of an arbitrary
tions may be represented by arbitrary function approxima- length by using big enough binary trees. It is not clear that
tors, e.g. Multilayer Perceptrons (MLPs). Their meaning the same set of parameters will give good results across
will be described soon. different tree sizes, but we showed experimentally that it is
indeed the case (see Sec. 4 for more details).
The details of the model are presented in 4 figures. Fig. 2
describes the initialization of the model. Each recurrent We decided to represent the transformations defining HAM
timestep of the model consists of three phases: the attention with MLPs with ReLU (Nair & Hinton, 2010) activation
phase described in Fig. 3, the output phase described in function in all neurons except the output layer of SEARCH,
Fig. 4 and the update phase described in Fig. 5. The whole which uses sigmoid activation function to ensure that
timestep can be performed in time Θ(log n). the output may be interpreted as a probability. More-
Learning Efficient Algorithms with Hierarchical Attentive Memory

whether to go left or right made during the whole execu-

tion of the model. We would like to maximize the log-
ha hLSTM yt probability of producing the correct output, i.e.
Figure 4. Output phase. The value ha stored in the attended leaf L = log p(y|x, θ) = log p(A|x, θ)p(y|A, x, θ) .
is given to the LSTM as an input. Then, the LSTM produces an A
output symbol yt ∈ {0, 1}b . More precisely, the value u ∈ Rb
is computed by a trainable linear transformation from hLSTM and
This sum is intractable, so instead of minimizing it directly,
the distribution of yt is defined by the formula p(yt,i = 1) =
sigmoid(ui ) for 1 ≤ i ≤ b. It may be beneficial to allow the we minimize a variational lower bound on it:
model to access the memory a few times between producing each X
output symbols. Therefore, the model produces an output symbol F= p(A|x, θ) log p(y|A, x, θ) ≤ L.
only at timesteps with indices divisible by some constant η ∈ N, A

which is a hyperparameter.
This sum is also intractable, so we approximate its
h1 gradient using the REINFORCE, which we briefly
explain below. Using the identity ∇p(A|x, θ) =
JOIN p(A|x, θ)∇ log p(A|x, θ), the gradient of the lower bound
h2 h3
with respect to the model parameters can be rewritten as:
X h
h4 h5 h6 h7
∇F = p(A|x, θ) ∇ log p(y|A, x, θ) +
h8 h9 h10 h11 h12 ha h14 h15
log p(y|A, x, θ)∇ log p(A|x, θ)
hLSTM We estimate this value using Monte Carlo approximation.
ha := WRITE(ha , hLSTM )
For every x we sample A e from p(A|x, θ) and approxi-
e x, θ) +
mate the gradient for the input x as ∇ log p(y|A,
Figure 5. Update phase. In this phase the value in the attended e e
log p(y|A, x, θ)∇ log p(A|x, θ).
leaf a is updated. More precisely, the value is modified us-
ing the formula ha := WRITE(ha , hLSTM ). Then, we update Notice that this gradient estimate can be computed using
the values of the inner nodes encountered during the attention normal backpropagation if we substitute the gradients in
phase (h6 , h3 and h1 in the figure) bottom-up using the equation the nodes2 which sample whether we should go left or right
he = JOIN(hl(e) , hr(e) ). during the attention phase by

e x, θ) ∇ log p(A|x,
log p(y|A, e θ).
over, the network for WRITE is enhanced in a similar | {z }
way as Highway Networks (Srivastava et al., 2015), i.e.
WRITE(ha , hLSTM ) = T (ha , hLSTM ) · H(ha , hLSTM ) + This term is called REINFORCE gradient estimate and the
(1 − T (ha , hLSTM)) · ha , where H and T are two MLPs left factor is called a return in Reinforcement Learning lit-
with sigmoid activation function in the output layer. This erature. This gradient estimator is unbiased, but it often
allows the WRITE transformation to easily leave the value has a high variance. Therefore, we employ two standard
ha unchanged. variance-reduction technique for REINFORCE: discounted
returns and baselines (Williams, 1992). Discounted re-
3.3. Training turns Pmeans that our return at the t-th timestep has the
form t≤i γ i−t log p(yi |A, e x, θ) for some discount con-
In this section we describe how to train our model
from purely input-output examples using REINFORCE stant γ ∈ [0, 1], which is a hyperparameter. This biases
(Williams, 1992). In Appendix A we also present a dif- the estimator if γ < 1, but it often decreases its variance.
ferent variant of HAM which is fully differentiable and can For the lack of space we do not describe the baselines
be trained using end-to-end backpropagation. technique. We only mention that our baseline is case and
Let x, y be an input-output pair. Recall that both x and y 2
For a general discussion of computing gradients in computa-
are sequences. Moreover, let θ denote the parameters of tion graphs, which contain stochastic nodes see (Schulman et al.,
the model and let A denote the sequence of all decisions 2015).
Learning Efficient Algorithms with Hierarchical Attentive Memory

timestep dependent: it is computed using a learnable lin- algorithm with exponentially decaying learning rate. We
ear transformation from hLSTM and trained using MSE loss use random search to determine the best hyper-parameters
function. for the model. We use gradient clipping (Pascanu et al.,
2012) with constant 5. The depth of our MLPs is either 1
The whole model is trained with the Adam (Kingma & Ba,
or 2, the LSTM controller has l = 20 memory cells and the
2014) algorithm. We also employ the following three train-
hidden values in the tree have dimensionality d = 20. Con-
ing techniques:
stant η determining a number of memory accesses between
producing each output symbols (Fig. 4) is equal either 1
Different reward function During our experiments we or 2. We always train for 100 epochs, each consisting of
noticed that better results may be obtained by using a dif- 1000 batches of size 50. After each epoch we evaluate the
ferent reward function for REINFORCE. More precisely, model on 200 validation batches without learning. When
instead of the log-probability of producing the correct the training is finished, we select the model parameters that
output, we use the percentage of the output bits, which gave the lowest error rate on validation batches and report
have the probability of being predicted correctly (given the error using these parameters on fresh 2, 500 random ex-
e greater than 50%, i.e. our discounted return is equal
P h i amples.
i−t e x, θ) > 0.5 . Notice that it
p(yi,j |A,
t≤i,1≤j≤b γ We report two types of errors: a test error and a general-
corresponds to the Hamming distance between the most ization error. The test error shows how well the model is
probable outcome accordingly to the model (given A) b and
able to fit the data distribution and generalize to unknown
the correct output. cases, assuming that cases of similar lengths were shown
during the training. It is computed using the HAM memory
Entropy bonus term We add a special term to the cost with n = 32 leaves, as the percentage of output sequences,
function which encourages exploration. More precisely, for which were predicted incorrectly. The lengths of test exam-
each sampling node we add to the cost function the term ples are sampled uniformly from the range [1, n]. Notice
H(p) , where H(p) is the entropy of the distribution of the that we mark the whole output sequence as incorrect even
decision, whether to go left or right in this node and α is if only one bit was predicted incorrectly, e.g. a hypothetical
an exponentially decaying coefficient. This term goes to model predicting each bit incorrectly with probability 1%
infinity, whenever the entropy goes to zero, what ensures (and independently of the errors on the other bits) has an
some level of exploration. We noticed that this term works error rate of 96% on whole sequences if outputs consist of
better in our experiments than the standard term of the form 320 bits.
−αH(p) (Williams, 1992).
The generalization error shows how well the model per-
forms with enlarged memory on examples with lengths ex-
Curriculum schedule We start with training on inputs ceeding n. We test our model with memory 4 times bigger
with lengths sampled uniformly from [1, n] for some n = than the training one. The lengths of input sequences are
2k and the binary tree with n leaves. Whenever the error now sampled uniformly from the range [2n + 1, 4n].
drops below some threshold, we increment the value k and
start using the bigger tree with 2n leaves and inputs with During testing we make our model fully deterministic by
lengths sampled uniformly from [1, 2n]. using the most probable outcomes instead of stochastic
sampling. More precisely, we assume that during the at-
tention phase the model decides to go right iff p > 0.5
4. Experiments (Fig. 3). Moreover, the output symbols (Fig. 4) are com-
In this section, we evaluate two variants of using the HAM puted by rounding to zero or one instead of sampling.
module. The first one is the model described in Sec. 3,
which combines an LSTM controller with a HAM mod- 4.2. LSTM+HAM
ule (denoted by LSTM+HAM). Then, in Sec. 4.3 we in-
We evaluate the model on a number of algorithmic tasks
vestigate the “raw” HAM (without the LSTM controller)
described below:
to check its capability of acting as classic data structures: a
stack, a FIFO queue and a priority queue.
Reverse: Given a sequence of 10-bit vectors, output
them in the reversed order., i.e. yi = xm+1−i for 1 ≤
4.1. Test setup
i ≤ m, where m is the length of the input sequence.
For each test that we perform, we apply the following pro-
cedure. First, we train the model with memory of size Search: Given a sequence of pairs xi = keyi ||valuei
up to n = 32 using the curriculum schedule described in for 1 ≤ i ≤ m − 1 sorted by keys and a query xm = q, find
Sec. 3.3. The model is trained using the minibatch Adam the smallest i such that keyi = q and output y1 = valuei .
Learning Efficient Algorithms with Hierarchical Attentive Memory

Keys and values are 5-bit vectors and keys are compared eralizes very well to new sizes of the binary tree. We find
lexicographically. The LSTM+HAM model is given only this fact quite interesting, because it means that parameters
two timesteps (η = 2) to solve this problem, which forces learned from a small neural network (i.e. HAM based on a
it to use a form of binary search. tree with 32 leaves) can be successfully used in a different,
bigger network (i.e. HAM with 128 memory cells).
Merge: Given two sorted sequences of pairs —
In comparison, the LSTM with attention does not learn to
(p1 , v1 ), . . . , (pm , vm ) and (p′1 , v1′ ), . . . , (p′m′ , vm

′ ), where
′ ′ 5 merge, nor sort. It also completely fails to generalize to
pi , pi ∈ [0, 1] and vi , vi ∈ {0, 1} , merge them. Pairs are
longer examples, which shows that LSTM+A learns rather
compared accordingly to their priorities, i.e. values pi and
some statistical dependencies between inputs and outputs
p′i . Priorities are unique and sampled uniformly from the
1 than the real algorithms.
set { 300 , . . . , 300
300 }, because neural networks can not easily
distinguish two real numbers which are very close to each The LSTM+HAM model makes a few errors when test-
other. Input is encoded as xi = pi ||vi for 1 ≤ i ≤ m and ing on longer outputs than the ones encountered during
xm+i = p′i ||vi′ for 1 ≤ i ≤ m′ . The output consists of the the training. Notice however, that we show in the table
vectors vi and vi′ sorted accordingly to their priorities3 . the percentage of output sequences, which contain at least
one incorrect bit. For instance, LSTM+HAM on the prob-
Sort: Given a sequence of pairs xi = keyi ||valuei sort lem Merge predicts incorrectly only 0.03% of output bits,
them in a stable way4 accordingly to the lexicographic or- which corresponds to 2.48% of incorrect output sequences.
der of the keys. Keys and values are 5-bit vectors. We believe that these rare mistakes could be avoided if one
trained the model longer and chose carefully the learning
Add: Given two numbers represented in binary, rate schedule. One more way to boost generalization capa-
compute their sum. The input is represented as bilities would be to simultaneously train the models with
a1 , . . . , am , +, b1 , . . . , bm , = (i.e. x1 = a1 , x2 = a2 different memory sizes and shared parameters. We have
and so on), where a1 , . . . , am and b1 , . . . , bm are bits of not tried this as the generalization properties of the model
the input numbers and +, = are some special symbols. were already very good.
Input and output numbers are encoded starting from the
least significant bits.
Table 1. Experimental results. The upper table presents the error
Every example output shown during the training is finished rates on inputs of the same lengths as the ones used during train-
by a special “End Of Output” symbol, which the model ing. The lower table shows the error rates on input sequences
learns to predict. It forces the model to learn not only the 2 to 4 times longer than the ones encountered during training.
output symbols, but also the length of the correct output. LSTM+A denotes an LSTM with the standard attention mecha-
nism. Each error rate is a percentage of output sequences, which
We compare our model with 2 strong baseline mod- contained at least one incorrectly predicted bit.
els: encoder-decoder LSTM (Sutskever et al., 2014) and test error LSTM LSTM+A LSTM+HAM
encoder-decoder LSTM with attention (denoted LSTM+A) Reverse 73% 0% 0%
(Bahdanau et al., 2014). The number of the LSTM cells Search 62% 0.04% 0.12%
in the baselines was chosen in such a way, that they have Merge 88% 16% 0%
more parameters than the biggest of our models. We also Sort 99% 25% 0.04%
use random search to select an optimal learning rate and Add 39% 0% 0%
some other parameters for the baselines and train them us- 2-4x longer inputs LSTM LSTM+A LSTM+HAM
ing the same curriculum scheme as LSTM+HAM. Reverse 100% 100% 0%
Search 89% 0.52% 1.68%
The results are presented in Table 1. Not only, does Merge 100% 100% 2.48%
LSTM+HAM solve all the problems almost perfectly, but Sort 100% 100% 0.24%
it also generalizes very well to much longer inputs on all Add 100% 100% 100%
problems except Add. Recall that for the generalization Complexity Θ(1) Θ(n) Θ(log n)
tests we used a HAM memory of a different size than the
ones used during the training, what shows that HAM gen-
3 4.3. Raw HAM
Notice that we earlier assumed for the sake of simplicity that
the input sequences consist of binary vectors and in this task the In this section, we evaluate “raw” HAM module (without
priorities are real values. It does not however require any change
the LSTM controller) to see if it can act as a drop-in re-
of our model. We decided to use real priorities in this task in order
to diversify our set of problems. placement for 3 classic data structures: a stack, a FIFO
Stability means that pairs with equal keys should be ordered queue and a priority queue. For each task, the network is
accordingly to their order in the input sequence. given a sequence of PUSH and POP operations in an on-
Learning Efficient Algorithms with Hierarchical Attentive Memory

line manner: at timestep t the network sees only the t-th

Table 2. Results of experiments with the raw version of HAM
operation to perform xt . This is a more realistic scenario
(without the LSTM controller). Error rates are measured as a per-
for data structures usage as it prevents the network from centage of operation sequences in which at least one POP query
cheating by peeking into the future. was not answered correctly.
Raw HAM module differs from the LSTM+HAM model Task Test Error Generalization
from Sec. 3 in the following way: Error
Stack 0% 0%
Queue 0% 0%
• The HAM memory is initialized with zeros. PriorityQueue 0.08% 0.2%

• The t-th output symbol yt is computed using an MLP

from the value in the accessed leaf ha . 4.4. Analysis

• Notice that in the LSTM+HAM model, hLSTM acted In this section, we present some insights into the algorithms
as a kind of “query” or “command” guiding the be- learned by the LSTM+HAM model, by investigating the
haviour of HAM. We will now use the values xt in- the hidden representations he learned for a variant of the
stead. Therefore, at the t-th timestep we use xt in- problem Sort in which we sort 4-bit vectors lexicograph-
stead of hLSTM whenever hLSTM was used in the orig- ically5 . For demonstration purposes, we use a small tree
inal model, e.g. during the attention phase (Fig. 3) with n = 8 leaves and d = 6.
we use p = SEARCH(hc , xt ) instead of p = The trained network performs sorting perfectly. It attends
SEARCH(hc , hLSTM). to the leaves in the order corresponding to the order of the
sorted input values, i.e. at every timestep HAM attends to
We evaluate raw HAM on the following tasks: the leaf corresponding to the smallest input value among
the leaves, which have not been attended so far.
Stack: The “PUSH x” operation places the element x It would be interesting to exactly understand the algorithm
(a 5-bit vector) on top of the stack, and the “POP” returns used by the network to perform this operation. A natural
the last added element and removes it from the stack. solution to this problem would be to store in each hidden
node e the smallest input value among the (unattended so
Queue: The “PUSH x” operation places the element x (a far) leaves below e together with the information whether
5-bit vector) at the end of the queue and the “POP” returns the smallest value is in the right or the left subtree under e.
the oldest element and removes it from the queue. We present two timesteps of our model together with some
insights into the algorithm used by the network in Fig.6.
PriorityQueue: The “PUSH x p” operations adds
the element x with priority p to the queue. The “POP”
5. Comparison to other models
operation returns the value with the highest priority and re-
move it from the queue. Both x and p are represented as Comparing neural networks able to learn algorithms is dif-
5-bit vectors and priorities are compared lexicographically. ficult for a few reasons. First of all, there are no well-
To avoid ties we assume that all elements have different established benchmark problems for this area. Secondly,
priorities. the difficulty of a problem often depends on the way in-
puts and outputs are encoded. For example, the difficulty
Model was trained with the memory of size up to n =
of the problem of adding long binary numbers depends on
32 with operation sequences of length n. Sequences of
whether the numbers are aligned (i.e. the i-th bit of the
PUSH/POP actions for training were selected randomly.
second number is “under” the i-th bit of the first number)
The t-th operation out of n operations in the sequence was
or written next to each other (e.g. 10011+10101). More-
POP with probability nt and PUSH otherwise. To test gen-
over, we could compare error rates on inputs from the same
eralization, we report the error rates with the memory of
distribution as the ones seen during the training or com-
size 4n on sequences of operations of length 4n.
pare error rates on inputs longer than the ones seen dur-
The results presented in Table 2 shows that HAM sim- ing the training to see if the model “really learned the al-
ulates a stack and a queue perfectly with no errors 5
whatsoever even for memory 4 times bigger. For the In the problem Sort considered in the experimental results,
there are separate keys and values, which forces the model to learn
PriorityQueue task, the model generalizes almost per- stable sorting. Here, for the sake of simplicity, we consider the
fectly to large memory, with errors only in 0.2% of output simplified version of the problem and do not use separate keys
sequences. and values.
Learning Efficient Algorithms with Hierarchical Attentive Memory

(a) The first timestep (b) The second timestep

Figure 6. This figure shows two timesteps of the model. The LSTM controller is not presented to simplify the exposition. The input
sequence is presented on the left, below the tree: x1 = 0000, x2 = 1110, x3 = 1101 and so on. The 2x3 grids in the nodes of the
tree represent the values he ∈ R6 . White cells correspond to value 0 and non-white cells correspond to values > 0. The lower-rightmost
cells are presented in pink, because we managed to decipher the meaning of this coordinate for the inner nodes. This coordinate in the
node e denotes whether the minimum in the subtree (among the values unattended so far) is in the right or left subtree of e. Value greater
than 0 (pink in the picture) means that the minimum is in the right subtree and therefore we should go right while visiting this node in
the attention phase. In the first timestep the leftmost leaf (corresponding to the input 0000) is accessed. Notice that the last coordinates
(shown in pink) are updated appropriately, e.g. the smallest unattended value at the beginning of the second timestep is 0101, which
corresponds to the 6-th leaf. It is in the right subtree under the root and accordingly the last coordinate in the hidden value stored in the
root is high (i.e. pink in the figure).

gorithm”. Furthermore, different models scale differently chine (Kurach et al., 2015), and Queue-Augmented LSTM
with the memory size, which makes direct comparison of (Grefenstette et al., 2015). However, the first three models
error rates less meaningful. have been only successful on relatively simple tasks. The
last model was successful on some synthetic tasks from the
As far as we know, our model is the first one which is
domain of Natural Language Processing, which are very
able to learn a sorting algorithm from pure input-output
different from the tasks we tested our model on, so we can
examples. In (Reed & de Freitas, 2015) it is shown that
not directly compare the two models.
an LSTM is able to learn to sort short sequences, but it
fails to generalize to inputs longer than the ones seen dur- Finally, we do not claim that our model is superior to
ing the training. It is quite clear that an LSTM can not the all other ones, e.g. Neural Turing Machines (NTM)
learn a “real” sorting algorithm, because it uses a bounded (Graves et al., 2014). We believe that both memory mech-
memory independent of the length of the input. The Neu- anisms are complementary: NTM memory has a built-in
ral Programmer-Interpreter (Reed & de Freitas, 2015) is a associative map functionality, which may be difficult to
neural network architecture, which is able to learn bubble achieve in HAM. On the other hand, HAM performs bet-
sort, but it requires strong supervision in the form of execu- ter in tasks like sorting due to a built-in bias towards op-
tion traces. In comparison, our model can be trained from erating on intervals of memory cells. Moreover, HAM al-
pure input-output examples, which is crucial if we want to lows much more efficient memory access than NTM. It is
use it to solve problems for which we do not know any al- also quite possible that a machine able to learn algorithms
gorithms. should use many different types of memory in the same
way as human brain stores a piece of information differ-
An important feature of neural memories is their ef-
ently depending on its type and how long it should be stored
ficiency. Our HAM module in comparison to many
(Berntson & Cacioppo, 2009).
other recently proposed solutions is effective and al-
lows to access the memory in Θ(log(n)) complexity. 6. Conclusions
In the context of learning algorithms it may sound sur-
prising that among all the architectures mentioned in We presented a new memory architecture for neural net-
Sec. 2 the only ones, which can copy a sequence of works called Hierarchical Attentive Memory. Its crucial
length n without Θ(n2 ) operations are: Reinforcement- property is that it scales well with the memory size — the
Learning NTM (Zaremba & Sutskever, 2015), the model memory access requires only Θ(log n) operations. This
from (Zaremba et al., 2015), Neural Random-Access Ma- complexity is achieved by using a new attention mecha-
Learning Efficient Algorithms with Hierarchical Attentive Memory

nism based on a binary tree. The novel attention mecha- 2015).

nism is not only faster than the standard one used in Deep
This version of the model is fully differentiable and there-
Learning, but it also facilities learning algorithms due to
fore it can be trained using end-to-end backpropagation on
the embedded tree structure.
the log-probability of producing the correct output. We ob-
We showed that an LSTM augmented with HAM can learn served that training DHAM is slightly easier than the RE-
a number of algorithms like merging, sorting or binary INFORCE version. However, DHAM does not generalize
searching from pure input-output examples. In particular, as well as HAM to larger memory sizes.
it is the first neural architecture able to learn a sorting algo-
rithm and generalize well to sequences much longer than References
the ones seen during the training.
Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio,
We believe that some concepts used in HAM, namely the Yoshua. Neural machine translation by jointly learning
novel attention mechanism and the idea of aggregating in- to align and translate. arXiv preprint arXiv:1409.0473,
formation through a binary tree may find applications in 2014.
Deep Learning outside of the problem of designing neural
memories. Berntson, G.G. and Cacioppo, J.T. Handbook of Neuro-
science for the Behavioral Sciences. Number v. 1 in
Acknowledgements Handbook of Neuroscience for the Behavioral Sciences.
Wiley, 2009. ISBN 9780470083567.
We would like to thank Nando de Freitas, Alexander
Graves, Serkan Cabi, Misha Denil and Jonathan Hunt for Graves, Alex, Wayne, Greg, and Danihelka, Ivo. Neural
helpful comments and discussions. turing machines. arXiv preprint arXiv:1410.5401, 2014.
Grefenstette, Edward, Hermann, Karl Moritz, Suleyman,
A. Using soft attention Mustafa, and Blunsom, Phil. Learning to transduce with
One of the open questions in the area of designing neu- unbounded memory. In Advances in Neural Information
ral networks with attention mechanisms is whether to use Processing Systems, pp. 1819–1827, 2015.
a soft or hard attention. The model described in the pa- Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-
per belongs to the latter class of attention mechanisms as it term memory. Neural computation, 9(8):1735–1780,
makes hard, stochastic choices. The other solution would 1997.
be to use a soft, differentiable mechanism, which attends to
a linear combination of the potential attention targets and Joulin, Armand and Mikolov, Tomas. Inferring algorith-
do not involve any sampling. The main advantage of such mic patterns with stack-augmented recurrent nets. arXiv
models is that their gradients can be computed exactly. preprint arXiv:1503.01007, 2015.
We now describe how to modify the model to make it Kaiser, Łukasz and Sutskever, Ilya. Neural gpus learn al-
fully differentiable (”DHAM”). Recall that in the origi- gorithms. arXiv preprint arXiv:1511.08228, 2015.
nal model the leaf which is attended at every timestep is
sampled stochastically. Instead of that, we will now at ev- Kalchbrenner, Nal, Danihelka, Ivo, and Graves, Alex.
ery timestep compute for every leaf e the probability p(e) Grid long short-term memory. arXiv preprint
that this leaf would be attended if we used the stochastic arXiv:1507.01526, 2015.
procedure described in Fig. 3. The value p(e) can be com- Kingma, Diederik and Ba, Jimmy. Adam: A
puted by multiplying the probabilities of going in the right method for stochastic optimization. arXiv preprint
direction from all the nodes on the path from the root to e. arXiv:1412.6980, 2014.
P the input for the LSTM we then use the value Kurach, Karol, Andrychowicz, Marcin, and Sutskever,
e∈L p(e) · he . During the write phase, we update the Ilya. Neural random-access machines. arXiv preprint
values of all the leaves using the formula he := p(e) ·
arXiv:1511.06392, 2015.
WRITE(he , hROOT ) + (1 − p(e)) · he . Then, in the up-
date phase we update the values of all the inner nodes, so Li, Yujia, Tarlow, Daniel, Brockschmidt, Marc, and Zemel,
that the equation he = JOIN(hl(e) , hr(e) ) is satisfied for Richard. Gated graph sequence neural networks. arXiv
each inner node e. Notice that one timestep of the soft ver- preprint arXiv:1511.05493, 2015.
sion of the model takes time Θ(n) as we have to update the
values of all the nodes in the tree. Our model may be seen Morin, Frederic and Bengio, Yoshua. Hierarchical proba-
as a special case of Gated Graph Neural Network (Li et al., bilistic neural network language model. In Aistats, vol-
ume 5, pp. 246–252. Citeseer, 2005.
Learning Efficient Algorithms with Hierarchical Attentive Memory

Nair, Vinod and Hinton, Geoffrey E. Rectified linear units Advances in neural information processing systems, pp.
improve restricted boltzmann machines. In Proceedings 3104–3112, 2014.
of the 27th International Conference on Machine Learn-
ing (ICML-10), pp. 807–814, 2010. Vinyals, Oriol, Toshev, Alexander, Bengio, Samy, and Er-
han, Dumitru. Show and tell: A neural image caption
Pascanu, Razvan, Mikolov, Tomas, and Bengio, Yoshua. generator. arXiv preprint arXiv:1411.4555, 2014.
Understanding the exploding gradient problem. Comput-
ing Research Repository (CoRR) abs/1211.5063, 2012. Vinyals, Oriol, Fortunato, Meire, and Jaitly, Navdeep.
Pointer networks. arXiv preprint arXiv:1506.03134,
Reed, Scott and de Freitas, Nando. Neural programmer- 2015.
interpreters. arXiv preprint arXiv:1511.06279, 2015.
Weston, Jason, Chopra, Sumit, and Bordes, Antoine. Mem-
Schulman, John, Heess, Nicolas, Weber, Theophane, and
ory networks. arXiv preprint arXiv:1410.3916, 2014.
Abbeel, Pieter. Gradient estimation using stochastic
computation graphs. In Advances in Neural Information Williams, Ronald J. Simple statistical gradient-following
Processing Systems, pp. 3510–3522, 2015. algorithms for connectionist reinforcement learning.
Srivastava, Rupesh Kumar, Greff, Klaus, and Schmid- Machine learning, 8(3-4):229–256, 1992.
huber, Jürgen. Highway networks. arXiv preprint Zaremba, Wojciech and Sutskever, Ilya. Reinforce-
arXiv:1505.00387, 2015. ment learning neural turing machines. arXiv preprint
Sukhbaatar, Sainbayar, Szlam, Arthur, Weston, Jason, and arXiv:1505.00521, 2015.
Fergus, Rob. End-to-end memory networks. arXiv
Zaremba, Wojciech, Mikolov, Tomas, Joulin, Armand, and
preprint arXiv:1503.08895, 2015.
Fergus, Rob. Learning simple algorithms from exam-
Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc VV. Se- ples. arXiv preprint arXiv:1511.07275, 2015.
quence to sequence learning with neural networks. In
Adaptive Computation Time
for Recurrent Neural Networks
arXiv:1603.08983v6 [cs.NE] 21 Feb 2017

Alex Graves
Google DeepMind


This paper introduces Adaptive Computation Time (ACT), an algorithm that allows recurrent neu-
ral networks to learn how many computational steps to take between receiving an input and emitting
an output. ACT requires minimal changes to the network architecture, is deterministic and differen-
tiable, and does not add any noise to the parameter gradients. Experimental results are provided for
four synthetic problems: determining the parity of binary vectors, applying binary logic operations,
adding integers, and sorting real numbers. Overall, performance is dramatically improved by the
use of ACT, which successfully adapts the number of computational steps to the requirements of the
problem. We also present character-level language modelling results on the Hutter prize Wikipedia
dataset. In this case ACT does not yield large gains in performance; however it does provide in-
triguing insight into the structure of the data, with more computation allocated to harder-to-predict
transitions, such as spaces between words and ends of sentences. This suggests that ACT or other
adaptive computation methods could provide a generic method for inferring segment boundaries in
sequence data.

1 Introduction
The amount of time required to pose a problem and the amount of thought required to solve it are
notoriously unrelated. Pierre de Fermat was able to write in a margin the conjecture (if not the
proof) of a theorem that took three and a half centuries and reams of mathematics to solve [35].
More mundanely, we expect the effort required to find a satisfactory route between two cities, or the
number of queries needed to check a particular fact, to vary greatly, and unpredictably, from case
to case. Most machine learning algorithms, however, are unable to dynamically adapt the amount
of computation they employ to the complexity of the task they perform.
For artificial neural networks, where the neurons are typically arranged in densely connected
layers, an obvious measure of computation time is the number of layer-to-layer transformations the
network performs. In feedforward networks this is controlled by the network depth, or number of
layers stacked on top of each other. For recurrent networks, the number of transformations also
depends on the length of the input sequence — which can be padded or otherwise extended to allow
for extra computation. The evidence that increased depth leads to more performant networks is by
now inarguable [5, 4, 19, 9], and recent results show that increased sequence length can be similarly
beneficial [31, 33, 25]. However it remains necessary for the experimenter to decide a priori on the
amount of computation allocated to a particular input vector or sequence. One solution is to simply

make every network very deep and design its architecture in such a way as to mitigate the vanishing
gradient problem [13] associated with long chains of iteration [29, 17]. However in the interests
of both computational efficiency and ease of learning it seems preferable to dynamically vary the
number of steps for which the network ‘ponders’ each input before emitting an output. In this case
the effective depth of the network at each step along the sequence becomes a dynamic function of
the inputs received so far.
The approach pursued here is to augment the network output with a sigmoidal halting unit
whose activation determines the probability that computation should continue. The resulting halting
distribution is used to define a mean-field vector for both the network output and the internal network
state propagated along the sequence. A stochastic alternative would be to halt or continue according
to binary samples drawn from the halting distribution—a technique that has recently been applied to
scene understanding with recurrent networks [7]. However the mean-field approach has the advantage
of using a smooth function of the outputs and states, with no need for stochastic gradient estimates.
We expect this to be particularly beneficial when long sequences of halting decisions must be made,
since each decision is likely to affect all subsequent ones, and sampling noise will rapidly accumulate
(as observed for policy gradient methods [36]).
A related architecture known as Self-Delimiting Neural Networks [26, 30] employs a halting
neuron to end a particular update within a large, partially activated network; in this case however a
simple activation threshold is used to make the decision, and no gradient with respect to halting time
is propagated. More broadly, learning when to halt can be seen as a form of conditional computing,
where parts of the network are selectively enabled and disabled according to a learned policy [3, 6].
We would like the network to be parsimonious in its use of computation, ideally limiting itself to
the minimum number of steps necessary to solve the problem. Finding this limit in its most general
form would be equivalent to determining the Kolmogorov complexity of the data (and hence solving
the halting problem) [21]. We therefore take the more pragmatic approach of adding a time cost to
the loss function to encourage faster solutions. The network then has to learn to trade off accuracy
against speed, just as a person must when making decisions under time pressure. One weakness is
that the numerical weight assigned to the time cost has to be hand-chosen, and the behaviour of the
network is quite sensitive to its value.
The rest of the paper is structured as follows: the Adaptive Computation Time algorithm is
presented in Section 2, experimental results on four synthetic problems and one real-world dataset
are reported in Section 3, and concluding remarks are given in Section 4.

2 Adaptive Computation Time

Consider a recurrent neural network R composed of a matrix of input weights Wx , a parametric
state transition model S, a set of output weights Wy and an output bias by . When applied to an
input sequence x = (x1 , . . . , xT ), R computes the state sequence s = (s1 , . . . , sT ) and the output
sequence y = (y1 , . . . , yT ) by iterating the following equations from t = 1 to T :

st = S(st−1 , Wx xt ) (1)
yt = Wy st + by (2)

The state is a fixed-size vector of real numbers containing the complete dynamic information of the
network. For a standard recurrent network this is simply the vector of hidden unit activations. For
a Long Short-Term Memory network (LSTM) [14], the state also contains the activations of the
memory cells. For a memory augmented network such as a Neural Turing Machine (NTM) [10],
the state contains both the complete state of the controller network and the complete state of the
memory. In general some portions of the state (for example the NTM memory contents) will not be
visible to the output units; in this case we consider the corresponding columns of Wy to be fixed to

Adaptive Computation Time (ACT) modifies the conventional setup by allowing R to perform a
variable number of state transitions and compute a variable number of outputs at each input step.
Let N (t) be the total number of updates performed at step t. Then define the intermediate state
N (t) N (t)
sequence (s1t , . . . , st ) and intermediate output sequence (yt1 , . . . , yt ) at step t as follows
n S(st−1 , x1t ) if n = 1
st = (3)
t , xnt ) otherwise
ytn = Wy snt + by (4)

where xnt = xt + δn,1 is the input at time t augmented with a binary flag that indicates whether the
input step has just been incremented, allowing the network to distinguish between repeated inputs
and repeated computations for the same input. Note that the same state function is used for all
state transitions (intermediate or otherwise), and similarly the output weights and bias are shared
for all outputs. It would also be possible to use different state and output parameters for each
intermediate step; however doing so would cloud the distinction between increasing the number of
parameters and increasing the number of computational steps. We leave this for future work.
To determine how many updates R performs at each input step an extra sigmoidal halting unit
h is added to the network output, with associated weight matrix Wh and bias bh :

hnt = σ (Wh snt + bh ) (5)

As with the output weights, some columns of Wh may be fixed to zero to give selective access to the
network state. The activation of the halting unit is then used to determine the halting probability
pnt of the intermediate steps:
R(t) if n = N (t)
pnt = (6)
hnt otherwise

where 0
N (t) = min{n : hnt >= 1 − } (7)

the remainder R(t) is defined as follows

N (t)−1
R(t) = 1 − hnt (8)

and  is a small constant (0.01 for the experiments in this paper), whose purpose is to allow compu-
tation to halt after a single update if h1t >= 1 − , as otherwise a minimum of two updates would
PN (t) n
be required for every input step. It follows directly from the definition that n=1 pt = 1 and
0 ≤ pnt ≤ 1 ∀n, so this is a valid probability distribution. A similar distribution was recently used
to define differentiable push and pop operations for neural stacks and queues [11].
At this point we could proceed stochastically by sampling n̂ from pnt and setting st = sn̂t , y t = ytn̂ .
However we will eschew sampling techniques and the associated problems of noisy gradients, instead
using pnt to determine mean-field updates for the states and outputs:
N (t) N (t)
st = pnt snt yt = pnt ytn (9)
n=1 n=1

The implicit assumption is that the states and outputs are approximately linear, in the sense that
a linear interpolation between a pair of state or output vectors will also interpolate between the

Figure 1: RNN Computation Graph. An RNN unrolled over two input steps (separated by vertical dotted lines). The input
and output weights Wx , Wy , and the state transition operator S are shared over all steps.

Figure 2: RNN Computation Graph with Adaptive Computation Time. The graph is equivalent to Figure 1, only with
each state and output computation expanded to a variable number of intermediate updates. Arrows touching boxes denote
operations applied to all units in the box, while arrows leaving boxes denote summations over all units in the box.

properties the vectors embody. There are several reasons to believe that such an assumption is
reasonable. Firstly, it has been observed that the high-dimensional representations present in neu-
ral networks naturally tend to behave in a linear way [32, 20], even remaining consistent under
arithmetic operations such as addition and subtraction [22]. Secondly, neural networks have been
successfully trained under a wide range of adversarial regularisation constraints, including sparse
internal states [23], stochastically masked units [28] and randomly perturbed weights [1]. This leads
us to believe that the relatively benign constraint of approximately linear representations will not
be too damaging. Thirdly, as training converges, the tendency for both mean-field and stochastic
latent variables is to concentrate all the probability mass on a single value. In this case that yields a
standard RNN with each input duplicated a variable, but deterministic, number of times, rendering
the linearity assumption irrelevant.
A diagram of the unrolled computation graph of a standard RNN is illustrated in Figure 1, while
Figure 2 provides the equivalent diagram for an RNN trained with ACT.

2.1 Limiting Computation Time
If no constraints are placed on the number of updates R can take at each step it will naturally
tend to ‘ponder’ each input for as long as possible (so as to avoid making predictions and incurring
errors). We therefore require a way of limiting the amount of computation the network performs.
Given a length T input sequence x, define the ponder sequence (ρ1 , . . . , ρT ) of R as

ρt = N (t) + R(t) (10)

and the ponder cost P(x) as

P(x) = ρt (11)

Since R(t) ∈ (0, 1), P(x) is an upper bound on the (non-differentiable) property we ultimately want
to reduce, namely the total computation t=1 N (t) during the sequence1 .
We can encourage the network to minimise P(x) by modifying the sequence loss function L(x, y)
used for training:
L̂(x, y) = L(x, y) + τ P(x) (12)
where τ is a time penalty parameter that weights the relative cost of computation versus error. As
we will see in the experiments section the behaviour of the network is quite sensitive to the value
of τ , and it is not obvious how to choose a good value. If computation time and prediction error
can be meaningfully equated (for example if the relative financial cost of both were known) a more
principled technique for selecting τ should be possible.
To prevent very long sequences at the beginning of training (while the network is learning how
to use the halting unit) the bias term bh can be initialised to a positive value. In addition, a hard
limit M on the maximum allowed value of N (t) can be imposed to avoid excessive space and time
costs. In this case Equation (7) is modified to
N (t) = min{M, min{n : hnt >= 1 − }} (13)

2.2 Error Gradients

The ponder costs ρt are discontinuous with respect to the halting probabilities at the points where
N (t) increments or decrements (that is, when the summed probability mass up to some n either
decreases below or increases above 1 − ). However they are continuous away from those points,
as N (t) remains constant and R(t) is a linear function of the probabilities. In practice we simply
ignore the discontinuities by treating N (t) as constant and minimising R(t) everywhere.
Given this approximation, the gradient of the ponder cost with respect to the halting activations
is straightforward:
∂P(x) 0 if n = N (t)
= (14)
∂hnt −1 otherwise
1 Fora stochastic ACT network, a more natural halting distribution than the one described in Equations (6) to (8)
Qn−1 0
would be to simply treat hn n
t as the probability of halting at step n, in which case pt = ht
n0 =1
(1 − hn
t ). One could
PN (t) n
then set ρt = n=1 npt — i.e. the expected ponder time under the stochastic distribution. However experiments
show that networks trained to minimise expected rather than total halting time learn to ‘cheat’ in the following
ingenious way: they set h1t to a value just below the halting threshold, then keep hn t = 0 until some N (t) when they
N (t) N (t)
set ht high enough to ensure they halt. In this case pt  p1t , so the states and outputs at n = N (t) have much
lower weight in the mean field updates (Equation (9)) than those at n = 1; however by making the magnitudes of the
states and output vectors much larger at N (t) than n = 1 the network can still ensure that the update is dominated
by the final vectors, despite having paid a low ponder penalty.

and hence (
∂ L̂(x, y) ∂L(x, y) 0 if n = N (t)
n = − (15)
∂ht ∂hnt τ otherwise
The halting activations only influence L via their effect on the halting probabilities, therefore
N (t)
∂L(x, y) X ∂L(x, y) ∂pn0
= n0
∂hnt 0
∂pt ∂hnt
n =1

Furthermore, since the halting probabilities only influence L via their effect on the states and outputs,
it follows from Equation (9) that

∂L(x, y) ∂L(x, y) n ∂L(x, y) n

= yt + st (17)
∂pnt ∂yt ∂st

while, from Equations (6) and (8)

0 δn,n0 if n < N (t) and n < N (t)

= −1 if n0 = N (t) and n < N (t) (18)
∂hnt 
0 if n = N (t)

Combining Equations (15), (17) and (18) gives, for n < N (t)

∂ L̂(x, y) ∂L(x, y)  n N (t)

 ∂L(x, y) 
N (t)

n = yt − yt + snt − st −τ (19)
∂ht ∂yt ∂st

while for n = N (t)

∂ L̂(x, y)
N (t)
=0 (20)
Thereafter the network can be differentiated as usual (e.g. with backpropagation through time [36])
and trained with gradient descent.

3 Experiments
We tested recurrent neural networks (RNNs) with and without ACT on four synthetic tasks and one
real-world language processing task. LSTM was used as the network architecture for all experiments
except one, where a simple RNN was used. However we stress that ACT is equally applicable to
any recurrent architecture.
All the tasks were supervised learning problems with discrete targets and cross-entropy loss.
The data for the synthetic tasks was generated online and cross-validation was therefore not needed.
Similarly, the character prediction dataset was sufficiently large that the network did not overfit.
The performance metric for the synthetic tasks was the sequence error rate: the fraction of examples
where any mistakes were made in the complete output sequence. This metric is useful as it is trivial
to evaluate without decoding. For character prediction the metric was the average log-loss of the
output predictions, in units of bits per character.
Most of the training parameters were fixed for all experiments: Adam [18] was used for optimi-
sation with a learning rate of 10−4 , the Hogwild! algorithm [24] was used for asynchronous training
with 16 threads; the initial halting unit bias bh mentioned in Equation (5) was 1; the  term from
Equation (7) was 0.01. The synthetic tasks were all trained for 1M iterations, where an iteration

Figure 3: Parity training Example. Each sequence consists of a single input and target vector. Only 8 of the 64 input bits
are shown for clarity.

is defined as a weight update on a single thread (hence the total number of weight updates is ap-
proximately 16 times the number of iterations). The character prediction task was trained for 10K
iterations. Early stopping was not used for any of the experiments.
A logarithmic grid search over time penalties was performed for each experiment, with 20 ran-
domly initialised networks trained for each value of τ . For the synthetic problems the range of the
grid search was from i × 10−j with integer i in the range 1–10 and the exponent j in the range 1–4.
For the language modelling task, which took many days to complete, the range of j was limited to
1–3 to reduce training time (lower values of τ , which naturally induce more pondering, tend to give
greater data efficiency but slower wall clock training time).
Unless otherwise stated the maximum computation time M (Equation (13)) was set to 100. In
all experiments the networks converged on learned values of N (t) that were far less than M , which
functions mainly as safeguard against excessively long ponder times early in training.

3.1 Parity
Determining the parity of a sequence of binary numbers is a trivial task for a recurrent neural
network [27], which simply needs to implement an internal switch that changes sign every time
a one is received. For shallow feedforward networks receiving the entire sequence in one vector,
however, the number of distinct input patterns, and hence difficulty of the task, grows exponentially
with the number of bits. We gauged the ability of ACT to infer an inherently sequential algorithm
from statically presented data by presenting large binary vectors to the network and asking it to
determine the parity. By varying the number of binary bits for which parity must be calculated we
were also able to assess ACT’s ability to adapt the amount of computation to the difficulty of the
The input vectors had 64 elements, of which a random number from 1 to 64 were randomly set
to 1 or −1 and the rest were set to 0. The corresponding target was 1 if there was an odd number
of ones and 0 if there was an even number of ones. Each training sequence consisted of a single
input and target vector, an example of which is shown in Figure 3. The network architecture was
a simple RNN with a single hidden layer containing 128 tanh units and a single sigmoidal output
unit, trained with binary cross-entropy loss on minibatches of size 128. Note that without ACT the
recurrent connection in the hidden layer was never used since the data had no sequential component,
and the network reduced to a feedforward network with a single hidden layer.
Figure 4 demonstrates that the network was unable to reliably solve the problem without ACT,
with a mean of almost 40% error compared to 50% for random guessing. For penalties of 0.03 and
below the mean error was below 5%. Figure 5 reveals that the solutions were both more rapid and
more accurate with lower time penalties. It also highlights the relationship between the time penalty,
the classification error rate and the average ponder time per input. The variance in ponder time
for low τ networks is very high, indicating that many correct solutions with widely varying runtime
can be discovered. We speculate that progressively higher τ values lead the network to compute

Figure 4: Parity Error Rates. Bar heights show the mean error rates for different time penalties at the end of training.
The error bars show the standard error in the mean.

Figure 5: Parity Learning Curves and Error Rates Versus Ponder Time. Left: faint coloured curves show the errors for
individual runs. Bold lines show the mean errors over all 20 runs for each τ value. ‘Iterations’ is the number of gradient
updates per asynchronous worker. Right: Small circles represent individual runs after training is complete, large circles
represent the mean over 20 runs for each τ value. ‘Ponder’ is the mean number of computation steps per input timestep
(minimum 1). The black dotted line shows the mean error for the networks without ACT. The height of the ellipses
surrounding the mean values represents the standard error over error rates for that value of τ , while the width shows the
standard error over ponder times.

the parities of successively larger chunks of the input vector at each ponder step, then iteratively
combine these calculations to obtain the parity of the complete vector.
Figure 6 shows that for the networks without ACT and those with overly high time penalties, the
error rate increases sharply with the difficulty of the task (where difficulty is defined as the number
of bits whose parity must be determined), while the amount of ponder remains roughly constant.
For the more successful networks, with intermediate τ values, ponder time appears to grow linearly
with difficulty, with a slope that generally increases as τ decreases. Even for the best networks the
error rate increased somewhat with difficulty. For some of the lowest τ networks there is a dramatic
increase in ponder after about 32 bits, suggesting an inefficient algorithm.

3.2 Logic
Like parity, the logic task tests if an RNN with ACT can sequentially process a static vector.
Unlike parity it also requires the network to internally transfer information across successive input
timesteps, thereby testing whether ACT can propagate coherent internal states.
Each input sequence consists of a random number from 1 to 10 of size 102 input vectors. The
first two elements of each input represent a pair of binary numbers; the remainder of the vector
is divided up into 10 chunks of size 10. The first B chunks, where B is a random number from

Figure 6: Parity Ponder Time and Error Rate Versus Input Difficulty. Faint lines are individual runs, bold lines are means
over 20 networks. ‘Difficulty’ is the number of bits in the parity vectors, with a mean over 1,000 random vectors used for
each data-point.

Table 1: Binary Truth Tables for the Logic Task

P Q NOR Xq ABJ XOR NAND AND XNOR if/then then/if OR


1 to 10, contain one-hot representations of randomly chosen numbers between 1 and 10; each of
these numbers correspond to an index into the subset of binary logic gates whose truth tables are
listed in Table 1. The remaining 10 − B chunks were zeroed to indicate that no further binary
operations were defined for that vector. The binary target bB+1 for each input is the truth value
yielded by recursively applying the B binary gates in the vector to the two initial bits b1 , b0 . That
is for 1 ≤ b ≤ B:
bi+1 = Ti (bi , bi−1 ) (21)
where Ti (., .) is the truth table indexed by chunk i in the input vector.
For the first vector in the sequence, the two input bits b0 , b1 were randomly chosen to be false (0)
or true (1) and assigned to the first two elements in the vector. For subsequent vectors, only b1 was
random, while b0 was implicitly equal to the target bit from the previous vector (for the purposes
of calculating the current target bit), but was always set to zero in the input vector. To solve the
task, the network therefore had to learn both how to calculate the sequence of binary operations
represented by the chunks in each vector, and how to carry the final output of that sequence over
to the next timestep. An example input-target sequence pair is shown in Figure 7.
The network architecture was single-layer LSTM with 128 cells. The output was a single sigmoidal
unit, trained with binary cross-entropy, and the minibatch size was 16.
Figure 8 shows that the network reaches a minimum sequence error rate of around 0.2 without
ACT (compared to 0.5 for random guessing), and virtually zero error for all τ ≤ 0.01. From Figure 9
it can be seen that low τ ACT networks solve the task very quickly, requiring about 10,000 training
iterations. For higher τ values ponder time reduces to 1, at which point the networks trained with
ACT behave identically to those without. For lower τ values, the spread of ponder values, and
hence computational cost, is quite large. Again we speculate that this is due to the network learning
more or less ‘chunked’ solutions in which composite truth table are learned for multiple successive
logic operations. This is somewhat supported by the clustering of the lowest τ networks around a
ponder time of 5–6, which is approximately the mean number of logic gates applied per sequence,

Figure 7: Logic training Example. Both the input and target sequences consist of 3 vectors. For simplicity only 2 of the 10
possible logic gates represented in the input are shown, and each is restricted to one of the first 3 gates in Table 1 (NOR,
Xq, and ABJ). The segmentation of the input vectors is show on the left and the recursive application of Equation (21)
required to determine the targets (and subsequent b0 values) is shown in italics above the target vectors.

Figure 8: Logic Error Rates.

and hence the minimum number of computations the network would need if calculating single binary
operations at a time.
Figure 10 shows a surprisingly high ponder time for the least difficult inputs, with some networks
taking more than 10 steps to evaluate a single logic gate. From 5 to 10 logic gates, ponder gradually
increases with difficulty as expected, suggesting that a qualitatively different solution is learned for
the two regimes. This is supported by the error rates for the non ACT and high τ networks, which
increase abruptly after 5 gates. It may be that 5 is the upper limit on the number of successive
gates the network can learn as a single composite operation, and thereafter it is forced to apply an
iterative algorithm.

3.3 Addition
The addition task presents the network with a input sequence of 1 to 5 size 50 input vectors. Each
vector represents a D digit number, where D is drawn randomly from 1 to 5, and each digit is drawn
randomly from 0 to 9. The first 10D elements of the vector are a concatenation of one-hot encodings
of the D digits in the number, and the remainder of the vector is set to 0. The required output
is the cumulative sum of all inputs up to the current one, represented as a set of 6 simultaneous
classifications for the 6 possible digits in the sum. There is no target for the first vector in the
sequence, as no sums have yet been calculated. Because the previous sum must be carried over by
the network, this task again requires the internal state of the network to remain coherent. Each
classification is modelled by a size 11 softmax, where the first 10 classes are the digits and the 11th
is a special marker used to indicate that the number is complete. An example input-target pair is
shown in Figure 11.
The network was single-layer LSTM with 512 memory cells. The loss function was the joint
cross-entropy of all 6 targets at each time-step where targets were present and the minibatch size

Figure 9: Logic Learning Curves and Error Rates Versus Ponder Time.

Figure 10: Logic Ponder Time and Error Rate Versus Input Difficulty. ‘Difficulty’ is the number of logic gates in each
input vector; all sequences were length 5.

Figure 11: Addition training Example. Each digit in the input sequence is represented by a size 10 one hot encoding.
Unused input digits, marked ‘-’, are represented by a vector of 10 zeros. The black vector at the start of the target sequence
indicates that no target was required for that step. The target digits are represented as 1-of-11 classes, where the 11t h
class, marked ‘*’, is used for digits beyond the end of the target number.

Figure 12: Addition Error Rates.

Figure 13: Addition Learning Curves and Error Rates Versus Ponder Time.

was 32. The maximum ponder M was set to 20 for this task, as it was found that some networks
had very high ponder times early in training.
The results in Figure 12 show that the task was perfectly solved by the ACT networks for all
values of τ in the grid search. Unusually, networks with higher τ solved the problem with fewer
training examples. Figure 14 demonstrates that the relationship between the ponder time and the
number of digits was approximately linear for most of the ACT networks, and that for the most
efficient networks (with the highest τ values) the slope of the line was close to 1, which matches our
expectations that an efficient long addition algorithm should need one computation step per digit.
Figure 15 shows how the ponder time is distributed during individual addition sequences, pro-
viding further evidence of an approximately linear-time long addition algorithm.

3.4 Sort
The sort task requires the network to sort sequences of 2 to 15 numbers drawn from a standard
normal distribution in ascending order. The experiments considered so far have been designed to
favour ACT by compressing sequential information into single vectors, and thereby requiring the
use of multiple computation steps to unpack them. For the sort task a more natural sequential
representation was used: the random numbers were presented one at a time as inputs, and the
required output was the sequence of indices into the number sequence placed in sorted order; an
example is shown in Figure 16. We were particularly curious to see how the number of ponder steps
scaled with the number of elements to be sorted, knowing that efficient sorting algorithms have
O(N log N ) computational cost.
The network was single-layer LSTM with 512 cells. The output layer was a size 15 softmax,

Figure 14: Addition Ponder Time and Error Rate Versus Input Difficulty. ‘Difficulty’ is the number of digits in each input
vector; all sequences were length 3.

Figure 15: Ponder Time During Three Addition Sequences. The input sequence is shown along the bottom x-axis and
the network output sequence is shown along the top x-axis. The ponder time ρt at each input step is shown by the black
lines; the actual number of computational steps taken at each point is ρt rounded up to the next integer. The grey lines
show the total number of digits in the two numbers being summed at each step; this appears to give a rough lower bound
on the ponder time, suggesting an internal algorithm that is approximately linear in the number of digits. All plots were
created using the same network, trained with τ = 9e−4 .

trained with cross-entropy to classify the indices of the sorted inputs. The minibatch size was 16.
Figure 17 shows that the advantage of using ACT is less dramatic for this task than the previous
three, but still substantial (from around 12% error without ACT to around 6% for the best τ value).
However from Figure 18 it is clear that these gains come at a heavy computational cost, with the best
networks requiring roughly 9 times as much computation as those without ACT. Not surprisingly,
Figure 19 shows that the error rate grew rapidly with the sequence length for all networks. It
also indicates that the better networks had a sublinear growth in computations per input step with
sequence length, though whether this indicates a logarithmic time algorithm is unclear. One problem
with the sort task was that the Gaussian samples were sometimes very close together, making it hard
for the network to determine which was greater; enforcing a minimum separation between successive
values would probably be beneficial.
Figure 20 shows the ponder time during three sort sequences of varying length. As can be seen,
there is a large spike in ponder time near (though not precisely at) the end of the input sequence,
presumably when the majority of the sort comparisons take place. Note that the spike is much higher
for the longer two sequences than the length 5 one, again pointing to an algorithm that is nonlinear

Figure 16: Sort training Example. Each size 2 input vector consists of one real number and one binary flag to indicate the
end of sequence to be sorted; inputs following the sort sequence are set to zero and marked in black. No targets are present
until after the sort sequence; thereafter the size 15 target vectors represent the sorted indices of the input sequence.

Figure 17: Sort Error Rates.

Figure 18: Sort Learning Curves and Error Rates Versus Ponder Time.

Figure 19: Sort Ponder Time and Error Rate Versus Input Difficulty. ‘Difficulty’ is the length of the sequence to be

Figure 20: Ponder Time During Three Sort Sequences. The input sequences to be sorted are shown along the bottom
x-axes and the network output sequences are shown along the top x-axes. All plots created using the same network, trained
with τ = 10−3 .

Figure 21: Wikipedia Error Rates.

in sequence length (the average ponder per timestep is nonetheless lower for longer sequences, as
little pondering is done away from the spike.).

3.5 Wikipedia Character Prediction

The Wikipedia task is character prediction on text drawn from the Hutter prize Wikipedia dataset [15].
Following previous RNN experiments on the same data [8], the raw unicode text was used, including
XML tags and markup characters, with one byte presented per input timestep and the next byte
predicted as a target. No validation set was used for early stopping, as the networks were unable to
overfit the data, and all error rates are recorded on the training set. Sequences of 500 consecutive
bytes were randomly chosen from the training set and presented to the network, whose internal state
was reset to 0 at the start of each sequence.
LSTM networks were used with a single layer of 1500 cells and a size 256 softmax classification
layer. As can be seen from Figures 21 and 22, the error rates are fairly similar with and without
ACT, and across values of τ (although the learning curves suggest that the ACT networks are
somewhat more data efficient). Furthermore the amount of ponder per input is much lower than for
the other problems, suggesting that the advantages of extra computation were slight for this task.
However Figure 23 reveals an intriguing pattern of ponder allocation while processing a sequence.
Character prediction networks trained with ACT consistently pause at spaces between words, and
pause for longer at ‘boundary’ characters such as commas and full stops. We speculate that the extra
computation is used to make predictions about the next ‘chunk’ in the data (word, sentence, clause),
much as humans have been found to do in self-paced reading experiments [16]. This suggests that
ACT could be useful for inferring implicit boundaries or transitions in sequence data. Alternative
measures for inferring transitions include the next-step prediction loss and predictive entropy, both
of which tend to increase during harder predictions. However, as can be seen from the figure, they

Figure 22: Wikipedia Learning Curves (Zoomed) and Error Rates Versus Ponder Time.

Figure 23: Ponder Time, Prediction loss and Prediction Entropy During a Wikipedia Text Sequence. Plot created using
a network trained with τ = 6e−3

are a less reliable indicator of boundaries, and are not likely to increase at points such as full stops
and commas, as these are invariably followed by space characters. More generally, loss and entropy
only indicate the difficulty of the current prediction, not the degree to which the current input is
likely to impact future predictions.
Furthermore Figure 24 reveals that, as well as being an effective detector of non-text transition
markers such as the opening brackets of XML tags, ACT does not increase computation time during
random or fundamentally unpredictable sequences like the two ID numbers. This is unsurprising,
as doing so will not improve its predictions. In contrast, both entropy and loss are inevitably high
for unpredictable data. We are therefore hopeful that computation time will provide a better way
to distinguish between structure and noise (or at least data perceived by the network as structure
or noise) than existing measures of predictive difficulty.

4 Conclusion
This paper has introduced Adaptive Computation time (ACT), a method that allows recurrent
neural networks to learn how many updates to perform for each input they receive. Experiments on

Figure 24: Ponder Time, Prediction loss and Prediction Entropy During a Wikipedia Sequence Containing XML Tags.
Created using the same network as Figure 23.

synthetic data prove that ACT can make otherwise inaccessible problems straightforward for RNNs
to learn, and that it is able to dynamically adapt the amount of computation it uses to the demands
of the data. An experiment on real data suggests that the allocation of computation steps learned
by ACT can yield insight into both the structure of the data and the computational demands of
predicting it.
ACT promises to be particularly interesting for recurrent architectures containing soft attention
modules [2, 10, 34, 12], which it could enable to dynamically adapt the number of glances or internal
operations they perform at each time-step.
One weakness of the current algorithm is that it is quite sensitive to the time penalty parameter
that controls the relative cost of computation time versus prediction error. An important direction
for future work will be to find ways of automatically determining and adapting the trade-off between
accuracy and speed.

The author wishes to thank Ivo Danihleka, Greg Wayne, Tim Harley, Malcolm Reynolds, Jacob
Menick, Oriol Vinyals, Joel Leibo, Koray Kavukcuoglu and many others on the DeepMind team for
valuable comments and suggestions, as well as Albert Zeyer, Martin Abadi, Dario Amodei, Eugene
Brevdo and Christopher Olah for pointing out the discontinuity in the ponder cost, which was
erroneously described as smooth in an earlier version of the paper.

[1] G. An. The effects of adding noise during backpropagation training on a generalization perfor-
mance. Neural Computation, 8(3):643–674, 1996.
[2] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align
and translate. abs/1409.0473, 2014.
[3] E. Bengio, P.-L. Bacon, J. Pineau, and D. Precup. Conditional computation in neural networks
for faster models. arXiv preprint arXiv:1511.06297, 2015.
[4] D. C. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image
classification. In arXiv:1202.2745v1 [cs.CV], 2012.

[5] G. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre-trained deep neural networks
for large-vocabulary speech recognition. Audio, Speech, and Language Processing, IEEE Trans-
actions on, 20(1):30 –42, jan. 2012.
[6] L. Denoyer and P. Gallinari. Deep sequential neural network. arXiv preprint arXiv:1410.0510,

[7] S. Eslami, N. Heess, T. Weber, Y. Tassa, K. Kavukcuoglu, and G. E. Hinton. Attend, infer,
repeat: Fast scene understanding with generative models. arXiv preprint arXiv:1603.08575,
[8] A. Graves. Generating sequences with recurrent neural networks. arXiv preprint
arXiv:1308.0850, 2013.
[9] A. Graves, A. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural net-
works. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Con-
ference on, pages 6645–6649. IEEE, 2013.
[10] A. Graves, G. Wayne, and I. Danihelka. Neural turing machines. arXiv preprint
arXiv:1410.5401, 2014.
[11] E. Grefenstette, K. M. Hermann, M. Suleyman, and P. Blunsom. Learning to transduce with
unbounded memory. In Advances in Neural Information Processing Systems, pages 1819–1827,

[12] K. Gregor, I. Danihelka, A. Graves, and D. Wierstra. Draw: A recurrent neural network for
image generation. arXiv preprint arXiv:1502.04623, 2015.
[13] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient flow in recurrent nets: the
difficulty of learning long-term dependencies, 2001.
[14] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–
1780, 1997.
[15] M. Hutter. Universal artificial intelligence. Springer, 2005.
[16] M. A. Just, P. A. Carpenter, and J. D. Woolley. Paradigms and processes in reading compre-
hension. Journal of experimental psychology: General, 111(2):228, 1982.

[17] N. Kalchbrenner, I. Danihelka, and A. Graves. Grid long short-term memory. arXiv preprint
arXiv:1507.01526, 2015.
[18] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.

[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional
neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
[20] Q. V. Le and T. Mikolov. Distributed representations of sentences and documents. arXiv
preprint arXiv:1405.4053, 2014.
[21] M. Li and P. Vitányi. An introduction to Kolmogorov complexity and its applications. Springer
Science & Business Media, 2013.
[22] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations
of words and phrases and their compositionality. In Advances in neural information processing
systems, pages 3111–3119, 2013.

[23] B. A. Olshausen et al. Emergence of simple-cell receptive field properties by learning a sparse
code for natural images. Nature, 381(6583):607–609, 1996.
[24] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic
gradient descent. In Advances in Neural Information Processing Systems, pages 693–701, 2011.

[25] S. Reed and N. de Freitas. Neural programmer-interpreters. Technical Report arXiv:1511.06279,

[26] J. Schmidhuber. Self-delimiting neural networks. arXiv preprint arXiv:1210.0118, 2012.
[27] J. Schmidhuber and S. Hochreiter. Guessing can outperform many long time lag algorithms.
Technical report, 1996.

[28] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple

way to prevent neural networks from overfitting. The Journal of Machine Learning Research,
15(1):1929–1958, 2014.
[29] R. K. Srivastava, K. Greff, and J. Schmidhuber. Training very deep networks. In Advances in
Neural Information Processing Systems, pages 2368–2376, 2015.
[30] R. K. Srivastava, B. R. Steunebrink, and J. Schmidhuber. First experiments with powerplay.
Neural Networks, 41:130–136, 2013.
[31] S. Sukhbaatar, J. Weston, R. Fergus, et al. End-to-end memory networks. In Advances in
Neural Information Processing Systems, pages 2431–2439, 2015.

[32] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks.
arXiv preprint arXiv:1409.3215, 2014.
[33] O. Vinyals, S. Bengio, and M. Kudlur. Order matters: Sequence to sequence for sets. arXiv
preprint arXiv:1511.06391, 2015.

[34] O. Vinyals, M. Fortunato, and N. Jaitly. Pointer networks. In Advances in Neural Information
Processing Systems, pages 2674–2682, 2015.
[35] A. J. Wiles. Modular elliptic curves and fermats last theorem. ANNALS OF MATH, 141:141,

[36] R. J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and
their computational complexity. Back-propagation: Theory, architectures and applications,
pages 433–486, 1995.

DeepMath - Deep Sequence Models for Premise

Alexander A. Alemi ∗ François Chollet ∗ Niklas Een ∗

Google Inc. Google Inc. Google Inc.
arXiv:1606.04442v2 [cs.AI] 26 Jan 2017

Geoffrey Irving ∗ Christian Szegedy ∗ Josef Urban ∗†

Google Inc. Google Inc. Czech Technical University in Prague


We study the effectiveness of neural sequence models for premise selection in

automated theorem proving, one of the main bottlenecks in the formalization of
mathematics. We propose a two stage approach for this task that yields good
results for the premise selection task on the Mizar corpus while avoiding the hand-
engineered features of existing state-of-the-art models. To our knowledge, this is
the first time deep learning has been applied to theorem proving on a large scale.

1 Introduction

Mathematics underpins all scientific disciplines. Machine learning itself rests on measure and
probability theory, calculus, linear algebra, functional analysis, and information theory. Complex
mathematics underlies computer chips, transit systems, communication systems, and financial infras-
tructure – thus the correctness of many of these systems can be reduced to mathematical proofs.
Unfortunately, these correctness proofs are often impractical to produce without automation, and
present-day computers have only limited ability to assist humans in developing mathematical proofs
and formally verifying human proofs. There are two main bottlenecks: (1) lack of automated methods
for semantic or formal parsing of informal mathematical texts (autoformalization), and (2) lack of
strong automated reasoning methods to fill in the gaps in already formalized human-written proofs.
The two bottlenecks are related. Strong automated reasoning can act as a semantic filter for autoformal-
ization, and successful autoformalization would provide a large corpus of computer-understandable
facts, proofs, and theory developments. Such a corpus would serve as both background knowledge to
fill in gaps in human-level proofs and as a training set to guide automated reasoning. Such guidance
is crucial: exhaustive deductive reasoning tools such as today’s resolution/superposition automated
theorem provers (ATPs) quickly hit combinatorial explosion, and are unusable when reasoning with a
very large number of facts without careful selection [4].
In this work, we focus on the latter bottleneck. We develop deep neural networks that learn from a
large repository of manually formalized computer-understandable proofs. We learn the task that is
essential for making today’s ATPs usable over large formal corpora: the selection of a limited number
of most relevant facts for proving a new conjecture. This is known as premise selection.
The main contributions of this work are:

Authors listed alphabetically. All contributions are considered equal.

Supported by ERC Consolidator grant nr. 649043 AI4REASON.
• A demonstration for the first time that neural network models are useful for aiding in large
scale automated logical reasoning without the need for hand-engineered features.
• The comparison of various network architectures (including convolutional, recurrent and
hybrid models) and their effect on premise selection performance.
• A method of semantic-aware “definition”-embeddings for function symbols that improves
the generalization of formulas with symbols occurring infrequently. This model outperforms
previous approaches.
• Analysis showing that neural network based premise selection methods are complementary
to those with hand-engineered features: ensembling with previous results produce superior

2 Formalization and Theorem Proving

In the last two decades, large corpora of complex mathematical knowledge have been formalized:
encoded in complete detail so that computers can fully understand the semantics of complicated
mathematical objects. The process of writing such formal and verifiable theorems, definitions, proofs,
and theories is called Interactive Theorem Proving (ITP).
The ITP field dates back to 1960s [16] and the Automath system by N.G. de Bruijn [9]. ITP systems
include HOL (Light) [15], Isabelle [37], Mizar [13], Coq [7], and ACL2 [23]. The development of
ITP has been intertwined with the development of its cousin field of Automated Theorem Proving
(ATP) [31], where proofs of conjectures are attempted fully automatically. Unlike ATP systems,
ITP systems allow human-assisted formalization and proving of theorems that are often beyond the
capabilities of the fully automated systems.
Large ITP libraries include the Mizar Mathematical Library (MML) with over 50,000 lemmas, and
the core Isabelle, HOL, Coq, and ACL2 libraries with thousands of lemmas. These core libraries are a
basis for large projects in formalized mathematics and software and hardware verification. Examples
in mathematics include the HOL Light proof of the Kepler conjecture (Flyspeck project) [14], the
Coq proofs of the Feit-Thompson theorem [12] and Four Color theorem [11], and the verification of
most of the Compendium of Continuous Lattices in Mizar [2]. ITP verification of the seL4 kernel [25]
and CompCert compiler [27] show comparable progress in large scale software verification. While
these large projects mark a coming of age of formalization, ITP remains labor-intensive. For example,
Flyspeck took about 20 person-years, with twice as much for Feit-Thompson. Behind this cost are
our two bottlenecks: lack of tools for autoformalization and strong proof automation.
Recently the field of Automated Reasoning in Large Theories (ARLT) [35] has developed, including
AI/ATP/ITP (AITP) systems called hammers that assist ITP formalization [4]. Hammers analyze
the full set of theorems and proofs in the ITP libraries, estimate the relevance of each theorem, and
apply optimized translations from the ITP logic to simpler ATP formalism. Then they attack new
conjectures using the most promising combinations of existing theorems and ATP search strategies.
Recent evaluations have proved 40% of all Mizar and Flyspeck theorems fully automatically [20, 21].
However, there is significant room for improvement: with perfect premise selection (a perfect choice
of library facts) ATPs can prove at least 56% of Mizar and Flyspeck instead of today’s 40% [4]. In
the next section we explain the premise selection task and the experimental setting for measuring
such improvements.

3 Premise Selection, Experimental Setting and Previous Results

Given a formal corpus of facts and proofs expressed in an ATP-compatible format, our task is
Definition (Premise selection problem). Given a large set of premises P, an ATP system A with
given resource limits, and a new conjecture C, predict those premises from P that will most likely
lead to an automatically constructed proof of C by A.
We use the Mizar Mathematical Library (MML) version 4.181.11473 as the formal corpus and E
prover [32] version 1.9 as the underlying ATP system. The following list exemplifies a small non-

:: t99_jordan: Jordan curve theorem in Mizar
for C being Simple_closed_curve holds C is Jordan;

:: Translation to first order logic

fof(t99_jordan, axiom, (! [A] : ( (v1_topreal2(A) & m1_subset_1(A,
k1_zfmisc_1(u1_struct_0(k15_euclid(2))))) => v1_jordan1(A)) ) ).

Figure 1: (top) The final statement of the Mizar formalization of the Jordan curve theorem. (bottom) The
translation to first-order logic, using name mangling to ensure uniqueness across the entire corpus.

(a) Length in chars. (b) Length in words. (c) Word occurrences. (d) Dependencies.
Figure 2: Histograms of statement lengths, occurrences of each word, and statement dependencies in the
Mizar corpus translated to first order logic. The wide length distribution poses difficulties for RNN models and
batching, and many rarely occurring words make it important to take definitions of words into account.

representative sample of topics and theorems that are included in the Mizar Mathematical Library:
Cauchy-Riemann Differential Equations of Complex Functions, Characterization and Existence of
Gröbner Bases, Maximum Network Flow Algorithm by Ford and Fulkerson, Gödel’s Completeness
Theorem, Brouwer Fixed Point Theorem, Arrow’s Impossibility Theorem Borsuk-Ulam Theorem,
Dickson’s Lemma, Sylow Theorems, Hahn Banach Theorem, The Law of Quadratic Reciprocity,
Pepin’s Primality Test for Public-Key Cryptography, Ramsey’s Theorem.
This version of MML was used for the latest AITP evaluation reported in [21]. There are 57,917
proved Mizar theorems and unnamed top-level lemmas in this MML organized into 1,147 articles.
This set is chronologically ordered by the order of articles in MML and by the order of theorems in
the articles. Proofs of later theorems can only refer to earlier theorems. This ordering also applies
to 88,783 other Mizar formulas (encoding the type system and other automation known to Mizar)
used in the problems. The formulas have been translated into first-order logic formulas by the MPTP
system [34] (see Figure 1).
Our goal is to automatically prove as many theorems as possible, using at each step all previous
theorems and proofs. We can learn from both human proofs and ATP proofs, but previous experi-
ments [26, 20] show that learning only from the ATP proofs is preferable to including human proofs
if the set of ATP proofs is sufficiently large. Since for 32,524 (56.2%) of the 57,917 theorems an ATP
proof was previously found by a combination of manual and learning-based premise selection [21],
we use only these ATP proofs for training.
The 40% success rate from [21] used a portfolio of 14 AITP methods using different learners, ATPs,
and numbers of premises. The best single method proved 27.3% of the theorems. Only fast and
simple learners such as k-nearest-neighbors, naive Bayes, and their ensembles were used, based on
hand-crafted features such as the set of (normalized) sub-terms and symbols in each formula.

4 Motivation for the use of Deep Learning

Strong premise selection requires models capable of reasoning over mathematical statements, here
encoded as variable-length strings of first-order logic. In natural language processing, deep neural net-
works have proven useful in language modeling [28], text classification [8], sentence pair scoring [3],
conversation modeling [36], and question answering [33]. These results have demonstrated the ability
of deep networks to extract useful representations from sequential inputs without hand-tuned feature
engineering. Neural networks can also mimic some higher-level reasoning on simple algorithmic
tasks [38, 18].

Logistic loss
Fully connected layer with 1

Fully connected layer with

Ux+c Ux+c Ux+c
1024 outputs

Concatenate embeddings Wx+b Wx+b Wx+b Wx+b Wx+b

CNN/RNN Sequence model CNN/RNN Sequence model

! [ A , B ] : ( g t a ...
Axiom first order logic Conjecture first order logic
sequence sequence

Figure 3: (left) Our network structure. The input sequences are either character-level (section 5.1) or word-level
(section 5.2). We use separate models to embed conjecture and axiom, and a logistic layer to predict whether the
axiom is useful for proving the conjecture. (right) A convolutional model.
The Mizar data set is also an interesting case study in neural network sequence tasks, as it differs
from natural language problems in several ways. It is highly structured with a simple context free
grammar – the interesting task occurs only after parsing. The distribution of lengths is wide, ranging
from 5 to 84,299 characters with mean 304.5, and from 2 to 21,251 tokens with mean 107.4 (see
Figure 2). Fully recurrent models would have to back-propagate through 100s to 1000s of characters
or 100s of tokens to embed a whole statement. Finally, there are many rare words – 60.3% of the
words occur fewer than 10 times – motivating the definition-aware embeddings in section 5.2.

5 Overview of our approach

The full premise selection task takes a conjecture and a set of axioms and chooses a subset of
axioms to pass to the ATP. We simplify from subset selection to pairwise relevance by predicting the
probability that a given axiom is useful for proving a given conjecture. This approach depends on a
relatively sparse dependency graph. Our general architecture is shown in Figure 3(left): the conjecture
and axiom sequences are separately embedded into fixed length real vectors, then concatenated and
passed to a third network with two fully connected layers and logistic loss. During training time, the
two embedding networks and the joined predictor path are trained jointly.
As discussed in section 3, we train our models on premise selection data generated by a combination
of various methods, including k-nearest-neighbor search on hand-engineered similarity metrics. We
start with a first stage of character-level models, and then build second and later stages of word-level
models on top of the results of earlier stages.

5.1 Stage 1: Character-level models

We begin by avoiding special purpose engineering by treating formulas on the character-level using
an 80 dimensional one-hot encoding of the character sequence. These sequences are passed to a
weight shared network for variable length input. For the embedding computation, we have explored
the following architectures:
1. Pure recurrent LSTM [17] and GRU [6] networks.
2. A pure multi-layer convolutional network with various numbers of convolutional layers (with strides)
followed by a global temporal max-pooling reduction (see Figure 3(right)).
3. A recurrent-convolutional network, that uses convolutional layers to produce a shorter sequence which
is processed by a LSTM.

The exact architectures used are specified in the experimental section.

It is computationally prohibitive to compute a large number of (conjecture, axiom) pairs due to the
costly embedding phase. Fortunately, our architecture allows caching the embeddings for conjectures
and axioms and evaluating the shared portion of the network for a given pair. This makes it practical
to consider all pairs during evaluation.

5.2 Stage 2: Word-level models

The character-level models are limited to word and structure similarity within the axiom or conjecture
being embedded. However, many of the symbols occurring in a formula are defined by formulas

earlier in the corpus, and we can use the axiom-embeddings of those symbols to improve model
Since Mizar is based on first-order set theory, definitions of symbols can be either explicit or implicit.
An explicit definition of x sets x = e for some expression e, while an implicit definition states a
property of the defined object, such as defining a function f (x) by ∀x.f (f (x)) = g(x). To avoid
manually encoding the structure of implicit definitions, we embed the entire statement defining a
symbol f , and then use the stage 1 axiom-embedding corresponding to the whole statement as a
word-level embeddings.
Ideally, we would train a single network that embeds statements by recursively expanding and
embedding the definitions of the defined symbols. Unfortunately, this recursion would dramatically
increase the cost of training since the definition chains can be quite deep. For example, Mizar defines
real numbers in terms of non-negative reals, which are defined as Dedekind cuts of non-negative
rationals, which are defined as ratios of naturals, etc. As an inexpensive alternative, we reuse the
axiom embeddings computed by a previously trained character-level model, mapping each defined
symbol to the axiom embedding of its defining statement. Other tokens such as brackets and operators
are mapped to fixed pseudo-random vectors of the same dimension.
Since we embed one token at a time ignoring the grammatical structure, our approach does not require
a parser: a trivial lexer is implemented in a few lines of Python. With word-level embeddings, we use
the same architectures with shorter input sequence to produce axiom and conjecture embeddings for
ranking the (conjecture, axiom) pairs. Iterating this approach by using the resulting, stronger axiom
embeddings as word embeddings multiple times for additional stages did not yield measurable gains.

6 Experiments
6.1 Experimental Setup

For training and evaluation we use a subset of 32,524 out of 57,917 theorems that are known to
be provable by an ATP given the right set of premises. We split off a random 10% of these (3,124
statements) for testing and validation. Also, we held out 400 statements from the 3,124 for monitoring
training progress, as well as for model and checkpoint selection. Final evaluation was done on the
remaining 2,724 conjectures. Note that we only held out conjectures, but we trained on all statements
as axioms. This is comparable to our k-NN baseline which is also trained on all statements as axioms.
The randomized selection of the training and testing sets may also lead to learning from future proofs:
a proof Pj of theorem Tj written after theorem Ti may guide the premise selection for Ti . However,
previous k-NN experiments show similar performance between a full 10-fold cross-validation and
incremental evaluation as long as chronologically preceding formulas participate in proofs of only
later theorems.

6.2 Metrics

For each conjecture, our models output a ranking of possible premises. Our primary metric is the
number of conjectures proved from the top-k premises, where k = 16, 32, . . . , 1024. This metric can
accommodate alternative proofs but is computationally expensive. Therefore we additionally measure
the ranking quality using the average maximum relative rank of the testing premise set. Formally,
average max relative rank is
rank(P, Pavail (C))
aMRR = mean max
C P ∈Ptest (C) |Pavail (C)|
where C ranges over conjectures, Pavail (C) is the set of premises available to prove C, Ptest (C) is the
set of premises for conjecture C from the test set, and rank(P, Pavail (C)) is the rank of premise P
among the set Pavail (C) according to the model. The motivation for aMRR is that conjectures are
easier to prove if all their dependencies occur early in the ranking.
Since it is too expensive to rank all axioms for a conjecture during continuous evaluation, we
approximate our objective. For our holdout set of 400 conjectures, we select all true dependencies
Ptest (C) and 128 fixed random false dependencies from Pavail (C) − Ptest (C) and compute the average
max relative rank in this ordering. Note that aMRR is nonzero even if all true dependencies are
ordered before false dependencies; the best possible value is 0.051.

Figure 4: Specification of the different embedder networks.

6.3 Network Architectures

All our neural network models use the general architecture from Fig 3: a classifier on top of the
concatenated embeddings of an axiom and a conjecture. The same classifier architecture was used for
all models: a fully-connected neural network with one hidden layer of size 1024. For each model, the
axiom and conjecture embedding networks have the same architecture without sharing weights. The
details of the embedding networks are shown in Fig 4.

6.4 Network Training

The neural networks were trained using asynchronous distributed stochastic gradient descent using
the Adam optimizer [24] with up to 20 parallel NVIDIA K-80 GPU workers per model. We used the
TensorFlow framework [1] and the Keras library [5]. The weights were initialized using [10]. Polyak
averaging with 0.9999 decay was used for producing the evaluation weights [30]. The character
level models were trained with maximum sequence length 2048 characters, where the word-level
(and definition embedding) based models had a maximum sequence length of 500 words. For good
performance, especially for low cutoff thresholds, it was critical to employ negative mining during
training. A side process was continuously evaluating many (conjecture, axiom) pairs. For each
conjecture, we pick the lowest scoring statements that have higher score than the lowest scoring true
positive. A queue of previously mined negatives is maintained for producing a mixture of examples
in which the ratio of mined instances is about 25% and the rest are randomly selected premises.
Negative mining was crucial for good quality: at the top-16 cutoff, the number of proved theorems
on the test set has doubled. For the union of proof attempts over all cutoff thresholds, the ratio of
successful proofs has increased from 61.3% to 66.4% for the best neural model.

6.5 Experimental Results

Our best selection pipeline uses a stage-1 character-level convolutional neural network model to
produce word-level embeddings for the second stage. The baseline uses distance-weighted k-
NN [19, 21] with handcrafted semantic features [22]. For all conjectures in our holdout set, we
consider all the chronologically preceding statements (lemmas, definitions and axioms) as premise

(a) Training accuracy for different character-level
models without hard negative mining. Recurrent (b) Test average max relative rank for different mod-
models seem underperform, while pure convolutional els without hard negative mining. The best is a
models yield the best results. For each architecture, word-level CNN using definition embeddings from
we trained three models with different random initial- a character-level 2-layer CNN. An identical word-
ization seeds. Only the best runs are shown on this embedding model with random starting embedding
graph; we did not see much variance between runs overfits after only 250,000 iterations and underper-
on the same architecture. forms the best character-level model.

candidates. In the DeepMath case, premises were ordered by their logistic scores. E prover was
applied to the top-k of the premise-candidates for each of the cutoffs k ∈ (16, 32, . . . , 1024) until a
proof is found or k = 1024 fails. Table 1 reports the number of theorems proved with a cutoff value
at most the k in the leftmost column. For E prover, we used auto strategy with a soft time limit of 90
seconds, a hard time limit of 120 seconds, a memory limit of 4 GB, and a processed clauses limit of
Our most successful models employ simple convolutional networks followed by max pooling (as
opposed to recurrent networks like LSTM/GRU), and the two stage definition-based def-CNN
outperforms the naïve word-CNN word embedding significantly. In the latter the word embeddings
were learned in a single pass; in the former they are fixed from the stage-1 character-level model. For
each architecture (cf. Figure 4) two convolutional layers perform best. Although our models differ
significantly from each other, they differ even more from the k-NN baseline based on hand-crafted
features. The right column of Table 1 shows the result if we average the prediction score of the stage-1
model with that of the definition based stage-2 model. We also experimented with character-based
RNN models using shorter sequences: these lagged behind our long-sequence CNN models but
performed significantly better than those RNNs trained on longer sequences. This suggest that RNNs
could be improved by more sophisticated optimization techniques such as curriculum learning.

Cutoff k-NN Baseline (%) char-CNN (%) word-CNN (%) def-CNN-LSTM (%) def-CNN (%) def+char-CNN (%)
16 674 (24.6) 687 (25.1) 709 (25.9) 644 (23.5) 734 (26.8) 835 (30.5)
32 1081 (39.4) 1028 (37.5) 1063 (38.8) 924 (33.7) 1093 (39.9) 1218 (44.4)
64 1399 (51) 1295 (47.2) 1355 (49.4) 1196 (43.6) 1381 (50.4) 1470 (53.6)
128 1612 (58.8) 1534 (55.9) 1552 (56.6) 1401 (51.1) 1617 (59) 1695 (61.8)
256 1709 (62.3) 1656 (60.4) 1635 (59.6) 1519 (55.4) 1708 (62.3) 1780 (64.9)
512 1762 (64.3) 1711 (62.4) 1712 (62.4) 1593 (58.1) 1780 (64.9) 1830 (66.7)
1024 1786 (65.1) 1762 (64.3) 1755 (64) 1647 (60.1) 1822 (66.4) 1862 (67.9)

Table 1: Results of ATP premise selection experiments with hard negative mining on a test set of 2,742 theorems.
Each entry is the number (%) of theorems proved by E prover using that particular model to rank the premises.
The union of def-CNN and char-CNN proves 69.8% of the test set, while the union of the def-CNN and k-NN
proves 74.25%. This means that the neural network predictions are more complementary to the k-NN predictions
than to other neural models. The union of all methods proves 2218 theorems (80.9%) and just the neural models
prove 2151 (78.4%).

Also, when we applied two of the premise selection models on those Mizar statements that were not
proven automatically before, we managed to prove 823 additional of them.

Model Test min average relative rank
char-CNN 0.0585
word-CNN 0.06
def-CNN-LSTM 0.0605
def-CNN 0.0575
(d) Best sustained test results obtained by the above
models. Lower values are better. This was moni-
tored continuously during training on a holdout set
with 400 theorems, using all true positive premises
(c) Jaccard similarities between proved sets of con- and 128 randomly selected negatives. In this setup,
jectures across models. Each of the neural network the lowest attainable max average relative rank with
model prediction are more like each other than those perfect predictions is 0.051.
of the k-NN baseline.

7 Conclusions
In this work we provide evidence that even simple neural models can compete with hand-engineered
features for premise selection, helping to find many new proofs. This translates to real gains in
automatic theorem proving. Despite these encouraging results, our models are relatively shallow
networks with inherent limitations to representational power and are incapable of capturing high level
properties of mathematical statements. We believe theorem proving is a challenging and important
domain for deep learning methods, and that more sophisticated optimization techniques and training
methodologies will prove more useful than in less structured domains.

8 Acknowledgments
We would like to thank Cezary Kaliszyk for providing us with an improved baseline model. Also
many thanks go to the Google Brain team for their generous help with the training infrastructure. We
would like to thank Quoc Le for useful discussions on the topic and to Sergio Guadarrama for his
help with TensorFlow-slim.

[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean,
M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,
M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens,
B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. War-
den, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on
heterogeneous systems, 2015. Software available from
[2] G. Bancerek and P. Rudnicki. A Compendium of Continuous Lattices in MIZAR. J. Autom. Reasoning,
29(3-4):189–224, 2002.
[3] P. Baudiš, J. Pichl, T. Vyskočil, and J. Šedivý. Sentence pair scoring: Towards unified framework for text
comprehension. arXiv preprint arXiv:1603.06127, 2016.
[4] J. C. Blanchette, C. Kaliszyk, L. C. Paulson, and J. Urban. Hammering towards QED. J. Formalized
Reasoning, 9(1):101–148, 2016.
[5] F. Chollet. Keras., 2015.
[6] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Gated feedback recurrent neural networks. arXiv preprint
arXiv:1502.02367, 2015.
[7] The Coq Proof Assistant.
[8] A. M. Dai and Q. V. Le. Semi-supervised sequence learning. In Advances in Neural Information Processing
Systems, pages 3061–3069, 2015.
[9] N. de Bruijn. The mathematical language AUTOMATH, its usage, and some of its extensions. In M. Laudet,
editor, Proceedings of the Symposium on Automatic Demonstration, pages 29–61, Versailles, France, Dec.
1968. Springer-Verlag LNM 125.
[10] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In
International conference on artificial intelligence and statistics, pages 249–256, 2010.

[11] G. Gonthier. The four colour theorem: Engineering of a formal proof. In D. Kapur, editor, Computer
Mathematics, 8th Asian Symposium, ASCM 2007, Singapore, December 15-17, 2007. Revised and Invited
Papers, volume 5081 of Lecture Notes in Computer Science, page 333. Springer, 2007.
[12] G. Gonthier, A. Asperti, J. Avigad, Y. Bertot, C. Cohen, F. Garillot, S. L. Roux, A. Mahboubi, R. O’Connor,
S. O. Biha, I. Pasca, L. Rideau, A. Solovyev, E. Tassi, and L. Théry. A machine-checked proof of the Odd
Order Theorem. In S. Blazy, C. Paulin-Mohring, and D. Pichardie, editors, ITP, volume 7998 of LNCS,
pages 163–179. Springer, 2013.
[13] A. Grabowski, A. Korniłowicz, and A. Naumowicz. Mizar in a nutshell. J. Formalized Reasoning,
3(2):153–245, 2010.
[14] T. C. Hales, M. Adams, G. Bauer, D. T. Dang, J. Harrison, T. L. Hoang, C. Kaliszyk, V. Magron,
S. McLaughlin, T. T. Nguyen, T. Q. Nguyen, T. Nipkow, S. Obua, J. Pleso, J. Rute, A. Solovyev, A. H. T.
Ta, T. N. Tran, D. T. Trieu, J. Urban, K. K. Vu, and R. Zumkeller. A formal proof of the Kepler conjecture.
CoRR, abs/1501.02155, 2015.
[15] J. Harrison. HOL Light: A tutorial introduction. In M. K. Srivas and A. J. Camilleri, editors, FMCAD,
volume 1166 of LNCS, pages 265–269. Springer, 1996.
[16] J. Harrison, J. Urban, and F. Wiedijk. History of interactive theorem proving. In J. H. Siekmann, editor,
Computational Logic, volume 9 of Handbook of the History of Logic, pages 135 – 214. North-Holland,
[17] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
[18] Ł. Kaiser and I. Sutskever. Neural gpus learn algorithms. arXiv preprint arXiv:1511.08228, 2015.
[19] C. Kaliszyk and J. Urban. Stronger automation for Flyspeck by feature weighting and strategy evolution.
In J. C. Blanchette and J. Urban, editors, PxTP 2013, volume 14 of EPiC Series, pages 87–95. EasyChair,
[20] C. Kaliszyk and J. Urban. Learning-assisted automated reasoning with Flyspeck. J. Autom. Reasoning,
53(2):173–213, 2014.
[21] C. Kaliszyk and J. Urban. MizAR 40 for Mizar 40. J. Autom. Reasoning, 55(3):245–256, 2015.
[22] C. Kaliszyk, J. Urban, and J. Vyskocil. Efficient semantic features for automated reasoning over large
theories. In Q. Yang and M. Wooldridge, editors, Proceedings of the Twenty-Fourth International Joint
Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015, pages
3084–3090. AAAI Press, 2015.
[23] M. Kaufmann and J. S. Moore. An ACL2 tutorial. In Mohamed et al. [29], pages 17–21.
[24] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[25] G. Klein, J. Andronick, K. Elphinstone, G. Heiser, D. Cock, P. Derrin, D. Elkaduwe, K. Engelhardt,
R. Kolanski, M. Norrish, T. Sewell, H. Tuch, and S. Winwood. seL4: formal verification of an operating-
system kernel. Commun. ACM, 53(6):107–115, 2010.
[26] D. Kuehlwein and J. Urban. Learning from multiple proofs: First experiments. In P. Fontaine, R. A.
Schmidt, and S. Schulz, editors, PAAR-2012, volume 21 of EPiC Series, pages 82–94. EasyChair, 2013.
[27] X. Leroy. Formal verification of a realistic compiler. Commun. ACM, 52(7):107–115, 2009.
[28] T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur. Recurrent neural network based
language model. In INTERSPEECH, volume 2, page 3, 2010.
[29] O. A. Mohamed, C. A. Muñoz, and S. Tahar, editors. Theorem Proving in Higher Order Logics, 21st
International Conference, TPHOLs 2008, Montreal, Canada, August 18-21, 2008. Proceedings, volume
5170 of LNCS. Springer, 2008.
[30] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on
Control and Optimization, 30(4):838–855, 1992.
[31] J. A. Robinson and A. Voronkov, editors. Handbook of Automated Reasoning (in 2 volumes). Elsevier and
MIT Press, 2001.
[32] S. Schulz. E - A Brainiac Theorem Prover. AI Commun., 15(2-3):111–126, 2002.
[33] S. Sukhbaatar, J. Weston, R. Fergus, et al. End-to-end memory networks. In Advances in Neural Information
Processing Systems, pages 2431–2439, 2015.
[34] J. Urban. MPTP 0.2: Design, implementation, and initial experiments. J. Autom. Reasoning, 37(1-2):21–43,
[35] J. Urban and J. Vyskočil. Theorem proving in large formal mathematics as an emerging AI field. In M. P.
Bonacina and M. E. Stickel, editors, Automated Reasoning and Mathematics: Essays in Memory of William
McCune, volume 7788 of LNAI, pages 240–257. Springer, 2013.

[36] O. Vinyals and Q. Le. A neural conversational model. arXiv preprint arXiv:1506.05869, 2015.
[37] M. Wenzel, L. C. Paulson, and T. Nipkow. The Isabelle framework. In Mohamed et al. [29], pages 33–38.
[38] W. Zaremba and I. Sutskever. Learning to execute. arXiv preprint arXiv:1410.4615, 2014.

Learning to Transduce with Unbounded Memory

Edward Grefenstette Karl Moritz Hermann Mustafa Suleyman

Google DeepMind Google DeepMind Google DeepMind

Phil Blunsom
Google DeepMind and Oxford University


Recently, strong results have been demonstrated by Deep Recurrent Neural Net-
works on natural language transduction problems. In this paper we explore the
representational power of these models using synthetic grammars designed to ex-
hibit phenomena similar to those found in real transduction problems such as ma-
chine translation. These experiments lead us to propose new memory-based recur-
rent networks that implement continuously differentiable analogues of traditional
data structures such as Stacks, Queues, and DeQues. We show that these architec-
tures exhibit superior generalisation performance to Deep RNNs and are often able
to learn the underlying generating algorithms in our transduction experiments.

1 Introduction

Recurrent neural networks (RNNs) offer a compelling tool for processing natural language input in
a straightforward sequential manner. Many natural language processing (NLP) tasks can be viewed
as transduction problems, that is learning to convert one string into another. Machine translation is
a prototypical example of transduction and recent results indicate that Deep RNNs have the ability
to encode long source strings and produce coherent translations [1, 2]. While elegant, the appli-
cation of RNNs to transduction tasks requires hidden layers large enough to store representations
of the longest strings likely to be encountered, implying wastage on shorter strings and a strong
dependency between the number of parameters in the model and its memory.
In this paper we use a number of linguistically-inspired synthetic transduction tasks to explore the
ability of RNNs to learn long-range reorderings and substitutions. Further, inspired by prior work on
neural network implementations of stack data structures [3], we propose and evaluate transduction
models based on Neural Stacks, Queues, and DeQues (double ended queues). Stack algorithms are
well-suited to processing the hierarchical structures observed in natural language and we hypothesise
that their neural analogues will provide an effective and learnable transduction tool. Our models
provide a middle ground between simple RNNs and the recently proposed Neural Turing Machine
(NTM) [4] which implements a powerful random access memory with read and write operations.
Neural Stacks, Queues, and DeQues also provide a logically unbounded memory while permitting
efficient constant time push and pop operations.
Our results indicate that the models proposed in this work, and in particular the Neural DeQue, are
able to consistently learn a range of challenging transductions. While Deep RNNs based on long
short-term memory (LSTM) cells [1, 5] can learn some transductions when tested on inputs of the
same length as seen in training, they fail to consistently generalise to longer strings. In contrast,
our sequential memory-based algorithms are able to learn to reproduce the generating transduction
algorithms, often generalising perfectly to inputs well beyond those encountered in training.

2 Related Work
String transduction is central to many applications in NLP, from name transliteration and spelling
correction, to inflectional morphology and machine translation. The most common approach lever-
ages symbolic finite state transducers [6, 7], with approaches based on context free representations
also being popular [8]. RNNs offer an attractive alternative to symbolic transducers due to their sim-
ple algorithms and expressive representations [9]. However, as we show in this work, such models
are limited in their ability to generalise beyond their training data and have a memory capacity that
scales with the number of their trainable parameters.
Previous work has touched on the topic of rendering discrete data structures such as stacks continu-
ous, especially within the context of modelling pushdown automata with neural networks [10, 11, 3].
We were inspired by the continuous pop and push operations of these architectures and the idea of
an RNN controlling the data structure when developing our own models. The key difference is that
our work adapts these operations to work within a recurrent continuous Stack/Queue/DeQue-like
structure, the dynamics of which are fully decoupled from those of the RNN controlling it. In our
models, the backwards dynamics are easily analysable in order to obtain the exact partial derivatives
for use in error propagation, rather than having to approximate them as done in previous work.
In a parallel effort to ours, researchers are exploring the addition of memory to recurrent networks.
The NTM and Memory Networks [4, 12, 13] provide powerful random access memory operations,
whereas we focus on a more efficient and restricted class of models which we believe are sufficient
for natural language transduction tasks. More closely related to our work, [14] have sought to
develop a continuous stack controlled by an RNN. Note that this model—unlike the work proposed
here—renders discrete push and pop operations continuous by “mixing” information across levels of
the stack at each time step according to scalar push/pop action values. This means the model ends up
compressing information in the stack, thereby limiting its use, as it effectively loses the unbounded
memory nature of traditional symbolic models.

3 Models
In this section, we present an extensible memory enhancement to recurrent layers which can be set
up to act as a continuous version of a classical Stack, Queue, or DeQue (double-ended queue). We
begin by describing the operations and dynamics of a neural Stack, before showing how to modify
it to act as a Queue, and extend it to act as a DeQue.

3.1 Neural Stack

Let a Neural Stack be a differentiable structure onto and from which continuous vectors are pushed
and popped. Inspired by the neural pushdown automaton of [3], we render these traditionally dis-
crete operations continuous by letting push and pop operations be real values in the interval (0, 1).
Intuitively, we can interpret these values as the degree of certainty with which some controller wishes
to push a vector v onto the stack, or pop the top of the stack.

Vt 1 [i] if 1  i < t
Vt [i] = (Note that Vt [i] = vi for all i  t) (1)
vt if i = t
8 tP1
max(0, st 1 [i] max(0, ut st 1 [j])) if 1  i < t
st [i] = j=i+1 (2)
dt if i = t
X t
rt = (min(st [i], max(0, 1 st [j]))) · Vt [i] (3)
i=1 j=i+1

Formally, a Neural Stack, fully parametrised by an embedding size m, is described at some timestep
t by a t ⇥ m value matrix Vt and a strength vector st 2 Rt . These form the core of a recurrent layer
which is acted upon by a controller by receiving, from the controller, a value vt 2 Rm , a pop signal
ut 2 (0, 1), and a push signal dt 2 (0, 1). It outputs a read vector rt 2 Rm . The recurrence of this

layer comes from the fact that it will receive as previous state of the stack the pair (Vt 1 , st 1 ), and
produce as next state the pair (Vt , st ) following the dynamics described below. Here, Vt [i] represents
the ith row (an m-dimensional vector) of Vt and st [i] represents the ith value of st .
Equation 1 shows the update of the value component of the recurrent layer state represented as a
matrix, the number of rows of which grows with time, maintaining a record of the values pushed to
the stack at each timestep (whether or not they are still logically on the stack). Values are appended
to the bottom of the matrix (top of the stack) and never changed.
Equation 2 shows the effect of the push and pop signal in updating the strength vector st 1 to
produce st . First, the pop operation removes objects from the stack. We can think of the pop value
ut as the initial deletion quantity for the operation. We traverse the strength vector st 1 from the
highest index to the lowest. If the next strength scalar is less than the remaining deletion quantity, it
is subtracted from the remaining quantity and its value is set to 0. If the remaining deletion quantity
is less than the next strength scalar, the remaining deletion quantity is subtracted from that scalar and
deletion stops. Next, the push value is set as the strength for the value added in the current timestep.
Equation 3 shows the dynamics of the read operation, which are similar to the pop operation. A
fixed initial read quantity of 1 is set at the top of a temporary copy of the strength vector st which
is traversed from the highest index to the lowest. If the next strength scalar is smaller than the
remaining read quantity, its value is preserved for this operation and subtracted from the remaining
read quantity. If not, it is temporarily set to the remaining read quantity, and the strength scalars of
all lower indices are temporarily set to 0. The output rt of the read operation is the weighted sum
of the rows of Vt , scaled by the temporary scalar values created during the traversal. An example
of the stack read calculations across three timesteps, after pushes and pops as described above, is
illustrated in Figure 1a. The third step shows how setting the strength s3 [2] to 0 for V3 [2] logically
removes v2 from the stack, and how it is ignored during the read.
This completes the description of the forward dynamics of a neural Stack, cast as a recurrent layer,
as illustrated in Figure 1b. All operations described in this section are differentiable1 . The equations
describing the backwards dynamics are provided in Appendix A of the supplementary materials.

t = 1 u1 = 0 d1 = 0.8 t = 2 u2 = 0.1 d2 = 0.5 t = 3 u3 = 0.9 d3 = 0.9

stack grows upwards

row 3 0.9
v2 removed
row 2 v2 0.5 v2 0 from stack

row 1 v1 0.8 v1 0.7 v1 0.3

r1 = 0.8 ∙ v1 r2 = 0.5 ∙ v2 + 0.5 ∙ v1 r3 = 0.9 ∙ v3 + 0 ∙ v2 + 0.1 ∙ v1

(a) Example Operation of a Continuous Neural Stack

(Vt-1, st-1)

prev. values (Vt-1) next values (Vt)

Vt-1 Vt
previous ht-1
previous state next state
state ht next
R (Vt, st) state
prev. strengths (st-1) Neural next strengths (st) rt-1
N st-1 Neural st

push (dt) dt

pop (ut)
Stack output (rt) input
N (ot, …) …
Stack rt
it (it, rt-1)
value (vt) ot
Split vt

(b) Neural Stack as a Recurrent Layer (c) RNN Controlling a Stack

Figure 1: Illustrating a Neural Stack’s Operations, Recurrent Structure, and Control

3.2 Neural Queue

A neural Queue operates the same way as a neural Stack, with the exception that the pop operation
reads the lowest index of the strength vector st , rather than the highest. This represents popping and
The max(x, y) and min(x, y) functions are technically not differentiable for x = y. Following the work
on rectified linear units [15], we arbitrarily take the partial differentiation of the left argument in these cases.

reading from the front of the Queue rather than the top of the stack. These operations are described
in Equations 4–5.
8 iP1
max(0, st 1 [i] max(0, ut st 1 [j])) if 1  i < t
st [i] = j=1 (4)
dt if i = t
X i 1
rt = (min(st [i], max(0, 1 st [j]))) · Vt [i] (5)
i=1 j=1

3.3 Neural DeQue

A neural DeQue operates likes a neural Stack, except it takes a push, pop, and value as input for
both “ends” of the structure (which we call top and bot), and outputs a read for both ends. We write
t and ubot
t instead of ut , vttop and vtbot instead of vt , and so on. The state, Vt and st are now
a 2t ⇥ m-dimensional matrix and a 2t-dimensional vector, respectively. At each timestep, a pop
from the top is followed by a pop from the bottom of the DeQue, followed by the pushes and reads.
The dynamics of a DeQue, which unlike a neural Stack or Queue “grows” in two directions, are
described in Equations 6–11, below. Equations 7–9 decompose the strength vector update into three
steps purely for notational clarity.

< vtbot if i = 1
Vt [i] = vtop if i = 2t (6)
: t
Vt 1 [i 1] if 1 < i < 2t
2(t 1) 1
t [i] = max(0, st 1 [i] max(0, utop
t st 1 [j])) if 1  i < 2(t 1) (7)
i 1
t [i] = max(0, stop
t [i] max(0, ubot
t stop
t [j])) if 1  i < 2(t 1) (8)
8 both
< st [i 1] if 1 < i < 2t
st [i] = dbot if i = 1 (9)
: ttop
dt if i = 2t
X 2t
t = (min(st [i], max(0, 1 st [j]))) · Vt [i] (10)
i=1 j=i+1
X i 1
t = (min(st [i], max(0, 1 st [j]))) · Vt [i] (11)
i=1 j=1

To summarise, a neural DeQue acts like two neural Stacks operated on in tandem, except that the
pushes and pops from one end may eventually affect pops and reads on the other, and vice versa.

3.4 Interaction with a Controller

While the three memory modules described can be seen as recurrent layers, with the operations being
used to produce the next state and output from the input and previous state being fully differentiable,
they contain no tunable parameters to optimise during training. As such, they need to be attached
to a controller in order to be used for any practical purposes. In exchange, they offer an extensible
memory, the logical size of which is unbounded and decoupled from both the nature and parameters
of the controller, and from the size of the problem they are applied to. Here, we describe how any
RNN controller may be enhanced by a neural Stack, Queue or DeQue.
We begin by giving the case where the memory is a neural Stack, as illustrated in Figure 1c. Here
we wish to replicate the overall ‘interface’ of a recurrent layer—as seen from outside the dotted

lines—which takes the previous recurrent state Ht 1 and an input vector it , and transforms them
to return the next recurrent state Ht and an output vector ot . In our setup, the previous state Ht 1
of the recurrent layer will be the tuple (ht 1 , rt 1 , (Vt 1 , st 1 )), where ht 1 is the previous state
of the RNN, rt 1 is the previous stack read, and (Vt 1 , st 1 ) is the previous state of the stack
as described above. With the exception of h0 , which is initialised randomly and optimised during
training, all other initial states, r0 and (V0 , s0 ), are set to 0-valued vectors/matrices and not updated
during training.
The overall input it is concatenated with previous read rt 1 and passed to the RNN controller as
input along with the previous controller state ht 1 . The controller outputs its next state ht and a
controller output o0t , from which we obtain the push and pop scalars dt and ut and the value vector
vt , which are passed to the stack, as well as the network output ot :
dt = sigmoid(Wd o0t + bd ) ut = sigmoid(Wu o0t + bu )
vt = tanh(Wv o0t + bv ) ot = tanh(Wo o0t + bo )
where Wd and Wu are vector-to-scalar projection matrices, and bd and bu are their scalar biases;
Wv and Wo are vector-to-vector projections, and bd and bu are their vector biases, all randomly
intialised and then tuned during training. Along with the previous stack state (Vt 1 , st 1 ), the stack
operations dt and ut and the value vt are passed to the neural stack to obtain the next read rt and
next stack state (Vt , st ), which are packed into a tuple with the controller state ht to form the next
state Ht of the overall recurrent layer. The output vector ot serves as the overall output of the
recurrent layer. The structure described here can be adapted to control a neural Queue instead of a
stack by substituting one memory module for the other.
The only additional trainable parameters in either configuration, relative to a non-enhanced RNN,
are the projections for the input concatenated with the previous read into the RNN controller, and the
projections from the controller output into the various Stack/Queue inputs, described above. In the
case of a DeQue, both the top read rtop and bottom read rbot must be preserved in the overall state.
They are both concatenated with the input to form the input to the RNN controller. The output of the
controller must have additional projections to output push/pop operations and values for the bottom
of the DeQue. This roughly doubles the number of additional tunable parameters “wrapping” the
RNN controller, compared to the Stack/Queue case.

4 Experiments

In every experiment, integer-encoded source and target sequence pairs are presented to the candidate
model as a batch of single joint sequences. The joint sequence starts with a start-of-sequence (SOS)
symbol, and ends with an end-of-sequence (EOS) symbol, with a separator symbol separating the
source and target sequences. Integer-encoded symbols are converted to 64-dimensional embeddings
via an embedding matrix, which is randomly initialised and tuned during training. Separate word-
to-index mappings are used for source and target vocabularies. Separate embedding matrices are
used to encode input and output (predicted) embeddings.

4.1 Synthetic Transduction Tasks

The aim of each of the following tasks is to read an input sequence, and generate as target sequence a
transformed version of the source sequence, followed by an EOS symbol. Source sequences are ran-
domly generated from a vocabulary of 128 meaningless symbols. The length of each training source
sequence is uniformly sampled from unif {8, 64}, and each symbol in the sequence is drawn with
replacement from a uniform distribution over the source vocabulary (ignoring SOS, and separator).
A deterministic task-specific transformation, described for each task below, is applied to the source
sequence to yield the target sequence. As the training sequences are entirely determined by the
source sequence, there are close to 10135 training sequences for each task, and training examples
are sampled from this space due to the random generation of source sequences. The following steps
are followed before each training and test sequence are presented to the models, the SOS symbol
(hsi) is prepended to the source sequence, which is concatenated with a separator symbol (|||) and
the target sequences, to which the EOS symbol (h/si) is appended.

Sequence Copying The source sequence is copied to form the target sequence. Sequences have
the form:
hsia1 . . . ak |||a1 . . . ak h/si

Sequence Reversal The source sequence is deterministically reversed to produce the target se-
quence. Sequences have the form:
hsia1 a2 . . . ak |||ak . . . a2 a1 h/si

Bigram flipping The source side is restricted to even-length sequences. The target is produced
by swapping, for all odd source sequence indices i 2 [1, |seq|] ^ odd(i), the ith symbol with the
(i + 1)th symbol. Sequences have the form:
hsia1 a2 a3 a4 . . . ak 1 ak |||a2 a1 a4 a3 . . . ak ak 1 h/si

4.2 ITG Transduction Tasks

The following tasks examine how well models can approach sequence transduction problems where
the source and target sequence are jointly generated by Inversion Transduction Grammars (ITG) [8],
a subclass of Synchronous Context-Free Grammars [16] often used in machine translation [17]. We
present two simple ITG-based datasets with interesting linguistic properties and their underlying
grammars. We show these grammars in Table 1, in Appendix C of the supplementary materials. For
each synchronised non-terminal, an expansion is chosen according to the probability distribution
specified by the rule probability p at the beginning of each rule. For each grammar, ‘A’ is always the
root of the ITG tree.
We tuned the generative probabilities for recursive rules by hand so that the grammars generate left
and right sequences of lengths 8 to 128 with relatively uniform distribution. We generate training
data by rejecting samples that are outside of the range [8, 64], and testing data by rejecting samples
outside of the range [65, 128]. For terminal symbol-generating rules, we balance the classes so
that for k terminal-generating symbols in the grammar, each terminal-generating non-terminal ‘X’
generates a vocabulary of approximately 128/k, and each each vocabulary word under that class is
equiprobable. These design choices were made to maximise the similarity between the experimental
settings of the ITG tasks described here and the synthetic tasks described above.

Subj–Verb–Obj to Subj–Obj–Verb A persistent challenge in machine translation is to learn to

faithfully reproduce high-level syntactic divergences between languages. For instance, when trans-
lating an English sentence with a non-finite verb into German, a transducer must locate and move
the verb over the object to the final position. We simulate this phenomena with a synchronous
grammar which generates strings exhibiting verb movements. To add an extra challenge, we also
simulate simple relative clause embeddings to test the models’ ability to transduce in the presence
of unbounded recursive structures.
A sample output of the grammar is presented here, with spaces between words being included for
stylistic purposes, and where s, o, and v indicate subject, object, and verb terminals respectively, i
and o mark input and output, and rp indicates a relative pronoun:
si1 vi28 oi5 oi7 si15 rpi si19 vi16 oi10 oi24 ||| so1 oo5 oo7 so15 rpo so19 vo16 oo10 oo24 vo28

Genderless to gendered grammar We design a small grammar to simulate translations from a

language with gender-free articles to one with gender-specific definite and indefinite articles. A
real world example of such a translation would be from English (the, a) to German (der/die/das,
The grammar simulates sentences in (N P/(V /N P )) or (N P/V ) form, where every noun phrase
can become an infinite sequence of nouns joined by a conjunction. Each noun in the source language
has a neutral definite or indefinite article. The matching word in the target language then needs to be
preceeded by its appropriate article. A sample output of the grammar is presented here, with spaces
between words being included for stylistic purposes:
we11 the en19 and the em17 ||| wg11 das gn19 und der gm17

4.3 Evaluation

For each task, test data is generated through the same procedure as training data, with the key dif-
ference that the length of the source sequence is sampled from unif {65, 128}. As a result of this
change, we not only are assured that the models cannot observe any test sequences during training,
but are also measuring how well the sequence transduction capabilities of the evaluated models gen-
eralise beyond the sequence lengths observed during training. To control for generalisation ability,
we also report accuracy scores on sequences separately sampled from the training set, which given
the size of the sample space are unlikely to have ever been observed during actual model training.
For each round of testing, we sample 1000 sequences from the appropriate test set. For each se-
quence, the model reads in the source sequence and separator symbol, and begins generating the
next symbol by taking the maximally likely symbol from the softmax distribution over target sym-
bols produced by the model at each step. Based on this process, we give each model a coarse
accuracy score, corresponding to the proportion of test sequences correctly predicted from begin-
ning until end (EOS symbol) without error, as well as a fine accuracy score, corresponding to the
average proportion of each sequence correctly generated before the first error. Formally, we have:
X #correcti
#correct 1
coarse = f ine =
#seqs #seqs i=1 |targeti |
where #correct and #seqs are the number of correctly predicted sequences (end-to-end) and the
total number of sequences in the test batch (1000 in this experiment), respectively; #correcti is the
number of correctly predicted symbols before the first error in the ith sequence of the test batch, and
|targeti | is the length of the target segment that sequence (including EOS symbol).

4.4 Models Compared and Experimental Setup

For each task, we use as benchmarks the Deep LSTMs described in [1], with 1, 2, 4, and 8 layers.
Against these benchmarks, we evaluate neural Stack-, Queue-, and DeQue-enhanced LSTMs. When
running experiments, we trained and tested a version of each model where all LSTMs in each model
have a hidden layer size of 256, and one for a hidden layer size of 512. The Stack/Queue/DeQue
embedding size was arbitrarily set to 256, half the maximum hidden size. The number of parameters
for each model are reported for each architecture in Table 2 of the appendix. Concretely, the neural
Stack-, Queue-, and DeQue-enhanced LSTMs have the same number of trainable parameters as a
two-layer Deep LSTM. These all come from the extra connections to and from the memory module,
which itself has no trainable parameters, regardless of its logical size.
Models are trained with minibatch RMSProp [18], with a batch size of 10. We grid-searched learning
rates across the set {5 ⇥ 10 3 , 1 ⇥ 10 3 , 5 ⇥ 10 4 , 1 ⇥ 10 4 , 5 ⇥ 10 5 }. We used gradient clipping
[19], clipping all gradients above 1. Average training perplexity was calculated every 100 batches.
Training and test set accuracies were recorded every 1000 batches.

5 Results and Discussion

Because of the impossibility of overfitting the datasets, we let the models train an unbounded number
of steps, and report results at convergence. We present in Figure 2a the coarse- and fine-grained
accuracies, for each task, of the best model of each architecture described in this paper alongside
the best performing Deep LSTM benchmark. The best models were automatically selected based on
average training perplexity. The LSTM benchmarks performed similarly across the range of random
initialisations, so the effect of this procedure is primarily to try and select the better performing
Stack/Queue/DeQue-enhanced LSTM. In most cases, this procedure does not yield the actual best-
performing model, and in practice a more sophisticated procedure such as ensembling [20] should
produce better results.
For all experiments, the Neural Stack or Queue outperforms the Deep LSTM benchmarks, often by
a significant margin. For most experiments, if a Neural Stack- or Queue-enhanced LSTM learns
to partially or consistently solve the problem, then so does the Neural DeQue. For experiments
where the enhanced LSTMs solve the problem completely (consistent accuracy of 1) in training,
the accuracy persists in longer sequences in the test set, whereas benchmark accuracies drop for

Training Testing
Experiment Model Coarse Fine Coarse Fine
4-layer LSTM 0.98 0.98 0.01 0.50
Sequence Stack-LSTM 0.89 0.94 0.00 0.22
Copying Queue-LSTM 1.00 1.00 1.00 1.00
DeQue-LSTM 1.00 1.00 1.00 1.00

8-layer LSTM 0.95 0.98 0.04 0.13

Sequence Stack-LSTM 1.00 1.00 1.00 1.00
Reversal Queue-LSTM 0.44 0.61 0.00 0.01
DeQue-LSTM 1.00 1.00 1.00 1.00

2-layer LSTM 0.54 0.93 0.02 0.52

Bigram Stack-LSTM 0.44 0.90 0.00 0.48
Flipping Queue-LSTM 0.55 0.94 0.55 0.98
DeQue-LSTM 0.55 0.94 0.53 0.98

8-layer LSTM 0.98 0.99 0.98 0.99

Stack-LSTM 1.00 1.00 1.00 1.00
Queue-LSTM 1.00 1.00 1.00 1.00
DeQue-LSTM 1.00 1.00 1.00 1.00

8-layer LSTM 0.98 0.99 0.99 0.99

Gender Stack-LSTM 0.93 0.97 0.93 0.97
Conjugation Queue-LSTM 1.00 1.00 1.00 1.00
DeQue-LSTM 1.00 1.00 1.00 1.00
(b) Comparison of Model Conver-
(a) Comparing Enhanced LSTMs to Best Benchmarks gence during Training

Figure 2: Results on the transduction tasks and convergence properties

all experiments except the SVO to SOV and Gender Conjugation ITG transduction tasks. Across
all tasks which the enhanced LSTMs solve, the convergence on the top accuracy happens orders of
magnitude earlier for enhanced LSTMs than for benchmark LSTMs, as exemplified in Figure 2b.
The results for the sequence inversion and copying tasks serve as unit tests for our models, as the
controller mainly needs to learn to push the appropriate number of times and then pop continuously.
Nonetheless, the failure of Deep LSTMs to learn such a regular pattern and generalise is itself
indicative of the limitations of the benchmarks presented here, and of the relative expressive power
of our models. Their ability to generalise perfectly to sequences up to twice as long as those attested
during training is also notable, and also attested in the other experiments. Finally, this pair of
experiments illustrates how while the neural Queue solves copying and the Stack solves reversal, a
simple LSTM controller can learn to operate a DeQue as either structure, and solve both tasks.
The results of the Bigram Flipping task for all models are consistent with the failure to consistently
correctly generate the last two symbols of the sequence. We hypothesise that both Deep LSTMs and
our models economically learn to pairwise flip the sequence tokens, and attempt to do so half the
time when reaching the EOS token. For the two ITG tasks, the success of Deep LSTM benchmarks
relative to their performance in other tasks can be explained by their ability to exploit short local
dependencies dominating the longer dependencies in these particular grammars.
Overall, the rapid convergence, where possible, on a general solution to a transduction problem
in a manner which propagates to longer sequences without loss of accuracy is indicative that an
unbounded memory-enhanced controller can learn to solve these problems procedurally, rather than
memorising the underlying distribution of the data.

6 Conclusions
The experiments performed in this paper demonstrate that single-layer LSTMs enhanced by an un-
bounded differentiable memory capable of acting, in the limit, like a classical Stack, Queue, or
DeQue, are capable of solving sequence-to-sequence transduction tasks for which Deep LSTMs
falter. Even in tasks for which benchmarks obtain high accuracies, the memory-enhanced LSTMs
converge earlier, and to higher accuracies, while requiring considerably fewer parameters than all
but the simplest of Deep LSTMs. We therefore believe these constitute a crucial addition to our neu-
ral network toolbox, and that more complex linguistic transduction tasks such as machine translation
or parsing will be rendered more tractable by their inclusion.

[1] Ilya Sutskever, Oriol Vinyals, and Quoc V. V Le. Sequence to sequence learning with neural
networks. In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger,
editors, Advances in Neural Information Processing Systems 27, pages 3104–3112. Curran
Associates, Inc., 2014.
[2] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk,
and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical
machine translation. arXiv preprint arXiv:1406.1078, 2014.
[3] GZ Sun, C Lee Giles, HH Chen, and YC Lee. The neural network pushdown automaton:
Model, stack and learning simulations. 1998.
[4] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. CoRR, abs/1410.5401,
[5] Alex Graves. Supervised Sequence Labelling with Recurrent Neural Networks, volume 385 of
Studies in Computational Intelligence. Springer, 2012.
[6] Markus Dreyer, Jason R. Smith, and Jason Eisner. Latent-variable modeling of string trans-
ductions with finite-state methods. In Proceedings of the Conference on Empirical Methods in
Natural Language Processing, EMNLP ’08, pages 1080–1089, Stroudsburg, PA, USA, 2008.
Association for Computational Linguistics.
[7] Cyril Allauzen, Michael Riley, Johan Schalkwyk, Wojciech Skut, and Mehryar Mohri. Open-
FST: A general and efficient weighted finite-state transducer library. In Implementation and
Application of Automata, volume 4783 of Lecture Notes in Computer Science, pages 11–23.
Springer Berlin Heidelberg, 2007.
[8] Dekai Wu. Stochastic inversion transduction grammars and bilingual parsing of parallel cor-
pora. Computational linguistics, 23(3):377–403, 1997.
[9] Alex Graves. Sequence transduction with recurrent neural networks. In Representation Learn-
ing Worksop, ICML. 2012.
[10] Sreerupa Das, C Lee Giles, and Guo-Zheng Sun. Learning context-free grammars: Capabilities
and limitations of a recurrent neural network with an external stack memory. In Proceedings
of The Fourteenth Annual Conference of Cognitive Science Society. Indiana University, 1992.
[11] Sreerupa Das, C Lee Giles, and Guo-Zheng Sun. Using prior knowledge in a {NNPDA} to
learn context-free languages. Advances in neural information processing systems, 1993.
[12] Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. Weakly supervised mem-
ory networks. CoRR, abs/1503.08895, 2015.
[13] Wojciech Zaremba and Ilya Sutskever. Reinforcement learning neural turing machines. arXiv
preprint arXiv:1505.00521, 2015.
[14] Armand Joulin and Tomas Mikolov. Inferring algorithmic patterns with stack-augmented re-
current nets. arXiv preprint arXiv:1503.01007, 2015.
[15] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann ma-
chines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10),
pages 807–814, 2010.
[16] Alfred V Aho and Jeffrey D Ullman. The theory of parsing, translation, and compiling.
Prentice-Hall, Inc., 1972.
[17] Dekai Wu and Hongsing Wong. Machine translation with a stochastic grammatical channel.
In Proceedings of the 17th international conference on Computational linguistics-Volume 2,
pages 1408–1415. Association for Computational Linguistics, 1998.
[18] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running
average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4,
[19] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. Understanding the exploding gradient
problem. Computing Research Repository (CoRR) abs/1211.5063, 2012.
[20] Zhi-Hua Zhou, Jianxin Wu, and Wei Tang. Ensembling neural networks: many could be better
than all. Artificial intelligence, 137(1):239–263, 2002.

Inferring Algorithmic Patterns with
Stack-Augmented Recurrent Nets

Armand Joulin Tomas Mikolov

Facebook AI Research Facebook AI Research
770 Broadway, New York, USA. 770 Broadway, New York, USA.

Despite the recent achievements in machine learning, we are still very far from
achieving real artificial intelligence. In this paper, we discuss the limitations of
standard deep learning approaches and show that some of these limitations can be
overcome by learning how to grow the complexity of a model in a structured way.
Specifically, we study the simplest sequence prediction problems that are beyond
the scope of what is learnable with standard recurrent networks, algorithmically
generated sequences which can only be learned by models which have the capacity
to count and to memorize sequences. We show that some basic algorithms can be
learned from sequential data using a recurrent network associated with a trainable

1 Introduction
Machine learning aims to find regularities in data to perform various tasks. Historically there have
been two major sources of breakthroughs: scaling up the existing approaches to larger datasets, and
development of novel approaches [5, 14, 22, 30]. In the recent years, a lot of progress has been
made in scaling up learning algorithms, by either using alternative hardware such as GPUs [9] or by
taking advantage of large clusters [28]. While improving computational efficiency of the existing
methods is important to deploy the models in real world applications [4], it is crucial for the research
community to continue exploring novel approaches able to tackle new problems.
Recently, deep neural networks have become very successful at various tasks, leading to a shift in
the computer vision [21] and speech recognition communities [11]. This breakthrough is commonly
attributed to two aspects of deep networks: their similarity to the hierarchical, recurrent structure of
the neocortex and the theoretical justification that certain patterns are more efficiently represented
by functions employing multiple non-linearities instead of a single one [1, 25].
This paper investigates which patterns are difficult to represent and learn with the current state of the
art methods. This would hopefully give us hints about how to design new approaches which will ad-
vance machine learning research further. In the past, this approach has lead to crucial breakthrough
results: the well-known XOR problem is an example of a trivial classification problem that cannot
be solved using linear classifiers, but can be solved with a non-linear one. This popularized the use
of non-linear hidden layers [30] and kernels methods [2]. Another well-known example is the parity
problem described by Papert and Minsky [25]: it demonstrates that while a single non-linear hidden
layer is sufficient to represent any function, it is not guaranteed to represent it efficiently, and in
some cases can even require exponentially many more parameters (and thus, also training data) than
what is sufficient for a deeper model. This lead to use of architectures that have several layers of
non-linearities, currently known as deep learning models.
Following this line of work, we study basic patterns which are difficult to represent and learn for
standard deep models. In particular, we study learning regularities in sequences of symbols gen-

Sequence generator Example
{an bn | n > 0} aabbaaabbbabaaaaabbbbb
{an bn cn | n > 0} aaabbbcccabcaaaaabbbbbccccc
{an bn cn dn | n > 0} aabbccddaaabbbcccdddabcd
{an b2n | n > 0} aabbbbaaabbbbbbabb
n m n+m
{a b c | n, m > 0} aabcccaaabbcccccabcc
n ∈ [1, k], X → nXn, X →= (k = 2) 12=212122=221211121=12111

Table 1: Examples generated from the algorithms studied in this paper. In bold, the characters which
can be predicted deterministically. During training, we do not have access to this information and at
test time, we evaluate only on deterministically predictable characters.

erated by simple algorithms. Interestingly, we find that these regularities are difficult to learn even
for some advanced deep learning methods, such as recurrent networks. We attempt to increase the
learning capabilities of recurrent nets by allowing them to learn how to control an infinite structured
memory. We explore two basic topologies of the structured memory: pushdown stack, and a list.
Our structured memory is defined by constraining part of the recurrent matrix in a recurrent net [24].
We use multiplicative gating mechanisms as learnable controllers over the memory [8, 19] and show
that this allows our network to operate as if it was performing simple read and write operations, such
as PUSH or POP for a stack.
Among recent work with similar motivation, we are aware of the Neural Turing Machine [17] and
Memory Networks [33]. However, our work can be considered more as a follow up of the research
done in the early nineties, when similar types of memory augmented neural networks were stud-
ied [12, 26, 27, 37].

2 Algorithmic Patterns

We focus on sequences generated by simple, short algorithms. The goal is to learn regularities in
these sequences by building predictive models. We are mostly interested in discrete patterns related
to those that occur in the real world, such as various forms of a long term memory.
More precisely, we suppose that during training we have only access to a stream of data which is
obtained by concatenating sequences generated by a given algorithm. We do not have access to the
boundary of any sequence nor to sequences which are not generated by the algorithm. We denote
the regularities in these sequences of symbols as Algorithmic patterns. In this paper, we focus on
algorithmic patterns which involve some form of counting and memorization. Examples of these
patterns are presented in Table 1. For simplicity, we mostly focus on the unary and binary numeral
systems to represent patterns. This allows us to focus on designing a model which can learn these
algorithms when the input is given in its simplest form.
Some algorithm can be given as context free grammars, however we are interested in the more gen-
eral case of sequential patterns that have a short description length in some general Turing-complete
computational system. Of particular interest are patterns relevant to develop a better language un-
derstanding. Finally, this study is limited to patterns whose symbols can be predicted in a single
computational step, leaving out algorithms such as sorting or dynamic programming.

3 Related work

Some of the algorithmic patterns we study in this paper are closely related to context free and context
sensitive grammars which were widely studied in the past. Some works used recurrent networks
with hardwired symbolic structures [10, 15, 18]. These networks are continuous implementation of
symbolic systems, and can deal with recursive patterns in computational linguistics. While theses
approaches are interesting to understand the link between symbolic and sub-symbolic systems such
as neural networks, they are often hand designed for each specific grammar.
Wiles and Elman [34] show that simple recurrent networks are able to learn sequences of the form
an bn and generalize on a limited range of n. While this is a promising result, their model does not

truly learn how to count but instead relies mostly on memorization of the patterns seen in the training
data. Rodriguez et al. [29] further studied the behavior of this network. Grünwald [18] designs a
hardwired second order recurrent network to tackle similar sequences. Christiansen and Chater [7]
extended these results to grammars with larger vocabularies. This work shows that this type of
architectures can learn complex internal representation of the symbols but it cannot generalize to
longer sequences generated by the same algorithm. Beside using simple recurrent networks, other
structures have been used to deal with recursive patterns, such as pushdown dynamical automata [31]
or sequenctial cascaded networks [3, 27].
Hochreiter and Schmidhuber [19] introduced the Long Short Term Memory network (LSTM) archi-
tecture. While this model was orginally developed to address the vanishing and exploding gradient
problems, LSTM is also able to learn simple context-free and context-sensitive grammars [16, 36].
This is possible because its hidden units can choose through a multiplicative gating mechanism to
be either linear or non-linear. The linear units allow the network to potentially count (one can easily
add and subtract constants) and store a finite amount of information for a long period of time. These
mechanisms are also used in the Gated Recurrent Unit network [8]. In our work we investigate the
use of a similar mechanism in a context where the memory is unbounded and structured. As opposed
to previous work, we do not need to “erase” our memory to store a new unit. More recently, Graves
et al. [17] have extended LSTM with an attention mechansim to build a model which roughly resem-
bles a Turing machine with limited tape. Their memory controller works with a fixed size memory
and it is not clear if its complexity is necessary for the the simple problems they study.
Finally, many works have also used external memory modules with a recurrent network, such as
stacks [12, 13, 20, 26, 37]. Zheng et al. [37] use a discrete external stack which may be hard
to learn on long sequences. Das et al. [12] learn a continuous stack which has some similarities
with ours. The mechnisms used in their work is quite different from ours. Their memory cells are
associated with weights to allow continuous representation of the stack, in order to train it with
continuous optimization scheme. On the other hand, our solution is closer to a standard RNN with
special connectivities which simulate a stack with unbounded capacity. We tackle problems which
are closely related to the ones addressed in these works and try to go further by exploring more
challenging problems such as binary addition.

4 Model
4.1 Simple recurrent network

We consider sequential data that comes in the form of discrete tokens, such as characters or words.
The goal is to design a model able to predict the next symbol in a stream of data. Our approach is
based on a standard model called recurrent neural network (RNN) and popularized by Elman [14].
RNN consists of an input layer, a hidden layer with a recurrent time-delayed connection and an
output layer. The recurrent connection allows the propagation of information through time.Given a
sequence of tokens, RNN takes as input the one-hot encoding xt of the current token and predicts
the probability yt of next symbol. There is a hidden layer with m units which stores additional
information about the previous tokens seen in the sequence. More precisely, at each time t, the state
of the hidden layer ht is updated based on its previous state ht−1 and the encoding xt of the current
token, according to the following equation:
ht = σ (U xt + Rht−1 ) , (1)
where σ(x) = 1/(1 + exp(−x)) is the sigmoid activation function applied coordinate wise, U is the
d × m token embedding matrix and R is the m × m matrix of recurrent weights. Given the state of
these hidden units, the network then outputs the probability vector yt of the next token, according to
the following equation:
yt = f (V ht ) , (2)
where f is the softmax function [6] and V is the m × d output matrix, where d is the number of
different tokens. This architecture is able to learn relatively complex patterns similar in nature to
the ones captured by N-grams. While this has made the RNNs interesting for language modeling
[23], they may not have the capacity to learn how algorithmic patterns are generated. In the next
section, we show how to add an external memory to RNNs which has the theoretical capability to
learn simple algorithmic patterns.

(a) (b)
Figure 1: (a) Neural network extended with push-down stack and a controlling mechanism that
learns what action (among PUSH, POP and NO-OP) to perform. (b) The same model extended with
a doubly-linked list with actions INSERT, LEFT, RIGHT and NO-OP.
4.2 Pushdown network

In this section, we describe a simple structured memory inspired by pushdown automaton, i.e., an
automaton which employs a stack. We train our network to learn how to operate this memory with
standard optimization tools.
A stack is a type of persistent memory which can be only accessed through its topmost element.
Three basic operations can be performed with a stack: POP removes the top element, PUSH adds
a new element on top of the stack and NO-OP does nothing. For simplicity, we first consider a
simplified version where the model can only choose between a PUSH or a POP at each time step.
We suppose that this decision is made by a 2-dimensional variable at which depends on the state of
the hidden variable ht :
at = f (Aht ) , (3)
where A is a 2 × m matrix (m is the size of the hidden layer) and f is a softmax function. We denote
by at [PUSH], the probability of the PUSH action, and by at [POP] the probability of the POP action.
We suppose that the stack is stored at time t in a vector st of size p. Note that p could be increased
on demand and does not have to be fixed which allows the capacity of the model to grow. The top
element is stored at position 0, with value st [0]:
st [0] = at [PUSH]σ(Dht ) + at [POP]st−1 [1], (4)
where D is 1 × m matrix. If at [POP] is equal to 1, the top element is replaced by the value below
(all values are moved by one position up in the stack structure). If at [PUSH] is equal to 1, we move
all values down in the stack and add a value on top of the stack. Similarly, for an element stored at
a depth i > 0 in the stack, we have the following update rule:
st [i] = at [PUSH]st−1 [i − 1] + at [POP]st−1 [i + 1]. (5)
We use the stack to carry information to the hidden layer at the next time step. When the stack is
empty, st is set to −1. The hidden layer ht is now updated as:
ht = σ U xt + Rht−1 + P skt−1 ,


where P is a m × k recurrent matrix and skt−1 are the k top-most element of the stack at time t − 1.
In our experiments, we set k to 2. We call this model Stack RNN, and show it in Figure 1-a without
the recurrent matrix R for clarity.
Stack with a no-operation. Adding the NO-OP action allows the stack to keep the same value on
top by a minor change of the stack update rule. Eq. (4) is replaced by:
st [0] = at [PUSH]σ(Dht ) + at [POP]st−1 [1] + at [NO-OP]st−1 [0].
Extension to multiple stacks. Using a single stack has serious limitations, especially considering
that at each time step, only one action can be performed. We increase capacity of the model by
using multiple stacks in parallel. The stacks can interact through the hidden layer allowing them to
process more challenging patterns.

method an bn an bn cn an bn cn dn an b2n an bm cn+m
RNN 25% 23.3% 13.3% 23.3% 33.3%
LSTM 100% 100% 68.3% 75% 100%
List RNN 40+5 100% 33.3% 100% 100% 100%
Stack RNN 40+10 100% 100% 100% 100% 43.3%
Stack RNN 40+10 + rounding 100% 100% 100% 100% 100%

Table 2: Comparison with RNN and LSTM on sequences generated by counting algorithms. The
sequences seen during training are such that n < 20 (and n + m < 20), and we test on sequences
up to n = 60. We report the percent of n for which the model was able to correctly predict the
sequences. Performance above 33.3% means it is able to generalize to never seen sequence lengths.
Doubly-linked lists. While in this paper we mostly focus on an infinite memory based on stacks, it
is straightforward to extend the model to another forms of infinite memory, for example, the doubly-
linked list. A list is a one dimensional memory where each node is connected to its left and right
neighbors. There is a read/write head associated with the list. The head can move between nearby
nodes and insert a new node at its current position. More precisely, we consider three different
actions: INSERT, which inserts an element at the current position of the head, LEFT, which moves
the head to the left, and RIGHT which moves it to the right. Given a list L and a fixed head position
HEAD, the updates are:
at [RIGHT]Lt−1 [i + 1] + at [LEFT]Lt−1 [i − 1] + at [INSERT]σ(Dht ) if i = HEAD,
Lt [i] = at [RIGHT]Lt−1 [i + 1] + at [LEFT]Lt−1 [i − 1] + at [INSERT]Lt−1 [i + 1] if i < HEAD,
at [RIGHT]Lt−1 [i + 1] + at [LEFT]Lt−1 [i − 1] + at [INSERT]Lt−1 [i] if i > HEAD.
Note that we can add a NO-OP operation as well. We call this model List RNN, and show it in
Figure 1-b without the recurrent matrix R for clarity.
Optimization. The models presented above are continuous and can thus be trained with stochastic
gradient descent (SGD) method and back-propagation through time [30, 32, 35]. As patterns be-
comes more complex, more complex memory controller must be learned. In practice, we observe
that these more complex controller are harder to learn with SGD. Using several random restarts
seems to solve the problem in our case. We have also explored other type of search based proce-
dures as discussed in the supplementary material.
Rounding. Continuous operators on stacks introduce small imprecisions leading to numerical is-
sues on very long sequences. While simply discretizing the controllers partially solves this problem,
we design a more robust rounding procedure tailored to our model. We slowly makes the controllers
converge to discrete values by multiply their weights by a constant which slowly goes to infinity. We
finetune the weights of our network as this multiplicative variable increase, leading to a smoother
rounding of our network. Finally, we remove unused stacks by exploring models which use only a
subset of the stacks. While brute-force would be exponential in the number of stacks, we can do it
efficiently by building a tree of removable stacks and exploring it with deep first search.

5 Experiments and results

First, we consider various sequences generated by simple algorithms, where the goal is to learn their
generation rule [3, 12, 29]. We hope to understand the scope of algorithmic patterns each model can
capture. We also evaluate the models on a standard language modeling dataset, Penn Treebank.
Implementation details. Stack and List RNNs are trained with SGD and backpropagation through
time with 50 steps [32], a hard clipping of 15 to prevent gradient explosions [23], and an initial
learning rate of 0.1. The learning rate is divided by 2 each time the entropy on the validation set is
not decreasing. The depth k defined in Eq. (6) is set to 2. The free parameters are the number of
hidden units, stacks and the use of NO-OP. The baselines are RNNs with 40, 100 and 500 units, and
LSTMs with 1 and 2 layers with 50, 100 and 200 units. The hyper-parameters of the baselines are
selected on the validation sets.

5.1 Learning simple algorithmic patterns

Given an algorithm with short description length, we generate sequences and concatenate them into
longer sequences. This is an unsupervised task, since the boundaries of each generated sequences

current next prediction proba(next) action stack1[top] stack2[top]
b a a 0.99 POP POP -1 0.53
a a a 0.99 PUSH POP 0.01 0.97
a a a 0.95 PUSH PUSH 0.18 0.99
a a a 0.93 PUSH PUSH 0.32 0.98
a a a 0.91 PUSH PUSH 0.40 0.97
a a a 0.90 PUSH PUSH 0.46 0.97
a b a 0.10 PUSH PUSH 0.52 0.97
b b b 0.99 PUSH PUSH 0.57 0.97
b b b 1.00 POP PUSH 0.52 0.56
b b b 1.00 POP PUSH 0.46 0.01
b b b 1.00 POP PUSH 0.40 0.00
b b b 1.00 POP PUSH 0.32 0.00
b b b 1.00 POP PUSH 0.18 0.00
b b b 0.99 POP PUSH 0.01 0.00
b b b 0.99 POP POP -1 0.00
b b b 0.99 POP POP -1 0.00
b b b 0.99 POP POP -1 0.00
b b b 0.99 POP POP -1 0.01
b a a 0.99 POP POP -1 0.56

Table 3: Example of the Stack RNN with 20 hidden units and 2 stacks on a sequence an b2n with
n = 6. −1 means that the stack is empty. The depth k is set to 1 for clarity. We see that the first
stack pushes an element every time it sees a and pop when it sees b. The second stack pushes when
it sees a. When it sees b , it pushes if the first stack is not empty and pop otherwise. This shows how
the two stacks interact to correctly predict the deterministic part of the sequence (shown in bold).

Memorization Binary addition

Figure 2: Comparison of RNN, LSTM, List RNN and Stack RNN on memorization and the perfor-
mance of Stack RNN on binary addition. The accuracy is in the proportion of correctly predicted
sequences generated with a given n. We use 100 hidden units and 10 stacks.

are not known. We study patterns related to counting and memorization as shown in Table 1. To
evaluate if a model has the capacity to understand the generation rule used to produce the sequences,
it is tested on sequences it has not seen during training. Our experimental setting is the following:
the training and validation set are composed of sequences generated with n up to N < 20 while
the test set is composed of sequences generated with n up to 60. During training, we incrementally
increase the parameter n every few epochs until it reaches some N . At test time, we measure the
performance by counting the number of correctly predicted sequences. A sequence is considered as
correctly predicted if we correctly predict its deterministic part, shown in bold in Table 1. On these
toy examples, the recurrent matrix R defined in Eq. (1) is set to 0 to isolate the mechanisms that
Stack and list can capture.
Counting. Results on patterns generated by “counting” algorithms are shown in Table 2. We report
the percentage of sequence lengths for which a method is able to correctly predict sequences of
that length. List RNN and Stack RNN have 40 hidden units and either 5 lists or 10 stacks. For
these tasks, the NO-OP operation is not used. Table 2 shows that RNNs are unable to generalize to
longer sequences, and they only correctly predict sequences seen during training. LSTM is able to
generalize to longer sequences which shows that it is able to count since the hidden units in an LSTM
can be linear [16]. With a finer hyper-parameter search, the LSTM should be able to achieve 100%

on all of these tasks. Despite the absence of linear units, these models are also able to generalize.
For an bm cn+m , rounding is required to obtain the best performance.
Table 3 show an example of actions done by a Stack RNN with two stacks on a sequence of the
form an b2n . For clarity, we show a sequence generated with n equal to 6, and we use discretization.
Stack RNN pushes an element on both stacks when it sees a. The first stack pops elements when the
input is b and the second stack starts popping only when the first one is empty. Note that the second
stack pushes a special value to keep track of the sequence length, i.e. 0.56.
Memorization. Figure 2 shows results on memorization for a dictionary with two elements. Stack
RNN has 100 units and 10 stacks, and List RNN has 10 lists. We use random restarts and we repeat
this process multiple times. Stack RNN and List RNN are able to learn memorization, while RNN
and LSTM do not seem to generalize. In practice, List RNN is more unstable than Stack RNN and
overfits on the training set more frequently. This unstability may be explained by the higher number
of actions the controler can choose from (4 versus 3). For this reason, we focus on Stack RNN in
the rest of the experiments.

Figure 3: An example of a learned Stack RNN that performs binary addition. The last column
is our interpretation of the functionality learned by the different stacks. The color code is: green
means PUSH, red means POP and grey means actions equivalent to NO-OP. We show the current
(discretized) value on the top of the each stack at each given time. The sequence is read from left
to right, one character at a time. In bold is the part of the sequence which has to be predicted. Note
that the result is written in reverse.

Binary addition. Given a sequence representing a binary addition, e.g., “101+1=”, the goal is
to predict the result, e.g., “110.” where “.” represents the end of the sequence. As opposed to
the previous tasks, this task is supervised, i.e., the location of the deterministic tokens is provided.
The result of the addition is asked in the reverse order, e.g., “011.” in the previous example. As
previously, we train on short sequences and test on longer ones. The length of the two input numbers
is chosen such that the sum of their lengths is equal to n (less than 20 during training and up to 60
at test time). Their most significant digit is always set to 1. Stack RNN has 100 hidden units with
10 stacks. The right panel of Figure 2 shows the results averaged over multiple runs (with random
restarts). While Stack RNNs are generalizing to longer numbers, it overfits for some runs on the
validation set, leading to a larger error bar than in the previous experiments.
Figure 3 shows an example of a model which generalizes to long sequences of binary addition. This
example illustrates the moderately complex behavior that the Stack RNN learns to solve this task: the
first stack keeps track of where we are in the sequence, i.e., either reading the first number, reading
the second number or writing the result. Stack 6 keeps in memory the first number. Interestingly, the
first number is first captured by the stacks 3 and 5 and then copied to stack 6. The second number is
stored on stack 3, while its length is captured on stack 4 (by pushing a one and then a set of zeros).
When producing the result, the values stored on these three stacks are popped. Finally stack 5 takes

care of the carry: it switches between two states (0 or 1) which explicitly say if there is a carry over
or not. While this use of stacks is not optimal in the sense of minimal description length, it is able
to generalize to sequences never seen before.

5.2 Language modeling.

Model Ngram Ngram + Cache RNN LSTM SRCN [24] Stack RNN
Validation perplexity - - 137 120 120 124
Test perplexity 141 125 129 115 115 118

Table 4: Comparison of RNN, LSTM, SRCN [24] and Stack RNN on Penn Treebank Corpus. We
use the recurrent matrix R in Stack RNN as well as 100 hidden units and 60 stacks.
We compare Stack RNN with RNN, LSTM and SRCN [24] on the standard language modeling
dataset Penn Treebank Corpus. SRCN is a standard RNN with additional self-connected linear
units which capture long term dependencies similar to bag of words. The models have only one
hidden layer with 100 hidden units. Table 4 shows that Stack RNN performs better than RNN with
a comparable number of parameters, but not as well as LSTM and SRCN. Empirically, we observe
that Stack RNN learns to store exponentially decaying bag of words similar in nature to the memory
of SRCN.
6 Discussion and future work
Continuous versus discrete model and search. Certain simple algorithmic patterns can be effi-
ciently learned using a continuous optimization approach (stochastic gradient descent) applied to a
continuous model representation (in our case RNN). Note that Stack RNN works better than prior
work based on RNN from the nineties [12, 34, 37]. It seems also simpler than many other ap-
proaches designed for these tasks [3, 17, 31]. However, it is not clear if a continuous representation
is completely appropriate for learning algorithmic patterns. It may be more natural to attempt to
solve these problems with a discrete model. This motivates us to try to combine continuous and
discrete optimization. It is possible that the future of learning of algorithmic patterns will involve
such combination of discrete and continuous optimization.
Long-term memory. While in theory using multiple stacks for representing memory is as powerful
as a Turing complete computational system, intricate interactions between stacks need to be learned
to capture more complex algorithmic patterns. Stack RNN also requires the input and output se-
quences to be in the right format (e.g., memorization is in reversed order). It would be interesting
to consider in the future other forms of memory which may be more flexible, as well as additional
mechanisms which allow to perform multiple steps with the memory, such as loop or random access.
Finally, complex algorithmic patterns can be more easily learned by composing simpler algorithms.
Designing a model which possesses a mechanism to compose algorithms automatically and training
it on incrementally harder tasks is a very important research direction.

7 Conclusion
We have shown that certain difficult pattern recognition problems can be solved by augmenting a
recurrent network with structured, growing (potentially unlimited) memory. We studied very simple
memory structures such as a stack and a list, but, the same approach can be used to learn how to
operate more complex ones (for example a multi-dimensional tape). While currently the topology
of the long term memory is fixed, we think that it should be learned from the data as well.
Acknowledgment. We would like to thank Arthur Szlam, Keith Adams, Jason Weston, Yann LeCun
and the rest of the Facebook AI Research team for their useful comments.

[1] Y. Bengio and Y. LeCun. Scaling learning algorithms towards ai. Large-scale kernel machines, 2007.
[2] C. M. Bishop. Pattern recognition and machine learning. springer New York, 2006.
[3] M. Bodén and J. Wiles. Context-free and context-sensitive dynamics in recurrent neural networks. Con-
nection Science, 2000.
The code is available at

[4] L. Bottou. Large-scale machine learning with stochastic gradient descent. In COMPSTAT. Springer, 2010.
[5] L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
[6] J. S. Bridle. Probabilistic interpretation of feedforward classification network outputs, with relationships
to statistical pattern recognition. In Neurocomputing, pages 227–236. Springer, 1990.
[7] M. H. Christiansen and N. Chater. Toward a connectionist model of recursion in human linguistic perfor-
mance. Cognitive Science, 23(2):157–205, 1999.
[8] J. Chung, C. Gulcehre, K Cho, and Y. Bengio. Gated feedback recurrent neural networks. arXiv, 2015.
[9] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhuber. High-performance neural
networks for visual object classification. arXiv preprint, 2011.
[10] M. W. Crocker. Mechanisms for sentence processing. University of Edinburgh, 1996.
[11] G. E. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre-trained deep neural networks for
large-vocabulary speech recognition. Audio, Speech, and Language Processing, 20(1):30–42, 2012.
[12] S. Das, C. Giles, and G. Sun. Learning context-free grammars: Capabilities and limitations of a recurrent
neural network with an external stack memory. In ACCSS, 1992.
[13] S. Das, C. Giles, and G. Sun. Using prior knowledge in a nnpda to learn context-free languages. NIPS,
[14] J. L. Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
[15] M. Fanty. Context-free parsing in connectionist networks. Parallel natural language processing, 1994.
[16] F. A. Gers and J. Schmidhuber. Lstm recurrent networks learn simple context-free and context-sensitive
languages. Transactions on Neural Networks, 12(6):1333–1340, 2001.
[17] A. Graves, G. Wayne, and I. Danihelka. Neural turing machines. arXiv preprint, 2014.
[18] P. Grünwald. A recurrent network that performs a context-sensitive prediction task. In ACCSS, 1996.
[19] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
[20] S. Holldobler, Y. Kalinke, and H. Lehmann. Designing a counter: Another case study of dynamics and
activation landscapes in recurrent networks. In Advances in Artificial Intelligence, 1997.
[21] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural net-
works. In NIPS, 2012.
[22] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.
[23] T. Mikolov. Statistical language models based on neural networks. PhD thesis, Brno University of
Technology, 2012.
[24] T. Mikolov, A. Joulin, S. Chopra, M. Mathieu, and M. A. Ranzato. Learning longer memory in recurrent
neural networks. arXiv preprint, 2014.
[25] M. Minsky and S. Papert. Perceptrons. MIT press, 1969.
[26] M. C. Mozer and S. Das. A connectionist symbol manipulator that discovers the structure of context-free
languages. NIPS, 1993.
[27] J. B. Pollack. The induction of dynamical recognizers. Machine Learning, 7(2-3):227–252, 1991.
[28] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient
descent. In NIPS, 2011.
[29] P. Rodriguez, J. Wiles, and J. L. Elman. A recurrent neural network that learns to count. Connection
Science, 1999.
[30] D. E Rumelhart, G. Hinton, and R. J. Williams. Learning internal representations by error propagation.
Technical report, DTIC Document, 1985.
[31] W. Tabor. Fractal encoding of context-free grammars in connectionist networks. Expert Systems, 2000.
[32] P. Werbos. Generalization of backpropagation with application to a recurrent gas market model. Neural
Networks, 1(4):339–356, 1988.
[33] J. Weston, S. Chopra, and A. Bordes. Memory networks. In ICLR, 2015.
[34] J. Wiles and J. Elman. Learning to count without a counter: A case study of dynamics and activation
landscapes in recurrent networks. In ACCSS, 1995.
[35] R. J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their com-
putational complexity. Back-propagation: Theory, architectures and applications, pages 433–486, 1995.
[36] W. Zaremba and I. Sutskever. Learning to execute. arXiv preprint, 2014.
[37] Z. Zeng, R. M. Goodman, and P. Smyth. Discrete recurrent neural networks for grammatical inference.
Transactions on Neural Networks, 5(2):320–330, 1994.