You are on page 1of 12

Neural Symbolic Machines:

Learning Semantic Parsers on Freebase with Weak Supervision


Chen Liang∗, Jonathan Berant†, Quoc Le, Kenneth D. Forbus, Ni Lao
Northwestern University, Evanston, IL
Tel-Aviv University, Tel Aviv-Yafo, Israel
Google Inc., Mountain View, CA
{chenliang2013,forbus}@u.northwestern.edu, joberant@cs.tau.ac.il, {qvl,nlao}@google.com

Abstract x: Largest city in the US ⇒ y: NYC


Compositionality
Harnessing the statistical power of neu- v2 ← (Argmax v1 Population)
ral networks to perform language under-
standing and symbolic reasoning is dif- v1 ← (Hop v0 CityIn) Population
arXiv:1611.00020v4 [cs.CL] 23 Apr 2017

ficult, when it requires executing effi-


v0 ← (USA) CityIn
cient discrete operations against a large
knowledge-base. In this work, we intro- Large Search Space
duce a Neural Symbolic Machine (NSM), ( Argmax v1 Population )

which contains (a) a neural “program- Hop v0 Size


mer”, i.e., a sequence-to-sequence model Elevation
that maps language utterances to programs
and utilizes a key-variable memory to han- Figure 1: The main challenges of training a semantic parser
from weak supervision: (a) compositionality: we use vari-
dle compositionality (b) a symbolic “com- ables (v0 , v1 , v2 ) to store execution results of intermediate
puter”, i.e., a Lisp interpreter that performs generated programs. (b) search: we prune the search space
program execution, and helps find good and augment REINFORCE with pseudo-gold programs.
programs by pruning the search space.
We apply REINFORCE to directly opti-
bolic representation that is executed by an execu-
mize the task reward of this structured
tor, through weak supervision remains challeng-
prediction problem. To train with weak
ing. This is because the model must interact with a
supervision and improve the stability of
symbolic executor through non-differentiable op-
REINFORCE we augment it with an it-
erations to search over a large program space.
erative maximum-likelihood training pro-
In semantic parsing, recent work handled this
cess. NSM outperforms the state-of-the-
(Dong and Lapata, 2016; Jia and Liang, 2016)
art on the W EB Q UESTIONS SP dataset
by training from manually annotated programs
when trained from question-answer pairs
and avoiding program execution at training time.
only, without requiring any feature engi-
However, annotating programs is known to be ex-
neering or domain-specific knowledge.
pensive and scales poorly. In program induc-
1 Introduction tion, attempts to address this problem (Graves
et al., 2014; Reed and de Freitas, 2016; Kaiser
Deep neural networks have achieved impressive and Sutskever, 2015; Graves et al., 2016b; An-
performance in supervised classification and struc- dreas et al., 2016) either utilized low-level mem-
tured prediction tasks such as speech recognition ory (Zaremba and Sutskever, 2015), or required
(Hinton et al., 2012), machine translation (Bah- memory to be differentiable (Neelakantan et al.,
danau et al., 2014; Wu et al., 2016) and more. 2015; Yin et al., 2015) so that the model can be
However, training neural networks for semantic trained with backpropagation. This makes it dif-
parsing (Zelle and Mooney, 1996; Zettlemoyer ficult to use the efficient discrete operations and
and Collins, 2005; Liang et al., 2011) or program memory of a traditional computer, and limited the
induction, where language is mapped to a sym- application to synthetic or small knowledge bases.

Work done while the author was interning at Google In this paper, we propose to utilize the mem-

Work done while the author was visiting Google ory and discrete operations of a traditional com-
puter in a novel Manager-Programmer-Computer imum likelihood (ML) training process, where
(MPC) framework for neural program induction, beam search is used to find pseudo-gold programs,
which integrates three components: which are then used to augment the objective of
1. A “manager” that provides weak supervi- REINFORCE.
sion (e.g., ‘NYC’ in Figure 1) through a re- On the W EB Q UESTIONS SP dataset (Yih et al.,
ward indicating how well a task is accom- 2016), NSM achieves new state-of-the-art results
plished. Unlike full supervision, weak super- with weak supervision, significantly closing the
vision is easy to obtain at scale (Section 3.1). gap between weak and full supervision for this
2. A “programmer” that takes natural lan- task. Unlike prior works, it is trained end-to-
guage as input and generates a program that end, and does not require feature engineering or
is a sequence of tokens (Figure 2). The pro- domain-specific knowledge.
grammer learns from the reward and must
overcome the hard search problem of finding 2 Neural Symbolic Machines
correct programs (Section 2.2). We now introduce NSM by first describing the
3. A “computer” that executes programs in a “computer”, a non-differentiable Lisp interpreter
high level programming language. Its non- that executes programs against a large KB and pro-
differentiable memory enables abstract, scal- vides code assistance (Section 2.1). We then pro-
able and precise operations, but makes train- pose a seq2seq model (“programmer”) that sup-
ing more challenging (Section 2.3). To help ports compositionality using a key-variable mem-
the “programmer” prune the search space, ory to save and reuse intermediate results (Sec-
it provides a friendly neural computer in- tion 2.2). Finally, we describe a training procedure
terface, which detects and eliminates invalid that is based on REINFORCE, but is augmented
choices (Section 2.1). with pseudo-gold programs found by an iterative
Within this framework, we introduce the Neu- ML training procedure (Section 2.3).
ral Symbolic Machine (NSM) and apply it to se- Before diving into details, we define the seman-
mantic parsing. NSM contains a neural sequence- tic parsing task: given a knowledge base K, and
to-sequence (seq2seq) “programmer” (Sutskever a question x = (w1 , w2 , ..., wm ), produce a pro-
et al., 2014) and a symbolic non-differentiable gram or logical form z that when executed against
Lisp interpreter (“computer”) that executes pro- K generates the right answer y. Let E denote a
grams against a large knowledge-base (KB). set of entities (e.g., A BE L INCOLN),1 and let P de-
Our technical contribution in this work is three- note a set of properties (e.g., P LACE O F B IRTH). A
fold. First, to support language compositionality, knowledge base K is a set of assertions or triples
we augment the standard seq2seq model with a (e1 , p, e2 ) ∈ E × P × E, such as (A BE L INCOLN,
key-variable memory to save and reuse intermedi- P LACE O F B IRTH, H ODGENVILLE).
ate execution results (Figure 1). This is a novel ap-
plication of pointer networks (Vinyals et al., 2015) 2.1 Computer: Lisp Interpreter with Code
to compositional semantics. Assistance
Second, to alleviate the search problem of find- Semantic parsing typically requires using a set of
ing correct programs when training from question- operations to query the knowledge base and pro-
answer pairs,we use the computer to execute par- cess the results. Operations learned with neural
tial programs and prune the programmer’s search networks such as addition and sorting do not per-
space by checking the syntax and semantics of fectly generalize to inputs that are larger than the
generated programs. This generalizes the weakly ones observed in the training data (Graves et al.,
supervised semantic parsing framework (Liang 2014; Reed and de Freitas, 2016). In contrast, op-
et al., 2011; Berant et al., 2013) by leveraging se- erations implemented in high level programming
mantic denotations during structural search. languages are abstract, scalable, and precise, thus
Third, to train from weak supervision and di- generalizes perfectly to inputs of arbitrary size.
rectly maximize the expected reward we turn Based on this observation, we implement opera-
to the REINFORCE (Williams, 1992) algorithm. tions necessary for semantic parsing with an or-
Since learning from scratch is difficult for RE- 1
We also consider numbers (e.g., “1.33”) and date-times
INFORCE, we combine it with an iterative max- (e.g., “1999-1-1”) as entities.
dinary programming language instead of trying to 2.2 Programmer: Seq2seq Model with
learn them with a neural network. Key-Variable Memory
We adopt a Lisp interpreter as the “com- Given the “computer”, the “programmer” needs to
puter”. A program C is a list of expressions map natural language into a program, which is a
(c1 ...cN ), where each expression is either a spe- sequence of tokens that reference operations and
cial token “Return” indicating the end of the pro- values in the “computer”. We base our program-
gram, or a list of tokens enclosed by parentheses mer on a standard seq2seq model with attention,
“(F A1 ...AK )”. F is a function, which takes as but extend it with a key-variable memory that al-
input K arguments of specific types. Table 1 de- lows the model to learn to represent and refer to
fines the semantics of each function and the types program variables (Figure 2).
of its arguments (either a property p or a variable Sequence-to-sequence models consist of two
r). When a function is executed, it returns an en- RNNs, an encoder and a decoder. We used a
tity list that is the expression’s denotation in K, 1-layer GRU (Cho et al., 2014) for both the en-
and save it to a new variable. coder and decoder. Given a sequence of words
By introducing variables that save the interme- w1 , w2 ...wm , each word wt is mapped to an em-
diate results of execution, the program naturally bedding qt (embedding details are in Section 3).
models language compositionality and describes Then, the encoder reads these embeddings and up-
from left to right a bottom-up derivation of the full dates its hidden state step by step using ht+1 =
meaning of the natural language input, which is GRU (ht , qt , θEncoder ), where θEncoder are the
convenient in a seq2seq model (Figure 1). This GRU parameters. The decoder updates its hid-
is reminiscent of the floating parser (Wang et al., den states ut by ut+1 = GRU (ut , ct−1 , θDecoder ),
2015; Pasupat and Liang, 2015), where a deriva- where ct−1 is the embedding of last step’s output
tion tree that is not grounded in the input is incre- token at−1 , and θDecoder are the GRU parame-
mentally constructed. ters. The last hidden state of the encoder hT is
used as the decoder’s initial state. We also adopt a
The set of programs defined by our functions is dot-product attention similar to Dong and Lapata
equivalent to the subset of λ-calculus presented in (2016). The tokens of the program a1 , a2 ...an are
(Yih et al., 2015). We did not use full Lisp pro- generated one by one using a softmax over the vo-
gramming language here, because constructs like cabulary of valid tokens at each step, as provided
control flow and loops are unnecessary for most by the “computer” (Section 2.1).
current semantic parsing tasks, and it is simple to
To achieve compositionality, the decoder must
add more functions to the model when necessary.
learn to represent and refer to intermediate vari-
To create a friendly neural computer interface, ables whose value was saved in the “computer”
the interpreter provides code assistance to the pro- after execution. Therefore, we augment the model
grammer by producing a list of valid tokens at each with a key-variable memory, where each entry
step. First, a valid token should not cause a syntax has two components: a continuous embedding key
error: e.g., if the previous token is “(”, the next to- vi , and a corresponding variable token Ri refer-
ken must be a function name, and if the previous encing the value in the “computer” (see Figure 2).
token is “Hop”, the next token must be a variable. During encoding, we use an entity linker to link
More importantly, a valid token should not cause text spans (e.g., “US”) to KB entities. For each
a semantic (run-time) error: this is detected using linked entity we add a memory entry where the key
the denotation saved in the variables. For example, is the average of GRU hidden states over the entity
if the previously generated tokens were “( Hop r”, span, and the variable token (R1 ) is the name of a
the next available token is restricted to properties variable in the computer holding the linked entity
{p | ∃e, e0 : e ∈ r, (e, p, e0 ) ∈ K} that are reach- (m.USA) as its value. During decoding, when a
able from entities in r in the KB. These checks are full expression is generated (i.e., the decoder gen-
enabled by the variables and can be derived from erates “)”), it gets executed, and the result is stored
the definition of the functions in Table 1. The in- as the value of a new variable in the “computer”.
terpreter prunes the “programmer”’s search space This variable is keyed by the GRU hidden state at
by orders of magnitude, and enables learning from that step. When a new variable R1 with key em-
weak supervision on a large KB. bedding v1 is added into the key-variable memory,
( Hop r p ) ⇒ {e2 |e1 ∈ r, (e1 , p, e2 ) ∈ K}
( ArgMax r p ) ⇒ {e1 |e1 ∈ r, ∃e2 ∈ E : (e1 , p, e2 ) ∈ K, ∀e : (e1 , p, e) ∈ K, e2 ≥ e}
( ArgMin r p ) ⇒ {e1 |e1 ∈ r, ∃e2 ∈ E : (e1 , p, e2 ) ∈ K, ∀e : (e1 , p, e) ∈ K, e2 ≤ e}
( Filter r1 r2 p ) ⇒ {e1 |e1 ∈ r1 , ∃e2 ∈ r2 : (e1 , p, e2 ) ∈ K}
Table 1: Interpreter functions. r represents a variable, p a property in Freebase. ≥ and ≤ are defined on numbers and dates.

Key Variable Key Variable Key Variable m.NYC


Execute Execute Execute
v1 R1(m.USA) ( Hop R1 !CityIn ) v1 R1(m.USA) ( Argmax R2 Population ) ... ... Return

v2 R2(list of US cities) v3 R3(m.NYC)


Entity Resolver

( Hop R1 !CityIn ) ( Argmax R2 Population ) Return

Largest city in US GO ( Hop R1 !CityIn ) ( Argmax R2 Population )

Figure 2: Semantic Parsing with NSM. The key embeddings of the key-variable memory are the output of the sequence model
at certain encoding or decoding steps. For illustration purposes, we also show the values of the variables in parentheses, but the
sequence model never sees these values, and only references them with the name of the variable (“R1 ”). A special token “GO”
indicates the start of decoding, and “Return” indicates the end of decoding.

the token R1 is added into the decoder vocabu- To train from weak supervision, we suggest an
lary with v1 as its embedding. The final answer iterative ML procedure for finding pseudo-gold
returned by the “programmer” is the value of the programs that will bootstrap REINFORCE.
last computed variable.
REINFORCE We can formulate training as a
Similar to pointer networks (Vinyals et al.,
reinforcement learning problem: given a question
2015), the key embeddings for variables are dy-
x, the state, action and reward at each time step t ∈
namically generated for each example. During
{0, 1, ..., T } are (st , at , rt ). Since the environment
training, the model learns to represent variables by
is deterministic, the state is defined by the question
backpropagating gradients from a time step where
x and the action sequence: st = (x, a0:t−1 ), where
a variable is selected by the decoder, through the
a0:t−1 = (a0 , ..., at−1 ) is the history of actions at
key-variable memory, to an earlier time step when
time t. A valid action at time t is at ∈ A(st ),
the key embedding was computed. Thus, the en-
where A(st ) is the set of valid tokens given by the
coder/decoder learns to generate representations
“computer”. Since each action corresponds to a
for variables such that they can be used at the right
token, the full history a0:T corresponds to a pro-
time to construct the correct program.
gram. The reward rt = I[t = T ] · F1 (x, a0:T )
While the key embeddings are differentiable,
is non-zero only at the last step of decoding, and
the values referenced by the variables (lists of
is the F1 score computed comparing the gold an-
entities), stored in the “computer”, are symbolic
swer and the answer generated by executing the
and non-differentiable. This distinguishes the key-
program a0:T . Thus, the cumulative reward of a
variable memory from other memory-augmented
program a0:T is
neural networks that use continuous differentiable X
embeddings as the values of memory entries (We- R(x, a0:T ) = rt = F1 (x, a0:T ).
ston et al., 2014; Graves et al., 2016a). t

The agent’s decision making procedure at each


2.3 Training NSM with Weak Supervision
time is defined by a policy, πθ (s, a) = Pθ (at =
NSM executes non-differentiable operations a|x, a0:t−1 ), where θ are the model parameters.
against a KB, and thus end-to-end backpropa- Since the environment is deterministic, the prob-
gation is not possible. Therefore, we base our ability of generating a program a0:T is
training procedure on REINFORCE (Williams, Y
1992; Norouzi et al., 2016). When the reward Pθ (a0:T |x) = Pθ (at | x, a0:t−1 ).
signal is sparse and the search space is large, t

it is common to utilize some full supervision We can define our objective to be the expected
to pre-train REINFORCE (Silver et al., 2016). cumulative reward and use policy gradient meth-
ods such as REINFORCE for training. The objec- in all previous iterations. Then, we can optimize
tive and gradient are: the ML objective:
X
J RL (θ) =
X
EPθ (a0:T |x) [R(x, a0:T )], J M L (θ) = log Pθ (abest
0:T (x) | x) (1)
x x
XX
∇θ J RL (θ) = Pθ (a0:T | x) · [R(x, a0:T )−
x a0:T A question x is not included if we did not find any
program with positive reward.
B(x)] · ∇θ log Pθ (a0:T | x),
Training with iterative ML is fast because there
P is at most one program per example and the gra-
where B(x) = a0:T Pθ (a0:T | x)R(x, a0:T ) is
a baseline that reduces the variance of the gradi- dient is not weighted by model probability. while
ent estimation without introducing bias. Having a decoding with a large beam size is slow, we could
separate network to predict the baseline is an in- train for multiple epochs after each decoding. This
teresting future direction. iterative process has a bootstrapping effect that a
While REINFORCE assumes a stochastic pol- better model leads to a better program abest 0:T (x)
icy, we use beam search for gradient estimation. through decoding, and a better program abest 0:T (x)
Thus, in contrast with common practice of ap- leads to a better model through training.
proximating the gradient by sampling from the Even with a large beam size, some programs are
model, we use the top-k action sequences (pro- hard to find because of the large search space. A
grams) in the beam with normalized probabilities. common solution to this problem is to use curricu-
This allows training to focus on sequences with lum learning (Zaremba and Sutskever, 2015; Reed
high probability, which are on the decision bound- and de Freitas, 2016). The size of the search space
aries, and reduces the variance of the gradient. is controlled by both the set of functions used in
Empirically (and in line with prior work), RE- the program and the program length. We apply
INFORCE converged slowly and often got stuck curriculum learning by gradually increasing both
in local optima (see Section 3). The difficulty of these quantities (see details in Section 3) when
training resulted from the sparse reward signal in performing iterative ML.
the large search space, which caused model prob- Nevertheless, iterative ML uses only pseudo-
abilities for programs with non-zero reward to be gold programs and does not directly optimize
very small at the beginning. If the beam size k is the objective we truly care about. This has two
small, good programs fall off the beam, leading to adverse effects: (1) The best program abest 0:T (x)
zero gradients for all programs in the beam. If the could be a spurious program that accidentally pro-
beam size k is large, training is very slow, and the duces the correct answer (e.g., using the prop-
normalized probabilities of good programs when erty P LACE O F B IRTH instead of P LACE O F D EATH
the model is untrained are still very small, leading when the two places are the same), and thus does
to (1) near zero baselines, thus near zero gradients not generalize to other questions. (2) Because
on “bad” programs (2) near zero gradients on good training does not observe full negative programs,
programs due to the low probability Pθ (a0:T | x). the model often fails to distinguish between to-
To combat this, we present an alternative training kens that are related to one another. For exam-
strategy based on maximum-likelihood. ple, differentiating PARENTS O F vs. S IBLINGS O F
vs. C HILDREN O F can be challenging. We now
Iterative ML If we had gold programs, we present learning where we combine iterative ML
could directly optimize their likelihood. Since we with REINFORCE.
do not have gold programs, we can perform an
iterative procedure (similar to hard Expectation- Augmented REINFORCE To bootstrap REIN-
Maximization (EM)), where we search for good FORCE, we can use iterative ML to find pseudo-
programs given fixed parameters, and then opti- gold programs, and then add these programs to the
mize the probability of the best program found so beam with a reasonably large probability. This is
far. We do decoding on an example with a large similar to methods from imitation learning (Ross
beam size and declare abest
0:T (x) to be the pseudo- et al., 2011; Jiang et al., 2012) that define a
gold program, which achieved highest reward with proposal distribution by linearly interpolating the
shortest length among the programs decoded on x model distribution and an oracle.
Algorithm 1 IML-REINFORCE new state-of-the-art performance on W EB Q UES -
Input: question-answer pairs D = {(xi , yi )}, mix ratio TIONS SP with weak supervision, and significantly
α, reward function R(·), training iterations NM L , NRL , closes the gap between weak and full supervisions
and beam sizes BM L , BRL .
Procedure: for this task.
Initialize Cx∗ = ∅ the best program so far for x
Initialize model θ randomly . Iterative ML 3.1 The W EB Q UESTIONS SP dataset
for n = 1 to NM L do
for (x, y) in D do The W EB Q UESTIONS SP dataset (Yih et al., 2016)
C ← Decode BM L programs given x contains full semantic parses for a subset of the
for j in 1...|C| do
if Rx,y (Cj ) > Rx,y (Cx∗ ) then Cx∗ ← Cj questions from W EB Q UESTIONS (Berant et al.,
θ ← ML training with DM L = {(x, Cx∗ )} 2013), because 18.5% of the original dataset were
Initialize model θ randomly . REINFORCE found to be “not answerable”. It consists of 3,098
for n = 1 to NRL do question-answer pairs for training and 1,639 for
DRL ← ∅ is the RL training set
for (x, y) in D do testing, which were collected using Google Sug-
C ← Decode BRL programs from x gest API, and the answers were originally obtained
for j in 1...|C| do using Amazon Mechanical Turk workers. They
if Rx,y (Cj ) > Rx,y (Cx∗ ) then Cx∗ ← Cj
were updated in (Yih et al., 2016) by annotators
C ← C ∪ {Cx∗ }
for j in 1...|C| do who were familiar with the design of Freebase and
p
p̂j ← (1−α)· P 0jp 0 where pj = Pθ (Cj | x) added semantic parses. We further separated out
j j
if Cj = Cx∗ then p̂j ← p̂j + α 620 questions from the training set as a validation
DRL ← DRL ∪ {(x, Cj , p̂j )} set. For query pre-processing we used an in-house
θ ← REINFORCE training with DRL named entity linking system to find the entities in a
question. The quality of the entity linker is similar
to that of (Yih et al., 2015) at 94% of the gold root
Algorithm 1 describes our overall training pro-
entities being included. Similar to Dong and Lap-
cedure. We first run iterative ML for NM L itera-
ata (2016), we replaced named entity tokens with
tions and record the best program found for every
a special token “ENT”. For example, the question
example xi . Then, we run REINFORCE, where
“who plays meg in family guy” is changed to “who
we normalize the probabilities of the programs in
plays ENT in ENT ENT”. This helps reduce over-
beam to sum to (1−α) and add α to the probability
fitting, because instead of memorizing the correct
of the best found program C ∗ (xi ). Consequently,
program for a specific entity, the model has to fo-
the model always puts a reasonable amount of
cus on other context words in the sentence, which
probability on a program with high reward during
improves generalization.
training. Note that we randomly initialized the pa-
Following (Yih et al., 2015) we used the last
rameters for REINFORCE, since initializing from
publicly available snapshot of Freebase (Bollacker
the final ML parameters seems to get stuck in a
et al., 2008). Since NSM training requires ran-
local optimum and produced worse results.
dom access to Freebase during decoding, we pre-
On top of imitation learning, our approach is
processed Freebase by removing predicates that
related to the common practice in reinforcement
are not related to world knowledge (starting with
learning (Schaul et al., 2016) to replay rare suc-
“/common/ ”, “/type/ ”, “/freebase/ ”),2 and remov-
cessful experiences to reduce the training variance
ing all text valued predicates, which are rarely the
and improve training efficiency. This is also simi-
answer. Out of all 27K relations, 434 relations are
lar to recent developments (Wu et al., 2016) in ma-
removed during preprocessing. This results in a
chine translation, where ML and RL objectives are
graph that fits in memory with 23K relations, 82M
linearly combined, because anchoring the model
nodes, and 417M edges.
to some high-reward outputs stabilizes training.

3 Experiments and Analysis 3.2 Model Details


For pre-trained word embeddings, we used the
We now empirically show that NSM can learn
300 dimension GloVe word embeddings trained
a semantic parser from weak supervision over a
on 840B tokens (Pennington et al., 2014). On
large KB. We evaluate on W EB Q UESTIONS SP, a
the encoder side, we added a projection matrix to
challenging semantic parsing dataset with strong
2
baselines. Experiments show that NSM achieves We kept “/common/topic/notable types”.
transform the embeddings into 50 dimensions. On Since decoding needs to query the knowledge
the decoder side, we used the same GloVe embed- base (KB) constantly, the speed bottleneck for
dings to construct an embedding for each property training is decoding. We address this problem
using its Freebase id, and also added a projection in our implementation by partitioning the dataset,
matrix to transform this embedding to 50 dimen- and using multiple decoders in parallel to han-
sions. A Freebase id contains three parts: domain, dle each partition. We use 100 decoders, which
type, and property. For example, the Freebase queries 50 KG servers, and one trainer. The neu-
id for PARENTS O F is “/people/person/parents”. ral network model is implemented in TensorFlow.
“people” is the domain, “person” is the type Since the model is small, we didn’t see a signif-
and “parents” is the property. The embedding icant speedup by using GPU, so all the decoders
is constructed by concatenating the average of and the trainer are using CPU only.
word embeddings in the domain and type name Inspired by the staged generation process in Yih
to the average of word embeddings in the prop- et al. (2015), curriculum learning includes two
erty name. For example, if the embedding dimen- steps. We first run iterative ML for 10 iterations
sion is 300, the embedding dimension for “/peo- with programs constrained to only use the “Hop”
ple/person/parents” will be 600. The first 300 di- function and the maximum number of expressions
mensions will be the average of the embeddings is 2. Then, we run iterative ML again, but use both
for “people” and “person”, and the second 300 “Hop” and “Filter”. The maximum number of ex-
dimensions will be the embedding for “parents”. pressions is 3, and the relations used by “Hop” are
The dimension of encoder hidden state, decoder restricted to those that appeared in abest
0:T (q) in the
hidden state and key embeddings are all 50. The first step.
embeddings for the functions and special tokens
3.4 Results and discussion
(e.g., “UNK”, “GO”) are randomly initialized by a
truncated normal distribution with mean=0.0 and We evaluate performance using the offical evalu-
stddev=0.1. All the weight matrices are initialized ation script for W EB Q UESTIONS SP. Because the
√ √
with a uniform distribution in [− d3 , d3 ] where d answer to a question may contain multiple enti-
is the input dimension. Dropout rate is set to 0.5, ties or values, precision, recall and F1 are com-
and we see a clear tendency for larger dropout rate puted based on the output of each individual ques-
to produce better performance, indicating overfit- tion, and average F1 is reported as the main eval-
ting is a major problem for learning. uation metric. Accuracy measures the proportion
of questions that are answered exactly.
3.3 Training Details A comparison to STAGG, the previous state-of-
the-art model (Yih et al., 2016, 2015), is shown
In iterative ML training, the decoder uses a beam in Table 2. Our model beats STAGG with weak
of size k = 100 to update the pseudo-gold pro- supervision by a significant margin on all metrics,
grams and the model is trained for 20 epochs after while relying on no feature engineering or hand-
each decoding step. We use the Adam optimizer crafted rules. When STAGG is trained with strong
(Kingma and Ba, 2014) with initial learning rate supervision it obtains an F1 of 71.7, and thus NSM
0.001. In our experiment, this process usually con- closes half the gap between training with weak and
verges after a few (5-8) iterations. full supervision.
For REINFORCE training, the best hyperpa-
rameters are chosen using the validation set. We Model Prec. Rec. F1 Acc.
use a beam of size k = 5 for decoding, and α is STAGG 67.3 73.1 66.8 58.8
set to 0.1. Because the dataset is small and some NSM 70.8 76.0 69.0 59.5
relations are only used once in the whole training
set, we train the model on the entire training set Table 2: Results on the test set. Average F1 is the main evalu-
ation metric and NSM outperforms STAGG with no domain-
for 200 iterations with the best hyperparameters. specific knowledge or feature engineering.
Then we train the model with learning rate de-
cay until convergence. Learning rate is decayed as Four key ingredients lead to the final perfor-
max(0,t−ts )
gt = g0 × β m , where g0 = 0.001, β = 0.5 mance of NSM. The first one is the neural com-
m = 1000, and ts is the number of training steps puter interface that provides code assistance by
at the end of iteration 200. checking for syntax and semantic errors. We find
that semantic checks are very effective for open- Settings ∆ F1@1
domain KBs with a large number of properties. −Pretrained word embeddings −5.5
For our task, the average number of choices is re- −Pretrained property embeddings −2.7
duced from 23K per step (all properties) to less −Dropout on GRU input and output −2.4
than 100 (the average number of properties con- −Dropout on softmax −1.1
nected to an entity). −Anonymize entity tokens −2.0
The second ingredient is augmented REIN-
FORCE training. Table 3 compares augmented Table 5: Contributions of different overfitting techniques on
the validation set.
REINFORCE, REINFORCE, and iterative ML on
the validation set. REINFORCE gets stuck in lo-
#Expressions 0 1 2 3
cal optimum and performs poorly. Iterative ML
training is not directly optimizing the F1 measure, Percentage 0.4% 62.9% 29.8% 6.9%
and achieves sub-optimal results. In contrast, aug- F1 0.0 73.5 59.9 70.3
mented REINFORCE is able to bootstrap using Table 6: Percentage and performance of model generated
pseudo-gold programs found by iterative ML and programs with different complexity (number of expressions).
achieves the best performance on both the training
and validation set.
tional depth increases, indicating that the model
Settings Train F1 Valid F1 is effective at capturing compositionality. We ob-
Iterative ML 68.6 60.1 serve that programs with three expressions use a
REINFORCE 55.1 47.8 more limited set of properties, mainly focusing on
Augmented REINFORCE 83.0 67.2 answering a few types of questions such as “who
plays meg in family guy”, “what college did jeff
Table 3: Average F1 on the validation set for augmented RE- corwin go to” and “which countries does russia
INFORCE, REINFORCE, and iterative ML.
border”. In contrast, programs with two expres-
sions use a more diverse set of properties, which
The third ingredient is curriculum learning dur- could explain the lower performance compared to
ing iterative ML. We compare the performance of programs with three expressions.
the best programs found with and without curricu-
lum learning in Table 4. We find that the best pro- Error analysis Error analysis on the validation
grams found with curriculum learning are substan- set shows two main sources of errors:
tially better than those found without curriculum 1. Search failure: Programs with high reward
learning by a large margin on every metric. are not found during search for pseudo-gold
programs, either because the beam size is not
Settings Prec. Rec. F1 Acc. large enough, or because the set of functions
No curriculum 79.1 91.1 78.5 67.2 implemented by the interpreter is insufficient.
Curriculum 88.6 96.1 89.5 79.8 The 89.5% F1 score in Table 4 indicates that
at least 10% of the questions are of this kind.
Table 4: Evaluation of the programs with the highest F1 score 2. Ranking failure: Programs with high reward
in the beam (abest
0:t ) with and without curriculum learning.
exist in the beam, but are not ranked at the
top during decoding. Because the training er-
The last important ingredient is reducing over-
ror is low, this is largely due to overfitting or
fitting. Given the small size of the dataset, over-
spurious programs. The 67.2% F1 score in
fitting is a major problem for training neural net-
Table 3 indicates that about 20% of the ques-
work models. We show the contributions of dif-
tions are of this kind.
ferent techniques for controlling overfitting in Ta-
ble 5. Note that after all the techniques have been 4 Related work
applied, the model is still overfitting with training
F1@1=83.0% and validation F1@1=67.2%. Among deep learning models for program in-
Among the programs generated by the model, duction, Reinforcement Learning Neural Turing
a significant portion (36.7%) uses more than one Machines (RL-NTMs) (Zaremba and Sutskever,
expression. From Table 6, we can see that the per- 2015) are the most similar to NSM, as a non-
formance doesn’t decrease much as the composi- differentiable machine is controlled by a sequence
model. Therefore, both models rely on REIN- ever, their strategy is not applicable to a large
FORCE for training. The main difference between KB such as Freebase, which contains about 100M
the two is the abstraction level of the programming entities, and more than 20k properties. Instead,
language. RL-NTM uses lower level operations NSM chooses a more scalable approach, where
such as memory address manipulation and byte the “computer” saves intermediate results, and the
reading/writing, while NSM uses a high level pro- neural network only refers to them with variable
gramming language over a large knowledge base names (e.g., “R1 ” for all cities in the US).
that includes operations such as following proper- NSM is similar to the Path Ranking Algorithm
ties from entities, or sorting based on a property, (PRA) (Lao et al., 2011) in that semantics is en-
which is more suitable for representing semantics. coded as a sequence of actions, and denotations
Earlier works such as OOPS (Schmidhuber, 2004) are used to prune the search space during learning.
has desirable characteristics, for example, the abil- NSM is more powerful than PRA by 1) allowing
ity to define new functions. These remain to be more complex semantics to be composed through
future improvements for NSM. the use of a key-variable memory; 2) controlling
the search procedure with a trained neural net-
We formulate NSM training as an instance of
work, while PRA only samples actions uniformly;
reinforcement learning (Sutton and Barto, 1998)
3) allowing input questions to express complex re-
in order to directly optimize the task reward of
lations, and then dynamically generating action
the structured prediction problem (Norouzi et al.,
sequences. PRA can combine multiple seman-
2016; Li et al., 2016; Yu et al., 2017). Compared
tic representations to produce the final prediction,
to imitation learning methods (Daume et al., 2009;
which remains to be future work for NSM.
Ross et al., 2011) that interpolate a model dis-
tribution with an oracle, NSM needs to solve a 5 Conclusion
challenging search problem of training from weak
supervisions in a large search space. Our solu- We propose the Manager-Programmer-Computer
tion employs two techniques (a) a symbolic “com- framework for neural program induction. It in-
puter” helps find good programs by pruning the tegrates neural networks with a symbolic non-
search space (b) an iterative ML training pro- differentiable computer to support abstract, scal-
cess, where beam search is used to find pseudo- able and precise operations through a friendly
gold programs. Wiseman and Rush (Wiseman neural computer interface. Within this frame-
and Rush, 2016) proposed a max-margin approach work, we introduce the Neural Symbolic Machine,
to train a sequence-to-sequence scorer. However, which integrates a neural sequence-to-sequence
their training procedure is more involved, and we “programmer” with key-variable memory, and a
did not implement it in this work. MIXER (Ran- symbolic Lisp interpreter with code assistance.
zato et al., 2015) also proposed to combine ML Because the interpreter is non-differentiable and to
training and REINFORCE, but they only con- directly optimize the task reward, we apply REIN-
sidered tasks with full supervisions. Berant and FORCE and use pseudo-gold programs found by
Liang (Berant and Liang, 2015) applied imita- an iterative ML training process to bootstrap train-
tion learning to semantic parsing, but still requires ing. NSM achieves new state-of-the-art results on
hand crafted grammars and features. a challenging semantic parsing dataset with weak
supervision, and significantly closes the gap be-
NSM is similar to Neural Programmer (Nee- tween weak and full supervision. It is trained end-
lakantan et al., 2015) and Dynamic Neural Mod- to-end, and does not require any feature engineer-
ule Network (Andreas et al., 2016) in that they ing or domain-specific knowledge.
all solve the problem of semantic parsing from
structured data, and generate programs using sim- Acknowledgements
ilar semantics. The main difference between these We thank for discussions and help from
approaches is how an intermediate result (the Arvind Neelakantan, Mohammad Norouzi, Tom
memory) is represented. Neural Programmer and Kwiatkowski, Eugene Brevdo, Lukasz Kaizer,
Dynamic-NMN chose to represent results as vec- Thomas Strohmann, Yonghui Wu, Zhifeng Chen,
tors of weights (row selectors and attention vec- Alexandre Lacoste, and John Blitzer. The second
tors), which enables backpropagation and search author is partially supported by the Israel Science
through all possible programs in parallel. How- Foundation, grant 942/16.
References 2016b. Hybrid computing using a neural network
with dynamic external memory. Nature .
Jacob Andreas, Marcus Rohrbach, Trevor Darrell,
and Dan Klein. 2016. Learning to compose
Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl,
neural networks for question answering. CoRR
Abdel-rahman Mohamed, Navdeep Jaitly, Andrew
abs/1601.01705.
Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Sainath, et al. 2012. Deep neural networks for
Bengio. 2014. Neural machine translation by acoustic modeling in speech recognition: The shared
jointly learning to align and translate. CoRR views of four research groups. IEEE Signal Process-
abs/1409.0473. http://arxiv.org/abs/1409.0473. ing Magazine 29(6):82–97.

Jonathan Berant, Andrew Chou, Roy Frostig, and Robin Jia and Percy Liang. 2016. Data recombination
Percy Liang. 2013. Semantic parsing on freebase for neural semantic parsing. In Association for Com-
from question-answer pairs. In EMNLP. volume 2, putational Linguistics (ACL).
page 6.
J. Jiang, A. Teichert, J. Eisner, and H. Daume. 2012.
Jonathan Berant and Percy Liang. 2015. Imitation Learned prioritization for trading off accuracy and
learning of agenda-based semantic parsers. TACL speed. In Advances in Neural Information Process-
3:545–558. ing Systems (NIPS).
K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and Łukasz Kaiser and Ilya Sutskever. 2015. Neural gpus
J. Taylor. 2008. Freebase: a collaboratively created learn algorithms. arXiv preprint arXiv:1511.08228 .
graph database for structuring human knowledge. In
International Conference on Management of Data Diederik P. Kingma and Jimmy Ba. 2014. Adam:
(SIGMOD). pages 1247–1250. A method for stochastic optimization. CoRR
Kyunghyun Cho, Bart van Merrienboer, Caglar Gul- abs/1412.6980. http://arxiv.org/abs/1412.6980.
cehre, Dzmitry Bahdanau, Fethi Bougares, Holger
Schwenk, and Yoshua Bengio. 2014. Learning Ni Lao, Tom Mitchell, and William W Cohen. 2011.
phrase representations using rnn encoder–decoder Random walk inference and learning in a large scale
for statistical machine translation. In Proceedings of knowledge base. In Proceedings of the Conference
the 2014 Conference on Empirical Methods in Nat- on Empirical Methods in Natural Language Pro-
ural Language Processing (EMNLP). Association cessing. Association for Computational Linguistics,
for Computational Linguistics, Doha, Qatar, pages pages 529–539.
1724–1734. http://www.aclweb.org/anthology/D14-
1179. Jiwei Li, Will Monroe, Alan Ritter, Michel Galley,
Jianfeng Gao, and Dan Jurafsky. 2016. Deep rein-
H. Daume, J. Langford, and D. Marcu. 2009. Search- forcement learning for dialogue generation. arXiv
based structured prediction. Machine Learning preprint arXiv:1606.01541 .
75:297–325.
P. Liang, M. I. Jordan, and D. Klein. 2011. Learn-
Li Dong and Mirella Lapata. 2016. Language to log- ing dependency-based compositional semantics. In
ical form with neural attention. In Association for Association for Computational Linguistics (ACL).
Computational Linguistics (ACL). pages 590–599.
Alex Graves, Greg Wayne, and Ivo Danihelka. Arvind Neelakantan, Quoc V. Le, and Ilya Sutskever.
2014. Neural turing machines. arXiv preprint 2015. Neural programmer: Inducing la-
arXiv:1410.5401 . tent programs with gradient descent. CoRR
Alex Graves, Greg Wayne, Malcolm Reynolds, abs/1511.04834.
Tim Harley, Ivo Danihelka, Agnieszka Grabska-
Barwinska, Sergio G. Colmenarejo, Edward Grefen- Mohammad Norouzi, Samy Bengio, zhifeng Chen,
stette, Tiago Ramalho, John Agapiou, Adrià P. Navdeep Jaitly, Mike Schuster, Yonghui Wu, and
Badia, Karl M. Hermann, Yori Zwols, Georg Dale Schuurmans. 2016. Reward augmented max-
Ostrovski, Adam Cain, Helen King, Christopher imum likelihood for neural structured prediction. In
Summerfield, Phil Blunsom, Koray Kavukcuoglu, Advances in Neural Information Processing Systems
and Demis Hassabis. 2016a. Hybrid comput- (NIPS).
ing using a neural network with dynamic exter-
nal memory. Nature advance online publication. Panupong Pasupat and Percy Liang. 2015. Composi-
https://doi.org/10.1038/nature20101. tional semantic parsing on semi-structured tables. In
ACL.
Alex Graves, Greg Wayne, Malcolm Reynolds,
Tim Harley, Ivo Danihelka, Agnieszka Grabska- Jeffrey Pennington, Richard Socher, and Christo-
Barwińska, Sergio Gómez Colmenarejo, Edward pher D. Manning. 2014. Glove: Global vectors for
Grefenstette, Tiago Ramalho, John Agapiou, et al. word representation. In EMNLP.
Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, Patil, Wei Wang, Cliff Young, Jason Smith, Jason
and Wojciech Zaremba. 2015. Sequence level train- Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado,
ing with recurrent neural networks. arXiv preprint Macduff Hughes, and Jeffrey Dean. 2016. Google’s
arXiv:1511.06732 . neural machine translation system: Bridging the gap
between human and machine translation. CoRR
Scott Reed and Nando de Freitas. 2016. Neural abs/1609.08144. http://arxiv.org/abs/1609.08144.
programmer-interpreters. In ICLR.
Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and
S. Ross, G. Gordon, and A. Bagnell. 2011. A reduction Jianfeng Gao. 2015. Semantic parsing via staged
of imitation learning and structured prediction to no- query graph generation: Question answering with
regret online learning. In Artificial Intelligence and knowledge base. In Association for Computational
Statistics (AISTATS). Linguistics (ACL).
Tom Schaul, John Quan, Ioannis Antonoglou, and Wen-tau Yih, Matthew Richardson, Chris Meek, Ming-
David Silver. 2016. Prioritized experience replay. Wei Chang, and Jina Suh. 2016. The value of se-
In International Conference on Learning Represen- mantic parse labeling for knowledge base question
tations. Puerto Rico. answering. In Association for Computational Lin-
guistics (ACL).
Jürgen Schmidhuber. 2004. Optimal ordered problem
solver. Machine Learning 54(3):211–254. Pengcheng Yin, Zhengdong Lu, Hang Li, and Ben Kao.
2015. Neural enquirer: Learning to query tables.
David Silver, Aja Huang, Chris J Maddison, Arthur arXiv preprint arXiv:1512.00965 .
Guez, Laurent Sifre, George Van Den Driessche, Ju-
lian Schrittwieser, Ioannis Antonoglou, Veda Pan- Adam Yu, Hongrae Lee, and Quoc Le. 2017. Learning
neershelvam, Marc Lanctot, et al. 2016. Mastering to skim text. In ACL.
the game of go with deep neural networks and tree
search. Nature 529(7587):484–489. Wojciech Zaremba and Ilya Sutskever. 2015. Rein-
forcement learning neural turing machines. arXiv
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. preprint arXiv:1505.00521 .
Sequence to sequence learning with neural net-
works. In Advances in neural information process- M. Zelle and R. J. Mooney. 1996. Learning to parse
ing systems. pages 3104–3112. database queries using inductive logic program-
ming. In Association for the Advancement of Arti-
Richard S. Sutton and Andrew G. Barto. 1998. Rein- ficial Intelligence (AAAI). pages 1050–1055.
forcement Learning: An Introduction. MIT Press,
Cambridge, MA. L. S. Zettlemoyer and M. Collins. 2005. Learning to
map sentences to logical form: Structured classifica-
Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. tion with probabilistic categorial grammars. In Un-
2015. Pointer networks. In Advances in Neural In- certainty in Artificial Intelligence (UAI). pages 658–
formation Processing Systems (NIPS). 666.
Yushi Wang, Jonathan Berant, and Percy Liang. 2015.
Building a semantic parser overnight. In Associa-
tion for Computational Linguistics (ACL).

Jason Weston, Sumit Chopra, and Antoine Bor-


des. 2014. Memory networks. arXiv preprint
arXiv:1410.3916 .

Ronald J. Williams. 1992. Simple statistical gradient-


following algorithms for connectionist reinforce-
ment learning. In Machine Learning. pages 229–
256.

Sam Wiseman and Alexander M. Rush. 2016.


Sequence-to-sequence learning as beam-
search optimization. CoRR abs/1606.02960.
http://arxiv.org/abs/1606.02960.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.


Le, Mohammad Norouzi, Wolfgang Macherey,
Maxim Krikun, Yuan Cao, Qin Gao, Klaus
Macherey, Jeff Klingner, Apurva Shah, Melvin
Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan
Gouws, Yoshikiyo Kato, Taku Kudo, Hideto
Kazawa, Keith Stevens, George Kurian, Nishant
A Supplementary Material
A.1 Extra Figures

Weak Neural
question
Symbolic knowledge base
supervision (built-in data)
& reward program
Manager Programmer Computer

answer code assist built-in


non-differentiable functions

Figure 3: Manager-Programmer-Computer framework

Attention SoftMax
dot product

Linear Projection
100*50=5k

GRU GRU
2*50*3*50=15k 2*50*3*50=15k

Linear Projection Linear Projection


300*50=15k 600*50=30k
q0 q1 qm a0 a1 an
Encoder Decoder

Figure 4: Seq2seq model architecture with dot-product attention and dropout at GRU input, output, and softmax layers.

QA pairs 1 Decoder 1 Solutions 1


KG server 1

QA pairs 2 Decoder 2 Solutions 2 Trainer


…...
…... …... …...
KG server m

QA pairs n Decoder n Solutions n

Model checkpoint

Figure 5: System Architecture. 100 decoders, 50 KB servers and 1 trainer.

You might also like