Professional Documents
Culture Documents
Figure 2: Semantic Parsing with NSM. The key embeddings of the key-variable memory are the output of the sequence model
at certain encoding or decoding steps. For illustration purposes, we also show the values of the variables in parentheses, but the
sequence model never sees these values, and only references them with the name of the variable (“R1 ”). A special token “GO”
indicates the start of decoding, and “Return” indicates the end of decoding.
the token R1 is added into the decoder vocabu- To train from weak supervision, we suggest an
lary with v1 as its embedding. The final answer iterative ML procedure for finding pseudo-gold
returned by the “programmer” is the value of the programs that will bootstrap REINFORCE.
last computed variable.
REINFORCE We can formulate training as a
Similar to pointer networks (Vinyals et al.,
reinforcement learning problem: given a question
2015), the key embeddings for variables are dy-
x, the state, action and reward at each time step t ∈
namically generated for each example. During
{0, 1, ..., T } are (st , at , rt ). Since the environment
training, the model learns to represent variables by
is deterministic, the state is defined by the question
backpropagating gradients from a time step where
x and the action sequence: st = (x, a0:t−1 ), where
a variable is selected by the decoder, through the
a0:t−1 = (a0 , ..., at−1 ) is the history of actions at
key-variable memory, to an earlier time step when
time t. A valid action at time t is at ∈ A(st ),
the key embedding was computed. Thus, the en-
where A(st ) is the set of valid tokens given by the
coder/decoder learns to generate representations
“computer”. Since each action corresponds to a
for variables such that they can be used at the right
token, the full history a0:T corresponds to a pro-
time to construct the correct program.
gram. The reward rt = I[t = T ] · F1 (x, a0:T )
While the key embeddings are differentiable,
is non-zero only at the last step of decoding, and
the values referenced by the variables (lists of
is the F1 score computed comparing the gold an-
entities), stored in the “computer”, are symbolic
swer and the answer generated by executing the
and non-differentiable. This distinguishes the key-
program a0:T . Thus, the cumulative reward of a
variable memory from other memory-augmented
program a0:T is
neural networks that use continuous differentiable X
embeddings as the values of memory entries (We- R(x, a0:T ) = rt = F1 (x, a0:T ).
ston et al., 2014; Graves et al., 2016a). t
it is common to utilize some full supervision We can define our objective to be the expected
to pre-train REINFORCE (Silver et al., 2016). cumulative reward and use policy gradient meth-
ods such as REINFORCE for training. The objec- in all previous iterations. Then, we can optimize
tive and gradient are: the ML objective:
X
J RL (θ) =
X
EPθ (a0:T |x) [R(x, a0:T )], J M L (θ) = log Pθ (abest
0:T (x) | x) (1)
x x
XX
∇θ J RL (θ) = Pθ (a0:T | x) · [R(x, a0:T )−
x a0:T A question x is not included if we did not find any
program with positive reward.
B(x)] · ∇θ log Pθ (a0:T | x),
Training with iterative ML is fast because there
P is at most one program per example and the gra-
where B(x) = a0:T Pθ (a0:T | x)R(x, a0:T ) is
a baseline that reduces the variance of the gradi- dient is not weighted by model probability. while
ent estimation without introducing bias. Having a decoding with a large beam size is slow, we could
separate network to predict the baseline is an in- train for multiple epochs after each decoding. This
teresting future direction. iterative process has a bootstrapping effect that a
While REINFORCE assumes a stochastic pol- better model leads to a better program abest 0:T (x)
icy, we use beam search for gradient estimation. through decoding, and a better program abest 0:T (x)
Thus, in contrast with common practice of ap- leads to a better model through training.
proximating the gradient by sampling from the Even with a large beam size, some programs are
model, we use the top-k action sequences (pro- hard to find because of the large search space. A
grams) in the beam with normalized probabilities. common solution to this problem is to use curricu-
This allows training to focus on sequences with lum learning (Zaremba and Sutskever, 2015; Reed
high probability, which are on the decision bound- and de Freitas, 2016). The size of the search space
aries, and reduces the variance of the gradient. is controlled by both the set of functions used in
Empirically (and in line with prior work), RE- the program and the program length. We apply
INFORCE converged slowly and often got stuck curriculum learning by gradually increasing both
in local optima (see Section 3). The difficulty of these quantities (see details in Section 3) when
training resulted from the sparse reward signal in performing iterative ML.
the large search space, which caused model prob- Nevertheless, iterative ML uses only pseudo-
abilities for programs with non-zero reward to be gold programs and does not directly optimize
very small at the beginning. If the beam size k is the objective we truly care about. This has two
small, good programs fall off the beam, leading to adverse effects: (1) The best program abest 0:T (x)
zero gradients for all programs in the beam. If the could be a spurious program that accidentally pro-
beam size k is large, training is very slow, and the duces the correct answer (e.g., using the prop-
normalized probabilities of good programs when erty P LACE O F B IRTH instead of P LACE O F D EATH
the model is untrained are still very small, leading when the two places are the same), and thus does
to (1) near zero baselines, thus near zero gradients not generalize to other questions. (2) Because
on “bad” programs (2) near zero gradients on good training does not observe full negative programs,
programs due to the low probability Pθ (a0:T | x). the model often fails to distinguish between to-
To combat this, we present an alternative training kens that are related to one another. For exam-
strategy based on maximum-likelihood. ple, differentiating PARENTS O F vs. S IBLINGS O F
vs. C HILDREN O F can be challenging. We now
Iterative ML If we had gold programs, we present learning where we combine iterative ML
could directly optimize their likelihood. Since we with REINFORCE.
do not have gold programs, we can perform an
iterative procedure (similar to hard Expectation- Augmented REINFORCE To bootstrap REIN-
Maximization (EM)), where we search for good FORCE, we can use iterative ML to find pseudo-
programs given fixed parameters, and then opti- gold programs, and then add these programs to the
mize the probability of the best program found so beam with a reasonably large probability. This is
far. We do decoding on an example with a large similar to methods from imitation learning (Ross
beam size and declare abest
0:T (x) to be the pseudo- et al., 2011; Jiang et al., 2012) that define a
gold program, which achieved highest reward with proposal distribution by linearly interpolating the
shortest length among the programs decoded on x model distribution and an oracle.
Algorithm 1 IML-REINFORCE new state-of-the-art performance on W EB Q UES -
Input: question-answer pairs D = {(xi , yi )}, mix ratio TIONS SP with weak supervision, and significantly
α, reward function R(·), training iterations NM L , NRL , closes the gap between weak and full supervisions
and beam sizes BM L , BRL .
Procedure: for this task.
Initialize Cx∗ = ∅ the best program so far for x
Initialize model θ randomly . Iterative ML 3.1 The W EB Q UESTIONS SP dataset
for n = 1 to NM L do
for (x, y) in D do The W EB Q UESTIONS SP dataset (Yih et al., 2016)
C ← Decode BM L programs given x contains full semantic parses for a subset of the
for j in 1...|C| do
if Rx,y (Cj ) > Rx,y (Cx∗ ) then Cx∗ ← Cj questions from W EB Q UESTIONS (Berant et al.,
θ ← ML training with DM L = {(x, Cx∗ )} 2013), because 18.5% of the original dataset were
Initialize model θ randomly . REINFORCE found to be “not answerable”. It consists of 3,098
for n = 1 to NRL do question-answer pairs for training and 1,639 for
DRL ← ∅ is the RL training set
for (x, y) in D do testing, which were collected using Google Sug-
C ← Decode BRL programs from x gest API, and the answers were originally obtained
for j in 1...|C| do using Amazon Mechanical Turk workers. They
if Rx,y (Cj ) > Rx,y (Cx∗ ) then Cx∗ ← Cj
were updated in (Yih et al., 2016) by annotators
C ← C ∪ {Cx∗ }
for j in 1...|C| do who were familiar with the design of Freebase and
p
p̂j ← (1−α)· P 0jp 0 where pj = Pθ (Cj | x) added semantic parses. We further separated out
j j
if Cj = Cx∗ then p̂j ← p̂j + α 620 questions from the training set as a validation
DRL ← DRL ∪ {(x, Cj , p̂j )} set. For query pre-processing we used an in-house
θ ← REINFORCE training with DRL named entity linking system to find the entities in a
question. The quality of the entity linker is similar
to that of (Yih et al., 2015) at 94% of the gold root
Algorithm 1 describes our overall training pro-
entities being included. Similar to Dong and Lap-
cedure. We first run iterative ML for NM L itera-
ata (2016), we replaced named entity tokens with
tions and record the best program found for every
a special token “ENT”. For example, the question
example xi . Then, we run REINFORCE, where
“who plays meg in family guy” is changed to “who
we normalize the probabilities of the programs in
plays ENT in ENT ENT”. This helps reduce over-
beam to sum to (1−α) and add α to the probability
fitting, because instead of memorizing the correct
of the best found program C ∗ (xi ). Consequently,
program for a specific entity, the model has to fo-
the model always puts a reasonable amount of
cus on other context words in the sentence, which
probability on a program with high reward during
improves generalization.
training. Note that we randomly initialized the pa-
Following (Yih et al., 2015) we used the last
rameters for REINFORCE, since initializing from
publicly available snapshot of Freebase (Bollacker
the final ML parameters seems to get stuck in a
et al., 2008). Since NSM training requires ran-
local optimum and produced worse results.
dom access to Freebase during decoding, we pre-
On top of imitation learning, our approach is
processed Freebase by removing predicates that
related to the common practice in reinforcement
are not related to world knowledge (starting with
learning (Schaul et al., 2016) to replay rare suc-
“/common/ ”, “/type/ ”, “/freebase/ ”),2 and remov-
cessful experiences to reduce the training variance
ing all text valued predicates, which are rarely the
and improve training efficiency. This is also simi-
answer. Out of all 27K relations, 434 relations are
lar to recent developments (Wu et al., 2016) in ma-
removed during preprocessing. This results in a
chine translation, where ML and RL objectives are
graph that fits in memory with 23K relations, 82M
linearly combined, because anchoring the model
nodes, and 417M edges.
to some high-reward outputs stabilizes training.
Jonathan Berant, Andrew Chou, Roy Frostig, and Robin Jia and Percy Liang. 2016. Data recombination
Percy Liang. 2013. Semantic parsing on freebase for neural semantic parsing. In Association for Com-
from question-answer pairs. In EMNLP. volume 2, putational Linguistics (ACL).
page 6.
J. Jiang, A. Teichert, J. Eisner, and H. Daume. 2012.
Jonathan Berant and Percy Liang. 2015. Imitation Learned prioritization for trading off accuracy and
learning of agenda-based semantic parsers. TACL speed. In Advances in Neural Information Process-
3:545–558. ing Systems (NIPS).
K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and Łukasz Kaiser and Ilya Sutskever. 2015. Neural gpus
J. Taylor. 2008. Freebase: a collaboratively created learn algorithms. arXiv preprint arXiv:1511.08228 .
graph database for structuring human knowledge. In
International Conference on Management of Data Diederik P. Kingma and Jimmy Ba. 2014. Adam:
(SIGMOD). pages 1247–1250. A method for stochastic optimization. CoRR
Kyunghyun Cho, Bart van Merrienboer, Caglar Gul- abs/1412.6980. http://arxiv.org/abs/1412.6980.
cehre, Dzmitry Bahdanau, Fethi Bougares, Holger
Schwenk, and Yoshua Bengio. 2014. Learning Ni Lao, Tom Mitchell, and William W Cohen. 2011.
phrase representations using rnn encoder–decoder Random walk inference and learning in a large scale
for statistical machine translation. In Proceedings of knowledge base. In Proceedings of the Conference
the 2014 Conference on Empirical Methods in Nat- on Empirical Methods in Natural Language Pro-
ural Language Processing (EMNLP). Association cessing. Association for Computational Linguistics,
for Computational Linguistics, Doha, Qatar, pages pages 529–539.
1724–1734. http://www.aclweb.org/anthology/D14-
1179. Jiwei Li, Will Monroe, Alan Ritter, Michel Galley,
Jianfeng Gao, and Dan Jurafsky. 2016. Deep rein-
H. Daume, J. Langford, and D. Marcu. 2009. Search- forcement learning for dialogue generation. arXiv
based structured prediction. Machine Learning preprint arXiv:1606.01541 .
75:297–325.
P. Liang, M. I. Jordan, and D. Klein. 2011. Learn-
Li Dong and Mirella Lapata. 2016. Language to log- ing dependency-based compositional semantics. In
ical form with neural attention. In Association for Association for Computational Linguistics (ACL).
Computational Linguistics (ACL). pages 590–599.
Alex Graves, Greg Wayne, and Ivo Danihelka. Arvind Neelakantan, Quoc V. Le, and Ilya Sutskever.
2014. Neural turing machines. arXiv preprint 2015. Neural programmer: Inducing la-
arXiv:1410.5401 . tent programs with gradient descent. CoRR
Alex Graves, Greg Wayne, Malcolm Reynolds, abs/1511.04834.
Tim Harley, Ivo Danihelka, Agnieszka Grabska-
Barwinska, Sergio G. Colmenarejo, Edward Grefen- Mohammad Norouzi, Samy Bengio, zhifeng Chen,
stette, Tiago Ramalho, John Agapiou, Adrià P. Navdeep Jaitly, Mike Schuster, Yonghui Wu, and
Badia, Karl M. Hermann, Yori Zwols, Georg Dale Schuurmans. 2016. Reward augmented max-
Ostrovski, Adam Cain, Helen King, Christopher imum likelihood for neural structured prediction. In
Summerfield, Phil Blunsom, Koray Kavukcuoglu, Advances in Neural Information Processing Systems
and Demis Hassabis. 2016a. Hybrid comput- (NIPS).
ing using a neural network with dynamic exter-
nal memory. Nature advance online publication. Panupong Pasupat and Percy Liang. 2015. Composi-
https://doi.org/10.1038/nature20101. tional semantic parsing on semi-structured tables. In
ACL.
Alex Graves, Greg Wayne, Malcolm Reynolds,
Tim Harley, Ivo Danihelka, Agnieszka Grabska- Jeffrey Pennington, Richard Socher, and Christo-
Barwińska, Sergio Gómez Colmenarejo, Edward pher D. Manning. 2014. Glove: Global vectors for
Grefenstette, Tiago Ramalho, John Agapiou, et al. word representation. In EMNLP.
Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, Patil, Wei Wang, Cliff Young, Jason Smith, Jason
and Wojciech Zaremba. 2015. Sequence level train- Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado,
ing with recurrent neural networks. arXiv preprint Macduff Hughes, and Jeffrey Dean. 2016. Google’s
arXiv:1511.06732 . neural machine translation system: Bridging the gap
between human and machine translation. CoRR
Scott Reed and Nando de Freitas. 2016. Neural abs/1609.08144. http://arxiv.org/abs/1609.08144.
programmer-interpreters. In ICLR.
Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and
S. Ross, G. Gordon, and A. Bagnell. 2011. A reduction Jianfeng Gao. 2015. Semantic parsing via staged
of imitation learning and structured prediction to no- query graph generation: Question answering with
regret online learning. In Artificial Intelligence and knowledge base. In Association for Computational
Statistics (AISTATS). Linguistics (ACL).
Tom Schaul, John Quan, Ioannis Antonoglou, and Wen-tau Yih, Matthew Richardson, Chris Meek, Ming-
David Silver. 2016. Prioritized experience replay. Wei Chang, and Jina Suh. 2016. The value of se-
In International Conference on Learning Represen- mantic parse labeling for knowledge base question
tations. Puerto Rico. answering. In Association for Computational Lin-
guistics (ACL).
Jürgen Schmidhuber. 2004. Optimal ordered problem
solver. Machine Learning 54(3):211–254. Pengcheng Yin, Zhengdong Lu, Hang Li, and Ben Kao.
2015. Neural enquirer: Learning to query tables.
David Silver, Aja Huang, Chris J Maddison, Arthur arXiv preprint arXiv:1512.00965 .
Guez, Laurent Sifre, George Van Den Driessche, Ju-
lian Schrittwieser, Ioannis Antonoglou, Veda Pan- Adam Yu, Hongrae Lee, and Quoc Le. 2017. Learning
neershelvam, Marc Lanctot, et al. 2016. Mastering to skim text. In ACL.
the game of go with deep neural networks and tree
search. Nature 529(7587):484–489. Wojciech Zaremba and Ilya Sutskever. 2015. Rein-
forcement learning neural turing machines. arXiv
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. preprint arXiv:1505.00521 .
Sequence to sequence learning with neural net-
works. In Advances in neural information process- M. Zelle and R. J. Mooney. 1996. Learning to parse
ing systems. pages 3104–3112. database queries using inductive logic program-
ming. In Association for the Advancement of Arti-
Richard S. Sutton and Andrew G. Barto. 1998. Rein- ficial Intelligence (AAAI). pages 1050–1055.
forcement Learning: An Introduction. MIT Press,
Cambridge, MA. L. S. Zettlemoyer and M. Collins. 2005. Learning to
map sentences to logical form: Structured classifica-
Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. tion with probabilistic categorial grammars. In Un-
2015. Pointer networks. In Advances in Neural In- certainty in Artificial Intelligence (UAI). pages 658–
formation Processing Systems (NIPS). 666.
Yushi Wang, Jonathan Berant, and Percy Liang. 2015.
Building a semantic parser overnight. In Associa-
tion for Computational Linguistics (ACL).
Weak Neural
question
Symbolic knowledge base
supervision (built-in data)
& reward program
Manager Programmer Computer
Attention SoftMax
dot product
Linear Projection
100*50=5k
GRU GRU
2*50*3*50=15k 2*50*3*50=15k
Figure 4: Seq2seq model architecture with dot-product attention and dropout at GRU input, output, and softmax layers.
Model checkpoint