You are on page 1of 10

Learning Fine-Grained Expressions to Solve Math Word Problems

Danqing Huang1∗, Shuming Shi2 , Jian Yin1 , and Chin-Yew Lin3


{huangdq2@mail2,issjyin@mail}.sysu.edu.cn
shumingshi@tencent.com
cyl@microsoft.com
1
Guangdong Key Laboratory of Big Data Analysis and Processing, Sun Yat-sen University
2
Tencent AI Lab 3 Microsoft Research

Abstract 18.3% accuracy on their published dataset Dol-


phin18K. Their results indicate that math word
This paper presents a novel template- problem solving is a very challenging task.
based method to solve math word prob- To solve a math word problem, a system needs
lems. This method learns the mappings to understand natural language text to extract in-
between math concept phrases in math formation from the problem as local context. Also,
word problems and their math expressions it should provide an external knowledge base, in-
from training data. For each equation tem- cluding commonsense knowledge (e.g. ”a chicken
plate, we automatically construct a rich has two legs”) and mathematical knowledge (e.g.
template sketch by aggregating informa- ”the perimeter of a rectangle = 2 * length + 2 *
tion from various problems with the same width”). The system can then perform reasoning
template. Our approach is implemented in based on the above two resources to generate an
a two-stage system. It first retrieves a few answer.
relevant equation system templates and
aligns numbers in math word problems P1: What's 25% off $139.99?
to those templates for candidate equation Equation: (1-0.25)*139.99 = x
generation. It then does a fine-grained in- P2: How much will the ipod now be if the original price is $260 and I
ference to obtain the final answer. Ex- get 10% discount?
periment results show that our method Equation: (1-0.1)*260 = x
achieves an accuracy of 28.4% on the lin- Template: (1-n1)*n2 = x
ear Dolphin18K benchmark, which is 10%
(54% relative) higher than previous state- P3: I bought something for $306.00 dollars. I got a 20% discount. What
of-the-art systems while achieving an ac- was the original price?
curacy increase of 12% (59% relative) on Equation: (1-0.2)*x = 306
the TS6 benchmark subset.
Template: (1-n1)*x = n2

1 Introduction
Figure 1: Math Word Problem Examples.
The research topic of automatically solving math
word problems dates back to the 1960s (Bobrow, In this paper, we focus on the acquisition of
1964a,b; Charniak, 1968). Recently many sys- mathematical knowledge, or deriving math con-
tems have been proposed to these types of prob- cepts from natural language. Consider the first two
lems (Kushman et al., 2014; Hosseini et al., 2014; problems P 1 and P 2 in Figure 1. The math con-
Koncel-Kedziorski et al., 2015; Zhou et al., 2015; cept in the problems tells you to take away a per-
Roy and Roth, 2015; Shi et al., 2015; Upadhyay centage from one and get the resulting percentage
et al., 2016; Mitra and Baral, 2016). On a re- of a total. Using mathematical language, it can be
cent evaluation conducted by Huang et al. (2016), formulated as (1−n1 )∗n2 , where n1 , n2 are quan-
current state-of-the-art systems only achieved an tities. In this example, we can derive the concept

Work done while this author was an intern at Microsoft of subtraction from the text “[NUM] % off ” and
Research. “[NUM] % discount”.

805
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 805–814
c
Copenhagen, Denmark, September 7–11, 2017. 2017 Association for Computational Linguistics
Acquisition of mathematical knowledge is non- capturing these properties, we automatically con-
trivial. Initial statistical approaches (Hosseini struct a sketch for each template in the training
et al., 2014; Roy and Roth, 2015; Koncel- data.
Kedziorski et al., 2015) derive math concepts Our approach is implemented in a two-stage
based on observations from their dataset of spe- system. We first retrieve a few relevant templates
cific types of problems, e.g. problems with one in the training data. This narrows our search space
single equation. For example, Hosseini et al. to focus only on those templates that are likely to
(2014) assumes verbs and only verbs embed math be relevant. Then we align numbers in the prob-
concepts and map them to addition/subtraction. lem to those few returned templates, and do fine-
Roy and Roth (2015); Koncel-Kedziorski et al. grained inference to obtain the final answer. We
(2015) assume there is only one unknown vari- show that the textural expressions and template
able in the problem and cannot derive math con- sketch we propose are effective for both stages. In
cepts involving constants or more than one un- addition, our system significantly reduces the hy-
known variables, such as “the product of two un- pothesis space of candidate equations compared to
known numbers”. previous systems, which benefits the learning pro-
Template-based approaches (Kushman et al., cess and inference at scale.
2014; Zhou et al., 2015; Upadhyay et al., 2016), We evaluate our system on the benchmark
on the other hand, leverage the built-in composi- dataset provided by Huang et al. (2016). Experi-
tion structure of equation system templates to for- ments show that our system outperforms two state-
mulate all types of math concepts seen in train- of-the-art baselines with a more than 10% abso-
ing data, such as (1 − n1 ) ∗ n2 = x in Figure 1. lute (54% relative) accuracy increase in the linear
However, they suffer from two major shortcom- benchmark and a more than 20% absolute (71%
ings. First, the math concepts they learned, which relative) accuracy increase for the dataset with a
is expressed as an entire template, fails to capture template size greater than or equal to 6.
a lot of useful information with sparse training in- In the remaining parts of this paper, we in-
stances. We argue that it would be more expres- troduce related work in Section 2, describe tem-
sive if the math concept is learned in a finer granu- plate sketch and textual expression learning in
larity. Second, their learning processes rely heav- Section 3, present our two-stage system in Sec-
ily on lexical and syntactic features, such as the tion 4, summarize experiment setup and results in
dependency path between two slots in a template. Section 5, and conclude this paper in Section 6.
When applied to a large-scale dataset, they create a
huge and sparse feature space and it is unclear how 2 Related Work
these template-related features would contribute. Automatic math word problem solving meth-
To alleviate the sparseness problem of math ods (Bobrow, 1964a,b; Charniak, 1968, 1969; Bri-
concept learning and better utilize templates, we ars and Larkin, 1984; Fletcher, 1985; Dellarosa,
propose a novel approach to capture rich informa- 1986; Bakman, 2007; Yuhui et al., 2010) devel-
tion contained in templates, including textual ex- oped before 2008 are mostly rule-based. They ac-
pressions that imply math concepts. We parse the cept limited well-format input sentences and map
template into a tree structure and define “template them into certain structures by pattern matching.
fragment” as any subtree with at least one opera- They usually focus on problems with simple math
tor and two operands. We learn fine-grained map- operations such as addition or subtraction. Please
pings between textual expressions and template see Mukherjee and Garain (2008) for a summary.
fragments, based on longest common substring. In recent years, symbolic and statistical meth-
For example, given the three problems in Figure 1, ods have been explored by various researchers. In
we can map “[NUM] % off” and “[NUM] % dis- the symbolic approach, systems transform math
count” to 1 − n1 , and “[NUM] % off [NUM]” to word problems to structured representations. Bak-
(1 − n1 ) ∗ n2 = x. In this way, we can decompose man (2007) maps math problems to predefined
the templates and learn math concepts in a finer schema with a table of textual formulas and chang-
grain. Furthermore, we observe that problems of ing verbs. Liguda and Pfeiffer (2012) uses aug-
the same template share some common properties. mented semantic networks to represent math prob-
By aggregating problems of the same template and lems. Shi et al. (2015) parses math problems to

806
their pre-defined semantic language. However, struction.
these methods are only effective in their desig-
nated math problem categories and are not scal- 3.1 Definition
able to other categories. For example, the method Template: It is first introduced in Kushman et al.
used by Shi et al. (2015) works extremely well for (2014). It is a unique form of an equation system.
solving number word problems but not others. For example, given an equation system as follows:
In the statistical machine learning approach,
2 · x1 + 4 · x2 = 34
Hosseini et al. (2014) solves addition and subtrac-
x1 + 4 = x2
tion problems by extracting quantities as states and
derive math concepts from verbs in the training This equation system is a solution for a specific
data. Kushman et al. (2014) and Zhou et al. math word problem. We replace the numbers with
(2015) generalize equations attached to problems four number slots {n1 , n2 , n3 , n4 } and generalize
with variable slots and number slots. They learn the equations to the following template:
a probabilistic model for finding the best solution
equation. Upadhyay et al. (2016) follows their n1 · x1 + n2 · x2 = n3
approach and leverage math word problems with- x1 + n4 = x2
out equation annotation as external resources. Seo
et al. (2015) solves a set of SAT geometry ques- Alignment: We align numbers in the math prob-
tions with text and diagram provided. Koncel- lem with the number slots of a template. For the
Kedziorski et al. (2015) and Roy and Roth (2015) first math problem in Figure 1 with its correspond-
target math problems that can be solved by one ing template (1 − n1 ) ∗ n2 = x, there are two
single linear equation. They map quantities and numbers 0.25 and 139.99 to align with two num-
words to candidate equation trees and select the ber slots n1 and n2 , which results in two different
best tree using a statistical learning model. Mi- alignments.
tra and Baral (2016) considers addition and sub- Kushman et al. (2014) aligns nouns to variable
traction problems in three basic problem types: slots {x1 , x2 , ...} which leads to a huge hypoth-
“Change”, “Part Whole” and “Comparison”. They esis space and does not perform as well as the
manually design different features for each type, number slot alignment only method proposed later
which is difficult to expand to more types. by (Zhou et al., 2015). Therefore, we only con-
sider number slot alignment in this paper.
In summary, previous methods can achieve
high accuracy in limited math problem categories, 3.2 Textual Expressions
(i.e. (Kushman et al., 2014; Shi et al., 2015)), but
For template fragments, there are usually some
do not scale or perform well in datasets contain-
textual expressions. For example, “n1 % off” and
ing various math problem types as in Huang et al.
“n1 % discount” are both mapped to the template
(2016), as their designed features are becoming
fragment 1 − n1 .
sparse. Their process of acquiring mathematical
We employ a statistical framework to automat-
knowledge is either sparse or based on certain as-
ically mine textual expressions for template frag-
sumptions of specific problem types. To allevi-
ments from a training dataset. First we parse the
ate this problem, we introduce our template sketch
equation to a hierarchical tree. In a bottom-up ap-
construction and fine-grained expressions learning
proach, we obtain each possible subtree as a tem-
in the next section.
plate fragment tk , which associates with at least
one number slot. For each tk , we use the num-
3 Template Sketch Construction
bers to anchor the number-related phrases in the
A template sketch contains template information. problem, replace numbers with“[NUM]” and noun
We define three categories of information for the phrases with “[VAR]”, and cluster the phrases
sketch shown in this section. Next we describe P = {p1 , p2 , · · · } with the same tk across all data
how we construct a template sketch, via aggrega- given all training problems. Then we compute the
tion of rich information from training problems. longest common substring lcskij between pairs pi
We group problems of the same template in train- and pj and calculate tf-idf score of lcskij . We keep
ing set as one cluster and collect information. See the lcskij with scores above certain empirically de-
Figure 2 for the outline of our template sketch con- termined threshold as the textual expressions.

807
Problem Aggregation Sketch for Template: (1-n1)*n2=x
Problem 1 Problem 2 … Problem k [Unit Sequence]
{%, $}
Problem: = …
Wallace received a discount of 28%
on an item priced at $275. What is * x [Normalized Unit Sequence]
the total price that he paid for it? {0, 1}
- 275 …
Equation: (1-0.28)*275 = x
Template: (1-n1)*n2 = x 1 0.28 [Question Keyword]
1. Quantity Extraction {price}
Qnt: 0.28 Qnt: 275

Unit: % Unit: $
Normalized Unit: 0 Normalized Unit: 1 [Textual Expression]
• 1.0-n1
2. Question Keyword Detection
{total price}
a discount of [NUM] %
mark down [NUM] %
3. Textual Patterns [NUM] % less than
Equation Template Phrases ...
• (1.0-n1)*n2
1–0.28 1-n1 a discount of [NUM] %
[NUM] % of off [NUM]
(1-0.28)*275 (1-n1)*n2 a discount of [NUM] % on …
[VAR] priced at [NUM]

Figure 2: Template Sketch Construction.

3.3 Slot Type keywords “what”,“how”,“figure out”...). Then we


extract the question keyword on the dependency
Number slots in templates have their own type of
tree with simple rules that we observed in the dev
constraints. For example, in the template (1−n1 )∗
set (e.g. retrieving nouns with “attr − nsubj” de-
n2 = x, usually n1 represents a percentage quan-
pendency relation with keyword “what”). Please
tity and n2 is the quantity of an object.
note that we favor recall over precision of our de-
We model slot types with quantity units, and
tected question keywords since they are used as
find the direct governing noun phrase as its
features instead of hard constrains on template de-
‘owner’. For the problem in Figure 2, we ex-
cision. Simple rule-based extraction can already
tract quantity unit sequence as {%, $}, normalized
satisfy our need for detecting question keywords
unit sequence as {0, 1} (because % and $ are of
in math problems.
different quantity types), and quantity owners as
{discount, item}. The slot type information pro-
4 Two-Stage System
vides important clues to choose the correct tem-
plate and alignment. In this section, we describe our two-stage system
for solving math problems, including template re-
3.4 Question Keyword trieval and alignment ranking. We show how to
Question keyword decides which template we use. apply textual expressions and template sketch to
Given the following problem setting: “A rectangle our system.
has a width of 5cm and a length of 10cm.”, we can
ask either Q1:“What is the area of the rectangle?” 4.1 Template Retrieval
or Q2: “What is the difference between width and We use an efficient retrieval module to first narrow
length?”. The question keywords area and differ- our search space and focus only on templates that
ence help our system to decide if is should apply are likely to be relevant. Let χ denote the set of
template n1 ∗ n2 = x for Q1 and apply template test problems, and T = {t1 , t2 , . . . , tj } as the tem-
n1 − n2 = x for Q2. plate set in the training data. For each test prob-
We first detect the question sentence (containing lem xi , our goal is to select the correct template

808
tj . We define the conditional probability of select- In addition, we consider question keywords for
ing a template given a problem as follows: templates. For example, if the question keyword is
exp(νt · f (xi , tj )) ”difference”, then x + n1 = n2 will have a higher
p(tj |xi ; νt ) = P 0 probability of being selected than x = n1 ∗ n2 .
t0 ∈T exp(νt · f (xi , tj ))
j We observe that in some cases, one word dif-
where νt is the model parameter and f (xi , tj ) ference can lead to two different templates. To
is the feature vector. We apply the Ranking consider cases in which some templates are very
SVM (Herbrich et al., 2000) to minimize a regu- similar (e.g. x + n1 = n2 and n1 + n2 = x,
larized margin-based pairwise loss. We then have part/whole unknown), we retrieve the top ranked
the following objective function: N (N =3) templates as candidates for alignment in
1 X the next stage.
kνt k2 + C l(νtT f (xi , tj )+ − νtT f (xi , tl )− )
2 4.2 Alignment Ranking
i
where superscript ”+” indicates the correct in- For each top N templates from the previous
stance and ”-” indicates the false ones. We use stage, we generate possible alignments A =
the loss function l(t) = max(0, 1 − t)2 . {a1 , a2 , . . . , am } as the candidate equation sys-
To construct the vector f (xi , tj ) for template tj , tem for the test problem xi . We train a ranking
we use the three categories in the template sketch model to choose the alignment with the highest
shown in Table 1. Let Q(tj ) represent the cluster probability p(ak |xi , tj ; νa ), where νa is the model
of training problems with template tj . parameter vector.

Textual Features
exp(νa · f (xi , ak ))
Contains textual expressions in each template p(ak ) = P
exp(νa0 · f (xi , a0k ))
a0k ∈A
fragments?
Average Word Overlap with Q(tj ) We use the same ranking model as in template se-
Max Word Overlap with Q(tj ) lection stage and the objective function is changed
Quantity Features to:
Unit sequence in Q(tj ) 1 X
Normalized unit sequence in Q(tj ) kνa k2 + C l(νaT f (xi , ak )+ − νaT f (xi , al )− )
2
i
Question Features
Is Question keyword in Q(tj ) We design more fine-grained features for each
number slot to formulate the alignment feature
Table 1: Features for template retrieval. vector f (xi , ak ). It contains the following features
in Table 2.
At the phrase level, as we have mined differ-
Textual Features
ent expressions in 3.2 for slots in templates, we
Match textual expressions in template frag-
can extract the phrases related to each number or
ment aligned to each number slot (pair)
number pair in a test problem and match them with
expressions. For example, given a test problem to Quantity Features
match template (1 − n1 ) ∗ n2 = x in Figure 2, we Aligned unit sequence in Q(tj )
have two groups of patterns to match, correspond- Aligned normalized unit sequence in Q(tj )
ing to 1.0 − n1 and (1.0 − n1 ) ∗ n2 respectively. Relationship with noun phrase
Quantity types in a problem are important. We Optimal number 1 or 2 is used?
use the unit type sequences and normalized unit Solution Features
type sequence for describing number slot types in Is integer solution?
a template. In addition, if a number unit type can- Is positive solution?
not differentiate each number slot, we will make
use of number “owner” as defined in subsection Table 2: Features for alignment ranking.
3.3. For example, in the sentence ”The width is
3cm and the length is 5cm”, we extract two quanti- At the textual level, we want to capture textual
ties with unit type sequence {cm, cm}; and owner expressions describing each number slot. For ex-
{width, length}. ample, in the template (1 − n1 ) ∗ n2 = x, we have

809
mined patterns of 1 − n1 in 3.2, such as “a dis- both scale and diversity. We conduct our experi-
count of n1 %”, “mark down n1 %”, etc. Given ment on their dataset Dolphin18K. We use the lin-
the problem in Figure 2 as the test problem, align- ear subset, containing 10,644 problems in total.
ment (1-0.28) ∗ 275 = x matches textual expres- We use two baseline systems for comparison:
sions, while (1-275) ∗ 0.28 = x does not. (1) ZDC (Zhou et al., 2015) is a statistical
For quantity features, we use the alignment- learning method that is an improved version of
ordered unit sequence. For the problem in Figure KAZB (Kushman et al., 2014)1 . (2) SIM (Huang
2 mapping to template (1 − n1 ) ∗ n2 = x, we et al., 2016) is a simple similarity based method.
have two different alignments: {n1 :0.28, n2 :275}, We do not compare other systems because they
{n1 :275,n2 :0.28}. Their aligned unit sequences only solve one specific type of problem, e.g. Hos-
are {%, $} and {$,%} respectively. We also use seini et al. (2014) only handle addition/subtraction
the relations of quantities with noun phrases to dif- problems and Koncel-Kedziorski et al. (2015) aim
ferentiate number slot interaction with unknown to solve problems with one single linear equation.
variable slots and number slots, such as n1 ∗ x and Experiments are conducted using 5-fold cross-
n1 ∗ n2 . validation with 80% problems randomly selected
Some templates have numerical solution prop- as training data and the remaining 20% for testing.
erties while others do not. For example, tem- We report the solution accuracy.
plate x1 = (n1 − n2 )/(n3 − n4 ) would be
less likely to have any strong indication of inte- 5.1 Overall Evaluation Results
ger solution properties. We count the percentage Table 3 shows the overall performance of differ-
of integer/positive solutions from the correspond- ent systems. In the table, the size of a template is
ing problems as the probability that this template the number of problems corresponding to a tem-
prefers an integer/positive solution. plate. For example, for templates with a size 100
or larger, their problem counts add up to 1,807.
4.3 Model Discussion
Our method has two main differences from pre- Template problems ZDC SIM Ours
vious template-based methods (Kushman et al., Size (%) (%) (%)
2014; Zhou et al., 2015; Upadhyay et al., 2016). >=100 1807 34.2 29.7 64.5
First, previous methods implicitly model map- >=50 4281 31.1 27.2 39.3
ping from problem text to templates. We learn >=20 5392 29.4 25.8 36.9
fine-grained textual expressions mapped to tem- >=10 6216 25.3 24.6 35.7
plate fragments; and explicitly model the property >=6 6827 21.7 20.2 34.6
of templates with template sketches. Second, pre- >=5 7081 21.6 20.1 34.3
vious methods align numbers for all templates in >=4 7262 21.1 19.8 33.8
a training set, while we only examine the N most >=3 7466 20.7 19.7 33.2
probable templates. This significantly reduces the >=2 8229 20.6 20.3 32.2
equation candidate search space. Given a prob- >=1 10644 17.9 18.4 28.4
lem in which m numbers align with a template
of n number slots, the number of possible equa-
Table 3: Overall evaluation results.
tion candidates would be Anm . The search space
grows linearly with the number of templates in From the table, we observe that our model
the training data. Suppose m = 5, n = 4 and consistently achieves better performance than the
we have 1000 templates, the total space would baselines on all template sizes. As the template
be (5 ∗ 4 ∗ 3 ∗ 2) ∗ 1000 = 120, 000 for one size becomes larger, all three systems achieve bet-
problem in Zhou et al. (2015), and will be much ter performance. When template size equals 6
larger if it considers unknown variable alignment (TS6, as a de-facto template size constrain adopted
as in (Kushman et al., 2014). in ZDC), our model achieve an absolute increase
of over 12% (59% relative). This demonstrates the
5 Experiments
effectiveness of our proposed method.
Settings As demonstrated in Huang et al. (2016), 1
We ignore KAZB because it does not complete running
previous datasets for math problems are limited in on the dataset in three days

810
When including long tail problems with a tem- oracle template retrieval. The result is 47.1%
plate size less than 2, performance of all three (Hit@ALL), which means 47.1% of the templates
systems drop significantly. This is because the exist more than once in the problem set. Please
templates of these problems are not seen in the note that our template retrieval evaluation may be
training set, and thus are difficult to solve using underestimated, since in some cases, a test prob-
these template-based methods. Still, we have at lem can be solved by different templates.
least 10% absolute (54% relative) accuracy in- We then use the top N templates as input for
crease on the whole test set compared to the two both our alignment ranking and ZDC. From the
baselines. Previous template-based methods re- table, we have the following observations: (1)
quire templates size larger than 6 in the data as Hit@3 performs better compared to Hit@1 for
constraints. From the result, we can see that our both systems. This confirms our claim that some
method relaxes the template size constraint and templates are similar and we need to incorporate
matches more problems with less frequent tem- more fine-grained features to differentiate in the
plates. alignment step; (2) It obtains the highest accu-
racy when N = 3 and decreases when N gets
5.2 Accuracy per Template larger. Both systems get benefits from our tem-
Here we investigate the performance of different plate retrieval which helps retrieve relevant tem-
templates. In Table 4, we sample some domi- plates and reduce the hypothesis space of equa-
nant templates and report their accuracies. For our tions; (3) Given the same N templates input, our
model, we report both template retrieval accuracy alignment ranking achieves better performance
and final solution accuracy. than ZDC. This implies that our features are more
As we can see, our method performs better than indicative.
the baselines for most dominant templates. Per-
formance of the dominant templates can reach an 5.4 Feature Ablation
accuracy level of 60%. This proves that our tem- This section describes our feature ablation study.
plate sketch and textual expressions are effective Template Retrieval In Table 6, we conduct
in capturing rich template information. three configurations against our model (FULL).
To our surprise, some templates tend to perform Each ablated configuration corresponds to one
better than others even with smaller template sizes. category of our template sketch. From the ta-
For example, x1 = n1 − n2 , which represents ble, we can see that all three categories of fea-
the subtraction problem, has 63 problems but per- tures contribute to system performance. We re-
forms not as well as x1 = (n1 − n2 )/(n3 − n4 ) move QUANTITY results in the worse perfor-
which has 48 problems. We look into their corre- mance comparing to the FULL model.
sponding problems and find out that x1 = n1 − n2 Alignment Ranking In Table 7, N means to
are applied to more themes in natural language select the top N templates in the previous stage
than x1 = (n1 − n2 )/(n3 − n4 ), which are almost for alignment. The column ”Correct Template”
about the theme of “coordinate slope”. represents how well the alignment model can per-
In our model, there is a gap between tem- form given the correct template input for align-
plate retrieval accuracy and final solution accu- ment. Our alignment model (FULL) performs the
racy, which means that although we select the best compared to the three ablated settings, which
correct template candidates for the problem, the confirms the effectiveness of template properties.
alignment model cannot rank the equations cor-
rectly. 5.5 Error Analysis
We have observed that template-based methods
5.3 Two-Stage Evaluation have difficulty solving problems with small tem-
Next, we evaluate the performance of our two- plate sizes, especially for cases that have a single
stage system. Accuracy of template retrieval and problem instance (i.e. template size = 1). We
alignment ranking is shown in Table 5. sample 100 problems in which our system makes
For template retrieval accuracy, Hit@N means mistakes in the dev set of Dolphin18K and sum-
the correct template for a problem is included in marize them in Table 8.
the top N list returned by our model. We es- Quantity Type The types of quantities are dif-
timate the best achievable performance by using ficult to determined. For the example problem in

811
Ours
Template problems ZDC SIM Template retrieval Acc Final Acc
(%) (%) (%) (%)
n1 ∗ x1 = n2 548 26.3 23.9 87.0 58.7
n1 /x1 = n2 /n3 453 21.4 29.8 94.1 61.5
x1 = n1 ∗ n2 403 23.6 28.0 78.9 63.4
n1 ∗ x1 + n2 ∗ x2 = n3 ; 300 86.3 69.7 94.9 85.8
x1 + x2 = n4
x1 = n1 ∗ n2 ∗ n3 103 22.3 32.0 67.0 55.0
x1 + x2 = n1 80 39.7 48.8 79.4 65.1
x1 − x2 = n2
x1 = n1 − n2 63 11.7 15.9 50.7 23.4
x1 = (n1 − n2 )/(n3 − n4 ) 48 14.9 18.8 95.7 89.4

Table 4: Accuracy Per Template. Template retrieval acc reports percent of templates appears in one of
the top 3 templates returned by our method.

Hit@N 1 2 3 4 5 10 20 50 ALL
Template retrieval 17.5 22.4 26.3 27.2 28.0 30.2 32.7 35.2 47.1
Acc (%)
Final Acc (%) 24.9 27.6 28.4 27.9 27.4 25.3 22.3 22.1 20.1
ZDC (%) 19.5 20.1 20.1 19.9 19.8 19.1 18.9 18.6 17.9

Table 5: Results of template retrieval and final accuracy with different top N templates retrieved.

Model Hit Hit Hit Model Correct N= N=


@1 @3 @10 Template 1 3
(%) (%) (%) (%) (%) (%)
FULL 17.5 26.3 30.2 FULL 34.5 24.9 28.4
-TEXTUAL 14.1 24.7 28.4 -TEXTUAL 31.9 22.2 25.1
-QUANTITY 11.4 23.4 25.9 -QUANTITY 29.2 20.9 23.3
-QUESTION 16.9 25.4 29.8 -SOLUTION 26.3 18.7 21.2

Table 6: Feature ablation of template retrieval. Table 7: Feature ablation of alignment ranking.

the table, if we can detect “24 male” is the same as in training. Thus, for problems corresponding
“men”, the problem can be solved. to template sizes less than 2, we can decompose
Relation/State Detection If we can identify the templates into smaller units and conduct learning
changed states or mathematical relations between more precisely. We then need to generate the equa-
variables, we can solve this category of problems. tions, which is also a challenge.
In the example problem, it is important to under-
stand that “commission is taken out” is my money
6 Conclusion and Future Work
state. In this paper, we propose a novel approach to solv-
External Knowledge This requires specific ing math word problems with rich information of
mathematical models, such as permutation and templates. We learn mappings between textual ex-
combination, or commonsense knowledge, e.g. a pressions and template fragments. Furthermore,
dice has 6 sides. we automatically construct sketches for each tem-
Equation Decomposition The limitation of plate. We implement a two-stage system, includ-
template-based approaches is that they require test ing template retrieval and alignment ranking. Ex-
problems belonging to one of the templates seen periments show that our method performs signifi-

812
Category Math Problem Daniel G. Bobrow. 1964b. Natural language input for
Quantity Type The ratio of women to men a computer problem solving system. Ph.D. Thesis.
(10%) in a certain club is 3 to 2. If Diane J. Briars and Jill H. Larkin. 1984. An integrated
there are 24 male club mem- model of skill in solving elementary word problems.
bers, then how many female Cognition and Instruction, 1(3):245–296.
club members are there? Eugene Charniak. 1968. Carps, a program which
Relation/State If I am selling something for solves calculus word problems. Technical report.
Detection $25,000 and a 7% commis-
Eugene Charniak. 1969. Computer solution of calcu-
(12%) sion is taken out, how much
lus word problems. In Proceedings of the 1st Inter-
money will I be left with? national Joint Conference on Artificial Intelligence,
External Find the probability that total pages 303–316.
Knowledge score is 10 or more given at
Denise Dellarosa. 1986. A computer simulation of
(23%) least one dice show 6 if 2 dice children’s arithmetic word-problem solving. Behav-
red & blue thrown? ior Research Methods, Instruments, & Computers,
Equation De- The average weight of A, B 18(2):147–154.
composition and C is 45 kg. If the aver- Charles R. Fletcher. 1985. Understanding and solving
(55%) age weight of A and B is 40 arithmetic word problems: A computer simulation.
kg and that of B and C is 43 Behavior Research Methods, Instruments, & Com-
kg, the weight of B is? puters, 17(5):565–571.

Ralf Herbrich, Thore Graepel, and Klaus Obermayer.


Table 8: Error Categorization. 2000. Large Margin Rank Boundaries for Ordinal
Regression, chapter 7.

cantly better than two state-of-the-art systems. Mohammad Javad Hosseini, Hannaneh Hajishirzi,
Oren Etzioni, and Nate Kushman. 2014. Learning
Based on our error analysis, we plan to improve
to solve arithmetic word problems with verb catego-
our model by detecting quantity types more accu- rization. In Proceedings of the 2014 Conference on
rately, learning relations and incorporating com- Empirical Methods in Natural Language Process-
monsense knowledge. For long tail problems with ing.
a template size less 2, we want to utilize the fine- Danqing Huang, Shuming Shi, Chin-Yew Lin, Jian Yin,
grained expressions we have learned and decom- and Wei-Ying Ma. 2016. How well do comput-
pose equations for learning. Then we can reason ers solve math word problems? large-scale dataset
with small equation units to generate a final equa- construction and evaluation. In Proceedings of the
52nd Annual Meeting of the Association for Compu-
tion in testing. We would like to leverage seman- tational Linguistics.
tic parsing and transform math problems to a more
structured representation. Additionally, we plan to Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish
apply our findings to generating math problem. Sabharwal, Oren Etzioni, and Siena Dumas Ang.
2015. Parsing algebraic word problems into equa-
tions. Transactions of the Association for Computa-
7 Acknowledgments tional Linguistics, 3:585–597.
This work is supported by the National Nat- Nate Kushman, Yoav Artzi, Luke Zettlemoyer, and
ural Science Foundation of China (61472453, Regina Barzilay. 2014. Learning to automatically
U1401256, U1501252, U1611264). Thanks to the solve algebra word problems. In Proceedings of the
52nd Annual Meeting of the Association for Compu-
anonymous reviewers for their helpful comments
tational Linguistics.
and suggestions.
Christian Liguda and Thies Pfeiffer. 2012. Modeling
math word problems with augmented semantic net-
References works. In Natural Language Processing and Infor-
mation Systems. International Conference on Appli-
Yefim Bakman. 2007. Robust understanding of cations of Natural Language to Information Systems
word problems with extraneous information. (NLDB-2012), pages 247–252.
Http://arxiv.org/abs/math/0701393.
Arindam Mitra and Chitta Baral. 2016. Learning to
Daniel G. Bobrow. 1964a. Natural language input for a use formulas to solve simple arithmetic problems.
computer problem solving system. Technical report, In Proceedings of the 52nd Annual Meeting of the
Cambridge, MA, USA. Association for Computational Linguistics.

813
Anirban Mukherjee and Utpal Garain. 2008. A review
of methods for automatic understanding of natural
language mathematical problems. Artificial Intelli-
gence Review, 29(2):93–122.
Subhro Roy and Dan Roth. 2015. Solving general
arithmetic word problems. In Proceedings of the
2015 Conference on Empirical Methods in Natural
Language Processing, pages 1743–1752. The Asso-
ciation for Computational Linguistics.
Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi, Oren
Etzioni, and Clint Malcolm. 2015. Solving geome-
try problems: Combining text and diagram interpre-
tation. In Proceedings of the 2015 Conference on
Empirical Methods in Natural Language Process-
ing.
Shuming Shi, Yuehui Wang, Chin-Yew Lin, Xiaojiang
Liu, and Yong Rui. 2015. Automatically solving
number word problems by semantic parsing and rea-
soning. In Proceedings of the 2015 Conference on
Empirical Methods in Natural Language Process-
ing.
Shyam Upadhyay, Ming-Wei Chang, Kai-Wei Chang,
and Wen tau Yih. 2016. Learning from explicit and
implicit supervision jointly for algebra word prob-
lems. In Proceedings of the 2016 Conference on
Empirical Methods in Natural Language Process-
ing.

Ma Yuhui, Zhou Ying, Cui Guangzuo, Ren Yun, and


Huang Ronghuai. 2010. Frame-based calculus of
solving arithmetic multistep addition and subtrac-
tion word problems. Education Technology and
Computer Science, International Workshop, 2:476–
479.
Lipu Zhou, Shuaixiang Dai, and Liwei Chen. 2015.
Learn to solve algebra word problems using
quadratic programming. In Proceedings of the 2015
Conference on Empirical Methods in Natural Lan-
guage Processing.

814

You might also like