You are on page 1of 14

Reassessing the Goals of Grammatical Error Correction:

Fluency Instead of Grammaticality

Keisuke Sakaguchi1 , Courtney Napoles1 , Matt Post2 , and Joel Tetreault3


1
Center for Language and Speech Processing, Johns Hopkins University
2
Human Language Technology Center of Excellence, Johns Hopkins University
3
Yahoo
{keisuke,napoles,post}@cs.jhu.edu, tetreaul@yahoo-inc.com

Abstract Dale and Kilgarriff, 2011; Leacock et al., 2014). As


systems improved and more advanced methods were
The field of grammatical error correction
applied to the task, the definition evolved to whole-
(GEC) has grown substantially in recent years,
with research directed at both evaluation met- sentence correction, or correcting all errors of every
rics and improved system performance against error type (Ng et al., 2014). With this pivot, we urge
those metrics. One unvisited assumption, the community to revisit the original question.
however, is the reliance of GEC evaluation It is often the case that writing exhibits problems
on error-coded corpora, which contain spe- that are difficult to ascribe to specific grammatical
cific labeled corrections. We examine cur- categories. Consider the following example:
rent practices and show that GEC’s reliance on
such corpora unnaturally constrains annota- Original: From this scope , social media has
tion and automatic evaluation, resulting in (a) shorten our distance .
sentences that do not sound acceptable to na- Corrected: From this scope , social media has
tive speakers and (b) system rankings that do shortened our distance .
not correlate with human judgments. In light If the goal is to correct verb errors, the grammat-
of this, we propose an alternate approach that ical mistake in the original sentence has been ad-
jettisons costly error coding in favor of unan-
dressed and we can move on. However, when we
notated, whole-sentence rewrites. We com-
pare the performance of existing metrics over aim to correct the sentence as a whole, a more vex-
different gold-standard annotations, and show ing problem remains. The more prominent error has
that automatic evaluation with our new anno- to do with how unnaturally this sentence reads. The
tation scheme has very strong correlation with meanings of words and phrases like scope and the
expert rankings (ρ = 0.82). As a result, we ad- corrected shortened our distance are clear, but this
vocate for a fundamental and necessary shift is not how a native English speaker would use them.
in the goal of GEC, from correcting small, la-
A more fluent version of this sentence would be the
beled error types, to producing text that has
native fluency.
following:
Fluent: From this perspective , social media has
shortened the distance between us .
1 Introduction
This issue argues for a broader definition of gram-
What is the purpose of grammatical error correction maticality that we will term native-language flu-
(GEC)? One response is that GEC aims to help peo- ency, or simply fluency. One can argue that tradi-
ple become better writers by correcting grammatical tional understanding of grammar and grammar cor-
mistakes in their writing. In the NLP community, rection encompasses the idea of native-language flu-
the original scope of GEC was correcting targeted ency. However, the metrics commonly used in eval-
error types with the goal of providing feedback to uating GEC undermine these arguments. The per-
non-native writers (Chodorow and Leacock, 2000; formance of GEC systems is typically evaluated us-

169

Transactions of the Association for Computational Linguistics, vol. 4, pp. 169–182, 2016. Action Editor: Chris Quirk.
Submission batch: 12/2015; Revision batch: 4/2015; Published 5/2016.
c 2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
ing metrics that compute corrections against error- tence rewrites as a gold standard, which are simpler
coded corpora, which impose a taxonomy of types and cheaper to collect than previous annotations.
of grammatical errors. Assigning these codes can be
difficult, as evidenced by the low agreement found 2 Current issues in GEC
between annotators of these corpora. It is also quite
In this section, we will address issues of the GEC
expensive. But most importantly, as we will show
task, reviewing previous work with respect to error
in this paper, annotating for explicit error codes
annotation and evaluation metrics.
places a downward pressure on annotators to find
and fix concrete, easily-identifiable grammatical er- 2.1 Annotation methodologies
rors (such as wrong verb tense) in lieu of addressing
Existing corpora for GEC are annotated for er-
the native fluency of the text.
rors using fine-grained coding schemes. To create
A related problem is the presence of multiple
error-coded corpora, trained annotators must iden-
evaluation metrics computed over error-annotated
tify spans of text containing an error, assign codes
corpora. Recent work has shown that metrics like
corresponding to the error type, and provide correc-
M2 and I-measure, both of which require error-
tions to those spans for each error in the sentence.
coded corpora, produce dramatically different re-
One of the main issues with coded annotation
sults when used to score system output and produce
schemes is the difficulty of defining the granularity
a ranking of systems in conventional competitions
of error types. These sets of error tags are not easily
(Felice and Briscoe, 2015).
interchangeable between different corpora. Specif-
In light of all of this, we suggest that the GEC task
ically, two major GEC corpora have different tax-
has overlooked a fundamental question: What are
onomies: the Cambridge Learner Corpus (CLC)
the best practices for corpus annotation and system
(Nicholls, 2003) has 80 tags, which generally repre-
evaluation? This work attempts to answer this ques-
sent the word class of the error and the type of error
tion. We show that native speakers prefer text that
(such as replace preposition, unnecessary pronoun,
exhibits fluent sentences over ones that have only
or missing determiner). In contrast, the NUS Cor-
minimal grammatical corrections. We explore dif-
pus of Learner English (NUCLE) (Dahlmeier et al.,
ferent methods for corpus annotation (with and with-
2013) has only 27 error types. A direct conversion
out error codes, written by experts and non-experts)
between them, if possible, would be very complex.
and different evaluation metrics to determine which
Additionally, it is difficult for annotators to agree on
configuration of annotated corpus and metric has the
error annotations, which complicates the annotation
strongest correlation with the human ranking. In so
validity as a gold standard (Leacock et al., 2014).
doing, we establish a reliable and replicable evalu-
This is due to the nature of grammatical error cor-
ation procedure to help further the advancement of
rection, where there can be diverse correct edits for
GEC methods.1 To date, this is the only work to un-
a sentence (Figure 1). In other words, there is no
dertake a comprehensive empirical study of annota-
single gold-standard correction. The variety of error
tion and evaluation. As we will show, the two areas
types and potential correct edits result in very low
are intimately related.
inter-annotator agreement (IAA), as reported in pre-
Fundamentally, this work reframes grammati-
vious studies (Tetreault and Chodorow, 2008; Ro-
cal error correction as a fluency task. Our pro-
zovskaya and Roth, 2010; Bryant and Ng, 2015).
posed evaluation framework produces system rank-
This leads to a more fundamental question: why
ings with strong to very strong correlations with hu-
do we depend so much on fine-grained, low-
man judgments (Spearman’s ρ = 0.82, Pearson’s
consensus error-type annotations as a gold standard
r = 0.73), using a variation of the GLEU metric
for evaluating GEC systems?
(Napoles et al., 2015)2 and two sets of “fluent” sen-
One answer is that error tags can be informative
1
All the scripts and new data we collected are available at and useful to provide feedback to language learn-
https://github.com/keisks/reassess-gec. ers, especially for specific closed-class error types
2
This metric should not be confused with the method of the
same name presented in Mutton et al. (2007) for sentence-level fluency evaluation.

170
As the development of the technology , social media becomes more and more significant role in the whole world .
With the development of technology plays a more and more significant role world
As the technology develops becomes more and more significant
As technology develops

Figure 1: An ungrammatical sentence that can be corrected in different ways.

(such as determiners and prepositions). Indeed, the coded errors. At this juncture, it is crucial that we
CLC, the first large-scale corpus of annotated gram- examine whether error-coded corpora and evalua-
matical errors, was coded specifically with the intent tion are necessary for this new direction of GEC.
of gathering statistics about errors to inform the de- Finally, it would be remiss not to address the
velopment of tools to help English language learn- cost and time of corpus annotation. Tetreault and
ers (Nicholls, 2003). Later GEC corpora adhered to Chodorow (2008) noted that it would take 80 hours
the same error-coding template, if not the same error to correct 1,000 preposition errors by one trained
types (Rozovskaya and Roth, 2010; Yannakoudakis annotator. Bryant and Ng (2015) reported that it
et al., 2011; Dahlmeier et al., 2013). took about three weeks (504 hours) to collect 7
The first shared task in GEC aspired to the CLC’s independent annotations for 1,312 sentences, with
same objective: to develop tools for language learn- all 28 CoNLL-2014 error types annotated. Clearly,
ers (Dale and Kilgarriff, 2011). Subsequent shared constructing a corpus with fine-grained error an-
tasks (Dale et al., 2012; Ng et al., 2013) followed notations is a labor-intensive process. Due to the
suit, targeting specific error types. Error-coded time and cost of annotation, the corpora currently
corpora are effective training and evaluation data used in the community are few and tend to be
for targeted error correction, and statistical classi- small, hampering robust evaluations as well as lim-
fiers have been developed to handle errors involving iting the power of statistical models for generat-
closed-class words (Rozovskaya and Roth, 2014). ing corrections. If an effective method could be
However, the 2014 CoNLL shared task engendered devised to decrease time or cost, larger corpora—
a sea change in GEC: in this shared task, systems and more of them—could be created. There has
needed to correct all errors in a sentence, of all er- been some work exploring this, namely Tetreault
ror types, including ones more stylistic in nature (Ng and Chodorow (2008), which used a sampling ap-
et al., 2014). The evaluation metrics and annotated proach that would only work for errors involving
data from the previous shared task were used; how- closed-class words. Pavlick et al. (2014) also de-
ever we argue that they do not align with the use scribe preliminary work into designing an improved
case of this reframed task. What is the use case of crowdsourcing interface to expedite data collection
whole-sentence correction? It should not be to pro- of coded errors.
vide specific targeted feedback on error types, but Section 3 outlines our annotation approach, which
rather to rewrite sentences as a proofreader would. is faster and cheaper than previous approaches be-
The community has already begun to view whole- cause it does not make use of error coding.
sentence correction as a task, with the yet unstated
2.2 Evaluation practices
goal of improving the overall fluency of sentences.
Independent papers published human evaluations of Three evaluation metrics3 have been proposed for
the shared task system output (Napoles et al., 2015; GEC: MaxMatch (M2 ) (Dahlmeier and Ng, 2012),
Grundkiewicz et al., 2015), asking judges to rank I-measure (Felice and Briscoe, 2015), and GLEU
systems based on their grammaticality. As GEC (Napoles et al., 2015). The first two compare the
moves toward correcting an entire sentence instead changes made in the output to error-coded spans of
of targeted error types, the myriad acceptable edits the reference corrections. M2 was the metric used
will result in much lower IAA, compromising eval- 3
Not including the metrics of the HOO shared tasks, which
uation metrics based on the precision and recall of were precision, recall, and F-score.

171
for the 2013 and 2014 CoNLL GEC shared tasks looks and sounds natural to a native-speaking pop-
(Ng et al., 2013; Ng et al., 2014). It captures word- ulation. Both of these terms are hard to define pre-
and phrase-level edits by building an edit lattice and cisely, and fluency especially is a nuanced concept
calculating an F-score over the lattice. for which there is no checklist of criteria to be met.5
Felice and Briscoe (2015) note problems with To carry the intuitions, Table 1 contains examples of
M2 : specifically, it does not distinguish between a sentences that are one, both, or neither. A text does
“do-nothing baseline” and systems that only pro- not have to be technically grammatical to be consid-
pose wrong corrections; also, phrase-level edits can ered fluent, although in almost all cases, fluent texts
be easily gamed because the lattice treats the dele- are also technically grammatical. In the rest of this
tion of a long phrase as a single edit. To address paper, we will demonstrate how they are quantifiably
these issues, they propose I-measure, which gener- different with respect to GEC.
ates a token-level alignment between the source sen-
Annotating coded errors encourages a minimal set
tence, system output, and gold-standard sentences,
of edits because more substantial edits often address
and then computes accuracy based on the alignment.
overlapping and interacting errors. For example, the
Unlike these approaches, GLEU does not use annotators of the NUCLE corpus, which was used
error-coded references4 (Napoles et al., 2015). for the recent shared tasks, were explicitly instructed
Based on BLEU (Papineni et al., 2002), it computes to select the minimal text span of possible alterna-
n-gram precision of the system output against refer- tives (Dahlmeier et al., 2013). There are situations
ence sentences. GLEU additionally penalizes text in where error-coded annotations are useful to help stu-
the output that was unchanged from the source but dents correct specific grammatical errors. The abil-
changed in the reference sentences. ity to do this with the non-error-coded, fluent an-
Recent work by Napoles et al. (2015) and Grund- notations we advocate here is no longer direct, but
kiewicz et al. (2015) evaluated these metrics against is not lost entirely. For this purpose, some recent
human evaluations obtained using methods bor- studies have proposed post hoc automated error-
rowed from the Workshop on Statistical Machine type classification methods (Swanson and Yamangil,
Translation (Bojar et al., 2014). Both papers found 2012; Xue and Hwa, 2014), which compare the orig-
a moderate to strong correlation with human judg- inal sentence to its correction and deduce the error
ments for GLEU and M2 , and a slightly negative cor- types.
relation for I-measure. Importantly, however, none
of these metrics achieved as a high correlation with We speculate that, by removing the error-coding
the human oracle ranking as desired in a fully reli- restraint, we can obtain edits that sound more fluent
able metric. to native speakers while also reducing the expense
In Section 4, we examine the available metrics of annotation, with diminished time and training re-
over different types of reference sets to identify an quirements. Chodorow et al. (2012) and Tetreault et
evaluation setup nearly as reliable as human experts. al. (2014) suggested that it is better to have a large
number of annotators to reduce bias in automatic
3 Creating a new, fluent GEC corpus evaluation. Following this recommendation, we col-
lected additional annotations without error codes,
We hypothesize that human judges, when presented written by both experts and non-experts.
with two versions of a sentence, will favor fluent ver-
sions over ones that exhibit only technical grammat-
icality. 5
It is important to note that both grammaticality and fluency
By technical grammaticality, we mean adherence are determined with respect to a particular speaker population
to an accepted set of grammatical conventions. In and setting. In this paper, we focus on Standard Written En-
contrast, we consider a text to be fluent when it glish, which is the standard used in education, business, and
journalism. While judgments of individual sentences would
4
We use the term references to refer to the corrected sen- differ for other populations and settings (for example, spoken
tences, since the term gold standard suggests that there is just African-American Vernacular English), the distinction between
one right correction. grammaticality and fluency would remain.

172
Technically grammatical Not technically grammatical
Fluent In addition, it is impractical to make such a law. I don’t like this book, it’s really boring.
Not fluent Firstly , someone having any kind of disease be- It is unfair to release a law only point to the ge-
longs to his or her privacy . netic disorder.

Table 1: Examples and counterexamples of technically grammatical and fluent sentences.

Original Genetic disorder may or may not be hirataged hereditary disease and it is sometimes hard to find
out one has these kinds of diseases .

Expert A genetic disorder may or may not be a hereditary disease , and it is sometimes hard to find out
e
fluency whether one has these kinds of diseases .

Non-expert Genetic factors can manifest overtly as disease , or simply be carried , making it hard ,
e e e
fluency sometimes , to find out if one has a genetic predisposition to disease .
e

Table 2: An example sentence with expert and non-expert fluency edits. Moved and changed or inserted spans are
underlined and indicates deletions.
e

3.1 Data collection We collected four complete sets of annotations by


both types of annotators: two sets of minimal edits,
We collected a large set of additional human correc-
designed to make the original sentences technically
tions to the NUCLE 3.2 test set, 6 which was used
grammatical (following the NUCLE annotation in-
in the 2014 CoNLL Shared Task on GEC (Ng et al.,
structions but without error coding), and two sets of
2014) and contains 1,312 sentences error-coded by
fluency edits, designed to elicit native-sounding, flu-
two trained annotators. Bryant and Ng (2015) col-
ent text. The instructions were:
lected an additional eight annotations using the same
error-coding framework, referred to here as BN15. • Minimal edits: Make the smallest number of
We collected annotations from both experts and changes so that each sentence is grammatical.
non-experts. The experts7 were three native En-
glish speakers familiar with the task. To ensure • Fluency edits: Make whatever changes neces-
that the edits were clean and meaning-preserving, sary for sentences to appear as if they had been
each expert’s corrections were inspected by a dif- written by a native speaker.
ferent expert in a second pass. For non-experts, we In total, we collected 8 (2 × 2 × 2) annotations
used crowdsourcing, which has shown potential for from each original sentence (minimal and fluency,
annotating closed-class errors as effectively as ex- expert and non-expert, two corrections each). Of
perts (Tetreault et al., 2010; Madnani et al., 2011; the original 1,312 sentences, the experts flagged 34
Tetreault et al., 2014). We hired 14 participants on sentences that needed to be merged together, so we
Amazon Mechanical Turk (MTurk) who had a HIT skipped these sentences in our analysis and experi-
approval rate of at least 95% and were located in ments. In the next two subsections we compare the
the United States. The non-experts went through changes made under both the fluency and minimal
an additional screening process: before completing edit conditions (Section 3.2) and show how humans
the task, they wrote corrections for five sample sen- rate corrections made by experts and non experts in
tences, which were checked by the three experts.8 both settings (Section 3.3).
6
7
www.comp.nus.edu.sg/˜nlp/conll14st.html 3.2 Edit analysis
All of the expert annotators are authors of this work.
8
The experts verified that the participants were following the When people (both experts and non-experts) are
instructions and not gaming the HITs. asked to make minimal edits, they make few changes

173
Original Some family may feel hurt , with regards to their family pride or reputation , on having the knowl-
edge of such genetic disorder running in their family .

NUCLE Some family members may feel hurt with regards to their family pride or reputation on having
e e
the knowledge of a genetic disorder running in their family .

Expert On learning of such a genetic disorder running in their family , some family members may
e
fluency feel hurt regarding their family pride or reputation .
e

Non-expert Some relatives may be concerned e about the family ’s reputation


e – not to mention their own
e e
fluency pride – in relation to this news of familial genetic defectiveness .

Expert Some families may feel hurt with regards to their family pride or reputation , on having
e e
minimal knowledge of such a genetic disorder running in their family .

Non-expert Some family may feel hurt with regards to their family pride or reputation on having the
e e
minimal knowledge of such genetic disorder running in their family .

Table 3: An example sentence with the original NUCLE correction and fluency
e and minimal edits written by experts
and non-experts. Moved and changed or inserted spans are underlined and indicates deletions.

to the sentences and also change fewer of the sen- made more changes per sentence. For fluency ed-
tences. Fluency edits show the opposite effect, with its, experts and non-experts changed approximately
non-experts taking more liberties than experts with the same number of sentences, but the non-experts
both the number of sentences changed and the de- made about seven edits per sentence compared to
gree of change within each sentence (see Table 2 for the experts’ four. Minimal edits by both experts and
an extreme example of this phenomenon). non-experts exhibit a similar degree of change from
In order to quantify the extent of changes made the original sentences, so further qualitative assess-
in the different annotations, we look at the percent ment is necessary to understand whether the annota-
of sentences that were left unchanged as well as the tors differ. Table 3 contains an example of how the
number of changes needed to transform the original same ungrammatical sentence was corrected using
sentence into the corrected annotation. To calculate both minimal and fluency edits, as well as one of the
the number of changes, we used a modified Trans- original NUCLE corrections.
lation Edit Rate (TER), which measures the number The error-coded annotations of NUCLE and
of edits needed to transform one sentence into an- BN15 fall somewhere in between the fluency and
other (Snover et al., 2006). An edit can be an inser- minimal edits in terms of sTER. The most conser-
tion, deletion, substitution, or shift. We chose this vative set of sentences is the system output of the
metric because it counts the movement of a phrase CoNLL 2014 shared task, with sTER = 1.4, or ap-
(a shift) as one change, which the Levenshtein dis- proximately one change made per sentence. In con-
tance would heavily penalize. TER is calculated as trast, the most conservative human annotations, the
the number of changes per token, but instead we re- minimal edits, edited a similar percent of the sen-
port the number of changes per sentence for ease of tences but made about two changes per sentence.
interpretation, which we call the sTER. When there are multiple annotators working on
We compare the original set of sentences to the the same data, one natural question is the inter-
new annotations and the existing NUCLE and BN15 annotator agreement (IAA). For GEC, IAA is of-
reference sets to determine the relative extent of ten low and arguably not an appropriate measure of
changes made by the fluency and minimal edits (Fig- agreement (Bryant and Ng, 2015). Additionally, it
ure 2). Compared to the original, non-experts had a would be difficult, if possible, to reliably calculate
higher average sTER than experts, meaning that they IAA without coded alignments between the new and

174
100% Percent of sentences changed Edit type Annotator Identical sTER
Fluency E1 v. E2 15.3% 5.1
80%
N1 v. N2 5.9% 10.0
60% E v. N 8.5% 7.9
Minimal E1 v. E2 38.7% 1.7
40%
N1 v. N2 21.8% 2.9
20% E v. N 25.9% 2.4
NUCLE A v. B. 18.8% 3.3
0%
NUCLE BN15 E-minimal N-minimal E-fluency N-fluency Output
Table 4: A comparison of annotations across different
8 Mean sTER annotators (E for expert, N for non-expert). Where there
7 Insertions were more than two annotators, statistics are over the full
Deletions
6 Substitutions
pairwise set. Identical refers to the percentage of sen-
5 Phrases shifted tences where both annotators made the same correction
4 and sTER is the mean sTER between the annotators’ cor-
3 rections.
2
1
and fewer changes per sentence.
0
NUCLE BN15 E-minimal N-minimal E-fluency N-fluency Output
3.3 Human evaluation
Figure 2: Amount of changes made by different annota-
tion sets compared to the original sentences. As an additional validation, we ran a task to establish
the relative quality of the new fluency and minimal-
edit annotations using crowdsourcing via MTurk.
original sentences. Therefore, we look at two al- Participants needed to be in the United States with
ternate measures: the percent of sentences to which a HIT approval rate of at least 95% and pass a pre-
different annotators made the same correction(s) and liminary ranking task, graded by the authors. We
the sTER between two annotators’ corrections, re- randomly selected 300 sentences and asked partici-
ported in Table 4. pants to rank the new annotations, one randomly se-
As we expect, there is notably lower agreement lected NUCLE correction, and the original sentence
between the annotators for fluency edits than for in order of grammaticality and meaning preservation
minimal edits, due to the presumably smaller set of (that is, a sentence that is well-formed but changes
required versus optional stylistic changes. Expert the meaning of the original source should have a
annotators produced the same correction on 15% of lower rank than one that is equally well-formed but
the fluency edits, but more than 38% of their min- maintains the original meaning). Since we were
imal edits were identical. Half of these identical comparing the minimal edits to the fluency edits, we
sentences were unchanged from the original. There did not define the term grammaticality, but instead
was lower agreement between non-expert annotators relied on the participants’ understanding of the term.
than experts on both types of edits. We performed Each sentence was ranked by two different judges,
the same calculations between the two NUCLE an- for a total of 600 rankings, yielding 7,795 pairwise
notators and found that they had agreement rates comparisons.
similar to the non-expert minimal edits. However, To rank systems, we use the TrueSkill approach
the experts’ minimal edits have much higher con- (Herbrich et al., 2006; Sakaguchi et al., 2014), based
sensus than both the non-experts’ and NUCLE, with on a protocol established by the Workshop on Ma-
twice as many identical corrected sentences and half chine Translation (Bojar et al., 2014; Bojar et al.,
the sTER. 2015). For each competing system, TrueSkill infers
From this analysis, one could infer that the expert the absolute system quality from the pairwise com-
annotations are more reliable than the non-expert be- parisons, representing each as the mean of a Gaus-
cause there are fewer differences between annotators sian. These means can then be sorted to rank sys-

175
# Score Range Annotation type public from the 2014 CoNLL Shared Task. To our
1 1.164 1–2 Expert fluency knowledge, this is the first time that the interplay
0.976 1–2 Non-expert fluency of annotation scheme and evaluation metric, as well
3 0.540 3 NUCLE
as the rater expertise, has been evaluated jointly for
4 0.265 4 Expert minimal
5 -0.020 5 Non-expert minimal GEC.
6 -2.925 6 Original sentence
4.1 Experimental setup
Table 5: Human ranking of the new annotations by gram-
The four automatic metrics that we investigate are
maticality. Lines between systems indicate clusters ac-
cording to bootstrap resampling at p ≤ 0.05. Systems in M2 , I-measure,9 GLEU, and BLEU. We include the
the same cluster are considered to be tied. machine-translation metric BLEU because evaluat-
ing against our new non-coded annotations is similar
to machine-translation evaluation, which considers
tems. By running TrueSkill 1,000 times using boot-
overlap instead of absolute alignment between the
strap resampling and producing a system ranking
output and reference sentences.
each time, we collect a range of ranks for each sys-
tem. We can then cluster systems according to non- For the M2 and I-measure evaluations, we aligned
overlapping rank ranges (Koehn, 2012) to produce the fluency and minimal edits to the original sen-
the final ranking, allowing ties. tences using a Levenshtein edit distance algorithm.10
Table 5 shows the ranking of “grammatical” judg- Neither metric makes use of the annotation labels, so
ments for the additional annotations and the orig- we simply assigned dummy error codes.
inal NUCLE annotations. While the score of the Our GLEU implementation differs from that of
expert fluency edits is higher than the non-expert Napoles et al. (2015). We use a simpler, modified
fluency, they are within the same cluster, suggest- version: Precision is the number of candidate (C)
ing that the judges perceived them to be just as n-grams that match the reference (R) n-grams, mi-
good. The fluency rewrites by both experts and non- nus the counts of n-grams found more often in the
experts are clearly preferable over the minimal edit source (S) than the reference (Equation 1). Because
corrections, although the error-coded NUCLE cor- the number of possible reference n-grams increases
rections are perceived as more grammatical than the as more reference sets are used, we calculate an in-
minimal corrections. termediate GLEU by drawing a random sample from
one of the references and report the mean score over
4 Automatic metrics 500 iterations.11
We compare the system outputs to each of the six
We have demonstrated that humans prefer fluency
annotation sets and a seventh set containing all of
edits to error-coded and minimal-edit corrections,
the annotations, using each metric. We ranked the
but it is unclear whether these annotations are an
systems based on their scores using each metric–
effective reference for automatic evaluation. The
annotation-set pair, and thus generated a total of 28
broad range of changes that can be made with non-
different rankings (4 metrics × 7 annotation sets).
minimal edits may make it especially challenging
for current automatic evaluation metrics to use. In To determine the best metric, we compared the
this section, we investigate the impact that differ- system-level ranking obtained from each evaluation
ent reference sets have on the system ranking found technique against the expert human ranking reported
by different evaluation metrics. With reference in Grundkiewicz et al. (2015), Table 3c.
sets having such different characteristics, the natural 9
We ran I-measure with the -nomix flag, preventing the al-
question is: which reference and evaluation metric gorithm from finding the optimal alignment across all possible
pairing best reflects human judgments of grammati- edits. Alignment was very memory-intensive and time consum-
cality? ing, even when skipping long sentences.
10
Costs for insertion, deletion, and substitution are set to 1,
To answer this question, we performed a compre- allowing partial match (e.g. same lemma).
hensive investigation of existing metrics and anno- 11
Running all iterations, it takes less than 30 seconds to eval-
tation sets to evaluate the 12 system outputs made uate 1,000 sentences.

176
!
P P
countC,R (ngram) − max [0, countC,S (ngram) − countC,R (ngram)]
ngram∈{C∩R} ngram∈{C∩S}
p∗n = P
count(ngram)
ngram∈{C}

countA,B (ngram) = min (# occurrences of ngram in A, # occurrences of ngram in B)


(1)

0.8 NUCLE (2)


0.6 BN15 (10)
0.4 E-fluency (2)
0.2 N-fluency (2)
0.0 E-minimal (2)
0.2 N-minimal (2)
0.4 All (20)
0.6
M2 GLEU I-measure BLEU
Figure 3: Correlation of the human ranking with metric scores over different reference sets (Spearman’s ρ). The
number of annotations per sentence in each set is in parentheses. See Table 6 for the numeric values.

M2 GLEU I-measure BLEU strongest correlation with the human ranking out
0.725 0.626 -0.423 -0.456 of all other metric–reference pairings (ρ = 0.819,
NUCLE
0.677* 0.646 -0.313 -0.310
0.692 0.720 -0.066 -0.319*
r = 0.731), and it is additionally cheaper and faster
BN15 to collect. E-fluency is the third-best reference set
0.641 0.697 -0.007 -0.255
0.758 0.819* -0.297 -0.385 with M2 , which does better with minimal changes:
E-fluency
0.665 0.731* -0.256 -0.230* the reference sets with the strongest correlations for
0.703 0.676 -0.451 -0.451
N-fluency M2 are E-minimal (ρ = 0.775) and NUCLE (r =
0.655 0.668 -0.319 -0.388
0.775* 0.786 -0.467 -0.456 0.677). Even though the non-expert fluency edits
E-min. had more changes than the expert fluency edits, they
0.655 0.676 -0.385 -0.396
N-min.
0.769 -0.187 -0.467 -0.495 still did reasonably well using both M2 and GLEU.
0.641 -0.110 -0.402 -0.473 The GLEU metric has strongest correlation when
0.692 0.725 -0.055* -0.462
All
0.617 0.724 0.061* -0.314 comparing against the E-fluency, BN15, E-minimal,
and “All” reference sets. One could argue that, ex-
Table 6: Correlation between the human ranking and cept for E-minimal, these references all have greater
metric scores over different reference sets. The first line diversity of edits than NUCLE and minimal edits.
of each cell is Spearman’s ρ and the second line is Pear- Although BN15 has fewer changes made per sen-
son’s r. The strongest correlations for each metric are
tence than the fluency edits, because of the number
starred, and the overall strongest correlations are in bold.
of annotators, the total pool of n-grams seen per sen-
tence increases. E-minimal edits also have strong
4.2 Results correlation, suggesting there may be a trade-off be-
tween quantity and quality of references.
Figure 3 and Table 6 show the correlation of the A larger number of references could improve per-
expert rankings with all of the evaluation configu- formance for GLEU. Because fluency edits tend to
rations. For the leading metrics, M2 and GLEU, have more variations than error-coded minimal-edit
the expert annotations had stronger positive corre- annotations, it is not obvious how many fluency ed-
lations than the non-expert annotations. Using just its are necessary to cover the full range of possible
two expert fluency annotations with GLEU has the corrections. To address this question, we ran an ad-

177
ditional small-scale experiment, where we collected GLEU
Expert E-fluency
10 non-expert fluency edits for 20 sentences and
computed the average GLEU scores of the submit- AMU CAMB
ted systems against an increasing number of these CAMB POST
fluency references. The result (Figure 5) shows that RAC CUUI
the GLEU score with more fluency references, but CUUI AMU
the effect starts to level off when there are at least POST PKU
4 references, suggesting that 4 references cover the PKU RAC
majority of possible changes. A similar pattern was UMC UMC
observed by Bryant and Ng (2015) in error-coded UFC SJTU
annotations with the M2 metric. IITB NTHU
The reference sets against which M2 has the input UFC
strongest correlation are NUCLE, expert fluency, SJTU IITB
and expert minimal edits. Even non-expert fluency NTHU IPN
annotations result in a stronger correlation with the IPN input
human metric than BN15. These findings support
Figure 4: System rankings produced by GLEU with ex-
the use of fluency edits even with a metric designed
pert fluency (E-fluency) as the reference compared to the
for error-coded corpora. expert human ranking.
One notable difference between M2 and GLEU
is their relative performance using non-expert mini-
mal edits as a metric. M2 is robust to the non-expert
minimal edits and, as a reference set, this achieves
the second strongest Spearman’s correlation for this
metric. However, pairing the non-expert minimal
edits with GLEU results in slightly negative correla-
tion. This is an unexpected result, as there is sizable
overlap between the non-expert and expert minimal
edits (Table 4). We speculate that this difference
may be due to the quality of the non-expert minimal
edits. Recall that humans perceived these sentences
Figure 5: Mean GLEU scores with different numbers of
to be worse than the other annotations, and better fluency references. The red line corresponds to the aver-
only than the original sentence (Table 5). age GLEU score of the 12 GEC systems and the vertical
I-measure and BLEU are shown to be unfavorable bars show the maximum and minimum GLEU scores.
for this task, having negative correlation with the hu-
man ranking, which supports the findings of Napoles
tasks where there is significant overlap between the
et al. (2015) and Grundkiewicz et al. (2015). Even
reference and the original sentences. For this data,
though BLEU and GLEU are both based on the n-
BLEU assigns a higher score to the original sen-
gram overlap between the hypothesis and original
tences than to any of the systems.12
sentences, GLEU has strong positive correlations
Figure 4 shows the system ranking for the most
with human rankings while BLEU has a moderate
strongly correlated annotation–evaluation combi-
negative correlation. The advantage of GLEU is
nation (GLEU with E-fluency) compared to the
that it penalizes n-grams in the system output that
“ground truth” human rankings. The automatic met-
were present in the input sentence and absent from
ric clusters the systems into the correct upper and
the reference. In other words, a system loses credit
lower halves, and the input is correctly placed in the
for missing n-grams that should have been changed.
lower half of the rankings.
BLEU has no such penalty and instead only rewards
n-grams that occur in the references and the output, 12
Of course, it could be that the input sentences are the best,
which is a problem in same-language text rewriting but the human ranking in Figure 4 suggests otherwise.

178
Expert Non-expert
Even though automatic metrics strongly correlate
with human judgments, they still do not have the AMU AMU
same reliability as manual evaluation. Like error- CAMB CAMB
RAC CUUI
coded annotations, judgment by specialists is expen-
CUUI POST
sive, so we investigate a more practical alternative in
POST RAC
the following section.
PKU UMC
5 Human evaluation UMC IITB
UFC PKU
Automatic metrics are only a proxy for human judg- IITB input
ments, which are crucial to truthfully ascertain the input UFC
quality of systems. Even the best result in Section SJTU SJTU
4.2, which is state of the art and has very strong NTHU IPN
rank correlation (ρ = 0.819) with the expert rank- IPN NTHU
ing, makes dramatic errors in the system ranking. Figure 6: Output of system rankings by experts and non-
Given the inherent imperfection of automatic eval- experts, from best to worst. Dotted lines indicate clusters
uation (and possible over-optimization to the NU- according to bootstrap resampling (p ≤ 0.05).
CLE data set), we recommend that human evalua-
tion be produced alongside metric scores whenever Judges κ κw
possible. However, human judgments can be ex- Non-experts 0.29 0.43
Experts 0.29 0.45
pensive to obtain. Crowdsourcing may address this
Non-experts and Experts 0.15 0.23
problem and has been shown to yield reasonably
good judgments for several error types at a relatively Table 7: Inter-annotator agreement of pairwise system
low cost (Tetreault et al., 2014). Therefore, we ap- judgments within non-experts, experts and between them.
ply crowdsourcing to sentence-level grammaticality We show Cohen’s κ and quadratic-weighted κ.15
judgments, by replicating previous experiments that
reported expert rankings of system output (Napoles tion 4.1. The rankings have very strong correla-
et al., 2015; Grundkiewicz et al., 2015) using non- tion (ρ = 0.917, r = 0.876), indicating that non-
experts on MTurk. expert grammaticality judgments are comparably as
5.1 Experimental setup reliable as those by experts. Compared to the best
metric ranking shown in Figure 4, the non-expert
Using the same data set as those experiments and ranking appears significantly better. No system has
the work described in this paper, we asked screened a rank more than two away from the expert rank,
participants13 on MTurk to rank five randomly se- while GLEU has six systems with ranks that are
lected systems and NUCLE corrections from best to three away. The non-expert correlation can be seen
worst, with ties allowed. 294 sentences were ran- as an upper bound for the task, which is approached
domly selected for evaluation from the NUCLE sub- but not yet attained by automatic metrics.
section used in Grundkiewicz et al. (2015), and the Systems in the same cluster, indicated by dotted
output for each sentence was ranked by two different lines in Figure 6, can be viewed as ties. From this
participants. The 588 system rankings yield 26,265 perspective the expert and non-expert rankings are
pairwise judgments, from which we inferred the ab- virtually identical. In addition, experts and non-
solute system ranking using TrueSkill. experts have similar inter-annotator agreement in
their pairwise system judgments (Table 7). The
5.2 Results
agreement between experts and non-experts is lower
Figure 6 compares the system ranking by non- than the agreement between just experts or just non-
experts to the same expert ranking used in Sec-
15
In addition to Cohen’s κ, we report weighted κ because
13
Participants in the United States with a HIT approval rate A > B and A < B should have less agreement than A > B
≥ 95% had to pass a sample ranking task graded by the authors. and A = B.

179
experts, which may be due to the difference of these judges produced a ranking that was highly corre-
experimental settings for experts (Grundkiewicz et lated with the expert ranking produced in earlier
al., 2015) and for non-experts (this work). However, work (Grundkiewicz et al., 2015). The implication
this finding is not overly concerning since the corre- is further reduced costs in producing a gold-standard
lation between the rankings is so strong. ranking for new sets of system outputs against both
In all, judgments cost approximately $140 ($0.2 existing and new corpora.
per sentence) and took a total of 32 hours to com- As a result, we make the following recommenda-
plete. Because the non-expert ranking very strongly tions:
correlates to the expert ranking and non-experts • GEC should be evaluated against 2–4 whole-
have similar IAA as experts, we conclude that ex- sentence rewrites, which can be obtained by
pensive expert judgments can be replaced by non- non-experts.
experts, when those annotators have been appropri-
ately screened. • Automatic metrics that rely on error coding
are not necessary, depending on the use case.
6 Conclusion Of the automatic metrics that have been pro-
posed, we found that a modified form of GLEU
There is a real distinction between technical gram- (Napoles et al., 2015) is the best-correlated.
maticality and fluency. Fluency is a level of mastery
that goes beyond knowledge of how to follow the • The field of GEC is in danger from over-
rules, and includes knowing when they can be bro- reliance on a single annotated corpus (NU-
ken or flouted. Language learners—who are a prime CLE). New corpora should be produced in a
constituency motivating the GEC task—ultimately regular fashion, similar to the Workshop on
care about the latter. But crucially, the current ap- Statistical Machine Translation.
proach of collecting error-coded annotations places Fortunately, collecting annotations in the form of
downward pressure on annotators to minimize edits unannotated sentence-level rewrites is much cheaper
in order to neatly label them. This results in anno- than error-coding, facilitating these practices.
tations that are less fluent, and therefore less use- By framing grammatical error correction as flu-
ful, than they should be. We have demonstrated this ency, we can reduce the cost of annotation while cre-
with the collection of both minimally-edited and flu- ating a more reliable gold standard. We have clearly
ent rewrites of a common test set (Section 3.1); the laid improved practices for annotation and evalua-
preference for fluent rewrites over minimal edits is tion, demonstrating that better quality results can be
clear (Table 5). achieved for less cost using fluency edits instead of
To correct this, the annotations and associated error coding. All of the source code and data, in-
metrics used to score automated GEC systems cluding templates for data collection, will be pub-
should be brought more in line with this broad- licly available, which we believe is crucial for sup-
ened goal. We advocate for the collection of flu- porting the improvement of GEC in the long term.
ent sentence-level rewrites of ungrammatical sen-
tences, which is cheaper than error-coded annota- Acknowledgments
tions and provides annotators with the freedom to We would like to thank Christopher Bryant, Mariano
produce fluent edits. In the realm of automatic met- Felice, Roman Grundkiewicz and Marcin Junczys-
rics, we found that a modified form of GLEU com- Dowmunt for providing data and code. We would
puted against expert fluency rewrites correlates best also like to thank the TACL editor, Chris Quirk, and
with a human ranking of the systems; a close runner- the three anonymous reviewers for their comments
up collects the rewrites from non-experts instead of and feedback. This material is based upon work
experts. partially supported by the National Science Founda-
Finally, to stimulate metric development, we tion Graduate Research Fellowship under Grant No.
found that we were able to produce a new human 1232825.
ranking of systems using non-expert judges. These

180
References of the Generation Challenges Session at the 13th Eu-
ropean Workshop on Natural Language Generation,
Ondrej Bojar, Christian Buck, Christian Federmann, pages 242–249, Nancy, France, September. Associa-
Barry Haddow, Philipp Koehn, Johannes Leveling, tion for Computational Linguistics.
Christof Monz, Pavel Pecina, Matt Post, Herve Saint-
Robert Dale, Ilya Anisimoff, and George Narroway.
Amand, Radu Soricut, Lucia Specia, and Aleš Tam-
2012. HOO 2012: A report on the preposition and
chyna. 2014. Findings of the 2014 Workshop on Sta-
determiner error correction shared task. In Proceed-
tistical Machine Translation. In Proceedings of the
ings of the Seventh Workshop on Building Educational
Ninth Workshop on Statistical Machine Translation,
Applications Using NLP, pages 54–62, Montréal,
pages 12–58, Baltimore, Maryland, USA, June. As-
Canada, June. Association for Computational Linguis-
sociation for Computational Linguistics.
tics.
Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Mariano Felice and Ted Briscoe. 2015. Towards a stan-
Barry Haddow, Matthias Huck, Chris Hokamp, Philipp dard evaluation method for grammatical error detec-
Koehn, Varvara Logacheva, Christof Monz, Matteo tion and correction. In Proceedings of the 2015 Con-
Negri, Matt Post, Carolina Scarton, Lucia Specia, and ference of the North American Chapter of the Associa-
Marco Turchi. 2015. Findings of the 2015 Workshop tion for Computational Linguistics, Denver, CO, June.
on Statistical Machine Translation. In Proceedings of Association for Computational Linguistics.
the Tenth Workshop on Statistical Machine Transla- Roman Grundkiewicz, Marcin Junczys-Dowmunt, and
tion, pages 1–46, Lisbon, Portugal, September. Asso- Edward Gillian. 2015. Human evaluation of gram-
ciation for Computational Linguistics. matical error correction systems. In Proceedings of
Christopher Bryant and Hwee Tou Ng. 2015. How far the 2015 Conference on Empirical Methods in Natural
are we from fully automatic high quality grammatical Language Processing, pages 461–470, Lisbon, Portu-
error correction? In Proceedings of the 53rd Annual gal, September. Association for Computational Lin-
Meeting of the Association for Computational Linguis- guistics.
tics and the 7th International Joint Conference on Nat- Ralf Herbrich, Tom Minka, and Thore Graepel. 2006.
ural Language Processing (Volume 1: Long Papers), TrueSkillTM : A Bayesian Skill Rating System. In
pages 697–707, Beijing, China, July. Association for Proceedings of the Twentieth Annual Conference on
Computational Linguistics. Neural Information Processing Systems, pages 569–
Martin Chodorow and Claudia Leacock. 2000. An unsu- 576, Vancouver, British Columbia, Canada, Decem-
pervised method for detecting grammatical errors. In ber. MIT Press.
Proceedings of the Conference of the North American Philipp Koehn. 2012. Simulating human judgment in
Chapter of the Association of Computational Linguis- machine translation evaluation campaigns. Proceed-
tics (NAACL), pages 140–147. ings IWSLT 2012.
Martin Chodorow, Markus Dickinson, Ross Israel, and C. Leacock, M. Chodorow, M. Gamon, and J. Tetreault.
Joel Tetreault. 2012. Problems in evaluating gram- 2014. Automated Grammatical Error Detection for
matical error detection systems. In Proceedings of Language Learners, Second Edition. Synthesis Lec-
COLING 2012, pages 611–628, Mumbai, India, De- tures on Human Language Technologies. Morgan &
cember. The COLING 2012 Organizing Committee. Claypool Publishers.
Daniel Dahlmeier and Hwee Tou Ng. 2012. Better Nitin Madnani, Martin Chodorow, Joel Tetreault, and
evaluation for grammatical error correction. In Pro- Alla Rozovskaya. 2011. They can help: Using crowd-
ceedings of the 2012 Conference of the North Ameri- sourcing to improve the evaluation of grammatical er-
can Chapter of the Association for Computational Lin- ror detection systems. In Proceedings of the 49th An-
guistics: Human Language Technologies, pages 568– nual Meeting of the Association for Computational
572, Montréal, Canada, June. Association for Compu- Linguistics: Human Language Technologies, pages
tational Linguistics. 508–513, Portland, Oregon, USA, June. Association
Daniel Dahlmeier, Hwee Tou Ng, and Siew Mei Wu. for Computational Linguistics.
2013. Building a large annotated corpus of learner en- Andrew Mutton, Mark Dras, Stephen Wan, and Robert
glish: The NUS Corpus of Learner English. In Pro- Dale. 2007. Gleu: Automatic evaluation of sentence-
ceedings of the Eighth Workshop on Innovative Use level fluency. In Proceedings of the 45th Annual Meet-
of NLP for Building Educational Applications, pages ing of the Association of Computational Linguistics,
22–31, Atlanta, Georgia, June. Association for Com- pages 344–351, Prague, Czech Republic, June. Asso-
putational Linguistics. ciation for Computational Linguistics.
Robert Dale and Adam Kilgarriff. 2011. Helping our Courtney Napoles, Keisuke Sakaguchi, Matt Post, and
own: The HOO 2011 pilot shared task. In Proceedings Joel Tetreault. 2015. Ground truth for grammatical

181
error correction metrics. In Proceedings of the 53rd Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-
Annual Meeting of the Association for Computational nea Micciulla, and John Makhoul. 2006. A study of
Linguistics and the 7th International Joint Conference translation edit rate with targeted human annotation.
on Natural Language Processing (Volume 2: Short Pa- In Proceedings of Association for Machine Translation
pers), pages 588–593, Beijing, China, July. Associa- in the Americas, pages 223–231.
tion for Computational Linguistics. Ben Swanson and Elif Yamangil. 2012. Correction de-
Hwee Tou Ng, Siew Mei Wu, Yuanbin Wu, Christian tection and error type selection as an ESL educational
Hadiwinoto, and Joel Tetreault. 2013. The CoNLL- aid. In Proceedings of the 2012 Conference of the
2013 Shared Task on grammatical error correction. In North American Chapter of the Association for Com-
Proceedings of the Seventeenth Conference on Com- putational Linguistics: Human Language Technolo-
putational Natural Language Learning: Shared Task, gies, pages 357–361, Montréal, Canada, June. Asso-
pages 1–12, Sofia, Bulgaria, August. Association for ciation for Computational Linguistics.
Computational Linguistics. Joel Tetreault and Martin Chodorow. 2008. Native judg-
Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, Christian ments of non-native usage: Experiments in preposi-
Hadiwinoto, Raymond Hendy Susanto, and Christo- tion error detection. In Coling 2008: Proceedings of
pher Bryant. 2014. The CoNLL-2014 Shared Task the workshop on Human Judgements in Computational
on grammatical error correction. In Proceedings of Linguistics, pages 24–32, Manchester, UK, August.
the Eighteenth Conference on Computational Natural Coling 2008 Organizing Committee.
Language Learning: Shared Task, pages 1–14, Balti- Joel Tetreault, Elena Filatova, and Martin Chodorow.
more, Maryland, June. Association for Computational 2010. Rethinking grammatical error annotation and
Linguistics. evaluation with the Amazon Mechanical Turk. In
Diane Nicholls. 2003. The Cambridge Learner Corpus: Proceedings of the NAACL HLT 2010 Fifth Workshop
Error coding and analysis for lexicography and ELT. on Innovative Use of NLP for Building Educational
In Proceedings of the Corpus Linguistics 2003 confer- Applications, pages 45–48, Los Angeles, California,
ence, pages 572–581. June. Association for Computational Linguistics.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Joel Tetreault, Martin Chodorow, and Nitin Madnani.
Jing Zhu. 2002. BLEU: a method for automatic eval- 2014. Bucking the trend: Improved evaluation and
uation of machine translation. In Proceedings of 40th annotation practices for ESL error detection systems.
Annual Meeting of the Association for Computational Language Resources and Evaluation, 48(1):5–31.
Linguistics, pages 311–318, Philadelphia, Pennsylva- Huichao Xue and Rebecca Hwa. 2014. Improved correc-
nia, USA, July. Association for Computational Lin- tion detection in revised ESL sentences. In Proceed-
guistics. ings of the 52nd Annual Meeting of the Association for
Ellie Pavlick, Rui Yan, and Chris Callison-Burch. 2014. Computational Linguistics (Volume 2: Short Papers),
Crowdsourcing for grammatical error correction. In pages 599–604, Baltimore, Maryland, June. Associa-
Proceedings of the companion publication of the 17th tion for Computational Linguistics.
ACM conference on Computer supported cooperative Helen Yannakoudakis, Ted Briscoe, and Ben Medlock.
work & social computing, pages 209–212. ACM. 2011. A new dataset and method for automatically
Alla Rozovskaya and Dan Roth. 2010. Annotating ESL grading ESOL texts. In Proceedings of the 49th An-
errors: Challenges and rewards. In Proceedings of the nual Meeting of the Association for Computational
NAACL HLT 2010 Fifth Workshop on Innovative Use Linguistics: Human Language Technologies, pages
of NLP for Building Educational Applications, pages 180–189, Portland, Oregon, USA, June. Association
28–36, Los Angeles, California, June. Association for for Computational Linguistics.
Computational Linguistics.
Alla Rozovskaya and Dan Roth. 2014. Building a state-
of-the-art grammatical error correction system. Trans-
actions of the Association for Computational Linguis-
tics, 2:414–434.
Keisuke Sakaguchi, Matt Post, and Benjamin
Van Durme. 2014. Efficient elicitation of anno-
tations for human evaluation of machine translation.
In Proceedings of the Ninth Workshop on Statisti-
cal Machine Translation, pages 1–11, Baltimore,
Maryland, USA, June. Association for Computational
Linguistics.

182

You might also like