Professional Documents
Culture Documents
A Comprehensive Survey of
Grammar Error Correction
ɨ
Yu Wang*, Yuelin Wang*, Jie Liu , and Zhuo Liu
Index Terms—Grammar error correction, machine Fig. 1. A typical instance of grammar error correction.
translation, natural language processing
characteristics, for example, the Language Tool [1] and Apart from the aforementioned components of GEC
the ESL Assistant [2]. H owever, the complexity of systems, we also cover other aspects of the GEC task. W e
designing rules and solving conflicts among the rules summarize the statistics and properties of several public
require a great magnitude of labour. Although some GEC GEC datasets, briefly discuss the researches about data
works today still use rules as an additional source of annotation schema, and introduce two GEC shared tasks,
correction, the performance of rule based GEC systems which are critical to the development of GEC. Besides,
has been superseded by data-driven approaches, which are standard evaluation provides a platform where multiple
the focus of our survey. GEC systems could be compared and analysed
Remarkable progress has been achieved in GEC by
quantitatively. The evaluation metrics, including both
data-driven approaches. In late 2000s, classification based
reference-based and reference-less, are explained and
approaches are developed to correct preposition errors
and article errors. In this approach, classifiers are trained compared.
on large magnitude of native error -free text to predict the GEC has always been a challenging task in NLP
correct target word, taking account into the linguistic research community. First, due to the unrestricted
features given by context. H owever, GEC systems that are mutability of language, it is hard to design a model that is
capable of correcting all types of errors are more capable of correcting all possible errors made by non-
desirable. As an improvement, statistical machine native learners, especially when error patterns in new text
translation (SM T) based GEC systems have gained are not observed in training data. Second, unlike machine
significant attention in the early 2010s. In this approach, translation where annotated training resources are
SM T models are trained on parallel sentence pairs, abundant, a large amount of annotated ungrammatical
correcting all types of errors by “ translating” the texts and their corrected counterparts are not available,
ungrammatical sentences into refined output. M ore adding difficulties to training MT based GEC models.
recently, with the increasing popularity of deep learning, Although data augmentation methods are proposed to
neural machine translation (NMT) based GEC systems alleviate the problem, however, if the artificially
applying neural seq2seq models become dominant and generated data cannot precisely capture the error
achieve the state-of-the-art performance. Since MT based distribution in real erroneous data, the final performance
translation approaches require corpus containing large of GEC systems will be impaired.
parallel sentence pairs, language model based GEC W e provide a comprehensive literature retrospective
approaches have been researched as an alternative that on the research of GEC. The overall categorization of the
does not rely on supervised training. W e give each of the research is visualized in Figure 2.
data-driven approaches in Section 3 in detail. O ur survey makes the following contributions.
Beyond the four basic approaches, numerous
techniques have also attracted significant attention in
order to facilitate the GEC system to achieve better
overall performance, especially in SMT and NM T based
GEC systems. Since GEC is an area that stresses the
appropriate application of basic model, the techniques are
important to adapt the existing models to GEC. These
techniques have been explored and developed
incrementally to provide assistance and could be
combined to further improve the error correction ability
of GEC systems. In our survey, we describe the researches
about GEC techniques and their broad application in
existing works.
Fig. 2. Categorization of research on GEC.
Data augmentation methods are also of great essence
to development of GEC. The lack of large amounts of
• W e present the first comprehensive survey in GEC.
public training sentence pairs refrains the development of
To the best of our knowledge, no prior work focuses
more powerful M T based GEC systems. This problem
on the overall survey in all aspects of datasets,
could be partly ameliorated by the proposal of various
approaches, performance boosting techniques, data
data augmentation methods, which generate artificial
augmentation methods and evaluation of the great
parallel data for training GEC models. Some inject noise
magnitude of researches in GEC, especially the
into error -free texts for corruption, while others apply
explorations in recent years that yield significant
back translation on error -free texts to translate them into progress.
ungrammatical counterparts. Both methods are commonly
• W e explicitly separate the elements belonging to the
applied in today’s GEC systems.
approaches, performance boosting techniques and
data augmentation and put them together for a more
3
TABLE 1
Statistics and properties of public GEC datasets.
Corpus Component # # # Sents # Error Error Proficiency Topic L1
Sents Tokens Chars Changed Ref Type Type
per
sent
NUCLE - 57k 1.16M 115 38% 2 minimal Labeled Simplex Simplex Simplex
Train 28k 455k
FCE Dev 2.1k 35k 74 62% 1 minimal Labeled Simplex Diverse Diverse
Test 2.7k 42k
Lang-8 - 1.04 11.86 1-8 fluency None Diverse Diverse Diverse
M M 56 42%
J FLEG Dev 754 14k 94 86 4 fluency None Diverse Diverse Diverse
Test 747 13k
Train 34.3k 628.7k 60 67% 1
W &I Dev 3.4k 63.9k 94 69% 1 - Labeled Diverse Diverse Diverse
Test 3.5k 62.5k - - 5
LO CNESS Dev 1k 23.1k 123 52% 1 -
Test 1k 23.1k - - 5 Labeled Diverse Diverse Simplex
structured description of GEC works. Due to the 2.1 Notation and Definition
nature of GEC, this is more beneficial for following Since GEC is an area of application of disparate theories
works based on the incorporation and application of and models, we focus only on the universal concepts that
disparate approaches, techniques and data are shared among the most GEC systems.
augmentation methods. Given a source sentence 𝑥 = {𝑥1 , 𝑥2 , … , 𝑥𝑇𝑥 } which
• W e make a recapitulation of current progress and contains grammar errors, a GEC system learns to map x
present an analysis based on empirical results in to its corresponding target sentence 𝑦 = {𝑦1 , 𝑦2, … , 𝑦𝑇𝑦 }
aspects of the approaches, performance boosting which is error -free. The output of the GEC system is
techniques and integrated GEC systems for a more called hypothesis 𝑦̂ = {𝑦̂1 , 𝑦̂2, … , 𝑦̂𝑇𝑦̂ }.
clear pattern of existing literatures in GEC.
• W e propose five prospective directions for GEC in 2.2 Datasets
aspects of adapting GEC to native language of W e first introduce several datasets that are most widely
English learners, low-resource scenario GEC, used to develop GEC systems based on supervised
combination of multiple GEC systems, datasets and approach, including NUCLE [3], Lang-8 [4], FCE [5],
J FLEG [6], W rite&Improve+LOCNESS [7]. These
better evaluation based on the existing works and
datasets contain parallel sentence pairs that are used to
progress.
develop M T based systems. Statistics and more properties
The rest of this survey is organized as follows. In of these datasets are listed in Table 1. Then, we also
Section 2, we introduce the public GEC corpora a nd the describe some commonly used monolingual corpora.
annotation schema. Section 3 summarizes the commonly 2.2.1 NUCLE
adopted data-driven GEC approaches. W e classify the
The NUS Corpus of Learner English (NUCLE) is the first
numerous techniques in Section 4 and collect data
GEC dataset that is freely available for research purposes.
augmentation methods in Section 5. The standard
NUCLE consists of 1414 essays written by Asian
evaluation metrics and discussion of GEC systems are in
undergraduate students at the National University of
Section 6. In Section 7, we propose the prospective Singapore. This leads to low diversity of topic, sentence
directions and conclude the survey in Section 8. proficiency and L1 (the first language of writer) of the
data. M ost tokens in NUCLE are grammatically correct,
resulting 46,597 annotated errors for 1,220,257 tokens.
NUCLE is also annotated with error types.
II. P RO BLEM
In this section, we introduce the fundamental concepts of 2.2.2 FCE
GEC and their notations in the survey, the public datasets, The First Certificate in English Corpus (FCE) is a portion
the annotation schema of data and the shared tasks in of propr ietary Cambridge Learner Corpus (CLC) [8], and
GEC. it is a collection of 1244 scripts written by English
language learners in respond to FCE exam questions. FCE
4
contains broader native languages of writers, and more which makes infor mation much easier to
diverse sentence proficiency and topic. Similar to NUCLE, under stand both in gr ammar and str uctur e.
FCE is also annotated with error types. H owever, FCE is • Gigaword. The English Gigawor d Cor pus is
annotated by only 1 annotator, which may lead to fewer collected by the Linguistic Data Consor tium at
usages than NUCLE.
the Univer sity of P ennsylvania, consisting of
2.2.3 Lang-8 compr ehensive news text data (including text
The Lang-8 Corpus of Learner English (Lang-8) is a fr om Xinhua News Agency, New Yor k Times,
somewhat-clean, English subsection of a social language etc.).
learning website where essays are posted by language • One Billion Word Benchmark. The O ne Billion
learners and corrected by native speakers. Although W or d Benchmar k is a cor pus with mor e than
Lang-8 contains the maximal amount of original sentences, one billion wor ds of tr aining data, aiming for
it is corrected by users with different English proficiency, measur ing r esearch pr ogr ess. Since the tr aining
weakening the quality of the data. Besides, it is not
data is fr om the web, differ ent techniques can
annotated with error types.
be compar ed fair ly on it.
2.2.4 J FLEG • COCA. The Cor pus of Contempor ar y Amer ican
The J HU FLuency-Extended GUG Corpus (J FLEG) English is the lar gest genr e-balanced cor pus,
contains 1511 sentences in GUG development and test set. containing mor e than 560 million wor ds,
Unlike NUCLE and FCE, annotation of J FLEG involves cover ing differ ent gen-r es including spoken,
not only grammar error correction but also sentence-level fiction, newspapers, popular magazines and
rewrite to make source sentence sound more fluent. Each
sentence is annotated four times on Amazon M echanical
academic jour nals.
Turk. J FLEG provides new perspective that GEC should • Common Crawl. The Common Crawl cor pus
make fluency edits to source sentences. [10] is a r epositor y of web cr awl data which is
open to ever yone. I t completes cr awls monthly
since 2011.
2.2.5 W &I+ LO CNESS • EVP. The English Vocabular y P r ofile is an
W rite&Improve Corpus (W &I) and LO CNESS Corpus online extensive r esear ch based on CLC,
are recently introduced by The BEA-2019 Shared Task pr oviding infor mation, including example
on Grammatical Error Correction. W &I consists of 3600 sentences of words and phrases for learner s at
annotated essay submissions from W rite&Improve [9], an each CERF level.
online web platform that assists non-native English
students with their writing. The 3600 submissions are 2.3 Annotation
distributed in train set, development set and test set with
3000, 300 and 300 respectively. LO CNESS Corpus is a In this section, we talk about the annotation of GEC data,
collection of about 400 essays written by British and including researches about annotation schema and inter -
American undergraduates, thus it contains only native annotator agreement. Although annotation requires a
grammatical errors. Note that LOCNESS contains only great magnitude of labor and time, it is of great
development set and test set. All the sentences in W &I importance to the development of GEC. Data -driven
and LOCNESS are evenly distributed at each CERF level. approaches rely heavily on annotated data. M ost
importantly, it enables comparable and quantitative
2.2.6 M onolingual Corpora evaluation of different GEC systems when they are
evaluated on data with standard annotation. For example,
Except for the parallel datasets discussed above, there are systems in the CoNLL-2014 and the BEA-2019 shared
also some public monolingual corpora that can be used in tasks are evaluated on data with official annotation, and
training language models, pre-training neural GEC they are ranked according to scores using identical
models and generating artificial training data. The evaluation metrics respectively. Besides, it also promotes
commonly used monolingual corpora are listed below. targeted evaluation and strength of GEC systems, since
• Wikipedia. W ik ipedia is an online encyclopedia error types may be annotated in data. W e can examine
based on W iki technology, wr itten in multiple the performances of GEC systems on specific error types
languages with mor e than 47 million pages. to identify their strength and weakness. This is meaningful
since a robust specialised system is more desirable than a
• Simple Wiki. The Simple English W ik ipedia,
mediocre general system [11].
compared to or dinar y English W ikipedia, only The most widely applied annotation is error -coded.
uses ar ound 1500 common English wor ds, Error -coded annotation involves identifying (1) the span
of grammatical erroneous context (2) error type and (3)
5
considerable results under a more extensive scene. The two models. This idea is better reflected on the log-linear
preset labels limit the classification method to certain model, which generalizes it with the following equation:
types of errors and cannot be extended to deal with more
complicated error types. For GEC, generally speaking,
the errors appeared often involve not only the wrong
words themselves, but also the context which may contain Compared to the noise channel model, the log-linear
the surrounding tokens in a sentence or even cross model provides a more general framework. This
sentence information. Further, the choice of correction is framework contains a variable number of sub-models,
also diverse and flexible, which promotes the development namely the feature functions 𝑓𝑚 = (𝑥, 𝑦), which could be
of generative model on GEC. In this section, we present features such as translation model or language model, and
SM T based models and its development on grammatical a group of tunable weight parameters 𝜆𝑚 , which are used
error correction. Since most current researchers focus on to adjust the effect of each sub-model in the translation
models based on neural machine translation, we do not process. Also, the total number of feature functions is
discuss too many particulars of models and methods in represented as 𝑀 . W ith the log-linear model, prior
detail here. knowledge and semantic information are possible to be
injected by adding feature functions.
3.1.1 Statistical Machine Translation SM T based methods treat GEC as the translation
M achine translation aims to find a translation 𝑦 in target process from the “bad English” to “ good English” .
language for a given source sentence x in the source Through training on a large number of parallel corpora,
language, which probabilistically finds a sentence y that the translation model can collect the corresponding
maximize 𝑝(𝑦|𝑥) . The distribution 𝑝(𝑦|𝑥) is the between the grammar error and its correct form, just as
translation model we want to estimate. O ften, the in the translation process does in mapping bilingual
translation model consists of two parts, the translation sentences. Several researches have used statistical
𝐼
probability Π𝑖=1 𝑝(𝑢𝑖 |𝑣𝑖 ) of the all “ units” in a sentence machine translation models as the basic framework for
with 𝑈 units and the reordering probability GEC, and taken the advantage of the versatility and
𝑈
Π𝑖=1 𝑑(𝑠𝑡𝑎𝑟𝑡𝑖 − 𝑒𝑛𝑑𝑖−1 − 1), which are used as the basis flexibility provided by log-linear models with considerable
for translation in different models: result achieved.
using M EM T [30] and the final system achieves a state- problem in the outputs of previous SMT based GEC
of-the-art result. Another commonly used feature which is systems. An extra rescoring post-processing also enabled
based on neural networks was first integrated in SMT more information to be incorporated into the SM T based
based GEC system in more recent studies [31], both a methods, through which improvement based on the
neural network joint model (NNJ M) and a neural previous systems has been acquired. Empirical study [36]
network global lexical model (NNGLM ) are introduced concerning distinguished approaches has compared the
into the SMT based GEC models. The NNJ M is further SM T and classification based approaches by performing
improved using the regularized adaptive training method error analysis of outputs, which also promoted a sound
described in later work [32] on a higher quality training pipeline system using classification based error type-
dataset, which has a higher error per -sentence ratio. specific components, a context sensitive spelling
A group of studies [33], [34], [35] paid more attention correction system, punctuation and casing correction
to n-best list reranking to solve the “ first but not the best” systems, and SMT. Also, system based on SM T only [37]
8
TABLE 3
NM T based GEC systems. The column “Training LM ” is the same as that in Table 2.
Source Model Framework Input Handling Misspelling Training LM Year
[45] RNN token level alignment, word-level translation - 2016
[47] RNN character level character-level translation model Common Crawl 2016
[49] RNN token level, character level character-level translation model Common Crawl 2017
[68] RNN token level spellchecker - 2017
[50] CNN token level spellchecker - 2018
ED
[67] RNN token level spellchecker Wikipedia 2018
[39] RNN token level spellchecker Common Crawl 2018
[53] Transformer token level spellchecker Common Crawl 2018
[56] Transformer token level spell error correction system Common Crawl 2019
[66] CNN token level, sentence level spellchecker - 2019
[69] PIE Transformer token level spellchecker Wikipedia 2019
[71] ED Transformer token level - - 2019
corpus could be W ikipedia [29], [28], [32], [35], [31], [36], as well as the auxiliary neural network components, it still
Gigaword [26], [34] and Common Crawl [28], [37], [38], suffers from the lack of contextual information and
[39]. Such LM s have been trained though tools such as limited generalization ability. As a solution, many
KenLM or IRSTLM in a range of researches to further researches start to research NMT based approaches for
prompt the improvement of their systems’ performance. GEC. W ith the increasing performances obtained by
neural encoder -decoder models [43] in machine
3.1.5 Neural Networks as Features translation, neural encoder -decoder based models are
Although the translation model of SM T based approaches adopted and modified. Compared to SMT based GEC
has shown the effectiveness by estimating phrase table, the systems, NMT based models have two advantages. First,
discrete phrase table and the linear mapping on which neural encoder -decoder model learns the mappings from
translation probability based still limited the competent of source to target directly from training parallel data,
model to generalize more to pattern beyond the training rather than the required features in SMT to capture the
sets. Also, the global information was always ignored by mapping regularities. Second, NM T based systems ar e
the basic SMT based model. Although a range of language able to correct unseen ungrammatical phrases and
models and other sequence modeling features have been sentences more effectively than SMT based approaches,
incorporated in several SMT based GEC models, context increasing the generalization ability [44].
information concerning each word was lacked. Some In this section, we trace the development of
works tackled these limitations using complementary representative works, which are the most commonly used
neural networks as additional features, namely neural as backbones combining with other techniques in GEC
network global lexicon model (NNGLM ) and neural systems today. W e leave the numerous techniques to
network joint model (NNJ M ), which construct continuous Section 4, and focus only on works researching the design
representation space with non-linear mapping modeled and training schema of neural models in GEC. As we will
[32], [31], [38]. discuss, all the neural GEC systems are based on the
NNGLM. A global lexicon model here is a feed forward encoder -decoder (ED) model, with an exception based on
neural network used to predict the presence of words in the parallel iterative edit (P IE) model.
the corrected output by estimating the overall probability
of hypothesis given the source sentence. The probability 3.2.1 Development of NM T Based Approaches
of a target hypothesis is computed using the following Yuan and Briscoe first applied NMT based models in GEC
equation: [45]. In this work, the encoder encodes the source
sentence 𝑥 = {𝑥1 , 𝑥2 , … , 𝑥𝑇𝑥 } as a vector 𝑣 . Then, the
vector is passed to the decoder to generate the correction
𝑦 through
where 𝑝(𝑦̂𝑖 |𝑥) is the probability of the target word 𝑦̂𝑖
given the source sentence 𝑥; 𝑝(𝑦̂𝑖 |𝑥) is the output of the
neural network.
At each time step, the target word is predicted with the
NNJM. J oint models in translation augment the
vector and the previously generated words. Both the
context information in language models with words from
encoder and the decoder are RNNs composing of GRU
the source sentence. Unlike the global lexicon model,
or LSTM units.
NNJ M uses a fixed window from the source side and
Attention mechanism [46] is always applied to improve
takes sequence information of words into consideration in
the output in each decoding step 𝑖 by selectively focusing
order to estimate the probability of the target word. The
on the most relevant context 𝑐𝑖 in the source. To be more
probability of the hypothesis ℎ given the source sentence
specific, the hidden state in decoder is calculated by
𝑥 is estimated by the following equation:
where 𝑠𝑖−1 is the hidden state last step; 𝑦𝑖−1 is the word
generated last step; 𝑓(·) is non-linear function. The
where 𝑐𝑖 is the context (history) for the target word 𝑦̂𝑖 .
variable 𝑐𝑖 is calculated as
The context 𝑐𝑖 consists of a set of source words
centralized by word 𝑦̂𝑖 and certain number of words
preceding 𝑦̂𝑖 from the target sentence same as the
language model does. where ℎ𝑗 is the hidden state of encoder at step 𝑗 and 𝛼𝑖𝑗
is calculated as
3.2 NMT Based Approaches
AlthoughSM Tbasedapproachbenefitsfromitsabilitytoincor
poratethelargeamountofparalleldataandmonolingual data The variable 𝑚𝑖𝑘 is a match score between 𝑠𝑖−1 and ℎ𝑘 .
10
softmax and the possibility of copying the word from The reward 𝑟(𝑦̂, 𝑦) is specified as GLEU. The parameters
source sentence of the model are optimized using policy gradient
algorithm.
Besides the direct sentence translation models discussed
𝑐𝑜𝑝𝑦
where 𝛼 and 𝑝𝑦 (𝜔) are calculated with copy attention above, two models translating source sentences into edits
between encoder’s output 𝐻 𝑒𝑛𝑐 and decoder’s hidden state that should be made to source tokens were successfully
𝐻 𝑑𝑒𝑐 t at step 𝑡: applied to GEC recently. The first is the P IE model [69],
an improved version of the parallel model that has been
previously explored in machine translation [70]. The P IE
model is trained to label source tokens as edits from the
edit space comprising of edit operations including copying,
The matrix 𝐾, 𝑉 and the vector 𝑞𝑡 are calculated as deletion, replacing with a 2-gram sequence, appending a
2-gram sequence and morphology transformation. The
required supervision is collected by comparing source and
and 𝐴𝑡 is given by
target sentences in parallel training data. The P IE model
is based on the architecture of BERT, but with additional
Integrating cross-sentence context is common among inputs and a self-attention mechanism to better predict
the researches about neural machine translation [63], [64],
replacing and appending. Unlike the commonly used ED
[65], but was firstly applied on GEC by Chollampatt et al.
model, the P IE model generates the output concurrently,
recently. In this work, they designed a cross-sentence
much faster than the encoder -decoder model. The second
convolutional encoder -decoder model with auxiliary is the LaserTagger [71] that generates edit tags for source
encoder and gating, which was an extension on the
sentences. The LaserTagger is composed of a BERT
previous work [50]. To be more specific, the auxiliary
encoder with 12 self-attention layers and an
encoder encodes the previous two sentences into 𝑆̂ ∈ autoregressive single layer Transformer decoder. The edit
ℝ𝑑×|𝑆̂|, and the corresponding output of the final encoder tags include all the possible combination of two base tags:
layer is 𝐸̂ ∈ ℝ𝑑×|𝐸̂ |. Auxiliary attention is involved and the KEEP and DELETE, and many phrases that should be
auxiliary encoder representation at each decoder layer 𝐶̂ 𝑙 inserted before the source tokens. The phrases are the 500
is calculated the same as equation 15 but with different most frequently n-grams that are not aligned between
linear layers, 𝑆̂ and 𝐸̂ . The output of lth decoder layer is source sentences and target sentences in training data. An
now calculated according to: algorithm is designed to convert the training examples
into corresponding tag sequences. The very advantage of
generating edits is that the edit space is rather small than
where ◦ is the element-wise production and 𝛬𝑙 is the gate the target vocabulary space.
calculated as follow [66]:
3.2.2 H andling M isspellings
Due to the limitation of vocabulary size in training NMT
Fluency boost learning aims at increasing the fluency models, it is unrealistic to cover all the misspelled words
and soundness of the source sentence to correct grammar in the vocabulary. So, several methods are applied to
errors [67]. The neural sequence-to-sequence model is correct misspelling errors in NMT based GEC systems.
trained with fluency boosting sentence pairs and multi- • Alignment+Translation. The misspelled words can be
round inference strategies are applied to make full viewed as UNKs, and the problem could be solved by
corrections to source sentences. Experiments have first aligning the UNKs in the target sentence to their
demonstrated the effectiveness of both the fluency boost corresponding words in the source sentence with an
learning and the inference strategies. W e cover more aligner and then training a word-level translation
details about the data generation methods and inference model to translate those words in a postprocessing
strategies in Section 5.2 and 4.4 separately. step [45].
Reinforcement learning has long been applied on other • Character-level translation model. As discussed
NLP tasks for post-processing, but was innovatively used before, some NM T based models are character -level,
on GEC to directly optimize parameters of neural so the problem of misspelling errors is naturally
encoder -decoder model [68]. The model (agent) searches solved [47], [49].
the policy 𝑝(ℎ) directly according to: • Spellchecker. Many NMT based GEC systems utilize
a spellchecker 1 or open-source spellchecking library2
in preprocessing [68], [50], [53], [67], [66], [39].
The variable 𝑏 is baseline used to reduce the variance.
1 2
https:/ / azure.microsoft.com/ en - https:/ / github.com/ AbiW ord/ enchant
us/ services/ cognitiveservices/ spellcheck /
12
TABLE 4
Classification based GEC systems.
Source Target Error Type Features Year
[72] article 2006
[73] preposition 2007
[75] 2008
[76] 2010
[74] 2008
[77] article, preposition Linguistic Feature 2008
[78] Engineering 2010
[79] 2010
[80] article, preposition, verb form, none number, sub-verb agreement 2013
article, preposition, verb form, none number, sub-verb agreement
[81] word form, orthography and punctuation, style 2014
[82] article CNN 2015
[83] article, preposition, verb form, none number, sub-verb agreement 2017
[86] Deep biRNN 2018
[84] article, preposition, verb form, noun number, sub-verb agreement, comma Learning 2019
[87] all 2019
choice [77]. Gamon combined classifiers with language form, subject agreement, none number), a deep
models and proposed metaclassifier in correcting articles bidirectional GRU was trained [83]. Following the
and prepositions. The metaclassifier is trained using the research above, Kaili et al. proposed two attention
features transforming from the output of the traditional mechanism to predict correct words better. The first
classifiers and language models, which score the suggested attention considers context words only, while the second
correction. Slightly different, the meta-classifier is attention take into account of the source word. The
trained on error -annotated data. Each error annotation second attention is used for correction of none number
provides errors and word form errors, since the interaction
the information that whether the suggested correction between the source word and its context is important to
is correct or incorrect [78]. Instead treating all correction of these two types of errors. Li et al. trained
prepositions equally when training classifiers, the bidirectional GRU to correct errors including subject -
confusion set could be restricted on specific several verb agreement, article, plural or singular noun, verb
candidates which are frequently observed in occasions of form, preposition substitution, missing comma and period
misuse in non-native data, which is called L1-dependent comma substitution [84]. They also trained a pointer
confusion set, since the distribution of preposition errors context model [85] to correct word form error [86].
differ by L1. Two methods that train classifiers based on M akarenkov et al. designed a bidirectional LSTM to
L1dependent confusion set are proposed. In the first assign a distribution over the vocabulary where correction
methods, only prepositions in L1-dependent confusion set tokens are selected and to suggest proper word
are viewed as negative examples for each source substitution. A postprocess based on P O S tagger is
prepositions. In the second methods, the all-class multi- appended to filter out the suggested words with different
classifier is trained on augmented data taking account into property from the source tokens [87].
the L1 error statistic. The restriction increases both the W e summarize the introduced classification based GEC
precision and recall compared to previous work [79]. systems in Table 4, with their target error types and where
Later, some researches extended previous methods so the features come from.
that the classification based approaches can be applied on
correcting more types of errors, instead of article errors
and preposition errors only. The University of Illinois 3.4 LM Based Approaches
System trained classifiers to correct noun number errors, The SMT based methods and NMT based methods are all
subject agreement errors and verb form errors. The supervised, thus require large amounts of parallel training
confusion sets are composed of morphological variants of data. H owever, unsupervised approaches based on LM are
the source word, such as noun plurals, gerund and past applied to GEC and achieve comparable performance with
tense of verb. This system ranked first in CoNLL-2013 supervised methods. The very advantage of LM based
shared task [80]. Continuously, in CoNLL-2014 shared approaches is that they do not need large parallel
task, the Illinois Columbia System extended by annotated data, given the fact that large amounts of
incorporating another three error -specific classifiers: parallel data is not available in GEC. As a result, LM
word form errors, orthography and punctuation errors based GEC systems are always developed for low-resource
and style errors. They also applied joint inference to situation. M ost of LM based GEC methods use a LM to
address inconsistent corrections suggested by multiple assess the hypothesis for correction, where hypothesis is
classifiers [81]. generated by making changes to source ungrammatical
M ore recently, deep learning was combined with sentence according to the designed rules or by substituting
classification based system. W ithout the reliance on tokens in source sentence with words selected from
traditional elaborated feature engineering, Sun et al. confusion sets. In this section, we first describe the basic
firstly employed CNN to represent the context of source and the optimized LM based approaches, then give more
articles and predict the correct labels [82]. W ang details about how to generate confusion sets.
presented a classification based study which trained deep N-gram language model can be used to compute a
GRU to represent context and predict the correct word feature score for each hypothesis as follow:
directly. For the source word 𝜔𝑖 in the sentence, the
context vector is calculated as follow:
H ypothesis with the highest score would be added to the
search beam and then modified to generate another set of
W here lGRU and rGRU read the words from left to right hypotheses. This iteration does not end until the beam is
and GRU reading the words from right to left in a given empty or the number of iteration has achieved a threshold
context respectively. Then, the target word t is predicted [88]. Following this, Bryant and Briscoe re-evaluated LM
according to: based GEC on several benchmark datasets. Language
model is used to calculate the normalised log probability
For each error types involved (article, preposition, verb of the hypothesis. The process of generating confusion set
14
and evaluating hypothesis is also iterated following According to Bayes’ principle, equation could be written
previous work. W ithout annotated data, this work as
achieves comparable performance with state-of-the-art
systems testing on J FLEG test set. This is because the
annotation of J FLEG is fluency oriented, thus making
where 𝑃(𝑤|𝑐) is estimated by noisy channel model; 𝑃(𝑐)
language model based approaches is more powerful [89].
is given by pre-trained language model (BERT, GP T-2)
A problem of neural LM based GEC is that the space
and 𝐶 is the generated confusion set for 𝑤 [92].
of possible hypothesis is rather vast. To solve this issue,
finite state transducer is used to represent large structured
search space, constraining the hypothesis space while also 3.5 Hybrid Approaches
Apart from models we have discussed, a range of
promising that it is large enough to involve admirable
corrections [90]. Several transducers are designed researches also integrated several models into their
systems to seize profit from different models in order to
including input lattice 𝐼̂ which maps the input sentence to
obtain better performance. Such systems with sub-models
itself, edit flower transducer 𝐸̂ which makes substitution
will be surveyed in this section as hybrid methods, which
with cost, penalization transducer 𝑃̂, 5-gram count based often contain heterogeneous sub-components. It is worth
language model transducer 𝐿̂ and transducer 𝑇̂ which noticing that ensemble methods (section 4.3) also involve
maps words to byte pair encoding (B). 𝑃̂ and 𝐿̂ are used content concerning model combination which is similar to
to score the hypothesis space. The best hypothesis is this section. H owever, in this survey we restrict the
searched using the score of the combined FST and a ensemble methods to the techniques assistant better
neural language model. Additionally, if large annotated combination of outputs produced by a range of models.
data is available to train SM T and NMT models, the n- The major difference between hybrid and ensemble in this
best list of SM T system is used to construct 𝐼̂ and NMT survey is that the hybrid methods integrate several sub-
score is incorporated into the final decoding procedure. system to construct a complete correction process,
Based on this, Stahlberg and Byrne composed input whereas ensemble methods use several independent whole
transducer 𝐼̂ with a deletion transducer 𝐷 ̂ to delete tokens system, often combined at the output level, to improve the
that are frequently deleted. They also constructed a correction performance.
insertion transducer 𝐴̂ that restrict insertion to three Yoshimoto et al. and Felice et al. employed different
specific tokens “ ,” , “ -” and “ ’s” . Their LM based sub-components in their system to gear with specific error
approaches achieved higher score on CoNLL-2014 test set types in the source sentence and outputs from sub-systems,
[91]. which are partially corrected, are merged to produce an
Confusion set is an important element in LM based GEC, error -free sentence. M ore specifically, Yoshimoto et al.
since language model selects the most possible hypothesis used three systems to deal with all five error types in the
from it. Dahlmeier and Ng generated hypothesis by (1) CoNLL2013 shared task, including a system based on the
replacing misspelled word with correction; (2) changing Treelet language model for verb form and subject verb
observed article before Noun P hrase; (3) changing the agreement errors, a classifier trained on both learner and
observed preposition; (4) inserting punctuation; (5) native corpora for noun number errors and an SMT based
altering noun’s number form (singular of plural) [88]. model for preposition and determiner errors[26]. Felice
Similarly, Bryant and Briscoe created confusion set by (1) et al. combined both a rule based system and an SMT
applying spell checker CyH unspell on misspelled words; based system. Their outputs are combined to produce all
(2) altering morphological forms of words using possible candidates without overlapping correction and
automatically generated inflection database; (3) changing then a language model is trained to rerank those
the observed article and preposition [89]. Besides, W ikEd candidates [27].
Error Corpus was also used to create a part of the P ipeline methods are also researched in combining
confusion set [92]. subsystems, where the output corrected by one system is
Flachs et al. combined pre-trained language models passed as an input to a following correcting system.
such as BERT and GPT-2 with noisy channel model for Rozovskaya and Roth first applied a classification model
GEC. For each word 𝑤 in a given sentence, the noisy followed with an SMT based system, since the SMT -
channel model estimates the probability that w is system owns the ability to handle with more complex
transformed from a candidate 𝑐 in the confusion set. The situation than classifiers [36]. A similar framework is
idea behind this approach is that for any given word 𝑤, proposed by combining the more powerful neural
there is a genuine word 𝑐 that passes through the noisy classifiers and SMT based system [86]. Grundkiewicz and
channel and transforms into 𝑤. The goal is to choose the J unczys-Dowmunt constructed an SM T-NM T pipeline
most possible genuine word 𝑐 ∗ : and experiments in the research have shown
complementary corrections have been made. They also
explore using the NMT system to rescore the n-best
hypothesis obtained through SM T system to improve
15
incorporated for rescoring [33]. weight for the 𝑖th feature; 𝑓𝑖 (·) is the feature function; 𝑦̂
• Adaptive language model. Adaptive LM scores are is the hypothesis being reranked.
calculated from the n-best list’s n-gram probabilities. Also, several studies trained a discriminative rank ing
N-gram counts are collected using the entries in the model to score the n-best list derived from the preset
n-best list for each sour ce sentence. N-grams features extracted from those hypothesises [33], [34].
occurring more often than others in the n-best list get Such incorporation of the features closely correlates their
higher scores, ameliorating incorrect lexical choices ranking performance with the ranking model they train.
and word order. The n-gram probability for a W eighting feature functions differently as the parameter
𝑖−1
hypothesis word 𝑦̂𝑖 given its history 𝑦̂𝑖−𝑛+1 is defined tuning does in the previous methods, the ranking models
as: serve the same purpose of obtaining a learnable model to
better consider different feature functions’ score
according to the performance.
4.3.2 Recombining the Edits of the model. The inspiration is from iterative beam search
A large proportion of GEC systems regard the correction decoder [106], which makes correction on ungrammatical
task as a special machine translation problem, which maps sentence through multiple rounds. In each round, the
the original sentence and corrected sentence to the source decoder searches the hypothesis space to select the best
language sentence and target language sentence in the correction for the source sentence. The selected sentence
M T respectively. H owever, a significant difference is input to decoder as source sentence in the next iteration
between GEC and MT should be noticed that commonly and is incrementally corrected. Some language model
the output of an M T system is produced and evaluated as based approaches rely on this algorithm [88], [89].
a whole, whereas an output derived fr om GEC systems Fluency oriented iterative refinement [67] corrects
could be seen as a combination of edits or corrections source sentence through multiple round inference until
which can be evaluated separately. The pilot the fluency of hypothesis does not increase. The fluency
implementation of this idea was done by a system taking is defined as follows:
advantage of outputs from both a classification and an
SM T approaches [29]. In this work, pairwise alignments
are first constructed using M EMT forming a confusion
network and then the one-best hypothesis is produced
during a beam search progress. Similar to M EM T, they
employ features including language model, matches, The variable 𝑝(𝑥𝑖 |𝑥<𝑖 ) is given by a language model.
backoff and length features to rescore both partial and Some works also proposed an iterative decoding
full hypothesis. Their methods allow switching among all algorithm and successfully applied it to the model trained
component systems, flexibly combining their outputs. on out-of-domain W ikipedia data. In each iteration, an
M ore recently, there have been some successful GEC identity cost and every non-identity cost of hypothesis in
systems that use more sophisticated ensemble strategies to the beam are calculated, and the hypothesis with minimal
combine multiple corrections extracted from outputs non-identity cost is maintained. If the minimal nonidentity
produced by different systems based on the fact cost is less than the product of an identity cost and a
aforementioned. Li et al. presented a ensemble method to predetermined threshold, this hypothesis is viewed as a
integrate the output of 4 CNN based ensemble GEC correction and input for next iteration. The model
models and 8 Transformer based GEC models and solved continues to correct the input sentence until no more edit
possible conflict. They first built up a confidence table for is required [54], [60], [107].
each individual system, consisted of the precision and 𝐹0.5 O ther tricks which could be viewed as iterative
of each error type given by ERRANT toolkit. Then, they correction includes feeding the output into the ensemble
designed three rules to merge the corrections output by translation model for a second pass correction [84], letting
different systems using the precision as the confidence of the SM T based model and classifiers take turns to make
each correction, increasing the confidence of identical corr ection until no more alter is required [86], and
corrections while discarding the conflicting correction parallel iterative edit model [69] which takes the output
with lower confidence. Three types of model ensemble of the model as the input for further refinement until the
with different combination of CNN based and output is identical to a previous hypothesis or the iteration
Transformer based model are investigated [84]. Kantor et number achieves the maximum round.
al. proposed a ensemble method to combine different GEC
systems at a higher scale, treating each individual system 4.5 Auxiliary Task Learning
as a black -box. Their method merged multiple M 2 format Although auxiliary task learning has been proved to be
files of disparate GEC systems by splitting the M 2 format beneficial on many NLP tasks, not so many GEC works
files and identifying the probability that an edit of an involve this technique. The most common auxiliary task
error type is observed in a subset of edits. An optimization for GEC is grammar error detection (GED), both word-
problem should be solved to determine the optimal level and sentence-level.
probability for each error type [105]. Word-level GED. W ord-level GED could be treated as
a token-level labeling task [56]. In particular, the token-
4.4 Iterative Correction level labeling task assigns a label to each source token
Iterative correction in GEC aims at correcting source indicating whether the source token is right or wrong,
sentences not in single round decoding as traditional based on hidden representation of the source token ℎ𝑖𝑠𝑟𝑐 :
sequence generation, but instead in multiple round
inference. The generated hypothesis is not be regarded as
final output, but is fed into the model to be edited in The model can learn more about the source token’s
following iteration. This is because some sentences with correctness with this auxiliary task.
multiple grammar errors may not be corrected in only one Sentence-level GED. Sentence-level GED could be
decoding round, which requires higher r eference ability treated as a sentence-level labeling task [59]. The model
19
is trained to predict whether source sentences are equivalent setting does not show significant improvement
grammatical or not based on hidden representation of the in final result [99]. As a result, many neural based GEC
source sentence 𝐻 𝑠𝑟𝑐 : methods adopt pre-training to incorporate a large
magnitude of artificial data.
This classification task could be modified into sentence-
5.1 Noise Injection
level copying task, which enables the model to trust more
5.1.1 Deterministic M ethods
about the source sentences when they are grammatical
[56]. Besides, the sentence-level GED task could also be Among the data augmentation methods that corrupt error -
free data by adding noise or applying noising function,
combined with sentence proficiency level prediction task
[108]. A BERT is trained to predict both the binary GED some methods directly inject noise according to
predefined rules, while others inject noise with
label (whether correction is required) and the 3
consideration of more linguistic features. W e first
proficiency level label simultaneously, and fine-tuned on
summarize direct noise injection methods, which are also
each set of sentences with the same proficiency level.
called deterministic methods.
Izumi et al. firstly used predefined rules to create a
4.6 Edit-weighted Objective
Another performance boosting technique is to directly corpus for grammar error detection by replacing
preposition in original sentences with alternatives [109].
modify the loss function in order to increase the weight
Brockett et al. used predefined rules to alter quantifiers,
of edited tokens between source and target [53], [102],
[58], [60], [56]. To be more specific, suppose that each generate plurals and insert redundant determiners [22].
Lee and Seneff defined rules to change verb forms to
target token 𝑦𝑗 could be aligned to a corresponding
create an artificial corpus [110]. Ehsan and Faili defined
source token 𝑥𝑖 : 𝑎𝑡 ∈ {0,1, … ,𝑇𝑥 }. I f 𝑦𝑗 differs from 𝑥𝑖 , error templates to inject artificial errors to treebank
then the loss for 𝑦𝑗 is enhanced by a factor Λ . The sentences when original sentences matched the error
modified loss is as follows: templates [111]. M ore recently, similar to direct noising
methods applied in low-resource machine translation
[112], Zhao et al. generated artificial data using
corruption including (1) insertion of a random token with
a probability (2) deletion of a random token with a
probability (3) replacing a token with another token
randomly picked from the dictionary probabilistically (4)
shuffling the tokens [56]. Lichtarge et al. applied
character -level deletion, insertion, replacement, and
V. D ATA AUGM ENTATI ON
transposition to create misspelling errors [54]. Similarly,
Data augmentation has always been explored in GEC, Yang and W ang corrupted One Billion W ord Benchmark
which is known as Artificial Error Generation (AEG), corpus by deleting a token, adding a token and replacing
since supervised models suffer from the lack of parallel a token with the equal probability [58].
data and low quality. Unlike machine translation, large
magnitude of parallel training data in GEC is not 5.1.2 P robabilistic M ethods
currently available. Thus, a wide spectrum of data Although deterministic noise injection methods are
augmentation methods incorporating pseudo training data effective to some extents, however, the generated errors
were studied and applied in order to ameliorate the such as random word order shuffling and word replacing
problem. M ost importantly, data augmentation methods are less realistic than errors observed in GEC datasets.
should capture commonly observed grammatical errors M any approaches inject noise to original data with
and imitate them when generating pseudo data. In this consideration of more linguistic features, thus the
section, we survey some typical data augmentation generated errors resemble more to real errors made by
methods, most of which could be divided into two kinds: English learners. W e call these approaches probabilistic
noise injection and back translation. methods.
It is worthy to note that the generated data can be There are some typical works in prior stage. GenRRate
typically used in two approaches. First, append the [113] is an error generation tool with more consideration
generated data to existing training data and use the of P O S, morphology and context. Rozovskaya and Roth
combined data for training. In this approach, the pseudo proposed three methods to generate artificial article
data is viewed equivalent as real training data. Second, errors with more consideration of frequency information
use the generated data for pre-training the neural model, and error distribution statistics of ESL corpora: (1)
and then use the real training data for fine-tuning. W e injecting articles so that their distribution of generated
have discussed pretraining in Section 4.1. It is data resemble distribution of articles in ESL corpora
demonstrated that when used for pre-training, larger befor e annotator’s correction; (2) injecting articles so that
amounts of pseudo data yield better performance, while
20
their distribution of generated data resemble distribution applying the corrections observed in the training corpus
of articles in ESL corpora after annotator’s correction; W &I backward to native error -free data [105].
and (3) injecting articles according to specific condition
probability. 𝑃(𝑡𝑜𝑘𝑒𝑛𝑠𝑟𝑐 |𝑡𝑜𝑘𝑒𝑛𝑡𝑟𝑔 ) is estimated from 5.2 Back Translation
annotated ESL corpora, which means the probability that Back translation was initially proposed to augment
article tokensrc should be corrected into article 𝑡𝑜𝑘𝑒𝑛𝑡𝑟𝑔 . training data for neural machine translation [118] and
During error generation, article 𝑡𝑜𝑘𝑒𝑛𝑡𝑟𝑔 in error -free then adapted to GEC. In this kind of AEG methods, a
context is replaced with 𝑡𝑜𝑘𝑒𝑛𝑠𝑟𝑐 with probability reverse machine translation model is trained, translating
𝑃(𝑡𝑜𝑘𝑒𝑛𝑠𝑟𝑐 |𝑡𝑜𝑘𝑒𝑛𝑡𝑟𝑔 ) [114]. Following this, inflation the error -free grammatical data into ungrammatical data.
Rei et al. firstly trained a phrase-based SMT model on
method [115] is proposed in order to solve the problem of
the public FCE dataset to generate errors. Trained model
error sparsity and low error rate in artificial data. This
are further applied on EVP dataset to extract generated
noise injection approach has been widely applied by
parallel data. They also made attempt on other
following GEC models and systems, especially
monolingual (error -free) data such as W ikipedia dataset
classification based [80], [81]. Yuan and Felice extracted
but failed to have equal quality data, which demonstrated
two types of correction pattern from NUCLE corpus:
by their development experiments where keeping the
context tokens and P O S tags, and injected the patterns to
writing style and vocabulary close to the target domain
EVP corpus with equal pr obability to generate artificial
gives better results compared to simply including more
training data [24].
data [117].
M ore recently, Felice and Yuan extended previous
Kasewa et al. firstly applied an attention-based neural
work by injecting errors with more linguistic information,
encoder -decoder model on error generation [119]. Three
and their approach generated 5 types of errors.
different error generation methods were discussed and
Specifically, they refined the conditional probability
compared. (1) Argmax (AM ) selects the most likely word
𝑃(𝑡𝑜𝑘𝑒𝑛𝑠𝑟𝑐 |𝑡𝑜𝑘𝑒𝑛𝑡𝑟𝑔 ) in following several aspects: (1)
at each decoding time step according to each candidate
estimating the probability of each error type and use it to
token’s generation probability 𝑝𝑖 . (2) T emperature
change relevant instances in error -free context; (2)
S ampling (TS) involves a temperature parameter 𝜏 to
estimating the conditional probability of words in specific
alter the distribution:
classes for different morphological context; (3) estimating
the conditional probability of source words when P O S of
target words are assigned; (4) estimating the conditional
probability of source words when semantic classes of
target words are assigned; (5) estimating the conditional which controls the diversity of generated tokens. (3) Beam
probability of source words when the particular sense of S earch (BS) maintains 𝑛 best hypothesis each time step
target words are assigned. Experimental results validated according to their scores. Experimental results show that
the effectiveness of their consideration of various Beam S earch error generation method improves the
linguistic characteristics [116]. Based on this, the pattern performance of error detection model trained on the
extraction method [117] is an improvement and can be generated parallel data most significantly.
applied to generate errors for all types. In this method, H owever, the traditional beam search in decoder would
correction patterns (ungrammatical phrase, grammatical result far fewer errors in generated ungrammatical
phrase) are extracted from parallel sentences in training sentences than original noisy text [103]. As a solution, the
corpus, and errors are injected by looking for matches beam search noising schemes can be extended through (1)
between error -free context and grammatical phrase. Xu penalizing each hypothesis 𝑦̂ in the beam by
et al. generated 5 types of errors, including concatenation, adding 𝑘𝛽𝑟𝑎𝑛𝑘 to their scores 𝑠(𝑦̂), where k is their rank
misspelling, substitution, deletion and transposition by in descending log-probabilistic 𝑝(𝑦̂) and 𝛽𝑟𝑎𝑛𝑘 is a
considering the relation between sentence length and hyperparameter (2) penalizing only the top hypothesis
error numbers, and assigned a possibility distribution to htop of the beam by adding 𝛽𝑡𝑜𝑝 to 𝑠(ℎ𝑡𝑜𝑝 ) and (3)
generate each error type [100]. They designed an penalizing each hypothesis in the beam by adding
algorithm to create word tree that was used to create more 𝑟𝛽𝑟𝑎𝑛𝑑𝑜𝑚 to their scores 𝑠(𝑦̂) , where 𝑟 is a random
likely substitution errors. variable between 0 and 1. As a result, more diversity and
O ther noise injection methods were also proposed and noise were involved in synthesised sentences.
applied in recent GEC researches. Aspell spellchecker Q uality control [59] was applied to guarantee the
was applied to create confusion set to generate word generated sentences are less grammatical. To be more
replacement errors more accurately than randomly specific, the generated ungrammatical sentence 𝑦̃ can be
selecting a substitute from word list [57]. Choe et al. added to training set only when 𝑦̃ and the error -free
generated artificial errors basing on inspection of original sentence 𝑦 satisfy the following inequation:
frequent edits and error types in annotated GEC corpora
[55]. Besides, Kantor et al. generated synthetic data via
21
TABLE 5
Different evaluation metrics
where Multiple System Sentence
Metric Definition
References Pearson Spearman Kendall
r 𝜌 𝑟
𝑀2 Eq 54- 56 max 0.623 0.687 0.617
The variable σ is learned on the development set with size 𝐼 Eq 58- 59 max -0.25 -0.385 0.564
|𝑁|: GLEU Eq 60- 62 average 0.691 0.407 0.567
ERRANT Eq 54-56 max 0.64 0.626 0.623
count scor e given by e-rater (ER) and Language Tool (LT) Their system proposed a strong SM T baseline at a 49.49
[132]. Then, adding fluency and meaning preservation 𝐹0.5 score even outperformed previous hybrid approaches
into consideration demonstrates that reference-less [29], [36] that made attempts to take advantage of both
metrics can replace reference-based metrics in evaluating SM T and classifiers. The integration of neural network
GEC systems [133]. Besides, a statistical model is trained features such as NNJ M and NNGLM in SM T based
to scale the grammaticality of a sentence using features system bring benefit [31], [32], [38]. Apart from
including misspelled counts, n-gram language model incremental performance produced by NNJ M , the 53.14
scores, parser outputs, and features extracted from 𝐹0.5 score was achieved due to an additional character -
precision grammar parser [131]. Finally, based on UCCA level SMT model [38]. It is also confirmed that systems
scheme [134], 𝑈𝑆𝐼𝑀 is a r eference-less measure of augmented with reranking technique enjoy extra bonus in
semantic faithfulness which compares the differences final performance [33], [34], [35], which require no
between the UCCA structures of outputs and source modification in model structures.
sentences according to alignable but different tokens. For NM T model, combining the word-level attention
Comparing to RBM s requiring many references, 𝑈𝑆𝐼𝑀 has and character -level attention proves effective, achieving
a lower cost in GEC systems. It also shows sensitivity for 45.15 𝐹0.5 [49]. The multi-layer CNN model brings
changes in meaning [135]. significant improvement, but the final performance (54.79
𝐹0.5) is not only achieved by the translation model itself,
6.2 Analysis of Approaches but also enhanced by performance boosting techniques by
This section presents analysis of previous discussed about 8 points [50]. Replacing traditional RNN with
approaches, including approaches based on SMT, NMT, Transformer and other performance boosting techniques
LM and classification. The analysis mainly focuses on two also behaves better [53]. Another significant improvement
aspects: the development of MT based approaches and the is attained by adding copy mechanism to pre-trained
comparison among different approaches. Transformer [56]. W e will discuss more details about the
performance boosting techniques in the next section.
6.2.1 Development of Approaches Apart from the analysis above concerning SMT and
NM T separately, we could also draw a brief conclusion
from those broken lines in Figure 4 with regard to several
turning points and preference switches among different
approaches. Before the CoNLL-2014 tasks, classification
based and rule based approaches are widely used in GEC,
which are not involved in the Figure 4 in advance of all
works we described in the lines graph. Although SMT was
proposed earlyin2006 [22] with considerable performance
in a certain error type, further researches stagnated due
to the lack of parallel data until those shared tasks
together with the propose of NUCLE and utilizing web-
Fig. 4. Development of SMT based approaches and NMT crawled language learner data in GEC [23],[24]. Systems
based approaches. exploiting SMT based approaches [28], [27] obtained
35.01 and 37.33 respectively in CoNLL-2014 shared task
W e summarize the experimental results of SMT based as shown in the graph, seizing the first and third place
systems and NM T based systems that we have introduced among all systems. Later work [29] that combined both
in Section 3 in Figure 4. The systems are evaluated on SM T and classification in 2014 obtained the state-of-the-
CoNLL-2014 test set with 𝑀2 metric. The experiment art performance with 39.39 𝐹0.5 score, which
results of the selected systems majorly reflect the demonstrated that SMT based methods had achieved the
performance of the approaches themselves, although some equal or even more significant position compared to
may be combined with other performance boosting classifiers in the field of GEC. M ore researches employing
techniques. The systems are arranged in chronological SM T in 2016 developed models that better tailored with
order. It is obvious that incremental performance is GEC task and a strong baseline were created entirely with
achieved in both SM T based approaches and NM T based SM T [37], which reached a new state-of-the-art result
approaches. 49.49. This work also symbolized the peak of SMT based
The potential of SM T based GEC systems were first systems in GEC then. Early in the same year, several pilot
exposed through their top tier performance in CoNLL- works introduced NMT based methods, utilizing attention
2014 shared tasks with 35.01 and 37.33 of 𝐹0.5 score [28],
[27]. Following work [37] developed models tailored to
GEC task by incorporating GEC-specific features into
standard SM T based approaches and tuning towards 𝑀2.
24
TABLE 6
Comparison of different GEC approaches. W e use “ + ” and “ -” to indicate the positive and negative characteristics
of the approaches with respect to each property.
Property SMT NMT Classification LM
Error coverage + all errors occurring in + all errors occurring in - only errors covered by - only errors covered by
the training data the training data the classifiers the confusion set
+ automatically through + automatically through
Error complexity parallel data; parallel data; - need to develop via - need to develop more
limited inside a phrase capture longer dependency specific approaches complex confusion sets
Generalize ability - only confusions observed + generalize through + well generalizable via - only errors in confusion
in training can be corrected continuous representation confusion sets and features sets can be corrected
and approximation
Supervision/Annotation - required - required + not required + not required
- inflexible; - inflexible; + flexible; + flexible;
System flexibility sparse feature used for multi-task learning facility phenomenon-specific injecting knowledge into
specific phenomenon specific phenomenon knowledge sources confusion sets
Fig. 5. Recall for each error type, indicating how well different approaches perform against a particular error type.
The green imaginary line is at the height where recall is 10. Both the classification and SM T systems we select from
the CoNLL-2014 shared task are almost bare-bone systems without any other performance boosting techniques. The
NM T-1 and NMT-2 represent systems using a CNN-based and an RNN-based sequence-to-sequence model
respectively. Results of LM -based methods are not covered due to the lack of relative empirical works.
augmented RNN, into GEC task [47], [45]. Although system [39] that combined NM T and SM T also proved
those NMT based methods were never on trial in GEC the state
before, they still showed promising results in GEC task and efficiency of NMT based approaches in GEC. Then
with 40.56 and 39.9 𝐹0.5 score. Under the influence of the transformer, competent of constructed more flexible and
empirical performance of neural methods, systems based efficiency links, were introduced into GEC and produced
on SMT further modified their models with additional better performance [53], after which most of GEC systems
neural features to remedy the defect of SM T methods that preferred transformer as a baseline model. Besides, work
are unable to capture context information or make using a parallel iterative model, inspired by the using of
continuous representations. Indeed, a range of study [31], parallel models in machine translation, also obtained a
[32], [38] demonstrated that neural network features such comparative performance with those encoder-decoder
as NNJ M did create extra improvement to standard SMT models [69]. Benefiting from its advantage in time cost
model in GEC. and GEC tailored structures, similar model framework
M ore recently, NM T based model that employed deserves further development.
convolutional sequence-to-sequence model holding an 𝐹0.5
score of 54.79 outperformed predominated SMT 6.2.2 Comparison Among Approaches
approaches [50], in which dependency that correction W e now present the comparison of different approaches
process required could be more readily to capt ure than from several key properties that distinguish among GEC
RNN by constructing shorter link path among tokens. approaches and that we identified to better characterize
The developed techniques and optimized network those models’ frameworks as well as output performance.
structures better released the potential of neural methods W e majorly consider properties as previous empirical
on incorporating context and prompted the NM T based work [36] does, but extend their conclusion to all types
approaches to a new state-of-the-art. Following hybrid of models in GEC:
25
TABLE 8
Recapitulation of integrated GEC systems. “ A” and “ E” refer to auxiliary task learning and edit-weighted object
respectively. W e make the following explanation with regard to systems notated with asterisk. Round -trip
translation [107] is described in Section 5.3. Approaches of systems notated with three asterisks are language model
based or classification based. Training these systems requires large monolingual instead parallel training data, so
their grids in the column of “ Corpus” are blank.
Approach Performance Boosting Technique Data Augmentation Corpus Evaluation
Source Year
Iterative Back W&I+ CoNLL-2014 BEA-2019 JFLEG
SMT NMT Classification LM Hybrid Pre-training Reranking Ensemble Correction Others Noising Translation NUCLE FCE Lang-8 LOCNESS M2 ERRANT GLEU
[99] 2019 √ √ √ √ √ √ √ 65 70.2 61.4
[57] 2019 √ √ √ √ √ √ √ √ √ 64.16 69.47 61.16
[100] 2019 √ √ √ √ √ √ 63.2 66.61 62.6
[69]* 2019 √ √ √ √ √ √ √ 61.2 - 61
and their approximation property when learning a pattern, or augmenting with specific classifier. Although the MT
nevertheless limitation still remains when we require based methods appear to have a more fixed structure
further generalization. Although the classification cannot where a whole model is used to solve all error types,
well cover all type of errors, it still demonstrates its several additional techniques have been proposed to better
competent of generalization in dealing with certain error incorporate extra resources based on the standard MT
types. For category such as prepositions or articles, methods. For SM T, often a range of features are
several more general rules are obtained. integrated in log linear model, where such architecture
Another property of model framework is flexibility that also allows incorporation of extra information as sparse
whether the model allows for a flexible architecture where features such as syntactic patterns and certain correction
a range of error -specific knowledge could be patterns. W hereas in NMT based methods, extra
incorporated by tailoring with additional resources. The information is more difficult to be injected, although
fact that classification methods exploiting various methods such as multitask learning could be used directly
classifiers suited to individual error types has against several specific phenomenons.
demonstrated the flexibility of those methods by adjusting
27
6.3 Analysis of the Techniques W e include the best score and their corresponding
Table 7 summarizes the improvement of system configuration of approaches, techniques and data
performance brought by the performance boosting augmentation methods. Systems before CoNLL-2014
techniques that we have introduced in Section 4. The shared task are not included, since they are not evaluated
experiment results come from the ablation study in their on any one of the three test sets. Systems are ordered in
original paper and we collect them together. M ost systems primary consideration of their published year. For systems
are evaluated on CoNLL-2014 test set using 𝑀2 𝐹0.5 in the same year, we rank them according to their
evaluation metric. performance on CoNLL-2014 test set first, then BEA-
W hile all the techniques indeed lift the performance in 2019 test set, and at last J FLEG. This is because CoNLL-
some degree, pre-training is the most effective, especially 2014 test set is the most widely used benchmark in today’s
when considering the significant increase in ERRANT GEC research community, wher eas BEA-2019 test set and
𝐹0.5 on W &I+ LO CNESS development set, from 47.01 to J FLEG are not adopted as frequently. All systems
55.37 with an increase of 8.36, and test set, from 53.24 to evaluated on BEA-2019 test set using ERRANT metric,
67.21 with an increase of 13.97 [100]. Ensemble and except 2 systems [99], [105], are submissions to the BEA-
reranking both yield steady improvement, and ensemble 2019 shared task, among which some also report 𝑀2 on
is generally a little more beneficial than reranking. CoNLL-2014 test set while others do not. Besides, no
Auxiliary task learning brings the least improvement, system is evaluated on J FLEG before it was released.
possibly because the systems have already been combined Great progress has been achieved in the past 5 years
with other techniques and after CoNLL-2014 shard task. Although submissions to
enhanced,weakeningtheadditionalbenefitbroughtbyauxilia CoNLL-2014 shared task we included in Table 8 ranked
rytasklearningonly.Althoughiterativecorrectionseems top 3 [27], [28], [81], they are largely superseded by more
promising on CoNLL-2014 test set, with an increase of powerful systems. Incremental improvement on
3.10 𝑀2 𝐹0.5 , the beneficial effect is minor on performance is observed when it comes closer to 2019.
W &I+ LO CNESS development set (0.67 ERRANT 𝐹0.5 Several systems participatedinBEA-2019 are also
[84] and 0.98 ERRANT 𝐹0.5 [60]). Besides, the boosting evaluated on CoNLL-2014 test set using 𝑀 2 evaluation
effect of edit-weighted object (3.00 𝑀2 𝐹0.5) on the only metr ic [99], [57], [100], [55], [91], and their scores have
system [53] may not be generalized to other systems. significantly surpassed the scores of top-ranked
participants in CoNLL-2014 shared task.
6.4 Analysis of GEC Systems Neural machine translation based approaches have been
Table 8 recapitulates typical GEC systems after CoNLL- increasingly popular in recent years thanks to the
2014 plus top 3 ranking systems in CoNLL-2014, where popularity of deep learning. It is obvious that NMT based
each column of the table corresponds to a subsection of approach has become dominant since 2017. All the
the survey. Systems include works that present initial trial submissions to BEA-2019 shared task are NMT based,
on new approaches, researches that explore new except for [92], participating in low-resource track only
beneficial techniques and data augmentation methods, and and adopting language model based approach. H owever,
submissions to CoNLL-2014 and BEA-2019 shared tasks although NM T based GEC was researched as early as
comprising different components based on existing works. 2014 [45], [47], it does not outperform state-of-the-art
Note that for CoNLL-2014 shared task , we include only SM T based GEC until [50]. W ith the propose of
3 top-ranking systems, since the performance of most Transformer, many NMT based systems start to gain
systems in CoNLL2014 are relatively limited compared to higher scores. Nearly two thirds of systems in BEA-2019
subsequent works, weakening their value for reference. shared task are based on Transformer, which also lead to
For BEA-2019 task, a system might participate into similar scores for some NM T based systems [100], [84],
multiple track, so we include the evaluation results on [59].
restricted track, where the available annotated training The proposal and application of performance boosting
data is the same as other systems (i.e. the FCE, Lang-8, techniques also contribute to the better performance of
NUCLE and W &I+ LO CNESS) and the unlimited GEC systems. P re-training has become the requisite for
monolingual data is also the same as other systems, for more powerful NMT based GEC systems no matter what
more controlled analysis. For a whole report on other techniques are adopted, for the sake of its
participants in BEA-2019, we refer to the task overview unmatched ability to incorporate artificial training data
paper [7]. W e summarize each system in aspects of what or large monolingual data, for example, systems with the
basic approaches they are based on, what techniques and top 5 scores on CoNLL2014 test set [99], [57], [100], [69],
data augmentation methods they used, and what dataset s [56], and systems ranking top 4 on BEA-2019 test set
are used for training. Systems are evaluated on at least [105], [99], [57], [55]. Ensemble is the most widely
one of the three test sets with their corresponding adopted technique and always yields better performance
evaluation metric: CoNLL-2014 test set with 𝑀2 𝐹0.5, BEA- than single model. M ore than half of the systems in 2019
2019 test set with ERRANT 𝐹0.5, and J FLEG with GLEU. used the combination of NM T based GEC approach, pre-
28
training and ensemble of multiple basic model with VII. P ROS P ECTIVE D I RECTIO N
identical architecture. It In this section, we discuss several prospective directions
is worthy to mention that system with the highest for future work.
ERRANT score (73.18) [105] and system with the fourth • Adaptation to L1. M ost of the existing works are L1
highest score (66.78) [84] on BEA2019 test set are based agnostic, treating text written by writers with various
on sophisticated ensemble strategies combining the output first language as equivalent. H owever, for better
of different GEC models, highlighting the benefit brought application of GEC, prospective GEC systems should
by combining different GEC systems. W e have discussed provide more individualized feedback to English
them in Section 4.3. W hile pre-training and ensemble are learners, taking consideration of their L1s. Although
restricted in NMT based systems, reranking is more some works have already focused on several specific
universal since it is more model-independent. Reranking first language [32], [136], there is still much room to
has been researched as early to 2014 and combined with be explored.
SM T based systems [27], [33], [34], [35]. Besides, • Low-resource scenario GEC. Datasets in machine
reranking is also important for better final performance, translation have tens of millions sentence pairs, but
as it is adopted by the top 2-5 systems evaluated on BEA- the amount of parallel data in GEC is not comparable
2019 test set. Iterative correction is not as widely applied even in the largest corpus Lang-8. W hat’s worse,
and effective, but can be also used for systems based on when applied to other minority languages where a
more approaches [69], [86]. Auxiliary task learning and large amount of parallel data is not available, the
edit-weighted training object are less extensively used and performance of many powerful MT based GEC
restricted with NMT based systems. systems will be seriously impaired. So, training better
An important discrepancy between the two data GEC systems in low-resource scenario is waiting to
augmentation methods is that data generated via noising be explored. The possible solution may be better
is always used for pre-training [99], [57], [100], [56], [55], pretraining and data augmentation strategies to
[105], while data generated by back translation is viewed incorporate large error -free text, and more
as equivalent as authentic training data. This is because exploration on LM based approaches without the
data generated using noising may not precisely capture reliance on supervision.
the distribution of error pattern in realistic data, although • Combination of disparate systems. It is demonstrated
many explorations are conducted to improve the quality that different GEC systems are better at different
of the noisy data (section 5.1). Besides, it can be sentence proficiency [7]. Besides, as indicated before,
concluded that as the noising strategy becomes more the system achieving the 1st and 4th best
sophisticated, the better performance is gained, for performance on BEA-2019 test set are based on
example, the system pre-trained on the augmented data ensemble of multiple individual GEC systems. These
generated by sophisticated process achieved 63.2 𝑀2 𝐹0.5 evidences suggest that it is promising to explore
score on CoNLL-2014 test set [100], about 5 points higher combination strategies to better incorporate the
than the system pre-trained on the data generated by strength of disparate GEC systems, which may be
deterministic approach, which achieved 58.3 𝑀2 𝐹0.5 on specialized at different error types, topics, sentence
CoNLL-2014 test set [54]. proficiency extents and L1s.
M T based systems rely heavily on parallel training data. • Datasets. The most widely used dataset in GEC is
Almost all the participants in BEA-2019 used all the NUCLE, which has simplex sentence proficiency,
training corpora available, and NUCLE and Lang-8 are topic and L1, as indicated in Table 1. The lack of
used by nearly every MT based systems. FCE is not used variety means that the performance of GEC systems
as extensively, possibly due to the lack of multiple in other conditions remains unknown [137]. Besides,
references. Although the performances of classification most datasets contain only limited number of
based and LM based systems are relatively limited references for source sentences. M ore references
compared to MT based systems, the nature of their low bring increase in inter -annotator agreement [15].
reliance on parallel training data adds many advantages H owever, adding references requires extra labor. It
to them for the sake of better application on low-resource is waiting to be researched that how many references
scenario, for example, grammar error correction for should be added to datasets with consideration of cost.
other languages instead of English. In aspect of evaluation, • Better evaluation. Systems have always been
no system obtains the highest score on all test sets evaluated on a single test set. H owever, different test
according to their corresponding metric. This is sets lead to inconsistent evaluation performance.
suggestive that evaluation results may vary with different Cross-corpora has been proposed for better
test sets, about which we discuss more in Section 7. evaluation [124]. Besides, although existing
evaluation metric captures grammatical correction
and fluency, no one measures the preserving of
meaning, which is also necessary to consider when
29
evaluating a GEC system. So, more admirable P ortland, Oregon, USA: Association for
metrics should explain the grammar, fluency and Computational Linguistics, J un. 2011, pp. 180–189.
meaning loyalty of system output. [6] C. Napoles, K. Sakaguchi, and J . Tetreault,
“ J FLEG: A fluency corpus and benchmark for
grammatical error correction,” in P roceedings of
the 15th Conference of the European Chapter of the
Association for Computational Linguistics: Volume
VIII. CONCLUSI ON 2, Short P apers. Valencia, Spain: Association for
W e present the first survey in grammar error correction Computational Linguistics, 2017, pp. 229–234.
[7] C. Bryant, M . Felice, Ø. E. Andersen, and T.
(GEC) for a comprehensive retrospect of existing progress. Briscoe, “ The BEA2019 shared task on grammatical
W e first give definition of the task and introduction of error correction,” in P roceedings of the Fourteenth
public datasets, annotation schema, and two important W orkshop on Innovative Use of NLP for Building
shared tasks. Then, four dominant basic approaches and Educational Applications. Florence, Italy:
their development are explained in detail. After that, we Association for Computational Linguistics, Aug.
2019, pp. 52–75.
classify the numerous performance boosting techniques [8] N. D., “ The cambridge learner corpus: Error
into six branches and describe their application and coding and analysis for lexicography and elt,” in
development in GEC, and two data augmentation methods P roceedings of the Corpus Linguistics 2003
are separately introduced due to their importa nce. M ore conference, 2003, pp. 572–581.
importantly, in Section 6, after the introduction of the [9] J . Devlin, M .-W . Chang, K. Lee, and K. Toutanova,
“ BERT: P retraining of deep bidirectional
standard evaluation metrics, we give in-depth analysis transformers for language understanding,” in
based on empirical results in aspects of approaches, P roceedings of the 2019 Conference of the North
techniques and integrated GEC systems for a more clear American Chapter of the Association for
pattern of existing works. Finally, we present five Computational Linguistics: H uman Language
Technologies, Volume 1 (Long and Short P apers).
prospective directions based on existing progress in GEC. M inneapolis, M innesota: Association for
W e hope our effort could provide assistance for future Computational Linguistics, J un. 2019, pp. 4171–
researches in the community. 4186.
[10] C. Buck, K. H eafield, and B. van O oyen, “ N-gram
R EFER ENCES counts and language models from the common
crawl,” in P roceedings of the Ninth International
[1] D. Naber, “ A rule-based style and grammar Conference on Language Resources and Evaluation
checker,” 01 2003. (LREC’14). Reykjavik, Iceland: European Language
Resources Association (ELRA), May 2014, pp.
[2] M . Gamon, C. Leacock, C. Brockett, W . Dolan, J . 3579–3584.
Gao, D. Belenko, and A. Klementiev, “ Using [11] C. Bryant, M . Felice, and T. Briscoe, “ Automatic
statistical techniques and web search to correct esl annotation and evaluation of error types for
grammatical error correction,” in P roceedings of
errors,” CALICO J ournal, vol. 26, pp. 491–511, 01 the 55th Annual M eeting of the Association for
2013. Computational Linguistics (Volume 1: Long Papers).
[3] D. Dahlmeier, H . T. Ng, and S. M . W u, “ Building Vancouver, Canada: Association for Computational
Linguistics, J ul.2017, pp. 793–805.
a large annotated corpus of learner English: The [12] J . Tetreault and M . Chodorow, “ Native judgments
NUS corpus of learner English,” in P roceedings of of non-native usage: Experiments in preposition
the Eighth W orkshop on Innovative Use of NLP for error detection,” in Coling 2008: P roceedings of the
workshop on H uman J udgements in Computational
Building Educational Applications. Atlanta, Linguistics. Manchester, UK: Coling 2008
Georgia: Association for Computational Linguistics, O rganizing Committee, Aug. 2008, pp. 24–32.
J un. 2013, pp. 22–31. [13] A. Rozovskaya and D. Roth, “ Annotating ESL
errors: Challenges and rewards,” in P roceedings of
[4] T. Tajiri, M . Komachi, and Y. Matsumoto, “Tense the NAACL HLT 2010 Fifth W orkshop on
and aspect error correction for ESL learners using Innovative Use of NLP for Building Educational
global context,” in P roceedings of the 50th Annual Applications. Los Angeles, California: Association
for Computational Linguistics, J un. 2010, pp. 28–
M eeting of the Association for Computational 36.
Linguistics (Volume 2: Short Papers). J eju Island, [14] K. Sakaguchi, C. Napoles, M . P ost, and J .
Korea: Association for Computational Linguistics, Tetreault, “ Reassessing the goals of grammatical
error correction: Fluency instead of
J ul. 2012, pp. 198–202. grammaticality,” Transactions of the Association for
[5] H . Yannakoudakis, T. Briscoe, and B. Medlock, “ A Computational Linguistics, vol. 4, pp. 169–182,
new dataset and method for automatically grading 2016.
[15] M . Chodorow, M . Dickinson, R. Israel, and J .
ESO L texts,” in P roceedings of the 49th Annual Tetreault, “ P roblems in evaluating grammatical
M eeting of the Association for Computational error detection systems,” in P roceedings of
Linguistics: H uman Language Technologies. CO LING 2012. M umbai, India: The CO LING 2012
O rganizing Committee, Dec. 2012, pp. 611–628.
30
Computational Linguistics, Aug. 2016, pp. 2205– model for grammatical error correction,” in
2215. P roceedings of the 55th Annual M eeting of the
[37] M . J unczys-Dowmunt and R. Grundkiewicz, Association for Computational Linguistics (Volume
“ P hrase-based machine translation is state-of-the- 1: Long P apers). Vancouver, Canada: Association
art for automatic grammatical error correction,” in for Computational Linguistics, J ul. 2017, pp. 753–
P roceedings of the 2016 Conference on Empirical 762.
M ethods in Natural Language P rocessing. Austin, [50] S. Chollampatt and H . T. Ng, “ A multilayer
Texas: Association for Computational Linguistics, convolutional encoder -decoder neural network for
Nov. 2016, pp. 1546–1556. grammatical error correction,” CoRR, 2018.
[38] S. Chollampatt and H . T. Ng, “ Connecting the dots: [51] J . Gehring, M . Auli, D. Grangier, D. Yarats, and
Towards human-level grammatical error Y. N. Dauphin, “ Convolutional sequence to
correction,” in P roceedings of the 12th W orkshop sequence learning,” CoRR, 2017.
on Innovative Use of NLP for Building Educational [52] A. Vaswani, N. Shazeer, N. P armar, J . Uszkoreit,
Applications. Copenhagen, Denmark: Association L. J ones, A. N. Gomez, L. Kaiser, and I.
for Computational Linguistics, Sep. 2017, pp. 327– P olosukhin, “ Attention is all you need,” CoRR,
333. 2017.
[39] R. Grundkiewicz and M . J unczys-Dowmunt, “ Near [53] M . J unczys-Dowmunt, R. Grundkiewicz, S. Guha,
human-level performance in grammatical error and K. H eafield, “ Approaching neural grammatical
correction with hybrid machine translation,” error correction as a low-resource machine
inP roceedingsofthe2018ConferenceoftheNorth translation task,” in P roceedings of the 2018
American Chapter of the Association for Conference of the North American Chapter of the
Computational Linguistics: H uman Language Association for Computational Linguistics: H uman
Technologies, Volume 2 (Short P apers). New Language Technologies, Volume 1 (Long Papers).
O rleans, Louisiana: Association for Computational New O rleans, Louisiana: Association for
Linguistics, J un. 2018, pp. 284–290. Computational Linguistics, J un. 2018, pp. 595–606.
[40] P . Koehn, H . H oang, A. Birch, C. Callison-Burch, [54] J . Lichtarge, C. Alberti, S. Kumar, N. Shazeer, and
M . Federico, N. Bertoldi, B. Cowan, W . Shen, C. N. P armar, “ W eakly supervised grammatical error
M oran, R. Zens, C. Dyer, O . Bojar, A. Constantin, correction using iterative decoding,” CoRR, 2018
and E. H erbst, “ M oses: O pen source toolkit for [55] Y. J . Choe, J . H am, K. P ark, and Y. Yoon, “ A
statistical machine translation,” in P roceedings of neural grammatical error correction system built on
the 45th Annual M eeting of the Association for better pre-training and sequential transfer
Computational Linguistics Companion Volume learning,” in P roceedings of the Fourteenth
P roceedings of the Demo and P oster Sessions. W orkshop on Innovative Use of NLP for Building
P rague, Czech Republic: Association for Educational Applications. Florence, Italy:
Computational Linguistics, J un. 2007, pp. 177–180. Association for Computational Linguistics, Aug.
[41] N. Durrani, A. Fraser, H . Schmid, H . H oang, and 2019, pp. 213–227.
P . Koehn, “ Can M arkov models over minimal [56] W . Zhao, L. W ang, K. Shen, R. J ia, and J . Liu,
translation units help phrase based SM T?” in “ Improving grammaticalerrorcorrectionviapre-
P roceedings of the 51st Annual M eeting of the trainingacopy-augmentedarchitecture with unlabeled
Association for Computational Linguistics (Volume data,” in P roceedings of the 2019 Conference of the
2: Short P apers). Sofia, Bulgaria: Association for North American Chapter of the Association for
Computational Linguistics, Aug. 2013, pp. 399–405. Computational Linguistics: H uman Language
[42] T. M ikolov, K. Chen, G. Corrado, and J . Dean, Technologies, Volume1(Long and Short P apers).
“ Efficient estimation of word representations in M inneapolis, M innesota: Association for
vector space,” 2013. Computational Linguistics, J un. 2019, pp. 156–165.
[43] K. Cho, B. van M errienboer, C.̧ Gülc¸ehre, F. [57] R. Grundkiewicz, M . J unczys-Dowmunt, and K.
Bougares, H . Schwenk, and Y. Bengio, “ Learning H eafield, “ Neural grammatical error correction
phrase representations using RNN encoder -decoder systems with unsupervised pretraining on synthetic
for statistical machine translation,” CoRR, 2014. data,” in P roceedings of the Fourteenth W orkshop
[44] Z. Yuan, “ Grammatical error correction in non- on Innovative Use of NLP for Building Educational
native English,” University of Cambridge, Computer Applications. Florence, Italy: Association for
Laboratory, Tech. Rep., M ar. 2017. Computational Linguistics, Aug. 2019, pp. 252–263.
[45] Z. Yuan and T. Briscoe, “ Grammatical error [58] L. Yang and C. W ang, “ The BLCU system in the
correction using neural machine translation,” in BEA 2019 shared task,” in P roceedings of the
P roceedings of the 2016 Conference of the North Fourteenth W orkshop on Innovative Use of NLP
American Chapter of the Association for for Building Educational Applications. Florence,
Computational Linguistics: H uman Language Italy: Association for Computational Linguistics,
Technologies. San Diego, California: Association for Aug. 2019, pp. 197– 206.
Computational Linguistics, J un. 2016, pp. 380– 386. [59] Z. Yuan, F. Stahlberg, M . Rei, B. Byrne, and H .
[46] D. Bahdanau, K. Cho,and Y. Bengio, “ Neural Yannakoudakis, “ Neural and FST-based approaches
machine translation by jointly learning to align and to grammatical error correction,” in P roceedings of
translate,” 2014. the Fourteenth W orkshop on Innovative Use of
[47] Z. Xie, A. Avati, N. Arivazhagan, D. J urafsky, and NLP for Building Educational Applications.
A. Y. Ng, “ Neural language correction with Florence, Italy: Association for Computational
character -based attention,” CoRR, 2016. Linguistics, Aug. 2019, pp. 228– 239.
[48] W . Chan, N. J aitly, Q . V. Le, and O . Vinyals, [60] J . N´aplava and M . Straka, “ CUNI system for the
“ Listen, attend and spell,” CoR R, 2015. building educational applications 2019 shared task:
[49] J . J i, Q . W ang, K. Toutanova, Y. Gong, S. Truong, Grammatical error correction,” in P roceedings of
and J . Gao, “ A nested attention neural hybrid the Fourteenth W orkshop on Innovative Use of
32
NLP for Building Educational Applications. Language P rocessing and the 9th International
Florence, Italy: Association for Computational J oint Conference on Natural Language P rocessing
Linguistics, Aug. 2019, pp. 183– 190. (EM NLP -IJ CNLP ). H ong Kong, China: Association
[61] M . Kaneko, K. H otate, S. Katsumata, and M . for Computational Linguistics, Nov. 2019, pp. 5054–
Komachi, “TMU transformer system using BERT 5065.
for re-ranking at BEA 2019 grammatical error [72] C. M . H an, N. and C. Leacock, “ Detecting errors
correction on restricted track,” in P r oceedings of in english article usage by non-native speakers,” in
the Fourteenth W orkshop on Innovative Use of Natural Language Engineering, 2006, pp. 115–129.
NLP for Building Educational Applications. [73] M . Chodorow, J . Tetreault, and N.-R. H an,
Florence, Italy: Association for Computational “ Detection of grammatical errors involving
Linguistics, Aug. 2019, pp. 207–212. prepositions,” in P roceedings of the Fourth ACL-
[62] C. Gulcehre, S. Ahn, R. Nallapati, B. Zhou, and Y. SIGSEM W orkshop on P repositions. P rague,
Bengio, “ P ointing the unknown words,” in Czech Republic: Association for Computational
P roceedings of the 54th Annual M eeting of the Linguistics, J un. 2007, pp. 25–30.
Association for Computational Linguistics (Volume [74] R. De Felice and S. G. P ulman, “ A classifier-based
1: Long P apers). Berlin, Germany: Association for approach to preposition and determiner error
Computational Linguistics, Aug. 2016, pp. 140–149. correction in L2 English,” in P roceedings of the
[63] J . Tiedemann and Y. Scherrer , “ Neural machine 22nd International Conference on Computational
translation with extended context,” in P roceedings Linguistics (Coling 2008). Manchester, UK: Coling
of the Third W orkshop on Discourse in M achine 2008 O rganizing Committee, Aug. 2008, pp. 169–
Translation. Copenhagen, Denmark: Association for 176.
Computational Linguistics, Sep. 2017, pp. 82–92. [75] J . R. Tetreault and M . Chodorow, “The ups and
[64] S. J ean, S. Lauly, O . Firat, and K. Cho, “ Does downs of preposition error detection in ESL
neural machine translation benefit from larger writing,” in P roceedings of the 22nd International
context?” 2017. Conference on Computational Linguistics (Coling
[65] S. Kuang, D. Xiong, W . Luo, and G. Zhou, 2008). M anchester, UK: Coling 2008 O rganizing
“ M odeling coherence for neural machine translation Committee, Aug. 2008, pp. 865–872.
with dynamic and topic caches,” in P roceedings of [76] J . Tetreault, J . Foster, and M . Chodorow, “ Using
the 27th International Conference on Computational parse features for preposition selection and error
Linguistics. Santa Fe, New M exico, USA: detection,” In P roceedings of the
Association for Computational Linguistics, Aug. ACL2010ConferenceShortP apers. Uppsala, Sweden:
2018, pp. 596–606. Association for Computational Linguistics, J ul.
[66] S. Chollampatt, W . W ang, and H . T. Ng, “ Cross- 2010, pp. 353–358.
sentence grammatical error correction,” in [77] M . Gamon, J . Gao, C. Br ockett, A. Klementiev, W .
P roceedingsofthe57thAnnualM eeting of the B. Dolan, D. Belenko, and L. Vanderwende, “Using
Association for Computational Linguistics. Florence, contextual speller techniques and language
Italy: Association for Computational Linguistics, modeling for ESL error correction,” in P roceedings
J ul.2019, pp.435–445. of the Third International J oint Conference on
[67] T. Ge, F. W ei, and M . Zhou, “ Fluency boost Natural Language P rocessing: Volume-I, 2008.
learning and inference for neural grammatical error [78] M . Gamon, “ Using mostly native data to correct
correction,” in P roceedings of the 56th Annual errors in learners’ writing,” in H uman Language
M eeting of the Association for Computational Technologies: The 2010 Annual Conference of the
Linguistics (Volume 1: Long Papers). M elbourne, North American Chapter of the Association for
Australia: Association for Computational Linguistics, Computational Linguistics. Los Angeles, California:
J ul. 2018, pp. 1055–1065. Association for Computational Linguistics, J un.
[68] K. Sakaguchi, M . P ost, and B. Van Durme, 2010, pp. 163–171.
“ Grammatical error correction with neural [79] A. Rozovskaya and D. Roth, “ Generating confusion
reinforcement learning,” in P roceedings of the sets for context-sensitive error correction,” in
Eighth International J oint Conference on Natural P roceedings of the 2010 Conference on Empirical
Language P rocessing (Volume2: Short P apers). M ethods in Natural Language P rocessing.
Taipei, Taiwan: Asian Federation of Natural Cambridge, M A: Association for Computational
Language P rocessing, Nov. 2017, pp. 366–372. Linguistics, Oct. 2010, pp. 961–970.
[69] A. Awasthi, S. Sarawagi, R. Goyal, S. Ghosh, and [80] A. Rozovskaya, K.-W . Chang, M . Sammons, and D.
V. P iratla, “ P arallel iterative edit models for local Roth, “ The university of Illinois system in the
sequence transduction,” in P roceedings of the 2019 CoNLL-2013 shared task,” in P roceedings of the
Conference on Empirical M ethods in Natural Seventeenth Conference on Computational Natural
Language P rocessing and the 9th International Language Learning: Shared Task. Sofia, Bulgaria:
J oint Conference on Natural Language P rocessing Association for Computational Linguistics, Aug.
(EM NLP -IJ CNLP ). H ong Kong, China: Association 2013, pp. 13–19.
for Computational Linguistics, Nov. 2019, pp. 4260– [81] A. Rozovskaya, K.-W . Chang, M . Sammons, D.
4270. Roth, and N. Habash, “The Illinois-Columbia
[70] L. Kaiser, A. Roy, A. Vaswani, N. P armar, S. system in the CoNLL-2014 shared task,” in
Bengio, J . Uszkoreit, and N. Shazeer, “ Fast P roceedings of the Eighteenth Conference on
decoding in sequence models using discrete latent Computational Natural Language Learning: Shared
variables,” CoRR, 2018. Task. Baltimore, Maryland: Association for
[71] E. M almi, S. Krause, S. Rothe, D. M irylenka, and Computational Linguistics, J un. 2014, pp. 34–42.
A. Severyn, “ Encode, tag, realize: H igh-precision [82] C. Sun, X. J in, L. Lei, Y. Zhao, and X. W ang,
text editing,” in P roceedings of the 2019 Convolutional Neural Networks for Correcting
Conference on Empirical M ethods in Natural
33
[128] D. Dahlmeier and H . T. Ng, “ Better evaluation for Computational Linguistics, Sep. 2017, pp. 180–
for grammatical error correction,” in P roceedings 187.
of the 2012 Conference of the North American
Chapter of the Association for Computational
Linguistics: H uman Language Technologies. M ontr é
al, Canada: Association for Computational
Linguistics, J un. 2012, pp. 568–572. Yu Wang is an undergraduate in the
[129] M . Felice and T. Briscoe, “Towards a standard College of Artificial Intelligence at
evaluation method for grammatical error detection Nankai University. H is research interests
and correction,” in P roceedings of the 2015 include data-driven approaches to natural
Conference of the North American Chapter of the language processing and deep
Association for Computational Linguistics: H uman
Language Technologies. Denver, Colorado: reinforcement learning.
Association for Computational Linguistics, May–
J un. 2015, pp. 578–587.
[130] S. Chollampatt and H . T. Ng, “ A reassessment of
reference-based grammatical error correction
metrics,” in P roceedings of the 27th International
Conference on Computational Linguistics. Santa Fe,
New M exico, USA: Association for Computational
Linguistics, Aug. 2018, pp. 2730–2741.
[131] M . H eilman, A. Cahill, N. Madnani, M . Lopez, M . Yuelin Wang is an undergraduate in the
M ulholland, and J . Tetreault, “ P redicting College of Artificial Intelligence at
grammaticality on an ordinal scale,” in P roceedings Nankai University. H is research interests
of the 52nd Annual M eeting of the Association for in natural language process, deep
Computational Linguistics (Volume 2: Short
P apers). Baltimore, Maryland: Association for reinforcement learning and pattern
Computational Linguistics, J un. 2014, pp. 174–180. recognition.
[132] C. Napoles, K. Sakaguchi, and J . Tetreault,
“ There’s no comparison: Reference-less evaluation
metrics in grammatical error correction,” in
P roceedings of the 2016 Conference on Empirical
M ethods in Natural Language P rocessing. Austin,
Texas: Association for Computational Linguistics,
Nov. 2016, pp. 2109–2115.
[133] H . Asano, T. M izumoto, and K. Inui, “ Reference- Jie Liu is a professor in the College of
based metrics can be replaced with reference-less Artificial Intelligence at Nankai
metrics in evaluating grammatical error correction University. H is research interests include
systems,” in P roceedings of the Eighth International machine learning, pattern recognition,
J oint Conference on Natural Language P rocessing information retrieval and data mining. H e
(Volume 2: Short P apers). Taipei, Taiwan: Asian
Federation of Natural Language P rocessing, Nov. has published several papers on CIKM ,
2017, pp. 343–348. ICDM , P AKDD, AP W EB, IP M , W W W J ,
[134] O . Abend and A. Rappoport, “Universal Soft Computing, etc. H e is the coauthor of the best
conceptual cognitive annotation (UCCA),” in student paper for ICMLC 2013. H e has owned the second
P roceedings of the 51st Annual M eeting of the place in the international ICDAR Book Structure
Association for Computational Linguistics (Volume Extraction Competition in 2012 and 2013. H e has visited
1: Long P apers). Sofia, Bulgaria: Association for the University of California, Santa Cruz from Sept. 2007
Computational Linguistics, Aug. 2013, pp. 228–238.
[135] L. Choshen and O . Abend, “ Reference-less to Sept. 2008. H e has visited the M icrosoft Research Asia
measure of faithfulness for grammatical error from Aug.2012 to Feb.2013. P rior to joining Nankai
correction,” in P roceedings of the 2018 Conference University, he obtained his P h. D. in computer science at
of the North American Chapter of the Associat ion Nankai University.
for Computational Linguistics: H uman Language
Technologies, Volume 2 (Short P apers). New
O rleans, Louisiana: Association for Computational
Linguistics, J un. 2018, pp. 124–129.
[136] M . Nadejde and J . Tetreault, “ P ersonalizing
grammatical error correction: Adaptation to Zhuo Liu is an undergraduate in the
proficiency level and L1,” in P roceedings of the 5th College of Artificial Intelligence at
W orkshop on Noisy User -generated Text (W -NUT Nankai University. H is research interests
2019). H ong Kong, China: Association for include data mining and graph neural
Computational Linguistics, Nov. 2019, pp. 27–33. network.
[137] K. Sakaguchi, C. Napoles, and J . Tetreault, “ GEC
into the future: W here are we going and how do we
get there?” in P roceedings of the 12th W orkshop
on Innovative Use of NLP for Building Educational
Applications. Copenhagen, Denmark: Association