You are on page 1of 6

ChatGPT or Grammarly?

Evaluating ChatGPT on Grammatical Error


Correction Benchmark
Haoran Wu† Wenxuan Wang† Yuxuan Wan † Wenxiang Jiao‡ Michael R. Lyu†

Department of Computer Science and Engineering, The Chinese University of Hong Kong
1155157061@link.cuhk.edu.hk {wxwang,yxwan9,lyu}@cse.cuhk.edu.hk

Tencent AI Lab
joelwxjiao@tencent.com

Abstract Despite the widespread use of ChatGPT, it re-


mains unclear to the NLP community that to what
ChatGPT is a cutting-edge artificial intelli-
extent ChatGPT is capable of revising the text and
gence language model developed by OpenAI,
correcting grammatical errors. To fill this research
arXiv:2303.13648v1 [cs.CL] 15 Mar 2023

which has attracted a lot of attention due to


its surprisingly strong ability in answering gap, we empirically study the Grammatical Error
follow-up questions. In this report, we aim Correction (GEC) ability of ChatGPT by evalu-
to evaluate ChatGPT on the Grammatical Er- ating on the CoNLL2014 benchmark dataset (Ng
ror Correction (GEC) task, and compare it et al., 2014), and comparing its performance to
with commercial GEC product (e.g., Gram- Grammarly, a prevalent cloud-based English typing
marly) and state-of-the-art models (e.g., GEC- assistant with 30 million users daily (Grammarly,
ToR). By testing on the CoNLL2014 bench-
2023) and GECToR (Omelianchuk et al., 2020), a
mark dataset, we find that ChatGPT performs
not as well as those baselines in terms of the state-of-the-art GEC model. With this study, we
automatic evaluation metrics (e.g., F0.5 score), aim to answer a research question:
particularly on long sentences. We inspect
the outputs and find that ChatGPT goes be- Is ChatGPT a good tool for GEC?
yond one-by-one corrections. Specifically, it
prefers to change the surface expression of To the best of our knowledge, this is the first study
certain phrases or sentence structure while
maintaining grammatical correctness. Human
on ChatGPT’s ability in GEC.
evaluation quantitatively confirms this and We present the major insights gained from this
suggests that ChatGPT produces less under- evaluation as below:
correction or mis-correction issues but more
over-corrections. These results demonstrate • ChatGPT performs worse than the baseline
that ChatGPT is severely under-estimated by systems in terms of the automatic evaluation
the automatic evaluation metrics and could be
metrics (e.g., F0.5 score), particularly on long
a promising tool for GEC.
sentences.
1 Introduction
• ChatGPT goes beyond one-by-one corrections
ChatGPT1 , the current “super-star” in artificial in- by introducing more changes to the surface
telligence (AI) area, has attracted millions of reg- expression of certain phrases or sentence struc-
istered users within just a week since its launch ture while maintaining the grammatical cor-
by OpenAI. One of the reasons for ChatGPT rectness.
being so popular is its surprisingly strong per-
formance on various natural language process- • Human evaluation quantitatively demon-
ing (NLP) tasks (Bang et al., 2023), including ques- strates that ChatGPT produces less under-
tion answering (Omar et al., 2023), text summariza- correction or mis-correction issues but more
tion (Yang et al., 2023), machine translation (Jiao over-corrections.
et al., 2023), logic reasoning (Frieder et al., 2023),
code debugging (Xia and Zhang, 2023), etc. There
Our evaluation indicates the limitation of relying
is also a trend of using ChatGPT as a writing assis-
solely on automatic evaluation metrics to assess
tant for text polishing.
the performance of GEC models and suggests that
1
https://chat.openai.com/chat ChatGPT is a promising tool for GEC.
Type Error Correction
Preposition I sat in the talk I sat in on the talk
Morphology dreamed dreamt
Determiner I like the ice cream I like ice cream
Tense/Aspect I like play basketball I like playing basketball
Syntax I have not the book I do not have the book
Punctuation We met they talked and left We met, they talked and left

Table 1: Different types of error in GEC.

2 Background System Precision Recall F0.5


2.1 ChatGPT GECToR 71.2 38.4 60.8
ChatGPT is an intelligent chatbot powered by large Grammarly 67.3 51.1 63.3
language models developed by OpenAI. It has at- ChatGPT 51.2 62.8 53.1
tracted great attention from industry, academia, and
the general public due to its strong ability in an- Table 2: GEC performance of GECToR, Grammarly,
and ChatGPT.
swering various follow-up questions, correcting
inappropriate questions (Zhong et al., 2023), and
even refusing illegal questions. While the tech- but introduces a new dataset, namely, the
nical details of ChatGPT have not been released Write&Improve+LOCNESS corpus, which
systematically, it is known to be built upon Instruct- represents a wider range of native and learner
GPT (Ouyang et al., 2022) which is trained using English levels and abilities (Bryant et al.,
instruction tuning (Wei et al., 2022a) and reinforce- 2019).
ment learning from human feedback (RLHF, Chris-
tiano et al., 2017). • JFLEG: It represents a broad range of lan-
guage proficiency levels and uses holistic flu-
2.2 Grammatical Error Correction
ency edits to not only correct grammatical
Grammatical Error Correction (GEC) is a task of errors but also make the original text more
correcting different kinds of errors in text such native sounding (Tetreault et al., 2017).
as spelling, punctuation, grammatical, and word
choice errors (Ruder, 2022). It is highly demanded
as writing plays an important role in academics, 3 ChatGPT for GEC
work, and daily life. Table 1 presents the illustra-
tion of different grammatical errors borrowed from 3.1 Experimental Setup
Bryant et al. (2022) in a comprehensive survey on Dataset. We evaluate the ability of ChatGPT in
grammatical error correction. In general, gram- grammatical error correction on the CoNLL2014
matical errors can be roughly classified into three task (Ng et al., 2014) dataset. The dataset is com-
categories: omission errors, such as "on" in the first posed by short paragraphs that are written by non-
example; replacement errors, such as "dreamed" native speakers of English, accompanied with the
for "dreamt" in the second example; and insertion corresponding annotations on the grammatical er-
errors, such as "the" in the third example. rors. We pulled 100 sentences from the official-
To evaluate the performance of GEC, researchers combined test set in the alternate folder of the
have built various benchmark datasets, which in- dataset sequentially.
clude but are not limited to:
• CoNLL-2014: Given the short English texts Evaluation Metric. To evaluate the performance
written by non-native speakers, the task re- of GEC, we adopt three metrics that are widely
quires a participating system to correct all er- used in literature, namely, Precision, Recall, and
rors present in each text. F0.5 score. Among them, F0.5 score combines both
Precision and Recall, where Precision is assigned
• BEA-2019: It is similar to CoNLL-2014 a higher weight (Wikipedia contributors, 2023a).
Short Medium Long
System
Precision Recall F0.5 Precision Recall F0.5 Precision Recall F0.5
GECToR 76.9 38.5 64.1 68.8 37.5 58.9 71.8 38.9 61.5
Grammarly 62.5 60.6 62.1 68.9 56.0 65.9 67.3 45.3 61.4
ChatGPT 58.5 66.7 60.0 48.7 60.7 50.7 51.0 62.8 53.0

Table 3: GEC performance with respect to sentence length.

Specifically, the three metrics are expressed as: the grammar correction in the setting and only
ask it to correct the ones with correctness prob-
TP lems (red underline), while leaving the clarity
Precision = , (1)
TP + FP (blue underline), engagement (green underline)
TP and delivery (purple underline) unchanged. We
Recall = , (2)
TP + FN iterate this process several times until there is no
1.25 × Precision × Recall error detected by Grammarly.
F0.5 = , (3)
0.25 × Precision + Recall
• GECToR: Besides Grammarly, we also compare
where T P , F P and F N represent the true posi- ChatGPT with GECToR (Omelianchuk et al.,
tives, false positives and false negatives of the pre- 2020), a state-of-the-art model on GEC in re-
dictions, respectively. We use the scoring program search, which also exhibits good performance
provided by CoNLL2014 official but adapt it to be on the CoNLL2014 task. We adopt the imple-
compatible with the latest Python environment. mentation based on the pre-trained RoBERTa
Baselines. In this report, we perform the GEC model.
task on three systems, including: 3.2 Results and Analysis
• ChatGPT: We query ChatGPT manually rather Overall Performance. Table 2 presents the over-
than using some API due to the instability of all performance of the three systems. As seen,
ChatGPT. For example, when a query sentence ChatGPT obtains the highest recall value, GECToR
resembles a question or demand, ChatGPT may obtains the highest precision value, while Gram-
stop the process of GEC but respond to the “de- marly achieves a better balance between the two
mand” instead. After a few trials, we find a metrics and results in the highest F0.5 score. These
prompt that works well for ChatGPT: results suggest that ChatGPT tends to correct as
many errors as possible, which may lead to more
Do grammatical error correction overcorrections. Instead, GECToR corrects only
on all the following sentences those it is confident about, which leaves many er-
I type in the conversation. rors uncorrected. Grammarly combines the advan-
We query ChatGPT with this prompt for each tages of both such that it performs more stably.
test sample. ChatGPT Performs Worse on Long Sentences?
To understand which kind of sentences ChatGPT
• Grammarly: Grammarly is a prevalent cloud-
are good at, we divide the 100 test sentences
based English typing assistant. It reviews
into three equally sized categories, namely, Short,
spelling, grammar, punctuation, clarity, engage-
Medium and Long. Table 3 shows the results with
ment, and delivery mistakes in English texts,
respect to sentence length. As seen, the gap be-
detects plagiarism and suggests replacements
tween ChatGPT and Grammarly is significantly
for the identified errors (Wikipedia contribu-
bridged on short sentences. In contrast, ChatGPT
tors, 2023b). As stated by Grammarly, every
performs much worse on those longer sentences, at
day, 30 million people and 50,000 teams around
least in terms of the existing evaluation metrics.
the world use Grammarly with their writing
(Grammarly, 2023). When querying Grammarly, ChatGPT Goes Beyond One-by-One Correc-
we open a text file and paste all the test sam- tions. We inspect the output of the three systems,
ples into separate paragraphs. We enable all especially those for long sentences, and find that
System Sentence
Source For an example , if exercising is helpful for family potential disease , we can
always look for more chances for the family to go exercise .
Reference For example , if exercising (OR exercise) is helpful for a potential family disease
, we can always look for more chances for the family to do exercise .
GECToR For example , if exercising is helpful for family potential disease , we can
always look for more chances for the family to go exercise .
Grammarly For example , if exercising is helpful for a family ’s potential disease , we can
always look for more chances for the family to go exercise .
ChatGPT For example , if exercise is helpful in preventing potential family diseases , we
can always look for more opportunities for the family to exercise .

Table 4: Comparison of the outputs from different GEC systems.

System Precision Recall F0.5 System #Under #Mis #Over


GECToR 71.2 38.4 60.8 GECToR 13 4 0
+ Grammarly -5.9 +16.5 +2.1 Grammarly 14 0 1
ChatGPT 51.2 62.8 53.1 ChatGPT 3 3 30
+ Grammarly +0.4 +0.8 +0.5
Table 6: Number of under-correction (Under), mis-
Table 5: GEC performance with Grammarly for further correction (Mis) and over-correction (Over) produced
correction. by different GEC systems.

Human Evaluation. We conduct a human eval-


ChatGPT is not limited to correcting the errors in uation to further demonstrate the potential of Chat-
the one-by-one fashion. Instead, it is more will- GPT for the GEC task. Specifically, we fol-
ing to change the superficial expression of some low Wang et al. (2022) to manually annotate the
phrases or the sentence structure. For example, issues in the outputs of the three systems, includ-
in Table 4, GECToR and Grammarly make mi- ing 1) Under-correction, which is the grammati-
nor changes to the source sentence (i.e., “an ex- cal errors that are not found; 2) Mis-correction,
ample” to “example”, “family potential disease” which is the grammatical errors that are found but
to “a family ’s potential disease”), while ChatGPT modified incorrectly; it can be either grammati-
modifies the sentence structure (i.e., “for family cally incorrect or semantically incorrect; 3) Over-
potential disease” to “in preventing potential fam- correction, which is the other modifications beyond
ily diseases”) and word choice (i.e., “chances” to the changes in the reference. We sample 20 sen-
“opportunities”). It indicates that the outputs by tences out of the 100 test sentences and ask two
ChatGPT maintain the grammatical correctness, al- annotators to identify the issues. Table 6 shows
though they do not follow the original expression the results. Obviously, ChatGPT has the least num-
of the source sentences. ber of under-corrections among the three systems
and fewer number of mis-corrections compared
To validate our hypothesis, we let Grammarly to with GECToR, which suggests its great potential
further correct the grammatical errors in the out- in grammatical error correction. Meanwhile, Chat-
puts of GECToR and ChatGPT. Table 5 lists the GPT produces more over-corrections, which may
results. We can observe that Grammarly introduces come from the diverse generation ability as a large
a negligible improvement to the output of ChatGPT, language model. While this usually leads to a lower
demonstrating that ChatGPT indeed generates cor- F0.5 score, it also allows more flexible language
rect sentences. On the contrary, Grammarly further expressions in GEC.
improves the performance of GECToR noticeably
(i.e., +2.1 F0.5 , +16.5 Recall), suggesting that there Discussions. We have checked the outputs corre-
are still many errors in the output of GECToR. sponding to the results of Table 5, and observed
different behaviors of ChatGPT and Grammarly. Limitations and Future Works
The slight improvement (i.e., +0.5 F0.5 ) by Gram-
There are several limitations in this version, which
marly mainly comes from punctuation problems.
we leave for future work:
ChatGPT is not sensitive to punctuation problems
but Grammarly is, though the modifications are not • More Datasets: In this version, we only use the
always correct. For example, when we manually CoNLL-2014 test set and only randomly select
undo the corrections on punctuation, the F0.5 score 100 sentences to conduct the evaluation. In our
increases by +0.0015. Other than punctuation prob- future work, we will conduct experiments on
lems, Grammarly also corrects a few grammatical more datasets.
errors on articles, prepositions, and plurals. How-
ever, these corrections usually require Grammarly • More Prompt and In-context Learning: In
to repeat the process twice. Take the following this version, we only use one prompt to query
sentence as an example, ChatGPT and do not utilize the advanced tech-
nology from the in-context learning field, such as
... constructs of the family and providing demonstration examples (Brown et al.,
kinship are a social construct, 2020) or providing chain-of-thought (Wei et al.,
... 2022b), which may under-estimate the full po-
tential of ChatGPT. In our future work, we will
Grammarly first changes it to explore the in-context learning methods for GEC
to improve its performance.
... constructs of the family and
kinship are a social constructs, • More Evaluation Metrics: In this version, we
... only adopt Precision, Recall and F0.5 as evalu-
ation metrics. In our future work, we will uti-
Then, changes it to lize more metrics, such as pretraining-based met-
rics (Gong et al., 2022) to evaluate the perfor-
... constructs of the family and mance comprehensively.
kinship are social constructs,
...

Nonetheless, it does correct some errors that Chat- References


GPT fails to correct. Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wen-
liang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Zi-
4 Conclusion wei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan
Xu, and Pascale Fung. 2023. A multitask, multilin-
This paper evaluates ChatGPT on the task of Gram- gual, multimodal evaluation of chatgpt on reasoning,
hallucination, and interactivity. ArXiv.
matical Error Correction (GEC). By testing on the
CoNLL2014 benchmark dataset, we find that Chat- Tom Brown, Benjamin Mann, Nick Ryder, Melanie
GPT performs worse than a commercial product Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Grammarly and a state-of-the-art model GECToR
Askell, et al. 2020. Language models are few-shot
in terms of automatic evaluation metrics. By ex- learners. NeurIPS.
amining the outputs, we find that ChatGPT dis-
plays a unique ability to go beyond one-by-one Christopher Bryant, Mariano Felice, Øistein E. An-
dersen, and Ted Briscoe. 2019. The bea-2019
corrections by changing surface expressions and shared task on grammatical error correction. In
sentence structure while maintaining grammatical BEA@ACL.
correctness. Human evaluation results confirm this
finding and reveals that ChatGPT produces fewer Christopher Bryant, Zheng Yuan, Muhammad Reza
Qorib, Hannan Cao, Hwee Tou Ng, and Ted Briscoe.
under-correction or mis-correction issues but more 2022. Grammatical error correction: A survey of the
over-corrections. These results demonstrate the state of the art. ArXiv.
limitation of relying solely on automatic evaluation
Paul Francis Christiano, Jan Leike, Tom B. Brown, Mil-
metrics to assess the performance of GEC models jan Martic, Shane Legg, and Dario Amodei. 2017.
and suggest that ChatGPT has the potential to be a Deep reinforcement learning from human prefer-
valuable tool for GEC. ences. NeruIPS.
Simon Frieder, Luca Pinchetti, Ryan-Rhys Grif- Wikipedia contributors. 2023b. Grammarly —
fiths, Tommaso Salvatori, Thomas Lukasiewicz, Wikipedia, the free encyclopedia. [Online; accessed
Philipp Christian Petersen, Alexis Chevalier, and J J 2-March-2023].
Berner. 2023. Mathematical capabilities of chatgpt.
ArXiv. Chun Xia and Lingming Zhang. 2023. Conversational
automated program repair. ArXiv.
Peiyuan Gong, Xuebo Liu, Heyan Huang, and Min
Zhang. 2022. Revisiting grammatical error correc- Xianjun Yang, Yan Li, Xinlu Zhang, Haifeng Chen,
tion evaluation and beyond. EMNLP. and Wei Cheng. 2023. Exploring the limits of chat-
gpt for query or aspect-based text summarization.
Grammarly. 2023. Grammarly website about us page. ArXiv.
Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, and
Wenxiang Jiao, Wenxuan Wang, Jen tse Huang, Xing
Dacheng Tao. 2023. Can chatgpt understand too?
Wang, and Zhaopeng Tu. 2023. Is ChatGPT a good
a comparative study on chatgpt and fine-tuned bert.
translator? a preliminary study. In ArXiv.
ArXiv.
Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, Christian
Hadiwinoto, Raymond Hendy Susanto, and Christo-
pher Bryant. 2014. The conll-2014 shared task on
grammatical error correction. In CoNLL.

Reham Omar, Omij Mangukiya, Panos Kalnis, and Es-


sam Mansour. 2023. Chatgpt versus traditional ques-
tion answering for knowledge graphs: Current status
and future directions towards knowledge graph chat-
bots. ArXiv.

Kostiantyn Omelianchuk, Vitaliy Atrasevych, Artem N.


Chernodub, and Oleksandr Skurzhanskyi. 2020.
Gector – grammatical error correction: Tag, not
rewrite. In Workshop on Innovative Use of NLP for
Building Educational Applications.

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car-


roll L Wainwright, Pamela Mishkin, Chong Zhang,
Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
2022. Training language models to follow instruc-
tions with human feedback. arXiv.

Sebastian Ruder. 2022. NLP-progress.

Joel R. Tetreault, Keisuke Sakaguchi, and Courtney


Napoles. 2017. Jfleg: A fluency corpus and bench-
mark for grammatical error correction. In EACL.

Wenxuan Wang, Wenxiang Jiao, Yongchang Hao, Xing


Wang, Shuming Shi, Zhaopeng Tu, and Michael Lyu.
2022. Understanding and improving sequence-to-
sequence pretraining for neural machine translation.
In ACL.

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu,


Adams Wei Yu, Brian Lester, Nan Du, Andrew M.
Dai, and Quoc V. Le. 2022a. Finetuned language
models are zero-shot learners. ICLR.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten


Bosma, Ed Huai hsin Chi, Quoc Le, and Denny
Zhou. 2022b. Chain of thought prompting elicits
reasoning in large language models. NeurIPS.

Wikipedia contributors. 2023a. F-score — Wikipedia,


the free encyclopedia. [Online; accessed 5-March-
2023].

You might also like