You are on page 1of 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/367359399

Is ChatGPT A Good Translator? A Preliminary Study

Preprint · January 2023

CITATIONS READS

0 1,348

5 authors, including:

Wenxiang Jiao Wenxuan Wang


Tencent Nanjing Agricultural University
39 PUBLICATIONS 222 CITATIONS 17 PUBLICATIONS 56 CITATIONS

SEE PROFILE SEE PROFILE

Xing Wang Zhaopeng Tu


Beijing children's hospital of Capital Medical University Huawei Technologies
57 PUBLICATIONS 864 CITATIONS 142 PUBLICATIONS 4,251 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Wenxiang Jiao on 19 March 2023.

The user has requested enhancement of the downloaded file.


Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine

Wenxiang Jiao∗ Wenxuan Wang Jen-tse Huang Xing Wang Zhaopeng Tu


Tencent AI Lab

Abstract
This report provides a preliminary evaluation
of ChatGPT for machine translation, includ-
arXiv:submit/4797616 [cs.CL] 19 Mar 2023

ing translation prompt, multilingual transla-


tion, and translation robustness. We adopt
the prompts advised by ChatGPT to trigger
its translation ability and find that the candi-
date prompts generally work well and show
minor performance differences. By evalu-
ating on a number of benchmark test sets1 ,
we find that ChatGPT performs competitively Figure 1: Prompts advised by ChatGPT for machine
with commercial translation products (e.g., translation (Date: 2022.12.16).
Google Translate) on high-resource European
languages but lags behind significantly on low-
resource or distant languages. For distant It integrates various abilities of natural language
languages, we explore an interesting strategy processing, including question answering, story-
named pivot prompting that asks ChatGPT telling, logic reasoning, code debugging, machine
to translate the source sentence into a high- translation, and so on. We are particularly inter-
resource pivot language before into the target ested in how ChatGPT performs for machine trans-
language, which improves the translation per- lation tasks, especially the gap between ChatGPT
formance significantly. As for the translation
and commercial translation products (e.g., Google
robustness, ChatGPT does not perform as well
as the commercial systems on biomedical ab- Translate, DeepL Translate).
stracts or Reddit comments but exhibits good In this report, we provide a preliminary study of
results on spoken language. With the launch ChatGPT on machine translation to gain a better
of the GPT-4 engine, the translation perfor- understanding of it. Specifically, we focus on three
mance of ChatGPT is significantly boosted, be- aspects:
coming comparable to commercial translation
products, even for distant languages. In other • Translation Prompt: ChatGPT is essentially a
words, ChatGPT has already become a good
large language model, which needs prompts as
translator.
guidance to trigger its translation ability. The
1 Introduction style of prompts may affect the quality of trans-
ChatGPT2 is an intelligent chatting machine devel- lation outputs. For example, how to mention
oped by OpenAI upon the InstructGPT (Ouyang the source or target language information mat-
et al., 2022), which is trained to follow an instruc- ters in multilingual machine translation models,
tion in a prompt and provide a detailed response. which is usually solved by attaching language
According to the official statement, ChatGPT is tokens (Johnson et al., 2017; Fan et al., 2021).
able to answer followup questions, admit its mis-
• Multilingual Translation: ChatGPT is a single
takes, challenge incorrect premises, and reject in-
model handling various NLP tasks and covering
appropriate requests due to the dialogue format.

different languages, which can be considered a
Correspondence: joelwxjiao@tencent.com
1 unified multilingual machine translation model.
Scripts and data: https://github.com/wxjiao/
Is-ChatGPT-A-Good-Translator Thus, we are curious about how ChatGPT per-
2
https://chat.openai.com forms on different language pairs considering
Table 1: Information of adopted test sets. Table 3: Comparison of different prompts for ChatGPT
to perform Chinese-to-English (Zh⇒En) translation.
Test Set Direction Domain Size
System BLEU↑ ChrF++↑ TER↓
Flores-101 Any General 1012
WMT19 Bio De⇒En Biomedical 373 Google 31.66 57.09 56.21
En⇒Ja 1376 DeepL 31.22 56.74 57.84
WMT20 Rob2 Reddit Tencent 29.69 56.24 57.16
Ja⇒En 997
WMT20 Rob3 De⇒En Common Voice 5609 ChatGPT w/ T P 1 23.25 53.07 66.03
ChatGPT w/ T P 2 24.54 53.05 63.79
Table 2: Candidate translation prompts. ChatGPT w/ T P 3 24.73 53.71 62.84

Translation Prompt
With the launch of GPT-4 (OpenAI, 2023)3 on
T P 1 Translate these sentences from
[SRC] to [TGT]: March 15, 2023, we re-evaluate the translation abil-
T P 2 Answer with no quotes. What do ity of ChatGPT and observe a significant boost of
these sentences mean in [TGT]? performance, becoming comparable to commercial
T P 3 Please provide the [TGT] translation products, even for distant languages. To
translation for these sentences:
this end, we can conclude that ChatGPT has al-
ready become a good translator, with GPT-4 as
both the resource difference (e.g., high vs. low) the engine!
and language family (e.g., European vs. Asian). 2 ChatGPT for Machine Translation
• Translation Robustness: ChatGPT is devel- 2.1 Evaluation Setting
oped upon GPT3, which was trained on large- We provide a brief introduction of the evaluation
scale datasets that cover various domains. There- setting, which mainly includes the compared base-
fore, we wonder if it can perform robustly well lines and test data.
on domain-specific or even noisy sentences.
Baselines. We compare ChatGPT with three
To trigger the translation ability of ChatGPT, commercial translation products, namely, Google
we ask ChatGPT itself for advice and obtain three Translate4 , DeepL Translate5 , and Tencent TranS-
candidate translation prompts. By evaluating on mart6 . So far, the three commercial systems sup-
the Chinese⇒English translation task, we find that port translation in 133, 29, and 16 languages, re-
the candidate prompts generally work well and spectively. By default, the results in this report
show minor performance differences. Neverthe- come from the ChatGPT version on 2022.12.16.
less, we adopt the best-performing prompt for the For new results, we will mark the updated version
rest parts of the study. By evaluating the transla- information correspondingly.
tion among four selected languages on the Flores-
101 test sets, we find that ChatGPT performs com- Data. For multilingual translation, we evalu-
petitively with commercial translation products ate the above translation systems on the Flores-
(e.g., Google Translate) on high-resource Euro- 101 (Goyal et al., 2021)7 test sets, which consists
pean languages but lags behind significantly on of 1012 sentences translated into 101 languages.
low-resource or distant languages. For distant lan- To test the translation robustness, we adopt the test
guages, we explore an interesting strategy named set of WMT19 Biomedical Translation Task (Baw-
pivot prompting that asks ChatGPT to translate the den et al., 2019, i.e., Bio) and the set2 and set3 of
source sentence into a high-resource pivot language WMT20 Robustness Task (Specia et al., 2020, i.e.,
before into the target language, which improves the Rob2 and Rob3). We obtain the first two test sets
translation performance significantly. As for the through SacreBLEU and the third pre-processed by
translation robustness, results on three robustness 3
https://openai.com/research/gpt-4
4
sets suggest that ChatGPT does not perform as well 5
https://translate.google.com
as the commercial systems on biomedical abstracts https://www.deepl.com/translator
6
https://transmart.qq.com/zh-CN/index
or Reddit comments but exhibits good results on 7
https://github.com/facebookresearch/
spoken language. flores
Table 4: Performance of ChatGPT for multilingual translation.

De-En Ro-En Zh-En


System
⇒ ⇐ ⇒ ⇐ ⇒ ⇐
Google 45.04 41.16 50.12 46.03 31.66 43.58
DeepL 49.23(+9.3%) 41.46(+0.7%) 50.61(+0.9%) 48.39(+5.1%) 31.22(-1.3%) 44.31(+1.6%)
Tencent n/a n/a n/a n/a 29.69(-6.2%) 46.06(+5.6%)
ChatGPT 43.71(-2.9%) 38.87(-5.5%) 44.95(-10.3%) 24.85(-46.0%) 24.73(-21.8%) 38.27(-12.1%)

De-Zh Ro-Zh De-Ro


System
⇒ ⇐ ⇒ ⇐ ⇒ ⇐
Google 38.71 21.68 39.05 25.59 33.31 32.27
DeepL 40.46(+4.5%) 22.82(+5.2%) 38.95(-0.2%) 25.39(-0.7%) 35.19(+5.6%) 34.27(+6.1%)
Tencent 40.66(+5.0%) 19.44(-10.3%) n/a n/a n/a n/a
ChatGPT 34.46(-10.9%) 19.80(-8.6%) 30.84(-21.0%) 19.17(-25.0%) 33.38(+0.2%) 29.89(-7.3%)

Wang et al. (2021)8 . Table 1 lists the information of often occurs with the original format. Nevertheless,
these test sets. However, obtaining the translation it is still unstable such that sentences in a batch
results from ChatGPT is time-consuming since it (in multiple lines) are translated into a single line
can only be interacted with manually and can not occasionally.
respond to large batches. Thus, we randomly sam- We compare the three different candidate
ple 50 sentences from each set for evaluation. prompts on the Chinese-to-English (Zh⇒En) trans-
lation task with the test set from Flores-101. Ta-
Metric. We adopt the mostly used BLEU
ble 3 shows the results of ChatGPT and three com-
score (Papineni et al., 2002) as our primary
mercial systems. While ChatGPT provides rea-
metric and also report ChrF++ (Popović, 2017)
sonably good translations, it still lags behind the
and TER (Snover et al., 2006) in some cases.
baselines by at least 5.0 BLEU points. Concerning
These three metrics are all supported by Sacre-
the three candidate prompts, T P 3 performs the best
BLEU (Post, 2018)9 .
in terms of all the three metrics. Thus, we use T P 3
2.2 Translation Prompts throughout this report by default.
To design the prompts for triggering the machine 2.3 Multilingual Translation
translation ability of ChatGPT, we seek inspiration
from ChatGPT by asking it for advice. Specifically, We select four languages to evaluate the capabil-
we ask ChatGPT with the following prompt: ity of ChatGPT in multilingual translation, includ-
ing German (De), English (En), Romanian (Ro),
Provide ten concise prompts and Chinese (Zh), which are commonly adopted in
or templates that can make you
translate. both research (Wang et al., 2022a; Jiao et al., 2021,
2022b) and competitions (Bojar et al., 2016; Farhad
and obtain the results as shown in Figure 1. The et al., 2021). The first three languages come from
generated prompts look reasonable but share simi- the same family with Latin scripts while the last
lar formats. Thus, we summarize them into three is from another family with Chinese scripts (Fan
candidate prompts as shown in Table 2, where et al., 2021). We test the translation performance
[SRC] and [TGT] represent the source and target between any two languages, which involves 12 di-
languages of translation. Note that we add an extra rections in total. For clarity and comparison, we
command into T P 2 to ask ChatGPT not to gen- report the BLEU scores and the improvement or
erate double quotes around the translation, which drop of performance (i.e., +/-) relative to Google
8 Translate. Table 4 presents the results.
https://github.com/hsing-wang/
WMT2020_BioMedical/tree/master/
Bio-18-19-testset Resource Difference. We consider the resource
9
https://github.com/mjpost/sacrebleu difference of languages in the same family. In ma-
Table 5: Performance of ChatGPT with pivot prompt-
ing. New results are obtained from the updated Chat-
GPT version on 2023.01.31. LR: length ratio.

De⇒Zh Ro⇒Zh
System
BLEU LR BLEU LR
Google 38.71 0.94 39.05 0.95
DeepL 40.46 0.98 38.95 0.99
ChatGPT (Direct) 34.46 0.97 30.84 0.91
ChatGPT (Directnew ) 30.76 0.92 27.51 0.93
ChatGPT (Pivotnew ) 34.68 0.95 34.19 0.98
Figure 2: Translation results by ChatGPT with pivot
prompting (Date: 2023.01.31).
between ChatGPT and the commercial systems be-
comes larger. We attribute to the better knowledge
chine translation, German⇔English translation is
transfer within the same family (i.e., from English
usually regarded as a high-resource task supported
to German) than between different families (e.g.,
by over ten million sentence pairs (Farhad et al.,
from English to Chinese). For language pairs that
2021) while Romanian⇔English translation is sup-
are both low-resource and from different families
ported by much less data (Bojar et al., 2016). This
(e.g., Romanian⇔Chinese), the performance gap
resource difference can also be indicated by the
can be further enlarged (Wang et al., 2022b). Since
data statistics10 of GPT-3 (Brown et al., 2020), al-
ChatGPT handles different tasks in one model, low-
though we do not know the data information of
resource translation tasks not only compete with
ChatGPT. As shown in Table 4, ChatGPT per-
high-resource translation tasks (Jiao et al., 2022a),
forms competitively with Google Translate and
but also with other NLP tasks for the model capac-
DeepL Translate for both German⇒English and
ity, which explains their poor performance.
English⇒German translations. However, it lags
behind them significantly on Romanian⇒English Pivot Prompting. We explore an interesting
and English⇒Romanian. Specifically, ChatGPT strategy named Pivot Prompting to improve the
obtains a BLEU score on English⇒Romanian that translation quality between distant languages.
is 46.4% lower than Google Translate and the Specifically, we ask ChatGPT to translate the
value is 10.3% on Romanian⇒English. We spec- source sentence into a high-resource pivot language
ulate that the huge resource difference of mono- (i.e., English by default) first and then into the tar-
lingual data between English and Romanian lim- get language. We adjust the T P 3 prompt as below:
its the language modeling capability of Roma- Please provide the [PIV]
nian, which partially explains the poor perfor- translation first and then the
mance on English⇒Romanian. On the contrary, [TGT] translation for these
sentences one by one:
Romanian⇒English can benefit from the strong
language modeling capability of English such that where [PIV] denotes the pivot language. As a
the resource gap of parallel data can be somewhat large language model, ChatGPT will naturally con-
compensated. dition on both the prompt and the translation result
Language Family. We also take the impact of in the pivot language to generate the translation
language families into account. In machine trans- into the target language. Figure 2 shows an exam-
lation, translating between different language fam- ple when using pivot prompting. There are several
ilies is often considered harder than that within advantages of pivot prompting:
the same language family, due to the differ- • Knowledge Transfer: While parallel data be-
ent cultures and writing scripts. By compar- tween two distant languages is often scarce (Fan
ing German⇔English with Chinese⇔English or et al., 2021; Wang et al., 2022b), the parallel data
German⇔Chinese translation, we find that the gap between them and the pivot language can be rel-
10
https://github.com/openai/gpt-3/tree/ atively considerable, which is expected to learn
master/dataset_statistics better translation ability for source-pivot and
Table 6: Performance of GPT-4 (Date: 2023.03.15) for multilingual translation.

System Zh⇒En En⇒Zh De⇒Zh Ro⇒Zh


Google 31.66 43.58 38.71 39.05
DeepL 31.22 44.31 40.46 38.95
Tencent 29.69 46.06 40.66 n/a
ChatGPT (Direct) 24.73 38.27 34.46 30.84
ChatGPT (Directnew ) n/a n/a 30.76 27.51
ChatGPT (Pivotnew ) n/a n/a 34.68 34.19
GPT-4 28.50 42.50 38.16 37.84

pivot-target directions than that for the source- pivot prompting. This can be reflected by the length
target direction. Thus, pivot prompting will ratio results. Note that, while pivot prompting is
potentially transfer the knowledge of the high- convenient for ChatGPT, how to further accelerate
resource pivot language to the low-resource tar- the inference process is still an important research
get languages (Zoph et al., 2016; Aji et al., 2020; question as we need to generate longer sentences.
Li et al., 2022; He et al., 2022).
GPT-4 further Bridges the Gap. With the
• Convenience: Essentially, pivot prompting is launch of GPT-4 (OpenAI, 2023) on March 15,
similar to the pivot translation technique in previ- 2023, we re-evaluate the performance of four trans-
ous studies (Cheng et al., 2016) but is more con- lation directions. As shown in Table 6, GPT-4
venient for ChatGPT. For the commonly adopted boosts the performance over ChatGPT significantly
multilingual sequence-to-sequence translation on all the four directions, bringing the BLEU scores
models (Fan et al., 2021), pivot translation re- to the level of top commercial translation systems.
quires two steps: (1) Input the source sentence Note that these results only come from zero-shot
and translate it into the pivot language; (2) Input settings. With modern techniques like in-context
the translation results in pivot language and trans- learning with demonstrations (Brown et al., 2020;
late it into the target language. In contrast, Chat- Agrawal et al., 2022), the translation performance
GPT can identify both the [PIV] and [TGT] could be further improved. In other words, GPT-4
languages and translate the source sentence into has already become a good translator!
the two languages sequentially (see Figure 2),
2.4 Translation Robustness
which requires only one step operation.
We further evaluate the translation robustness of
Table 5 presents our results in BLUE score ChatGPT on the WMT19 Bio and WMT20 Rob2
and length ratio of translation results over refer- and Rob3 test sets, which introduce the impact
ences. We obtain the translation results by us- of domain bias and potentially noisy data. For
ing T P 3 (i.e., Direct) and pivot prompting (i.e., example, WMT19 Bio test set is composed of
Pivot) through English (i.e., source-to-English-to- Medline abstracts, which require domain-specific
target), respectively. As seen, the latest update knowledge to handle the terminologies. WMT20
for ChatGPT seems to harm the translation qual- Rob2 are comments from the social media website
ity for German⇒Chinese and Romanian⇒Chinese reddit.com that could contain various errors,
translations, compared with the previous version including spelling/typographical errors, word omis-
we used (i.e., Directnew vs. Direct). Nev- sion/insertion/repetition, grammatical errors, spo-
ertheless, pivot prompting can significantly im- ken languages, Internet slang, and so on (Michel
prove the translation performance by nearly 3.9 and Neubig, 2018).
and 6.6 BLEU points for German⇒Chinese Table 7 lists the BLEU scores. Obviously, Chat-
and Romanian⇒Chinese translations, respectively, GPT does not perform as well as Google Translate
which demonstrates its effectiveness. By inspecting or DeepL Translate on the WMT19 Bio and WMT2
the translation results, we find that direct transla- Rob2 test sets. The reason may be that commer-
tion with T P 3 will under-translate some tokens in cial translation systems like Google Translate often
source sentences, which can be noticeably fixed by need to continuously improve their ability to trans-
Table 7: Performance of ChatGPT for translation robustness.

WMT19 Bio WMT20 Rob2 WMT20 Rob3


System
De⇒En En⇒Ja Ja⇒En De⇒En
Google 37.83 29.72 19.21 42.91
DeepL 37.13 26.25 19.83 41.29
ChatGPT 33.22 22.36 18.34 44.59

Table 8: Examples from WMT20 Robust Set3. that asks ChatGPT to translate the source sentence
into a high-resource pivot language before into the
Example
target language, which improves the translation
S RC Haben wir noch Nudeln? performance significantly. As for the translation
R EF Do we still have noodles?
Google Do we still have pasta?
robustness, ChatGPT does not perform as well as
DeepL Do we have any noodles left? the commercial systems on biomedical abstracts
ChatGPT Do we still have noodles? or Reddit comments but exhibits good results on
S RC Tatsächlich ist der zu häufige Gebrauch von spoken language. With the launch of the GPT-4
Seife schlecht für die Haut. engine, the translation performance of ChatGPT
R EF Actually, very frequent usage of soap is bad
for the skin. is significantly boosted, becoming comparable to
Google In fact, using soap too often is bad for your commercial translation products, even for distant
skin. languages. In other words, ChatGPT has already
DeepL In fact, using soap too often is bad for the skin. become a good translator.
ChatGPT In fact, the frequent use of soap is bad for the
skin.
Limitations
We admit that this report is far from complete with
late domain-specific (e.g. biomedical) or noisy various aspects to make it more reliable:
sentences, since they are real-world applications
that require better generalization performance over • Coverage of Test Data: Currently, we randomly
out-of-distribution data. However, these may not select 50 samples from each test set for evalua-
be done in ChatGPT. tion due to the response delay of ChatGPT. While
An interesting finding is that ChatGPT outper- there are some projects in GitHub trying to auto-
forms Google Translate and DeepL Translate sig- mate the access process, they are vulnerable to
nificantly on WMT20 Rob3 test set that contains browser refreshes or network issues. While Ope-
a crowdsourced speech recognition corpus. It sug- nAI has released the gpt-3.5-turbo API,
gests that ChatGPT, which is essentially an artifi- it shows to be less powerful than the Webpage
cial intelligent chatting machine, is capable of gen- ChatGPT version11 . Therefore, we still report
erating more natural spoken languages than these the current results.
commercial translation systems. We provide some
examples in Table 8. • Reproducibility Issue: By querying ChatGPT
multiple times, we found that the results of
3 Conclusion the same query may vary across multiple trials,
which brings randomness to the evaluation re-
We present a preliminary study of ChatGPT for sults. For more reliable results, it is best to re-
machine translation, including translation prompt, peat the translation multiple times for each test
multilingual translation, and translation robustness. set and report the average result.
By evaluating on a number of benchmark test
sets, we find that ChatGPT performs competitively • Evaluation Metrics: The results here are cal-
with commercial translation products (e.g., Google culated by automatic metrics with single refer-
Translate) on high-resource European languages ences, which may not reflect some characteristics
but lags behind significantly on low-resource or 11
https://github.com/
distant languages. For distant languages, we ex- inspired-cognition/critique-apps/tree/
plore an interesting strategy named pivot prompting main/prompt-gym
of translation properly, e.g., nativeness. Human Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-
evaluation can provide more insights for compar- Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Kr-
ishnan, Marc’Aurelio Ranzato, Francisco Guzman,
ing ChatGPT with commercial translators.
and Angela Fan. 2021. The flores-101 evaluation
benchmark for low-resource and multilingual ma-
• Translation Abilities: We only focus on multi- chine translation. arXiv.
lingual translation and translation robustness in
this report. However, there are some other trans- Zhiwei He, Xing Wang, Zhaopeng Tu, Shuming Shi,
and Rui Wang. 2022. Tencent AI Lab-Shanghai Jiao
lation abilities that can be further evaluated, e.g., Tong University low-resource translation system for
constrained machine translation and document- the WMT22 translation task. In WMT.
level machine translation.
Wenxiang Jiao, Zhaopeng Tu, Jiarui Li, Wenxuan
Wang, Jen-tse Huang, and Shuming Shi. 2022a. Ten-
cent’s multilingual machine translation system for
WMT22 large-scale african languages. In WMT.
References
Wenxiang Jiao, Xing Wang, Shilin He, Zhaopeng Tu,
Sweta Agrawal, Chunting Zhou, Mike Lewis, Luke Irwin King, and Michael R Lyu. 2022b. Exploit-
Zettlemoyer, and Marjan Ghazvininejad. 2022. In- ing inactive examples for natural language genera-
context examples selection for machine translation. tion with data rejuvenation. IEEE/ACM TASLP.
arXiv.
Wenxiang Jiao, Xing Wang, Zhaopeng Tu, Shuming
Alham Fikri Aji, Nikolay Bogoychev, Kenneth Shi, Michael Lyu, and Irwin King. 2021. Self-
Heafield, and Rico Sennrich. 2020. In neural ma- training sampling with monolingual data uncertainty
chine translation, what does transfer learning trans- for neural machine translation. In ACL.
fer? In ACL.
Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim
Rachel Bawden, Kevin Bretonnel Cohen, Cristian Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Tho-
Grozea, Antonio Jimeno Yepes, Madeleine Kittner, rat, Fernanda B. Viégas, Martin Wattenberg, Greg
Martin Krallinger, Nancy Mah, Aurelie Neveol, Mar- Corrado, Macduff Hughes, and Jeffrey Dean. 2017.
iana Neves, Felipe Soares, et al. 2019. Findings of Google’s multilingual neural machine translation
the WMT 2019 biomedical translation shared task: system: Enabling zero-shot translation. TACL.
Evaluation for medline abstracts and biomedical ter-
minologies. In WMT. Zhaocong Li, Xuebo Liu, Derek F Wong, Lidia S Chao,
and Min Zhang. 2022. ConsistTL: Modeling con-
Ondřej Bojar, Rajen Chatterjee, Christian Federmann, sistency in transfer learning for low-resource neural
Yvette Graham, Barry Haddow, Matthias Huck, An- machine translation. In EMNLP.
tonio Jimeno Yepes, Philipp Koehn, Varvara Lo-
gacheva, Christof Monz, et al. 2016. Findings of the Paul Michel and Graham Neubig. 2018. MTNT: A
2016 conference on machine translation. In WMT. testbed for machine translation of noisy text. In
EMNLP.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind OpenAI. 2023. GPT-4 technical report. arXiv.
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, et al. 2020. Language models are few-shot Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car-
learners. NeurIPS. roll L Wainwright, Pamela Mishkin, Chong Zhang,
Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
2022. Training language models to follow instruc-
Yong Cheng, Yang Liu, Qian Yang, Maosong Sun, and
tions with human feedback. arXiv.
Wei Xu. 2016. Neural machine translation with
pivot languages. arXiv. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. BLEU: A method for automatic
Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi evaluation of machine translation. In ACL.
Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep
Baines, Onur Celebi, Guillaume Wenzek, Vishrav Maja Popović. 2017. ChrF++: Words helping charac-
Chaudhary, et al. 2021. Beyond english-centric mul- ter n-grams. In WMT.
tilingual machine translation. JMLR, 22(107):1–48.
Matt Post. 2018. A call for clarity in reporting bleu
Akhbardeh Farhad, Arkhangorodsky Arkady, Biesial- scores. In WMT.
ska Magdalena, Bojar Ondřej, Chatterjee Rajen,
Chaudhary Vishrav, Marta R Costa-jussa, España- Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-
Bonet Cristina, Fan Angela, Federmann Christian, nea Micciulla, and John Makhoul. 2006. A study of
et al. 2021. Findings of the 2021 conference on ma- translation edit rate with targeted human annotation.
chine translation (WMT21). In WMT. In AMTA.
Lucia Specia, Zhenhao Li, Juan Pino, Vishrav Chaud-
hary, Francisco Guzmán, Graham Neubig, Nadir
Durrani, Yonatan Belinkov, Philipp Koehn, Hassan
Sajjad, et al. 2020. Findings of the WMT 2020
shared task on machine translation robustness. In
WMT.
Wenxuan Wang, Wenxiang Jiao, Yongchang Hao, Xing
Wang, Shuming Shi, Zhaopeng Tu, and Michael Lyu.
2022a. Understanding and improving sequence-to-
sequence pretraining for neural machine translation.
In ACL.
Wenxuan Wang, Wenxiang Jiao, Shuo Wang,
Zhaopeng Tu, and Michael R Lyu. 2022b. Un-
derstanding and mitigating the uncertainty in
zero-shot translation. arXiv.
Xing Wang, Zhaopeng Tu, and Shuming Shi. 2021.
Tencent ai lab machine translation systems for the
WMT21 biomedical translation task. In WMT.
Barret Zoph, Deniz Yuret, Jonathan May, and Kevin
Knight. 2016. Transfer learning for low-resource
neural machine translation. In EMNLP.

View publication stats

You might also like