You are on page 1of 10

Unleashing the Power of ChatGPT for Translation: An Empirical Study

Yuan Gao , Ruili Wang , Feng Hou


School of Mathematical and Computational Science
Massey University, New Zealand
{y.gao, ruili.wang, f.hou}@massey.ac.nz

Abstract data, which allows them to capture rich represen-


tations of the input text. Such pre-trained models
The recently released ChatGPT has demon- have achieved significant improvements in various
strated surprising abilities in natural language natural language processing tasks, including ma-
understanding and natural language genera-
arXiv:2304.02182v1 [cs.CL] 5 Apr 2023

chine translation(Lewis et al., 2020).


tion. Machine translation is an important and
The recently released ChatGPT1 is also a pow-
extensively studied task in the field of natu-
ral language processing, which heavily relies erful pret-rained language model developed by
on the abilities of language understanding and OpenAI. ChatGPT is built upon GPT-3.5 and opti-
generation. Thus, in this paper, we explore mized with Reinforcement Learning from Human
how to assist machine translation with Chat- Feedback (RLHF). Due to the surprising ability
GPT. We adopt several translation prompts on of natural language understanding and natural lan-
a wide range of translations. Our experimen- guage generation, ChatGPT has attracted more and
tal results show that ChatGPT with designed
more attention across millions of users. While
translation prompts can achieve comparable or
better performance over professional transla-
ChatGPT is primarily designed as an intelligent
tion systems for high-resource language trans- conversational system, it can also perform plenty
lations but lags behind significantly on low- of human-like tasks (e.g., writing poems, fixing
resource translations. We further evaluate the coding bugs, and so on). However, recent work
translation quality using multiple references, (Artetxe et al., 2018) reveal that ChatGPT exhibits
and ChatGPT achieves superior performance a non-negligible performance gap in comparison
compared to the professional systems. We with other professional translation systems, such
also conduct experiments on domain-specific
as Google Translate and DeepL Translate. Particu-
translations, the final results show that Chat-
GPT is able to comprehend the provided do- larly, this gap is further magnified in low-resource
main keyword and adjust accordingly to output translations.
proper translations. At last, we perform few- Different from other professional translation sys-
shot prompts that show consistent improve- tems, ChatGPT is capable of adjusting its output
ment across different base prompts. Our work bias based on the provided instructions. As a re-
provides empirical evidence that ChatGPT still sult, instead of solely requesting translations from
has great potential in translations.
ChatGPT, users are allowed to input various in-
structions (translation prompts) into its dialogue
1 Introduction box with the source sentences. Since OpenAI only
provides a web interface to access ChatGPT, we
Machine translation (MT) aims to automatically
cannot modify the internal components or retrieve
translate text data from one language to another
the intermediate representations of ChatGPT. Thus,
target language with the aid of computers. MT
we treat ChatGPT as a black-box system and fo-
is an important research area in Artificial Intelli-
cus on what kind of translation prompts could un-
gence and has attracted massive attention in both
leash the power of ChatGPT as much as possible.
academia and industry (Stahlberg, 2020; Yang et al.,
In addition, as ChatGPT was trained on massive
2020). In recent years, there has been a growing
amounts of multilingual general-domain datasets,
trend towards incorporating large-scale pre-trained
the generated text is inevitably influenced by mis-
language models for natural language processing
cellaneous prior knowledge instead of being based
(Yang et al., 2019; Brown et al., 2020). These mod-
1
els are typically trained on massive monolingual https://openai.com/blog/chatgpt/

1
solely on the current input. In this work, we ex- et al., 2022). To mitigate any unexpected effects
plore how to enhance the translation proficiency of on ChatGPT, we selectively sample multiple high-
ChatGPT effectively. We hypothesize that specify- quality translation pairs from the same dataset. No-
ing translation direction or data domain could make tably, the selected examples and test samples are
the system focus on the current input data, thereby mutually exclusive.
improving the quality of translations. Moreover,
we also introduce Part-of-Speech(POS) tags into 2 Background
prompts as auxiliary information to assist the trans-
lation process. Machine Translation(MT) is a crucial area of
research in the field of natural language processing
We conduct comprehensive experiments to as-
and has gained immense attention in recent years.
sess the effectiveness of various prompts on Chat-
The primary goal of MT is to automatically
GPT. The experiments include six high-resource
translate text data from the source language to the
translations, four low-resource translations as well
target language with the help of computers. Thus,
as non-English-centric translations, with the aim
a mature translation system must possess robust
of evaluating the model’s performance across a
language understanding and language generation
range of language resources. The experimental
capabilities in order to produce adequate and
results indicate that with our devised prompts Chat-
fluent translations. Prior research (Liu et al., 2019;
GPT performs competitively or even better than
Guo et al., 2020; Zhu et al., 2020) have shown
professional translation systems for high-resource
that LLMs can enhance the ability of translation
language translations, but still lags significantly
systems to understand the source text but struggle
behind for low-resource translations. This find-
to improve generation ability. ChatGPT has
ing holds even after incorporating POS tags into
exhibited an impressive level of performance in
prompts, despite the fact that the application of
natural language understanding and generation,
POS tags leads to meaningful improvements in
as evidenced by its ability to comprehend and
low-resource translations. To sum up, while the
generate human-like responses in a wide range of
performance of ChatGPT on low-resource trans-
contexts. Therefore, exploring the application of
lation still not being satisfactory, the significant
ChatGPT in translations presents a compelling and
improvement achieved by providing additional in-
promising area of research.
formation shows its great potential.
We also evaluate ChatGPT on the multi- ChatGPT is a large-scale language model devel-
reference and multi-domain test sets. For the multi- oped by OpenAI that was fine-tuned upon GPT-3.5.
reference sets, which assign 10 different human As the official website states, ChatGPT was opti-
translations for each source sentence, ChatGPT mized using RLHF and trained on massive amounts
achieves great improvement when measured by of text data to generate a detailed response by fol-
multiple references and even outperforms profes- lowing instructions provided in a prompt. While
sional systems. As for multi-domain sets, we con- ChatGPT is primarily designed as an intelligent
duct experiments on four domain-specific sets that conversational system, it is capable to perform var-
exhibit statistical differences. We find that Chat- ious human-like tasks, including machine trans-
GPT with domain information performs well across lation. However, recent work (Jiao et al., 2023)
all domains and consistently achieves higher BLEU found that ChatGPT exhibits an undeniable perfor-
scores than other prompts. Additionally, the exper- mance gap when compared with other professional
imental results on different domains demonstrate translation systems, such as Google Translate and
that ChatGPT is able to comprehend the provided DeepL Translate, and this gap is exacerbated in low-
domain keyword and subsequently adapt the trans- resource languages. Thus, we will explore how to
lation output accordingly. unleash the power of ChatGPT for translations.
At last, we conduct experiments to explore the Inspired by the training objective of ChatGPT,
influence of in-context learning prompts (referred which employs prompts to guide the generation
to as few-shot prompts) on ChatGPT, which is in- of responses, we consider that a proper translation
spired by previous research that utilize large-scale prompt is able to unleash more translation potential
language models (LLMs) to enhance downstream of ChatGPT. In this work, we adopt several trans-
tasks (Brown et al., 2020; Chen et al., 2021; Dou lation prompts to trigger ChatGPT for translation

2
Translation Prompt
TP3 Please provide the [TGT]
translation for these
sentences:
TD This is a [SRC] to [TGT]
translation task, please
provide the [TGT] translation
for these sentences:
DD Please provide the [TGT]
translation for these sentences
taken from the [SPECIFIC
DOMAIN]:
TD-pos This is a [SRC] to [TGT] Figure 1: Best performing translations answered by
translation task, please ChatGPT.
provide the [TGT] translation
for these sentences with the
help of given POS tags:
TD-pos Please provide the [TGT] • TP3 - Please provide the
translation for these sentences [TGT] translation for these
taken from the [SPECIFIC sentences:
DOMAIN] with the help of given
POS tags: We argue that the translation results are unlikely
to be significantly impacted by the three prompts,
Table 1: Translation prompts adopted in this work. TD as they merely provide general instructions advised
stands for translation direction, DD stands for Data Do- by ChatGPT without any supplementary informa-
main, "-pos" indicates that the prompt contains a POS tion. It is worth noting that ChatGPT was trained on
tag. TP3 is extracted from Jiao et al. (2023) directly
massive amounts of multilingual general-domain
without any editions, and other prompts are built upon
TP3.
datasets, and as a result, the generated text is in-
evitably influenced by various prior knowledge
instead of being solely based on the current in-
and evaluate them in comprehensive experiments. put. Thus, we consider specifying translation di-
rection(TD) or data domain(DD) could make Chat-
3 Experiments GPT more focused on the current input text, result-
ing in better translations. In light of this, we devise
In this section, we first describe our designed trans-
two translation prompts from two perspectives and
lation prompts and provide details on the experi-
compare them with the most effective prompts of
mental setup, including the datasets, baselines, and
Jiao et al. (2023) (refer to TP3). Besides, we in-
evaluation metrics used. We also conduct various
tegrated POS tags as auxiliary information into
experiments to explore the effectiveness of the de-
both TD and DD prompts. All translation prompts
signed prompts under different translation condi-
adopted in this work are presented in Table 1.
tions. The experimental results and analysis are
presented as well. 3.2 Experimental Setup

3.1 Prompt Design 3.2.1 Datasets


We conduct experiments on a range of benchmarks,
As ChatGPT is trained to generate corresponding
including high-resource and low-resource transla-
responses according to the provided prompts. This
tions, as well as multi-reference translations and
nature determines that it is highly sensitive to both
domain-specific translations. For the high-resource
the content and format of the prompts. However,
scenario, we choose English ↔ Spanish and En-
Jiao et al. (2023) shows similar translation perfor-
glish ↔ French, both of which ChatGPT is capable
mance using three different prompts provided by
of translating with high quality as shown in 1, and
ChatGPT itself, the prompts are as follows:
we further conduct Spanish ↔ French translations
• TP1 - Translate these to evaluate ChatGPT on the non-English-centric
sentences from [SRC] to scenario. For the low-resource translations, we
[TGT]:
evaluate translation systems on English ↔ Turk-
• TP2 - Answer with no quotes.
What do these sentences mean ish translation and English ↔ Romanian transla-
in [TGT]? tions. We use the Flores-101 dataset (Goyal et al.,

3
En->Fr Fr->En Fr->Es
System
BLEU ↑ ChrF++↑ TER↓ BLEU ↑ ChrF++↑ TER↓ BLEU ↑ ChrF++↑ TER↓
Google 54.75 75.10 33.64 49.66 72.74 35.48 22.48 48.50 64.76
DeepL 53.87 74.65 33.64 50.09 72.29 35.30 21.68 47.67 65.38
TP3 45.03 69.50 42.11 44.86 68.85 36.02 21.33 48.25 65.31
TD 45.85 70.22 40.89 50.18* 72.84* 36.76 21.92 48.53 65.24
DD 45.92 70.40 40.67 49.68 72.72 35.05* 22.13 48.97 64.28
TD - pos 47.11* 71.05* 39.40* 48.87 72.39 36.08 22.20 48.88 65.10
DD - pos 46.93 70.31 39.60 49.33 72.42 36.84 22.58* 49.25* 63.79*
En->Es Es->En Es->Fr
System
BLEU ↑ ChrF++↑ TER↓ BLEU ↑ ChrF++↑ TER↓ BLEU ↑ ChrF++↑ TER↓
Google 23.49 50.73 62.62 25.32 53.21 70.01 26.89 53.47 66.07
DeepL 23.62 49.86 63.59 26.74 54.64 67.44 29.55 54.19 61.84
TP3 23.00 49.93 62.28 25.42 54.07 70.35 26.20 53.23 65.85
TD 23.47 50.43 62.69 27.10* 55.40* 67.81 26.66* 53.92* 63.72*
DD 23.33 50.76 62.69 26.89 54.91 67.10* 26.05 53.58 64.00
TD - pos 23.69 50.81 63.17 26.42 53.84 67.52 26.35 53.63 64.15
DD - pos 23.92* 51.06* 62.07* 26.37 54.64 68.21 26.19 53.38 63.93

Table 2: Performance comparison of ChatGPT for high-resource translations using BLEU, ChrF++, and TER
scores. The best scores are highlighted in bold, and * denotes the highest score achieved by ChatGPT across
different prompts.

2022) for all of the above directions, which contain about the data domain. To evaluate the efficacy
1012 sentences from Wikipedia articles for each of these prompts, we compare them against TP3,
language. Since the text generated by ChatGPT which represents the baseline condition in which
is more flexible and diverse than that of conven- no supplementary information is provided about
tional translation systems, it can be challenging to the input text. In addition, as ChatGPT is a well-
evaluate ChatGPT’s actual performance with a sin- established pre-trained system, we complement our
gle reference. Consequently, we employ multiple study by comparing its performance over different
references with different word order and usage to prompts with two dominant industrial translation
evaluate the translation quality of ChatGPT, and systems, Google Translate and DeepL Translate,
the data is released by (Ott et al., 2018). instead of the academic translation systems trained
Due to the response delay caused by the from scratch. We access ChatGPT through the
available computing resources, using ChatGPT publicly released web interface2 , and we access
for translation tasks can be a time-consuming and Google Translate and DeepL Translate in the same
labor-intensive process. Therefore, we follow the way. To ensure reliable results, we take the aver-
strategy outlined in (Jiao et al., 2023) to randomly age score obtained from three randomly sampled
sample 50 sentence pairs from each test set for our test sets to ensure that our evaluation results were
final evaluation sets. robust and not impacted by randomness. We evalu-
ate the performance of our system using multiple
metrics, including BLEU (Papineni et al., 2002),
3.2.2 Baselines and Evaluation Metrics ChrF++ (Popović, 2017), and TER (Snover et al.,
To verify whether specifying translation direction 2006). In this work, we report BLEU scores using
or data domain can improve the translation per- SacreBLEU (Post, 2018).
formance of ChatGPT, we devise a set of trans-
lation prompts that explicitly indicate the direc-
2
tion of translation or provide relevant information https://chat.openai.com/chat

4
En->Tr Tr->En En->Ro Ro->En
System
BLEU↑ ChrF++↑ BLEU↑ ChrF++↑ BLEU↑ ChrF++↑ BLEU↑ ChrF++↑
Google 19.37 55.91 25.82 56.78 34.88 61.62 45.92 67.65
DeepL 21.25 57.64 29.46 58.44 37.40 64.96 48.05 68.00
TP3 11.12 47.75 24.48 54.97 28.37 58.48 34.31 62.88
TD 12.32* 49.17* 25.91 56.93 28.88 58.94 36.24 63.56
DD 12.16 49.12 25.40 56.05 29.27 59.27 36.85 63.88
TD - pos 10.79 46.48 26.85 56.47 30.11 59.61 37.04* 64.57*
DD - pos 10.15 47.64 26.90* 57.00* 31.21* 61.02* 36.94 63.30

Table 3: Performance comparison of ChatGPT for low-resource translations using BLEU, ChrF++, and TER
scores. The best scores are highlighted in bold, and * denotes the highest score achieved by ChatGPT across
different prompts.

3.3 High-resource Translation and systems, and this improvement shows in En-
glish → Spanish as well. By Combining this find-
We conduct a complete multilingual translation ing with the previous observation on X → English
among three high-resource languages, English(En), translations, it is evident that the performance of
French(Fr), and Spanish(Es), all of which use Latin ChatGPT is influenced by the target language, indi-
script and belong to the European language family cating that its translation capabilities are not only
(English belongs to Germanic, French and Spanish biased towards English-centric translations but also
belong to Romance (Fan et al., 2021)). The results depending on the specific target language being
are reported in Table 2. It can be observed that the translated.
performance of ChatGPT with the default transla-
tion prompt (TP3) generally falls behind that of 3.4 Low-resource Translation
either Google Translate or DeepL Translate yet re- ChatGPT has the ability to understand and generate
mains competitive in certain translation directions, numerous languages apart from those depicted in
which is consistent with the experimental results of Figure 1. This capability stems from its training
(Jiao et al., 2023). Specifically, ChatGPT obtains a data that encompasses text in more than 100 lan-
+0.12 higher BLEU score than Google Translate on guages, in addition to the dominant English corpus.
the Spanish → English translation. While TP3 en- Therefore, ChatGPT can be utilized to address low-
able ChatGPT to perform competitively with other resource translation tasks. We evaluate the efficacy
translation systems, our translation prompts fur- of ChatGPT with prompts by conducting experi-
ther improve the BLEU scores by an average of ments on widely explored low-resource translation
0.6, 5.07, 0.4, and 1.33 points on four English- tasks, English ↔ Turkish and English ↔ Roma-
centric translations respectively, demonstrating the nian, with the same setup as high-resource transla-
efficacy of our prompts. The surprising results ap- tions. For these two low-resource languages, since
pear on the X → English translations, where our Romanian also belongs to the European language
prompts significantly outperformed TP3 and even family, the translation difficulty of Romanian is
exceeded the baseline systems by 0.52 and 1.87 less than that of Turkish which belongs to the Cau-
BLEU scores in French → English and Spanish casian language family theoretically. As Table 3
→ English translations, respectively.The findings shows, ChatGPT exhibits a huge performance gap
suggest that ChatGPT has significant potential for compared with Google and DeepL in low-resource
English generation tasks. translation settings, particularly falling behind by
As for Spanish → French translations, our over 10 BLEU scores in English ↔ Turkish trans-
prompts yield only marginal improvement over lation. However, similar to the high-resource cases,
TP3, unlike the surprising gains on the English- our prompts (i.e., TD and DD) still consistently
centric tasks. However, ChatGPT with DD-pos ex- improve in terms of both BLEU scores and CharF
hibit the most exceptional performance on French scores and improve Romanian → English by 2.54
→ Spanish translation, surpassing other prompts BLEU scores, which indicates their effectiveness.

5
3.5 POS tags for ChatGPT
Single Multiple
System
BLEU 4 BLEU 4
Neural Machine Translation has been shown to
En→Fr
benefit from additional provided linguistic fea-
tures, especially for underrepresented languages. DeepL 47.46 - 82.28 -
Petrushkov et al. (2018) use human feedbacks to Google 49.39 + 1.93 88.22 + 5.94
improve the accuracy and robustness of transla- TP3 45.63 - 1.83 85.59 + 3.31
tion. Niu and Carpuat (2020) adopt target language TD 46.48 - 0.98 87.45 + 5.17
formality in different ways to improve the perfor-
DD 48.65 + 1.19 87.95 + 5.67
mance of the translation model. Li et al. (2022)
insert word constraint and order constraint into En→De
prompts to improve translation quality. In addition, DeepL 34.88 - 80.90 -
extensive works (Khan et al., 2020; Perera et al., Google 38.18 + 3.30 80.06 - 0.84
2022) were conducted on the utilization of POS TP3 32.21 - 2.67 80.56 - 0.34
tags for improving translations. POS tags provide
TD 33.21 - 1.67 82.22 + 1.32
syntactic information that locates each word in a
sentence to help disambiguate words and under- DD 34.28 - 0.60 82.88 + 1.98
stand the grammatical structure of a sentence. For
example, knowing the POS tag of a word can help Table 4: BLEU scores on En→Fr and En→De test sets
with single and multiple references.
determine whether it is a subject, object, or mod-
ifier in a sentence. Therefore, in this work, we
conjecture that incorporating POS tags as auxiliary 3.6 Translation Diversity
information could further help ChatGPT release
Since ChatGPT is trained on huge amounts of text
translation ability upon TD and DD. We concate-
data, the model distribution is too widespread in
nate POS tag sequences with the original sentences
hypothesis space. Thus, the model shows greater
directly and prepend [sentence]/[P OS] tags in
flexibility in vocabulary choice during generation.
two sequences to identify and segment sequences.
In addition, ChatGPT has been specifically tailored
We apply Stanza (Qi et al., 2020) to automatically
as an intelligent chatbot and has been optimized via
generate POS tags for test samples. First of all,
the RLHF training scheme, which allows it to pro-
ChatGPT is able to fully understand the input con-
duce coherent and concise responses to facilitate
tent to distinguish between two sequences, and
reader understanding. In the absence of interven-
the final output only contains the translation of
tion, however, these attributes of ChatGPT lead
[sentence] part. At this juncture, ChatGPT demon-
to uncertainty translations, which are intentionally
strates remarkable proficiency.
generated to enhance fluency and comprehensibil-
As shown by the results in Table 2, 3, we can ity, but inconsistencies with the gold references.
see that incorporating POS tags boost ChatGPT in Therefore, we follow the work of Ott et al.
many translation directions, such as improving En- (2018) to evaluate the translation quality using
glish → Romanian translation with +1.94 BLEU multi-reference test sets which collect 10 dif-
points based on DD. However, the performance ferent human translations for each source sen-
drops in some directions, including French → En- tence. The source sentences of two multi-reference
glish, Spanish → English, and English → Turkish. test sets are randomly selected from the WMT
We suspect that the performance drop is caused by 2014 English-French and English-German News
the auxiliary information negatively impacting the sets, respectively. Experimental results are listed
original input, either by introducing noise or by in- in Table 4. Similar to the previous experimen-
creasing the complexity of the input text. While the tal results, ChatGPT still shows a competitive
necessity of incorporating POS tags is not explicitly level of performance when evaluated with single
demonstrated, it is no doubt that auxiliary informa- references(outperforming DeepL by 1.19 BLEU
tion has a positive effect on ChatGPT somehow. with DD on En-Fr), while achieving significant
Furthermore, what kind of auxiliary information improvement when measured by multiple refer-
and how to integrate are still important research ences(surpassing DeepL by 5.67 BLEU with DD
questions for future research. on En-Fr). The empirical evidence provided by

6
Domains Uni-Words (%) # of Tokens # of Chars Prompts News e-Com. Social Conver. Avg.

News 20.4 20.8 109.4 TP3 35.60 25.47 34.59 34.93 32.65

e-Com. 20.8 22.4 119.4 TT 37.73 26.87 35.10 35.19 33.72


Social 10.8 19.5 93.2 DD 39.93 30.28 35.80 36.25 35.57
Conver. 6.7 13.4 59.9 w-DD 38.03 25.41 35.13 36.04 33.80

Table 5: The statistical results of adopted domain- Table 6: BLEU scores on multi-domain translation.
specific sets. The best scores are highlighted in bold.

these results suggests that the utilization of more to note that we employ the lowercase and lemma-
comprehensive criteria is imperative for the eval- tization of words prior to counting them to elimi-
uation of ChatGPT’s performance in translations. nate the impact of morphological transformation
Additionally, ChatGPT with DD consistently out- on vocabulary overlap, and the calculation of the
performs other prompts on News sets, contrary to proportion is based on the total number of tokens
previous results on Flores-101. To gain further rather than the number in the vocabulary.
insights into this, we conduct multi-domain trans- As shown in Table 6, ChatGPT with DD achieves
lation in the next section. remarkable performance across all domains and
gains an average BLEU score higher than TP3 and
3.7 Multi-domain Translation TT by +2.92 and +1.85, respectively. Meanwhile,
The translation performance of ChatGPT exhibits the News and e-Commerce domains with higher
inconsistency when using TD and DD across FLO- unique word proportions achieve greater improve-
RES (as shown in Table 2, 3) and WMT (as shown ments (outperforming TP3 by +4.33 and +4.81,
in Table 4) test sets. Specifically, ChatGPT with respectively) compared to the Social and Conversa-
TD performs better than with DD on the FLORSE tional domains (outperforming TP3 by +1.21 and
test sets, whereas using DD is better than using TD +1.32, respectively). This suggests that domain
on the WMT test sets. This variability in perfor- information is advantageous for ChatGPT in trans-
mance may be attributed to the different sources lating data from specific domains. However, the re-
of data used. The FLORES dataset comprises of sults obtained from the BLEU scores demonstrated
Wikipedia articles that cover multiple topics (Goyal a lack of apparent correlation between translation
et al., 2022), and should be considered as general quality and domain uniqueness. To assess the im-
domain data. On the other hand, the data used in pact of providing incorrect domain information,
Table 4 is from a specific (News) domain. To in- we conducted a comparative experiment using the
vestigate whether ChatGPT with DD is effective wrong domain information to prompt ChatGPT (re-
for domain-specific translation, we conduct exper- ferred to as w-DD). The experimental results reveal
iments on the multi-domain test sets taken from a significant influence on the translation quality
(Hendy et al., 2023), which cover four domains, when using the wrong domain information, partic-
namely News, e-Commerce, Social, and Conversa- ularly in the News and e-Commerce domains. This
tional. aligns with previous observations where the correct
Table 5 illustrates the diversity in statistical pat- domain information is used, indicates that domain
terns observed in datasets obtained from different information explicitly impacts ChatGPT on transla-
domains. The data from different domains dis- tions, especially on the domains with more specific
play noticeable discrepancies in the proportion of words.
unique words, average sentence length, as well as
average word length. In particular, the Conversa- 3.8 Few-shot Prompts
tional domain appears to be a general domain, it Recent research (Brown et al., 2020; Chen et al.,
contains 6.7% of unique words that are not present 2021; Dou et al., 2022) has demonstrated the advan-
in other sets. Additionally, the average sentence tages of in-context learning for LLMs. In-context
length and the average number of characters per learning involves adding several input-output ex-
word are notably shorter in comparison to other amples (perform as prompts) to the input text to
domain sets, as shown in Table 5. It is important enhance the performance of LLMs across multi-

7
Please provide the [TGT] translation for these
0-shot 1-shot 5-shot
sentences: 53

[n-shot input]
51
Output: [n-shot reference]
49
Please provide the [TGT] translation for these
sentences:

BLEU
47
[n-shot input]
45
Output:
43
Figure 2: The template of few-shot prompts. We take
TP3 as an example. 41 TP3 TD DD
Fr-En

Figure 3: Comparing BLEU scores of ChatGPT with 0,


ple tasks, without any adjustments to parameters 1, 5-shot prompts on the Fr→En translation.
or architecture. Li and Liang (2021) proposed
prefix-tuning that uses a few labeled data to en-
0-shot 1-shot 5-shot
able general LLMs to achieve competitive results 28
with a specialized vector trained on the downstream
datasets. Inspired by prefix-tuning, Tsimpoukelli
et al. (2021) utilizes in-context learning to improve 27
LLM performance on a variety of multimodal tasks.
BLEU

In addition, there is a series of research focused


on prompting LLMs for MT (Vilar et al., 2022;
26
Zhang et al., 2023). In-context learning relies on
the quality and quantity of examples selected, with
previous research suggesting that providing more
high-quality examples leads to greater downstream 25 TP3 TD DD
Es-En
task improvements (Yang et al., 2020; Wei et al.,
2022). However, the number of examples is lim-
Figure 4: Comparing BLEU scores of ChatGPT with 0,
ited by LLM architecture (Olmo et al., 2021), and 1, 5-shot prompts on the Es→En translation.
providing more examples does not result in any
meaningful improvement (Hendy et al., 2023).
Thus, in this section, we perform experiments of most significant gains when applied to the basic
few-shot prompts by selecting 1 and 5 examples TP3, while the improvements are relatively limited
(shots) from the original test sets and integrating for TD and DD. This observation is consistent in
them with the previously used prompts (i.e., TP3, both French→English and Spanish→English trans-
TD, DD) as shown in Figure 2. Note that the se- lations. In other words, combining few-shot learn-
lected shots do not overlap with the data used for ing with designed prompts does not necessarily
testing. To access the quality of the shots, we use result in equivalent additive improvements. We
LaBSE (Feng et al., 2022) to score the translation speculate that there is some overlap between the
pairs following (Hendy et al., 2023), and then se- information contained in in-context shots and the
lect the top-1 and top-5 pairs as in-context shots. information provided by the designed prompts.
Our experiments with few-shot prompts are con-
4 Conclusion and Further work
ducted on French→English and Spanish→English,
and the results are presented in Figure 3, 4. We In this work, we present an empirical study to ex-
compare the performance of three few-shot set- plore how to effectively improve the performance
tings, namely 0-shot, 1-shot, and 5-shot. Based on of ChatGPT for machine translations. Specifically,
the observations, ChatGPT with 1-shot or 5-shot we evaluate the designed translation prompts and
setups result in substantial improvement in perfor- analyze them in various settings, including high-
mance over 0-shot for TP3, TD, and DD. However, resource, low-resource, multi-reference, multi-
it is worth noting that few-shot prompts yield the domain, and few-shot. We primarily design two

8
types of translation prompts, namely TD and DD, potentially lead to better performance, and should
based on the prompts advised by ChatGPT itself be explored further.
(i.e., TP3).
Our findings indicate that ChatGPT is capable of
achieving comparable results as professional trans- References
lation systems by adjusting prompts between high- Mikel Artetxe, Gorka Labaka, Eneko Agirre, and
resource languages. However, in low-resource Kyunghyun Cho. 2018. Unsupervised neural ma-
translations, ChatGPT’s performance falls behind chine translation. In International Conference on
Learning Representations.
that of Google and DeepL. To facilitate ChatGPT
in low-resource translations, we incorporate POS Tom Brown, Benjamin Mann, Nick Ryder, Melanie
tags into prompts as auxiliary information, building Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
upon TD and DD. However, the observed instabil- Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, et al. 2020. Language models are few-shot
ities in partial translation directions indicate that learners. Advances in neural information processing
this strategy requires further investigation to fully systems.
understand its limitations and potential.
Yiming Chen, Yan Zhang, Chen Zhang, Grandee Lee,
Considering that ChatGPT is trained specifically
Ran Cheng, and Haizhou Li. 2021. Revisiting self-
for dialogue and prioritizes coherence and concise- training for few-shot learning of language model.
ness in generating sentences, which may not suffi- arXiv preprint arXiv:2110.01256.
cient to evaluate the translation quality with a single
reference. Therefore, we use the multi-reference Yao Dou, Maxwell Forbes, Rik Koncel-Kedziorski,
Noah A Smith, and Yejin Choi. 2022. Is gpt-3 text
test sets to evaluate the performance of ChatGPT- indistinguishable from human text? scarecrow: A
based machine translation. This allows us to take framework for scrutinizing machine text. In Pro-
into account the diversity of possible translations. ceedings of the 60th Annual Meeting of the Associa-
The experimental results indicate that the perfor- tion for Computational Linguistics.
mance of ChatGPT-based machine translation is Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi
significantly different when evaluated with mul- Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep
tiple references compared to the single reference Baines, Onur Celebi, Guillaume Wenzek, Vishrav
evaluation. This highlights the importance of using Chaudhary, et al. 2021. Beyond english-centric mul-
tilingual machine translation. The Journal of Ma-
more comprehensive evaluation criteria to assess
chine Learning Research.
the quality of ChatGPT’s translations.
To further investigate the impact of DD on Chat- Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen
GPT, we conducted experiments on multi-domain Arivazhagan, and Wei Wang. 2022. Language-
agnostic bert sentence embedding. In Proceedings
test sets. The experimental results show that intro- of the 60th Annual Meeting of the Association for
ducing correct domain information into prompts Computational Linguistics.
can effectively improve the performance of Chat-
GPT for translation. Additionally, we use few-shot Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-
Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Kr-
prompts to prompt ChatGPT, which can achieve ishnan, Marc’Aurelio Ranzato, Francisco Guzmán,
substantial improvement. and Angela Fan. 2022. The flores-101 evaluation
For future work, there are several directions that benchmark for low-resource and multilingual ma-
chine translation. Transactions of the Association
need to be explored. Firstly, although the incorpora-
for Computational Linguistics.
tion of POS tags shows advantages in high-resource
translations, it does not show consistency in low- Junliang Guo, Zhirui Zhang, Linli Xu, Hao-Ran Wei,
resource translations. Therefore, it is important to Boxing Chen, and Enhong Chen. 2020. Incor-
investigate what types of auxiliary data can be used porating bert into parallel sequence decoding with
adapters. Advances in Neural Information Process-
and how to use them effectively. Secondly, due to ing Systems.
the response delay of generating translations with
ChatGPT, we only used a randomly sampled test Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas
set for evaluation and using more comprehensive Raunak, Mohamed Gabr, Hitokazu Matsushita,
Young Jin Kim, Mohamed Afify, and Hany Has-
test sets to establish experiments when efficient san Awadalla. 2023. How good are gpt models at
APIs are released is necessary. Lastly, combining machine translation? a comprehensive evaluation.
multiple translation prompts and techniques could arXiv preprint arXiv:2302.09210.

9
Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing 56th Annual Meeting of the Association for Compu-
Wang, and Zhaopeng Tu. 2023. Is chatgpt a good tational Linguistics (Volume 2: Short Papers).
translator? a preliminary study. arXiv preprint
arXiv:2301.08745. Maja Popović. 2017. chrf++: words helping character
n-grams. In Proceedings of the second conference
Nabeel Sabir Khan, Adnan Abid, and Kamran Abid. on machine translation, pages 612–618.
2020. A novel natural language processing (nlp)–
based machine translation model for english to pak- Matt Post. 2018. A call for clarity in reporting bleu
istan sign language translation. Cognitive Computa- scores. In Proceedings of the Third Conference on
tion. Machine Translation: Research Papers.

Mike Lewis, Yinhan Liu, Naman Goyal, Mar- Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton,
jan Ghazvininejad, Abdelrahman Mohamed, Omer and Christopher D Manning. 2020. Stanza: A
Levy, Veselin Stoyanov, and Luke Zettlemoyer. python natural language processing toolkit for many
2020. Bart: Denoising sequence-to-sequence pre- human languages. In Proceedings of the 58th An-
training for natural language generation, translation, nual Meeting of the Association for Computational
and comprehension. In Proceedings of the 58th An- Linguistics: System Demonstrations, pages 101–
nual Meeting of the Association for Computational 108.
Linguistics.
Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-
Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: nea Micciulla, and John Makhoul. 2006. A study of
Optimizing continuous prompts for generation. In translation edit rate with targeted human annotation.
Proceedings of the 59th Annual Meeting of the In Proceedings of the 7th Conference of the Associa-
Association for Computational Linguistics and the tion for Machine Translation in the Americas: Tech-
11th International Joint Conference on Natural Lan- nical Papers, pages 223–231.
guage Processing.
Felix Stahlberg. 2020. Neural machine translation: A
Yafu Li, Yongjing Yin, Jing Li, and Yue Zhang. 2022. review. Journal of Artificial Intelligence Research.
Prompt-driven neural machine translation. In Find-
ings of the Association for Computational Linguis- Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi,
tics: ACL 2022. SM Eslami, Oriol Vinyals, and Felix Hill. 2021.
Multimodal few-shot learning with frozen language
Zihan Liu, Yan Xu, Genta Indra Winata, and Pascale models. Advances in Neural Information Process-
Fung. 2019. Incorporating word and subword units ing Systems.
in unsupervised machine translation using language
model rescoring. WMT 2019. David Vilar, Markus Freitag, Colin Cherry, Jiaming
Luo, Viresh Ratnakar, and George Foster. 2022.
Xing Niu and Marine Carpuat. 2020. Controlling neu- Prompting palm for translation: Assessing strategies
ral machine translation formality with synthetic su- and performance. arXiv preprint arXiv:2211.09102.
pervision. In Proceedings of the AAAI Conference
on Artificial Intelligence. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022.
Alberto Olmo, Sarath Sreedharan, and Subbarao Kamb- Chain of thought prompting elicits reasoning in large
hampati. 2021. Gpt3-to-plan: Extracting plans from language models. arXiv preprint arXiv:2201.11903.
text using gpt-3. arXiv preprint arXiv:2106.07131.
Shuoheng Yang, Yuxin Wang, and Xiaowen Chu. 2020.
Myle Ott, Michael Auli, David Grangier, and A survey of deep learning techniques for neural ma-
Marc’Aurelio Ranzato. 2018. Analyzing uncer- chine translation. arXiv preprint arXiv:2002.07526.
tainty in neural machine translation. In Interna-
tional Conference on Machine Learning. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-
bonell, Russ R Salakhutdinov, and Quoc V Le. 2019.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Xlnet: Generalized autoregressive pretraining for
Jing Zhu. 2002. Bleu: a method for automatic eval- language understanding. Advances in neural infor-
uation of machine translation. In Proceedings of the mation processing systems.
40th annual meeting of the Association for Compu-
tational Linguistics. Biao Zhang, Barry Haddow, and Alexandra Birch.
2023. Prompting large language model for ma-
Ravinga Perera, Thilakshi Fonseka, Rashmini Naran- chine translation: A case study. arXiv preprint
panawa, and Uthayasanker Thayasivam. 2022. Im- arXiv:2301.07069.
proving english to sinhala neural machine trans-
lation using part-of-speech tag. arXiv preprint Jinhua Zhu, Yingce Xia, Lijun Wu, Di He, Tao Qin,
arXiv:2202.08882. Wengang Zhou, Houqiang Li, and Tieyan Liu. 2020.
Incorporating bert into neural machine translation.
Pavel Petrushkov, Shahram Khadivi, and Evgeny Ma- In International Conference on Learning Represen-
tusov. 2018. Learning from chunk-based feedback tations.
in neural machine translation. In Proceedings of the

10

You might also like