You are on page 1of 15

Error Analysis Prompting Enables Human-Like Translation Evaluation

in Large Language Models: A Case Study on ChatGPT


Qingyu Lu♢,ℜ , Baopu Qiu♭,ℜ , Liang Dingℜ , Kanjian Zhang♢ , Tom Kocmi♡ , Dacheng Taoℜ

Southeast University ℜ JD Explore Academy, JD.com Inc. ♭ Nanjing University ♡ Microsoft
luqingyu@seu.edu.cn, qiubaopu@smail.nju.edu.cn,
liangding.liam@gmail.com, tomkocmi@microsoft.com
https://github.com/Coldmist-Lu/ErrorAnalysis_Prompt

Abstract and maintain sensitivity throughout several turns


of conversation.
Generative large language models (LLMs), e.g.,
ChatGPT, have demonstrated remarkable pro- Previous research has demonstrated that Chat-
ficiency across several NLP tasks, such as ma- GPT can perform as well as or even better than
arXiv:2303.13809v2 [cs.CL] 8 Oct 2023

chine translation, text summarization. Recent other LLMs in machine translation task (Hendy
research (Kocmi and Federmann, 2023) has et al., 2023; Jiao et al., 2023; Peng et al., 2023).
shown that utilizing ChatGPT for assessing the However, it remains uncertain whether ChatGPT
quality of machine translation (MT) achieves can be used as a metric to evaluate the quality of
state-of-the-art performance at the system level
translations. If ChatGPT is suitable for this task,
but performs poorly at the segment level. To
further improve the performance of LLMs on then, how to develop appropriate prompts that can
MT quality assessment, we conduct an inves- make ChatGPT generate reliable evaluations? Con-
tigation into several prompting methods, and current to our work, GEMBA (Kocmi and Feder-
propose a new prompting method called Error mann, 2023) present an encouraging finding that
Analysis Prompting (EAPrompt) by combin- LLMs, e.g., ChatGPT, could outperform current
ing Chain-of-Thoughts (Wei et al., 2022) and best MT metrics at the system level quality assess-
Error Analysis (Lu et al., 2022). Our results on
ment with zero-shot standard prompting, but such
WMT22 indicate that prompting LLMs like Chat-
GPT with error analysis can generate human-
kind of prompts show unreliable performance at
like MT evaluations at both the system and the segment level.
segment level. Additionally, we first discover In this work, we take the further step by investi-
some limitations of ChatGPT as an MT evalu- gating the advanced few-shot prompting strategies
ator, such as changing the order of input may upon ChatGPT for MT quality assessment, and pro-
significantly influence the judgment when pro- pose a novel prompting strategy – Error Analysis
viding multiple translations in a single query.
Prompting (EAPrompt), combining the Chain-of-
This work provides a preliminary experience
of prompting LLMs as an evaluator to improve Thought (CoT, Wei et al. (2022)) and Error Anal-
the reliability of translation evaluation metrics ysis (EA, Lu et al. (2022)). We give an example
under the error analysis paradigm. of EAPrompt in Figure 1. The idea is to prompt
ChatGPT to generate a human-like evaluation like
1 Introduction MQM (Freitag et al., 2021) by ❶ identifying major
Large language models (LLMs), especially Gener- and minor errors, and ❷ scoring the translations
ative Pre-trained Transformer (GPT) models (Rad- according to the severity of these errors.
ford et al., 2019; Brown et al., 2020), have shown We conduct experiments on 122,823 segments
remarkable performance in various natural lan- on 64 MT systems across various domains to verify
guage processing (NLP) tasks. Recently, Open the effectiveness of our approach, and find that:
AI developed ChatGPT, an interactive chatbot built
upon InstructGPT (Ouyang et al., 2022), which has Our proposed EAPrompt outperforms stan-
captured the attention of researchers in the NLP dard prompting (Kocmi and Federmann,
community (Qin et al., 2023; Zhong et al., 2023). 2023) at both the system level and the seg-
This chatbot is capable of integrating multiple NLP ment level, achieving human-like evaluations
tasks and can generate detailed and comprehen- on LLMs such as gpt-3.5-turbo (Turbo).
sive responses to human inquiries. Additionally, it Notably, EAPrompt is also reliable when eval-
can respond appropriately to follow-up questions uating top-performing MT systems.
Standard Prompting Error Analysis Prompting
Q: Source : 迈克去书店。 Q: Source : 迈克去书店。
Reference : Mike goes to the bookstore. Reference : Mike goes to the bookstore. In-context Example
Translation : Jerry went to bookstore happily. Translation : Jerry went bookstore happily.
Please act as a translation evaluation metric Based on the given source and reference, identify the major and minor errors in this
that scores a translation between 0 to 100 translation. Note that Major errors refer to actual translation or grammatical errors,
based on the source and reference. Do not and Minor errors refer to smaller imperfections, and purely subjective opinions about
use existing metrics. the translation.

A: 30. A: Major errors: (1) “Jerry” – Mistranslation (2) “happily” – Addition


Minor errors: (1) “went” – Grammar (2) “the” – Omission
Q: Source : 猫在垫子上。
Reference : The cat is on the mat. Q: Source : 猫在垫子上。
Translation : The mat sat on the mat. Reference : The cat is on the mat. Test Sample
Please act as a translation evaluation metric Translation : The mat sat on the mat.
that scores a translation between 0 to 100 Based on the given source and reference, identify the major and minor errors in this
based on the source and reference. Do not translation. Note that Major errors refer to actual translation or grammatical errors,
use existing metrics. and Minor errors refer to smaller imperfections, and purely subjective opinions about
the translation. Instruction: Identify Errors
A: 40. Direct Assessment
A: Major errors: (1) “mat” – Mistranslation (2): “sat” – Verb tense and agreement error
Minor errors: None

Error Analysis Strategy Q: Count the number of major and minor errors identified in your last response and
compute the final score for this translation.
Identify Score Deduct 5 points for each major error. Deduct 1 point for each minor error. If the
translation has no errors, its score will be 0. Instruction: Score Translation

A: There are 2 major errors in the translation. Therefore, 10 points will be deducted (2
errors × 5 points each). There are no minor errors in the translation. Hence, the final
score for this translation will be -10. Human-Like Evaluation

Figure 1: An comparative overview between Standard Prompting and our proposed Error Analysis Prompting in
assessing the MT quality with ChatGPT.

EAPrompt on Turbo achieves nearly the per- metrics. Furthermore, this methodology can also
formance of GEMBA on GPT-4. We believe be extended to benefit other evaluation scenarios
if EAPrompt were applied to GPT-4, it could within natural language generation, including text
potentially outperform the GEMBA variants. summarization and data-to-text tasks.

When designing prompts, itemized responses 2 Prompt ChatGPT with Error Analysis
are better than lengthy and detailed explana-
2.1 Translation Evaluation Metric
tions of errors. Moreover, splitting the instruc-
tion into two identifying errors and scoring Translation evaluation metrics are used to assess
translation can improve evaluation stability. the performance of machine translation systems on
specific test sets (Freitag et al., 2022; Mathur et al.,
EAPrompt may have a detrimental effect on 2020b). Modern evaluation metrics often leverage
text-davinci-003 (Dav3) as an evaluator. pre-trained language models to enhance reliability
This discrepancy may be attributed to Dav3’s (Kocmi et al., 2021). These metrics typically take
relatively inferior CoT capability in contrast inputs from three sources: the sentence from source
to Turbo. language ("Source"), the reference translation pro-
vided by human translators ("Reference"), and the
Despite its good performance, we show that
hypothesis being evaluated ("Translation"). In sce-
ChatGPT may be unreliable at evaluating
narios where reference signals are not provided,
high-quality MT systems, and may score the
this "reference-less" metric can also be utilized for
same translation differently.
quality estimation purposes (Zerva et al., 2022; Spe-
It is NOT advisable to combine multiple trans- cia et al., 2010; Qiu et al., 2022). The output of the
lations into a single query input, as ChatGPT metric is a score or rank indicating the translation
has a preference for former translations. quality of each hypothesis.
To test the reliability of metrics, we use hu-
This study provides an initial exploration of uti- man evaluation as the golden standard. A com-
lizing error analysis to prompt LLMs as evalua- mon approach for collecting human evaluation is
tors. The proposed approach holds the promise of Direct Assessment (DA, Graham et al. (2017)),
enhancing the reliability of translation evaluation which evaluate the sentence score ranging from
0 100. Multi-dimensional Quality Metric (MQM) is counted, and the final score is computed ("In-
is adopted recently in WMT as a high-quality hu- struction: Score Translation"). Distinguished from
man evaluation strategy (Freitag et al., 2021). It standard prompting, EAPrompt provides a more
asks human experts to annotate the errors in the detailed and human-like evaluation approach.
hypothesis and categorize them into "Major" and After exploring several prompt contexts in initial
"Minor" indicating their severity. experiments, we made the following modifications
to EAPrompt as follows:
2.2 Prompt LLMs as Evaluation Metrics
• we adopt the one-shot learning format (Brown
When prompting an LLM as an evaluation met-
et al., 2020) to enhance the LLMs’ understand-
ric, it is crucial to design appropriate instructions
ing of the task; different in-context examples
that describe the evaluation task and the scoring
are used for different language pairs;
range. In this paper, we mainly adopt two prompt-
ing strategies: "Standard Prompting" and "Error • we employ itemized template response, en-
Analysis Prompting". abling clearer identification and quantification
The standard prompting approach directly asks of errors;
LLMs to generate a score that reflects the qual-
ity of the translation. In a recent study, GEMBA • we divide the scoring process into two stages,
(Kocmi and Federmann, 2023) adopted four differ- enhancing the stability of LLMs during each
ent standard prompting techniques, demonstrating query and avoiding potential inconsistencies.
its state-of-the-art at the system level when com- We present a thorough analysis of the prompt
pared to other model-based metrics. However, they variants’ impact in §4.3. The specific prompt con-
also observe that the performance at the segment texts we utilize can be found in Appendix A.
level is relatively poorer. This highlights the im-
portance of combining Chain-of-Thought with the 3 Experimental Results
Error Analysis Strategy to prompt LLMs in a man- 3.1 Experiment Setup
ner that more closely resembles human evaluation.
Dataset We utilize the test set from the WMT22
2.3 Error Analysis Prompting shared tasks (Freitag et al., 2022) in three language
Motivated by the MQM framework in human evalu- pairs: English-German (En-De), English-Russian
ation, the idea of the error Analysis (EA) paradigm, (En-Ru), and Chinese-English (Zh-En). In addi-
as introduced by Lu et al. (2022), is to enhance tion, we present results from WMT20 shared tasks
the automatic scoring process by explicitly incor- (Mathur et al., 2020b) in Appendix B1 . Table 1
porating error identification, thus providing a more provides detailed information about our test set.
human-like evaluation. Compared to the WMT20 test set (news domain
The Chain-of-Thought (CoT) prompting strategy only), the WMT22 testset consists of samples from
was first proposed by Wei et al. (2022). Instead of 4 domains - conversational, e-commerce, news, and
directly generating the answer, CoT prompts LLMs social. Since the training data of LLMs from Ope-
to think step-by-step. This approach has shown sig- nAI we use were up to Sep 2021, this significantly
nificant performance improvements on reasoning reduces the chance of testset leakage, making our
tasks, such as GSM8K (Cobbe et al., 2021). CoT analyses more convincible.
is an emergent ability of LLMs and has been incor- Human Evaluation Human evaluation of trans-
porated in instruction fine-tuning of LLMs (Chung lated texts is widely considered to be the gold stan-
et al., 2022) as well as in benchmarks designed to dard in evaluating metrics. We use a high-quality
evaluate LLM capabilities (Suzgun et al., 2022). human evaluation dataset Multi-dimensional Qual-
In this work, we combine the CoT and EA ity Metrics (MQM, Freitag et al. (2021)) as human
paradigms, introducing a novel prompting strat- judgments. This dataset is annotated by human ex-
egy called Error Analysis Prompting (EAPrompt). perts and has been widely adopted in recent trans-
As shown in Figure 1, EAPrompt divides the scor- lation evaluation (Freitag et al., 2022) and quality
ing process into two stages: First, the LLM is in- estimation tasks (Zerva et al., 2022) in WMT.
structed to identify major and minor errors in the 1
Since there might be potential contamination of the
translation ("Instruction: Identify Errors"). Subse- WMT20 test set in GPT training, we exclude these results
quently, the number of these two types of errors from the main findings.
Dataset Language Pair Segments Systems Domains
En-De 2037 15 conversational, e-commerce, news, social
WMT22 En-Ru 2037 16 conversational, e-commerce, news, social
Zh-En 1875 18 conversational, e-commerce, news, social
En-De 1418 7 news
WMT20
Zh-En 2000 8 news

Table 1: Statistics of testset. Source, reference texts, and translations are from the WMT20 and WMT22 metrics
shared task.

Meta Evaluation We utilize the system-level mainly involve two models from the GPT3.5 fam-
pairwise accuracy of system-ranking (Kocmi et al., ily: gpt-3.5-turbo ("Turbo") and text-davinci-003
2021). At the segment level, we follow Freitag ("Dav3"). Both models are selected due to their
et al. (2022) to adopt the average of three types proximity in capabilities to ChatGPT, since the in-
of Kendall correlation. Specifically, these values ternal model behind ChatGPT is unknown. We
are computed by flattening the scores into a sin- compare our proposed approach with the GPT-4
gle vector and calculating the average correlations model ("GPT-4") on GEMBA, to see if the perfor-
over systems, or over segments. To avoid skewing mance on Turbo with EAPrompt could approach
the results, we compute the pairwise accuracy for this powerful LLM.
MT systems across all three language pairs as the
final performance at the system level. We compute 3.3 Experimental Results
the average Kendall correlation for the segment- We compute the system and segment level perfor-
level results across all language pairs. In order to mance of EAPrompt with LLMs in Table 2. Full
maintain consistency and comparability with other results of WMT20 and WMT22 are presented in
metrics, all the meta-evaluation are calculated with Appendix B. We can see that:
MTME2 , a metric evaluation tool recommended by
WMT22 (Freitag et al., 2022). (i) EAPrompt achieves state-of-the-art perfor-
mance for Turbo at both the system level and
3.2 Baselines and Large Language Models segment level. Consistent with the findings of
Baseline Metrics Given the reported unreliabil- Kocmi and Federmann (2023), LLMs achieve
ity of BLEU (Papineni et al., 2002) in WMT22, SOTA performance across all three language pairs
we compare LLMs with several commonly used at the system level. Notably, EAPrompt further
model-based metrics for MT evaluation. BLEURT enhances Turbo’s performance compared to other
(Sellam et al., 2020) and COMET (Rei et al., 2020) prompting strategies, achieving an overall score
are supervised neural metrics that leverage human of 90.9% as opposed to GEMBA-DA’s 89.4% in
judgments to train. UniTE (Wan et al., 2022) is terms of pairwise accuracy.
a learnt metric that evaluates MT outputs combin- At the segment level, despite previous findings
ing three different evaluation scenarios. MetricX by Kocmi and Federmann (2023) regarding the
XXL (Freitag et al., 2022) is a large-scale multi- poor correlation between LLMs as evaluators and
task metric that fine-tunes LLM checkpoints using human judgments, EAPrompt surpasses GEMBA-
diverse human feedback data. For reference-less DA’s performance on Turbo by a significant margin,
metrics, we reproduce COMET-QE (Rei et al., averaging 9.7% improvement. This result verifies
2021), which was one of the top-performing met- the effectiveness of EAPrompt when used with
rics in WMT21. These metrics have shown a strong LLMs such as ChatGPT.
correlation with human judgments.
(ii) EAPrompt on Turbo approaches the perfor-
Large Language Models We assess the evalu- mance of GPT-4. We also compare EAPrompt
ation capability of LLMs via OpenAI API3 . We on Turbo with prompting GPT-4 with GEMBA.
2 An interesting finding is that the difference between
https://github.com/google-research/
mt-metrics-eval EAPrompt on Turbo (90.9 - system level, 33.6 -
3
https://platform.openai.com/ segment level) and GEMBA-Stars on GPT-4 (91.2
En-De En-Ru Zh-En Overall
Models Metrics / Prompts SYS SEG SYS SEG SYS SEG SYS SEG
Baselines MetricX XXL 76.9 36.0 91.4 42.0 84.6 42.7 85.0 40.2
BLEURT20 76.9 34.4 90.5 35.9 84.6 36.1 84.7 35.5
COMET22 76.9 36.8 86.7 40.0 86.8 42.8 83.9 39.9
UniTE 74.4 36.9 87.6 37.8 84.6 35.7 82.8 36.8
COMET-QE[noref] 71.8 28.1 80.0 34.1 81.3 36.5 78.1 32.9
Dav3 GEMBA-DA 92.3 30.6 85.7 33.2 86.8 37.1 88.0 33.6
EAPrompt 67.9 20.0 87.6 25.5 79.1 22.9 79.6 22.8
GEMBA-DA[noref] 87.2 18.0 88.6 25.8 82.4 28.9 86.1 24.2
EAPrompt[noref] 39.7 12.7 80.0 23.1 74.7 17.6 67.9 17.8
Turbo GEMBA-Stars 89.7 25.9 90.5 22.3 87.9 26.5 89.4 24.9
EAPrompt 89.7 32.4 92.4 32.3 90.1 36.2 90.9 33.6
GEMBA-SQM[noref] 89.7 25.9 91.4 30.9 81.3 29.1 87.6 28.6
EAPrompt[noref] 87.2 28.4 90.5 30.1 90.1 32.5 89.4 30.3
GPT-4 GEMBA-Stars 89.7 32.6 94.3 35.1 89.0 38.2 91.2 35.3
GEMBA-Classes[noref] 92.3 30.4 92.4 39.0 89.0 31.3 91.2 33.6

Table 2: The system and segment level results of metrics using pairwise accuracy (%) and Kendall correlation (%)
with human-annotated MQM scores, respectively. The best results among the same model are highlighted in bold.
The best results among all metrics are underlined. Our proposed EAPrompt results are highlighted in orange .

- system level, 35.3 - segment level) is negligible. observed with Turbo. This discrepancy may be
This finding suggests that if EAPrompt were ap- attributed to Dav3’s relatively inferior CoT capa-
plied to GPT-4, it could potentially outperform bility in contrast to Turbo. A standard prompting
the GEMBA variants. This belief stems from the strategy like GEMBA-DA may be more suitable,
fact that GEMBA mainly leverages the DA scoring while the Turbo model may be better suited for
technique, aiming to avoid using test sets as devel- dialog completion, likely excels in CoT and aligns
opment sets in GPT-4. Consequently, EAPrompt well with our two-turn EAPrompt approach. An-
appears to be better suited for ChatGPT models as other suspicion is that a stricter profanity filter was
the translation evaluator. applied in Dav3, leading to instances where the
model returns "None" outputs.
(iii) EAPrompt also boosts Turbo’s performance
in reference-less scenarios. Our findings remain It is important to note that the results for WMT20
consistent in reference-less settings ("[noref]"), are also provided in Appendix B. These results
where EAPrompt[noref] on Turbo shows improved largely mirror the findings from WMT22, reinforc-
performance compared to GEMBA, with an in- ing the consistency of our observations across dif-
crease of 1.8% at the system level and 1.7% at ferent evaluation settings.
the segment level. Notably, EAPrompt[noref] on
Turbo surpasses existing reference-less metrics 4 Analysis
and even reference-based metrics at the system 4.1 The reliability of ChatGPT when
level. These results indicate that LLMs possess evaluating top-performance systems
impressive cross-lingual capabilities, and that are
well-suited for quality estimation under EAPrompt, Previous research has shown that the unreliability
where the absence of reference translations poses a of automatic metrics in evaluating top-performing
significant challenge. systems, as they exhibit a significant decline
in correlation with human evaluation (Mathur
(iv) EAPrompt may have a detrimental effect et al., 2020a). To further testify the reliability
on Dav3 as an evaluator. EAPrompt with Dav3 of EAPrompt on LLMs, we compute the perfor-
does not exhibit the same level of effectiveness as mance of LLMs on top-k MT systems at the sys-
Instruction Response Separation Human Turbo Dav3 GPT-3.5(Interface) GPT-4(Interface)
Standard EA Detailed Itemized Maj Min All Maj Min All Maj Min All Maj Min All Maj Min All
✓ - - - - - 70 - - 85 - - 90 - - 70
✓ ✓ ✗ 0 2 -2 1 1 -6 1 2 -7 1 1 -6
1 1 -6
✓ ✓ ✗ 1 0 -5 1 1 -6 0 2 -2 1 0 -5
✓ ✓ ✓ 2 0 -10 1 0 -5 2 0 -10 1 1 -6

Table 3: Comparison of the segment level scores of ChatGPT for different variants of prompts. Instructions are
divided into standard prompting and EA. The response can either be itemized or detailed. The instruction could
be separated into two queries (one for identifying errors and another for scoring) or combined into a single query.
"Maj", "Min", and "All" represent the number of major errors, minor errors and final score, respectively.

GEMBA-Turbo-DA[noref] GEMBA-Dav3-DA[noref] EAPrompt-Turbo-[noref] EAPrompt-Dav3-[noref]


GEMBA-Turbo-DA GEMBA-Dav3-DA EAPrompt-Turbo EAPrompt-Dav3
SYS-wmt22-zh-en SYS-wmt22-en-de SYS-wmt22-en-ru
1.0
Pairwise Accuracy

0.8 0.8
0.8
0.6 0.6
0.4 0.6
0.4
0.2 0.2 0.4
15 12 9 6 4 15 12 9 6 4 15 12 9 6 4
Top k systems Top k systems Top k systems
(a) System level

SEG-wmt22-zh-en SEG-wmt22-en-de SEG-wmt22-en-ru


Kendall's Correlation

0.35 0.30 0.30


0.30 0.25
0.25
0.25 0.20
0.20 0.15 0.20
15 12 9 6 4 15 12 9 6 4 15 12 9 6 4
Top k systems Top k systems Top k systems
(b) Segment level

Figure 2: System and Segment level Performance of top-k MT systems on WMT22 dataset.

tem and segment level, presented in Figure 2. We Prompt Dav3 Turbo GPT-4
can observe that, at the segment level, the ranking GEMBA - DA 0 0 0
of different prompts remains relatively consistent, - Stars 58 0 -
indicating the stability in performance of LLMs. - SQM 1279 0 -
- Classes 0 0 -
However, at the system level, there is a noticeable EAPrompt 40 3 -
decline in correlations when restricting the evalua- GEMBA - DA[noref] 53 0 0
tion to the top k <= 6 prompts in most settings. This - Stars[noref] 1 0 0
finding warns us that LLMs as an evaluator may not - SQM[noref] 1 0 0
- Classes[noref] 0 0 -
exhibit the same level of accuracy especially when EAPrompt[noref] 58 6 -
the quality of Machine Translation (MT) systems
is relatively high. Table 4: Number of invalid answers using different
prompts in WMT22 (testset size: 106,758).
4.2 EAPrompt brings acceptable invalid
answers compared with standard
cess instead of providing a definitive score. Given
prompting
that our approach involves more complex prompt
As GEMBA (Kocmi and Federmann, 2023) high- contexts, necessitating two separate queries for
lights in their study that LLMs may provide invalid each test sample, a possible concern is that there is
answers by explaining their decision-making pro- a higher probability that EAPrompt may generate
invalid answers compared to standard prompting. and the other for scoring the translation. Although
To this end, we report the number of invalid this may not cause a significant performance gain,
answers in Table 4. We observe that when using we observe that sometimes ChatGPT fails to deduct
EAPrompt on Turbo for WMT22, it generates only points for identified errors or presents an incorrect
9 invalid answers (6 in the reference-based set- calculation of scores. Separating the scoring pro-
ting, 3 in the reference-less setting). However, on cess may be helpful, as it allows ChatGPT to focus
Dav3, this number increases to 98, which may be on one single procedure in each query, thus can
attributed to that Dav3 model utilizes a stricter pro- provide more accurate judgments.
fanity filter, causing some samples to not receive
the expected response. Overall, the number of in- 5 Case Study
valid answers is comparable to that of GEMBA,
In Figure 3, we list several typical issues with the
confirming the effectiveness of our approach and
case study that should be aware of when using
its suitability for real-world evaluation.
ChatGPT as a translation evaluator. We also present
4.3 EAPrompt empowers ChatGPT to detailed responses in Appendix D.
produce human-like evaluations
5.1 ChatGPT is unstable when conducting
Given the crucial significance of the prompt design, evaluation process
we explore several versions of in-context prompt
contexts and present an analysis in Table 3. See When assessing translations using ChatGPT, it is
Appendix A for the prompt contexts used in our not uncommon to observe variations in the scores
experiment. We find that: assigned to the same input. As shown in As de-
picted in Case 1, we regenerate several responses
(i) ChatGPT becomes more adept at identifying with the same input and obtain 3 different scores
errors when instructed by error analysis. For (98, 95, 100) for the translation. The discrepancies
standard prompting, it becomes challenging to in- in scores could be attributed to the inherent ran-
terpret the overall quality of a test sample due to the domness of the model behind ChatGPT. Another
differing evaluation criteria among different mod- possible reason is the lack of clearly stated eval-
els. For instance, a translation score of 70 on Turbo uation criteria described in the prompt contexts.
might be considered worse, whereas on the GPT- Therefore, we suggest using specific guidelines we
3.5 interface, a score of 90 could be perceived as propose to minimize the impact of these variations.
significantly better. In contrast, EAPrompt’s judg-
ment provides more specific instructions, resulting 5.2 ChatGPT prefers former inputs when
in greater interpretability. Therefore, we recom- provided with multiple translations
mend incorporating error analysis instructions in An interesting phenomenon is that when multiple
prompt contexts rather than relying solely on stan- translations are presented together as a single in-
dard instructions. put to ChatGPT for evaluation, it tends to believe
that the translations provided earlier are of higher
(ii) Itemized template response is better than de-
quality, while the quality of later translations are
tailed illustration. As shown in the "Response"
relatively poorer.
column, providing descriptions of errors in detail
may hinder ChatGPT’s capability to accurately Case 2 shows an example of the attack on Chat-
identify errors. In our initial experiments, we GPT. We provide 8 translations along with their cor-
also observe that generating excessively detailed responding source and reference sentences. First,
responses can lead to incorrect error counting or we present the translations sequentially, and ask
misclassification of error severity. Therefore, it is ChatGPT to rank them according to their trans-
recommended to provide clear and concise descrip- lation quality. ChatGPT ranks the translations as
tions of errors in a format that is easy for ChatGPT (SYS1, SYS2, SYS4, SYS5, SYS3, SYS6, SYS7, SYS8),
to process and comprehend. with SYS1 being the best translation and SYS8 be-
ing the worst. Then, we reverse the order of trans-
(iii) Separating the scoring process from error lations and obtain an entirely different sequence
identification may improve the stability of Chat- of ranks - (SYS8, SYS7, SYS6, SYS5, SYS4, SYS3,
GPT. We suggest splitting the instruction of error SYS2, SYS1), with SYS8 being the best translation
analysis into two queries, one for identifying errors and SYS1 being the worst.
Case 1 Prompt ChatGPT
...
Source: 携程国庆旅游报告预测...
I would evaluate this translation as a 100 out of 100.
Reference: Ctrip’s National Day tourism report predicts that, based on previous regenerate
National Day tourism statistics, it is expected that ... ...
Translation: Ctrip's National Day tourism report predicts that according to the I would evaluate this translation as a 98 out of 100.
national day tourism statistics of previous years, it is estimated that ...
Please act as a translation evaluation metric that scores a translation between 0
to 100 based on the source and reference. Do not use existing metrics.
... regenerate
I would evaluate this translation as a 95 out of 100.

Case 2 Prompt
Based on given source and reference, please rank the following translation from ChatGPT
best to worst according to their translation quality, and do not use existing
metrics.
Source: 当日,以“讲好中国故事,传播好中国声音”为宗旨的外宣工作... Based on my analysis of the given translations, here is my
Reference: On that day, the externally publicized innovative ... ranking from best to worst based on translation quality:
SYS1 translation: On the same day, the "China Good Stories" ....
SYS2 translation: .... SYS1>SYS2>SYS4>SYS5>SYS3>SYS6>SYS7>SYS8
....
SYS8 translation: On the same day, the innovative and basic platform...
Input order
Prompt ChatGPT affects evaluation
Based on given source and reference, please rank the following translation from results
best to worst according to their translation quality, and do not use existing
metrics.
Source: 当日,以“讲好中国故事,传播好中国声音”为宗旨的外宣工作... Based on my analysis of the given translations, here is
Reference: On that day, the externally publicized innovative ... my ranking from best to worst based on translation quality:
SYS8 translation: On the same day, the innovative and basic platform...
SYS7 translation: .... SYS8>SYS7>SYS6>SYS5>SYS4>SYS3>SYS2>SYS1
....
SYS1 translation: On the same day, the "China Good Stories" ....

Case 3
Prompt ChatGPT
Source: 兴安盟属于大兴安岭南麓集中连片贫困地区
Reference: Xing’an Prefecture is within the concentrated poverty-stricken areas... Based on the provided source, reference, and target
Translation: Xing 'an League belongs to a contiguous poverty-stricken area... translation, the translation can be evaluated using the
Please act as a translation evaluation metric that scores a translation between 0 Use
BLEU score, which is a widely used metric .......
to 100 based on the source and reference. existing metrics

Figure 3: Case study of potential issues in ChatGPT. Top: ChatGPT exhibits variations in its responses upon
multiple regenerations; Medium: different input order of samples may affect the judgment of ChatGPT; Bottom:
ChatGPT sometimes relies on existing metrics during translation evaluation.

The contradictory results may be attributed to 6 Conclusion


the auto-regressive nature of the decoder model,
In this paper, we explore the potential of Chat-
which gives more attention to the latter input, po-
GPT as a metric for evaluating translations. We
tentially leading to greater identification of errors
design a novel in-context prompting strategy based
for the translation input later. Therefore, we rec-
on chain-of-thought and error analysis, and show
ommend that researchers input one translation at a
that this strategy significantly improves ChatGPT’s
time instead of providing multiple translations.
evaluation performance. We further verify the ca-
pabilities of LLMs for evaluating top-performing
5.3 ChatGPT may directly adopt existing MT systems, and also compare the number of in-
evaluation metrics valid answers in our approach. We compare our
approach with other prompt designs to show the
We observe that in certain cases, when prompted effectiveness of error analysis. We hope the expe-
conventionally, ChatGPT occasionally relies on rience can benefit NLP researchers in developing
established evaluation metrics like BLEU and ME- more reliable promoting strategies. In section 5,
TEOR. An illustration of this behavior can be seen we also highlight several potential issues that re-
in Case 3, where ChatGPT tends to prioritize the searchers should be aware of when using ChatGPT
BLEU score instead of offering judgments based as a translation evaluator.
on its inherent capabilities. In future work, we would like to experiment with
As our objective is to examine ChatGPT’s per- a broader range of LLMs (Barrault et al., 2019;
fomance for translation evaluation, rather than its Anastasopoulos et al., 2021; Kocmi et al., 2022;
capability to implement pre-existing evaluation pro- Zan et al., 2022), to make our conclusion more
cedures, we include an explicit instruction of "Do convincing. Lastly, it will be interesting to test the
not use existing metrics" in standard prompting. capabilities of LLMs for other MT-related tasks,
This encourages ChatGPT to develop its own ap- such as grammatical error correction and automatic
proach to evaluating translations. post-editing (Wu et al., 2023; Vidal et al., 2022).
Limitations reported accurately and objectively. Informed con-
sent was obtained from all individual participants
The limitations of this work are three-fold: included in this study.
• Potential Test Data Contamination: Although
we utilized WMT22 to minimize the risk of References
test set leakage in the training data of LLMs,
Antonios Anastasopoulos, Ondřej Bojar, Jacob Bremer-
it is still possible that some contamination man, Roldano Cattoni, Maha Elbayad, Marcello Fed-
from the test data remains. Therefore, future erico, et al. 2021. Findings of the IWSLT 2021 eval-
researchers utilizing these datasets should be uation campaign. In IWSLT.
cautious and carefully address this issue, as it Loïc Barrault, Ondřej Bojar, Marta R. Costa-jussà,
may affect the availability of the test set for Christian Federmann, Mark Fishel, et al. 2019. Find-
comparison purposes. ings of the 2019 conference on machine translation
(WMT19). In WMT.
• Budget Constraints: Due to limited resources,
Tom Brown, Benjamin Mann, Nick Ryder, Melanie
we were unable to explore more prompt Subbiah, et al. 2020. Language models are few-shot
choices comprehensively in our research. The learners. NeurIPS.
findings presented in this study only reflect
Hyung Won Chung, Le Hou, Shayne Longpre, Bar-
our initial experiments. The impact of differ-
ret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi
ent prompt choices, as well as the influence Wang, Mostafa Dehghani, Siddhartha Brahma, et al.
of unstable problem and input order issues in 2022. Scaling instruction-finetuned language models.
ChatGPT, remain for further investigation. arXiv preprint.

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian,


• Limited Range of LLMs Tested: In this study, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias
we focused on evaluating a limited number Plappert, Jerry Tworek, Jacob Hilton, Reiichiro
of LLMs that we believed possessed potential Nakano, et al. 2021. Training verifiers to solve math
and capability as translation evaluators. How- word problems. arXiv preprint.
ever, it is important to note that not all existing Markus Freitag, George Foster, David Grangier, et al.
LLMs can necessarily serve as reliable evalu- 2021. Experts, errors, and context: A large-scale
ators under the EAPrompt approach. Future study of human evaluation for machine translation.
research could explore and experiment with TACL.
a broader range of LLMs, examining their ef- Markus Freitag, Ricardo Rei, Nitika Mathur, Chi-kiu Lo,
fectiveness and assessing their suitability as Craig Stewart, Eleftherios Avramidis, Tom Kocmi,
evaluators. et al. 2022. Results of WMT22 metrics shared task:
Stop using BLEU – neural metrics are better and
more robust. In WMT.
Ethics Statement
Yvette Graham, Timothy Baldwin, Alistair Moffat, and
We take ethical considerations very seriously, and Justin Zobel. 2017. Can machine translation systems
strictly adhere to the Code of Ethics. All proce- be evaluated by the crowd alone. Natural Language
dures performed in this study are in accordance Engineering.
with the ethical standards. This paper focuses on Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas
evaluating the capabilities of LLM as a transla- Raunak, Mohamed Gabr, et al. 2023. How good are
tion evaluator. Our proposed approach, EAPrompt, gpt models at machine translation? a comprehensive
does not include statements that induce the model evaluation. arXiv preprint.
to generate harmful information. Additionally, this Wenxiang Jiao, Wenxuan Wang, Jen tse Huang, Xing
method solely extracts and processes the numerical Wang, and Zhaopeng Tu. 2023. Is chatgpt a good
scores from the model’s response, thereby further translator? a preliminary study. arXiv preprint.
mitigating the potential risks. Both the datasets Tom Kocmi, Rachel Bawden, Ondřej Bojar, et al. 2022.
and models used in this paper are publicly avail- Findings of the 2022 conference on machine transla-
able and have been widely adopted by researchers. tion (WMT22). In WMT.
Our model will not learn from user inputs or cause Tom Kocmi and Christian Federmann. 2023. Large
potential risks to the NLP community. We ensure language models are state-of-the-art evaluators of
that the findings and conclusions of this paper are translation quality. arXiv preprint.
Tom Kocmi, Christian Federmann, Roman Grund- Mirac Suzgun, Nathan Scales, Nathanael Schärli, Se-
kiewicz, Marcin Junczys-Dowmunt, Hitokazu Mat- bastian Gehrmann, Yi Tay, Hyung Won Chung,
sushita, and Arul Menezes. 2021. To ship or not to Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny
ship: An extensive evaluation of automatic metrics Zhou, et al. 2022. Challenging big-bench tasks and
for machine translation. In WMT. whether chain-of-thought can solve them. arXiv
preprint.
Qingyu Lu, Liang Ding, Liping Xie, Kanjian Zhang,
Derek F Wong, and Dacheng Tao. 2022. Toward Brian Thompson and Matt Post. 2020. Automatic ma-
human-like evaluation for natural language genera- chine translation evaluation in many languages via
tion with error analysis. arXiv preprint. zero-shot paraphrasing. In EMNLP.

Nitika Mathur, Timothy Baldwin, and Trevor Cohn. Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
2020a. Tangled up in BLEU: Reevaluating the eval- bert, Amjad Almahairi, Yasmine Babaei, Nikolay
uation of automatic machine translation evaluation Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
metrics. In ACL. Bhosale, et al. 2023. Llama 2: Open founda-
tion and fine-tuned chat models. arXiv preprint
Nitika Mathur, Johnny Wei, Markus Freitag, Qingsong arXiv:2307.09288.
Ma, and Ondřej Bojar. 2020b. Results of the WMT20
metrics shared task. In WMT. Blanca Vidal, Albert Llorens, and Juan Alonso. 2022.
Automatic post-editing of MT output using large lan-
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, et al. guage models. In AMTA.
2022. Training language models to follow instruc-
tions with human feedback. arXiv preprint. Yu Wan, Dayiheng Liu, Baosong Yang, Haibo Zhang,
Boxing Chen, Derek Wong, and Lidia Chao. 2022.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- UniTE: Unified translation evaluation. In ACL.
Jing Zhu. 2002. Bleu: a method for automatic evalu-
ation of machine translation. In ACL. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022.
Keqin Peng, Liang Ding, Qihuang Zhong, Li Shen, Chain of thought prompting elicits reasoning in large
Xuebo Liu, Min Zhang, Yuanxin Ouyang, and language models. arXiv preprint.
Dacheng Tao. 2023. Towards making the most of
chatgpt for machine translation. arXiv preprint. Haoran Wu, Wenxuan Wang, Yuxuan Wan, Wenxiang
Jiao, and Michael Lyu. 2023. Chatgpt or grammarly?
Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao evaluating chatgpt on grammatical error correction
Chen, Michihiro Yasunaga, and Diyi Yang. 2023. Is benchmark. arXiv preprint.
chatgpt a general-purpose natural language process-
ing task solver? arXiv preprint. Changtong Zan, Keqin Peng, Liang Ding, Baopu Qiu,
et al. 2022. Vega-MT: The JD explore academy ma-
Baopu Qiu, Liang Ding, Di Wu, Lin Shang, Yibing chine translation system for WMT22. In WMT.
Zhan, and Dacheng Tao. 2022. Original or trans-
Chrysoula Zerva, Frédéric Blain, Ricardo Rei, et al.
lated? on the use of parallel data for translation qual-
2022. Findings of the WMT 2022 shared task on
ity estimation. arXiv preprint.
quality estimation. In WMT.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, and
Dario Amodei, Ilya Sutskever, et al. 2019. Language
Dacheng Tao. 2023. Can chatgpt understand too?
models are unsupervised multitask learners. OpenAI
a comparative study on chatgpt and fine-tuned bert.
blog.
arXiv preprint.
Ricardo Rei, Ana C Farinha, Chrysoula Zerva, Daan
van Stigt, Craig Stewart, Pedro Ramos, Taisiya A Prompt Contexts
Glushkova, André F. T. Martins, and Alon Lavie.
2021. Are references really needed? unbabel-IST Figure 4 compares the prompt contexts imple-
2021 submission for the metrics shared task. In mented in error analysis prompting with a detailed
WMT. response and combined instruction discussed in
Section 4.3.
Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon
Lavie. 2020. COMET: A neural framework for MT
evaluation. In EMNLP. B Additional Results

Thibault Sellam, Dipanjan Das, and Ankur Parikh. 2020. To further validate the findings of our experiments,
BLEURT: Learning robust metrics for text genera- we provide a comprehensive presentation of the
tion. In ACL. WMT22 results obtained under various prompt set-
Lucia Specia, Dhwaj Raj, and Marco Turchi. 2010. Ma-
tings, as shown in Table 6.
chine translation evaluation versus quality estimation. In addition, we present a comparative analysis
Machine translation. of the results obtained by EAPrompt on WMT20,
in contrast to the baseline metrics, as illustrated in displays the prompt contexts of Case 3 and the re-
Table 7. sponse from ChatGPT addressing the utilization of
existing metrics, as explained in Section 5.3.
C Performance on other LLMs
In order to further verify the effect of EAPrompt
strategy on other types of LLMs, we use an open-
source model, LLama2-70b-chat (Touvron et al.,
2023) as our base model, with an in-house LLama2-
70b model, which has been fine-tuned on mathe-
matical reasoning datasets to enhance CoT capa-
bilities. we also test our approach on GPT-4, with
a subset of 600 samples, due to the budget con-
straints. Results are shown in Table 5.

Model DA EAPrompt ∆
Llama2-70b-chat 21.9 22.6 +0.7
Llama2-70b +SFT 22.1 37.4 +15.3
GPT-4* 44.92 45.97 +1.05

Table 5: Segment Level Kendall Correlation on WMT22


zh-en using different models. *: Due to budget limit,
the experiments of GPT-4 was conducted on a subset of
600 samples.

We can see that, EAPrompt exhibits a slight


advantage over GEMBA-DA by 0.7 on the base
model. However, once we inject additional rea-
soning capabilities into the model (referred to as
"+SFT"), the performance gap widens significantly,
with EAPrompt outperforming GEMBA-DA by
a large margin (+15.3). This further verifies that
EAPrompt is better suited for LLMs with stronger
CoT reasoning capabilities.
On GPT-4, EAPrompt shows a slight perfor-
mance advantage over GEMBA-DA, outperform-
ing it by 1.05 points. This could be attributed
to the intrinsic evaluation capabilities of GPT-4,
which may already be adept at translation evalua-
tion. EAPrompt introduces interpretability to the
translation evaluation process while simultaneously
upholding evaluation accuracy.

D Examples for Prompt Selection and


Case Study
Table 8 illustrates the detailed information of ex-
ample utilzed in Section 4.3.
The prompt contexts of Case 1 regarding the
issue of instability are depicted in Figure 5, as dis-
cussed in Section 5.1. The prompt contexts of
Case 2 and the response from ChatGPT concern-
ing the input order issue are shown in Figure 6, as
illustrated in Section 5.2. Furthermore, Figure 7
Here is an example template:
In-Context Example
Q: Source: 中新网北京9月27日电 (记者 杜燕)为加强节前市场监管执法,北京市市场监管
局在国庆节前夕检查各类经营主体2000余户。
Reference: Chinanews.com Report on September 27 in Beijing (Journalist Du Yan) The
Beijing Administration for Market Regulation inspected more than 2,000 operating
entities of different types before the National Day holiday to strengthen pre-holiday
market regulation and law enforcement.
Translation: BEIJING, Sept. 27 (Reporter Du Yan) In order to strengthen market
supervision and law enforcement before the festival, the Beijing Municipal Market
Supervision Bureau inspected more than 2,000 households of various business subjects
on the eve of the National Day.
Based on the given source and reference, identify the major and minor errors in this
translation. Note that Major errors refer to actual translation or grammatical errors, and
Minor errors refer to smaller imperfections, and purely subjective opinions about the
translation.
Itemized Response Detailed Response
A: Major errors: I think the mistranslation of “subjects”
(1)“BEIJING” – Omission should be categorized into a major error,
(2)“subjects” – Mistranslation and the omission in “BEIJING” should also
Minor errors: considered as a major error. “households
(1)“households of various” – Mistranslation of various”, “festival” and “supervision” are
(2)“festival” – Mistranslation three mistranslation errors, they should be
(3)“supervision” – Mistranslation categorized into minor errors. The
(4) “Beijing Municipal Market Supervision Bureau” – Inappropriate for context terminology, “Beijing Municipal Market
(5) “BEIJING” – Spelling Supervision Bureau” is Inappropriate for
context, and should also be categorized
into a minor error. “BEIJING” also has a
spelling error, which is considered as a
Test Question minor error.
Use the template above to answer the following question:

Q: Source: 当日,以“讲好中国故事,传播好中国声音”为宗旨的外宣工作创新性、基础
性平台——“中国好故事”数据库正式上线,向世界展现真实、立体、全面的中国。
Reference: On that day, the externally publicized innovative and basic platform-“The
story of China”, for the purpose of telling the story of China well and spreading the voice
of China well”, was officially on line today, to show the world a true, three-dimensional
and comprehensive China.
Translation: On that day, the "China Good Story" database, an innovative and basic
platform for outreach work with the aim of "telling a good Chinese story and
disseminating a good Chinese voice", was officially launched to show the world a real,
three-dimensional and comprehensive China.
Separated Instructions Combined Instruction
Based on the given source and reference, identify the major and minor errors in this
Based on the given source and reference,
translation. Note that Major errors refer to actual translation or grammatical errors, and
identify the major and minor errors in this
Minor errors refer to smaller imperfections, and purely subjective opinions about the
translation. Note that Major errors refer to
translation.
actual translation or grammatical errors,
and Minor errors refer to smaller
A:
imperfections, and purely subjective
opinions about the translation.
Count the number of major and minor
Q: Count the number of major and minor errors identified in your last response and
errors identified and compute the final
compute the final score for this translation. Deduct 5 points for each major error. Deduct
score for this translation. Deduct 5 points
1 point for each minor error. If the translation has no errors, its score will be 0.
for each major error. Deduct 1 point for
each minor error. If the translation has no
A:
errors, its score will be 0.

Figure 4: The prompt contexts utilized in EAPrompt. We also present other prompt contexts we use when comparing
different prompt selections.
En-De En-Ru Zh-En Overall
Models Metrics / Prompts SYS SEG SYS SEG SYS SEG SYS SEG
Baselines MetricX XXL 76.9 36.0 91.4 42.0 84.6 42.7 85.0 40.2
BLEURT20 76.9 34.4 90.5 35.9 84.6 36.1 84.7 35.5
COMET22 76.9 36.8 86.7 40.0 86.8 42.8 83.9 39.9
UniTE 74.4 36.9 87.6 37.8 84.6 35.7 82.8 36.8
COMET-QE[noref] 71.8 28.1 80.0 34.1 81.3 36.5 78.1 32.9
Dav3 DA 92.3 30.6 85.7 33.2 86.8 37.1 88.0 33.6
Stars 88.5 29.4 81.9 29.4 87.9 29.7 85.8 29.5
SQM 93.6 28.3 83.8 30.8 80.2 34.6 85.4 31.2
Classes 83.3 23.5 87.6 28.9 84.6 25.1 85.4 25.8
EAPrompt 67.9 20.0 87.6 25.5 79.1 22.9 79.6 22.8
DA[noref] 87.2 18.0 88.6 25.8 82.4 28.9 86.1 24.2
Stars[noref] 80.8 19.8 83.8 31.0 84.6 23.5 83.2 24.8
SQM[noref] 82.1 21.8 84.8 32.8 80.2 26.8 82.5 27.1
Classes[noref] 76.9 17.6 81.9 27.1 76.9 17.2 78.8 20.6
EAPrompt[noref] 39.7 12.7 80.0 23.1 74.7 17.6 67.9 17.8
Turbo DA 85.9 25.0 90.5 23.4 82.4 25.5 86.5 24.6
Stars 89.7 25.9 90.5 22.3 87.9 26.5 89.4 24.9
SQM 87.2 29.8 91.4 27.7 82.4 31.3 87.2 29.6
Classes 82.1 17.0 87.6 16.7 76.9 17.8 82.5 17.2
EAPrompt 89.7 32.4 92.4 32.3 90.1 36.2 90.9 33.6
DA[noref] 83.3 25.5 90.5 29.4 85.7 26.4 86.9 27.1
Stars[noref] 88.5 25.5 88.6 27.9 75.8 26.1 84.3 26.5
SQM[noref] 89.7 25.9 91.4 30.9 81.3 29.1 87.6 28.6
Classes[noref] 62.8 -1.0 61.9 2.7 61.5 2.9 62.0 1.5
EAPrompt[noref] 87.2 28.4 90.5 30.1 90.1 32.5 89.4 30.3
GPT-4 DA 89.7 35.7 92.4 35.8 86.8 38.2 89.8 36.6
Stars 89.7 32.6 94.3 35.1 89.0 38.2 91.2 35.3
SQM 88.5 38.0 92.4 38.8 84.6 39.8 88.7 38.9
Classes 89.7 22.2 90.5 26.7 86.8 27.3 89.1 25.4
DA[noref] 84.6 31.1 91.4 40.5 85.7 40.7 87.6 37.4
Stars[noref] 84.6 30.8 93.3 36.6 87.9 40.4 89.1 35.9
SQM[noref] 84.6 35.9 95.2 43.2 85.7 41.6 89.1 40.2
Classes[noref] 92.3 30.4 92.4 39.0 89.0 31.3 91.2 33.6

Table 6: The system and segment level results of metrics using pairwise accuracy (%) and Kendall correlation (%)
with human-annotated MQM scores. The best results among the same model are highlighted in bold. The best
results among all metrics are underlined. Our proposed EAPrompt results are highlighted in orange .
En-De Zh-En
Models Metrics / Prompts SYS SEG SYS SEG
Baselines BLEU (Papineni et al., 2002) 85.7 12.2 67.9 17.4
BLEURT (Sellam et al., 2020) 76.2 43.6 82.1 43.1
COMET (Rei et al., 2020) 85.7 34.8 78.6 37.1
PRISM (Thompson and Post, 2020) 90.5 28.2 82.1 32.3
Turbo EAPrompt 90.5 35.5 82.1 46.1
EAPrompt[noref] 81 34.9 75 37.1

Table 7: The system and segment level results of metrics on WMT20 En-De and Zh-En datasets, using pairwise
accuracy (%) and Kendall correlation (%) with human-annotated MQM scores. Best results are highlighted in bold.

System Online-A.en
Domain conversational
Doc_id 1
Seg_id 6
Source: 请问,订单情况现在是什么样?
Text Reference: May I ask what the status of the order is now?
Translation: Please ask, what is the order situation now?
Human Major Error: "Please ask" - Accuracy/Mistranslation
Evaluation Minor Error: "situation" - Style/Awkward

Table 8: Case information.


Translation 8 is
ranked as the worst

Translation 8 is
ranked as the best

Figure 6: Comparison of providing multiple translations


in sequential or reverse order. ChatGPT tends to prefer
the former translations and generate contradictory judg-
ments.

Figure 5: When evaluating the same translation three


times, ChatGPT generates similar explanations but dif-
ferent scores.

Figure 7: An example on ChatGPT directly adopting


BLEU to evaluate translation quality.

You might also like