Professional Documents
Culture Documents
chine translation, text summarization. Recent other LLMs in machine translation task (Hendy
research (Kocmi and Federmann, 2023) has et al., 2023; Jiao et al., 2023; Peng et al., 2023).
shown that utilizing ChatGPT for assessing the However, it remains uncertain whether ChatGPT
quality of machine translation (MT) achieves can be used as a metric to evaluate the quality of
state-of-the-art performance at the system level
translations. If ChatGPT is suitable for this task,
but performs poorly at the segment level. To
further improve the performance of LLMs on then, how to develop appropriate prompts that can
MT quality assessment, we conduct an inves- make ChatGPT generate reliable evaluations? Con-
tigation into several prompting methods, and current to our work, GEMBA (Kocmi and Feder-
propose a new prompting method called Error mann, 2023) present an encouraging finding that
Analysis Prompting (EAPrompt) by combin- LLMs, e.g., ChatGPT, could outperform current
ing Chain-of-Thoughts (Wei et al., 2022) and best MT metrics at the system level quality assess-
Error Analysis (Lu et al., 2022). Our results on
ment with zero-shot standard prompting, but such
WMT22 indicate that prompting LLMs like Chat-
GPT with error analysis can generate human-
kind of prompts show unreliable performance at
like MT evaluations at both the system and the segment level.
segment level. Additionally, we first discover In this work, we take the further step by investi-
some limitations of ChatGPT as an MT evalu- gating the advanced few-shot prompting strategies
ator, such as changing the order of input may upon ChatGPT for MT quality assessment, and pro-
significantly influence the judgment when pro- pose a novel prompting strategy – Error Analysis
viding multiple translations in a single query.
Prompting (EAPrompt), combining the Chain-of-
This work provides a preliminary experience
of prompting LLMs as an evaluator to improve Thought (CoT, Wei et al. (2022)) and Error Anal-
the reliability of translation evaluation metrics ysis (EA, Lu et al. (2022)). We give an example
under the error analysis paradigm. of EAPrompt in Figure 1. The idea is to prompt
ChatGPT to generate a human-like evaluation like
1 Introduction MQM (Freitag et al., 2021) by ❶ identifying major
Large language models (LLMs), especially Gener- and minor errors, and ❷ scoring the translations
ative Pre-trained Transformer (GPT) models (Rad- according to the severity of these errors.
ford et al., 2019; Brown et al., 2020), have shown We conduct experiments on 122,823 segments
remarkable performance in various natural lan- on 64 MT systems across various domains to verify
guage processing (NLP) tasks. Recently, Open the effectiveness of our approach, and find that:
AI developed ChatGPT, an interactive chatbot built
upon InstructGPT (Ouyang et al., 2022), which has Our proposed EAPrompt outperforms stan-
captured the attention of researchers in the NLP dard prompting (Kocmi and Federmann,
community (Qin et al., 2023; Zhong et al., 2023). 2023) at both the system level and the seg-
This chatbot is capable of integrating multiple NLP ment level, achieving human-like evaluations
tasks and can generate detailed and comprehen- on LLMs such as gpt-3.5-turbo (Turbo).
sive responses to human inquiries. Additionally, it Notably, EAPrompt is also reliable when eval-
can respond appropriately to follow-up questions uating top-performing MT systems.
Standard Prompting Error Analysis Prompting
Q: Source : 迈克去书店。 Q: Source : 迈克去书店。
Reference : Mike goes to the bookstore. Reference : Mike goes to the bookstore. In-context Example
Translation : Jerry went to bookstore happily. Translation : Jerry went bookstore happily.
Please act as a translation evaluation metric Based on the given source and reference, identify the major and minor errors in this
that scores a translation between 0 to 100 translation. Note that Major errors refer to actual translation or grammatical errors,
based on the source and reference. Do not and Minor errors refer to smaller imperfections, and purely subjective opinions about
use existing metrics. the translation.
Error Analysis Strategy Q: Count the number of major and minor errors identified in your last response and
compute the final score for this translation.
Identify Score Deduct 5 points for each major error. Deduct 1 point for each minor error. If the
translation has no errors, its score will be 0. Instruction: Score Translation
A: There are 2 major errors in the translation. Therefore, 10 points will be deducted (2
errors × 5 points each). There are no minor errors in the translation. Hence, the final
score for this translation will be -10. Human-Like Evaluation
Figure 1: An comparative overview between Standard Prompting and our proposed Error Analysis Prompting in
assessing the MT quality with ChatGPT.
EAPrompt on Turbo achieves nearly the per- metrics. Furthermore, this methodology can also
formance of GEMBA on GPT-4. We believe be extended to benefit other evaluation scenarios
if EAPrompt were applied to GPT-4, it could within natural language generation, including text
potentially outperform the GEMBA variants. summarization and data-to-text tasks.
When designing prompts, itemized responses 2 Prompt ChatGPT with Error Analysis
are better than lengthy and detailed explana-
2.1 Translation Evaluation Metric
tions of errors. Moreover, splitting the instruc-
tion into two identifying errors and scoring Translation evaluation metrics are used to assess
translation can improve evaluation stability. the performance of machine translation systems on
specific test sets (Freitag et al., 2022; Mathur et al.,
EAPrompt may have a detrimental effect on 2020b). Modern evaluation metrics often leverage
text-davinci-003 (Dav3) as an evaluator. pre-trained language models to enhance reliability
This discrepancy may be attributed to Dav3’s (Kocmi et al., 2021). These metrics typically take
relatively inferior CoT capability in contrast inputs from three sources: the sentence from source
to Turbo. language ("Source"), the reference translation pro-
vided by human translators ("Reference"), and the
Despite its good performance, we show that
hypothesis being evaluated ("Translation"). In sce-
ChatGPT may be unreliable at evaluating
narios where reference signals are not provided,
high-quality MT systems, and may score the
this "reference-less" metric can also be utilized for
same translation differently.
quality estimation purposes (Zerva et al., 2022; Spe-
It is NOT advisable to combine multiple trans- cia et al., 2010; Qiu et al., 2022). The output of the
lations into a single query input, as ChatGPT metric is a score or rank indicating the translation
has a preference for former translations. quality of each hypothesis.
To test the reliability of metrics, we use hu-
This study provides an initial exploration of uti- man evaluation as the golden standard. A com-
lizing error analysis to prompt LLMs as evalua- mon approach for collecting human evaluation is
tors. The proposed approach holds the promise of Direct Assessment (DA, Graham et al. (2017)),
enhancing the reliability of translation evaluation which evaluate the sentence score ranging from
0 100. Multi-dimensional Quality Metric (MQM) is counted, and the final score is computed ("In-
is adopted recently in WMT as a high-quality hu- struction: Score Translation"). Distinguished from
man evaluation strategy (Freitag et al., 2021). It standard prompting, EAPrompt provides a more
asks human experts to annotate the errors in the detailed and human-like evaluation approach.
hypothesis and categorize them into "Major" and After exploring several prompt contexts in initial
"Minor" indicating their severity. experiments, we made the following modifications
to EAPrompt as follows:
2.2 Prompt LLMs as Evaluation Metrics
• we adopt the one-shot learning format (Brown
When prompting an LLM as an evaluation met-
et al., 2020) to enhance the LLMs’ understand-
ric, it is crucial to design appropriate instructions
ing of the task; different in-context examples
that describe the evaluation task and the scoring
are used for different language pairs;
range. In this paper, we mainly adopt two prompt-
ing strategies: "Standard Prompting" and "Error • we employ itemized template response, en-
Analysis Prompting". abling clearer identification and quantification
The standard prompting approach directly asks of errors;
LLMs to generate a score that reflects the qual-
ity of the translation. In a recent study, GEMBA • we divide the scoring process into two stages,
(Kocmi and Federmann, 2023) adopted four differ- enhancing the stability of LLMs during each
ent standard prompting techniques, demonstrating query and avoiding potential inconsistencies.
its state-of-the-art at the system level when com- We present a thorough analysis of the prompt
pared to other model-based metrics. However, they variants’ impact in §4.3. The specific prompt con-
also observe that the performance at the segment texts we utilize can be found in Appendix A.
level is relatively poorer. This highlights the im-
portance of combining Chain-of-Thought with the 3 Experimental Results
Error Analysis Strategy to prompt LLMs in a man- 3.1 Experiment Setup
ner that more closely resembles human evaluation.
Dataset We utilize the test set from the WMT22
2.3 Error Analysis Prompting shared tasks (Freitag et al., 2022) in three language
Motivated by the MQM framework in human evalu- pairs: English-German (En-De), English-Russian
ation, the idea of the error Analysis (EA) paradigm, (En-Ru), and Chinese-English (Zh-En). In addi-
as introduced by Lu et al. (2022), is to enhance tion, we present results from WMT20 shared tasks
the automatic scoring process by explicitly incor- (Mathur et al., 2020b) in Appendix B1 . Table 1
porating error identification, thus providing a more provides detailed information about our test set.
human-like evaluation. Compared to the WMT20 test set (news domain
The Chain-of-Thought (CoT) prompting strategy only), the WMT22 testset consists of samples from
was first proposed by Wei et al. (2022). Instead of 4 domains - conversational, e-commerce, news, and
directly generating the answer, CoT prompts LLMs social. Since the training data of LLMs from Ope-
to think step-by-step. This approach has shown sig- nAI we use were up to Sep 2021, this significantly
nificant performance improvements on reasoning reduces the chance of testset leakage, making our
tasks, such as GSM8K (Cobbe et al., 2021). CoT analyses more convincible.
is an emergent ability of LLMs and has been incor- Human Evaluation Human evaluation of trans-
porated in instruction fine-tuning of LLMs (Chung lated texts is widely considered to be the gold stan-
et al., 2022) as well as in benchmarks designed to dard in evaluating metrics. We use a high-quality
evaluate LLM capabilities (Suzgun et al., 2022). human evaluation dataset Multi-dimensional Qual-
In this work, we combine the CoT and EA ity Metrics (MQM, Freitag et al. (2021)) as human
paradigms, introducing a novel prompting strat- judgments. This dataset is annotated by human ex-
egy called Error Analysis Prompting (EAPrompt). perts and has been widely adopted in recent trans-
As shown in Figure 1, EAPrompt divides the scor- lation evaluation (Freitag et al., 2022) and quality
ing process into two stages: First, the LLM is in- estimation tasks (Zerva et al., 2022) in WMT.
structed to identify major and minor errors in the 1
Since there might be potential contamination of the
translation ("Instruction: Identify Errors"). Subse- WMT20 test set in GPT training, we exclude these results
quently, the number of these two types of errors from the main findings.
Dataset Language Pair Segments Systems Domains
En-De 2037 15 conversational, e-commerce, news, social
WMT22 En-Ru 2037 16 conversational, e-commerce, news, social
Zh-En 1875 18 conversational, e-commerce, news, social
En-De 1418 7 news
WMT20
Zh-En 2000 8 news
Table 1: Statistics of testset. Source, reference texts, and translations are from the WMT20 and WMT22 metrics
shared task.
Meta Evaluation We utilize the system-level mainly involve two models from the GPT3.5 fam-
pairwise accuracy of system-ranking (Kocmi et al., ily: gpt-3.5-turbo ("Turbo") and text-davinci-003
2021). At the segment level, we follow Freitag ("Dav3"). Both models are selected due to their
et al. (2022) to adopt the average of three types proximity in capabilities to ChatGPT, since the in-
of Kendall correlation. Specifically, these values ternal model behind ChatGPT is unknown. We
are computed by flattening the scores into a sin- compare our proposed approach with the GPT-4
gle vector and calculating the average correlations model ("GPT-4") on GEMBA, to see if the perfor-
over systems, or over segments. To avoid skewing mance on Turbo with EAPrompt could approach
the results, we compute the pairwise accuracy for this powerful LLM.
MT systems across all three language pairs as the
final performance at the system level. We compute 3.3 Experimental Results
the average Kendall correlation for the segment- We compute the system and segment level perfor-
level results across all language pairs. In order to mance of EAPrompt with LLMs in Table 2. Full
maintain consistency and comparability with other results of WMT20 and WMT22 are presented in
metrics, all the meta-evaluation are calculated with Appendix B. We can see that:
MTME2 , a metric evaluation tool recommended by
WMT22 (Freitag et al., 2022). (i) EAPrompt achieves state-of-the-art perfor-
mance for Turbo at both the system level and
3.2 Baselines and Large Language Models segment level. Consistent with the findings of
Baseline Metrics Given the reported unreliabil- Kocmi and Federmann (2023), LLMs achieve
ity of BLEU (Papineni et al., 2002) in WMT22, SOTA performance across all three language pairs
we compare LLMs with several commonly used at the system level. Notably, EAPrompt further
model-based metrics for MT evaluation. BLEURT enhances Turbo’s performance compared to other
(Sellam et al., 2020) and COMET (Rei et al., 2020) prompting strategies, achieving an overall score
are supervised neural metrics that leverage human of 90.9% as opposed to GEMBA-DA’s 89.4% in
judgments to train. UniTE (Wan et al., 2022) is terms of pairwise accuracy.
a learnt metric that evaluates MT outputs combin- At the segment level, despite previous findings
ing three different evaluation scenarios. MetricX by Kocmi and Federmann (2023) regarding the
XXL (Freitag et al., 2022) is a large-scale multi- poor correlation between LLMs as evaluators and
task metric that fine-tunes LLM checkpoints using human judgments, EAPrompt surpasses GEMBA-
diverse human feedback data. For reference-less DA’s performance on Turbo by a significant margin,
metrics, we reproduce COMET-QE (Rei et al., averaging 9.7% improvement. This result verifies
2021), which was one of the top-performing met- the effectiveness of EAPrompt when used with
rics in WMT21. These metrics have shown a strong LLMs such as ChatGPT.
correlation with human judgments.
(ii) EAPrompt on Turbo approaches the perfor-
Large Language Models We assess the evalu- mance of GPT-4. We also compare EAPrompt
ation capability of LLMs via OpenAI API3 . We on Turbo with prompting GPT-4 with GEMBA.
2 An interesting finding is that the difference between
https://github.com/google-research/
mt-metrics-eval EAPrompt on Turbo (90.9 - system level, 33.6 -
3
https://platform.openai.com/ segment level) and GEMBA-Stars on GPT-4 (91.2
En-De En-Ru Zh-En Overall
Models Metrics / Prompts SYS SEG SYS SEG SYS SEG SYS SEG
Baselines MetricX XXL 76.9 36.0 91.4 42.0 84.6 42.7 85.0 40.2
BLEURT20 76.9 34.4 90.5 35.9 84.6 36.1 84.7 35.5
COMET22 76.9 36.8 86.7 40.0 86.8 42.8 83.9 39.9
UniTE 74.4 36.9 87.6 37.8 84.6 35.7 82.8 36.8
COMET-QE[noref] 71.8 28.1 80.0 34.1 81.3 36.5 78.1 32.9
Dav3 GEMBA-DA 92.3 30.6 85.7 33.2 86.8 37.1 88.0 33.6
EAPrompt 67.9 20.0 87.6 25.5 79.1 22.9 79.6 22.8
GEMBA-DA[noref] 87.2 18.0 88.6 25.8 82.4 28.9 86.1 24.2
EAPrompt[noref] 39.7 12.7 80.0 23.1 74.7 17.6 67.9 17.8
Turbo GEMBA-Stars 89.7 25.9 90.5 22.3 87.9 26.5 89.4 24.9
EAPrompt 89.7 32.4 92.4 32.3 90.1 36.2 90.9 33.6
GEMBA-SQM[noref] 89.7 25.9 91.4 30.9 81.3 29.1 87.6 28.6
EAPrompt[noref] 87.2 28.4 90.5 30.1 90.1 32.5 89.4 30.3
GPT-4 GEMBA-Stars 89.7 32.6 94.3 35.1 89.0 38.2 91.2 35.3
GEMBA-Classes[noref] 92.3 30.4 92.4 39.0 89.0 31.3 91.2 33.6
Table 2: The system and segment level results of metrics using pairwise accuracy (%) and Kendall correlation (%)
with human-annotated MQM scores, respectively. The best results among the same model are highlighted in bold.
The best results among all metrics are underlined. Our proposed EAPrompt results are highlighted in orange .
- system level, 35.3 - segment level) is negligible. observed with Turbo. This discrepancy may be
This finding suggests that if EAPrompt were ap- attributed to Dav3’s relatively inferior CoT capa-
plied to GPT-4, it could potentially outperform bility in contrast to Turbo. A standard prompting
the GEMBA variants. This belief stems from the strategy like GEMBA-DA may be more suitable,
fact that GEMBA mainly leverages the DA scoring while the Turbo model may be better suited for
technique, aiming to avoid using test sets as devel- dialog completion, likely excels in CoT and aligns
opment sets in GPT-4. Consequently, EAPrompt well with our two-turn EAPrompt approach. An-
appears to be better suited for ChatGPT models as other suspicion is that a stricter profanity filter was
the translation evaluator. applied in Dav3, leading to instances where the
model returns "None" outputs.
(iii) EAPrompt also boosts Turbo’s performance
in reference-less scenarios. Our findings remain It is important to note that the results for WMT20
consistent in reference-less settings ("[noref]"), are also provided in Appendix B. These results
where EAPrompt[noref] on Turbo shows improved largely mirror the findings from WMT22, reinforc-
performance compared to GEMBA, with an in- ing the consistency of our observations across dif-
crease of 1.8% at the system level and 1.7% at ferent evaluation settings.
the segment level. Notably, EAPrompt[noref] on
Turbo surpasses existing reference-less metrics 4 Analysis
and even reference-based metrics at the system 4.1 The reliability of ChatGPT when
level. These results indicate that LLMs possess evaluating top-performance systems
impressive cross-lingual capabilities, and that are
well-suited for quality estimation under EAPrompt, Previous research has shown that the unreliability
where the absence of reference translations poses a of automatic metrics in evaluating top-performing
significant challenge. systems, as they exhibit a significant decline
in correlation with human evaluation (Mathur
(iv) EAPrompt may have a detrimental effect et al., 2020a). To further testify the reliability
on Dav3 as an evaluator. EAPrompt with Dav3 of EAPrompt on LLMs, we compute the perfor-
does not exhibit the same level of effectiveness as mance of LLMs on top-k MT systems at the sys-
Instruction Response Separation Human Turbo Dav3 GPT-3.5(Interface) GPT-4(Interface)
Standard EA Detailed Itemized Maj Min All Maj Min All Maj Min All Maj Min All Maj Min All
✓ - - - - - 70 - - 85 - - 90 - - 70
✓ ✓ ✗ 0 2 -2 1 1 -6 1 2 -7 1 1 -6
1 1 -6
✓ ✓ ✗ 1 0 -5 1 1 -6 0 2 -2 1 0 -5
✓ ✓ ✓ 2 0 -10 1 0 -5 2 0 -10 1 1 -6
Table 3: Comparison of the segment level scores of ChatGPT for different variants of prompts. Instructions are
divided into standard prompting and EA. The response can either be itemized or detailed. The instruction could
be separated into two queries (one for identifying errors and another for scoring) or combined into a single query.
"Maj", "Min", and "All" represent the number of major errors, minor errors and final score, respectively.
0.8 0.8
0.8
0.6 0.6
0.4 0.6
0.4
0.2 0.2 0.4
15 12 9 6 4 15 12 9 6 4 15 12 9 6 4
Top k systems Top k systems Top k systems
(a) System level
Figure 2: System and Segment level Performance of top-k MT systems on WMT22 dataset.
tem and segment level, presented in Figure 2. We Prompt Dav3 Turbo GPT-4
can observe that, at the segment level, the ranking GEMBA - DA 0 0 0
of different prompts remains relatively consistent, - Stars 58 0 -
indicating the stability in performance of LLMs. - SQM 1279 0 -
- Classes 0 0 -
However, at the system level, there is a noticeable EAPrompt 40 3 -
decline in correlations when restricting the evalua- GEMBA - DA[noref] 53 0 0
tion to the top k <= 6 prompts in most settings. This - Stars[noref] 1 0 0
finding warns us that LLMs as an evaluator may not - SQM[noref] 1 0 0
- Classes[noref] 0 0 -
exhibit the same level of accuracy especially when EAPrompt[noref] 58 6 -
the quality of Machine Translation (MT) systems
is relatively high. Table 4: Number of invalid answers using different
prompts in WMT22 (testset size: 106,758).
4.2 EAPrompt brings acceptable invalid
answers compared with standard
cess instead of providing a definitive score. Given
prompting
that our approach involves more complex prompt
As GEMBA (Kocmi and Federmann, 2023) high- contexts, necessitating two separate queries for
lights in their study that LLMs may provide invalid each test sample, a possible concern is that there is
answers by explaining their decision-making pro- a higher probability that EAPrompt may generate
invalid answers compared to standard prompting. and the other for scoring the translation. Although
To this end, we report the number of invalid this may not cause a significant performance gain,
answers in Table 4. We observe that when using we observe that sometimes ChatGPT fails to deduct
EAPrompt on Turbo for WMT22, it generates only points for identified errors or presents an incorrect
9 invalid answers (6 in the reference-based set- calculation of scores. Separating the scoring pro-
ting, 3 in the reference-less setting). However, on cess may be helpful, as it allows ChatGPT to focus
Dav3, this number increases to 98, which may be on one single procedure in each query, thus can
attributed to that Dav3 model utilizes a stricter pro- provide more accurate judgments.
fanity filter, causing some samples to not receive
the expected response. Overall, the number of in- 5 Case Study
valid answers is comparable to that of GEMBA,
In Figure 3, we list several typical issues with the
confirming the effectiveness of our approach and
case study that should be aware of when using
its suitability for real-world evaluation.
ChatGPT as a translation evaluator. We also present
4.3 EAPrompt empowers ChatGPT to detailed responses in Appendix D.
produce human-like evaluations
5.1 ChatGPT is unstable when conducting
Given the crucial significance of the prompt design, evaluation process
we explore several versions of in-context prompt
contexts and present an analysis in Table 3. See When assessing translations using ChatGPT, it is
Appendix A for the prompt contexts used in our not uncommon to observe variations in the scores
experiment. We find that: assigned to the same input. As shown in As de-
picted in Case 1, we regenerate several responses
(i) ChatGPT becomes more adept at identifying with the same input and obtain 3 different scores
errors when instructed by error analysis. For (98, 95, 100) for the translation. The discrepancies
standard prompting, it becomes challenging to in- in scores could be attributed to the inherent ran-
terpret the overall quality of a test sample due to the domness of the model behind ChatGPT. Another
differing evaluation criteria among different mod- possible reason is the lack of clearly stated eval-
els. For instance, a translation score of 70 on Turbo uation criteria described in the prompt contexts.
might be considered worse, whereas on the GPT- Therefore, we suggest using specific guidelines we
3.5 interface, a score of 90 could be perceived as propose to minimize the impact of these variations.
significantly better. In contrast, EAPrompt’s judg-
ment provides more specific instructions, resulting 5.2 ChatGPT prefers former inputs when
in greater interpretability. Therefore, we recom- provided with multiple translations
mend incorporating error analysis instructions in An interesting phenomenon is that when multiple
prompt contexts rather than relying solely on stan- translations are presented together as a single in-
dard instructions. put to ChatGPT for evaluation, it tends to believe
that the translations provided earlier are of higher
(ii) Itemized template response is better than de-
quality, while the quality of later translations are
tailed illustration. As shown in the "Response"
relatively poorer.
column, providing descriptions of errors in detail
may hinder ChatGPT’s capability to accurately Case 2 shows an example of the attack on Chat-
identify errors. In our initial experiments, we GPT. We provide 8 translations along with their cor-
also observe that generating excessively detailed responding source and reference sentences. First,
responses can lead to incorrect error counting or we present the translations sequentially, and ask
misclassification of error severity. Therefore, it is ChatGPT to rank them according to their trans-
recommended to provide clear and concise descrip- lation quality. ChatGPT ranks the translations as
tions of errors in a format that is easy for ChatGPT (SYS1, SYS2, SYS4, SYS5, SYS3, SYS6, SYS7, SYS8),
to process and comprehend. with SYS1 being the best translation and SYS8 be-
ing the worst. Then, we reverse the order of trans-
(iii) Separating the scoring process from error lations and obtain an entirely different sequence
identification may improve the stability of Chat- of ranks - (SYS8, SYS7, SYS6, SYS5, SYS4, SYS3,
GPT. We suggest splitting the instruction of error SYS2, SYS1), with SYS8 being the best translation
analysis into two queries, one for identifying errors and SYS1 being the worst.
Case 1 Prompt ChatGPT
...
Source: 携程国庆旅游报告预测...
I would evaluate this translation as a 100 out of 100.
Reference: Ctrip’s National Day tourism report predicts that, based on previous regenerate
National Day tourism statistics, it is expected that ... ...
Translation: Ctrip's National Day tourism report predicts that according to the I would evaluate this translation as a 98 out of 100.
national day tourism statistics of previous years, it is estimated that ...
Please act as a translation evaluation metric that scores a translation between 0
to 100 based on the source and reference. Do not use existing metrics.
... regenerate
I would evaluate this translation as a 95 out of 100.
Case 2 Prompt
Based on given source and reference, please rank the following translation from ChatGPT
best to worst according to their translation quality, and do not use existing
metrics.
Source: 当日,以“讲好中国故事,传播好中国声音”为宗旨的外宣工作... Based on my analysis of the given translations, here is my
Reference: On that day, the externally publicized innovative ... ranking from best to worst based on translation quality:
SYS1 translation: On the same day, the "China Good Stories" ....
SYS2 translation: .... SYS1>SYS2>SYS4>SYS5>SYS3>SYS6>SYS7>SYS8
....
SYS8 translation: On the same day, the innovative and basic platform...
Input order
Prompt ChatGPT affects evaluation
Based on given source and reference, please rank the following translation from results
best to worst according to their translation quality, and do not use existing
metrics.
Source: 当日,以“讲好中国故事,传播好中国声音”为宗旨的外宣工作... Based on my analysis of the given translations, here is
Reference: On that day, the externally publicized innovative ... my ranking from best to worst based on translation quality:
SYS8 translation: On the same day, the innovative and basic platform...
SYS7 translation: .... SYS8>SYS7>SYS6>SYS5>SYS4>SYS3>SYS2>SYS1
....
SYS1 translation: On the same day, the "China Good Stories" ....
Case 3
Prompt ChatGPT
Source: 兴安盟属于大兴安岭南麓集中连片贫困地区
Reference: Xing’an Prefecture is within the concentrated poverty-stricken areas... Based on the provided source, reference, and target
Translation: Xing 'an League belongs to a contiguous poverty-stricken area... translation, the translation can be evaluated using the
Please act as a translation evaluation metric that scores a translation between 0 Use
BLEU score, which is a widely used metric .......
to 100 based on the source and reference. existing metrics
Figure 3: Case study of potential issues in ChatGPT. Top: ChatGPT exhibits variations in its responses upon
multiple regenerations; Medium: different input order of samples may affect the judgment of ChatGPT; Bottom:
ChatGPT sometimes relies on existing metrics during translation evaluation.
Nitika Mathur, Timothy Baldwin, and Trevor Cohn. Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
2020a. Tangled up in BLEU: Reevaluating the eval- bert, Amjad Almahairi, Yasmine Babaei, Nikolay
uation of automatic machine translation evaluation Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
metrics. In ACL. Bhosale, et al. 2023. Llama 2: Open founda-
tion and fine-tuned chat models. arXiv preprint
Nitika Mathur, Johnny Wei, Markus Freitag, Qingsong arXiv:2307.09288.
Ma, and Ondřej Bojar. 2020b. Results of the WMT20
metrics shared task. In WMT. Blanca Vidal, Albert Llorens, and Juan Alonso. 2022.
Automatic post-editing of MT output using large lan-
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, et al. guage models. In AMTA.
2022. Training language models to follow instruc-
tions with human feedback. arXiv preprint. Yu Wan, Dayiheng Liu, Baosong Yang, Haibo Zhang,
Boxing Chen, Derek Wong, and Lidia Chao. 2022.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- UniTE: Unified translation evaluation. In ACL.
Jing Zhu. 2002. Bleu: a method for automatic evalu-
ation of machine translation. In ACL. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022.
Keqin Peng, Liang Ding, Qihuang Zhong, Li Shen, Chain of thought prompting elicits reasoning in large
Xuebo Liu, Min Zhang, Yuanxin Ouyang, and language models. arXiv preprint.
Dacheng Tao. 2023. Towards making the most of
chatgpt for machine translation. arXiv preprint. Haoran Wu, Wenxuan Wang, Yuxuan Wan, Wenxiang
Jiao, and Michael Lyu. 2023. Chatgpt or grammarly?
Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao evaluating chatgpt on grammatical error correction
Chen, Michihiro Yasunaga, and Diyi Yang. 2023. Is benchmark. arXiv preprint.
chatgpt a general-purpose natural language process-
ing task solver? arXiv preprint. Changtong Zan, Keqin Peng, Liang Ding, Baopu Qiu,
et al. 2022. Vega-MT: The JD explore academy ma-
Baopu Qiu, Liang Ding, Di Wu, Lin Shang, Yibing chine translation system for WMT22. In WMT.
Zhan, and Dacheng Tao. 2022. Original or trans-
Chrysoula Zerva, Frédéric Blain, Ricardo Rei, et al.
lated? on the use of parallel data for translation qual-
2022. Findings of the WMT 2022 shared task on
ity estimation. arXiv preprint.
quality estimation. In WMT.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, and
Dario Amodei, Ilya Sutskever, et al. 2019. Language
Dacheng Tao. 2023. Can chatgpt understand too?
models are unsupervised multitask learners. OpenAI
a comparative study on chatgpt and fine-tuned bert.
blog.
arXiv preprint.
Ricardo Rei, Ana C Farinha, Chrysoula Zerva, Daan
van Stigt, Craig Stewart, Pedro Ramos, Taisiya A Prompt Contexts
Glushkova, André F. T. Martins, and Alon Lavie.
2021. Are references really needed? unbabel-IST Figure 4 compares the prompt contexts imple-
2021 submission for the metrics shared task. In mented in error analysis prompting with a detailed
WMT. response and combined instruction discussed in
Section 4.3.
Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon
Lavie. 2020. COMET: A neural framework for MT
evaluation. In EMNLP. B Additional Results
Thibault Sellam, Dipanjan Das, and Ankur Parikh. 2020. To further validate the findings of our experiments,
BLEURT: Learning robust metrics for text genera- we provide a comprehensive presentation of the
tion. In ACL. WMT22 results obtained under various prompt set-
Lucia Specia, Dhwaj Raj, and Marco Turchi. 2010. Ma-
tings, as shown in Table 6.
chine translation evaluation versus quality estimation. In addition, we present a comparative analysis
Machine translation. of the results obtained by EAPrompt on WMT20,
in contrast to the baseline metrics, as illustrated in displays the prompt contexts of Case 3 and the re-
Table 7. sponse from ChatGPT addressing the utilization of
existing metrics, as explained in Section 5.3.
C Performance on other LLMs
In order to further verify the effect of EAPrompt
strategy on other types of LLMs, we use an open-
source model, LLama2-70b-chat (Touvron et al.,
2023) as our base model, with an in-house LLama2-
70b model, which has been fine-tuned on mathe-
matical reasoning datasets to enhance CoT capa-
bilities. we also test our approach on GPT-4, with
a subset of 600 samples, due to the budget con-
straints. Results are shown in Table 5.
Model DA EAPrompt ∆
Llama2-70b-chat 21.9 22.6 +0.7
Llama2-70b +SFT 22.1 37.4 +15.3
GPT-4* 44.92 45.97 +1.05
Q: Source: 当日,以“讲好中国故事,传播好中国声音”为宗旨的外宣工作创新性、基础
性平台——“中国好故事”数据库正式上线,向世界展现真实、立体、全面的中国。
Reference: On that day, the externally publicized innovative and basic platform-“The
story of China”, for the purpose of telling the story of China well and spreading the voice
of China well”, was officially on line today, to show the world a true, three-dimensional
and comprehensive China.
Translation: On that day, the "China Good Story" database, an innovative and basic
platform for outreach work with the aim of "telling a good Chinese story and
disseminating a good Chinese voice", was officially launched to show the world a real,
three-dimensional and comprehensive China.
Separated Instructions Combined Instruction
Based on the given source and reference, identify the major and minor errors in this
Based on the given source and reference,
translation. Note that Major errors refer to actual translation or grammatical errors, and
identify the major and minor errors in this
Minor errors refer to smaller imperfections, and purely subjective opinions about the
translation. Note that Major errors refer to
translation.
actual translation or grammatical errors,
and Minor errors refer to smaller
A:
imperfections, and purely subjective
opinions about the translation.
Count the number of major and minor
Q: Count the number of major and minor errors identified in your last response and
errors identified and compute the final
compute the final score for this translation. Deduct 5 points for each major error. Deduct
score for this translation. Deduct 5 points
1 point for each minor error. If the translation has no errors, its score will be 0.
for each major error. Deduct 1 point for
each minor error. If the translation has no
A:
errors, its score will be 0.
Figure 4: The prompt contexts utilized in EAPrompt. We also present other prompt contexts we use when comparing
different prompt selections.
En-De En-Ru Zh-En Overall
Models Metrics / Prompts SYS SEG SYS SEG SYS SEG SYS SEG
Baselines MetricX XXL 76.9 36.0 91.4 42.0 84.6 42.7 85.0 40.2
BLEURT20 76.9 34.4 90.5 35.9 84.6 36.1 84.7 35.5
COMET22 76.9 36.8 86.7 40.0 86.8 42.8 83.9 39.9
UniTE 74.4 36.9 87.6 37.8 84.6 35.7 82.8 36.8
COMET-QE[noref] 71.8 28.1 80.0 34.1 81.3 36.5 78.1 32.9
Dav3 DA 92.3 30.6 85.7 33.2 86.8 37.1 88.0 33.6
Stars 88.5 29.4 81.9 29.4 87.9 29.7 85.8 29.5
SQM 93.6 28.3 83.8 30.8 80.2 34.6 85.4 31.2
Classes 83.3 23.5 87.6 28.9 84.6 25.1 85.4 25.8
EAPrompt 67.9 20.0 87.6 25.5 79.1 22.9 79.6 22.8
DA[noref] 87.2 18.0 88.6 25.8 82.4 28.9 86.1 24.2
Stars[noref] 80.8 19.8 83.8 31.0 84.6 23.5 83.2 24.8
SQM[noref] 82.1 21.8 84.8 32.8 80.2 26.8 82.5 27.1
Classes[noref] 76.9 17.6 81.9 27.1 76.9 17.2 78.8 20.6
EAPrompt[noref] 39.7 12.7 80.0 23.1 74.7 17.6 67.9 17.8
Turbo DA 85.9 25.0 90.5 23.4 82.4 25.5 86.5 24.6
Stars 89.7 25.9 90.5 22.3 87.9 26.5 89.4 24.9
SQM 87.2 29.8 91.4 27.7 82.4 31.3 87.2 29.6
Classes 82.1 17.0 87.6 16.7 76.9 17.8 82.5 17.2
EAPrompt 89.7 32.4 92.4 32.3 90.1 36.2 90.9 33.6
DA[noref] 83.3 25.5 90.5 29.4 85.7 26.4 86.9 27.1
Stars[noref] 88.5 25.5 88.6 27.9 75.8 26.1 84.3 26.5
SQM[noref] 89.7 25.9 91.4 30.9 81.3 29.1 87.6 28.6
Classes[noref] 62.8 -1.0 61.9 2.7 61.5 2.9 62.0 1.5
EAPrompt[noref] 87.2 28.4 90.5 30.1 90.1 32.5 89.4 30.3
GPT-4 DA 89.7 35.7 92.4 35.8 86.8 38.2 89.8 36.6
Stars 89.7 32.6 94.3 35.1 89.0 38.2 91.2 35.3
SQM 88.5 38.0 92.4 38.8 84.6 39.8 88.7 38.9
Classes 89.7 22.2 90.5 26.7 86.8 27.3 89.1 25.4
DA[noref] 84.6 31.1 91.4 40.5 85.7 40.7 87.6 37.4
Stars[noref] 84.6 30.8 93.3 36.6 87.9 40.4 89.1 35.9
SQM[noref] 84.6 35.9 95.2 43.2 85.7 41.6 89.1 40.2
Classes[noref] 92.3 30.4 92.4 39.0 89.0 31.3 91.2 33.6
Table 6: The system and segment level results of metrics using pairwise accuracy (%) and Kendall correlation (%)
with human-annotated MQM scores. The best results among the same model are highlighted in bold. The best
results among all metrics are underlined. Our proposed EAPrompt results are highlighted in orange .
En-De Zh-En
Models Metrics / Prompts SYS SEG SYS SEG
Baselines BLEU (Papineni et al., 2002) 85.7 12.2 67.9 17.4
BLEURT (Sellam et al., 2020) 76.2 43.6 82.1 43.1
COMET (Rei et al., 2020) 85.7 34.8 78.6 37.1
PRISM (Thompson and Post, 2020) 90.5 28.2 82.1 32.3
Turbo EAPrompt 90.5 35.5 82.1 46.1
EAPrompt[noref] 81 34.9 75 37.1
Table 7: The system and segment level results of metrics on WMT20 En-De and Zh-En datasets, using pairwise
accuracy (%) and Kendall correlation (%) with human-annotated MQM scores. Best results are highlighted in bold.
System Online-A.en
Domain conversational
Doc_id 1
Seg_id 6
Source: 请问,订单情况现在是什么样?
Text Reference: May I ask what the status of the order is now?
Translation: Please ask, what is the order situation now?
Human Major Error: "Please ask" - Accuracy/Mistranslation
Evaluation Minor Error: "situation" - Style/Awkward
Translation 8 is
ranked as the best